[
  {
    "slug": "accelerated-failure-time-models",
    "name": "Accelerated Failure Time (AFT) Models",
    "short_definition": "A fully parametric family of survival regression models that expresses covariate effects as multiplicative stretching or compression of the event-time axis, producing a \"time ratio\" — how many times longer or shorter the event takes — rather than a hazard ratio; the preferred choice when the proportional hazards assumption fails and when a direct time-scale interpretation is clinically or economically needed.",
    "long_description": "**The accelerated failure time model**\n\nThe AFT model specifies that log(T) = Xβ + σε, where T is the time to event, X is the\ncovariate matrix, β are the log-time-scale regression coefficients, σ is a scale parameter,\nand ε is a standardized error term. Every exp(β_k) is a time ratio (TR): a multiplicative\nfactor that stretches or compresses the entire event-time distribution for covariate k. The\nchoice of distribution for ε defines the model family: an extreme-value ε gives the Weibull\nAFT (monotone hazards, the only family that is simultaneously a valid PH model), a Gaussian ε\ngives the log-normal AFT (arc-shaped, non-monotone hazard), a logistic ε gives the log-logistic\nAFT (arc-shaped hazard, closed-form survivor function), and the generalized gamma (GG) nests\nall three as special limiting cases via two shape parameters.\n\nThe time ratio is the fundamental AFT estimand. If TR = 1.5, every quantile of the treated\nevent-time distribution is 1.5 times the corresponding quantile of the control distribution.\nA patient who would progress at 10 months under control progresses at 15 months under\ntreatment; one who would progress at 4 months progresses at 6 months; one who would progress\nat 20 months progresses at 30 months. This invariance — the same multiplicative shift at every\nquantile — is the core mathematical property of the AFT family and is what makes the time ratio\ninterpretation clean and distribution-agnostic within the chosen family.\n\n**Interpreting the output**\n\nA Weibull AFT model fit to a comparative cohort study on time to progression returns:\ntreatment log-time coefficient 0.405 (95% CI 0.10 to 0.71); scale parameter σ = 0.78\n(Weibull shape k = 1/σ ≈ 1.28).\n\n*Formal interpretation.* exp(0.405) = 1.50 is the time ratio (also called the acceleration\nfactor). Conditional on the Weibull model and the fitted covariates, every quantile of the\ntreated progression-time distribution is 1.50 times the corresponding control quantile. The\n95% CI on the log-time scale (0.10 to 0.71) maps to a time-ratio CI of approximately\nexp(0.10) to exp(0.71), roughly 1.11 to 2.03 — the data are compatible with treatment\nstretching time to progression by between 11% and 103%. The Weibull shape k ≈ 1.28 allows\na parallel proportional-hazards reading: HR = TR^(−k) = 1.50^(−1.28) ≈ 0.59, meaning the\nhazard is approximately 41% lower under treatment. For non-Weibull AFT families (log-normal,\nlog-logistic), no simple closed-form HR exists; the time ratio is the native and only\nwell-defined covariate effect estimate.\n\n*Practical interpretation.* A time ratio of 1.50 means treatment stretches time to\nprogression by about 50%. A patient who would typically progress at 10 months under\ncontrol typically progresses at 15 months under treatment; one who would progress early\nat 4 months progresses at 6 months; one who would progress late at 20 months progresses\nat 30 months. Clinicians often find this more natural than \"the hazard is 41% lower at\nany given instant in follow-up,\" because time-ratio language maps directly onto how\npatients and oncologists discuss disease course — months of progression-free life, not\ninstantaneous rates. For payers and HTA analysts, a time ratio also directly quantifies\nthe economic value of delay: multiplying the control arm's expected time-on-therapy\ncost by the time ratio gives the treated arm's expected cost under the AFT assumption.\n\n**AFT versus proportional hazards**\n\nThe Cox PH model estimates a hazard ratio assumed constant across follow-up. When hazards\nconverge, cross, or show a delayed treatment effect — the canonical pattern in\nimmuno-oncology, where treated patients may initially show similar or slightly higher\nhazard before the immune response matures into durable benefit — PH fails and a single\naveraged HR is a misleading summary that obscures both the early and late treatment\nprofile. AFT models make no PH assumption. The time ratio is identified through the\ndistributional parameters and remains well-defined even under non-PH data, provided the\nchosen distributional family fits adequately.\n\nThe Weibull distribution is the sole family that admits both a PH and an AFT\nreparameterization. All other AFT families (log-normal, log-logistic, generalized gamma)\ncan only be expressed in AFT form. This makes the Weibull the natural starting point for\nAFT analysis: if Weibull fits adequately, results can be stated in either HR or TR\nlanguage to satisfy different reviewer expectations across regulatory and HTA contexts.\n\nThe generalized gamma (GG) acts as a diagnostic umbrella: it nests Weibull (shape q = 1),\nlog-normal (q → 0), and gamma as sub-models. Fitting GG first and using likelihood-ratio\ntests to select the most parsimonious sub-family is the principled approach for HTA\nextrapolation. When GG AIC is only marginally better than a sub-family's AIC, prefer the\nsub-family for interpretability and probabilistic sensitivity analysis stability.\n\n**RWE and HTA context**\n\nIn claims data, heavy administrative censoring from disenrollment or Medicare Advantage\nenrollment without continuous fee-for-service coverage means fitted AFT distribution\nparameters depend heavily on what the model assumes about the hazard in the unobserved\ntail — the same fundamental tension that makes HTA extrapolation challenging. AFT\ndistributions with a declining late hazard (log-normal, log-logistic) extrapolate more\noptimistically than Weibull or Gompertz given identical in-sample data. NICE DSU TSD 14\n(Latimer 2013) mandates evaluating the full six-distribution candidate set, overlaying on\nthe Kaplan-Meier, inspecting projected hazards beyond the data, and flooring all-cause\nhazard against the general-population mortality envelope. The four standard AFT families\n(Weibull, log-normal, log-logistic, generalized gamma) are four of those six TSD-14\ncandidates; AFT parameterization adds interpretable time ratios to the distributional\nselection exercise.\n\nFor real-world progression-free survival (rwPFS) and time-to-discontinuation endpoints,\ntime ratios are increasingly reported alongside hazard ratios in HTA dossiers, especially\nin immuno-oncology where PH is often violated. A rank-based estimation approach\n(Prentice-Storer), which is consistent without a distributional assumption, exists for\nrobustness checks, though regulatory submissions typically require maximum-likelihood\nparametric AFT fits from the TSD-14 candidate set.\n\n**Diagnostics**\n\n(1) Log(−log(KM)) vs log(t) plot: a straight line supports Weibull; a non-linear but\nsingle-humped trajectory supports log-normal or log-logistic. (2) Fitted survivor curve\nvs KM overlay: the parametric S(t) should track the KM closely in the observable window;\nsystematic deviation indicates inadequate distributional fit. (3) AIC/BIC comparison\nacross all candidate families: select the most parsimonious model with competitive AIC\nand a plausible projected hazard; AIC alone does not determine extrapolation quality\nbecause in-sample fit and tail behavior can diverge sharply. (4) Standardized residual\nQ-Q plot: residuals r_i = (log(t_i) − X_i β-hat) / σ-hat should follow the assumed error\ndistribution — normal for log-normal AFT, extreme-value for Weibull AFT; heavy-tailed\nresiduals or systematic curvature indicate a more flexible model is needed.\n\n**Pros, cons, and trade-offs**\n\nPros of AFT over Cox PH: time ratios are interpretable when PH is violated and when\nclinicians think in terms of time rather than instantaneous rates; the time ratio is\ninvariant to the choice of error distribution within the AFT class; fully parametric\nmaximum-likelihood estimation is efficient when the distributional form is correct;\ndirect quantile prediction is available without a Breslow baseline estimator; natural\nfor HTA extrapolation where a distributional assumption is unavoidable and where the\ntime ratio directly quantifies the economic value of delay.\n\nCons: commits to a distributional family — misspecification biases all estimates; when\nthe hazard ratio is the pre-specified regulatory primary endpoint (FDA oncology\nregistration trial) and PH holds well, Cox is preferred and more efficient; cure fractions\nand time-varying covariates require extensions that are less mature than standard Cox in\nregulatory software.\n\n**When to use**\n\nUse AFT models when: (a) PH is violated (Schoenfeld test significant, visual crossing or\nconvergence of hazards) and a single averaged HR would be misleading; (b) a direct\ntime-scale effect — \"treatment stretches event time by X%\" — is the preferred clinical\nor HTA communication; (c) the endpoint is rwPFS or time-to-discontinuation in an HTA\nextrapolation and the TSD-14 candidate set is being evaluated; (d) a Weibull, log-normal,\nor log-logistic distributional form is biologically motivated by prior literature on the\nendpoint; or (e) the research context is aging or chronic disease where the\nacceleration-factor framing — does the exposure age subjects faster or slower? — is\nconceptually natural and supported by domain literature.\n\n**When NOT to use**\n\nDo not use AFT models when: (a) the hazard ratio is the pre-specified primary estimand\nand PH holds well — Cox PH is more efficient and imposes no distributional commitment;\n(b) a genuine cure fraction or long-term survivor plateau is clinically supported —\nmixture or non-mixture cure models are the appropriate tool; (c) time-varying covariates\nare central to the causal question — Cox PH with a counting-process layout handles them\nmore naturally and with more mature regulatory software support; (d) competing risks\ndominate the analysis — cause-specific or Fine-Gray models are preferred and do not\nnaturally map to the AFT framework; or (e) the entire analysis is on the hazard scale for\nregulatory or communication reasons and the audience is not positioned to interpret time\nratios alongside or in place of hazard ratios.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "survival-analysis",
      "parametric-models",
      "time-ratio",
      "time-to-event",
      "weibull",
      "accelerated-failure-time",
      "hta-extrapolation",
      "inferential"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.4780111409",
        "url": "https://doi.org/10.1002/sim.4780111409",
        "citation_text": "Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in Medicine. 1992;11(14-15):1871-1879.",
        "year": 1992,
        "authors_short": "Wei",
        "notes": "The canonical methodological introduction to AFT models as a practical alternative to Cox PH regression. Establishes the time-ratio interpretation, discusses the six major distribution families, demonstrates the Weibull's dual AFT-PH membership, and compares semiparametric rank-based AFT estimation to the parametric approach. Essential reading for any analyst choosing between AFT and Cox for a survival endpoint."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X12472398",
        "url": "https://doi.org/10.1177/0272989X12472398",
        "citation_text": "Latimer NR. Survival analysis for economic evaluations alongside clinical trials — extrapolation with patient-level data: inconsistencies, limitations, and a practical guide. Medical Decision Making. 2013;33(6):743-754.",
        "year": 2013,
        "authors_short": "Latimer",
        "notes": "The peer-reviewed companion to NICE DSU TSD 14; establishes the six-distribution candidate set (Weibull, exponential, Gompertz, log-normal, log-logistic, generalized gamma — four of which are AFT families) and the AIC/BIC-plus-plausibility selection workflow for HTA extrapolation. Grounds the AFT concept in its primary regulatory and health-economic application and explains when a distributional commitment is unavoidable."
      },
      {
        "role": "use",
        "doi": "10.1016/j.exger.2008.10.005",
        "url": "https://doi.org/10.1016/j.exger.2008.10.005",
        "citation_text": "Swindell WR. Accelerated failure time models provide a useful statistical framework for aging research. Experimental Gerontology. 2009;44(3):190-200.",
        "year": 2009,
        "authors_short": "Swindell",
        "notes": "Demonstrates AFT models in aging and gerontology — a domain where the acceleration-factor framing is conceptually natural (does the exposure age subjects faster or slower?) — providing accessible worked examples and a distribution-comparison workflow applicable to any chronic-disease or RWE study where the time-ratio interpretation adds clinical value."
      }
    ],
    "plain_language_summary": "An accelerated failure time (AFT) model is a statistical method for comparing how long two groups take to reach an outcome — such as disease progression or death — using a single number called the time ratio. If the time ratio is 1.5, the treatment multiplies every patient's expected time to the outcome by 1.5: a patient who would progress at 10 months on the control treatment would instead progress at 15 months on the new treatment, and the same 50% stretch applies whether we look at early progressors or late progressors. Unlike the more common hazard ratio (which measures event rates at each moment of follow-up), the time ratio directly says \"how much longer\" in calendar time — a framing that clinicians, patients, and health economists often find more intuitive.",
    "key_terms": [
      {
        "term": "time ratio",
        "definition": "The multiplicative factor by which a covariate stretches or compresses the time until an event; a time ratio of 1.5 means the event takes 1.5 times as long in the treated group as in the control group at every point in the survival distribution."
      },
      {
        "term": "acceleration factor",
        "definition": "Another name for the time ratio — it describes how much slower (greater than 1) or faster (less than 1) the event occurs under the condition of interest compared to the reference group; an acceleration factor less than 1 means the event is accelerated (harmful if the event is bad)."
      },
      {
        "term": "error distribution",
        "definition": "The statistical family assumed for the variability in log-event-time after removing covariate effects; the choice (Weibull, log-normal, log-logistic, generalized gamma) determines the shape of the hazard over time and affects how the model extrapolates beyond the observed data."
      },
      {
        "term": "log-linear model for time",
        "definition": "The core AFT structure — instead of modeling the hazard directly, AFT models treat log(survival time) as a linear function of covariates, like ordinary linear regression applied to the logarithm of the time-to-event outcome."
      },
      {
        "term": "proportional hazards vs AFT",
        "definition": "The proportional hazards (PH) model assumes the ratio of hazards between groups is constant throughout follow-up; AFT models assume the entire time axis is stretched or compressed by a constant factor — the two are equivalent only for the Weibull distribution."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is comparing time to cancer progression between 6 patients who received a new therapy (treated arm) and 6 patients who received standard care (control arm) in a small registry cohort. All 12 patients experienced the event (no censoring in this illustration). The team fits a Weibull AFT model to estimate the time ratio for treatment. The model returns a log-time coefficient for treatment of 0.405, which the team uses to predict how the full distribution of progression times — not just the median — is shifted by treatment.",
      "dataset": {
        "caption": "One row per patient. time_months is months from study entry to cancer progression. event = 1 for all rows (complete data, no censoring in this example). The treated-arm times are approximately 1.5 times the control-arm times, consistent with a time ratio of 1.5 produced by the fitted Weibull AFT model.",
        "columns": [
          "person_id",
          "arm",
          "time_months",
          "event"
        ],
        "rows": [
          [
            "C1",
            "control",
            3,
            1
          ],
          [
            "C2",
            "control",
            4,
            1
          ],
          [
            "C3",
            "control",
            8,
            1
          ],
          [
            "C4",
            "control",
            12,
            1
          ],
          [
            "C5",
            "control",
            20,
            1
          ],
          [
            "C6",
            "control",
            25,
            1
          ],
          [
            "T1",
            "treated",
            4.5,
            1
          ],
          [
            "T2",
            "treated",
            6,
            1
          ],
          [
            "T3",
            "treated",
            12,
            1
          ],
          [
            "T4",
            "treated",
            18,
            1
          ],
          [
            "T5",
            "treated",
            30,
            1
          ],
          [
            "T6",
            "treated",
            37.5,
            1
          ]
        ]
      },
      "steps": [
        "Sort the 6 control-arm times: 3, 4, 8, 12, 20, 25 months. The sample median is the average of the 3rd and 4th sorted values: (8 + 12) / 2 = 10 months. The first quartile (25th percentile, 2nd sorted value) is 4 months; the third quartile (75th percentile, 5th sorted value) is 20 months.",
        "The Weibull AFT model is fit to all 12 patients. The model returns a log-time coefficient for treatment of 0.405 (95% CI approximately 0.10 to 0.71). Taking the exponential gives the time ratio: TR = exp(0.405) ≈ 1.50.",
        "The time ratio means every quantile of the treated progression-time distribution is 1.50 times the corresponding control quantile. Apply this to the control median: treated median = 10 × 1.5 = 15 months.",
        "Verify with the treated-arm data: sorted treated times are 4.5, 6, 12, 18, 30, 37.5. Treated median = (12 + 18) / 2 = 15 months. This matches the AFT prediction exactly.",
        "First quartile scaling: control Q1 = 4 months; treated Q1 = 4 × 1.5 = 6 months. Check: 2nd treated value = 6 months. Third quartile scaling: control Q3 = 20 months; treated Q3 = 20 × 1.5 = 30 months. Check: 5th treated value = 30 months."
      ],
      "result": "Weibull AFT model fit to 12 patients (6 per arm). Control median = (8 + 12) / 2 = 10 months. Time ratio TR = exp(0.405) ≈ 1.50. Treated median = 10 × 1.5 = 15 months; verified by data: (12 + 18) / 2 = 15 months. Q1 scaling: 4 × 1.5 = 6 months (data: 6 months). Q3 scaling: 20 × 1.5 = 30 months (data: 30 months). Every quantile is multiplied by the same factor of 1.5, demonstrating the AFT property that the time ratio is invariant across the entire progression-time distribution."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "weibull-distribution",
      "censoring-mechanisms-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Weibull AFT (PH-AFT bridge)",
        "description": "The Weibull is the only distribution that is simultaneously a valid proportional hazards model and a valid AFT model. The relationship between the two parameterizations is HR = TR^(−k), where k is the Weibull shape parameter (k = 1/scale in survreg output). This allows reporting in both HR and TR language, satisfying reviewers who expect a hazard ratio while also providing the time-scale interpretation. A shape k greater than 1 indicates an increasing hazard; k less than 1 indicates a decreasing hazard; k = 1 is the exponential special case with a constant hazard.",
        "edge_cases": [
          "When k = 1 (exponential), HR = 1/TR exactly. For k near 1 (weakly monotone hazard), the Weibull is the most parsimonious choice.",
          "survreg() in R returns the scale parameter sigma = 1/k. The Weibull shape is k = 1/scale, not scale itself."
        ],
        "data_source_notes": "Claims: verify the PH assumption (log-log plot, Schoenfeld test) before choosing Weibull vs a more flexible family. If PH holds and Weibull AIC is competitive, the Weibull is the preferred AFT family because it allows dual HR/TR reporting. Registry: natural for disease progression endpoints with monotone hazards."
      },
      {
        "name": "Log-normal AFT (arc-shaped hazard)",
        "description": "The log-normal AFT assumes the error distribution is Gaussian on the log-time scale. The hazard first rises and then falls to zero as time increases (arc-shaped), making it appropriate for conditions where risk peaks early and then declines as the most susceptible patients have already had the event. Time ratios are computed identically to the Weibull case: TR = exp(beta_treatment). The log-normal does not admit a PH reparameterization.",
        "edge_cases": [
          "The declining late hazard can produce optimistic HTA extrapolations if the true hazard does not actually decline; always compare against Weibull and Gompertz on AIC and tail plausibility.",
          "Large sigma values (diffuse log-time distributions) indicate high heterogeneity in event timing; check for unmodeled subgroups or competing events."
        ],
        "data_source_notes": "Common in RWE for time-to-treatment-discontinuation, where initial stable patients are unlikely to discontinue early but risk peaks and then declines as durable responders stabilize. Claims disenrollment censoring can inflate estimated sigma; confirm that late survival is not an artifact of unobserved Medicare Advantage events."
      },
      {
        "name": "Generalized gamma AFT (diagnostic umbrella)",
        "description": "The generalized gamma nests Weibull, log-normal, and gamma as limiting cases via two shape parameters, allowing four hazard shapes: monotone increasing, monotone decreasing, bathtub, and arc-shaped. Fitting GG first and using likelihood-ratio tests against nested sub-families is the recommended strategy for model selection in HTA extrapolation. If GG AIC is only marginally better than a sub-family, prefer the sub-family for parsimony and PSA stability.",
        "edge_cases": [
          "GG parameter identification requires adequate event counts; with fewer than approximately 50 events, the two shape parameters may be weakly identified, leading to unstable estimates.",
          "When GG converges to a sub-family boundary, the LRT for the sub-family may have a non-standard distribution; use the AIC comparison rather than the p-value alone."
        ],
        "data_source_notes": "Used as the primary diagnostic in NICE TSD 21 flexible-modeling workflows. In HTA submissions, GG is fitted alongside Royston-Parmar splines to benchmark parametric flexibility against the observed Kaplan-Meier. Available in flexsurvreg (dist = gengamma), PROC LIFEREG (DIST=GAMMA in SAS), and lifelines GeneralizedGammaRegressionFitter."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "Time ratios are valid and interpretable when PH fails (crossing or converging hazards); direct quantile prediction is available without a Breslow baseline estimator; natural for HTA extrapolation where a distributional assumption is unavoidable; Weibull AFT allows dual HR/TR reporting for mixed regulatory-HTA audiences.",
        "cons_of_this": "Requires committing to a distributional family — Cox imposes no distributional assumption on the baseline hazard; Cox handles time-varying covariates and left-truncation more naturally in standard software; Cox is preferred when PH holds and HR is the regulatory primary endpoint.",
        "when_to_prefer": "Prefer AFT when Schoenfeld test or visual diagnostics indicate PH violation, when the time-scale interpretation is clinically primary, or when HTA extrapolation makes a distributional assumption unavoidable anyway."
      },
      {
        "compared_to": "survival-extrapolation-hta-rwe",
        "pros_of_this": "AFT parameterization adds interpretable time ratios to the TSD-14 distributional selection exercise; the time ratio directly quantifies the economic value of treatment delay in the cost-effectiveness model without additional transformation.",
        "cons_of_this": "Extrapolation quality depends on both the distributional choice and the persistence assumption for the treatment effect; AIC does not discipline the unobserved tail.",
        "when_to_prefer": "Always use AFT within the TSD-14 candidate-set workflow; it is the parameterization for four of the six standard distributions and adds TR interpretation to the selection exercise."
      },
      {
        "compared_to": "restricted-mean-survival-time-rmst",
        "pros_of_this": "AFT provides a single time-ratio effect estimate valid across the full event-time distribution and is parameterization-consistent for extrapolation beyond the observed window; RMST avoids distributional assumptions but cannot answer the lifetime question HTA requires.",
        "cons_of_this": "RMST makes no distributional assumption in the observed window and is robust to PH violation; AFT requires a distributional commitment that is only testable in observed data and must be justified for the unobserved extrapolation tail.",
        "when_to_prefer": "Use AFT when the analysis must extend beyond observed follow-up; use RMST as a face-validity anchor confirming that the AFT model's fitted mean over the observed window matches the empirical restricted mean survival time."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exclude Medicare Advantage-only person-time before fitting (unobserved deaths inflate late survival and distort AFT distribution parameter estimates, especially sigma for log-normal and log-logistic). Pin time zero at treatment initiation to avoid immortal-time bias. Drop the final incomplete quarters to avoid claim-lag artifacts. For elderly cohorts, floor the extrapolated all-cause hazard at the matched general-population life table. Compare Weibull (interpretable in both HR and TR), log-normal, log-logistic, and generalized gamma on AIC/BIC and KM overlay before selecting a family.",
      "ehr": "Visit-driven capture makes censoring potentially informative; verify non-differential loss to follow-up or apply IPCW before fitting AFT models. The distributional assumption is untestable beyond the observed window; pair the AFT fit with a registry or death-index validation of the tail if long-term data are available.",
      "registry": "The natural source of long-term external data to anchor the AFT tail and validate the projected hazard. Confirm vital-status completeness via death-index linkage before trusting registry tails; differential dropout of sicker patients can inflate late survival and produce an overly optimistic time ratio if censoring is informative.",
      "primary": "For RCT and primary-study data, AFT models complement Cox when PH is violated (immuno- oncology delayed effects) or when a distributional assumption is required for lifetime projection. Report both the parametric time ratio and the empirical RMST over the observed window as a face-validity cross-check.",
      "linked": "Linked claims-registry-vital-records data provides the best AFT fitting substrate: registry severity, claims duration, and reliable mortality from the death index. Reconcile order, fill, and service-date discrepancies and linkage selection before setting time zero and event dates to avoid reintroducing immortal time or informative censoring at the data seam."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import WeibullAFTFitter, LogNormalAFTFitter\n\n# cohort: analysis-ready DataFrame described in the header above.\ncohort = pd.read_parquet(\"cohort.parquet\")\n\n# ── 1. Weibull AFT ──────────────────────────────────────────────────────────\nwaf = WeibullAFTFitter()\nwaf.fit(cohort, duration_col=\"time\", event_col=\"event\")\nprint(\"=== Weibull AFT: time ratios ===\")\nprint(waf.summary[[\"coef\", \"exp(coef)\", \"exp(coef) lower 95%\",\n                    \"exp(coef) upper 95%\", \"p\"]])\n# exp(coef) for 'arm' is the time ratio TR:\n# TR > 1 → treatment stretches time to event (beneficial if the event is harmful)\n# TR < 1 → treatment compresses time to event\n\n# ── 2. Predict median event time by arm ─────────────────────────────────────\nnewdata = pd.DataFrame({\"arm\": [0, 1]})\nmedians = waf.predict_median(newdata)\nprint(f\"\\nControl median: {medians.iloc[0]:.1f}  Treated median: {medians.iloc[1]:.1f}\")\nprint(f\"Empirical time ratio from medians: {medians.iloc[1] / medians.iloc[0]:.3f}\")\n\n# ── 3. Log-normal AFT for comparison ────────────────────────────────────────\nlnaf = LogNormalAFTFitter()\nlnaf.fit(cohort, duration_col=\"time\", event_col=\"event\")\nprint(\"\\n=== Log-normal AFT: time ratios ===\")\nprint(lnaf.summary[[\"coef\", \"exp(coef)\", \"exp(coef) lower 95%\",\n                     \"exp(coef) upper 95%\", \"p\"]])\n\n# ── 4. Compare AIC (lower is better; inspect tail before selecting) ──────────\nprint(f\"\\nWeibull AIC  : {waf.AIC_:.2f}\")\nprint(f\"Log-normal AIC: {lnaf.AIC_:.2f}\")\n# AIC alone does not determine extrapolation quality. Overlay fitted curves on the\n# KM and inspect projected hazard beyond the data before selecting a distribution.",
        "description": "Fit Weibull and log-normal AFT models using lifelines and extract time ratios directly\nfrom the model summary. Required input DataFrame `cohort` with columns:\n  time    : float, time to event or censoring (> 0)\n  event   : int, 1 = event occurred, 0 = censored\n  arm     : int, 0 = control, 1 = treated\n  [additional baseline covariates as needed]\nWeibullAFTFitter and LogNormalAFTFitter both output a summary with exp(coef) = time ratio\nand its 95% CI. Use predict_median() for quantile predictions by covariate profile and\ncompare AIC across families before selecting the distribution for extrapolation.",
        "dependencies": [
          "lifelines",
          "pandas"
        ],
        "source_citations": [
          "wei-1992"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(flexsurv)\n\n# cohort: analysis-ready data.frame (described in the header above).\ncohort <- readRDS(\"cohort.rds\")\ncohort$arm <- relevel(factor(cohort$arm), ref = \"control\")\n\n# ── 1. Weibull AFT via survreg() ─────────────────────────────────────────────\nfit_wb <- survreg(Surv(time, event) ~ arm,\n                  data = cohort, dist = \"weibull\")\nprint(summary(fit_wb))\n\n# exp(coef) is the time ratio; survreg 'scale' = sigma = 1 / Weibull_shape_k\nTR_wb <- exp(coef(fit_wb)[\"armtreated\"])\nk      <- 1 / fit_wb$scale                      # Weibull shape parameter\ncat(sprintf(\"Weibull TR: %.3f   shape k: %.3f   equivalent HR: %.3f\\n\",\n            TR_wb, k, TR_wb^(-k)))\n\n# Predict median (p = 0.50 quantile) and quartiles by arm\nq_vals <- c(0.25, 0.50, 0.75)\nfor (p in q_vals) {\n  q_preds <- predict(fit_wb,\n                     newdata = data.frame(arm = c(\"control\", \"treated\")),\n                     type = \"quantile\", p = p)\n  cat(sprintf(\"Q%.0f: control %.1f  treated %.1f  ratio %.3f\\n\",\n              p * 100, q_preds[1], q_preds[2], q_preds[2] / q_preds[1]))\n}\n\n# ── 2. Log-normal AFT via survreg() ──────────────────────────────────────────\nfit_ln <- survreg(Surv(time, event) ~ arm,\n                  data = cohort, dist = \"lognormal\")\nTR_ln <- exp(coef(fit_ln)[\"armtreated\"])\ncat(sprintf(\"\\nLog-normal TR: %.3f\\n\", TR_ln))\n\n# ── 3. Generalized gamma via flexsurvreg() (diagnostic umbrella) ────────────\nfit_gg <- flexsurvreg(Surv(time, event) ~ arm,\n                      data = cohort, dist = \"gengamma\")\nprint(fit_gg)\n\n# AIC comparison: lower is better; inspect tail plausibility before selecting.\naic_compare <- data.frame(\n  model = c(\"weibull\", \"lognormal\", \"gengamma\"),\n  AIC   = c(AIC(flexsurvreg(Surv(time, event) ~ arm, data = cohort, dist = \"weibull\")),\n            AIC(flexsurvreg(Surv(time, event) ~ arm, data = cohort, dist = \"lnorm\")),\n            AIC(fit_gg))\n)\nprint(aic_compare[order(aic_compare$AIC), ])",
        "description": "Fit Weibull and log-normal AFT models with survreg() from the survival package and\nthe generalized gamma with flexsurvreg() from flexsurv. Compute time ratios by\nexponentiating the treatment coefficient and predict quantiles at each covariate profile.\nRequired input data.frame `cohort` with columns:\n  time  : numeric, time to event or censoring (> 0)\n  event : integer, 1 = event, 0 = censored\n  arm   : factor, reference level = \"control\"\nKey note on survreg(): the 'scale' output is sigma = 1/k for Weibull; the shape is\nk = 1/scale. exp(coef[\"armtreated\"]) from survreg() IS the time ratio.",
        "dependencies": [
          "survival",
          "flexsurv"
        ],
        "source_citations": [
          "wei-1992"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── 1. Weibull AFT ── */\nproc lifereg data=work.cohort;\n  class arm (ref='0');\n  model time_months*event(0) = arm / dist=weibull;\n  ods output ParameterEstimates=work.wb_parms FitStatistics=work.wb_fit;\nrun;\n\n/* ── 2. Log-normal AFT ── */\nproc lifereg data=work.cohort;\n  class arm (ref='0');\n  model time_months*event(0) = arm / dist=lnormal;\n  ods output ParameterEstimates=work.ln_parms FitStatistics=work.ln_fit;\nrun;\n\n/* ── 3. Log-logistic AFT ── */\nproc lifereg data=work.cohort;\n  class arm (ref='0');\n  model time_months*event(0) = arm / dist=llogistic;\n  ods output ParameterEstimates=work.ll_parms FitStatistics=work.ll_fit;\nrun;\n\n/* ── 4. Generalized gamma (DIST=GAMMA in PROC LIFEREG) ── */\nproc lifereg data=work.cohort;\n  class arm (ref='0');\n  model time_months*event(0) = arm / dist=gamma;\n  ods output ParameterEstimates=work.gg_parms FitStatistics=work.gg_fit;\nrun;\n\n/* ── 5. Compute time ratio from any model's ODS ParameterEstimates ──\n   The arm coefficient is on the log-time scale; exp() gives the time ratio.    */\ndata work.wb_tr;\n  set work.wb_parms;\n  where parameter = 'arm';\n  time_ratio    = exp(estimate);\n  tr_lower_95ci = exp(estimate - 1.96 * stderr);\n  tr_upper_95ci = exp(estimate + 1.96 * stderr);\n  label time_ratio    = 'Time ratio (treated vs control)'\n        tr_lower_95ci = 'TR lower 95% CI'\n        tr_upper_95ci = 'TR upper 95% CI';\n  format time_ratio tr_lower_95ci tr_upper_95ci 8.4;\nrun;\nproc print data=work.wb_tr label; var time_ratio tr_lower_95ci tr_upper_95ci; run;\n\n/* ── 6. Compare AIC across distributions ──\n   Collect the -2*log-likelihood from each FitStatistics ODS dataset.           */\ndata work.aic_compare;\n  length model $20;\n  set work.wb_fit  (in=a)\n      work.ln_fit  (in=b)\n      work.ll_fit  (in=c)\n      work.gg_fit  (in=d);\n  where criterion = '-2 Log Likelihood';\n  if a then model = 'Weibull';\n  if b then model = 'Log-normal';\n  if c then model = 'Log-logistic';\n  if d then model = 'Gen. gamma';\n  aic = value + 2 * 2;   /* AIC = -2LL + 2k; two free params for intercept+arm */\n  keep model value aic;\nrun;\nproc sort data=work.aic_compare; by aic; run;\nproc print data=work.aic_compare; run;\n/* Always inspect fitted survival curves on the KM and projected hazard beyond\n   the data before selecting a distribution; AIC alone does not determine tail\n   plausibility for HTA extrapolation. Floor the all-cause hazard at the general-\n   population life table for lifetime-horizon models in elderly cohorts.          */",
        "description": "Fit Weibull, log-normal, log-logistic, and generalized gamma AFT models using PROC LIFEREG.\nExtract time ratios from the log-time-scale parameter estimates via ODS ParameterEstimates\nand exponentiate in a subsequent DATA step. PROC LIFEREG does not have an ESTIMATE statement\nwith an EXP option; extract via ODS and compute exp() manually.\nRequired dataset work.cohort with variables:\n  time_months : numeric, time to event or censoring (> 0)\n  event       : 0/1, 1 = event, 0 = censored\n  arm         : 0 = control, 1 = treated",
        "dependencies": [],
        "source_citations": [
          "wei-1992",
          "latimer-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Time-to-event analysis<br/>with a survival endpoint] --> PH{PH assumption<br/>tenable?<br/>Schoenfeld test +<br/>log-log plot}\n  PH -- Yes, PH holds --> COX[Cox PH: report HR<br/>+ absolute summary RMST]\n  PH -- No, hazards cross<br/>or converge --> AFT[Fit AFT candidate set:<br/>Weibull / log-normal /<br/>log-logistic / gen. gamma]\n  AFT --> AIC{Compare AIC/BIC<br/>+ KM overlay<br/>+ projected hazard shape}\n  AIC --> TR[Report time ratio TR<br/>= exp beta<br/>with 95% CI]\n  TR --> HTA{HTA extrapolation<br/>to lifetime horizon?}\n  HTA -- Yes --> EXT[Extend per TSD-14 workflow:<br/>floor at general-population<br/>mortality envelope]\n  HTA -- No --> DONE[Report TR + CI<br/>+ quantile predictions<br/>by covariate profile]",
        "caption": "Decision logic for choosing between Cox PH and AFT. When PH is violated, the AFT candidate set is evaluated on AIC, KM overlay, and tail plausibility before selecting the distribution for reporting time ratios and, where needed, HTA extrapolation.",
        "alt_text": "Flowchart routing a time-to-event question through a proportional-hazards check. If PH holds, Cox PH is used. If PH fails, an AFT candidate set is fit and compared on AIC and KM overlay. The selected model produces a time ratio and feeds HTA extrapolation if needed.",
        "source_type": "illustrative",
        "source_citations": [
          "wei-1992",
          "latimer-2013"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "The semi-parametric PH alternative: prefer Cox when PH holds and HR is the primary estimand. Prefer AFT when Schoenfeld diagnostics indicate PH violation or when the time-ratio interpretation is clinically primary. For the Weibull family, the two are related by HR = TR^(−k) where k is the Weibull shape parameter."
      },
      {
        "relation_type": "requires",
        "target_slug": "weibull-distribution",
        "notes": "The Weibull is the canonical AFT family and the only distribution that admits both an AFT and a proportional hazards reparameterization, making it the bridge between the two model worlds and the natural starting point for AFT model selection."
      },
      {
        "relation_type": "see_also",
        "target_slug": "log-normal-distribution",
        "notes": "A common AFT error distribution with an arc-shaped hazard that rises then falls; log-normal AFT is appropriate when early risk is low, peaks, and then declines, and is one of the standard TSD-14 candidate distributions for HTA extrapolation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "The four AFT families (Weibull, log-normal, log-logistic, generalized gamma) are four of the six standard TSD-14 candidate distributions. The AFT parameterization adds interpretable time ratios to the distributional selection exercise mandated by NICE DSU TSD 14."
      },
      {
        "relation_type": "requires",
        "target_slug": "censoring-mechanisms-rwe",
        "notes": "AFT models assume non-informative censoring; heavy administrative censoring in claims from disenrollment or Medicare Advantage switches can bias distribution parameter estimates and distort extrapolation. Understanding censoring mechanisms is a prerequisite before interpreting AFT model outputs from RWE datasets."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST is a PH-free alternative effect measure that avoids distributional assumptions within the observed window. Use RMST as a face-validity anchor for an AFT model fit — the model's restricted mean over the observed period should match the empirical RMST — and as a complement when extrapolation beyond observed follow-up is not required."
      }
    ],
    "aliases": [
      "AFT model",
      "accelerated life model",
      "time ratio model",
      "survreg"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "acs-sdoh-area-level-linkage-rwe",
    "name": "ACS and Area-Level SDoH Linkage",
    "short_definition": "The use of American Community Survey variables and derived neighborhood indices, such as ADI, SVI, or SDI, to add contextual social determinants of health to patient-level RWE datasets through geography and time-period linkage.",
    "long_description": "ACS-based SDoH linkage attaches area-level socioeconomic and neighborhood context to patient-level claims, EHR, registry, or survey data. Common inputs include American Community Survey measures of income, education, unemployment, housing, crowding, transportation, language, and household composition. Those variables may be used directly or through composite indices such as the Area Deprivation Index, Social Vulnerability Index, or Social Deprivation Index.\n\nThe analytic value is real but limited. ACS linkage measures neighborhood context, not individual income, housing, food insecurity, education, or social need. Assigning tract-level deprivation to a patient creates ecological measurement error, especially when geocoding is coarse, addresses are stale, or ZIP codes cross heterogeneous neighborhoods. Time alignment also matters because ACS 5-year estimates aggregate several years.\n\nRWE reports should state the geography, ACS vintage, index construction, linkage rate, handling of missing or invalid addresses, and whether the variable is interpreted as context, confounding control, effect modification, or equity stratification.\n\n**Pros, cons, and trade-offs.** ACS linkage is valuable because it adds social and neighborhood context to data sources that often lack individual social-needs fields. It can support equity stratification, effect-modification analyses, and contextual confounding control. The trade-off is ecological measurement error. A tract or ZIP estimate is not the patient's income, education, housing status, or food security. Composite indices improve summary power but hide which domain is driving the association.\n\n**When to use.** Use ACS variables, ADI, SVI, SDI, or similar area-level measures when the research question concerns neighborhood context, contextual deprivation, place-based access, or equity patterns, and when geography, address date, ACS vintage, and linkage quality can be documented. Use direct ACS variables when a specific mechanism matters; use indices when a broad deprivation summary is the target.\n\n**When NOT to use - and when it is actively misleading.** Do not use area-level SDoH as proof of an individual's income, housing instability, transport access, language, or food insecurity. Do not compare tract and ZIP linkage as if precision were equivalent. It is actively misleading to interpret an area deprivation association as an individual social-need effect without acknowledging ecological bias and linkage selection.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "acs",
      "sdoh",
      "social-determinants",
      "area-level-linkage",
      "deprivation-index",
      "adi",
      "svi",
      "sdi",
      "census"
    ],
    "applies_to_study_types": [
      "health_equity",
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jamanetworkopen.2023.56121",
        "url": "https://doi.org/10.1001/jamanetworkopen.2023.56121",
        "citation_text": "Morenz AM, Liao JM, Au DH, Hayes SA. Area-Level Socioeconomic Disadvantage and Health Care Spending: A Systematic Review. JAMA Network Open. 2024;7(2):e2356121.",
        "year": 2024,
        "authors_short": "Morenz et al.",
        "notes": "Systematic review of ADI and SVI use in health care spending studies, with Census-derived area-level measures."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://shvs.org/resource/leveraging-american-community-survey-acs-data-to-address-social-determinants-of-health-and-advance-health-equity/",
        "citation_text": "State Health and Value Strategies. Leveraging American Community Survey (ACS) Data to Address Social Determinants of Health and Advance Health Equity.",
        "year": 2024,
        "authors_short": "SHVS",
        "notes": "Practical ACS-for-SDoH overview."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.graham-center.org/evidence-based-research/featured-work/social-deprivation-index",
        "citation_text": "Robert Graham Center. Social Deprivation Index.",
        "year": 2026,
        "authors_short": "Robert Graham Center",
        "notes": "Source for the SDI neighborhood deprivation measure."
      },
      {
        "role": "explain",
        "doi": "10.1056/NEJMp1802313",
        "url": "https://doi.org/10.1056/NEJMp1802313",
        "citation_text": "Kind AJH, Buckingham W. Making Neighborhood-Disadvantage Metrics Accessible - The Neighborhood Atlas. New England Journal of Medicine. 2018;378(26):2456-2458.",
        "year": 2018,
        "authors_short": "Kind and Buckingham",
        "notes": "Neighborhood Atlas / ADI source describing ACS-derived neighborhood-disadvantage metrics and public access."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.census.gov/programs-surveys/acs",
        "citation_text": "U.S. Census Bureau. American Community Survey (ACS).",
        "year": 2026,
        "authors_short": "U.S. Census Bureau",
        "notes": "Official source for ACS data products used in area-level SDoH linkage."
      }
    ],
    "plain_language_summary": "ACS linkage adds neighborhood social context to health data. It can tell you about the area where a patient likely lives, but it does not prove that patient's individual income, housing, or social needs.",
    "key_terms": [
      {
        "term": "American Community Survey",
        "definition": "Census Bureau survey producing demographic, economic, housing, and social estimates for geographic areas."
      },
      {
        "term": "Area-level SDoH",
        "definition": "Neighborhood or geographic social context assigned to individuals through residence or care-location geography."
      },
      {
        "term": "Ecological measurement error",
        "definition": "Error from using group-level geography to infer individual-level circumstances."
      }
    ],
    "worked_example": {
      "scenario": "A claims-EHR linked study adds neighborhood deprivation to a diabetes outcomes cohort. Patients are geocoded to census tract when possible and ZIP code when tract is unavailable. The analyst links 2018-2022 ACS 5-year variables to the address closest to index date.",
      "dataset": {
        "caption": "Area-level SDoH linkage quality.",
        "columns": [
          "patient_id",
          "address_date",
          "geography",
          "acs_vintage",
          "linked_variable",
          "quality_flag"
        ],
        "rows": [
          [
            "P001",
            "2023-02-11",
            "census tract",
            "ACS 2018-2022 5-year",
            "tract poverty percentile",
            "high precision"
          ],
          [
            "P002",
            "2021-09-04",
            "ZIP code",
            "ACS 2018-2022 5-year",
            "ZIP median income",
            "coarse geography"
          ],
          [
            "P003",
            "missing",
            "none",
            "none",
            "none",
            "unlinked"
          ]
        ]
      },
      "steps": [
        "Choose address timing relative to the clinical index date.",
        "Link ACS variables or indices at the finest permitted geography.",
        "Flag coarse or failed geocodes separately from linked records.",
        "Compare linked and unlinked patients to assess selection."
      ],
      "result": "The study uses tract-level context where available, flags ZIP-only linkage as lower precision, and avoids interpreting deprivation as the patient's individual income or housing status."
    },
    "prerequisites": [],
    "index_definitions": [
      {
        "name": "ACS variables",
        "definition": "Census-derived area measures such as income, education, unemployment, housing, transportation, language, and household composition.",
        "source": "American Community Survey",
        "use": "Direct contextual covariates or inputs to composite indices.",
        "notes": "Match geography and vintage to the study period."
      },
      {
        "name": "Area Deprivation Index",
        "definition": "Composite neighborhood deprivation measure built from census/ACS socioeconomic indicators.",
        "source": "Singh / Neighborhood Atlas lineage",
        "use": "Contextual deprivation adjustment, equity stratification, and effect modification.",
        "notes": "Often distributed as ranks; rank geography and vintage matter."
      },
      {
        "name": "Social Deprivation Index",
        "definition": "Composite index using ACS variables to measure social deprivation relevant to healthcare access and utilization.",
        "source": "Robert Graham Center",
        "use": "Area-level deprivation adjustment and subgroup analysis.",
        "notes": "Measures neighborhood deprivation, not individual social need."
      },
      {
        "name": "Social Vulnerability Index",
        "definition": "Census/ACS-based index summarizing community vulnerability across socioeconomic, household, minority/language, and housing/transportation themes.",
        "source": "CDC/ATSDR SVI",
        "use": "Community vulnerability context and public-health equity analyses.",
        "notes": "Interpret as contextual vulnerability, not patient-level exposure."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Direct ACS variable linkage",
        "description": "Link individual ACS variables such as poverty, unemployment, education, housing, or transportation.",
        "edge_cases": [
          "Multiple comparisons and collinearity.",
          "Variables may not represent the individual patient."
        ],
        "data_source_notes": "Most transparent when the research question names a specific contextual domain."
      },
      {
        "name": "Composite deprivation index linkage",
        "description": "Link ADI, SVI, SDI, or study-built composite measures.",
        "edge_cases": [
          "Index construction and rank geography affect interpretation.",
          "Different indices are not interchangeable."
        ],
        "data_source_notes": "Better for summary stratification but less specific mechanistically."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Individual-level SDoH screening",
        "use_acs_when": "Individual social-needs data are absent but neighborhood context is relevant.",
        "use_individual_sdoh_when": "The question concerns patient-level food, housing, transport, financial, or social need.",
        "notes": "Area-level measures are contextual proxies and should be labeled that way."
      }
    ],
    "implementation_notes_by_data_source": {
      "linked": "Address governance and geocoding precision determine feasibility; retain linkage quality flags.",
      "claims": "ZIP may be all that is available, which is usable for broad context but weak for neighborhood inference.",
      "ehr": "EHR addresses can be stale; choose closest valid address before index and document refresh logic."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "sdoh-social-determinants-of-health",
        "notes": "This entry focuses on ACS/geographic linkage within the broader SDoH concept family."
      },
      {
        "relation_type": "requires",
        "target_slug": "tokenization-privacy-preserving-record-linkage-rwe",
        "notes": "Address linkage and de-identification governance determine whether area-level enrichment is possible."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ecological",
        "notes": "Area-level SDoH variables create ecological interpretation risks."
      }
    ],
    "aliases": [
      "ACS SDoH",
      "American Community Survey SDoH",
      "area-level SDoH",
      "neighborhood deprivation linkage",
      "ADI linkage",
      "SVI linkage",
      "SDI linkage",
      "Census SDoH"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "fda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "active-comparator-new-user",
    "name": "Active Comparator, New-User Design",
    "short_definition": "A cohort design that restricts to patients initiating either a study drug or a clinically interchangeable active comparator after a drug-free washout, with follow-up starting at initiation (time zero), to control confounding by indication and prevalent-user bias.",
    "long_description": "The **active comparator, new-user (ACNU) design** combines two restrictions that attack the two dominant sources of bias\nin observational drug studies. The **new-user (incident-user) restriction** requires that a patient have no dispensing of\nthe study drug or its comparator during a defined washout, so that follow-up starts at first exposure (time zero) for\neveryone. The **active-comparator restriction** chooses the reference group as initiators of a different drug used for the\n*same* indication, rather than non-users. Together they emulate the eligibility and treatment-assignment structure of the\nhead-to-head randomized trial you wish you could run.\n\n**Core conceptual distinction.** Two design choices are doing the work, and they are separable. (1) *New-user vs prevalent-user*:\nstarting follow-up at initiation removes immortal time, prevents adjustment for post-initiation variables on the causal\npathway, and avoids depletion-of-susceptibles (the survivors who tolerate a drug look healthier than incident users). (2)\n*Active comparator vs non-user*: comparing two treatment decisions for the same indication removes most **confounding by\nindication** and **healthy-user/healthy-adherer bias**, because both arms cleared the same clinical threshold to be treated.\nThe estimand is the comparative (drug A vs drug B) effect on initiation — an intention-to-treat-like contrast under a\nfirst-line strategy, or an as-treated/per-protocol contrast if you censor at switching/discontinuation and weight for\ninformative censoring. ACNU does **not** estimate \"drug vs no drug\"; if that is the policy question, the active comparator\nis the wrong reference.\n\n**Pros, cons, and trade-offs.**\n- **vs new-user with a non-user / unexposed comparator:** ACNU removes confounding by indication and healthy-user bias that\n  cripple drug-vs-no-drug comparisons in claims, and yields better covariate overlap (both arms are treated). Cost: it\n  answers a narrower question and loses power when the comparator is rarely used; if the comparator has its own effect on\n  the outcome, the contrast is shifted, not unbiased for an absolute effect. **Prefer ACNU** for nearly all comparative\n  safety/effectiveness questions among chronic-disease therapies.\n- **vs prevalent-user / ever-exposed designs:** ACNU eliminates survivor bias, depletion of susceptibles, and time-zero\n  misalignment. Cost: smaller cohorts and a population skewed toward initiators, who may differ from the prevalent users\n  who dominate real-world practice. **Prefer ACNU** when early effects matter or when prevalent-user bias is plausible;\n  consider a prevalent-new-user (Suissa) extension when initiation is too rare.\n- **vs target-trial emulation with clone-censor-weight:** ACNU is the analytic core of most two-drug target-trial\n  emulations and is far simpler to specify and defend. Cost: it is less flexible for sustained/dynamic strategies, grace\n  periods that create eligibility-time ambiguity, or multi-option regimens, where g-methods or clone-censor-weight add\n  value. **Prefer plain ACNU** unless the protocol genuinely requires a dynamic per-protocol estimand.\n\n**When to use.** Head-to-head comparative effectiveness or safety of two drugs for the same indication in claims, EHR, or\nregistry data; building the analytic engine of a target-trial emulation; any setting where confounding by indication would\ndoom a drug-vs-non-user contrast. A defensible comparator is the linchpin: it should treat the same indication, be a\nplausible alternative for the *same* patients at the *same* decision point, and not itself cause (or prevent) the outcome.\n\n**When NOT to use — and when it is actively misleading.**\n- **No clinically interchangeable comparator exists.** Forcing a comparator that is prescribed to systematically different\n  patients (e.g., a second-line agent vs a first-line agent) re-introduces confounding by indication and *channeling* —\n  the bias you came to remove. Diagnose with baseline covariate balance and clinical review before trusting the cohort.\n- **The comparator affects the outcome of interest.** Comparing two antihypertensives on stroke is fine; comparing them on\n  a renal outcome that one drug class directly modifies makes the \"null comparator\" assumption false.\n- **The genuine question is drug vs no treatment** (e.g., uptake, adherence's effect on cost). ACNU cannot answer it.\n- **Severe non-overlap / positivity violation.** If one drug is reserved for sicker or renally-impaired patients, PS\n  distributions separate, matching discards much of the cohort, and the surviving estimand no longer maps to a meaningful\n  population.\n- **One drug is much older.** Calendar-time imbalance (the comparator was first-line for a decade before the study drug\n  launched) creates secular confounding; require both drugs to be co-available and consider restricting to overlapping\n  calendar time.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Exposure is the pharmacy claim (NDC + `fill_date` + `days_supply`). Require continuous\n  medical + pharmacy enrollment across the full washout (commonly 365 days) so the absence of prior dispensing is real,\n  not unobserved. Confirm indication with diagnosis codes in the baseline window. Index date = first qualifying fill.\n  Failure modes: Medicare Advantage and bundled/capitated arrangements drop fee-for-service claims, so \"no prior fill\" can\n  be missingness, not a true washout — restrict to enrollees with both Parts A/B/D (or commercial pharmacy benefit) and\n  exclude MA-only person-time. Sample fills, 90-day mail-order, and free samples distort `days_supply`.\n- **EHR:** Initiation is the *order* or *administration*, not the dispensing; linkage to pharmacy fills is preferred to\n  confirm the patient actually started. Problem lists, labs, and notes sharpen indication and baseline severity (an\n  advantage over claims), but visit-driven capture means a patient who leaves the system is differentially lost — define\n  observation windows explicitly and treat loss to follow-up as potentially informative.\n- **Registry:** Strongest for indication, disease severity, and adjudicated outcomes (e.g., cancer stage); typically weak\n  for complete pharmacy exposure. Link to claims for the full fill history and to a death index to firm up censoring.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR severity + claims completeness + reliable mortality — but\n  linkage introduces selection (only the linkable subset) and date-discrepancy issues between order, fill, and service\n  dates that must be reconciled before time-zero assignment.\n\n**Worked claims example.** Question: incident heart failure with second-generation sulfonylurea vs DPP-4 inhibitor among\nadults with type 2 diabetes in a commercial + Medicare FFS database. (1) Eligibility: age ≥18, ≥2 diabetes diagnoses, and\n365 days of continuous A/B/D (or commercial medical+pharmacy) enrollment before the first study fill. (2) Washout: no fill\nof *any* sulfonylurea or DPP-4 inhibitor in the 365-day lookback — this is what makes both arms incident users. (3) Time\nzero: the date of that first qualifying fill; assign the arm from the NDC dispensed on that date. (4) Baseline covariates:\nmeasured only in the 365 days up to and including time zero (comorbidities, HbA1c proxies, prior insulin, healthcare\nutilization), feeding a high-dimensional propensity score. (5) Follow-up: from time zero to first validated HF event,\ncensoring at disenrollment, death, end of data, and — for an as-treated analysis — treatment discontinuation (last\n`days_supply` end + a pre-specified grace period) or switch to the other arm. (6) Apply 1:1 PS matching (or overlap\nweighting), check standardized differences <0.1, and run sensitivity analyses on washout length, grace period, and a\nnegative-control outcome to detect residual confounding.",
    "primary_category": "Study_Design",
    "tags": [
      "active-comparator",
      "new-user-design",
      "incident-user",
      "confounding-by-indication",
      "pharmacoepidemiology",
      "head-to-head",
      "target-trial",
      "propensity-score"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Foundational statement of the new-user (incident-user) design and the biases that prevalent-user cohorts induce."
      },
      {
        "role": "introduce",
        "doi": "10.1007/s40471-015-0053-5",
        "url": "https://doi.org/10.1007/s40471-015-0053-5",
        "citation_text": "Lund JL, Richardson DB, Stürmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Current Epidemiology Reports. 2015;2(4):221-228.",
        "year": 2015,
        "authors_short": "Lund et al.",
        "notes": "Canonical articulation of the combined active-comparator, new-user framework with administrative-data application."
      },
      {
        "role": "explain",
        "doi": "10.1038/nrrheum.2015.30",
        "url": "https://doi.org/10.1038/nrrheum.2015.30",
        "citation_text": "Yoshida K, Solomon DH, Kim SC. Active-comparator design and new-user design in observational studies. Nature Reviews Rheumatology. 2015;11(7):437-441.",
        "year": 2015,
        "authors_short": "Yoshida et al.",
        "notes": "Accessible tutorial with clear diagrams of the comparator and new-user restrictions and their effect on confounding."
      }
    ],
    "plain_language_summary": "The active comparator, new-user design is a way to compare two drugs fairly using everyday healthcare records. You only keep patients who are just starting one of two competing drugs for the same condition (so neither group has been on its drug for years), you make everyone's first fill their shared 'day zero,' and then you watch both groups forward in time under the exact same rules. Comparing two real treatment choices for the same illness — rather than treated patients against untreated ones — keeps the two groups similar, so a difference in outcomes is more likely to be the drug and not the kind of patient who got it. It cannot answer 'is this drug better than no drug,' only 'is drug A better than drug B.'",
    "key_terms": [
      {
        "term": "active comparator",
        "definition": "The other drug you compare against — a real alternative prescribed for the same condition, not a group of untreated patients."
      },
      {
        "term": "new user (incident user)",
        "definition": "A patient who is starting the drug for the very first time, with no fills of it in the recent past."
      },
      {
        "term": "washout",
        "definition": "A drug-free lookback period (here 365 days) with no fills of either drug, which proves the patient is truly a first-time starter."
      },
      {
        "term": "index date (time zero)",
        "definition": "The patient's shared starting line — the date of their first qualifying fill, when the clock for follow-up begins for everyone."
      },
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last, printed on each pharmacy claim (e.g., a 90-day fill)."
      },
      {
        "term": "confounding by indication",
        "definition": "When sicker patients get one drug and healthier patients get another, so the patients differ before treatment even begins and muddy any comparison."
      }
    ],
    "worked_example": {
      "scenario": "We want to compare two diabetes drugs on the risk of being hospitalized for heart failure: a sulfonylurea (glipizide, our study drug) versus a DPP-4 inhibitor (sitagliptin, our active comparator). We pull pharmacy claims for two adults with type 2 diabetes. We require each to have a clean 365-day drug-free washout (no fill of either drug class) so both are true first-time starters, set each patient's first qualifying fill as their shared day zero, and follow both forward for 180 days under identical rules to see who has a heart failure hospitalization first.",
      "dataset": {
        "caption": "The raw rows an analyst would see in a claims pharmacy table, one row per fill. drug_class flags whether the fill is the study drug or the active comparator.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "drug_class",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2024-01-01",
            "glipizide",
            "STUDY",
            90
          ],
          [
            2001,
            "2024-04-01",
            "glipizide",
            "STUDY",
            90
          ],
          [
            2002,
            "2024-01-01",
            "sitagliptin",
            "COMPARATOR",
            90
          ],
          [
            2002,
            "2024-04-01",
            "sitagliptin",
            "COMPARATOR",
            90
          ]
        ]
      },
      "steps": [
        "Check the washout: for each patient, look back 365 days before their first fill (all of 2023). Neither patient has any glipizide or sitagliptin fill in that window, so both qualify as brand-new starters.",
        "Set day zero: patient 2001's first fill (glipizide) and patient 2002's first fill (sitagliptin) are both on 2024-01-01, so both clocks start on the same aligned index date.",
        "Assign the arm from the drug filled on day zero: 2001 goes to the STUDY arm, 2002 goes to the COMPARATOR arm.",
        "Follow both forward for 180 days (2024-01-01 to 2024-06-29) under identical rules, watching for a heart failure hospitalization.",
        "Patient 2001 (study drug) is hospitalized for heart failure on 2024-05-15, which is day 135 of follow-up. Patient 2002 (comparator) reaches day 180 with no event and is censored at the end of the window.",
        "Because both patients cleared the same washout, share the same day zero, and follow the same rules, the only structural difference between them is which drug they started."
      ],
      "result": "Of 2 new initiators (1 study, 1 comparator), the study-drug patient had 1 heart failure hospitalization at day 135 of a 180-day follow-up; the comparator patient had 0 events over the full 180 days. Both had a clean 365-day washout and a shared index date of 2024-01-01, so the comparison is of two aligned first-time starters rather than of treated-vs-untreated patients.",
      "timeline_spec": {
        "title": "Active comparator, new-user design: two aligned first-time starters (study drug vs active comparator)",
        "window": {
          "start": "2023-01-01",
          "end": "2024-06-29",
          "label": "365-day shared washout, aligned index date 2024-01-01, then 180-day follow-up"
        },
        "events": [
          {
            "label": "Patient 2001 (STUDY) - glipizide Fill 1",
            "start": "2024-01-01",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Patient 2001 (STUDY) - glipizide Fill 2",
            "start": "2024-04-01",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Patient 2002 (COMPARATOR) - sitagliptin Fill 1",
            "start": "2024-01-01",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Patient 2002 (COMPARATOR) - sitagliptin Fill 2",
            "start": "2024-04-01",
            "length_days": 90,
            "quantity": "90 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "washout",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "365-day drug-free washout (both patients, no study or comparator fill)"
          },
          {
            "kind": "exposed",
            "start": "2024-01-01",
            "end": "2024-06-29",
            "label": "Patient 2001 (STUDY) on-treatment follow-up"
          },
          {
            "kind": "followup",
            "start": "2024-05-15",
            "end": "2024-05-15",
            "label": "Patient 2001 heart failure hospitalization (day 135)"
          },
          {
            "kind": "exposed",
            "start": "2024-01-01",
            "end": "2024-06-29",
            "label": "Patient 2002 (COMPARATOR) on-treatment follow-up, no event"
          }
        ],
        "result": {
          "label": "Shared index 2024-01-01; study arm 1 HF event at day 135, comparator arm 0 events over 180 days",
          "value": 135
        },
        "caption": "Two new users for the same indication start at an aligned day zero after the same 365-day washout: one on the study drug, one on the active comparator. Because baseline is measured before the shared index fill and follow-up begins at the fill under identical rules for both arms, the design controls confounding by indication and removes the head start that prevalent users would have.",
        "alt_text": "Timeline with a 365-day drug-free washout across all of 2023 for both patients, a shared index date on 2024-01-01, and a 180-day follow-up. The study-drug patient (glipizide) has two 90-day fills and a heart failure hospitalization at day 135; the active-comparator patient (sitagliptin) has two 90-day fills and no event through day 180."
      }
    },
    "prerequisites": [
      "new-user-design",
      "washout-clean-lookback-period-rwe",
      "time-zero-index-date-alignment-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Strict within-class active comparator",
        "description": "Comparator drawn from the same therapeutic/indication class as the study drug (e.g., one DOAC vs another DOAC, one statin vs another statin) to minimize channeling and maximize covariate overlap.",
        "edge_cases": [
          "Within-class agents can still be preferentially prescribed (e.g., one DOAC favored in renal impairment), leaving residual channeling that balance diagnostics must catch.",
          "Requires rich indication/comorbidity data to confirm the two drugs treat the same patients at the same decision point."
        ],
        "data_source_notes": "claims: build NDC lists by therapeutic class and confirm indication with baseline diagnosis codes; EHR: use problem lists or NLP to confirm indication at the index encounter."
      },
      {
        "name": "ACNU with propensity-score matching or weighting",
        "description": "After cohort construction, balance pre-index confounders by 1:1 PS matching, IPTW, or overlap weighting using only covariates measured up to and including time zero.",
        "edge_cases": [
          "Near-positivity violations leave some initiators without acceptable matches; the matched estimand shifts toward the region of overlap.",
          "Weight truncation trades variance against bias and changes the target population."
        ],
        "data_source_notes": "claims: high-dimensional PS (hundreds of diagnosis/procedure/drug-class proxies from the lookback) is standard and powerful when key confounders are unmeasured."
      },
      {
        "name": "As-treated ACNU with time-varying exposure",
        "description": "Exposure is followed as time-varying after initiation (on-treatment windows, grace periods, switching rules) rather than an initiation-only (ITT-like) contrast.",
        "edge_cases": [
          "Differential discontinuation or switching by arm makes naive as-treated estimates susceptible to informative censoring; inverse-probability-of-censoring weighting is usually required."
        ],
        "data_source_notes": "Depends on careful episode construction (days_supply stitching, grace periods, stockpiling rules)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "New-user design with a non-user comparator",
        "pros_of_this": "Removes confounding by indication and healthy-user/healthy-adherer bias; both arms reflect a treatment decision, improving covariate overlap.",
        "cons_of_this": "Answers a narrower comparative question and loses power when the comparator is uncommon; biased for absolute effects if the comparator itself influences the outcome.",
        "when_to_prefer": "Comparative safety/effectiveness of two drugs for the same indication, where drug-vs-non-user would be confounded by indication."
      },
      {
        "compared_to": "Prevalent-user / ever-exposed designs",
        "pros_of_this": "Eliminates depletion of susceptibles, survivor bias, immortal time, and adjustment for post-initiation mediators.",
        "cons_of_this": "Smaller cohorts; initiators may not represent prevalent users who dominate practice.",
        "when_to_prefer": "Questions about initiation effects, early harms/benefits, or whenever prevalent-user bias is plausible."
      },
      {
        "compared_to": "Target-trial emulation with clone-censor-weight",
        "pros_of_this": "Simpler to specify, communicate, and defend; the new-user + active-comparator + time-zero structure already maps to trial eligibility and assignment.",
        "cons_of_this": "Less flexible for sustained/dynamic strategies, grace-period eligibility ambiguity, or multi-option regimens where g-methods are needed.",
        "when_to_prefer": "Two-drug head-to-head comparisons that do not require a dynamic per-protocol estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (NDC + fill_date + days_supply). Require continuous medical + pharmacy enrollment across the entire washout so absence of prior fills is observed, not missing; exclude Medicare Advantage-only person-time where fee-for-service claims are unavailable. Time zero = first qualifying fill; assign arm from that NDC. Censor at disenrollment, death, end of data, and (as-treated) discontinuation/switch.",
      "ehr": "Initiation = order or administration; prefer linked dispensing to confirm the patient started. Use problem lists, labs, and notes to confirm indication and baseline severity. Define observation windows explicitly and treat loss to follow-up as potentially informative.",
      "registry": "Strong for indication, severity, and adjudicated outcomes; weak for full pharmacy exposure. Link to claims for fills and to a death index for censoring and mortality outcomes.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (severity + completeness + mortality) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWASHOUT_DAYS = 365  # drug-free + continuous-enrollment lookback that defines \"new user\"\n\ndef build_acnu_cohort(rx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n\n    # Candidate index = first fill of EITHER the study drug or the active comparator.\n    study_fills = rx[rx[\"drug_class\"].isin([\"STUDY\", \"COMPARATOR\"])]\n    idx = (study_fills.groupby(\"person_id\")\n                      .first()\n                      .reset_index()\n                      .rename(columns={\"fill_date\": \"index_date\", \"drug_class\": \"arm\"}))\n\n    # New-user check: no fill of study OR comparator in the WASHOUT_DAYS before the index date.\n    prior = study_fills.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    prior_in_washout = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                             (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prior_in_washout[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment spanning the full washout through index (no MA-only gaps).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) &\n                   (~e[\"ma_only\"]))\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n\n    cohort = idx[idx[\"person_id\"].isin(eligible)].copy()\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)  # covariate window\n    return cohort[[\"person_id\", \"arm\", \"index_date\", \"baseline_start\"]]",
        "description": "ACNU cohort construction from claims-style inputs. Required inputs (already cleaned and de-duplicated):\n  rx     : pharmacy fills  -> person_id, fill_date (datetime), drug_class in {'STUDY','COMPARATOR'}, days_supply\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only person-time lacks FFS claims\nReturns one row per eligible new initiator with arm and time zero. Build covariates and the propensity score only from\nthe returned [baseline_start, index_date] window, and apply outcome/censoring rules identically to both arms downstream.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "lund-2015"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\n\nbuild_acnu_cohort <- function(rx, enroll) {\n  setDT(rx); setDT(enroll)\n  setorder(rx, person_id, fill_date)\n\n  study <- rx[drug_class %chin% c(\"STUDY\", \"COMPARATOR\")]\n  idx <- study[, .(index_date = fill_date[1L], arm = drug_class[1L]), by = person_id]\n\n  # New-user: drop anyone with a study/comparator fill in the washout window before index.\n  study <- merge(study, idx[, .(person_id, index_date)], by = \"person_id\")\n  prior_ids <- unique(study[fill_date < index_date &\n                            fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous FFS-observable enrollment across the full washout through index.\n  e <- merge(enroll, idx[, .(person_id, index_date)], by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & !ma_only, unique(person_id)]\n\n  cohort <- idx[person_id %chin% ok]\n  cohort[, baseline_start := index_date - WASHOUT_DAYS]\n  cohort[, .(person_id, arm, index_date, baseline_start)]\n}",
        "description": "ACNU cohort construction with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), drug_class in {'STUDY','COMPARATOR'}, days_supply\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "lund-2015"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* First fill of study drug or comparator = candidate time zero, with the arm dispensed that day. */\n/* Keep only qualifying fills, sort by person and date, then take the earliest row per person. */\nproc sort data=work.rx(where=(drug_class in ('STUDY','COMPARATOR'))) out=rx_q;\n  by person_id fill_date;\nrun;\n\ndata idx;\n  set rx_q;\n  by person_id;\n  if first.person_id;\n  index_date = fill_date;\n  format index_date date9.;\n  length arm $12;\n  arm = drug_class;             /* arm = drug_class on the earliest qualifying fill */\n  keep person_id index_date arm;\nrun;\n\n/* New-user restriction: exclude any prior study/comparator fill inside the washout window. */\nproc sql;\n  create table newuser as\n  select i.*\n  from idx i\n  where not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id\n      and p.drug_class in ('STUDY','COMPARATOR')\n      and p.fill_date <  i.index_date\n      and p.fill_date >= i.index_date - &washout\n  );\nquit;\n\n/* Continuous, FFS-observable enrollment across the full washout through index (no MA-only spans). */\nproc sql;\n  create table cohort as\n  select n.person_id, n.arm, n.index_date,\n         n.index_date - &washout as baseline_start format=date9.\n  from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id\n      and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date\n  );\nquit;\n\n/* 1:1 nearest-neighbor PS matching on baseline covariates (the standard ACNU analytic step). */\nproc psmatch data=work.analytic region=allobs;            /* analytic = cohort joined to work.cov */\n  class arm <categorical baseline covariates>;\n  psmodel arm(treated='STUDY') = <baseline covariates>;    /* covariates from the baseline window only */\n  match method=greedy(k=1) distance=lps caliper=0.2;       /* 0.2 SD of the logit-PS caliper */\n  assess lps var=(<key covariates>) / plots=(boxplot);     /* standardized differences pre/post match */\n  output out(obs=match)=matched matchid=mid;\nrun;",
        "description": "ACNU cohort construction and 1:1 PS matching in SAS. Required input datasets (post data-management):\n  work.rx     : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.cov    : person_id + baseline covariates measured in [index_date-365, index_date]\nPROC PSMATCH requires SAS/STAT 14.2+; confirm post-match standardized mean differences are <0.1 before fitting the\noutcome model on the matched set.",
        "dependencies": [],
        "source_citations": [
          "lund-2015"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "active-comparator-new-user-timeline.svg",
        "mermaid": null,
        "caption": "Two new users for the same indication start at an aligned day zero after the same 365-day washout: one on the study drug, one on the active comparator. Because baseline is measured before the shared index fill and follow-up begins at the fill under identical rules for both arms, the design controls confounding by indication and removes the head start that prevalent users would have.",
        "alt_text": "Timeline with a 365-day drug-free washout across all of 2023 for both patients, a shared index date on 2024-01-01, and a 180-day follow-up. The study-drug patient (glipizide) has two 90-day fills and a heart failure hospitalization at day 135; the active-comparator patient (sitagliptin) has two 90-day fills and no event through day 180.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population with the indication] --> Wash[Continuous enrollment + drug-free washout<br/>no study drug or comparator fill]\n  Wash --> Init[First fill of study drug<br/>OR active comparator]\n  Init --> T0[Time zero = index fill date<br/>assign arm from dispensed NDC]\n  T0 --> Base[Baseline covariates measured<br/>only up to time zero -> propensity score]\n  Base --> Fup[Follow-up: identical outcome + censoring rules in both arms<br/>censor at disenroll / death / end of data / switch]\n  Fup --> Sens[Sensitivity: washout length, grace period,<br/>negative-control outcome, alternative comparator]",
        "caption": "Operational ACNU flow in real-world data. The washout establishes incident-user status, the active comparator controls confounding by indication, time zero aligns follow-up at initiation, and outcome/censoring rules are identical across arms.",
        "alt_text": "Flowchart from source population through washout, first fill, time zero, baseline covariate measurement, follow-up, and sensitivity analyses for the active comparator new-user design.",
        "source_type": "illustrative",
        "source_citations": [
          "lund-2015"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title ACNU timeline for one new initiator (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  Continuous enrollment + washout (no study/comparator fill) :done, wash, 2023-01-01, 2023-12-31\n  section Time zero\n  First qualifying fill -> arm assignment :milestone, t0, 2024-01-01, 0d\n  section Follow-up\n  On-treatment exposure (days_supply + grace) :active, fu, 2024-01-01, 180d\n  Censor at switch / disenroll / death / data end :crit, cen, 2024-06-29, 1d",
        "caption": "Time-zero alignment for a single initiator. Because baseline is measured before the index fill and follow-up starts at the fill, there is no immortal time and no adjustment for post-initiation variables.",
        "alt_text": "Gantt timeline showing a 365-day washout in 2023, time zero at the first fill on 2024-01-01, an on-treatment follow-up window, and a censoring point.",
        "source_type": "illustrative",
        "source_citations": [
          "lund-2015"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "new-user-design",
        "notes": "ACNU adds an active-comparator requirement to the basic new-user restriction to further control confounding by indication and healthy-user bias."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "ACNU is the usual analytic core of a head-to-head target-trial emulation; new-user + active comparator + time-zero alignment map directly onto trial eligibility and assignment."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS matching, IPTW, or overlap weighting on pre-index covariates is the standard balancing step after ACNU cohort construction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Setting time zero at the index fill prevents the immortal time that arises when follow-up starts before the exposure decision (e.g., at diagnosis)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "An active comparator mitigates healthy-user bias because both arms reflect a decision to treat; residual differences can remain and require adjustment or negative controls."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Prevalent-user/ever-exposed designs mix patients at different points in their treatment trajectory and suffer depletion of susceptibles; ACNU avoids these by design."
      }
    ],
    "aliases": [
      "ACNU",
      "active comparator new-user design",
      "active comparator, new user design",
      "incident active-comparator design"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "acute-event-deduplication-window-rwe",
    "name": "Acute Event Deduplication Window",
    "short_definition": "A pre-specified gap (or \"blackout\") rule that collapses multiple codes or encounters for the same acute condition occurring within a defined time window into a single counted event episode, so that one clinical event is not scored as several.",
    "long_description": "An **acute event deduplication window** is the operational rule that answers a single, decisive question during outcome\nascertainment: *given several codes or encounters carrying the same acute diagnosis, when do they represent ONE event and\nwhen do they represent TWO?* Administrative and EHR data fragment a single clinical event across many records — an acute\nmyocardial infarction (AMI) generates an inpatient stay, often a transfer to a second facility, physician (carrier)\nclaims, follow-up office visits still carrying the I21.x code, and rehospitalization within days. Counted naively, that\none infarction becomes four or five \"events,\" inflating incidence, double-counting in cost/utilization, and corrupting\ntime-to-event analyses. The deduplication window collapses records whose dates fall within a pre-specified gap of an\nanchor record into the same **event episode**, and only resets to a new episode once a record falls *beyond* that gap.\n\n**Core conceptual distinction — and what this is NOT.** The window has two parameters that must be pre-specified\nseparately: (1) the **anchor/index rule** (which record opens an episode — e.g., first inpatient claim with the diagnosis\nin the primary position) and (2) the **gap rule** (the blackout/clean interval that must elapse before a subsequent\nqualifying record can open a *new* episode — e.g., 30 days from the index discharge for AMI; 14 days for COPD or asthma\nexacerbations and for sepsis). This concept is narrowly about *within-condition episode grouping*. It is distinct from\nthree neighbors it is constantly confused with. A **pre-index washout / clean look-back period**\n(`washout-clean-lookback-period-rwe`) clears the baseline so the *first* observed event is plausibly incident rather than\nprevalent; the deduplication window operates *during follow-up* to merge fragments and to separate genuinely distinct\nrecurrences. A **restart / new-episode rule for treatment** (`restart-rechallenge-new-episode-rwe`) builds *exposure*\nepisodes from fills; this builds *outcome* episodes from diagnoses/encounters. **Recurrent-event analysis**\n(`recurrent-events-analysis-rwe`) is the downstream *model* (Andersen-Gill, PWP, frailty) that consumes the episode\nstream the window produces — get the window wrong and every recurrent-event estimate is wrong.\n\n**Estimand link.** The window is not cosmetic data cleaning; it defines the counting unit and therefore the estimand. A\nshort or zero gap counts each fragment, so the implied estimand is \"number of records,\" not \"number of events.\" A long\ngap can merge two true recurrences into one, biasing recurrence rates downward and shrinking the per-protocol event count.\nPre-specify the gap to match the clinical natural history of the condition and the question (first-event time-to-event\nvs. recurrence rate vs. annualized count), and pre-register it in the protocol/SAP before touching data.\n\n**Pros, cons, and trade-offs.**\n- **vs. no deduplication (count every qualifying record):** A window removes the dominant upward bias from administrative\n  fragmentation (transfers, carrier + facility claims for the same stay, resolved-condition follow-up coding). Cost: it\n  introduces a tuning parameter that can mask true early recurrences; the gap length is a researcher degree of freedom\n  that must be pre-specified and sensitivity-tested. **Prefer a window** for essentially every acute-event count, rate,\n  or cost analysis in fragmented data.\n- **vs. a \"one event per patient ever\" (first-event-only) rule:** First-event-only is the correct, simplest choice when\n  the estimand is time to *first* event in a survival model and recurrences are not of interest; it needs no gap\n  parameter. Cost: it discards all recurrence information and is wrong for incidence-rate, utilization, or cost questions\n  where repeat events matter. **Prefer the window** when recurrences carry information or cost; **prefer first-event-only**\n  for a clean first-event hazard.\n- **vs. encounter/claim-line based counting with manual chart adjudication:** Adjudication is the gold standard for what\n  constitutes a distinct event but does not scale and is usually unavailable in large claims. The window is the scalable\n  proxy; its validity should be anchored to an adjudicated or validated reference (PPV/sensitivity of the resulting event\n  count, not just the diagnosis code). **Prefer the window** for scale, but validate it against a chart-reviewed or\n  registry-adjudicated subset (`claims-outcome-algorithm-ppv-sensitivity-rwe`).\n\n**When to use.** Counting acute, potentially recurrent events from fragmented claims or EHR encounters — AMI, stroke,\nsepsis, heart-failure or COPD/asthma exacerbations, GI bleeds, fractures, VTE, hospitalized infections — for incidence\nrates, recurrence analysis, utilization, or episode-based costing. Use it whenever a single clinical event is expected to\ngenerate multiple records (transfers, split facility/professional claims, post-acute follow-up coding) and whenever the\nsame condition can genuinely recur within the observation period.\n\n**When NOT to use — and when it is actively misleading.**\n- **Chronic, continuously coded conditions** (diabetes, hypertension, CKD stage). There is no discrete \"event\" to\n  deduplicate; an arbitrary window manufactures pseudo-events from routine maintenance coding. Use prevalence/algorithm\n  definitions instead (`diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe`).\n- **First-event time-to-event questions where recurrences are irrelevant.** A window adds a needless parameter; use a\n  first-event rule plus a pre-index washout.\n- **When the gap is tuned to the result.** Choosing the gap after seeing how it moves the effect estimate is\n  p-hacking by another name; it is most dangerous when the gap length differs by exposure arm or is data-driven. The gap\n  must be pre-specified, applied identically to all arms, and varied only in pre-planned sensitivity analyses.\n- **When a single window is applied across heterogeneous conditions.** A 30-day AMI rule misapplied to a 14-day\n  exacerbation condition (or vice versa) systematically miscounts; gaps are condition-specific by clinical natural history.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The classic failure is treating each claim line as an event. One AMI hospitalization yields an\n  inpatient facility claim, a separate carrier (Part B) professional claim, possibly an interhospital **transfer** to a\n  second facility, and downstream office visits still carrying I21.x weeks later. Anchor on the **inpatient** facility\n  claim with the diagnosis in the *primary/principal* position, set the gap from the index **discharge** date (not admit\n  date), and treat transfers — overlapping or contiguous inpatient stays — as the same episode before applying the gap.\n  Preserve raw admit/discharge dates alongside the derived `episode_id`. Claim-reversal and adjudication lag can make an\n  event appear, vanish, and reappear; freeze the data cut and use service dates, not paid/process dates.\n- **Medicare Advantage vs FFS, and plan switching:** MA encounter data are notoriously incomplete and underreport\n  inpatient events, so episode counts are not comparable between MA and FFS person-time\n  (`medicare-ffs-ma-commercial-claims-differences-rwe`). A patient who switches plans mid-window can have the back half of\n  an episode unobserved, splitting one event into two or truncating recurrence follow-up. Require continuous, FFS-\n  observable enrollment across each episode window and censor at plan switches; do not pool MA-only person-time with FFS\n  for event rates.\n- **EHR:** Capture is encounter-driven and site-bounded. A patient readmitted to an outside hospital for a true\n  recurrence is *missed*, so a fixed gap can wrongly merge two events (the second is invisible) or, conversely, resolved-\n  condition problem-list carry-forward can manufacture spurious follow-up \"events.\" Prefer linkage to claims or a regional\n  HIE to capture out-of-system events, define observation windows explicitly, and treat out-of-network care as informative\n  missingness.\n- **Registry / linked data:** Registries often carry adjudicated event dates that should override code-derived anchors\n  when available; linkage to claims supplies the fragmentation (transfers, professional claims) that the registry omits,\n  and to a death index so a fatal event is not mistaken for \"no recurrence.\" Reconcile registry, claims, and vital-records\n  dates before assigning episode boundaries, and check transportability of the chosen gap to the analysis population.\n- **Differential competing risks and immortal-time traps:** In elderly or sicker arms, death competes with recurrence; a\n  long gap that spans a death interval can suppress observed recurrences differentially by arm — handle death as a\n  competing risk (`competing-risks-cause-specific-fine-gray-rwe`), not as censoring. In procedure-anchored studies,\n  counting only patients who survived to a post-discharge code creates immortal time\n  (`immortal-time-bias-handling`); align time zero to the index event, not to the deduplicated follow-up code.\n\n**Worked claims example (AMI, 30-day rule).** Question: incidence and 1-year recurrence of AMI in a Medicare FFS cohort.\n(1) **Eligibility / washout:** ≥365 days of continuous FFS Parts A/B enrollment with no inpatient I21.x before the first\nqualifying claim, so the first episode is incident, not prevalent (the deduplication window is *separate* from this clean\nperiod). (2) **Anchor rule:** an episode opens on the admit date of an inpatient facility claim with ICD-10 **I21.x in the\nprincipal position**. (3) **Transfer handling:** any inpatient claim whose admit date is on or within 1 day of a prior\nepisode's discharge date is folded into that episode (an interhospital transfer is one event). (4) **Gap rule:** the\nnext principal-position I21.x inpatient claim with an admit date **≥30 days after the index episode's discharge** opens a\n*new* episode; anything within 30 days of discharge — including carrier claims and post-MI office visits — is attributed\nto the index episode and does **not** count. (5) **Output:** one row per `episode_id` with index admit/discharge dates,\narm, and an `is_incident` flag; recurrence rate = (episodes − first episodes) per person-year of FFS-observable follow-up,\ncensoring at disenrollment, death, and data cut. (6) **Sensitivity / validation:** rerun at 14- and 60-day gaps, report\nhow incidence and recurrence move, and benchmark the event count's PPV against a chart-reviewed or registry-adjudicated\nsubset rather than trusting the code alone.\n\n**Interpreting the output**. Applying the 30-day deduplication window to Margaret's (person 2001) four raw AMI claims\ncollapses them into two distinct episodes. Episode 1 spans the January 3 index admit through the January 12 discharge;\nall claims with admit dates within 30 days of that discharge are attributed to Episode 1 and do not count as new events.\nEpisode 2 opens on March 2, 49 days after the Episode 1 discharge — outside the 30-day gap — and ends March 9.\n\nFormal interpretation: the 30-day window is a modeling assumption, not a clinical truth. Any two claims whose admit\ndates fall within 30 days of the preceding discharge are treated as one continuous episode, regardless of whether the\nsecond admission represented a genuine readmission or a billing artifact. The window choice is consequential: a 14-day\nwindow would potentially split what the 30-day rule treats as one episode; a 60-day window would collapse events the\n30-day rule treats as distinct. Incidence and recurrence estimates are therefore partly a function of the\nwindow specification, and sensitivity analyses at 14 and 60 days are mandatory before presenting results to a\nregulator or HTA body.\n\nPractical interpretation: always report the episode count alongside the raw claim count and the window length.\nReviewers who see \"218 AMI events\" without knowing the deduplication rule cannot evaluate whether recurrence was\nover- or under-counted. The window choice should be pre-specified and motivated by the condition's natural history —\nfor AMI, 30 days is the standard clinically meaningful readmission threshold, aligning the deduplication rule with\nthe readmission outcome.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "outcome-algorithm-construction",
      "acute-event-deduplication-window",
      "episode-of-care",
      "event-clean-period",
      "recurrent-events",
      "claims-outcome-ascertainment"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "descriptive_epidemiology"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.ahj.2004.02.013",
        "url": "https://doi.org/10.1016/j.ahj.2004.02.013",
        "citation_text": "Kiyota Y, Schneeweiss S, Glynn RJ, Cannuscio CC, Avorn J, Solomon DH. Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: estimating positive predictive value on the basis of review of hospital records. American Heart Journal. 2004;148(1):99-104.",
        "year": 2004,
        "authors_short": "Kiyota & Schneeweiss et al.",
        "notes": "Defines the claims-based acute MI event by anchoring on the inpatient hospitalization (principal-position code) and shows that the validity of the event hinges on how hospitalization records are aggregated into a single event - the operational substrate for any deduplication window on acute events."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Requires transparent reporting of how codes and encounters are turned into analytic events, including the date logic and aggregation rules - i.e., the deduplication window must be stated explicitly, not buried in code."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Framework for validating the resulting event algorithm (PPV/sensitivity of the deduplicated event count) against a reference standard, not merely the diagnosis code."
      },
      {
        "role": "demonstrate",
        "doi": "10.1161/CIRCULATIONAHA.114.013829",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.114.013829",
        "citation_text": "Yasaitis LC, Berkman LF, Chandra A. Comparison of self-reported and Medicare claims-identified acute myocardial infarction. Circulation. 2015;131(17):1477-1485.",
        "year": 2015,
        "authors_short": "Yasaitis et al.",
        "notes": "Applied use of a hospitalization-anchored, clean-period acute MI definition in Medicare claims, illustrating how episode aggregation choices affect agreement with an external (self-reported/adjudicated) event count."
      }
    ],
    "plain_language_summary": "When a patient has a heart attack, their insurance records often show multiple separate charges for the same event — the hospital stay, a transfer to a second hospital, and follow-up office visits that still list the same diagnosis code. An acute event deduplication window is a pre-specified rule that says: any claim that arrives within a set number of days of the first claim (say, 30 days for a heart attack) belongs to the same event, not a new one. Without this rule, one heart attack can get counted as four or five events, inflating incidence rates and corrupting cost analyses. The window collapses those fragments into a single counted episode; only a claim that arrives more than 30 days after the first discharge is treated as a true new event.",
    "key_terms": [
      {
        "term": "episode",
        "definition": "A single occurrence of a medical event — for example, one heart attack — treated as one unit for counting purposes, even if multiple records in the data describe it."
      },
      {
        "term": "deduplication window",
        "definition": "A pre-set number of days after the first claim for an acute event during which any additional claim for the same condition is considered part of that same episode, not a new event."
      },
      {
        "term": "anchor claim",
        "definition": "The first qualifying record that opens an episode — for example, an inpatient hospital claim where the heart-attack diagnosis code appears as the main reason for admission."
      },
      {
        "term": "incidence rate",
        "definition": "How often a medical event occurs in a population over time, usually expressed as events per 1,000 patient-years — miscounted episodes corrupt this figure directly."
      },
      {
        "term": "transfer",
        "definition": "When a patient is moved from one hospital to a second hospital during the same illness; this generates two separate inpatient records in claims data but represents one clinical event."
      }
    ],
    "worked_example": {
      "scenario": "Margaret is a 72-year-old Medicare patient who has a heart attack on January 3, 2023. She is admitted to Community Hospital that day, transferred to University Medical Center on January 5 (her records show a second inpatient claim), and then sees her cardiologist in an office visit on January 20, at which the cardiologist still lists the heart-attack diagnosis code. On March 2, she is admitted again with a new heart attack. A researcher applying a 30-day deduplication window wants to count how many distinct heart-attack episodes Margaret had in 2023.",
      "dataset": {
        "caption": "Raw qualifying claims for Margaret (person 2001) — every claim where the principal diagnosis is I21.x (acute MI), sorted by date.",
        "columns": [
          "person_id",
          "claim_type",
          "admit_date",
          "discharge_date",
          "dx_principal"
        ],
        "rows": [
          [
            2001,
            "IP",
            "2023-01-03",
            "2023-01-05",
            "I21.9"
          ],
          [
            2001,
            "IP",
            "2023-01-05",
            "2023-01-12",
            "I21.9"
          ],
          [
            2001,
            "Office",
            "2023-01-20",
            "2023-01-20",
            "I21.9"
          ],
          [
            2001,
            "IP",
            "2023-03-02",
            "2023-03-09",
            "I21.9"
          ]
        ]
      },
      "steps": [
        "Claim 1 (Jan 3 admit, Jan 5 discharge) is an inpatient hospital claim with I21.9 in the principal position — it opens Episode 1, and Jan 5 becomes the running discharge date.",
        "Claim 2 (Jan 5 admit) is a second inpatient claim whose admit date is 0 days after the Jan 5 discharge — it falls within the 1-day transfer window, so it is merged into Episode 1 as an interhospital transfer, not a new event. The running discharge date updates to Jan 12.",
        "Claim 3 (Jan 20 office visit) arrives 8 days after the Jan 12 discharge. Eight days is less than the 30-day window, so this follow-up visit is attributed to Episode 1 and is NOT counted as a new event.",
        "Claim 4 (Mar 2 admit) arrives 49 days after the Jan 12 discharge. Forty-nine days exceeds the 30-day window, so this claim opens Episode 2 — a true new event (recurrence).",
        "Result: 4 raw claims collapse to 2 distinct episodes. Counting records without a window would have reported 4 events for this patient."
      ],
      "result": "4 raw claims for patient 2001 collapse to 2 distinct episodes under the 30-day deduplication window: Episode 1 (index admit Jan 3, discharge Jan 12, spanning 3 raw claims) and Episode 2 (index admit Mar 2, a true recurrence 49 days after Episode 1 discharge).",
      "timeline_spec": {
        "title": "Four AMI claims collapsed to two episodes under a 30-day deduplication window",
        "window": {
          "start": "2023-01-03",
          "end": "2023-03-09",
          "label": "Observation window: Jan 3 to Mar 9 (65 days shown)"
        },
        "events": [
          {
            "label": "Claim 1: IP admit (Episode 1 anchor)",
            "start": "2023-01-03",
            "length_days": 2,
            "quantity": "Jan 3-5 inpatient stay"
          },
          {
            "label": "Claim 2: IP transfer (same event)",
            "start": "2023-01-05",
            "length_days": 7,
            "quantity": "Jan 5-12 transfer stay"
          },
          {
            "label": "Claim 3: Office visit (follow-up, inside 30d window)",
            "start": "2023-01-20",
            "length_days": 1,
            "quantity": "Jan 20 office visit, 8d after Ep1 discharge"
          },
          {
            "label": "Claim 4: IP admit (Episode 2, true recurrence)",
            "start": "2023-03-02",
            "length_days": 7,
            "quantity": "Mar 2-9 inpatient stay, 49d after Ep1 discharge"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2023-01-03",
            "end": "2023-01-20",
            "label": "Episode 1 window (anchor Jan 3 through last attributed claim Jan 20)"
          },
          {
            "kind": "gap",
            "start": "2023-01-21",
            "end": "2023-03-01",
            "label": "39-day clean gap (all days exceed 30-day window from Jan 12 discharge)"
          },
          {
            "kind": "covered",
            "start": "2023-03-02",
            "end": "2023-03-09",
            "label": "Episode 2 (recurrence, 49d after Ep1 discharge)"
          }
        ],
        "result": {
          "label": "4 raw claims collapse to 2 distinct episodes",
          "value": 2
        },
        "caption": "Margaret has four claims carrying an acute MI diagnosis code. Claims 1 and 2 are an interhospital transfer (0-day gap, merged automatically). Claim 3 is a follow-up office visit only 8 days after discharge — inside the 30-day blackout, so it belongs to Episode 1. Claim 4 arrives 49 days after the Episode 1 discharge, beyond the 30-day window, and opens a true new episode. Naive record-counting would report 4 events; the window correctly reports 2.",
        "alt_text": "Timeline for patient 2001 showing four acute MI claims between January 3 and March 9, 2023. Claims 1 and 2 form an interhospital transfer merged into Episode 1. Claim 3 (office visit on January 20) falls inside the 30-day window and is absorbed into Episode 1. Claim 4 (March 2 inpatient) falls 49 days after the Episode 1 discharge and opens Episode 2 as a true recurrence. The gap between episodes is highlighted in a neutral color to show the clean separation."
      }
    },
    "prerequisites": [
      "outcome-algorithm-construction-rwe",
      "washout-clean-lookback-period-rwe",
      "recurrent-events-analysis-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed calendar-day gap (blackout window)",
        "description": "A single pre-specified number of days (e.g., 30 for AMI, 14 for exacerbations/sepsis) measured from the index episode's discharge (or service) date; subsequent qualifying records inside the window are attributed to the index episode, the first record beyond it opens a new episode.",
        "edge_cases": [
          "Gap measured from admit vs discharge date materially changes counts for long stays - pre-specify which.",
          "Interhospital transfers and contiguous/overlapping inpatient stays must be merged before the gap is applied.",
          "A death inside the window must be handled as a competing risk, not as evidence of \"no recurrence.\""
        ],
        "data_source_notes": "claims: anchor on inpatient facility claim with principal-position code; fold carrier claims and contiguous stays into the episode. ehr: out-of-system recurrences are invisible, so a fixed gap can wrongly merge two true events."
      },
      {
        "name": "Care-setting / condition-specific gap",
        "description": "Different gaps by condition and setting based on clinical natural history (e.g., inpatient-anchored 30-day AMI rule vs outpatient-anchored 14-day exacerbation rule), rather than one global window.",
        "edge_cases": [
          "Applying one global gap across heterogeneous conditions systematically miscounts.",
          "Outpatient-only events (no hospitalization) need an outpatient anchor and typically a shorter gap."
        ],
        "data_source_notes": "claims/ehr: justify each gap from published natural history or a validation substudy; document per condition in the SAP."
      },
      {
        "name": "First-event-only collapse",
        "description": "Degenerate window that keeps only the first qualifying episode per patient (infinite gap), appropriate for a time-to-first-event hazard where recurrences are not of interest.",
        "edge_cases": [
          "Discards all recurrence and repeat-utilization/cost information.",
          "Still requires transfer/fragment merging so the single retained event is dated correctly."
        ],
        "data_source_notes": "claims/ehr: pair with a pre-index washout so the first event is incident; this is not a substitute for the washout."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "No deduplication (count every qualifying record)",
        "pros_of_this": "Removes the dominant upward bias from administrative fragmentation (transfers, split facility/professional claims, resolved-condition follow-up coding); makes the counting unit a clinical event rather than a record.",
        "cons_of_this": "Introduces a tuning parameter (gap length) that can mask true early recurrences and is a researcher degree of freedom requiring pre-specification and sensitivity analysis.",
        "when_to_prefer": "Essentially every acute-event count, rate, recurrence, or episode-cost analysis in fragmented claims or EHR data."
      },
      {
        "compared_to": "First-event-only (one event per patient ever)",
        "pros_of_this": "Retains recurrence and repeat-cost information needed for incidence-rate, utilization, and cost estimands.",
        "cons_of_this": "Requires choosing and defending a finite gap; first-event-only needs no gap parameter at all.",
        "when_to_prefer": "When recurrences carry clinical or economic information; use first-event-only for a clean first-event hazard where recurrences are irrelevant."
      },
      {
        "compared_to": "Claim-line counting with chart adjudication as the reference",
        "pros_of_this": "Scales to large databases where adjudication is infeasible; reproducible and pre-specifiable.",
        "cons_of_this": "Is a proxy whose validity must be anchored to an adjudicated or registry reference; can both over- and under-merge relative to clinical truth.",
        "when_to_prefer": "Large-scale claims/EHR analyses, validated against a chart-reviewed or registry-adjudicated subset."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Anchor episodes on the inpatient facility claim with the diagnosis in the principal position; merge interhospital transfers and contiguous/overlapping stays before applying the gap; measure the gap from the index discharge (service) date, not paid/process dates; preserve raw admit/discharge dates alongside the derived episode_id. Require continuous FFS-observable enrollment across each window and censor at plan switches.",
      "ehr": "Capture is encounter- and site-bounded; out-of-system recurrences are missed and can cause a fixed gap to merge two true events, while problem-list carry-forward can manufacture spurious follow-up events. Prefer linkage to claims/HIE, define observation windows explicitly, and treat out-of-network care as informative missingness.",
      "registry": "Prefer adjudicated registry event dates over code-derived anchors when available; link to claims for fragmentation (transfers, professional claims) the registry omits and to a death index so a fatal event is not read as \"no recurrence.\" Check transportability of the chosen gap to the target population.",
      "linked": "Reconcile registry/claims/EHR/vital-records dates before assigning episode boundaries; linkage introduces selection (linkable subset) and order/fill/service date discrepancies that must be resolved before the gap is applied."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nGAP_DAYS = 30          # condition-specific blackout (AMI=30; exacerbations/sepsis commonly 14)\nTRANSFER_DAYS = 1      # contiguous/overlapping IP stays within this many days are the same event (transfer)\n\ndef build_event_episodes(claims: pd.DataFrame,\n                         gap_days: int = GAP_DAYS,\n                         transfer_days: int = TRANSFER_DAYS) -> pd.DataFrame:\n    c = claims.sort_values([\"person_id\", \"admit_date\", \"discharge_date\"]).copy()\n    out = []\n    for pid, g in c.groupby(\"person_id\", sort=False):\n        episode_id = 0\n        ep_admit = ep_discharge = None\n        n_records = 0\n        first_seen = True\n        for _, row in g.iterrows():\n            if ep_admit is None:\n                # open the first episode for this person\n                episode_id, n_records = 1, 1\n                ep_admit, ep_discharge = row[\"admit_date\"], row[\"discharge_date\"]\n                continue\n            days_after_discharge = (row[\"admit_date\"] - ep_discharge).days\n            if days_after_discharge <= transfer_days:\n                # transfer / fragment of the SAME event -> extend the episode, do not count anew\n                ep_discharge = max(ep_discharge, row[\"discharge_date\"])\n                n_records += 1\n            elif days_after_discharge < gap_days:\n                # inside the blackout window -> attributed to current episode, not a new event\n                n_records += 1\n            else:\n                # beyond the gap -> close current episode, open a new one\n                out.append((pid, episode_id, ep_admit, ep_discharge, n_records, first_seen))\n                first_seen = False\n                episode_id += 1\n                ep_admit, ep_discharge, n_records = row[\"admit_date\"], row[\"discharge_date\"], 1\n        if ep_admit is not None:\n            out.append((pid, episode_id, ep_admit, ep_discharge, n_records, first_seen))\n    episodes = pd.DataFrame(out, columns=[\"person_id\", \"episode_id\", \"index_admit\",\n                                          \"index_discharge\", \"n_records\", \"is_incident\"])\n    return episodes",
        "description": "Acute-event episode construction with a deduplication window from claims-style inputs. Required input (already cleaned,\none row per qualifying claim that carries the target diagnosis in the principal position):\n  claims : person_id, claim_id, claim_type ('IP'/'OP'), admit_date (datetime), discharge_date (datetime), dx_primary (str)\nReturns one row per event episode (person_id, episode_id, index_admit, index_discharge, n_records, is_incident).\nThe gap is measured from the running episode discharge; contiguous/overlapping inpatient stays (transfers) are merged\nbefore the gap test, so an interhospital transfer counts as one event. Build rates/costs from the returned episodes,\nnot the raw claims.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "kiyota-2004"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nGAP_DAYS <- 30L\nTRANSFER_DAYS <- 1L\n\nbuild_event_episodes <- function(claims, gap_days = GAP_DAYS, transfer_days = TRANSFER_DAYS) {\n  setDT(claims)\n  setorder(claims, person_id, admit_date, discharge_date)\n\n  assign_episodes <- function(admit, discharge) {\n    n <- length(admit)\n    epi <- integer(n); ep_discharge <- discharge[1L]; epi[1L] <- 1L\n    for (i in seq_len(n)[-1L]) {\n      gap <- as.integer(admit[i] - ep_discharge)\n      if (gap <= transfer_days) {                 # transfer / same-event fragment\n        epi[i] <- epi[i - 1L]\n        ep_discharge <- max(ep_discharge, discharge[i])\n      } else if (gap < gap_days) {                # inside blackout -> current episode\n        epi[i] <- epi[i - 1L]\n      } else {                                    # beyond gap -> new episode\n        epi[i] <- epi[i - 1L] + 1L\n        ep_discharge <- discharge[i]\n      }\n    }\n    epi\n  }\n\n  claims[, episode_id := assign_episodes(admit_date, discharge_date), by = person_id]\n  episodes <- claims[, .(index_admit = min(admit_date),\n                         index_discharge = max(discharge_date),\n                         n_records = .N),\n                     by = .(person_id, episode_id)]\n  episodes[, is_incident := episode_id == min(episode_id), by = person_id]\n  episodes[]\n}",
        "description": "Acute-event episode construction with a deduplication window using data.table. Input mirrors the Python version\n(one row per qualifying principal-position claim):\n  claims : person_id, claim_id, claim_type ('IP'/'OP'), admit_date (Date), discharge_date (Date), dx_primary\nReturns one row per episode. Within each person, a cumulative episode id increments only when a record's admit date is\n>= gap_days after the running episode discharge; contiguous stays within transfer_days are merged as one event.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "kiyota-2004"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let gap = 30;        /* condition-specific blackout window (days)            */\n%let transfer = 1;    /* contiguous IP stays within this gap = same event     */\n\nproc sort data=work.claims; by person_id admit_date discharge_date; run;\n\ndata work.episodes(keep=person_id episode_id index_admit index_discharge n_records is_incident);\n  set work.claims;\n  by person_id;\n  retain episode_id ep_discharge index_admit index_discharge n_records;\n  format index_admit index_discharge date9.;\n\n  if first.person_id then do;              /* open first episode for the person */\n    episode_id = 1; ep_discharge = discharge_date;\n    index_admit = admit_date; index_discharge = discharge_date; n_records = 1;\n  end;\n  else do;\n    gap_days = admit_date - ep_discharge;\n    if gap_days <= &transfer then do;      /* transfer / same-event fragment    */\n      ep_discharge = max(ep_discharge, discharge_date);\n      index_discharge = ep_discharge; n_records + 1;\n    end;\n    else if gap_days < &gap then do;       /* inside blackout -> current episode */\n      n_records + 1;\n    end;\n    else do;                               /* beyond gap -> emit + open new      */\n      is_incident = (episode_id = 1);       /* flag the episode being emitted     */\n      output;\n      episode_id + 1; ep_discharge = discharge_date;\n      index_admit = admit_date; index_discharge = discharge_date; n_records = 1;\n    end;\n  end;\n\n  if last.person_id then do;               /* emit the still-open final episode  */\n    is_incident = (episode_id = 1);\n    output;\n  end;\nrun;",
        "description": "Acute-event episode construction with a deduplication window in SAS (sorted DATA step with RETAIN/LAG-style carry of the\nrunning episode discharge). Required input (one row per qualifying principal-position claim):\n  work.claims : person_id, claim_id, claim_type, admit_date, discharge_date, dx_primary\nProduces work.episodes (one row per episode) and an is_incident flag. This is date-arithmetic / window logic - no\nstatistical procedure is appropriate here; estimation (rates via PROC GENMOD, recurrence via PROC PHREG) happens\ndownstream on the episode table.",
        "dependencies": [],
        "source_citations": [
          "kiyota-2004"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "acute-event-deduplication-window-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Margaret has four claims carrying an acute MI diagnosis code. Claims 1 and 2 are an interhospital transfer (0-day gap, merged automatically). Claim 3 is a follow-up office visit only 8 days after discharge — inside the 30-day blackout, so it belongs to Episode 1. Claim 4 arrives 49 days after the Episode 1 discharge, beyond the 30-day window, and opens a true new episode. Naive record-counting would report 4 events; the window correctly reports 2.",
        "alt_text": "Timeline for patient 2001 showing four acute MI claims between January 3 and March 9, 2023. Claims 1 and 2 form an interhospital transfer merged into Episode 1. Claim 3 (office visit on January 20) falls inside the 30-day window and is absorbed into Episode 1. Claim 4 (March 2 inpatient) falls 49 days after the Episode 1 discharge and opens Episode 2 as a true recurrence. The gap between episodes is highlighted in a neutral color to show the clean separation.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title One patient, four AMI claims collapsed into two episodes (30-day rule)\n  Jan 03 - IP admit I21.x : Episode 1 opens (index)\n  Jan 09 - transfer IP I21.x : Same event (<=1d after discharge) - merged into Episode 1\n  Jan 20 - office visit I21.x : Inside 30d blackout - attributed to Episode 1, NOT counted\n  Mar 02 - IP admit I21.x : >=30d after Ep1 discharge - Episode 2 opens (recurrence)",
        "caption": "A single infarction with an interhospital transfer and post-MI follow-up coding produces four records but one event; a true recurrence beyond the 30-day gap opens a second episode. Counting records would report four events.",
        "alt_text": "Timeline showing four acute MI claims for one patient collapsed by a 30-day deduplication window into two event episodes, with a transfer and a follow-up visit merged into the first episode.",
        "source_type": "illustrative",
        "source_citations": [
          "kiyota-2004"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Rec[Next qualifying claim<br/>principal-position dx] --> Open{An episode already open<br/>for this person?}\n  Open -- No --> New1[Open episode 1<br/>set index + discharge]\n  Open -- Yes --> Tx{Admit date within<br/>transfer window of<br/>running discharge?}\n  Tx -- Yes --> Merge[Merge: same event<br/>extend discharge, do not count anew]\n  Tx -- No --> Gap{Admit date >= gap_days<br/>after running discharge?}\n  Gap -- No --> Same[Inside blackout<br/>attribute to current episode]\n  Gap -- Yes --> New2[Close current episode<br/>open new episode = recurrence]",
        "caption": "Decision logic applied to each sorted claim: merge transfers, swallow records inside the blackout window, and open a new episode only once a record falls beyond the condition-specific gap.",
        "alt_text": "Decision flowchart for the deduplication window - merge transfers, attribute within-gap records to the current episode, and open a new episode when a record falls beyond the gap.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "The deduplication window is the date-logic step within outcome-algorithm construction that turns qualifying records into counted event episodes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "Complementary but distinct - the pre-index washout establishes that the FIRST event is incident; the deduplication window operates during follow-up to merge fragments and separate recurrences."
      },
      {
        "relation_type": "used_with",
        "target_slug": "recurrent-events-analysis-rwe",
        "notes": "Recurrent-event models (Andersen-Gill, PWP, frailty) consume the episode stream the deduplication window produces; an over- or under-merging window biases every recurrence estimate."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "restart-rechallenge-new-episode-rwe",
        "notes": "Same gap-rule machinery but on the exposure side (treatment restart episodes from fills) rather than the outcome side (event episodes from diagnoses/encounters)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The validity of the deduplicated event count (not just the diagnosis code) should be benchmarked via PPV and sensitivity against a chart-reviewed or registry-adjudicated reference."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "For chronic conditions there is no discrete event to deduplicate; a prevalence/phenotype algorithm replaces the episode window."
      },
      {
        "relation_type": "affects",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death inside the gap competes with recurrence; deaths must be handled as a competing risk, not as censoring or as evidence of no recurrence."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "MA encounter data underreport inpatient events, so episode counts are not comparable across MA and FFS person-time; require FFS-observable enrollment across each window."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring time zero on a deduplicated post-discharge follow-up code rather than the index event introduces immortal time in procedure-anchored studies."
      }
    ],
    "aliases": [
      "event clean period",
      "episode-of-care window",
      "outcome blackout window",
      "event deduplication gap",
      "30-day event rule",
      "acute event episode window"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "agreement-statistics-kappa-icc-bland-altman",
    "name": "Agreement Statistics: Kappa, ICC, and Bland-Altman",
    "short_definition": "A family of statistical methods for quantifying how much two raters, instruments, or measurement approaches agree — as opposed to merely correlating — used to evaluate inter- abstractor reliability in chart review, concordance between data sources, and the substitutability of automated coding algorithms for a gold standard. Cohen's kappa corrects observed categorical agreement for chance; the intraclass correlation coefficient (ICC) extends this idea to continuous ratings; and the Bland-Altman limits-of-agreement plot characterizes systematic bias and clinically acceptable variability between two continuous measurement methods.",
    "long_description": "**Agreement measures a different question from correlation**\n\nThe single most consequential distinction in measurement validation is that *correlation*\nand *agreement* are fundamentally different quantities that answer different questions.\nTwo continuous instruments can have a Pearson r of 0.99 and yet fail completely as\nsubstitutes for each other: if Instrument A consistently returns values exactly twice\nthose of Instrument B, r = 1.0 while every paired measurement disagrees by 100%. This\nis not a theoretical curiosity — it is a recurring error in the HEOR and pharmacoepidemiology\nliterature when researchers validate claims-derived biomarker proxies against EHR laboratory\nvalues, compare chart-abstracted event rates to adjudicated standards, or calibrate NLP\npipeline outputs against manual annotation. Correlation describes the strength and direction\nof co-movement; agreement asks whether two raters or instruments produce *interchangeable*\nvalues. Reporting a high Pearson or Spearman correlation as evidence of agreement between\ntwo measurement methods is methodologically indefensible. The correct tool for agreement\ndepends on the scale of the variable being compared.\n\nFor **categorical** variables (confirmed/not confirmed, disease present/absent, three-level\nseverity grades), the foundational agreement measure is *Cohen's kappa*. For **continuous\nratings** from multiple raters or occasions on the same subjects (pain scores, troponin\nreadings, star ratings), the *intraclass correlation coefficient (ICC)* is the appropriate\nagreement measure. For **method comparison** — asking whether a new assay, a claims-derived\nproxy, or an algorithm output can substitute for a gold-standard measurement — *Bland and\nAltman's limits-of-agreement plot* is the deliverable.\n\n**Cohen's kappa for categorical agreement**\n\nPercent observed agreement (the fraction of cases where two raters agree) is an inadequate\nsole metric because raters can agree by chance alone. Two raters each assigning a binary\noutcome at 90% true prevalence would agree 82% of the time even with no actual information\nwhatsoever. Cohen's kappa corrects for this chance agreement by comparing what was observed\nto what would be expected if raters made independent decisions based only on their own\nmarginal tendencies:\n\n  kappa = (P_o - P_e) / (1 - P_e)\n\nwhere P_o is the observed proportion of agreement (the diagonal sum divided by total cases)\nand P_e is the expected proportion of agreement by chance alone, computed from the\nmarginal probabilities of each rater's calls:\n\n  P_e = P(R1=Y) * P(R2=Y) + P(R1=N) * P(R2=N)\n\nKappa ranges from -1 (perfect systematic disagreement) through 0 (no agreement beyond\nchance) to 1 (perfect agreement). The Landis and Koch (1977) benchmark scale places kappa\n< 0.20 as slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00\nalmost perfect agreement. Regulatory submissions (FDA, EMA) for adjudicated primary endpoints\nin observational research typically expect kappa ≥ 0.60 before tiebreak as a prerequisite for\nthe adjudication to be considered reliable.\n\n*Weighted kappa for ordinal categories.* When categories are ordered — three-grade severity\nclassifications, staging systems, or Likert scales — disagreements that are close on the\nordinal scale should be penalized less than distant disagreements. Weighted kappa accomplishes\nthis by applying a weight matrix that reduces the contribution of near-misses. With linear\nweights, a one-grade discordance is penalized proportionally less than a two-grade discordance;\nwith quadratic weights, large discordances are penalized more steeply. The choice of weighting\nscheme materially affects the resulting kappa and must be pre-specified in the analysis plan.\n\n*The kappa paradox under extreme prevalence.* Kappa can be misleadingly low when one\ncategory strongly dominates the marginals — a phenomenon called the prevalence paradox.\nIf 95% of adjudicated cases are \"not a true event\" and both raters agree on \"not a true\nevent\" almost always, observed agreement may be 0.95 while kappa is near 0.10, because\nP_e is already approximately 0.90. This situation arises directly in endpoint adjudication\nof rare events (severe bleeding, fatal MI) and in NLP validation on highly imbalanced\ncorpora. In these settings, report both percent agreement and kappa, and explicitly note\nthe base-rate context so that a low kappa is not misread as evidence of poor reliability.\n\n**Intraclass correlation coefficient for continuous agreement**\n\nThe ICC is a family of statistics quantifying agreement among continuous measurements,\nextending the kappa idea to scales where the magnitude of disagreement matters and not\njust its presence. The key design choices are the statistical model and the type of\nagreement being assessed:\n\n- *One-way random effects*: each subject is rated by a different random set of raters,\n  as in multi-site studies where raters are not consistent across sites. The ICC is\n  estimated from between-subject and within-subject mean squares in a one-way ANOVA.\n- *Two-way random effects, absolute agreement*: the same set of raters evaluates all\n  subjects, and systematic differences between raters are treated as disagreement and\n  lower the ICC. Use this form whenever one rater consistently scores higher than another\n  in a clinically meaningful way — for example, two readers of radiologic images where\n  the offset matters for clinical decisions.\n- *Two-way random effects, consistency*: the same raters evaluate all subjects, but\n  systematic rater offsets are considered acceptable and are removed from the error term.\n  Use this form only when the comparison is inherently relative (e.g., rank ordering\n  of subjects, not absolute measurement) and any systematic calibration difference\n  between raters will be corrected during production.\n\nChoosing the wrong ICC form is a frequent error. Shrout and Fleiss (1979) catalogued six\nICC models; for regulatory submissions evaluating inter-abstractor reliability on\ncontinuous endpoints, the two-way absolute-agreement ICC with a single measurement\n(ICC(2,1)) is the most commonly required form. Always report the ICC model chosen,\nthe number of raters, and the 95% confidence interval alongside the point estimate.\n\n**Bland-Altman for method comparison**\n\nThe Bland-Altman plot directly addresses the practical clinical question: \"Can I substitute\nMethod B for Method A in clinical practice?\" For n paired observations, compute:\n\n- The *mean difference* (bias): the average of (Method A - Method B). A non-zero bias\n  indicates systematic over- or under-reading by one method across the measurement range.\n- The *limits of agreement*: bias ± 1.96 * SD(differences), where SD(differences) is the\n  standard deviation of the paired differences.\n\nThe plot shows the average of the two methods on the x-axis against the difference\n(Method A - Method B) on the y-axis, with horizontal reference lines at the bias and\nboth limits of agreement. The interpretation: approximately 95% of future paired\ndifferences between the two methods in similar patients are expected to fall within the\nlimits of agreement, assuming differences are normally distributed and do not depend on\nthe magnitude of the measurement. The clinical decision is whether those limits are\nnarrow enough to be acceptable — a judgment that requires a pre-specified maximum\nacceptable difference, not a statistical test alone.\n\nIf the spread of differences increases with the magnitude of the measurement (a funnel\nshape on the plot), proportional bias is present. In this case, log-transform both\nmeasurements before computing the Bland-Altman, and back-transform the limits of agreement\ninto ratio form (for example, \"Method A returns values within a factor of 1.15 of Method B\nfor 95% of paired measurements\"). A formal test for proportional bias uses the Pearson\ncorrelation between the difference (Method A - Method B) and the mean (Method A + Method B)/2;\na significant correlation indicates that the limits of agreement vary with the measurement\nlevel.\n\n**RWE applications**\n\nAgreement statistics arise in four recurring contexts in HEOR and pharmacoepidemiology:\n\n1. *Inter-abstractor reliability in chart review and endpoint adjudication*: a calibration\n   exercise on a shared set of 20–50 records before the full study adjudication begins\n   establishes pre-tiebreak kappa. Regulators use this to assess whether the adjudication\n   committee was consistent; a kappa below 0.40 on the calibration set typically triggers\n   revision of the case definition and adjudication charter before proceeding.\n2. *Claims-versus-registry variable concordance*: when a claims-derived comorbidity flag\n   or disease status variable is compared to a registry adjudicated label for the same\n   patients, kappa (binary) or ICC (ordered severity) quantifies alignment beyond what\n   sensitivity and specificity alone capture. Agreement complements PPV and sensitivity\n   by correcting for prevalence effects.\n3. *Algorithm-versus-gold-standard beyond PPV*: PPV and sensitivity measure case-finding\n   accuracy of a binary classifier; kappa additionally accounts for the chance agreement\n   component, making it a better single metric when comparing multiple competing algorithm\n   specifications during development and when reporting to external stakeholders.\n4. *Duplicate-coder quality control in NLP validation*: training NLP models on annotated\n   data requires inter-annotator agreement on the labeled training corpus before the model\n   is trained. A kappa below 0.60 on a sample of the training labels indicates the annotation\n   scheme is ambiguous and the resulting model will have a ceiling on its accuracy.\n\n**Pros, cons, and trade-offs**\n\n*Cohen's kappa*: chance-corrected, interpretable scale from -1 to 1, directly comparable\nacross studies and widely understood by regulatory reviewers. Weighted kappa accommodates\nordinal categories. Cons: the prevalence paradox makes kappa misleadingly low at extreme\nbase rates; kappa does not capture the magnitude of disagreement on a continuous scale;\napplying kappa to ordinal data without weighting discards the ordering information.\n\n*ICC*: captures both systematic and random rater differences for continuous data; the\ntwo-way absolute-agreement form directly answers the question of interchangeability.\nThe 95% confidence interval is directly interpretable and can be pre-specified as a\nsuccess criterion. Cons: requires the same raters across all subjects for the two-way\nform; ICC values are inflated when between-subject variability is high even if rater\nprecision is poor, so the ICC for the same rater pair in a heterogeneous vs homogeneous\npopulation can differ dramatically.\n\n*Bland-Altman*: visually communicates both bias and variability in one plot and directly\nasks the clinical substitutability question. Cons: requires a pre-defined maximum\nacceptable difference (a clinical judgment, not a statistical one); assumes differences\nare normally distributed; does not provide a single summary statistic for pre-registering\nan agreement threshold.\n\n**When to use**\n\nUse Cohen's kappa for any binary or categorical agreement task: endpoint adjudication,\nNLP annotation, diagnostic classification quality control, or comparing algorithm output\nto chart abstraction on a binary endpoint. Pre-specify kappa ≥ 0.60 as the minimum\nacceptable threshold for regulatory submissions; use weighted kappa for ordinal scales.\nUse ICC for panels of raters evaluating a continuous outcome (physiological parameters,\nbiomarker values, functional scores) — always specify the ICC form (one-way/two-way,\nabsolute/consistency, single/average measure) and report the 95% confidence interval.\nUse Bland-Altman when asking \"can one continuous measurement method substitute for\nanother?\" — not as a correlation substitute.\n\n**When NOT to use**\n\nDo not use Pearson or Spearman correlation to assess agreement between two measurement\nmethods or two raters measuring the same quantity; a correlation of 0.99 is fully\ncompatible with systematic bias that makes methods non-interchangeable. Do not use kappa\nas the sole metric when true prevalence is extreme (above 90% or below 10%); complement\nit with percent agreement and a description of the prevalence context. Do not use ICC\n(consistency) when systematic rater offsets carry clinical meaning — this choice hides\nthe most dangerous form of rater disagreement. Do not apply Bland-Altman to categorical\nor ordinal data; use kappa or weighted kappa. Do not interpret limits of agreement as\nthe measurement error of a single instrument; they reflect the total variability of\nthe paired *difference*, which aggregates both instruments' contributions.\n\n**Interpreting the output**\n\nFrom the worked example below: two cardiologists independently adjudicate 100 candidate\nMACE events from a comparative drug-safety cohort, blinded to treatment assignment.\nObserved agreement = 0.90. Rater 1's marginal confirm rate = 80/100 = 0.80; Rater 2's\nmarginal confirm rate = 90/100 = 0.90. Expected chance agreement Pe = 0.80 * 0.90 +\n0.20 * 0.10 = 0.720 + 0.020 = 0.740. Kappa = (0.90 - 0.74)/(1 - 0.74) = 0.16/0.26\n≈ 0.615.\n\n*(1) Formal interpretation.* Kappa = 0.615 quantifies the proportion of possible\nnon-chance agreement that was actually achieved: the two raters agreed substantially\nmore than would be predicted from their marginal calling rates alone. A kappa of 0\nwould mean all agreement is explainable by the raters independently applying their own\nconfirm-rate tendencies; a kappa of 1 would mean perfect agreement beyond chance.\nThe Landis-Koch scale places 0.615 in the \"substantial agreement\" band (0.61–0.80).\nThe observed agreement of 0.90 is meaningfully higher than the chance-expected agreement\nof 0.74; the 10 discordant cases where Rater 2 confirmed but Rater 1 did not represent\nthe irreducible clinical judgment zone at the boundary of the case definition.\n\n*(2) Practical interpretation.* A kappa of 0.615 meets the most common regulatory\nthreshold (≥ 0.60) for a pre-tiebreak inter-rater calibration exercise in a clinical\nevents committee. The 10 discordant cases should be reviewed by a third-party tiebreaker,\nand the specific features driving disagreement — the clinical findings that led Rater 1\nto classify as \"not confirmed\" — should be documented and used to sharpen the case\ndefinition charter for the full study adjudication. If kappa were below 0.40, the\ncalibration exercise has failed and the adjudication must be paused to revise the charter\nbefore proceeding with the full event set.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "agreement",
      "inter-rater-reliability",
      "kappa",
      "icc",
      "intraclass-correlation",
      "bland-altman",
      "limits-of-agreement",
      "measurement-validation",
      "chart-review",
      "algorithm-validation",
      "adjudication",
      "method-comparison",
      "statistics"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "comparative_effectiveness",
      "algorithm_validation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/S0140-6736(86)90837-8",
        "url": "https://doi.org/10.1016/S0140-6736(86)90837-8",
        "citation_text": "Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet. 1986;327(8476):307-310.",
        "year": 1986,
        "authors_short": "Bland & Altman",
        "notes": "Foundational paper introducing the limits-of-agreement plot and establishing that correlation is the wrong tool for assessing method agreement. Demonstrated with two instruments measuring peak expiratory flow rate that high correlation is compatible with clinically unacceptable systematic bias. The most cited paper in the clinical measurement literature."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2010.03.002",
        "url": "https://doi.org/10.1016/j.jclinepi.2010.03.002",
        "citation_text": "Kottner J, Audige L, Brorson S, et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Journal of Clinical Epidemiology. 2011;64(1):96-106.",
        "year": 2011,
        "authors_short": "Kottner et al.",
        "notes": "The GRRAS reporting guidelines for reliability and agreement studies; specifies the minimum elements required when reporting kappa, ICC, and Bland-Altman analyses in clinical and health research publications. Provides the framework for pre-specifying the model form, unit of measurement, and agreement threshold before data collection."
      },
      {
        "role": "explain",
        "doi": "10.11613/BM.2012.031",
        "url": "https://doi.org/10.11613/BM.2012.031",
        "citation_text": "McHugh ML. Interrater reliability: the kappa statistic. Biochemia Medica. 2012;22(3):276-282.",
        "year": 2012,
        "authors_short": "McHugh",
        "notes": "Practical reference covering Cohen's kappa calculation, interpretation against the Landis-Koch scale, weighted kappa for ordinal data, and the prevalence paradox. Widely cited as the accessible guide to kappa for clinical researchers; addresses the common pitfall of interpreting raw percent agreement without chance correction."
      }
    ],
    "plain_language_summary": "Agreement statistics answer a different question from correlation: two measurement methods can move together perfectly (correlation = 1.0) while one consistently reads twice as high as the other, so they cannot be used interchangeably. Cohen's kappa measures how much two reviewers agree on categorical calls — such as \"confirmed heart attack\" versus \"not confirmed\" — beyond what pure chance would predict, and is the standard metric reported after endpoint adjudication and algorithm validation. The intraclass correlation coefficient (ICC) extends this idea to continuous measurements from multiple raters, while the Bland-Altman limits-of-agreement plot shows whether a new measurement method differs from a gold standard by clinically acceptable amounts.",
    "key_terms": [
      {
        "term": "Cohen's kappa",
        "definition": "A statistic ranging from -1 to 1 that measures how much two raters agree on a categorical outcome beyond what would be expected if each rater made independent decisions based only on their own overall calling rates."
      },
      {
        "term": "expected agreement by chance (Pe)",
        "definition": "The proportion of cases where two raters would agree purely by chance, computed from each rater's marginal calling rates; kappa subtracts this baseline before computing the agreement ratio."
      },
      {
        "term": "intraclass correlation coefficient (ICC)",
        "definition": "A family of statistics measuring how consistently multiple raters score the same subjects on a continuous scale; the choice between absolute-agreement and consistency forms determines whether systematic rater differences count against the ICC."
      },
      {
        "term": "Bland-Altman limits of agreement",
        "definition": "The range within which approximately 95% of differences between two measurement methods are expected to fall; if this range is clinically acceptable, the two methods can be used interchangeably."
      },
      {
        "term": "prevalence paradox",
        "definition": "The phenomenon where kappa is misleadingly low even when observed agreement is high, because one outcome category is so dominant that chance agreement alone would explain most of the observed agreement."
      },
      {
        "term": "weighted kappa",
        "definition": "A variant of Cohen's kappa for ordinal categories that penalizes near-miss disagreements (e.g., one grade apart) less severely than large disagreements (e.g., three grades apart)."
      }
    ],
    "worked_example": {
      "scenario": "A claims-based comparative drug-safety study uses a clinical events committee to adjudicate candidate MACE (major adverse cardiovascular events) flagged by a claims algorithm. Two cardiologists independently review 100 candidate event packets, each stripped of the patient's drug assignment. The analyst wants to report the pre-tiebreak inter-rater agreement, compute Cohen's kappa to correct for chance, and interpret the result against the regulatory threshold of kappa ≥ 0.60.",
      "dataset": {
        "caption": "Two-by-two agreement table for 100 candidate MACE events reviewed independently by two cardiologists blinded to treatment assignment. Cell values are event counts. Rater 2 was more liberal, confirming 90 of 100 events versus 80 for Rater 1.",
        "columns": [
          "",
          "Rater 2: Confirmed",
          "Rater 2: Not confirmed",
          "Row total"
        ],
        "rows": [
          [
            "Rater 1: Confirmed",
            80,
            0,
            80
          ],
          [
            "Rater 1: Not confirmed",
            10,
            10,
            20
          ],
          [
            "Column total",
            90,
            10,
            100
          ]
        ]
      },
      "steps": [
        "Count agreements on the diagonal: 80 cases where both said confirmed, and 10 cases where both said not confirmed. Total agreements = 80 + 10 = 90 out of 100 events.",
        "Compute observed proportion agreement: observed_agreement = (80 + 10)/100 = 90/100 = 0.90.",
        "Compute each rater's marginal confirm rate. Rater 1 confirmed 80 cases: P_R1_confirm = 80/100 = 0.80, P_R1_not = 20/100 = 0.20. Rater 2 confirmed 90 cases: P_R2_confirm = 90/100 = 0.90, P_R2_not = 10/100 = 0.10.",
        "Compute expected agreement by chance (Pe): if both raters were drawing from their marginal rates independently, the probability both confirm is 0.80 * 0.90 = 0.720, and the probability both do not confirm is 0.20 * 0.10 = 0.020. So Pe = 0.720 + 0.020 = 0.740.",
        "Apply the kappa formula. Numerator (non-chance agreement): kappa_num = 0.90 - 0.74 = 0.16. Denominator (maximum possible non-chance agreement): kappa_den = 1 - 0.74 = 0.26. Kappa = kappa_num/kappa_den = 0.16/0.26 ≈ 0.615.",
        "Interpret: kappa ≈ 0.615 falls in the 'substantial agreement' band (0.61–0.80) on the Landis-Koch scale and meets the regulatory threshold of ≥ 0.60. The 10 discordant cases (Rater 2 confirmed, Rater 1 did not) represent edge cases to be resolved by the tiebreaker."
      ],
      "result": "observed_agreement = (80 + 10)/100 = 0.90. Pe = 0.80 * 0.90 + 0.20 * 0.10 = 0.720 + 0.020 = 0.740. kappa = 0.16/0.26 ≈ 0.615. The inter-rater agreement is substantial, meeting the kappa ≥ 0.60 threshold for endpoint adjudication in a regulatory submission. The 10 discordant events (Rater 2 confirmed, Rater 1 did not) are sent to a third cardiologist tiebreaker."
    },
    "prerequisites": [
      "descriptive-statistics",
      "pearson-spearman-correlation",
      "parametric-vs-nonparametric-tests"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unweighted Cohen's kappa (binary or nominal categories)",
        "description": "The standard kappa for binary (yes/no) or unordered nominal (category A/B/C) agreement tasks. Treats all disagreements as equally serious regardless of category. Used for binary adjudication outcomes (confirmed/not confirmed), binary algorithm classifications (phenotype positive/negative), and multi-class classifications where no ordering exists among the categories.",
        "edge_cases": [
          "At extreme prevalence (> 90% or < 10% true positive rate), the prevalence paradox may make kappa near zero even when observed agreement exceeds 0.90; report both percent agreement and kappa with the base rate.",
          "Kappa requires discrete, pre-defined categories; continuous ratings discretized into bins lose information and the cut-point choice affects kappa."
        ],
        "data_source_notes": "Claims: used when validating a binary claims phenotype algorithm against chart- abstracted truth or registry adjudication. EHR: inter-annotator kappa on NLP training labels. Registry: comparing a registry adjudicated endpoint to a claims algorithm flag."
      },
      {
        "name": "Weighted kappa for ordinal categories",
        "description": "Extends kappa to ordered categories (e.g., severity grades 0/1/2/3) by applying a weight matrix so that near-miss disagreements (adjacent grades) are penalized less than large disagreements (non-adjacent grades). Linear weighting applies uniform distance penalties; quadratic weighting penalizes large discordances more steeply. The weighting scheme must be pre-specified because it materially changes the resulting kappa value.",
        "edge_cases": [
          "Quadratic-weighted kappa is numerically equivalent to the ICC for ordinal data in many common designs; researchers sometimes report one when they mean the other.",
          "Ordinal data with many tied ranks and few levels reduces the discriminating power of both kappa variants; polychoric correlation may be more appropriate as a supplementary measure."
        ],
        "data_source_notes": "Registry: inter-rater kappa for disease severity staging (e.g., NYHA class, KDIGO AKI stage). EHR: agreement on three-level NLP annotation labels (positive, negative, uncertain)."
      },
      {
        "name": "ICC two-way random effects, absolute agreement (ICC 2,1)",
        "description": "The standard ICC for the case where the same set of raters evaluates all subjects and systematic rater differences are treated as disagreement (they lower the ICC). Computed from a two-way random effects ANOVA with subject and rater as random factors. Reports the ICC for a single measurement (subscript 1), not the average of k measurements. This is the form required by most regulatory guidance for continuous inter-rater reliability in clinical studies.",
        "edge_cases": [
          "ICC values depend strongly on the between-subject variance in the sample; a heterogeneous sample inflates ICC even when rater precision is poor. Always report the range of the ratings alongside the ICC.",
          "Missing data (some raters did not evaluate all subjects) invalidates the balanced ANOVA; use mixed models with REML for unbalanced designs."
        ],
        "data_source_notes": "EHR: agreement between two readers on imaging-derived measurements. Registry: inter-rater ICC on adjudicated continuous endpoints. Primary: inter-observer agreement on clinical examination findings."
      },
      {
        "name": "Bland-Altman with proportional bias correction (log-transform)",
        "description": "When the spread of differences between two methods increases with the measurement magnitude (funnel shape on the standard Bland-Altman plot), proportional bias is present. Log-transform both measurements, compute the Bland-Altman on the log scale, and back-transform the limits of agreement as ratio bounds (e.g., \"Method A measures within a factor of 1.15 of Method B for 95% of paired samples\"). A Pearson r between the difference and the mean on the original scale formally tests for proportional bias.",
        "edge_cases": [
          "Log transformation requires all measurements to be positive; add a small constant before logging if zeros are present, or use a ratio-based method directly.",
          "Back-transformed limits are asymmetric (not ± some value) and must be reported as a ratio range, not as absolute differences."
        ],
        "data_source_notes": "Linked claims-EHR: validating claims-derived cost or utilization proxies against EHR-observed values where proportional scaling is expected."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "pearson-spearman-correlation",
        "pros_of_this": "Agreement statistics (kappa, ICC, Bland-Altman) directly answer the clinical substitutability question. A high correlation is not evidence of agreement; two instruments with r = 0.99 can have systematic bias that disqualifies substitution. Kappa corrects for chance; ICC decomposes systematic and random rater error; Bland-Altman shows the full distribution of differences.",
        "cons_of_this": "Correlation is appropriate when the research question is about the strength of association rather than substitutability — for example, collinearity screening among candidate covariates or exploratory biomarker screening. Agreement statistics require paired measurements on the same subjects or cases.",
        "when_to_prefer": "Use agreement statistics any time the question is whether two measurements of the same construct are interchangeable or whether two raters are producing consistent results. Use correlation only when the question is about the strength of co-movement, not about substitutability or reliability."
      },
      {
        "compared_to": "algorithm-validation",
        "pros_of_this": "Kappa complements PPV and sensitivity from a 2x2 validation table by correcting for prevalence, making it easier to compare algorithm specifications across datasets with different base rates. Agreement statistics also apply to the reliability of the reference standard itself (the adjudicators), which PPV cannot measure.",
        "cons_of_this": "PPV and sensitivity from a validation study directly quantify case-finding accuracy and feed quantitative bias analysis for correcting effect estimates. Kappa does not distinguish false positives from false negatives and cannot be used to correct misclassification in the effect estimand.",
        "when_to_prefer": "Report both: kappa for the inter-rater reliability of the adjudication process (the reference standard quality), and PPV/sensitivity for the operating characteristics of the algorithm against that reference standard."
      },
      {
        "compared_to": "endpoint-adjudication-chart-review-rwe",
        "pros_of_this": "Agreement statistics (kappa) provide the quantitative metric that makes an adjudication process verifiable and reportable. Adjudication describes the operational process; kappa measures whether that process produced consistent outcomes across raters.",
        "cons_of_this": "Kappa alone does not describe the adjudication workflow, the case definition, or the blinding procedures. A high kappa on a biased or unblinded adjudication process does not guarantee unbiased event classification.",
        "when_to_prefer": "Kappa is a required output of every multi-reviewer adjudication process. It should always be reported alongside the adjudication design details, not as a standalone metric."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims-based kappa arises most often when validating a binary claims algorithm against chart abstraction or registry adjudication. Compute kappa from the 2x2 table of algorithm call vs chart-confirmed truth, report both unweighted kappa (if binary) and percent agreement with the base rate. At large scale (> 10,000 adjudicated pairs), kappa's standard error is negligible and the focus should be on the confidence interval and on whether the kappa holds separately within exposure arms (to detect differential misclassification).",
      "ehr": "EHR applications include ICC for continuous biomarker agreement between two laboratory systems or two timepoints, inter-annotator kappa for NLP training labels, and Bland-Altman for comparing EHR-recorded values to reference assay measurements. Be alert to informative missingness: lab values that are ordered only when clinical suspicion is high are not a random sample of the target population, which can bias both ICC and Bland-Altman analyses toward better apparent agreement.",
      "registry": "Registry data often serves as the reference standard for kappa computation (registry adjudicated endpoint vs claims algorithm). When the registry variable is itself a rated scale (e.g., NYHA class, ECOG performance status), report weighted kappa for ordinal agreement. Be explicit about whether the registry variable was adjudicated by the registry protocol or inherited from a clinical note, as these carry different reference-standard quality.",
      "linked": "Linked claims-EHR-registry datasets enable the fullest method-comparison analysis: a claims-derived continuous proxy can be compared to an EHR lab value via Bland- Altman, a binary claims algorithm can be validated against a registry endpoint via kappa, and an ICC can be computed for continuous ratings across systems. Linkage selection (only the linkable subset is analyzed) is the key threat to generalizability of all three statistics."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.metrics import cohen_kappa_score\n\n# ── 1. Cohen's kappa: 2x2 adjudication table from worked example ──\n# 100 events: a=80 (both confirmed), b=0 (R1 yes/R2 no), c=10 (R1 no/R2 yes), d=10 (both no)\nrater1 = [1] * 80 + [0] * 10 + [0] * 10   # confirmed=1, not confirmed=0\nrater2 = [1] * 80 + [1] * 10 + [0] * 10\n\nkappa_uw = cohen_kappa_score(rater1, rater2, weights=None)\nprint(f\"Unweighted kappa: {kappa_uw:.4f}\")\n# Manual verification: Po = 0.90, Pe = 0.80*0.90 + 0.20*0.10 = 0.74\n# kappa = (0.90 - 0.74) / (1 - 0.74) = 0.16 / 0.26 = 0.6154\n\n# ── 2. Weighted kappa for ordinal ratings (3-level severity: 0=none, 1=mild, 2=severe) ──\nr1_ord = [0, 0, 1, 1, 2, 2, 1, 0, 2, 1]\nr2_ord = [0, 1, 1, 2, 2, 2, 0, 0, 2, 1]\nkappa_lin  = cohen_kappa_score(r1_ord, r2_ord, weights=\"linear\")\nkappa_quad = cohen_kappa_score(r1_ord, r2_ord, weights=\"quadratic\")\nprint(f\"Linear-weighted kappa (ordinal):    {kappa_lin:.4f}\")\nprint(f\"Quadratic-weighted kappa (ordinal): {kappa_quad:.4f}\")\n# Note: quadratic-weighted kappa = ICC for balanced 2-rater designs on ordinal data\n\n# ── 3. ICC: two-way absolute agreement (ICC 2,1) using pingouin ──\nimport pingouin as pg\n# 10 patients rated by 2 raters on a continuous score (e.g., troponin proxy, 0-100)\nnp.random.seed(42)\nn_patients = 10\ntrue_score = np.random.uniform(20, 80, n_patients)\nratings_long = pd.DataFrame({\n    \"patient\": list(range(n_patients)) * 2,\n    \"rater\":   [\"R1\"] * n_patients + [\"R2\"] * n_patients,\n    \"score\":   np.concatenate([\n        true_score + np.random.normal(0, 3, n_patients),   # Rater 1: small random error\n        true_score + np.random.normal(2, 3, n_patients),   # Rater 2: +2 systematic bias\n    ]),\n})\nicc_result = pg.intraclass_corr(\n    data=ratings_long,\n    targets=\"patient\",   # subject identifier\n    raters=\"rater\",      # rater identifier\n    ratings=\"score\",     # continuous rating\n)\n# ICC(2,1): two-way random, single measurement, absolute agreement\nicc21 = icc_result[icc_result[\"Type\"] == \"ICC2\"].iloc[0]\nprint(f\"\\nICC(2,1) absolute agreement: {icc21['ICC']:.4f}  \"\n      f\"95% CI [{icc21['CI95%'][0]:.4f}, {icc21['CI95%'][1]:.4f}]\")\n# ICC(3,1): two-way mixed, single measurement, consistency (rater offset ignored)\nicc31 = icc_result[icc_result[\"Type\"] == \"ICC3\"].iloc[0]\nprint(f\"ICC(3,1) consistency:         {icc31['ICC']:.4f}  \"\n      f\"95% CI [{icc31['CI95%'][0]:.4f}, {icc31['CI95%'][1]:.4f}]\")\nprint(\"Note: ICC consistency > ICC absolute when systematic rater bias is present (+2 offset here).\")\n\n# ── 4. Bland-Altman limits of agreement ──\n# Comparing two troponin assays on 12 patients (systematic +3 offset in Assay B)\nassay_a = np.array([45, 62, 38, 71, 55, 49, 66, 42, 58, 74, 37, 53], dtype=float)\nassay_b = assay_a + np.random.normal(3, 4, len(assay_a))   # Assay B reads ~3 units higher\n\nmeans = (assay_a + assay_b) / 2\ndiffs = assay_a - assay_b          # Method A minus Method B convention\nbias  = np.mean(diffs)             # negative: Assay A reads lower than Assay B on average\nsd_d  = np.std(diffs, ddof=1)      # ddof=1 for sample SD\nloa_hi = bias + 1.96 * sd_d\nloa_lo = bias - 1.96 * sd_d\n\nprint(f\"\\nBland-Altman (Assay A - Assay B):\")\nprint(f\"  Bias (mean difference):    {bias:.2f}\")\nprint(f\"  SD of differences:         {sd_d:.2f}\")\nprint(f\"  Upper limit of agreement:  {loa_hi:.2f}\")\nprint(f\"  Lower limit of agreement:  {loa_lo:.2f}\")\nprint(\"  Clinical decision: are these limits acceptable for clinical use?\")\nprint(\"  (A pre-specified maximum acceptable difference is required to answer this.)\")",
        "description": "Demonstrates all three agreement methods. Cohen's kappa (unweighted and weighted) uses\nsklearn.metrics.cohen_kappa_score. ICC uses pingouin.intraclass_corr on long-format data.\nBland-Altman is computed manually with numpy, producing bias, SD of differences, and\nlimits of agreement. Uses the 100-event adjudication dataset from the worked example for\nkappa, and small synthetic datasets for ICC and Bland-Altman. No external plotting library\nrequired for the Bland-Altman statistics (only numpy); add matplotlib if the plot is needed.",
        "dependencies": [
          "scikit-learn",
          "pingouin",
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(irr)\n\n# ── 1. Cohen's kappa (binary adjudication, worked example) ──\nrater1 <- c(rep(1, 80), rep(0, 10), rep(0, 10))   # confirmed=1, not confirmed=0\nrater2 <- c(rep(1, 80), rep(1, 10), rep(0, 10))\n\nratings_bin <- cbind(rater1, rater2)\nkappa_uw <- kappa2(ratings_bin, weight = \"unweighted\")\ncat(sprintf(\"Unweighted kappa: %.4f  (z=%.2f, p=%.4f)\\n\",\n            kappa_uw$value, kappa_uw$statistic, kappa_uw$p.value))\n# Expected: kappa ~ 0.615 (Po=0.90, Pe=0.74; see worked example)\n\n# ── 2. Weighted kappa for ordinal ratings ──\nr1_ord <- c(0, 0, 1, 1, 2, 2, 1, 0, 2, 1)\nr2_ord <- c(0, 1, 1, 2, 2, 2, 0, 0, 2, 1)\nratings_ord <- cbind(r1_ord, r2_ord)\n\nkappa_lin  <- kappa2(ratings_ord, weight = \"equal\")     # linear weights\nkappa_quad <- kappa2(ratings_ord, weight = \"squared\")   # quadratic weights\ncat(sprintf(\"Linear-weighted kappa:    %.4f\\n\", kappa_lin$value))\ncat(sprintf(\"Quadratic-weighted kappa: %.4f\\n\", kappa_quad$value))\n\n# ── 3. ICC: two-way absolute agreement and consistency ──\nset.seed(42)\nn <- 10\ntrue_score <- runif(n, 20, 80)\nr1_cont <- true_score + rnorm(n, 0, 3)   # Rater 1: random error only\nr2_cont <- true_score + rnorm(n, 2, 3)   # Rater 2: +2 systematic offset + random error\nratings_cont <- cbind(r1_cont, r2_cont)  # irr::icc expects wide format (rows = subjects)\n\n# Two-way random, absolute agreement, single measurement = ICC(2,1)\nicc_abs <- icc(ratings_cont, model = \"twoway\", type = \"agreement\", unit = \"single\")\ncat(sprintf(\"\\nICC(2,1) absolute agreement: %.4f  95%% CI [%.4f, %.4f]\\n\",\n            icc_abs$value, icc_abs$lbound, icc_abs$ubound))\n\n# Two-way random, consistency, single measurement = ICC(3,1)\nicc_con <- icc(ratings_cont, model = \"twoway\", type = \"consistency\", unit = \"single\")\ncat(sprintf(\"ICC(3,1) consistency:         %.4f  95%% CI [%.4f, %.4f]\\n\",\n            icc_con$value, icc_con$lbound, icc_con$ubound))\ncat(\"Note: absolute < consistency when systematic rater bias is present.\\n\")\n\n# ── 4. Bland-Altman (base R) ──\nassay_a <- c(45, 62, 38, 71, 55, 49, 66, 42, 58, 74, 37, 53)\nset.seed(7)\nassay_b <- assay_a + rnorm(length(assay_a), 3, 4)   # Assay B: ~3 units higher\n\nba_means <- (assay_a + assay_b) / 2\nba_diffs  <- assay_a - assay_b\nbias      <- mean(ba_diffs)\nsd_d      <- sd(ba_diffs)         # R's sd() uses n-1 denominator by default\nloa_hi    <- bias + 1.96 * sd_d\nloa_lo    <- bias - 1.96 * sd_d\n\ncat(sprintf(\"\\nBland-Altman (Assay A - Assay B):\\n\"))\ncat(sprintf(\"  Bias (mean diff):          %.2f\\n\", bias))\ncat(sprintf(\"  SD of differences:         %.2f\\n\", sd_d))\ncat(sprintf(\"  Upper limit of agreement:  %.2f\\n\", loa_hi))\ncat(sprintf(\"  Lower limit of agreement:  %.2f\\n\", loa_lo))\n\n# If blandr is installed, use it for a more complete summary:\n# library(blandr)\n# blandr.statistics(assay_a, assay_b, sig.level = 0.95)\n# blandr.draw(assay_a, assay_b)",
        "description": "All three agreement methods in R. Unweighted and weighted kappa use irr::kappa2.\nICC uses irr::icc with explicit model and type arguments. Bland-Altman statistics\nare computed from base R; blandr::blandr.statistics provides the full summary if the\nblandr package is available. Uses the same adjudication dataset as the Python\nimplementation for kappa and small synthetic datasets for ICC and Bland-Altman.",
        "dependencies": [
          "irr",
          "blandr"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create adjudication dataset (worked example: 100 events) ── */\ndata work.adjud;\n  /* 80 both confirmed, 10 R1=no/R2=yes, 10 both not confirmed */\n  do i = 1 to 80;  r1 = 1; r2 = 1; output; end;\n  do i = 1 to 10;  r1 = 0; r2 = 1; output; end;\n  do i = 1 to 10;  r1 = 0; r2 = 0; output; end;\n  drop i;\nrun;\n\n/* ── 1. Unweighted kappa (AGREE option): matches worked example ── */\nproc freq data=work.adjud;\n  tables r1 * r2 / agree;\n  /* AGREE: outputs percent agreement, kappa, weighted kappa (if >2 levels), SE, and 95% CI */\n  /* Expected: kappa ~ 0.615 (Po=0.90, Pe=0.74) */\nrun;\n\n/* ── 2. Weighted kappa for ordinal 3-level ratings ── */\ndata work.ordinal;\n  input r1 r2 count;\n  datalines;\n0 0 3\n0 1 1\n1 0 1\n1 1 2\n1 2 1\n2 1 0\n2 2 2\n;\nrun;\nproc freq data=work.ordinal;\n  weight count;\n  tables r1 * r2 / agree;\n  /* Weighted kappa uses linear (equal) weights by default in PROC FREQ AGREE */\n  /* Add WTKAP option for Cicchetti-Allison (quadratic) weights:               */\n  /* tables r1*r2 / agree(wtkap);                                              */\nrun;\n\n/* ── 3. ICC via PROC MIXED variance components ──\n   Two-way random effects model: score = mu + subject_effect + rater_effect + error\n   Var(subject) = sigma2_s, Var(rater) = sigma2_r, Var(error) = sigma2_e\n   ICC absolute agreement: sigma2_s / (sigma2_s + sigma2_r + sigma2_e)\n   ICC consistency:        sigma2_s / (sigma2_s + sigma2_e)                    */\ndata work.ratings;\n  input patient rater score;\n  datalines;\n1  1  47.2    1  2  51.8\n2  1  63.1    2  2  65.4\n3  1  39.5    3  2  43.2\n4  1  72.4    4  2  74.9\n5  1  56.7    5  2  60.1\n6  1  50.3    6  2  53.8\n7  1  67.2    7  2  69.4\n8  1  43.1    8  2  46.5\n9  1  59.8    9  2  62.3\n10 1  75.6   10  2  78.2\n;\nrun;\nproc mixed data=work.ratings method=reml;\n  class patient rater;\n  model score = / s;\n  random patient;            /* between-subject variance: sigma2_s */\n  random rater / subject=patient;  /* residual after subject: combines sigma2_r + sigma2_e */\n  ods output CovParms = work.vc;   /* save variance components for ICC calculation */\nrun;\n/* Compute ICC from variance components stored in work.vc                   */\n/* ICC(2,1) absolute = sigma2_s / (sigma2_s + sigma2_r + sigma2_e)         */\n/* ICC(3,1) consistency = sigma2_s / (sigma2_s + sigma2_e)                 */\n/* (requires parsing the CovParms output; see PROC MIXED documentation)    */\n\n/* ── 4. Bland-Altman statistics ── */\ndata work.ba;\n  input assay_a assay_b;\n  diff  = assay_a - assay_b;   /* Method A minus Method B */\n  avg   = (assay_a + assay_b) / 2;\n  datalines;\n45 48.3   62 66.1   38 41.4   71 74.2   55 59.0   49 52.8\n66 70.3   42 46.5   58 61.9   74 77.4   37 40.8   53 57.1\n;\nrun;\n/* Bias (mean difference) and SD of differences */\nproc means data=work.ba mean std n;\n  var diff;\n  output out=work.ba_stats mean=bias std=sd_diff;\nrun;\n/* Limits of agreement: bias +/- 1.96 * sd_diff */\ndata work.ba_loa;\n  set work.ba_stats;\n  loa_upper = bias + 1.96 * sd_diff;\n  loa_lower = bias - 1.96 * sd_diff;\n  put \"Bias = \" bias;\n  put \"SD of differences = \" sd_diff;\n  put \"Upper limit of agreement = \" loa_upper;\n  put \"Lower limit of agreement = \" loa_lower;\nrun;\n/* Scatter plot (mean vs difference) using PROC SGPLOT */\nproc sgplot data=work.ba;\n  scatter x=avg y=diff;\n  refline 0 / lineattrs=(pattern=solid);   /* zero-difference reference */\n  /* Add reflines for bias and LoA after computing them above             */\n  xaxis label=\"Mean of Assay A and Assay B\";\n  yaxis label=\"Difference (Assay A - Assay B)\";\n  title \"Bland-Altman Plot: Assay A vs Assay B\";\nrun;",
        "description": "Agreement statistics in SAS. PROC FREQ with the AGREE option computes Cohen's kappa\n(unweighted and weighted) from a two-way contingency table. ICC is obtained from\nPROC MIXED using restricted maximum likelihood (REML) with subject and rater as\nrandom effects; the two ICC forms (absolute agreement vs consistency) are derived from\nthe variance component estimates. Bland-Altman statistics (bias, SD, limits of\nagreement) are computed in a DATA step and summarized with PROC MEANS. Uses the\nsame adjudication data as the Python and R implementations.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Two measurements of the same construct<br/>on the same subjects?] --> Scale[What is the scale?]\n  Scale --> Cat[Categorical / binary<br/>outcome]\n  Scale --> Cont[Continuous rating<br/>from multiple raters]\n  Scale --> Subst[Continuous method comparison:<br/>can Method B replace Method A?]\n  Cat --> Ord{Ordered categories?}\n  Ord -- No --> UKappa[\"Unweighted Cohen's kappa<br/>(binary or nominal)\"]\n  Ord -- Yes --> WKappa[\"Weighted kappa<br/>(linear or quadratic weights;<br/>pre-specify the weight scheme)\"]\n  Cont --> ICCform{Design?}\n  ICCform -- \"Same raters, all subjects\" --> TwoWay[\"Two-way ICC<br/>Use ABSOLUTE if rater offset matters<br/>Use CONSISTENCY if offset is irrelevant\"]\n  ICCform -- \"Different raters per subject\" --> OneWay[\"One-way ICC<br/>(raters are random from a pool)\"]\n  Subst --> BA[\"Bland-Altman limits-of-agreement plot<br/>Compute bias + 1.96 x SD_diff<br/>Ask: are these limits clinically acceptable?\"]\n  UKappa --> Interp[\"Interpret with Landis-Koch scale:<br/>< 0.20 slight  0.21-0.40 fair<br/>0.41-0.60 moderate  0.61-0.80 substantial<br/>0.81-1.00 almost perfect\"]\n  WKappa --> Interp\n  TwoWay --> Interp2[\"Report ICC point estimate + 95% CI<br/>+ the specific ICC model chosen\"]\n  OneWay --> Interp2\n  BA --> Interp3[\"Report bias (systematic offset) and<br/>limits of agreement (random variability)<br/>A pre-specified acceptable range is required\"]\n  style UKappa fill:#d4edda\n  style WKappa fill:#d4edda\n  style TwoWay fill:#cce5ff\n  style OneWay fill:#cce5ff\n  style BA fill:#fff3cd",
        "caption": "Decision tree for selecting the correct agreement statistic by variable type and study design. Correlation (Pearson, Spearman) is not shown because it does not answer the agreement question regardless of its value.",
        "alt_text": "Flowchart branching on variable scale (categorical, continuous rating, continuous method comparison) into the appropriate agreement statistic: unweighted or weighted kappa for categorical/ordinal outcomes, one-way or two-way ICC for continuous ratings from multiple raters, and Bland-Altman for method substitutability questions.",
        "source_type": "illustrative",
        "source_citations": [
          "kottner-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "pearson-spearman-correlation",
        "notes": "The pearson-spearman-correlation entry explicitly routes agreement questions here; the central teaching point of this concept is that correlation and agreement are different quantities. A Pearson r of 0.99 between two methods is compatible with systematic bias that makes them non-interchangeable."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Endpoint adjudication committees report Cohen's kappa as the primary inter-rater reliability metric before the tiebreak. Kappa quantifies the quality of the reference standard produced by the adjudication process."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "Algorithm validation studies report kappa alongside PPV and sensitivity; kappa corrects for prevalence effects that make PPV alone an incomplete summary of algorithm-reference-standard agreement across datasets with different base rates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "NLP and rule-based EHR phenotyping algorithms require inter-annotator kappa on training labels before model training; the agreement between human annotators is the ceiling for algorithm accuracy."
      },
      {
        "relation_type": "requires",
        "target_slug": "descriptive-statistics",
        "notes": "Means, standard deviations, histograms, and marginal distributions are prerequisites for computing and interpreting kappa (marginal calling rates), ICC (between-subject and within-subject variances), and Bland-Altman (mean and SD of differences)."
      }
    ],
    "aliases": [
      "Cohen's kappa",
      "weighted kappa",
      "intraclass correlation coefficient",
      "ICC",
      "Bland-Altman plot",
      "limits of agreement",
      "inter-rater reliability",
      "interrater agreement",
      "method comparison statistics"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ahrq-ccs-ccsr-clinical-classifications-rwe",
    "name": "AHRQ CCS and CCSR Clinical Classifications",
    "short_definition": "AHRQ HCUP diagnosis and procedure grouping tools that collapse detailed ICD codes into clinically meaningful categories for reporting, risk adjustment, cohort description, and feature construction.",
    "long_description": "AHRQ's Clinical Classifications Software family provides maintained code groupers that organize granular ICD diagnosis and procedure codes into clinically interpretable categories. Legacy CCS is widely used with ICD-9-CM. CCSR is the refined ICD-10-CM/PCS-era family, including diagnosis CCSR and procedure CCSR tools.\n\nIn RWE, CCS/CCSR groupers are useful when analysts need interpretable features instead of thousands of raw ICD codes. They can summarize baseline conditions, describe utilization mix, build descriptive dashboards, support model features, and create service-line or disease-area strata. They are not a replacement for disease-specific phenotype algorithms when the study endpoint or exposure requires high specificity.\n\nThe key operational decision is whether the grouping is used for description, covariate adjustment, or endpoint definition. Description tolerates broader categories. Confounding adjustment may benefit from grouped features. Endpoint definition usually needs validated code algorithms, diagnosis-position rules, claim-type rules, and adjudication or chart validation when possible.\n\n**Pros, cons, and trade-offs.** CCS/CCSR improves readability and reduces dimensionality. It lets a study show understandable condition groups instead of thousands of individual diagnosis or procedure codes. The trade-off is specificity: groupers are classification tools, not automatically validated phenotypes. A CCSR category may be too broad for an endpoint, too coarse for a mechanistic subgroup, or too heterogeneous for a causal covariate. The many-to-many structure of some mappings also means the analyst must decide whether categories are indicators, counts, hierarchies, or mutually exclusive groupings.\n\n**When to use.** Use CCS/CCSR for descriptive summaries, high-level utilization profiles, interpretable feature engineering, baseline covariates, and dashboards where clinical grouping is more useful than raw code granularity. It is especially helpful when the analysis spans many ICD-10-CM or ICD-10-PCS codes and reviewers need a stable, maintained grouping layer.\n\n**When NOT to use - and when it is actively misleading.** Do not use CCS/CCSR alone as a validated disease outcome, safety endpoint, or exposure algorithm when the question requires high PPV, timing rules, setting restrictions, or diagnosis-position logic. It is actively misleading to report a broad CCSR category as a specific phenotype without validating the codes, claim types, and positions used to create it.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "ahrq",
      "hcup",
      "ccs",
      "ccsr",
      "icd-10-cm",
      "icd-10-pcs",
      "diagnosis-grouping",
      "procedure-grouping"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "descriptive_epidemiology",
      "comparative_effectiveness",
      "predictive_modeling"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": null,
        "url": "https://hcup-us.ahrq.gov/toolssoftware/ccsr/ccs_refined.jsp",
        "citation_text": "Agency for Healthcare Research and Quality. Clinical Classifications Software Refined (CCSR). Healthcare Cost and Utilization Project.",
        "year": 2026,
        "authors_short": "AHRQ",
        "notes": "Main HCUP CCSR reference page."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://hcup-us.ahrq.gov/toolssoftware/ccsr/dxccsr.jsp",
        "citation_text": "Agency for Healthcare Research and Quality. Clinical Classifications Software Refined for ICD-10-CM Diagnoses.",
        "year": 2026,
        "authors_short": "AHRQ",
        "notes": "Diagnosis CCSR reference and software."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://hcup-us.ahrq.gov/toolssoftware/ccsr/prccsr.jsp",
        "citation_text": "Agency for Healthcare Research and Quality. Clinical Classifications Software Refined for ICD-10-PCS Procedures.",
        "year": 2026,
        "authors_short": "AHRQ",
        "notes": "Procedure CCSR reference and software."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp",
        "citation_text": "Agency for Healthcare Research and Quality. Clinical Classifications Software (CCS) for ICD-9-CM.",
        "year": 2026,
        "authors_short": "AHRQ",
        "notes": "Legacy ICD-9-CM CCS reference."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.5702",
        "url": "https://doi.org/10.1002/pds.5702",
        "citation_text": "Hong LS, Garcia-Albeniz X, Friesen D, Foskett N, Beau-Lejdstrom R. Use of clinical classifications software to address ICD coding transition in large healthcare databases analyzed via high-dimensional propensity scores. Pharmacoepidemiology and Drug Safety. 2024;33(1):e5702.",
        "year": 2024,
        "authors_short": "Hong et al.",
        "notes": "Applied example of clinical classification software used to bridge ICD coding dictionaries in claims analyses."
      }
    ],
    "plain_language_summary": "CCS and CCSR are AHRQ tools that turn thousands of diagnosis and procedure codes into more readable clinical groups. They help analysts summarize claims or hospital data, but they are not automatically validated outcome definitions.",
    "key_terms": [
      {
        "term": "CCS",
        "definition": "AHRQ's legacy Clinical Classifications Software, commonly used for ICD-9-CM diagnosis and procedure grouping."
      },
      {
        "term": "CCSR",
        "definition": "AHRQ's refined ICD-10-era Clinical Classifications Software groupers for diagnoses and procedures."
      },
      {
        "term": "Grouper",
        "definition": "A mapping tool that assigns detailed codes to broader categories."
      }
    ],
    "worked_example": {
      "scenario": "A hospital-utilization study needs readable diagnosis groups for inpatient stays after October 2015. The analyst applies diagnosis CCSR to all diagnosis fields, then reports both principal-diagnosis CCSR for reason-for-admission and all-diagnosis CCSR flags for baseline burden.",
      "dataset": {
        "caption": "Simplified ICD-10-CM diagnosis grouping.",
        "columns": [
          "claim_id",
          "diagnosis_position",
          "icd10cm",
          "ccsr_category"
        ],
        "rows": [
          [
            "H001",
            "principal",
            "I21.4",
            "acute myocardial infarction"
          ],
          [
            "H001",
            "secondary",
            "E11.22",
            "diabetes mellitus with complications"
          ],
          [
            "H001",
            "secondary",
            "N18.4",
            "chronic kidney disease"
          ]
        ]
      },
      "steps": [
        "Select diagnosis CCSR rather than legacy ICD-9 CCS because the discharge is after ICD-10 implementation.",
        "Apply the release-specific mapping file and keep the version in the study archive.",
        "Separate principal diagnosis summaries from all-diagnosis covariate features.",
        "Avoid treating the broad CCSR category as a validated MI outcome without separate outcome-algorithm rules."
      ],
      "result": "The same hospitalization contributes to an admission-reason category and separate comorbidity-feature categories, with the grouper version recorded."
    },
    "prerequisites": [],
    "index_definitions": [
      {
        "name": "CCS",
        "definition": "Legacy AHRQ Clinical Classifications Software for grouping ICD-9-CM diagnoses and procedures into clinically meaningful categories.",
        "source": "AHRQ HCUP CCS",
        "use": "Historical ICD-9-CM analyses and long-run trend work that spans pre-ICD-10 data.",
        "notes": "Do not apply legacy CCS rules to ICD-10-CM without an appropriate crosswalk or CCSR strategy."
      },
      {
        "name": "Diagnosis CCSR",
        "definition": "Refined AHRQ ICD-10-CM diagnosis grouper with categories designed for clinical interpretability and analytic use.",
        "source": "AHRQ HCUP CCSR for ICD-10-CM diagnoses",
        "use": "ICD-10-CM diagnosis feature construction, descriptive summaries, and covariate grouping.",
        "notes": "A single diagnosis code may map to multiple CCSR categories in some cases."
      },
      {
        "name": "Procedure CCSR",
        "definition": "AHRQ ICD-10-PCS procedure grouper for inpatient procedure categories.",
        "source": "AHRQ HCUP CCSR for ICD-10-PCS procedures",
        "use": "Procedure burden summaries, service-line grouping, and inpatient-procedure feature construction.",
        "notes": "Not the same as CPT/HCPCS grouping for outpatient/professional services."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Principal-diagnosis CCSR",
        "description": "Classifies the main reason for an inpatient stay or first-listed diagnosis.",
        "edge_cases": [
          "Principal position meaning varies by setting.",
          "Outpatient first-listed diagnosis may not equal hospitalization reason."
        ],
        "data_source_notes": "Best for utilization ranking and reason-for-visit summaries."
      },
      {
        "name": "All-diagnosis CCSR features",
        "description": "Uses every diagnosis field to create broader condition features.",
        "edge_cases": [
          "Rule-out and history diagnoses can inflate prevalence.",
          "Many-to-many mappings require modeling choices."
        ],
        "data_source_notes": "Useful for covariate grouping, not a validated phenotype by itself."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Raw ICD code feature sets",
        "use_ccsr_when": "Interpretability and stable clinical categories matter.",
        "use_raw_codes_when": "Predictive modeling needs granular signals and enough sample size.",
        "notes": "CCSR reduces dimensionality but may obscure clinically meaningful subtypes."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Apply after choosing eligible claim types, diagnosis positions, and ICD era.",
      "ehr": "Encounter diagnoses can be grouped, but problem-list diagnoses may need separate governance.",
      "linked": "Use consistent grouper versions across claims and EHR-derived diagnosis extracts."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "Diagnosis CCSR groups ICD-10-CM codes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "icd-10-pcs-procedure-coding",
        "notes": "Procedure CCSR groups ICD-10-PCS inpatient procedures."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Disease phenotypes usually need more than broad CCS/CCSR grouping."
      }
    ],
    "aliases": [
      "AHRQ CCS",
      "AHRQ CCSR",
      "Clinical Classifications Software",
      "Clinical Classifications Software Refined",
      "HCUP CCS",
      "HCUP CCSR",
      "diagnosis CCSR",
      "procedure CCSR"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "algorithm-validation",
    "name": "Algorithm Validation",
    "short_definition": "A validation study that estimates the operating characteristics (PPV, sensitivity, specificity, NPV) of a claims- or EHR-based algorithm against a reference standard, so the algorithm-defined variable can be used in inference and, where needed, corrected for misclassification.",
    "long_description": "**Algorithm validation** quantifies how well a computable definition applied to routinely-collected data (an ICD/CPT/NDC/lab\nrule for a diagnosis, exposure, or outcome) reproduces the truth as captured by a **reference standard** (chart review,\nregistry linkage, adjudication committee, or a richer linked source). The deliverable is a set of operating characteristics\nfrom a 2x2 table of algorithm flag against adjudicated truth — **positive predictive value (PPV)**, **sensitivity (Se)**,\n**specificity (Sp)**, and **negative predictive value (NPV)** — each with a binomial confidence interval. Validation is not\na checkbox: it is the input to **quantitative bias analysis (QBA)** that propagates measurement error into the effect\nestimate. An algorithm is \"validated\" only relative to a population, an era, and a use; the same code list can be excellent\nfor case identification in one setting and dangerously biased in another.\n\n**Core conceptual distinction — which metric you can estimate, and which you can transport.** The metric you *can* estimate\ndepends on your sampling frame. If you sample only algorithm-positive charts (the common, cheap \"PPV study\"), you can\nestimate **PPV** but not Se or Sp, because you never observe truth among algorithm-negatives. To estimate **Se and Sp** you\nmust also sample (a usually weighted subset of) algorithm-negatives and adjudicate them — far more expensive and the reason\nmost published validations report PPV only. The critical, frequently-missed asymmetry: **Se and Sp are properties of the\nalgorithm-versus-truth classification and are approximately portable across populations; PPV and NPV are not.** By Bayes'\nrule, PPV = Se*pi / (Se*pi + (1 - Sp)*(1 - pi)), where pi is the true prevalence in the *target* population. A PPV of 0.90\nmeasured in a high-prevalence inpatient validation sample can collapse to 0.50 in a low-prevalence ambulatory cohort even\nthough the algorithm is unchanged. This is why correction methods that travel (e.g., the Se/Sp matrix correction) are built\non Se/Sp, and why porting a PPV number across data sources is a recurring error.\n\n**The estimand it feeds, and why misclassification direction matters.** Validation exists to defend a downstream causal or\ndescriptive estimand — an incidence rate, a risk ratio, a hazard ratio. **Non-differential** outcome misclassification\n(Se/Sp identical across exposure arms) with imperfect specificity biases a risk ratio *toward the null* in expectation, so\na \"null\" finding from an unvalidated, low-specificity outcome is uninterpretable. **Differential** misclassification\n(Se/Sp differing by exposure — e.g., treated patients are surveilled more and so their outcomes are coded more completely)\nbiases in an **unpredictable direction and magnitude**, and a correction that assumes non-differentiality can move the\nestimate *further from* the truth. The whole point of validation-plus-QBA is to replace the comforting but false\n\"non-differential bias is conservative\" reflex with an explicit, quantified adjustment.\n\n**Pros, cons, and trade-offs.**\n- **vs face-validity / citing someone else's number:** A study-specific internal validation removes the heroic transport\n  assumption and yields Se/Sp/PPV calibrated to *your* data, era, and population. Cost: charts, adjudicators, and\n  IRB-permitted re-identification. **Prefer internal validation** for primary endpoints in regulatory or HTA submissions;\n  borrowing an external estimate is acceptable only as a sensitivity-analysis anchor, never as the sole basis, and only\n  when the source population's prevalence and coding practices are demonstrably similar.\n- **vs a single high-PPV code list (no Se/Sp):** A PPV-only design is cheap and answers \"of those I flagged, how many are\n  real?\" — sufficient when you only need confirmed cases (case-finding) and accept lost power. It is **inadequate** when\n  you need to correct an effect estimate, because the matrix and most probabilistic corrections require Se and Sp. **Prefer\n  a Se/Sp design** whenever the validated variable is the outcome or exposure in a comparative analysis.\n- **vs probabilistic bias analysis (PBA):** A simple **matrix (deterministic) correction** plugs point estimates of Se/Sp\n  into a back-calculation and is transparent and fast, but understates uncertainty and can produce impossible (negative)\n  cell counts when Se/Sp estimates are imprecise. **PBA** draws Se/Sp from their sampling distributions (e.g., Beta priors\n  from the validation 2x2) and yields a simulation interval that honestly combines random and systematic error. **Prefer\n  PBA** when the validation sample is small or the correction must appear in a defensible inference; use the matrix\n  correction for a quick directional check.\n- **vs latent-class / Bayesian no-gold-standard models:** When the \"reference standard\" is itself imperfect (chart review\n  misses, adjudication disagrees), assuming it is truth biases Se/Sp. Latent-class models estimate algorithm and\n  reference-standard accuracy jointly without a perfect gold standard, at the cost of strong conditional-independence\n  assumptions that are easy to violate (two ICD-based references that share the same coding error are not independent).\n\n**When to use.** Any analysis where a study variable is defined by a data-derived algorithm and measurement error could\nchange the conclusion: the primary outcome of a comparative safety/effectiveness study, the exposure phenotype, a\nsubgroup-defining comorbidity, or a regulator/HTA-facing endpoint. Validate the *specific* algorithm you will run\n(operators, time windows, code versions), in a sample drawn from the *same* source and era, with a reference standard\nappropriate to the construct.\n\n**When NOT to use / when it is actively misleading or dangerous.**\n- **Transporting PPV across populations or eras.** Moving a PPV from a hospitalized validation sample to an ambulatory\n  cohort, or from ICD-9 to ICD-10 coding, or from a commercial to a Medicare population, ignores the prevalence and\n  coding-practice dependence above. If you must reuse external accuracy, transport **Se/Sp** (more stable) and re-derive\n  PPV at the target prevalence — never reuse PPV directly.\n- **Non-representative validation sample (selection on the algorithm or on adjudicability).** If chartable patients differ\n  systematically from the analytic cohort (e.g., only large integrated systems share notes), the estimated Se/Sp do not\n  apply to the population you will correct. This is selection bias dressed up as validation.\n- **Imperfect or absent gold standard treated as truth.** Using one claims rule to \"validate\" another, or an adjudication\n  committee with modest inter-rater agreement, anchors the correction to a biased target. Report kappa, and use\n  latent-class methods when no clean reference exists.\n- **Assuming non-differentiality without checking.** Surveillance, contact frequency, and coding intensity often differ by\n  exposure. A non-differential matrix correction applied to genuinely differential misclassification can amplify bias.\n  Validate Se/Sp *within exposure strata* when feasible, or carry differential scenarios in the PBA.\n- **PPV-only validation when you need bias direction.** A high PPV says flagged cases are real; it says nothing about the\n  cases you missed. With low sensitivity and a rate or absolute-risk estimand, PPV alone cannot tell you which way, or how\n  far, your estimate is off.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** Algorithms are built from ICD diagnosis codes, CPT/HCPCS procedures, NDCs, and\n  place-of-service. Reference standard usually requires linkage to charts or a registry that the closed claims world does\n  not contain, so PPV-only designs dominate. Real failure modes: (a) **Medicare Advantage encounter data are\n  under-captured** relative to fee-for-service, so an algorithm validated in FFS over-states sensitivity when applied to\n  MA person-time — restrict validation and analysis to comparable benefit types or validate within each. (b) **ICD-9 to\n  ICD-10 transition (Oct 2015)** broke many code lists; a pre-2015 PPV does not carry forward. (c) **Rule-in codes used for\n  billing rather than confirmed disease** (e.g., \"rule-out MI\" coded then refuted) inflate false positives; requiring an\n  inpatient primary-position code plus a confirmatory procedure (e.g., troponin or revascularization) raises PPV at the\n  cost of sensitivity.\n- **EHR:** Richer substrate (labs, vitals, notes) supports phenotype algorithms and even NLP, and notes can serve as part\n  of the reference standard — but EHR capture is **visit-driven and system-bounded**: care delivered outside the system is\n  invisible, so a \"no event\" can be unobserved rather than true, depressing apparent sensitivity. Local coding habits make\n  EHR-derived Se/Sp poorly portable across institutions.\n- **Registry:** Often the *reference standard* itself (adjudicated stage, confirmed diagnosis) rather than the thing being\n  validated; linking claims/EHR to a disease or death registry is the standard way to estimate Se/Sp because the registry\n  captures truth the claims source cannot.\n- **Linked claims-EHR-vital-records:** Lets you adjudicate algorithm-negatives (needed for Se/Sp) using the partner source,\n  but introduces **linkage selection** (only the linkable subset is validated) and date-discrepancy issues; validate the\n  linkage rate and check that linked patients resemble the analytic cohort before generalizing the operating characteristics.\n\n**Worked claims example (outcome algorithm for acute MI, with correction).** Goal: validate and correct a claims algorithm\nfor incident acute myocardial infarction (AMI) in a comparative drug-safety cohort. (1) **Algorithm:** an inpatient claim\nwith an ICD-10-CM I21.x code in the **primary** position during a stay of >=1 day, with no AMI code in the 365-day\nwashout (incident). (2) **Eligibility for the analytic cohort:** age >=18 and 365 days of continuous medical enrollment\n(FFS A/B, or commercial) before `index_date`, so absence of a prior AMI code is observed, not missing. (3) **Sampling for\nvalidation:** because we need a risk-ratio correction, draw a stratified random sample of **both** algorithm-positive and\nalgorithm-negative person-records (weighting the rare positives up), within the linked subset where charts are obtainable.\n(4) **Reference standard:** chart adjudication using troponin dynamics and the universal MI definition; record `truth_flag`.\n(5) **2x2 and metrics:** from `algorithm_flag` x `truth_flag` compute PPV with an exact (Clopper-Pearson) 95% CI; from the\nweighted positive *and* negative samples compute Se and Sp. (6) **Correction:** if Se=0.78, Sp=0.997 and the raw algorithm\ncounted A_obs cases, the matrix-corrected true count is A_true = (A_obs - (1 - Sp) * N) / (Se - (1 - Sp)); apply within\nexposure arms and propagate uncertainty with a probabilistic bias analysis drawing Se/Sp from Beta distributions anchored\nto the validation 2x2. (7) **Report** per RECORD: the code list with versions, the sampling frame, the reference standard,\nthe 2x2 counts, all four operating characteristics with CIs, and the corrected versus uncorrected effect estimate.",
    "primary_category": "Study_Design",
    "tags": [
      "algorithm-validation",
      "positive-predictive-value",
      "sensitivity-specificity",
      "outcome-misclassification",
      "phenotyping",
      "quantitative-bias-analysis",
      "reference-standard",
      "chart-review"
    ],
    "applies_to_study_types": [
      "algorithm_validation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.2321",
        "url": "https://doi.org/10.1002/pds.2321",
        "citation_text": "Carnahan RM, Moores KG. Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned. Pharmacoepidemiology and Drug Safety. 2012;21(S1):82-89.",
        "year": 2012,
        "authors_short": "Carnahan & Moores",
        "notes": "Canonical framework for what a claims/EHR algorithm validation must report (PPV, sensitivity, specificity, reference standard) and the recurring pitfalls across the Mini-Sentinel validation program."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyu149",
        "url": "https://doi.org/10.1093/ije/dyu149",
        "citation_text": "Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S. Good practices for quantitative bias analysis. International Journal of Epidemiology. 2014;43(6):1969-1985.",
        "year": 2014,
        "authors_short": "Lash et al.",
        "notes": "Authoritative guidance on turning validation parameters (Se/Sp) into a corrected effect estimate via deterministic matrix correction and probabilistic bias analysis, including the differential-vs-non-differential distinction."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5601",
        "url": "https://doi.org/10.1002/pds.5601",
        "citation_text": "Lanes S, Beachler DC. Validation to correct for outcome misclassification bias. Pharmacoepidemiology and Drug Safety. 2023;32(6):700-703.",
        "year": 2023,
        "authors_short": "Lanes & Beachler",
        "notes": "Concise statement of how a validation substudy supplies the Se/Sp needed to correct outcome misclassification in the effect estimate, and why PPV alone is insufficient for bias correction."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLOS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard requiring explicit description of code lists, validation status, and the accuracy of algorithm-defined variables in routinely-collected-data studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/s12913-018-3727-0",
        "url": "https://doi.org/10.1186/s12913-018-3727-0",
        "citation_text": "Ando T, Ooba N, Mochizuki M, et al. Positive predictive value of ICD-10 codes for acute myocardial infarction in Japan: a validation study at a single center. BMC Health Services Research. 2018;18(1):895.",
        "year": 2018,
        "authors_short": "Ando et al.",
        "notes": "Worked applied PPV validation of an ICD-10 claims algorithm for AMI against clinical/angiographic reference, illustrating chart-confirmed PPV estimation and its dependence on the code-position rule."
      }
    ],
    "plain_language_summary": "An algorithm validation study checks whether a rule written in claims or electronic health record data actually identifies the right patients. To find cases of a disease in insurance data, researchers write a rule — for example, \"two or more diagnosis codes for Type 2 diabetes within any 12-month period\" — but that rule will sometimes flag patients who don't truly have the disease and miss patients who do. A validation study solves this by pulling a random sample of flagged and un-flagged records, having clinicians review the actual charts to confirm the true diagnosis, and then measuring how often the rule agreed with the chart. The result is a report card — numbers like sensitivity, specificity, and positive predictive value — that tells every future researcher exactly how trustworthy the rule is before they use it to study treatment effects or costs.",
    "key_terms": [
      {
        "term": "algorithm",
        "definition": "A written rule applied to data fields — such as diagnosis codes, procedure codes, or lab results — that labels each patient as a case or a non-case of a condition."
      },
      {
        "term": "gold standard",
        "definition": "The most reliable available confirmation of a patient's true status, almost always a clinician reading the actual medical chart, because the chart contains details the billing data does not capture."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "Of all the patients the algorithm flagged as cases, the fraction who truly have the condition — a measure of how much you can trust a positive flag."
      },
      {
        "term": "sensitivity",
        "definition": "Of all the patients who truly have the condition, the fraction the algorithm successfully caught — a measure of how many real cases the rule finds."
      },
      {
        "term": "specificity",
        "definition": "Of all the patients who truly do not have the condition, the fraction the algorithm correctly left un-flagged — a measure of how rarely the rule fires on healthy patients."
      }
    ],
    "worked_example": {
      "scenario": "A research team is building a cohort of patients with Type 2 diabetes using an insurance claims database. Their algorithm flags any patient who has at least two outpatient diagnosis codes for Type 2 diabetes recorded on different dates. Before using this cohort to compare treatments, they need to know how accurate the algorithm is. They draw a random sample of 200 patient records — 100 the algorithm flagged as diabetes cases and 100 it did not flag — and have a clinician review each chart to record the true diagnosis.",
      "dataset": {
        "caption": "Results of chart review for the 200 sampled records, organized as a 2x2 table of algorithm decision versus gold-standard chart finding.",
        "columns": [
          "",
          "Gold standard +",
          "Gold standard -"
        ],
        "rows": [
          [
            "Algorithm +",
            "TP = 80",
            "FP = 20"
          ],
          [
            "Algorithm -",
            "FN = 20",
            "TN = 80"
          ]
        ]
      },
      "steps": [
        "Read the table: TP (true positives) = 80 patients the algorithm flagged who the chart confirmed do have diabetes; FP (false positives) = 20 patients the algorithm flagged who the chart showed do not have diabetes; FN (false negatives) = 20 patients the algorithm missed who the chart showed do have diabetes; TN (true negatives) = 80 patients the algorithm did not flag and the chart confirmed do not have diabetes.",
        "Compute PPV — ask 'of the 100 patients the algorithm flagged, how many are real cases?': PPV = TP / (TP + FP) = 80 / (80 + 20) = 80 / 100 = 0.80.",
        "Compute sensitivity — ask 'of the 100 patients who truly have diabetes, how many did the algorithm catch?': Sensitivity = TP / (TP + FN) = 80 / (80 + 20) = 80 / 100 = 0.80.",
        "Compute specificity — ask 'of the 100 patients who truly do not have diabetes, how many did the algorithm correctly leave un-flagged?': Specificity = TN / (TN + FP) = 80 / (80 + 20) = 80 / 100 = 0.80.",
        "Notice that each formula uses a different denominator: PPV divides by everyone the algorithm flagged (100); sensitivity divides by everyone who truly has the disease (100); specificity divides by everyone who truly does not have the disease (100). In real studies these three denominators are usually different numbers, which is why the three metrics can diverge dramatically."
      ],
      "result": "PPV = 0.80 (80 of 100 flagged patients are true cases), Sensitivity = 0.80 (the algorithm found 80 of the 100 true diabetes patients in the sample), Specificity = 0.80 (the algorithm correctly left un-flagged 80 of the 100 true non-cases). All three are 0.80 in this balanced example; in practice, a claims diabetes algorithm often has high PPV (few false positives) but lower sensitivity (missing patients who see out-of-network providers or pay out of pocket)."
    },
    "prerequisites": [
      "outcome-algorithm-construction-rwe",
      "sensitivity-specificity-rwe",
      "diagnostic-accuracy"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Internal (study-specific) validation",
        "description": "Validate the exact algorithm in a sample drawn from the same data source, era, and population as the analytic cohort, so the operating characteristics apply directly to the inference.",
        "edge_cases": [
          "Requires IRB-permitted access to charts or a linked reference standard, which may not exist in closed claims.",
          "Small validation samples yield wide CIs on Se/Sp that dominate the corrected-estimate uncertainty."
        ],
        "data_source_notes": "claims: usually needs linkage to charts/registry for adjudication; ehr: notes can serve as part of the reference standard within the same system."
      },
      {
        "name": "External validation borrowed as a sensitivity anchor",
        "description": "Reuse Se/Sp (never PPV) from a published validation of a similar algorithm to anchor a bias analysis when an internal study is infeasible.",
        "edge_cases": [
          "Valid only when source and target populations share prevalence and coding practices; transport PPV directly and the correction is wrong by construction.",
          "Code-version drift (ICD-9 to ICD-10) can invalidate borrowed parameters."
        ],
        "data_source_notes": "Re-derive target PPV from borrowed Se/Sp and the target prevalence rather than importing PPV."
      },
      {
        "name": "PPV-only design with stratified sampling",
        "description": "Sample and adjudicate only algorithm-positive records (often stratified by code subtype or care setting) to estimate PPV efficiently when confirmed-case identification is the goal.",
        "edge_cases": [
          "Cannot estimate sensitivity or specificity, so cannot support a rate/effect-estimate correction.",
          "Stratified sampling requires sampling-weighted PPV; an unweighted estimate is biased toward over-sampled strata."
        ],
        "data_source_notes": "claims: stratify by primary vs secondary code position and inpatient vs outpatient to expose PPV heterogeneity that a pooled estimate hides."
      },
      {
        "name": "Full Se/Sp design (positives and sampled negatives)",
        "description": "Adjudicate a weighted sample of both algorithm-positive and algorithm-negative records so all four operating characteristics are estimable and a matrix/probabilistic correction is possible.",
        "edge_cases": [
          "Algorithm-negatives are numerous and mostly true negatives, so naive sampling is inefficient; over-sample suspected cases.",
          "Differential capture by exposure requires estimating Se/Sp within exposure strata."
        ],
        "data_source_notes": "linked: the partner source (registry/EHR) is what makes negative-record adjudication feasible."
      },
      {
        "name": "Probabilistic vs deterministic bias correction",
        "description": "Either plug point Se/Sp into a matrix correction (deterministic) or draw Se/Sp from their sampling distributions and simulate the corrected estimate (probabilistic bias analysis).",
        "edge_cases": [
          "Deterministic correction can yield impossible (negative) corrected cell counts when Se/Sp are imprecise or near (1 - Sp).",
          "PBA requires a defensible parameterization (e.g., Beta priors from the validation 2x2) and enough draws for stable tails."
        ],
        "data_source_notes": "Report both the uncorrected and corrected effect estimates with their intervals."
      },
      {
        "name": "Latent-class / no-gold-standard validation",
        "description": "Jointly estimate algorithm and reference-standard accuracy when no perfect gold standard exists, using a latent-class or Bayesian model.",
        "edge_cases": [
          "Conditional-independence between tests is often violated when both share a coding source, biasing accuracy estimates.",
          "Identifiability typically needs multiple imperfect references or informative priors."
        ],
        "data_source_notes": "Useful when chart review or adjudication is itself imperfect and treating it as truth would bias Se/Sp."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Borrowing an external PPV without internal validation",
        "pros_of_this": "Operating characteristics are calibrated to the actual data source, era, and population, removing the transport assumption that makes a borrowed PPV unreliable.",
        "cons_of_this": "Requires charts/linkage, adjudicators, and IRB approval; small samples give imprecise Se/Sp.",
        "when_to_prefer": "Primary endpoints in regulatory or HTA submissions, or whenever target prevalence/coding differ from any available external source."
      },
      {
        "compared_to": "PPV-only validation",
        "pros_of_this": "A full Se/Sp design supplies the parameters needed to correct an effect estimate and to determine the direction of misclassification bias.",
        "cons_of_this": "Far more expensive because algorithm-negatives must also be adjudicated.",
        "when_to_prefer": "When the validated variable is an outcome or exposure in a comparative analysis rather than a case-finding filter."
      },
      {
        "compared_to": "Deterministic matrix correction",
        "pros_of_this": "Probabilistic bias analysis honestly combines random and systematic uncertainty and avoids impossible corrected counts.",
        "cons_of_this": "Heavier to specify and communicate; results depend on the chosen Se/Sp priors.",
        "when_to_prefer": "Small validation samples or any correction that must withstand regulatory/peer scrutiny."
      },
      {
        "compared_to": "Assuming a perfect gold standard",
        "pros_of_this": "Latent-class/no-gold-standard models avoid anchoring Se/Sp to an imperfect reference.",
        "cons_of_this": "Strong conditional-independence assumptions and identifiability constraints that are easy to violate.",
        "when_to_prefer": "When chart review or adjudication has only modest agreement and treating it as truth would bias accuracy."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Algorithm = ICD/CPT/HCPCS/NDC rule with code position, care setting, and time windows; reference standard usually needs chart/registry linkage, so PPV-only designs dominate. Restrict validation and analysis to comparable benefit types (FFS vs Medicare Advantage encounter capture differs); honor the ICD-9/ICD-10 break; prefer primary-position inpatient codes plus a confirmatory procedure/lab to raise PPV.",
      "ehr": "Labs, vitals, and notes enrich phenotypes and can serve as part of the reference standard, but visit-driven, system-bounded capture means \"no event\" may be unobserved; institution-specific coding limits portability of Se/Sp.",
      "registry": "Typically the reference standard itself (adjudicated diagnosis/stage/death); link claims/EHR to the registry to estimate sensitivity and specificity, not just PPV.",
      "linked": "Linkage lets you adjudicate algorithm-negatives needed for Se/Sp, but introduces linkage selection and date discrepancies; validate the linkage rate and confirm linked patients resemble the analytic cohort."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\ndef operating_characteristics(val):\n    # 2x2 of algorithm_flag (rows) x truth_flag (cols). Use weights if a stratified sample was drawn.\n    w = val.get(\"sample_weight\", np.ones(len(val[\"algorithm_flag\"])))\n    a = np.asarray(val[\"algorithm_flag\"]); t = np.asarray(val[\"truth_flag\"])\n    tp = (w[(a == 1) & (t == 1)]).sum()\n    fp = (w[(a == 1) & (t == 0)]).sum()\n    fn = (w[(a == 0) & (t == 1)]).sum()\n    tn = (w[(a == 0) & (t == 0)]).sum()\n\n    def exact_ci(num, den):\n        # Clopper-Pearson exact binomial CI; rounds weighted counts to integers for the exact interval.\n        num, den = int(round(num)), int(round(den))\n        if den == 0:\n            return (np.nan, np.nan, np.nan)\n        lo = stats.beta.ppf(0.025, num, den - num + 1) if num > 0 else 0.0\n        hi = stats.beta.ppf(0.975, num + 1, den - num) if num < den else 1.0\n        return (num / den, lo, hi)\n\n    return {\n        \"PPV\": exact_ci(tp, tp + fp),   # NOT transportable: depends on target prevalence\n        \"NPV\": exact_ci(tn, tn + fn),\n        \"Se\":  exact_ci(tp, tp + fn),   # approximately transportable across populations\n        \"Sp\":  exact_ci(tn, tn + fp),\n    }\n\ndef matrix_correct_count(a_obs, n_total, se, sp):\n    # Back-calculate the true number of cases from the algorithm-counted total under a given Se/Sp.\n    # Apply this WITHIN each exposure arm to handle differential misclassification.\n    denom = se - (1.0 - sp)\n    if denom <= 0:\n        raise ValueError(\"Se must exceed (1 - Sp); algorithm has no discriminating value.\")\n    a_true = (a_obs - (1.0 - sp) * n_total) / denom\n    return max(a_true, 0.0)\n\ndef pba_corrected_count(a_obs, n_total, val, n_draws=20000, seed=1):\n    # Probabilistic bias analysis: draw Se/Sp from Beta posteriors anchored to the validation 2x2,\n    # propagate into the corrected case count, and summarize the simulation interval.\n    rng = np.random.default_rng(seed)\n    a = np.asarray(val[\"algorithm_flag\"]); t = np.asarray(val[\"truth_flag\"])\n    tp = int(((a == 1) & (t == 1)).sum()); fn = int(((a == 0) & (t == 1)).sum())\n    tn = int(((a == 0) & (t == 0)).sum()); fp = int(((a == 1) & (t == 0)).sum())\n    se_draws = rng.beta(tp + 0.5, fn + 0.5, n_draws)   # Jeffreys prior on Se\n    sp_draws = rng.beta(tn + 0.5, fp + 0.5, n_draws)   # Jeffreys prior on Sp\n    out = np.array([matrix_correct_count(a_obs, n_total, s, p)\n                    for s, p in zip(se_draws, sp_draws)\n                    if (s - (1.0 - p)) > 0])\n    return {\"median\": float(np.median(out)),\n            \"ci95\": (float(np.percentile(out, 2.5)), float(np.percentile(out, 97.5)))}",
        "description": "Operating characteristics + misclassification correction from a validation table. Required input:\n  val : adjudicated validation sample -> person_id, algorithm_flag (0/1), truth_flag (0/1)\n        [optional] sample_weight  # inverse sampling probability for stratified/weighted PPV-Se-Sp\nComputes PPV/Se/Sp/NPV with exact (Clopper-Pearson) CIs, then a deterministic matrix correction of the\nalgorithm-counted case total in the FULL analytic cohort, and a probabilistic bias analysis (PBA) interval.",
        "dependencies": [
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "carnahan-2012",
          "lash-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "op_char <- function(val) {\n  w <- if (!is.null(val$sample_weight)) val$sample_weight else rep(1, nrow(val))\n  a <- val$algorithm_flag; t <- val$truth_flag\n  tp <- sum(w[a == 1 & t == 1]); fp <- sum(w[a == 1 & t == 0])\n  fn <- sum(w[a == 0 & t == 1]); tn <- sum(w[a == 0 & t == 0])\n\n  exact_ci <- function(num, den) {           # Clopper-Pearson exact binomial CI\n    num <- round(num); den <- round(den)\n    if (den == 0) return(c(est = NA, lo = NA, hi = NA))\n    bt <- binom.test(num, den)\n    c(est = num / den, lo = bt$conf.int[1], hi = bt$conf.int[2])\n  }\n  list(PPV = exact_ci(tp, tp + fp),   # NOT transportable across populations\n       NPV = exact_ci(tn, tn + fn),\n       Se  = exact_ci(tp, tp + fn),   # approximately transportable\n       Sp  = exact_ci(tn, tn + fp))\n}\n\nmatrix_correct_count <- function(a_obs, n_total, se, sp) {\n  denom <- se - (1 - sp)\n  if (denom <= 0) stop(\"Se must exceed (1 - Sp); algorithm does not discriminate.\")\n  max((a_obs - (1 - sp) * n_total) / denom, 0)\n}\n\npba_corrected_count <- function(a_obs, n_total, val, n_draws = 20000L, seed = 1L) {\n  set.seed(seed)\n  a <- val$algorithm_flag; t <- val$truth_flag\n  tp <- sum(a == 1 & t == 1); fn <- sum(a == 0 & t == 1)\n  tn <- sum(a == 0 & t == 0); fp <- sum(a == 1 & t == 0)\n  se <- rbeta(n_draws, tp + 0.5, fn + 0.5)   # Jeffreys prior on Se\n  sp <- rbeta(n_draws, tn + 0.5, fp + 0.5)   # Jeffreys prior on Sp\n  keep <- (se - (1 - sp)) > 0\n  out <- mapply(matrix_correct_count, se = se[keep], sp = sp[keep],\n                MoreArgs = list(a_obs = a_obs, n_total = n_total))\n  list(median = median(out),\n       ci95 = quantile(out, c(0.025, 0.975), names = FALSE))\n}",
        "description": "Operating characteristics + matrix/probabilistic misclassification correction. Required input:\n  val : data.frame with algorithm_flag (0/1), truth_flag (0/1), optionally sample_weight\nPPV/Se/Sp/NPV use exact (Clopper-Pearson) intervals; the correction back-calculates the true case\ncount from the algorithm-counted total and is applied within exposure arm for differential error.",
        "dependencies": [
          "stats"
        ],
        "source_citations": [
          "carnahan-2012",
          "lash-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* PPV = P(truth=1 | algorithm=1): restrict to algorithm-positive, exact CI on the proportion true. */\nproc freq data=work.val(where=(algorithm_flag=1));\n  tables truth_flag / binomial(level='1') alpha=0.05;   /* PPV; exact (Clopper-Pearson) CI via EXACT stmt */\n  exact binomial;\nrun;\n\n/* Sensitivity = P(algorithm=1 | truth=1): restrict to true cases. */\nproc freq data=work.val(where=(truth_flag=1));\n  tables algorithm_flag / binomial(level='1') alpha=0.05;\n  exact binomial;\nrun;\n\n/* Specificity = P(algorithm=0 | truth=0): restrict to true non-cases. */\nproc freq data=work.val(where=(truth_flag=0));\n  tables algorithm_flag / binomial(level='0') alpha=0.05;\n  exact binomial;\nrun;\n\n/* Full 2x2 with agreement (kappa) -- gauges how trustworthy the reference standard is. */\nproc freq data=work.val;\n  tables algorithm_flag*truth_flag / agree;\nrun;\n\n/* Matrix correction of the algorithm-counted case total in the analytic cohort.            */\n/* Plug in Se/Sp point estimates from the PROC FREQ output; apply WITHIN each exposure arm   */\n/* when misclassification may be differential.                                               */\n%let se = 0.78;        /* sensitivity from validation */\n%let sp = 0.997;       /* specificity from validation */\ndata corrected;\n  a_obs   = 1200;      /* algorithm-counted cases in the arm */\n  n_total = 50000;     /* persons at risk in the arm        */\n  denom = &se - (1 - &sp);\n  if denom <= 0 then put 'ERROR: Se must exceed (1 - Sp).';\n  a_true = max((a_obs - (1 - &sp) * n_total) / denom, 0);\n  put 'Corrected true case count = ' a_true;\nrun;",
        "description": "Validation operating characteristics in SAS. Required input dataset:\n  work.val : person_id, algorithm_flag (0/1), truth_flag (0/1)\nPROC FREQ supplies exact binomial CIs for PPV/Se/Sp and KAPPA for reference-standard agreement; a short\ndata step applies the matrix correction to the algorithm-counted case total in the analytic cohort.\n(Use PROC FREQ here, not a PS/matching proc -- this is a measurement-accuracy task, not estimation.)",
        "dependencies": [],
        "source_citations": [
          "carnahan-2012",
          "lash-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Algo[Apply algorithm to claims/EHR<br/>code list + position + time windows] --> Frame[Define sampling frame]\n  Frame --> SampP[Sample algorithm-POSITIVE records]\n  Frame --> SampN[Sample algorithm-NEGATIVE records<br/>weighted; required for Se/Sp]\n  SampP --> Ref[Reference standard<br/>chart review / registry linkage / adjudication]\n  SampN --> Ref\n  Ref --> Tab[2x2: algorithm_flag x truth_flag]\n  Tab --> Met[PPV, NPV, Se, Sp + exact CIs<br/>report kappa for reference quality]\n  Met --> Corr[Bias correction: matrix or probabilistic<br/>propagate Se/Sp into the effect estimate]\n  Corr --> Out[Corrected estimand + uncorrected estimand<br/>reported per RECORD]",
        "caption": "Validation-to-correction workflow. A PPV-only design samples positives only; estimating sensitivity and specificity (needed to correct an effect estimate) requires adjudicating sampled algorithm-negatives as well.",
        "alt_text": "Flowchart from applying the algorithm, through sampling positives and negatives, reference-standard adjudication, the 2x2 table, operating characteristics with confidence intervals, and bias correction of the effect estimate.",
        "source_type": "illustrative",
        "source_citations": [
          "carnahan-2012",
          "lash-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Misclassification<br/>differential by exposure?} -->|No: Se/Sp equal across arms| ND[Non-differential]\n  Q -->|Yes: Se/Sp differ by arm| D[Differential]\n  ND --> NDe[\"With Sp < 1, risk ratio is biased<br/>TOWARD THE NULL (predictable)\"]\n  D --> De[\"Direction and magnitude UNPREDICTABLE;<br/>a non-differential correction can worsen bias\"]\n  NDe --> Fix1[Correct with one Se/Sp set<br/>matrix or probabilistic]\n  De --> Fix2[Estimate Se/Sp WITHIN exposure strata<br/>or carry differential scenarios in PBA]",
        "caption": "Why the misclassification type drives the correction. Non-differential outcome misclassification with imperfect specificity attenuates the risk ratio; differential misclassification has no guaranteed direction, so a non-differential correction can move the estimate further from the truth.",
        "alt_text": "Decision diagram contrasting non-differential misclassification (predictable null bias, single correction) with differential misclassification (unpredictable direction, requires stratum-specific Se/Sp or scenario analysis).",
        "source_type": "illustrative",
        "source_citations": [
          "lash-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Companion concept focused on building and reporting the claims outcome algorithm whose operating characteristics this validation estimates."
      },
      {
        "relation_type": "produces",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "The Se/Sp/PPV estimated here are the inputs to misclassification (matrix/probabilistic) correction of the effect estimate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Validation parameters feed deterministic and probabilistic bias analysis that propagate measurement error into the estimand."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Chart review or an adjudication committee is the usual reference standard against which the algorithm is validated."
      },
      {
        "relation_type": "requires",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "An explicit, versioned outcome algorithm (codes, positions, time windows) must exist before its accuracy can be validated."
      },
      {
        "relation_type": "see_also",
        "target_slug": "external-adjustment-validation-substudy-bias-correction-rwe",
        "notes": "A validation substudy is the canonical external-adjustment source for correcting misclassification in the main study."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnostic-accuracy",
        "notes": "Shares the 2x2 sensitivity/specificity/PPV machinery; algorithm validation applies it to data-derived phenotypes rather than clinical tests."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "EHR phenotype algorithms (including NLP) are a major target of validation, with notes serving as part of the reference standard."
      }
    ],
    "aliases": [
      "algorithm validation",
      "claims algorithm validation",
      "outcome algorithm validation",
      "phenotype validation",
      "validation study"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "all-cause-vs-attributable-costs-rwe",
    "name": "All-Cause vs Attributable Costs",
    "short_definition": "The choice between summing every cost a patient incurs (all-cause) versus isolating only the costs caused by or assignable to a specific disease, treatment, or event (attributable), where attribution is operationalized by diagnosis coding, validated algorithms, episode windows, or an incremental comparison against a matched or modeled counterfactual.",
    "long_description": "**All-cause** and **attributable** costs answer two different economic questions about the same\npatients, and conflating them is one of the most common and most consequential errors in claims-\nand EHR-based HEOR. **All-cause cost** is the sum of the paid or allowed amount on *every* claim —\nmedical and pharmacy, related or unrelated — accruing during a patient's observation window. It\nmeasures the total economic footprint of the patient (the payer's total cost of care) and is the\nright numerator for budget impact, total-cost-of-care contracts, and any question where spillover\nonto comorbidity, complications, or downstream utilization is part of the value story.\n**Attributable cost** is the subset of spend caused by, or assignable to, the index condition or\nintervention. It is the numerator for cost-of-illness, disease burden, and the cost side of a\ncost-effectiveness or budget-impact model that prices a *specific* disease.\n\n**Core conceptual distinction — six operational definitions, not two.** \"Attributable\" is not a\nsingle thing. Onukwugha et al. (2016) classify the field into six estimation methods, and the\nchoice is an estimand decision that must be pre-specified before any programming:\n(1) *Sum_All Medical* — all-cause; sum everything.\n(2) *Sum_Diagnosis-Specific* — keep only claims carrying a qualifying diagnosis/procedure/NDC\n(the \"direct disease-specific\" or \"top-down\" sum). This is a **descriptive accounting** quantity:\nit answers \"what was spent on claims labelled with this disease,\" not \"what did the disease cause.\"\n(3) *Matched* — mean cost in the diseased/exposed cohort minus mean cost in a demographically and\nclinically matched non-diseased/unexposed cohort. The difference is the **excess (incremental)**\ncost.\n(4) *Regression* — model total cost on a disease indicator plus covariates (typically a two-part\nmodel, or a gamma/Tweedie GLM with log link); the coefficient on the indicator is the adjusted\nincremental cost.\n(5)/(6) *Other_Total* and *Other_Incremental* — phase-of-care, prevalence-weighted, or\neconometric variants.\nThe critical fault line: methods (1)-(2) are **bookkeeping** (which dollars carry the right label),\nwhile (3)-(4) are **causal/counterfactual** (how much higher is spend *because of* the disease).\nOnly the matched/regression incremental constructs estimate \"the cost the disease caused,\" and only\nthey are defensible as the cost-of-illness or the cost arm of a comparative economic model.\n\n**Pros, cons, and trade-offs (named alternatives).**\n- **Sum_Diagnosis-Specific (attributable accounting) vs Sum_All Medical (all-cause):** Diagnosis-\n  specific focuses the signal and is trivial to compute from line-level claims, but it systematically\n  *under*-counts (a disease-driven sepsis admission coded only with the sepsis code is dropped from a\n  diabetes attributable sum) *and* mis-attributes (a routine PCP visit that happens to carry the\n  disease code in any position is counted). The fraction of all-cause cost it captures — the\n  \"% attributable\" — is often only 30-60% in chronic disease and should always be reported as a\n  transparency metric. **Prefer all-cause** for total-cost-of-care and budget questions; **prefer\n  diagnosis-specific** only for descriptive disease-spend accounting where you explicitly disclaim a\n  causal reading.\n- **Matched/incremental vs diagnosis-specific:** The incremental approach is the only one that\n  isolates *caused* cost and the only one that captures disease-driven spend that is coded under a\n  different diagnosis (the str, MI, or fall that the disease produced). Cost: it requires a credible\n  counterfactual, inherits every confounding and positivity problem of any comparative RWE analysis,\n  and is sensitive to the matching/weighting specification and to the cost metric (allowed vs paid).\n  **Prefer incremental** whenever the claim is \"burden caused by the disease\" or \"cost offset of the\n  treatment.\"\n- **Matching vs regression for the incremental estimand:** Matching is transparent and makes the\n  comparison explicit but discards unmatched patients and loses power; a two-part or gamma-GLM\n  regression uses the full sample and handles the cost distribution (mass at zero, right skew)\n  directly but rests on functional-form assumptions. **Report both** when feasible; they should agree.\n- **vs HCRU (resource-utilization counts):** Costs add the monetary and intensity weighting that raw\n  counts lack and map directly onto budget and HTA decisions, but dollar attribution is more\n  consequential and more scrutinized than count attribution, and costs move with price and payment-\n  model changes that volume does not. Apply the *same* attribution rule to HCRU and costs and report\n  both — see `hcru-healthcare-resource-utilization`.\n\n**When to use.** Use **all-cause** for payer total-cost-of-care, budget impact, value-based-contract\nperformance, and any intervention with plausible spillover (e.g., a drug that reduces all-cause\nhospitalization). Use **attributable/incremental** for cost-of-illness, disease burden, the cost arm\nof a CEA/CUA, and any \"cost offset of treating X\" message — and within attributable, use the\n*matched or regression incremental* construct, not the diagnosis-specific sum, whenever the claim is\ncausal.\n\n**When NOT to use / when this is actively misleading.**\n- **Do not report a diagnosis-specific sum as \"the cost of the disease.\"** It is descriptive\n  accounting; presenting it as caused cost overstates precision and is the single most common COI\n  error. If a reviewer asks \"compared to what?\", a diagnosis-specific sum has no answer — that is the\n  tell that you needed an incremental design.\n- **Do not use all-cause incremental as a disease-cost estimate when arms differ on unrelated\n  spend.** If the exposed cohort is older/sicker, all-cause incremental folds unrelated comorbidity\n  cost into the estimate; this is the mirror failure of the diagnosis-specific under-count.\n- **Incremental costs with a non-overlapping comparator are uninterpretable.** If the disease/exposure\n  is reserved for sicker patients, positivity fails, matching discards most of the cohort, and the\n  surviving estimand no longer maps to a meaningful population — the same positivity logic as in\n  `active-comparator-new-user`.\n- **Beware circularity:** using the *same* code list to define the cohort and to attribute its costs\n  guarantees a high apparent attributable fraction by construction. Define attribution independently\n  and report sensitivity to the code list.\n- **Differential follow-up and competing risks break naive cost sums.** In elderly claims, the sicker\n  arm dies sooner and therefore accrues *less* cumulative cost — a survivor will out-spend a decedent\n  — so unadjusted mean cumulative cost can run *backwards* to the disease effect. Use cost over a\n  fixed window (e.g., PPPM over observed person-time), or phase-of-care / partitioned-survival costing\n  that handles the truncation, rather than total cost to death.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The reference substrate for cost attribution because every adjudicated line carries\n  a paid/allowed amount, a service date, place of service, and diagnosis/procedure/NDC fields needed\n  for both labelling and summing. Failure modes: (a) **Medicare Advantage / capitated person-time has\n  no FFS claim-level dollars** — MA encounter records frequently lack reliable paid amounts, so MA\n  person-time silently *deflates* both all-cause and attributable sums; restrict to FFS (or commercial\n  with full pharmacy benefit) and exclude MA-only spans. (b) **Adjudication lag and claim reversals**\n  — pull with a run-out window (commonly 3-6 months) and net out reversed/voided lines before summing,\n  or late and negative claims distort totals. (c) **Bundled/episode payments** — a single bundle\n  payment may be the only true dollar figure while internal service lines show utilization but not\n  granular cost; treat the bundle as the attributable cost and use internal lines only for HCRU. (d)\n  **Diagnosis position** — primary-only is conservative and misses secondary manifestations; any-\n  position over-attributes (rule-out, history, and comorbidity codes). (e) **Medical/pharmacy\n  crossover** — infused drugs bill to the medical benefit as J-codes (HCPCS), not pharmacy NDCs; a\n  pharmacy-only attribution rule will miss them entirely.\n- **EHR:** Has charges, not adjudicated payer cost, and only for care delivered *inside* the system.\n  External care — the main source of \"unrelated\" all-cause cost — is invisible, so all-cause sums are\n  structurally incomplete and attribution to a counterfactual is unreliable without linkage to claims.\n  Charge-to-cost ratios or reference unit costs are required to monetize, and encounter-driven capture\n  means a patient who leaves the system is differentially lost.\n- **Registry:** Strong for clean disease confirmation and severity (sharpening the *labelling* side of\n  attribution) but rarely carries cost; link to claims for the dollar figures and for the full all-\n  cause footprint.\n- **Linked claims-EHR(-registry):** The ideal substrate — EHR/registry severity for credible\n  matching plus claims completeness for the dollars — but linkage introduces selection (only the\n  linkable subset) and date-discrepancy issues that must be reconciled before windowing costs.\n\n**Standardization.** Both all-cause and attributable totals are almost always converted to PPPM/PPPY\nusing person-time denominators, and the *same* person-time and enrollment rules must apply to the\nnumerator costs and to any comparator (see `healthcare-costs-pppm-pppy-pmpm`). Pre-specify the cost\nbasis (allowed vs paid), the perspective (payer vs patient out-of-pocket vs both), and any\ninflation-adjustment to a common dollar year.\n\n**Worked claims example.** Question: what is the diabetes-attributable annual medical + pharmacy cost\nin a commercial + Medicare FFS database? (1) **Cohort:** adults with >=2 outpatient or >=1 inpatient\ndiabetes diagnosis (ICD-10 E11.x) and >=365 days of continuous medical + pharmacy FFS enrollment\n(exclude MA-only spans); index_date = first qualifying diagnosis. (2) **Follow-up window:** the 365\ndays from index_date, censoring at disenrollment, death, or data end; require full enrollment in the\nwindow or annualize over observed person-months. (3) **All-cause cost:** sum `paid_amt` over *all*\nmedical and pharmacy claims with `service_date` in the window. (4) **Diagnosis-specific attributable\ncost:** sum `paid_amt` only on medical claims carrying E11.x in any diagnosis position, plus pharmacy\nclaims whose NDC maps to an antidiabetic drug class — and report the attributable fraction\n(attributable / all-cause), which will land well under 100% and is itself a finding. (5) **Incremental\n(excess) cost:** draw a non-diabetic comparator from the same database with identical enrollment\nrules, 1:1 match on age band, sex, region, index calendar quarter, and a comorbidity score measured\nin the 365-day baseline, then take mean(all-cause cost | diabetic) - mean(all-cause cost | matched\nnon-diabetic) — this is the defensible \"cost caused by diabetes,\" and it will typically exceed the\ndiagnosis-specific sum because it recaptures the disease-driven spend coded under MI, renal, and\namputation claims. (6) **Robustness:** repeat the incremental estimate with a two-part / gamma-GLM\nregression on the unmatched sample, vary diagnosis position (primary-only vs any), and report PPPM so\nthe estimate is comparable across patients with partial follow-up.\n\n**Interpreting the output**\n\nA researcher reports three cost figures for three diabetic patients versus matched non-diabetic controls:\nall-cause mean cost $19,600 per diabetic patient, diagnosis-specific (diabetes-labelled) mean $7,467\n(attributable fraction 38%), and matched-control incremental cost $9,733.\n\n*(1) Formal interpretation.* The all-cause figure ($19,600) is the total annual allowed cost per\ndiabetic patient regardless of which condition prompted each claim; it is a payer total-cost-of-care\nestimate, not an estimate of what diabetes caused. The diagnosis-specific figure ($7,467) is the\naccounting sum of claims carrying a diabetes label and represents only 38% of all-cause cost — 62% of\nspending is on claims coded for other diagnoses, some of which were caused by diabetes but labelled\nunder downstream complications or comorbidities. It is a descriptive accounting construct, not a causal\nestimate. The incremental figure ($9,733) subtracts the matched non-diabetic mean from the diabetic\nmean, attributing the difference to diabetes; it exceeds the diagnosis-specific sum because it\nrecaptures disease-driven spend labelled under other codes. The incremental estimate rests on the\nassumption that matched controls are a credible counterfactual — it inherits all the confounding and\npositivity considerations of any comparative design.\n\n*(2) Practical interpretation.* When a dossier claims \"diabetes costs $X per patient,\" ask which method\nproduced X. A diagnosis-specific $7,467 framing systematically understates burden; the matched\nincremental $9,733 is the defensible figure for cost-of-illness, value arguments, and cost-offset\nmodeling. Report all three for transparency; lead with incremental as the primary estimate.",
    "primary_category": "Health_Economic",
    "tags": [
      "health_economic",
      "cost-of-illness",
      "attributable-costs",
      "incremental-costs",
      "all-cause-costs",
      "claims-cost-estimation",
      "two-part-model",
      "place-of-service"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "cost_of_illness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s40273-015-0325-4",
        "url": "https://doi.org/10.1007/s40273-015-0325-4",
        "citation_text": "Onukwugha E, McRae J, Kravetz A, Varga S, Khairnar R, Mullins CD. Cost-of-Illness Studies: An Updated Review of Current Methods. PharmacoEconomics. 2016;34(1):43-58.",
        "year": 2016,
        "authors_short": "Onukwugha et al.",
        "notes": "Canonical taxonomy of the six cost-estimation methods (Sum_All Medical, Sum_Diagnosis-Specific, Matched, Regression, Other_Total, Other_Incremental) that operationalizes the all-cause vs attributable vs incremental distinction."
      },
      {
        "role": "explain",
        "doi": "10.2165/00019053-200624090-00005",
        "url": "https://doi.org/10.2165/00019053-200624090-00005",
        "citation_text": "Akobundu E, Ju J, Blatt L, Mullins CD. Cost-of-Illness Studies: A Review of Current Methods. PharmacoEconomics. 2006;24(9):869-890.",
        "year": 2006,
        "authors_short": "Akobundu et al.",
        "notes": "Earlier methods review establishing the sum-all-medical / disease-specific / matched contrasts and the criteria for choosing among them; read with Onukwugha 2016 for the evolution of practice."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) Statement. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "Reporting standard requiring explicit statement of perspective and which costs are included or excluded - the disclosure that the all-cause vs attributable choice exists to satisfy."
      }
    ],
    "plain_language_summary": "When researchers add up what a patient cost the insurance system, they face a choice: count every single dollar spent on that person (all-cause costs), or count only the dollars that are directly tied to one specific disease (disease-attributable costs). All-cause costs give a full picture of the patient's total expense, while attributable costs try to isolate how much the disease itself is responsible for. The cleanest way to find the attributable share is to compare what patients with the disease spent against what a similar group without the disease spent, so that only the difference is credited to the condition.",
    "key_terms": [
      {
        "term": "all-cause costs",
        "definition": "The total amount paid for every medical and pharmacy claim a patient generated during a study period, regardless of what condition prompted each visit or prescription."
      },
      {
        "term": "disease-attributable costs",
        "definition": "The portion of a patient's total spending that can be linked to a specific condition, either by identifying claims that carry that condition's diagnosis codes or by measuring how much more patients with the disease spent compared to similar patients without it."
      },
      {
        "term": "matched control",
        "definition": "A person without the disease who is similar in age, sex, and health complexity to someone with the disease, used as a comparison to estimate what the diseased patient would have spent if they did not have the disease."
      },
      {
        "term": "incremental cost",
        "definition": "The extra spending caused by a disease, calculated by subtracting the average cost of matched controls from the average cost of patients with the disease."
      },
      {
        "term": "attributable fraction",
        "definition": "The percentage of a patient's total spending that is captured by the disease-specific rule, calculated as disease-attributable cost divided by all-cause cost; a useful transparency check because it is often well below 100 percent in chronic disease."
      }
    ],
    "worked_example": {
      "scenario": "A researcher studying type 2 diabetes wants to know both how much diabetic patients cost in total and how much of that cost is caused by diabetes. She identifies three diabetic patients and three matched non-diabetic patients with similar age, sex, and general health. Over one year she adds up each person's claims. She then computes all-cause costs (every dollar), diagnosis-specific attributable costs (only claims labelled with a diabetes code or antidiabetic prescription), and the incremental attributable cost (the average difference between diabetic and matched non-diabetic patients).",
      "dataset": {
        "caption": "Annual claims summary for three diabetic patients and their matched non-diabetic controls. All-cause = every claim; diabetes-labelled = only claims with a diabetes diagnosis code or antidiabetic drug.",
        "columns": [
          "person_id",
          "group",
          "match_id",
          "all_cause_cost_usd",
          "diabetes_labelled_cost_usd"
        ],
        "rows": [
          [
            2001,
            "diabetic",
            1,
            18400,
            7200
          ],
          [
            2002,
            "diabetic",
            2,
            24600,
            9100
          ],
          [
            2003,
            "diabetic",
            3,
            15800,
            6100
          ],
          [
            3001,
            "non-diabetic",
            1,
            9800,
            0
          ],
          [
            3002,
            "non-diabetic",
            2,
            11200,
            0
          ],
          [
            3003,
            "non-diabetic",
            3,
            8600,
            0
          ]
        ]
      },
      "steps": [
        "Average all-cause cost for diabetic patients: (18,400 + 24,600 + 15,800) / 3 = 58,800 / 3 = 19,600 dollars.",
        "Average all-cause cost for matched non-diabetic patients: (9,800 + 11,200 + 8,600) / 3 = 29,600 / 3 = 9,867 dollars (rounded to the nearest dollar).",
        "Incremental (caused) cost = diabetic average minus non-diabetic average: 19,600 - 9,867 = 9,733 dollars per patient per year. This is the best estimate of what diabetes itself cost, because it cancels out spending that any similar patient would have had regardless of diabetes.",
        "Average diagnosis-specific attributable cost (diabetes-labelled claims only): (7,200 + 9,100 + 6,100) / 3 = 22,400 / 3 = 7,467 dollars.",
        "Attributable fraction using the diagnosis-specific method: 7,467 / 19,600 = 0.38, meaning only 38 percent of total diabetic spending carries a diabetes label. The other 62 percent of spending is on claims coded for complications, other conditions, or unrelated care, but some of that spending was still caused by diabetes.",
        "Notice that the diagnosis-specific sum (7,467 dollars) is lower than the incremental estimate (9,733 dollars). The difference exists because some diabetes-driven spending, such as a heart-disease hospitalization caused by diabetes, is coded under the heart-disease diagnosis and would be missed by the label-only rule."
      ],
      "result": "All-cause cost per diabetic patient: 19,600 dollars. Diagnosis-specific attributable cost: 7,467 dollars (attributable fraction = 38%). Incremental (caused) cost versus matched controls: 9,733 dollars. The incremental figure is the more defensible estimate of what diabetes costs the system because it recovers disease-driven spending that is labelled under other diagnosis codes."
    },
    "prerequisites": [
      "hcru-healthcare-resource-utilization",
      "healthcare-costs-pppm-pppy-pmpm",
      "claims-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Sum_Diagnosis-specific attribution (primary-only vs any-position)",
        "description": "Keep only claims carrying a qualifying diagnosis/procedure code (medical) or NDC (pharmacy); a descriptive accounting of labelled spend, not a causal estimate.",
        "edge_cases": [
          "Primary-only is conservative (drops disease-driven events coded under their proximate diagnosis, e.g., a diabetes-driven MI admission); any-position over-attributes via rule-out, history, and comorbidity codes.",
          "Misses disease-caused spend coded under a different diagnosis entirely - the structural under-count that the incremental approach exists to recover.",
          "Code-list maintenance across ICD-9/10 transitions and annual updates; document the exact version."
        ],
        "data_source_notes": "claims: report primary-only and any-position as a sensitivity pair, and always report the attributable fraction (attributable / all-cause). Use validated published code lists when available."
      },
      {
        "name": "Matched incremental (excess) cost",
        "description": "Mean cost in the diseased/exposed cohort minus mean cost in a demographically and clinically matched non-diseased/unexposed comparator; the difference is the caused cost.",
        "edge_cases": [
          "Residual confounding after matching biases the excess cost; requires rich baseline covariates and genuine overlap (positivity).",
          "Sensitive to the cost metric (allowed vs paid) and to whether all-cause or diagnosis-specific cost is differenced.",
          "Differential mortality/competing risks: the sicker arm dies sooner and accrues less cumulative cost; difference over a fixed window or PPPM rather than cost-to-death."
        ],
        "data_source_notes": "claims: apply identical enrollment and follow-up rules to both arms; match on a baseline comorbidity score measured in a pre-index window; report unadjusted and adjusted incremental, the distribution, and the % zero-cost patients."
      },
      {
        "name": "Regression-based incremental cost (two-part / gamma GLM)",
        "description": "Model total cost on a disease/exposure indicator plus covariates using a two-part model (logistic for any-cost, then gamma/log-OLS for amount) or a gamma/Tweedie GLM with log link; the indicator coefficient is the adjusted incremental cost.",
        "edge_cases": [
          "Mass at zero and heavy right skew violate OLS assumptions; choose the link/family by Park test and modified Park test, and validate with predicted-vs-observed by cost decile.",
          "Retransformation bias if log-OLS is used with heteroskedastic residuals (use Duan smearing or a GLM that needs no retransformation)."
        ],
        "data_source_notes": "claims: uses the full sample (no discarded unmatched patients) but rests on functional-form assumptions; report alongside a matched estimate as a cross-check."
      },
      {
        "name": "Episode / bundle-window attribution",
        "description": "Attribute all (or qualifying) costs occurring inside a pre-defined clinical episode window (e.g., 30/90 days post-procedure, or a full bundle period) to the index event.",
        "edge_cases": [
          "Episode definitions vary by payer/program; cost leakage outside captured claims; overlapping or competing episodes.",
          "In pure bundled data the bundle payment is the only true dollar figure - treat it as the attributable cost and use internal lines for HCRU only."
        ],
        "data_source_notes": "claims: align episode start with the index event date; use payment-model indicators where present."
      },
      {
        "name": "Medical vs pharmacy and place-of-service stratified attribution",
        "description": "Apply the attribution rule separately by benefit (medical vs pharmacy) and by place of service (inpatient, ED, outpatient, pharmacy) and report the splits.",
        "edge_cases": [
          "Infused/specialty drugs bill to the medical benefit as J-codes (HCPCS), not pharmacy NDCs; a pharmacy-only rule misses them.",
          "Site-of-service migration (infusion moving from hospital outpatient to physician office) shifts costs across categories over calendar time."
        ],
        "data_source_notes": "claims: join medical and pharmacy files; map POS/revenue codes; critical for specialty-pharmacy or procedure-heavy conditions."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "All-cause cost (Sum_All Medical, no attribution)",
        "pros_of_this": "Attribution focuses the economic signal on the condition or intervention of interest; the incremental construct supports causal \"cost caused by the disease\" claims.",
        "cons_of_this": "Diagnosis-specific attribution under-counts disease-driven spend coded elsewhere and over-counts incidental labelled claims; the attributable fraction is often only 30-60% in chronic disease and is hard to communicate. All-cause is complete and assumption-light but mixes in unrelated spend.",
        "when_to_prefer": "Prefer attributable/incremental when the question is disease- or treatment-specific (COI, the cost arm of a CEA, \"cost offset\"); prefer all-cause for total-cost-of-care, budget impact, and interventions with plausible spillover."
      },
      {
        "compared_to": "hcru-healthcare-resource-utilization",
        "pros_of_this": "Costs add the monetary and intensity weighting that raw counts lack and map directly onto budget and HTA decisions.",
        "cons_of_this": "Dollar attribution is more consequential and more scrutinized than count attribution, and costs move with price and payment-model changes that volume does not.",
        "when_to_prefer": "Report both; apply the identical attribution rule to HCRU and costs, and use HCRU to explain the drivers behind any cost difference."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "The all-cause vs attributable choice is the upstream estimand decision that determines what the PPPM/PPPY numerator even contains.",
        "cons_of_this": "Standardization (PPPM/PPPY), cost basis (allowed/paid), perspective, and outlier handling are specified in the costs entry - this concept must be read together with it.",
        "when_to_prefer": "Decide the attribution rule first, then apply the standardization and modeling machinery of the costs entry to the resulting series."
      },
      {
        "compared_to": "cost-outlier-handling-rwe",
        "pros_of_this": "Attribution is decided before or together with outlier rules so that a genuinely disease-linked high-cost event is retained or winsorized transparently rather than dropped.",
        "cons_of_this": "Over-attribution can pull in unrelated catastrophic events (a cancer patient's unrelated trauma admission), amplifying outlier problems if any-position labelling is used.",
        "when_to_prefer": "Pre-specify the attribution rule first, then decide outlier handling on the resulting attributable or incremental series."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Reference substrate - every adjudicated line carries paid/allowed amount, service date, POS, and dx/px/NDC. Pull with a 3-6 month run-out window and net out reversed/voided claims before summing. Exclude Medicare Advantage / capitated person-time (no reliable claim-level dollars). Compute all-cause and (diagnosis-specific and/or incremental) attributable in parallel and report the attributable fraction. Stratify by medical vs pharmacy and by POS; capture J-code (HCPCS) drugs on the medical benefit. Use a fixed cost window or PPPM rather than cost-to-death to avoid competing-risk truncation.",
      "ehr": "Charges (not adjudicated payer cost) for in-system care only; external care - the main source of unrelated all-cause cost - is invisible. Monetize charges with charge-to-cost ratios or reference unit costs; linkage to claims is usually required for credible all-cause totals or any incremental attribution.",
      "registry": "Strong for disease confirmation and severity (sharpens the labelling side of attribution) but rarely carries cost; link to claims for the dollar figures and the full all-cause footprint.",
      "linked": "Linked claims-EHR(-registry) is the ideal substrate (severity for matching + claims dollars) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before windowing costs."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWINDOW_DAYS = 365\nDM_DX_PREFIX = \"E11\"          # type 2 diabetes, ICD-10\nDM_RX_CLASS = \"ANTIDIABETIC\" # value of ndc_drug_class for antidiabetic fills\n\ndef cost_summary(claims: pd.DataFrame, cohort: pd.DataFrame) -> pd.DataFrame:\n    dx_cols = [c for c in claims.columns if c.startswith(\"dx\")]\n\n    c = claims.merge(cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    # Keep only claims inside the fixed post-index window (avoids competing-risk truncation).\n    in_window = ((c[\"service_date\"] >= c[\"index_date\"]) &\n                 (c[\"service_date\"] <  c[\"index_date\"] + pd.Timedelta(days=WINDOW_DAYS)))\n    c = c.loc[in_window].copy()\n\n    # Diagnosis-specific attribution: E11.x in ANY dx position, OR an antidiabetic NDC class.\n    dx_hit = c[dx_cols].apply(lambda s: s.astype(\"string\").str.startswith(DM_DX_PREFIX, na=False)).any(axis=1)\n    rx_hit = (c[\"benefit\"] == \"RX\") & (c[\"ndc_drug_class\"] == DM_RX_CLASS)\n    c[\"attributable\"] = dx_hit | rx_hit\n\n    per_person = c.groupby(\"person_id\").agg(\n        all_cause=(\"paid_amt\", \"sum\"),\n        attributable=(\"paid_amt\", lambda s: s[c.loc[s.index, \"attributable\"]].sum()),\n    ).reset_index()\n    return cohort.merge(per_person, on=\"person_id\", how=\"left\").fillna({\"all_cause\": 0.0,\n                                                                        \"attributable\": 0.0})\n\ndef attributable_fraction(summary: pd.DataFrame) -> float:\n    # Transparency metric: share of all-cause spend captured by the diagnosis-specific rule.\n    return summary[\"attributable\"].sum() / summary[\"all_cause\"].sum()\n\ndef matched_incremental_cost(summary: pd.DataFrame) -> float:\n    # Excess (caused) cost = mean ALL-CAUSE among diabetics minus their matched non-diabetic controls.\n    paired = summary.dropna(subset=[\"match_id\"])\n    wide = paired.pivot_table(index=\"match_id\", columns=\"diabetic\",\n                              values=\"all_cause\", aggfunc=\"mean\")\n    wide.columns = [\"non_diabetic\", \"diabetic\"]            # False, True after pivot ordering\n    return float((wide[\"diabetic\"] - wide[\"non_diabetic\"]).mean())",
        "description": "All-cause vs attributable cost from claims, plus the matched-incremental (excess) cost.\nRequired inputs (already cleaned, de-duplicated, reversals netted out):\n  claims : person_id, service_date (datetime), paid_amt (float),\n           benefit in {'MED','RX'}, dx1..dxN (str, ICD-10), ndc_drug_class (str or NaN)\n  cohort : person_id, index_date (datetime), diabetic (bool), enroll_end (datetime),\n           match_id (int, links one diabetic to one matched non-diabetic; NaN if unmatched)\nCosts are summed over the fixed 365-day post-index window. Attribution here is diagnosis-specific\n(E11.x in ANY position, or an antidiabetic NDC class); the incremental estimate differences\nALL-CAUSE cost across the matched pair, which is the construct that recovers disease-driven spend\ncoded under other diagnoses.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWINDOW_DAYS  <- 365L\nDM_DX_PREFIX <- \"E11\"\nDM_RX_CLASS  <- \"ANTIDIABETIC\"\n\ncost_summary <- function(claims, cohort) {\n  setDT(claims); setDT(cohort)\n  dx_cols <- grep(\"^dx\", names(claims), value = TRUE)\n\n  c <- merge(claims, cohort[, .(person_id, index_date)], by = \"person_id\")\n  c <- c[service_date >= index_date &\n         service_date <  index_date + WINDOW_DAYS]\n\n  # Diagnosis-specific attribution: E11.x in any dx position, or an antidiabetic NDC class.\n  dx_hit <- Reduce(`|`, lapply(dx_cols, function(col)\n              startsWith(as.character(c[[col]]), DM_DX_PREFIX) %in% TRUE))\n  rx_hit <- c$benefit == \"RX\" & c$ndc_drug_class == DM_RX_CLASS & !is.na(c$ndc_drug_class)\n  c[, attributable := dx_hit | rx_hit]\n\n  per_person <- c[, .(all_cause    = sum(paid_amt),\n                      attributable = sum(paid_amt[attributable])), by = person_id]\n  out <- merge(cohort, per_person, by = \"person_id\", all.x = TRUE)\n  out[is.na(all_cause),    all_cause    := 0]\n  out[is.na(attributable), attributable := 0]\n  out[]\n}\n\nattributable_fraction <- function(summary) sum(summary$attributable) / sum(summary$all_cause)\n\nmatched_incremental_cost <- function(summary) {\n  paired <- summary[!is.na(match_id)]\n  w <- dcast(paired, match_id ~ diabetic, value.var = \"all_cause\", fun.aggregate = mean)\n  mean(w[[\"TRUE\"]] - w[[\"FALSE\"]], na.rm = TRUE)   # excess (caused) cost per matched pair\n}",
        "description": "All-cause vs attributable (diagnosis-specific) cost and matched-incremental cost with data.table.\nInputs mirror the Python version:\n  claims : person_id, service_date (Date), paid_amt, benefit ('MED'/'RX'),\n           dx1..dxN (character, ICD-10), ndc_drug_class (character or NA)\n  cohort : person_id, index_date (Date), diabetic (logical), match_id (integer or NA)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let window = 365;\n\n/* All-cause and diagnosis-specific attributable cost in the fixed post-index window. */\nproc sql;\n  create table cost_pp as\n  select c.person_id,\n         sum(cl.paid_amt) as all_cause,\n         sum( case when ( cl.dx1 like 'E11%' or cl.dx2 like 'E11%' or cl.dx3 like 'E11%'\n                          or (cl.benefit='RX' and cl.ndc_drug_class='ANTIDIABETIC') )\n                   then cl.paid_amt else 0 end ) as attributable\n  from work.cohort c\n  inner join work.claims cl\n    on cl.person_id = c.person_id\n   and cl.service_date >= c.index_date\n   and cl.service_date <  c.index_date + &window\n  group by c.person_id;\nquit;\n\n/* Attributable fraction: transparency metric (share of all-cause captured by the rule). */\nproc sql;\n  select sum(attributable) / sum(all_cause) as attributable_fraction\n  from cost_pp;\nquit;\n\n/* Join cost back to cohort + covariates for the regression-based incremental estimate. */\nproc sql;\n  create table analytic as\n  select c.*, p.all_cause\n  from work.cohort c left join cost_pp p on c.person_id = p.person_id;\nquit;\n\n/* Gamma GLM with log link on ALL-CAUSE cost; DIABETIC coefficient = adjusted incremental cost. */\nproc genmod data=analytic;\n  class diabetic(ref='0') / param=ref;\n  model all_cause = diabetic <baseline covariates>\n        / dist=gamma link=log;                      /* right-skew, non-negative cost */\n  lsmeans diabetic / diff exp cl;                   /* dollar difference / cost ratio */\nrun;",
        "description": "All-cause and diagnosis-specific attributable cost (PROC SQL), then a regression-based\nincremental cost via a gamma GLM with log link (PROC GENMOD) - the assumption-light alternative\nto matching that uses the full sample. Required inputs (post data-management):\n  work.claims : person_id, service_date, paid_amt, benefit ('MED'/'RX'),\n                dx1-dxN (ICD-10), ndc_drug_class\n  work.cohort : person_id, index_date, diabetic (1/0), plus baseline covariates for the GLM\nThe GENMOD coefficient on DIABETIC (exponentiated as a cost ratio, or via LSMEANS as a dollar\ndifference) is the adjusted incremental cost; gamma/log handles the right-skewed, non-negative\ncost distribution without retransformation.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Claims[All adjudicated claims in window<br/>paid_amt, service_date, dx, NDC] --> Q{What is the question?}\n  Q -->|Total cost of care / budget| AllCause[All-cause cost<br/>Sum_All Medical: sum everything]\n  Q -->|Cost OF the disease| Attr{Causal claim?}\n  Attr -->|Descriptive accounting| DxSpec[Diagnosis-specific sum<br/>keep claims labelled with the disease<br/>UNDER-counts + mis-attributes]\n  Attr -->|Yes, caused cost| Incr[Incremental / excess cost]\n  Incr --> Match[Matched comparator<br/>mean diseased - mean matched control]\n  Incr --> Reg[Regression<br/>two-part / gamma GLM coefficient]\n  AllCause --> PPPM[Standardize: PPPM/PPPY, person-time, dollar year]\n  DxSpec --> Frac[Report attributable fraction = attributable / all-cause]\n  Match --> PPPM\n  Reg --> PPPM",
        "caption": "Decision logic from a pool of claims to the right cost estimand. Diagnosis-specific sums are descriptive accounting; only the matched or regression incremental constructs estimate the cost the disease caused.",
        "alt_text": "Flowchart branching from all claims into all-cause cost versus attributable cost, and within attributable into a descriptive diagnosis-specific sum versus a causal incremental estimate computed by matching or regression, all standardized to PPPM.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Diseased[Diabetic cohort]\n    D_dx[Disease-labelled claims<br/>E11.x, antidiabetic NDC]\n    D_other[Disease-CAUSED claims<br/>coded as MI, renal, amputation]\n    D_unrel[Unrelated claims<br/>injury, dermatology]\n  end\n  subgraph Control[Matched non-diabetic cohort]\n    C_unrel[Unrelated claims only]\n  end\n  D_dx --> DxSum[Diagnosis-specific sum<br/>= D_dx ONLY]\n  D_dx --> Excess[Incremental / excess cost]\n  D_other --> Excess\n  D_unrel -. cancels against .-> C_unrel\n  Excess --> Note[Incremental recaptures D_other<br/>that the diagnosis-specific sum drops]",
        "caption": "Why diagnosis-specific and incremental costs differ. The diagnosis-specific sum keeps only explicitly labelled claims; the matched incremental contrast also captures disease-caused spend coded under other diagnoses, while unrelated cost cancels against the matched control.",
        "alt_text": "Diagram contrasting a diagnosis-specific sum (labelled claims only) with an incremental estimate that additionally captures disease-caused claims coded under other diagnoses, with unrelated cost cancelling against the matched control cohort.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "The all-cause vs attributable choice is the upstream estimand decision; PPPM/PPPY standardization, cost basis, perspective, and two-part/gamma modeling of the resulting series live in the costs entry. Use together for the full pipeline."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Apply the identical attribution rule (diagnosis position, code list, incremental contrast) to HCRU counts and to costs; decisions here affect both endpoints symmetrically."
      },
      {
        "relation_type": "see_also",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "COI and burden studies present all-cause, diagnosis-specific, and incremental costs; the matched/regression incremental construct is preferred for \"burden caused by the disease.\""
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Outlier rules (winsorization, trimming) should be applied after or consistently with the attribution rule; an unrelated extreme cost may be retained in all-cause but excluded from the attributable/incremental series."
      },
      {
        "relation_type": "complements",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "Incremental (attributable) costs form the cost side of ICER and NMB; all-cause incremental is commonly reported as a secondary total-cost perspective."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "The matched/regression incremental estimand inherits the same confounding, positivity, and comparator-choice problems as any comparative RWE design; ACNU machinery (matching, weighting, overlap diagnostics) applies directly."
      }
    ],
    "aliases": [
      "all-cause vs attributable costs",
      "attributable costs",
      "incremental costs",
      "excess costs",
      "disease-specific costs",
      "cost attribution rwe"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "ema",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "as-treated-risk-window-construction-rwe",
    "name": "As-Treated Risk Window Construction",
    "short_definition": "The exposure-definition rule that converts dispensing or administration records into on-treatment follow-up time by stitching supply intervals, applying grace periods and carryover, and censoring person-time when treatment stops or switches, so that risk is attributed only while the drug is plausibly acting.",
    "long_description": "**As-treated (on-treatment) risk window construction** is the operational step that turns a stream of exposure\nrecords into *time at risk*: the spans of follow-up during which a patient is counted as actively exposed. It\nanswers three coupled questions — when does exposed person-time **start** (almost always at the index fill/order,\ni.e. time zero), when is a patient **still on treatment** (supply-interval stitching, grace periods, carryover of\noversupply), and when does exposed person-time **end** (run-out, discontinuation, switch, or the structural\ncensoring events of disenrollment, death, and end of data). It is the engine behind an **as-treated / per-protocol\nestimand** and is distinct from an **intention-to-treat (ITT) / first-line** analysis that attributes all\npost-initiation follow-up to the initial drug regardless of later behavior.\n\n**Core conceptual / estimand distinction.** The risk-window rule *is* the estimand made concrete. Under ITT you\ncount outcomes for the whole observation window from time zero; the rule is trivial (one window per person) but the\ncontrast is the effect of *starting* a strategy and dilutes with discontinuation and switching. Under as-treated you\ncensor (or split) person-time when the patient leaves the protocol, so the contrast approaches the effect of\n*staying on* the drug — but only validly if you weight for the informative censoring that discontinuation/switching\ninduce (inverse-probability-of-censoring weighting, IPCW). A naive as-treated analysis with no IPCW silently\nconditions on staying treated, which is a post-baseline variable on the causal pathway, and is biased whenever\nprognosis predicts who stays. The window rule also fixes the **lag/induction** structure — whether the first N days\nafter initiation are \"at risk\" (acute outcomes) or excluded (latency for chronic outcomes) — and whether an outcome\nduring a grace-period extension or a post-discontinuation \"legacy\" window still counts (carryover, depletion of the\npharmacologic effect).\n\n**Pros, cons, and trade-offs.**\n- **vs intention-to-treat / first-line attribution:** As-treated targets the biologically interpretable\n  on-treatment effect, recovers dose-response and acute toxicities that ITT washes out, and matches the labeling\n  question \"what happens while a patient takes this.\" Cost: it requires careful episode logic and IPCW; done naively\n  it reintroduces selection bias and is *less* defensible than a clean ITT. **Prefer ITT** for the policy/adherence\n  question and as the primary in a target-trial emulation; **prefer as-treated** as the per-protocol companion or\n  when the mechanism is acute and on-treatment timing dominates.\n- **vs a single fixed risk window** (e.g. \"90 days after the index fill\" for everyone): A fixed window is simple,\n  immune to gap-rule arbitrariness, and standard for acute, single-dose, or vaccine-style exposures. Cost: it\n  misclassifies person-time for chronic refilled therapy — patients who refilled for two years get 90 days, patients\n  who stopped at day 10 get 90 days. **Prefer fixed windows** for acute/one-shot exposures and as a sensitivity\n  analysis; **prefer stitched as-treated windows** for chronic, refillable drugs.\n- **vs current-vs-former-vs-never time-updated exposure** (the fuller g-method machinery): Time-updated exposure with\n  a marginal structural model handles time-varying confounding affected by prior treatment that as-treated censoring\n  cannot. Cost: far heavier specification and data demands. **Prefer the as-treated window** when discontinuation is\n  not strongly confounded by evolving prognosis; escalate to time-updated/MSM when it is.\n- **Grace period is the central nuisance parameter.** Too short and you create artifactual gaps, fragment one true\n  episode into many, and manufacture immortal/\"unexposed\" time between fills; too long and you carry exposure status\n  far past the last pill, misattributing late events to a drug no longer present. The grace period must be\n  pre-specified and varied in sensitivity analysis — it is the single choice most likely to move a hazard ratio.\n\n**When to use.** Comparative safety where the hazard tracks active pharmacology (e.g., bleeding on an anticoagulant,\nhypoglycemia on a sulfonylurea), dose-response questions, per-protocol arms of a target-trial emulation, and any\nITT analysis whose effect is suspected to be diluted by heavy discontinuation. Use it whenever \"off-drug\"\nperson-time should not count as exposed.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When discontinuation/switching is strongly driven by evolving prognosis and you do not weight for it.** Sick\n  patients stop drugs (or are stopped by clinicians); naive as-treated censoring then makes the remaining\n  on-treatment time look healthier than it is — a textbook **healthy-adherer / informative-censoring** bias. If you\n  cannot estimate censoring weights, an ITT primary is safer and more honest.\n- **For chronic-effect or carcinogenic outcomes with long latency.** Counting only current on-treatment time, with no\n  induction lag and no legacy window, biases toward the null because the relevant etiologic exposure occurred years\n  earlier. Use exposure-lag/induction windows or cumulative-dose metrics instead.\n- **In procedure or hospitalization studies where the window is anchored to a future event.** Defining \"exposed\"\n  time using post-index information (e.g., counting the days until a procedure that only treated patients receive)\n  creates **immortal time** — guaranteed event-free survival assigned to the exposed arm. The window must be built\n  only from information available at its start.\n- **In data that cannot observe the off-drug state.** If supply, stop dates, or enrollment are not reliably captured,\n  a \"discontinuation\" is indistinguishable from data loss and the window boundaries are noise.\n\n**Data-source operational depth.**\n- **Claims (FFS):** Exposure spans come from pharmacy claims (`ndc`, `fill_date`, `days_supply`). The standard\n  construction: sort fills per `person_id`, project each fill forward by `days_supply`, and if the next `fill_date`\n  falls within the supply end plus the **grace period**, stitch the two into one continuous episode; otherwise close\n  the episode at `supply_end + grace` (or at the last supply end). **Carryover/stockpiling**: when an early refill\n  overlaps unused supply, shift the new supply start to the prior supply end so on-hand days accumulate (capped to\n  avoid implausible hoarding). Failure modes and workarounds: (1) **Medicare Advantage / capitated person-time lacks\n  FFS pharmacy claims** — an MA enrollee's \"gap\" is missingness, not discontinuation; restrict to Parts A/B/D (or\n  commercial pharmacy benefit) and exclude MA-only spans. (2) **90-day mail-order and sample fills** distort\n  `days_supply`, lengthening or hiding episodes. (3) **Inpatient days** suppress outpatient pharmacy claims even\n  though the drug is administered; bridge known inpatient stays so they are not scored as gaps. (4) **Last fill near\n  death** leaves leftover supply that should be censored at death, not carried forward.\n- **EHR:** Exposure is the *order* or *administration*, not a paid claim; an active prescription with no fill is not\n  on-treatment. e-Prescribing and medication-administration records help, but **external-care leakage** (fills at\n  pharmacies outside the system) makes apparent discontinuation unreliable; link to dispensing where possible and\n  treat loss to follow-up as potentially informative. Encounter-driven capture means the *absence* of a stop note is\n  not evidence of continued use.\n- **Registry:** Often records treatment *lines* or start/stop at adjudicated visits but rarely day-level supply;\n  derive coarse windows from visit-anchored start/stop and link to claims for granular refill stitching.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR start dates + claims fill completeness + a death\n  index for the right-censoring boundary — but order/fill/service-date discrepancies must be reconciled before the\n  window is built, or episodes will start and end on the wrong dates.\n\n**Competing risks within the window.** In elderly claims populations, death is a frequent and *differentially\ndistributed* competing event: an exposure that delays death lengthens on-treatment person-time and inflates the\nobserved rate of any non-fatal outcome. Decide explicitly whether death censors the window (cause-specific) or is a\ncompeting event (subdistribution); a cause-specific window with differential mortality by arm can mislead.\n\n**Worked claims example.** Question: rate of major GI bleeding *while on* low-dose aspirin in a Medicare FFS +\ncommercial cohort. (1) **Eligibility / time zero:** first aspirin fill (`index_date`) after ≥365 days continuous\nA/B/D (or commercial medical+pharmacy) enrollment with no aspirin fill in the lookback (incident user). (2)\n**Window start:** day after `index_date` (or `index_date` itself, pre-specified); apply a 1-day induction so a bleed\ncoded on the index day is not attributed to a drug not yet taken. (3) **Stitching:** for each subsequent fill,\n`supply_end = fill_date + days_supply`; if the next `fill_date <= supply_end + 30` (30-day grace), continue the\nepisode and, if `fill_date < supply_end`, carry the unused days forward (cap total on-hand at 90 days). (4)\n**Window end (censor exposed time at the earliest of):** last `supply_end + 30`-day grace run-out (discontinuation),\nswitch to a different antiplatelet (NDC change), disenrollment, death, or end of data. (5) **Bridging:** any\ninpatient stay overlapping a stitched episode is treated as on-treatment (drug administered in hospital), not a gap.\n(6) **Outcome:** first inpatient claim with a primary GI-bleed `dx` during open on-treatment person-time; person-time\nand events outside open windows are excluded (or contribute to an \"off-treatment\" comparison group). (7)\n**Sensitivity:** rerun with grace = 0/15/60 days, induction = 0/7 days, a 30-day post-discontinuation legacy window,\nand IPCW for informative discontinuation; report how the rate and any comparative HR move with each.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure_definition",
      "as-treated",
      "on-treatment-window",
      "grace-period",
      "carryover-stockpiling",
      "per-protocol",
      "informative-censoring",
      "time-at-risk",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Establishes the incident-user framework and time-zero anchoring on which any on-treatment window is built."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "The canonical statement of how mis-defined exposure windows manufacture immortal (guaranteed event-free) time; the failure mode an as-treated rule must avoid."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Frames the ITT vs per-protocol (as-treated) estimand distinction and the need to adjust for informative censoring when follow-up is truncated at deviation from protocol."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Operational reference for supply-interval stitching, permissible gaps, and persistence from automated pharmacy data — the mechanics of episode construction."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1365-2125.2012.04167.x",
        "url": "https://doi.org/10.1111/j.1365-2125.2012.04167.x",
        "citation_text": "Vrijens B, De Geest S, Hughes DA, et al. A new taxonomy for describing and defining adherence to medications. British Journal of Clinical Pharmacology. 2012;73(5):691-705.",
        "year": 2012,
        "authors_short": "Vrijens et al.",
        "notes": "ABC taxonomy distinguishing initiation, implementation, and discontinuation/persistence — the vocabulary for where an on-treatment window opens and closes."
      }
    ],
    "plain_language_summary": "An as-treated risk window marks exactly the days a patient had a drug on hand and counts only those days when attributing side effects or outcomes to the drug. You project each prescription fill forward by the number of days it is supposed to last, stitch consecutive fills together when a new fill arrives before the previous one runs out (or within a short grace period), and close the window when the patient goes too long without refilling. Any event that happens inside the window is charged to the drug; any event outside the window — after the patient has stopped — is not, because the drug was no longer acting. The method cannot see cash-paid fills or free samples, so gaps in the data may look like the patient stopped when they actually kept taking the drug.",
    "key_terms": [
      {
        "term": "days_supply",
        "definition": "The number of days one filled prescription is meant to last — a 30-day fill has days_supply = 30."
      },
      {
        "term": "index fill",
        "definition": "The first prescription fill that starts the clock — the patient's 'day zero' when on-treatment follow-up begins."
      },
      {
        "term": "grace period",
        "definition": "A short buffer (often 30 days) added after a fill's supply runs out, during which the patient is still counted as on treatment even if they have not yet refilled — it forgives a brief late trip to the pharmacy."
      },
      {
        "term": "on-treatment person-time",
        "definition": "The total count of days during which a patient is inside an open risk window and therefore counted as actively exposed to the drug."
      },
      {
        "term": "episode",
        "definition": "One continuous stretch of on-treatment days, starting with the index fill and ending when the supply plus grace period runs out or the patient switches drugs."
      }
    ],
    "worked_example": {
      "scenario": "Patient 2001 is newly started on metoprolol (a blood pressure pill) on January 1, 2024. She fills it twice before stopping. We want to know which days count as 'at risk' — meaning the drug was plausibly in her system — so we can correctly attribute a heart-rate event to the drug only if it happened while she was actually taking it. We use a 30-day grace period: if her next fill arrives within 30 days of her supply running out, the two fills are joined into one unbroken risk window.",
      "dataset": {
        "caption": "Pharmacy claims rows for patient 2001 — exactly the columns an analyst sees in a real table.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2024-01-01",
            "metoprolol",
            30
          ],
          [
            2001,
            "2024-02-15",
            "metoprolol",
            30
          ]
        ]
      },
      "steps": [
        "Fill A (Jan 1, 30-day supply) covers Jan 1 through Jan 30 — those 30 days are inside the risk window.",
        "After Jan 30 the supply is gone, but the 30-day grace period keeps the window open through Feb 29 while we wait to see if she refills.",
        "Fill B arrives on Feb 15, which is before the grace expires (Feb 29) — so the two fills are stitched into one continuous episode; no gap is recorded.",
        "Fill B's 30-day supply runs from Feb 15 through Mar 15; the 30-day grace tail then extends the window through Apr 14.",
        "No Fill C arrives by Apr 14, so the episode closes on Apr 14 — all days from Apr 15 onward are off-treatment.",
        "Event A (heart-rate drop, Mar 10) falls inside the risk window (Jan 1–Apr 14) and is counted as an on-treatment event.",
        "Event B (a separate ER visit, May 1) falls outside the closed window and is NOT counted as an on-treatment event — the drug was no longer acting.",
        "Total on-treatment days: Jan 1–Apr 14 = 31 (Jan) + 29 (Feb, leap year) + 31 (Mar) + 14 (Apr 1-14) = 105 days.",
        "Incidence rate = 1 on-treatment event ÷ 105 on-treatment days × 1,000 = 9.5 events per 1,000 person-days."
      ],
      "result": "105 on-treatment days; 1 event inside the window; incidence rate = 9.5 events per 1,000 person-days. Event B (May 1) does not contribute because the risk window closed on Apr 14.",
      "timeline_spec": {
        "title": "As-treated risk window for one metoprolol patient (30-day grace, two fills)",
        "window": {
          "start": "2024-01-01",
          "end": "2024-06-30",
          "label": "Observation period (6 months)"
        },
        "events": [
          {
            "label": "Fill A (index fill)",
            "start": "2024-01-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B (refill on Feb 15 — supply ran out Jan 30, gap = 16 days, within 30-day grace)",
            "start": "2024-02-15",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Event A: heart-rate drop (INSIDE risk window — counted)",
            "start": "2024-03-10",
            "length_days": 1,
            "quantity": "outcome event"
          },
          {
            "label": "Event B: ER visit (OUTSIDE risk window — not counted)",
            "start": "2024-05-01",
            "length_days": 1,
            "quantity": "outcome event"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2024-01-01",
            "end": "2024-01-30",
            "label": "Fill A supply (30 days at risk)"
          },
          {
            "kind": "exposed",
            "start": "2024-01-31",
            "end": "2024-02-14",
            "label": "Grace period — still at risk (15 days; Fill B not yet arrived)"
          },
          {
            "kind": "exposed",
            "start": "2024-02-15",
            "end": "2024-03-15",
            "label": "Fill B supply (30 days at risk; stitched to Fill A)"
          },
          {
            "kind": "exposed",
            "start": "2024-03-16",
            "end": "2024-04-14",
            "label": "Grace tail after Fill B (30 days at risk; no Fill C arrives)"
          },
          {
            "kind": "unexposed",
            "start": "2024-04-15",
            "end": "2024-06-30",
            "label": "Off-treatment: grace expired, no refill — not at risk (77 days)"
          }
        ],
        "result": {
          "label": "105 on-treatment days (Jan 1–Apr 14); 1 event inside window; rate = 9.5 per 1,000 person-days",
          "value": 105
        },
        "caption": "Two fills are stitched into one 105-day risk window because Fill B arrives before the 30-day grace expires. Event A (Mar 10) lands inside and is attributed to metoprolol. Event B (May 1) lands in the off-treatment gap and is excluded from the on-treatment rate. If the grace period were shorter — say 10 days — Fill B would arrive too late to stitch and the single window would close Jan 30, cutting 75 days of legitimate at-risk time and possibly missing Event A entirely.",
        "alt_text": "Horizontal timeline from January 1 to June 30, 2024. A blue exposed bar covers January 1 through April 14, subdivided into Fill A supply (Jan 1–Jan 30), a grace-period segment (Jan 31–Feb 14), Fill B supply (Feb 15–Mar 15), and a grace tail (Mar 16–Apr 14). A red event marker on March 10 sits inside the blue bar labeled 'Event A — counted.' A grey unexposed bar runs April 15 through June 30. A second red event marker on May 1 sits inside the grey bar labeled 'Event B — not counted.'"
      }
    },
    "prerequisites": [
      "new-user-design",
      "time-zero-index-date-alignment-rwe",
      "grace-period-gap-rules-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed-duration risk window (single-dose / acute exposure)",
        "description": "Each exposure record opens a pre-specified fixed window (e.g., 1-42 days post-vaccination, 30 days post-dispensing) regardless of refills; the simplest and most defensible rule for acute or one-shot exposures.",
        "edge_cases": [
          "Misclassifies chronic refillable therapy (everyone gets the same window irrespective of true duration).",
          "Overlapping fixed windows from sequential fills require a merge rule to avoid double-counting person-time."
        ],
        "data_source_notes": "claims: window = fill_date + fixed_days; ignore days_supply. Standard for vaccines and acute drugs; use as a sensitivity analysis for chronic drugs."
      },
      {
        "name": "Supply-stitched episode with grace period and carryover",
        "description": "Project each fill by days_supply, merge consecutive fills whose gap is within the grace period into a single episode, and carry unused supply forward (stockpiling) up to a cap; close the episode at run-out + grace.",
        "edge_cases": [
          "Grace period too short fragments one episode and creates artifactual unexposed gaps; too long carries exposure past the last pill.",
          "Early refills (stockpiling) inflate cumulative on-hand supply unless capped.",
          "Inpatient stays suppress outpatient claims and masquerade as gaps unless bridged."
        ],
        "data_source_notes": "claims: requires reliable days_supply; bridge known inpatient stays; cap carryover. The default for chronic refillable drugs."
      },
      {
        "name": "As-treated with informative-censoring weights (per-protocol)",
        "description": "Build the on-treatment window, then censor person-time at discontinuation/switch and re-weight by inverse probability of remaining uncensored (IPCW) to recover the per-protocol effect.",
        "edge_cases": [
          "Requires a correctly specified time-varying model for the censoring (discontinuation/switch) hazard.",
          "Extreme weights from rare adherence patterns inflate variance; truncation trades bias for variance."
        ],
        "data_source_notes": "Depends on rich time-updated covariates that predict discontinuation; weak covariates leave residual selection bias."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Intention-to-treat / first-line attribution",
        "pros_of_this": "Targets the biologically interpretable on-treatment effect; recovers dose-response and acute toxicities that ITT dilutes; matches the on-label \"while taking the drug\" question.",
        "cons_of_this": "Requires episode logic and inverse-probability-of-censoring weighting; naive as-treated reintroduces informative-censoring (healthy-adherer) bias and is less defensible than a clean ITT.",
        "when_to_prefer": "Acute mechanistic safety/effectiveness, dose-response, and the per-protocol companion to an ITT primary in a target-trial emulation."
      },
      {
        "compared_to": "Single fixed-duration risk window",
        "pros_of_this": "Reflects true treatment duration for chronic refillable drugs; does not assign identical follow-up to persistent and immediate-discontinuer patients.",
        "cons_of_this": "Sensitive to grace-period and stockpiling choices; more programming and more diagnostics.",
        "when_to_prefer": "Chronic, refillable therapy where on-treatment duration varies widely across patients."
      },
      {
        "compared_to": "Time-updated current/former/never exposure with a marginal structural model",
        "pros_of_this": "Far simpler to specify, communicate, and defend; adequate when discontinuation is not strongly confounded by evolving prognosis.",
        "cons_of_this": "Cannot handle time-varying confounding affected by prior treatment that g-methods address.",
        "when_to_prefer": "When discontinuation/switching is weakly prognostic; escalate to MSM/g-methods when it is strongly prognostic."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Episodes = days_supply projection with a pre-specified grace period and capped carryover; bridge inpatient stays; close at run-out+grace, switch, disenrollment, death, or end of data. Exclude Medicare Advantage-only person-time where FFS pharmacy claims are unavailable (a \"gap\" there is missingness, not discontinuation). Audit pre/post episode counts and grace-period sensitivity.",
      "ehr": "Window anchored to the order/administration, not the paid claim; an unfilled prescription is not on-treatment. External-care leakage makes apparent discontinuation unreliable — link to dispensing and treat loss to follow-up as potentially informative.",
      "registry": "Derive coarse windows from visit-anchored treatment-line start/stop; link to claims for day-level refill stitching and to a death index for the right-censoring boundary.",
      "linked": "Reconcile order/fill/service-date discrepancies before building windows; use EHR start + claims fills + a death index, but account for linkage selection (only the linkable subset is analyzable)."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nGRACE_DAYS = 30      # permissible gap between fills before an episode is closed\nCARRYOVER_CAP = 90   # max stockpiled on-hand days (guards against implausible hoarding)\n\ndef build_at_windows(rx: pd.DataFrame, censor: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"]).copy()\n    episodes = []\n\n    for pid, g in rx.groupby(\"person_id\"):\n        # Earliest structural censoring date for this person (NaT-safe min).\n        c = censor.loc[censor[\"person_id\"] == pid, [\"disenroll_date\", \"death_date\", \"data_end\"]]\n        hard_stop = c.min(axis=1).min() if len(c) else pd.NaT\n\n        ep_start = None\n        on_hand_end = None     # running supply-end including carried-over days\n        ep_drug = None\n\n        for _, f in g.iterrows():\n            start = f[\"fill_date\"]\n            supply_end = start + pd.Timedelta(days=int(f[\"days_supply\"]))\n\n            if ep_start is None:\n                ep_start, on_hand_end, ep_drug = start, supply_end, f[\"drug_class\"]\n                continue\n\n            # Switch closes the current episode at run-out + grace.\n            switched = f[\"drug_class\"] != ep_drug\n            within_grace = start <= on_hand_end + pd.Timedelta(days=GRACE_DAYS)\n\n            if within_grace and not switched:\n                # Carry forward unused supply (stockpiling), capped.\n                base = max(on_hand_end, start)\n                on_hand_end = min(base + pd.Timedelta(days=int(f[\"days_supply\"])),\n                                  start + pd.Timedelta(days=CARRYOVER_CAP))\n            else:\n                episodes.append((pid, ep_drug, ep_start,\n                                 on_hand_end + pd.Timedelta(days=GRACE_DAYS)))\n                ep_start, on_hand_end, ep_drug = start, supply_end, f[\"drug_class\"]\n\n        if ep_start is not None:\n            episodes.append((pid, ep_drug, ep_start,\n                             on_hand_end + pd.Timedelta(days=GRACE_DAYS)))\n\n    out = pd.DataFrame(episodes,\n                       columns=[\"person_id\", \"drug_class\", \"episode_start\", \"episode_end\"])\n    # Right-censor every episode end at the structural stop (disenroll / death / data end).\n    out = out.merge(\n        censor.assign(hard_stop=censor[[\"disenroll_date\", \"death_date\", \"data_end\"]].min(axis=1))\n              [[\"person_id\", \"hard_stop\"]],\n        on=\"person_id\", how=\"left\")\n    out[\"episode_end\"] = out[[\"episode_end\", \"hard_stop\"]].min(axis=1)\n    out = out[out[\"episode_end\"] > out[\"episode_start\"]]   # drop empty windows\n    return out.drop(columns=\"hard_stop\").reset_index(drop=True)",
        "description": "Build as-treated on-treatment risk windows from claims-style pharmacy fills. Required inputs\n(cleaned, de-duplicated):\n  rx     : person_id, fill_date (datetime), ndc/drug_class, days_supply (int)\n  censor : person_id, disenroll_date, death_date, data_end (datetime; NaT if not applicable)\nReturns one row per continuous on-treatment episode with [episode_start, episode_end], where the end is the\nearliest of supply run-out + grace, a switch, disenrollment, death, or data end. Outcomes are attributed only to\nperson-time inside these episodes downstream.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nGRACE_DAYS    <- 30L\nCARRYOVER_CAP <- 90L\n\nbuild_at_windows <- function(rx, censor) {\n  setDT(rx); setDT(censor)\n  setorder(rx, person_id, fill_date)\n\n  one_person <- function(g) {\n    ep_start <- on_hand_end <- ep_drug <- NULL\n    eps <- list()\n    for (i in seq_len(nrow(g))) {\n      start <- g$fill_date[i]\n      supply_end <- start + g$days_supply[i]\n      if (is.null(ep_start)) {\n        ep_start <- start; on_hand_end <- supply_end; ep_drug <- g$drug_class[i]; next\n      }\n      switched     <- g$drug_class[i] != ep_drug\n      within_grace <- start <= on_hand_end + GRACE_DAYS\n      if (within_grace && !switched) {\n        base <- max(on_hand_end, start)                      # carry unused supply forward\n        on_hand_end <- min(base + g$days_supply[i], start + CARRYOVER_CAP)\n      } else {\n        eps[[length(eps) + 1L]] <- list(ep_drug, ep_start, on_hand_end + GRACE_DAYS)\n        ep_start <- start; on_hand_end <- supply_end; ep_drug <- g$drug_class[i]\n      }\n    }\n    eps[[length(eps) + 1L]] <- list(ep_drug, ep_start, on_hand_end + GRACE_DAYS)\n    data.table(drug_class   = vapply(eps, `[[`, \"\", 1L),\n               episode_start = as.Date(vapply(eps, function(e) as.numeric(e[[2L]]), 0), origin = \"1970-01-01\"),\n               episode_end   = as.Date(vapply(eps, function(e) as.numeric(e[[3L]]), 0), origin = \"1970-01-01\"))\n  }\n\n  out <- rx[, one_person(.SD), by = person_id]\n  hs  <- censor[, .(hard_stop = pmin(disenroll_date, death_date, data_end, na.rm = TRUE)),\n                by = person_id]\n  out <- merge(out, hs, by = \"person_id\", all.x = TRUE)\n  out[!is.na(hard_stop), episode_end := pmin(episode_end, hard_stop)]\n  out[episode_end > episode_start, .(person_id, drug_class, episode_start, episode_end)]\n}",
        "description": "As-treated on-treatment windows with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), drug_class, days_supply (integer)\n  censor : person_id, disenroll_date, death_date, data_end (Date; NA allowed)\nReturns one row per on-treatment episode, right-censored at the earliest structural stop.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let grace = 30;\n%let cap   = 90;\n\nproc sort data=work.rx; by person_id fill_date; run;\n\n/* Stitch fills into on-treatment episodes with grace period + capped carryover. */\ndata episodes;\n  set work.rx;\n  by person_id;\n  retain ep_start on_hand_end ep_drug;\n  format ep_start episode_end date9.;\n  supply_end = fill_date + days_supply;\n\n  if first.person_id then do;\n    ep_start = fill_date; on_hand_end = supply_end; ep_drug = drug_class;\n  end;\n  else do;\n    if fill_date <= on_hand_end + &grace and drug_class = ep_drug then do;\n      /* within grace, same drug: carry unused supply forward, capped */\n      base = max(on_hand_end, fill_date);\n      on_hand_end = min(base + days_supply, fill_date + &cap);\n    end;\n    else do;\n      /* gap exceeded or switch: emit the closed episode, then open a new one */\n      drug_class_out = ep_drug; episode_start = ep_start;\n      episode_end = on_hand_end + &grace; output;\n      ep_start = fill_date; on_hand_end = supply_end; ep_drug = drug_class;\n    end;\n  end;\n\n  if last.person_id then do;\n    drug_class_out = ep_drug; episode_start = ep_start;\n    episode_end = on_hand_end + &grace; output;\n  end;\n  keep person_id drug_class_out episode_start episode_end;\n  rename drug_class_out = drug_class;\nrun;\n\n/* Right-censor each episode at the earliest of disenrollment / death / data end. */\nproc sql;\n  create table at_windows as\n  select e.person_id, e.drug_class, e.episode_start,\n         min(e.episode_end,\n             coalesce(c.disenroll_date, e.episode_end),\n             coalesce(c.death_date,     e.episode_end),\n             coalesce(c.data_end,       e.episode_end)) as episode_end format=date9.\n  from episodes e\n  left join work.censor c on e.person_id = c.person_id\n  having calculated episode_end > e.episode_start;\nquit;",
        "description": "As-treated on-treatment windows in SAS via PROC SQL + DATA step. Required input datasets (post data-management):\n  work.rx     : person_id, fill_date, drug_class, days_supply\n  work.censor : person_id, disenroll_date, death_date, data_end  (missing allowed)\nThe DATA step uses BY-group retain logic to stitch fills within the grace period (carryover capped), close\nepisodes at run-out + grace or a switch, and the final PROC SQL right-censors each episode at the earliest\nstructural stop. Outcomes are attributed only to person-time inside the resulting episodes.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "as-treated-risk-window-construction-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Two fills are stitched into one 105-day risk window because Fill B arrives before the 30-day grace expires. Event A (Mar 10) lands inside and is attributed to metoprolol. Event B (May 1) lands in the off-treatment gap and is excluded from the on-treatment rate. If the grace period were shorter — say 10 days — Fill B would arrive too late to stitch and the single window would close Jan 30, cutting 75 days of legitimate at-risk time and possibly missing Event A entirely.",
        "alt_text": "Horizontal timeline from January 1 to June 30, 2024. A blue exposed bar covers January 1 through April 14, subdivided into Fill A supply (Jan 1–Jan 30), a grace-period segment (Jan 31–Feb 14), Fill B supply (Feb 15–Mar 15), and a grace tail (Mar 16–Apr 14). A red event marker on March 10 sits inside the blue bar labeled 'Event A — counted.' A grey unexposed bar runs April 15 through June 30. A second red event marker on May 1 sits inside the grey bar labeled 'Event B — not counted.'",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  F[Pharmacy fills: fill_date + days_supply] --> P[Project each fill to supply_end]\n  P --> G{Next fill within<br/>supply_end + grace?}\n  G -->|Yes, same drug| S[Stitch: carry unused supply forward<br/>capped at CARRYOVER_CAP]\n  G -->|No, or switch| C[Close episode at<br/>supply_end + grace]\n  S --> P\n  C --> H[Right-censor at earliest of:<br/>disenroll / death / data end]\n  H --> W[On-treatment risk window]\n  W --> A[Attribute outcomes + person-time<br/>only inside open windows]\n  A --> Z[Sensitivity: vary grace / induction /<br/>legacy window / IPCW]",
        "caption": "As-treated window construction. Fills are projected by days_supply and stitched within the grace period with capped carryover; episodes close at run-out, switch, or structural censoring; outcomes are counted only while a window is open.",
        "alt_text": "Flowchart from pharmacy fills through supply projection, grace-period stitching with carryover, episode closure, structural censoring, outcome attribution, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title On-treatment windows for one patient (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Episode 1 (drug A)\n  Fill 1 supply (30d) :done, f1, 2024-01-01, 30d\n  Fill 2 within grace (stitched) :done, f2, 2024-02-05, 30d\n  Grace tail (run-out + 30d) :active, g1, 2024-03-06, 30d\n  section Gap (off-treatment, censored)\n  No fill > grace -> episode closed :crit, gap, 2024-04-05, 60d\n  section Episode 2 (drug A restart)\n  Fill 3 new episode :done, f3, 2024-06-04, 30d",
        "caption": "One patient with two on-treatment episodes. Fill 2 lands within the grace period and is stitched to Fill 1; the subsequent long gap closes the episode (off-treatment person-time is not counted as exposed) and Fill 3 opens a new episode.",
        "alt_text": "Gantt chart showing a stitched first episode with a grace tail, an off-treatment gap that closes the episode, and a later restart that opens a second episode.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "As-treated window construction is the on-treatment specialization of general exposure-episode construction, focused on attributing risk only while the drug is plausibly acting."
      },
      {
        "relation_type": "used_with",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "The grace period and gap rule are the central nuisance parameters that determine when consecutive fills are stitched and when an episode closes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "The window must open at a correctly aligned time zero; misalignment reintroduces immortal time."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Clone-censor-weight is the formal per-protocol alternative that handles grace-period eligibility ambiguity and informative censoring that a naive as-treated window cannot."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring window boundaries to post-index events (e.g., a future fill or procedure) creates immortal time; the window must be built only from information available at its start."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "When discontinuation is strongly confounded by evolving prognosis, escalate from as-treated censoring to time-updated exposure with cumulative dose and marginal structural models."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inpatient-bridging-exposure-rwe",
        "notes": "Inpatient stays suppress outpatient pharmacy claims and must be bridged so they are not scored as gaps that artifactually close an episode."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-time-at-risk-cohort-exit-rwe",
        "notes": "The OMOP time-at-risk / cohort-exit machinery is the standardized implementation of these on-treatment window and censoring rules."
      }
    ],
    "aliases": [
      "on-treatment risk window",
      "as-treated exposure window",
      "on-treatment follow-up window",
      "as-treated time at risk"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "atc-ddd-classification",
    "name": "ATC Classification and Defined Daily Dose (DDD)",
    "short_definition": "The WHO Anatomical Therapeutic Chemical (ATC) classification assigns every drug substance to a five-level hierarchy — from broad anatomical group down to the specific chemical substance — while the Defined Daily Dose (DDD) is a standardized unit of measurement (not a prescribing recommendation) representing the assumed average adult maintenance dose per day for the main indication; together, ATC and DDD are the international standard currency for drug utilization research, enabling cross-country and time-trend comparisons that are impossible with US-centric NDC codes alone.",
    "long_description": "**The ATC hierarchy: five levels from organ system to molecule**\n\nThe Anatomical Therapeutic Chemical (ATC) classification system, maintained by the WHO\nCollaborating Centre for Drug Statistics Methodology in Oslo, organizes every marketed drug\nsubstance into a five-level code. Walking through A10BA02 (metformin) makes each level concrete:\n\n- **Level 1 — Anatomical main group (1 letter):** A = Alimentary tract and metabolism. The 14\n  top-level groups map to organ systems or therapeutic domains: A (alimentary), B (blood), C\n  (cardiovascular), J (anti-infectives), L (antineoplastic), N (nervous system), R (respiratory),\n  and so on.\n- **Level 2 — Therapeutic subgroup (2 digits):** A10 = Drugs used in diabetes. This level\n  narrows to the therapeutic purpose within the organ system.\n- **Level 3 — Pharmacological subgroup (1 letter):** A10B = Blood glucose-lowering drugs,\n  excluding insulins. The third level specifies pharmacological mechanism.\n- **Level 4 — Chemical subgroup (1 letter):** A10BA = Biguanides. The fourth level reaches the\n  chemical class.\n- **Level 5 — Chemical substance (2 digits):** A10BA02 = metformin. The fifth level uniquely\n  identifies the active substance.\n\nAnalysts query ATC at whichever level the research question demands. A stewardship report on\nall antidiabetics queries A10. A comparative study of biguanides queries A10BA. A study of\nmetformin specifically queries A10BA02. The hierarchy is the design tool; the level chosen is\nan analytic decision that must be pre-specified.\n\n**DDD: a measurement unit, not a dose recommendation**\n\nThe DDD is the *assumed average maintenance dose per day* for a drug used for its *main\nindication* in *adults*. It is purely a measurement unit — a yardstick that makes the volume\nof different drugs comparable — not a therapeutic recommendation and not the dose any individual\npatient should receive. The WHO states this explicitly and emphatically: \"The DDD is a unit of\nmeasurement and does not necessarily reflect the recommended or prescribed daily dose.\"\n\nThis distinction is the single most important honesty point in applying DDD-based metrics.\nAn analyst who reports \"patients received 1.5 DDDs per day on average\" is not saying they were\nover-dosed; they are saying the volume of drug dispensed, expressed as a fraction of the\nreference unit, was 1.5. The clinical interpretation of that number depends entirely on context.\n\nFor metformin (A10BA02), the WHO DDD is 2 g (2000 mg). This is the assumed daily maintenance\ndose for type 2 diabetes. A patient receiving metformin 500 mg twice daily (1000 mg/day) is\nreceiving 0.5 DDDs/day. A patient on 1000 mg twice daily (2000 mg/day) is receiving 1.0\nDDD/day. A patient on 1000 mg three times daily (3000 mg/day) is receiving 1.5 DDDs/day. All\nthree patients are within the licensed dosing range; the DDD simply standardizes the comparison.\n\n**Why international RWE speaks ATC while US pharmacy claims speak NDC**\n\nUS pharmacy claims use NDC (National Drug Code) as the primary drug identifier. NDC is\npackage-level and labeler-specific — excellent for identifying the exact dispensed product, but\nill-suited for international comparison or therapeutic-class grouping without a mapping layer.\n\nInternational drug utilization research — WHO drug statistics, OECD health data, ECDC\nantimicrobial surveillance, EMA post-authorization safety studies, pan-European\npharmacoepidemiology networks — uses ATC/DDD as the common currency. This is because:\n\n1. ATC transcends country-specific billing codes. The same metformin molecule carries ATC\n   A10BA02 in the US, UK, Sweden, Germany, and Japan, allowing direct cross-country comparison.\n2. DDD/1000 inhabitants/day is scale-invariant. A rate of 300 DDD/1000/day for metformin means\n   the same thing regardless of the currency or healthcare system.\n\nThe NDC-to-ATC crosswalk path is: NDC → RxNorm (ingredient level via RxNav) → ATC (via the\nRxClass service using the `classType=ATC1-4` endpoint). This three-step chain is the standard\napproach for US claims studies that need to report in ATC terms or link to European registries.\nThe chain is lossy in at least three documented ways:\n\n- **Combination products map to multiple ATC codes.** A fixed-dose combination tablet containing\n  both metformin and sitagliptin (Janumet) has two ATC codes: A10BA02 (metformin component)\n  and A10BH01 (sitagliptin component). A simple NDC→single-ATC mapping must choose one or\n  create two records; both choices introduce error.\n- **One substance, multiple ATC codes for different indications.** Aspirin is the canonical\n  example. At antiplatelet doses (75–325 mg/day), aspirin is classified as B01AC06 (platelet\n  aggregation inhibitors). At analgesic doses (500–1000 mg per dose), it is N02BA01 (salicylic\n  acid and derivatives, anilides). Assigning \"the\" ATC code for aspirin without knowing the\n  clinical indication is meaningless — a common source of drug class miscounting in claims\n  studies. This is sometimes called the aspirin mapping trap and applies to any drug with\n  approved uses across multiple therapeutic domains (e.g., low-dose naltrexone for autoimmune\n  vs opioid dependence).\n- **ATC assignment lag for new substances.** Newly approved drugs may lack an ATC code at\n  launch. During the gap, NDC→ATC crosswalks return no match, systematically undercounting\n  exposure to novel agents in the period immediately following approval.\n\n**DDD/1000 inhabitants/day: the population utilization metric**\n\nThe standard WHO aggregate drug utilization metric is:\n\nDDD / 1000 inhabitants / day = (Total DDDs dispensed in period) / (population × days in period) × 1000\n\nThis quantity answers: \"Of every 1000 people in this population, how many are (notionally) on a\nfull DDD of this drug every day?\" A value of 300 DDD/1000/day for metformin means that, if\neveryone took exactly one DDD per day, 30% of the population (300 per 1000) would be treated.\nIt is a utilization intensity measure, not a prevalence measure — the two will differ whenever\nthe actual prescribed dose differs from the DDD.\n\nWhen computing DDD/1000/day from US pharmacy claims rather than census population, the\ndenominator should be enrolled person-days, not census population. Mixing the two numerator\nand denominator populations creates a ratio that is neither interpretable nor comparable to\nWHO country-level statistics.\n\n**DDD vs days_supply vs PDC: what each measures and when to prefer each**\n\nThese three quantities answer different questions from the same dispensing record:\n\n- **DDDs:** Volume of drug (how much was dispensed in WHO standardized units). Answers \"how\n  much?\" at both patient and population level. Insensitive to whether the patient actually took\n  the drug; depends entirely on dispensed quantity and strength.\n- **days_supply:** Coverage duration as recorded by the dispensing pharmacy. Answers \"how long\n  was this fill meant to last?\" Directly usable for PDC/MPR without conversion. Does not require\n  knowing the DDD.\n- **PDC (Proportion of Days Covered):** Adherence metric. Uses days_supply to build an exposure\n  timeline; union-rule prevents double-counting of stockpiled days; answers \"what fraction of the\n  observation window did the patient have medication on hand?\"\n\nFor US claims adherence studies (PDC, MPR, persistence), days_supply is the correct input —\nDDD is irrelevant. For international utilization comparison and volume measurement, DDD is the\ncorrect metric — days_supply units differ across countries and pharmacy systems.\n\n**Where DDD breaks: seven documented failure modes**\n\n1. **Pediatrics.** The DDD is defined for adults. Pediatric dosing is weight-based and age-\n   adjusted; a child prescribed amoxicillin at the correct pediatric dose will appear to receive\n   a fraction of the DDD that has no clinical interpretation. DDD/1000/day is uninformative for\n   pediatric populations and should never be used as an adherence proxy.\n\n2. **Renal and hepatic dosing.** Drugs with mandatory dose reductions in renal impairment (e.g.,\n   metformin is contraindicated in severe CKD; most direct oral anticoagulants require dose\n   reduction) will show patients receiving less than 1 DDD/day — not because of poor adherence\n   but because 1 DDD is the wrong target for their physiology.\n\n3. **PRN (as-needed) medications.** NSAIDs, analgesics, migraine treatments, and benzodiazepines\n   used on demand have no stable daily dose. DDD/1000/day for a PRN drug mixes usage intensity\n   with number of users in a way that is difficult to disentangle.\n\n4. **Topical and transdermal formulations.** Many dermatological preparations have DDDs expressed\n   in grams of ointment, which bears no obvious relationship to therapeutic effect or coverage.\n\n5. **Combination products.** As noted above, a single dispensing event for a fixed-dose\n   combination generates multiple DDDs for different component ATC codes, requiring care to\n   avoid double-counting at the patient level while correctly attributing volume to each\n   component at the population level.\n\n6. **Biologics and many specialty drugs.** Some injectable biologics have no assigned DDD, either\n   because the molecule was approved after ATC assignment was established or because weight-based\n   or body-surface-area dosing makes a fixed DDD meaningless. Adalimumab, for example, has a\n   DDD of 1.4 mg assigned in a specific formulation unit that requires careful strength conversion.\n   Checking the WHO ATC/DDD index for a \"no DDD assigned\" status before computing any volume\n   metric is essential.\n\n7. **Off-label or non-main-indication use.** A drug prescribed for an indication other than the\n   main indication used to set the DDD will appear to receive more or fewer DDDs than expected.\n   Aspirin at antithrombotic doses (75–100 mg) is prescribed at roughly 1/10 of the 500 mg\n   analgesic DDD (N02BA01); a patient on aspirin for secondary prevention will appear to receive\n   only 0.15–0.2 DDDs/day using the analgesic DDD, which is exactly wrong for characterizing\n   antithrombotic use.\n\n**ATC-DDD versioning: the reproducibility requirement**\n\nThe WHO Collaborating Centre updates the ATC/DDD system annually, typically effective 1 January.\nUpdates include: new ATC codes for newly approved substances, DDD changes for existing drugs\n(based on accumulating prescribing evidence), reclassifications moving a drug from one ATC\nbranch to another, and deletions. A DDD that was 2 g in one year may change in a subsequent\nupdate.\n\nFor RWE reproducibility, this means: (1) always report the ATC/DDD version used; (2) pin the\nversion at study initiation and do not update mid-study; (3) when replicating an earlier study,\nuse the version in force at the time of the original analysis; (4) be alert to version-change\nartifacts in longitudinal trend data — a step in DDD/1000/day over time may be an ATC\nreclassification rather than a real change in prescribing practice.\n\n**Pros, cons, and trade-offs**\n\n*ATC classification*\n- Pros: internationally standardized and maintained by a single authoritative body (WHO\n  Collaborating Centre, Oslo); enables cross-country and cross-database drug class comparison\n  without vocabulary harmonization; hierarchical design allows analysis at any level of\n  therapeutic granularity; freely available and openly documented; used as the classification\n  layer in OMOP CDM and linked to RxNorm via RxClass; required for regulatory submissions to\n  EMA and many HTA bodies; ESAC, OECD, WHO country-level statistics all report in ATC/DDD.\n- Cons: not natively present in US pharmacy claims (requires NDC→RxNorm→ATC crosswalk with\n  associated lossiness); annual versioning creates reproducibility obligations; combination\n  products require multi-code assignment; indication-dependent codes (aspirin) demand clinical\n  context that claims rarely provide; lag for newly approved substances.\n\n*DDD as a utilization unit*\n- Pros: scale-invariant, enabling direct comparison across drugs of different potency;\n  internationally recognized and applied consistently; allows computation of \"treated patient\n  equivalents\" for budget-impact and policy purposes; technically simple once the DDD value is\n  known (dispensed quantity × strength / DDD).\n- Cons: not a recommended dose, not a therapeutic target, not a coverage measure, and not an\n  adherence metric; systematically wrong for pediatrics, renally-adjusted doses, PRN use,\n  biologics without DDDs, and off-label use; requires knowledge of dispensed quantity (tablets)\n  and strength (mg), which is sometimes absent or unreliable in claims; combination products\n  require splitting.\n\n**When to use**\n\nUse ATC classification when: building drug exposure definitions that must be comparable across\ndata sources, countries, or time periods; reporting to regulatory bodies or HTA organizations\nthat require ATC codes; constructing therapeutic-class-level drug lists from US claims\n(NDC→RxNorm→ATC→expand back to ingredient) for inclusion/exclusion criteria; linking US claims\ndata to international registries or European databases; constructing off-label use studies where\ntherapeutic class membership is the independent variable.\n\nUse DDD/1000/day when: measuring population-level drug utilization volume for post-marketing\nsurveillance, stewardship programs, or market-access submissions; comparing utilization across\ncountries, regions, or time periods in a format consistent with WHO/OECD statistics; computing\nbudget-impact denominators at the health system level; expressing drug volume in a unit that is\nindependent of the number of dosing units per package.\n\n**When NOT to use**\n\nDo NOT use DDD as a proxy for adherence or days of coverage in individual-level analyses.\nFor adherence, use days_supply-based PDC or MPR. A patient who fills 90 DDDs of metformin\n(180 tablets × 1000 mg) is not necessarily adherent for 90 days; the actual coverage depends\non the prescribed regimen, which the DDD does not encode.\n\nDo NOT apply ATC/DDD to pediatric populations without explicitly acknowledging and quantifying\nthe dose-DDD mismatch. Reporting DDD/1000 children/day for an antibiotic is numerically\ncomputable but clinically misleading.\n\nDo NOT treat the ATC/DDD crosswalk as a solved problem in US claims. The NDC→RxNorm→ATC\nchain fails silently for: combination products (one NDC → multiple ATC codes), off-label\nuse (wrong ATC level selected), biologics without DDDs, and newly approved drugs without ATC\ncodes. Always audit unmapped NDCs and report the fraction assigned an ATC code.\n\nDo NOT use a stale ATC/DDD version across a multi-year study without checking whether any\nrelevant DDD assignments or reclassifications occurred during the study period. A trend artifact\ncreated by a WHO version update is indistinguishable from a real prescribing shift in the data.\n\nDo NOT use DDD to compare dosing adequacy across renally-impaired patients versus normal-renal\npatients. The apparent lower DDDs in CKD patients reflects guideline-concordant dose reduction,\nnot inadequate treatment.\n\n**Interpreting the output**\n\nIn the worked example, a single dispensing of metformin 1000 mg × 180 tablets yields 90 DDDs.\nAt the plan level, 90000 DDDs dispensed across 10000 members in a 30-day month yields\n300 DDD/1000 members/day.\n\n*(1) Formal interpretation.* The 90-DDD single-dispensing quantity is a volume measure\nexpressing how many \"standard treatment days\" (at the WHO assumed maintenance dose of 2000 mg)\nare contained in this dispensing event. The ratio DDDs = 180000 mg / 2000 mg = 90 is\ndimensionless in the sense that it expresses dispensed volume relative to the reference unit.\nThe population rate of 300 DDD/1000 members/day is a utilization intensity measure: if the\nentire plan population took exactly one DDD of metformin daily, 30% of members (300 per 1000)\nwould be on therapy. Because actual prescribed doses (typically 1000–2000 mg/day) bracket the\n2000 mg DDD closely for metformin, this metric is reasonably well-behaved for this drug — but\nthe same arithmetic applied to aspirin or a biologic would require far more interpretation.\n\n*(2) Practical interpretation.* A plan pharmacist reviewing the 300 DDD/1000/day figure can\ncompare it to WHO country-level benchmarks or prior-year plan data to assess whether metformin\nutilization is changing. A formulary analyst can multiply the total DDDs dispensed by the\ncost per DDD to estimate plan spend. Neither can conclude from this metric alone that patients\nare adherent, that doses are appropriate, or that metformin is causing any clinical outcome.\nIf the question becomes \"are patients taking enough metformin to achieve glycemic targets,\" the\nDDD is the wrong tool — the analyst needs HbA1c laboratory data and prescribed dose, not DDDs.",
    "primary_category": "Data_Standard",
    "tags": [
      "atc",
      "ddd",
      "defined-daily-dose",
      "coding-system",
      "data-standard",
      "drug-classification",
      "who-atc",
      "drug-utilization",
      "pharmacoepidemiology",
      "drug-exposure"
    ],
    "applies_to_study_types": [
      "drug_utilization",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3390/pharmacy9010060",
        "url": "https://doi.org/10.3390/pharmacy9010060",
        "citation_text": "Hollingworth SA, Kairuz T. Measuring Medicine Use: Applying ATC/DDD Methodology to Real-World Data. Pharmacy. 2021;9(1):60.",
        "year": 2021,
        "authors_short": "Hollingworth & Kairuz",
        "notes": "A direct methods guide for applying the WHO ATC/DDD system to real-world dispensing data, covering how to convert dispensed quantities to DDDs, compute DDD/1000/day rates, and interpret utilization outputs. The paper explicitly addresses the DDD-as-measurement-unit distinction and several of the documented failure modes. Crossref-verified: first author Hollingworth, year 2021."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5490",
        "url": "https://doi.org/10.1002/pds.5490",
        "citation_text": "Rasmussen L, Wettermark B, Steinke D, Pottegård A. Core concepts in pharmacoepidemiology: Measures of drug utilization based on individual-level drug dispensing data. Pharmacoepidemiology and Drug Safety. 2022;31(10):1015-1026.",
        "year": 2022,
        "authors_short": "Rasmussen et al.",
        "notes": "Canonical methods statement for individual-level drug utilization measures from dispensing data. Covers the DDD unit as the volume denominator for aggregate utilization and clarifies the distinction between DDD-based volume metrics and days_supply-based coverage measures (PDC/MPR). Crossref-verified: first author Rasmussen, year 2022."
      },
      {
        "role": "demonstrate",
        "doi": "10.1111/j.1742-7843.2009.00494.x",
        "url": "https://doi.org/10.1111/j.1742-7843.2009.00494.x",
        "citation_text": "Furu K, Wettermark B, Andersen M, Martikainen JE, Almarsdottir AB, Sørensen HT. The Nordic Countries as a Cohort for Pharmacoepidemiological Research. Basic and Clinical Pharmacology and Toxicology. 2010;106(2):86-94.",
        "year": 2010,
        "authors_short": "Furu et al.",
        "notes": "Describes large-scale pharmacoepidemiological infrastructure in the Nordic countries where ATC/DDD is the universal drug coding and measurement standard in all national prescription registries. Demonstrates how the ATC/DDD system enables pan-Nordic and cross-national drug utilization comparisons using administrative dispensing data — the paradigm application this catalog entry supports. Crossref-verified: first author Furu, year 2010."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.whocc.no/atc_ddd_index/",
        "citation_text": "WHO Collaborating Centre for Drug Statistics Methodology. ATC/DDD Index 2024. Oslo: WHO Collaborating Centre for Drug Statistics Methodology. Available at: https://www.whocc.no/atc_ddd_index/. Accessed 2026.",
        "year": null,
        "authors_short": "WHO Collaborating Centre",
        "notes": "The authoritative online ATC/DDD index — look up any substance's ATC code, DDD value, unit, route of administration, and assigned main indication. Updated annually; version must be pinned for reproducible studies. The definitive reference for confirming whether a biologic or newly approved substance has an assigned DDD."
      }
    ],
    "plain_language_summary": "The WHO ATC (Anatomical Therapeutic Chemical) system is a five-level coding hierarchy that organizes every drug from broad organ system down to the specific molecule — for example, metformin gets the code A10BA02, placing it in the diabetes drug group (A10), biguanide class (A10BA), and then the specific substance (A10BA02). The DDD (Defined Daily Dose) is the number the WHO assigns as the typical adult maintenance dose per day for that drug's main use — for metformin it is 2 g (2000 mg). Crucially, the DDD is a measurement unit like a \"standard drink,\" not a prescription: it lets researchers count and compare how much of a drug a population uses, without that number telling them anything about whether any individual patient was on the right dose. US pharmacy claims use NDC drug codes instead of ATC, so analysts working with American data must convert through an intermediate vocabulary (RxNorm) to reach ATC — a translation that sometimes fails for drugs with multiple uses or newly approved medicines.",
    "key_terms": [
      {
        "term": "ATC code",
        "definition": "A five-character alphanumeric code assigned by the WHO to each drug substance, classifying it from anatomical group (first character) down to specific chemical substance (fifth position) — for example, A10BA02 for metformin."
      },
      {
        "term": "DDD (Defined Daily Dose)",
        "definition": "The WHO's assumed average maintenance dose per day for a drug used for its main indication in adults — a measurement unit for comparing drug volumes across countries, not a recommended prescribing dose."
      },
      {
        "term": "DDD/1000 inhabitants/day",
        "definition": "The standard population-level utilization metric: total DDDs dispensed in a period divided by the number of people in the population times the number of days, then multiplied by 1000 — expressing how many of every 1000 people would be on the drug if each took exactly one DDD per day."
      },
      {
        "term": "NDC-to-ATC crosswalk",
        "definition": "The three-step translation used with US claims data: NDC (package-level US drug code) maps to RxNorm (ingredient-level) via the RxNav API, then RxNorm maps to ATC class via the RxClass service — lossy for combination products and drugs with multiple therapeutic indications."
      },
      {
        "term": "ATC version pinning",
        "definition": "The practice of recording and fixing which annual release of the WHO ATC/DDD index was used for a study, so that results are reproducible even after the WHO updates DDD values or reclassifies drugs in subsequent years."
      },
      {
        "term": "combination product trap",
        "definition": "The mapping problem that arises when a single dispensed product (one NDC) contains two active ingredients with different ATC codes — for example, metformin/sitagliptin — requiring either two separate DDD calculations or a deliberate single-component assignment choice."
      }
    ],
    "worked_example": {
      "scenario": "A health plan pharmacist wants to report metformin utilization for March 2023 in the format required by a WHO drug utilization study: DDDs per 1000 members per day. She starts with one representative pharmacy claim to confirm the DDD arithmetic, then scales to the full plan population. Metformin's ATC code is A10BA02 and its WHO DDD is 2000 mg (2 g).",
      "dataset": {
        "caption": "One pharmacy claim for metformin (representative of the plan data), plus the plan-level aggregate. The claim shows the dispensed product at the ATC and DDD level needed for calculation. Strength is per tablet; quantity is the number of tablets dispensed.",
        "columns": [
          "claim_id",
          "fill_date",
          "drug_name",
          "atc_code",
          "who_ddd_mg",
          "quantity_tablets",
          "strength_mg_per_tablet"
        ],
        "rows": [
          [
            "C001",
            "2023-03-01",
            "metformin",
            "A10BA02",
            2000,
            180,
            1000
          ]
        ]
      },
      "steps": [
        "Total milligrams dispensed in claim C001: 180 * 1000 = 180000 mg.",
        "Convert to DDDs: 180000 / 2000 = 90 DDDs. This single dispensing contains 90 standard treatment days of metformin at the WHO reference dose — note this does not mean the patient is adherent for 90 days; the actual coverage depends on the prescribed regimen (days_supply on the claim, not the DDD).",
        "Summing across all metformin claims for the plan's 10000 members in March 2023 (30 days), total DDDs dispensed = 90000 DDDs (hypothetical plan-level aggregate).",
        "Compute denominator in person-days: 10000 * 30 = 300000 person-days.",
        "DDD per person per day: 90000 / 300000 = 0.3.",
        "Scale to per 1000 members: 0.3 * 1000 = 300 DDD/1000 members/day. This is the standard WHO utilization rate. It means: if every member took exactly one DDD of metformin daily, 300 out of 1000 (30%) would be on therapy."
      ],
      "result": "Single claim: 180 * 1000 = 180000 mg total; 180000 / 2000 = 90 DDDs. Plan level March 2023: 90000 DDDs across 10000 members over 30 days; denominator = 10000 * 30 = 300000 person-days; rate = 90000 / 300000 = 0.3 DDD/person/day; scaled = 0.3 * 1000 = 300 DDD/1000 members/day. This rate is a volume measure — it does not indicate adherence, prescribed dose adequacy, or clinical outcomes."
    },
    "prerequisites": [
      "ndc-national-drug-code",
      "drug-utilization"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Aggregate DDD/1000 inhabitants/day from pharmacy claims",
        "description": "The canonical population-level utilization metric: map each claim's NDC to its ATC code, convert dispensed quantity × strength to DDDs using the WHO DDD value for that ATC, sum all DDDs in the study period, and divide by enrolled person-days × 1000. Use enrolled person-days (not census population) as the denominator when working from claims to avoid mixing the numerator and denominator populations.",
        "edge_cases": [
          "Combination products (one NDC → two ATC codes) require splitting the DDD calculation: each active component is counted under its own ATC code, so the same dispensing event contributes DDDs to both A10BA02 (metformin) and A10BH01 (sitagliptin) in a Janumet claim.",
          "A drug with no assigned DDD (e.g., some biologics, newly approved agents) cannot be included in DDD-based aggregate metrics; report its utilization in defined daily dose equivalents or in original dispensing units (vials, injections) instead."
        ],
        "data_source_notes": "Claims: requires NDC-to-ATC mapping via RxNorm/RxClass; audit unmapped fraction before finalizing; use enrolled person-days as denominator; exclude MA-only person-time in Medicare where FFS pharmacy claims are missing. Registry: many European prescription registries store ATC directly on the dispensing record, bypassing the crosswalk step."
      },
      {
        "name": "Individual-level DDD per patient (exposure intensity)",
        "description": "Per-patient DDDs dispensed in a window — total dispensed mg / WHO DDD. Useful as an exposure variable (dose intensity) in dose-response analyses or as a comparator between cohorts when actual prescribed dose varies. Distinguish from days_supply-based PDC: DDDs measure volume, PDC measures coverage duration.",
        "edge_cases": [
          "Dose changes mid-window invalidate a simple total-DDDs/window calculation; model time-varying dose or restrict to periods of stable dosing.",
          "For drugs where the DDD differs markedly from the typical prescribed dose (e.g., aspirin for antithrombotic use at 81 mg vs the analgesic DDD of 3000 mg), per-patient DDDs will be consistently much less than 1.0/day and the metric is difficult to interpret."
        ],
        "data_source_notes": "Claims: dispensed quantity and strength fields must be present and credible; a quality check on implausibly large quantities (>365 DDDs in one fill) or implausibly small strengths is recommended before computing patient-level DDD totals."
      },
      {
        "name": "ATC class-level drug list construction for cohort eligibility",
        "description": "Use ATC hierarchy as the inclusion/exclusion principle: identify all ingredients belonging to an ATC class (e.g., A10B for all non-insulin antidiabetics), map via RxClass to RxCUIs, expand to NDCs, and filter claims. This approach is reproducible and self-updating for new generics; the class definition is documented by its ATC code.",
        "edge_cases": [
          "RxClass ATC membership updates may lag the WHO annual release; pin both the RxNorm and ATC/DDD version dates in the study protocol for reproducibility.",
          "Drugs classified at ATC level 5 under multiple codes (aspirin dual-role) require an explicit decision rule: which ATC classification governs inclusion — the antiplatelet code (B01AC06), the analgesic code (N02BA01), or both? The choice must be pre-specified and documented."
        ],
        "data_source_notes": "Claims: use RxClass `getClassMembers` API (classType=ATC1-4) to pull all ingredient RxCUIs for a given ATC code, then expand to NDC via RxNorm to build the claims filter list; refresh when the ATC/DDD annual update is released. EHR: ATC codes may be present natively in European EHR systems; in US EHR, the same NDC→RxNorm→ATC chain applies."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ndc-national-drug-code",
        "pros_of_this": "ATC codes are internationally standardized across countries, data sources, and labelers; a single ATC code covers the drug's entire market regardless of manufacturer or package. DDD provides a scale-invariant volume unit that enables cross-country and cross-drug comparison impossible with NDC alone.",
        "cons_of_this": "ATC/DDD is not natively present in US pharmacy claims and requires a three-step crosswalk (NDC→RxNorm→ATC) that is lossy for combination products and indication-dependent drugs. NDC provides exact package-level identification — the granularity needed for lot-specific safety signals, exact reimbursement, and days_supply-based adherence algorithms.",
        "when_to_prefer": "Prefer ATC/DDD for population-level volume reporting, cross-country comparison, and regulatory/HTA submissions requiring WHO-compatible metrics. Prefer NDC for US claims processing, adherence measurement (PDC/MPR), and labeler-specific pharmacovigilance."
      },
      {
        "compared_to": "drug-utilization",
        "pros_of_this": "ATC classification and DDD supply the *measurement vocabulary* for drug utilization studies. Without ATC codes, a DUS cannot group drugs by therapeutic class; without DDDs, it cannot express volume in a cross-drug-comparable unit.",
        "cons_of_this": "ATC/DDD alone does not produce a utilization study; population denominators, study windows, washout rules, and analytic decisions are all outside the ATC/DDD system. A drug utilization study requires ATC/DDD as a tool, not as a design.",
        "when_to_prefer": "ATC/DDD is always the preferred measurement layer for drug utilization research involving volume, trends, or international comparison. Think of it as the unit system; the DUS is what you build with those units."
      },
      {
        "compared_to": "rxnorm-drug-terminology",
        "pros_of_this": "ATC provides a therapeutic-class hierarchy directly usable for cross-country benchmarking and regulatory submissions. RxNorm does not carry a built-in therapeutic classification; it requires a separate RxClass query to reach ATC.",
        "cons_of_this": "For US claims work, RxNorm is the bridge that makes ATC reachable; ATC alone cannot be joined to US pharmacy claims without first going through RxNorm. RxNorm also handles the ingredient-level rollup (collapsing hundreds of NDCs to one concept) that ATC does not.",
        "when_to_prefer": "Use RxNorm as the normalization step and ATC as the classification/reporting layer. The two systems are complementary, not competing."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Map each pharmacy claim NDC to its ATC code via the NDC→RxNorm→ATC chain (RxNav ndcstatus + RxClass getClassMembers). Store the ATC code, DDD value, and DDD unit from the WHO index alongside the mapped claim. For each claim, compute ddd_dispensed = (quantity_ dispensed × strength_mg) / (WHO_DDD_mg). Audit the fraction of claims mapped to an ATC code and characterize unmapped claims by drug class and calendar year before reporting. For the denominator, use FFS-observable enrolled person-days; exclude Medicare Advantage-only spans where pharmacy claims are not observable.",
      "ehr": "European EHR and national prescription registries often store ATC codes directly on the dispensing or prescribing record, bypassing the NDC crosswalk entirely. Verify the ATC version in use by the system. For US EHR, apply the same NDC→RxNorm→ATC chain as for claims; note that EHR medication order records reflect the prescribed drug but not necessarily the dispensed product, so quantity and strength fields may be less reliable than in adjudicated pharmacy claims.",
      "registry": "Disease registries linked to prescription registries (common in Scandinavia, UK Biobank, CPRD) typically provide ATC codes at the dispensing level. Pin the ATC/DDD version used in the source registry alongside the study-assigned version to detect version-change artifacts when combining data sources. For registries without ATC codes, apply the RxNorm crosswalk or use the WHO ATC/DDD index look-up table directly by drug name and strength.",
      "linked": "When linking US claims to international registries or European cohorts, ATC is the common drug classification language. Standardize on ATC level 5 for ingredient-level comparisons and ATC levels 2–4 for class-level comparisons. Document any DDD version differences between the US-side (claims-derived) and international-side (registry-stored) sources and perform a sensitivity analysis using a common pinned version."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import requests\nimport pandas as pd\n\nWHO_DDD_TABLE = {\n    \"A10BA02\": {\"ddd_mg\": 2000, \"unit\": \"mg\", \"substance\": \"metformin\"},\n    \"C10AA01\": {\"ddd_mg\": 20,   \"unit\": \"mg\", \"substance\": \"simvastatin\"},\n    \"C10AA05\": {\"ddd_mg\": 20,   \"unit\": \"mg\", \"substance\": \"atorvastatin\"},\n    \"N02BA01\": {\"ddd_mg\": 3000, \"unit\": \"mg\", \"substance\": \"aspirin (analgesic)\"},\n    \"B01AC06\": {\"ddd_mg\": 75,   \"unit\": \"mg\", \"substance\": \"aspirin (antiplatelet)\"},\n}\n# NOTE: always load DDD values from the pinned WHO ATC/DDD index release for your study;\n# this table is illustrative only. Pin the release year in every study protocol.\nATC_DDD_RELEASE = \"2024\"\n\n\n# ── A. DDD value lookup ──────────────────────────────────────────────────────\ndef get_ddd_mg(atc_code: str) -> dict | None:\n    \"\"\"Return the WHO DDD metadata for an ATC code from the local reference table.\n    In production, load the full WHO index CSV; this snippet uses a small illustrative dict.\"\"\"\n    entry = WHO_DDD_TABLE.get(atc_code.upper())\n    if entry:\n        return {**entry, \"atc\": atc_code, \"release\": ATC_DDD_RELEASE}\n    return None\n\n\n# ── B. Convert a single dispensing to DDDs ───────────────────────────────────\ndef dispensing_to_ddds(quantity_units: int, strength_mg: float, atc_code: str) -> float | None:\n    \"\"\"Convert a dispensing event (quantity × strength) to DDDs for the given ATC code.\n\n    Parameters\n    ----------\n    quantity_units : number of tablets, capsules, or other unit-dose items dispensed\n    strength_mg    : milligrams per unit-dose item\n    atc_code       : 7-character ATC code (e.g. 'A10BA02')\n\n    Returns None if the ATC code has no DDD entry (e.g. biologic without a DDD).\n\n    Examples\n    --------\n    >>> dispensing_to_ddds(180, 1000, \"A10BA02\")\n    90.0  # 180 * 1000 / 2000\n    >>> dispensing_to_ddds(30, 20, \"C10AA05\")\n    30.0  # 30 * 20 / 20\n    \"\"\"\n    ddd_info = get_ddd_mg(atc_code)\n    if ddd_info is None:\n        return None\n    total_mg = quantity_units * strength_mg\n    return total_mg / ddd_info[\"ddd_mg\"]\n\n\n# ── C. DDD/1000 enrolled members/day from a claims DataFrame ────────────────\ndef ddd_per_1000_per_day(\n    rx: pd.DataFrame,\n    enroll: pd.DataFrame,\n    atc_code: str,\n    period_start: pd.Timestamp,\n    period_end: pd.Timestamp,\n) -> float:\n    \"\"\"Compute WHO aggregate DDD/1000 enrolled members/day for one ATC code and period.\n\n    rx columns: fill_date (Timestamp), atc_code (str), quantity_units (int), strength_mg (float)\n    enroll columns: person_id (any), enroll_start (Timestamp), enroll_end (Timestamp),\n                    ma_only (bool)  — exclude MA-only spans (no FFS pharmacy claims)\n\n    Returns nan if no enrolled person-days in the period.\n    \"\"\"\n    ddd_info = get_ddd_mg(atc_code)\n    if ddd_info is None:\n        return float(\"nan\")\n\n    # Numerator: total DDDs dispensed in the period\n    mask = (\n        (rx[\"atc_code\"] == atc_code)\n        & (rx[\"fill_date\"] >= period_start)\n        & (rx[\"fill_date\"] <= period_end)\n    )\n    rx_period = rx.loc[mask].copy()\n    rx_period[\"ddds\"] = (\n        rx_period[\"quantity_units\"] * rx_period[\"strength_mg\"] / ddd_info[\"ddd_mg\"]\n    )\n    total_ddds = rx_period[\"ddds\"].sum()\n\n    # Denominator: FFS-observable enrolled person-days in the period\n    e = enroll.loc[~enroll[\"ma_only\"]].copy()\n    e[\"days\"] = (\n        e[\"enroll_end\"].clip(upper=period_end) - e[\"enroll_start\"].clip(lower=period_start)\n    ).dt.days.clip(lower=0)\n    person_days = e[\"days\"].sum()\n\n    if person_days == 0:\n        return float(\"nan\")\n\n    return 1000.0 * total_ddds / person_days\n\n\n# ── Demo ─────────────────────────────────────────────────────────────────────\nif __name__ == \"__main__\":\n    # Single-claim DDD verification from the worked example\n    qty, strength, atc = 180, 1000, \"A10BA02\"\n    ddds = dispensing_to_ddds(qty, strength, atc)\n    print(f\"{qty} tablets × {strength} mg / {get_ddd_mg(atc)['ddd_mg']} mg DDD = {ddds:.0f} DDDs\")\n    # → 180 tablets × 1000 mg / 2000 mg DDD = 90 DDDs\n\n    # Population rate: 90000 DDDs across 10000 members in 30 days\n    total_ddds_plan = 90000\n    members = 10000\n    days = 30\n    rate = 1000 * total_ddds_plan / (members * days)\n    print(f\"Plan rate: 1000 * {total_ddds_plan} / ({members} * {days}) = {rate:.0f} DDD/1000/day\")\n    # → Plan rate: 1000 * 90000 / (10000 * 30) = 300 DDD/1000/day",
        "description": "Three utilities for ATC/DDD work in US pharmacy claims: (A) look up a substance's DDD value\nfrom the RxClass/WHO index via the RxNav API; (B) convert a dispensing record's quantity and\nstrength to DDDs; (C) compute the DDD/1000 enrolled members/day aggregate metric from a\npandas DataFrame of claims. No external dependencies beyond requests and pandas.",
        "dependencies": [
          "requests",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\n# WHO DDD reference (illustrative subset — load from the full WHO index in production)\nWHO_DDD_TABLE <- data.frame(\n  atc_code    = c(\"A10BA02\", \"C10AA01\", \"C10AA05\", \"N02BA01\", \"B01AC06\"),\n  substance   = c(\"metformin\", \"simvastatin\", \"atorvastatin\",\n                  \"aspirin (analgesic)\", \"aspirin (antiplatelet)\"),\n  ddd_mg      = c(2000, 20, 20, 3000, 75),\n  unit        = rep(\"mg\", 5),\n  stringsAsFactors = FALSE\n)\nATC_DDD_RELEASE <- \"2024\"\n\n\n# ── A. DDD lookup ────────────────────────────────────────────────────────────\nget_ddd_mg <- function(atc_code) {\n  row <- WHO_DDD_TABLE[WHO_DDD_TABLE$atc_code == toupper(atc_code), ]\n  if (nrow(row) == 0) return(NULL)\n  as.list(c(row, release = ATC_DDD_RELEASE))\n}\n\n\n# ── B. Convert a dispensing to DDDs ──────────────────────────────────────────\ndispensing_to_ddds <- function(quantity_units, strength_mg, atc_code) {\n  # quantity_units: tablets/capsules/units dispensed\n  # strength_mg: mg per unit-dose item\n  # Returns NA if ATC code has no DDD entry.\n  ddd_info <- get_ddd_mg(atc_code)\n  if (is.null(ddd_info)) return(NA_real_)\n  total_mg <- quantity_units * strength_mg\n  total_mg / ddd_info$ddd_mg\n}\n\n\n# ── C. DDD/1000 enrolled members/day ─────────────────────────────────────────\nddd_per_1000_per_day <- function(rx, enroll, atc_code, period_start, period_end) {\n  # rx: fill_date (Date), atc_code (chr), quantity_units (int), strength_mg (dbl)\n  # enroll: person_id, enroll_start (Date), enroll_end (Date), ma_only (lgl)\n  ddd_info <- get_ddd_mg(atc_code)\n  if (is.null(ddd_info)) return(NA_real_)\n\n  # Numerator: total DDDs in period\n  rx_period <- rx |>\n    filter(atc_code == !!atc_code,\n           fill_date >= period_start,\n           fill_date <= period_end) |>\n    mutate(ddds = quantity_units * strength_mg / ddd_info$ddd_mg)\n  total_ddds <- sum(rx_period$ddds, na.rm = TRUE)\n\n  # Denominator: FFS-observable enrolled person-days (exclude MA-only spans)\n  person_days <- enroll |>\n    filter(!ma_only) |>\n    mutate(days = as.integer(\n      pmin(enroll_end, period_end) - pmax(enroll_start, period_start)\n    ) |> pmax(0L)) |>\n    pull(days) |>\n    sum()\n\n  if (person_days == 0) return(NA_real_)\n  1000 * total_ddds / person_days\n}\n\n\n# ── Demo: verify worked-example arithmetic ────────────────────────────────────\nddds_one_claim <- dispensing_to_ddds(180, 1000, \"A10BA02\")\ncat(sprintf(\"180 tablets x 1000 mg / 2000 mg DDD = %.0f DDDs\\n\", ddds_one_claim))\n# -> 180 tablets x 1000 mg / 2000 mg DDD = 90 DDDs\n\nplan_rate <- 1000 * 90000 / (10000 * 30)\ncat(sprintf(\"Plan rate: 1000 * 90000 / (10000 * 30) = %.0f DDD/1000/day\\n\", plan_rate))\n# -> Plan rate: 1000 * 90000 / (10000 * 30) = 300 DDD/1000/day",
        "description": "R equivalents using base R and dplyr: (A) DDD lookup from a local reference table; (B)\nper-claim DDD conversion; (C) DDD/1000 enrolled members/day from a data frame of claims.\nStructure mirrors the Python version. In production, load the WHO ATC/DDD index CSV; this\nsnippet uses a small illustrative lookup table.",
        "dependencies": [
          "dplyr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  L1[\"Level 1 — Anatomical group (letter)\\nA = Alimentary tract and metabolism\"]\n  L2[\"Level 2 — Therapeutic subgroup (2 digits)\\nA10 = Drugs used in diabetes\"]\n  L3[\"Level 3 — Pharmacological subgroup (letter)\\nA10B = Blood glucose-lowering drugs, excl. insulins\"]\n  L4[\"Level 4 — Chemical subgroup (letter)\\nA10BA = Biguanides\"]\n  L5[\"Level 5 — Chemical substance (2 digits)\\nA10BA02 = metformin (DDD = 2000 mg)\"]\n  NDC[\"US pharmacy claim NDC\\n(package-level, labeler-specific)\"]\n  RXNORM[\"RxNorm ingredient rollup\\n(via RxNav ndcstatus API)\"]\n  ATC[\"ATC code\\n(via RxClass ATC1-4 query)\"]\n  L1 --> L2 --> L3 --> L4 --> L5\n  NDC -->|\"Step 1: NDC → RxCUI (ingredient)\"| RXNORM\n  RXNORM -->|\"Step 2: RxCUI → ATC (via RxClass)\"| ATC\n  ATC -->|\"Lossy for combination products\\nand dual-indication drugs (aspirin)\"| ATC",
        "caption": "The five-level ATC hierarchy for metformin (A10BA02) and the three-step NDC→RxNorm→ATC crosswalk path used to assign ATC codes in US pharmacy claims. The crosswalk is lossy for combination products and dual-indication drugs.",
        "alt_text": "Flowchart showing the ATC hierarchy five-level breakdown for metformin and the NDC-to-RxNorm to ATC translation chain, annotated with the combination-product and dual-indication failure modes.",
        "source_type": "illustrative",
        "source_citations": [
          "hollingworth-kairuz-2021"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "ndc-national-drug-code",
        "notes": "NDC is the US billing counterpart to ATC — the package-level identifier on every US pharmacy claim. Mapping from NDC to ATC (via RxNorm) is the first step in applying the WHO utilization standard to US claims data; the lossiness of that crosswalk is the primary operational challenge for US-based ATC/DDD research."
      },
      {
        "relation_type": "see_also",
        "target_slug": "rxnorm-drug-terminology",
        "notes": "RxNorm is the crosswalk bridge: NDC codes map to RxNorm ingredient RxCUIs, and those RxCUIs map to ATC class codes via the RxClass service. Understanding the RxNorm TTY hierarchy (SCD → ingredient → ATC) is prerequisite for correctly implementing the NDC→RxNorm→ATC chain in US claims."
      },
      {
        "relation_type": "used_with",
        "target_slug": "medical-code-crosswalks-mappings",
        "notes": "The NDC-to-ATC translation is a specific instance of medical code crosswalk work, subject to all the general challenges of versioning, unmapped fractions, and many-to-many relationships documented in the crosswalk methods entry."
      },
      {
        "relation_type": "used_with",
        "target_slug": "drug-utilization",
        "notes": "ATC/DDD supplies the primitive unit and therapeutic classification for drug utilization studies. The DDD is the denominator in DDD/1000/day; the ATC code defines the drug class for any utilization volume calculation. Drug utilization study methods depend on ATC/DDD without replacing it."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-standardized-vocabularies",
        "notes": "The OMOP Common Data Model includes ATC as a vocabulary in the CONCEPT table, linked to RxNorm standard concepts via CONCEPT_RELATIONSHIP and CONCEPT_ANCESTOR. ATC-level queries in OMOP use the ATC vocabulary_id alongside the RxNorm standard drug concepts."
      }
    ],
    "aliases": [
      "ATC classification",
      "ATC/DDD",
      "Defined Daily Dose",
      "DDD",
      "Anatomical Therapeutic Chemical",
      "WHO ATC"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "attributable-risk-paf",
    "name": "Attributable Risk and Population Attributable Fraction",
    "short_definition": "Attributable risk (AR) is the absolute excess risk in exposed persons compared to unexposed; the attributable fraction among the exposed (AF_exposed) converts that to the proportion of exposed cases attributable to the exposure; and the population attributable fraction (PAF, Levin's formula) extends to the whole population by weighting by exposure prevalence — together they answer not just how strong an association is, but how much of the disease burden would disappear if the exposure were eliminated, under an explicit causal assumption.",
    "long_description": "**The measure family: AR, AF_exposed, and PAF**\n\nThree related but distinct quantities address the \"how much burden is due to this exposure?\"\nquestion, and confusing them is one of the most common errors in burden-of-disease reporting.\n\n*Attributable risk (AR)*, also called the risk difference among the exposed or the excess\nrisk, is: AR = risk_exposed − risk_unexposed. It answers \"by how much does being exposed\nincrease my absolute probability of disease?\" Like all absolute risk differences, AR is\nclinically concrete and directly translatable to number needed to treat (NNT = 1/AR), but it\nis sensitive to background risk — the same causal RR produces a larger AR in a high-risk\npopulation than in a low-risk one, so AR is not transportable across populations without\nre-anchoring to local baseline risk.\n\n*Attributable fraction among the exposed (AF_exposed)*, also called the aetiologic fraction or\nattributable proportion, converts AR to a proportion: AF_exposed = AR / risk_exposed =\n(RR − 1) / RR. It answers \"among all cases in the exposed group, what fraction would not have\noccurred if the exposed group had the unexposed risk level?\" A value of 0.5 means half the\nevents in the exposed group are attributable to the exposure.\n\n*Population attributable fraction (PAF)* extends AF_exposed to the whole population by\nweighting by exposure prevalence (p, the proportion of the population that is exposed):\nPAF = p × (RR − 1) / (p × (RR − 1) + 1). This is Levin's formula (1953). It answers\n\"if the exposure were completely eliminated from the population, what fraction of ALL cases\nwould be prevented?\" PAF is the measure used in burden-of-disease analyses, payer narratives,\nand public health priority-setting. It is prevalence-dependent: the same RR yields a higher\nPAF in a population with greater exposure prevalence.\n\n**Levin's formula and Miettinen's adjusted estimator**\n\nLevin's formula is derived from the marginal risks: PAF = (R_total − R_unexposed) / R_total,\nwhich algebraically reduces to p*(RR−1)/(p*(RR−1)+1). This is the version in most introductory\nepidemiology texts and burden-of-disease papers. It is valid only when RR is the unconfounded,\nmarginal (causal) risk ratio and p is the exposure prevalence in the source population.\n\nA critical and frequently overlooked subtlety: using Levin's formula with a confounder-adjusted\nRR from a multivariable regression model is mathematically incorrect under confounding.\nThe Levin formula assumes the RR is the crude marginal risk ratio; substituting an adjusted\n(conditional) RR produces a biased PAF because it conflates marginal and conditional\nquantities. The correct approach when using an adjusted RR is Miettinen's case-based formula\n(1974): PAF = proportion_of_cases_exposed × (1 − 1/RR_adjusted). This uses the adjusted RR\nbut weights by the fraction of cases (not the fraction of the population) that are exposed —\na subtle but important difference. G-computation (outcome regression standardization) and\ntargeted maximum likelihood estimation (TMLE) provide doubly robust PAF estimates that\naccommodate full confounding adjustment and complex covariate structures.\n\n**Interpreting the output**\n\nTaking the worked example — PAF = 16.7% (illustrative 95% CI 8%–25%) — two layers of\ninterpretation are required.\n\n*Formal interpretation:* Under the assumptions that (1) the RR of 2.0 is causal (no\nunmeasured confounding, no measurement error, no selection bias) and (2) hypertension could\nbe completely and instantaneously eliminated from this population with all else held fixed,\napproximately 16.7% of cardiovascular events would not have occurred over this follow-up\nperiod. This PAF is prevalence-dependent: the same RR of 2.0 yields PAF ≈ 4.8% in a\npopulation where only 5% have hypertension (Levin: 0.05×1 / (0.05×1 + 1) = 0.05/1.05\n≈ 0.048), PAF ≈ 9.1% at 10% prevalence, and PAF = 50% in a population that is\nuniversally hypertensive. PAFs across multiple risk factors (hypertension, smoking, diabetes)\ndo NOT sum to 100% and can sum well beyond 100% because the counterfactual elimination of\neach factor independently removes cases that overlap with those removed by eliminating others.\nA confounded RR makes the PAF causally meaningless: it would answer a scenario where the\nassociation is broken without removing the exposure's causal path, which is not a coherent\nintervention.\n\n*Practical interpretation:* In plain English, roughly 1 in 6 cardiovascular events in this\npopulation is associated with hypertension at today's prevalence levels — but only if the\nlink is genuinely causal, and only if hypertension could be fully eliminated. In practice,\ntreating hypertension reduces but does not eliminate elevated blood pressure, so the\nrealistically preventable fraction is smaller than the PAF. This is the distinction between\nthe population attributable fraction (complete elimination) and the population intervention\neffect (PIE), which models a realistic partial reduction.\n\n**The misuse catalog**\n\nRockhill, Newman, and Weinberg (1998) identified four categories of PAF misuse that remain\nprevalent in HEOR and burden-of-disease literature:\n\n1. *Summing PAFs across risk factors.* If 50% of cardiovascular events are attributable to\nhypertension, 30% to smoking, and 25% to diabetes in the same population, these do NOT imply\nthat only 5% are unexplained. PAFs across overlapping exposures can and do sum well beyond\n100%, because the counterfactual elimination of each factor independently removes cases that\nwould also be removed by eliminating others. Joint counterfactual analysis or multiplicative\ndecomposition is required for multi-factor attribution.\n\n2. *Using confounded or unadjusted RRs.* A crude RR inflated by confounding produces a\nspuriously large PAF that does not correspond to any coherent causal quantity. Always use\nthe best available unconfounded RR estimate, and use Miettinen's formula or g-computation\nrather than Levin's when an adjusted RR is required.\n\n3. *Assuming complete and instantaneous exposure elimination.* Real interventions are partial:\nantihypertensive therapy reduces, not eliminates, elevated blood pressure. The true preventable\nfraction under a realistic policy is always smaller than the PAF. Report the PAF as an\nupper bound on what is theoretically preventable, not as the expected impact of a program.\n\n4. *Transporting PAFs across populations with different exposure prevalence.* A PAF from a\nEuropean cohort with 15% hypertension prevalence is not applicable to a US cohort with 40%\nprevalence. Only the (causal, unconfounded) RR is transportable; the PAF must be re-computed\nusing the local prevalence and the transported RR.\n\n**RWE and HEOR applications**\n\nPAF appears most prominently in two HEOR contexts: burden-of-disease studies and payer\nnarratives linking treatment gaps to outcomes.\n\n*Burden of disease:* Global burden of disease analyses use risk-weighted PAF estimates to\napportion deaths and disability-adjusted life years (DALYs) to modifiable risk factors. In\ncommercial HEOR work, the same logic supports claims such as \"X% of hospitalizations for\ncondition Y are attributable to medication nonadherence.\" These claims almost always rest on\nan observational RR, embedded causal assumptions, and a specific population's nonadherence\nprevalence. They overreach causally whenever the RR is confounded, and honest HEOR dossiers\ndistinguish the associative PAF from a causal PAF while providing sensitivity analyses over\nthe RR and prevalence assumptions.\n\n*Connection to attributable costs:* The PAF logic extends naturally to expenditure.\nAttributable cost = total condition cost × PAF, where PAF is computed from the fraction of\ntotal spending in a payer dataset that could be prevented by eliminating the exposure. The\n`all-cause-vs-attributable-costs-rwe` concept covers the HEOR-specific approach to separating\nall-cause and attributable spending; the PAF entry here provides the rate-based foundation.\n\n**Pros, cons, and trade-offs**\n\n*PAF vs attributable risk (AR):* PAF incorporates exposure prevalence and is the\npopulation-level burden measure relevant to payers and public health authorities. AR is the\nindividual-level excess risk in exposed persons; it is the input to NNT and is most useful\nfor communicating to a clinician or patient. Neither is universally preferable — they answer\ndifferent questions. Use AR when the audience is the individual clinician; use PAF when the\naudience is a budget-holder or policy-maker allocating prevention resources.\n\n*PAF vs relative risk (RR/HR):* RR is more stable across populations with different baseline\nrisk, making it the right quantity for transporting effects between studies. PAF is\npopulation-specific because it depends on prevalence; the recommended approach is to transport\nthe RR and combine it with local prevalence to produce a locally valid PAF. Use RR for\nmodelling and effect transportation; use PAF for burden statements anchored to a specific\npopulation.\n\n*Adjusted vs unadjusted PAF:* Unadjusted Levin's PAF is biased whenever the RR is confounded.\nMiettinen's case-based formula handles an adjusted RR correctly. G-computation provides the\nmost flexible adjusted PAF estimator in the presence of multiple or continuous confounders.\nAvoid the error of substituting an adjusted RR directly into Levin's formula.\n\n*Computational simplicity vs validity:* Levin's formula is a three-variable arithmetic\nexpression (p, RR, and the derived PAF) that can be computed in any spreadsheet. This\nsimplicity is also its weakness: every assumption is invisible to a reader who sees only the\nfinal number. Always report the RR, the prevalence estimate, its source, and a sensitivity\nanalysis over plausible prevalence ranges.\n\n**When to use**\n\nUse attributable risk and PAF when: the research question is explicitly about the population\nburden attributable to a modifiable exposure, and a defensible causal argument for the RR is\navailable; when preparing burden-of-disease sections of HTA dossiers or payer submissions;\nwhen combining a published or study-derived RR with local population prevalence to produce a\nlocally relevant PAF for a specific payer or geography; when performing multi-exposure burden\nanalyses with full awareness of the non-summability problem and joint counterfactual structure;\nor when designing or evaluating a prevention or treatment program and needing an upper-bound\nestimate of the theoretically preventable fraction.\n\n**When NOT to use**\n\n*When the RR is not causal.* A PAF computed from a confounded association answers a\nnon-scientific counterfactual. If adequate confounding adjustment is not possible, clearly\nlabel the output as an upper-bound associative estimate, not a preventable burden figure.\n\n*When the exposure is not eliminable.* Age, genetic sex, and most genetic variants cannot be\neliminated from a population. PAF for non-modifiable exposures has no actionable meaning;\nprefer AR or AF_exposed for characterizing individual-level excess risk attributable to\nnon-modifiable factors.\n\n*When summing PAFs across risk factors.* Presenting multiple PAFs summing beyond 100%\nwithout acknowledging the overlap is misleading. Use joint counterfactual analysis or\ncomplementary cumulative risk attribution for multi-factor burden reports.\n\n*When transporting a PAF across populations.* A published PAF from a different country,\npatient population, or time period is not applicable without re-anchoring to local exposure\nprevalence and verifying that the RR is transportable.\n\n*When the policy intervention is partial.* If the intervention achieves 40% exposure reduction\nrather than complete elimination, the expected population impact is substantially less than\nPAF × total cases. Use the population intervention effect (PIE) framework, which models the\nexpected PAF under a realistic partial reduction, rather than Levin's full-elimination PAF.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "effect-measures",
      "epidemiology",
      "public-health",
      "attributable-risk"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "case_control",
      "cross_sectional",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2105/AJPH.88.1.15",
        "url": "https://doi.org/10.2105/AJPH.88.1.15",
        "citation_text": "Rockhill B, Newman B, Weinberg C. Use and misuse of population attributable fractions. Am J Public Health. 1998;88(1):15-19.",
        "year": 1998,
        "authors_short": "Rockhill et al.",
        "notes": "The canonical reference on the four categories of PAF misuse: summing PAFs beyond 100% across risk factors, using confounded RRs, assuming complete instantaneous exposure elimination, and transporting PAFs across populations with different prevalence. Required reading for any HEOR analyst using PAF in burden-of-disease claims."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.k757",
        "url": "https://doi.org/10.1136/bmj.k757",
        "citation_text": "Mansournia MA, Altman DG. Population attributable fraction. BMJ. 2018;360:k757.",
        "year": 2018,
        "authors_short": "Mansournia & Altman",
        "notes": "Concise BMJ Statistics Notes tutorial covering Levin's formula, Miettinen's case-based adjusted estimator, and the causal interpretation requirement. Ideal for HEOR analysts reporting PAF in dossiers and burden-of-disease submissions."
      }
    ],
    "plain_language_summary": "Attributable risk (AR) measures how many more disease events occur among exposed people compared to unexposed people — the raw excess risk you carry because of the exposure. The population attributable fraction (PAF) extends this to ask what percentage of ALL disease cases in the entire population could theoretically be prevented if the exposure were completely eliminated. Both measures are powerful for public health planning and payer submissions, but they carry one strict requirement: the exposure-disease link must be genuinely causal, not just an association — a confounded PAF is scientifically meaningless.",
    "key_terms": [
      {
        "term": "attributable risk (AR)",
        "definition": "The difference in risk between exposed and unexposed groups; it estimates how many extra disease events occur per 100 (or 1000) exposed persons solely because of the exposure."
      },
      {
        "term": "attributable fraction among the exposed (AF_exposed)",
        "definition": "The proportion of disease cases in the exposed group that would not have occurred if those people had the same risk as the unexposed; equal to AR divided by the risk in the exposed group."
      },
      {
        "term": "population attributable fraction (PAF)",
        "definition": "The proportion of all cases in the entire population that is attributable to the exposure, accounting for what fraction of the population is actually exposed; computed via Levin's formula from exposure prevalence and the risk ratio."
      },
      {
        "term": "exposure prevalence",
        "definition": "The proportion of the population that is exposed to the risk factor; a higher prevalence means the same relative risk translates to a larger PAF, which is why PAFs are population-specific and cannot be transferred across populations without re-weighting."
      },
      {
        "term": "Levin's formula",
        "definition": "The standard PAF formula: PAF = p*(RR-1) / (p*(RR-1) + 1), where p is the population exposure prevalence and RR is the risk ratio; valid only when RR is the unconfounded (causal) estimate."
      },
      {
        "term": "counterfactual elimination",
        "definition": "The hypothetical scenario underlying PAF — what would happen to disease burden if the exposure were completely and instantly removed from the entire population; it is a mathematical thought experiment, not a realistic policy target."
      }
    ],
    "worked_example": {
      "scenario": "A retrospective cohort study examines hypertension (the exposure) and cardiovascular events over a 5-year follow-up. Of 100 hypertensive and 100 normotensive patients selected with similar baseline characteristics, 12 and 6 cardiovascular events occur, respectively. The analyst needs to compute: (1) the attributable risk in the exposed group, (2) the fraction of events among hypertensive patients attributable to hypertension, and (3) the population attributable fraction given that hypertension affects 20% of the broader population. All three measures come from the same 2x2 table and one prevalence figure.",
      "dataset": {
        "caption": "Summary 2x2 table from a 5-year retrospective cohort. Exposure = hypertension; outcome = any cardiovascular event. Population prevalence of hypertension p = 0.2.",
        "columns": [
          "group",
          "events",
          "non_events",
          "total_persons",
          "observed_risk"
        ],
        "rows": [
          [
            "Exposed (hypertension)",
            12,
            88,
            100,
            0.12
          ],
          [
            "Unexposed (no hypertension)",
            6,
            94,
            100,
            0.06
          ]
        ]
      },
      "steps": [
        "Risk among exposed = 12/100 = 0.12; risk among unexposed = 6/100 = 0.06.",
        "RR = 0.12/0.06 = 2.0; the risk in the hypertensive group is twice that in the unexposed group (a relative doubling, not a doubling of the absolute risk difference); the absolute risk increases by 0.06, or 6 percentage points.",
        "AR = 0.12 - 0.06 = 0.06; among every 100 hypertensive persons, 6 events would not have occurred if they carried the same risk as the unexposed, assuming the association is causal.",
        "AF_exposed = 0.06/0.12 = 0.5; half of all cardiovascular events among hypertensive patients are attributable to hypertension itself.",
        "Population prevalence of hypertension is p = 0.2; twenty percent of the broader population carries the exposure.",
        "Levin PAF = 0.2*(2.0-1)/(0.2*(2.0-1)+1) = 0.2/1.2 = 0.1667; approximately 16.7% of all cardiovascular events in the population are attributable to hypertension."
      ],
      "result": "AR = 0.12 - 0.06 = 0.06 excess events per 100 exposed persons. AF_exposed = 0.06/0.12 = 0.5; half of all events among the hypertensive group are attributable to hypertension. PAF = 0.2*(2.0-1)/(0.2*(2.0-1)+1) = 0.2/1.2 = 0.1667; roughly 1 in 6 cardiovascular events in this population is tied to hypertension at today's exposure levels — IF the association is causal and IF hypertension could be fully eliminated."
    },
    "prerequisites": [
      "risk-ratio-and-risk-difference",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "# Attributable Risk and Population Attributable Fraction\n# Manual computation from a 2x2 summary table + population exposure prevalence.\n\n# ── Inputs (from the worked example) ──\nn_exp,   events_exp   = 100, 12  # exposed: 100 persons, 12 events\nn_unexp, events_unexp = 100,  6  # unexposed: 100 persons, 6 events\np_pop = 0.20                      # exposure prevalence in the broader population\n\n# ── 1. Individual-group risks ──\nr_exp   = events_exp   / n_exp    # 0.12\nr_unexp = events_unexp / n_unexp  # 0.06\n\n# ── 2. Core attributable measures ──\nRR         = r_exp / r_unexp                           # Risk ratio\nAR         = r_exp - r_unexp                           # Attributable risk (risk diff among exposed)\nAF_exposed = AR / r_exp                                # (RR-1)/RR\nPAF_levin  = p_pop * (RR - 1) / (p_pop * (RR - 1) + 1)  # Levin's formula\n\nprint(f\"Risk (exposed)   = {r_exp:.4f}\")\nprint(f\"Risk (unexposed) = {r_unexp:.4f}\")\nprint(f\"RR               = {RR:.4f}\")\nprint(f\"AR               = {AR:.4f}  ({AR*100:.1f}% excess risk)\")\nprint(f\"AF_exposed       = {AF_exposed:.4f}  ({AF_exposed*100:.0f}% of exposed cases attributable)\")\nprint(f\"PAF (Levin)      = {PAF_levin:.4f}  ({PAF_levin*100:.1f}% of population cases attributable)\")\n\n# ── 3. Miettinen's case-based formula (use with adjusted RR from a multivariable model) ──\n# PAF = proportion_of_cases_exposed * (1 - 1/RR_adjusted)\n# CAUTION: Do NOT substitute an adjusted RR into Levin's formula — that is biased under\n# confounding. Miettinen's formula uses the case fraction from your study population.\ntotal_cases    = events_exp + events_unexp\nprop_cases_exp = events_exp / total_cases      # fraction of all cases that are exposed\nPAF_miettinen  = prop_cases_exp * (1 - 1 / RR)\nprint(f\"\\nPAF (Miettinen, case-fraction based) = {PAF_miettinen:.4f}\")\nprint(\"  Differs from Levin because study has 1:1 sampling; population is 20/80 exposed.\")\nprint(\"  Use Miettinen's formula when plugging in a confounder-adjusted RR.\")\n\n# ── 4. Bootstrap 95% CI for Levin's PAF ──\n# For a patient-level dataset, derive CI by bootstrapping individual rows.\n# This sketch draws Bernoulli events from a population with 20% exposure prevalence.\nimport random\nrandom.seed(42)\npop_n_exp, pop_n_unexp = 100, 400   # 20% prevalence in a population of 500\n\ndef boot_paf(ne, re, nu, ru):\n    ev_e = sum(random.random() < re for _ in range(ne))\n    ev_u = sum(random.random() < ru for _ in range(nu))\n    r_t = (ev_e + ev_u) / (ne + nu)\n    r_u = ev_u / nu\n    return (r_t - r_u) / r_t if r_t > 0 else 0.0\n\nboot_pafs = sorted(boot_paf(pop_n_exp, r_exp, pop_n_unexp, r_unexp) for _ in range(2000))\nci_lo = boot_pafs[int(0.025 * 2000)]\nci_hi = boot_pafs[int(0.975 * 2000)]\nprint(f\"\\nBootstrap 95% CI for PAF: ({ci_lo:.3f}, {ci_hi:.3f})\")\nprint(\"  For production: bootstrap from patient-level rows, not aggregate Bernoulli draws.\")\n\n# ── 5. Sensitivity: PAF across exposure prevalence values (RR fixed) ──\nprint(\"\\nPAF sensitivity to exposure prevalence (RR = 2.0):\")\nfor p in (0.05, 0.10, 0.20, 0.30, 0.50):\n    paf = p * (RR - 1) / (p * (RR - 1) + 1)\n    print(f\"  p = {p:.2f}  ->  PAF = {paf:.3f}  ({paf*100:.1f}%)\")\nprint(\"PAFs across populations are NOT comparable without re-anchoring to local prevalence.\")",
        "description": "Computes attributable risk, attributable fraction among the exposed, and Levin's\npopulation attributable fraction from a 2x2 summary table and population exposure\nprevalence. Also demonstrates Miettinen's case-based formula for use with an adjusted\nRR, a bootstrap 95% CI sketch, and a sensitivity analysis over exposure prevalence.\nNo external dependencies required for core computation.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# Attributable Risk and PAF in R\n# Manual computation + epiR::epi.2by2 + AF package for adjusted estimates.\n\n# ── 1. Manual computation from the worked example ──\nr_exp   <- 12 / 100   # risk in exposed\nr_unexp <-  6 / 100   # risk in unexposed\np_pop   <- 0.20        # population exposure prevalence\n\nRR         <- r_exp / r_unexp\nAR         <- r_exp - r_unexp\nAF_exposed <- AR / r_exp\nPAF_levin  <- p_pop * (RR - 1) / (p_pop * (RR - 1) + 1)\n\ncat(sprintf(\"AR          = %.4f\\n\", AR))\ncat(sprintf(\"RR          = %.4f\\n\", RR))\ncat(sprintf(\"AF_exposed  = %.4f\\n\", AF_exposed))\ncat(sprintf(\"PAF (Levin) = %.4f (%.1f%%)\\n\", PAF_levin, PAF_levin * 100))\n\n# ── 2. epiR::epi.2by2 — automated attributable fraction with 95% CIs ──\n# install.packages(\"epiR\")\n# Matrix layout: [exposed-case, exposed-noncase; unexposed-case, unexposed-noncase]\nif (requireNamespace(\"epiR\", quietly = TRUE)) {\n  library(epiR)\n  dat <- matrix(c(12, 88, 6, 94), nrow = 2, byrow = TRUE,\n                dimnames = list(Exposure = c(\"Exposed\", \"Unexposed\"),\n                                Outcome  = c(\"Case\", \"Non-case\")))\n  result <- epi.2by2(dat, method = \"cohort.count\",\n                     conf.level = 0.95, units = 100,\n                     interpret = FALSE, outcome = \"as.columns\")\n  print(summary(result))\n  # Key outputs in result$massoc.detail:\n  #   AFRisk.exp.strata.wald   -- AF among exposed (with 95% CI)\n  #   PAFRisk.strata.wald      -- Population AF, Levin (with 95% CI)\n} else {\n  cat(\"Install epiR: install.packages('epiR')\\n\")\n}\n\n# ── 3. AF package — regression-based adjusted attributable fraction ──\n# The AF package estimates PAF adjusting for confounders via logistic or Cox regression.\n# It implements Miettinen's case-fraction approach internally.\n# Example (requires patient-level data frame 'df' with: outcome, exposure, covariates):\n#   library(AF)\n#   model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial())\n#   af_result <- AFglm(model, data = df, exposure = \"exposure\")\n#   summary(af_result)    # returns adjusted PAF with 95% CI\ncat(\"\\nFor confounder-adjusted PAF: install.packages('AF')\\n\")\ncat(\"Use AFglm() (binary outcome) or AFcoxph() (time-to-event) with a fitted model.\\n\")\n\n# ── 4. PAF sensitivity to exposure prevalence (RR fixed) ──\ncat(\"\\nPAF sensitivity (RR = 2.0):\\n\")\nfor (p in c(0.05, 0.10, 0.20, 0.30, 0.50)) {\n  paf <- p * (RR - 1) / (p * (RR - 1) + 1)\n  cat(sprintf(\"  p = %.2f  ->  PAF = %.3f  (%.1f%%)\\n\", p, paf, paf * 100))\n}\ncat(\"Reminder: PAFs across populations are NOT comparable without re-anchoring prevalence.\\n\")",
        "description": "Demonstrates manual computation of AR, AF_exposed, and Levin's PAF alongside\nepiR::epi.2by2 for automated attributable fraction outputs with confidence intervals.\nAlso documents the AF package for regression-based adjusted AF estimation and provides\na prevalence sensitivity table. Uses the same 2x2 table as the worked example.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Attributable Risk and Population Attributable Fraction in SAS\n   DATA-step computation from a 2x2 summary + sensitivity loop.\n   For patient-level data, replace the summary counts with aggregated PROC FREQ output. */\n\n/* ── 1. DATA-step: manual computation from the 2x2 summary ── */\ndata work.paf_calc;\n  /* Inputs */\n  n_exp         = 100;  events_exp   = 12;\n  n_unexp       = 100;  events_unexp =  6;\n  p_pop         = 0.20; /* population exposure prevalence */\n\n  /* Risks */\n  r_exp   = events_exp   / n_exp;    /* 0.12 */\n  r_unexp = events_unexp / n_unexp;  /* 0.06 */\n\n  /* Effect measures */\n  RR         = r_exp / r_unexp;             /* 2.0  */\n  AR         = r_exp - r_unexp;             /* 0.06 */\n  AF_exposed = AR / r_exp;                  /* 0.5  */\n  PAF_levin  = p_pop * (RR - 1) / (p_pop * (RR - 1) + 1);  /* Levin formula */\n\n  /* Miettinen case-based formula (use with adjusted RR from PROC LOGISTIC) */\n  total_cases    = events_exp + events_unexp;\n  prop_cases_exp = events_exp / total_cases;\n  PAF_miettinen  = prop_cases_exp * (1 - 1/RR);\n\n  label\n    AR            = \"Attributable Risk\"\n    AF_exposed    = \"Attributable Fraction (exposed)\"\n    PAF_levin     = \"Population Attributable Fraction (Levin)\"\n    PAF_miettinen = \"PAF (Miettinen, use with adjusted RR)\";\nrun;\n\nproc print data=work.paf_calc noobs label;\n  var r_exp r_unexp RR AR AF_exposed PAF_levin PAF_miettinen;\nrun;\n\n/* ── 2. Sensitivity analysis: PAF across exposure prevalence values ── */\ndata work.paf_sensitivity;\n  RR = 2.0;   /* fixed from analysis */\n  do p_pop = 0.05, 0.10, 0.20, 0.30, 0.50;\n    PAF = p_pop * (RR - 1) / (p_pop * (RR - 1) + 1);\n    output;\n  end;\n  label p_pop = \"Population Exposure Prevalence\"\n        PAF   = \"Levin PAF\";\nrun;\nproc print data=work.paf_sensitivity noobs label; run;\n\n/* ── 3. PROC STDRATE for direct standardization-based attributable fraction ──\n   PROC STDRATE computes directly standardized rates and can produce attributable\n   risk components when a continuous or multi-level confounder requires\n   standardization rather than regression adjustment. Available in SAS/STAT 13.1+.\n   Specify the ATTRISK option in the TABLES statement to obtain attributable\n   risk output. For simple Levin PAF from a 2x2, the DATA step above is\n   sufficient; reserve PROC STDRATE for complex stratified or rate-based analyses. */\n/* Reference: SAS/STAT User's Guide, PROC STDRATE documentation. */\n\n/* ── 4. Extracting adjusted RR from PROC LOGISTIC for Miettinen formula ──\n   After fitting a logistic regression:\n     proc logistic data=mydata;\n       model outcome(event='1') = exposure age sex comorbidity;\n       ods output ParameterEstimates=pe;\n     run;\n   Extract the exposure parameter estimate and compute RR_adj (from OR for rare outcomes\n   or from PROC LOGISTIC ESTIMATE for risk ratio via log-binomial link).\n   Then apply Miettinen:\n     PAF_miettinen = prop_cases_exposed * (1 - 1/RR_adjusted)\n   where prop_cases_exposed is computed from the analytic dataset as\n   sum(exposure=1 and outcome=1) / sum(outcome=1). */",
        "description": "Implements AR, AF_exposed, and Levin's PAF via a DATA step from a 2x2 summary table.\nAlso demonstrates Miettinen's case-based formula for use with an adjusted RR, a\nprevalence sensitivity loop, and a note on PROC STDRATE for direct standardization-based\nattributable fraction when a continuous or multi-level confounder requires standardization\nrather than regression adjustment.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "risk-ratio-and-risk-difference",
        "notes": "PAF is computed from RR (or RD) plus exposure prevalence; the risk-ratio-and-risk- difference entry covers the foundational relative and absolute risk measures that AR and PAF build on directly."
      },
      {
        "relation_type": "see_also",
        "target_slug": "number-needed-to-treat-rwe",
        "notes": "The clinical-translation cousin of AR: NNT = 1/AR answers \"how many exposed persons must we treat to prevent one event,\" while PAF answers \"what share of the entire population burden is attributable to the exposure.\" They are complementary framings of the same absolute risk difference."
      },
      {
        "relation_type": "see_also",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "Incidence rates are the building blocks for rate-based attributable risk in cohort studies; rate differences and incidence rate ratios feed into analogous attributable-rate formulas when person-time denominators replace cumulative-risk denominators."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hazard-ratio-interpretation",
        "notes": "Hazard ratios are commonly substituted for RR in Levin's PAF formula, which is only approximately valid when the HR is close to the risk ratio (rare outcome, short follow-up); in long follow-up or high-incidence settings, direct PAF estimation from cumulative incidence is preferred."
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "The cost-side attributable concept: attributable costs extend PAF logic to healthcare expenditure, asking what fraction of payer spending is attributable to a condition or exposure; PAF from this entry provides the epidemiological foundation."
      }
    ],
    "aliases": [
      "AR",
      "attributable fraction",
      "PAF",
      "population attributable risk",
      "etiologic fraction",
      "Levin's formula"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "attrition-and-loss-to-follow-up-rwe",
    "name": "Attrition and Loss to Follow-Up",
    "short_definition": "The progressive, often non-random loss of observable person-time in longitudinal real-world cohorts (disenrollment, plan switching, system departure, death) that, when related to exposure or outcome, induces selection bias in incidence, treatment-effect, cost, and adherence estimates rather than merely reducing precision.",
    "long_description": "**Attrition** is the loss of a subject from observation before the end of intended follow-up; **loss to follow-up (LTFU)**\nis the subset where the patient becomes unobservable while still at risk and the outcome status going forward is unknown.\nIn real-world data these are near-universal: in claims, person-time ends at disenrollment (job change, plan switch, death,\nMedicare transition at age 65, hospice election); in EHR, capture decays as patients change providers or simply stop\nseeking care. The methodologically decisive question is never \"how much\" attrition there was but **whether the\ncensoring is independent of the outcome conditional on the measured covariates**. Standard time-to-event estimators\n(Kaplan-Meier, Cox) assume *non-informative (independent) censoring*: at any time, censored subjects have the same future\nhazard as those who remain. When that fails — when sicker or treatment-intolerant patients leave differentially —\nthe analysis is biased even if the model is otherwise correct.\n\n**The selection-bias structure (why this is collider bias, not missingness arithmetic).** Following Howe and Cole,\nthe cleanest way to see the danger is as a DAG. Let A = exposure, Y = outcome, C = censoring/LTFU indicator. If both\nA and Y (or their causes) affect C, then *restricting the analysis to the observed (uncensored) — i.e., conditioning\non C = 0 — opens a non-causal path A -> C <- Y*. C is a collider; selecting on it correlates A and Y even under the\nnull. This is the same machinery as healthy-survivor and depletion-of-susceptibles bias, and it is why \"complete-case\"\nor \"censor-and-ignore\" analyses are not conservative — they can bias toward OR away from the null unpredictably.\n\n**Core estimand distinction — pre-specify it.** The handling of LTFU is inseparable from the estimand:\n(1) *Independent-censoring estimand* (standard Cox/KM): valid only if censoring is non-informative given covariates; in\nRWE this is an assumption you must defend, not a default. (2) *IPCW-corrected (per-protocol / hypothetical) estimand*:\nthe effect that *would* have been observed had nobody been informatively censored, recovered by inverse-probability-of-\ncensoring weighting under \"no unmeasured common cause of censoring and outcome\" (a sequential exchangeability\nassumption). (3) *Competing-risks estimand* when death is the dominant driver of \"loss\": here death is **not** censoring\nbut a competing event, and you must choose between the *cause-specific hazard* (etiologic, rate among those still at\nrisk; estimate with a standard Cox on the event of interest, censoring at death) and the *subdistribution hazard /\ncumulative incidence function* (Fine-Gray; keeps decedents in an extended risk set, answers the absolute-risk /\nprognostic question). These are different quantities — a treatment can lower the cause-specific hazard of the event\nwhile raising its cumulative incidence if it also lowers competing mortality. Reporting one and interpreting it as the\nother is a classic, author-detectable error.\n\n**Pros, cons, and trade-offs of the main handling strategies.**\n- **Administrative censoring + standard Cox/KM (do nothing special).** Pro: transparent, no extra modeling, valid when\n  loss is genuinely administrative (end of data, calendar-driven). Con: silently assumes independent censoring; biased\n  whenever loss tracks prognosis or treatment tolerability. **Prefer** only when you can argue loss is non-informative\n  (e.g., loss driven by employer contract end, not health).\n- **IPCW (stabilized weights).** Pro: under correct specification recovers the no-informative-censoring effect using\n  time-varying predictors of dropout; pairs naturally with IPTW for doubly weighted MSMs. Con: requires rich\n  time-varying data on the *causes* of loss; unmeasured common causes of loss and outcome leave residual bias; extreme\n  weights inflate variance (truncate/stabilize and report the weight distribution). **Prefer** when the dropout\n  mechanism is plausibly captured by observed history (utilization, labs, comorbidity, adherence). Compared to\n  multiple imputation, IPCW targets the causal estimand directly and avoids imputing unobserved outcomes.\n- **Restriction to minimum continuous enrollment + sensitivity analysis.** Pro: simple, transparent, communicable to\n  reviewers and HTA bodies. Con: discards person-time and can itself be a collider if the enrollment-duration\n  restriction is affected by treatment or prognosis (you select on a downstream variable). **Prefer** when modeling\n  data for the censoring process are thin; always pair with a tipping-point / delta sensitivity for the discarded.\n- **Competing-risks framing (Fine-Gray / Aalen-Johansen).** Pro: correct when death is the main loss and absolute risk\n  is the question (cost, HCRU, prognosis). Con: changes the estimand (subdistribution vs cause-specific) and is not a\n  fix for *non-death* informative loss (disenrollment is still censoring). **Prefer** for OS/PFS-adjacent and economic\n  outcomes where decedents must remain in the denominator logic.\n- **Multiple imputation / pattern-mixture.** Pro: principled under MAR for longitudinal outcomes and PROs; pattern-\n  mixture and delta-adjustment let you stress MNAR explicitly. Con: imputing time-to-event outcomes is fragile;\n  MAR is often implausible for clinical dropout. **Prefer** for repeated-measures/PRO endpoints, not as the primary\n  handler of survival LTFU.\n\n**When to use careful LTFU handling (decision rules).** Treat attrition as informative until proven otherwise whenever:\nfollow-up exceeds ~6-12 months; loss rates differ across arms (always tabulate retention by arm); the exposure plausibly\naffects tolerability, hospitalization, mortality, or insurance status; or the outcome is mortality, a costly event, or\na slowly-accruing chronic endpoint. In these settings pre-specify IPCW or competing risks as primary, with restriction\n+ sensitivity as a transparency check.\n\n**When NOT to use / when a method is actively misleading.**\n- **Do not apply IPCW when you lack measured predictors of the loss process** — an under-specified weight model gives\n  false reassurance (it \"corrects\" using variables unrelated to dropout while the true drivers are unmeasured); a\n  transparent restriction + sensitivity analysis is more honest.\n- **Do not treat death as ordinary censoring** when death is informative for the outcome (almost always true for\n  morbidity, cost, and HCRU endpoints in elderly/oncology cohorts) — independent-censoring Cox will overstate\n  event-free survival because the frailest patients are removed from the risk set as \"censored.\" Use cause-specific\n  or Fine-Gray and state which.\n- **Do not interpret a cause-specific HR as an absolute-risk statement** — if the audience (HTA, payer) needs cumulative\n  incidence, the cause-specific hazard can mislead, especially when competing mortality differs by arm.\n- **Do not restrict on a post-baseline variable** (e.g., \"≥12 months enrolled,\" which requires surviving and staying\n  insured for 12 months) and then run an as-treated analysis — that re-creates immortal time and selects on a collider.\n\n**Data-source operational depth and failure modes.**\n- **Claims (FFS).** LTFU is sharp and observable: person-time ends at the enrollment-span boundary. Build retention from\n  the eligibility/enrollment file, not from claim gaps (absence of claims is not disenrollment). Failure mode:\n  **Medicare Advantage (MA) enrollees lack fee-for-service claims**, so MA person-time looks like a study exit when the\n  patient is simply unobservable to FFS — never count MA spans as \"events\" of loss or as person-time; restrict to A/B/D\n  (or commercial medical+pharmacy) FFS-observable time. Hospice election truncates routine claims while being a near-\n  certain marker of impending death — coding hospice exit as administrative censoring is severely informative and\n  biases survival upward. **Differential competing risks by exposure in the elderly**: arms with higher background\n  mortality lose person-time to death faster; treating that as censoring inflates the apparent event-free survival of\n  the sicker arm.\n- **EHR.** Loss is gradual and visit-driven: a patient who feels well, recovers, moves, or dies at home simply stops\n  generating records, with no disenrollment signal. Model the probability of any encounter/record in the next interval\n  given history for IPCW; treat key labs/outcomes as MNAR (sicker patients are tested more). Linkage to claims or a\n  death index is the standard remedy for the \"silent exit\" problem.\n- **Registry.** Strong for mortality and adjudicated progression (low LTFU on the primary clinical endpoint) but loses\n  PROs, utilization, and cost; link to claims for complete coverage and to vital records to firm up the censoring/\n  competing-event distinction.\n- **Linked claims-EHR-vital records.** Best substrate (EHR severity predictors for the IPCW model + claims completeness\n  + reliable death dates to separate competing risk from censoring), but linkage selection (only the linkable subset)\n  and date discrepancies between order/fill/service must be reconciled before assigning censoring times.\n\n**Worked claims example (continuous enrollment, washout, IPCW).** Question: 2-year all-cause and HF-specific event\nrisk in new initiators of Drug A vs active comparator Drug B among adults with type 2 diabetes in a commercial +\nMedicare FFS database. (1) **Eligibility / time zero**: first qualifying pharmacy fill (`fill_date`) with 365 days of\ncontinuous medical + pharmacy FFS enrollment beforehand and no prior A/B fill in that washout (incident users in both\narms). (2) **Observable person-time**: from `index_date` to the *earliest* of the validated HF event, death, end of\ncontinuous FFS enrollment, or end of data; **exclude all MA-only spans** so \"loss\" reflects real disenrollment, not a\ndata artifact. (3) **Retention reporting**: plot cumulative retention (1 - censoring CDF) by arm — if A loses 25% by\n6 months vs 15% for B because A causes more GI intolerance and job loss, attrition is differential and informative.\n(4) **IPCW model**: a pooled-logistic (discrete-time) model for \"still observable in interval t given observable through\nt-1,\" with arm, baseline comorbidity (Elixhauser/Charlson), prior-interval HCRU (inpatient days, ED visits), days_supply-\nbased adherence proxy, age band, and a flag for recent hospitalization; form stabilized inverse-probability-of-censoring\nweights and combine multiplicatively with IPTW. (5) **Estimand and models**: *primary* = cause-specific Cox for the HF\nevent with death as a competing event reported via Fine-Gray cumulative incidence (absolute risk is the decision-relevant\nquantity for HF); *secondary* = IPCW-weighted Cox targeting the no-informative-censoring per-protocol effect.\n(6) **Sensitivity**: admin-censoring-only Cox (shows the magnitude of the informative-censoring problem), weight\ntruncation at the 1st/99th percentile, alternative continuous-enrollment thresholds, and a delta/tipping-point analysis\nassuming the lost have 1.5x-2x the outcome hazard to bound MNAR. Always report baseline characteristics of retained\nvs lost patients alongside the longitudinal (CONSORT-style but time-resolved) attrition flow.\n\n**Interpreting the output**\n\nIn the five-patient cohort, the audit log shows: patients 1001, 1002, and 1004 each contribute\n365 person-days; patient 1003 is censored at day 150; patient 1005 at day 260. Total observed\nperson-days = 1,505 of 1,825 possible (82.5% of potential follow-up); 365-day retention =\n3/5 = 60%.\n\n*(1) Formal interpretation.* The 60% retention rate signals that 40% of the cohort was lost\nbefore the intended 365-day endpoint. Whether this loss is informative depends on the reason\nfor censoring relative to the outcome mechanism. If patients 1003 and 1005 disenrolled because\nof worsening disease — a process correlated with the event being counted — then standard\nadministrative censoring produces a biased event rate (informative censoring). The 82.5%\nperson-time capture is a secondary measure: high person-time percentage can coexist with\nheavy early-dropout bias if those lost early were the highest-risk patients in the cohort.\n\n*(2) Practical interpretation.* A 60% retention rate at 12 months is a threshold most\nregulatory reviewers will flag for mandatory sensitivity analysis. The directional question\nis whether retained patients are systematically healthier or sicker than those lost. If the\ndrug arm has higher dropout — for example, patients stopping due to adverse events — the\nsurviving arm appears artificially healthier, and IPCW or a competing-risk model is required\nbefore the event rate can be interpreted as unbiased. Always stratify the attrition diagnostic\nby treatment arm to detect differential loss.",
    "primary_category": "Bias_Control",
    "tags": [
      "attrition",
      "loss-to-follow-up",
      "informative-censoring",
      "selection-bias",
      "collider-bias",
      "ipcw",
      "competing-risks",
      "disenrollment",
      "claims",
      "ehr",
      "oncology-rwe"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0000000000000409",
        "url": "https://doi.org/10.1097/EDE.0000000000000409",
        "citation_text": "Howe CJ, Cole SR, Lau B, Napravnik S, Eron JJ Jr. Selection bias due to loss to follow up in cohort studies. Epidemiology. 2016;27(1):91-97.",
        "year": 2016,
        "authors_short": "Howe et al.",
        "notes": "Canonical DAG-based account of how loss to follow-up induces collider/selection bias and how inverse-probability- of-censoring weighting corrects it; the single best conceptual anchor for this concept."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.0006-341X.2000.00779.x",
        "url": "https://doi.org/10.1111/j.0006-341X.2000.00779.x",
        "citation_text": "Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics. 2000;56(3):779-788.",
        "year": 2000,
        "authors_short": "Robins & Finkelstein",
        "notes": "Foundational derivation of IPCW for dependent (informative) censoring; the methodological basis for weight-based correction of attrition in time-to-event analyses."
      },
      {
        "role": "explain",
        "doi": "10.1097/00001648-200009000-00012",
        "url": "https://doi.org/10.1097/00001648-200009000-00012",
        "citation_text": "Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561-570.",
        "year": 2000,
        "authors_short": "Hernán et al.",
        "notes": "Worked application combining inverse-probability-of-treatment and inverse-probability-of-censoring weights in a marginal structural model; the template for jointly handling confounding and informative dropout."
      },
      {
        "role": "explain",
        "doi": "10.1161/CIRCULATIONAHA.115.017719",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.115.017719",
        "citation_text": "Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601-609.",
        "year": 2016,
        "authors_short": "Austin et al.",
        "notes": "Clear exposition of cause-specific vs subdistribution (Fine-Gray) hazards and cumulative incidence; essential when death is the dominant driver of attrition and must be modeled as a competing event rather than as censoring."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Shows how misaligned follow-up start and censoring rules create bias in database studies; relevant because post-baseline restrictions intended to mitigate attrition can re-introduce immortal time."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.m4856",
        "url": "https://doi.org/10.1136/bmj.m4856",
        "citation_text": "Wang SV, Pinheiro S, Hua W, et al. STaRT-RWE: structured template for planning and reporting on the implementation of real world evidence studies. BMJ. 2021;372:m4856.",
        "year": 2021,
        "authors_short": "Wang et al.",
        "notes": "Reporting standard adopted by FDA/EMA-facing studies that requires explicit specification of follow-up, censoring, loss-to-follow-up handling, and the associated sensitivity analyses."
      }
    ],
    "plain_language_summary": "When researchers follow a group of patients over time, not everyone stays visible for the entire period — some switch insurance plans, some stop seeking care, and some simply cannot be tracked anymore. Attrition is the word for this shrinking group, and loss to follow-up is the specific case where a patient disappears before the study ends and we still do not know what happened to them. The real danger is not just that the group gets smaller: if the patients who leave are sicker or tolerating treatment worse than those who stay, the remaining group looks artificially healthier, and any estimate of how well the treatment worked will be too optimistic.",
    "key_terms": [
      {
        "term": "follow-up",
        "definition": "The stretch of time a patient is actively observed in a study, from their start date until they leave, have the outcome, or the study ends — whichever comes first."
      },
      {
        "term": "censoring",
        "definition": "What analysts call it when a patient's observation ends before the outcome occurs and we do not know whether the outcome ever happened — the patient's true fate is cut off, or 'censored,' from our view."
      },
      {
        "term": "informative censoring",
        "definition": "A situation where the reason a patient left the study is connected to how sick they were or how well they were doing — meaning the patients who disappear are not a random sample, and those who stay behind look healthier than they truly are."
      },
      {
        "term": "at-risk cohort",
        "definition": "The set of patients who are still being observed at any given point in time and who have not yet had the outcome — this group shrinks as patients leave or are lost."
      },
      {
        "term": "retention rate",
        "definition": "The share of patients who remain continuously observable through a given time point, calculated as the number still being followed divided by the number who started."
      }
    ],
    "worked_example": {
      "scenario": "A health-plan database study enrolls 5 adults with a new diabetes diagnosis on January 1, 2023. The research team wants to track each patient for 365 days to count hospitalizations. Patients are observable as long as they remain enrolled in the health plan. Two patients disenroll before 365 days (they are lost to follow-up), and the team wants to summarize how much follow-up time they collected and what fraction of the original cohort remained at the end.",
      "dataset": {
        "caption": "Study roster: each row is one patient. 'days_observed' is person-time elapsed from January 1, 2023. Patients 1003 and 1005 disenrolled before the 365-day window closed.",
        "columns": [
          "person_id",
          "index_date",
          "exit_date",
          "days_observed",
          "status"
        ],
        "rows": [
          [
            1001,
            "2023-01-01",
            "2024-01-01",
            365,
            "retained"
          ],
          [
            1002,
            "2023-01-01",
            "2024-01-01",
            365,
            "retained"
          ],
          [
            1003,
            "2023-01-01",
            "2023-05-31",
            150,
            "lost_to_followup"
          ],
          [
            1004,
            "2023-01-01",
            "2024-01-01",
            365,
            "retained"
          ],
          [
            1005,
            "2023-01-01",
            "2023-09-18",
            260,
            "lost_to_followup"
          ]
        ]
      },
      "steps": [
        "Each patient starts on January 1, 2023. The 365-day window closes January 1, 2024. Days observed = days of person-time elapsed from the start date (so a patient who exits on May 31 has contributed 150 days, because January 1 + 150 days = May 31).",
        "Patient 1001: still enrolled on January 1, 2024 — 365 days observed. Retained.",
        "Patient 1002: still enrolled on January 1, 2024 — 365 days observed. Retained.",
        "Patient 1003: disenrolled on May 31, 2023 (150 days after January 1). Their follow-up bar ends at day 150. The remaining 215 days of the window (365 − 150 = 215) are unobserved. Lost to follow-up.",
        "Patient 1004: still enrolled on January 1, 2024 — 365 days observed. Retained.",
        "Patient 1005: disenrolled on September 18, 2023 (260 days after January 1). Their follow-up bar ends at day 260. The remaining 105 days of the window (365 − 260 = 105) are unobserved. Lost to follow-up.",
        "Total person-days collected: 365 + 365 + 150 + 365 + 260 = 1,505 days out of a maximum possible 5 × 365 = 1,825 days.",
        "Patients retained at 365 days: patients 1001, 1002, and 1004 — 3 of the original 5.",
        "Retention rate: 3 ÷ 5 = 0.60, or 60%.",
        "Important caveat: if patients 1003 and 1005 disenrolled because they were hospitalized and lost insurance coverage — meaning sicker patients left first — then the 3 remaining patients are healthier on average than the full original cohort. Counting hospitalizations only among the survivors would undercount events. That is informative censoring introducing bias."
      ],
      "result": {
        "label": "3 of 5 retained at 12 months = 60% retention; 1,505 of 1,825 possible person-days observed",
        "value": 0.6
      },
      "timeline_spec": {
        "title": "Patient follow-up over a 365-day observation window — 5-patient cohort, 2 lost to follow-up",
        "window": {
          "start": "2023-01-01",
          "end": "2024-01-01",
          "label": "Denominator: 365-day observation window"
        },
        "events": [
          {
            "label": "Patient 1001",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "365 days observed"
          },
          {
            "label": "Patient 1002",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "365 days observed"
          },
          {
            "label": "Patient 1003",
            "start": "2023-01-01",
            "length_days": 150,
            "quantity": "150 days observed"
          },
          {
            "label": "Patient 1004",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "365 days observed"
          },
          {
            "label": "Patient 1005",
            "start": "2023-01-01",
            "length_days": 260,
            "quantity": "260 days observed"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2024-01-01",
            "label": "Patient 1001 — retained 365 days",
            "row": "Patient 1001"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2024-01-01",
            "label": "Patient 1002 — retained 365 days",
            "row": "Patient 1002"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-05-31",
            "label": "Patient 1003 — observed 150 days",
            "row": "Patient 1003"
          },
          {
            "kind": "gap",
            "start": "2023-05-31",
            "end": "2024-01-01",
            "label": "Patient 1003 — lost to follow-up (215 days unobserved)",
            "row": "Patient 1003"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2024-01-01",
            "label": "Patient 1004 — retained 365 days",
            "row": "Patient 1004"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-09-18",
            "label": "Patient 1005 — observed 260 days",
            "row": "Patient 1005"
          },
          {
            "kind": "gap",
            "start": "2023-09-18",
            "end": "2024-01-01",
            "label": "Patient 1005 — lost to follow-up (105 days unobserved)",
            "row": "Patient 1005"
          }
        ],
        "result": {
          "label": "3 of 5 retained at 365 days = 60% retention; 1,505 / 1,825 person-days observed",
          "value": 0.6
        },
        "note": "Patients 1003 and 1005 left at different points, so the at-risk cohort shrank over time — 5 patients at day 0, 4 after day 150, and 3 after day 260. If the patients who left were sicker than average (informative censoring), any outcome rate estimated from the 3 survivors will be too low."
      },
      "caption": "Timeline showing 5 patients followed from January 1, 2023. Solid bars = days observed (follow-up); hatched bars = unobserved window after disenrollment. Patients 1003 and 1005 dropped out at days 150 and 260, respectively. Three of five patients (60%) remained observable through the full 365-day window.",
      "alt_text": "Horizontal bar chart with one row per patient. Patients 1001, 1002, and 1004 each have a solid bar spanning the full 365-day window. Patient 1003 has a solid bar for the first 150 days and a hatched bar for the remaining 215 days. Patient 1005 has a solid bar for the first 260 days and a hatched bar for the remaining 105 days. A summary line at the bottom reads: 3 of 5 retained at 12 months = 60% retention."
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe",
      "cohort-retrospective",
      "incidence-rate-calculation-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Administrative censoring at coverage end",
        "description": "Censor each patient at end of continuous enrollment or end of data and fit standard Cox/KM, assuming censoring is independent of the outcome given covariates. Defensible only when loss is genuinely administrative.",
        "edge_cases": [
          "Medicare eligibility at age 65 and employer-contract cycles create calendar/age-dependent censoring that can confound if treatment timing correlates with age or enrollment cohort.",
          "Hospice election and MA enrollment end FFS observability but are strongly outcome-related (impending death) or purely a data artifact (MA), respectively; treating either as administrative censoring is biased."
        ],
        "data_source_notes": "claims: derive censoring times from the eligibility file, never from claim gaps; exclude MA-only person-time and verify coverage end is not hospice-driven."
      },
      {
        "name": "IPCW with stabilized weights",
        "description": "Fit a (discrete-time pooled-logistic) model for remaining observable at each interval given time-varying history, form stabilized inverse-probability-of-censoring weights, and fit a weighted Cox/pooled-logistic outcome model. Combine multiplicatively with IPTW for a doubly weighted marginal structural model.",
        "edge_cases": [
          "Unmeasured common causes of loss and outcome violate the no-unmeasured-confounding-of-censoring assumption; weights cannot fix what is not measured.",
          "Extreme weights from rare covariate patterns inflate variance; truncate at the 1st/99th percentile and report the weight distribution and effective sample size."
        ],
        "data_source_notes": "claims: predictors = prior-interval HCRU (inpatient/ED), comorbidity score, days_supply-based adherence, demographics, geography, recent hospitalization flag; ehr: model probability of any encounter/record."
      },
      {
        "name": "Restriction to minimum continuous enrollment + sensitivity analysis",
        "description": "Restrict to patients with a minimum baseline-only follow-up requirement and run delta/tipping-point and best-/worst-case sensitivity analyses for the excluded or lost. Transparent fallback when censoring-process data are thin.",
        "edge_cases": [
          "Restricting on a post-baseline duration (surviving and staying enrolled k months) selects on a collider and can re-introduce immortal time; restrict only on baseline-fixed eligibility."
        ],
        "data_source_notes": "claims: implement from the enrollment file at baseline; report how the restricted population differs from the full incident cohort."
      },
      {
        "name": "Competing-risks framing (Fine-Gray / Aalen-Johansen)",
        "description": "When death is the dominant loss, model it as a competing event rather than censoring. Report cause-specific hazards (etiologic) and the cumulative incidence function / subdistribution hazard (absolute risk) explicitly.",
        "edge_cases": [
          "A treatment can reduce the cause-specific hazard of the event while increasing its cumulative incidence if it lowers competing mortality; report and interpret the correct quantity for the question."
        ],
        "data_source_notes": "claims/registry: requires reliable death dates (link to vital records); does not address non-death informative loss such as disenrollment, which remains censoring."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Standard Cox/KM with administrative censoring",
        "pros_of_this": "IPCW or competing-risks framing removes the bias from informative loss that standard independent- censoring estimators silently assume away; restriction + sensitivity makes the assumption transparent.",
        "cons_of_this": "IPCW needs rich time-varying predictors of the loss process and can be unstable with extreme weights; restriction sacrifices power and can induce collider bias if the restriction variable is post-baseline.",
        "when_to_prefer": "Use IPCW when the dropout mechanism is plausibly captured by observed history; use restriction + sensitivity when modeling data for censoring are limited; reserve plain administrative censoring for genuinely administrative loss."
      },
      {
        "compared_to": "competing-risks-analysis",
        "pros_of_this": "When death drives attrition, framing it as a competing event (Fine-Gray CIF or cause-specific Cox) is more appropriate than treating it as independent censoring, which removes the frailest from the risk set.",
        "cons_of_this": "Competing risks change the estimand (subdistribution vs cause-specific) and do not address non-death informative loss such as disenrollment, which still requires IPCW or restriction.",
        "when_to_prefer": "Prefer competing risks for morbidity/cost/HCRU and absolute-risk questions in elderly or oncology cohorts where mortality is high and differential by arm."
      },
      {
        "compared_to": "target-trial-emulation",
        "pros_of_this": "A target-trial protocol forces explicit, pre-specified follow-up and censoring rules at the design stage, preventing many post-hoc attrition decisions and the immortal time that ad hoc restrictions create.",
        "cons_of_this": "Even a perfect protocol leaves residual differential loss in the data that must still be handled analytically (IPCW, competing risks); the protocol disciplines, it does not eliminate.",
        "when_to_prefer": "Always specify the censoring rule in a target-trial framework; layer IPCW/competing-risks on top."
      },
      {
        "compared_to": "missing-data-trimming-winsorization-rwe",
        "pros_of_this": "For survival LTFU, IPCW targets the causal estimand directly without imputing unobserved event times, which multiple imputation must do under often-implausible MAR assumptions.",
        "cons_of_this": "For repeated-measures/PRO endpoints, multiple imputation and pattern-mixture models handle the missingness more naturally than IPCW and allow explicit MNAR stress-testing via delta-adjustment.",
        "when_to_prefer": "Use IPCW/competing risks for time-to-event LTFU; use MI/pattern-mixture for longitudinal continuous or PRO outcomes."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Derive censoring times from the eligibility/enrollment file (continuous medical + pharmacy spans), never from claim gaps. Exclude Medicare Advantage-only person-time, which lacks fee-for-service claims and would otherwise look like disenrollment. Flag hospice election (outcome-informative). For IPCW, build interval-level predictors from prior HCRU, comorbidity scores, days_supply-based adherence, and demographics. Report cumulative retention by arm and baseline characteristics of retained vs lost.",
      "ehr": "Loss is gradual and visit-driven; a patient who recovers, moves, or dies at home simply stops generating records. Model the probability of any encounter/record in the next interval given history for IPCW, and treat key labs/outcomes as MNAR. Link to claims or a death index to detect silent exits and to separate competing death from censoring.",
      "registry": "Low loss on mortality and adjudicated progression but high loss for PROs, utilization, and cost. Link to claims for complete coverage history and to vital records to firm up the censoring vs competing-event distinction.",
      "linked": "Linked claims-EHR-vital records is the ideal substrate (EHR severity predictors for the censoring model + claims completeness + reliable death dates) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before assigning censoring and competing-event times."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\nfrom lifelines import KaplanMeierFitter\n\n# ---- (1) Retention by arm: KM on the CENSORING distribution (event = loss to follow-up) ----\n# A steep, arm-differential censoring curve is the signal that loss is informative.\ncens = person.copy()\ncens[\"lost\"] = cens[\"exit_reason\"].isin([\"disenroll\"]).astype(int)   # death handled as competing event, not loss\ncens[\"fu_days\"] = (cens[\"exit_date\"] - cens[\"index_date\"]).dt.days\nkmf = KaplanMeierFitter()\nfor a, g in cens.groupby(\"arm\"):\n    kmf.fit(g[\"fu_days\"], event_observed=g[\"lost\"], label=f\"arm {a}\")\n    print(a, \"retention at 180d:\", float(kmf.predict(180)))   # 1 - cumulative loss\n\n# ---- (2) Stabilized IPCW via discrete-time pooled logistic for staying observable ----\n# Numerator depends on arm only (baseline); denominator adds time-varying predictors of dropout.\nnum = smf.logit(\"observed_next ~ C(arm_t) + t + I(t**2)\", data=panel).fit(disp=0)\nden = smf.logit(\n    \"observed_next ~ C(arm_t) + t + I(t**2) + comorbidity + prior_hcru \"\n    \"+ adherence + C(age_band) + recent_hosp\",\n    data=panel,\n).fit(disp=0)\n\npanel = panel.sort_values([\"person_id\", \"t\"]).copy()\npanel[\"num_t\"] = num.predict(panel).values   # P(stay observable | baseline)\npanel[\"den_t\"] = den.predict(panel).values   # P(stay observable | full history)\n# Cumulative product within person = probability of remaining UNCENSORED through interval t.\npanel[\"cum_num\"] = panel.groupby(\"person_id\")[\"num_t\"].cumprod()\npanel[\"cum_den\"] = panel.groupby(\"person_id\")[\"den_t\"].cumprod()\npanel[\"ipcw\"] = panel[\"cum_num\"] / panel[\"cum_den\"]\n\n# Truncate extreme weights at the 1st/99th percentile and report stability.\nlo, hi = panel[\"ipcw\"].quantile([0.01, 0.99])\npanel[\"ipcw_trunc\"] = panel[\"ipcw\"].clip(lo, hi)\nprint(\"IPCW mean (should be ~1):\", round(panel[\"ipcw_trunc\"].mean(), 3),\n      \"max:\", round(panel[\"ipcw_trunc\"].max(), 2))\n# Downstream: multiply ipcw_trunc by IPTW and pass as weights to a (pooled-logistic or Cox) outcome model.",
        "description": "Retention reporting + stabilized IPCW for informative loss in a claims cohort. Required inputs (post data-management):\n  person : person_id, arm in {'A','B'}, index_date (datetime), exit_date (datetime), exit_reason in\n           {'event','death','disenroll','admin'}            # exit_date/reason already reconciled against the enrollment file (MA-only spans removed)\n  panel  : one row per person per 30-day interval the person is observable ->\n           person_id, t (0,1,2,... interval index), observed_next (1 if observable in t+1 else 0),\n           arm_t, comorbidity, prior_hcru, adherence, age_band, recent_hosp   # time-varying predictors of remaining observable\nProduces (1) cumulative retention by arm and (2) stabilized IPCW ready to multiply into IPTW for a weighted Cox.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels",
          "lifelines"
        ],
        "source_citations": [
          "howe-2016",
          "robins-finkelstein-2000"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(survival)\nlibrary(cmprsk)\n\n## ---- Stabilized IPCW from a discrete-time pooled logistic for staying observable ----\nnum <- glm(observed_next ~ factor(arm_t) + t + I(t^2),\n           family = binomial, data = panel)\nden <- glm(observed_next ~ factor(arm_t) + t + I(t^2) + comorbidity + prior_hcru +\n             adherence + factor(age_band) + recent_hosp,\n           family = binomial, data = panel)\n\npanel <- panel %>%\n  arrange(person_id, t) %>%\n  mutate(p_num = predict(num, type = \"response\"),\n         p_den = predict(den, type = \"response\")) %>%\n  group_by(person_id) %>%\n  mutate(ipcw = cumprod(p_num) / cumprod(p_den)) %>%\n  ungroup()\nqs <- quantile(panel$ipcw, c(0.01, 0.99))\npanel$ipcw_trunc <- pmin(pmax(panel$ipcw, qs[1]), qs[2])   # weight truncation\n\n## ---- IPCW-weighted Cox (per-protocol / no-informative-censoring estimand) ----\n## Attach the last-interval weight to each person's survival record, then fit weighted Cox.\nw <- panel %>% group_by(person_id) %>% summarise(w = dplyr::last(ipcw_trunc), .groups = \"drop\")\nsurv_w <- left_join(surv, w, by = \"person_id\")\nfit_ipcw <- coxph(Surv(fu_time, status == 1) ~ arm, data = surv_w,\n                  weights = w, robust = TRUE)   # robust SE required with weights\nprint(summary(fit_ipcw)$coefficients)\n\n## ---- Competing risks: death (status==2) is a COMPETING EVENT, not censoring ----\n## Cause-specific hazard (etiologic): standard Cox, censor at death.\nfit_cs <- coxph(Surv(fu_time, status == 1) ~ arm, data = surv)\n## Subdistribution hazard / cumulative incidence (absolute risk): Fine-Gray.\nfg <- crr(ftime = surv$fu_time, fstatus = surv$status,\n          cov1 = model.matrix(~ arm, surv)[, -1, drop = FALSE],\n          failcode = 1, cencode = 0)   # failcode=1 event of interest, 2 treated as competing\nprint(summary(fg))",
        "description": "Stabilized IPCW + IPCW-weighted Cox, and a Fine-Gray competing-risks model with death as the competing event.\nInputs mirror the Python version:\n  panel : person_id, t, observed_next (0/1), arm_t, comorbidity, prior_hcru, adherence, age_band, recent_hosp\n  surv  : person_id, arm (factor), fu_time (numeric), status in {0 = censored, 1 = event, 2 = death}",
        "dependencies": [
          "survival",
          "cmprsk",
          "dplyr"
        ],
        "source_citations": [
          "howe-2016",
          "austin-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (1) Retention by arm via the censoring (loss) distribution. */\nproc lifetest data=work.surv plots=survival(atrisk);\n  time fu_time*status(1 2);          /* event=loss only; codes 1,2 (event,death) are censored here */\n  strata arm;\nrun;\n\n/* (2) Stabilized IPCW: discrete-time pooled logistic for remaining observable. */\nproc logistic data=work.panel descending;          /* numerator: baseline (arm) only */\n  class arm_t age_band / param=ref;\n  model observed_next = arm_t t t*t;\n  output out=num p=p_num;\nrun;\nproc logistic data=work.panel descending;          /* denominator: + time-varying predictors of dropout */\n  class arm_t age_band / param=ref;\n  model observed_next = arm_t t t*t comorbidity prior_hcru adherence age_band recent_hosp;\n  output out=den p=p_den;\nrun;\n\n/* Cumulative product within person -> stabilized weight; truncate extremes. */\ndata ipcw;\n  merge num(keep=person_id t p_num) den(keep=person_id t p_den);\n  by person_id t;\n  retain cum_num cum_den;\n  if first.person_id then do; cum_num=1; cum_den=1; end;\n  by person_id;\n  cum_num = cum_num * p_num;\n  cum_den = cum_den * p_den;\n  ipcw = cum_num / cum_den;\nrun;\nproc univariate data=ipcw noprint; var ipcw; output out=q pctlpts=1 99 pctlpre=p; run;\ndata work.surv_w;\n  merge work.surv (in=a) ipcw(keep=person_id ipcw rename=(ipcw=w));\n  by person_id; if a;     /* keep the last-interval weight per person from your panel collapse step */\nrun;\n\n/* (3) IPCW-weighted Cox = no-informative-censoring (per-protocol) estimand. */\nproc phreg data=work.surv_w covsandwich(aggregate);   /* robust SE with weights */\n  class arm (ref='B') / param=ref;\n  model fu_time*status(0 2) = arm;     /* event=1; death(2) censored in the cause-specific weighted model */\n  weight w;\n  id person_id;\nrun;\n\n/* (4) Competing risks: death is a competing event, not censoring. */\nproc phreg data=work.surv;             /* cause-specific hazard (etiologic) */\n  class arm (ref='B') / param=ref;\n  model fu_time*status(0 2) = arm;\nrun;\nproc phreg data=work.surv;             /* Fine-Gray subdistribution hazard (absolute risk / CIF) */\n  class arm (ref='B') / param=ref;\n  model fu_time*status(0) = arm / eventcode=1;   /* status 2 (death) kept as competing event */\nrun;",
        "description": "Retention curves, stabilized IPCW, IPCW-weighted Cox, and a Fine-Gray competing-risks Cox. Required datasets\n(post data-management; MA-only person-time already removed):\n  work.surv  : person_id, arm ('A'/'B'), fu_time, status (0=censored, 1=event, 2=death)\n  work.panel : person_id, t, observed_next (0/1), arm_t, comorbidity, prior_hcru, adherence, age_band, recent_hosp\nPROC PHREG eventcode= (Fine-Gray) requires SAS/STAT 14.1+; report cause-specific AND subdistribution results.",
        "dependencies": [],
        "source_citations": [
          "howe-2016",
          "austin-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "attrition-and-loss-to-follow-up-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Data-grounded worked-example timeline (beginner layer), drawn to scale from worked_example.timeline_spec so the picture matches the numbers.",
        "alt_text": "Timeline for the worked example of attrition-and-loss-to-follow-up-rwe.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Exposure / arm A] --> Y[Outcome Y]\n  A --> C[Loss to follow-up C]\n  Y --> C\n  U[Frailty / severity<br/>often unmeasured] --> Y\n  U --> C\n  C -. condition on observed .-> Collider[Selecting on C opens<br/>A to C from Y: collider/selection bias]",
        "caption": "Selection-bias DAG for loss to follow-up (after Howe & Cole). When both exposure and outcome (and shared causes like frailty) affect censoring C, restricting to the observed (C=0) conditions on a collider and biases the A-Y association even under the null. This is why ignoring informative loss is not conservative.",
        "alt_text": "Directed acyclic graph with exposure A and outcome Y both pointing to censoring C, an unmeasured frailty U pointing to Y and C, and a note that conditioning on observed status opens a collider path.",
        "source_type": "illustrative",
        "source_citations": [
          "howe-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Loss of person-time in follow-up] --> Q1{Is the loss DEATH?}\n  Q1 -->|Yes| CR[Competing risks:<br/>cause-specific Cox AND Fine-Gray CIF]\n  Q1 -->|No, disenroll/system exit| Q2{Loss related to exposure or prognosis?}\n  Q2 -->|No - administrative<br/>end of data/contract| Std[Standard Cox/KM<br/>independent censoring OK]\n  Q2 -->|Yes - informative| Q3{Rich time-varying predictors<br/>of the loss process available?}\n  Q3 -->|Yes| IPCW[Stabilized IPCW-weighted Cox<br/>combine with IPTW]\n  Q3 -->|No| Restrict[Restrict to baseline-fixed enrollment<br/>+ delta / tipping-point sensitivity]\n  CR --> Report[Report retention by arm, estimand,<br/>and sensitivity analyses]\n  Std --> Report\n  IPCW --> Report\n  Restrict --> Report",
        "caption": "Decision logic for handling attrition. Death is a competing event (not censoring); non-death informative loss is handled by IPCW when its drivers are measured, otherwise by transparent restriction plus MNAR sensitivity.",
        "alt_text": "Decision tree branching on whether loss is death (competing risks), administrative (standard Cox), or informative with or without predictors (IPCW versus restriction plus sensitivity).",
        "source_type": "illustrative",
        "source_citations": [
          "howe-2016",
          "austin-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "Attrition is the dominant source of missing longitudinal data in RWE; the same MAR/MNAR logic applies, but survival LTFU is better handled by IPCW than by imputing event times, while PRO/repeated-measures loss favors MI."
      },
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial protocols pre-specify the censoring and loss-to-follow-up rules (the follow-up component), preventing ad hoc, immortal-time-inducing restrictions; analytic LTFU correction is still layered on top."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "Informative censoring violates the independent-censoring assumption of standard Cox models and biases the HR; IPCW-weighted Cox or a competing-risks model is required when loss tracks prognosis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "Death and disenrollment are intercurrent events; the chosen strategy (treatment-policy vs while-on-treatment vs competing-risk) determines how attrition enters the estimand and the analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology has the highest attrition (death plus rapid switching); cause-specific vs subdistribution choices and explicit censoring handling are essential for credible OS/PFS and HCRU estimates."
      }
    ],
    "aliases": [
      "loss to follow-up",
      "informative censoring",
      "dependent censoring",
      "differential attrition",
      "drop-out bias",
      "retention",
      "IPCW in RWE"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "baseline-characteristics-and-covariate-balance-rwe",
    "name": "Baseline Characteristics and Covariate Balance",
    "short_definition": "The structured description of pre-exposure covariates across treatment groups together with the diagnostics (standardized mean differences, variance ratios, distributional overlap, effective sample size) used to judge whether a matched, weighted, or restricted real-world cohort achieved measured exchangeability with respect to the target estimand.",
    "long_description": "In a randomized trial the baseline table documents a known truth (randomization should have balanced everything, measured\nand unmeasured). In **nonrandomized real-world evidence (RWE) it documents the central problem the study exists to solve**:\ntreatment was *chosen*, so the arms differ on the very prognostic factors that drive the outcome. The baseline table and\nits balance diagnostics are therefore a **design instrument, not clerical output** — they show who got which treatment,\nwho was excluded by the eligibility cascade, and whether the adjustment strategy (matching, weighting, restriction) made\nthe comparison groups exchangeable on *measured* covariates within the population the estimand targets.\n\n**Core conceptual distinction: balance is a property of an estimator + target population, not of a dataset.** Standardized\ndifferences must be computed in the *same* pseudo-population the effect estimate describes. Inverse-probability-of-treatment\nweighting (IPTW) targets the **ATE** and balance is assessed in the full weighted sample; standardized-mortality-ratio\n(SMR/ATT) weights and 1:1 matching target the **ATT** and balance is assessed in the treated-anchored sample; overlap\n(ATO) weights target the population with clinical equipoise and balance is assessed in the overlap-weighted sample where,\nby construction, weighted means are *exactly* equal for any covariate in the propensity-score (PS) model. Reporting an\nunweighted Table 1 next to a weighted effect estimate, or pooling matched pairs as if they were the source population, is\na category error: the diagnostic no longer corresponds to the estimand. A \"balanced\" table also says nothing about\n**unmeasured** confounding — it is necessary, never sufficient, for valid causal inference.\n\n**The right diagnostics (and the cardinal sin).**\n- **Standardized mean difference (SMD):** for a continuous covariate, (mean_treated − mean_comparator) / pooled SD; for a\n  binary covariate, (p1 − p0) / sqrt[(p1(1−p1) + p0(1−p0))/2]. It is **sample-size-independent**, which is exactly why it\n  replaces hypothesis tests. The convention |SMD| < 0.10 flags adequate balance, but it is a heuristic, not a guarantee —\n  a strong confounder at SMD = 0.08 can matter more than a non-confounder at 0.15.\n- **Variance ratio** (treated/comparator) close to 1, and **higher-moment / distributional** checks (eCDF, KS-type\n  distance, quantile-quantile or mirrored histograms): two groups can have identical means yet different spreads or\n  skew, which the SMD of the mean alone hides. Report these for key continuous covariates.\n- **Propensity-score overlap and effective sample size (ESS).** Inspect the PS distribution by arm for non-overlap\n  (positivity); after weighting report ESS = (Σw)² / Σw², because a handful of extreme weights can collapse the real\n  information content even when SMDs look pristine.\n- **The cardinal sin: p-values as balance diagnostics.** In a 200,000-patient claims cohort a 0.3-year age difference is\n  \"p < 0.001\" and a 6-year difference can be \"p = 0.2\" in a 40-patient subgroup. The test answers *sample size*, not\n  *practical comparability*. Austin (2009) and consensus guidance reject significance testing of baseline balance; use\n  standardized differences.\n\n**Pros, cons, and trade-offs (vs the alternatives you would otherwise reach for).**\n- **vs an omnibus c-statistic / global balance metric (e.g., the PS model AUC or a single Mahalanobis distance):** the\n  covariate-by-covariate SMD/variance-ratio table is interpretable and *actionable* — it names the residually imbalanced\n  confounder so you can enrich the PS model, restrict, or re-specify the comparator. A single global number hides which\n  covariate failed and rewards overfitting the PS. **Prefer the per-covariate table** for reporting and decisions; use a\n  global summary (max/median SMD, a Love plot) only to *summarize* a high-dimensional table, never to replace it.\n- **vs a formal balance hypothesis test (Hotelling's T², per-covariate t/chi-square):** standardized differences are\n  sample-size-independent and decision-relevant; tests conflate imbalance with power and are explicitly discouraged. There\n  is essentially no setting in large RWE where a balance p-value is the right tool.\n- **vs trusting the outcome model's covariate adjustment to \"fix\" residual imbalance:** regression adjustment on top of a\n  poorly balanced design extrapolates across regions of non-overlap and is fragile to model misspecification. The balance\n  table tells you whether design (matching/weighting/restriction) carried the load so the outcome model only has to\n  interpolate. **Prefer design over reliance on modeling** when overlap is poor.\n- **Trade-off in what balance buys you:** chasing SMD < 0.10 on every one of hundreds of high-dimensional covariates via\n  aggressive trimming or tight calipers shrinks the cohort and ESS, widens confidence intervals, and shifts the estimand\n  toward an ever-narrower overlap population (Stürmer 2010). Balance is purchased with generalizability and precision;\n  the table should be read alongside ESS and the change in cohort composition, not optimized in isolation.\n\n**When to use.** Always, as a mandatory deliverable, in any nonrandomized comparative-effectiveness or safety study and in\nany descriptive cohort comparison: (1) an **unadjusted/design table** to expose treatment channeling and the eligibility\ncascade, and (2) an **adjusted table** (matched/weighted) computed in the estimand's population to certify that the\nbalancing step worked before any outcome model is fit. It is the standard acceptance gate for PS matching/IPTW/overlap\nweighting and for target-trial emulations.\n\n**When NOT to use / when it is actively misleading or dangerous.**\n- **Do not treat a balanced table as evidence of no confounding.** Measured balance is silent on unmeasured confounders\n  (frailty, performance status, over-the-counter use, indication severity not coded in claims). Pairing a pristine Love\n  plot with a causal claim, absent negative controls / quantitative bias analysis, is the most common over-reach reviewers\n  punish.\n- **Do not assess balance in the wrong population.** An unweighted Table 1 reported beside an ATE/IPTW estimate, or a\n  matched-pair table read as if it described the source cohort, misrepresents who the effect is about.\n- **Do not check balance only on the covariates already in the PS model.** Overlap weights and a saturated PS force those\n  to balance by construction; the informative check is on prognostic variables *excluded* from the model and on\n  transformations/interactions, where residual imbalance actually lives.\n- **Do not over-trim to manufacture balance** when it shrinks ESS and changes the target population — you may \"win\" the\n  table and lose the question (Stürmer 2010).\n- **Beware balance achieved on a mis-defined baseline window.** Covariates measured after time zero (e.g., a lab drawn the\n  week after initiation) introduce immortal-time and mediator adjustment; a beautiful balanced table built on\n  post-baseline data is balanced on the wrong thing.\n\n**Data-source operational depth — real failure modes and workarounds.**\n- **Claims (FFS):** the strongest measured confounders are usually *utilization* (prior hospitalizations, ED visits,\n  outpatient counts), *cost*, *medication classes*, and *diagnosis/procedure counts* in a fixed lookback — not raw\n  comorbidity flags, which are noisy. Include these in Table 1 and the PS. *Failure mode:* **Medicare Advantage (MA)\n  person-time lacks fee-for-service (FFS) claims**, so MA enrollees have artifactually \"clean\" baselines (few coded\n  comorbidities), creating spurious balance and differential measurement error if MA mix differs by arm. *Workaround:*\n  restrict to enrollees with complete FFS Parts A/B/D (or a commercial benefit) across the full lookback, and report the\n  MA share by arm. *Failure mode:* a fixed lookback with **variable enrollment** makes counts depend on observed time, not\n  true burden; *workaround:* require continuous enrollment across the lookback or model person-time, and never let\n  enrollment length differ systematically by arm.\n- **EHR:** labs, vitals, stage, ECOG, smoking, and PROs are powerful confounders but **missing not at random** — a missing\n  lab often marks a healthier or less-engaged patient, and missingness can itself differ by arm and by site. *Failure\n  mode:* complete-case Table 1 silently restricts to the sickest, most-worked-up patients. *Workaround:* report a\n  missingness indicator per covariate *as its own row in Table 1*, balance the missingness indicators, and use multiple\n  imputation (or the missing-indicator method only when missingness is plausibly a measured marker), with sensitivity\n  analyses. Visit-driven capture also means baseline \"absence of disease\" can be absence of contact.\n- **Registry:** rich clinical staging/biomarkers (often the true confounders) but typically thin on full pharmacy/utilization;\n  *workaround:* link to claims for baseline HCRU and to a death index. *Failure mode:* voluntary-enrollment registries\n  select on prognosis, so balance within the registry may not transport.\n- **Linked claims–EHR–registry:** the ideal substrate (severity + completeness + mortality), but **linkage selects** the\n  linkable subset and **date discrepancies** (order vs fill vs service) can push covariates across the time-zero boundary;\n  reconcile dates and report balance both in the linked subset and against the unlinked source to gauge selection.\n\n**Worked claims example (end to end).** Comparative safety of SGLT2 inhibitors vs DPP-4 inhibitors on lower-limb\namputation among adults with type 2 diabetes in a commercial + Medicare FFS database. (1) *Cohort:* active-comparator,\nnew-user design — first fill (`fill_date`) of either class as `index_date`; require 365 days of continuous medical +\npharmacy enrollment with no MA-only person-time and no prior fill of either class in the lookback (true washout). (2)\n*Baseline window:* covariates measured **only in [index_date − 365, index_date]** — age, sex, region, prior amputation/PAD\ndiagnoses, neuropathy, insulin/metformin use, HbA1c proxy, counts of inpatient/ED/outpatient visits, total paid cost, and\na missing-HbA1c indicator. (3) *Unadjusted Table 1:* SGLT2 initiators are younger, lower prior-amputation rate, fewer\nhospitalizations — classic channeling (age SMD ≈ 0.42, prior-PAD SMD ≈ 0.28). (4) *Adjustment:* fit a high-dimensional PS,\napply overlap (ATO) weights for an equipoise estimand. (5) *Adjusted Table 1:* weighted SMDs for all PS covariates ≈ 0.00\nby construction; the *informative* check is on excluded prognostic proxies (e.g., wound-care procedure codes) and variance\nratios — all < 0.10 and ~1.0 here. (6) *Guardrails:* report **ESS** (e.g., 18,400 → 12,950 weighted) and the change in\nweighted-mean age vs the source to show how ATO reshaped the population; only then fit the outcome model. A worked SMD: in\nthe lookback, mean total cost \\$14,200 (treated) vs \\$11,800 (comparator), pooled SD \\$9,000 → SMD = 0.27 (clear\nchanneling); after ATO weighting \\$12,400 vs \\$12,250, pooled SD \\$9,000 → SMD = 0.017 (practically balanced), while the\nsame contrast may still be \"p < 0.001\" in 30,000 patients — which is precisely why the SMD, not the p-value, governs the\ndecision.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "table-1",
      "baseline-characteristics",
      "standardized-mean-difference",
      "covariate-balance",
      "propensity-score",
      "overlap",
      "love-plot",
      "effective-sample-size",
      "diagnostics"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.3697",
        "url": "https://doi.org/10.1002/sim.3697",
        "citation_text": "Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine. 2009;28(25):3083-3107.",
        "year": 2009,
        "authors_short": "Austin",
        "notes": "Canonical reference defining the standardized mean difference and distributional balance diagnostics and arguing against significance testing of baseline balance."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.6607",
        "url": "https://doi.org/10.1002/sim.6607",
        "citation_text": "Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine. 2015;34(28):3661-3679.",
        "year": 2015,
        "authors_short": "Austin & Stuart",
        "notes": "Best-practice guidance for assessing balance in the weighted pseudo-population, including weighted SMDs and effective sample size for IPTW."
      },
      {
        "role": "explain",
        "doi": "10.1214/09-STS313",
        "url": "https://doi.org/10.1214/09-STS313",
        "citation_text": "Stuart EA. Matching methods for causal inference: a review and a look forward. Statistical Science. 2010;25(1):1-21.",
        "year": 2010,
        "authors_short": "Stuart",
        "notes": "Frames balance checking as the diagnostic that should drive iteration on the design (matching/weighting) before any outcome model is fit, and discusses appropriate balance measures."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwq198",
        "url": "https://doi.org/10.1093/aje/kwq198",
        "citation_text": "Stürmer T, Rothman KJ, Avorn J, Glynn RJ. Treatment effects in the presence of unmeasured confounding: dealing with observations in the tails of the propensity score distribution—a simulation study. American Journal of Epidemiology. 2010;172(7):843-854.",
        "year": 2010,
        "authors_short": "Stürmer et al.",
        "notes": "Demonstrates that trimming the PS tails to improve overlap/balance changes the target population and the estimand—the trade-off the balance table must be read against."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Anchors the requirement that baseline eligibility and the covariate window be fixed at time zero, and that the population whose balance is assessed match the estimand of the emulated trial."
      }
    ],
    "plain_language_summary": "When a study isn't randomized, the people who got the new drug and the people who got the comparator usually differ before treatment even starts (one group is older, sicker, etc.), and those differences can fake or hide a treatment effect. The baseline ('Table 1') lays out each pre-treatment characteristic side by side for both groups and asks one question: are the groups similar enough to compare fairly? The standard yardstick is the standardized mean difference (SMD), a same-scale gap measure where roughly below 0.10 counts as 'close enough.' One honest caveat: a balanced table only proves the groups match on the things you measured, never on the things you didn't.",
    "key_terms": [
      {
        "term": "baseline characteristics",
        "definition": "The traits each patient already had before treatment started (age, sex, existing conditions), measured in a window before their first dose."
      },
      {
        "term": "standardized mean difference (SMD)",
        "definition": "A gap between the two groups divided by their typical spread, so any characteristic is on one comparable scale regardless of its original units."
      },
      {
        "term": "pooled standard deviation",
        "definition": "A single 'typical spread' for a characteristic, combining the spread from both groups, used as the denominator of the SMD."
      },
      {
        "term": "covariate",
        "definition": "Any baseline characteristic you compare across groups, such as age or whether the patient has diabetes."
      },
      {
        "term": "weighting",
        "definition": "Giving some patients more or less mathematical 'vote' so the two groups end up looking alike on the measured characteristics, like rebalancing a scale."
      },
      {
        "term": "treatment channeling",
        "definition": "When doctors tend to steer a particular type of patient toward one drug, which is why the untreated comparison group starts out different."
      }
    ],
    "worked_example": {
      "scenario": "We are comparing a new diabetes drug (treated arm, 500 patients) against an older comparator (500 patients) in a claims database. Before we trust any outcome comparison, we build Table 1 on three baseline characteristics measured before each patient's first fill: age, percent female, and percent with a pre-existing diabetes complication. For each one we compute an SMD to see whether the groups are comparable, then repeat after applying weights designed to make the groups match.",
      "dataset": {
        "caption": "A baseline ('Table 1') summary an analyst would actually report: group means/percentages plus the SMD column, shown before and after weighting.",
        "columns": [
          "covariate",
          "treated",
          "comparator",
          "pooled_SD",
          "SMD_before",
          "SMD_after_weighting"
        ],
        "rows": [
          [
            "age (years, mean)",
            "61",
            "67",
            "11.51",
            "-0.52",
            "-0.04"
          ],
          [
            "female (%)",
            "48%",
            "52%",
            "0.50",
            "-0.08",
            "-0.04"
          ],
          [
            "diabetes complication (%)",
            "30%",
            "45%",
            "0.48",
            "-0.31",
            "-0.04"
          ]
        ]
      },
      "steps": [
        "The SMD for a continuous covariate is (mean in treated minus mean in comparator) divided by the pooled standard deviation, the combined typical spread of the two groups.",
        "Work the age row before weighting: the treated group averages 61 years, the comparator 67. The two group spreads (SD 11 and 12) combine into a pooled SD of sqrt((11^2 + 12^2)/2) = sqrt(132.5) = 11.51 years.",
        "So the age SMD is (61 - 67) / 11.51 = -6 / 11.51 = -0.52. The sign just says treated is younger; we read the size, 0.52.",
        "Apply the balance rule: an absolute SMD below 0.10 is the convention for 'close enough.' Before weighting, age (0.52) and the diabetes complication (0.31) both blow past 0.10, so those groups are NOT comparable, while percent female (0.08) is already under the line.",
        "After weighting, recompute on the rebalanced groups. Age becomes (64 - 64.5) / 11.51 = -0.04, and the diabetes complication shrinks to 0.04. All three covariates are now below 0.10."
      ],
      "result": "Before weighting, the groups were imbalanced on age (|SMD| = 0.52) and diabetes complication (|SMD| = 0.31) and balanced on sex (|SMD| = 0.08) - a clear sign of treatment channeling toward younger, healthier patients. After weighting, all three covariates fall below the 0.10 cut-off (age -0.04, female -0.04, diabetes -0.04), so the table now passes the measured-balance check. Important caveat: this only certifies the three covariates we measured; it says nothing about unmeasured differences like frailty."
    },
    "prerequisites": [
      "estimands-ate-att-intercurrent-events-rwe",
      "propensity-score-methods-psm-iptw",
      "washout-clean-lookback-period-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unadjusted (design) Table 1",
        "description": "Describes the source population and exposes treatment channeling and the eligibility cascade before any adjustment. The magnitude of the unadjusted SMDs is itself a finding about confounding.",
        "data_source_notes": "claims: include lookback HCRU, cost, medication classes and enrollment length; EHR: include missingness indicators for labs/stage as their own rows."
      },
      {
        "name": "Matched balance table (ATT)",
        "description": "Standardized differences computed in the matched (treated-anchored) sample after 1:1 or k:1 PS matching; report the number/fraction of treated unmatched, because the matched estimand shifts toward the region of overlap.",
        "edge_cases": [
          "Never use significance tests as the balance criterion; use SMD and variance ratios.",
          "Report how many treated patients were discarded and how the matched cohort differs from the source."
        ]
      },
      {
        "name": "Weighted balance table (IPTW/ATE, SMR/ATT, or overlap/ATO)",
        "description": "Weighted SMDs and variance ratios computed in the pseudo-population the weights define; always accompanied by effective sample size. Under exact overlap (ATO) weights, covariates in the PS balance by construction.",
        "edge_cases": [
          "A few extreme IPTW weights can balance the table while collapsing effective sample size—report ESS, not just SMDs.",
          "Check balance on prognostic covariates and transformations excluded from the PS, not only those forced to balance."
        ]
      },
      {
        "name": "High-dimensional balance summary",
        "description": "For hundreds of claims-derived covariates, summarize with a Love plot and max/median SMD, then list the top residual imbalances by name so they can be acted on."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Omnibus / global balance metric (PS c-statistic, single Mahalanobis distance)",
        "pros_of_this": "Per-covariate standardized differences are interpretable and actionable—they name the residually imbalanced confounder so the design can be revised.",
        "cons_of_this": "A long per-covariate table is harder to summarize at a glance and invites multiplicity if read as tests.",
        "when_to_prefer": "For reporting and design decisions; use a global summary only to compress a high-dimensional table (Love plot, max/median SMD), never to replace it."
      },
      {
        "compared_to": "Formal balance hypothesis testing (per-covariate t/chi-square, Hotelling's T-squared)",
        "pros_of_this": "Standardized differences are sample-size-independent and decision-relevant; they do not conflate imbalance with statistical power.",
        "cons_of_this": "Lacks a single p-value some reviewers still expect.",
        "when_to_prefer": "Essentially always in large RWE—balance significance tests are explicitly discouraged."
      },
      {
        "compared_to": "Relying on outcome-model covariate adjustment to absorb residual imbalance",
        "pros_of_this": "Certifying balance by design means the outcome model only interpolates within overlap rather than extrapolating across non-overlap.",
        "cons_of_this": "Demands an explicit balancing step (matching/weighting/restriction) and can shrink the cohort.",
        "when_to_prefer": "Whenever overlap is poor or the outcome model is at risk of misspecification—prefer design over modeling."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Prior utilization, cost, medication classes, and diagnosis/procedure counts over a fixed lookback are usually stronger confounding proxies than comorbidity flags. Require complete FFS enrollment across the lookback and exclude Medicare Advantage-only person-time, which lacks FFS claims and produces artifactually clean baselines; report the MA share by arm.",
      "ehr": "Report and balance missingness indicators for labs, vitals, stage, ECOG, smoking, and PROs as their own Table 1 rows; missingness is often informative (a marker of health status or site). Prefer multiple imputation over complete-case tables, which restrict to the most worked-up patients.",
      "registry": "Use clinical staging and biomarker detail (often the true confounders); link to claims for baseline HCRU and to a death index. Voluntary enrollment selects on prognosis, so balance within the registry may not transport.",
      "linked": "Linkage selects the linkable subset and introduces order/fill/service date discrepancies that can push covariates across time zero; reconcile dates and report balance both in the linked subset and against the unlinked source."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\n\ndef _wmean_wvar(x, w):\n    m = np.average(x, weights=w)\n    v = np.average((x - m) ** 2, weights=w)  # weighted (biased) variance; adequate for SMD\n    return m, v\n\n\ndef standardized_diff(x, treated, weight=None, binary=False):\n    x = np.asarray(x, dtype=float)\n    a = np.asarray(treated)\n    w = np.ones(len(x)) if weight is None else np.asarray(weight, dtype=float)\n    m1, v1 = _wmean_wvar(x[a == 1], w[a == 1])\n    m0, v0 = _wmean_wvar(x[a == 0], w[a == 0])\n    if binary:\n        # proportions: pooled SD from p(1-p), not the raw variance\n        denom = np.sqrt((m1 * (1 - m1) + m0 * (1 - m0)) / 2)\n    else:\n        denom = np.sqrt((v1 + v0) / 2)\n    smd = np.nan if denom == 0 else (m1 - m0) / denom\n    vratio = np.nan if (binary or v0 == 0) else v1 / v0  # variance ratio only meaningful for continuous\n    return smd, vratio\n\n\ndef effective_sample_size(weight):\n    w = np.asarray(weight, dtype=float)\n    return (w.sum() ** 2) / np.sum(w ** 2)  # ESS; falls sharply with extreme weights\n\n\ndef balance_table(df, treated_col, covariates, binary_cols=(), weight_col=None):\n    w = df[weight_col].to_numpy() if weight_col else None\n    rows = []\n    for c in covariates:\n        smd, vr = standardized_diff(df[c], df[treated_col], w, binary=c in set(binary_cols))\n        rows.append({\"covariate\": c, \"abs_smd\": abs(smd), \"variance_ratio\": vr,\n                     \"balanced\": abs(smd) < 0.10})\n    out = pd.DataFrame(rows).sort_values(\"abs_smd\", ascending=False)\n    if weight_col:  # report ESS by arm so a balanced table is not read in isolation\n        for arm in (1, 0):\n            ess = effective_sample_size(df.loc[df[treated_col] == arm, weight_col])\n            out.attrs[f\"ess_arm_{arm}\"] = ess\n    return out\n\n\n# balance_table(cohort, \"treated\",\n#               covariates=[\"age\", \"lookback_cost\", \"n_inpatient\", \"prior_pad\", \"missing_hba1c\"],\n#               binary_cols=[\"prior_pad\", \"missing_hba1c\"],\n#               weight_col=\"overlap_weight\")",
        "description": "Weighted/unweighted standardized differences and variance ratios for a balance table and Love plot. Required input:\n  df : one row per patient with\n       treated      : 1 = study arm, 0 = comparator (assigned at time zero)\n       weight       : analytic weight (IPTW/SMR/overlap); pass None or all-ones for the unadjusted table\n       <covariates> : continuous (e.g., age, lookback_cost) and binary (0/1) baseline covariates,\n                      measured only in [index_date - lookback, index_date]\nCompute the table BEFORE fitting any outcome model and in the population your estimand defines (weights set the\npopulation). Binary covariates use the proportion-based SMD; continuous use the pooled-SD SMD.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(WeightIt)\nlibrary(cobalt)\nlibrary(tableone)\n\ncovs <- c(\"age\", \"sex\", \"lookback_cost\", \"n_inpatient\", \"prior_pad\", \"missing_hba1c\")\n\n# Estimate overlap (ATO) weights from a high-dimensional-style PS model.\nw.out <- weightit(treated ~ age + sex + lookback_cost + n_inpatient + prior_pad + missing_hba1c,\n                  data = df, method = \"glm\", estimand = \"ATO\")\n\n# Balance table: weighted + unweighted SMDs, variance ratios, KS distance, and effective sample size.\nbal <- bal.tab(w.out, stats = c(\"mean.diffs\", \"variance.ratios\", \"ks.statistics\"),\n               un = TRUE, disp = c(\"means\"), thresholds = c(m = 0.10, v = 2))\nprint(bal)                       # includes Adjusted/Unadjusted ESS rows\nlove.plot(w.out, stats = \"mean.diffs\", thresholds = c(m = 0.10),\n          abs = TRUE, var.order = \"unadjusted\")\n\n# Companion stratified Table 1 with SMDs (no significance tests in large RWE).\nprint(CreateTableOne(vars = covs, strata = \"treated\", data = df, test = FALSE), smd = TRUE)",
        "description": "Production balance table and Love plot with cobalt/WeightIt. Required input:\n  df : one row per patient with `treated` (0/1), baseline covariates measured in the pre-index window, and an analytic\n       weight column when assessing a weighted estimand. cobalt computes weighted SMDs, variance ratios, KS statistics,\n       and effective sample size, and threshold-flags imbalance; it is the de facto standard for RWE balance reporting.",
        "dependencies": [
          "cobalt",
          "WeightIt",
          "tableone"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---- Path A: 1:1 PS matching with built-in balance diagnostics (ATT) ---- */\nproc psmatch data=work.cohort region=allobs;\n  class treated prior_pad missing_hba1c sex;\n  psmodel treated(treated='STUDY') = age sex lookback_cost n_inpatient prior_pad missing_hba1c;\n  match method=greedy(k=1) distance=lps caliper=0.2;     /* 0.2 SD of the logit-PS caliper */\n  assess lps var=(age lookback_cost n_inpatient prior_pad)   /* std diff + variance ratio, pre/post */\n         / weight=none plots=(stddiff boxplot);\n  output out(obs=match)=matched matchid=mid;\nrun;\n\n/* ---- Path B: balance in a WEIGHTED pseudo-population (IPTW/SMR/overlap) ---- */\n/* work.cohort.w = precomputed analytic weight; compute weighted means/variances by arm,    */\n/* then SMD = (m1 - m0) / sqrt((v1 + v0)/2). SURVEYMEANS honors the WEIGHT for the variance. */\nproc surveymeans data=work.cohort mean var;\n  class treated;\n  domain treated;\n  var age lookback_cost n_inpatient prior_pad missing_hba1c;\n  weight w;\nrun;\n\n/* Effective sample size by arm (a balanced weighted table is meaningless if ESS collapsed). */\nproc sql;\n  select treated,\n         (sum(w))**2 / sum(w*w) as ess\n  from work.cohort\n  group by treated;\nquit;",
        "description": "Balance assessment with PROC PSMATCH (matching) and weighted PROC SURVEYMEANS (weighting). Required input:\n  work.cohort : one row per patient with treated ('STUDY'/'COMPARATOR' or 1/0), baseline covariates from the pre-index\n                window, and (for the weighted path) a precomputed analytic weight `w`. PROC PSMATCH ASSESS reports the\n                standardized mean difference and variance ratio before and after matching and produces balance plots;\n                SAS/STAT 14.2+ is required. Confirm |std diff| < 0.10 before fitting any outcome model.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[Eligible cohort at time zero<br/>covariates from baseline window only] --> T1u[Unadjusted Table 1<br/>SMD, variance ratio by arm]\n  T1u -->|Channeling visible| Adj[Balancing step matched to estimand]\n  Adj --> Match[1:1 / k:1 PS matching -> ATT]\n  Adj --> IPTW[IPTW -> ATE]\n  Adj --> ATO[Overlap weights -> equipoise/ATO]\n  Match --> T1a[Adjusted Table 1 in the estimand population]\n  IPTW --> T1a\n  ATO --> T1a\n  T1a --> Chk{All key SMD < 0.10,<br/>variance ratios ~1,<br/>ESS acceptable?}\n  Chk -->|Yes| Model[Fit outcome model<br/>+ negative controls for unmeasured confounding]\n  Chk -->|No| Redesign[Enrich PS / restrict / re-specify comparator / re-examine overlap]\n  Redesign --> Adj\nstyle Redesign fill:#fee2e2\nstyle Model fill:#dcfce7\nstyle Chk fill:#fef9c3",
        "caption": "Balance assessment is an iterative design gate, computed in the population the estimand defines, that must pass before the outcome model is fit. A passing table addresses measured—not unmeasured—confounding.",
        "alt_text": "Flowchart from the eligible cohort through an unadjusted Table 1, an estimand-specific balancing step (matching, IPTW, or overlap weighting), an adjusted Table 1, and a decision gate that loops back to redesign if balance or effective sample size is inadequate before fitting the outcome model.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "The baseline balance table is the acceptance diagnostic that certifies whether an ACNU cohort achieved plausible measured exchangeability before the outcome model."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Balance diagnostics (weighted/matched SMDs, variance ratios, ESS) are the mandatory check after PS matching, IPTW, or overlap weighting; assess in the population the weights define."
      },
      {
        "relation_type": "used_with",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "After IPTW (including time-varying treatment/censoring weights), weighted balance must be checked at baseline and, for time-varying exposures, at each interval."
      },
      {
        "relation_type": "requires",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "Balance must be assessed in the population the estimand targets—ATE (full weighted sample), ATT (treated-anchored), or ATO (overlap)—so the diagnostic matches the effect being reported."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "Weight trimming/winsorization to improve overlap changes the target population and effective sample size; Table 1 and ESS should display that shift rather than hide it."
      }
    ],
    "aliases": [
      "Table 1",
      "baseline table",
      "covariate balance",
      "standardized mean difference",
      "SMD",
      "love plot",
      "balance diagnostics",
      "baseline characteristics"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "bayesian-inference-foundations",
    "name": "Bayesian Inference Foundations",
    "short_definition": "A statistical framework that treats unknown parameters as probability distributions rather than fixed unknowns, combining a prior belief about a parameter with the likelihood of the observed data via Bayes' theorem to produce a posterior distribution — the complete post-data summary of uncertainty — that enables direct probability statements such as \"there is an 89% posterior probability the event rate exceeds 0.25,\" the interpretation most analysts want but that frequentist confidence intervals cannot deliver.",
    "long_description": "**Core theorem: prior × likelihood → posterior**\n\nBayesian inference rests on a single mechanistic rule. Before seeing data, the analyst\nencodes existing knowledge — from prior studies, expert opinion, theoretical constraints,\nor deliberate agnosticism — as a probability distribution p(θ) over the unknown parameter\nθ. This is the *prior*. The data D inform via the *likelihood* L(θ; D) = p(D | θ): the\nprobability of observing exactly these data if θ were the true value. Bayes' theorem\ncombines both:\n\n  posterior p(θ | D)  ∝  prior p(θ)  ×  likelihood p(D | θ)\n\nThe ∝ symbol (\"proportional to\") hides only a normalizing constant — a number that makes\nthe posterior integrate to 1. That constant does not depend on θ and does not affect\ninference about θ. The posterior is the complete statement of post-data knowledge: it\nassigns a probability to every interval of plausible θ values.\n\nIn a finite conjugate update (e.g., the beta-binomial pair described below), the posterior\nis computed in closed form with no numerical integration. In all other cases — logistic or\nsurvival regressions with Bayesian priors, hierarchical multi-site models, models with\ncorrelated parameters — the posterior is approximated by *Markov chain Monte Carlo* (MCMC).\n\n**Credible interval vs confidence interval — the interpretation most people want**\n\nThe single most practically important distinction between Bayesian and frequentist inference\nis in the interval that summarizes uncertainty. A *95% credible interval* (CrI) is the\ninterval [a, b] such that the posterior probability that θ ∈ [a, b] is exactly 0.95.\nThe CrI is a direct probability statement about the parameter: \"Given the prior and the\ndata, I am 95% certain the true event rate is between 0.12 and 0.45.\"\n\nThis is precisely the interpretation most scientists, clinicians, and decision-makers apply\nwhen they see a confidence interval — and it is wrong for a frequentist confidence interval.\nA 95% CI is a statement about a *procedure*, not about a specific interval: if the study were\nrepeated infinitely many times and a CI constructed each time, 95% of those intervals would\ncontain the true fixed value of θ. For any single realized interval, θ either is or is not\ninside it — there is no 95% probability. The parameter θ is fixed and unknown; only the\ninterval is random in the frequentist framework.\n\nWhen sample sizes are large and priors are weak (non-informative), the numerical difference\nbetween a 95% CrI and a 95% CI is negligible. When samples are small, priors are informative,\nor parameters are on the boundary of their support, the two can diverge substantially. In\nrare-disease RWE settings with n < 50, the CrI is the only interval that correctly propagates\nall sources of uncertainty — prior, likelihood, and model structure — into a single coherent\nstatement.\n\n**Conjugate priors and the beta-binomial example**\n\nA *conjugate prior* is a prior distribution family that, combined with a specific likelihood,\nyields a posterior in the same family — enabling exact closed-form updating. The canonical\nexample for proportions and event rates is the *beta-binomial* pair:\n\n  Prior: Beta(α₀, β₀)   [event rate p has prior mean α₀/(α₀+β₀)]\n  Data: k events in n Bernoulli trials   [likelihood: Binomial(n, p)]\n  Posterior: Beta(α₀ + k, β₀ + n − k)   [exact; no MCMC required]\n\nThe shape parameters α and β carry the interpretation of pseudo-counts: α₀ is the number\nof prior \"events\" and β₀ is the number of prior \"non-events,\" giving the prior an effective\nsample size (ESS) of α₀ + β₀. The posterior simply adds observed counts to the pseudo-counts.\nThe Jeffreys non-informative prior for a binomial proportion is Beta(0.5, 0.5), which contributes\nonly 1 pseudo-observation. The link to the beta-distribution entry in this catalog: the\nbeta-distribution entry explains how to fit Beta(α, β) from a published mean and SE (method\nof moments), which is exactly how informative priors are constructed from published literature.\n\n**Prior choice: informative, weakly informative, skeptical, and flat**\n\nThe prior encodes what is known before the data are seen. Four classes arise in practice:\n\n- *Informative prior*: encodes genuine external knowledge — a Beta(2, 8) prior on an event\n  rate says \"I believe the rate is around 0.20 and I have 10 prior pseudo-observations of\n  evidence.\" Legitimate when external evidence is well-documented, transportable to the current\n  patient population, and pre-specified before analysis.\n- *Weakly informative prior*: rules out physically impossible or substantively absurd values\n  while remaining flat over the scientifically plausible range. A Beta(1.5, 1.5) or a\n  Normal(0, 1) prior on a log-odds are examples: they prevent the MCMC sampler from exploring\n  extreme regions without pulling the posterior toward any particular interior value. Weakly\n  informative priors are the default recommendation for hierarchical models in RWE.\n- *Skeptical prior*: deliberately weights against large treatment effects, raising the\n  evidentiary bar. A Normal(0, 0.355²) on log-OR places 95% of prior mass on OR between\n  0.5 and 2.0. Regulators and HTA reviewers increasingly require a prespecified skeptical\n  prior to accompany the primary analysis in confirmatory Bayesian submissions.\n- *Flat / non-informative prior*: places approximately equal mass across the parameter range.\n  Truly flat priors are mathematically improper on unbounded parameters and are not invariant\n  to reparameterization. In large samples a flat prior yields posteriors nearly identical to\n  frequentist maximum-likelihood results — making it a natural bridge when the goal is Bayesian\n  machinery with minimal prior impact.\n\n*Prior sensitivity analysis is mandatory reporting.* Every Bayesian submission should include\nresults under at least two or three prior specifications spanning from non-informative to the\nanalyst's preferred informative prior. Robustness of the conclusion across this grid means the\nprior is not load-bearing; strong sensitivity means the data are insufficient to overcome prior\nbeliefs — which is itself a substantive finding that must be disclosed. Regulatory agencies\n(FDA 2023 Bayesian guidance) and HTA bodies (NICE DSU Technical Support Documents) require\nprior sensitivity analyses for Bayesian submissions.\n\n**MCMC and convergence diagnostics**\n\nMost practical Bayesian models do not have conjugate closed forms. *Markov chain Monte Carlo*\n(MCMC) — Metropolis-Hastings, Gibbs sampling, and the modern No-U-Turn Sampler (NUTS) used\nin Stan, PyMC, and brms — approximates the posterior by constructing a Markov chain whose\nstationary distribution is the target posterior. After a burn-in (warmup) phase where the\nchain finds the high-probability region and tunes its step size, the retained draws are\ntreated as samples from the posterior. Convergence diagnostics exist not to prove convergence\n(that cannot be proven) but to detect non-convergence: the R-hat (potential scale reduction\nfactor) should be < 1.01 for every parameter; Effective Sample Size (ESS) should exceed 400\nin the bulk and tail. Trace plots should look like well-mixed \"hairy caterpillars\" — no\ntrends, no stuck chains, no chains drifting apart. Always report these diagnostics. A\nposterior from a non-converged chain is meaningless regardless of how sophisticated the model.\nUse Stan/brms (R) or PyMC (Python) — they implement NUTS and report diagnostics automatically.\n\n**Where Bayes earns its keep in RWE/HEOR**\n\nBayesian inference is not universally superior to frequentist inference; it earns its additional\nmodeling and computational cost in specific settings:\n\n- *Rare-disease historical borrowing*: the beta-binomial conjugate update (or a MAP prior) lets\n  external control data informatively augment a small concurrent trial with a quantifiable,\n  pre-specified discount. See the Bayesian borrowing entry in this catalog.\n- *Probabilistic sensitivity analysis in cost-effectiveness*: PSA is already implicitly Bayesian\n  — it propagates parameter uncertainty through a decision model by drawing from distributions.\n  Embedding the PSA in a formal Bayesian framework (parameters as posterior draws from a\n  hierarchical model) gives coherent posterior probabilities over cost-effectiveness thresholds.\n- *Sequential safety monitoring*: Bayesian adaptive designs accumulate posterior probability of\n  harm continuously and can trigger early stopping when P(harm exceeds threshold | data) crosses\n  a pre-specified gate, without paying a frequentist multiplicity penalty for interim looks.\n- *Hierarchical shrinkage across sites*: a Bayesian hierarchical model partially pools\n  site-specific estimates toward the overall mean, shrinking noisy small-site estimates and\n  reducing false-discovery rates vs independently estimated site effects — mathematically\n  equivalent to empirical Bayes / random-effects meta-analysis but with explicit priors on\n  the between-site variance.\n- *Bias priors in probabilistic bias analysis*: unmeasured confounding can be modeled by\n  placing a prior over bias parameters (sensitivity and specificity of confounder measurement,\n  prevalence in exposed vs unexposed), propagating the resulting uncertainty into adjusted\n  effect estimates. See the probabilistic bias analysis entry in this catalog.\n\n**Frequentist-Bayes pragmatism (calibrated Bayes)**\n\nThe tribal framing — \"Bayesian vs frequentist\" — is not useful in applied RWE/HEOR. Most\nproduction analyses use both paradigms at different layers. A Cox proportional hazards model\nis frequentist; adding Normal(0, σ²) priors on the regression coefficients and sampling via\nMCMC makes it Bayesian. *Calibrated Bayes* (Rubin 1984) is the pragmatic synthesis: use\nBayesian machinery where it offers genuine advantages (informative priors, hierarchical\nstructure, exact small-sample uncertainty, direct probability statements), and frequentist\nmethods where they are adequate (large samples, no genuine prior, regulatory conventions\nrequiring frequentist operating characteristics). The key sanity check: a Bayesian posterior\ninterval should have approximately correct frequentist coverage when the model is well\nspecified. When Bayesian and frequentist results agree, the prior is not carrying load.\nWhen they disagree, the prior is doing real work — and must be justified explicitly.\n\n**Pros, cons, and trade-offs**\n\n*Pros of Bayesian inference:*\n- Direct probability statements about parameters: \"There is a 94% posterior probability the\n  drug reduces 12-month mortality by more than 5 percentage points.\"\n- Natural, principled incorporation of prior information in small-sample and rare-disease\n  settings — the prior ESS is a transparent, reportable quantity.\n- Coherent uncertainty propagation through complex hierarchical models without ad hoc\n  variance adjustments.\n- No frequentist multiplicity penalty for pre-specified interim looks in adaptive designs.\n- The 95% credible interval has the interpretation practitioners routinely and incorrectly\n  apply to confidence intervals.\n\n*Cons and limitations:*\n- Prior specification adds a modeling choice that must be pre-specified, justified, and\n  sensitivity-analyzed; a poorly chosen or post-hoc prior can bias inference in opaque ways.\n- Computational cost: MCMC can be slow for large RWE datasets or models with many parameters;\n  convergence must be verified and can fail.\n- Frequentist operating characteristics (type-I error, power) must still be verified by\n  simulation for regulatory submissions — the Bayesian framework does not guarantee error\n  control unless the design is pre-specified.\n- Regulatory reviewers trained in frequentist methods may require additional justification;\n  acceptance is increasing but uneven across jurisdictions.\n\n**When to use**\n\n- Small samples where prior information is genuine and documentable (rare disease, pediatric\n  extrapolation, single-arm trials augmented by historical controls).\n- Hierarchical models with parameter sharing across sites, subgroups, or indications.\n- Sequential adaptive designs where interim analyses are planned and a Bayesian stopping rule\n  is clinically motivated and pre-specified in the SAP.\n- PSA frameworks where coherent posterior propagation of parameter uncertainty is preferred.\n- When a direct probability statement about the parameter is the primary deliverable for a\n  payer or HTA body (e.g., P(cost-effective at WTP $150k/QALY)).\n\n**When NOT to use — and when it is actively misleading**\n\n- *Do not choose the prior to reach a desired result.* Prior specification must be locked in\n  the analysis plan before unblinding. Post-hoc prior tuning to achieve statistical significance\n  is data dredging with additional obfuscation. Pre-specify the prior ESS and sensitivity grid.\n- *Do not skip prior sensitivity analysis.* Reporting only the result under one prior — even\n  a nominally \"non-informative\" flat prior — hides sensitivity. Every Bayesian submission should\n  include at least three prior specifications from non-informative to the planned informative\n  prior.\n- *Do not interpret a high posterior probability as a causal claim.* A 98% posterior\n  probability that a treatment coefficient is positive is an associational statement about the\n  model, not proof of causation. Causal claims require the same identification assumptions\n  regardless of inferential framework.\n- *Do not use flat priors on variance components in hierarchical models.* A Uniform(0, ∞)\n  prior on a between-site standard deviation is improper and causes pathological posterior\n  behavior with few groups. Use a weakly informative Half-Normal or Half-Cauchy prior.\n- *Bayesian machinery does not fix a confounded design.* Sophisticated MCMC sampling of a\n  misspecified or confounded model yields a precise posterior of a wrong quantity.\n\n**Interpreting the output**\n\nFrom the worked example: prior Beta(2,8) (mean 0.2, equivalent to 10 pseudo-observations);\nobserve 6 events in 20 trials; posterior Beta(8,22), mean 8/30 ≈ 0.267.\n\n*(1) Formal interpretation.* The posterior Beta(8,22) is a probability distribution over the\nunknown event rate p. The posterior mean 8/30 ≈ 0.267 is a weighted compromise between the\nprior mean (2/10 = 0.2) and the data MLE (6/20 = 0.3), with weights proportional to the\nprior information (10 pseudo-observations) and the data (20 observations). The total\ninformation in the posterior is 8+22 = 30 pseudo-observations. A 95% credible interval from\nBeta(8,22) spans approximately [0.12, 0.45]: the posterior probability that the true event\nrate p lies in this interval is exactly 0.95 — a direct probability statement, not a\nrepeated-sampling claim. This posterior is exact because of conjugacy; no MCMC is required\nfor this model.\n\n*(2) Practical interpretation.* Before seeing the new data, we believed the event rate was\naround 20%, with the strength of 10 observations. We then observed 30% in 20 patients.\nOur best updated estimate is approximately 26.7% — not 30%, because the prior pulled\nus back toward 20%; not 20%, because the new data pulled us upward. A decision-maker should\nnote that the prior is doing real work: if its 10 pseudo-observations came from a comparable\npopulation, the shrinkage is appropriate. If the prior population differed (different disease\nstage, earlier era), the posterior is biased toward the prior, and a sensitivity analysis\nunder Beta(0.5, 0.5) — contributing only 1 pseudo-observation — should accompany the primary\nresult to show how much the conclusion depends on the informative prior.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "bayesian",
      "statistics",
      "foundations",
      "prior",
      "posterior",
      "credible-interval",
      "MCMC",
      "conjugate-prior",
      "beta-binomial",
      "rare-disease",
      "hierarchical-model",
      "probabilistic-sensitivity-analysis"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "registry_study",
      "single_arm_external_control",
      "rare_disease_study",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.317.7166.1151",
        "url": "https://doi.org/10.1136/bmj.317.7166.1151",
        "citation_text": "Bland JM, Altman DG. Statistics notes: Bayesians and frequentists. BMJ. 1998;317(7166):1151-1160.",
        "year": 1998,
        "authors_short": "Bland & Altman",
        "notes": "Accessible BMJ Statistics Notes entry contrasting Bayesian and frequentist approaches for clinical researchers; explains the probability-of-the-parameter interpretation of the credible interval and why it differs from the frequentist confidence interval."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyi312",
        "url": "https://doi.org/10.1093/ije/dyi312",
        "citation_text": "Greenland S. Bayesian perspectives for epidemiological research: I. Foundations and basic methods. International Journal of Epidemiology. 2006;35(3):765-775.",
        "year": 2006,
        "authors_short": "Greenland",
        "notes": "Rigorous epidemiological treatment of Bayesian foundations including prior specification, likelihood construction, and posterior interpretation; covers the calibrated-Bayes pragmatism that motivates combining Bayesian and frequentist approaches in RWE practice."
      },
      {
        "role": "demonstrate",
        "doi": "10.1214/ss/1177011136",
        "url": "https://doi.org/10.1214/ss/1177011136",
        "citation_text": "Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7(4):457-472.",
        "year": 1992,
        "authors_short": "Gelman & Rubin",
        "notes": "Introduces the R-hat (potential scale reduction factor) convergence diagnostic for MCMC chains; the foundational reference for MCMC convergence assessment now implemented by default in Stan, PyMC, and brms."
      }
    ],
    "plain_language_summary": "Bayesian inference is a method for updating beliefs in light of new evidence: you start with a \"prior\" — a probability distribution capturing what you already know about some unknown number (like a drug's event rate) — then multiply it by the probability of the data you just observed to get a \"posterior,\" your revised belief. The key practical advantage is that the result is a genuine probability statement about the unknown number itself, for example \"there is an 89% chance the true rate is above 0.25,\" which is what most people intend when they report a confidence interval but which frequentist statistics cannot technically say. In real-world evidence and health economics, Bayesian methods are especially useful when data are sparse (rare disease, small trials) and existing evidence can be formally incorporated, or when a cost-effectiveness model needs to propagate all sources of uncertainty coherently.",
    "key_terms": [
      {
        "term": "prior",
        "definition": "A probability distribution over an unknown parameter that encodes what is believed about it before the current data are analyzed; in Bayesian updating it is combined with the data's likelihood to form the posterior."
      },
      {
        "term": "likelihood",
        "definition": "The probability of observing the actual data you collected, as a function of the unknown parameter; it tells you how much the data favor each possible parameter value."
      },
      {
        "term": "posterior",
        "definition": "The updated probability distribution over the unknown parameter after combining the prior with the likelihood; it is the complete summary of knowledge about the parameter given both the prior belief and the observed data."
      },
      {
        "term": "credible interval",
        "definition": "A Bayesian uncertainty range [a, b] such that the posterior probability the true parameter falls inside it equals the stated coverage (e.g., 95%); unlike a confidence interval, this is a direct probability statement about the parameter, not about a repeated-sampling procedure."
      },
      {
        "term": "conjugate prior",
        "definition": "A prior distribution family that, when combined with a specific likelihood, produces a posterior in the same family, allowing exact closed-form updating without numerical integration; the beta distribution is the conjugate prior for a binomial event rate."
      },
      {
        "term": "prior effective sample size",
        "definition": "The number of real observations that the prior is informationally equivalent to; a Beta(2,8) prior has effective sample size 10, meaning it carries as much weight as 10 observed data points when updating the posterior."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is studying an adverse event rate for a drug used in a rare pediatric disease. From published literature, the event rate is believed to be around 20%, based on evidence roughly equivalent to 10 patients. The analyst encodes this as a prior Beta(2,8) — 2 prior events and 8 prior non-events. A new registry study then observes 6 adverse events among 20 patients. The analyst wants to compute the updated (posterior) event rate using the exact beta-binomial conjugate formula, and confirm it lies between the prior belief (20%) and the data estimate (30%).",
      "dataset": {
        "caption": "Summary counts feeding the beta-binomial conjugate update. Each row represents one source of information: prior belief (encoded as pseudo-counts) and new observed data.",
        "columns": [
          "source",
          "pseudo_events",
          "pseudo_nonevents",
          "total_n",
          "proportion"
        ],
        "rows": [
          [
            "prior_Beta_2_8",
            2,
            8,
            10,
            0.2
          ],
          [
            "new_registry_data",
            6,
            14,
            20,
            0.3
          ],
          [
            "posterior_Beta_8_22",
            8,
            22,
            30,
            0.267
          ]
        ]
      },
      "steps": [
        "Step 1 — Identify prior parameters. Prior Beta(2,8): alpha_0 = 2 prior events, beta_0 = 8 prior non-events. Prior mean = 2/(2+8) = 2/10 = 0.2.",
        "Step 2 — Read the new data. Observed k = 6 events in n = 20 trials. Data maximum-likelihood estimate = 6/20 = 0.3, which is higher than the prior mean of 0.2.",
        "Step 3 — Compute the posterior alpha parameter by adding observed events to the prior event count: alpha_post = 2+6 = 8.",
        "Step 4 — Compute the posterior beta parameter by adding observed non-events to the prior non-event count: n-k = 20-6 = 14, so beta_post = 8+14 = 22.",
        "Step 5 — Verify the posterior total pseudo-count (effective sample size): alpha_post + beta_post = 8+22 = 30, which equals the prior ESS (10) plus new data (20), confirming 10+20 = 30.",
        "Step 6 — Compute the posterior mean. Posterior mean = 8/30 ≈ 0.267. This lies between the prior mean (0.2) and the data MLE (0.3). The data (20 observations) outweigh the prior (10 pseudo-observations) by 2 to 1, pulling the posterior mean two-thirds of the way from 0.2 toward 0.3."
      ],
      "result": "Posterior Beta(8,22): prior mean 2/10 = 0.2, data MLE 6/20 = 0.3, posterior mean 8/30 ≈ 0.267. The posterior ESS = 8+22 = 30, composed of 10 prior pseudo-observations plus 20 real data observations. The 95% credible interval from Beta(8,22) is approximately [0.12, 0.45] — the posterior probability the true event rate lies in this range is exactly 0.95. Because the posterior mean (0.267) falls between the prior (0.2) and the data (0.3), it is correctly shrunk toward the prior in proportion to how informative the prior was relative to the new data."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "beta-distribution"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Closed-form conjugate update (beta-binomial)",
        "description": "When the prior is Beta(alpha, beta) and the likelihood is Binomial, the posterior is exactly Beta(alpha+k, beta+n-k) with no numerical integration. This is the fastest, most transparent, and most auditable form of Bayesian updating; it requires no MCMC and produces an exact posterior with known quantile functions available in all statistical packages.",
        "edge_cases": [
          "The prior effective sample size (alpha + beta) must be pre-specified and justified before analysis — choosing it retrospectively to widen or narrow the posterior is analytic manipulation.",
          "If the observed event rate differs substantially from the prior mean, the posterior is sensitive to the prior ESS: a larger ESS pulls the posterior harder toward the prior. Always report the prior sensitivity analysis (e.g., Jeffreys Beta(0.5,0.5) vs the planned informative prior)."
        ],
        "data_source_notes": "Claims and registry: compute the observed event count k and denominator n from the cohort definition; verify the endpoint algorithm's PPV/sensitivity before treating the claims-derived count as if it were an adjudicated count for the likelihood."
      },
      {
        "name": "MCMC posterior sampling (non-conjugate models)",
        "description": "For logistic regression, survival models, or hierarchical multi-site models with Bayesian priors — any model where the posterior does not have a closed form — MCMC via Stan/brms (R) or PyMC (Python) draws samples from the posterior. PROC MCMC or the BAYES statement in PROC GENMOD handles this in SAS. MCMC results are valid only after confirming convergence: R-hat < 1.01, Bulk ESS > 400, Tail ESS > 400 for all parameters.",
        "edge_cases": [
          "Do not set weakly informative priors and assume they are equivalent to non-informative ones — verify by running a prior predictive check: simulate data from the prior alone and confirm the implied outcomes span the plausible scientific range.",
          "Large EHR or claims datasets (> 500k rows) can make MCMC slow; consider subsampling, variational inference, or Laplace approximation for exploratory work, then validate with full MCMC before submission."
        ],
        "data_source_notes": "EHR and linked data: large datasets accelerate MCMC (more data = tighter likelihood) but also make priors less influential. At very large n, verify that the prior is not accidentally swamped and that the posterior interval matches the frequentist CI closely (calibration check)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Frequentist hypothesis testing (p-values and confidence intervals)",
        "pros_of_this": "Produces direct posterior probability statements about parameters and enables credible intervals with the interpretation practitioners actually want; naturally incorporates informative prior knowledge; no multiplicity penalty for pre-specified interim analyses.",
        "cons_of_this": "Requires prior specification that must be justified and sensitivity-analyzed; MCMC adds computational overhead and a convergence-checking step; frequentist operating characteristics (type-I error, power) must be verified separately by simulation for regulatory settings.",
        "when_to_prefer": "Prefer Bayesian inference when data are sparse, prior information is genuine and documentable, a direct probability statement is the deliverable, or the design requires adaptive interim analyses. Prefer frequentist when sample sizes are large, no genuine prior exists, and regulatory conventions require frequentist error control."
      },
      {
        "compared_to": "Penalized maximum likelihood (ridge, LASSO)",
        "pros_of_this": "Bayesian hierarchical models with shrinkage priors are the principled version of regularization: the prior ESS is interpretable and reportable, the posterior quantifies uncertainty rather than producing a single point estimate, and the shrinkage is calibrated to the actual between-group variation in a multi-site model.",
        "cons_of_this": "Penalized ML is faster, more familiar to most applied analysts, and requires no MCMC; for prediction tasks (not inference) LASSO or elastic net with cross-validated lambda may achieve similar shrinkage with less overhead.",
        "when_to_prefer": "Prefer Bayesian hierarchical shrinkage when uncertainty quantification and site-level estimates are needed alongside the overall effect; prefer penalized ML for pure prediction pipelines where calibrated posterior uncertainty is not the primary output."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Compute observed event count and denominator from the claims cohort (continuous enrollment, washout, first-event discipline). Apply the conjugate update or PROC GENMOD BAYES on aggregate counts. Verify that the claims-algorithm endpoint PPV is high enough that the observed k is a reliable surrogate for the adjudicated event count; low PPV systematically biases the likelihood and hence the posterior.",
      "ehr": "EHR event counts are encounter-driven and miss out-of-system events; prefer linkage to claims or a death registry before using EHR data as the likelihood input for Bayesian updating. Hierarchical models with site-level random effects are well-suited for multi-site EHR data with varying follow-up completeness.",
      "registry": "Registry data often provide the cleanest adjudicated event counts for the likelihood. When the registry era pre-dates current standard of care, the historical count may reflect a different baseline rate than the current period — encode this uncertainty via an appropriately wide prior or a prior sensitivity analysis over a range of plausible historical rates.",
      "primary": "In a small-n pilot or rare-disease trial, the primary data are the main likelihood input. With n < 50, the posterior is highly sensitive to the prior ESS; always report results under the planned informative prior and under the Jeffreys Beta(0.5,0.5) prior so the reader can see how much the conclusion depends on the prior.",
      "linked": "Linked claims-EHR-registry data provide the richest likelihood but also the most complex model structure. Bayesian hierarchical models are well-suited for jointly modeling linkage uncertainty (not-missing-at-random indicators), site effects, and the primary treatment contrast."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\n# ── Beta-binomial conjugate update (exact posterior, no MCMC) ─────────────────\nalpha_0, beta_0 = 2, 8          # prior Beta(2,8): mean = 2/(2+8) = 0.2; ESS = 10\nk, n            = 6, 20         # observe 6 events in 20 trials\n\nalpha_post = alpha_0 + k        # 2 + 6 = 8\nbeta_post  = beta_0 + (n - k)  # 8 + (20 - 6) = 22\n\nprior     = stats.beta(alpha_0, beta_0)\nposterior = stats.beta(alpha_post, beta_post)\n\nprior_mean = alpha_0 / (alpha_0 + beta_0)             # 2/10 = 0.200  (exact)\ndata_mle   = k / n                                     # 6/20 = 0.300  (exact)\npost_mean  = alpha_post / (alpha_post + beta_post)     # 8/30 ≈ 0.267\n\nprint(f\"Prior     Beta({alpha_0}, {beta_0}):  mean = {prior_mean:.4f}\")\nprint(f\"Data MLE: {k}/{n} = {data_mle:.4f}\")\nprint(f\"Posterior Beta({alpha_post}, {beta_post}): mean = {post_mean:.4f}\")\n\n# 95% credible interval — direct posterior probability statement\ncri_lo, cri_hi = posterior.ppf(0.025), posterior.ppf(0.975)\nprint(f\"95% credible interval: [{cri_lo:.3f}, {cri_hi:.3f}]\")\nprint(\"Interpretation: P(event rate in [{:.3f}, {:.3f}] | data) = 0.95\".format(cri_lo, cri_hi))\n\n# Direct posterior probability query: P(p > 0.25 | data)\nprob_q = 1 - posterior.cdf(0.25)\nprint(f\"P(event rate > 0.25 | prior + data): {prob_q:.3f}\")\n\n# Prior sensitivity: Jeffreys Beta(0.5, 0.5) (non-informative)\na_jeff = 0.5 + k\nb_jeff = 0.5 + (n - k)\npost_jeff = stats.beta(a_jeff, b_jeff)\nprint(f\"\\nSensitivity — Jeffreys prior Beta(0.5,0.5):\")\nprint(f\"  Posterior Beta({a_jeff}, {b_jeff}): mean = {a_jeff/(a_jeff+b_jeff):.4f}\")\nprint(f\"  95% CrI: [{post_jeff.ppf(0.025):.3f}, {post_jeff.ppf(0.975):.3f}]\")\n\n# ── PyMC sketch: Bayesian logistic regression (non-conjugate) ─────────────────\n# import pymc as pm\n# import arviz as az\n# with pm.Model():\n#     # Weakly informative priors on log-odds scale\n#     intercept = pm.Normal(\"intercept\", mu=0, sigma=2)\n#     beta_treat = pm.Normal(\"beta_treat\", mu=0, sigma=1)   # skeptical prior\n#     beta_age   = pm.Normal(\"beta_age\",   mu=0, sigma=1)\n#     p = pm.math.invlogit(intercept + beta_treat * treat + beta_age * age)\n#     obs = pm.Bernoulli(\"obs\", p=p, observed=y)\n#     idata = pm.sample(2000, tune=2000, target_accept=0.9,\n#                       chains=4, random_seed=1)\n# # Convergence diagnostics — MANDATORY before interpreting posterior\n# assert az.summary(idata)[\"r_hat\"].max() < 1.01, \"R-hat >= 1.01: chain not converged\"\n# print(az.summary(idata, var_names=[\"beta_treat\"], hdi_prob=0.95))\n# # Direct posterior probability: P(treatment increases rate)\n# print(\"P(beta_treat > 0 | data):\", float((idata.posterior[\"beta_treat\"] > 0).mean()))",
        "description": "Exact beta-binomial conjugate update using scipy.stats.beta, demonstrating the\nprior-to-posterior arithmetic from the worked example (prior Beta(2,8), observe 6\nevents in 20 trials, posterior Beta(8,22)). Includes posterior mean, 95% credible\ninterval, and a direct posterior probability query. Also shows a PyMC sketch for\nnon-conjugate Bayesian logistic regression with convergence diagnostic checks.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Beta-binomial conjugate update (exact posterior, base R) ──────────────────\nalpha_0 <- 2;  beta_0 <- 8    # prior Beta(2,8): mean = 2/(2+8) = 0.2; ESS = 10\nk <- 6;  n <- 20              # observe 6 events in 20 trials\n\nalpha_post <- alpha_0 + k              # 2 + 6 = 8\nbeta_post  <- beta_0 + (n - k)        # 8 + 14 = 22\n\nprior_mean <- alpha_0 / (alpha_0 + beta_0)            # 2/10 = 0.200\ndata_mle   <- k / n                                    # 6/20 = 0.300\npost_mean  <- alpha_post / (alpha_post + beta_post)   # 8/30 ≈ 0.267\n\ncat(sprintf(\"Prior     Beta(%d,%d): mean = %.4f\\n\", alpha_0, beta_0, prior_mean))\ncat(sprintf(\"Data MLE: %d/%d = %.4f\\n\", k, n, data_mle))\ncat(sprintf(\"Posterior Beta(%d,%d): mean = %.4f\\n\", alpha_post, beta_post, post_mean))\n\n# 95% credible interval — P(theta in CrI | data) = 0.95 exactly\ncri <- qbeta(c(0.025, 0.975), alpha_post, beta_post)\ncat(sprintf(\"95%% credible interval: [%.3f, %.3f]\\n\", cri[1], cri[2]))\n\n# Direct posterior probability: P(p > 0.25 | data)\ncat(sprintf(\"P(event rate > 0.25 | data): %.3f\\n\",\n            1 - pbeta(0.25, alpha_post, beta_post)))\n\n# Prior sensitivity: Jeffreys Beta(0.5, 0.5) — non-informative\na_jeff <- 0.5 + k;  b_jeff <- 0.5 + (n - k)\ncat(sprintf(\"\\nSensitivity check — Jeffreys prior Beta(0.5, 0.5):\\n\"))\ncat(sprintf(\"  Posterior Beta(%.1f, %.1f): mean = %.4f\\n\",\n            a_jeff, b_jeff, a_jeff / (a_jeff + b_jeff)))\ncat(sprintf(\"  95%% CrI: [%.3f, %.3f]\\n\",\n            qbeta(0.025, a_jeff, b_jeff), qbeta(0.975, a_jeff, b_jeff)))\n\n# ── brms sketch: Bayesian logistic regression via Stan ────────────────────────\n# library(brms)\n#\n# fit <- brm(\n#   outcome ~ treatment + age + cci,\n#   data   = rwe_df,\n#   family = bernoulli(link = \"logit\"),\n#   prior  = c(\n#     prior(normal(0, 2), class = Intercept),\n#     prior(normal(0, 1), class = b, coef = treatment),  # skeptical prior\n#     prior(normal(0, 1), class = b)                      # weakly informative\n#   ),\n#   chains = 4, iter = 2000, warmup = 1000, seed = 1,\n#   backend = \"cmdstanr\"   # faster; requires cmdstanr + CmdStan\n# )\n#\n# # MANDATORY convergence check before interpreting results\n# stopifnot(max(rhat(fit)) < 1.01)\n# cat(\"All R-hat < 1.01: chains have converged.\\n\")\n# summary(fit)  # posterior means, 95% CrI, Bulk ESS, Tail ESS for each parameter\n#\n# # Direct posterior probability: P(treatment coefficient > 0 | data)\n# h <- hypothesis(fit, \"treatment > 0\")\n# print(h)   # P+ = posterior probability treatment increases log-odds",
        "description": "Exact beta-binomial conjugate update in base R, followed by a prior sensitivity\ncomparison (informative Beta(2,8) vs Jeffreys Beta(0.5,0.5)), and a brms sketch\nfor Bayesian logistic regression via Stan. All conjugate arithmetic reproduces the\nworked example. The brms block shows prior specification, the mandatory R-hat\nconvergence check, and how to extract direct posterior probability statements.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Beta-binomial conjugate update (exact, DATA step) ─────────────────────── */\ndata work.bayes_update;\n  /* Prior Beta(2,8): prior mean = 2/(2+8) = 0.200, ESS = 10 pseudo-observations */\n  alpha_0   = 2;   beta_0    = 8;\n  k         = 6;   n         = 20;\n\n  /* Conjugate posterior                                                          */\n  alpha_post = alpha_0 + k;             /* 2 + 6 = 8                             */\n  beta_post  = beta_0  + (n - k);      /* 8 + 14 = 22                            */\n  post_total = alpha_post + beta_post;  /* 8 + 22 = 30                           */\n\n  prior_mean = alpha_0 / (alpha_0 + beta_0);        /* 2/10 = 0.200              */\n  data_mle   = k / n;                                /* 6/20 = 0.300              */\n  post_mean  = alpha_post / post_total;              /* 8/30 approx 0.267         */\n\n  /* 95% credible interval from Beta(8,22)                                        */\n  cri_lo = quantile('BETA', 0.025, alpha_post, beta_post);\n  cri_hi = quantile('BETA', 0.975, alpha_post, beta_post);\n\n  /* P(event rate > 0.25 | prior + data)                                          */\n  prob_gt_025 = 1 - cdf('BETA', 0.25, alpha_post, beta_post);\n\n  /* Sensitivity: Jeffreys Beta(0.5, 0.5) non-informative prior                   */\n  a_jeff = 0.5 + k;   b_jeff = 0.5 + (n - k);\n  post_mean_jeff = a_jeff / (a_jeff + b_jeff);\nrun;\n\nproc print data=work.bayes_update label noobs;\n  var prior_mean data_mle post_mean cri_lo cri_hi prob_gt_025 post_mean_jeff;\n  format prior_mean data_mle post_mean cri_lo cri_hi prob_gt_025 post_mean_jeff 8.4;\n  label prior_mean     = \"Prior mean (Beta 2,8)\"\n        data_mle       = \"Data MLE\"\n        post_mean      = \"Posterior mean\"\n        cri_lo         = \"95% CrI lower\"\n        cri_hi         = \"95% CrI upper\"\n        prob_gt_025    = \"P(rate>0.25|data)\"\n        post_mean_jeff = \"Posterior mean (Jeffreys prior)\";\n  title \"Beta-binomial conjugate update: prior Beta(2,8) + 6 events in 20\";\nrun;\n\n/* ── PROC MCMC: Bayesian logistic regression for binary RWE outcome ─────────── */\n/* Requires work.rwe_data with variables: outcome (0/1), treat, age, cci.        */\n/*\nproc mcmc data=work.rwe_data nbi=2000 nmc=10000 seed=1 thin=2\n          monitor=(b_int b_treat b_age b_cci)\n          diagnostics=all;   * prints R-hat (PSRF) and ESS — check PSRF < 1.01;\n  parms b_int 0  b_treat 0  b_age 0  b_cci 0;\n  * Weakly informative priors (normal on log-odds scale);\n  prior b_int   ~ normal(0, var=4);   * Normal(0, 2^2);\n  prior b_treat ~ normal(0, var=1);   * Normal(0, 1^2) skeptical;\n  prior b_age   ~ normal(0, var=1);\n  prior b_cci   ~ normal(0, var=1);\n  p_event = logistic(b_int + b_treat*treat + b_age*age + b_cci*cci);\n  model outcome ~ binomial(n=1, p=p_event);\n  * After run: verify PSRF < 1.01 and ESS > 400 for all parameters;\nrun;\n*/\n\n/* ── PROC GENMOD BAYES: faster alternative for standard GLMs ────────────────── */\n/* The BAYES statement adds MCMC sampling to a standard GENMOD call.             */\n/*\nproc genmod data=work.rwe_data;\n  class treat;\n  model outcome = treat age cci / dist=binomial link=logit;\n  bayes seed=1 nmc=10000\n        coeffprior=normal(mean=0 var=1)\n        diagnostics=all   * check convergence before trusting posterior summaries;\n        plots=none;\nrun;\n*/",
        "description": "Exact beta-binomial conjugate update in a DATA step, producing posterior mean,\n95% credible interval, and a direct posterior probability using SAS built-in\ndistribution functions. Also includes a PROC MCMC block for Bayesian logistic\nregression and a PROC GENMOD BAYES block as a faster GLM alternative. The PROC MCMC\noutput includes R-hat (labeled PSRF) and ESS diagnostics; check these before\ninterpreting any posterior summary.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Prior[Prior p(theta)\\nencodes existing knowledge\\ne.g. Beta_2_8 mean=0.20] --> Bayes[Bayes' theorem\\nposterior prop prior x likelihood]\n  Data[Observed data\\ne.g. 6 events in 20 trials\\ndata MLE = 0.30] --> Bayes\n  Bayes --> Post[Posterior p(theta|data)\\nBeta_8_22 mean approx 0.267]\n  Post --> CrI[95% credible interval\\ndirect probability statement\\nP(theta in CrI) = 0.95]\n  Post --> Prob[Direct probability queries\\nP(rate > 0.25 | data) = ?]\n  Post --> Sens[Prior sensitivity analysis\\nrepeat under flat and skeptical priors\\nmandatory reporting]\n  Sens --> Report[Report: posterior is robust or\\nprior is load-bearing - say which]",
        "caption": "Bayesian inference pipeline: prior and data combine via Bayes' theorem to yield a posterior; outputs are credible intervals, direct probability queries, and a mandatory prior sensitivity grid.",
        "alt_text": "Flowchart from prior and observed data into Bayes' theorem and then into a posterior distribution, which supports credible intervals, direct probability statements, and prior sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "bland-altman-1998"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "beta-distribution",
        "notes": "The beta distribution is the conjugate prior for a binomial proportion; the beta-binomial update in this entry (prior Beta(2,8) + data -> posterior Beta(8,22)) is the canonical worked example of conjugacy, and the beta-distribution entry explains how to fit Beta(alpha,beta) from a published mean and SE for use as an informative prior."
      },
      {
        "relation_type": "see_also",
        "target_slug": "borrowing-historical-controls-bayesian-rwe",
        "notes": "The flagship application of Bayesian inference in RWE: the beta-binomial conjugate update scales to MAP priors and robust MAP priors that formally borrow strength from historical or external control data in rare-disease trials, with quantifiable prior effective sample size."
      },
      {
        "relation_type": "see_also",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "PSA in cost-effectiveness modeling is the health-economic instantiation of Bayesian uncertainty propagation: parameter distributions (beta for probabilities, gamma for costs) serve as priors or quasi-posteriors and are propagated through the decision model in the same spirit as the posterior draws described in this entry."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Bayesian inference presupposes fluency with probability distributions, likelihood functions, and the contrast between point estimates and uncertainty intervals; the inferential statistics foundations entry covers these prerequisites before Bayesian posterior mechanics are introduced."
      },
      {
        "relation_type": "see_also",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "Probabilistic bias analysis places prior distributions over unmeasured bias parameters (sensitivity, specificity, prevalence of the unmeasured confounder), propagating those priors through a bias model to produce a posterior distribution over the bias-adjusted effect estimate — a direct application of the prior-to-posterior logic in this entry."
      }
    ],
    "aliases": [
      "Bayes theorem",
      "posterior distribution",
      "beta-binomial update",
      "credible interval",
      "Bayesian credible interval",
      "prior-likelihood-posterior"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "benefit-carve-outs-medical-pharmacy-rwe",
    "name": "Benefit Carve-Outs in Claims Data",
    "short_definition": "A plan-design arrangement where services such as pharmacy, behavioral health, specialty drugs, fertility, dental, or vision are administered outside the main medical benefit, creating missingness or split observability in claims-based RWE.",
    "long_description": "A benefit carve-out occurs when a plan separates part of coverage from the main medical administrator and routes it to a different vendor or benefit manager. Pharmacy benefits are the classic example, but behavioral health, specialty pharmacy, fertility, dental, vision, transplant networks, and disease-management services can also be carved out.\n\nIn RWE, carve-outs are a data-completeness problem. A medical-claims file can look complete while missing all retail pharmacy fills. A pharmacy file can miss infused buy-and-bill drugs under the medical benefit. A behavioral-health carve-out can erase diagnoses, visits, and utilization needed to measure psychiatric comorbidity or outcomes. Missing carved-out services can be differential by employer, plan, state, calendar year, or patient subgroup.\n\nCarve-outs should be handled at the data-source-fitness stage. The analyst should identify benefit channels needed for the research question, confirm they are present, and censor, exclude, stratify, or sensitivity-test periods where the needed channel is absent. Treating carve-out missingness as \"no utilization\" is usually wrong.\n\n**Pros, cons, and trade-offs.** Carve-out flags can prevent the most damaging claims-data error: confusing missing benefit data with no service use. They make adherence, persistence, cost, behavioral-health, fertility, and specialty-drug analyses more honest. The trade-off is that channel completeness is often observed at plan, employer, or vendor level rather than patient level, so analysts may need conservative exclusions or person-month eligibility rules that reduce sample size.\n\n**When to use.** Use carve-out assessment whenever the exposure, outcome, utilization measure, or cost endpoint depends on a benefit channel that might be administered separately. Pharmacy, behavioral health, specialty pharmacy, dental/vision, fertility, transplant, and disease-management benefits should be checked explicitly when they matter to the estimand.\n\n**When NOT to use - and when it is actively misleading.** Do not infer nonadherence, no treatment, no behavioral-health care, or zero cost from an absent feed. Do not pool carved-out and integrated benefit periods without a data-fitness rule. It is actively misleading to report pharmacy PDC, exposure initiation, or total cost when the required benefit channel is absent for a subset of person-time.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "carve-out",
      "carve-in",
      "pharmacy-benefit",
      "behavioral-health",
      "claims-completeness",
      "pbm",
      "benefit-design"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "adherence",
      "cost_analysis",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.18553/jmcp.2020.26.10.1317",
        "url": "https://doi.org/10.18553/jmcp.2020.26.10.1317",
        "citation_text": "Parekh N, Vakharia N, Villani J, Bouchard J, Gagnon-Sanschagrin P, Mody R, Good CB. Effect of Carving in Pharmacy Benefits on Utilization and Costs. Journal of Managed Care & Specialty Pharmacy. 2020;26(10):1317-1324.",
        "year": 2020,
        "authors_short": "Parekh et al.",
        "notes": "Empirical study comparing carved-in and carved-out pharmacy benefit arrangements."
      },
      {
        "role": "explain",
        "doi": "10.37765/ajmc.2021.88708",
        "url": "https://doi.org/10.37765/ajmc.2021.88708",
        "citation_text": "Lucas E, Liu M, Ouyang J, et al. Reduced Medical Spending Associated With Integrated Pharmacy Benefits. American Journal of Managed Care. 2021;27(7):e242-e247.",
        "year": 2021,
        "authors_short": "Lucas et al.",
        "notes": "Evidence that integrated pharmacy benefits can alter downstream medical spending, underscoring carve-in/carve-out observability and policy relevance."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://bgca.milliman.com/en/insight/Pharmacy-benefits-Carvein-or-carveout",
        "citation_text": "Milliman. Pharmacy benefits: Carve-in or carve-out.",
        "year": 2026,
        "authors_short": "Milliman",
        "notes": "Actuarial overview of pharmacy benefit carve-in/carve-out tradeoffs."
      }
    ],
    "plain_language_summary": "A carve-out means one benefit is handled by a different vendor. If your database only has the main medical claims, a carved-out pharmacy or behavioral-health benefit may simply be missing.",
    "key_terms": [
      {
        "term": "Carve-out",
        "definition": "A benefit administered separately from the main health-plan administrator."
      },
      {
        "term": "Carve-in",
        "definition": "A benefit kept within the main medical plan or integrated administrator."
      },
      {
        "term": "Observability",
        "definition": "Whether the data source actually captures the services needed for the study question."
      }
    ],
    "worked_example": {
      "scenario": "A claims adherence study compares two oral oncology drugs. The medical file is complete, but one employer group carved out pharmacy to a PBM not included in the extract. Patients in that employer group appear to have no fills after initiation.",
      "dataset": {
        "caption": "Apparent adherence by benefit-channel capture.",
        "columns": [
          "employer_group",
          "medical_feed",
          "pharmacy_feed",
          "observed_fills",
          "correct_interpretation"
        ],
        "rows": [
          [
            "integrated plan",
            "present",
            "present",
            "6 fills",
            "pharmacy exposure observable"
          ],
          [
            "pharmacy carve-out",
            "present",
            "absent",
            "0 fills",
            "missing feed, not nonadherence"
          ]
        ]
      },
      "steps": [
        "Identify pharmacy-feed completeness before calculating PDC or persistence.",
        "Exclude or separately flag periods without pharmacy capture.",
        "Confirm buy-and-bill products in medical claims separately from retail/specialty pharmacy fills.",
        "Report benefit-channel requirements in the data-fitness section."
      ],
      "result": "The carved-out employer group is not counted as nonadherent; it is ineligible for the pharmacy-adherence endpoint until PBM data are linked."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pharmacy carve-out",
        "description": "Retail or specialty pharmacy claims are administered outside the medical carrier.",
        "edge_cases": [
          "Oral drug exposure appears absent.",
          "Specialty fills may be split from retail fills."
        ],
        "data_source_notes": "Fatal for adherence/exposure studies unless the PBM feed is present."
      },
      {
        "name": "Behavioral-health carve-out",
        "description": "Mental-health and substance-use services are administered separately.",
        "edge_cases": [
          "Psychiatric comorbidity and utilization are under-captured.",
          "Outcomes can appear artificially low."
        ],
        "data_source_notes": "Important for CNS, pain, addiction, maternal, and pediatric studies."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Integrated benefit data",
        "use_carveout_flags_when": "Benefit capture varies by employer, plan, state, or year.",
        "use_integrated_data_when": "Medical, pharmacy, behavioral, and specialty channels are demonstrably complete.",
        "notes": "A carve-out is a missing-data mechanism, not a patient behavior."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Profile medical, pharmacy, behavioral-health, and specialty channel presence at person-month and plan-month levels."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Carve-outs are a central data-source-fitness issue."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "Claims analysis should verify complete medical and pharmacy channel capture."
      },
      {
        "relation_type": "see_also",
        "target_slug": "erisa-self-insured-health-plans-rwe",
        "notes": "Self-insured employers may use separate vendors for carved-out benefits."
      }
    ],
    "aliases": [
      "carve out",
      "carve-out",
      "benefit carveout",
      "pharmacy carve-out",
      "behavioral health carve-out",
      "specialty pharmacy carve-out",
      "carved-in benefit",
      "carved-out benefit"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "beta-distribution",
    "name": "Beta Distribution for Proportions and Utilities",
    "short_definition": "A continuous probability distribution defined on the open interval (0, 1) that is the standard tool in health-economics modelling for representing uncertainty about probabilities, event rates, adherence proportions, and QALY utility weights; its two shape parameters alpha and beta are fitted from a reported mean and standard error via the method of moments, and its conjugate relationship with the binomial likelihood makes it the natural Bayesian prior for proportion and rare-event probability estimation in PSA-driven cost-effectiveness models.",
    "long_description": "**What is the beta distribution?**\n\nThe beta distribution, written Beta(α, β) with shape parameters α > 0 and β > 0, is a\ncontinuous probability distribution whose support is strictly the open interval (0, 1). Its\nprobability density is proportional to x^(α−1) × (1−x)^(β−1), and the two parameters\ncontrol shape completely: when α = β = 1 the distribution is flat (uniform); when both\nexceed 1 it is unimodal and bell-shaped; when both are below 1 it is U-shaped. For\nhealth-economic modelling, the bell-shaped regime (α > 1, β > 1) is the usual case,\nrepresenting a parameter that is most plausibly near an interior value rather than at the\nextremes. The mean is α/(α+β) and the variance is αβ/((α+β)² × (α+β+1)), which implies\nthat the standard deviation of a beta-distributed quantity is fully determined by the mean\nand the total count α+β — a useful property when fitting from summary statistics.\n\n**Why [0,1] support matters in RWE**\n\nMany parameters that enter health-economic models are bounded between 0 and 1: event\nprobabilities (response rate, transition probability, mortality risk), adherence proportions\nsuch as the Proportion of Days Covered (PDC), QALY utility weights, and quality-of-life\nindex scores from instruments such as the EQ-5D-3L. The central reason to use a beta\ndistribution for these quantities — rather than a normal distribution — is that the normal\ndistribution has infinite support on both sides. In probabilistic sensitivity analysis\n(PSA), where thousands of random draws are taken from parameter distributions, a normal\ndistribution centred near 0 or 1 will routinely produce draws below 0 or above 1 for\nparameters near the boundaries. A utility of −0.12 or a transition probability of 1.08\nis not a legitimate model input and will either crash the model, trigger silent errors\nin transition matrices, or produce inadmissible cost-effectiveness results. The beta\ndistribution eliminates this risk by construction: every draw lies in (0, 1). Near the\ncentre of the parameter range, normal approximations often produce similar results; the\ndifference is most consequential for probabilities or utilities in the range [0.05, 0.15]\nor [0.85, 0.95], where the normal tail already overlaps the inadmissible region at\nrealistic standard errors.\n\n**Method of moments: fitting Beta(α, β) from a mean and standard error**\n\nIn practice, an analyst seldom has access to patient-level data to fit a beta distribution\nby maximum likelihood. The typical input is a published point estimate and its standard\nerror, extracted from a clinical trial, registry analysis, or systematic review. The method\nof moments provides a closed-form solution that requires no optimisation:\n\nStep 1 — convert SE to variance: var = SE².\nStep 2 — compute total shape: α + β = mean × (1 − mean) / var − 1.\nStep 3 — compute α: α = mean × (α + β).\nStep 4 — compute β: β = (α + β) − α = (1 − mean) × (α + β).\nStep 5 — verify: mean = α/(α + β) and var = αβ/((α+β)² × (α+β+1)).\n\nThis procedure is unambiguous whenever var < mean × (1 − mean), which must hold for any\nproper beta distribution. If the reported variance exceeds this bound — for example, because\na small study reported a wide confidence interval — the method of moments yields a\nnon-positive shape parameter, signalling that the beta family cannot represent these\nstatistics and that the analyst should either investigate the source of the variance estimate\nor consider a different distributional form.\n\n**The PSA workhorse: beta for probabilities and utilities**\n\nThe ISPOR-SMDM Modelling Good Research Practices Task Force (Briggs et al., 2012)\ncodified the distributional assignment rule now standard in submissions to NICE, CADTH,\nICER, and other HTA bodies: beta distributions for parameters bounded in [0, 1]\n(probabilities, utilities, adherence rates); gamma or log-normal for strictly positive\nparameters (costs, length of stay); log-normal for relative risks and hazard ratios;\nnormal only for parameters with genuine support across the real line (regression\ncoefficients on an unconstrained scale). Violating this mapping — for example, assigning\na normal distribution to a utility near 0.90 — produces PSA results where a nontrivial\nfraction of draws exceed 1 or fall below 0, and the expected net benefit estimator\nwill be biased. Every PSA parameter table in an HTA submission should declare the\ndistributional family for each parameter and justify it against this schema. In practice,\nthe two-parameter set that covers the vast majority of HTA parameter tables is beta\n(probabilities, utilities) and gamma (costs, resource use) — the two distributions appear\ntogether across the published ISPOR good-modelling-practice literature.\n\n**Interpreting the output**\n\nConsider a fitted Beta(12, 3) for the utility parameter of a treatment responder, derived\nfrom an EQ-5D-3L registry analysis reporting mean utility 0.80 with SE 0.10.\n\n*Formal interpretation.* The mean of Beta(12, 3) is 12/(12+3) = 12/15 = 0.80,\nreproducing the observed mean exactly by construction of the method-of-moments fit. The\ndistribution encodes *parameter uncertainty* — specifically, the second-order uncertainty\nabout the true population mean utility — not patient-to-patient heterogeneity in utility\nscores. In a PSA second-order Monte Carlo loop, each of the 5,000 iterations draws a\nsingle candidate value u_i from Beta(12, 3); this u_i is treated as the true utility for\nthat iteration and all patients in that model run are assigned that utility. The 95%\ncredible interval is read from the distribution's quantiles: qbeta(0.025, 12, 3) ≈ 0.55\nand qbeta(0.975, 12, 3) ≈ 0.96. Interpreted through a conjugate Bayesian lens, Beta(12, 3)\ncorresponds to having observed 12 pseudo-successes and 3 pseudo-failures in a prior dataset\nof 15 observations — the shape parameters play the role of posterior sufficient statistics\nwhen the beta is used as a prior for a binomial proportion.\n\n*Practical interpretation.* In plain language: we believe the average utility of a\ntreatment responder is around 0.80, but given our data we cannot rule out values as low as\nroughly 0.55 or as high as roughly 0.96. Each PSA iteration draws one candidate truth from\nthis range. The width of the draw distribution directly contributes to the width of the\ncost-effectiveness acceptability curve and to the expected value of perfect information.\n\n*The critical confusion: parameter uncertainty vs patient heterogeneity.* This is the most\ncommon PSA error in submitted HTA models, and it is worth stating explicitly. The beta\ndistribution in PSA encodes uncertainty about the mean parameter value for the population,\nnot individual patient variation around that mean. In a cohort Markov model, the correct\nPSA procedure is to draw u_i once from Beta(12, 3) per PSA iteration, then apply u_i\nuniformly to all patients in that model run. Drawing a different beta-distributed value\nfor each simulated patient in a cohort model is a conceptual error that inflates PSA\nvariance and will draw criticism from NICE Technical Support Documents and CADTH reviewers.\nIn a microsimulation or discrete-event simulation that explicitly models individual\npatients, patient-level heterogeneity is captured by a separate distributional component\nover patients; the PSA outer loop remains a loop over uncertainty about the mean parameter.\nThese two sources of variation — parameter uncertainty and patient heterogeneity — are\ndistinct and must not be conflated.\n\n**Bayesian conjugacy: beta prior + binomial likelihood → beta posterior**\n\nThe beta distribution is the conjugate prior for the binomial likelihood. If the prior on\na probability p is Beta(α₀, β₀) and n independent Bernoulli trials produce k successes,\nthe posterior is exactly Beta(α₀ + k, β₀ + n − k) — no numerical integration required.\nThis conjugacy is valuable in rare-event settings. Suppose a disease has an annual event\nprobability of approximately 0.03 from a published meta-analysis, and a new registry study\nobserves 2 events in 50 person-years. A naive maximum-likelihood estimate of 2/50 = 0.04\nis unstable at this sample size. Using the published estimate to set a Beta(3, 97) prior\n(encoding a prior mean of 3/100 = 0.03 with 100 pseudo-observations), the posterior after\nobserving 2 events in 50 trials is Beta(5, 145), with posterior mean 5/150 ≈ 0.033 — a\nreasonable shrinkage toward the prior that respects existing evidence. This framework also\nallows the analyst to explicitly represent the strength of prior belief and to show HTA\nreviewers how sensitive results are to the prior specification.\n\n**Beta regression for bounded continuous outcomes**\n\nWhen the outcome variable itself is bounded in (0, 1) — rather than being a model\nparameter — beta regression (Ferrari & Cribari-Neto, 2004) is the appropriate regression\nframework. Beta regression models E[Y | X] via a link function (typically logit) on a\nlinear predictor and models the conditional distribution of Y as beta, respecting the\n[0, 1] boundary without transformation. Applications in RWE and HEOR include: regressing\nPDC on patient characteristics, drug class, and plan design while preserving the adherence\nproportion scale; regressing EQ-5D index scores on disease severity and comorbidities\nwithout the boundary artifacts that arise from OLS near 0 and 1; and analysing health-plan\nquality metrics (HEDIS rates) as continuous proportions. Beta regression requires outcomes\nstrictly in (0, 1); for outcomes with a mass at 0 or 1 (for example, complete non-initiation\nor perfect adherence), a zero-one-inflated beta (ZOIB) model adds mixture components at the\nboundaries. An alternative that is often pragmatically adequate when the fraction at the\nboundary is small is to recode exact 0 values to 0.001 and exact 1 values to 0.999, with\nthis choice documented in the analysis plan. Beta regression is available in the betareg\npackage in R and in the Smithson-Verkuilen (2006) formulation implemented in Stata and R;\na statsmodels BetaModel is available in Python.\n\n**Negative utilities: states worse than death**\n\nStandard utility weights from the EQ-5D can take negative values for health states judged\nworse than death; the UK EQ-5D-3L value set includes utilities as low as approximately\n−0.594. A raw negative utility breaks the [0, 1] support of the beta distribution. The\nstandard rescaling approach is to transform: u_rescaled = (u − u_min) / (u_max − u_min),\nwhere u_min is the minimum plausible utility (e.g., −0.594 for the UK EQ-5D-3L value set)\nand u_max = 1. The beta distribution is then fitted to u_rescaled, PSA draws are taken on\nthe rescaled scale, and each draw is back-transformed before entering the model. Analysts\nshould document the rescaling convention explicitly in the model technical report.\n\n**Pros, cons, and trade-offs**\n\n*Pros of the beta distribution:*\n- Strictly bounded to (0, 1), eliminating inadmissible PSA draws by construction.\n- Two-parameter family covering a wide range of shapes (symmetric, left-skewed, right-skewed,\n  near-uniform) depending on α and β, fitting any mean and variance satisfying the\n  feasibility constraint.\n- Conjugate with the binomial likelihood, enabling Bayesian updating in closed form with\n  no numerical integration.\n- Closed-form moments and quantile functions available in all statistical packages; PSA\n  draws require a single call (rbeta in R, scipy.stats.beta.rvs in Python, RAND('BETA')\n  in SAS).\n- ISPOR-SMDM endorsed distributional assignment for probabilities and utilities in HTA\n  submissions to NICE, CADTH, and other bodies.\n- Beta regression (Ferrari & Cribari-Neto 2004) extends the family to regression of\n  bounded continuous outcomes, with full R, Stata, and Python support.\n\n*Cons and limitations:*\n- Cannot accommodate exact 0 or 1 values without inflation extensions; boundary observations\n  require a ZOIB model or recoding, each of which adds a modelling decision to document.\n- Cannot directly represent negative utilities; a rescaling step is required before fitting\n  and after drawing, adding complexity and a parameter choice (u_min) to justify.\n- Method-of-moments fit requires var < mean × (1 − mean); published SEs that are\n  implausibly large relative to the mean will yield non-positive shape parameters.\n- In large individual-level datasets, fitting by method of moments discards distributional\n  shape information beyond mean and variance; maximum-likelihood or Bayesian estimation\n  from patient-level data is preferred when those data are available.\n- Beta regression can exhibit convergence issues with boundary-heavy data and requires\n  diagnostic inspection (randomised quantile residuals, pseudo-R²).\n\n**When to use**\n\n- *PSA parameter distributions for probabilities, utilities, and proportions*: whenever a\n  parameter is bounded in (0, 1) and the analyst has a published mean and SE, the beta\n  distribution via method of moments is the default ISPOR-endorsed choice for PSA.\n- *Bayesian prior on a proportion or event probability*: particularly valuable in rare-\n  event or small-sample settings where the maximum-likelihood estimate is unstable and\n  prior information from a meta-analysis or published registry can be encoded as Beta(α₀, β₀).\n- *Beta regression for bounded continuous outcomes* such as PDC, EQ-5D index scores, and\n  HEDIS rates, when the analyst wants to regress a [0, 1] outcome on covariates while\n  respecting the bounded nature of the outcome throughout modelling.\n- *Overdispersed count-denominator proportions*: the beta-binomial distribution (a beta\n  prior on the success probability of a binomial) handles extra-binomial variation in\n  site-level event rates, plan-level quality metrics, and similar grouped data.\n\n**When NOT to use**\n\n- *Exact 0 or 1 values present without inflation handling*: standard beta regression will\n  fail or produce degenerate estimates if a nontrivial fraction of observations are exactly\n  0 or 1. Use a zero-one-inflated beta model or document a pragmatic recode to (ε, 1−ε).\n- *Utilities below 0 that have not been rescaled*: the beta distribution cannot represent\n  negative values; fitting without the rescaling transformation described above will produce\n  nonsensical parameter estimates and PSA draws that cannot be back-transformed validly.\n- *Modelling binary outcomes themselves*: if each observation is a 0/1 event indicator,\n  the correct distributional model is binomial (for the count) or Bernoulli (for the\n  individual event), not beta. The beta distribution is for parameters (population\n  probabilities, proportions), not for individual 0/1 patient-level outcomes.\n- *Patient-level heterogeneity in a cohort Markov model*: drawing a different beta-\n  distributed utility for each simulated patient in a cohort model conflates parameter\n  uncertainty with patient heterogeneity, inflates PSA variance, and is a recognised\n  modelling error. Heterogeneity belongs in the model structure (subgroups, microsimulation\n  individual-level variation), not in the PSA parameter distributions.\n- *Outcomes with genuinely unbounded or positive-only support*: log-normal or gamma\n  distributions are the appropriate choice for costs, length of stay, and other strictly\n  positive right-skewed quantities; the beta distribution does not apply outside (0, 1).",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "distributions",
      "utilities",
      "psa",
      "bayesian"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "ehr_study",
      "registry_study",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/0266476042000214501",
        "url": "https://doi.org/10.1080/0266476042000214501",
        "citation_text": "Ferrari SLP, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31(7):799-815.",
        "year": 2004,
        "authors_short": "Ferrari & Cribari-Neto",
        "notes": "Foundational paper establishing beta regression as the natural regression framework for bounded continuous outcomes in (0,1); proves that OLS on logit-transformed proportions is a misspecification and derives the beta regression log-likelihood, score equations, and link function options. Directly applicable to adherence proportions (PDC), EQ-5D index scores, and plan-level quality metrics in RWE."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2012.04.014",
        "url": "https://doi.org/10.1016/j.jval.2012.04.014",
        "citation_text": "Briggs AH, Weinstein MC, Fenwick EAL, Karnon J, Sculpher MJ, Paltiel AD. Model parameter estimation and uncertainty: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-6. Value in Health. 2012;15(6):835-842.",
        "year": 2012,
        "authors_short": "Briggs et al.",
        "notes": "ISPOR-SMDM task force consensus statement establishing the canonical distributional assignment rule for PSA: beta for probabilities and utilities, gamma/log-normal for costs, normal for unconstrained parameters. The definitive reference for justifying the beta distribution in NICE, CADTH, and HTA submissions."
      },
      {
        "role": "demonstrate",
        "doi": "10.1037/1082-989X.11.1.54",
        "url": "https://doi.org/10.1037/1082-989X.11.1.54",
        "citation_text": "Smithson M, Verkuilen J. A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods. 2006;11(1):54-71.",
        "year": 2006,
        "authors_short": "Smithson & Verkuilen",
        "notes": "Systematic treatment of maximum-likelihood beta regression as an alternative to OLS or logit-transforming a bounded outcome; introduces the precision parameterisation (phi = alpha+beta) used in betareg and Stata; addresses boundary values and goodness- of-fit diagnostics. Useful reference when fitting beta regression to PDC or EQ-5D outcomes with patient-level data."
      }
    ],
    "plain_language_summary": "The beta distribution is a bell-shaped probability curve that always stays between 0 and 1, making it the standard choice for representing uncertainty about probabilities, adherence rates, and quality-of-life scores in health-economics models. To set one up, you take a published mean and standard error for any rate or proportion, plug them into two simple formulas (the method of moments), and get two shape numbers that fully define the distribution. In probabilistic sensitivity analysis — where a model is run thousands of times with slightly different assumed values — drawing from a beta distribution guarantees the model never receives an impossible input like a probability of 1.08 or a quality-of-life weight of −0.30, errors that a normal distribution would routinely produce for parameters near the 0 or 1 boundaries.",
    "key_terms": [
      {
        "term": "support [0,1]",
        "definition": "The range of values a beta-distributed variable can take — any real number between 0 and 1, but never outside it, which makes it safe for probabilities, adherence rates, and quality-of-life utility weights."
      },
      {
        "term": "alpha and beta parameters",
        "definition": "The two positive shape numbers (written α and β) that control where a beta distribution is centred and how spread out it is; larger values give a narrower, more confident distribution."
      },
      {
        "term": "method of moments",
        "definition": "A way to fit a statistical distribution using only a published mean and standard error rather than patient-level data, by matching the distribution's theoretical mean and variance to the reported values."
      },
      {
        "term": "conjugate prior",
        "definition": "A starting probability distribution for a parameter that, after combining with observed data, yields a posterior distribution of the same family — for the beta distribution, observing binary outcomes (events/non-events) keeps the posterior beta-shaped."
      },
      {
        "term": "probabilistic sensitivity analysis",
        "definition": "A technique in health-economics modelling that randomly draws plausible values for every uncertain parameter (typically 5,000 to 10,000 times) to show how much the overall cost-effectiveness conclusion might change depending on what the true values turn out to be."
      }
    ],
    "worked_example": {
      "scenario": "A health economist is building a Markov cost-effectiveness model for a new treatment in relapsing multiple sclerosis. One key input is the utility weight for a patient who responds to treatment, taken from a published EQ-5D-3L registry analysis that reported a mean utility of 0.80 with a standard error of 0.10 (from 120 patients). The analyst must fit a beta distribution to this parameter for PSA so that all 5,000 PSA iterations draw a plausible value between 0 and 1. Only the published summary statistics are available — no patient-level data — so the method of moments is used.",
      "dataset": {
        "caption": "Summary statistics from the published registry analysis. Only the mean and SE are available; these two numbers are the complete input to the method-of-moments fitting procedure.",
        "columns": [
          "parameter",
          "estimate_mean",
          "estimate_se",
          "source_n"
        ],
        "rows": [
          [
            "QoL utility (responder)",
            0.8,
            0.1,
            120
          ]
        ]
      },
      "steps": [
        "Step 1 — Convert SE to variance: var = 0.10*0.10 = 0.01.",
        "Step 2 — Compute the total shape parameter alpha+beta. The method-of-moments formula is mean*(1-mean)/var - 1 = 0.8*0.2/0.01 - 1 = 0.16/0.01 - 1 = 16 - 1 = 15. So alpha+beta = 15.",
        "Step 3 — Compute alpha from the mean: alpha = mean*(alpha+beta) = 0.8*15 = 12.",
        "Step 4 — Compute beta as the remainder: beta = (alpha+beta) - alpha = 15 - 12 = 3.",
        "Step 5 — Verify the mean: alpha/(alpha+beta) = 12/(12+3) = 12/15 = 0.80. Correct — the fitted distribution reproduces the published mean exactly.",
        "Step 6 — Verify the variance: alpha*beta/((alpha+beta)*(alpha+beta)*(alpha+beta+1)) = 12*3/(15*15*16) = 36/3600 = 0.01. This matches the target var = 0.10*0.10 = 0.01 — the fit is exact by construction of the method of moments.",
        "Step 7 — In PSA, call rbeta(5000, 12, 3) in R or scipy.stats.beta(12, 3).rvs(5000) in Python to draw 5,000 plausible utility values. All draws fall in (0,1). The 2.5th and 97.5th percentiles of Beta(12,3) are approximately 0.55 and 0.96, spanning the plausible range for this utility parameter in the model."
      ],
      "result": "Fitted distribution: Beta(alpha=12, beta=3). Theoretical mean = 12/(12+3) = 12/15 = 0.80. Variance = 12*3/(15*15*16) = 36/3600 = 0.01, matching target SE squared = 0.10*0.10 = 0.01. Every PSA draw lies in (0,1), eliminating inadmissible model inputs. The 95% interval of roughly [0.55, 0.96] conveys to HTA reviewers how much uncertainty about this single utility parameter feeds into the overall cost-effectiveness uncertainty."
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\n# ── Method-of-moments fit: Beta(alpha, beta) from mean and SE ──────────────────\nmean_util = 0.80\nse_util   = 0.10\nvar_util  = se_util ** 2                        # 0.10 ** 2 = 0.01\n\n# Feasibility check: var must be < mean*(1-mean)\nmax_var = mean_util * (1 - mean_util)           # 0.80 * 0.20 = 0.16\nif var_util >= max_var:\n    raise ValueError(\n        f\"var={var_util} >= mean*(1-mean)={max_var}; \"\n        \"Beta distribution cannot represent these statistics.\"\n    )\n\nalpha_plus_beta = mean_util * (1 - mean_util) / var_util - 1  # 0.16/0.01 - 1 = 15\nalpha = mean_util * alpha_plus_beta                             # 0.8 * 15 = 12\nbeta  = (1 - mean_util) * alpha_plus_beta                      # 0.2 * 15 = 3\n\nprint(f\"Fitted Beta({alpha:.4g}, {beta:.4g})\")\nprint(f\"  Theoretical mean:     {alpha / (alpha + beta):.4f}  (target: {mean_util})\")\nprint(f\"  Theoretical variance: \"\n      f\"{alpha * beta / ((alpha + beta)**2 * (alpha + beta + 1)):.6f}\"\n      f\"  (target: {var_util})\")\n\n# ── Distributional summary ──────────────────────────────────────────────────────\ndist = stats.beta(alpha, beta)\nprint(f\"  Mode:  {(alpha - 1) / (alpha + beta - 2):.4f}\")\nprint(f\"  95% CI: [{dist.ppf(0.025):.4f}, {dist.ppf(0.975):.4f}]\")\n\n# ── PSA Monte Carlo: 5,000 draws, all guaranteed in (0, 1) ───────────────────\nnp.random.seed(42)\npsa_draws = dist.rvs(5_000)\n\nassert psa_draws.min() > 0 and psa_draws.max() < 1, \"All PSA draws must be in (0,1)\"\nprint(f\"\\nPSA draws (n=5,000): mean={psa_draws.mean():.4f}, \"\n      f\"SD={psa_draws.std():.4f}, \"\n      f\"min={psa_draws.min():.4f}, max={psa_draws.max():.4f}\")\n\n# ── Bayesian conjugate update: Beta prior + binomial data -> Beta posterior ────\n# Prior: Beta(3, 97) — prior mean = 3/100 = 0.03 (from published meta-analysis)\n# Observed: 2 events in 50 trials (new registry)\nprior_a, prior_b     = 3, 97\nobs_k, obs_n         = 2, 50\npost_a = prior_a + obs_k                         # 3 + 2 = 5\npost_b = prior_b + (obs_n - obs_k)               # 97 + 48 = 145\nposterior = stats.beta(post_a, post_b)\nprint(f\"\\nBayesian conjugate update:\")\nprint(f\"  Prior:     Beta({prior_a}, {prior_b}), mean = {prior_a/(prior_a+prior_b):.4f}\")\nprint(f\"  Observed:  {obs_k} events in {obs_n} trials (MLE = {obs_k/obs_n:.4f})\")\nprint(f\"  Posterior: Beta({post_a}, {post_b}), \"\n      f\"mean = {post_a/(post_a+post_b):.4f}\")\n\n# ── Beta regression for PDC or EQ-5D outcomes (statsmodels) ──────────────────\n# from statsmodels.othermod.betareg import BetaModel\n# mod = BetaModel.from_formula(\"pdc ~ age + cci\", data=df)\n# res = mod.fit()\n# Outcomes must be strictly in (0,1). Recode: pdc = pdc.clip(0.001, 0.999)\n# Coefficients are on the logit scale; back-transform with scipy.special.expit().",
        "description": "Method-of-moments fit of Beta(α, β) from a mean and SE using scipy.stats.beta. Demonstrates\nfitting Beta(12, 3) from utility mean=0.80 and SE=0.10, verifying moments, running a 5,000-\niteration PSA Monte Carlo, and computing the Bayesian conjugate update for a rare-event\nprobability. Also shows the statsmodels BetaModel sketch for beta regression on patient-level\nbounded outcomes. All draws verified to lie in (0,1).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Method-of-moments fit ─────────────────────────────────────────────────────\nmean_util <- 0.80\nse_util   <- 0.10\nvar_util  <- se_util^2              # 0.01\n\n# Feasibility check\nstopifnot(var_util < mean_util * (1 - mean_util))\n\nalpha_plus_beta <- mean_util * (1 - mean_util) / var_util - 1   # 15\nalpha <- mean_util * alpha_plus_beta                              # 12\nbeta  <- (1 - mean_util) * alpha_plus_beta                       # 3\n\ncat(sprintf(\"Fitted Beta(%.4g, %.4g)\\n\", alpha, beta))\ncat(sprintf(\"  Theoretical mean:     %.4f  (target %.2f)\\n\",\n            alpha / (alpha + beta), mean_util))\ncat(sprintf(\"  Theoretical variance: %.6f  (target %.4f)\\n\",\n            alpha * beta / ((alpha + beta)^2 * (alpha + beta + 1)),\n            var_util))\n\n# ── Quantiles and mode ────────────────────────────────────────────────────────\ncat(sprintf(\"  Mode:  %.4f\\n\", (alpha - 1) / (alpha + beta - 2)))\ncat(sprintf(\"  95%% CI: [%.4f, %.4f]\\n\",\n            qbeta(0.025, alpha, beta), qbeta(0.975, alpha, beta)))\n\n# ── PSA Monte Carlo ───────────────────────────────────────────────────────────\nset.seed(42)\npsa_draws <- rbeta(5000, alpha, beta)\n\nstopifnot(all(psa_draws > 0) && all(psa_draws < 1))\ncat(sprintf(\"\\nPSA draws (n=5000): mean=%.4f, SD=%.4f, min=%.4f, max=%.4f\\n\",\n            mean(psa_draws), sd(psa_draws), min(psa_draws), max(psa_draws)))\n\n# ── Bayesian conjugate update ─────────────────────────────────────────────────\nprior_a <- 3;  prior_b <- 97       # Beta(3,97): prior mean = 0.03\nobs_k   <- 2;  obs_n   <- 50       # 2 events in 50 trials\npost_a  <- prior_a + obs_k         # 5\npost_b  <- prior_b + obs_n - obs_k # 145\ncat(sprintf(\n  \"\\nBayesian update: Beta(%d,%d) + Bin(n=%d,k=%d) -> Beta(%d,%d), mean=%.4f\\n\",\n  prior_a, prior_b, obs_n, obs_k, post_a, post_b,\n  post_a / (post_a + post_b)\n))\n\n# ── Beta regression with betareg ──────────────────────────────────────────────\n# install.packages(\"betareg\")  # run once if not installed\n# library(betareg)\n#\n# Simulate a dataset: PDC as outcome, age and CCI as predictors\nset.seed(1)\nn   <- 200\ndf  <- data.frame(\n  pdc = rbeta(n, 8, 2),                 # mean PDC ≈ 0.80\n  age = rnorm(n, mean = 55, sd = 12),\n  cci = rpois(n, lambda = 2)\n)\n# Recode exact boundary values (betareg requires strictly open (0,1))\ndf$pdc <- pmin(pmax(df$pdc, 0.001), 0.999)\n#\n# Fit beta regression (logit link for mean, log link for precision phi):\n# mod <- betareg(pdc ~ age + cci, data = df)\n# summary(mod)\n# Marginal effect on PDC scale: plogis(coef(mod)[\"age\"])",
        "description": "Method-of-moments fit, PSA Monte Carlo via rbeta, beta regression using the betareg\npackage, and Bayesian conjugate update in base R. Follows the same Beta(12, 3)\nmotivating example as the Python implementation. The betareg package uses a logit link\nfor the mean and a log link for the precision parameter phi = alpha+beta; coefficients\nare on the logit scale and are back-transformed with plogis().",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Method-of-moments macro ────────────────────────────────────────────────── */\n%macro beta_mom(mean=, se=, alpha_out=alpha_mom, beta_out=beta_mom);\n  %global &alpha_out. &beta_out.;\n  %let var_  = %sysevalf(&se. * &se.);\n  %let apb_  = %sysevalf(&mean. * (1 - &mean.) / &var_. - 1);\n  %let &alpha_out. = %sysevalf(&mean. * &apb_.);\n  %let &beta_out.  = %sysevalf((1 - &mean.) * &apb_.);\n  %put NOTE: Fitted Beta(&&&alpha_out., &&&beta_out.) from mean=&mean., SE=&se.;\n%mend;\n\n/* Call: mean=0.80, SE=0.10 -> Beta(12, 3) */\n%beta_mom(mean=0.80, se=0.10)\n\n/* ── PSA Monte Carlo: 5,000 draws from Beta(12, 3) ─────────────────────────── */\ndata work.psa_draws;\n  call streaminit(42);             /* reproducible seed                          */\n  do iter = 1 to 5000;\n    utility_draw = rand('BETA', 12, 3);  /* all draws in (0,1) by construction   */\n    output;\n  end;\nrun;\n\nproc means data=work.psa_draws n mean std min max;\n  var utility_draw;\n  title \"PSA draws: Beta(12,3) utility parameter\";\nrun;\n\n/* ── Accumulate model output across PSA iterations ──────────────────────────── */\n/* In a full PSA, replace the stub below with the complete decision model.       */\n/* Draw ALL uncertain parameters per iteration, run the model, store delta-C and */\n/* delta-E, then summarise the joint distribution for the CE plane and CEAC.     */\ndata work.psa_results;\n  set work.psa_draws;\n  survival_years = 5;             /* deterministic for illustration               */\n  qaly           = utility_draw * survival_years;\n  cost           = 15000;\n  /* Real PSA: icer = (cost_new - cost_comp) / (qaly_new - qaly_comp) per iter */\nrun;\n\n/* ── PROC NLMIXED: ML beta regression for bounded continuous outcomes ──────── */\n/* Requires patient-level data with outcomes strictly in (0,1).                 */\n/* Set pdc = min(max(pdc, 0.001), 0.999) before calling.                       */\n/*\nproc nlmixed data=work.patient_data;\n  parms beta0=0 beta_age=0 beta_cci=0 log_phi=2;\n  eta   = beta0 + beta_age*age + beta_cci*cci;\n  mu    = 1 / (1 + exp(-eta));          * logit link for mean;\n  phi   = exp(log_phi);                  * log link for precision;\n  alpha_p = mu * phi;\n  beta_p  = (1 - mu) * phi;\n  ll      = lgamma(alpha_p + beta_p) - lgamma(alpha_p) - lgamma(beta_p)\n            + (alpha_p - 1)*log(pdc) + (beta_p - 1)*log(1 - pdc);\n  model pdc ~ general(ll);\nrun;\n*/",
        "description": "Method-of-moments macro, PSA Monte Carlo using RAND('BETA') in a DATA step, and a PROC\nNLMIXED sketch for maximum-likelihood beta regression. The DATA step PSA loop is the\npattern used in production SAS health-economic models for HTA submissions. RAND('BETA')\nguarantees draws in (0,1) by construction; no boundary checking is needed. The PROC\nNLMIXED block shows the beta log-likelihood with a logit link for mean and a log link for\nprecision; it is commented out because it requires patient-level data as input.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "The beta distribution is the canonical PSA distributional assignment for probabilities and utilities; every PSA loop that draws from a beta-distributed utility or transition probability uses the method-of-moments fitting procedure described in this entry."
      },
      {
        "relation_type": "used_with",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "Utility weights estimated from mapping algorithms (EQ-5D, SF-6D) are fitted with beta distributions for PSA; the mapping entry explains how utility estimates and standard errors are produced, and this entry explains how to convert those estimates into a distributional parameter for the PSA loop."
      },
      {
        "relation_type": "see_also",
        "target_slug": "binomial-distribution-logit-link",
        "notes": "The beta distribution is the conjugate prior for the binomial likelihood; the binomial entry covers the event-count distribution for which the beta is the natural Bayesian prior, enabling closed-form posterior updating when event data are observed."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gamma-distribution",
        "notes": "The gamma distribution is the standard PSA assignment for strictly positive parameters such as costs and resource use; together, beta (utilities, probabilities) and gamma (costs) cover the two most common distributional families in ISPOR-endorsed HTA parameter tables."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalized-linear-models",
        "notes": "Beta regression is a generalised linear model with a beta response distribution and logit link for the mean; the GLM entry covers the broader regression family of which beta regression is a member for bounded continuous outcomes in (0, 1)."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "The beta distribution requires foundational knowledge of continuous probability distributions, moments (mean, variance, standard deviation), and the role of distributional assumptions in statistical modelling and parameter estimation."
      }
    ],
    "aliases": [
      "beta regression",
      "beta-binomial",
      "Beta(α,β)"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "binomial-distribution-logit-link",
    "name": "Binomial Distribution and the Logit Link",
    "short_definition": "The Bernoulli/binomial distributional family for binary data, the logit transform that maps probabilities onto an unbounded linear scale, and the odds-ratio arithmetic that follows — the probabilistic primitives on which logistic regression, 2x2 table analysis, and case-control inference are all built.",
    "long_description": "**What binary outcomes are: from Bernoulli to binomial**\n\nEach patient in an RWE cohort contributes a binary endpoint — hospitalized (1) or not (0),\nresponded (1) or not (0), persistent on therapy (1) or not (0). The mathematical model for a\nsingle binary trial is the **Bernoulli distribution**: a random variable Y that equals 1 with\nprobability p and 0 with probability 1 − p. Its probability mass function is\nP(Y = y) = p^y × (1 − p)^(1 − y), which collapses to P(Y = 1) = p and P(Y = 0) = 1 − p.\nWhen n independent patients each share the same event probability p — the simplest case before\ncovariates enter — the total count of events follows a **binomial distribution**: Bin(n, p),\nwith mean np and variance np(1 − p). The binomial is discrete and right-skewed at small p; it\nis not a Gaussian bell curve. Logistic regression respects this by modelling each patient's\noutcome Y_i as Bernoulli(p_i) directly, rather than pretending the outcome is continuous.\n\n**Why model probabilities through a link function**\n\nThe clinical goal is to let each patient's event probability p_i depend on their\ncharacteristics X_i: p_i = f(beta_0 + beta'X_i). The simplest approach is the **linear\nprobability model (LPM)**: assume p_i = beta_0 + beta'X_i directly. This fails in a\nfundamental way — there is nothing to prevent predicted probabilities from exceeding 1 or\nfalling below 0. A regression line fitted to binary data will produce nonsensical predictions\noutside [0, 1] for covariate combinations near the tails, and generates heteroscedastic\nresiduals throughout the range. The fix is a **link function** that transforms the bounded\n[0, 1] probability onto the entire real line (−∞, +∞), so the linear predictor can roam\nfreely without ever violating the probability constraint. Three link functions dominate\npractice: (1) the logit (log-odds), (2) the probit (inverse-normal CDF), and (3) the log\nlink (used in log-binomial regression for direct risk ratios).\n\n**The logit function: log-odds and the logistic curve**\n\nThe logit transform maps any probability to the real line: logit(p) = log[p / (1 − p)].\nThe quantity p / (1 − p) is the **odds** — the ratio of the probability the event occurs to\nthe probability it does not. Odds of 0.25 mean the event is expected once for every four\nnon-events, corresponding to a probability of 0.2. Odds of 1.0 mean equal probability (p =\n0.5). Taking the logarithm maps odds from (0, ∞) to log-odds in (−∞, +∞): logit(0.1) ≈\n−2.20; logit(0.5) = 0.0; logit(0.9) ≈ 2.20. The inverse is the **logistic (expit) function**:\np = 1 / (1 + exp(−logit)), which maps any real number back to a valid probability and traces\nthe familiar S-shaped logistic curve. A logistic regression model fits\nlogit(p_i) = beta_0 + beta_1 X_{1i} + ... + beta_k X_{ki} by maximum likelihood; the linear\npredictor is unbounded, but the predicted probability p_i = expit(linear predictor) is always\nin (0, 1).\n\n**Logit vs probit vs log link: why logit dominates**\n\nThe **probit** link uses the inverse standard-normal CDF instead of the logit. The probit and\nlogit produce nearly identical fits in the range p ∈ [0.10, 0.90] — they differ mainly in\nthe extreme tails. The probit arises in latent-threshold models where the underlying liability\nis normally distributed; its coefficients carry no direct odds interpretation. The **log link**\n(log-binomial regression) maps the linear predictor to log(p), so exponentiated coefficients\nare **risk ratios (RRs)** rather than odds ratios — the more natural quantity for clinical\ncommunication when outcome prevalence is high. However, log-binomial models frequently fail to\nconverge because the linear predictor can push predicted probabilities above 1. The standard\nworkaround is **Poisson regression with a robust (sandwich) variance** — Zou's modified Poisson\n— which produces adjusted RRs without convergence failures. The logit link dominates for four\nreasons: (a) it guarantees probabilities in (0, 1) without convergence issues across all\noutcome prevalence levels; (b) it is the natural likelihood for case-control sampling; (c) it\nis numerically stable in high-dimensional covariate settings common in claims-based RWE; and\n(d) standardization via g-computation can convert any logistic model's conditional ORs into the\nmarginal RRs and risk differences (RDs) that decision-makers need.\n\n**The odds ratio and its misread as a risk ratio**\n\nExponentiating a logistic coefficient, exp(beta_j), gives the **conditional odds ratio (OR)**:\nthe multiplicative change in the odds of the event for a one-unit increase in X_j, holding all\nother covariates constant. The single most common interpretation error in published RWE is\nreading this OR as though it were a risk ratio. These two quantities are equal only when the\noutcome is rare in both comparison groups — the \"rare disease assumption\" holds approximately\nwhen event risk is below 5–10% in all arms. When the outcome is common — as it frequently is\nin HEOR for endpoints such as any hospitalization, medication adherence, or treatment switch,\nwhich routinely occur in 20–50% of patients — the OR is numerically further from 1.0 than the\nRR, and overstates the apparent magnitude of the effect in both directions (harm and benefit).\nIn the worked example below: the outcome is common (20% in Group A, 10% in Group B), yielding\nRR = 2.0 exactly and OR = 2.25. Reporting OR = 2.25 and stating \"the treatment was associated\nwith a 2.25-fold higher risk\" is incorrect and misleads clinicians and HTA reviewers about the\ntrue magnitude of the association.\n\n**Noncollapsibility: distinct from confounding**\n\nNoncollapsibility is a mathematical property of the odds ratio itself, not a form of bias or\nconfounding error. Even with perfectly balanced randomization and no confounding whatsoever,\na **conditional OR** (from a logistic model adjusting for covariates) will differ numerically\nfrom the **marginal OR** (from an unadjusted 2×2 table or from inverse-probability weighting).\nAdding a strong risk-factor covariate to a logistic model moves the conditional OR away from\nthe null even when that covariate is perfectly balanced between treatment arms and is not a\nconfounder. This occurs because the logistic link is nonlinear: averaging over a covariate\ndistribution does not recover the same OR as conditioning on it. By contrast, the risk\ndifference and risk ratio are collapsible: adding a balanced predictor does not systematically\nshift the marginal RD or RR. The practical implication is that observed differences in ORs\nacross studies with different covariate adjustment sets may reflect noncollapsibility rather\nthan true confounding; this distinction must be pre-specified in the estimand. Standardization\nto marginal effects (see marginal-effects-and-interpretation-of-inferential-statistics-rwe)\nconverts conditional logistic estimates into the collapsible marginal quantities that support\ncross-study comparisons and HTA decisions.\n\n**Case-control studies: why the OR is the only estimable quantity**\n\nIn a **case-control design**, investigators sample on the outcome — a predetermined number of\ncases and controls are enrolled, deliberately breaking the link between case frequency in the\nsample and prevalence in the source population. Because sampling fractions differ by outcome\nstatus, the absolute event rates in the enrolled sample do not reflect the true population\nincidence. Absolute risks (events / total enrolled) are therefore meaningless. What is\npreserved under outcome-stratified sampling is the cross-product ratio:\n(cases_exposed / cases_unexposed) / (controls_exposed / controls_unexposed) = OR. This\nalgebraic invariance is why the OR is the natural and only directly estimable effect measure\nin case-control studies, and why logistic regression is the canonical analysis for such data.\nConverting a case-control OR to an approximate RR requires external data on the absolute event\nrate in the source population from which cases and controls were drawn.\n\n**Interpreting the output**\n\nConsider a logistic model comparing a new treatment (Group A) versus standard care (Group B)\nfor a binary event, yielding an adjusted logistic coefficient of 0.811 with 95% CI 0.22 to\n1.40.\n\nFormal interpretation: exp(0.811) = 2.25 — patients in Group A had 2.25 times the adjusted\nodds of the event compared to Group B, holding covariates fixed. This is a conditional\n(covariate-specific, model-based) odds ratio. In the worked example (Group A: 20/100 = 0.20\nrisk; Group B: 10/100 = 0.10 risk), the unadjusted OR is exactly 2.25 while the RR is\nexactly 2.0. Because the outcome is common (20% in the treated arm), the OR overstates the\nrelative frequency; the conditional OR also shifts when strong but non-confounding predictors\nare added to the model — noncollapsibility, not confounding. A 95% CI for the log-OR of\n[0.22, 1.40] translates to a 95% CI for the OR of approximately [1.25, 4.07].\n\nPractical interpretation for clinicians and HTA bodies: for rare outcomes (risk below about\n10% in both arms), the OR and RR are sufficiently similar that it is defensible to state \"the\nadjusted OR was 2.25, approximately corresponding to a 2.25-fold higher risk.\" For common\noutcomes — as in this example — state instead: \"the adjusted OR was 2.25; because the event\nwas common (20% in the treated arm), the corresponding risk ratio is approximately 2.0 and\nthe absolute risk difference is approximately 10 percentage points. The absolute risk\ndifference is reported alongside the OR for clinical interpretation.\"\n\n**Pros, cons, and trade-offs**\n\nLogit link (logistic regression):\n- Pros: guarantees probabilities in (0, 1); numerically stable at all prevalence levels;\n  natural Bernoulli likelihood for binary data; the only directly interpretable effect measure\n  from case-control sampling; stable in high-dimensional covariate settings; easily\n  standardized to marginal RD and RR by g-computation; the standard outcome model inside\n  g-computation, AIPW, and TMLE for binary outcomes.\n- Cons: native measure is the OR, which is noncollapsible, further from 1.0 than the RR for\n  common outcomes, and routinely misread as a risk ratio; conditional ORs from logistic models\n  with different covariate sets cannot be directly compared.\n- When to prefer: case-control design; any binary endpoint when the analyst will standardize to\n  marginal effects; outcomes too common for log-binomial to converge; rare outcomes where\n  OR ≈ RR.\n\nLog link (log-binomial or Poisson-robust):\n- Pros: exponentiated coefficient is the RR directly, the quantity clinicians and HTA prefer.\n- Cons: log-binomial frequently fails to converge; Poisson-robust requires a sandwich variance\n  correction; inappropriate for case-control data where the OR is the estimable quantity.\n- When to prefer: common outcome, RR is the pre-specified estimand, complete follow-up in a\n  fixed-window cohort.\n\nLinear probability model (OLS on 0/1):\n- Pros: coefficients are risk differences; collapsible; familiar to non-statisticians.\n- Cons: predictions violate [0, 1] at boundary covariate values; heteroscedastic residuals;\n  naive standard errors require correction.\n- When to prefer: only as a quick sanity check for risk differences, not for primary inference.\n\n**When to use**\n\nUse the binomial/logit framework when: (1) the endpoint is binary (yes/no) over a fixed risk\nwindow with complete follow-up; (2) the study design is case-control, where the OR is the only\nestimable measure and logistic regression is the canonical analysis; (3) the analyst will\nstandardize from the fitted logistic model to marginal RDs and RRs — the most defensible\nworkflow for common outcomes; (4) the outcome is rare or separation concerns make the log link\ninfeasible; (5) the binary outcome is the outcome model inside g-computation, AIPW, or TMLE.\n\n**When NOT to use**\n\n- When RR or RD is the communication target and the outcome is common (risk at or above 10% in\n  any arm): do not report the OR as a risk ratio. Use Poisson-robust regression for an adjusted\n  RR directly, or standardize the logistic model via g-computation. The gap between OR and RR\n  for a common outcome is not a model failure — it is a choice of scale, but reporting OR as\n  RR is an interpretation error that overstates effect magnitude.\n- Correlated binary outcomes — multiple eligible episodes per patient, or clustering within\n  facilities or healthcare systems: naive logistic standard errors are anticonservative. Use GEE\n  (population-average model) or add cluster-robust (sandwich) standard errors.\n- Perfect or quasi-complete separation in small samples: maximum likelihood estimates diverge\n  (infinite coefficients, enormous CIs). Use Firth penalized likelihood for model-based\n  inference. For a simple 2×2 table with very small n, Fisher's exact test is the\n  non-regression analogue.\n- Variable follow-up or informative censoring: a fixed-window logistic model treats all\n  patients with no observed event as event-free for the full window, which is incorrect when\n  some patients are unobservable for part of the window (disenrollment, death). Route to Cox\n  proportional-hazards regression or pooled (discrete-time) logistic for time-to-event\n  questions. See logistic-regression-for-binary-outcomes.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "distributions",
      "binary-outcomes",
      "odds-ratio",
      "logit",
      "Bernoulli",
      "binomial",
      "noncollapsibility",
      "case-control"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "case_control",
      "cross_sectional",
      "active_comparator_new_user",
      "rct",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.320.7247.1468",
        "url": "https://doi.org/10.1136/bmj.320.7247.1468",
        "citation_text": "Bland JM, Altman DG. Statistics Notes: The odds ratio. BMJ. 2000;320(7247):1468.",
        "year": 2000,
        "authors_short": "Bland & Altman",
        "notes": "Classic BMJ Statistics Notes entry defining the odds ratio, its relationship to the risk ratio, and the conditions under which the OR approximates the RR (rare disease assumption). Written for clinical researchers and provides the intuitive grounding for the logit link's native effect measure."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2018.6971",
        "url": "https://doi.org/10.1001/jama.2018.6971",
        "citation_text": "Norton EC, Dowd BE, Maciejewski ML. Odds Ratios — Current Best Practice and Use. JAMA. 2018;319(19):1978-1979.",
        "year": 2018,
        "authors_short": "Norton et al.",
        "notes": "Authoritative JAMA guide on when ORs are appropriate, when they mislead, and when alternatives (RR, RD, average marginal effects) should replace them. Directly motivates the noncollapsibility and OR-as-RR-misread content in this entry and is required reading before interpreting logistic output in RWE or HEOR."
      }
    ],
    "plain_language_summary": "When a clinical outcome is a yes-or-no event — hospitalized or not, responded or not — the underlying math treats each patient's result as a coin flip with a particular probability of landing heads. The logit link is the mathematical bridge that converts those bounded probabilities (which must stay between 0 and 1) into an unbounded number a regression model can use; it does this by working in terms of odds instead of probabilities. The key number the model produces is an odds ratio: how many times more often the event occurred in one group than the other in terms of odds. One honest limitation is that when events are common — say more than 10% of patients — the odds ratio makes differences look bigger than they really are, so analysts should convert it to a risk ratio or risk difference before reporting to clinicians or payers.",
    "key_terms": [
      {
        "term": "odds",
        "definition": "The number of patients with the event divided by the number without it — for example, odds of 0.25 mean the event happens once for every four non-events, which corresponds to a 20% probability."
      },
      {
        "term": "log odds",
        "definition": "The natural logarithm of the odds; also called the logit. Taking the log converts odds (which are always positive) into a number that can be negative, zero, or positive, which a regression line can use without going out of bounds."
      },
      {
        "term": "logit",
        "definition": "The link function that maps a probability between 0 and 1 to the entire real line by computing the natural log of the odds; the inverse logit (expit or plogis) converts back to a probability."
      },
      {
        "term": "odds ratio",
        "definition": "The odds of the event in one group divided by the odds in another; an odds ratio of 2.25 means the first group has 2.25 times the odds of the event. For common outcomes the odds ratio is larger than the risk ratio describing the same data."
      },
      {
        "term": "risk ratio",
        "definition": "The probability (risk) of the event in one group divided by the probability in another; for common outcomes the risk ratio is closer to 1.0 than the odds ratio, so the two numbers can look quite different even when describing the same 2x2 table."
      },
      {
        "term": "noncollapsibility",
        "definition": "A property of odds ratios — not a bias — meaning that an adjusted (covariate-controlled) odds ratio and an unadjusted odds ratio differ even when the covariate is not a confounder; risk differences and risk ratios do not have this property."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes team is evaluating a new therapy (Group A) versus standard care (Group B) for a binary endpoint: hospitalized within 90 days (1 = yes, 0 = no). Group A has 100 patients, Group B has 100 patients. Twenty patients in Group A were hospitalized; ten patients in Group B were hospitalized. The team computes the risk ratio and the odds ratio from the 2x2 table and observes that the two numbers are not the same — even though they describe identical data — because the outcome is common in both arms.",
      "dataset": {
        "caption": "90-day hospitalization: counts per arm (n=100 per arm)",
        "columns": [
          "group",
          "hospitalized_yes",
          "hospitalized_no",
          "total"
        ],
        "rows": [
          [
            "A (new therapy)",
            20,
            80,
            100
          ],
          [
            "B (standard care)",
            10,
            90,
            100
          ]
        ]
      },
      "steps": [
        "Read Group A cells: events = 20, non-events = 80, total = 100. Risk A = 20/100 = 0.2. Odds A = 20/80 = 0.25. Interpretation: 1 event for every 4 non-events in Group A.",
        "Read Group B cells: events = 10, non-events = 90, total = 100. Risk B = 10/100 = 0.1. Odds B = 10/90 = 0.111. Interpretation: roughly 1 event for every 9 non-events in Group B.",
        "Compute the risk ratio: RR = 0.2 / 0.1 = 2.0. Group A patients have twice the risk of hospitalization compared to Group B. This is a ratio of probabilities.",
        "Compute the odds ratio using the cross-product formula: OR = (20 * 90) / (80 * 10) = 1800 / 800 = 2.25. The OR (2.25) is larger than the RR (2.0) because the outcome is common. The logistic regression coefficient for group membership equals log(OR) = log(2.25) approximately 0.811, and exp(0.811) approximately 2.25 recovers the OR.",
        "The divergence between OR and RR grows with outcome prevalence. Here, OR = 2.25 exceeds RR = 2.0 by about 12.5 percent relatively. For a rare outcome where risk in both arms is below 5%, OR and RR would be nearly indistinguishable. Reporting OR = 2.25 as though it were a risk ratio overstates the association for a common outcome like hospitalization."
      ],
      "result": "Group A: risk = 20/100 = 0.2, odds = 20/80 = 0.25. Group B: risk = 10/100 = 0.1, odds = 10/90 = 0.111. RR = 0.2 / 0.1 = 2.0. OR = (20 * 90) / (80 * 10) = 1800 / 800 = 2.25. The OR (2.25) exceeds the RR (2.0) because the outcome is common. The logistic coefficient is log(2.25) approximately 0.811. The log-Wald 95% CI for the OR uses SE = sqrt(1/20 + 1/80 + 1/10 + 1/90) approximately 0.417, giving the interval exp(0.811 ± 1.96 × 0.417) approximately [0.99, 5.09] for a sample of this size; this interval includes 1.0, reflecting the modest sample size. Analysts reporting this result to clinicians should state the absolute risk difference (20% minus 10% = 10 percentage points) alongside the OR."
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy.stats import binom\nfrom scipy.special import logit, expit\nimport statsmodels.formula.api as smf\nimport pandas as pd\n\n# ── 1. Bernoulli / Binomial distribution ────────────────────────────────────\n# n=100 patients in each arm; Group A has event probability p=0.2\np_a, p_b = 0.2, 0.1\ndist_a = binom(n=100, p=p_a)\n\nprint(\"--- Binomial distribution (Group A, n=100, p=0.2) ---\")\nprint(f\"Expected events : {dist_a.mean():.1f}\")\nprint(f\"Variance        : {dist_a.var():.1f}\")\nprint(f\"P(exactly 20)   : {dist_a.pmf(20):.4f}\")\nprint(f\"P(15 to 25)     : {dist_a.cdf(25) - dist_a.cdf(14):.4f}\")\n\n# ── 2. Logit / expit round-trip ─────────────────────────────────────────────\nprint(\"\\n--- Logit / expit (qlogis / plogis in R) ---\")\nprint(f\"{'p':>5}  {'odds':>8}  {'logit':>8}  {'expit(logit)':>14}\")\nfor p_val in [0.1, 0.2, 0.5, 0.9]:\n    lo   = logit(p_val)                 # log-odds = log(p/(1-p))\n    odds = p_val / (1 - p_val)\n    back = expit(lo)                    # inverse logit: recover p\n    print(f\"{p_val:5.1f}  {odds:8.4f}  {lo:8.4f}  {back:14.4f}\")\n\n# ── 3. 2x2 table: OR vs RR (the worked example arithmetic) ──────────────────\na, b_c, c, d = 20, 80, 10, 90          # a=events_A, b=non_A, c=events_B, d=non_B\nrisk_a = a / (a + b_c)                 # 20/100 = 0.2\nrisk_b = c / (c + d)                   # 10/100 = 0.1\nodds_a = a / b_c                        # 20/80  = 0.25\nodds_b = c / d                          # 10/90  = 0.111\nrr     = risk_a / risk_b               # 0.2/0.1 = 2.0\nor_val = (a * d) / (b_c * c)           # (20*90)/(80*10) = 1800/800 = 2.25\n\nprint(\"\\n--- 2x2 table: OR vs RR ---\")\nprint(f\"risk A = {risk_a:.3f}, odds A = {odds_a:.4f}\")\nprint(f\"risk B = {risk_b:.3f}, odds B = {odds_b:.4f}\")\nprint(f\"RR     = {rr:.4f}   (ratio of probabilities)\")\nprint(f\"OR     = {or_val:.4f}  (ratio of odds; larger than RR because outcome is common)\")\nprint(f\"log(OR)= {np.log(or_val):.4f}  (the logistic regression coefficient)\")\n\n# ── 4. Logistic regression: coefficient must equal log(OR) ──────────────────\ndf = pd.DataFrame(\n    [{\"group\": 1, \"event\": 1}] * 20 +   # A events\n    [{\"group\": 1, \"event\": 0}] * 80 +   # A non-events\n    [{\"group\": 0, \"event\": 1}] * 10 +   # B events\n    [{\"group\": 0, \"event\": 0}] * 90      # B non-events\n)\nfit = smf.logit(\"event ~ group\", data=df).fit(disp=0)\ncoef  = fit.params[\"group\"]\nci    = fit.conf_int().loc[\"group\"]\n\nprint(\"\\n--- Logistic regression ---\")\nprint(f\"Coefficient for group : {coef:.4f}  (= log(OR) = {np.log(or_val):.4f})\")\nprint(f\"exp(coef)             : {np.exp(coef):.4f}  (= OR = {or_val:.4f})\")\nprint(f\"95% CI (log-OR)       : [{ci[0]:.3f}, {ci[1]:.3f}]\")\nprint(f\"95% CI (OR)           : [{np.exp(ci[0]):.3f}, {np.exp(ci[1]):.3f}]\")\nprint(\"Note: OR > RR (2.25 > 2.0) because outcome is common (20% in Group A).\")\nprint(\"For clinical communication, compute marginal RD = risk_A - risk_B or use g-comp.\")",
        "description": "Bernoulli/binomial distribution mechanics (scipy.stats.binom), the logit/expit pair\n(scipy.special.logit and expit), the 2x2 OR vs RR arithmetic from the worked example,\nand statsmodels logistic regression confirming that the coefficient equals log(OR). Uses\nonly numpy, scipy, statsmodels, and pandas. No external clinical data required; the\n100-patient-per-arm dataset mirrors the worked example exactly.",
        "dependencies": [
          "numpy",
          "scipy",
          "statsmodels",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── 1. Binomial distribution ────────────────────────────────────────────────\np_a <- 0.2; p_b <- 0.1; n <- 100\ncat(\"--- Binomial distribution (Group A, n=100, p=0.2) ---\\n\")\ncat(sprintf(\"Expected events : %.1f\\n\", n * p_a))\ncat(sprintf(\"Variance        : %.1f\\n\", n * p_a * (1 - p_a)))\ncat(sprintf(\"P(exactly 20)   : %.4f\\n\", dbinom(20, size=n, prob=p_a)))\ncat(sprintf(\"P(15 to 25)     : %.4f\\n\", pbinom(25, n, p_a) - pbinom(14, n, p_a)))\n\n# ── 2. qlogis (logit) and plogis (expit / inverse-logit) ────────────────────\ncat(\"\\n--- Logit / expit ---\\n\")\ncat(sprintf(\"%-6s %-9s %-9s %-12s\\n\", \"p\", \"odds\", \"logit\", \"plogis(logit)\"))\nfor (pv in c(0.1, 0.2, 0.5, 0.9)) {\n  lo <- qlogis(pv)                       # logit: log(p / (1 - p))\n  cat(sprintf(\"%-6.1f %-9.4f %-9.4f %-12.4f\\n\",\n              pv, pv/(1-pv), lo, plogis(lo)))\n}\n\n# ── 3. 2x2 table: OR vs RR ──────────────────────────────────────────────────\na <- 20; b_c <- 80; cc <- 10; d <- 90\nrisk_a <- a/(a+b_c); risk_b <- cc/(cc+d)     # 0.2, 0.1\nodds_a <- a/b_c;     odds_b <- cc/d           # 0.25, 0.111\nrr_val <- risk_a / risk_b                     # 0.2/0.1 = 2.0\nor_val <- (a * d) / (b_c * cc)                # (20*90)/(80*10) = 2.25\ncat(sprintf(\"\\nRR = %.4f, OR = %.4f  (OR > RR: outcome common at %.0f%%)\\n\",\n            rr_val, or_val, risk_a*100))\n\n# Cross-check: fisher.test returns OR from exact hypergeometric distribution\ntbl <- matrix(c(a, b_c, cc, d), nrow=2, byrow=TRUE,\n              dimnames=list(c(\"A\",\"B\"), c(\"Event\",\"No-event\")))\nft <- fisher.test(tbl)\ncat(sprintf(\"fisher.test OR  = %.4f  (should match cross-product %.4f)\\n\",\n            ft$estimate, or_val))\n\n# ── 4. glm with binomial family and logit link ───────────────────────────────\ndf <- data.frame(\n  group = c(rep(1,100), rep(0,100)),\n  event = c(rep(1,20), rep(0,80), rep(1,10), rep(0,90))\n)\nfit    <- glm(event ~ group, family=binomial(link=\"logit\"), data=df)\ncoef_g <- coef(fit)[\"group\"]\n# confint() uses profile-likelihood; suppressMessages to suppress one iteration note\nci_g   <- suppressMessages(confint(fit))[\"group\", ]\n\ncat(sprintf(\"\\nglm coefficient : %.4f  (= log(OR) = %.4f)\\n\",\n            coef_g, log(or_val)))\ncat(sprintf(\"exp(coef)       : %.4f  (= OR = %.4f)\\n\", exp(coef_g), or_val))\ncat(sprintf(\"95%% CI (OR)     : [%.3f, %.3f]\\n\", exp(ci_g[1]), exp(ci_g[2])))\ncat(\"Note: for marginal RR/RD from this logistic fit use marginaleffects::avg_comparisons()\\n\")",
        "description": "Binomial PMF/CDF (dbinom/pbinom), the logit/expit pair (qlogis/plogis — base R), the\n2x2 cross-product OR vs RR arithmetic, and glm() with binomial family confirming the\nlogistic coefficient equals log(OR). All base-R except confint() which requires the MASS\npackage (installed with base R). The fisher.test() call cross-checks the cross-product OR\nfrom the formula.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ─── 1. Binomial PMF and logit/expit in a DATA step ─── */\ndata work.binom_logit;\n  n = 100; p_a = 0.2; k = 20;\n\n  /* Binomial PMF: P(X = 20 | Binom(100, 0.2)) */\n  pmf_a = pdf('BINOMIAL', k, p_a, n);\n  put \"P(exactly 20 events | n=100, p=0.2) = \" pmf_a;\n\n  /* Logit and expit for a range of probabilities */\n  do p = 0.1, 0.2, 0.5, 0.9;\n    odds     = p / (1 - p);\n    log_odds = log(odds);              /* logit(p) */\n    p_back   = 1 / (1 + exp(-log_odds)); /* expit: recover p */\n    output;\n  end;\nrun;\nproc print data=work.binom_logit (drop=n p_a k pmf_a) noobs;\n  format odds log_odds p_back 8.4;\nrun;\n\n/* ─── 2. Create the 2x2 dataset (weighted rows) ─── */\ndata work.trial;\n  length group $1;\n  input group $ outcome count;\n  datalines;\nA 1 20\nA 0 80\nB 1 10\nB 0 90\n;\nrun;\n\n/* ─── 3. PROC FREQ: OR, RR, and chi-square from the 2x2 table ─── */\nproc freq data=work.trial order=data;\n  weight count;\n  tables group * outcome / relrisk chisq;\n  /* RELRISK prints:\n     - Risk ratio (RR): 0.2/0.1 = 2.0 for Group A vs B\n     - Odds ratio (OR): (20*90)/(80*10) = 1800/800 = 2.25 for Group A vs B\n     Both with 95% confidence limits.                                         */\nrun;\n\n/* ─── 4. PROC LOGISTIC: covariate-adjusted OR ─── */\n/*\n   EVENT= TRAP (most common SAS logistic error):\n     Default: SAS models the LOWER value of outcome -> P(outcome=0) = P(no event)\n     This reverses the sign of every coefficient.\n     Fix: always write  model outcome(event='1') = ...\n     when outcome is coded 1=event happened, 0=event did not happen.\n\n   CLODDS=PL: profile-likelihood CI for OR (preferred over Wald at moderate n).\n*/\nproc logistic data=work.trial;\n  weight count;\n  class group (ref='B') / param=ref;\n  model outcome(event='1') = group   /* <-- event='1' is mandatory */\n        / clodds=pl lackfit;\n  /* Expected output:\n       Intercept        = log(odds_B) = log(10/90) approximately -2.197\n       group A          = log(OR_A_vs_B) = log(2.25) approximately  0.811\n       Odds Ratio (A/B) = 2.25, 95% profile-likelihood CI approximately [1.06, 4.77]\n     Without event='1': coefficients would be -0.811 (reversed sign).       */\nrun;",
        "description": "Binomial PMF via the PDF() function, logit/expit in a DATA step, the 2x2 table via PROC\nFREQ (RELRISK option prints both OR and RR with CIs), and PROC LOGISTIC for covariate-\nadjusted analysis. The EVENT= option in the MODEL statement is the most common SAS\npitfall: without event='1', SAS models P(outcome = 0) — the non-event — reversing all\ncoefficient signs. Always specify event='1' when outcome is coded 1 for event, 0 for\nno-event. CLODDS=PL requests profile-likelihood CIs for the odds ratio, which are more\naccurate than Wald CIs at small-to-moderate sample sizes.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  P[\"Probability p in (0,1)\\nbounded — cannot model directly\"] --> LOG[\"Logit = log(p / (1-p))\\nmaps to all reals\"]\n  LOG --> LP[\"Linear predictor\\nb0 + b1*X1 + ... + bk*Xk\\nunbounded, estimated by MLE\"]\n  LP --> EXPIT[\"Expit = 1/(1+exp(-LP))\\nconverts back to probability\"]\n  EXPIT --> OR[\"exp(bj) = conditional odds ratio\\nfor one-unit increase in Xj\"]\n  OR --> INTERP{Is the outcome common?}\n  INTERP -->|\"Risk below 5-10% in all arms\"| RARE[\"OR approximately equals RR\\nrare-disease approximation holds\"]\n  INTERP -->|\"Risk 10% or above\"| COMMON[\"OR > RR when OR > 1\\nStandardize to get marginal RD/RR\\nvia g-computation\"]",
        "caption": "From bounded probability to the logit linear predictor, back to probability via the expit function, and then to the odds ratio. The bottom branch shows when OR and RR diverge and how standardization resolves the gap for clinical communication.",
        "alt_text": "Flowchart showing probability feeding the logit, then a linear predictor estimated by MLE, then the expit back to probability, then the odds ratio, which branches to rare-outcome (OR equals RR) and common-outcome (OR exceeds RR; standardize to get marginal RD/RR).",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "The logistic regression model is built directly on the logit link and the OR arithmetic described here; that entry covers model fitting, case-control and cohort variants, Firth penalization for sparse data, and g-computation for marginal RD/RR — the applied layer on top of this probabilistic primitive."
      },
      {
        "relation_type": "requires",
        "target_slug": "generalized-linear-models",
        "notes": "The logit link is one of several link functions in the generalized linear model (GLM) framework; understanding GLM mechanics (link function, variance function, maximum likelihood estimation) contextualizes why logit, probit, and log links each make sense for binary data and how they differ from the identity link in linear regression."
      },
      {
        "relation_type": "see_also",
        "target_slug": "chi-square-test",
        "notes": "The Pearson chi-square test is the unadjusted omnibus test for the same 2x2 table that logistic regression generalizes; the chi-square p-value tests independence but does not produce an OR, RR, or RD, making logistic regression necessary as soon as covariate adjustment is required."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-distribution",
        "notes": "The Poisson distribution with a log link (log-binomial) and the Poisson-with-robust- variance approach (Zou's modified Poisson) are the primary alternatives when the RR is the pre-specified estimand and the outcome is too common for the OR to approximate the RR; this entry explains why those alternatives exist and when to prefer them."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Probability distributions, expected value, maximum likelihood estimation, and the concept of a sampling distribution are the prerequisites for understanding why the Bernoulli likelihood leads to the logit link and why the OR is the natural effect measure that emerges from that likelihood."
      }
    ],
    "aliases": [
      "Bernoulli distribution",
      "logit transform",
      "odds",
      "logistic distribution",
      "log odds"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "biomarker-defined-cohort-rwe",
    "name": "Biomarker-Defined Cohort (RWE)",
    "short_definition": "A cohort design that restricts eligibility to patients with a specific molecular, genomic, laboratory, pathology, or companion-diagnostic result, with time zero and follow-up anchored to biomarker ascertainment to avoid immortal time and testing-selection bias.",
    "long_description": "A **biomarker-defined cohort** restricts the study population to patients carrying (or lacking) a specific biomarker —\nan oncogenic driver mutation (e.g., *EGFR*, *KRAS* G12C, *BRAF* V600E), a protein-expression level (PD-L1 TPS ≥50%, HER2\nIHC 3+), a genomic signature (tumor mutational burden, MSI-high/dMMR), or a laboratory threshold (eGFR, LDL-C, viral load).\nIn real-world data the biomarker is not assigned at enrollment as in a trial; it is *discovered* through a test that was\nordered in routine care at some point in the patient's trajectory. The central methodological problem is therefore not\n\"who has the biomarker\" but **\"how does conditioning on a test result — and the timing of that test — distort the cohort,\nthe time axis, and the estimand.\"** Getting the index date and the entry condition right is the entire game.\n\n**Core conceptual distinction.** A biomarker-defined cohort sits at the intersection of three separable choices, and\nconflating them is the most common source of bias. (1) *Eligibility on biomarker status* requires that the result actually\nexist in the data — so the analyzable population is the **tested** subset, not the underlying disease population, and tested\npatients differ systematically from untested ones (access, performance status, larger or progressing tumors that warrant\nprofiling, academic-center care). (2) *Index date / time zero* must be chosen so that biomarker status is known at entry\nand no follow-up is \"guaranteed survived.\" Anchoring follow-up to the date of diagnosis while requiring a later NGS result\nbuilds in **immortal time**: a patient cannot appear in a biomarker-defined cohort unless they survived long enough to be\ntested and reported. (3) *Estimand*: a descriptive prevalence/outcome in the biomarker-positive subgroup is a very different\nquantity from a comparative (biomarker-stratified treatment effect), and the latter inherits all the confounding-by-indication\nproblems of any observational comparison plus the testing-selection layer on top.\n\n**Pros, cons, and trade-offs.**\n- **vs an unselected (all-comers) disease cohort:** A biomarker-defined cohort answers the precision-medicine question\n  directly — efficacy, safety, utilization, or cost *within the marker stratum* — and supports external-control arms for\n  marker-targeted therapies. Cost: it is restricted to the tested subpopulation, so generalizability to the full indication\n  requires an explicit argument about who gets tested. **Prefer** it when the clinical and reimbursement decision is itself\n  marker-conditional (companion diagnostic, ESCAT/OncoKB-tiered actionability).\n- **vs defining the cohort on the biomarker *test order* (a CPT/HCPCS or order flag) rather than the result:** Using the\n  test event alone (common when only claims are available) tells you a test happened, never the result; it cannot separate\n  positives from negatives and silently mixes strata. **Prefer result-based definition** whenever EHR/registry/NGS-report\n  data carry the actual call; fall back to test-order proxies only for utilization/testing-rate questions, never for\n  marker-conditional outcomes.\n- **vs target-trial emulation with clone-censor-weight from diagnosis:** A simple \"enter at biomarker ascertainment with\n  left truncation\" design is easier to specify and communicate and removes immortal time cleanly when the question is\n  outcomes *from the result date* forward. Cost: it cannot recover the full from-diagnosis estimand and conditions on\n  surviving-to-test. When the policy question is from diagnosis (e.g., the value of universal up-front testing), a\n  clone-censor-weight or sequential-trial emulation that handles the pre-result period is the more defensible (and more\n  complex) tool. **Prefer plain left-truncated entry** unless the from-diagnosis estimand is genuinely required.\n\n**When to use.** Comparative effectiveness, safety, natural-history, utilization, or cost analyses where the clinical\nquestion is intrinsically marker-conditional; building external/synthetic control arms for single-arm trials of targeted\nagents (regulatory and HTA submissions); characterizing testing rates, turnaround, and marker prevalence; HEOR models whose\ninputs (response, PFS, OS) are biomarker-stratified. The substrate is almost always EHR, registry, or linked\nEHR–claims–genomic data, because the result lives in pathology/molecular reports, not in administrative claims.\n\n**When NOT to use — and when it is actively misleading.**\n- **Claims-only data with no result.** Administrative claims carry CPT/HCPCS codes for the *test* (e.g., 81235 for *EGFR*,\n  81445/81455 for NGS panels, 0037U/0048U PLA codes, G-codes) but not the call. Defining \"biomarker-positive\" from claims\n  alone is impossible; attempting it forces a test-order surrogate that misclassifies massively. Use linked or EHR data, or\n  restrict the question to testing rates.\n- **Immortal time from result-date anchoring.** If you require a biomarker result obtained *after* the treatment line you\n  are studying, every patient in the cohort survived from line start to test report — the line looks falsely protective.\n  Nakamura et al. (2023) showed exactly this in a 5,743-patient genomic-profiling program: patients enrolled after first-line\n  initiation had a median OS advantage of ~8.9 months purely from immortal time. Anchor entry at the result date with left\n  truncation, or use a landmark/clone-censor-weight design.\n- **Ignoring testing-selection.** Tested patients are not a random sample of the diseased. A \"biomarker-positive vs negative\"\n  contrast that does not address why some patients were tested (and others were not) confounds marker status with the\n  indication-to-test. Report the tested-vs-untested comparison and, where possible, model selection or restrict to a setting\n  with near-universal testing (e.g., reflex testing).\n- **Assay/platform and threshold heterogeneity treated as one variable.** PD-L1 by 22C3 vs SP263, *HER2* IHC vs ISH,\n  NGS variant-allele-frequency cutoffs, and OncoKB/ESCAT actionability tiers drift across labs and calendar time; variant\n  reclassification (VUS → pathogenic) changes membership retroactively. Pooling them without a harmonization rule and a\n  sensitivity analysis on the cut point produces a marker definition no one can reproduce.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Useful only for the *fact and date of testing* (procedure/PLA codes) and for downstream\n  utilization/cost once the cohort is defined elsewhere. Require continuous medical enrollment so an absent test code means\n  \"not tested,\" not \"unobserved\" — and note that Medicare Advantage and capitated/bundled arrangements drop fee-for-service\n  claims, so MA-only person-time cannot confirm a test did or did not occur; restrict to enrollees with observable claims.\n  Never read marker *status* from claims.\n- **EHR:** The biomarker result lives in pathology/molecular-report fields or unstructured notes; structured genomics tables\n  exist only in mature systems and frequently require NLP/abstraction. Strengths: the actual call, assay, specimen date,\n  and report date, plus indication and severity. Failure modes: results captured only when the test is in-network (external\n  reference-lab reports leak out of the EHR), report-date vs specimen-date ambiguity, and visit-driven loss to follow-up\n  that is differential by marker (positives funnel to targeted therapy and academic centers).\n- **Registry (e.g., SEER, disease/genomic registries):** Strong for adjudicated stage, histology, and often a curated\n  biomarker field with defined assay rules; typically weak for complete treatment and pharmacy exposure. Link to claims for\n  therapy and to a death index to firm up censoring and OS.\n- **Linked EHR–claims–genomic (e.g., clinicogenomic databases):** The ideal substrate — the genomic report supplies the\n  result, the EHR supplies severity and treatment, claims supply completeness and cost, and a vital-records link supplies\n  mortality. Singal et al. (2019) established this design for NSCLC. Costs: linkage selects the linkable subset, the\n  NGS-tested cohort is itself selected, and report/specimen/treatment dates must be reconciled before time-zero assignment.\n\n**Worked example (linked EHR–claims–NGS, oncology).** Question: real-world overall survival of first-line osimertinib in\n*EGFR*-mutant advanced NSCLC. (1) Source: a linked clinicogenomic database with NGS reports, EHR treatment records, and a\ndeath index. (2) Biomarker definition: a structured *EGFR* exon-19 deletion or L858R call from an NGS or validated\nsingle-gene assay; record `assay_type`, `specimen_date`, and `report_date`; pre-specify how to treat VUS and which platforms\nqualify. (3) Cohort entry to defeat immortal time: a patient becomes eligible only at `max(advanced_dx_date,\negfr_result_date)`, and time zero is the first qualifying line start *on or after* that date — so biomarker status is known\nat entry and no pre-result, survived-to-test period is counted as exposed follow-up. If the estimand is OS from advanced\ndiagnosis, instead **left-truncate** each patient's follow-up at `egfr_result_date` (delayed entry) so the risk set excludes\ntime before testing. (4) Tested-selection check: compare the *EGFR*-tested population to the broader advanced-NSCLC EHR\npopulation on stage, performance status, and site to characterize generalizability. (5) Follow-up: from time zero to death,\ncensoring at last EHR activity (treat as potentially informative) and end of data; reconcile `report_date` vs\n`specimen_date` and EHR vs claims service dates. (6) Sensitivity analyses: alternative index-date rules (result-date entry\nvs left-truncation vs clone-censor-weight from diagnosis), VAF/threshold and assay-platform variations, and a\nnegative-control marker to probe residual testing-selection.",
    "primary_category": "Study_Design",
    "tags": [
      "study_design",
      "biomarker-defined-cohort",
      "precision-medicine",
      "immortal-time-bias",
      "testing-selection",
      "clinicogenomic",
      "companion-diagnostic",
      "special-populations-methods"
    ],
    "applies_to_study_types": [
      "ehr_study",
      "registry_linkage",
      "claims_analysis",
      "comparative_effectiveness",
      "target_trial_emulation",
      "external_control_arm"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "linked",
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.2019.3241",
        "url": "https://doi.org/10.1001/jama.2019.3241",
        "citation_text": "Singal G, Miller PG, Agarwala V, et al. Association of patient characteristics and tumor genomics with clinical outcomes among patients with non-small cell lung cancer using a clinicogenomic database. JAMA. 2019;321(14):1391-1399.",
        "year": 2019,
        "authors_short": "Singal et al.",
        "notes": "Landmark establishment of a linked real-world clinicogenomic (EHR + comprehensive genomic profiling) cohort and the design considerations for defining and analyzing biomarker-defined populations in routine-care data."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5083",
        "url": "https://doi.org/10.1002/pds.5083",
        "citation_text": "Suissa S, Dell'Aniello S. Time-related biases in pharmacoepidemiology. Pharmacoepidemiology and Drug Safety. 2020;29(9):1101-1110.",
        "year": 2020,
        "authors_short": "Suissa & Dell'Aniello",
        "notes": "Systematic treatment of immortal time, time-window, and related time-axis biases — the dominant failure mode when a cohort is conditioned on a biomarker test obtained during follow-up."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Provides the eligibility/time-zero machinery that disciplines index-date choice in biomarker-defined cohorts and prevents conditioning follow-up on surviving-to-test."
      },
      {
        "role": "demonstrate",
        "doi": "10.1200/PO.22.00653",
        "url": "https://doi.org/10.1200/PO.22.00653",
        "citation_text": "Nakamura Y, Yamashita R, Okamoto W, et al. Efficacy of targeted trials and signaling pathway landscape in advanced gastrointestinal cancers from SCRUM-Japan GI-SCREEN: a nationwide genomic profiling program. JCO Precision Oncology. 2023;7:e2200653.",
        "year": 2023,
        "authors_short": "Nakamura et al.",
        "notes": "Empirically demonstrates immortal time bias in a 5,743-patient genomic-profiling cohort — patients profiled after treatment initiation showed a spurious ~8.9-month median OS advantage; an explicit warning about index-date choice for biomarker-tested cohorts."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.fda.gov/regulatory-information/search-fda-guidance-documents/enrichment-strategies-clinical-trials-support-approval-human-drugs-and-biological-products",
        "citation_text": "U.S. Food and Drug Administration. Enrichment Strategies for Clinical Trials to Support Determination of Effectiveness of Human Drugs and Biological Products: Guidance for Industry. 2019.",
        "year": 2019,
        "authors_short": "FDA",
        "notes": "Regulatory framing for predictive-biomarker enrichment that biomarker-defined real-world cohorts operationalize for external-control and confirmatory evidence."
      }
    ],
    "plain_language_summary": "A biomarker-defined cohort is a study group built only from patients who had a specific molecular or lab test done AND received a qualifying result (for example, a genetic mutation found in their tumor). Because the test happens in routine care rather than a controlled trial, only patients who were actually tested can enter the study, and that tested group is not a random slice of all patients with the disease. The key trap to avoid is counting time before a patient got their test result as if they were already in the study, which makes survival look falsely better than it really is.",
    "key_terms": [
      {
        "term": "biomarker",
        "definition": "A measurable biological signal, such as a gene mutation, protein level, or lab value, that identifies a patient subgroup likely to respond differently to a treatment."
      },
      {
        "term": "index date",
        "definition": "The patient's day zero in the study, the date from which all follow-up time is counted, typically the date treatment starts or the date a qualifying test result was received."
      },
      {
        "term": "immortal time bias",
        "definition": "A distortion that occurs when a researcher counts the period before a patient met the study entry rule as part of their follow-up, making the drug or group look safer or more effective than it truly is because only survivors reached that entry point."
      },
      {
        "term": "testing-selection bias",
        "definition": "The distortion that arises because patients who received a biomarker test differ systematically from those who did not, for example having better insurance, being at a larger hospital, or having more advanced disease that prompted the test."
      },
      {
        "term": "left truncation",
        "definition": "A statistical technique that tells the survival analysis to start counting a patient's at-risk time only from the date they actually entered the study, not from some earlier date, so patients who died before being tested are correctly excluded from the risk set."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to study survival in patients with advanced lung cancer whose tumors carry an EGFR mutation. The electronic health record contains NGS (genomic panel) test reports. The table below shows five patients: their cancer diagnosis date, the date their EGFR mutation result came back, and whether they were tested at all. The goal is to decide who enters the cohort, when their follow-up clock starts, and what happens to patients who were never tested.",
      "dataset": {
        "caption": "Patient records: diagnosis date, EGFR test result date, and result. Untested patients have no result.",
        "columns": [
          "person_id",
          "advanced_dx_date",
          "egfr_result_date",
          "egfr_result",
          "qualifies"
        ],
        "rows": [
          [
            "PT-01",
            "2023-01-10",
            "2023-02-05",
            "EGFR exon19del (positive)",
            "YES"
          ],
          [
            "PT-02",
            "2023-01-15",
            "2023-02-20",
            "EGFR wild-type (negative)",
            "NO"
          ],
          [
            "PT-03",
            "2023-02-01",
            "never tested",
            "no result",
            "NO"
          ],
          [
            "PT-04",
            "2023-02-10",
            "2023-03-01",
            "EGFR L858R (positive)",
            "YES"
          ],
          [
            "PT-05",
            "2023-03-05",
            "died before result",
            "no result",
            "NO"
          ]
        ]
      },
      "steps": [
        "Step 1 - Who qualifies by biomarker result: Only PT-01 and PT-04 have a confirmed qualifying EGFR mutation in their test report. PT-02 tested negative, PT-03 was never tested, and PT-05 died before a result was available.",
        "Step 2 - Set each qualifying patient's index date to the later of their diagnosis date or their test result date. For PT-01: result date (2023-02-05) is after diagnosis (2023-01-10), so index date = 2023-02-05. For PT-04: result date (2023-03-01) is after diagnosis (2023-02-10), so index date = 2023-03-01.",
        "Step 3 - Do NOT count the time between diagnosis and the result date as study follow-up. PT-01 had 26 days between diagnosis and result; those 26 days are excluded. If we mistakenly started follow-up at diagnosis, every patient in the cohort would be a survivor of that pre-test window, making survival look better than it is (immortal time bias).",
        "Step 4 - Note the selection issue: 3 of 5 patients (60%) are excluded because they were untested or negative. The 2 who qualify may differ from the other 3 in ways that affect outcomes (they may have been tested because their disease was progressing rapidly, or because they had access to a comprehensive cancer center)."
      ],
      "result": "Cohort size = 2 patients (PT-01, PT-04). Index dates: PT-01 on 2023-02-05, PT-04 on 2023-03-01. The 26-day and 19-day pre-result windows are excluded from follow-up. The 3 excluded patients represent the testing-selection issue: findings from this cohort describe EGFR-positive, tested patients only, not all advanced lung cancer patients."
    },
    "prerequisites": [
      "cohort-retrospective",
      "immortal-time-bias-handling",
      "new-user-design"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Result-based definition (preferred)",
        "description": "Eligibility on the actual biomarker call from a structured genomics/pathology field or abstracted report, with pre-specified assay/platform inclusion and threshold/VUS handling.",
        "edge_cases": [
          "Variant reclassification (VUS to pathogenic) changes membership retroactively; freeze a knowledge-base version and date.",
          "Assay heterogeneity (PD-L1 22C3 vs SP263; IHC vs ISH; NGS VAF cutoffs) requires a harmonization rule and threshold sensitivity analysis.",
          "External reference-lab results leak out of the EHR, undercounting positives."
        ],
        "data_source_notes": "ehr/linked: read from structured genomics tables or NLP-abstracted reports; registry: use the curated biomarker field with its defined assay rules."
      },
      {
        "name": "Test-order (surrogate) definition",
        "description": "Cohort keyed on the biomarker test event (CPT/HCPCS/PLA procedure code or order flag) when the result is unavailable; valid only for testing-rate, turnaround, and utilization questions, never for marker-conditional outcomes.",
        "edge_cases": [
          "Cannot separate positives from negatives; mixes strata and biases any outcome contrast toward the null.",
          "Procedure codes for broad NGS panels (e.g., 81445/81455) do not identify which gene was altered."
        ],
        "data_source_notes": "claims: identify the test via procedure/PLA codes with continuous enrollment so absence means not tested; confirms a test occurred, not its result."
      },
      {
        "name": "Left-truncated / landmark entry for immortal time",
        "description": "Patients enter the risk set at biomarker ascertainment (delayed entry) or at a fixed landmark, so no survived-to-test, pre-result person-time is counted; used when the estimand runs from diagnosis but the marker is learned later.",
        "edge_cases": [
          "Choosing the landmark too late discards events and shrinks the cohort; too early reintroduces immortal time.",
          "For the full from-diagnosis estimand, clone-censor-weight is needed instead of simple left truncation."
        ],
        "data_source_notes": "Requires reliable specimen/report dates reconciled against diagnosis and treatment dates before time-zero assignment."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Unselected (all-comers) disease cohort",
        "pros_of_this": "Answers the marker-conditional clinical, regulatory, and reimbursement question directly and supports biomarker-targeted external-control arms.",
        "cons_of_this": "Restricted to the tested subpopulation; generalizability to the full indication requires an explicit testing-selection argument.",
        "when_to_prefer": "When the clinical or HTA decision is itself biomarker-conditional (companion diagnostic, actionability tier)."
      },
      {
        "compared_to": "Cohort defined on the biomarker test order rather than the result",
        "pros_of_this": "Separates positives from negatives and enables true marker-conditional outcomes rather than a mixed-strata contrast.",
        "cons_of_this": "Requires EHR/registry/linked data carrying the actual call; not feasible in claims-only data.",
        "when_to_prefer": "Whenever the result is available; reserve test-order surrogates for testing-rate and utilization questions."
      },
      {
        "compared_to": "Target-trial emulation with clone-censor-weight from diagnosis",
        "pros_of_this": "Simpler to specify and communicate; left-truncated entry removes immortal time cleanly for outcomes measured from the result date forward.",
        "cons_of_this": "Cannot recover the full from-diagnosis estimand and conditions on surviving to test.",
        "when_to_prefer": "When the question is outcomes from biomarker ascertainment forward rather than from diagnosis."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Carries the test event (CPT/HCPCS/PLA codes) and date, never the result. Require continuous medical enrollment so an absent code means \"not tested\"; exclude Medicare Advantage-only person-time where fee-for-service claims are unavailable. Use for testing rates and downstream cost/utilization only.",
      "ehr": "Result lives in pathology/molecular-report fields or unstructured notes (often needing NLP/abstraction). Record assay, specimen date, and report date; watch external reference-lab leakage and report-vs-specimen-date ambiguity; treat differential loss to follow-up by marker as potentially informative.",
      "registry": "Strong for adjudicated stage, histology, and a curated biomarker field with defined assay rules; weak for full treatment/pharmacy exposure. Link to claims for therapy and to a death index for OS and censoring.",
      "linked": "Linked EHR-claims-genomic data is the ideal substrate (result + severity + completeness + mortality) but introduces linkage selection, NGS-tested selection, and report/specimen/treatment date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nQUALIFYING = {(\"EGFR\", \"exon19del\"), (\"EGFR\", \"L858R\")}  # pre-specified positive calls\nELIGIBLE_ASSAYS = {\"NGS\", \"single_gene_validated\"}        # pre-specified platforms\n\ndef build_biomarker_cohort(dx: pd.DataFrame, bmarker: pd.DataFrame, tx: pd.DataFrame) -> pd.DataFrame:\n    # 1) Result-based positive definition; ascertainment date = report_date (when status becomes known).\n    pos = bmarker[bmarker[\"assay_type\"].isin(ELIGIBLE_ASSAYS)].copy()\n    pos = pos[pos[[\"gene\", \"alteration\"]].apply(tuple, axis=1).isin(QUALIFYING)]\n    # Earliest qualifying report per person = the date biomarker status is first known in the data.\n    ascertain = (pos.sort_values(\"report_date\")\n                    .groupby(\"person_id\", as_index=False)\n                    .agg(egfr_result_date=(\"report_date\", \"first\"),\n                         assay_type=(\"assay_type\", \"first\")))\n\n    cohort = dx.merge(ascertain, on=\"person_id\", how=\"inner\")  # tested + positive only\n\n    # 2) Eligibility starts when BOTH advanced disease and a known positive result exist.\n    cohort[\"eligible_from\"] = cohort[[\"advanced_dx_date\", \"egfr_result_date\"]].max(axis=1)\n\n    # 3) Time zero = first treatment line starting ON OR AFTER eligibility (status known at entry,\n    #    so no survived-to-test, pre-result person-time is counted as exposed follow-up).\n    tx_sorted = tx.sort_values(\"line_start_date\")\n    merged = tx_sorted.merge(cohort[[\"person_id\", \"eligible_from\"]], on=\"person_id\")\n    first_line = (merged[merged[\"line_start_date\"] >= merged[\"eligible_from\"]]\n                  .groupby(\"person_id\", as_index=False)\n                  .first()\n                  .rename(columns={\"line_start_date\": \"index_date\"}))\n\n    out = cohort.merge(first_line[[\"person_id\", \"index_date\", \"regimen\"]], on=\"person_id\", how=\"inner\")\n    # 4) Alternative from-diagnosis estimand: left-truncate the risk set at the result date.\n    out[\"truncation_entry\"] = out[\"egfr_result_date\"]  # delayed entry for OS-from-diagnosis analyses\n    out[\"baseline_start\"] = out[\"index_date\"] - pd.Timedelta(days=365)  # covariate window ends at index\n    return out[[\"person_id\", \"advanced_dx_date\", \"egfr_result_date\", \"assay_type\",\n                \"index_date\", \"regimen\", \"truncation_entry\", \"baseline_start\"]]",
        "description": "Biomarker-defined cohort construction from linked EHR/genomic inputs, with index-date logic that defeats immortal time.\nRequired inputs (already cleaned and de-duplicated):\n  dx       : advanced-disease diagnosis -> person_id, advanced_dx_date (datetime)\n  bmarker  : biomarker results          -> person_id, report_date, specimen_date, gene, alteration, assay_type\n  tx       : treatment line starts      -> person_id, line_start_date, regimen, line_number\nDefines POSITIVE membership from the result (not the test order), then either (a) enters at the qualifying line start on\nor after biomarker ascertainment (no pre-result exposed time), or (b) left-truncates from-diagnosis follow-up at the\nresult date. Build covariates only from windows ending at index/entry.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "singal-2019",
          "nakamura-2023"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nQUALIFYING <- list(c(\"EGFR\", \"exon19del\"), c(\"EGFR\", \"L858R\"))\nELIGIBLE_ASSAYS <- c(\"NGS\", \"single_gene_validated\")\n\nbuild_biomarker_cohort <- function(dx, bmarker, tx) {\n  setDT(dx); setDT(bmarker); setDT(tx)\n\n  # Result-based positive definition; ascertainment = earliest qualifying report date.\n  pos <- bmarker[assay_type %chin% ELIGIBLE_ASSAYS]\n  keep <- pos[, paste(gene, alteration) %chin% sapply(QUALIFYING, paste, collapse = \" \")]\n  pos <- pos[keep == TRUE]\n  setorder(pos, report_date)\n  ascertain <- pos[, .(egfr_result_date = report_date[1L], assay_type = assay_type[1L]), by = person_id]\n\n  cohort <- merge(dx, ascertain, by = \"person_id\")            # tested + positive only\n  cohort[, eligible_from := pmax(advanced_dx_date, egfr_result_date)]\n\n  # Time zero = first line starting on/after eligibility (status known at entry).\n  setorder(tx, line_start_date)\n  m <- merge(tx, cohort[, .(person_id, eligible_from)], by = \"person_id\")\n  first_line <- m[line_start_date >= eligible_from,\n                  .(index_date = line_start_date[1L], regimen = regimen[1L]), by = person_id]\n\n  out <- merge(cohort, first_line, by = \"person_id\")\n  out[, truncation_entry := egfr_result_date]                 # delayed entry for from-diagnosis OS\n  out[, baseline_start := index_date - 365L]\n  out[, .(person_id, advanced_dx_date, egfr_result_date, assay_type,\n          index_date, regimen, truncation_entry, baseline_start)]\n}",
        "description": "Biomarker-defined cohort construction with data.table. Inputs mirror the Python version:\n  dx      : person_id, advanced_dx_date (Date)\n  bmarker : person_id, report_date (Date), specimen_date (Date), gene, alteration, assay_type\n  tx      : person_id, line_start_date (Date), regimen, line_number\nIndex date avoids immortal time; truncation_entry supports a left-truncated from-diagnosis analysis.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "singal-2019",
          "nakamura-2023"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1) Result-based positive definition; ascertainment date = earliest qualifying report. */\nproc sql;\n  create table ascertain as\n  select person_id,\n         min(report_date) as egfr_result_date format=date9.\n  from work.bmarker\n  where assay_type in ('NGS','single_gene_validated')\n    and gene = 'EGFR'\n    and alteration in ('exon19del','L858R')\n  group by person_id;\nquit;\n\n/* 2) Tested + positive cohort; eligibility begins when advanced disease AND a known result both hold. */\nproc sql;\n  create table cohort as\n  select d.person_id,\n         d.advanced_dx_date,\n         a.egfr_result_date,\n         max(d.advanced_dx_date, a.egfr_result_date) as eligible_from format=date9.\n  from work.dx d\n  inner join ascertain a\n    on d.person_id = a.person_id;\nquit;\n\n/* 3) Time zero = first treatment line starting ON OR AFTER eligibility (status known at entry;\n      no survived-to-test, pre-result person-time is counted as exposed follow-up). */\nproc sql;\n  create table cohort_index as\n  select c.person_id,\n         c.advanced_dx_date,\n         c.egfr_result_date,\n         min(t.line_start_date)        as index_date format=date9.,\n         c.egfr_result_date            as truncation_entry format=date9.,  /* delayed entry option */\n         calculated index_date - 365   as baseline_start format=date9.\n  from cohort c\n  inner join work.tx t\n    on t.person_id = c.person_id\n   and t.line_start_date >= c.eligible_from\n  group by c.person_id, c.advanced_dx_date, c.egfr_result_date;\nquit;\n\n/* 4) Left-truncated (delayed-entry) survival from diagnosis: enter the risk set at the result date. */\n/*    proc phreg data=surv;                                                                          */\n/*      model (truncation_entry, exit_date)*event(0) = trt covars / ties=efron;  /* entry, exit time */\n/*    run;                                                                                           */",
        "description": "Biomarker-defined cohort construction in SAS (PROC SQL), with immortal-time-safe index dating. Required input datasets\n(post data-management):\n  work.dx      : person_id, advanced_dx_date\n  work.bmarker : person_id, report_date, specimen_date, gene, alteration, assay_type\n  work.tx      : person_id, line_start_date, regimen, line_number\nPositive membership is read from the RESULT (not the test order). Time zero is the first line on/after biomarker\nascertainment; truncation_entry supports a separate left-truncated survival analysis (entry= in PROC PHREG / PROC LIFETEST).",
        "dependencies": [],
        "source_citations": [
          "singal-2019",
          "nakamura-2023"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Advanced-disease patients in routine care] --> Tested{Biomarker test ordered<br/>and result captured?}\n  Tested -- No --> Untested[Untested subset<br/>differs by access / severity / site]\n  Tested -- Yes --> Result{Qualifying positive call?<br/>result-based, not test-order}\n  Result -- No --> Neg[Biomarker-negative stratum]\n  Result -- Yes --> Ascertain[Ascertainment date = report_date<br/>status now known in data]\n  Ascertain --> Entry[Entry / time zero on or after ascertainment<br/>or left-truncate from diagnosis]\n  Entry --> Sel[Report tested-vs-untested selection<br/>+ assay/threshold sensitivity]\n  Sel --> Analysis[Marker-conditional outcome analysis]",
        "caption": "Decision logic for a biomarker-defined cohort. Conditioning on a captured result selects the tested subset and requires entry at ascertainment (or left truncation) so that biomarker status is known at time zero and no survived-to-test person-time is counted.",
        "alt_text": "Flowchart from the advanced-disease population through whether a result was captured, whether the call is a qualifying positive, ascertainment dating, immortal-time-safe entry, selection and assay sensitivity checks, and the final marker-conditional analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "nakamura-2023"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Index-date choice for one biomarker-positive patient (immortal time)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Pre-result (survived-to-test)\n  Advanced diagnosis :milestone, dx, 2023-02-01, 0d\n  First-line therapy started (status unknown) :crit, fl, 2023-02-10, 60d\n  section Ascertainment\n  NGS report -> EGFR positive :milestone, rep, 2023-04-15, 0d\n  section Immortal-time-safe follow-up\n  Risk set entry at result date (left truncation) :active, ft, 2023-04-15, 200d",
        "caption": "Anchoring follow-up at diagnosis while requiring a later NGS result builds in immortal time (the red pre-result span); entering the risk set at the result date via left truncation removes the survived-to-test interval.",
        "alt_text": "Gantt timeline showing diagnosis in February 2023, a first-line therapy span before the result is known, an NGS report in April 2023, and risk-set entry at the result date with left truncation.",
        "source_type": "illustrative",
        "source_citations": [
          "nakamura-2023"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "Biomarker-defined cohorts are a special-populations design family where membership depends on a molecular/laboratory result rather than a demographic or clinical category."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Requiring a biomarker result obtained after follow-up start induces immortal time; entry at ascertainment or left truncation is the fix."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Specifying eligibility (known biomarker status) and time zero via a target-trial protocol disciplines index-date choice and supports clone-censor-weight when the from-diagnosis estimand is required."
      },
      {
        "relation_type": "used_with",
        "target_slug": "single-arm-external-control",
        "notes": "Biomarker-defined real-world cohorts are the usual substrate for external/synthetic control arms supporting single-arm trials of marker-targeted therapies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "When the biomarker-defined cohort drives a comparative treatment contrast, the new-user + active-comparator structure controls confounding by indication on top of the testing-selection layer."
      }
    ],
    "aliases": [
      "biomarker-defined cohort",
      "biomarker-enriched cohort",
      "molecularly-defined cohort",
      "companion-diagnostic-defined cohort",
      "clinicogenomic cohort"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "bootstrap-resampling-methods",
    "name": "Bootstrap and Resampling Methods",
    "short_definition": "A simulation-based approach to quantifying uncertainty: draw thousands of random re-samples (with replacement) from the observed data, compute the statistic of interest in each re-sample to build an empirical sampling distribution, and read off standard errors and confidence intervals directly from that distribution — without assuming normality or deriving analytic formulas. In HEOR, the bootstrap is the standard tool for uncertainty around skewed mean costs, ICER ellipses on the cost-effectiveness plane, medians and quantiles, and any pipeline (e.g., propensity-score matching) where closed-form standard errors are wrong or unavailable; the cluster bootstrap — which resamples whole patients rather than individual claim rows — is the RWE-critical variant for correlated longitudinal data.",
    "long_description": "**What the bootstrap actually does**\n\nThe bootstrap, introduced by Efron (1979), exploits a simple observation: the observed data\nare the best available approximation to the population. If you treat the n observations as\na surrogate population and draw B re-samples of size n *with replacement*, each re-sample\nmimics \"what would happen if you ran the study again.\" The statistic computed on each\nre-sample is one draw from the bootstrap sampling distribution. Aggregate B such draws and\nyou have an empirical approximation to the sampling distribution of your estimator — obtained\npurely from the data, without assuming normality, without a closed-form SE formula, and\nwithout the Delta method.\n\nThree quantities come out of the bootstrap distribution: the *bootstrap standard error* (the\nstandard deviation of the B replicate statistics), the *bootstrap bias estimate* (the mean\nof the replicates minus the original-sample statistic), and *bootstrap confidence intervals*\nin one of several flavors described below. The bias is usually small relative to the SE, but\nfor ICERs and ratio estimators it can matter.\n\n**Types of bootstrap confidence intervals**\n\nThree variants are in common use, in increasing order of accuracy:\n\nThe *normal-approximation* CI uses the bootstrap SE as a plug-in: estimate ± 1.96 × boot_SE.\nThis is the simplest form but assumes the bootstrap distribution is symmetric and bell-shaped\n— an assumption violated by many HEOR estimands (costs, ICERs).\n\nThe *percentile* CI reads directly from the tails of the bootstrap distribution: the 2.5th\nand 97.5th percentiles of the B replicate statistics. It is transformation-equivariant\n(working on the log scale gives the same result as exponentiating percentile bounds computed\non the log scale), does not assume symmetry, and is the dominant method in health economics\npractice. Its main limitation is poor coverage when the bootstrap distribution is skewed or\nbiased.\n\nThe *bias-corrected and accelerated* (BCa) CI adjusts the percentiles used to extract the\nCI, correcting for both bootstrap bias and the rate at which the SE changes with the\nparameter value (acceleration). BCa generally achieves better coverage than the percentile\nmethod, particularly for small n or skewed estimands, but requires B ≥ 2000 replicates and\njackknife acceleration estimates. For ICER CIs reported to HTA bodies (NICE, ICER), BCa is\nthe defensible choice.\n\n**When bootstrap beats formulas in HEOR**\n\nFour situations in which the bootstrap is the right tool, not just a convenience:\n\n*Mean cost differences from skewed claims data* — the canonical HEOR application. Total\nannual costs per patient follow a right-skewed distribution with a spike at low values and\na heavy upper tail driven by high-cost admissions. The analytic SE for the mean difference\nassumes an approximately normal sampling distribution of the mean, which the CLT protects\nonly at large n. At moderate n (< 500 per arm), or when comparing subgroups, the CLT may\nnot have kicked in and the normal approximation for the SE is unreliable. The bootstrap SE\nis exact for any sample size and any distribution shape. The bootstrap is so standard here\nthat Barber and Thompson (2000) demonstrated it specifically for randomized cost analyses\nusing claims-style data.\n\n*ICER uncertainty* — a cluster of cost-effectiveness pairs on the CE plane. The ICER is a\nratio of expected costs to expected effects: ICER = ΔCost / ΔEffect. The Delta-method SE\nfor a ratio is unreliable when ΔEffect is close to zero or when costs and effects are both\nskewed. The bootstrap generates B pairs (ΔCost_b, ΔEffect_b) that trace out the joint\nsampling distribution on the cost-effectiveness plane. Reading the 95% CI from the\npercentile ellipse is the standard way to characterize ICER uncertainty for HTA submissions\nand net monetary benefit (NMB) analyses.\n\n*Medians and quantiles* — there is no simple analytic SE formula for the median that works\nat small-to-moderate n with skewed data; the bootstrap is the default. Similarly for the\n75th percentile of cost distributions or the 90th percentile of length-of-stay, neither of\nwhich has a standard formula under non-normal data.\n\n*Complex pipelines where analytic SEs are wrong* — propensity-score matching, inverse\nprobability weighting, two-part cost models, and propensity-trimmed cohorts all involve\nmultiple estimation steps. Analytic SEs derived from the final-stage regression ignore the\nvariance introduced by the earlier stages (selecting a caliper, estimating weights, choosing\na bandwidth). The full-pipeline cluster bootstrap — re-running the entire pipeline in each\nreplicate — correctly propagates all sources of uncertainty.\n\n**The cluster bootstrap: the RWE-critical variant**\n\nIn claims and EHR data, the unit of analysis is the patient, but the unit of observation is\noften the claim row: a patient may contribute dozens of pharmacy, outpatient, or inpatient\nrows. A naive bootstrap that resamples rows treats rows as independent, underestimating the\nSE because it ignores within-patient correlation. The *cluster bootstrap* resamples whole\npatients: draw B samples of n patients with replacement from the patient roster; for each\nsampled patient, keep all of their rows intact. This preserves the longitudinal correlation\nstructure and gives a valid SE even when patients have very different numbers of claims.\n\nThe cluster bootstrap is also the correct approach for matched cohorts, where pairs or\nclusters of matched patients must be resampled together to maintain balance. It extends\nnaturally to multi-level data (patients nested in hospitals) by resampling the highest-level\ncluster (hospitals).\n\n**Number of replicates: conventions**\n\nB = 1000 replicates is the minimum for a reliable percentile CI at the 95% level. B = 2000\nis the recommended minimum for BCa CIs (the jackknife acceleration estimate adds variance\nthat additional replicates dampen). B = 5000–10000 is used for small-p tail probabilities,\nbootstrap hypothesis tests, or when the number of patients is small (< 100) and the bootstrap\ndistribution is discrete. Runtime scales linearly with B; for large claims datasets, the\ncluster bootstrap is parallelizable by splitting replicate indices across cores.\n\n**Permutation tests: the testing cousin**\n\nThe permutation test addresses the same question — is this difference consistent with chance\n— using a complementary idea. Rather than resampling with replacement, it randomly shuffles\nthe group labels among the observed units and computes the test statistic in each shuffled\ndataset. The resulting null distribution is the exact distribution under the null of\nexchangeability (no treatment effect). The p-value is the fraction of permuted statistics\nmore extreme than the observed. Permutation tests are valid under a weaker assumption than\nthe bootstrap (exchangeability rather than IID sampling) and are exact in finite samples;\nthey do not produce CIs or effect estimates. The two methods are complementary: use the\nbootstrap when the goal is estimation and uncertainty quantification; use permutation when\nthe primary goal is testing a sharp null with no parametric assumptions.\n\n**Failure modes**\n\n*Tiny n* (< 20): the bootstrap distribution is discrete — only a finite number of distinct\nre-samples are possible — and tail coverage is poor. At n = 5, the exact percentile CI is\ndominated by the two extreme values in the sample; any additional precision claimed from the\nbootstrap is illusory. A minimum of n ≈ 30 is a practical threshold for reliable percentile\nCIs; BCa CIs require still larger n.\n\n*Extreme tails and heavy-tailed distributions*: when the underlying distribution is\nPareto-like (insurance catastrophic claims, rare high-cost events), the bootstrap SE is\nhighly variable across experiments because each re-sample may or may not capture the extreme\nobservation. Coverage of the bootstrap CI for the mean can be poor. In these cases,\nwinsorization of costs above the 99th percentile, or a model-based approach (gamma or\nlognormal GLM), may be more stable.\n\n*Matching with replacement subtleties*: in propensity-score matching where controls are\nmatched with replacement, the effective sample size for controls is smaller than the nominal\nsample size (some controls are used many times). The cluster bootstrap should resample\ntreated patients, then re-do the matching in each replicate — not just resample the already-\nmatched dataset, which would double-count the matching uncertainty.\n\n*Time-series and panel data*: standard bootstrap is invalid for longitudinally correlated\noutcomes where the correlation is across time within a patient. The cluster bootstrap (cluster\n= patient) handles this correctly. If the research question involves time trends, a block\nbootstrap (sampling blocks of consecutive time-periods) may be needed in addition.\n\n**Pros, cons, and trade-offs**\n\nPros: valid for virtually any estimator without distributional assumptions; automatically\ncorrect for complex pipelines where analytic SEs are unavailable or wrong; produces full\nempirical sampling distributions that can be plotted and inspected; the cluster variant\nnaturally handles correlated data; BCa CIs have asymptotically correct coverage under\nmild regularity conditions; computationally straightforward to parallelize.\n\nCons: computationally intensive (B × pipeline runtime); fails at very small n or under very\nheavy tails; does not substitute for a valid identification strategy — the bootstrap\nquantifies sampling uncertainty but cannot correct for unmeasured confounding or selection\nbias; naive implementation (resampling rows instead of patients) in claims data gives\nanti-conservative SEs; results can appear to differ across runs if the random seed is not\nfixed.\n\n**When to use**\n\nUse the bootstrap when: (a) the estimator has no tractable analytic SE (ICERs, medians,\ncomplex pipeline outputs); (b) the outcome is right-skewed and n is insufficient for the\nCLT to protect the normal approximation for the mean (cost analyses at n < 500 per arm,\nsubgroup analyses); (c) the data have a clustered structure (claims, EHR rows nested within\npatients) and the cluster bootstrap is the correct SE estimator; (d) the analysis involves\nPS-matching, IPTW, or other multi-stage procedures where the SE must propagate uncertainty\nthrough all stages; (e) a full sampling distribution on the cost-effectiveness plane is\nneeded for probabilistic sensitivity analysis.\n\n**When NOT to use**\n\nDo not use the bootstrap as the primary method when: (a) n is very small (< 20–30 per\ncluster) — the discrete bootstrap distribution gives unreliable tail coverage; (b) the\nestimand is a mean from a large balanced RCT where the CLT is fully operational and an\nanalytic t-test CI is both valid and more efficient; (c) the naive implementation resamples\nrows rather than patients in claims data — this is a common error that produces CIs that are\ntoo narrow; (d) the goal is a sharp null test without the need for CIs — a permutation test\nis better powered and more interpretable in that case; (e) the pipeline does not re-run all\nestimation steps (PS model, matching, outcome model) within each replicate — partial\nbootstrapping understates the true uncertainty.\n\n**Interpreting the output**\n\nThe worked example below uses five patients with annual costs [1200, 3400, 800, 5600, 2100]\nUSD (n = 5, illustrative only). The original sample mean is 2620 USD. Three explicit\nre-samples yield bootstrap replicate means of 2180, 3060, and 2880 USD. With B = 2000\nreplicates run in software, the bootstrap SE ≈ 860 USD and the percentile 95% CI ≈\n[1060, 4740] USD.\n\n*(1) Formal interpretation.* The bootstrap percentile 95% CI [1060, 4740] is a data-derived\ninterval: in 95% of hypothetical repeated studies of size n = 5 drawn from the same\npopulation, an interval constructed by the same percentile procedure would contain the true\npopulation mean cost. The interval is not symmetric around the point estimate (2620 USD) —\nit extends further to the right than to the left — because the bootstrap distribution\ninherits the right-skew of the underlying cost distribution, which pulls replicate means\nupward when the high-cost patient (5600 USD) is over-represented in a re-sample.\n\n*(2) Practical interpretation.* The mean annual cost is 2620 USD with a 95% confidence\ninterval of roughly 1060 to 4740 USD. This wide range reflects genuine uncertainty from a\nsmall sample; a decision-maker should treat the point estimate as uncertain by a factor of\nroughly two in either direction. In a real analysis with n in the hundreds per arm, this\ninterval would be far narrower, but the bootstrap procedure — including the cluster variant\nthat resamples patients rather than individual claims rows — remains the correct method for\nskewed cost data regardless of sample size.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "bootstrap",
      "resampling",
      "confidence-intervals",
      "uncertainty-quantification",
      "cost-analysis",
      "icer",
      "nonparametric",
      "simulation",
      "cluster-bootstrap",
      "permutation-test"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "cost_effectiveness_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1214/aos/1176344552",
        "url": "https://doi.org/10.1214/aos/1176344552",
        "citation_text": "Efron B. Bootstrap methods: another look at the jackknife. Annals of Statistics. 1979;7(1):1-26.",
        "year": 1979,
        "authors_short": "Efron",
        "notes": "The foundational paper introducing the bootstrap. Efron showed that treating the empirical distribution as a surrogate population and resampling from it yields consistent SE and CI estimates for a wide class of estimators without analytic formulas."
      },
      {
        "role": "explain",
        "doi": "10.1002/(sici)1097-0258(20000515)19:9<1141::aid-sim479>3.0.co;2-f",
        "url": "https://doi.org/10.1002/(sici)1097-0258(20000515)19:9<1141::aid-sim479>3.0.co;2-f",
        "citation_text": "Carpenter J, Bithell J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statistics in Medicine. 2000;19(9):1141-1164.",
        "year": 2000,
        "authors_short": "Carpenter & Bithell",
        "notes": "The definitive applied guide to choosing among normal-approximation, percentile, and BCa bootstrap CIs. Covers coverage properties, required number of replicates, and failure modes. Widely cited in medical statistics for guiding the percentile-vs-BCa choice."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/1097-0258(20001215)19:23<3219::aid-sim623>3.0.co;2-p",
        "url": "https://doi.org/10.1002/1097-0258(20001215)19:23<3219::aid-sim623>3.0.co;2-p",
        "citation_text": "Barber JA, Thompson SG. Analysis of cost data in randomized trials: an application of the non-parametric bootstrap. Statistics in Medicine. 2000;19(23):3219-3236.",
        "year": 2000,
        "authors_short": "Barber & Thompson",
        "notes": "Demonstrated the non-parametric bootstrap specifically for cost data in randomized trials, establishing it as the standard method for skewed healthcare cost distributions. Shows coverage properties of percentile and BCa CIs on realistic cost data."
      }
    ],
    "plain_language_summary": "The bootstrap is a way to measure how uncertain an estimate is — like a sample mean or a cost difference — without needing a complex formula. You take your observed data and repeatedly draw a new sample of the same size from it (allowing the same observation to be picked more than once), compute your statistic in each fake sample, and the spread of those results tells you how much the estimate could vary if you ran the study again. In health economics, the bootstrap is the go-to tool for costs (which are rarely bell-shaped) and for situations where the statistical pipeline is too complex for a formula — as long as you resample whole patients, not individual claim rows, to keep correlated data intact.",
    "key_terms": [
      {
        "term": "resampling with replacement",
        "definition": "Drawing a new sample of the same size from the observed data while allowing the same observation to appear multiple times, so each draw is independent of the others."
      },
      {
        "term": "bootstrap replicate",
        "definition": "One random re-sample drawn with replacement from the original data, used to compute one value of the statistic being studied."
      },
      {
        "term": "percentile confidence interval",
        "definition": "A bootstrap CI formed by taking the 2.5th and 97.5th percentiles of the distribution of replicate statistics — no symmetry assumption is required."
      },
      {
        "term": "BCa interval",
        "definition": "Bias-corrected and accelerated bootstrap CI; adjusts the percentile cutoffs to correct for bias and for how the standard error changes with the parameter value, giving better coverage than the plain percentile method."
      },
      {
        "term": "cluster bootstrap",
        "definition": "A bootstrap variant that resamples whole patients (or other clusters) rather than individual rows, preserving the correlation among multiple claims or visits from the same person."
      },
      {
        "term": "permutation test",
        "definition": "A significance test that randomly shuffles group labels among observed units to build a null distribution, used for testing rather than estimation; the bootstrap's hypothesis- testing cousin."
      }
    ],
    "worked_example": {
      "scenario": "A health economist is estimating the mean annual total cost for five patients with a rare condition in a pilot claims analysis. The five observed costs are right-skewed (one patient has very high costs). She applies the bootstrap to quantify uncertainty around the mean without assuming normality. She draws three explicit re-samples by hand to illustrate the method, then would run B = 2000 replicates in software for the final CI.",
      "dataset": {
        "caption": "Annual total cost per patient (USD) from five pilot patients. Costs are right-skewed; patient P04 is a high-cost outlier. These are the five observations that will be resampled with replacement.",
        "columns": [
          "patient_id",
          "annual_cost_usd"
        ],
        "rows": [
          [
            "P01",
            1200
          ],
          [
            "P02",
            3400
          ],
          [
            "P03",
            800
          ],
          [
            "P04",
            5600
          ],
          [
            "P05",
            2100
          ]
        ]
      },
      "steps": [
        "Compute the original sample mean. Sum of the five costs: 1200+3400+800+5600+2100 = 13100 dollars. Original mean = 13100/5 = 2620 dollars.",
        "Draw Resample A with replacement (five draws, order determined by random indices). One possible draw: [800, 3400, 1200, 3400, 2100]. P02 and P03 appear once each, P01 once, P02 again, P05 once; P04 is absent. Sum = 800+3400+1200+3400+2100 = 10900 dollars. Resample A mean = 10900/5 = 2180 dollars. This resample missed the high-cost patient, so its mean is below the original.",
        "Draw Resample B with replacement. One possible draw: [5600, 1200, 5600, 800, 2100]. P04 appears twice (sampled on two draws), P01 once, P03 once, P05 once. Sum = 5600+1200+5600+800+2100 = 15300 dollars. Resample B mean = 15300/5 = 3060 dollars. This resample overrepresents the high-cost patient, pulling the mean above the original.",
        "Draw Resample C with replacement. One possible draw: [2100, 2100, 3400, 1200, 5600]. P05 appears twice, P02 once, P01 once, P04 once. Sum = 2100+2100+3400+1200+5600 = 14400 dollars. Resample C mean = 14400/5 = 2880 dollars.",
        "The three replicate means are 2180, 3060, and 2880 USD. Their spread shows that the original estimate of 2620 USD is uncertain. With B = 2000 replicates run in software, the standard deviation of all replicate means — the bootstrap SE — is approximately 860 USD, and the percentile 95% CI spans approximately [1060, 4740] USD.",
        "In a real claims analysis, the cluster bootstrap would resample patient IDs rather than individual claim rows, so all claims for a drawn patient travel together and within-patient correlation is preserved."
      ],
      "result": "Original mean = 13100/5 = 2620 USD. Three explicit bootstrap replicate means: 10900/5 = 2180 USD (missed high-cost patient), 15300/5 = 3060 USD (over-represented high-cost patient), 14400/5 = 2880 USD. The variability across resamples — from 2180 to 3060 USD — illustrates that the true mean is uncertain by roughly plus or minus 440 USD even across just three replicates; with B = 2000 replicates the bootstrap SE is approximately 860 USD and the percentile 95% CI is approximately [1060, 4740] USD."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "healthcare-costs-pppm-pppy-pmpm"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard (IID) bootstrap for mean costs",
        "description": "Resample n observations with replacement B times from the per-patient cost vector; compute the mean in each replicate; report the percentile or BCa 95% CI. This is the canonical application for cost analyses where patients are independent. Use B ≥ 1000 for percentile CIs and B ≥ 2000 for BCa.",
        "edge_cases": [
          "At n < 30 the bootstrap distribution is discrete and coverage is unreliable; report the CI with a note that n is below the threshold for reliable bootstrap inference.",
          "For mean cost differences between two arms, bootstrap the difference directly (paired resamples from each arm, preserving arm size) rather than computing arm-specific CIs and subtracting."
        ],
        "data_source_notes": "Claims: use per-patient annual or episode-level totals, not claim rows, as the unit of resampling. EHR: use per-patient summary cost or utilization aggregates."
      },
      {
        "name": "Cluster bootstrap for correlated claims data",
        "description": "Resample whole patients (patient IDs) with replacement; for each sampled patient, retain all of their claim rows. Recompute the full analysis (aggregation, outcome construction) on the re-sampled dataset in each replicate. This correctly propagates within-patient claim correlation into the SE and is the required approach whenever the outcome is built from multiple rows per patient (e.g., sum of pharmacy claims, count of ER visits, days-supply summed over fills).",
        "edge_cases": [
          "If patients are nested within facilities (hospitals, plans), resample the highest-level cluster (facility) to capture both within-patient and within-facility correlation.",
          "In PS-matched cohorts where matching was done with replacement (controls can match multiple treated), resample treated patients and re-run the PS match in each replicate rather than resampling the already-matched pairs."
        ],
        "data_source_notes": "Claims and EHR: essential when the analysis unit is the patient but data are stored at claim/encounter level. Registry: often already patient-level, so standard bootstrap applies; cluster if registry has repeated assessments per patient."
      },
      {
        "name": "Bootstrap for ICER uncertainty (CE plane)",
        "description": "Generate B paired re-samples from the joint cost-and-effect patient-level dataset; compute ΔCost_b and ΔEffect_b in each replicate to plot a cloud of B points on the cost-effectiveness plane. Compute the net monetary benefit NMB_b = lambda * ΔEffect_b minus ΔCost_b in each replicate and read the 95% percentile CI from those values. This is the standard method for ICER uncertainty submitted to NICE and ICER.",
        "edge_cases": [
          "When ΔEffect crosses zero in some replicates, the ICER changes quadrant and the ratio is undefined; report the NMB CI rather than the ICER CI in those cases.",
          "For PSA-linked models, treat the bootstrap as the sampling uncertainty layer and run PSA over model parameters within each replicate if resources allow, or treat them as separate uncertainty layers."
        ],
        "data_source_notes": "Linked cost and effectiveness data from RCTs or observational cohorts; patient-level QALY estimates from linked registry or EHR utility weights."
      },
      {
        "name": "Permutation test (hypothesis-testing variant)",
        "description": "To test the null hypothesis that two groups have the same distribution: randomly shuffle the group labels among all patients B times; compute the test statistic (e.g., mean difference, median difference) in each shuffled dataset; the p-value is the fraction of shuffled statistics more extreme than the observed. No CI is produced. Use when the primary question is a sharp null test (exchangeability) without the need for an effect estimate or CI.",
        "edge_cases": [
          "Permutation tests assume exchangeability under the null — valid in RCTs with strict randomization; in observational data with uncontrolled confounding, the exchangeability assumption does not hold and the p-value is not interpretable as a causal test.",
          "For clustered data, permute whole clusters (patients), not individual observations, to maintain the correlation structure under the null."
        ],
        "data_source_notes": "Most natural in RCT settings. In observational cohorts, permutation tests are used primarily as sensitivity checks after primary adjusted analyses."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "gamma-distribution",
        "pros_of_this": "The bootstrap makes no distributional assumption and automatically handles any shape of cost distribution; it is valid even when the true distribution is not gamma, lognormal, or any other parametric family. For complex pipelines, it requires no model specification beyond the resampling unit.",
        "cons_of_this": "A parametric gamma or Tweedie GLM is more efficient when the distributional assumption is approximately correct — it uses the data more fully and can accommodate covariate adjustment, interactions, and subgroup contrasts without re-running a full bootstrap loop. The GLM also extrapolates to covariate values not well-represented in the sample; the bootstrap does not.",
        "when_to_prefer": "Use the bootstrap when distributional assumptions are questionable, n is moderate, or the pipeline is multi-stage. Prefer a gamma GLM when the distributional form is plausible, covariate adjustment is needed, and computational efficiency matters."
      },
      {
        "compared_to": "inferential-statistics-foundations",
        "pros_of_this": "The bootstrap provides valid SEs and CIs for estimators (medians, ratios, pipeline outputs) that have no closed-form analytic SE, extending inference beyond the textbook parametric formulas.",
        "cons_of_this": "For simple means in large samples where the CLT applies, analytic SEs from t-tests or regression are exact, computationally trivial, and easier to report. The bootstrap adds computational cost without benefit in those cases.",
        "when_to_prefer": "Use analytic SEs for simple means and proportions at large n. Use the bootstrap for non-standard estimands, small n, skewed outcomes, or multi-stage pipelines."
      },
      {
        "compared_to": "propensity-score-methods-psm-iptw",
        "pros_of_this": "The full-pipeline cluster bootstrap is the recommended SE method after propensity-score matching or IPTW, because it propagates uncertainty from the PS estimation step through the outcome model. Analytic SEs from the outcome regression that ignore the PS step are anti-conservative.",
        "cons_of_this": "The bootstrap for a large claims PS-matched cohort can be computationally expensive (re-running PS estimation, matching, and outcome model B times). Sandwich variance estimators or influence-function SEs for IPTW are computationally cheaper alternatives when the functional form is known.",
        "when_to_prefer": "Use the full-pipeline cluster bootstrap as the default for SE estimation after PS-matching or IPTW; use analytic sandwich SEs only when the pipeline is simple enough for a closed-form influence function to be derived."
      },
      {
        "compared_to": "icer-net-monetary-benefit-rwe",
        "pros_of_this": "The bootstrap generates the full joint sampling distribution of ΔCost and ΔEffect on the cost-effectiveness plane, enabling probabilistic sensitivity analysis, cost- effectiveness acceptability curves, and plotting of the CE ellipse — outputs that a simple Delta-method SE cannot provide.",
        "cons_of_this": "The Delta method is an order of magnitude faster for ICER SEs and is adequate when ΔEffect is well away from zero and both costs and effects are approximately normal.",
        "when_to_prefer": "Use the bootstrap for ICER uncertainty when ΔEffect is small or uncertain, when costs or effects are skewed, or when a full CE acceptability curve is needed for HTA."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "The cluster is always the patient, not the claim row. Build per-patient cost or utilization aggregates first; then bootstrap the patient-level vector. For pipeline analyses (PS match, two-part models), wrap the entire pipeline — PS estimation, matching, outcome model — inside the bootstrap loop and re-run it on each re-sampled patient roster. Fix the random seed for reproducibility. B = 2000 is the standard minimum for BCa CIs submitted to HTA bodies.",
      "ehr": "Cluster by patient (not by encounter or lab draw). When the outcome is built from longitudinal observations (e.g., mean A1c over a year constructed from multiple lab draws), all lab draws for a resampled patient travel with them. For very large EHR cohorts, consider block-bootstrap by calendar period if the analysis spans multiple years with changing coding practices.",
      "registry": "Registry data are usually already patient-level summaries; standard IID bootstrap applies. For registries with repeated patient visits, cluster by patient. BCa CIs are preferred for small registries (n < 200) where the bootstrap distribution may be skewed.",
      "primary": "Small randomized trials and pilot studies: the bootstrap is appropriate for cost endpoints but coverage is unreliable at n < 30. Report that the CI is based on a small bootstrap sample and that it is exploratory. For paired pre-post designs, use the paired bootstrap (resample patient-level differences).",
      "linked": "Linked claims-EHR-registry cohorts are patient-clustered by construction. Use the cluster bootstrap resampling the patient identifier across all linked files simultaneously, so that each resampled patient's claims, EHR rows, and registry entries are all retained together. Coordinate with the data governance team on whether bootstrap replicates must stay within the trusted research environment or can be exported as summary statistics."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\nrng = np.random.default_rng(seed=42)\n\n# ── Motivating dataset: five annual costs from worked example ──\ncosts = np.array([1200, 3400, 800, 5600, 2100], dtype=float)\nn = len(costs)\noriginal_mean = costs.mean()\nprint(f\"Original sample mean: {original_mean:.2f} USD\")\n\n# ── 1. Manual bootstrap loop for SE and percentile CI ──\nB = 2000\nboot_means = np.empty(B)\nfor b in range(B):\n    sample = rng.choice(costs, size=n, replace=True)\n    boot_means[b] = sample.mean()\n\nboot_se = boot_means.std(ddof=1)\npct_ci = np.percentile(boot_means, [2.5, 97.5])\nprint(f\"Bootstrap SE: {boot_se:.2f}\")\nprint(f\"Percentile 95% CI: [{pct_ci[0]:.2f}, {pct_ci[1]:.2f}]\")\n\n# ── 2. BCa CI via scipy.stats.bootstrap ──\nresult = stats.bootstrap(\n    (costs,),\n    statistic=np.mean,\n    n_resamples=2000,\n    confidence_level=0.95,\n    method=\"BCa\",\n    random_state=42,\n)\nprint(f\"BCa 95% CI: [{result.confidence_interval.low:.2f}, {result.confidence_interval.high:.2f}]\")\nprint(f\"Bootstrap SE (scipy): {result.standard_error:.2f}\")\n\n# ── 3. Cluster bootstrap for correlated claims data ──\n# Simulated patient-level claims dataset: multiple rows per patient\nimport pandas as pd\n\nclaims = pd.DataFrame({\n    \"patient_id\": [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],\n    \"claim_cost\":  [400, 300, 500, 1200, 2200, 250, 250, 150, 150, 5600, 900, 1200],\n})\n\n# Build per-patient totals first (these are the analysis unit)\npatient_totals = claims.groupby(\"patient_id\")[\"claim_cost\"].sum().values\npatient_ids = claims[\"patient_id\"].unique()\nn_patients = len(patient_ids)\n\n# Cluster bootstrap: resample patient IDs, keep all rows for each patient\nB_clust = 2000\nclust_means = np.empty(B_clust)\nfor b in range(B_clust):\n    sampled_ids = rng.choice(patient_ids, size=n_patients, replace=True)\n    # Collect all rows for sampled patients (with repetition)\n    boot_claims = pd.concat(\n        [claims[claims[\"patient_id\"] == pid] for pid in sampled_ids],\n        ignore_index=True,\n    )\n    # Re-aggregate: for each draw of a patient, their cost appears once per draw\n    # Use a running index to keep duplicate patients separate\n    boot_costs = [\n        claims.loc[claims[\"patient_id\"] == pid, \"claim_cost\"].sum()\n        for pid in sampled_ids\n    ]\n    clust_means[b] = np.mean(boot_costs)\n\nclust_pct_ci = np.percentile(clust_means, [2.5, 97.5])\nprint(f\"\\nCluster bootstrap mean: {np.mean(clust_means):.2f}\")\nprint(f\"Cluster bootstrap SE: {clust_means.std(ddof=1):.2f}\")\nprint(f\"Cluster percentile 95% CI: [{clust_pct_ci[0]:.2f}, {clust_pct_ci[1]:.2f}]\")\nprint(\"Note: cluster bootstrap resamples patients, not rows — correct for correlated claims.\")\n\n# ── 4. Permutation test for two-group mean difference ──\ngroup_a = np.array([1200, 3400, 800, 5600, 2100], dtype=float)\ngroup_b = np.array([900, 1100, 2200, 3000, 1500], dtype=float)\nobserved_diff = group_a.mean() - group_b.mean()\ncombined = np.concatenate([group_a, group_b])\nn_a = len(group_a)\n\nperm_diffs = np.empty(B)\nfor b in range(B):\n    perm = rng.permutation(combined)\n    perm_diffs[b] = perm[:n_a].mean() - perm[n_a:].mean()\n\np_perm = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))\nprint(f\"\\nPermutation test: observed diff = {observed_diff:.2f}, p = {p_perm:.4f}\")\nprint(\"Permutation test produces a p-value only (no CI); use bootstrap for estimation.\")",
        "description": "Bootstrap SE and percentile CI using a manual numpy loop (transparent), scipy.stats.bootstrap\nfor automated BCa CIs, and a cluster bootstrap by patient_id for correlated claims data.\nUses the five-patient cost dataset from the worked example as a minimal illustration;\nappend a larger synthetic dataset section for production use.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(boot)\n\n# ── Motivating dataset ──\ncosts <- c(1200, 3400, 800, 5600, 2100)\nn <- length(costs)\ncat(\"Original mean:\", mean(costs), \"USD\\n\")\n\n# ── 1. Manual bootstrap loop ──\nset.seed(42)\nB <- 2000\nboot_means <- replicate(B, mean(sample(costs, size = n, replace = TRUE)))\ncat(sprintf(\"Bootstrap SE: %.2f\\n\", sd(boot_means)))\ncat(sprintf(\"Percentile 95%% CI: [%.2f, %.2f]\\n\",\n            quantile(boot_means, 0.025), quantile(boot_means, 0.975)))\n\n# ── 2. boot package: percentile and BCa ──\nmean_fn <- function(data, indices) mean(data[indices])\nboot_obj <- boot(data = costs, statistic = mean_fn, R = B)\n\npct_ci <- boot.ci(boot_obj, type = \"perc\")\nbca_ci <- boot.ci(boot_obj, type = \"bca\")\ncat(sprintf(\"Percentile CI: [%.2f, %.2f]\\n\",\n            pct_ci$percent[4], pct_ci$percent[5]))\ncat(sprintf(\"BCa CI:        [%.2f, %.2f]\\n\",\n            bca_ci$bca[4], bca_ci$bca[5]))\ncat(\"BCa adjusts for bootstrap bias and acceleration (prefer for skewed estimands).\\n\")\n\n# ── 3. Cluster bootstrap by patient_id ──\nclaims <- data.frame(\n  patient_id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5),\n  claim_cost = c(400, 300, 500, 1200, 2200, 250, 250, 150, 150, 5600, 900, 1200)\n)\n\ncluster_bootstrap <- function(data, B = 2000, seed = 42) {\n  set.seed(seed)\n  patient_ids  <- unique(data$patient_id)\n  n_patients   <- length(patient_ids)\n  # Per-patient totals (these are the analysis units)\n  totals <- tapply(data$claim_cost, data$patient_id, sum)\n\n  replicate(B, {\n    drawn_ids  <- sample(patient_ids, size = n_patients, replace = TRUE)\n    mean(totals[as.character(drawn_ids)])\n  })\n}\n\nclust_means <- cluster_bootstrap(claims)\ncat(sprintf(\"\\nCluster bootstrap SE: %.2f\\n\", sd(clust_means)))\ncat(sprintf(\"Cluster percentile 95%% CI: [%.2f, %.2f]\\n\",\n            quantile(clust_means, 0.025), quantile(clust_means, 0.975)))\ncat(\"Cluster bootstrap: resamples patients not rows — correct for correlated claims data.\\n\")\n\n# ── 4. Permutation test ──\ngroup_a <- c(1200, 3400, 800, 5600, 2100)\ngroup_b <- c(900, 1100, 2200, 3000, 1500)\nobs_diff <- mean(group_a) - mean(group_b)\ncombined <- c(group_a, group_b)\nn_a <- length(group_a)\n\nset.seed(42)\nperm_diffs <- replicate(B, {\n  perm <- sample(combined)\n  mean(perm[seq_len(n_a)]) - mean(perm[(n_a + 1):length(perm)])\n})\np_perm <- mean(abs(perm_diffs) >= abs(obs_diff))\ncat(sprintf(\"\\nPermutation p-value: %.4f (observed diff = %.2f)\\n\", p_perm, obs_diff))\ncat(\"Permutation test gives a p-value only; use boot() for a CI on the mean difference.\\n\")\n\n# ── 5. Bootstrap for ICER uncertainty ──\n# Patient-level incremental cost and incremental effect (illustrative)\nset.seed(42)\nn_pts <- 50\ndelta_cost   <- rnorm(n_pts, mean = 5000, sd = 3000)\ndelta_effect <- rnorm(n_pts, mean = 0.10,  sd = 0.08)\nicer_fn <- function(data, idx) {\n  mean(data[idx, 1]) / mean(data[idx, 2])\n}\nicer_data <- cbind(delta_cost, delta_effect)\nicer_boot <- boot(data = icer_data, statistic = icer_fn, R = B)\nicer_pct  <- boot.ci(icer_boot, type = \"perc\")\ncat(sprintf(\"\\nICER bootstrap percentile 95%% CI: [%.0f, %.0f] USD/QALY\\n\",\n            icer_pct$percent[4], icer_pct$percent[5]))",
        "description": "Bootstrap CIs using the boot package (percentile and BCa), a manual loop for transparency,\nand a cluster bootstrap function resampling by patient_id. Includes a permutation test\nusing base R sample(). Uses the five-patient cost dataset from the worked example.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── 1. Create the patient-level cost dataset ── */\ndata work.patients;\n  input patient_id $ annual_cost_usd;\n  datalines;\nP01 1200\nP02 3400\nP03  800\nP04 5600\nP05 2100\n;\nrun;\n\n/* ── 2. Standard bootstrap: resample patient rows with replacement ──\n   PROC SURVEYSELECT METHOD=URS draws n units with replacement in each replicate.\n   REPS= sets the number of bootstrap replicates.\n   SEED= fixes the random seed for reproducibility.                               */\nproc surveyselect data=work.patients\n      method=urs         /* unrestricted random sampling = with replacement      */\n      n=5                /* resample the same number of units as the original    */\n      reps=2000          /* B bootstrap replicates                               */\n      seed=20240101\n      out=work.boot_sample\n      outall;            /* keep original sample in replicate 0                  */\nrun;\n/* Each replicate is identified by the variable Replicate (1 to 2000).           */\n\n/* ── 3. Compute mean cost in each replicate ── */\nproc means data=work.boot_sample noprint;\n  by Replicate;\n  var annual_cost_usd;\n  output out=work.boot_means (drop=_type_ _freq_)\n         mean=boot_mean;\nrun;\n/* Remove replicate 0 (original sample) for the bootstrap distribution           */\ndata work.boot_dist;\n  set work.boot_means;\n  where Replicate >= 1;\nrun;\n\n/* ── 4. Read percentile CI from the bootstrap distribution ── */\nproc univariate data=work.boot_dist noprint;\n  var boot_mean;\n  output out=work.boot_ci\n         mean=boot_mean_of_means\n         std=boot_se\n         pctlpts=2.5 97.5\n         pctlpre=pct_;\nrun;\nproc print data=work.boot_ci; run;\n/* pct_2_5 = lower bound; pct_97_5 = upper bound; boot_se = bootstrap SE        */\n\n/* ── 5. Cluster bootstrap: resample patients, keep all claim rows per patient ──\n   When data are at claim level (multiple rows per patient), cluster by patient_id\n   using PROC SURVEYSELECT with the CLUSTER statement.                            */\ndata work.claims;\n  input patient_id $ claim_cost;\n  datalines;\nP01  400\nP01  300\nP01  500\nP02 1200\nP02 2200\nP03  250\nP03  250\nP03  150\nP03  150\nP04 5600\nP05  900\nP05 1200\n;\nrun;\n\n/* Sort by cluster variable before SURVEYSELECT                                  */\nproc sort data=work.claims; by patient_id; run;\n\nproc surveyselect data=work.claims\n      method=urs\n      n=5               /* number of clusters (patients) to resample            */\n      reps=2000\n      seed=20240101\n      out=work.clust_boot\n      outall;\n  cluster patient_id;   /* resample whole patients, keeping all their rows      */\nrun;\n/* Each drawn patient's rows are duplicated in the output dataset when that\n   patient was sampled multiple times.                                            */\n\n/* Compute per-patient totals in each replicate, then mean across patients       */\nproc means data=work.clust_boot noprint nway;\n  by Replicate;\n  class patient_id SelectionProb;   /* patient_id + replicate-specific weight   */\n  var claim_cost;\n  output out=work.pt_totals sum=patient_total;\nrun;\nproc means data=work.pt_totals noprint;\n  by Replicate;\n  var patient_total;\n  output out=work.clust_means mean=clust_mean;\nrun;\ndata work.clust_dist;\n  set work.clust_means;\n  where Replicate >= 1;\nrun;\nproc univariate data=work.clust_dist noprint;\n  var clust_mean;\n  output out=work.clust_ci\n         std=clust_se\n         pctlpts=2.5 97.5\n         pctlpre=cpct_;\nrun;\nproc print data=work.clust_ci;\n  title \"Cluster bootstrap 95% CI for mean per-patient cost (USD)\";\nrun;\n/* cpct_2_5 and cpct_97_5 are the percentile CI bounds.\n   This correctly handles within-patient claim correlation.                       */",
        "description": "Cluster bootstrap in SAS using PROC SURVEYSELECT METHOD=URS (unrestricted random sampling\nwith replacement) to draw patient-level replicates, followed by a PROC MEANS BY-replicate\nanalysis to compute the statistic in each replicate. A PROC UNIVARIATE step reads the\ndistribution of replicate means to extract the percentile CI. Covers both standard and\ncluster bootstrap patterns.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Original data<br/>n observations] --> B[Draw B resamples<br/>with replacement]\n  B --> C1[Resample 1 → stat_1]\n  B --> C2[Resample 2 → stat_2]\n  B --> C3[... → ...]\n  B --> CB[Resample B → stat_B]\n  C1 --> D[Bootstrap distribution<br/>of stat_1 ... stat_B]\n  C2 --> D\n  C3 --> D\n  CB --> D\n  D --> E1[Bootstrap SE<br/>= SD of replicates]\n  D --> E2[Percentile CI<br/>= 2.5th, 97.5th pct]\n  D --> E3[BCa CI<br/>= adjusted percentiles]\n  A --> F{Clustered data?}\n  F -->|Yes - claims or EHR| G[Resample PATIENTS<br/>not rows<br/>= Cluster bootstrap]\n  F -->|No - already patient-level| B\n  G --> B",
        "caption": "Bootstrap workflow: draw B resamples with replacement, compute the statistic in each, and read the SE and CI from the empirical distribution of replicates. For claims or EHR data, the cluster bootstrap resamples whole patients to preserve within-patient correlation.",
        "alt_text": "Flowchart showing the bootstrap procedure from original data through B resamples to a bootstrap distribution, branching to SE, percentile CI, and BCa CI. A parallel path shows the cluster bootstrap decision for correlated data.",
        "source_type": "illustrative",
        "source_citations": [
          "efron-1979"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "The bootstrap addresses the same inferential goals as parametric SE formulas and t-tests but relaxes distributional assumptions; understanding what a SE and CI mean — and the repeated-sampling interpretation — is prerequisite before choosing between analytic and bootstrap methods."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Mean per-patient-per-month or per-year costs are the canonical HEOR application for the bootstrap; the cluster bootstrap by patient_id is the standard SE method when costs are built from multiple claim rows per patient."
      },
      {
        "relation_type": "used_with",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "Bootstrap resampling generates the joint sampling distribution of incremental costs and effects on the cost-effectiveness plane, supporting the ICER confidence ellipse and cost-effectiveness acceptability curves required for HTA submissions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gamma-distribution",
        "notes": "The gamma GLM is the model-based alternative for inference on mean costs; it is more efficient than the bootstrap when the distributional assumption is approximately correct and accommodates covariate adjustment naturally, making it the preferred primary analysis in many cost studies with the bootstrap reserved for sensitivity checks."
      },
      {
        "relation_type": "see_also",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Propensity-score matching and IPTW analyses require the full-pipeline cluster bootstrap for valid SE estimation; analytic SEs that ignore the PS estimation step are anti-conservative, making the bootstrap a required companion to these causal methods."
      }
    ],
    "aliases": [
      "bootstrap CI",
      "bootstrap standard error",
      "percentile bootstrap",
      "BCa confidence interval",
      "cluster bootstrap",
      "resampling methods",
      "permutation test"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "borrowing-historical-controls-bayesian-rwe",
    "name": "Bayesian Borrowing from Historical / External Controls",
    "short_definition": "A Bayesian approach that constructs an informative prior on the concurrent control arm from historical or real-world external-control data, discounting the borrowed information by its commensurability with the current trial so that the strength of borrowing is data-driven rather than all-or-nothing.",
    "long_description": "**Bayesian dynamic borrowing** augments the control arm of a current study with information from one or more *historical\nor real-world external controls* by encoding that external information as an informative prior, then letting the data\ndecide how much weight that prior actually carries. The shared parameter is the control-arm outcome (e.g., the control\nevent rate, mean, or hazard); the treatment-arm parameter is left vague. The result is a posterior on the treatment\ncontrast that \"rents\" external control information when the external and concurrent controls agree and \"returns\" it when\nthey disagree. This is the practical answer to the central question Viele et al. pose — *how much should we borrow?* —\nreplacing the binary choice between naive pooling and ignoring history with a continuous, pre-specified discount.\n\n**Core conceptual / estimand distinction.** The estimand is unchanged by borrowing: it remains the comparative treatment\neffect (difference in means, risk difference/ratio, or hazard ratio) versus a control. What changes is the *prior on the\ncontrol parameter*, and the discriminating tuning quantity differs by method:\n- **Power prior** (Ibrahim & Chen): raise the historical likelihood to a power a0 in [0,1]. a0 = 1 is full pooling; a0 = 0\n  ignores history. a0 can be *fixed* (a deterministic discount, e.g., 0.5) or *modeled* (a hyperprior on a0 — the\n  \"normalized/modified\" power prior — letting the data move it).\n- **Commensurate prior** (Hobbs et al.): tie the current control parameter to the historical one through a\n  *commensurability* variance; a large estimated commensurability variance automatically down-weights borrowing under\n  conflict.\n- **Meta-analytic-predictive (MAP) prior** (Neuenschwander et al.): treat each historical control as an exchangeable\n  draw from a hierarchical model with between-trial heterogeneity tau, then use the *predictive* distribution for the\n  new control as the prior. Larger tau ⇒ less borrowing.\n- **Robust MAP** (Schmidli et al.): mix the MAP prior with a vague (unit-information) component given weight w (commonly\n  0.1–0.2). The vague component is an insurance policy: under prior-data conflict the posterior shifts toward it, capping\n  the damage from optimistic borrowing.\n\nThe common currency across all four is **prior effective sample size (ESS)** — how many \"extra control patients\" the prior\nis worth. ESS is the number to pre-specify, justify to regulators, and report; an ESS of, say, 40 historical controls in\na 60-patient concurrent control arm is a very different claim than an ESS of 5.\n\n**Pros, cons, and trade-offs.**\n- **vs naive pooling of historical and concurrent controls (Pocock combination):** Pooling assumes exact exchangeability;\n  any drift in standard of care, case mix, or outcome ascertainment biases the control estimate and inflates type-I error.\n  Dynamic borrowing keeps the upside (smaller control arm, more patients randomized to the experimental drug) while\n  *adaptively* discounting under conflict. Cost: more modeling, sensitivity analysis, and operating-characteristic\n  simulation; it can still mislead if conflict is real but small relative to noise (the robust weight cannot detect what\n  the data cannot resolve).\n- **vs frequentist \"test-then-pool\":** A pre-test for historical-vs-current difference, then pool if non-significant, has\n  notoriously poor operating characteristics — the test is underpowered, so it pools precisely when it should not, and the\n  final inference ignores the model-selection step. Bayesian borrowing folds the uncertainty about commensurability into\n  one coherent posterior. **Prefer dynamic borrowing** over test-then-pool in essentially all cases.\n- **vs propensity-score / IPW external-control adjustment:** PS methods target *patient-level* exchangeability by\n  re-weighting an external cohort to the trial's covariate distribution (an identification strategy); borrowing operates at\n  the *summary-parameter* level and addresses *residual* trial-to-trial discrepancy. They are complementary: PS-adjust the\n  external control to match the trial population, *then* borrow the adjusted control estimate with a discount. Borrowing is\n  not a substitute for confounding control.\n- **Fixed vs modeled a0 / robust vs non-robust:** Fixed a0 and non-robust MAP borrow aggressively and have the best power\n  *if the prior is right*; modeled a0 and robust MAP sacrifice some power for protection against prior-data conflict and far\n  better worst-case type-I error. For confirmatory regulatory work, the robust/dynamic versions are the defensible default.\n\n**When to use.** Rare diseases, pediatric extrapolation, and oncology settings where a fully concurrent randomized control\nis infeasible or unethical and high-quality historical/RWD controls exist; designs that *reduce* (not eliminate) the\nconcurrent control with a Bayesian augmented control; HTA submissions for ultra-rare indications where the alternative is no\ncomparative evidence at all. The FDA 2023 externally controlled trials guidance is the governing regulatory frame.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Standard-of-care drift between eras.** If supportive care, monitoring intensity, or background therapy improved between\n  the historical period and the current trial, the historical control event rate is biased *in a fixed direction*. Robust\n  weights damp it but cannot fully correct a systematic shift, and a single historical control gives the heterogeneity\n  model no information to estimate tau — borrowing then becomes near-pooling of a biased control. Do **not** borrow across a\n  known practice change without an explicit, conservative discount and a tipping-point sensitivity analysis.\n- **Differential outcome ascertainment.** Historical/RWD controls with registry-adjudicated endpoints versus a current arm\n  with claims-algorithm or central-read endpoints are not measuring the same thing. Borrowing imports measurement bias.\n- **Case-mix / eligibility shift.** Newer cohorts caught by earlier screening, different staging, or changed referral\n  patterns differ in baseline prognosis. Borrow only after PS-matching or eligibility-restricting the external control to\n  the trial criteria.\n- **Differential follow-up or censoring rules.** Time-to-event borrowing across cohorts with different censoring or\n  follow-up duration distorts the control hazard.\n- **Manufacturing the answer.** Tuning a0 or the robust weight *after* seeing the trial result to reach significance is\n  indefensible; all borrowing parameters and the prior-data-conflict criterion must be locked in the SAP before unblinding,\n  with pre-specified operating characteristics (type-I error under drift, power, ESS) from simulation.\n\n**Data-source operational depth (building the external control).** Borrowing inherits every weakness of the external data\nit is built from.\n- **Claims (FFS vs MA):** A claims external-control arm requires the same continuous-enrollment, washout, and first-event\n  discipline as any claims cohort. Medicare-Advantage-only person-time lacks fee-for-service medical/pharmacy claims, so an\n  \"event-free\" external control built from MA enrollees can be missingness masquerading as a low event rate — restrict to\n  enrollees with complete A/B/D (or commercial medical+pharmacy) coverage. Endpoints defined by claims algorithms (PPV well\n  below 1) systematically *understate* events relative to an adjudicated trial endpoint, biasing the borrowed control rate\n  downward and the treatment effect toward harm. Pre-period for covariate and washout measurement must mirror the trial's\n  baseline window.\n- **EHR:** Outcomes are encounter-driven; a patient who leaves the system looks event-free. Structured fields are sparse,\n  so severity/case-mix matching to trial eligibility is harder — yet matching is exactly what protects borrowing from\n  case-mix conflict. Prefer linkage to claims to firm up follow-up and capture out-of-system events.\n- **Registry:** Strongest for indication, stage/severity, and adjudicated endpoints — the best raw material for an external\n  control — but completeness and enrollment selection vary, and the registry era may predate current standard of care\n  (drift). Link to a death index to firm up survival endpoints.\n- **Linked claims–EHR–registry:** The ideal substrate (severity + completeness + adjudicated outcomes + mortality), but\n  linkage selection (only the linkable subset) and date discrepancies across order/fill/service dates must be reconciled\n  before defining the index date and follow-up that feed the borrowed summary.\n\n**Worked claims-style example.** Confirmatory single-arm trial of a new agent in a rare cancer, n = 70 treated, with no\nconcurrent control; the sponsor proposes a Bayesian external control from a linked claims–registry database (FDA 2023\nguidance frame). (1) **Eligibility-match the external control to the protocol:** age, stage, prior-line, ECOG proxy, and\n365 days of continuous A/B/D enrollment before `index_date` (first systemic-therapy claim meeting the trial's inclusion\nwindow), excluding MA-only person-time so absence of events is observed, not missing. (2) **Harmonize the endpoint:** define\nthe external overall-survival/progression endpoint with a validated claims algorithm and report its PPV/sensitivity; if it\ndiffers from the trial's adjudicated endpoint, pre-specify a bias adjustment. (3) **PS-adjust** the external cohort to the\ntrial covariate distribution (overlap weighting on the baseline window). (4) **Fit a MAP prior** on the adjusted external\ncontrol survival parameter; because only one external source exists, fix between-trial heterogeneity tau at a conservative\nvalue from a comparable disease area rather than estimating it from a single study. (5) **Robustify** the MAP with a 20%\nvague mixture component and compute prior **ESS** (target ESS ≈ 25–35 controls — a fraction of the 70 treated). (6)\n**Pre-specify the prior-data-conflict rule** (e.g., posterior weight on the vague component, or a Bayesian p-value\ncomparing observed to predicted control survival). (7) Compute the **posterior on the survival contrast** and run a\n**tipping-point sensitivity analysis** over a0/robust-weight and over a plausible standard-of-care drift adjustment to show\nthe conclusion is not an artifact of optimistic borrowing.\n\n**Interpreting the output**\n\nFrom the worked example: historical arm n = 100, r = 28 (28%); power prior a0 = 0.4; ESS = 40\nborrowed. Current trial controls n = 20, r = 7 (35%). Blended posterior control rate ≈ (11.2 + 7) /\n(40 + 20) = 18.2 / 60 ≈ 30.3%. Effective control arm size ≈ 60 (40 borrowed + 20 enrolled).\n\n*(1) Formal interpretation.* The posterior control rate ≈ 30.3% reflects a weighted compromise between\nthe historical rate (28%) and the current trial rate (35%), with the weights determined by the power\nprior discount (a0 = 0.4) and sample sizes. The posterior is not a frequentist estimate — it is the\nprobability distribution over the control rate given both data sources and the specified prior.\nCritically, the ESS of 40 borrowed controls is itself conditional on a0 = 0.4; halving a0 to 0.2\nwould halve the borrowed ESS and shift the posterior toward the current data. Any trial needed only 20\nenrolled controls instead of the ≈ 60 that would give equivalent precision without borrowing —\nbut that sample-size reduction is only valid if the historical and current populations are genuinely\nexchangeable on all outcome-relevant dimensions.\n\n*(2) Practical interpretation.* If the historical 28% control rate reflected a lower-acuity population\n(less severe disease than the current trial), borrowing drags the blended control estimate downward\nfrom the observed 35%, artificially inflating the apparent treatment benefit. A decision-maker\nreviewing a Bayesian-borrowing trial should always examine the prior-data-conflict diagnostic — if the\ncurrent and historical control rates are discordant (e.g., 35% vs 28%), the robust MAP's vague mixture\ncomponent should have absorbed much of the borrowed weight, and the tipping-point sensitivity over a0\nshould show the conclusion holds even at minimal borrowing.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "bayesian-dynamic-borrowing",
      "power-prior",
      "commensurate-prior",
      "meta-analytic-predictive-prior",
      "robust-map-prior",
      "external-control",
      "prior-effective-sample-size",
      "rare-disease",
      "single-arm-trial"
    ],
    "applies_to_study_types": [
      "single_arm_external_control",
      "rare_disease_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/0021-9681(76)90044-8",
        "url": "https://doi.org/10.1016/0021-9681(76)90044-8",
        "citation_text": "Pocock SJ. The combination of randomized and historical controls in clinical trials. Journal of Chronic Diseases. 1976;29(3):175-188.",
        "year": 1976,
        "authors_short": "Pocock",
        "notes": "Foundational statement of the acceptability conditions for combining historical with concurrent controls and the bias that arises when they are not met."
      },
      {
        "role": "introduce",
        "doi": "10.1214/ss/1009212673",
        "url": "https://doi.org/10.1214/ss/1009212673",
        "citation_text": "Ibrahim JG, Chen M-H. Power prior distributions for regression models. Statistical Science. 2000;15(1):46-60.",
        "year": 2000,
        "authors_short": "Ibrahim & Chen",
        "notes": "Original derivation of the power prior; defines the discount parameter a0 that underlies most dynamic-borrowing methods."
      },
      {
        "role": "explain",
        "doi": "10.1177/1740774509356002",
        "url": "https://doi.org/10.1177/1740774509356002",
        "citation_text": "Neuenschwander B, Capkun-Niggli G, Branson M, Spiegelhalter DJ. Summarizing historical information on controls in clinical trials. Clinical Trials. 2010;7(1):5-18.",
        "year": 2010,
        "authors_short": "Neuenschwander et al.",
        "notes": "Introduces the meta-analytic-predictive (MAP) prior, framing borrowing as a hierarchical model with explicit between-trial heterogeneity."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1541-0420.2011.01564.x",
        "url": "https://doi.org/10.1111/j.1541-0420.2011.01564.x",
        "citation_text": "Hobbs BP, Carlin BP, Mandrekar SJ, Sargent DJ. Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics. 2011;67(3):1047-1056.",
        "year": 2011,
        "authors_short": "Hobbs et al.",
        "notes": "Commensurate prior; ties the current and historical control parameters through a commensurability variance that adaptively governs the strength of borrowing."
      },
      {
        "role": "explain",
        "doi": "10.1002/pst.1589",
        "url": "https://doi.org/10.1002/pst.1589",
        "citation_text": "Viele K, Berry S, Neuenschwander B, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharmaceutical Statistics. 2014;13(1):41-54.",
        "year": 2014,
        "authors_short": "Viele et al.",
        "notes": "Practical synthesis of the \"how much to borrow\" problem; compares pooling, power, and hierarchical priors and their operating characteristics under prior-data conflict."
      },
      {
        "role": "demonstrate",
        "doi": "10.1111/biom.12242",
        "url": "https://doi.org/10.1111/biom.12242",
        "citation_text": "Schmidli H, Gsteiger S, Roychoudhury S, O'Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics. 2014;70(4):1023-1032.",
        "year": 2014,
        "authors_short": "Schmidli et al.",
        "notes": "Robust MAP prior (vague mixture component) with prior-data-conflict protection; the basis of the RBesT package used in the R implementation below."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.fda.gov/regulatory-information/search-fda-guidance-documents/considerations-design-and-conduct-externally-controlled-trials-drug-and-biological-products",
        "citation_text": "FDA. Considerations for the Design and Conduct of Externally Controlled Trials for Drug and Biological Products. Draft Guidance for Industry. 2023.",
        "year": 2023,
        "authors_short": "FDA",
        "notes": "Current regulatory expectations for externally controlled designs, including comparability of populations, endpoints, and the discipline required around borrowed external-control information."
      }
    ],
    "plain_language_summary": "When a new drug is tested in a rare disease, there are often too few patients to run a full randomized trial with its own control group. Bayesian borrowing solves this by using data from a past group of similar untreated patients as a stand-in control, then applying a discount so that if the past group turns out to look different from today's patients, the borrowed information counts for less. The key insight is that instead of a binary choice between using old data fully or ignoring it entirely, you get a sliding scale: the more similar the historical and current patients look, the more you borrow; the less similar, the less you borrow.",
    "key_terms": [
      {
        "term": "prior",
        "definition": "In Bayesian analysis, a prior is your starting belief about a number (like a disease event rate) before you see the current study data; borrowing from historical controls means encoding that history as a prior."
      },
      {
        "term": "borrowing",
        "definition": "Using summary information from a past or external group of patients to supplement the control arm of a current study, rather than collecting all that information again from scratch."
      },
      {
        "term": "power prior",
        "definition": "A specific borrowing technique that discounts the historical control data by raising its statistical contribution to a fraction called a0, where a0 = 1 means full use and a0 = 0 means ignore it entirely."
      },
      {
        "term": "effective sample size (ESS)",
        "definition": "The number of additional current-study control patients that the borrowed historical information is worth; an ESS of 30 means the prior is contributing as much information as 30 newly enrolled controls."
      },
      {
        "term": "prior-data conflict",
        "definition": "The situation where the borrowed historical control group behaves differently from the current control group, signaling that the historical patients may not be comparable and the discount should increase."
      }
    ],
    "worked_example": {
      "scenario": "A sponsor runs a single-arm trial of a new drug in a rare blood cancer: 60 patients receive the drug, and there is no concurrent randomized control group. The sponsor wants to compare the new drug's response rate against a historical control group drawn from a registry. The historical group had 100 patients; 28 of them responded (28%). The current trial enrolls 20 control-arm patients as well (small, to conserve enrollment for the treated arm) and observes 7 responders (35%). The sponsor applies a fixed power prior with a discount weight a0 = 0.4, meaning only 40% of the historical data counts. The question is: how many effective control patients does that borrowed history contribute, and what does the combined control estimate look like?",
      "dataset": {
        "caption": "Summary counts feeding the borrowing calculation. Each row represents one data source, not one patient.",
        "columns": [
          "source",
          "patients_n",
          "responders_r",
          "response_rate",
          "role"
        ],
        "rows": [
          [
            "historical_registry",
            100,
            28,
            "28%",
            "borrowed (discounted by a0)"
          ],
          [
            "current_trial_control",
            20,
            7,
            "35%",
            "full weight (a0 = 1.0)"
          ]
        ]
      },
      "steps": [
        "Step 1 - Calculate borrowed effective sample size: multiply the historical group size by the discount weight. ESS = 100 x 0.4 = 40. The prior counts as 40 phantom control patients.",
        "Step 2 - Calculate borrowed events: apply the same discount to the historical event count. Borrowed events = 28 x 0.4 = 11.2.",
        "Step 3 - Pool the borrowed information with the current control data: combined pseudo-events = 11.2 + 7 = 18.2; combined pseudo-patients = 40 + 20 = 60.",
        "Step 4 - Compute the posterior (blended) control response rate: 18.2 / 60 = 0.303, or about 30%.",
        "Step 5 - Interpret the result: the blended rate (30%) sits between the historical rate (28%) and the current small-sample rate (35%), pulled toward the historical data in proportion to how much was borrowed. Because the two rates were close, borrowing is reasonable; if they had differed by 15 percentage points or more, the sponsor would need a more conservative discount or a robust design."
      ],
      "result": "With a0 = 0.4, the 100-patient historical registry contributes an ESS of 40 phantom controls. Combined with 20 actual trial controls, the effective control arm is 60 patients (40 borrowed + 20 real), yielding a blended response rate of 18.2 / 60 = 0.303 (30%). The trial needed only 20 enrolled controls instead of the ~60 that would have been required without borrowing, but this efficiency comes with a risk: if the historical 28% rate reflects older, less-fit patients while the current trial enrolled healthier patients (hence the observed 35%), the borrowed prior drags the control estimate downward and makes the new drug look better than it truly is."
    },
    "prerequisites": [
      "single-arm-external-control",
      "rare-disease-external-controls-rwe",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed-weight power prior",
        "description": "Raise the historical-control likelihood to a fixed, pre-specified power a0 in [0,1] (e.g., 0.5), a deterministic discount chosen and justified before unblinding.",
        "edge_cases": [
          "Fixed a0 cannot adapt to prior-data conflict; if the historical control is biased, the discount must be set conservatively a priori and stress-tested by sensitivity analysis.",
          "Choosing a0 to hit a target prior ESS is the defensible way to set it, but the ESS-to-a0 map depends on the historical sample size and must be recomputed per study."
        ],
        "data_source_notes": "claims/registry: a0 should reflect confidence that the external endpoint definition and case mix match the trial; lower a0 when endpoint PPV or era drift is uncertain."
      },
      {
        "name": "Modeled (normalized) power prior / commensurate prior",
        "description": "Place a hyperprior on a0 (or on a commensurability variance) so the data govern the borrowing strength; under conflict the posterior down-weights the historical likelihood automatically.",
        "edge_cases": [
          "The normalized power prior is required for a coherent joint model when a0 is random; the unnormalized version is not a proper joint density.",
          "With a single historical study, the commensurability/heterogeneity parameter is weakly identified and must be given an informative hyperprior."
        ],
        "data_source_notes": "registry/linked: best when multiple external sources exist so the model can estimate heterogeneity rather than relying on a fixed value."
      },
      {
        "name": "(Robust) meta-analytic-predictive prior",
        "description": "Fit a hierarchical model across multiple historical controls, use the predictive distribution for the new control as the prior, and (robust version) mix it with a vague component of weight w to cap prior-data conflict.",
        "edge_cases": [
          "Requires several exchangeable historical controls to estimate between-trial heterogeneity tau; a single source forces a fixed/informative tau.",
          "The robust weight w (commonly 0.1-0.2) trades power for worst-case protection; report operating characteristics across a grid of w."
        ],
        "data_source_notes": "linked claims-EHR-registry: assemble multiple comparable external cohorts (e.g., by era or site) as the exchangeable units feeding the MAP model."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive pooling of historical and concurrent controls (Pocock combination)",
        "pros_of_this": "Adaptively discounts the historical control under prior-data conflict, preserving smaller-control-arm efficiency while controlling the type-I-error inflation that fixed pooling causes under drift.",
        "cons_of_this": "More modeling, simulation of operating characteristics, and sensitivity analysis; cannot correct a systematic bias that is small relative to sampling noise.",
        "when_to_prefer": "Whenever the exchangeability of historical and concurrent controls is plausible but not certain - i.e., essentially all real applications."
      },
      {
        "compared_to": "Frequentist test-then-pool",
        "pros_of_this": "Folds commensurability uncertainty into one coherent posterior; avoids the poor operating characteristics of an underpowered pre-test that pools when it should not and ignores the model-selection step.",
        "cons_of_this": "Requires a fully specified Bayesian model and prior, and pre-registered operating characteristics from simulation.",
        "when_to_prefer": "Essentially always preferred over test-then-pool for incorporating historical controls."
      },
      {
        "compared_to": "Propensity-score / IPW external-control adjustment",
        "pros_of_this": "Addresses residual trial-to-trial discrepancy at the summary-parameter level and yields a coherent posterior on the contrast with an interpretable prior ESS.",
        "cons_of_this": "Does not control patient-level confounding by itself; an unadjusted external control will be borrowed complete with its confounding.",
        "when_to_prefer": "As a complement after PS/IPW adjustment of the external cohort - adjust first, then borrow the adjusted estimate with a discount."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the external control with continuous-enrollment, washout, and first-event discipline; exclude MA-only person-time so event-free time is observed, not missing. Report endpoint-algorithm PPV/sensitivity - claims algorithms that miss events bias the borrowed control rate and the contrast. Measure baseline covariates in a window mirroring the trial's.",
      "ehr": "Outcomes are encounter-driven and out-of-system events are missed; prefer linkage to claims for follow-up. Sparse structured fields make eligibility/case-mix matching to the trial harder, which is exactly the matching that protects borrowing from case-mix conflict.",
      "registry": "Strongest raw material (adjudicated endpoints, stage/severity), but watch enrollment selection and era drift relative to current standard of care; link to a death index for survival endpoints.",
      "linked": "Ideal substrate (severity + completeness + adjudicated outcomes + mortality) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before defining index date and follow-up for the borrowed summary."
    },
    "implementations": [
      {
        "lang": "r",
        "code": "library(RBesT)\n\n# Aggregate historical/external control cohorts (binary endpoint).\nhist <- data.frame(\n  study = c(\"registry_A\", \"registry_B\", \"claims_C\"),\n  n     = c(120L, 95L, 140L),\n  r     = c(34L,  21L,  41L)   # events among external controls\n)\n\n# (1) Hierarchical MAP across cohorts; tau prior encodes between-trial heterogeneity.\nset.seed(1)\nmap_mc <- gMAP(cbind(r, n - r) ~ 1 | study, data = hist,\n               family = binomial,\n               tau.dist = \"HalfNormal\", tau.prior = 0.5,   # heterogeneity scale (logit)\n               beta.prior = 2)\n\n# (2) Parametric Beta-mixture approximation of the MAP prior.\nmap_prior <- automixfit(map_mc)\n\n# (3) Robustify: add a weakly-informative unit-information component (insurance against conflict).\nrob_prior <- robustify(map_prior, weight = 0.2, mean = 0.5)  # 20% vague weight\n\n# (4) Prior effective sample size = how many \"extra control patients\" we are borrowing.\ncat(\"Prior ESS (robust MAP):\", round(ess(rob_prior), 1), \"control patients\\n\")\n\n# (5) Update with the CURRENT trial control arm (e.g., 18 events / 30 controls).\npost_ctrl <- postmix(rob_prior, r = 18, n = 30)\n\n# Treatment arm: vague prior updated with current treated arm (e.g., 9 events / 70 treated).\npost_trt  <- postmix(mixbeta(c(1, 0.5, 0.5)), r = 9, n = 70)\n\n# Posterior on the risk difference (treated - control) by Monte Carlo.\ndraws_c <- rmix(post_ctrl, 1e5); draws_t <- rmix(post_trt, 1e5)\nrd <- draws_t - draws_c\ncat(sprintf(\"Posterior risk difference: %.3f (95%% CrI %.3f, %.3f)\\n\",\n            mean(rd), quantile(rd, .025), quantile(rd, .975)))\ncat(\"P(treated rate < control rate):\", mean(rd < 0), \"\\n\")",
        "description": "Robust MAP prior for a binary control endpoint using RBesT (Schmidli et al.'s package), the canonical implementation.\nRequired input - one row per historical/external control cohort with aggregate counts:\n  hist : study (chr), n (int, control patients), r (int, events)   # e.g., eligibility-matched external claims/registry cohorts\nSteps: (1) gMAP fits the hierarchical MAP across cohorts; (2) automix approximates the MAP as a Beta mixture;\n(3) robustify adds a vague component (weight 0.2); (4) ess reports prior effective sample size; (5) postmix combines\nthe robust prior with the CURRENT control likelihood (events/n) to get the posterior control rate. The treatment-arm\nposterior is fit separately (vague prior) and the contrast is formed by Monte Carlo draws.",
        "dependencies": [
          "RBesT"
        ],
        "source_citations": [
          "schmidli-2014",
          "neuenschwander-2010"
        ],
        "notes": ""
      },
      {
        "lang": "python",
        "code": "import numpy as np\nimport pymc as pm\nimport pytensor.tensor as pt\nimport arviz as az\n\nhist_n, hist_r = 355, 96       # pooled external-control patients / events\ncur_ctrl_n, cur_ctrl_r = 30, 18\ncur_trt_n,  cur_trt_r  = 70, 9\n\nwith pm.Model() as m:\n    # Shared control event probability; vague treated probability. Beta(1,1) base prior on p_ctrl.\n    p_ctrl = pm.Beta(\"p_ctrl\", 1.0, 1.0)\n    p_trt  = pm.Beta(\"p_trt\", 1.0, 1.0)\n\n    # Borrowing weight a0 in [0,1], modeled so the data govern borrowing.\n    a0 = pm.Beta(\"a0\", 1.0, 1.0)\n\n    # Historical control binomial log-likelihood raised to power a0.\n    hist_loglik = pm.logp(pm.Binomial.dist(n=hist_n, p=p_ctrl), hist_r)\n    pm.Potential(\"power_prior\", a0 * hist_loglik)\n\n    # NORMALIZED power prior: subtract log of the Beta normalizing constant C(a0) so the joint is proper.\n    # With Beta(1,1) on p_ctrl, the a0-scaled posterior kernel integrates to B(a0*r+1, a0*(n-r)+1).\n    log_norm = (pt.gammaln(a0 * hist_r + 1.0)\n                + pt.gammaln(a0 * (hist_n - hist_r) + 1.0)\n                - pt.gammaln(a0 * hist_n + 2.0))\n    pm.Potential(\"power_prior_norm\", -log_norm)\n\n    # Current data: control and treated arms (full weight).\n    pm.Binomial(\"cur_ctrl\", n=cur_ctrl_n, p=p_ctrl, observed=cur_ctrl_r)\n    pm.Binomial(\"cur_trt\",  n=cur_trt_n,  p=p_trt,  observed=cur_trt_r)\n\n    rd = pm.Deterministic(\"risk_diff\", p_trt - p_ctrl)\n    idata = pm.sample(2000, tune=2000, target_accept=0.95, random_seed=1)\n\na0_mean = float(idata.posterior[\"a0\"].mean())\nprint(f\"Posterior mean a0 (realized borrowing): {a0_mean:.2f}\")\nprint(f\"Approx borrowed ESS: {a0_mean * hist_n:.0f} external controls\")\nprint(az.summary(idata, var_names=[\"p_ctrl\", \"p_trt\", \"risk_diff\", \"a0\"], hdi_prob=0.95))\nrd_draws = idata.posterior[\"risk_diff\"].values.ravel()\nprint(f\"P(treated rate < control rate): {(rd_draws < 0).mean():.3f}\")",
        "description": "Power prior for a binary control endpoint in PyMC, with the discount a0 as a modeled hyperparameter (normalized power\nprior) so the data govern borrowing. Inputs are aggregate counts:\n  hist_n, hist_r  : external-control patients and events (e.g., eligibility-matched, PS-adjusted claims/registry cohort)\n  cur_ctrl_n/r    : current trial control arm\n  cur_trt_n/r     : current trial treated arm\nThe historical control contributes a Binomial log-likelihood scaled by a0. Because a0 is random, the joint must use the\nNORMALIZED power prior: subtract the (closed-form, Beta) normalizing constant so the density stays proper - omitting it\nis the classic unnormalized-power-prior error. Report the posterior on a0 (realized borrowing), an ESS proxy, and the\nposterior risk difference.",
        "dependencies": [
          "pymc",
          "pytensor",
          "numpy",
          "arviz"
        ],
        "source_citations": [
          "ibrahim-chen-2000",
          "viele-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* One-row aggregate input; logits keep p in (0,1). */\ndata counts;\n  input hist_n hist_r cur_ctrl_n cur_ctrl_r cur_trt_n cur_trt_r;\n  datalines;\n355 96 30 18 70 9\n;\nrun;\n\nproc mcmc data=counts nbi=5000 nmc=50000 seed=1 monitor=(p_ctrl p_trt rd a0 ess);\n  /* Control & treated event probabilities (logit-parameterized), and logit of borrowing weight a0. */\n  parms b_ctrl 0  b_trt 0  la0 0;\n  prior b_ctrl ~ normal(0, var=10);\n  prior b_trt  ~ normal(0, var=10);\n  prior la0    ~ normal(0, var=2);            /* logit of a0 -> a0 in (0,1) */\n\n  a0     = logistic(la0);\n  p_ctrl = logistic(b_ctrl);\n  p_trt  = logistic(b_trt);\n\n  /* Normalized power prior on the historical control:\n     a0 * binomial log-likelihood  MINUS  log of the Beta normalizing constant C(a0). */\n  llh = lgamma(hist_n+1) - lgamma(hist_r+1) - lgamma(hist_n-hist_r+1)\n        + hist_r*log(p_ctrl) + (hist_n-hist_r)*log(1-p_ctrl);\n  lognorm = lgamma(a0*hist_r + 1) + lgamma(a0*(hist_n-hist_r) + 1) - lgamma(a0*hist_n + 2);\n  model general(a0*llh - lognorm);            /* attaches the discounted, normalized historical info to the joint */\n\n  /* Current trial arms at full weight. */\n  model cur_ctrl_r ~ binomial(n=cur_ctrl_n, p=p_ctrl);\n  model cur_trt_r  ~ binomial(n=cur_trt_n,  p=p_trt);\n\n  rd  = p_trt - p_ctrl;                       /* posterior risk difference (treated - control) */\n  ess = a0*hist_n;                            /* borrowed effective sample size (controls)      */\nrun;",
        "description": "Power prior for a binary control endpoint in PROC MCMC, with a0 fitted as a hyperparameter (data-driven borrowing).\nRequired input dataset (one row, aggregate counts; expand to one record per arm if preferred):\n  work.counts : hist_n hist_r cur_ctrl_n cur_ctrl_r cur_trt_n cur_trt_r\nThe historical control contributes a binomial log-likelihood scaled by a0, injected into the joint via a single\nMODEL GENERAL() term; the current arms enter at full weight. Because a0 is random, the GENERAL() term includes the\nclosed-form Beta log normalizing constant (the normalized power prior) - a duplicate PRIOR statement on one parameter\nwould error and the unnormalized form would be an improper joint. Monitor a0 (realized borrowing) and the risk\ndifference. Use PROC BGLIMM instead for the hierarchical (MAP) form when several exchangeable external cohorts exist.",
        "dependencies": [],
        "source_citations": [
          "ibrahim-chen-2000",
          "viele-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Hist[Historical / external controls<br/>claims, registry, linked RWD] --> Match[Eligibility-match + PS-adjust<br/>to trial population]\n  Match --> Prior[Build informative control prior<br/>power / commensurate / MAP]\n  Prior --> Robust[Robustify: add vague mixture component<br/>report prior ESS]\n  Robust --> Conflict{Prior-data conflict?<br/>observed vs predicted control}\n  Conflict -- Agree --> Borrow[Posterior borrows strongly<br/>effective control arm enlarged]\n  Conflict -- Disagree --> Discount[Robust weight dominates<br/>borrowing automatically down-weighted]\n  Borrow --> Post[Posterior on treatment contrast]\n  Discount --> Post\n  Post --> Sens[Sensitivity: a0 / robust weight grid,<br/>SoC drift, tipping point]",
        "caption": "Bayesian dynamic-borrowing decision logic. The robust mixture and the prior-data-conflict check are what convert borrowing from all-or-nothing pooling into a data-adaptive discount.",
        "alt_text": "Flowchart from historical/external controls through eligibility matching and prior construction to a prior-data conflict check that either borrows strongly or down-weights, then a posterior on the contrast and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "schmidli-2014",
          "viele-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Historical\n    H1[Control study 1<br/>n1, events1]\n    H2[Control study 2<br/>n2, events2]\n    H3[Control study k<br/>nk, eventsk]\n  end\n  H1 --> MAP[Hierarchical MAP model<br/>between-trial heterogeneity tau]\n  H2 --> MAP\n  H3 --> MAP\n  MAP --> Pred[Predictive prior for NEW control<br/>+ prior ESS]\n  Pred --> Comb((Combine))\n  Cur[Current control arm likelihood<br/>events / n] --> Comb\n  Comb --> PostC[Posterior on control parameter]\n  Trt[Current treated arm likelihood] --> Contrast[Posterior on treatment contrast]\n  PostC --> Contrast",
        "caption": "Meta-analytic-predictive construction. Multiple exchangeable historical controls feed a hierarchical model whose predictive distribution becomes the prior for the current control; larger heterogeneity tau yields a wider prior and less borrowing.",
        "alt_text": "Schematic of several historical control studies feeding a hierarchical MAP model that produces a predictive prior for the new control, combined with the current control and treated likelihoods to form the posterior contrast.",
        "source_type": "illustrative",
        "source_citations": [
          "neuenschwander-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "single-arm-external-control",
        "notes": "Bayesian borrowing is the analytic engine that lets a single-arm trial be augmented by, or compared against, an external control with a quantified, discounted strength of borrowing."
      },
      {
        "relation_type": "used_with",
        "target_slug": "rare-disease-external-controls-rwe",
        "notes": "Rare-disease programs are the primary setting where dynamic borrowing is justified, because a fully concurrent randomized control is often infeasible."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS/IPW adjust the external control to the trial population first (patient-level exchangeability); borrowing then discounts the adjusted summary for residual trial-to-trial discrepancy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Borrowing assumes the external control is transportable to the trial population; case-mix and era differences are transportability failures that bias borrowed estimates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "tipping-point-analysis-rwe",
        "notes": "A tipping-point analysis over the borrowing weight and a plausible standard-of-care drift adjustment is the standard way to show a borrowed conclusion is robust."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes can detect systematic differences between the external and concurrent control sources before borrowing is applied."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "Part of the special-populations method family, where conventional concurrent controls are frequently unavailable."
      }
    ],
    "aliases": [
      "Bayesian dynamic borrowing",
      "power prior",
      "meta-analytic-predictive prior",
      "MAP prior",
      "robust MAP prior",
      "commensurate prior",
      "historical control borrowing"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "brier-score-calibration-rwe",
    "name": "Brier Score",
    "short_definition": "The mean squared error of probabilistic predictions for a binary outcome, BS = (1/N) * sum((p_i - y_i)^2), an overall accuracy measure that decomposes into calibration (reliability) and refinement (resolution and uncertainty) components and underlies model-performance assessment.",
    "long_description": "The **Brier score (BS)** is the **mean squared error of a probabilistic prediction**. For a binary outcome y in {0,1} and a\npredicted probability p in [0,1] over N subjects, BS = (1/N) * sum_i (p_i - y_i)^2. It ranges from 0 (perfect: every\nprediction equals the realized outcome) to 1 (worst: confident and always wrong); a non-informative model that always\npredicts the marginal prevalence q achieves BS = q*(1-q). Originating in weather forecast verification (Brier, 1950), it is\nnow the standard **overall performance** measure for clinical and RWE prediction models because, unlike discrimination\nmetrics (AUC, the c-statistic) that depend only on the *ranking* of predictions, the Brier score penalizes predictions that\nare mis-*scaled* — a model can rank perfectly (AUC 1.0) yet be badly calibrated and earn a poor Brier score.\n\n**Core conceptual distinction — the decomposition.** The Brier score's analytic value is that it splits into interpretable\nparts. Grouping predictions into K bins of (approximately) constant predicted probability, Murphy's (1973) decomposition is\nBS = **reliability** - **resolution** + **uncertainty**, where reliability = (1/N) * sum_k n_k * (p_bar_k - o_bar_k)^2 (the\nsquared gap between mean predicted probability p_bar_k and observed outcome rate o_bar_k in each bin — *lower is better*, this\nis the calibration term), resolution = (1/N) * sum_k n_k * (o_bar_k - o_bar)^2 (how far each bin's outcome rate departs from\nthe overall rate o_bar — *higher is better*, this is discrimination/sharpness), and uncertainty = o_bar*(1 - o_bar) (the\nirreducible variance of the outcome, fixed by the data). A second, equivalent split is **calibration + refinement**, where\nrefinement = uncertainty - resolution. The lesson: a low Brier score can be achieved either by good calibration *or* by high\nresolution, so the raw score conflates two distinct virtues — the decomposition is what lets you attribute a model's Brier\nperformance to its calibration versus its sharpness.\n\n**Scaling.** Because the achievable Brier score depends on outcome prevalence (a rare outcome makes a low absolute BS easy),\nthe raw score is not comparable across datasets with different prevalence. The **scaled Brier score** (a.k.a. **Brier skill\nscore**), BS_scaled = 1 - BS / BS_ref where BS_ref = q*(1-q) is the score of the prevalence-only model, rescales to a 0-to-1\n\"proportion of maximum achievable\" metric (1 = perfect, 0 = no better than predicting the marginal rate, negative = worse\nthan the null model). Report the scaled Brier score for cross-study comparison; report the raw score with its decomposition\nwithin a study.\n\n**Pros, cons, and trade-offs.**\n- **vs the c-statistic / AUC (`roc-auc-discrimination-rwe`):** The Brier score is a **proper scoring rule** — it is\n  minimized in expectation only by the true probabilities, so it cannot be gamed by mis-scaling and it rewards honest\n  probability estimates. AUC measures *only* discrimination (ranking) and is invariant to any monotone re-scaling of\n  predictions, so it is silent about calibration. **Prefer the Brier score** (with its decomposition) when the predicted\n  *probability itself* will be used (for thresholds, expected cost, decision curves); **prefer AUC** when only the rank\n  order matters and you want a prevalence-invariant discrimination summary. Report both — they answer different questions.\n- **vs a calibration plot / Hosmer-Lemeshow:** A calibration plot (and the calibration slope/intercept) shows *where* and\n  *how* miscalibration occurs across the probability range, information the single Brier number compresses away; the\n  reliability term of the decomposition is the scalar summary of exactly that plot. **Prefer the calibration plot** to\n  diagnose the shape of miscalibration; **use the Brier score / reliability term** as the one-number summary and for model\n  selection. The Hosmer-Lemeshow test is a discredited goodness-of-fit hypothesis test (power depends on N, arbitrary\n  binning); the reliability decomposition is the preferred quantitative alternative.\n- **vs the logarithmic score (log loss):** Both are proper scoring rules. Log loss penalizes confident errors far more\n  harshly (it is unbounded as p -> 0 for a positive case) and is the right loss when extreme over-confidence must be\n  punished; the Brier score is bounded, more robust to a handful of catastrophic predictions, and decomposes cleanly into\n  calibration and refinement. **Prefer the Brier score** for interpretable decomposition and robustness; **prefer log loss**\n  when calibrated tail probabilities are critical.\n\n**When to use.** Reporting the overall performance of any RWE prediction or prognostic model that outputs probabilities\n(risk of hospitalization, mortality, treatment response) alongside discrimination (AUC) and a calibration plot, as the\nmodern framework (Steyerberg et al., 2010) recommends; selecting among candidate models when the predicted probability will\nfeed a downstream decision (risk-based eligibility, expected-cost calculations, decision-curve analysis); tracking\nperformance over time or across sites with the scaled Brier score; quantifying calibration with the reliability term rather\nthan the Hosmer-Lemeshow test.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Comparing raw Brier scores across datasets with different prevalence.** A rare outcome makes a low raw BS trivially\n  attainable (the null model already scores q*(1-q) ~= q for small q); comparing raw scores across populations confounds\n  model quality with base rate. Use the scaled Brier score and report prevalence.\n- **As a substitute for the calibration plot when the question is *where* the model fails.** The single Brier number cannot\n  tell you whether the model is over-confident at high risk and under-confident at low risk; treating a good Brier score as\n  proof of calibration across the whole range is misleading. Inspect the plot or the bin-wise reliability.\n- **For a model that will only ever be used as a rank-ordering tool.** If only the ordering of patients matters (e.g.,\n  waitlist prioritization) and no probability is acted on numerically, the calibration that the Brier score rewards is\n  irrelevant; AUC or a rank-based metric is the appropriate target and optimizing Brier may add nothing.\n- **On predictions evaluated on the same data used to fit them, without correction.** An apparent (resubstitution) Brier\n  score is optimistically biased; report a Brier score from cross-validation, bootstrap optimism correction, or an external\n  validation sample, exactly as for any other performance measure.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The Brier score requires a binary outcome label that is itself a computable phenotype, so its honesty\n  inherits the phenotype's misclassification — a noisy outcome label inflates the irreducible-uncertainty term and caps the\n  achievable score. Build predictions and outcomes on FFS-observable person-time only; Medicare Advantage enrollees generate\n  no claims, so an MA-only span yields a fabricated \"no event\" label that biases both the outcome rate o_bar and the\n  reliability term. For time-to-event outcomes use the time-dependent (inverse-probability-of-censoring-weighted) Brier\n  score so administrative censoring does not masquerade as event-free survival.\n- **EHR:** Richer features usually improve resolution (sharper, more separated predictions), but encounter-driven\n  ascertainment makes an out-of-system event an unobserved \"negative,\" biasing the outcome label and therefore the\n  calibration term; require demonstrable in-system activity before labeling a subject event-free, and treat informative loss\n  to follow-up with censoring-weighted Brier scores.\n- **Registry / linked:** Adjudicated registry outcomes give the cleanest binary labels, tightening every term of the\n  decomposition; linked claims-EHR-vital-records additionally supplies a death index so a competing terminal event is not\n  miscoded as event-free. Linkage selects the linkable subset, so check that the validation sample's prevalence matches the\n  deployment population before trusting a scaled Brier score.\n\n**Worked example.** A logistic model predicts 1-year mortality in a linked claims-registry cohort of N = 5,000 with observed\nprevalence o_bar = 0.10 (so uncertainty = 0.10*0.90 = 0.090, and the prevalence-only null model scores BS_ref = 0.090).\nSuppose the fitted model achieves raw **BS = 0.072**. Then the **scaled Brier score = 1 - 0.072/0.090 = 0.20** — the model\nexplains 20% of the maximum achievable improvement over predicting the marginal rate, a modest but real gain. Decomposing\nacross deciles of predicted risk: if reliability = 0.004 (small squared calibration gaps) and resolution = 0.022, then BS =\nreliability - resolution + uncertainty = 0.004 - 0.022 + 0.090 = 0.072, confirming the arithmetic and attributing most of\nthe improvement to resolution (the model separates risk groups) with good calibration (low reliability term). The same model\nmight post AUC = 0.74; reporting AUC alone would have hidden the calibration story, and reporting raw BS alone would have\nhidden that 0.072 is only modestly better than the 0.090 null. Report all three — scaled Brier 0.20, the reliability/resolution\nsplit, and a calibration plot — for a complete performance picture.\n\n**Interpreting the output**\n\nIn the large-scale worked example, the model achieves Brier score = 0.072, scaled Brier = 0.20,\nreliability = 0.004, and resolution = 0.022, against a null Brier of 0.090.\n\n*(1) Formal interpretation.* The Brier score is the mean squared error between predicted probabilities\nand binary outcomes. A score of 0.072 is lower (better) than the null score of 0.090, which a model\npredicting the marginal event rate for everyone would achieve. The scaled Brier of 0.20 expresses this\nimprovement as a proportion of the maximum possible reduction, with 0 = no skill and 1 = perfect\ncalibration and discrimination. The Murphy decomposition separates BS into reliability (average squared\ncalibration gap between predicted and observed — lower is better) and resolution (variance of predicted\nprobabilities — higher means the model spreads risk groups apart). A reliability term of 0.004 indicates\npredictions are close to observed rates on average; a resolution of 0.022 indicates meaningful\nrisk-group separation. When calibration slope < 1, predictions are too extreme and shrinkage or\nrecalibration is needed; a slope > 1 indicates predictions are too conservative.\n\n*(2) Practical interpretation.* A scaled Brier of 0.20 is a modest but real gain over the null —\nadequate for population-level risk stratification but not sufficient for individual-level clinical\ndecisions that require tighter calibration. The low reliability (0.004) confirms predictions track\nobserved rates well, which matters whenever predicted probabilities drive enrollment thresholds or\nresource-allocation rules. Always pair the Brier score with AUC to separate calibration from\ndiscrimination: the same model here posts AUC = 0.74, showing acceptable rank ordering; the Brier\nadds the calibration dimension that AUC cannot see.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "brier-score",
      "calibration",
      "proper-scoring-rule",
      "prediction-model-performance",
      "reliability-resolution",
      "brier-skill-score",
      "mean-squared-error",
      "prognostic-model"
    ],
    "applies_to_study_types": [
      "prediction_model",
      "prognostic_study",
      "claims_analysis",
      "ehr_study",
      "registry_linkage"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0b013e3181c30fb2",
        "url": "https://doi.org/10.1097/EDE.0b013e3181c30fb2",
        "citation_text": "Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138.",
        "year": 2010,
        "authors_short": "Steyerberg et al.",
        "notes": "The framework paper that positions the Brier score (and its scaled version) as the overall-performance measure alongside discrimination and calibration, and sets the modern reporting standard for clinical/RWE prediction models."
      },
      {
        "role": "explain",
        "doi": "10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2",
        "url": "https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2",
        "citation_text": "Brier GW. Verification of forecasts expressed in terms of probability. Monthly Weather Review. 1950;78(1):1-3.",
        "year": 1950,
        "authors_short": "Brier",
        "notes": "The originating paper defining the score as the mean squared error of probabilistic forecasts; the source of the name and the quadratic proper-scoring-rule formulation later imported into biostatistics."
      },
      {
        "role": "explain",
        "doi": "10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2",
        "url": "https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2",
        "citation_text": "Murphy AH. A new vector partition of the probability score. Journal of Applied Meteorology. 1973;12(4):595-600.",
        "year": 1973,
        "authors_short": "Murphy",
        "notes": "Derives the reliability-resolution-uncertainty decomposition of the Brier score that lets a single number be attributed to calibration versus sharpness; the basis for the calibration-plus-refinement view used in prediction modeling."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2009.11.009",
        "url": "https://doi.org/10.1016/j.jclinepi.2009.11.009",
        "citation_text": "Rufibach K. Use of Brier score to assess binary predictions. Journal of Clinical Epidemiology. 2010;63(8):938-939.",
        "year": 2010,
        "authors_short": "Rufibach",
        "notes": "Clarifies the decomposition and the correct use and interpretation of the Brier score for binary clinical predictions, including the role of the reference (null-model) score for scaling."
      },
      {
        "role": "use",
        "doi": "10.1186/s12916-019-1466-7",
        "url": "https://doi.org/10.1186/s12916-019-1466-7",
        "citation_text": "Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17(1):230.",
        "year": 2019,
        "authors_short": "Van Calster et al.",
        "notes": "Situates the calibration component of overall performance, argues calibration is routinely neglected, and motivates reporting the Brier score and calibration plot together rather than relying on discrimination alone."
      }
    ],
    "plain_language_summary": "The Brier score measures how accurate a model's predicted probabilities are for a yes/no outcome — it is the average of the squared differences between each patient's predicted probability and their actual result (0 for no event, 1 for event), so a score of 0 means every prediction was exactly right. It captures two things at once: whether the model's probabilities are in the right ballpark for similar patients (that property is called calibration) and whether the model can actually separate high-risk patients from low-risk ones. A Brier score of 0 is perfect; a score equal to the outcome rate multiplied by one minus the outcome rate is what you would get from a model that ignores all patient information and just guesses the overall event rate for everyone.",
    "key_terms": [
      {
        "term": "predicted probability",
        "definition": "A number between 0 and 1 that the model assigns to each patient, representing how likely the model thinks that patient is to experience the outcome (for example, 0.8 means the model thinks there is an 80% chance of the event)."
      },
      {
        "term": "calibration",
        "definition": "How well a model's predicted probabilities match the actual event rates observed in similar patients — a model is well-calibrated if patients it scores at 0.7 really do experience the event roughly 70% of the time."
      },
      {
        "term": "proper scoring rule",
        "definition": "A way of grading probability predictions that can only be made better by giving your honest best estimate — you cannot improve your score by inflating or deflating probabilities, so the score rewards truthful predictions."
      },
      {
        "term": "null model",
        "definition": "The simplest possible prediction: assign every patient the same probability equal to the overall event rate in the dataset, ignoring all individual patient information."
      },
      {
        "term": "scaled Brier score",
        "definition": "A version of the Brier score that is adjusted for how common the outcome is, so you can fairly compare model performance across studies where the event rate differs."
      }
    ],
    "worked_example": {
      "scenario": "A research team builds a model that predicts whether a patient will be hospitalized within 90 days. They test it on five patients from a validation dataset. For each patient the model outputs a predicted probability of hospitalization, and at the end of 90 days the team records whether hospitalization actually occurred (1 = yes, 0 = no). The team wants to compute the Brier score to summarize how well the model's probabilities matched reality, and to discuss what the score tells them about calibration.",
      "dataset": {
        "caption": "Validation rows: one row per patient with the model's predicted probability and the observed 90-day hospitalization outcome.",
        "columns": [
          "patient_id",
          "predicted_prob",
          "outcome"
        ],
        "rows": [
          [
            "P01",
            0.9,
            1
          ],
          [
            "P02",
            0.8,
            1
          ],
          [
            "P03",
            0.3,
            0
          ],
          [
            "P04",
            0.6,
            1
          ],
          [
            "P05",
            0.2,
            0
          ]
        ]
      },
      "steps": [
        "For each patient, subtract the observed outcome from the predicted probability, then square the result. This gives the squared error for that patient.",
        "P01: predicted 0.9, outcome 1. Difference = 0.9 - 1 = -0.1. Squared error = (-0.1)^2 = 0.01.",
        "P02: predicted 0.8, outcome 1. Difference = 0.8 - 1 = -0.2. Squared error = (-0.2)^2 = 0.04.",
        "P03: predicted 0.3, outcome 0. Difference = 0.3 - 0 = 0.3. Squared error = (0.3)^2 = 0.09.",
        "P04: predicted 0.6, outcome 1. Difference = 0.6 - 1 = -0.4. Squared error = (-0.4)^2 = 0.16.",
        "P05: predicted 0.2, outcome 0. Difference = 0.2 - 0 = 0.2. Squared error = (0.2)^2 = 0.04.",
        "Add up all five squared errors: 0.01 + 0.04 + 0.09 + 0.16 + 0.04 = 0.34.",
        "Divide by the number of patients to get the mean: 0.34 / 5 = 0.068. That is the Brier score.",
        "Now check calibration in plain terms. Three of the five patients had the event (outcome rate = 3/5 = 0.60). The model assigned probabilities of 0.9, 0.8, 0.3, 0.6, and 0.2; the average predicted probability is (0.9+0.8+0.3+0.6+0.2)/5 = 0.56, close to the 0.60 observed rate — a sign of reasonable overall calibration. A null model that predicted 0.60 for everyone would score (0.60-1)^2 + (0.60-1)^2 + (0.60-0)^2 + (0.60-1)^2 + (0.60-0)^2 = 0.16+0.16+0.36+0.16+0.36 = 1.20; divided by 5 that is 0.240. The actual model (0.068) is well below that null benchmark, meaning it adds real information beyond just knowing the event rate."
      ],
      "result": "Brier score = (0.01 + 0.04 + 0.09 + 0.16 + 0.04) / 5 = 0.34 / 5 = 0.068. The null model (always predict 0.60) would score 0.240, so the model substantially outperforms a guess based on the overall event rate alone. The model is reasonably well-calibrated: its average predicted probability (0.56) is close to the observed event rate (0.60), and the largest single squared error (0.16, from P04 who was hospitalized but only scored 0.6) flags the one patient where the model was most under-confident."
    },
    "prerequisites": [
      "logistic-regression-for-binary-outcomes",
      "roc-auc-discrimination-rwe",
      "prediction-model-validation-recalibration-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Scaled Brier score (Brier skill score)",
        "description": "BS_scaled = 1 - BS/BS_ref where BS_ref = prevalence*(1-prevalence) is the null (prevalence-only) model; rescales the raw score to a prevalence-comparable 0-to-1 metric (1 perfect, 0 no better than the marginal rate, negative worse).",
        "edge_cases": [
          "A negative scaled Brier score signals a model worse than predicting the constant prevalence - a clear stop signal.",
          "The reference score must use the validation-set prevalence, not the development-set prevalence, when prevalence differs."
        ],
        "data_source_notes": "linked: confirm the validation sample prevalence matches the deployment population so the scaled score transfers; otherwise BS_ref is computed on the wrong base rate."
      },
      {
        "name": "Brier decomposition (reliability - resolution + uncertainty)",
        "description": "Murphy's partition that attributes the raw score to calibration (reliability, lower better), sharpness (resolution, higher better), and the irreducible outcome variance (uncertainty), via binning of predicted probabilities.",
        "edge_cases": [
          "The decomposition is bin-dependent; the number and placement of probability bins changes the reliability/resolution split and introduces a within-bin approximation bias, so report the binning and consider bias-corrected estimators.",
          "With very few events per bin the resolution term is noisy; use deciles or risk-based bins with adequate event counts."
        ],
        "data_source_notes": "ehr/registry: adjudicated outcomes tighten the uncertainty term; a noisy phenotype label inflates it and caps the achievable score."
      },
      {
        "name": "Time-dependent (IPCW) Brier score",
        "description": "For censored time-to-event outcomes, the Brier score is evaluated at a fixed horizon t with inverse-probability-of-censoring weights so administratively censored subjects do not bias the squared-error average.",
        "edge_cases": [
          "Requires a correctly specified censoring model; informative censoring (loss to follow-up correlated with risk) biases the IPCW weights and therefore the score.",
          "The integrated Brier score over a range of horizons summarizes performance across follow-up but hides horizon-specific miscalibration."
        ],
        "data_source_notes": "claims: administrative censoring at study end and disenrollment is the dominant censoring mechanism; the censoring model should reflect FFS-observable follow-up only."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "roc-auc-discrimination-rwe",
        "pros_of_this": "A proper scoring rule that rewards honest, well-calibrated probabilities and penalizes mis-scaling; decomposes into calibration and refinement, so it captures performance the rank-only AUC cannot.",
        "cons_of_this": "The raw score depends on outcome prevalence and is not comparable across datasets without scaling, and a single number compresses away where calibration fails.",
        "when_to_prefer": "When the predicted probability itself drives a decision (thresholds, expected cost, decision curves); use AUC for prevalence-invariant discrimination and rank-only tasks."
      },
      {
        "compared_to": "prediction-model-validation-recalibration-rwe",
        "pros_of_this": "Gives a single overall-performance scalar (and its scaled version) that summarizes calibration and sharpness together for quick model selection and tracking.",
        "cons_of_this": "Does not localize miscalibration the way the calibration slope/intercept and calibration plot do, nor does it by itself prescribe a recalibration.",
        "when_to_prefer": "As the overall-performance headline number; pair with the calibration plot and slope/intercept to diagnose and fix miscalibration."
      },
      {
        "compared_to": "f1-score-precision-recall-rwe",
        "pros_of_this": "Evaluates the full probabilistic prediction without choosing a threshold, so it is not sensitive to an arbitrary operating point and rewards correct probability magnitude, not just classification at one cut.",
        "cons_of_this": "Does not directly express the case-finding burden (false positives per true case) that precision/recall make explicit for a deployed classifier.",
        "when_to_prefer": "When the model outputs and acts on probabilities; use precision/recall/F1 when a single hard classification threshold is deployed and false-positive burden is the concern."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build predictions and binary outcome labels on FFS-observable person-time only; an MA-only span gives a fabricated \"no event\" label that biases the outcome rate and the reliability term. The outcome is a computable phenotype, so its misclassification inflates the irreducible-uncertainty term; for time-to-event outcomes use the IPCW (time-dependent) Brier score so administrative censoring is not read as event-free survival.",
      "ehr": "Richer features improve resolution, but encounter-driven ascertainment makes an out-of-system event an unobserved negative that biases calibration; require demonstrable in-system activity before labeling a subject event-free and use censoring-weighted scores for informative loss to follow-up.",
      "registry": "Adjudicated registry outcomes give the cleanest labels, tightening every decomposition term; check registry completeness and reporting lag so late-reported events are not miscoded as non-events at the evaluation horizon.",
      "linked": "Linked claims-EHR-vital-records adds a death index so a competing terminal event is not miscoded as event-free; linkage selects the linkable subset, so verify the validation prevalence matches the deployment population before trusting a scaled Brier score."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom sklearn.metrics import brier_score_loss\n\ndef brier_decomposition(y_true, y_prob, n_bins=10):\n    \"\"\"Raw Brier score, scaled Brier (skill) score, and Murphy's decomposition.\"\"\"\n    y_true = np.asarray(y_true, float)\n    y_prob = np.asarray(y_prob, float)\n    N = y_true.size\n    o_bar = y_true.mean()                       # overall outcome rate\n    bs = brier_score_loss(y_true, y_prob)       # raw mean squared error\n    bs_ref = o_bar * (1 - o_bar)                # prevalence-only null model\n    bs_scaled = 1 - bs / bs_ref                 # Brier skill score\n\n    # Bin by predicted probability for the reliability/resolution split.\n    edges = np.linspace(0.0, 1.0, n_bins + 1)\n    idx = np.clip(np.digitize(y_prob, edges[1:-1]), 0, n_bins - 1)\n    reliability = resolution = 0.0\n    for k in range(n_bins):\n        m = idx == k\n        n_k = m.sum()\n        if n_k == 0:\n            continue\n        p_bar_k = y_prob[m].mean()              # mean prediction in bin\n        o_bar_k = y_true[m].mean()              # observed rate in bin\n        reliability += n_k * (p_bar_k - o_bar_k) ** 2\n        resolution  += n_k * (o_bar_k - o_bar) ** 2\n    reliability /= N\n    resolution  /= N\n    uncertainty = o_bar * (1 - o_bar)\n    return {\"brier\": bs, \"brier_scaled\": bs_scaled,\n            \"reliability\": reliability, \"resolution\": resolution,\n            \"uncertainty\": uncertainty,\n            \"decomp_check\": reliability - resolution + uncertainty}\n\n# Worked example: 5000 subjects, prevalence 0.10, a moderately useful model.\nrng = np.random.default_rng(1)\ny = rng.binomial(1, 0.10, size=5000)\np = np.clip(0.10 + 0.25 * (y - 0.10) + rng.normal(0, 0.05, size=5000), 1e-4, 1 - 1e-4)\nres = brier_decomposition(y, p)\nprint({k: round(v, 4) for k, v in res.items()})",
        "description": "Compute the raw Brier score, the scaled Brier score (Brier skill score), and Murphy's\nreliability-resolution-uncertainty decomposition for binary probabilistic predictions. Inputs: y_true\n(0/1 array of adjudicated outcomes) and y_prob (predicted probabilities). Reproduces the worked example\n(prevalence 0.10, raw BS 0.072, scaled 0.20) and verifies BS = reliability - resolution + uncertainty.",
        "dependencies": [
          "numpy",
          "scikit-learn"
        ],
        "source_citations": [
          "steyerberg-2010",
          "murphy-1973"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(Hmisc)\n\n# Worked example: prevalence 0.10, a moderately useful model on 5000 subjects.\nset.seed(1)\ny <- rbinom(5000, 1, 0.10)\np <- pmin(pmax(0.10 + 0.25 * (y - 0.10) + rnorm(5000, 0, 0.05), 1e-4), 1 - 1e-4)\n\n# val.prob returns Brier, calibration intercept/slope, c-statistic together.\nvp <- val.prob(p = p, y = y, pl = FALSE)\nbrier   <- vp[\"Brier\"]\nbs_ref  <- mean(y) * (1 - mean(y))           # prevalence-only null model\ncat(\"raw Brier       =\", round(brier, 4), \"\\n\")\ncat(\"scaled Brier    =\", round(1 - brier / bs_ref, 4), \"\\n\")  # Brier skill score\n\n# Time-dependent (IPCW) Brier score for a censored Cox model at a horizon.\n# library(pec); library(survival)\n# fit <- coxph(Surv(time, status) ~ x1 + x2, data = d, x = TRUE)\n# pe  <- pec(list(Cox = fit), Surv(time, status) ~ 1, data = d,\n#            times = 365, cens.model = \"marginal\")   # IPCW Brier at 1 year\n# print(pe)",
        "description": "Brier score and its scaled version with the rms/Hmisc ecosystem, plus the time-dependent (IPCW) Brier\nscore for censored outcomes via pec. For binary outcomes, Hmisc::val.prob() returns the Brier score\ntogether with calibration slope/intercept and the c-statistic from a single validation call. Inputs:\npredicted probabilities and the binary (or survival) outcome.",
        "dependencies": [
          "Hmisc",
          "pec"
        ],
        "source_citations": [
          "steyerberg-2010",
          "rufibach-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Per-subject squared error; phat from PROC LOGISTIC (... / outpred=scored). */\ndata se;\n  set work.scored;\n  sq_err = (phat - y)**2;\nrun;\n\n/* Raw Brier score, overall outcome rate, and the prevalence-only reference. */\nproc sql;\n  create table brier as\n  select mean(sq_err)            as brier,\n         mean(y)                 as o_bar,\n         calculated o_bar * (1 - calculated o_bar) as bs_ref,\n         1 - calculated brier / calculated bs_ref as brier_scaled  /* skill score */\n  from se;\nquit;\n\n/* Reliability / resolution via decile bins of the predicted probability. */\nproc rank data=work.scored out=binned groups=10;\n  var phat; ranks bin;\nrun;\n\nproc sql;\n  create table decomp as\n  select sum(n_k * (p_bar_k - o_bar_k)**2) / sum(n_k) as reliability,\n         sum(n_k * (o_bar_k - g_rate)**2)  / sum(n_k) as resolution,\n         min(g_rate) * (1 - min(g_rate))             as uncertainty\n  from (select bin,\n               count(*)   as n_k,\n               mean(phat) as p_bar_k,\n               mean(y)    as o_bar_k,\n               (select mean(y) from binned) as g_rate\n        from binned group by bin);\nquit;\n\nproc print data=brier   noobs; run;\nproc print data=decomp  noobs; run;   /* brier ~= reliability - resolution + uncertainty */",
        "description": "Brier score and its scaled (skill-score) version from a scored validation dataset in SAS using a DATA\nstep and PROC SQL aggregation (no specialized PROC computes the Brier score directly; PROC LOGISTIC\nproduces the predicted probabilities upstream). Input: work.scored with y (0/1 adjudicated outcome) and\nphat (predicted probability). The PROC SQL ranks predictions into deciles for the reliability/resolution split.",
        "dependencies": [],
        "source_citations": [
          "steyerberg-2010",
          "murphy-1973"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  BS[Brier score<br/>mean&#40;p - y&#41;^2] --> D{Murphy decomposition}\n  D --> Rel[Reliability<br/>calibration gap per bin<br/>LOWER is better]\n  D --> Res[Resolution<br/>bin rates vs overall rate<br/>HIGHER is better]\n  D --> Unc[Uncertainty<br/>o&#183;&#40;1-o&#41; irreducible<br/>fixed by the data]\n  Rel --> Eq[BS = Reliability - Resolution + Uncertainty]\n  Res --> Eq\n  Unc --> Eq\n  Eq --> Scaled[Scaled Brier = 1 - BS / &#40;o&#183;&#40;1-o&#41;&#41;<br/>comparable across prevalence]",
        "caption": "The Brier score and Murphy's decomposition. A low raw score can come from good calibration (low reliability) or from high sharpness (high resolution); the decomposition attributes performance to each, while the scaled Brier score divides out the prevalence-driven uncertainty to permit cross-study comparison.",
        "alt_text": "Diagram of the Brier score decomposing into reliability, resolution, and uncertainty, recombining into the identity BS = reliability minus resolution plus uncertainty, and rescaling to the scaled Brier score.",
        "source_type": "illustrative",
        "source_citations": [
          "murphy-1973"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  P[Predicted probabilities + binary outcomes<br/>from a validation sample] --> Q{What is the question?}\n  Q -->|Rank order only| AUC[Use AUC / c-statistic<br/>prevalence-invariant discrimination]\n  Q -->|Probability is acted on| Brier[Compute Brier score]\n  Brier --> Cal[Pair with calibration plot<br/>+ slope/intercept]\n  Brier --> Cmp{Comparing across<br/>different prevalence?}\n  Cmp -->|Yes| Sc[Report scaled Brier score]\n  Cmp -->|No| Raw[Report raw Brier + decomposition]\n  Cal --> Fix[If miscalibrated -> recalibrate<br/>intercept/slope or refit]",
        "caption": "Choosing and reporting the Brier score. Use it when the predicted probability itself drives a decision, always pair it with a calibration plot, scale it for cross-prevalence comparison, and route detected miscalibration to recalibration.",
        "alt_text": "Decision flow from predicted probabilities to either AUC for rank-only questions or the Brier score when probabilities are acted on, then to a calibration plot, scaled-versus-raw reporting, and recalibration if miscalibrated.",
        "source_type": "illustrative",
        "source_citations": [
          "steyerberg-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "complements",
        "target_slug": "roc-auc-discrimination-rwe",
        "notes": "The Brier score (overall performance, calibration-aware, proper scoring rule) and AUC (discrimination only, rank-based, prevalence-invariant) answer different questions and are reported together for prediction models."
      },
      {
        "relation_type": "used_with",
        "target_slug": "prediction-model-validation-recalibration-rwe",
        "notes": "The Brier score is the overall-performance measure in the standard validation framework; it is reported with the calibration slope/intercept and a calibration plot, and a poor reliability term motivates recalibration."
      },
      {
        "relation_type": "complements",
        "target_slug": "f1-score-precision-recall-rwe",
        "notes": "Brier evaluates the probability without choosing a threshold; precision/recall/F1 evaluate a hard classification at a chosen operating point - complementary views of the same predictions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "The time-dependent (IPCW) Brier score evaluates predicted absolute risks against the cumulative incidence at a fixed horizon, so the risk estimand defined there is the target the Brier score scores."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnostic-accuracy",
        "notes": "Diagnostic-accuracy studies summarize a classification at a threshold, whereas the Brier score scores the underlying probabilistic prediction across all thresholds; the two characterize different facets of the same model."
      }
    ],
    "aliases": [
      "Brier score",
      "probability score",
      "mean squared error of predictions",
      "Brier skill score",
      "scaled Brier score",
      "quadratic scoring rule"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "journal",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "budget-impact",
    "name": "Budget Impact Analysis",
    "short_definition": "A payer-perspective financial projection that estimates the change in total and per-member-per-month expenditure attributable to adopting a new intervention into a defined, often dynamic, covered population over a short (typically 1-5 year) time horizon, using undiscounted current costs.",
    "long_description": "A **budget impact analysis (BIA)** answers a question that cost-effectiveness analysis (CEA) deliberately ignores:\n*can the payer afford this, and what does it do to next year's budget?* It projects the difference between two\nscenarios for a defined plan or system population over a short horizon — a **\"world without\"** the new intervention\n(current treatment mix) and a **\"world with\"** it (the intervention takes a forecast share of the eligible pool). The\nprimary outputs are the **total incremental budget** and the **per-member-per-month (PMPM)** or per-member-per-year\ncost change, reported year by year, not a single summary ratio.\n\n**Core conceptual distinction — BIA is not CEA.** This is the distinction the ISPOR task forces police hardest, and\nthe most common way a BIA is done wrong is by quietly running a CEA instead.\n- **Estimand / output.** CEA produces an *incremental cost-effectiveness ratio (ICER)* or *net monetary benefit\n  (NMB)* — efficiency per unit of health (cost per QALY). BIA produces an *affordability* number: total dollars and\n  PMPM hitting a specific budget holder. There is no QALY, no ICER, and no willingness-to-pay threshold in a BIA. If\n  your BIA reports an ICER or an NMB, you have built a CEA and mislabeled it.\n- **Discounting.** CEA discounts future costs and effects to present value because it compares lifetime efficiency.\n  BIA **does not discount** (per ISPOR BIA Good Practice II): payers manage *cash outflow in the actual budget year*,\n  so a dollar spent in year 3 is reported as a year-3 dollar. Applying CEA-style discounting to a BIA understates the\n  real budget pressure and is a recognized error.\n- **Population.** CEA typically follows a fixed *cohort* over a lifetime horizon. BIA must reflect the **whole\n  eligible population the payer actually covers**, which changes each year as members enter, age, are diagnosed,\n  disenroll, or die — usually modeled as an **open (prevalence + incidence) population**, not a closed cohort.\n- **Costs used.** CEA uses *lifetime* expected costs. BIA uses *current* (acquisition + administration + monitoring\n  + downstream HCRU offsets) costs over the budget window, and must include only resources the budget holder\n  actually pays for.\n\n**Mechanics.** The arithmetic is deliberately simple and transparent (regulators and pharmacy & therapeutics\ncommittees must be able to re-derive it): for each year *t*, `eligible_population(t)` is multiplied by a\n**market-share uptake curve** `share(t)` to get treated patients; each treated patient incurs `(drug_cost +\nadministration + monitoring - cost_offsets)`; untreated patients incur current-standard-of-care cost; the budget\nimpact is the world-with minus world-without total, divided by member-months for PMPM. The credibility of a BIA\nlives almost entirely in three inputs that real-world data are best positioned to supply: the **size of the\naddressable population**, the **realistic uptake/market-share trajectory** (not instantaneous 100% switching), and\nthe **defensible cost offsets** (avoided hospitalizations, reduced concomitant therapy), which is where claims- or\nEHR-derived HCRU enters.\n\n**Pros, cons, and trade-offs.**\n- **vs cost-effectiveness / cost-utility analysis:** BIA tells a payer the *cash* consequence of a coverage decision\n  on a real membership over a real budget cycle; CEA tells a value-for-money story over a lifetime. They are\n  complements, not substitutes — ISPOR good-practice guidance covers both, and HTA practice typically expects both:\n  CDA-AMC (formerly CADTH) requires a BIA, while NICE assesses budget impact separately (its Budget Impact Test) rather\n  than mandating a company-submitted BIA. **Use BIA** when the\n  decision is formulary placement, capitation rate-setting, or affordability under a fixed budget; **use CEA** when\n  the decision is whether the intervention is worth its price at a societal/lifetime level. Treating a favorable\n  ICER as proof of affordability is the classic mistake: a cost-effective drug can still be budget-busting (e.g.,\n  hepatitis C direct-acting antivirals — highly cost-effective, yet an acute multi-year budget shock).\n- **vs a static one-year \"drug cost × prevalence\" back-of-envelope:** a proper BIA models the *dynamic* eligible\n  population and a *ramped* uptake curve over multiple years, capturing the timing of spend that payers actually\n  care about. Cost: more inputs, more assumptions, more uncertainty to characterize. **Prefer the dynamic model**\n  whenever uptake is gradual, the population is growing, or the horizon exceeds one year.\n- **vs full Markov / partitioned-survival economic models:** those generate the per-patient cost and effect streams\n  that *feed* a BIA, but a BIA layered on top must collapse them to budget-window cash flows for the whole\n  population. Cost: a transparent BIA spreadsheet is easier for a P&T committee to audit than a state-transition\n  engine. **Prefer a transparent BIA layer** even when a Markov model exists underneath; expose the inputs.\n\n**When to use.** Formulary / coverage decisions and tier placement; pharmacy and medical benefit budget forecasting;\ncapitation and premium rate-setting; HTA submissions (NICE, CADTH, ICER, AMCP Format dossiers) where an\naffordability section is mandatory; negotiating value-based or budget-cap contracts. BIA is the right tool whenever a\n*specific budget holder over a specific time window* must plan for the cash consequence of an adoption decision.\n\n**When NOT to use — and when it is actively misleading.**\n- **As a stand-in for value.** A BIA says nothing about whether the spend buys health. A drug can have a small\n  budget impact and be terrible value, or a large budget impact and be excellent value. Presenting a low budget\n  impact as evidence of *worth* is misleading; that is the CEA's job.\n- **With discounting, lifetime horizons, or QALYs bolted on.** These import CEA machinery that contradicts the BIA's\n  affordability purpose and inflate or deflate the headline number in ways payers will reject.\n- **With instantaneous or unrealistic uptake.** Assuming 100% of eligible patients switch on day one overstates\n  year-1 impact and destroys credibility; assuming token uptake hides a real future liability. Uptake must be\n  evidence-based (analogue launches, contract terms, prior-authorization friction).\n- **When the eligible population or offsets are guessed, not measured.** A BIA built on assumed prevalence and\n  assumed offsets is an opinion in a spreadsheet. If RWD can quantify the addressable pool and the HCRU offsets,\n  failing to use it is the principal source of an indefensible result.\n- **For a societal perspective.** BIA is payer/budget-holder perspective by construction; productivity and\n  out-of-pocket costs the budget holder does not pay belong in a CEA/societal analysis, not the budget line.\n\n**Data-source operational depth (RWD feeds the three load-bearing inputs).**\n- **Claims (FFS / commercial / Part D):** the workhorse for the *addressable population* (members with the\n  indication via diagnosis codes, satisfying continuous enrollment and prior-treatment criteria) and for *observed\n  per-patient annual cost and HCRU offsets* in a treated subcohort (PMPM medical + pharmacy spend). Failure modes:\n  **Medicare Advantage and capitated person-time lack complete FFS claims**, so MA-only enrollees produce\n  understated cost and missed utilization — derive cost offsets from members with full medical+pharmacy benefit and\n  do not pool MA-only person-time into PMPM denominators. Pharmacy claims give *acquisition* cost via days_supply ×\n  unit cost but miss in-office (medical-benefit) infusions billed under J-codes — capture both benefit silos or the\n  drug cost is truncated. Plan-paid vs allowed vs charged amounts differ by an order of magnitude; the *budget\n  holder pays the plan-paid (net of rebate) amount*, which claims rarely show — rebates must be layered on\n  separately. Lab-confirmed disease severity is absent, so the addressable pool may be over-broad.\n- **EHR:** sharpens the *eligible population* with labs, biomarkers, stage, and severity that claims lack (e.g.,\n  HbA1c, eGFR, tumor stage), tightening the denominator. Weak for *cost* (charges are not what the payer pays) and\n  for *completeness* once a patient leaves the system — visit-driven capture undercounts utilization that occurs\n  out-of-network, so EHR-derived offsets are typically biased toward zero.\n- **Registry:** strongest for indication, severity, and adjudicated eligibility (e.g., confirmed cancer stage),\n  underpinning a credible addressable-population count; weak for complete cost and pharmacy capture — link to claims\n  for the spend side.\n- **Linked claims-EHR:** the ideal substrate — EHR-tightened eligibility + claims-complete cost and offsets — but\n  only the linkable subset is observed, which can bias the addressable count and the cost estimate toward the\n  insured, system-engaged population.\n\n**Worked claims example.** A regional health plan with **1.2 million covered lives** must project the **3-year budget\nimpact** of adding a new injectable GLP-1 receptor agonist to its formulary for type 2 diabetes (T2D).\n(1) *Addressable population from claims:* members with ≥2 T2D diagnosis codes in the baseline year, ≥365 days of\ncontinuous medical + pharmacy enrollment (excluding MA-only person-time so cost is observable), and a metformin fill\nin the prior 12 months (the on-label \"inadequately controlled on background therapy\" population). This yields, say,\n**48,000 eligible members in year 1**, grown each year by observed incident T2D diagnoses and net enrollment churn\n(open population). (2) *Cost offsets from a treated subcohort:* among members already on an analogue injectable, the\nobserved change in diabetes-related PMPM medical spend (hospitalizations, ED visits) versus oral-only controls gives\nan evidence-based **annual HCRU offset per treated patient**, derived only from members with full medical+pharmacy\nbenefit. (3) *Acquisition cost:* days_supply × net (post-rebate) unit cost from pharmacy claims plus any\nadministration cost. (4) *Uptake curve:* market share ramps **8% -> 15% -> 22%** of eligible members over years 1-3,\nanchored to a prior GLP-1 launch in the same plan rather than assumed instantaneous uptake. (5) *Output:* total\nincremental budget per year and PMPM = (world-with - world-without total) / member-months, **undiscounted**, with a\none-way sensitivity analysis on uptake, net price (rebate), and the magnitude of the offset, and a scenario for an\n*open* vs *closed* population. The decision-relevant deliverable is \"this adds \\$0.41 PMPM in year 1 rising to \\$1.05\nPMPM in year 3,\" not an ICER.\n\n**Interpreting the output**\n\nA regional health plan projects that adding a new injectable GLP-1 receptor agonist to its formulary\nadds $0.41 PMPM in year 1, rising to $1.05 PMPM in year 3, based on a 1.2-million-member plan, an\nuptake ramp of 8% to 22% of 48,000 eligible members, and undiscounted annual costs.\n\n*(1) Formal interpretation.* The PMPM figures spread the incremental budget impact over the plan's\nentire enrolled membership — not just the eligible or treated patients — so they are a plan-affordability\nmetric, not a per-patient cost measure. Future-year dollars are reported without discounting because\nthe ISPOR BIA Good Practice guideline requires undiscounted cash-flow projection: the payer manages\nactual year-3 expenditure, not a present value. The uptake assumption (8% → 22%) is the single most\ninfluential parameter; the worked example includes a one-way sensitivity on uptake and net price\n(post-rebate) to bound the range. There is no ICER, no QALY, and no willingness-to-pay threshold —\na BIA measures affordability, not efficiency.\n\n*(2) Practical interpretation.* A $0.41 PMPM year-1 impact is a modest but real cost pressure on a\n1.2-million-member plan. Year-3 doubling to $1.05 PMPM reflects uptake growth, not price increases.\nA formulary committee should ask: what is the HCRU offset (avoided hospitalizations, ER visits)\nincluded in the \"world-with\" scenario? If diabetes-related hospitalization offsets are credible, the\nnet PMPM impact will be lower than the gross drug cost figure — and that offset calculation should\nbe shown separately from drug acquisition cost.",
    "primary_category": "Health_Economic",
    "tags": [
      "budget-impact",
      "payer-perspective",
      "affordability",
      "pmpm",
      "market-share-uptake",
      "health-economics",
      "formulary-decision",
      "hta"
    ],
    "applies_to_study_types": [
      "budget_impact"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2013.08.2291",
        "url": "https://doi.org/10.1016/j.jval.2013.08.2291",
        "citation_text": "Sullivan SD, Mauskopf JA, Augustovski F, et al. Budget impact analysis-principles of good practice: report of the ISPOR 2012 Budget Impact Analysis Good Practice II Task Force. Value in Health. 2014;17(1):5-14.",
        "year": 2014,
        "authors_short": "Sullivan et al.",
        "notes": "The canonical methodological standard for BIA. Establishes the payer perspective, short time horizon, dynamic eligible population, undiscounted current costs, market-share uptake, and the explicit prohibition on discounting that separates a BIA from a CEA."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1524-4733.2007.00187.x",
        "url": "https://doi.org/10.1111/j.1524-4733.2007.00187.x",
        "citation_text": "Mauskopf JA, Sullivan SD, Annemans L, et al. Principles of good practice for budget impact analysis: report of the ISPOR Task Force on good research practices-budget impact analysis. Value in Health. 2007;10(5):336-347.",
        "year": 2007,
        "authors_short": "Mauskopf et al.",
        "notes": "The first ISPOR BIA task-force report; lays out the conceptual foundations (affordability vs efficiency, perspective, time horizon, open vs closed population) that Good Practice II later refined."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s40273-016-0426-8",
        "url": "https://doi.org/10.1007/s40273-016-0426-8",
        "citation_text": "Mauskopf J, Earnshaw S. A methodological review of US budget-impact models for new drugs. PharmacoEconomics. 2016;34(11):1111-1131.",
        "year": 2016,
        "authors_short": "Mauskopf & Earnshaw",
        "notes": "Systematic review of how published US BIAs actually operationalize the population, uptake curve, cost offsets, and sensitivity analyses; documents common methodological shortcomings (static populations, missing offsets, inappropriate discounting) and what a defensible model looks like in practice."
      }
    ],
    "plain_language_summary": "A budget impact analysis answers a money question that a health plan asks before covering a new drug: how much will our total yearly drug bill go up if we add this? You take the members who could use the drug, estimate the share who actually will (the uptake), price out what those patients cost on the new drug versus what they cost today, and report the dollar difference for the budget. It is about affordability for a specific payer over a few years, not about whether the drug is worth the money for the health it buys (that is a separate cost-effectiveness question). Future-year dollars are reported as-is, because a payer cares about the actual cash leaving the budget that year.",
    "key_terms": [
      {
        "term": "eligible population",
        "definition": "The count of covered members who could appropriately receive the new drug, based on their diagnosis and treatment history."
      },
      {
        "term": "uptake",
        "definition": "The share of eligible members expected to actually take the new drug, which usually starts small and grows over a few years rather than jumping to everyone at once."
      },
      {
        "term": "payer perspective",
        "definition": "Counting only the costs the health plan itself pays, not costs paid by patients, employers, or society."
      },
      {
        "term": "PMPM (per-member-per-month)",
        "definition": "The budget impact spread across every covered member and every month, so plans can compare it against premiums on a familiar scale."
      },
      {
        "term": "cost offset",
        "definition": "Money the plan saves elsewhere because of the new drug, such as avoided hospital stays, which is subtracted from its cost."
      }
    ],
    "worked_example": {
      "scenario": "A small health plan is deciding whether to add a new injectable diabetes drug to its formulary. It has identified the members who could use it and wants the one-year budget impact: how many more dollars will the plan spend if it covers the new drug, compared to the world where everyone stays on today's oral therapy. We keep it to a single year and clean round numbers so the arithmetic is easy to follow.",
      "dataset": {
        "caption": "The plan-level inputs an analyst would assemble before running the model: who is eligible, how many will switch, and what each option costs per patient per year.",
        "columns": [
          "input",
          "value"
        ],
        "rows": [
          [
            "eligible_members",
            "10,000"
          ],
          [
            "expected_uptake",
            "20%"
          ],
          [
            "new_drug_cost_per_patient_per_year",
            "$6,000"
          ],
          [
            "current_oral_cost_per_patient_per_year",
            "$1,000"
          ]
        ]
      },
      "steps": [
        "Split the eligible members by uptake: 10,000 eligible x 20% = 2,000 members on the new drug, leaving 10,000 - 2,000 = 8,000 still on the current oral therapy.",
        "World without (today): every one of the 10,000 eligible members stays on oral therapy, so 10,000 x $1,000 = $10,000,000 per year.",
        "World with (new drug added): the 2,000 switchers cost 2,000 x $6,000 = $12,000,000, and the 8,000 who stay on oral cost 8,000 x $1,000 = $8,000,000, for a total of $12,000,000 + $8,000,000 = $20,000,000.",
        "Budget impact is the difference between the two worlds: $20,000,000 - $10,000,000 = $10,000,000."
      ],
      "result": "Adding the new drug raises the plan's annual spend on this population by $10,000,000 (from $10,000,000 to $20,000,000 per year). That $10,000,000 is the one-year budget impact the payer must plan for."
    },
    "prerequisites": [
      "healthcare-costs-pppm-pppy-pmpm",
      "hcru-healthcare-resource-utilization"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Open (prevalence + incidence) population",
        "description": "The eligible population is refreshed each year for incident cases, net enrollment change, aging, and death, reflecting the membership the payer actually covers over the horizon. This is the ISPOR-preferred default for most BIAs.",
        "edge_cases": [
          "Requires an annual incidence/diagnosis rate and a disenrollment/mortality assumption; getting these wrong drives the entire trajectory more than the per-patient cost does.",
          "Members can enter and leave mid-year; partial member-months must be handled correctly in the PMPM denominator."
        ],
        "data_source_notes": "claims: derive prevalent eligible members and incident diagnoses year over year; treat MA-only spans as person-time with incomplete cost so they do not deflate PMPM offsets."
      },
      {
        "name": "Closed (fixed cohort) population",
        "description": "A single index-year eligible cohort is followed forward without new entrants. Appropriate only for a one-time or non-recurring intervention (e.g., a curative one-time therapy) where the relevant question is the spend on today's prevalent pool.",
        "edge_cases": [
          "Understates multi-year budget impact for chronic-disease therapies where new patients keep becoming eligible.",
          "Mixing a closed population with a multi-year horizon for a chronic therapy is a frequent error flagged in review."
        ],
        "data_source_notes": "claims: fix the eligible cohort at index and carry it forward; still censor at disenrollment and death for accurate member-months."
      },
      {
        "name": "Dynamic market-share uptake curve",
        "description": "Treated share of the eligible pool ramps over the horizon (e.g., 8/15/22% across years 1-3) rather than switching instantaneously, with the curve anchored to an analogue launch, contract terms, or prior-authorization friction.",
        "edge_cases": [
          "Source-of-business assumptions matter — new patients vs switches from existing therapies change which costs are incremental vs displaced.",
          "An instantaneous-uptake assumption is a common way to overstate year-1 impact and lose payer credibility."
        ],
        "data_source_notes": "claims: estimate plausible uptake from the observed adoption trajectory of a comparable previously launched agent in the same plan."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cost-effectiveness / cost-utility analysis (ICER, NMB)",
        "pros_of_this": "Answers affordability for a specific budget holder over a real budget cycle; transparent, auditable cash-flow arithmetic a P&T committee can re-derive; uses current costs and the actual covered population.",
        "cons_of_this": "Says nothing about value for money; a low budget impact does not mean the spend is worthwhile, and a high one does not mean it is poor value.",
        "when_to_prefer": "Formulary/coverage decisions, capitation and rate-setting, and any decision constrained by a fixed budget over a short horizon. Pair with a CEA; do not substitute one for the other."
      },
      {
        "compared_to": "Static one-year \"drug cost x prevalence\" estimate",
        "pros_of_this": "Captures the dynamic eligible population and ramped uptake, so the timing and trajectory of spend that payers plan around are represented.",
        "cons_of_this": "More inputs and assumptions, hence more uncertainty to characterize via sensitivity and scenario analysis.",
        "when_to_prefer": "Whenever uptake is gradual, the population grows, or the horizon exceeds one year (i.e., almost always for chronic therapies)."
      },
      {
        "compared_to": "Full Markov / partitioned-survival economic model",
        "pros_of_this": "A transparent BIA layer is far easier for payers and reviewers to audit than a state-transition engine; exposes population, uptake, and offset inputs directly.",
        "cons_of_this": "Must collapse rich per-patient cost streams to budget-window cash flows, losing within-patient longitudinal detail that the underlying model carries.",
        "when_to_prefer": "Always expose a transparent BIA layer for the affordability decision, even when a Markov/PSM model supplies the per-patient cost inputs underneath."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Source of the addressable population (indication dx + continuous enrollment + prior treatment) and of observed per-patient annual cost / HCRU offsets (PMPM medical + pharmacy spend). Exclude MA-only and capitated person-time when computing cost offsets, since FFS claims are incomplete there. Capture both pharmacy-benefit (days_supply x net unit cost) and medical-benefit (J-code infusion) drug spend. The budget holder pays plan-paid net-of-rebate amounts, not charged or allowed; layer rebates on separately.",
      "ehr": "Sharpens the eligible-population denominator with labs/biomarkers/stage that claims lack; weak for cost (charges != plan-paid) and undercounts out-of-network utilization, biasing offsets toward zero. Use EHR for eligibility, claims for spend.",
      "registry": "Strong for adjudicated indication and severity (credible addressable count); weak for complete cost and pharmacy capture. Link to claims for the spend side.",
      "linked": "EHR-tightened eligibility plus claims-complete cost is the ideal substrate, but only the linkable, insured, system-engaged subset is observed, which can bias both the population count and the cost estimate."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\ndef budget_impact(params: dict) -> pd.DataFrame:\n    \"\"\"Return per-year total and PMPM budget impact (world-with minus world-without).\"\"\"\n    years = range(1, params[\"horizon_years\"] + 1)\n    rows = []\n    eligible = params[\"eligible_y1\"]            # addressable members in year 1 (from claims)\n    for t in years:\n        # Open population: refresh eligible pool for incidence + net enrollment growth each year.\n        if t > 1:\n            eligible *= (1 + params[\"pop_growth_rate\"])\n        covered_lives = params[\"covered_lives_y1\"] * (1 + params[\"pop_growth_rate\"]) ** (t - 1)\n        member_months = covered_lives * 12\n\n        share = params[\"uptake_curve\"][t - 1]   # ramped market share, NOT instantaneous (e.g. .08/.15/.22)\n        treated = eligible * share\n\n        # Per treated patient: acquisition + administration + monitoring, net of measured HCRU offsets.\n        new_cost_pp = (params[\"drug_acq_cost\"] + params[\"admin_cost\"]\n                       + params[\"monitoring_cost\"] - params[\"hcru_offset\"])\n        # World-without: those patients would have been on current standard of care.\n        soc_cost_pp = params[\"soc_cost\"]\n\n        world_with = treated * new_cost_pp + (eligible - treated) * soc_cost_pp\n        world_without = eligible * soc_cost_pp\n        impact = world_with - world_without      # incremental budget for year t (undiscounted)\n\n        rows.append({\n            \"year\": t,\n            \"eligible\": round(eligible),\n            \"treated\": round(treated),\n            \"incremental_budget\": round(impact, 2),\n            \"pmpm\": round(impact / member_months, 4),\n        })\n    return pd.DataFrame(rows)\n\nparams = {\n    \"horizon_years\": 3,\n    \"covered_lives_y1\": 1_200_000,\n    \"eligible_y1\": 48_000,           # T2D, on metformin, continuously enrolled (from claims)\n    \"pop_growth_rate\": 0.02,         # incident dx + net enrollment churn (open population)\n    \"uptake_curve\": [0.08, 0.15, 0.22],\n    \"drug_acq_cost\": 9_600.0,        # net-of-rebate annual acquisition cost\n    \"admin_cost\": 0.0,\n    \"monitoring_cost\": 180.0,\n    \"hcru_offset\": 1_350.0,          # avoided diabetes-related medical spend (claims-derived)\n    \"soc_cost\": 1_100.0,             # current oral-therapy annual cost\n}\nprint(budget_impact(params))",
        "description": "Multi-year budget impact engine (payer perspective, undiscounted). This is a BIA, not a CEA: it returns total and\nPMPM budget impact by year, never an ICER/NMB. Required inputs (all derived upstream from claims/EHR; see the SAS\nblock for the claims derivation):\n  params : a dict of scalar assumptions (covered lives, eligible fraction, growth, uptake curve, per-patient costs)\nCosts are CURRENT costs over the budget window and are NOT discounted, per ISPOR BIA Good Practice II.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "sullivan-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "budget_impact <- function(params) {\n  eligible <- params$eligible_y1                 # addressable members year 1 (from claims)\n  out <- vector(\"list\", params$horizon_years)\n  for (t in seq_len(params$horizon_years)) {\n    if (t > 1) eligible <- eligible * (1 + params$pop_growth_rate)  # open population refresh\n    covered_lives <- params$covered_lives_y1 * (1 + params$pop_growth_rate)^(t - 1)\n    member_months <- covered_lives * 12\n\n    share   <- params$uptake_curve[t]            # ramped market share, not instantaneous\n    treated <- eligible * share\n\n    new_cost_pp <- params$drug_acq_cost + params$admin_cost +\n                   params$monitoring_cost - params$hcru_offset   # net of measured HCRU offsets\n    soc_cost_pp <- params$soc_cost\n\n    world_with    <- treated * new_cost_pp + (eligible - treated) * soc_cost_pp\n    world_without <- eligible * soc_cost_pp\n    impact        <- world_with - world_without  # undiscounted incremental budget for year t\n\n    out[[t]] <- data.frame(year = t, eligible = round(eligible),\n                           treated = round(treated),\n                           incremental_budget = round(impact, 2),\n                           pmpm = round(impact / member_months, 4))\n  }\n  do.call(rbind, out)\n}\n\nparams <- list(\n  horizon_years = 3L, covered_lives_y1 = 1.2e6, eligible_y1 = 48000,\n  pop_growth_rate = 0.02, uptake_curve = c(0.08, 0.15, 0.22),\n  drug_acq_cost = 9600, admin_cost = 0, monitoring_cost = 180,\n  hcru_offset = 1350, soc_cost = 1100\n)\nprint(budget_impact(params))",
        "description": "Multi-year budget impact engine in base R, mirroring the Python version. Returns per-year total and PMPM budget\nimpact for the world-with vs world-without contrast. Undiscounted current costs (ISPOR BIA Good Practice II).\nInput `params` is a named list of scalar assumptions derived upstream from claims/EHR.",
        "dependencies": [],
        "source_citations": [
          "sullivan-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let index_year = 2023;\n%let t2dm   = 'E11';   /* ICD-10 T2DM prefix */\n%let metformin = '...';/* NDC list for metformin background therapy */\n\n/* 1. Members with >=2 T2DM diagnoses in the index year (indication). */\nproc sql;\n  create table dx_pts as\n  select person_id\n  from work.dx\n  where substr(dx_code,1,3) = &t2dm\n    and year(dx_date) = &index_year\n  group by person_id\n  having count(distinct dx_date) >= 2;\nquit;\n\n/* 2. >=365 days continuous medical+pharmacy enrollment, excluding MA-only (FFS cost observable). */\nproc sql;\n  create table enrolled as\n  select distinct e.person_id\n  from work.enroll e\n  where e.ma_only = 0 and e.med_benefit = 1 and e.rx_benefit = 1\n    and e.enroll_start <= mdy(1,1,&index_year)\n    and e.enroll_end   >= mdy(12,31,&index_year);\nquit;\n\n/* 3. Prior metformin background therapy in the index year (on-label addressable pool). */\nproc sql;\n  create table on_metformin as\n  select distinct person_id\n  from work.rx\n  where ndc in (&metformin) and year(fill_date) = &index_year;\nquit;\n\n/* Addressable eligible population = intersection of the three criteria. */\nproc sql;\n  create table eligible as\n  select d.person_id\n  from dx_pts d\n    inner join enrolled  e on d.person_id = e.person_id\n    inner join on_metformin m on d.person_id = m.person_id;\n  /* eligible_y1 input for the engine: */\n  select count(*) as eligible_y1 from eligible;\nquit;\n\n/* 4. Observed per-patient annual diabetes-related medical spend (HCRU offset basis), eligible members only. */\nproc sql;\n  create table offset_basis as\n  select c.person_id,\n         sum(case when c.dx_related = 1 then c.plan_paid else 0 end) as dm_med_cost\n  from work.medcost c\n  where c.person_id in (select person_id from eligible)\n    and year(c.svc_date) = &index_year\n  group by c.person_id;\n\n  /* Mean per-patient annual diabetes-related medical cost -> feeds hcru_offset / soc_cost assumptions. */\n  select mean(dm_med_cost) as mean_dm_med_cost_per_patient\n  from offset_basis;\nquit;",
        "description": "Claims derivation of the two load-bearing BIA inputs: the addressable eligible population and the observed\nper-patient annual cost / HCRU offset that feed the budget-impact engine above. This is data management with\nPROC SQL, NOT statistical estimation -- no PHREG/GENMOD/PSMATCH belongs here. Required input datasets (post\nstandard data management):\n  work.dx      : person_id, dx_date, dx_code           (diagnosis claims)\n  work.rx      : person_id, fill_date, ndc, days_supply, plan_paid_net   (pharmacy claims, net of rebate)\n  work.enroll  : person_id, enroll_start, enroll_end, ma_only (0/1), med_benefit (0/1), rx_benefit (0/1)\n  work.medcost : person_id, svc_date, plan_paid, dx_related (0/1)        (medical claims, plan-paid amounts)\n&index_year = the baseline calendar year used to count the year-1 eligible population.",
        "dependencies": [],
        "source_citations": [
          "mauskopf-earnshaw-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Covered population<br/>1.2M lives] --> Elig[Addressable eligible pool<br/>indication dx + continuous enrollment + prior tx]\n  Elig --> Open[Open population: refresh each year<br/>incidence + net enrollment - disenroll/death]\n  Open --> Share[Apply ramped market-share uptake curve<br/>NOT instantaneous switching]\n  Share --> WW[World WITH: treated x net new cost + untreated x SOC cost]\n  Open --> WO[World WITHOUT: all eligible x current SOC cost]\n  WW --> Impact[Incremental budget = world-with - world-without<br/>UNDISCOUNTED, current costs]\n  WO --> Impact\n  Impact --> Out[Outputs: total $ per year and PMPM<br/>+ one-way sensitivity on uptake, net price, offset]",
        "caption": "BIA computation flow. The eligible pool is refreshed annually (open population), uptake ramps over the horizon, and the world-with minus world-without contrast is reported as undiscounted total and PMPM dollars - never as an ICER or NMB.",
        "alt_text": "Flowchart from covered population through addressable eligible pool, open-population refresh, market-share uptake, world-with and world-without cost scenarios, to undiscounted total and PMPM budget impact with sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "sullivan-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Q[Decision question] --> D{Affordability for a<br/>budget holder over a<br/>short horizon?}\n  D -- Yes --> BIA[Budget Impact Analysis<br/>payer perspective, undiscounted,<br/>dynamic population, PMPM output]\n  D -- No --> V{Value for money<br/>over a lifetime?}\n  V -- Yes --> CEA[Cost-effectiveness / cost-utility<br/>ICER, NMB, discounted, QALYs]\n  V -- No --> Other[Cost-of-illness / HCRU<br/>descriptive economics]\n  BIA -. complements .-> CEA",
        "caption": "BIA vs CEA decision logic. BIA and CEA answer different questions (affordability vs efficiency) and are complements; most HTA bodies require both. Discounting and QALYs belong to CEA, not BIA.",
        "alt_text": "Decision flowchart distinguishing budget impact analysis (affordability, payer perspective, undiscounted, PMPM) from cost-effectiveness analysis (value, lifetime, discounted, QALYs), shown as complementary analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "sullivan-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "complements",
        "target_slug": "cost-effectiveness",
        "notes": "BIA answers affordability (cash to a budget holder over a short horizon); CEA answers value for money (cost per QALY over a lifetime). They are complementary, not interchangeable; HTA submissions typically require both."
      },
      {
        "relation_type": "see_also",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Unlike CEA, a BIA does NOT discount future costs - payers manage cash outflow in the actual budget year. See this concept for the discounting that applies to CEA but is explicitly excluded from BIA."
      },
      {
        "relation_type": "produces",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "The primary BIA output is reported as per-member-per-month (PMPM) / per-member-per-year budget impact across the covered population."
      },
      {
        "relation_type": "requires",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Defensible cost offsets (avoided hospitalizations, reduced concomitant therapy) come from RWD-derived HCRU and are a load-bearing BIA input."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-analysis",
        "notes": "Claims data supply the addressable eligible population and the observed per-patient cost / HCRU offsets that populate the model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "BIA is one of the core health-economic modeling methods; a Markov or partitioned-survival model often supplies the per-patient cost streams that a BIA collapses to budget-window cash flows."
      },
      {
        "relation_type": "used_with",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "When per-patient cost and event trajectories are modeled with a state-transition model, the BIA layer aggregates them to whole-population, budget-window spend."
      },
      {
        "relation_type": "used_with",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "Uncertainty in uptake, net price, eligible population, and offsets is characterized with one-way/scenario and probabilistic sensitivity analyses around the budget impact."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "ICER and NMB are CEA outputs; a correctly specified BIA reports neither - confusing the two is the most common BIA error."
      }
    ],
    "aliases": [
      "BIA",
      "budget impact analysis",
      "budget impact model",
      "payer budget impact model"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "burden-of-disease-cost-of-illness",
    "name": "Burden of Disease and Cost-of-Illness (COI) Studies",
    "short_definition": "A descriptive economic assessment that quantifies the epidemiologic, humanistic, and economic burden a disease imposes on patients, payers, and society, with cost-of-illness (COI) the dollar-valued component estimating direct, indirect, and intangible costs as total national burden, per-patient figures (PPPY/PPPM), or incremental burden versus disease-free controls.",
    "long_description": "**Burden of disease and cost-of-illness (COI)** studies answer a *descriptive* question — \"how big is the problem?\" — not\na *comparative-effectiveness* question. They quantify the epidemiologic burden (incidence, prevalence, mortality, DALYs),\nthe humanistic burden (HRQoL decrements, caregiver impact), and the economic burden (the dollars the disease consumes). COI\nis the dollar-valued economic component. It is deliberately *not* a causal contrast of one treatment against another; it is a\nbaseline accounting of what a disease costs, used to justify research investment, support disease-awareness and prevention\narguments, frame the denominator for budget-impact and cost-effectiveness work, and populate the \"natural history and unmet\nneed\" sections of regulatory and HTA dossiers.\n\n**Core estimand distinctions (must be pre-specified).** COI has no single number; the \"answer\" depends on four orthogonal\nchoices that must be fixed in the protocol before any cost is summed:\n- **Prevalence-based vs incidence-based.** *Prevalence-based* COI sums all costs incurred in a calendar window (usually one\n  year) by everyone who has the disease that year, regardless of when it began — the right frame for annual payer budgeting\n  and most claims studies. *Incidence-based* COI sums the present value of all *future* costs attributable to the cohort of\n  *new* cases arising in a period (a lifetime or fixed horizon), requiring survival/cost modelling and discounting — the\n  right frame for prevention value, where averting an incident case avoids a lifetime cost stream.\n- **All-cause vs disease-attributable vs incremental.** *All-cause* costs (every claim for a diseased patient) overstate\n  burden by including unrelated care. *Attributable* costs restrict to disease-coded claims but undercount downstream\n  sequelae and depend on coding fidelity. *Incremental* costs — the difference between diseased patients and matched\n  disease-free controls — is the methodologically preferred operationalization of \"the burden *caused by* the disease,\"\n  because it nets out the background cost of being a comparable person in the same system.\n- **Perspective.** Payer/healthcare-system (direct medical only — what claims see) vs societal (adds direct non-medical\n  costs such as transportation and informal care, and indirect costs from lost productivity). The perspective dictates which\n  cost buckets are in scope and must be stated up front; claims alone cannot support a true societal perspective.\n- **Mean vs median.** Report the **arithmetic mean** as the headline per-patient cost. Cost distributions are heavily\n  right-skewed; the mean (not the median) is the policy-relevant quantity because total burden = mean × population, and\n  budgets are denominated in totals. The median systematically understates burden and should never stand alone.\n\n**Costing method.** *Bottom-up* sums patient-level resource use × unit costs (or paid/allowed amounts) from micro-data —\nthe claims/EHR workhorse. *Top-down* allocates aggregate national expenditure by diagnosis proportions — useful for national\ntotals but blind to patient heterogeneity. The two are routinely combined (bottom-up per-patient cost × top-down prevalence).\n\n**Pros, cons, and trade-offs.**\n- **vs healthcare-costs / PPPM-PPPY:** COI is the broader framing (epidemiologic + humanistic + full economic, including\n  indirect and societal costs) used for population priority-setting; per-patient cost metrics are one *component* of it.\n  Cost: COI is more assumption-heavy (indirect-cost valuation, discounting, prevalence extrapolation) and less granular for\n  a specific therapy's value. **Prefer COI** when the question is the magnitude of a disease's footprint; **prefer plain\n  cost metrics** when you only need a standardized per-patient spend.\n- **vs cost-effectiveness analysis (CEA):** COI describes burden; it does not compare interventions or produce an ICER, and\n  a large COI figure does *not* imply an intervention is cost-effective. COI frequently *precedes* CEA by sizing the\n  addressable cost. **Prefer CEA** for any \"is this intervention worth it?\" decision. Treating COI as if it answered the\n  cost-effectiveness question is a category error.\n- **vs budget-impact analysis (BIA):** COI is the static current burden; BIA is the forward-looking, population-scaled\n  financial consequence of *adopting* a new technology over a payer's planning horizon. COI supplies BIA's baseline disease\n  cost. **Prefer BIA** when a payer needs the affordability/cash-flow answer rather than the size-of-problem answer.\n\n**When to use.** Sizing the economic footprint of a disease for advocacy, research-prioritization, or value-story framing;\ngenerating the \"unmet need / natural history\" evidence in HTA submissions and FDA/EMA dossiers; establishing the cost\nbaseline that BIA and CEA build on; quantifying disparities in burden (equity-weighted or SDoH-stratified COI).\n\n**When NOT to use — and when it is actively misleading.**\n- **As a stand-in for treatment value.** A high COI number is sometimes wielded to argue a drug is worth its price. It\n  cannot: COI has no comparator and no outcome contrast. Use CEA/BIA for value claims.\n- **All-cause costs presented as \"the burden of the disease.\"** All-cause spend in a chronically ill, often elderly,\n  multimorbid population is dominated by *other* conditions; attributing it wholesale to the index disease inflates burden,\n  sometimes several-fold. Use incremental (matched-control) or carefully validated attributable costing.\n- **Mean cost from a sample with uncontrolled catastrophic outliers.** A handful of transplant, ICU, or end-of-life cases\n  can dominate the mean and the national total. Without pre-specified outlier handling and a sensitivity analysis, the\n  headline figure is an artifact of a few records (see cost-outlier-handling).\n- **Cross-perspective or cross-country comparison without harmonization.** A payer-perspective US claims COI and a societal\n  European COI are not comparable; mixing them or comparing nominal dollars across years without inflation-adjustment is\n  misleading.\n\n**Data-source operational depth.**\n- **Administrative claims (FFS or commercial):** The workhorse for *direct medical* bottom-up COI. Costs come from paid or\n  allowed amounts on inpatient, outpatient, professional, and pharmacy claims, ideally split by place of service. Real\n  failure modes: (1) **Medicare Advantage / capitated person-time lacks adjudicated FFS claims** — utilization is captured\n  as encounters with no reliable dollar amounts, so MA-only enrollees silently produce near-zero costs and bias the mean\n  *down*; restrict to FFS Parts A/B/D (or commercial members with a genuine pharmacy + medical benefit) and exclude MA-only\n  spans. (2) **Right-skew and catastrophic outliers** dominate the mean — winsorize (e.g., at the 99th–99.5th percentile) or\n  use gamma/log-link GLM, and report the untrimmed result as a sensitivity. (3) **Truncation at enrollment ends and at\n  death** creates partial person-time; annualize on observed enrolled days (PPPY = total cost ÷ enrolled person-days × 365),\n  not on a naive 12-month assumption, or burden of high-cost end-of-life care is dropped. (4) Claims see **no indirect costs,\n  no out-of-pocket beyond the claim, and limited long-term care** — a payer-perspective ceiling that must be stated.\n- **EHR:** Strong clinical detail (severity, labs, stage) sharpening case definition and risk adjustment, but charges or\n  RVUs are not true costs and capture is visit-driven; patients who leave the system are differentially lost. Link to claims\n  or apply cost-to-charge ratios before reporting dollars.\n- **Registry + survey:** Best for incidence/prevalence and patient-reported humanistic and indirect burden (work loss,\n  caregiver hours via human-capital or friction-cost valuation); typically must be linked to claims for credible direct\n  costs. National surveys (e.g., MEPS) anchor out-of-pocket and indirect components claims cannot see.\n- **Linked claims–EHR–registry–vital-records:** The ideal substrate (severity + complete spend + reliable mortality for\n  end-of-life and incidence-based costing), at the price of linkage selection and date-reconciliation work.\n\n**Worked claims example (incremental, prevalence-based, payer perspective).** Question: the one-year incremental direct\nmedical cost of adult psoriatic arthritis (PsA) in a commercial + Medicare FFS database. (1) **Cases:** ≥2 PsA diagnoses\n(ICD-10 L40.5x/M07.x) ≥30 days apart in 2022; index_date = first PsA claim. (2) **Continuous enrollment:** require medical\n+ pharmacy enrollment for the full 12-month measurement year (2022) with FFS-observable spend; **exclude MA-only person-time**\nso costs are real, not missing. (3) **Disease-free controls:** sample patients with no PsA and no psoriasis diagnosis ever,\n**exact-matched 1:3 on age band, sex, region, and index year**, and (optionally) propensity-matched on baseline comorbidity\nburden; assign each control the case's index_date so measurement windows align. (4) **Cost capture:** sum allowed amounts\nacross inpatient, outpatient, professional, and pharmacy claims over the 12-month window; if enrollment is partial,\nannualize as cost ÷ enrolled_days × 365. (5) **Incremental cost** = mean(case cost) − mean(matched-control cost), the burden\n*attributable* to PsA. (6) **Skew + outliers:** winsorize total cost at the 99.5th percentile and fit a gamma GLM with a log\nlink (Manning–Mullahy family check) for the adjusted incremental cost; report both raw and modelled means. (7) **National\ntotal** = incremental PPPY × prevalent PsA population (from the same data or an external prevalence estimate). (8)\n**Sensitivity:** all-cause vs attributable costing, winsorization threshold, payer vs (where linkable) societal perspective,\nand discount/inflation adjustment if pooling across years. Reporting should state perspective, time horizon,\nepidemiological approach, costing method, and both per-patient and national totals, and — increasingly expected by HTA — an\nequity or SDoH-stratified view of how burden concentrates.\n\n**Interpreting the output**\n\nA prevalence-based COI study reports a national annual burden of type 2 diabetes of $358.4 billion:\n$268.8 billion in direct medical costs (75%) and $89.6 billion in indirect costs from lost productivity\n(25%), based on 28 million prevalent cases and a mean per-patient direct cost of approximately $12,800\nper year.\n\n*(1) Formal interpretation.* The $358.4 billion figure is a prevalence-based, societal-perspective\nannual accounting total — it sums all costs incurred by everyone with the disease in the study year,\nnot the discounted lifetime cost stream of new cases. It is not a causal estimate of what diabetes\ncaused; it is the aggregate economic footprint of a defined prevalent population. The 75/25 direct/\nindirect split depends on the human-capital method for indirect costs and would shift with the\nfriction-cost or opportunity-cost approach. National totals are computed as per-patient cost × prevalent\ncase count, so they inherit uncertainty from both the cost model (typically right-skewed, mean-driven)\nand the epidemiologic prevalence estimate. Double-counting of comorbidity costs is a known limitation\nof gross (all-cause) costing.\n\n*(2) Practical interpretation.* For a health system or policymaker, the $12,800 per-patient figure is\nthe lever for formulary and prevention decisions — reducing it by 10% across 28 million patients saves\napproximately $35.8 billion nationally. The $358.4B headline justifies research investment and frames\nHTA value arguments, but reviewers will ask whether an incremental (matched-control) design was used;\nif not, the total should be labeled a gross accounting estimate, not a causal burden.",
    "primary_category": "Health_Economic",
    "tags": [
      "burden-of-disease",
      "cost-of-illness",
      "coi",
      "economic-burden",
      "incremental-cost",
      "attributable-cost",
      "pppy",
      "daly",
      "societal-perspective",
      "prevalence-based",
      "incidence-based"
    ],
    "applies_to_study_types": [
      "cost_of_illness",
      "budget_impact",
      "cost_effectiveness",
      "drug_utilization",
      "cohort_retrospective",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "survey",
      "vital-statistics"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3350/cmh.2014.20.4.327",
        "url": "https://doi.org/10.3350/cmh.2014.20.4.327",
        "citation_text": "Jo C. Cost-of-illness studies: concepts, scopes, and methods. Clinical and Molecular Hepatology. 2014;20(4):327-337.",
        "year": 2014,
        "authors_short": "Jo",
        "notes": "Widely cited primer that defines the prevalence/incidence, top-down/bottom-up, and direct/indirect/intangible cost taxonomy and walks through how each is operationalized."
      },
      {
        "role": "introduce",
        "doi": "10.1186/2191-1991-2-18",
        "url": "https://doi.org/10.1186/2191-1991-2-18",
        "citation_text": "Costa N, Derumeaux H, Rapp T, et al. Methodological considerations in cost of illness studies on Alzheimer disease. Health Economics Review. 2012;2:18.",
        "year": 2012,
        "authors_short": "Costa et al.",
        "notes": "Concrete worked review of the design choices (perspective, prevalence vs incidence, top-down vs bottom-up, cost-component scope) that drive divergent COI estimates for the same disease."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.healthpol.2005.07.016",
        "url": "https://doi.org/10.1016/j.healthpol.2005.07.016",
        "citation_text": "Tarricone R. Cost-of-illness analysis. What room in health economics? Health Policy. 2006;77(1):51-63.",
        "year": 2006,
        "authors_short": "Tarricone",
        "notes": "Critical appraisal of COI's place in health economics — human-capital vs friction-cost valuation of productivity loss, the limits of COI for decision-making, and why it complements rather than substitutes for CEA."
      },
      {
        "role": "explain",
        "doi": "10.1007/s40273-015-0325-4",
        "url": "https://doi.org/10.1007/s40273-015-0325-4",
        "citation_text": "Onukwugha E, McRae J, Kravetz A, Varga S, Khairnar R, Mullins CD. Cost-of-illness studies: an updated review of current methods. PharmacoEconomics. 2016;34(1):43-58.",
        "year": 2016,
        "authors_short": "Onukwugha et al.",
        "notes": "Updated methods review with explicit attention to econometric handling of skewed cost data (GLM, two-part models) and the all-cause vs attributable vs matched-control (incremental) costing distinction central to claims-based COI."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/s41927-025-00459-1",
        "url": "https://doi.org/10.1186/s41927-025-00459-1",
        "citation_text": "Fernández-Ávila DG, et al. Exploring drug utilization patterns, healthcare resource utilization, and epidemiology of rheumatoid arthritis in Colombia: a retrospective claims database study. BMC Rheumatology. 2025.",
        "year": 2025,
        "authors_short": "Fernández-Ávila et al.",
        "notes": "Real-world claims-based example integrating epidemiology, HCRU, and cost burden for a chronic inflammatory disease, illustrating the bottom-up direct-medical-cost workflow described here."
      }
    ],
    "plain_language_summary": "A burden-of-disease study answers one question: how big is this disease's footprint? It adds up all the costs the disease creates — doctor visits, hospital stays, medicines, and lost work — to produce a single dollar figure for the whole population. A cost-of-illness study is the dollar-valued piece of that accounting; it can report the national total (all patients combined), a per-patient annual average, or the extra cost a sick person bears compared to a healthy similar person. The number is a snapshot of the problem's size, not a verdict on whether any treatment is worth its price.",
    "key_terms": [
      {
        "term": "direct medical costs",
        "definition": "Dollars paid for health care services — hospital stays, doctor visits, lab tests, and prescription drugs — that appear as claims in an insurance database."
      },
      {
        "term": "indirect costs",
        "definition": "Money lost because the disease keeps people from working, including missed workdays and reduced productivity on the job."
      },
      {
        "term": "per-patient per-year (PPPY) cost",
        "definition": "The average annual health care spend for one patient with the disease, calculated by dividing total costs by the number of patients."
      },
      {
        "term": "prevalence",
        "definition": "The total number of people living with a disease in a population at a given point in time."
      },
      {
        "term": "national burden",
        "definition": "The total economic cost of a disease across an entire country, calculated by multiplying the per-patient cost by the number of people with the disease."
      }
    ],
    "worked_example": {
      "scenario": "Imagine you are asked to estimate the total annual economic burden of type 2 diabetes in the United States. You have a published prevalence figure (28 million adults with the diagnosis) and a per-patient annual cost estimate from a claims database study that separates direct medical spending from indirect productivity losses. Your job is to multiply those two numbers together for each cost category and then add the categories up to reach a national total.",
      "dataset": {
        "caption": "Summary inputs for a prevalence-based cost-of-illness calculation. Each row is one cost category; the right column shows the per-patient annual average from claims and survey data.",
        "columns": [
          "cost_category",
          "cost_type",
          "per_patient_annual_usd"
        ],
        "rows": [
          [
            "Hospital and outpatient visits",
            "Direct medical",
            5200
          ],
          [
            "Prescription drugs",
            "Direct medical",
            4400
          ],
          [
            "Lost workdays and reduced productivity",
            "Indirect",
            3200
          ]
        ]
      },
      "steps": [
        "Add the two direct medical rows to get the total direct medical cost per patient: $5,200 + $4,400 = $9,600 per patient per year.",
        "The indirect cost row stands alone: $3,200 per patient per year.",
        "Total per-patient annual cost = direct + indirect = $9,600 + $3,200 = $12,800 per patient per year.",
        "Multiply direct medical cost per patient by prevalence to get the national direct medical burden: $9,600 x 28,000,000 = $268,800,000,000.",
        "Multiply indirect cost per patient by prevalence to get the national indirect burden: $3,200 x 28,000,000 = $89,600,000,000.",
        "Add the two national totals: $268,800,000,000 + $89,600,000,000 = $358,400,000,000."
      ],
      "result": "Total annual national burden of type 2 diabetes = $358.4 billion. Direct medical costs account for $268.8 billion (75%) and indirect productivity losses account for $89.6 billion (25%). These figures come from multiplying the per-patient annual cost of $12,800 by the 28 million prevalent cases: 28,000,000 x $12,800 = $358,400,000,000."
    },
    "prerequisites": [
      "prevalence-point-period-annual-rwe",
      "healthcare-costs-pppm-pppy-pmpm"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Prevalence-based COI",
        "description": "Sums all costs incurred in a fixed window (typically one year) by every individual who has the disease in that window, regardless of disease onset. The standard frame for annual payer budgeting and most claims-based studies.",
        "edge_cases": [
          "Mixes long-standing stable cases with newly diagnosed and end-stage cases; the annual mean masks within-disease cost trajectories (diagnosis-year and final-year-of-life spikes).",
          "Multiply per-patient cost by a prevalence estimate from the same or external data to obtain national totals; the two must refer to the same population and period."
        ],
        "data_source_notes": "claims: identify prevalent cases in the year, annualize cost on enrolled person-days, then scale by prevalence. Best supported data design.",
        "citations": [
          "jo-2014",
          "tarricone-2006"
        ]
      },
      {
        "name": "Incidence-based COI",
        "description": "Estimates the present value of all future costs attributable to the cohort of new cases arising in a period, over a lifetime or fixed horizon. The frame for prevention value, since averting an incident case avoids a future cost stream.",
        "edge_cases": [
          "Requires long longitudinal follow-up or cost/survival modelling and explicit discounting; sensitive to the discount rate and to extrapolation beyond observed follow-up.",
          "Phase-of-care costing (initial, continuing, terminal) is usually needed to capture diagnosis and end-of-life cost spikes."
        ],
        "data_source_notes": "Data-intensive; typically linked longitudinal claims or registry-linked data with survival models and a death index to define the terminal phase.",
        "citations": [
          "jo-2014",
          "costa-2012"
        ]
      },
      {
        "name": "All-cause vs disease-attributable vs incremental (matched-control) costing",
        "description": "Three operationalizations of \"burden\": all claims for diseased patients (all-cause), only disease-coded claims (attributable), or the difference between diseased patients and matched disease-free controls (incremental). Incremental is the preferred estimate of the cost caused by the disease.",
        "edge_cases": [
          "All-cause overstates burden in multimorbid/elderly populations; attributable undercounts sequelae and is hostage to coding fidelity; incremental requires defensible matching and shared measurement windows.",
          "Matched-control costing must align the control index_date to the case index_date so pre/post windows are comparable."
        ],
        "data_source_notes": "claims: exact- or PS-match disease-free controls on age, sex, region, and index year; report all three figures so reviewers can see how attribution choice moves the estimate.",
        "citations": [
          "onukwugha-2016"
        ]
      },
      {
        "name": "Top-down vs bottom-up costing",
        "description": "Top-down allocates national/aggregate expenditure by diagnosis proportions; bottom-up sums patient-level resource use and costs from micro-data. Hybrid (bottom-up per-patient cost x top-down prevalence) is common.",
        "edge_cases": [
          "Top-down misses patient heterogeneity and cannot support incremental costing; bottom-up requires a representative sample and complete cost capture, and is vulnerable to outliers."
        ],
        "data_source_notes": "Claims excel at bottom-up patient-level costing; national totals usually pair a bottom-up per-patient estimate with an external prevalence figure.",
        "citations": [
          "jo-2014"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Broader framing (epidemiologic + humanistic + full economic, including indirect and societal costs) for population priority-setting; per-patient cost metrics are one component of a COI study.",
        "cons_of_this": "More assumption-heavy (indirect-cost valuation, discounting, prevalence extrapolation) and less granular for a specific therapy's value than a clean per-patient spend metric.",
        "when_to_prefer": "When the question is the magnitude of a disease's economic footprint across a population, not the standardized spend for a defined cohort."
      },
      {
        "compared_to": "cost-effectiveness",
        "pros_of_this": "Describes the absolute size of a disease's burden and the addressable cost; requires no comparator or outcome model and often precedes CEA by sizing the problem.",
        "cons_of_this": "Produces no ICER and no comparative value statement; a large COI figure does not imply any intervention is cost-effective.",
        "when_to_prefer": "When sizing unmet need or the cost baseline, not when deciding whether an intervention is worth its price (use CEA for that)."
      },
      {
        "compared_to": "budget-impact",
        "pros_of_this": "Captures the current static economic burden of the disease as it stands, independent of any new technology.",
        "cons_of_this": "Not forward-looking and not population-scaled to an adoption scenario; cannot answer a payer's affordability/cash-flow question.",
        "when_to_prefer": "When the decision-maker needs the size-of-problem baseline; use BIA for the financial consequence of adopting a specific new technology over a planning horizon."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Bottom-up direct medical costs from paid/allowed amounts (inpatient, outpatient, professional, pharmacy), ideally by place of service. Require continuous medical + pharmacy enrollment over the measurement window and exclude MA-only / capitated person-time where adjudicated dollar amounts are absent. Annualize on enrolled person-days (PPPY = cost ÷ enrolled_days x 365), winsorize or GLM-model the right-skewed distribution, and report all-cause, attributable, and incremental (matched-control) figures.",
      "ehr": "Rich clinical detail for case definition and risk adjustment, but charges/RVUs are not costs and capture is visit-driven. Apply cost-to-charge ratios or link to claims before reporting dollars; treat patients lost from the system as potentially informative.",
      "registry": "Strong for incidence/prevalence and adjudicated severity/outcomes; weak on complete cost. Link to claims for direct costs and to a death index for incidence-based/terminal-phase costing.",
      "survey": "National surveys (e.g., MEPS) anchor out-of-pocket, indirect (productivity), and informal-care costs that claims cannot see; needed to move from a payer to a societal perspective.",
      "linked": "Linked claims-EHR-registry-vital-records is the ideal substrate (severity + complete spend + reliable mortality) but introduces linkage selection and order/fill/service date reconciliation before windows are assigned."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nDAYS_IN_YEAR = 365.0\nWINSOR_Q = 0.995          # cap catastrophic outliers; report untrimmed as sensitivity\nN_CONTROLS = 3            # 1:3 exact matching\n\ndef annualized_cost(members: pd.DataFrame, claims: pd.DataFrame) -> pd.DataFrame:\n    # Total paid per person over the year, then annualize on observed FFS-enrolled days.\n    total = (claims.groupby(\"person_id\")[\"paid_amount\"].sum()\n                   .rename(\"total_paid\").reset_index())\n    df = members.merge(total, on=\"person_id\", how=\"left\")\n    df[\"total_paid\"] = df[\"total_paid\"].fillna(0.0)        # enrolled, no claims -> zero cost (not missing)\n    df[\"cost_pppy\"] = df[\"total_paid\"] / df[\"enrolled_days\"] * DAYS_IN_YEAR\n    return df\n\ndef exact_match(df: pd.DataFrame, k: int = N_CONTROLS, seed: int = 1) -> pd.DataFrame:\n    # Exact 1:k matching of disease-free controls to cases on age band, sex, region.\n    rng = np.random.default_rng(seed)\n    df = df.copy()\n    df[\"age_band\"] = pd.cut(df[\"age\"], [0, 17, 44, 64, 200],\n                            labels=[\"0-17\", \"18-44\", \"45-64\", \"65+\"])\n    keys = [\"age_band\", \"sex\", \"region\"]\n    out = []\n    for _, stratum in df.groupby(keys, observed=True):\n        cases = stratum[stratum[\"is_case\"]]\n        ctrls = stratum[~stratum[\"is_case\"]]\n        if cases.empty or ctrls.empty:\n            continue                                       # no support in this cell -> dropped\n        take = min(len(ctrls), len(cases) * k)\n        out.append(cases)\n        out.append(ctrls.sample(take, random_state=int(rng.integers(1e9))))\n    return pd.concat(out, ignore_index=True)\n\ndef incremental_coi(members, claims):\n    df = annualized_cost(members, claims)\n    matched = exact_match(df)\n    cap = matched[\"cost_pppy\"].quantile(WINSOR_Q)\n    matched[\"cost_w\"] = matched[\"cost_pppy\"].clip(upper=cap)\n    mean_case = matched.loc[matched[\"is_case\"], \"cost_w\"].mean()\n    mean_ctrl = matched.loc[~matched[\"is_case\"], \"cost_w\"].mean()\n    return {\n        \"mean_case_pppy\": mean_case,\n        \"mean_control_pppy\": mean_ctrl,\n        \"incremental_pppy\": mean_case - mean_ctrl,   # burden attributable to the disease\n        \"winsor_cap\": cap,\n        \"n_cases\": int(matched[\"is_case\"].sum()),\n    }",
        "description": "Prevalence-based, incremental (matched-control) direct-medical COI from claims-style inputs. Required inputs\n(already cleaned, one measurement year):\n  members : person_id, age, sex, region, is_case (bool), enrolled_days (FFS-observable days in the year, MA-only excluded)\n  claims  : person_id, paid_amount, pos (place of service)  # one row per claim line, FFS amounts only\nCases are matched 1:N to disease-free controls on age band, sex, region; cost is annualized on enrolled days,\nwinsorized for catastrophic outliers, and the incremental mean is the burden attributable to the disease.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\ncoi_incremental <- function(members, claims, winsor_q = 0.995) {\n  setDT(members); setDT(claims)\n\n  # Annualize total paid on FFS-enrolled days (enrolled-but-no-claims -> 0, not NA).\n  tot <- claims[, .(total_paid = sum(paid_amount)), by = person_id]\n  df  <- merge(members, tot, by = \"person_id\", all.x = TRUE)\n  df[is.na(total_paid), total_paid := 0]\n  df[, cost_pppy := total_paid / enrolled_days * 365]\n\n  df[, age_band := cut(age, c(0, 17, 44, 64, Inf),\n                       labels = c(\"0-17\", \"18-44\", \"45-64\", \"65+\"))]\n\n  # Winsorize catastrophic outliers; keep raw for a sensitivity run.\n  cap <- quantile(df$cost_pppy, winsor_q, names = FALSE)\n  df[, cost_w := pmin(cost_pppy, cap)]\n\n  # Two-part model for skewed cost with a mass at zero (enrolled-but-no-claims):\n  #   part 1 = logistic for Pr(any cost > 0); part 2 = Gamma/log GLM on POSITIVE cost only\n  #   (Gamma's support is y > 0, so it must be fit on the positives, not on a +1-shifted\n  #   series -- the +1 shift biases the mean and is not the right fix for log(0)).\n  # E[cost] = Pr(cost > 0) * E[cost | cost > 0], so the adjusted prediction multiplies the parts.\n  df[, any_cost := as.integer(cost_w > 0)]\n  fit_p1 <- glm(any_cost ~ is_case + age_band + sex + region,\n                family = binomial(link = \"logit\"), data = df)\n  fit_p2 <- glm(cost_w ~ is_case + age_band + sex + region,\n                family = Gamma(link = \"log\"), data = df[cost_w > 0])\n\n  # Adjusted incremental cost = predicted(case) - predicted(control) at the reference profile.\n  base <- df[1]; base$age_band <- \"45-64\"; base$sex <- df[, names(sort(table(sex), TRUE))[1]]\n  case_row <- copy(base); case_row$is_case <- TRUE\n  ctrl_row <- copy(base); ctrl_row$is_case <- FALSE\n  pred_2p <- function(row) predict(fit_p1, row, type = \"response\") *\n                           predict(fit_p2, row, type = \"response\")\n  pc <- pred_2p(case_row)\n  pk <- pred_2p(ctrl_row)\n\n  list(adjusted_incremental_pppy = unname(pc - pk),\n       winsor_cap = cap, n_cases = df[is_case == TRUE, .N])\n}",
        "description": "Prevalence-based, incremental direct-medical COI in R with a gamma GLM for the skewed cost. Inputs mirror Python:\n  members : person_id, age, sex, region, is_case (logical), enrolled_days\n  claims  : person_id, paid_amount, pos\nThe gamma/log-link GLM gives an adjusted incremental cost robust to right-skew; the marginal means are back-transformed\nvia prediction at case vs control with covariates held at sample values.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Bottom-up: total paid per person, annualized on FFS-enrolled days. */\nproc sql;\n  create table person_cost as\n  select m.person_id, m.age, m.sex, m.region, m.is_case, m.enrolled_days,\n         coalesce(c.total_paid, 0) as total_paid,\n         calculated total_paid / m.enrolled_days * 365 as cost_pppy\n  from work.members m\n  left join (select person_id, sum(paid_amount) as total_paid\n             from work.claims group by person_id) c\n    on m.person_id = c.person_id;\nquit;\n\n/* 2. Winsorize catastrophic outliers at the 99.5th percentile (report untrimmed as sensitivity). */\nproc univariate data=person_cost noprint;\n  var cost_pppy;\n  output out=caps pctlpts=99.5 pctlpre=p_;\nrun;\ndata person_cost;\n  if _n_ = 1 then set caps;\n  set person_cost;\n  cost_w = min(cost_pppy, p_99_5);\n  age_band = (age > 17) + (age > 44) + (age > 64);   /* 0=0-17,1=18-44,2=45-64,3=65+ */\nrun;\n\n/* 3. Unadjusted PPPY by case status -> raw incremental cost. */\nproc means data=person_cost mean std n maxdec=0;\n  class is_case;\n  var cost_w;\nrun;\n\n/* 4. Adjusted incremental cost: gamma GLM with log link (the standard skewed-cost model). */\nproc genmod data=person_cost;\n  class sex region age_band / param=ref;\n  model cost_w = is_case age_band sex region\n        / dist=gamma link=log type3;\n  estimate 'log incremental (case vs control)' is_case 1 / exp;  /* exp() = cost ratio */\nrun;",
        "description": "Prevalence-based incremental COI in SAS using genuine costing PROCs: PROC SQL/MEANS for bottom-up aggregation and PPPY,\nand PROC GENMOD with a gamma distribution + log link for the skewed adjusted incremental cost. Required inputs\n(post data-management):\n  work.members : person_id, age, sex, region, is_case (0/1), enrolled_days  (MA-only spans already excluded)\n  work.claims  : person_id, paid_amount, pos\nNo survival/PHREG model is used: COI is a cost-aggregation and skewed-cost-regression problem, not a hazard problem.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  TB[Total disease burden] --> DM[Direct medical costs]\n  TB --> DN[Direct non-medical costs<br/>transport, home modification, informal care]\n  TB --> IND[Indirect costs<br/>absenteeism, presenteeism,<br/>premature mortality]\n  TB --> INT[Intangible costs<br/>pain, HRQoL / DALY decrement]\n  DM --> INP[Inpatient]\n  DM --> OUT[Outpatient + professional]\n  DM --> RX[Pharmacy]\n  DM --> LTC[Long-term care]\n  DN -. valued via .-> HC[Human-capital or<br/>friction-cost method]\n  IND -. valued via .-> HC\n  DM -- payer perspective --> PAY[Payer / health-system COI]\n  DM --> SOC\n  DN --> SOC\n  IND --> SOC\n  SOC[Societal COI]",
        "caption": "Cost-component decomposition of disease burden. A payer-perspective COI captures only direct medical costs; a societal-perspective COI adds direct non-medical and indirect costs (valued by human-capital or friction-cost methods), with intangible burden often reported separately as HRQoL/DALY decrements.",
        "alt_text": "Tree decomposing total disease burden into direct medical (inpatient, outpatient, pharmacy, long-term care), direct non-medical, indirect, and intangible cost components, showing which feed the payer versus societal perspective.",
        "source_type": "illustrative",
        "source_citations": [
          "jo-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  AC[All-cause cost<br/>every claim for diseased patients] -->|overstates: includes<br/>unrelated multimorbidity| Q{What is being<br/>measured?}\n  AT[Attributable cost<br/>disease-coded claims only] -->|undercounts sequelae;<br/>coding-dependent| Q\n  INC[Incremental cost<br/>diseased minus matched<br/>disease-free controls] -->|preferred: nets out<br/>background cost| Q\n  Q --> EST[Estimate of burden<br/>caused by the disease]",
        "caption": "The three costing estimands for \"burden of the disease.\" All-cause overstates burden in multimorbid populations, attributable undercounts downstream sequelae and depends on coding, and incremental (matched-control) is the preferred operationalization of the burden caused by the disease.",
        "alt_text": "Decision diagram contrasting all-cause, attributable, and incremental matched-control costing and what each estimates about disease burden.",
        "source_type": "illustrative",
        "source_citations": [
          "onukwugha-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "HCRU (by place of service, medical vs pharmacy, all-cause vs attributable) is the volume driver of direct medical costs; report HCRU alongside dollar burden."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Standardized per-patient costs (PPPY/PPPM), all-cause vs attributable, with outlier handling are the core dollar outputs of a COI study."
      },
      {
        "relation_type": "used_with",
        "target_slug": "prevalence-point-period-annual-rwe",
        "notes": "Prevalence-based COI scales a per-patient cost by a prevalence estimate; the cost and prevalence must refer to the same population and period."
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "COI must present all-cause and disease-attributable (or incremental-vs-controls) costs; incremental is preferred for \"burden caused by\" the disease."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Bottom-up claims COI is dominated by a few catastrophic cases; pre-specified winsorization or robust/GLM methods plus sensitivity analyses are required before reporting means or totals."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-effectiveness",
        "notes": "COI sizes the disease's cost baseline but produces no ICER; it precedes and complements CEA rather than answering the value question."
      },
      {
        "relation_type": "see_also",
        "target_slug": "budget-impact",
        "notes": "COI supplies the static baseline disease cost that a forward-looking, population-scaled budget-impact analysis builds on."
      },
      {
        "relation_type": "used_with",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Count models for HCRU volume (hospitalizations, visits) provide rate-based burden metrics that complement dollar totals and feed bottom-up costing."
      },
      {
        "relation_type": "see_also",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Treatment patterns shape the trajectory and magnitude of disease burden over time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "sdoh-social-determinants-of-health",
        "notes": "SDoH drive disparities in disease burden and are increasingly incorporated into equity-stratified COI."
      }
    ],
    "aliases": [
      "COI",
      "burden of disease",
      "BOD",
      "economic burden",
      "cost-of-illness study"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cascade-of-care-analysis-rwe",
    "name": "Cascade of Care Analysis",
    "short_definition": "A staged framework that decomposes the patient pathway from diagnosis or eligibility through linkage, treatment initiation, persistence, and response/outcome attainment, quantifying conditional retention and absolute attrition at each gate to locate modifiable bottlenecks.",
    "long_description": "The **cascade of care** (care cascade, treatment cascade, \"the cascade\") is a descriptive-analytic framework that maps the\nordered steps a patient must clear to reach a clinically meaningful outcome and quantifies the proportion retained, and the\nabsolute number lost, at each step. It was formalized for HIV (diagnosed -> linked to care -> ART-initiated -> retained ->\nvirally suppressed) and has been transplanted to diabetes (diagnosed -> treated -> controlled), heart failure (eligible ->\nguideline-directed medical therapy initiated -> up-titrated to target dose), oncology (diagnosed -> staged -> systemic\ntherapy -> response), and opioid use disorder (diagnosed -> medication for OUD initiated -> retained). In RWE it is built\nentirely from date-stamped events in claims, EHR, or registry data: diagnosis codes for eligibility, pharmacy fills or\nJ-codes for treatment, lab values or CPT-II codes for response, and diagnosis/procedure/death for terminal outcomes.\n\n**Core conceptual distinction.** A cascade is not a single rate, not persistence, and not lines-of-therapy. A \"control\nrate\" or \"treated rate\" is a scalar that hides where the system leaks; the cascade replaces it with a set of *conditional*\nprobabilities (of those eligible, p1 are diagnosed; of those diagnosed, p2 are treated; of those treated, p3 respond) plus\nthe *absolute* count lost at each gate. The denominator semantics are the whole methodological game: a cascade can be read\nunconditionally (every stage as a share of the original eligible population, so the curve is monotone non-increasing) or\nconditionally (each stage as a share of the prior stage, exposing which single transition is the bottleneck). Persistence/\nPDC measure duration *conditional on having initiated* a specific drug, so they are blind to the pre-treatment losses\n(undiagnosed, diagnosed-but-untreated) that often dominate the cascade. Lines-of-therapy describe sequencing *after*\ninitiation and condition on reaching a later line. The cascade is the only one of the three that spans the full diagnosis-\nto-outcome arc and makes pre-initiation leakage and end-stage response visible in one object. It is descriptive, not\ncausal: it reports what happened to a real population, not what would happen under an intervention.\n\n**Pros, cons, and trade-offs.**\n- **vs a single summary metric (e.g., \"42% controlled\"):** the cascade localizes the bottleneck (90% diagnosed but only\n  35% initiated is a very different problem from 95% initiated but only 40% responding), which makes it directly\n  actionable for health systems, payers, and quality programs. Cost: it demands defensible stage definitions and date\n  logic, and the \"leak\" attribution shifts with those choices.\n- **vs persistence / time-to-discontinuation:** the cascade includes the diagnosis and initiation gaps that persistence\n  never observes and supports equity and access questions across the whole pathway. Persistence is the right tool for the\n  narrower \"how long do patients stay on drug X once started?\" question and gives finer time resolution within the\n  single stage it covers.\n- **vs treatment-patterns / lines-of-therapy:** the cascade exposes the large pre-first-line losses and the terminal\n  response stage; LOT is richer for sequencing, switching, and later-line cost once patients have initiated. Use the\n  cascade for population gap analysis, LOT for describing the complexity of treated patients.\n- **vs discrete-event or Markov simulation:** the cascade is the observed evidence layer; simulation is the decision\n  layer that takes cascade-derived transition probabilities as inputs to project \"what if we closed the stage-3 gap?\"\n  over a lifetime or budget horizon. A cascade alone cannot forecast; a simulation without a credible cascade is\n  ungrounded.\n\n**When to use.** Population-health, quality-improvement, value-based-contracting, or access questions framed as \"where in\nthe journey from eligibility to outcome are patients lost?\"; comparative work across systems, regions, payers, or pre/post\npolicy to detect differential leakage (an equity lens); therapeutic areas with clear, ordered, measurable stages; and as\nthe empirical scaffold supplying uptake, persistence, and response parameters to a budget-impact or cost-effectiveness\nmodel.\n\n**When NOT to use -- and when it is actively misleading or dangerous.**\n- **The stages are not genuinely sequential, or the \"leak\" is appropriate care.** Many diagnosed patients are correctly\n  not started on a drug (contraindication, mild disease, patient preference, watchful waiting). Drawing a cascade implies\n  every drop-off is a failure; presenting \"65% diagnosed but untreated\" as a quality gap without clinical adjudication is\n  misleading and can drive harmful over-treatment incentives.\n- **Intermediate stages rest on insensitive or non-specific codes.** Defining \"response\" as the mere absence of a\n  progression code, or \"control\" from sporadically captured labs, manufactures an artifactual cascade whose shape is\n  driven by coding completeness, not clinical reality.\n- **The denominator is itself selected and presented as the population.** Restricting to continuously enrolled patients,\n  or to those with a recorded lab, and then labeling the result \"the cascade for condition X\" conflates a convenience\n  sample with the target population. This is the cascade-specific form of selection bias.\n- **It is used to claim a drug \"works.\"** A cascade has no comparator and no counterfactual; reading \"of those treated,\n  60% responded\" as evidence of treatment efficacy is a category error -- that requires an active-comparator new-user\n  design or a target-trial emulation, not a funnel.\n\n**Data-source operational depth.**\n- **Claims (FFS / commercial):** diagnosis from ICD on medical claims (incident phenotypes typically require 1 inpatient\n  or 2 outpatient codes on different days to suppress rule-out coding); treatment from pharmacy fills (NDC + `fill_date` +\n  `days_supply`) or J-codes/HCPCS for infused/injected agents that never appear in pharmacy; response from LOINC labs or\n  CPT-II codes, which are sparsely captured in claims. Failure modes: (1) **Medicare Advantage person-time lacks FFS\n  claims** -- MA-only enrollees generate no usable medical/pharmacy claims, so absence of a treatment fill or lab is\n  missingness, not a true negative; restrict to enrollees with the relevant benefit (Parts A/B/D or commercial medical+\n  pharmacy) and exclude MA-only spans, or the later cascade stages collapse artifactually. (2) **Continuous-enrollment\n  requirements select the denominator** -- requiring 12 months pre/post enrollment drops the sickest (who die or\n  disenroll) and the healthiest (who churn), biasing every downstream conditional. (3) **Plan switching and FFS<->MA\n  transitions truncate observation** mid-cascade, so a patient who \"fails to reach control\" may simply have left the\n  observable data; censor explicitly and report the at-risk denominator per stage. (4) **Response capture differs by\n  exposure** -- patients on actively managed therapy get more labs ordered, so \"controlled\" looks higher for treated\n  patients partly because they are *measured* more, an immortal-time/ascertainment artifact at the response gate.\n- **EHR:** richer for the response stage (actual lab values, vitals, problem-list updates, e-prescribing orders) but\n  visit-driven and leaky to outside care. Use the actual result/order date, not a claim adjudication date, and require a\n  minimum observation window after each stage before scoring \"not achieved,\" or you misclassify \"not yet measured\" as\n  \"failed.\" A patient who leaves the system is differentially lost.\n- **Registry:** often carries adjudicated stage membership (cancer stage, dialysis initiation, viral load) and is the gold\n  standard for specific steps, but the enrolled population is selected and cost/utilization capture is incomplete; link to\n  claims for the full pathway.\n- **Linked claims-EHR(-vital records):** the ideal substrate -- claims give complete capture of diagnosis and initiation,\n  EHR gives the clinical detail and labs for response, and a death index firms up the terminal stage and censoring.\n  Reconcile order/fill/service/result dates across sources *before* assigning stage entry, or stage ordering inverts.\n\n**Worked claims example (type 2 diabetes; commercial + Medicare FFS, index 2022, follow-up through 2024).**\nDenominator: adults 40-80 with an incident type 2 diabetes phenotype (ICD-10 E11.x: 1 inpatient or 2 outpatient on\nseparate days) in 2022, with 12 months continuous medical+pharmacy enrollment pre-index and 12 months post-index, and\n**no MA-only person-time** across that window. Pre-specified stages (denominator = stage 1 throughout for the\nunconditional curve):\n1. Diagnosed (eligible denominator): 184,200 (100%).\n2. Linked: primary-care or endocrinology E/M visit within 180 days of the first qualifying diagnosis code: 141,900\n   (77.0% of diagnosed).\n3. Initiated: first fill (NDC + `days_supply`) or administration of any non-insulin antidiabetic (metformin, SGLT2i,\n   GLP-1 RA, DPP-4i, sulfonylurea, TZD) within 180 days of diagnosis or linkage: 98,300 (53.4% of diagnosed; 69.3% of\n   linked).\n4. Persistent: PDC >= 0.80 over the first 365 days post-initiation, allowing within-class switches and a 30-day refill\n   grace period: 71,800 (39.0% of diagnosed).\n5. Controlled: most recent HbA1c < 7.0% (or a documented individualized < 7.5% target) in the 12-month outcome window,\n   from LOINC labs or CPT-II codes: 42,100 (22.8% of diagnosed).\nRead-out: the **largest absolute loss is at initiation** (85,900 diagnosed patients never reach a first antidiabetic\nfill within the window); among initiators, ~27% fail to persist to 12 months; among persistent patients, ~41% have no\ndocumented controlled HbA1c -- but note that part of that last gap is *measurement*, since the controlled-lab denominator\nis itself conditional on a lab being captured (a cascade-specific selection trap). Stratifying by Area Deprivation Index\nquartile shows an 8-11 percentage-point lower retention at *every* gate for the most-deprived vs least-deprived quartile,\nthe kind of monotone equity signal a single control rate would hide. Sensitivity: widening the initiation window from 90\nto 365 days moves stage 3 from 48% to 61% of diagnosed -- proof that stage definitions, not biology, can dominate the\napparent leak, which is why the window, denominator choice, and code lists must be pre-specified in the SAP and varied in\nsensitivity analyses.\n\n**Operational variants.** (a) Static cross-sectional cascade -- every patient at their own current stage on one fixed\ndate; fast, but mixes incident and prevalent patients. (b) Longitudinal/cohort cascade -- patients followed from a common\ntime-zero (first diagnosis or first treatment opportunity); the cleaner design for attribution. (c) Comparative cascades --\nby payer, region, race/ethnicity, or pre/post a formulary or policy change. (d) Conditional cascade -- restricted to\npatients who reached a prior stage (\"of initiators, what share responded?\"), useful but easy to misread as the population\ncascade.\n\n**Interpreting the output**\n\nA type 2 diabetes care cascade among 184,200 commercially insured + Medicare FFS adults shows: 10,000\ndiagnosed (denominator rescaled for readability) → 7,700 linked to care (77.0%) → 5,330 initiated on\nantidiabetics (53.3% of diagnosed; 69.2% of linked) → 3,890 persistent at 12 months (38.9% of\ndiagnosed) → 2,280 with documented glycemic control (22.8% of diagnosed).\n\n*(1) Formal interpretation.* Each row has two valid denominators, and which you use changes the story.\nThe unconditional fractions (all referenced to the 10,000 diagnosed denominator) give a monotone\nnon-increasing curve showing cumulative leakage from the full eligible population. The conditional\nfractions (each stage referenced to the prior stage) reveal where the system is most inefficient\nwithin each transition. The largest *absolute* unconditional loss occurs at initiation: 4,670 diagnosed\npatients never reach a first antidiabetic fill, making the diagnosis-to-treatment gap the dominant\nbottleneck. Among those who initiated (5,330), approximately 27% failed to persist through 12 months.\nThe controlled-HbA1c stage is subject to measurement truncation — patients without a captured lab are\nmisclassified as uncontrolled, so the 22.8% controlled rate is a lower bound.\n\n*(2) Practical interpretation.* The cascade tells a payer or quality program where to invest first:\nclosing the diagnosis-to-initiation gap (from 77.0% linked to 53.3% initiated) recovers more patients\nthan improving persistence among those already on therapy. A single summary rate — \"22.8% controlled\"\n— hides this entirely. Any intervention targeting a specific transition should be evaluated against\nthe stage-specific conditional rate, not the unconditional overall rate.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "cascade-of-care",
      "care-cascade",
      "treatment-cascade",
      "attrition-funnel",
      "conditional-retention",
      "quality-measurement",
      "health-equity",
      "descriptive-epidemiology"
    ],
    "applies_to_study_types": [
      "descriptive_epidemiology",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/cid/ciq243",
        "url": "https://doi.org/10.1093/cid/ciq243",
        "citation_text": "Gardner EM, McLees MP, Steiner JF, del Rio C, Burman WJ. The spectrum of engagement in HIV care and its relevance to test-and-treat strategies for prevention of HIV infection. Clinical Infectious Diseases. 2011;52(6):793-800.",
        "year": 2011,
        "authors_short": "Gardner et al.",
        "notes": "Foundational paper that defined the staged HIV engagement cascade and established the conditional-retention logic now reused across therapeutic areas to locate system gaps and target test-and-treat policy."
      },
      {
        "role": "explain",
        "doi": "10.1056/nejmsa1213829",
        "url": "https://doi.org/10.1056/nejmsa1213829",
        "citation_text": "Ali MK, Bullard KM, Saaddine JB, Cowie CC, Imperatore G, Gregg EW. Achievement of goals in U.S. diabetes care, 1999-2010. New England Journal of Medicine. 2013;368(17):1613-1624.",
        "year": 2013,
        "authors_short": "Ali et al.",
        "notes": "Decomposes U.S. diabetes goal attainment by stage, making explicit how a single aggregate control rate masks stage-specific failures -- the canonical illustration of why the cascade view is more actionable than a scalar."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jacc.2018.04.070",
        "url": "https://doi.org/10.1016/j.jacc.2018.04.070",
        "citation_text": "Greene SJ, Butler J, Albert NM, et al. Medical therapy for heart failure with reduced ejection fraction: the CHAMP-HF registry. Journal of the American College of Cardiology. 2018;72(4):351-366.",
        "year": 2018,
        "authors_short": "Greene et al.",
        "notes": "Registry demonstration of the guideline-directed medical therapy cascade in HFrEF, quantifying the large gaps between eligibility and actual use (and target dosing) at each step in real-world practice."
      },
      {
        "role": "use",
        "doi": "10.1080/00952990.2018.1546862",
        "url": "https://doi.org/10.1080/00952990.2018.1546862",
        "citation_text": "Williams AR, Nunes EV, Bisaga A, Levin FR, Olfson M. Development of a cascade of care for responding to the opioid epidemic. The American Journal of Drug and Alcohol Abuse. 2019;45(1):1-10.",
        "year": 2019,
        "authors_short": "Williams et al.",
        "notes": "Applied construction of an OUD cascade (diagnosed -> medication for OUD initiated -> retained) in real-world data, showing the framework operationalized for outcomes measurement and policy beyond its HIV origins."
      }
    ],
    "plain_language_summary": "A cascade of care tracks a population of patients through a series of ordered steps — from being diagnosed with a disease all the way to achieving a good health outcome — and counts how many people are still in the pipeline at each step. The picture it produces is a shrinking funnel: you can see not only the final fraction who reached the goal but exactly where along the journey the largest groups of patients fell away. This makes it directly useful for answering 'where should a health system focus first?' — a question that a single summary number like '22% of patients are well-controlled' cannot answer. One honest limitation: a drop at any step does not automatically mean a failure in care; some patients appropriately stop treatment for clinical reasons, so the funnel must be interpreted with that in mind.",
    "key_terms": [
      {
        "term": "cascade stage",
        "definition": "One named checkpoint on the patient journey (for example, 'received treatment') at which an analyst counts how many people have reached that point."
      },
      {
        "term": "attrition",
        "definition": "The number of patients lost between one stage and the next — the patients who were present at step N but did not show up at step N+1."
      },
      {
        "term": "unconditional share",
        "definition": "Each stage's count expressed as a percentage of the very first (largest) group, so all percentages have the same denominator and you can compare losses across stages directly."
      },
      {
        "term": "conditional share",
        "definition": "Each stage's count expressed as a percentage of only the immediately prior stage, showing which single transition is the steepest drop."
      },
      {
        "term": "denominator",
        "definition": "The group of patients that a percentage is calculated against; in a cascade the denominator choice determines whether you are reporting unconditional or conditional retention."
      }
    ],
    "worked_example": {
      "scenario": "A regional health plan wants to know why its type 2 diabetes quality scores are low. An analyst pulls one year of records and builds a five-stage cascade for 10,000 adults with a new diabetes diagnosis, following them forward to see how many reach each checkpoint: diagnosed, linked to a primary-care provider, started on a diabetes medication, still taking that medication at 12 months, and finally recorded as having their blood sugar under control. The funnel table below shows what the data actually looked like.",
      "dataset": {
        "caption": "Population funnel table — each row is one cascade stage. n = patients who reached this stage. pct_of_prior = n divided by the count at the immediately preceding stage (rounded to one decimal).",
        "columns": [
          "stage",
          "n",
          "pct_of_prior"
        ],
        "rows": [
          [
            "1. Diagnosed (starting population)",
            10000,
            null
          ],
          [
            "2. Linked to primary-care provider",
            7700,
            "77.0%"
          ],
          [
            "3. Started a diabetes medication",
            5330,
            "69.2%"
          ],
          [
            "4. Still taking medication at 12 months (persistent)",
            3890,
            "73.0%"
          ],
          [
            "5. Blood sugar recorded as controlled (HbA1c < 7%)",
            2280,
            "58.6%"
          ]
        ]
      },
      "steps": [
        "Stage 1 is the whole cohort: 10,000 newly diagnosed patients. This is the denominator for all unconditional percentages.",
        "Stage 2 — linkage: 7,700 of 10,000 patients had a visit with a primary-care provider within 180 days of diagnosis. That is 77.0% of the starting group; 2,300 patients (23.0%) never showed up to a follow-up visit.",
        "Stage 3 — treatment initiation: of the 7,700 linked patients, 5,330 picked up a first diabetes prescription within 180 days. That is 69.2% of those linked (5,330 / 7,700 = 0.692). As a share of the original 10,000, only 53.3% ever started a medication (5,330 / 10,000).",
        "Stage 4 — persistence: of the 5,330 who started a medication, 3,890 were still filling their prescription consistently at the 12-month mark (a 'proportion of days covered' of at least 80%). That is 73.0% of initiators (3,890 / 5,330 = 0.730). The remaining 1,440 patients (27%) stopped or had large gaps in their fills.",
        "Stage 5 — control: of the 3,890 persistent patients, 2,280 had a blood-sugar lab result below the 7% target by end of year. That is 58.6% of persistent patients (2,280 / 3,890 = 0.586). Expressed as a share of the original 10,000 diagnosed patients, only 22.8% reached control (2,280 / 10,000).",
        "Reading the funnel: the single biggest absolute loss happened at linkage (2,300 patients gone before they even saw a doctor) and at initiation (a further 2,370 linked patients never started a medication). Together those two steps shed 4,670 patients — more than all later stages combined. A health plan targeting adherence coaching would be optimizing the wrong stage."
      ],
      "result": "Of 10,000 diagnosed patients, 2,280 (22.8%) reached blood-sugar control. The arithmetic checks at every row: 10,000 × 0.770 = 7,700; 7,700 × 0.692 = 5,330; 5,330 × 0.730 = 3,891 (rounds to 3,890); 3,890 × 0.586 = 2,280. The largest single-stage absolute loss is at linkage (−2,300), and the largest conditional drop is also at linkage (only 77% of diagnosed patients ever saw a provider), making that the primary intervention target."
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "prevalence-point-period-annual-rwe",
      "drug-utilization"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unconditional (population-share) cascade",
        "description": "Every stage reported as a share of the single original eligible denominator, so the curve is monotone non-increasing and absolute losses are directly comparable across gates.",
        "edge_cases": [
          "A patient can satisfy a later stage's criteria without a recorded earlier stage (e.g., a fill with no prior visit claim); decide whether stages are strictly nested or independently scored before reporting.",
          "Death or disenrollment before a stage must be distinguished from failure to reach it (competing-risk vs censoring)."
        ],
        "data_source_notes": "claims: hold the denominator fixed at the incident diagnosis cohort and require observable (non-MA-only) person-time through each stage's assessment window."
      },
      {
        "name": "Conditional (transition) cascade",
        "description": "Each stage reported as a share of the immediately prior stage, isolating which single transition is the rate-limiting bottleneck.",
        "edge_cases": [
          "Conditional percentages are easily misread as population shares; always co-report the absolute counts and the original denominator.",
          "The response-stage conditional is biased upward when the response measurement itself is conditional on engagement."
        ],
        "data_source_notes": "claims/ehr: the controlled-lab denominator is patients with a captured lab, not all persistent patients -- report measurement completeness alongside the conditional."
      },
      {
        "name": "Comparative / stratified cascade",
        "description": "Parallel cascades by payer, region, race/ethnicity, SDoH stratum, or pre/post a policy change, compared as stage-wise risk differences.",
        "edge_cases": [
          "Differential data completeness across strata (e.g., Medicaid vs commercial lab capture) can masquerade as a true equity gap; benchmark capture before attributing differences to care."
        ],
        "data_source_notes": "linked: prefer linked labs for stratified response comparisons so the gap reflects care, not coding."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "persistence-time-to-discontinuation",
        "pros_of_this": "Captures pre-treatment losses (undiagnosed, diagnosed-but-untreated) that persistence never observes; directly supports system-level, equity, and value-based-contracting questions across the full pathway.",
        "cons_of_this": "Requires defensible multi-stage definitions and date logic and is more sensitive to coding/enrollment choices than a single-drug duration measure; coarser time resolution within any one stage.",
        "when_to_prefer": "When the question is \"where from diagnosis to outcome are patients lost?\" rather than \"how long do patients stay on drug X once they start?\""
      },
      {
        "compared_to": "treatment-patterns-lines-of-therapy",
        "pros_of_this": "Makes the large pre-first-line losses and the terminal response/outcome stage visible; surfaces equity and access gaps across the entire journey in one object.",
        "cons_of_this": "Less granular than LOT for sequencing, switching, and later-line cost once patients have initiated.",
        "when_to_prefer": "Population-level gap and equity analysis; use LOT when the audience needs treatment complexity, later-line cost, or post-failure sequencing."
      },
      {
        "compared_to": "discrete-event-simulation-rwe",
        "pros_of_this": "Purely descriptive and directly observable from RWE; no structural assumptions beyond the stage definitions; immediately interpretable to clinicians and payers.",
        "cons_of_this": "Cannot project lifetime outcomes, budget impact, or counterfactual \"what if we closed the stage-3 gap?\"; those require a simulation that consumes the cascade's transition rates as inputs.",
        "when_to_prefer": "When the deliverable is a gap analysis, quality report, or equity dashboard rather than a forward-looking decision model."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Diagnosis from ICD on medical claims (incident phenotype: 1 inpatient or 2 outpatient on separate days); treatment from pharmacy fills (NDC + fill_date + days_supply) or J-codes for infused agents; response from LOINC/CPT-II (sparse). Require continuous, non-MA-only medical+pharmacy enrollment through each stage assessment window so absence is a true negative, not missingness. Use the earliest qualifying event as stage-entry date; censor at disenrollment/death and report the at-risk denominator per stage.",
      "ehr": "Richer for the response stage (actual labs, vitals, orders) but visit-driven and leaky to outside care. Use actual result/order dates and impose a minimum post-stage observation window before scoring \"not achieved\"; treat loss to follow-up as potentially informative.",
      "registry": "Often carries adjudicated stage membership (cancer stage, dialysis start, viral load) and is gold standard for specific steps, but the enrolled population is selected and cost/utilization capture is incomplete; link to claims for the full pathway.",
      "linked": "Best substrate -- claims for complete diagnosis/initiation capture, EHR for response labs and clinical detail, death index for the terminal stage. Reconcile order/fill/service/result dates across sources before assigning stage entry, or stage ordering inverts."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nPRE_DAYS, POST_DAYS = 365, 365      # continuous-enrollment requirement around index\nLINK_DAYS, INIT_DAYS = 180, 180     # windows for linkage and initiation\nPDC_THRESHOLD, PERSIST_DAYS = 0.80, 365\nGRACE_DAYS = 30                     # refill grace inside the persistence window\nA1C_TARGET = 7.0\n\ndef _incident_index(dx: pd.DataFrame) -> pd.DataFrame:\n    # Incident phenotype: 1 inpatient OR 2 outpatient diagnoses on separate days; index = first qualifying date.\n    dx = dx.sort_values([\"person_id\", \"dx_date\"])\n    ip = dx.loc[dx[\"is_inpatient\"], [\"person_id\", \"dx_date\"]]\n    op = dx.loc[~dx[\"is_inpatient\"]].drop_duplicates([\"person_id\", \"dx_date\"])\n    op2 = op.groupby(\"person_id\").filter(lambda g: g[\"dx_date\"].nunique() >= 2)\n    qualifying = pd.concat([ip, op2[[\"person_id\", \"dx_date\"]]])\n    return (qualifying.groupby(\"person_id\")[\"dx_date\"].min()\n                      .rename(\"index_date\").reset_index())\n\ndef _enrolled(idx: pd.DataFrame, enroll: pd.DataFrame) -> pd.Series:\n    # Require a single non-MA-only span covering [index - PRE_DAYS, index + POST_DAYS].\n    e = enroll.merge(idx, on=\"person_id\")\n    covers = (~e[\"ma_only\"] &\n              (e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=PRE_DAYS)) &\n              (e[\"enroll_end\"]   >= e[\"index_date\"] + pd.Timedelta(days=POST_DAYS)))\n    return e.loc[covers, \"person_id\"].drop_duplicates()\n\ndef build_cascade(dx, rx, visits, labs, enroll):\n    idx = _incident_index(dx)\n    idx = idx[idx[\"person_id\"].isin(_enrolled(idx, enroll))].copy()\n\n    # Stage 2: linked to PCP/endo within LINK_DAYS of index.\n    v = visits[visits[\"pcp_or_endo\"]].merge(idx, on=\"person_id\")\n    linked = v.loc[(v[\"visit_date\"] >= v[\"index_date\"]) &\n                   (v[\"visit_date\"] <= v[\"index_date\"] + pd.Timedelta(days=LINK_DAYS)),\n                   \"person_id\"].drop_duplicates()\n\n    # Stage 3: first antidiabetic fill within INIT_DAYS of index; capture initiation date for the persistence window.\n    r = rx[rx[\"antidiabetic\"]].merge(idx, on=\"person_id\")\n    init = r.loc[(r[\"fill_date\"] >= r[\"index_date\"]) &\n                 (r[\"fill_date\"] <= r[\"index_date\"] + pd.Timedelta(days=INIT_DAYS))]\n    init_date = init.groupby(\"person_id\")[\"fill_date\"].min().rename(\"init_date\")\n    initiated = init_date.index\n\n    # Stage 4: PDC >= threshold over PERSIST_DAYS after initiation (grace-extended days_supply, capped at 1.0).\n    rp = rx[rx[\"antidiabetic\"]].merge(init_date, on=\"person_id\")\n    win_end = rp[\"init_date\"] + pd.Timedelta(days=PERSIST_DAYS)\n    in_win = rp[(rp[\"fill_date\"] >= rp[\"init_date\"]) & (rp[\"fill_date\"] < win_end)].copy()\n    in_win[\"covered\"] = (in_win[\"days_supply\"] + GRACE_DAYS).clip(upper=PERSIST_DAYS)\n    pdc = (in_win.groupby(\"person_id\")[\"covered\"].sum() / PERSIST_DAYS).clip(upper=1.0)\n    persistent = pdc[pdc >= PDC_THRESHOLD].index\n\n    # Stage 5: most recent HbA1c in the outcome window below target (denominator conditional on a captured lab).\n    lab = labs.merge(idx, on=\"person_id\")\n    lab = lab[(lab[\"result_date\"] > lab[\"index_date\"]) &\n              (lab[\"result_date\"] <= lab[\"index_date\"] + pd.Timedelta(days=POST_DAYS))]\n    last_a1c = lab.sort_values(\"result_date\").groupby(\"person_id\")[\"a1c_value\"].last()\n    controlled = last_a1c[last_a1c < A1C_TARGET].index\n\n    n_dx = idx[\"person_id\"].nunique()\n    stages = {\n        \"1_diagnosed\":  idx[\"person_id\"].unique(),\n        \"2_linked\":     linked.values,\n        \"3_initiated\":  initiated.values,\n        \"4_persistent\": persistent.values,\n        \"5_controlled\": controlled.values,\n    }\n    out, prior_n = [], n_dx\n    for name, ids in stages.items():\n        n = len(set(ids) & set(stages[\"1_diagnosed\"]))  # keep within the eligible denominator\n        out.append({\"stage\": name, \"n\": n,\n                    \"pct_of_diagnosed\": n / n_dx,\n                    \"pct_of_prior\": np.nan if prior_n == 0 else n / prior_n})\n        prior_n = n\n    return pd.DataFrame(out)",
        "description": "Longitudinal cohort cascade from claims-style tables. Required inputs (cleaned, de-duplicated):\n  dx     : medical-claim diagnoses -> person_id, dx_date (datetime), is_inpatient (bool)        # T2D phenotype source\n  rx     : pharmacy fills          -> person_id, fill_date (datetime), antidiabetic (bool), days_supply (int)\n  visits : E/M encounters          -> person_id, visit_date (datetime), pcp_or_endo (bool)      # for the \"linked\" stage\n  labs   : HbA1c results           -> person_id, result_date (datetime), a1c_value (float)\n  enroll : enrollment spans        -> person_id, enroll_start, enroll_end, ma_only (bool)       # ma_only lacks FFS claims\nBuilds an incident cohort (1 IP or 2 OP on separate days), enforces non-MA-only continuous enrollment, then scores\nfive nested stages and reports both unconditional (share of diagnosed) and conditional (share of prior stage) curves.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nPRE_DAYS <- 365L; POST_DAYS <- 365L\nLINK_DAYS <- 180L; INIT_DAYS <- 180L\nPDC_THRESHOLD <- 0.80; PERSIST_DAYS <- 365L; GRACE_DAYS <- 30L; A1C_TARGET <- 7.0\n\nbuild_cascade <- function(dx, rx, visits, labs, enroll) {\n  setDT(dx); setDT(rx); setDT(visits); setDT(labs); setDT(enroll)\n\n  # Incident phenotype: 1 inpatient OR 2 outpatient on separate days; index = earliest qualifying date.\n  ip  <- dx[is_inpatient == TRUE, .(person_id, dx_date)]\n  op2 <- unique(dx[is_inpatient == FALSE, .(person_id, dx_date)])\n  op2 <- op2[, if (uniqueN(dx_date) >= 2L) .SD, by = person_id]\n  idx <- rbind(ip, op2)[, .(index_date = min(dx_date)), by = person_id]\n\n  # Non-MA-only continuous enrollment spanning [index - PRE, index + POST].\n  e <- merge(enroll, idx, by = \"person_id\")\n  ok <- e[ma_only == FALSE &\n          enroll_start <= index_date - PRE_DAYS &\n          enroll_end   >= index_date + POST_DAYS, unique(person_id)]\n  idx <- idx[person_id %chin% ok]\n\n  # Stage 2: linkage within LINK_DAYS.\n  v <- merge(visits[pcp_or_endo == TRUE], idx, by = \"person_id\")\n  linked <- v[visit_date >= index_date & visit_date <= index_date + LINK_DAYS, unique(person_id)]\n\n  # Stage 3: first antidiabetic fill within INIT_DAYS; keep initiation date for the persistence window.\n  r <- merge(rx[antidiabetic == TRUE], idx, by = \"person_id\")\n  init <- r[fill_date >= index_date & fill_date <= index_date + INIT_DAYS,\n            .(init_date = min(fill_date)), by = person_id]\n  initiated <- init$person_id\n\n  # Stage 4: PDC >= threshold over PERSIST_DAYS (grace-extended days_supply, capped at the window).\n  rp <- merge(rx[antidiabetic == TRUE], init, by = \"person_id\")\n  rp <- rp[fill_date >= init_date & fill_date < init_date + PERSIST_DAYS]\n  rp[, covered := pmin(days_supply + GRACE_DAYS, PERSIST_DAYS)]\n  pdc <- rp[, .(pdc = pmin(sum(covered) / PERSIST_DAYS, 1.0)), by = person_id]\n  persistent <- pdc[pdc >= PDC_THRESHOLD, person_id]\n\n  # Stage 5: most recent HbA1c in the outcome window below target.\n  l <- merge(labs, idx, by = \"person_id\")\n  l <- l[result_date > index_date & result_date <= index_date + POST_DAYS]\n  setorder(l, person_id, result_date)\n  last_a1c <- l[, .(a1c = a1c_value[.N]), by = person_id]\n  controlled <- last_a1c[a1c < A1C_TARGET, person_id]\n\n  n_dx <- idx[, uniqueN(person_id)]\n  ids <- list(`1_diagnosed` = idx$person_id, `2_linked` = linked,\n              `3_initiated` = initiated, `4_persistent` = persistent,\n              `5_controlled` = controlled)\n  prior <- n_dx\n  res <- rbindlist(lapply(names(ids), function(nm) {\n    n <- length(intersect(ids[[nm]], idx$person_id))\n    row <- data.table(stage = nm, n = n,\n                      pct_of_diagnosed = n / n_dx,\n                      pct_of_prior = if (prior == 0) NA_real_ else n / prior)\n    prior <<- n\n    row\n  }))\n  res[]\n}",
        "description": "Longitudinal cohort cascade with data.table. Inputs mirror the Python version:\n  dx     : person_id, dx_date (Date), is_inpatient (logical)\n  rx     : person_id, fill_date (Date), antidiabetic (logical), days_supply (integer)\n  visits : person_id, visit_date (Date), pcp_or_endo (logical)\n  labs   : person_id, result_date (Date), a1c_value (numeric)\n  enroll : person_id, enroll_start (Date), enroll_end (Date), ma_only (logical)\nReturns one row per stage with unconditional and conditional retention.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let pre = 365; %let post = 365;\n%let link = 180; %let init = 180;\n%let persist = 365; %let grace = 30; %let pdc_thr = 0.80; %let a1c_target = 7.0;\n\n/* Incident phenotype: 1 inpatient OR 2 outpatient diagnoses on separate days; index = first qualifying date. */\nproc sql;\n  create table _ip as select distinct person_id, dx_date from work.dx where is_inpatient = 1;\n  create table _op as\n    select person_id, dx_date from work.dx where is_inpatient = 0\n    group by person_id, dx_date;                              /* distinct OP days */\n  create table _op2 as\n    select person_id from (select distinct person_id, dx_date from _op)\n    group by person_id having count(*) >= 2;\n  create table idx as\n    select person_id, min(dx_date) as index_date format=date9.\n    from (select person_id, dx_date from _ip\n          union all\n          select o.person_id, o.dx_date from _op o\n            where o.person_id in (select person_id from _op2))\n    group by person_id;\nquit;\n\n/* Non-MA-only continuous enrollment spanning [index - &pre, index + &post]. */\nproc sql;\n  create table cohort as\n  select i.person_id, i.index_date\n  from idx i\n  where exists (select 1 from work.enroll e\n                where e.person_id = i.person_id and e.ma_only = 0\n                  and e.enroll_start <= i.index_date - &pre\n                  and e.enroll_end   >= i.index_date + &post);\nquit;\n\n/* Stages 2 (linkage), 3 (initiation date), 5 (last HbA1c) by SQL; stage 4 (PDC) via DATA step below. */\nproc sql;\n  create table _linked as\n    select distinct c.person_id from cohort c\n    where exists (select 1 from work.visits v\n                  where v.person_id = c.person_id and v.pcp_or_endo = 1\n                    and v.visit_date between c.index_date and c.index_date + &link);\n  create table _init as\n    select r.person_id, min(r.fill_date) as init_date format=date9.\n    from work.rx r inner join cohort c on r.person_id = c.person_id\n    where r.antidiabetic = 1 and r.fill_date between c.index_date and c.index_date + &init\n    group by r.person_id;\n  create table _lasta1c as\n    select l.person_id, l.a1c_value from work.labs l inner join cohort c\n      on l.person_id = c.person_id\n    where l.result_date > c.index_date and l.result_date <= c.index_date + &post\n    group by l.person_id\n    having l.result_date = max(l.result_date);               /* most recent in-window result */\nquit;\n\n/* Stage 4: PDC over &persist days after initiation, grace-extended days_supply capped at the window. */\nproc sql;\n  create table _pdcfills as\n    select r.person_id, min(r.days_supply + &grace, &persist) as covered\n    from work.rx r inner join _init t on r.person_id = t.person_id\n    where r.antidiabetic = 1 and r.fill_date >= t.init_date\n      and r.fill_date < t.init_date + &persist;\nquit;\nproc sql;\n  create table _persist as\n    select person_id from _pdcfills\n    group by person_id having min(sum(covered) / &persist, 1) >= &pdc_thr;\nquit;\n\n/* Assemble per-person stage flags within the eligible (diagnosed) denominator. */\nproc sql;\n  create table cascade as\n  select c.person_id,\n         1 as diagnosed,\n         (c.person_id in (select person_id from _linked))                        as linked,\n         (c.person_id in (select person_id from _init))                          as initiated,\n         (c.person_id in (select person_id from _persist))                       as persistent,\n         (c.person_id in (select person_id from _lasta1c where a1c_value < &a1c_target)) as controlled\n  from cohort c;\nquit;\n\nproc freq data=cascade;\n  tables diagnosed linked initiated persistent controlled / nocum;\n  /* Conditional %% (share of prior stage) and absolute losses computed in reporting layer for the waterfall. */\nrun;",
        "description": "Longitudinal cohort cascade in SAS (PROC SQL + DATA step). Required input datasets (post data-management):\n  work.dx     : person_id, dx_date, is_inpatient (0/1)\n  work.rx     : person_id, fill_date, antidiabetic (0/1), days_supply\n  work.visits : person_id, visit_date, pcp_or_endo (0/1)\n  work.labs   : person_id, result_date, a1c_value\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nProduces work.cascade with stage flags, then PROC FREQ for stage counts; compute conditional/unconditional\npercentages and absolute losses downstream for the waterfall plot.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[\"Diagnosed / eligible<br/>184,200 (100%)\"] -->|\"77% linked\"| B[\"Linked to care<br/>141,900\"]\n  B -->|\"69% of linked\"| C[\"Initiated treatment<br/>98,300 (53% of dx)\"]\n  C -->|\"73% of initiated\"| D[\"Persistent 12 mo (PDC>=0.80)<br/>71,800 (39% of dx)\"]\n  D -->|\"59% of persistent\"| E[\"Controlled HbA1c<br/>42,100 (22.8% of dx)\"]\n  style A fill:#e3f2fd\n  style E fill:#c8e6c9",
        "caption": "Longitudinal cohort cascade for type 2 diabetes in U.S. claims data. Edge labels are conditional (share of the prior stage); node labels show absolute counts and the unconditional share of the diagnosed denominator. The largest absolute loss is at initiation.",
        "alt_text": "Funnel flowchart showing sequential attrition from diagnosis through linkage, initiation, persistence, and HbA1c control, with conditional retention percentages on the edges.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Diag[\"Diagnosis\"] --> Init[\"Initiation gap<br/>largest absolute loss\"]\n  Init --> Persist[\"Persistence gap<br/>~27% of initiators lost\"]\n  Persist --> Resp[\"Response gap<br/>~41% of persistent not controlled<br/>(partly measurement)\"]\n  Init -. \"equity gap by ADI quartile\" .-> Persist",
        "caption": "The three high-impact leakage points and the equity signal that drive intervention targeting. The response-gap note flags that part of that drop is measurement (the controlled-lab denominator is itself conditional on a captured lab).",
        "alt_text": "Simplified left-to-right schematic highlighting the initiation, persistence, and response leakage points with an equity annotation linking initiation and persistence by area deprivation index.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "The cascade supplies the pre-initiation and population-level view; lines-of-therapy describe post-initiation sequencing. Both are needed for a complete treatment-pattern picture."
      },
      {
        "relation_type": "produces",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Stage-specific HCRU and cost can be attached to each cascade gate to show where spend concentrates and what closing a gap would cost or save."
      },
      {
        "relation_type": "used_with",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Cascade conditional probabilities and absolute counts are direct inputs to uptake, persistence, and response parameters in Markov/DES decision models."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Persistence is one middle stage of the cascade, with finer time resolution but blind to pre-initiation losses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "discrete-event-simulation-rwe",
        "notes": "Simulation consumes cascade-derived transition rates to project the lifetime or budget impact of closing a specific gate -- the forward-looking complement to the descriptive cascade."
      },
      {
        "relation_type": "affects",
        "target_slug": "sdoh-social-determinants-of-health",
        "notes": "Stratified cascades by SDoH stratum, race/ethnicity, geography, or payer are a primary tool for documenting and monitoring health-equity gaps across the real-world care pathway."
      }
    ],
    "aliases": [
      "care cascade",
      "treatment cascade",
      "HIV care cascade",
      "diabetes cascade",
      "GDMT cascade",
      "sequential attrition analysis",
      "care continuum"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "case-case-time-control",
    "name": "Case-Case-Time-Control Design",
    "short_definition": "A self-controlled case-only design that estimates a transient drug effect by dividing the cases' case-crossover exposure odds ratio by the same self-matched odds ratio computed in a set of future cases, removing both time-invariant confounding and the exposure-time trend that biases the case-crossover.",
    "long_description": "The **case-case-time-control (CCTC) design** is a self-controlled, case-only design for transient (acute, short-induction)\ndrug effects. It corrects the one bias the case-crossover cannot fix on its own: a secular **exposure-time trend**. The\nestimator is a *ratio of two conditional odds ratios*. First, run an ordinary **case-crossover** in the cases: within each\ncase, compare exposure in a hazard window just before the event to exposure in one or more earlier reference windows. This\nself-matched odds ratio (OR_case) cancels every time-invariant confounder — measured or not — because each person is their\nown control. But OR_case is contaminated if drug use is trending over calendar time (uptake of a new agent, decline after a\nwarning), because the hazard window is, by construction, *later* than the reference window. CCTC estimates that trend from a\ncomparator group of **future cases** — people who experience the same outcome but *later* — by running the identical\nself-matched comparison shifted onto their (pre-event) calendar period, yielding OR_futurecase. The causal estimate is\n**OR_CCTC = OR_case / OR_futurecase**, a ratio of conditional ORs typically obtained as the exposure × group interaction in\na single conditional logistic model. Design and estimator are inseparable here: building the windows without the ratio is\nnot the method.\n\n**Conceptual lineage (the spine).** *Maclure 1991 (case-crossover)* introduced within-person comparison of hazard vs\nreference windows, removing fixed confounding but assuming no exposure-time trend. *Suissa 1995 (case-time-control)* added\na concurrent control group — **non-cases** — to estimate and divide out that trend. The fatal weakness of case-time-control\nis that drug-use trends in non-cases need not equal trends in cases: if the disease itself drives prescribing (confounding\nby indication has a *temporal* analogue), non-cases mis-measure the cases' counterfactual trend. *Wang 2011 (CCTC)* replaced\nthe non-case time controls with **future cases**, who share the cases' indication, severity trajectory, and prescribing\ndynamics, so their pre-event exposure trend is a far better proxy for what the current cases' trend would have been absent\nthe event. CCTC is therefore best read as case-time-control with a sharper choice of time control.\n\n**Pros, cons, and trade-offs.**\n- **vs case-crossover:** CCTC removes the exposure-time-trend bias that case-crossover silently carries whenever drug use\n  drifts over calendar time (almost always true for newer drugs, post-warning drugs, or anything seasonal). Cost: it needs\n  a future-case sample, roughly doubles data requirements, and inflates variance (you are dividing two noisy ORs). **Prefer\n  case-crossover** only when you can defend a flat exposure trend (e.g., a long-established drug at steady-state use);\n  otherwise **prefer CCTC**.\n- **vs case-time-control (non-case controls):** CCTC's future cases share indication and channeling with the cases, so the\n  trend correction is less biased than Suissa's non-case correction whenever the outcome influences exposure trends.\n  Cost: future cases are scarcer than non-cases (you must wait for them to accrue) and require enough post-index follow-up\n  to identify them. **Prefer CCTC** in pharmacoepidemiology, where indication-driven prescribing is the norm.\n- **vs self-controlled case series (SCCS):** SCCS models the full observation time and is more efficient for recurrent or\n  well-defined transient exposures, but it requires that the event not censor or alter future exposure (the\n  event-dependent-exposure assumption) and that exposures be modeled as time-varying covariates. CCTC makes no parametric\n  rate model and tolerates event-dependent exposure better, but estimates only a transient (window-contrast) effect.\n  **Prefer SCCS** for vaccine-style point exposures with clean rate models; **prefer CCTC** for chronically-trending drug\n  exposures where event-dependent exposure and secular trend coexist.\n- **vs cohort / active-comparator new-user designs:** CCTC needs no measured confounders and no external comparator cohort,\n  a decisive advantage when key confounders are unrecorded. Cost: it cannot estimate effects of *time-invariant* exposures\n  (a within-person design has no variation in them), gives only relative (not absolute) effects, and answers a transient-\n  effect question, not a cumulative or long-latency one. **Prefer a cohort/ACNU design** for chronic cumulative effects or\n  when absolute risk is required.\n\n**When to use.** Acute outcomes with a plausible *transient* drug trigger (induction period of days to weeks): e.g.,\nfracture or fall after a sedating drug, GI bleed after an NSAID, MI after a COX-2 inhibitor, hip fracture after a\nbenzodiazepine. Settings where the dominant confounders (frailty, baseline disease severity, genetics) are time-invariant\nand unmeasured, so a within-person design is attractive — *and* where drug use is clearly trending over calendar time, so a\nplain case-crossover would be biased. CCTC is well suited to administrative claims, where dispensing dates and a large pool\nof future cases are both available.\n\n**When NOT to use — and when it is actively misleading.**\n- **The exposure trend differs between current and future cases.** This is the load-bearing assumption: future cases'\n  pre-event exposure trend must equal the current cases' counterfactual trend. It breaks under rapid drug launch/uptake or\n  post-warning withdrawal (the trend is changing too fast for a fixed future offset to track), mid-window guideline changes,\n  formulary switches, or shifting diagnostic criteria that redefine who becomes a case over calendar time. If the trend is\n  non-linear or regime-changing across the study period, OR_CCTC is not interpretable.\n- **The effect is cumulative or long-latency, not transient.** A within-person hazard-vs-reference contrast cannot capture\n  effects that build over months/years; the reference window would itself be \"exposed\" in causal terms. Use a cohort design.\n- **The exposure is stable within person** (time-invariant or near-constant chronic use): there is no within-person\n  discordance to estimate from, and the OR is undefined or unstable. Self-controlled designs need exposure *switching*.\n- **Event-dependent exposure with strong feedback.** If the event itself sharply changes exposure (hospitalization stops\n  outpatient fills), the reference windows must be chosen *before* the event and any post-event period excluded; naive\n  bidirectional windows reintroduce bias.\n- **Too few future cases or too-short follow-up.** Sparse future cases make OR_futurecase unstable; dividing by a noisy\n  denominator can produce wild point estimates and intervals. Pre-specify a minimum future-case sample.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The natural substrate. Exposure on a given day is reconstructed from `fill_date` + `days_supply`\n  (a person is \"exposed\" on day d if some fill covers d). The hazard window is the N days immediately before the event\n  date; reference windows are equal-length windows further back (and, for future cases, windows before *their* later event).\n  Failure modes: (1) **Medicare Advantage / capitated person-time lacks fee-for-service pharmacy claims**, so \"unexposed\"\n  windows can be unobserved exposure — restrict to enrollees with full Part D (or commercial pharmacy benefit) covering all\n  windows. (2) 90-day mail-order and stockpiling distort daily exposure status; cap or carry-over `days_supply` consistently\n  in cases and future cases. (3) Reference and hazard windows must lie inside continuous enrollment, or window exposure is\n  censored asymmetrically.\n- **EHR:** Exposure is the *order/administration*, not the dispensing; outpatient adherence is unknown, so transient\n  triggering is mismeasured unless linked to fills. Visit-driven capture means windows with no encounter look \"unexposed\"\n  spuriously. Outcome dates from notes/labs may lag the true event, blurring the hazard window — anchor on the most\n  objective date available (e.g., admission date).\n- **Registry:** Adjudicated outcome timing is a strength (clean hazard-window anchoring); pharmacy exposure is usually weak\n  and must be linked to claims for daily exposure reconstruction.\n- **Linked claims–EHR:** Ideal — EHR for precise event dating and indication, claims for complete day-level exposure — but\n  reconcile order/fill/service dates before assigning windows; a date discrepancy of days can move a fill across the\n  hazard/reference boundary and flip a discordant pair.\n\n**Worked claims example.** Question: does initiating a sedating antipsychotic acutely trigger hip fracture in elderly\nnursing-home residents (Wang 2011's setting)? (1) **Cases:** residents with an incident hip-fracture admission;\n`event_date` = admission date; require continuous Part A/B/D enrollment with no MA-only person-time spanning all windows.\n(2) **Windows:** hazard window = days 1–30 before `event_date`; reference window = days 91–120 before `event_date`\n(equal 30-day length, separated by a washout to avoid carry-over). A person is \"exposed\" in a window if any antipsychotic\nfill (`fill_date` + `days_supply`) covers ≥1 day of it. (3) **Case-crossover OR (OR_case):** conditional logistic\nregression stratified on `person_id`, two rows per case (hazard vs reference), exposure as the predictor — say OR_case =\n2.4. This still embeds the rising secular use of antipsychotics in this population. (4) **Future cases:** residents who\nfracture *later* in the data; apply the *same* day-30 hazard / day-91–120 reference windows anchored on their own later\nevent, capturing the calendar-time exposure trend — say OR_futurecase = 1.6. (5) **CCTC estimate:**\nOR_CCTC = OR_case / OR_futurecase = 2.4 / 1.6 = **1.5**, the trend-adjusted transient effect. Operationally this is one\nconditional logistic model on the pooled case + future-case rows with an exposure × group(current vs future) interaction;\nthe interaction odds ratio *is* OR_CCTC, and its model-based CI propagates both ORs' uncertainty.",
    "primary_category": "Study_Design",
    "tags": [
      "self-controlled",
      "case-only",
      "case-crossover",
      "exposure-trend-bias",
      "within-person",
      "transient-effects",
      "pharmacoepidemiology",
      "confounding-control"
    ],
    "applies_to_study_types": [
      "case_case_time_control"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0b013e31821d09cd",
        "url": "https://doi.org/10.1097/EDE.0b013e31821d09cd",
        "citation_text": "Wang S, Linkletter C, Maclure M, Dore D, Mor V, Buka S, Wellenius GA. Future cases as present controls to adjust for exposure-trend bias in case-only studies. Epidemiology. 2011;22(4):568-574.",
        "year": 2011,
        "authors_short": "Wang et al.",
        "notes": "Defining paper for the case-case-time-control design; introduces future cases as the time-control group and the ratio-of-conditional-odds-ratios estimator."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a115853",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a115853",
        "citation_text": "Maclure M. The case-crossover design: a method for studying transient effects on the risk of acute events. American Journal of Epidemiology. 1991;133(2):144-153.",
        "year": 1991,
        "authors_short": "Maclure",
        "notes": "Parent design. CCTC builds the within-case hazard-vs-reference comparison on the case-crossover, then corrects its exposure-time-trend assumption."
      },
      {
        "role": "explain",
        "doi": "10.1097/00001648-199505000-00010",
        "url": "https://doi.org/10.1097/00001648-199505000-00010",
        "citation_text": "Suissa S. The case-time-control design. Epidemiology. 1995;6(3):248-253.",
        "year": 1995,
        "authors_short": "Suissa",
        "notes": "The precursor CCTC fixes. Case-time-control uses non-cases as time controls; CCTC replaces them with future cases to avoid the assumption that exposure trends are equal in non-cases and cases."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pone.0049444",
        "url": "https://doi.org/10.1371/journal.pone.0049444",
        "citation_text": "Nordmann S, Biard L, Ravaud P, Esposito-Farèse M, Tubach F. Case-only designs in pharmacoepidemiology: a systematic review. PLoS ONE. 2012;7(11):e49444.",
        "year": 2012,
        "authors_short": "Nordmann et al.",
        "notes": "Systematic review situating case-case-time-control among case-only designs (case-crossover, case-time-control, SCCS) and tabulating assumptions and applied use."
      }
    ],
    "plain_language_summary": "The case-case-time-control design answers the question: does taking a drug in the days just before an acute event actually raise the risk, after accounting for the fact that more people are using the drug over time? For each person who had the event, it compares whether they were on the drug in the 30 days right before the injury to whether they were on it in an earlier 30-day window from the same person, so their stable background characteristics cancel out. It then corrects a hidden problem with that simple within-person comparison: if drug use in the whole population has been rising over calendar time, the window closest to the event will always look more exposed just because of the calendar, not because of a real risk. To fix that, the design borrows a second group, people who had the same event later, and runs the identical within-person comparison on them to measure how much of the difference was just the calendar trend; dividing the two results gives the trend-corrected estimate.",
    "key_terms": [
      {
        "term": "hazard window",
        "definition": "The 30 days immediately before the acute event, the period when the drug might have triggered the outcome."
      },
      {
        "term": "referent window",
        "definition": "An earlier 30-day stretch from the same person, used as that person's own baseline for comparison; a washout gap separates it from the hazard window so leftover drug supply from one period does not bleed into the other."
      },
      {
        "term": "exposure-time trend",
        "definition": "A gradual rise or fall in how many people are using a drug across calendar time, which can make a simple within-person comparison look like a drug effect even when none exists."
      },
      {
        "term": "future cases",
        "definition": "People who experienced the same outcome but at a later calendar date; their within-person comparison estimates how much drug use was drifting over time, so that drift can be subtracted out of the main result."
      },
      {
        "term": "fill_date",
        "definition": "The date a pharmacy dispensed a prescription; paired with days_supply, it tells you which calendar days a patient had the drug on hand."
      },
      {
        "term": "days_supply",
        "definition": "How many days one dispensed prescription is meant to last, used to determine which days a patient was covered by each fill."
      }
    ],
    "worked_example": {
      "scenario": "We want to know whether starting an antipsychotic pill acutely raises the risk of a hip fracture in elderly nursing-home residents. We use insurance claims, where pharmacy fills record the drug name, the fill date, and how many days that supply lasts. Two residents fractured their hip: Margaret fractured on 2024-05-30 (she is a current case), and Ruth fractured on 2024-09-27 (she is a future case, the same outcome but later in calendar time, used to measure the drug-use trend). We look at each woman's own prescription history to decide whether she was on an antipsychotic in her hazard window and in her referent window.",
      "dataset": {
        "caption": "Pharmacy fills for two residents in the claims data (simplified to one drug each)",
        "columns": [
          "person_id",
          "name",
          "group",
          "event_date",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            "1001",
            "Margaret",
            "current",
            "2024-05-30",
            "2024-04-15",
            "quetiapine",
            30
          ],
          [
            "1001",
            "Margaret",
            "current",
            "2024-05-30",
            "2024-01-10",
            "quetiapine",
            30
          ],
          [
            "2002",
            "Ruth",
            "future",
            "2024-09-27",
            "2024-08-12",
            "quetiapine",
            30
          ],
          [
            "2002",
            "Ruth",
            "future",
            "2024-09-27",
            "2024-05-05",
            "quetiapine",
            30
          ]
        ]
      },
      "steps": [
        "Define windows for Margaret (current case, event 2024-05-30): hazard window = 2024-04-30 to 2024-05-29 (30 days before the fracture); referent window = 2024-01-21 to 2024-02-19 (30 days ending 90 days before the hazard window starts, so days 121-150 before the event).",
        "Check Margaret's fills against her windows: her 2024-04-15 fill (30 days supply) covers through 2024-05-14, which overlaps the hazard window (Apr 30 - May 29), so she is EXPOSED in the hazard window. Her 2024-01-10 fill (30 days supply) covers through 2024-02-08, which overlaps the referent window (Jan 21 - Feb 19), so she is EXPOSED in the referent window. Margaret is concordant (exposed in both), so her pair contributes no discordance to the within-person comparison.",
        "Now define windows for Ruth (future case, event 2024-09-27): same rule, shifted to her own event date. Hazard window = 2024-08-28 to 2024-09-26; referent window = 2024-05-19 to 2024-06-17.",
        "Check Ruth's fills: her 2024-08-12 fill covers through 2024-09-10, overlapping her hazard window, so EXPOSED in hazard. Her 2024-05-05 fill covers through 2024-06-03, which overlaps her referent window (May 19 - Jun 17), so EXPOSED in referent. Ruth is also concordant in this simple two-person illustration.",
        "In a full study with many such pairs, the within-current-case comparison yields OR_case = 2.4 (the raw within-person odds ratio suggesting higher exposure near the fracture). The within-future-case comparison yields OR_futurecase = 1.6 (capturing the secular rise in antipsychotic use over calendar time that makes every later window look more exposed).",
        "The trend-adjusted estimate is OR_CCTC = OR_case divided by OR_futurecase = 2.4 / 1.6 = 1.5. This means after removing the calendar trend, being on an antipsychotic in the 30 days before a hip fracture is associated with 1.5 times the odds of fracture compared with an earlier period, not 2.4 times."
      ],
      "result": "OR_CCTC = 2.4 / 1.6 = 1.5. The raw within-person signal of 2.4 was inflated by rising antipsychotic use over calendar time; after the future-case trend correction the estimate drops to 1.5.",
      "timeline_spec": {
        "title": "CCTC windows: one current case (Margaret) and one future case (Ruth) illustrating the trend correction",
        "window": {
          "start": "2024-01-01",
          "end": "2024-10-15",
          "label": "Observation period spanning both cases"
        },
        "events": [
          {
            "label": "Margaret fill 1 (30d supply)",
            "start": "2024-01-10",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Margaret fill 2 (30d supply)",
            "start": "2024-04-15",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Ruth fill 1 (30d supply)",
            "start": "2024-05-05",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Ruth fill 2 (30d supply)",
            "start": "2024-08-12",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2024-01-21",
            "end": "2024-02-19",
            "label": "Margaret referent window (30d, days 121-150 before fracture)"
          },
          {
            "kind": "exposed",
            "start": "2024-04-30",
            "end": "2024-05-29",
            "label": "Margaret hazard window (30d before fracture)"
          },
          {
            "kind": "gap",
            "start": "2024-05-30",
            "end": "2024-05-30",
            "label": "Margaret fracture (2024-05-30)"
          },
          {
            "kind": "unexposed",
            "start": "2024-05-19",
            "end": "2024-06-17",
            "label": "Ruth referent window (30d, days 121-150 before fracture)"
          },
          {
            "kind": "exposed",
            "start": "2024-08-28",
            "end": "2024-09-26",
            "label": "Ruth hazard window (30d before fracture)"
          },
          {
            "kind": "gap",
            "start": "2024-09-27",
            "end": "2024-09-27",
            "label": "Ruth fracture (2024-09-27)"
          }
        ],
        "result": {
          "label": "OR_case = 2.4 (current cases); OR_futurecase = 1.6 (future cases); OR_CCTC = 2.4 / 1.6 = 1.5",
          "value": 1.5
        },
        "caption": "Two residents share the same window structure anchored on their own event dates. Margaret is a current case (fracture May 30); Ruth is a future case (fracture Sep 27) used to estimate the calendar-time trend in antipsychotic use. Dividing the within-current-case odds ratio (2.4) by the within-future-case odds ratio (1.6) removes the trend and yields the corrected estimate of 1.5.",
        "alt_text": "Horizontal timeline from January to October 2024 showing fill bars and shaded referent and hazard windows for Margaret (current case, fracture May 30) and Ruth (future case, fracture Sep 27), illustrating how the same window structure applied to a later case estimates the secular drug-use trend."
      }
    },
    "prerequisites": [
      "case-crossover",
      "case-time-control",
      "self-controlled-case-series"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single vs multiple reference (washout-separated) windows",
        "description": "One reference window per case is simplest; multiple earlier reference windows per case increase precision and let exposure-trend curvature be inspected, at the cost of a stronger within-person exchangeability assumption across more distant calendar time.",
        "edge_cases": [
          "Reference windows too close to the hazard window leak the acute effect (carry-over) and attenuate OR_case; too far back they straddle a different exposure regime and inflate trend bias.",
          "All windows must lie within continuous, FFS-observable enrollment or exposure is censored asymmetrically."
        ],
        "data_source_notes": "claims: place a washout gap between hazard and reference windows so days_supply from one does not bleed into the other; verify enrollment covers every window."
      },
      {
        "name": "Future-case offset and matching",
        "description": "Future cases are matched to current cases (commonly on calendar-time offset, age, sex) and their identical window structure is anchored on their own later event to estimate the secular trend.",
        "edge_cases": [
          "Too few future cases (rare outcome, short follow-up) makes OR_futurecase unstable and the CCTC ratio volatile; pre-specify a minimum future-case count and report the denominator OR separately.",
          "If the exposure trend is non-linear across the study period, a fixed future-case offset cannot track it and the estimate is uninterpretable."
        ],
        "data_source_notes": "claims: future cases require sufficient post-index follow-up to accrue; truncated data tails starve the denominator."
      },
      {
        "name": "Pooled conditional-logistic interaction estimator",
        "description": "Rather than computing OR_case and OR_futurecase separately and dividing, pool current- and future-case rows into one conditional logistic model with an exposure × group interaction; the interaction OR is OR_CCTC and its CI correctly propagates both components' uncertainty.",
        "edge_cases": [
          "Sparse discordant strata yield separation; use exact conditional logistic or penalized (Firth) estimation."
        ],
        "data_source_notes": "Equivalent point estimate to the manual ratio, but model-based variance is preferred over delta-method or bootstrap on the ratio."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Case-crossover design",
        "pros_of_this": "Removes exposure-time-trend bias (secular drift in drug use) that the case-crossover silently carries; retains full control of time-invariant confounding.",
        "cons_of_this": "Requires a future-case sample, roughly doubles data needs, and increases variance because two odds ratios are divided.",
        "when_to_prefer": "Whenever drug use is trending over calendar time (new, withdrawn, seasonal, or guideline-affected drugs); use plain case-crossover only when a flat exposure trend is defensible."
      },
      {
        "compared_to": "Case-time-control design (non-case controls)",
        "pros_of_this": "Future cases share indication, severity trajectory, and channeling with the cases, so the trend correction is less biased when the outcome influences prescribing.",
        "cons_of_this": "Future cases are scarcer than non-cases and require post-index follow-up to accrue.",
        "when_to_prefer": "Pharmacoepidemiologic settings where prescribing is indication-driven, so non-cases mis-measure the cases' counterfactual exposure trend."
      },
      {
        "compared_to": "Self-controlled case series (SCCS)",
        "pros_of_this": "No parametric event-rate model; tolerates event-dependent exposure better and needs no full observation-time exposure history.",
        "cons_of_this": "Estimates only a transient window-contrast effect; generally less efficient than SCCS for clean point exposures with a well-specified rate model.",
        "when_to_prefer": "Chronically-trending drug exposures where secular trend and event-dependent exposure coexist; prefer SCCS for vaccine-style point exposures."
      },
      {
        "compared_to": "Cohort / active-comparator new-user design",
        "pros_of_this": "Needs no measured confounders and no external comparator cohort; automatically controls all time-invariant confounders.",
        "cons_of_this": "Cannot estimate effects of time-invariant exposures, yields only relative (not absolute) effects, and answers a transient-effect question only.",
        "when_to_prefer": "When key confounders are unmeasured and the effect is acute/transient; prefer cohort/ACNU for cumulative or long-latency effects or when absolute risk is required."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Reconstruct daily exposure from fill_date + days_supply; define equal-length hazard and washout-separated reference windows entirely within continuous, FFS-observable enrollment. Exclude MA-only/capitated person-time where pharmacy claims are unavailable so \"unexposed\" windows are truly observed. Apply identical window logic to current and future cases.",
      "ehr": "Exposure = order/administration, not dispensing; link to pharmacy fills to recover outpatient exposure, or transient triggering is mismeasured. Visit-driven gaps create spurious \"unexposed\" windows. Anchor the hazard window on the most objective event date (e.g., admission).",
      "registry": "Strong, adjudicated event dating for clean window anchoring; weak pharmacy exposure that must be linked to claims for day-level reconstruction.",
      "linked": "Linked claims-EHR is the ideal substrate (precise event dates + complete exposure) but reconcile order/fill/service dates before window assignment, since a few days' discrepancy can flip a discordant pair across the hazard/reference boundary."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nimport statsmodels.api as sm\nfrom statsmodels.discrete.conditional_models import ConditionalLogit\n\nHAZARD_DAYS = 30          # day 1..30 before the event\nREF_GAP_DAYS = 90         # washout between hazard and reference window\nREF_DAYS = 30             # reference window length (equal to hazard)\n\ndef _covered(rx_p, w_start, w_end):\n    \"\"\"True if any fill's [fill_date, fill_date+days_supply) covers >=1 day of [w_start, w_end].\"\"\"\n    supply_end = rx_p[\"fill_date\"] + pd.to_timedelta(rx_p[\"days_supply\"], unit=\"D\")\n    return bool(((rx_p[\"fill_date\"] <= w_end) & (supply_end > w_start)).any())\n\ndef _enrolled(enr_p, w_start, w_end):\n    \"\"\"True if a single non-MA enrollment span fully covers the window (so 'unexposed' is observed, not missing).\"\"\"\n    ok = (enr_p[\"enroll_start\"] <= w_start) & (enr_p[\"enroll_end\"] >= w_end) & (~enr_p[\"ma_only\"])\n    return bool(ok.any())\n\ndef build_cctc_rows(events, rx, enroll):\n    rows = []\n    rx_by = dict(tuple(rx.groupby(\"person_id\")))\n    en_by = dict(tuple(enroll.groupby(\"person_id\")))\n    for _, ev in events.iterrows():\n        pid, e0, grp = ev[\"person_id\"], ev[\"event_date\"], ev[\"group\"]\n        haz = (e0 - pd.Timedelta(days=HAZARD_DAYS), e0 - pd.Timedelta(days=1))\n        ref_end = e0 - pd.Timedelta(days=HAZARD_DAYS + REF_GAP_DAYS)\n        ref = (ref_end - pd.Timedelta(days=REF_DAYS - 1), ref_end)\n        rx_p = rx_by.get(pid, rx.iloc[0:0])\n        en_p = en_by.get(pid, enroll.iloc[0:0])\n        # Both windows must be observable; otherwise the discordant pair is censored asymmetrically -> drop the case.\n        if not (_enrolled(en_p, *haz) and _enrolled(en_p, *ref)):\n            continue\n        rows.append({\"person_id\": pid, \"group\": grp, \"window\": \"hazard\",\n                     \"exposed\": int(_covered(rx_p, *haz))})\n        rows.append({\"person_id\": pid, \"group\": grp, \"window\": \"reference\",\n                     \"exposed\": int(_covered(rx_p, *ref))})\n    return pd.DataFrame(rows)\n\ndef estimate_cctc(rows):\n    d = rows.copy()\n    d[\"hazard\"] = (d[\"window\"] == \"hazard\").astype(int)             # within-person time contrast\n    d[\"future\"] = (d[\"group\"] == \"future\").astype(int)\n    d[\"hazard_x_future\"] = d[\"hazard\"] * d[\"future\"]                # interaction = trend correction\n    # Conditional logistic, stratified on person; outcome = exposed in that window.\n    X = d[[\"hazard\", \"hazard_x_future\"]]\n    model = ConditionalLogit(d[\"exposed\"], X, groups=d[\"person_id\"])\n    res = model.fit(disp=False)\n    # OR_case = exp(hazard); OR_futurecase = exp(hazard + hazard_x_future);\n    # OR_CCTC = OR_case / OR_futurecase = exp(-hazard_x_future).\n    beta_int = res.params[\"hazard_x_future\"]\n    or_cctc = float(np.exp(-beta_int))\n    ci = np.exp(-res.conf_int().loc[\"hazard_x_future\"][::-1])       # flip sign -> flip/relabel bounds\n    return {\"OR_case\": float(np.exp(res.params[\"hazard\"])),\n            \"OR_CCTC\": or_cctc, \"CI95\": (float(ci[0]), float(ci[1]))}",
        "description": "CCTC window construction + ratio estimator from claims-style inputs. Required inputs (cleaned, de-duplicated):\n  events : one row per case  -> person_id, event_date (datetime), group in {'current','future'}\n  rx     : pharmacy fills     -> person_id, fill_date (datetime), days_supply (int)\n  enroll : enrollment spans   -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only lacks FFS pharmacy claims\nBuilds two rows per case (hazard vs reference window), checks each window lies in continuous non-MA enrollment, then fits\none conditional logistic model; the exposure x group interaction OR is OR_CCTC.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "wang-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\n\nHAZARD_DAYS <- 30L; REF_GAP_DAYS <- 90L; REF_DAYS <- 30L\n\ncovered <- function(fd, ds, ws, we) any(fd <= we & (fd + ds) > ws)\nenrolled <- function(es, ee, mao, ws, we) any(es <= ws & ee >= we & !mao)\n\nbuild_cctc_rows <- function(events, rx, enroll) {\n  setDT(events); setDT(rx); setDT(enroll)\n  out <- list()\n  for (i in seq_len(nrow(events))) {\n    pid <- events$person_id[i]; e0 <- events$event_date[i]; grp <- events$group[i]\n    haz_s <- e0 - HAZARD_DAYS;            haz_e <- e0 - 1L\n    ref_e <- e0 - (HAZARD_DAYS + REF_GAP_DAYS); ref_s <- ref_e - (REF_DAYS - 1L)\n    rp <- rx[person_id == pid]; ep <- enroll[person_id == pid]\n    # Both windows must be observable, non-MA, or the discordant pair is censored asymmetrically -> drop.\n    if (!(enrolled(ep$enroll_start, ep$enroll_end, ep$ma_only, haz_s, haz_e) &&\n          enrolled(ep$enroll_start, ep$enroll_end, ep$ma_only, ref_s, ref_e))) next\n    out[[length(out) + 1L]] <- data.table(\n      person_id = pid, group = grp, window = c(\"hazard\", \"reference\"),\n      exposed = c(as.integer(covered(rp$fill_date, rp$days_supply, haz_s, haz_e)),\n                  as.integer(covered(rp$fill_date, rp$days_supply, ref_s, ref_e))))\n  }\n  rbindlist(out)\n}\n\nestimate_cctc <- function(rows) {\n  rows[, hazard := as.integer(window == \"hazard\")]   # within-person time contrast\n  rows[, future := as.integer(group == \"future\")]\n  # Conditional logistic stratified on person; exposed ~ hazard * future, strata(person).\n  fit <- clogit(exposed ~ hazard + hazard:future + strata(person_id), data = rows)\n  b_int <- coef(fit)[\"hazard:future\"]\n  ci <- exp(-rev(confint(fit)[\"hazard:future\", ]))   # OR_CCTC = exp(-interaction)\n  list(OR_case = unname(exp(coef(fit)[\"hazard\"])),\n       OR_CCTC = unname(exp(-b_int)),\n       CI95 = unname(ci))\n}",
        "description": "CCTC in R with survival::clogit (conditional logistic). Inputs mirror the Python version:\n  events : person_id, event_date (Date), group in {'current','future'}\n  rx     : person_id, fill_date (Date), days_supply (integer)\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\nReturns OR_case and OR_CCTC; OR_CCTC = exp(-coef on hazard:future interaction).",
        "dependencies": [
          "data.table",
          "survival"
        ],
        "source_citations": [
          "wang-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let hazard = 30;  %let refgap = 90;  %let refdays = 30;\n\n/* One stratum per case = two rows (hazard, reference). Window bounds in days before the event. */\ndata windows;\n  set work.events;\n  length window $9;\n  /* hazard window: day 1..&hazard before event */\n  window = 'hazard';    w_start = event_date - &hazard;  w_end = event_date - 1;        output;\n  /* reference window: equal length, separated by &refgap washout */\n  window = 'reference'; w_end   = event_date - (&hazard + &refgap);\n                        w_start = w_end - (&refdays - 1);                                output;\nrun;\n\n/* Drop a case unless BOTH windows are inside a single non-MA enrollment span (observed 'unexposed'). */\nproc sql;\n  create table windows_obs as\n  select w.* from windows w\n  where exists (select 1 from work.enroll e\n                where e.person_id = w.person_id and e.ma_only = 0\n                  and e.enroll_start <= w.w_start and e.enroll_end >= w.w_end);\n  /* keep only cases whose BOTH windows survived the enrollment filter */\n  create table keepers as\n  select person_id from windows_obs group by person_id having count(*) = 2;\n  create table win2 as\n  select * from windows_obs where person_id in (select person_id from keepers);\nquit;\n\n/* Window-level exposure: any fill covering >=1 day of the window (fill_date + days_supply > w_start). */\nproc sql;\n  create table model_in as\n  select w.person_id, w.window,\n         (w.window = 'hazard')                          as hazard,\n         (w.group  = 'future')                          as future,\n         max( case when r.fill_date <= w.w_end\n                    and (r.fill_date + r.days_supply) > w.w_start\n                   then 1 else 0 end )                  as exposed\n  from win2 w left join work.rx r on r.person_id = w.person_id\n  group by w.person_id, w.window, calculated hazard, calculated future;\nquit;\n\n/* Conditional logistic via STRATA: exposed = hazard future hazard*future, stratified on person. */\nproc logistic data=model_in;\n  strata person_id;                              /* self-controlled stratum = the case */\n  model exposed(event='1') = hazard future hazard*future;\n  /* OR_case = exp(hazard); OR_CCTC = exp(-(hazard*future interaction)). */\n  estimate 'log OR_CCTC' hazard*future -1 / exp cl;   /* exponentiated estimate = OR_CCTC with 95% CL */\nrun;",
        "description": "CCTC window construction and the ratio estimator in SAS. Required input datasets (post data-management):\n  work.events : person_id, event_date, group ('current'/'future')\n  work.rx     : person_id, fill_date, days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nPROC LOGISTIC with STRATA person_id fits the conditional (exact-stratified) logistic; OR_CCTC = exp(-beta) on the\nhazard*future interaction. The two-row-per-case structure (hazard vs reference) is the self-controlled stratum.",
        "dependencies": [],
        "source_citations": [
          "wang-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "case-case-time-control-timeline.svg",
        "mermaid": null,
        "caption": "Two residents share the same window structure anchored on their own event dates. Margaret is a current case (fracture May 30); Ruth is a future case (fracture Sep 27) used to estimate the calendar-time trend in antipsychotic use. Dividing the within-current-case odds ratio (2.4) by the within-future-case odds ratio (1.6) removes the trend and yields the corrected estimate of 1.5.",
        "alt_text": "Horizontal timeline from January to October 2024 showing fill bars and shaded referent and hazard windows for Margaret (current case, fracture May 30) and Ruth (future case, fracture Sep 27), illustrating how the same window structure applied to a later case estimates the secular drug-use trend.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  CC[Case-crossover in current cases<br/>hazard vs reference window] --> ORc[OR_case<br/>controls fixed confounding<br/>but embeds exposure-time trend]\n  FUT[Same window contrast in FUTURE cases<br/>anchored on their later event] --> ORf[OR_futurecase<br/>estimates the secular<br/>exposure-time trend]\n  ORc --> RATIO[OR_CCTC = OR_case / OR_futurecase<br/>= exp of exposure x group interaction]\n  ORf --> RATIO\n  RATIO --> EST[Trend-adjusted transient effect]",
        "caption": "The case-case-time-control estimator. The current-case case-crossover removes time-invariant confounding; the future-case case-crossover estimates the secular exposure trend; their ratio is the trend-adjusted transient effect.",
        "alt_text": "Flowchart showing OR_case from current cases and OR_futurecase from future cases combining into the ratio OR_CCTC.",
        "source_type": "illustrative",
        "source_citations": [
          "wang-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  M[Maclure 1991<br/>Case-crossover<br/>removes fixed confounding] --> S[Suissa 1995<br/>Case-time-control<br/>adds NON-CASE time controls]\n  S --> W[Wang 2011<br/>Case-case-time-control<br/>FUTURE CASES as time controls]\n  S -. weakness: non-cases' trend<br/>need not equal cases' trend .-> W",
        "caption": "Conceptual lineage. Case-time-control's non-case time controls assume equal exposure trends in non-cases and cases; CCTC replaces them with future cases who share indication and channeling.",
        "alt_text": "Timeline from Maclure 1991 case-crossover to Suissa 1995 case-time-control to Wang 2011 case-case-time-control.",
        "source_type": "illustrative",
        "source_citations": [
          "wang-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title CCTC windows for one current case (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section Reference\n  Reference window (30d, washout-separated) :done, ref, 2024-01-01, 30d\n  Washout gap :crit, gap, 2024-01-31, 90d\n  section Hazard\n  Hazard window (30d before event) :active, haz, 2024-04-30, 30d\n  section Event\n  Acute event (e.g., hip fracture) :milestone, ev, 2024-05-30, 0d",
        "caption": "Window placement for one current case. The hazard window precedes the event; a washout-separated reference window sits earlier in the same person. Exposure in each window is reconstructed from fill_date + days_supply.",
        "alt_text": "Gantt chart showing a reference window, a washout gap, a hazard window, and the event date for a single case.",
        "source_type": "illustrative",
        "source_citations": [
          "wang-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "case-time-control",
        "notes": "CCTC is case-time-control with future cases (rather than non-cases) as the time-control group, removing the assumption that exposure trends are equal in non-cases and cases."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "case-crossover",
        "notes": "The current-case component of CCTC is a case-crossover; CCTC adds a future-case case-crossover to divide out the exposure-time trend the plain case-crossover assumes away."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "self-controlled-case-series",
        "notes": "Both are self-controlled, case-only designs for transient effects; SCCS models full observation time with a rate model and assumes event-independent exposure, while CCTC contrasts windows and tolerates event-dependent exposure and secular trend."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-control",
        "notes": "CCTC's future cases derive from the same source population as the current cases; the design reframes a case-control contrast as a within-person, trend-adjusted comparison."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes or exposures can probe residual exposure-trend or window-misspecification bias when the equal-trend assumption is in doubt."
      }
    ],
    "aliases": [
      "CCTC",
      "case-case-time-control design",
      "future-cases-as-controls design"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "case-cohort-design",
    "name": "Case-Cohort Design",
    "short_definition": "A sampling-efficient design nested inside a fully assembled cohort that draws a single random subcohort at baseline and ascertains expensive covariates (biomarker assays, chart abstraction, genotyping) only on subcohort members plus any cases emerging from the full cohort; one subcohort serves multiple outcomes and supports absolute-risk estimation, but analysis requires weighted Cox regression with Prentice or Barlow weights and robust variance — unweighted Cox is a well-known error.",
    "long_description": "**Design mechanics**\n\nThe case-cohort design begins, like all nested sampling designs, with a fully assembled and\nenrolled cohort — every member has a time-zero, an eligibility record, and an observable\nfollow-up period. At **baseline (time-zero)**, before any outcomes are observed, the analyst\ndraws a random **subcohort** of size m from the full cohort of size N, yielding a sampling\nfraction π = m / N. During follow-up, the study team ascertains expensive covariates — stored\nbiospecimen assays, manual chart abstraction, genotyping, imaging reads — for every subcohort\nmember, regardless of whether they later develop the outcome. As cases accumulate anywhere in\nthe full cohort, any case *not already in the subcohort* is added to the measurement queue; the\ncovariate is ascertained for them as well. The analytic dataset therefore contains: (1) all\nsubcohort members (cases and non-cases alike), and (2) cases from outside the subcohort. The\nkey quantity is the **overlap**: cases who happen to fall inside the subcohort by chance are\ncounted once, not twice. Total assays = m + (total cases) − (cases inside subcohort).\n\n**The killer advantage over nested case-control: one subcohort, many outcomes.** In a\nnested case-control (NCC) design, controls are sampled fresh at each case's event time and are\ntherefore outcome-specific — a new control sample is needed for every endpoint. The case-cohort\nsubcohort is drawn once at baseline and reused for every outcome the study examines: fatal\nmyocardial infarction, incident diabetes, all-cause mortality, and any post-hoc endpoint can\neach use the same m subcohort members, with only the new-case set varying. This makes\ncase-cohort the preferred design for multi-endpoint biobank and registry substudies.\n\n**Absolute risk is estimable.** Because the subcohort is a probability sample of the full\ncohort, its person-time is a known fraction of the total cohort person-time, and event rates\n(incidence densities) and cumulative incidence can be estimated with appropriate weighting.\nNCC, by contrast, cannot recover absolute risks without additional data because the risk-set\nsampling probabilities depend on cohort size in a time-varying way that is not always recorded.\n\n**Pros, cons, and trade-offs.**\n\n- **vs full-cohort Cox:** The case-cohort design's sole advantage is measurement cost when an\n  exposure or covariate is expensive for every cohort member. When the exposure and confounders\n  are already universally available — as in ordinary claims or EHR data where the drug, diagnosis,\n  and covariates are coded for everyone — the case-cohort design discards information and the\n  full-cohort Cox model dominates. Choosing case-cohort for cheap-exposure claims data is\n  indefensible.\n\n- **vs nested case-control:** NCC re-samples controls at each event time, providing tight\n  time-matching that can be more efficient than case-cohort for a single time-to-event outcome\n  with strong time-related confounding (age drift, calendar-period effects). The case-cohort\n  design sacrifices some per-outcome efficiency to gain the multi-outcome reusability and\n  absolute-risk advantages described above. **Prefer NCC** for a single outcome with strong\n  time confounding or highly time-varying expensive exposures; **prefer case-cohort** for\n  multi-outcome biobank substudies, registry biomarker layers, and when absolute risk is needed.\n\n- **vs self-controlled designs (SCCS, case-crossover):** Self-controlled designs eliminate\n  all time-fixed confounding by within-person comparison but require transient reversible\n  exposures and acute outcomes. Case-cohort accommodates chronic exposures and stable biomarkers\n  and handles between-person confounders via covariate adjustment, at the cost of residual\n  unmeasured between-person confounding that self-controlled designs remove by design.\n\n**The classic analytic mistake: unweighted Cox.** Because the subcohort is a biased sample of\nthe risk sets at later event times (subcohort members who died or were censored early are\nunderrepresented in later risk sets), naive Cox regression on the case-cohort dataset without\nweights produces a biased hazard-ratio estimate. The correct analysis uses a **weighted\npseudo-partial likelihood** in which the contribution of each subcohort non-case at each event\ntime is up-weighted by 1/π to represent the full cohort's at-risk pool. Two main weighting\nschemes exist: **Prentice (1986)** weights, which use 1/π for subcohort members who have not\nyet failed and 1 for cases at their event time; and **Barlow et al. (1999)** weights (also\ncalled self-weighted or \"Barlow\"), which assign a constant weight of 1/π to subcohort members\nthroughout follow-up and 1 to all cases at their event time, yielding slightly simpler\nimplementation and the same asymptotic estimator. Both require a **robust (sandwich) variance\nestimator** because the same subcohort members appear in multiple pseudo-risk sets and the\nstandard Cox variance ignores this correlation. The `survival::cch` function in R implements\nboth methods and the robust variance directly.\n\n**When to use.**  A retrospective cohort with expensive covariates (biospecimen assays,\nmanual chart abstraction, genotyping, expert adjudication) that cannot be collected\ncost-effectively for all N members; multi-outcome or multi-endpoint biobank or registry\nsubstudies where one sampled panel is to serve several hypotheses; situations where absolute\nincidence rates or cumulative incidence are needed alongside relative risks; and claims or\nEHR cohorts being linked to chart-validated outcome data for a validation substudy, where the\nchart-pull is the expensive item.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n\n- **Cheap, complete exposure in claims/EHR.** If the covariate of interest is already coded\n  in the database for every cohort member, the case-cohort design wastes information and\n  inflates variance; full-cohort Cox is strictly preferred.\n- **Unweighted Cox on the case-cohort dataset.** Fitting a standard (unweighted) Cox model\n  on the raw case-cohort data produces a biased hazard-ratio estimate. This is the most common\n  analysis error in the literature. The estimator must use Prentice or Barlow weights plus\n  robust variance; failing to do so is quantitatively incorrect, not conservative.\n- **Highly time-varying expensive exposure.** If the expensive item is a time-varying\n  biomarker that must be re-ascertained at multiple person-specific time points, the\n  case-cohort's single baseline-sampling advantage erodes; NCC with time-matched controls may\n  be more efficient in this scenario.\n- **Very small cohorts.** With N < several hundred, the subcohort may be nearly the full\n  cohort, eliminating the cost advantage. Check whether the expected number of cases in the\n  subcohort and outside it provides enough statistical power before committing to the design.\n- **Ignoring overlap between cases and subcohort.** Double-counting cases who are also\n  subcohort members (by including them in both the subcohort non-case row and a separate case\n  row) inflates the apparent efficiency and biases the estimate. Analytic code must de-duplicate\n  so that each subject appears with the correct weight-indexed contribution.\n\n**Interpreting the output**\n\nThe weighted Cox estimator from a correctly analyzed case-cohort produces a hazard-ratio\ncoefficient for each covariate. Using the worked example: cohort N = 50,000, subcohort\nm = 1,000 (π = 0.02), 400 total cases, 8 inside the subcohort; suppose the Barlow-weighted\nanalysis yields HR = 1.73 (robust 95% CI 1.31–2.28) for the exposure of interest.\n\n*Formal interpretation:* The Barlow-weighted Cox partial-likelihood estimator, with robust\nsandwich variance to account for the repeated appearance of subcohort members across\npseudo-risk sets, estimates an instantaneous rate ratio of 1.73 comparing exposed to unexposed\nsubjects among those still at risk at each event time. The confidence interval has the\nrepeated-sampling interpretation: if this analysis were repeated many times under the same\nsampling design, approximately 95% of such intervals would contain the true hazard ratio in\nthe source cohort. The estimate is conditional on measured covariates and requires the\nuntestable assumption that unmeasured confounders are not materially associated with both\nexposure and outcome.\n\n*Practical interpretation:* At any moment during follow-up, exposed patients had approximately\n73% higher instantaneous risk of the outcome than unexposed patients of the same measured\ncharacteristics. The confidence interval (1.31 to 2.28) excludes 1.0, indicating this\nassociation is unlikely to be due to chance, though residual unmeasured confounding cannot\nbe ruled out in an observational study.\n\n**Data-source operational depth.**\n\n- **Claims with chart-validated outcomes:** The classic RWE application is a large claims\n  cohort (potentially hundreds of thousands of patients) where the primary outcome requires\n  manual chart review for validation. Pull the full cohort from claims for exposure and\n  covariate ascertainment; draw the subcohort for chart validation plus all incident cases\n  flagged by the claims algorithm. This yields validated outcomes without reviewing every chart.\n- **Registry biomarker substudies:** A disease registry fixes the case set (all diagnoses are\n  adjudicated); the subcohort is drawn from registry enrollees without the event during the\n  baseline period; stored specimens or additional labs are run only on the subcohort plus cases.\n  Multiple biomarkers can be assayed from the same specimen bank across several hypotheses.\n- **EHR cohorts:** Define the cohort from encounter-based enrollment windows; the subcohort is\n  a random draw at index; expensive items (NLP-derived phenotypes, expert severity scores) are\n  applied to subcohort + cases. Observation windows and loss-to-follow-up rules apply as for\n  any EHR cohort.\n- **Competing risks:** In elderly or seriously ill cohorts, death may compete with the primary\n  outcome. Because the subcohort includes decedents who contributed person-time, the\n  case-cohort estimand can be extended to cause-specific hazards (exclude competing-event\n  person-time in the denominator) or to subdistribution hazards with additional weighting.\n  Specify the estimand before analysis; the choice affects both the weights and the\n  interpretation.",
    "primary_category": "Study_Design",
    "tags": [
      "case-cohort",
      "subcohort",
      "prentice-weights",
      "barlow-weights",
      "weighted-cox",
      "sampling-efficiency",
      "biomarker-substudy",
      "pharmacoepidemiology",
      "multi-outcome"
    ],
    "applies_to_study_types": [
      "case_cohort"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/73.1.1",
        "url": "https://doi.org/10.1093/biomet/73.1.1",
        "citation_text": "Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73(1):1-11.",
        "year": 1986,
        "authors_short": "Prentice",
        "notes": "The original proposal of the case-cohort design: subcohort sampled at baseline, weighted pseudo-partial likelihood (Prentice weights), and robust variance for efficient analysis of expensive covariates in large cohorts."
      },
      {
        "role": "explain",
        "doi": "10.1016/S0895-4356(99)00102-X",
        "url": "https://doi.org/10.1016/S0895-4356(99)00102-X",
        "citation_text": "Barlow WE, Ichikawa L, Rosner D, Mangat S. Analysis of case-cohort designs. Journal of Clinical Epidemiology. 1999;52(12):1165-1172.",
        "year": 1999,
        "authors_short": "Barlow et al.",
        "notes": "Develops the self-weighted (Barlow) estimator for case-cohort data — simpler implementation than Prentice weights, same asymptotic efficiency, robust variance essential; introduces the PROC PHREG WEIGHT + COVS(AGGREGATE) approach in SAS."
      },
      {
        "role": "use",
        "doi": "10.1097/00001648-199103000-00013",
        "url": "https://doi.org/10.1097/00001648-199103000-00013",
        "citation_text": "Wacholder S. Practical considerations in choosing between the case-cohort and nested case-control designs. Epidemiology. 1991;2(2):155-158.",
        "year": 1991,
        "authors_short": "Wacholder",
        "notes": "Canonical practical comparison of case-cohort vs nested case-control: prefer case-cohort for multi-outcome substudies and absolute risk; prefer NCC for single time-matched outcome with strong time confounding."
      }
    ],
    "plain_language_summary": "A case-cohort study starts with a large group of patients all followed over time, then picks a smaller random sample — the subcohort — at the very beginning, before anyone develops the outcome of interest. Researchers only run expensive tests (like biomarker blood assays or detailed chart reviews) on subcohort members plus any patients who later develop the outcome, instead of testing everyone. The big advantage over a similar approach called nested case-control is that the same subcohort can be reused for several different outcomes, and you can also calculate how common the outcome was in the whole group — not just compare those who got it to those who did not. However, the statistical analysis requires a special weighted version of the Cox survival model; using the standard unweighted version on this kind of data is a known mistake that produces incorrect results.",
    "key_terms": [
      {
        "term": "subcohort",
        "definition": "A randomly chosen subset of the full cohort, selected at the very start of follow-up before any outcomes occur; the subcohort is the group on which expensive measurements are made, and it stands in for the whole cohort in the analysis."
      },
      {
        "term": "sampling fraction",
        "definition": "The proportion of the full cohort included in the subcohort (subcohort size divided by cohort size); this fraction is used to weight each subcohort member's contribution so the analysis represents the full cohort."
      },
      {
        "term": "Prentice/Barlow weights",
        "definition": "Numerical multipliers assigned to each subject's contribution to the statistical model that correct for the fact that only a fraction of the cohort was measured; without these weights the hazard-ratio estimate is biased."
      },
      {
        "term": "robust variance",
        "definition": "A variance calculation that accounts for the fact that the same subcohort members appear in many parts of the analysis, preventing artificially narrow confidence intervals."
      },
      {
        "term": "hazard ratio (HR)",
        "definition": "The ratio of the instantaneous rate of the outcome in the exposed group to that in the unexposed group among people still at risk at any given moment; the main effect estimate from the weighted Cox model in a case-cohort analysis."
      },
      {
        "term": "multi-outcome reusability",
        "definition": "The ability to use the same subcohort as the comparison group for several different outcomes, which is unique to the case-cohort design and not possible in nested case-control studies."
      }
    ],
    "worked_example": {
      "scenario": "A research team assembles a cohort of 50,000 adults from a linked claims-registry database to study whether a costly biomarker measured from stored serum predicts incident cardiovascular events. Running the assay on all 50,000 patients would cost roughly $500 per assay. Instead, they draw a subcohort of 1,000 patients at random at baseline (day zero), before any events occur. Over three years of follow-up, 400 patients across the full cohort develop the outcome; of those 400 cases, 8 were already in the subcohort. The team needs to calculate exactly how many assays are required and compare that to the full-cohort alternative.",
      "dataset": {
        "caption": "Summary counts for the case-cohort calculation — not one row per patient but the key group totals an analyst would record before deciding on the design.",
        "columns": [
          "group",
          "count",
          "assay_needed"
        ],
        "rows": [
          [
            "Full cohort (N)",
            50000,
            "would require 50000 assays"
          ],
          [
            "Subcohort drawn at baseline (m)",
            1000,
            "assayed regardless of outcome"
          ],
          [
            "Total cases in full cohort",
            400,
            "assayed because they are cases"
          ],
          [
            "Cases already inside subcohort",
            8,
            "already counted in subcohort — not duplicated"
          ],
          [
            "Cases outside subcohort (new additions)",
            392,
            "assayed as additional cases"
          ]
        ]
      },
      "steps": [
        "Sampling fraction: subcohort size divided by full cohort size gives 1000 / 50000 = 0.02, meaning 2% of the cohort is in the subcohort.",
        "Cases outside the subcohort: 400 total cases minus the 8 who were already selected into the subcohort gives 400 - 8 = 392 additional subjects needing an assay.",
        "Total assays required: all subcohort members plus all cases not already in the subcohort gives 1000 + 392 = 1392 assays.",
        "Cross-check using the overlap formula: subcohort size plus all cases minus cases inside the subcohort gives 1000 + 400 - 8 = 1392 assays — confirming the same answer.",
        "Cost ratio versus full-cohort measurement: 1392 / 50000 = 0.02784, meaning only about 2.8% as many assays are needed compared to measuring everyone.",
        "Because the subcohort is a probability sample drawn at baseline, it can be reused for a second or third outcome (say, incident diabetes or all-cause mortality) without any additional assays on the subcohort members — only new cases outside the subcohort for each additional outcome would require measurement, making the multi-outcome cost savings even larger."
      ],
      "result": "Total assays = 1000 + 400 - 8 = 1392 versus 50000 for full-cohort measurement. Cost ratio 1392 / 50000 = 0.02784 (approximately 2.8% of the full-cohort burden). The subcohort is then reused for every additional outcome at no extra baseline cost.",
      "timeline_spec": {
        "title": "Case-cohort design — subcohort drawn at baseline, cases added throughout follow-up",
        "window": {
          "start": "2022-01-01",
          "end": "2025-01-01",
          "label": "3-year follow-up window (full cohort N = 50,000)"
        },
        "events": [
          {
            "label": "Subcohort drawn (m = 1000, pi = 0.02)",
            "start": "2022-01-01",
            "length_days": 1,
            "quantity": "Random draw at baseline — 2% of cohort"
          },
          {
            "label": "SC1 (subcohort, no event)",
            "start": "2022-01-01",
            "length_days": 1096,
            "quantity": "Assayed at baseline; contributes full follow-up"
          },
          {
            "label": "SC2 (subcohort + case, event day 400)",
            "start": "2022-01-01",
            "length_days": 400,
            "quantity": "In subcohort AND a case — counted once"
          },
          {
            "label": "C1 (case outside subcohort, event day 200)",
            "start": "2022-01-01",
            "length_days": 200,
            "quantity": "Not in subcohort; assayed only because case"
          },
          {
            "label": "C2 (case outside subcohort, event day 600)",
            "start": "2022-01-01",
            "length_days": 600,
            "quantity": "Not in subcohort; assayed only because case"
          }
        ],
        "spans": [
          {
            "kind": "washout",
            "start": "2022-01-01",
            "end": "2022-01-01",
            "label": "Subcohort selected here — before any outcomes observed"
          },
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2024-12-31",
            "label": "Subcohort non-cases contribute full person-time with weight 1/pi"
          },
          {
            "kind": "exposed",
            "start": "2022-01-01",
            "end": "2022-07-19",
            "label": "C1 contributes to analysis only at event time (day 200)"
          },
          {
            "kind": "exposed",
            "start": "2022-01-01",
            "end": "2023-07-19",
            "label": "C2 contributes to analysis only at event time (day 600)"
          },
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2023-02-04",
            "label": "SC2 in subcohort until event (day 400); assayed from subcohort draw"
          }
        ],
        "result": {
          "label": "1000 + 400 - 8 = 1392 assays vs 50000; cost ratio 1392 / 50000 = 0.02784",
          "value": 0.02784
        },
        "caption": "Subcohort members (SC1, SC2) contribute follow-up time throughout the study with up-weights of 1/pi = 50 representing the full cohort risk pool. Cases outside the subcohort (C1, C2) are added to the analytic dataset only because they are cases; they receive weight 1 at their event time. SC2, who is both a subcohort member and a case, is counted once with the case weight at their event time. The single subcohort can be reused for additional outcomes without re-drawing.",
        "alt_text": "Timeline showing three years of follow-up from 2022 to 2025. A marker at the 2022-01-01 baseline shows the subcohort drawn from the full cohort. Subcohort member SC1 has a full-length follow-up bar. SC2's bar ends at day 400 with a case marker. Two additional case bars (C1 at day 200, C2 at day 600) are shorter and labeled as cases outside the subcohort. A note indicates 1392 total assays versus 50000 for the full cohort."
      }
    },
    "prerequisites": [
      "cohort-retrospective",
      "nested-case-control",
      "cox-ph-regression"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Prentice (1986) weighted pseudo-partial likelihood",
        "description": "At each event time t_i, subcohort members who have not yet failed contribute to the pseudo-risk set with weight 1/pi; the case at t_i (whether in the subcohort or not) contributes with weight 1. The resulting weighted score equations define the Prentice estimator; robust sandwich variance is required because subcohort members appear in multiple pseudo-risk sets.",
        "edge_cases": [
          "If a subcohort member develops the outcome, they shift from the weighted non-case contribution to the case contribution (weight 1) at their event time; the implementation must handle this transition correctly to avoid double-weighting.",
          "Very early events before the subcohort can be fully ascertained create a small window where the pseudo-risk set may not yet reflect the full subcohort; pre-specify a brief accrual period if needed."
        ],
        "data_source_notes": "Implemented in R via survival::cch(method=\"Prentice\"); SAS approximation via PROC PHREG with WEIGHT statement where weight = 1/pi for subcohort non-cases and 1 for all cases."
      },
      {
        "name": "Barlow (self-weighted) estimator",
        "description": "Assigns a constant weight of 1/pi to all subcohort members throughout their follow-up as non-cases, and weight 1 to all cases at their event time (regardless of subcohort membership). Simpler to implement, asymptotically equivalent to Prentice, and more natural in software that accepts a WEIGHT variable. Robust sandwich variance is still required.",
        "edge_cases": [
          "Cases inside the subcohort receive weight 1 at their event time and weight 1/pi during their prior non-case follow-up; code must distinguish these two contributions or use a counting-process formulation.",
          "The Barlow estimator can be slightly less efficient than Prentice in small samples; the difference is negligible for most applied settings (hundreds of cases or more)."
        ],
        "data_source_notes": "Implemented in R via survival::cch(method=\"Barlow\"); in SAS via PROC PHREG with a WEIGHT variable set to 1/pi for subcohort non-cases and 1 for cases, plus COVS(AGGREGATE) for the robust variance."
      },
      {
        "name": "Stratified case-cohort (outcome-specific sub-subcohorts)",
        "description": "For very large multi-ethnic or multi-center cohorts, a stratified subcohort is drawn with stratum-specific sampling fractions pi_s; this allows oversampling of underrepresented groups or high-risk strata. Each stratum's subjects receive stratum- specific weights 1/pi_s in the weighted Cox analysis.",
        "edge_cases": [
          "Stratum membership and stratum-specific pi_s must be pre-specified and documented before outcome ascertainment; post-hoc stratification inflates type-I error."
        ],
        "data_source_notes": "Implemented in R via survival::cch with the stratum argument; requires careful bookkeeping of stratum membership for each subject."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Full-cohort Cox proportional hazards",
        "pros_of_this": "Dramatically reduces the cost of expensive covariate ascertainment (assays, chart abstraction, genotyping) to roughly 3-5% of full-cohort measurement burden while retaining most statistical efficiency.",
        "cons_of_this": "Strictly less efficient than the full cohort; imposes additional analytic complexity (Prentice/Barlow weights, robust variance); pointless when covariates are already universally available in claims or EHR.",
        "when_to_prefer": "When ascertaining a key covariate for all N subjects is cost-prohibitive; never for cheap, universally coded claims-based exposures."
      },
      {
        "compared_to": "Nested case-control design",
        "pros_of_this": "One subcohort is reusable for multiple outcomes; supports absolute-risk and incidence-rate estimation; subcohort is a true probability sample of the cohort.",
        "cons_of_this": "Less efficient than NCC for a single outcome with strong time confounding (NCC's time-matched risk-set sampling can be more precise per case); analysis is more complex than NCC's conditional logistic regression.",
        "when_to_prefer": "Multi-endpoint biobank substudies, registry biomarker studies, any setting where absolute risk estimation is needed alongside relative risk; prefer NCC for a single time-matched analysis with highly time-varying expensive exposures."
      },
      {
        "compared_to": "Self-controlled designs (SCCS, case-crossover)",
        "pros_of_this": "Accommodates chronic, non-reversible exposures and stable biomarkers; supports between-person confounding adjustment via covariate modeling.",
        "cons_of_this": "Cannot eliminate unmeasured time-fixed confounding that self-controlled designs remove through within-person comparison.",
        "when_to_prefer": "Chronic exposures, stable biomarkers, or any setting where a within-person comparison is infeasible or the exposure is not transient."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the full cohort from eligibility, medical, and pharmacy files using standard new-user / continuous-enrollment rules (see cohort-retrospective). At the cohort's time-zero, draw the random subcohort using a random number generator seeded before any outcome ascertainment. Expensive covariates (e.g., chart-validated outcomes, linked lab values) are then ascertained only for subcohort members plus cases identified by the claims algorithm. Apply Barlow weights in PROC PHREG (SAS) or survival::cch (R) with COVS(AGGREGATE) / robust variance. Verify the overlap count (cases inside subcohort) before finalizing the analytic dataset.",
      "ehr": "Define cohort entry and the subcohort draw using the same encounter-based observability window required for any EHR cohort. The subcohort enables costly NLP phenotyping or expert severity scoring for a manageable fraction of the database. Ensure loss-to- follow-up rules apply consistently to both subcohort and non-subcohort cases.",
      "registry": "Registry enrollees provide the adjudicated case set; the subcohort is drawn from registry members who were event-free at baseline. Stored specimens or additional lab panels are run on subcohort plus cases. The registry's clinical data infrastructure often makes it straightforward to identify the 8 (cases in subcohort) vs 392 (cases outside) split. Multiple biomarker hypotheses are studied from the same stored draw.",
      "linked": "Linked claims-EHR-registry substudies are the natural home for case-cohort: the claims define the N = 50,000 cohort and the subcohort; the registry or EHR provides the expensive adjudicated outcome or biomarker for the 1,392 measurement subjects. Linkage selection bias (only linkable subjects are eligible) must be addressed in sensitivity analyses."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import CoxPHFitter\n\ndef build_case_cohort_weights(cohort: pd.DataFrame, pi: float) -> pd.DataFrame:\n    \"\"\"\n    Assign Barlow weights to the case-cohort analytic dataset.\n\n    Barlow rule:\n      - All subcohort members (cases + non-cases): weight = 1 / pi during\n        their non-event follow-up.\n      - All cases (in subcohort or not): weight = 1 at their event time.\n    For a counting-process approximation in lifelines, we use a single-row\n    per subject and set weight = 1 for cases, 1/pi for subcohort non-cases.\n    Subcohort members who are also cases receive weight = 1 (case dominates).\n\n    This is a Barlow approximation; use survival::cch in R for the exact estimator.\n    \"\"\"\n    df = cohort.copy()\n\n    # Only include subjects in the analytic dataset:\n    #   - all subcohort members\n    #   - all cases (whether in subcohort or not)\n    analytic = df[df[\"in_subcohort\"] | (df[\"event\"] == 1)].copy()\n\n    # Barlow weight: 1 for cases, 1/pi for subcohort non-cases.\n    analytic[\"weight\"] = analytic.apply(\n        lambda r: 1.0 if r[\"event\"] == 1 else 1.0 / pi, axis=1\n    )\n\n    return analytic\n\n# --- example usage ---\n# cohort  = pd.DataFrame(...)  # one row per subject, biomarker ascertained for subcohort + cases\n# N = len(cohort)              # full cohort size\n# m = cohort[\"in_subcohort\"].sum()\n# pi = m / N                   # sampling fraction\n\npi = 0.02   # 1000 / 50000 in the worked example\n\n# Build the analytic dataset.\nanalytic = build_case_cohort_weights(cohort, pi)\n\n# Compute follow-up duration in days.\nanalytic[\"duration\"] = (\n    pd.to_datetime(analytic[\"exit_date\"]) - pd.to_datetime(analytic[\"entry_date\"])\n).dt.days\n\n# Fit Barlow-weighted Cox with robust variance.\ncph = CoxPHFitter()\ncph.fit(\n    analytic[[\"duration\", \"event\", \"biomarker\", \"weight\"]],\n    duration_col=\"duration\",\n    event_col=\"event\",\n    weights_col=\"weight\",\n    robust=True   # sandwich variance — required for case-cohort; corrects for\n                  # repeated subcohort membership across pseudo-risk sets\n)\ncph.print_summary()\n# exp(coef) for biomarker is the Barlow-weighted hazard ratio.\n# For publishable analyses use R survival::cch or SAS PROC PHREG with COVS(AGGREGATE).",
        "description": "Case-cohort dataset construction and Barlow-weighted Cox analysis in Python. Because\nlifelines does not expose a dedicated case-cohort (cch) interface, this implementation\nbuilds the Barlow weight variable explicitly and uses lifelines CoxPHFitter with the\nweights_col argument. The robust variance option (robust=True) corrects the standard\nerrors for the repeated appearance of subcohort members. This is an honest approximation:\nthe Barlow pseudo-partial likelihood is not identical to the exact Prentice estimator,\nand for definitive published analyses R's survival::cch or SAS PROC PHREG should be\nused. The Python version is suitable for exploratory analysis and pipeline prototyping.\n\nRequired inputs (one row per subject):\n  cohort : person_id, entry_date, exit_date (datetime), event (0/1), in_subcohort (bool),\n           biomarker (float — ascertained for subcohort + cases only)\nSampling fraction pi = m / N must be computed from the subcohort draw.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n# --- worked example parameters ---\nN  <- 50000L   # full cohort size\nm  <- 1000L    # subcohort size  (drawn at baseline)\npi <- m / N    # sampling fraction: 1000 / 50000 = 0.02\n\n# dat: analytic dataset — subcohort members + all cases.\n# Build in advance; biomarker is NA for non-subcohort non-cases (they are excluded).\n# One row per subject; in_subcohort = TRUE/FALSE; event = 0/1.\n\n# Barlow method (recommended for most applied settings):\nfit_barlow <- cch(\n  Surv(entry, exit, event) ~ biomarker + exposure,\n  data        = dat,\n  subcoh      = ~in_subcohort,       # logical column marking subcohort membership\n  id          = ~person_id,\n  cohort.size = N,                   # full cohort N — required for weight computation\n  method      = \"Barlow\"             # alternatives: \"Prentice\", \"II.Borgan\"\n)\nsummary(fit_barlow)\n# exp(coef) is the Barlow-weighted hazard ratio with robust 95% CI.\n# The robust variance (sandwich) is applied automatically by cch().\n\n# Prentice method for comparison:\nfit_prentice <- cch(\n  Surv(entry, exit, event) ~ biomarker + exposure,\n  data        = dat,\n  subcoh      = ~in_subcohort,\n  id          = ~person_id,\n  cohort.size = N,\n  method      = \"Prentice\"\n)\nsummary(fit_prentice)\n\n# Both should produce similar HR estimates; CIs will differ slightly.\n# In the worked example (pi = 0.02, 400 cases, 8 in subcohort):\n# assays required = 1000 + 400 - 8 = 1392 vs 50000 full-cohort.",
        "description": "Canonical case-cohort analysis using survival::cch, which implements both the Prentice\n(1986) and Barlow et al. (1999) estimators with the correct robust variance. This is\nthe authoritative R implementation; exp(coef) from the fitted object is the case-cohort\nhazard ratio.\n\nRequired inputs:\n  dat : data.frame with one row per subject in the analytic set\n        (all subcohort members + all cases)\n        person_id, entry (numeric or 0), exit (follow-up time), event (0/1),\n        in_subcohort (logical), biomarker (numeric), exposure (factor or numeric)\n  N   : full cohort size (integer)",
        "dependencies": [
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let pi       = 0.02;    /* sampling fraction: m / N = 1000 / 50000 */\n%let inv_pi   = 50;      /* 1 / pi = 1 / 0.02 = 50 */\n\n/* Assign Barlow weights:\n     - cases (event = 1): weight = 1  (whether in subcohort or not)\n     - subcohort non-cases (in_subcohort = 1, event = 0): weight = 1/pi\n   Cases outside the subcohort (in_subcohort = 0, event = 1) receive weight = 1.\n*/\ndata work.cc_weighted;\n  set work.analytic;\n  if event = 1 then weight = 1;\n  else weight = &inv_pi;      /* subcohort non-cases: up-weighted to represent full cohort */\nrun;\n\n/* Weighted Cox with robust sandwich variance (COVS(AGGREGATE) clusters on person_id). */\nproc phreg data = work.cc_weighted\n           covs(aggregate);            /* AGGREGATE = sandwich variance; required for case-cohort */\n  model exit * event(0) = biomarker exposure / ties = efron;\n  weight weight;                       /* Barlow weight variable */\n  id person_id;                        /* clustering unit for COVS(AGGREGATE) */\n  hazardratio biomarker / diff = ref;\n  hazardratio exposure  / diff = ref;\nrun;\n/*\n  exp(Parameter Estimate) = Barlow-weighted hazard ratio.\n  The \"Robust\" column in the PROC PHREG output gives sandwich-based SEs and 95% CIs.\n  Do NOT use the model-based (non-robust) standard errors for case-cohort data.\n\n  Worked-example audit:\n    Subcohort m = 1000, N = 50000, pi = 1000 / 50000 = 0.02, inv_pi = 50.\n    Total cases = 400; cases in subcohort = 8; cases outside = 400 - 8 = 392.\n    Analytic N = 1000 + 392 = 1392 subjects (work.analytic rows).\n    Assay cost ratio vs full cohort: 1392 / 50000 = 0.02784.\n*/",
        "description": "Barlow-weighted case-cohort analysis in SAS using PROC PHREG with a WEIGHT statement and\nCOVS(AGGREGATE) for the robust sandwich variance. This follows the approach described in\nBarlow et al. (1999). The COVS(AGGREGATE) option clusters the sandwich estimator at the\nsubject level, which is necessary because subcohort members contribute to multiple risk\nsets across the pseudo-partial likelihood.\n\nRequired input dataset (work.analytic):\n  person_id     — unique patient identifier\n  entry         — entry time (usually 0 for all subjects)\n  exit          — follow-up duration to event or censoring\n  event         — 0/1 event indicator\n  in_subcohort  — 1 if in subcohort, 0 if case outside subcohort\n  biomarker     — ascertained for all analytic subjects\n  exposure      — exposure indicator or level\n\nThe analytic dataset contains: all subcohort members + all cases (union).\npi must be computed before the DATA step below.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cohort[\"Full assembled cohort<br/>N = 50,000 (time-zero, eligibility, follow-up defined)\"]\n  Cohort --> Draw[\"Draw random subcohort at baseline<br/>m = 1,000 (pi = 0.02)<br/>BEFORE any outcomes observed\"]\n  Draw --> AssaySC[\"Ascertain expensive covariate<br/>for ALL m subcohort members\"]\n  Cohort --> Cases[\"Identify all cases during follow-up<br/>400 total; 8 inside subcohort\"]\n  Cases --> NewCases[\"392 cases OUTSIDE subcohort<br/>(400 - 8 = 392 new assay subjects)\"]\n  NewCases --> AssayNew[\"Ascertain expensive covariate<br/>for 392 additional cases\"]\n  AssaySC --> Analytic[\"Analytic dataset<br/>1000 subcohort + 392 extra cases = 1392 subjects\"]\n  AssayNew --> Analytic\n  Analytic --> Weighted[\"Weighted Cox (Barlow / Prentice)<br/>+ robust sandwich variance\"]\n  Weighted --> HR[\"Hazard ratio + robust 95% CI<br/>for each outcome\"]\n  HR --> MultiOutcome[\"Reuse same subcohort for Outcome 2, 3 ...<br/>No new subcohort draw needed\"]",
        "caption": "Case-cohort flow: the subcohort is drawn once at baseline, expensive covariates are ascertained for subcohort members plus additional cases, and the same subcohort reappears for every subsequent outcome. Weighted Cox with robust variance is mandatory.",
        "alt_text": "Flowchart from full cohort through baseline subcohort draw and covariate ascertainment, case identification, additional case assay, analytic dataset assembly, weighted Cox analysis, and multi-outcome reuse of the subcohort.",
        "source_type": "illustrative",
        "source_citations": [
          "prentice-1986"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph NCC[\"Nested Case-Control\"]\n    direction TB\n    E1[\"Case A at t=120 → sample 4 controls<br/>from risk set at t=120\"]\n    E2[\"Case B at t=240 → sample 4 controls<br/>from risk set at t=240 (NEW sample)\"]\n    E3[\"Case C at t=400 → sample 4 controls<br/>from risk set at t=400 (NEW sample)\"]\n    E1 -.- E2 -.- E3\n  end\n  subgraph CC[\"Case-Cohort\"]\n    direction TB\n    Sub[\"Subcohort m = 1000<br/>drawn ONCE at t = 0\"]\n    Sub --> CA[\"Case A at t=120: adds if not in subcohort\"]\n    Sub --> CB[\"Case B at t=240: adds if not in subcohort\"]\n    Sub --> CC2[\"Case C at t=400: adds if not in subcohort\"]\n    Sub --> Out2[\"Second outcome: SAME subcohort\"]\n  end",
        "caption": "NCC samples new controls per case per outcome (outcome-specific, time-matched). Case-cohort draws one subcohort at baseline, reuses it for all outcomes, and adds only new cases not already in the subcohort. NCC wins on time-matching efficiency for one outcome; case-cohort wins on multi-outcome reusability and absolute-risk estimability.",
        "alt_text": "Side-by-side diagram. Left panel (NCC): three separate control samples drawn at three different case event times. Right panel (case-cohort): one subcohort drawn at baseline, reused for three cases and a second outcome.",
        "source_type": "illustrative",
        "source_citations": [
          "wacholder-1991"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "nested-case-control",
        "notes": "The rival subsampling design within a cohort. NCC re-samples controls per case at each event time (outcome-specific, more efficient for one time-matched analysis); case-cohort uses a single baseline subcohort reusable across outcomes and supports absolute risk. Prefer NCC for a single time-to-event outcome with strong time confounding; prefer case-cohort for multi-outcome biobank substudies."
      },
      {
        "relation_type": "requires",
        "target_slug": "cohort-retrospective",
        "notes": "The case-cohort design is always nested inside a fully assembled cohort with defined time-zero, eligibility, and observable follow-up; all retrospective-cohort data-construction rules (washout, continuous enrollment, censoring) apply to the parent cohort before the subcohort is drawn."
      },
      {
        "relation_type": "see_also",
        "target_slug": "external-adjustment-validation-substudy-bias-correction-rwe",
        "notes": "Both designs measure an expensive quantity on a subset and use it to correct or enrich the full-cohort analysis; the case-cohort's subcohort + cases structure mirrors the validation substudy's internal sample logic."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "Case-cohort analysis uses a weighted variant of the Cox partial likelihood (Prentice or Barlow weights) with robust sandwich variance; the standard unweighted Cox model applied naively to case-cohort data produces a biased hazard-ratio estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "biomarker-defined-cohort-rwe",
        "notes": "Registry and claims cohorts with stored biospecimens are the primary applied setting for case-cohort designs; one subcohort assay panel supports multiple biomarker hypotheses across multiple outcomes."
      }
    ],
    "aliases": [
      "case-cohort study",
      "Prentice design",
      "subcohort design",
      "case-subcohort design",
      "self-weighted case-cohort"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "case-control",
    "name": "Case-Control Study Design",
    "short_definition": "An observational design that samples on outcome status — identifying cases who experience the event and controls who do not — and compares prior exposure odds between them, estimating an odds ratio that, under incidence-density (risk-set) sampling, equals the rate ratio without any rare-disease assumption.",
    "long_description": "The **case-control (CC) design** reverses the logic of a cohort: instead of following exposed and unexposed forward to\nthe outcome, it samples on the **outcome** — identifying everyone (or a sample) who became a **case** and a sample of\n**controls** who did not — and then looks backward at exposure in an etiologically relevant window. Its reason for\nexisting is efficiency: when an outcome is rare or its ascertainment is expensive (adjudication, chart review,\ngenotyping, biospecimen assay), enrolling 300 cases and 1,200 controls answers the question that would otherwise require\nfollowing hundreds of thousands of people in a cohort.\n\n**Core conceptual distinction — what the odds ratio actually estimates.** This is the point reviewers test first, and the\npoint most casual descriptions get wrong. The estimand depends entirely on *how controls are sampled*:\n- **Incidence-density (risk-set) sampling.** Controls are sampled from the population still at risk and event-free at the\n  *instant each case occurs* (the case's index date). Under this scheme the exposure odds ratio estimates the **incidence\n  rate ratio directly — with no rare-disease assumption** (Greenland & Thomas). This is the correct mental model for\n  pharmacoepidemiology and the one that makes a nested CC algebraically equivalent to a Cox model on the full cohort\n  (Breslow's partial likelihood for a matched risk set *is* the conditional-logistic likelihood).\n- **Cumulative (\"traditional\") sampling.** Controls are sampled from those still non-cases at the end of the study\n  period. Here the OR estimates the **risk (cumulative-incidence) odds ratio**, which approximates the risk ratio only\n  under the rare-disease assumption (outcome < ~10%).\n- **Case-cohort sampling.** Controls are a random subcohort sampled at baseline (a \"base series\"); with appropriate\n  weighting the OR estimates the **risk ratio** and one subcohort can serve multiple outcomes.\nConflating these three is the classic error: density-sampled CC does *not* need a rare disease, cumulative CC does. State\nthe sampling scheme and the resulting estimand explicitly in the SAP.\n\n**Pros, cons, and trade-offs.**\n- **vs a full cohort (cohort-prospective / cohort-retrospective):** CC needs orders of magnitude less data collection for\n  rare outcomes and lets you spend an expensive ascertainment budget (adjudication, assays) only on cases plus a handful\n  of controls. Cost: it cannot produce absolute risks, incidence, or numbers-needed-to-treat without known sampling\n  fractions; it is more exposed to selection bias through control choice; and it is harder to explain to clinical\n  stakeholders. **Prefer CC** when the outcome is rare or its measurement is the binding cost constraint. **Prefer a\n  cohort** when you need absolute risk, multiple outcomes from one exposure, or the outcome is common.\n- **vs nested case-control:** A nested CC samples cases and risk-set controls from inside an *already-defined* cohort\n  (claims, EHR, registry). It inherits the cohort's clean source population, time-zero, and density-sampling validity\n  while keeping CC efficiency — it is the modern default whenever a usable source cohort exists. Stand-alone CC (case\n  registry + controls sampled from eligibility files) is what you fall back to when no such cohort is available. **Prefer\n  nested** essentially always when you have the cohort.\n- **vs self-controlled designs (self-controlled-case-series, case-crossover, case-time-control):** Self-controlled\n  designs use each case as their own control and therefore null out *all* time-invariant confounding (genetics, chronic\n  frailty, baseline SES) by construction — something CC can only attempt through measurement and matching. But they\n  require transient, repeatable exposures and (for SCCS) recurrent or at least non-fatal events, and they cannot study\n  fixed exposures. **Prefer CC** for one-time or chronic exposures, fatal/non-recurrent outcomes, or when between-person\n  factors are themselves of interest; **prefer self-controlled** for acute drug triggers of recurrent events.\n\n**When to use.** Rare outcomes (specific malignancies, agranulocytosis, congenital anomalies, sudden cardiac death);\noutcomes requiring costly adjudication or biospecimens; rapid safety-signal evaluation in a defined source population;\nany setting where you have a clean cohort and want a nested density-sampled CC as an efficient surrogate for a Cox model.\n\n**When NOT to use — and when it is actively misleading.**\n- **The outcome is common and you used cumulative sampling.** The OR then overstates the risk ratio (the rare-disease\n  approximation fails). Switch to density sampling, a case-cohort design, or report the OR as an odds ratio only.\n- **Controls are not drawn from the population that gave rise to the cases.** This is the cardinal sin (Wacholder's\n  \"study base\" principle). Hospital controls for a community case series, or controls whose exposure prevalence differs\n  from the source base for reasons unrelated to disease, produce **selection bias that no analysis can repair**. Berkson's\n  bias (hospital-based controls) is the canonical example.\n- **Differential exposure ascertainment by case status.** In primary-data CC this is recall bias; in claims/EHR it is\n  surveillance/detection bias — cases, by virtue of being sicker or in more contact with the system, have *more complete\n  exposure capture* than controls, inflating the OR. Match or adjust on healthcare utilization to blunt it.\n- **Outcome misclassification.** Low PPV of the case algorithm contaminates the case series with non-cases and biases the\n  OR toward the null; this is why CC validity is downstream of the outcome phenotype (see claims-outcome-algorithm-ppv-\n  sensitivity-rwe).\n- **Prevalent exposure mixed in.** If \"exposed\" includes long-term prevalent users, depletion-of-susceptibles and\n  immortal-time issues from the source cohort leak in; apply a new-user/incident restriction inside the exposure window.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA, commercial):** Cases come from validated outcome algorithms (e.g., 1 inpatient *or* 2 outpatient\n  claims with the diagnosis in a primary/qualifying position, or a procedure + diagnosis). The index date is the\n  case-defining event date; for risk-set controls it is the case's index date assigned to a still-at-risk enrollee.\n  Exposure comes from pharmacy claims (NDC + `fill_date` + `days_supply`) for drugs or medical claims (CPT/HCPCS/J-codes,\n  ICD-10-PCS) for procedures/devices, ascertained in a pre-specified pre-index window (e.g., 90 days for an acute trigger,\n  cumulative for chronic exposure). **Failure modes:** (1) *Medicare Advantage person-time lacks fee-for-service claims* —\n  a control's \"no exposure\" can be unobserved rather than truly unexposed, so require both medical and pharmacy benefit\n  and exclude MA-only person-time. (2) *Differential competing risks by exposure in the elderly* — if the exposure raises\n  short-term mortality, those people die before becoming cases and are absent from the risk set, distorting the OR;\n  restrict risk sets to those alive and enrolled at the case's index date and run a competing-risks sensitivity analysis.\n  (3) *Detection bias* — match controls on prior utilization (counts of visits, baseline cost quartile) so cases are not\n  spuriously \"more exposed\" merely because they are in more contact with the system. (4) *Claim reversals/adjudication\n  lag* — net by `claim_id` and avoid the most recent, incompletely adjudicated months.\n- **EHR:** Cases via structured codes + NLP phenotyping, validated on a chart-reviewed subset. Exposure = the medication\n  *order/administration*, not a dispensing, so link to pharmacy fills where possible. The structural threat is\n  **visit-driven (informative) observation**: a control who is healthy and rarely visits looks \"unexposed\" simply because\n  little is recorded. Use encounter counts as a matching/adjustment proxy and define the observation window explicitly.\n- **Registry / linked:** Disease and product registries are often case-enriched and ideal for rare outcomes with\n  adjudicated case status; link to claims for complete pharmacy/exposure history and to a death index to firm up the risk\n  set. Linkage introduces selection (only the linkable subset) and date-discrepancy issues that must be reconciled before\n  index-date assignment.\n\n**Worked claims example (incidence-density nested CC).** Question: does current exposure to a high-risk oral NSAID raise\nthe risk of hospitalized upper-gastrointestinal (UGI) bleeding among adults ≥40 in a commercial + Medicare FFS database?\n(1) *Source cohort:* enrollees with ≥365 days of continuous medical **and** pharmacy enrollment (Parts A/B/D or commercial\nequivalent), excluding MA-only person-time so that absence of a fill is observed, not missing. (2) *Cases:* first\nhospitalization with a validated UGI-bleed algorithm (qualifying ICD-10-CM in the primary position on an inpatient claim);\nindex_date = admission date; exclude anyone with a prior UGI bleed in the lookback (incident cases only). (3) *Risk-set\ncontrols:* for each case, sample m = 4 controls from enrollees alive, enrolled, and event-free on that case's index_date,\nmatched on age (±2 y), sex, and index calendar month (matching on time = density sampling). (4) *Exposure:* \"current use\"\n= an NSAID `fill_date` with `days_supply` covering the case's index_date, or a fill ending within a 14-day carryover; the\nreference is non-use in the 90-day pre-index window. (5) *Covariates:* anticoagulant/antiplatelet/PPI use, prior GI\ndiagnoses, and baseline visit count (utilization, to control detection bias), all measured strictly before index_date.\n(6) *Analysis:* conditional logistic regression stratified on the matched set (`clogit` / `STRATA match_id`), reporting\nthe exposure OR as a rate ratio; sensitivity analyses vary the carryover window, the matching ratio, and add a\nnegative-control outcome (e.g., a condition the NSAID should not cause) to detect residual confounding.",
    "primary_category": "Study_Design",
    "tags": [
      "case-control",
      "odds-ratio",
      "incidence-density-sampling",
      "risk-set-matching",
      "conditional-logistic-regression",
      "pharmacoepidemiology",
      "rare-outcomes"
    ],
    "applies_to_study_types": [
      "case_control"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/s0140-6736(02)07605-5",
        "url": "https://doi.org/10.1016/s0140-6736(02)07605-5",
        "citation_text": "Schulz KF, Grimes DA. Case-control studies: research in reverse. The Lancet. 2002;359(9304):431-434.",
        "year": 2002,
        "authors_short": "Schulz & Grimes",
        "notes": "Concise, widely cited introduction to the case-control logic (sampling on outcome, looking back at exposure) and its principal pitfalls."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a116396",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a116396",
        "citation_text": "Wacholder S, McLaughlin JK, Silverman DT, Mandel JS. Selection of controls in case-control studies: I. Principles. American Journal of Epidemiology. 1992;135(9):1019-1028.",
        "year": 1992,
        "authors_short": "Wacholder et al.",
        "notes": "The authoritative statement of the study-base principle — controls must arise from the same population that produced the cases — and the comparability, deconfounding, and efficiency criteria for valid control selection."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Foundational treatment of conventional and nested case-control implementation in administrative claims, including exposure ascertainment, confounding, and validity threats specific to utilization data."
      },
      {
        "role": "explain",
        "doi": "10.1097/00001648-199505000-00010",
        "url": "https://doi.org/10.1097/00001648-199505000-00010",
        "citation_text": "Suissa S. The case-time-control design. Epidemiology. 1995;6(3):248-253.",
        "year": 1995,
        "authors_short": "Suissa",
        "notes": "Introduces the case-time-control extension that uses a within-person reference period to remove time-invariant confounding from a case-crossover analysis of transient exposures; defines a key self-controlled variant of CC."
      },
      {
        "role": "demonstrate",
        "doi": "10.1056/nejmoa1506115",
        "url": "https://doi.org/10.1056/nejmoa1506115",
        "citation_text": "Filion KB, Azoulay L, Platt RW, et al. A multicenter observational study of incretin-based drugs and heart failure. New England Journal of Medicine. 2016;374(12):1145-1154.",
        "year": 2016,
        "authors_short": "Filion et al.",
        "notes": "Large-scale nested case-control of incretin-based drugs and heart-failure hospitalization, replicated across multiple administrative databases (CNODES distributed network) with risk-set matching and conditional logistic regression — an exemplar of the design in claims data."
      }
    ],
    "plain_language_summary": "A case-control study works backwards from the result. You start by finding people who already have the outcome you care about (the cases) and a comparison group who do not (the controls), then you look back in time to ask who had been exposed to the thing you suspect. Because you can hunt down rare cases directly instead of waiting for them to appear, this design is fast and cheap for uncommon events. The catch: it tells you whether exposure and outcome travel together (an odds ratio), not how many people overall will get sick, and a poorly chosen control group can quietly distort the answer.",
    "key_terms": [
      {
        "term": "case",
        "definition": "A person who already has the outcome you are studying, such as someone hospitalized for a stomach bleed."
      },
      {
        "term": "control",
        "definition": "A comparison person who does not have the outcome, picked to represent the same population the cases came from."
      },
      {
        "term": "exposure",
        "definition": "The thing you suspect might cause the outcome, like having taken a particular drug before getting sick."
      },
      {
        "term": "odds ratio",
        "definition": "A single number comparing the odds of having been exposed among cases versus controls; above 1 means exposure is more common in cases."
      },
      {
        "term": "2x2 table",
        "definition": "A four-box summary that cross-tabulates exposed vs. unexposed against case vs. control, giving the four counts you need to compute the odds ratio."
      }
    ],
    "worked_example": {
      "scenario": "We want to know whether taking a high-risk NSAID pain reliever is linked to being hospitalized for an upper-gastrointestinal (GI) bleed. We assemble 100 cases (adults hospitalized for a GI bleed) and 100 controls (similar adults with no such bleed), then check each person's pharmacy records to see who had filled an NSAID before their index date. We count everyone into a 2x2 table and compute the odds ratio.",
      "dataset": {
        "caption": "The 2x2 table an analyst builds after classifying each of the 200 people by exposure and case status. Cell letters a, b, c, d are labeled for the odds-ratio formula.",
        "columns": [
          "",
          "Cases",
          "Controls"
        ],
        "rows": [
          [
            "Exposed (took NSAID)",
            "a = 90",
            "b = 60"
          ],
          [
            "Unexposed (no NSAID)",
            "c = 10",
            "d = 40"
          ]
        ]
      },
      "steps": [
        "Sort each person into one of four boxes: a = exposed cases (90), b = exposed controls (60), c = unexposed cases (10), d = unexposed controls (40).",
        "The odds of having been exposed among cases is a / c = 90 / 10 = 9.",
        "The odds of having been exposed among controls is b / d = 60 / 40 = 1.5.",
        "The odds ratio compares those two odds: OR = (a / c) / (b / d), which rearranges to the cross-product OR = (a * d) / (b * c).",
        "Plug in the cells: OR = (90 * 40) / (60 * 10) = 3600 / 600 = 6.0."
      ],
      "result": "OR = (a*d)/(b*c) = (90*40)/(60*10) = 3600/600 = 6.0. Cases had 6 times the odds of prior NSAID exposure as controls, suggesting NSAID use is associated with GI bleeding."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Incidence-density (risk-set) / nested case-control",
        "description": "Controls are sampled from those still at risk and event-free at the exact index date of each case (matching on time). The exposure odds ratio estimates the incidence rate ratio directly, with no rare-disease assumption, and is algebraically equivalent to a Cox model on the underlying cohort.",
        "edge_cases": [
          "Requires precise person-time and an enrollment/eligibility file so risk sets contain only observable, at-risk enrollees on each index date.",
          "A control sampled at one case time can later become a case (and may be sampled again as a control) — this is correct under density sampling and must not be \"fixed\" by deduplication."
        ],
        "data_source_notes": "claims/EHR: build risk sets by joining cases against enrollment spans on the case index date; standard in pharmacoepidemiology for time-varying drug exposures."
      },
      {
        "name": "Cumulative (traditional) case-control",
        "description": "Controls are sampled from non-cases at the end of follow-up (or from the source population irrespective of timing). The OR estimates the risk odds ratio and approximates the risk ratio only under the rare-disease assumption.",
        "edge_cases": [
          "Biased when the outcome is common or when exposure influences survival/loss to follow-up."
        ],
        "data_source_notes": "claims: simplest to build from a final cohort file (exclude cases), but still requires index-date alignment so the exposure lookback is anchored correctly."
      },
      {
        "name": "1:m matched (risk-set matched) case-control",
        "description": "Each case is matched to m controls on index date and a small set of design variables (age, sex, calendar month, prior utilization), then analyzed with conditional logistic regression on the matched set.",
        "edge_cases": [
          "Over-matching on a variable correlated with exposure removes real signal and reduces power.",
          "Variable matching ratios and incomplete matches require either retaining unmatched cases via flexible strata or documenting their exclusion."
        ],
        "data_source_notes": "claims: match on index month, age, sex, region, and prior-year cost/utilization quartile; analyze with clogit or PROC LOGISTIC STRATA."
      },
      {
        "name": "Case-cohort design",
        "description": "A random subcohort sampled at baseline serves as the comparison base for all cases (and can serve multiple outcomes); weighting (Prentice / Barlow) corrects for the subcohort sampling fraction.",
        "edge_cases": [
          "Requires sampling-fraction weights and robust variance; naive unweighted analysis is biased."
        ],
        "data_source_notes": "claims/EHR: efficient when an expensive covariate (assay, chart abstraction) is collected only on the subcohort plus cases."
      },
      {
        "name": "Case-time-control / case-case-time-control",
        "description": "Self-controlled extensions for transient exposures that remove time-invariant confounding (case-time-control) and exposure-time trends (case-case-time-control) using within-person reference periods.",
        "edge_cases": [
          "Require within-person exposure variation; unsuitable for fixed or continuously maintained exposures."
        ],
        "data_source_notes": "See the dedicated case-time-control and case-case-time-control entries; implemented in claims via within-patient comparison of exposure across pre-defined time windows."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cohort-prospective",
        "pros_of_this": "Orders-of-magnitude less data collection for rare outcomes; concentrates expensive ascertainment (adjudication, assays) on cases plus a small control sample.",
        "cons_of_this": "Cannot estimate absolute risk, incidence, or NNT without known sampling fractions; more exposed to control- selection bias; less intuitive to clinical stakeholders.",
        "when_to_prefer": "Rare outcomes, or when costly outcome/exposure measurement is the binding constraint."
      },
      {
        "compared_to": "cohort-retrospective",
        "pros_of_this": "Same efficiency advantage in existing databases; easy to add bespoke ascertainment to a small case+control set.",
        "cons_of_this": "A retrospective cohort can yield incidence and absolute risk from the full source population if data are complete; CC must mimic that base through careful sampling.",
        "when_to_prefer": "When the outcome is rare and primary or costly ascertainment on cases/controls is the bottleneck."
      },
      {
        "compared_to": "nested-case-control",
        "pros_of_this": "Stand-alone CC can be built from a case registry plus controls sampled from eligibility files when no source cohort is available.",
        "cons_of_this": "Nested CC inherits a clean source population, time-zero, and density-sampling validity and is the modern default whenever a usable cohort exists.",
        "when_to_prefer": "Use nested CC when a claims/EHR/registry source cohort exists; reserve stand-alone CC for mixed-source settings without one."
      },
      {
        "compared_to": "self-controlled-case-series",
        "pros_of_this": "Handles one-time/chronic exposures and fatal or non-recurrent outcomes; permits study of between-person factors (genetics, frailty) via matching/measurement.",
        "cons_of_this": "Must measure and adjust for time-invariant confounders that SCCS removes by design; vulnerable to control- selection and detection bias that within-person designs avoid.",
        "when_to_prefer": "Non-recurrent or fatal outcomes, fixed/chronic exposures, or when between-person comparison is the scientific goal."
      },
      {
        "compared_to": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "pros_of_this": "This entry supplies the design and sampling logic; the algorithm entry supplies the case-definition validation (PPV, sensitivity) on which CC validity depends.",
        "cons_of_this": "Case-algorithm validation is a prerequisite — low PPV biases the OR toward the null regardless of design quality.",
        "when_to_prefer": "Use both: validate the outcome algorithm first (or in a substudy), then apply it as the case definition in the CC."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Cases from validated outcome algorithms; index_date = case event date. Risk-set controls sampled from enrollees alive, enrolled (medical + pharmacy, no MA-only person-time), and event-free on the case index date, matched on time plus age/sex and prior utilization. Exposure from pharmacy NDC + fill_date + days_supply (drugs) or CPT/HCPCS/ICD-10-PCS (procedures) in the pre-specified pre-index window. Analyze with conditional logistic (clogit / STRATA). Restrict risk sets to those alive/enrolled to handle competing risks; report match rates and balance on matched and adjusted factors.",
      "ehr": "Cases via structured + NLP phenotyping, validated on a chart-reviewed subset. Exposure = order/administration; link to pharmacy fills to confirm initiation. Counter visit-driven (informative) observation by matching/adjusting on encounter counts; define the observation window explicitly and treat differential loss to follow-up as potentially informative.",
      "registry": "Often case-enriched and ideal for rare, adjudicated outcomes; sample controls from the underlying population or eligibility frame and link to claims/EHR for complete exposure history.",
      "linked": "Claims (exposure completeness) + registry/EHR (validated, adjudicated outcomes) + vital records (death, to firm up the risk set) is the strongest substrate; model linkage error (false matches/non-matches) in sensitivity analyses.",
      "multi-database": "Replicate the CC in each database under a common protocol and (preferably) a common data model (OMOP); pool with meta-analysis or distributed regression. Harmonize the outcome algorithm, exposure window, and matching factors; report database-specific and pooled estimates (cf. the Filion/CNODES network design)."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\nEXPOSURE_WINDOW_DAYS = 90    # pre-index etiologic window for \"current use\"\nCARRYOVER_DAYS = 14          # supply may end shortly before index and still count as current\nAGE_CALIPER = 2\nM_CONTROLS = 4\nRNG = np.random.default_rng(20240601)\n\ndef _enrolled_observable(person_ids, on_date, enroll):\n    \"\"\"Person_ids with continuous, non-MA-only enrollment covering `on_date`.\"\"\"\n    e = enroll[enroll[\"person_id\"].isin(person_ids)]\n    ok = e[(e[\"enroll_start\"] <= on_date) & (e[\"enroll_end\"] >= on_date) & (~e[\"ma_only\"])]\n    return set(ok[\"person_id\"])\n\ndef build_riskset_matched_cc(cases, cohort, enroll, m=M_CONTROLS):\n    case_ids = set(cases[\"person_id\"])\n    rows = []\n    for _, case in cases.iterrows():\n        idx = case[\"index_date\"]\n        # Candidate controls: event-free at idx (not yet a case or a case with a later index date),\n        # enrolled/observable on idx, and within the matching calipers.\n        future_or_noncase = cohort[\n            (~cohort[\"person_id\"].isin(case_ids))\n            | (cohort[\"person_id\"].map(\n                cases.set_index(\"person_id\")[\"index_date\"]).fillna(pd.Timestamp.max) > idx)\n        ]\n        observable = _enrolled_observable(future_or_noncase[\"person_id\"], idx, enroll)\n        pool = future_or_noncase[\n            (future_or_noncase[\"person_id\"].isin(observable))\n            & (future_or_noncase[\"person_id\"] != case[\"person_id\"])\n            & (future_or_noncase[\"sex\"] == case[\"sex\"])\n            & ((future_or_noncase[\"age\"] - case[\"age\"]).abs() <= AGE_CALIPER)\n        ]\n        picks = pool.sample(n=min(m, len(pool)), random_state=int(RNG.integers(1e9)))\n        match_id = f\"set_{case['person_id']}\"\n        rows.append({\"person_id\": case[\"person_id\"], \"match_id\": match_id,\n                     \"is_case\": 1, \"index_date\": idx})\n        for pid in picks[\"person_id\"]:\n            rows.append({\"person_id\": pid, \"match_id\": match_id,\n                         \"is_case\": 0, \"index_date\": idx})  # control inherits the case index date\n    return pd.DataFrame(rows)\n\ndef add_current_exposure(cc, rx, drug_class):\n    \"\"\"Flag 'current use': a fill whose [fill_date, fill_date+days_supply+carryover] covers index_date.\"\"\"\n    r = rx[rx[\"drug_class\"] == drug_class].merge(cc[[\"person_id\", \"match_id\", \"index_date\"]],\n                                                 on=\"person_id\")\n    r[\"cov_end\"] = r[\"fill_date\"] + pd.to_timedelta(r[\"days_supply\"] + CARRYOVER_DAYS, unit=\"D\")\n    r[\"covers\"] = (r[\"fill_date\"] <= r[\"index_date\"]) & (r[\"cov_end\"] >= r[\"index_date\"])\n    exposed = r.loc[r[\"covers\"], [\"person_id\", \"match_id\"]].drop_duplicates()\n    exposed[\"exposed\"] = 1\n    out = cc.merge(exposed, on=[\"person_id\", \"match_id\"], how=\"left\")\n    out[\"exposed\"] = out[\"exposed\"].fillna(0).astype(int)\n    return out\n\n# cc = build_riskset_matched_cc(cases, cohort, enroll)\n# cc = add_current_exposure(cc, rx, drug_class=\"NSAID\")\n# -> downstream: conditional logistic on `exposed` stratified by `match_id` (see R/SAS).",
        "description": "Incidence-density risk-set sampling and 1:m matched case-control construction from claims-style inputs. Required inputs\n(already cleaned and de-duplicated):\n  cases  : one row per incident case -> person_id, index_date (datetime), age, sex\n  cohort : everyone in the source population -> person_id, age, sex (the candidate-control universe)\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only lacks FFS claims\n  rx     : pharmacy fills -> person_id, fill_date (datetime), drug_class, days_supply\nDensity sampling: for each case, controls are drawn from enrollees alive, enrolled (no MA-only gap), and event-free on\nthat case's index_date, matched on sex and age (+/-2y); the control inherits the case's index_date so the pre-index\nexposure window is identical. The same person may serve as a control at multiple case times -- that is correct and is\nNOT deduplicated. Fit conditional logistic on the resulting match_id strata downstream.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "schneeweiss-2005"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(data.table)\n\n# --- Risk-set sampling skeleton (density sampling, m controls per case) ---------------\nsample_riskset <- function(cases, cohort, enroll, m = 4L, age_caliper = 2L) {\n  setDT(cases); setDT(cohort); setDT(enroll)\n  out <- vector(\"list\", nrow(cases))\n  for (i in seq_len(nrow(cases))) {\n    idx <- cases$index_date[i]; cid <- cases$person_id[i]\n    observable <- enroll[enroll_start <= idx & enroll_end >= idx & !ma_only, unique(person_id)]\n    pool <- cohort[person_id %in% observable & person_id != cid &\n                   sex == cases$sex[i] & abs(age - cases$age[i]) <= age_caliper, person_id]\n    ctrls <- if (length(pool) > m) sample(pool, m) else pool\n    out[[i]] <- data.table(\n      person_id = c(cid, ctrls),\n      match_id  = paste0(\"set_\", cid),\n      is_case   = c(1L, rep(0L, length(ctrls))),\n      index_date = idx)\n  }\n  rbindlist(out)\n}\n\n# --- Conditional logistic on the matched analytic file --------------------------------\n# cc: person_id, match_id, is_case, exposed, plus pre-index covariates\nfit <- clogit(\n  is_case ~ exposed + anticoagulant + prior_gi_dx + utilization_q + strata(match_id),\n  data = cc, method = \"exact\")\nsummary(fit)        # exp(coef) of `exposed` = incidence rate ratio under density sampling",
        "description": "Conditional logistic regression for the risk-set matched case-control file produced above. Input `cc` has one row per\nsubject: person_id, match_id (the risk set), is_case (1/0), and one column per exposure/covariate measured strictly in\nthe pre-index window. survival::clogit fits the conditional likelihood that, under density sampling, returns the\nincidence rate ratio (exp(beta)). A skeleton risk-set sampler with data.table is shown for completeness.",
        "dependencies": [
          "survival",
          "data.table"
        ],
        "source_citations": [
          "schneeweiss-2005"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---- Path 1: conditional logistic via STRATA (matched-sets analysis) ---- */\nproc logistic data=work.cc;\n  strata match_id;                                   /* one stratum per risk set */\n  model is_case(event='1') = exposed anticoagulant prior_gi_dx utilization_q\n        / clodds=pl;                                 /* profile-likelihood CIs for the OR (=RR) */\nrun;\n\n/* ---- Path 2: equivalent incidence-density fit via PROC PHREG ----\n   Encode each matched risk set so all members share a (t0,t1] interval and the case has status=1.\n   Breslow ties + the matched strata reproduce the conditional-logistic estimate (the rate ratio). */\nproc phreg data=work.cc_pheg;\n  strata match_id;\n  model (t0, t1)*is_case(0) = exposed anticoagulant prior_gi_dx utilization_q\n        / ties=breslow risklimits;\nrun;",
        "description": "SAS analysis for the matched case-control. Two genuinely correct paths are shown. (1) PROC LOGISTIC with STRATA fits the\nconditional likelihood for matched risk sets directly. (2) PROC PHREG with a counting-process (start,stop) setup is the\nequivalent route for incidence-density nested CC -- the Breslow/partial likelihood for tied risk sets equals the\nconditional-logistic likelihood, so it returns the same rate ratio. Required input work.cc: person_id, match_id, is_case\n(1/0), exposed (1/0), and pre-index covariates; for PHREG, t1/t0 encode the matched risk set as a (start,stop) interval\nwith status=is_case.",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-2005"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Base[Source population / eligibility + enrollment spans] --> Cases[Identify incident cases<br/>validated outcome algorithm; index_date = event date]\n  Base --> Risk[Risk set at each case index_date<br/>alive, enrolled, FFS-observable, event-free]\n  Cases --> Sample[Density sampling: m controls per case<br/>matched on index_date, age, sex, prior utilization]\n  Risk --> Sample\n  Sample --> Exp[Ascertain exposure + covariates in the<br/>pre-index window only, anchored to index_date]\n  Exp --> Fit[Conditional logistic on matched strata<br/>OR = incidence rate ratio under density sampling]\n  Fit --> Sens[Sensitivity: exposure window, matching ratio,<br/>negative-control outcome, competing-risks check]",
        "caption": "Operational flow for an incidence-density (nested) matched case-control in real-world data. Risk-set sampling on the case index date is what makes the odds ratio equal the rate ratio; exposure and covariates are measured only before index.",
        "alt_text": "Flowchart from source population and enrollment spans through case identification, risk-set construction, density-sampled matched controls, pre-index exposure ascertainment, conditional logistic regression, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TB\n  subgraph Selection[Two ways control sampling biases the OR]\n    A[Controls NOT from the study base<br/>e.g. hospital controls for community cases] -->|Berkson / selection bias| Bias1[OR distorted; unrepairable by analysis]\n    B[Differential exposure capture by case status<br/>sicker cases have more complete records] -->|detection / surveillance bias| Bias2[OR inflated away from null]\n  end\n  Fix1[Sample controls from the same population that produced the cases<br/>Wacholder study-base principle] --> A\n  Fix2[Match / adjust on prior healthcare utilization] --> B",
        "caption": "The two selection threats that define case-control validity and their fixes. Berkson-type bias from off-base controls cannot be repaired analytically; detection bias is mitigated by matching or adjusting on utilization.",
        "alt_text": "Diagram contrasting off-base control selection (Berkson bias, unrepairable) and differential exposure capture (detection bias, inflates the OR) with their respective design and analysis fixes.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "nested-case-control",
        "notes": "Nested CC samples cases and risk-set controls from inside a defined cohort, inheriting its source population, time-zero, and density-sampling validity; it is the modern default whenever a usable source cohort exists."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cohort-prospective",
        "notes": "A cohort follows exposure forward to incidence and yields absolute risk; CC samples on outcome and looks back, far more efficient for rare outcomes but unable to give absolute risk without sampling fractions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cohort-retrospective",
        "notes": "Retrospective cohorts in claims can support a nested CC as an efficient sub-design; choose CC when full-cohort outcome ascertainment is rare or costly."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "self-controlled-case-series",
        "notes": "SCCS uses each case as its own control and removes all time-invariant confounding, but needs recurrent/non-fatal events and transient exposures; CC handles fixed/chronic exposures and fatal outcomes at the cost of measuring those confounders."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-crossover",
        "notes": "The case-crossover design is a self-controlled CC for transient triggers of acute events; case-time-control adds a separate control series to remove exposure-time trends."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-time-control",
        "notes": "Case-time-control is the within-person CC extension (Suissa 1995) that removes time-invariant confounding from a case-crossover analysis of transient exposures."
      },
      {
        "relation_type": "requires",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The case definition is the outcome algorithm; its PPV and sensitivity bound CC validity, since low PPV biases the OR toward the null. Validate before or within the study."
      },
      {
        "relation_type": "used_with",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Conditional (matched) or unconditional logistic regression is the analytic engine of CC; see that entry for OR interpretation, rare-event handling, and separation."
      },
      {
        "relation_type": "requires",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Exposure and covariates must be measured strictly before the index date, with controls inheriting the case index date; misalignment reintroduces immortal-time and reverse-causation artifacts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Unmeasured confounding and selection bias are the dominant CC threats; use probabilistic bias analysis, negative controls, and E-values to quantify their plausible impact."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Medicare Advantage person-time lacks fee-for-service claims, so a control's apparent non-exposure may be unobserved; restrict to FFS-observable enrollment when building risk sets."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "When nested in a large claims/EHR cohort, high-dimensional propensity scores can match or weight controls within the CC to control measured and proxy confounding."
      }
    ],
    "aliases": [
      "case control",
      "case-control study",
      "conventional case-control",
      "matched case-control",
      "incidence-density case-control",
      "risk-set sampled case-control"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "case-crossover",
    "name": "Case-Crossover Design",
    "short_definition": "A self-controlled design in which each case serves as its own control, contrasting exposure during a hazard (\"case\") window immediately before an acute event with exposure during one or more earlier referent (\"control\") windows in the same person, to estimate the transient effect of an intermittent exposure on the risk of abrupt event onset.",
    "long_description": "The **case-crossover design** studies whether a *transient* exposure triggers an *abrupt* event by comparing, within each\ncase, the exposure status during a short window just before the event (the **case/hazard window**) with exposure during one\nor more **referent/control windows** in that same individual at an earlier time. There is no separate control group: each\ncase is matched to itself. Analytically it is a matched case-control study where the matched sets are person-periods, so the\nnatural estimator is **conditional (stratum = person) logistic regression**, and the estimand is the **odds ratio for the\ntransient effect of exposure on event onset** within the hazard window (an approximation of the rate ratio under the usual\nrare-event/incidence-density logic of Mittleman, Maclure & Robins).\n\n**Core conceptual distinction — why self-controlled.** Because both windows come from the same person, *every time-invariant\ncharacteristic is matched out by design*: genetics, sex, baseline comorbidity, frailty, socioeconomic status, chronic\nchanneling, and any other stable confounder — measured or not — cancels in the within-person contrast. This is the design's\nunique selling point relative to cohort or conventional case-control designs, which must measure and adjust for those\nconfounders. The price is that the design controls **only** time-invariant confounding; anything that *varies within a person\nover the spacing between windows* (acute illness, season, day of week, secular trends in exposure prevalence) is not\ncontrolled and can bias the estimate badly.\n\n**Estimand and assumptions.** Under (i) a transient exposure with a well-defined effect (induction/hazard) period, (ii) a\nstable distribution of exposure over time within persons (the **\"no exposure-time trend\"** assumption), (iii) no carryover\nof effect from the case window into the referent windows (and vice versa), and (iv) the event not itself altering subsequent\nexposure, the conditional-logistic OR estimates the incidence rate ratio for the acute triggering effect. Violations of (ii)\nand (iii) are the dominant failure modes and motivate the case-time-control and case-case-time-control extensions below.\n\n**Pros, cons, and trade-offs.**\n- **vs cohort / new-user designs:** Case-crossover needs no comparator cohort and is immune to all stable confounding,\n  which is decisive when the suspected confounders are chronic and hard to measure in claims (frailty, lifestyle). Cost: it\n  answers only the *transient-trigger* question, gives an OR not an absolute risk, has no information on chronic-exposure\n  effects, and is vulnerable to within-person time trends that a cohort with an external comparator can absorb. **Prefer**\n  for short-acting drugs and acute events where confounding-by-indication and unmeasured frailty would wreck a cohort.\n- **vs conventional / nested case-control:** Removes between-person confounding entirely and needs no control selection,\n  eliminating one source of selection bias. Cost: it cannot study non-time-varying exposures, is less efficient when\n  exposure is rare or near-constant within persons, and shifts the threat from confounding to *exposure-trend* bias.\n- **vs self-controlled case series (SCCS):** Both are self-controlled and remove time-invariant confounding. SCCS models the\n  *full* observation time with a Poisson/conditional-Poisson model, handles recurrent events naturally, requires the event\n  not to censor/curtail observation (or a modified SCCS), and is well suited to vaccine-safety risk-interval analyses.\n  Case-crossover samples discrete referent windows, needs only the cases, and is simpler when the exposure is sharply\n  intermittent and the event is acute and rare. **Prefer SCCS** for recurrent events or when modeling age/time effects\n  explicitly; **prefer case-crossover** for a single acute event triggered by a sharply transient exposure.\n- **vs case-time-control / case-case-time-control:** When background exposure prevalence is changing over calendar time\n  (e.g., a drug's use is rising), the plain case-crossover OR is biased by that *exposure-time trend*. Case-time-control\n  (Suissa 1995) adds a separate control group to estimate and subtract the trend; case-case-time-control (Wang, Schneeweiss)\n  uses future cases as the control series to also absorb within-person trends from disease progression. These are the\n  correct fixes, not heavier covariate adjustment.\n\n**When to use.** (1) The exposure is **intermittent/transient** with a plausibly short induction period (an acute drug\neffect, a single dose, a behavioral trigger, an environmental spike). (2) The outcome is an **abrupt, acute, well-dated**\nevent (MI, GI bleed, motor-vehicle crash, fracture, anaphylaxis, arrhythmia, seizure). (3) Suspected confounders are\nlargely **stable within persons** and hard to measure. (4) You want a design that needs only the cases and is robust to\nunmeasured chronic confounding.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Chronic / continuous exposure.** If exposure is essentially constant across the case and referent windows (a maintenance\n  statin taken daily for years), there is no within-person contrast and the design is uninformative or biased toward the\n  null. Use a cohort/new-user design instead.\n- **Exposure with a secular time trend.** If the drug's prevalence is rising (or falling) over calendar time, exposure is\n  systematically more (or less) likely in the more recent case window than in earlier referent windows for *reasons unrelated\n  to the event*, producing a spurious OR. This is the canonical case-crossover failure and is exactly what **case-time-control**\n  repairs. Bidirectional sampling (referent windows both before and after the event) partly mitigates it but assumes the event\n  does not affect future exposure.\n- **Carryover / long or ill-defined effect window.** If the effect persists across both windows, or the induction period is\n  long relative to window spacing, exposure \"bleeds\" into the referent window and attenuates the contrast. Case-crossover\n  demands a *short, well-characterized* hazard period.\n- **Event alters subsequent exposure (within-person reverse causation).** If a prodrome of the event changes exposure (a\n  patient with early symptoms stops or starts a drug), unidirectional (before-only) sampling is biased; bidirectional\n  sampling is invalid because post-event exposure is affected by the event.\n- **Depletion of susceptibles / first-event-only chronic use.** For drugs where susceptible patients have their event early\n  in treatment, the within-person exposure-event association is distorted; SCCS or a new-user cohort is safer.\n\n**Operational sampling choices (these are design decisions, not nuisances).** *Unidirectional* sampling draws referent\nwindows only *before* the event (mandatory when the event can affect future exposure or cause death); it is the more\nconservative default in pharmacoepidemiology. *Bidirectional* sampling draws referents both before and after the event and\ncontrols for *within-person* time trends, but is valid only when post-event exposure is unaffected by the event. The number\nand spacing of referent windows trade efficiency against the exposure-trend and carryover assumptions: more referent windows\nimprove precision (Mittleman, Maclure & Robins show the relative efficiency gains) but widen the calendar span over which the\nno-trend assumption must hold. A further choice is the **point-in-time** approach (was the person exposed on the index day of\neach window?) vs the **window/hazard-period** approach (was there any exposure within a multi-day hazard period?); in claims\nthe window approach with a `days_supply`-defined exposure interval is usually required because dispensing is bursty\n(Hallas et al. 2024).\n\n**Data-source operational depth.**\n- **Claims (FFS):** Exposure is reconstructed from pharmacy fills — a person is \"exposed\" on a given day if a dispensing's\n  `[fill_date, fill_date + days_supply)` interval covers that day (apply a grace period and stockpiling rule, and account for\n  90-day mail-order and free samples that distort `days_supply`). The event date is the index claim for the acute outcome.\n  Require **continuous medical + pharmacy enrollment** spanning the entire span from the earliest referent window through the\n  event so that absence of a fill is true non-exposure, not unobserved dispensing. Failure modes: **Medicare Advantage and\n  capitated person-time lack fee-for-service claims**, so exposure in a referent window can be missing rather than absent —\n  restrict to enrollees with the relevant medical + Part D / pharmacy benefit and exclude MA-only person-time across *every*\n  window. Outpatient-only acute events may be miscoded or undated; prefer inpatient/ED principal-diagnosis events with a clean\n  admission date as the anchor.\n- **EHR:** Exposure is the *order/administration*, not the fill, and is visit-driven, so the referent windows are populated\n  only when the patient interacts with the system — differential visit intensity around the event (more contact just before\n  an MI) manufactures a spurious within-person exposure trend. Link to dispensing where possible and restrict to exposures\n  routinely captured (e.g., inpatient MAR) rather than self-managed OTC drugs.\n- **Registry / linked:** Registries give well-adjudicated, well-dated acute events (the ideal anchor) but rarely complete\n  longitudinal exposure; link to claims for the fill history and to a death index, since unidirectional sampling is mandatory\n  when the event is fatal (no post-event windows exist).\n\n**Worked claims example.** Question: does short-term NSAID exposure trigger acute upper-GI bleeding among older adults in a\nMedicare FFS + Part D database? (1) **Cases:** the first hospitalization with a principal ICD-10 diagnosis of acute upper-GI\nhemorrhage; the **event date** is the admission date. (2) **Enrollment:** require continuous Parts A/B/D and exclude any\nMA-only person-time over the 6 months preceding the event so every window is FFS-observable. (3) **Exposure:** a day is\nNSAID-exposed if any non-aspirin NSAID dispensing's `[fill_date, fill_date + days_supply + 3-day grace)` interval covers it.\n(4) **Hazard (case) window:** the 7 days immediately before the event date. (5) **Referent (control) windows:** four\nnon-overlapping 7-day windows ending 30, 60, 90, and 120 days before the event (unidirectional, because the bleed and any\nhospitalization can alter subsequent drug use). (6) Classify each window as exposed/unexposed by the rule in (3), creating one\nstratum per person with one case-window row and four referent-window rows. (7) Fit **conditional logistic regression**\nstratified on `person_id`; the exponentiated coefficient is the OR for the transient NSAID-triggering effect. (8) **Sensitivity:**\nvary the hazard-window length (3/7/14 days) and grace period; run a **case-time-control** analysis to net out the secular rise\nin NSAID use; and test a **negative-control exposure** (a drug with no plausible acute GI-bleed effect) — its OR should be ~1.",
    "primary_category": "Study_Design",
    "tags": [
      "self-controlled",
      "within-person",
      "transient-exposure",
      "acute-event",
      "conditional-logistic",
      "trigger-study",
      "pharmacoepidemiology",
      "exposure-time-trend"
    ],
    "applies_to_study_types": [
      "case_crossover"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/oxfordjournals.aje.a115853",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a115853",
        "citation_text": "Maclure M. The case-crossover design: a method for studying transient effects on the risk of acute events. American Journal of Epidemiology. 1991;133(2):144-153.",
        "year": 1991,
        "authors_short": "Maclure",
        "notes": "Original formulation of the design; introduces the case/referent within-person contrast and the transient-effect estimand."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev.publhealth.21.1.193",
        "url": "https://doi.org/10.1146/annurev.publhealth.21.1.193",
        "citation_text": "Maclure M, Mittleman MA. Should we use a case-crossover design? Annual Review of Public Health. 2000;21:193-221.",
        "year": 2000,
        "authors_short": "Maclure & Mittleman",
        "notes": "Definitive review of when the design is valid, including the no-exposure-trend and no-carryover assumptions and the distinction from cohort/SCCS designs."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a117550",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a117550",
        "citation_text": "Mittleman MA, Maclure M, Robins JM. Control sampling strategies for case-crossover studies: an assessment of relative efficiency. American Journal of Epidemiology. 1995;142(1):91-98.",
        "year": 1995,
        "authors_short": "Mittleman et al.",
        "notes": "Establishes referent-window sampling strategies, the conditional-logistic/incidence-density-sampling justification, and the efficiency gains from multiple referent windows."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/0962280208092346",
        "url": "https://doi.org/10.1177/0962280208092346",
        "citation_text": "Delaney JAC, Suissa S. The case-crossover study design in pharmacoepidemiology. Statistical Methods in Medical Research. 2009;18(1):53-65.",
        "year": 2009,
        "authors_short": "Delaney & Suissa",
        "notes": "Pharmacoepidemiology-focused treatment of exposure-window construction from drug data and the threats specific to medication exposures."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/00001648-199505000-00010",
        "url": "https://doi.org/10.1097/00001648-199505000-00010",
        "citation_text": "Suissa S. The case-time-control design. Epidemiology. 1995;6(3):248-253.",
        "year": 1995,
        "authors_short": "Suissa",
        "notes": "Introduces the case-time-control extension that estimates and removes the exposure-time-trend bias to which the plain case-crossover design is vulnerable."
      }
    ],
    "plain_language_summary": "A case-crossover study asks: was the patient more likely to be taking a short-acting drug in the days just before their emergency hospitalization than they were in calmer periods months earlier? Instead of comparing the patient to other people, the design compares the same person to themselves — the hazard window (a brief period right before the event) versus earlier referent windows when no event occurred. Because both windows come from the same person, anything that is stable about them — their chronic conditions, genetics, income, general frailty — automatically cancels out. The trade-off is that it only works when the exposure is intermittent (taken sometimes but not always) and the event is sudden and well-dated, like a hospitalization.",
    "key_terms": [
      {
        "term": "hazard window",
        "definition": "The short time period immediately before the acute event — typically the 7 days before a hospitalization — where we ask whether the patient happened to be taking the drug of interest."
      },
      {
        "term": "referent window",
        "definition": "An earlier stretch of time in the same patient's history — for example, the 7-day period ending 30 days before the event — used as the comparison, standing in for what exposure looked like when nothing acute was happening."
      },
      {
        "term": "within-person control",
        "definition": "Using the same individual's earlier time periods as the comparison group, so that everything stable about that person (age, sex, chronic diseases, lifestyle) is held constant by design."
      },
      {
        "term": "conditional logistic regression",
        "definition": "A statistical method that compares exposure across matched groups — here, the hazard window versus the referent windows within a single patient — and estimates how many times more likely the event was when the exposure was present."
      },
      {
        "term": "exposure-time trend",
        "definition": "A gradual background change in how often people in general are taking a drug over months or years; if this trend exists, the more-recent hazard window will look more exposed than earlier referent windows for reasons unrelated to the event, which would bias the result."
      },
      {
        "term": "odds ratio (OR)",
        "definition": "A number expressing how much more common the acute event was during windows when the patient was exposed to the drug versus windows when they were not; an OR of 4 means the event was 4 times as likely on exposed days within that person."
      }
    ],
    "worked_example": {
      "scenario": "Two older adults were each hospitalized with an acute upper-gastrointestinal bleed. We want to know whether taking ibuprofen in the days just before the bleed raised their risk. For each patient we define one hazard window (the 7 days immediately before admission) and four referent windows (the 7-day periods ending 30, 60, 90, and 120 days earlier). We then look up whether a pharmacy claim for ibuprofen covered any day in each window, using the rule: a fill is 'on board' for every calendar day from the fill date through the fill date plus the days supplied. If the drug turns up in the hazard window more often than in the referent windows — across many patients like these two — that is evidence of a triggering effect.",
      "dataset": {
        "caption": "Pharmacy fill records for two patients. Each row is one prescription dispensed; days_supply tells us how many days that bottle was meant to last. A day is considered 'ibuprofen-exposed' if it falls within [fill_date, fill_date + days_supply).",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            1001,
            "2023-09-05",
            "ibuprofen",
            14
          ],
          [
            1002,
            "2023-08-18",
            "ibuprofen",
            7
          ]
        ]
      },
      "steps": [
        "Patient 1001 had a GI bleed admission on 2023-09-15. Their ibuprofen fill on Sep 5 covers Sep 5 through Sep 18 (14 days). The hazard window runs Sep 8–Sep 14 — all seven days fall inside the fill's coverage, so this window is EXPOSED.",
        "Patient 1001's four referent windows (ending Aug 16, Jul 17, Jun 17, May 17) all fall outside any fill. No ibuprofen was on board during any of them, so all four are UNEXPOSED.",
        "Patient 1002 had a GI bleed admission on 2023-09-28. Their ibuprofen fill on Aug 18 covers Aug 18 through Aug 24 (7 days). The hazard window runs Sep 21–Sep 27 — well after the fill expired — so this window is UNEXPOSED.",
        "Patient 1002's referent window ending Aug 29 spans Aug 23–Aug 29. The fill's last covered days are Aug 23 and Aug 24, which fall inside this window, so referent window 1 is EXPOSED. The remaining three referent windows (ending Jul 30, Jun 29, May 30) have no fill coverage and are UNEXPOSED.",
        "Now count the informative pairs. A 'discordant pair' is any (hazard window, referent window) combination where one is exposed and the other is not — these pairs carry all the information about whether hazard-window exposure is unusually high.",
        "Patient 1001 contributes 4 discordant pairs: hazard=EXPOSED paired with each of 4 UNEXPOSED referent windows. Patient 1002 contributes 1 discordant pair: hazard=UNEXPOSED paired with referent window 1=EXPOSED. The other three of Patient 1002's referent windows are concordant (both unexposed) and contribute nothing.",
        "Estimated OR = (pairs where hazard exposed, referent not) ÷ (pairs where referent exposed, hazard not) = 4 ÷ 1 = 4.0."
      ],
      "result": "OR = 4 / 1 = 4.0 — across these two patients the ibuprofen-exposure odds were four times higher in the hazard windows than in the matched referent windows, consistent with NSAID use triggering acute GI bleeding. (In a real study, hundreds of cases would be pooled via conditional logistic regression to get a stable estimate with a confidence interval.)",
      "timeline_spec": {
        "title": "Case-crossover windows for Patient 1001 — NSAID exposure vs. GI bleed event",
        "window": {
          "start": "2023-05-11",
          "end": "2023-09-15",
          "label": "Observable person-time spanning all four referent windows through the event date"
        },
        "events": [
          {
            "label": "Ibuprofen fill",
            "start": "2023-09-05",
            "length_days": 14,
            "quantity": "14 days_supply"
          },
          {
            "label": "GI bleed admission",
            "start": "2023-09-15",
            "length_days": 1,
            "quantity": "event"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2023-05-11",
            "end": "2023-05-17",
            "label": "Referent 4 (ends 120 d before event): UNEXPOSED"
          },
          {
            "kind": "followup",
            "start": "2023-05-18",
            "end": "2023-06-10",
            "label": "Between referent windows (not assessed)"
          },
          {
            "kind": "unexposed",
            "start": "2023-06-11",
            "end": "2023-06-17",
            "label": "Referent 3 (ends 90 d before event): UNEXPOSED"
          },
          {
            "kind": "followup",
            "start": "2023-06-18",
            "end": "2023-07-10",
            "label": "Between referent windows (not assessed)"
          },
          {
            "kind": "unexposed",
            "start": "2023-07-11",
            "end": "2023-07-17",
            "label": "Referent 2 (ends 60 d before event): UNEXPOSED"
          },
          {
            "kind": "followup",
            "start": "2023-07-18",
            "end": "2023-08-09",
            "label": "Between referent windows (not assessed)"
          },
          {
            "kind": "unexposed",
            "start": "2023-08-10",
            "end": "2023-08-16",
            "label": "Referent 1 (ends 30 d before event): UNEXPOSED"
          },
          {
            "kind": "followup",
            "start": "2023-08-17",
            "end": "2023-09-07",
            "label": "Between last referent and fill start (not assessed)"
          },
          {
            "kind": "exposed",
            "start": "2023-09-08",
            "end": "2023-09-14",
            "label": "Hazard window (7 days before event): EXPOSED — ibuprofen fill on board"
          }
        ],
        "result": {
          "label": "Hazard window: EXPOSED | Referent windows: all UNEXPOSED → this patient's windows are all discordant in the same direction, contributing to OR > 1",
          "value": 4.0
        },
        "caption": "Each horizontal band is one 7-day window assessed for ibuprofen exposure in Patient 1001. The four referent windows (blue, UNEXPOSED) represent calm periods months earlier; the hazard window (orange, EXPOSED) sits right before the admission. Because this one patient's hazard window is exposed while all four referent windows are not, the comparison screams 'the drug was unusually present just before the event.' Pooling this pattern across many patients — and dividing by the rare opposite pattern — produces the odds ratio.",
        "alt_text": "Timeline for Patient 1001 running from May 2023 to September 15 2023. Four 7-day bands labeled Referent 4 through Referent 1 are shaded blue (unexposed) at 120, 90, 60, and 30 days before the event. A 14-day ibuprofen fill bar starts September 5. The hazard window (September 8-14) is shaded orange (exposed) immediately before the GI bleed admission marker on September 15."
      }
    },
    "prerequisites": [
      "case-control",
      "new-user-design",
      "self-controlled-case-series"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unidirectional (retrospective) referent sampling",
        "description": "Referent windows are drawn only from before the event date. Required when the event (or its prodrome/treatment) can affect subsequent exposure, or when the event is fatal so no post-event person-time exists.",
        "edge_cases": [
          "More conservative but cannot separate the exposure effect from a within-person secular time trend in exposure.",
          "Sensitive to the spacing of referent windows relative to any background trend in prescribing."
        ],
        "data_source_notes": "claims: ensure continuous FFS enrollment back through the earliest referent window; exclude MA-only person-time in any window so absence of a fill is true non-exposure."
      },
      {
        "name": "Bidirectional / ambidirectional referent sampling",
        "description": "Referent windows are drawn both before and after the event date, which controls for within-person exposure-time trends by symmetry.",
        "edge_cases": [
          "Invalid when the event affects future exposure (reverse causation) or when the event causes death/loss to follow-up.",
          "Requires sufficient post-event observation time with the same enrollment/observability as pre-event time."
        ],
        "data_source_notes": "claims/EHR: confirm continuous post-event enrollment; verify the event does not systematically trigger starting/stopping the study drug."
      },
      {
        "name": "Case-time-control",
        "description": "Adds a separate control group whose exposure trend over the same calendar windows is used to estimate and subtract the background exposure-time trend from the case-crossover odds ratio (Suissa 1995).",
        "edge_cases": [
          "Requires that the control group's exposure trend validly represents the cases' counterfactual trend; a poorly chosen control series can over- or under-correct."
        ],
        "data_source_notes": "claims: select controls from the same source population and require identical enrollment/observability rules across all windows for cases and controls."
      },
      {
        "name": "Window (hazard-period) vs point-in-time exposure assessment",
        "description": "Window approach classifies a referent/case window as exposed if any exposure falls within a multi-day hazard period; point-in-time classifies exposure on a single index day per window.",
        "edge_cases": [
          "In claims, dispensing is bursty, so point-in-time misclassifies; the window approach with a days_supply-defined interval is usually required (Hallas et al. 2024).",
          "Window length must match the biological induction period; too long induces carryover, too short misses the effect."
        ],
        "data_source_notes": "claims: define exposure as days covered by [fill_date, fill_date + days_supply + grace); calibrate the hazard-window length in sensitivity analyses."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cohort / new-user design",
        "pros_of_this": "Eliminates all time-invariant confounding (measured or not) by within-person matching; needs no comparator cohort; robust to unmeasured chronic frailty/channeling.",
        "cons_of_this": "Estimates only the transient-trigger OR, not absolute risk or chronic-exposure effects; vulnerable to within-person exposure-time trends and carryover that a cohort with an external comparator can absorb.",
        "when_to_prefer": "Short-acting/intermittent exposure and an acute, well-dated event where unmeasured stable confounding would dominate a cohort analysis."
      },
      {
        "compared_to": "Conventional / nested case-control",
        "pros_of_this": "Removes between-person confounding entirely and needs no separate control selection, eliminating control-selection bias.",
        "cons_of_this": "Cannot study non-time-varying exposures; less efficient when within-person exposure is rare or near-constant; trades confounding bias for exposure-trend bias.",
        "when_to_prefer": "When suspected confounders are stable within persons and the exposure is genuinely transient."
      },
      {
        "compared_to": "Self-controlled case series (SCCS)",
        "pros_of_this": "Simpler when the event is a single acute, rare occurrence triggered by a sharply intermittent exposure; uses only the cases and discrete referent windows.",
        "cons_of_this": "Less natural for recurrent events; does not model the full observation time or age/time effects explicitly as SCCS does.",
        "when_to_prefer": "Single acute event with a short induction period; choose SCCS for recurrent events or vaccine risk-interval analyses."
      },
      {
        "compared_to": "Case-time-control / case-case-time-control",
        "pros_of_this": "Simpler to specify and communicate; needs only the cases.",
        "cons_of_this": "Provides no protection against background exposure-time trends; the plain OR is biased when prescribing prevalence changes over calendar time.",
        "when_to_prefer": "When exposure prevalence is stable over the study window; otherwise use case-time-control (secular trend) or case-case-time-control (also within-person/progression trends)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure on a given day = any pharmacy fill whose [fill_date, fill_date + days_supply + grace) interval covers that day. Event date = the acute outcome's index claim (prefer inpatient/ED principal diagnosis with a clean admission date). Require continuous medical + pharmacy enrollment across all referent and case windows; exclude Medicare Advantage-only person-time in any window so absence of a fill is true non-exposure, not missingness.",
      "ehr": "Exposure = order/administration and is visit-driven; differential visit intensity around the event manufactures a spurious within-person exposure trend. Link to dispensing where possible and restrict to reliably captured exposures (e.g., inpatient MAR) rather than self-managed OTC drugs.",
      "registry": "Strong, adjudicated, well-dated acute events (the ideal anchor) but usually incomplete longitudinal exposure; link to claims for the fill history.",
      "linked": "Linked claims-EHR-registry-vital-records gives adjudicated event dates plus complete exposure plus mortality; mortality forces unidirectional sampling (no post-event windows). Reconcile order/fill/service dates before assigning window exposure."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\n\nHAZARD_DAYS = 7          # length of each window\nREFERENT_LAGS = [30, 60, 90, 120]  # days before event_date at which each referent window ENDS\nGRACE = 3                # days_supply grace to bridge late refills\n\ndef _covered(days_set: set, win_end, win_len=HAZARD_DAYS) -> int:\n    \"\"\"1 if any exposed day falls inside the window [win_end - win_len + 1, win_end].\"\"\"\n    win = {win_end - pd.Timedelta(days=d) for d in range(win_len)}\n    return int(bool(days_set & win))\n\ndef build_case_crossover(events: pd.DataFrame, rx: pd.DataFrame,\n                         enroll: pd.DataFrame) -> pd.DataFrame:\n    # Earliest day any window can reach back to; used for the enrollment-observability check.\n    earliest_lag = max(REFERENT_LAGS) + HAZARD_DAYS\n    ev = events.merge(enroll, on=\"person_id\")\n    ev[\"covers\"] = ((ev[\"enroll_start\"] <= ev[\"event_date\"] - pd.Timedelta(days=earliest_lag)) &\n                    (ev[\"enroll_end\"]   >= ev[\"event_date\"]) &\n                    (~ev[\"ma_only\"]))                 # every window must be FFS-observable\n    ok = ev.loc[ev[\"covers\"], \"person_id\"].unique()\n    ev = events[events[\"person_id\"].isin(ok)].copy()\n\n    # Per person, the set of calendar days covered by the exposure of interest.\n    rx = rx[rx[\"person_id\"].isin(ok)].copy()\n    rx[\"start\"] = rx[\"fill_date\"]\n    rx[\"end\"]   = rx[\"fill_date\"] + pd.to_timedelta(rx[\"days_supply\"] + GRACE, unit=\"D\")\n    exposed_days = {}\n    for pid, g in rx.groupby(\"person_id\"):\n        days = set()\n        for _, r in g.iterrows():\n            days |= {r[\"start\"] + pd.Timedelta(days=d)\n                     for d in range((r[\"end\"] - r[\"start\"]).days)}\n        exposed_days[pid] = days\n\n    rows = []\n    for _, r in ev.iterrows():\n        pid, e = r[\"person_id\"], r[\"event_date\"]\n        days = exposed_days.get(pid, set())\n        # Case window: the HAZARD_DAYS immediately before (and including) the event date.\n        rows.append({\"person_id\": pid, \"is_case_window\": 1,\n                     \"exposed\": _covered(days, e)})\n        # Referent windows: each ends `lag` days before the event date (unidirectional).\n        for lag in REFERENT_LAGS:\n            rows.append({\"person_id\": pid, \"is_case_window\": 0,\n                         \"exposed\": _covered(days, e - pd.Timedelta(days=lag))})\n    return pd.DataFrame(rows)\n\ndef fit_case_crossover(df: pd.DataFrame):\n    # Conditional logistic = stratified-by-person logit of the case-window indicator on exposure.\n    model = sm.ConditionalLogit(endog=df[\"is_case_window\"],\n                                exog=df[[\"exposed\"]],\n                                groups=df[\"person_id\"])\n    res = model.fit(disp=False)\n    or_ = np.exp(res.params[\"exposed\"])\n    ci = np.exp(res.conf_int().loc[\"exposed\"])\n    return res, or_, (ci[0], ci[1])",
        "description": "Build case/referent person-period rows from claims and fit conditional logistic regression. Required inputs (cleaned,\nde-duplicated):\n  events : one acute event per person -> person_id, event_date (datetime)\n  rx     : pharmacy fills            -> person_id, fill_date (datetime), days_supply (int)  # exposure of interest only\n  enroll : continuous FFS spans      -> person_id, enroll_start, enroll_end, ma_only (bool) # ma_only lacks FFS claims\nUses a 7-day hazard window and four earlier 7-day referent windows (unidirectional). Output feeds statsmodels\nConditionalLogit (stratum = person_id); the exp(coef) is the transient-effect odds ratio.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "maclure-1991",
          "mittleman-1995"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\n\nHAZARD_DAYS  <- 7L\nREFERENT_LAGS <- c(30L, 60L, 90L, 120L)  # days before event_date at which each referent window ENDS\nGRACE        <- 3L\n\nbuild_case_crossover <- function(events, rx, enroll) {\n  setDT(events); setDT(rx); setDT(enroll)\n  earliest_lag <- max(REFERENT_LAGS) + HAZARD_DAYS\n\n  # Keep only people FFS-observable across every window (no MA-only person-time).\n  ev <- merge(events, enroll, by = \"person_id\")\n  ok <- ev[enroll_start <= event_date - earliest_lag &\n           enroll_end   >= event_date & !ma_only, unique(person_id)]\n  ev <- events[person_id %in% ok]\n  rx <- rx[person_id %in% ok]\n\n  # Per-person set of exposed calendar days from [fill_date, fill_date + days_supply + grace).\n  exposed_days <- rx[, .(day = unlist(Map(function(s, n) s + seq_len(n) - 1L,\n                                          fill_date, days_supply + GRACE))),\n                     by = person_id]\n\n  covered <- function(pid, win_end) {\n    win <- win_end - (0:(HAZARD_DAYS - 1L))\n    as.integer(nrow(exposed_days[person_id == pid & day %in% win]) > 0L)\n  }\n\n  out <- rbindlist(lapply(seq_len(nrow(ev)), function(i) {\n    pid <- ev$person_id[i]; e <- ev$event_date[i]\n    case <- data.table(person_id = pid, is_case_window = 1L, exposed = covered(pid, e))\n    refs <- rbindlist(lapply(REFERENT_LAGS, function(lag)\n      data.table(person_id = pid, is_case_window = 0L, exposed = covered(pid, e - lag))))\n    rbind(case, refs)\n  }))\n  out[]\n}\n\nfit_case_crossover <- function(df) {\n  # Conditional logistic regression stratified on person; exp(coef) is the OR.\n  clogit(is_case_window ~ exposed + strata(person_id), data = df)\n}",
        "description": "Build case/referent rows and fit conditional logistic regression with survival::clogit (the standard tool). Inputs mirror\nthe Python version:\n  events : person_id, event_date (Date)\n  rx     : person_id, fill_date (Date), days_supply (integer)   # exposure of interest only\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\nclogit(case_window ~ exposed + strata(person_id)) returns the transient-effect odds ratio via exp(coef).",
        "dependencies": [
          "data.table",
          "survival"
        ],
        "source_citations": [
          "maclure-1991",
          "mittleman-1995"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let hazard = 7;     /* window length (days)                          */\n%let grace  = 3;     /* days_supply grace                             */\n\n/* Keep only people FFS-observable back through the earliest referent window (no MA-only person-time). */\nproc sql;\n  create table eligible as\n  select e.person_id, e.event_date\n  from work.events e\n  where exists (\n    select 1 from work.enroll n\n    where n.person_id = e.person_id and n.ma_only = 0\n      and n.enroll_start <= e.event_date - (120 + &hazard)\n      and n.enroll_end   >= e.event_date );\nquit;\n\n/* Per-person exposed calendar days from [fill_date, fill_date + days_supply + grace). */\ndata exposed_days;\n  set work.rx;\n  do offset = 0 to (days_supply + &grace - 1);\n    exp_day = fill_date + offset;\n    output;\n  end;\n  keep person_id exp_day;\nrun;\nproc sort data=exposed_days nodupkey; by person_id exp_day; run;\n\n/* One case window + four referent windows per person; exposed=1 if any exposed day falls in the window. */\ndata windows;\n  set eligible;\n  array lags[5] _temporary_ (0 30 60 90 120);   /* 0 = case window; others = referent end-lags */\n  do k = 1 to 5;\n    is_case_window = (k = 1);\n    win_end = event_date - lags[k];\n    output;\n  end;\n  keep person_id event_date is_case_window win_end;\nrun;\n\nproc sql;\n  create table cc as\n  select w.person_id, w.is_case_window,\n         (case when exists (\n             select 1 from exposed_days d\n             where d.person_id = w.person_id\n               and d.exp_day between w.win_end - (&hazard - 1) and w.win_end )\n            then 1 else 0 end) as exposed\n  from windows w;\nquit;\n\n/* Conditional logistic regression: matched set = person_id. */\nproc logistic data=cc;\n  strata person_id;\n  model is_case_window(event='1') = exposed;\n  /* exp(estimate) for EXPOSED = transient-effect odds ratio; add /sensitivity windows in reruns. */\nrun;",
        "description": "Build case/referent person-period rows and fit conditional logistic regression with PROC LOGISTIC + STRATA (the true\nconditional likelihood for matched sets). Required inputs (post data-management):\n  work.events : person_id, event_date\n  work.rx     : person_id, fill_date, days_supply           /* exposure of interest only */\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nUse STRATA person_id (NOT a PHREG tie-handling hack); exp(estimate) for EXPOSED is the transient-effect odds ratio.",
        "dependencies": [],
        "source_citations": [
          "maclure-1991",
          "mittleman-1995"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "case-crossover-timeline.svg",
        "mermaid": null,
        "caption": "Each horizontal band is one 7-day window assessed for ibuprofen exposure in Patient 1001. The four referent windows (blue, UNEXPOSED) represent calm periods months earlier; the hazard window (orange, EXPOSED) sits right before the admission. Because this one patient's hazard window is exposed while all four referent windows are not, the comparison screams 'the drug was unusually present just before the event.' Pooling this pattern across many patients — and dividing by the rare opposite pattern — produces the odds ratio.",
        "alt_text": "Timeline for Patient 1001 running from May 2023 to September 15 2023. Four 7-day bands labeled Referent 4 through Referent 1 are shaded blue (unexposed) at 120, 90, 60, and 30 days before the event. A 14-day ibuprofen fill bar starts September 5. The hazard window (September 8-14) is shaded orange (exposed) immediately before the GI bleed admission marker on September 15.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title Case-crossover windows for one subject (unidirectional, referent to event date)\n  120 days before : Referent window 4 : exposed? assessed from days_supply coverage\n  90 days before : Referent window 3 : exposed? assessed from days_supply coverage\n  60 days before : Referent window 2 : exposed? assessed from days_supply coverage\n  30 days before : Referent window 1 : exposed? assessed from days_supply coverage\n  Event date : Case / hazard window : exposed? assessed in the 7 days before the event",
        "caption": "One subject contributes a single case (hazard) window just before the acute event and four earlier referent windows. Conditional logistic regression compares case-window exposure to referent-window exposure within the person, so all time-invariant confounders cancel.",
        "alt_text": "Timeline showing four referent windows at 120, 90, 60, and 30 days before the event date and one case/hazard window immediately before the event date for a single subject.",
        "source_type": "illustrative",
        "source_citations": [
          "mittleman-1995"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U[Time-invariant confounders<br/>genetics, sex, chronic comorbidity, SES] -. blocked by within-person matching .-> X\n  V[Time-varying factors<br/>exposure-time trend, season, acute illness] --> X[Transient exposure<br/>in window]\n  V --> Y[Acute event onset]\n  X --> Y\n  classDef blocked fill:#e8f5e9,stroke:#2e7d32;\n  classDef threat fill:#fdecea,stroke:#c62828;\n  class U blocked;\n  class V threat;",
        "caption": "Why the design works and where it fails. Stable confounders (U) are matched out by the within-person contrast (dashed blocked path). Time-varying factors (V) - especially secular exposure-time trends - remain open backdoor paths and are the motivation for case-time-control and bidirectional sampling.",
        "alt_text": "Directed graph showing time-invariant confounders blocked by within-person matching while time-varying factors open a backdoor path from exposure to the acute event.",
        "source_type": "illustrative",
        "source_citations": [
          "maclure-1991"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "self-controlled-case-series",
        "notes": "Both are self-controlled designs that remove time-invariant confounding; SCCS models the full observation time and recurrent events with a (conditional) Poisson model, while case-crossover samples discrete referent windows for a single acute event."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-time-control",
        "notes": "Case-time-control adds a separate control group to estimate and subtract the exposure-time trend that biases the plain case-crossover odds ratio when prescribing prevalence changes over calendar time."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-case-time-control",
        "notes": "Case-case-time-control uses future cases as the control series to absorb both secular and within-person (disease-progression) exposure trends that the case-crossover and case-time-control designs may leave uncontrolled."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-control",
        "notes": "Case-crossover is operationally a matched case-control study in which the matched control person-time comes from the case's own earlier referent windows rather than from separate control subjects."
      },
      {
        "relation_type": "see_also",
        "target_slug": "nested-case-control",
        "notes": "Both are case-based sampling designs; nested case-control samples controls from a cohort's risk sets, whereas case-crossover samples referent windows from within each case."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-exposures-rwe",
        "notes": "A negative-control exposure with no plausible acute triggering effect should yield an odds ratio near 1; a deviation flags residual exposure-time-trend or carryover bias in the case-crossover analysis."
      }
    ],
    "aliases": [
      "case crossover",
      "case-crossover study",
      "crossover case-control design"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "case-report",
    "name": "Case Report",
    "short_definition": "A descriptive observational design that documents the detailed clinical course, exposures, interventions, and outcomes of a single patient (or a handful), reported under the CARE checklist to generate hypotheses and surface safety signals that aggregate analyses miss.",
    "long_description": "A **case report** is a structured narrative of one patient's clinical course — relevant history, the index exposure (drug,\ndevice, procedure, vaccine), co-interventions, the outcome, and the investigator's reasoning about whether the exposure\nplausibly caused the outcome. In RWE and pharmacoepidemiology its job is **signal generation and detailed description**, not\nmeasurement: it documents an unexpected adverse event, a novel phenotype, or an exceptional treatment response in enough\ngranularity that others can recognize the same pattern. The contemporary reporting standard is the **CARE checklist** (title,\nkeywords, abstract, introduction, patient information, clinical findings, *timeline*, diagnostic assessment, therapeutic\nintervention, follow-up and outcomes, discussion, patient perspective, informed consent); a report that omits the timeline\nor the consent statement is not publishable to current standards.\n\n**Why there is no estimand here.** A case report has **no denominator and no comparison group**, so it cannot estimate a\nrate, a risk, or an effect — there is nothing to be unbiased *about*. This is the defining feature, not a limitation to be\npatched. It means the familiar machinery of comparative RWE (time zero, washout, propensity scores, competing-risk\nestimands) does not apply to the *inference*, even though that machinery reappears below as a **case-finding** tool. The\npractical decision rule: if your question is \"does A cause/prevent Y?\" or \"how common is Y?\", a case report is the wrong\ndesign and you need a comparative cohort, case-control, or self-controlled analysis. If your question is \"has anyone ever\nseen Y after A, and what did it look like?\", the case report is the *right* and often only feasible design — it is how\nthalidomide embryopathy, clozapine agranulocytosis, and rofecoxib thrombosis first entered the literature.\n\n**Pros, cons, and trade-offs.**\n- **vs case series:** A single report is faster, cheaper, and can stand alone for a truly first-in-kind event, but it\n  supports no pattern recognition and invites over-interpretation of a coincidence. **Prefer a single report** only for the\n  first description of a novel signal or an N-of-1 response; the moment you have 2-5 similar patients, a **case series**\n  with a denominator-aware discussion is more persuasive.\n- **vs case-control / cohort:** A case report needs no sampling frame, control group, or analysis plan and turns around in\n  weeks, but it cannot quantify association or incidence and its selection (\"only the interesting one got written up\") and\n  information biases are extreme. **Prefer a comparative design** the instant the goal shifts from \"describe\" to \"estimate.\"\n- **vs spontaneous-report systems (FAERS / EudraVigilance):** A spontaneous report is a coded line item optimized for\n  disproportionality screening across millions of records; a case report is a rich, adjudicated narrative optimized for\n  causality assessment of one patient. They are complements: a disproportionality signal motivates pulling and publishing\n  illustrative case reports, and a striking case report motivates querying the spontaneous-report database for corroboration.\n- **vs systematic review:** Individual case reports are the raw material a later review of rare events synthesizes, but each\n  report carries severe publication bias (positive/dramatic cases are written up, null observations are not), so a review\n  of case reports estimates the *literature*, not the *world*.\n\n**When to use.** First description of a novel or extremely rare adverse event, phenotype, or treatment response; an N-of-1\nresponse worth flagging before any formal study is feasible; a regulatory or pharmacovigilance signal that needs an\nadjudicated narrative to accompany a coded spontaneous report; teaching and recognition of a clinical pattern. Dechallenge\n(event resolves when the drug is stopped) and rechallenge (event recurs on re-exposure) are the strongest within-case\ncausality evidence and should be documented whenever they occurred naturally — never engineered.\n\n**When NOT to use — and when it is actively misleading.**\n- **As evidence of frequency or effect.** Reporting \"we observed 3 cases\" with no denominator and treating it as an\n  incidence, or implying causality from a single temporal association, is the classic abuse. A case report cannot rule out\n  coincidence, confounding by indication, or progression of the underlying disease.\n- **When a comparative design is feasible and the question is comparative.** Publishing a case report of an outcome that a\n  powered cohort could actually study substitutes anecdote for evidence and can entrench a false signal that later studies\n  waste resources refuting.\n- **As a denominator-free safety claim in a regulatory submission.** Case reports support hypothesis generation and\n  causality narratives; they do not establish risk magnitude and cannot anchor a labeling change on their own.\n- **When consent/de-identification is impossible.** A single patient is inherently identifiable from a rich timeline plus\n  rare diagnosis plus dates; HIPAA-compliant de-identification and documented informed consent are mandatory, and IRB or\n  privacy-board review is often required even though the study is \"just one patient.\"\n\n**Data-source operational depth.** In RWD, claims/EHR/registry play two distinct roles: (a) **case-finding** — efficiently\nflagging chart-review candidates from millions of records — and (b) **narrative source** — supplying the clinical detail the\nreport ultimately needs. They are good at very different things.\n- **Claims (FFS or commercial):** Excellent for *screening* (rare diagnosis code + specific NDC within a defined window\n  after dispensing + continuous enrollment + no prior history of the event), poor for *narrative* (no labs, notes, imaging,\n  or causality detail). Failure modes: Medicare Advantage and capitated person-time drop fee-for-service claims, so an\n  apparent \"no prior event\" can be missingness rather than a true negative history — restrict screening to enrollees with\n  observable A/B/D (or commercial medical+pharmacy) across the full lookback. `days_supply` and 90-day mail-order distort\n  the exposure timeline; claim reversals and coordination-of-benefits create phantom or duplicated exposures; place of\n  service matters for infused biologics (outpatient admin vs inpatient). Always pull the chart before believing a claims-only\n  case.\n- **EHR:** The primary *narrative* source — notes, problem lists, labs, pathology, imaging, exact dosing/administration,\n  and patient-reported outcomes — and the right substrate for documenting dechallenge/rechallenge with serial labs. Limit:\n  visit-driven capture means care delivered outside the system is invisible, so a \"complete\" EHR timeline may have silent\n  gaps; NLP or manual abstraction is needed to assemble the CARE timeline.\n- **Registry (device, pregnancy, rare disease):** Often *mandates* case reporting for enrolled patients and is strong for\n  indication, severity, and adjudicated outcomes; typically weak for full pharmacy exposure. Pregnancy registries with\n  mother-infant linkage enable reports on in-utero exposure outcomes that no single data source captures alone.\n- **Linked claims-EHR-vital-records:** The ideal substrate — claims completeness for exposure/utilization, EHR for the\n  adjudicated narrative, and a death index for fatal outcomes — but report the linkage quality and any date discrepancies\n  among order, fill, and service dates.\n\n**Worked claims example — suspected drug-induced liver injury (DILI).** Goal: surface chart-review candidates for a case\nreport of acute hepatocellular injury after initiation of a hepatotoxic agent, e.g., a newly marketed kinase inhibitor.\n(1) **Index exposure:** first pharmacy fill of any NDC on the drug's list (`fill_date` = candidate index date); the patient\nmust be a new user — no fill of that NDC in the prior 180 days. (2) **Continuous, observable enrollment:** require 365 days\nof continuous medical + pharmacy FFS-observable enrollment before the index fill (so absence of prior liver disease is real,\nnot unobserved), and at least 90 observable days after it. (3) **Clean lookback:** no diagnosis of chronic liver disease,\nviral hepatitis, alcoholic liver disease, or biliary obstruction in the 365-day baseline, and no dispensing of a known\ncompeting hepatotoxin (e.g., high-dose acetaminophen, isoniazid, methotrexate) in the 30 days around index. (4) **Outcome\nwindow:** a *new* diagnosis of acute hepatocellular/hepatic injury (relevant ICD-10 codes) within 90 days after the index\nfill — the latency window appropriate to idiosyncratic DILI. (5) **Output:** a short candidate list (typically a handful of\npatients), each flagged for chart review of LFT trajectories (ALT/AST/alkaline phosphatase/bilirubin), R-ratio, exclusion\nof alternative causes, and — critically — any documented **dechallenge** (LFTs normalizing after stopping the drug) or\ninadvertent **rechallenge**. The claims query does *not* establish causality; its only job is to find the patient whose\nadjudicated timeline becomes the CARE-structured report and feeds downstream pharmacovigilance and, if a pattern emerges, a\nformal comparative study.",
    "primary_category": "Study_Design",
    "tags": [
      "descriptive",
      "single-patient",
      "signal-generation",
      "pharmacovigilance",
      "case-finding",
      "care-checklist",
      "hypothesis-generating"
    ],
    "applies_to_study_types": [
      "case_report"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": null,
        "url": "https://encepp.europa.eu/encepp-toolkit/methodological-guide_en",
        "citation_text": "ENCePP Guide on Methodological Standards in Pharmacoepidemiology. European Medicines Agency / ENCePP.",
        "year": 2023,
        "authors_short": "ENCePP",
        "notes": "Source of the descriptive study-type taxonomy used in this catalog; situates the case report among hypothesis-generating designs."
      },
      {
        "role": "introduce",
        "doi": "10.1186/1752-1947-7-223",
        "url": "https://doi.org/10.1186/1752-1947-7-223",
        "citation_text": "Gagnier JJ, Kienle G, Altman DG, Moher D, Sox H, Riley D. The CARE guidelines: consensus-based clinical case report guideline development. Journal of Medical Case Reports. 2013;7:223.",
        "year": 2013,
        "authors_short": "Gagnier et al.",
        "notes": "The defining CARE statement and 13-item reporting checklist (timeline, exposure detail, diagnostics, intervention, outcomes, patient perspective, consent); the contemporary standard for what a case report must contain."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.326.7403.1346",
        "url": "https://doi.org/10.1136/bmj.326.7403.1346",
        "citation_text": "Aronson JK. Anecdotes as evidence: we need guidelines for reporting anecdotes of suspected adverse drug reactions. BMJ. 2003;326(7403):1346.",
        "year": 2003,
        "authors_short": "Aronson",
        "notes": "Articulates the epistemic role and limits of single case reports in pharmacovigilance — why an adjudicated anecdote generates signals it cannot itself confirm."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2017.04.026",
        "url": "https://doi.org/10.1016/j.jclinepi.2017.04.026",
        "citation_text": "Riley DS, Barber MS, Kienle GS, et al. CARE guidelines for case reports: explanation and elaboration document. Journal of Clinical Epidemiology. 2017;89:218-235.",
        "year": 2017,
        "authors_short": "Riley et al.",
        "notes": "Item-by-item explanation with worked examples for each CARE element; the practical manual for assembling a complete, transparent report including the causality-relevant timeline."
      }
    ],
    "plain_language_summary": "A case report is a detailed written story of what happened to a single patient: their history, the drug or treatment they got, the problem that showed up afterward, and the doctor's reasoning about whether the treatment might have caused it. Its job is to describe and to raise a question (\"has anyone ever seen this before?\"), not to measure anything. Because there is only one patient and no comparison group, you cannot turn it into a rate or a risk — there is no group of people to divide by, so \"3 out of how many?\" has no answer. That is the whole point of the design, not a flaw to fix.",
    "key_terms": [
      {
        "term": "denominator",
        "definition": "The bottom number of a fraction — the total group you would divide by to get a rate; a single patient gives you a top number with nothing to divide by."
      },
      {
        "term": "signal generation",
        "definition": "Spotting a possible new problem worth investigating later, without yet proving it is real."
      },
      {
        "term": "dechallenge",
        "definition": "The drug is stopped and the problem goes away — a clue (not proof) that the drug was involved."
      },
      {
        "term": "rechallenge",
        "definition": "The patient is given the drug again and the same problem comes back — stronger evidence the drug was involved, but only documented when it happened naturally, never set up on purpose."
      },
      {
        "term": "CARE checklist",
        "definition": "The standard list of items a published case report must include, such as a patient timeline and a consent statement."
      },
      {
        "term": "case-finding",
        "definition": "Using a large database to flag a small handful of patients worth pulling for a detailed chart review."
      }
    ],
    "worked_example": {
      "scenario": "One adult patient starts a newly marketed cancer drug on January 1, 2025 — their first-ever fill of it, with no liver problems in the year before. About seven weeks later they develop sudden liver injury. The drug is stopped, and their liver tests return to normal over the next month. We want to write up exactly what happened to this one person. Notice what we are NOT doing: we are not asking how often this happens or whether the drug causes liver injury in general, because we have one patient and no comparison group.",
      "dataset": {
        "caption": "The patient's event log — the kind of date-and-event rows an analyst pulls from the chart and claims to build a timeline. One patient, listed in time order.",
        "columns": [
          "person_id",
          "date",
          "event"
        ],
        "rows": [
          [
            4471,
            "2024-01-01",
            "Continuous insurance coverage begins (clean year, no liver disease on record)"
          ],
          [
            4471,
            "2025-01-01",
            "First fill of suspect cancer drug (day zero for this patient)"
          ],
          [
            4471,
            "2025-02-20",
            "Symptom onset: jaundice and fatigue; liver tests sharply elevated"
          ],
          [
            4471,
            "2025-02-22",
            "Acute liver injury diagnosed"
          ],
          [
            4471,
            "2025-03-01",
            "Suspect drug stopped"
          ],
          [
            4471,
            "2025-03-31",
            "Liver tests back to normal (resolution)"
          ]
        ]
      },
      "steps": [
        "Lay every row out in calendar order for this one patient — that ordered list of dates and events IS the case report's timeline.",
        "The clean year before day zero (Jan 2024 to Jan 2025) shows no prior liver disease, so the new liver problem did not exist before the drug.",
        "Liver injury appears on Feb 20, 2025, which is 50 days after the first fill — inside the few-month window in which this kind of reaction typically shows up.",
        "The drug is stopped on Mar 1 and the liver tests return to normal by Mar 31; the problem going away after stopping is the strongest within-one-patient clue that the drug was involved.",
        "Count the patients you could divide by: there is exactly one, and no untreated comparison person, so no rate, risk, or percentage can be calculated — only a vivid description of this single course."
      ],
      "result": "This is a description of ONE patient (person 4471), not a measurement. The timeline shows a clean year, a first drug fill on day zero, liver injury 50 days later, and recovery once the drug stopped. There is no denominator, so no rate, risk, or percentage exists — the report's value is the detailed, credible story and the question it raises, not any number.",
      "timeline_spec": {
        "title": "Case-report timeline for one patient: suspected drug-induced liver injury",
        "window": {
          "start": "2024-01-01",
          "end": "2025-04-01",
          "label": "One patient only — no comparison group, so no rate can be computed"
        },
        "events": [
          {
            "label": "First fill of suspect drug (day zero)",
            "start": "2025-01-01",
            "length_days": 1,
            "quantity": "1 patient, first-ever fill"
          },
          {
            "label": "Symptom onset (jaundice, fatigue)",
            "start": "2025-02-20",
            "length_days": 2,
            "quantity": "day 50 after first fill"
          },
          {
            "label": "Acute liver injury diagnosed",
            "start": "2025-02-22",
            "length_days": 1,
            "quantity": "day 52"
          },
          {
            "label": "Suspect drug stopped",
            "start": "2025-03-01",
            "length_days": 1,
            "quantity": "day 59"
          },
          {
            "label": "Liver tests back to normal (resolution)",
            "start": "2025-03-31",
            "length_days": 1,
            "quantity": "30 days after stopping"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2025-01-01",
            "end": "2025-03-01",
            "label": "On the suspect drug (59 days)"
          },
          {
            "kind": "followup",
            "start": "2025-03-01",
            "end": "2025-04-01",
            "label": "Drug stopped — dechallenge: liver tests normalize"
          }
        ],
        "result": {
          "label": "One patient described — no denominator, no rate computable",
          "value": 1
        },
        "caption": "A single patient's clinical course: a clean baseline year, the first drug fill on day zero, liver injury 50 days later, the drug stopped on day 59, and recovery 30 days after stopping (a dechallenge). Because this is one patient with no comparison group, the timeline can describe what happened but cannot produce any rate or risk.",
        "alt_text": "Horizontal timeline for one patient showing a clean baseline year, a first drug fill marked as day zero on 2025-01-01, an exposed period of 59 days, symptom onset and liver-injury diagnosis at about day 50, the drug stopped at day 59, and liver tests returning to normal 30 days later; labeled as one patient with no denominator and no rate computable."
      }
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Adverse event / safety-signal case report",
        "description": "Detailed, adjudicated narrative of an unexpected or serious event temporally associated with a drug, vaccine, device, or procedure; the strongest evidence is documented dechallenge and (rarely, naturally occurring) rechallenge.",
        "edge_cases": [
          "Confounding by co-medications and progression of the underlying disease must be addressed explicitly, not assumed away.",
          "A coincidental temporal association is indistinguishable from causation in a single case without dechallenge/rechallenge or a strong biological mechanism."
        ],
        "data_source_notes": "claims: screen NDC + new outcome diagnosis within a latency-appropriate window post-dispense, clean lookback for the event and competing causes, then pull the chart. Code the event to MedDRA for cross-referencing with FAERS/EudraVigilance."
      },
      {
        "name": "Rare disease or novel-phenotype case report",
        "description": "Description of an unusual presentation, natural history, or treatment response not previously characterized in real-world practice.",
        "data_source_notes": "ehr/registry: richest for biomarkers, genetics, imaging, and longitudinal course; link claims for the full medication and utilization history that the EHR may miss for out-of-system care."
      },
      {
        "name": "Pregnancy or pediatric exposure case report",
        "description": "Maternal exposure (drug, vaccine) with fetal/infant outcomes, or pediatric off-label use; a frequent mandated input to pregnancy and product registries.",
        "data_source_notes": "claims/registry: anchor exposure timing to fill dates relative to estimated conception/gestational age, capture infant outcomes through mother-infant linkage. See pregnancy-exposure-window-rwe, mother-infant-linkage-rwe."
      },
      {
        "name": "N-of-1 within-person observation",
        "description": "A single patient observed across exposure and non-exposure periods (often informal dechallenge/rechallenge); illustrates within-person logic that a self-controlled design formalizes at population scale.",
        "edge_cases": [
          "A within-person pattern in one patient is hypothesis-generating only; it gains evidentiary weight only when replicated across patients in a self-controlled-case-series or case-crossover analysis."
        ],
        "data_source_notes": "ehr: serial labs/vitals across on- and off-treatment windows provide the within-person contrast; document timing precisely."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "case-series",
        "pros_of_this": "Faster and cheaper; can stand alone for a genuinely first-in-kind event.",
        "cons_of_this": "No pattern recognition or even crude denominator; high risk of over-interpreting a coincidence.",
        "when_to_prefer": "First description of a novel signal or an N-of-1 response, before 2-5 similar patients are available."
      },
      {
        "compared_to": "case-control",
        "pros_of_this": "No sampling frame, controls, or analysis plan required; turns around in weeks for rapid hypothesis generation.",
        "cons_of_this": "Cannot quantify association or incidence; selection and information bias are extreme.",
        "when_to_prefer": "Signal generation or detailed mechanistic description before a formal comparative study is feasible."
      },
      {
        "compared_to": "safety-signal-case-definition-rwe",
        "pros_of_this": "Provides the adjudicated, narrative-rich single case that anchors causality reasoning behind a coded signal.",
        "cons_of_this": "Lacks the disproportionality and denominator logic of systematic signal-detection across a spontaneous-report database.",
        "when_to_prefer": "When a striking individual case needs to be packaged for pharmacovigilance review or to illustrate a statistically detected signal."
      },
      {
        "compared_to": "systematic-review",
        "pros_of_this": "Supplies the raw, granular material later syntheses of rare events depend on.",
        "cons_of_this": "Each report carries severe publication bias; a review of reports estimates the literature, not the population.",
        "when_to_prefer": "Contributing a documented rare event to an evidence base that will later be systematically reviewed."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Use as a case-finding screen only (rare outcome diagnosis + specific NDC within a latency-appropriate post-dispense window + continuous FFS-observable enrollment across lookback and follow-up + clean baseline for the event and competing causes). Exclude Medicare Advantage-only person-time where fee-for-service claims are unavailable, so absence of prior history is observed rather than missing. Pull the chart before believing any claims-only case; claims carry no narrative and cannot establish causality.",
      "ehr": "Primary narrative source — notes, problem lists, labs, pathology, imaging, exact dosing/administration, PROs — and the right place to document dechallenge/rechallenge with serial labs. Visit-driven capture leaves silent gaps for out-of-system care; use NLP or manual abstraction to assemble the CARE timeline.",
      "registry": "Often mandates case reporting for enrolled patients (device, pregnancy, rare disease); strong for indication, severity, and adjudicated outcomes, weak for full pharmacy exposure. Supplement with claims/EHR for the complete exposure-outcome picture.",
      "linked": "Ideal substrate — claims completeness + EHR narrative + vital-records mortality. Report linkage quality and reconcile order/fill/service date discrepancies before fixing the timeline."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nLOOKBACK_DAYS = 365   # clean-baseline + observable-time requirement before index\nNEWUSER_DAYS  = 180   # no prior target-drug fill within this window => incident exposure\nLATENCY_DAYS  = 90     # idiosyncratic DILI onset window after first dispense\nFOLLOWUP_DAYS = 90     # minimum observable follow-up to see (or rule out) the event\n\ndef find_dili_candidates(rx: pd.DataFrame, dx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n\n    # Index = first fill of the suspect (TARGET) drug.\n    target = rx[rx[\"drug_group\"] == \"TARGET\"]\n    idx = (target.groupby(\"person_id\", as_index=False)[\"fill_date\"].min()\n                 .rename(columns={\"fill_date\": \"index_date\"}))\n\n    # New-user: drop anyone with a TARGET fill in the NEWUSER_DAYS before their index date.\n    t = target.merge(idx, on=\"person_id\")\n    prior = t[(t[\"fill_date\"] < t[\"index_date\"]) &\n              (t[\"fill_date\"] >= t[\"index_date\"] - pd.Timedelta(days=NEWUSER_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prior[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment across full lookback through index + minimum follow-up (no MA-only gaps).\n    e = enroll.merge(idx, on=\"person_id\")\n    covers = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)) &\n              (e[\"enroll_end\"]   >= e[\"index_date\"] + pd.Timedelta(days=FOLLOWUP_DAYS)) &\n              (~e[\"ma_only\"]))\n    idx = idx[idx[\"person_id\"].isin(e.loc[covers, \"person_id\"])].copy()\n\n    # Clean baseline: exclude pre-existing chronic liver disease in the 365-day lookback.\n    cl = dx[dx[\"dx_group\"] == \"CHRONIC_LIVER\"].merge(idx, on=\"person_id\")\n    chronic = cl[(cl[\"dx_date\"] < cl[\"index_date\"]) &\n                 (cl[\"dx_date\"] >= cl[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(chronic[\"person_id\"])].copy()\n\n    # Clean baseline: exclude a competing hepatotoxin dispensed within +/-30 days of index.\n    hep = rx[rx[\"drug_group\"] == \"HEPATOTOXIN\"].merge(idx, on=\"person_id\")\n    confounded = hep[(hep[\"fill_date\"] >= hep[\"index_date\"] - pd.Timedelta(days=30)) &\n                     (hep[\"fill_date\"] <= hep[\"index_date\"] + pd.Timedelta(days=30))]\n    idx = idx[~idx[\"person_id\"].isin(confounded[\"person_id\"])].copy()\n\n    # Outcome: NEW acute liver injury within the latency window after index.\n    li = dx[dx[\"dx_group\"] == \"LIVER_INJURY\"].merge(idx, on=\"person_id\")\n    events = li[(li[\"dx_date\"] > li[\"index_date\"]) &\n                (li[\"dx_date\"] <= li[\"index_date\"] + pd.Timedelta(days=LATENCY_DAYS))]\n    first_event = events.groupby(\"person_id\", as_index=False)[\"dx_date\"].min() \\\n                        .rename(columns={\"dx_date\": \"event_date\"})\n\n    candidates = idx.merge(first_event, on=\"person_id\")\n    candidates[\"days_to_event\"] = (candidates[\"event_date\"] - candidates[\"index_date\"]).dt.days\n    # Hand off to chart review (LFT trajectory, R-ratio, alternative causes, dechallenge/rechallenge).\n    return candidates[[\"person_id\", \"index_date\", \"event_date\", \"days_to_event\"]] \\\n               .sort_values(\"days_to_event\").reset_index(drop=True)",
        "description": "Case-FINDING screen for a drug-induced-liver-injury (DILI) case report. This does NOT estimate anything; it returns a\nshort list of chart-review candidates for adjudication. Required inputs (cleaned, de-duplicated):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime64), ndc, drug_group ('TARGET' for the suspect drug,\n                                'HEPATOTOXIN' for known competing hepatotoxins)\n  dx     : diagnoses         -> person_id, dx_date (datetime64), icd10, dx_group ('LIVER_INJURY' acute hepatocellular,\n                                'CHRONIC_LIVER' chronic/viral/alcoholic/biliary)\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only person-time lacks FFS claims\nA candidate is a NEW user of the target drug, continuously observable across a 365-day lookback and >=90 days after,\nwith no chronic liver disease or competing hepatotoxin in the baseline, who then develops acute liver injury within the\n90-day latency window. Output rows are inputs to chart review, not to any inferential model.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "riley-2017"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK <- 365L; NEWUSER <- 180L; LATENCY <- 90L; FOLLOWUP <- 90L\n\nfind_dili_candidates <- function(rx, dx, enroll) {\n  setDT(rx); setDT(dx); setDT(enroll)\n\n  # Index = first fill of the suspect (TARGET) drug.\n  idx <- rx[drug_group == \"TARGET\", .(index_date = min(fill_date)), by = person_id]\n\n  # New-user: drop anyone with a TARGET fill within NEWUSER days before index.\n  t <- merge(rx[drug_group == \"TARGET\"], idx, by = \"person_id\")\n  prior_ids <- unique(t[fill_date < index_date & fill_date >= index_date - NEWUSER, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous FFS-observable enrollment across lookback through index + minimum follow-up.\n  e <- merge(enroll, idx, by = \"person_id\")\n  ok <- e[enroll_start <= index_date - LOOKBACK &\n          enroll_end   >= index_date + FOLLOWUP & !ma_only, unique(person_id)]\n  idx <- idx[person_id %chin% ok]\n\n  # Clean baseline: no chronic liver disease in lookback.\n  cl <- merge(dx[dx_group == \"CHRONIC_LIVER\"], idx, by = \"person_id\")\n  chronic_ids <- unique(cl[dx_date < index_date & dx_date >= index_date - LOOKBACK, person_id])\n  idx <- idx[!person_id %chin% chronic_ids]\n\n  # Clean baseline: no competing hepatotoxin within +/-30 days of index.\n  hep <- merge(rx[drug_group == \"HEPATOTOXIN\"], idx, by = \"person_id\")\n  conf_ids <- unique(hep[fill_date >= index_date - 30L & fill_date <= index_date + 30L, person_id])\n  idx <- idx[!person_id %chin% conf_ids]\n\n  # Outcome: new acute liver injury within the latency window after index.\n  li <- merge(dx[dx_group == \"LIVER_INJURY\"], idx, by = \"person_id\")\n  ev <- li[dx_date > index_date & dx_date <= index_date + LATENCY,\n           .(event_date = min(dx_date)), by = .(person_id, index_date)]\n  ev[, days_to_event := as.integer(event_date - index_date)]\n  setorder(ev, days_to_event)\n  ev[, .(person_id, index_date, event_date, days_to_event)]   # -> chart review\n}",
        "description": "Case-FINDING screen for a DILI case report, data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), ndc, drug_group ('TARGET'/'HEPATOTOXIN')\n  dx     : person_id, dx_date (Date), icd10, dx_group ('LIVER_INJURY'/'CHRONIC_LIVER')\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\nReturns chart-review candidates (incident target-drug users with clean baseline who develop acute liver injury in the\nlatency window). This is screening, not estimation.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "riley-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365; %let newuser = 180; %let latency = 90; %let followup = 90;\n\n/* Index = first fill of the suspect (TARGET) drug. */\nproc sql;\n  create table idx as\n  select person_id, min(fill_date) as index_date format=date9.\n  from work.rx\n  where drug_group = 'TARGET'\n  group by person_id;\nquit;\n\n/* New-user: drop anyone with a prior TARGET fill within the new-user window before index. */\nproc sql;\n  create table newuser as\n  select i.* from idx i\n  where not exists (\n    select 1 from work.rx r\n    where r.person_id = i.person_id and r.drug_group = 'TARGET'\n      and r.fill_date <  i.index_date\n      and r.fill_date >= i.index_date - &newuser\n  );\nquit;\n\n/* Continuous FFS-observable enrollment across lookback through index + minimum follow-up (no MA-only spans). */\nproc sql;\n  create table enrolled as\n  select n.* from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &lookback\n      and e.enroll_end   >= n.index_date + &followup\n  );\nquit;\n\n/* Clean baseline: no chronic liver disease in lookback AND no competing hepatotoxin within +/-30 days of index. */\nproc sql;\n  create table clean as\n  select c.* from enrolled c\n  where not exists (\n          select 1 from work.dx d\n          where d.person_id = c.person_id and d.dx_group = 'CHRONIC_LIVER'\n            and d.dx_date <  c.index_date\n            and d.dx_date >= c.index_date - &lookback)\n    and not exists (\n          select 1 from work.rx h\n          where h.person_id = c.person_id and h.drug_group = 'HEPATOTOXIN'\n            and h.fill_date between c.index_date - 30 and c.index_date + 30);\nquit;\n\n/* Outcome: new acute liver injury within the latency window after index -> chart-review candidates. */\nproc sql;\n  create table candidates as\n  select c.person_id, c.index_date,\n         min(d.dx_date) as event_date format=date9.,\n         min(d.dx_date) - c.index_date as days_to_event\n  from clean c\n  inner join work.dx d\n    on d.person_id = c.person_id and d.dx_group = 'LIVER_INJURY'\n   and d.dx_date >  c.index_date\n   and d.dx_date <= c.index_date + &latency\n  group by c.person_id, c.index_date\n  order by days_to_event;\nquit;",
        "description": "Case-FINDING screen for a DILI case report in SAS, mirroring the Python/R logic with PROC SQL set operations (no\nestimation procedure is appropriate for a case report). Required input datasets (post data-management):\n  work.rx     : person_id, fill_date, ndc, drug_group ('TARGET'/'HEPATOTOXIN')\n  work.dx     : person_id, dx_date, icd10, dx_group ('LIVER_INJURY'/'CHRONIC_LIVER')\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nOutput work.candidates is the chart-review list (LFT trajectory, R-ratio, alternative causes, dechallenge/rechallenge).",
        "dependencies": [],
        "source_citations": [
          "riley-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "case-report-timeline.svg",
        "mermaid": null,
        "caption": "A single patient's clinical course: a clean baseline year, the first drug fill on day zero, liver injury 50 days later, the drug stopped on day 59, and recovery 30 days after stopping (a dechallenge). Because this is one patient with no comparison group, the timeline can describe what happened but cannot produce any rate or risk.",
        "alt_text": "Horizontal timeline for one patient showing a clean baseline year, a first drug fill marked as day zero on 2025-01-01, an exposed period of 59 days, symptom onset and liver-injury diagnosis at about day 50, the drug stopped at day 59, and liver tests returning to normal 30 days later; labeled as one patient with no denominator and no rate computable.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[RWD source: claims / EHR / registry] --> Screen[Case-finding screen<br/>rare outcome dx + specific NDC<br/>in latency window post-dispense]\n  Screen --> Elig[New user + continuous observable enrollment<br/>+ clean baseline for event and competing causes]\n  Elig --> Pull[Pull chart / EHR narrative:<br/>labs, imaging, notes, exact dosing]\n  Pull --> Adj[Adjudicate causality:<br/>timeline, alternative explanations,<br/>dechallenge / rechallenge]\n  Adj --> Care[Assemble CARE-structured report<br/>+ informed consent + de-identification]\n  Care --> Out[Contribute to case series / registry<br/>/ pharmacovigilance signal evaluation]",
        "caption": "Case-report workflow in real-world data. Claims/EHR/registry serve as a case-finding screen; the publishable report requires the chart-derived narrative, an adjudicated causality assessment, and CARE-compliant structure with consent.",
        "alt_text": "Flowchart from a real-world data source through a case-finding screen, eligibility and clean-baseline checks, chart pull, causality adjudication, CARE-structured reporting, and contribution to a series, registry, or signal evaluation.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Case-report timeline for one patient (DILI example)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  Clean lookback: no chronic liver disease, observable enrollment :done, base, 2024-01-01, 2024-12-31\n  section Index\n  First fill of suspect drug (incident exposure) :milestone, t0, 2025-01-01, 0d\n  section On-treatment\n  Suspect drug exposure :active, exp, 2025-01-01, 60d\n  section Outcome and challenge\n  Acute liver injury onset (within latency window) :crit, ev, 2025-02-20, 1d\n  Drug stopped; LFTs normalize (dechallenge) :done, dech, 2025-03-01, 30d",
        "caption": "Single-patient causality timeline. The clean baseline establishes no pre-existing liver disease, the index marks incident exposure, the event falls within the idiosyncratic-DILI latency window, and a documented dechallenge (resolution on withdrawal) provides the strongest within-case causality evidence.",
        "alt_text": "Gantt timeline showing a clean baseline year, an index exposure on 2025-01-01, an acute liver-injury event within the latency window, and a dechallenge with normalizing liver tests after the drug is stopped.",
        "source_type": "illustrative",
        "source_citations": [
          "riley-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "case-series",
        "notes": "Case reports are the building blocks of a case series; aggregating similar reports enables pattern recognition before formal analytic studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "A single adjudicated case report supplies the narrative behind a coded safety signal; systematic signal detection adds the disproportionality and denominator logic the report lacks."
      },
      {
        "relation_type": "see_also",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Case reports in RWD almost always require chart review/adjudication for clinical validity and causality assessment beyond claims codes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "systematic-review",
        "notes": "Individual case reports of rare events are synthesized in systematic reviews; account for severe publication bias when aggregating them."
      },
      {
        "relation_type": "used_with",
        "target_slug": "self-controlled-case-series",
        "notes": "An N-of-1 within-person observation in a case report illustrates the within-person logic that SCCS formalizes, with a population rate the single case cannot provide."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pregnancy-registry",
        "notes": "Pregnancy exposure case reports are a key input to pregnancy registries and inform dedicated pregnancy RWE designs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Rare cases found in claims can inform or challenge outcome/phenotype algorithm development and PPV assessment."
      }
    ],
    "aliases": [
      "case report",
      "clinical case report",
      "single-patient case report",
      "N-of-1 case report",
      "adverse event case report"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "case-series",
    "name": "Case Series",
    "short_definition": "A descriptive, single-group study design that reports the characteristics, treatment, and outcomes of a set of patients selected because they share an exposure or diagnosis, with no comparator group and no defined denominator, and therefore no basis for rate or effect estimation.",
    "long_description": "A **case series** assembles a group of patients who share a defining feature — a treatment, a procedure, a diagnosis, or\na rare exposure — and *describes* them: who they are, what was done, and what happened. It is the simplest analytic unit\nabove the single **case report**, and the defining feature is what it lacks: there is **no comparison group** and, in the\npure form, **no enumerated denominator (population at risk)**. Because of those two absences, a case series can describe\n(\"12 of 40 patients had remission\") but cannot legitimately *estimate a rate* (it has no population at risk and no\nperson-time) or a *causal effect* (it has no counterfactual). Its scientific job is **description and hypothesis\ngeneration**, not estimation or confirmation.\n\n**Core conceptual distinction — case series vs cohort.** This is the single most-confused boundary in observational\ntaxonomy, and getting it wrong inflates the apparent evidence base. The discriminating question (Dekkers et al. 2012;\nMathes & Pieper 2017) is: *was the group defined by exposure/disease and then followed for outcomes within a population\nwhose denominator and person-time you can count?* If yes — sampled from a defined source population, with a denominator\nand follow-up — it is a **cohort study**, even if it has only one arm. If the group was defined by *already having the\noutcome or having been treated*, assembled retrospectively without a denominator, it is a **case series**. A \"single-arm\nstudy\" with continuous enrollment, an index date, and computable person-time is a cohort and should be analyzed and\nreported as one; calling it a case series understates its rigor, while calling an uncontrolled, denominator-free series a\ncohort overstates its rigor. The related boundaries: a **case report** is a series of n=1; a **case-control** study starts\nfrom cases *and* controls (it has a comparator and is analytic, not descriptive); the **self-controlled case series\n(SCCS)** uses only cases but is a fully causal within-person method (each case is its own control across exposed and\nunexposed time) — it merely borrows the word \"case series\" and is a different design entirely.\n\n**Pros, cons, and trade-offs.**\n- **vs case report (n=1):** A series buys you pattern, range, and a rough sense of frequency *within the reported group*\n  (e.g., \"8 of 30 developed the adverse event\"), which a single report cannot. Cost: it is still selected and\n  denominator-free. **Prefer a series** over scattered reports when several cases of a novel pattern exist.\n- **vs single-arm cohort / single-arm external-control study:** A cohort defines a source population, an index date, and\n  person-time, so it yields *incidence* and supports time-to-event analysis; with an external or historical comparator it\n  can support (cautious) comparative inference. A case series yields none of that. Cost: a cohort requires a denominator\n  and continuous observation you may not have for a rare event. **Prefer a cohort** whenever a denominator is obtainable —\n  in claims/EHR it almost always is, which makes a true denominator-free \"case series\" rare and usually a sign of a\n  missing denominator rather than a deliberate design.\n- **vs case-control / case-crossover / SCCS:** These are analytic designs with internal comparators (external controls,\n  or the case's own time) and *do* support effect estimation from cases. A descriptive case series does not. **Prefer the\n  self-controlled designs** when you have cases and want a causal estimate while controlling time-invariant confounding.\n- **Speed and feasibility (the genuine advantage):** A case series is fast, cheap, needs no comparator or sampling frame,\n  and is often the *only* feasible design for an ultra-rare event or a brand-new exposure where no denominator yet exists.\n  It is the natural first description of an emerging signal.\n\n**When to use.** Early characterization of a new drug/device/procedure or a newly recognized adverse event; natural\nhistory of an ultra-rare disease where no cohort denominator is assemblable; documenting the clinical spectrum,\npresentation, and management of a condition; **hypothesis generation and safety-signal description** that a later cohort,\ncase-control, SCCS, or trial will test. In pharmacovigilance, aggregated spontaneous reports and case series are the\nsubstrate of *signal detection* (FDA Sentinel signal identification, EMA/PRAC PSUR review).\n\n**When NOT to use — and when a case series is actively misleading or dangerous.**\n- **Any comparative or causal claim.** \"Patients on drug X did better than expected\" is uninterpretable: there is no\n  comparator and no adjustment for confounding by indication. Treating an uncontrolled series as evidence of effectiveness\n  is the classic error that controlled designs exist to prevent.\n- **Reporting a \"rate,\" \"incidence,\" or \"response rate\" computed from the series alone.** With no population at risk and\n  no person-time, the denominator is the reported cases themselves — a *selected* set (only patients who survived to be\n  treated, were referred to a center, and were chosen for write-up). \"Response in 24/30 = 80%\" silently conditions on\n  selection and survival; it is not the response rate in any real population. If you can compute a defensible denominator,\n  you are no longer doing a case series — report it as a cohort with incidence.\n- **Confusing a denominator-bearing single-arm cohort with a case series** (Dekkers/Mathes-Pieper): mislabeling demotes a\n  rigorous cohort and lets a denominator-free series masquerade as one — in either direction the evidence grade is wrong.\n- **Signal *confirmation*, label changes, or HTA comparative-effectiveness decisions.** Signal detection (acceptable) is\n  not signal confirmation (not acceptable from a series). HTA bodies (NICE, ICER) and reimbursement committees reject\n  uncontrolled case series for comparative effectiveness; an unanchored single arm cannot support a relative-effect claim.\n\n**Data-source operational depth.** The recurring failure mode in real-world data is that an apparent \"case series\" is\nreally a cohort with an *unmeasured or discarded* denominator — fix the denominator rather than lowering the design.\n- **Claims (FFS vs MA):** You can almost always reconstruct a denominator (all continuously enrolled members with the\n  indication), so a denominator-free series is rarely justified — the honest design is usually a cohort. If you do\n  characterize a case set (e.g., all members with a rare procedure), build it from `person_id` + `service_date` + dx/proc\n  codes with a continuous-enrollment requirement, and state explicitly that no rate is computed. Failure modes:\n  **Medicare Advantage person-time lacks fee-for-service claims**, so MA-only members appear as \"no events\" purely from\n  non-capture — never mix MA-only and FFS person-time when any frequency is reported. **Differential competing risks by\n  exposure in the elderly** (death removing patients before the outcome is coded) distort any naive \"proportion with\n  outcome\" — a case series cannot handle competing risks at all, which is itself a reason to escalate to a cohort with\n  Fine-Gray/cause-specific analysis. **Immortal time in procedure series** (counting survival from a landmark the patient\n  had to survive to reach) inflates apparent outcomes; a series has no machinery to correct it.\n- **EHR:** Rich for clinical detail (labs, notes, severity, imaging) — the natural strength of a descriptive series — but\n  capture is visit-driven, so patients who leave the system vanish (informative loss), and \"the cases we have notes on\"\n  is a referral-selected set. Good for characterizing presentation and management; poor for any frequency claim.\n- **Registry:** A disease/product registry with defined enrollment criteria and a denominator is a *cohort*; a registry\n  used only to pull \"all the cases of X\" for description is functioning as a case-series sampling frame. Registries are\n  the strongest substrate for rare-disease natural-history *description* (adjudicated phenotypes, longitudinal detail).\n- **Linked claims–EHR–registry:** Linkage adds clinical depth and mortality, but the linkable subset is selected; if any\n  frequency is reported, the denominator must be the linkable population, not the whole source.\n\n**Worked claims example.** Question: characterize patients who received a newly approved, rarely used gene therapy in a\nnational claims database (no comparator; description only). (1) Case identification: in `medical_claims`, find each\n`person_id` with the procedure/HCPCS code for the therapy; set `index_date` = first such `service_date`. (2) Observability:\nrequire 365 days of continuous medical + pharmacy enrollment before `index_date` and exclude **MA-only** person-time so\nbaseline characteristics are actually observable in FFS claims. (3) Baseline characterization (the deliverable): in the\n365-day lookback summarize demographics, comorbidities (Elixhauser/Charlson from dx codes), prior therapies\n(`drug_class`, `days_supply`), and healthcare utilization. (4) Post-index description: among the identified cases,\ntabulate documented outcomes/AEs and time from `index_date` to first occurrence — reported as **counts and within-series\nproportions only**. (5) Explicit non-claims: state that no incidence rate, no comparative effect, and no causal claim is\nmade because there is no denominator and no comparator. (6) Escalation note: if a credible denominator exists (all\nmembers with the indication), re-cast as a retrospective cohort with incidence and time-at-risk — at which point\ncompeting risks (death) and immortal time must be handled, which the case series intentionally does not.",
    "primary_category": "Study_Design",
    "tags": [
      "descriptive",
      "study-design",
      "single-group",
      "hypothesis-generating",
      "signal-detection",
      "pharmacovigilance",
      "no-comparator",
      "rare-disease"
    ],
    "applies_to_study_types": [
      "case_series"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.7326/0003-4819-156-1-201201030-00006",
        "url": "https://doi.org/10.7326/0003-4819-156-1-201201030-00006",
        "citation_text": "Dekkers OM, Egger M, Altman DG, Vandenbroucke JP. Distinguishing case series from cohort studies. Annals of Internal Medicine. 2012;156(1):37-40.",
        "year": 2012,
        "authors_short": "Dekkers et al.",
        "notes": "Authoritative criterion for the case-series/cohort boundary — sampling on exposure/disease without a denominator (case series) vs a defined source population with person-time (cohort). The reference standard for grading the design."
      },
      {
        "role": "explain",
        "doi": "10.1186/s12874-017-0391-8",
        "url": "https://doi.org/10.1186/s12874-017-0391-8",
        "citation_text": "Mathes T, Pieper D. Clarifying the distinction between case series and cohort studies in systematic reviews of comparative studies: potential impact on body of evidence and workload. BMC Medical Research Methodology. 2017;17(1):107.",
        "year": 2017,
        "authors_short": "Mathes & Pieper",
        "notes": "Shows how mislabeling single-arm cohorts as case series (and vice versa) distorts the apparent evidence base and GRADE assessments; operationalizes the distinction for reviewers."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.ajo.2010.08.047",
        "url": "https://doi.org/10.1016/j.ajo.2010.08.047",
        "citation_text": "Kempen JH. Appropriate use and reporting of uncontrolled case series in the medical literature. American Journal of Ophthalmology. 2011;151(1):7-10.e1.",
        "year": 2011,
        "authors_short": "Kempen",
        "notes": "Practical rules for when an uncontrolled case series is legitimate (description, hypothesis generation) versus when reporting proportions/\"response rates\" as if they were rates is actively misleading."
      }
    ],
    "plain_language_summary": "A case series takes a group of patients picked because they share one thing — the same diagnosis, drug, or procedure — and simply describes them: who they are, what was done, and what happened. There is no comparison group, and no count of how many similar people were out there to begin with (no population at risk). That means you can report what you saw inside the group (\"5 of 8 patients improved\") but you cannot turn it into a rate or claim the treatment caused the result, because you have nothing to compare against and no denominator.",
    "key_terms": [
      {
        "term": "population at risk (denominator)",
        "definition": "The full set of people who could have had the outcome, which you divide by to get a rate; a case series doesn't have one because it only collects the patients themselves."
      },
      {
        "term": "comparator group",
        "definition": "A separate group of similar patients who did not get the treatment, used to judge whether the treatment made a difference; a case series has none."
      },
      {
        "term": "rate",
        "definition": "A count of events divided by the population (or time) at risk, like \"3 cases per 1,000 people\"; it needs a denominator a case series cannot supply."
      },
      {
        "term": "within-series proportion",
        "definition": "A fraction calculated only among the collected patients (e.g., 5 of 8 = 62.5%), which describes this specific group but is not a population rate."
      },
      {
        "term": "hypothesis generation",
        "definition": "Using an early description to suggest an idea worth testing later with a stronger design, rather than to prove anything."
      }
    ],
    "worked_example": {
      "scenario": "A clinician notices that eight patients in her practice all received the same newly approved gene therapy and wonders how they fared. She pulls their records into a small table to describe the group. She wants to know: did the therapy work? The catch is that these eight are simply the patients who happened to get the drug and get written up — there is no list of everyone who could have received it, and no untreated group to hold them against.",
      "dataset": {
        "caption": "The eight collected patients — exactly the rows the clinician has, and nothing about anyone outside this group.",
        "columns": [
          "patient_id",
          "age",
          "treatment",
          "outcome"
        ],
        "rows": [
          [
            1,
            57,
            "gene_therapy_X",
            "improved"
          ],
          [
            2,
            63,
            "gene_therapy_X",
            "improved"
          ],
          [
            3,
            49,
            "gene_therapy_X",
            "no_change"
          ],
          [
            4,
            71,
            "gene_therapy_X",
            "improved"
          ],
          [
            5,
            55,
            "gene_therapy_X",
            "no_change"
          ],
          [
            6,
            68,
            "gene_therapy_X",
            "improved"
          ],
          [
            7,
            60,
            "gene_therapy_X",
            "worsened"
          ],
          [
            8,
            52,
            "gene_therapy_X",
            "improved"
          ]
        ]
      },
      "steps": [
        "Notice that every patient shares the same exposure (gene_therapy_X) and there is no second column of untreated patients — the whole table is one group, so there is no comparator.",
        "Count the outcomes you can actually see inside the group: 5 improved (patients 1, 2, 4, 6, 8), 2 had no change (patients 3, 5), and 1 worsened (patient 7). That is a fair description of these eight people.",
        "Try to turn '5 improved' into a rate and you hit a wall: a rate needs a denominator — everyone who could have improved — but the only people in the data are the 8 you already chose, so the 'denominator' would just be the cases themselves.",
        "Try to claim the therapy worked and you hit a second wall: with no untreated comparator group, you cannot know whether these 5 would have improved anyway, so no treatment effect can be computed.",
        "Contrast this with a cohort: if instead you had started from all patients with the indication in a database, given each an index date, and counted person-time, you could compute incidence and compare arms. The moment you have a countable denominator and follow-up, it is a cohort — not a case series."
      ],
      "result": "Descriptive summary: among the 8 collected patients, 5 improved, 2 had no change, and 1 worsened (a within-series proportion of 5/8 = 62.5% improved). No rate, incidence, or response rate can be computed because there is no population at risk (denominator) and no person-time, and no treatment effect can be claimed because there is no comparator group. The series can only describe these patients and generate a hypothesis for a later cohort or controlled study to test."
    },
    "prerequisites": [
      "case-report"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Consecutive case series",
        "description": "Every eligible patient over a defined period at one or more sites is included, in order of presentation, to reduce selection within the series and approximate the clinical spectrum more faithfully.",
        "edge_cases": [
          "Still selected at the site level (referral/catchment bias); \"consecutive\" controls within-series cherry-picking, not selection into the source.",
          "Period boundaries must be pre-specified or the count is post-hoc."
        ],
        "data_source_notes": "claims/EHR: enforce a fixed [start, end] service-date window and take all qualifying person_ids; do not drop patients after seeing their outcomes."
      },
      {
        "name": "Selected / illustrative case series",
        "description": "Cases chosen to illustrate a phenomenon, novel presentation, or management approach; explicitly not intended to estimate frequency.",
        "edge_cases": [
          "High selection makes any numerator/denominator meaningless; report as qualitative description only."
        ],
        "data_source_notes": "EHR is typical here (rich notes/imaging); never attach a proportion that implies a population rate."
      },
      {
        "name": "Aggregated spontaneous-report case series (pharmacovigilance)",
        "description": "A series of cases of a suspected adverse event assembled from spontaneous-report systems (e.g., FAERS) or active surveillance for signal description.",
        "edge_cases": [
          "Reporting is voluntary and stimulated by attention (notoriety/Weber effect); counts cannot be turned into rates.",
          "Supports signal detection, not signal confirmation."
        ],
        "data_source_notes": "spontaneous reports have no denominator by construction; pair with a denominator-bearing surveillance cohort (e.g., Sentinel) before any rate or effect claim."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Case report (n=1)",
        "pros_of_this": "Captures a pattern across multiple patients — range of presentation, management, and within-series frequency — rather than a single anecdote.",
        "cons_of_this": "Still selected and denominator-free; cannot estimate a population rate or effect.",
        "when_to_prefer": "When several cases of a novel exposure/event exist and a pattern (not just an anecdote) is informative."
      },
      {
        "compared_to": "Single-arm cohort / single-arm external-control study",
        "pros_of_this": "Far faster and feasible without a denominator or comparator; often the only option for an ultra-rare, newly emerged exposure.",
        "cons_of_this": "No denominator, no person-time, no incidence, no time-to-event, and no anchored comparative inference.",
        "when_to_prefer": "Only when a credible denominator is genuinely unobtainable; otherwise build a cohort."
      },
      {
        "compared_to": "Self-controlled case series (SCCS) / case-crossover",
        "pros_of_this": "No exposure model, no risk-window specification, no within-person assumptions needed — pure description.",
        "cons_of_this": "Cannot estimate a causal effect; SCCS/case-crossover use the same cases to produce adjusted within-person effect estimates that a descriptive series cannot.",
        "when_to_prefer": "When the goal is to characterize patients, not to estimate the effect of a transient exposure."
      },
      {
        "compared_to": "Case-control study",
        "pros_of_this": "No control sampling, no matching, no measurement of exposure odds — simple and quick.",
        "cons_of_this": "No comparator at all, so no odds ratio and no analytic inference about association.",
        "when_to_prefer": "Description/hypothesis generation; switch to case-control once a comparator and a testable hypothesis exist."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "A denominator is almost always reconstructable, so prefer a cohort; if describing a case set, build from person_id + service_date + dx/proc codes with continuous enrollment, exclude MA-only person-time (no FFS capture), and report counts/within-series proportions only — never a rate. Competing risks (death) and immortal time cannot be handled by a series and are themselves reasons to escalate to a cohort.",
      "ehr": "Strong for clinical characterization (labs, notes, severity); capture is visit-driven and referral-selected, so good for describing presentation/management, poor for any frequency claim. Treat loss to follow-up as informative.",
      "registry": "A registry with defined enrollment and a denominator is a cohort; used to pull \"all cases of X\" it is a case-series sampling frame. Best substrate for rare-disease natural-history description.",
      "linked": "Linkage adds clinical depth and mortality but the linkable subset is selected; any reported frequency must use the linkable population as denominator, not the full source."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nLOOKBACK_DAYS = 365          # baseline-characterization window before index\nTHERAPY_CODES = {\"J9999\"}    # HCPCS/procedure code(s) defining the case-series exposure\n\ndef build_case_series(med: pd.DataFrame, rx: pd.DataFrame, enroll: pd.DataFrame) -> dict:\n    med = med.sort_values([\"person_id\", \"service_date\"])\n\n    # 1. Case identification: first occurrence of the defining exposure = index_date.\n    hits = med[med[\"code\"].isin(THERAPY_CODES)]\n    cases = (hits.groupby(\"person_id\")[\"service_date\"].min()\n                 .reset_index(name=\"index_date\"))\n\n    # 2. Observability: continuous, FFS-observable enrollment across the full lookback (no MA-only gaps),\n    #    so baseline characteristics are actually captured rather than missing.\n    e = enroll.merge(cases, on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) &\n                   (~e[\"ma_only\"]))\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n    cases = cases[cases[\"person_id\"].isin(eligible)].copy()\n    cases[\"baseline_start\"] = cases[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)\n\n    # 3. Baseline characterization (the deliverable) — prior therapy use in the lookback window.\n    rxm = rx.merge(cases[[\"person_id\", \"index_date\", \"baseline_start\"]], on=\"person_id\")\n    prior = rxm[(rxm[\"fill_date\"] >= rxm[\"baseline_start\"]) & (rxm[\"fill_date\"] < rxm[\"index_date\"])]\n    prior_tx = (prior.groupby(\"person_id\")[\"drug_class\"].nunique()\n                     .reindex(cases[\"person_id\"], fill_value=0))\n\n    n = len(cases)\n    summary = {\n        \"n_cases\": n,\n        \"n_with_any_prior_therapy\": int((prior_tx > 0).sum()),\n        # within-series PROPORTION (descriptive) — NOT a population rate (no denominator/comparator)\n        \"pct_with_prior_therapy\": round(100 * (prior_tx > 0).sum() / n, 1) if n else None,\n    }\n    return {\"cases\": cases, \"summary\": summary}",
        "description": "Case identification and descriptive characterization from claims-style inputs (NOT estimation). Required inputs\n(cleaned, de-duplicated):\n  med    : medical claims -> person_id, service_date (datetime), code (dx/proc), code_type\n  rx     : pharmacy fills -> person_id, fill_date (datetime), drug_class, days_supply\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only person-time lacks FFS capture\nProduces one row per identified case with index_date and a baseline characterization table. Reports COUNTS and\nWITHIN-SERIES PROPORTIONS only: there is no denominator/comparator, so no incidence rate or effect is computed.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nLOOKBACK_DAYS <- 365L\nTHERAPY_CODES <- c(\"J9999\")   # defining exposure code(s)\n\nbuild_case_series <- function(med, rx, enroll) {\n  setDT(med); setDT(rx); setDT(enroll)\n  setorder(med, person_id, service_date)\n\n  # 1. Case identification: first occurrence of the defining exposure.\n  cases <- med[code %chin% THERAPY_CODES,\n               .(index_date = min(service_date)), by = person_id]\n\n  # 2. Observability: continuous FFS-observable enrollment across the lookback (drop MA-only).\n  e <- merge(enroll, cases, by = \"person_id\")\n  ok <- e[enroll_start <= index_date - LOOKBACK_DAYS &\n          enroll_end   >= index_date & !ma_only, unique(person_id)]\n  cases <- cases[person_id %chin% ok]\n  cases[, baseline_start := index_date - LOOKBACK_DAYS]\n\n  # 3. Baseline characterization: distinct prior therapy classes in the lookback window.\n  rxm <- merge(rx, cases[, .(person_id, index_date, baseline_start)], by = \"person_id\")\n  prior <- rxm[fill_date >= baseline_start & fill_date < index_date,\n               .(n_classes = uniqueN(drug_class)), by = person_id]\n  cases <- merge(cases, prior, by = \"person_id\", all.x = TRUE)\n  cases[is.na(n_classes), n_classes := 0L]\n\n  n <- nrow(cases)\n  summary <- list(\n    n_cases = n,\n    n_with_any_prior_therapy = sum(cases$n_classes > 0L),\n    # within-series proportion only — NOT a population rate\n    pct_with_prior_therapy = if (n) round(100 * mean(cases$n_classes > 0L), 1) else NA_real_\n  )\n  list(cases = cases[], summary = summary)\n}",
        "description": "Case identification and descriptive characterization with data.table. Inputs mirror the Python version:\n  med    : person_id, service_date (Date), code, code_type\n  rx     : person_id, fill_date (Date), drug_class, days_supply\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\nReturns the identified cases and a within-series summary (counts/proportions only; no rate, no comparator).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n\n/* 1. Case identification: first occurrence of the defining exposure code = index_date. */\nproc sql;\n  create table cases as\n  select person_id, min(service_date) as index_date format=date9.\n  from work.med\n  where code in ('J9999')              /* HCPCS/procedure code(s) defining the series */\n  group by person_id;\nquit;\n\n/* 2. Observability: continuous, FFS-observable enrollment across the full lookback (exclude MA-only). */\nproc sql;\n  create table cohort as\n  select c.person_id, c.index_date,\n         c.index_date - &lookback as baseline_start format=date9.\n  from cases c\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = c.person_id\n      and e.ma_only = 0\n      and e.enroll_start <= c.index_date - &lookback\n      and e.enroll_end   >= c.index_date\n  );\nquit;\n\n/* 3. Baseline characterization: any prior therapy in the lookback window (descriptive deliverable). */\nproc sql;\n  create table chars as\n  select b.person_id, b.index_date,\n         (case when exists (\n            select 1 from work.rx r\n            where r.person_id = b.person_id\n              and r.fill_date >= b.baseline_start\n              and r.fill_date <  b.index_date) then 1 else 0 end) as any_prior_tx\n  from cohort b;\nquit;\n\n/* Counts and WITHIN-SERIES proportions only -- no rate, no comparator. */\nproc freq data=chars;\n  tables any_prior_tx / nocum;\nrun;",
        "description": "Case identification and descriptive characterization in SAS (PROC SQL + PROC FREQ/MEANS). This is a DESCRIPTIVE design:\nno PROC for rates/effects is appropriate because the series has no denominator and no comparator. Required input datasets\n(post data-management):\n  work.med    : person_id, service_date, code, code_type\n  work.rx     : person_id, fill_date, drug_class, days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nPROC FREQ/MEANS report counts and within-series proportions only; do NOT divide by person-time (that would be a cohort).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Have a group of patients?] --> Q1{Defined by a shared<br/>exposure or diagnosis?}\n  Q1 -- No --> Other[Not a case series]\n  Q1 -- Yes --> Q2{Is there a comparator<br/>group?}\n  Q2 -- Yes --> CC[Case-control / cohort with comparator<br/>= analytic design]\n  Q2 -- No --> Q3{Can you enumerate a<br/>denominator + person-time<br/>in a defined source population?}\n  Q3 -- Yes --> Coh[Single-arm COHORT<br/>report incidence + time-at-risk]\n  Q3 -- No --> Q4{n = 1?}\n  Q4 -- Yes --> Rep[Case REPORT]\n  Q4 -- No --> CS[CASE SERIES<br/>describe only: counts + within-series proportions<br/>no rate, no effect, no causal claim]",
        "caption": "Decision logic separating a case series from its neighbors. The two gates that demote a study to \"case series\" are absence of a comparator and absence of a countable denominator/person-time; if either is present the correct design is a cohort or analytic comparative design (Dekkers et al. 2012).",
        "alt_text": "Decision flowchart starting from a patient group, branching on shared exposure, presence of a comparator, and a countable denominator, ending at case-control/cohort, single-arm cohort, case report, or case series.",
        "source_type": "illustrative",
        "source_citations": [
          "dekkers-2012"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Spont[Spontaneous reports /<br/>aggregated case series] --> Sig[Signal DETECTION<br/>hypothesis generation<br/>FDA Sentinel / EMA PRAC]\n  Sig --> Need[Denominator + comparator required]\n  Need --> Cohort[Cohort / SCCS / case-control]\n  Cohort --> Conf[Signal CONFIRMATION<br/>rate + effect estimation]\n  Conf --> Reg[Label change / HTA decision]\n  Sig -. not sufficient for .-> Reg",
        "caption": "Where a descriptive case series sits in the evidence pipeline. It feeds signal detection and hypothesis generation but is not sufficient for signal confirmation, label changes, or HTA comparative-effectiveness decisions, which require a denominator and comparator.",
        "alt_text": "Pipeline diagram showing spontaneous reports and case series feeding signal detection, which requires escalation to a cohort, SCCS, or case-control study with denominator and comparator before signal confirmation and regulatory or HTA decisions.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "case-report",
        "notes": "A case series is the multi-patient generalization of a single case report; both are descriptive and denominator-free."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cohort-retrospective",
        "notes": "The key discriminator (Dekkers/Mathes-Pieper) is a countable denominator and person-time; with those, a single-arm study is a retrospective cohort that yields incidence, not a case series."
      },
      {
        "relation_type": "see_also",
        "target_slug": "single-arm-external-control",
        "notes": "Adding an external/historical comparator to a single arm moves from pure description toward (cautious) comparative inference, which a case series cannot support."
      },
      {
        "relation_type": "see_also",
        "target_slug": "self-controlled-case-series",
        "notes": "The SCCS borrows the name but is a causal within-person method (each case its own control across exposed/unexposed time); it is not a descriptive case series."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-control",
        "notes": "A case-control study starts from cases plus a comparator and is analytic; a case series has no comparator and is descriptive only."
      },
      {
        "relation_type": "used_with",
        "target_slug": "signal-detection",
        "notes": "Aggregated case series are a primary substrate for pharmacovigilance signal detection and hypothesis generation, not signal confirmation."
      },
      {
        "relation_type": "part_of",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "The case series is one of the core descriptive designs; it characterizes person, place, and time without estimating rates or effects."
      }
    ],
    "aliases": [
      "case series",
      "clinical case series",
      "uncontrolled case series",
      "consecutive case series"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "case-time-control",
    "name": "Case-Time-Control Design",
    "short_definition": "A self-controlled, case-only design that augments the case-crossover by dividing each case's within-person exposure odds ratio by the analogous odds ratio estimated in a separate series of (typically future) cases or non-cases, intending to remove the exposure-time trend that biases the case-crossover when exposure prevalence drifts over calendar time.",
    "long_description": "The **case-time-control (CTC) design** (Suissa, 1995) was proposed as a correction to the **case-crossover** design. In a\ncase-crossover, each case is its own control: exposure during a short **hazard (case) window** immediately before an acute\nevent is contrasted with exposure during one or more earlier **referent (control) windows** in the *same* person, and the\neffect is estimated by conditional logistic regression. This eliminates all **time-invariant** between-person confounding\nby construction. Its fatal vulnerability is an **exposure-time trend**: if the population probability of being exposed is\nrising (or falling) over calendar time for reasons unrelated to the outcome — a drug being launched, a formulary change, a\nguideline shift — then the referent windows (further in the past) are systematically less (or more) exposed than the hazard\nwindow, and the case-crossover odds ratio is biased even when the drug has no effect. CTC tries to *measure* that trend in a\ncontrol group and divide it out: it computes the case-crossover odds ratio in the cases (OR_cc), computes the same\nwithin-person odds ratio in a series of **time controls** (a separate sample of subjects who do not contribute the event at\nthe index time — classically future cases or matched non-cases), and reports CTC_OR = OR_cc / OR_tc. Algebraically this is a\nsingle conditional logistic model on the pooled person-windows with a **case-status x exposure interaction term**; the\nexponentiated interaction coefficient is the CTC estimate.\n\n**Core conceptual / estimand distinction.** The estimand is the **transient, within-person rate ratio** of an abrupt event\nfor a time-varying (\"on\" vs \"off\") exposure, *purged of the exposure-time trend captured by the control series*. CTC does\n**not** estimate a chronic or cumulative effect, and it does **not** estimate a between-person contrast of ever- vs\nnever-users. The defining — and contested — identifying assumption is that **the exposure-time trend in the control series\nequals the counterfactual exposure-time trend the cases would have experienced absent the event**, *conditional on the same\nwithin-person comparison*. Greenland (1996) showed that this is far stronger than it looks: because the time controls are\n*different people*, the OR_tc carries the **between-person confounding** that the case-crossover was specifically designed to\navoid, and dividing by it re-injects that confounding unless the case and control series are exchangeable on those\nconfounders. Suissa (1998) substantially conceded the point and narrowed the conditions under which CTC is valid. This is\nwhy CTC is best understood not as a default tool but as a historically important bridge between the case-crossover and the\nmodern self-controlled toolkit.\n\n**Pros, cons, and trade-offs.**\n- **vs the plain case-crossover:** CTC's one advantage is that it *attempts* to remove exposure-time-trend bias, which the\n  case-crossover cannot. Cost: it does so by importing a between-person control series, so it trades a known, characterizable\n  bias (the trend) for a generally **uncontrollable** one (between-person confounding in OR_tc). Prefer the case-crossover —\n  with a **bidirectional** or symmetric referent-window scheme — when the trend is mild and the exposure is genuinely\n  transient; prefer neither over the options below when the trend is strong.\n- **vs the case-case-time-control design (Wang et al., 2011):** This is the decisive comparison. Case-case-time-control\n  keeps CTC's ratio structure but makes the control series **future cases** of the *same* outcome, so the control odds ratio\n  reflects the exposure trend among people who are exchangeable with the cases on outcome-related confounders. It removes\n  *both* the time-invariant confounding *and* the trend without Greenland's re-confounding problem. **Prefer\n  case-case-time-control over CTC essentially whenever a future-case series is obtainable** — it is the recommended modern\n  successor.\n- **vs the self-controlled case series (SCCS):** SCCS models the full observation time of each case with conditional Poisson\n  regression and can include a **calendar-time (age/period) covariate** to absorb the exposure-time trend within-person,\n  avoiding any external control group. Prefer SCCS when the exposure is well measured across each person's whole observation\n  window and the event does not censor future exposure; it is generally more efficient and avoids CTC's confounding trap.\n- **vs the active comparator, new-user cohort:** For a **sustained** effect of a chronically used drug, no self-controlled\n  design applies; a cohort with an active comparator and time-zero alignment is the right tool. CTC is only ever a candidate\n  for **transient** effects of **intermittent** exposures on **abrupt** events.\n\n**When to use.** Acute, abrupt-onset outcome (hip fracture, MI, motor-vehicle crash, seizure, anaphylaxis); a transient,\nintermittent exposure whose effect, if any, is short-lived (a benzodiazepine fill, an NSAID course, a triptan); a credible\nconcern that exposure prevalence is **trending over calendar time** so that a bare case-crossover would be biased; and a\nsetting where the cleaner alternatives (case-case-time-control, SCCS) are infeasible — for example, future cases cannot be\nassembled in the available data window and full per-person exposure histories needed for SCCS are unavailable.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **A future-case or exchangeable control series is available.** Then use **case-case-time-control**. Using CTC with\n  arbitrary non-cases when you could have used future cases is choosing the design with the known confounding flaw.\n- **The control series is not exchangeable with the cases on the time-invariant confounders.** This is the Greenland (1996)\n  failure: OR_tc then carries between-person confounding, and CTC_OR = OR_cc / OR_tc is biased — *and the direction is\n  unpredictable*. There is no diagnostic that fully verifies exchangeability, so a \"balanced\" presentation overstates CTC's\n  safety. Treat a non-exchangeable control series as disqualifying, not as a limitation to footnote.\n- **The effect is chronic or cumulative, or the exposure is continuous.** Self-controlled within-person contrasts have no\n  \"unexposed\" referent and the design collapses; use a cohort design.\n- **Event-dependent exposure / event-dependent observation.** If the event changes the probability of subsequent exposure\n  (e.g., the fracture stops the benzodiazepine) or truncates observation (death), using **future** windows as referents is\n  invalid; SCCS extensions for event-dependent exposure or strictly **pre-event (unidirectional)** referent windows are\n  required.\n- **The exposure-time trend itself is the causal pathway** (e.g., a prescribing surge driven by early outcomes). Then the\n  trend is not nuisance to be divided out and CTC removes signal.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The natural substrate. Exposure on a given day is read from `fill_date` + `days_supply` (a day is\n  \"exposed\" if it falls inside an active supply interval). Require **continuous medical + pharmacy enrollment** spanning the\n  full referent-through-hazard window so that \"unexposed\" days are observed, not missing. Failure modes: **Medicare\n  Advantage / capitated person-time lacks fee-for-service claims**, so apparent non-exposure in a referent window can be\n  pure missingness — restrict to A/B/D (or commercial medical+pharmacy) enrollees and exclude MA-only spans. **90-day\n  mail-order and stockpiling** stretch `days_supply` and blur window assignment; **free samples and inpatient\n  administrations** are invisible in pharmacy claims, biasing exposure classification non-differentially toward the null.\n- **EHR:** Order/administration dates can place exposure more precisely than dispensing, but **visit-driven capture** means\n  referent windows in quiet periods look spuriously unexposed; link to pharmacy fills before trusting \"off\" days. Outcome\n  onset timing (essential for placing the 1-day-resolution hazard window) is often better in EHR than claims.\n- **Registry:** Strong for adjudicated, precisely dated acute events (the design's chief requirement) but typically weak for\n  complete intermittent-exposure histories; link to claims for fills. Registries rarely support assembling an exchangeable\n  time-control series, which often pushes the analysis toward SCCS instead.\n- **Linked claims-EHR-registry:** Best dating of the acute event (registry/EHR) with complete exposure (claims). Reconcile\n  order/fill/service-date discrepancies before fixing window boundaries, and beware that the **linkable subset is selected**,\n  which can break exchangeability between the case and control series.\n\n**Worked claims example.** Question: does a new benzodiazepine fill transiently raise the risk of hip fracture in older\nadults, in a FFS database where benzodiazepine use is rising over the study years (so a bare case-crossover would be biased\nupward by the trend)? (1) **Cases**: members aged >=65 with a first inpatient hip-fracture claim (`event_date`), with\ncontinuous A/B/D enrollment and no MA-only span over the 60 days before the event. (2) **Hazard window**: days 1-7 before\n`event_date`; a member is \"exposed\" in that window if any benzodiazepine `days_supply` interval (`fill_date` to\n`fill_date + days_supply - 1`) overlaps it. (3) **Referent window**: days 31-37 before `event_date`, classified identically\n— a strictly pre-event (unidirectional) scheme, because a fracture plausibly *changes* later benzodiazepine use. (4)\n**Time-control series**: a sample of **future** hip-fracture cases (people who fracture later in the data) assigned a\npseudo-`event_date` at their own future fracture; their day 1-7 vs day 31-37 windows are classified the same way. Using\nfuture cases (rather than arbitrary non-cases) is what keeps the control series exchangeable on fracture-related\nconfounders and avoids the Greenland re-confounding problem — i.e., this worked example is really a\ncase-case-time-control specification, which is the defensible way to run CTC's machinery. (5) **Estimation**: pool all\nperson-windows; fit conditional logistic regression stratified on person, with terms for `exposed`, `case_status`, and\ntheir **interaction**; exp(interaction coefficient) is the CTC/case-case-time-control estimate. The ratio OR_cc / OR_tc\ndivides out the rising-prevalence trend; the residual reflects the transient hazard. (6) **Sensitivity**: vary\nhazard/referent window lengths and gaps, add multiple referent windows, restrict `days_supply` handling (cap stockpiling),\nand run a **negative-control exposure** (a drug with no plausible fracture mechanism) to detect residual trend or\nconfounding.",
    "primary_category": "Study_Design",
    "tags": [
      "self-controlled",
      "case-crossover-extension",
      "exposure-time-trend-adjustment",
      "transient-effects",
      "pharmacoepidemiology",
      "case-only"
    ],
    "applies_to_study_types": [
      "case_time_control"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/00001648-199505000-00010",
        "url": "https://doi.org/10.1097/00001648-199505000-00010",
        "citation_text": "Suissa S. The case-time-control design. Epidemiology. 1995;6(3):248-253.",
        "year": 1995,
        "authors_short": "Suissa",
        "notes": "Original proposal of the design as a correction to the case-crossover for exposure-time-trend bias."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a115853",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a115853",
        "citation_text": "Maclure M. The case-crossover design: a method for studying transient effects on the risk of acute events. American Journal of Epidemiology. 1991;133(2):144-153.",
        "year": 1991,
        "authors_short": "Maclure",
        "notes": "The parent design CTC extends; defines the hazard/referent within-person comparison and the transient-effect estimand."
      },
      {
        "role": "explain",
        "doi": "10.1097/00001648-199605000-00003",
        "url": "https://doi.org/10.1097/00001648-199605000-00003",
        "citation_text": "Greenland S. Confounding and exposure trends in case-crossover and case-time-control designs. Epidemiology. 1996;7(3):231-239.",
        "year": 1996,
        "authors_short": "Greenland",
        "notes": "Foundational critique showing CTC's external control series can reintroduce between-person confounding; essential reading before using the design."
      },
      {
        "role": "explain",
        "doi": "10.1097/00001648-199807000-00016",
        "url": "https://doi.org/10.1097/00001648-199807000-00016",
        "citation_text": "Suissa S. The case-time-control design: further assumptions and conditions. Epidemiology. 1998;9(4):441-445.",
        "year": 1998,
        "authors_short": "Suissa",
        "notes": "Suissa's response narrowing the conditions under which CTC is valid, largely conceding Greenland's confounding point."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e31821d09cd",
        "url": "https://doi.org/10.1097/EDE.0b013e31821d09cd",
        "citation_text": "Wang S, Linkletter C, Maclure M, et al. Future cases as present controls to adjust for exposure trend bias in case-only studies. Epidemiology. 2011;22(4):568-574.",
        "year": 2011,
        "authors_short": "Wang et al.",
        "notes": "Introduces the case-case-time-control design (future cases as the control series), the recommended modern successor that fixes CTC's confounding flaw; demonstrates the ratio-of-odds-ratios estimation in practice."
      }
    ],
    "plain_language_summary": "The case-time-control design is a way to study whether a short-acting drug briefly raises the chance of a sudden event — like a hip fracture — by comparing what each patient was taking right before their event versus what they were taking a few weeks earlier. Because each person acts as their own comparison, most stable differences between people (like age or underlying frailty) cancel out. The twist this design adds on top of the simpler case-crossover is a second group of people — called time controls — whose drug records are examined over the same two windows so the study can subtract out any background drift in how often the drug was being prescribed over those years. In practice, researchers almost always prefer to build that second group from future patients who will eventually have the same kind of event, because those people are more similar to the cases in ways that matter.",
    "key_terms": [
      {
        "term": "hazard window",
        "definition": "The short stretch of days immediately before the event (here, days 1 through 7 before the fracture) where the study asks whether the patient had the drug on hand."
      },
      {
        "term": "referent window",
        "definition": "An earlier stretch of days from the same person (here, days 31 through 37 before the fracture) that serves as that person's own built-in comparison point when they were at lower immediate risk."
      },
      {
        "term": "exposure-time trend",
        "definition": "A steady rise or fall in how commonly a drug is prescribed across the whole population over calendar years, unrelated to any individual patient's risk — this can distort results if not accounted for."
      },
      {
        "term": "time-control series",
        "definition": "A separate group of people (ideally future patients who will later have the same event) who contribute the same two time windows so the study can measure and subtract the background prescribing trend."
      },
      {
        "term": "days_supply",
        "definition": "The number of days a single prescription fill is intended to last, recorded in a pharmacy claim — for example, a 30-day supply of a sleep aid."
      },
      {
        "term": "within-person comparison",
        "definition": "Using the same individual as their own control across different time points, so stable personal traits like genetics or chronic illness do not confuse the result."
      }
    ],
    "worked_example": {
      "scenario": "A claims database covers adults aged 65 and older from January 2018 through December 2019. Benzodiazepine prescribing has been rising steadily over those two years — a background trend unrelated to any one patient's fracture risk. We want to know whether having an active benzodiazepine fill in the week before a hip fracture raises fracture risk. We select two people: Patient A (a case) who fractured on 2019-03-15, and Patient B (a time control — a future case who fractured later, on 2019-11-20) who contributes the same two windows anchored on their own fracture date. For each person we check whether any benzodiazepine prescription covered the hazard window (days 1-7 before fracture) and the referent window (days 31-37 before fracture).",
      "dataset": {
        "caption": "Pharmacy fill records for the two study participants. Each row is one prescription dispensing.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply",
          "case_status"
        ],
        "rows": [
          [
            "A-001",
            "2019-02-15",
            "lorazepam",
            30,
            1
          ],
          [
            "B-002",
            "2019-10-30",
            "lorazepam",
            30,
            0
          ]
        ]
      },
      "steps": [
        "Patient A (case) fractured on 2019-03-15. Hazard window = 2019-03-08 through 2019-03-14. Referent window = 2019-02-06 through 2019-02-12.",
        "Patient A's lorazepam fill started 2019-02-15 and covered 30 days, meaning the supply lasted through 2019-03-16. That interval (Feb 15 to Mar 16) overlaps the hazard window (Mar 8-14): EXPOSED in hazard window = 1. It does NOT overlap the referent window (Feb 6-12): EXPOSED in referent window = 0. So within Patient A, the exposure odds ratio (OR_cc) numerator favors exposure near the event.",
        "Patient B (time control) had their fracture on 2019-11-20. Hazard window = 2019-11-13 through 2019-11-19. Referent window = 2019-10-14 through 2019-10-20.",
        "Patient B's lorazepam fill started 2019-10-30 and covered 30 days through 2019-11-28. That supply overlaps BOTH windows: hazard window (Nov 13-19) = EXPOSED 1, referent window (Oct 14-20) = EXPOSED 0. Same pattern as Patient A.",
        "OR_cc (cases only): exposed in hazard vs referent = (1/0) pattern — in a full cohort this ratio measures the association in cases. OR_tc (time controls only): same exposed-in-hazard / not-in-referent pattern. The CTC estimate = OR_cc divided by OR_tc. If OR_tc equals 1.0 (no trend), the adjustment does nothing. If the rising prescribing trend had made OR_tc = 1.4, then CTC_OR = OR_cc divided by 1.4, shrinking the raw case-crossover estimate toward the truth."
      ],
      "result": "With one case and one matched time control the numbers are illustrative, not a real p-value. The key arithmetic: if the raw within-person odds ratio in cases (OR_cc) were 2.8 and the time controls showed an OR_tc of 1.4 driven purely by the rising prescribing trend, then CTC_OR = 2.8 divided by 1.4 = 2.0. The design divided out the trend and left the estimated 2-fold transient fracture hazard during the first week of an active benzodiazepine fill.",
      "timeline_spec": {
        "title": "Case-time-control windows for one case (Patient A) and one time control (Patient B)",
        "window": {
          "start": "2019-02-06",
          "end": "2019-03-15",
          "label": "Study observation span shown (Patient A)"
        },
        "events": [
          {
            "label": "Patient A fill (lorazepam, 30-day supply)",
            "start": "2019-02-15",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Patient B fill (lorazepam, 30-day supply)",
            "start": "2019-10-30",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2019-02-06",
            "end": "2019-02-12",
            "label": "Patient A referent window (days 31-37 pre-fracture): NOT covered by fill"
          },
          {
            "kind": "covered",
            "start": "2019-02-15",
            "end": "2019-03-16",
            "label": "Patient A: fill active (30 days)"
          },
          {
            "kind": "exposed",
            "start": "2019-03-08",
            "end": "2019-03-14",
            "label": "Patient A hazard window (days 1-7 pre-fracture): covered by fill"
          },
          {
            "kind": "unexposed",
            "start": "2019-10-14",
            "end": "2019-10-20",
            "label": "Patient B referent window (days 31-37 pre-fracture): NOT covered by fill"
          },
          {
            "kind": "covered",
            "start": "2019-10-30",
            "end": "2019-11-28",
            "label": "Patient B: fill active (30 days)"
          },
          {
            "kind": "exposed",
            "start": "2019-11-13",
            "end": "2019-11-19",
            "label": "Patient B hazard window (days 1-7 pre-fracture): covered by fill"
          }
        ],
        "result": {
          "label": "CTC_OR = OR_cc / OR_tc = 2.8 / 1.4 = 2.0 after dividing out the prescribing trend",
          "value": 2.0
        },
        "caption": "Timeline showing the hazard window (days 1-7 before fracture, shaded as exposed) and referent window (days 31-37 before fracture, shaded as unexposed) for Patient A (a case) and Patient B (a time control). The fill bar shows when the 30-day lorazepam supply was active. Both patients were covered during the hazard window but not the referent window, consistent with the rising background trend. Dividing OR_cc by OR_tc removes that trend.",
        "alt_text": "Two horizontal timeline rows, one per patient. Each row has a referent window bar (grey, not covered by fill) 31-37 days before fracture, a fill bar (blue) showing when the prescription was active, and a hazard window bar (orange, covered by fill) 1-7 days before fracture. An annotation shows the CTC division: OR_cc divided by OR_tc equals 2.0."
      }
    },
    "prerequisites": [
      "case-crossover",
      "case-case-time-control",
      "self-controlled-case-series"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unidirectional (pre-event referent) windows",
        "description": "Referent windows are drawn only from time before the event (e.g., hazard days 1-7, referent days 31-37 pre-event), used when the event plausibly alters subsequent exposure.",
        "edge_cases": [
          "Pre-event referents remain vulnerable to the exposure-time trend (the very motivation for CTC), so a control series is still needed.",
          "Choosing the referent gap trades trend exposure (longer gap) against within-person comparability (shorter gap)."
        ],
        "data_source_notes": "claims: classify each window by overlap of fill_date..fill_date+days_supply-1; require continuous, FFS-observable enrollment spanning the earliest referent through the event."
      },
      {
        "name": "Bidirectional / future referent windows",
        "description": "Referent windows drawn symmetrically before and after the event to balance secular exposure trends within the case.",
        "edge_cases": [
          "Invalid when the event is event-dependent for exposure (it stops/starts the drug) or truncates observation (death) - future windows are then unobservable or distorted."
        ],
        "data_source_notes": "Requires observed post-event person-time, so disenrollment and mortality must be modeled as censoring."
      },
      {
        "name": "Case-case-time-control specification (future cases as controls)",
        "description": "The time-control series is a sample of future cases of the same outcome rather than arbitrary non-cases, making the control odds ratio exchangeable with the cases on outcome-related confounders.",
        "edge_cases": [
          "Requires enough accrual time to observe future cases; rare outcomes may leave too few future cases for a stable OR_tc.",
          "Pseudo-index (pseudo-event) dates for future cases must be assigned and windowed identically to the cases."
        ],
        "data_source_notes": "claims/EHR: assign each future case a pseudo-event_date at its own future event and apply the same hazard/referent window logic; this is the recommended default over classic non-case time controls."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Plain case-crossover design",
        "pros_of_this": "Attempts to remove exposure-time-trend bias that the case-crossover cannot address.",
        "cons_of_this": "Does so by importing a between-person control series, which can reintroduce the time-invariant confounding the case-crossover was designed to eliminate (Greenland 1996); trades a characterizable bias for a generally uncontrollable one.",
        "when_to_prefer": "Rarely - only when a strong exposure-time trend is present and the cleaner successors are infeasible; otherwise a bidirectional case-crossover or one of the designs below is preferable."
      },
      {
        "compared_to": "Case-case-time-control design (future cases as controls)",
        "pros_of_this": "Simpler to describe historically and conceptually as the original idea.",
        "cons_of_this": "The classic non-case control series is not guaranteed exchangeable; case-case-time-control's future-case control series removes both time-invariant confounding and the trend without re-confounding.",
        "when_to_prefer": "Essentially never when a future-case series is obtainable; prefer case-case-time-control."
      },
      {
        "compared_to": "Self-controlled case series (SCCS)",
        "pros_of_this": "Needs only short windows around the event rather than each person's full observation period; tolerant of incomplete long-run exposure history.",
        "cons_of_this": "Requires an external control series and is exposed to its confounding; SCCS absorbs the calendar-time trend within-person with a period covariate and needs no external controls.",
        "when_to_prefer": "When per-person full exposure histories are unavailable and an exchangeable (future-case) control series is feasible; otherwise SCCS is usually more efficient and cleaner."
      },
      {
        "compared_to": "Active comparator, new-user cohort",
        "pros_of_this": "Controls all time-invariant confounding by within-person design without measuring covariates.",
        "cons_of_this": "Only valid for transient effects of intermittent exposures on abrupt events; cannot estimate sustained or cumulative drug effects.",
        "when_to_prefer": "Never for chronic/cumulative effects of continuously used drugs - use a cohort design there."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure on a day = that day lies within an active fill interval (fill_date to fill_date+days_supply-1). Require continuous medical + pharmacy enrollment across the earliest referent window through the event, and exclude Medicare Advantage-only person-time where fee-for-service claims are missing (apparent non-exposure can be missingness). Account for stockpiling/mail-order in days_supply; sample fills and inpatient administrations are invisible.",
      "ehr": "Use order/administration dates for finer exposure timing, but visit-driven capture makes quiet referent windows look spuriously unexposed - prefer linkage to fills. EHR usually dates the acute event onset more precisely, which matters for placing day-resolution hazard windows.",
      "registry": "Strong for adjudicated, precisely dated acute events; weak for complete intermittent-exposure histories. Link to claims for fills; registries rarely support an exchangeable time-control series, often favoring SCCS instead.",
      "linked": "Combine registry/EHR event dating with claims exposure completeness; reconcile order/fill/service-date discrepancies before fixing window boundaries, and note that the linkable subset is selected, which can break case-control exchangeability."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom statsmodels.discrete.conditional_models import ConditionalLogit\n\nHAZARD = (1, 7)     # days before index that form the hazard (\"case\") window\nREFERENT = (31, 37) # earlier, pre-index referent (\"control\") window (unidirectional)\n\ndef _exposed_in_window(person_fills, lo_date, hi_date):\n    \"\"\"1 if any active fill interval [fill_date, fill_date+days_supply-1] overlaps [lo_date, hi_date].\"\"\"\n    if person_fills.empty:\n        return 0\n    start = person_fills[\"fill_date\"]\n    end = start + pd.to_timedelta(person_fills[\"days_supply\"] - 1, unit=\"D\")\n    overlap = (start <= hi_date) & (end >= lo_date)\n    return int(overlap.any())\n\ndef build_ctc_windows(events: pd.DataFrame, fills: pd.DataFrame) -> pd.DataFrame:\n    fills = fills.sort_values([\"person_id\", \"fill_date\"])\n    by_person = dict(tuple(fills.groupby(\"person_id\")))\n    rows = []\n    for _, r in events.iterrows():\n        pf = by_person.get(r[\"person_id\"], fills.iloc[0:0])\n        # window = \"case\" (hazard) vs \"referent\" within each subject; both anchored on the index date\n        haz_lo = r[\"event_date\"] - pd.Timedelta(days=HAZARD[1])\n        haz_hi = r[\"event_date\"] - pd.Timedelta(days=HAZARD[0])\n        ref_lo = r[\"event_date\"] - pd.Timedelta(days=REFERENT[1])\n        ref_hi = r[\"event_date\"] - pd.Timedelta(days=REFERENT[0])\n        rows.append({\"person_id\": r[\"person_id\"], \"case_status\": r[\"case_status\"], \"window\": 1,\n                     \"exposed\": _exposed_in_window(pf, haz_lo, haz_hi)})\n        rows.append({\"person_id\": r[\"person_id\"], \"case_status\": r[\"case_status\"], \"window\": 0,\n                     \"exposed\": _exposed_in_window(pf, ref_lo, ref_hi)})\n    return pd.DataFrame(rows)\n\ndef estimate_ctc(windows: pd.DataFrame):\n    # Outcome = being the hazard window (window=1); strata = person. Conditional logit removes\n    # all time-invariant person-level confounding. The case_status x exposed interaction is the CTC term.\n    d = windows.copy()\n    d[\"exp_x_case\"] = d[\"exposed\"] * d[\"case_status\"]\n    X = d[[\"exposed\", \"exp_x_case\"]]  # case_status main effect drops out (constant within person stratum)\n    model = ConditionalLogit(d[\"window\"], X, groups=d[\"person_id\"])\n    res = model.fit(disp=False)\n    ctc_or = np.exp(res.params[\"exp_x_case\"])\n    ci = np.exp(res.conf_int().loc[\"exp_x_case\"])\n    return {\"ctc_or\": ctc_or, \"ci_low\": ci[0], \"ci_high\": ci[1], \"result\": res}",
        "description": "Case-time-control / case-case-time-control estimation from claims-style inputs. Required inputs (cleaned, de-duplicated):\n  events : one row per subject with an (real or pseudo) index date\n           -> person_id, event_date (datetime), case_status in {1=case, 0=time-control}\n  fills  : intermittent-exposure dispensings\n           -> person_id, fill_date (datetime), days_supply (int)\nThe function expands each subject into a hazard window and a referent window, classifies each window as exposed by\noverlap with any active fill interval, then fits a single conditional logistic regression with a case_status x exposed\ninteraction. exp(interaction coef) is the CTC estimate (equivalently OR_cc / OR_tc).",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "wang-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\n\nHAZARD   <- c(1L, 7L)    # days before index: hazard window\nREFERENT <- c(31L, 37L)  # days before index: referent window (unidirectional)\n\nexposed_in_window <- function(pf, lo, hi) {\n  if (nrow(pf) == 0L) return(0L)\n  start <- pf$fill_date\n  end   <- pf$fill_date + pf$days_supply - 1L\n  as.integer(any(start <= hi & end >= lo))\n}\n\nbuild_ctc_windows <- function(events, fills) {\n  setDT(events); setDT(fills); setorder(fills, person_id, fill_date)\n  out <- vector(\"list\", nrow(events))\n  for (i in seq_len(nrow(events))) {\n    e  <- events[i]\n    pf <- fills[person_id == e$person_id]\n    haz_lo <- e$event_date - HAZARD[2];   haz_hi <- e$event_date - HAZARD[1]\n    ref_lo <- e$event_date - REFERENT[2]; ref_hi <- e$event_date - REFERENT[1]\n    out[[i]] <- data.table(\n      person_id   = e$person_id,\n      case_status = e$case_status,\n      window      = c(1L, 0L),  # 1 = hazard, 0 = referent\n      exposed     = c(exposed_in_window(pf, haz_lo, haz_hi),\n                      exposed_in_window(pf, ref_lo, ref_hi)))\n  }\n  rbindlist(out)\n}\n\nestimate_ctc <- function(windows) {\n  # Conditional logit stratified on person; the exposed:case_status interaction is the CTC term.\n  fit <- clogit(window ~ exposed + exposed:case_status + strata(person_id), data = windows)\n  cf  <- summary(fit)$coefficients[\"exposed:case_status\", ]\n  ci  <- exp(confint(fit)[\"exposed:case_status\", ])\n  list(ctc_or = exp(cf[\"coef\"]), ci_low = ci[1], ci_high = ci[2], fit = fit)\n}",
        "description": "Case-time-control / case-case-time-control estimation with survival::clogit. Inputs mirror the Python version:\n  events : person_id, event_date (Date), case_status (1 = case, 0 = time-control)\n  fills  : person_id, fill_date (Date), days_supply (integer)\nBuilds hazard/referent windows, classifies exposure by fill-interval overlap, and fits a conditional logistic model whose\nexposed:case_status interaction is the CTC estimate (= OR_cc / OR_tc).",
        "dependencies": [
          "data.table",
          "survival"
        ],
        "source_citations": [
          "wang-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let haz_lo = 7;  %let haz_hi = 1;     /* hazard window: 1-7 days before index   */\n%let ref_lo = 37; %let ref_hi = 31;    /* referent window: 31-37 days before index */\n\n/* One row per subject-window; exposed = any active fill interval overlaps the window. */\nproc sql;\n  create table ctc_windows as\n  select e.person_id, e.case_status, w.window,\n         max( case when f.fill_date is not null then 1 else 0 end ) as exposed\n  from work.events e\n  cross join (select 1 as window from sashelp.class(obs=1)\n              union all\n              select 0 as window from sashelp.class(obs=1)) w\n  left join work.fills f\n    on  f.person_id = e.person_id\n    /* window bounds depend on which window (hazard vs referent) this row is */\n    and f.fill_date <= ( e.event_date - (case when w.window=1 then &haz_hi else &ref_hi end) )\n    and ( f.fill_date + f.days_supply - 1 ) >=\n        ( e.event_date - (case when w.window=1 then &haz_lo else &ref_lo end) )\n  group by e.person_id, e.case_status, w.window;\nquit;\n\n/* True conditional likelihood via STRATA=person_id; interaction = the CTC term. */\nproc logistic data=ctc_windows;\n  strata person_id;                         /* conditions out time-invariant person confounding */\n  class case_status (ref='0') / param=ref;\n  model window(event='1') = exposed exposed*case_status;\n  oddsratio exposed;                         /* exp(exposed*case_status) is read from the parameter estimates */\nrun;",
        "description": "Case-time-control / case-case-time-control estimation in SAS. Required input datasets (post data-management):\n  work.events : person_id, event_date, case_status (1 = case, 0 = time-control / future case)\n  work.fills  : person_id, fill_date, days_supply\nBuilds the two within-person windows, classifies exposure by fill-interval overlap, and fits a TRUE conditional logistic\nmodel with PROC LOGISTIC STRATA=person_id. The exposed*case_status interaction estimate is the CTC term;\nexp(estimate) is the case-time-control odds ratio (= OR_cc / OR_tc).",
        "dependencies": [],
        "source_citations": [
          "wang-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "case-time-control-timeline.svg",
        "mermaid": null,
        "caption": "Timeline showing the hazard window (days 1-7 before fracture, shaded as exposed) and referent window (days 31-37 before fracture, shaded as unexposed) for Patient A (a case) and Patient B (a time control). The fill bar shows when the 30-day lorazepam supply was active. Both patients were covered during the hazard window but not the referent window, consistent with the rising background trend. Dividing OR_cc by OR_tc removes that trend.",
        "alt_text": "Two horizontal timeline rows, one per patient. Each row has a referent window bar (grey, not covered by fill) 31-37 days before fracture, a fill bar (blue) showing when the prescription was active, and a hazard window bar (orange, covered by fill) 1-7 days before fracture. An annotation shows the CTC division: OR_cc divided by OR_tc equals 2.0.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Outcome abrupt + exposure transient/intermittent?} -->|No| Cohort[Use cohort design<br/>e.g. active comparator new-user]\n  Q -->|Yes| CC[Case-crossover<br/>within-person hazard vs referent]\n  CC --> Trend{Strong exposure-time trend<br/>over calendar time?}\n  Trend -->|No| UseCC[Case-crossover sufficient<br/>consider bidirectional referents]\n  Trend -->|Yes| FutureCases{Exchangeable future-case<br/>control series obtainable?}\n  FutureCases -->|Yes| CCTC[Case-case-time-control<br/>RECOMMENDED successor to CTC]\n  FutureCases -->|No, only non-cases| CTCwarn[Case-time-control<br/>WARNING: OR_tc may reintroduce<br/>between-person confounding Greenland 1996]\n  FutureCases -->|Full per-person history available| SCCS[Self-controlled case series<br/>absorb trend with period covariate]",
        "caption": "Decision logic placing case-time-control among its alternatives. CTC is a fallback only when a strong exposure-time trend exists and the cleaner successors (case-case-time-control, SCCS) are infeasible.",
        "alt_text": "Decision flowchart from acute-outcome/transient-exposure check, to case-crossover, to exposure-time-trend check, branching to case-case-time-control, self-controlled case series, or a warning-flagged case-time-control.",
        "source_type": "illustrative",
        "source_citations": [
          "greenland-1996",
          "wang-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Cases [Case series]\n    H1[Hazard window<br/>days 1-7 pre-event] --> ORcc[OR_cc:<br/>within-person<br/>exposure odds ratio]\n    R1[Referent window<br/>days 31-37 pre-event] --> ORcc\n  end\n  subgraph Controls [Time-control series: future cases / non-cases]\n    H2[Hazard window<br/>1-7 pre pseudo-index] --> ORtc[OR_tc:<br/>within-person<br/>exposure odds ratio]\n    R2[Referent window<br/>31-37 pre pseudo-index] --> ORtc\n  end\n  ORcc --> Ratio[CTC OR = OR_cc / OR_tc<br/>= exp interaction in pooled<br/>conditional logistic model]\n  ORtc --> Ratio\n  ORtc -. carries between-person confounding<br/>if controls not exchangeable .-> Bias[Greenland 1996:<br/>re-confounding risk]",
        "caption": "Estimation logic. The case-crossover odds ratio (OR_cc) is divided by the control-series odds ratio (OR_tc) to remove the exposure-time trend; if the control series is not exchangeable with the cases, OR_tc carries between-person confounding and the ratio is biased.",
        "alt_text": "Diagram showing the case series producing OR_cc from hazard and referent windows, a time-control series producing OR_tc, their ratio as the case-time-control estimate, and an annotation that OR_tc can reintroduce between-person confounding.",
        "source_type": "illustrative",
        "source_citations": [
          "suissa-1995",
          "greenland-1996"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "case-crossover",
        "notes": "CTC adds an external (time-control) series to the case-crossover to divide out the exposure-time trend the case-crossover cannot address."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-case-time-control",
        "notes": "Case-case-time-control replaces CTC's control series with future cases, removing both time-invariant confounding and the trend; it is the recommended modern successor that fixes CTC's confounding flaw."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "self-controlled-case-series",
        "notes": "SCCS absorbs the calendar-time exposure trend within-person via a period covariate and needs no external control series, avoiding CTC's re-confounding risk."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-control",
        "notes": "CTC's control series is conceptually a time-control analogue of the case-control comparison, but applied to the within-person exposure odds ratio rather than between-person."
      },
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "For sustained effects of continuously used drugs, self-controlled designs do not apply; emulate a trial with a cohort design instead."
      },
      {
        "relation_type": "used_with",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Negative-control exposures help detect residual exposure-time-trend bias or between-person confounding left by the control series."
      }
    ],
    "aliases": [
      "CTC design",
      "case-time-control",
      "case time control design"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "causal-mediation-effect-modification-rwe",
    "name": "Causal Mediation and Effect Modification",
    "short_definition": "A pair of distinct causal-inference tasks in RWE — mediation, which decomposes a total effect into pathway-specific direct and indirect components through a post-treatment mediator, and effect modification/interaction, which describes how the effect varies across baseline subgroups or scales — each with its own estimands, identification assumptions, and characteristic biases (notably post-treatment adjustment bias).",
    "long_description": "RWE manuscripts routinely call a variable a \"mediator,\" a \"moderator,\" a \"confounder,\" or an \"effect\nmodifier\" without specifying its causal timing or role, yet these are different objects requiring different\nanalyses. A **confounder** causes both exposure and outcome and is adjusted for. A **mediator** lies on the\ncausal path from exposure to outcome (exposure -> mediator -> outcome) and is measured *after* exposure. An\n**effect modifier (moderator)** is a baseline characteristic across whose levels the exposure effect\ndiffers in magnitude or sign. A single post-treatment variable can be simultaneously a mediator of the\nexposure and a confounder of later treatment — which is precisely why naive regression adjustment for it is\ndangerous. This entry yokes two tasks that share machinery but answer different questions: **mediation\nanalysis** (how much of the effect runs through a pathway) and **effect modification / interaction** (for\nwhom, and on what scale, the effect is larger).\n\n**Core estimand distinction — mediation.** With exposure A, mediator M, outcome Y, the *total effect* (TE)\nincludes all pathways. The **controlled direct effect (CDE)** is the A->Y effect with M fixed by\nintervention at a chosen value m. The **natural direct effect (NDE)** and **natural indirect effect (NIE)**\ndecompose TE = NDE + NIE by setting M to the value it *would naturally take* under a reference exposure —\nidentification of these requires a *cross-world* (counterfactual independence) assumption that no\nexperiment can verify. VanderWeele's **four-way decomposition** splits TE into a pure direct effect, a\nreference interaction, a mediated interaction, and a pure indirect effect, unifying mediation and\ninteraction in one framework. The estimand must be chosen and pre-specified: CDE answers \"what if we\nblocked the pathway,\" NDE/NIE answer \"how much of the effect is mediated,\" and they coincide only when\nthere is no A-by-M interaction.\n\n**Core estimand distinction — effect modification / interaction.** Effect modification is **scale-dependent**.\nA treatment can show no interaction on the multiplicative (ratio) scale yet strong interaction on the\nadditive (risk-difference) scale, and vice versa; whenever two factors each have a main effect, they\n*cannot* be additive on both scales at once. Public-health and HTA decisions (who to treat, absolute\nbenefit, number-needed-to-treat) hinge on the **additive scale**, summarized by the **RERI** (relative\nexcess risk due to interaction), the **attributable proportion (AP)**, and the **synergy index (S)** —\nKnol & VanderWeele's reporting standard asks for both scales plus stratum-specific estimates. Critically,\n**effect modification is not the same as causal interaction**: a modifier may merely be a marker correlated\nwith the true causal interactor, and machine-learned heterogeneous treatment effects (HTE) describe\nconditional variation without licensing a causal-pathway interpretation.\n\n**Pros, cons, and trade-offs — mediation.**\n- **vs reporting only the total effect (e.g., a single PS-adjusted hazard ratio):** Mediation explains\n  *mechanism* — how much of a drug's cardiovascular benefit operates through weight or HbA1c — which is\n  valuable for label claims, surrogate validation, and pipeline decisions. Cost: it requires a correctly\n  timed, well-measured mediator and strong, untestable confounding assumptions; an unmeasured\n  mediator-outcome confounder biases NDE/NIE in unknown direction. **Prefer the total effect alone** unless\n  the mechanistic question genuinely changes a decision.\n- **vs g-estimation / marginal structural models for direct effects:** When the mediator-outcome\n  confounders are themselves *affected by the exposure* (treatment-induced confounding), standard\n  regression-based mediation (Imai/VanderWeele product/difference methods) is biased and you must use\n  g-methods (MSM for natural effects, g-formula, or g-estimation of structural nested models). Cost:\n  g-methods are harder to specify and communicate. **Prefer regression-based mediation** only when no\n  intermediate confounder is plausibly on the A->L->M,Y structure.\n- **vs simply adjusting for the mediator in the outcome model:** This is the cardinal error. Conditioning\n  on a post-treatment mediator does *not* yield the total effect and generally does *not* yield a clean\n  direct effect either — it can open a collider path (M's unmeasured causes) and induce\n  **post-treatment/overadjustment bias**. Adjust for a post-treatment variable only when the estimand is\n  explicitly a controlled/natural direct effect and the cross-world or no-exposure-induced-confounding\n  assumptions are documented and defended.\n\n**Pros, cons, and trade-offs — effect modification.**\n- **vs a single pooled (marginal) effect:** Effect modification targets the policy question of *who\n  benefits*, supporting subgroup labeling and HTA reimbursement restrictions. Cost: multiplicity and\n  data-driven subgroup fishing inflate false positives; report pre-specified subgroups, the interaction\n  test, and both additive and multiplicative measures. **Prefer the pooled effect** when subgroups are not\n  pre-specified or the trial/cohort is underpowered for interaction (interaction tests need ~4x the sample\n  of main-effect tests).\n- **vs causal-ML heterogeneous treatment effects (causal forests, meta-learners):** ML estimates a\n  flexible CATE surface useful for hypothesis generation and individualized prediction. Cost: it does not\n  distinguish a true causal modifier from a correlated marker, and standard implementations target the\n  conditional ATE, not a pre-specified contrast. **Prefer parametric interaction terms** when the modifiers\n  are few, pre-specified, and clinically motivated.\n\n**When to use.** Use **mediation** when a stakeholder question is genuinely about mechanism or surrogacy\n(does the cardiovascular benefit of a GLP-1 agonist run through weight loss?), the mediator is measurable\nwith correct timing (post-exposure, pre-outcome), and mediator-outcome confounders are measured. Use\n**effect modification/interaction** when the decision is about targeting (which subgroup gets the largest\nabsolute benefit), the modifiers are baseline and pre-specified, and the cohort is powered for interaction.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Never adjust for a post-treatment mediator in a primary total-effect model.** This is the single most\n  common and most damaging error: it changes the estimand from the total effect to an ill-defined quantity,\n  can introduce collider/overadjustment bias, and is routinely mislabeled as a \"fully adjusted\" total\n  effect. Adjust for a post-treatment variable *only* when the estimand is explicitly direct-effect-like\n  (CDE/NDE) and the assumptions are stated.\n- **Do not run regression-based mediation when a mediator-outcome confounder is affected by treatment.**\n  The product-of-coefficients and difference methods break under treatment-induced confounding; results are\n  biased even with infinite data. Switch to g-methods or do not decompose.\n- **Do not interpret NDE/NIE causally without confronting the cross-world assumption.** It is untestable;\n  report sensitivity analysis (e.g., the proportion of the effect that an unmeasured U would have to explain\n  to nullify the indirect effect).\n- **Do not declare \"no effect modification\" from a non-significant multiplicative interaction term.**\n  Absence of multiplicative interaction is compatible with strong additive interaction that matters for\n  treatment decisions; always report the additive scale.\n- **Do not mine subgroups.** Data-driven subgroup discovery without pre-specification or multiplicity\n  control manufactures spurious modifiers; the famous astrological-sign subgroup result is the cautionary\n  tale.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Strong for *utilization* mediators — adherence (PDC/MPR), treatment\n  switching, hospitalization, procedure completion — and for utilization-scale effect modifiers (line of\n  therapy, prior insulin). Weak for *biological* mediators: BMI/weight is captured only sparsely and\n  non-currently via ICD-10 Z68.x codes, and labs are absent, so a biological mediator must be proxied\n  (e.g., bariatric-procedure codes, anti-obesity Rx initiation) or sourced from linked EHR. Failure modes:\n  (1) **Medicare Advantage-only person-time lacks fee-for-service claims**, so a \"mediator-not-observed\"\n  can be missingness rather than a true zero — restrict to enrollees with full A/B/D or commercial\n  pharmacy benefit. (2) **Differential competing risks by exposure in the elderly**: in claims of older\n  adults, death competes with the non-fatal outcome and may differ by arm, so a mediator measured at a\n  later landmark is conditioned on differential survival — handle with competing-risk or\n  landmark/clone-censor-weight approaches. (3) **Immortal time at the mediator-assessment window**: both\n  arms must survive (and remain enrolled) to the mediator measurement, which can induce immortal-time bias\n  if handled naively.\n- **EHR:** Best substrate for biological mediators (weight, HbA1c, eGFR, blood pressure, PROs), but\n  measurement is *visit-driven and irregular* — a mediator value exists only when a test was ordered, and\n  ordering is informative (sicker patients are measured more), so missingness is rarely at random. Define a\n  fixed mediator-assessment window relative to index, use the closest in-window value, and model\n  missingness (multiple imputation or IPW) rather than complete-case.\n- **Registry:** Strongest for adjudicated effect modifiers and mediators with clinical meaning (cancer\n  stage, disease activity, genomic markers, progression) measured on protocol-driven schedules; weak for\n  complete pharmacy exposure and full utilization pathways. Link to claims to capture utilization mediators\n  and to a death index for competing risks.\n- **Linked claims-EHR-vital records:** The ideal substrate — EHR biological mediators + claims utilization\n  completeness + reliable mortality for competing risks — but linkage selects the linkable subset and\n  introduces date-discrepancy issues (order vs fill vs lab-result dates) that must be reconciled before\n  timing a mediator relative to time zero.\n\n**Worked claims/linked example (mediation + immortal time).** Question: does the reduction in\nhospitalized heart-failure events under a GLP-1 receptor agonist vs a DPP-4 inhibitor operate through\nearly weight loss, among adults with type 2 diabetes in a linked commercial-claims + EHR database?\n(1) **Cohort (ACNU core):** age >=18, >=2 diabetes diagnoses, 365 days continuous medical+pharmacy\nenrollment, and *no* fill of any GLP-1 or DPP-4 agent in the 365-day washout; index_date = first\nqualifying fill (`fill_date`), arm assigned from the dispensed NDC. (2) **Mediator (M):** percent change in\nbody weight from the baseline EHR value to the value nearest a 6-month landmark (window 4-8 months\npost-index). Because weight is poorly captured in claims (Z68.x sparse), the EHR linkage supplies it; in\nclaims-only data the mediator would be proxied by anti-obesity Rx adds or bariatric procedure codes, an\ninferior operationalization to flag. (3) **Immortal-time control:** restrict the analysis to patients alive\nand enrolled at the 6-month landmark and *start follow-up for the outcome at the landmark* (landmark\nanalysis), so neither arm accrues immortal time waiting for the mediator; alternatively clone-censor-weight.\n(4) **Outcome (Y):** first hospitalized HF event (validated dx in primary position) from the landmark to\ndisenrollment, death, or data end. (5) **Confounding:** PS or covariate adjustment on baseline covariates\nmeasured in [index_date-365, index_date]; for the mediation step, additionally adjust the\nmediator-outcome relationship for landmark-window confounders — and verify none of them is\n*treatment-affected* (if HbA1c at the landmark both responds to arm and confounds weight->HF, regression\nmediation is invalid and g-methods are required). (6) **Estimands:** report TE (landmark HR), CDE (HF effect\nwith weight change fixed), and NDE/NIE with a cross-world sensitivity analysis. (7) **Effect-modification\ncompanion:** pre-specify baseline BMI category as a modifier and report stratum-specific HRs *and* the RERI\non the risk scale, since absolute HF reduction may be largest in the highest-BMI stratum even if the hazard\nratio is constant.\n\n**Interpreting the output**\n\nIn the GLP-1 versus DPP-4 analysis (250 per arm), decomposition yields: total risk\ndifference = −0.040; natural direct effect (NDE) = 0.000; natural indirect effect (NIE)\nthrough weight loss = −0.040; high-BMI subgroup RD = −0.06; normal-BMI subgroup RD = −0.02.\n\n*(1) Formal interpretation.* The zero NDE indicates that, had weight loss been set to its\ncounterfactual value under DPP-4 for every patient, GLP-1 and DPP-4 would produce identical\nMACE rates — the entire benefit is mediated through the weight-loss pathway. The NIE of −0.040\nrepresents the share of the total effect attributable to the mediator when exposure is fixed at\nGLP-1. This decomposition relies on the sequential ignorability (cross-world) assumption: no\nunmeasured confounding of the mediator-outcome relationship, even within levels of exposure. The\neffect-modification finding (RD −0.06 in high-BMI vs −0.02 in normal-BMI) is a subgroup\ncontrast, not a mediation quantity, and must be assessed for interaction on the chosen effect\nscale before being interpreted as a modifier rather than chance variation.\n\n*(2) Practical interpretation.* Complete mediation through weight loss has a direct regulatory\nimplication: a payer restricting GLP-1 coverage to patients who achieve a weight-loss threshold\nmay inadvertently capture the entire cardiovascular mechanism — the benefit does not appear to\nbypass the weight-loss pathway. The effect-modification finding supports label language about\nBMI-stratified expected benefit, but requires replication given the observational basis and the\nmultiple-comparison exposure inherent in subgroup reporting.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "mediation",
      "effect-modification",
      "interaction",
      "moderators",
      "natural-direct-effect",
      "natural-indirect-effect",
      "controlled-direct-effect",
      "four-way-decomposition",
      "additive-interaction",
      "reri",
      "heterogeneity",
      "post-treatment-adjustment-bias"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "ehr_study",
      "claims_analysis",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0000000000000121",
        "url": "https://doi.org/10.1097/EDE.0000000000000121",
        "citation_text": "VanderWeele TJ. A unification of mediation and interaction: a four-way decomposition. Epidemiology. 2014;25(5):749-761.",
        "year": 2014,
        "authors_short": "VanderWeele",
        "notes": "Unifies mediation and interaction into a single four-way decomposition (pure direct, reference interaction, mediated interaction, pure indirect) — the conceptual backbone tying both halves of this concept together."
      },
      {
        "role": "introduce",
        "doi": "10.1037/a0020761",
        "url": "https://doi.org/10.1037/a0020761",
        "citation_text": "Imai K, Keele L, Tingley D. A general approach to causal mediation analysis. Psychological Methods. 2010;15(4):309-334.",
        "year": 2010,
        "authors_short": "Imai et al.",
        "notes": "General counterfactual definition of natural direct/indirect effects and the simulation-based estimator implemented in the R mediation package."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev-publhealth-032315-021402",
        "url": "https://doi.org/10.1146/annurev-publhealth-032315-021402",
        "citation_text": "VanderWeele TJ. Mediation analysis: a practitioner's guide. Annual Review of Public Health. 2016;37:17-32.",
        "year": 2016,
        "authors_short": "VanderWeele",
        "notes": "Practical, assumption-by-assumption guide to choosing CDE vs NDE/NIE, handling exposure-mediator interaction, and conducting sensitivity analysis — the best single reference for applied teams."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyr218",
        "url": "https://doi.org/10.1093/ije/dyr218",
        "citation_text": "Knol MJ, VanderWeele TJ. Recommendations for presenting analyses of effect modification and interaction. International Journal of Epidemiology. 2012;41(2):514-520.",
        "year": 2012,
        "authors_short": "Knol & VanderWeele",
        "notes": "Defines the additive (RERI/AP/S) vs multiplicative reporting standard and stratum-specific presentation — the citation to anchor effect-modification reporting in a manuscript or SAP."
      },
      {
        "role": "demonstrate",
        "doi": "10.1214/10-STS321",
        "url": "https://doi.org/10.1214/10-STS321",
        "citation_text": "Imai K, Keele L, Yamamoto T. Identification, inference and sensitivity analysis for causal mediation effects. Statistical Science. 2010;25(1):51-71.",
        "year": 2010,
        "authors_short": "Imai et al.",
        "notes": "Formal identification, the sequential-ignorability assumption, and the sensitivity-analysis parameter for an unmeasured mediator-outcome confounder."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e31818f69ce",
        "url": "https://doi.org/10.1097/EDE.0b013e31818f69ce",
        "citation_text": "VanderWeele TJ. Marginal structural models for the estimation of direct and indirect effects. Epidemiology. 2009;20(1):18-26.",
        "year": 2009,
        "authors_short": "VanderWeele",
        "notes": "The g-method solution when mediator-outcome confounders are themselves affected by the exposure — the case where regression-based mediation is biased and an MSM is required."
      }
    ],
    "plain_language_summary": "Causal mediation analysis asks: of the total effect a drug has on an outcome, how much travels through a specific in-between biological or behavioral step (the mediator), and how much acts by other routes? Effect modification asks a different question: does the drug work better or worse for certain patient groups defined before treatment starts? These two questions sound similar but require entirely different analyses, and confusing them, or carelessly adjusting for a step that happens after treatment begins, is one of the most common errors in real-world evidence studies.",
    "key_terms": [
      {
        "term": "mediator",
        "definition": "A variable that sits on the causal path between the exposure and the outcome, occurring after treatment starts, such as weight loss caused by a drug that then reduces heart-failure risk."
      },
      {
        "term": "direct effect",
        "definition": "The portion of a drug's total effect on the outcome that does NOT pass through the mediator, sometimes called the natural direct effect (NDE)."
      },
      {
        "term": "indirect effect",
        "definition": "The portion of a drug's total effect that operates specifically through the mediator, sometimes called the natural indirect effect (NIE); together, direct and indirect effects add up to the total effect."
      },
      {
        "term": "effect modification",
        "definition": "When the size or direction of a drug's effect differs across subgroups defined by a characteristic that was measured before treatment started, such as baseline BMI category or age group."
      },
      {
        "term": "post-treatment adjustment bias",
        "definition": "The distortion that results when an analyst controls for a variable that was caused by the treatment itself; this changes what question is being answered and can introduce new, misleading associations."
      },
      {
        "term": "additive interaction",
        "definition": "A situation where two factors together produce more (or less) risk than the sum of their individual effects, the scale most relevant for deciding who benefits most from a treatment."
      }
    ],
    "worked_example": {
      "scenario": "A 500-person cohort study compares a GLP-1 receptor agonist (drug A) to a DPP-4 inhibitor (drug B) on the 12-month risk of a heart-failure hospitalization. Researchers want to do two things: (1) decompose the drug's total risk-difference effect into the part that runs through early weight loss (the mediator) and the part that does not, and (2) check whether the drug works differently in patients with high versus normal baseline BMI (effect modification). Patients are assigned to A (n=250) or B (n=250) at the index date. Weight loss of at least 5% by the 6-month landmark is the mediator. The outcome is heart-failure hospitalization in months 7-12.",
      "dataset": {
        "caption": "Aggregated risk table (simplified from the cohort); each row is one patient subgroup defined by arm and mediator status.",
        "columns": [
          "arm",
          "achieved_5pct_weight_loss",
          "n_patients",
          "hf_events",
          "risk"
        ],
        "rows": [
          [
            "A (GLP-1)",
            "yes",
            150,
            9,
            0.06
          ],
          [
            "A (GLP-1)",
            "no",
            100,
            14,
            0.14
          ],
          [
            "B (DPP-4)",
            "yes",
            50,
            5,
            0.1
          ],
          [
            "B (DPP-4)",
            "no",
            200,
            28,
            0.14
          ]
        ]
      },
      "steps": [
        "Step 1 — Total effect. Overall risk in arm A = (9+14)/250 = 23/250 = 0.092. Overall risk in arm B = (5+28)/250 = 33/250 = 0.132. Total risk difference (RD) = 0.092 - 0.132 = -0.040, meaning drug A reduced the 12-month heart-failure risk by 4.0 percentage points.",
        "Step 2 — Direct effect (path NOT through weight loss). Among patients who did NOT achieve 5% weight loss, risk in A = 0.14, risk in B = 0.14. Direct RD = 0.14 - 0.14 = 0.00. Drug A shows no advantage when the weight-loss pathway is blocked, giving a direct effect of 0.00.",
        "Step 3 — Indirect effect (path THROUGH weight loss). The indirect effect equals the total effect minus the direct effect: -0.040 - 0.000 = -0.040. All of the drug's benefit in this illustration travels through the weight-loss mediator.",
        "Step 4 — Arithmetic check. Direct (-0.00) + indirect (-0.040) = -0.040 = total RD. The decomposition is exact.",
        "Step 5 — Effect modification (a separate question). Now split by baseline BMI, a characteristic measured before treatment. In the high-BMI subgroup, RD = -0.06. In the normal-BMI subgroup, RD = -0.02. The drug's absolute benefit is three times larger in the high-BMI group. This is effect modification: the effect varies across a pre-treatment subgroup. Note that this is a different analysis from the mediation above; the mediator (weight loss) happened after treatment, while the BMI modifier was baseline."
      ],
      "result": "Total RD = -0.040 (drug A reduces 12-month HF risk by 4.0 pp). Decomposition: direct effect = 0.00, indirect effect through weight loss = -0.040, sum = -0.040. Effect modification by baseline BMI: high-BMI subgroup RD = -0.06, normal-BMI subgroup RD = -0.02; the benefit is larger in the high-BMI group, but this reflects who benefits, not how the effect travels."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "estimands-ate-att-intercurrent-events-rwe",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Baseline effect modification (parametric interaction)",
        "description": "Estimate stratum-specific exposure effects, or fit a product (interaction) term, for pre-specified pre-treatment covariates such as age, baseline biomarker, line of therapy, frailty, or route of administration. Report both additive (RERI/AP) and multiplicative measures plus stratum-specific estimates.",
        "edge_cases": [
          "Interaction tests are under-powered (~4x the n of a main effect); a null multiplicative term does not rule out additive modification.",
          "Avoid data-driven subgroup discovery without pre-specification or multiplicity control."
        ],
        "data_source_notes": "claims: use baseline diagnosis/procedure/drug-class proxies for the modifier; EHR: closest in-window baseline lab value for biomarker modifiers."
      },
      {
        "name": "Causal mediation with a measured post-treatment mediator",
        "description": "Decompose the total effect into NDE/NIE (or CDE) through a mediator measured after exposure and before outcome, controlling exposure-outcome, mediator-outcome, and exposure-mediator confounding, and reporting a sensitivity analysis for unmeasured mediator-outcome confounding.",
        "edge_cases": [
          "Mediator timing must be strictly post-exposure and pre-outcome; mis-timed mediators invert the analysis.",
          "Requires the cross-world (sequential ignorability) assumption for natural effects, which is untestable.",
          "Immortal-time bias if both arms must survive to a mediator-assessment landmark — use landmark or clone-censor-weight."
        ],
        "data_source_notes": "claims: biological mediators are poorly measured (BMI via sparse Z68.x); prefer linked EHR labs or proxy by procedure/Rx. ehr: visit-driven, informative missingness — model rather than complete-case."
      },
      {
        "name": "Mediation under treatment-affected confounding (g-methods)",
        "description": "When a mediator-outcome confounder L is itself affected by the exposure (A->L->M,Y), use a marginal structural model, the mediation g-formula, or g-estimation of structural nested models rather than regression-based product/difference methods.",
        "edge_cases": [
          "Regression-based mediation is biased here even with infinite data.",
          "Specification and communication are substantially harder; weight diagnostics and positivity must be checked."
        ],
        "data_source_notes": "Requires time-resolved measurement of L, M, and Y; linked claims-EHR is typically the minimum substrate."
      },
      {
        "name": "Additive interaction (RERI / AP / S)",
        "description": "Quantify whether the joint effect of two exposures exceeds the sum of their separate effects on the risk scale, the relevant scale for public-health and HTA targeting decisions.",
        "data_source_notes": "Compute from a single model with both main effects and the product term, then derive RERI from predicted risks (log-binomial/Poisson) with the delta method or bootstrap for CIs."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "pros_of_this": "Explains mechanism (mediation) and identifies who benefits and on which scale (effect modification), beyond a single average contrast.",
        "cons_of_this": "Requires more, often untestable, assumptions and carries real risk of post-treatment adjustment bias and underpowered, multiplicity-driven false subgroup findings.",
        "when_to_prefer": "When the decision genuinely turns on mechanism/surrogacy or on targeting a pre-specified subgroup; otherwise report the total/marginal effect."
      },
      {
        "compared_to": "g-estimation-structural-nested-models",
        "pros_of_this": "Regression-based mediation is simpler to specify and communicate when no intermediate confounder is affected by the exposure.",
        "cons_of_this": "It is biased whenever a mediator-outcome confounder is treatment-affected; g-methods are required in that structure.",
        "when_to_prefer": "Use plain regression mediation only when treatment-induced confounding of the mediator is implausible; otherwise switch to g-methods."
      },
      {
        "compared_to": "predictive-and-causal-ml-models-rwe",
        "pros_of_this": "Parametric interaction with pre-specified, clinically motivated modifiers yields interpretable, testable, scale-explicit contrasts.",
        "cons_of_this": "Less flexible than causal-ML for high-dimensional heterogeneity; can miss complex modifier structure.",
        "when_to_prefer": "When modifiers are few and pre-specified; use causal-ML HTE for hypothesis generation, remembering it does not certify a causal-pathway interpretation."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Strong for utilization mediators (adherence, switching, HCRU, procedures) and utilization-scale modifiers; weak for biological mediators (BMI via sparse Z68.x, no labs). Restrict to FFS/full-benefit person-time so an unobserved mediator is a true zero, not MA-only missingness, and account for differential competing risks by exposure in the elderly.",
      "ehr": "Best for biological mediators (weight, HbA1c, eGFR, PROs) but measurement is visit-driven and informatively missing; define a fixed mediator-assessment window, take the closest in-window value, and model missingness (MI or IPW) rather than complete-case.",
      "registry": "Strong for adjudicated, protocol-scheduled mediators/modifiers (stage, disease activity, progression, genomics); weak for pharmacy exposure and utilization pathways — link to claims and a death index.",
      "linked": "Ideal substrate (EHR biology + claims utilization + reliable mortality for competing risks) but linkage selects the linkable subset and creates order/fill/result date discrepancies that must be reconciled before timing the mediator relative to time zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\n# ---- Additive interaction (RERI) on the risk scale -------------------------------\n# Log-binomial gives risk ratios; RERI = RR11 - RR10 - RR01 + 1 (Knol & VanderWeele 2012).\nem = smf.glm(\"Y ~ A * high_bmi + age + cci\", data=df,\n             family=sm.families.Binomial(sm.families.links.Log())).fit()\nb = em.params\nRR10 = np.exp(b[\"A\"])                       # A only\nRR01 = np.exp(b[\"high_bmi\"])               # modifier only\nRR11 = np.exp(b[\"A\"] + b[\"high_bmi\"] + b[\"A:high_bmi\"])  # both\nreri = RR11 - RR10 - RR01 + 1\nprint(f\"RERI (additive interaction) = {reri:.3f}  (>0 => positive additive interaction)\")\n\n# ---- Regression-based mediation: NDE / NIE via the difference method -------------\n# Difference method: total effect minus mediator-adjusted (direct) effect on the RR scale.\n# Requires: no exposure-induced mediator-outcome confounding (else g-methods).\ndef mediation_rr(data):\n    tot = smf.glm(\"Y ~ A + age + cci\", data=data,\n                  family=sm.families.Binomial(sm.families.links.Log())).fit()\n    dir_ = smf.glm(\"Y ~ A + M + age + cci\", data=data,\n                   family=sm.families.Binomial(sm.families.links.Log())).fit()\n    te  = np.exp(tot.params[\"A\"])          # total effect (RR)\n    nde = np.exp(dir_.params[\"A\"])         # direct effect with M held in the model (RR)\n    nie = te / nde                          # indirect (mediated) effect on the RR scale\n    prop_med = np.log(nie) / np.log(te)    # proportion mediated (log-RR scale)\n    return te, nde, nie, prop_med\n\nte, nde, nie, prop = mediation_rr(df)\nrng = np.random.default_rng(20240601)\nboot = np.array([mediation_rr(df.sample(len(df), replace=True, random_state=int(rng.integers(1e9))))\n                 for _ in range(1000)])\nlo, hi = np.percentile(boot[:, 2], [2.5, 97.5])  # bootstrap CI for the indirect effect (NIE)\nprint(f\"TE={te:.2f}  NDE={nde:.2f}  NIE={nie:.2f} (95% CI {lo:.2f}-{hi:.2f})  prop. mediated={prop:.2%}\")",
        "description": "Effect modification (additive RERI) and regression-based mediation on a binary outcome.\nRequired input table `df` (one row per subject, cohort + baseline + landmark variables already built):\n  A          : exposure arm, 1 = study drug, 0 = active comparator (assigned at index_date)\n  M          : binary post-treatment mediator (e.g., >=5% weight loss by the 6-month landmark)\n  Y          : binary outcome (1 = hospitalized HF event after the landmark)\n  high_bmi   : binary baseline effect modifier (1 = baseline BMI >= 35)\n  age, cci   : baseline confounders (measured in [index_date-365, index_date])\nFit log-binomial models so coefficients are risk ratios and RERI is on the additive risk scale. The\nmediation block estimates the NIE/NDE via the difference method with a nonparametric bootstrap CI; it is\nvalid ONLY if no mediator-outcome confounder is affected by A (otherwise use an MSM / g-formula).",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(mediation)\nlibrary(interactionR)\n\n## ---- Causal mediation: NDE / NIE with an A*M interaction allowed -----------------\nmed.fit <- glm(M ~ A + age + cci, family = binomial, data = df)          # mediator model\nout.fit <- glm(Y ~ A * M + age + cci, family = binomial, data = df)      # outcome model (interaction)\nset.seed(20240601)\nmed <- mediate(med.fit, out.fit, treat = \"A\", mediator = \"M\",\n               robustSE = TRUE, sims = 1000)\nsummary(med)        # ACME (=NIE), ADE (=NDE), total effect, proportion mediated, 95% CIs\n## Sensitivity to an unmeasured mediator-outcome confounder (Imai/Keele/Yamamoto 2010):\nsummary(medsens(med, rho.by = 0.1))   # rho at which the indirect effect crosses 0\n\n## ---- Additive interaction (RERI / AP / S) ---------------------------------------\nem.fit <- glm(Y ~ A * high_bmi + age + cci, family = binomial, data = df)\ninteractionR(em.fit, exposure_names = c(\"A\", \"high_bmi\"),\n             ci.type = \"delta\", ci.level = 0.95)   # RERI, AP, synergy index with CIs",
        "description": "Causal mediation with simulation-based NDE/NIE and additive interaction (RERI) in R.\nInput data.frame `df` mirrors the Python version (A, M, Y, high_bmi, age, cci; cohort + landmark built).\nThe mediation::mediate path estimates natural effects under sequential ignorability and allows an\nexposure-mediator interaction; interactionR computes RERI/AP/S with CIs from a single logistic model.",
        "dependencies": [
          "mediation",
          "interactionR"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---- Causal mediation: CDE, NDE, NIE, and four-way decomposition --------------- */\nproc causalmed data=work.med nboot=1000 seed=20240601;\n  class A(ref='0') M(ref='0') high_bmi(ref='0') / param=ref;\n  model Y(event='1') = A M A*M age cci / dist=binomial link=logit;   /* outcome model w/ A*M */\n  mediator M(event='1') = A age cci / dist=binomial link=logit;       /* mediator model      */\n  /* evaluate effects at a fixed covariate profile and the chosen CDE mediator value */\n  evaluate cvar=(age=65 cci=2) mediator=0;\n  decomp;                                                             /* 4-way decomposition */\nrun;\n\n/* ---- Additive interaction (RERI) from a log-binomial model --------------------- */\n/* RERI = RR11 - RR10 - RR01 + 1 ; ESTIMATE builds it from the linear predictor (log-RR). */\nproc genmod data=work.med;\n  class A(ref='0') high_bmi(ref='0') / param=ref;\n  model Y(event='1') = A high_bmi A*high_bmi age cci / dist=binomial link=log;\n  estimate 'RERI'  A 1 high_bmi 1 A*high_bmi 1,\n                   A 1 high_bmi 0 A*high_bmi 0 -1,\n                   A 0 high_bmi 1 A*high_bmi 0 -1 / exp;  /* exponentiate, then RR11-RR10-RR01+1 */\nrun;",
        "description": "Causal mediation with PROC CAUSALMED and additive interaction (RERI) with PROC GENMOD in SAS.\nRequired input dataset work.med (one row per subject; cohort + baseline + landmark already constructed):\n  A        : exposure (0/1)        M : binary post-treatment mediator (0/1)\n  Y        : binary outcome (0/1)  high_bmi : binary baseline modifier (0/1)\n  age cci  : baseline confounders measured in [index_date-365, index_date]\nPROC CAUSALMED (SAS/STAT 14.3+) reports CDE, NDE, NIE, total effect and the four-way decomposition with\nbootstrap CIs; it does NOT give additive interaction, so RERI is computed separately from a log-binomial\nmodel. CAUSALMED assumes no exposure-induced mediator-outcome confounding — if that fails, use a\nweighted (MSM) approach instead.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "graph LR\n  L[Baseline confounder / effect modifier<br/>e.g., age, baseline BMI] --> A\n  L --> Y\n  A[Exposure A<br/>drug vs active comparator] --> M[Mediator M<br/>post-treatment, e.g., weight change]\n  A --> Y[Outcome Y<br/>hospitalized HF]\n  M --> Y\n  U[Unmeasured mediator-outcome confounder]:::bad --> M\n  U --> Y\n  L -. modifies effect of A on Y .-> Y\n  classDef bad fill:#fee2e2,stroke:#b91c1c;\n  class M mediator;\n  classDef mediator fill:#e0f2fe,stroke:#0369a1;",
        "caption": "Causal structure for mediation and effect modification. M is post-treatment and on the A->Y path (a mediator, never to be adjusted for in a total-effect model); L is a baseline confounder that also modifies the A->Y effect; U is the unmeasured mediator-outcome confounder that biases NDE/NIE and is the target of sensitivity analysis.",
        "alt_text": "Directed acyclic graph showing exposure A causing mediator M and outcome Y, M causing Y, a baseline confounder/modifier L pointing to A and Y, and an unmeasured confounder U pointing to both M and Y.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[A covariate V you are tempted to put in the model] --> T{Measured BEFORE exposure?}\n  T -- No --> P{Is the estimand explicitly a direct effect<br/>CDE / NDE?}\n  P -- No --> STOP[DO NOT adjust for V<br/>post-treatment adjustment bias]:::bad\n  P -- Yes --> G{Is any mediator-outcome confounder<br/>affected by the exposure?}\n  G -- Yes --> GM[Use g-methods: MSM / g-formula /<br/>g-estimation of SNMs]:::good\n  G -- No --> MED[Regression mediation: estimate CDE/NDE/NIE<br/>+ sensitivity analysis]:::good\n  T -- Yes --> C{Does V cause both exposure and outcome?}\n  C -- Yes --> CONF[Confounder: adjust / PS / match]:::good\n  C -- No --> EM{Does the A->Y effect differ across levels of V?}\n  EM -- Yes --> MOD[Effect modifier: stratify + report<br/>additive RERI and multiplicative]:::good\n  EM -- No --> NEU[Likely a precision covariate or marker;<br/>no special handling]\n  classDef bad fill:#fee2e2,stroke:#b91c1c;\n  classDef good fill:#dcfce7,stroke:#15803d;",
        "caption": "Decision logic for classifying a candidate variable as a confounder, mediator, or effect modifier, and the analytic consequence of each. The most-violated branch in RWE is the top-left branch — adjusting for a post-treatment variable in a total-effect model.",
        "alt_text": "Decision tree determining whether a variable measured before or after exposure should be adjusted for, treated as a mediator requiring direct-effect estimands or g-methods, treated as a confounder, or treated as an effect modifier reported on additive and multiplicative scales.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "Report stratum-specific marginal effects (and absolute risk differences) rather than only model interaction coefficients; the additive scale is the policy-relevant one."
      },
      {
        "relation_type": "used_with",
        "target_slug": "g-estimation-structural-nested-models",
        "notes": "Required for mediation/direct effects when a mediator-outcome confounder is itself affected by the exposure (treatment-induced confounding)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "CDE, NDE, NIE, the four-way decomposition, and subgroup/interaction contrasts are distinct estimands that must be specified before modeling."
      },
      {
        "relation_type": "see_also",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Causal-ML estimates heterogeneous (conditional) treatment effects but does not certify a causal-pathway or true-modifier interpretation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Requiring both arms to survive to a mediator-assessment landmark can induce immortal time; address with landmark analysis or clone-censor-weight."
      }
    ],
    "aliases": [
      "mediation",
      "causal mediation analysis",
      "mediators",
      "moderators",
      "effect modifiers",
      "effect modification",
      "interaction",
      "natural direct effect",
      "natural indirect effect",
      "controlled direct effect",
      "four-way decomposition",
      "RERI"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cdisc-sdtm-adam-rwe",
    "name": "CDISC Standards (SDTM/ADaM) for RWE Submissions",
    "short_definition": "CDISC's Study Data Tabulation Model (SDTM) and Analysis Data Model (ADaM) are the electronic submission formats the FDA requires for study data packages; applying them to real-world data forces explicit modeling decisions — mapping claims dispensings to EX-domain exposures, translating diagnosis codes to MedDRA for adverse-event domains, constructing ADTTE parameters for estimand-aligned time-to-event analyses — and documenting every derivation in define.xml and a Reviewer's Guide so an independent analyst can verify the traceability chain from raw source record to headline estimate.",
    "long_description": "**The standards trio: what CDISC requires for an FDA electronic submission**\n\nThe Clinical Data Interchange Standards Consortium (CDISC) defines a layered set of electronic\nsubmission standards that the FDA has required for new drug applications since 2016 and that\nincreasingly governs real-world evidence (RWE) submissions. Three layers are essential.\n\n*SDTM — Study Data Tabulation Model.* SDTM organizes subject-level data into domain datasets:\none dataset per clinical domain, structured so an FDA reviewer can navigate directly to the\nraw observations without further transformation. Each domain has a fixed set of required and\nexpected variables defined in the SDTM Implementation Guide (SDTMIG). The most relevant\ndomains for RWE studies are: **DM** (Demographics — one row per subject, required in every\nsubmission); **EX** (Exposure — one row per administered or dispensed dose, the primary link\nfrom claims pharmacy records to the submission package); **AE** (Adverse Events — one row per\nadverse event, with the term coded to MedDRA); **CM** (Concomitant Medications — other drugs\nthe subject was taking); and **LB** (Laboratory results). SDTM is not an analytic format; it\nis a standardized tabulation of what happened to whom, when, and at what dose.\n\n*ADaM — Analysis Data Model.* ADaM sits above SDTM and is the layer that actually feeds\nstatistical analyses and tables, listings, and figures (TLFs). The core datasets are:\n**ADSL** (the subject-level dataset — one row per subject, containing all baseline flags,\ntreatment arm assignments, and stratification variables); **ADTTE** (time-to-event dataset —\none row per subject per parameter, with PARAMCD, PARAM, AVAL, CNSR, ADT, and STARTDT as the\ncritical variables); **ADAE** (adverse event analysis dataset); and **ADLB** (lab analysis\ndataset). Every ADaM derived variable is required to trace back to its SDTM source variable,\nthe derivation algorithm, and the governing analysis plan section.\n\n*Define.xml and Reviewer's Guides.* Define.xml is the machine-readable metadata catalog —\nit documents every dataset, every variable, every controlled-terminology code, and every\nderivation algorithm. The Analysis Data Reviewer's Guide (ADRG) and Study Data Reviewer's\nGuide (SDRG) are human-readable companions that explain the analysis choices, dataset\nrelationships, and how to navigate the submission. Together these three artifacts — define.xml,\nADRG, SDRG — are what allow an FDA statistician to re-derive the headline estimate without\nasking the sponsor a single question.\n\n**The RWD-to-SDTM mapping problem: decisions masquerading as lookups**\n\nThe single most consequential insight for RWE practitioners is that RWD-to-SDTM mapping is\nnot a lookup process — it is a sequence of documented modeling decisions that the define.xml\nand ADRG must expose and justify.\n\n*Claims dispensing → EX (exposure) domain.* An administrative pharmacy claim carries a\nfill_date, an NDC code, and a days_supply. Mapping to EX requires deciding: Does fill_date\nequal exposure start (EXSTDTC)? Most RWE protocols treat dispensing date as exposure start\nbecause administration date is unobservable in claims — but this is not true for IV therapies\nor inpatient administration, where the service date differs. What dose (EXDOSE) and unit\n(EXDOSU) are assigned? For fixed-dose products the NDC determines the dose; for weight-based\nor titrated drugs this becomes a complex derivation. Does a 90-day mail-order fill represent\none exposure row or 90 one-day rows? The SDTMIG provides guidance, but the sponsor chooses.\nEvery choice becomes a row in define.xml.\n\n*Enrollment periods → DS (Disposition) domain.* An insurance enrollment span does not map\ncleanly to the DS domain, which tracks subject disposition events (entered study, completed,\ndiscontinued). A claims enrollment period is an eligibility flag, not a protocol-defined\ndisposition event. The sponsor must decide how to represent enrollment gaps, plan-type changes,\nand coverage termination in a domain designed for randomized trial milestones.\n\n*Diagnosis codes → MedDRA for the AE domain.* ICD-10 codes in claims were assigned for billing,\nnot for adverse event reporting. Mapping a billing code (ICD-10-CM) to a MedDRA preferred term\nrequires a code-to-concept crosswalk plus a decision about which SNOMED or WHO-ART hierarchy\nlevel to use. A single ICD-10 code can map to multiple MedDRA preferred terms depending on\ncontext; the sponsor must document which mapping was used and why, and the mapping is a\nprotocol-level decision that must appear in the analysis plan before any data are coded.\n\n**ADaM for the target-trial world: ADSL and ADTTE parameterization**\n\nFor RWE studies using a target-trial emulation framework, ADaM's ADTTE dataset is the primary\nvehicle for time-to-event estimands. The PARAMCD variable distinguishes parameters within the\nsame dataset: a single ADTTE dataset can contain TTDISC (time to discontinuation), TTEVENT\n(time to primary event), and TTDEATH (time to death) as separate PARAMCD strata. This\nstructure maps naturally to the ICH E9(R1) summary-measure attribute: AVAL is the continuous\ntime in days, ADT is the event or censoring date, and CNSR (0 = event, 1 = censored)\nimplements the intercurrent-event strategy. An ADTTE row with CNSR = 0 and AVAL = 87 means\nthe event occurred exactly 87 days after the reference date (STARTDT = ADT - 87 days);\nthe estimand-to-ADaM traceability matrix must document which intercurrent events are censored\n(producing CNSR = 1) and which are treated as events (CNSR = 0).\n\n**Traceability as the regulatory currency**\n\nThe FDA Study Data Technical Conformance Guide states that every ADaM derived variable\nmust be traceable to its source, either through a direct source variable reference in\ndefine.xml or through an explicit derivation algorithm. For RWE submissions, this traceability\nruns three levels deep: ADaM variable → SDTM source variable → raw data field (e.g., claim\nline on the institutional file or the pharmacy claims table). The complete chain —\nfill_date (claims) → EXSTDTC (SDTM EX) → STARTDT (ADaM ADTTE) → AVAL derivation (ADTTE) —\nis what a regulatory reviewer can follow, step by step, using define.xml and the ADRG.\nIf any link in the chain is undocumented, the submission receives a deficiency letter. This\nis why the CDISC standards are the regulatory currency for RWE: they impose a data structure\nthat forces traceability into the submission artifact rather than leaving it to the sponsor's\nnarrative.\n\n**OMOP vs CDISC: research CDM vs submission standard**\n\nOMOP (the OHDSI Common Data Model) and CDISC serve different purposes and should not be\nconfused. OMOP is a research CDM: its goal is to standardize data structure and vocabulary\nacross many sites so that one analysis script runs everywhere. CDISC is a submission standard:\nits goal is to provide a regulatory reviewer with a navigable, auditable package for a single\nstudy. OMOP is optimized for portability and network analytics; CDISC is optimized for\nper-submission transparency. Mappings from OMOP standard concepts (RxNorm, SNOMED) to CDISC\ncontrolled terminology (drug dictionaries, MedDRA) exist and are partially automated, but\nthe mapping is lossy — concept granularity differs, route/form distinctions differ, and\nOMOP's OBSERVATION_PERIOD does not translate directly to any SDTM domain. A study can be\nexecuted on an OMOP CDM and then post-processed into CDISC for submission, but the\nOMOP-to-CDISC conversion layer requires the same mapping decisions as a raw-claims-to-CDISC\nconversion, just with a more standardized input vocabulary.\n\n**Pros, cons, and trade-offs**\n\n*Pros of full CDISC SDTM/ADaM compliance for RWE submissions:* The primary benefit is\nregulatory credibility — an FDA reviewer can navigate the submission without sponsor\nassistance, run independent QC against define.xml, and verify the traceability chain from\nany headline estimate back to the raw dispensing or diagnosis record. CDISC-compliant\nsubmissions experience fewer deficiency letters and shorter review timelines than\nnon-conformant ones. A secondary benefit is internal auditability: the discipline of building\ndefine.xml before delivering ADaM datasets forces documentation of every modeling decision\nearly, when changing it is cheap.\n\n*Cons and costs:* CDISC compliance is expensive for RWE. Most commercial claims and EHR\ndatabases are not structured as SDTM, and the conversion layer requires specialized\nprogramming (SAS is the dominant language for CDISC production work), vocabulary crosswalks\n(NDC to WHO drug dictionary, ICD-10 to MedDRA), and a structured define.xml generation\npipeline. The effort can reach 20–40% of total study programming cost for a complex\nsubmission. Waivers from full compliance exist (the FDA can grant them for specific domains\nwhere mapping is not feasible), but waiver requests must be submitted early and are not\nguaranteed.\n\n*Trade-off with OMOP-only pipelines:* Using OMOP for the analytic layer and converting to\nCDISC only for submission is the most practical approach for most RWE submissions, but the\nOMOP-to-CDISC conversion is not a push-button step — it requires the same mapping decisions\nas a raw-to-CDISC conversion, and the conversion layer must itself be auditable.\n\n**When to use**\n\nCDISC SDTM/ADaM compliance is required or strongly indicated when: (1) the study will be\nsubmitted to FDA as part of a regulatory action — new drug application, supplemental\napplication, post-marketing requirement, or externally controlled trial; (2) the study is\na confirmatory RWE study under the FDA RWE Action Plan or a post-market safety commitment;\n(3) the study will be reviewed by EMA or another regulatory body whose guidelines cite CDISC\nstandards (EMA expects CDISC alignment for ICH E9(R1) submissions); (4) the study involves\nan electronic patient data submission to a regulatory authority, even if the primary data\nare observational. Plan for CDISC compliance at protocol conception — defining the EX-domain\nmapping rules and the ADTTE PARAMCD structure before data extraction is far cheaper than\nretro-fitting after the analysis.\n\n**When NOT to use**\n\nDo not apply full CDISC SDTM/ADaM production pipelines to internal exploratory or\nhypothesis-generating analyses where the regulatory reviewer is not the audience. The overhead\nis unjustifiable for a feasibility study, a retrospective chart review used only for internal\nplanning, or a real-world data analysis whose deliverable is a journal manuscript rather than\na regulatory submission. It also becomes **actively misleading** when CDISC-like formatting\nis applied to a study whose underlying modeling decisions are undocumented — a define.xml that\nlists variable labels but does not describe derivation algorithms gives a reviewer the\nappearance of compliance while hiding the critical decisions. Partial compliance with empty\nderivation fields is worse than no compliance, because it creates false confidence.\n\n**Interpreting the output**\n\nThe worked example produces an ADaM ADTTE record with AVAL = 87. This single field is the\ncharacteristic artifact of the traceability chain.\n\n*(1) Formal interpretation.* AVAL = 87 is the time-to-discontinuation in days for patient\nSTUDY-001-1001, measured from STARTDT = 2023-01-01 (the date of the first qualifying\napixaban dispensing, sourced from EXSTDTC in the EX domain, which was itself derived from\nthe claims fill_date via a documented mapping rule) to ADT = 2023-03-29 (the first day after\nthe last supply window closed with no qualifying refill within the grace period). CNSR = 0\nmeans the event (discontinuation) was observed; this patient contributes an uncensored\nobservation to any ADTTE-based time-to-event analysis. PARAMCD = TTDISC identifies the\nparameter; the corresponding ADSL record anchors the subject identifier, treatment arm, and\nbaseline characteristics. The value 87 is reproduced exactly from the date arithmetic:\nfrom Jan 1 to Feb 1 = 31 days, Feb 1 to Mar 1 = 28 days, Mar 1 to Mar 29 = 28 days,\ntotal = 31 + 28 + 28 = 87 days. No approximation or rounding occurs in any link of the chain.\n\n*(2) Practical interpretation.* AVAL = 87 tells a regulatory reviewer that this patient\nremained on apixaban for approximately three months before stopping. Crucially, an independent\nanalyst can verify this value without contacting the sponsor: define.xml documents that AVAL\nis derived as ADT minus STARTDT; STARTDT traces to EX.EXSTDTC; EXSTDTC traces to the\npharmacy claims fill_date; and ADT traces to the last covered day plus one, with the\n30-day grace-period rule documented in the ADRG. If the 30-day grace rule had been 60 days,\nthe patient might not have been classified as discontinuing at all (a refill arriving within\n60 days would extend the treatment episode). The value 87 is only interpretable — and only\nverifiable — if every one of these derivation choices is visible in define.xml and the ADRG.\nThat auditability is the entire regulatory purpose of CDISC compliance for RWE.",
    "primary_category": "Data_Standard",
    "tags": [
      "cdisc",
      "sdtm",
      "adam",
      "regulatory-submission",
      "define-xml",
      "data-standards",
      "rwe-submission",
      "traceability",
      "ehr-to-sdtm",
      "claims-to-sdtm",
      "fda-submission",
      "adtte",
      "adsl",
      "meddra"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2196/30363",
        "url": "https://doi.org/10.2196/30363",
        "citation_text": "Facile R, Muhlbradt EE, Gong M, et al. Use of Clinical Data Interchange Standards Consortium (CDISC) Standards for Real-world Data: Expert Perspectives. JMIR Medical Informatics. 2022;10(1):e30363.",
        "year": 2022,
        "authors_short": "Facile et al.",
        "notes": "Systematic expert-consensus assessment of how CDISC SDTM and ADaM standards apply to real-world data sources (claims, EHR, registry), identifying the specific mapping challenges — including the NDC-to-EXTRT, diagnosis-to-MedDRA, and enrollment-to-DS problems — that this catalog entry addresses. The foundational reference for RWE practitioners approaching CDISC compliance for the first time."
      },
      {
        "role": "explain",
        "doi": "10.47912/jscdm.128",
        "url": "https://doi.org/10.47912/jscdm.128",
        "citation_text": "Sniadecki J, Lucey S, Shiu D. Integration of Clinical Trial and Real-World Data: A Case Study Examination of CDISC Standards. Journal of the Society for Clinical Data Management. 2023;5(1).",
        "year": 2023,
        "authors_short": "Sniadecki et al.",
        "notes": "Practical case study of integrating RWD into a CDISC-conformant submission package, covering the SDTM domain mapping decisions, define.xml requirements, and the reviewer-guide documentation needed when real-world sources supplement clinical trial data in an FDA submission."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jval.2024.03.1698",
        "url": "https://doi.org/10.1016/j.jval.2024.03.1698",
        "citation_text": "Girman CJ, Mack CD, Teltsch DY, Lieberman DA. Key Components of a Strong Real World Data Quality and Relevance Package for Regulatory or Health Technology Assessment Submission. Value in Health. 2024;27(S1):S292.",
        "year": 2024,
        "authors_short": "Girman et al.",
        "notes": "Frames data quality and relevance documentation — including standards-conformance evidence — as a structured package the FDA and HTA bodies require; situates CDISC compliance within the broader regulatory submission quality framework."
      }
    ],
    "plain_language_summary": "When a real-world evidence study is submitted to the FDA for regulatory review, the data cannot arrive as a raw spreadsheet — they must follow two specific data structures called SDTM (which organizes what happened to each patient, like drug fills and diagnoses) and ADaM (which organizes the analysis-ready numbers, like time on treatment). Converting claims or electronic health record data into these structures is not automatic: every choice about how to map a pharmacy dispensing to an exposure record, or how to translate a billing diagnosis code into a standard adverse-event term, is a documented decision the FDA reviewer can audit. Think of it as building a paper trail so that any number in the final statistical table can be traced back, step by step, to the original patient data — and the CDISC standards define exactly what that paper trail must contain.",
    "key_terms": [
      {
        "term": "SDTM domain",
        "definition": "A standardized dataset in CDISC format that holds one category of clinical observation — for example, the EX domain holds drug exposure records and the AE domain holds adverse events — with fixed column names recognized by FDA review software."
      },
      {
        "term": "ADaM ADTTE",
        "definition": "The ADaM time-to-event dataset, with one row per patient per analysis parameter; AVAL is the number of days until the event or censoring, and CNSR (0 = event, 1 = censored) records whether the event occurred."
      },
      {
        "term": "define.xml",
        "definition": "A machine-readable file required in every FDA submission that documents every dataset name, every variable, its allowable values, and the rule used to derive it — the metadata catalog that lets a reviewer understand and verify the data without asking the sponsor."
      },
      {
        "term": "MedDRA",
        "definition": "The Medical Dictionary for Regulatory Activities — the controlled vocabulary the FDA and other regulators require for coding adverse events and medical history in submissions; mapping ICD-10 billing codes to MedDRA terms requires a documented crosswalk."
      },
      {
        "term": "EXSTDTC",
        "definition": "The SDTM EX domain variable for exposure start date in ISO 8601 format (YYYY-MM-DD); in RWE studies it is typically derived from the pharmacy claims fill_date, and the derivation rule must appear in define.xml."
      },
      {
        "term": "traceability chain",
        "definition": "The documented path from a raw data field (e.g., a claims fill_date) through each transformation step (EX domain, ADaM derivation) to the final analysis number, such that any intermediate value can be independently reproduced."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is preparing an FDA regulatory submission for a post-marketing safety study of apixaban in atrial fibrillation. The analytic dataset is built from commercial claims, and the team must produce CDISC-conformant SDTM and ADaM datasets. They trace one patient's first qualifying dispensing — a 87-day supply of apixaban 5 mg dispensed on January 1, 2023 — through the three-layer CDISC chain to produce an ADaM ADTTE record for the time-to-discontinuation parameter. The patient does not refill within the 30-day grace period after the supply ends, so discontinuation is observed. The team must document every modeling decision in define.xml and the ADRG.",
      "dataset": {
        "caption": "Three-layer traceability chain for patient STUDY-001-1001: source claims row → SDTM EX domain record → ADaM ADTTE record. Each row shows one field, its value, and the derivation rule or modeling decision that produced it. The Derivation Rule column is what define.xml and the ADRG must document.",
        "columns": [
          "Layer",
          "Field_Name",
          "Value",
          "Derivation_Rule_or_Modeling_Decision"
        ],
        "rows": [
          [
            "Source (pharmacy claim)",
            "fill_date",
            "2023-01-01",
            "Date of first qualifying apixaban dispensing; raw field from the pharmacy claims table"
          ],
          [
            "Source (pharmacy claim)",
            "ndc",
            "00310-0892-10",
            "NDC for apixaban 5 mg tablet (Bristol-Myers Squibb); maps to EXTRT and EXDOSE via sponsor formulary crosswalk"
          ],
          [
            "Source (pharmacy claim)",
            "days_supply",
            87,
            "Quantity dispensed; 87 days of supply covers January 1 through March 28"
          ],
          [
            "SDTM EX domain",
            "EXSTDTC",
            "2023-01-01",
            {
              "Modeling decision": "fill_date = exposure start date; inpatient or IV administration would require a different rule"
            }
          ],
          [
            "SDTM EX domain",
            "EXTRT",
            "APIXABAN",
            {
              "Modeling decision": "NDC translated to generic drug name via sponsor formulary crosswalk (not an automated CDISC lookup)"
            }
          ],
          [
            "SDTM EX domain",
            "EXDOSE / EXDOSU",
            "5 / mg",
            "Dose from NDC product label for the 5 mg tablet form; EXDOSU code from CDISC controlled terminology CT"
          ],
          [
            "ADaM ADTTE",
            "PARAMCD / PARAM",
            "TTDISC / Time to Discontinuation (days)",
            "Pre-specified analysis parameter; PARAMCD and PARAM values frozen in the SAP before any data extraction"
          ],
          [
            "ADaM ADTTE",
            "STARTDT",
            "2023-01-01",
            "Sourced from EX.EXSTDTC; traceability documented in define.xml STARTDT derivation algorithm"
          ],
          [
            "ADaM ADTTE",
            "ADT",
            "2023-03-29",
            "Last covered day (March 28 = fill_date + 86 days) + 1; no qualifying refill within the 30-day grace period"
          ],
          [
            "ADaM ADTTE",
            "AVAL",
            87,
            "ADT minus STARTDT = March 29 minus January 1; derivation algorithm documented in define.xml"
          ],
          [
            "ADaM ADTTE",
            "CNSR",
            0,
            "0 = event observed (discontinuation); 1 = censored; strategy documented in the ADRG intercurrent-event section"
          ]
        ]
      },
      "steps": [
        "The source pharmacy claim for patient 1001 records a fill on 2023-01-01 for NDC 00310-0892-10 (apixaban 5 mg, 87 days supply). This is the raw evidence layer; nothing has been transformed yet. The 87-day supply provides coverage from January 1 through March 28 (the last covered day is fill_date + days_supply - 1 = 87 - 1 = 86 days after January 1). Verification of 86 days from January 1 to March 28: January contributes 30 days (Jan 2 through Jan 31), February contributes 28 days, March 1 through March 28 contributes 28 days; 30 + 28 + 28 = 86 calendar days from January 1 to March 28, confirming the last covered day.",
        "The SDTM EX domain record is built from the claims row. Modeling decision 1: fill_date (2023-01-01) becomes EXSTDTC (2023-01-01) because the sponsor protocol designates dispensing date as the exposure start; this would differ for IV or inpatient drugs. Modeling decision 2: the NDC is translated to EXTRT = APIXABAN via the sponsor's formulary crosswalk, and EXDOSE = 5 and EXDOSU = mg are drawn from the product label for that NDC. These are not automatic mappings; define.xml must describe the crosswalk source and version in the EXTRT variable-level metadata.",
        "No qualifying apixaban refill arrives within 30 days after March 28 (the last covered day). The ADaM discontinuation date ADT is therefore set to March 29, 2023 (the first day the patient is not covered and is past the grace period). The STARTDT in ADTTE is sourced from EXSTDTC and equals 2023-01-01. AVAL is computed as ADT minus STARTDT in days.",
        "Date arithmetic for AVAL: from January 1 to March 29, count the days by month. January 1 to February 1 = 31 days (31 days in January). February 1 to March 1 = 28 days (2023 is not a leap year). March 1 to March 29 = 28 days. Total: 31 + 28 + 28 = 87 days. CNSR = 0 because discontinuation was observed (not censored). PARAMCD = TTDISC as pre-specified in the SAP."
      ],
      "result": "AVAL = 31 + 28 + 28 = 87 days (Jan 1 to Feb 1 = 31; Feb 1 to Mar 1 = 28; Mar 1 to Mar 29 = 28). CNSR = 0 (event: discontinuation observed after 87-day supply with no refill in grace period). Traceability chain: claims fill_date 2023-01-01 -> SDTM EXSTDTC 2023-01-01 -> ADaM ADTTE STARTDT 2023-01-01 -> AVAL = 87 days. Every link documented in define.xml and the ADRG."
    },
    "prerequisites": [
      "regulatory-readiness-rwe",
      "omop-cdm-method-patterns-rwe",
      "study-protocol-or-sap-elements"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Full SDTM + ADaM pipeline (confirmatory regulatory submission)",
        "description": "Production-grade CDISC pipeline for a confirmatory RWE study submitted to FDA. Requires SDTM domains (DM, EX, AE, CM, LB at minimum), ADaM datasets (ADSL, ADTTE, and domain-specific analysis datasets), define.xml generated from a controlled-metadata repository, and ADRG/SDRG documents that describe every modeling decision. SDTM must conform to the SDTMIG version specified in the Study Data Technical Conformance Guide for the relevant submission year.",
        "edge_cases": [
          "Waivers from specific SDTM domains are possible (request early); the FDA accepts waiver requests that explain why a domain cannot be mapped from the available data source (e.g., the EX domain cannot be populated from inpatient administration records that are not captured in outpatient claims).",
          "Mail-order 90-day fills in pharmacy claims create ambiguity in SDTM EX: one EX record for a 90-day fill or three for 30-day intervals? The protocol must specify, and the choice affects AVAL computation in ADTTE."
        ],
        "data_source_notes": "claims: fill_date + days_supply → EXSTDTC; NDC → EXTRT via formulary crosswalk; ICD-10 → MedDRA via sponsor-maintained terminology map; enrollment spans → DM/DS via sponsor-defined eligibility rules. ehr: order/administration date → EXSTDTC (not prescription date); free-text narratives must be structured before AE domain coding. linked: reconcile fill_date vs administration_date vs service_date across sources before populating EX domain."
      },
      {
        "name": "SDTM-like staging only (non-submission analytics)",
        "description": "An internal pipeline that restructures RWD into SDTM-like tables — using SDTM column names, domain logic, and controlled terminology — without producing a submission-grade define.xml or reviewer guides. Useful for multi-database RWE studies where harmonized structure improves reproducibility and cross-site comparability, even when no regulatory submission is planned. Not CDISC-conformant but CDISC-inspired.",
        "edge_cases": [
          "If any study using SDTM-like staging later becomes regulatory-facing, the staging layer will need a full compliance audit and define.xml retrofit — this is expensive. Plan for compliance from the start if there is any possibility of regulatory use."
        ],
        "data_source_notes": "Same mapping decisions as a full pipeline, but with reduced documentation burden. The critical investment is establishing the vocabulary crosswalks (NDC → drug name, ICD → MedDRA) at the outset, as these are the same whether or not full define.xml is produced."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "omop-cdm-method-patterns-rwe",
        "pros_of_this": "CDISC is the submission-required standard; OMOP is a research CDM. When the deliverable is an FDA submission, CDISC compliance is mandatory. CDISC's define.xml also provides per-variable derivation documentation that OMOP does not require, making the analysis auditable at a level OMOP was not designed to support.",
        "cons_of_this": "CDISC is submission-specific: a CDISC dataset cannot be reused across sites without conversion, whereas an OMOP dataset runs the same analysis script at every conformant site. For multi-database research analytics, OMOP is far more efficient. Converting OMOP to CDISC for submission is feasible but requires the same mapping decisions as a raw-to-CDISC conversion.",
        "when_to_prefer": "CDISC when the study will be reviewed by a regulatory authority and the dataset is submission-bound. OMOP when the primary goal is multi-database research portability and the submission layer (if any) comes later."
      },
      {
        "compared_to": "regulatory-readiness-rwe",
        "pros_of_this": "CDISC standards implement the transparency and reproducibility gate of regulatory readiness: define.xml and the ADRG are the machine- and human-readable artifacts that let a reviewer verify every claim in the submission package. They are the downstream product of readiness planning.",
        "cons_of_this": "CDISC compliance is necessary but not sufficient for regulatory readiness — a perfectly conformant define.xml can still fail the fit-for-purpose, estimand-traceability, or falsification gates. CDISC is the data format; readiness is the process.",
        "when_to_prefer": "Both together: regulatory readiness governs what decisions are made and pre-specified; CDISC governs how those decisions are documented and packaged for submission."
      },
      {
        "compared_to": "estimand-analysis-traceability-rwe",
        "pros_of_this": "The CDISC standards provide the data structure (define.xml derivation metadata) that makes the estimand-to-analysis traceability matrix machine-verifiable: every ADTTE row with PARAMCD = TTDISC is anchored to a define.xml derivation that a reviewer can read. CDISC gives the traceability matrix its teeth.",
        "cons_of_this": "CDISC alone does not ensure the right estimand was chosen — it only ensures the chosen estimand is documented. An ADTTE dataset with a poorly specified PARAMCD derivation is CDISC-conformant but estimand-inconsistent if the define.xml derivation algorithm does not match the SAP's intercurrent-event strategy.",
        "when_to_prefer": "Use both: the estimand traceability matrix defines the question and derivation rules; CDISC provides the standardized data structure that carries those rules into the submission."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Pharmacy claims are the most tractable RWD source for SDTM EX: fill_date → EXSTDTC, days_supply → episode duration, NDC → EXTRT via formulary crosswalk (NDC-to-drug-name mapping must be versioned and documented in define.xml). Medical claims drive AE and CM domains: ICD-10-CM codes must be MedDRA-coded using a validated, sponsor-documented crosswalk; billing position (primary vs secondary diagnosis) affects whether a code is classified as an AE or a comorbidity. Mail-order 90-day fills need an explicit EX-domain representation rule. Enrollment spans → DS/DM: continuous enrollment must be translated to study-period eligibility using the protocol's enrollment criteria; Medicare Advantage gaps (where FFS claims are absent) require documentation in DS.",
      "ehr": "EHR data require careful disambiguation of order vs administration dates for the EX domain: EXSTDTC should reflect actual administration, not the prescription order date, which overstates treatment exposure. Problem-list diagnoses may persist long after resolution — restrict AE coding to encounter-linked diagnoses with appropriate date windows. Free-text clinical notes are the primary source for adverse event narratives but require NLP or clinical review before MedDRA coding; document the coding method in the ADRG. Lab results populate the LB domain; LOINC → CDISC test codes require a controlled-terminology crosswalk.",
      "registry": "Registry data typically have adjudicated endpoints (strong for AE/primary outcome domains) and disease-severity fields (useful for DM and baseline flags in ADSL), but weak drug exposure capture for EX. Link registry records to pharmacy claims for EX domain population and to a death index for DS domain (study exit). Adjudicated outcome terms usually require a MedDRA coding step if the registry used site-specific terminology.",
      "linked": "Linked sources (claims + EHR + registry + vital records) provide the richest substrate for all SDTM domains but introduce reconciliation challenges: fill_date (claims), administration_date (EHR), and service_date (institutional claims) may differ for the same exposure event. The EX domain derivation algorithm must specify which date takes precedence and how conflicts are resolved, and this rule must appear in define.xml. Date concordance checks between source files are a standard QC step before any SDTM domain is finalized."
    },
    "implementations": [
      {
        "lang": "sas",
        "code": "/* ── Macro parameters: set before running ── */\n%let grace_days = 30;          /* grace period after last covered day              */\n%let study_id   = STUDY-001;   /* submission study identifier prefix               */\n\n/* ── Step 1: Identify first qualifying apixaban fill from pharmacy claims ── */\n/* Source table: work.pharmacy_claims                                            */\n/*   Columns: person_id, fill_date (date9.), ndc, drug_name, dose_mg, days_supply */\nproc sort data=work.pharmacy_claims (where=(study_drug_flag=1))\n          out=work.px_sorted;\n  by person_id fill_date;\nrun;\n\ndata work.first_fill;\n  set work.px_sorted; by person_id;\n  if first.person_id;           /* keep only the first qualifying fill per patient */\nrun;\n\n/* ── Step 2: SDTM EX domain ── */\n/* Modeling decisions documented here must appear in define.xml EX variable metadata. */\ndata work.ex;\n  set work.first_fill;\n  length DOMAIN $2 USUBJID $20 EXTRT $40 EXDOSU $10 EXSTDTC $10;\n  DOMAIN   = 'EX';\n  /* Modeling decision: concatenate study_id + person_id for USUBJID */\n  USUBJID  = catx('-', \"&study_id\", strip(put(person_id, best.)));\n  /* Modeling decision: fill_date = exposure start date (dispensing = initiation) */\n  EXSTDTC  = put(fill_date, is8601da.);   /* ISO 8601: YYYY-MM-DD, required by CDISC */\n  /* Modeling decision: drug_name from formulary crosswalk (not raw NDC) = EXTRT */\n  EXTRT    = upcase(strip(drug_name));\n  EXDOSE   = dose_mg;                     /* dose from NDC product label             */\n  EXDOSU   = 'mg';                        /* CDISC controlled terminology unit code  */\n  keep DOMAIN USUBJID EXSTDTC EXTRT EXDOSE EXDOSU person_id fill_date days_supply;\nrun;\n\nproc print data=work.ex noobs;\n  var USUBJID EXSTDTC EXTRT EXDOSE EXDOSU;\n  title \"SDTM EX domain -- verify EXSTDTC = fill_date in ISO 8601\";\nrun;\n\n/* ── Step 3: ADaM ADTTE -- time to discontinuation ── */\n/* Modeling decisions: PARAMCD name, AVAL derivation, CNSR coding, grace period.   */\n/* All documented in define.xml AVAL derivation algorithm and ADRG section 4.      */\ndata work.adtte;\n  set work.ex;\n  length PARAMCD $8 PARAM $40;\n  PARAMCD  = 'TTDISC';\n  PARAM    = 'Time to Discontinuation (days)';\n  STARTDT  = fill_date;                         /* traceability: STARTDT <- EXSTDTC */\n  /* Last covered day = fill_date + days_supply - 1 (0-indexed supply)             */\n  last_cov = fill_date + days_supply - 1;\n  /* Discontinuation date = day after last covered day (no refill in grace window)  */\n  ADT      = last_cov + 1;                      /* for confirmed discontinuation    */\n  /* AVAL = ADT - STARTDT (integer days; equals days_supply for single-fill case)  */\n  AVAL     = ADT - STARTDT;    /* e.g., 2023-03-29 - 2023-01-01 = 87 days          */\n  CNSR     = 0;                /* 0 = event (discontinuation observed)              */\n  format STARTDT ADT date9.;\n  keep USUBJID PARAMCD PARAM STARTDT ADT AVAL CNSR;\nrun;\n\nproc print data=work.adtte noobs;\n  var USUBJID PARAMCD AVAL CNSR STARTDT ADT;\n  title \"ADaM ADTTE -- AVAL = ADT - STARTDT (days); CNSR = 0 for observed event\";\nrun;\n\n/* ── Step 4: QC check -- verify AVAL matches expected value ── */\ndata _null_;\n  set work.adtte;\n  if AVAL ne days_supply then\n    put 'WARNING: AVAL ' AVAL ' does not equal days_supply for USUBJID=' USUBJID;\n  else\n    put 'QC PASS: AVAL = ' AVAL '(days) for USUBJID=' USUBJID;\nrun;\n/* Expected output: QC PASS: AVAL = 87 (days) for USUBJID=STUDY-001-1001            */\n/* define.xml entry for AVAL: \"ADT minus STARTDT in days; STARTDT sourced from      */\n/*   EX.EXSTDTC; ADT = last covered day (fill_date + days_supply - 1) + 1.\"         */",
        "description": "SDTM EX domain construction from pharmacy claims and ADaM ADTTE derivation in SAS.\nDemonstrates the modeling decisions at each layer: fill_date -> EXSTDTC (via the\nISO 8601 format CDISC requires), NDC -> EXTRT via a formulary crosswalk dataset, and\nAVAL computed from ADT minus STARTDT. The ADTTE step assigns PARAMCD = TTDISC and\nCNSR = 0 for an observed discontinuation. In production, define.xml would be generated\nfrom a controlled metadata repository (e.g., Pinnacle 21 or a sponsor-built define catalog)\nand every PROC step here would have a corresponding metadata entry.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# ── Source data: pharmacy claims for study-eligible patients ──\n# In production, this comes from the analytic claims extract.\npharmacy_claims = pd.DataFrame({\n    \"person_id\":       [1001],\n    \"fill_date\":       [pd.Timestamp(\"2023-01-01\")],\n    \"ndc\":             [\"00310-0892-10\"],\n    \"drug_name\":       [\"apixaban\"],       # from formulary crosswalk (modeling decision)\n    \"dose_mg\":         [5],\n    \"days_supply\":     [87],\n    \"study_drug_flag\": [1],\n})\n\nSTUDY_ID   = \"STUDY-001\"\nGRACE_DAYS = 30   # days after last covered day before discontinuation is declared\n\n# ── Step 1: SDTM EX domain ──\n# Modeling decisions encoded here must appear in define.xml variable-level metadata.\npx = pharmacy_claims[pharmacy_claims[\"study_drug_flag\"] == 1].copy()\n\nex = pd.DataFrame()\nex[\"DOMAIN\"]   = \"EX\"\n# Modeling decision: USUBJID = study_id + person_id (pattern documented in define.xml DM domain)\nex[\"USUBJID\"]  = STUDY_ID + \"-\" + px[\"person_id\"].astype(str).values\n# Modeling decision: fill_date is the exposure start date (dispensing = initiation)\nex[\"EXSTDTC\"]  = px[\"fill_date\"].dt.strftime(\"%Y-%m-%d\").values  # ISO 8601 (CDISC required)\n# Modeling decision: generic drug name from crosswalk, upper-cased per SDTM convention\nex[\"EXTRT\"]    = px[\"drug_name\"].str.upper().values\nex[\"EXDOSE\"]   = px[\"dose_mg\"].values          # from NDC product label\nex[\"EXDOSU\"]   = \"mg\"                           # CDISC controlled terminology unit\n# Carry forward for ADaM derivation (not in the SDTM export but needed downstream)\nex[\"_fill_date\"]   = px[\"fill_date\"].values\nex[\"_days_supply\"] = px[\"days_supply\"].values\n\nprint(\"SDTM EX domain:\")\nprint(ex[[\"USUBJID\", \"EXSTDTC\", \"EXTRT\", \"EXDOSE\", \"EXDOSU\"]].to_string(index=False))\n\n# ── Step 2: ADaM ADTTE -- time to discontinuation ──\nadtte = ex.copy()\nadtte[\"PARAMCD\"] = \"TTDISC\"\nadtte[\"PARAM\"]   = \"Time to Discontinuation (days)\"\nadtte[\"STARTDT\"] = adtte[\"_fill_date\"]          # STARTDT <- EXSTDTC (traceability)\n\n# Last covered day = fill_date + days_supply - 1 (supply is 0-indexed from fill_date)\nlast_covered = adtte[\"_fill_date\"] + pd.to_timedelta(adtte[\"_days_supply\"] - 1, unit=\"D\")\n# Discontinuation date = day after last covered day (no qualifying refill in grace window)\nadtte[\"ADT\"]     = last_covered + pd.Timedelta(days=1)  # 2023-01-01 + 87 days = 2023-03-29\n# AVAL = ADT - STARTDT in days (integer; equals days_supply for a single-fill discontinuation)\nadtte[\"AVAL\"]    = (adtte[\"ADT\"] - adtte[\"STARTDT\"]).dt.days   # 87\nadtte[\"CNSR\"]    = 0   # 0 = event (discontinuation observed); 1 = censored\n\nprint(\"\\nADaM ADTTE:\")\nprint(adtte[[\"USUBJID\", \"PARAMCD\", \"AVAL\", \"CNSR\",\n             \"STARTDT\", \"ADT\"]].to_string(index=False))\n\n# ── Arithmetic verification gate (mirrors the catalog's worked-example check) ──\n# AVAL = ADT - STARTDT = 2023-03-29 - 2023-01-01\n# Month breakdown: Jan 1 to Feb 1 = 31 days, Feb 1 to Mar 1 = 28 days,\n#                  Mar 1 to Mar 29 = 28 days; total = 31 + 28 + 28 = 87 days\nassert adtte[\"AVAL\"].iloc[0] == 87, (\n    f\"AVAL gate failed: expected 87, got {adtte['AVAL'].iloc[0]}\"\n)\nprint(f\"\\nAVAL = {adtte['AVAL'].iloc[0]} days (verified: 31 + 28 + 28 = 87)\")\n# define.xml entry for AVAL: \"Derived as ADT minus STARTDT in days.\n#   ADT = fill_date + days_supply (first day no longer covered under supply window);\n#   STARTDT sourced from EX.EXSTDTC = claims fill_date.\"",
        "description": "SDTM EX domain construction and ADaM ADTTE derivation using pandas. Mirrors the SAS\nimplementation: fill_date -> EXSTDTC in ISO 8601, NDC -> EXTRT via formulary crosswalk,\nAVAL = ADT - STARTDT in days. The assert at the end verifies the arithmetic gate:\nAVAL = 87 for the worked-example patient. In a production pipeline, define.xml would\nbe generated from a metadata catalog (e.g., a YAML or database-backed define builder)\nand each derivation step here would have a corresponding metadata record describing\nsource variable, derivation algorithm, and CDISC controlled terminology reference.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[Raw pharmacy claim<br/>fill_date, NDC, days_supply] -->|Modeling decision:<br/>fill_date = EXSTDTC| EX[SDTM EX domain<br/>EXSTDTC, EXTRT, EXDOSE]\n  EX -->|STARTDT from EXSTDTC<br/>AVAL = ADT - STARTDT| ADTTE[ADaM ADTTE<br/>PARAMCD, AVAL, CNSR, ADT]\n  Raw -->|ICD-10 -> MedDRA<br/>via sponsor crosswalk| AE[SDTM AE domain<br/>AETERM, AEDECOD, AESTDTC]\n  ADTTE --> TLF[Tables Listings Figures]\n  EX --> DefXML[define.xml<br/>variable metadata + derivation algorithms]\n  AE --> DefXML\n  ADTTE --> DefXML\n  DefXML --> ADRG[ADRG / SDRG<br/>Reviewer's Guides]\n  TLF --> Submission[FDA submission package]\n  ADRG --> Submission\n  DefXML --> Submission",
        "caption": "CDISC traceability chain for an RWE submission. Raw RWD flows through modeling decisions into SDTM domains (EX, AE), then into ADaM analysis datasets (ADTTE), with define.xml and reviewer guides documenting every derivation. An FDA reviewer navigates from any TLF number back to the raw source using this chain.",
        "alt_text": "Flowchart from raw pharmacy claims through SDTM EX and AE domains, then to ADaM ADTTE, tables/listings/figures, with define.xml and reviewer guides all feeding the FDA submission package. Modeling decisions are labeled on the transformation arrows.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Research[Research layer]\n    OMOP[(OMOP CDM<br/>DRUG_EXPOSURE<br/>CONDITION_OCCURRENCE)]\n    RxNorm[RxNorm drug<br/>concepts]\n    SNOMED[SNOMED condition<br/>concepts]\n  end\n  subgraph Submission[Submission layer]\n    SDTM_EX[SDTM EX domain<br/>CDISC drug dictionary]\n    SDTM_AE[SDTM AE domain<br/>MedDRA coding]\n    DefXML2[define.xml]\n  end\n  OMOP -->|Lossy mapping<br/>RxNorm -> WHO drug dict<br/>SNOMED -> MedDRA| SDTM_EX\n  OMOP --> SDTM_AE\n  RxNorm --> SDTM_EX\n  SNOMED --> SDTM_AE\n  SDTM_EX --> DefXML2\n  SDTM_AE --> DefXML2\n  style Research fill:#e8f4fd\n  style Submission fill:#fdf6e3",
        "caption": "OMOP (research CDM) versus CDISC (submission standard): the two layers serve different purposes. Mapping from OMOP to CDISC is possible but lossy — vocabulary granularity differences and concept-mapping choices must be documented in define.xml just as any other raw-to-CDISC conversion would be.",
        "alt_text": "Two-panel diagram showing OMOP CDM (research layer) on the left with DRUG_EXPOSURE, CONDITION_OCCURRENCE, RxNorm, and SNOMED, and CDISC submission layer on the right with SDTM EX, SDTM AE, and define.xml, with arrows labeled as lossy mapping steps.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "regulatory-readiness-rwe",
        "notes": "Regulatory readiness is the process layer that governs what decisions are made and pre-specified before CDISC packaging; CDISC standards implement the transparency and reproducibility gate that readiness requires."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimand-analysis-traceability-rwe",
        "notes": "The CDISC define.xml derivation metadata is the machine-readable implementation of the estimand-to-analysis traceability matrix; every ADTTE PARAMCD derivation in define.xml should correspond to a row in the traceability matrix."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "OMOP CDM is the research-side counterpart to CDISC: OMOP optimizes for multi-database portability and network analytics; CDISC optimizes for per-submission auditability. Many RWE pipelines use OMOP for analysis and then convert to CDISC for submission."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fhir-interoperability-rwe",
        "notes": "FHIR (Fast Healthcare Interoperability Resources) is the exchange-side standard for transmitting clinical data between systems; CDISC is the submission-side standard for packaging that data for regulatory review. FHIR-to-CDISC mappings are an active area for EHR-source RWE submissions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Fit-for-purpose assessment determines whether a data source can support the SDTM domain mappings required by the submission — if the EX domain cannot be populated (no observable dispensing dates), the data source fails fit-for-purpose before any CDISC work begins."
      },
      {
        "relation_type": "see_also",
        "target_slug": "study-protocol-or-sap-elements",
        "notes": "The SAP must pre-specify the ADTTE PARAMCD structure, the EX-domain mapping rules, and the MedDRA coding approach; CDISC compliance requires that define.xml derivations match the locked SAP, making the SAP the anchor document for every define.xml entry."
      }
    ],
    "aliases": [
      "SDTM",
      "ADaM",
      "CDISC submission standards",
      "Study Data Tabulation Model",
      "Analysis Data Model",
      "define.xml",
      "ADRG",
      "SDRG",
      "ADTTE",
      "ADSL",
      "CDISC for real-world data",
      "RWE CDISC compliance"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "censoring-mechanisms-rwe",
    "name": "Censoring: Types, Mechanisms, and Informativeness",
    "short_definition": "Censoring is the condition where a subject's exact event time is unknown because observation ends before the event occurs; the fundamental taxonomy — right, left, interval censoring, and left truncation — governs how survival analysis software encodes the (time, event) pair from raw date fields, and whether the non-informative (independent) censoring assumption that underlies Kaplan-Meier and Cox models is defensible in a given real-world data source.",
    "long_description": "**What censoring is — and what it is not**\n\nCensoring is the condition where a subject leaves observation before the study period ends and\ntheir event status beyond that departure is unknown. It is fundamentally different from general\nmissingness: a censored observation carries the partial information that the event had not yet\noccurred up to the censoring time. This knowledge — that the subject was event-free through\ntime t — is what survival estimators exploit. A naive complete-case analysis that discards\ncensored subjects throws away this information and biases results whenever the censored\nsystematically differ from those who stay under observation.\n\nEvery survival analysis encodes censoring through a two-column construct: (time, event). The\n\"time\" column records the days from the index date to the end of observation — whether that\nend was an event or censoring. The \"event\" column is a binary indicator: 1 if the event was\nobserved, 0 if the subject was censored. Constructing that pair from raw date fields in a\nclaims or EHR table is the foundational primitive that every downstream survival analysis\ndepends on — Kaplan-Meier, Cox regression, competing-risks estimators, and IPCW-weighted\nanalyses all require this input.\n\n**A taxonomy of censoring types**\n\n*Right censoring* is overwhelmingly dominant in real-world evidence (RWE). A subject is\nright-censored when observation ends before the event: the study data cut occurs, the patient\ndisenrolls from an insurance plan, or follow-up terminates at a pre-specified study end date.\nThe event time is unknown but bounded below — we know it is greater than the observed\nfollow-up time, meaning the event, if it occurs, happens \"to the right\" on the time axis.\nRight censoring is so prevalent that when analysts say \"censored\" without qualification, they\nalmost always mean right censoring. Right censoring has two important subtypes for RWE\ninterpretation. Administrative censoring occurs when the data cut is a fixed calendar date or\nstudy end unrelated to patient health; it is the most plausibly non-informative form.\nInformative censoring occurs when the reason for censoring is causally linked to event risk.\n\n*Left censoring* is the mirror: the event occurred before observation began but the exact\ntime is unknown. Left censoring is rare in standard RWE survival analyses. New-user designs\nwith a carefully specified look-back window are specifically engineered to ensure patients\nare event-free at study entry, avoiding left censoring by construction. True left censoring\nappears in seroprevalence studies, environmental exposure research, or disease registries\nthat capture prevalent cases without anchoring event timing.\n\n*Interval censoring* is common in clinical practice but routinely under-acknowledged in RWE.\nAn event is interval-censored when we know it occurred in the interval (L, R] — after the\nlast clean test at time L and by the first positive test at time R — but cannot pinpoint the\nexact moment. In EHR data, diagnoses confirmed by laboratory testing or scheduled screening\n(HIV seroconversion, cancer detected by imaging, hepatitis C sustained virologic response)\nare genuinely interval-censored: the patient had a normal result at visit L and an abnormal\nresult at visit R, and the event happened somewhere in between. Treating interval-censored\nevent times as exact introduces measurement error whose magnitude depends on the detection\ninterval. For outcomes identified by inpatient claims codes the interval is typically narrow\nand the approximation is defensible; for gradual-onset conditions detected by infrequent\nelective monitoring, interval censoring should be modeled using interval-censoring-aware\nmethods or acknowledged as a study limitation.\n\n*Left truncation* (delayed entry) is not censoring at all, but it is so frequently confused\nwith left censoring that it merits a sharp paragraph here. Truncation occurs when subjects\nenter the study only if they survived a pre-observation period — patients who experienced the\nevent or died before study entry are never enrolled and never appear in the data. In a new-\nuser claims cohort requiring 12 months of continuous enrollment before the index date,\npatients who disenrolled or died during those 12 months are structurally absent. Survival\nmodels must condition on this delayed entry by including the left truncation time in the\nrisk-set construction; failing to do so underestimates early event rates. The prevalent-user\ntrap is the canonical left-truncation failure in pharmacoepidemiology: including patients who\nstarted treatment before the study observation window selects for those who survived long\nenough on treatment to still be taking it at study entry — a systematically healthy,\ntreatment-tolerant survivor cohort.\n\n**Why censoring happens in claims and EHR data specifically**\n\nIn commercial insurance claims, the dominant censoring mechanism is disenrollment — the\npatient changes employers, ages into Medicare, loses coverage, or selects a different plan\nduring open enrollment. Some disenrollment is purely administrative and plausibly non-\ninformative (an employer changes insurance carriers; all employees switch regardless of health\nstatus). But much disenrollment is health-correlated: job loss driven by disability is\nassociated with worsening chronic disease; income-related Medicaid churn is correlated with\npoverty and its health consequences; switching to Medicare Advantage at age 65 is correlated\nwith the burden of aging-related illness. The same mechanism — disenrollment — can be non-\ninformative or heavily informative depending on why it occurred, and claims data almost never\nrecord the reason.\n\nMedicare Advantage (MA) transitions are a specific, well-documented informative-censoring\ntrap. A deteriorating patient may transition from fee-for-service Medicare (where all Part A\nand Part B claims are observable) to a Medicare Advantage plan (where encounter-level data\nare largely opaque to researchers using CMS claims). This transition is health-correlated —\nsicker patients may seek plans with richer benefits for complex care — causing elevated-risk\npatients to disappear from the observable dataset precisely when their event hazard is\nhighest, biasing the KM curve optimistically.\n\nCompeting events — particularly death when it is not the primary event of interest — create\na structural problem that looks like censoring but demands different handling. Treating death\nas censoring for a non-fatal endpoint (hospital readmission, disease recurrence, medication\nfailure) estimates the cause-specific hazard in a hypothetical world where death cannot occur.\nThe 1 minus KM(t) curve computed under this convention overestimates the cumulative incidence\nof the non-fatal event in the real population, because it ignores that some patients who are\n\"censored\" at death would never have reached the event. When the scientific question is \"what\nfraction of real patients will experience the non-fatal event by time t?\", the competing-risks\ncumulative incidence function (Aalen-Johansen estimator; Fine-Gray model for covariate-adjusted\nestimates) is the correct tool. See the companion entry on competing risks for the full\ntreatment.\n\nIn EHR data, the primary censoring mechanism is care-site departure — the patient seeks care\noutside the system, switches to a different health network, or reduces their engagement with\nhealthcare. This departure is highly informative in both directions: referral to tertiary\ncenters selects for complex, severely ill patients; discharge to community care can signal\nimprovement. EHR censoring is among the most difficult to handle because the database rarely\ndistinguishes between patients who are healthy and disengaged versus patients who are severely\nill and seeking care elsewhere.\n\n**The non-informative (independent) censoring assumption**\n\nThe foundational assumption of Kaplan-Meier, the Cox partial likelihood, and most survival\nmethods is that censoring is non-informative (equivalently: independent, or random conditional\non measured covariates). Formally: at any time t, conditional on the covariates included in\nthe model, subjects censored at t have the same future event hazard as subjects who remain\nunder observation with the same covariate history. Equivalently, the censoring time and the\nevent time are conditionally independent given the covariates.\n\nThis assumption is plausible when censoring is purely administrative (the data cut is\ncalendar-driven, the plan year ends, the study enrollment closes). It becomes immediately\nsuspect for health-correlated disenrollment. A patient who loses their job and their insurance\nsimultaneously faces elevated risk of worsening chronic disease management from medication\nlapses, psychosocial stress correlated with cardiovascular risk, and delayed care-seeking that\nallows disease to progress undetected. All three pathways link disenrollment causally to the\nevent outcome, making the censoring informative. When informative censoring is present and\nuncorrected, the KM and Cox estimates are biased — typically in an optimistic direction,\nbecause the sicker patients who disenroll are underrepresented in the late-follow-up data,\nmaking the surviving cohort appear healthier than the full at-risk population.\n\n**Interpreting the output**\n\nConsider a Kaplan-Meier estimate reporting a median survival time of 14.0 months (95% CI:\n11-18 months) in a cohort where 40% of subjects were censored before the event.\n\nFormal interpretation: The KM median is the estimated time by which 50% of subjects are\nexpected to have had the event, under the assumption that censoring is non-informative. The\n40% who were censored are treated as representative of those who remained at risk with the\nsame covariate profile. The 95% confidence interval (11-18 months) is wide partly because\nheavy censoring means fewer events drive the estimate; at 40% censoring, the tail of the KM\ncurve after the median is based on a thinned risk set and should be interpreted with caution.\nSensitivity to censoring grows as the censoring fraction rises: beyond 40-50%, the median\nestimate itself becomes unreliable without IPCW correction or a competing-risks framing.\n\nPractical interpretation: \"We estimate that half of patients in this cohort would experience\nthe event by 14 months — but 40% left observation before the event occurred. If sicker or\nmore treatment-intolerant patients disenrolled faster, the 14-month estimate is biased\noptimistic: the patients still in follow-up at 14 months were systematically healthier than\nthose who left, so the true median could be shorter. Before reporting this estimate to a\npayer or HTA body, the analysis plan should pre-specify whether disenrollment is expected to\nbe health-correlated and what IPCW-weighted sensitivity analysis will be used to bound the\npotential bias direction and magnitude.\"\n\n**Pros, cons, and trade-offs**\n\nCensoring is a data property, not a method choice. The choices are in how censoring is handled\nand what assumption is made about its nature:\n\nAssuming non-informative censoring (standard KM and Cox): Pro — simple, transparent,\nuniversally familiar, computationally trivial, and valid when censoring is genuinely\nadministrative. Con — silently biased whenever censoring is health-correlated; the assumption\nis unverifiable from the data alone and must be defended from the data-generating process;\nfailure to state it explicitly is a common weakness in RWE manuscripts.\n\nIPCW-adjusted analysis: Pro — corrects for informative censoring when its causes are measured\nin time-varying covariates; pairs naturally with IPTW for doubly-weighted marginal structural\nmodels. Con — requires a correctly specified censoring model; adds modeling complexity;\nunmeasured drivers of censoring leave residual bias; extreme weights inflate variance and\nrequire stabilization and reporting.\n\nCompeting-risks framing (for death as a non-censoring competing event): Pro — correct when\ndeath competes with the primary non-fatal event; avoids overestimating cumulative incidence\nin a real mortality-affected population. Con — changes the estimand from cause-specific\nhazard to subdistribution hazard and cumulative incidence function, requiring competing-risks-\naware modeling and interpretation.\n\nRestriction to administratively censored subjects as a sensitivity analysis: Pro — cleanly\nisolates non-informative censoring; useful as a bounding analysis. Con — reduces sample size\nsubstantially and shifts the population to persistent enrollees who may be systematically\nhealthier, itself a form of collider restriction.\n\n**When to use**\n\nThe (time, event) pair construction and the censoring framework described here apply to every\nsurvival analysis built on real-world data, regardless of estimator:\n\nAny time-to-event endpoint — hospitalization, death, disease progression, treatment\ndiscontinuation, readmission — where some subjects do not reach the event during the\nobservation window requires censoring to be represented. Any claims or EHR cohort study where\npatients disenroll, transfer care sites, or have follow-up terminated by a data cut needs a\ncensoring date computed from disenrollment and study-end fields. Building any survival model\ninput — Surv() in R, durations/event_observed in lifelines (Python), or PROC LIFETEST in\nSAS — requires the primitive (time, event) pair produced by this approach. Identifying\ncompeting events (deciding which outcomes are events and which are competing events before\nrouting to cause-specific or subdistribution hazard models) also requires explicit censoring\ndate construction.\n\n**When NOT to use**\n\nThe standard censoring treatment — the assumption of non-informative censoring applied to KM\nor Cox without correction — is actively misleading in three situations:\n\nDo not treat death as censoring for non-fatal outcomes when competing mortality is substantial.\nIf death is a real alternative event that prevents the non-fatal outcome, treating it as\ncensoring estimates the cause-specific hazard (rate among surviving subjects in a hypothetical\nno-death world) rather than the cumulative incidence in the real population. The 1 minus KM\ncurve will overestimate actual population risk. Use competing-risks methods when absolute risk\nrather than the etiologic hazard is the policy-relevant quantity.\n\nDo not assume non-informative censoring without explicit justification when disenrollment is\nplausibly health-correlated. If the analysis population includes patients who disenroll because\nof worsening illness, economic instability tied to poor health, or Medicare Advantage\ntransitions driven by disease burden, the naive KM and Cox estimates are biased. At minimum,\ninclude a qualitative discussion of the censoring mechanism and the direction of likely bias;\nideally, report an IPCW-weighted estimate as a primary or sensitivity analysis.\n\nDo not treat genuinely interval-censored endpoints as exact event times when detection\nintervals are wide. For lab-confirmed diagnoses where months may separate the last clean test\nand the first positive result, the diagnosis-date assignment introduces systematic timing\nerror. Interval-censoring-aware models — icenReg in R, PROC LIFEREG with lower= and upper=\nstatement variables in SAS — are technically correct and should be at least acknowledged as\nan alternative when the detection interval exceeds a few weeks.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "survival-analysis",
      "censoring",
      "follow-up"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1038/sj.bjc.6601118",
        "url": "https://doi.org/10.1038/sj.bjc.6601118",
        "citation_text": "Clark TG, Bradburn MJ, Love SB, Altman DG. Survival Analysis Part I: Basic concepts and first analyses. British Journal of Cancer. 2003;89(2):232-238.",
        "year": 2003,
        "authors_short": "Clark et al.",
        "notes": "The canonical introductory reference for survival analysis in clinical research. Explains right censoring, the Kaplan-Meier estimator, the log-rank test, and the non-informative censoring assumption in accessible terms directly applicable to RWE practice."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev.publhealth.18.1.83",
        "url": "https://doi.org/10.1146/annurev.publhealth.18.1.83",
        "citation_text": "Leung KM, Elashoff RM, Afifi AA. Censoring issues in survival analysis. Annual Review of Public Health. 1997;18:83-104.",
        "year": 1997,
        "authors_short": "Leung et al.",
        "notes": "Comprehensive taxonomy of censoring types in survival analysis — right, left, and interval censoring — with discussion of when each arises in practice. Covers the non-informative censoring assumption and its consequences when violated; essential grounding for any RWE analyst building time-to-event analyses from administrative data."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e3181c1ea43",
        "url": "https://doi.org/10.1097/EDE.0b013e3181c1ea43",
        "citation_text": "Hernán MA. The hazards of hazard ratios. Epidemiology. 2010;21(1):13-15.",
        "year": 2010,
        "authors_short": "Hernán",
        "notes": "Sharp discussion of how censoring assumptions interact with hazard ratio interpretation, including the distinction between cause-specific hazards (death treated as censoring) and subdistribution hazards (competing risks). Useful background for understanding why treating death as censoring is misleading when cumulative incidence is the target quantity."
      }
    ],
    "plain_language_summary": "Censoring — the condition where a patient stops being observed before the study ends, so you do not know whether the event happened afterward — is present in nearly every survival study built on insurance claims or health records. When a patient disenrolls from an insurance plan, transfers to a different health system, or is still event-free when the data are collected, you record how long they were observed and mark them as censored, not as someone the event never happened to. Survival methods like Kaplan-Meier curves and Cox models need each patient's observation time paired with a 1-or-0 event flag; constructing that pair from raw dates in a claims table is the foundational step in every time-to-event analysis. A hidden danger is that if sicker patients tend to leave the database earlier, the patients who remain visible are healthier than the full group, making event-rate estimates look more optimistic than they truly are.",
    "key_terms": [
      {
        "term": "right censoring",
        "definition": "The most common situation in real-world studies: a patient stops being observed before the event occurs, so you know only that the event — if it happened at all — occurred after observation ended."
      },
      {
        "term": "administrative censoring",
        "definition": "An observation endpoint driven by a calendar rule rather than the patient's health — for example, a study data cut on a specific date records all patients still alive at that date as ending follow-up on that day regardless of how sick they were."
      },
      {
        "term": "informative censoring",
        "definition": "An observation endpoint where the reason for leaving the study is connected to how sick the patient is — if the sickest patients tend to drop out earliest, those still being watched are healthier than average and the event estimates become overly optimistic."
      },
      {
        "term": "risk set",
        "definition": "At any given time point in a survival analysis, the group of subjects who are still under observation and have not yet had the event — only they contribute to the estimated event rate at that time."
      },
      {
        "term": "event indicator",
        "definition": "The 1-or-0 number assigned to each patient at the end of their follow-up: 1 if the event was actually observed, and 0 if the patient left observation before the event happened."
      },
      {
        "term": "follow-up window",
        "definition": "The span of time from when a patient enters the study to when they either have the event or stop being observed, measured in days from the starting date to the ending date."
      }
    ],
    "worked_example": {
      "scenario": "Six new users of a diabetes drug are followed from their first prescription (the index date) until a hospitalization for hypoglycemia (the event) or until they leave observation — either by disenrolling from their insurance plan or when the study data are collected on June 30, 2022. The analyst must build the (follow_up_days, event) pair for each patient from the raw date fields before fitting a Kaplan-Meier curve. Three patients have the event; three are censored for different reasons, illustrating administrative censoring, informative early disenrollment, and end-of-study administrative censoring in a single cohort.",
      "dataset": {
        "caption": "Raw date fields as they appear in a claims extract. A tilde (~) means no date recorded for that field. The study_end is June 30, 2022 for all patients. The analyst must compute follow_up_days and event from these five columns.",
        "columns": [
          "person_id",
          "index_date",
          "event_date",
          "disenroll_date",
          "study_end"
        ],
        "rows": [
          [
            "PT001",
            "2022-01-01",
            "2022-03-01",
            null,
            "2022-06-30"
          ],
          [
            "PT002",
            "2022-01-01",
            null,
            "2022-04-30",
            "2022-06-30"
          ],
          [
            "PT003",
            "2022-02-01",
            null,
            null,
            "2022-06-30"
          ],
          [
            "PT004",
            "2022-03-01",
            "2022-05-15",
            null,
            "2022-06-30"
          ],
          [
            "PT005",
            "2022-01-01",
            null,
            "2022-02-15",
            "2022-06-30"
          ],
          [
            "PT006",
            "2022-01-01",
            "2022-06-10",
            null,
            "2022-06-30"
          ]
        ]
      },
      "steps": [
        "Step 1 — Compute the censoring date for each patient: censor_date = min(disenroll_date, study_end), ignoring missing values. PT001 has no disenrollment so censor_date = 2022-06-30. PT002 disenrolled 2022-04-30 which is before 2022-06-30, so censor_date = 2022-04-30. PT003 has neither; censor_date = 2022-06-30. PT004 censor_date = 2022-06-30. PT005 censor_date = 2022-02-15. PT006 censor_date = 2022-06-30.",
        "Step 2 — Determine the event indicator: event = 1 if event_date is not missing AND event_date <= censor_date; otherwise event = 0. PT001: event_date 2022-03-01 <= 2022-06-30, so event = 1. PT002: no event_date, so event = 0 (censored at disenrollment). PT003: no event_date, event = 0 (administrative censoring at study end). PT004: event_date 2022-05-15 <= 2022-06-30, event = 1. PT005: no event_date, event = 0 (early disenrollment). PT006: event_date 2022-06-10 <= 2022-06-30, event = 1.",
        "Step 3 — Compute end_date: if event = 1, end_date = event_date; if event = 0, end_date = censor_date. Then follow_up_days = end_date - index_date (calendar days).",
        "PT001: end_date = 2022-03-01 (event). January contributes 31 days (Jan 1 to Feb 1), February contributes 28 days (Feb 1 to Mar 1): 31+28 = 59 days, event = 1.",
        "PT002: end_date = 2022-04-30 (disenrolled). January: 31 days, February: 28 days, March: 31 days, April 1 to April 30 as timedelta: 29 days: 31+28+31+29 = 119 days, event = 0. This patient is censored because they switched plans — a plausibly informative mechanism.",
        "PT003: end_date = 2022-06-30 (study end). February 1 to March 1: 28 days, March: 31 days, April: 30 days, May: 31 days, June 1 to June 30 as timedelta: 29 days: 28+31+30+31+29 = 149 days, event = 0. This is pure administrative censoring — the least biasing type.",
        "PT004: end_date = 2022-05-15 (event). March 1 to April 1: 31 days, April 1 to May 1: 30 days, May 1 to May 15 as timedelta: 14 days: 31+30+14 = 75 days, event = 1.",
        "PT005: end_date = 2022-02-15 (early disenrollment). January: 31 days, February 1 to February 15 as timedelta: 14 days: 31+14 = 45 days, event = 0. Early disenrollment after only 45 days may reflect poor health — informative censoring is most suspect here.",
        "PT006: end_date = 2022-06-10 (event). January: 31 days, February: 28 days, March: 31 days, April: 30 days, May: 31 days, June 1 to June 10 as timedelta: 9 days: 31+28+31+30+31+9 = 160 days, event = 1.",
        "Summary: the six (follow_up_days, event) pairs are (59,1), (119,0), (149,0), (75,1), (45,0), (160,1). Total person-days of observed follow-up: 59+119+149+75+45+160 = 607. Event fraction: 3/6 = 0.50 (three events observed, three censored)."
      ],
      "result": "Follow-up construction complete. PT001: 31+28 = 59 days, event = 1. PT002: 31+28+31+29 = 119 days, event = 0 (plan disenrollment). PT003: 28+31+30+31+29 = 149 days, event = 0 (study end). PT004: 31+30+14 = 75 days, event = 1. PT005: 31+14 = 45 days, event = 0 (early disenrollment). PT006: 31+28+31+30+31+9 = 160 days, event = 1. Total person-days: 59+119+149+75+45+160 = 607. Event fraction: 3/6 = 0.50. These six rows are the direct input to KaplanMeierFitter (Python), survfit() (R), or PROC LIFETEST (SAS).",
      "timeline_spec": {
        "title": "Follow-up construction for 6 patients: event indicator and observation time from raw claims fields",
        "window": {
          "start": "2022-01-01",
          "end": "2022-06-30",
          "label": "Study window: Jan 1 to Jun 30, 2022"
        },
        "events": [
          {
            "label": "PT001 — event day 59",
            "start": "2022-01-01",
            "length_days": 59,
            "quantity": "event=1"
          },
          {
            "label": "PT002 — censored day 119 (disenrolled)",
            "start": "2022-01-01",
            "length_days": 119,
            "quantity": "event=0"
          },
          {
            "label": "PT003 — censored day 149 (study end)",
            "start": "2022-02-01",
            "length_days": 149,
            "quantity": "event=0"
          },
          {
            "label": "PT004 — event day 75",
            "start": "2022-03-01",
            "length_days": 75,
            "quantity": "event=1"
          },
          {
            "label": "PT005 — censored day 45 (early disenroll)",
            "start": "2022-01-01",
            "length_days": 45,
            "quantity": "event=0"
          },
          {
            "label": "PT006 — event day 160",
            "start": "2022-01-01",
            "length_days": 160,
            "quantity": "event=1"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2022-03-01",
            "label": "PT001: 59 days, event"
          },
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2022-04-30",
            "label": "PT002: 119 days, censored (disenroll)"
          },
          {
            "kind": "followup",
            "start": "2022-02-01",
            "end": "2022-06-30",
            "label": "PT003: 149 days, censored (study end)"
          },
          {
            "kind": "followup",
            "start": "2022-03-01",
            "end": "2022-05-15",
            "label": "PT004: 75 days, event"
          },
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2022-02-15",
            "label": "PT005: 45 days, censored (early disenroll)"
          },
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2022-06-10",
            "label": "PT006: 160 days, event"
          }
        ],
        "result": {
          "label": "3 events, 3 censored; total person-days = 607",
          "value": 607
        }
      }
    },
    "prerequisites": [
      "study-time-windows-anatomy",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {
      "claims": "In commercial claims, the censoring date is min(disenrollment_date, study_end_date). Build continuous enrollment periods from enrollment span files (member_id, enroll_start, enroll_end) and identify the last day of the enrollment period overlapping with follow-up as the disenrollment date. Flag Medicare Advantage transitions separately (typically identified by HMO coverage codes or the absence of Part A and Part B claims in a period with enrollment records) and treat them as a distinct, likely informative, censoring mechanism. Consider running a sensitivity analysis restricting to patients who remained in FFS Medicare throughout to bound the bias from informative MA transition.",
      "ehr": "In EHR data, the censoring date is typically the last encounter date or last active status date. EHR censoring from care-site departure is highly informative: patients who transfer to specialty referral centers or stop engaging with primary care represent opposite ends of the illness spectrum. Build the censoring date from the maximum date of any clinical encounter (visit, lab, prescription) in the follow-up window, then add a grace period (e.g., 6-12 months of encounter silence = censored) to distinguish engaged-and-healthy from truly lost- to-care. Document the operational definition of \"last observed date\" in the SAP.",
      "registry": "Registries typically have cleaner censoring: predefined data cut dates, structured withdrawal forms, and explicit lost-to-follow-up indicators. Verify that the registry protocol distinguishes administrative censoring (data cut, planned end of registry) from withdrawal (patient-initiated or investigator-initiated). Investigator-initiated withdrawal in a clinical registry is often informative (safety, tolerability); include its indicator in any sensitivity analysis of the non-informative censoring assumption."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import KaplanMeierFitter\n\n# ── Raw claims-like fields: as they appear in a cohort extract ──\ndata = pd.DataFrame({\n    \"person_id\":      [\"PT001\",\"PT002\",\"PT003\",\"PT004\",\"PT005\",\"PT006\"],\n    \"index_date\":     pd.to_datetime([\"2022-01-01\",\"2022-01-01\",\"2022-02-01\",\n                                      \"2022-03-01\",\"2022-01-01\",\"2022-01-01\"]),\n    \"event_date\":     pd.to_datetime([\"2022-03-01\", None, None,\n                                      \"2022-05-15\", None, \"2022-06-10\"]),\n    \"disenroll_date\": pd.to_datetime([None, \"2022-04-30\", None,\n                                      None, \"2022-02-15\", None]),\n    \"study_end\":      pd.to_datetime([\"2022-06-30\"] * 6),\n})\n\n# ── PRIMITIVE CONSTRUCTION: (follow_up_days, event) from raw date fields ──\n\n# Step 1: censoring date = earliest of disenrollment or study end (ignore NaT)\ndata[\"censor_date\"] = data[[\"disenroll_date\", \"study_end\"]].min(axis=1)\n\n# Step 2: event occurred if event_date exists AND is on or before censor_date\ndata[\"had_event\"] = (\n    data[\"event_date\"].notna() &\n    (data[\"event_date\"] <= data[\"censor_date\"])\n)\n\n# Step 3: end_date = event_date if event, else censor_date\ndata[\"end_date\"] = data[\"event_date\"].where(data[\"had_event\"], data[\"censor_date\"])\n\n# Step 4: follow-up time in days from index to end\ndata[\"follow_up_days\"] = (data[\"end_date\"] - data[\"index_date\"]).dt.days\n\n# Step 5: binary event indicator (1 = event observed, 0 = censored)\ndata[\"event\"] = data[\"had_event\"].astype(int)\n\nprint(\"Follow-up table:\")\nprint(data[[\"person_id\", \"follow_up_days\", \"event\"]].to_string(index=False))\n# Expected output:\n#  person_id  follow_up_days  event\n#      PT001              59      1\n#      PT002             119      0\n#      PT003             149      0\n#      PT004              75      1\n#      PT005              45      0\n#      PT006             160      1\n\n# ── Kaplan-Meier estimate ──\nkmf = KaplanMeierFitter()\nkmf.fit(\n    durations=data[\"follow_up_days\"],\n    event_observed=data[\"event\"],\n    label=\"New-user cohort\"\n)\nprint(f\"\\nMedian survival time: {kmf.median_survival_time_} days\")\nprint(f\"95% CI: {kmf.confidence_interval_median_.values}\")\n# Note: with only 6 patients the CI will be wide; this illustrates the construction,\n# not a powered survival estimate. In production, verify event_date <= censor_date\n# strictly (not <=) if same-day events and censorings need disambiguation.",
        "description": "Demonstrates the primitive (follow_up_days, event) construction from raw claims-like date\nfields using pandas, then fits a Kaplan-Meier curve with lifelines. The key operation is\ncomputing censor_date = min(disenroll_date, study_end) and resolving whether the event\noccurred before that date. Uses the six-patient cohort from the worked example.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n# ── Raw claims-like fields ──\ndf <- data.frame(\n  person_id      = c(\"PT001\",\"PT002\",\"PT003\",\"PT004\",\"PT005\",\"PT006\"),\n  index_date     = as.Date(c(\"2022-01-01\",\"2022-01-01\",\"2022-02-01\",\n                             \"2022-03-01\",\"2022-01-01\",\"2022-01-01\")),\n  event_date     = as.Date(c(\"2022-03-01\", NA, NA,\n                             \"2022-05-15\", NA, \"2022-06-10\")),\n  disenroll_date = as.Date(c(NA, \"2022-04-30\", NA, NA, \"2022-02-15\", NA)),\n  study_end      = as.Date(rep(\"2022-06-30\", 6)),\n  stringsAsFactors = FALSE\n)\n\n# ── PRIMITIVE CONSTRUCTION: Surv(time, event) from raw date fields ──\n\n# Step 1: censoring date = earliest of disenrollment or study end (na.rm ignores NA)\ndf$censor_date <- pmin(df$disenroll_date, df$study_end, na.rm = TRUE)\n\n# Step 2: event occurred if event_date exists AND is on or before censor_date\ndf$event <- as.integer(!is.na(df$event_date) & df$event_date <= df$censor_date)\n\n# Step 3: end_date = event_date if event, else censor_date\ndf$end_date <- as.Date(\n  ifelse(df$event == 1L, as.character(df$event_date), as.character(df$censor_date))\n)\n\n# Step 4: follow-up time in days\ndf$follow_up_days <- as.numeric(df$end_date - df$index_date)\n\ncat(\"Follow-up table:\\n\")\nprint(df[, c(\"person_id\", \"follow_up_days\", \"event\")])\n# Expected output:\n#   person_id follow_up_days event\n# 1     PT001             59     1\n# 2     PT002            119     0\n# 3     PT003            149     0\n# 4     PT004             75     1\n# 5     PT005             45     0\n# 6     PT006            160     1\n\n# ── Surv() object and Kaplan-Meier ──\n# Surv(time, event): event = 1 means observed; event = 0 means censored (the default coding)\nsurv_obj <- Surv(time = df$follow_up_days, event = df$event)\nkm_fit   <- survfit(surv_obj ~ 1, data = df)\n\nprint(summary(km_fit))\n# The summary shows survival probability at each event time.\n# With n=6 and 3 events, CIs are wide but the construction is identical to large cohorts.\n\n# Median survival time with 95% CI (Hall-Wellner or log-log bands)\ncat(sprintf(\"\\nMedian: %g days  95%% CI: %g-%g\\n\",\n            quantile(km_fit, 0.5)$quantile,\n            quantile(km_fit, 0.5)$lower,\n            quantile(km_fit, 0.5)$upper))",
        "description": "Demonstrates the primitive Surv(time, event) construction from raw claims-like date fields\nin base R, then fits a Kaplan-Meier curve with survfit(). The pmin() call implements\ncensor_date = min(disenroll_date, study_end), and the event indicator is built with an\nexplicit logical test. Uses the six-patient cohort from the worked example.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Raw claims-like data ── */\ndata work.cohort;\n  informat index_date event_date disenroll_date study_end yymmdd10.;\n  format   index_date event_date disenroll_date study_end date9.;\n  input person_id $\n        index_date     : yymmdd10.\n        event_date     : yymmdd10.\n        disenroll_date : yymmdd10.\n        study_end      : yymmdd10.;\n  datalines;\nPT001 2022-01-01 2022-03-01 . 2022-06-30\nPT002 2022-01-01 . 2022-04-30 2022-06-30\nPT003 2022-02-01 . . 2022-06-30\nPT004 2022-03-01 2022-05-15 . 2022-06-30\nPT005 2022-01-01 . 2022-02-15 2022-06-30\nPT006 2022-01-01 2022-06-10 . 2022-06-30\n;\nrun;\n\n/* ── PRIMITIVE CONSTRUCTION: follow_up_days and event indicator ── */\ndata work.survival;\n  set work.cohort;\n\n  /* Step 1: censoring date = min of disenroll and study_end (. = missing in SAS = +Inf) */\n  censor_date = min(disenroll_date, study_end);\n  format censor_date date9.;\n\n  /* Step 2-3: event and end_date */\n  if (event_date ^= .) and (event_date <= censor_date) then do;\n    event         = 1;          /* 1 = event observed */\n    end_date      = event_date;\n  end;\n  else do;\n    event         = 0;          /* 0 = censored        */\n    end_date      = censor_date;\n  end;\n  format end_date date9.;\n\n  /* Step 4: follow-up time in days */\n  follow_up_days = end_date - index_date;\n\n  keep person_id follow_up_days event;\nrun;\n\nproc print data=work.survival; run;\n/* Expected output:\n   PT001  59  1\n   PT002 119  0\n   PT003 149  0\n   PT004  75  1\n   PT005  45  0\n   PT006 160  1                                                         */\n\n/* ── Kaplan-Meier estimate with PROC LIFETEST ──\n   Syntax: TIME <follow_up_var> * <event_var>(<censored_value>)\n   event(0) tells SAS that 0 means censored in the event column.        */\nproc lifetest data=work.survival plots=survival(cl) outsurv=work.km_out;\n  time follow_up_days * event(0);\nrun;\n/* Output includes: KM survival function at each event time, 95% CI\n   (Greenwood variance; Hall-Wellner or log-log bands via BANDS= option),\n   median survival with 95% CI, and a log-rank test vs null.            */",
        "description": "Demonstrates the (follow_up_days, event) construction from raw claims-like date fields in\nSAS using a data step, then estimates the Kaplan-Meier curve with PROC LIFETEST. The\nmin() function provides censor_date, and the IF-THEN-ELSE block sets event and end_date.\nIn PROC LIFETEST, time follow_up_days * event(0) means event = 0 is the censored value.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  OBS[\"Observation ends<br/>before confirmed event time\"]\n  OBS --> RC[\"Right censoring<br/><b>dominant in RWE</b><br/>event time &gt; observed follow-up\"]\n  OBS --> LC[\"Left censoring<br/>event before observation began<br/>rare in standard RWE\"]\n  OBS --> IC[\"Interval censoring<br/>event in interval (L, R]<br/>EHR lab-detected outcomes\"]\n  OBS --> LT[\"Left truncation<br/>delayed entry — NOT censoring<br/>new-user design washout\"]\n  RC --> Admin[\"Administrative censoring<br/>data cut, study end<br/>plausibly non-informative\"]\n  RC --> Info[\"Informative censoring<br/>disenrollment, site transfer<br/>health-correlated → bias\"]\n  RC --> CR[\"Competing event<br/>death for non-fatal endpoint<br/>use competing-risks model<br/><i>not</i> treated as censoring\"]",
        "caption": "Taxonomy of censoring and related concepts. Right censoring dominates in claims and EHR studies. Left truncation is not censoring but is frequently confused with left censoring. Death for non-fatal outcomes is a competing event, not censoring.",
        "alt_text": "Flowchart branching from \"observation ends before confirmed event time\" into right censoring (dominant in RWE), left censoring (rare), interval censoring (EHR lab outcomes), and left truncation (delayed entry, not censoring). Right censoring further branches into administrative, informative, and competing event subcategories.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "study-time-windows-anatomy",
        "notes": "The index date, washout period, and study-end date defined by the study time window architecture are the raw inputs to the (follow_up_days, event) construction. Understanding how index dates and observation periods are specified is prerequisite to computing censoring dates correctly."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "The Cox proportional hazards model operates on the same (time, event) pair constructed here. Understanding censoring — and specifically the non-informative censoring assumption that Cox inherits — is essential for interpreting Cox hazard ratio estimates from claims and EHR data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inverse-probability-of-censoring-weighting-rwe",
        "notes": "IPCW is the primary analytical remedy for informative censoring: it reweights uncensored subjects by the inverse of their estimated probability of remaining uncensored, correcting the bias that arises when the non-informative censoring assumption fails. Read this entry first; then read IPCW for the fix."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Attrition and loss to follow-up covers the broader study-design and epidemiological consequences of differential dropout, including the DAG-based collider structure and the full menu of handling strategies (restriction, IPCW, multiple imputation, competing risks). The current entry covers the data primitive; the attrition entry covers the analytical response."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Competing risks is the correct framework when death (or another terminal event) would prevent the primary event of interest from occurring. Treating competing events as censoring estimates a cause-specific hazard in a hypothetical no-death world; when the policy question concerns real-world cumulative incidence, Fine-Gray or Aalen-Johansen estimators must replace 1-KM."
      },
      {
        "relation_type": "see_also",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Continuous enrollment requirements define the observable person-time window in claims studies and directly determine the censoring date when enrollment lapses. The enrollment span operationalization is the upstream step before censoring dates can be computed."
      }
    ],
    "aliases": [
      "right censoring",
      "administrative censoring",
      "informative censoring",
      "loss to follow-up",
      "non-informative censoring",
      "independent censoring",
      "left truncation",
      "delayed entry",
      "interval censoring",
      "event indicator"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cer-observational",
    "name": "Observational Comparative Effectiveness Research",
    "short_definition": "A family of non-randomized study designs that estimate the comparative effectiveness or safety of two or more real-world treatment strategies from secondary data (claims, EHR, registry), by emulating the eligibility, time-zero, and treatment-assignment structure of the head-to-head trial that cannot be run.",
    "long_description": "**Observational comparative effectiveness research (CER)** asks which of two (or more) clinically reasonable treatment\nstrategies works better, or is safer, *in routine care* — using data generated by clinical practice rather than by\nrandomization. It is the umbrella under which specific non-randomized designs live: the active-comparator new-user (ACNU)\ncohort, the prevalent-new-user cohort, the nested case-control study, self-controlled designs, and explicit target-trial\nemulation. The defining feature is not a single procedure but a *comparative estimand recovered without randomization*,\nwhich means every design choice exists to reconstruct, from observational data, the three things randomization gives you\nfor free: a well-defined eligible population, a sharp time zero, and exchangeable treatment groups.\n\n**Core conceptual distinction — effectiveness vs efficacy, and comparative vs absolute.** A randomized trial estimates\n*efficacy*: the effect of a treatment under protocol-enforced adherence in a selected, consenting population. Observational\nCER estimates *effectiveness*: the effect of a treatment *strategy* as actually used — imperfect adherence, broad\ncomorbidity, real prescribing patterns, longer horizons. These are different estimands, not noisy versions of the same\nnumber, and the gap (the \"efficacy-effectiveness gap\") is the substantive reason CER exists. The second distinction is\ncomparative vs absolute: observational CER is built to compare *strategy A vs strategy B for the same decision*, because a\ndrug-vs-nothing contrast in secondary data is almost always fatally confounded by indication. Choosing a clinically\ninterchangeable active comparator is what converts an unanswerable absolute question into an answerable comparative one.\n\n**The design spectrum within observational CER (how to pick).**\n- **Active-comparator, new-user (ACNU) cohort** — the default for chronic-therapy head-to-head questions. Both arms initiate\n  a drug for the same indication after a drug-free washout; time zero is initiation. Use when a clinically interchangeable\n  comparator and an initiation event both exist.\n- **Prevalent-new-user cohort (Suissa)** — when initiation of the study drug is too rare to support an incident-user\n  cohort, this matches new initiators to time-matched prevalent users of the comparator. Use when ACNU starves for power.\n- **Nested case-control** — when outcomes are rare and full-cohort covariate construction (e.g., expensive chart abstraction\n  or biomarker assays) is infeasible; sample controls by risk-set sampling to recover the rate ratio efficiently.\n- **Self-controlled designs (SCCS, case-crossover)** — when *time-invariant* confounding dominates (genetics, chronic\n  severity) and the exposure is transient with an acute outcome. Each person is their own control, eliminating between-person\n  confounding entirely — but they cannot handle time-varying confounders or estimate a between-drug contrast.\n- **Target-trial emulation** — the organizing framework, not a separate data structure: write the protocol of the trial you\n  wish you could run (eligibility, treatment strategies, assignment, time zero, outcome, estimand, analysis), then map each\n  element to the data. Use it always as the discipline; reach for clone-censor-weight when the strategies are *sustained or\n  dynamic* and grace periods create eligibility-time ambiguity.\nThe single most common, most expensive error in observational CER is not picking the wrong model — it is misaligning time\nzero (the eligibility, exposure, and follow-up start), which silently manufactures immortal-time and selection bias before\nany estimator runs.\n\n**Pros, cons, and trade-offs.**\n- **vs the randomized controlled trial:** Observational CER answers questions an RCT cannot or will not — head-to-head\n  comparisons no sponsor will fund, broad/elderly/multimorbid populations excluded from trials, long-horizon and rare\n  safety outcomes, and post-launch effectiveness. Cost: it carries unmeasured confounding that randomization removes by\n  design; positivity and exchangeability are assumptions you must defend, not guarantees. **Prefer the RCT** when it is\n  feasible and ethical and the question is efficacy; **prefer observational CER** when an RCT is infeasible, too slow, too\n  narrow, or already answered efficacy and the live question is real-world effectiveness.\n- **vs the pragmatic / registry-based RCT:** A pragmatic trial keeps randomization (so confounding is handled) while\n  relaxing protocol toward routine care. Cost: it is slow, costly, and still selects consenting sites/patients. **Prefer\n  a pragmatic trial** when randomization is achievable and equipoise exists; **prefer observational CER** when randomization\n  is impossible, unethical, or far too slow for the decision timeline.\n- **vs single-arm study with external/historical control:** Observational CER compares two *concurrent* arms from the same\n  source, sharing secular trends and measurement. Cost: it needs a real comparator population, which may not exist for an\n  ultra-rare disease or a first-in-class agent. **Prefer concurrent comparative CER** whenever a contemporaneous comparator\n  exists; fall back to an external control only when it genuinely does not.\n\n**When to use.** A defined comparative decision between treatment strategies used for the same indication; an RCT that is\ninfeasible/unethical/too slow/too narrow; a need for routine-care effectiveness, long-horizon or rare safety outcomes, or\nevidence for a payer/HTA dossier where head-to-head trial data are absent. The non-negotiable prerequisites are a clinically\ndefensible comparator, a fit-for-purpose data source that captures exposure and outcome with acceptable validity, and a\npre-specified protocol with an explicit time zero and estimand.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **No clinically interchangeable comparator exists.** Forcing a comparator prescribed to systematically different patients\n  (second-line vs first-line; sicker vs healthier) re-introduces confounding by indication and channeling — the exact bias\n  CER is meant to remove. The result can be precisely wrong and worse than no evidence.\n- **The real question is drug vs no treatment** (uptake, deprescribing, the effect of *being treated at all*). Active-comparator\n  CER structurally cannot answer it, and a non-user comparator in claims is almost always confounded by indication and\n  healthy-user bias.\n- **Exposure or outcome is poorly captured in the data.** If the outcome algorithm has low PPV/sensitivity, or exposure is\n  invisible (samples, inpatient-administered drugs absent from pharmacy claims, OTC use), differential misclassification can\n  create or hide an effect. Validate the algorithm before trusting the comparison, not after.\n- **Severe positivity violation / non-overlap.** If one strategy is reserved for sicker or renally-impaired patients,\n  propensity distributions separate; matching discards most of the cohort and the surviving estimand no longer maps to any\n  decision-relevant population.\n- **Strong time-varying confounding affected by prior treatment** (e.g., labs that drive both subsequent dosing and the\n  outcome). Standard PS/regression adjustment is biased here; this requires g-methods (MSM/IPTW, g-estimation), and if the\n  team cannot implement them, the naive comparative estimate is misleading.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Exposure is the pharmacy claim (NDC + `fill_date` + `days_supply`); diagnosis/procedure\n  codes define indication, covariates, and many outcomes. Strengths: near-complete capture of dispensed drugs and billed\n  encounters within a covered, continuously enrolled population. Failure modes: (1) *Medicare Advantage / capitated\n  person-time lacks fee-for-service claims* — \"no prior diagnosis\" or \"no prior fill\" can be missingness, not truth; restrict\n  to enrollees with the relevant benefit (A/B/D or commercial medical+pharmacy) and exclude MA-only person-time. (2)\n  *Differential competing risks by exposure in elderly claims* — if one drug is preferentially given to frailer patients,\n  death competes with the outcome differently across arms; use cause-specific or Fine-Gray models rather than naive\n  Kaplan-Meier. (3) Inpatient-administered drugs are invisible to pharmacy claims; sample fills, 90-day mail order, and\n  stockpiling distort `days_supply`-based episodes. Workaround: continuous-enrollment requirements across the full lookback,\n  explicit episode/grace-period rules, and validated outcome algorithms.\n- **EHR:** Initiation is the *order or administration*, not the dispensing — link to pharmacy fills to confirm the patient\n  actually started. Problem lists, labs, vitals, and notes sharpen indication and baseline severity (a real advantage over\n  claims), but visit-driven capture means a patient who seeks care outside the system is differentially lost; define\n  observation windows explicitly and treat loss to follow-up as potentially informative. Care fragmentation is the\n  signature EHR failure mode in CER.\n- **Registry:** Strongest for indication, disease severity/stage, and adjudicated outcomes; typically weak for complete\n  pharmacy exposure and for non-registry comorbidity. Link to claims for the full fill history and to a death index to firm\n  up censoring. The signature failure mode is enrollment selection and incomplete capture of out-of-registry events.\n- **Linked claims-EHR-vital records:** The ideal substrate — EHR severity + claims completeness + reliable mortality — but\n  linkage introduces selection (only the linkable subset, which differs from the source population) and order/fill/service\n  date discrepancies that must be reconciled before time-zero assignment. The failure mode is treating the linked subset as\n  representative.\n\n**Worked claims example (a CER *design choice*, not just a cohort build).** Question: among adults with type 2 diabetes,\ndoes an SGLT2 inhibitor reduce heart-failure hospitalization compared with a DPP-4 inhibitor? In a commercial + Medicare\nFFS claims database, walk the decision: (1) *Is there an interchangeable active comparator?* Yes — both are added at a\nsimilar point in the treatment pathway, so an ACNU cohort is defensible (a non-user comparator would be confounded by\nindication and is rejected). (2) *Is initiation common enough for an incident-user cohort?* Yes — so ACNU, not\nprevalent-new-user. (3) *Eligibility:* age ≥18, ≥2 diabetes diagnoses, and 365 days of continuous A/B/D (or commercial\nmedical+pharmacy) enrollment before the first study fill, excluding MA-only person-time so absence of prior fills is\nobserved rather than missing. (4) *Washout / time zero:* no fill of *either* class in the 365-day lookback makes both arms\nincident users; time zero is the first qualifying fill, and the arm is read from the NDC dispensed that day. (5) *Outcome:*\na validated HF-hospitalization algorithm (inpatient claim with HF in the primary position) — its PPV is checked, not\nassumed. (6) *Confounding:* baseline covariates measured only in [time zero − 365, time zero] feed a high-dimensional\npropensity score; balance is confirmed with standardized differences <0.1. (7) *Competing risk:* because the cohort skews\nelderly, death is modeled as a competing event (cause-specific hazard for etiology, Fine-Gray for absolute risk). (8)\n*Sensitivity:* negative-control outcomes and an E-value quantify residual confounding; washout and grace-period lengths are\nvaried. The deliverable is a comparative hazard/risk contrast with a defensible target-trial mapping — the design decisions,\nnot the regression, are where the evidence is won or lost.",
    "primary_category": "Study_Design",
    "tags": [
      "comparative-effectiveness",
      "observational",
      "pharmacoepidemiology",
      "head-to-head",
      "confounding-by-indication",
      "target-trial",
      "effectiveness",
      "study-design"
    ],
    "applies_to_study_types": [
      "cer_observational"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Foundational account of how administrative/claims databases are used for comparative therapeutic epidemiology, and the design and confounding pitfalls that observational CER must control."
      },
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "The organizing framework for modern observational CER — specify the protocol of the trial you would run, then emulate each component; makes time zero, eligibility, and the estimand explicit."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2017.08.3019",
        "url": "https://doi.org/10.1016/j.jval.2017.08.3019",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on Real-World Evidence in Health Care Decision Making. Value in Health. 2017;20(8):1003-1008.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Consensus good-practice recommendations specific to real-world comparative-effectiveness studies — protocol pre-specification, transparency, replication, and decision-grade conduct."
      },
      {
        "role": "explain",
        "doi": "10.1056/NEJMsb1609216",
        "url": "https://doi.org/10.1056/NEJMsb1609216",
        "citation_text": "Sherman RE, Anderson SA, Dal Pan GJ, et al. Real-world evidence — what is it and what can it tell us? New England Journal of Medicine. 2016;375(23):2293-2297.",
        "year": 2016,
        "authors_short": "Sherman et al.",
        "notes": "Defines real-world data/evidence and frames where non-randomized comparative effectiveness can and cannot support regulatory and coverage decisions."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e3181a663cc",
        "url": "https://doi.org/10.1097/EDE.0b013e3181a663cc",
        "citation_text": "Schneeweiss S, Rassen JA, Glynn RJ, et al. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512-522.",
        "year": 2009,
        "authors_short": "Schneeweiss et al.",
        "notes": "Demonstrates the standard claims-based confounding-control engine for observational CER — empirically selected high-dimensional proxy covariates measured in the baseline window."
      },
      {
        "role": "use",
        "doi": "10.18553/jmcp.2016.22.10.1107",
        "url": "https://doi.org/10.18553/jmcp.2016.22.10.1107",
        "citation_text": "Dreyer NA, Bryant A, Velentgas P. The GRACE checklist: a validated assessment tool for high quality observational studies of comparative effectiveness. Journal of Managed Care & Specialty Pharmacy. 2016;22(10):1107-1113.",
        "year": 2016,
        "authors_short": "Dreyer et al.",
        "notes": "Validated checklist routinely used to rate and report the quality of observational CER studies for decision makers."
      }
    ],
    "plain_language_summary": "Observational comparative effectiveness research asks a head-to-head question — does drug A or drug B work better (or is safer) for the same condition — using data that routine care already generated, like insurance claims or medical records, because the clean trial that would randomly assign patients was never run. The whole job is to compare two real treatments fairly when nobody flipped a coin to decide who got which. The hardest part is that doctors choose treatments on purpose: sicker or different patients often get one drug over the other, so a raw comparison can credit the drug for differences that were really about who took it. That core problem — the treated and comparison groups not being alike to begin with — is why these studies live or die on choosing a fair comparison drug and adjusting for measured differences between the groups.",
    "key_terms": [
      {
        "term": "confounding by indication",
        "definition": "When the reason a patient was given a particular drug is also tied to their outcome, so the two treatment groups differ in ways that make a raw comparison unfair."
      },
      {
        "term": "active comparator",
        "definition": "A second real drug used for the same condition that you compare against, instead of comparing the drug to no treatment at all."
      },
      {
        "term": "risk difference",
        "definition": "One group's event rate minus the other's — how many more (or fewer) events per patient one drug is associated with."
      },
      {
        "term": "risk ratio",
        "definition": "One group's event rate divided by the other's — a value below 1 means the first drug had proportionally fewer events."
      },
      {
        "term": "claims data",
        "definition": "Billing records insurers keep on prescriptions filled and medical services delivered, reused here as a ready-made (if imperfect) view of who took what and what happened next."
      }
    ],
    "worked_example": {
      "scenario": "Among adults with type 2 diabetes, we want to know whether starting drug A (an SGLT2 inhibitor) is associated with fewer heart-failure hospitalizations than starting drug B (a DPP-4 inhibitor) over one year of follow-up. We pull two groups of new starters from a claims database, count how many in each group were hospitalized for heart failure, and compare the two event rates head to head.",
      "dataset": {
        "caption": "One summary row per treatment arm, built from a claims cohort: how many patients started each drug and how many had a heart-failure hospitalization within the year.",
        "columns": [
          "arm",
          "n",
          "events",
          "risk"
        ],
        "rows": [
          [
            "drug A (SGLT2 inhibitor)",
            500,
            30,
            0.06
          ],
          [
            "drug B (DPP-4 inhibitor)",
            400,
            40,
            0.1
          ]
        ]
      },
      "steps": [
        "Risk in each arm = events / n. Drug A: 30 / 500 = 0.060 (6.0%). Drug B: 40 / 400 = 0.100 (10.0%).",
        "Risk difference = risk(A) − risk(B) = 0.060 − 0.100 = −0.040, i.e., 4.0 fewer hospitalizations per 100 patients on drug A.",
        "Risk ratio = risk(A) / risk(B) = 0.060 / 0.100 = 0.60, so drug A's risk is 60% of drug B's — a 40% relative reduction.",
        "Pause before believing it: this is only fair if the two groups were alike at the start. If sicker patients were steered toward one drug (confounding by indication), part of this 0.60 could be about the patients, not the drug — which is why a fair active comparator and adjustment for measured baseline differences come before trusting any number."
      ],
      "result": "Drug A had a heart-failure-hospitalization risk of 30/500 = 0.060 versus 40/400 = 0.100 for drug B. Risk difference = 0.060 − 0.100 = −0.040 (4.0 fewer events per 100 patients); risk ratio = 0.060 / 0.100 = 0.60 (a 40% relative reduction) — an estimate that is only trustworthy if the two arms were comparable at baseline."
    },
    "prerequisites": [
      "new-user-design",
      "active-comparator-new-user",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Active-comparator new-user (ACNU) cohort",
        "description": "Both arms initiate a drug for the same indication after a drug-free washout; time zero is initiation. The default observational CER design for chronic-therapy head-to-head questions.",
        "edge_cases": [
          "Requires a clinically interchangeable comparator prescribed to the same patients at the same decision point; residual channeling within a class must be caught by balance diagnostics.",
          "Loses power when the comparator is uncommon; consider a prevalent-new-user extension."
        ],
        "data_source_notes": "claims: exposure = NDC + fill_date + days_supply, indication confirmed by baseline diagnosis codes; require continuous enrollment across the full washout and exclude MA-only person-time."
      },
      {
        "name": "Prevalent-new-user cohort (Suissa)",
        "description": "When study-drug initiation is too rare for an incident-user cohort, match new initiators to time-matched prevalent users of the comparator, preserving a defensible time-zero contrast.",
        "edge_cases": [
          "Time-conditional matching is required so prevalent comparators are compared at an equivalent point in their treatment trajectory; mishandling re-introduces immortal-time and depletion-of-susceptibles bias."
        ],
        "data_source_notes": "claims: build exposure-time sets from dispensing history; verify continuous enrollment across the matched index times."
      },
      {
        "name": "Nested case-control within the source cohort",
        "description": "For rare outcomes or costly covariate ascertainment, sample controls by risk-set sampling from the same cohort to recover the rate ratio without building covariates on everyone.",
        "edge_cases": [
          "Incidence-density (risk-set) sampling is mandatory so the odds ratio estimates the rate ratio without rare-disease assumptions; calendar/age time scales must match cases and controls."
        ],
        "data_source_notes": "claims/EHR: efficient when outcome requires chart adjudication or linked lab/biomarker data that cannot be assembled for the full cohort."
      },
      {
        "name": "Target-trial emulation (explicit protocol mapping)",
        "description": "Specify the hypothetical trial protocol (eligibility, strategies, assignment, time zero, outcome, estimand, analysis) and map each element to the data; use clone-censor-weight for sustained or dynamic strategies.",
        "edge_cases": [
          "Grace periods for treatment initiation create eligibility-time ambiguity; cloning at time zero with censoring and inverse-probability weighting resolves it but adds modeling burden."
        ],
        "data_source_notes": "Applicable across claims/EHR/registry/linked; the discipline (not a new data structure) that prevents immortal-time and selection bias from creeping into any of the above designs."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Randomized controlled trial",
        "pros_of_this": "Answers head-to-head, broad-population, long-horizon, and rare-safety questions that are infeasible, unethical, too slow, or too narrow to randomize; estimates routine-care effectiveness rather than protocol efficacy.",
        "cons_of_this": "Carries unmeasured confounding that randomization removes by design; exchangeability and positivity are assumptions to defend, not guarantees.",
        "when_to_prefer": "When an RCT is infeasible/unethical/too slow/too narrow, or efficacy is already established and the live question is real-world comparative effectiveness."
      },
      {
        "compared_to": "Pragmatic / registry-based randomized trial",
        "pros_of_this": "Far faster and cheaper, uses existing data, no consent/site selection bottleneck, and captures the full treated population.",
        "cons_of_this": "Loses randomization, so confounding by indication must be controlled analytically and can never be fully ruled out.",
        "when_to_prefer": "When randomization is impossible, unethical, or too slow for the decision timeline."
      },
      {
        "compared_to": "Single-arm study with external/historical control",
        "pros_of_this": "Uses concurrent comparator arms from the same source that share secular trends and measurement, sharply reducing time-trend and ascertainment bias.",
        "cons_of_this": "Requires a real contemporaneous comparator population, which may not exist for ultra-rare disease or first-in-class agents.",
        "when_to_prefer": "Whenever a contemporaneous, clinically relevant comparator exists in the data; fall back to external controls only when it genuinely does not."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (NDC + fill_date + days_supply); indication and covariates from diagnosis/procedure codes. Require continuous medical + pharmacy enrollment across the full lookback so absence of prior diagnoses/fills is observed, not missing; exclude Medicare Advantage-only person-time where fee-for-service claims are unavailable. Model death as a competing risk in elderly cohorts. Validate outcome algorithms (PPV/sensitivity) before trusting the contrast.",
      "ehr": "Initiation = order or administration; prefer linked dispensing to confirm the patient started. Problem lists, labs, and notes sharpen indication/severity but care fragmentation causes differential loss to follow-up — define observation windows explicitly and treat loss as potentially informative.",
      "registry": "Strong for indication, severity/stage, and adjudicated outcomes; weak for complete pharmacy exposure and out-of-registry events. Link to claims for fills and to a death index for censoring and mortality outcomes.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (severity + completeness + mortality) but introduces linkage-selection bias (the linkable subset differs from the source) and order/fill/service date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS = 365  # drug-free + continuous, FFS-observable lookback that defines an incident user\n\ndef build_cer_cohort(rx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n    strat = rx[rx[\"drug_class\"].isin([\"STUDY\", \"COMPARATOR\"])]\n\n    # Candidate time zero = first fill of EITHER strategy; arm = the class dispensed that day.\n    idx = (strat.groupby(\"person_id\").first().reset_index()\n                .rename(columns={\"fill_date\": \"index_date\", \"drug_class\": \"arm\"}))\n\n    # New-user restriction: no fill of either strategy in the washout window before index.\n    prior = strat.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    had_prior = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                      (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(had_prior[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment spanning the full washout through index\n    # (so \"no prior fill/dx\" is truly observed, not MA/capitated missingness).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) & e[\"ffs_observable\"])\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n\n    cohort = idx[idx[\"person_id\"].isin(eligible)].copy()\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)  # covariate window\n    return cohort[[\"person_id\", \"arm\", \"index_date\", \"baseline_start\"]]",
        "description": "Two-arm observational-CER cohort construction (ACNU template) from claims-style inputs. This is cohort *construction*,\nnot outcome estimation — covariate building and the comparative model run downstream on the returned cohort. Required\ninputs (already cleaned and de-duplicated):\n  rx     : pharmacy fills  -> person_id, fill_date (datetime), drug_class in {'STUDY','COMPARATOR'}, days_supply (int)\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end (datetime), ffs_observable (bool)  # False == MA/cap. person-time without FFS claims\nReturns one row per eligible new initiator with arm and time zero. Build covariates and the propensity score ONLY from\nthe returned [baseline_start, index_date] window, and apply identical outcome/censoring rules to both arms.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "schneeweiss-2009"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\n\nbuild_cer_cohort <- function(rx, enroll) {\n  setDT(rx); setDT(enroll)\n  setorder(rx, person_id, fill_date)\n\n  strat <- rx[drug_class %chin% c(\"STUDY\", \"COMPARATOR\")]\n  idx <- strat[, .(index_date = fill_date[1L], arm = drug_class[1L]), by = person_id]\n\n  # New-user: drop anyone with a study/comparator fill in the washout window before index.\n  strat <- merge(strat, idx[, .(person_id, index_date)], by = \"person_id\")\n  prior_ids <- unique(strat[fill_date < index_date &\n                            fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous, FFS-observable enrollment across the full washout through index.\n  e <- merge(enroll, idx[, .(person_id, index_date)], by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & ffs_observable, unique(person_id)]\n\n  cohort <- idx[person_id %chin% ok]\n  cohort[, baseline_start := index_date - WASHOUT_DAYS]\n  cohort[, .(person_id, arm, index_date, baseline_start)]\n}",
        "description": "Two-arm observational-CER cohort construction (ACNU template) with data.table. Cohort construction only. Inputs mirror\nthe Python version:\n  rx     : person_id, fill_date (Date), drug_class in {'STUDY','COMPARATOR'}, days_supply (integer)\n  enroll : person_id, enroll_start, enroll_end (Date), ffs_observable (logical)  # FALSE == MA/capitated, no FFS claims",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "schneeweiss-2009"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* First fill of either strategy = candidate time zero, with the arm dispensed that day. */\nproc sort data=work.rx(where=(drug_class in ('STUDY','COMPARATOR'))) out=rx_q;\n  by person_id fill_date;\nrun;\n\ndata idx;\n  set rx_q;\n  by person_id;\n  if first.person_id;\n  index_date = fill_date;\n  format index_date date9.;\n  length arm $12;\n  arm = drug_class;\n  keep person_id index_date arm;\nrun;\n\n/* New-user restriction: exclude any prior study/comparator fill inside the washout window. */\nproc sql;\n  create table newuser as\n  select i.*\n  from idx i\n  where not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id\n      and p.drug_class in ('STUDY','COMPARATOR')\n      and p.fill_date <  i.index_date\n      and p.fill_date >= i.index_date - &washout\n  );\nquit;\n\n/* Continuous, FFS-observable enrollment across the full washout through index (no MA/capitated gaps). */\nproc sql;\n  create table cohort as\n  select n.person_id, n.arm, n.index_date,\n         n.index_date - &washout as baseline_start format=date9.\n  from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id\n      and e.ffs_observable = 1\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date\n  );\nquit;",
        "description": "Two-arm observational-CER cohort construction (ACNU template) in SAS via PROC SQL — cohort construction, not estimation.\nRequired input datasets (post data-management):\n  work.rx     : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ffs_observable (1/0)   /* 0 == MA/capitated, no FFS claims */\nOutput work.cohort has one row per eligible new initiator (person_id, arm, index_date, baseline_start). Build covariates\nand the propensity score from [baseline_start, index_date] only; apply identical outcome/censoring rules to both arms.",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-2009"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Comparative decision:<br/>strategy A vs strategy B] --> RCT{RCT feasible,<br/>ethical, timely?}\n  RCT -- Yes --> DoRCT[Run RCT / pragmatic trial<br/>randomization handles confounding]\n  RCT -- No --> COMP{Clinically interchangeable<br/>active comparator exists?}\n  COMP -- No, real question is<br/>drug vs no treatment --> STOP[Observational CER not valid here<br/>confounding by indication]\n  COMP -- No comparator pop.<br/>ultra-rare / first-in-class --> EXT[Single-arm + external control]\n  COMP -- Yes --> INIT{Initiation common enough<br/>for incident users?}\n  INIT -- No --> PNU[Prevalent-new-user cohort]\n  INIT -- Yes --> RARE{Outcome rare or<br/>covariates costly?}\n  RARE -- Yes --> NCC[Nested case-control]\n  RARE -- No --> TV{Time-varying confounding<br/>affected by prior treatment?}\n  TV -- Yes --> GM[Target-trial emulation +<br/>g-methods / clone-censor-weight]\n  TV -- No --> ACNU[Active-comparator new-user cohort<br/>+ propensity score]",
        "caption": "Decision logic for choosing an observational CER design. The first branch is whether an RCT should be run at all; the second is whether the comparative question is even answerable without randomization (an interchangeable active comparator must exist); the rest selects the specific design from the comparative cohort family.",
        "alt_text": "Decision tree starting from a comparative treatment question, branching on RCT feasibility, existence of an interchangeable comparator, initiation frequency, outcome rarity, and time-varying confounding, leading to ACNU, prevalent-new-user, nested case-control, g-methods, or external-control designs.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Indication[Indication / disease severity] --> Tx[Treatment choice:<br/>strategy A vs B]\n  Indication --> Y[Outcome]\n  Tx --> Y\n  Confounders[Measured + unmeasured<br/>baseline confounders] --> Tx\n  Confounders --> Y\n  Indication -. confounding by indication .-> Tx\n  classDef bias fill:#fde,stroke:#b36;\n  class Indication bias",
        "caption": "The causal structure observational CER must defend against. Indication and baseline severity drive both the treatment decision and the outcome, opening a backdoor path (confounding by indication). An interchangeable active comparator plus baseline-covariate adjustment blocks the measured part of this path; unmeasured confounding is what negative controls and quantitative bias analysis probe.",
        "alt_text": "A directed acyclic graph showing indication and baseline confounders pointing into both treatment choice and outcome, illustrating the confounding-by-indication backdoor path that observational comparative effectiveness research must close.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "Observational CER is the non-randomized branch of the broader comparative-effectiveness-research method family."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "The active-comparator new-user cohort is the default specific design used to operationalize an observational CER question."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation is the organizing discipline that keeps an observational CER protocol aligned on eligibility, time zero, and the estimand."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Propensity-score matching, IPTW, or overlap weighting on baseline covariates is the standard confounding-control step after CER cohort construction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "High-dimensional propensity scores are the standard claims-based engine for controlling measured confounding in observational CER."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "pragmatic-trial",
        "notes": "Pragmatic/registry-based randomized trials answer similar real-world questions while preserving randomization; observational CER is the choice when randomization is infeasible."
      },
      {
        "relation_type": "see_also",
        "target_slug": "nested-case-control",
        "notes": "A nested case-control design is the efficient observational-CER option when outcomes are rare or covariate ascertainment is costly."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "single-arm-external-control",
        "notes": "When no contemporaneous comparator population exists (ultra-rare disease, first-in-class agent), a single-arm study with an external/historical control replaces concurrent comparative CER."
      },
      {
        "relation_type": "see_also",
        "target_slug": "self-controlled-case-series",
        "notes": "Self-controlled designs are the observational-CER choice when time-invariant confounding dominates and the exposure is transient with an acute outcome."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Prevalent-user/ever-exposed comparisons suffer depletion of susceptibles and time-zero misalignment; observational CER avoids these by using incident-user or time-matched prevalent-new-user designs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Misaligning eligibility, exposure, and follow-up start is the most common silent source of immortal-time and selection bias in observational CER."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes detect residual confounding that survives design and adjustment in observational CER."
      },
      {
        "relation_type": "requires",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "A data source must be shown fit for purpose (valid exposure and outcome capture) before an observational CER contrast can be trusted."
      }
    ],
    "aliases": [
      "observational comparative effectiveness research",
      "observational CER",
      "non-randomized comparative effectiveness study",
      "real-world comparative effectiveness study",
      "comparative effectiveness research (observational)"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "charlson-comorbidity-index-rwe",
    "name": "Charlson / Quan-Charlson Comorbidity Index (CCI)",
    "short_definition": "A weighted count of 17 chronic conditions, each scored 1–6 by its association with 1-year mortality, summed into a single comorbidity-burden score used to risk-adjust or confounder-adjust RWE analyses; in claims, modern implementations should name the Deyo or Quan ICD code map and the weight set.",
    "long_description": "The **Charlson Comorbidity Index (CCI)** is the most widely used single-number summary of a patient's\nchronic-disease burden. Charlson et al. (1987) selected 19 conditions that independently predicted 1-year\nmortality in a hospital cohort and assigned each an integer **weight of 1, 2, 3, or 6** proportional to its\nadjusted mortality hazard (e.g., myocardial infarction, CHF, COPD, uncomplicated diabetes = 1; any tumor,\nmoderate/severe renal disease, diabetes with end-organ damage = 2; moderate/severe liver disease = 3;\nmetastatic solid tumor and AIDS = 6). The patient's CCI is the **sum of the weights of the conditions present**.\nAn **age-adjusted CCI** adds one point per decade of age over 40 (50–59 → +1, 60–69 → +2, ... ≥80 → +4). In\nRWE the index is almost never read off a chart — it is built from administrative data through a published\nICD-to-condition crosswalk: **Deyo (1992)** mapped Charlson to ICD-9-CM for claims, and **Quan (2005)** extended\nthe maps to ICD-10 and (2011) re-estimated condition weights in a modern multi-country discharge cohort. The\nCCI is fundamentally a **covariate-construction method**, not a study design or an outcome: its job is to make\ntwo exposure groups comparable on baseline sickness, either by entering the continuous score (or its\ncategories 0 / 1–2 / 3–4 / ≥5) in an outcome model, or by feeding it into a propensity score.\n\n**Core conceptual distinctions.** (1) *Index vs algorithm vs weights*: the **index** is the conceptual list of\nconditions; the **code algorithm** (Deyo, Quan) is the ICD crosswalk that decides whether a condition is\n\"present\" in claims; the **weights** (original Charlson 1987 vs updated Quan 2011) convert presence to a score.\nAll three must be reported — \"we used the Charlson index\" is under-specified without naming the code map and\nweight set. (2) *Lookback window*: comorbidities are ascertained over a **fixed baseline window before index\ndate** (commonly 6–12 months of continuous enrollment); a longer window finds more conditions and mechanically\nraises every patient's score, so the window must be equal across exposure groups. (3) *Diagnosis rule*: a\nsingle inpatient or outpatient diagnosis code is sensitive but noisy; a **≥1 inpatient OR ≥2 outpatient**\nrule reduces rule-out/coding-artifact false positives, exactly as in phenotype algorithms. (4) *CCI vs\nElixhauser*: CCI is a single mortality-weighted score (parsimonious, comparable across studies); Elixhauser is\na broader **31-condition set** usually entered as individual indicators or a van Walraven point score, trading\nparsimony for resolution.\n\n**Pros, cons, and trade-offs** (named against the alternatives).\n- **vs the Elixhauser comorbidity measures:** CCI is a compact, mortality-oriented single number that is\n  directly comparable across the thousands of studies that report it; Elixhauser captures more conditions and\n  typically predicts in-hospital outcomes better but costs degrees of freedom and cross-study comparability.\n  **Prefer CCI** when you need one interpretable summary or a parsimonious confounder; **switch to Elixhauser**\n  when condition-level resolution or maximal predictive performance matters.\n- **vs a high-dimensional propensity score (hdPS):** CCI is transparent, pre-specified, and clinically\n  interpretable but captures only 17 named conditions; hdPS empirically screens hundreds of claims codes and\n  can adjust for proxies CCI never sees, at the cost of interpretability and a data-driven (overfitting-prone)\n  variable set. They are complementary — many analyses include CCI **and** hdPS.\n- **vs individual comorbidity indicators:** entering each condition as its own covariate is the most flexible\n  (lets the outcome model learn condition-specific effects) but spends many degrees of freedom and can be\n  unstable in sparse data; the CCI's fixed weights buy stability and parsimony at the price of assuming the\n  1-year-mortality weighting is appropriate for *your* outcome.\n- **Original vs updated weights:** the 1987 weights were fit to 1980s hospital mortality; the Quan 2011 weights\n  reflect modern survival (several conditions now carry less weight). The choice shifts scores and must be\n  pre-specified and reported.\n\n**When to use.** As a baseline confounder for overall sickness in comparative cohort or case-control RWE; as a\nmatching/propensity-score input; as a stratification or subgroup variable; as a case-mix adjuster in HCRU,\ncost, and mortality models; and as a transparent, reviewer-familiar summary of cohort comparability in a\nTable 1. It is the default comorbidity adjuster for claims-based regulatory and HTA submissions because its\nprovenance and code maps are fully documented.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a proxy for the very outcome it was built to predict.** The CCI's weights come from 1-year mortality;\n  using it to adjust a mortality analysis risks adjusting away part of the effect (over-adjustment) or, if the\n  comorbidity is on the causal pathway from exposure to death, introducing collider/mediator bias. Treat\n  on-pathway comorbidities explicitly, not by burying them in a summary score.\n- **When the lookback window or enrollment differs across groups.** Score is mechanically a function of\n  observable person-time; an exposure group with longer baseline enrollment will look sicker purely from more\n  coding opportunity. Require equal continuous-enrollment windows before comparing scores.\n- **In populations unlike its derivation cohort.** The weights were estimated in general hospital inpatients;\n  in pediatric, obstetric, or single-disease specialty cohorts the mortality weighting can be irrelevant or\n  misleading. Validate or replace the index rather than importing it blindly.\n- **As a measure of frailty or function.** CCI counts diagnosed diseases; it does not capture functional\n  dependency, falls, or frailty, which a claims-based frailty index targets directly. A robust older adult with\n  treated hypertension and diabetes can outscore a frail one with few coded diagnoses.\n- **Reporting a number without the code map and weight set.** \"CCI = 4\" is uninterpretable and unreproducible\n  unless the ICD algorithm (Deyo vs Quan), the weight set (1987 vs 2011), the lookback, and the diagnosis rule\n  are stated.\n\n**Data-source operational depth.** In **claims**, presence is decided by the Deyo/Quan ICD crosswalk over a\nfixed baseline window; use both inpatient and outpatient (and often physician) files, require a defensible\ndiagnosis rule, and respect the diagnosis hierarchy (severe forms supersede mild — e.g., metastatic tumor\noverrides localized tumor, complicated overrides uncomplicated diabetes) so a condition is not double-weighted.\nIn **EHR**, problem lists and encounter diagnoses give richer detail but capture only in-system care; conditions\nmanaged elsewhere are missed, so the score can understate true burden. **Linked claims–EHR** maximizes capture\nbut inherits linkage selection. Across all sources the score is only as comparable as the underlying continuous\nenrollment and coding intensity, which differ systematically across payers (FFS vs Medicare Advantage vs\ncommercial), so cross-payer score comparisons need explicit caution.\n\n**Interpreting the output**\n\nIn the worked example, a patient carries CHF (weight 1), COPD (weight 1), complicated diabetes\n(weight 2), moderate renal disease (weight 2), and any malignancy (weight 2), giving CCI = 8; adding\nage-adjustment for age 72 yields CCI_age-adjusted = 11.\n\n*(1) Formal interpretation.* CCI = 8 places this patient in a high-comorbidity stratum. The score is\nan additive index: each component condition contributes its pre-specified integer weight derived from\nCox proportional-hazards coefficients fit to 1-year all-cause mortality in the 1987 Charlson index\ncohort. The age-adjusted score of 11 incorporates one additional point per decade above 40, reflecting\nthe independent mortality contribution of age in the original regression. These weights do not update\nautomatically to modern treatment effectiveness or the specific outcome in the current study. CCI is a\nsummary confounder-control variable, not a causal severity measure — a score of 8 does not predict\nthis patient's outcome, nor does CCI quantify modifiable disease burden.\n\n*(2) Practical interpretation.* Use CCI as a covariate or stratification variable to adjust for\nbaseline comorbidity differences between exposure groups, not as an absolute mortality prediction for\nmodern patients. Confirm which ICD code-to-condition mapping and which weight set (Charlson original,\nQuan 2011, Romano) the algorithm applies, and specify the lookback window, because score values\ndiffer materially across implementations. When comparing CCI across data sources or payers, note that\ncoding intensity and enrollment continuity differ — a score of 8 in Medicare FFS may reflect more\ncomplete comorbidity capture than the same score in a commercial claims file.",
    "primary_category": "Bias_Control",
    "tags": [
      "charlson",
      "comorbidity-index",
      "risk-adjustment",
      "confounding-control",
      "claims",
      "icd-coding",
      "covariate-construction",
      "case-mix"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "comparative_effectiveness",
      "pharmacoepidemiology"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/0021-9681(87)90171-8",
        "url": "https://doi.org/10.1016/0021-9681(87)90171-8",
        "citation_text": "Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373-383.",
        "year": 1987,
        "authors_short": "Charlson et al.",
        "notes": "Original derivation of the index and the 1–6 mortality weights in a hospital cohort."
      },
      {
        "role": "explain",
        "doi": "10.1016/0895-4356(92)90133-8",
        "url": "https://doi.org/10.1016/0895-4356(92)90133-8",
        "citation_text": "Deyo RA, Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. J Clin Epidemiol. 1992;45(6):613-619.",
        "year": 1992,
        "authors_short": "Deyo et al.",
        "notes": "The canonical ICD-9-CM claims adaptation that made the CCI computable from administrative data."
      },
      {
        "role": "explain",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "Extended the Charlson (and Elixhauser) code maps to ICD-10, the standard crosswalk for modern claims."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwq433",
        "url": "https://doi.org/10.1093/aje/kwq433",
        "citation_text": "Quan H, Li B, Couris CM, et al. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am J Epidemiol. 2011;173(6):676-682.",
        "year": 2011,
        "authors_short": "Quan et al.",
        "notes": "Re-estimated the condition weights in a modern multi-country cohort; the updated weight set used in current studies."
      }
    ],
    "plain_language_summary": "The Charlson Comorbidity Index turns a patient's list of chronic illnesses into one number that stands for how sick they are. Each qualifying condition carries a point value — 1 for things like heart failure or COPD, up to 6 for metastatic cancer or AIDS — and you add up the points for the conditions a patient has. A common version also adds points for older age. In real-world data you do not read this off a chart; you let a published list of diagnosis codes decide which conditions count, looking back over a fixed window before the study start. Researchers use the score to make treatment groups fairer to compare, so that a difference in outcomes is not just because one group was sicker to begin with. The honest caveat: it only sees conditions that got coded, and it was built to predict death, so it is a blunt tool for anything else.",
    "key_terms": [
      {
        "term": "comorbidity weight",
        "definition": "The fixed point value (1, 2, 3, or 6) the index assigns to a condition based on how strongly that condition predicted death within a year."
      },
      {
        "term": "code algorithm (crosswalk)",
        "definition": "The published list of ICD diagnosis codes (Deyo for ICD-9, Quan for ICD-10) that decides whether a condition counts as present in administrative data."
      },
      {
        "term": "lookback (baseline) window",
        "definition": "The fixed stretch of time before the study index date over which you search the data for qualifying diagnoses; it must be the same length for every patient."
      },
      {
        "term": "age-adjusted CCI",
        "definition": "A version that adds one point for each decade of age over 40, combining disease burden and age into a single score."
      }
    ],
    "worked_example": {
      "scenario": "We want the age-adjusted Charlson score for one 72-year-old patient from a claims database. Over the 12-month baseline window the code algorithm flags four qualifying conditions; we look up each condition's Charlson weight, add them, then add the age points (one per decade over 40) to get the final score the analyst would carry into the outcome model.",
      "dataset": {
        "caption": "The conditions flagged for one patient over the 12-month lookback, with each condition's Charlson weight.",
        "columns": [
          "condition",
          "charlson_weight"
        ],
        "rows": [
          [
            "congestive_heart_failure",
            1
          ],
          [
            "chronic_pulmonary_disease",
            1
          ],
          [
            "diabetes_with_complications",
            2
          ],
          [
            "moderate_severe_renal_disease",
            2
          ],
          [
            "any_malignancy",
            2
          ]
        ]
      },
      "steps": [
        "List the qualifying conditions the code algorithm flagged over the equal baseline window, with each one's Charlson weight from the table.",
        "Add the condition weights to get the base Charlson score: 1 + 1 + 2 + 2 + 2 = 8.",
        "Compute the age points: the patient is 72, which is three full decades over 40 (50s, 60s, 70s), so age adds 3 points.",
        "Add the age points to the base score for the age-adjusted CCI: 8 + 3 = 11."
      ],
      "result": "Base Charlson Comorbidity Index = 1 + 1 + 2 + 2 + 2 = 8; age-adjusted CCI = 8 + 3 = 11. This patient enters the model in the highest comorbidity stratum (≥5), and the same code algorithm, weight set, lookback, and diagnosis rule are applied identically to every patient in both exposure groups."
    },
    "prerequisites": [
      "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
      "continuous-enrollment-observable-time-rwe",
      "baseline-characteristics-and-covariate-balance-rwe"
    ],
    "index_definitions": [
      {
        "name": "Original Charlson Comorbidity Index",
        "definition": "Mortality-weighted comorbidity score derived from chronic conditions associated with 1-year mortality; modern catalog use treats the collapsed 17-condition score as a baseline risk-adjustment covariate.",
        "source": "Charlson et al. 1987",
        "use": "Conceptual index and original integer weights.",
        "notes": "Report the weight set separately from the code map, diagnosis rule, and lookback window."
      },
      {
        "name": "Deyo Charlson ICD-9-CM adaptation",
        "definition": "Administrative-claims coding algorithm that maps ICD-9-CM diagnosis codes to Charlson condition flags.",
        "source": "Deyo et al. 1992",
        "use": "ICD-9-CM claims implementation and historical US administrative-data studies.",
        "notes": "Still relevant when baseline windows include pre-ICD-10 diagnosis history."
      },
      {
        "name": "Quan Charlson ICD-9/ICD-10 algorithms",
        "definition": "ICD-9-CM and ICD-10 coding algorithms for Charlson comorbidities in administrative data, published alongside Elixhauser mappings.",
        "source": "Quan et al. 2005",
        "use": "Crosswalk for modern coded claims/EHR administrative data.",
        "notes": "This is the phrase many users mean by \"Quan-Charlson comorbidity index\"; the code map is distinct from the weight set."
      },
      {
        "name": "Updated Quan Charlson weights",
        "definition": "Re-estimated Charlson score weights using hospital discharge data from 6 countries to improve modern mortality risk adjustment.",
        "source": "Quan et al. 2011",
        "use": "Updated mortality-calibrated score when modern weighting is desired.",
        "notes": "Scores are not numerically comparable with the original Charlson weights unless the same weight set is applied."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Age-adjusted Charlson (Charlson 1994 age weighting)",
        "description": "Adds one point per decade of age over 40 to the condition-weight sum, combining demographic and disease risk into a single mortality-oriented score.",
        "edge_cases": [
          "Age and comorbidity are correlated; the combined score can mask which component drives an association — keep age as a separate covariate if you need to separate the two.",
          "Truncate at ≥80 → +4; do not extrapolate beyond the published decade weights."
        ],
        "data_source_notes": "claims/ehr: take age at index date; apply the same decade banding to all patients."
      },
      {
        "name": "Deyo (ICD-9) vs Quan (ICD-10) code algorithm",
        "description": "The ICD crosswalk that decides condition presence; Deyo for ICD-9-CM, Quan for the ICD-10 era. The transition date in the data (US ICD-10 from Oct 2015) must be handled so a baseline window spanning the switch uses both maps.",
        "edge_cases": [
          "A lookback window straddling the ICD-9→ICD-10 transition needs both crosswalks or the score drops for the ICD-10 portion.",
          "Apply the severity hierarchy (metastatic supersedes localized tumor; complicated supersedes uncomplicated diabetes) so a single clinical condition is not counted twice."
        ],
        "data_source_notes": "claims: confirm which ICD version each diagnosis field uses; some databases re-map historical codes."
      },
      {
        "name": "Updated (Quan 2011) vs original (Charlson 1987) weights",
        "description": "Two interchangeable weight sets mapping condition presence to points; the 2011 set reflects improved survival and down-weights several conditions.",
        "edge_cases": [
          "Scores are not comparable across weight sets; never pool studies that used different weights without re-deriving.",
          "For non-mortality outcomes, neither weight set is guaranteed optimal — consider condition-level indicators instead."
        ],
        "data_source_notes": "any: state the weight set explicitly in methods; both are published as integer lookup tables."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Elixhauser comorbidity measures",
        "pros_of_this": "One interpretable, mortality-weighted number that is directly comparable across the large literature reporting CCI; parsimonious as a confounder.",
        "cons_of_this": "Captures only 17 conditions and a single mortality-based weighting, so it can predict non-mortality outcomes less well than the broader Elixhauser set.",
        "when_to_prefer": "When you need a compact, reviewer-familiar summary of overall sickness or a parsimonious adjustment variable."
      },
      {
        "compared_to": "High-dimensional propensity score (hdPS)",
        "pros_of_this": "Transparent, pre-specified, and clinically interpretable; no data-driven variable selection or overfitting.",
        "cons_of_this": "Sees only named conditions and can miss proxies for confounding that an empirical code screen would capture.",
        "when_to_prefer": "When interpretability and pre-specification matter, or as a complement entered alongside an hdPS."
      },
      {
        "compared_to": "Individual comorbidity indicators",
        "pros_of_this": "Fixed weights give a stable, low-degrees-of-freedom covariate that behaves well in sparse data.",
        "cons_of_this": "Imposes the index's mortality weighting rather than letting the model learn condition-specific effects on your outcome.",
        "when_to_prefer": "When data are sparse or you want a single parsimonious adjuster instead of many condition terms."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import re\nimport pandas as pd\n\nLOOKBACK_DAYS = 365\n\n# condition -> (ICD-prefix regex, Charlson weight). ILLUSTRATIVE subset; swap in full Deyo/Quan maps.\nCHARLSON = {\n    \"mi\":              (r\"^(I21|I22|I252|410|412)\", 1),\n    \"chf\":             (r\"^(I50|428)\", 1),\n    \"copd\":            (r\"^(J4[0-7]|49[0-6])\", 1),\n    \"diabetes_uncx\":   (r\"^(E1[0-4][0-1]?|250[0-3])\", 1),\n    \"diabetes_cx\":     (r\"^(E1[0-4][2-8]|250[4-9])\", 2),\n    \"renal\":           (r\"^(N1[789]|585|586)\", 2),\n    \"tumor\":           (r\"^(C[0-7]|C8[0-5]|14[0-9]|1[5-9][0-9]|20[0-8])\", 2),\n    \"metastatic\":      (r\"^(C7[7-9]|C80|19[6-9])\", 6),\n    \"aids\":            (r\"^(B2[0-4]|04[2-4])\", 6),\n}\n# milder form -> more severe form that supersedes it (severity hierarchy)\nSUPERSEDES = {\"diabetes_uncx\": \"diabetes_cx\", \"tumor\": \"metastatic\"}\n\ndef charlson_score(diags: pd.DataFrame, base: pd.DataFrame, age_adjust: bool = True) -> pd.DataFrame:\n    df = diags.merge(base[[\"person_id\", \"index_date\"]], on=\"person_id\", how=\"inner\")\n    win = df[(df[\"dx_date\"] < df[\"index_date\"]) &\n             (df[\"dx_date\"] >= df[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS))]\n\n    rows = []\n    for pid, g in win.groupby(\"person_id\"):\n        codes = g[\"code\"].astype(str)\n        present = {c: codes.str.match(rx).any() for c, (rx, _) in CHARLSON.items()}\n        for mild, severe in SUPERSEDES.items():       # don't double-count severity\n            if present.get(severe):\n                present[mild] = False\n        score = sum(w for c, (_, w) in CHARLSON.items() if present[c])\n        rows.append({\"person_id\": pid, \"cci\": score})\n\n    out = base[[\"person_id\", \"age\"]].merge(pd.DataFrame(rows), on=\"person_id\", how=\"left\").fillna({\"cci\": 0})\n    out[\"cci\"] = out[\"cci\"].astype(int)\n    if age_adjust:\n        out[\"age_points\"] = ((out[\"age\"].clip(lower=40) - 40) // 10).clip(upper=4).astype(int)\n        out[\"cci_age_adj\"] = out[\"cci\"] + out[\"age_points\"]\n    return out",
        "description": "Compute the (optionally age-adjusted) Charlson score from claims-style long diagnosis data. Inputs:\n  diags : person_id, code (ICD string), dx_date (datetime)\n  base  : person_id, index_date (datetime), age (int)\nA {condition: (regex, weight)} map stands in for the full Deyo/Quan crosswalk; replace with the published\ncode lists. Conditions are ascertained over a fixed lookback; the severity hierarchy zeroes the milder form\nwhen a more severe related condition is present.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "deyo-1992",
          "quan-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nLOOKBACK_DAYS <- 365L\n\ncharlson_map <- list(   # condition = list(regex, weight) -- ILLUSTRATIVE subset\n  mi            = list(\"^(I21|I22|I252|410|412)\", 1L),\n  chf           = list(\"^(I50|428)\", 1L),\n  copd          = list(\"^(J4[0-7]|49[0-6])\", 1L),\n  diabetes_uncx = list(\"^(E1[0-4][0-1]?|250[0-3])\", 1L),\n  diabetes_cx   = list(\"^(E1[0-4][2-8]|250[4-9])\", 2L),\n  renal         = list(\"^(N1[789]|585|586)\", 2L),\n  tumor         = list(\"^(C[0-7]|C8[0-5]|14[0-9]|1[5-9][0-9]|20[0-8])\", 2L),\n  metastatic    = list(\"^(C7[7-9]|C80|19[6-9])\", 6L),\n  aids          = list(\"^(B2[0-4]|04[2-4])\", 6L)\n)\nsupersedes <- c(diabetes_uncx = \"diabetes_cx\", tumor = \"metastatic\")\n\ncharlson_score <- function(diags, base, age_adjust = TRUE) {\n  setDT(diags); setDT(base)\n  df <- merge(diags, base[, .(person_id, index_date)], by = \"person_id\")\n  win <- df[dx_date < index_date & dx_date >= index_date - LOOKBACK_DAYS]\n\n  score_one <- function(codes) {\n    present <- sapply(charlson_map, function(cw) any(grepl(cw[[1]], codes)))\n    for (mild in names(supersedes))                  # severity hierarchy\n      if (isTRUE(present[[supersedes[[mild]]]])) present[[mild]] <- FALSE\n    sum(mapply(function(cw, p) if (p) cw[[2]] else 0L, charlson_map, present))\n  }\n  sc <- win[, .(cci = score_one(as.character(code))), by = person_id]\n\n  out <- merge(base[, .(person_id, age)], sc, by = \"person_id\", all.x = TRUE)\n  out[is.na(cci), cci := 0L]\n  if (age_adjust) {\n    out[, age_points := pmin(pmax(age - 40L, 0L) %/% 10L, 4L)]\n    out[, cci_age_adj := cci + age_points]\n  }\n  out[]\n}",
        "description": "R/data.table version. Inputs mirror the Python version:\n  diags : person_id, code (character ICD), dx_date (Date)\n  base  : person_id, index_date (Date), age (integer)\nReplace the illustrative CHARLSON map with the full published Deyo/Quan code lists.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "deyo-1992",
          "quan-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n\n/* Keep baseline-window diagnoses only. */\nproc sql;\n  create table win as\n  select d.person_id, d.code, d.dx_date, b.index_date, b.age\n  from work.diags d inner join work.base b on d.person_id = b.person_id\n  where d.dx_date < b.index_date\n    and d.dx_date >= b.index_date - &lookback;\nquit;\n\n/* Flag conditions per patient (ILLUSTRATIVE prefixes; replace with full crosswalk). */\ndata flags;\n  set win;\n  by person_id;\n  retain mi chf copd dm_uncx dm_cx renal tumor mets aids 0;\n  if first.person_id then do; mi=0; chf=0; copd=0; dm_uncx=0; dm_cx=0; renal=0; tumor=0; mets=0; aids=0; end;\n  c = cats(code);\n  if prxmatch(\"/^(I21|I22|I252|410|412)/\", c)            then mi=1;\n  if prxmatch(\"/^(I50|428)/\", c)                          then chf=1;\n  if prxmatch(\"/^(J4[0-7]|49[0-6])/\", c)                  then copd=1;\n  if prxmatch(\"/^(E1[0-4][0-1]?|250[0-3])/\", c)           then dm_uncx=1;\n  if prxmatch(\"/^(E1[0-4][2-8]|250[4-9])/\", c)            then dm_cx=1;\n  if prxmatch(\"/^(N1[789]|585|586)/\", c)                  then renal=1;\n  if prxmatch(\"/^(C[0-7]|C8[0-5]|14[0-9]|1[5-9][0-9]|20[0-8])/\", c) then tumor=1;\n  if prxmatch(\"/^(C7[7-9]|C80|19[6-9])/\", c)              then mets=1;\n  if prxmatch(\"/^(B2[0-4]|04[2-4])/\", c)                  then aids=1;\n  if last.person_id then output;\n  keep person_id age mi chf copd dm_uncx dm_cx renal tumor mets aids;\nrun;\n\n/* Apply severity hierarchy, sum weights, add age points. */\ndata charlson;\n  set flags;\n  if dm_cx then dm_uncx = 0;          /* complicated supersedes uncomplicated */\n  if mets  then tumor   = 0;          /* metastatic supersedes localized */\n  cci = mi*1 + chf*1 + copd*1 + dm_uncx*1 + dm_cx*2 + renal*2 + tumor*2 + mets*6 + aids*6;\n  age_points = min(max(floor((age - 40)/10), 0), 4);\n  cci_age_adj = cci + age_points;\nrun;",
        "description": "SAS build of the Charlson score from a long diagnosis table. Inputs:\n  work.diags : person_id, code (ICD), dx_date\n  work.base  : person_id, index_date, age\nUses an illustrative code map; substitute the full Deyo/Quan lists. Applies the diabetes and tumor severity\nhierarchy before summing weights, then optionally adds the per-decade age points.",
        "dependencies": [],
        "source_citations": [
          "deyo-1992",
          "quan-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  D[Baseline-window claims<br/>equal lookback, continuous enrollment] --> A[Apply Deyo/Quan ICD crosswalk<br/>condition present yes/no]\n  A --> H{Severity hierarchy}\n  H -->|metastatic present| T1[Drop localized tumor]\n  H -->|complicated diabetes| T2[Drop uncomplicated diabetes]\n  T1 --> W[Sum Charlson weights 1/2/3/6]\n  T2 --> W\n  A --> W\n  W --> AG{Age-adjust?}\n  AG -->|yes| AP[Add 1 point per decade over 40, cap +4]\n  AG -->|no| S[Charlson score]\n  AP --> S\n  S --> U[Use as covariate / PS input / Table 1 stratum]",
        "caption": "Charlson construction pipeline — ascertain conditions over an equal baseline window, resolve the severity hierarchy, sum the integer weights, optionally add age points, and carry the score into the analysis as a confounder.",
        "alt_text": "Flowchart from baseline-window claims through ICD crosswalk, severity hierarchy, weight summation, optional age adjustment, to use of the Charlson score as a covariate.",
        "source_type": "illustrative",
        "source_citations": [
          "quan-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "elixhauser-comorbidity-index-rwe",
        "notes": "The broader 31-condition comorbidity measure; Elixhauser trades the CCI's parsimony for condition-level resolution and stronger in-hospital prediction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-based-frailty-index-rwe",
        "notes": "Frailty indices capture functional dependency that the diagnosis-counting CCI misses; the two adjust for different axes of risk and are often used together in older populations."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "The Charlson score is a standard covariate entering propensity-score models to balance baseline comorbidity burden."
      },
      {
        "relation_type": "complements",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "CCI is a transparent pre-specified summary; hdPS empirically screens hundreds of codes. Many analyses include both."
      },
      {
        "relation_type": "requires",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Deciding whether each Charlson condition is present uses the same inpatient/outpatient diagnosis rules as phenotype algorithms."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Scores are only comparable when the comorbidity lookback sits inside an equal continuous-enrollment window across groups."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "The Charlson score is a headline Table 1 covariate used to demonstrate baseline comparability between exposure groups."
      }
    ],
    "aliases": [
      "CCI",
      "Charlson index",
      "Charlson-Deyo index",
      "age-adjusted Charlson comorbidity index",
      "Charlson comorbidity score",
      "Quan-Charlson Comorbidity Index",
      "Quan Charlson comorbidity index",
      "Quan Charlson score",
      "Quan ICD Charlson",
      "Deyo-Charlson Comorbidity Index",
      "Deyo Charlson score"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "chi-square-test",
    "name": "Chi-Square Test of Independence",
    "short_definition": "The Pearson chi-square test of independence asks whether the distribution of a categorical outcome differs across groups — it compares what you actually observed in each cell of a contingency table to what you would expect if group membership and outcome were completely unrelated, and converts those discrepancies into a single test statistic that follows (approximately) a chi-squared distribution. The test answers the question \"is there any association?\" but not \"how large is it?\" — effect sizes such as the risk difference, relative risk, or odds ratio must be computed and reported separately to be useful for clinical or health-economic decisions.",
    "long_description": "**What the chi-square test of independence actually does**\n\nGiven a contingency table with r rows (groups) and c columns (outcome categories), the test\ncompares the observed count in every cell to the count that would be expected if the row\nvariable and the column variable were statistically independent. The expected count for cell\n(i, j) is:\n\n    E(i,j) = (row i total × column j total) / grand total\n\nThe Pearson chi-square statistic sums the squared, scaled discrepancies across all cells:\n\n    χ² = Σ [ (O - E)² / E ]\n\nUnder the null hypothesis of independence this statistic follows (approximately) a\nchi-squared distribution with degrees of freedom (r − 1)(c − 1). A large χ² signals that\nat least one cell count is further from its expected value than pure chance would produce.\nThe p-value is the probability of observing a χ² at least this large if independence were\ntruly the case.\n\nThe chi-squared approximation is valid when the expected counts are large enough for the\nCentral Limit Theorem to have kicked in. The widely cited rule of thumb — every expected cell\ncount should be ≥ 5 — is a practical heuristic, not a theorem. Campbell (2007) recommends\nthe more precise criterion that no more than 20 % of expected counts fall below 5 and none\nfalls below 1. When this criterion is violated, Fisher's exact test is preferred for 2 × 2\ntables; for larger tables, the options include collapsing sparse categories, a mid-p\ncorrection, or exact methods in software.\n\n**Degrees of freedom and table structure**\n\nFor a 2 × 2 table (two groups, binary outcome), df = 1. For a 3 × 2 table (three groups,\nbinary outcome), df = 2. For a 4 × 3 table, df = 6. The degrees of freedom capture how many\ncells are free to vary once the row and column totals are fixed. Increasing df reduces the\nchance of a large χ² under the null, so a given χ² value corresponds to a larger p-value\nas the table grows.\n\n**Yates' continuity correction — the debate**\n\nFor 2 × 2 tables specifically, Yates (1934) proposed subtracting 0.5 from each |O − E|\nbefore squaring, bringing the discrete chi-square distribution closer to the continuous\nchi-squared curve. The Yates correction makes the test more conservative (larger p-values).\nThe modern consensus, reviewed by Serra and Rea (2022), is that the correction is\nover-conservative: it produces p-values larger than the true exact p-value from Fisher's\nexact test, and Fisher's exact test itself is the preferred alternative when expected counts\nare small. Most analysts therefore do not apply Yates' correction as a default; when\nexpected counts are small, they switch to Fisher's exact test entirely. In software, the\n`correction=False` flag in `scipy.stats.chi2_contingency` and `correct=FALSE` in\n`chisq.test` disable Yates' correction and are the appropriate defaults when expected\ncounts are adequate.\n\n**Chi-square for trend — Cochran-Armitage**\n\nWhen the column categories are ordered (e.g., dose groups: none, low, medium, high), the\nstandard chi-square test discards that ordering and tests only whether any pattern of\nassociation exists. The Cochran-Armitage trend test uses the ordering explicitly to construct\na more powerful one-degree-of-freedom test for a monotone dose-response relationship. In\npharmacoepidemiology and benefit-risk assessment, where ordered exposure categories are\ncommon, the trend test is often more appropriate than a general chi-square; it also produces\na cleaner narrative — \"risk increases with dose\" — rather than the omnibus \"at least one group\ndiffers.\"\n\n**Effect sizes: the deliverable that chi-square alone cannot provide**\n\nThe chi-square statistic and its p-value answer only one question: is there evidence against\nindependence? They do not tell you whether the association is clinically or economically\nmeaningful. For binary outcomes in a 2 × 2 table, the effect size choices are:\n\n- *Risk difference (RD)*: proportion with event in group A minus proportion in group B.\n  Directly interpretable: \"8 fewer events per 100 patients.\" Sensitive to baseline risk.\n- *Relative risk (RR)*: event rate in group A divided by event rate in group B.\n  Intuitive multiplicative framing: \"40 % lower risk.\" Preferred in clinical communication\n  and in prospective or cohort designs.\n- *Odds ratio (OR)*: the odds of the event in A divided by odds in B. Approximates the RR\n  when the outcome is rare (< 10 %); diverges substantially for common outcomes. Required\n  output of logistic regression; used extensively in case-control studies.\n- *Phi coefficient / Cramér's V*: standardized chi-square-based effect size. Phi applies to\n  2 × 2 tables; Cramér's V generalises to larger tables. Values near 0 indicate weak\n  association; near 1 indicate strong. Ben-Shachar et al. (2023) provide a modern treatment\n  of these statistics. Phi and Cramér's V are most useful for comparing association strength\n  across tables of different sizes, not for clinical communication.\n\nFor HEOR and HTA submissions, risk difference and relative risk (with 95 % confidence\nintervals) are the effect measures that connect most directly to budget-impact models,\nabsolute risk reduction, and number-needed-to-treat calculations. A chi-square p-value\nwithout these effect estimates is insufficient for a decision-making audience.\n\n**RWE realities: the large-n problem and why p-values mislead at claims scale**\n\nIn real-world evidence studies using administrative claims or EHR data, sample sizes of\n50,000 to several million patients are routine. At this scale, the chi-square test will\nreject the null for association differences that are substantively meaningless. A 0.2\npercentage-point difference in comorbidity prevalence between two cohorts of 500,000 patients\nwill yield χ² values in the hundreds and p-values indistinguishable from zero. This is not\nan error in the test — it is working exactly as designed. The test is highly powered to\ndetect any departure from independence, no matter how trivial.\n\nThe practical implication is that in large observational datasets, p-values from chi-square\ntests should be treated as nuisance outputs, not deliverables. What matters is:\n\n1. The absolute risk difference and its 95 % confidence interval, sized against a\n   pre-specified minimally important difference.\n2. The standardized difference (also called the standardized mean difference for binary\n   variables, where it equals the risk difference divided by the pooled standard deviation of\n   the proportion). A standardized difference < 0.10 indicates negligible imbalance for\n   covariate balance assessment, regardless of n.\n\nFor Table 1 balance comparisons in observational RWE studies, standardized differences are\nnow the preferred summary because, unlike chi-square p-values, they do not inflate with\nsample size. Reporting chi-square p-values in a Table 1 comparing a treated group of 200,000\nto a control group of 200,000 will produce a list of \"statistically significant\" differences\nin 80 %+ of covariates even after careful propensity-score matching — this is uninformative\nand potentially misleading.\n\n**Independence assumption violations in real-world data**\n\nThe chi-square test assumes that each observation (each row in the contingency table) is\nindependent. This assumption is violated in several common RWE settings:\n\n- *Claim-level rather than patient-level analysis*: a patient with 12 hospitalizations\n  contributes 12 rows. If the analyst builds a contingency table from claim rows rather than\n  from patient-level aggregates, the rows are correlated within patient. Naive chi-square on\n  such data is wrong — the effective sample size is far smaller than the row count, and\n  standard errors are underestimated. The correct unit of analysis is almost always the\n  patient; aggregate to one row per patient before any test.\n- *Clustering within providers or facilities*: patients treated by the same physician or at\n  the same hospital share outcomes that are more similar than random. If the chi-square table\n  is built from cross-sectional data with this structure, a cluster-robust or mixed-model\n  approach is needed; the standard test is anti-conservative.\n- *Repeated binary outcomes per patient over time*: if a patient can have the outcome in\n  multiple follow-up windows, standard chi-square treats each occurrence independently. Use\n  GEE (generalized estimating equations) with an exchangeable working correlation structure,\n  or collapse to a time-to-first-event analysis.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- Conceptually simple: observed vs expected, with a single formula most analysts can verify\n  by hand for a 2 × 2 table.\n- Valid under minimal assumptions when expected counts are adequate (≥ 5 in each cell).\n- Available in every statistical package; the implementation is identical across software.\n- Extends naturally to r × c tables with a single statistic and a single p-value.\n- At large n, the approximation is excellent; the test is computationally trivial.\n- The Cochran-Armitage extension handles ordered categorical predictors without additional\n  modelling machinery.\n\n*Cons*:\n- Does not produce a clinically meaningful effect estimate; risk difference, RR, or OR must\n  be computed separately and are the actual deliverable for decision-makers.\n- The chi-squared approximation breaks down when expected counts are small — the p-value is\n  unreliable in sparse tables.\n- Completely powered at large n: trivial associations are flagged as statistically\n  significant; p-values alone are uninterpretable at claims scale.\n- Cannot accommodate covariates or confounding adjustment; it is a marginal (unadjusted)\n  test. For adjusted inference, route to logistic regression or log-binomial regression.\n- Does not distinguish the direction of the association — a large χ² could reflect excess\n  events in group A or in group B; the contingency table must be inspected.\n- Sensitive to the unit of analysis: claim-level rows rather than patient-level rows\n  produce an anti-conservative test.\n\n*Trade-offs vs alternatives*:\n- *Fisher's exact test*: exact p-value from the hypergeometric distribution; no large-sample\n  approximation needed; preferred when expected counts are small; computationally heavier for\n  large tables (though modern software handles this for typical RWE table sizes).\n- *McNemar's test*: the paired-data analogue for a 2 × 2 table where the two rows represent\n  the same subjects measured twice (e.g., pre/post or matched pairs). Using a standard\n  chi-square test on paired binary data is a common and consequential mistake.\n- *Logistic regression*: extends chi-square from a bivariate marginal test to a multivariable\n  adjusted model; produces an OR with CI; handles continuous covariates; is the adjusted\n  version of the chi-square test and the preferred primary method for binary outcomes in\n  confounded observational comparisons.\n\n**When to use**\n\nUse the chi-square test of independence when:\n\n- The outcome is binary or multinomial (categorical with no intrinsic ordering or an\n  ordering you have decided not to exploit).\n- The unit of analysis is the patient (or other independent observational unit), not the\n  claim, visit, or encounter row.\n- All expected cell counts satisfy the ≥ 5 rule (or more precisely, the Campbell criterion).\n- The goal is a simple bivariate test of association — e.g., comparing 30-day readmission\n  rates between two treatment arms in a balanced, adjusted cohort.\n- A baseline Table 1 comparison in an RCT (where balance testing is legitimately of interest\n  and sample sizes are typically small enough that p-values are not trivially significant).\n- As a sensitivity check alongside an adjusted logistic or log-binomial regression model for\n  the confirmatory analysis.\n- For ordinal categorical predictors where you want to test a monotone trend, use the\n  Cochran-Armitage trend test rather than the general chi-square.\n\n**When NOT to use — and when it is actively misleading**\n\n- *Paired data (pre-post or matched pairs)*: if the two rows of the table represent the same\n  patients before and after (or matched patient pairs), observations are not independent and\n  chi-square is wrong. Use McNemar's test. This is one of the most common misapplications of\n  chi-square in the biomedical literature.\n- *Small expected counts*: when any expected cell count falls below 5 (or more than 20 % of\n  cells are below 5), switch to Fisher's exact test for 2 × 2 tables. For larger tables,\n  consider collapsing sparse categories or using an exact method.\n- *Claim-level rather than patient-level units*: building the contingency table from claim\n  rows creates correlated observations within patient; the effective sample size is\n  overstated and the chi-square p-value is anti-conservative. Always aggregate to one row per\n  patient before testing.\n- *As evidence of a causal association*: chi-square tests for statistical association, not\n  causation. An unadjusted chi-square on an unbalanced observational dataset estimates a\n  confounded marginal association. Do not interpret a significant p-value as evidence that\n  the exposure causes the outcome; route to propensity-score adjusted, regression-adjusted,\n  or g-method analyses for causal inference.\n- *As the primary analysis for confounded comparisons*: in any observational design without\n  randomisation, the chi-square is at best a descriptive screen; the confirmatory analysis\n  requires adjustment for confounders. Report the chi-square alongside the adjusted OR from\n  logistic regression, not instead of it.\n- *At large n for substantive decision-making*: at claims scale (hundreds of thousands of\n  patients), every chi-square test will be significant. Report absolute risk differences,\n  relative risks, standardized differences, and confidence intervals. Suppress or minimise\n  the p-value, or at minimum note that at this sample size statistical significance is not\n  informative.\n- *Ordered categories without exploiting the order*: if the columns represent an ordered\n  scale (severity grades, dose levels), the standard chi-square discards the ordering and is\n  less powerful than the Cochran-Armitage trend test. Use the trend test for ordered\n  alternatives.\n\n**Interpreting the output**\n\nIn the worked example, a discharge coordination program is compared with usual care on\n30-day readmission in a 2×2 contingency table. The Pearson chi-square statistic is\napproximately 9.52 on df = 1, with p ≈ 0.002. The estimated risk difference is −0.20\n(20 fewer readmissions per 100 patients) and the relative risk is 0.50 (half the\nreadmission rate in the program arm relative to usual care).\n\n*(1) Formal interpretation.* The chi-square statistic of approximately 9.52 is the weighted\nsum of squared discrepancies between observed and expected cell counts. Under the null\nhypothesis of no association between program assignment and readmission, a statistic this\nlarge or larger on df = 1 arises by chance in approximately 0.2% of samples (p ≈ 0.002).\nThe test confirms that the observed association is incompatible with independence at\nconventional significance levels. The p-value alone does not quantify the magnitude of the\nassociation — the risk difference (−0.20, i.e., 20 fewer readmissions per 100 patients) and\nthe relative risk (0.50) are the effect measures that carry clinical and economic meaning.\n\n*(2) Practical interpretation.* The data are consistent with the discharge program being\nassociated with roughly half the 30-day readmission rate of usual care — a risk difference\nof 20 per 100 that is substantial in most hospital settings. The chi-square p-value confirms\nthe association is unlikely to be pure chance; it does not establish causation. Because this\nis an unadjusted comparison, differences in baseline risk, patient mix, and facility factors\nbetween the two groups may confound the estimate. An adjusted logistic regression model\nusing the same outcome is the appropriate next step for confirmatory inference.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "categorical-data",
      "contingency-tables",
      "chi-square",
      "fisher-exact",
      "goodness-of-fit",
      "effect-size"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.2832",
        "url": "https://doi.org/10.1002/sim.2832",
        "citation_text": "Campbell I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine. 2007;26(19):3661-3675.",
        "year": 2007,
        "authors_short": "Campbell",
        "notes": "The definitive modern reference for the chi-square vs Fisher exact decision in 2x2 tables. Campbell reviews simulation evidence on type-I error and derives the more precise criterion (no more than 20% of cells below expected count of 5, none below 1) that supersedes the cruder \"all cells >= 5\" rule. Essential for any analyst choosing between chi-square and Fisher exact in sparse tables."
      },
      {
        "role": "explain",
        "doi": "10.11613/bm.2013.018",
        "url": "https://doi.org/10.11613/bm.2013.018",
        "citation_text": "McHugh ML. The Chi-square test of independence. Biochemia Medica. 2013;23(2):143-149.",
        "year": 2013,
        "authors_short": "McHugh",
        "notes": "A widely cited pedagogical overview covering the mechanics of the chi-square test, the degrees-of-freedom formula, the expected-count assumption, and the distinction between chi-square for goodness-of-fit versus chi-square for independence. Written for clinical researchers and health sciences analysts."
      },
      {
        "role": "demonstrate",
        "doi": "10.2427/13059",
        "url": "https://doi.org/10.2427/13059",
        "citation_text": "Serra A, Rea F, Di Carlo A. Continuity correction of Pearson's chi-square test in 2x2 contingency tables: a mini-review on recent development. Epidemiology, Biostatistics, and Public Health. 2022;19(1):e17027.",
        "year": 2022,
        "authors_short": "Serra et al.",
        "notes": "Systematic review of the Yates continuity correction debate across simulation studies and methodological papers. Concludes that the Yates correction is over-conservative relative to Fisher's exact test and that Fisher exact is the preferred alternative for small expected counts. Supports the recommendation to disable the Yates correction when expected counts are adequate and to switch entirely to Fisher when they are not."
      },
      {
        "role": "use",
        "doi": "10.3390/math11091982",
        "url": "https://doi.org/10.3390/math11091982",
        "citation_text": "Ben-Shachar MS, Patil I, Thériault R, Wiernik BM. Phi, Fei, Fo, Fum: Effect sizes for categorical data that use the chi-squared statistic. Mathematics. 2023;11(9):1982.",
        "year": 2023,
        "authors_short": "Ben-Shachar et al.",
        "notes": "Modern treatment of chi-square-based effect sizes including phi (2x2 tables), Cramér's V (larger tables), Fei (goodness-of-fit), and their relationships. Relevant when analysts need a standardized effect size alongside a chi-square p-value — for example, when comparing association strength across multiple contingency tables of different sizes."
      }
    ],
    "plain_language_summary": "The chi-square test of independence checks whether two categorical variables — such as treatment group and whether a patient was hospitalized — are related or just coincidentally differ between groups. It works by comparing what you actually counted in each cell of a table (for example, 8 readmissions in the treated group and 18 in the control group) to what you would expect to see if there were no association at all, then summarising those discrepancies in a single number. A large discrepancy produces a small p-value, which means the pattern is unlikely to be due to chance — but the test cannot tell you how large or important the difference is, so you always need to also report a concrete effect size such as the risk difference (\"10 fewer events per 100 patients\") or the relative risk.",
    "key_terms": [
      {
        "term": "contingency table",
        "definition": "A grid that cross-tabulates two categorical variables — for example, rows for treatment group (A vs B) and columns for outcome (event vs no event) — so you can see all four combinations at once."
      },
      {
        "term": "observed vs expected counts",
        "definition": "The \"observed\" count is the number you actually found in the data for a given cell; the \"expected\" count is what you would predict if group membership and outcome were completely unrelated (calculated from the row and column totals)."
      },
      {
        "term": "degrees of freedom",
        "definition": "For a chi-square test, this equals (number of rows minus 1) times (number of columns minus 1); it controls how large the test statistic needs to be before the result is considered unusual."
      },
      {
        "term": "Yates continuity correction",
        "definition": "A small adjustment to the chi-square formula for 2x2 tables that makes the result more conservative; the modern consensus is that it over-corrects, and Fisher's exact test is the preferred alternative when sample sizes are small."
      },
      {
        "term": "expected-cell-count rule",
        "definition": "The practical guideline that chi-square is only a reliable approximation when every cell has an expected count of at least 5; when this fails, switch to Fisher's exact test."
      },
      {
        "term": "standardized difference",
        "definition": "A sample-size-independent measure of how different two groups are on a variable, preferred over chi-square p-values for assessing covariate balance in large observational studies because it does not inflate with sample size."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is comparing 30-day hospital readmission rates between patients who received a new discharge coordination program (Program group) and patients who received usual care (Control group). Each group has 100 patients. The analyst builds a 2x2 contingency table, computes expected counts, calculates the chi-square statistic by hand, and reports the risk difference and relative risk alongside the p-value.",
      "dataset": {
        "caption": "30-day readmission counts for 100 patients per group. Totals chosen so every expected count is a whole number and each (O-E)^2/E term is an exact fraction.",
        "columns": [
          "group",
          "readmitted",
          "not_readmitted",
          "row_total"
        ],
        "rows": [
          [
            "Program",
            20,
            80,
            100
          ],
          [
            "Control",
            40,
            60,
            100
          ],
          [
            "Column total",
            60,
            140,
            200
          ]
        ]
      },
      "steps": [
        "Compute expected counts using E = (row total x column total) / grand total. E(Program, readmitted) = (100 x 60) / 200 = 6000 / 200 = 30. E(Program, not readmitted) = (100 x 140) / 200 = 14000 / 200 = 70. E(Control, readmitted) = (100 x 60) / 200 = 6000 / 200 = 30. E(Control, not readmitted) = (100 x 140) / 200 = 14000 / 200 = 70.",
        "All four expected counts equal 30 or 70, both well above 5. The chi-square approximation is valid.",
        "Compute the squared deviation divided by expected for each cell. Cell (Program, readmitted): difference = 20 - 30 = -10; squared = 100; term = 100/30. Cell (Program, not readmitted): difference = 80 - 70 = 10; squared = 100; term = 100/70. Cell (Control, readmitted): difference = 40 - 30 = 10; squared = 100; term = 100/30. Cell (Control, not readmitted): difference = 60 - 70 = -10; squared = 100; term = 100/70.",
        "Sum all four terms to get the test statistic. The two 100/30 terms sum to 200/30, and the two 100/70 terms sum to 200/70. Converting to a common denominator of 210: 200/30 = 1400/210 and 200/70 = 600/210, so the total is 2000/210 = 9.524.",
        "Degrees of freedom for a 2x2 table = (2 minus 1) times (2 minus 1) = 1. With the test statistic equal to 9.524 and df = 1, the p-value is approximately 0.002. There is strong evidence against independence.",
        "Compute the risk difference and relative risk. Risk in Program group = 20 / 100 = 0.20. Risk in Control group = 40 / 100 = 0.40. Risk difference = 0.20 - 0.40 = -0.20 (the program group has 20 fewer events per 100). Relative risk = 0.20 / 0.40 = 0.50 (50% lower risk in the program group).",
        "Interpretation: the chi-square test confirms a statistically significant association (p approximately 0.002). The clinically meaningful summary is that the program is associated with a 20 percentage-point reduction in 30-day readmissions (RD = -0.20, RR = 0.50). The chi-square p-value alone does not convey this magnitude."
      ],
      "result": "Expected counts: E(Program readmitted) = 6000/200 = 30, E(Program not readmitted) = 14000/200 = 70, E(Control readmitted) = 6000/200 = 30, E(Control not readmitted) = 14000/200 = 70. Test statistic = 2000/210 = 9.524, df = 1, p approximately 0.002. Risk difference = 20/100 - 40/100 = -0.20 (20 fewer readmissions per 100 patients). Relative risk = (20/100) / (40/100) = 0.20 / 0.40 = 0.50. The effect size (RD, RR) is the deliverable; the p-value only confirms non-randomness."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "2x2 table: two groups, binary outcome (standard chi-square)",
        "description": "The most common form in HEOR and clinical research: comparing an event rate (readmission, treatment discontinuation, hospitalization, adverse event) between two groups. Chi-square with df = 1 is appropriate when both expected cell counts are >= 5; Fisher exact is preferred otherwise. Always report the risk difference and relative risk (or OR) with 95 % CIs alongside the p-value.",
        "edge_cases": [
          "At large n (> 10,000 per group), every association will be statistically significant. Report effect sizes prominently; note that the p-value is not informative at this scale.",
          "When the outcome is rare (< 5% prevalence), the OR and RR diverge; prefer the RR for clinical communication and note that logistic regression estimates the OR, not the RR."
        ],
        "data_source_notes": "Claims: aggregate to one binary indicator per patient (ever readmitted within 30 days) before building the table. EHR: confirm that each row represents one unique patient, not one visit. Registry: verify that adjudicated binary outcomes are patient-level."
      },
      {
        "name": "rxc table: more than two groups or more than two outcome categories",
        "description": "When there are three or more groups (drug classes, lines of therapy, geographic regions) or three or more outcome categories (none / mild / severe), the chi-square generalises to an r x c table with df = (r-1)(c-1). A significant overall test tells you that at least one cell combination departs from independence; post-hoc pairwise comparisons with Bonferroni or FDR correction are needed to locate which pair(s) drive the result.",
        "edge_cases": [
          "Sparse cells become more likely in large tables; collapse clinically similar categories before testing if expected counts fall below 5.",
          "Unordered categorical columns (race/ethnicity categories, drug classes) require the standard chi-square; ordered columns (severity grades, dose groups) call for the Cochran-Armitage trend test."
        ],
        "data_source_notes": "For claims multi-drug comparisons, confirm that each patient is counted in exactly one row (e.g., the drug of first use or the drug with longest exposure) to avoid double- counting patients who switch therapies."
      },
      {
        "name": "Cochran-Armitage trend test for ordered categories",
        "description": "When the column (or row) categories are ordered — dose levels (0, 1, 2, 3), severity grades, or ordered time windows — the Cochran-Armitage trend test exploits that ordering to construct a one-degree-of-freedom test for a monotone dose-response or time trend. It is more powerful than a general chi-square for detecting monotone associations and produces a cleaner interpretive narrative.",
        "edge_cases": [
          "The trend test assumes a linear score assignment to ordered categories; verify that the spacing of scores is defensible or try multiple score assignments as a sensitivity check.",
          "A non-significant trend test does not rule out a non-monotone association (e.g., a U-shaped dose-response); inspect the table for non-linear patterns."
        ],
        "data_source_notes": "Pharmacoepidemiology: compare event rates across ordered exposure-intensity categories (days of supply or defined daily doses); trend test supports a dose-response argument in benefit-risk narratives."
      },
      {
        "name": "Goodness-of-fit chi-square (one-way)",
        "description": "Tests whether the distribution of a single categorical variable matches a theoretical (expected) distribution. For example, comparing the observed racial/ethnic distribution of a study cohort to the expected US population distribution from census data. The statistic and degrees-of-freedom formula are the same; the expected counts come from external proportions rather than from a cross-tabulation of two variables.",
        "edge_cases": [
          "If the expected distribution is estimated from the same data (e.g., testing whether a coin is fair after observing its flips), this is the standard goodness-of-fit use. If expected proportions come from a reference population, document the reference source.",
          "Goodness-of-fit chi-square for continuous variables (testing normality) is rarely recommended; the Kolmogorov-Smirnov or Anderson-Darling tests have better power."
        ],
        "data_source_notes": "Claims and registry: compare cohort demographic distribution to National Health Interview Survey or census benchmarks to assess generalisability; report the chi-square alongside a table of observed vs expected proportions."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "Chi-square is purpose-built for categorical data: it requires no assumptions about a continuous distribution and operates directly on cell counts. It generalises trivially to multi-category outcomes, whereas the t-test is restricted to continuous outcomes.",
        "cons_of_this": "Chi-square does not produce an interpretable effect estimate on its own; the analyst must separately compute the risk difference, RR, or OR. The chi-square-based framework also does not extend to covariate adjustment — for adjusted binary-outcome analyses, logistic regression is the standard.",
        "when_to_prefer": "Use chi-square (or Fisher exact) for unadjusted bivariate tests of categorical association; use logistic regression when covariate adjustment or multivariable modelling is needed."
      },
      {
        "compared_to": "fisher-exact-test",
        "pros_of_this": "Chi-square is computationally trivial for large tables and large samples; the large-sample approximation is excellent when expected counts are adequate.",
        "cons_of_this": "Fisher's exact test is valid regardless of sample size and is the correct choice when any expected cell count falls below 5. In 2x2 tables with adequate counts, both tests typically agree; the practical recommendation is to default to Fisher exact for any table with sparse cells and chi-square for the rest.",
        "when_to_prefer": "Use chi-square when all expected counts are >= 5 and the table is larger than 2x2 (where Fisher exact becomes computationally demanding). Use Fisher exact for 2x2 tables with small expected counts or whenever an exact p-value is required."
      },
      {
        "compared_to": "mcnemar-test",
        "pros_of_this": "Standard chi-square is appropriate for independent groups; it is simpler to compute and explain than McNemar's test.",
        "cons_of_this": "For paired binary data — pre/post measurements on the same patients, or matched-pair designs — chi-square is incorrect because the observations are not independent. McNemar's test accounts for the pairing and is the correct choice. Using chi-square on paired data is a common and consequential error: it underestimates the test statistic when concordant pairs dominate, which is typical in pre-post clinical data.",
        "when_to_prefer": "Always use chi-square for independent groups; always use McNemar when the same patients (or matched pairs) appear in both rows of the 2x2 table."
      },
      {
        "compared_to": "logistic-regression-for-binary-outcomes",
        "pros_of_this": "Chi-square is simpler, requires no modelling assumptions beyond independence and adequate expected counts, and is interpretable without regression output. It is the standard tool for bivariate categorical comparison in descriptive and baseline tables.",
        "cons_of_this": "Logistic regression adjusts for confounders, handles continuous covariates, and produces adjusted ORs with CIs. For any observational design where confounding is a concern, chi-square is inadequate as the primary analysis; logistic regression (or log-binomial for RR estimation) is required. Chi-square becomes the unadjusted baseline comparison in a sensitivity table, not the main result.",
        "when_to_prefer": "Use chi-square for unadjusted bivariate comparisons, Table 1, and RCT balance tests. Use logistic regression for all primary adjusted analyses in observational designs."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Always aggregate to the patient level before building the contingency table — one row per patient with a binary indicator for the outcome of interest. Never build the table from claim rows; within-patient correlation makes the chi-square anti-conservative. For large cohorts (> 50,000), the p-value will almost always be near zero; report the absolute risk difference and relative risk with 95 % CIs as the primary summary. Inspect expected counts with the EXPECTED option (SAS PROC FREQ) or print the expected matrix (Python/R) to confirm the chi-square approximation is valid before citing the p-value.",
      "ehr": "One visit row per patient is the correct unit for cross-sectional EHR comparisons; index visits (first qualifying encounter) or annual summary rows are common constructions. Binary outcomes such as diagnosis code presence, procedure completion, or 30-day return are natural chi-square inputs. Informative visit patterns (sicker patients have more encounters) can bias patient selection; document the index-visit construction rule.",
      "registry": "Adjudicated registry endpoints are typically the cleanest binary outcomes. Confirm that the registry does not double-count patients enrolled at multiple sites. Regional or practice-level clustering is common in registries; if the chi-square is being used to compare event rates across sites, consider whether a cluster-corrected approach is needed.",
      "primary": "Survey data with binary items are natural chi-square inputs. Survey weights require a design-based chi-square (Rao-Scott correction in SAS PROC SURVEYFREQ; svychisq in R's survey package) rather than the standard unweighted test; the unweighted chi-square ignores complex survey design and can produce incorrect p-values.",
      "linked": "Linked datasets (claims plus EHR, claims plus registry) typically produce large n, so expected-count violations are rare but p-value inflation is universal. Report absolute risk differences and standardized differences as the primary summaries; include the chi-square only for completeness. Confirm that linkage did not introduce selection bias (patients with linkable records may differ systematically from unlinked patients)."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom scipy import stats\nimport numpy as np\n\n# ── Motivating dataset: 30-day readmission (2x2 table) ──\n# Rows: [Program, Control]; Columns: [Readmitted, Not readmitted]\ntable = np.array([[20, 80],\n                  [40, 60]])\n\n# ── 1. Chi-square test of independence ──\n# correction=False: disable Yates' continuity correction (recommended when expected counts >= 5)\nchi2, p_val, dof, expected = stats.chi2_contingency(table, correction=False)\nprint(f\"Observed counts:\\n{table}\")\nprint(f\"\\nExpected counts:\\n{expected}\")\nprint(f\"\\nChi-square = {chi2:.4f}, df = {dof}, p = {p_val:.4f}\")\nprint(f\"Minimum expected count: {expected.min():.1f} (must be >= 5 for chi-square approximation)\")\n\n# ── 2. Fisher exact test (always valid; prefer when any expected count < 5) ──\n_, fisher_p = stats.fisher_exact(table)\nprint(f\"\\nFisher exact p = {fisher_p:.4f}\")\n\n# ── 3. Effect sizes: risk difference and relative risk ──\nn_a = table[0].sum()          # 100 patients in Program group\nn_b = table[1].sum()          # 100 patients in Control group\nrisk_a = table[0, 0] / n_a   # 20/100 = 0.20\nrisk_b = table[1, 0] / n_b   # 40/100 = 0.40\nrd = risk_a - risk_b          # risk difference: -0.20\nrr = risk_a / risk_b          # relative risk: 0.50\nprint(f\"\\nRisk in Program group:  {risk_a:.3f}\")\nprint(f\"Risk in Control group:  {risk_b:.3f}\")\nprint(f\"Risk difference (A-B):  {rd:.3f}  (i.e., {rd*100:.0f} fewer events per 100 patients)\")\nprint(f\"Relative risk (A/B):    {rr:.3f}\")\n\n# 95% CI for risk difference (Newcombe method via normal approximation here for brevity)\nse_rd = math.sqrt(risk_a * (1 - risk_a) / n_a + risk_b * (1 - risk_b) / n_b)\nrd_lower, rd_upper = rd - 1.96 * se_rd, rd + 1.96 * se_rd\nprint(f\"95% CI for RD: ({rd_lower:.3f}, {rd_upper:.3f})\")\n\n# ── 4. Cramér's V: standardized chi-square effect size ──\nn_total = table.sum()\ncramers_v = math.sqrt(chi2 / (n_total * (min(table.shape) - 1)))\nprint(f\"\\nCramér's V = {cramers_v:.3f}  (0=no association, 1=perfect association)\")\n\n# ── 5. Rule: check expected counts before deciding which test to report ──\nif expected.min() < 5:\n    print(\"\\nWARNING: Expected count < 5. Report Fisher exact p-value, not chi-square.\")\nelse:\n    print(\"\\nAll expected counts >= 5: chi-square approximation is valid.\")",
        "description": "Chi-square test of independence and Fisher exact test using scipy.stats.chi2_contingency\nand scipy.stats.fisher_exact. Demonstrates the correction parameter (disable Yates\ncorrection by default), inspection of expected counts, computation of risk difference and\nrelative risk from the 2x2 table, and Cramér's V as a standardized effect size. Uses the\nreadmission motivating dataset from the beginner layer.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset: 30-day readmission (2x2 table) ──\ntab <- matrix(c(20, 80, 40, 60), nrow = 2, byrow = TRUE,\n              dimnames = list(c(\"Program\",\"Control\"), c(\"Readmitted\",\"Not_Readmitted\")))\ncat(\"Observed counts:\\n\"); print(tab)\n\n# ── 1. Chi-square test of independence ──\n# correct = FALSE: disable Yates' continuity correction (recommended when expected >= 5)\nchi_res <- chisq.test(tab, correct = FALSE)\ncat(\"\\nChi-square result:\\n\"); print(chi_res)\ncat(\"\\nExpected counts:\\n\"); print(chi_res$expected)\ncat(sprintf(\"Minimum expected count: %.1f\\n\", min(chi_res$expected)))\n\n# ── 2. Fisher exact test (exact p-value; always valid for 2x2 tables) ──\nfish_res <- fisher.test(tab)\ncat(sprintf(\"\\nFisher exact p = %.4f  OR = %.3f\\n\",\n            fish_res$p.value, fish_res$estimate))\n\n# ── 3. Effect sizes: risk difference and relative risk ──\nrisk_a <- tab[\"Program\",\"Readmitted\"] / sum(tab[\"Program\",])    # 20/100\nrisk_b <- tab[\"Control\",\"Readmitted\"] / sum(tab[\"Control\",])    # 40/100\nrd     <- risk_a - risk_b   # -0.20\nrr     <- risk_a / risk_b   # 0.50\ncat(sprintf(\"\\nRisk (Program) = %.3f, Risk (Control) = %.3f\\n\", risk_a, risk_b))\ncat(sprintf(\"Risk difference = %.3f  (%.0f fewer events per 100 patients)\\n\", rd, rd * 100))\ncat(sprintf(\"Relative risk   = %.3f\\n\", rr))\n\n# Simple 95% CI for risk difference (normal approximation)\nn_a <- sum(tab[\"Program\",]); n_b <- sum(tab[\"Control\",])\nse_rd <- sqrt(risk_a * (1 - risk_a) / n_a + risk_b * (1 - risk_b) / n_b)\ncat(sprintf(\"95%% CI for RD: (%.3f, %.3f)\\n\", rd - 1.96 * se_rd, rd + 1.96 * se_rd))\n\n# ── 4. Cramér's V ──\nn_total  <- sum(tab)\nchi2_val <- chi_res$statistic\ncramers_v <- sqrt(chi2_val / (n_total * (min(dim(tab)) - 1)))\ncat(sprintf(\"\\nCramér's V = %.3f\\n\", cramers_v))\n\n# ── 5. Guard: switch to Fisher exact when expected counts < 5 ──\nif (min(chi_res$expected) < 5) {\n  cat(\"WARNING: Expected count < 5. Use Fisher exact p-value instead of chi-square.\\n\")\n} else {\n  cat(\"All expected counts >= 5: chi-square approximation valid.\\n\")\n}\n\n# ── 6. Cochran-Armitage trend test for ordered categories (example) ──\n# For ordered dose groups (0,1,2,3) and a binary outcome:\n# prop.trend.test(x = c(events_per_group), n = c(total_per_group))\n# Example: event counts and group sizes across 4 ordered dose levels\nevents <- c(5, 10, 18, 30)\ntotals <- c(50, 50, 50, 50)\nca_res <- prop.trend.test(events, totals)\ncat(\"\\nCochran-Armitage trend test (4 ordered dose groups):\\n\"); print(ca_res)",
        "description": "Chi-square and Fisher exact tests in base R using chisq.test and fisher.test. Shows the\ncorrect=FALSE argument to disable Yates correction, inspection of expected counts via\n$expected, computation of risk difference and relative risk, and the pattern for switching\nto Fisher exact when expected counts are small. Uses the same readmission motivating dataset.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset ── */\ndata work.readmit;\n  input group $ outcome $ count;\n  datalines;\nProgram Readmitted    20\nProgram Not_Readmitted 80\nControl Readmitted    40\nControl Not_Readmitted 60\n;\nrun;\n\n/* ── 1. Chi-square test, Fisher exact, expected counts, risk difference, relative risk ── */\nproc freq data=work.readmit;\n  weight count;\n  tables group * outcome /\n    chisq          /* Pearson chi-square (and Yates-corrected version in output)          */\n    fisher         /* Fisher exact test (exact p-value; prefer when expected count < 5)   */\n    expected       /* Print expected cell frequencies to verify >= 5 rule                 */\n    nocol norow    /* Suppress row/column % (keep cell % and counts cleaner)              */\n    riskdiff       /* Risk difference (row 1 vs row 2) with 95% CI                        */\n    relrisk;       /* Relative risk and odds ratio with 95% CI                            */\n  /* Note: SAS prints BOTH the uncorrected and Yates-corrected chi-square.                */\n  /* For adequate expected counts (>= 5), use the uncorrected Pearson chi-square row.     */\n  /* When any expected count < 5, use the Fisher exact p-value in the output.             */\nrun;\n\n/* ── 2. Manual verification: expected counts via PROC FREQ output data set ── */\nproc freq data=work.readmit noprint;\n  weight count;\n  tables group * outcome / expected out=work.freq_out outexpected;\nrun;\nproc print data=work.freq_out; run;\n\n/* ── 3. Cochran-Armitage trend test for ordered categories ── */\n/* Example: event counts across 4 ordered dose levels (1=low, 4=high)           */\ndata work.trend;\n  input dose events total;\n  datalines;\n1 5  50\n2 10 50\n3 18 50\n4 30 50\n;\nrun;\n/* Expand to individual rows for PROC FREQ (binary outcome per patient) */\ndata work.trend_expanded;\n  set work.trend;\n  do i = 1 to events;   outcome = 1; output; end;\n  do i = 1 to (total - events); outcome = 0; output; end;\n  drop i;\nrun;\nproc freq data=work.trend_expanded;\n  tables dose * outcome / trend;  /* TREND option: Cochran-Armitage trend test */\nrun;",
        "description": "Chi-square and Fisher exact tests via PROC FREQ with the CHISQ, FISHER, and EXPECTED\noptions. Demonstrates manual computation of risk difference and relative risk using a\nDATA step, and the RISKDIFF and RELRISK options in PROC FREQ for direct effect-size\noutput. Shows how to switch to Fisher when expected counts are small. Uses the readmission\nmotivating dataset.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[\"Binary or multinomial categorical outcome?<br/>(group x outcome counts in a table)\"] --> Paired{Paired data?<br/>Same patients appear<br/>in both rows}\n  Paired -->|Yes| McNemar[\"Use McNemar's test<br/>(paired 2x2)<br/>NOT chi-square\"]\n  Paired -->|No| Unit{Unit of analysis<br/>correct?}\n  Unit -->|Claim/visit rows<br/>per patient| Fix[\"Aggregate to<br/>one row per patient first\"]\n  Fix --> Unit\n  Unit -->|One row<br/>per patient| ExpCount[\"Compute expected counts<br/>E = (row total × col total) / N\"]\n  ExpCount --> Check{Any expected<br/>count < 5?}\n  Check -->|Yes| Fisher[\"Fisher exact test<br/>(2x2) or collapse<br/>sparse categories\"]\n  Check -->|No| ChiSq[\"Chi-square test<br/>χ² = Σ (O-E)² / E<br/>df = (r-1)(c-1)\"]\n  ChiSq --> Ordered{Columns ordered?<br/>dose levels,<br/>severity grades}\n  Ordered -->|Yes| Trend[\"Cochran-Armitage<br/>trend test<br/>(more powerful)\"]\n  Ordered -->|No| Effect[\"Always report effect size:<br/>Risk difference, RR, OR<br/>NOT just the p-value\"]\n  Fisher --> Effect\n  Trend --> Effect\n  Effect --> LargeN{Large n?<br/>> 10,000 patients}\n  LargeN -->|Yes| SMD[\"Report standardized differences<br/>and absolute risk reduction.<br/>p-value not informative at scale.\"]\n  LargeN -->|No| Done[\"Report chi-square + p,<br/>effect size + 95% CI\"]",
        "caption": "Decision flowchart for chi-square and related tests of categorical association. Key branches: paired data (McNemar), small expected counts (Fisher exact), ordered categories (Cochran-Armitage), and the large-n override (report effect sizes, not just p-values).",
        "alt_text": "Flowchart starting at a binary/multinomial outcome, branching on paired data (McNemar), unit of analysis (aggregate first), expected counts (Fisher when small, chi-square when adequate), ordered categories (Cochran-Armitage), and large n (report effect sizes).",
        "source_type": "illustrative",
        "source_citations": [
          "campbell-2007"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The chi-square test sits within the broader framework of parametric and nonparametric tests for categorical data described in the parent entry. Understanding when chi-square is appropriate versus Fisher exact or McNemar requires the distributional reasoning and test-selection logic covered there."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Hypothesis testing mechanics — null hypothesis, p-value, type-I error, power, and the distinction between statistical significance and practical importance — are prerequisite concepts that must be in place before the chi-square machinery makes sense."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fisher-exact-test",
        "notes": "Fisher's exact test is the correct alternative when expected cell counts are small (any cell below 5), producing an exact p-value from the hypergeometric distribution rather than a chi-squared approximation. Every chi-square analysis should evaluate expected counts and switch to Fisher when the criterion is violated."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mcnemar-test",
        "notes": "McNemar's test is the paired-data analogue of chi-square for 2x2 tables where the same patients appear in both rows (pre/post measurements or matched pairs). Applying chi-square to paired data is one of the most common misapplications in clinical research; recognising the design as paired is the key diagnostic step."
      },
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Logistic regression is the covariate-adjusted extension of the chi-square test for binary outcomes. When confounders need to be controlled, chi-square is replaced by logistic regression as the primary analysis; chi-square becomes the unadjusted baseline comparison in a sensitivity table."
      }
    ],
    "aliases": [
      "Pearson chi-square",
      "chi-squared test",
      "test of independence",
      "goodness-of-fit test",
      "chi-square test",
      "Pearson chi-squared",
      "chi-square independence test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "claim-adjustments-reversals-denials",
    "name": "Claim Adjustments, Reversals, and Denials",
    "short_definition": "Raw administrative claims files contain multiple transaction types for the same service event — original submissions, voids (full reversals with negative amounts), replacement adjustments (corrected resubmissions that supersede the original), and denials (claims the payer refused, with paid amount of zero) — and any analysis that naively sums or counts all raw rows will double-count costs, mis-measure utilization, and silently corrupt adherence metrics; correct analysis requires \"final-action\" logic that resolves each claim's adjustment lineage to its authoritative version before any aggregation, and must separately decide whether denied claims contribute to the exposure/utilization numerator (the service may have occurred) or the cost numerator (no spend) based on the pre-specified estimand.",
    "long_description": "**What claim adjustments, reversals, and denials are — and why they are the most common silent data bug in claims research**\n\nWhen a hospital, pharmacy, or physician bills an insurer, the first transaction submitted is\nrarely the last. Payers reject claims for coding errors, authorization failures, or missing\ndocumentation. Providers then correct and resubmit. Pharmacies reverse and re-adjudicate\nfills when a patient never picks up a prescription. Each of these administrative events\ngenerates a new claim row in the underlying database, and the raw extract may therefore\ncontain two, three, or more rows representing a single service event — some with positive\namounts, some with negative amounts, and some with zero paid. An analyst who runs a naive\n`GROUP BY patient_id, service_date` sum over raw rows will silently double-count costs,\noverstate utilization, and inflate adherence estimates in ways that are invisible without\ndeliberate QC profiling.\n\n**The four raw claim transaction types**\n\nUnderstanding the four categories is the prerequisite for correct deduplication:\n\n(1) *Original claim* (institutional TOB frequency digit 1; pharmacy claim_type \"B1\" or \"21\"):\nThe initial billing submission. In a well-curated research file this is what analysts intend\nto count — but raw files include originals that were subsequently voided or replaced.\n\n(2) *Void / full reversal* (institutional TOB frequency digit 8; pharmacy claim_type \"B2\"\nreversal transaction): A transaction that cancels the original in its entirety. Institutional\nvoids carry negative charge and payment amounts that exactly mirror the original. Pharmacy\nB2 reversals carry a zero or negative quantity and negative amounts. If both the original and\nthe void are present and not resolved, they arithmetically cancel — but only if the analyst\nadds them, not if the analyst counts rows (utilization is double-counted even when costs net\nto zero).\n\n(3) *Replacement / adjustment claim* (institutional TOB frequency digit 7): A corrected\nresubmission that supersedes the original. The replacement may change diagnosis codes,\nbilled amounts, dates of service, or provider identifiers. A naively processed file that\nretains both the original and its replacement will double-count the claim. When a claim goes\nthrough multiple rounds of correction, the lineage is: original → replacement-1 → replacement-2,\nand only the terminal replacement is the authoritative record.\n\n(4) *Denied claim* (any claim type, paid_amount = 0, with a claim adjustment reason code\nsuch as CO-4 \"service inconsistent with modifier\" or CO-97 \"payment is included in another\nservice\"): The payer adjudicated the claim and refused payment. The service may or may not\nhave actually occurred — a denial on a facility claim for a hospital stay that genuinely\nhappened (but was denied for lack of authorization) is informative for utilization but\ncontributes nothing to the cost numerator.\n\n**Institutional claims: the TOB frequency digit and pre-processing variability**\n\nFor institutional claims (UB-04/837I), the Type of Bill (FL4) frequency digit is the key\ndeduplication signal. Frequency 1 = admit-through-discharge (the normal complete bill);\n2, 3, 4 = interim bills for long stays (drop before counting admissions); 7 = replacement\nof a prior claim (keep as the authoritative version, discard all earlier versions for the\nsame claim control number); 8 = void/cancel (drop the void and its predecessor).\n\nThe critical vendor question is whether the research database has already applied this\nlogic. MedPAR (Medicare inpatient) is pre-processed to the admit-through-discharge level\nand generally does not contain voids or replacements as separate rows. Medicare Outpatient\nStandard Analytical Files (SAF), commercial databases (MarketScan, Optum), and raw FFS\nclaims from CMS do contain raw adjustment transactions and require analyst-side\ndeduplication. Applying a second round of deduplication to a pre-processed file (e.g.,\nMedPAR) that has already resolved replacements can incorrectly remove valid records —\nalways read the vendor data dictionary before deduplicating.\n\n**Pharmacy claims: the NCPDP B2 reversal transaction and phantom fills**\n\nIn the NCPDP pharmacy billing ecosystem, the transaction type code \"B2\" (or service type\ncode \"B2\" in some payer systems) denotes a reversal of a previously adjudicated fill. A\nsame-day B2 reversal for the same person, NDC, and pharmacy as a prior B1 fill means the\npatient never picked up the prescription — the pharmacist adjudicated the claim, then\nreversed it when the patient did not return. These are *phantom fills* that inflate\nmedication adherence metrics if not removed.\n\nThe phantom-fill adherence inflation pathway: if a research extract captures the original\nfill (B1) but the reversal (B2) arrives in a different claim batch or data period and is\nnot linked and netted, the fill appears as dispensed and contributes days_supply to the\nPDC denominator or numerator. A patient coded as adherent (PDC ≥ 0.80) because of phantom\nfills that were never actually dispensed represents a measurement error that cannot be\ndetected without checking for unmatched B2 reversals. The rule: match each B2 to its\nparent B1 on (person_id, NDC, pharmacy_id, fill_date) and remove both; any unmatched B2\n(reversals without a same-batch original) should also be removed.\n\n**Denied claims: the estimand-dependent decision**\n\nWhether a denied claim should count toward the numerator of a research variable is not a\ndata-cleaning question — it is an estimand question that must be pre-specified:\n\n- *For cost estimands* (PPPM, PPPY, budget impact): a denied claim contributes zero\n  dollars. Including denied claims in cost sums adds phantom spend. Net all claims by\n  paid_amount = 0 exclusion for cost outcomes.\n- *For utilization estimands* (did the patient attempt to access this service?): a denied\n  claim may legitimately count as an encounter — the patient sought care, the provider\n  submitted a bill, and the service may have occurred even if the payer refused payment.\n  Emergency department denied claims are the paradigmatic example: denial for an out-of-\n  network ED visit does not mean the visit did not happen.\n- *For exposure phenotypes* (did the patient receive this drug or procedure?): pharmacy\n  denials typically mean the fill was rejected at point-of-sale and the drug was not\n  dispensed, so denied pharmacy claims should generally be excluded from exposure counts.\n  Institutional denials for surgical procedures are ambiguous — the surgery may still have\n  been performed if the denial was for a documentation issue resolved post-service.\n\nThe safest approach is to build two analytic flags on each claim: `final_action_flag`\n(excluding voids, retaining only terminal replacements) and `paid_flag` (paid_amount > 0),\nand let the estimand definition determine which filter applies to each variable.\n\n**Final-action logic: deduplication by claim ID lineage**\n\nThe canonical algorithm for resolving a raw claims file to its final-action state:\n\nStep 1 — Identify and remove void claims (TOB freq = 8; pharmacy B2). For institutional\nvoids, also remove the predecessor original they reference (tracked via claim control number\nor ICN). For pharmacy, match and remove both the B2 and its parent B1.\n\nStep 2 — For replacement chains (TOB freq = 7 or equivalent commercial claim adjustment\ncodes), keep only the terminal replacement for each claim control number. Drop all earlier\noriginals and intermediate replacements. If the vendor uses a separate \"original_claim_id\"\ncross-reference field, join on that field rather than parsing the claim_id string.\n\nStep 3 — Drop interim bills (TOB freq = 2, 3, 4) for institutional claims; these are\nsub-period bills for long stays subsumed by the final admit-through-discharge claim.\n\nStep 4 — Retain denied claims as a separate analytic flag rather than deleting them\noutright; the estimand determines whether they contribute to the variable.\n\n**Negative-value rows breaking naive sums**\n\nA void claim with a -$10,000 allowed amount will arithmetically cancel the original $10,000\nif both are included in a GROUP BY sum. But three failure modes remain even when costs net\nto zero: (1) row counts still double, inflating utilization; (2) if the replacement\n($11,000) is also present but is not linked to the void, the net sum is $1,000 not $11,000\n— a cost undercount; (3) if the replacement is present but the void is not (partial extract\nor lag), the naive sum is $21,000 — a cost overcount. Negative amounts in any numeric\naggregation on claims are a diagnostic signal that final-action processing has not been\napplied. QC step: check for any sum(allowed_amount) < 0 at the patient-service level before\nproceeding.\n\n**Vendor differences: pre-processed vs raw files**\n\n| Database | Pre-processed to final action? | Notes |\n|---|---|---|\n| Medicare MedPAR | Yes (inpatient) | Deduplication applied; voids and replacements resolved |\n| Medicare Outpatient SAF | Partial | Claim status field (CLM_QUERY_CODE) identifies final action; analyst must filter |\n| Optum / MarketScan commercial | Partial | Vendor-specific; read the data dictionary; some files include a \"claim_status\" or \"pay_status\" field |\n| Raw CMS 5% sample / 100% Medicare FFS | No | Full adjustment history present; analyst must apply frequency-digit deduplication |\n| Medicaid T-MSIS/MAX | Highly variable | State-specific preprocessing; treat as raw until verified |\n\n**QC recipes**\n\nRun these checks before any aggregation on a new claims extract:\n- Count rows with paid_amount < 0 or allowed_amount < 0 by year and claim type — any negative\n  amounts indicate unresolved voids or reversals.\n- Compute the reversal rate (void/reversal claims ÷ total claims) by file-year — a sudden\n  increase signals extract-period lag where reversals from the prior period arrived late.\n- For pharmacy files: count B2 (or reversal claim_type) rows as a fraction of all fills by\n  NDC or drug class — high B2 rates (> 5%) in a drug class may indicate formulary barriers\n  or patient non-pickup that inflates apparent dispensing volume.\n- Join claim_id to original_claim_id and count orphaned replacements (replacements without\n  a matching original in the extract) — these indicate claim receipt lag and should be\n  treated as the authoritative record for that service.\n- For pharmacy adherence metrics: compare PDC computed before and after B2 removal — a\n  material difference (> 2 percentage points) indicates phantom-fill inflation that requires\n  correction.\n\n**Pros, cons, and trade-offs**\n\n*Applying final-action logic (pros)*: eliminates double-counted costs and utilization;\nproduces a defensible analytic file with one row per service event; enables correct cost\nsums and adherence metrics; is the standard expected in regulatory and HTA submissions.\n\n*Applying final-action logic (cons)*: requires knowledge of the vendor's preprocessing\nstate — over-deduplicating a pre-processed file removes valid records; requires a\ncross-reference field (claim control number, ICN, or original_claim_id) that is not always\npopulated in commercial databases; pharmacy B2 matching requires same-batch or multi-period\nlinking that can be complex in large extracts.\n\n*Not applying final-action logic (the silent-bug scenario)*: costs may be double-counted\n(original + replacement) or understated (void without replacement resolved); utilization\ncounts are inflated; adherence metrics are inflated by phantom fills; and none of these\nerrors are visible in the final analytic table without deliberate QC.\n\n*Denied claims trade-off*: including denied claims in cost analyses overstates spend;\nexcluding them from utilization analyses misses genuine service attempts. The choice must\nbe pre-specified in the SAP with estimand justification.\n\n**When to use**\n\nApply final-action claim deduplication as the first data-management step in every claims\nanalysis, before building cohorts, computing costs, or deriving adherence metrics. Apply\nit to: (1) all raw institutional claims files that have not been vendor-pre-processed\n(any file containing TOB frequency digit 7 or 8 rows); (2) all pharmacy claims files\nbefore computing days_supply-based coverage or adherence (PDC, MPR); (3) all commercial\nclaims files where the vendor does not explicitly document that adjustments are resolved.\nThe reversal-rate QC check should be run on every new annual data pull, because extract\nperiods may shift the batch boundary and change which reversals are captured.\n\n**When NOT to use — and when skipping this step is actively misleading**\n\n- *Do not re-deduplicate MedPAR or other pre-processed files*: applying frequency-digit\n  logic to a file that has already resolved adjustment chains will incorrectly remove valid\n  admit-through-discharge claims whose claim_id ends in \"R1\" or similar suffixes from prior\n  replacement rounds. Always confirm preprocessing state before applying deduplication.\n- *Do not assume all negative-amount rows are voids*: some commercial vendors encode\n  coordination-of-benefits (COB) adjustments as negative lines on the same claim; these\n  are not voids and should not be dropped. Read the vendor data dictionary.\n- *Do not exclude all denied claims from utilization without estimand justification*: for\n  ED visits, denied claims represent real encounters. Excluding them from the utilization\n  numerator will undercount ED utilization in analyses of high-cost patients.\n- *Do not apply pharmacy B2 removal without matching to the parent B1*: removing B2 rows\n  without also removing the matched B1 leaves an orphaned \"dispensed\" fill in the data —\n  the opposite of the intended correction. The pair must be removed together.\n- *Do not skip this step for \"small\" analyses*: the error is not proportional to sample\n  size. A single high-cost original + replacement pair (e.g., a $10,000 original and an\n  $11,000 replacement) inflates that patient's cost by $10,000 regardless of cohort size.\n  In a 50-patient pilot, this error could shift the mean PPPM by hundreds of dollars.\n\n**Interpreting the output**\n\nThe canonical QC output of final-action processing is a reconciliation table: raw row\ncount, rows removed by category (voids, interim bills, superseded originals, denials\nflagged), and final-action row count with cost totals.\n\nFor the worked example (five raw rows for one admission with allowed amounts of $10,000,\n-$10,000, $11,000, $2,000, and $0), the reconciliation produces:\nnaive sum = 13,000 vs final-action allowed = 11,000 — a $2,000 discrepancy.\n\n*(1) Formal interpretation.* The naive sum of the `allowed_amt` column across all five\nraw rows is $13,000. This figure is arithmetically incorrect as a cost measure because it\nretains the original claim ($10,000) alongside its replacement ($11,000), does not net the\nvoid (-$10,000), and includes the denied line at billed face value ($2,000) despite zero\nreimbursement. After final-action deduplication: the void removes the original (net: 0);\nthe replacement is the authoritative cost record ($11,000); the denied line is excluded\nfrom the cost numerator (paid_amount = 0). The correct cost numerator is $11,000. The\ndenied $2,000 line may count as a utilization event depending on the pre-specified estimand\nbut contributes nothing to cost.\n\n*(2) Practical interpretation.* For a payer or HEOR analyst, the $2,000 gap between the\nnaive sum ($13,000) and the final-action allowed amount ($11,000) represents a 15%\noverstatement of this patient's costs — a systematic error that scales across a full cohort\nand produces inflated PPPM rates, biased incremental cost estimates, and budget-impact\nmodels that overstate the financial burden. The denied-claim decision (include in\nutilization, exclude from cost) must be documented in the SAP before unblinding results,\nbecause changing it post-hoc constitutes selective outcome reporting.",
    "primary_category": "Data_Standard",
    "tags": [
      "claims",
      "data-management",
      "deduplication",
      "claims-adjustment",
      "reversal",
      "void",
      "denial",
      "final-action",
      "pharmacy-claims",
      "B2-reversal",
      "phantom-fill",
      "adherence",
      "data-quality"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "comparative_effectiveness",
      "utilization_study",
      "cost_analysis",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Canonical statement of what administrative claims databases can and cannot measure; the section on data quality covers the structural limitations arising from billing byproduct data, including the adjustment and correction cycles that produce multiple raw rows per service event."
      },
      {
        "role": "explain",
        "doi": "10.1177/1062860606288774",
        "url": "https://doi.org/10.1177/1062860606288774",
        "citation_text": "Tyree PT, Lind BK, Lafferty WE. Challenges of using medical insurance claims data for utilization analysis. American Journal of Medical Quality. 2006;21(4):269-275.",
        "year": 2006,
        "authors_short": "Tyree et al.",
        "notes": "Systematic primer on the structural limitations of administrative claims for utilization analysis, including the observability constraint, enrollment gaps, and the billing-byproduct nature of claims records — the foundational framing for why adjustment and reversal transactions appear in raw extracts and must be resolved before any analysis."
      }
    ],
    "plain_language_summary": "When a hospital or pharmacy bills an insurer, the billing system can submit multiple transactions for the same service — an original bill, then a cancellation, then a corrected resubmission — and if a researcher adds up all those rows without cleaning them first, they get the wrong total. This concept covers the four types of raw claim transactions (originals, voids, replacements, and denials), how to reduce them to the single authoritative record for each service event, and the special case of pharmacy \"phantom fills\" — prescriptions that were reversed on the same day because the patient never picked them up, but which appear as genuine fills and inflate medication adherence calculations if not removed.",
    "key_terms": [
      {
        "term": "void claim",
        "definition": "A claim transaction that cancels a previously submitted claim in its entirety, usually carrying negative dollar amounts equal to the original; for institutional claims, the Type of Bill frequency digit is 8."
      },
      {
        "term": "replacement claim",
        "definition": "A corrected resubmission that supersedes an original claim, typically changing billed amounts, diagnosis codes, or provider information; for institutional claims, the Type of Bill frequency digit is 7."
      },
      {
        "term": "denied claim",
        "definition": "A claim the payer adjudicated and refused to pay, resulting in a paid amount of zero and a claim adjustment reason code; the service may still have occurred even though no reimbursement was made."
      },
      {
        "term": "B2 reversal",
        "definition": "The NCPDP pharmacy transaction code indicating a prescription fill was reversed after initial adjudication, usually because the patient never picked up the medication; these reversals must be matched and removed to avoid phantom fills inflating adherence metrics."
      },
      {
        "term": "final-action claim",
        "definition": "The resolved, authoritative version of a claim after all voids, replacements, and adjustments have been applied, leaving one definitive record per service event."
      },
      {
        "term": "claim lineage",
        "definition": "The chain of claim transactions linking an original submission to its subsequent voids and replacements, tracked via a claim control number or internal claim identifier."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is computing total allowed costs for a single inpatient admission (patient 2042) to use in a PPPM analysis. She pulls the raw institutional claims extract and finds five rows for this admission. Before summing the allowed amounts, she needs to identify which rows represent the final authoritative cost record, which are administrative corrections that should be removed, and how to handle the one denied line when building cost versus utilization variables.",
      "dataset": {
        "caption": "Five raw claim rows for patient 2042, one inpatient admission, as they appear in the raw extract before final-action deduplication. allowed_amt is the negotiated allowed amount; denial_code is blank when the claim was paid.",
        "columns": [
          "claim_id",
          "claim_type",
          "tob_freq",
          "allowed_amt",
          "paid_amt",
          "denial_code"
        ],
        "rows": [
          [
            "IP-2023-001",
            "original",
            "1",
            10000,
            8500,
            ""
          ],
          [
            "IP-2023-001-V",
            "void",
            "8",
            -10000,
            -8500,
            ""
          ],
          [
            "IP-2023-001-R",
            "replacement",
            "7",
            11000,
            9350,
            ""
          ],
          [
            "OP-2023-002",
            "denied",
            "1",
            2000,
            0,
            "CO-4"
          ],
          [
            "IP-2023-001-LC",
            "late_charge",
            "1",
            0,
            0,
            ""
          ]
        ]
      },
      "steps": [
        "Row 1 (IP-2023-001): original claim, TOB frequency 1, allowed_amt = $10,000. This is the first submission. Because a replacement (Row 3) exists for this claim, this original will be discarded in final-action processing.",
        "Row 2 (IP-2023-001-V): void claim, TOB frequency 8, allowed_amt = -$10,000. This cancels the original (Row 1). Both the void and its predecessor original are removed. After this step, the original and the void net to zero and neither remains in the final-action file.",
        "Row 3 (IP-2023-001-R): replacement claim, TOB frequency 7, allowed_amt = $11,000. This is the corrected resubmission that supersedes Row 1. It is the terminal member of the claim lineage and is KEPT as the authoritative record. Allowed cost = $11,000.",
        "Row 4 (OP-2023-002): denied claim, TOB frequency 1, billed_amt = $2,000, paid_amt = $0, denial_code = CO-4 (service inconsistent with modifier). Paid nothing. For a COST variable: exclude (contributes $0 to allowed_amt). For a UTILIZATION variable (did the patient present for this outpatient service?): may count as an encounter — estimand-dependent.",
        "Row 5 (IP-2023-001-LC): late charge, TOB frequency 1, all amounts = $0. This is a zero-dollar administrative line (e.g., a lab result that arrived after discharge billing); it contributes nothing to any variable and is retained but irrelevant.",
        "Naive sum of allowed_amt across all five rows (the wrong answer) = 10000 - 10000 + 11000 + 2000 + 0 = 13000. This is incorrect because it includes the superseded original ($10,000) alongside the replacement ($11,000), and adds the denied line at face value ($2,000) despite zero reimbursement.",
        "Final-action allowed (the correct cost numerator): keep only the replacement Row 3. Void and original cancel and are removed. Denied Row 4 excluded from cost (paid = $0). Late-charge Row 5 contributes $0. Final-action cost = $11,000. Discrepancy versus naive sum = 13000 - 11000 = 2000 — a $2,000 overstatement (about 15% of the true cost)."
      ],
      "result": "Naive sum of raw allowed_amt = 10000 - 10000 + 11000 + 2000 + 0 = 13000. Final-action allowed (cost numerator) = 11000 (replacement only; void negates original; denied line excluded from cost). Overstatement if naive sum used = 13000 - 11000 = 2000. Utilization estimate: 1 inpatient admission (the replacement episode) + 1 potential outpatient service attempt (the denied OP-2023-002 claim), subject to estimand."
    },
    "prerequisites": [
      "ub-04-institutional-claim-fields",
      "claims-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Institutional claim final-action deduplication (TOB frequency digit)",
        "description": "For UB-04/837I institutional claims, the standard three-step deduplication protocol is: (1) drop voids (TOB frequency 8) and remove their predecessor originals using the claim control number cross-reference; (2) for replacement chains (TOB frequency 7), keep only the terminal replacement and drop all earlier versions for the same original claim control number; (3) drop interim bills (TOB frequency 2, 3, 4). Run these steps before any episode construction, cost aggregation, or readmission counting.",
        "edge_cases": [
          "MedPAR is already deduplicated to admit-through-discharge; applying frequency-digit deduplication a second time will remove valid records whose claim_id reflects prior replacement history. Always verify the preprocessing state of the file.",
          "A replacement that changes the principal diagnosis or discharge status is a legitimate coding correction — the replacement version should be kept even when the analytic classification of the episode changes.",
          "Voids without a corresponding replacement in the same extract period (partial extract lag) produce orphaned negative amounts; treat these as evidence of a reversal and flag the original as unresolvable rather than summing the negative amount naively."
        ],
        "data_source_notes": "MedPAR: pre-processed; no action needed. Medicare Outpatient SAF: filter on CLM_QUERY_CODE or CLM_RLT_COND_CD for final-action status. Commercial databases: read vendor documentation on whether claim_status or pay_status fields identify adjustment transactions."
      },
      {
        "name": "Pharmacy B2 reversal deduplication (NCPDP)",
        "description": "For NCPDP pharmacy claims, match each B2 reversal transaction to its parent B1 fill on (person_id, NDC or drug_class, pharmacy_id, fill_date). Remove both the B2 and the matched B1 from the analytic file, as the prescription was never dispensed. Unmatched B2 reversals (where the matching B1 is in a prior extract period) should also be removed as they represent reversals of phantom fills from outside the current window. Run this step before computing days_supply coverage, PDC, MPR, or any adherence metric.",
        "edge_cases": [
          "High B2 rates in specialty drugs (e.g., biologics, specialty oncology) may indicate prior-authorization denials at point of sale; document the reversal rate by drug class as a QC metric and include it in the data-limitations section of the study.",
          "Same-day B2 reversals are unambiguous phantom fills. B2 reversals for fills from a prior day or week are more complex: they may reflect late patient pickup cancellations or payer take-back adjustments rather than non-dispense events.",
          "Some pharmacy claims databases pre-net B2 reversals before the research extract is delivered; ask the data vendor explicitly whether reversal transactions are included."
        ],
        "data_source_notes": "MarketScan pharmacy files: contain the full reversal history; apply B2 matching before any adherence computation. Optum pharmacy files: check for a \"reversal_flag\" or \"claim_status\" field that pre-flags reversals. Medicare Part D PDE (Prescription Drug Event) files: largely pre-adjudicated; Part D plan-submitted PDEs have already been through CMS adjudication but may still contain small-N adjustment records — check for negative days_supply values as a diagnostic."
      },
      {
        "name": "Denied claim handling (estimand-dependent)",
        "description": "Denied claims (paid_amount = 0, with a claim adjustment reason code) must be handled differently depending on the research variable. For cost variables: exclude denied claims from all cost sums and PPPM calculations (they contribute zero allowed dollars regardless of billed charges). For utilization variables (counts of service encounters): include denied claims if the study question is whether the patient sought care, not whether the payer reimbursed it; exclude if the question is whether reimbursed care occurred. For exposure phenotypes (drug dispensed, procedure performed): exclude pharmacy denials (drug not dispensed); apply clinical judgment for institutional procedure denials (procedure may or may not have been performed). Document the decision in the SAP.",
        "edge_cases": [
          "Emergency department denied claims are the paradigmatic inclusion case: the patient presented, received care, and the denial was for a coverage or authorization reason, not because the encounter was fictitious.",
          "Denied claims for duplicate billing (reason code CO-18) represent a billing error, not a separate service; exclude from both cost and utilization.",
          "Denied claims can be used as a validation cross-check: if the denial rate is unusually high for a specific NDC or CPT code in one arm of a comparative study, it may indicate differential prior-authorization burden that is a potential confound."
        ],
        "data_source_notes": "Claims files vary in how denials are coded; check for pay_status, claim_status, or denial_code fields in the vendor dictionary. In some commercial databases, denied claims are suppressed from the research extract entirely — in that case, the absence of a denial file is itself informative and should be documented."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ub-04-institutional-claim-fields",
        "pros_of_this": "This entry specifically addresses the multi-row adjustment problem that the UB-04 entry introduces at the deduplication step; it provides the full algorithm, QC recipes, and pharmacy-specific coverage (B2 reversals, PDC inflation) not covered by the field-anatomy entry.",
        "cons_of_this": "The TOB frequency digit semantics, claim control number linkage, and Form Locator structure must be understood from the UB-04 entry before the deduplication logic here is fully interpretable; this entry cannot stand alone without that prerequisite.",
        "when_to_prefer": "Use the UB-04 entry for field-level semantics; use this entry for the practical data management algorithm applied before analysis. Both are needed for a complete institutional claims data workflow."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Resolves the upstream data integrity problem (duplicate rows, phantom fills, denied-claim inclusion) that, if not corrected, silently corrupts the cost numerator that the PPPM entry computes. Final-action deduplication is a prerequisite, not an alternative.",
        "cons_of_this": "This entry does not cover the person-time denominator, two-part/gamma cost models, place-of-service decomposition, or outlier handling — those belong to the cost-measurement entry.",
        "when_to_prefer": "Apply this entry's deduplication first, then hand the clean file to the PPPM/PPPY pipeline. Skipping this step and computing PPPM on raw claims produces systematically inflated cost rates."
      },
      {
        "compared_to": "claims-analysis",
        "pros_of_this": "Drills into the specific mechanical failure mode (adjustment transactions, reversals, denials) that the broad claims-analysis entry mentions only briefly; provides the implementation-level algorithm, QC checks, and pharmacy-specific B2 logic.",
        "cons_of_this": "The broader claims-analysis entry covers enrollment, time zero, phenotypes, confounding control, and the full analytic pipeline; this entry covers only the raw-file preprocessing step.",
        "when_to_prefer": "Use the claims-analysis entry for study design; use this entry whenever the analytic plan requires explicit documentation that raw files were deduplicated to final-action before any aggregation (which it always should)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "For institutional claims: apply the three-step TOB-frequency deduplication (drop voids and their originals; keep terminal replacements; drop interim bills) before any cohort construction, episode building, or cost aggregation. For pharmacy: match and remove B2 reversal pairs before computing days_supply coverage or adherence metrics. For denied claims: create a boolean final_paid flag (paid_amount > 0) to allow estimand-specific inclusion/ exclusion. Run QC checks for negative allowed_amount rows and high reversal rates by file-year and drug class before signing off on the analytic dataset. Document the preprocessing state of each source file (vendor pre-processed vs. raw) in the methods section.",
      "linked": "When linking institutional claims to pharmacy or professional claims, apply deduplication independently within each file type before the cross-file join. B2 reversal matching must be completed on the pharmacy file before any adherence or exposure derivation. Check that the institutional final-action episodes align with the pharmacy refill timeline — a pharmacy fill date that falls within a voided institutional claim period is a data-quality signal worth investigating."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# ── 1. INSTITUTIONAL CLAIMS FINAL-ACTION DEDUPLICATION ──────────────────────────────────\n#\n#  Input DataFrame columns:\n#    claim_id          : str  (unique row ID; replacement rows may share a prefix with original)\n#    original_claim_id : str  (None/NaN for originals; filled for voids and replacements)\n#    tob_freq          : str  ('1'=final, '2'/'3'/'4'=interim, '7'=replacement, '8'=void)\n#    allowed_amt       : float\n#    paid_amt          : float\n#    denial_code       : str  (empty/None when paid)\n#\n#  Returns: deduplicated DataFrame with one row per final-action claim.\n\ndef deduplicate_institutional_claims(df: pd.DataFrame) -> pd.DataFrame:\n    df = df.copy()\n    df[\"tob_freq\"] = df[\"tob_freq\"].astype(str).str.strip()\n\n    # QC: report negative amounts before any action (should only be voids)\n    neg = df[df[\"allowed_amt\"] < 0]\n    if not neg.empty:\n        print(f\"QC: {len(neg)} rows with negative allowed_amt (expected = void rows only)\")\n\n    # Step 1: Identify void claims (freq=8) and their predecessors.\n    void_original_ids = set(\n        df.loc[df[\"tob_freq\"] == \"8\", \"original_claim_id\"].dropna()\n    )\n    # Remove void rows and the originals they cancel\n    df = df[~df[\"tob_freq\"].isin([\"8\"])]\n    df = df[~df[\"claim_id\"].isin(void_original_ids)]\n\n    # Step 2: For replacement chains (freq=7), keep only the terminal replacement.\n    # Group by original_claim_id; keep the replacement row, drop superseded originals.\n    replacement_ids = set(\n        df.loc[df[\"tob_freq\"] == \"7\", \"original_claim_id\"].dropna()\n    )\n    df = df[~(\n        df[\"claim_id\"].isin(replacement_ids) & ~df[\"tob_freq\"].isin([\"7\"])\n    )]\n\n    # Step 3: Drop interim bills (freq 2, 3, 4) — subsumed by the final bill.\n    df = df[~df[\"tob_freq\"].isin([\"2\", \"3\", \"4\"])]\n\n    print(f\"QC: {len(df)} final-action rows after deduplication.\")\n    return df.reset_index(drop=True)\n\n\n# ── 2. PHARMACY B2 REVERSAL REMOVAL ────────────────────────────────────────────────────\n#\n#  Input DataFrame columns:\n#    person_id       : str/int\n#    ndc             : str  (11-digit NDC)\n#    pharmacy_id     : str\n#    fill_date       : datetime\n#    days_supply     : int\n#    claim_type      : str  ('B1' = dispensed fill; 'B2' = reversal)\n#    allowed_amt     : float\n#\n#  A B2 reversal on the same (person_id, ndc, pharmacy_id, fill_date) cancels the B1.\n#  Both are removed. Unmatched B2s (no matching B1) are also removed — they represent\n#  reversals from prior extract periods.\n\ndef remove_pharmacy_b2_reversals(rx: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.copy()\n    key = [\"person_id\", \"ndc\", \"pharmacy_id\", \"fill_date\"]\n\n    b2 = rx[rx[\"claim_type\"] == \"B2\"].set_index(key).index\n    b1_matched = rx[\n        (rx[\"claim_type\"] == \"B1\") & rx.set_index(key).index.isin(b2)\n    ].index\n\n    n_b2 = (rx[\"claim_type\"] == \"B2\").sum()\n    n_matched_b1 = len(b1_matched)\n    print(\n        f\"QC: {n_b2} B2 reversal rows; {n_matched_b1} matched B1 fills removed \"\n        f\"as phantom fills. Reversal rate = {n_b2 / max(len(rx), 1):.1%}\"\n    )\n\n    # Remove all B2 rows and matched B1 rows\n    b2_idx = rx[rx[\"claim_type\"] == \"B2\"].index\n    remove_idx = b2_idx.union(b1_matched)\n    rx = rx.drop(index=remove_idx)\n    return rx.reset_index(drop=True)\n\n\n# ── 3. DENIED CLAIM FLAG (estimand-dependent inclusion/exclusion) ──────────────────────\n#\n#  Adds a boolean column so downstream code can apply its own filter:\n#    final_paid = True  -> claim was reimbursed (include in COST numerators)\n#    final_paid = False -> denied claim (include in UTILIZATION numerators if warranted)\n\ndef flag_denied_claims(df: pd.DataFrame) -> pd.DataFrame:\n    df = df.copy()\n    df[\"final_paid\"] = df[\"paid_amt\"] > 0\n    n_denied = (~df[\"final_paid\"]).sum()\n    print(f\"QC: {n_denied} denied claim rows flagged (paid_amt = 0). \"\n          f\"Exclude from cost sums; apply estimand rule for utilization counts.\")\n    return df\n\n\n# ── EXAMPLE USAGE: worked-example dataset ─────────────────────────────────────────────\n\nif __name__ == \"__main__\":\n    raw = pd.DataFrame({\n        \"claim_id\":          [\"IP-2023-001\", \"IP-2023-001-V\", \"IP-2023-001-R\",\n                               \"OP-2023-002\", \"IP-2023-001-LC\"],\n        \"original_claim_id\": [None, \"IP-2023-001\", \"IP-2023-001\", None, \"IP-2023-001\"],\n        \"tob_freq\":          [\"1\", \"8\", \"7\", \"1\", \"1\"],\n        \"allowed_amt\":       [10000.0, -10000.0, 11000.0, 2000.0, 0.0],\n        \"paid_amt\":          [8500.0, -8500.0, 9350.0, 0.0, 0.0],\n        \"denial_code\":       [\"\", \"\", \"\", \"CO-4\", \"\"],\n    })\n\n    print(f\"Naive sum of allowed_amt: {raw['allowed_amt'].sum():.0f}\")  # 13000\n    clean = deduplicate_institutional_claims(raw)\n    clean = flag_denied_claims(clean)\n    cost_sum = clean.loc[clean[\"final_paid\"], \"allowed_amt\"].sum()\n    print(f\"Final-action cost numerator (paid claims only): {cost_sum:.0f}\")  # 11000",
        "description": "Final-action claim deduplication for raw institutional claims and pharmacy B2 reversal\nremoval. Operates on pandas DataFrames with the column schema matching the worked example.\nThree functions: (1) deduplicate_institutional_claims — applies TOB frequency digit\nlogic to resolve originals, voids, and replacements; (2) remove_pharmacy_b2_reversals —\nmatches and removes B2 reversal pairs from pharmacy claims; (3) flag_denied_claims —\nadds a boolean column so cost vs. utilization estimands can apply their own filter.\nQC outputs print to stdout; adapt logging as needed.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\n# ── 1. INSTITUTIONAL CLAIMS FINAL-ACTION DEDUPLICATION ──────────────────────────────────\n\ndeduplicate_institutional_claims <- function(dt) {\n  dt <- copy(dt)\n  dt[, tob_freq := as.character(tob_freq)]\n\n  # QC: count negative allowed amounts (expected only for voids)\n  n_neg <- dt[allowed_amt < 0, .N]\n  cat(sprintf(\"QC: %d rows with negative allowed_amt (expected = void rows only)\\n\", n_neg))\n\n  # Step 1: remove voids (freq=8) and their original predecessors\n  void_original_ids <- dt[tob_freq == \"8\" & !is.na(original_claim_id), original_claim_id]\n  dt <- dt[tob_freq != \"8\"]\n  dt <- dt[!claim_id %chin% void_original_ids]\n\n  # Step 2: for replacements (freq=7), drop the superseded originals\n  replacement_ids <- dt[tob_freq == \"7\" & !is.na(original_claim_id), original_claim_id]\n  dt <- dt[!(claim_id %chin% replacement_ids & tob_freq != \"7\")]\n\n  # Step 3: drop interim bills (freq 2, 3, 4)\n  dt <- dt[!tob_freq %chin% c(\"2\", \"3\", \"4\")]\n\n  cat(sprintf(\"QC: %d final-action rows after deduplication.\\n\", nrow(dt)))\n  dt\n}\n\n\n# ── 2. PHARMACY B2 REVERSAL REMOVAL ────────────────────────────────────────────────────\n\nremove_pharmacy_b2_reversals <- function(rx) {\n  rx <- copy(rx)\n  key_cols <- c(\"person_id\", \"ndc\", \"pharmacy_id\", \"fill_date\")\n\n  b2_keys <- rx[claim_type == \"B2\", .SD, .SDcols = key_cols]\n  # Flag B1 rows that match a B2 reversal key\n  rx[, row_id := .I]\n  b1_matched_ids <- rx[\n    claim_type == \"B1\",\n  ][b2_keys, on = key_cols, nomatch = 0L, row_id]\n\n  n_b2 <- rx[claim_type == \"B2\", .N]\n  cat(sprintf(\n    \"QC: %d B2 reversals; %d matched B1 phantom fills removed. Reversal rate = %.1f%%\\n\",\n    n_b2, length(b1_matched_ids), 100 * n_b2 / nrow(rx)\n  ))\n\n  b2_ids  <- rx[claim_type == \"B2\", row_id]\n  drop_ids <- union(b2_ids, b1_matched_ids)\n  rx[!row_id %in% drop_ids][, row_id := NULL]\n}\n\n\n# ── 3. DENIED CLAIM FLAG ─────────────────────────────────────────────────────────────\n\nflag_denied_claims <- function(dt) {\n  dt <- copy(dt)\n  dt[, final_paid := paid_amt > 0]\n  n_denied <- dt[final_paid == FALSE, .N]\n  cat(sprintf(\n    \"QC: %d denied rows (paid_amt = 0). Exclude from cost sums; estimand governs utilization.\\n\",\n    n_denied\n  ))\n  dt\n}\n\n\n# ── WORKED EXAMPLE ────────────────────────────────────────────────────────────────────\n\nraw <- data.table(\n  claim_id          = c(\"IP-2023-001\",\"IP-2023-001-V\",\"IP-2023-001-R\",\"OP-2023-002\",\"IP-2023-001-LC\"),\n  original_claim_id = c(NA, \"IP-2023-001\", \"IP-2023-001\", NA, \"IP-2023-001\"),\n  tob_freq          = c(\"1\",\"8\",\"7\",\"1\",\"1\"),\n  allowed_amt       = c(10000, -10000, 11000, 2000, 0),\n  paid_amt          = c(8500, -8500, 9350, 0, 0),\n  denial_code       = c(\"\",\"\",\"\",\"CO-4\",\"\")\n)\n\ncat(sprintf(\"Naive sum of allowed_amt: %.0f\\n\", sum(raw$allowed_amt)))  # 13000\n\nclean <- deduplicate_institutional_claims(raw)\nclean <- flag_denied_claims(clean)\ncost_sum <- clean[final_paid == TRUE, sum(allowed_amt)]\ncat(sprintf(\"Final-action cost numerator (paid claims only): %.0f\\n\", cost_sum))  # 11000",
        "description": "Final-action institutional claim deduplication and pharmacy B2 reversal removal in R\nusing data.table. Mirrors the Python logic: three functions for institutional dedup,\npharmacy B2 removal, and denied-claim flagging. Includes the worked-example dataset\nto verify the naive-sum vs. final-action arithmetic.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create worked-example dataset ── */\ndata work.raw_claims;\n  length claim_id $20 original_claim_id $20 tob_freq $1 denial_code $10;\n  input claim_id $ original_claim_id $ tob_freq $ allowed_amt paid_amt denial_code $;\n  datalines;\nIP-2023-001    .                 1  10000  8500  .\nIP-2023-001-V  IP-2023-001      8 -10000 -8500  .\nIP-2023-001-R  IP-2023-001      7  11000  9350  .\nOP-2023-002    .                 1   2000     0  CO-4\nIP-2023-001-LC IP-2023-001      1      0     0  .\n;\nrun;\n\n/* QC: Naive sum of allowed_amt (should = 13000 before deduplication) */\nproc sql;\n  select sum(allowed_amt) as naive_sum label='Naive sum (pre-dedup)' from work.raw_claims;\nquit;\n/* Expected output: naive_sum = 13000 */\n\n/* ── 1. INSTITUTIONAL FINAL-ACTION DEDUPLICATION ── */\n\n/* Step 1a: Identify void predecessors to remove */\nproc sql;\n  create table work.void_originals as\n  select original_claim_id as claim_id\n  from work.raw_claims\n  where tob_freq = '8' and original_claim_id ne '.';\nquit;\n\n/* Step 1b: Identify superseded originals (have a replacement, freq=7) */\nproc sql;\n  create table work.superseded as\n  select original_claim_id as claim_id\n  from work.raw_claims\n  where tob_freq = '7' and original_claim_id ne '.';\nquit;\n\n/* Step 2: Build final-action file — remove voids, superseded originals, interim bills */\nproc sql;\n  create table work.final_action as\n  select r.*\n  from work.raw_claims r\n  where r.tob_freq not in ('8','2','3','4')                   /* drop voids + interim */\n    and r.claim_id not in (select claim_id from work.void_originals)    /* drop voided originals */\n    and not (r.claim_id in (select claim_id from work.superseded)       /* drop superseded originals */\n             and r.tob_freq ne '7');                          /* keep the replacement itself */\nquit;\n\n/* QC: Row count and cost sum after deduplication */\nproc sql;\n  select count(*) as final_rows, sum(allowed_amt) as final_allowed_sum\n  from work.final_action;\nquit;\n/* Expected: final_rows = 3 (replacement + denied + late_charge), final_allowed_sum = 13000\n   NOTE: 13000 because denied ($2000) and late ($0) are still present.           */\n\n/* ── 3. FLAG DENIED CLAIMS (estimand-specific cost vs utilization filter) ── */\ndata work.final_action;\n  set work.final_action;\n  final_paid = (paid_amt > 0);   /* 1 = reimbursed; 0 = denied */\nrun;\n\n/* Cost numerator = only final_paid = 1 claims */\nproc sql;\n  select sum(allowed_amt) as final_action_cost\n  from work.final_action\n  where final_paid = 1;\nquit;\n/* Expected: final_action_cost = 11000 (replacement only) */\n\n/* QC: Denied claim count */\nproc freq data=work.final_action;\n  tables final_paid / nocum;\n  title 'Denied vs paid claims after final-action deduplication';\nrun;",
        "description": "Final-action institutional claim deduplication in SAS (PROC SQL + data steps), pharmacy\nB2 reversal removal, and denied-claim flagging. Mirrors the Python and R implementations.\nIncludes QC summary output at each step. Uses the worked-example dataset.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[Raw claims extract<br/>originals + voids + replacements + denials] --> QC1{Any allowed_amt < 0?}\n  QC1 -->|Yes| Void[Remove voids and matched originals<br/>TOB freq = 8 for institutional<br/>B2 for pharmacy]\n  QC1 -->|No| Interim\n  Void --> Interim[Drop interim bills<br/>TOB freq 2 / 3 / 4]\n  Interim --> Repl[Keep terminal replacement only<br/>TOB freq = 7<br/>discard superseded originals]\n  Repl --> Denied{Denied claims<br/>paid_amt = 0?}\n  Denied -->|Cost estimand| ExcludeDenied[Exclude from cost sum<br/>paid_amt = 0 contributes nothing]\n  Denied -->|Utilization estimand| IncludeDenied[Include as encounter attempt<br/>estimand decision, document in SAP]\n  ExcludeDenied --> FinalCost[Final-action cost numerator]\n  IncludeDenied --> FinalUtil[Final-action utilization count]\n  FinalCost --> PPPM[PPPM / PPPY calculation<br/>see healthcare-costs-pppm-pppy-pmpm]",
        "caption": "Final-action deduplication decision tree for raw institutional claims. The same logic applies to pharmacy files (B2 reversal instead of void; no interim bills). The denied-claim branch depends on the pre-specified estimand.",
        "alt_text": "Flowchart from raw claims extract through negative-amount QC check, void removal, interim bill removal, replacement resolution, and denied-claim branching to final cost and utilization numerators.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "The TOB frequency digit (FL4 digit 4) is the primary deduplication signal for institutional claims; understanding the frequency code semantics (1=final, 7=replacement, 8=void) and claim control number linkage requires the UB-04 field reference."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ncpdp-pharmacy-claim-fields",
        "notes": "The B2 reversal transaction type and the NCPDP claim structure that determines how pharmacy fills and reversals are identified are covered in the NCPDP pharmacy fields entry; this entry applies those field semantics to the reversal-removal algorithm."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "Final-action deduplication is the first data-management step in every claims analysis; the broader claims-analysis entry covers enrollment, time zero, phenotypes, and confounding — all of which assume a clean, deduplicated file as input."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Unresolved voids and replacement pairs silently inflate the cost numerator; final-action deduplication must be applied before any PPPM or PPPY aggregation to produce defensible cost estimates. The negative-amount QC check is a standard gate before the cost pipeline."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Pharmacy B2 reversals (phantom fills) inflate PDC by adding days_supply coverage for prescriptions that were never dispensed; B2 removal is a prerequisite for valid adherence metric computation using PDC or MPR."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Fit-for-purpose assessment of a claims data source includes evaluating whether the vendor has pre-processed adjustments to final action or whether analyst-side deduplication is required; this is a key dimension of data-quality suitability evaluation."
      }
    ],
    "aliases": [
      "claim adjustment",
      "claim reversal",
      "claim void",
      "claim denial",
      "final action claim",
      "B2 reversal",
      "pharmacy reversal",
      "replacement claim",
      "TOB frequency digit",
      "adjustment transaction"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "claims-analysis",
    "name": "Administrative Claims Analysis",
    "short_definition": "A family of study designs that constructs cohorts and defines exposures, outcomes, and covariates from administrative billing and enrollment records (diagnosis, procedure, drug, and eligibility files) generated for reimbursement rather than research, whose validity rests on transparent, pre-specified operationalization of enrollment, time zero, and code-based phenotypes.",
    "long_description": "**Administrative claims analysis** is not a single design but the substrate on which most US real-world evidence is built.\nThe raw material is billing and enrollment data — inpatient (facility/MedPAR), outpatient/professional (carrier), pharmacy\n(NDC + `days_supply`), DME, and eligibility files — created to adjudicate payment, then repurposed to estimate treatment\neffects, incidence, costs, and utilization in populations far larger than any trial. Because the data are a byproduct of\nreimbursement, every research variable is a *proxy*: a diagnosis code means \"someone billed for evaluating this,\" a fill\nmeans \"a pharmacy was reimbursed,\" and the absence of a code can mean the patient is healthy, was seen out-of-network, or\nwas enrolled in a plan that does not submit that claim type. The discipline of claims analysis is making those proxies\nexplicit, validated, and reproducible.\n\n**Core conceptual distinction — what claims observe versus what you want to measure.** Claims capture *adjudicated\nbilling events with service dates and enrollment spans*, not clinical truth. The central operating decisions are therefore\n(1) **observability**: a variable can only be measured during continuous, claim-submitting enrollment, so person-time\nmust be bounded by an explicit enrollment definition; (2) **time zero**: index must be a real claim event (a `fill_date`,\nprocedure date, or first qualifying diagnosis) with a pre-specified lookback for washout and covariates; and (3)\n**phenotype**: each exposure/outcome/covariate is an algorithm over codes (e.g., 1 inpatient OR 2 outpatient diagnoses\n≥30 days apart) with a known positive predictive value (PPV) and sensitivity *in this specific source and population*.\nGet these three right and claims rival registries for many questions; get them wrong and large N simply produces a\nprecisely estimated bias.\n\n**Pros, cons, and trade-offs.**\n- **vs EHR-only studies:** Claims give near-complete *cross-provider* capture within a plan (every reimbursed encounter,\n  pharmacy, and hospital, not just one health system) and clean enrollment denominators, which makes person-time and\n  \"loss\" well defined. EHR offers labs, vitals, notes, severity, and physiologic outcomes that claims lack. The classic\n  failure of claims-only: no HbA1c, ejection fraction, stage, or smoking status, so confounding by severity is controlled\n  only through proxies (a high-dimensional propensity score) rather than measurement. **Prefer linked claims–EHR** when\n  severity or a lab-defined outcome drives the question; **prefer claims alone** when the question is utilization, cost,\n  adherence, or a well-coded hospitalization outcome across a defined population.\n- **vs disease/product registries:** Registries win on adjudicated outcomes, disease severity, and indication; claims win\n  on completeness of treatment/utilization and unbiased denominators. Registries often miss off-protocol care and lack a\n  true denominator. **Prefer linkage** (registry for severity + claims for full exposure and a death index for censoring).\n- **vs primary data collection / pragmatic trials:** Claims are orders of magnitude cheaper, faster, and larger, and\n  capture routine practice — but cannot randomize, cannot capture PROs, and inherit all measurement and channeling bias.\n- **Trade-off that defines the field:** scale and external validity bought with measurement error and unmeasured\n  confounding. The methodological response is not \"more N\" but validated phenotypes, active-comparator new-user design,\n  hdPS, negative controls, and quantitative bias analysis.\n\n**When to use.** Comparative safety/effectiveness of marketed drugs across a defined insured population; drug utilization,\nadherence (PDC/MPR), and persistence; healthcare resource use and cost (PPPM/PPPY); incidence/prevalence with a clean\nenrollment denominator; post-authorization safety surveillance (the Sentinel model). Claims are the default when the\nquestion is about *what was billed across all providers a patient touched* and a population denominator is required.\n\n**When NOT to use — and when claims are actively misleading.**\n- **The key variable is clinical, not billed.** Disease *severity*, lab values, tumor stage, performance status,\n  physiologic outcomes (BP, eGFR), symptoms, and PROs are absent. Studying a severity-driven outcome on claims alone\n  invites uncontrolled confounding by indication — large N makes the bias look certain, not absent.\n- **Outcome ascertainment differs by exposure.** If one arm is seen more often (more visits → more incidental codes),\n  surveillance/detection bias inflates the more-monitored arm's apparent event rate. Equalize ascertainment opportunity\n  or use an outcome with mandatory encounters (hospitalization, death).\n- **Person-time is unobservable for one group.** In Medicare Advantage, fee-for-service claims are often *not* submitted\n  (capitated/encounter data may be incomplete for non-risk variables), so \"no prior fill\" or \"no event\" can be\n  missingness, not truth. Mixing MA-only and FFS person-time without restriction produces differential outcome capture.\n- **Immortal time from procedure/qualifying-event designs.** Defining exposure by a downstream procedure or a second fill,\n  then starting follow-up earlier, guarantees the exposed group survives to be exposed — a textbook claims pitfall in\n  surgery and oncology line-of-therapy studies. Align time zero to the first qualifying event.\n- **Cash-pay, samples, $4 generics, and specialty-pharmacy carve-outs** are invisible — adherence and exposure are\n  undercounted, often differentially by drug cost.\n\n**Data-source operational depth (each with real failure modes and workarounds).**\n- **Medicare FFS:** Payment-driven and relatively complete for covered services; MedPAR for inpatient, carrier for\n  professional, Part D for pharmacy. *Failure modes:* Part D started in 2006 (no earlier drug data); Part B drugs are\n  captured as J-codes on medical claims, not Part D, so oral-vs-infused capture differs; **differential competing risk of\n  death** in this elderly population means cause-specific vs subdistribution estimands genuinely diverge — pre-specify\n  which you report. *Workaround:* require enrollment in all needed parts (A/B/D); link to a death index (NDI/Vital\n  Statistics) because Part A only reliably captures inpatient deaths.\n- **Medicare Advantage:** HCC risk-adjustment incentives drive *higher coding intensity* (chart reviews, in-home health\n  risk assessments add diagnoses), so MA patients can look sicker than identical FFS patients — a confounder for any\n  code-count covariate. *Failure mode:* encounter submission has historically been less complete/standardized for\n  non-risk variables, and FFS-style line items may be absent under capitation, so **MA-only person-time can lack the FFS\n  claims** that define exposure or outcome. *Workaround:* either restrict to FFS person-time or treat MA explicitly with\n  source-specific phenotypes and sensitivity analyses; never pool naively.\n- **Commercial (e.g., MarketScan, Optum):** Younger, employed/dependent populations with high churn (job change ends\n  enrollment); benefit/formulary differences shape exposure; out-of-network and carved-out behavioral/specialty claims\n  may be missing. *Failure mode:* short median enrollment truncates lookback and follow-up, biasing toward short-term\n  effects and toward healthier persistent enrollees. *Workaround:* report the attrition funnel, test lookback length, and\n  model loss to follow-up as potentially informative.\n- **EHR (when used in place of or with claims):** Adds labs/vitals/notes/severity but capture is *visit-driven* — a\n  patient who leaves the system is differentially lost, and exposure is the *order/administration*, not a guaranteed fill.\n  *Workaround:* link to pharmacy claims to confirm initiation and to a death index for mortality.\n- **Registry and linked claims–EHR–vital records:** Registry adds adjudicated outcomes and severity but weak exposure;\n  linkage is the ideal substrate but introduces selection (only the linkable subset) and order/fill/service **date\n  discrepancies** that must be reconciled before time-zero assignment.\n\n**Worked claims example (FFS + commercial, oral anticoagulant safety).** Question: incidence of hospitalized GI bleed in\nnew users of oral anticoagulant A vs B among adults with atrial fibrillation. (1) **Enrollment/observability:** require\n≥365 days of continuous medical + pharmacy enrollment before index, and exclude MA-only person-time so absence of prior\nfills and outcomes is observed, not missing. (2) **Indication:** ≥1 inpatient OR ≥2 outpatient AF diagnoses (ICD-10 I48)\n≥7 days apart in the baseline window. (3) **Time zero:** the first pharmacy `fill_date` for A or B (NDC list), with arm\nassigned from that day's NDC; **washout** = no fill of *any* oral anticoagulant in the prior 365 days (this makes both\narms incident users and removes prevalent-user/depletion-of-susceptibles bias). (4) **Covariates:** measured only in\n`[index_date − 365, index_date]` (HAS-BLED component diagnoses, prior bleed, NSAID/antiplatelet fills, utilization\ncounts) feeding an hdPS to proxy unmeasured severity. (5) **Outcome:** first hospitalization with a primary GI-bleed\ndiagnosis — a high-PPV phenotype because inpatient coding requires a billed admission, reducing detection bias relative\nto outpatient codes. (6) **Exposure window/censoring:** stitch consecutive fills with a 30-day grace period and\nstockpiling cap; follow from time zero to first event, censoring at disenrollment, death, switch to the other arm, end of\non-treatment supply + grace, or end of data. (7) **Competing risk:** because death competes with GI bleed in an\nAF/elderly population, report the cause-specific hazard (etiology) *and* the cumulative incidence function via Fine–Gray\n(absolute risk) rather than naive Kaplan–Meier. (8) **Sensitivity:** vary washout (180/365/730 days) and grace period,\nadd a negative-control outcome, and check standardized differences <0.1 after PS adjustment.\n\n**The estimand must be pre-specified.** In claims, the choice among cause-specific rate, cumulative incidence (Fine–Gray\nsubdistribution), and an ITT-like (initiation) vs as-treated (on-treatment, censoring at switch/discontinuation with\ninverse-probability-of-censoring weights) contrast changes both the model and the interpretation — especially where\ndeath is a frequent competing event. Decide it in the protocol, not after seeing the data.",
    "primary_category": "Study_Design",
    "tags": [
      "claims",
      "administrative-data",
      "pharmacoepidemiology",
      "phenotype-algorithm",
      "continuous-enrollment",
      "payer-heterogeneity",
      "secondary-data",
      "real-world-data"
    ],
    "applies_to_study_types": [
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Canonical statement of what administrative claims/utilization databases can and cannot measure, and the design principles (validated algorithms, new-user, confounding control) that make claims studies credible."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181a663cc",
        "url": "https://doi.org/10.1097/EDE.0b013e3181a663cc",
        "citation_text": "Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512-522.",
        "year": 2009,
        "authors_short": "Schneeweiss et al.",
        "notes": "Explains how to exploit the high-dimensional code/utilization structure of claims to empirically proxy unmeasured confounders when clinical severity is unrecorded — the standard answer to the central limitation of claims."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.2336",
        "url": "https://doi.org/10.1002/pds.2336",
        "citation_text": "Curtis LH, Weiner MG, Boudreau DM, et al. Design considerations, architecture, and use of the Mini-Sentinel distributed data system. Pharmacoepidemiology and Drug Safety. 2012;21(S1):23-31.",
        "year": 2012,
        "authors_short": "Curtis et al.",
        "notes": "Demonstrates large-scale, multi-database claims analysis (the FDA Sentinel model) with a common data model, distributed querying, and validated algorithms for active safety surveillance."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.k3532",
        "url": "https://doi.org/10.1136/bmj.k3532",
        "citation_text": "Langan SM, Schmidt SA, Wing K, et al. The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE). BMJ. 2018;363:k3532.",
        "year": 2018,
        "authors_short": "Langan et al.",
        "notes": "Reporting standard used in practice for claims/EHR pharmacoepidemiology studies; specifies the operational details (codes, enrollment, time windows, phenotype validation) that a defensible claims analysis must disclose."
      }
    ],
    "plain_language_summary": "Administrative claims analysis is research built from the billing and enrollment records an insurer keeps to pay for care, not to do science. Every diagnosis, drug fill, and procedure shows up as a dated line that says \"someone billed for this,\" so you reuse those lines to figure out who was treated, when, and what happened next. The catch is that a claim only proves a bill was paid, never that a patient was truly sick or truly took the medicine, and care the plan never paid for (cash-paid drugs, free samples, out-of-network visits) is simply invisible.",
    "key_terms": [
      {
        "term": "administrative claims",
        "definition": "The dated billing and enrollment records an insurer generates to reimburse care, later reused as research data."
      },
      {
        "term": "code system",
        "definition": "The shared dictionary a claim uses to label what happened, such as ICD-10 for diagnoses, CPT for procedures, and NDC for drugs."
      },
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last, so a fill on Feb 1 with days_supply 30 covers Feb 1 through Mar 2."
      },
      {
        "term": "phenotype",
        "definition": "A rule over codes that decides whether a person counts as having a condition, exposure, or outcome (for example, \"any claim with an ICD-10 code starting I21\" flags a heart attack)."
      },
      {
        "term": "enrollment span",
        "definition": "The stretch of dates a person was covered by the plan; outside it you see no claims, so absence of a record is not the same as absence of an event."
      }
    ],
    "worked_example": {
      "scenario": "We pull every 2023 claim line for one insured member (member_id 5001) and want to read those raw rows the way an analyst would: tell apart a diagnosis from a drug fill from a procedure, see when the member started a statin and for how long it was meant to last, and spot whether a heart attack was billed during the year.",
      "dataset": {
        "caption": "A handful of raw claim lines for one member, mixing pharmacy and medical claims exactly as they sit in the source tables.",
        "columns": [
          "member_id",
          "claim_type",
          "service_date",
          "code_system",
          "code",
          "days_supply",
          "amount"
        ],
        "rows": [
          [
            5001,
            "medical",
            "2023-01-10",
            "ICD-10",
            "I48.91",
            null,
            142.0
          ],
          [
            5001,
            "pharmacy",
            "2023-02-01",
            "NDC",
            "00071-0155-23",
            30,
            11.45
          ],
          [
            5001,
            "pharmacy",
            "2023-03-03",
            "NDC",
            "00071-0155-23",
            30,
            11.45
          ],
          [
            5001,
            "medical",
            "2023-05-12",
            "CPT",
            "93000",
            null,
            38.0
          ],
          [
            5001,
            "medical",
            "2023-09-15",
            "ICD-10",
            "I21.9",
            null,
            9820.0
          ]
        ]
      },
      "steps": [
        "Read claim_type and code_system together to see what each row is: the rows with claim_type 'medical' carry an ICD-10 diagnosis code (I48.91, I21.9) or a CPT procedure code (93000), while the 'pharmacy' rows carry an NDC drug code.",
        "Row 1 is a diagnosis: ICD-10 I48.91 is atrial fibrillation, billed on 2023-01-10. A diagnosis row has no days_supply (it is null) because nothing is being dispensed.",
        "Rows 2 and 3 are drug fills: the same NDC (00071-0155-23, atorvastatin) filled on 2023-02-01 and again on 2023-03-03. The exposure starts at the first fill's service_date, and service_date + days_supply tells you how long it covers: 2023-02-01 plus 30 days covers through 2023-03-02, then the 2023-03-03 refill continues coverage.",
        "Row 4 is a procedure, not a diagnosis: CPT 93000 (an electrocardiogram) billed on 2023-05-12; CPT codes describe a service performed, so this row says a test was done, not that a disease was present.",
        "Row 5 is the outcome we were watching for: ICD-10 I21.9 is an acute myocardial infarction (heart attack). Because its code starts with I21, the outcome phenotype flags it, dated 2023-09-15."
      ],
      "result": "Reading member 5001's rows: 1 statin (atorvastatin) start on 2023-02-01 with 30 days_supply (coverage through 2023-03-02), continued by a 30-day refill on 2023-03-03; a baseline atrial-fibrillation diagnosis on 2023-01-10; an ECG procedure on 2023-05-12; and 1 MI diagnosis (ICD-10 I21.9) on 2023-09-15 flagged as the outcome."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Drug-exposure cohort (pharmacy claims, new-user)",
        "description": "Index = first pharmacy fill (NDC) of the study drug after a drug-free washout; exposure episodes stitched from days_supply with grace periods and stockpiling rules. The workhorse for comparative safety/effectiveness and adherence studies.",
        "edge_cases": [
          "90-day mail-order, free samples, and cash/$4-generic fills distort days_supply and undercount exposure differentially by drug cost.",
          "Part B (J-code) infused/administered drugs appear on medical, not pharmacy, claims — oral-vs-infused comparisons need both files."
        ],
        "data_source_notes": "claims: require continuous pharmacy benefit across washout; Medicare needs Part D (post-2006) plus medical claims for J-code drugs."
      },
      {
        "name": "Procedure/surgical cohort (medical claims, index event)",
        "description": "Index = first qualifying CPT/ICD-10-PCS/HCPCS procedure; covariates from the prior enrollment window, follow-up from the procedure date for revision/readmission/complication/death.",
        "edge_cases": [
          "Immortal time if follow-up starts before the procedure date or if the cohort is defined by a downstream confirmatory procedure.",
          "Facility vs professional claims capture the same procedure differently (bundling, bilateral modifiers, global periods)."
        ],
        "data_source_notes": "claims: reconcile facility (revenue/bill-type) and professional (CPT) records; de-duplicate same-day multiples and reversals before counting the index event."
      },
      {
        "name": "Diagnosis-based disease cohort (claims phenotype)",
        "description": "Cohort entry by a validated code algorithm (e.g., 1 inpatient OR 2 outpatient diagnoses within a window, optionally requiring a confirmatory drug or procedure) to raise PPV.",
        "edge_cases": [
          "Rule-out coding (a code placed during workup that is later negated) inflates apparent incidence — require a confirmatory second criterion or a code-position rule.",
          "MA coding intensity raises diagnosis counts relative to FFS, shifting phenotype sensitivity/PPV by payer."
        ],
        "data_source_notes": "claims: cite source- and population-specific PPV/sensitivity; EHR or chart-review validation substudies anchor the algorithm."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "EHR-only studies",
        "pros_of_this": "Complete cross-provider capture within a plan and a clean enrollment denominator, so person-time, utilization, cost, and loss to follow-up are well defined across all care settings, not one health system.",
        "cons_of_this": "No labs, vitals, notes, severity, stage, or PROs; confounding by severity is controlled only through code proxies (hdPS), not measurement.",
        "when_to_prefer": "Utilization, cost, adherence, incidence with a population denominator, or well-coded hospitalization/death outcomes; prefer linked claims-EHR when severity or a lab-defined outcome drives the question."
      },
      {
        "compared_to": "Disease or product registries",
        "pros_of_this": "Far more complete treatment, utilization, and cost capture, and an unbiased denominator; captures off-protocol and cross-provider care registries miss.",
        "cons_of_this": "Weak on adjudicated outcomes, disease severity, and indication; every clinical variable is a billing proxy.",
        "when_to_prefer": "When complete exposure/utilization and a true denominator matter more than adjudicated severity; link to a registry and a death index when severity and mortality matter."
      },
      {
        "compared_to": "Primary data collection / pragmatic trials",
        "pros_of_this": "Orders of magnitude cheaper, faster, and larger; captures routine practice with long historical depth.",
        "cons_of_this": "Cannot randomize, cannot collect PROs/physiologic measures, and inherits measurement error, channeling, and unmeasured confounding.",
        "when_to_prefer": "When randomization is infeasible/unethical, when scale and external validity are paramount, or for post-authorization safety surveillance."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Bound all person-time by continuous, claim-submitting enrollment; exclude MA-only spans where FFS line items are unavailable. Index = a real claim event (fill_date / procedure date / first qualifying diagnosis). De-duplicate reversals, voids, and same-day multiples; reconcile facility vs professional records before counting events. Every exposure/outcome/covariate is a code algorithm with source-specific PPV/sensitivity. Build covariates only from the pre-index lookback window.",
      "ehr": "Capture is visit-driven and single-system; exposure is the order/administration, not a guaranteed fill. Link to pharmacy claims to confirm initiation and to a death index for mortality; treat leaving the system as potentially informative loss to follow-up. Use labs/notes to add severity unavailable in claims.",
      "registry": "Strong for indication, severity, and adjudicated outcomes; weak for complete pharmacy/utilization exposure. Link to claims for full fill history and to a death index for censoring and mortality outcomes.",
      "linked": "Linked claims-EHR-vital records is the ideal substrate (EHR severity + claims completeness + reliable mortality) but introduces linkage selection (only the linkable subset) and order/fill/service date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS = 365        # drug-free + continuous-enrollment lookback that defines a \"new user\"\nINDICATION_CODES = {\"I48\"}  # e.g., atrial fibrillation (ICD-10, truncated to 3 char for matching)\n\ndef build_claims_cohort(rx: pd.DataFrame, dx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n\n    # 1. Candidate time zero = first fill of the study drug class (NDC -> drug_class upstream).\n    study = rx[rx[\"drug_class\"].isin([\"STUDY\", \"COMPARATOR\"])]\n    idx = (study.groupby(\"person_id\", as_index=False).first()\n                .rename(columns={\"fill_date\": \"index_date\", \"drug_class\": \"arm\"}))\n\n    # 2. New-user washout: no study/comparator fill in the WASHOUT_DAYS before index.\n    prior = study.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    in_washout = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                       (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(in_washout[\"person_id\"])].copy()\n    idx[\"baseline_start\"] = idx[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)\n\n    # 3. Continuous, claim-submitting enrollment across the whole washout through index (no MA-only person-time).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\", \"baseline_start\"]], on=\"person_id\")\n    covered = e[(e[\"enroll_start\"] <= e[\"baseline_start\"]) & (e[\"enroll_end\"] >= e[\"index_date\"]) &\n                e[\"med_benefit\"] & e[\"rx_benefit\"] & (~e[\"ma_only\"])]\n    idx = idx[idx[\"person_id\"].isin(covered[\"person_id\"])].copy()\n\n    # 4. Validate indication with a claims phenotype: >=1 IP OR >=2 OP qualifying dx in the baseline window.\n    d = dx.merge(idx[[\"person_id\", \"index_date\", \"baseline_start\"]], on=\"person_id\")\n    d = d[d[\"dx_code\"].str[:3].isin(INDICATION_CODES) &\n          (d[\"dx_date\"] >= d[\"baseline_start\"]) & (d[\"dx_date\"] <= d[\"index_date\"])]\n    ip = d[d[\"claim_setting\"] == \"IP\"].groupby(\"person_id\").size()\n    op = d[d[\"claim_setting\"] == \"OP\"].groupby(\"person_id\").size()\n    has_indication = set(ip[ip >= 1].index) | set(op[op >= 2].index)\n    idx = idx[idx[\"person_id\"].isin(has_indication)]\n\n    return idx[[\"person_id\", \"arm\", \"index_date\", \"baseline_start\"]].reset_index(drop=True)",
        "description": "Claims cohort construction (a drug-exposure new-user cohort) from standard claims tables. Required inputs, already\ncleaned and de-duplicated (reversals/voids removed):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime), ndc, days_supply (int), drug_class\n  dx     : diagnosis claims  -> person_id, dx_date (datetime), dx_code, claim_setting in {'IP','OP'}\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end (datetime), med_benefit, rx_benefit, ma_only (bool)\nReturns one analysis-ready row per eligible new initiator with arm, validated indication, time zero, and the\n[baseline_start, index_date] covariate window. Build covariates / hdPS and apply outcome+censoring rules identically\nto both arms downstream. No toy data is created here; wire your own loaders to the schema above.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "schneeweiss-avorn-2005"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\nINDICATION_CODES <- c(\"I48\")  # atrial fibrillation, 3-char ICD-10 match\n\nbuild_claims_cohort <- function(rx, dx, enroll) {\n  setDT(rx); setDT(dx); setDT(enroll)\n  setorder(rx, person_id, fill_date)\n\n  # 1. Candidate time zero = first study/comparator fill.\n  study <- rx[drug_class %chin% c(\"STUDY\", \"COMPARATOR\")]\n  idx <- study[, .(index_date = fill_date[1L], arm = drug_class[1L]), by = person_id]\n  idx[, baseline_start := index_date - WASHOUT_DAYS]\n\n  # 2. New-user washout: drop anyone with a prior study/comparator fill inside the window.\n  study <- merge(study, idx[, .(person_id, index_date)], by = \"person_id\")\n  prior_ids <- unique(study[fill_date < index_date &\n                            fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # 3. Continuous claim-submitting enrollment across washout -> index (no MA-only spans).\n  e <- merge(enroll, idx[, .(person_id, index_date, baseline_start)], by = \"person_id\")\n  ok_enroll <- e[enroll_start <= baseline_start & enroll_end >= index_date &\n                 med_benefit & rx_benefit & !ma_only, unique(person_id)]\n  idx <- idx[person_id %chin% ok_enroll]\n\n  # 4. Indication phenotype: >=1 IP OR >=2 OP qualifying dx in the baseline window.\n  d <- merge(dx, idx[, .(person_id, index_date, baseline_start)], by = \"person_id\")\n  d <- d[substr(dx_code, 1L, 3L) %chin% INDICATION_CODES &\n         dx_date >= baseline_start & dx_date <= index_date]\n  cnt <- d[, .(ip = sum(claim_setting == \"IP\"), op = sum(claim_setting == \"OP\")), by = person_id]\n  keep <- cnt[ip >= 1L | op >= 2L, person_id]\n  idx <- idx[person_id %chin% keep]\n\n  idx[, .(person_id, arm, index_date, baseline_start)]\n}",
        "description": "Claims cohort construction with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), ndc, days_supply (int), drug_class\n  dx     : person_id, dx_date (Date), dx_code, claim_setting in {'IP','OP'}\n  enroll : person_id, enroll_start, enroll_end (Date), med_benefit, rx_benefit, ma_only (logical)\nReturns one analysis-ready row per eligible new initiator with arm, validated indication, and the covariate window.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "schneeweiss-avorn-2005"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* 1. Candidate time zero = first study/comparator fill; arm = drug_class on that earliest fill. */\nproc sort data=work.rx(where=(drug_class in ('STUDY','COMPARATOR'))) out=rx_q;\n  by person_id fill_date;\nrun;\n\ndata idx;\n  set rx_q;\n  by person_id;\n  if first.person_id;\n  index_date = fill_date;\n  format index_date date9.;\n  length arm $12;\n  arm = drug_class;\n  keep person_id index_date arm;\nrun;\n\ndata idx; set idx; baseline_start = index_date - &washout; format baseline_start date9.; run;\n\n/* 2. New-user washout: exclude any prior study/comparator fill inside the lookback. */\nproc sql;\n  create table newuser as\n  select i.* from idx i\n  where not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id\n      and p.drug_class in ('STUDY','COMPARATOR')\n      and p.fill_date <  i.index_date\n      and p.fill_date >= i.index_date - &washout\n  );\nquit;\n\n/* 3. Continuous, claim-submitting enrollment across washout -> index (no MA-only person-time). */\nproc sql;\n  create table enrolled as\n  select n.* from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id and e.med_benefit = 1 and e.rx_benefit = 1 and e.ma_only = 0\n      and e.enroll_start <= n.baseline_start\n      and e.enroll_end   >= n.index_date\n  );\nquit;\n\n/* 4. Indication phenotype: >=1 IP OR >=2 OP qualifying dx in the baseline window. */\nproc sql;\n  create table cohort as\n  select e.person_id, e.arm, e.index_date, e.baseline_start\n  from enrolled e\n  where ( select count(*) from work.dx d\n            where d.person_id = e.person_id and d.claim_setting = 'IP'\n              and substr(d.dx_code,1,3) = 'I48'\n              and d.dx_date between e.baseline_start and e.index_date ) >= 1\n     or ( select count(*) from work.dx d\n            where d.person_id = e.person_id and d.claim_setting = 'OP'\n              and substr(d.dx_code,1,3) = 'I48'\n              and d.dx_date between e.baseline_start and e.index_date ) >= 2;\nquit;",
        "description": "Claims cohort construction in SAS (PROC SQL / data step) — the cohort-building plumbing, not estimation. Required input\ndatasets (post data-management, reversals/voids removed):\n  work.rx     : person_id, fill_date, ndc, days_supply, drug_class ('STUDY'/'COMPARATOR')\n  work.dx     : person_id, dx_date, dx_code, claim_setting ('IP'/'OP')\n  work.enroll : person_id, enroll_start, enroll_end, med_benefit (0/1), rx_benefit (0/1), ma_only (0/1)\nProduces work.cohort: one row per eligible new initiator with arm, index_date, and baseline_start. Build covariates\nonly within [baseline_start, index_date] and apply outcome/censoring rules identically to both arms downstream.",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-avorn-2005"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Research question + estimand] --> FIT[Fit-for-purpose data assessment<br/>payer mix FFS / MA / commercial, capture, timeliness]\n  FIT --> ENR[Continuous claim-submitting enrollment<br/>exclude MA-only person-time]\n  ENR --> T0[Time zero = real claim event<br/>fill_date / procedure / first qualifying dx]\n  T0 --> WASH[Washout / lookback before index<br/>new-user + baseline covariate window]\n  WASH --> PHEN[Code-based phenotypes with PPV / sensitivity<br/>exposure, outcome, covariates]\n  PHEN --> ADJ[Confounding control<br/>active comparator + hdPS + negative controls]\n  ADJ --> EST[Estimation honoring the pre-specified estimand<br/>cause-specific vs cumulative incidence; ITT vs as-treated]\n  EST --> SENS[Diagnostics + sensitivity<br/>attrition funnel, balance, washout length, QBA]",
        "caption": "Operational pipeline for an administrative claims analysis. Validity is set early — fit-for-purpose assessment, observable enrollment, a real-claim time zero, and validated phenotypes — long before estimation.",
        "alt_text": "Flowchart from research question and estimand through data fitness, continuous enrollment, time zero, washout, code-based phenotypes, confounding control, estimation, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-avorn-2005"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  EXP[Exposure A vs B] -->|true effect| OUT[Outcome]\n  SEV[Unrecorded clinical severity] -->|confounds| EXP\n  SEV -->|causes| OUT\n  EXP -->|more visits = more codes| MON[Monitoring / detection]\n  MON -->|surveillance bias| OUT\n  MA[MA vs FFS coding intensity & capture] -->|misclassifies| SEV\n  MA -->|differential capture| OUT",
        "caption": "Why scale does not rescue claims. Severity is unrecorded (confounding by indication, addressed via active comparator + hdPS + negative controls), differential monitoring creates detection bias, and MA-vs-FFS coding/capture distorts both the confounder proxy and outcome ascertainment.",
        "alt_text": "Causal diagram showing unrecorded severity confounding exposure and outcome, monitoring-driven detection bias, and Medicare Advantage versus fee-for-service coding intensity distorting severity proxies and outcome capture.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-2009-hdps"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Person-time, exposure, and outcome capture are only valid during continuous, claim-submitting enrollment; MA-only spans lacking FFS claims must be excluded or handled explicitly."
      },
      {
        "relation_type": "requires",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Every claims analysis begins with (and is limited by) a fit-for-purpose assessment of the specific source(s) — capture by payer type, timeliness, and whether key variables are billed at all."
      },
      {
        "relation_type": "requires",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Time zero must be a real claim event (fill date, procedure date, first qualifying diagnosis) with a pre-specified lookback; misaligned time zero is the source of immortal-time bias in procedure and line-of-therapy claims studies."
      },
      {
        "relation_type": "part_of",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Cohorts, exposures, outcomes, and covariates in claims are code algorithms (e.g., 1 IP / 2 OP with a time window) whose PPV and sensitivity must be established in the specific source and population."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "Because clinical severity is unrecorded in claims, hdPS empirically selects high-dimensional code/utilization proxies to control confounding by indication — the standard response to the central limitation of claims."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "The active-comparator new-user design is the default analytic frame for comparative claims studies, removing confounding by indication and prevalent-user/immortal-time bias."
      },
      {
        "relation_type": "affects",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Payer type (FFS payment-driven completeness vs MA coding intensity and incomplete FFS capture vs commercial churn) is a fundamental source of heterogeneity in phenotype performance, confounding, and generalizability."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Outcome ascertainment in claims hinges on algorithm PPV/sensitivity and on equalizing detection opportunity across arms to avoid surveillance bias."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "In elderly/claims populations death is a frequent competing event; the estimand choice (cause-specific hazard vs Fine-Gray cumulative incidence) must be pre-specified and reported."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Claims provide the enrollment, time-zero, and phenotyping plumbing on which cost operationalization (allowed vs paid, PPPM/PPPY person-time standardization, outlier handling) is layered."
      },
      {
        "relation_type": "see_also",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "A transparent attrition funnel (source population -> enrollment -> washout -> indication -> analytic cohort) is expected reporting for any claims study, especially given commercial churn and MA exclusions."
      }
    ],
    "aliases": [
      "claims analysis",
      "administrative claims analysis",
      "administrative claims data analysis",
      "healthcare claims study",
      "claims-based pharmacoepidemiology"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "claims-based-frailty-index-rwe",
    "name": "Claims-Based Frailty Index (Faurot / Kim)",
    "short_definition": "A predicted frailty score built from administrative claims — the Faurot predicted-probability-of-dependency index and the Kim deficit-accumulation frailty index — used to adjust for frailty-related confounding that diagnosis-counting comorbidity indices (Charlson, Elixhauser) miss.",
    "long_description": "A **claims-based frailty index (CFI)** estimates a patient's frailty from administrative data when no direct\nfrailty assessment (gait speed, grip strength, activities-of-daily-living dependency) was ever recorded. It\nexists because comorbidity indices answer \"how many diseases?\" while frailty answers \"how depleted is this\nperson's physiologic reserve?\" — distinct axes of risk that a Charlson or Elixhauser score captures poorly.\nTwo complementary algorithms dominate. **Faurot et al. (2015)** fit a logistic model predicting a directly\nmeasured frailty/dependency proxy from claims indicators (durable medical equipment, home health, oxygen,\nwheelchair, hospital beds, mobility/ADL-related codes, demographics) and output a **predicted probability of\ndependency** that is used as a continuous confounder. **Kim et al. (2018)** mapped a validated\n**deficit-accumulation frailty index** (Rockwood-style, the proportion of possible deficits a patient has) onto\n~93 claims variables, producing a continuous 0–1 CFI validated against the Health and Retirement Study (Kim\n2019). Both are **covariate-construction methods** for confounding control, not study designs or outcomes;\nboth are computed over a **fixed baseline window** (commonly 8–12 months) before index date.\n\n**Core conceptual distinctions.** (1) *Frailty vs comorbidity*: frailty is a state of diminished reserve and\nvulnerability; comorbidity is disease count. A robust 80-year-old with controlled diabetes and hypertension can\nhave a high Charlson score yet low frailty, while a frail elder with sparse diagnoses scores low on Charlson —\nso the CFI adjusts for something the comorbidity indices structurally cannot. (2) *Predicted-probability vs\ndeficit-accumulation*: Faurot outputs the probability of a frailty proxy from a fitted model (the score is a\nmodel prediction, interpreted as risk of dependency); Kim outputs a deficit-accumulation index (the fraction of\npossible health deficits present, interpreted on the Rockwood 0–1 frailty scale). They correlate but are not\ninterchangeable and use different coefficient sets. (3) *Confounding by frailty / functional status*: the CFI\nis the standard tool for **confounding by frailty** in pharmacoepidemiology — the bias where frailer patients\nare differentially started on, or withheld from, a treatment (a major driver of \"healthy-user\" and\n\"frailty-by-indication\" effects). (4) *Continuous score, not a label*: the CFI is meant to enter models as a\ncontinuous covariate or PS input; dichotomizing into \"frail/not frail\" at a cut point discards information and\nis discouraged for adjustment.\n\n**Pros, cons, and trade-offs** (named against the alternatives).\n- **vs the Charlson / Elixhauser comorbidity indices:** the CFI captures functional decline and care-dependency\n  proxies (DME, home health, oxygen) those indices ignore, materially reducing residual confounding in older\n  populations; but it is less familiar, requires a fitted coefficient set, and predicts a *latent* construct\n  (frailty) rather than counting observed diagnoses. **Use the CFI alongside** a comorbidity index, not instead\n  of it — they adjust for different axes and the combination outperforms either alone.\n- **Faurot predicted-probability vs Kim deficit-accumulation:** Faurot is parsimonious and directly tied to a\n  dependency outcome; Kim is a richer ~93-variable deficit index validated against a gold-standard survey\n  frailty index. **Prefer Kim** when you want a Rockwood-scaled, survey-validated continuous index; **prefer\n  Faurot** for a lighter, dependency-oriented predicted probability.\n- **vs a high-dimensional propensity score (hdPS):** hdPS may empirically pick up frailty proxies on its own,\n  but a pre-specified CFI is transparent, validated, and guaranteed to be present; hdPS is data-driven and may\n  or may not select the relevant frailty codes. They are complementary.\n- **vs direct frailty assessment (Fried phenotype, clinical frailty scale):** direct measures are the gold\n  standard but are absent from claims/most EHR; the CFI is the best available *proxy* when direct measurement\n  was never done, at the cost of measurement error and dependence on coding of frailty-related services.\n\n**When to use.** As a confounder for frailty/functional status in comparative-effectiveness and safety RWE in\nolder or chronically ill populations; as a propensity-score input to control confounding by frailty; as an\neffect-modifier or subgroup variable (treatment effects often differ by frailty); and as a sensitivity-analysis\nadjustment to probe residual confounding beyond comorbidity. It is increasingly expected in Medicare\nclaims-based drug and device studies where frailty plausibly drives both treatment and outcome.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Adjusting for an on-pathway frailty change.** If the exposure itself causes functional decline measured in\n  the baseline window (or the window overlaps post-treatment time), the CFI becomes a mediator/collider and\n  adjusting for it biases the effect. Ascertain the CFI strictly **before** treatment initiation.\n- **Importing coefficients into a different population or coding system.** The Faurot and Kim weights were fit\n  in specific (largely US Medicare) populations and code systems; applying them to a different age band, payer,\n  country, or ICD/HCPCS coding environment without re-validation can produce a miscalibrated score. Check\n  calibration before trusting it.\n- **Treating it as a frailty diagnosis for an individual.** The CFI is a population-level adjustment proxy with\n  real measurement error; it should not be read as a clinical determination that a specific patient is frail.\n- **Dichotomizing for adjustment.** Splitting the continuous score into frail/non-frail at a threshold throws\n  away information and weakens confounding control; keep it continuous (or finely categorized) when adjusting.\n- **Unequal or short lookback / thin service coding.** The CFI depends on coded frailty-related services; an\n  exposure group with shorter baseline enrollment or a setting that under-codes home health/DME will look\n  artificially robust. Require an equal continuous-enrollment window and adequate service capture.\n\n**Data-source operational depth.** The CFI is native to **claims** (especially Medicare FFS), where DME, home\nhealth, hospice, oxygen, wheelchair, and procedure/diagnosis codes that proxy dependency are reliably captured;\napply the published Faurot or Kim variable list and coefficients over a fixed pre-index window with continuous\nenrollment. **Medicare Advantage encounter data** under-capture some of these services, biasing the score\ndownward, so FFS-derived coefficients should not be applied uncritically to MA person-time. **EHR** can supply\nricher functional detail (notes, problem lists) but misses care delivered out-of-system and rarely codes ADL\ndependency discretely. **Linked claims–EHR–survey** is the validation substrate (Kim validated against the HRS\nsurvey frailty index) but inherits linkage selection. Across sources, the score is only comparable when\nbaseline enrollment and service-coding intensity are comparable across the exposure groups.\n\n**Interpreting the output**\n\nIn the worked example, a patient accumulates a linear predictor LP = 0.9 + 1.1 + 0.8 + 0.7 − 3.0 = 0.5,\ngiving a frailty probability = 1 / (1 + e^(−0.5)) ≈ 0.62.\n\n*(1) Formal interpretation.* A predicted frailty probability of 0.62 means the logistic model assigns\nthis patient a 62% estimated probability of belonging to the frail category as defined in the\nvalidation cohort (typically the HRS or CHS frailty phenotype). The linear predictor is a weighted\nsum of claims-detectable frailty-associated service codes (e.g., home health visits, durable medical\nequipment, falls-related diagnoses), minus a model intercept. This is a proxy measure — it\napproximates frailty as captured by administrative service-use patterns, not a clinical frailty\nassessment such as the Fried phenotype or Clinical Frailty Scale. Misclassification error is\ninherent: patients who are clinically frail but do not generate the relevant service codes (e.g.,\nunengaged with healthcare, Medicare Advantage encounter data gaps) will be scored lower than their\ntrue frailty warrants.\n\n*(2) Practical interpretation.* A frailty probability of 0.62 is above the typical threshold for\nclassifying a patient as frail (commonly 0.5 in logistic implementations), but the threshold and its\nclinical meaning depend on the specific tool (Kim, Segal, CFAI) and its validation context. Use\nthe continuous probability as a covariate in regression or propensity-score models rather than\ndichotomizing, to preserve information and reduce threshold-sensitivity. Never interpret this\nprobability as a clinical frailty diagnosis. When comparing frailty scores across treatment arms or\ndata sources, confirm that baseline enrollment continuity and service-coding intensity are comparable —\nsystematic differences in MA vs FFS enrollment or care-seeking patterns will bias the score\ndistribution independently of true frailty.",
    "primary_category": "Bias_Control",
    "tags": [
      "frailty",
      "claims-based-frailty-index",
      "faurot",
      "kim",
      "confounding-by-frailty",
      "risk-adjustment",
      "medicare-claims",
      "covariate-construction"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "comparative_effectiveness",
      "pharmacoepidemiology"
    ],
    "data_sources": [
      "claims",
      "linked",
      "ehr"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.3719",
        "url": "https://doi.org/10.1002/pds.3719",
        "citation_text": "Faurot KR, Jonsson Funk M, Pate V, et al. Using claims data to predict dependency in activities of daily living as a proxy for frailty. Pharmacoepidemiol Drug Saf. 2015;24(1):59-66.",
        "year": 2015,
        "authors_short": "Faurot et al.",
        "notes": "Derived the claims-based predicted-probability-of-dependency frailty proxy used as a continuous confounder."
      },
      {
        "role": "explain",
        "doi": "10.1093/gerona/glx229",
        "url": "https://doi.org/10.1093/gerona/glx229",
        "citation_text": "Kim DH, Schneeweiss S, Glynn RJ, Lipsitz LA, Rockwood K, Avorn J. Measuring frailty in Medicare data: development and validation of a claims-based frailty index. J Gerontol A Biol Sci Med Sci. 2018;73(7):980-987.",
        "year": 2018,
        "authors_short": "Kim et al.",
        "notes": "Mapped a deficit-accumulation frailty index onto ~93 claims variables to produce a continuous 0–1 Medicare CFI."
      },
      {
        "role": "explain",
        "doi": "10.1093/gerona/62.7.722",
        "url": "https://doi.org/10.1093/gerona/62.7.722",
        "citation_text": "Rockwood K, Mitnitski A. Frailty in relation to the accumulation of deficits. J Gerontol A Biol Sci Med Sci. 2007;62(7):722-727.",
        "year": 2007,
        "authors_short": "Rockwood & Mitnitski",
        "notes": "The deficit-accumulation frailty framework underlying the Kim claims-based index."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/gerona/gly197",
        "url": "https://doi.org/10.1093/gerona/gly197",
        "citation_text": "Kim DH, Glynn RJ, Avorn J, et al. Validation of a claims-based frailty index against physical performance and adverse health outcomes in the Health and Retirement Study. J Gerontol A Biol Sci Med Sci. 2019;74(8):1271-1276.",
        "year": 2019,
        "authors_short": "Kim et al.",
        "notes": "Validated the claims-based frailty index against survey-measured physical performance and outcomes in the HRS."
      }
    ],
    "plain_language_summary": "Frailty is about how much physical reserve a person has left — whether they are robust or worn down — and it predicts bad outcomes on top of which diseases someone has. The problem is that claims data almost never record frailty directly. A claims-based frailty index gets around this by looking at the footprints frailty leaves in billing data: wheelchairs, home health visits, oxygen, hospital beds, and similar services. The Faurot version uses these to predict the chance a person is dependent on others for daily activities; the Kim version adds up the share of possible health \"deficits\" a person has to land them on a 0-to-1 frailty scale. Researchers use the resulting score to make treatment groups fairer to compare, because frailer patients are often steered toward or away from certain treatments. The honest caveats: it is an educated guess, not a real frailty exam; it only works where the underlying services get coded; and you must measure it before treatment starts so you do not accidentally adjust away the very effect you are studying.",
    "key_terms": [
      {
        "term": "frailty",
        "definition": "A state of reduced physical reserve and heightened vulnerability to stressors, distinct from simply having many diagnosed diseases."
      },
      {
        "term": "deficit-accumulation index",
        "definition": "A way of scoring frailty as the fraction of a long list of possible health problems a person actually has, landing them on a 0-to-1 scale (the Rockwood approach Kim's index uses)."
      },
      {
        "term": "predicted probability of dependency",
        "definition": "The Faurot index's output — a model's estimate of how likely a patient is to need help with daily activities, used as a continuous frailty proxy."
      },
      {
        "term": "confounding by frailty",
        "definition": "The bias that arises when frailer patients are systematically more or less likely to get a treatment, so frailty distorts the apparent treatment effect unless adjusted for."
      }
    ],
    "worked_example": {
      "scenario": "We want a Faurot-style claims-based frailty score for one patient, expressed as a predicted probability of dependency. The model has an intercept and a coefficient for each frailty-proxy service; we add the intercept and the coefficients for the services the patient used in the baseline window to get the linear predictor, then pass it through the logistic function to get the probability the analyst uses as a continuous confounder.",
      "dataset": {
        "caption": "The frailty-proxy claims indicators for one patient over the baseline window, with the model coefficient each contributes to the linear predictor (intercept shown as its own row).",
        "columns": [
          "predictor",
          "present",
          "coefficient"
        ],
        "rows": [
          [
            "intercept",
            1,
            -3.0
          ],
          [
            "age_75_plus",
            1,
            0.9
          ],
          [
            "wheelchair_or_dme",
            1,
            1.1
          ],
          [
            "hospital_bed",
            1,
            0.8
          ],
          [
            "home_oxygen",
            0,
            0.5
          ],
          [
            "skilled_home_health",
            1,
            0.7
          ]
        ]
      },
      "steps": [
        "Keep the coefficient for each predictor the patient actually has (present = 1) plus the intercept; home oxygen is absent, so its 0.5 contributes nothing.",
        "Add the contributing coefficients to form the linear predictor: 0.9 + 1.1 + 0.8 + 0.7 - 3.0 = 0.5.",
        "Convert the linear predictor to a probability with the logistic function 1/(1 + e^(-0.5)), which gives about 0.62.",
        "Carry that 0.62 predicted probability of dependency into the outcome model as a continuous frailty confounder (not a frail/not-frail label)."
      ],
      "result": "Linear predictor = 0.9 + 1.1 + 0.8 + 0.7 - 3.0 = 0.5; the logistic transform 1/(1 + e^(-0.5)) gives a predicted frailty probability of about 0.62. The same coefficient set and baseline window are applied identically to every patient, and the continuous score (not a dichotomized label) enters the adjustment."
    },
    "prerequisites": [
      "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
      "continuous-enrollment-observable-time-rwe",
      "healthy-user-bias"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Faurot predicted-probability index",
        "description": "A logistic model predicting an ADL-dependency frailty proxy from claims indicators (DME, home health, oxygen, mobility codes, demographics); output is a continuous predicted probability used as a confounder.",
        "edge_cases": [
          "Parsimonious but tied to a specific dependency outcome and derivation population; check calibration before reuse.",
          "Ascertain strictly pre-index so treatment-induced decline does not enter the score."
        ],
        "data_source_notes": "claims: requires reliable coding of DME/home-health/oxygen services; MA encounter data may under-capture these."
      },
      {
        "name": "Kim deficit-accumulation CFI (Medicare)",
        "description": "Maps a Rockwood-style deficit-accumulation frailty index onto ~93 claims variables, yielding a continuous 0–1 index validated against the HRS survey frailty index.",
        "edge_cases": [
          "Richer but heavier; needs the full published variable list and coefficients applied over the correct window.",
          "Calibrated in Medicare FFS; applying to MA, commercial, or non-US data needs re-validation."
        ],
        "data_source_notes": "claims/linked: best in Medicare FFS with full service capture; validation used linked survey data (HRS)."
      },
      {
        "name": "Continuous score vs dichotomized frailty category",
        "description": "Use the index as a continuous covariate (preferred for adjustment) or, only when a label is needed for description, categorize at published cut points (robust / pre-frail / frail).",
        "edge_cases": [
          "Dichotomizing for adjustment loses information and weakens confounding control; keep continuous when adjusting.",
          "Cut points are population-specific; do not transport thresholds without checking the score distribution."
        ],
        "data_source_notes": "any: report whether the score entered continuously or categorized, and the cut points if used."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Charlson / Elixhauser comorbidity indices",
        "pros_of_this": "Captures functional decline and care-dependency proxies (DME, home health, oxygen) that disease-counting indices ignore, reducing residual confounding by frailty in older populations.",
        "cons_of_this": "Less familiar, requires a fitted coefficient set, and predicts a latent construct with measurement error rather than counting observed diagnoses.",
        "when_to_prefer": "Whenever frailty plausibly drives both treatment and outcome — use it alongside, not instead of, a comorbidity index."
      },
      {
        "compared_to": "High-dimensional propensity score (hdPS)",
        "pros_of_this": "Pre-specified, transparent, and validated against survey frailty; guaranteed to be in the model.",
        "cons_of_this": "Captures only the published frailty proxies, whereas hdPS may empirically surface additional confounding proxies.",
        "when_to_prefer": "When you want a validated, interpretable frailty adjustment; combine with hdPS for broader control."
      },
      {
        "compared_to": "Direct frailty assessment (Fried phenotype, clinical frailty scale)",
        "pros_of_this": "Computable from existing claims when no direct frailty measure was ever recorded.",
        "cons_of_this": "A proxy with measurement error that depends on service coding, not a gold-standard physical assessment.",
        "when_to_prefer": "When direct frailty measurement is unavailable (the usual case in claims-based RWE)."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\nLOOKBACK_DAYS = 365\nINTERCEPT = -3.0\n\n# predictor -> (code regex over the window, logistic coefficient). ILLUSTRATIVE; replace with published model.\nPREDICTORS = {\n    \"wheelchair_or_dme\": (r\"^(K000[0-9]|E114[0-9]|E124[0-9])\", 1.1),\n    \"hospital_bed\":      (r\"^(E029[0-9]|E030[0-9])\", 0.8),\n    \"home_oxygen\":       (r\"^(E0431|E0439|E1390)\", 0.5),\n    \"skilled_home_health\": (r\"^(G015[0-9]|G016[0-9]|99[0-9]{3}HH)\", 0.7),\n}\n\ndef frailty_score(claims: pd.DataFrame, base: pd.DataFrame) -> pd.DataFrame:\n    df = claims.merge(base[[\"person_id\", \"index_date\"]], on=\"person_id\", how=\"inner\")\n    win = df[(df[\"svc_date\"] < df[\"index_date\"]) &\n             (df[\"svc_date\"] >= df[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS))]\n\n    present = {}\n    for pid, g in win.groupby(\"person_id\"):\n        codes = g[\"code\"].astype(str)\n        present[pid] = {p: bool(codes.str.match(rx).any()) for p, (rx, _) in PREDICTORS.items()}\n\n    rows = []\n    for _, r in base.iterrows():\n        pid = r[\"person_id\"]\n        p = present.get(pid, {k: False for k in PREDICTORS})\n        lp = INTERCEPT + (0.9 if r[\"age\"] >= 75 else 0.0)          # age_75_plus term\n        lp += sum(coef for k, (_, coef) in PREDICTORS.items() if p[k])\n        prob = 1.0 / (1.0 + np.exp(-lp))                            # logistic transform\n        rows.append({\"person_id\": pid, \"frailty_lp\": lp, \"frailty_prob\": prob})\n    return pd.DataFrame(rows)",
        "description": "Compute a Faurot-style predicted-probability claims frailty score over a fixed pre-index window. Inputs:\n  claims : person_id, code (HCPCS/ICD string), svc_date (datetime)\n  base   : person_id, index_date (datetime), age (int)\nPREDICTORS maps each frailty proxy to (code regex, logistic coefficient); INTERCEPT and coefficients are\nILLUSTRATIVE — substitute the published Faurot (or Kim) variable list and weights. Output is a continuous\npredicted probability of dependency.",
        "dependencies": [
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "faurot-2015",
          "kim-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nLOOKBACK_DAYS <- 365L\nINTERCEPT <- -3.0\n\npredictors <- list(  # name = list(regex, coefficient) -- ILLUSTRATIVE\n  wheelchair_or_dme   = list(\"^(K000[0-9]|E114[0-9]|E124[0-9])\", 1.1),\n  hospital_bed        = list(\"^(E029[0-9]|E030[0-9])\", 0.8),\n  home_oxygen         = list(\"^(E0431|E0439|E1390)\", 0.5),\n  skilled_home_health = list(\"^(G015[0-9]|G016[0-9])\", 0.7)\n)\n\nfrailty_score <- function(claims, base) {\n  setDT(claims); setDT(base)\n  df <- merge(claims, base[, .(person_id, index_date)], by = \"person_id\")\n  win <- df[svc_date < index_date & svc_date >= index_date - LOOKBACK_DAYS]\n\n  flag_one <- function(codes)\n    sapply(predictors, function(pc) any(grepl(pc[[1]], codes)))\n  fl <- win[, as.list(flag_one(as.character(code))), by = person_id]\n\n  out <- merge(base[, .(person_id, age)], fl, by = \"person_id\", all.x = TRUE)\n  for (p in names(predictors)) out[is.na(get(p)), (p) := FALSE]\n  out[, lp := INTERCEPT + fifelse(age >= 75L, 0.9, 0.0)]\n  for (p in names(predictors))\n    out[, lp := lp + as.integer(get(p)) * predictors[[p]][[2]]]\n  out[, frailty_prob := 1 / (1 + exp(-lp))]\n  out[, .(person_id, frailty_lp = lp, frailty_prob)]\n}",
        "description": "R/data.table version of the Faurot-style predicted-probability frailty score. Inputs mirror Python:\n  claims : person_id, code (character HCPCS/ICD), svc_date (Date)\n  base   : person_id, index_date (Date), age (integer)\nReplace the illustrative predictor regexes and coefficients with the published Faurot/Kim model.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "faurot-2015",
          "kim-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n%let intercept = -3.0;\n\nproc sql;\n  create table win as\n  select c.person_id, c.code, b.index_date, b.age\n  from work.claims c inner join work.base b on c.person_id = b.person_id\n  where c.svc_date < b.index_date and c.svc_date >= b.index_date - &lookback;\nquit;\n\ndata flags;\n  set win;\n  by person_id;\n  retain wheelchair hosp_bed oxygen home_health 0;\n  if first.person_id then do; wheelchair=0; hosp_bed=0; oxygen=0; home_health=0; end;\n  c = cats(code);\n  if prxmatch(\"/^(K000[0-9]|E114[0-9]|E124[0-9])/\", c) then wheelchair=1;\n  if prxmatch(\"/^(E029[0-9]|E030[0-9])/\", c)           then hosp_bed=1;\n  if prxmatch(\"/^(E0431|E0439|E1390)/\", c)             then oxygen=1;\n  if prxmatch(\"/^(G015[0-9]|G016[0-9])/\", c)           then home_health=1;\n  if last.person_id then output;\n  keep person_id age wheelchair hosp_bed oxygen home_health;\nrun;\n\ndata frailty;\n  set flags;\n  lp = &intercept\n       + (age >= 75) * 0.9\n       + wheelchair * 1.1\n       + hosp_bed   * 0.8\n       + oxygen     * 0.5\n       + home_health* 0.7;\n  frailty_prob = 1 / (1 + exp(-lp));   /* logistic transform */\nrun;",
        "description": "SAS build of the Faurot-style predicted-probability frailty score over a pre-index window. Inputs:\n  work.claims : person_id, code (HCPCS/ICD), svc_date\n  work.base   : person_id, index_date, age\nIllustrative predictor prefixes and coefficients; substitute the published Faurot/Kim model. Output is a\ncontinuous predicted probability of dependency.",
        "dependencies": [],
        "source_citations": [
          "faurot-2015",
          "kim-2018"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  C[Pre-index claims<br/>fixed window, continuous enrollment] --> P[Flag frailty-proxy services<br/>DME, home health, oxygen, hospital bed]\n  P --> M{Algorithm}\n  M -->|Faurot| L[Logistic model<br/>intercept + sum of coefficients]\n  M -->|Kim| D[Deficit-accumulation<br/>fraction of ~93 possible deficits]\n  L --> T[Logistic transform to<br/>predicted probability of dependency]\n  D --> R[Continuous 0-1 frailty index]\n  T --> U[Enter as continuous confounder / PS input<br/>do not dichotomize for adjustment]\n  R --> U",
        "caption": "Claims-based frailty construction — ascertain frailty-proxy services strictly before index, then apply either the Faurot logistic predicted-probability model or the Kim deficit-accumulation index, and enter the continuous score as a confounder.",
        "alt_text": "Flowchart from pre-index claims through frailty-proxy service flags to a choice between the Faurot logistic predicted probability and the Kim deficit-accumulation index, feeding a continuous frailty confounder.",
        "source_type": "illustrative",
        "source_citations": [
          "faurot-2015",
          "kim-2018"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "charlson-comorbidity-index-rwe",
        "notes": "Charlson counts diseases; the frailty index captures functional reserve. They adjust for different risk axes and are often used together."
      },
      {
        "relation_type": "see_also",
        "target_slug": "elixhauser-comorbidity-index-rwe",
        "notes": "Like Charlson, Elixhauser measures comorbidity burden, not frailty; the CFI complements it in older populations."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "The continuous frailty score is a standard propensity-score input for controlling confounding by frailty."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthy-user-bias",
        "notes": "Confounding by frailty is a core driver of healthy-user/frailty-by-indication effects; the CFI is the standard adjustment for it."
      },
      {
        "relation_type": "complements",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS may empirically surface frailty proxies; a pre-specified CFI guarantees a validated frailty adjustment is present. The two are often combined."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "The frailty score depends on coded services over an equal pre-index continuous-enrollment window to be comparable across groups."
      },
      {
        "relation_type": "used_with",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "The CFI is calibrated in Medicare FFS; MA encounter data under-capture DME/home-health services and bias the score, so FFS coefficients should not be applied uncritically to MA person-time."
      }
    ],
    "aliases": [
      "CFI",
      "claims-based frailty index",
      "Faurot frailty index",
      "Kim frailty index",
      "claims-based frailty score"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
    "name": "Claims Outcome Algorithm PPV/Sensitivity Trade-off",
    "short_definition": "The deliberate trade-off, when defining a claims- or EHR-based outcome algorithm, between a narrow high-PPV case definition that minimizes false positives and a broad high-sensitivity definition that minimizes missed events, chosen to match the estimand and the differential vs non-differential structure of the resulting outcome misclassification.",
    "long_description": "A **claims-based outcome algorithm** is the operational rule — diagnosis codes, code positions, encounter setting, procedure\nor lab confirmation, and timing windows — that converts raw administrative records into a binary \"case / non-case\" indicator\nfor analysis. Because no code list perfectly captures a clinical event, every algorithm sits on a frontier defined by its\n**positive predictive value (PPV)** (of records flagged as cases, the fraction that are true cases), **sensitivity** (of\ntrue cases, the fraction the algorithm flags), and **specificity** (of true non-cases, the fraction correctly left\nunflagged). The named trade-off is the choice along that frontier: tightening the rule (e.g., requiring a *primary*-position\ninpatient diagnosis plus a confirmatory troponin or revascularization procedure for acute myocardial infarction) raises PPV\nbut lowers sensitivity; loosening it (any-position diagnosis in any setting) raises sensitivity but admits false positives\nand lowers PPV.\n\n**Core conceptual distinction — why the trade-off is not symmetric, and why the estimand decides.** Outcome\nmisclassification distorts different estimands in different, predictable ways, and this is the single point a methodologist\nwill check first.\n- *Non-differential misclassification* (the algorithm's PPV/sensitivity are the **same in both exposure arms**): for a\n  relative effect (risk ratio, rate ratio, hazard ratio) with a binary, reasonably rare outcome, non-differential error\n  biases the estimate **toward the null** in expectation — but it is *not* guaranteed monotone in small samples or with\n  competing forms of error, so \"it only attenuates\" is a heuristic, not a license. Imperfect *sensitivity* alone (with\n  PPV = 1, i.e., you miss cases but never invent them) leaves the **rate ratio unbiased** while biasing absolute rates\n  downward; imperfect *specificity* (false positives) attenuates the ratio. This asymmetry is the practical reason a\n  high-PPV, lower-sensitivity algorithm is often preferred for comparative (ratio) estimands: a small, exposure-balanced\n  deficit in case capture costs little, whereas false positives directly dilute the contrast.\n- *Differential misclassification* (PPV or sensitivity **differs by exposure**) can bias a ratio estimate in **either\n  direction** and is the dangerous case. It arises whenever the exposure changes the probability of being worked up or\n  coded: a drug whose label prompts monitoring (e.g., LFTs, ECGs, imaging) generates more diagnostic encounters and thus\n  more coded events in its arm — detection / surveillance bias — even if true incidence is identical. No amount of\n  high-PPV tightening fixes differential capture; only a validation study that estimates the operating characteristics\n  *within each arm*, a quantitative-bias correction, or a design fix (active comparator with similar monitoring) addresses it.\n- For **absolute** estimands (incidence rate, cumulative incidence, number needed to harm, decision-analytic event\n  probabilities feeding a cost-effectiveness model), even non-differential error matters directly: a high-PPV/low-sensitivity\n  algorithm systematically *undercounts* events and must be PPV/sensitivity-corrected before the number is used.\n\n**Pros, cons, and trade-offs (named alternatives).**\n- **High-PPV narrow algorithm vs high-sensitivity broad algorithm.** The narrow rule (primary-position, inpatient,\n  procedure/lab-confirmed) yields clean contrasts for ratio estimands and is the default for safety signals where a false\n  positive is costly; its cost is undercounted absolute risk and lost power when the true event is uncommon. The broad rule\n  maximizes event capture (useful for screening, feasibility counts, or when sensitivity-corrected absolute rates are the\n  target) but pays in lower PPV and, if capture is exposure-related, in differential bias. **Prefer narrow** for comparative\n  ratio estimands and regulatory safety; **prefer broad with a validation-based correction** when an unbiased absolute rate\n  is the deliverable.\n- **vs adjudicated / chart-reviewed endpoints.** Full medical-record adjudication (see `endpoint-adjudication-chart-review-rwe`)\n  is the reference standard and removes the trade-off — but is infeasible at full-cohort scale, slow, and impossible in\n  de-identified claims without linkage. The standard compromise is a **claims algorithm validated against an adjudicated\n  subsample** to estimate PPV (and, with a sampling frame over non-flagged records, sensitivity), then either accept a\n  high-PPV algorithm or quantitatively correct.\n- **vs quantitative bias analysis on a fixed algorithm** (`quantitative-bias-analysis-toolkit-rwe`,\n  `misclassification-bias-correction-rwe`). Rather than chase a perfect code list, fix a transparent algorithm and *correct*\n  the estimate using externally or internally estimated sensitivity/specificity/PPV (matrix correction, multiple\n  imputation, or probabilistic bias analysis). **Prefer correction** when validation data or credible priors exist and the\n  estimand is absolute or the misclassification may be differential.\n\n**When to use this trade-off explicitly (decision rules).** Whenever the outcome is derived from codes rather than\nadjudicated: pre-specify the algorithm and its expected PPV in the protocol/SAP, justify the position on the\nPPV–sensitivity frontier by the estimand (ratio → favor PPV; absolute → favor a known, correctable sensitivity), and\npre-specify at least one sensitivity analysis that swaps a narrow for a broad definition (or vice versa). For regulatory\nsafety endpoints, anchor to a published validation (PPV with CI) for the specific code set, setting, and population.\n\n**When NOT to use / when this is actively misleading.**\n- **Do not** treat a PPV borrowed from a different population, coding era (ICD-9 vs ICD-10-CM), or care setting as your\n  own. PPV depends on the prevalence of true cases among flagged records and is **not transportable**; a PPV measured in a\n  referral hospital does not apply to a community FFS population. Specificity and sensitivity travel better but still drift\n  across settings.\n- **Do not** claim a high-PPV algorithm \"removes bias\" when the threat is *differential* capture — it does not; you have a\n  clean numerator in each arm but the arms were screened differently.\n- **Do not** apply a non-differential attenuation argument to an absolute rate, to a non-rare outcome, or to a\n  case-finding window that interacts with follow-up time — the toward-the-null heuristic can fail.\n- **Do not** use any-position diagnosis codes from the index hospitalization as the outcome when the same codes also\n  define eligibility or the exposure-triggering event — you will manufacture immortal-time-like or reverse-causation\n  artifacts; require the event code to fall strictly after time zero.\n\n**Data-source operational depth (with real failure modes and workarounds).**\n- **Claims (FFS vs Medicare Advantage vs commercial).** Build the algorithm from diagnosis codes (with *code position* and\n  *place of service*), procedure codes, and drug fills (`fill_date`, `days_supply`, NDC). *Failure mode:* in Medicare,\n  **MA-only person-time lacks complete fee-for-service claims**, so events are silently under-ascertained for MA enrollees —\n  differential if exposure correlates with plan type; *workaround:* restrict to FFS Parts A/B (and D for drug outcomes) or\n  flag and sensitivity-analyze MA spans. *Failure mode:* same-day duplicate and reversed/denied claims inflate event\n  counts; *workaround:* dedupe on `person_id`+`claim_date`+code and drop reversals before counting. *Failure mode:*\n  adjudication lag and claims run-out truncate recent events; *workaround:* enforce a run-out buffer and a closed\n  ascertainment window. *Failure mode:* differential **competing risks** — in elderly claims cohorts, the sicker arm dies\n  before the non-fatal coded event can occur, so a naive cumulative-incidence comparison of the algorithm-defined outcome\n  is biased; *workaround:* model with cause-specific or Fine–Gray methods (`competing-risks-cause-specific-fine-gray-rwe`).\n- **EHR.** Capture is **encounter-driven**: a true event managed entirely outside the system (an out-of-network ED visit)\n  is invisible, lowering sensitivity differentially by who stays in-network. Structured fields (problem lists, labs) can\n  confirm cases and raise PPV, but missing structured data and free-text-only documentation suppress sensitivity. *Workaround:*\n  define an explicit observation window per patient and treat external-care leakage as informative; consider NLP for\n  note-based confirmation and report capture by site and calendar time.\n- **Registry.** Often the strongest case definition (adjudicated, staged) but **incomplete enrollment** and eligibility\n  rules limit denominator validity; outcome capture may stop at registry exit. *Workaround:* link to claims/EHR for\n  continued follow-up and to a death index for fatal events and censoring.\n- **Linked claims–EHR–vital records.** The ideal substrate for *measuring* the algorithm's PPV/sensitivity (EHR/registry\n  truth, claims breadth, reliable mortality), but linkage selection (only the linkable subset) and date discrepancies\n  between service, claim, and adjudication dates must be reconciled before counting an event relative to time zero.\n\n**Worked claims example (Medicare AMI, with the actual PPV correction arithmetic).** Question: incidence of acute\nmyocardial infarction (AMI) in a Medicare FFS cohort. (1) *Continuous enrollment / washout:* require Parts A and B FFS\nenrollment for 365 days before index; exclude MA-only person-time so the absence of a prior AMI is observed, not missing.\nDefine **incident** AMI by requiring no AMI diagnosis in that 365-day baseline. (2) *Algorithm (high-PPV variant):* an AMI\nis a hospitalization with ICD-10-CM I21.x in the **primary** diagnosis position and length of stay ≥ 1 day (or in-hospital\ndeath), with the admission date strictly after time zero; this mirrors the definition Kiyota et al. validated against\nhospital-record review in Medicare, where the primary-position inpatient code achieved high PPV. (3) *First-event coding:*\ntake the first qualifying admission per `person_id`; drop same-day duplicate and reversed claims. (4) *Time windows:*\nascertain events only within the closed follow-up window (time zero to disenrollment, death, or a fixed end date minus a\n90-day claims run-out buffer). (5) *Validation + correction:* in an adjudicated subsample, suppose the algorithm flags 100\nrecords of which chart review confirms 90 — PPV = 0.90 (95% CI ≈ 0.82–0.95, Wilson). If the algorithm counts 1,000 AMIs in\nthe cohort, the **PPV-corrected true-case count** is 1,000 × 0.90 = **900** events; propagating the PPV CI gives roughly\n820–950 corrected events, which then feed the incidence rate (and any downstream cost or QALY model) instead of the raw\n1,000. If sensitivity is also estimable from a sample of *non-flagged* records, the corrected count generalizes to\ncorrected = observed × PPV ÷ sensitivity, and the gap between the high-PPV and a high-sensitivity broad definition becomes\nthe headline sensitivity analysis.\n\n**Interpreting the output**\n\nFrom the worked example (simpler numbers): TP = 80, FP = 20, FN = 20, TN = 30. PPV = 80 / 100 = 0.80;\nsensitivity = 80 / 100 = 0.80. From the claims cohort, 200 flagged AMIs → PPV-corrected true-case\ncount = 200 × 0.80 = 160. Correcting further for sensitivity: 160 / 0.80 ≈ 200 total true AMIs.\n\n*(1) Formal interpretation.* PPV = 0.80 means that 20% of algorithm-flagged cases are false positives;\nthey inflate the numerator of any incidence rate or event count. Under non-differential misclassification\n(PPV equal across arms), false positives dilute the contrast toward the null in a comparative relative-\nrisk study — the PPV-corrected rate ratio will be further from 1.0 than the observed. Under differential\nmisclassification (PPV higher in one arm), bias can go in either direction, and arm-specific PPV\nestimates are required. The PPV-corrected count (160) still understates total true AMIs because\nsensitivity is also < 1.0: missed true events (false negatives) reduce the denominator-adjusted count.\nThe fully corrected count (≈ 200) requires both PPV and sensitivity to be estimable from the validation.\n\n*(2) Practical interpretation.* Reporting raw algorithm-flagged counts as true event counts overstates\nAMI incidence by ≈ 25% in this example (200 flagged vs 160 confirmed). For a decision-analytic model\nor a cost-per-event calculation, the PPV-corrected figure is the appropriate input. For a comparative HR,\nthe direction of bias matters more than the absolute count — analysts must state and justify the\ndifferential vs non-differential assumption before interpreting any corrected ratio.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "outcome-algorithm-construction",
      "positive-predictive-value",
      "sensitivity-specificity",
      "outcome-misclassification",
      "claims-validation",
      "quantitative-bias-analysis"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_safety"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Foundational treatment of claims/utilization databases for therapeutics research, including the necessity of validating code-based outcome definitions and the PPV/sensitivity considerations that govern algorithm choice."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyu149",
        "url": "https://doi.org/10.1093/ije/dyu149",
        "citation_text": "Lash TL, Fox MP, MacLehose RF, et al. Good practices for quantitative bias analysis. International Journal of Epidemiology. 2014;43(6):1969-1985.",
        "year": 2014,
        "authors_short": "Lash et al.",
        "notes": "Canonical good-practices reference for the bias-analysis machinery (sensitivity/specificity/PPV) used to quantify and correct outcome misclassification, including its direction under differential vs non-differential error."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Reporting standard for diagnostic-accuracy / validation studies; the template for transparently reporting the PPV, sensitivity, and specificity of a claims outcome algorithm against an adjudicated reference standard."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.ahj.2004.02.013",
        "url": "https://doi.org/10.1016/j.ahj.2004.02.013",
        "citation_text": "Kiyota Y, Schneeweiss S, Glynn RJ, et al. Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: estimating positive predictive value on the basis of review of hospital records. American Heart Journal. 2004;148(1):99-104.",
        "year": 2004,
        "authors_short": "Kiyota et al.",
        "notes": "Canonical Medicare validation showing how code position and care setting move the PPV of an AMI claims algorithm; the empirical basis for the high-PPV worked example."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.5601",
        "url": "https://doi.org/10.1002/pds.5601",
        "citation_text": "Lanes S, Beachler DC. Validation to correct for outcome misclassification bias. Pharmacoepidemiology and Drug Safety. 2023;32(6):700-703.",
        "year": 2023,
        "authors_short": "Lanes & Beachler",
        "notes": "Concise demonstration of using validation-substudy operating characteristics (PPV/sensitivity/specificity) to correct exposure-outcome associations for outcome misclassification."
      },
      {
        "role": "use",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Routine-data reporting standard requiring transparent code lists and validation status for outcome definitions in claims/EHR studies."
      }
    ],
    "plain_language_summary": "When researchers use insurance claims to count a medical event — say, heart attacks in a diabetes drug study — they have to write a rule that says which billing codes count as a real event. Every rule sits on a trade-off: a tight rule (requiring a specific primary-position hospital code plus a confirming procedure) catches mainly true events but misses some real ones; a loose rule (any mention of the code, anywhere) catches nearly all real events but also flags many false alarms. Positive predictive value (PPV) measures how often a flagged record is a true event, while sensitivity measures how often a true event actually got flagged. For studies comparing two treatments on a relative scale (like a risk ratio), researchers usually favor a high-PPV rule because false alarms dilute the comparison — but a high-PPV rule undercounts events, so absolute rates reported from it need a correction step using the validated PPV.",
    "key_terms": [
      {
        "term": "claims outcome algorithm",
        "definition": "The set of billing codes and rules an analyst writes to identify a medical event — such as a hospitalization for heart attack — from insurance records."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "Of all the records the algorithm flags as events, PPV is the fraction that are truly events confirmed by chart review — it answers the question, how clean is my flagged list?"
      },
      {
        "term": "sensitivity",
        "definition": "Of all the true events that actually occurred, sensitivity is the fraction that the algorithm successfully flagged — it answers, how many real events am I capturing?"
      },
      {
        "term": "chart review",
        "definition": "The process of a clinician reading the actual medical record for a patient to confirm whether a coded event (like a heart attack) truly happened — used as the gold-standard reference."
      },
      {
        "term": "2x2 table",
        "definition": "A four-cell grid that cross-tabulates what the algorithm called (event or not) against what the chart review found (true event or not), producing the counts that go into PPV and sensitivity."
      }
    ],
    "worked_example": {
      "scenario": "A Medicare claims study is counting acute heart attacks (AMIs) in a cohort of older adults with Type 2 diabetes. The research team uses a high-PPV algorithm: a primary-position inpatient ICD-10-CM code starting with I21 (the AMI code family). The algorithm flags 200 hospitalizations across the full cohort. To validate it, the team pulls the actual hospital charts for 100 of those flagged records and has a cardiologist confirm whether a true AMI occurred. The cardiologist also reviews 50 records the algorithm did NOT flag, and finds 20 true AMIs the algorithm missed. The 2x2 table below shows the result. The goal is to compute PPV and sensitivity, then explain what the trade-off means for this study.",
      "dataset": {
        "caption": "2x2 validation table: algorithm result vs. chart-review truth for the 150 reviewed records (100 flagged + 50 unflagged).",
        "columns": [
          "",
          "Chart review: TRUE AMI",
          "Chart review: NOT an AMI",
          "Row total"
        ],
        "rows": [
          [
            "Algorithm FLAGGED",
            "TP = 80",
            "FP = 20",
            "100"
          ],
          [
            "Algorithm DID NOT FLAG",
            "FN = 20",
            "TN = 30",
            "50"
          ],
          [
            "Column total",
            "100",
            "50",
            "150"
          ]
        ]
      },
      "steps": [
        "Identify the four cells: TP (true positives) = 80, the records the algorithm flagged AND the chart confirms as true AMIs. FP (false positives) = 20, flagged by the algorithm but NOT confirmed as AMIs. FN (false negatives) = 20, NOT flagged but the chart shows a real AMI occurred. TN (true negatives) = 30, not flagged and confirmed not to be AMIs.",
        "Compute PPV: divide TP by all flagged records. PPV = TP / (TP + FP) = 80 / (80 + 20) = 80 / 100 = 0.80. This means 80 out of every 100 records the algorithm flags are genuine heart attacks.",
        "Compute sensitivity: divide TP by all true AMIs found in the reviewed subsample. Sensitivity = TP / (TP + FN) = 80 / (80 + 20) = 80 / 100 = 0.80. This means the algorithm captures 80 out of every 100 real AMIs in the reviewed records.",
        "In this particular example both values happen to equal 0.80, but they measure different things and will usually differ. Changing the algorithm rule changes them in opposite directions: tightening the rule (e.g., also requiring a troponin procedure code) would push PPV higher (fewer false alarms) but push sensitivity lower (more missed cases).",
        "Apply the PPV-vs-sensitivity tradeoff: the study reports a relative effect (a hazard ratio comparing two drugs). For that purpose, the team accepts the 0.80 PPV rule because false alarms dilute the comparison roughly equally in both arms. However, if the team also wants to report the absolute rate of AMIs per 1,000 patient-years, the raw algorithm count is too low — sensitivity of 0.80 means 20 percent of real AMIs were missed. To correct the absolute rate, the team would divide the PPV-corrected count by sensitivity: corrected true events = (observed events x PPV) / sensitivity."
      ],
      "result": "PPV = 80 / (80 + 20) = 80 / 100 = 0.80. Sensitivity = 80 / (80 + 20) = 80 / 100 = 0.80. If the full-cohort algorithm flagged 200 AMIs, the PPV-corrected true-case estimate is 200 x 0.80 = 160 true AMIs. Because sensitivity is also 0.80, the algorithm missed roughly 20 percent of real events, so the PPV-corrected count still understates total true AMIs — for an absolute rate the corrected estimate would be 160 / 0.80 = 200 total true AMIs estimated in the cohort."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "ppv-npv-rwe",
      "outcome-algorithm-construction-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "High-PPV narrow algorithm (primary-position, setting/procedure-confirmed)",
        "description": "Requires a primary-position diagnosis in a specific care setting (e.g., inpatient) plus a confirmatory procedure, lab, or minimum length of stay. Maximizes PPV at the cost of sensitivity; the default for comparative ratio estimands and regulatory safety endpoints.",
        "edge_cases": [
          "Confirmatory procedures/labs are themselves capture-dependent; if the exposed arm is monitored more, \"confirmation\" becomes a source of differential ascertainment rather than a fix.",
          "Code-position fields are unreliable in some commercial extracts, collapsing the narrow/broad distinction."
        ],
        "data_source_notes": "claims: require I21.x (or analog) in the primary position + inpatient place of service; dedupe and drop reversals. EHR: confirm with structured labs/orders, not free text alone."
      },
      {
        "name": "High-sensitivity broad algorithm (any-position, any-setting)",
        "description": "Counts the outcome on any-position diagnosis in any care setting (inpatient, outpatient, ED). Maximizes event capture for feasibility counts, screening, or sensitivity-corrected absolute rates; lower PPV and greater exposure to differential capture.",
        "edge_cases": [
          "Outpatient/ED rule-out and \"history of\" codes inflate false positives; restrict to confirmed (not rule-out) coding where available.",
          "Higher event counts can mask that the extra events are disproportionately false positives in the more-surveilled arm."
        ],
        "data_source_notes": "claims: include outpatient and ED settings and any diagnosis position; pair with a validation-based PPV correction before reporting absolute rates."
      },
      {
        "name": "Validated-and-corrected algorithm (fixed rule + bias correction)",
        "description": "Fixes a transparent algorithm, estimates PPV (and ideally sensitivity/specificity) in an adjudicated subsample or from external validation, and corrects the point estimate and CI via matrix correction, multiple imputation, or probabilistic bias analysis.",
        "edge_cases": [
          "PPV is not transportable across populations/coding eras; an externally borrowed PPV must be justified or replaced with an internal estimate.",
          "Differential operating characteristics require arm-specific validation, not a single pooled PPV."
        ],
        "data_source_notes": "linked: linkage to charts/registry/EHR truth is the natural substrate for measuring operating characteristics; reconcile service/claim/adjudication dates before counting."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "High-sensitivity broad outcome definition",
        "pros_of_this": "A high-PPV narrow definition yields cleaner numerators and less attenuation for comparative (ratio) estimands; preferred for regulatory safety where false positives are costly.",
        "cons_of_this": "Undercounts absolute incidence and loses power when the true event is uncommon; missed cases, if exposure-related, still bias the contrast.",
        "when_to_prefer": "Ratio estimands (RR/HR) and safety signals where non-differential under-capture costs little but false positives directly dilute the effect."
      },
      {
        "compared_to": "Adjudicated / chart-reviewed endpoints",
        "pros_of_this": "A claims algorithm is feasible at full-cohort scale, fast, and possible in de-identified data; no per-event record pull.",
        "cons_of_this": "Always imperfect; requires a validation substudy to quantify PPV/sensitivity and may need bias correction for absolute or differential settings.",
        "when_to_prefer": "When full adjudication is infeasible and a validated, transparently reported algorithm meets the estimand."
      },
      {
        "compared_to": "Quantitative bias analysis on a fixed algorithm",
        "pros_of_this": "Choosing a higher-PPV algorithm needs no priors or validation data and is simpler to pre-specify and defend.",
        "cons_of_this": "Cannot recover an unbiased absolute rate and cannot fix differential capture; tightening the rule does not address surveillance bias.",
        "when_to_prefer": "Comparative ratio estimands with non-differential, modest misclassification and no validation data."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build from diagnosis codes with code position and place of service, procedures, and drug fills (fill_date, days_supply, NDC). Require the event strictly after time zero, dedupe same-day/reversed claims, enforce a closed ascertainment window with a run-out buffer, and restrict to FFS person-time (exclude MA-only spans) so events are completely captured. Account for competing risks (death before the coded non-fatal event) in elderly cohorts.",
      "ehr": "Capture is encounter-driven; out-of-system care lowers sensitivity, differentially by who stays in-network. Confirm cases with structured labs/orders to raise PPV; define explicit observation windows and report capture by site and calendar time.",
      "registry": "Strongest case definitions (adjudicated/staged) but incomplete enrollment limits denominators; link to claims/EHR for continued follow-up and to a death index for fatal events.",
      "linked": "Ideal substrate for measuring PPV/sensitivity (truth + breadth + mortality), but linkage selection and service/claim/adjudication date discrepancies must be reconciled before assigning events relative to time zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy.stats import norm\n\nAMI_CODES = (\"I21\",)  # ICD-10-CM AMI family; match on prefix\n\ndef derive_outcome(dx: pd.DataFrame, cohort: pd.DataFrame,\n                   primary_only: bool = True, settings=(\"IP\",)) -> pd.DataFrame:\n    \"\"\"First incident outcome per person, strictly after time zero, within the ascertainment window.\n\n    primary_only=True, settings=('IP',)        -> high-PPV narrow algorithm\n    primary_only=False, settings=('IP','OP','ED') -> high-sensitivity broad algorithm\n    \"\"\"\n    ev = dx[dx[\"dx_code\"].str.startswith(AMI_CODES)].copy()\n    if primary_only:\n        ev = ev[ev[\"dx_position\"] == 1]\n    ev = ev[ev[\"place_of_service\"].isin(settings)]\n\n    ev = ev.merge(cohort[[\"person_id\", \"index_date\", \"fu_end\"]], on=\"person_id\", how=\"inner\")\n    # Event must fall strictly after time zero and within the closed follow-up window.\n    ev = ev[(ev[\"claim_date\"] > ev[\"index_date\"]) & (ev[\"claim_date\"] <= ev[\"fu_end\"])]\n    # First qualifying event per person (incident).\n    first = (ev.sort_values([\"person_id\", \"claim_date\"])\n               .groupby(\"person_id\", as_index=False)\n               .first()[[\"person_id\", \"claim_date\"]]\n               .rename(columns={\"claim_date\": \"event_date\"}))\n    return first\n\ndef wilson_ci(x: int, n: int, alpha: float = 0.05):\n    \"\"\"Wilson score interval for a proportion (robust for PPV near 0/1 and small validation samples).\"\"\"\n    if n == 0:\n        return (np.nan, np.nan)\n    z = norm.ppf(1 - alpha / 2)\n    p = x / n\n    denom = 1 + z**2 / n\n    center = (p + z**2 / (2 * n)) / denom\n    half = (z * np.sqrt(p * (1 - p) / n + z**2 / (4 * n**2))) / denom\n    return (center - half, center + half)\n\ndef ppv_corrected_count(validated: pd.DataFrame, observed_events: int):\n    \"\"\"Estimate PPV from the adjudicated subsample and apply it to the full-cohort flagged count.\n\n    Returns the corrected true-case count with an interval propagated from the PPV Wilson CI.\n    If a sensitivity estimate is available, divide the corrected count by sensitivity to recover\n    true incidence: corrected_true = observed * PPV / sensitivity.\n    \"\"\"\n    flagged = validated[validated[\"flagged\"] == 1]\n    tp = int(flagged[\"true_case\"].sum())\n    n_flagged = int(len(flagged))\n    ppv = tp / n_flagged if n_flagged else np.nan\n    lo, hi = wilson_ci(tp, n_flagged)\n    return {\n        \"ppv\": ppv, \"ppv_ci\": (lo, hi),\n        \"corrected_events\": observed_events * ppv,\n        \"corrected_events_ci\": (observed_events * lo, observed_events * hi),\n    }",
        "description": "Two operations on claims-style inputs: (1) derive incident outcome events under a configurable PPV/sensitivity\nalgorithm, and (2) compute validation operating characteristics + a PPV-corrected event count with a Wilson interval.\nRequired input tables (already cleaned and de-duplicated):\n  dx        : medical diagnosis claims -> person_id, claim_date (datetime), dx_code (str),\n              dx_position (1=primary), place_of_service ('IP'/'OP'/'ED')\n  cohort    : one row per person      -> person_id, index_date (datetime, time zero),\n              fu_end (datetime, end of ascertainment after run-out + censoring)\n  validated : adjudicated subsample   -> person_id, flagged (0/1 by algorithm), true_case (0/1 chart-confirmed)",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "kiyota-2004",
          "lanes-2023"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nAMI_RE <- \"^I21\"  # ICD-10-CM AMI family\n\nderive_outcome <- function(dx, cohort, primary_only = TRUE, settings = c(\"IP\")) {\n  setDT(dx); setDT(cohort)\n  ev <- dx[grepl(AMI_RE, dx_code)]\n  if (primary_only) ev <- ev[dx_position == 1L]\n  ev <- ev[place_of_service %chin% settings]\n  ev <- merge(ev, cohort[, .(person_id, index_date, fu_end)], by = \"person_id\")\n  # Strictly after time zero and inside the closed ascertainment window.\n  ev <- ev[claim_date > index_date & claim_date <= fu_end]\n  setorder(ev, person_id, claim_date)\n  ev[, .(event_date = claim_date[1L]), by = person_id]  # first incident event per person\n}\n\nwilson_ci <- function(x, n, alpha = 0.05) {\n  if (n == 0) return(c(NA_real_, NA_real_))\n  z <- qnorm(1 - alpha / 2); p <- x / n; denom <- 1 + z^2 / n\n  center <- (p + z^2 / (2 * n)) / denom\n  half <- (z * sqrt(p * (1 - p) / n + z^2 / (4 * n^2))) / denom\n  c(center - half, center + half)\n}\n\nppv_corrected_count <- function(validated, observed_events) {\n  setDT(validated)\n  flagged <- validated[flagged == 1L]\n  tp <- sum(flagged$true_case); n_flagged <- nrow(flagged)\n  ppv <- if (n_flagged > 0) tp / n_flagged else NA_real_\n  ci <- wilson_ci(tp, n_flagged)\n  list(ppv = ppv, ppv_ci = ci,\n       corrected_events = observed_events * ppv,\n       corrected_events_ci = observed_events * ci)\n}",
        "description": "Mirror of the Python logic in base R + data.table. Inputs:\n  dx        : person_id, claim_date (Date), dx_code (character), dx_position (int), place_of_service (character)\n  cohort    : person_id, index_date (Date), fu_end (Date)\n  validated : person_id, flagged (0/1), true_case (0/1)\nReturns the incident-event table and the PPV-corrected count with a Wilson interval.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "kiyota-2004",
          "lanes-2023"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* High-PPV algorithm: primary-position (dx_position=1) inpatient I21.x, strictly after time zero. */\nproc sql;\n  create table events as\n  select d.person_id, min(d.claim_date) as event_date format=date9.\n  from work.dx d\n  inner join work.cohort c\n    on d.person_id = c.person_id\n  where substr(d.dx_code,1,3) = 'I21'\n    and d.dx_position = 1\n    and d.place_of_service = 'IP'\n    and d.claim_date >  c.index_date     /* strictly after time zero */\n    and d.claim_date <= c.fu_end         /* inside the closed ascertainment window */\n  group by d.person_id;                   /* first incident event per person */\nquit;\n\n/* Observed full-cohort event count flagged by the algorithm. */\nproc sql noprint;\n  select count(*) into :observed trimmed from events;\nquit;\n\n/* PPV with an exact binomial CI from the adjudicated subsample (flagged records only). */\nproc freq data=work.validated(where=(flagged=1)) order=data;\n  tables true_case / binomial(level='1' exact) alpha=0.05;  /* exact Clopper-Pearson CI */\n  ods output BinomialProp=ppv_est;   /* contains the proportion (PPV) and its exact CI */\nrun;\n\n/* Apply the PPV (matrix) correction to the observed count.\n   If a sensitivity estimate SENS is available, set corrected_true = corrected_events / SENS. */\ndata correction;\n  set ppv_est;\n  if Name1 = '_BIN_';                 /* the point-estimate row */\n  ppv = nValue1;\n  observed = &observed;\n  corrected_events = observed * ppv;\n  /* corrected_true = corrected_events / sens;  * uncomment when sensitivity is estimated; */\n  keep observed ppv corrected_events;\nrun;",
        "description": "SAS implementation: derive incident events under a high-PPV algorithm, then validate and PPV-correct.\nRequired input datasets (post data-management):\n  work.dx        : person_id, claim_date, dx_code, dx_position, place_of_service\n  work.cohort    : person_id, index_date, fu_end\n  work.validated : person_id, flagged (0/1), true_case (0/1)\nPROC FREQ supplies the exact (Clopper-Pearson) binomial CI for PPV; the final data step applies the\nmatrix correction (corrected = observed * PPV, and corrected / sensitivity when sensitivity is known).",
        "dependencies": [],
        "source_citations": [
          "kiyota-2004",
          "lash-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[Raw claims/EHR records<br/>dx codes, position, setting, procedures] --> Choice{Position on the<br/>PPV-sensitivity frontier}\n  Choice -->|Narrow rule:<br/>primary-position, inpatient, confirmed| HiPPV[High PPV / lower sensitivity<br/>clean numerator, undercounts events]\n  Choice -->|Broad rule:<br/>any-position, any-setting| HiSens[High sensitivity / lower PPV<br/>more capture, more false positives]\n  HiPPV --> Estimand{Estimand?}\n  HiSens --> Estimand\n  Estimand -->|Ratio RR/HR| Ratio[Favor high PPV;<br/>non-differential error attenuates toward null]\n  Estimand -->|Absolute rate/risk| Abs[Validate + correct:<br/>corrected = observed x PPV / sensitivity]\n  Ratio --> Diff{Capture differential<br/>by exposure?}\n  Abs --> Diff\n  Diff -->|Yes - surveillance/detection bias| Danger[Bias in EITHER direction<br/>requires arm-specific validation or QBA]\n  Diff -->|No| Report[Report PPV/sensitivity with CIs<br/>+ narrow-vs-broad sensitivity analysis]",
        "caption": "Choosing and defending a position on the PPV-sensitivity frontier. The estimand (ratio vs absolute) and whether misclassification is differential by exposure determine whether a high-PPV rule suffices or a validation-based correction is required.",
        "alt_text": "Decision flowchart from raw records to a narrow high-PPV or broad high-sensitivity algorithm, branching on estimand type and differential-vs-non-differential misclassification to either reporting or quantitative bias correction.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Algo[Claims algorithm flags<br/>1000 cases in full cohort] --> Sub[Adjudicate a subsample<br/>100 flagged records]\n  Sub --> TwoBytwo[Chart review:<br/>90 true cases, 10 false positives]\n  TwoBytwo --> PPV[PPV = 90/100 = 0.90<br/>Wilson 95% CI 0.82-0.95]\n  PPV --> Corr[Corrected true events<br/>= 1000 x 0.90 = 900<br/>interval ~820-950]\n  Corr --> Rate[Feed corrected count into<br/>incidence rate / cost-effectiveness inputs]",
        "caption": "Worked PPV correction. A validation substudy converts a raw algorithm count into a bias-corrected true-case count (and interval) before it enters absolute-rate or decision-analytic calculations.",
        "alt_text": "Data flow showing 1000 algorithm-flagged cases, a 100-record adjudicated subsample yielding PPV 0.90, and a corrected estimate of 900 true events feeding downstream rate and cost-effectiveness inputs.",
        "source_type": "illustrative",
        "source_citations": [
          "kiyota-2004"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "The PPV/sensitivity trade-off is the central design decision within the broader outcome-algorithm-construction family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "Estimating PPV (and sensitivity/specificity) against an adjudicated reference is how the chosen frontier position is measured and justified."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Chart adjudication provides the reference standard for the validation substudy that quantifies the algorithm's operating characteristics."
      },
      {
        "relation_type": "used_with",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "When a high-PPV rule cannot fully remove bias (absolute estimands or differential capture), the PPV/sensitivity estimates feed a quantitative misclassification correction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Matrix correction, multiple imputation, and probabilistic bias analysis operationalize the correction of outcome-misclassification using these operating characteristics."
      },
      {
        "relation_type": "see_also",
        "target_slug": "external-adjustment-validation-substudy-bias-correction-rwe",
        "notes": "External or internal validation substudies supply the PPV/sensitivity/specificity inputs used to correct the primary analysis."
      },
      {
        "relation_type": "affects",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Differential competing risks in elderly claims cohorts interact with outcome capture and must be modeled with cause-specific or subdistribution methods."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes help detect residual or differential outcome ascertainment that a high-PPV algorithm does not remove."
      }
    ],
    "aliases": [
      "claims outcome algorithm PPV sensitivity trade-off",
      "outcome algorithm positive predictive value vs sensitivity",
      "claims-based outcome case definition trade-off",
      "PPV-sensitivity trade-off in outcome ascertainment"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "clone-censor-weight-per-protocol",
    "name": "Clone-Censor-Weight for Per-Protocol Target-Trial Emulation",
    "short_definition": "A target-trial-emulation technique that estimates the per-protocol effect of treatment strategies not distinguishable at baseline by replicating each person once per strategy (cloning), artificially censoring each clone when its observed data first deviate from its assigned strategy, and applying inverse-probability-of-censoring weights to remove the selection bias that the artificial censoring induces.",
    "long_description": "**Clone-censor-weight (CCW)** is the standard device for estimating the *per-protocol* effect of treatment\nstrategies that two or more eligible patients could still satisfy at time zero but that diverge only later — the\nsetting where a naive design has no clean way to assign an arm. Classic examples are sustained strategies (\"initiate\ntreatment within a grace period and stay on it\" vs \"never initiate\"), duration strategies (\"treat for 12 months\" vs\n\"treat for 6 months\"), and threshold/dynamic strategies (\"start when eGFR crosses X\"). Because every eligible person\nis, at baseline, compatible with *every* strategy, CCW makes that compatibility explicit: it **clones** each person\ninto one copy per strategy, follows each clone under its assigned rule, **artificially censors** a clone the instant\nits observed behavior departs from that rule, and then **weights** the surviving clone-time by the inverse\nprobability of remaining uncensored so the censored clones are represented by similar still-adherent clones.\n\n**Core estimand distinction.** CCW targets the *observational analog of the per-protocol effect of a sustained\nstrategy* — what the cumulative incidence (or hazard) would have been had everyone followed the assigned strategy\nexactly. This is not the intention-to-treat (initiation) contrast, which an active-comparator new-user design with\nbaseline propensity-score adjustment already estimates cleanly. It is also not the naive \"as-treated\" or\n\"ever-exposed\" contrast, which classifies person-time by post-baseline behavior and so re-imports immortal time,\nselection on adherence, and time-varying confounding. The cloning step exists precisely *because* the strategies are\nnot distinguishable at t0: assigning the arm from baseline data alone (as one would in active-comparator new-user)\nis impossible for \"start within 30 days and persist,\" so instead of choosing an arm we place each person in *both*\narms and let the data censor the incompatible clone later. The inverse-probability-of-censoring weighting (IPCW)\nthat follows is, mechanically, a marginal structural model fit on a cloned scaffold; CCW is best understood as an\nMSM whose person-time has been duplicated to encode strategy membership.\n\n**Pros, cons, and trade-offs.**\n- **vs naive as-treated / ever-exposed:** CCW eliminates immortal-time bias (no person-time is counted before the\n  deviation that defines non-adherence), selection bias from post-baseline arm assignment, and confounding by\n  time-varying factors that drive both adherence and the outcome. Cost: it answers a *hypothetical full-adherence*\n  question that can diverge from real-world adherence, and it requires a correctly specified IPCW model. **Prefer\n  CCW** for any sustained/dynamic strategy where as-treated is the only alternative — as-treated is almost always\n  biased here.\n- **vs landmark analysis:** Landmark conditions on survival and exposure status up to a fixed landmark time, discards\n  everyone who had the event or deviated before it, and answers a question about the selected survivors. CCW clones\n  everyone at baseline and recovers the full eligible population through weighting, so it avoids the survivor\n  selection landmark builds in and accommodates grace periods naturally. Cost: heavier specification (two weighting\n  models, weight diagnostics) and bootstrap inference. **Prefer landmark** only when the scientific question really\n  is about a post-baseline conditioning set; otherwise CCW is the more defensible per-protocol estimator.\n- **vs marginal structural models / g-methods without cloning:** A standard MSM or the g-formula can estimate\n  sustained-strategy effects directly without cloning. CCW's advantage is interpretability and protocol\n  transparency — the cloning + censoring rules force you to write down the strategy, the grace period, and the\n  deviation definition exactly as a trial protocol would, which is why regulators and reviewers favor it. Cost: the\n  clone expansion multiplies the dataset, induces within-person correlation (requiring robust/bootstrap variance),\n  and can be less statistically efficient than a well-specified g-formula. **Prefer CCW** when explicit\n  protocol/estimand specification and auditability matter (regulatory submissions, head-to-head sustained\n  strategies); consider plain g-methods or TMLE on the cloned data when efficiency or double robustness is the\n  priority.\n\n**When to use.** Sustained, duration, or dynamic strategies where (a) eligible patients satisfy every strategy at\nbaseline, (b) divergence happens during follow-up (grace-period starts, persistence, biomarker-triggered\ninitiation), and (c) you want the effect under full adherence. CCW is the per-protocol engine of most one-arm and\nhead-to-head target-trial emulations of treatment *strategies* rather than point interventions.\n\n**When NOT to use — and when it is actively misleading.**\n- **The strategies are fully distinguishable at baseline** (e.g., drug A initiator vs drug B initiator, both single\n  point decisions). Then an active-comparator new-user design with IPTW is simpler, more efficient, and equally\n  valid; cloning adds dataset size and variance for no gain.\n- **Deviation is a collider / informed by the outcome process.** If patients stop because of early toxicity or\n  declining health that itself predicts the outcome, and those drivers are unmeasured, the IPCW \"no unmeasured\n  determinants of censoring\" assumption fails and CCW is biased — frequently *more* convincingly biased than a\n  crude analysis because the machinery lends false credibility.\n- **The grace period is so short (or the strategy so demanding) that the eligible-and-still-adherent stratum\n  collapses.** A handful of persistently adherent clones near the end of follow-up receive enormous weights;\n  effective sample size craters and the estimate becomes a high-variance artifact of a few people.\n- **Competing risks differ by strategy and are ignored.** In elderly claims cohorts, death competes with the event\n  and may itself depend on the strategy; a cause-specific-hazard CCW that censors at the competing event answers a\n  different (and often less policy-relevant) question than one targeting the subdistribution / cumulative incidence.\n  Pre-specify which.\n- **Adherence is unobservable in the data.** In Medicare Advantage-only person-time, fills are not captured, so the\n  deviation rule cannot fire correctly and clones are censored (or not) spuriously — restrict to fee-for-service /\n  full-benefit person-time before applying CCW.\n\n**Data-source operational depth.**\n- **Claims (FFS / commercial):** Treatment status over time is reconstructed from pharmacy fills (`fill_date` +\n  `days_supply`, stitched into on-treatment episodes with a stockpiling/grace rule). Require continuous medical +\n  pharmacy enrollment so a \"no fill\" gap is a true gap, not missingness. *Failure modes:* MA-only person-time lacks\n  FFS pharmacy/medical claims, so the adherence/deviation rule misfires — exclude it; differential competing risks\n  (death) by strategy in elderly cohorts bias a cause-specific CCW unless handled; immortal time sneaks back in if\n  the grace period is treated as \"guaranteed survival\" rather than encoded as eligible-but-uncensored clone-time;\n  90-day mail-order and free samples distort `days_supply` and therefore the deviation timing.\n- **EHR:** Orders + administrations give finer \"on-treatment\" granularity and capture *reasons* for deviation\n  (toxicity, response), which is invaluable for arguing the IPCW assumption — but visit-driven capture means a\n  patient who leaves the system looks like a deviation/censoring event when they merely changed providers. Link to\n  fills to confirm the patient actually started, and model loss to follow-up as a separate censoring process.\n- **Registry:** Structured start/stop dates and adjudicated outcomes support clean cloning and censoring rules and\n  are common in embedded/registry-based emulations; weak for complete longitudinal drug exposure, so link to claims\n  for the full fill history and to a death index to firm up the competing-risk handling.\n- **Linked claims-EHR-vital records:** The ideal substrate (EHR reasons-for-deviation + claims completeness +\n  reliable mortality for competing risks), at the cost of linkage selection and reconciling order/fill/service-date\n  discrepancies before time-zero and deviation timing are assigned.\n\n**Worked claims example.** Question: does *initiating a statin within 6 months of a first MI and persisting* vs\n*never initiating* reduce 1-year all-cause mortality, in a commercial + Medicare fee-for-service database? (1)\n*Eligibility:* adults with an incident MI hospitalization, 365 days of continuous A/B/D (or commercial\nmedical+pharmacy) enrollment before discharge, and no statin fill in that lookback. Time zero = discharge date. (2)\n*Strategies:* S1 = fill a statin within a 180-day grace period and remain covered thereafter; S0 = never fill a\nstatin. (3) *Cloning:* each eligible person becomes two clones, one assigned S1 and one assigned S0, both starting\nat t0. (4) *Artificial censoring:* the S0 clone is censored on the first day a statin fill is observed; the S1\nclone is censored at day 180 if no statin has been filled by then (failed the grace period) and, after a fill, on\nthe first day the patient's covered supply lapses beyond a 30-day permissible gap (failed persistence). A clone\nthat has the outcome before deviating is *not* censored — its event counts. (5) *IPCW:* fit a pooled logistic model\nper arm for the probability of remaining uncensored in each month given time-varying covariates (recent\nhospitalizations, cardiac procedures, comorbidity flags, prior adherence proxies) and baseline covariates; the\nstabilized weight for a clone-month is the cumulative product of those inverse probabilities. (6) *Estimation:*\nfit a weighted pooled logistic outcome model on the clone-months with a flexible function of time and a strategy\nindicator; convert the fitted discrete hazards into standardized 1-year cumulative-incidence curves and report the\nrisk difference and risk ratio at 12 months. (7) *Inference & diagnostics:* nonparametric bootstrap over *persons*\n(resample whole persons, re-clone, refit weights and outcome model) for confidence intervals; report unique N vs\nexpanded clone-months, the stabilized-weight distribution and effective sample size, and sensitivity analyses to\nthe grace-period length (90 vs 180 vs 270 days), the permissible-gap rule, and weight truncation at the 1st/99th\npercentiles.\n\n**Interpreting the output**\n\nUsing the worked example: Clone S1 (initiate-and-persist) contributes 365 uncensored days; Clone S0\n(never-initiate) is artificially censored at day 100 when Maria fills her first statin. After IPCW\nreweighting across the full cohort, a per-protocol comparison yields — for illustration — HR = 0.74\n(95% CI 0.61–0.90) for 1-year mortality.\n\nFormal interpretation: The clone-censor-weight HR of 0.74 is the per-protocol effect of the\ninitiate-and-persist strategy versus the never-initiate strategy — the causal effect that would be\nobserved if all patients in the target population had adhered to their assigned strategy throughout\nfollow-up. It is not an intent-to-treat estimate (which includes non-adherers) and does not apply to\nthe as-treated population. The IPCW step is essential: without reweighting, patients who deviated (like\nMaria's Clone S0 at day 100) would be dropped, leaving a selected, healthier subgroup. The estimate is\nvalid under two untestable conditions: no unmeasured confounding conditional on covariates entering the\nIPCW model, and correct specification of the censoring model. Confidence intervals must be obtained by\ncluster bootstrap over original patients — not over expanded clone-months — to respect the within-patient\ncorrelation introduced by cloning.\n\nPractical interpretation: Patients who initiated a statin within 180 days and remained on it would, on\naverage, have died at a 26% lower rate than those who never started — if the adherence assumptions and\nno-unmeasured-confounding conditions hold. The clone-censor-weight framework makes this \"what if everyone\nhad followed the strategy\" question answerable from observational data without requiring treatment to have\nbeen randomized.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "clone-censor-weight",
      "ccw",
      "per-protocol",
      "target-trial-emulation",
      "artificial-censoring",
      "inverse-probability-of-censoring-weighting",
      "sustained-treatment-strategy",
      "grace-period",
      "marginal-structural-model",
      "immortal-time-bias"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Foundational target-trial-emulation paper that introduces cloning, artificial censoring, and weighting for emulating per-protocol effects of strategies not distinguishable at baseline."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Explains why grace-period and sustained strategies create immortal time when handled naively, and how cloning/censoring/weighting with a specified target trial prevents it."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyaa057",
        "url": "https://doi.org/10.1093/ije/dyaa057",
        "citation_text": "Maringe C, Benitez Majano S, Exarchakou A, et al. Reflection on modern methods: trial emulation in the presence of immortal-time bias. Assessing the benefit of major surgery for elderly lung cancer patients using observational data. Int J Epidemiol. 2020;49(5):1719-1729.",
        "year": 2020,
        "authors_short": "Maringe et al.",
        "notes": "Step-by-step methodological tutorial that walks through the clone-censor-weight procedure (cloning, artificial censoring at deviation, IPCW) for a grace-period surgery strategy; the canonical applied how-to reference."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2017.11.021",
        "url": "https://doi.org/10.1016/j.jclinepi.2017.11.021",
        "citation_text": "Danaei G, García Rodríguez LA, Cantero OF, Logan RW, Hernán MA. Electronic medical records can be used to emulate target trials of sustained treatment strategies. J Clin Epidemiol. 2018;96:12-22.",
        "year": 2018,
        "authors_short": "Danaei et al.",
        "notes": "Demonstrates ITT vs per-protocol estimation for single-treatment, joint-treatment, and head-to-head sustained strategies using inverse-probability weighting to censor non-adherent person-time, with pooled logistic models and standardized survival curves."
      },
      {
        "role": "use",
        "doi": "10.1212/WNL.0000000000010433",
        "url": "https://doi.org/10.1212/WNL.0000000000010433",
        "citation_text": "Caniglia EC, Rojas-Saunero LP, Hilal S, et al. Emulating a target trial of statin use and risk of dementia using cohort data. Neurology. 2020;95(10):e1322-e1332.",
        "year": 2020,
        "authors_short": "Caniglia et al.",
        "notes": "Applied emulation contrasting statin initiation (ITT analog) with sustained use (per-protocol analog via inverse-probability weighting and pooled logistic regression); illustrates the estimand distinction and the small-stratum/positivity cautions in practice."
      }
    ],
    "plain_language_summary": "Clone-censor-weight (CCW) is a technique for comparing two treatment strategies using real-world data when everyone in the study could plausibly follow either strategy on day one. The trick is that each person is placed into both groups simultaneously by creating two copies called clones; each clone is followed over time and removed from its group the moment the real person's behavior diverges from that group's rule, and then a statistical weight is applied to compensate for those removals so the comparison stays fair. This lets researchers answer a per-protocol question: what would the outcome have been if patients actually stuck to their assigned strategy from start to finish?",
    "key_terms": [
      {
        "term": "cloning",
        "definition": "Creating two identical copies of every patient at the study start date so that each copy can be assigned to a different treatment strategy simultaneously."
      },
      {
        "term": "censoring",
        "definition": "Removing a patient (or, in CCW, a clone) from further follow-up at the moment when tracking them would no longer answer the intended question."
      },
      {
        "term": "artificial censoring",
        "definition": "Deliberately removing a clone from its assigned strategy arm the instant the real patient's observed behavior deviates from that strategy's rule, even though the real patient is still alive and in the data."
      },
      {
        "term": "per-protocol effect",
        "definition": "The estimated difference in outcomes between two groups assuming everyone actually followed their assigned treatment strategy exactly as specified."
      },
      {
        "term": "inverse-probability-of-censoring weight",
        "definition": "A number multiplied onto each remaining clone-record to mathematically represent the clones that were artificially removed, so the surviving clones still reflect the full original population."
      },
      {
        "term": "grace period",
        "definition": "A window of time after the study start date during which a patient assigned to an initiation strategy is still considered adherent even if they have not yet taken their first dose."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether starting a statin within 180 days of a heart attack and staying on it reduces one-year mortality compared to never starting one. Using a claims database, every eligible patient is cloned into two copies at hospital discharge (day 0). Clone S1 is assigned to the initiate-and-persist strategy; Clone S0 is assigned to the never-initiate strategy. We follow one patient, Maria, to see exactly when and why each of her clones gets artificially censored.",
      "dataset": {
        "caption": "Maria's statin fill history after her MI discharge on 2024-01-01, as it would appear in a pharmacy claims table.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            "M001",
            "2024-04-10",
            "atorvastatin",
            90
          ],
          [
            "M001",
            "2024-07-05",
            "atorvastatin",
            90
          ]
        ]
      },
      "steps": [
        "At time zero (2024-01-01, Maria's discharge date), she is cloned: Clone S1 is assigned to the initiate-and-persist arm; Clone S0 is assigned to the never-initiate arm.",
        "Both clones begin follow-up on 2024-01-01. Neither has deviated yet, so both remain active.",
        "Maria fills atorvastatin on 2024-04-10, which is day 100 after discharge. This is within the 180-day grace period, so Clone S1 has met its initiation requirement and stays active.",
        "That same fill (2024-04-10) is the first statin observed in the data. Clone S0 is assigned to never initiate, so this fill is a deviation from S0's rule. Clone S0 is artificially censored on 2024-04-10 (day 100).",
        "Clone S1 continues follow-up. The 90-day fill covers through 2024-07-08. Maria refills on 2024-07-05, which is 3 days before the supply runs out, so there is no gap beyond 30 days. Persistence is maintained; Clone S1 stays active through the end of the 12-month window.",
        "Because Clone S0 was censored early, it only contributed 100 days of person-time to the never-initiate arm. To avoid systematically underrepresenting patients like Maria (those who eventually did start), the analyst fits an inverse-probability-of-censoring model and assigns a weight to the remaining S0 clones that resemble Maria's baseline profile."
      ],
      "result": "Clone S1 contributes 365 uncensored days under the initiate-and-persist strategy. Clone S0 contributes 100 days before artificial censoring at day 100 (2024-04-10). The per-protocol analysis weights Clone S0 records to account for this early removal, then compares 1-year mortality rates across both arms using those weights.",
      "timeline_spec": {
        "title": "One MI patient cloned into two strategy arms: who gets censored and when",
        "window": {
          "start": "2024-01-01",
          "end": "2024-12-31",
          "label": "12-month follow-up window from MI discharge"
        },
        "events": [
          {
            "label": "Time zero: MI discharge, both clones start",
            "start": "2024-01-01",
            "length_days": 0,
            "quantity": "Day 0 — both clones begin"
          },
          {
            "label": "First statin fill (day 100) — S1 grace-period met; S0 deviation",
            "start": "2024-04-10",
            "length_days": 0,
            "quantity": "Day 100 — deviation point for S0"
          },
          {
            "label": "Statin fill A (atorvastatin 90-day supply)",
            "start": "2024-04-10",
            "length_days": 90,
            "quantity": "90-day supply"
          },
          {
            "label": "Statin fill B (atorvastatin 90-day supply)",
            "start": "2024-07-05",
            "length_days": 90,
            "quantity": "90-day supply"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2024-01-01",
            "end": "2024-04-09",
            "label": "Clone S1 and Clone S0 both active (days 1-99)"
          },
          {
            "kind": "exposed",
            "start": "2024-04-10",
            "end": "2024-12-31",
            "label": "Clone S1 (initiate+persist) — uncensored through end of window"
          },
          {
            "kind": "unexposed",
            "start": "2024-04-10",
            "end": "2024-04-10",
            "label": "Clone S0 (never-initiate) — artificially censored day 100"
          }
        ],
        "result": {
          "label": "Clone S1: 365 days uncensored | Clone S0: 100 days then artificially censored at first fill",
          "value": null
        },
        "caption": "Maria is cloned at time zero into two arms. Clone S1 (initiate and persist) remains active all year because she filled within the 180-day grace period and maintained continuous supply. Clone S0 (never initiate) is artificially censored on day 100 the moment the first fill appears. Inverse-probability-of-censoring weights compensate for the censored S0 person-time so the per-protocol comparison remains unbiased.",
        "alt_text": "A timeline from 2024-01-01 to 2024-12-31 showing two parallel arms for one patient. Both arms are active from January through April 9. On April 10 (day 100), the S1 arm continues uncensored while the S0 arm ends with an artificial censoring marker. Two statin fill bars sit on the S1 arm: the first from April 10 through July 8, the second from July 5 through October 2, illustrating continuous coverage with a minor overlap."
      }
    },
    "prerequisites": [
      "target-trial-emulation",
      "new-user-design",
      "immortal-time-bias-handling"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "CCW with grace period (sustained-initiation strategy)",
        "description": "Allow a window (grace period) after time zero during which initiating the assigned treatment is still considered adherent; the \"initiate\" clone is censored only if it fails to start by the end of the grace period or, after starting, lapses in persistence; the \"never initiate\" clone is censored at first treatment.",
        "edge_cases": [
          "Longer grace periods raise power but can reintroduce immortal-time-like behavior if the grace window is modeled as guaranteed survival rather than as eligible, uncensored clone-time that still accrues outcomes.",
          "The deviation definition (\"any fill of the wrong drug\" vs \"sustained switch\" vs \"gap beyond X days\") requires explicit clinical specification and drives results; pre-register it."
        ],
        "data_source_notes": "claims: use fill_date + days_supply with a permissible-gap rule (e.g., 30 days) to define persistence; common for \"start RAS inhibitor within 30 days of CKD stage 3\" or \"statin within 180 days of MI\".",
        "citations": [
          "hernan-2016-aje",
          "maringe-2020"
        ]
      },
      {
        "name": "CCW for treatment-duration strategies",
        "description": "Compare fixed durations (\"treat for 12 months\" vs \"treat for 6 months\", or \"continue vs stop at 6 months\"); each clone is censored when its observed on-treatment duration diverges from the assigned protocol duration.",
        "edge_cases": [
          "Requires a clean on-treatment definition (covered days from stitched fills) and explicit handling of early stops caused by the outcome process, which are informative and threaten the IPCW assumption.",
          "Events occurring during the protocol window before any deviation are retained, not censored — a common implementation error is to censor at the duration boundary even for clones that already had the outcome."
        ],
        "data_source_notes": "claims: well suited to duration questions such as optimal length of dual antiplatelet therapy or antibiotic courses, using days_supply to reconstruct cumulative on-treatment time.",
        "citations": [
          "danaei-2018"
        ]
      },
      {
        "name": "CCW combined with g-methods, TMLE, or doubly robust estimation",
        "description": "Use cloning/censoring to encode the design and estimand, then estimate with the parametric g-formula, targeted maximum likelihood, or doubly robust weighting on the clone-time dataset for efficiency or robustness under high-dimensional time-varying confounding.",
        "edge_cases": [
          "Adds substantial complexity and computational cost but can recover efficiency lost to the clone expansion and provides protection against misspecification of either the censoring or outcome model.",
          "Variance must account for both the clone-induced within-person correlation and the estimation of nuisance models; bootstrap over persons is the safe default."
        ],
        "data_source_notes": "Advanced implementation for high-dimensional longitudinal claims/EHR; pair with cross-fitting when using machine-learning nuisance estimators.",
        "citations": [
          "danaei-2018"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "naive-as-treated-or-ever-exposed",
        "pros_of_this": "Eliminates immortal-time bias, selection bias from post-baseline arm assignment, and time-varying confounding by factors that jointly drive adherence and outcome; yields a well-defined per-protocol estimand under full adherence.",
        "cons_of_this": "More complex (two weighting models, weight diagnostics, bootstrap inference); estimates a hypothetical full-adherence effect that may differ from observed real-world adherence patterns.",
        "when_to_prefer": "Any sustained, duration, or dynamic strategy where the only naive alternative is an as-treated analysis — which is almost always biased in that setting."
      },
      {
        "compared_to": "landmark-analysis",
        "pros_of_this": "Clones everyone at baseline and recovers the full eligible population through weighting, avoiding the survivor selection landmark builds in by conditioning on survival and exposure status to a fixed time; accommodates grace periods naturally.",
        "cons_of_this": "Heavier specification and computation; requires correct IPCW models and bootstrap variance, whereas landmark is a simple conditional analysis.",
        "when_to_prefer": "When the scientific question is the per-protocol effect of a sustained strategy rather than an effect conditional on a chosen post-baseline survivor set."
      },
      {
        "compared_to": "marginal-structural-models-g-methods",
        "pros_of_this": "Forces explicit, auditable protocol specification (strategy, grace period, deviation rule), which regulators and reviewers favor; the cloned scaffold makes the per-protocol estimand transparent and communicable.",
        "cons_of_this": "The clone expansion multiplies the dataset, induces within-person correlation, and can be less statistically efficient than a well-specified g-formula or standard MSM fit on the original person-time.",
        "when_to_prefer": "Regulatory or head-to-head sustained-strategy questions where explicit estimand specification and auditability outweigh raw efficiency."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Reconstruct on-treatment status from fill_date + days_supply stitched into episodes with a permissible-gap/stockpiling rule. Require continuous medical + pharmacy enrollment so a no-fill gap is observed rather than missing, and exclude Medicare Advantage-only person-time where fills are unobservable and the deviation rule misfires. Clone each eligible person once per strategy at time zero; censor each clone at first deviation (wrong-arm treatment, failed grace-period start, or persistence lapse); fit IPCW via pooled logistic on clone-periods, then a weighted pooled logistic outcome model; bootstrap over persons for inference. Pre-specify competing-risk handling (cause-specific vs subdistribution) because death often differs by strategy in elderly cohorts.",
      "ehr": "Orders/administrations give finer on-treatment timing and capture reasons for deviation (toxicity, response) that support the IPCW no-unmeasured-determinants-of-censoring assumption; link to dispensing to confirm initiation. Visit-driven capture makes a patient who leaves the system look like a deviation, so model loss to follow-up as a distinct censoring process.",
      "registry": "Structured start/stop dates and adjudicated outcomes support clean cloning/censoring; weak for complete longitudinal exposure. Link to claims for the full fill history and to a death index to firm up competing-risk and censoring handling.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (reasons-for-deviation + exposure completeness + reliable mortality for competing risks), at the cost of linkage selection and reconciling order/fill/service-date discrepancies before assigning time zero and deviation timing."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\nGRACE_DAYS = 180          # window to initiate the statin and remain \"adherent\" to S1\nGAP_DAYS = 30             # permissible gap before a persistence lapse counts as deviation\nHORIZON_M = 12            # months of follow-up for the 1-year risk contrast\n\ndef build_person_periods(cohort, rx, outcome, fup_end):\n    \"\"\"One row per person-month from t0 to t0+HORIZON_M, with time-varying on-treatment + event flags.\"\"\"\n    rows = []\n    rx_by = {p: g.sort_values(\"fill_date\") for p, g in rx.groupby(\"person_id\")}\n    for r in cohort.itertuples(index=False):\n        pid, t0 = r.person_id, r.t0\n        ev = outcome.loc[outcome.person_id == pid, \"event_date\"]\n        ev_date = ev.min() if len(ev) else pd.NaT\n        adm = fup_end.loc[fup_end.person_id == pid, \"fup_end_date\"]\n        adm_date = adm.min() if len(adm) else pd.NaT\n        fills = rx_by.get(pid, pd.DataFrame(columns=[\"fill_date\", \"days_supply\"]))\n        # covered-day intervals stitched with stockpiling (carry-over of surplus supply)\n        covered_until = t0 - pd.Timedelta(days=1)\n        intervals = []\n        for f in fills.itertuples(index=False):\n            start = max(f.fill_date, covered_until + pd.Timedelta(days=1))\n            covered_until = max(covered_until, f.fill_date) + pd.Timedelta(days=int(f.days_supply))\n            intervals.append((start, covered_until))\n        first_fill = fills[\"fill_date\"].min() if len(fills) else pd.NaT\n        for m in range(HORIZON_M):\n            m_start = t0 + pd.Timedelta(days=30 * m)\n            m_end = t0 + pd.Timedelta(days=30 * (m + 1)) - pd.Timedelta(days=1)\n            if pd.notna(adm_date) and adm_date < m_start:\n                break\n            on_tx = any(s <= m_end and e + pd.Timedelta(days=GAP_DAYS) >= m_start for s, e in intervals)\n            started_by = pd.notna(first_fill) and first_fill <= m_end\n            event_m = int(pd.notna(ev_date) and m_start <= ev_date <= m_end)\n            rows.append(dict(person_id=pid, month=m, on_tx=int(on_tx),\n                             started_by=int(started_by),\n                             days_since_t0=(m_start - t0).days, event=event_m))\n            if event_m:\n                break\n    pp = pd.DataFrame(rows)\n    return pp.merge(cohort, on=\"person_id\", how=\"left\")\n\ndef expand_and_censor(pp):\n    \"\"\"Duplicate each person-month into clone S1 and clone S0; set artificial censoring per the strategy.\"\"\"\n    out = []\n    for strat in (\"S1\", \"S0\"):\n        c = pp.copy()\n        c[\"strategy\"] = strat\n        c[\"clone_id\"] = c[\"person_id\"].astype(str) + \"_\" + strat\n        if strat == \"S0\":                       # never initiate -> censor at first treatment\n            c[\"artif_cens\"] = (c[\"on_tx\"] == 1).astype(int)\n        else:                                   # initiate within grace, then persist\n            grace_fail = (c[\"days_since_t0\"] > GRACE_DAYS) & (c[\"started_by\"] == 0)\n            persist_fail = (c[\"started_by\"] == 1) & (c[\"on_tx\"] == 0)\n            c[\"artif_cens\"] = (grace_fail | persist_fail).astype(int)\n        out.append(c)\n    x = pd.concat(out, ignore_index=True)\n    # An event in a month overrides artificial censoring: the event is observed, not censored.\n    x.loc[x[\"event\"] == 1, \"artif_cens\"] = 0\n    return x.sort_values([\"clone_id\", \"month\"])\n\ndef stabilized_ipcw(clones, tv_covs):\n    \"\"\"Pooled-logistic IPCW: P(uncensored this month). Stabilized weight = prod(num)/prod(denom).\"\"\"\n    clones = clones.copy()\n    clones[\"uncens\"] = 1 - clones[\"artif_cens\"]\n    rhs_den = \"bs(days_since_t0, df=4) + \" + \" + \".join(tv_covs)\n    rhs_num = \"bs(days_since_t0, df=4)\"\n    w = []\n    for strat, g in clones.groupby(\"strategy\"):\n        den = smf.logit(\"uncens ~ \" + rhs_den, data=g).fit(disp=0)\n        num = smf.logit(\"uncens ~ \" + rhs_num, data=g).fit(disp=0)\n        g = g.assign(p_den=den.predict(g), p_num=num.predict(g))\n        g = g.sort_values([\"clone_id\", \"month\"])\n        g[\"sw\"] = (g.groupby(\"clone_id\")[\"p_num\"].cumprod() /\n                   g.groupby(\"clone_id\")[\"p_den\"].cumprod())\n        w.append(g)\n    out = pd.concat(w, ignore_index=True)\n    out[\"sw\"] = out[\"sw\"].clip(upper=out[\"sw\"].quantile(0.99))   # truncate extreme weights\n    return out\n\ndef run_ccw(cohort, rx, outcome, fup_end, tv_covs=(\"on_tx\",)):\n    pp = build_person_periods(cohort, rx, outcome, fup_end)\n    clones = expand_and_censor(pp)\n    wdat = stabilized_ipcw(clones, list(tv_covs))\n    # Weighted pooled-logistic outcome model (discrete-time hazard) on uncensored clone-months.\n    m = smf.glm(\"event ~ strategy + bs(days_since_t0, df=4)\",\n                data=wdat[wdat[\"artif_cens\"] == 0],\n                family=__import__(\"statsmodels.api\", fromlist=[\"families\"]).families.Binomial(),\n                freq_weights=wdat.loc[wdat[\"artif_cens\"] == 0, \"sw\"]).fit()\n    # Standardize: predict monthly hazard under each strategy, convert to 1-year cumulative incidence.\n    risks = {}\n    grid = wdat[[\"days_since_t0\"]].drop_duplicates().sort_values(\"days_since_t0\")\n    for strat in (\"S1\", \"S0\"):\n        g = grid.assign(strategy=strat)\n        h = m.predict(g).values\n        risks[strat] = 1 - np.prod(1 - h)\n    return dict(model=m, risk_S1=risks[\"S1\"], risk_S0=risks[\"S0\"],\n                risk_difference=risks[\"S1\"] - risks[\"S0\"])",
        "description": "Clone-censor-weight per-protocol emulation of a sustained \"initiate within grace period and persist\" vs\n\"never initiate\" strategy from claims-style inputs. Required inputs (already cleaned, de-duplicated):\n  cohort  : one row per eligible new-MI patient -> person_id, t0 (index/discharge date), <baseline covariates>\n  rx      : statin fills                        -> person_id, fill_date (datetime), days_supply (int)\n  outcome : first events                        -> person_id, event_date (datetime), event (1)  # all-cause death here\n  fup_end : administrative censoring            -> person_id, fup_end_date (disenroll/death/data-end, datetime)\nThe pipeline: (1) build a monthly person-period grid to t0+12mo; (2) expand to TWO clones per person (strategies\nS1=initiate&persist, S0=never); (3) set the artificial-censoring flag per clone by the deviation rule; (4) fit\nstabilized IPCW per arm via pooled logistic; (5) fit a weighted pooled-logistic outcome model and standardize to\n1-year cumulative incidence. Wrap in a person-level bootstrap for CIs (not shown: call run_ccw on resampled persons).",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "hernan-2016-aje",
          "danaei-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(splines)\n\nGRACE_DAYS <- 180L; GAP_DAYS <- 30L; HORIZON_M <- 12L\n\nbuild_person_periods <- function(cohort, rx, outcome, fup_end) {\n  setDT(cohort); setDT(rx); setDT(outcome); setDT(fup_end); setorder(rx, person_id, fill_date)\n  out <- list()\n  for (i in seq_len(nrow(cohort))) {\n    pid <- cohort$person_id[i]; t0 <- cohort$t0[i]\n    ev  <- min(outcome[person_id == pid, event_date],  Inf)\n    adm <- min(fup_end[person_id == pid, fup_end_date], Inf)\n    f   <- rx[person_id == pid]\n    # stitch covered-day intervals with stockpiling carry-over\n    covered_until <- t0 - 1L; iv <- list(); first_fill <- if (nrow(f)) min(f$fill_date) else as.Date(NA)\n    if (nrow(f)) for (j in seq_len(nrow(f))) {\n      start <- max(f$fill_date[j], covered_until + 1L)\n      covered_until <- max(covered_until, f$fill_date[j]) + f$days_supply[j]\n      iv[[length(iv) + 1L]] <- c(start, covered_until)\n    }\n    for (m in 0:(HORIZON_M - 1L)) {\n      ms <- t0 + 30L * m; me <- t0 + 30L * (m + 1L) - 1L\n      if (is.finite(adm) && adm < ms) break\n      on_tx <- any(vapply(iv, function(z) z[1] <= me && z[2] + GAP_DAYS >= ms, logical(1)))\n      started_by <- !is.na(first_fill) && first_fill <= me\n      event_m <- as.integer(is.finite(ev) && ev >= ms && ev <= me)\n      out[[length(out) + 1L]] <- data.table(person_id = pid, month = m,\n        on_tx = as.integer(on_tx), started_by = as.integer(started_by),\n        days_since_t0 = as.integer(ms - t0), event = event_m)\n      if (event_m == 1L) break\n    }\n  }\n  merge(rbindlist(out), cohort, by = \"person_id\")\n}\n\nexpand_and_censor <- function(pp) {\n  mk <- function(strat) {\n    c <- copy(pp); c[, `:=`(strategy = strat, clone_id = paste0(person_id, \"_\", strat))]\n    if (strat == \"S0\") c[, artif_cens := as.integer(on_tx == 1L)]\n    else c[, artif_cens := as.integer((days_since_t0 > GRACE_DAYS & started_by == 0L) |\n                                      (started_by == 1L & on_tx == 0L))]\n    c\n  }\n  x <- rbindlist(list(mk(\"S1\"), mk(\"S0\")))\n  x[event == 1L, artif_cens := 0L]          # observed events are not artificially censored\n  setorder(x, clone_id, month); x[]\n}\n\nstabilized_ipcw <- function(clones, tv_covs = c(\"on_tx\")) {\n  clones[, uncens := 1L - artif_cens]\n  fden <- as.formula(paste(\"uncens ~ bs(days_since_t0, df = 4) +\", paste(tv_covs, collapse = \" + \")))\n  fnum <- uncens ~ bs(days_since_t0, df = 4)\n  res <- clones[, {\n    den <- glm(fden, data = .SD, family = binomial()); num <- glm(fnum, data = .SD, family = binomial())\n    pd <- predict(den, .SD, type = \"response\"); pn <- predict(num, .SD, type = \"response\")\n    ord <- order(clone_id, month)\n    .SD[ord][, sw := ave(pn[ord], clone_id[ord], FUN = cumprod) /\n                   ave(pd[ord], clone_id[ord], FUN = cumprod)][]\n  }, by = strategy]\n  res[, sw := pmin(sw, quantile(sw, 0.99))]      # truncate extreme weights\n  res[]\n}\n\nrun_ccw <- function(cohort, rx, outcome, fup_end, tv_covs = c(\"on_tx\")) {\n  pp <- build_person_periods(cohort, rx, outcome, fup_end)\n  w  <- stabilized_ipcw(expand_and_censor(pp), tv_covs)\n  fit <- glm(event ~ strategy + bs(days_since_t0, df = 4),\n             data = w[artif_cens == 0L], family = binomial(), weights = w[artif_cens == 0L, sw])\n  grid <- unique(w[, .(days_since_t0)])[order(days_since_t0)]\n  risk <- sapply(c(\"S1\", \"S0\"), function(s) {\n    h <- predict(fit, cbind(grid, strategy = s), type = \"response\"); 1 - prod(1 - h)\n  })\n  list(fit = fit, risk_S1 = risk[\"S1\"], risk_S0 = risk[\"S0\"],\n       risk_difference = unname(risk[\"S1\"] - risk[\"S0\"]))\n}",
        "description": "Clone-censor-weight per-protocol emulation in R with data.table + splines, mirroring the Python pipeline.\nInputs:\n  cohort  : person_id, t0 (Date), <baseline covariates>\n  rx      : person_id, fill_date (Date), days_supply (integer)\n  outcome : person_id, event_date (Date)            # first all-cause death\n  fup_end : person_id, fup_end_date (Date)           # administrative censoring\nReturns the standardized 1-year risk under each strategy and the risk difference. Wrap run_ccw() in a\nperson-level bootstrap (resample person_id, refit) for confidence intervals.",
        "dependencies": [
          "data.table",
          "splines"
        ],
        "source_citations": [
          "hernan-2016-aje",
          "danaei-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let grace = 180; %let horizon = 12;\n\n/* (1) CLONE: duplicate each person-month into strategy S1 (initiate+persist) and S0 (never initiate). */\ndata clones;\n  set work.pp;\n  length strategy $2 clone_id $40;\n  do strategy = 'S1', 'S0';\n    clone_id = catx('_', put(person_id, best12.), strategy);\n    /* (2) ARTIFICIAL CENSORING per the assigned strategy. */\n    if strategy = 'S0' then artif_cens = (on_tx = 1);                 /* censor never-arm at first treatment   */\n    else artif_cens = ( (days_since_t0 > &grace and started_by = 0)   /* failed grace-period initiation        */\n                        or (started_by = 1 and on_tx = 0) );          /* failed persistence (gap > permissible)*/\n    if event = 1 then artif_cens = 0;                                 /* observed events are not censored      */\n    uncens = 1 - artif_cens;\n    output;\n  end;\nrun;\nproc sort data=clones; by strategy clone_id month; run;\n\n/* (3) IPCW denominator and numerator models, fit separately by strategy, pooled over person-months. */\nproc logistic data=clones noprint;\n  by strategy;\n  model uncens(event='1') = days_since_t0 days_since_t0*days_since_t0 on_tx;  /* + baseline + time-varying covs */\n  output out=pden p=p_den;\nrun;\nproc logistic data=clones noprint;\n  by strategy;\n  model uncens(event='1') = days_since_t0 days_since_t0*days_since_t0;        /* stabilization numerator (time)  */\n  output out=pnum p=p_num;\nrun;\n\n/* Stabilized weight = running product of numerator / running product of denominator within each clone. */\ndata weights;\n  merge pden(keep=clone_id strategy month p_den) pnum(keep=clone_id month p_num);\n  by strategy clone_id month;            /* sorted upstream; merge aligns the two predicted-probability streams */\n  retain cum_num cum_den;\n  by strategy clone_id;\n  if first.clone_id then do; cum_num = 1; cum_den = 1; end;\n  cum_num = cum_num * p_num;\n  cum_den = cum_den * p_den;\n  sw = cum_num / cum_den;\nrun;\n\n/* Truncate extreme stabilized weights at the 99th percentile. */\nproc univariate data=weights noprint; var sw; output out=p99 pctlpts=99 pctlpre=p; run;\ndata analytic;\n  if _n_ = 1 then set p99;\n  set weights;\n  if sw > p99 then sw = p99;\n  where artif_cens = 0;                  /* outcome model uses uncensored clone-months only */\nrun;\n/* join baseline covariates / event back on by clone_id month before fitting if needed */\n\n/* (4) WEIGHTED POOLED-LOGISTIC outcome model (discrete-time hazard) -> strategy contrast. */\nproc genmod data=analytic;\n  class strategy clone_id;\n  weight sw;\n  model event(event='1') = strategy days_since_t0 days_since_t0*days_since_t0 / dist=bin link=logit;\n  repeated subject=clone_id / type=ind;  /* robust container; bootstrap over person_id for valid CIs */\n  estimate 'S1 vs S0 log-OR (per-month hazard)' strategy 1 -1;\nrun;\n/* Standardize the fitted monthly hazards under each strategy to 1-year cumulative incidence in a DATA step:\n   CI_s = 1 - PROD over months of (1 - hazard_s(month)), then risk difference = CI_S1 - CI_S0. */",
        "description": "Clone-censor-weight per-protocol emulation in SAS. Required input datasets (post data-management):\n  work.pp : LONG person-period table from claims, one row per person-month over the 12-month horizon, with\n            person_id, t0, month, days_since_t0, on_tx (0/1 covered this month), started_by (0/1 ever filled by\n            this month), event (0/1 outcome this month), and baseline covariates. Build work.pp by stitching\n            fill_date + days_supply into covered intervals with a 30-day permissible gap (PROC SQL / DATA step,\n            analogous to the Python/R build_person_periods).\nSteps below perform the method: (1) clone each person-month into S1 and S0; (2) set the artificial-censoring flag;\n(3) IPCW via PROC LOGISTIC pooled over person-months and cumulative-product stabilized weights in a DATA step;\n(4) weighted pooled-logistic outcome model in PROC GENMOD. Use a person-level bootstrap (PROC SURVEYSELECT by\nperson_id, re-run) for confidence intervals; robust SE alone understates variance because weights are estimated.",
        "dependencies": [],
        "source_citations": [
          "hernan-2016-aje",
          "danaei-2018"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "clone-censor-weight-per-protocol-timeline.svg",
        "mermaid": null,
        "caption": "Maria is cloned at time zero into two arms. Clone S1 (initiate and persist) remains active all year because she filled within the 180-day grace period and maintained continuous supply. Clone S0 (never initiate) is artificially censored on day 100 the moment the first fill appears. Inverse-probability-of-censoring weights compensate for the censored S0 person-time so the per-protocol comparison remains unbiased.",
        "alt_text": "A timeline from 2024-01-01 to 2024-12-31 showing two parallel arms for one patient. Both arms are active from January through April 9. On April 10 (day 100), the S1 arm continues uncensored while the S0 arm ends with an artificial censoring marker. Two statin fill bars sit on the S1 arm: the first from April 10 through July 8, the second from July 5 through October 2, illustrating continuous coverage with a minor overlap.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Elig[Eligible at t0: satisfies BOTH strategies<br/>no prior statin, continuous enrollment] --> Clone{Clone once per strategy}\n  Clone -->|assign S1: initiate within grace + persist| A0[Clone A starts at t0 under S1]\n  Clone -->|assign S0: never initiate| B0[Clone B starts at t0 under S0]\n  A0 --> Acens[Artificially censor clone A at<br/>grace-period failure OR persistence lapse]\n  B0 --> Bcens[Artificially censor clone B at<br/>first statin fill]\n  Acens --> IPCW[Stabilized IPCW per arm<br/>pooled logistic: P uncensored each month]\n  Bcens --> IPCW\n  IPCW --> Out[Weighted pooled-logistic outcome model<br/>standardize to 1-year cumulative incidence]\n  Out --> Boot[Person-level bootstrap CIs<br/>weight + ESS diagnostics, sensitivity to grace/gap/truncation]",
        "caption": "The clone-censor-weight procedure. Because every eligible person satisfies both strategies at time zero, each is cloned once per strategy; each clone is artificially censored at first deviation from its assigned rule, and inverse-probability-of-censoring weights restore the eligible population before the weighted outcome model standardizes to cumulative incidence.",
        "alt_text": "Flowchart from an eligible patient at time zero, through cloning into one clone per strategy, artificial censoring at deviation, inverse-probability-of-censoring weighting, a weighted outcome model, and bootstrap inference with diagnostics.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2016-aje"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One observed patient becomes two clones with different censor times\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Observed (single patient)\n  MI discharge = time zero :milestone, t0, 2024-01-01, 0d\n  No statin until day 90, then persistent fills :obs, 2024-01-01, 365d\n  section Clone S1 (initiate within 180d + persist)\n  Eligible, uncensored: filled day 90 within grace, stays covered :active, s1, 2024-01-01, 365d\n  Outcome or admin censoring ends follow-up :crit, s1e, 2024-12-31, 1d\n  section Clone S0 (never initiate)\n  Uncensored until first fill :done, s0, 2024-01-01, 89d\n  Artificially censored at first statin fill (day 90) :crit, s0c, 2024-03-31, 1d",
        "caption": "How a single observed person contributes to both arms. This patient initiated a statin on day 90 and persisted, so the S1 clone remains eligible and uncensored (its grace period and persistence are met) while the S0 clone is artificially censored on day 90 at the first fill. IPCW reweights to compensate for the censored S0 person-time.",
        "alt_text": "Gantt timeline showing one MI patient who starts a statin on day 90; the initiate-and-persist clone stays uncensored to the end while the never-initiate clone is artificially censored at day 90.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2016-aje"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "CCW is the per-protocol estimation engine of target-trial emulations of sustained, duration, or dynamic strategies that are not distinguishable at baseline."
      },
      {
        "relation_type": "see_also",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "CCW with IPCW is mechanically a marginal structural model fit on a cloned person-time scaffold; g-methods can also estimate sustained-strategy effects without cloning, trading explicit protocol specification for efficiency."
      },
      {
        "relation_type": "see_also",
        "target_slug": "landmark-analysis",
        "notes": "Both address exposure-timing problems; landmark conditions on survivors to a fixed time while CCW clones everyone at baseline and recovers the full population through weighting."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "With a properly specified target trial, CCW is among the most robust ways to prevent immortal-time bias when estimating effects of delayed or sustained strategies (grace periods)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "A weighted Cox model (or, more commonly, a weighted pooled logistic) on the cloned, censored, weighted clone-time is the typical analytic step for the per-protocol contrast."
      },
      {
        "relation_type": "used_with",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Causal ML (double/debiased ML or TMLE) can estimate nuisance models or the final contrast on the cloned weighted data for efficiency or double robustness."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "When the strategies ARE distinguishable at baseline (two point initiations), an active-comparator new-user design with IPTW is simpler and more efficient than cloning."
      }
    ],
    "aliases": [
      "clone-censor-weight",
      "CCW",
      "cloning-censoring-weighting",
      "clone-censor-weighting",
      "per-protocol emulation",
      "artificial censoring"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cluster-randomized",
    "name": "Cluster-Randomized Trial",
    "short_definition": "An experimental design in which intact groups (clinics, practices, hospitals, communities) rather than individuals are randomly assigned to study arms, so that allocation is at the cluster level while outcomes are usually measured on individuals within clusters, inducing within-cluster outcome correlation that must be accounted for in both sample size and analysis.",
    "long_description": "A **cluster-randomized trial (CRT)** — also called a group-randomized or cluster-randomised trial — randomizes\n*intact groups* (primary-care practices, hospital wards, dialysis units, nursing homes, geographic communities)\nto intervention or control, then ascertains outcomes on the individuals who belong to those groups. The defining\nfeature is a **mismatch between the unit of randomization (the cluster) and the unit of observation/analysis (the\nindividual)**. Because people within a cluster share clinicians, protocols, local case-mix, and environment, their\noutcomes are positively correlated. That correlation, summarized by the **intracluster correlation coefficient\n(ICC, ρ)**, is the single fact that governs everything downstream: it shrinks the effective sample size, inflates\nvariance, and makes the naïve \"treat every patient as independent\" analysis anticonservative (too-narrow confidence\nintervals, inflated type-I error). In RWE, CRTs most often appear as **pragmatic trials embedded in routine care** —\nrandomize clinics to a clinical-decision-support (CDS) alert, a care-pathway change, or an outreach program, and\nmeasure effectiveness using EHR or claims outcomes already being captured.\n\n**Core conceptual distinction (vs individually randomized trials).** An individually randomized trial balances\nconfounders in expectation at the person level and yields independent observations. A CRT randomizes a *handful* of\nunits — sometimes 6, 10, 20 clusters — so chance imbalance between arms is far more likely, and the observations are\nnot independent. Two estimands must be kept straight. The **cluster-level estimand** treats each cluster as one data\npoint (e.g., the mean of cluster-specific rates), is robust with few clusters, and weights clusters equally. The\n**individual-level (population-average or cluster-specific) estimand** uses every patient but must model the\ncorrelation — via generalized estimating equations (GEE, population-average) or a generalized linear mixed model\n(GLMM, cluster-specific/conditional). With a binary outcome and a non-identity link these two individual-level\nestimands are *not* equal (the marginal GEE odds ratio is attenuated relative to the conditional GLMM odds ratio);\nstating which you report is mandatory.\n\n**The design effect is not optional arithmetic.** The variance inflation from clustering is the **design effect**,\nDEFF = 1 + (m̄ − 1)·ρ, where m̄ is the average cluster size. A trial that would need N individuals under simple\nrandomization needs N·DEFF individuals under cluster randomization. With m̄ = 50 and a modest ρ = 0.02, DEFF ≈ 1.98 —\nthe study needs roughly twice the patients. When cluster sizes vary, the effective DEFF rises further; Eldridge et al.\nshowed it becomes 1 + ((1 + CV²)·m̄ − 1)·ρ, where CV is the coefficient of variation of cluster size. Powering a CRT\nwith an individually-randomized sample-size formula is a classic, fatal error: the trial is underpowered before it\nenrolls a single patient.\n\n**Pros, cons, and trade-offs.**\n- **vs the individually randomized trial (RCT):** CRTs are the right (often only) choice when the intervention is\n  delivered to a group and cannot be withheld from individuals in the same setting — a clinician trained in a new\n  protocol cannot un-know it for half their panel (**treatment contamination**), and ward-level infection-control or\n  community health-promotion interventions are inherently group-level. CRTs also ease logistics (train one site, not\n  one patient at a time) and reduce contamination. Cost: dramatically lower statistical efficiency (the design effect),\n  far greater vulnerability to chance imbalance and to **identification/recruitment bias** (see below), and more\n  complex analysis. **Prefer individual randomization** whenever contamination is implausible and individuals can be\n  consented and randomized independently — it is strictly more efficient.\n- **vs the stepped-wedge CRT:** A parallel CRT randomizes clusters to fixed arms for the whole study. A stepped-wedge\n  design rolls the intervention out to all clusters in randomized sequence, so every cluster eventually crosses over;\n  it suits one-directional rollouts, gives every site the intervention (often required for adoption/ethics), and uses\n  within-cluster contrasts. Cost: confounding of the treatment effect with secular time trends, a more demanding mixed\n  model, and sensitivity to how the time effect is specified. **Prefer parallel CRT** for a clean concurrent contrast;\n  **prefer stepped-wedge** when a phased rollout is unavoidable or when all clusters must receive the intervention.\n- **vs observational comparative-effectiveness (e.g., active-comparator new-user) in the same RWD:** randomization at\n  the cluster level removes confounding *by assignment* (no confounding-by-indication, no healthy-user bias) — the core\n  advantage over any observational design. Cost: you randomize few units, accept the design effect, and often cannot\n  blind; an observational ACNU/PS analysis on the full database may have far more power and external validity for a\n  *patient-level* drug-vs-drug question. **Prefer a CRT** for system/provider-level interventions; **prefer\n  observational CER** for individual-level drug choices where assignment can be modeled.\n- **vs target-trial emulation:** a CRT *is* a trial, so it needs no emulation; but a well-specified CRT protocol is\n  exactly the target trial that an observational study of a system-level intervention would try to emulate.\n\n**When to use.** The intervention acts at the group level (provider training, CDS/EHR alerts, care pathways,\nward protocols, community programs); individual randomization would cause contamination or is logistically impossible;\nand you can recruit/enumerate participants *before* clusters are revealed to be intervention or control. Pragmatic,\nembedded CRTs that read endpoints from routine EHR/claims data are a natural RWE application and are explicitly\nrecognized in regulatory real-world-evidence and pragmatic-trial guidance.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Few clusters.** With fewer than ~15–20 clusters per arm, GEE and model-based GLMM standard errors are biased\n  *downward* — confidence intervals too narrow, type-I error inflated. If you must use few clusters, switch to a\n  cluster-level summary analysis (a t-test on cluster means) or apply a small-sample correction (Mancl–DeRouen /\n  Kauermann–Carroll for GEE; Satterthwaite/Kenward–Roger denominator df for mixed models). A \"significant\" individual-\n  level GEE result from 8 clusters is often an artifact, not an effect.\n- **Post-randomization recruitment with knowledge of allocation (identification/recruitment bias).** This is the\n  signature CRT failure. If patients are enrolled *after* a cluster is known to be intervention vs control, recruiters\n  (or patients) behave differently by arm — sicker patients are preferentially enrolled in the intervention clinic,\n  healthier ones in control — and the arms are no longer comparable despite \"randomization.\" Puffer et al. documented\n  this in published CRTs. The fix is structural: identify and consent the cohort, or define the closed population,\n  *before* allocation is disclosed; if outcomes come from a fixed, pre-existing roster (a true RWE strength), this bias\n  is largely defused.\n- **Ignoring clustering in the analysis.** Running ordinary logistic/Cox/linear regression on individual rows as if\n  independent is not conservative — it is wrong in the dangerous direction, manufacturing false precision.\n- **Adjusting for post-randomization, cluster-level variables** that are affected by the intervention re-introduces\n  confounding and can reverse the sign of the effect; baseline-only, pre-randomization adjustment is the safe rule.\n- **An intervention with no plausible group-level mechanism** gains nothing from cluster randomization and simply pays\n  the design-effect penalty for no reason — use individual randomization.\n\n**Data-source operational depth (RWE / routinely-collected data).**\n- **Claims:** Strong for hard, well-coded endpoints (hospitalization, ED visits, dispensings, death via linkage) on a\n  closed, enumerable denominator — ideal for an *embedded* CRT where practices/health-plan regions are randomized and\n  outcomes are read from claims. Failure modes: the cluster (e.g., a practice) must be unambiguously linkable to the\n  plan's members and stable over follow-up; **member churn / disenrollment** silently changes cluster composition and\n  cluster size (CV inflation in the design effect) — require continuous enrollment across the measurement window.\n  **Medicare Advantage encounter data are incomplete vs fee-for-service claims**, so a cluster's apparent event rate\n  can reflect MA penetration, not the intervention; restrict to a consistent benefit type or model it. Attribution of\n  a patient to exactly one cluster (which practice \"owns\" a patient seen at several) must be defined a priori.\n- **EHR:** Best for the intervention itself (the CDS alert, order set, problem list) and for clinical granularity\n  (labs, vitals, BP, HbA1c) needed for both outcomes and ICC estimation. Failure modes: visit-driven capture means a\n  patient who stops visiting an intervention clinic is differentially lost; **the ICC is often larger in EHR data**\n  because shared clinicians and local documentation habits cluster outcomes more tightly — using a too-small ICC from a\n  different setting under-powers the trial. Cross-coverage and patients seen at multiple sites blur cluster membership.\n- **Registry:** Useful when the cluster is a center already contributing to a disease registry (adjudicated outcomes,\n  disease severity); typically weak for complete medication exposure — link to claims for fills and to a death index\n  for censoring. Centers that join/leave the registry change the cluster set mid-study.\n- **Linked claims–EHR–vital records:** The ideal embedded-CRT substrate — EHR captures the intervention and severity,\n  claims complete the utilization picture, vital records fix mortality. Cost: linkage selects the linkable subset\n  (potential generalizability loss) and introduces order/fill/service-date discrepancies; reconcile dates before\n  defining the outcome window, and confirm linkage rates are balanced across arms.\n\n**Worked claims/EHR example.** Question: does a practice-level EHR clinical-decision-support alert that flags overdue\nstatin therapy reduce 12-month cardiovascular hospitalization among adults with type 2 diabetes, evaluated as a\npragmatic CRT in a claims-linked EHR network? (1) **Unit of randomization = the practice** (say 24 practices, 12 per\narm); **unit of analysis = the patient**. (2) **Define the closed cohort before allocation is revealed**: adults ≥40\nwith ≥2 diabetes diagnoses and ≥365 days of continuous enrollment ending at the index quarter, attributed to exactly\none practice by plurality of primary-care visits in the prior year — *enumerate this roster first* to prevent\nidentification bias. (3) **Randomize the 24 practices**, ideally with restricted/covariate-constrained randomization\non baseline practice-level statin rate and panel size to curb chance imbalance with few clusters. (4) **Outcome**:\nfirst CV hospitalization (inpatient claim with a qualifying primary diagnosis) over the 12 months after activation;\ncensor at disenrollment, death, or data end. (5) **Power the trial with the design effect**: with m̄ ≈ 300 patients\nper practice and an EHR-plausible ICC ρ = 0.01, DEFF = 1 + (300 − 1)·0.01 ≈ 4.0 — the effective sample size is one\nquarter of the head count, and unequal practice sizes (CV of cluster size) push it higher. (6) **Analysis**: report a\npre-specified primary estimand — a population-average risk difference via GEE with an exchangeable working correlation\nand cluster-robust (sandwich) variance *with a small-sample correction* given only 24 clusters, or a cluster-level\ncomparison of practice-specific rates as the robust primary, with the patient-level GLMM as secondary. Adjust only for\n**baseline, pre-randomization** covariates (age, sex, prior CV history, baseline statin use). (7) **Sensitivity**:\nre-estimate ICC from the observed data, vary the cluster-attribution rule, and restrict to a single benefit type to\nrule out MA encounter-completeness artifacts.",
    "primary_category": "Study_Design",
    "tags": [
      "cluster-randomization",
      "group-randomized",
      "pragmatic-trial",
      "intracluster-correlation",
      "design-effect",
      "gee",
      "mixed-effects",
      "recruitment-bias",
      "study-design"
    ],
    "applies_to_study_types": [
      "cluster_randomized"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/9781118763452",
        "url": "https://doi.org/10.1002/9781118763452",
        "citation_text": "Campbell MJ, Walters SJ. How to Design, Analyse and Report Cluster Randomised Trials in Medicine and Health Related Research. Chichester: John Wiley & Sons; 2014.",
        "year": 2014,
        "authors_short": "Campbell & Walters",
        "notes": "Comprehensive monograph on CRT design, sample-size with the design effect, ICC, and clustered analysis (GEE, mixed models, cluster-level summaries)."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.328.7441.702",
        "url": "https://doi.org/10.1136/bmj.328.7441.702",
        "citation_text": "Campbell MK, Elbourne DR, Altman DG, for the CONSORT Group. CONSORT statement: extension to cluster randomised trials. BMJ. 2004;328(7441):702-708.",
        "year": 2004,
        "authors_short": "Campbell et al.",
        "notes": "Foundational articulation of the design rationale, the unit-of-randomization/unit-of-analysis distinction, and the reporting elements specific to CRTs."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyl129",
        "url": "https://doi.org/10.1093/ije/dyl129",
        "citation_text": "Eldridge SM, Ashby D, Kerry S. Sample size for cluster randomized trials: effect of coefficient of variation of cluster size and analysis method. International Journal of Epidemiology. 2006;35(5):1292-1300.",
        "year": 2006,
        "authors_short": "Eldridge et al.",
        "notes": "Derives how unequal cluster sizes (coefficient of variation) inflate the design effect; the basis for honest CRT power calculations in real-world data where cluster sizes vary."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.327.7418.785",
        "url": "https://doi.org/10.1136/bmj.327.7418.785",
        "citation_text": "Puffer S, Torgerson D, Watson J. Evidence for risk of bias in cluster randomised trials: review of recent trials published in three general medical journals. BMJ. 2003;327(7418):785-789.",
        "year": 2003,
        "authors_short": "Puffer et al.",
        "notes": "Documents identification/recruitment bias from enrolling participants after allocation is known; the empirical case for recruiting before randomization is revealed."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.h391",
        "url": "https://doi.org/10.1136/bmj.h391",
        "citation_text": "Hemming K, Haines TP, Chilton PJ, Girling AJ, Lilford RJ. The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ. 2015;350:h391.",
        "year": 2015,
        "authors_short": "Hemming et al.",
        "notes": "Worked treatment of the stepped-wedge variant, including the mixed model that separates the intervention effect from secular time trends."
      }
    ],
    "plain_language_summary": "A cluster-randomized trial assigns whole groups — such as clinics, hospital wards, or communities — to an intervention or to usual care, rather than assigning individual patients. Researchers then measure outcomes on the patients inside those groups. This design is used when the intervention is delivered to an entire group at once (for example, training all providers at a clinic), so you cannot give it to some patients but not others in the same setting. The catch is that patients at the same clinic tend to look more alike than patients drawn randomly from everywhere, and that similarity must be accounted for when calculating how many patients you need and when analyzing the results.",
    "key_terms": [
      {
        "term": "cluster",
        "definition": "The group that receives the intervention as a whole unit — for example, a clinic, a hospital ward, or a community — rather than an individual patient."
      },
      {
        "term": "intraclass correlation (ICC)",
        "definition": "A number between 0 and 1 that measures how similar patients within the same cluster are to each other compared with patients across different clusters; even a small ICC (e.g., 0.05) meaningfully reduces the statistical information in the data."
      },
      {
        "term": "design effect",
        "definition": "The factor by which the required sample size grows because outcomes are correlated within clusters; a design effect of 2.0 means you need twice as many patients as an individually randomized trial of the same size would need."
      },
      {
        "term": "unit of randomization vs unit of analysis",
        "definition": "In a cluster-randomized trial the cluster (e.g., clinic) is randomized, but outcomes are usually measured on individual patients — this mismatch is what creates the within-cluster correlation problem."
      },
      {
        "term": "identification bias",
        "definition": "A distortion that occurs when patients are enrolled into the study after the clinic has already been told which arm it is in, so that sicker or healthier patients are selectively entered depending on the arm."
      }
    ],
    "worked_example": {
      "scenario": "A health system wants to test whether placing an automated reminder in the clinic electronic health record reduces 90-day hospital readmission among adults with heart failure. They cannot give the reminder to only some patients at a clinic because every provider at that clinic sees it. So they randomize six clinics as whole units: three get the reminder turned on (intervention), three do not (control). All heart-failure patients at each clinic are followed for 90 days and their readmission status is recorded.",
      "dataset": {
        "caption": "Each row is one clinic. Patients are nested inside their clinic and cannot move between groups.",
        "columns": [
          "clinic_id",
          "arm",
          "patients_n",
          "readmissions_n",
          "readmission_rate"
        ],
        "rows": [
          [
            "Clinic A",
            "Intervention",
            20,
            3,
            "15.0%"
          ],
          [
            "Clinic B",
            "Intervention",
            25,
            4,
            "16.0%"
          ],
          [
            "Clinic C",
            "Intervention",
            15,
            2,
            "13.3%"
          ],
          [
            "Clinic D",
            "Control",
            20,
            7,
            "35.0%"
          ],
          [
            "Clinic E",
            "Control",
            25,
            8,
            "32.0%"
          ],
          [
            "Clinic F",
            "Control",
            15,
            5,
            "33.3%"
          ]
        ]
      },
      "steps": [
        "Because whole clinics were randomized, the correct comparison starts at the clinic level. Compute each clinic's readmission rate (readmissions divided by patients): Clinic A = 3/20 = 15.0%, B = 4/25 = 16.0%, C = 2/15 = 13.3%, D = 7/20 = 35.0%, E = 8/25 = 32.0%, F = 5/15 = 33.3%.",
        "Average the three intervention clinic rates: (15.0 + 16.0 + 13.3) / 3 = 14.8%. Average the three control clinic rates: (35.0 + 32.0 + 33.3) / 3 = 33.4%. The cluster-level risk difference is 14.8% minus 33.4% = -18.6 percentage points.",
        "Now ask: did the sample size calculation account for within-clinic correlation? Patients at the same clinic share a care team, documentation habits, and local case mix, so their outcomes are more alike than two patients from different clinics. The ICC captures that similarity.",
        "Calculate the design effect: average cluster size m = 120 patients divided by 6 clinics = 20 patients per clinic. With an ICC of 0.05, design effect = 1 + (20 - 1) x 0.05 = 1 + 0.95 = 1.95. The trial effectively has only about half the statistical information of a 120-patient individually randomized trial.",
        "If the sample size was calculated assuming independent individuals (no clustering), the trial is underpowered by roughly a factor of 2.0 before a single patient was enrolled. This is the classic error in cluster-randomized trials."
      ],
      "result": "Intervention clinics averaged a 14.8% readmission rate versus 33.4% in control clinics, a cluster-level risk difference of -18.6 percentage points. However, with only 3 clinics per arm and a design effect of 1.95 (ICC 0.05, 20 patients per clinic), the effective sample is roughly 62 independent observations rather than 120 — the confidence interval around that difference is wide, and any significance test must use cluster-level or clustering-adjusted methods, not a simple chi-square on pooled patient counts."
    },
    "prerequisites": [
      "pragmatic-trial",
      "cohort-prospective",
      "sample-size-power-precision-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Parallel cluster-randomized trial",
        "description": "Clusters are randomized once to intervention or control and remain in that arm for the whole study; the contrast is concurrent and between clusters.",
        "edge_cases": [
          "With few clusters, chance imbalance is likely; use restricted/covariate-constrained randomization or stratify clusters on baseline rate and size.",
          "Cluster-level confounders that differ by arm cannot be removed by randomization when the number of clusters is small."
        ],
        "data_source_notes": "claims/EHR: define cluster membership and the closed denominator before allocation; require continuous enrollment so cluster size and composition are stable."
      },
      {
        "name": "Stepped-wedge cluster-randomized trial",
        "description": "All clusters start in control and cross to intervention in a randomized sequence over successive time periods, so each cluster contributes both control and intervention person-time.",
        "edge_cases": [
          "The intervention effect is confounded with secular time trends; the analysis model must include a (correctly specified) time effect.",
          "Assumes the intervention effect is immediate and constant unless time-on-treatment terms are added; lagged effects bias the estimate."
        ],
        "data_source_notes": "EHR/claims: align rollout dates to data refresh cycles; misdated activation contaminates control person-time with treated outcomes."
      },
      {
        "name": "Cluster-level (cluster-summary) analysis",
        "description": "Each cluster is reduced to a single summary statistic (rate, mean, or adjusted residual) and the arms are compared with a t-test or weighted regression on those summaries.",
        "edge_cases": [
          "Loses individual-level covariate adjustment unless a two-stage (regress individuals, then summarize cluster residuals) approach is used.",
          "Equal vs size-weighted cluster weighting changes the estimand and the result when cluster sizes vary."
        ],
        "data_source_notes": "Robust default when the number of clusters is small (<~20-30), where GEE/GLMM standard errors are anticonservative."
      },
      {
        "name": "Individual-level model with clustering accounted for (GEE or GLMM)",
        "description": "Every patient is analyzed with a model that captures within-cluster correlation - GEE for a population-average (marginal) estimand, or a random-effects GLMM for a cluster-specific (conditional) estimand.",
        "edge_cases": [
          "GEE sandwich variances are biased downward with few clusters; apply Mancl-DeRouen or Kauermann-Carroll corrections.",
          "For non-identity links the marginal (GEE) and conditional (GLMM) effect estimates differ in magnitude and must not be conflated."
        ],
        "data_source_notes": "Estimate the ICC from the trial's own data and report it; an imported ICC from a different setting may be far off."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Individually randomized controlled trial",
        "pros_of_this": "Handles group-level interventions, prevents contamination, and can be embedded in routine care with outcomes read from EHR/claims.",
        "cons_of_this": "Much lower statistical efficiency (the design effect inflates the required sample size), greater chance imbalance with few clusters, and vulnerability to identification/recruitment bias.",
        "when_to_prefer": "When the intervention is delivered at the group level or contamination across individuals in the same setting is plausible."
      },
      {
        "compared_to": "Stepped-wedge cluster-randomized trial",
        "pros_of_this": "A clean concurrent between-cluster contrast not confounded by calendar time.",
        "cons_of_this": "Requires withholding the intervention from control clusters for the whole study, which can be ethically or operationally infeasible.",
        "when_to_prefer": "When a phased rollout is unnecessary and a simple parallel comparison is acceptable to sites and ethics review."
      },
      {
        "compared_to": "Observational comparative-effectiveness (e.g., active-comparator new-user) in the same data",
        "pros_of_this": "Randomization at the cluster level removes confounding by assignment (no confounding by indication or healthy-user bias).",
        "cons_of_this": "Few randomized units and the design effect cut power; blinding is often impossible; an observational PS analysis may have far more power for patient-level drug questions.",
        "when_to_prefer": "For provider/system-level interventions where assignment cannot be modeled; use observational CER for individual-level drug choices."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Cluster (practice/region/plan) must be unambiguously and stably linkable to enrollees. Require continuous enrollment across the outcome window so cluster size and composition are fixed; member churn changes the design effect. Restrict to a consistent benefit type (Medicare Advantage encounter data are less complete than fee-for-service) and attribute each patient to exactly one cluster a priori.",
      "ehr": "EHR captures the intervention (CDS alert, order set) and clinical detail for outcomes and ICC. Visit-driven capture makes loss differential when patients stop attending an intervention site; ICCs tend to be larger in EHR data, so do not import a small ICC from elsewhere. Resolve patients seen at multiple sites before assigning cluster membership.",
      "registry": "Suitable when the cluster is a participating center with adjudicated outcomes and severity; weak for complete pharmacy exposure (link to claims) and mortality (link to a death index). Centers joining/leaving mid-study alter the cluster set.",
      "linked": "Linked claims-EHR-vital records is the ideal embedded-CRT substrate (intervention + completeness + mortality) but introduces linkage selection and order/fill/service-date discrepancies; confirm linkage rates are balanced across arms before analysis."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\n# ---- Step 1: required sample size under cluster randomization -------------------------\ndef crt_sample_size(n_simple: int, m_bar: float, rho: float, cv: float = 0.0):\n    \"\"\"Inflate a simple-randomization individual N by the (variable cluster size) design effect.\n    DEFF = 1 + ((1 + cv**2) * m_bar - 1) * rho   (cv=0 -> classic 1 + (m_bar - 1)*rho).\"\"\"\n    deff = 1.0 + ((1.0 + cv ** 2) * m_bar - 1.0) * rho\n    n_individuals = int(np.ceil(n_simple * deff))\n    n_clusters = int(np.ceil(n_individuals / m_bar))\n    return {\"deff\": deff, \"n_individuals\": n_individuals, \"n_clusters\": n_clusters}\n\n# Example: 788 needed under simple randomization, mean 300/practice, ICC 0.01, CV 0.6.\nprint(crt_sample_size(n_simple=788, m_bar=300, rho=0.01, cv=0.6))\n\n# ---- Step 2: individual-level population-average estimand via GEE ---------------------\n# df columns: outcome (0/1), arm (0/1), cluster_id, and baseline covariates only.\ndef fit_gee(df: pd.DataFrame, covars: list[str]) -> sm.GEE:\n    rhs = \" + \".join([\"arm\"] + covars)\n    model = smf.gee(\n        f\"outcome ~ {rhs}\", groups=\"cluster_id\", data=df,\n        family=sm.families.Binomial(),                 # logit link -> marginal OR\n        cov_struct=sm.cov_struct.Exchangeable(),       # one within-cluster correlation\n    )\n    # cov_type='bias_reduced' applies the Kauermann-Carroll small-sample sandwich correction,\n    # essential when the number of clusters is small (anticonservative otherwise).\n    return model.fit(cov_type=\"bias_reduced\")\n\n# ---- Step 3: robust cluster-level summary analysis (few clusters) ---------------------\ndef cluster_level_test(df: pd.DataFrame):\n    \"\"\"Reduce each cluster to its event rate, then compare arm means (Welch t-test).\"\"\"\n    from scipy import stats\n    rates = df.groupby([\"cluster_id\", \"arm\"])[\"outcome\"].mean().reset_index()\n    a = rates.loc[rates[\"arm\"] == 1, \"outcome\"]\n    b = rates.loc[rates[\"arm\"] == 0, \"outcome\"]\n    t, p = stats.ttest_ind(a, b, equal_var=False)\n    return {\"risk_diff_cluster_means\": a.mean() - b.mean(), \"t\": t, \"p\": p}",
        "description": "Sample-size and clustered-analysis core for a CRT with a binary outcome. Two required inputs:\n  design params : average cluster size (m_bar), ICC (rho), coefficient of variation of cluster size (cv),\n                  and the simple-randomization N from a standard two-proportion calculation.\n  patient table : person_id, cluster_id, arm in {0,1}, outcome (0/1), plus baseline (pre-randomization) covariates.\nStep 1 inflates the simple-randomization N by the variable-cluster-size design effect (Eldridge 2006).\nStep 2 fits a population-average (marginal) GEE; with few clusters prefer the cluster-level summary in Step 3.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "eldridge-2006",
          "campbell-walters-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(geepack)\nlibrary(lme4)\n\n## ---- Sample size: variable-cluster-size design effect (Eldridge 2006) ----------------\ncrt_sample_size <- function(n_simple, m_bar, rho, cv = 0) {\n  deff <- 1 + ((1 + cv^2) * m_bar - 1) * rho\n  n_ind <- ceiling(n_simple * deff)\n  list(deff = deff, n_individuals = n_ind, n_clusters = ceiling(n_ind / m_bar))\n}\ncrt_sample_size(n_simple = 788, m_bar = 300, rho = 0.01, cv = 0.6)\n\n## ---- Population-average (marginal) estimand: GEE with exchangeable correlation --------\n## geeglm uses the cluster-robust sandwich variance by default; cluster_id MUST be a factor\n## and rows sorted within cluster.\ndat <- dat[order(dat$cluster_id), ]\ndat$cluster_id <- factor(dat$cluster_id)\nfit_gee <- geeglm(outcome ~ arm + age + prior_cvd + baseline_statin,\n                  id = cluster_id, data = dat,\n                  family = binomial(\"logit\"), corstr = \"exchangeable\")\nsummary(fit_gee)   # arm coefficient = marginal log-OR; SE is sandwich-based\n\n## ---- Cluster-specific (conditional) estimand: random-intercept GLMM -------------------\nfit_glmm <- glmer(outcome ~ arm + age + prior_cvd + baseline_statin + (1 | cluster_id),\n                  data = dat, family = binomial(\"logit\"))\n## ICC on the latent scale: tau2 / (tau2 + pi^2/3)\ntau2 <- as.numeric(VarCorr(fit_glmm)$cluster_id)\nicc  <- tau2 / (tau2 + pi^2 / 3)\n\n## ---- Robust cluster-level summary t-test (few clusters) ------------------------------\nrates <- aggregate(outcome ~ cluster_id + arm, data = dat, FUN = mean)\nt.test(outcome ~ arm, data = rates, var.equal = FALSE)",
        "description": "CRT sample size and analysis in R. Inputs mirror the Python version:\n  patient data.frame: person_id, cluster_id, arm (0/1), outcome (0/1), baseline covariates.\nUses geepack for the population-average GEE and lme4 for the cluster-specific GLMM; the cluster-level\nsummary t-test is the robust fallback when the number of clusters is small.",
        "dependencies": [
          "geepack",
          "lme4"
        ],
        "source_citations": [
          "eldridge-2006",
          "campbell-walters-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "proc sort data=work.crt; by cluster_id; run;\n\n/* Population-average (marginal) estimand: GEE, exchangeable working correlation. */\n/* MBN = Morel-Bokossa-Neerchal small-sample sandwich correction for few clusters. */\nproc genmod data=work.crt descending;\n  class cluster_id arm (ref='0');\n  model outcome = arm age prior_cvd baseline_statin / dist=binomial link=logit;\n  repeated subject=cluster_id / type=exch corrw modelse;\n  lsmeans arm / diff exp;            /* marginal odds ratio for arm */\nrun;\n\n/* Cluster-specific (conditional) estimand: random-intercept GLMM. */\nproc glimmix data=work.crt method=laplace;\n  class cluster_id arm (ref='0');\n  model outcome(event='1') = arm age prior_cvd baseline_statin / dist=binary link=logit solution oddsratio;\n  random intercept / subject=cluster_id;   /* between-cluster variance -> ICC */\nrun;\n\n/* Robust cluster-level summary analysis (preferred primary when clusters are few). */\nproc means data=work.crt noprint nway;\n  class cluster_id arm;\n  var outcome;\n  output out=clrates mean=cluster_rate;\nrun;\nproc ttest data=clrates;               /* compare practice-specific event rates by arm */\n  class arm;\n  var cluster_rate;\nrun;",
        "description": "CRT analysis in SAS. Required input dataset (post data-management), one row per patient:\n  work.crt : person_id, cluster_id, arm (0/1), outcome (0/1), and baseline (pre-randomization)\n             covariates (age, prior_cvd, baseline_statin). Sort by cluster_id before PROC GENMOD.\nPROC GENMOD/REPEATED gives the population-average (GEE, marginal) estimand; PROC GLIMMIX with a\nRANDOM intercept gives the cluster-specific (conditional) estimand. With few clusters, use the\ncluster-level summary (PROC MEANS + PROC TTEST) as the robust primary and apply EMPIRICAL small-\nsample corrections in GENMOD.",
        "dependencies": [],
        "source_citations": [
          "campbell-walters-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population: enumerable patients in candidate clusters] --> Enum[Define closed cohort and attribute each patient to ONE cluster<br/>BEFORE allocation is revealed -> prevents identification bias]\n  Enum --> Rand[Randomize CLUSTERS to arms<br/>restricted / covariate-constrained with few clusters]\n  Rand --> Int[Intervention clusters: deliver group-level intervention<br/>e.g., EHR decision-support alert]\n  Rand --> Con[Control clusters: usual care]\n  Int --> Out[Ascertain individual outcomes from EHR/claims]\n  Con --> Out\n  Out --> Ana[Analysis accounts for within-cluster correlation:<br/>GEE marginal / GLMM conditional / cluster-level summary]\n  Ana --> Report[Report estimand, ICC, design effect, and small-sample handling]",
        "caption": "Operational flow of an embedded cluster-randomized trial in real-world data. The cohort is enumerated and attributed to clusters before allocation (preventing recruitment/identification bias), clusters are randomized, and the analysis explicitly models within-cluster correlation.",
        "alt_text": "Flowchart from source population through pre-allocation cohort definition, cluster randomization, intervention versus control clusters, EHR/claims outcome ascertainment, clustering-aware analysis, and reporting.",
        "source_type": "illustrative",
        "source_citations": [
          "campbell-walters-2014",
          "puffer-2003"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  ICC[Intracluster correlation rho] --> DEFF[\"Design effect<br/>1 + #40;m_bar &minus; 1#41; &times; rho\"]\n  Msize[Mean cluster size m_bar] --> DEFF\n  CV[Coefficient of variation of cluster size] --> DEFF\n  DEFF --> Ninfl[Required individuals = N_simple * DEFF]\n  Ninfl --> Nclust[Number of clusters = required individuals / m_bar]\n  DEFF --> Eff[Effective sample size = head count / DEFF]",
        "caption": "Why clustering costs power. The intracluster correlation and cluster size combine into the design effect, which inflates the required sample size and shrinks the effective sample size relative to an individually randomized trial.",
        "alt_text": "Diagram showing intracluster correlation, mean cluster size, and coefficient of variation feeding the design effect, which determines required individuals, number of clusters, and effective sample size.",
        "source_type": "illustrative",
        "source_citations": [
          "eldridge-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "pragmatic-trial",
        "notes": "Pragmatic trials are frequently implemented as cluster-randomized designs embedded in routine care; the cluster design is the allocation mechanism, the pragmatic frame the intent."
      },
      {
        "relation_type": "used_with",
        "target_slug": "gee-population-average-models-rwe",
        "notes": "GEE with cluster-robust variance is the standard population-average (marginal) analysis for individual-level CRT data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mixed-effects-models-longitudinal-rwe",
        "notes": "A random-intercept mixed model gives the cluster-specific (conditional) estimand and yields the ICC directly from the between-cluster variance."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cluster-robust-standard-errors-rwe",
        "notes": "Within-cluster correlation requires cluster-robust (sandwich) variance estimation, with small-sample corrections when clusters are few."
      },
      {
        "relation_type": "used_with",
        "target_slug": "sample-size-power-precision-rwe",
        "notes": "CRT power calculations must inflate the individual sample size by the design effect, which depends on the ICC and cluster-size variability."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "With few clusters, chance imbalance is common; restricted/covariate-constrained randomization and baseline-only adjustment are standard safeguards."
      },
      {
        "relation_type": "see_also",
        "target_slug": "selection-bias-sensitivity-analysis-rwe",
        "notes": "Identification/recruitment bias from enrolling after allocation is the signature CRT selection bias; pre-allocation cohort enumeration is the structural fix."
      },
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "A well-specified CRT protocol is the target trial an observational study of a system-level intervention would seek to emulate."
      }
    ],
    "aliases": [
      "CRT",
      "cluster randomized trial",
      "cluster-randomised trial",
      "group-randomized trial",
      "group randomised trial",
      "c-RCT"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cluster-robust-standard-errors-rwe",
    "name": "Cluster-Robust Standard Errors",
    "short_definition": "A sandwich variance estimator that produces valid standard errors and confidence intervals when observations are correlated within clusters (the same patient, matched set, clinician, facility, plan, or family), leaving the point estimate unchanged.",
    "long_description": "**Cluster-robust standard errors (CRSE)** replace the model-based variance of a regression coefficient with a\n*sandwich* estimator that is consistent under arbitrary within-cluster correlation. The \"bread\" is the usual\ninverse-information (or inverse-Hessian) matrix; the \"meat\" is built from the sum of cluster-level score\ncontributions rather than individual-observation contributions. The point estimate (hazard ratio, odds ratio,\nrate ratio, risk difference) does **not** change — only its standard error, and therefore its Wald confidence\ninterval and p-value. CRSE is the variance counterpart to the design fact that, in real-world data, rows are not\nindependent: the same `person_id` contributes multiple visits or recurrent events, propensity-score matching\ncreates correlated matched sets, and patients nest within clinicians, facilities, and health plans.\n\n**Core conceptual distinction — what CRSE does and does not estimate.** CRSE targets the *marginal*\n(population-average) variance of a coefficient under a working model. It is agnostic about the source and form of\nthe correlation: you do not specify a within-cluster correlation structure correctly to get valid inference, you\nonly need clusters to be independent of one another and reasonably numerous. This is the key difference from a\n**random-effects / mixed model**, which models the correlation explicitly (a random intercept per cluster) and, in\nnonlinear models (logistic, Poisson), changes the *estimand itself* from a marginal to a conditional\n(subject-specific) effect. A GEE with an independence working correlation plus the robust sandwich is numerically\nidentical to fitting the naive model and applying CRSE; richer working correlations (exchangeable, AR-1) only buy\nefficiency, not validity. So the decision is not \"CRSE vs GEE\" — CRSE *is* the variance engine of GEE — but\n\"marginal effect with robust variance\" (CRSE/GEE) vs \"conditional effect with a random effect\" (mixed model). Pick\nthe estimand first; the variance method follows.\n\n**Pros, cons, and trade-offs.**\n- **vs naive (model-based / independence) standard errors:** CRSE corrects the near-universal under-estimation of\n  SEs that occurs when correlated rows are treated as independent, which otherwise yields anticonservative CIs and\n  inflated type-I error (a recurrent-event or repeated-visit analysis run without clustering will routinely report\n  CIs ~30-60% too narrow). Cost: essentially none at the point-estimate level and trivial computationally; the only\n  real risk is using CRSE when you have **too few clusters** (see below). **Always prefer CRSE** over naive SEs\n  whenever the unit of analysis is finer than the unit of independence.\n- **vs random-effects / mixed-effects models:** CRSE makes no distributional assumption about the cluster effects\n  and keeps a clean marginal interpretation, which is usually what a regulator or payer wants (\"the average effect\n  in the population\"). Cost: it discards the efficiency gains a correct random-effects model provides, gives no\n  cluster-level prediction (no BLUPs), and is **downward-biased with few clusters**, whereas a well-specified mixed\n  model can be more efficient and handles small numbers of large clusters better. **Prefer CRSE/GEE** when the\n  target is a marginal effect and the number of clusters is large; **prefer a mixed model** when you want\n  subject-specific effects, cluster-level prediction, or have few clusters with many members.\n- **vs heteroskedasticity-robust (HC0/HC1, \"White\") SEs:** ordinary robust SEs correct for heteroskedasticity but\n  still assume *independent rows*; they are the special case of CRSE where every cluster has size one. Using HC\n  instead of CR on clustered data leaves the SEs as wrong as the naive estimator. **Prefer CRSE** whenever any\n  cluster has more than one observation.\n- **vs cluster bootstrap:** the bootstrap (resampling whole clusters with replacement) is a robust alternative that\n  often performs better with moderate cluster counts and for non-smooth statistics. Cost: computation and seed\n  management. **Prefer the cluster bootstrap or a small-sample correction (CR2/CR3) over the standard CRSE** when\n  clusters are few.\n\n**When to use.** Any model fit on data where independence holds at the cluster level but not the row level: (1)\n**repeated measures / longitudinal outcomes** with multiple records per patient; (2) **recurrent-event survival**\n(Andersen-Gill, WLW, PWP), where each patient supplies several at-risk intervals and events; (3) **propensity-score\nmatched cohorts**, where the matched set is the cluster and within-set outcomes are correlated by construction; (4)\n**multilevel real-world data** — patients within clinicians, hospitals, plans, or geographic regions, or siblings\nwithin families; (5) **pooled / stacked person-time** designs (pooled logistic for hazards, case-time-control)\nwhere each patient contributes many person-period rows.\n\n**When NOT to use — and when CRSE is actively misleading.**\n- **Too few clusters.** The standard CRSE is consistent only as the *number of clusters* grows; with fewer than\n  roughly 40-50 clusters it is **downward-biased**, producing CIs that are too narrow and type-I error well above\n  nominal. A multi-site study with 8 hospitals clustered at the hospital level is the classic trap. Use a\n  bias-reduced estimator (CR2/CR3, the Mancl-DeRouen or Kauermann-Carroll corrections) with Satterthwaite degrees\n  of freedom, or a **cluster wild bootstrap**. Clustering at a *finer* level with many units (e.g., patient rather\n  than site) does not have this problem.\n- **You actually want a subject-specific effect.** If the question is conditional (\"for a given patient/clinician,\n  what is the effect\"), CRSE on a marginal model answers a different question; fit a mixed model.\n- **The clustering variable is endogenous or post-treatment.** Clustering on something affected by exposure\n  (e.g., the treating specialist chosen *because* of the drug) does not fix bias and can distort variance; clusters\n  must be exogenous, independent groupings.\n- **Singleton-heavy or one-observation-per-cluster data.** If essentially every cluster has size one, CRSE\n  collapses to HC robust SEs and adds nothing — and small numbers of large clusters mixed with many singletons can\n  behave erratically.\n- **CRSE does not cure confounding or model misspecification of the mean.** It fixes the variance, not the point\n  estimate; a biased coefficient with a correct CRSE is still biased, now with a confidently wrong interval.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** The natural cluster is `person_id`, because a single enrollee generates many\n  pharmacy and medical claims and, in recurrent-event or pooled-logistic designs, many analysis rows. Cluster on\n  `person_id`, not on the claim. Higher-level clustering (plan, PBM, provider TIN, geographic region) matters for\n  utilization and cost outcomes that are correlated by formulary, network, and local practice. Failure modes:\n  **plan-switching** moves a person across plan clusters mid-follow-up — decide a priori whether the cluster is the\n  person (usual choice) or the plan-spell; **Medicare Advantage-only person-time** lacks fee-for-service claims, so\n  apparent \"independent\" intervals are really missingness, and clustering cannot repair an undercounted denominator;\n  **few large plans** (a database dominated by 5-10 payers) recreates the few-clusters problem if you cluster at the\n  plan level — cluster at the person level or use a small-cluster correction.\n- **EHR:** Encounter-driven capture makes the patient the primary cluster, but **clinic/site** is a strong\n  second-level cluster because workflow, coding, and care patterns are shared. With a handful of contributing sites,\n  site-level CRSE is unreliable (few clusters) — prefer patient-level clustering, a mixed model with a site random\n  effect, or a wild bootstrap. External-care leakage means some within-patient correlation is unobserved; CRSE only\n  accounts for correlation among the rows you actually have.\n- **Registry:** Patients nest within **enrolling centers**; center is a meaningful cluster for adjudicated outcomes\n  and for quality-of-care contrasts. Registries often have moderate center counts (10-30), which sits squarely in\n  the few-clusters danger zone — report a CR2/Satterthwaite or bootstrap variant and state the cluster count.\n- **Linked claims-EHR-vital-records:** Multiple plausible cluster levels coexist (person, site, plan). Pre-specify\n  the clustering level in the SAP; nesting (patient within site within region) usually calls for clustering at the\n  highest level at which independence is credible, or for two-way / multiway clustering when neither dimension\n  nests cleanly (e.g., patients seen across several facilities).\n\n**Worked claims example — two scenarios where the choice changes inference.**\nScenario A (within-person recurrent events). Question: rate of recurrent COPD exacerbations on inhaled\ncorticosteroid + LABA vs LABA alone in a commercial + Medicare FFS cohort. Each `person_id` is followed from the\nindex fill and contributes *multiple* exacerbation events and at-risk intervals, so rows within a person are\ncorrelated. Fit an Andersen-Gill Cox model on `(start, stop, event)` person-interval rows with the treatment\ncovariate; the *naive* SE treats every interval as independent and understates the true SE. Cluster on `person_id`\n(`COVS(AGGREGATE)` / `cluster(id)`): the HR is unchanged, the SE widens appropriately, and the 95% CI now reflects\nthat 10,000 rows came from 3,200 people. Sensitivity: compare to a shared-frailty (random-effect) model to confirm\nthe marginal and conditional stories agree.\nScenario B (1:k PS-matched cohort). Question: 1-year all-cause mortality, drug A vs drug B, after 1:1 propensity-\nscore matching among adults with `>=`365 days of continuous A/B/D enrollment and a drug-free washout. Matching\ninduces within-pair correlation (matched patients share covariate values by construction). Fit the outcome model on\nthe matched cohort and cluster on the **matched-set id**, not the person, so the variance accounts for the matched\ndesign (Austin 2014). Failure mode to avoid: ignoring the matched-set clustering yields anticonservative SEs; over-\nclustering (e.g., on plan when the design unit is the pair) targets the wrong correlation. Report the cluster count\n(number of matched sets) and, if few large clusters appear at any candidate level, switch to CR2/Satterthwaite or a\ncluster wild bootstrap before trusting the interval.\n\n**Interpreting the output**\n\nConsider a two-hospital study: 6 patients each in Hospital A (all treated) and Hospital B (all comparators).\nThe observed risk difference is −0.33. Under a naive independence assumption the SE is 0.05, yielding a\n95% CI of approximately −0.43 to −0.23. Clustering on hospital raises the SE to 0.09, yielding a 95% CI\nof approximately −0.51 to −0.15.\n\n*(1) Formal statistical interpretation.* The point estimate of −0.33 is identical under both SE approaches —\nclustering corrects the uncertainty, not the central estimate. The naive SE of 0.05 underestimates true\nvariability because it treats the 12 patients as 12 independent draws; within-hospital patients share\nunmeasured environmental and practice factors, so the effective information is closer to two hospitals than\ntwelve patients. The cluster-robust sandwich estimator targets the between-cluster variance, producing a\nwider CI calibrated to the actual clustering structure. With only two clusters, however, the sandwich\nestimator itself is unreliable; CR2/Satterthwaite or a cluster wild bootstrap is preferred.\n\n*(2) Practical interpretation for a decision-maker.* The risk difference is the same (−0.33), but the\nhonest uncertainty range is wider: −0.51 to −0.15 rather than −0.43 to −0.23. In a two-hospital study, a\nresult that looks statistically significant under naive SEs may not survive correct variance estimation.\nReport the cluster count alongside the cluster-robust interval so readers can judge whether the precision\nclaim is credible.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "cluster-robust-standard-errors",
      "sandwich-variance",
      "gee",
      "repeated-measures",
      "recurrent-events",
      "propensity-score-matching",
      "few-clusters",
      "variance-estimation"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/73.1.13",
        "url": "https://doi.org/10.1093/biomet/73.1.13",
        "citation_text": "Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13-22.",
        "year": 1986,
        "authors_short": "Liang & Zeger",
        "notes": "Introduces GEE and generalizes the sandwich variance to clustered/correlated data; the robust (\"empirical\") estimator here is the canonical cluster-robust standard error used throughout RWE."
      },
      {
        "role": "explain",
        "doi": "10.2307/1912934",
        "url": "https://doi.org/10.2307/1912934",
        "citation_text": "White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980;48(4):817-838.",
        "year": 1980,
        "authors_short": "White",
        "notes": "Origin of the sandwich (heteroskedasticity-robust) covariance estimator; cluster-robust SEs are the clustered generalization, reducing to White's HC estimator when every cluster has size one."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.0006-341X.2001.00126.x",
        "url": "https://doi.org/10.1111/j.0006-341X.2001.00126.x",
        "citation_text": "Mancl LA, DeRouen TA. A covariance estimator for GEE with improved small-sample properties. Biometrics. 2001;57(1):126-134.",
        "year": 2001,
        "authors_short": "Mancl & DeRouen",
        "notes": "Documents the downward bias of the standard sandwich with few clusters and gives a bias-corrected estimator; the basis for CR2/CR3-type corrections needed in multi-site RWE with small cluster counts."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/rfs/hhn053",
        "url": "https://doi.org/10.1093/rfs/hhn053",
        "citation_text": "Petersen MA. Estimating standard errors in finance panel data sets: comparing approaches. Review of Financial Studies. 2009;22(1):435-480.",
        "year": 2009,
        "authors_short": "Petersen",
        "notes": "Side-by-side comparison of clustered, robust, and naive variance estimators in panel data, showing how badly naive SEs fail under correlation; directly transferable to clustered patient-level RWE panels."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.5984",
        "url": "https://doi.org/10.1002/sim.5984",
        "citation_text": "Austin PC. The use of propensity score methods with survival or time-to-event outcomes: reporting measures of effect similar to those used in randomized experiments. Statistics in Medicine. 2014;33(7):1242-1258.",
        "year": 2014,
        "authors_short": "Austin",
        "notes": "Applies robust (matched-set-clustered) variance estimation to propensity-score-matched survival analyses in biomedical data; the standard reference for clustering on the matched set in pharmacoepidemiology."
      }
    ],
    "plain_language_summary": "When you analyze data where multiple rows come from the same source — say, five hospital visits from the same patient, or ten patients from the same clinic — those rows are not truly independent of one another. Standard statistical software assumes every row is independent and, as a result, calculates standard errors that are too small, producing confidence intervals that look more precise than the data actually support. Cluster-robust standard errors fix this by grouping all rows from the same source together and recalculating the uncertainty around your estimate to reflect how much truly independent information you have. The result is wider, more honest confidence intervals — the point estimate (for example, a risk difference or odds ratio) does not change, only the uncertainty around it.",
    "key_terms": [
      {
        "term": "standard error",
        "definition": "A number that describes how uncertain a statistical estimate is — smaller means more precise, larger means more uncertain."
      },
      {
        "term": "confidence interval",
        "definition": "A range of plausible values for an estimate; a 95% confidence interval is expected to contain the true value in 95 out of 100 correctly run studies."
      },
      {
        "term": "cluster",
        "definition": "A natural grouping in the data whose members share something in common — for example, all patients treated at the same hospital, or all visits from the same person."
      },
      {
        "term": "independence assumption",
        "definition": "The statistical requirement that knowing the outcome for one row tells you nothing about the outcome for another row; violated whenever rows share a cluster."
      },
      {
        "term": "sandwich estimator",
        "definition": "The mathematical formula behind cluster-robust standard errors; it adjusts the variance calculation to account for within-cluster correlation without changing the main effect estimate."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is studying whether patients admitted to Hospital A have a lower 30-day readmission rate than patients admitted to Hospital B. The dataset has 12 patients — 6 from each hospital. Because patients at the same hospital share the same doctors, protocols, and discharge practices, their outcomes are correlated. Fitting a regression that ignores this clustering produces a standard error that is too small and a confidence interval that is falsely narrow.",
      "dataset": {
        "caption": "One row per patient. The cluster variable is hospital_id, which groups patients who share a care environment.",
        "columns": [
          "person_id",
          "hospital_id",
          "readmitted_30d",
          "treated"
        ],
        "rows": [
          [
            101,
            "H-A",
            0,
            1
          ],
          [
            102,
            "H-A",
            0,
            1
          ],
          [
            103,
            "H-A",
            1,
            1
          ],
          [
            104,
            "H-A",
            0,
            1
          ],
          [
            105,
            "H-A",
            0,
            1
          ],
          [
            106,
            "H-A",
            1,
            1
          ],
          [
            201,
            "H-B",
            1,
            0
          ],
          [
            202,
            "H-B",
            1,
            0
          ],
          [
            203,
            "H-B",
            0,
            0
          ],
          [
            204,
            "H-B",
            1,
            0
          ],
          [
            205,
            "H-B",
            1,
            0
          ],
          [
            206,
            "H-B",
            0,
            0
          ]
        ]
      },
      "steps": [
        "Run the regression treating all 12 rows as independent. The model estimates a risk difference of -0.33 (Hospital A has 33 percentage points lower readmission than Hospital B).",
        "The naive standard error for that estimate comes out to 0.05, giving a 95% confidence interval of roughly -0.43 to -0.23 — it looks very precise.",
        "But patients within the same hospital are not independent: they share discharge nurses, post-discharge call protocols, and local referral networks. The 6 Hospital A rows are really only 1 independent unit of information about Hospital A, and the 6 Hospital B rows are 1 independent unit about Hospital B.",
        "Apply cluster-robust standard errors, grouping by hospital_id. The software now sums the score contributions within each hospital before computing the variance, reflecting that we have 2 independent clusters, not 12 independent rows.",
        "The cluster-robust standard error is 0.09 — nearly twice as large as the naive version — and the 95% confidence interval widens to roughly -0.51 to -0.15.",
        "The point estimate (-0.33) is identical. Only the uncertainty changed: the naive interval was falsely narrow because it double-counted correlated observations as if they were independent."
      ],
      "result": "Naive SE: 0.05, 95% CI approximately -0.43 to -0.23. Cluster-robust SE: 0.09, 95% CI approximately -0.51 to -0.15. The lesson: ignoring clustering made the result look twice as precise as it really was. With only 2 hospitals, even the cluster-robust SE has limitations (very few clusters), but it is far closer to the truth than the naive version."
    },
    "prerequisites": [
      "logistic-regression-for-binary-outcomes",
      "gee-population-average-models-rwe",
      "cox-ph-regression"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Patient-level clustering (within-person correlation)",
        "description": "Cluster on person_id when each patient contributes multiple rows -- repeated visits, recurrent-event intervals, or pooled person-period records. The most common and safest choice in claims and EHR because patient counts are large.",
        "edge_cases": [
          "Plan- or site-switching mid-follow-up keeps the person as one cluster but mixes higher-level effects; decide a priori whether the cluster is the person or the person-plan spell.",
          "Singletons (patients with a single row) are handled automatically but contribute no within-cluster information."
        ],
        "data_source_notes": "claims: cluster on person_id, never on the individual claim; recurrent-event Cox uses COVS(AGGREGATE)/cluster(id). EHR: patient is the primary cluster even when site is the scientific interest."
      },
      {
        "name": "Matched-set clustering (propensity-score matched cohorts)",
        "description": "After 1:1 or 1:k PS matching, cluster on the matched-set id so the variance reflects the induced within-pair correlation rather than treating matched patients as independent (Austin 2014).",
        "edge_cases": [
          "Clustering on person instead of matched set understates the matched-design correlation; clustering on plan/site targets the wrong level entirely.",
          "With caliper matching, the number of matched sets (the cluster count) can shrink substantially -- report it."
        ],
        "data_source_notes": "claims/EHR: carry the matchid from the matching step into the outcome model; cluster the variance on matchid."
      },
      {
        "name": "Higher-level / multiway clustering (clinician, facility, plan, region)",
        "description": "Cluster at the level shared by patients (clinician, hospital, payer, geography) for outcomes correlated by practice pattern, formulary, or network; use two-way/multiway clustering when units cross-classify rather than nest.",
        "edge_cases": [
          "Few large clusters (e.g., 8 hospitals, 5 payers) trigger downward bias -- switch to CR2/Satterthwaite or a cluster wild bootstrap, or model the level as a random effect instead.",
          "Cross-classified data (patients seen across multiple facilities) violate clean nesting; consider multiway clustering."
        ],
        "data_source_notes": "claims: provider TIN, plan, or rating area; report the cluster count. registry: enrolling center counts are often 10-30 -- in the few-clusters danger zone."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive (model-based / independence) standard errors",
        "pros_of_this": "Valid inference under arbitrary within-cluster correlation; corrects the systematic SE under-estimation that produces anticonservative CIs and inflated type-I error when correlated rows are treated as independent.",
        "cons_of_this": "Downward-biased with few clusters; provides no efficiency gain and no cluster-level prediction.",
        "when_to_prefer": "Always, whenever the unit of analysis is finer than the unit of independence and clusters are reasonably numerous."
      },
      {
        "compared_to": "Random-effects / mixed-effects models",
        "pros_of_this": "No distributional assumption on cluster effects; preserves a marginal (population-average) interpretation favored by regulators and payers; robust to misspecified correlation structure.",
        "cons_of_this": "Less efficient than a correctly specified random-effects model; no subject-specific effects or BLUPs; poor behavior with few clusters.",
        "when_to_prefer": "When the estimand is a marginal effect and the number of clusters is large."
      },
      {
        "compared_to": "Heteroskedasticity-robust (HC0/HC1, White) standard errors",
        "pros_of_this": "Accounts for within-cluster correlation, not just heteroskedasticity; HC robust SEs are the special case of CRSE with one observation per cluster.",
        "cons_of_this": "Requires identifying the correct clustering level; with truly independent rows it adds nothing.",
        "when_to_prefer": "Whenever any cluster contains more than one observation."
      },
      {
        "compared_to": "Cluster bootstrap / small-sample (CR2/CR3) corrections",
        "pros_of_this": "Standard CRSE is simpler and faster and needs no resampling.",
        "cons_of_this": "Standard CRSE is anticonservative with few clusters, where the bootstrap or CR2/CR3 with Satterthwaite degrees of freedom perform markedly better.",
        "when_to_prefer": "Standard CRSE when clusters are numerous; switch to bootstrap or CR2/CR3 when clusters are few (roughly <40-50)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Cluster on person_id for within-person correlation (repeated claims, recurrent-event intervals, pooled person-time); use a higher level (plan, provider TIN, region) only when that level is the source of correlation and has many units. Plan-switching and MA-only person-time complicate cluster definition and denominators -- fix those before clustering. Recurrent-event Cox uses COVS(AGGREGATE)/cluster(id).",
      "ehr": "Patient is the primary cluster; site/clinic is a meaningful second level but usually has too few units for reliable site-level CRSE -- prefer patient-level clustering, a site random effect, or a cluster wild bootstrap. Unobserved external-care correlation is not captured.",
      "registry": "Cluster on enrolling center for center-correlated outcomes, but center counts (often 10-30) sit in the few-clusters danger zone -- report a CR2/Satterthwaite or bootstrap variant and state the cluster count.",
      "linked": "Multiple cluster levels coexist (person, site, plan); pre-specify the level in the SAP, cluster at the highest level where independence is credible, and use two-way/multiway clustering when dimensions cross-classify rather than nest."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport statsmodels.formula.api as smf\nfrom lifelines import CoxPHFitter\n\n# --- Repeated-measures / pooled logistic: cluster on person_id ---\n# Naive SEs assume independent person-period rows and are too narrow.\nmodel = smf.logit(\"y ~ treat + age + sex + comorbidity_score\", data=df)\nnaive = model.fit(disp=0)                                  # model-based (anticonservative) SEs\nrobust = model.fit(disp=0, cov_type=\"cluster\",            # cluster sandwich\n                   cov_kwds={\"groups\": df[\"person_id\"]})\n# Coefficients are identical; compare the standard errors:\nprint(naive.bse[\"treat\"], robust.bse[\"treat\"])\nprint(robust.summary())                                    # report the robust CI/p-value\n\n# --- Recurrent-event (Andersen-Gill) Cox: cluster on person_id ---\n# surv_df is in counting-process (start, stop, event) form, multiple rows per person.\ncph = CoxPHFitter()\ncph.fit(surv_df, duration_col=\"stop\", entry_col=\"start\",\n        event_col=\"event\", cluster_col=\"person_id\",        # robust SE for within-person correlation\n        formula=\"treat + age + sex + comorbidity_score\")\ncph.print_summary()                                         # 'robust' SE column reflects clustering\n\n# --- PS-matched cohort: cluster on the matched-set id, not person ---\nmatched_fit = smf.logit(\"y ~ treat\", data=matched).fit(\n    disp=0, cov_type=\"cluster\", cov_kwds={\"groups\": matched[\"matchid\"]})",
        "description": "Cluster-robust inference for repeated-measures (pooled logistic) and recurrent-event survival in claims-style\ndata. Required input (one row per analysis unit, already cleaned):\n  df : person_id (cluster), y (0/1 outcome for the period), treat (0/1), plus baseline covariates\n  surv_df (recurrent events, counting process): person_id, start, stop, event (0/1), treat, covariates\nFor the logistic model, cov_type='cluster' swaps the naive variance for the cluster sandwich (point estimate\nunchanged). For recurrent events, lifelines' CoxPHFitter with cluster_col gives the robust (Lin-Wei-Yang-Ying)\nmatched/clustered variance. cluster on the matched-set id instead of person_id for a PS-matched analysis.",
        "dependencies": [
          "statsmodels",
          "lifelines",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(sandwich); library(lmtest); library(geepack); library(survival)\n\n# --- GLM + cluster-robust (CR) variance on person_id ---\nfit <- glm(y ~ treat + age + sex + comorbidity_score, family = binomial, data = df)\ncoeftest(fit, vcov = vcovCL(fit, cluster = ~ person_id))            # CR sandwich\n# Few clusters: bias-reduced CR2 variant\ncoeftest(fit, vcov = vcovCL(fit, cluster = ~ person_id, type = \"HC2\"))\n\n# --- Equivalent population-average GEE (independence working corr -> same robust SE) ---\ndf <- df[order(df$person_id), ]\ngee <- geeglm(y ~ treat + age + sex + comorbidity_score, family = binomial,\n              id = person_id, corstr = \"independence\", data = df)\nsummary(gee)                                                        # 'Std.err' is the robust (sandwich) SE\n\n# --- Recurrent-event Cox (Andersen-Gill) with robust clustered variance ---\nag <- coxph(Surv(start, stop, event) ~ treat + age + sex + comorbidity_score +\n              cluster(person_id), data = surv_df)                   # robust SE for within-person events\nsummary(ag)                                                         # 'robust se' column\n\n# --- PS-matched cohort: cluster on matchid ---\nmfit <- glm(y ~ treat, family = binomial, data = matched)\ncoeftest(mfit, vcov = vcovCL(mfit, cluster = ~ matchid))",
        "description": "Cluster-robust inference in R for (1) a GLM via the sandwich package, (2) a population-average GEE, and (3)\nrecurrent-event Cox. Required input:\n  df : person_id (cluster), y, treat, baseline covariates (one row per analysis unit)\n  surv_df : person_id, start, stop, event, treat, covariates (counting-process form)\nsandwich::vcovCL with lmtest::coeftest is the general-purpose tool; geepack::geeglm with corstr='independence'\nreturns the identical robust variance via GEE. For few clusters, set type='HC2' (CR2) and use a small-sample df.",
        "dependencies": [
          "sandwich",
          "lmtest",
          "geepack",
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* --- GEE/GENMOD: empirical (cluster-robust) SE under an independence working correlation --- */\nproc sort data=work.df; by person_id; run;\nproc genmod data=work.df descending;\n  class person_id;\n  model y = treat age sex comorbidity_score / dist=binomial link=logit;\n  repeated subject=person_id / type=ind;     /* type=ind -> robust SE = cluster sandwich, point est unchanged */\nrun;                                          /* read the 'Empirical Standard Error Estimates' table */\n\n/* --- Recurrent-event Cox (Andersen-Gill) with robust clustered variance --- */\nproc phreg data=work.surv covsandwich(aggregate);\n  model (start, stop)*event(0) = treat age sex comorbidity_score;\n  id person_id;                              /* aggregate score residuals within person -> robust SE */\nrun;                                          /* compare 'StdErrRatio' = robust/model-based SE */\n\n/* --- Clustered binary outcome via survey procedures (alternative) --- */\nproc surveylogistic data=work.df;\n  cluster person_id;                          /* design-based cluster-robust variance */\n  model y(event='1') = treat age sex comorbidity_score;\nrun;\n\n/* --- PS-matched cohort: cluster on the matched-set id --- */\nproc surveylogistic data=work.matched;\n  cluster matchid;                            /* matched-set is the cluster (Austin 2014) */\n  model y(event='1') = treat;\nrun;",
        "description": "Cluster-robust variance in SAS via the empirical (sandwich) estimator. Required input datasets (post\ndata-management):\n  work.df   : person_id, y (0/1), treat (0/1), baseline covariates -- one row per analysis unit\n  work.surv : person_id, start, stop, event (0/1), treat, covariates -- counting-process form\nPROC GENMOD with REPEATED SUBJECT=id / TYPE=IND fits the model under independence and reports the empirical\n(cluster-robust) standard errors -- the GEE trick that yields CRSE without imposing a working correlation.\nPROC PHREG with COVSANDWICH(AGGREGATE) and ID gives the robust variance for recurrent events. PROC SURVEYLOGISTIC\nwith a CLUSTER statement is an alternative for clustered binary outcomes.",
        "dependencies": [],
        "source_citations": [
          "austin-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Data[Analysis table<br/>rows finer than the unit of independence] --> Q{Where does independence hold?}\n  Q -->|Same patient, many rows<br/>visits / recurrent events / pooled person-time| Pat[Cluster on person_id]\n  Q -->|PS-matched cohort| Match[Cluster on matched-set id]\n  Q -->|Patients within<br/>clinician / site / plan / region| High[Cluster on the higher level]\n  Pat --> N{Number of clusters}\n  Match --> N\n  High --> N\n  N -->|Many >= 40-50| Std[Standard cluster sandwich<br/>point estimate unchanged, SE widened]\n  N -->|Few < 40-50| Small[CR2/CR3 + Satterthwaite df<br/>or cluster wild bootstrap]\n  Std --> Report[Report robust CI / p-value<br/>+ cluster count + cluster level]\n  Small --> Report",
        "caption": "Decision logic for cluster-robust standard errors. Choose the level at which observations are independent, then choose the variance estimator based on the number of clusters; the point estimate never changes.",
        "alt_text": "Flowchart deciding the clustering level (patient, matched set, or higher level) and then selecting the standard cluster sandwich for many clusters versus a small-sample correction or cluster wild bootstrap for few clusters.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Naive[Naive variance]\n    I1[Row 1] --> M1[Sum over independent rows]\n    I2[Row 2] --> M1\n    I3[Row 3] --> M1\n  end\n  subgraph Cluster[Cluster-robust variance]\n    C1[Patient A rows<br/>score sum] --> M2[Sum over independent CLUSTERS]\n    C2[Patient B rows<br/>score sum] --> M2\n  end\n  M1 --> B1[Bread x Meat x Bread<br/>too small -> CI too narrow]\n  M2 --> B2[Bread x Meat x Bread<br/>valid -> CI correct]",
        "caption": "The sandwich estimator changes only the 'meat'. Naive variance sums independent-row score contributions; the cluster-robust meat sums score contributions aggregated within each independent cluster, widening the SE to reflect the true amount of information.",
        "alt_text": "Diagram contrasting the naive variance, which sums over independent rows and underestimates the standard error, with the cluster-robust variance, which sums score contributions aggregated within independent clusters to give a valid standard error.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "gee-population-average-models-rwe",
        "notes": "GEE's robust (\"empirical\") variance is the cluster-robust standard error; a GEE with an independence working correlation plus the sandwich is numerically identical to the naive model with CRSE."
      },
      {
        "relation_type": "used_with",
        "target_slug": "recurrent-events-analysis-rwe",
        "notes": "Andersen-Gill, WLW, and PWP recurrent-event models require clustering on the patient (COVS(AGGREGATE) / cluster(id)) because each person contributes multiple correlated at-risk intervals and events."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Propensity-score matching induces within-pair correlation; the outcome-model variance should cluster on the matched-set id (Austin 2014) rather than treat matched patients as independent."
      },
      {
        "relation_type": "used_with",
        "target_slug": "longitudinal-outcomes-modeling-rwe",
        "notes": "Repeated-measures and pooled person-time models cluster on person_id so the variance reflects multiple correlated records per patient."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "mixed-effects-models-longitudinal-rwe",
        "notes": "Mixed models estimate conditional (subject-specific) effects by modeling cluster correlation explicitly; CRSE/GEE estimate marginal effects with robust variance and no distributional assumption -- choose by estimand."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "A standard Cox model gains valid inference under within-cluster correlation by adding cluster(id) / the robust sandwich variance, without changing the hazard-ratio estimate."
      }
    ],
    "aliases": [
      "cluster robust standard errors",
      "clustered standard errors",
      "robust sandwich variance",
      "empirical standard errors",
      "Huber-White cluster-robust standard errors"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cms-1500-professional-claim-fields",
    "name": "CMS-1500 / 837P Professional Claim Fields",
    "short_definition": "The paper form (CMS-1500, maintained by NUCC) and its electronic equivalent (837P transaction) that physicians and other non-institutional suppliers use to bill Medicare, Medicaid, and commercial payers; each claim carries up to 12 header-level diagnosis codes and one or more service lines, each with a diagnosis pointer that links the procedure to the applicable subset of header diagnoses — the foundational structure that determines what diagnosis is \"attached to\" a procedure in any professional claims analysis.",
    "long_description": "**The CMS-1500 / 837P professional claim** is the billing instrument used by physicians, nurse\npractitioners, physician assistants, therapists, laboratories, and most other non-institutional\nhealthcare suppliers (collectively, \"Part B suppliers\" in Medicare terminology). It is distinct\nfrom the UB-04 / 837I institutional claim, which hospitals use for inpatient admissions and\noutpatient facility services. Understanding the structure of the professional claim — particularly\nthe relationship between header-level diagnoses and the line-level diagnosis pointer — is\nfoundational to any study built on outpatient or professional claims data, because it determines\nwhat diagnostic information can be attributed to a specific procedure or encounter.\n\n**Paper form and electronic transaction.** The CMS-1500 is the paper form maintained by the\nNational Uniform Claim Committee (NUCC). Its electronic equivalent is the HIPAA-mandated 837P\n(Professional) ASC X12 transaction, which all payers must accept for electronic claims. Research\ndatabases derived from Medicare carrier files, commercial professional claims, or Medicaid\nprofessional claims are all downstream representations of 837P data. The field names and box\nnumbers in the CMS-1500 form map directly to 837P loop/segment identifiers, and most research\ndata dictionaries use the CMS-1500 box numbers as the conceptual reference.\n\n**Claim anatomy: header and service lines.** A professional claim has two structural layers:\n\n*Header (boxes 1–23 and 25–33):* Contains patient and insurance information, the provider\nidentifiers (billing provider NPI in box 33; referring/ordering provider NPI in box 17), the\ndate of the encounter if it spans multiple lines, and — most critically for diagnosis\nascertainment — **Item 21: the diagnosis code list** (up to 12 ICD-10-CM codes, labeled\nA through L). These 12 codes are a claim-level list; there is no \"principal diagnosis\" concept\non professional claims. The first-listed code (position A) is colloquially treated as the\n\"primary\" reason for the visit, but that designation is a billing convention and not a\nclinically adjudicated principal diagnosis. This matters substantially for phenotype algorithms:\nthe ambiguity around which of the 12 header codes is the \"main\" reason for the visit is a\nwell-known limitation of professional claims that motivates cautious dx-ascertainment strategies.\n\n*Service lines (box 24, columns A–J):* Each claim can have multiple service lines — typically\none per CPT/HCPCS procedure billed on that claim. The key fields in box 24 are:\n- **24A:** Service date(s) — from/through dates for the procedure; professional claims typically\n  have identical from/through dates for most outpatient services.\n- **24B:** Place of service (POS) — a two-digit CMS code indicating the physical setting where\n  the service was rendered (e.g., 11 = Office, 21 = Inpatient Hospital, 22 = On Campus\n  Outpatient Hospital, 23 = Emergency Room, 31 = Skilled Nursing Facility). POS is the\n  primary setting variable on professional claims and is how analysts distinguish an office visit\n  from a hospital-based outpatient visit for the same CPT code.\n- **24D:** Procedure code (CPT or HCPCS Level II) plus up to four two-character modifiers that\n  refine the procedure (e.g., modifier -25 for a significant, separately identifiable E/M\n  service; modifier -26 for professional component of a diagnostic service).\n- **24E (Diagnosis Pointer):** One to four letters (A–L) referencing the header diagnoses that\n  apply to this specific service line. This is the ONLY line-level diagnosis linkage in the\n  professional claim. To know what diagnosis a CPT code was billed for, an analyst must look up\n  which of the up to 12 header codes the pointer references.\n- **24G:** Units — the quantity of services rendered (e.g., number of units of a drug\n  administered, number of minutes for time-based services).\n- **24J:** Rendering provider NPI — the individual clinician who actually performed the service.\n  This is distinct from the billing provider NPI in box 33, which is often the practice group\n  or facility billing entity. In individual provider attribution studies, rendering NPI is the\n  correct identifier.\n\n**The diagnosis pointer — the central construct for RWE.** The diagnosis pointer in 24E is\narguably the most important and most misunderstood field for researchers. It links the procedure\nto 1–4 of the 12 header diagnosis codes. Without resolving the pointer, an analyst cannot\ndetermine which diagnosis was \"the reason for\" a given procedure. Example: a claim with 6\nheader diagnoses including diabetes (E11.9), hypertension (I10), and chronic kidney disease\n(N18.3) might have three service lines — an E/M visit (CPT 99213) pointing to all three, a\nnephrology consultation (CPT 99244) pointing to only N18.3, and a hemoglobin A1c test (CPT\n83036) pointing only to E11.9. Without the pointer, attributing the lab to diabetes requires\ninference; with the pointer, it is explicit.\n\n**Critical limitation in research databases: dropped pointers.** Many commercial research\ndatabases and some Medicare data products do not carry the diagnosis pointer at the line level\nin their standard extracts. When pointers are absent, analysts face a fundamental ambiguity:\nwhich of the up to 12 header diagnoses should be attributed to a given CPT line? The\nconservative and most common response is to use all header diagnoses at the claim level —\ntreating every diagnosis on the claim as relevant to every procedure — which inflates the\napparent diagnostic burden and can produce false-positive phenotype hits. This is why most\nprofessional claims phenotype algorithms default to claim-level (not line-level) diagnosis\nascertainment as the conservative approach, even when pointer data are theoretically available.\nAnalysts should document whether pointer data are available and used in their database, as it\nis a material methodological distinction.\n\n**Provider identifiers and the attribution trilemma.** A professional claim carries at minimum\nthree provider NPI fields: rendering (box 24J), billing/group (box 33), and referring (box 17).\nFor provider-level RWE studies — evaluating practice variation, quality of care, or\nspecialty-specific outcomes — the choice of NPI drives the entire attribution: rendering NPI\ngives the individual clinician who delivered the service; billing NPI gives the practice or\ngroup; referring NPI traces who ordered the visit. Using billing NPI in place of rendering NPI\nconflates individual variation with group-level variation. In large multispecialty practices or\nhospital-based outpatient settings, the billing entity's NPI may represent hundreds of providers,\nmaking practice-level attribution nonsensical for individual-level studies. This rendering vs\nbilling NPI distinction is the most common provider-attribution error in professional claims\nresearch.\n\n**Charges, allowed amounts, and adjudication fields.** The claim carries the submitted charge\n(box 24F per line and box 28 for total), but research-relevant costs are the payer-allowed\namount and the patient paid amount, which appear in adjudication fields in the data extract\n(not on the CMS-1500 form itself — the form goes to the payer, not back to the provider as an\nERA). In Medicare FFS carrier data, the allowed amount reflects the Medicare fee schedule amount;\nin commercial claims, it reflects the contracted rate. The difference between submitted charge\nand allowed amount (the \"contractual adjustment\") is invisible to the provider in many commercial\ndatabases and should not be confused with a denial.\n\n**Other notable fields.** Box 23 carries the prior authorization number when required. The CLIA\nnumber in box 23 (for laboratory services) identifies the performing lab for diagnostic test\nclaims. Box 20 (\"outside lab\") indicates whether laboratory work was referred out, which matters\nfor distinguishing in-house lab from reference-lab claims in utilization studies.\n\n**Institutional versus professional claim: why the distinction matters for RWE.** A patient\nhospitalized for a procedure will generate both an institutional claim (UB-04/837I from the\nhospital) and one or more professional claims (CMS-1500/837P from the attending physician,\nsurgeon, anesthesiologist, radiologist, etc.). Counting both would double-count the visit. Most\ncohort-construction logic uses the institutional claim to identify the hospitalization and uses\nprofessional claims to identify outpatient events; merging them without de-duplication produces\ninflated counts. The structural difference — UB-04 has a principal diagnosis and a revenue code\nper line; CMS-1500 has a diagnosis list plus a pointer per line — means that the same ICD-10\ncode can appear on both claim types with different positional meaning.\n\n**RWE significance: professional claims carry most outpatient diagnosis volume.** In US\nadministrative claims databases, the large majority of outpatient diagnoses are recorded on\nprofessional claims, not institutional. Common phenotype algorithms follow the \"2 OP codes\"\nconvention (at least 2 outpatient professional claims with the relevant ICD code on different\ndates), relying entirely on the claim-level header diagnoses. The accuracy of any outpatient\nphenotype, the correct count of E/M visits, the site-of-care determination (from POS), and the\nattribution of care to a specific provider (via rendering NPI) all depend on correct\ninterpretation of the CMS-1500 / 837P structure described here.\n\n**Pros, cons, and trade-offs.**\n- **vs. UB-04 / 837I institutional claims:** Professional claims cover more encounter types\n  (every outpatient visit, procedure, lab, imaging, and Part B drug administration) and are\n  therefore the dominant source of outpatient diagnosis and utilization data. Institutional\n  claims have a true principal diagnosis (adjudicated by the facility under the Uniform Billing\n  guidelines) and revenue codes that give service-type granularity, but they cover only facility\n  events. For any outpatient-dominant phenotype (e.g., \"2 OP diagnoses ≥30 days apart\"), the\n  professional claim is the operative data source. **Prefer professional claims** for outpatient\n  encounter and diagnosis ascertainment; **prefer institutional claims** for inpatient event\n  identification and principal-diagnosis-based phenotypes.\n- **vs. EHR encounter data:** The professional claim's POS and CPT codes provide setting and\n  procedure with high consistency across providers because billing accuracy is financially\n  incentivized. EHR data has richer clinical detail (labs, vitals, problem list) but is\n  visit-driven, provider-specific, and lacks a standardized procedure taxonomy across systems.\n  **Prefer professional claims** for procedure/utilization counting across a defined population;\n  **prefer EHR** for severity, lab-based phenotypes, or when clinical detail is required.\n- **Diagnosis pointer availability:** When pointer data are present in the research extract,\n  line-level diagnosis attribution is more precise and reduces false-positive procedure-diagnosis\n  pairings. When pointer data are absent (common in commercial databases), claim-level\n  attribution is the only option and introduces diagnosis–procedure misattribution that can\n  inflate comorbidity burden and confound phenotype specificity. This is a database-specific\n  limitation that analysts must document.\n- **Rendering vs. billing NPI trade-off:** Rendering NPI gives individual-level attribution but\n  may be missing or filled with group NPIs in some older claims or in certain billing practices.\n  Billing NPI is more consistently populated but conflates individual and group. For\n  provider-level studies, validate rendering NPI completeness in the target database before\n  committing to an attribution strategy.\n\n**When to use.** Use professional claim fields as the primary data source when:\n(1) Identifying outpatient encounters, ambulatory E/M visits, or outpatient procedures;\n(2) Ascertaining diagnoses using \"2 OP\" or \"1 OP\" conventions for a phenotype algorithm;\n(3) Determining site of care (POS code) for outpatient services;\n(4) Counting procedure-level units (e.g., number of infusions, number of imaging studies);\n(5) Attributing care to an individual rendering provider for physician-level analyses;\n(6) Identifying Part B drug administrations via HCPCS J-codes on professional claims;\n(7) Any study where the dominant utilization is ambulatory rather than inpatient.\n\n**When NOT to use — and when professional claims are actively misleading or dangerous.**\n- **As the sole source for inpatient events.** A hospitalization generates professional claims\n  from multiple providers, but the encounter itself is defined by the institutional claim. Using\n  professional claims to count inpatient stays risks counting multiple clinicians' claims as\n  separate admissions, severely inflating inpatient event rates.\n- **When line-level diagnosis attribution is required but pointer data are absent.** If the\n  research question requires knowing exactly which diagnosis prompted a specific procedure (e.g.,\n  an imaging study for cancer surveillance versus pain), and the database lacks diagnosis\n  pointers, the answer is not obtainable from professional claims alone without probabilistic\n  inference. Presenting claim-level attribution as line-level attribution in this setting\n  produces misclassified procedure-diagnosis pairs.\n- **When the \"primary reason for visit\" must be adjudicated.** Position A in Item 21 is a\n  billing decision by the coder, not a clinical adjudication. For studies where the principal\n  reason for an outpatient visit is outcome-relevant (e.g., characterizing visits specifically\n  for a condition vs. incidental coding), using first-listed outpatient diagnosis as a surrogate\n  for \"primary diagnosis\" imports substantial measurement error. This contrasts with inpatient\n  principal diagnosis, which is subject to UHDDS guidelines and payer audit.\n- **When rendering NPI is consistently missing or defaulted to group NPI.** In some databases\n  or billing configurations, individual rendering NPIs are not reported or are filled with the\n  group NPI, making individual provider attribution impossible. Proceeding with a provider-level\n  study without validating NPI completeness and rendering vs. billing NPI fill rates produces\n  an attribution analysis anchored to a meaningless identifier.\n- **For Medicare Advantage patients when encounter data are incomplete.** MA plans submit\n  encounter data rather than FFS claims. In some MA data products, professional-claim-equivalent\n  encounter records may be incomplete or missing line-level detail. Treating MA encounter data\n  as equivalent to FFS professional claims without source-specific validation can produce\n  differential outcome and covariate misclassification.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "claims",
      "professional",
      "cms-1500",
      "837p",
      "place-of-service",
      "diagnosis-pointer",
      "rendering-provider",
      "npi",
      "cpt",
      "outpatient"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "comparative_effectiveness",
      "descriptive",
      "provider_attribution"
    ],
    "data_sources": [
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/1062860606288774",
        "url": "https://doi.org/10.1177/1062860606288774",
        "citation_text": "Tyree PT, Lind BK, Lafferty WE. Challenges of using medical insurance claims data for utilization analysis. American Journal of Medical Quality. 2006;21(4):269-275.",
        "year": 2006,
        "authors_short": "Tyree et al.",
        "notes": "Foundational exposition of the structural challenges in medical insurance claims data, including diagnosis coding conventions, provider identification, and the gap between billed data and clinical truth — directly applicable to the CMS-1500 professional claim context."
      },
      {
        "role": "explain",
        "doi": "10.1001/archopht.1993.01090050039024",
        "url": "https://doi.org/10.1001/archopht.1993.01090050039024",
        "citation_text": "Javitt JC. Accuracy of coding in Medicare Part B claims. Archives of Ophthalmology. 1993;111(5):605-607.",
        "year": 1993,
        "authors_short": "Javitt",
        "notes": "Early empirical assessment of diagnosis and procedure coding accuracy in Medicare Part B (professional) claims — the direct predecessor to the 837P/CMS-1500 structure used today, establishing that coding accuracy is high for procedure codes but more variable for diagnosis codes."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/00005650-200208001-00012",
        "url": "https://doi.org/10.1097/00005650-200208001-00012",
        "citation_text": "Baldwin LM, Adamache W, Klabunde CN, Kenward K, Dahlman C, Warren JL. Linking physician characteristics and Medicare claims data: issues in data availability, quality, and integrity. Medical Care. 2002;40(8 Suppl):IV-82-95.",
        "year": 2002,
        "authors_short": "Baldwin et al.",
        "notes": "Demonstrates the practical challenges of linking physician-level characteristics to Medicare professional claims, including rendering vs. billing NPI disambiguation, provider specialty attribution, and data quality issues specific to the professional claims file — core issues for any provider-attribution study using CMS-1500 / 837P data."
      },
      {
        "role": "use",
        "doi": "10.7326/0003-4819-138-4-200302180-00006",
        "url": "https://doi.org/10.7326/0003-4819-138-4-200302180-00006",
        "citation_text": "Fisher ES, Wennberg DE, Stukel TA, Gottlieb DJ, Lucas FL, Pinder EL. The implications of regional variations in Medicare spending. Part 1: The content, quality, and accessibility of care. Annals of Internal Medicine. 2003;138(4):273-287.",
        "year": 2003,
        "authors_short": "Fisher et al.",
        "notes": "Landmark study using Medicare professional claims (carrier file / CMS-1500 equivalent) and POS codes to attribute services to providers and sites of care across regions — a canonical demonstration of how rendering NPI, place of service, and CPT codes on professional claims enable provider-level and site-of-care analyses at national scale."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.nucc.org",
        "citation_text": "National Uniform Claim Committee (NUCC). CMS-1500 Health Insurance Claim Form Reference Instruction Manual. Chicago, IL: NUCC. Available at https://www.nucc.org",
        "year": null,
        "authors_short": "NUCC",
        "notes": "The authoritative reference for the CMS-1500 claim form structure maintained by the National Uniform Claim Committee; specifies all box definitions, diagnosis pointer rules, place-of-service codes, and provider identifier requirements."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/Regulations-and-Guidance/Guidance/Manuals/Downloads/clm104c26.pdf",
        "citation_text": "Centers for Medicare and Medicaid Services. Medicare Claims Processing Manual, Chapter 26: Completing and Processing the Form CMS-1500 Data Set. Baltimore, MD: CMS. Available at https://www.cms.gov/Regulations-and-Guidance/Guidance/Manuals/Downloads/clm104c26.pdf",
        "year": null,
        "authors_short": "CMS",
        "notes": "The CMS Medicare Claims Processing Manual Chapter 26 defines every field in the CMS-1500 and 837P professional claim, including the diagnosis pointer rules, place-of-service code definitions, NPI reporting requirements, and adjudication conventions — the canonical regulatory source for professional claim field specifications."
      }
    ],
    "plain_language_summary": "The CMS-1500 form (also called the 837P in its electronic version) is the billing record that doctors and other outpatient providers send to insurance companies for every office visit, lab test, or procedure they perform. Each form lists up to 12 diagnoses for the entire visit, then has separate line entries for each procedure — and each procedure line contains a pointer telling the payer which of those 12 diagnoses explains why that specific service was done. Researchers use these records to count outpatient visits, identify conditions, and track which provider saw a patient, but must be careful because many databases drop the pointer field, leaving the diagnosis-to-procedure link ambiguous.",
    "key_terms": [
      {
        "term": "service line",
        "definition": "One row within a professional claim representing a single procedure or service billed, including its date, procedure code, place of service, units, and diagnosis pointer."
      },
      {
        "term": "diagnosis pointer",
        "definition": "A letter (A through L) in box 24E of each service line that points to one or more of the up to 12 header diagnoses, indicating which diagnosis explains why that specific procedure was performed."
      },
      {
        "term": "rendering provider",
        "definition": "The individual clinician who actually performed the service, identified by their NPI in box 24J; this is often different from the billing provider (box 33), which may be a practice group."
      },
      {
        "term": "billing provider",
        "definition": "The entity — usually a practice group or clinic — that submits the claim and receives payment; its NPI appears in box 33 and is distinct from the rendering provider who delivered the care."
      },
      {
        "term": "allowed amount",
        "definition": "The amount the insurer agrees to pay for a service after applying the contracted rate or fee schedule, as opposed to the submitted charge, which is what the provider billed."
      },
      {
        "term": "place of service",
        "definition": "A two-digit code in box 24B that records where the service was delivered (for example, 11 = doctor's office, 21 = inpatient hospital), used to distinguish ambulatory visits from hospital-based services."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacy outcomes analyst is building a cohort of adults with type 2 diabetes who received a new GLP-1 receptor agonist injection (HCPCS J3490) at an outpatient clinic. She pulls all professional claims for a single patient — Patricia H., age 57 — to check which diagnoses were attached to the injection visits versus her regular office visits. The claim below is one of three service lines on a single professional claim submitted by her endocrinologist's group. The analyst needs to decide: how many of the 12 header diagnoses should be counted as \"present on this visit,\" and which diagnoses can be attributed specifically to each service line?\n",
      "dataset": {
        "caption": "One professional claim for Patricia H., showing the Item 21 header diagnoses (A–L) and three service lines from box 24. This is the structure an analyst sees in a professional claims research database.\n",
        "columns": [
          "field",
          "value"
        ],
        "rows": [
          [
            "Claim header: Item 21 diagnoses",
            "A=E11.9 (T2DM), B=I10 (HTN), C=E78.5 (hyperlipidemia), D=Z79.4 (long-term insulin use), E=E11.65 (T2DM with hyperglycemia), F=Z00.00 (general exam), G=J34.89 (other nasal disorders), H=M54.5 (low back pain), I=K21.0 (GERD), J=Z82.49 (family hx heart disease), K=Z96.641 (presence of knee prosthesis), L=E66.9 (obesity)"
          ],
          [
            "Line 1: box 24A (service date)",
            "2023-08-15"
          ],
          [
            "Line 1: box 24B (place of service)",
            "11 (Office)"
          ],
          [
            "Line 1: box 24D (procedure)",
            "CPT 99214 (office/outpatient visit, moderate complexity E/M)"
          ],
          [
            "Line 1: box 24E (diagnosis pointer)",
            "A, B, C, E"
          ],
          [
            "Line 1: box 24G (units)",
            "1"
          ],
          [
            "Line 1: box 24J (rendering NPI)",
            "1234567890 (Dr. Elena Reyes)"
          ],
          [
            "Line 2: box 24A (service date)",
            "2023-08-15"
          ],
          [
            "Line 2: box 24B (place of service)",
            "11 (Office)"
          ],
          [
            "Line 2: box 24D (procedure)",
            "HCPCS J3490 (unclassified drug — GLP-1 injection)"
          ],
          [
            "Line 2: box 24E (diagnosis pointer)",
            "A, E"
          ],
          [
            "Line 2: box 24G (units)",
            "1"
          ],
          [
            "Line 2: box 24J (rendering NPI)",
            "1234567890 (Dr. Elena Reyes)"
          ],
          [
            "Line 3: box 24A (service date)",
            "2023-08-15"
          ],
          [
            "Line 3: box 24B (place of service)",
            "11 (Office)"
          ],
          [
            "Line 3: box 24D (procedure)",
            "CPT 83036 (hemoglobin A1c)"
          ],
          [
            "Line 3: box 24E (diagnosis pointer)",
            "A, D, E"
          ],
          [
            "Line 3: box 24G (units)",
            "1"
          ],
          [
            "Line 3: box 24J (rendering NPI)",
            "1234567890 (Dr. Elena Reyes)"
          ]
        ]
      },
      "steps": [
        "Count the header diagnoses: Item 21 has 12 codes (positions A through L). All 12 are associated with this claim at the claim level. If the research database drops diagnosis pointers, an analyst must attribute all 12 diagnoses to all 3 service lines — every procedure appears co-present with every diagnosis.",
        "Service lines on this claim: 3 total. Each has a service date (2023-08-15), a POS code (11 = Office for all three), a procedure code, a diagnosis pointer, and a rendering NPI.",
        "Resolving the diagnosis pointer for Line 1 (E/M visit CPT 99214): pointer letters A, B, C, E map to E11.9 (T2DM), I10 (HTN), E78.5 (hyperlipidemia), and E11.65 (T2DM with hyperglycemia). The E/M visit was billed for 4 of the 12 header diagnoses. The 8 remaining diagnoses (F through L plus D) are on the claim but NOT pointed to by Line 1.",
        "Resolving the pointer for Line 2 (GLP-1 injection J3490): pointer letters A and E map to E11.9 and E11.65 only. The injection is explicitly attributed to T2DM — the 10 non-pointed diagnoses are irrelevant to this line. This is the correct denominator for a diabetes-specific utilization count: expr = 2 diagnosis codes pointed to line 2.",
        "Resolving the pointer for Line 3 (HbA1c CPT 83036): pointer letters A, D, E map to E11.9 (T2DM), Z79.4 (long-term insulin use), and E11.65 (T2DM with hyperglycemia). The lab is pointed to 3 of the 12 header codes.",
        "Provider attribution check: box 24J (rendering NPI 1234567890 = Dr. Reyes) is the same on all 3 lines. Box 33 on this claim carries the group NPI for the endocrinology practice (not shown). For a physician-level study, use rendering NPI from 24J. For a practice-level study, use the billing NPI from box 33.",
        "Confirming the pointer arithmetic: total header diagnoses = 12 (A–L). Line 1 points to 4 (A, B, C, E). Line 2 points to 2 (A, E). Line 3 points to 3 (A, D, E). No line points to more than 4 diagnoses, and no line exceeds the 12-position limit. Total unique diagnosis codes pointed across all 3 lines = A, B, C, D, E = 5 unique positions out of 12. Diagnoses in positions F through L (7 codes) appear on the claim but are not pointed to by any service line."
      ],
      "result": "With pointers available: Line 2 (GLP-1 injection) is attributed to 2 diagnosis codes (positions A and E = T2DM diagnoses only), giving a precise diabetes-specific utilization count. Line 1 (E/M visit) is attributed to 4 diagnoses (A, B, C, E). Line 3 (HbA1c) is attributed to 3 diagnoses (A, D, E). Total service lines on this claim = 3; header diagnoses = 12; diagnoses pointed to by at least one line = 5 (out of 12); diagnoses present on claim but unpointed = 7. If pointer data are absent (common in commercial databases), all 3 lines would be attributed to all 12 diagnoses — inflating the apparent diagnostic burden per line by a factor of up to 12/2 = 6 for the injection line. This is the claim-level ambiguity that drives conservative phenotyping practice."
    },
    "prerequisites": [
      "claims-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Claim-level versus line-level diagnosis ascertainment",
        "description": "When diagnosis pointers are available, each service line can be attributed only to its pointed diagnoses — a precise approach that reduces false-positive phenotype hits. When pointers are absent (common in many commercial databases), all header diagnoses are attributed to all lines (claim-level ascertainment), which is conservative but inflates apparent diagnostic burden. The \"2 OP codes\" phenotype convention defaults to claim-level because it asks only whether a qualifying diagnosis appeared on the claim, not which specific procedure it was linked to.\n",
        "edge_cases": [
          "When a single claim has both a qualifying diagnosis and a non-qualifying diagnosis on the same service line pointer, claim-level attribution misclassifies the event.",
          "Some databases store \"first listed\" diagnosis only (position A), discarding positions B-L, which underestimates comorbidity and can miss qualifying diagnoses not coded first."
        ],
        "data_source_notes": "Medicare carrier file: diagnosis pointers are generally preserved in RESDAC data extracts. Commercial databases (MarketScan, Optum): pointer availability varies by data vintage and product; check data dictionary before assuming pointer fields are populated."
      },
      {
        "name": "Rendering versus billing NPI for provider attribution",
        "description": "Rendering NPI (box 24J, line-level) identifies the individual clinician. Billing NPI (box 33, claim-level) identifies the practice or group. Individual provider studies should use rendering NPI; studies of practice patterns should use billing NPI. Mixed use — especially defaulting to billing NPI when rendering NPI is missing — conflates the two attribution levels.\n",
        "edge_cases": [
          "In solo practices, rendering and billing NPI are often the same; in large multispecialty groups, they almost always differ.",
          "Locum tenens providers may bill under the supervising physician's rendering NPI, masking true provider identity.",
          "Group-practice billing with shared NPI is common in anesthesiology and radiology, where individual rendering NPI fill rates may be lower."
        ],
        "data_source_notes": "Medicare carrier file: rendering NPI is required and generally well-populated since NPI mandate (2008). Commercial claims: rendering NPI completeness varies; some databases report fill rates below 60% for certain specialties."
      },
      {
        "name": "Place-of-service code as the site-of-care variable",
        "description": "The two-digit POS code in box 24B is the only reliable site-of-care indicator on professional claims. It distinguishes outpatient office (POS 11), hospital outpatient (POS 22), emergency department (POS 23), inpatient (POS 21), telehealth (POS 02 or 10), and over a dozen other settings. For studies of setting-specific utilization or cost differentials (e.g., office vs. hospital outpatient for the same CPT code), POS is the operative field.\n",
        "edge_cases": [
          "POS 22 (on-campus outpatient) triggers the \"site-of-service\" payment differential in Medicare, making the same CPT code cost substantially more than at POS 11 (office). Studies of total cost of care must account for POS-driven payment variation.",
          "Telehealth POS codes expanded dramatically during COVID-19 (2020+). Pre-pandemic POS coding for telehealth was inconsistent; restrict telehealth analyses to post-2019 data or validate POS coding with modifier GT."
        ],
        "data_source_notes": "POS is consistently populated across Medicare and commercial claims; it is a stable, reliable field for site-of-care stratification."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ub-04-institutional-claim-fields",
        "pros_of_this": "Professional claims cover the vast majority of outpatient encounters, procedures, and Part B drug administrations; provide rendering-provider-level attribution; and include the diagnosis pointer linking specific procedures to specific diagnoses.",
        "cons_of_this": "No principal diagnosis concept (position A is a billing choice, not an adjudicated clinical determination); pointers often absent in research databases; multiple professional claims per hospitalization episode require de-duplication against the institutional claim.",
        "when_to_prefer": "Use professional claims for outpatient phenotypes, E/M visit counting, ambulatory procedure identification, and provider attribution. Use institutional claims for inpatient event identification and principal-diagnosis-based phenotypes."
      },
      {
        "compared_to": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "pros_of_this": "The raw input to phenotype algorithms — understanding claim structure clarifies why \"2 OP\" defaults to claim-level and why IP and OP are structurally different (principal diagnosis vs. first-listed header code).",
        "cons_of_this": "Knowing the claim structure does not resolve which algorithm has the best PPV/Sp for a specific condition — validation studies are still required.",
        "when_to_prefer": "Always understand CMS-1500 field structure before implementing any professional claims phenotype algorithm; the \"2 OP\" and \"1 IP\" conventions map directly to the structural differences described here."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Professional claims appear in the Medicare carrier file (RESDAC), commercial databases (MarketScan professional, Optum professional), and Medicaid MAX/T-MSIS other services file. Key steps: (1) filter to professional claim type; (2) check diagnosis pointer availability in the data dictionary before choosing claim-level vs. line-level dx ascertainment; (3) use rendering NPI (24J) for individual provider attribution, validate fill rates; (4) use POS code (24B) to stratify by site of care; (5) join to enrollment file to confirm observable enrollment on the service date.\n",
      "linked": "When linking professional claims to EHR data, service date in box 24A is the join key. Order date in EHR typically precedes service date in claim by 1-7 days; use a window (e.g., ±7 days) for matching. Rendering NPI in 24J is the provider identifier for EHR-claims provider linkage.\n"
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# Synthetic professional claim table (one row per service line)\n# Field names follow standard Medicare carrier extract conventions.\nclaim_lines = pd.DataFrame({\n    \"claim_id\":    [\"CLM001\",\"CLM001\",\"CLM001\"],\n    \"line_num\":    [1, 2, 3],\n    \"service_dt\":  [\"2023-08-15\",\"2023-08-15\",\"2023-08-15\"],\n    \"pos_cd\":      [11, 11, 11],          # 24B: place of service (11=office)\n    \"proc_cd\":     [\"99214\",\"J3490\",\"83036\"],  # 24D: CPT/HCPCS\n    \"units\":       [1, 1, 1],             # 24G: units\n    \"rendering_npi\": [\"1234567890\",\"1234567890\",\"1234567890\"],  # 24J\n    \"billing_npi\":   [\"9876543210\",\"9876543210\",\"9876543210\"],  # box 33\n    \"dx_ptr\":      [\"A,B,C,E\", \"A,E\", \"A,D,E\"],   # 24E: diagnosis pointer\n    # Claim-level header diagnoses (Item 21, positions A-L)\n    \"dx_A\": [\"E11.9\"]*3, \"dx_B\": [\"I10\"]*3,   \"dx_C\": [\"E78.5\"]*3,\n    \"dx_D\": [\"Z79.4\"]*3, \"dx_E\": [\"E11.65\"]*3, \"dx_F\": [\"Z00.00\"]*3,\n    \"dx_G\": [\"J34.89\"]*3,\"dx_H\": [\"M54.5\"]*3,  \"dx_I\": [\"K21.0\"]*3,\n    \"dx_J\": [\"Z82.49\"]*3,\"dx_K\": [\"Z96.641\"]*3,\"dx_L\": [\"E66.9\"]*3,\n})\n\n# Build a lookup: position letter -> dx column name\npos_to_col = {letter: f\"dx_{letter}\" for letter in \"ABCDEFGHIJKL\"}\n\ndef resolve_pointers(row):\n    \"\"\"For one service line, return the list of ICD codes the pointer maps to.\"\"\"\n    ptrs = [p.strip() for p in str(row[\"dx_ptr\"]).split(\",\") if p.strip()]\n    codes = []\n    for letter in ptrs:\n        col = pos_to_col.get(letter.upper())\n        if col and col in row.index and pd.notna(row[col]):\n            codes.append(row[col])\n    return codes\n\n# Line-level diagnosis attribution (pointer-resolved)\nclaim_lines[\"pointed_dx\"] = claim_lines.apply(resolve_pointers, axis=1)\nclaim_lines[\"n_pointed_dx\"] = claim_lines[\"pointed_dx\"].str.len()\n\nprint(\"=== Pointer-resolved line-diagnosis pairs ===\")\nfor _, row in claim_lines.iterrows():\n    print(f\"  Line {row['line_num']} ({row['proc_cd']}, POS {row['pos_cd']}): \"\n          f\"{row['pointed_dx']}  (n={row['n_pointed_dx']})\")\n\n# Claim-level ascertainment (fallback when pointer absent)\ndx_cols = [f\"dx_{l}\" for l in \"ABCDEFGHIJKL\"]\nall_claim_dx = (\n    claim_lines[dx_cols].iloc[0].dropna().tolist()\n)\nprint(f\"\\n=== Claim-level dx (all {len(all_claim_dx)} header codes, pointer absent) ===\")\nprint(f\"  All lines attributed to: {all_claim_dx}\")\n\n# Provider attribution: rendering vs billing NPI\nprint(\"\\n=== Provider attribution ===\")\nprint(f\"  Rendering NPI (24J, individual): {claim_lines['rendering_npi'].unique()}\")\nprint(f\"  Billing NPI (box 33, group):     {claim_lines['billing_npi'].unique()}\")\nprint(\"  -> Use rendering NPI for individual provider studies.\")\nprint(\"  -> Use billing NPI for practice/group-level studies.\")\n\n# Diagnosis inflation factor when pointers are absent\navg_pointed = claim_lines[\"n_pointed_dx\"].mean()\nn_header = len([c for c in all_claim_dx if c])\nprint(f\"\\n=== Diagnosis pointer ambiguity metric ===\")\nprint(f\"  Avg diagnoses per line with pointer: {avg_pointed:.1f}\")\nprint(f\"  Diagnoses per line without pointer:  {n_header}\")\n# n_header / avg_pointed = 12 / (avg ~3) ≈ 4\nprint(f\"  Inflation factor (claim-level vs pointed): \"\n      f\"{n_header} / {avg_pointed:.1f} = {n_header / avg_pointed:.1f}\")",
        "description": "Explodes a professional claims table into line-diagnosis pairs by resolving the diagnosis pointer, then counts pointer-attributed diagnoses per procedure line. Also demonstrates rendering vs. billing NPI extraction for provider attribution. Data are synthetic but use realistic Medicare-style field names. Implementation notes field is intentionally empty; see description.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(tidyr)\nlibrary(dplyr)\n\n# Synthetic professional claim (one row per service line)\nclaim_lines <- data.frame(\n  claim_id      = rep(\"CLM001\", 3),\n  line_num      = 1:3,\n  service_dt    = rep(\"2023-08-15\", 3),\n  pos_cd        = rep(11L, 3),           # 24B: place of service (11=office)\n  proc_cd       = c(\"99214\",\"J3490\",\"83036\"),  # 24D: CPT/HCPCS\n  units         = rep(1L, 3),\n  rendering_npi = rep(\"1234567890\", 3),  # 24J: individual clinician\n  billing_npi   = rep(\"9876543210\", 3),  # box 33: practice group\n  dx_ptr        = c(\"A,B,C,E\",\"A,E\",\"A,D,E\"),   # 24E: diagnosis pointer\n  # Header diagnoses Item 21 (positions A-L, carried on every line row)\n  dx_A = rep(\"E11.9\", 3), dx_B = rep(\"I10\", 3),    dx_C = rep(\"E78.5\", 3),\n  dx_D = rep(\"Z79.4\", 3), dx_E = rep(\"E11.65\", 3), dx_F = rep(\"Z00.00\", 3),\n  dx_G = rep(\"J34.89\",3), dx_H = rep(\"M54.5\", 3),  dx_I = rep(\"K21.0\", 3),\n  dx_J = rep(\"Z82.49\",3), dx_K = rep(\"Z96.641\",3), dx_L = rep(\"E66.9\", 3),\n  stringsAsFactors = FALSE\n)\n\n# Resolve the diagnosis pointer for each service line\nresolve_pointers <- function(row_df) {\n  ptrs   <- trimws(unlist(strsplit(row_df$dx_ptr, \",\")))\n  cols   <- paste0(\"dx_\", toupper(ptrs))\n  valid  <- cols[cols %in% names(row_df)]\n  codes  <- unlist(row_df[valid], use.names = FALSE)\n  codes[!is.na(codes) & codes != \"\"]\n}\n\npointer_results <- apply(claim_lines, 1, function(row) {\n  df <- as.data.frame(t(row), stringsAsFactors = FALSE)\n  codes <- resolve_pointers(df)\n  data.frame(\n    claim_id     = row[\"claim_id\"],\n    line_num     = as.integer(row[\"line_num\"]),\n    proc_cd      = row[\"proc_cd\"],\n    pos_cd       = as.integer(row[\"pos_cd\"]),\n    rendering_npi = row[\"rendering_npi\"],\n    billing_npi  = row[\"billing_npi\"],\n    pointed_dx   = paste(codes, collapse = \"; \"),\n    n_pointed    = length(codes),\n    stringsAsFactors = FALSE\n  )\n})\nline_dx <- do.call(rbind, pointer_results)\n\ncat(\"=== Pointer-resolved line-diagnosis pairs ===\\n\")\nprint(line_dx[, c(\"line_num\",\"proc_cd\",\"pos_cd\",\"pointed_dx\",\"n_pointed\")])\n\n# Claim-level ascertainment (when pointer absent)\ndx_cols   <- paste0(\"dx_\", LETTERS[1:12])\nall_claim_dx <- unique(unlist(claim_lines[1, dx_cols]))\nall_claim_dx <- all_claim_dx[!is.na(all_claim_dx)]\ncat(sprintf(\"\\n=== Claim-level dx (pointer absent): %d header codes ===\\n\",\n            length(all_claim_dx)))\nprint(all_claim_dx)\n\n# Provider attribution summary\ncat(\"\\n=== Provider attribution ===\\n\")\ncat(\"Rendering NPI (24J, individual):\", unique(claim_lines$rendering_npi), \"\\n\")\ncat(\"Billing NPI (box 33, group):    \", unique(claim_lines$billing_npi), \"\\n\")\n\n# Diagnosis inflation when pointer absent\navg_pointed <- mean(line_dx$n_pointed)\nn_header    <- length(all_claim_dx)\ncat(sprintf(\"\\nInflation factor (claim-level vs pointed): %d / %.1f = %.1f\\n\",\n            n_header, avg_pointed, n_header / avg_pointed))",
        "description": "Resolves diagnosis pointers on professional claims to produce line-level dx pairs, and demonstrates rendering vs. billing NPI extraction for provider attribution. Uses base R and tidyr for the pointer explosion step. Implementation notes field is intentionally empty; see description.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  CLM[\"CMS-1500 / 837P Professional Claim\"]\n  HDR[\"Item 21: Header Diagnoses (A–L)\\nUp to 12 ICD-10-CM codes\\nClaim-level: no principal dx\"]\n  SL1[\"Service Line 1\\n24A: Service date\\n24B: POS code\\n24D: CPT/HCPCS\\n24E: Pointer → A,B,C,E\\n24G: Units\\n24J: Rendering NPI\"]\n  SL2[\"Service Line 2\\n24D: J3490 (GLP-1)\\n24E: Pointer → A,E\\n24J: Rendering NPI\"]\n  SL3[\"Service Line 3\\n24D: CPT 83036 (HbA1c)\\n24E: Pointer → A,D,E\\n24J: Rendering NPI\"]\n  BOX33[\"Box 33: Billing/Group NPI\\n(Practice entity — distinct\\nfrom rendering NPI 24J)\"]\n  BOX17[\"Box 17: Referring Provider NPI\\n(Who ordered the visit)\"]\n  DXA[\"A: E11.9 T2DM\"]\n  DXB[\"B: I10 HTN\"]\n  DXC[\"C: E78.5 Hyperlipidemia\"]\n  DXD[\"D: Z79.4 Long-term insulin\"]\n  DXE[\"E: E11.65 T2DM w/ hyperglycemia\"]\n  DXF[\"F–L: 7 other diagnoses\\n(not pointed to by any line)\"]\n  CLM --> HDR\n  CLM --> SL1\n  CLM --> SL2\n  CLM --> SL3\n  CLM --> BOX33\n  CLM --> BOX17\n  HDR --> DXA\n  HDR --> DXB\n  HDR --> DXC\n  HDR --> DXD\n  HDR --> DXE\n  HDR --> DXF\n  SL1 -- \"pointer A,B,C,E\" --> DXA\n  SL1 -- \"pointer A,B,C,E\" --> DXB\n  SL1 -- \"pointer A,B,C,E\" --> DXC\n  SL1 -- \"pointer A,B,C,E\" --> DXE\n  SL2 -- \"pointer A,E\" --> DXA\n  SL2 -- \"pointer A,E\" --> DXE\n  SL3 -- \"pointer A,D,E\" --> DXA\n  SL3 -- \"pointer A,D,E\" --> DXD\n  SL3 -- \"pointer A,D,E\" --> DXE",
        "caption": "Anatomy of a CMS-1500 / 837P professional claim showing the Item 21 header diagnosis list (12 codes, A–L) and three service lines. Each line's diagnosis pointer (box 24E) links to a subset of header diagnoses — the GLP-1 injection line (SL2) points only to the two T2DM codes, while the E/M visit (SL1) points to four diagnoses. Seven of the 12 header diagnoses are not pointed to by any line on this claim. The billing NPI (box 33) is the practice group, distinct from the individual rendering NPI in box 24J.\n",
        "alt_text": "Flowchart showing a professional claim branching into Item 21 header diagnoses (A through L) and three service lines. Each service line connects via a labeled diagnosis pointer arrow to the subset of header diagnoses it references. The billing NPI (box 33) and referring provider (box 17) also hang off the claim header.",
        "source_type": "illustrative",
        "source_citations": [
          "tyree-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "The institutional counterpart — UB-04/837I covers facility claims (inpatient, outpatient facility) with a true principal diagnosis under UHDDS; CMS-1500/837P covers professional claims with no principal diagnosis, only a first-listed header code and a line-level pointer."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cpt-procedure-coding",
        "notes": "Box 24D of the CMS-1500 carries the CPT or HCPCS Level II procedure code for each service line; CPT coding conventions, modifier rules, and unbundling edits all operate at the line level."
      },
      {
        "relation_type": "part_of",
        "target_slug": "place-of-service-codes",
        "notes": "Box 24B on each service line carries the two-digit CMS place-of-service code — the primary site-of-care variable on professional claims used to distinguish office, hospital outpatient, emergency department, and other settings."
      },
      {
        "relation_type": "used_with",
        "target_slug": "npi-national-provider-identifier",
        "notes": "Box 24J carries the rendering provider's NPI (individual clinician who delivered the service); box 33 carries the billing provider or group NPI. Disambiguation of rendering vs. billing NPI is the central challenge in provider attribution studies using professional claims."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-position-and-qualifiers",
        "notes": "Item 21 of the CMS-1500 lists diagnoses in positions A–L with no principal diagnosis designation; the first-listed (position A) is a billing convention, not a clinical adjudication. Understanding diagnosis position semantics is essential for phenotype algorithms built on professional claims."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ndc-national-drug-code",
        "notes": "Some professional claims carry drug lines using HCPCS J-codes with an associated NDC in the drug identification field (loop 2410 in 837P), enabling Part B drug utilization analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "The CMS-1500/837P professional claim is one of the two primary claim types in US administrative claims databases (along with UB-04/837I institutional); understanding its field structure is prerequisite to any professional claims analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The accuracy of outcome phenotypes built on professional claims depends directly on whether claim-level or pointer-resolved line-level diagnosis ascertainment is used; pointer ambiguity is a primary driver of false-positive rates in outpatient diagnosis algorithms."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The \"2 OP\" (two outpatient codes) convention in phenotype algorithms refers specifically to professional claim-level diagnosis codes; understanding the CMS-1500 header structure explains why claim-level (not line-level) ascertainment is the default."
      }
    ],
    "aliases": [
      "CMS-1500",
      "HCFA-1500",
      "837P",
      "professional claim",
      "diagnosis pointer",
      "Part B claim"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cms-hcc-risk-adjustment-rwe",
    "name": "CMS-HCC Risk Adjustment",
    "short_definition": "A CMS Medicare Advantage risk-adjustment model that maps ICD-10-CM diagnosis codes to hierarchical condition categories and combines demographic, Medicaid/LIS, disability, and disease interactions into a prospective payment risk score.",
    "long_description": "CMS-HCC risk adjustment is the payment model used to adjust Medicare Advantage and related Medicare risk-based payments for expected clinical cost. In RWE it is useful as both a real operational feature of Medicare data and a warning label: diagnoses are not neutral clinical facts when they also drive plan revenue. The model maps diagnosis codes from an encounter period into condition categories, applies hierarchies so more severe manifestations supersede related milder ones, and combines those condition flags with demographic and eligibility factors to form a risk score.\n\nAnalysts should distinguish the HCC model from a generic comorbidity index. Charlson and Elixhauser were built as research covariates. CMS-HCC was built for payment. It is prospective, model-versioned, payment-year-specific, and sensitive to coding intensity, chart reviews, health risk assessments, and Medicare Advantage encounter completeness. A high HCC count may reflect higher morbidity, more aggressive documentation, or both.\n\nIn comparative RWE, CMS-HCC variables can help summarize baseline disease burden in Medicare populations, stratify by payment-risk profile, or flag coding-intensity differences between Medicare Advantage and fee-for-service. They should not be substituted silently for validated comorbidity or frailty measures, and they should not be pooled across model years without checking which CMS model, mapping file, and coefficient set produced the score.\n\n**Pros, cons, and trade-offs.** CMS-HCC is highly operational because it reflects the same diagnosis, eligibility, and demographic information that Medicare uses for risk adjustment. That makes it useful for Medicare-specific baseline risk description and for audits of coding intensity. The trade-off is purpose: CMS-HCC is calibrated for payment, not for a generic clinical severity estimand. It can overstate morbidity when documentation intensity differs and understate clinical risk when important severity markers do not map to HCCs. In RWE, it works best as a transparent Medicare payment-risk feature, not as a substitute for condition-level confounders, frailty, or outcome-specific clinical severity.\n\n**When to use.** Use CMS-HCC when the cohort is Medicare-centered, model year and diagnosis-capture windows are known, and the research question needs Medicare payment-risk context, Medicare Advantage/FFS comparability checks, or coding-intensity diagnostics. It is also useful as a stratification variable when plan payment risk is part of the causal pathway or data-fitness question.\n\n**When NOT to use - and when it is actively misleading.** Do not use CMS-HCC as a timeless comorbidity score across commercial, Medicaid, and Medicare data without reconstructing the model logic. Do not compare scores across model years without version control. It is actively misleading to interpret a higher HCC count as purely higher disease burden when chart review, HRA diagnoses, encounter completeness, or plan coding incentives differ between study arms.",
    "primary_category": "Bias_Control",
    "tags": [
      "cms-hcc",
      "hcc",
      "risk-adjustment",
      "medicare-advantage",
      "coding-intensity",
      "claims",
      "payment-model"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "comparative_effectiveness",
      "pharmacoepidemiology",
      "health_economic_modeling"
    ],
    "data_sources": [
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": null,
        "url": "https://www.cms.gov/medicare/payment/medicare-advantage-rates-statistics/risk-adjustment",
        "citation_text": "Centers for Medicare & Medicaid Services. Risk Adjustment. CMS Medicare Advantage Rates & Statistics.",
        "year": 2026,
        "authors_short": "CMS",
        "notes": "CMS overview of Medicare risk adjustment and model materials."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.cms.gov/medicare/payment/medicare-advantage-rates-statistics/risk-adjustment/2026-model-software-icd-10-mappings",
        "citation_text": "Centers for Medicare & Medicaid Services. 2026 Model Software/ICD-10 Mappings.",
        "year": 2026,
        "authors_short": "CMS",
        "notes": "Official model software and diagnosis-to-HCC mapping materials."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/files/document/cy-2026-risk-adjustment-implementation-memo-g.pdf",
        "citation_text": "Centers for Medicare & Medicaid Services. Continued Phase-in of the 2024 CMS-HCC Risk Adjustment Model for CY 2026.",
        "year": 2026,
        "authors_short": "CMS",
        "notes": "Implementation memo documenting current model transition and operational details."
      }
    ],
    "plain_language_summary": "CMS-HCC is Medicare's diagnosis-based risk score system. It groups diagnosis codes into payment categories and gives plans higher payments for beneficiaries expected to cost more. For research, it can be a useful marker of baseline sickness, but it is also shaped by payment incentives and documentation intensity.",
    "key_terms": [
      {
        "term": "Hierarchical condition category",
        "definition": "A diagnosis grouping used by CMS risk adjustment, with hierarchy rules so severe categories can supersede less severe related categories."
      },
      {
        "term": "Risk score",
        "definition": "The model output that combines demographic, eligibility, diagnosis, and interaction factors to predict expected cost for a payment year."
      },
      {
        "term": "Coding intensity",
        "definition": "Differences in diagnosis capture caused by documentation and risk-adjustment incentives, not necessarily true clinical differences."
      }
    ],
    "worked_example": {
      "scenario": "A Medicare Advantage study needs a baseline risk variable for a 74-year-old dual-eligible beneficiary. During the diagnosis capture year, claims and encounter records include diabetes with complications, uncomplicated diabetes, COPD, and metastatic cancer. The analyst maps diagnoses to HCCs, applies hierarchies, then carries both the final HCC flags and the model-year risk score into Table 1 and the propensity model.",
      "dataset": {
        "caption": "Simplified diagnosis-to-HCC construction for one beneficiary.",
        "columns": [
          "source_record",
          "diagnosis_label",
          "raw_hcc_hit",
          "hierarchy_result"
        ],
        "rows": [
          [
            "inpatient claim",
            "diabetes with chronic complications",
            "diabetes complication HCC",
            "kept; supersedes uncomplicated diabetes"
          ],
          [
            "outpatient encounter",
            "uncomplicated diabetes",
            "diabetes without complication HCC",
            "suppressed by hierarchy"
          ],
          [
            "outpatient encounter",
            "chronic obstructive pulmonary disease",
            "COPD HCC",
            "kept"
          ],
          [
            "oncology encounter",
            "metastatic lung cancer",
            "metastatic cancer HCC",
            "kept"
          ]
        ]
      },
      "steps": [
        "Choose the CMS-HCC model year and mapping files before looking at outcomes.",
        "Map eligible diagnoses from the capture period to raw HCC hits.",
        "Apply hierarchy rules so lower-severity related HCCs do not double count disease burden.",
        "Add demographic and eligibility variables, including dual/LIS/disability indicators if required by the model.",
        "Store raw hits, final HCC flags, model version, and the final risk score for audit."
      ],
      "result": "The patient has final HCC flags for diabetes with complications, COPD, and metastatic cancer; the uncomplicated diabetes hit is suppressed. The study reports model year, diagnosis capture window, data source, and whether MA chart-review or HRA records were included."
    },
    "prerequisites": [],
    "index_definitions": [
      {
        "name": "CMS-HCC risk-adjustment model",
        "definition": "Medicare risk-adjustment model that converts demographics, eligibility variables, and diagnosis-based HCCs into a payment risk score.",
        "source": "CMS Medicare Advantage Risk Adjustment",
        "use": "Payment adjustment and, cautiously, baseline risk characterization in Medicare RWE.",
        "notes": "Always report model year/version and whether the data source is FFS, MA encounter, or linked."
      },
      {
        "name": "ICD-10-CM to HCC mapping",
        "definition": "Versioned CMS mapping from diagnosis codes to condition categories used by the model software.",
        "source": "CMS model software ICD-10 mappings",
        "use": "Reproducible construction of HCC flags from diagnosis records.",
        "notes": "Mappings are payment-year-specific and should not be mixed across years without justification."
      },
      {
        "name": "HCC hierarchy",
        "definition": "Rule set that suppresses lower-severity related HCCs when a higher-severity HCC in the same hierarchy is present.",
        "source": "CMS risk-adjustment model software",
        "use": "Prevents double counting related disease severity categories.",
        "notes": "Store both raw category hits and final hierarchy-applied flags when auditing."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Payment-model score as covariate",
        "description": "Use the CMS-HCC risk score directly as a baseline risk-adjustment covariate.",
        "edge_cases": [
          "Model-year coefficients change.",
          "MA encounter and FFS diagnosis capture differ."
        ],
        "data_source_notes": "Best within a single Medicare model year and payer channel; risky across FFS/MA/commercial comparisons."
      },
      {
        "name": "HCC count or flags as coding-intensity proxy",
        "description": "Use distinct HCC count, chart-review flags, or HRA-origin diagnoses to diagnose coding intensity rather than clinical burden alone.",
        "edge_cases": [
          "High HCC count may reflect documentation intensity.",
          "Some MA records may be payment-oriented rather than utilization-oriented."
        ],
        "data_source_notes": "Useful in MA sensitivity analyses and payer-channel diagnostics."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Charlson / Quan-Charlson Comorbidity Index",
        "use_cms_hcc_when": "Medicare payment risk and coding-intensity context are central to the question.",
        "use_comorbidity_index_when": "A transparent research comorbidity covariate is needed across payers or studies.",
        "notes": "CMS-HCC is payment-calibrated; Charlson is research-calibrated."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build only after an eligibility panel defines Medicare FFS versus MA observability; separate diagnosis sources such as encounter, chart review, and HRA where available.",
      "linked": "Linked EHR/claims data can help distinguish clinically documented disease from payment-oriented diagnosis capture, but linkage selection must be described."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "HCC coding intensity is one of the main reasons Medicare Advantage diagnosis profiles can differ from FFS."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "charlson-comorbidity-index-rwe",
        "notes": "CMS-HCC is payment-calibrated; Charlson is a research comorbidity index."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "elixhauser-comorbidity-index-rwe",
        "notes": "Elixhauser provides condition-level comorbidity adjustment without the same payment-model purpose."
      }
    ],
    "aliases": [
      "HCC",
      "CMS HCC",
      "CMS-HCC",
      "HCC risk score",
      "Medicare Advantage risk score",
      "CMS-HCC risk score",
      "CMS hierarchical condition category",
      "hierarchical condition category risk adjustment"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "cms",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cohort-prospective",
    "name": "Prospective Cohort Study",
    "short_definition": "An observational design that defines a group by exposure status at a fixed time origin and follows it forward in calendar time to ascertain incident outcomes, so that exposure and covariates are recorded before the outcome occurs.",
    "long_description": "A **prospective cohort study** classifies people by exposure at a defined **time origin (time zero)** and then follows\nthem *forward* to observe who develops the outcome. The defining feature is temporal: exposure status and baseline\ncovariates are fixed and recorded **before** any outcome is known, so the direction of measurement matches the direction\nof inference. This is what distinguishes it from a retrospective cohort (where both exposure and outcome have already\noccurred when the investigator looks) and from a case-control study (which samples on the outcome and looks backward at\nexposure). In real-world data the \"prospective\" label is often about *analytic posture* rather than wall-clock timing:\na study built in an administrative database is technically conducted on already-accrued records, but it is **designed and\nanalyzed prospectively** when the protocol fixes eligibility, time zero, exposure, and covariate windows *a priori* and\nfollows each person forward from time zero — the structure Hernán and Robins formalize as **target-trial emulation**.\n\n**Core conceptual / estimand distinction.** A cohort is a *sampling-and-follow-up frame*, not an estimator. The design\nfixes who is in the risk set, when their clock starts, and what counts as person-time at risk; the estimand\n(cumulative incidence, incidence rate, hazard ratio, risk difference, restricted mean survival time) and the\nconfounding-control strategy (restriction, matching, propensity scores, g-methods) are layered on top. The cohort frame\ndelivers two things a case-control design cannot: it lets you estimate **absolute risks and rates** directly (numerator\nevents over a denominator of person-time you actually counted), and it lets you study **multiple outcomes** from a single\nexposure definition. Getting the frame right — one unambiguous time zero per person, exposure assigned *at* time zero,\ncovariates measured *before* time zero, follow-up that begins *at* time zero — is the work; nearly every notorious\ncohort bias is a violation of one of those four rules.\n\n**Pros, cons, and trade-offs.**\n- **vs case-control:** A cohort yields incidence and absolute risk, handles many outcomes at once, and avoids the\n  recall and selection biases that plague exposure ascertainment after the outcome is known. Cost: it is inefficient for\n  rare outcomes (you must follow large denominators for few events) and for outcomes with long induction periods.\n  **Prefer a cohort** when the exposure is rare or when absolute risk / multiple outcomes matter; **prefer\n  nested case-control or case-cohort** sampling *within* the cohort when an expensive covariate (biomarker, chart\n  abstraction, adjudication) must be collected and the outcome is rare.\n- **vs retrospective cohort:** Prospective measurement (or prospective *design* in RWD) lets you specify exposure and\n  confounders before knowing outcomes, which protects against data-driven definition tweaking and against conditioning on\n  post-baseline variables. Cost: in primary-data collection it is slower and more expensive; in RWD the trade is that you\n  are limited to variables the data captured, captured *before* time zero.\n- **vs cross-sectional:** A cohort establishes temporality (exposure precedes outcome), the single most important\n  ingredient for causal interpretation, which a cross-sectional snapshot cannot. Cost: follow-up infrastructure,\n  loss to follow-up, and competing risks.\n- **vs RCT:** A cohort can study harms, long-term outcomes, rare exposures, and populations excluded from trials, at\n  real-world scale and cost. Cost: treatment assignment is not randomized, so confounding (especially **confounding by\n  indication**) is the central threat and must be addressed by design (active comparator, new-user restriction) and\n  analysis (PS methods, negative controls), not assumed away.\n\n**When to use.** Estimating incidence or absolute risk; comparing outcomes across exposure groups when the exposure is\nreasonably common; studying multiple outcomes of one exposure; long-term safety and effectiveness questions; any setting\nwhere you can define a clean time zero and measure confounders before it. In pharmacoepidemiology the prospective cohort\nis the default frame, almost always implemented as a **new-user (incident-user)** cohort with an **active comparator** so\nthat follow-up starts at initiation for everyone and the two arms cleared the same clinical threshold to be treated.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Very rare outcomes with expensive covariates.** Following a huge cohort to capture a handful of events wastes\n  measurement resources; a nested case-control or case-cohort design recovers nearly all the efficiency at a fraction of\n  the cost. Forcing a full cohort here is not *wrong*, but it is the wrong tool.\n- **Ill-defined or person-varying time zero.** If time zero is set at a point that itself depends on future events\n  (e.g., starting follow-up at diagnosis but classifying exposure by a treatment received *later*), you manufacture\n  **immortal time bias**: exposed person-time before the drug is dispensed is guaranteed event-free and spuriously\n  favors the exposed. This is the single most common fatal error in RWD cohorts and is *actively misleading* — it can\n  invert the sign of an effect.\n- **Prevalent-user (ever-exposed) cohorts.** Starting follow-up among current users mixes people at different points in\n  their treatment trajectory, induces **depletion of susceptibles** (survivors tolerate the drug and look healthier),\n  and forces adjustment for variables on the causal pathway. Use a new-user cohort unless initiation is too rare, in\n  which case consider the prevalent-new-user (Suissa) extension.\n- **Differential loss to follow-up by exposure.** If the exposed and unexposed are censored for outcome-related reasons\n  at different rates (informative censoring), naive estimates are biased; this demands explicit observation windows and,\n  often, inverse-probability-of-censoring weighting.\n- **No way to control confounding by indication.** A cohort comparing treated vs untreated for a condition that itself\n  predicts the outcome, with no active comparator and no measured confounders, produces a confounded contrast that looks\n  quantitative but is not interpretable as causal.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** The natural substrate for incident-user cohorts. Exposure = the pharmacy claim\n  (NDC + `fill_date` + `days_supply`); diagnoses come from medical claims (ICD-10-CM on professional/facility lines).\n  Require **continuous medical + pharmacy enrollment** across the whole baseline/washout window so that \"no prior fill\"\n  is a real observation, not unobserved person-time. *Failure modes:* (1) **Medicare Advantage / capitated person-time\n  lacks fee-for-service claims** — utilization and fills are invisible, so absence of an event or a prior fill is\n  missingness; restrict to enrollees with Parts A/B/D (or a commercial medical+pharmacy benefit) and drop MA-only spans.\n  (2) **Differential competing risks by exposure in elderly claims** — death is a competing event that is often\n  unobserved unless a death index or Part A inpatient-discharge-status is linked; if one arm is older/sicker, ignoring\n  the competing risk overstates the cumulative incidence of the event of interest (use Fine-Gray or report\n  cause-specific *and* cumulative-incidence estimates). (3) **Immortal time in procedure/treatment studies** — defining\n  the exposed group by receipt of a procedure but starting the clock at an earlier landmark builds guaranteed survival\n  into the exposed; align time zero to the procedure or use a landmark/time-varying treatment.\n- **EHR:** Time zero is the *order* or *administration*, not a dispensing; problem lists, labs, and notes sharpen\n  indication and baseline severity (an advantage over claims), but **visit-driven capture** means a patient who leaves\n  the system disappears — define observation windows explicitly and treat loss to follow-up as potentially informative.\n  Linkage to pharmacy fills is preferred to confirm the patient actually started.\n- **Registry:** Strongest for indication, disease severity, and adjudicated/validated outcomes (e.g., cancer stage,\n  cause of death); typically weak for complete medication exposure and for non-registry comorbidity. Link to claims for\n  the full fill history and to a death index to firm up censoring.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity + claims completeness + reliable mortality),\n  but linkage introduces **selection** (only the linkable subset) and **date-discrepancy** issues (order vs fill vs\n  service date) that must be reconciled before time zero is assigned.\n\n**Worked claims example.** Question: 2-year cumulative incidence of hospitalized GI bleed among adults *initiating* a\nnon-selective NSAID, in a commercial + Medicare FFS database. (1) **Eligibility:** age ≥18 and ≥365 days of continuous\nmedical + pharmacy enrollment before the first NSAID fill (FFS-observable, no MA-only spans). (2) **Washout:** no NSAID\nfill in the 365-day lookback — this makes the cohort *incident users* and removes prevalent-user bias. (3) **Time zero:**\nthe date of that first qualifying fill (the `fill_date` of the index NDC). (4) **Baseline covariates:** measured only in\nthe 365 days up to and including time zero (prior GI bleed, anticoagulant/antiplatelet use, age, utilization), so no\ncovariate is on the causal pathway. (5) **Follow-up and person-time:** from time zero forward to the first inpatient\nclaim with a primary GI-bleed diagnosis; **censor** at disenrollment, death (from a linked death index — a competing\nevent, not an outcome), 2 years, or end of data. Do *not* count post-`days_supply` time as immortal \"exposed\" time — if\nthis is an as-treated analysis, the on-treatment window is the stitched `days_supply` episodes plus a pre-specified grace\nperiod. (6) **Estimand:** report the cumulative incidence function treating death as a competing risk (not 1 − KM), plus\nthe incidence rate per 1,000 person-years; compare exposure groups with an active comparator (e.g., a different\nanalgesic class) and PS adjustment rather than against never-users.",
    "primary_category": "Study_Design",
    "tags": [
      "cohort",
      "prospective",
      "longitudinal",
      "incidence",
      "time-zero",
      "new-user-design",
      "person-time",
      "target-trial",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "cohort_prospective"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Foundational statement of the incident-user cohort and the biases (immortal time, depletion of susceptibles) that prevalent-user cohorts induce; the default frame for pharmacoepidemiologic cohorts."
      },
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Formalizes the prospective-analytic posture of an observational cohort as emulation of a target trial, fixing eligibility, time zero, treatment assignment, and follow-up a priori."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5083",
        "url": "https://doi.org/10.1002/pds.5083",
        "citation_text": "Suissa S, Dell'Aniello S. Time-related biases in pharmacoepidemiology. Pharmacoepidemiology and Drug Safety. 2020;29(9):1101-1110.",
        "year": 2020,
        "authors_short": "Suissa & Dell'Aniello",
        "notes": "Definitive taxonomy of immortal time, time-window, and depletion-of-susceptibles biases that arise from misaligned time zero in cohort follow-up."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pmed.0040296",
        "url": "https://doi.org/10.1371/journal.pmed.0040296",
        "citation_text": "von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. PLoS Medicine. 2007;4(10):e296.",
        "year": 2007,
        "authors_short": "von Elm et al.",
        "notes": "Reporting standard with explicit cohort-study items (eligibility, follow-up, person-time, attrition) used to specify and report prospective cohorts."
      },
      {
        "role": "use",
        "doi": "10.1093/ije/dyv098",
        "url": "https://doi.org/10.1093/ije/dyv098",
        "citation_text": "Herrett E, Gallagher AM, Bhaskaran K, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). International Journal of Epidemiology. 2015;44(3):827-836.",
        "year": 2015,
        "authors_short": "Herrett et al.",
        "notes": "Profiles a major real-world data resource purpose-built for prospective-design cohort follow-up with linkage to hospital and mortality records."
      }
    ],
    "plain_language_summary": "A prospective cohort study starts with a group of people sorted by whether they did or did not get a treatment, marks that starting moment as everyone's day zero, and then watches them move forward in time to see who later develops the outcome you care about. Because you decide who's in which group before any outcomes happen, you get to measure things in the same order you want to reason about them: cause first, effect second. The payoff is that you can directly count how often the outcome happens (the actual risk), and you can track several different outcomes from the same starting group. The catch is that it's slow and wasteful when the outcome is very rare, since you have to watch a lot of people to catch a few events.",
    "key_terms": [
      {
        "term": "time zero",
        "definition": "Each person's day zero, the single agreed-upon moment when their follow-up clock starts ticking, usually the day they begin the treatment being studied."
      },
      {
        "term": "incident outcome",
        "definition": "A brand-new occurrence of the outcome that happens after time zero, not one a person already had before they entered the study."
      },
      {
        "term": "cumulative incidence",
        "definition": "The share of the starting group that develops the outcome over the follow-up period, found by dividing the number of people with the event by the number of people you started with."
      },
      {
        "term": "censor",
        "definition": "Stopping the clock on a person before the study ends because you can no longer watch them, for example they left their insurance plan, so you stop counting their time without recording an outcome."
      },
      {
        "term": "fill_date",
        "definition": "The calendar date a pharmacy prescription was filled, the field in a claims table that usually marks when a patient actually started a drug."
      }
    ],
    "worked_example": {
      "scenario": "We want the 2-year (730-day) risk of a hospitalized GI bleed among adults who start taking an NSAID pain reliever. We have a tiny claims dataset of four patients, each with the pharmacy fill that marks their first NSAID. We set each person's day zero to that first fill, then follow every patient forward to see who is hospitalized for a GI bleed before two years are up. We then count the events and divide by the four people we started with.",
      "dataset": {
        "caption": "The raw rows an analyst would see: one starting fill per patient, plus what happened during forward follow-up.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_followed",
          "outcome"
        ],
        "rows": [
          [
            1001,
            "2023-01-15",
            "ibuprofen",
            180,
            "GI bleed hospitalization"
          ],
          [
            1002,
            "2023-02-03",
            "ibuprofen",
            730,
            "none"
          ],
          [
            1003,
            "2023-02-20",
            "naproxen",
            730,
            "none"
          ],
          [
            1004,
            "2023-03-11",
            "naproxen",
            400,
            "none (left plan, censored)"
          ]
        ]
      },
      "steps": [
        "For each patient, day zero is their first NSAID fill_date, and the clock starts there and only moves forward.",
        "Patient 1001 is hospitalized for a GI bleed 180 days after starting, so they count as one event.",
        "Patients 1002 and 1003 are watched the full 730 days and never have the event.",
        "Patient 1004 leaves their insurance plan at day 400, so we stop watching them with no event recorded; they are censored but still part of the four people we started with.",
        "Count the events (1) and divide by the number of people who started (4)."
      ],
      "result": "2-year cumulative incidence = 1 GI-bleed event / 4 patients enrolled at time zero = 0.25, or about 25%.",
      "timeline_spec": {
        "title": "Prospective cohort: enroll four NSAID initiators at time zero, then follow forward to see who develops a GI bleed",
        "window": {
          "start": "2023-01-15",
          "end": "2025-01-14",
          "label": "Observation window: up to 730 days (2 years) of forward follow-up per patient"
        },
        "events": [
          {
            "label": "Patient 1001 enrolls (first ibuprofen fill)",
            "start": "2023-01-15",
            "length_days": 180,
            "quantity": "180 days followed -> event"
          },
          {
            "label": "Patient 1002 enrolls (first ibuprofen fill)",
            "start": "2023-02-03",
            "length_days": 730,
            "quantity": "730 days followed -> event-free"
          },
          {
            "label": "Patient 1003 enrolls (first naproxen fill)",
            "start": "2023-02-20",
            "length_days": 730,
            "quantity": "730 days followed -> event-free"
          },
          {
            "label": "Patient 1004 enrolls (first naproxen fill)",
            "start": "2023-03-11",
            "length_days": 400,
            "quantity": "400 days followed -> censored, event-free"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-01-15",
            "end": "2023-07-14",
            "label": "1001 followed 180 days, then GI bleed (the one event)"
          },
          {
            "kind": "followup",
            "start": "2023-02-03",
            "end": "2025-02-02",
            "label": "1002 followed full 730 days, event-free"
          },
          {
            "kind": "followup",
            "start": "2023-02-20",
            "end": "2025-02-19",
            "label": "1003 followed full 730 days, event-free"
          },
          {
            "kind": "followup",
            "start": "2023-03-11",
            "end": "2024-04-14",
            "label": "1004 followed 400 days, then censored"
          }
        ],
        "result": {
          "label": "1 GI-bleed event / 4 patients enrolled = cumulative incidence 0.25",
          "value": 0.25
        }
      }
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Incident (new-user) cohort",
        "description": "Time zero is first exposure after a drug-free washout, so follow-up starts at initiation for everyone and exposed person-time is counted only from the moment exposure begins.",
        "edge_cases": [
          "Requires continuous, exposure-observable enrollment across the full washout so \"no prior use\" is observed rather than missing.",
          "Rare initiation yields small cohorts; consider the prevalent-new-user (Suissa) extension when incident use is too uncommon."
        ],
        "data_source_notes": "claims: washout on continuous A/B/D or commercial pharmacy enrollment, exclude MA-only spans; EHR: anchor on first order/administration and confirm with linked fills."
      },
      {
        "name": "Closed vs open (dynamic) cohort",
        "description": "A closed cohort fixes membership at time zero; an open/dynamic cohort lets people enter and leave the risk set over calendar time, accruing person-time only while at risk and observable.",
        "edge_cases": [
          "Open cohorts must handle late entry (left truncation) so person-time before observation is not counted as at risk.",
          "Re-entry rules (e.g., after disenrollment gaps) must be pre-specified to avoid immortal or double-counted person-time."
        ],
        "data_source_notes": "claims: model enrollment spans explicitly; allow late entry with age or time-on-study as the time scale."
      },
      {
        "name": "Prospective-analytic emulation in RWD (target-trial framing)",
        "description": "An administrative database analyzed as if conducted prospectively — protocol fixes eligibility, time zero, treatment strategies, and follow-up before looking at outcomes, eliminating data-driven definition drift.",
        "edge_cases": [
          "Grace periods and eligibility-time ambiguity can re-introduce immortal time; clone-censor-weight may be needed for sustained strategies.",
          "As-treated estimands require informative-censoring weighting when discontinuation/switching differ by arm."
        ],
        "data_source_notes": "Specify exposure, covariate, and outcome windows relative to time zero; never use post-time-zero information to define eligibility."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Case-control study",
        "pros_of_this": "Estimates incidence and absolute risk directly, supports multiple outcomes from one exposure definition, and avoids recall/selection bias in exposure ascertainment.",
        "cons_of_this": "Inefficient for rare outcomes or long induction periods; requires large denominators and follow-up infrastructure.",
        "when_to_prefer": "When the exposure is rare, when absolute risk or multiple outcomes are of interest, or when temporality must be unambiguous."
      },
      {
        "compared_to": "Nested case-control / case-cohort within the cohort",
        "pros_of_this": "Uses the full cohort's person-time and supports many outcomes without re-sampling.",
        "cons_of_this": "Wastes expensive covariate measurement (biomarkers, chart abstraction, adjudication) on the vast event-free majority when the outcome is rare.",
        "when_to_prefer": "When all needed covariates are already in the data (e.g., claims) and the outcome is not vanishingly rare."
      },
      {
        "compared_to": "Retrospective cohort",
        "pros_of_this": "Exposure and confounders are fixed before outcomes are known (or before they are examined), protecting against data-driven definition tweaking and conditioning on post-baseline variables.",
        "cons_of_this": "Slower and costlier in primary data collection; in RWD limited to variables captured before time zero.",
        "when_to_prefer": "When protocol pre-specification and temporal integrity are paramount and the design can be locked before outcome examination."
      },
      {
        "compared_to": "Randomized controlled trial",
        "pros_of_this": "Studies harms, long-term and rare exposures, and trial-excluded populations at real-world scale and cost.",
        "cons_of_this": "No randomization, so confounding by indication is the central threat and must be handled by design and analysis.",
        "when_to_prefer": "When randomization is infeasible, unethical, or cannot answer the long-term/real-world question."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (NDC + fill_date + days_supply); diagnoses from medical-claim ICD-10-CM. Require continuous medical+pharmacy enrollment across the full baseline/washout so absence of prior use is observed; exclude Medicare Advantage / capitated person-time that lacks FFS claims. Link a death index to capture the competing risk and to censor. Time zero = first qualifying fill; count exposed person-time only from time zero.",
      "ehr": "Time zero = order or administration; confirm initiation with linked dispensing where possible. Visit-driven capture makes loss to follow-up potentially informative — define observation windows explicitly. Problem lists, labs, and notes improve baseline severity and indication over claims.",
      "registry": "Strong for indication, severity, and adjudicated outcomes; weak for full medication exposure. Link to claims for fills and to a death index for censoring and mortality outcomes.",
      "linked": "Claims-EHR-vital-records is the ideal substrate (severity + completeness + mortality) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS = 365   # drug-free + continuous-enrollment lookback that makes a user \"incident\"\nSTUDY_NDCS = {...}   # set of NDCs defining the exposure of interest\n\ndef build_prospective_cohort(rx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n    study = rx[rx[\"ndc\"].isin(STUDY_NDCS)]\n\n    # Candidate time zero = first fill of the study exposure for each person.\n    idx = (study.groupby(\"person_id\", as_index=False)\n                .first()\n                .rename(columns={\"fill_date\": \"index_date\"})[[\"person_id\", \"index_date\"]])\n\n    # New-user restriction: no study fill in the washout window strictly before time zero.\n    prior = study.merge(idx, on=\"person_id\")\n    prior_in_washout = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                             (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prior_in_washout[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment spanning the full washout through time zero (no MA-only gaps).\n    e = enroll.merge(idx, on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) &\n                   (~e[\"ma_only\"]))\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n\n    cohort = idx[idx[\"person_id\"].isin(eligible)].copy()\n    # Baseline covariate window: measure confounders strictly up to and including time zero (never after).\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)\n    # Follow-up starts AT time zero -> no immortal time. Censoring (disenroll/death/end-of-data) added downstream.\n    cohort[\"followup_start\"] = cohort[\"index_date\"]\n    return cohort[[\"person_id\", \"index_date\", \"baseline_start\", \"followup_start\"]]",
        "description": "Prospective (incident-user) cohort construction from claims-style inputs. Required inputs (already cleaned, de-duplicated):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime64), ndc, days_supply\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end (datetime64), ma_only (bool)  # ma_only spans lack FFS claims\n  dx     : medical diagnoses -> person_id, dx_date (datetime64), icd10  # used downstream for covariates/outcomes\nReturns one row per eligible new initiator with time zero and the baseline-window bounds. This is COHORT CONSTRUCTION\nonly; estimation (incidence, cumulative incidence with competing risks, PS adjustment) happens downstream on this frame.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "ray-2003"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\nSTUDY_NDCS <- c(...)  # NDCs defining the exposure of interest\n\nbuild_prospective_cohort <- function(rx, enroll) {\n  setDT(rx); setDT(enroll)\n  setorder(rx, person_id, fill_date)\n  study <- rx[ndc %chin% STUDY_NDCS]\n\n  # Candidate time zero = first study fill per person.\n  idx <- study[, .(index_date = fill_date[1L]), by = person_id]\n\n  # New-user restriction: drop anyone with a study fill in the washout window before time zero.\n  study <- merge(study, idx, by = \"person_id\")\n  prior_ids <- unique(study[fill_date < index_date &\n                            fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous, FFS-observable enrollment across the full washout through time zero (no MA-only spans).\n  e <- merge(enroll, idx, by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & !ma_only, unique(person_id)]\n\n  cohort <- idx[person_id %chin% ok]\n  cohort[, baseline_start := index_date - WASHOUT_DAYS]  # covariate window ends at time zero\n  cohort[, followup_start := index_date]                 # follow-up begins at time zero -> no immortal time\n  cohort[, .(person_id, index_date, baseline_start, followup_start)]\n}",
        "description": "Prospective (incident-user) cohort construction with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), ndc, days_supply\n  enroll : person_id, enroll_start, enroll_end (Date), ma_only (logical)\nReturns one row per eligible new initiator with time zero and baseline-window bounds. Construction only; estimation\n(e.g., cmprsk::cuminc for competing-risk cumulative incidence) runs downstream on this frame.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "ray-2003"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n%let study_ndcs = '00000-0000-00';  /* quoted, comma-separated NDC list defining the exposure */\n\n/* Candidate time zero = first study-exposure fill per person. */\nproc sql;\n  create table idx as\n    select person_id, min(fill_date) as index_date format=date9.\n    from work.rx\n    where ndc in (&study_ndcs)\n    group by person_id;\nquit;\n\n/* New-user restriction: exclude any prior study fill inside the washout window before time zero. */\nproc sql;\n  create table newuser as\n    select i.*\n    from idx i\n    where not exists (\n      select 1 from work.rx p\n      where p.person_id = i.person_id\n        and p.ndc in (&study_ndcs)\n        and p.fill_date <  i.index_date\n        and p.fill_date >= i.index_date - &washout\n    );\nquit;\n\n/* Continuous, FFS-observable enrollment across the full washout through time zero (no MA-only spans). */\nproc sql;\n  create table cohort as\n    select n.person_id,\n           n.index_date,\n           n.index_date - &washout as baseline_start format=date9.,  /* covariate window ends at time zero */\n           n.index_date            as followup_start  format=date9.   /* follow-up begins at time zero */\n    from newuser n\n    where exists (\n      select 1 from work.enroll e\n      where e.person_id = n.person_id\n        and e.ma_only = 0\n        and e.enroll_start <= n.index_date - &washout\n        and e.enroll_end   >= n.index_date\n    );\nquit;",
        "description": "Prospective (incident-user) cohort construction in SAS using PROC SQL. Required input datasets (post data-management):\n  work.rx     : person_id, fill_date (SAS date), ndc, days_supply\n  work.enroll : person_id, enroll_start, enroll_end (SAS dates), ma_only (0/1)\nProduces work.cohort: one row per eligible new initiator with time zero and the baseline window. This is cohort\nCONSTRUCTION; downstream you build covariates in [baseline_start, index_date] and estimate cumulative incidence with\ncompeting risks via PROC PHREG (eventcode=) / PROC LIFETEST (plots=cif). %let study_ndcs supplies the exposure NDC list.",
        "dependencies": [],
        "source_citations": [
          "ray-2003"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "cohort-prospective-timeline.svg",
        "mermaid": null,
        "caption": "Data-grounded worked-example timeline (beginner layer), drawn to scale from worked_example.timeline_spec so the picture matches the numbers.",
        "alt_text": "Timeline for the worked example of cohort-prospective.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population in the data] --> Elig[Eligibility + continuous, exposure-observable enrollment<br/>across the full baseline/washout window]\n  Elig --> T0[Time zero = first exposure after drug-free washout<br/>assign exposure group at time zero]\n  T0 --> Base[Baseline covariates measured ONLY up to and including time zero<br/>no post-time-zero information]\n  Base --> Fup[Follow-up FORWARD from time zero<br/>count person-time at risk while observable]\n  Fup --> Out[Incident outcome ascertainment<br/>identical rules across exposure groups]\n  Fup --> Cen[Censor at disenrollment / death competing risk / end of data]\n  Out --> Est[Estimand: incidence rate, cumulative incidence with competing risks, hazard ratio]\n  Cen --> Est",
        "caption": "Construction of a prospective (incident-user) cohort. Enrollment makes prior non-use observable, time zero aligns follow-up at exposure, covariates are measured before time zero, and person-time accrues forward — the four rules whose violation produces immortal-time, selection, and prevalent-user bias.",
        "alt_text": "Flowchart from source population through eligibility and washout, time zero at first exposure, pre-time-zero covariate measurement, forward follow-up with censoring and competing risk, to the estimand.",
        "source_type": "illustrative",
        "source_citations": [
          "ray-2003"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Prospective cohort timeline for one incident user (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  Continuous enrollment + washout (no prior study fill) :done, wash, 2023-01-01, 2023-12-31\n  section Time zero\n  First study fill -> exposure group assigned :milestone, t0, 2024-01-01, 0d\n  section Follow-up (forward)\n  Person-time at risk (forward from time zero) :active, fu, 2024-01-01, 365d\n  First incident outcome OR censor (disenroll/death/data end) :crit, cen, 2024-12-31, 1d",
        "caption": "Time-zero alignment for one initiator. Baseline is measured before the index fill and follow-up starts at the fill, so there is no immortal time and no covariate on the causal pathway.",
        "alt_text": "Gantt timeline showing a 365-day washout in 2023, time zero at the first fill on 2024-01-01, forward person-time at risk, and an outcome-or-censor endpoint at end of 2024.",
        "source_type": "illustrative",
        "source_citations": [
          "ray-2003"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "active-comparator-new-user",
        "notes": "The active-comparator new-user design is the standard pharmacoepidemiologic instantiation of a prospective cohort — it adds an active comparator and a drug-free washout to the forward-follow-up frame."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "A prospective-analytic cohort in RWD is operationalized as a target-trial emulation, which fixes eligibility, time zero, treatment, and follow-up a priori."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Starting follow-up at time zero (the exposure decision) rather than at an earlier landmark prevents the immortal time that misaligned cohorts manufacture."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Confounding in a non-randomized cohort is addressed by PS matching, IPTW, or overlap weighting on covariates measured before time zero."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-control",
        "notes": "A cohort samples on exposure and looks forward (yielding incidence and absolute risk); a case-control study samples on outcome and looks backward (efficient for rare outcomes)."
      }
    ],
    "aliases": [
      "prospective cohort",
      "incident-user cohort",
      "longitudinal cohort study",
      "forward-looking cohort",
      "follow-up study"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cohort-retrospective",
    "name": "Retrospective Cohort Study Design",
    "short_definition": "An observational cohort design built entirely from already-recorded data, in which exposure status is assigned at a time-zero in the past and the cohort is then followed forward through the existing record to ascertain outcomes, so that both the exposure decision and the follow-up have already occurred by the time the study begins.",
    "long_description": "A **retrospective cohort** (also *historical* or *non-concurrent* cohort) follows the same exposure-to-outcome logic\nas a prospective cohort, but does so inside data that already exist. The investigator identifies a cohort by exposure\nstatus at a past **time-zero (index)**, then \"follows\" each person forward through the recorded data — claims,\nelectronic health records (EHR), registries, or linked sources — to observe outcomes, censoring, resource use, or\nother endpoints that have, in calendar time, already happened. The defining feature is *temporality of data capture*,\nnot the direction of inference: like any cohort, it moves exposure → outcome; unlike a prospective cohort, the data\nwere collected before the research question was posed and for reasons other than the study (billing, care delivery,\nsurveillance).\n\n**Core conceptual distinction — design vs estimand vs data.** Three things must be held apart and pre-specified.\n(1) *The design* fixes time-zero, eligibility, the exposure contrast, and the censoring rules. (2) *The estimand* —\nthe causal contrast (e.g., a per-protocol vs intention-to-treat-analogue hazard ratio, cumulative incidence, or rate\ndifference) and the target population — dictates the model, not the reverse. A retrospective cohort can support a\ncause-specific hazard, a subdistribution effect under competing risks, restricted mean survival time, or a rate ratio;\nthe design does not choose for you. (3) *The data* constrain which estimands are even identifiable. The modern,\ndefensible way to specify all three at once is **target-trial emulation**: write down the protocol of the randomized\ntrial you wish you could run (eligibility, treatment strategies, assignment, time-zero, outcome, causal contrast,\nanalysis), then build the retrospective cohort to emulate it. Retrospective cohorts that are not anchored to an\nexplicit time-zero and a clear estimand are where immortal time, prevalent-user bias, and post-baseline adjustment\nsilently enter.\n\n**Pros, cons, and trade-offs.**\n- **vs prospective cohort:** The retrospective cohort uses data that already exist, so it is fast, large, inexpensive\n  at the margin, and can deliver long follow-up the day the study starts — decisive when the outcome is rare or the\n  latency is long, or when prospective enrollment would be unethical or impractical. Cost: you are limited to what was\n  recorded; no biospecimens, no protocolized PROs, no adjudicated endpoints at fixed visits unless you add linkage or a\n  hybrid arm. **Prefer retrospective** for comparative safety/effectiveness, drug-utilization, and HCRU questions where\n  the constructs are codeable; **prefer prospective** when the exposure or outcome cannot be reconstructed from routine\n  data with acceptable misclassification.\n- **vs nested case-control / case-cohort within the same source:** The full retrospective cohort retains person-time\n  for every member, supports time-varying exposure and absolute-risk estimands, and is reusable for multiple outcomes.\n  Cost: it is more computationally and data-management intensive than sampling controls. **Prefer the full cohort**\n  when you need rates, cumulative incidence, or several endpoints; **prefer nested sampling** when an expensive\n  covariate (chart review, biomarker) must be collected on a subset.\n- **vs self-controlled designs (SCCS, case-crossover):** Self-controlled designs eliminate all time-fixed confounding\n  by using the person as their own control, which a between-person retrospective cohort cannot. Cost: they require a\n  transient exposure and an acute outcome and cannot estimate effects of stable exposures. **Prefer self-controlled**\n  for acute-on-transient questions (e.g., vaccine-fever); **prefer the cohort** for chronic therapies and absolute\n  risk.\n- **vs prevalent-user / cross-sectional analyses of the same data:** A new-user retrospective cohort aligns time-zero\n  at initiation and so avoids depletion of susceptibles and adjustment for post-initiation variables. Cost: smaller N\n  and a population skewed toward initiators. This is a *variant choice within* the retrospective cohort, not a\n  different data source.\n\n**When to use.** Comparative effectiveness or safety of treatments codeable in routine data; incidence, prevalence,\nand natural-history questions; HCRU and cost endpoints; long-latency or rare outcomes where prospective follow-up is\ninfeasible; and as the data-construction backbone of any target-trial emulation. The design is appropriate whenever a\ndefensible time-zero, an observable exposure contrast, and validated outcome and covariate definitions can be built\nfrom the source.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **No clean time-zero exists.** If you start follow-up at a point everyone in one arm has *already survived to*\n  (e.g., follow-up from a procedure date but classifying exposure by a prescription filled later), you manufacture\n  **immortal time** that spuriously favors the exposed. If you cannot place a single, arm-symmetric time-zero, do not\n  run the cohort as specified — restructure or use a self-controlled design.\n- **Exposure or outcome cannot be reconstructed.** Over-the-counter drugs, inpatient-administered agents invisible in\n  pharmacy claims, samples, and outcomes with no validated algorithm produce differential misclassification that no\n  adjustment fixes.\n- **Prevalent-user contrast with depletion of susceptibles.** Comparing current users to non-users on routine data\n  embeds survivor bias and adjustment for mediators; a naive prevalent-user cohort here is worse than no study.\n- **Person-time is unobservable or differentially missing.** In claims, Medicare Advantage (MA) enrollees generate no\n  fee-for-service (FFS) claims, so their exposures and outcomes are invisible; counting MA person-time as \"unexposed\n  and event-free\" biases rates. Likewise, EHR follow-up ends silently when a patient leaves the system.\n- **Competing risks ignored in an elderly or sick cohort.** When death competes with the outcome and competing-event\n  rates differ by exposure (common in older claims populations), a naive cause-specific Kaplan-Meier over-states\n  cumulative incidence; the estimand (cause-specific vs subdistribution) must be chosen deliberately.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** The substrate is eligibility, medical, and pharmacy files. Establish *observable\n  person-time* from enrollment spans, and require **continuous enrollment** (medical + pharmacy) across the full\n  lookback so that \"no prior event\" is genuine absence, not unobserved care. Exposure comes from pharmacy claims\n  (`ndc`, `fill_date`, `days_supply`) and procedures (CPT/HCPCS, ICD-10-PCS); diagnoses (ICD-10-CM) serve eligibility,\n  covariates, and validated outcome algorithms. *Failure modes and workarounds:* (a) **MA-only person-time lacks FFS\n  claims** — restrict to enrollees with the relevant benefit (Parts A/B/D, or commercial medical+pharmacy) and exclude\n  MA-only spans rather than treating them as event-free; (b) **adjudication/claims lag** truncates recent person-time —\n  apply a data-cutoff buffer or restrict to fully run-out claims; (c) **90-day mail-order and free samples** distort\n  `days_supply` and washout — model supply windows explicitly; (d) **death is often unobserved** in commercial claims —\n  link to a death index or treat disenrollment as potentially informative censoring.\n- **EHR:** Richer baseline (labs, vitals, severity, notes via NLP) and some outcomes, but **visit-driven, irregular\n  observation**: a patient who stops visiting is differentially lost, not event-free. Define observation windows and a\n  loss-to-follow-up rule (e.g., no encounter > N months) explicitly. Exposure is the *order/administration*, not the\n  dispensing — link to pharmacy fills to confirm the patient actually started. Hybridize with claims for complete\n  pharmacy and HCRU capture.\n- **Registry:** Strong, validated phenotypes and clinical endpoints (cancer stage, adjudicated events) but typically\n  incomplete longitudinal exposure; link to claims for the full fill history and to a death index for censoring.\n  Registry inclusion criteria bound generalizability.\n- **Linked claims–EHR–vital records:** Maximizes exposure completeness, baseline severity, and mortality, but\n  linkage selects only the linkable subset and creates order/fill/service date discrepancies that must be reconciled\n  *before* time-zero assignment; propagate linkage uncertainty into sensitivity analyses.\n\n**Worked claims example.** Question: among adults with type 2 diabetes initiating a second-generation sulfonylurea,\nwhat is the 2-year cumulative incidence of hospitalized heart failure, in a commercial + Medicare FFS database?\n(1) *Source population and eligibility:* age ≥ 18, ≥ 2 outpatient or 1 inpatient diabetes diagnosis (ICD-10-CM\nE11.x), and 365 days of continuous medical + pharmacy (or A/B/D) enrollment before the first qualifying fill, with all\nMA-only person-time excluded. (2) *Time-zero / index:* the date of the first sulfonylurea fill (`ndc` on `fill_date`)\nafter a clean 365-day washout with no prior sulfonylurea dispensing — this makes the cohort incident users and fixes a\nsingle, well-defined start of follow-up. (3) *Exposure assignment:* defined at index from the dispensed `ndc`; an\nas-treated variant would stitch `days_supply` into episodes with a grace period and censor at discontinuation. (4)\n*Baseline covariates:* measured only in the 365-day pre-index window (comorbidities, prior insulin, baseline HCRU,\ncomorbidity index) — never using any post-index information, to avoid conditioning on mediators. (5) *Follow-up and\noutcome:* from index to the first hospitalized HF event (a validated inpatient ICD-10-CM algorithm), applied\nidentically across the cohort. (6) *Censoring:* at disenrollment (from eligibility spans), death (linked index),\nend of data, or 2 years, whichever comes first — and, because death competes with HF in this older population, report\nthe **subdistribution** cumulative incidence (Fine-Gray / Aalen-Johansen) rather than a naive 1−KM. (7) *Sensitivity:*\nvary washout length (180 vs 365 days), redefine the grace period for the as-treated analysis, and add a\nnegative-control outcome to probe residual confounding.",
    "primary_category": "Study_Design",
    "tags": [
      "longitudinal",
      "cohort",
      "study-design",
      "claims",
      "ehr",
      "target-trial",
      "person-time"
    ],
    "applies_to_study_types": [
      "cohort_retrospective"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "The modern framing of the retrospective cohort — specify the target trial (eligibility, time-zero, strategies, estimand) first, then emulate it in existing data; the canonical defense against immortal time and prevalent-user bias."
      },
      {
        "role": "introduce",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Defines what a retrospective cohort built from routinely collected data must report — population derivation, codes, linkage, and observable person-time."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.k3532",
        "url": "https://doi.org/10.1136/bmj.k3532",
        "citation_text": "Langan SM, Schmidt SAJ, Wing K, et al. The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE). BMJ. 2018;363:k3532.",
        "year": 2018,
        "authors_short": "Langan et al.",
        "notes": "Pharmacoepidemiology extension showing, item by item, how an exposure-defined retrospective cohort in claims/EHR should operationalize and report washout, time-zero, exposure, and follow-up."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Joint ISPOR-ISPE good-practice recommendations governing the design, conduct, and credibility of retrospective cohort RWD studies used for regulatory and HTA decision-making."
      }
    ],
    "plain_language_summary": "A retrospective cohort study answers a cause-and-effect question (does taking drug A change a patient's risk of outcome B?) using health records that were already collected for other reasons, like insurance billing or routine care. The analyst picks a 'day zero' for each patient that sits in the past, splits people into groups by what treatment they were on at that moment, and then reads forward through the already-written record to see who had the outcome. The follow-up still runs forward in time, exactly like a study that enrolls patients and waits; the only difference is that everything has already happened by the time the researcher sits down. Its big payoff is speed and size, but it can only ever measure things the original record happened to capture.",
    "key_terms": [
      {
        "term": "retrospective",
        "definition": "The data were recorded before the research question was asked, so the whole study sits in the past even though the analyst runs it today."
      },
      {
        "term": "prospective cohort",
        "definition": "The contrasting design where researchers enroll patients now and collect new data forward in time as events unfold."
      },
      {
        "term": "time-zero (index date)",
        "definition": "Each patient's 'day zero,' the single calendar day when their group is assigned and follow-up starts; here it sits in the past."
      },
      {
        "term": "look-back period (washout)",
        "definition": "A stretch of record before time-zero that the analyst inspects to confirm the patient is a brand-new user and to measure their starting health."
      },
      {
        "term": "claims data",
        "definition": "Billing records an insurer keeps every time a patient fills a prescription or gets a service, reused here as a research dataset."
      },
      {
        "term": "censoring",
        "definition": "Stopping the clock on a patient when you can no longer see them, for example when they leave the insurance plan, so you don't pretend they stayed outcome-free."
      }
    ],
    "worked_example": {
      "scenario": "It is 2026 and we want to know whether adults who start a new diabetes pill go on to be hospitalized for heart failure. Instead of enrolling patients and waiting two years, we open an insurance claims database that already covers 2021 through early 2024. We follow one patient, person 5001, who first filled the new pill on 2022-01-01. Everything in this story already happened years ago; our job is just to read the record in the right order, looking back before that fill to confirm she is a new user, then looking forward to see if and when the outcome occurred.",
      "dataset": {
        "caption": "The kind of rows an analyst actually sees: one enrollment span telling us when the insurer could observe her, plus the dated events (a prior fill check, her first qualifying fill, and a later hospitalization).",
        "columns": [
          "person_id",
          "record_date",
          "record_type",
          "detail"
        ],
        "rows": [
          [
            5001,
            "2021-01-01",
            "enrollment_start",
            "continuous coverage begins"
          ],
          [
            5001,
            "2021-12-31",
            "lookback_check",
            "no diabetes-pill fill in prior 365 days -> new user"
          ],
          [
            5001,
            "2022-01-01",
            "index_fill",
            "first fill of the new diabetes pill (time-zero)"
          ],
          [
            5001,
            "2023-06-15",
            "outcome",
            "hospitalized for heart failure"
          ],
          [
            5001,
            "2024-03-01",
            "enrollment_end",
            "still covered through end of available data"
          ]
        ]
      },
      "steps": [
        "Set time-zero (day zero) to her first qualifying fill on 2022-01-01; this is the day she joins the new-user group, and it sits squarely in the past.",
        "Look BACK across the 365 days before time-zero (2021-01-01 to 2021-12-31) and confirm she had continuous coverage and no earlier fill of this drug class, so she truly is a brand-new user and we can measure her starting health from that window only.",
        "Look FORWARD from time-zero through the already-written record, watching for the outcome; she is hospitalized for heart failure on 2023-06-15.",
        "Count the follow-up time from day zero to the event: 2022-01-01 to 2023-06-15 is 530 days, about 17.4 months.",
        "If no event had appeared, we would have followed her until coverage ended or until a fixed 2-year (730-day) cap, then censored her there rather than assume she stayed event-free."
      ],
      "result": "Follow-up from time-zero (2022-01-01) to the heart-failure hospitalization (2023-06-15) = 530 days = 17.4 months. This single event would later be pooled with every other patient's person-time to estimate a rate; the design's whole job was to place day zero correctly and read look-back before it and follow-up after it.",
      "timeline_spec": {
        "title": "One retrospective-cohort patient: look-back sits before a past time-zero, follow-up reads forward through the existing record",
        "window": {
          "start": "2021-01-01",
          "end": "2024-01-01",
          "label": "Entire timeline already recorded before the 2026 analysis began"
        },
        "events": [
          {
            "label": "Person 5001 follow-up",
            "start": "2022-01-01",
            "length_days": 530,
            "quantity": "530 observed follow-up days to the outcome"
          }
        ],
        "spans": [
          {
            "kind": "washout",
            "start": "2021-01-01",
            "end": "2021-12-31",
            "label": "365-day look-back: confirm new user, measure baseline (pre-index only)"
          },
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2023-06-15",
            "label": "530 days of forward follow-up after time-zero"
          }
        ],
        "result": {
          "label": "Time-zero 2022-01-01 -> outcome 2023-06-15 = 530 follow-up days (17.4 months)",
          "value": 530
        },
        "caption": "Time-zero (2022-01-01) sits in the past. To its left is the 365-day look-back the analyst reads to confirm a new user and capture baseline health; to its right is forward follow-up through the already-recorded data, ending at the heart-failure event on day 530. The follow-up direction is forward, identical to a prospective study; only the recording happened earlier.",
        "alt_text": "Horizontal timeline from 2021 to early 2024. A shaded 365-day look-back block runs from 2021-01-01 to 2021-12-31, then a time-zero marker at 2022-01-01, then a longer forward follow-up block of 530 days ending at an outcome marker on 2023-06-15. A note states the whole timeline was already recorded before the 2026 analysis."
      }
    },
    "prerequisites": [
      "cohort-prospective",
      "time-zero-index-date-alignment-rwe",
      "washout-clean-lookback-period-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "New-user (incident-user) retrospective cohort with washout",
        "description": "Restrict to first exposure after a clean lookback/washout; time-zero at initiation. Removes immortal time, prevalent-user bias, and depletion of susceptibles, and lets baseline covariates be measured before any treatment effect.",
        "edge_cases": [
          "Requires continuous enrollment across the full washout, or \"new\" status is unverifiable; short washout misclassifies re-initiators as incident.",
          "Free samples and inpatient initiation can precede the first observed pharmacy fill, contaminating incident status."
        ],
        "data_source_notes": "claims: NDC lists + 180-365d with no prior dispensing of the class and continuous enrollment in lookback. See washout-clean-lookback-period-rwe, new-user-design, active-comparator-new-user."
      },
      {
        "name": "As-treated / on-treatment retrospective cohort",
        "description": "Exposure is time-varying after index, built from episode construction (days_supply stitching, grace periods, stockpiling, bridging, switching); censor or reclassify at discontinuation or switch.",
        "edge_cases": [
          "Differential discontinuation by exposure makes naive as-treated estimates susceptible to informative censoring; inverse-probability-of-censoring weighting is usually required."
        ],
        "data_source_notes": "claims: use exposure-episode-construction-rwe, grace-period-gap-rules-rwe, stockpiling-carryover-rules-rwe, restart-rechallenge-new-episode-rwe, inpatient-bridging-exposure-rwe."
      },
      {
        "name": "Prevalent-user / ever-exposed retrospective cohort",
        "description": "Include anyone with current or any prior exposure at entry; larger N but mixes treatment durations, embeds survivor bias, and depletes susceptibles.",
        "edge_cases": [
          "Cannot study early harms or initiation effects cleanly; baseline covariates may already be on the causal pathway."
        ],
        "data_source_notes": "claims: simpler (no washout) but reserve for long-term outcomes when incident users are too few; still enforce observable person-time and adjust for time-since-initiation where possible."
      },
      {
        "name": "Landmark retrospective cohort",
        "description": "Condition on event-free survival to a post-index landmark (e.g., 6 months) before classifying exposure or starting outcome follow-up, used for responder or sustained-exposure questions.",
        "edge_cases": [
          "Selecting on post-index status reintroduces immortal time and selection bias unless follow-up starts at the landmark."
        ],
        "data_source_notes": "See landmark-analysis. Anchor all comparisons at the landmark, not at index."
      },
      {
        "name": "Single-arm or external-control retrospective cohort",
        "description": "Construct a historical or concurrent comparator from the same or another source to contrast against a single-arm treated cohort (e.g., a trial extension or product registry).",
        "edge_cases": [
          "Confounding by indication, calendar-time drift, and transportability assumptions dominate; measured-covariate balancing plus quantitative bias analysis for unmeasured confounding is mandatory."
        ],
        "data_source_notes": "See single-arm-external-control, borrowing-historical-controls-bayesian-rwe, rare-disease-external-controls-rwe."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cohort-prospective",
        "pros_of_this": "Uses existing data — fast, large, inexpensive, with long follow-up immediately available; reflects routine practice and heterogeneity; feasible when prospective follow-up is unethical or impractical.",
        "cons_of_this": "Limited to recorded constructs (no biospecimens, protocolized PROs, or adjudicated fixed-visit endpoints without linkage); routine-data coding/selection biases; cannot enforce protocol timing.",
        "when_to_prefer": "When the exposure and outcome are codeable, the outcome is rare or long-latency, or a large real-world sample is needed for subgroups and precision."
      },
      {
        "compared_to": "active-comparator-new-user",
        "pros_of_this": "Supplies the general retrospective-cohort scaffold (time-zero, washout, follow-up, censoring, pre-index covariates) and accommodates prevalent, as-treated, landmark, and external-control variants beyond a two-drug contrast.",
        "cons_of_this": "A generic retrospective cohort lacking the active-comparator + new-user restrictions is far more susceptible to confounding by indication and prevalent-user bias.",
        "when_to_prefer": "For incidence, natural-history, HCRU, and single-exposure questions; switch to ACNU for head-to-head comparative drug safety/effectiveness."
      },
      {
        "compared_to": "target-trial-emulation",
        "pros_of_this": "Provides the concrete data-construction backbone (cohort plumbing, observable person-time, censoring) on which an emulation is implemented.",
        "cons_of_this": "Without the explicit target-trial layer (protocol mapping, assignment, causal contrast), a raw retrospective cohort is where time-zero errors and self-inflicted bias enter.",
        "when_to_prefer": "Always specify the target trial first; implement it through the retrospective-cohort structures here."
      },
      {
        "compared_to": "claims-analysis",
        "pros_of_this": "Applies study design (time-zero, exposure definition, follow-up, censoring) on top of cleaned data.",
        "cons_of_this": "Depends entirely on correct low-level claims processing (enrollment derivation, clean claims, reversals, payer heterogeneity); invalid extracts invalidate any cohort.",
        "when_to_prefer": "Establish valid claims processing first, then layer the retrospective-cohort design rules."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Derive the cohort from eligibility + medical + pharmacy files. Enforce continuous enrollment across the full lookback (commonly 365d) and observable person-time in follow-up; exclude MA-only spans where FFS claims are absent. Time-zero = first qualifying claim (fill, dx, or px) after washout. Exposure from NDC/CPT mapped to arms or doses, with episode rules for as-treated. Outcomes from validated algorithms applied identically post-index. Covariates measured pre-index only. Censor at disenrollment (eligibility spans), death (if linked), end of data, or treatment change per SAP. Apply PS/hdPS for confounding and robust/clustered SEs for within-patient correlation; QC patient timelines, person-time and event frequencies by arm, and sensitivity to washout/lookback length.",
      "ehr": "Use problem lists, encounter diagnoses, orders, and NLP for eligibility, index, and outcomes; medication orders or administrations for exposure (link claims for dispensed fills). Irregular, visit-driven observation requires explicit observation windows and a loss-to-follow-up rule (e.g., no encounter > N months). Exploit richer baseline (labs, vitals, severity) for matching/covariates; hybridize with claims for complete pharmacy and HCRU capture.",
      "registry": "Registries supply validated phenotypes and clinical endpoints (e.g., cancer staging, adjudicated events) but often incomplete longitudinal exposure; link to claims for fills and HCRU and to a death index for censoring. Account for registry inclusion criteria in generalizability.",
      "linked": "Multi-source linkage (claims + EHR + registry + vital records) maximizes exposure completeness and outcome validity; harmonize time-zero and observable windows, reconcile order/fill/service date discrepancies before time-zero assignment, and propagate linkage selection and uncertainty into sensitivity analyses."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS = 365   # clean lookback that defines an incident (new) user\nMAX_FOLLOWUP = 730   # 2-year administrative cap\n\ndef build_retro_cohort(events: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    events = events.sort_values([\"person_id\", \"event_date\"])\n\n    # Time-zero = first qualifying event per person.\n    idx = (events.groupby(\"person_id\", as_index=False)\n                 .first()\n                 .rename(columns={\"event_date\": \"index_date\"}))\n\n    # New-user check: no qualifying event in the WASHOUT_DAYS before index.\n    prior = events.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    had_prior = prior[(prior[\"event_date\"] < prior[\"index_date\"]) &\n                      (prior[\"event_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(had_prior[\"person_id\"])].copy()\n\n    # Continuous, FFS-OBSERVABLE enrollment spanning the full washout through index (no MA-only spans).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    covers_baseline = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                       (e[\"enroll_end\"]   >= e[\"index_date\"]) &\n                       (~e[\"ma_only\"]))\n    eligible = e.loc[covers_baseline, \"person_id\"].unique()\n    cohort = idx[idx[\"person_id\"].isin(eligible)].copy()\n\n    # Observable follow-up END = first of: end of the FFS span covering index, or the admin cap.\n    post = enroll[(~enroll[\"ma_only\"])].merge(cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    post = post[(post[\"enroll_start\"] <= post[\"index_date\"]) & (post[\"enroll_end\"] >= post[\"index_date\"])]\n    span_end = post.groupby(\"person_id\")[\"enroll_end\"].max()\n    cohort = cohort.join(span_end.rename(\"ffs_end\"), on=\"person_id\")\n\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)  # covariate window\n    admin_cap = cohort[\"index_date\"] + pd.Timedelta(days=MAX_FOLLOWUP)\n    cohort[\"censor_date\"] = cohort[[\"ffs_end\"]].assign(cap=admin_cap).min(axis=1)      # +death/outcome downstream\n    return cohort[[\"person_id\", \"index_date\", \"baseline_start\", \"censor_date\"]]",
        "description": "Retrospective cohort CONSTRUCTION (not estimation) from claims-style inputs. Required inputs (already cleaned and\nde-duplicated):\n  events : qualifying exposure/dx/px events -> person_id, event_date (datetime), event_type (e.g. 'SU' fill),\n           days_supply (int, optional for as-treated)\n  enroll : enrollment spans              -> person_id, enroll_start, enroll_end (datetime), ma_only (bool)\n                                            # ma_only spans have NO fee-for-service claims -> unobservable person-time\nReturns one row per eligible new-user with time-zero, the covariate window, and observable follow-up end. Build\ncovariates only from [baseline_start, index_date] and apply outcome/censoring rules identically downstream.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "hernan-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\nMAX_FOLLOWUP <- 730L\n\nbuild_retro_cohort <- function(events, enroll) {\n  setDT(events); setDT(enroll)\n  setorder(events, person_id, event_date)\n\n  # Time-zero = first qualifying event per person.\n  idx <- events[, .(index_date = event_date[1L]), by = person_id]\n\n  # New-user: drop anyone with a qualifying event in the washout window before index.\n  ev <- merge(events, idx, by = \"person_id\")\n  prior_ids <- unique(ev[event_date < index_date &\n                         event_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous, FFS-observable enrollment across the full washout through index.\n  e <- merge(enroll, idx, by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & !ma_only, unique(person_id)]\n  cohort <- idx[person_id %chin% ok]\n\n  # Observable follow-up end = end of FFS span covering index, capped at MAX_FOLLOWUP.\n  post <- merge(enroll[ma_only == FALSE], cohort, by = \"person_id\")\n  post <- post[enroll_start <= index_date & enroll_end >= index_date]\n  ffs  <- post[, .(ffs_end = max(enroll_end)), by = person_id]\n  cohort <- merge(cohort, ffs, by = \"person_id\", all.x = TRUE)\n\n  cohort[, baseline_start := index_date - WASHOUT_DAYS]\n  cohort[, censor_date := pmin(ffs_end, index_date + MAX_FOLLOWUP)]  # +death/outcome downstream\n  cohort[, .(person_id, index_date, baseline_start, censor_date)]\n}",
        "description": "Retrospective cohort construction with data.table; mirrors the Python version and produces the same cohort table.\nInputs:\n  events : person_id, event_date (Date), event_type, days_supply\n  enroll : person_id, enroll_start, enroll_end (Date), ma_only (logical)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "hernan-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n%let maxfup  = 730;\n\n/* Time-zero = first qualifying event per person. */\nproc sql;\n  create table idx as\n  select person_id, min(event_date) as index_date format=date9.\n  from work.events\n  group by person_id;\nquit;\n\n/* New-user restriction: exclude anyone with a qualifying event inside the washout before index. */\nproc sql;\n  create table newuser as\n  select i.*\n  from idx i\n  where not exists (\n    select 1 from work.events p\n    where p.person_id = i.person_id\n      and p.event_date <  i.index_date\n      and p.event_date >= i.index_date - &washout\n  );\nquit;\n\n/* Continuous, FFS-observable enrollment across the full washout through index (no MA-only spans). */\nproc sql;\n  create table eligible as\n  select n.person_id, n.index_date,\n         n.index_date - &washout as baseline_start format=date9.\n  from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id\n      and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date\n  );\nquit;\n\n/* Observable follow-up end = end of the FFS span covering index, capped at MAX_FOLLOWUP. */\nproc sql;\n  create table cohort as\n  select g.person_id, g.index_date, g.baseline_start,\n         min( max(e.enroll_end), g.index_date + &maxfup ) as censor_date format=date9.\n  from eligible g\n  inner join work.enroll e\n    on  e.person_id = g.person_id\n    and e.ma_only = 0\n    and e.enroll_start <= g.index_date\n    and e.enroll_end   >= g.index_date\n  group by g.person_id, g.index_date, g.baseline_start;\nquit;",
        "description": "Retrospective cohort construction in SAS using PROC SQL (cohort plumbing only; estimation lives on the analysis\npages). Required input datasets (post data-management):\n  work.events : person_id, event_date, event_type, days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nProduces work.cohort with index_date, the baseline covariate window, and the observable censor date. Build covariates\nonly in [baseline_start, index_date] and apply outcome/censoring identically downstream.",
        "dependencies": [],
        "source_citations": [
          "hernan-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "cohort-retrospective-timeline.svg",
        "mermaid": null,
        "caption": "Time-zero (2022-01-01) sits in the past. To its left is the 365-day look-back the analyst reads to confirm a new user and capture baseline health; to its right is forward follow-up through the already-recorded data, ending at the heart-failure event on day 530. The follow-up direction is forward, identical to a prospective study; only the recording happened earlier.",
        "alt_text": "Horizontal timeline from 2021 to early 2024. A shaded 365-day look-back block runs from 2021-01-01 to 2021-12-31, then a time-zero marker at 2022-01-01, then a longer forward follow-up block of 530 days ending at an outcome marker on 2023-06-15. A note states the whole timeline was already recorded before the 2026 analysis.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Source population in historical window<br/>continuous enrollment, observable person-time] --> B[Time-zero / index<br/>first qualifying event after washout]\n  B --> C[New-user restriction<br/>no qualifying event in washout lookback]\n  C --> D[Assign exposure at/before index<br/>new-user vs prevalent; dose; arm from NDC]\n  D --> E[Baseline covariates measured<br/>PRE-INDEX ONLY -> PS / hdPS]\n  E --> F[Follow-up from index<br/>validated outcome algorithm; update exposure if as-treated]\n  F --> G[Censor at disenrollment / death / end of data / switch / competing event]\n  G --> H[Analysis per estimand<br/>cause-specific vs subdistribution; rates; RMST]\n  H --> I[Sensitivity<br/>washout length, grace rules, negative control, quantitative bias]",
        "caption": "Operational flow for retrospective cohort construction and analysis in claims/EHR/registry data — baseline is measured strictly before index, follow-up and outcomes strictly after, with explicit censoring and an estimand-driven analysis.",
        "alt_text": "Flowchart from source population through time-zero, new-user restriction, exposure assignment, pre-index covariates, post-index follow-up, censoring, estimand-driven analysis, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Retrospective cohort timeline for one new initiator (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Already in the past\n  Continuous FFS enrollment + washout (no prior qualifying event) :done, wash, 2021-01-01, 2021-12-31\n  section Time zero\n  First qualifying event -> exposure assigned :milestone, t0, 2022-01-01, 0d\n  section Follow-up (also in the past)\n  Observable person-time, outcome ascertainment :active, fu, 2022-01-01, 720d\n  Censor at disenroll / death / data end / switch :crit, cen, 2023-12-21, 1d",
        "caption": "Although the analysis is run today, the entire timeline — washout, time-zero, and follow-up — has already occurred in the data. Baseline precedes index and follow-up follows it, so there is no immortal time.",
        "alt_text": "Gantt timeline showing a 2021 washout, time-zero at the first qualifying event on 2022-01-01, a roughly two-year follow-up window, and a censoring point in late 2023, all in the historical record.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "cohort-prospective",
        "notes": "Both are cohort designs moving exposure to outcome; the retrospective version uses already-recorded data (efficient, large, immediate long follow-up) while the prospective version collects new data forward in time (protocol control, novel measures). Many retrospective cohorts emulate a prospective trial via target-trial thinking."
      },
      {
        "relation_type": "depends_on",
        "target_slug": "target-trial-emulation",
        "notes": "Retrospective cohorts are the primary vehicle for target-trial emulation; the protocol (eligibility, time-zero, strategies, assignment, follow-up, censoring, estimand) is operationalized through the data-construction patterns described here."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU specializes the retrospective cohort with active-comparator and new-user restrictions for head-to-head drug comparisons; this entry supplies the general time-zero, washout, follow-up, censoring, and pre-index covariate scaffold."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "A valid retrospective cohort needs continuous enrollment across lookback and observable person-time in follow-up; otherwise absence of events is unobserved care, not true absence."
      },
      {
        "relation_type": "see_also",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "Rates and cumulative incidence from a retrospective cohort require correct person-time denominators built from observable enrollment spans, excluding MA-only and unobservable time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "Claims analysis provides the enrollment derivation, clean-claims handling, reversals, and payer-heterogeneity processing that must precede any valid retrospective cohort built from administrative data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "For as-treated or time-updated retrospective cohorts, exposure episode construction (index fill, grace/gap rules, stockpiling, bridging, switching, restart) is required to classify and update exposure during follow-up."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Retrospective cohorts are prone to immortal time when follow-up or exposure classification begins before the arm-symmetric time-zero; aligning time-zero at initiation or diagnosis prevents it."
      },
      {
        "relation_type": "see_also",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "A clean washout before index establishes incident exposure and lets baseline covariates be measured without depletion of susceptibles."
      },
      {
        "relation_type": "see_also",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS is a standard confounding-adjustment tool inside large retrospective claims cohorts, empirically selecting hundreds of pre-index proxy covariates."
      }
    ],
    "aliases": [
      "retrospective cohort",
      "historical cohort",
      "non-concurrent cohort",
      "historical prospective cohort"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "comparative-effectiveness-research-cer-methods",
    "name": "Comparative Effectiveness Research (CER) Methods",
    "short_definition": "The methodological toolkit — encompassing target-trial-aligned design, confounding control (propensity scores, g-methods), quasi-experimental identification, and sensitivity analysis — for estimating the comparative benefits, harms, and value of alternative healthcare interventions for the same decision when randomization is unavailable.",
    "long_description": "**Comparative effectiveness research (CER) methods** are the set of design and analytic tools used to recover a\n*comparative causal estimand* — the effect of treatment strategy A versus a clinically reasonable alternative B for the\nsame decision — from real-world data when a head-to-head randomized trial cannot be run. CER is not a single estimator; it\nis the methods stack that sits above the specific designs and models in this catalog. This entry is the **umbrella for the\nwhole methodological enterprise** (when to reach for a propensity score versus an instrument versus a g-method versus a\nnatural experiment, and how to defend the answer for a clinical, regulatory, or HTA decision). Its sibling entry\n`cer-observational` is narrower: it catalogs the *observational study designs* (ACNU cohort, prevalent-new-user, nested\ncase-control, self-controlled designs) and how to pick among them. Read this entry for the analytic and inferential layer;\nread `cer-observational` for the design-selection layer.\n\n**Core conceptual distinction** (the core estimand distinction). Three distinctions organize all of CER and must be settled in the protocol\nbefore any code runs. (1) *Efficacy vs effectiveness*: an RCT estimates efficacy under protocol-enforced adherence in a\nselected population; CER estimates effectiveness — the effect of a strategy *as actually used* in routine care, with\nimperfect adherence, broad comorbidity, and longer horizons. These are different estimands, not noisy versions of the same\nnumber. (2) *Comparative vs absolute*: CER compares A vs B for one decision, because a drug-vs-nothing contrast in\nsecondary data is almost always fatally confounded by indication; the active comparator converts an unanswerable absolute\nquestion into an answerable comparative one. (3) *The causal estimand itself* must be named: ATE vs ATT/ATU (which the\nweighting or matching scheme implicitly targets — IPTW→ATE, 1:1 matching→ATT, overlap weights→ATO), and ITT-like\n(initiation/intention-to-treat) vs per-protocol/as-treated (which requires censoring at switch/discontinuation and\ninverse-probability-of-censoring weighting to handle informative censoring). Different choices imply different models and\ndifferent decision-relevance; conflating them is the most common silent error in applied CER.\n\n**The method-selection logic (the heart of CER).** The driving question is *what is the dominant threat to exchangeability,\nand is it measured?*\n- **Confounding is measured** → design out what you can (active comparator + new-user), then balance on measured covariates\n  with a **propensity score** (matching, IPTW, overlap weighting, stratification) or a **doubly robust** estimator (PS +\n  outcome model, e.g., AIPW/TMLE) that is consistent if *either* model is correct. With high-dimensional claims/EHR\n  proxies, use a **high-dimensional propensity score** to recover information from variables you did not pre-specify.\n- **Confounding is time-varying and affected by prior treatment** (a confounder that is also a mediator) → standard\n  regression and baseline PS are biased in *both* directions; use **g-methods** (marginal structural models via IPTW,\n  g-estimation of structural nested models, or g-computation) within a target-trial frame, typically with clone-censor-weight\n  for sustained strategies.\n- **Unmeasured confounding is the dominant threat and a valid instrument exists** (physician/regional preference, formulary\n  rules, distance) → **instrumental variables / 2SLS / 2SRI**, accepting that the estimand is a local effect among\n  compliers, not the population ATE, and that exclusion-restriction violations are untestable.\n- **A policy or natural experiment created quasi-random variation in exposure** → **difference-in-differences** (parallel\n  trends), **regression discontinuity** (a sharp eligibility threshold), or **synthetic control** for a single treated unit.\nWhatever the route, CER discipline requires *pre-specification of a target trial*, *transparent reporting* (STaRT-RWE,\nHARPER, the ISPOR-ISPE good-practice recommendations), *fit-for-purpose data assessment*, and *quantitative sensitivity\nanalysis* (E-value, negative controls / empirical calibration, quantitative bias analysis) because residual confounding is\nnever excluded by design alone.\n\n**Pros, cons, and trade-offs.**\n- **vs the head-to-head RCT it emulates:** CER delivers larger, more representative populations, rare and long-term outcomes,\n  subgroups, and answers far faster and cheaper, and it captures real-world adherence and heterogeneity that efficacy trials\n  suppress. Cost: residual confounding (measured *and* unmeasured) is never excluded, the answer depends on data quality and\n  correct estimand specification, and regulators still treat a single observational CER study as weaker than a confirmatory\n  RCT. The RCT-DUPLICATE program showed observational CER *can* reproduce RCT effect estimates when design emulation is\n  rigorous and confounders are well measured — and fails predictably when they are not. **Prefer CER** when an RCT is\n  infeasible/unethical, generalizability is the question, or speed is decisive; combine with RCT evidence via transportability\n  or network meta-analysis when possible.\n- **vs naive observational comparisons (prevalent-user, \"ever vs never\", baseline regression only):** the CER stack\n  (active comparator, new-user, PS/g-methods, target trial) removes confounding by indication, immortal time, and\n  depletion-of-susceptibles that cripple naive analyses, and makes the estimate defensible. Cost: more complex, more\n  assumption-heavy, and slower — and still wrong if the assumptions fail, which is exactly why sensitivity analysis is\n  mandatory, not optional. **Prefer the CER stack** for essentially all non-randomized comparative questions.\n- **PS vs g-methods vs IV (within the stack):** PS methods are simplest and best when key confounders are measured and not\n  time-varying; g-methods are *necessary* (not merely better) when a covariate is both confounder and mediator of a sustained\n  strategy; IV is the only route to unmeasured-confounding identification but pays for it with a fragile, untestable exclusion\n  restriction and a compliers-only estimand. Reaching for IV when a good PS would do trades away precision and interpretability;\n  reaching for a baseline PS on a time-varying problem trades away validity.\n\n**When to use.** Reach for the CER stack whenever the decision is *comparative* (strategy A vs a clinically reasonable\nalternative B for the same indication and the same decision point), a head-to-head RCT is infeasible, unethical, too slow,\nor too narrow, and a defensible active comparator exists so the question is not secretly drug-vs-nothing. Within the stack,\nthe routing rule is mechanical: if the dominant confounders are measured and not time-varying, use a propensity score or a\ndoubly robust estimator (add hd-PS for claims/EHR proxies); if a confounder is also a mediator of a sustained strategy, use\ng-methods inside a target-trial frame; if unmeasured confounding dominates and a valid instrument exists, use IV; if a\npolicy or natural experiment created quasi-random variation, use DiD/RD/synthetic control. Use it for regulatory- or\nHTA-grade comparative safety/effectiveness and value questions in claims, EHR, registry, or linked data, and pre-specify\nthe target trial and the sensitivity analyses before any code runs.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **No clinically interchangeable comparator exists.** Forcing a comparator prescribed to systematically different patients\n  (a second-line agent against a first-line agent) re-introduces confounding by indication and channeling — the bias you\n  came to remove. If the honest question is drug-vs-no-treatment, the active-comparator machinery cannot answer it.\n- **The dominant confounder is unmeasured and no valid instrument exists.** E-value or a strong negative-control association\n  that wipes out the effect is a signal to *not* report a causal estimate; report the bias-analysis bound instead.\n- **Severe positivity violation.** If one drug is reserved for the sickest patients, PS distributions separate, matching\n  discards most of the cohort, weights explode, and the surviving estimand maps to no clinically meaningful population.\n- **Time-zero / immortal-time misalignment.** Starting follow-up at diagnosis rather than initiation, or selecting on a\n  post-baseline event, manufactures bias before any model is fit; no downstream adjustment repairs it.\n- **g-methods on thin longitudinal data.** MSMs and clone-censor-weight demand correctly modeled time-varying processes and\n  positivity at *every* time point; on sparse measurement they substitute model dependence for the bias they removed.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** the workhorse for comparative drug safety/effectiveness — large N, longitudinal\n  fills (NDC + `fill_date` + `days_supply`), diagnoses, procedures, and costs. Strengths: near-complete utilization and\n  pharmacy in fee-for-service. Failure modes and workarounds: **Medicare Advantage encounter data are incomplete** — MA-only\n  person-time lacks reliable FFS claims, so \"no prior fill\" can be missingness rather than a true washout; restrict to\n  enrollees with full Parts A/B/D (or a commercial medical+pharmacy benefit) and **exclude MA-only person-time**. Coding\n  drift (ICD-9→ICD-10, fee-schedule changes) breaks longitudinal outcome algorithms — version your code lists. No labs, no\n  severity, no *reason* for treatment, so channeling is invisible without linkage. **Differential competing risks by exposure\n  in elderly claims** (e.g., a comparator preferentially used in frailer patients raises competing mortality) biases naive\n  cause-specific or Kaplan-Meier outcome risk; model competing risks explicitly. 90-day mail order, sample fills, and\n  stockpiling distort `days_supply` and on-treatment windows.\n- **EHR:** richer covariates (labs, vitals, problem lists, notes/NLP for severity and SDoH) make effect-modification and\n  indication confirmation possible. But initiation is the *order/administration*, not the dispensing (link to pharmacy to\n  confirm the patient started), capture is visit-driven so a patient who leaves the system is differentially lost (treat\n  loss to follow-up as potentially informative), and care outside the network is invisible.\n- **Registry:** strongest for indication, disease severity, and adjudicated outcomes (cancer stage, device endpoints);\n  typically weak for full pharmacy exposure and complete utilization. Link to claims for fills/costs and to a death index\n  for censoring; excellent as the eligibility/severity backbone of a target-trial emulation.\n- **Linked claims–EHR–vital records:** the ideal substrate (EHR severity + claims completeness + reliable mortality) but\n  linkage introduces selection (only the linkable subset) and order/fill/service date discrepancies that must be reconciled\n  before time-zero assignment.\n\n**Worked claims example (head-to-head, FFS logic).** Question: 2-year risk of hospitalized heart failure with a\nsecond-generation sulfonylurea vs a DPP-4 inhibitor among adults with type 2 diabetes, in a commercial + Medicare FFS\ndatabase. (1) *Eligibility:* age ≥18, ≥2 outpatient or ≥1 inpatient diabetes diagnosis, and 365 days of continuous\nmedical + pharmacy enrollment (full A/B/D or commercial) before the first study fill — **exclude any MA-only person-time**\nso \"no prior fill\" is observed, not missing. (2) *Washout (new-user):* no fill of *any* sulfonylurea or DPP-4 inhibitor in\nthe 365-day lookback, making both arms incident users. (3) *Time zero:* the date of the first qualifying fill; assign the\narm from the NDC dispensed that day. (4) *Covariates:* measured only in the 365-day window up to and including time zero\n(comorbidities, prior insulin, baseline HCRU, HbA1c proxies), feeding a high-dimensional propensity score. (5) *Estimand\nand estimator:* target the ATE with stabilized IPTW and add an outcome model for double robustness (AIPW), or target the\nATT with 1:1 PS matching — name the choice in the SAP. (6) *Follow-up and competing risks:* from time zero to first\nvalidated HF hospitalization, censoring at disenrollment, end of data, and (as-treated) discontinuation (last `days_supply`\nend + a pre-specified grace period) or switch; treat **all-cause death as a competing risk** (it is differential by arm in\nthis older cohort) and report cause-specific HR alongside the Fine-Gray subdistribution effect, since they answer different\nquestions. (7) *Diagnostics and sensitivity:* standardized mean differences <0.1 after weighting/matching, an E-value for\nthe point estimate and confidence limit, a negative-control outcome to detect residual confounding, and sensitivity to\nwashout length and grace period.\n\n**Interpreting the output**\n\nIn the new-user active-comparator design comparing Drug A and Drug B (2-year commercial\nclaims), propensity-score balancing produces: Drug A event rate 4.2 per 100 person-years;\nDrug B 6.8 per 100; risk ratio ≈ 0.62; absolute risk reduction ≈ 2.6 percentage points\nover 2 years.\n\n*(1) Formal interpretation.* The RR ≈ 0.62 is an effectiveness estimate in the real-world\ninitiator population — patients who received prescriptions in routine care, with their actual\ncontraindications, prior treatment failures, and adherence patterns intact. It is not an\nefficacy estimate from a trial's per-protocol population. The new-user restriction starts the\ncomparison at first dispensing, eliminating prevalent-user depletion-of-susceptibles bias. The\nactive-comparator restriction means both arms faced a similar prescribing threshold, attenuating\nhealthy-user confounding that would inflate apparent benefit if the comparator were untreated\npatients. Residual confounding by indication — Drug A may be preferred in lower-risk patients —\nremains the primary threat to validity and should be quantified via E-value or quantitative bias\nanalysis, alongside a negative-control outcome assessment.\n\n*(2) Practical interpretation.* A 2.6 pp absolute reduction over 2 years corresponds to\napproximately 26 fewer events per 1,000 treated patients. This is an actionable number for\ncoverage decisions, but carries an effectiveness caveat: the observed benefit includes the\ncontribution of partial adherence, switching, and provider management decisions that would not\nreplicate under controlled trial conditions. Pair the RR with a cause-specific and\nsubdistribution hazard ratio if death is a competing event in the population.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "comparative-effectiveness",
      "cer",
      "real-world-evidence",
      "causal-inference",
      "propensity-scores",
      "g-methods",
      "instrumental-variables",
      "target-trial",
      "doubly-robust",
      "competing-risks",
      "sensitivity-analysis"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial",
      "registry_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.1524-4733.2009.00602.x",
        "url": "https://doi.org/10.1111/j.1524-4733.2009.00602.x",
        "citation_text": "Johnson ML, Crown W, Martin BC, Dormuth CR, Siebert U. Good research practices for comparative effectiveness research: analytic methods to improve causal inference from nonrandomized studies of treatment effects using secondary data sources: the ISPOR Good Research Practices for Retrospective Database Analysis Task Force Report-Part III. Value in Health. 2009;12(8):1062-1073.",
        "year": 2009,
        "authors_short": "Johnson et al.",
        "notes": "ISPOR Task Force statement of the analytic-methods layer of CER — design, confounding control, and sensitivity analysis to strengthen causal inference from secondary data."
      },
      {
        "role": "introduce",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiol Drug Saf. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Joint ISPOR-ISPE good-practice recommendations covering study conduct, replicability, and transparency for comparative real-world evidence used in decision making."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Explains the target-trial discipline that aligns each CER design/analytic choice (eligibility, time zero, estimand) with the head-to-head trial being emulated, preventing immortal-time and related self-inflicted biases."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181a663cc",
        "url": "https://doi.org/10.1097/EDE.0b013e3181a663cc",
        "citation_text": "Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512-522.",
        "year": 2009,
        "authors_short": "Schneeweiss et al.",
        "notes": "The hd-PS algorithm — empirically selecting and prioritizing claims/EHR proxy covariates to reduce confounding beyond pre-specified variables; a core confounding-control tool of claims-based CER."
      },
      {
        "role": "demonstrate",
        "doi": "10.1161/CIRCULATIONAHA.120.051718",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.120.051718",
        "citation_text": "Franklin JM, Patorno E, Desai RJ, et al. Emulating randomized clinical trials with nonrandomized real-world evidence studies: first results from the RCT DUPLICATE initiative. Circulation. 2021;143(10):1002-1013.",
        "year": 2021,
        "authors_short": "Franklin et al.",
        "notes": "Benchmarks observational CER against the RCTs it emulates across multiple trials; quantifies when rigorous design emulation reproduces randomized estimates and when it predictably fails (unmeasured confounding, comparator mismatch)."
      },
      {
        "role": "use",
        "doi": "10.57264/cer-2024-0101",
        "url": "https://doi.org/10.57264/cer-2024-0101",
        "citation_text": "Daigl M, Abogunrin S. Advancing the role of real-world evidence in comparative effectiveness research. J Comp Eff Res. 2024;13(7):e240101.",
        "year": 2024,
        "authors_short": "Daigl & Abogunrin",
        "notes": "Recent review and decision flowchart for selecting CER analytic approaches in RWE under current regulatory/HTA expectations; useful as an applied method-selection map."
      }
    ],
    "plain_language_summary": "Comparative effectiveness research (CER) asks which of two treatments works better in the real world — not under idealized trial conditions but among actual patients who may skip doses, have multiple illnesses, and stay on therapy for years. Instead of randomly assigning patients to drugs, researchers build a carefully designed study from insurance claims or medical records, using statistical tools to make the two groups as comparable as possible before measuring outcomes. The core insight is that comparing a drug to an active alternative patients are already receiving is far more answerable than asking whether the drug beats doing nothing.",
    "key_terms": [
      {
        "term": "effectiveness vs efficacy",
        "definition": "Efficacy is how well a drug works under the strict, controlled conditions of a clinical trial; effectiveness is how well it works in routine clinical practice with real patients."
      },
      {
        "term": "active comparator",
        "definition": "The second drug in a head-to-head comparison — a treatment that is prescribed for the same condition as the study drug, so both groups of patients are similar at the start."
      },
      {
        "term": "new-user design",
        "definition": "A study rule that counts only patients starting a drug for the first time during the study period, so researchers can observe the full treatment experience from day one."
      },
      {
        "term": "confounding by indication",
        "definition": "A distortion that occurs when the reason a doctor chose one drug over another is also related to the patient's risk of the outcome being studied, making the two groups incomparable before any analysis."
      },
      {
        "term": "propensity score",
        "definition": "A number between 0 and 1 that summarizes how likely a patient was to receive the study drug given everything measurable about them at baseline, used to create balanced comparison groups."
      }
    ],
    "worked_example": {
      "scenario": "A payer wants to know whether Drug A (a newer diabetes pill) leads to fewer hospitalizations for heart failure than Drug B (an older diabetes pill) among adults who are just starting one of these two drugs. No head-to-head trial has been run. Researchers build an active-comparator, new-user study from two years of commercial insurance claims.",
      "dataset": {
        "caption": "Five design decisions a CER analyst documents before running any code — one row per choice, contrasting effectiveness in the real world vs what an efficacy trial would do.",
        "columns": [
          "Design choice",
          "CER study (effectiveness)",
          "Clinical trial (efficacy)"
        ],
        "rows": [
          [
            "Who is included",
            "All adults starting Drug A or Drug B in the database, including those with kidney disease, heart disease, or obesity",
            "Narrow eligibility criteria; patients with comorbidities often excluded"
          ],
          [
            "Comparator",
            "Drug B (active comparator — another diabetes pill for the same indication)",
            "Placebo or no treatment"
          ],
          [
            "How groups are made comparable",
            "Propensity score balancing: researchers calculate each patient's probability of receiving Drug A vs B from 40+ measured characteristics, then weight the groups to look alike",
            "Random assignment makes groups comparable by design"
          ],
          [
            "Treatment in practice",
            "Patients take their medication as they choose; some skip doses, switch, or stop — this real-world behavior is kept in the analysis",
            "Patients follow a strict protocol; adherence is monitored and enforced"
          ],
          [
            "Main outcome",
            "Hospitalization for heart failure recorded in insurance claims over 2 years",
            "Lab-based surrogate endpoint measured over 6 months under trial conditions"
          ]
        ]
      },
      "steps": [
        "Step 1 — Define the question precisely: Drug A versus Drug B, in new users only (no one already on either drug in the prior 12 months), for adults aged 18 and older with a documented diabetes diagnosis.",
        "Step 2 — Assign each patient a start date (their first fill of either drug) and record all 40+ baseline characteristics measured in the 12 months before that start date.",
        "Step 3 — Calculate a propensity score for each patient: the predicted probability of having received Drug A given their age, sex, prior diagnoses, other medications, and health-care use.",
        "Step 4 — Use those scores to make the Drug A and Drug B groups statistically comparable, so confounding by indication is removed — the same way random assignment would in a trial.",
        "Step 5 — Follow all patients from their start date until they are hospitalized for heart failure, leave the insurance plan, or reach 730 days (2 years), whichever comes first.",
        "Step 6 — Compare the 2-year heart failure hospitalization rate between the two balanced groups and report the result as the real-world effectiveness difference."
      ],
      "result": "After propensity-score balancing, the Drug A group and Drug B group look similar on all measured baseline characteristics. The 2-year heart failure hospitalization rate is 4.2 per 100 patients in the Drug A arm versus 6.8 per 100 in the Drug B arm — a 2.6 percentage-point absolute reduction (relative risk 0.62). Because this is measured in routine care with real-world adherence and a broad patient population, it is an effectiveness estimate, not the efficacy estimate a placebo-controlled trial would produce."
    },
    "prerequisites": [
      "active-comparator-new-user",
      "target-trial-emulation",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Propensity-based and doubly robust CER",
        "description": "Balance measured confounders with a propensity score (1:1/1:k matching with caliper, stabilized IPTW, overlap weighting, or stratification) or combine PS with an outcome model in a doubly robust estimator (AIPW, TMLE) that is consistent if either the PS or the outcome model is correct. With high-dimensional claims/EHR proxies, an hd-PS recovers confounding information from unspecified variables.",
        "edge_cases": [
          "Poor overlap (positivity violation) destabilizes weights and forces extrapolation; trim, restrict, or switch to overlap weights, and report the resulting (shifted) target population.",
          "The weighting/matching scheme silently fixes the estimand (IPTW→ATE, 1:1 matching→ATT, overlap→ATO); name it before fitting, not after.",
          "Doubly robust does not mean assumption-free — both models being mildly misspecified can still bias the estimate."
        ],
        "data_source_notes": "claims/EHR: standard for head-to-head drug CER. Report standardized mean differences and a Love plot pre/post balancing and the effective sample size. Tools: WeightIt/MatchIt/tmle (R), causallib/zEpid (Python), PROC PSMATCH / PROC CAUSALTRT (SAS).",
        "citations": [
          "johnson-2009",
          "schneeweiss-2009"
        ]
      },
      {
        "name": "G-methods for time-varying treatments",
        "description": "For sustained or dynamic strategies where a covariate is both confounder and mediator, use marginal structural models via IPTW, g-estimation of structural nested models, or parametric g-computation, typically inside a target-trial frame with clone-censor-weight for per-protocol estimands.",
        "edge_cases": [
          "Requires positivity at every time point and correctly modeled time-varying confounding; sparse longitudinal measurement substitutes model dependence for the bias removed.",
          "Per-protocol estimands need informative-censoring weights (IPCW) when discontinuation/switching is non-random."
        ],
        "data_source_notes": "Critical for treatment duration, switching, or chronic-therapy safety questions. See the dedicated marginal-structural-models, g-estimation, and clone-censor-weight entries for operational detail.",
        "citations": [
          "hernan-2016"
        ]
      },
      {
        "name": "Instrumental-variable and quasi-experimental CER",
        "description": "When unmeasured confounding dominates, use an instrument (physician/regional prescribing preference, formulary rules, distance) via 2SLS or 2SRI; when a policy or natural experiment created quasi-random variation, use difference-in-differences, regression discontinuity, or synthetic control.",
        "edge_cases": [
          "Weak instruments and exclusion-restriction violations (any direct instrument→outcome path) bias estimates; the exclusion restriction is untestable, so report falsification/over-identification checks.",
          "IV identifies a local (complier) average effect, not the population ATE; DiD requires parallel trends; RD requires no manipulation of the running variable at the threshold."
        ],
        "data_source_notes": "Useful when strong measured confounders are unavailable or when formulary/coverage changes or regional variation supply quasi-random exposure. Tools: ivreg/AER/fixest/rdrobust (R), linearmodels (Python).",
        "citations": [
          "johnson-2009",
          "daigl-2024"
        ]
      },
      {
        "name": "Target-trial-emulated CER with explicit sensitivity analysis",
        "description": "Pre-specify the protocol of the head-to-head trial (eligibility, strategies, assignment, time zero, outcome, estimand, analysis), map each element to the data, and pair the primary analysis with quantitative bias analysis (E-value, negative controls / empirical calibration, probabilistic bias analysis).",
        "edge_cases": [
          "Strategies must be well-defined and emulatable; grace periods and ITT-vs-per-protocol choices change the estimand.",
          "A negative-control outcome that moves with exposure is evidence to withhold a causal claim, not a nuisance to explain away."
        ],
        "data_source_notes": "The dominant framework for regulatory/HTA-grade CER (FDA, EMA, HTA). Pre-specify everything; report per STaRT-RWE/HARPER. See the target-trial-emulation and clone-censor-weight entries.",
        "citations": [
          "hernan-2016",
          "franklin-2021"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "head-to-head randomized controlled trial",
        "pros_of_this": "Larger, more representative populations; rare and long-term outcomes, subgroups, and unethical/infeasible comparisons; faster and cheaper; captures real-world adherence and heterogeneity that efficacy trials suppress.",
        "cons_of_this": "Residual confounding (measured and unmeasured) is never excluded; depends on data quality and correct estimand specification; a single observational CER study is treated as weaker evidence than a confirmatory RCT.",
        "when_to_prefer": "When an RCT is infeasible/unethical, generalizability is the question, or speed is decisive — ideally combined with RCT evidence via transportability or network meta-analysis."
      },
      {
        "compared_to": "naive observational comparisons (prevalent-user, ever-vs-never, baseline regression)",
        "pros_of_this": "The CER stack (active comparator, new-user, PS/g-methods, target trial) removes confounding by indication, immortal time, and depletion of susceptibles, making the estimate defensible for decision making.",
        "cons_of_this": "More complex, slower, and assumption-heavy; still biased if assumptions fail, which is why quantitative sensitivity analysis is mandatory rather than optional.",
        "when_to_prefer": "Essentially all non-randomized comparative effectiveness or safety questions in RWE/pharmacoepidemiology."
      },
      {
        "compared_to": "instrumental-variable identification",
        "pros_of_this": "Propensity-score and g-methods are more precise, interpretable, and diagnosable when key confounders are measured; they target a population (ATE/ATT) rather than a compliers-only effect.",
        "cons_of_this": "They cannot address unmeasured confounding; if the dominant confounder is unmeasured and a valid instrument exists, PS/g-methods are biased while IV is not.",
        "when_to_prefer": "Use PS/g-methods when confounding is measured; reserve IV for unmeasured-confounding settings with a defensible instrument, accepting a fragile exclusion restriction and a local estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Workhorse for comparative drug CER (large N, NDC + fill_date + days_supply, diagnoses, procedures, costs). Strengths: near-complete FFS utilization and pharmacy. Weaknesses/workarounds: exclude MA-only person-time (incomplete encounter data make washout/outcomes unobservable); version code lists across ICD-9→ICD-10 and fee-schedule changes; no labs/severity/reason-for-treatment, so confirm indication with diagnoses and use hd-PS; model death as a competing risk because it is differential by exposure in elderly cohorts.",
      "ehr": "Richer covariates (labs, vitals, problem lists, notes/NLP) enable indication confirmation and effect modification, but initiation is the order/administration (link to dispensing to confirm start), capture is visit-driven (loss to follow-up is potentially informative), and out-of-network care is invisible. Linkage to claims is the gold standard.",
      "registry": "Strong for indication, severity, and adjudicated outcomes; weak for full pharmacy exposure and utilization. Link to claims for fills/costs and to a death index for censoring; excellent as the eligibility/severity backbone of a target-trial emulation.",
      "linked": "Linked claims-EHR-vital records is the ideal substrate (severity + completeness + mortality) but introduces linkage selection (only the linkable subset) and order/fill/service date discrepancies to reconcile before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\n\ndef aipw_ate(df: pd.DataFrame, covs: list[str], horizon_days: int = 730) -> dict:\n    \"\"\"Doubly robust ATE (risk difference) for a binary outcome by `horizon_days`.\n\n    AIPW is consistent if EITHER the propensity model OR the outcome model is correct.\n    Death-without-event is treated as a censoring competing event: subjects who die\n    before the horizon without the outcome and with fu_time < horizon are censored, so\n    the contrast is the cause-specific (treatment-as-cause) risk difference. For a\n    policy-relevant cumulative-incidence (subdistribution) contrast, switch to a\n    Fine-Gray / Aalen-Johansen estimator (see the R and SAS blocks).\n    \"\"\"\n    d = df.copy()\n    # Outcome observed by the horizon (event before horizon AND followed long enough).\n    d[\"y\"] = ((d[\"event\"] == 1) & (d[\"fu_time\"] <= horizon_days)).astype(int)\n\n    X = d[covs].to_numpy()\n    a = d[\"treat\"].to_numpy()\n    y = d[\"y\"].to_numpy()\n\n    # --- Propensity model -> stabilized IPTW ---\n    ps = LogisticRegression(max_iter=1000).fit(X, a).predict_proba(X)[:, 1]\n    ps = np.clip(ps, 0.01, 0.99)                 # bound to tame extreme weights\n    p_treat = a.mean()\n    sw = np.where(a == 1, p_treat / ps, (1 - p_treat) / (1 - ps))  # stabilized weights\n\n    # --- Outcome model (fit within each arm to allow effect modification) ---\n    mu1 = LogisticRegression(max_iter=1000).fit(X[a == 1], y[a == 1]).predict_proba(X)[:, 1]\n    mu0 = LogisticRegression(max_iter=1000).fit(X[a == 0], y[a == 0]).predict_proba(X)[:, 1]\n\n    # --- AIPW influence-function estimates of E[Y^1] and E[Y^0] ---\n    psi1 = mu1 + (a / ps) * (y - mu1)\n    psi0 = mu0 + ((1 - a) / (1 - ps)) * (y - mu0)\n    rd = psi1.mean() - psi0.mean()\n    se = np.sqrt(np.var(psi1 - psi0, ddof=1) / len(d))  # influence-function SE\n\n    # --- Weighted covariate balance (standardized mean differences) ---\n    def wsmd(col):\n        t, c = d[a == 1], d[a == 0]\n        wt, wc = sw[a == 1], sw[a == 0]\n        mt = np.average(t[col], weights=wt); mc = np.average(c[col], weights=wc)\n        vt = np.average((t[col] - mt) ** 2, weights=wt)\n        vc = np.average((c[col] - mc) ** 2, weights=wc)\n        return (mt - mc) / np.sqrt((vt + vc) / 2 + 1e-12)\n\n    smd = {c: round(float(wsmd(c)), 3) for c in covs}\n    return {\"risk_diff\": float(rd), \"se\": float(se),\n            \"ci95\": (float(rd - 1.96 * se), float(rd + 1.96 * se)),\n            \"weighted_smd\": smd, \"max_abs_smd\": max(abs(v) for v in smd.values())}",
        "description": "Head-to-head CER estimation on a pre-built ACNU analytic table (one row per new initiator). Required input columns\n(already cohort-constructed via the new-user + active-comparator + time-zero rules; see active-comparator-new-user):\n  person_id   : patient id\n  treat       : 1 = study drug, 0 = active comparator (assigned from the NDC dispensed at index)\n  event       : 1 = validated outcome (e.g., HF hospitalization) during follow-up, else 0\n  fu_time     : follow-up days from index to event/censor\n  death       : 1 = died without the event (competing risk), else 0\n  x1..xk      : baseline covariates measured ONLY in [index_date - 365, index_date]\nProduces a doubly robust (AIPW) ATE on the risk-difference scale via stabilized IPTW + an outcome model, plus a\nweighted balance check. Use this AFTER cohort construction; do not adjust for any post-index variable.",
        "dependencies": [
          "pandas",
          "numpy",
          "scikit-learn"
        ],
        "source_citations": [
          "johnson-2009",
          "schneeweiss-2009"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(WeightIt); library(cobalt); library(survey); library(survival); library(cmprsk)\n\ncovs <- grep(\"^x\", names(dat), value = TRUE)\nf_ps <- reformulate(covs, response = \"treat\")\n\n## (1) Stabilized IPTW for the ATE -------------------------------------------\nw <- weightit(f_ps, data = dat, method = \"glm\", estimand = \"ATE\", stabilize = TRUE)\nprint(bal.tab(w, stats = \"mean.diffs\", thresholds = c(m = 0.1)))  # post-weighting SMDs\n\ndes  <- svydesign(ids = ~1, weights = ~w$weights, data = dat)\nfit  <- svyglm(event ~ treat, design = des, family = quasibinomial())  # ITT-like risk model\nprint(summary(fit))                                                    # log-OR for treat\n\n## (2) Competing-risks cumulative incidence (death as competing event) --------\n## status: 0 = censored, 1 = outcome, 2 = competing death\ndat$status <- with(dat, ifelse(event == 1, 1L, ifelse(death == 1, 2L, 0L)))\n\nci <- cuminc(ftime = dat$fu_time, fstatus = dat$status, group = dat$treat)\nprint(ci$Tests)   # Gray's test comparing arm-specific cumulative incidence of the outcome\n\nfg <- crr(ftime = dat$fu_time, fstatus = dat$status,\n          cov1 = model.matrix(~ treat, dat)[, -1, drop = FALSE], failcode = 1, cencode = 0)\nprint(summary(fg))  # Fine-Gray subdistribution hazard ratio for treat",
        "description": "Same head-to-head CER analytic table as the Python block. Two complementary estimands on one cohort:\n(1) stabilized-IPTW ATE on a binary outcome via survey-weighted GLM (with balance diagnostics), and\n(2) the competing-risks cumulative-incidence (subdistribution) contrast via Fine-Gray, because all-cause death is a\ndifferential competing event in elderly claims and the cause-specific HR and the cumulative-incidence effect answer\ndifferent questions. Input data frame `dat` has: treat (0/1), event, fu_time, death (competing), x1..xk baseline covs.",
        "dependencies": [
          "WeightIt",
          "cobalt",
          "survey",
          "survival",
          "cmprsk"
        ],
        "source_citations": [
          "johnson-2009",
          "franklin-2021"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Composite competing-risks status: 0=censored, 1=outcome, 2=competing death. */\ndata analytic;\n  set work.analytic;\n  if event = 1 then crstatus = 1;\n  else if death = 1 then crstatus = 2;\n  else crstatus = 0;\nrun;\n\n/* (a) Inverse-probability-of-treatment weighting for the ATE, with balance output. */\nproc psmatch data=analytic region=allobs;\n  class treat;\n  psmodel treat(treated='1') = x1 x2 x3 x4;        /* baseline-window covariates only */\n  assess lps var=(x1 x2 x3 x4) / weight=atewgt;     /* standardized differences pre/post */\n  output out(obs=all)=wtd atewgt=ipw;\nrun;\n\n/* IPTW outcome model on the binary endpoint (ITT-like risk contrast). */\nproc genmod data=wtd descending;\n  class treat / param=ref ref='0';\n  weight ipw;\n  model event = treat / dist=binomial link=logit;\n  estimate 'Study vs Comparator (log OR)' treat 1 -1;\nrun;\n\n/* (b) Doubly robust (AIPW) ATE on the risk-difference scale. */\nproc causaltrt data=analytic method=aipw;\n  class treat;\n  psmodel treat(event='1') = x1 x2 x3 x4;\n  model event = x1 x2 x3 x4;                         /* outcome model -> double robustness */\nrun;\n\n/* (c) Competing-risks: Fine-Gray subdistribution HR + cause-specific cumulative incidence. */\nproc phreg data=analytic;\n  class treat (ref='0');\n  model fu_time*crstatus(0) = treat / eventcode=1 ties=efron;   /* Fine-Gray for the outcome */\n  hazardratio treat / diff=ref;\nrun;\n\nproc lifetest data=analytic plots=cif(test);\n  time fu_time*crstatus(0) / eventcode=1;            /* cumulative incidence accounting for death */\n  strata treat;\nrun;",
        "description": "Head-to-head CER estimation in SAS on the same analytic dataset. Required input WORK.ANALYTIC:\n  treat   (num 0/1, 1=study drug)   event (num 0/1)   fu_time (num, days)\n  death   (num 0/1, competing)      x1..xk (baseline covariates from the lookback window only)\nThree real estimators: (a) PROC PSMATCH builds an IPTW (ATE) weight with balance diagnostics; (b) PROC CAUSALTRT\nfits a doubly robust (AIPW) ATE; (c) PROC PHREG with EVENTCODE= gives the Fine-Gray subdistribution model and\nPROC LIFETEST plots the cause-specific cumulative incidence (death as a competing risk). Requires SAS/STAT 14.2+.",
        "dependencies": [],
        "source_citations": [
          "johnson-2009",
          "franklin-2021"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Comparative question: strategy A vs B for the same decision] --> EST[Name the estimand<br/>ATE vs ATT/ATO; ITT vs per-protocol]\n  EST --> TT[Pre-specify the target trial<br/>eligibility, time zero, assignment, follow-up, outcome]\n  TT --> THREAT{Dominant threat to exchangeability?}\n  THREAT -->|Confounding measured,<br/>not time-varying| PS[Propensity score or doubly robust<br/>matching / IPTW / overlap / AIPW / hd-PS]\n  THREAT -->|Time-varying confounder<br/>that is also a mediator| G[G-methods<br/>MSM-IPTW / g-estimation / g-computation<br/>+ clone-censor-weight]\n  THREAT -->|Unmeasured confounding,<br/>valid instrument exists| IV[Instrumental variable<br/>2SLS / 2SRI -> complier effect]\n  THREAT -->|Policy / natural experiment| QE[Quasi-experimental<br/>DiD / RD / synthetic control]\n  PS --> SENS[Quantitative sensitivity analysis<br/>E-value, negative controls, QBA]\n  G --> SENS\n  IV --> SENS\n  QE --> SENS\n  SENS --> DEC[Decision-ready comparative estimate<br/>reported per STaRT-RWE / HARPER]",
        "caption": "CER method-selection logic. The dominant, identifiable threat to exchangeability — and whether it is measured — routes the analyst to a propensity-score/doubly-robust, g-method, instrumental-variable, or quasi-experimental estimator, all under a pre-specified target trial and mandatory sensitivity analysis.",
        "alt_text": "Decision flowchart that starts from a comparative question, names the estimand, pre-specifies a target trial, branches on the dominant threat to exchangeability into propensity-score, g-method, instrumental-variable, or quasi-experimental analysis, then converges on sensitivity analysis and a decision-ready estimate.",
        "source_type": "illustrative",
        "source_citations": [
          "johnson-2009"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Unadjusted [Naive comparison]\n    T1[Treatment A vs B] -->|biased| Y1[Outcome]\n    C1[Confounder by indication] --> T1\n    C1 --> Y1\n  end\n  subgraph Adjusted [CER design + analysis]\n    T2[Treatment A vs B] --> Y2[Outcome]\n    C2[Measured confounders] -.balanced by PS/g-methods.-> T2\n    C2 --> Y2\n    U[Unmeasured confounding] -->|probed by E-value / negative controls| Y2\n    U --> T2\n  end",
        "caption": "Why CER methods are needed. Naively, confounding by indication opens a backdoor path from treatment to outcome; CER design and analysis block the measured backdoor (propensity scores / g-methods) and probe the residual unmeasured path with sensitivity analysis rather than assuming it away.",
        "alt_text": "Two side-by-side causal diagrams. The naive comparison shows a confounder opening a backdoor path between treatment and outcome. The adjusted CER diagram shows measured confounders balanced by propensity-score or g-methods and unmeasured confounding probed by E-values and negative controls.",
        "source_type": "illustrative",
        "source_citations": [
          "johnson-2009"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "cer-observational",
        "notes": "This entry is the analytic/inferential methods umbrella; cer-observational is the narrower catalog of observational study designs (ACNU, prevalent-new-user, nested case-control, self-controlled) and how to choose among them."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation is the organizing framework that aligns every CER design and analytic choice with the head-to-head trial being emulated."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU is the default design substrate for head-to-head comparative drug questions and the usual analytic core of a two-drug target-trial emulation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Propensity-score matching/weighting is the most commonly applied confounding-control tool in CER when key confounders are measured."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hd-PS recovers confounding information from unspecified claims/EHR proxies, strengthening measured-confounding control in database CER."
      },
      {
        "relation_type": "used_with",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "MSMs and other g-methods are required (not merely preferred) when a covariate is both confounder and mediator of a sustained treatment strategy."
      },
      {
        "relation_type": "used_with",
        "target_slug": "g-estimation-structural-nested-models",
        "notes": "G-estimation of structural nested models handles time-varying treatments and effect modification within the CER toolkit."
      },
      {
        "relation_type": "used_with",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Clone-censor-weight delivers per-protocol comparative estimands for sustained/dynamic strategies under grace-period eligibility ambiguity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "instrumental-variables-pharmacoepi-rwe",
        "notes": "IV is the CER route to unmeasured-confounding identification when a valid instrument exists, at the cost of a fragile exclusion restriction and a compliers-only estimand."
      },
      {
        "relation_type": "see_also",
        "target_slug": "difference-in-differences-staggered-adoption-rwe",
        "notes": "DiD is the quasi-experimental CER tool when a policy or natural experiment creates parallel-trends-credible variation in exposure."
      },
      {
        "relation_type": "requires",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The comparative estimand (ATE vs ATT/ATO, ITT vs per-protocol, intercurrent-event handling) must be named before any CER estimator is chosen."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death and other competing events are differential by exposure in elderly claims; CER outcome analyses must separate cause-specific and subdistribution estimands."
      },
      {
        "relation_type": "used_with",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value quantifies how strong unmeasured confounding would need to be to explain away a CER estimate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes (and empirical calibration) detect residual systematic error in CER comparative estimates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "Weighted or time-dependent Cox models are a core analytic engine for time-to-event comparative endpoints."
      },
      {
        "relation_type": "used_with",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Causal ML (double ML, TMLE, causal forests) extends CER to high-dimensional confounding and treatment-effect heterogeneity."
      }
    ],
    "aliases": [
      "CER methods",
      "comparative effectiveness research",
      "comparative effectiveness analytic methods",
      "patient-centered outcomes research methods"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "competing-risks-cause-specific-fine-gray-rwe",
    "name": "Competing Risks (Cause-Specific Hazard, Cumulative Incidence, and Fine-Gray)",
    "short_definition": "A family of time-to-event methods for settings where a competing event (typically death, but also treatment switch, transplant, or revision) prevents or alters the meaning of the event of interest, built on three quantities the analyst must keep distinct -- the cause-specific hazard, the cumulative incidence function (CIF), and the Fine-Gray subdistribution hazard.",
    "long_description": "A **competing risk** is an event whose occurrence precludes the event of interest (EOI) from ever happening, or\nfundamentally changes its probability and interpretation: death before a hospitalization, next-line therapy before\ndocumented progression, transplant before dialysis-related death, prosthesis revision before a readmission. The defining\nfeature is that, once the competing event occurs, the patient is no longer *at risk* for the EOI in any meaningful sense —\na dead patient cannot be hospitalized. The single most consequential and most common error in real-world time-to-event\nanalysis is treating a competing event as if it were ordinary independent (administrative) censoring, which is what every\ndefault `survfit`/Kaplan-Meier/Cox call does unless you intervene.\n\n**Core estimand distinction — three quantities, three questions.**\n- **Cause-specific hazard.** Among patients still free of *both* the EOI and any competing event, the instantaneous rate\n  at which the EOI occurs. It answers an *etiologic/mechanistic* question (\"does the drug change the biology of\n  progression among those still able to progress?\"). Competing events are treated as censoring in the partial likelihood.\n  A standard Cox model with the competing event coded as censored estimates the cause-specific hazard ratio.\n- **Cumulative incidence function (CIF), a.k.a. subdistribution function.** The actual *probability* that a patient\n  experiences the EOI before any competing event by time t. This is the decision-relevant, patient- and payer-facing\n  number — the quantity that belongs in a label, an HTA dossier, or a risk calculator. It is estimated non-parametrically\n  by the Aalen-Johansen estimator. Crucially, CIF_EOI(t) + CIF_competing(t) + S(t) = 1, so the EOI incidence is bounded\n  by the room the competing event leaves it. The complement of Kaplan-Meier (1 − KM) **over-states** this probability\n  because it implicitly assumes patients removed by the competing event could still go on to have the EOI.\n- **Fine-Gray subdistribution hazard.** A proportional-hazards regression whose coefficient maps *monotonically* onto the\n  CIF: a subdistribution HR > 1 means a higher cumulative incidence. It achieves this by keeping subjects who experience\n  the competing event in an \"extended\" risk set (with decreasing weight over time) rather than removing them. The price is\n  interpretive: the subdistribution HR is **not** a biological rate and must never be described as \"the hazard among those\n  still able to have the event.\"\n\n**The trap that makes all three necessary.** The cause-specific HR and the subdistribution HR can point in *opposite*\ndirections for the same exposure. A drug that strongly reduces non-cancer mortality leaves more patients alive and at risk\nto progress; its cause-specific HR for progression may be ~1, while its subdistribution HR for progression is > 1 (more\nprogression *events accrue* simply because fewer people die first). Neither is \"wrong\" — they answer different questions.\nThis is why best practice (Austin & Fine 2017; Latouche et al.) is to **pre-specify the primary estimand in the protocol/SAP**\nand to **report the CIF curves plus both regression models** so reviewers can see the mechanism.\n\n**Pros, cons, and trade-offs.**\n- **vs. 1 − Kaplan-Meier / Cox with competing events censored (the naive default).** The competing-risks approach yields\n  *honest absolute risk*: it never lets the EOI incidence exceed the room left by competing mortality, so it avoids\n  overstating benefit (e.g., \"stroke prevention\" in a frail population that mostly dies of other causes) or harm. Cost:\n  more complex to code and to explain, and the numbers are typically *lower* than the 1−KM curves clinicians expect, which\n  generates pushback. **Prefer competing-risks** whenever a competing event is common (rule of thumb: cumulative competing\n  incidence > ~10%) or differs across exposure arms. A plain Cox cause-specific analysis is defensible *only* when the\n  competing event is rare and balanced and the question is genuinely about the conditional rate.\n- **Cause-specific Cox vs. Fine-Gray.** Cause-specific is the right tool for *etiology* and is unbiased for the\n  rate among those at risk; it is trivial to fit (any Cox engine) and its covariate effects on multiple causes combine\n  coherently. Fine-Gray is the right tool when the deliverable is the *probability* you will plot and quote, because its\n  coefficient is tied to that curve. The trade-off: Fine-Gray HRs lose mechanistic meaning and can be unstable when the\n  competing event dominates; cause-specific HRs require a separate standardization/Aalen-Johansen step to recover absolute\n  risk. Modern guidance: report cause-specific HRs for *each* cause **and** the CIF; add Fine-Gray when a single\n  covariate-adjusted statement about the EOI probability is the headline.\n- **vs. RMST / restricted mean time lost.** The CIF answers \"what fraction by time t\"; RMST-type summaries answer \"how\n  much event-free time\" in interpretable time units and can be extended to competing risks (mean time lost to each cause).\n  They are complements, often reported together for HTA.\n\n**When to use.** Any RWE time-to-event analysis in a population with non-trivial competing mortality or terminal\nintercurrent events: oncology (progression/discontinuation competing with death and next-line therapy), cardiology and\nnephrology in the elderly (non-CV death), device/procedure studies (revision, reoperation, death), transplant, and any\nHTA or label-supporting estimate of *absolute* incidence. It is the default whenever the deliverable is a probability and\ndeath is on the table.\n\n**When NOT to use — and when censoring the competing event is actively misleading.**\n- **Do not 1 − KM a competing event.** If you report a 1 − KM curve for hospitalization while ignoring that 18% of the\n  cohort died first, the curve is an artifact of an impossible counterfactual world and will overstate incidence — the more\n  so the higher and more differential the competing mortality. This is the single dangerous mistake the method exists to\n  prevent.\n- **Do not interpret a subdistribution HR as a rate.** Reporting \"the Fine-Gray HR shows the drug halves the hazard of\n  progression\" is wrong; it concerns the cumulative incidence, not a hazard among the at-risk.\n- **Do not Fine-Gray when the competing event itself is the scientific target of a mechanistic claim** — use\n  cause-specific models for each cause and let the CIFs carry the absolute story.\n- **Do not conflate informative administrative censoring with competing risks.** Disenrollment that depends on prognosis is\n  *dependent censoring*, handled by IPCW, not by recoding it as a competing event; mixing the two double-counts the\n  correction.\n\n**Data-source operational depth.**\n- **Claims (FFS vs. Medicare Advantage).** Build *mutually exclusive* first-event dates for the EOI and each competing\n  event on the same observable follow-up window, then code one event type per person from `min(eoi_date, competing_date,\n  censor_date)`. Death is the dominant competing risk in elderly/oncology RWE and is the hardest to get right: discharge\n  status (`disch_status`) captures only in-hospital death and undercounts out-of-hospital death — link to a mortality file\n  (Medicare enrollment/EDB date of death, SSDMF where still available, or state vital records). **Medicare Advantage\n  person-time lacks adjudicated FFS claims**, so both EOI events *and* deaths are differentially missing for MA enrollees;\n  either restrict to FFS A/B (plus D for exposure) or use an MA-aware death source, because differential ascertainment of\n  the competing event by arm is exactly what flips the cause-specific vs. subdistribution contrast. Disenrollment is\n  *administrative censoring*, handled separately (IPCW if informative), never recoded as a competing event.\n- **EHR.** Progression/recurrence dates frequently live in notes or tumor-registry linkage, not structured fields; death is\n  badly undercaptured because patients who die out-of-system simply stop appearing (loss to follow-up masquerading as\n  event-free survival). Visit-driven capture adds interval censoring on top of the competing-risk structure. Link to claims\n  and a death index before trusting any CIF.\n- **Registry.** Usually the best source for adjudicated cause-specific death, recurrence, and progression; still link to\n  claims for complete pharmacy exposure timing and out-of-system HCRU that competes.\n- **Linked claims–EHR–vital records.** The ideal substrate (severity + completeness + reliable mortality) but linkage\n  selection and order/fill/service date discrepancies must be reconciled before first-event coding, or the \"first event\"\n  can be assigned to the wrong cause.\n\n**Worked claims example.** Question: 12- and 24-month cumulative incidence of a first inpatient febrile-neutropenia (FN)\nadmission after initiating a new line of cytotoxic chemotherapy, comparing regimen A vs. regimen B, in a Medicare\nFFS oncology cohort. (1) **Eligibility:** age ≥66, a qualifying cancer diagnosis, and 365 days of continuous Part A/B\nenrollment before the index regimen (washout that also makes both arms incident users of that line). (2) **Index/time\nzero:** date of the first administration claim (J-code) for regimen A or B; assign the arm from that claim. (3)\n**Event of interest:** first inpatient admission with an FN diagnosis (`dx` in the validated FN code set) on a facility\nclaim, `event_type = 1`. (4) **Competing event:** all-cause death from the Medicare enrollment date-of-death field —\n*not* discharge status alone, which misses out-of-hospital deaths — `event_type = 2`. (5) **First-event coding:** for each\nperson, `fu_time = min(fn_date, death_date, censor_date) − index_date` and `event_type` is the cause that achieved that\nminimum; ties broken by a pre-specified rule. (6) **Censoring (`event_type = 0`):** disenrollment from FFS, transition to\nMA-only (FFS claims stop, so FN can no longer be observed — treat as administrative censoring, *not* as event-free), and\nend of data. (7) **Estimate:** arm-stratified Aalen-Johansen CIF reported at 12 and 24 months with 95% CIs; a 1 − KM curve\nfor FN would over-state incidence because the ~15–25% who die first cannot then be hospitalized. (8) **Regression:**\ncause-specific Cox (death censored) for the etiologic effect on FN rate *and* a Fine-Gray model for the effect on the FN\nprobability; if regimen A causes more early death, expect the two HRs to diverge, and the CIF curves explain why. (9)\n**Sensitivity:** vary the FN code set, the death source (enrollment field vs. claims-based), and tie-breaking, and check\nwhether differential competing mortality across arms drives any divergence.\n\n**Interpreting the output**\n\nFor febrile neutropenia (FN) in a chemotherapy trial: cause-specific HR = 0.68 (arm A vs B) and Fine-Gray subdistribution HR = 0.79 for the same endpoint in the same dataset.\n\n*Formal interpretation.* The cause-specific HR (0.68) removes patients who die before FN from the risk set at each event time and estimates how quickly FN accrues among survivors; it answers the etiologic question of whether arm A genuinely reduces the biological rate of FN. The Fine-Gray subdistribution HR (0.79) retains all patients — including those who died — in a subdistribution risk set and directly models the cumulative incidence function; it answers the predictive question of how much lower the absolute probability of ever experiencing FN is on arm A. The two quantities diverge whenever competing mortality differs between arms. Neither is wrong; they answer different scientific questions.\n\n*Practical interpretation.* If arm A also causes more early death, dead patients cannot develop FN, compressing the FN cumulative incidence even if the underlying biology is unchanged. Report the Aalen-Johansen CIF alongside both HRs. For clinical decision-making about infection risk, use the Fine-Gray model and CIF; for mechanistic understanding of whether the drug suppresses neutropenia biology, use the cause-specific HR.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "competing-risks",
      "fine-gray",
      "subdistribution-hazard",
      "cause-specific-hazard",
      "cumulative-incidence-function",
      "aalen-johansen",
      "survival-analysis",
      "competing-mortality",
      "oncology-rwe",
      "hta"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "disease_registry",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/01621459.1999.10474144",
        "url": "https://doi.org/10.1080/01621459.1999.10474144",
        "citation_text": "Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association. 1999;94(446):496-509.",
        "year": 1999,
        "authors_short": "Fine & Gray",
        "notes": "The original subdistribution-hazards model that regresses the cumulative incidence function directly; the source of the \"Fine-Gray\" name and the extended-risk-set construction."
      },
      {
        "role": "explain",
        "doi": "10.1161/CIRCULATIONAHA.115.017719",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.115.017719",
        "citation_text": "Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601-609.",
        "year": 2016,
        "authors_short": "Austin, Lee & Fine",
        "notes": "The standard applied tutorial distinguishing Kaplan-Meier, the cumulative incidence function, cause-specific hazards, and the Fine-Gray subdistribution hazard, with clear guidance on when each is appropriate."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwp107",
        "url": "https://doi.org/10.1093/aje/kwp107",
        "citation_text": "Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. American Journal of Epidemiology. 2009;170(2):244-256.",
        "year": 2009,
        "authors_short": "Lau, Cole & Gange",
        "notes": "Worked epidemiologic demonstration contrasting cause-specific and Fine-Gray regression on cohort data, including how the two coefficients diverge and how to interpret each for incidence vs. etiology."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.7501",
        "url": "https://doi.org/10.1002/sim.7501",
        "citation_text": "Austin PC, Fine JP. Practical recommendations for reporting Fine-Gray model analyses for competing risk data. Statistics in Medicine. 2017;36(27):4391-4400.",
        "year": 2017,
        "authors_short": "Austin & Fine",
        "notes": "Reporting checklist for Fine-Gray analyses (report the CIF, both hazard scales, and avoid mislabeling the subdistribution HR as a rate) — directly operationalizable in an SAP and manuscript methods section."
      }
    ],
    "plain_language_summary": "Competing-risks analysis asks: what is the true probability that a patient experiences a specific bad outcome — like a severe infection — before something else (such as death) gets in the way first? Standard survival curves pretend that patients who die can still go on to have the infection later, which inflates the estimated risk. Competing-risks methods keep track of two types of outcomes at once and report the honest probability that the infection actually occurs. Two regression tools help: one focuses on the rate of the infection among patients who are still alive and infection-free (cause-specific hazard), and one ties its estimate directly to that honest probability curve (Fine-Gray model).",
    "key_terms": [
      {
        "term": "competing risk",
        "definition": "An event — most often death, but also treatment switch or transplant — that happens first and makes the outcome you are studying impossible or meaningless to observe afterward."
      },
      {
        "term": "cause-specific hazard",
        "definition": "The instantaneous rate at which the outcome of interest occurs among patients who have not yet had the outcome or any competing event; it answers an etiologic question about the biological process."
      },
      {
        "term": "subdistribution hazard (Fine-Gray)",
        "definition": "A regression coefficient that maps directly onto the cumulative incidence curve; a subdistribution hazard ratio greater than 1 means higher probability of the outcome, but it is not a biological rate."
      },
      {
        "term": "cumulative incidence function",
        "definition": "The actual probability that a patient experiences the outcome of interest before any competing event by a given point in time; this is the honest, decision-relevant number reported in health technology assessments."
      },
      {
        "term": "censoring",
        "definition": "When a patient leaves the study before having any event — for example because the data period ends — so we know only that they were event-free up to their last observed day."
      }
    ],
    "worked_example": {
      "scenario": "Five Medicare patients start a new chemotherapy regimen on Day 0. We want to know the 6-month (180-day) probability of a first febrile-neutropenia (FN) hospitalization. Some patients have the FN admission (event of interest); others die before any FN occurs (competing event); one is still event-free at day 180 (censored). We compare the naive Kaplan-Meier estimate — which wrongly treats death as ordinary censoring — with the correct cumulative incidence function that accounts for competing mortality.",
      "dataset": {
        "caption": "One-row-per-patient analytic table: follow-up time and outcome type for five Medicare FFS patients.",
        "columns": [
          "person_id",
          "fu_days",
          "event_type",
          "event_label"
        ],
        "rows": [
          [
            1001,
            45,
            1,
            "FN hospitalization"
          ],
          [
            1002,
            60,
            2,
            "Death (competing)"
          ],
          [
            1003,
            90,
            1,
            "FN hospitalization"
          ],
          [
            1004,
            120,
            2,
            "Death (competing)"
          ],
          [
            1005,
            180,
            0,
            "Censored (end of window)"
          ]
        ]
      },
      "steps": [
        "Order all five patients by their follow-up time: 45, 60, 90, 120, 180 days.",
        "At each event time, note how many patients are still in the risk set (have not yet had any event or been censored).",
        "Naive Kaplan-Meier ignores the event type and treats both FN and death as 'events' for a combined curve, or — the common error — treats death as censoring when estimating FN risk.",
        "If death is censored naively: at day 45 (patient 1001, FN), risk set = 5, KM drops by 1/5 = 0.20; at day 90 (patient 1003, FN), risk set appears to be 3 (patients 1002 and 1004 were wrongly 'censored'), KM drops by 1/3 = 0.33.",
        "Naive 1-minus-KM probability of FN by day 180 = 1 - (4/5)(2/3) = 1 - 0.533 = 0.467, or about 47%.",
        "The correct cumulative incidence function (CIF) uses the Aalen-Johansen estimator, which keeps competing deaths in the denominator of the risk set but does NOT let them count as FN events.",
        "At day 45 (FN): risk set = 5, CIF increases by (1/5) x current overall survival = 0.20 x 1.0 = 0.200.",
        "At day 60 (death): risk set = 4, no change to the FN CIF; overall survival drops to 3/4 x 0.80 = 0.600.",
        "At day 90 (FN): risk set = 3, CIF increases by (1/3) x 0.600 = 0.200; cumulative FN CIF = 0.200 + 0.200 = 0.400.",
        "At day 120 (death): no change to FN CIF; overall survival drops further. Patient 1005 censored at day 180."
      ],
      "result": "Correct cumulative incidence of FN by day 180 = 0.40 (40%). Naive 1-minus-KM = 0.47 (47%). The naive approach overstates FN risk by 7 percentage points because it pretends the two patients who died could still go on to have an FN admission — but dead patients cannot be hospitalized.",
      "timeline_spec": {
        "title": "Competing risks: FN hospitalization vs death over 180-day follow-up (5 patients)",
        "window": {
          "start_day": 0,
          "end_day": 180,
          "label": "180-day observation window (chemotherapy follow-up)"
        },
        "events": [
          {
            "label": "Pt 1001: FN admission (event of interest)",
            "start_day": 0,
            "end_day": 45,
            "marker": "FN hospitalization at day 45"
          },
          {
            "label": "Pt 1002: Death (competing event)",
            "start_day": 0,
            "end_day": 60,
            "marker": "Death at day 60"
          },
          {
            "label": "Pt 1003: FN admission (event of interest)",
            "start_day": 0,
            "end_day": 90,
            "marker": "FN hospitalization at day 90"
          },
          {
            "label": "Pt 1004: Death (competing event)",
            "start_day": 0,
            "end_day": 120,
            "marker": "Death at day 120"
          },
          {
            "label": "Pt 1005: Censored",
            "start_day": 0,
            "end_day": 180,
            "marker": "Censored at day 180 (end of window)"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 45,
            "label": "Pt 1001 at risk"
          },
          {
            "kind": "covered",
            "start_day": 45,
            "end_day": 45,
            "label": "FN event (CIF increases to 0.20)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 60,
            "label": "Pt 1002 at risk"
          },
          {
            "kind": "gap",
            "start_day": 60,
            "end_day": 60,
            "label": "Death: competing event (CIF unchanged, denominator shrinks)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 90,
            "label": "Pt 1003 at risk"
          },
          {
            "kind": "covered",
            "start_day": 90,
            "end_day": 90,
            "label": "FN event (CIF increases to 0.40)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 120,
            "label": "Pt 1004 at risk"
          },
          {
            "kind": "gap",
            "start_day": 120,
            "end_day": 120,
            "label": "Death: competing event (CIF unchanged)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 180,
            "label": "Pt 1005 at risk then censored"
          }
        ],
        "result": {
          "label": "CIF (correct) = 0.40 vs naive 1-minus-KM = 0.47 at day 180",
          "cif_value": 0.4,
          "naive_km_value": 0.47,
          "overstatement": 0.07
        },
        "caption": "Each horizontal bar shows one patient's follow-up from Day 0 to their first event or censoring. Orange markers indicate febrile-neutropenia admissions (event of interest); red markers indicate deaths (competing events). The correct cumulative incidence function (CIF = 0.40) is lower than the naive Kaplan-Meier estimate (0.47) because deaths are kept in the denominator of the risk set without inflating the FN numerator.",
        "alt_text": "Five horizontal patient timelines from Day 0 to Day 180. Patients 1001 and 1003 end with orange FN-event markers at days 45 and 90. Patients 1002 and 1004 end with red death markers at days 60 and 120. Patient 1005 reaches day 180 with a censoring mark. Below the timelines, a stepped CIF curve rises to 0.40 while a dashed naive KM curve rises higher to 0.47, illustrating the overstatement when deaths are incorrectly treated as censored."
      }
    },
    "prerequisites": [
      "cumulative-incidence-risk-rwe",
      "cox-ph-regression",
      "attrition-and-loss-to-follow-up-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Cumulative incidence function (Aalen-Johansen) as the primary descriptive",
        "description": "Non-parametric estimate of the probability the event of interest occurs before any competing event, reported by arm at clinically relevant horizons (12/24/36 months) with confidence intervals. The decision-relevant absolute risk for HTA, labeling, and patient communication.",
        "edge_cases": [
          "When death is a competing event the CIF for the EOI is lower than 1 - KM; this is the honest number, not an error.",
          "With heavy ties (claims dates rounded to admission day) the Aalen-Johansen jumps can be sensitive to tie handling; pre-specify."
        ],
        "data_source_notes": "claims: define mutually exclusive first-event dates; death usually requires external mortality linkage rather than discharge status. Report CIF with pointwise CIs at fixed horizons."
      },
      {
        "name": "Cause-specific hazards (Cox) regression",
        "description": "Cox model for the EOI with all competing events coded as censored, estimating the cause-specific hazard ratio for the etiologic/mechanistic question among those still at risk.",
        "edge_cases": [
          "The effect is conditional on not yet having had a competing event; when competing mortality differs by arm this conditional contrast does not translate directly into absolute risk and must be paired with the CIF."
        ],
        "data_source_notes": "Use the same covariate set as a standard Cox; code competing events as a distinct censoring value so only the EOI counts as the event. Fit a parallel model for the competing event to interpret the full picture."
      },
      {
        "name": "Fine-Gray subdistribution hazards regression",
        "description": "Proportional-hazards model on the subdistribution that maps directly to the cumulative incidence scale; subjects with a competing event remain in a time-decaying extended risk set.",
        "edge_cases": [
          "The coefficient is a subdistribution HR, not a biological rate; never describe it as \"the hazard among those who could still have the event.\"",
          "Unstable or hard to interpret when the competing event dominates incidence; report cause-specific results alongside."
        ],
        "data_source_notes": "R: cmprsk::crr or riskRegression::FGR / finegray + coxph; Python: lifelines does not fit Fine-Gray directly (use survival::finegray weights or call R); SAS: PROC PHREG with eventcode= (SAS/STAT 14.2+, SAS 9.4 M3+)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "1 - Kaplan-Meier or Cox with competing events treated as independent censoring",
        "pros_of_this": "Produces honest absolute incidence bounded by competing mortality; prevents overstating benefit or harm in populations with substantial competing death (elderly, advanced oncology, multimorbid).",
        "cons_of_this": "More complex to code and communicate; CIF values are lower than the 1 - KM curves clinicians expect and can draw pushback.",
        "when_to_prefer": "Whenever a competing event is common (cumulative competing incidence above roughly 10%) or differs by arm, or whenever the deliverable is an absolute probability."
      },
      {
        "compared_to": "Cause-specific hazards (Cox) regression",
        "pros_of_this": "The Fine-Gray coefficient maps directly to the cumulative incidence curve you plot and quote; one covariate-adjusted statement about EOI probability.",
        "cons_of_this": "Loses mechanistic/rate interpretation and can be unstable when the competing event dominates; cause-specific models are simpler and combine coherently across causes.",
        "when_to_prefer": "When a single adjusted statement about the EOI probability is the headline; use cause-specific models when the question is etiologic rate among those still at risk. Best practice reports both plus the CIF."
      },
      {
        "compared_to": "RMST / restricted mean time lost to each cause",
        "pros_of_this": "The CIF answers probability-by-time-t questions directly and is the standard absolute-risk currency for labels and HTA.",
        "cons_of_this": "Does not summarize \"event-free time gained\" in intuitive time units the way RMST/mean-time-lost does.",
        "when_to_prefer": "When the deliverable is a probability at a horizon rather than a time summary; the two are often reported together."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Create mutually exclusive first-event dates for the EOI and each competing event on the same observable window, then set one event type per person from min(eoi_date, competing_date, censor_date). Death is usually the dominant competing event; discharge status undercounts out-of-hospital death, so link to a mortality file. Restrict to FFS or use an MA-aware death source because Medicare Advantage person-time lacks adjudicated FFS claims and differentially hides both EOI events and deaths. Disenrollment is administrative censoring (IPCW if informative), not a competing event.",
      "ehr": "Progression, recurrence, and death dates often live in notes or registry linkage, not structured fields; out-of-system death is undercaptured and can masquerade as event-free survival. Visit gaps add interval censoring on top of the competing-risk structure. Link to claims and a death index before trusting any CIF.",
      "registry": "Often the best source for adjudicated cause-specific death, recurrence, and progression. Link to claims for complete pharmacy exposure timing and out-of-system HCRU events that compete.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate but linkage selection and order/fill/service date discrepancies must be reconciled before first-event coding, or the first event is assigned to the wrong cause."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import AalenJohansenFitter, CoxPHFitter\n\nEOI = 1            # event of interest\nHORIZONS = [365, 730]   # 12 and 24 months, in days\n\ndef cif_by_arm(cohort: pd.DataFrame, horizons=HORIZONS) -> pd.DataFrame:\n    \"\"\"Aalen-Johansen CIF of the event of interest, per arm, at fixed horizons.\n\n    The CIF (not 1 - KM) is the honest absolute probability: competing events\n    are kept out of the EOI numerator AND keep their place in the denominator,\n    so EOI incidence can never exceed the room competing mortality leaves it.\n    \"\"\"\n    rows = []\n    for arm, g in cohort.groupby(\"arm\"):\n        ajf = AalenJohansenFitter(calculate_variance=True)\n        # event_of_interest=1 tells lifelines which code is the EOI; all other\n        # non-zero codes are handled as competing (not as censoring).\n        ajf.fit(g[\"fu_time\"], g[\"event_type\"], event_of_interest=EOI)\n        cif = ajf.cumulative_density_\n        ci = ajf.confidence_interval_\n        for h in horizons:\n            # step function: value at the last observed time <= horizon\n            at = cif.loc[cif.index <= h]\n            lo = ci.loc[ci.index <= h]\n            rows.append({\n                \"arm\": arm, \"horizon_days\": h,\n                \"cif_eoi\": float(at.iloc[-1, 0]) if len(at) else 0.0,\n                \"cif_lower\": float(lo.iloc[-1, 0]) if len(lo) else 0.0,\n                \"cif_upper\": float(lo.iloc[-1, 1]) if len(lo) else 0.0,\n            })\n    return pd.DataFrame(rows)\n\ndef cause_specific_cox(cohort: pd.DataFrame, covariates: list[str]) -> CoxPHFitter:\n    \"\"\"Cause-specific HR for the EOI: code competing events as censored (0),\n    EOI as the event (1). This is the etiologic rate among those still at risk,\n    NOT the probability scale -- pair it with cif_by_arm().\"\"\"\n    d = cohort.copy()\n    d[\"event\"] = (d[\"event_type\"] == EOI).astype(int)  # competing -> 0 (censored)\n    cph = CoxPHFitter()\n    cph.fit(d[[\"fu_time\", \"event\"] + covariates], duration_col=\"fu_time\", event_col=\"event\")\n    return cph",
        "description": "Arm-stratified Aalen-Johansen cumulative incidence from a claims-shaped, one-row-per-person analytic table.\nRequired input (already cleaned, mutually-exclusive first-event coding done upstream):\n  cohort : person_id, arm (str), fu_time (days from index to first event/censor, float),\n           event_type (int) where 0 = censored/administrative, 1 = event of interest,\n                                  2 = competing event (death / switch / revision)\nReports the CIF for the event of interest by arm at the requested horizons with bootstrap-free pointwise CIs from\nAalenJohansenFitter. lifelines does NOT fit Fine-Gray; for the subdistribution model use R (survival::finegray + coxph or\ncmprsk::crr) or SAS PROC PHREG eventcode=. Cause-specific Cox is shown for the etiologic contrast.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [
          "austin-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(cmprsk)\n\n# event_type must be a factor with the censoring level first for finegray():\nd$event_f <- factor(d$event_type, levels = c(0, 1, 2),\n                    labels = c(\"censor\", \"eoi\", \"competing\"))\n\n## 1. Non-parametric CIF by arm (Aalen-Johansen). multi-state survfit handles competing risks.\naj <- survfit(Surv(fu_time, event_f) ~ arm, data = d)\nprint(summary(aj, times = c(365, 730)))   # 12- and 24-month CIF, with CIs\n\n## 2. Cause-specific Cox for the EOI: competing events are censored (status == \"eoi\" only).\ncs_eoi <- coxph(Surv(fu_time, event_f == \"eoi\") ~ arm + age + sex, data = d)\nsummary(cs_eoi)        # cause-specific HR (etiologic rate among those at risk)\n\n## 3. Fine-Gray subdistribution model via finegray() + coxph().\nfg_data <- finegray(Surv(fu_time, event_f) ~ ., data = d, etype = \"eoi\")\nfg <- coxph(Surv(fgstart, fgstop, fgstatus) ~ arm + age + sex,\n            weights = fgwt, data = fg_data)\nsummary(fg)            # subdistribution HR -> maps to the CIF scale, NOT a rate\n\n## 3b. Equivalent with cmprsk::crr (failcode = EOI, cencode = censoring code).\ncovs <- model.matrix(~ arm + age + sex, data = d)[, -1]\nfg2 <- crr(ftime = d$fu_time, fstatus = d$event_type,\n           cov1 = covs, failcode = 1, cencode = 0)\nsummary(fg2)",
        "description": "Cumulative incidence (Aalen-Johansen), cause-specific Cox, and the Fine-Gray subdistribution model on a one-row-per-person\nclaims table. Required input columns:\n  d$fu_time     numeric, days from index to first event or censor\n  d$event_type  factor/int: 0 = censored, 1 = event of interest, 2 = competing (death/switch)\n  d$arm         factor: treatment arm\n  plus baseline covariates (e.g., age, sex, comorbidity score)\nFine-Gray is fit two equivalent ways: survival::finegray (builds the weighted/extended risk set, then coxph) and\ncmprsk::crr. Always plot the CIF curves next to any subdistribution HR (Austin & Fine 2017).",
        "dependencies": [
          "survival",
          "cmprsk"
        ],
        "source_citations": [
          "lau-2009",
          "austin-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Non-parametric CIF (Aalen-Johansen) by arm. plots=cif gives the\n   cumulative incidence for EACH non-zero event code; event=1 is the EOI,\n   event=2 is the competing event kept in the risk set for the CIF. */\nproc lifetest data=work.analytic plots=cif(test) timelist=365 730;\n   time fu_time*event(0) / eventcode=1;   /* CIF of the EOI at 12 and 24 mo */\n   strata arm;\nrun;\n\n/* 2. Cause-specific Cox for the EOI: censor BOTH admin (0) and competing (2),\n   so only event=1 contributes -- the etiologic rate among those at risk. */\nproc phreg data=work.analytic;\n   class arm (ref='0') sex (ref='0');\n   model fu_time*event(0 2) = arm age sex / ties=efron rl;\n   hazardratio arm / diff=ref;            /* cause-specific HR */\nrun;\n\n/* 3. Fine-Gray subdistribution model for the CIF of the EOI.\n   eventcode=1 switches PHREG to the subdistribution hazard; competing events\n   are retained in the (weighted) risk set rather than censored. The HR maps to\n   the cumulative incidence scale and is NOT a biological rate. */\nproc phreg data=work.analytic;\n   class arm (ref='0') sex (ref='0');\n   model fu_time*event(0) = arm age sex / eventcode=1 rl;\n   hazardratio arm / diff=ref;            /* subdistribution HR */\nrun;\n\n/* 4. Event-count and crude-rate audit table -- always produced alongside the\n   survival output so reviewers can see how many EOI vs competing events occurred. */\nproc sql;\n   create table event_summary as\n   select arm,\n          count(*)                                          as n,\n          sum(case when event=1 then 1 else 0 end)          as eoi_n,\n          sum(case when event=2 then 1 else 0 end)          as competing_n,\n          sum(case when event=0 then 1 else 0 end)          as censored_n,\n          sum(fu_time)/365.25                               as person_years\n   from work.analytic\n   group by arm;\nquit;",
        "description": "Cumulative incidence (PROC LIFETEST plots=cif), cause-specific Cox, and the Fine-Gray subdistribution model (PROC PHREG\neventcode=, SAS/STAT 14.2+ / SAS 9.4 M3+) on a one-row-per-person analytic dataset. Required input (work.analytic):\n  person_id   patient key\n  fu_time     days from index to first event or censor\n  event       0 = censored/administrative, 1 = event of interest, 2 = competing (death/switch)\n  arm         0/1 treatment indicator; age, sex, comorbidity covariates\nFirst-event coding (fu_time = min over EOI/competing/censor; event = the cause achieving the minimum) must be done in data\nmanagement before this step. Always pair the regression output with the CIF curves (Austin & Fine 2017).",
        "dependencies": [],
        "source_citations": [
          "austin-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "competing-risks-cause-specific-fine-gray-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Each horizontal bar shows one patient's follow-up from Day 0 to their first event or censoring. Orange markers indicate febrile-neutropenia admissions (event of interest); red markers indicate deaths (competing events). The correct cumulative incidence function (CIF = 0.40) is lower than the naive Kaplan-Meier estimate (0.47) because deaths are kept in the denominator of the risk set without inflating the FN numerator.",
        "alt_text": "Five horizontal patient timelines from Day 0 to Day 180. Patients 1001 and 1003 end with orange FN-event markers at days 45 and 90. Patients 1002 and 1004 end with red death markers at days 60 and 120. Patient 1005 reaches day 180 with a censoring mark. Below the timelines, a stepped CIF curve rises to 0.40 while a dashed naive KM curve rises higher to 0.47, illustrating the overstatement when deaths are incorrectly treated as censored.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  T0[Time zero / index date<br/>at risk for the event of interest] -->|Event of interest first| EOI[EOI observed<br/>CIF for the EOI increases]\n  T0 -->|Competing event first<br/>death / switch / revision| Comp[No longer at risk for the EOI]\n  T0 -->|Neither by end of data<br/>disenroll / data end| Cen[Administrative censoring]\n  Comp --> Csh[Cause-specific Cox: treat as CENSORED]\n  Comp --> Fg[Fine-Gray: kept in extended risk set<br/>so it counts against the EOI subdistribution]\n  EOI --> Report[Report arm-stratified CIF + cause-specific HR + Fine-Gray HR]\n  Csh --> Report\n  Fg --> Report\n  style Comp fill:#fee2e2,stroke:#b91c1c\n  style Cen fill:#e0f2fe,stroke:#0369a1",
        "caption": "How the same competing event is handled differently by each method. Cause-specific models censor it (etiologic rate among those at risk); Fine-Gray keeps it in a time-decaying extended risk set so it lowers the EOI's room on the CIF; administrative censoring (disenrollment, end of data) is distinct from both.",
        "alt_text": "Flowchart from time zero branching into event of interest, competing event, or administrative censoring, showing that cause-specific models censor the competing event while Fine-Gray retains it in an extended risk set, and all paths feed a report of the CIF plus both hazard ratios.",
        "source_type": "illustrative",
        "source_citations": [
          "austin-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Primary estimand?} -->|Absolute probability of the EOI<br/>before a competing event| P[CIF + Fine-Gray subdistribution HR]\n  Q -->|Etiologic rate among those<br/>still at risk for the EOI| R[Cause-specific Cox<br/>competing event censored]\n  P --> Comm[Common rate < ~10% and balanced?]\n  R --> Comm\n  Comm -->|No: high or differential competing mortality| Both[Report CIF + cause-specific + Fine-Gray]\n  Comm -->|Yes| Simple[Standard Cox may suffice<br/>only with explicit justification]\n  Both --> Pre[Pre-specify primary estimand and model in the protocol/SAP]\n  Simple --> Pre",
        "caption": "Decision logic for choosing the primary tool. Pick the estimand first; report the CIF in essentially all RWE settings; a plain Cox is acceptable only when the competing event is rare, balanced, and explicitly justified.",
        "alt_text": "Decision flowchart starting from the primary estimand, routing absolute-probability questions to CIF plus Fine-Gray and etiologic-rate questions to cause-specific Cox, then to whether the competing event is rare and balanced, ending at pre-specifying the estimand and model in the protocol.",
        "source_type": "illustrative",
        "source_citations": [
          "austin-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "A standard Cox with the competing event coded as censored estimates the cause-specific hazard ratio but does not give the cumulative incidence that is usually decision-relevant."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST extends to competing risks as mean time lost to each cause; often reported alongside the CIF as a complementary, time-unit summary."
      },
      {
        "relation_type": "part_of",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "Death or treatment switch as a competing or intercurrent event must be addressed in the estimand strategy (treatment policy vs while-on-treatment vs hypothetical) before any model is chosen."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial protocols must pre-specify how competing events (death, switch, loss) are handled in follow-up and outcome definitions, including which cause is the EOI."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology RWE almost always involves competing risks (death, next-line therapy, progression) that change the meaning of progression-free and overall survival estimates from claims."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Death misclassified as attrition or treated as ordinary censoring biases survival, HCRU, and cost estimates; a death index distinguishes competing mortality from administrative loss to follow-up."
      }
    ],
    "aliases": [
      "Fine-Gray model",
      "Fine-Gray subdistribution hazard",
      "subdistribution hazard regression",
      "cause-specific hazard",
      "cumulative incidence function",
      "CIF",
      "Aalen-Johansen estimator",
      "competing risks analysis",
      "competing risk regression"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "complete-case-analysis-rwe",
    "name": "Complete-Case Analysis",
    "short_definition": "An analysis restricted to records with no missing values on any variable used by the model, discarding partially observed records; unbiased only under restrictive missingness mechanisms and generally less efficient than principled missing-data methods such as multiple imputation or weighting.",
    "long_description": "**Core idea.** A **complete-case analysis (CCA)** — also called listwise deletion — fits the target model using only the\nsubset of records that have *no* missing value on any variable the analysis touches (outcome, exposure, and every\ncovariate in the adjustment set). Records with even a single missing field are dropped entirely. It is the silent default\nof almost every regression routine: `lm`, `glm`, `coxph`, `PROC LOGISTIC`, and `statsmodels` all delete incomplete rows\nbefore estimating. The appeal is that it requires no extra modelling and yields valid standard errors *for the analysis\nthat was actually run*; the danger is that \"the analysis that was actually run\" silently answers a question about the\ncomplete-case subpopulation, not the target population, whenever missingness is informative.\n\n**The missingness taxonomy that governs validity.** Following Rubin's framework, data are **Missing Completely At Random\n(MCAR)** when the probability of being missing is independent of both observed and unobserved values; **Missing At Random\n(MAR)** when that probability depends only on *observed* data; and **Missing Not At Random (MNAR)** when it depends on the\nunobserved value itself. Under MCAR the complete cases are a simple random subsample, so CCA is unbiased for every\nestimand — it just throws away information and inflates variance. Under MAR, CCA is biased for descriptive quantities\n(means, prevalences) but can still be unbiased for **regression coefficients** in an important special case: when\nmissingness, conditional on the covariates in the model, is independent of the *outcome*. This is the result that makes\nCCA defensible far more often than the blunt \"valid only under MCAR\" slogan suggests — if records are missing because of\nthe covariate values themselves (e.g., a lab is ordered more often in sicker patients) but, given those covariates,\nmissingness does not depend on the outcome, the fitted coefficients are consistent even though MAR (not MCAR) holds.\nUnder MNAR, CCA is biased in general and no complete-data method recovers the truth without an explicit, untestable model\nfor the missingness mechanism.\n\n**Pros, cons, and trade-offs.**\n- **vs multiple imputation (`multiple-imputation-longitudinal-rwe`):** MI imputes plausible values under MAR and pools\n  estimates across imputations (Rubin's rules), recovering the information in partially observed records and giving\n  correct variance that accounts for imputation uncertainty. CCA discards that information. White & Carlin (2010) make\n  the precise comparison: when missingness is in *covariates* and the coefficient-validity condition above holds, CCA and\n  MI are both unbiased and CCA can even be *more* efficient than a poorly specified imputation model — but when\n  missingness is in the *outcome* alone, or when auxiliary variables predictive of the missing values exist, MI is more\n  efficient and more robust. **Prefer CCA** when the proportion missing is small, missingness is plausibly unrelated to\n  the outcome given covariates, and no strong auxiliary predictors exist; **prefer MI** when missingness is substantial,\n  auxiliaries are available, or descriptive (not just coefficient) targets matter.\n- **vs inverse-probability weighting / IPCW (`inverse-probability-of-censoring-weighting-rwe`):** Weighting upweights the\n  observed complete cases by the inverse of their estimated probability of being complete, correcting MAR selection\n  without imputing. It is semiparametric-efficient when the missingness model is correct and avoids modelling the full\n  joint distribution. **Prefer weighting** when the missingness mechanism is easier to model than the outcome\n  distribution (e.g., monotone dropout in longitudinal follow-up); the two can be combined (augmented IPW / doubly\n  robust) for protection against misspecification of either model.\n- **vs available-case / pairwise deletion:** Available-case analysis uses all records that have the *specific* variables\n  needed for each sub-computation (e.g., each pairwise correlation), so different estimates rest on different samples and\n  a covariance matrix can fail to be positive definite. CCA at least keeps one coherent analytic sample. Neither solves\n  informative missingness.\n\n**When to use.** CCA is the honest, transparent default when (1) the fraction of records lost is small (a common rule of\nthumb is <5% with no obvious pattern, though this is not a guarantee), (2) the substantive interest is in a regression\ncoefficient and missingness is plausibly independent of the outcome given the modelled covariates, and (3) there are no\nstrong auxiliary variables that would let imputation recover information CCA throws away. It is also the right reference\nanalysis to *report alongside* MI or weighting: agreement between CCA and a principled method is reassuring, and material\ndisagreement is itself a diagnostic that missingness is informative. Always report the number and characteristics of the\ndropped records and compare the complete-case sample to the full eligible sample on baseline covariates.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Substantial or patterned missingness analyzed silently.** Letting the software delete 30% of rows without a word\n  converts a population question into a complete-case-subpopulation question and routinely biases effect estimates;\n  quoting the result as if it were the target estimand is the single most common and most dangerous misuse.\n- **Missingness driven by the outcome.** If a variable is missing *because* of the outcome (e.g., severity scores not\n  recorded for patients who died early), CCA is biased even for coefficients and the bias points in an unpredictable\n  direction. Switch to a method with an explicit missingness/censoring model.\n- **Descriptive targets under MAR.** Means, prevalences, and incidence among complete cases are biased under MAR even\n  when coefficients are not; do not report a complete-case prevalence as a population prevalence.\n- **Many covariates each with a little missing.** With p covariates each 5% missing and missingness roughly independent\n  across them, the complete-case fraction shrinks toward 0.95^p — a handful of variables can silently delete a third of\n  the sample. Count the realized analytic N, not the eligible N.\n\n**Data-source operational depth.**\n- **Claims:** Administrative variables (enrollment, dispensings, ICD/CPT codes) are rarely \"missing\" in the survey sense —\n  a value is either present or genuinely absent (a procedure that did not occur). Where missingness bites is in *linked\n  or supplemental* fields (lab results, BMI, smoking, race/ethnicity, income), which are present only for the subset with\n  that data feed. A CCA that requires a lab value implicitly restricts to patients who were tested — a strongly selected,\n  typically sicker, group — so the complete-case sample is not MCAR; document the testing/linkage selection explicitly.\n- **EHR:** Missingness is *informatively present*: labs and vitals are recorded because a clinician ordered them, so\n  \"missing\" carries information about the patient and the encounter. CCA on EHR covariates almost never satisfies MCAR\n  and frequently violates the coefficient-validity condition because ordering depends on the same clinical state that\n  drives the outcome. Treat informative presence as a structural feature, not random missingness.\n- **Registry:** Core registry fields (stage, histology, adjudicated events) are typically near-complete by protocol, so\n  CCA is more defensible for those; long-form questionnaire items and follow-up forms are where attrition and item\n  non-response accumulate. Report completeness by field and by calendar year.\n\n**Interpreting the output**\n\nConsider the ten-patient cohort where BMI is missing for patients 2, 5, and 7 — all three older and all\nthree had the outcome event. Complete-case analysis retains n = 7 (a 30% reduction in sample). In the\ncomplete-case subset the observed event rate is 2/7 ≈ 29%; in the full sample it is 5/10 = 50%. Mean\nage in the complete-case subset is 60.6 years versus 63.0 in the full sample.\n\n*(1) Formal statistical interpretation.* The complete-case estimate is unbiased only if data are MCAR —\nthat is, if missingness of BMI is independent of all other variables, observed and unobserved. The pattern\nhere clearly violates MCAR: the three patients missing BMI are systematically older and had the event,\nmeaning the complete-case subset is not a simple random sample of the full cohort. Under MAR — where\nmissingness depends only on observed variables such as age — a correctly specified complete-case analysis\nof the outcome model can still be unbiased if age is included in the model; the SE is penalized by the\n30% sample reduction. Under MNAR (missingness depends on the missing BMI itself), complete-case analysis\nis biased regardless of model specification.\n\n*(2) Practical interpretation for a decision-maker.* Discarding the three missing-BMI patients cuts the\nobserved event rate roughly in half (50% → 29%) and shifts the mean age downward — evidence of substantial\nselection bias in this small example. Reporting only the complete-case result without disclosing the\nmissingness pattern and mechanism assumption would mislead reviewers about the true event burden. Document\nthe fraction dropped, how completers differ from excluded patients, and which missingness mechanism\n(MCAR, MAR, or MNAR) the complete-case analysis assumes.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "missing-data",
      "complete-case-analysis",
      "listwise-deletion",
      "mcar",
      "mar",
      "multiple-imputation",
      "estimand"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.3944",
        "url": "https://doi.org/10.1002/sim.3944",
        "citation_text": "White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in Medicine. 2010;29(28):2920-2931.",
        "year": 2010,
        "authors_short": "White & Carlin",
        "notes": "The canonical head-to-head comparison establishing exactly when complete-case analysis is unbiased (and even more efficient) versus when multiple imputation is preferable, focusing on missing covariate values."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.b2393",
        "url": "https://doi.org/10.1136/bmj.b2393",
        "citation_text": "Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.",
        "year": 2009,
        "authors_short": "Sterne et al.",
        "notes": "Practical guidance contrasting complete-case analysis with multiple imputation, including when CCA is acceptable and the auxiliary-variable and outcome-dependence conditions that decide between them."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a117592",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a117592",
        "citation_text": "Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology. 1995;142(12):1255-1264.",
        "year": 1995,
        "authors_short": "Greenland & Finkle",
        "notes": "Classic critique of missing-indicator and ad hoc fixes versus complete-case and model-based approaches in epidemiologic regression; clarifies the bias structure of listwise deletion."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwr302",
        "url": "https://doi.org/10.1093/aje/kwr302",
        "citation_text": "Groenwold RHH, Donders ART, Roes KCB, Harrell FE, Moons KGM. Dealing with missing outcome data in randomized trials and observational studies. American Journal of Epidemiology. 2012;175(3):210-217.",
        "year": 2012,
        "authors_short": "Groenwold et al.",
        "notes": "Worked comparisons of complete-case analysis against imputation and weighting for missing outcomes, showing the direction and magnitude of bias under different mechanisms."
      },
      {
        "role": "use",
        "doi": "10.1037/1082-989x.7.2.147",
        "url": "https://doi.org/10.1037/1082-989x.7.2.147",
        "citation_text": "Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychological Methods. 2002;7(2):147-177.",
        "year": 2002,
        "authors_short": "Schafer & Graham",
        "notes": "Authoritative review situating complete-case analysis within the MCAR/MAR/MNAR taxonomy and against modern likelihood- and imputation-based methods."
      }
    ],
    "plain_language_summary": "Complete-case analysis is a way of handling missing data by simply keeping only the patients who have every piece of information the analysis needs and throwing the rest out. For example, if your study requires age, treatment, and BMI for each patient, anyone missing BMI is dropped entirely before any numbers are calculated. This approach is fast and requires no extra work, but it can quietly distort your results: if the patients who are missing data are systematically different from those who are not, the group you end up analyzing no longer looks like the original study population.",
    "key_terms": [
      {
        "term": "Missing Completely At Random (MCAR)",
        "definition": "Missingness is pure chance — the probability that a value is missing has nothing to do with any characteristic of the patient, whether observed or not."
      },
      {
        "term": "Missing At Random (MAR)",
        "definition": "Whether a value is missing depends only on other information you already have recorded (for example, older patients are less likely to have a BMI recorded), not on the missing value itself."
      },
      {
        "term": "Missing Not At Random (MNAR)",
        "definition": "Whether a value is missing depends on the value that is missing — for instance, very high BMI patients decline to have their BMI recorded precisely because it is high."
      },
      {
        "term": "Complete-case sample",
        "definition": "The subset of patients who have no missing values on any variable the analysis uses; these are the only patients kept when complete-case analysis is applied."
      },
      {
        "term": "Listwise deletion",
        "definition": "Another name for complete-case analysis — every row (patient record) with even one missing value is deleted from the analysis list before estimation begins."
      },
      {
        "term": "Selection bias",
        "definition": "A distortion that arises when the patients included in an analysis are systematically different from the target population the study is meant to represent."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is studying whether a new blood-pressure drug reduces the risk of a cardiovascular event over one year. The analysis needs three variables for each patient: whether they received the drug (treatment), their age, and their BMI. BMI is the problem: it comes from a clinic measurement, and not all patients had a clinic visit where it was recorded. Sicker, older patients were more likely to visit the clinic and therefore more likely to have BMI on file. The researcher applies complete-case analysis, which automatically drops every patient with a missing BMI.",
      "dataset": {
        "caption": "Ten-patient cohort — the raw data before any analysis. BMI is blank for patients who never had it measured.",
        "columns": [
          "patient_id",
          "age",
          "treatment",
          "bmi",
          "had_event"
        ],
        "rows": [
          [
            1,
            58,
            "drug",
            27.4,
            0
          ],
          [
            2,
            63,
            "drug",
            null,
            1
          ],
          [
            3,
            71,
            "placebo",
            31.2,
            1
          ],
          [
            4,
            55,
            "drug",
            24.8,
            0
          ],
          [
            5,
            69,
            "placebo",
            null,
            1
          ],
          [
            6,
            60,
            "drug",
            29.1,
            0
          ],
          [
            7,
            74,
            "placebo",
            null,
            1
          ],
          [
            8,
            52,
            "placebo",
            22.6,
            0
          ],
          [
            9,
            67,
            "drug",
            30.5,
            1
          ],
          [
            10,
            61,
            "placebo",
            26.3,
            0
          ]
        ]
      },
      "steps": [
        "Start with 10 patients. Identify which rows have a missing BMI: patients 2, 5, and 7 each have no BMI value recorded.",
        "Complete-case analysis drops all three patients with missing BMI. The analytic sample shrinks from 10 to 7 patients.",
        "Check who was dropped: patients 2, 5, and 7 have ages 63, 69, and 74 — all older than the group average. All three also had the event (had_event = 1). The remaining 7 patients have a mean age of 60.6 and an event rate of 2 out of 7 (29%).",
        "Compare to the full 10-patient group: mean age was 63.0 and event rate was 5 out of 10 (50%). The complete-case sample is younger and has a lower event rate than the original cohort.",
        "This difference is not random chance — the missingness was linked to older age, and older age was linked to having the event. Dropping those patients makes the drug look more effective and the population look healthier than it really is."
      ],
      "result": "Complete-case N = 7 (dropped 3 of 10 patients, 30% loss). Event rate in the complete-case sample = 2/7 = 29%, versus 5/10 = 50% in the full cohort. The 21 percentage-point gap is a bias introduced by dropping older, sicker patients whose BMI happened to be missing — a clear sign that missingness was not random (MCAR does not hold here)."
    },
    "prerequisites": [
      "missing-data-pattern-table-rwe",
      "multiple-imputation-longitudinal-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Listwise deletion (software default)",
        "description": "The analytic sample is implicitly defined as records non-missing on the outcome and every covariate the model references; estimation proceeds on that subset with conventional standard errors.",
        "edge_cases": [
          "Adding a covariate to the model can shrink the analytic sample and change the estimate purely through sample composition, not confounding control; report the realized N for each model specification.",
          "With many partially observed covariates the complete-case fraction can collapse far below intuition (approximately the product of per-variable completeness rates)."
        ],
        "data_source_notes": "ehr: requiring a lab/vital value silently restricts to tested patients (informative presence); state the testing selection and compare the complete-case sample to the full eligible cohort."
      },
      {
        "name": "Complete-case as reference analysis alongside MI/weighting",
        "description": "CCA is run deliberately as a transparent comparator to a principled missing-data analysis; concordance is reassuring and discordance flags informative missingness.",
        "edge_cases": [
          "Material divergence between CCA and MI is a diagnostic of MAR/MNAR, not a reason to silently pick the more favorable number; pre-specify which is primary.",
          "When the coefficient-validity condition holds, CCA may be more efficient than a misspecified imputation model, so a smaller CCA standard error does not by itself prove CCA is wrong."
        ],
        "data_source_notes": "claims/linked: report the count and baseline characteristics of records dropped for missing linked fields (labs, BMI, SES) so reviewers can judge the selection."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "multiple-imputation-longitudinal-rwe",
        "pros_of_this": "No imputation model to specify or misspecify; valid standard errors for the analysis run; can be more efficient than a poorly specified MI model when missingness is in covariates and independent of the outcome given them.",
        "cons_of_this": "Discards the information in partially observed records; biased under MAR for descriptive targets and under any outcome-dependent missingness; analytic N can collapse with many partially observed covariates.",
        "when_to_prefer": "Small fraction missing, missingness plausibly independent of the outcome given modelled covariates, and no strong auxiliary predictors of the missing values."
      },
      {
        "compared_to": "inverse-probability-of-censoring-weighting-rwe",
        "pros_of_this": "Requires no model for the probability of being observed; simpler and fully transparent.",
        "cons_of_this": "Cannot correct MAR selection; throws away observed records that weighting would recover and use efficiently.",
        "when_to_prefer": "When the missingness/selection mechanism is hard to model reliably and the coefficient-validity condition plausibly holds; otherwise weighting (or doubly robust AIPW) recovers efficiency and corrects selection."
      },
      {
        "compared_to": "Missing-indicator method",
        "pros_of_this": "Avoids the well-known residual confounding the missing-indicator approach introduces for confounders.",
        "cons_of_this": "Loses the records the indicator method retains.",
        "when_to_prefer": "Essentially always over the missing-indicator method for confounders, which Greenland & Finkle show is generally biased; prefer CCA or MI instead."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Administrative fields are present or genuinely absent rather than survey-missing; CCA bites on linked/supplemental fields (labs, BMI, smoking, race, income) available only for a selected subset. Requiring such a field restricts to a non-random subgroup; document the linkage/testing selection and report dropped-record characteristics.",
      "ehr": "Missingness is informatively present (labs/vitals recorded because ordered), so CCA rarely satisfies MCAR and often violates the coefficient-validity condition because ordering tracks the clinical state that drives the outcome; treat informative presence structurally rather than as random missingness.",
      "registry": "Protocol-collected core fields (stage, histology, adjudicated events) are near-complete and more amenable to CCA; questionnaire and follow-up items accrue item non-response and attrition. Report completeness by field and calendar year."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\nrng = np.random.default_rng(20240601)\nn = 2000\n\n# Synthetic cohort: outcome depends on age and a lab value.\nage = rng.normal(65, 10, n)\nlab = 0.5 * (age - 65) / 10 + rng.normal(0, 1, n)        # standardized lab\nlogit = -1.0 + 0.04 * (age - 65) + 0.6 * lab\ny = rng.binomial(1, 1 / (1 + np.exp(-logit)))\ndf = pd.DataFrame({\"y\": y, \"age\": age, \"lab\": lab})\n\n# MAR: probability the lab is missing depends only on observed age (older -> more often tested,\n# i.e., less often missing). Given age, missingness is independent of the outcome.\np_missing = 1 / (1 + np.exp(2.0 - 0.05 * (df[\"age\"] - 65)))\ndf.loc[rng.random(n) < p_missing, \"lab\"] = np.nan\n\neligible_n = len(df)\ncc = df.dropna(subset=[\"y\", \"age\", \"lab\"])               # listwise deletion (CCA)\ncc_n = len(cc)\nprint(f\"eligible N = {eligible_n}; complete-case N = {cc_n}; \"\n      f\"complete-case fraction = {cc_n / eligible_n:.3f}\")\n\n# Complete-case logistic regression (the implicit default of glm on incomplete data).\ncca_fit = smf.logit(\"y ~ age + lab\", data=cc).fit(disp=0)\nprint(cca_fit.summary2().tables[1].loc[:, [\"Coef.\", \"Std.Err.\", \"P>|z|\"]])\n\n# Diagnostic: compare complete-case sample to the full eligible sample on covariates.\nprint(\"\\nMean age  - eligible vs complete-case:\",\n      round(df[\"age\"].mean(), 2), round(cc[\"age\"].mean(), 2))",
        "description": "Demonstrate complete-case analysis explicitly and contrast it with the full eligible sample. We construct a small\ncohort where a covariate (a lab value) is Missing At Random conditional on age, then fit a logistic regression two\nways: on all rows with the lab dropped by listwise deletion (the software default), reporting the realized N and the\ncomplete-case fraction. The point is operational transparency, not estimation magic: the realized analytic N and the\nselection it implies are what must be reported.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "white-carlin-2010",
          "sterne-2009"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(mice)\nset.seed(20240601)\nn <- 2000\n\nage   <- rnorm(n, 65, 10)\nlab   <- 0.5 * (age - 65) / 10 + rnorm(n)\nlogit <- -1.0 + 0.04 * (age - 65) + 0.6 * lab\ny     <- rbinom(n, 1, plogis(logit))\ndf    <- data.frame(y = y, age = age, lab = lab)\n\n# MAR: lab missing with probability depending only on observed age.\np_missing <- plogis(2.0 - 0.05 * (age - 65))\ndf$lab[runif(n) < p_missing] <- NA\n\n# Document the missing-data pattern before deleting anything.\nprint(md.pattern(df, plot = FALSE))\n\ncc <- df[complete.cases(df), ]            # listwise deletion (complete-case analysis)\ncat(sprintf(\"eligible N = %d; complete-case N = %d; fraction = %.3f\\n\",\n            nrow(df), nrow(cc), nrow(cc) / nrow(df)))\n\n# Complete-case logistic regression; na.omit is the default but stated for clarity.\ncca_fit <- glm(y ~ age + lab, data = cc, family = binomial(), na.action = na.omit)\nprint(summary(cca_fit)$coefficients)\n\n# Diagnostic: covariate distribution, eligible vs complete-case.\ncat(\"Mean age eligible vs complete-case:\",\n    round(mean(df$age), 2), round(mean(cc$age), 2), \"\\n\")",
        "description": "Complete-case analysis in R via the (default) na.action = na.omit, made explicit. We simulate a cohort with a\ncovariate Missing At Random given age, report the realized complete-case sample size and fraction, and fit the\ncomplete-case logistic model. md.pattern() from the mice package summarizes the missing-data pattern so the deletion\nis documented rather than silent.",
        "dependencies": [
          "mice"
        ],
        "source_citations": [
          "white-carlin-2010",
          "sterne-2009"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* work.cohort has y (0/1), age, lab; lab is missing for a MAR-selected subset. */\n\n/* 1. Audit the missing-data pattern and per-variable missing counts (no imputation). */\nproc mi data=work.cohort nimpute=0;\n  var y age lab;\n  ods select MissPattern;\nrun;\n\nproc means data=work.cohort n nmiss;\n  var y age lab;\nrun;\n\n/* 2. Realized eligible vs complete-case sample sizes. */\nproc sql;\n  select count(*) as eligible_n\n  from work.cohort;\n  select count(*) as complete_case_n\n  from work.cohort\n  where y is not missing and age is not missing and lab is not missing;\nquit;\n\n/* 3. Complete-case logistic regression. PROC LOGISTIC deletes incomplete records by default;\n      stating it documents that the analysis is a complete-case analysis. */\nproc logistic data=work.cohort;\n  model y(event='1') = age lab;\n  title 'Complete-case logistic regression (listwise deletion)';\nrun;\n\n/* 4. Diagnostic: compare age distribution in the full vs complete-case sample. */\nproc means data=work.cohort mean;\n  var age;\n  title 'Mean age, full eligible sample';\nrun;\nproc means data=work.cohort mean;\n  where lab is not missing;\n  var age;\n  title 'Mean age, complete-case sample';\nrun;",
        "description": "Complete-case analysis in SAS. PROC LOGISTIC (like all SAS modelling PROCs) performs listwise deletion automatically,\ndropping any observation missing on a model variable. Here we make the deletion auditable: PROC MI with NIMPUTE=0\ntabulates the missing-data pattern and counts, then PROC SQL reports the eligible and complete-case Ns, and PROC\nLOGISTIC fits the complete-case model. The NMISS option on PROC MEANS confirms how many records each variable loses.",
        "dependencies": [],
        "source_citations": [
          "white-carlin-2010",
          "sterne-2009"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Elig[Eligible cohort<br/>full sample N] --> Miss{Any value missing on<br/>outcome, exposure, or a model covariate?}\n  Miss -->|No missing values| CC[Complete-case sample<br/>used for estimation]\n  Miss -->|>=1 value missing| Drop[Record deleted listwise<br/>information discarded]\n  CC --> Mech{Missingness mechanism?}\n  Mech -->|MCAR| Unb[Unbiased for all estimands<br/>efficiency loss only]\n  Mech -->|MAR, indep of outcome given covariates| Coef[Regression coefficients unbiased<br/>descriptive targets biased]\n  Mech -->|MAR otherwise / MNAR| Bias[Biased - prefer MI or weighting]",
        "caption": "Complete-case analysis deletes any record with a missing model value, then validity depends entirely on the missingness mechanism. CCA is unbiased under MCAR, unbiased for coefficients under the outcome-independence condition, and biased otherwise.",
        "alt_text": "Flowchart from an eligible cohort splitting into complete cases (kept) versus records with any missing value (deleted), then branching by missingness mechanism to unbiased, coefficient-only-unbiased, or biased outcomes.",
        "source_type": "illustrative",
        "source_citations": [
          "white-carlin-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Q[Choosing a missing-data strategy] --> Frac{Fraction missing small<br/>and no strong auxiliaries?}\n  Frac -->|Yes| Out{Missingness indep of outcome<br/>given covariates?}\n  Frac -->|No| MI[Multiple imputation<br/>or weighting]\n  Out -->|Yes| CCA[Complete-case analysis acceptable<br/>report dropped N and selection]\n  Out -->|No| MI\n  MI --> Ref[Also report CCA as reference<br/>discordance flags informative missingness]\n  CCA --> Ref",
        "caption": "Decision logic for when complete-case analysis is acceptable versus when multiple imputation or weighting is needed, and the recommendation to report CCA alongside the principal method as a diagnostic.",
        "alt_text": "Decision tree on fraction missing, auxiliaries, and outcome-independence of missingness leading to either complete-case analysis or multiple imputation/weighting, with both reported for comparison.",
        "source_type": "illustrative",
        "source_citations": [
          "white-carlin-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Complete-case analysis discards partially observed records; multiple imputation recovers their information under MAR with correct variance via Rubin's rules. White & Carlin (2010) give the exact conditions favouring each."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "A missing-data pattern table documents how many records and which variables drive the listwise deletion, making the complete-case sample auditable rather than silent."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "inverse-probability-of-censoring-weighting-rwe",
        "notes": "Weighting corrects MAR selection by upweighting observed records rather than deleting incomplete ones; preferred when the missingness/selection mechanism is easier to model than the full data distribution."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Attrition is a longitudinal form of missingness; complete-case analysis of only those with complete follow-up is biased when dropout is informative."
      },
      {
        "relation_type": "affects",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Listwise deletion changes the analytic sample composition, so the table-one baseline characteristics of the complete-case sample can differ from the full eligible cohort and must be compared."
      }
    ],
    "aliases": [
      "complete-case analysis",
      "listwise deletion",
      "complete cases only",
      "CCA",
      "available-subject analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "composite-endpoint-construction-rwe",
    "name": "Composite Endpoint Construction",
    "short_definition": "The pre-specified rule that combines two or more component events (e.g., death, non-fatal MI, hospitalization) into a single outcome, defining each component's case definition, the event-date assignment (typically the earliest qualifying component), the deduplication window, and the analytic estimand (time-to-first-event, recurrent-event, hierarchical, or competing-risk) in real-world data.",
    "long_description": "A **composite endpoint** counts a patient as having \"the outcome\" when *any one* of several pre-specified component events\noccurs. The classic example is **3-point MACE** (all-cause death + non-fatal myocardial infarction + non-fatal stroke). The\ndefault analytic representation is **time-to-first-event**: the composite event date is the *earliest* date among the\npatient's qualifying components, and follow-up is censored at that first event. The deceptively simple `MIN(component_date)`\nhides every hard decision — what each component's case definition is, how each component is ascertained (and how *well*),\nhow same-window events are deduplicated, and which estimand the composite is meant to support.\n\n**Core conceptual / estimand distinction.** The construction rule and the estimand are separable choices that must both be\npre-specified, because they drive different models:\n- **Time-to-first-event (default).** Event = earliest component; analyze with cause-specific Cox or pooled logistic. Simple\n  and the FDA/regulatory default, but it (a) discards all information after the first event, (b) weights a soft component\n  (e.g., hospitalization) exactly the same as death, and (c) lets a frequent, low-severity, well-ascertained component\n  dominate the hazard ratio while a rare, high-severity component drives clinical meaning.\n- **Recurrent / total events (Andersen–Gill, LWYY, negative-binomial).** Counts all qualifying events per patient, not just\n  the first — appropriate when repeated hospitalizations carry information (e.g., heart-failure burden).\n- **Hierarchical / win-ratio (Pocock 2011) and restricted-mean approaches.** Compare patients pairwise on the most severe\n  component first, breaking ties on the next, so death is never outweighed by a hospitalization. Respects clinical priority\n  at the cost of an unfamiliar effect measure (win ratio) and no single survival curve.\n- **Cumulative incidence with death as a competing risk.** If a *non-fatal* component is the endpoint of interest and death\n  competes, the cause-specific hazard answers a different question than the subdistribution (Fine–Gray) cumulative\n  incidence. Naively treating death as censoring overstates the non-fatal component's cumulative incidence.\n\nTwo operational invariants make all of this auditable: **retain the composite event date AND the component identifier that\nproduced it.** Every component-level sensitivity analysis, every \"which piece drove the effect\" question, and every\ncompeting-risk re-analysis depends on knowing *which* component fired and *when*.\n\n**Pros, cons, and trade-offs.**\n- **vs analyzing each component separately:** A composite raises event counts and statistical power, shortens required\n  follow-up, and side-steps the multiplicity of testing several outcomes. Cost: it presumes the components share a common\n  treatment effect and similar clinical importance — assumptions that are *false more often than not* (Freemantle 2003;\n  Ferreira-González 2007). **Prefer the composite** only when components are biologically linked, of comparable severity,\n  and expected to move in the same direction; **always report the components individually alongside it.**\n- **vs hierarchical / win-ratio analysis:** Time-to-first-event is universally understood and maps to a hazard ratio and a\n  Kaplan–Meier curve. The win ratio (Pocock 2011) preserves the severity ordering death > MI > hospitalization, so a drug\n  that prevents deaths but causes extra admissions is not falsely rewarded. **Prefer the win ratio / hierarchical estimand**\n  when components differ sharply in severity or when a soft component dominates the count; **prefer time-to-first-event**\n  when components are of similar importance and regulators expect the conventional contrast.\n- **vs a single hard outcome (e.g., all-cause mortality):** A single hard endpoint is unambiguous, maximally ascertainable\n  from death indices, and immune to differential surveillance — but is often under-powered. The composite trades that\n  cleanliness for power. **Prefer the single hard endpoint** when it is adequately powered and when differential\n  ascertainment of soft components by arm is plausible.\n\n**When to use.** Comparative effectiveness or safety where a single hard endpoint is under-powered and components are\nclinically linked and of similar severity (cardiovascular, renal, oncologic composites); HTA models needing an\nevent-driven outcome; pre-specified regulatory endpoints (e.g., MACE in cardiovascular-outcomes-trial emulations). The\ncomposite definition, each component's case definition and data-source validation (PPV/sensitivity), the deduplication\nwindow, and the estimand must all be fixed in the protocol/SAP *before* programming.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Components have opposing or markedly heterogeneous expected effects.** If a drug plausibly lowers MI but raises\n  hospitalization, the composite hazard ratio can hover near the null while masking real, opposite-direction component\n  effects. Require pre-specified, plausible per-component effect estimates before trusting any composite (Freemantle 2003).\n- **A soft, frequent component dominates the count but a hard component drives clinical meaning.** When 80% of composite\n  events are hospitalizations and 5% are deaths, the \"significant\" composite is essentially a hospitalization study wearing\n  a mortality costume. Always tabulate the component breakdown.\n- **Differential ascertainment by arm.** Composite event date = `MIN(component_date)`, so the *best-ascertained* component\n  systematically wins the event-date assignment. If one arm has more healthcare contact (e.g., a drug requiring frequent\n  labs or monitoring visits), its non-fatal, encounter-dependent components are detected earlier and more often — inflating\n  its composite rate for reasons that have nothing to do with biology. This is the dominant RWE-specific failure mode.\n- **Cross-study comparison with inconsistent definitions.** 3-point vs 4-point vs 5-point MACE, or different ICD code lists\n  and code-position rules, are not interchangeable; pooling or comparing them is comparing different outcomes.\n- **Death as a component AND a subdistribution interpretation for a non-fatal piece.** You cannot coherently treat death as\n  both part of the composite and a competing risk for a sub-component in the same framing — pick one estimand.\n\n**Data-source operational depth.**\n- **Claims (FFS):** Each component is an algorithm with its own validity. All-cause death is near-complete only when linked\n  to SSA Master Death File / NDI (claims-based death proxies — e.g., discharge status, enrollment termination — miss\n  out-of-hospital deaths). Non-fatal MI and stroke are ICD-10-CM algorithms whose PPV depends on code position (primary\n  inpatient diagnosis ≈ high PPV; any-position or outpatient ≈ low PPV) and whether a `1 inpatient OR 2 outpatient` rule is\n  applied. Failure modes: (a) **acute-event re-coding** — a single MI generates an index hospitalization, a transfer, and\n  follow-up outpatient claims; without a **deduplication / clean window** these become spurious \"recurrent\" events;\n  (b) **adjudication/claims lag** — late-arriving claims shift event dates and can move an event across the end-of-data\n  boundary; (c) **differential surveillance** by arm as above.\n- **Medicare Advantage-only person-time:** Encounter (non-fatal) claims are frequently incomplete or absent, while death is\n  still captured via linked mortality. The composite therefore *degrades into a near-death-only outcome* in MA segments —\n  silently changing the estimand by enrollment type. Restrict to FFS Parts A/B (and D for exposure) for the components that\n  rely on medical claims, or model enrollment type explicitly.\n- **EHR:** Non-fatal components come from problem lists, encounter diagnoses, labs, and notes; capture is visit-driven, so\n  a patient who seeks care outside the system has events that simply never appear (external-care leakage). Death is\n  notoriously incomplete in EHR alone — link to a death index. EHR can *sharpen* a component definition (troponin for MI,\n  imaging for stroke) where claims cannot.\n- **Registry / linked:** Registries often provide *adjudicated* components (the gold standard for the composite's clinical\n  pieces) but weak longitudinal completeness; link to claims for full follow-up and to a death index for the mortality\n  component. Linkage introduces selection (only the linkable subset) and date-discrepancy issues (claim service date vs\n  adjudicated event date) that must be reconciled before assigning the composite event date.\n\n**Worked claims example.** Endpoint: **3-point MACE** (all-cause death + non-fatal MI + non-fatal stroke) in an\nSGLT2-inhibitor vs DPP-4-inhibitor active-comparator new-user study in a commercial + Medicare FFS database with linked\nmortality. (1) **Component case definitions.** Non-fatal MI = an inpatient claim with ICD-10-CM `I21.x` in the *primary*\nposition (high-PPV rule; pre-specify a sensitivity analysis using any-position). Non-fatal stroke = inpatient `I63.x`\nprimary. Death = earliest date across the **mortality source hierarchy** (linked SSA/NDI date, then inpatient discharge\nstatus = expired, then enrollment-termination-for-death) — never a claims proxy alone. (2) **Continuous enrollment +\nwashout.** Require 365 days of continuous FFS A/B (and D for the exposure) before `index_date`; restrict component\nascertainment to FFS person-time so MA gaps do not silently drop non-fatal events. (3) **Deduplication / clean window.**\nAn inpatient MI plus a transfer and two follow-up office visits coded I21 within 30 days is *one* MI — collapse component\nclaims within a pre-specified acute-event window to the admission date. (4) **Event-date assignment.** Composite event\ndate = earliest qualifying component date; **store both the date and the component label.** Edge case: a stroke claim dated\nthree days before a death date — if the death is the index hospitalization's terminal event, count it as a single\ncomposite event at the stroke date with component = \"non-fatal stroke then death\" flagged; the time-to-first-event\nestimand counts only the first, but the component flag preserves the death for the competing-risk re-analysis. (5)\n**Follow-up & censoring.** From `index_date` to the composite event date, censoring at disenrollment, end of data, and (for\nas-treated) discontinuation. (6) **Analysis & sensitivity.** Primary = cause-specific Cox for time-to-first MACE; report\nthe **component breakdown** (how many deaths vs MI vs stroke), a **Fine–Gray** cumulative-incidence analysis when a single\nnon-fatal component is examined with death as a competing risk, the **win-ratio** as a severity-weighted alternative, and\nre-runs under the any-position MI rule and an alternative deduplication window.\n\n**Interpreting the output**. In the Patient 7741 example, an MI claim in principal position appears on day 100 from\nindex; a stroke claim appears on day 204 and a death record on day 289. The composite-endpoint algorithm fires on\nday 100 and labels the event component as \"non-fatal MI.\" Stroke and death are not counted as additional composite\nevents — only the first qualifying component triggers the time-to-event clock.\n\nFormal interpretation: the composite event date is day 100, and the component label is non-fatal MI. The\nhazard ratio estimated from this composite endpoint applies to the composite (MI, stroke, or death occurring\nfirst) — it does not apply to MI alone, stroke alone, or death alone. This is a consequential restriction:\na treatment that prevents MI but shifts events toward stroke will produce a composite HR near 1.0 even though\nthe clinical picture changed substantially. The composite is driven by whichever component accumulates events\nmost rapidly, often the softest (most frequently occurring) component; if most composite events are MI and few\nare stroke or death, the composite HR is essentially an MI HR, and calling it a \"MACE\" estimate is misleading.\n\nPractical interpretation: always report the component breakdown — number and proportion of composite events\nattributable to each component — alongside the primary composite HR. If one component dominates, pre-specify\ncomponent-specific analyses and consider whether the composite still addresses the clinical question. The\nwin-ratio analysis, which weights events by clinical severity, is a useful pre-specified sensitivity analysis\nwhen component heterogeneity is anticipated.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "composite-endpoint",
      "mace",
      "time-to-first-event",
      "win-ratio",
      "competing-risks",
      "component-ascertainment",
      "outcome-algorithm-construction"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.289.19.2554",
        "url": "https://doi.org/10.1001/jama.289.19.2554",
        "citation_text": "Freemantle N, Calvert M, Wood J, Eastaugh J, Griffin C. Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA. 2003;289(19):2554-2559.",
        "year": 2003,
        "authors_short": "Freemantle et al.",
        "notes": "Foundational caution that a composite is only interpretable when its components share a common direction and comparable clinical importance; the rationale for always reporting components individually."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.c3920",
        "url": "https://doi.org/10.1136/bmj.c3920",
        "citation_text": "Cordoba G, Schwartz L, Woloshin S, Bae H, Gøtzsche PC. Definition, reporting, and interpretation of composite outcomes in clinical trials: systematic review. BMJ. 2010;340:c3920.",
        "year": 2010,
        "authors_short": "Cordoba et al.",
        "notes": "Empirical review showing widespread inconsistency in how composites are defined and reported, and how the most important component is often the least affected by treatment."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.39136.682083.AE",
        "url": "https://doi.org/10.1136/bmj.39136.682083.AE",
        "citation_text": "Ferreira-González I, Permanyer-Miralda G, Domingo-Salvany A, et al. Problems with use of composite end points in cardiovascular trials: systematic review of randomised controlled trials. BMJ. 2007;334(7597):786.",
        "year": 2007,
        "authors_short": "Ferreira-González et al.",
        "notes": "Documents that composite effects are frequently driven by the least serious, most frequent component, with large gradients in importance across components."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/eurheartj/ehr352",
        "url": "https://doi.org/10.1093/eurheartj/ehr352",
        "citation_text": "Pocock SJ, Ariti CA, Collier TJ, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European Heart Journal. 2012;33(2):176-182.",
        "year": 2012,
        "authors_short": "Pocock et al.",
        "notes": "The canonical hierarchical alternative that weights components by clinical severity (death before hospitalization) rather than counting all components equally in a time-to-first-event analysis."
      }
    ],
    "plain_language_summary": "A composite endpoint combines several serious health events — such as heart attack, stroke, and death — into a single outcome so that a study has enough events to detect a treatment effect. The rule is simple: a patient counts as having the outcome on the earliest date that any one of those events occurs, and follow-up stops there. Because the study records which specific event happened first, researchers can always go back and look at each event separately. One honest catch: if one event (say, a hospitalization) happens far more often than the others, it can dominate the result and make a treatment look effective even if it does nothing for the rarer but more serious events.",
    "key_terms": [
      {
        "term": "composite endpoint",
        "definition": "A single study outcome that is triggered the first time a patient experiences any one of several pre-specified serious events (for example, heart attack, stroke, or death)."
      },
      {
        "term": "component event",
        "definition": "One of the individual events (such as heart attack, stroke, or death) that together make up the composite endpoint."
      },
      {
        "term": "time-to-first-event",
        "definition": "The number of days from a patient's study start date until the earliest component event occurs; this is what the statistical model uses as the outcome."
      },
      {
        "term": "index date",
        "definition": "The patient's personal day zero in the study — usually the date they started the medication being studied — from which follow-up time is measured."
      },
      {
        "term": "follow-up",
        "definition": "The stretch of time a patient is watched for events, starting at the index date and ending when an outcome occurs, the patient leaves the study, or the data end."
      },
      {
        "term": "deduplication window",
        "definition": "A rule that collapses multiple claims for the same acute episode (for example, a hospital stay, a transfer, and several follow-up office visits all coded as the same heart attack) into a single event so it is not counted more than once."
      }
    ],
    "worked_example": {
      "scenario": "Margaret is a 68-year-old Medicare patient enrolled in an SGLT2-inhibitor study. Her index date — the day she fills her first prescription — is January 1, 2024. Researchers are tracking 3-point MACE: non-fatal heart attack (MI), non-fatal stroke, or death from any cause, whichever comes first. They follow Margaret until one of those events happens, she loses insurance coverage, or the data end on December 31, 2024. She experiences all three component events during the year, but only the earliest one defines her composite endpoint.",
      "dataset": {
        "caption": "Component-event claims for one patient (Margaret, person_id 7741). Each row is a qualifying event that passed its own code-based case definition — inpatient primary-diagnosis code for MI and stroke, linked mortality file for death.",
        "columns": [
          "person_id",
          "event_date",
          "component",
          "source"
        ],
        "rows": [
          [
            7741,
            "2024-04-10",
            "MI",
            "inpatient claim, I21.9 primary"
          ],
          [
            7741,
            "2024-07-22",
            "STROKE",
            "inpatient claim, I63.9 primary"
          ],
          [
            7741,
            "2024-10-15",
            "DEATH",
            "linked mortality file"
          ]
        ]
      },
      "steps": [
        "Start the clock at Margaret's index date: January 1, 2024 (day 0).",
        "List every qualifying component event inside her follow-up window, with the date it occurred: MI on April 10 (day 100), stroke on July 22 (day 204), death on October 15 (day 289).",
        "Apply the earliest-event rule: composite event date = the minimum of {April 10, July 22, October 15} = April 10.",
        "Margaret's composite outcome is recorded as: event = 1 (yes, she had the composite), time_to_composite = 100 days, composite_component = 'MI'.",
        "Follow-up is stopped at day 100 — the stroke and death still happened, but they occur after the composite has already fired and are not counted in a time-to-first-event analysis.",
        "The composite_component label 'MI' is kept alongside the date so analysts can later ask 'how many composite events were MIs vs strokes vs deaths?' and run sensitivity analyses on each component separately."
      ],
      "result": "Composite endpoint = MI; time-to-composite = 100 days (April 10, 2024). The stroke (day 204) and death (day 289) occurred after the composite fired and do not alter the primary outcome — though the component label stored with the record makes them available for secondary analyses.",
      "timeline_spec": {
        "title": "3-point MACE composite for one patient: earliest component event wins",
        "window": {
          "start": "2024-01-01",
          "end": "2024-12-31",
          "label": "Observation window: January 1 – December 31, 2024"
        },
        "events": [
          {
            "label": "MI (I21.9 primary, inpatient)",
            "start": "2024-04-10",
            "length_days": 1,
            "quantity": "day 100 — FIRST event → defines composite",
            "flag": "composite_trigger"
          },
          {
            "label": "Stroke (I63.9 primary, inpatient)",
            "start": "2024-07-22",
            "length_days": 1,
            "quantity": "day 204 — occurs after composite fired; not counted"
          },
          {
            "label": "Death (linked mortality file)",
            "start": "2024-10-15",
            "length_days": 1,
            "quantity": "day 289 — occurs after composite fired; not counted"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2024-01-01",
            "end": "2024-04-10",
            "label": "Active follow-up: 100 days"
          },
          {
            "kind": "unexposed",
            "start": "2024-04-10",
            "end": "2024-12-31",
            "label": "After composite event: follow-up closed (stroke and death not counted)"
          }
        ],
        "result": {
          "label": "Composite endpoint = MI on day 100; time-to-composite = 100 days",
          "value": 100
        },
        "caption": "Margaret's three component events are shown on the timeline. The heart attack (MI) on April 10 is the earliest — it fires the composite endpoint on day 100 and closes her follow-up. The stroke (day 204) and death (day 289) appear on the diagram to show they happened, but they do not change the composite result because follow-up stopped at the first event. The stored component label ('MI') is what allows analysts to break out the composite into its pieces later.",
        "alt_text": "A horizontal timeline from January 1 to December 31 2024 for one patient. The index date anchors the left end. A marker labeled 'MI day 100' on April 10 is flagged as the composite trigger. A marker labeled 'Stroke day 204' on July 22 and a marker labeled 'Death day 289' on October 15 appear to the right but are shown as inactive because follow-up closed at day 100. A shaded span from January 1 to April 10 is labeled 'Active follow-up: 100 days'."
      }
    },
    "prerequisites": [
      "outcome-algorithm-construction-rwe",
      "acute-event-deduplication-window-rwe",
      "mortality-source-hierarchy-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Time-to-first-event composite (default)",
        "description": "Composite event date = earliest qualifying component date; follow-up censored at that first event. Analyzed with cause-specific Cox or pooled logistic. Retain the component label that produced the event for component-level sensitivity analyses.",
        "edge_cases": [
          "A frequent, well-ascertained soft component (hospitalization) dominates the event count and the hazard ratio while a rare hard component (death) drives clinical meaning.",
          "Same-window component claims (admission + transfer + follow-up coding) must be deduplicated to one event before taking the minimum date."
        ],
        "data_source_notes": "claims: pre-specify code-position rules per component and a deduplication/clean window; EHR: visit-driven capture means the earliest detectable component, not the earliest true event, may be assigned."
      },
      {
        "name": "Recurrent / total-events composite",
        "description": "Counts all qualifying component events per patient (Andersen-Gill, LWYY, or negative-binomial) rather than only the first; appropriate when repeated events (e.g., heart-failure hospitalizations) carry information.",
        "edge_cases": [
          "Without a clean window, claims artifacts (transfers, coding of the same acute event) inflate the recurrent count.",
          "Death must be handled as a terminal event, not another recurrence."
        ],
        "data_source_notes": "claims: requires robust episode/clean-window logic; deduplication errors propagate directly into the event count."
      },
      {
        "name": "Hierarchical / win-ratio composite",
        "description": "Compares patients pairwise on the most severe component first, breaking ties on the next-most-severe, so death is never outweighed by a hospitalization (Pocock 2011). Produces a win ratio rather than a hazard ratio.",
        "edge_cases": [
          "Requires a defensible, pre-specified severity ordering of components.",
          "Censoring and unequal follow-up complicate pair comparisons; matched or unmatched win-ratio variants differ."
        ],
        "data_source_notes": "all sources: still depends on per-component case definitions and event dates; ascertainment differences distort the pairwise comparisons just as they do time-to-first-event."
      },
      {
        "name": "Single non-fatal component with death as competing risk",
        "description": "When one non-fatal component (e.g., HF hospitalization) is the endpoint and death competes, report cause-specific hazard and Fine-Gray subdistribution cumulative incidence rather than censoring death.",
        "edge_cases": [
          "Treating death as ordinary censoring overstates the non-fatal cumulative incidence, especially in elderly cohorts where competing mortality differs by arm."
        ],
        "data_source_notes": "claims: differential competing mortality by exposure is common in elderly populations; report both cause-specific and subdistribution estimates."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Analyzing each component as a separate outcome",
        "pros_of_this": "Higher event counts and power, shorter required follow-up, avoids multiplicity across several outcomes.",
        "cons_of_this": "Presumes components share treatment direction and clinical importance; a near-null composite can hide opposite-direction component effects.",
        "when_to_prefer": "Components are biologically linked, comparable in severity, and expected to move together; always report components individually alongside the composite."
      },
      {
        "compared_to": "Hierarchical / win-ratio analysis",
        "pros_of_this": "Time-to-first-event yields a familiar hazard ratio and Kaplan-Meier curve that regulators expect.",
        "cons_of_this": "Weights death and a soft hospitalization equally; a drug that prevents deaths but causes admissions is falsely penalized.",
        "when_to_prefer": "Components are of similar severity and the conventional contrast is required; switch to win-ratio when a soft component dominates the count."
      },
      {
        "compared_to": "A single hard endpoint (e.g., all-cause mortality)",
        "pros_of_this": "Power and shorter follow-up from combining events.",
        "cons_of_this": "Vulnerable to differential ascertainment of soft components by arm and to inconsistent cross-study definitions.",
        "when_to_prefer": "The single hard endpoint is under-powered and components are clinically linked; prefer the single endpoint when differential surveillance by arm is plausible."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Each component is its own algorithm with its own PPV/sensitivity. Use linked mortality (SSA/NDI) for death; apply code-position rules and a deduplication/clean window per non-fatal component. Composite event date = MIN(component date); store the component label. Restrict to FFS person-time for medical-claims components; MA-only segments degrade the composite toward a death-only outcome.",
      "ehr": "Non-fatal components are visit-driven (encounter diagnoses, labs, notes) and miss external-care events; death is incomplete without a linked index. EHR can sharpen component definitions (troponin, imaging) beyond what claims allow.",
      "registry": "Often provides adjudicated components (gold standard for the clinical pieces) but weak longitudinal completeness; link to claims for follow-up and to a death index for the mortality component.",
      "linked": "Ideal substrate (adjudicated/EHR severity + claims completeness + reliable mortality) but introduces linkage selection and event-date discrepancies (service date vs adjudicated date) that must be reconciled before assigning the composite event date."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nDEDUP_DAYS = 30  # acute-event clean window: claims of the SAME component within this window = one event\n\ndef build_composite(cohort: pd.DataFrame, comp_events: pd.DataFrame) -> pd.DataFrame:\n    ev = comp_events.merge(cohort[[\"person_id\", \"index_date\", \"fu_end\"]], on=\"person_id\")\n\n    # Keep only component events inside each patient's follow-up window.\n    ev = ev[(ev[\"event_date\"] > ev[\"index_date\"]) & (ev[\"event_date\"] <= ev[\"fu_end\"])]\n\n    # Deduplicate same-component acute episodes: a claim opens a new episode only if it falls more than\n    # DEDUP_DAYS after the LAST KEPT episode date (anchored, not chained off the immediately prior claim),\n    # matching the SAS step and the acute-event-deduplication-window concept.\n    ev = ev.sort_values([\"person_id\", \"component\", \"event_date\"])\n\n    def _keep_anchored(dates: pd.Series) -> pd.Series:\n        keep, last = [], None\n        for d in dates:\n            if last is None or (d - last).days > DEDUP_DAYS:\n                keep.append(True); last = d\n            else:\n                keep.append(False)\n        return pd.Series(keep, index=dates.index)\n\n    new_episode = ev.groupby([\"person_id\", \"component\"])[\"event_date\"].transform(_keep_anchored)\n    ev = ev[new_episode]\n\n    # Composite event = earliest qualifying component; preserve WHICH component fired.\n    ev = ev.sort_values([\"person_id\", \"event_date\"])\n    first = ev.groupby(\"person_id\").first().reset_index()\n    first = first.rename(columns={\"event_date\": \"composite_date\", \"component\": \"composite_component\"})\n\n    out = cohort.merge(first[[\"person_id\", \"composite_date\", \"composite_component\"]], on=\"person_id\", how=\"left\")\n    out[\"event\"] = out[\"composite_date\"].notna().astype(int)\n    # Time = composite event date if it occurred, else censoring at follow-up end.\n    end = out[\"composite_date\"].fillna(out[\"fu_end\"])\n    out[\"time_days\"] = (end - out[\"index_date\"]).dt.days\n    return out[[\"person_id\", \"index_date\", \"time_days\", \"event\", \"composite_component\"]]",
        "description": "Derive a time-to-first-event composite from claims-style component events. Required inputs (already cleaned and mapped to\ncomponent case definitions upstream):\n  cohort     : one row per patient -> person_id, index_date (datetime), fu_end (datetime: min of disenroll/death/data end)\n  comp_events: one row per qualifying component claim -> person_id, event_date (datetime),\n               component in {'DEATH','MI','STROKE'}  # each already passes its own code-position / validation rule\nDEDUP_DAYS collapses same-component claims of one acute event (transfers, follow-up coding) to the first claim date.\nReturns one row per patient with the composite event date, the WINNING component label, and the time-to-event/censoring\nindicator. Retaining `composite_component` is what makes every component-level and competing-risk sensitivity analysis\npossible downstream.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "freemantle-2003"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nDEDUP_DAYS <- 30L\n\nbuild_composite <- function(cohort, comp_events) {\n  setDT(cohort); setDT(comp_events)\n  ev <- merge(comp_events, cohort[, .(person_id, index_date, fu_end)], by = \"person_id\")\n  ev <- ev[event_date > index_date & event_date <= fu_end]\n\n  # Deduplicate same-component acute episodes: a claim opens a new episode only if it falls more than\n  # DEDUP_DAYS after the LAST KEPT episode date (anchored, not chained off the immediately prior claim),\n  # matching the SAS step and the acute-event-deduplication-window concept.\n  setorder(ev, person_id, component, event_date)\n  keep_anchored <- function(dates) {\n    keep <- logical(length(dates)); last <- NA\n    for (i in seq_along(dates)) {\n      if (is.na(last) || as.integer(dates[i] - last) > DEDUP_DAYS) {\n        keep[i] <- TRUE; last <- dates[i]\n      }\n    }\n    keep\n  }\n  ev <- ev[, .SD[keep_anchored(event_date)], by = .(person_id, component)]\n\n  # Earliest qualifying component = composite event; keep the component label.\n  setorder(ev, person_id, event_date)\n  first <- ev[, .(composite_date = event_date[1L], composite_component = component[1L]), by = person_id]\n\n  out <- merge(cohort, first, by = \"person_id\", all.x = TRUE)\n  out[, event := as.integer(!is.na(composite_date))]\n  out[, end_date := fifelse(is.na(composite_date), fu_end, composite_date)]\n  out[, time_days := as.integer(end_date - index_date)]\n  out[, .(person_id, index_date, time_days, event, composite_component)]\n}",
        "description": "Time-to-first-event composite with data.table. Inputs mirror the Python version:\n  cohort     : person_id, index_date (Date), fu_end (Date)\n  comp_events: person_id, event_date (Date), component in {'DEATH','MI','STROKE'}\nOutput: one row per patient with composite time, event indicator, and the winning component label.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "freemantle-2003"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let dedup = 30;\n\n/* Step 1: keep in-window events; deduplicate same-component episodes to the earliest claim within &dedup days. */\nproc sql;\n  create table ev as\n  select e.person_id, e.event_date, e.component, c.index_date, c.fu_end\n  from work.comp_events e\n  inner join work.cohort c on e.person_id = c.person_id\n  where e.event_date > c.index_date and e.event_date <= c.fu_end;\nquit;\n\nproc sort data=ev; by person_id component event_date; run;\ndata ev_dedup;\n  set ev; by person_id component event_date;\n  retain last_date;\n  /* Anchored rule: gap measured from the LAST KEPT episode (last_date updates only on output), matching\n     the Python/R helpers and the acute-event-deduplication-window concept. */\n  if first.component then last_date = .;\n  if last_date = . or (event_date - last_date) > &dedup then do;\n    last_date = event_date; output;\n  end;\nrun;\n\n/* Step 2: composite event = earliest qualifying component; keep WHICH component fired. */\nproc sort data=ev_dedup; by person_id event_date; run;\ndata composite;\n  merge work.cohort(in=c)\n        ev_dedup(keep=person_id event_date component rename=(event_date=composite_date component=composite_component));\n  by person_id;\n  if c;\n  if first.person_id;            /* earliest component only */\n  event = (composite_date ne .);\n  if event then end_date = composite_date; else end_date = fu_end;\n  time_days = end_date - index_date;\nrun;\n\n/* Step 3: time-to-first-event analysis (cumulative incidence + cause-specific hazard ratio). */\nproc lifetest data=composite plots=(cif);\n  time time_days*event(0);\n  strata arm;                    /* arm joined from cohort upstream */\nrun;\nproc phreg data=composite;\n  class arm(ref='COMPARATOR');\n  model time_days*event(0) = arm;   /* cause-specific composite hazard ratio */\nrun;\n\n/* Step 4: single non-fatal component (MI) as endpoint, death competing.\n   status: 1=MI, 2=death (competing), 0=censored. Fine-Gray subdistribution via eventcode=. */\ndata fg;\n  set composite;\n  if event=1 and composite_component='MI' then status=1;\n  else if event=1 and composite_component='DEATH' then status=2;\n  else status=0;\nrun;\nproc phreg data=fg;\n  class arm(ref='COMPARATOR');\n  model time_days*status(0) = arm / eventcode=1;   /* Fine-Gray: subdistribution HR for MI */\nrun;",
        "description": "Composite construction (PROC SQL) plus the analyses the estimand requires. Required inputs (post data-management):\n  work.cohort     : person_id, index_date, fu_end\n  work.comp_events: person_id, event_date, component ('DEATH'/'MI'/'STROKE')  # each already validated per component\nStep 1 deduplicates same-component acute episodes; Step 2 takes the earliest component as the composite event and keeps\nthe winning component; Step 3 runs time-to-first-event (PROC LIFETEST cumulative incidence + cause-specific PROC PHREG);\nStep 4 shows the Fine-Gray subdistribution model for the case where a non-fatal component (MI) is the endpoint and death\ncompetes (eventcode= on the status variable).",
        "dependencies": [],
        "source_citations": [
          "freemantle-2003"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "composite-endpoint-construction-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Margaret's three component events are shown on the timeline. The heart attack (MI) on April 10 is the earliest — it fires the composite endpoint on day 100 and closes her follow-up. The stroke (day 204) and death (day 289) appear on the diagram to show they happened, but they do not change the composite result because follow-up stopped at the first event. The stored component label ('MI') is what allows analysts to break out the composite into its pieces later.",
        "alt_text": "A horizontal timeline from January 1 to December 31 2024 for one patient. The index date anchors the left end. A marker labeled 'MI day 100' on April 10 is flagged as the composite trigger. A marker labeled 'Stroke day 204' on July 22 and a marker labeled 'Death day 289' on October 15 appear to the right but are shown as inactive because follow-up closed at day 100. A shaded span from January 1 to April 10 is labeled 'Active follow-up: 100 days'.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Comp[Per-component case definitions<br/>DEATH / MI / STROKE<br/>each with its own PPV + code-position rule] --> Dedup[Deduplicate same-component<br/>acute episodes within clean window]\n  Dedup --> Min[Composite event = EARLIEST component date<br/>store date AND winning component label]\n  Min --> Estimand{Estimand?}\n  Estimand -->|components similar severity| TTE[Time-to-first-event<br/>cause-specific Cox / pooled logistic]\n  Estimand -->|severity differs| Win[Hierarchical / win-ratio<br/>death before hospitalization]\n  Estimand -->|non-fatal piece, death competes| FG[Fine-Gray subdistribution<br/>+ cause-specific hazard]\n  Estimand -->|repeated events matter| Rec[Recurrent events<br/>Andersen-Gill / LWYY]\n  TTE --> Report[Always report component breakdown<br/>+ sensitivity on code-position and dedup window]\n  Win --> Report\n  FG --> Report\n  Rec --> Report",
        "caption": "From per-component case definitions to the composite event date to the estimand. The earliest-component rule means the best-ascertained component tends to win the event-date assignment, so component-level validation and the component breakdown are mandatory diagnostics.",
        "alt_text": "Flowchart from per-component case definitions through deduplication and the earliest-component rule to four estimand branches (time-to-first-event, win ratio, Fine-Gray, recurrent events), all feeding a mandatory component breakdown report.",
        "source_type": "illustrative",
        "source_citations": [
          "freemantle-2003"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One patient's component events and the composite event date (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Follow-up\n  Index date -> follow-up start :milestone, idx, 2024-01-01, 0d\n  section Components\n  Non-fatal MI (I21 primary, inpatient) :crit, mi, 2024-04-10, 1d\n  Stroke claim (within MI clean window -> deduped) :done, st, 2024-04-25, 1d\n  Death (linked mortality) :active, dth, 2024-09-15, 1d\n  section Composite\n  Composite event = earliest = MI; component label retained :milestone, comp, 2024-04-10, 0d",
        "caption": "Time-to-first-event assignment for one patient. The MI is the earliest qualifying component, so the composite event date is the MI date and follow-up is censored there; the death is preserved as a component flag for the competing-risk re-analysis even though it does not count first.",
        "alt_text": "Gantt timeline showing an index date, a non-fatal MI in April 2024, a deduplicated stroke claim, and a death in September 2024, with the composite event assigned to the earliest event (the MI).",
        "source_type": "illustrative",
        "source_citations": [
          "freemantle-2003"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "A composite endpoint is an outcome algorithm whose definition combines several component algorithms under an event-date and deduplication rule."
      },
      {
        "relation_type": "requires",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Each component needs its own PPV/sensitivity validation; differential component validity is the main driver of differential ascertainment in the composite."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "When a non-fatal component is examined with death as a competing event, the composite must be re-cast as a cause-specific and subdistribution analysis rather than censoring death."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "Component-level and composite cumulative incidence functions are the correct absolute-risk summary in the presence of competing mortality."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mortality-source-hierarchy-rwe",
        "notes": "The death component must be defined from a mortality source hierarchy (linked SSA/NDI before claims proxies) to avoid under-ascertainment that biases the composite toward non-fatal components."
      },
      {
        "relation_type": "used_with",
        "target_slug": "acute-event-deduplication-window-rwe",
        "notes": "A clean/deduplication window collapses transfer and follow-up coding of a single acute event so the composite does not double-count it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Adjudication or chart review of a component sample calibrates its PPV and supports the high-PPV code-position rules used in the composite."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST and win-ratio analyses are severity- or time-weighted alternatives to the equal-weight time-to-first-event treatment of composite components."
      }
    ],
    "aliases": [
      "composite endpoint",
      "composite outcome construction",
      "MACE",
      "major adverse cardiovascular events",
      "multi-component endpoint"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "confounding-by-indication-channeling",
    "name": "Confounding by Indication and Channeling Bias",
    "short_definition": "The dominant structural confounding in pharmacoepidemiology: the medical indication for a drug — and especially its severity — simultaneously determines who receives treatment and independently elevates (or lowers) the outcome risk, so that crude observational associations between drug use and outcomes reflect the underlying disease rather than any true causal drug effect; channeling is the specific variant in which newer drugs are preferentially prescribed to patients who are sicker, have failed prior therapy, or carry more comorbidity.",
    "long_description": "**Core definition and the backdoor path**\n\nConfounding by indication (CBI) is the systematic error that arises when the clinical reason a\ndrug is prescribed — the indication — is itself a determinant of the outcome under study. In a\ndirected acyclic graph, the indication and its severity open a backdoor path: Indication →\nTreatment and Indication → Outcome. Because healthcare database studies cannot randomize\npatients to treatment, patients who receive a given drug will always differ systematically from\nthose who do not, in exactly the ways prescribers consider clinically relevant. This is not a\ndata-quality flaw that more careful collection can eliminate — it is a structural consequence of\nhow medical prescribing works.\n\nThe aggravating feature in claims and EHR data is that *severity* within a diagnosis is\nsubstantially unmeasured. An ICD code for \"type 2 diabetes\" is recorded for a patient whose\nHbA1c is 6.5% on diet alone and for a patient with HbA1c of 11%, recurrent hospitalizations,\nand emerging neuropathy. The prescriber knows the difference; the claims database records\nneither. The same binary code masks a severity gradient that drives both treatment escalation\nand outcome risk. CBI is therefore the strongest and most difficult-to-fix confounding source in\npharmacoepidemiology, because the most important part of the confounder — the part that drives\nthe prescribing decision — is the part the database cannot see.\n\n**Channeling: the original Petri and Urquhart framing**\n\nChanneling is a specific, common manifestation of CBI in which a new drug is systematically\nprescribed to a different — and typically sicker — patient population than an established\ncomparator. Petri and Urquhart (1991) coined the term to describe how initial prescribers of a\nnew agent tend to be specialists who see the most severely ill, treatment-refractory patients:\nexactly the patients who would be expected to have worse outcomes regardless of the new drug's\ntrue efficacy or safety.\n\nThe COX-2 selective inhibitor era (celecoxib, rofecoxib, 1999–2004) is the canonical case study.\nCOX-2 inhibitors were approved and marketed as safer alternatives to traditional NSAIDs for\npatients at elevated gastrointestinal risk — patients with prior GI bleeding, peptic ulcer\nhistory, or concurrent anticoagulant use. Observational studies in the early post-market period\nfound higher cardiovascular event rates in COX-2 users than in NSAID users, partly because the\nchanneled patient population — older, more comorbid, more cardiovascularly vulnerable — was\nalready at higher baseline cardiovascular risk. Disentangling the true pharmacological\ncardiovascular signal from the channeling signal required careful active-comparator designs and\nrestriction analyses; both effects were real, but their relative magnitudes could not be\nestimated reliably without controlling the channeling.\n\n**Confounding by contraindication and the reverse direction**\n\nCBI operates in both directions. When an indication increases both treatment probability and\noutcome risk, treated patients appear sicker in the crude analysis (the drug appears harmful or\nless effective than it truly is). When a *contraindication* — a condition that makes the drug\ndangerous — drives withholding of treatment, the untreated patients carry the higher risk:\nanticoagulants are withheld from patients with high bleeding risk; ACE inhibitors are avoided\nin bilateral renal artery stenosis; beta-blockers were historically avoided in asthma. In these\nsettings the treated patients are the healthier ones, and the crude association flatters the\ndrug — the mirror image of the channeling pattern. Both directions of CBI are operationally\ninvisible in claims unless the contraindication is fully and consistently coded.\n\n**Direction-of-bias reasoning**\n\nThe direction of crude confounding depends on whether the indication/contraindication is a risk\nfactor or protective factor for the outcome. If the indication increases outcome risk (severe\nheart failure → digoxin → higher baseline mortality), treated patients have higher baseline risk\nand the crude drug-outcome association is confounded upward — the drug appears more harmful or\nless beneficial than it truly is. If the contraindication increases outcome risk and those\npatients are left untreated, the crude association is confounded downward — the drug appears\nmore protective than it truly is. Conducting this direction-of-bias reasoning before analysis\nis essential for interpreting whether residual confounding would inflate or deflate the effect\nestimate, and whether it favors or disfavors the study drug — a critical input to sensitivity\nanalysis and regulatory communication.\n\n**Why standard covariate adjustment fails: residual severity**\n\nIncluding diagnosis codes, comorbidity indices (Charlson, Elixhauser), and propensity scores\nin the outcome model attenuates CBI but rarely eliminates it. The fundamental problem is that\nthe severity signals driving prescribing decisions — HbA1c trajectory, spirometry values,\nfrailty score, symptom burden, the clinician's gestalt — are not recorded in claims. Even\nhigh-dimensional propensity scores (hdPS), which empirically screen hundreds of claims codes\nfor proxy confounders, can recover indirect severity signals (hospitalizations, emergency\nencounters, specialist contacts, polypharmacy) but cannot reconstruct continuous physiological\nmeasurements the prescriber used. Residual confounding by unmeasured severity therefore persists\nin essentially every pharmacoepidemiological cohort study, and its direction is predictable from\nthe indication structure. The claims-based frailty index (CFBI) partially rescues this for\nelderly populations by aggregating frailty-adjacent codes into a validated composite, but it\nremains a proxy for the underlying physiological state, not a direct measure of it.\n\n**Design fixes, ranked by effectiveness**\n\nThe only complete fix is randomization. Among observational designs, the hierarchy is:\n(1) *Active-comparator, new-user (ACNU) design*: by restricting to initiators of two drugs\nused for the same indication, the indication itself cancels across arms. Both groups had the\nindication; the severity confounding that remains is the within-indication severity differential\nbetween prescribers' choices, which is typically far smaller than the cross-indication gap.\n(2) *Restriction to a single indication* with a narrow severity band reduces the gradient but\ndepends on having severity explicitly coded or measurable.\n(3) *Self-controlled designs* (SCCS, case-crossover): the patient acts as their own control\nacross time periods so stable confounders including the indication and baseline severity cancel.\nAppropriate for acute, transient exposures; cannot handle time-varying confounders including\ndisease progression.\n(4) *Instrumental variable / regression discontinuity*: when a valid instrument exists (physician\nprescribing preference, formulary threshold, geographic variation), IV methods achieve causal\nidentification even with unmeasured severity, at the cost of estimating only a local average\ntreatment effect and requiring strong, verifiable IV assumptions.\n(5) *Negative controls and E-value*: after the primary analysis, negative-control outcomes or\nexposures that share the indication-confounding structure without a causal path to the outcome\nmeasure the residual bias; the E-value quantifies the minimum unmeasured confounding strength\nneeded to explain away the observed association.\n\n**Pros, cons, and trade-offs**\n\nConfounding by indication is not a design choice — it is a structural feature of non-randomized\nprescribing. The trade-offs are among strategies for controlling it:\n\n- *Active-comparator new-user design*: the most powerful single structural fix. CBI cancels by\n  symmetry when both arms share the indication. Pros: eliminates indication-level confounding\n  with no modeling assumptions; also removes healthy-user bias and prevalent-user bias\n  simultaneously. Cons: answers a comparative (drug A vs drug B) question, not an absolute\n  (drug vs no drug) question; requires a clinically interchangeable comparator; residual\n  within-indication channeling may persist if prescribers differentially select one drug for\n  the sicker within-indication patients.\n- *Propensity score adjustment* (partial fix): addresses measured severity proxies and produces\n  an auditable balance diagnostic. Pros: flexible, standard, transparent. Cons: cannot control\n  unmeasured severity; a balanced PS on measured variables does not imply balance on unmeasured\n  severity gradients, and the appearance of balance can give false confidence.\n- *Restriction and eligibility narrowing*: tightening the cohort to a homogeneous severity\n  stratum reduces the confounding gradient. Pros: conceptually simple. Cons: reduces\n  generalizability; hard to implement in claims without direct severity measurement.\n- *Instrumental variable / RDD*: can achieve causal identification with unmeasured confounding.\n  Pros: removes both measured and unmeasured CBI when the instrument is valid. Cons: valid\n  instruments are rare; identifies LATE, not ATE; requires expert validation of exclusion\n  restriction.\n\n**When to use** (meaning: when to treat CBI as a primary design priority)\n\nCBI is a design priority in essentially every observational study of therapeutic drugs where the\ncomparison is not driven by an exogenous mechanism. Specifically, actively address CBI when:\n(a) the drug is prescribed for a condition that is itself a risk factor for the outcome — any\nstudy of diabetes drugs on cardiovascular outcomes, COPD drugs on respiratory hospitalizations,\nantihypertensives on stroke, anticoagulants on thromboembolic outcomes; (b) the study compares\na new drug to an older drug for the same indication (channeling toward newer agents for sicker\nor treatment-refractory patients is the default pattern at market entry); (c) the study uses a\nnon-user comparison group (CBI is strongest here because comparators have no indication at all).\nDefault to an active-comparator new-user design for all head-to-head drug comparisons; reserve\nnon-user designs for questions where no interchangeable comparator exists and quantify the CBI\nrisk explicitly with direction-of-bias reasoning and an E-value.\n\n**When NOT to use** (meaning: when CBI-specific adjustments are unnecessary or counterproductive)\n\n- *When the design already eliminates it structurally*: a self-controlled analysis comparing\n  on-treatment to off-treatment windows within the same patient eliminates stable confounders\n  including the indication. Additional CBI adjustment on top of this may introduce collider bias\n  if time-varying severity is also a mediator.\n- *When the cohort is tightly restricted to a single, narrow indication stratum* with documented\n  within-stratum severity balance: the indication-level confounding is largely eliminated and\n  residual channeling within the stratum is the remaining concern.\n- *Do not adjust for post-baseline severity markers*: conditioning on outcomes of disease\n  severity measured after treatment initiation (HbA1c at month 6, post-index symptom score)\n  blocks the causal path and induces collider bias. CBI adjustment must use pre-treatment\n  severity proxies only, measured in the baseline window.\n- *Do not interpret a balanced propensity score as eliminating CBI*: balance on measured\n  confounders does not rule out residual CBI from unmeasured severity. Always supplement with\n  direction-of-bias reasoning, a negative control, and an E-value in any non-randomized study.\n\n**Interpreting the output**\n\nIn the worked example: 1000 severe patients (60% treated, risk = 0.30) and 1000 mild patients\n(20% treated, risk = 0.10). The drug has no true causal effect (RR = 1.00 within each stratum).\nThe stratum-adjusted RRs are both 1.00. The crude pooled analysis yields treated risk =\n200 / 800 = 0.25 and crude RR = 1200 / 800 = 1.50.\n\n*(1) Formal interpretation.* The crude RR of 1.50 is a confounded estimate. The Mantel-Haenszel\nstratum-adjusted RR equals 1.00, confirming that the drug has no causal effect within either\nseverity stratum. The 1.50 arises entirely because severe patients — who have both a higher\ntreatment probability (60%) and a higher outcome risk (30%) — are overrepresented among the\ntreated in the pooled analysis. The severity stratum is an unmeasured confounder: if it were\nperfectly controlled, the association would vanish. The direction of confounding is positive\n(crude RR > true RR), consistent with the indication pattern in which sicker patients are more\nlikely to receive treatment and face higher baseline risk. This is the mathematical signature of\nconfounding by indication in a 2-stratum model.\n\n*(2) Practical interpretation.* A study that does not adequately control for severity will report\nthat the drug increases risk by 50% — potentially triggering regulatory action or prescriber\nhesitation — when the drug has no true effect. The scale of bias (1.50 vs 1.00) is proportional\nto the severity gradient in both treatment probability (60% vs 20%) and outcome risk (30% vs\n10%). In real pharmacoepidemiological data, these gradients are typically larger and partly\nunmeasured, so residual confounding after standard covariate adjustment can be substantial and\nin a predictable direction. The design remedy — an active-comparator design restricting to the\nsame indication — removes this bias structurally rather than requiring it to be modeled away\nfrom partially observed data.",
    "primary_category": "Bias_Control",
    "tags": [
      "confounding-by-indication",
      "channeling",
      "channeling-bias",
      "pharmacoepidemiology",
      "bias",
      "severity-confounding",
      "cox-2",
      "active-comparator",
      "residual-confounding",
      "indication",
      "new-drugs"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "active_comparator_new_user",
      "new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.4780100409",
        "url": "https://doi.org/10.1002/sim.4780100409",
        "citation_text": "Petri H, Urquhart J. Channeling bias in the interpretation of drug effects. Statistics in Medicine. 1991;10(4):577-581.",
        "year": 1991,
        "authors_short": "Petri & Urquhart",
        "notes": "Coined the term \"channeling\" in pharmacoepidemiology: the systematic prescribing of new drugs preferentially to patients who are sicker, have failed prior therapy, or are otherwise at higher risk — exactly the patients with higher baseline outcome rates regardless of the drug. The COX-2 inhibitor experience became the canonical subsequent illustration of this framing."
      },
      {
        "role": "explain",
        "doi": "10.1097/MLR.0b013e3181dbebe3",
        "url": "https://doi.org/10.1097/MLR.0b013e3181dbebe3",
        "citation_text": "Brookhart MA, Stürmer T, Glynn RJ, Rassen J, Schneeweiss S. Confounding control in healthcare database research: challenges and potential approaches. Medical Care. 2010;48(6 Suppl):S114-120.",
        "year": 2010,
        "authors_short": "Brookhart et al.",
        "notes": "Comprehensive taxonomy of confounding in claims and EHR databases, including confounding by indication, channeling, and healthy-user bias. Reviews design solutions (ACNU, restriction, IV) and analytic approaches (PS methods, hdPS) with practical guidance on diagnosing which type of confounding dominates a given study."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/oxfordjournals.aje.a008895",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a008895",
        "citation_text": "Blais L, Ernst P, Suissa S. Confounding by indication and channeling over time: the risks of beta-2-agonists. American Journal of Epidemiology. 1996;144(12):1161-1169.",
        "year": 1996,
        "authors_short": "Blais et al.",
        "notes": "Demonstrates both channeling and confounding by indication operating simultaneously in a respiratory pharmacoepidemiology study of beta-2-agonists: sicker asthma patients received the newer long-acting agents, confounding naïve comparative estimates; stratified and restriction analyses recovered materially different effect estimates. A clear empirical illustration of the bias and the design-level remedies."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2016.16435",
        "url": "https://doi.org/10.1001/jama.2016.16435",
        "citation_text": "Kyriacou DN, Lewis RJ. Confounding by indication in clinical research. JAMA. 2016;316(17):1818-1819.",
        "year": 2016,
        "authors_short": "Kyriacou & Lewis",
        "notes": "Accessible primer on confounding by indication for a clinical audience: defines the mechanism, illustrates it with clinical examples, and outlines the design and analytic approaches for control. Widely cited in regulatory and clinical contexts as a clear statement of when and why observational comparative effectiveness estimates can be misleading."
      }
    ],
    "plain_language_summary": "Confounding by indication happens when the very reason a doctor prescribes a drug — for example, treating severe diabetes — also makes the patient more likely to have bad outcomes, so a naive comparison makes the drug look harmful even if it has no true effect. Channeling is a related pattern where newer drugs get prescribed mainly to the sickest patients or those who failed older drugs, so studies comparing the new drug to the old one find worse outcomes in the new-drug group — not because the drug is worse, but because the patients were already sicker before they started it. In insurance claims data this bias is especially hard to fix because severity is only partly coded: the database knows a patient has diabetes, but not how severe or well-controlled it is. The most reliable design fix is to compare two drugs prescribed for the same condition, so that the indication and its severity are shared across both groups and cancel out.",
    "key_terms": [
      {
        "term": "confounding by indication",
        "definition": "When the medical reason a drug is prescribed is itself a cause of the outcome being studied, making the drug appear harmful or beneficial simply because sick people get the drug, not because of what the drug actually does."
      },
      {
        "term": "channeling",
        "definition": "The systematic tendency for newer or more specialized drugs to be prescribed preferentially to sicker patients or those who have already failed other treatments, creating an unfair comparison in observational studies."
      },
      {
        "term": "confounding by contraindication",
        "definition": "The reverse of channeling: a drug is withheld from the sickest patients because it is dangerous for them, making the treated group look artificially healthier than untreated patients in a database study."
      },
      {
        "term": "severity gradient",
        "definition": "The range from mild to severe illness within a single diagnosis category; because this gradient drives prescribing decisions but is often not recorded in claims data, it is the main unmeasured source of confounding by indication."
      },
      {
        "term": "active comparator",
        "definition": "A comparison group that takes a different drug for the same condition, so both groups share the same indication — making the indication cancel out as a source of confounding when the two groups are compared."
      },
      {
        "term": "direction-of-bias reasoning",
        "definition": "Working out in advance whether confounding by indication would make a drug look more harmful or more beneficial than it truly is, based on whether the indication itself increases or decreases the risk of the outcome."
      }
    ],
    "worked_example": {
      "scenario": "A claims-based cohort study asks whether a new oral anti-inflammatory drug increases the risk of a serious adverse event over one year. The database holds 2 000 patients: 1 000 with severe disease and 1 000 with mild disease. Physicians prescribe the new drug to 60% of severe patients (who are at highest need) but to only 20% of mild patients. The true pharmacological effect of the drug is null — within each severity stratum the drug neither increases nor decreases the adverse-event risk (stratum-specific RR = 1.00). The analyst first ignores severity and computes a crude RR, then stratifies by severity to recover the true RR.",
      "dataset": {
        "caption": "Four cells of a 2x2-by-stratum table: two severity strata (severe/mild), each split into treated vs untreated. Events and risks are exact by construction (no true drug effect).",
        "columns": [
          "stratum",
          "treatment_status",
          "patients",
          "events",
          "risk"
        ],
        "rows": [
          [
            "severe",
            "treated",
            600,
            180,
            0.3
          ],
          [
            "severe",
            "untreated",
            400,
            120,
            0.3
          ],
          [
            "mild",
            "treated",
            200,
            20,
            0.1
          ],
          [
            "mild",
            "untreated",
            800,
            80,
            0.1
          ]
        ]
      },
      "steps": [
        "Severe stratum (1 000 patients, 60% treated): treated group has 600 patients, untreated has 400. True risk = 0.30 for all severe patients regardless of treatment (RR = 1.00 by design). Events in treated: 600 * 0.30 = 180. Events in untreated: 400 * 0.30 = 120.",
        "Within-stratum check for severe patients: treated risk = 180 / 600 = 0.30; untreated risk = 120 / 400 = 0.30; RR_severe = 0.30 / 0.30 = 1.00. No drug effect.",
        "Mild stratum (1 000 patients, 20% treated): treated group has 200 patients, untreated has 800. True risk = 0.10 for all mild patients (RR = 1.00). Events in treated: 200 * 0.10 = 20. Events in untreated: 800 * 0.10 = 80.",
        "Within-stratum check for mild patients: treated risk = 20 / 200 = 0.10; untreated risk = 80 / 800 = 0.10; RR_mild = 0.10 / 0.10 = 1.00. No drug effect.",
        "Pool both strata (crude, ignoring severity): total treated = 600 + 200 = 800; total untreated = 400 + 800 = 1200; total treated events = 180 + 20 = 200; total untreated events = 120 + 80 = 200.",
        "Crude risk in treated = 200 / 800 = 0.25. Crude risk in untreated is approximately 200 / 1200 ≈ 0.167. Simplify the ratio: crude RR = 1200 / 800 = 1.50. (Cancel the 200s: (200 / 800) / (200 / 1200) = 1200 / 800 = 1.50.)",
        "The crude RR of 1.50 makes the drug look harmful (a 50% risk increase) even though both stratum-specific RRs equal 1.00. The entire inflation is confounding by indication: severe patients (60% treated, 30% risk) dominate the treated arm while mild patients (20% treated, 10% risk) dominate the untreated arm, so the crude treated group is on average sicker than the crude untreated group — an imbalance the drug did not create."
      ],
      "result": "Within-stratum results: RR_severe = 0.30 / 0.30 = 1.00; RR_mild = 0.10 / 0.10 = 1.00. Both stratum-specific RRs are null — no true drug effect in either severity group. Crude pooled result: treated risk = 200 / 800 = 0.25; crude RR = 1200 / 800 = 1.50. The crude RR of 1.50 is entirely spurious confounding by indication. Stratifying by severity (or using an active-comparator design where the indication cancels) restores the true RR = 1.00."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "new-user-design",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Channeling to new drugs (market-entry pattern)",
        "description": "At a new drug's market entry, prescribers — typically specialists at academic centers — treat the most complex, refractory, or severely ill patients first. Early post-marketing studies comparing the new agent to an established comparator inherit this channeling structure. Identification requires examining the new-drug user population's pre-index hospitalization rates, specialist contact, comorbidity burden, and prior-therapy failure history compared to the established-drug arm.",
        "edge_cases": [
          "Channeling attenuates over time as the drug enters mainstream prescribing. Restricting to the first one to two years post-approval captures the strongest channeling period; pooling across multiple market-years can mask the time-varying bias.",
          "Risk-management programs (REMS) explicitly restrict new drugs to specific patient populations, codifying the channeling structure in eligibility criteria — these must be reproduced exactly in the study cohort definition."
        ],
        "data_source_notes": "Claims: compare baseline hospitalization rates, Charlson/Elixhauser scores, specialist visit counts, and prior-line-of-therapy indicators between the new-drug arm and the comparator arm in the 365-day lookback; document the directional imbalance as the channeling evidence. EHR: clinical notes and structured severity fields (NYHA class, FEV1, ECOG) may directly capture the severity differential."
      },
      {
        "name": "Confounding by contraindication (reverse channeling)",
        "description": "A drug is systematically withheld from patients for whom it is contraindicated or risky — who are often the sickest patients. The treated group is healthier on average, and the crude association underestimates harms or overestimates benefits. Examples: withholding metformin in renal impairment, thiazolidinediones in heart failure, fluoroquinolones in patients with tendon-rupture risk.",
        "edge_cases": [
          "The contraindication may be partially captured by diagnosis codes but severity within the contraindicated condition (e.g., degree of renal impairment) is often unmeasured; lab values in EHR are the best proxy but are not available in claims.",
          "Restriction to patients without the contraindication (e.g., exclude eGFR < 30 using prior AKI / CKD codes) partially but not completely removes the bias."
        ],
        "data_source_notes": "Claims: exclude patients with codes for the contraindication in the baseline window; use lab results from linked EHR if available for a severity-calibrated exclusion. Explicitly document the exclusion as a contraindication restriction in the protocol."
      },
      {
        "name": "Indication-severity residual confounding after PS adjustment",
        "description": "Even after propensity score matching or weighting on observed covariates, residual CBI persists when the key severity dimension is poorly measured in claims. The workhorse diagnostic is the E-value: how strong would an unmeasured confounder (e.g., HbA1c trajectory, frailty) need to be to explain the observed association? Supplement with a negative-control outcome (a condition the drug cannot cause) to empirically bound the bias.",
        "edge_cases": [
          "A balanced Table 1 after PS weighting on measured covariates is *not* evidence that CBI is controlled; it is evidence that measured covariates are balanced. The direction-of-bias reasoning determines whether residual unmeasured severity is expected to inflate or deflate the estimate.",
          "Claims-based frailty index (CFBI) and prior hospitalization count are the strongest available severity proxies in older claims cohorts; include them in every PS model even if they are imperfect."
        ],
        "data_source_notes": "Claims: after PS adjustment, compute the E-value for the point estimate and its CI bound; run a negative-control outcome (e.g., accidental injury) through the identical pipeline to empirically measure residual bias. EHR: use structured clinical severity markers (HbA1c, eGFR, FEV1) as direct confounders rather than proxy codes — the most reliable approach when available."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "active-comparator-new-user",
        "pros_of_this": "Naming and quantifying CBI (E-value, negative controls, direction-of-bias) is possible even when no clinically interchangeable comparator exists and a non-user design is unavoidable; it provides an audit trail that the ACNU design cannot quantify.",
        "cons_of_this": "The ACNU design removes indication-level confounding structurally, without modeling assumptions, in a single design step; it is superior to CBI adjustment when a defensible active comparator exists.",
        "when_to_prefer": "Use CBI characterization (E-value, direction-of-bias reasoning, negative control) as a mandatory sensitivity element in all drug studies regardless of design; use it as the primary strategy only when no interchangeable active comparator exists."
      },
      {
        "compared_to": "propensity-score-methods-psm-iptw",
        "pros_of_this": "CBI framing motivates the design choice (ACNU vs non-user) that PS adjustment cannot fix; it guides which confounders matter most (severity proxies) and predicts the direction of residual bias.",
        "cons_of_this": "PS adjustment on measured severity proxies is the standard operational tool within a given design; CBI framing does not replace PS — it informs what the PS must adjust for and what it cannot fix.",
        "when_to_prefer": "Always use both: CBI framing as the conceptual scaffold that determines design choice and identifies what is unmeasured, PS adjustment as the operational tool for measured confounders, and E-value/negative-control as the residual-bias audit trail."
      },
      {
        "compared_to": "healthy-user-bias",
        "pros_of_this": "CBI operates in either direction (sicker patients treated or healthier patients treated), whereas healthy-user bias operates specifically in the direction of health-seeking behavior making treated patients appear healthier; CBI encompasses contraindication confounding that healthy-user framing does not directly address.",
        "cons_of_this": "Healthy-user bias specifically identifies the behavioral and access-related mechanisms (exercise, screening, medication adherence) that can be partially measured with claims utilization proxies; pure CBI framing does not provide these specific proxy measures.",
        "when_to_prefer": "Both biases frequently co-exist; CBI is the dominant concern for severity-driven prescribing decisions (e.g., insulin for severe diabetes), while healthy-user bias is the dominant concern for preventive-therapy adherence. Diagnose both with a negative-control outcome."
      },
      {
        "compared_to": "e-value-sensitivity-analysis",
        "pros_of_this": "CBI framing with direction-of-bias reasoning identifies the *expected* direction and approximate magnitude of residual confounding, giving interpretive context to the E-value.",
        "cons_of_this": "The E-value quantifies the unmeasured confounding strength needed to explain away the observed association — a formal, scale-free summary that CBI framing alone cannot provide.",
        "when_to_prefer": "Use CBI framing to determine whether an E-value is high or low relative to what the indication/severity structure could plausibly provide; use the E-value to communicate residual bias to regulators and reviewers quantitatively."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Severity within a diagnosis is the key unmeasured dimension. Build the most informative severity-proxy set available in the 365-day pre-index window: Charlson and Elixhauser comorbidity scores, prior hospitalization and ICU count, specialist contact (endocrinology, cardiology, pulmonology), polypharmacy count, prior-line-of-therapy failure codes (for channeling diagnoses), and a claims-based frailty index in elderly cohorts. Require continuous medical and pharmacy enrollment across the full baseline window (FFS Parts A/B/D or commercial medical+pharmacy) so that absence of prior-line codes reflects true treatment-naivety rather than unobserved care. For new-drug channeling, stratify analyses by calendar year of index date and check whether the severity imbalance between arms attenuates over time as the drug enters mainstream prescribing. After PS adjustment, always report an E-value and run at least one negative-control outcome.",
      "ehr": "Structured clinical severity fields (HbA1c, eGFR, FEV1, ECOG, NYHA class, BMI, systolic BP) are the best-available severity proxies and should be included directly in the confounder set rather than replaced by diagnostic codes. Visit-driven capture means that sicker patients generate more EHR records — more data is a proxy for higher severity itself. Lab values and vitals are often missing not at random (healthier patients have fewer measurements); use multiple imputation or sensitivity analyses under missing data scenarios. For severity dimensions recorded only in unstructured notes, NLP extraction of symptom burden can augment the structured feature set.",
      "registry": "Disease registries often record validated clinical severity scores (NYHA, GOLD, APACHE, SOFA) that claims cannot approach; use these as primary confounders. Link to claims for full medication history and to identify prior-line treatment failure. Adjudicated registry outcomes are less susceptible to differential severity-driven under-ascertainment than administrative code outcomes.",
      "linked": "Linked claims-EHR-registry datasets provide the best substrate: EHR severity fields + claims medication history + registry severity scores + vital-status mortality. Reconcile order/fill/service dates before assigning the index date. The linkable subset may itself be selected (more health-engaged or more severely ill depending on the linkage trigger), which is an additional form of selection confounding to document."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nimport random\n\nrandom.seed(42)\n\n# ── Simulation parameters matching the worked example ──────────────────────────────\n# True drug effect: RR = 1.00 in both strata (no causal effect)\nstrata = [\n    {\"stratum\": \"severe\", \"n\": 1000, \"p_treat\": 0.60, \"base_risk\": 0.30},\n    {\"stratum\": \"mild\",   \"n\": 1000, \"p_treat\": 0.20, \"base_risk\": 0.10},\n]\n\n# Use exact counts from worked example rather than random draws for reproducibility.\n# severe: 600 treated (180 events), 400 untreated (120 events)\n# mild:   200 treated ( 20 events), 800 untreated ( 80 events)\ncells = [\n    {\"stratum\": \"severe\", \"treated\": 1, \"patients\": 600, \"events\": 180},\n    {\"stratum\": \"severe\", \"treated\": 0, \"patients\": 400, \"events\": 120},\n    {\"stratum\": \"mild\",   \"treated\": 1, \"patients\": 200, \"events\":  20},\n    {\"stratum\": \"mild\",   \"treated\": 0, \"patients\": 800, \"events\":  80},\n]\n\n# ── Stratum-specific RRs (should both equal 1.00) ──────────────────────────────────\nprint(\"=== Stratum-specific RRs ===\")\nfor stratum_name in (\"severe\", \"mild\"):\n    c = {c[\"treated\"]: c for c in cells if c[\"stratum\"] == stratum_name}\n    r_treat = c[1][\"events\"] / c[1][\"patients\"]\n    r_untr  = c[0][\"events\"] / c[0][\"patients\"]\n    rr      = r_treat / r_untr\n    print(f\"  {stratum_name}: treated risk = {r_treat:.4f}, \"\n          f\"untreated risk = {r_untr:.4f}, RR = {rr:.4f}\")\n\n# ── Crude (pooled, ignoring stratum) RR ───────────────────────────────────────────\ntot_treat_events = sum(c[\"events\"] for c in cells if c[\"treated\"] == 1)\ntot_treat_n      = sum(c[\"patients\"] for c in cells if c[\"treated\"] == 1)\ntot_untr_events  = sum(c[\"events\"] for c in cells if c[\"treated\"] == 0)\ntot_untr_n       = sum(c[\"patients\"] for c in cells if c[\"treated\"] == 0)\n\ncrude_r_treat = tot_treat_events / tot_treat_n   # = 200/800 = 0.25\ncrude_r_untr  = tot_untr_events  / tot_untr_n    # = 200/1200 ≈ 0.167\ncrude_rr      = crude_r_treat    / crude_r_untr   # = 1.50\n\nprint(\"\\n=== Crude (pooled) RR ===\")\nprint(f\"  Treated:   {tot_treat_events}/{tot_treat_n} = {crude_r_treat:.4f}\")\nprint(f\"  Untreated: {tot_untr_events}/{tot_untr_n} = {crude_r_untr:.6f}\")\nprint(f\"  Crude RR   = {crude_rr:.4f}  (expected: 1.50 — entirely from CBI)\")\n\n# ── Mantel-Haenszel RR (stratum-adjusted) ─────────────────────────────────────────\n# MH numerator term per stratum: (events_treat * n_untr) / total_in_stratum\n# MH denominator term: (events_untr  * n_treat) / total_in_stratum\nmh_num = 0.0\nmh_den = 0.0\nfor stratum_name in (\"severe\", \"mild\"):\n    c = {c[\"treated\"]: c for c in cells if c[\"stratum\"] == stratum_name}\n    n_total   = c[1][\"patients\"] + c[0][\"patients\"]\n    mh_num   += (c[1][\"events\"] * c[0][\"patients\"]) / n_total\n    mh_den   += (c[0][\"events\"] * c[1][\"patients\"]) / n_total\n\nmh_rr = mh_num / mh_den\nprint(\"\\n=== Mantel-Haenszel Adjusted RR ===\")\nprint(f\"  MH numerator   = {mh_num:.4f}\")\nprint(f\"  MH denominator = {mh_den:.4f}\")\nprint(f\"  MH RR          = {mh_rr:.4f}  (expected: 1.00 — true null effect)\")\nprint()\nprint(f\"Conclusion: Crude RR = {crude_rr:.2f} vs MH-adjusted RR = {mh_rr:.2f}.\")\nprint(\"The entire excess (1.50 vs 1.00) is confounding by indication.\")\nprint(\"Design fix: active-comparator new-user design restricts to the same indication,\")\nprint(\"making severity balance within-stratum rather than requiring stratification.\")",
        "description": "Simulate a confounding-by-indication dataset replicating the worked-example structure\n(1 000 severe + 1 000 mild patients; treatment probability 60% / 20%; outcome risk 0.30 / 0.10;\ntrue drug RR = 1.00 in both strata). Compute the crude RR and the Mantel-Haenszel\nstratum-adjusted RR to demonstrate how stratification recovers the true null effect.\nNo external dependencies.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Exact cell counts from the worked example ──────────────────────────────────────\n# Rows: treated (1), untreated (0); Columns: event (1), no event (0)\nsevere_2x2 <- matrix(\n  c(180, 420,   # treated:   180 events, 420 no-events\n    120, 280),  # untreated: 120 events, 280 no-events\n  nrow = 2, byrow = TRUE,\n  dimnames = list(treat = c(\"treated\",\"untreated\"), event = c(\"yes\",\"no\"))\n)\nmild_2x2 <- matrix(\n  c( 20, 180,   # treated:   20 events, 180 no-events\n     80, 720),  # untreated: 80 events, 720 no-events\n  nrow = 2, byrow = TRUE,\n  dimnames = list(treat = c(\"treated\",\"untreated\"), event = c(\"yes\",\"no\"))\n)\n\nrr_stratum <- function(m) {\n  r_treat <- m[\"treated\",   \"yes\"] / sum(m[\"treated\",   ])\n  r_untr  <- m[\"untreated\", \"yes\"] / sum(m[\"untreated\", ])\n  list(r_treat = r_treat, r_untr = r_untr, rr = r_treat / r_untr)\n}\n\ncat(\"=== Stratum-specific RRs ===\\n\")\nsev <- rr_stratum(severe_2x2)\nmld <- rr_stratum(mild_2x2)\ncat(sprintf(\"  Severe: treated risk = %.4f, untreated = %.4f, RR = %.4f\\n\",\n            sev$r_treat, sev$r_untr, sev$rr))\ncat(sprintf(\"  Mild:   treated risk = %.4f, untreated = %.4f, RR = %.4f\\n\",\n            mld$r_treat, mld$r_untr, mld$rr))\n\n# ── Crude (pooled) RR ──────────────────────────────────────────────────────────────\npooled <- severe_2x2 + mild_2x2\ncrude  <- rr_stratum(pooled)\ncat(sprintf(\"\\n=== Crude RR ===\\n  Treated risk = %.4f, Untreated = %.6f, Crude RR = %.4f\\n\",\n            crude$r_treat, crude$r_untr, crude$rr))\n\n# ── Manual Mantel-Haenszel RR ─────────────────────────────────────────────────────\nmh_term <- function(m) {\n  N <- sum(m)\n  list(num = m[\"treated\",\"yes\"] * sum(m[\"untreated\",]) / N,\n       den = m[\"untreated\",\"yes\"] * sum(m[\"treated\",]) / N)\n}\nts <- mh_term(severe_2x2)\ntm <- mh_term(mild_2x2)\nmh_rr <- (ts$num + tm$num) / (ts$den + tm$den)\ncat(sprintf(\"\\n=== Mantel-Haenszel Adjusted RR ===\\n  MH RR = %.4f (expected 1.00)\\n\", mh_rr))\n\n# ── E-value for the crude RR of 1.50 ─────────────────────────────────────────────\n# E-value formula (VanderWeele & Ding 2017): E = RR + sqrt(RR * (RR - 1))\nrr_obs  <- crude$rr\ne_value <- rr_obs + sqrt(rr_obs * (rr_obs - 1))\ncat(sprintf(\"\\n=== E-value for crude RR = %.2f ===\\n\", rr_obs))\ncat(sprintf(\"  E-value = %.4f + sqrt(%.4f * (%.4f - 1)) = %.4f\\n\",\n            rr_obs, rr_obs, rr_obs, e_value))\ncat(\"  Interpretation: an unmeasured confounder would need to be associated\\n\")\ncat(sprintf(\"  with both treatment and outcome by a factor of >=%.2f to explain\\n\", e_value))\ncat(\"  the crude RR of 1.50. Severity in claims (partially unmeasured) could\\n\")\ncat(\"  plausibly reach this threshold, supporting the CBI diagnosis.\\n\")\n\n# ── Conclusion ────────────────────────────────────────────────────────────────────\ncat(sprintf(\"\\nCrude RR = %.2f vs MH-adjusted RR = %.2f.\\n\", crude$rr, mh_rr))\ncat(\"The entire excess is confounding by indication — eliminated by stratification.\\n\")\ncat(\"Design fix: active-comparator new-user design shares the indication across arms.\\n\")",
        "description": "Replicate the worked-example arithmetic in R and compute the Mantel-Haenszel stratified RR\nusing a 2x2 array (epiR::epi.2by2 or manual computation). Demonstrates the crude vs\nadjusted discrepancy and how the E-value bounds the unmeasured-severity confounder needed to\nexplain the crude RR of 1.50. No external dependencies for the core arithmetic; epiR is\noptional for the MH interval.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  SEV[\"Indication severity\\n(e.g., HbA1c trajectory,\\nfrailty, prior failure)\\n— largely unmeasured\\nin claims\"]\n  TX[\"Drug treatment\\n(new or study drug)\"]\n  OUT[\"Outcome\\n(adverse event, hospitalization,\\nmortality)\"]\n  MED[\"Prescriber decision\\n(indication + severity)\"]\n  SEV -->|drives prescribing| MED\n  MED -->|channel to new/study drug| TX\n  SEV -->|direct biological risk| OUT\n  TX -->|true causal effect\\n(may be null)| OUT\n  style SEV fill:#ffcccc,stroke:#cc0000\n  style MED fill:#fff3cc,stroke:#ccaa00",
        "caption": "DAG of confounding by indication and channeling. Indication severity (red) opens a backdoor path to the outcome that does not pass through treatment. In claims data, this severity node is the unmeasured component — diagnosis codes are recorded, but the severity trajectory driving the prescribing decision is not. The active-comparator new-user design closes this path by ensuring both arms share the same indication node.",
        "alt_text": "Directed acyclic graph showing indication severity pointing to both the prescriber decision (which channels patients to the new drug) and the outcome directly. Treatment points to the outcome via the true causal effect. The backdoor path through severity is the source of confounding by indication.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[\"Is there a clinically\\ninterchangeable active\\ncomparator for the\\nsame indication?\"]\n  Q -->|\"Yes\"| ACNU[\"Active-comparator new-user design\\nIndication cancels across arms\\nCBI removed structurally\"]\n  Q -->|\"No\"| NONUSER[\"Non-user comparison\\nCBI likely — document it\"]\n  NONUSER --> DIR[\"Direction-of-bias reasoning:\\nDoes indication increase or\\ndecrease outcome risk?\"]\n  DIR --> PS[\"Propensity score on all available\\nseverity proxies (Charlson,\\nhospitalizations, frailty index,\\nprior-line-of-therapy)\"]\n  PS --> NC[\"Negative-control outcome:\\ndoes the 'drug effect' persist\\nfor an outcome it cannot cause?\"]\n  NC -->|\"Yes — residual bias\"| EVAL[\"Report E-value; calibrate;\\ndo not interpret causally\"]\n  NC -->|\"No — bias plausibly controlled\"| EST[\"Report estimate with E-value\\nand direction-of-bias caveat\"]\n  ACNU --> WITHIN[\"Within-design channeling check:\\nare severity proxies balanced\\nbetween arms?\"]\n  WITHIN -->|\"Imbalanced\"| PSACNU[\"PS or restriction within the\\nACNU cohort; report residual E-value\"]\n  WITHIN -->|\"Balanced\"| ESTACNU[\"Report comparative estimate\\nwith E-value sensitivity\"]",
        "caption": "Decision logic for addressing confounding by indication in an observational drug study. The ACNU design is the primary structural fix; when unavailable, the audit trail of direction-of-bias reasoning, severity proxy adjustment, negative-control falsification, and E-value reporting forms the minimum standard for a credible non-user comparison.",
        "alt_text": "Decision flowchart starting from whether a clinically interchangeable active comparator exists, branching to either the ACNU design (with a within-design channeling check) or a non-user comparison path requiring direction-of-bias reasoning, propensity score adjustment, negative-control falsification, and E-value reporting.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "The active-comparator new-user design is the primary structural fix for confounding by indication: restricting to initiators of two drugs for the same indication makes the indication cancel across arms, removing the main source of CBI without relying on modeling assumptions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Sibling bias: CBI makes treated patients appear sicker (indication drives treatment and outcome risk), while healthy-user bias makes treated patients appear healthier (health-seeking behavior drives both treatment uptake and lower risk). The two can co-exist and partially cancel, making a balanced Table 1 misleading."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Prevalent-user bias and CBI co-occur in studies that fail to restrict to incident initiators: prevalent users who survived the early high-risk window are a severity-selected survivor population, compounding severity-driven channeling with depletion-of-susceptibles."
      },
      {
        "relation_type": "see_also",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS adjustment on measured baseline covariates is the standard partial mitigation for CBI, but it cannot control unmeasured severity; balance on measured variables does not imply balance on the severity gradient driving prescribing decisions."
      },
      {
        "relation_type": "used_with",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value quantifies the minimum unmeasured-confounder strength needed to explain away the crude or PS-adjusted association; it is the mandatory audit-trail companion to any study with residual CBI, and direction-of-bias reasoning from CBI contextualizes whether the E-value is plausibly achievable by the indication severity structure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-based-frailty-index-rwe",
        "notes": "The claims-based frailty index aggregates frailty-adjacent codes (falls, weight loss, pressure ulcers, disability) into a severity proxy that partially captures the unmeasured severity gradient driving prescribing decisions in elderly claims cohorts; include it in the PS model whenever it is available and the cohort includes patients 65 or older."
      }
    ],
    "aliases": [
      "confounding by indication",
      "channeling bias",
      "indication bias",
      "prescribing channeling",
      "severity confounding",
      "confounding by contraindication"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "continuous-enrollment-observable-time-rwe",
    "name": "Continuous Enrollment and Observable Time",
    "short_definition": "The requirement that a person be under continuous data capture (enrolled in a health plan, or within an EHR/registry observation window) across each baseline and follow-up interval, so that the absence of a record can be interpreted as the absence of the event rather than as unobserved care.",
    "long_description": "**Continuous enrollment / observable time** is the data-availability requirement that underlies almost every other\noperational decision in claims, EHR, and registry research. A subject contributes valid **observable person-time** only\nduring intervals in which the data source is actually capturing that person's care. In claims, observability means active\nenrollment with the relevant benefit (medical *and* pharmacy, for the lines of business that flow claims); in EHR it means\nan open observation period at a site that records the encounters of interest; in registries it means an active follow-up\nstatus with the data elements being abstracted. Outside those intervals, the data are *silent*, and silence is not the\nsame as \"no event.\"\n\n**Core conceptual distinction — observability vs. occurrence.** The entire validity of count-based and time-to-event RWE\nrests on the assumption that, within observable time, **absence of a claim/record equals absence of the event**. Continuous\nenrollment is what makes that assumption defensible. It governs three separate things that beginners conflate: (1) the\n**baseline lookback** — you can only assert \"no prior diagnosis/treatment\" if the person was observable throughout the\nlookback (this is what a washout requires); (2) **follow-up person-time** — the denominator for rates and the time-at-risk\nfor survival models must be restricted to observable intervals, or you systematically miss events that occurred off-plan;\nand (3) **outcome ascertainment** — an outcome that happens during an unobserved gap is misclassified as a non-event. The\nconcept is therefore the data-observability *precondition* for the washout (`washout-clean-lookback-period-rwe`), the\nperson-time denominator (`person-time-denominator-construction-rwe`), and time-zero alignment\n(`time-zero-index-date-alignment-rwe`); those concepts assume continuous enrollment has already been established.\n\n**Pros, cons, and trade-offs.**\n- **vs. no enrollment requirement (use everyone with any claim):** Requiring continuous enrollment removes the most common\n  source of **differential outcome misclassification** in claims (events that fall in coverage gaps are coded as\n  non-events) and makes lookback-based exclusions honest. Cost: it shrinks the cohort, skews it toward the continuously\n  insured (typically healthier, more stably employed, or older Medicare-eligible), and can erode generalizability. **Prefer\n  a continuous-enrollment requirement** for any rate, incidence, or survival endpoint; relax it only for cross-sectional\n  prevalence on a fixed date where prior observability is irrelevant.\n- **vs. strict zero-gap enrollment:** A small, pre-specified gap tolerance (e.g., allow one gap of <=45 days, bridged by\n  assuming continuity) recovers people who churn briefly between plans or have administrative coverage lapses, increasing\n  power and representativeness. Cost: a tolerated gap is a window of true non-observability — events in it are still missed,\n  biasing rates downward — so the tolerance must be small relative to the outcome's detectability window and reported.\n  **Prefer a modest, transparent gap rule** over strict zero-gap when churn is common, but never tolerate gaps longer than\n  the induction/latency window of the outcome.\n- **vs. as-treated or registry-driven observability:** Continuous *plan* enrollment is the right observability frame for\n  claims; for EHR/registry the analogous frame is the **observation period** (`omop-observation-period-rwe`), which is\n  inferred from encounter density rather than an explicit enrollment field and is therefore softer and more error-prone.\n  **Prefer enrollment fields when available**; reconstruct observation periods only when no enrollment table exists, and\n  validate them against known-capture events (e.g., annual wellness visits).\n\n**When to use.** Any incidence/prevalence rate, any time-to-event analysis, any new-user/washout design, any cost or\nutilization measure expressed per person-time (PMPM/PPPM), and any safety study where a missed event is a false negative.\nBuild observable-time spans first, then derive baseline windows, time zero, and follow-up *inside* those spans.\n\n**When NOT to use / when it is actively misleading.**\n- **Immortal-time creation.** Defining the cohort so that survival from index to a later qualifying event (e.g., requiring\n  a fixed post-index enrollment minimum, or requiring a second fill to confirm exposure) guarantees subjects were event-free\n  and observable over that interval. If that interval is counted as exposed/at-risk, it manufactures **immortal time bias**\n  (`immortal-time-bias-handling`) — the classic trap in procedure and adherence studies. Start follow-up at time zero and\n  let enrollment *censor*, never *select*, follow-up.\n- **Medicare Advantage (MA) person-time treated as observable.** MA encounter data are notoriously incomplete and, in many\n  research extracts, MA enrollees do not generate the fee-for-service (FFS) claims that downstream code assumes. Counting\n  MA-only spans as observable makes \"no event\" largely an artifact of missing claims — events and prior treatments simply\n  do not appear. Restrict to Parts A+B (and D for drug exposure) FFS, and **exclude MA-only person-time**, or the entire\n  rate is biased toward the null.\n- **Differential observability by exposure.** If one arm is enrolled/captured more completely than the other (e.g., a drug\n  requiring specialty-pharmacy enrollment, or a comparator concentrated in a churning Medicaid population), continuous\n  enrollment differs by arm and the resulting differential ascertainment mimics a treatment effect. Diagnose by comparing\n  enrollment duration and gap distributions across arms.\n- **Coverage gap that swallows the outcome.** For an acute, quickly-fatal, or out-of-network-treated outcome (e.g.,\n  out-of-area MI, hospice care), enrollment may technically continue while the *capturing* benefit does not, or the event\n  is paid outside the observed plan. Continuous enrollment is necessary but not sufficient; pair it with a mortality source\n  hierarchy (`mortality-source-hierarchy-rwe`) and out-of-network capture checks.\n\n**Data-source operational depth.**\n- **Claims (commercial / Medicare FFS / Medicaid):** Observable time = enrollment spans with the right benefit. Require\n  *both* medical and pharmacy enrollment whenever exposure is a drug, because medical-only enrollees never generate\n  pharmacy claims and would falsely pass a drug washout. Reconcile the raw monthly eligibility table into continuous spans,\n  apply the gap rule, and intersect every analysis window with these spans. Failure modes: MA-only person-time lacking FFS\n  claims (above); plan switching that fragments enrollment within the same person; mid-month enrollment booleans that\n  overstate coverage; capitated/bundled arrangements where services are paid without itemized claims; and adjudication lag\n  at the end of data that mimics a coverage gap (truncate follow-up before the run-out window). In Medicaid, frequent\n  churn makes a strict zero-gap rule discard a large, non-random share of the cohort.\n- **EHR:** There is no enrollment field; observability is inferred from encounter activity (the OMOP observation period or a\n  \"first/last note\" window). A patient who seeks care elsewhere is **differentially lost** without any signal, so \"no\n  record of diagnosis X\" can mean off-system care, not absence. Define observation windows explicitly, prefer sites/systems\n  with high capture, link to claims to confirm continuity, and treat loss to follow-up as potentially informative\n  (`attrition-and-loss-to-follow-up-rwe`).\n- **Registry:** Observability = active follow-up status with the elements being abstracted; completeness varies by visit\n  schedule and site. Link to claims for interval health-care use between registry visits and to a death index to firm up\n  censoring. Registry \"no event\" between scheduled visits is interval-censored, not point-observed.\n- **Linked claims–EHR–vital records:** The strongest substrate — enrollment from claims gives true observability windows,\n  EHR adds severity, vital records firm up mortality — but linkage restricts to the linkable subset (selection) and creates\n  date discrepancies (enrollment span vs. encounter date vs. service date) that must be reconciled before any window is\n  intersected.\n\n**Worked claims example.** Question: 12-month incidence of hospitalized acute pancreatitis among new initiators of a GLP-1\nreceptor agonist in a commercial + Medicare FFS database. (1) **Build observable spans:** collapse the monthly eligibility\ntable into continuous enrollment spans requiring *both* medical and pharmacy benefit; exclude any MA-only months so that\nabsence of a claim is genuine. (2) **Apply a gap rule:** treat the person as continuously enrolled if any enrollment gap is\n<=45 days, bridging it as covered; a single longer gap truncates the span. (3) **Baseline observability / washout:** require\n365 days of continuous observable time before the first GLP-1 fill (`fill_date`) so the new-user (no prior GLP-1) and\nexclusion (no prior pancreatitis `dx` in any position) criteria are verifiable, not assumed. (4) **Time zero** = that first\nqualifying `fill_date`. (5) **Follow-up person-time:** accrue at-risk time from time zero only while the observable span is\nopen; right-censor at the *earliest* of first hospitalized pancreatitis (>=1 inpatient claim with the qualifying `dx` in the\nprimary position), disenrollment / end of the observable span, death (from the mortality hierarchy), or 365 days — and stop\nfollow-up before the claims run-out window to avoid mistaking adjudication lag for a coverage gap. (6) **First-event coding:**\ncount only the first qualifying event per person; deduplicate same-episode inpatient claims. (7) **Diagnostics:** report the\nattrition funnel (continuous-enrollment requirement is typically the largest single exclusion), the enrollment-gap\ndistribution, person-time by arm, and a sensitivity analysis varying the gap tolerance (0, 30, 45 days) and the lookback\nlength to show the rate is not an artifact of the observability rule.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "continuous-enrollment",
      "observable-time",
      "person-time",
      "claims-enrollment",
      "medicare-ffs-vs-ma",
      "coverage-gap",
      "outcome-ascertainment",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Canonical statement of how enrollment, observability, and data continuity in claims databases govern valid exposure and outcome ascertainment for therapeutic epidemiology."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.2229",
        "url": "https://doi.org/10.1002/pds.2229",
        "citation_text": "Hall GC, Sauer B, Bourke A, Brown JS, Reynolds MW, LoCasale RJ. Guidelines for good database selection and use in pharmacoepidemiology research. Pharmacoepidemiology and Drug Safety. 2012;21(1):1-10.",
        "year": 2012,
        "authors_short": "Hall et al.",
        "notes": "Database selection and use guidance; specifies enrollment continuity and observable-time checks as core to fitness for purpose."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Defines the dominant failure mode when observable/at-risk time is mis-specified - intervals guaranteed to be event-free and observable are wrongly counted as at-risk, manufacturing immortal time bias."
      },
      {
        "role": "use",
        "doi": "10.1007/s40471-015-0053-5",
        "url": "https://doi.org/10.1007/s40471-015-0053-5",
        "citation_text": "Lund JL, Richardson DB, Stürmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Current Epidemiology Reports. 2015;2(4):221-228.",
        "year": 2015,
        "authors_short": "Lund et al.",
        "notes": "Operationalizes continuous enrollment across the washout as the precondition that makes the new-user (no prior exposure) restriction verifiable in administrative data."
      }
    ],
    "plain_language_summary": "A claims database records events — doctor visits, hospital stays, filled prescriptions — only while a person is actively enrolled in a health plan. When enrollment lapses, the database goes silent: it cannot tell you whether the person had no events or simply had events that went unrecorded. Continuous enrollment is the rule that a patient must have an unbroken coverage record across every analysis window, so that 'no record of an event' can honestly be read as 'no event.' The catch is that requiring perfect, unbroken coverage shrinks your study group and can favor healthier, more stably employed people.",
    "key_terms": [
      {
        "term": "observable time",
        "definition": "The stretches of calendar time during which a patient is enrolled and the database is actually capturing their care — the only periods where a missing record safely means a missing event."
      },
      {
        "term": "coverage gap",
        "definition": "A span of days when a patient's health-plan enrollment has lapsed, so any care they receive during that stretch does not appear in the claims data."
      },
      {
        "term": "enrollment span",
        "definition": "A continuous block of time — with no gaps, or only very short bridged gaps — during which a patient holds active health-plan coverage with both medical and pharmacy benefits."
      },
      {
        "term": "washout period",
        "definition": "A required lookback window before a patient's study entry date, used to confirm they had no prior diagnosis or treatment being studied — valid only if the entire lookback falls inside observable time."
      },
      {
        "term": "person-time at risk",
        "definition": "The sum of calendar days a patient is both enrolled and counted in the study's follow-up — used as the denominator when calculating how often an event occurs per unit of time."
      }
    ],
    "worked_example": {
      "scenario": "Patient 1001 is enrolled in a commercial health plan for most of 2023, but their coverage lapses for 45 days in the fall. We want to count hospital admissions over a January 1 – December 31, 2023 observation window. We need to know which days are truly observable, how many person-days the patient contributes, and what happens to an event that falls inside the gap.",
      "dataset": {
        "caption": "Raw monthly enrollment rows for patient 1001 — each row represents one calendar month of active coverage.",
        "columns": [
          "person_id",
          "elig_month",
          "medical",
          "pharmacy"
        ],
        "rows": [
          [
            1001,
            "2023-01-01",
            true,
            true
          ],
          [
            1001,
            "2023-02-01",
            true,
            true
          ],
          [
            1001,
            "2023-03-01",
            true,
            true
          ],
          [
            1001,
            "2023-04-01",
            true,
            true
          ],
          [
            1001,
            "2023-05-01",
            true,
            true
          ],
          [
            1001,
            "2023-06-01",
            true,
            true
          ],
          [
            1001,
            "2023-07-01",
            true,
            true
          ],
          [
            1001,
            "2023-08-01",
            true,
            true
          ],
          [
            1001,
            "2023-09-01",
            true,
            true
          ],
          [
            1001,
            "2023-11-01",
            true,
            true
          ],
          [
            1001,
            "2023-12-01",
            true,
            true
          ]
        ],
        "note": "October 2023 is missing — the patient had no coverage that month and the first two weeks of November. A separate hospital claims table shows a hospitalization on 2023-10-20, but that row will never appear because the patient was not enrolled when it occurred."
      },
      "steps": [
        "Collapse the monthly rows into continuous enrollment spans: January 1 through September 30 is one unbroken span (9 months = 273 days).",
        "October is absent and November only begins on November 15 (the re-enrollment date after a 45-day lapse), so October 1 – November 14 is a coverage gap of 45 days.",
        "The second enrollment span runs November 15 through December 31 = 47 days.",
        "Total observable days = 273 (first span) + 47 (second span) = 320 days.",
        "The hospitalization on October 20 falls inside the 45-day gap — no claim was submitted to this plan, so the database shows nothing; the event is invisible.",
        "An analyst who ignores the gap would see zero hospitalizations for patient 1001 over 365 days and would wrongly conclude the patient was event-free all year.",
        "The correct count is: 320 observable days contributed, 0 observed hospitalizations within observable time, and a note that 45 days were unobservable — the October 20 event cannot be classified."
      ],
      "result": {
        "label": "Observable days / Total window days",
        "value": "320 / 365 = 87.7% of the year was observable; the 45-day gap is genuine unobservable time, and the October 20 hospitalization is invisible to the study."
      },
      "timeline_spec": {
        "title": "Observable vs. unobservable time for one patient — enrollment gap hides a hospitalization",
        "window": {
          "start": "2023-01-01",
          "end": "2023-12-31",
          "label": "Observation window: 365 days"
        },
        "spans": [
          {
            "kind": "enrolled",
            "start": "2023-01-01",
            "end": "2023-09-30",
            "label": "Enrolled (observable): 273 days"
          },
          {
            "kind": "gap",
            "start": "2023-10-01",
            "end": "2023-11-14",
            "label": "Coverage gap (unobservable): 45 days — events here are INVISIBLE"
          },
          {
            "kind": "enrolled",
            "start": "2023-11-15",
            "end": "2023-12-31",
            "label": "Re-enrolled (observable): 47 days"
          }
        ],
        "events": [
          {
            "label": "Hospitalization (INVISIBLE — inside gap)",
            "date": "2023-10-20",
            "kind": "unexposed",
            "note": "Claim never submitted to this plan; appears nowhere in the database. A naive analyst counting 'no hospitalizations' over 365 days is mistaking unobservability for a non-event."
          }
        ],
        "result": {
          "label": "320 observable days / 365 total days = 87.7% observable; 1 event invisible",
          "value": 0.877
        },
        "caption": "Patient 1001's enrollment timeline. The orange band is the 45-day coverage gap — any event inside it is silent in the database. The hospitalization on October 20 falls squarely in the gap and will be counted as zero events unless the analyst restricts follow-up to observable spans only.",
        "alt_text": "Horizontal timeline from January 1 to December 31, 2023. A green bar spans January through September 30 labeled 'Enrolled (observable): 273 days'. An orange bar spans October 1 through November 14 labeled 'Coverage gap (unobservable): 45 days'. A second green bar spans November 15 through December 31 labeled 'Re-enrolled (observable): 47 days'. A red marker on October 20 inside the orange bar is labeled 'Hospitalization INVISIBLE'. Below the timeline, a results row reads '320 observable days out of 365 total'."
      }
    },
    "prerequisites": [
      "washout-clean-lookback-period-rwe",
      "person-time-denominator-construction-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Strict zero-gap continuous enrollment",
        "description": "Requires uninterrupted enrollment (with the relevant benefit) across the entire baseline and follow-up windows; any gap truncates observable time.",
        "edge_cases": [
          "Discards a large, often non-random share of churning populations (Medicaid, gig/seasonal commercial), eroding power and representativeness.",
          "Mid-month enrollment flags and end-of-data adjudication lag can masquerade as gaps and over-exclude."
        ],
        "data_source_notes": "claims: collapse the monthly eligibility table into spans and require span coverage of every analysis window; require medical AND pharmacy benefit for drug exposure."
      },
      {
        "name": "Gap-tolerant continuous enrollment (bridged)",
        "description": "Treats enrollment as continuous if all gaps are below a pre-specified threshold (e.g., <=30 or <=45 days), bridging short gaps as covered to recover brief plan churn.",
        "edge_cases": [
          "A bridged gap is true non-observability; events within it are still missed, biasing rates downward, so the tolerance must be small relative to the outcome's detectability window.",
          "The threshold is a sensitivity-analysis parameter, not a fixed convention - report results across 0/30/45 days."
        ],
        "data_source_notes": "claims: implement by merging adjacent spans separated by <= threshold days; never bridge a gap longer than the outcome induction/latency window."
      },
      {
        "name": "Inferred observation period (EHR / registry)",
        "description": "No enrollment field exists; observability is reconstructed from encounter density (first/last activity, OMOP observation period) or active registry follow-up status.",
        "edge_cases": [
          "Off-system care is invisible, so \"no record\" can mean care elsewhere rather than absence of the event (differential loss).",
          "Capture varies by site, calendar time, and visit schedule; registry intervals between visits are interval-censored."
        ],
        "data_source_notes": "ehr/registry: validate inferred windows against known-capture events (annual visits); link to claims where possible to confirm continuity and interval health-care use."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "No enrollment requirement (any person with any claim)",
        "pros_of_this": "Removes differential outcome misclassification from coverage gaps and makes lookback-based exclusions honest; rates and time-at-risk denominators become interpretable.",
        "cons_of_this": "Shrinks the cohort and skews it toward the continuously insured, eroding generalizability.",
        "when_to_prefer": "Any incidence/prevalence, time-to-event, or per-person-time cost/utilization endpoint; relax only for cross-sectional prevalence on a fixed date."
      },
      {
        "compared_to": "Strict zero-gap enrollment",
        "pros_of_this": "A modest, transparent gap tolerance recovers brief plan churn, improving power and representativeness.",
        "cons_of_this": "A tolerated gap is genuine non-observability; events inside it are missed, biasing rates downward.",
        "when_to_prefer": "When churn is common and the outcome's detectability window exceeds the gap tolerance; report across gap thresholds."
      },
      {
        "compared_to": "Inferred EHR observation period",
        "pros_of_this": "Explicit enrollment fields give true, dated observability windows rather than encounter-inferred ones.",
        "cons_of_this": "Enrollment fields are unavailable in standalone EHR; inferred windows are softer and miss off-system care.",
        "when_to_prefer": "Whenever an enrollment table exists; reconstruct observation periods only when it does not, and validate them."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Observable time = enrollment spans with the relevant benefit (require medical AND pharmacy for drug exposure). Collapse the monthly eligibility table into continuous spans, apply the gap rule, and intersect every baseline/follow-up window with these spans. Exclude MA-only person-time where FFS claims do not flow; truncate follow-up before the claims run-out window so adjudication lag is not mistaken for a coverage gap.",
      "ehr": "No enrollment field - infer the observation period from encounter density and define windows explicitly. Treat off-system care as differential, potentially informative loss; link to claims to confirm continuity where possible.",
      "registry": "Observability = active follow-up status with the abstracted elements; completeness varies by visit schedule. Link to claims for interval utilization and to a death index for censoring; treat between-visit periods as interval-censored.",
      "linked": "Claims enrollment gives true observability windows, EHR adds severity, vital records firm up mortality; but linkage restricts to the linkable subset and creates enrollment/encounter/service date discrepancies that must be reconciled before intersecting any analysis window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nLOOKBACK_DAYS = 365   # continuous observable time required before index\nGAP_TOLERANCE = 45    # bridge enrollment gaps <= this many days\nRUNOUT_DAYS   = 90    # claims adjudication run-out; do not follow patients into it\n\ndef build_observable_spans(elig: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Collapse monthly eligibility into continuous spans with BOTH medical+pharmacy\n    benefit, excluding MA-only months, and bridging gaps <= GAP_TOLERANCE days.\"\"\"\n    e = elig[elig[\"medical\"] & elig[\"pharmacy\"] & (~elig[\"ma_only\"])].copy()\n    e[\"start\"] = e[\"elig_month\"].dt.to_timestamp()                       # first day of month\n    e[\"end\"]   = (e[\"elig_month\"] + 1).dt.to_timestamp() - pd.Timedelta(days=1)  # last day\n    e = e.sort_values([\"person_id\", \"start\"])\n\n    # A new span begins when the gap from the prior covered month exceeds the tolerance.\n    prev_end = e.groupby(\"person_id\")[\"end\"].shift()\n    gap = (e[\"start\"] - prev_end).dt.days\n    e[\"new_span\"] = (gap.isna()) | (gap > GAP_TOLERANCE + 1)\n    e[\"span_id\"]  = e.groupby(\"person_id\")[\"new_span\"].cumsum()\n    spans = (e.groupby([\"person_id\", \"span_id\"])\n               .agg(span_start=(\"start\", \"min\"), span_end=(\"end\", \"max\"))\n               .reset_index())\n    return spans\n\ndef build_cohort(elig: pd.DataFrame, rx: pd.DataFrame, study_class: str,\n                 data_end: pd.Timestamp) -> pd.DataFrame:\n    spans = build_observable_spans(elig)\n\n    # Candidate index = first fill of the study drug class.\n    idx = (rx[rx[\"drug_class\"] == study_class]\n             .sort_values([\"person_id\", \"fill_date\"])\n             .groupby(\"person_id\", as_index=False).first()\n             .rename(columns={\"fill_date\": \"index_date\"}))[[\"person_id\", \"index_date\"]]\n\n    # Attach the observable span that CONTAINS the index date.\n    cand = idx.merge(spans, on=\"person_id\")\n    cand = cand[(cand[\"span_start\"] <= cand[\"index_date\"]) &\n                (cand[\"span_end\"]   >= cand[\"index_date\"])]\n\n    # Require LOOKBACK_DAYS of continuous observable time before index within that span.\n    cand[\"baseline_start\"] = cand[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)\n    cand = cand[cand[\"span_start\"] <= cand[\"baseline_start\"]].copy()\n\n    # Follow-up is censored at the end of the observable span, the data run-out, never later.\n    admin_end = data_end - pd.Timedelta(days=RUNOUT_DAYS)\n    cand[\"fup_end\"] = cand[[\"span_end\"]].assign(admin=admin_end).min(axis=1)\n    return cand[[\"person_id\", \"index_date\", \"baseline_start\", \"fup_end\"]]",
        "description": "Build continuous observable-time spans from a monthly eligibility table, then derive each subject's baseline lookback,\ntime zero, and at-risk follow-up restricted to observable time. Required inputs (already cleaned, de-duplicated):\n  elig : monthly enrollment -> person_id, elig_month (period 'M'), medical (bool), pharmacy (bool), ma_only (bool)\n  rx   : pharmacy fills      -> person_id, fill_date (datetime), drug_class (str), days_supply (int)\nReturns one row per eligible new initiator with index_date, baseline_start, and the censoring date implied by the end\nof the observable span. Outcome ascertainment downstream must use [index_date, fup_end] intersected with this span.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK_DAYS <- 365L\nGAP_TOLERANCE <- 45L\nRUNOUT_DAYS   <- 90L\n\nbuild_observable_spans <- function(elig) {\n  e <- as.data.table(elig)[medical & pharmacy & !ma_only]\n  e[, start := elig_month]                                  # first day of month\n  e[, end   := seq(elig_month, by = \"month\", length.out = 2)[2] - 1, by = .I]  # last day of month\n  setorder(e, person_id, start)\n  # New span when the gap from the prior covered month exceeds the tolerance.\n  e[, prev_end := shift(end), by = person_id]\n  e[, gap := as.integer(start - prev_end)]\n  e[, new_span := is.na(gap) | gap > GAP_TOLERANCE + 1L]\n  e[, span_id := cumsum(new_span), by = person_id]\n  e[, .(span_start = min(start), span_end = max(end)), by = .(person_id, span_id)]\n}\n\nbuild_cohort <- function(elig, rx, study_class, data_end) {\n  spans <- build_observable_spans(elig)\n  rx <- as.data.table(rx)\n  setorder(rx, person_id, fill_date)\n  idx <- rx[drug_class == study_class, .(index_date = fill_date[1L]), by = person_id]\n\n  cand <- merge(idx, spans, by = \"person_id\", allow.cartesian = TRUE)\n  cand <- cand[span_start <= index_date & span_end >= index_date]            # span containing index\n  cand[, baseline_start := index_date - LOOKBACK_DAYS]\n  cand <- cand[span_start <= baseline_start]                                 # full lookback observable\n  admin_end <- data_end - RUNOUT_DAYS\n  cand[, fup_end := pmin(span_end, admin_end)]                              # censor at span end / run-out\n  cand[, .(person_id, index_date, baseline_start, fup_end)]\n}",
        "description": "Same logic with data.table. Inputs mirror the Python version:\n  elig : person_id, elig_month (Date, first of month), medical (logical), pharmacy (logical), ma_only (logical)\n  rx   : person_id, fill_date (Date), drug_class (character), days_supply (integer)\nReturns one row per eligible new initiator: index_date, baseline_start, fup_end (censoring at end of observable span).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n%let gaptol   = 45;\n%let runout   = 90;\n%let dataend  = '31DEC2024'd;\n\n/* Keep only months with BOTH benefits and exclude MA-only person-time; build month spans. */\ndata elig_keep;\n  set work.elig;\n  if medical=1 and pharmacy=1 and ma_only=0;\n  mstart = elig_month;\n  mend   = intnx('month', elig_month, 1) - 1;   /* last day of the eligibility month */\n  format mstart mend date9.;\nrun;\nproc sort data=elig_keep; by person_id mstart; run;\n\n/* Collapse months into continuous spans, bridging gaps <= &gaptol days. */\ndata spans;\n  set elig_keep; by person_id mstart;\n  retain span_start span_end;\n  prev_end = lag(mend);\n  if first.person_id then do; span_start=mstart; span_end=mend; end;\n  else if (mstart - prev_end) > (&gaptol + 1) then do;\n    output;                       /* close previous span */\n    span_start=mstart; span_end=mend;\n  end;\n  else span_end = max(span_end, mend);\n  if last.person_id then output;  /* close final span */\n  keep person_id span_start span_end;\n  format span_start span_end date9.;\nrun;\n\n/* Candidate index = first fill of the study drug class. */\nproc sql;\n  create table idx as\n  select person_id, min(fill_date) as index_date format=date9.\n  from work.rx where drug_class = \"STUDY\"\n  group by person_id;\nquit;\n\n/* Index must fall inside an observable span with the full lookback covered; censor at span end / run-out. */\nproc sql;\n  create table cohort as\n  select i.person_id,\n         i.index_date,\n         i.index_date - &lookback                  as baseline_start format=date9.,\n         min(s.span_end, &dataend - &runout)       as fup_end        format=date9.\n  from idx i\n  inner join spans s\n    on i.person_id = s.person_id\n   and s.span_start <= i.index_date\n   and s.span_end   >= i.index_date\n   and s.span_start <= i.index_date - &lookback;   /* full lookback observable within the span */\nquit;",
        "description": "Build continuous observable-time spans and the new-initiator cohort in SAS. Required inputs (post data-management):\n  work.elig : person_id, elig_month (date, first of month), medical (0/1), pharmacy (0/1), ma_only (0/1)\n  work.rx   : person_id, fill_date (date), drug_class (char), days_supply (num)\nMacro vars set the lookback, gap tolerance, run-out, and data-end date. Output work.cohort has one row per eligible new\ninitiator with index_date, baseline_start, and fup_end (follow-up censored at the end of the observable span / run-out).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "continuous-enrollment-observable-time-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Patient 1001's enrollment timeline. The orange band is the 45-day coverage gap — any event inside it is silent in the database. The hospitalization on October 20 falls squarely in the gap and will be counted as zero events unless the analyst restricts follow-up to observable spans only.",
        "alt_text": "Horizontal timeline from January 1 to December 31, 2023. A green bar spans January through September 30 labeled 'Enrolled (observable): 273 days'. An orange bar spans October 1 through November 14 labeled 'Coverage gap (unobservable): 45 days'. A second green bar spans November 15 through December 31 labeled 'Re-enrolled (observable): 47 days'. A red marker on October 20 inside the orange bar is labeled 'Hospitalization INVISIBLE'. Below the timeline, a results row reads '320 observable days out of 365 total'.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Elig[Monthly eligibility table<br/>medical / pharmacy / MA flags] --> Drop[Drop MA-only months<br/>require medical AND pharmacy]\n  Drop --> Span[Collapse into continuous spans<br/>bridge gaps <= tolerance]\n  Span --> Look[Full lookback observable?<br/>span_start <= index - 365d]\n  Look -->|yes| T0[Time zero = first qualifying fill]\n  Look -->|no| Excl[Exclude: prior period unobservable<br/>washout cannot be verified]\n  T0 --> Fup[Accrue at-risk time only while span open<br/>censor at span end / death / run-out]\n  Fup --> Asc[Outcome ascertainment within observable time<br/>no record = no event]",
        "caption": "How continuous enrollment turns a raw eligibility table into valid observable time. Only person-time inside a continuous, both-benefit, non-MA-only span supports the assumption that the absence of a record equals the absence of the event.",
        "alt_text": "Flowchart from a monthly eligibility table through dropping MA-only months, collapsing spans with a gap tolerance, checking that the full lookback is observable, setting time zero, accruing at-risk time, and ascertaining outcomes only within observable time.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Observable time vs. silent time for one subject (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Enrollment\n  Continuous medical + pharmacy (observable) :done, obs1, 2023-01-01, 2023-09-30\n  Coverage gap (silent - events MISSED)      :crit, gap, 2023-10-01, 45d\n  Re-enrolled (observable)                   :done, obs2, 2023-11-15, 2024-12-31\n  section Design windows\n  365d lookback / washout (must be observable) :active, lb, 2023-01-01, 2023-12-31\n  Time zero = first qualifying fill            :milestone, t0, 2024-01-01, 0d\n  Follow-up at-risk (observable only)          :active, fu, 2024-01-01, 270d",
        "caption": "Observable vs. silent time. The coverage gap is non-observability; any event in it is misclassified as a non-event. A 365-day lookback ending at time zero is only valid if the whole window is observable; here an early-2023 gap would invalidate it under a strict rule and require bridging or exclusion.",
        "alt_text": "Gantt chart showing a continuous enrollment span, a 45-day coverage gap during which events are missed, re-enrollment, the 365-day lookback window, time zero at the first fill, and the at-risk follow-up window restricted to observable time.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "A washout/clean lookback is only valid if the subject was continuously observable across the entire lookback; continuous enrollment is the data-availability precondition the washout assumes."
      },
      {
        "relation_type": "produces",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "Observable spans define the at-risk person-time that forms the denominator for rates and the time-at-risk for survival models."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Time zero must fall inside an observable span with the required lookback covered; alignment and enrollment are set together."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Requiring post-index survival/enrollment to qualify the cohort manufactures immortal time; let enrollment censor, not select, follow-up."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "MA-only person-time often lacks FFS claims, so it must be excluded from observable time or rates are biased toward the null."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Disenrollment and EHR off-system care are forms of loss to follow-up that end observable time and may be informative."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-observation-period-rwe",
        "notes": "In EHR/OMOP data with no enrollment field, the observation period is the inferred analogue of continuous enrollment."
      },
      {
        "relation_type": "part_of",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Demonstrable continuous-enrollment / observable-time capture is a core element of judging a data source fit for a given question."
      }
    ],
    "aliases": [
      "continuous enrollment",
      "observable time",
      "continuous health plan enrollment",
      "enrollment continuity requirement"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cost-benefit",
    "name": "Cost-Benefit Analysis (CBA)",
    "short_definition": "A full economic evaluation that values both the costs and the health (and non-health) consequences of an intervention in the same monetary units, expressing the result as incremental net monetary benefit or a benefit-cost ratio to inform allocation decisions, including across non-health sectors.",
    "long_description": "**Cost-benefit analysis (CBA)** is the only full economic evaluation that places *both* sides of the ledger in money. Costs\n(intervention, downstream medical and pharmacy, and — under a societal perspective — patient time, caregiver burden, and\nproductivity) and consequences (deaths averted, events avoided, symptom-days, QALYs, productivity restored) are each\nconverted to a common monetary scale. The decision rule is then internal to the analysis: an intervention is \"worth it\" if\nits **incremental net monetary benefit (NMB)** — monetized incremental benefit minus incremental cost — is positive, or\nequivalently if its **benefit-cost ratio (BCR)** exceeds 1. This is what separates CBA from cost-effectiveness analysis\n(CEA) and cost-utility analysis (CUA), which keep the outcome in natural units (life-years) or QALYs and import an *external*\nwillingness-to-pay (WTP) threshold to decide. CBA does not borrow a threshold; it bakes the valuation of health into the\nestimate itself.\n\n**Core conceptual distinction.** The CBA estimand is the incremental net monetary benefit of intervention vs comparator,\nNMB = (lambda x deltaE) - deltaC, where deltaE is the incremental effect in natural units, deltaC the incremental cost, and\nlambda the *monetary value assigned to one unit of that effect*. Algebraically this is identical to the net-benefit form of\nCEA — but the meaning of lambda is the crux. In CEA/CUA lambda is a decision-maker's threshold (e.g., a payer's cost-per-QALY\nceiling) and the analyst reports results *across* a range of lambda (the net-benefit / cost-effectiveness acceptability\ncurve). In CBA lambda is a *claim about the social value of health itself*, drawn from one of two valuation paradigms that\nmust be pre-specified and never silently mixed: the **human-capital approach** (value = lost market production: wages x days\nlost, plus discounted lifetime earnings for premature mortality) and the **welfarist / willingness-to-pay approach**\n(value = what people will pay to obtain the health gain, from contingent valuation, discrete-choice experiments, or a\nvalue-of-a-statistical-life-year, VSLY). The two can differ by an order of magnitude and embed different ethics; reporting\n\"a CBA\" without stating which is uninterpretable.\n\n**Pros, cons, and trade-offs.**\n- **vs cost-effectiveness analysis (CEA):** CEA's incremental cost-effectiveness ratio (ICER, deltaC/deltaE) is undefined or\n  unstable near deltaE = 0, flips sign across quadrants of the cost-effectiveness plane, and cannot be averaged or summed.\n  CBA's NMB is a single signed number on a linear scale: it has a defined variance, can be regressed, and can be added\n  across programs — which is precisely why budget holders comparing a vaccine program against a road-safety program reach\n  for it. Cost: CBA forces an explicit price on health that CEA sidesteps by leaving lambda to the decision-maker.\n- **vs cost-utility analysis (CUA):** CUA's QALY is itself a partial monetization-substitute — a preference-weighted health\n  unit accepted across HTA bodies (NICE, ICER, CADTH) without anyone \"pricing life.\" CBA can capture benefits the QALY\n  structurally misses (productivity, spillovers to education/justice/environment, process utility, option value), but at the\n  cost of contested valuation and weaker comparability with the existing HTA evidence base. **Prefer CUA** for within-health-\n  system reimbursement; **prefer CBA** only when the decision genuinely spans sectors or a monetary ROI is the explicit ask.\n- **vs cost-consequence analysis (CCA):** CCA lists disaggregated costs and each consequence in its own natural unit and\n  lets the decision-maker weight them. It is CBA with the final monetization step removed. CCA is honest when valuation is\n  too contested to defend, but it abdicates the synthesis CBA exists to provide and cannot rank dissimilar programs.\n- **The distributional blind spot (the central CBA failure mode):** because the human-capital approach values a day of a\n  high earner's restored productivity above a low earner's, and WTP rises with income, a naive CBA systematically favors\n  interventions concentrated in wealthier, working-age, employed populations and *understates* benefit to retirees,\n  children, the unemployed, and the disabled. This is not a rounding error — it can reverse a recommendation. CEA/CUA with\n  a fixed lambda per QALY avoid it by construction. Equity weighting, distributional CBA (DCBA), or simply defaulting to\n  CUA are the standard responses.\n\n**When to use.** Multi-sectoral or public-health decisions where the consequences are heterogeneous and a single monetary\nmetric is the only common denominator: vaccination and screening programs, occupational-health and workplace interventions,\ninjury prevention, environmental-health regulation, and any setting where a budget holder explicitly wants net monetary\nreturn or a BCR. CBA is also natural when a large share of the benefit is *non-health* (productivity, averted criminal-\njustice or special-education costs) that a QALY cannot carry.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Standard payer reimbursement / HTA submission.** Most agencies require CEA/CUA with a cost-per-QALY against their own\n  threshold; a CBA monetizing the QALY internally answers a question they did not ask and will be rejected or, worse,\n  silently misread as a cost-per-QALY.\n- **When the population is non-working or income-heterogeneous and human-capital valuation is used.** A CBA of a pediatric,\n  geriatric, or disability intervention valued by lost wages will mechanically conclude the intervention has little benefit.\n  This is the single most dangerous misuse: the method *appears* rigorous while encoding a value judgment that the analyst\n  may not even realize they made.\n- **Double counting.** If patient out-of-pocket spending is already in the cost column and the *same* avoided spending also\n  appears inside a WTP figure (people partly pay to avoid costs they bear), the benefit is counted twice. The same trap\n  catches productivity counted both as an averted cost and as a monetized benefit, and morbidity counted both in QALYs and\n  in absenteeism.\n- **When deltaE itself is biased.** CBA inherits every confounding, immortal-time, and competing-risk problem of the\n  underlying effect estimate; a clean monetization layer on a confounded effect produces a confidently wrong NMB.\n\n**Data-source operational depth.** The cost side of a CBA *is* a healthcare-cost analysis (see healthcare-costs-pppm-pppy-pmpm);\nthe benefit side is an effect estimate (often from an active-comparator new-user cohort) plus an external valuation. Each\nsubstrate fails differently.\n- **Claims (FFS vs MA):** The natural source for the cost arm — allowed/paid amounts, place-of-service and medical/pharmacy\n  splits, PPPM/PPPY standardization on *exact* observed person-time, pre-specified outlier rules, and two-part/GLM modeling\n  of the cost distribution. Failure modes: **Medicare Advantage person-time carries no fee-for-service claim lines**, so MA\n  enrollees contribute outcomes but near-zero observed cost — including them deflates the cost arm and inflates NMB; restrict\n  cost estimation to FFS-observable person-time or model the gap. Capitated/bundled arrangements hide unit costs the same\n  way. Productivity is essentially absent from medical claims — at best a short-term-disability or absenteeism feed in\n  employer-sourced data; otherwise it must be *imputed* (clinical improvement -> assumed work-days restored x wage), which\n  is the largest source of CBA uncertainty and belongs in the PSA, not the base case alone.\n- **EHR:** Strong for the clinical effect, severity, and PROs that anchor benefit valuation, but holds charges/RVUs, not\n  paid amounts — do not treat EHR charges as cost. Link to claims for costing. Visit-driven capture means a patient who\n  leaves the system is differentially lost, biasing both the cost and effect arms.\n- **Registry:** Best for validated/adjudicated outcomes feeding the benefit estimate (e.g., cancer stage, MACE); weak for\n  complete cost capture. Link to claims for costs and to a death index so that **competing mortality**, which differs by\n  exposure in elderly populations, does not silently truncate downstream cost accrual differently across arms.\n- **Linked claims-EHR-registry-survey:** The ideal CBA substrate — registry/EHR effect + claims cost + a survey arm for\n  WTP/productivity calibration — but linkage selects the linkable subset and creates order/fill/service-date discrepancies\n  that must be reconciled before costs and effects are attributed to the same time window.\n\n**Worked claims-style example.** Question: does an employer-sponsored migraine prophylaxis program (vs no program) deliver\npositive net monetary benefit over 1 year, from a societal perspective, in a commercial + Medicare FFS database? (1)\nCohorts: program initiators and a propensity-matched non-program comparator, each with >=365 days continuous medical +\npharmacy enrollment before `index_date` and excluding MA-only person-time so paid amounts are observed. (2) Cost arm: sum\nall allowed amounts in the 365-day follow-up (medical + pharmacy), standardize to PPPM on exact person-time, apply a\npre-specified high-cost outlier rule, and model with a two-part/gamma GLM; deltaC = adjusted mean cost difference. (3)\nBenefit arm: the effect is migraine-days averted, derived from a comparative model of acute-medication fills (`days_supply`\non triptan/NSAID NDCs) and ED/office visits with a primary migraine diagnosis (first-event coding within a 30-day clean\nwindow) — note this is a *utilization-based proxy* for migraine-days, not a directly observed claims field, so its\nconversion to averted symptom-days is itself an assumption that must be stated. (4) Monetization: value each averted\nmigraine-day by the human-capital approach (mean daily wage x productivity fraction lost per migraine-day from the\nliterature) for the societal perspective, and run a WTP-per-day alternative as a scenario; never add both. Because the\nbenefit chain is fills/visits -> imputed migraine-days -> imputed work-days lost -> wage, every link is a PSA parameter,\nnot a fixed base-case input. (5) Net benefit: NMB = (lambda x migraine-days averted) - deltaC, discounting any benefit/cost\nstream beyond 1 year if the horizon is extended. (6) Uncertainty: PSA drawing deltaC from the GLM, the effect from its\nsampling distribution, and lambda from a wide distribution over the wage/WTP value; report the share of simulations with\nNMB > 0 (net-benefit acceptability) and a one-way tornado on lambda and the discount rate. (7) Perspective split: report\npayer (plan-paid only, no productivity) and societal (adds patient liability + productivity) side by side per CHEERS 2022,\nand check explicitly that out-of-pocket spending is not also embedded in the WTP figure.\n\n**Interpreting the output**\n\nThe worked example yields a net benefit of $120,000 and a benefit-cost ratio (BCR) of 1.60: the program costs\n$75,000 and generates $195,000 in monetized benefits (averted medical costs plus productivity gains valued at WTP),\nso NB = $195,000 − $75,000 = $120,000 and BCR = $195,000 / $75,000 = 1.60.\n\n*(1) Formal interpretation.* A positive net benefit means the total monetized value of the health gains exceeds\nthe total cost of the program: the intervention generates more value than it consumes when health is valued at the\nstated willingness-to-pay figure. A BCR above 1 conveys the same information in ratio form: each dollar spent\nreturns $1.60 in monetized benefit. Unlike the ICER, the CBA result is expressed entirely in currency, so the\ndecision rule is simply NB > 0 (or BCR > 1) rather than comparison with a separate threshold. However, the\nresult is only as credible as the WTP value used to monetize health outcomes — this is the critical assumption\nthat should be varied in sensitivity analysis.\n\n*(2) Practical interpretation.* The program appears worthwhile at the assumed WTP. Analysts and decision-makers\nmust scrutinize what the benefit figure includes: were productivity losses valued at market wages or shadow prices?\nWere patient out-of-pocket savings included — and if so, were they also counted in the cost side? A CBA that\ndouble-counts or uses an implausibly high WTP figure can make almost any intervention look favorable. Reporting\nsensitivity of NB across a range of WTP values (for example, $50,000 to $200,000 per QALY equivalent) is\nessential for transparent decision-making.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "monetary_valuation",
      "economic-evaluation",
      "net-monetary-benefit",
      "willingness-to-pay",
      "human-capital",
      "benefit-cost-ratio",
      "cost-benefit",
      "societal-perspective"
    ],
    "applies_to_study_types": [
      "cost_benefit"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.2016.12195",
        "url": "https://doi.org/10.1001/jama.2016.12195",
        "citation_text": "Sanders GD, Neumann PJ, Basu A, et al. Recommendations for conduct, methodological practices, and reporting of cost-effectiveness analyses: Second Panel on Cost-Effectiveness in Health and Medicine. JAMA. 2016;316(10):1093-1103.",
        "year": 2016,
        "authors_short": "Sanders, Neumann et al.",
        "notes": "The authoritative US methods reference for full economic evaluation; defines the reference-case societal and health-care- sector perspectives, the impact inventory, and the treatment of productivity and monetized consequences that a CBA depends on."
      },
      {
        "role": "explain",
        "doi": "10.1093/oso/9780198529446.001.0001",
        "url": "https://doi.org/10.1093/oso/9780198529446.001.0001",
        "citation_text": "Drummond MF, Sculpher MJ, Torrance GW, O'Brien BJ, Stoddart GL. Methods for the Economic Evaluation of Health Care Programmes. 3rd ed. Oxford University Press; 2005.",
        "year": 2005,
        "authors_short": "Drummond et al.",
        "notes": "Canonical textbook distinguishing CBA from CEA/CUA/CCA and laying out human-capital vs willingness-to-pay monetization and the benefit-cost decision rule."
      },
      {
        "role": "explain",
        "doi": "10.1016/0167-6296(94)00044-5",
        "url": "https://doi.org/10.1016/0167-6296(94)00044-5",
        "citation_text": "Koopmanschap MA, Rutten FFH, van Ineveld BM, van Roijen L. The friction cost method for measuring indirect costs of disease. Journal of Health Economics. 1995;14(2):171-189.",
        "year": 1995,
        "authors_short": "Koopmanschap et al.",
        "notes": "Source of the friction-cost alternative to full human-capital valuation of lost production; central to the productivity- monetization choice in a societal CBA."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) Statement. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "Reporting checklist that governs how perspective, valuation of consequences, discounting, and uncertainty in a CBA (or any full economic evaluation) must be transparently reported."
      }
    ],
    "plain_language_summary": "Cost-benefit analysis (CBA) answers the question: does this program return more money in benefits than it costs? It works by converting every consequence of the program — lives saved, injuries avoided, sick days prevented — into dollar values, then subtracting the program's total cost from those monetized benefits. If the result (called net benefit) is positive, the program is worth the investment; a benefit-cost ratio above 1.0 means every dollar spent generates more than a dollar back. The key difference from cost-effectiveness analysis is that CBA keeps everything in dollars, while cost-effectiveness analysis leaves the health outcome in its natural units (such as events avoided or life-years gained) and never assigns it a dollar value.",
    "key_terms": [
      {
        "term": "net benefit",
        "definition": "The total monetized benefit of a program minus its total cost; a positive number means the program generates more value than it consumes."
      },
      {
        "term": "benefit-cost ratio (BCR)",
        "definition": "Total monetized benefits divided by total costs; a BCR above 1.0 means each dollar spent returns more than one dollar in benefits."
      },
      {
        "term": "monetized benefit",
        "definition": "A health or social outcome (such as an injury avoided) that has been converted into a dollar amount so it can be compared directly with costs."
      },
      {
        "term": "cost-effectiveness analysis",
        "definition": "A related economic method that compares costs against health outcomes left in natural units (for example, dollars per hospitalization avoided) rather than converting those outcomes into dollars."
      }
    ],
    "worked_example": {
      "scenario": "A regional health department spends $200,000 to run a one-year workplace fall-prevention program at ten manufacturing plants. An analyst wants to know whether the program is worth the investment. To answer this using CBA, both the program cost and the benefits (injuries prevented) must be expressed in dollars. The analyst identifies 40 fewer lost-workday injuries in the treated plants compared to matched control plants during the same year. Each averted lost-workday injury is valued at $8,000 using published wage-replacement and medical-cost data (this is the human-capital approach to monetization). The analyst then computes net benefit and the benefit-cost ratio.",
      "dataset": {
        "caption": "Program inputs used to compute net benefit and BCR.",
        "columns": [
          "item",
          "value_dollars",
          "note"
        ],
        "rows": [
          [
            "Program cost (total)",
            200000,
            "Staff, training materials, site visits for 10 plants over 1 year"
          ],
          [
            "Injuries averted (count)",
            40,
            "Difference-in-differences estimate vs matched control plants"
          ],
          [
            "Dollar value per averted injury",
            8000,
            "Wage replacement + averted medical costs, human-capital approach"
          ],
          [
            "Total monetized benefit",
            320000,
            "40 injuries x $8,000 each"
          ]
        ]
      },
      "steps": [
        "Identify total program cost: $200,000.",
        "Count averted injuries: 40 fewer lost-workday injuries in the treated plants versus controls.",
        "Assign a dollar value to each averted injury using the human-capital approach: $8,000 per injury (wage replacement costs plus averted medical spending).",
        "Compute total monetized benefit: 40 injuries x $8,000 = $320,000.",
        "Compute net benefit: $320,000 (benefit) - $200,000 (cost) = $120,000.",
        "Compute benefit-cost ratio (BCR): $320,000 / $200,000 = 1.60.",
        "Note the contrast with cost-effectiveness analysis: a cost-effectiveness analyst would report '$5,000 per injury avoided' (i.e., $200,000 / 40) and let the decision-maker judge whether that cost per event is acceptable; CBA instead assigns each injury a dollar value and collapses everything to a single net-benefit number, so no external judgment about what one injury is worth is needed from the decision-maker."
      ],
      "result": "Net benefit = $120,000 (positive, so the program returns more than it costs). BCR = 1.60, meaning every $1.00 spent on the program returns $1.60 in monetized benefits. The program is worthwhile on both metrics."
    },
    "prerequisites": [
      "cost-effectiveness",
      "icer-net-monetary-benefit-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Human-capital monetization (productivity + premature mortality)",
        "description": "Value health gains by avoided lost market production — wages x days of absenteeism/presenteeism averted, plus discounted lifetime earnings for premature deaths averted. Standard for occupational and public-health CBA where labor- market effects dominate.",
        "edge_cases": [
          "Full lifetime-earnings valuation can vastly exceed the friction-cost alternative (which counts only the time to replace a worker); the choice can flip NMB.",
          "Mechanically undervalues retirees, children, and the unemployed (zero or low wage), creating equity-reversing bias.",
          "Sensitive to assumed real wage growth and the discount rate applied to future earnings streams."
        ],
        "data_source_notes": "claims: short-term-disability/absenteeism feeds where present (employer data); otherwise impute work- days restored from clinical improvement x external age/sex wage tables. Treat the imputed value as a PSA parameter, not a fixed input."
      },
      {
        "name": "Willingness-to-pay / welfarist monetization",
        "description": "Assign lambda from stated- or revealed-preference WTP for the specific outcome (contingent valuation, discrete- choice experiments) or from a value-of-a-statistical-life-year (VSLY) applied to QALYs/life-years gained.",
        "edge_cases": [
          "WTP estimates are highly sensitive to elicitation method, framing, and respondent income; a single point estimate is rarely defensible without a wide PSA distribution.",
          "Double-counting risk when patient out-of-pocket costs sit in the cost column and are also implicit in the WTP figure."
        ],
        "data_source_notes": "Sourced from the preference-elicitation literature or a primary survey arm, not from claims; calibrate to the study population's income/age and report the WTP source DOI and year in the SAP."
      },
      {
        "name": "Cost-consequence presentation (monetization withheld)",
        "description": "Report disaggregated costs and each consequence in its own natural/monetary unit without forcing a single lambda or net-benefit number; the decision-maker supplies the weights. Preferred when monetization is too contested to defend.",
        "edge_cases": [
          "Forfeits the single-number synthesis that is CBA's reason to exist; cannot rank dissimilar programs."
        ],
        "data_source_notes": "Pair the detailed cost breakdown (PPPM by place-of-service, all-cause vs attributable) from healthcare- costs with natural-unit outcomes (events avoided, QALYs, life-years) in a transparent table with ranges."
      },
      {
        "name": "Perspective contrast (payer vs societal)",
        "description": "Run the same model under a narrow payer perspective (plan-paid medical + pharmacy only) and the reference-case societal perspective (adds patient out-of-pocket, time, caregiver burden, and productivity). Reporting both is the CHEERS 2022 expectation.",
        "edge_cases": [
          "Adding societal cost and benefit elements can reverse the sign of NMB relative to the payer view.",
          "Double-counting if patient OOP is in costs and also embedded in WTP; resolve before reporting."
        ],
        "data_source_notes": "claims: payer = plan-paid amounts on FFS-observable person-time; societal layers patient liability + externally valued productivity/time. State the perspective explicitly and use an impact inventory (Second Panel)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cost-effectiveness",
        "pros_of_this": "Yields a single signed net monetary benefit (or BCR) that is additive across programs and well defined even when the incremental effect is near zero, where CEA's ICER becomes unstable; supports cross-sector allocation in one metric.",
        "cons_of_this": "Requires explicit, contested monetization of health; results swing with the human-capital vs WTP choice and can favor high-income, working-age populations.",
        "when_to_prefer": "Multi-sectoral or public-health decisions, or when a budget holder explicitly wants a monetary ROI; CEA is preferred for within-health-system value questions with a stated cost-per-effect threshold."
      },
      {
        "compared_to": "cost-utility",
        "pros_of_this": "Can incorporate non-health benefits (productivity, education/justice spillovers, process utility) that the QALY structurally omits, and avoids reliance on an external QALY threshold.",
        "cons_of_this": "Forfeits the QALY's broad acceptance in HTA and its built-in side-step of explicitly pricing health; weaker comparability with the existing reimbursement evidence base.",
        "when_to_prefer": "CUA for payer/HTA reimbursement decisions in QALY space; CBA only when benefits genuinely span non-health sectors or a net monetary return is the explicit decision criterion."
      },
      {
        "compared_to": "icer-net-monetary-benefit-rwe",
        "pros_of_this": "CBA generalizes the net-benefit calculation to any monetized consequence (events, productivity, QALYs) via an internally justified lambda, rather than reporting NMB across an external threshold range as in CEA/CUA.",
        "cons_of_this": "Must defend the upstream valuation (lambda) as a social value claim; the NMB entry assumes lambda is an exogenous decision-maker threshold, sidestepping that burden.",
        "when_to_prefer": "Use the NMB/CEA net-benefit machinery when lambda is a payer threshold; use CBA when full monetization of diverse benefits is itself the policy-relevant, defensible step."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "CBA consumes the entire cost-measurement, person-time standardization, attribution, outlier, and GLM machinery as its cost arm, then adds the benefit-valuation and net-benefit layer.",
        "cons_of_this": "Cannot proceed without a valid, transparent cost estimate; the cost entry is the prerequisite foundation, not a competitor.",
        "when_to_prefer": "Master the cost measurement first (that entry), then layer CBA valuation and the net-benefit synthesis."
      },
      {
        "compared_to": "probabilistic-sensitivity-analysis-hea-rwe",
        "pros_of_this": "CBA supplies the structural model and the specific high-uncertainty parameters (lambda, productivity imputation, discount rate) that make PSA essential rather than optional in this method.",
        "cons_of_this": "The PSA entry owns the simulation mechanics (parameter distributions, net-benefit acceptability curves); CBA must specify which parameters to vary and how to monetize.",
        "when_to_prefer": "Always embed a CBA in a full PSA and report net-benefit acceptability over lambda; never report a CBA on point estimates alone."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Cost arm follows healthcare-costs-pppm-pppy-pmpm exactly (allowed/paid amounts, all-cause vs attributable by dx/px/drug codes, PPPM/PPPY on exact person-time, place-of-service and medical/pharmacy splits, pre-specified outlier rule, two-part/GLM modeling). Restrict cost estimation to FFS-observable person-time and exclude MA-only spans where paid amounts are missing. Benefit arm links cohort outcomes to external monetary values (WTP per event/QALY or wage tables for productivity); impute productivity from utilization reduction + external data where no disability/absenteeism feed exists, and carry that imputation as a PSA parameter. Report payer and societal perspectives, NMB (or BCR) with a PSA-derived interval, and one-way sensitivity on lambda and the discount rate. Pre-specify valuation sources and perspective in the SAP.",
      "ehr": "Strong for the clinical effect, severity, and PROs that anchor benefit valuation; holds charges/RVUs, not paid amounts, so do not treat EHR charges as cost — link to claims for costing. Treat visit-driven loss to follow-up as potentially informative for both arms.",
      "registry": "Best for validated/adjudicated outcomes feeding the benefit estimate; weak for complete cost capture. Link to claims for costs and to a death index so exposure-differential competing mortality does not truncate cost accrual unevenly across arms.",
      "linked": "Ideal CBA substrate (registry/EHR effect + claims cost + survey arm for WTP/productivity calibration), but linkage selects the linkable subset and creates order/fill/service-date discrepancies that must be reconciled before costs and effects are attributed to the same time window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\ndef cba_net_benefit(arm_summary: pd.DataFrame, valuation: dict) -> dict:\n    s = arm_summary.set_index(\"arm\")\n    # Incremental contrasts: intervention minus comparator.\n    delta_cost   = float(s.loc[\"intervention\", \"mean_cost\"]   - s.loc[\"comparator\", \"mean_cost\"])\n    delta_effect = float(s.loc[\"intervention\", \"mean_effect\"] - s.loc[\"comparator\", \"mean_effect\"])\n\n    lam = float(valuation[\"lambda_per_effect\"])      # human-capital wage value OR WTP/VSLY -- pre-specified, not mixed\n    monetized_benefit = lam * delta_effect            # money value of the incremental health/productivity gain\n\n    # DOUBLE-COUNTING GUARD: do not add a productivity benefit here if that same productivity already nets out\n    # inside delta_cost (e.g. averted disability payments in the cost column). Resolve before calling this function.\n    nmb = monetized_benefit - delta_cost              # incremental net monetary benefit; >0 favors intervention\n    # BCR convention: monetized benefit over *incremental* cost; only interpretable when delta_cost > 0.\n    bcr = monetized_benefit / delta_cost if delta_cost > 0 else float(\"nan\")\n\n    return {\n        \"perspective\": valuation[\"perspective\"],\n        \"delta_cost\": delta_cost,\n        \"delta_effect\": delta_effect,\n        \"monetized_benefit\": monetized_benefit,\n        \"incremental_nmb\": nmb,\n        \"benefit_cost_ratio\": bcr,\n        \"favors_intervention\": nmb > 0,\n    }",
        "description": "Deterministic CBA net-benefit and benefit-cost ratio from two model inputs (no toy data is fabricated inside the analysis;\nsupply these from your cohort + claims pipeline):\n  arm_summary : one row per arm -> arm in {'comparator','intervention'},\n                mean_cost (adjusted mean total cost from the two-part/GLM cost model, FFS-observable person-time only),\n                mean_effect (incremental-unit effect per patient, e.g. events_avoided or QALYs, from the matched cohort)\n  valuation   : dict with lambda_per_effect (monetary value of one effect unit; human-capital OR WTP, pre-specified) and\n                perspective ('payer' or 'societal')\nReturns deltaC, deltaE, monetized benefit, incremental NMB, and BCR. Call twice (different cost inputs + lambda) to report\npayer and societal perspectives separately. Discount mean_cost/mean_effect upstream if the horizon exceeds one year\n(see discounting-costs-effects-rwe).",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      },
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef nmb_acceptability(cost_draws, effect_draws, lambda_grid) -> pd.DataFrame:\n    dc = np.asarray(cost_draws, dtype=float)\n    de = np.asarray(effect_draws, dtype=float)\n    if dc.shape != de.shape:\n        # Paired draws are required so the cost-effect correlation is preserved across the PSA.\n        raise ValueError(\"cost_draws and effect_draws must be paired (same shape) to preserve correlation\")\n\n    rows = []\n    for lam in lambda_grid:\n        nmb = lam * de - dc                       # vectorized incremental NMB across all PSA draws\n        rows.append({\n            \"lambda\": float(lam),\n            \"mean_nmb\": float(nmb.mean()),\n            \"p_nmb_positive\": float((nmb > 0).mean()),   # net-benefit acceptability at this lambda\n        })\n    return pd.DataFrame(rows)",
        "description": "Probabilistic CBA: net-benefit acceptability over a range of lambda. This is the headline CBA output because lambda\n(the monetary value of one effect unit) is usually the dominant uncertainty. Plot p_nmb_positive against lambda for the\nnet-benefit acceptability curve. Inputs:\n  cost_draws   : array-like of incremental cost draws (e.g., GLM/bootstrap posterior of delta_cost)\n  effect_draws : array-like of incremental effect draws (same length; draw jointly with cost_draws -- e.g. from one\n                 bootstrap of patients -- so the cost-effect correlation is preserved; independent draws understate\n                 uncertainty)\n  lambda_grid  : range of monetary values per effect unit to scan (e.g., wage uncertainty band or WTP credible range)\nReturns, for each lambda, the probability that incremental NMB exceeds zero.",
        "dependencies": [
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "cba_net_benefit <- function(arm_summary, lambda_per_effect, perspective) {\n  rownames(arm_summary) <- arm_summary$arm\n  delta_cost   <- arm_summary[\"intervention\", \"mean_cost\"]   - arm_summary[\"comparator\", \"mean_cost\"]\n  delta_effect <- arm_summary[\"intervention\", \"mean_effect\"] - arm_summary[\"comparator\", \"mean_effect\"]\n\n  monetized_benefit <- lambda_per_effect * delta_effect           # human-capital OR WTP value of the gain (pre-specified)\n  # Double-counting guard: exclude productivity here if it already nets out inside delta_cost.\n  nmb <- monetized_benefit - delta_cost                           # incremental net monetary benefit\n  bcr <- if (delta_cost > 0) monetized_benefit / delta_cost else NA_real_\n\n  list(perspective = perspective, delta_cost = delta_cost, delta_effect = delta_effect,\n       monetized_benefit = monetized_benefit, incremental_nmb = nmb,\n       benefit_cost_ratio = bcr, favors_intervention = nmb > 0)\n}\n\nnmb_acceptability <- function(cost_draws, effect_draws, lambda_grid) {\n  stopifnot(length(cost_draws) == length(effect_draws))           # paired draws preserve cost-effect correlation\n  do.call(rbind, lapply(lambda_grid, function(lam) {\n    nmb <- lam * effect_draws - cost_draws                        # vectorized NMB across PSA draws\n    data.frame(lambda = lam, mean_nmb = mean(nmb),\n               p_nmb_positive = mean(nmb > 0))\n  }))\n}",
        "description": "Deterministic CBA net benefit + BCR and a net-benefit acceptability scan over lambda. For a full probabilistic engine\n(net-benefit acceptability curves and expected value of information directly from joint cost-effect draws) use the BCEA\nor hesim packages; discount both streams (see discounting-costs-effects-rwe) before these functions if the horizon\nexceeds one year. Inputs:\n  arm_summary  : data.frame with arm ('comparator'/'intervention'), mean_cost (adjusted GLM cost), mean_effect (per-patient\n                 incremental-unit effect)\n  cost_draws, effect_draws : paired numeric vectors of PSA draws of incremental cost and effect (draw jointly, e.g. from a\n                 single bootstrap of patients, to preserve the cost-effect correlation)\n  lambda_grid  : numeric vector of monetary values per effect unit (human-capital or WTP)",
        "dependencies": [],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision question + perspective<br/>payer vs societal] --> COST[Cost arm: claims allowed amounts<br/>PPPM on exact person-time, GLM, outliers]\n  Q --> EFF[Effect arm: deltaE in natural units<br/>events / days / QALYs from matched cohort]\n  EFF --> VAL{Monetization choice<br/>pre-specified, never mixed}\n  VAL -->|human capital| HC[lambda = wage x productivity<br/>+ lifetime earnings for mortality]\n  VAL -->|willingness to pay| WTP[lambda = WTP / VSLY<br/>from CV / DCE / survey]\n  HC --> NB[NMB = lambda x deltaE - deltaC<br/>BCR = monetized benefit / deltaC]\n  WTP --> NB\n  COST --> NB\n  NB --> PSA[PSA over lambda, deltaC, deltaE, discount rate<br/>net-benefit acceptability + tornado]\n  PSA --> DEC[Decision-ready net monetary benefit<br/>reported payer AND societal]",
        "caption": "Operational CBA workflow. The cost arm is a healthcare-cost analysis; the effect arm is a comparative cohort estimate; the monetization choice (human-capital vs WTP) is pre-specified and defines the estimand before the net-benefit synthesis and PSA.",
        "alt_text": "Flowchart from decision question and perspective into a claims-based cost arm and a natural-unit effect arm, through a human-capital-or-WTP monetization decision, into net monetary benefit and benefit-cost ratio, then probabilistic sensitivity analysis and a perspective-stratified decision.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[deltaE uncertain<br/>effect estimate] --> NB[NMB distribution]\n  B[deltaC uncertain<br/>GLM cost difference] --> NB\n  C[lambda uncertain<br/>wage / WTP value] --> NB\n  NB --> P[Probability NMB greater than 0 at each lambda<br/>net-benefit acceptability]\n  C -. dominates spread .-> P",
        "caption": "Why PSA is non-optional in CBA. The monetary value of an outcome (lambda) is typically the dominant source of uncertainty, so the headline result is the probability that net monetary benefit is positive, not a single point estimate.",
        "alt_text": "Diagram showing incremental effect, incremental cost, and the monetary value lambda feeding a net-monetary-benefit distribution, with lambda dominating the spread, producing a probability that net benefit exceeds zero.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "cost-effectiveness",
        "notes": "CEA keeps the outcome in natural units and imports an external cost-per-effect threshold; CBA monetizes the outcome internally and reports net monetary benefit. CEA's ICER is unstable near deltaE = 0 where CBA's NMB stays well defined."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cost-utility",
        "notes": "CUA uses the QALY and an external cost-per-QALY threshold, avoiding explicit pricing of health; CBA can add non-health benefits the QALY omits but must defend a contested monetary valuation. Prefer CUA for HTA reimbursement."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "NMB is algebraically shared, but in CEA/CUA lambda is a decision-maker threshold reported across a range, whereas in CBA lambda is an internally justified social value of the outcome."
      },
      {
        "relation_type": "requires",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "The CBA cost arm is a healthcare-cost analysis — PPPM standardization, attribution, outlier handling, and GLM cost modeling are prerequisites for a valid net-benefit estimate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "PSA is essential because lambda, productivity imputation, and the discount rate are highly uncertain; report net- benefit acceptability over lambda rather than a point NMB."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Both the cost stream and the monetized benefit stream must be discounted consistently; differential timing (upfront cost vs later productivity gain) can drive the result."
      }
    ],
    "aliases": [
      "cost benefit",
      "Cost-Benefit Analysis",
      "CBA",
      "cost-benefit analysis",
      "benefit-cost analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "fda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cost-effectiveness",
    "name": "Cost-effectiveness Analysis (CEA)",
    "short_definition": "A full economic evaluation that compares two or more interventions on both incremental cost and incremental health effect, summarized as an incremental cost-effectiveness ratio (cost per unit of effect, e.g. cost per life-year or QALY) and interpreted against a willingness-to-pay threshold.",
    "long_description": "**Cost-effectiveness analysis (CEA)** is a *full* economic evaluation: it compares at least two alternatives on both\nthe resources they consume and the health they produce, and expresses the result as an **incremental cost-effectiveness\nratio (ICER)** = (C_index - C_comparator) / (E_index - E_comparator), the extra cost to buy one additional unit of\neffect. When the effect unit is the **QALY**, the analysis is a cost-utility analysis (CUA); when it is a single\nnatural unit (life-years, events avoided), it is CEA in the narrow sense. A result is judged against a\n**willingness-to-pay (WTP) threshold** λ: the intervention is cost-effective if ICER < λ, or equivalently if the\n**net monetary benefit** NMB = λ·ΔE - ΔC > 0. CEA does *not* tell you whether something is affordable — that is a\nbudget-impact question — and it does *not* monetize health (that is cost-benefit analysis).\n\n**Core conceptual / estimand distinction.** The estimand is a *joint* contrast of two correlated quantities — the\ndifference in mean total cost and the difference in mean effect between strategies — under a stated **perspective**\n(payer vs health-system vs societal), **time horizon**, and **discount rate**. Three design forks govern everything\ndownstream. (1) **Trial/cohort-based (\"within-study\") CEA** estimates ΔC and ΔE directly from observed patient-level\ndata over the follow-up window; it carries low structural-assumption burden but is bounded by the data's horizon. (2)\n**Decision-analytic (model-based) CEA** — Markov cohort, partitioned-survival, or discrete-event simulation —\nextrapolates costs and effects to a lifetime horizon using transition probabilities and survival models, trading\ntransparency for the ability to answer the decision-maker's actual (lifetime) question. (3) The **ratio vs net-benefit**\nparameterization: ICERs are non-monotone and undefined when ΔE crosses zero (a negative ICER can mean dominant *or*\ndominated), so modern RWE work targets **NMB / net-benefit regression**, a linear quantity that admits covariate\nadjustment and ordinary uncertainty propagation. The decision rule is the same; the statistics behave far better.\n\n**The censored-cost problem (the defining methodological hazard in claims-based CEA).** Naively averaging observed\ntotal cost over patients is **biased downward** under administrative censoring, because patients with shorter observed\nfollow-up accrue less *recorded* cost even though their true lifetime cost is unknown — and standard survival methods\ndo **not** fix this, since cost does not accrue at a constant rate over time (it spikes near initiation, relapse, and\ndeath). Two consistent estimators solve it: **Lin's partitioned (interval) estimator** (Lin & Feuer 1997), which\npartitions follow-up into intervals and weights mean within-interval cost by the Kaplan-Meier survival probability, and\nthe **inverse-probability-of-censoring-weighted (IPCW) estimator** of Bang & Tsiatis (2000), which reweights\nfully-observed cost histories by the probability of remaining uncensored. Either is mandatory whenever a non-trivial\nfraction of person-time is administratively censored — i.e. essentially every claims study with a fixed data cut.\n\n**Pros, cons, and trade-offs (specific and comparative).**\n- **vs cost-minimization analysis (CMA):** CMA assumes equal effect and compares cost alone. Prefer CMA *only* when\n  equivalence in effect is already established (e.g. a biosimilar with non-inferiority shown); using CMA when effects\n  actually differ silently buries the health trade-off. CEA is the default when effects are uncertain or differ.\n- **vs cost-benefit analysis (CBA):** CBA monetizes health (WTP/QALY, friction-cost, human-capital), enabling\n  cross-sector comparison but importing controversial valuation choices. CEA keeps health in natural/QALY units and\n  defers the monetary trade-off to an explicit external threshold — usually preferred for HTA, where λ is set by the\n  payer (e.g. NICE £20k-£30k/QALY).\n- **vs budget-impact analysis (BIA):** BIA answers \"what will it cost the plan over 1-5 years,\" CEA answers \"is it\n  good value per unit of health.\" They are complements, not substitutes; an intervention can be cost-effective yet\n  unaffordable. Report both for coverage decisions.\n- **Within-study CEA vs decision-analytic model:** within-study CEA (Ramsey/ISPOR) is transparent and assumption-light\n  but truncated at the data horizon — fatal for chronic disease where most cost/QALY divergence happens after\n  follow-up ends. Markov / partitioned-survival models extrapolate to lifetime but the answer can be dominated by\n  extrapolation assumptions (parametric survival choice, transition probabilities) rather than the data. Prefer\n  within-study when the horizon captures the relevant divergence (acute conditions, short-course therapy); prefer a\n  model (or a hybrid: within-study costs/effects out to horizon, then extrapolation) for chronic disease.\n\n**When to use.** Comparing two or more interventions for the same indication where both cost and effect plausibly\ndiffer; HTA submissions and payer value dossiers; comparative-value evidence built on a defensible comparative-effect\nestimate (the ΔE should itself come from a sound design — e.g. active-comparator new-user with propensity adjustment —\nnot a naive prevalent-user contrast).\n\n**When NOT to use / when it is actively misleading.**\n- **Chronic disease with short claims follow-up and no extrapolation.** A 24-month within-study ICER for a therapy\n  whose survival and cost curves separate over a decade is not \"conservative\" — it is *wrong*, and its direction is\n  unpredictable. Either extrapolate explicitly or do not report an ICER.\n- **Perspective mismatch.** A payer-perspective CEA presented to a societal decision-maker (or vice versa) omits or\n  includes productivity/caregiver costs that can flip the conclusion. State the perspective and match it to the\n  decision-maker.\n- **No utility data → crude QALY proxies.** Claims rarely contain EQ-5D/utilities; mapping from diagnoses to utilities\n  or borrowing trial utilities introduces error that the ICER hides. If utilities are weak, report cost per\n  *clinical* event and flag the QALY as exploratory.\n- **Reporting a raw ICER when ΔE may cross zero.** With statistical noise around a small ΔE, the ICER is unstable and\n  its sign is uninterpretable. Use NMB/CEAC instead; never quote a point ICER without joint uncertainty.\n- **Immortal time / mis-timed cost accrual.** Assigning costs to an exposure arm before exposure begins (procedure\n  studies, post-index drug initiation) inflates one arm. Accrue cost only from a correctly aligned time zero, exactly\n  as for the effect outcome.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** the natural substrate for cost — paid/allowed amounts on medical + pharmacy claims\n  give total cost directly. Use **allowed** (not billed) amounts; standardize to a price year and inflate with a\n  medical-care index; decide all-cause vs disease-attributable cost up front. Failure modes: **Medicare Advantage and\n  capitated/bundled person-time lack itemized FFS claims**, so cost is missing-not-zero — restrict to enrollees with\n  full Parts A/B/D (or commercial medical+pharmacy) and exclude MA-only person-time. **Differential administrative\n  censoring by arm** biases naive mean cost (use Lin/Bang-Tsiatis). **Differential competing risk of death by exposure\n  in elderly cohorts** truncates cost accrual unequally — model cost over survival, not over raw person-time.\n- **EHR:** rich on effect/utility proxies (labs, vitals, notes for severity and outcomes) but **weak on cost** — chargemaster\n  or RVU-based costing is a poor stand-in for paid amounts and out-of-network/external care is invisible. Link to\n  claims for cost completeness; treat loss-to-follow-up (patient leaves the system) as informative censoring.\n- **Registry:** strong for adjudicated effect and severity, typically thin on resource use; link to claims for the\n  cost side and to a death index to firm up the survival weights the cost estimators depend on.\n- **Linked claims-EHR-vital-records:** the ideal substrate (EHR severity/utility + claims cost completeness + reliable\n  mortality for survival weighting), but linkage selection and order/fill/service-date discrepancies must be reconciled\n  before aligning cost accrual to time zero.\n\n**Worked claims example (with the censoring correction made visible).** Question: incremental cost per relapse avoided,\nindex biologic vs active comparator, over a 24-month health-system-perspective horizon in a commercial + Medicare FFS database\n(allowed amounts = plan-paid + patient cost-share; for a strict payer perspective sum paid amounts instead).\n(1) **Cohort & time zero:** active-comparator new-user design — first qualifying fill = `index_date`, arm from the\ndispensed NDC, 365-day continuous medical+pharmacy enrollment washout, MA-only person-time excluded. (2) **Cost\naccrual:** sum allowed amounts on all medical and pharmacy claims with `service_date` in (`index_date`,\n`index_date`+730], inflated to a common price year; accrue only from time zero. (3) **Effect:** time to first validated\nrelapse from the same time zero, identical outcome definition in both arms. (4) **The trap:** with a fixed data cut,\n~30% of patients are administratively censored before 24 months; their *observed* cost is artificially low, so a naive\narm-mean cost is biased downward and, because censoring differs by arm, the bias does not cancel in ΔC. (5) **Fix:**\nestimate mean 24-month cost per arm with Lin's KM-weighted partitioned estimator (or Bang-Tsiatis IPCW), discount both\ncost and effect at 3% annually, and form ΔC and ΔE on the corrected means. (6) **Decision & uncertainty:** compute the\nICER and NMB at λ; bootstrap the *joint* (ΔC, ΔE) at the patient level (resample with the censoring correction inside\neach replicate) to plot the cost-effectiveness plane and the **cost-effectiveness acceptability curve (CEAC)** across a\nrange of λ; report the probability cost-effective, not a bare point ICER.\n\n**Interpreting the output**\n\nThe worked example returns an ICER of $50,000 per QALY gained: Drug B costs $25,000 more per patient and produces\n0.50 additional QALYs relative to Drug A, so $25,000 / 0.50 = $50,000/QALY.\n\n*(1) Formal interpretation.* The ICER is the ratio of *incremental* costs to *incremental* effects — it is not an\naverage cost per QALY for Drug B in isolation. An ICER of $50,000/QALY means every additional QALY purchased by\nswitching from Drug A to Drug B costs the payer $50,000. Because $50,000 is below the stated willingness-to-pay\nthreshold of $100,000/QALY, Drug B meets the cost-effectiveness criterion. The ICER is plotted on the cost-effectiveness\nplane (x-axis = ΔE, y-axis = ΔC); a northeast-quadrant result with ICER below the threshold line is the standard\ncost-effective finding. Dominance occurs when ΔC ≤ 0 and ΔE ≥ 0 simultaneously — the new intervention is both cheaper\nand at least as effective, making ICER computation unnecessary.\n\n*(2) Practical interpretation.* In plain language: the plan gains one full year of perfect health for each $50,000\nspent upgrading from Drug A to Drug B — well within what the plan has decided is worth paying. Two caveats are\nessential. First, the ICER says nothing about affordability: if 50,000 patients are eligible, the budget impact is\n$1.25 billion, a separate budget-impact question. Second, the ICER is an incremental quantity; quoting Drug B's\naverage cost-per-QALY (total cost divided by total QALYs, ignoring what the comparator produces) is a common\nerror that conflates average and marginal efficiency.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "cost-effectiveness",
      "ICER",
      "net-monetary-benefit",
      "health-economics",
      "cost-utility",
      "censored-costs",
      "CEAC",
      "probabilistic-sensitivity-analysis",
      "HTA"
    ],
    "applies_to_study_types": [
      "cost_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.2016.12195",
        "url": "https://doi.org/10.1001/jama.2016.12195",
        "citation_text": "Sanders GD, Neumann PJ, Basu A, et al. Recommendations for conduct, methodological practices, and reporting of cost-effectiveness analyses: Second Panel on Cost-Effectiveness in Health and Medicine. JAMA. 2016;316(10):1093-1103.",
        "year": 2016,
        "authors_short": "Sanders et al. (Second Panel)",
        "notes": "Authoritative reference case for CEA conduct — perspective (impact inventory), discounting, time horizon, and the reporting standard the field defaults to."
      },
      {
        "role": "explain",
        "doi": "10.2307/2533947",
        "url": "https://doi.org/10.2307/2533947",
        "citation_text": "Lin DY, Feuer EJ, Etzioni R, Wax Y. Estimating medical costs from incomplete follow-up data. Biometrics. 1997;53(2):419-434.",
        "year": 1997,
        "authors_short": "Lin et al.",
        "notes": "Establishes that naive mean cost is biased under censoring and gives the Kaplan-Meier-weighted partitioned (interval) estimator — foundational for claims-based cost estimation."
      },
      {
        "role": "explain",
        "doi": "10.1093/biomet/87.2.329",
        "url": "https://doi.org/10.1093/biomet/87.2.329",
        "citation_text": "Bang H, Tsiatis AA. Estimating medical costs with censored data. Biometrika. 2000;87(2):329-343.",
        "year": 2000,
        "authors_short": "Bang & Tsiatis",
        "notes": "Inverse-probability-of-censoring-weighted (IPCW) estimator for mean medical cost — the second canonical fix for censored costs and the basis of most modern implementations."
      },
      {
        "role": "explain",
        "doi": "10.1002/hec.678",
        "url": "https://doi.org/10.1002/hec.678",
        "citation_text": "Hoch JS, Briggs AH, Willan AR. Something old, something new, something borrowed, something blue: a framework for the marriage of health econometrics and cost-effectiveness analysis. Health Economics. 2002;11(5):415-430.",
        "year": 2002,
        "authors_short": "Hoch et al.",
        "notes": "Net-benefit regression — recasts CEA as a regression on net benefit so covariate adjustment and standard inference replace the ill-behaved ICER; the RWE-friendly framing."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2015.02.001",
        "url": "https://doi.org/10.1016/j.jval.2015.02.001",
        "citation_text": "Ramsey SD, Willke RJ, Glick H, et al. Cost-effectiveness analysis alongside clinical trials II — an ISPOR Good Research Practices Task Force report. Value in Health. 2015;18(2):161-172.",
        "year": 2015,
        "authors_short": "Ramsey et al. (ISPOR)",
        "notes": "Good-practice blueprint for patient-level (within-study) CEA — handling missing/censored cost and effect, sampling uncertainty, and the cost-effectiveness plane/CEAC."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/hec.985",
        "url": "https://doi.org/10.1002/hec.985",
        "citation_text": "Claxton K, Sculpher M, McCabe C, et al. Probabilistic sensitivity analysis for NICE technology assessment: not an optional extra. Health Economics. 2005;14(4):339-347.",
        "year": 2005,
        "authors_short": "Claxton et al.",
        "notes": "Argues PSA is mandatory for decision uncertainty in HTA and frames the CEAC/expected-value-of-information machinery used to summarize joint cost-effect uncertainty."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) statement. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "The reporting checklist reviewers expect — perspective, horizon, discounting, uncertainty, and structural assumptions must all be stated."
      }
    ],
    "plain_language_summary": "Cost-effectiveness analysis asks a simple question: for every extra unit of health you buy with a new treatment, how much more does it cost compared to the current standard? You measure both the extra cost and the extra health benefit for the new treatment, divide one by the other, and get a single number — the incremental cost-effectiveness ratio, or ICER — in dollars per life-year or dollars per quality-adjusted life-year gained. Decision makers then compare that number to a ceiling price they are willing to pay per unit of health: if the ICER falls below that ceiling, the treatment is considered a good use of money; if it exceeds it, the treatment is too expensive for the benefit it delivers.",
    "key_terms": [
      {
        "term": "QALY (quality-adjusted life-year)",
        "definition": "A single number that combines how long a patient lives with how healthy they feel during that time: one year in perfect health equals 1.0 QALY, while one year with a serious disability might equal 0.5 QALY."
      },
      {
        "term": "ICER (incremental cost-effectiveness ratio)",
        "definition": "The extra cost of a new treatment divided by the extra health it produces compared to the current option — for example, $50,000 per QALY gained."
      },
      {
        "term": "willingness-to-pay threshold",
        "definition": "The maximum a health system or payer is prepared to spend to gain one additional unit of health (such as one QALY); if the ICER is below this ceiling, the treatment is judged cost-effective."
      },
      {
        "term": "incremental cost",
        "definition": "The difference in total cost between the new treatment and the comparator — the extra dollars spent to use the new option."
      },
      {
        "term": "incremental effect",
        "definition": "The difference in health outcome between the new treatment and the comparator — the extra QALYs (or life-years) gained by using the new option."
      }
    ],
    "worked_example": {
      "scenario": "A health plan is deciding whether to cover Drug B, a newer therapy for a chronic condition, instead of Drug A, the current standard of care. An analyst pulls two-year totals from a clinical study: average cost per patient and average QALYs per patient for each treatment arm. The goal is to calculate the ICER and determine whether Drug B is cost-effective at the plan's willingness-to-pay threshold of $100,000 per QALY.",
      "dataset": {
        "caption": "Summary results per patient over a two-year follow-up — the two numbers an analyst would extract from a trial or observational cohort before computing the ICER.",
        "columns": [
          "treatment",
          "mean_total_cost_usd",
          "mean_qalys"
        ],
        "rows": [
          [
            "Drug A (comparator)",
            27000,
            0.4
          ],
          [
            "Drug B (new treatment)",
            42000,
            0.7
          ]
        ]
      },
      "steps": [
        "Compute incremental cost: $42,000 − $27,000 = $15,000. Drug B costs $15,000 more per patient over two years.",
        "Compute incremental effect: 0.70 − 0.40 = 0.30 QALYs. Drug B produces 0.30 additional QALYs per patient.",
        "Compute the ICER: $15,000 ÷ 0.30 QALYs = $50,000 per QALY gained.",
        "Compare to the willingness-to-pay threshold: $50,000 per QALY < $100,000 per QALY threshold."
      ],
      "result": "ICER = $50,000 per QALY gained. Because $50,000 is below the plan's $100,000 per QALY ceiling, Drug B is cost-effective — the plan gets an additional QALY for less than the maximum it is willing to pay."
    },
    "prerequisites": [
      "qaly-utility-mapping-rwe",
      "discounting-costs-effects-rwe",
      "icer-net-monetary-benefit-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Within-study (trial/cohort-based) CEA",
        "description": "Estimate ΔC and ΔE directly from observed patient-level data over the follow-up window, with censored-cost correction and joint bootstrap of (ΔC, ΔE).",
        "edge_cases": [
          "Horizon is bounded by the data; chronic-disease cost/effect divergence beyond follow-up is missed unless extrapolated.",
          "Administrative censoring biases naive mean cost — Lin partitioned or Bang-Tsiatis IPCW is required, not optional."
        ],
        "data_source_notes": "claims: sum allowed amounts by service window from time zero; correct for censoring; standardize to a price year. ehr: cost side usually requires linkage to claims."
      },
      {
        "name": "Decision-analytic (model-based) CEA",
        "description": "Extrapolate costs and effects to a lifetime horizon with a Markov cohort, partitioned-survival, or discrete-event model parameterized from RWE inputs (transition probabilities, survival curves, unit costs, utilities).",
        "edge_cases": [
          "The result can be driven by the parametric survival/extrapolation choice rather than the data — pre-specify and run structural sensitivity analyses.",
          "Double counting cost across mutually exclusive health states if state definitions overlap."
        ],
        "data_source_notes": "RWE feeds the model — survival/transition estimates from cohorts, unit costs from claims, utilities from EHR/registry or mapped from clinical scales."
      },
      {
        "name": "Net-benefit regression",
        "description": "Compute individual net benefit NB_i = λ·E_i - C_i and regress on arm plus covariates, yielding an adjusted incremental NMB with standard errors at each λ.",
        "edge_cases": [
          "Censored cost/effect must still be handled (IPCW within the regression) — the regression does not fix censoring by itself.",
          "Heteroscedastic, right-skewed net benefit may need robust SEs or bootstrap inference."
        ],
        "data_source_notes": "claims/EHR: enables covariate adjustment for residual confounding in the cost-effectiveness contrast, paralleling adjusted effect models."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cost-minimization analysis (CMA)",
        "pros_of_this": "Does not assume equal effect; quantifies the cost-effect trade-off when outcomes differ or are uncertain.",
        "cons_of_this": "More data and uncertainty machinery required than a cost-only comparison.",
        "when_to_prefer": "Whenever effect equivalence is not already established; reserve CMA for proven non-inferiority/equivalence (e.g. biosimilars)."
      },
      {
        "compared_to": "Cost-benefit analysis (CBA)",
        "pros_of_this": "Keeps health in natural/QALY units and defers the monetary value of health to an explicit external WTP threshold rather than embedding a contested valuation.",
        "cons_of_this": "Cannot compare across sectors or against non-health investments without an external threshold; needs a decision-maker's λ.",
        "when_to_prefer": "HTA and payer value assessment where a threshold is available and monetizing health is undesirable."
      },
      {
        "compared_to": "Budget-impact analysis (BIA)",
        "pros_of_this": "Answers value per unit of health (efficiency), not just total spend.",
        "cons_of_this": "Says nothing about short-run affordability for a specific plan and population.",
        "when_to_prefer": "Use alongside BIA — CEA for value, BIA for affordability; coverage decisions need both."
      },
      {
        "compared_to": "Within-study CEA (no extrapolation)",
        "pros_of_this": "A decision-analytic model can reach the lifetime horizon the decision-maker actually cares about.",
        "cons_of_this": "Adds structural assumptions whose influence can swamp the empirical inputs.",
        "when_to_prefer": "Prefer a model (or hybrid) for chronic disease; prefer within-study when the data horizon captures the relevant cost/effect divergence."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Total cost = sum of allowed (not billed) amounts on medical + pharmacy claims accrued from time zero over the horizon; standardize to a price year and inflate with a medical-care index. Exclude Medicare Advantage / capitated person-time (itemized FFS cost missing-not-zero). Correct naive mean cost for administrative censoring (Lin partitioned or Bang-Tsiatis IPCW); the correction matters most when censoring differs by arm.",
      "ehr": "Strong for effect and utility proxies, weak for cost (chargemaster/RVU is a poor proxy and external care is invisible) — link to claims for the cost side. Treat patients leaving the system as informative censoring.",
      "registry": "Good for adjudicated effect and severity; thin on resource use. Link to claims for cost and to a death index to firm up the survival weights the censored-cost estimators rely on.",
      "linked": "Ideal substrate (EHR severity/utility + claims cost completeness + reliable mortality) but reconcile linkage selection and order/fill/service-date discrepancies before aligning cost accrual to time zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef km_censoring_survival(obs_time, uncensored):\n    # KM estimate of G(t)=P(C>t), treating *censoring* as the event of interest.\n    # cens_event = 1 when the observation ends due to censoring (i.e. uncensored == 0).\n    cens_event = 1 - np.asarray(uncensored, dtype=float)\n    t = np.asarray(obs_time, dtype=float)\n    order = np.argsort(t)\n    t_sorted, d_sorted = t[order], cens_event[order]\n    n = len(t_sorted)\n    at_risk = n\n    G, surv = {}, 1.0\n    for i, (ti, di) in enumerate(zip(t_sorted, d_sorted)):\n        if di == 1:                      # a censoring \"event\" drops survival\n            surv *= (1.0 - 1.0 / at_risk)\n        G[ti] = surv\n        at_risk -= 1\n    # step function lookup: G at the largest event time <= queried time\n    times = np.array(sorted(G))\n    vals = np.array([G[x] for x in times])\n    def G_at(q):\n        idx = np.searchsorted(times, q, side=\"right\") - 1\n        return vals[idx] if idx >= 0 else 1.0\n    return np.vectorize(G_at)\n\ndef ipcw_mean_cost(arm_df):\n    # Bang-Tsiatis: average fully-observed costs reweighted by 1/G(event time).\n    G_at = km_censoring_survival(arm_df[\"obs_time\"], arm_df[\"uncensored\"])\n    obs = arm_df[arm_df[\"uncensored\"] == 1]\n    w = 1.0 / np.clip(G_at(obs[\"obs_time\"].to_numpy()), 1e-6, None)\n    return float(np.sum(w * obs[\"cost\"].to_numpy()) / len(arm_df))\n\ndef cea_point(df):\n    c_i = ipcw_mean_cost(df[df.arm == \"index\"])\n    c_c = ipcw_mean_cost(df[df.arm == \"comparator\"])\n    e_i = df.loc[df.arm == \"index\", \"effect\"].mean()\n    e_c = df.loc[df.arm == \"comparator\", \"effect\"].mean()\n    d_cost, d_eff = c_i - c_c, e_i - e_c\n    icer = d_cost / d_eff if d_eff != 0 else np.nan  # undefined / unstable when d_eff ~ 0\n    return d_cost, d_eff, icer\n\ndef cea_bootstrap(df, wtp_grid, n_boot=2000, seed=7):\n    # Resample patients WITHIN arm so the (delta_cost, delta_effect) pair stays jointly correlated.\n    rng = np.random.default_rng(seed)\n    idx = df[df.arm == \"index\"].reset_index(drop=True)\n    cmp_ = df[df.arm == \"comparator\"].reset_index(drop=True)\n    dC, dE = np.empty(n_boot), np.empty(n_boot)\n    for b in range(n_boot):\n        bi = idx.iloc[rng.integers(0, len(idx), len(idx))]\n        bc = cmp_.iloc[rng.integers(0, len(cmp_), len(cmp_))]\n        dC[b] = ipcw_mean_cost(bi) - ipcw_mean_cost(bc)\n        dE[b] = bi[\"effect\"].mean() - bc[\"effect\"].mean()\n    # CEAC: P(intervention cost-effective) = P(NMB > 0) at each willingness-to-pay lambda.\n    ceac = {lam: float(np.mean(lam * dE - dC > 0)) for lam in wtp_grid}\n    return dC, dE, ceac\n\nif __name__ == \"__main__\":\n    d_cost, d_eff, icer = cea_point(df)\n    dC, dE, ceac = cea_bootstrap(df, wtp_grid=[50_000, 100_000, 150_000])\n    print({\"delta_cost\": d_cost, \"delta_effect\": d_eff, \"icer\": icer,\n           \"nmb_at_100k\": 100_000 * d_eff - d_cost, \"ceac\": ceac})",
        "description": "Patient-level within-study CEA from a claims-style analytic table, with censored-cost correction, ICER/NMB, joint\nbootstrap, and a CEAC. Required input (one row per patient, post data-management; cost already accrued from time zero\nand discounted):\n  df: person_id, arm in {'index','comparator'},\n      obs_time   (years observed = min(true event/horizon, censoring time)),\n      uncensored (1 if cost fully observed to event/horizon, 0 if administratively censored),\n      cost       (observed discounted total cost over obs_time),\n      effect     (discounted effect over horizon, e.g. QALYs or events avoided; NaN where censored if effect needs\n                  the same KM/IPCW treatment as cost)\nUses Bang-Tsiatis IPCW for mean cost: weight each patient's fully-observed cost by 1/Prob(uncensored at their event\ntime), with the censoring survival curve estimated by Kaplan-Meier on the censoring indicator.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "bang-tsiatis-2000",
          "ramsey-2015"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n# Bang-Tsiatis IPCW mean cost: KM on the censoring indicator, then reweight observed costs.\nipcw_mean_cost <- function(d) {\n  # event = censoring, so use (1 - uncensored) as the survfit event.\n  fit <- survfit(Surv(obs_time, 1 - uncensored) ~ 1, data = d)\n  G_at <- function(q) {\n    i <- findInterval(q, fit$time)\n    ifelse(i == 0, 1.0, fit$surv[i])\n  }\n  obs <- d[d$uncensored == 1, ]\n  w <- 1 / pmax(G_at(obs$obs_time), 1e-6)\n  sum(w * obs$cost) / nrow(d)\n}\n\ncea_point <- function(df) {\n  ci <- ipcw_mean_cost(df[df$arm == \"index\", ])\n  cc <- ipcw_mean_cost(df[df$arm == \"comparator\", ])\n  ei <- mean(df$effect[df$arm == \"index\"])\n  ec <- mean(df$effect[df$arm == \"comparator\"])\n  d_cost <- ci - cc; d_eff <- ei - ec\n  list(delta_cost = d_cost, delta_effect = d_eff,\n       icer = if (d_eff != 0) d_cost / d_eff else NA_real_)  # unstable near d_eff = 0\n}\n\ncea_bootstrap <- function(df, wtp_grid, n_boot = 2000) {\n  idx <- df[df$arm == \"index\", ]; cmp <- df[df$arm == \"comparator\", ]\n  dC <- dE <- numeric(n_boot)\n  for (b in seq_len(n_boot)) {\n    bi <- idx[sample(nrow(idx), replace = TRUE), ]   # resample within arm to keep the pair correlated\n    bc <- cmp[sample(nrow(cmp), replace = TRUE), ]\n    dC[b] <- ipcw_mean_cost(bi) - ipcw_mean_cost(bc)\n    dE[b] <- mean(bi$effect) - mean(bc$effect)\n  }\n  # CEAC: probability NMB > 0 across willingness-to-pay thresholds.\n  ceac <- sapply(wtp_grid, function(lam) mean(lam * dE - dC > 0))\n  names(ceac) <- wtp_grid\n  list(dC = dC, dE = dE, ceac = ceac)\n}\n\n# Net-benefit regression (Hoch et al.): adjust the incremental NMB for covariates at a fixed lambda.\nnb_regression <- function(df, lambda, covariates = c(\"age\", \"sex\")) {\n  df$nb <- lambda * df$effect - df$cost\n  f <- reformulate(c(\"arm\", covariates), response = \"nb\")\n  summary(lm(f, data = df))$coefficients  # 'arm' coefficient = adjusted incremental NMB\n}",
        "description": "Within-study CEA in R using survival::survfit for the IPCW censoring weights and a patient-level bootstrap for the\njoint (delta_cost, delta_effect) distribution and CEAC. Input mirrors the Python version:\n  df: data.frame(person_id, arm in {'index','comparator'}, obs_time, uncensored (0/1),\n                 cost (observed discounted), effect (discounted))",
        "dependencies": [
          "survival"
        ],
        "source_citations": [
          "bang-tsiatis-2000",
          "hoch-2002",
          "ramsey-2015"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision question + perspective + time horizon + discount rate] --> Fork{Data horizon covers<br/>relevant cost/effect divergence?}\n  Fork -->|Yes acute / short course| WS[Within-study CEA<br/>patient-level dC and dE]\n  Fork -->|No chronic disease| MD[Decision-analytic model<br/>Markov / partitioned survival<br/>extrapolate to lifetime]\n  WS --> Cens[Censored-cost correction<br/>Lin partitioned or Bang-Tsiatis IPCW]\n  MD --> Cens\n  Cens --> Disc[Discount costs and effects]\n  Disc --> Param{Summary measure}\n  Param --> ICER[ICER = dC / dE<br/>caution: undefined near dE=0]\n  Param --> NMB[NMB = lambda*dE - dC<br/>net-benefit regression for adjustment]\n  ICER --> Unc[Joint uncertainty:<br/>bootstrap dC,dE -> CE plane -> CEAC -> EVPI]\n  NMB --> Unc\n  Unc --> Dec[Decision-ready value statement vs WTP threshold]",
        "caption": "CEA decision logic — choose within-study vs model-based by whether the data horizon captures the relevant divergence, correct censored costs, discount, summarize as ICER or NMB, and propagate joint uncertainty.",
        "alt_text": "Flowchart from decision question through the within-study versus decision-analytic fork, censored-cost correction, discounting, ICER/NMB summary, and joint uncertainty to a value statement.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "quadrantChart\n  title Cost-effectiveness plane (incremental cost vs incremental effect)\n  x-axis \"Less effective (dE<0)\" --> \"More effective (dE>0)\"\n  y-axis \"Cost saving (dC<0)\" --> \"More costly (dC>0)\"\n  quadrant-1 \"More costly, more effective: ICER vs WTP\"\n  quadrant-2 \"Dominated: costlier and worse\"\n  quadrant-3 \"Less costly, less effective: ICER vs WTP\"\n  quadrant-4 \"Dominant: cheaper and better\"\n  \"Index vs comparator (point estimate)\": [0.72, 0.70]\n  \"Bootstrap cloud edge\": [0.60, 0.58]",
        "caption": "The cost-effectiveness plane. The northeast (more costly, more effective) and southwest (cheaper, less effective) quadrants require comparison of the ICER to the willingness-to-pay threshold; the southeast quadrant is dominant and the northwest is dominated. Bootstrap replicates of (dE, dC) populate the plane and generate the CEAC.",
        "alt_text": "Four-quadrant cost-effectiveness plane with incremental effect on the x-axis and incremental cost on the y-axis, showing dominant, dominated, and trade-off regions and a point estimate.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "The ICER is the primary summary measure of a CEA; NMB is the linearized, better-behaved reparameterization used for inference and the CEAC."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-utility",
        "notes": "Cost-utility analysis is CEA with the QALY as the effect unit; most HTA \"CEA\" submissions are technically CUAs."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cost-minimization",
        "notes": "Cost-minimization compares cost alone under an assumption of equal effect — appropriate only when effect equivalence is established."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cost-benefit",
        "notes": "Cost-benefit monetizes health and needs no external threshold; CEA keeps health in natural/QALY units and defers valuation to an explicit WTP threshold."
      },
      {
        "relation_type": "see_also",
        "target_slug": "budget-impact",
        "notes": "Budget impact answers affordability over a short horizon; report alongside CEA, which answers value per unit of health."
      },
      {
        "relation_type": "requires",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "Cost-utility CEA needs utilities/QALYs; in claims these often come from mapping clinical data, a key source of uncertainty."
      },
      {
        "relation_type": "requires",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Both costs and effects beyond year one must be discounted at the reference-case rate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "Model-based CEA for chronic disease is commonly built as a Markov cohort parameterized from RWE transition probabilities."
      },
      {
        "relation_type": "used_with",
        "target_slug": "partitioned-survival-models-rwe",
        "notes": "Partitioned-survival models are a common oncology CEA structure feeding lifetime costs and effects."
      },
      {
        "relation_type": "used_with",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "Lifetime CEA requires extrapolating observed survival beyond the data horizon; the parametric choice drives the result."
      },
      {
        "relation_type": "requires",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Standardized per-member cost metrics from claims are a core input to the cost side of CEA."
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "The all-cause vs disease-attributable cost decision must be made up front and stated, as it changes ΔC."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Right-skewed, outlier-heavy cost distributions require explicit handling before mean-cost contrasts."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "A defensible CEA contrast rests on a defensible comparative-effect estimate; the active-comparator new-user design is the usual basis for the ΔE feeding the ICER."
      }
    ],
    "aliases": [
      "cost effectiveness",
      "cost-effectiveness analysis",
      "Cost-Effectiveness Analysis",
      "CEA",
      "cost per QALY analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cost-minimization",
    "name": "Cost-minimization Analysis (CMA)",
    "short_definition": "A full economic evaluation that compares only the costs of two or more interventions under the prior assumption that their health outcomes are equivalent, so the preferred option is simply the cheapest.",
    "long_description": "**Cost-minimization analysis (CMA)** is the special case of a comparative economic evaluation in\nwhich the analyst has already concluded that the interventions produce *equivalent* health outcomes,\nso the only quantity that distinguishes them is cost. The decision rule collapses to \"choose the\nlowest-cost option.\" CMA is therefore not a method for *establishing* equivalence; it is a costing\nexercise that is *licensed* by an equivalence conclusion reached elsewhere (a non-inferiority or\nequivalence trial, a comparative-effectiveness study, or a network meta-analysis). The substantive\noutput is the difference in total per-patient cost over a defined time horizon and perspective,\nwith its uncertainty — not an incremental cost-effectiveness ratio (ICER) and not a QALY.\n\n**Core conceptual / estimand distinction.** The estimand in CMA is the *difference in mean total\ncost per patient* between arms (ΔCost), expressed for a stated perspective (payer, health-system,\nsocietal) and horizon. This contrasts sharply with cost-effectiveness analysis (CEA: ΔCost per\nnatural unit of effect) and cost-utility analysis (CUA: ΔCost per QALY), where outcomes are\nexplicitly traded against cost in an ICER. CMA replaces that trade-off with a *binary equivalence\nassumption on effects*. The critical, often-missed point — formalized by Briggs & O'Brien — is that\noutcomes are estimated with uncertainty, so even when a trial *fails to reject* the null of no\ndifference in effect, that is not proof of equivalence; the cost-effectiveness plane should still be\ncharacterized jointly, because a non-significant effect difference combined with a cost difference\ncan still leave a wide, decision-relevant ICER distribution. CMA is defensible only when the\nequivalence margin is pre-specified and *met*, not merely when a superiority test is non-significant.\n\n**Pros, cons, and trade-offs.**\n- **vs cost-effectiveness / cost-utility analysis (CEA/CUA):** CMA is far simpler to compute and\n  communicate (no utility elicitation, no QALY modeling, no ICER/CEAC), and it sidesteps the\n  contested machinery of mapping outcomes to QALYs. Cost: it is valid *only* under demonstrated\n  equivalence; if effects in fact differ, CMA gives the wrong answer by construction (it can crown\n  the cheaper-but-worse option). **Prefer CMA** only after a credible equivalence/non-inferiority\n  result; otherwise run CEA/CUA and report the full joint distribution of cost and effect.\n- **vs reporting the full ICER with effects set near zero:** A modern critique (Briggs & O'Brien;\n  Dakin & Wordsworth) is that you should almost always carry the (uncertain) effect difference\n  through to the cost-effectiveness plane rather than assume it away, because doing so propagates\n  second-order uncertainty and avoids overconfident \"equivalent therefore cheapest\" claims. **Prefer\n  the joint CEA framing** when effect estimates are uncertain or the equivalence margin is soft; the\n  pure-CMA shortcut is justified mainly when equivalence is genuinely strong (e.g., bioequivalent\n  formulations, generic vs brand, identical molecule different setting of administration).\n- **vs budget-impact analysis (BIA):** CMA answers \"which interchangeable option is cheaper per\n  patient\" (efficiency-style, per-patient, often longer horizon); BIA answers \"what is the total\n  affordability impact on a given budget over 1–5 years\" (population-level, undiscounted, includes\n  uptake/market share). They are complementary, not substitutes: a CMA-favored switch can still have\n  a large positive budget impact at scale, and vice versa.\n- **vs cost-of-illness / burden studies:** CMA is comparative (two+ interventions); cost-of-illness\n  is descriptive (the economic burden of a condition). Do not conflate a one-arm cost description\n  with a CMA.\n\n**When to use.** Two (or more) interventions that are clinically interchangeable for the same\nindication and population, where a pre-specified equivalence or non-inferiority margin has been\n*met*: generic vs brand of the same molecule; biosimilar vs reference biologic after equivalence is\nestablished; the same drug delivered in a lower-cost setting (home infusion vs hospital infusion);\noral vs IV formulation of the same agent shown therapeutically equivalent. In real-world data, CMA\nis most credible when the equivalence claim rests on robust comparative-effectiveness evidence and\nthe only open question is which delivery/sourcing pathway costs the payer less.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Equivalence has not been demonstrated; you only have a non-significant superiority test.** This\n  is the cardinal error. A wide confidence interval around a null effect difference does *not* license\n  CMA; choosing the cheaper option then risks adopting an inferior therapy. Decision rule: require a\n  *pre-specified equivalence/non-inferiority margin that the data clear*, or fall back to CEA/CUA.\n- **The interventions plausibly differ on a relevant outcome** (efficacy, key safety event, quality\n  of life, durability). Equivalence on the primary endpoint does not imply equivalence on the outcome\n  that drives value; suppressing the effect difference biases the decision.\n- **Different time horizons of benefit or downstream cost offsets** (e.g., a costlier drug that\n  reduces later hospitalizations). A pure same-window cost comparison misses the offset; this is a\n  CEA/CUA or full cost-consequence problem.\n- **Heterogeneous populations where equivalence holds on average but not in a key subgroup.** CMA on\n  the pooled population can be misleading; pre-specify subgroup equivalence.\n- **As a substitute for affordability analysis.** \"Cheaper per patient\" is not \"affordable in\n  aggregate\" — use BIA for budget questions.\n\n**Data-source operational depth.** CMA depends almost entirely on getting *comparable, complete*\ncost (resource-use × unit-cost) for each arm; the failure modes are the failure modes of real-world\ncosting.\n- **Claims (FFS):** Costs are usually proxied by *allowed amounts* (plan-paid + patient\n  cost-sharing), which approximate transaction prices, not provider cost or charges. Build total cost\n  over a fixed, equal observation window for both arms (e.g., 12 months post-index) and require\n  *continuous enrollment* across that window so person-time is fully observed. Failure modes:\n  (1) **Medicare Advantage / capitated person-time lacks itemized FFS claims**, so MA-only enrollees\n  have systematically incomplete encounter-level cost — restrict to FFS (Parts A/B/D or commercial\n  fee-for-service) and exclude MA-only person-time, or costs are differentially censored by arm.\n  (2) **Differential censoring / competing risk of death** truncates accrued cost; in elderly claims\n  populations one arm may have more deaths, so naive mean cost over observed time understates cost in\n  the higher-mortality arm — use cost over a common horizon with inverse-probability-of-censoring\n  weighting (the Bang–Tsiatis estimator) rather than complete-case means. (3) **Drug cost from claims\n  excludes manufacturer rebates**, so allowed-amount drug spend overstates net payer cost; for\n  brand-vs-generic CMA the rebate gap is decision-determining — use net-of-rebate prices where\n  available or flag the limitation.\n- **EHR:** Captures resource *use* (orders, administrations, LOS) but rarely true cost; you must\n  apply external unit costs or a cost-to-charge ratio, and capture is *visit-driven* — care delivered\n  outside the system is invisible, so a patient who seeks costly care elsewhere looks artificially\n  cheap. Differential out-of-system leakage by arm biases CMA. Prefer claims or linked data for the\n  cost layer.\n- **Registry:** Good for the equivalence/effectiveness side (adjudicated outcomes, severity) but\n  typically thin on itemized cost; link to claims for the cost component and to a death index to\n  handle the competing risk.\n- **Linked claims–EHR:** Best substrate — EHR confirms the equivalence-relevant clinical detail and\n  claims supply complete cost — but linkage selection (only the linkable subset) and date\n  reconciliation across order/fill/service dates must be handled before windowing cost.\n- **Immortal-time trap in procedure/switch studies:** If one arm is defined by surviving long enough\n  to undergo a procedure or switch, the time before that event is \"immortal\" and accrues cost\n  differently; align time zero at the comparable decision point in both arms.\n\n**Worked claims example.** Question: per-patient 12-month total cost of *generic vs brand* of the\nsame molecule among adults initiating chronic therapy, where bioequivalence (and thus outcome\nequivalence) is already accepted, so CMA is licensed. (1) **Cohort:** new initiators with a first\nqualifying pharmacy fill (`fill_date` = index_date); require a 365-day drug-free washout (no prior\nfill of the molecule, generic or brand) and 365 days of continuous FFS medical+pharmacy enrollment\nbefore index so prior use is truly absent. Assign arm from the formulation (`product_type` ∈\n{generic, brand}) of the index NDC. (2) **Observation window:** day 0 through day 365 post-index;\nrequire continuous FFS enrollment through the window or apply IPCW for those censored by\ndisenrollment/death so accrued cost reflects a *common* horizon. Exclude MA-only person-time\n(no itemized claims). (3) **Cost build:** sum allowed amounts across all claims (pharmacy + medical)\nin the window; for the index drug use *net-of-rebate* price if available — for generics rebates are\nnegligible, for brand they are material, and ignoring them is the dominant bias here. (4) **Effects:**\n*not estimated* — equivalence is assumed from bioequivalence; document that assumption explicitly as\nthe warrant for CMA. (5) **Comparison:** difference in mean 12-month total cost (ΔCost) with a\nnonparametric bootstrap CI; adjust for baseline confounders (age, comorbidity, prior cost) via a\ncost regression or IPTW since real-world arms are not randomized. (6) **Sensitivity:** vary the\nrebate assumption, the horizon (6/12/24 months), the censoring handling (complete-case vs IPCW), and\nreport a budget-impact companion since \"cheaper per patient\" does not equal \"affordable at scale.\"\n\n**Interpreting the output**\n\nThe worked example finds that brand IV totals $1,700/patient/year and generic oral totals $480/patient/year —\na cost difference of $1,220/patient/year in favor of the generic oral formulation.\n\n*(1) Formal interpretation.* The CMA result is a cost difference, not an ICER. Under the prior assumption\nof demonstrated outcome equivalence, no effectiveness ratio needs to be computed: the cheaper option is preferred\nby definition. Here the generic oral saves $1,220 per patient per year. This conclusion is entirely conditional\non the equivalence assumption being valid — if the two treatments are not truly equivalent on effectiveness or\nsafety, the CMA is methodologically inappropriate and the cost difference is uninterpretable as a decision metric.\n\n*(2) Practical interpretation.* The generic oral is the preferred option at $1,220 less per patient per year —\nbut only if the equivalence evidence holds up. Analysts should explicitly cite the source of the equivalence\nclaim (a bioequivalence trial, a non-inferiority study, a head-to-head RCT) and note its confidence interval.\nA non-inferiority margin of, say, 5% in response rate is not the same as proven equivalence; if the true\ndifference lies anywhere in that margin the cost savings from the CMA could come at an unacknowledged health cost.\nSensitivity analyses should explore the cost difference under plausible ranges of rebate and treatment-duration\nassumptions, and a budget-impact companion should accompany the per-patient result.",
    "primary_category": "Health_Economic",
    "tags": [
      "cost-minimization",
      "economic-evaluation",
      "equivalence",
      "non-inferiority",
      "comparative-costs",
      "health-economics",
      "net-of-rebate",
      "budget"
    ],
    "applies_to_study_types": [
      "cost_minimization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/hec.584",
        "url": "https://doi.org/10.1002/hec.584",
        "citation_text": "Briggs AH, O'Brien BJ. The death of cost-minimization analysis? Health Economics. 2001;10(2):179-184.",
        "year": 2001,
        "authors_short": "Briggs & O'Brien",
        "notes": "Canonical critique that defines CMA and argues a non-significant effect difference does not justify ignoring outcome uncertainty; the standard starting point for understanding when CMA is and is not appropriate."
      },
      {
        "role": "explain",
        "doi": "10.1002/hec.1812",
        "url": "https://doi.org/10.1002/hec.1812",
        "citation_text": "Dakin H, Wordsworth S. Cost-minimisation analysis versus cost-effectiveness analysis, revisited. Health Economics. 2013;22(1):22-34.",
        "year": 2013,
        "authors_short": "Dakin & Wordsworth",
        "notes": "Revisits the conditions under which a cost-minimization framing is legitimate and shows when the equivalence assumption is unsafe relative to a full CEA on the cost-effectiveness plane."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jval.2013.02.002",
        "url": "https://doi.org/10.1016/j.jval.2013.02.002",
        "citation_text": "Husereau D, Drummond M, Petrou S, et al. Consolidated Health Economic Evaluation Reporting Standards (CHEERS) - Explanation and Elaboration: A Report of the ISPOR Health Economic Evaluation Publication Guidelines Good Reporting Practices Task Force. Value in Health. 2013;16(2):231-250.",
        "year": 2013,
        "authors_short": "Husereau et al.",
        "notes": "ISPOR reporting standard governing how an economic evaluation (including a CMA) should state perspective, horizon, costing methods, and the basis for any equivalence assumption."
      }
    ],
    "plain_language_summary": "Cost-minimization analysis is what you do when two treatments have already been shown to work equally well for the same patients: since the health results are a tie, you only compare what each one costs and pick the cheaper one. You add up every dollar each treatment requires per patient (the drug itself, giving it, and any monitoring it needs), then choose the lower total. The whole approach only makes sense if the 'works equally well' claim is genuinely true, so it has to come from solid earlier evidence, not just a study that failed to find a difference. If that equal-outcomes assumption is wrong, this method can hand you the cheaper-but-worse option.",
    "key_terms": [
      {
        "term": "equivalence assumption",
        "definition": "The pre-condition for this method: a prior conclusion that the two treatments produce the same health outcomes, so cost is the only thing left to compare."
      },
      {
        "term": "perspective",
        "definition": "Whose costs you count (for example the insurer/payer, the whole health system, or society), which must be fixed before adding anything up."
      },
      {
        "term": "time horizon",
        "definition": "The fixed length of time over which you total each treatment's costs, kept identical for both options so the comparison is fair."
      },
      {
        "term": "non-inferiority",
        "definition": "A study result showing a treatment is not meaningfully worse than its comparator by a margin set in advance; a strong such result is the kind of evidence that can justify assuming equal outcomes."
      },
      {
        "term": "net-of-rebate price",
        "definition": "A drug's price after manufacturer discounts the payer actually receives, which can be much lower than the list price and is the figure that should drive a brand-versus-generic cost comparison."
      }
    ],
    "worked_example": {
      "scenario": "An oral generic and an intravenous (IV) brand version of the same molecule have already been shown, in a prior non-inferiority study, to produce equivalent health outcomes for adults treating the same chronic condition. Because outcomes are a tie, we only need to compare cost per patient over a fixed 12-month horizon from a payer perspective. We assume equivalent outcomes here — that assumption is the precondition that licenses cost-minimization; without it this method does not apply.",
      "dataset": {
        "caption": "A small per-patient cost table an analyst would build for each arm over the 12-month horizon, broken into drug, administration, and monitoring components (payer perspective, net-of-rebate drug price).",
        "columns": [
          "arm",
          "cost_component",
          "annual_cost_usd"
        ],
        "rows": [
          [
            "brand_IV",
            "drug (net-of-rebate)",
            1200
          ],
          [
            "brand_IV",
            "administration (infusion visits)",
            320
          ],
          [
            "brand_IV",
            "monitoring (labs)",
            180
          ],
          [
            "generic_oral",
            "drug (net-of-rebate)",
            300
          ],
          [
            "generic_oral",
            "administration (self-administered)",
            0
          ],
          [
            "generic_oral",
            "monitoring (labs)",
            180
          ]
        ]
      },
      "steps": [
        "First confirm the warrant: outcomes are assumed equivalent from the prior non-inferiority study, so we are allowed to compare cost alone.",
        "Total cost per patient for the brand IV arm = drug 1200 + administration 320 + monitoring 180 = 1700 USD.",
        "Total cost per patient for the generic oral arm = drug 300 + administration 0 + monitoring 180 = 480 USD.",
        "The saving from choosing the cheaper arm = 1700 - 480 = 1220 USD per patient over the 12-month horizon."
      ],
      "result": "Brand IV total = 1700 USD/patient; generic oral total = 480 USD/patient. The generic oral option is cheaper by 1220 USD per patient per year, so under the assumed equal outcomes it is the preferred choice. Danger: if the equivalence assumption is actually false — say the IV version really does work better, or the generic differs on a safety or adherence outcome — then this method has crowned the cheaper-but-inferior option, because it never re-examines the outcomes once they are assumed equal.",
      "prerequisites": [
        "cost-effectiveness",
        "cer-observational"
      ]
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Within-trial CMA (equivalence established in the same study)",
        "description": "Costs and the equivalence/non-inferiority outcome come from the same comparative study (RCT or comparative RWE), so the equivalence margin and the cost difference are estimated on the same population and horizon.",
        "edge_cases": [
          "A non-significant effect difference is not equivalence; the pre-specified margin must actually be met before CMA is licensed.",
          "Trial horizon may be too short to capture downstream cost offsets."
        ],
        "data_source_notes": "claims/linked: derive cost from allowed amounts over the trial window; confirm the equivalence endpoint from the clinical layer (EHR/registry) when linked."
      },
      {
        "name": "Decision-analytic / modeled CMA",
        "description": "Equivalence is taken from external evidence (NMA, prior trial) and per-patient costs are modeled over a longer horizon with resource-use and unit-cost inputs.",
        "edge_cases": [
          "Imported equivalence may not transport to the modeled population (effect-modification across settings).",
          "Discounting and horizon choices can flip the cheaper arm."
        ],
        "data_source_notes": "unit costs from fee schedules / micro-costing; resource use from claims or EHR; discount future costs per the reference-case rate."
      },
      {
        "name": "Net-of-rebate CMA (drug sourcing comparisons)",
        "description": "For generic-vs-brand or biosimilar-vs-reference comparisons, the cost difference hinges on net (post-rebate) drug price rather than list or allowed amount.",
        "edge_cases": [
          "Claims allowed amounts exclude confidential rebates, biasing brand cost upward.",
          "Patient cost-sharing differs from plan net cost; choose the perspective explicitly."
        ],
        "data_source_notes": "claims: substitute net-of-rebate WAC/ASP-based prices for the study drug; keep medical costs at allowed amounts."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cost-effectiveness / cost-utility analysis (CEA/CUA)",
        "pros_of_this": "Simpler to compute and communicate; no QALY/utility modeling or ICER required; avoids contested outcome-to-QALY mapping.",
        "cons_of_this": "Valid only under demonstrated equivalence; if effects truly differ, CMA can select the cheaper-but-inferior option, and it discards outcome uncertainty.",
        "when_to_prefer": "Clinically interchangeable options where a pre-specified equivalence/non-inferiority margin has actually been met (e.g., generic vs brand, biosimilar vs reference)."
      },
      {
        "compared_to": "Reporting the full ICER on the cost-effectiveness plane",
        "pros_of_this": "Single, decision-ready cost difference; no need to characterize a joint cost-effect distribution.",
        "cons_of_this": "Ignores second-order uncertainty in the effect difference; can be overconfident when equivalence is soft (Briggs & O'Brien critique).",
        "when_to_prefer": "Equivalence is strong and uncontested; otherwise carry the uncertain effect through to the plane."
      },
      {
        "compared_to": "Budget-impact analysis (BIA)",
        "pros_of_this": "Per-patient efficiency-style comparison over a clinical horizon; isolates which option is cheaper per treated patient.",
        "cons_of_this": "Says nothing about aggregate affordability, uptake, or market share over a budget cycle.",
        "when_to_prefer": "The question is \"which interchangeable option is cheaper per patient,\" not \"what is the total spend impact on the plan.\""
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Proxy cost with allowed amounts over a fixed, equal post-index window with continuous FFS enrollment; exclude MA-only person-time (no itemized claims); handle death/disenrollment censoring with IPCW (Bang-Tsiatis) over a common horizon; use net-of-rebate prices for the study drug in sourcing comparisons.",
      "ehr": "Captures resource use, not true cost - apply external unit costs or cost-to-charge ratios; visit-driven capture means out-of-system care (differential by arm) is invisible; prefer claims or linked data for the cost layer.",
      "registry": "Strong for the equivalence/outcome side, thin on itemized cost; link to claims for costs and to a death index to handle the competing risk.",
      "linked": "Best substrate (clinical equivalence detail + complete cost) but reconcile linkage selection and order/fill/service date discrepancies before windowing cost."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\ndef cma_cost_difference(cohort: pd.DataFrame, n_boot: int = 2000, seed: int = 1) -> dict:\n    # Only fully observed 12-month person-time (else use IPCW-weighted cost upstream).\n    df = cohort[cohort[\"fully_enrolled_12m\"]].copy()\n    df[\"brand\"] = (df[\"arm\"] == \"BRAND\").astype(int)\n\n    # Confounder-adjusted mean cost per arm: arms are NOT randomized in RWD.\n    # GLM with log link + Gamma family handles right-skewed, positive cost.\n    model = smf.glm(\n        \"total_cost_12m ~ brand + age + comorbidity_score + prior_cost_12m\",\n        data=df, family=__import__(\"statsmodels.api\", fromlist=[\"families\"]).families.Gamma(\n            link=__import__(\"statsmodels.api\", fromlist=[\"families\"]).families.links.Log()),\n    ).fit()\n\n    # Recycled-prediction (g-computation) ATE: predict everyone as brand vs as generic.\n    def adjusted_delta(fit, data):\n        d0 = data.assign(brand=0); d1 = data.assign(brand=1)\n        return fit.predict(d1).mean() - fit.predict(d0).mean()\n\n    delta = adjusted_delta(model, df)\n\n    rng = np.random.default_rng(seed)\n    boots = np.empty(n_boot)\n    for b in range(n_boot):\n        s = df.sample(frac=1.0, replace=True, random_state=int(rng.integers(1e9)))\n        fit_b = smf.glm(\n            \"total_cost_12m ~ brand + age + comorbidity_score + prior_cost_12m\",\n            data=s, family=model.family).fit()\n        boots[b] = adjusted_delta(fit_b, s)\n\n    lo, hi = np.percentile(boots, [2.5, 97.5])\n    return {\"delta_cost_brand_minus_generic\": float(delta),\n            \"ci95\": (float(lo), float(hi)),\n            \"decision\": \"choose GENERIC (lower cost, equivalent effect)\" if delta > 0\n                        else \"choose BRAND\"}",
        "description": "CMA on claims-style inputs: difference in mean 12-month total cost per patient under an assumed\nequivalence of effects, with confounder-adjusted means and a bootstrap CI. Required inputs\n(already cleaned, one row per eligible new initiator over a common, fully-enrolled window):\n  cohort : person_id, arm in {'GENERIC','BRAND'}, index_date, fully_enrolled_12m (bool),\n           total_cost_12m (float, allowed amounts incl. net-of-rebate study drug),\n           age, comorbidity_score, prior_cost_12m   # baseline confounders\nEquivalence of outcomes is assumed from external bioequivalence evidence (the warrant for CMA);\nno effect is estimated here. Restrict to fully observed person-time to avoid differential\ncensoring, or substitute IPCW-weighted costs upstream.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "briggs-2001"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(boot)\n\ncma_delta <- function(cohort) {\n  df <- subset(cohort, fully_enrolled_12m)\n  df$brand <- as.integer(df$arm == \"BRAND\")\n\n  adj_delta <- function(data, idx) {\n    d <- data[idx, , drop = FALSE]\n    fit <- glm(total_cost_12m ~ brand + age + comorbidity_score + prior_cost_12m,\n               data = d, family = Gamma(link = \"log\"))\n    d1 <- d; d1$brand <- 1L\n    d0 <- d; d0$brand <- 0L\n    mean(predict(fit, d1, type = \"response\")) -\n      mean(predict(fit, d0, type = \"response\"))\n  }\n\n  b <- boot(df, adj_delta, R = 2000)\n  ci <- boot.ci(b, type = \"perc\")$percent[4:5]\n  list(delta_cost_brand_minus_generic = b$t0,\n       ci95 = ci,\n       decision = if (b$t0 > 0) \"choose GENERIC (cheaper, equivalent)\" else \"choose BRAND\")\n}",
        "description": "CMA cost difference in R mirroring the Python version: adjusted (g-computed) difference in mean\n12-month cost under assumed outcome equivalence, with a bootstrap CI. Input data.frame:\n  cohort : person_id, arm in {'GENERIC','BRAND'}, total_cost_12m, fully_enrolled_12m (logical),\n           age, comorbidity_score, prior_cost_12m\nA Gamma GLM with log link respects the skew and non-negativity of cost; recycled predictions give\nthe confounder-adjusted ATE on the natural (dollar) scale.",
        "dependencies": [
          "boot"
        ],
        "source_citations": [
          "briggs-2001"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Total 12-month cost per patient over the common, fully-enrolled window. */\nproc sql;\n  create table cost12 as\n  select c.person_id,\n         sum(cl.allowed_amt) as total_cost_12m\n  from work.cohort c\n  left join work.claims cl\n    on cl.person_id = c.person_id\n   and cl.service_date >= c.index_date\n   and cl.service_date <  c.index_date + 365\n  where c.fully_enrolled_12m = 1     /* else use IPCW-weighted cost */\n  group by c.person_id;\nquit;\n\nproc sql;\n  create table analytic as\n  select a.*, b.total_cost_12m\n  from work.cohort a inner join cost12 b on a.person_id = b.person_id;\nquit;\n\n/* 2. Confounder-adjusted cost difference (arms are not randomized in RWD).\n      Gamma family + log link handles right-skewed, positive cost. */\nproc genmod data=analytic;\n  class arm (ref='GENERIC');\n  model total_cost_12m = arm age comorbidity_score prior_cost_12m\n        / dist=gamma link=log;\n  lsmeans arm / diff ilink cl;       /* adjusted mean cost per arm + difference on $ scale */\nrun;",
        "description": "CMA in SAS: build equal-window total cost per arm from claims, then estimate a confounder-adjusted\ncost difference with a Gamma/log GLM (PROC GENMOD) - the standard model for skewed RWD cost.\nRequired inputs (post data-management):\n  work.cohort : person_id, arm ('GENERIC'/'BRAND'), index_date, fully_enrolled_12m (0/1),\n                age, comorbidity_score, prior_cost_12m\n  work.claims : person_id, service_date, allowed_amt   (net-of-rebate for the study drug)\nLSMEANS on the response scale (ilink) gives adjusted mean cost per arm; their difference is the CMA\nestimand. Restrict to fully observed person-time or weight by IPCW upstream to avoid differential\ncensoring.",
        "dependencies": [],
        "source_citations": [
          "briggs-2001"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Comparative question:<br/>two interchangeable options, same indication] --> E{Pre-specified equivalence<br/>or non-inferiority margin MET?}\n  E -- No --> CEA[Do NOT use CMA<br/>Run CEA/CUA; report full<br/>cost-effectiveness plane]\n  E -- Yes --> P[Fix perspective + horizon<br/>payer / health-system / societal]\n  P --> C[Build equal-window total cost per arm<br/>allowed amounts, net-of-rebate drug,<br/>continuous FFS enrollment, IPCW for censoring]\n  C --> D[Confounder-adjusted delta-Cost + bootstrap CI<br/>effects assumed equal, not estimated]\n  D --> R[Choose lowest-cost option]\n  R --> B[Pair with budget-impact analysis<br/>cheaper-per-patient != affordable at scale]",
        "caption": "CMA decision logic. The equivalence gate is the load-bearing step - CMA is licensed only when a pre-specified equivalence/non-inferiority margin is met; otherwise the analysis must move to CEA/CUA on the full cost-effectiveness plane.",
        "alt_text": "Decision flowchart starting from a comparative question, gating on whether an equivalence margin is met, branching to CEA/CUA if not, and otherwise proceeding to perspective/horizon, equal- window cost build, adjusted cost difference, lowest-cost choice, and a budget-impact companion.",
        "source_type": "illustrative",
        "source_citations": [
          "briggs-2001"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "cost-effectiveness",
        "notes": "CEA estimates cost per unit of effect; CMA is its special case under assumed outcome equivalence. When equivalence is not demonstrated, CEA is the correct alternative."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cost-utility",
        "notes": "CUA estimates cost per QALY; choose it over CMA whenever outcomes (including quality of life) may differ between arms."
      },
      {
        "relation_type": "see_also",
        "target_slug": "budget-impact",
        "notes": "CMA gives per-patient cost; budget-impact analysis gives aggregate affordability over a budget cycle. Report them together - cheaper per patient is not the same as affordable at scale."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "When the equivalence assumption is soft, carry the uncertain effect difference onto the cost-effectiveness plane (ICER / net monetary benefit) rather than assuming it away."
      },
      {
        "relation_type": "requires",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "CMA depends entirely on a defensible cost build; standardized per-member cost metrics and equal-window total-cost construction are the inputs to the cost difference."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cer-observational",
        "notes": "CMA's equivalence warrant typically comes from a non-inferiority/equivalence comparative- effectiveness result; without it, CMA is not licensed."
      }
    ],
    "aliases": [
      "cost minimization",
      "cost minimisation analysis",
      "Cost-Minimization Analysis",
      "Cost-minimisation analysis",
      "CMA"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hta",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cost-of-illness",
    "name": "Cost-of-Illness (COI) Study",
    "short_definition": "An economic-burden study that estimates the total cost a disease imposes on a defined population over a defined period — typically direct medical, direct non-medical, and indirect (productivity) costs — without comparing competing interventions.",
    "long_description": "A **cost-of-illness (COI) study** measures the aggregate economic burden of a disease, condition, or risk\nfactor in a defined population and period. It answers \"how much does this disease cost society (or a payer)?\"\nrather than \"is treatment A worth more than treatment B?\". The total is conventionally decomposed into\n**direct medical costs** (inpatient, outpatient, pharmacy, procedures, devices), **direct non-medical costs**\n(transportation, paid caregiving, home modification), and **indirect costs** (lost productivity from morbidity,\npremature mortality, and informal caregiving, valued by the human-capital or friction-cost method). COI is a\n*descriptive economic-evaluation* method: it has no incremental contrast and no QALY, so it cannot establish\nvalue for money. Its outputs justify research prioritization, size a market or unmet need, and supply the\nbackground burden figure for downstream cost-effectiveness, budget-impact, and HTA submissions.\n\n**Core conceptual distinction**. Three independent design axes define a COI study, and conflating them is the\nmost common error. (1) *Prevalence-based vs incidence-based*: a **prevalence-based** COI sums all costs\nincurred by everyone with the disease during a calendar window (the standard for annual national-burden\nfigures and payer planning); an **incidence-based** COI follows new cases from onset and sums lifetime costs,\ndiscounted to present value (the right input for prevention/screening value, but it requires long longitudinal\nfollow-up). (2) *Gross (all-cause) vs net (attributable/excess)*: gross costing counts every dollar a patient\nspends; **net/attributable** costing subtracts the cost those patients would have incurred without the disease,\nusually via a matched non-diseased comparator — this is the methodologically defensible burden but demands a\ncausal-comparison design. (3) *Top-down vs bottom-up*: top-down apportions national aggregate spending to the\ndisease using attributable fractions; **bottom-up** builds the estimate from patient-level resource use × unit\ncosts (the natural approach in claims/EHR). The estimand must be stated explicitly: COI does *not* estimate a\ntreatment effect, and its \"attributable cost\" is an association unless an active-comparator/new-user or\nadjusted design is layered on.\n\n**Pros, cons, and trade-offs** (specific & comparative, naming the alternatives).\n- **vs cost-effectiveness / cost-utility analysis:** COI quantifies *burden*, not *value*; it has no comparator\n  intervention, no incremental cost-effectiveness ratio, and no health-outcome denominator. Cheaper and faster\n  to produce and useful for advocacy/prioritization, but it can never answer \"should we pay for this drug?\".\n  **Prefer COI** only to size the problem; **switch to CEA/CUA** the moment a decision involves choosing\n  between interventions.\n- **vs budget-impact analysis (BIA):** BIA is the affordability cousin — it projects net spending changes to a\n  *specific budget holder* over a short horizon under a new intervention's uptake. COI is intervention-agnostic\n  and societal/payer-wide. A COI total is often the denominator a BIA quotes against.\n- **Gross vs attributable costing:** all-cause/gross costs are trivial to compute from claims but\n  systematically overstate disease burden by including unrelated care; **attributable (excess) costing** via a\n  matched comparator is far more defensible but introduces confounding, matching, and competing-risk problems.\n- **Human-capital vs friction-cost for indirect costs:** human-capital values all foregone lifetime\n  productivity (large numbers, favored in the US literature); friction-cost values only the period until a\n  worker is replaced (smaller, favored in some European guidance). The choice can move the indirect-cost total\n  several-fold, so it must be pre-specified and reported transparently.\n\n**When to use**. Establishing the magnitude and distribution of disease burden for a payer, manufacturer, or\npolicymaker; generating the background-burden and natural-history cost inputs that feed a CEA/CUA or BIA;\nsupporting research-prioritization or market-sizing; quantifying the burden of an unmet need in a value\ndossier. Bottom-up, prevalence-based, payer-perspective COI is the workhorse for US claims-based burden studies.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **As evidence of an intervention's value.** A large COI total is routinely (mis)used to argue a therapy is\n  worth its price. It is not — burden says nothing about how much of that cost a treatment can avert at what\n  incremental cost. Presenting COI as cost-effectiveness is the field's signature abuse.\n- **Gross all-cause cost reported as \"the cost of the disease.\"** Without an attributable/excess design, the\n  figure includes care patients would have needed anyway and overstates burden, sometimes by multiples.\n- **Double counting.** Summing prevalence-based annual costs across multiple years, or adding indirect mortality\n  costs (already a lifetime value) to multi-year direct costs, inflates totals. Mixing top-down attributable\n  fractions with bottom-up patient costs double-counts the same care.\n- **Cross-study addition.** COI totals from studies with different perspectives, cost years (no inflation\n  adjustment), populations, or costing methods are not additive and must not be summed to a \"total national\n  burden.\"\n- **Immortal-time / survivorship artifacts in lifetime (incidence-based) COI.** Requiring survival to a landmark\n  to accrue costs, or annualizing costs only over observed (survivor) person-time, biases per-patient burden\n  downward in fatal diseases — costs that cluster near death are differentially censored.\n\n**Data-source operational depth**.\n- **Administrative claims (FFS, commercial, Medicaid):** the standard substrate for bottom-up burden. Cost =\n  adjudicated/allowed/paid amount, not charges; standardize across payers and inflate to a common cost year\n  (CPI medical-care component or GDP deflator). Require **continuous medical + pharmacy enrollment** across the\n  measurement window so per-patient-per-month (PPPM) and per-patient-per-year (PPPY) denominators reflect true\n  observable person-time, not enrollment gaps. *Failure modes:* **Medicare Advantage encounter records lack\n  reliable paid amounts** and MA-only person-time has no FFS claims — restrict national-burden estimates to FFS\n  Parts A/B/D with complete capture and never pool MA encounter costs with FFS paid amounts. Capitated/bundled\n  arrangements zero-out component costs. **Differential competing risk of death by disease severity** truncates\n  cost accrual in the sickest patients, biasing prevalence-based annual costs; annualize over observed\n  person-time (PPPM × 12) rather than forcing a full-year denominator. Carve-out behavioral/pharmacy benefits\n  and out-of-network care are invisible. Indirect (productivity) costs are absent from claims entirely and must\n  be modeled externally.\n- **EHR:** captures clinical detail and severity for attribution but **not the full cost of care delivered\n  elsewhere** (referrals, other systems); chargemaster prices are not economic costs — apply\n  cost-to-charge ratios or a standardized fee schedule. Visit-driven capture means sicker, in-system patients\n  are over-represented.\n- **Registry:** strong for incident-case identification, severity, and outcomes (ideal denominator for\n  incidence-based lifetime COI) but weak for complete cost; link to claims for the full resource-use ledger and\n  to a death index for mortality-cost valuation.\n- **Linked claims–EHR–vital-records or claims–survey:** the ideal substrate — claims completeness + EHR severity\n  for attribution + survey/administrative wage data for indirect costs — but linkage selection (only the\n  linkable subset) and date reconciliation across sources must be handled before person-time and cost windows\n  are fixed.\n\n**Worked claims example (attributable, prevalence-based, payer perspective).** Question: the 1-year direct\nmedical burden of heart failure (HF) attributable to the disease among commercially insured + Medicare FFS\nadults. (1) *Cases:* adults with ≥1 inpatient or ≥2 outpatient HF diagnoses (ICD-10 I50.x) ≥30 days apart in\n2023; index_date = first qualifying claim. (2) *Continuous enrollment:* require medical + pharmacy enrollment\nfor the full 2023 observation window (and a 365-day baseline for matching), excluding MA-only person-time so\npaid amounts are observed. (3) *Cost build:* sum allowed/paid amounts across inpatient, outpatient, professional,\nand pharmacy claims by person_id within the window; convert each person to **PPPM = total_cost / observed_member_months**\nand annualize to PPPY = PPPM × 12 so patients who die or disenroll mid-year are not penalized by a forced\n12-month denominator. (4) *Attribution:* match each HF case 1:1 to a non-HF control on age, sex, region, index\ncalendar month, and a comorbidity score using only baseline-window data; **attributable cost = case PPPY −\nmatched-control PPPY**, which nets out the care these patients would have incurred regardless. (5) *Aggregation:*\npopulation burden = mean attributable PPPY × number of prevalent HF enrollees, with patient-level bootstrap CIs\nto reflect the right-skewed cost distribution; inflate all 2023 dollars to the reporting cost year. Report\nperspective (payer), cost components, the gross-vs-attributable split, and that productivity/informal-care costs\nare excluded.\n\n**Interpreting the output**\n\nA COI study for a single heart-failure patient from a societal perspective reports a per-patient total\nof $25,000: $19,750 in direct medical costs (79%) — comprising $14,200 inpatient, $3,150 pharmacy, and\n$2,400 outpatient — and $5,250 in indirect costs from lost productivity (21%).\n\n*(1) Formal interpretation.* The $25,000 per-patient total is an accounting aggregate, not a causal\ncounterfactual. Unless this figure was derived by subtracting matched-control costs, it includes\nspending that would have occurred regardless of heart failure (background comorbidity care, routine\npreventive visits). The 79% direct / 21% indirect split is perspective-dependent: the societal view\nincludes lost wages via the human-capital method; a payer-perspective analysis would exclude the\n$5,250 indirect component entirely, yielding $19,750. Each cost component carries its own measurement\nlimitations — inpatient costs from claims reflect allowed amounts, not societal resource consumption,\nand lost productivity requires wage-rate assumptions. The total is reported as a point estimate; the\nright-skewed cost distribution means the median patient cost is substantially lower.\n\n*(2) Practical interpretation.* The $25,000 COI figure sizes the problem for a disease-awareness or\nformulary argument. For an HTA submission, a reviewer will ask what the comparable cost is for a\nmatched non-HF patient — if that figure is not reported, the $25,000 cannot be attributed to heart\nfailure. Use this estimate to frame unmet need and to calibrate cost-offset claims, not as a standalone\ncausal burden statement.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "cost-of-illness",
      "disease-burden",
      "economic-evaluation",
      "direct-medical-costs",
      "indirect-costs",
      "attributable-cost",
      "prevalence-based",
      "claims"
    ],
    "applies_to_study_types": [
      "cost_of_illness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3350/cmh.2014.20.4.327",
        "url": "https://doi.org/10.3350/cmh.2014.20.4.327",
        "citation_text": "Jo C. Cost-of-illness studies: concepts, scopes, and methods. Clinical and Molecular Hepatology. 2014;20(4):327-337.",
        "year": 2014,
        "authors_short": "Jo",
        "notes": "Clear modern statement of COI concepts — prevalence- vs incidence-based, top-down vs bottom-up, and the three cost components — with the human-capital/friction-cost distinction for indirect costs."
      },
      {
        "role": "explain",
        "doi": "10.2165/11588380-000000000-00000",
        "url": "https://doi.org/10.2165/11588380-000000000-00000",
        "citation_text": "Larg A, Moss JR. Cost-of-illness studies: a guide to critical evaluation. PharmacoEconomics. 2011;29(8):653-671.",
        "year": 2011,
        "authors_short": "Larg & Moss",
        "notes": "Definitive critical-appraisal guide; lays out gross-vs-attributable costing, perspective, and the standard methodological pitfalls used here."
      },
      {
        "role": "explain",
        "doi": "10.1136/ip.6.3.177",
        "url": "https://doi.org/10.1136/ip.6.3.177",
        "citation_text": "Rice DP. Cost of illness studies: what is good about them? Injury Prevention. 2000;6(3):177-179.",
        "year": 2000,
        "authors_short": "Rice",
        "notes": "Foundational commentary on the legitimate uses and limits of COI from a pioneer of the method."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.vhri.2014.08.003",
        "url": "https://doi.org/10.1016/j.vhri.2014.08.003",
        "citation_text": "Greenberg D, Yodfat U. What are the challenges in conducting cost-of-illness studies? Value in Health Regional Issues. 2014;4:30-31.",
        "year": 2014,
        "authors_short": "Greenberg & Yodfat",
        "notes": "Concise demonstration of the practical challenges (attribution, perspective, double counting) that arise when operationalizing COI."
      }
    ],
    "plain_language_summary": "A cost-of-illness study adds up everything one disease costs over a set period for a defined group of people, then reports that total as a dollar figure. It counts direct medical costs (hospital stays, drugs, outpatient visits) and indirect costs (the wages a patient loses when illness keeps them out of work), but it does not compare two treatments and cannot tell you whether a drug is worth its price. Think of it as measuring the size of the problem, not the value of any fix. One honest caveat: it only captures the costs the data can see, so anything paid out of pocket or handled outside the records is missing.",
    "key_terms": [
      {
        "term": "direct medical costs",
        "definition": "Money spent on the actual care a patient receives, such as hospital stays, prescription drugs, and clinic visits."
      },
      {
        "term": "indirect costs",
        "definition": "The economic loss from a patient being too sick to work, usually counted as the value of the wages they miss."
      },
      {
        "term": "perspective",
        "definition": "Whose costs you are counting, such as a single insurer's payments versus the whole of society's, which changes which dollars get included."
      },
      {
        "term": "cost component",
        "definition": "One labeled bucket of spending, like hospital care or lost wages, that you add to the others to reach the grand total."
      }
    ],
    "worked_example": {
      "scenario": "We want the one-year cost of illness for a single patient with heart failure, seen from a societal view so both their medical bills and their lost wages count. An analyst pulls the dollars the patient generated in each spending bucket over the calendar year, labels each bucket as either a direct medical cost or an indirect cost, then adds them up and reports the total along with how much of it is medical care versus lost work.",
      "dataset": {
        "caption": "The per-patient annual dollars an analyst would assemble, one row per cost bucket.",
        "columns": [
          "cost_component",
          "type",
          "annual_cost"
        ],
        "rows": [
          [
            "inpatient_hospital",
            "direct",
            14200
          ],
          [
            "prescription_drugs",
            "direct",
            3150
          ],
          [
            "outpatient_visits",
            "direct",
            2400
          ],
          [
            "lost_productivity",
            "indirect",
            5250
          ]
        ]
      },
      "steps": [
        "List every cost bucket the patient generated in the year, with its dollar amount, exactly as the table shows.",
        "Tag each bucket as direct (care the patient received) or indirect (wages lost to illness): the three care buckets are direct, lost productivity is indirect.",
        "Add the three direct buckets: 14,200 + 3,150 + 2,400 = 19,750 in direct medical costs.",
        "The single indirect bucket stands alone: 5,250 in lost productivity.",
        "Add the direct and indirect subtotals to get the full cost of illness: 19,750 + 5,250 = 25,000."
      ],
      "result": "Total cost of illness = $25,000 per patient for the year. The direct/indirect split is $19,750 direct (79%) and $5,250 indirect (21%); the three direct buckets are $14,200 inpatient + $3,150 drugs + $2,400 outpatient = $19,750."
    },
    "prerequisites": [
      "hcru-healthcare-resource-utilization",
      "healthcare-costs-pppm-pppy-pmpm",
      "all-cause-vs-attributable-costs-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Prevalence-based, bottom-up, payer-perspective COI (claims)",
        "description": "Sum patient-level allowed/paid amounts for all prevalent cases in a calendar window; express as PPPM/PPPY over observed person-time and scale to the prevalent population. The workhorse for US payer burden studies.",
        "edge_cases": [
          "Mid-year death or disenrollment shrinks the denominator; use PPPM x 12 over observed member-months rather than a forced annual denominator.",
          "Right-skewed costs make the mean unstable; report medians and bootstrap CIs alongside the mean."
        ],
        "data_source_notes": "claims: require continuous medical+pharmacy enrollment; use paid/allowed amounts not charges; exclude MA-only person-time lacking FFS paid amounts; inflate to a common cost year."
      },
      {
        "name": "Incidence-based lifetime COI (longitudinal/linked)",
        "description": "Follow incident cases from onset and accumulate discounted lifetime costs; the correct input for prevention/screening value.",
        "edge_cases": [
          "Requires long follow-up or modeled extrapolation; costs cluster near death, so survivor-only annualization biases estimates downward.",
          "Must discount future costs to present value and report the discount rate."
        ],
        "data_source_notes": "registry/linked: identify incident date precisely; link to claims for the full cost ledger and to a death index to value end-of-life costs."
      },
      {
        "name": "Attributable (excess) cost via matched non-diseased comparator",
        "description": "Estimate disease-attributable cost as case minus matched-control cost, netting out background care the patients would have incurred anyway.",
        "edge_cases": [
          "Matching on baseline covariates leaves residual confounding by indication/severity; consider regression adjustment or a propensity-matched comparator.",
          "Differential mortality by disease status complicates per-patient cost over a fixed horizon; align observation windows and consider competing-risk-aware accrual."
        ],
        "data_source_notes": "claims/linked: build the comparator from the same enrollment frame and calendar time; match on age, sex, region, index month, and a comorbidity score from the baseline window only."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cost-effectiveness / cost-utility analysis (CEA/CUA)",
        "pros_of_this": "Quantifies total societal/payer burden with no need for a comparator intervention or health-outcome denominator; faster and cheaper; sizes the problem and feeds downstream models.",
        "cons_of_this": "Has no incremental contrast and no QALY, so it cannot establish value for money or justify reimbursement; routinely misused as if it did.",
        "when_to_prefer": "When the question is the magnitude/distribution of burden or background-cost inputs, not the value of choosing one intervention over another."
      },
      {
        "compared_to": "Budget-impact analysis (BIA)",
        "pros_of_this": "Intervention-agnostic and payer/society-wide; provides the background-burden denominator a BIA quotes against.",
        "cons_of_this": "Does not project the affordability of a specific new intervention to a specific budget holder over a planning horizon.",
        "when_to_prefer": "When sizing total existing burden rather than the net spending change from adopting a new technology."
      },
      {
        "compared_to": "Gross (all-cause) costing",
        "pros_of_this": "Attributable/excess costing nets out background care and yields a defensible disease-specific burden.",
        "cons_of_this": "Requires a matched comparator and inherits confounding, matching, and competing-risk complications that gross costing avoids.",
        "when_to_prefer": "Whenever the headline figure is presented as the cost OF the disease rather than the cost of patients WITH the disease."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWINDOW_START = pd.Timestamp(\"2023-01-01\")\nWINDOW_END   = pd.Timestamp(\"2023-12-31\")\n\ndef patient_cost(claims: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    # Keep only paid claims inside the measurement window; sum to person-level total cost.\n    c = claims[(claims[\"service_date\"] >= WINDOW_START) &\n               (claims[\"service_date\"] <= WINDOW_END)]\n    total = c.groupby(\"person_id\")[\"paid_amount\"].sum().rename(\"total_cost\")\n\n    # PPPM over OBSERVED member-months avoids penalizing mid-year death/disenrollment; annualize to PPPY.\n    out = enroll.set_index(\"person_id\").join(total).fillna({\"total_cost\": 0.0})\n    out = out[out[\"member_months_observed\"] > 0].copy()\n    out[\"pppm\"] = out[\"total_cost\"] / out[\"member_months_observed\"]\n    out[\"pppy\"] = out[\"pppm\"] * 12.0\n    return out.reset_index()[[\"person_id\", \"total_cost\", \"member_months_observed\", \"pppm\", \"pppy\"]]\n\ndef attributable_pppy(cases: pd.DataFrame, costs: pd.DataFrame,\n                      n_boot: int = 1000, seed: int = 1) -> dict:\n    # Attributable cost = case PPPY minus matched-control PPPY, paired within match_id.\n    df = cases.merge(costs[[\"person_id\", \"pppy\"]], on=\"person_id\", how=\"inner\")\n    wide = (df.pivot_table(index=\"match_id\", columns=\"is_case\", values=\"pppy\")\n              .rename(columns={1: \"case_pppy\", 0: \"ctrl_pppy\"})\n              .dropna())\n    diff = (wide[\"case_pppy\"] - wide[\"ctrl_pppy\"]).to_numpy()\n\n    # Patient-level (cluster) bootstrap over matched pairs to respect the right-skewed cost distribution.\n    rng = np.random.default_rng(seed)\n    n = len(diff)\n    boot = np.array([diff[rng.integers(0, n, n)].mean() for _ in range(n_boot)])\n    return {\n        \"mean_attributable_pppy\": float(diff.mean()),\n        \"ci95_low\":  float(np.percentile(boot, 2.5)),\n        \"ci95_high\": float(np.percentile(boot, 97.5)),\n        \"n_pairs\": int(n),\n    }",
        "description": "Bottom-up, prevalence-based ATTRIBUTABLE cost from claims-style inputs. Required inputs (cleaned, de-duplicated):\n  claims : person_id, paid_amount (float), service_date (datetime), claim_type {'IP','OP','PROF','RX'}\n  cases  : person_id, index_date (datetime), is_case (1=disease, 0=matched control), match_id\n  enroll : person_id, member_months_observed (int, within the calendar window, medical+pharmacy, FFS-observable)\nReturns per-patient PPPM/PPPY and the mean disease-ATTRIBUTABLE PPPY (case minus matched control) with a\npatient-level bootstrap CI. Costs must already be inflated to a common cost year and exclude MA-only person-time.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "jo-2014",
          "larg-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nWINDOW_START <- as.Date(\"2023-01-01\")\nWINDOW_END   <- as.Date(\"2023-12-31\")\n\npatient_cost <- function(claims, enroll) {\n  setDT(claims); setDT(enroll)\n  c <- claims[service_date >= WINDOW_START & service_date <= WINDOW_END]\n  total <- c[, .(total_cost = sum(paid_amount)), by = person_id]\n\n  out <- merge(enroll, total, by = \"person_id\", all.x = TRUE)\n  out[is.na(total_cost), total_cost := 0]\n  out <- out[member_months_observed > 0]\n  out[, pppm := total_cost / member_months_observed]   # observed-time denominator\n  out[, pppy := pppm * 12]                              # annualize\n  out[, .(person_id, total_cost, member_months_observed, pppm, pppy)]\n}\n\nattributable_pppy <- function(cases, costs, n_boot = 1000L, seed = 1L) {\n  setDT(cases); setDT(costs)\n  df <- merge(cases, costs[, .(person_id, pppy)], by = \"person_id\")\n  # One case PPPY and one control PPPY per matched pair.\n  wide <- dcast(df, match_id ~ is_case, value.var = \"pppy\")\n  setnames(wide, c(\"0\", \"1\"), c(\"ctrl_pppy\", \"case_pppy\"))\n  wide <- wide[!is.na(case_pppy) & !is.na(ctrl_pppy)]\n  diff <- wide$case_pppy - wide$ctrl_pppy\n\n  set.seed(seed)\n  n <- length(diff)\n  boot <- replicate(n_boot, mean(diff[sample.int(n, n, replace = TRUE)]))\n  list(mean_attributable_pppy = mean(diff),\n       ci95_low  = unname(quantile(boot, 0.025)),\n       ci95_high = unname(quantile(boot, 0.975)),\n       n_pairs   = n)\n}",
        "description": "R version with data.table. Inputs mirror the Python version:\n  claims : person_id, paid_amount, service_date (Date), claim_type\n  cases  : person_id, index_date (Date), is_case (1/0), match_id\n  enroll : person_id, member_months_observed (integer, FFS-observable, medical+pharmacy)\nReturns per-patient PPPM/PPPY and the mean disease-attributable PPPY with a matched-pair bootstrap CI.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "jo-2014",
          "larg-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let win_start = '01JAN2023'd;\n%let win_end   = '31DEC2023'd;\n\n/* Person-level total paid cost within the measurement window. */\nproc sql;\n  create table total_cost as\n  select person_id, sum(paid_amount) as total_cost\n  from work.claims\n  where service_date between &win_start and &win_end\n  group by person_id;\nquit;\n\n/* PPPM over OBSERVED member-months (not a forced 12), then annualize to PPPY. */\nproc sql;\n  create table patient_cost as\n  select e.person_id,\n         coalesce(t.total_cost, 0) as total_cost,\n         e.member_months_observed,\n         calculated total_cost / e.member_months_observed as pppm,\n         (calculated pppm) * 12 as pppy\n  from work.enroll e\n  left join total_cost t on e.person_id = t.person_id\n  where e.member_months_observed > 0;\nquit;\n\n/* Pair each matched set into one case PPPY and one control PPPY, then take the difference. */\nproc sql;\n  create table pairs as\n  select c.match_id,\n         max(case when c.is_case = 1 then p.pppy end) as case_pppy,\n         max(case when c.is_case = 0 then p.pppy end) as ctrl_pppy\n  from work.cases c\n  inner join patient_cost p on c.person_id = p.person_id\n  group by c.match_id\n  having case_pppy is not null and ctrl_pppy is not null;\nquit;\n\ndata pairs; set pairs; attributable = case_pppy - ctrl_pppy; run;\n\n/* Mean disease-attributable PPPY; bootstrap externally (PROC SURVEYSELECT) for skew-robust CIs. */\nproc means data=pairs n mean median std clm maxdec=0;\n  var attributable;\nrun;",
        "description": "Bottom-up attributable-cost build in SAS. Required input datasets (post data-management):\n  work.claims : person_id, paid_amount, service_date, claim_type ('IP'/'OP'/'PROF'/'RX')\n  work.cases  : person_id, is_case (1/0), match_id\n  work.enroll : person_id, member_months_observed (FFS-observable medical+pharmacy person-time)\nProduces per-patient PPPM/PPPY and the mean disease-attributable PPPY (case minus matched control).\nCosts must be inflated to a common cost year and exclude MA-only person-time before this step.",
        "dependencies": [],
        "source_citations": [
          "jo-2014",
          "larg-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Burden question: cost OF the disease] --> A1{Prevalence-based<br/>or incidence-based?}\n  A1 -->|All cases in a window| Prev[Prevalence-based<br/>annual burden]\n  A1 -->|New cases, lifetime| Inc[Incidence-based<br/>discounted lifetime cost]\n  Prev --> A2{Gross all-cause<br/>or attributable?}\n  Inc --> A2\n  A2 -->|Patients WITH disease| Gross[Gross / all-cause cost<br/>overstates burden]\n  A2 -->|Cost OF disease| Attr[Attributable / excess<br/>vs matched comparator]\n  Attr --> A3{Cost components}\n  Gross --> A3\n  A3 --> DM[Direct medical] --> Agg[Aggregate to population<br/>x prevalent N, common cost year]\n  A3 --> DNM[Direct non-medical] --> Agg\n  A3 --> IND[Indirect: human-capital<br/>or friction-cost] --> Agg",
        "caption": "COI decision logic — pick the time frame (prevalence vs incidence), the costing basis (gross vs attributable), and the cost components before aggregating to the population. Each axis is an independent, pre-specified choice.",
        "alt_text": "Decision flowchart for a cost-of-illness study moving from burden question through prevalence-versus-incidence, gross-versus-attributable, and the three cost components to population aggregation.",
        "source_type": "illustrative",
        "source_citations": [
          "jo-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  C[Adjudicated claims<br/>paid/allowed amounts] --> F[Continuous enrollment filter<br/>medical+pharmacy, FFS-observable]\n  F --> X[Exclude MA-only person-time<br/>no FFS paid amounts]\n  X --> S[Sum cost by person<br/>in calendar window]\n  S --> P[PPPM = cost / observed member-months<br/>PPPY = PPPM x 12]\n  P --> M[Match cases to non-diseased controls<br/>baseline covariates only]\n  M --> D[Attributable PPPY = case - control<br/>bootstrap CI for skewed costs]\n  D --> Pop[Scale to prevalent population<br/>inflate to common cost year]",
        "caption": "Claims data-flow for an attributable, prevalence-based COI. Observed-time denominators and MA-only exclusion prevent the two most common burden-estimation artifacts.",
        "alt_text": "Data-flow diagram from adjudicated claims through enrollment filtering, MA-only exclusion, person-level cost summation, PPPM/PPPY, matched-control attribution, and population scaling.",
        "source_type": "illustrative",
        "source_citations": [
          "larg-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "Companion entry with deeper detail on HCRU/cost measurement, attribution, and standardization for cost-of-illness analyses."
      },
      {
        "relation_type": "produces",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "COI burden is operationalized through PPPM/PPPY/PMPM cost metrics computed over observed person-time."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Bottom-up COI builds cost from resource-use counts multiplied by unit costs; HCRU is the upstream measurement."
      },
      {
        "relation_type": "used_with",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "The gross-vs-attributable distinction is central to defensible COI; attributable costing nets out background care."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Right-skewed cost distributions require outlier handling (winsorization/trimming) and skew-robust inference for COI totals."
      },
      {
        "relation_type": "requires",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Incidence-based lifetime COI must discount future costs to present value."
      },
      {
        "relation_type": "see_also",
        "target_slug": "budget-impact",
        "notes": "COI burden totals supply the background-burden denominator that budget-impact analyses quote against."
      },
      {
        "relation_type": "complements",
        "target_slug": "cost-effectiveness",
        "notes": "COI measures burden with no comparator; cost-effectiveness measures the value of choosing one intervention over another. COI cannot establish value for money."
      },
      {
        "relation_type": "used_with",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Valid PPPM/PPPY denominators require continuous-enrollment observable-time definitions to avoid gap-driven bias."
      }
    ],
    "aliases": [
      "COI",
      "cost of illness study",
      "cost-of-illness analysis",
      "disease burden study",
      "economic burden of disease"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hta",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cost-outlier-handling-rwe",
    "name": "Cost Outlier Handling (Winsorization, Trimming, Robust/Two-Part Models)",
    "short_definition": "A pre-specified strategy for limiting the influence of extreme high-cost observations on real-world cost estimands using percentile winsorization, dollar-threshold trimming, and distribution-aware models (two-part, gamma-log GLM, quantile/robust regression) rather than untransformed means or OLS.",
    "long_description": "Healthcare cost data are right-skewed, heavy-tailed, and zero-inflated: a small fraction of patients (transplants, CAR-T,\nprolonged ICU stays, rare catastrophic complications) generate a disproportionate share of total spend, and a single\n$1–2M claimant can move an unadjusted mean, a per-member-per-month (PMPM) figure, or an OLS coefficient by tens of\npercent. **Cost outlier handling** is the pre-specified rule set that bounds the leverage of these observations so that\nthe reported estimand is stable across resamples, databases, and minor cohort-definition changes. It is an operational\nHEOR decision that belongs in the protocol and statistical analysis plan (SAP) *before* the data are seen, not a\npost-hoc reaction to an ugly histogram.\n\n**Core conceptual distinction**. Three distinct levers are conflated under \"outlier handling,\" and they change different\nthings. (1) *Winsorization* replaces values beyond a cap (e.g., the 99th percentile) with the cap value — it keeps the\nsample size and patient identity intact, shrinks the mean and especially the variance, and biases the estimated mean\n*downward* by construction. (2) *Trimming/exclusion* deletes observations beyond a threshold — it changes the target\npopulation and the estimand (you are now estimating cost in a non-catastrophic subpopulation), and should be treated as a\nredefinition of the cohort, not a cleaning step. (3) *Distribution-aware modeling* (two-part models, gamma or\ninverse-Gaussian GLM with a log link, quantile regression, robust M-estimation) keeps every dollar but down-weights the\ninfluence of the tail through the likelihood and link rather than by editing data. These are not interchangeable: a\nwinsorized mean answers \"what is mean cost if the most extreme spend is capped?\"; a gamma-GLM marginal mean answers\n\"what is expected cost given the full skewed distribution?\"; a trimmed mean answers a different population's question\nentirely. The estimand must name which one is primary. A separate, frequently confused choice is the *scale*: OLS on\nlog-cost requires retransformation (smearing or a heteroscedasticity-consistent correction) to recover the\narithmetic-mean cost that budget-impact and ICER calculations need, whereas a gamma-log GLM models the mean directly and\nsidesteps the retransformation problem.\n\n**Pros, cons, and trade-offs**.\n- **Percentile winsorization vs no handling (raw mean / OLS).** Winsorization buys reproducibility and finite-sample\n  variance reduction at the cost of a known, deliberate downward bias in the mean and an understated tail. Raw means are\n  unbiased in expectation but so unstable that two random halves of the same cohort can disagree by 20–40%, and OLS\n  coefficients are dominated by leverage points. **Prefer winsorization** as a transparent sensitivity layer, never as\n  the sole primary analysis for a mean that feeds a budget-impact model, because the capped mean understates true spend.\n- **Winsorization vs trimming/exclusion.** Winsorization preserves N and the estimand's population; trimming silently\n  redefines the population and discards the very patients (the most severe, the index-treatment failures) who often\n  drive the policy question. **Prefer winsorization or modeling** over trimming unless there is a documented data-error\n  rationale (e.g., a duplicate adjudication, an implausible negative or $10M keying error) — clinical extremity is not a\n  data error.\n- **Winsorization vs two-part / gamma-log GLM.** GLMs use all data, respect the cost scale, and give the\n  arithmetic-mean estimand directly, but they impose distributional/variance assumptions (the mean–variance relationship\n  of the gamma) that must be checked (modified Park test, Pregibon link test, Pearson/deviance residuals). Winsorization\n  is assumption-light and trivially explained to reviewers but biases the mean. **Prefer the two-part + gamma-log GLM**\n  for inferential/comparative cost analyses (incremental cost, adjusted PMPM) and reserve winsorization for descriptive\n  tables and as a robustness check on the GLM.\n- **GLM vs OLS-on-log-cost.** Log-OLS handles skew but forces retransformation; under heteroscedastic residuals (the\n  norm in cost data) Duan's smearing factor must itself be made covariate-specific or the back-transformed mean is\n  biased. The gamma-log GLM avoids this. **Prefer the GLM** unless the log-residuals are demonstrably homoscedastic.\n\n**When to use**. Any RWE/HEOR analysis whose endpoint is a dollar amount — total cost of care, disease-attributable\ncost, PMPM/PPPM/PPPY, incremental cost in a cost-effectiveness or budget-impact analysis, cost-of-illness burden — where\nthe empirical distribution is right-skewed and the conclusion could plausibly flip if one to a few claimants were\nreweighted. Pre-specify the primary handling (typically a two-part + gamma-log GLM for comparative work, or a clearly\nlabeled winsorized/raw descriptive mean) and at least two sensitivity rules (e.g., none / 95th / 99th winsor, or\nGLM with vs without tail winsorization).\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **As silent data cleaning that the reader never sees.** Winsorizing or trimming without reporting the dollar caps, the\n  percent of patients affected, and the percent of total spend removed is the single most common HTA review flag; it can\n  turn a real cost difference into a manufactured null (or vice versa).\n- **Trimming high-cost patients before a comparative contrast when they cluster in one arm.** If the new therapy's\n  failures land in the ICU and you trim the tail, you delete the harm you were measuring. Differential tail mass by arm\n  means trimming is confounding by design.\n- **Capping the mean that feeds a budget-impact model.** Payers need expected (arithmetic-mean) spend including the\n  tail; a winsorized mean systematically understates the budget and is the wrong estimand for that decision, however\n  appropriate it is for a stable descriptive table.\n- **Treating clinical extremity as error.** A verified $1.4M CAR-T-plus-complications episode is signal, not noise.\n  Outlier *rules* manage leverage; they do not license deletion of real, adjudicated spend.\n- **Winsorizing on a contaminated denominator.** If person-time is mismeasured (Medicare Advantage gaps, partial-year\n  enrollment), percentile caps computed on annualized cost can be driven by a denominator artifact, not true high spend.\n\n**Data-source operational depth**.\n- **Administrative claims (FFS vs MA).** Cost = adjudicated allowed/paid amount on each line, summed over the analytic\n  window. The dominant failure mode is **Medicare Advantage person-time**: MA encounter records do not carry reliable\n  paid amounts (capitated/bundled), so an MA member appears artificially low-cost and *deflates* the percentile caps and\n  the mean for everyone. Restrict cost analyses to fee-for-service-observable person-time (Parts A/B with no MA months,\n  or commercial fully-insured/ASO with complete paid amounts) before computing any cap. Adjudication lag and claim\n  reversals create transient negatives and double counts — net reversals and impose a claims-runout window before\n  freezing the denominator. Bundled/DRG and global-surgery payments mean a single line can legitimately be $80k; decide\n  whether the bundle or its shadow-priced components are the unit being winsorized. Annualize to PPPY only on\n  FFS-observable months and winsorize the annualized value, not the raw window total, when follow-up is incomplete.\n- **EHR.** Native EHR \"cost\" is usually charges or the chargemaster, not paid/allowed — charge-to-cost ratios vary 3–8×\n  across service lines, so winsorizing charges answers a question no payer asks. Link to claims for true paid amounts\n  before any outlier rule. Encounter-driven capture also makes external high-cost events (an out-of-network ICU\n  admission) invisible, which can masquerade as a low-cost patient and distort the tail in the *opposite* direction from\n  the MA problem.\n- **Registry.** Registries rarely carry complete spend but excel at clinical severity and stage; use them to\n  *adjudicate* whether a high-cost patient is clinically justified (retain) versus a coding/linkage artifact (correct),\n  and link to claims for the actual dollars and to a death index so end-of-life cost spikes are not censored away.\n- **Linked claims–EHR–registry.** The ideal substrate (EHR/registry severity to justify the tail + claims completeness\n  + mortality), but linkage selects the linkable subset and introduces date-discrepancy issues; a competing risk that\n  is differential by exposure — death is more common, and end-of-life is the costliest period, in the sicker arm — will\n  bias any naive mean or winsorized mean if mortality and its terminal-care costs are handled inconsistently across arms.\n\n**Worked claims example.** Question: 24-month all-cause and disease-attributable total cost of care after initiating\nDrug A vs Drug B in a commercial + Medicare FFS database. (1) *Denominator integrity:* require 12 months continuous,\nFFS-observable medical+pharmacy enrollment before `index_date` and follow each patient up to 24 months, censoring at\ndisenrollment, an MA-enrollment month (MA person-time is dropped because paid amounts are unreliable), death, or end of\ndata; apply a 3-month claims-runout window and net all reversals before summing `paid_amt`. (2) *Attribution first,\nthen outliers:* compute all-cause total cost and, separately, disease-attributable cost (claims with the qualifying\ndiagnosis in any position) — the outlier rule is applied to each resulting series independently, because a $900k\nunrelated trauma admission belongs in all-cause but not in attributable. (3) *Describe the tail:* the top 2% of\npatients (n≈40 of 2,000) hold 31% of all-cause spend; the 95th percentile cap = $148,200 and the 99th = $612,500.\n(4) *Primary descriptive analysis:* report the raw arithmetic mean (with its instability flagged) alongside a 99th-\npercentile-winsorized mean, stating that winsorization reduced the all-cause mean PPPY from $41,300 to $35,800 (−13.3%),\naffected 1.0% of patients, and removed 13.3% of total spend. (5) *Primary comparative analysis:* a two-part model — a\nlogistic model for the probability of any positive cost (near-universal here, so the second part dominates) and a gamma\nGLM with log link for positive cost — adjusted by the propensity score (1:1 matching or overlap weighting on baseline\ncovariates measured in the 12-month lookback), yielding an adjusted incremental cost with a bias-corrected bootstrap CI.\n(6) *Sensitivity:* repeat the GLM with the positive tail winsorized at the 99th and at the 95th percentile, and as\nOLS-on-log with covariate-specific smearing retransformation; report whether the sign and significance of the\nincremental cost are stable. (7) *Reporting (CHEERS):* state the primary estimand (adjusted arithmetic-mean incremental\ncost), the exact dollar caps, the percent of patients and spend affected, and the full sensitivity grid.\n\n**Interpreting the output**\n\nA cost analysis reports a raw arithmetic mean of $52,920 per patient. One patient (1 of 10, or 10% of\nthe sample) had a $500,000 catastrophic claim. After applying a 90th-percentile winsorization cap of\n$4,500, the winsorized mean drops to $3,370 — a reduction of more than 93% from the unadjusted figure.\n\n*(1) Formal interpretation.* The raw mean ($52,920) and the winsorized mean ($3,370) estimate different\nquantities. The raw mean is an unbiased estimator of the true population mean under this cost\ndistribution — including the right tail — but it is highly sensitive to a single catastrophic claimant\nwhose $500,000 cost drives roughly half of all spend in the sample. The winsorized mean replaces values\nabove $4,500 with $4,500, reducing variance substantially but biasing the estimated mean downward by\nconstruction. Winsorization does not remove the patient; it changes the estimand to \"cost in a\npopulation where tail events are capped.\" The $500,000 claim is real spend, not a data error — trimming\nit entirely would change the study population, not just the statistic.\n\n*(2) Practical interpretation.* When presenting cost results to a payer, report both figures and state\nexplicitly what the winsorization rule did: \"Winsorizing at the 90th-percentile cap of $4,500 affected\n1 of 10 patients (10%) and reduced the mean by 93.6%. The cap was pre-specified in the SAP.\" This\ntransparency allows the reviewer to judge which estimate best matches their population's risk-pooling\nand avoids the appearance of cherry-picking.",
    "primary_category": "Health_Economic",
    "tags": [
      "health_economic",
      "cost-outlier-handling",
      "winsorization",
      "trimming",
      "two-part-models",
      "gamma-glm",
      "quantile-regression",
      "skewed-cost-data",
      "costs-vs-hcru"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "cost_effectiveness_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/hec.1653",
        "url": "https://doi.org/10.1002/hec.1653",
        "citation_text": "Mihaylova B, Briggs A, O'Hagan A, Thompson SG. Review of statistical methods for analysing healthcare resources and costs. Health Economics. 2011;20(8):897-916.",
        "year": 2011,
        "authors_short": "Mihaylova et al.",
        "notes": "Canonical review of methods for skewed, heavy-tailed cost data, contrasting raw means, transformation/retransformation, two-part models, and GLMs, and the conditions under which each is appropriate."
      },
      {
        "role": "explain",
        "doi": "10.1016/S0167-6296(01)00086-8",
        "url": "https://doi.org/10.1016/S0167-6296(01)00086-8",
        "citation_text": "Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461-494.",
        "year": 2001,
        "authors_short": "Manning & Mullahy",
        "notes": "Decision framework for choosing between log-OLS (with retransformation) and gamma/GLM approaches based on the skewness and heteroscedasticity of cost residuals; the basis for preferring gamma-log GLM in skewed cost data."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/S0167-6296(98)00025-3",
        "url": "https://doi.org/10.1016/S0167-6296(98)00025-3",
        "citation_text": "Manning WG. The logged dependent variable, heteroscedasticity, and the retransformation problem. Journal of Health Economics. 1998;17(3):283-295.",
        "year": 1998,
        "authors_short": "Manning",
        "notes": "Demonstrates why naive back-transformation of log-cost models is biased under heteroscedasticity and how covariate-specific smearing corrects it; the technical justification for modeling the mean directly."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) Statement. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "Reporting standard requiring transparent disclosure of cost-data handling, including outlier/winsorization rules, affected proportions, and sensitivity analyses, in health economic evaluations."
      }
    ],
    "plain_language_summary": "Healthcare cost data almost always includes a small number of patients with catastrophic bills — a transplant, a prolonged ICU stay, a rare disease therapy — that can drag the average cost sky-high and make it look wildly different from what most patients actually cost. Cost outlier handling is a pre-planned rule for deciding what to do with those extreme values before you run any analysis. The two most common tools are winsorization, which caps extreme costs at a chosen ceiling rather than deleting the patient, and trimming, which removes those patients entirely and studies a different, less-severe group. Choosing and documenting your rule in advance keeps your results stable and reviewable — a mean that swings 40% depending on whether one patient is included is not a trustworthy number.",
    "key_terms": [
      {
        "term": "winsorization",
        "definition": "Replacing any cost above a chosen ceiling (such as the 99th-percentile value) with that ceiling value, so the patient stays in the dataset but their cost no longer pulls the average to extremes."
      },
      {
        "term": "trimming",
        "definition": "Removing patients whose costs exceed a threshold from the analysis entirely, which changes which population you are studying — you are no longer describing all patients, only the non-catastrophic ones."
      },
      {
        "term": "right-skewed distribution",
        "definition": "A cost pattern where most patients have low-to-moderate costs but a small tail of patients has very high costs, pulling the average far above the amount a typical patient actually spends."
      },
      {
        "term": "percentile",
        "definition": "A cut-point that divides a ranked list of values: the 90th percentile is the value below which 90% of patients fall, so only the top 10% exceed it."
      },
      {
        "term": "arithmetic mean",
        "definition": "The ordinary average — add up all values and divide by the number of patients; it is highly sensitive to even one extremely large value."
      }
    ],
    "worked_example": {
      "scenario": "A researcher studying 10 patients who were hospitalized for a serious infection wants to report average 30-day all-cause costs. Nine patients had routine recoveries, but one patient developed a rare complication and required a prolonged ICU stay and a second surgery, resulting in a $500,000 bill. The researcher wants to know how much that single catastrophic case distorts the average — and what the average looks like after winsorizing at the 90th percentile.",
      "dataset": {
        "caption": "30-day all-cause costs for 10 patients from a claims database. Each row is one patient.",
        "columns": [
          "patient_id",
          "total_cost_usd"
        ],
        "rows": [
          [
            "P01",
            2800
          ],
          [
            "P02",
            3100
          ],
          [
            "P03",
            4200
          ],
          [
            "P04",
            2600
          ],
          [
            "P05",
            3900
          ],
          [
            "P06",
            4500
          ],
          [
            "P07",
            2300
          ],
          [
            "P08",
            3700
          ],
          [
            "P09",
            2100
          ],
          [
            "P10",
            500000
          ]
        ]
      },
      "steps": [
        "Add up all 10 costs: 2800 + 3100 + 4200 + 2600 + 3900 + 4500 + 2300 + 3700 + 2100 + 500000 = 529,200.",
        "Divide by 10 to get the raw arithmetic mean: 529,200 / 10 = $52,920. This is almost entirely driven by P10 — none of the other nine patients spent anywhere near that much.",
        "To winsorize at the 90th percentile, sort the 10 costs from lowest to highest: $2,100 / $2,300 / $2,600 / $2,800 / $3,100 / $3,700 / $3,900 / $4,200 / $4,500 / $500,000.",
        "Using the nearest-rank rule, the 90th percentile of 10 values is the 9th value in that sorted list: $4,500. That becomes the cap.",
        "Replace P10's cost ($500,000) with the cap ($4,500). The other nine patients are unchanged because all of their costs are at or below $4,500.",
        "Add up the 10 winsorized costs: 2800 + 3100 + 4200 + 2600 + 3900 + 4500 + 2300 + 3700 + 2100 + 4500 = 33,700.",
        "Divide by 10 to get the winsorized mean: 33,700 / 10 = $3,370."
      ],
      "result": "Raw mean = $52,920; winsorized mean (90th-percentile cap of $4,500) = $3,370. Winsorization reduced the mean by $49,550 — a 94% drop — by capping the single catastrophic patient at $4,500 instead of $500,000. One patient out of 10 (10%) was affected. The winsorized mean is a much better description of what a typical patient in this group costs, but the analyst must also report the raw mean and the cap rule so readers know the full picture."
    },
    "prerequisites": [
      "healthcare-costs-pppm-pppy-pmpm",
      "hcru-healthcare-resource-utilization",
      "health-economic-modeling-methods-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Percentile winsorization (95th / 99th / 99.5th)",
        "description": "Replace patient-level (or annualized PPPY) costs above the chosen empirical percentile with the percentile value, computed in the analytic cohort or a fixed reference population. 95th is aggressive (more bias, more stability); 99th/99.5th retain more legitimate high-cost signal. Sample size and patient identity are preserved.",
        "edge_cases": [
          "The mean is biased downward by construction; never present a winsorized mean as the budget-impact estimand without also reporting the uncapped mean.",
          "The dollar value of the cap is sample- and calendar-year-specific (price inflation, formulary/therapy changes); report it explicitly and recompute per analysis period.",
          "In small or rare-disease cohorts even the 99.5th cap can leave one influential point; consider modeling instead."
        ],
        "data_source_notes": "claims: compute the cap on FFS-observable annualized cost (drop MA months) AFTER attribution; report N and percent of patients affected and percent of total spend removed."
      },
      {
        "name": "Fixed-dollar threshold trimming or capping",
        "description": "Hard cap at a pre-chosen dollar amount (e.g., $100k, $250k per patient-year) or exclude patients above the threshold. Capping bounds leverage like winsorization; exclusion redefines the population.",
        "edge_cases": [
          "Exclusion changes the target population and estimand and should be described as a cohort redefinition, not data cleaning.",
          "Arbitrary and calendar-sensitive; a threshold sensible pre-CAR-T is far too low for modern oncology and too loose for primary care."
        ],
        "data_source_notes": "claims: justify thresholds against a real benchmark (e.g., stop-loss/reinsurance attachment points) and use only as a labeled sensitivity, not as the primary rule."
      },
      {
        "name": "Two-part model with gamma-log GLM (preferred for comparative inference)",
        "description": "Part 1: logistic/probit model for the probability of any positive cost in the window. Part 2: GLM with gamma (or inverse-Gaussian) family and log link on the positive-cost subsample, giving the arithmetic-mean cost directly and avoiding retransformation. Combine with PS matching/weighting for adjusted incremental cost.",
        "edge_cases": [
          "The mean–variance assumption of the gamma must be checked (modified Park test to select the family; Pregibon link test; Pearson/deviance residuals).",
          "When almost all patients have positive cost, the first part is near-degenerate and the gamma GLM alone suffices.",
          "Marginal/incremental effects require recycled-prediction or bootstrap standard errors, not the raw GLM coefficient."
        ],
        "data_source_notes": "claims: standard in modern HEOR; fit on the full PS-balanced sample, report both parts and the bootstrap CI for the adjusted incremental cost."
      },
      {
        "name": "Quantile / robust regression",
        "description": "Median (or upper-quantile) regression, or M-estimation, to estimate covariate effects on a tail-insensitive location parameter instead of the mean. Answers a different question (the typical patient, or a specified quantile) rather than the budget-relevant mean.",
        "edge_cases": [
          "The median ignores precisely the tail spend that drives payer budgets; useful as a complement, not a replacement for a mean-based budget-impact estimand.",
          "Upper-quantile (e.g., 90th) regression can characterize the high-cost segment directly when that is the policy target."
        ],
        "data_source_notes": "claims: pair median regression with a mean-based GLM so reviewers see both the typical and the expected cost."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "No explicit outlier handling (raw means or OLS)",
        "pros_of_this": "Reproducible across resamples and databases; finite-sample variance is controlled; leverage points cannot single-handedly drive coefficients.",
        "cons_of_this": "Winsorization/trimming introduces deliberate bias (winsor downward in the mean; trimming changes the population); modeling adds distributional assumptions.",
        "when_to_prefer": "Always specify a handling rule for skewed cost endpoints; reserve the raw mean only as a labeled, transparency comparator."
      },
      {
        "compared_to": "hcru-healthcare-resource-utilization",
        "pros_of_this": "Cost intensity (dollars per event) is what creates the heavy tail; explicit winsorization or two-part gamma-log GLMs target the monetary scale that count models ignore.",
        "cons_of_this": "More assumption-laden (distributional choices on positive costs) and more sensitive to payment-model and price changes than count endpoints.",
        "when_to_prefer": "For dollar endpoints use dedicated cost methods; for pure volume questions use Poisson/NB count models; report both when both matter."
      },
      {
        "compared_to": "all-cause-vs-attributable-costs-rwe",
        "pros_of_this": "Outlier handling is applied downstream of, and separately for, each attribution series, so an unrelated catastrophic admission can be retained in all-cause yet kept out of the attributable series.",
        "cons_of_this": "Loose attribution imports unrelated outliers into the attributable series; aggressive attribution can strip legitimate disease-driven tail spend.",
        "when_to_prefer": "Fix the attribution rule first, then apply the outlier rule to each resulting series and run joint sensitivities."
      },
      {
        "compared_to": "missing-data-trimming-winsorization-rwe",
        "pros_of_this": "Shares the winsorization/trimming toolkit; cost outliers frequently coincide with extreme propensity weights, so coordinated handling stabilizes the weighted mean.",
        "cons_of_this": "Weight trimming changes the population (positivity); cost winsorization changes the outcome distribution — the two act on different objects and must be reported separately.",
        "when_to_prefer": "Stabilize/trim weights first, then winsorize cost on the weighted data (or report both orders) and document the impact on the estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Compute caps on fee-for-service-observable, attribution-resolved cost after netting reversals and applying a claims-runout window; drop Medicare Advantage person-time whose paid amounts are unreliable. Winsorize the annualized PPPY value when follow-up is incomplete, not the raw window total. Report the dollar caps, percent of patients affected, and percent of total spend removed under the primary and every sensitivity rule.",
      "ehr": "Native EHR cost is charges/chargemaster, not paid/allowed; link to claims for true paid amounts before any outlier rule. Encounter-driven capture misses out-of-network high-cost events, which can deflate the apparent tail.",
      "registry": "Use registry severity/stage to adjudicate whether a high-cost patient is clinically justified (retain) versus a linkage/coding artifact (correct); link to claims for dollars and to a death index so end-of-life spikes are not lost.",
      "linked": "Ideal substrate (severity to justify the tail + claims completeness + mortality), but linkage selection and date discrepancies must be reconciled, and differential end-of-life cost under a competing risk of death by arm must be handled identically across arms."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\nCOVARS = [\"age\", \"female\", \"charlson\", \"prior_cost\"]  # measured in the baseline lookback only\n\ndef annualize(df: pd.DataFrame) -> pd.DataFrame:\n    df = df.copy()\n    df[\"pppy\"] = df[\"total_cost\"] * 365.25 / df[\"followup_days\"]\n    return df\n\ndef winsorize(s: pd.Series, upper_pct: float) -> tuple[pd.Series, float, dict]:\n    cap = np.quantile(s, upper_pct)                       # e.g. 0.99 -> 99th percentile dollar cap\n    wins = s.clip(upper=cap)\n    impact = {\n        \"cap\": float(cap),\n        \"pct_patients_affected\": float((s > cap).mean()),\n        \"pct_spend_removed\": float((s.sum() - wins.sum()) / s.sum()),\n        \"mean_raw\": float(s.mean()),\n        \"mean_winsorized\": float(wins.mean()),\n    }\n    return wins, cap, impact\n\ndef twopart_incremental_cost(df: pd.DataFrame, n_boot: int = 1000, seed: int = 1):\n    # Part 1: P(any cost) via weighted logistic; Part 2: gamma-log GLM on positive cost.\n    # Adjusted arithmetic-mean cost per arm = recycled prediction averaged over the cohort.\n    df = df.assign(any_cost=(df[\"pppy\"] > 0).astype(int))\n    rhs = \"arm + \" + \" + \".join(COVARS)\n\n    def fit_predict(d):\n        m1 = smf.glm(\"any_cost ~ \" + rhs, d, family=sm.families.Binomial(),\n                     freq_weights=d[\"ps_weight\"]).fit()\n        pos = d[d[\"pppy\"] > 0]\n        m2 = smf.glm(\"pppy ~ \" + rhs, pos, family=sm.families.Gamma(sm.families.links.Log()),\n                     freq_weights=pos[\"ps_weight\"]).fit()\n        mu = {}\n        for a in (\"A\", \"B\"):\n            d_a = d.assign(arm=a)\n            mu[a] = float(np.average(m1.predict(d_a) * m2.predict(d_a), weights=d[\"ps_weight\"]))\n        return mu[\"A\"] - mu[\"B\"]\n\n    point = fit_predict(df)\n    rng = np.random.default_rng(seed)\n    boot = [fit_predict(df.sample(frac=1, replace=True, random_state=int(rng.integers(1e9))))\n            for _ in range(n_boot)]\n    lo, hi = np.percentile(boot, [2.5, 97.5])\n    return {\"incremental_cost\": point, \"ci_low\": float(lo), \"ci_high\": float(hi)}\n\ndf = annualize(costs)\nfor pct in (0.95, 0.99):                                   # sensitivity grid on the descriptive mean\n    _, _, impact = winsorize(df[\"pppy\"], pct)\n    print(f\"winsor {int(pct*100)}th:\", impact)\nprint(twopart_incremental_cost(df))                       # primary comparative estimand (uncapped)",
        "description": "Winsorization, descriptive impact, and a two-part (logistic + gamma-log GLM) adjusted incremental cost on a\npatient-period cost table. Required input (one row per patient, already attribution-resolved and FFS-observable):\n  costs : person_id, arm ('A'/'B'), followup_days, total_cost (>=0, summed paid amounts over the window),\n          ps_weight (overlap/IPT weight from the upstream propensity model),\n          and the baseline covariate columns listed in COVARS.\nCosts are annualized to PPPY before winsorization so partial follow-up is comparable. No data are fabricated here;\nfit downstream on the PS-balanced sample produced by the cohort/PS step.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(boot)\n\ncovars <- c(\"age\", \"female\", \"charlson\", \"prior_cost\")\nrhs    <- paste(c(\"arm\", covars), collapse = \" + \")\n\nannualize <- function(d) mutate(d, pppy = total_cost * 365.25 / followup_days)\n\nwinsorize_impact <- function(x, upper_pct) {\n  cap  <- quantile(x, upper_pct, names = FALSE)            # dollar cap at the chosen percentile\n  wins <- pmin(x, cap)\n  data.frame(percentile = upper_pct, cap = cap,\n             pct_patients_affected = mean(x > cap),\n             pct_spend_removed     = (sum(x) - sum(wins)) / sum(x),\n             mean_raw = mean(x), mean_winsorized = mean(wins))\n}\n\ntwopart_inc <- function(d) {\n  d$any_cost <- as.integer(d$pppy > 0)\n  m1  <- glm(reformulate(c(\"arm\", covars), \"any_cost\"),\n             family = binomial(), weights = ps_weight, data = d)\n  pos <- d[d$pppy > 0, ]\n  m2  <- glm(reformulate(c(\"arm\", covars), \"pppy\"),\n             family = Gamma(link = \"log\"), weights = pos$ps_weight, data = pos)\n  mu <- sapply(c(\"A\", \"B\"), function(a) {                  # recycled prediction per arm\n    da <- transform(d, arm = a)\n    weighted.mean(predict(m1, da, type = \"response\") * predict(m2, da, type = \"response\"),\n                  w = d$ps_weight)\n  })\n  unname(mu[\"A\"] - mu[\"B\"])\n}\n\nd <- annualize(costs)\nprint(do.call(rbind, lapply(c(0.95, 0.99), function(p) winsorize_impact(d$pppy, p))))\n\nbs <- boot(d, function(data, i) twopart_inc(data[i, ]), R = 1000)\nprint(boot.ci(bs, type = \"bca\"))                           # adjusted incremental cost CI",
        "description": "Winsorization impact table and a two-part (logistic + gamma-log GLM) adjusted incremental cost with bias-corrected\nbootstrap. Input mirrors the Python version:\n  costs : person_id, arm ('A'/'B'), followup_days, total_cost, ps_weight, and baseline covariates in `covars`.\nPPPY annualization happens before winsorization so partial follow-up is comparable across arms.",
        "dependencies": [
          "dplyr",
          "boot"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Annualize to PPPY so partial follow-up is comparable before any cap. */\ndata costs2;\n  set work.costs;\n  pppy = total_cost * 365.25 / followup_days;\nrun;\n\n/* 95th and 99th percentile dollar caps on the analytic distribution. */\nproc univariate data=costs2 noprint;\n  var pppy;\n  output out=caps pctlpts=95 99 pctlpre=p_;\nrun;\n\n/* Winsorize at the 99th and quantify impact: % patients affected and % spend removed. */\ndata wins; if _n_=1 then set caps; set costs2;\n  cap99 = p_99;\n  pppy_w = min(pppy, cap99);\n  over   = (pppy > cap99);\nrun;\nproc sql;\n  select mean(over)                                  as pct_patients_affected,\n         (sum(pppy) - sum(pppy_w)) / sum(pppy)       as pct_spend_removed,\n         mean(pppy)                                  as mean_raw,\n         mean(pppy_w)                                as mean_winsorized\n  from wins;\nquit;\n\n/* Primary comparative estimand: gamma-log GLM (positive-cost part) on the PS-weighted sample. */\nproc genmod data=costs2;\n  where pppy > 0;\n  class arm (ref='B') female;\n  weight ps_weight;                                  /* overlap/IPT weight from the upstream PS model */\n  model pppy = arm age female charlson prior_cost / dist=gamma link=log;\n  lsmeans arm / diff exp cl;                         /* exp() -> arm-specific predicted mean cost */\nrun;\n\n/* Median (tail-insensitive) sensitivity via quantile regression. */\nproc quantreg data=costs2;\n  class arm;\n  model pppy = arm age charlson prior_cost / quantile=0.5;\nrun;",
        "description": "Percentile caps, a winsorization-impact summary, and a gamma-log GLM (the positive-cost part of the two-part model)\nwith PS weights. Required input (post data-management, one row per patient):\n  work.costs : person_id, arm ('A'/'B'), followup_days, total_cost, ps_weight, and baseline covariates.\nPROC GENMOD with DIST=GAMMA LINK=LOG estimates the arithmetic-mean cost on the natural scale (no retransformation);\nLSMEANS ... EXP gives arm-specific predicted means for the incremental contrast.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[Patient-period cost table<br/>FFS-observable, attribution-resolved] --> Ann[Annualize to PPPY<br/>cost x 365.25 / followup_days]\n  Ann --> Desc[Describe tail: top 1-2% share,<br/>95th / 99th percentile dollar caps]\n  Desc --> Est{Estimand?}\n  Est -->|Descriptive mean<br/>for a table| Wins[Winsorize at 99th<br/>report cap, % patients, % spend]\n  Est -->|Comparative / budget<br/>arithmetic-mean cost| TP[Two-part model:<br/>logistic P_any_cost + gamma-log GLM<br/>on PS-balanced sample]\n  Wins --> Sens[Sensitivity grid:<br/>none / 95th / 99th]\n  TP --> Sens\n  Sens --> Report[Report per CHEERS:<br/>caps, affected %, full sensitivity]",
        "caption": "Decision flow for cost outlier handling. The estimand determines the tool — winsorization for a stable descriptive mean, a two-part gamma-log GLM for the arithmetic-mean comparative/budget estimand — and every rule is reported with its dollar cap and affected proportions.",
        "alt_text": "Flowchart from an FFS-observable cost table through PPPY annualization and tail description to an estimand decision splitting into winsorization versus a two-part gamma-log GLM, then a sensitivity grid and CHEERS reporting.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Skewed cost distribution<br/>heavy right tail] --> B[Winsorize<br/>cap tail to percentile]\n  A --> C[Trim / exclude<br/>delete tail]\n  A --> D[Model the full distribution<br/>gamma-log GLM / quantile]\n  B --> B1[Keeps N and population<br/>mean biased DOWN]\n  C --> C1[Changes population<br/>and estimand]\n  D --> D1[Keeps all dollars<br/>down-weights leverage via likelihood]",
        "caption": "The three distinct levers for the cost tail and what each changes. Winsorization preserves the population but biases the mean downward; trimming redefines the population; distribution-aware modeling keeps every dollar and manages leverage through the model.",
        "alt_text": "Diagram contrasting winsorization, trimming, and distribution-aware modeling of a skewed cost distribution and the distinct effect of each on sample, population, and the mean.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Cost outlier handling is one operational component of the broader real-world cost-modeling method family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Outlier handling is required for stable PPPM/PPPY means and incremental cost; extreme spend distorts cost means, ratios, and regressions far more than it distorts most utilization counts."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "High-utilization patients drive both count and cost outliers; pre-specify parallel handling (NB or winsorized rates for HCRU counts; winsorization plus two-part gamma-log GLM for dollars)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "Fix the attribution rule first, then apply the outlier rule to each resulting series; an unrelated catastrophic cost may be retained in all-cause but excluded from the attributable series."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "Shares the winsorization/trimming toolkit and sensitivity ethos; cost outliers frequently coincide with extreme propensity weights, so coordinate weight stabilization with cost winsorization and report both."
      },
      {
        "relation_type": "see_also",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "Bottom-up cost-of-illness estimates must handle outliers to avoid overstating per-patient or national burden from a handful of catastrophic cases."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "For the utilization counts underlying cost, NB models provide robustness to count overdispersion; dollar costs instead require two-part plus gamma-log GLM or explicit winsorization."
      },
      {
        "relation_type": "affects",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "ICER and net-monetary-benefit calculations use arithmetic-mean costs and are highly sensitive to outlier handling; present winsorization-level sensitivities alongside the base-case ICER/NMB."
      }
    ],
    "aliases": [
      "cost outlier handling",
      "winsorization of costs",
      "trimming high-cost patients",
      "two-part cost models",
      "robust cost regression in RWE",
      "handling skewed healthcare cost data"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "fda",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cost-utility",
    "name": "Cost-Utility Analysis (CUA)",
    "short_definition": "A form of full economic evaluation that compares two or more interventions on incremental cost per quality-adjusted life-year (QALY) gained, expressing health outcomes in a single preference-weighted, mortality-and-morbidity-combining metric so that disparate interventions can be ranked against a common cost-effectiveness threshold.",
    "long_description": "**Cost-utility analysis (CUA)** is the dominant form of full economic evaluation in HTA. It values an\nintervention's benefit as the **quality-adjusted life-year (QALY)** — life-years weighted by a utility (preference)\nscore anchored at 0 (dead) and 1 (full health) — and divides the *incremental* cost of one strategy over its\ncomparator by the *incremental* QALYs gained, yielding the **incremental cost-effectiveness ratio (ICER)**:\n`ICER = (Cost_A - Cost_B) / (QALY_A - QALY_B)`. The result is compared against a cost-effectiveness threshold\n(e.g., a willingness-to-pay of $100,000-$150,000/QALY in the US, or NICE's £20,000-£30,000/QALY) to inform\nreimbursement and coverage. CUA is the analysis NICE, CDA-AMC (formerly CADTH), and most HTA bodies require, because the QALY makes\na cancer drug, a hip replacement, and a statin commensurable on one axis.\n\n**Core conceptual distinction.** CUA is one branch of full economic evaluation, defined by *how it values the\noutcome*. (1) *vs cost-effectiveness analysis (CEA)*: a generic CEA expresses benefit in a natural clinical unit\n(life-years gained, mmHg lowered, events averted); CUA specifically uses the QALY, a *utility-weighted* outcome\nthat folds quality of life and survival into one number. Every CUA is a CEA, but not every CEA is a CUA. (2) *vs\ncost-benefit analysis (CBA)*: CBA monetizes health itself (willingness-to-pay, value of a statistical life),\nproducing a net monetary or benefit-cost ratio and permitting comparison across sectors, but forcing an explicit\ndollar value on a life-year that many find ethically fraught. (3) *vs cost-minimization analysis (CMA)*: CMA\napplies only when outcomes are demonstrably equivalent and reduces to comparing costs. The estimand of a CUA is an\n*incremental, comparative* quantity — never a standalone \"this drug costs $X/QALY\" without a named comparator. The\ndecision rule is on **net monetary benefit (NMB = λ·ΔQALY − ΔCost)** at threshold λ, which avoids the\nquadrant/sign pathologies of the ratio (an ICER is uninterpretable when ΔQALY < 0, and a dominant strategy and a\ndominated one produce identical negative ratios despite representing opposite decisions — the signs of ΔCost and ΔQALY differ).\n\n**Pros, cons, and trade-offs** (CUA vs the other full-evaluation forms).\n- **vs plain CEA (clinical-unit outcome):** CUA's QALY captures morbidity *and* mortality and is comparable across\n  diseases, which is exactly why HTA bodies mandate it. Cost: the QALY depends on a utility elicitation\n  (EQ-5D/SF-6D/standard gamble/time trade-off) and a country-specific value set, so two analysts can get different\n  QALYs from the same clinical data; for a within-indication safety contrast where utilities barely move, a\n  natural-unit CEA can be cleaner and less assumption-laden. **Prefer CUA** whenever a payer/HTA decision spans\n  conditions or when quality of life is materially affected.\n- **vs cost-benefit analysis:** CUA sidesteps explicitly pricing a life-year and aligns with how health systems\n  actually budget (fixed envelope, opportunity cost in QALYs). Cost: it cannot compare health spending against\n  non-health investments and assumes a known threshold. **Prefer CUA** for within-health-system reimbursement;\n  reach for CBA only when cross-sectoral or societal monetary comparison is the genuine question.\n- **vs cost-minimization:** CMA is simpler but valid *only* under proven outcome equivalence; assuming equivalence\n  that does not hold silently biases toward the cheaper arm. **Prefer CUA** unless non-inferiority on the relevant\n  outcome is firmly established.\n\n**When to use** — decision rules. Reimbursement/coverage submissions to HTA bodies (NICE, CADTH, PBAC, ICER) where a common\ncross-disease metric is required; any comparison in which an intervention trades survival against quality of life\n(oncology, palliative care, chronic disease) so a life-years-only CEA would understate or overstate value;\npopulating a decision-analytic (Markov/partitioned-survival) or trial-based model where utilities are measurable.\nUse the reference-case conventions of the relevant jurisdiction: stated perspective (healthcare-system vs\nsocietal), a lifetime or sufficiently long horizon, and **discounting of both costs and QALYs** (commonly 3% US,\n3.5% UK) so that future health and money are valued consistently.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules below).\n- **Outcomes are equivalent.** If non-inferiority is established, a CUA's ICER is a 0/0 ratio dominated by noise;\n  use CMA. Reporting an ICER on a near-zero ΔQALY produces wild, uninterpretable ratios.\n- **ΔQALY is negative or near zero.** When the more effective-looking arm is actually worse or tied on QALYs, the\n  ICER changes sign and quadrant and becomes meaningless. Always decide on NMB / the cost-effectiveness plane,\n  never on the bare ratio, and present the cost-effectiveness acceptability curve (CEAC).\n- **No defensible utility data.** Mapping algorithms or borrowed utilities from a different population/severity can\n  dominate the result; if utilities are weak and the decision does not require cross-disease comparison, a\n  natural-unit CEA is more honest.\n- **Short horizon truncating downstream QALYs.** A within-trial CUA that stops at trial end omits lifetime QALY\n  differences and can reverse the decision; either model the horizon or state the limitation prominently.\n- **Cross-jurisdiction transfer without re-basing.** Unit costs, utility value sets, comparators, and thresholds\n  are country-specific; a UK reference-case ICER does not transfer to the US without re-costing and re-thresholding.\n\n**Data-source operational depth** across substrates. RWE increasingly feeds the cost and resource-use side of CUA, while utilities\noften come from instruments or the literature; each substrate has distinct failure modes.\n- **Claims (FFS):** Excellent for *direct medical cost* — paid amounts (not charges) on medical and pharmacy\n  claims map cleanly to resource use via `allowed_amount`/`paid_amount`, `fill_date`, `days_supply`. Failure modes:\n  claims carry **no utility/QoL data**, so the QALY must be imported from instruments or a mapping algorithm;\n  out-of-pocket, over-the-counter, and indirect (productivity) costs are invisible, biasing a societal-perspective\n  CUA; and **Medicare Advantage (MA) person-time lacks adjudicated FFS claims**, so MA enrollees show artifactually\n  near-zero medical cost — restrict the costing window to FFS (Parts A/B) + Part D enrollment or you will\n  underestimate cost differentially by plan type.\n- **EHR:** Captures clinical detail and sometimes PROs (which can yield utilities directly), but records *charges\n  or activity*, not *paid amounts*; cost must be approximated via cost-to-charge ratios or fee schedules, and\n  care delivered outside the system (the differential leakage problem) is missing — biasing cost downward\n  differentially for sicker, mobile patients.\n- **Registry:** Strong for adjudicated outcomes, disease severity, and often EQ-5D collection (utilities), but\n  typically thin on complete cost; link to claims for the full resource picture and to a death index to anchor\n  survival, which drives life-years and therefore QALYs.\n- **Linked claims–EHR–registry:** The ideal substrate — registry/EHR utilities + claims-complete cost + reliable\n  mortality — but linkage selection (only the linkable subset) and date discrepancies across order/fill/service\n  dates must be reconciled before per-period costing windows are drawn. **Censored cost is not missing-at-random**:\n  patients with longer follow-up accrue more cost, so naïvely averaging observed cost is biased; use\n  inverse-probability-of-censoring-weighted (Bang-Tsiatis) or Lin-style partitioned mean cost estimators.\n\n**Worked claims example.** Question: lifetime cost per QALY of a new oral oncolytic vs standard-of-care among\nadults with metastatic disease, costing side from a US commercial + Medicare FFS database, utilities from a\npublished EQ-5D study. (1) Cohort & enrollment: require continuous medical + Part D (or commercial pharmacy)\nenrollment with no MA-only person-time across each costing period, so paid amounts are observed not missing. (2)\nIndex & arm: `index_date` = first qualifying dispensing of the oral oncolytic (`fill_date`) or first\nstandard-of-care administration; assign arm from that claim. (3) Resource use & cost: sum `paid_amount` across all\nmedical + pharmacy claims in each follow-up period, **inflation-adjusted to a common cost year** and **discounted\nat 3%/yr** from index. Because patients are censored at disenrollment/death/data-end, do **not** average raw\nobserved cost — estimate mean lifetime cost per arm with an IPCW/partitioned estimator over monthly intervals so\nthe within-interval cost is reweighted for the probability of still being observed. (4) QALYs: derive life-years\nper arm from a survival model fit to time from index to death (claims death flag plus a linked death index), then\nweight each interval by the EQ-5D utility for that health state (on-treatment, progression, terminal),\ndiscounting QALYs at 3%/yr. (5) Incremental result: `ΔCost / ΔQALY` = ICER; report alongside NMB at λ =\n$100k and $150k/QALY, the cost-effectiveness plane, a CEAC from a probabilistic (Monte Carlo) sensitivity\nanalysis over cost, utility, and survival parameters, and one-way sensitivity (tornado) on discount rate, utility\nsource, and the costing-window definition. (6) Sensitivity on the MA exclusion and the cost-censoring method,\nsince both can move the ICER materially.\n\n**Interpreting the output**\n\nThe worked example returns an ICER of $60,000 per QALY gained: Drug A costs $24,000 more and produces 0.40\nadditional QALYs over the comparator, so $24,000 / 0.40 = $60,000/QALY.\n\n*(1) Formal interpretation.* The ICER is the ratio of incremental cost to incremental QALYs — it is not Drug A's\naverage cost per QALY in isolation. Each additional QALY purchased by choosing Drug A over the comparator costs\nthe payer $60,000. Because $60,000 is below the stated $100,000/QALY threshold, Drug A is cost-effective at this\nwillingness-to-pay level. The QALY unit is what distinguishes CUA from a plain CEA using natural units: QALYs\ncombine length and quality of life into a single preference-weighted metric, enabling cross-disease comparisons.\nA utility weight of 1.0 represents full health; 0.0 represents a state equivalent to death; values between reflect\nthe preference-weighted fraction of a life-year in that health state.\n\n*(2) Practical interpretation.* The payer is paying $60,000 for each additional quality-adjusted year of life\ngained by switching from comparator to Drug A. This falls below the plan's threshold, so Drug A clears the\ncost-effectiveness bar. The utility weights powering the QALY calculation deserve scrutiny: the source instrument\n(EQ-5D, SF-6D, HUI), the value set (UK, US), and whether utilities were directly measured or mapped from a disease\ninstrument all affect the QALY magnitude and therefore the ICER. A sensitivity analysis varying the utility\nsource is a mandatory component of any CUA submission.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "QALY",
      "ICER",
      "cost-effectiveness",
      "net-monetary-benefit",
      "health-economics",
      "HTA",
      "discounting",
      "utility-weights"
    ],
    "applies_to_study_types": [
      "cost_utility"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.2016.12195",
        "url": "https://doi.org/10.1001/jama.2016.12195",
        "citation_text": "Sanders GD, Neumann PJ, Basu A, et al. Recommendations for Conduct, Methodological Practices, and Reporting of Cost-Effectiveness Analyses: Second Panel on Cost-Effectiveness in Health and Medicine. JAMA. 2016;316(10):1093-1103.",
        "year": 2016,
        "authors_short": "Sanders et al.",
        "notes": "The authoritative US reference case for CUA/CEA conduct — perspective, the QALY as the recommended outcome, discounting of both costs and effects, and the impact-inventory framework."
      },
      {
        "role": "explain",
        "doi": "10.1056/NEJM197703312961304",
        "url": "https://doi.org/10.1056/NEJM197703312961304",
        "citation_text": "Weinstein MC, Stason WB. Foundations of cost-effectiveness analysis for health and medical practices. New England Journal of Medicine. 1977;296(13):716-721.",
        "year": 1977,
        "authors_short": "Weinstein & Stason",
        "notes": "Foundational derivation of incremental cost-effectiveness reasoning and the rationale for combining morbidity and mortality into a single quality-adjusted outcome."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) Statement. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "The CHEERS 2022 checklist — the expected reporting standard for QALY-based economic evaluations including perspective, horizon, discounting, and uncertainty analysis."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2016.02.017",
        "url": "https://doi.org/10.1016/j.jval.2016.02.017",
        "citation_text": "Woods B, Revill P, Sculpher M, Claxton K. Country-Level Cost-Effectiveness Thresholds: Initial Estimates and the Need for Further Research. Value in Health. 2016;19(8):929-935.",
        "year": 2016,
        "authors_short": "Woods et al.",
        "notes": "Empirical, opportunity-cost-based estimates of the cost-effectiveness threshold against which an ICER is judged — clarifies that the threshold is a health-system parameter, not an arbitrary constant."
      },
      {
        "role": "use",
        "doi": "10.1007/s40273-014-0193-3",
        "url": "https://doi.org/10.1007/s40273-014-0193-3",
        "citation_text": "Faria R, Gomes M, Epstein D, White IR. A Guide to Handling Missing Data in Cost-Effectiveness Analysis Conducted Within Randomised Controlled Trials. PharmacoEconomics. 2014;32(12):1157-1170.",
        "year": 2014,
        "authors_short": "Faria et al.",
        "notes": "Practical handling of missing and censored cost and utility data — directly relevant when building a CUA on incomplete claims/EHR follow-up."
      }
    ],
    "plain_language_summary": "Cost-utility analysis (CUA) is a specific type of cost-effectiveness analysis that measures the health benefit of a treatment using a single combined score called a QALY (quality-adjusted life-year), which captures both how long a patient lives and how good their quality of life is during that time. To compare two treatments, you compute the extra cost of the new treatment divided by the extra QALYs it produces; this ratio is called the ICER (incremental cost-effectiveness ratio), and payers ask whether that cost per QALY gained is worth it relative to a threshold. CUA is the preferred method for health technology assessment bodies worldwide precisely because QALYs let decision-makers compare a cancer drug against a diabetes drug on the same yardstick. Unlike a plain cost-effectiveness analysis that might count events averted or blood-pressure points lowered, CUA always uses the QALY as its outcome unit.",
    "key_terms": [
      {
        "term": "QALY",
        "definition": "A quality-adjusted life-year is one year of life weighted by how healthy that year is, where 1.0 means perfect health and 0.0 means a state equivalent to death, so a year spent at 0.7 utility counts as 0.7 QALYs."
      },
      {
        "term": "utility score",
        "definition": "A number between 0 and 1 that represents how desirable a given health state is, usually measured with a standardized questionnaire such as the EQ-5D; it is the weight applied to each year of life when calculating QALYs."
      },
      {
        "term": "ICER",
        "definition": "The incremental cost-effectiveness ratio is the extra cost of a new treatment divided by the extra QALYs it produces compared with the alternative, expressed as dollars per QALY gained."
      },
      {
        "term": "cost-effectiveness threshold",
        "definition": "A benchmark dollar-per-QALY value set by a health system or payer (for example, roughly $100,000 to $150,000 per QALY in the US) above which a treatment is considered too expensive relative to its health benefit."
      }
    ],
    "worked_example": {
      "scenario": "A health plan is deciding whether to add a new oral medication for moderate-to-severe plaque psoriasis to its formulary. The new drug (Drug A) costs $38,000 per year and produces an average of 0.82 QALYs per patient per year based on a clinical trial that collected EQ-5D scores. The current standard of care (Drug B) costs $14,000 per year and produces 0.42 QALYs per patient per year. We want to know the cost per QALY gained by switching to Drug A, and whether it clears the plan's $100,000-per-QALY threshold.",
      "dataset": {
        "caption": "Arm-level summary inputs an analyst would build from per-patient data before running the ICER calculation.",
        "columns": [
          "arm",
          "annual_cost_usd",
          "avg_qalys_per_year",
          "qaly_components_note"
        ],
        "rows": [
          [
            "Drug A (new)",
            38000,
            0.82,
            "1 year x utility 0.82 = 0.82 QALYs"
          ],
          [
            "Drug B (standard of care)",
            14000,
            0.42,
            "1 year x utility 0.42 = 0.42 QALYs"
          ]
        ]
      },
      "steps": [
        "Compute the QALY for each arm: QALYs = years of follow-up multiplied by the utility score. Drug A: 1 year x 0.82 = 0.82 QALYs. Drug B: 1 year x 0.42 = 0.42 QALYs.",
        "Compute incremental cost: delta_cost = cost of Drug A minus cost of Drug B = $38,000 - $14,000 = $24,000.",
        "Compute incremental QALYs: delta_QALY = QALYs for Drug A minus QALYs for Drug B = 0.82 - 0.42 = 0.40 QALYs.",
        "Compute the ICER: ICER = delta_cost / delta_QALY = $24,000 / 0.40 = $60,000 per QALY gained.",
        "Compare to the threshold: $60,000 per QALY is below the plan's $100,000-per-QALY threshold, so Drug A is considered cost-effective."
      ],
      "result": "ICER = $24,000 / 0.40 QALYs = $60,000 per QALY gained. Because $60,000 is below the $100,000-per-QALY threshold, Drug A is cost-effective at this willingness-to-pay level. Note that cost-utility analysis is simply cost-effectiveness analysis with the QALY as the outcome unit specifically; that precise use of utility-weighted life-years is what distinguishes CUA from a plain cost-effectiveness analysis that might count symptom-free days or disease events instead."
    },
    "prerequisites": [
      "cost-effectiveness",
      "hrqol",
      "qaly-utility-mapping-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Trial-based (within-study) CUA",
        "description": "Costs and utilities are collected alongside a randomized or observational study over the observed follow-up, and the incremental cost per QALY is estimated directly from patient-level data.",
        "edge_cases": [
          "Follow-up is finite, so lifetime QALY/cost differences accruing after study end are truncated; the ICER can reverse if the horizon is too short.",
          "Cost is right-censored and accrues with time-at-risk, so simple mean cost is biased; use IPCW (Bang-Tsiatis) or partitioned (Lin) estimators rather than averaging observed cost."
        ],
        "data_source_notes": "claims/EHR: per-patient cost from paid amounts over censoring-aware intervals; utilities from a collected instrument or a mapping algorithm."
      },
      {
        "name": "Model-based (decision-analytic) CUA",
        "description": "A Markov or partitioned-survival model extrapolates costs and QALYs over a lifetime horizon using transition probabilities, utilities, and unit costs assembled from multiple sources (RWE for resource use, instruments for utilities, trials for efficacy).",
        "edge_cases": [
          "Extrapolation beyond observed data is assumption-driven; survival-curve choice (exponential vs Weibull vs spline) can swing the ICER.",
          "Structural uncertainty (model form) is rarely captured by probabilistic sensitivity analysis and must be explored via scenario analysis."
        ],
        "data_source_notes": "RWE feeds resource-use, cost, and real-world survival inputs; utilities typically external."
      },
      {
        "name": "Net-monetary-benefit (NMB) reformulation",
        "description": "Rather than the ratio, compute NMB = lambda*QALY - Cost per arm at a stated willingness-to-pay lambda; the optimal arm is the one with highest NMB, and uncertainty is summarized by the cost-effectiveness acceptability curve over a range of lambda.",
        "edge_cases": [
          "Requires committing to a threshold (or a range); the decision can flip across plausible thresholds.",
          "Preferred when ICER is undefined or sign-ambiguous (ΔQALY near zero or negative)."
        ],
        "data_source_notes": "NMB is a deterministic transform of the same cost and QALY inputs; no extra data required."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cost-effectiveness analysis with a natural clinical-unit outcome (CEA)",
        "pros_of_this": "The QALY combines morbidity and mortality into one preference-weighted metric that is comparable across diseases, which is why HTA bodies require it.",
        "cons_of_this": "Depends on utility elicitation and a country-specific value set, adding assumptions and analyst variability; for within-indication contrasts where utilities barely move, a natural-unit CEA can be cleaner.",
        "when_to_prefer": "Whenever a decision spans conditions or quality of life is materially affected by the intervention."
      },
      {
        "compared_to": "Cost-benefit analysis (CBA)",
        "pros_of_this": "Avoids explicitly pricing a life-year in dollars and aligns with fixed-budget, opportunity-cost reasoning used by health systems.",
        "cons_of_this": "Cannot compare health against non-health spending and presumes a known cost-effectiveness threshold.",
        "when_to_prefer": "Within-health-system reimbursement and coverage decisions."
      },
      {
        "compared_to": "Cost-minimization analysis (CMA)",
        "pros_of_this": "Does not assume outcome equivalence; correctly accounts for differences in effectiveness.",
        "cons_of_this": "More data-intensive (requires valid utility and effectiveness estimates) than simply comparing costs.",
        "when_to_prefer": "Whenever outcome equivalence is not firmly established; CMA is valid only under proven non-inferiority."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Cost = paid/allowed amounts (not charges) summed over censoring-aware follow-up intervals; inflation-adjust to a common cost year and discount. Claims carry no utilities, so import the QALY from an instrument or mapping algorithm. Restrict to FFS (Parts A/B) + Part D or commercial medical+pharmacy enrollment and exclude MA-only person-time, which lacks adjudicated cost.",
      "ehr": "Records charges/activity, not paid amounts; approximate cost via cost-to-charge ratios or fee schedules. Care delivered outside the system leaks (differentially for mobile/sicker patients). PROs, where collected, can yield utilities directly.",
      "registry": "Strong for adjudicated outcomes, severity, and sometimes EQ-5D (utilities); typically thin on complete cost. Link to claims for resource use and to a death index for survival/life-years.",
      "linked": "Ideal substrate (utilities + complete cost + mortality) but introduces linkage selection and order/fill/service date discrepancies; reconcile before drawing per-period costing windows. Censored cost is not MAR — use IPCW/partitioned mean-cost estimators."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef cua_point(mean_cost: dict, mean_qaly: dict, ref: str, comp: str, wtp: float) -> dict:\n    # Deterministic incremental CUA for two arms (ref = new strategy, comp = comparator).\n    d_cost = mean_cost[ref] - mean_cost[comp]\n    d_qaly = mean_qaly[ref] - mean_qaly[comp]\n    icer = d_cost / d_qaly if d_qaly != 0 else np.nan  # undefined when no QALY difference\n    nmb_ref = wtp * mean_qaly[ref] - mean_cost[ref]\n    nmb_comp = wtp * mean_qaly[comp] - mean_cost[comp]\n    return {\"delta_cost\": d_cost, \"delta_qaly\": d_qaly, \"icer\": icer,\n            \"inmb\": nmb_ref - nmb_comp, \"preferred\": ref if nmb_ref > nmb_comp else comp}\n\ndef ceac(cost_draws: pd.DataFrame, qaly_draws: pd.DataFrame, ref: str, comp: str,\n         wtp_grid=np.arange(0, 200001, 5000)) -> pd.DataFrame:\n    # cost_draws / qaly_draws: columns are arm names, rows are PSA iterations (same length).\n    d_cost = cost_draws[ref] - cost_draws[comp]\n    d_qaly = qaly_draws[ref] - qaly_draws[comp]\n    rows = []\n    for wtp in wtp_grid:\n        inmb = wtp * d_qaly - d_cost                       # incremental NMB per iteration\n        rows.append({\"wtp\": wtp, \"p_ref_cost_effective\": float((inmb > 0).mean())})\n    return pd.DataFrame(rows)",
        "description": "Incremental cost-utility analysis from arm-level summaries plus probabilistic sensitivity analysis. Required\ninputs (one row per arm, produced upstream from censoring-aware per-patient cost and QALY estimation):\n  arms : arm (str), mean_cost (discounted, common cost-year), mean_qaly (discounted)\nFor PSA, supply per-arm sampling distributions for cost and QALY (e.g., from bootstrap or a fitted model).\nComputes ICER, NMB at a willingness-to-pay grid, the incremental net benefit, and a CEAC.",
        "dependencies": [
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "cua_point <- function(mean_cost, mean_qaly, ref, comp, wtp) {\n  d_cost <- mean_cost[[ref]] - mean_cost[[comp]]\n  d_qaly <- mean_qaly[[ref]] - mean_qaly[[comp]]\n  icer   <- if (d_qaly != 0) d_cost / d_qaly else NA_real_  # undefined when no QALY difference\n  nmb    <- function(a) wtp * mean_qaly[[a]] - mean_cost[[a]]\n  list(delta_cost = d_cost, delta_qaly = d_qaly, icer = icer,\n       inmb = nmb(ref) - nmb(comp),\n       preferred = if (nmb(ref) > nmb(comp)) ref else comp)\n}\n\nceac <- function(cost_draws, qaly_draws, ref, comp,\n                 wtp_grid = seq(0, 2e5, by = 5e3)) {\n  d_cost <- cost_draws[[ref]] - cost_draws[[comp]]\n  d_qaly <- qaly_draws[[ref]] - qaly_draws[[comp]]\n  data.frame(\n    wtp = wtp_grid,\n    p_ref_cost_effective = vapply(wtp_grid, function(w) mean((w * d_qaly - d_cost) > 0), numeric(1))\n  )\n}",
        "description": "Incremental CUA with ICER, net monetary benefit, and a cost-effectiveness acceptability curve. Inputs mirror the\nPython version:\n  mean_cost / mean_qaly : named numeric vectors over arms (discounted, common cost-year)\n  cost_draws / qaly_draws : matrices/data.frames of PSA draws, one column per arm",
        "dependencies": [],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let wtp_lo = 100000;   /* willingness-to-pay thresholds, $/QALY */\n%let wtp_hi = 150000;\n\n/* Arm-level discounted mean cost and QALYs. */\nproc means data=work.patient noprint;\n  class arm;\n  var cost_disc qaly_disc;\n  output out=work.arm_means(where=(_type_=1)) mean(cost_disc)=mean_cost mean(qaly_disc)=mean_qaly;\nrun;\n\n/* Incremental cost, incremental QALY, ICER, and NMB at each threshold. */\nproc sql;\n  create table work.cua as\n  select\n    (select mean_cost from work.arm_means where arm='NEW')\n      - (select mean_cost from work.arm_means where arm='SOC') as delta_cost,\n    (select mean_qaly from work.arm_means where arm='NEW')\n      - (select mean_qaly from work.arm_means where arm='SOC') as delta_qaly\n  from work.arm_means(obs=1);\nquit;\n\ndata work.cua_result;\n  set work.cua;\n  if delta_qaly ne 0 then icer = delta_cost / delta_qaly;  /* undefined when no QALY difference */\n  else icer = .;\n  inmb_lo = &wtp_lo * delta_qaly - delta_cost;             /* incremental net monetary benefit */\n  inmb_hi = &wtp_hi * delta_qaly - delta_cost;\n  ce_at_lo = (inmb_lo > 0);                                 /* new arm cost-effective at low threshold */\n  ce_at_hi = (inmb_hi > 0);\nrun;",
        "description": "Discounted incremental CUA and net-monetary-benefit decision from per-patient cost and QALY data. Required input\n(post data-management; cost already censoring-adjusted and discounted upstream, e.g., via an IPCW/partitioned\nmean-cost step):\n  work.patient : person_id, arm ('NEW'/'SOC'), cost_disc (discounted cost), qaly_disc (discounted QALYs)\nProduces arm means, the ICER, and NMB at a willingness-to-pay grid via the cost-effectiveness plane logic.",
        "dependencies": [],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Persp[Perspective + lifetime horizon<br/>healthcare-system or societal] --> Res[Resource use & unit costs<br/>RWE paid amounts, censoring-aware]\n  Persp --> Eff[Survival + health states]\n  Eff --> Util[Utility weights per state<br/>EQ-5D / SF-6D value set]\n  Res --> Disc[Discount costs & QALYs<br/>e.g., 3%/yr]\n  Util --> Disc\n  Disc --> Inc[Incremental ΔCost and ΔQALY<br/>new vs comparator]\n  Inc --> Dec[ICER vs threshold λ<br/>and NMB = λ·ΔQALY − ΔCost]\n  Dec --> Unc[Uncertainty: cost-effectiveness plane,<br/>PSA → CEAC, one-way tornado]",
        "caption": "Operational CUA workflow — set perspective and horizon, assemble censoring-aware costs and utility-weighted QALYs, discount both, compute the incremental ICER/NMB against a threshold, and characterize uncertainty.",
        "alt_text": "Flowchart from perspective and horizon through resource use, survival and utility weights, discounting, incremental cost and QALY, the ICER/NMB decision rule, and uncertainty analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "sanders-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[ΔQALY > 0, ΔCost < 0<br/>SE quadrant: new arm DOMINATES] --> D{Decision}\n  B[ΔQALY > 0, ΔCost > 0<br/>NE quadrant: ICER vs λ] --> D\n  C[ΔQALY < 0, ΔCost > 0<br/>NW quadrant: DOMINATED, reject] --> D\n  E[ΔQALY < 0, ΔCost < 0<br/>SW quadrant: trade-off, decide on NMB] --> D\n  D --> R[Accept new arm iff NMB = λ·ΔQALY − ΔCost > 0]",
        "caption": "Cost-effectiveness plane logic. The bare ICER is only interpretable in the NE quadrant; in the other quadrants dominance or sign-ambiguity makes the ratio misleading, so the decision is made on net monetary benefit at threshold λ.",
        "alt_text": "Diagram of the four cost-effectiveness plane quadrants mapping incremental cost and QALY combinations to dominance, dominated, and trade-off cases, resolving to a net-monetary-benefit decision rule.",
        "source_type": "illustrative",
        "source_citations": [
          "sanders-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "cost-benefit",
        "notes": "CBA monetizes health outcomes (willingness-to-pay / value of a statistical life); CUA keeps benefit in QALYs and compares against a cost-effectiveness threshold. Choose CBA only for cross-sectoral/societal monetary comparison."
      },
      {
        "relation_type": "see_also",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "A reference-case CUA discounts both costs and QALYs (commonly 3% US, 3.5% UK) so future health and money are valued consistently; discount rate is a standard one-way sensitivity parameter."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "The cost side of a CUA is built from the same paid-amount resource-use measures; CUA values outcomes in QALYs rather than leaving cost as the endpoint."
      }
    ],
    "aliases": [
      "cost utility",
      "cost-utility analysis",
      "CUA",
      "cost per QALY analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cox-ph-regression",
    "name": "Cox Proportional Hazards Regression",
    "short_definition": "A semiparametric time-to-event regression that estimates covariate hazard ratios via the partial likelihood while leaving the baseline hazard fully unspecified, serving as the default model for survival outcomes (death, discontinuation, progression, first event) in claims, EHR, and registry studies.",
    "long_description": "The **Cox proportional hazards (PH) model** (Cox 1972) specifies the hazard as h(t | X) = h0(t) · exp(β'X), where\nh0(t) is an arbitrary, unestimated baseline hazard and β are log hazard ratios. Its central trick is the **partial\nlikelihood**: by conditioning on the risk set at each observed event time, h0(t) cancels and the estimator depends\nonly on the *rank order* of event times, not their spacing. That is what makes Cox \"semiparametric\" — no functional\nform is imposed on how risk evolves with time, only on how covariates multiply it. It is the workhorse of\npharmacoepidemiology and HEOR because it accommodates right-censoring, left-truncation (delayed entry), and\ntime-varying covariates with mature, regulator-accepted software. This entry covers the **standard** model; pushing\nexposure or covariates inside the partial likelihood at each event time is its own concept (`standard-cox-time-dependent`).\n\n**Core conceptual distinction.** The quantity Cox estimates is a **conditional (covariate-specific) hazard ratio**,\nand three properties of it are routinely misunderstood. (1) *Conditional vs marginal, and non-collapsibility.* Even\nwith no confounding, the HR adjusted for a prognostic covariate is generally *not* equal to the crude HR — the HR is\nnon-collapsible (Stensrud & Hernán 2020). Adding a strong predictor moves the HR away from the null even when that\npredictor is not a confounder, so \"the HR changed when I adjusted\" is not evidence of confounding. (2) *The HR is a\nratio of instantaneous rates among survivors,* so under PH it still mixes a true causal effect with selection: as the\nmore-protected arm depletes its susceptibles, the surviving risk sets become non-comparable, and a constant true\neffect can produce a time-varying observed HR. (3) *A single HR is a weighted average over follow-up* when PH is\nviolated; the weights depend on the censoring distribution, making the number partly an artifact of study duration.\nFor decision-makers, an absolute or marginal summary — cumulative incidence, restricted mean survival time (RMST), or\na survival difference at a fixed horizon — is often the more interpretable estimand and should be pre-specified\nalongside (or instead of) the HR.\n\n**Pros, cons, and trade-offs.**\n- **vs fully parametric survival models (Weibull, exponential, generalized gamma):** Cox is robust because it never\n  commits to a shape for h0(t), and it is the easiest survival model to communicate. Cost: it is less efficient when a\n  parametric form is correct, it does not directly yield absolute survival curves (you need a Breslow baseline\n  estimate), and it cannot extrapolate beyond observed follow-up. **Prefer parametric** for HTA survival\n  extrapolation (`survival-extrapolation-hta-rwe`) and lifetime cost-effectiveness, where you must project past the\n  data.\n- **vs Poisson / negative-binomial rate models:** Cox uses exact event-time ordering and handles smoothly varying\n  baseline risk; Poisson assumes piecewise-constant rates and is simpler for aggregate person-time and recurrent\n  counts. **Prefer Poisson/NB** (`poisson-negative-binomial-count-models`) for incidence-rate reporting and recurrent\n  events when individual event timing is coarse.\n- **vs cause-specific / Fine-Gray competing-risks models:** standard Cox censoring a competing event silently targets\n  the *cause-specific* hazard and, if read as \"risk,\" overstates the cumulative incidence of the event of interest\n  whenever competing mortality differs by arm. **Prefer the explicit competing-risks framing**\n  (`competing-risks-cause-specific-fine-gray-rwe`) in elderly, oncology, and end-stage populations.\n- **vs g-methods (MSM/IPTW-weighted Cox, g-estimation, clone-censor-weight):** plain Cox cannot handle time-varying\n  confounders affected by prior treatment, and conditioning on post-baseline mediators induces collider bias.\n  **Prefer g-methods** (`marginal-structural-models-g-methods`, `g-estimation-structural-nested-models`,\n  `clone-censor-weight-per-protocol`) for sustained-strategy or per-protocol estimands under time-varying confounding;\n  Cox remains the analytic engine that runs *after* the weighting.\n\n**When to use.** A pre-specified time-to-event outcome with a defensible time zero; a comparative (often\nactive-comparator new-user) contrast where PH approximately holds or can be relaxed by stratification; ITT-like or\nbaseline-covariate-adjusted questions; and as the outcome model inside propensity-score or target-trial-emulation\npipelines after balancing. It is the right default when the audience expects an HR and an absolute summary (RMST,\ncumulative incidence) is reported alongside.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Strong, structured PH violation (crossing or converging hazards).** A reported single HR near 1 can hide an early\n  harm that reverses to a late benefit. Do not paper over this; model time-by-covariate interaction, stratify the\n  baseline hazard, switch to RMST, or report time-segmented HRs. Stensrud & Hernán (2020) argue the relevant question\n  is usually not \"does PH hold?\" but \"what estimand do I actually want?\"\n- **Competing risks ignored.** Censoring death to study a non-fatal endpoint in an elderly claims cohort, then\n  interpreting 1 − KM as risk, overstates incidence when mortality is differential by arm — a classic, dangerous\n  error. Use cause-specific *and* subdistribution analyses.\n- **Time-varying confounding by indication.** When a post-baseline lab (eGFR, HbA1c) both responds to treatment and\n  drives the next treatment decision and the outcome, standard Cox cannot remove the bias; adjusting for it opens a\n  mediator/collider path. This is the g-methods boundary.\n- **Immortal time and misaligned time zero.** If follow-up starts before exposure can occur (e.g., \"ever had the\n  procedure\" with follow-up clocked from diagnosis), exposed person-time is inflated and the HR is biased toward\n  benefit. Fix the *design* (new-user, time zero at the index event), not the model.\n- **Informative censoring.** Loss to follow-up correlated with prognosis (sicker patients disenroll or switch)\n  violates the independent-censoring assumption; require IPCW or a competing-risks/sensitivity framing\n  (`attrition-and-loss-to-follow-up-rwe`).\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** time zero is the index fill (`fill_date`) or, for procedures, the\n  `service_date` on the CPT/HCPCS claim; outcomes are validated code algorithms (e.g., death via discharge status or\n  a linked mortality index, MI via 1-IP or 2-OP rules). The dominant failure mode is **enrollment-driven\n  observability**: Medicare Advantage encounter data are incomplete and MA-only person-time can lack the FFS claims\n  that define both exposure and outcomes — restrict to enrollees with continuous Parts A/B/D (or full commercial\n  medical+pharmacy) and exclude MA-only spans, or \"no event\" is unobserved censoring masquerading as survival. In\n  elderly claims, **competing risk of death differs by exposure**, so a cause-specific-only Cox misleads. Procedure\n  studies invite **immortal time** if the index date is not pinned to the actual CPT date; stockpiling and 90-day\n  mail-order distort `days_supply` and any on-treatment window.\n- **EHR:** the event is the *order/administration*, not the dispensing; link to fills to confirm initiation. Rich\n  labs/vitals enable severity adjustment but arrive at irregular times — naive last-observation-carried-forward into a\n  time-fixed Cox can bias, and a lab measured *because* the patient deteriorated is a collider. Visit-driven capture\n  makes loss to follow-up informative; define observation windows explicitly.\n- **Registry:** cleanest for staging, biomarkers, and adjudicated recurrence/progression (SEER-Medicare,\n  disease-specific), but typically weak for complete pharmacy exposure and full mortality — link to claims and a death\n  index. Often used to validate claims-based survival algorithms.\n- **Linked claims–EHR–vital records:** the ideal substrate (severity + completeness + reliable mortality), but\n  linkage selection and order/fill/service-date discrepancies must be reconciled before time-zero assignment, or\n  immortal time creeps back in at the seam.\n\n**Worked claims example.** Question: time to first hospitalized heart-failure event, second-generation sulfonylurea\nvs DPP-4 inhibitor, among adults with type 2 diabetes in a commercial + Medicare FFS database. (1) *Eligibility:*\nage ≥18, ≥2 diabetes diagnoses, and 365 days of continuous A/B/D (or commercial medical+pharmacy) enrollment with no\nMA-only spans before the first study fill. (2) *Washout / new-user:* no fill of any sulfonylurea or DPP-4 inhibitor\nin the 365-day lookback. (3) *Time zero:* the date of the first qualifying fill; assign `arm` from the dispensed NDC.\n(4) *Outcome and time:* `time_to_event` = days from `index_date` to the first validated HF hospitalization (1-IP\nprimary-position algorithm); `event = 1` at that date. (5) *Censoring:* at disenrollment, end of data, and the\ncompeting event of non-HF death — analyze death as a *competing risk*, not as administrative censoring, and report\nboth the cause-specific HR (from Cox) and the subdistribution HR / cumulative incidence. (6) *Adjustment:* fit\n`coxph(Surv(time_to_event, event) ~ arm + covariates)` on a propensity-matched set with baseline covariates measured\nonly in [index_date − 365, index_date]; test PH with weighted Schoenfeld residuals (Grambsch & Therneau 1994), and if\nthe HF hazards converge over follow-up, report RMST difference at 3 years alongside the HR rather than a single number.\n\n**Interpreting the output**\n\nA Cox model returns: adjusted HR = 0.75 (95% CI 0.60–0.94) for DPP-4 inhibitor vs sulfonylurea, time to first hospitalized heart-failure event, propensity-matched cohort.\n\n*Formal interpretation.* The estimated hazard ratio of 0.75 means that, at any instant during follow-up, patients in the DPP-4 group who are still event-free have, on average, 75% of the instantaneous rate of heart-failure hospitalization observed in the sulfonylurea group, conditional on the matched covariates. Under the proportional-hazards assumption this ratio is assumed constant across follow-up time; if Schoenfeld-residual tests indicate time-varying hazards, the summary HR should be supplemented or replaced by a restricted mean survival time difference or a time-split analysis.\n\n*Practical interpretation.* The data are consistent with a lower rate of heart-failure events on DPP-4 therapy, but the HR is not \"25% fewer hospitalizations\": it describes an instantaneous rate ratio among those still event-free, and its translation to absolute risk depends on the baseline hazard and length of follow-up. Pair this HR with an absolute measure — such as the RMST difference or cumulative incidence at 36 months — to communicate the magnitude of benefit in patient-meaningful terms.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "cox",
      "proportional-hazards",
      "survival-analysis",
      "time-to-event",
      "hazard-ratio",
      "partial-likelihood",
      "schoenfeld",
      "non-collapsibility",
      "claims",
      "competing-risks"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.2517-6161.1972.tb00899.x",
        "url": "https://doi.org/10.1111/j.2517-6161.1972.tb00899.x",
        "citation_text": "Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187-220.",
        "year": 1972,
        "authors_short": "Cox",
        "notes": "The original paper introducing the proportional hazards model and the partial likelihood that eliminates the unspecified baseline hazard. Foundation for all modern survival regression in medicine and epidemiology."
      },
      {
        "role": "explain",
        "doi": "10.1093/biomet/81.3.515",
        "url": "https://doi.org/10.1093/biomet/81.3.515",
        "citation_text": "Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 1994;81(3):515-526.",
        "year": 1994,
        "authors_short": "Grambsch & Therneau",
        "notes": "Derives the weighted (scaled) Schoenfeld residual test for proportional hazards that underlies cox.zph() in R, check_assumptions() in lifelines, and ASSESS PH in PROC PHREG. The standard tool for detecting non-PH."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2020.1267",
        "url": "https://doi.org/10.1001/jama.2020.1267",
        "citation_text": "Stensrud MJ, Hernán MA. Why test for proportional hazards? JAMA. 2020;323(14):1401-1402.",
        "year": 2020,
        "authors_short": "Stensrud & Hernán",
        "notes": "Clarifies non-collapsibility and the built-in selection in the hazard ratio, and argues for choosing the estimand (e.g., RMST, survival difference) rather than reflexively reporting an averaged HR under PH violation."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Demonstrates Cox (often weighted or time-dependent) as the analytic engine inside target-trial emulation, with explicit time-zero alignment and bias prevention in large observational databases."
      }
    ],
    "plain_language_summary": "Cox proportional hazards regression is a method for comparing how quickly two groups reach an outcome — like a first hospitalization — while accounting for the fact that patients are followed for different lengths of time and some never experience the outcome during the study. It produces a hazard ratio: a single number that says how much faster (or slower) the event arrives in one group relative to another, after adjusting for differences in age, disease severity, and other baseline factors. For example, a hazard ratio of 0.71 means the treated group reaches the event at 71% the rate of the comparison group — roughly a 29% lower rate at any given moment in follow-up. The model requires a clear 'day zero' for every patient and handles the reality that many patients leave the study early without ever having the event.",
    "key_terms": [
      {
        "term": "hazard ratio",
        "definition": "A number comparing how fast an event occurs in one group versus another at any given moment — a ratio below 1.0 means the event happens more slowly in the treated group."
      },
      {
        "term": "censoring",
        "definition": "What we call it when a patient's follow-up ends before they have the event — for example, they leave the insurance plan or the study ends — so we know only that the event had not happened yet up to that point."
      },
      {
        "term": "partial likelihood",
        "definition": "The mathematical trick Cox's method uses to estimate hazard ratios by asking, at each moment an event occurs, who in the still-event-free group was most likely to have that event — this lets the model work without assuming any particular shape for how background risk changes over time."
      },
      {
        "term": "time zero (index date)",
        "definition": "Each patient's personal 'day 0' — typically the date they first filled the study drug — from which all follow-up time is measured."
      },
      {
        "term": "proportional hazards assumption",
        "definition": "The model's core requirement that the ratio of event rates between the two groups stays roughly constant throughout follow-up — if one group's risk starts high and the other's catches up later, a single hazard ratio is misleading."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether patients newly starting a DPP-4 inhibitor reach their first hospitalized heart-failure event more slowly than patients newly starting a second-generation sulfonylurea. Eight patients are enrolled on the day they pick up their first prescription (time zero). Four are in the DPP-4 arm and four are in the sulfonylurea arm. The study tracks each patient until either they are hospitalized for heart failure (the event) or their insurance coverage ends and we can no longer observe them (censored). We want to estimate the hazard ratio comparing the two arms.",
      "dataset": {
        "caption": "One row per patient. time_to_event is the number of days from first fill until hospitalization or last observable day. event = 1 means hospitalization occurred; event = 0 means the patient left follow-up before the event (censored).",
        "columns": [
          "person_id",
          "arm",
          "time_to_event_days",
          "event",
          "outcome_label"
        ],
        "rows": [
          [
            1001,
            "DPP-4",
            420,
            0,
            "censored — coverage ended"
          ],
          [
            1002,
            "DPP-4",
            185,
            1,
            "hospitalized for HF on day 185"
          ],
          [
            1003,
            "DPP-4",
            365,
            0,
            "censored — study ended"
          ],
          [
            1004,
            "DPP-4",
            290,
            1,
            "hospitalized for HF on day 290"
          ],
          [
            2001,
            "sulfonylurea",
            112,
            1,
            "hospitalized for HF on day 112"
          ],
          [
            2002,
            "sulfonylurea",
            245,
            1,
            "hospitalized for HF on day 245"
          ],
          [
            2003,
            "sulfonylurea",
            330,
            0,
            "censored — coverage ended"
          ],
          [
            2004,
            "sulfonylurea",
            88,
            1,
            "hospitalized for HF on day 88"
          ]
        ]
      },
      "steps": [
        "Sort patients by when events occurred. The first event is patient 2004 (sulfonylurea, day 88).",
        "At day 88, all 8 patients were still in follow-up — Cox asks: given one event just happened, how likely was each person to be the one who had it? The model uses each patient's arm and covariates to answer.",
        "Repeat this at each subsequent event day (112, 185, 245, 290) — each time, the 'risk set' shrinks as patients either have their event or get censored.",
        "Patient 1001, 1003, and 2003 contribute follow-up time right up until they are censored, but are not counted as events — their time is not wasted, it still informs the model about who was at risk.",
        "After processing all event times, Cox combines the comparisons across every event moment using the partial likelihood to estimate a single hazard ratio for the DPP-4 arm versus the sulfonylurea arm.",
        "In a real study with thousands of patients and full covariate adjustment (age, sex, prior heart failure, kidney disease), this same logic produces the adjusted hazard ratio reported in the results."
      ],
      "result": "In the illustrative large-scale claims study described in this concept (type 2 diabetes patients, commercial + Medicare FFS database, DPP-4 inhibitor vs second-generation sulfonylurea, time to first hospitalized heart-failure event), the adjusted hazard ratio is 0.71 (95% CI 0.61–0.83), meaning patients on a DPP-4 inhibitor experienced hospitalized heart failure at 29% lower rate at any given point in follow-up compared to patients on a sulfonylurea, after adjusting for baseline differences between the groups.",
      "timeline_spec": {
        "title": "Time-to-event data: DPP-4 inhibitor vs sulfonylurea, first hospitalized heart-failure event",
        "caption": "Each bar shows one patient's follow-up, measured in days from their first prescription fill (time zero = day 0). A closed circle marks a hospitalization event; an open tick mark indicates censoring (coverage ended or study closed before the event). DPP-4 patients are shown in the top panel; sulfonylurea patients in the bottom panel.",
        "alt_text": "Horizontal bar chart showing eight patients split into two treatment arms. Each bar extends from day 0 to the patient's last observed day. Four DPP-4 patients: patient 1001 bar ends at day 420 with a censored marker, patient 1002 ends at day 185 with an event marker, patient 1003 ends at day 365 with a censored marker, patient 1004 ends at day 290 with an event marker. Four sulfonylurea patients: patient 2001 ends at day 112 with an event marker, patient 2002 ends at day 245 with an event marker, patient 2003 ends at day 330 with a censored marker, patient 2004 ends at day 88 with an event marker. The sulfonylurea arm visually shows events clustering earlier than the DPP-4 arm.",
        "window": {
          "start_day": 0,
          "end_day": 430,
          "label": "Days from first prescription fill (time zero)"
        },
        "events": [
          {
            "person_id": 1001,
            "arm": "DPP-4",
            "start_day": 0,
            "end_day": 420,
            "quantity": "420 days followed",
            "marker": "censored",
            "label": "Pt 1001 — censored day 420"
          },
          {
            "person_id": 1002,
            "arm": "DPP-4",
            "start_day": 0,
            "end_day": 185,
            "quantity": "185 days followed",
            "marker": "event",
            "label": "Pt 1002 — HF hospitalization day 185"
          },
          {
            "person_id": 1003,
            "arm": "DPP-4",
            "start_day": 0,
            "end_day": 365,
            "quantity": "365 days followed",
            "marker": "censored",
            "label": "Pt 1003 — censored day 365"
          },
          {
            "person_id": 1004,
            "arm": "DPP-4",
            "start_day": 0,
            "end_day": 290,
            "quantity": "290 days followed",
            "marker": "event",
            "label": "Pt 1004 — HF hospitalization day 290"
          },
          {
            "person_id": 2001,
            "arm": "sulfonylurea",
            "start_day": 0,
            "end_day": 112,
            "quantity": "112 days followed",
            "marker": "event",
            "label": "Pt 2001 — HF hospitalization day 112"
          },
          {
            "person_id": 2002,
            "arm": "sulfonylurea",
            "start_day": 0,
            "end_day": 245,
            "quantity": "245 days followed",
            "marker": "event",
            "label": "Pt 2002 — HF hospitalization day 245"
          },
          {
            "person_id": 2003,
            "arm": "sulfonylurea",
            "start_day": 0,
            "end_day": 330,
            "quantity": "330 days followed",
            "marker": "censored",
            "label": "Pt 2003 — censored day 330"
          },
          {
            "person_id": 2004,
            "arm": "sulfonylurea",
            "start_day": 0,
            "end_day": 88,
            "quantity": "88 days followed",
            "marker": "event",
            "label": "Pt 2004 — HF hospitalization day 88"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "arm": "DPP-4",
            "label": "DPP-4 arm: 2 events in 4 patients (events at days 185, 290)"
          },
          {
            "kind": "followup",
            "arm": "sulfonylurea",
            "label": "Sulfonylurea arm: 3 events in 4 patients (events at days 88, 112, 245)"
          }
        ],
        "result": {
          "label": "Adjusted HR (DPP-4 vs sulfonylurea) = 0.71 — DPP-4 patients reached hospitalized heart failure at 29% lower rate at any given moment in follow-up",
          "value": 0.71
        }
      }
    },
    "prerequisites": [
      "cumulative-incidence-risk-rwe",
      "new-user-design",
      "competing-risks-cause-specific-fine-gray-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Stratified Cox (baseline hazard stratified by strong PH violators)",
        "description": "Allow a separate, unestimated baseline hazard within strata (e.g., cancer stage, line of therapy, HF class, or PS-matched-pair id) while sharing β across strata. The standard first remedy when a covariate violates PH but is not itself of interest.",
        "edge_cases": [
          "The stratifying variable's own effect is no longer estimated; do not stratify on the exposure of interest.",
          "Many thin strata (e.g., matched-pair strata) reduce efficiency and can drop strata with no events."
        ],
        "data_source_notes": "claims/oncology: stratify by stage or prior lines from registry linkage; EHR: stratify by site to absorb center-level baseline-risk heterogeneity.",
        "citations": [
          "grambsch-therneau-1994"
        ]
      },
      {
        "name": "Cox with time-by-covariate interaction (explicitly non-proportional effects)",
        "description": "Model a deliberately time-varying coefficient, e.g. β1·x + β2·x·g(t), so the log-HR for x changes over follow-up. Used when hazards cross or converge and a single HR is misleading.",
        "edge_cases": [
          "Requires the counting-process (start, stop, event) data layout to define the time transform.",
          "Interpretation shifts to a time-specific HR; pair with RMST or cumulative incidence for a decision-ready summary."
        ],
        "data_source_notes": "Common when an early treatment harm reverses to a late benefit (or vice versa) that the global PH test flags; report the HR trajectory, not one averaged number.",
        "citations": [
          "stensrud-hernan-2020"
        ]
      },
      {
        "name": "Weighted Cox (IPTW / overlap / SMR weights for a marginal HR)",
        "description": "Fit Cox on a propensity-weighted pseudo-population to target a marginal hazard ratio (ATE, ATT, or overlap-weighted estimand) rather than a covariate-conditional HR, using a robust (sandwich) variance.",
        "edge_cases": [
          "Extreme weights inflate variance; truncate or use overlap weights and report effective sample size and weight distribution.",
          "Naive model-based SEs are anti-conservative under weighting; cluster-robust variance is mandatory."
        ],
        "data_source_notes": "claims: high-dimensional PS from lookback-window proxies, then stabilized IPTW; combine with IPCW when as-treated censoring is informative."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "competing-risks-cause-specific-fine-gray-rwe",
        "pros_of_this": "Familiar, directly estimates the cause-specific hazard ratio (the etiologic/rate question) with one procedure; censoring competing events is mechanically simple.",
        "cons_of_this": "A cause-specific HR does not map to cumulative incidence when competing mortality differs by arm; interpreting 1 - KM as risk overstates the event probability.",
        "when_to_prefer": "When the question is about the rate among those still at risk (etiology); use Fine-Gray / cumulative-incidence framing for absolute risk and policy questions in populations with substantial competing death."
      },
      {
        "compared_to": "restricted-mean-survival-time-rmst",
        "pros_of_this": "Single scalar HR is compact and regulator-familiar; efficient when PH holds.",
        "cons_of_this": "Uninterpretable as a single number under PH violation and silently averaged over the censoring distribution; gives no absolute time benefit.",
        "when_to_prefer": "When PH approximately holds and the audience expects an HR; switch to RMST when hazards cross/converge or when a \"days of event-free survival gained\" summary is needed for value assessment."
      },
      {
        "compared_to": "marginal-structural-models-g-methods",
        "pros_of_this": "Simpler to specify, communicate, and defend; mature software; handles right-censoring and left-truncation natively.",
        "cons_of_this": "Cannot handle time-varying confounding affected by prior treatment; conditioning on post-baseline mediators induces collider bias and biased effect estimates.",
        "when_to_prefer": "Baseline-confounding, ITT-like or PS-adjusted contrasts; use MSM/g-estimation/CCW for sustained or dynamic strategies under time-varying confounding (Cox still runs after the weighting)."
      },
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "Uses exact event-time ordering and a flexible (unspecified) baseline hazard; natural for first-event time-to-event endpoints.",
        "cons_of_this": "Less convenient for aggregate person-time incidence rates and recurrent counts; partial likelihood is heavier than a rate regression at very large N with coarse timing.",
        "when_to_prefer": "First-event survival with meaningful timing; prefer Poisson/NB for incidence-rate reporting and recurrent-event counts where timing is coarse or overdispersion dominates."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build one row per subject (or counting-process start/stop intervals for time-varying terms). Time zero = first qualifying fill after washout, or the CPT/HCPCS service_date for procedures. Require continuous, FFS-observable enrollment over the lookback and follow-up; exclude MA-only person-time so \"no event\" is true survival, not unobserved censoring. Apply validated outcome algorithms and analyze death as a competing risk in elderly cohorts. Use robust variance when weighting; report number-at-risk, events by arm, median follow-up, and the PH test.",
      "ehr": "Use structured labs/vitals for time-varying severity, but treat labs ordered because of deterioration as potential colliders and avoid naive LOCF into a time-fixed model. Initiation = order/administration; link to fills to confirm. Handle informative loss to follow-up with IPCW or sensitivity analysis.",
      "registry": "Cleanest for stage, biomarkers, and adjudicated recurrence/progression; weak for complete pharmacy exposure and full mortality. Link to claims and a death index, and use the registry to validate claims-based endpoints.",
      "linked": "Linked claims-EHR-vital-records gives severity + completeness + reliable mortality, but reconcile order/fill/service-date discrepancies and linkage selection before assigning time zero to avoid reintroducing immortal time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import CoxPHFitter\n\n# cohort: analysis-ready table described in the header (NO toy data created here).\ncohort = pd.read_parquet(\"cohort.parquet\")\n\ncovariates = [\"arm\", \"age\", \"sex\", \"prior_hf\", \"ckd\", \"baseline_utilization\"]\nmodel_df = cohort[[\"time_to_event\", \"event\"] + covariates].dropna()\n\ncph = CoxPHFitter()\ncph.fit(model_df, duration_col=\"time_to_event\", event_col=\"event\",\n        robust=True)                      # robust SEs; required if weights/clustering are added later\nprint(cph.summary[[\"coef\", \"exp(coef)\", \"exp(coef) lower 95%\",\n                   \"exp(coef) upper 95%\", \"p\"]])   # exp(coef) = adjusted HR\n\n# Proportional-hazards check via weighted Schoenfeld residuals (Grambsch & Therneau 1994).\n# A small p-value for `arm` => PH violated for the exposure: stratify, add a time interaction,\n# or report RMST / cumulative incidence instead of a single averaged HR.\ncph.check_assumptions(model_df, p_value_threshold=0.05, show_plots=False)",
        "description": "Standard Cox PH on a cleaned, analysis-ready cohort (one row per subject; counting-process layout for time-varying\nterms is handled by standard-cox-time-dependent). Required input table `cohort` (already de-duplicated, with time\nzero, outcome, and baseline covariates resolved upstream):\n  person_id        : unique subject id\n  time_to_event    : days from index_date to event or censoring (>0)\n  event            : 1 = outcome event, 0 = censored (competing death is censoring here -> also run a competing-risks model)\n  arm              : 0/1 treatment indicator (or categorical), assigned at time zero\n  age, sex, ...    : baseline covariates measured only in [index_date - lookback, index_date]\nProduces the HR table and the weighted-Schoenfeld PH test; report an absolute summary (RMST) when PH fails.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [
          "grambsch-therneau-1994"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n# cohort: analysis-ready data.frame described in the header.\ncohort <- readRDS(\"cohort.rds\")\ncohort$arm <- relevel(factor(cohort$arm), ref = \"0\")\n\nfit <- coxph(\n  Surv(time_to_event, event) ~ arm + age + sex + prior_hf + ckd + baseline_utilization,\n  data   = cohort,\n  ties   = \"efron\",        # Efron handling of tied event times (default; preferred over Breslow)\n  robust = TRUE            # robust variance; required once IPTW/IPCW weights are added\n)\nsummary(fit)               # exp(coef) = adjusted hazard ratio with 95% CI\n\n# Weighted Schoenfeld PH test + diagnostic plot (Grambsch & Therneau 1994).\nzph <- cox.zph(fit)\nprint(zph)                 # global + per-term p-values; p<0.05 for arm => PH violated\n# If PH fails for arm: strata(stage) for nuisance violators, or add tt(arm) for a time-varying\n# effect, or summarize with RMST: survival::rmean via summary(survfit(...), rmean=1095).",
        "description": "Standard Cox PH with the survival package on the same analysis-ready cohort. Required columns:\n  time_to_event (numeric days > 0), event (1 event / 0 censored), arm (factor),\n  and baseline covariates measured only in the pre-index lookback window.\ncoxph() supports left truncation via Surv(start, stop, event); use that layout for delayed entry or time-varying terms.",
        "dependencies": [
          "survival"
        ],
        "source_citations": [
          "grambsch-therneau-1994"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Cumulative incidence by arm (and competing-risk CIF if a competing event is coded). */\nproc lifetest data=work.cohort plots=cif(test);\n  time time_to_event*event(0);\n  strata arm;\nrun;\n\n/* Primary Cox model: adjusted (cause-specific) hazard ratio with PH diagnostics. */\nproc phreg data=work.cohort;\n  class arm (ref='0') sex (ref='0');\n  model time_to_event*event(0) = arm age sex prior_hf ckd baseline_utilization / ties=efron rl;\n  hazardratio 'Treatment effect' arm / diff=ref;\n  assess ph / resample seed=123;          /* weighted Schoenfeld residual PH test (Grambsch-Therneau) */\n  output out=work.resid ressch=sch_arm;   /* per-event Schoenfeld residuals for the exposure */\nrun;\n\n/* Competing-risks (subdistribution) HR when a competing event exists (event=2 = competing death). */\nproc phreg data=work.cohort;\n  class arm (ref='0');\n  model time_to_event*event(0) = arm age sex / eventcode=1 ties=efron;  /* Fine-Gray */\n  hazardratio arm / diff=ref;\nrun;",
        "description": "Standard Cox PH in SAS with PROC PHREG on the analysis-ready cohort dataset work.cohort. Required variables:\n  person_id, time_to_event (>0), event (1=event,0=censored), arm (0/1 or categorical),\n  and baseline covariates from the pre-index lookback window only.\nPair with PROC LIFETEST for cumulative incidence and, when a competing event is present, PROC PHREG eventcode=\n(Fine-Gray) so absolute risk is not misread off 1 - KM.",
        "dependencies": [],
        "source_citations": [
          "grambsch-therneau-1994"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "cox-ph-regression-timeline.svg",
        "mermaid": null,
        "caption": "Each bar shows one patient's follow-up, measured in days from their first prescription fill (time zero = day 0). A closed circle marks a hospitalization event; an open tick mark indicates censoring (coverage ended or study closed before the event). DPP-4 patients are shown in the top panel; sulfonylurea patients in the bottom panel.",
        "alt_text": "Horizontal bar chart showing eight patients split into two treatment arms. Each bar extends from day 0 to the patient's last observed day. Four DPP-4 patients: patient 1001 bar ends at day 420 with a censored marker, patient 1002 ends at day 185 with an event marker, patient 1003 ends at day 365 with a censored marker, patient 1004 ends at day 290 with an event marker. Four sulfonylurea patients: patient 2001 ends at day 112 with an event marker, patient 2002 ends at day 245 with an event marker, patient 2003 ends at day 330 with a censored marker, patient 2004 ends at day 88 with an event marker. The sulfonylurea arm visually shows events clustering earlier than the DPP-4 arm.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Time-to-event question with a defensible time zero] --> PH{Proportional hazards<br/>plausible? Schoenfeld test}\n  PH -->|Yes, PH holds| CR{Competing event<br/>differential by arm?}\n  PH -->|No, hazards cross/converge| NPH[Stratify baseline OR time-by-covariate interaction<br/>report RMST / segmented HR]\n  CR -->|No| TVC{Time-varying confounding<br/>affected by prior treatment?}\n  CR -->|Yes| FG[Cause-specific Cox PLUS Fine-Gray / cumulative incidence]\n  TVC -->|No| COX[Standard or PS-weighted Cox<br/>report HR + absolute summary]\n  TVC -->|Yes| GM[g-methods: MSM-weighted Cox / g-estimation / clone-censor-weight]",
        "caption": "Decision logic for choosing standard Cox versus its alternatives. Standard Cox is appropriate only when PH is tenable, competing risks are not differentially distorting absolute risk, and confounding is baseline (not treatment-affected time-varying).",
        "alt_text": "Flowchart routing a time-to-event question through proportional-hazards, competing-risk, and time-varying confounding checks to standard Cox, non-PH remedies, Fine-Gray, or g-methods.",
        "source_type": "illustrative",
        "source_citations": [
          "stensrud-hernan-2020"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U[Unmeasured healthy behaviors / frailty] -->|confounds| X[Treatment arm]\n  U -->|drives event hazard| T[Event time]\n  X --> T\n  U -.->|depletes faster in higher-risk arm,<br/>making risk sets non-comparable| RS[Surviving risk set over time]\n  RS -.->|induces time-varying observed HR<br/>even under constant true effect| T\n  style U fill:#ffcccc",
        "caption": "Two paths that bias and distort a Cox HR. The solid path is classic confounding by an unmeasured prognostic factor; the dashed path is the built-in selection of the hazard ratio, where differential depletion of susceptibles makes surviving risk sets non-comparable and can manufacture a time-varying HR from a constant true effect.",
        "alt_text": "DAG showing unmeasured frailty confounding treatment and event time, plus a depletion-of-susceptibles path that makes the surviving risk set non-comparable and induces a time-varying observed hazard ratio.",
        "source_type": "illustrative",
        "source_citations": [
          "stensrud-hernan-2020"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "standard-cox-time-dependent",
        "notes": "Pushing exposure/covariates inside the partial likelihood at each event time (counting-process layout) extends this standard model; see that concept for time-varying-covariate construction and immortal-time avoidance."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST gives an absolute, assumption-light \"event-free time gained\" summary that is preferable to a single HR when proportional hazards is violated."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Rate models trade exact event timing for simpler aggregate person-time and recurrent-count handling; Cox keeps the unspecified baseline hazard and exact ordering."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Standard Cox censoring a competing event targets the cause-specific hazard; pair it with Fine-Gray / cumulative incidence so absolute risk is not overstated when competing mortality differs by arm."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Cox is the usual outcome model after PS matching or IPTW; use robust/cluster-robust variance on the weighted or matched set to target a marginal HR."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Cox (possibly weighted or time-dependent) is the standard analytic engine inside a target-trial emulation after eligibility, time-zero alignment, and cloning/censoring/weighting."
      },
      {
        "relation_type": "used_with",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "An IPTW-weighted Cox is the fitting step of a marginal structural Cox model; g-methods supply the weights that handle time-varying confounding standard Cox cannot."
      },
      {
        "relation_type": "see_also",
        "target_slug": "g-estimation-structural-nested-models",
        "notes": "Use g-estimation/SNMs when effect modification by time-varying factors or blip effects is central and the PH / no-time-varying-confounding assumptions of standard Cox fail."
      },
      {
        "relation_type": "see_also",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "CCW emulates sustained per-protocol strategies and then fits a weighted Cox; plain Cox on observed data can still carry time-zero or selection bias for those questions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Misaligned time zero (follow-up starting before exposure can occur) inflates exposed person-time and biases the HR toward benefit; fix the design rather than the model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Including prevalent users without proper time zero mixes early high-risk and later low-risk periods, attenuating HRs and inducing non-PH."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Unmeasured healthy behaviors can bias the HR toward apparent benefit and induce non-PH if healthier patients remain on treatment longer."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "Report an E-value on the HR scale to quantify how strong unmeasured confounding would need to be to explain away the observed association."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Informative censoring (loss to follow-up related to prognosis or treatment) violates independent censoring and requires IPCW or a competing-risks/sensitivity framing."
      },
      {
        "relation_type": "see_also",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "The Cox HR is conditional and non-collapsible; marginal absolute summaries (survival difference, RMST) are often more interpretable for decision-making."
      },
      {
        "relation_type": "part_of",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "One of the core analytic tools in the CER toolbox alongside PS methods, g-methods, and sensitivity analyses."
      }
    ],
    "aliases": [
      "cox model",
      "cox regression",
      "proportional hazards model",
      "proportional hazards regression",
      "semiparametric survival model",
      "partial likelihood model"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cpt-procedure-coding",
    "name": "CPT Codes (HCPCS Level I)",
    "short_definition": "Current Procedural Terminology (CPT), maintained by the American Medical Association and adopted under HIPAA as HCPCS Level I, is the standard code set for identifying professional services and outpatient procedures on US claims; it appears on every physician (CMS-1500/837P) and outpatient facility (UB-04) claim but NOT on inpatient facility claims, where ICD-10-PCS governs procedure coding — the single most consequential setting rule for procedure ascertainment in RWE.",
    "long_description": "**CPT (Current Procedural Terminology)** is the procedure and service coding system maintained by the\nAmerican Medical Association (AMA) CPT Editorial Panel and adopted by the Department of Health and\nHuman Services under HIPAA as HCPCS Level I. It is the universal language of professional billing in\nthe United States: every evaluation and management (E/M) encounter, surgical service, diagnostic test,\nimaging study, and therapeutic infusion performed outside an inpatient hospital setting is identified\non the claim by a CPT code. Because reimbursement is tied to the code, the claim record reflects what\nwas billed (and paid), making CPT the most comprehensive, consistently captured procedure signal in\nUS administrative data.\n\n**Copyright and licensing constraint — read before using in any catalog, tool, or publication.**\nCPT is a copyrighted work of the American Medical Association. The AMA licenses CPT for commercial\nuse, redistribution, and derivative products; a valid AMA CPT license is required before reproducing\ncode descriptors in software, databases, publications, or data products. This is the reason that the\nOMOP Common Data Model's Athena vocabulary download portal requires a separate license-acceptance step\nfor the CPT4 vocabulary: even though OMOP maps CPT codes to standard concepts, the underlying\ndescriptor text remains AMA-copyrighted and cannot be redistributed without that license. Analysts\nbuilding phenotype libraries, open-source code lists, or research tools that display CPT descriptors\nmust obtain or operate under an AMA CPT license. Researchers who work within a licensed institutional\nenvironment (hospital system, payer, major analytics vendor) are typically covered by an enterprise\nlicense, but that coverage does not extend to public GitHub repositories or open publications that\nreproduce the descriptor text. This entry therefore describes the format and structure of CPT without\nreproducing licensed descriptor text; worked examples use numeric range logic rather than official\ncode labels.\n\n**Code format and category structure.** CPT codes are organized into four categories, each with a\ndistinct format that can be detected by regular expression:\n\n- **Category I** (the main procedural set): exactly five numeric digits (pattern `\\d{5}`). These are\n  the codes that drive the overwhelming majority of professional and outpatient facility claims.\n  They are organized into six sections: Evaluation and Management, Anesthesia, Surgery, Radiology,\n  Pathology and Laboratory, and Medicine. Within each section codes are grouped in ascending numeric\n  ranges that correspond roughly to anatomic system or service type. The range structure — not the\n  descriptors — is usable in open code without a license, and it is what enables range-based\n  filtering (e.g., identifying office/outpatient E/M visit lines from the 99202–99215 family by\n  numeric range alone).\n\n- **Category II** (performance measurement tracking): four digits followed by the letter \"F\"\n  (pattern `\\d{4}F`). These codes are supplemental and non-billable for payment; they document\n  quality measures, preventive care actions, and patient safety activities. Because they are not\n  required for reimbursement, their capture in administrative claims is uneven and they should not\n  be used as the primary source for procedure ascertainment. Their RWE value is primarily for\n  quality-of-care or pay-for-performance studies that use claims from payers who mandate their\n  reporting.\n\n- **Category III** (emerging technology, temporary): four digits followed by the letter \"T\"\n  (pattern `\\d{4}T`). These are time-limited codes for new procedures, technologies, and services\n  that lack sufficient evidence or utilization for Category I status. They are updated\n  semiannually (January and July) rather than annually. When a Category III code accumulates\n  sufficient evidence and volume, the AMA may promote it to a permanent Category I code; when\n  that happens, the Category III code is retired and both the old and new codes may appear in\n  historical claims during the transition window. RWE studies on novel devices or procedures\n  must anticipate this code-evolution pattern: a procedure identified only by its Category III\n  code will be systematically missed in claims before the code was created and after it was\n  replaced by a Category I code.\n\n- **PLA codes** (Proprietary Laboratory Analyses): four digits followed by the letter \"U\"\n  (pattern `\\d{4}U`). Introduced in 2018, these codes identify laboratory tests marketed under\n  a proprietary name. They are relevant to RWE studies on genomic testing, liquid biopsy, and\n  other advanced laboratory diagnostics.\n\n**Modifiers.** CPT codes accept two-character alphanumeric modifiers appended after the code\n(in claims data, typically stored in a separate modifier field). Modifiers convey clinically\nimportant information without altering the core code: they can indicate that only the professional\ncomponent of a service was provided (as opposed to the technical component), that a service was\ndistinct and separate from another service on the same date, that laterality applies (left side vs.\nright side), or that a bilateral procedure was performed. In RWE, failing to parse modifiers leads\nto two common errors: (1) double-counting a global service when both the professional and technical\ncomponents are billed separately with component modifiers, and (2) missing laterality when the\nstudy requires distinguishing which limb, eye, or ear was treated.\n\n**Where CPT appears on claims — the central setting rule for RWE.**\nThis is the most operationally consequential fact about CPT for real-world evidence:\n\n- **Professional (physician/supplier) claims** — CMS-1500 paper form / 837P electronic transaction,\n  field 24D: CPT or HCPCS codes appear on every service line. This is the primary source for\n  all physician office visits, specialist consultations, outpatient surgical procedures billed by\n  the surgeon, diagnostic services, infusion administration codes, and labs billed under the\n  physician fee schedule.\n\n- **Outpatient facility claims** — UB-04 paper form / 837I electronic transaction, field FL44\n  (revenue code line): CPT/HCPCS codes appear alongside revenue center codes on outpatient\n  hospital claims. A hospital outpatient department (HOPD) procedure generates both a revenue\n  code and a CPT code; the CPT code identifies the specific procedure while the revenue code\n  identifies the cost center.\n\n- **Inpatient facility claims** — UB-04 / 837I: **CPT codes do NOT appear on inpatient facility\n  claims.** Inpatient procedures are coded using ICD-10-PCS (International Classification of\n  Diseases, 10th Revision, Procedure Coding System), a wholly separate, public-domain system\n  maintained by CMS. Any RWE study that relies on CPT alone for procedure ascertainment will\n  systematically miss every inpatient case. For procedures that can be performed in either\n  inpatient or outpatient settings (cardiac catheterization, joint replacement, certain cancer\n  resections), a CPT-only code list captures the lower-acuity outpatient population while\n  dropping the higher-acuity inpatient population — a form of case-mix ascertainment bias that\n  can severely distort comparative effectiveness estimates.\n\n**Key RWE use cases.**\n\n1. **E/M visit ascertainment and utilization counting.** Office/outpatient E/M visits fall in a\n   well-known numeric range; place-of-service codes narrow to office, telehealth, or outpatient\n   settings. Counting unique dates rather than service lines avoids double-counting split-billing\n   (where E/M and a procedure on the same day both generate a line).\n\n2. **Drug administration coding (infusion and injection).** For oncology and specialty pharmacy\n   products, the CPT administration code (identifying the infusion service) and the HCPCS Level II\n   J-code (identifying the drug itself) appear as separate lines on the same claim. In RWE,\n   treating the J-code alone as evidence of drug exposure without the administration code may\n   miss routes of administration, and treating the administration code alone identifies the service\n   but not the specific agent. The correct approach for chemotherapy/biologic administration\n   studies is to require both the drug J-code and an administration CPT code on the same or\n   proximate claim.\n\n3. **Procedure ascertainment (outpatient and ambulatory surgery).** CPT identifies outpatient\n   surgical procedures in ambulatory surgery centers (ASC) and HOPDs precisely. Because the\n   CPT code is the payment trigger, capture is high for reimbursable procedures. Failure modes\n   include: (a) bilateral procedures coded with a modifier on a single code line rather than\n   two separate lines, so volume-based utilization counts undercount; (b) unlisted procedure\n   codes (Category I \"unlisted\" codes ending in specific suffixes) that aggregate heterogeneous\n   procedures and cannot be distinguished without chart review.\n\n4. **Laboratory test identification.** CPT lab codes identify the test that was billed (e.g.,\n   a specific panel or analyte), while LOINC codes identify the observable result returned by\n   the laboratory. In EHR-linked data, CPT appears on the order/billing side and LOINC on the\n   result side; claims data contains CPT but not LOINC for most tests. This means claims can\n   confirm a test was ordered and billed but cannot directly provide the result value — that\n   requires EHR linkage.\n\n**Relationship to HCPCS Level II.** CPT constitutes HCPCS Level I. HCPCS Level II is a parallel\ncode set maintained by CMS that fills gaps not addressed by CPT: drugs and biologics administered\nin clinical settings (J-codes), durable medical equipment (E-codes), ambulance services (A-codes),\nand supplies. The two levels are complementary: professional claims use both CPT and HCPCS Level II\ncodes on the same service line when appropriate. For drug administration studies in claims, J-codes\n(Level II) identify the drug and CPT codes (Level I) identify the administration service; analysts\nmust work with both levels simultaneously.\n\n**Relationship to SNOMED CT and LOINC.** CPT is a billing-oriented system optimized for\nreimbursement, not clinical representation. SNOMED CT provides a clinically rich, ontologically\nstructured representation of procedures with formal hierarchical relationships and laterality\nthat CPT lacks. LOINC provides standardized identifiers for laboratory observations and results.\nIn OMOP, CPT4 procedure codes are mapped to SNOMED standard concepts via the Athena vocabulary\nserver, enabling cross-database queries that are not possible on raw CPT alone. However, because\nCPT4 is AMA-copyrighted, downloading the CPT4 vocabulary from Athena requires an explicit\nlicense-acceptance step that is separate from the standard OMOP vocabulary download — the only\nmajor OMOP vocabulary that requires this step.\n\n**Annual update cycle and version management.** The AMA CPT Editorial Panel issues a new CPT\ncode set effective January 1 of each year (Category III codes also receive a July 1 semiannual\nupdate). Codes can be added, revised, or deleted in each update. In longitudinal RWE studies\nspanning multiple calendar years, a code that was not yet created in year one of the study window\ncannot appear in year-one claims, and a code deleted mid-study creates a truncated capture window.\nCode-list documentation for regulatory-grade RWE must specify which CPT version(s) were in effect\nduring the study window and how code changes were handled — particularly for Category III codes\ntransitioning to Category I.\n\n**Pros, cons, and trade-offs — specific and comparative.**\n\n- **vs ICD-10-PCS (inpatient procedure coding):** CPT covers all professional and outpatient facility\n  services with high payment-driven capture; ICD-10-PCS covers inpatient facility procedures with\n  clinical detail (approach, device, qualifier) that CPT lacks but is coded by facility coders\n  rather than the performing clinician. For a procedure performed in either setting, a complete\n  ascertainment requires both systems in a union code set. **Never use CPT alone** when the study\n  population includes hospitalizations for the procedure of interest.\n\n- **vs HCPCS Level II J-codes (drug identification):** CPT identifies the administration service;\n  J-codes identify the specific drug or biologic. The two are complements on infusion claims, not\n  substitutes. Choosing one and ignoring the other produces a systematically incomplete study\n  design for any infused specialty product.\n\n- **vs revenue center codes (outpatient facility billing):** Revenue codes are present on all\n  UB-04 facility claims and identify the cost center (e.g., operating room, observation, emergency\n  department). They are broader than CPT — a single revenue code covers many procedures — and\n  should be used to confirm care setting (outpatient vs. inpatient) or to detect services not billed\n  with a CPT code, not as a substitute for CPT in procedure identification.\n\n- **vs SNOMED CT (clinical concept representation):** SNOMED provides hierarchical procedure\n  semantics (a hip replacement IS-A joint replacement IS-A musculoskeletal procedure) enabling\n  queries that traverse the hierarchy without enumerating every leaf code. CPT is flat with no\n  formal hierarchy, so CPT-based phenotyping requires explicit enumeration of all relevant codes.\n  In OMOP-mapped data, SNOMED ancestors provide a shortcut; in raw claims, every CPT code of\n  interest must be listed explicitly.\n\n- **AMA copyright vs ICD-10-CM/PCS public domain:** ICD-10-CM and ICD-10-PCS are public-domain\n  code sets — their descriptor text can be reproduced freely in research, software, and publications.\n  CPT descriptors require an AMA license for reproduction. This licensing asymmetry has practical\n  consequences for open-source phenotype libraries, GitHub repositories, and journal publications:\n  CPT-based code lists can share the numeric codes but not the official descriptor text without\n  a license. **Prefer public-domain alternatives where clinically equivalent** (e.g., SNOMED in\n  OMOP); when CPT is required, work within a licensed institutional environment.\n\n**When to use.**\n- When the study requires identifying professional services or outpatient procedures from US\n  claims data: E/M visits, outpatient surgeries, infusion administrations, laboratory orders,\n  diagnostic imaging, and any other billed service performed outside an inpatient facility.\n- When counting ambulatory utilization (office visits per person-year, procedure rates in\n  outpatient settings) in commercial, Medicare FFS, or Medicaid claims.\n- When drug exposure in a specialty infusion context requires linking the administration service\n  (CPT) to the specific agent (HCPCS J-code).\n- When building a complete procedure ascertainment code set that unions CPT (outpatient/professional)\n  with ICD-10-PCS (inpatient) to capture procedures regardless of care setting.\n- When working in OMOP and needing the CPT4 vocabulary for mapping to SNOMED standard concepts\n  (after completing the required Athena license-acceptance step).\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As the sole source for inpatient procedure ascertainment.** A CPT-only code list will return\n  zero results for inpatient facility claims (ICD-10-PCS governs there). Applying CPT to inpatient\n  data silently drops the entire hospitalized population, systematically excluding the sicker, more\n  complex patients. This produces a healthy-worker-style selection bias that can make a procedure\n  look safer or more effective than it is.\n- **When the care setting is Medicare Advantage (managed care).** MA enrollees' claims are\n  encounter-based and historically less complete than FFS claims; procedure capture may be absent\n  or under-coded. Do not pool MA-only and FFS person-time for procedure rates.\n- **When reproducing CPT descriptor text without an AMA license.** Publishing, sharing, or\n  embedding official CPT descriptor text in a database, software tool, or public repository without\n  a valid AMA CPT license is a copyright violation. Use numeric ranges, SNOMED mappings, or plain\n  English descriptions of procedure families instead.\n- **When Category III codes govern the procedure of interest without awareness of the transition\n  timeline.** If a Category III code was promoted to Category I mid-study, using only one code\n  family will produce a time-truncated ascertainment that looks like a sudden change in utilization\n  when it is actually a coding-system change.\n- **As a substitute for LOINC in laboratory result analysis.** CPT identifies the billed lab test;\n  it cannot provide the result value or the measured analyte in a LOINC-structured way. Using CPT\n  to identify a lab order is correct; treating CPT as equivalent to a structured lab result is an\n  analytical error.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "procedures",
      "cpt",
      "hcpcs",
      "claims",
      "professional-claim",
      "outpatient",
      "e-and-m",
      "billing",
      "ama"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "new_user",
      "active_comparator_new_user",
      "ehr_study",
      "linked_data"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/neurintsurg-2014-011156",
        "url": "https://doi.org/10.1136/neurintsurg-2014-011156",
        "citation_text": "Hirsch JA, Leslie-Mazwi TM, Nicola GN, Bhatt DL, Jovin TG, Baxter BW, et al. Current procedural terminology; a primer. Journal of NeuroInterventional Surgery. 2015;7(4):309-312.",
        "year": 2015,
        "authors_short": "Hirsch et al.",
        "notes": "Accessible primer on CPT code structure, the AMA Editorial Panel process, and how CPT codes are created, revised, and deleted — the foundational reference for understanding CPT as a coding system before applying it in claims research."
      },
      {
        "role": "explain",
        "doi": "10.1111/jgs.15948",
        "url": "https://doi.org/10.1111/jgs.15948",
        "citation_text": "Berenson RA, Lazaroff AE. Time Is of the Essence: Solving Office Visit Coding Problems. Journal of the American Geriatrics Society. 2019;67(8):1568-1573.",
        "year": 2019,
        "authors_short": "Berenson & Lazaroff",
        "notes": "Explains the practical complexity of E/M visit CPT coding — time-based vs complexity-based documentation, the 99202-99215 family, and how coding decisions by clinicians translate into the claims data fields that RWE analysts consume."
      },
      {
        "role": "demonstrate",
        "doi": "10.1038/sj.clpt.6100249",
        "url": "https://doi.org/10.1038/sj.clpt.6100249",
        "citation_text": "Schneeweiss S. Developments in post-marketing comparative effectiveness research. Clinical Pharmacology and Therapeutics. 2007;82(2):143-156.",
        "year": 2007,
        "authors_short": "Schneeweiss",
        "notes": "Demonstrates how administrative claims data — including CPT-coded procedures and services — are operationalized in comparative effectiveness research: code-list construction, exposure and outcome ascertainment, and the validity assumptions underlying claims-based procedure identification."
      },
      {
        "role": "use",
        "doi": "10.1002/cpt.359",
        "url": "https://doi.org/10.1002/cpt.359",
        "citation_text": "Lin KJ, Schneeweiss S. Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs. Clinical Pharmacology and Therapeutics. 2016;100(2):147-159.",
        "year": 2016,
        "authors_short": "Lin & Schneeweiss",
        "notes": "Illustrates how CPT-coded procedure and service lines in claims data integrate with EHR data for drug effectiveness and safety studies — including the complementary role of CPT (billing precision) and EHR structured data (clinical granularity) for procedure ascertainment and confounder identification in linked analyses."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.ama-assn.org/practice-management/cpt",
        "citation_text": "American Medical Association. CPT (Current Procedural Terminology) Overview. AMA; 2024. Available at: https://www.ama-assn.org/practice-management/cpt",
        "year": 2024,
        "authors_short": "AMA",
        "notes": "Official AMA overview of CPT: code categories, the Editorial Panel process, annual update cycle, and licensing terms. Authoritative source for the copyright and licensing requirements that govern redistribution of CPT codes and descriptor text."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coding-billing/healthcare-common-procedure-system",
        "citation_text": "Centers for Medicare and Medicaid Services. Healthcare Common Procedure Coding System (HCPCS). CMS; 2024. Available at: https://www.cms.gov/medicare/coding-billing/healthcare-common-procedure-system",
        "year": 2024,
        "authors_short": "CMS",
        "notes": "CMS reference page confirming the two-level HCPCS structure: Level I (CPT, AMA-maintained) and Level II (CMS-maintained). Establishes the HIPAA mandate for HCPCS use in Medicare and Medicaid claims processing — the regulatory basis for CPT's universal adoption in US claims."
      }
    ],
    "plain_language_summary": "CPT codes are five-digit numbers printed on every US doctor's bill and outpatient hospital charge that tell the insurer exactly which service was performed — an office visit, a blood test, a surgery, or a chemotherapy infusion. When researchers study what care patients received using insurance claims, they look up these codes to find who had a procedure or visit and when. The important catch is that CPT codes do not appear on regular hospital admission bills — hospitals use a completely different system called ICD-10-PCS for inpatient procedures — so researchers who rely only on CPT will miss every procedure done during a hospital stay.",
    "key_terms": [
      {
        "term": "HCPCS Level I",
        "definition": "The formal HIPAA name for CPT codes: 'Level I' means CPT codes (five digits, AMA-maintained), while 'Level II' refers to a separate set of codes maintained by CMS that cover drugs, equipment, and supplies not described by CPT."
      },
      {
        "term": "Evaluation and Management (E/M) visit",
        "definition": "A billed office or outpatient encounter between a clinician and patient; E/M visits make up the largest single category of CPT codes and are the primary way researchers count outpatient contact events in claims data."
      },
      {
        "term": "modifier",
        "definition": "A two-character code attached to a CPT code that tells the insurer something important about how the service was delivered — for example, that only one side of the body was treated, or that two separate services happened on the same day."
      },
      {
        "term": "professional claim",
        "definition": "The bill submitted by a physician or other clinician for their own services, filed on the CMS-1500 form; CPT codes appear here for every service line."
      },
      {
        "term": "Category III code",
        "definition": "A temporary CPT code (four digits plus the letter T) used for new or experimental procedures; if the procedure becomes common enough, the AMA replaces it with a permanent five-digit Category I code, which means historical claims data will show two different codes for the same procedure at different time points."
      },
      {
        "term": "ICD-10-PCS",
        "definition": "The separate, public-domain code system used on inpatient hospital bills to describe surgical and procedural services during a hospital stay; it is maintained by CMS, not the AMA, and its codes look nothing like CPT codes — confusing the two is one of the most common errors in claims-based procedure research."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes researcher wants to count how many adult patients in a commercial insurance database had an office or outpatient visit for diabetes management in a single calendar year, using professional claims (CMS-1500 / 837P). The goal is a simple utilization rate: unique patients with at least one qualifying visit, and total visits per 100 enrolled members. The researcher must identify the right CPT codes without using licensed descriptor text, and must decide how to count — by service line or by date — to avoid inflating the number.\n",
      "dataset": {
        "caption": "Professional claim lines for three patients in the calendar year. Each row is one service line on a CMS-1500 claim. A single visit can generate more than one line (e.g., the E/M visit code plus a separate procedure code on the same date).",
        "columns": [
          "person_id",
          "service_date",
          "cpt_code",
          "place_of_service",
          "paid_amount_usd"
        ],
        "rows": [
          [
            1001,
            "2023-03-14",
            "99213",
            "11",
            95.0
          ],
          [
            1001,
            "2023-03-14",
            "82947",
            "11",
            12.0
          ],
          [
            1001,
            "2023-09-05",
            "99214",
            "11",
            130.0
          ],
          [
            1002,
            "2023-06-20",
            "99212",
            "11",
            72.0
          ],
          [
            1003,
            "2023-01-10",
            "99203",
            "11",
            88.0
          ],
          [
            1003,
            "2023-01-10",
            "99203",
            "11",
            88.0
          ]
        ]
      },
      "steps": [
        "Identify the E/M office visit range: CPT codes 99202 through 99215 cover new and established patient office and outpatient E/M visits. A numeric range filter (cpt_code >= '99202' AND cpt_code <= '99215') captures this family without needing to reproduce any licensed descriptor text.",
        "Apply the place-of-service filter: place_of_service = '11' means the physician's office. This excludes E/M codes billed from emergency departments (23), hospitals (21), or telehealth (02/10), keeping only office visits.",
        "Count by unique person_id and service_date, not by row: person 1001 has two rows on 2023-03-14 (the E/M code 99213 and a glucose test code 82947 on the same day). That is one visit, not two. Deduplication: count distinct (person_id, service_date) pairs among qualifying E/M lines.",
        "Person 1003 has two identical rows on 2023-01-10 — likely a duplicate claim line or a billing resubmission. After deduplication by (person_id, service_date, cpt_code), this collapses to one qualifying visit.",
        "After deduplication, tally qualifying visit-dates per person: person 1001 has 2 (March 14 + September 5), person 1002 has 1 (June 20), person 1003 has 1 (January 10). Total unique qualifying visit-dates = 2 + 1 + 1 = 4.",
        "Count unique patients with at least one qualifying visit: all 3 patients qualify."
      ],
      "result": "3 unique patients had at least one office E/M visit (CPT 99202-99215, place-of-service 11) in the calendar year. Deduplicating by (person_id, service_date) yields 4 total visit-dates across the 3 patients (person 1001 contributed 2, persons 1002 and 1003 each contributed 1). The glucose test line (CPT 82947) is excluded because its code falls outside the 99202-99215 E/M range; the duplicate row for person 1003 collapses to 1 after deduplication: 4 qualifying visit-dates / 3 unique patients = 1.33 visits per patient in the year."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Professional-only vs union (professional + facility) code set",
        "description": "A professional-only CPT code set captures procedures from physician/supplier claims only; a union set also includes CPT codes from outpatient facility (UB-04) claims. The professional-only approach is simpler but misses hospital outpatient department procedures where the facility and professional components are billed separately — the professional claim may be present (surgeon billed the procedure) but the facility claim carries additional service lines.",
        "edge_cases": [
          "In an HOPD setting, the same procedure generates a facility UB-04 claim (with revenue codes and CPT in FL44) and a professional CMS-1500 claim (the surgeon's fee). Naively merging both without deduplication double-counts the procedure.",
          "ASC (ambulatory surgery center) claims are facility claims but use CPT codes in FL44 — they look like professional claims in code content but like facility claims in header fields."
        ],
        "data_source_notes": "claims: join carrier/physician file with outpatient facility file on (person_id, service_date) within a short window (same day or ±1 day) and deduplicate before counting events; flag the care setting from place-of-service or claim type field, not from the CPT code alone."
      },
      {
        "name": "Category I numeric-range filtering vs explicit code enumeration",
        "description": "For broadly defined procedure families (all E/M visits, all infusion administration codes), a numeric range filter on the five-digit code is often more robust and more license-safe than enumerating every individual code: the range captures all codes in the section without requiring the licensed descriptor text. For narrow procedure definitions (a specific surgical technique), explicit code enumeration with validation against a clinical consultant is more precise.",
        "edge_cases": [
          "A numeric range that is too broad will include retired or added codes across annual updates; specify the target CPT version and validate the range boundary against the update notes.",
          "Section boundaries can shift in major CPT restructurings (e.g., the 2021 E/M overhaul that retired 99201 and modified time-based documentation for 99202-99215); code that assumed the pre-2021 structure may silently mis-count visits after the update year."
        ],
        "data_source_notes": "claims: verify that CPT codes in the data are stored as 5-character strings with leading zeros preserved (e.g., '00100' not 100); some systems strip leading zeros from numeric fields, breaking range filters and exact-match lookups."
      },
      {
        "name": "Category III to Category I code transition management",
        "description": "When a Category III (####T) code is promoted to a Category I (five-digit) code mid-study, longitudinal analyses must use both codes — the T-code for earlier claims and the new five-digit code for later claims — to avoid a spurious utilization drop at the transition point. The AMA publishes crosswalk tables in the CPT changes book for each annual update.",
        "edge_cases": [
          "During the transition year, some providers will have switched to the new code while others still report the old T-code, creating heterogeneous coding in the same calendar-year data.",
          "The semiannual Category III update (July) creates mid-year transitions that split the data within a single calendar year."
        ],
        "data_source_notes": "claims: for any Category III code in the study, flag the transition date and search for both the T-code and its promoted Category I successor across the full study window."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "icd-10-pcs-procedure-coding",
        "pros_of_this": "CPT captures all professional and outpatient facility services with high payment-driven completeness; it is the only procedure signal on physician claims and is required for all reimbursable outpatient acts. Its five-digit numeric structure is compact and consistently formatted.",
        "cons_of_this": "CPT does not appear on inpatient facility claims (ICD-10-PCS governs there), is AMA-copyrighted (requiring a license for descriptor redistribution), and lacks the structural clinical detail (approach, device, qualifier) encoded in ICD-10-PCS.",
        "when_to_prefer": "Use CPT when the procedure of interest is primarily or exclusively performed in outpatient or ambulatory settings, or when studying professional services (E/M visits, infusion administration, imaging). Use ICD-10-PCS for inpatient procedures, or use both in a union code set when the procedure occurs in both settings."
      },
      {
        "compared_to": "hcpcs-level-ii-j-codes",
        "pros_of_this": "CPT covers the full breadth of professional and procedural services; HCPCS Level II fills narrow gaps (drugs, DME, supplies) that CPT does not address. For non-drug service identification, CPT is the correct and complete source.",
        "cons_of_this": "CPT does not identify specific drug products — HCPCS Level II J-codes are required for drug-level identification in infusion studies. CPT administration codes and J-codes are complements on the same claim, not alternatives.",
        "when_to_prefer": "Use CPT for service/procedure identification. Use HCPCS Level II (J-codes) for drug identification. Use both simultaneously for drug administration studies that need to link the specific agent to its administration event."
      },
      {
        "compared_to": "omop-standardized-vocabularies",
        "pros_of_this": "CPT is the native billing code in US claims — using it directly avoids the mapping error rate inherent in any vocabulary translation and ensures the code list matches exactly what payers accept for reimbursement.",
        "cons_of_this": "CPT in OMOP requires an Athena license-acceptance step for the CPT4 vocabulary; once mapped to SNOMED standard concepts, OMOP queries can use hierarchical ancestor logic that CPT's flat structure does not support. Raw CPT lists must enumerate every code explicitly.",
        "when_to_prefer": "Use CPT directly when working in raw claims data and when the code list is based on billing-validated code sets. Use OMOP SNOMED standard concepts when the analysis requires cross-database portability, hierarchical querying, or avoidance of the CPT4 license step."
      },
      {
        "compared_to": "revenue-center-codes",
        "pros_of_this": "CPT precisely identifies the specific procedure or service; revenue codes identify only the cost center (operating room, observation, emergency department), which is too broad for procedure-level ascertainment.",
        "cons_of_this": "Revenue codes appear on all UB-04 facility claims, including inpatient; CPT does not appear on inpatient UB-04 claims. Revenue codes are therefore the correct signal for setting identification (was this an inpatient vs outpatient encounter?), while CPT is the correct signal for procedure identification within the outpatient/professional setting.",
        "when_to_prefer": "Use CPT for procedure identification. Use revenue codes to confirm or restrict the care setting (e.g., require a specific revenue code to confirm the service was from an outpatient or ASC claim, not an inpatient episode)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Store CPT codes as 5-character strings with leading zeros; verify this in your specific database before any range filter. Professional claims: join carrier/physician file and filter on CPT range in the procedure code field (typically called HCPCS or CPT field in Medicare data). Outpatient facility: join the institutional outpatient file and filter on FL44 (revenue/procedure line CPT). Exclude inpatient facility claims (they contain ICD-10-PCS, not CPT) by filtering on claim type. Deduplicate by (person_id, service_date, cpt_code) at minimum; for event counting, further deduplicate to (person_id, service_date) to avoid counting split-billed services as separate events. For E/M visit studies, restrict to place-of-service 11 (office) or appropriate telehealth codes; for HOPD studies, require a matching revenue code on the same claim. For drug administration, always join CPT administration codes to the concurrent HCPCS J-code lines.",
      "ehr": "EHR procedure tables may contain CPT codes from charge capture systems, but EHR-native procedures are more commonly represented as SNOMED codes in structured fields or as free-text in operative notes. When CPT codes appear in EHR data, they reflect the billing-documented procedure, not necessarily the full clinical record. Validate CPT completeness against billing records and note that procedures performed outside the health system will not appear in EHR data at all. For lab test identification, EHR billing systems use CPT on the order side while lab result tables use LOINC — do not conflate the two.",
      "linked": "In linked claims-EHR data, CPT codes from the claims side are the authoritative source for procedure billing; EHR procedure orders are the authoritative source for clinical procedure detail (laterality, surgeon, indication). Reconcile date discrepancies: the service date on the professional claim may differ from the encounter date in the EHR by one day due to billing submission timing. When CPT codes appear in both the claims and EHR data, prefer the claims-side CPT for reimbursement-based analyses (the billing record is definitive) and the EHR-side procedure record for clinical detail."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import re\nimport pandas as pd\nfrom typing import Optional\n\n# ── CPT format patterns ───────────────────────────────────────────────────────\n# Category I:   exactly 5 numeric digits (leading zeros preserved as strings)\n# Category II:  4 numeric digits + \"F\"\n# Category III: 4 numeric digits + \"T\"\n# PLA:          4 numeric digits + \"U\"\n_CAT_I   = re.compile(r\"^\\d{5}$\")\n_CAT_II  = re.compile(r\"^\\d{4}F$\")\n_CAT_III = re.compile(r\"^\\d{4}T$\")\n_PLA     = re.compile(r\"^\\d{4}U$\")\n# Modifier: 2-character alphanumeric suffix, stored separately in most datasets\n_MODIFIER = re.compile(r\"^[A-Z0-9]{2}$\")\n\ndef classify_cpt(code: str) -> Optional[str]:\n    \"\"\"Return CPT category string or None if the code does not match any known format.\"\"\"\n    if not isinstance(code, str):\n        return None\n    code = code.strip().upper()\n    if _CAT_I.match(code):\n        return \"Category_I\"\n    if _CAT_II.match(code):\n        return \"Category_II\"\n    if _CAT_III.match(code):\n        return \"Category_III\"\n    if _PLA.match(code):\n        return \"PLA\"\n    return None\n\ndef is_valid_cpt(code: str) -> bool:\n    \"\"\"True if the code string matches any valid CPT format.\"\"\"\n    return classify_cpt(code) is not None\n\n# ── E/M office visit range filter ────────────────────────────────────────────\n# Office and outpatient E/M visits: CPT 99202–99215 (post-2021 restructuring).\n# 99201 was retired effective 2021-01-01 (no longer a valid code after that date).\n# Range-based filtering avoids reproducing licensed descriptor text.\n_EM_OFFICE_MIN = 99202\n_EM_OFFICE_MAX = 99215\n\n# Place of service codes for office/outpatient E/M visits (CMS POS codes)\nOFFICE_POS       = {\"11\"}          # physician office\nTELEHEALTH_POS   = {\"02\", \"10\"}    # telehealth (distant site / patient home)\nOUTPATIENT_POS   = {\"22\", \"19\", \"49\"}  # outpatient hospital, off-campus HOPD, independent clinic\n\ndef flag_em_office_visit(cpt_code: str, place_of_service: str) -> bool:\n    \"\"\"\n    Return True if the claim line represents an office or outpatient E/M visit.\n    Requires BOTH a code in the 99202-99215 range AND an office/outpatient\n    place-of-service code. This avoids counting the same E/M range when billed\n    from an ED (23), inpatient (21), or SNF (31).\n    \"\"\"\n    if classify_cpt(cpt_code) != \"Category_I\":\n        return False\n    try:\n        n = int(cpt_code)\n    except ValueError:\n        return False\n    in_em_range = _EM_OFFICE_MIN <= n <= _EM_OFFICE_MAX\n    in_office_setting = str(place_of_service).strip() in OFFICE_POS\n    return in_em_range and in_office_setting\n\n# ── Deduplication: count unique visit dates, not service lines ─────────────\ndef count_unique_em_visits(\n    claims_df: pd.DataFrame,\n    person_col: str = \"person_id\",\n    date_col: str = \"service_date\",\n    cpt_col: str = \"cpt_code\",\n    pos_col: str = \"place_of_service\",\n) -> pd.DataFrame:\n    \"\"\"\n    From a professional claims DataFrame, return a summary of unique E/M office\n    visit dates per person.\n\n    Deduplication logic:\n      1. Keep only lines where flag_em_office_visit() is True.\n      2. Deduplicate to unique (person_id, service_date) pairs — a single date with\n         multiple qualifying CPT lines (e.g., E/M code plus a modifier variant) counts\n         as one visit, not multiple.\n\n    Returns a DataFrame with columns [person_id, em_visit_count].\n    \"\"\"\n    claims_df = claims_df.copy()\n    claims_df[cpt_col] = claims_df[cpt_col].astype(str).str.strip().str.upper()\n    claims_df[\"_em_flag\"] = claims_df.apply(\n        lambda r: flag_em_office_visit(str(r[cpt_col]), str(r[pos_col])), axis=1\n    )\n    em_lines = claims_df[claims_df[\"_em_flag\"]].copy()\n    # Unique visit dates: deduplicate on (person_id, service_date)\n    unique_visits = (\n        em_lines[[person_col, date_col]]\n        .drop_duplicates()\n        .groupby(person_col)\n        .size()\n        .reset_index(name=\"em_visit_count\")\n    )\n    return unique_visits\n\n# ── Category III transition guard ─────────────────────────────────────────\ndef warn_category_iii_codes(cpt_list: list[str]) -> list[str]:\n    \"\"\"\n    Given a list of CPT codes, identify any Category III (####T) codes and\n    return them with a warning. The caller should verify whether a Category I\n    successor code exists for the study window.\n    \"\"\"\n    cat_iii = [c for c in cpt_list if classify_cpt(c) == \"Category_III\"]\n    if cat_iii:\n        print(\n            f\"WARNING: {len(cat_iii)} Category III (emerging technology) code(s) detected: \"\n            f\"{cat_iii}. Verify whether a Category I successor code exists for any part \"\n            \"of the study window. Using only the T-code may produce time-truncated \"\n            \"ascertainment if a promotion occurred mid-study.\"\n        )\n    return cat_iii\n\n# ── Example usage ─────────────────────────────────────────────────────────\nif __name__ == \"__main__\":\n    # Synthetic professional claim lines (no licensed descriptor text)\n    sample = pd.DataFrame({\n        \"person_id\":        [1001, 1001, 1001, 1002, 1003, 1003],\n        \"service_date\":     [\"2023-03-14\", \"2023-03-14\", \"2023-09-05\",\n                             \"2023-06-20\", \"2023-01-10\", \"2023-01-10\"],\n        \"cpt_code\":         [\"99213\", \"82947\", \"99214\", \"99212\", \"99203\", \"99203\"],\n        \"place_of_service\": [\"11\", \"11\", \"11\", \"11\", \"11\", \"11\"],\n    })\n\n    # Validate all codes\n    sample[\"cpt_category\"] = sample[\"cpt_code\"].apply(classify_cpt)\n    print(\"Code classification:\")\n    print(sample[[\"cpt_code\", \"cpt_category\"]].to_string(index=False))\n\n    # Count unique E/M office visits per person\n    visit_summary = count_unique_em_visits(sample)\n    print(\"\\nUnique E/M office visit dates per person:\")\n    print(visit_summary.to_string(index=False))\n    # Expected: person 1001=2, 1002=1, 1003=1 (duplicate row deduped)",
        "description": "Format validation and range-based identification of CPT code categories and the E/M office visit family from professional claims. Uses only the numeric code format — no licensed descriptor text is embedded. All pattern matching is done against code format and numeric range, consistent with open research use.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(stringr)\n\n# ── CPT format classification ─────────────────────────────────────────────\n# Category I:   exactly 5 digits\n# Category II:  4 digits + F\n# Category III: 4 digits + T\n# PLA:          4 digits + U\nclassify_cpt <- function(code) {\n  code <- trimws(toupper(as.character(code)))\n  dplyr::case_when(\n    stringr::str_detect(code, \"^\\\\d{5}$\")  ~ \"Category_I\",\n    stringr::str_detect(code, \"^\\\\d{4}F$\") ~ \"Category_II\",\n    stringr::str_detect(code, \"^\\\\d{4}T$\") ~ \"Category_III\",\n    stringr::str_detect(code, \"^\\\\d{4}U$\") ~ \"PLA\",\n    TRUE                                    ~ NA_character_\n  )\n}\n\nis_valid_cpt <- function(code) !is.na(classify_cpt(code))\n\n# ── E/M office visit flag ─────────────────────────────────────────────────\n# Office/outpatient E/M visits: CPT 99202-99215 at place_of_service = \"11\"\n# 99201 retired 2021-01-01; range starts at 99202 for post-2021 data.\nEM_OFFICE_MIN <- 99202L\nEM_OFFICE_MAX <- 99215L\nOFFICE_POS    <- c(\"11\")   # physician office; add \"02\",\"10\" for telehealth if needed\n\nflag_em_office_visit <- function(cpt_code, place_of_service) {\n  cat <- classify_cpt(cpt_code)\n  n   <- suppressWarnings(as.integer(cpt_code))\n  is_em_range    <- !is.na(n) & n >= EM_OFFICE_MIN & n <= EM_OFFICE_MAX\n  is_office      <- trimws(as.character(place_of_service)) %in% OFFICE_POS\n  cat == \"Category_I\" & is_em_range & is_office\n}\n\n# ── Unique visit date count ───────────────────────────────────────────────\ncount_unique_em_visits <- function(df,\n                                   person_col = \"person_id\",\n                                   date_col   = \"service_date\",\n                                   cpt_col    = \"cpt_code\",\n                                   pos_col    = \"place_of_service\") {\n  df |>\n    dplyr::mutate(\n      .em_flag = flag_em_office_visit(.data[[cpt_col]], .data[[pos_col]])\n    ) |>\n    dplyr::filter(.em_flag) |>\n    dplyr::distinct(.data[[person_col]], .data[[date_col]]) |>\n    dplyr::count(.data[[person_col]], name = \"em_visit_count\")\n}\n\n# ── Category III transition warning ──────────────────────────────────────\nwarn_category_iii <- function(cpt_vec) {\n  cat3 <- unique(cpt_vec[classify_cpt(cpt_vec) == \"Category_III\"])\n  if (length(cat3) > 0) {\n    warning(\n      sprintf(\n        \"%d Category III (emerging technology) code(s) in code list: %s. \",\n        length(cat3), paste(cat3, collapse = \", \")\n      ),\n      \"Verify whether a Category I successor exists for any part of the study window.\",\n      call. = FALSE\n    )\n  }\n  invisible(cat3)\n}\n\n# ── Example ───────────────────────────────────────────────────────────────\nsample_claims <- tibble::tibble(\n  person_id        = c(1001L, 1001L, 1001L, 1002L, 1003L, 1003L),\n  service_date     = as.Date(c(\"2023-03-14\",\"2023-03-14\",\"2023-09-05\",\n                               \"2023-06-20\",\"2023-01-10\",\"2023-01-10\")),\n  cpt_code         = c(\"99213\",\"82947\",\"99214\",\"99212\",\"99203\",\"99203\"),\n  place_of_service = c(\"11\",\"11\",\"11\",\"11\",\"11\",\"11\")\n)\n\n# Classify codes\nsample_claims <- sample_claims |>\n  dplyr::mutate(cpt_category = classify_cpt(cpt_code))\nprint(sample_claims |> dplyr::select(cpt_code, cpt_category))\n\n# Count unique E/M office visits per person\nvisit_summary <- count_unique_em_visits(sample_claims)\nprint(visit_summary)\n# Expected: person 1001=2, person 1002=1, person 1003=1",
        "description": "CPT format validation and E/M office visit ascertainment in R, using only numeric range logic against the code string — no licensed descriptor text. Compatible with Medicare and commercial claims data where CPT codes are stored as character strings.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "sibling",
        "target_slug": "hcpcs-level-ii-j-codes",
        "notes": "CPT is HCPCS Level I; HCPCS Level II fills gaps that CPT does not cover (drugs, biologics, durable medical equipment, supplies). Professional claims routinely carry both a CPT administration code and a HCPCS Level II J-code on infusion service lines — the two are complementary, not interchangeable."
      },
      {
        "relation_type": "sibling",
        "target_slug": "icd-10-pcs-procedure-coding",
        "notes": "ICD-10-PCS is the inpatient facility procedure coding counterpart to CPT: CPT covers professional and outpatient facility services; ICD-10-PCS covers inpatient facility procedures. A complete procedure ascertainment in claims requires both — CPT on professional and outpatient claims, ICD-10-PCS on inpatient UB-04 claims."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cms-1500-professional-claim-fields",
        "notes": "CPT codes appear in field 24D of the CMS-1500 paper form and the equivalent 837P electronic transaction — the primary source of professional service records in US claims data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "place-of-service-codes",
        "notes": "CPT codes must be interpreted alongside place-of-service (POS) codes to correctly classify the care setting: the same CPT code in an office (POS 11) vs an emergency department (POS 23) vs a telehealth encounter (POS 10) represents materially different care contexts."
      },
      {
        "relation_type": "used_with",
        "target_slug": "revenue-center-codes",
        "notes": "On outpatient facility (UB-04) claims, CPT codes appear in FL44 alongside revenue center codes. Revenue codes identify the cost center (operating room, observation, ED); CPT identifies the specific procedure. Both are required for a complete outpatient facility procedure record."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-standardized-vocabularies",
        "notes": "In OMOP, CPT4 procedure codes are mapped to SNOMED standard concepts via the Athena vocabulary server. Downloading the CPT4 vocabulary from Athena requires a separate AMA license-acceptance step, unlike all other OMOP standard vocabularies."
      },
      {
        "relation_type": "used_with",
        "target_slug": "medical-code-crosswalks-mappings",
        "notes": "CPT codes are crosswalked to SNOMED CT, HCPCS Level II, and historical ICD-9-CM procedure codes for longitudinal studies spanning the ICD transition or for cross-database analyses. The AMA publishes CPT-to-SNOMED mappings as part of licensed CPT products."
      },
      {
        "relation_type": "parent",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "CPT is the primary coding primitive for procedure identification in US professional and outpatient facility claims; the parent entry covers the full operational framework (multi-system union code sets, de-duplication, setting rules, validity) that governs how CPT and other code systems are assembled into a defensible procedure definition."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-analysis",
        "notes": "CPT is one of the core code systems for claims-based research; the claims-analysis entry covers the broader architecture of claims data (enrollment, eligibility, claim types, data quality) within which CPT-based procedure ascertainment operates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "drug-utilization",
        "notes": "For specialty infusion drugs, CPT administration codes (identifying the infusion service) and HCPCS Level II J-codes (identifying the specific drug) appear together on the same professional claim; drug utilization studies must work with both."
      }
    ],
    "aliases": [
      "CPT",
      "CPT-4",
      "Current Procedural Terminology",
      "HCPCS Level I"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cross-sectional",
    "name": "Cross-Sectional Study",
    "short_definition": "An observational design that measures exposure and outcome status simultaneously in a population at a single point (or short window) in time, yielding prevalence rather than incidence and supporting descriptive and association — but not, in general, causal — inference.",
    "long_description": "A **cross-sectional study** takes a snapshot of a defined population at one calendar instant (point prevalence) or over a\nshort interval (period prevalence) and ascertains exposure and outcome *together*, with no inherent follow-up. Because both\nthe numerator (people with the condition or on treatment) and the denominator (the whole snapshot population) are read off\nthe same moment, the natural quantity is a **prevalence**, not an incidence rate. In RWE, the snapshot is built from a claims\nor EHR database by fixing a prevalence date, requiring observable enrollment spanning that date, and evaluating active\ntreatment status and condition status as-of-date from surrounding records. This makes cross-sectional designs the workhorse\nof disease-burden estimation, treated-prevalence reporting, HEDIS-style quality measurement, and feasibility/landscape\nanalyses — and a recurring trap when analysts mistake a prevalence association for a causal effect.\n\n**Core conceptual distinction** A cross-sectional study is defined by *simultaneous* ascertainment and the *prevalence*\nestimand, which separates it from every longitudinal design. (1) *Prevalence vs incidence*: the fundamental identity is\n**prevalence ≈ incidence × mean disease duration** (steady state). A cross-sectional sample therefore cannot recover an\nincidence rate, and it over-represents long-duration/chronic cases while under-representing short-duration cases — those who\nrecovered quickly or died before the snapshot are gone. (2) *Temporal ambiguity*: because exposure and outcome are measured\nat the same time, the design generally cannot establish that exposure preceded outcome, so **reverse causation** is not\nidentifiable from the data alone. (3) *Estimand for association*: the two legitimate measures are the **prevalence ratio\n(PR)** and the **prevalence odds ratio (POR)**. The PR is estimated by log-binomial or Poisson regression with robust\n(sandwich) standard errors; the POR by logistic regression. When the outcome is common (prevalence above roughly 10%) the\nPOR materially overstates the PR and should not be interpreted as a risk ratio — a frequent and avoidable error. Contrast\nthis with a **cohort** (incidence, person-time, temporal order preserved) and a **case-control** (sampling on outcome to\nestimate an odds ratio efficiently for rare disease); the cross-sectional design samples on neither exposure nor outcome but\nreads both off a fixed population at one time.\n\n**Pros, cons, and trade-offs**\n- **vs cohort (prospective or retrospective):** Cross-sectional is far cheaper and faster, needs no follow-up, and directly\n  yields prevalence and treated-prevalence — exactly what disease-burden and HTA epidemiology sections require. Cost: it\n  cannot estimate incidence, risk, or hazards, cannot establish temporal order, and is wide open to prevalence–incidence\n  (Neyman) bias. **Prefer cross-sectional** for \"how many / how treated, right now\" questions; **prefer a cohort** the moment\n  the question is \"what happens to people over time\" or \"does exposure cause the outcome.\"\n- **vs case-control:** Both are efficient single-pass designs, but case-control samples on the outcome (good for rare disease,\n  estimates an OR via incidence-density or cumulative sampling) whereas cross-sectional samples the whole population (good for\n  common conditions, estimates prevalence directly). Cost of cross-sectional: poor for rare outcomes (few cases in a snapshot)\n  and prone to including only survivors. **Prefer case-control** for rare disease etiology; **prefer cross-sectional** for\n  prevalence and for screening/diagnostic-test performance evaluated at one time.\n- **vs ecological:** Cross-sectional uses individual-level exposure and outcome, so it avoids the **ecological fallacy** that\n  afflicts group-level correlation studies. **Prefer cross-sectional** whenever individual records are available.\n- **vs repeated cross-sections / serial surveys:** A single snapshot gives a point estimate; stacking annual snapshots\n  describes population-level *trends* but still cannot follow individuals. **Prefer repeated cross-sections** for monitoring\n  prevalence over calendar time; a true cohort for within-person change.\n\n**When to use** Estimating point/period/annual **prevalence** of a disease or of treatment (treated prevalence); HEDIS and\nPQA-style denominator-based quality measurement; disease-burden and epidemiology inputs for HTA dossiers and budget-impact\nmodels; database **feasibility and landscape** scans (how many eligible patients exist as-of a date); evaluating\n**diagnostic/screening test performance** (sensitivity, specificity, PPV) at a single examination; describing the\ncross-sectional distribution of comorbidities, utilization, or SDoH in a population. Use it whenever the question is genuinely\nabout the *state of a population at a time*, not about *change or causation over time*.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Etiologic / causal claims about a serious or fatal outcome.** This is the signature danger. Through\n  **prevalence–incidence (Neyman) bias**, severe and rapidly fatal cases never reach the prevalence date, so the snapshot is\n  enriched for survivors. A protective-looking exposure–outcome association can be entirely an artifact of differential\n  survival into the cross-section. Never use a cross-sectional prevalence association to argue a treatment causes or prevents\n  a serious outcome.\n- **Any question requiring temporal order.** If reverse causation is plausible (e.g., does the drug cause the symptom, or does\n  the symptom prompt the drug?), the design cannot adjudicate it. Use a new-user cohort or target-trial emulation instead.\n- **Incidence, risk, or rate estimation.** Cross-sectional data cannot produce incidence rates, hazard ratios, or\n  cumulative-risk curves; forcing them is a category error.\n- **Rare outcomes.** A snapshot of even a large database may contain too few prevalent cases for stable estimation; a\n  case-control or registry design is more efficient.\n- **Reporting a POR as if it were a risk ratio when prevalence is high.** With common outcomes the POR exaggerates the PR;\n  report a PR (log-binomial/Poisson + robust SE) when a ratio of probabilities is what stakeholders will interpret.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA vs commercial):** The snapshot population is everyone with observable, continuous enrollment spanning\n  the prevalence date. Treated status as-of-date has two materially different operationalizations: an **active-supply rule**\n  (a fill where `fill_date ≤ index_date ≤ fill_date + days_supply`, i.e., the day's supply still covers the snapshot) versus a\n  looser **\"any fill in the prior N days\"** rule; these yield different numerators and must be pre-specified and justified.\n  Failure modes: (a) **Medicare Advantage person-time is invisible in fee-for-service claims** — MA enrollees' fills and\n  encounters are not in the FFS feed, so they look untreated/undiagnosed and are silently dropped, biasing treated prevalence;\n  restrict the denominator to enrollees with the relevant benefit (A/B/D or commercial medical+pharmacy) and exclude MA-only\n  person-time. (b) **Differential disenrollment and survival to the prevalence date** operationalizes Neyman bias — patients\n  who died or churned out before the snapshot are absent, so prevalence reflects survivors. (c) Mail-order 90-day fills,\n  sample fills, and stockpiling distort `days_supply` and thus the active-supply window. (d) A diagnosis \"as-of-date\" usually\n  requires a coded condition in a lookback window (e.g., ≥1 inpatient or ≥2 outpatient codes in the prior 365 days), not\n  literally on the index date.\n- **EHR:** An EHR snapshot reflects **care-seeking, not population prevalence** — patients appear only when they have an\n  encounter, so the cross-section is selected on visiting the system and undercounts well or disengaged patients. Problem\n  lists, labs, and medication orders sharpen condition and severity at the snapshot, but \"active medication\" reflects what was\n  *ordered*, not necessarily *dispensed or taken*; link to pharmacy fills where possible. Define the observable window\n  explicitly so absence of a record means \"not observed sick,\" not \"absent.\"\n- **Registry:** Strong for adjudicated case status and severity at enrollment, supporting clean prevalence of well-defined\n  disease. Weakness: **case-ascertainment lag** means recently incident cases are under-counted at any snapshot, and registry\n  membership itself is a selection mechanism. Link to claims for full treatment exposure and to a death index to define who\n  was alive at the prevalence date.\n- **Linked claims–EHR–vital records:** The ideal substrate for cross-sectional burden work — EHR severity + claims\n  completeness + reliable vital status to fix the alive-and-enrolled denominator. Linkage introduces selection (only the\n  linkable subset) and date-discrepancy issues (order vs fill vs service date) that must be reconciled before the snapshot\n  date is applied.\n\n**Worked claims example (true point-prevalence calculation).** Question: point prevalence of statin treatment among adults\nwith type 2 diabetes in a commercial + Medicare FFS database **as of 2024-01-01**. (1) Prevalence date (index_date) =\n2024-01-01. (2) Denominator: persons with age ≥18 on the index date AND continuous, FFS-observable enrollment (medical +\npharmacy; exclude MA-only spans) covering the full [2023-01-01, 2024-01-01] window, AND a qualifying diabetes phenotype (≥1\ninpatient or ≥2 outpatient T2DM diagnoses in the 365-day lookback). (3) Numerator (treated prevalence, active-supply rule): of\nthe denominator, those with at least one statin pharmacy fill whose supply covers the index date — `fill_date ≤ 2024-01-01 ≤\nfill_date + days_supply`. (4) Point prevalence = numerator / denominator. (5) Report it as a proportion with an exact or\nrobust 95% CI; if comparing treated prevalence across subgroups, estimate a **prevalence ratio** via Poisson or log-binomial\nregression with robust standard errors (not a logistic odds ratio, because statin treatment is common). (6) Sensitivity\nanalyses: vary the numerator rule (active-supply vs any fill in prior 90/180 days), tighten/loosen the diabetes phenotype, and\nquantify how many otherwise-eligible patients were excluded as MA-only — the size of that exclusion is a direct read on\npotential selection of the treated-prevalence estimate.",
    "primary_category": "Study_Design",
    "tags": [
      "cross-sectional",
      "prevalence",
      "point-prevalence",
      "prevalence-ratio",
      "neyman-bias",
      "temporal-ambiguity",
      "disease-burden",
      "treated-prevalence"
    ],
    "applies_to_study_types": [
      "cross_sectional"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1038/sj.ebd.6400375",
        "url": "https://doi.org/10.1038/sj.ebd.6400375",
        "citation_text": "Levin KA. Study design III: cross-sectional studies. Evidence-Based Dentistry. 2006;7(1):24-25.",
        "year": 2006,
        "authors_short": "Levin",
        "notes": "Compact canonical definition of the cross-sectional design, its prevalence estimand, and its core limitations (temporal ambiguity, no incidence)."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.g2276",
        "url": "https://doi.org/10.1136/bmj.g2276",
        "citation_text": "Sedgwick P. Cross sectional studies: advantages and disadvantages. BMJ. 2014;348:g2276.",
        "year": 2014,
        "authors_short": "Sedgwick",
        "notes": "Clear treatment of the strengths and limitations, including reverse causation and the inability to estimate incidence."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pmed.0040296",
        "url": "https://doi.org/10.1371/journal.pmed.0040296",
        "citation_text": "von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. PLoS Medicine. 2007;4(10):e296.",
        "year": 2007,
        "authors_short": "von Elm et al.",
        "notes": "STROBE reporting checklist explicitly covers cross-sectional studies, including how prevalence and the analytic sample must be reported."
      }
    ],
    "plain_language_summary": "A cross-sectional study takes a single photograph of a group of people on one chosen day and asks, for each person, two yes/no questions at that same instant: do you have the condition, and are you on the treatment? Because you count the people who currently have the condition and divide by everyone in the snapshot, the number you get is a prevalence (the share who have it right now), not a count of new cases over time. The catch is the most important thing to remember: since both questions are answered at the same moment, the snapshot cannot tell you which came first, the condition or the treatment, so it cannot prove that one caused the other.",
    "key_terms": [
      {
        "term": "prevalence",
        "definition": "The share of a group who have a condition at one point in time, found by dividing the number of people with it by the total number of people you looked at."
      },
      {
        "term": "snapshot date",
        "definition": "The single calendar day you freeze the population on and read every person's condition and treatment status as-of that day."
      },
      {
        "term": "temporality",
        "definition": "Knowing which event happened first; a cross-sectional snapshot measures everything at once, so it cannot tell you whether the exposure or the outcome came first."
      },
      {
        "term": "reverse causation",
        "definition": "When the outcome actually drives the exposure rather than the other way around (for example, the symptom prompts the drug), which a single snapshot cannot rule out."
      },
      {
        "term": "incidence",
        "definition": "The rate of brand-new cases appearing over a stretch of time, which a one-day snapshot cannot measure because it never watches people across time."
      }
    ],
    "worked_example": {
      "scenario": "We freeze a small commercial health-plan population on a single day, 2024-01-01, and for each of six enrolled adults we read two yes/no facts as-of that exact day: does the person have type 2 diabetes (the condition), and does the person currently have a statin on hand (the exposure). From this one-day photograph we want the prevalence of diabetes in the group, and we want to be honest about what it cannot tell us.",
      "dataset": {
        "caption": "The snapshot table an analyst would build: one row per enrolled person, both statuses read as-of the single date 2024-01-01.",
        "columns": [
          "person_id",
          "has_condition",
          "has_exposure"
        ],
        "rows": [
          [
            1001,
            1,
            1
          ],
          [
            1002,
            1,
            0
          ],
          [
            1003,
            0,
            0
          ],
          [
            1004,
            1,
            1
          ],
          [
            1005,
            0,
            1
          ],
          [
            1006,
            1,
            0
          ]
        ]
      },
      "steps": [
        "Everyone in the table is enrolled and observable on the snapshot date 2024-01-01, so the denominator is all N = 6 people.",
        "Count the people whose has_condition is 1 (persons 1001, 1002, 1004, and 1006): that is 4 people with diabetes on the snapshot day.",
        "Prevalence = number with the condition divided by everyone in the snapshot = 4 / 6.",
        "Now look at the temporality caveat: persons 1001 and 1004 have both the condition and the statin, but the table only records that both are true on 2024-01-01. It does not record whether the diabetes was diagnosed before or after the statin was started, so we cannot say the statin came first or that it caused or prevented anything.",
        "We also cannot rule out reverse causation: having diabetes may have prompted the statin, rather than the statin influencing the diabetes, and a single snapshot can never separate those two stories."
      ],
      "result": "Prevalence of diabetes = 4 people with the condition / 6 people in the snapshot = 0.667 (about 67%). This is a valid description of the group's state on 2024-01-01, but because every value was read at the same instant we cannot establish temporality (which came first), so no causal claim about the statin and diabetes is supported."
    },
    "prerequisites": [
      "prevalence-point-period-annual-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Point prevalence",
        "description": "Exposure and condition status measured as of a single calendar date (the prevalence date / index date). The default for treated-prevalence and disease-burden snapshots in claims and EHR.",
        "edge_cases": [
          "Requires observable enrollment spanning the exact index date; patients enrolled the day after are absent, those who died the day before are absent (Neyman bias).",
          "The 'active treatment as-of-date' definition (active-supply window vs any recent fill) drives the numerator and must be pre-specified."
        ],
        "data_source_notes": "claims: active-supply rule fill_date <= index_date <= fill_date + days_supply, with continuous FFS enrollment across the date; EHR: status from problem list/active medication order at or near the date, recognizing care-seeking selection."
      },
      {
        "name": "Period prevalence",
        "description": "Condition or treatment status counted as present if it occurs anywhere within a defined interval (e.g., calendar year 2024); the standard for annual prevalence and many HEDIS/PQA denominator measures.",
        "edge_cases": [
          "Mixing incident and prevalent cases within the interval blurs the estimand; denominators must handle partial enrollment (person-time or member-month adjustment).",
          "Longer intervals inflate counts relative to point prevalence and are sensitive to enrollment churn."
        ],
        "data_source_notes": "claims: require a minimum continuous-enrollment fraction of the period; annualize with member-months when enrollment is incomplete."
      },
      {
        "name": "Lifetime / ever prevalence",
        "description": "Whether the condition or exposure has ever occurred up to the snapshot (e.g., ever-diagnosed, ever-treated).",
        "edge_cases": [
          "In survey data this is dominated by recall bias; in claims it is bounded by the lookback/observable history, so 'ever' is really 'ever within available data' and varies with enrollment length.",
          "Left-truncation makes long-enrolled members appear to have higher lifetime prevalence purely from longer observation."
        ],
        "data_source_notes": "claims/EHR: explicitly report the lookback length; never present database \"ever\" prevalence as true lifetime prevalence."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cohort study (prospective or retrospective)",
        "pros_of_this": "Cheaper and faster, no follow-up required, directly yields prevalence and treated prevalence for burden and HTA inputs.",
        "cons_of_this": "Cannot estimate incidence, risk, or hazards; cannot establish temporal order; vulnerable to prevalence-incidence (Neyman) bias from differential survival into the snapshot.",
        "when_to_prefer": "Questions about the state of a population at a time (how many, how treated) rather than change or causation over time."
      },
      {
        "compared_to": "Case-control study",
        "pros_of_this": "Samples the whole population so prevalence is estimated directly; uses individual-level data; good for common conditions and one-time diagnostic-test evaluation.",
        "cons_of_this": "Inefficient for rare outcomes (few prevalent cases in a snapshot); includes only survivors to the prevalence date.",
        "when_to_prefer": "Common conditions, prevalence reporting, and screening/diagnostic accuracy; use case-control for rare-disease etiology."
      },
      {
        "compared_to": "Ecological study",
        "pros_of_this": "Uses individual-level exposure and outcome, avoiding the ecological fallacy inherent in group-level correlation.",
        "cons_of_this": "Requires individual records and adequate snapshot sample size.",
        "when_to_prefer": "Whenever individual-level data are available rather than only aggregate group rates."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Snapshot = persons with continuous, FFS-observable enrollment (medical + pharmacy) spanning the prevalence date; exclude MA-only person-time (invisible in FFS claims). Treated as-of-date via active-supply rule (fill_date <= index_date <= fill_date + days_supply) or a pre-specified recent-fill window. Condition as-of-date via a phenotype in the lookback (e.g., 1 inpatient or 2 outpatient codes). Report prevalence ratios via log-binomial/Poisson + robust SE, not logistic ORs, when the outcome is common.",
      "ehr": "Snapshot reflects care-seeking, not the population; status from problem list, labs, and active medication orders near the date, with the caveat that orders are not dispensings. Define the observable window so absence means not-observed-sick, not absent. Link to fills to confirm active treatment.",
      "registry": "Clean adjudicated case status and severity, but case-ascertainment lag under-counts recent incident cases at any snapshot, and registry membership is itself a selection mechanism. Link to claims for treatment and to a death index for the alive-at-date denominator.",
      "linked": "Ideal for burden work (EHR severity + claims completeness + vital status to fix the alive-and-enrolled denominator); reconcile order/fill/service date discrepancies and account for linkage selection before applying the snapshot date."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nINDEX_DATE = pd.Timestamp(\"2024-01-01\")   # prevalence (snapshot) date\nLOOKBACK_DAYS = 365                         # observable window for enrollment + phenotype\nTREATED_CLASS = \"STATIN\"                    # drug_class defining the treated numerator\n\ndef point_prevalence(enroll: pd.DataFrame, rx: pd.DataFrame, dx: pd.DataFrame) -> dict:\n    lookback_start = INDEX_DATE - pd.Timedelta(days=LOOKBACK_DAYS)\n\n    # Denominator base: continuous FFS-observable enrollment spanning the full lookback through the index date.\n    e = enroll[(enroll[\"enroll_start\"] <= lookback_start) &\n               (enroll[\"enroll_end\"]   >= INDEX_DATE) &\n               (~enroll[\"ma_only\"])]               # MA-only person-time is invisible in FFS claims -> exclude\n    denom_ids = set(e[\"person_id\"].unique())\n\n    # Condition phenotype as-of-date: >=1 inpatient OR >=2 outpatient codes in the lookback window.\n    d = dx[(dx[\"dx_date\"] >= lookback_start) & (dx[\"dx_date\"] <= INDEX_DATE) &\n           (dx[\"person_id\"].isin(denom_ids))]\n    ip = d[d[\"setting\"] == \"IP\"].groupby(\"person_id\").size()\n    op = d[d[\"setting\"] == \"OP\"].groupby(\"person_id\").size()\n    condition_ids = set(ip[ip >= 1].index) | set(op[op >= 2].index)\n    denom_ids &= condition_ids                     # restrict denominator to condition-positive persons\n\n    # Treated numerator (active-supply rule): a fill whose days_supply still covers the index date.\n    r = rx[(rx[\"drug_class\"] == TREATED_CLASS) & (rx[\"person_id\"].isin(denom_ids))].copy()\n    r[\"supply_end\"] = r[\"fill_date\"] + pd.to_timedelta(r[\"days_supply\"], unit=\"D\")\n    covered = r[(r[\"fill_date\"] <= INDEX_DATE) & (r[\"supply_end\"] >= INDEX_DATE)]\n    treated_ids = set(covered[\"person_id\"].unique())\n\n    n = len(denom_ids)\n    k = len(treated_ids)\n    prev = k / n if n else float(\"nan\")\n    se = np.sqrt(prev * (1 - prev) / n) if n else float(\"nan\")   # Wald SE; use exact CI for small n\n    return {\"denominator\": n, \"treated\": k, \"point_prevalence\": prev,\n            \"ci95\": (prev - 1.96 * se, prev + 1.96 * se)}",
        "description": "Point-prevalence (treated-prevalence) snapshot from claims-style inputs. Required inputs (cleaned, de-duplicated):\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)   # ma_only spans lack FFS claims\n  rx     : pharmacy fills    -> person_id, fill_date (datetime), drug_class, days_supply\n  dx     : diagnoses         -> person_id, dx_date (datetime), setting in {'IP','OP'}, code  # condition phenotype source\nThis is a SNAPSHOT, not a new-initiator cohort: status is read AS-OF a fixed prevalence date, not from first fills.\nReturns the denominator (eligible, condition-positive, enrolled across the date) flagged with the as-of-date treated\nnumerator, plus the point-prevalence estimate.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "levin-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nINDEX_DATE    <- as.Date(\"2024-01-01\")   # prevalence (snapshot) date\nLOOKBACK_DAYS <- 365L\nTREATED_CLASS <- \"STATIN\"\n\npoint_prevalence <- function(enroll, rx, dx) {\n  setDT(enroll); setDT(rx); setDT(dx)\n  lookback_start <- INDEX_DATE - LOOKBACK_DAYS\n\n  # Denominator base: continuous FFS-observable enrollment across the full lookback through index; drop MA-only spans.\n  denom <- enroll[enroll_start <= lookback_start & enroll_end >= INDEX_DATE & !ma_only,\n                  unique(person_id)]\n\n  # Condition phenotype as-of-date: >=1 inpatient OR >=2 outpatient codes in the lookback.\n  d  <- dx[dx_date >= lookback_start & dx_date <= INDEX_DATE & person_id %chin% denom]\n  ip <- d[setting == \"IP\", .N, by = person_id][N >= 1L, person_id]\n  op <- d[setting == \"OP\", .N, by = person_id][N >= 2L, person_id]\n  denom <- intersect(denom, union(ip, op))\n\n  # Treated numerator (active-supply rule): a fill whose supply still covers the index date.\n  r <- rx[drug_class == TREATED_CLASS & person_id %chin% denom]\n  r[, supply_end := fill_date + days_supply]\n  treated <- r[fill_date <= INDEX_DATE & supply_end >= INDEX_DATE, unique(person_id)]\n\n  n <- length(denom); k <- length(treated)\n  prev <- if (n) k / n else NA_real_\n  se   <- if (n) sqrt(prev * (1 - prev) / n) else NA_real_   # Wald; use exact CI for small n\n  list(denominator = n, treated = k, point_prevalence = prev,\n       ci95 = c(prev - 1.96 * se, prev + 1.96 * se))\n}",
        "description": "Point-prevalence snapshot with data.table; inputs mirror the Python version:\n  enroll : person_id, enroll_start (Date), enroll_end (Date), ma_only (logical)\n  rx     : person_id, fill_date (Date), drug_class, days_supply\n  dx     : person_id, dx_date (Date), setting in {'IP','OP'}, code\nReads exposure and condition status AS-OF a fixed prevalence date (snapshot logic), not first-fill logic.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "levin-2006"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let index_date  = '01JAN2024'd;   /* prevalence (snapshot) date */\n%let lookback    = 365;            /* observable window in days   */\n%let treated_cls = 'STATIN';\n\n/* Denominator base: continuous FFS-observable enrollment spanning the full lookback through index; no MA-only spans. */\nproc sql;\n  create table denom_base as\n  select distinct person_id\n  from work.enroll\n  where ma_only = 0\n    and enroll_start <= &index_date - &lookback\n    and enroll_end   >= &index_date;\nquit;\n\n/* Condition phenotype as-of-date: >=1 inpatient OR >=2 outpatient codes in the lookback window. */\nproc sql;\n  create table condition as\n  select person_id\n  from work.dx\n  where dx_date between &index_date - &lookback and &index_date\n    and person_id in (select person_id from denom_base)\n  group by person_id\n  having sum(setting='IP') >= 1 or sum(setting='OP') >= 2;\nquit;\n\n/* Treated numerator (active-supply rule): a fill whose days_supply still covers the index date. */\nproc sql;\n  create table analytic as\n  select c.person_id,\n         (case when exists (\n             select 1 from work.rx r\n             where r.person_id = c.person_id\n               and r.drug_class = &treated_cls\n               and r.fill_date <= &index_date\n               and r.fill_date + r.days_supply >= &index_date\n          ) then 1 else 0 end) as treated\n  from condition c;\nquit;\n\n/* Point prevalence with an exact binomial CI. */\nproc freq data=analytic;\n  tables treated / binomial(level='1') alpha=0.05;\nrun;\n\n/* Prevalence RATIO across a subgroup (example covariate: subgrp) via log-binomial + robust SE. */\nproc genmod data=analytic_with_subgrp descending;\n  class person_id subgrp(ref='0');\n  model treated = subgrp / dist=binomial link=log;   /* log link -> exp(beta) = prevalence ratio */\n  repeated subject=person_id / type=ind;             /* sandwich (robust) standard errors        */\n  estimate 'PR subgrp=1 vs 0' subgrp 1 -1 / exp;\nrun;",
        "description": "Point-prevalence snapshot and prevalence-ratio estimation in SAS. Required input datasets (post data-management):\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.rx     : person_id, fill_date, drug_class, days_supply\n  work.dx     : person_id, dx_date, setting ('IP'/'OP'), code\nBuilds the snapshot AS-OF a fixed prevalence date (not first-fill cohort logic). The closing PROC GENMOD computes a\nprevalence RATIO across a subgroup via log-binomial regression with robust SE (REPEATED) - the correct estimand when the\noutcome is common; do NOT report a logistic odds ratio as a risk ratio.",
        "dependencies": [],
        "source_citations": [
          "levin-2006"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Inc[Incidence rate<br/>new cases / person-time] --> P[Prevalence at snapshot]\n  Dur[Mean disease duration] --> P\n  P -->|prevalence ~ incidence x duration| Snap[Cross-sectional sample reads PREVALENCE only]\n  Snap --> A[Valid: describe burden, treated prevalence, associations]\n  Snap --> B[Invalid: recover incidence or risk]\n  Severe[Severe / rapidly fatal cases<br/>die before snapshot] -.depleted.-> P\n  Short[Short-duration / recovered cases<br/>gone before snapshot] -.depleted.-> P\n  B --> Danger[Prevalence-incidence Neyman bias:<br/>survivor-enriched sample distorts causal claims]",
        "caption": "Why a cross-section yields prevalence, not incidence. Prevalence is incidence scaled by mean duration, so severe and short-duration cases are systematically under-sampled (Neyman bias) — making etiologic interpretation of a prevalence association dangerous.",
        "alt_text": "Diagram showing prevalence as incidence times duration, with severe and short-duration cases depleted before the snapshot, and a branch warning that recovering incidence or making causal claims is invalid.",
        "source_type": "illustrative",
        "source_citations": [
          "levin-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Research question] --> Time{Does it require<br/>temporal order or<br/>incidence over time?}\n  Time -->|Yes| Cohort[Use a cohort / target-trial emulation<br/>cross-sectional is wrong]\n  Time -->|No| Kind{What is the target estimand?}\n  Kind -->|Disease or treated PREVALENCE,<br/>burden, feasibility| CSok[Cross-sectional snapshot: appropriate]\n  Kind -->|Etiology of a serious outcome| Neyman[Avoid: Neyman bias + reverse causation<br/>prevalence association can be artifact]\n  Kind -->|Diagnostic / screening test<br/>performance at one exam| CSdx[Cross-sectional: appropriate with caveats]\n  Kind -->|Rare outcome etiology| CC[Use case-control instead]",
        "caption": "Decision logic for when a cross-sectional design answers the question. Descriptive burden and one-time diagnostic accuracy are good fits; incidence, temporal order, and serious-outcome etiology are not.",
        "alt_text": "Decision tree routing a research question to cross-sectional, cohort, or case-control depending on whether temporal order or incidence is required and what the target estimand is.",
        "source_type": "illustrative",
        "source_citations": [
          "levin-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  LB[Lookback window<br/>-365 days .. index] --> ENR[Continuous FFS-observable enrollment<br/>spanning the whole window]\n  LB --> DX[Condition phenotype<br/>1 IP or 2 OP codes in lookback]\n  RX[Pharmacy fills<br/>fill_date + days_supply] --> SUPPLY{Supply covers<br/>the index date?}\n  ENR --> IDX[Prevalence date = index_date<br/>read status AS-OF this instant]\n  DX --> IDX\n  SUPPLY -->|yes| IDX\n  IDX --> NUMDEN[Denominator = enrolled + condition-positive<br/>Numerator = treated as-of-date]",
        "caption": "Operational point-prevalence assembly in claims. Status is read as-of a single index date using an active-supply rule for treatment and a lookback phenotype for the condition — snapshot logic, not first-fill cohort logic.",
        "alt_text": "Data-flow diagram showing a 365-day lookback feeding enrollment and condition checks, pharmacy supply covering the index date for treatment, and assembly of the numerator and denominator at the prevalence date.",
        "source_type": "illustrative",
        "source_citations": [
          "levin-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "cohort-retrospective",
        "notes": "A cohort follows people over time to estimate incidence and risk with temporal order preserved; cross-sectional reads prevalence at one instant and cannot recover those quantities."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-control",
        "notes": "Case-control samples on the outcome for efficient rare-disease estimation; cross-sectional samples the whole population at a time for prevalence of common conditions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalence-point-period-annual-rwe",
        "notes": "Point, period, and annual prevalence are the core estimands a cross-sectional design produces; that concept details the denominator/member-month mechanics."
      },
      {
        "relation_type": "see_also",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Cross-sectional snapshots are a primary engine of descriptive epidemiology (who, how many, how treated, as-of-when)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ecological",
        "notes": "Cross-sectional uses individual-level data and so avoids the ecological fallacy of group-level correlation studies."
      },
      {
        "relation_type": "used_with",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "The snapshot denominator requires continuous, observable enrollment spanning the prevalence date so that treated and condition status are real, not missing."
      },
      {
        "relation_type": "used_with",
        "target_slug": "survey-weights-complex-sampling-rwe",
        "notes": "Survey-based cross-sections (NHANES-style) require complex-sampling weights to produce population-representative prevalence estimates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "selection-bias-sensitivity-analysis-rwe",
        "notes": "Prevalence-incidence (Neyman) bias is the dominant selection threat in cross-sectional designs; sensitivity analysis quantifies survivor-driven distortion."
      }
    ],
    "aliases": [
      "cross sectional study",
      "prevalence study",
      "point prevalence study",
      "period prevalence study",
      "prevalence survey"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cross-validation-and-overfitting-rwe",
    "name": "Cross-Validation, Overfitting, and Optimism",
    "short_definition": "The suite of internal validation techniques — k-fold cross-validation, stratified and group-stratified folds, repeated CV, and bootstrap optimism correction — used to obtain an honest estimate of a clinical prediction model's discrimination and calibration before any external data are consulted; overfitting quantifies the gap between apparent performance on training data and the expected performance on new patients from the same source population, and optimism is the formal name for that gap.",
    "long_description": "**What cross-validation and overfitting mean in clinical prediction**\n\nEvery prediction model — logistic regression for 30-day readmission, a gradient-boosted tree\nfor 1-year mortality, or regularized regression for hospitalization risk — is built by fitting\ncoefficients or splits to observed data. That process is inherently self-referential: the\nalgorithm adjusts its parameters to explain the training labels as well as possible, which means\nit also memorizes noise specific to those particular patients, specific to that particular\ncalendar window, and specific to that particular database. When you then evaluate the same model\non the same data it was trained on, you measure *apparent performance* — discrimination and\ncalibration that are systematically too good because the model is graded on an exam it has\nalready seen. Cross-validation and bootstrap optimism correction are the two main families of\nmethods that generate an honest, *internal* estimate of how the model would perform on new\npatients drawn from the same population — before any external data are touched.\n\n**Why apparent performance always flatters: the mechanics of overfitting**\n\nOverfitting has two mechanical causes in clinical prediction models. The first is *capacity\nrelative to events*: a logistic regression with 30 predictors fitted on 150 events (5 events per\nvariable, EPV = 5) has enough degrees of freedom to pick up idiosyncratic patterns in that\ntraining sample that will not replicate. The widely cited EPV ≥ 10 rule of thumb warns that\nmodels with fewer events per predictor overfit progressively more severely. The second cause is\nthe *optimization pressure* of fitting: even a correctly specified model with adequate EPV will\nproduce coefficients that are slightly too extreme — the estimated effect of each predictor is\npulled toward the direction that best explains the training outcomes, a phenomenon Steyerberg\n(2001) formalizes as the calibration slope shrinking below 1.0 on held-out data. The net effect\nis that reported AUC, c-statistic, and calibration metrics computed on training data always\noverstate how well the model will perform when presented with new patients.\n\n**The cross-validation menu**\n\n*Standard k-fold CV* (k = 5 or k = 10 by convention): the analysis dataset is randomly\npartitioned into k equal folds. In each of the k iterations, one fold serves as the test set\nand the model is re-trained from scratch on the remaining k − 1 folds; the performance metric\n(c-statistic, AUC, Brier score) is computed on the held-out fold. The k fold estimates are\naveraged. The k = 5 convention is well-supported by bias-variance analyses: k = 10 has lower\nbias but slightly higher variance; k = 2 (simple 50/50 split) is unacceptably noisy; leave-one-\nout CV is nearly unbiased but computationally expensive and can have high variance.\n\n*Stratified k-fold CV*: when the outcome event is rare (event rate < 5–10%), pure random\nsplitting of k folds can, by chance, leave one fold with zero events, making the test-set\nperformance undefined. Stratified splitting ensures each fold contains approximately the same\nevent rate as the full dataset. This is the default for any binary prediction model in claims or\nEHR data with a rare outcome.\n\n*Repeated k-fold CV*: to reduce the Monte Carlo variance introduced by the random fold\nassignment, the entire k-fold process is repeated R times (typically R = 5 or 10) with different\nrandom seeds and the R × k fold estimates are averaged. Repeated CV is the most stable internal\nestimate but requires R times as much fitting. It is preferred when n is small (< 500 events)\nand computational cost is manageable.\n\n*Bootstrap optimism correction* (Steyerberg/Harrell procedure): fit the model on the full\ndataset; record the apparent performance metric A. Then draw B bootstrap samples (typically\nB = 200) from the full dataset with replacement, fit the model to each bootstrap sample, and\nevaluate that bootstrap-fitted model on the original full dataset. The bootstrap models tend to\nperform better on their own sample (overfitting) than on the original data; the mean gap across\nbootstrap replications estimates the optimism O. The corrected estimate is A − O. The bootstrap\noptimism procedure uses all available data for model fitting in every replicate (no data are\nwithheld) and typically has lower variance than k-fold CV for small samples. In R, the\n`rms::validate` function implements this directly for logistic and Cox models.\n\n*Nested CV* (required when hyperparameters are tuned): a fundamental distinction between model\n*selection* (choosing which algorithm and hyperparameters to use) and model *evaluation*\n(estimating performance of the chosen model) is that performing both with the same CV loop\nproduces optimistic estimates. Varma & Simon (2006) showed that if you select hyperparameters\nby cross-validation and then report the CV score of the selected model, you have used the test\nfolds to guide selection — the estimate is biased upward. Nested CV uses an outer loop (k-fold)\nfor honest evaluation and an inner loop (inner k-fold or grid search) for hyperparameter\nselection entirely within the training folds. The outer fold estimate is then unbiased for the\nfull model-selection-plus-fitting pipeline, not just for a fixed algorithm.\n\n**RWE-specific data leakage catalog**\n\nData leakage is the most consequential threat to CV validity in claims and EHR studies because\nit corrupts not just apparent performance but also every cross-validation estimate. Leakage means\ninformation that would not be available at the time of prediction — or that violates the\npatient-independence assumption — flows from the test fold into the model.\n\n*Patient-level leakage* (claims): when a patient contributes multiple rows (multiple episodes,\nmultiple fill records, multiple encounters), simple random splitting of rows assigns the same\npatient to both training and test folds. The model implicitly learns patient-level idiosyncratic\npatterns from the training rows and then \"predicts\" those patterns in the test rows — a form of\nidentity memorization. The fix is `GroupKFold` (sklearn) or group-stratified splitting by\n`person_id`, ensuring each patient appears in exactly one fold. This is the most critical\nleakage mode in claims analysis and the one most often overlooked.\n\n*Temporal leakage* (prospective deployment): when the model will be deployed prospectively —\nscoring new patients at time T to predict outcomes that occur after T — a random split that\nmixes patients indexed in 2018, 2020, and 2022 allows future patients' data to inform training\non earlier patients (and vice versa). For prospective deployments, the only valid split is a\ntemporal one: train on the earlier period, validate on the later period. CV with random splits\nis appropriate only when the model will score a historical population, not when it will be\napplied forward in time.\n\n*Feature leakage* (post-index codes): any feature derived from after the prediction time-zero\n(the index date) is a form of leakage. Common examples: including the treatment's own dispensing\ncodes in a model predicting whether the patient will be treated; using hospitalization codes from\nthe outcome window as predictors; including post-index comorbidity codes in a \"baseline\" feature\nmatrix. Feature leakage produces models with apparently excellent discrimination that collapse\ncompletely at deployment. The solution is strict temporal feature engineering — every predictor\nmust be constructed from data in [index_date − lookback_window, index_date).\n\n*Preprocessing leakage*: many preprocessing steps — imputation of missing values, standardization\nof continuous features, feature selection by univariate association — should be fitted only on\ntraining data and then applied to test data. If imputation or standardization is fitted on the\nfull dataset before splitting, the test fold implicitly contains information derived from its own\nvalues, producing artificially low test-set error. In sklearn, the correct pattern is to wrap\nall preprocessing in a `Pipeline` so that `fit_transform` is called only on training folds and\n`transform` (without re-fitting) is called on test folds.\n\n*Site leakage* (multi-site EHR): when a multi-site dataset is split randomly, training and test\nsets may contain records from the same site. If the goal is to estimate performance at a new site\n(leave-one-site-out validation), use `GroupKFold` with site as the grouping variable rather than\npatient.\n\n**Internal versus external validation — the boundary**\n\nCross-validation — including all variants above — is a form of *internal* validation. It\nestimates expected performance on new patients drawn from the *same source population, same\ndatabase, same calendar window, and same eligibility criteria* as the development cohort. It does\nnot estimate performance in a different health system, a different geographic region, a different\nera, or a different patient population. Confirming that performance holds in a separate external\ndataset — a different health plan, a different EHR system, a later calendar period, a different\ncountry — is external validation and is covered in this catalog under\n`prediction-model-validation-recalibration-rwe`. A model that passes internal CV should be\nthought of as \"ready for external validation\" rather than \"ready for deployment.\"\n\n**Interpreting the output**\n\nThe concrete example throughout this entry: a logistic regression model for 1-year\nhospitalization in 500 Medicare patients achieves an apparent c-statistic of 0.85 on the data\nit was trained on, and 5-fold CV returns fold c-statistics of 0.74, 0.76, 0.78, 0.72, and 0.75,\nfor a CV mean of 0.75 and an optimism of 0.10.\n\n*(1) Formal interpretation.* The CV estimate of 0.75 approximates the expected discrimination on\nnew patients drawn from the same Medicare population, same database, and same calendar window\nunder the same eligibility criteria. It does NOT estimate performance on a different Medicare\nAdvantage plan, a different geographic region, or a later calendar era — those questions require\nexternal validation. The optimism of 0.10 quantifies overfitting: the model memorized patterns\nworth 0.10 c-statistic points that will not replicate. If any hyperparameters were tuned using\nthe same 5 folds (e.g., a regularization penalty selected by 5-fold CV on the same data), even\nthe 0.75 estimate is biased upward — nested CV is needed to produce an unbiased estimate for the\nfull training pipeline.\n\n*(2) Practical interpretation.* In plain language: expect about 0.75 discrimination, not 0.85,\nwhen this model scores next year's similar patients drawn from this same Medicare system. Whether\nit performs comparably on patients from a different health system, a different region, or a\ndifferent calendar era is unknown until an external validation study is completed. The 0.10\noptimism gap is a direct signal that the model has overfitted and that shrinkage or\nregularization should be considered before any external deployment.\n\n**Pros, cons, and trade-offs**\n\n*k-fold CV*:\n- Pros: computationally feasible for most models; k = 5 or 10 has well-characterized bias-\n  variance properties; stratified and grouped variants handle rare outcomes and multi-row\n  patients; nested CV extends to hyperparameter selection.\n- Cons: each fold trains on only (k − 1)/k of the data, so the estimate applies to a model\n  that is slightly underfit relative to the final model (this is the pessimism of CV); random\n  variation in fold assignment adds variance to the estimate, especially at small n.\n- When to prefer: moderate to large datasets (n ≥ several hundred events); when bootstrap\n  computation is prohibitive; whenever preprocessing must be validated end-to-end with a\n  Pipeline.\n\n*Bootstrap optimism correction*:\n- Pros: fits on the full dataset in each replicate so the corrected estimate applies to the\n  final model, not a slightly underfit k-fold version; lower variance than CV at small sample\n  sizes; directly estimates the shrinkage factor for coefficients.\n- Cons: computationally expensive (B = 200 full model fits); does not naturally extend to\n  nested hyperparameter selection without additional structure; less transparent to audiences\n  unfamiliar with bootstrap resampling.\n- When to prefer: small datasets (n < 200 events); when coefficient shrinkage factors are\n  needed; Cox and logistic models implemented in `rms::validate`.\n\n*Simple train-test split*:\n- Pros: fast; easy to explain; gives a single unbiased estimate if the split is pre-specified.\n- Cons: wastes data (typically 20–30% is withheld for testing and not used in training);\n  high variance from a single random partition; particularly unreliable at small n.\n- When to prefer: only when n is very large (n > 50,000 patients) and an end-to-end\n  evaluation pipeline is needed quickly as a sanity check; never as the primary internal\n  validation in a publication.\n\n**When to use**\n\nApply cross-validation or bootstrap optimism correction in every prediction model development\nstudy before reporting performance metrics:\n\n- Before presenting a c-statistic, AUC, or calibration metric from a prediction model, always\n  report the *internal CV estimate* alongside (or instead of) the apparent training-set estimate.\n- When the development sample has fewer than 200 events or EPV < 10, use bootstrap optimism\n  correction in preference to k-fold CV to reduce variance.\n- When the dataset contains multiple rows per patient (pharmacy fills, encounters, episodes),\n  always use GroupKFold by person_id to prevent patient-level leakage.\n- When the model will be deployed prospectively (scoring future patients), use a temporal split\n  rather than random CV to mimic the real deployment scenario.\n- When tuning hyperparameters (regularization penalty, tree depth, number of features), use\n  nested CV: inner loop for selection, outer loop for honest evaluation.\n- After internal validation passes, proceed to external validation\n  (`prediction-model-validation-recalibration-rwe`) before clinical or policy deployment.\n\n**When NOT to use — and active misuse patterns**\n\n- *Reporting apparent (training-set) performance as if it were validation performance*: the most\n  common misuse. A c-statistic computed on the same data used for fitting is not a validation\n  estimate. Always label any training-set metric explicitly as \"apparent\" and never cite it as\n  evidence of model generalizability.\n- *Random row-level splits when patients have multiple rows*: in claims or EHR datasets where\n  one patient contributes multiple rows, splitting randomly on rows assigns the same patient to\n  training and test. GroupKFold by person_id is mandatory, not optional, for such datasets.\n- *Using the CV test-fold estimates for model selection and then citing them as unbiased*: the\n  moment you use held-out fold performance to choose between models, those folds are no longer a\n  clean test set. Nested CV is required; selecting then citing the same folds is optimistic.\n- *Applying random CV splits to a prospective deployment question*: if the model will score\n  patients in real time, a random split that mixes future and past patients inflates the estimate\n  because future patterns inform the past. Use a temporal split to match the deployment context.\n- *Imputing or standardizing features before splitting*: fitting an imputer or scaler on the\n  full dataset and then splitting leaks test-fold statistics into training. All preprocessing\n  must live inside a Pipeline that is fitted only on training folds.\n- *Treating a good internal CV estimate as evidence of external validity*: a CV c-statistic of\n  0.75 certifies performance in the source population and database. It says nothing about\n  performance in a different health system, country, or era. Citing internal CV results in an\n  HTA submission as evidence of generalizability is a methodological misrepresentation.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "machine-learning",
      "validation",
      "overfitting",
      "cross-validation",
      "internal-validation",
      "optimism",
      "data-leakage",
      "bootstrap",
      "nested-cv"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "diagnostic_accuracy",
      "algorithm_validation",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/S0895-4356(01)00341-9",
        "url": "https://doi.org/10.1016/S0895-4356(01)00341-9",
        "citation_text": "Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. Journal of Clinical Epidemiology. 2001;54(8):774-781.",
        "year": 2001,
        "authors_short": "Steyerberg et al.",
        "notes": "Definitive comparison of bootstrap optimism correction, leave-one-out CV, and 10-fold CV for internal validation of logistic regression models. Establishes that bootstrap optimism correction is most efficient at small n and that apparent performance systematically exceeds cross-validated performance by a calibration-slope-linked overfitting gap. The foundational reference for the \"optimism\" framing used throughout this entry."
      },
      {
        "role": "explain",
        "doi": "10.1186/1471-2105-7-91",
        "url": "https://doi.org/10.1186/1471-2105-7-91",
        "citation_text": "Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7:91.",
        "year": 2006,
        "authors_short": "Varma & Simon",
        "notes": "Demonstrates via simulation that using the same CV loop for both hyperparameter selection and performance estimation produces optimistic error estimates — the nested CV requirement. The key reference for the \"selection bias\" that arises when test folds are re-used for model selection, and the rigorous justification for nested cross-validation."
      },
      {
        "role": "demonstrate",
        "doi": "10.1214/09-SS054",
        "url": "https://doi.org/10.1214/09-SS054",
        "citation_text": "Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Statistics Surveys. 2010;4:40-79.",
        "year": 2010,
        "authors_short": "Arlot & Celisse",
        "notes": "Comprehensive theoretical survey of CV variants — leave-one-out, k-fold, repeated k-fold, hold-out, and model-selection CV — with bias-variance characterization of each. Provides the formal basis for the k = 5 or 10 convention and for the pessimism-of-CV argument. The authoritative reference for understanding when different CV designs are optimal."
      }
    ],
    "plain_language_summary": "A prediction model is trained on past patient data, but how well it will work on new patients is always worse than how well it looks on the data it already learned from — this gap is called overfitting, and the difference in performance is called optimism. Cross-validation is the standard fix: you repeatedly hide a portion of your patients from the model during training and then test it on those hidden patients, giving you an honest estimate of real-world performance before deploying the model. One critical rule in healthcare data is that you must keep each patient's records together in either training or testing — never split a single patient's claims across both sides — otherwise the model cheats by recognizing patients it has already seen.",
    "key_terms": [
      {
        "term": "training error vs test error",
        "definition": "Training error is how well the model fits the data it was built on (always too optimistic); test error is how well it performs on patients it has never seen (the honest estimate you actually need)."
      },
      {
        "term": "k-fold cross-validation",
        "definition": "A method that splits the dataset into k equal groups, trains the model k times (each time leaving one group out as the test set), and averages the k test-set performance scores to get an honest estimate."
      },
      {
        "term": "optimism",
        "definition": "The gap between a model's apparent performance on training data and its honest cross-validated performance; an optimism of 0.10 in c-statistic means the model looks 0.10 points better than it actually is on new patients."
      },
      {
        "term": "nested cross-validation",
        "definition": "A two-layer CV design where the inner layer tunes model settings (like the regularization penalty) and the outer layer honestly evaluates the whole training pipeline; needed whenever settings are chosen using the same data."
      },
      {
        "term": "data leakage",
        "definition": "Accidentally letting information that would not be available at prediction time — or from the same patient's other records — flow into the training process, making the model look far better than it really is."
      },
      {
        "term": "patient-level splitting",
        "definition": "Ensuring that all records belonging to one patient go entirely into training or entirely into testing, never split across both; required in claims and EHR data where one patient contributes many rows."
      }
    ],
    "worked_example": {
      "scenario": "A logistic regression model is built to predict 1-year hospitalization in a Medicare claims cohort of 500 patients, using 30 baseline features (comorbidity indicators, prior utilization counts). The model is fitted on all 500 patients and the apparent c-statistic — measured on the same data used for fitting — is 0.85. The team then applies 5-fold cross-validation, splitting the 500 patients into 5 groups of 100, training on 400 each time and testing on the 100 held-out patients, to get an honest estimate of how the model will perform on next year's similar Medicare patients.",
      "dataset": {
        "caption": "5-fold CV results. Each fold holds out 100 patients as the test set; the model is re-trained on the remaining 400 and the c-statistic is computed on the 100 held-out patients.",
        "columns": [
          "fold",
          "n_train",
          "n_test",
          "fold_c_stat"
        ],
        "rows": [
          [
            1,
            400,
            100,
            0.74
          ],
          [
            2,
            400,
            100,
            0.76
          ],
          [
            3,
            400,
            100,
            0.78
          ],
          [
            4,
            400,
            100,
            0.72
          ],
          [
            5,
            400,
            100,
            0.75
          ]
        ]
      },
      "steps": [
        "The apparent c-statistic (model evaluated on its own training data) = 0.85. This is optimistic because the model has already seen these patients.",
        "Sum the five held-out fold c-statistics: 0.74 + 0.76 + 0.78 + 0.72 + 0.75 = 3.75.",
        "CV mean c-statistic = 3.75 / 5 = 0.75. This is the honest internal estimate.",
        "Optimism = apparent c-statistic minus CV mean = 0.85 - 0.75 = 0.10. The model memorized patterns worth 0.10 c-statistic points that will not replicate on new patients.",
        "The corrected honest estimate (0.75) applies to new Medicare patients from this same database and time window. It does not tell us how the model will perform at a different health system or in a later year — that requires external validation."
      ],
      "result": "Apparent c = 0.85; sum of fold c-stats = 0.74 + 0.76 + 0.78 + 0.72 + 0.75 = 3.75; CV mean c = 3.75 / 5 = 0.75; optimism = 0.85 - 0.75 = 0.10. Honest internal estimate is 0.75."
    },
    "prerequisites": [
      "predictive-and-causal-ml-models-rwe",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard k-fold CV (k = 5 or 10)",
        "description": "The analysis dataset is randomly partitioned into k equal folds; the model is refitted k times using k − 1 folds for training and 1 fold for evaluation; the k performance estimates are averaged. k = 5 is preferred when fitting is expensive; k = 10 has lower bias at slightly higher variance. The gold standard for most prediction model development in RWE.",
        "edge_cases": [
          "When n is small (< 200 events), fold-to-fold variance is high and the mean estimate is noisy; prefer bootstrap optimism correction or repeated CV.",
          "k-fold CV is pessimistic: each model is trained on only (k-1)/k of the data, so the CV estimate underestimates performance of the final model trained on the full dataset by a small amount."
        ],
        "data_source_notes": "Claims and EHR: always use GroupKFold by person_id when the dataset has one row per encounter or fill (multiple rows per patient). Standard KFold is appropriate only for datasets that are already aggregated to one row per patient."
      },
      {
        "name": "Stratified k-fold CV",
        "description": "Folds are created so that the outcome event rate in each fold matches the overall rate. Mandatory for binary outcomes with event prevalence < 10% — pure random splitting risks empty or nearly-empty event strata in a fold.",
        "edge_cases": [
          "When both stratification (by outcome) and grouping (by patient) are needed simultaneously, use a stratified GroupKFold — sklearn's StratifiedGroupKFold — or custom splitting logic.",
          "For survival outcomes, stratify on the event indicator ignoring time; survival-calibrated CV is more complex and typically replaced by bootstrap optimism."
        ],
        "data_source_notes": "Claims: readmission, mortality, and rare procedure outcomes often have < 5% event rates and require stratified folds. EHR: same for rare diagnoses or incident events in short windows."
      },
      {
        "name": "Group k-fold CV (patient-level splitting)",
        "description": "Each patient's records are assigned to exactly one fold; no patient appears in both training and test sets. Implemented in sklearn via GroupKFold or StratifiedGroupKFold with groups = person_id. This is the correct default for any claims or EHR dataset where one patient contributes multiple rows.",
        "edge_cases": [
          "Patient group sizes vary; some folds may be larger than others if patients have very different numbers of records — the n_test per fold is approximate, not exact.",
          "When combining group splitting with stratification, ensure that patients are stratified by their outcome status, not individual rows."
        ],
        "data_source_notes": "Claims: pharmacy fills, encounters, episodes — any multi-row design requires GroupKFold. EHR: visit-level or lab-level datasets require the same. Failure to use group splitting in these datasets is the single most common leakage error in clinical ML publications."
      },
      {
        "name": "Bootstrap optimism correction",
        "description": "Fit the model on the full dataset; record apparent performance A. Draw B bootstrap samples (B ≥ 200) with replacement; fit the model to each; evaluate on the original full dataset. The mean bootstrap-minus-original gap is the optimism O. Corrected estimate = A − O. Implemented in R via rms::validate.",
        "edge_cases": [
          "Requires B full model fits (B = 200 minimum); computationally expensive for gradient boosting or neural-net models.",
          "Bootstrap does not naturally extend to nested hyperparameter selection; nested CV is simpler when tuning is required."
        ],
        "data_source_notes": "Best suited to regression models (logistic, Cox) implemented in rms where validate() handles the bootstrap resampling internally. For ML algorithms, k-fold CV is usually more practical."
      },
      {
        "name": "Nested cross-validation",
        "description": "Outer k-fold loop evaluates the full training pipeline; inner k-fold loop (or grid search) selects hyperparameters using only the training portion of each outer fold. The outer fold estimate is unbiased for the combined selection-plus-fitting procedure. Required whenever regularization penalties, tree depth, or other tuning parameters are selected from the data.",
        "edge_cases": [
          "Computationally expensive: with outer k = 5 and inner k = 5, requires 25 full model fits per hyperparameter grid point.",
          "The outer CV estimate reflects the uncertainty of both the model fit and the hyperparameter selection step; it is typically more conservative (lower) than a flat CV estimate."
        ],
        "data_source_notes": "Claims and EHR: required for LASSO/Ridge regularization (lambda selection), gradient boosting (learning rate, depth), and any sklearn GridSearchCV or RandomizedSearchCV workflow. Implement with sklearn's cross_validate wrapping a Pipeline that contains a cross-validated selector."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "prediction-model-validation-recalibration-rwe",
        "pros_of_this": "Internal CV is available without any additional external dataset; it can be run during model development and iteratively improved; it quantifies overfitting and guides decisions about regularization, shrinkage, and model complexity.",
        "cons_of_this": "Internal CV does not test transportability to a different population, database, era, or geography. A model that passes 5-fold CV has only been tested against patients from the same source — it may fail catastrophically in a different health system.",
        "when_to_prefer": "Always run internal CV first as part of model development; then proceed to external validation before deployment. These are sequential steps, not alternatives."
      },
      {
        "compared_to": "predictive-and-causal-ml-models-rwe",
        "pros_of_this": "CV and optimism correction are model-agnostic evaluation tools that apply to any prediction algorithm; they ensure that the performance metrics you report are honest before any external test. Nested CV is particularly important for flexible ML algorithms with many tuning choices.",
        "cons_of_this": "CV is an evaluation framework, not a modeling strategy; it does not choose the algorithm or resolve any causal question. It must be paired with the correct modeling choice (predictive vs causal ML) for the scientific question.",
        "when_to_prefer": "Use CV evaluation for any predictive ML model; the cross-fitting mechanism inside DML/TMLE is structurally related but solves a different problem (debiasing the causal estimate, not estimating predictive performance)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "The most common leakage failure in claims is row-level splitting when patients have multiple fill, encounter, or episode records. Always check whether the analysis unit is patient or row; if row, use GroupKFold with person_id. Temporal splits are mandatory if the model will score future patients. Feature engineering (ever/never flags, utilization counts, drug classes) must be recomputed inside each fold on training data only — do not pre-aggregate and then split. For rare outcomes (< 5% event rate), use StratifiedGroupKFold.",
      "ehr": "Multi-site EHR datasets introduce site-level patterns; consider GroupKFold by site when the goal is to estimate performance at a new site (leave-one-site-out). Within-site temporal drift (order-set changes, workflow changes) motivates temporal splits even for internal validation. Lab and vital preprocessing (imputation, normalization) must live inside a Pipeline that is fitted only on training folds to prevent preprocessing leakage.",
      "registry": "Registry datasets are typically one-row-per-patient and adjudicated, so standard k-fold CV is appropriate. Stratified CV is still needed for rare outcomes. Bootstrap optimism correction is preferred when the registry is small (< 500 events). Registry-derived models validated internally should still be externally validated on claims or EHR before deployment because the registry case mix differs from the broader population.",
      "linked": "Linked claims-EHR datasets may have multiple rows per patient from different sources; the patient-level splitting requirement applies with extra force. When linking introduces a selected (linkable) subset, internal CV estimates apply to the linkable population, not the full underlying population — a form of selection bias that CV cannot correct. Temporal splits are especially important when linkage completeness changes over calendar time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import (\n    GroupKFold, StratifiedGroupKFold, cross_validate, GridSearchCV\n)\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.metrics import roc_auc_score\n\n# ── Prepare data ──\nFEATURE_COLS = [c for c in df.columns if c not in (\"person_id\", \"outcome\")]\nX = df[FEATURE_COLS].values\ny = df[\"outcome\"].values\ngroups = df[\"person_id\"].values          # CRITICAL: patient-level splitting\n\n# ── Pattern 1: GroupKFold — each patient appears in ONE fold only ──\n# Use StratifiedGroupKFold when event rate is < 10% (stratifies by outcome AND groups by patient).\n# For event rate >= 10%, plain GroupKFold is fine.\nevent_rate = y.mean()\nif event_rate < 0.10:\n    cv_outer = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)\nelse:\n    cv_outer = GroupKFold(n_splits=5)\n\n# ── Pattern 2: Pipeline — all preprocessing fitted only on training folds ──\n# SimpleImputer + StandardScaler inside the Pipeline are NEVER fitted on the test fold.\npipe = Pipeline([\n    (\"impute\",  SimpleImputer(strategy=\"median\")),  # fit on train fold only\n    (\"scale\",   StandardScaler()),                  # fit on train fold only\n    (\"clf\",     LogisticRegression(max_iter=500, C=1.0)),\n])\n\n# ── Flat 5-fold CV: honest internal estimate ──\nresults = cross_validate(\n    pipe, X, y,\n    cv=cv_outer,\n    groups=groups,\n    scoring=\"roc_auc\",\n    return_train_score=True,\n)\ncv_mean   = results[\"test_score\"].mean()\napparent  = results[\"train_score\"].mean()   # average apparent AUC across folds (proxy)\noptimism  = apparent - cv_mean\n\nprint(f\"CV mean AUC (honest): {cv_mean:.3f}\")\nprint(f\"Apparent AUC (train): {apparent:.3f}\")\nprint(f\"Optimism:             {optimism:.3f}\")\n\n# ── Pattern 3: Nested CV — inner loop selects C, outer loop evaluates honestly ──\n# Without nested CV, selecting C by the same folds used for evaluation is biased upward.\nparam_grid = {\"clf__C\": [0.01, 0.1, 1.0, 10.0]}\n# Inner CV: grouped to respect patient-level integrity inside the training fold.\ncv_inner = GroupKFold(n_splits=3)\n\n# GridSearchCV selects the best C using the INNER loop only.\n# cross_validate uses the OUTER loop for honest evaluation.\nnested_gs = GridSearchCV(\n    pipe, param_grid,\n    cv=cv_inner,                # inner: hyperparameter selection\n    scoring=\"roc_auc\",\n    refit=True,\n)\nnested_results = cross_validate(\n    nested_gs, X, y,\n    cv=cv_outer,                # outer: honest evaluation of the full pipeline\n    groups=groups,\n    scoring=\"roc_auc\",\n)\nnested_cv_mean = nested_results[\"test_score\"].mean()\nprint(f\"Nested CV mean AUC (pipeline + C-selection): {nested_cv_mean:.3f}\")\n# nested_cv_mean is unbiased for the full train-then-tune-then-predict procedure.\n# It is typically lower than flat CV when regularization is tuned on the same folds.",
        "description": "Correct k-fold cross-validation for a clinical prediction model in a Medicare claims-style\ndataset. Demonstrates three essential patterns:\n  (1) GroupKFold by person_id to prevent patient-level leakage in multi-row datasets;\n  (2) Pipeline wrapping imputation + scaling + logistic regression so preprocessing is\n      fitted only on training folds (no preprocessing leakage);\n  (3) Nested CV sketch where an inner GridSearchCV selects the regularization penalty C\n      and the outer cross_validate evaluates the full pipeline honestly.\nRequired input (one row per encounter / episode; one patient may have multiple rows):\n  df : DataFrame with columns person_id, outcome (0/1), and baseline feature columns\n       (all features computed from pre-index data only — no post-index codes).",
        "dependencies": [
          "scikit-learn",
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "steyerberg-2001",
          "varma-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(rms)\nlibrary(rsample)\nlibrary(pROC)\n\n# ── 1. Bootstrap optimism correction via rms::validate ──\n# lrm fits a logistic regression model; dd stores distribution parameters for rms.\ndd <- datadist(dat); options(datadist = \"dd\")\n\nformula_str <- paste(\"outcome ~\", paste(setdiff(names(dat), c(\"outcome\",\"person_id\")),\n                                        collapse = \" + \"))\nfit_lrm <- lrm(as.formula(formula_str), data = dat, x = TRUE, y = TRUE)\n\n# validate() implements bootstrap optimism correction.\n# B = 200 bootstrap samples; the default optimism-corrected Dxy (= 2*(C - 0.5)) is reported.\nset.seed(42)\nval <- validate(fit_lrm, method = \"boot\", B = 200)\n\n# C-statistic = (Dxy + 1) / 2\napparent_C  <- (val[\"Dxy\", \"index.orig\"] + 1) / 2\ncorrected_C <- (val[\"Dxy\", \"index.corrected\"] + 1) / 2\noptimism    <- apparent_C - corrected_C\n\ncat(sprintf(\"Apparent C (training):  %.3f\\n\", apparent_C))\ncat(sprintf(\"Corrected C (bootstrap):%.3f\\n\", corrected_C))\ncat(sprintf(\"Optimism:               %.3f\\n\", optimism))\n\n# ── 2. Manual k-fold CV with patient-level grouping (rsample::group_vfold_cv) ──\n# group_vfold_cv ensures every row of the same patient stays in one fold.\nset.seed(42)\nfolds <- group_vfold_cv(dat, group = \"person_id\", v = 5)\n\nauc_per_fold <- vapply(folds$splits, function(split) {\n  train <- analysis(split)\n  test  <- assessment(split)\n\n  # Re-train on the training fold only.\n  dd_train <- datadist(train); options(datadist = \"dd_train\")\n  fit_fold <- lrm(as.formula(formula_str), data = train, x = TRUE, y = TRUE)\n  pred <- predict(fit_fold, newdata = test, type = \"fitted\")\n\n  # c-statistic on the held-out fold.\n  auc(test$outcome, pred, quiet = TRUE)$auc\n}, numeric(1))\n\ncat(sprintf(\"Fold AUCs: %s\\n\", paste(round(auc_per_fold, 3), collapse=\", \")))\ncat(sprintf(\"CV mean AUC: %.3f\\n\", mean(auc_per_fold)))\n# Apparent minus CV mean = optimism estimate from 5-fold CV.\ncat(sprintf(\"Optimism (apparent %.3f - CV %.3f) = %.3f\\n\",\n            apparent_C, mean(auc_per_fold), apparent_C - mean(auc_per_fold)))",
        "description": "Bootstrap optimism correction for a logistic regression model using rms::validate, plus\nmanual k-fold CV using rsample with patient-level GroupKFold grouping. Demonstrates:\n  (1) rms::validate for bootstrap optimism correction (the Steyerberg/Harrell procedure);\n  (2) rsample group_vfold_cv for GroupKFold-equivalent splitting by patient id;\n  (3) Apparent vs corrected c-statistic and optimism reporting.\nRequired input (one row per patient after aggregation; or use group_vfold_cv for multi-row):\n  dat : data.frame with outcome (0/1), person_id, and baseline predictor columns.",
        "dependencies": [
          "rms",
          "rsample",
          "pROC"
        ],
        "source_citations": [
          "steyerberg-2001",
          "varma-2006"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[\"Prediction model fitted\\non full training dataset\\n(apparent AUC = 0.85)\"] --> Q{\"Multi-row dataset?\\n(one patient, many rows)\"}\n  Q -->|Yes: claims / EHR encounters| GKF[\"GroupKFold by person_id\\n(NEVER random row split)\"]\n  Q -->|No: one row per patient| KF[\"StratifiedKFold if rare outcome\\nKFold otherwise\"]\n  GKF --> CV[\"5-fold CV loop:\\nTrain on 4 folds → Test on 1 fold\\nCompute AUC on held-out fold\"]\n  KF  --> CV\n  CV  --> Mean[\"Average fold AUCs\\nCV mean = 0.75\"]\n  Mean --> Opt[\"Optimism = Apparent - CV mean\\n= 0.85 - 0.75 = 0.10\"]\n  Opt --> Q2{\"Hyperparameters\\ntuned on same folds?\"}\n  Q2 -->|Yes| Nested[\"Nested CV required\\n(inner loop: tune, outer loop: evaluate)\"]\n  Q2 -->|No| Report[\"Report CV mean as honest\\ninternal estimate (0.75)\"]\n  Report --> Border[\"Internal-external boundary:\\nCV = same source population only\\nDifferent system/era → external validation\"]\n  Nested --> Report",
        "caption": "Decision flow for honest internal validation: GroupKFold enforces patient-level integrity; the CV loop produces the optimism-corrected estimate; nested CV is triggered when hyperparameters are tuned; the result is always an internal estimate bounded by source population.",
        "alt_text": "Flowchart from a fitted model through a decision on dataset structure (multi-row or single-row per patient), into GroupKFold or stratified KFold, through 5-fold CV averaging to the CV mean, computing optimism as apparent minus CV mean, then branching on whether hyperparameters were tuned (requiring nested CV), and noting the internal-external validation boundary.",
        "source_type": "illustrative",
        "source_citations": [
          "steyerberg-2001"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Parent concept: the distinction between predictive and causal ML, the EPV rule, and the overfitting risk of high-dimensional feature matrices are all introduced in the parent entry. Cross-validation is the diagnostic step that quantifies overfitting for any predictive ML model in the parent's taxonomy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prediction-model-validation-recalibration-rwe",
        "notes": "The natural successor step: internal CV certifies performance in the source population; external validation (this entry) certifies transportability to a different database, site, or era and adds recalibration when calibration drifts. CV and external validation are sequential, not interchangeable."
      },
      {
        "relation_type": "used_with",
        "target_slug": "tree-based-ensembles-rwe",
        "notes": "Random forests, gradient boosting, and other tree ensembles require nested CV for honest evaluation because their hyperparameters (n_estimators, max_depth, learning_rate) are selected from the data; flat CV on these models without nested selection is optimistic. GroupKFold is especially critical for tree models on multi-row claims datasets."
      },
      {
        "relation_type": "used_with",
        "target_slug": "regularized-regression-lasso-ridge-rwe",
        "notes": "LASSO, Ridge, and Elastic Net require cross-validation to select the regularization penalty lambda; without nested CV, the lambda chosen by the inner loop and the AUC reported by the same outer loop are both optimistic. Nested CV is the correct protocol when lambda is tuned on claims or EHR data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "roc-auc-discrimination-rwe",
        "notes": "The c-statistic (AUC) is the primary performance metric estimated by cross-validation in binary prediction models. CV provides the honest AUC; apparent AUC always overstates it. The worked example (apparent 0.85, CV 0.75) directly uses the AUC/c-statistic framework defined in this companion entry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "brier-score-calibration-rwe",
        "notes": "Cross-validation can and should estimate calibration (Brier score, calibration slope) in addition to discrimination. The Brier score is more sensitive to overfitting than AUC because it penalizes both discrimination failure and miscalibrated predicted probabilities. Bootstrap optimism also corrects the calibration slope in rms::validate."
      }
    ],
    "aliases": [
      "k-fold CV",
      "overfitting",
      "optimism correction",
      "train-test split",
      "data leakage",
      "patient-level splitting",
      "bootstrap optimism",
      "nested cross-validation"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cumulative-incidence-risk-rwe",
    "name": "Cumulative Incidence and Absolute Risk Estimation",
    "short_definition": "Estimation of the absolute probability that an event occurs by a fixed horizon, using the Kaplan-Meier complement when there are no competing events and the Aalen-Johansen cumulative incidence function (with Fine-Gray subdistribution regression) when competing events such as death are present.",
    "long_description": "**Cumulative incidence (absolute risk)** is the probability that the event of interest has occurred by a fixed\nfollow-up horizon t, given the subject was event-free at time zero. It is the quantity decision-makers actually want —\n\"what is a patient's 1-year risk of stroke on this drug?\" — and is the natural input to number-needed-to-treat,\nbenefit-risk tables, cost-effectiveness Markov traces, and HTA absolute-effect framing. The estimator you choose is the\nentire methodological content of this concept, and choosing wrong silently biases the headline number.\n\n**Core estimand distinction — and the canonical trap.** Three distinct quantities are routinely (and wrongly)\nconflated:\n- **Cause-specific hazard** — the instantaneous rate of the event among those still event-free *and* still at risk\n  (competing events remove subjects from the risk set). This is the *etiologic* quantity: how the exposure acts on the\n  biological process. A cause-specific Cox model answers \"does the drug change the rate of the event?\"\n- **Cumulative incidence function (CIF)** — the absolute probability of the event by time t, computed so that the CIFs\n  for all event types plus the event-free probability sum to 1. This is the *prognostic / decision* quantity. With no\n  competing risks it equals 1 − Kaplan-Meier (1−KM); with competing risks it is the **Aalen-Johansen** estimator, which\n  correctly treats competing events as a separate exit, not as censoring.\n- **Subdistribution hazard (Fine-Gray)** — a regression that targets the CIF directly by keeping subjects who\n  experienced a competing event in a modified (\"subdistribution\") risk set. Its hazard ratio describes the effect on\n  *absolute risk*, which is what changes a benefit-risk conclusion.\n\nThe single error reviewers hunt for: **using 1−KM for the event of interest when competing events are non-negligible.**\nKaplan-Meier censors the competing event (e.g., death) and assumes it is independent and \"could have gone on to have the\nevent.\" It therefore **overestimates** cumulative incidence, and the overestimation is *differential* whenever the\ncompeting-event rate differs by arm — exactly the situation in elderly claims populations where one drug's patients die\nfaster. The sum of separate 1−KM curves across event types can exceed 1, which is the diagnostic tell that the estimator\nis incoherent. Use Aalen-Johansen (descriptive) and Fine-Gray (regression) for absolute risk; reserve cause-specific\nhazards for etiology. A complete competing-risks analysis usually reports *both* the cause-specific and subdistribution\nmodels, because they answer different questions and can point in opposite directions.\n\n**Pros, cons, and trade-offs.**\n- **1−Kaplan-Meier vs Aalen-Johansen CIF:** 1−KM is simpler, ubiquitous, and *correct only when the competing-event\n  rate is negligible*. Aalen-Johansen costs nothing extra to compute and is always coherent (curves sum to ≤1). **Prefer\n  Aalen-Johansen** for any absolute-risk statement when death or another terminal competing event is plausible (almost\n  all elderly or oncology RWE). Use plain 1−KM only when the competing risk is truly rare or when the estimand is a\n  composite that includes it.\n- **Cause-specific Cox vs Fine-Gray subdistribution:** the cause-specific HR is the cleaner *etiologic* effect and is\n  what you want for mechanism; the Fine-Gray subdistribution HR maps to the *absolute-risk* effect and is what you want\n  for prognosis and HTA. Cost: the subdistribution HR has a non-intuitive interpretation (it operates on a risk set that\n  retains subjects who already had a competing event) and is sensitive to the competing-event rate, so it is not\n  transportable across populations with different background mortality. **Report both**; do not pick one and call it\n  \"the\" effect.\n- **CIF/risk vs incidence rate (person-time):** a rate (events per person-year) is a hazard summary, not a probability,\n  and is not bounded by 1; it is the right currency when follow-up is short relative to risk or when you need a single\n  summary across variable follow-up. Cumulative incidence is the right currency for fixed-horizon absolute risk and\n  decision modeling. They are not interchangeable — see the incidence-rate concept.\n- **CIF vs RMST:** restricted mean survival time summarizes the whole curve up to a horizon as expected event-free time\n  and sidesteps proportional-hazards assumptions; CIF gives the point-in-time probability. RMST is often the better\n  single comparative summary when curves cross; CIF is what a clinician reads off for \"risk by year 1.\"\n\n**When to use.** Any fixed-horizon absolute-risk question in cohort RWE: 1-year risk of an adverse event, 5-year\nrecurrence, cumulative incidence of a utilization or cost event; the descriptive backbone of comparative safety\nstudies; the input layer for decision-analytic and HTA models that need absolute probabilities rather than hazard\nratios.\n\n**When NOT to use / when this is actively misleading.**\n- **Do not use 1−KM for cause-specific cumulative incidence when a competing event is common.** This is the dominant\n  failure mode and it is *systematic upward bias*, worsened differentially by arm. Switch to Aalen-Johansen.\n- **Do not interpret a Fine-Gray subdistribution HR as an etiologic effect** — a drug can have no effect on the\n  cause-specific hazard yet show a non-null subdistribution HR purely because it changes the competing (death) rate.\n- **Do not report a single estimator when competing risks exist.** A cause-specific HR near 1 with a Fine-Gray HR <1 is\n  not a contradiction; it usually means the exposure reduces the competing event. Reporting only one hides the story.\n- **Do not estimate cumulative incidence with informative/dependent censoring** (e.g., disenrollment driven by the\n  outcome) without addressing it — Aalen-Johansen and KM both assume censoring is non-informative within the modeled\n  structure.\n- **Immortal time.** If time zero is set at a landmark that requires surviving to receive a treatment/procedure, the\n  risk denominator is corrupted before any estimator runs; fix the cohort (see immortal-time-bias-handling), not the\n  formula.\n\n**Data-source operational depth.**\n- **Claims (FFS vs Medicare Advantage):** the at-risk denominator is *observed* person-time under continuous\n  enrollment, and the dominant competing risk in older cohorts is death. Two distinct failure modes: (1) **MA-only\n  person-time lacks adjudicable FFS claims**, so events and competing deaths are under-ascertained — restrict the risk\n  set to enrollees with Parts A/B (and D for drug exposure) and exclude MA-only spans rather than treating them as\n  event-free follow-up. (2) **Death is incompletely captured in claims alone** (disenrollment and death are\n  confounded); link to the Medicare/SSA death master or vital records, otherwise the competing event is misclassified\n  as censoring and the analysis silently collapses back to a 1−KM-like overestimate. Censor at disenrollment, end of\n  data, and horizon; treat death as a competing event, not a censor.\n- **EHR:** capture is encounter-driven, so both the event and the competing death can be missed when a patient leaves\n  the system — loss to follow-up is potentially informative and biases the CIF. Define the observation window\n  explicitly, link to an external death index, and report follow-up completeness by arm.\n- **Registry:** strong for adjudicated events and (often) vital status, which makes the competing-risk structure\n  trustworthy; typically weak for complete drug exposure and for out-of-registry events. Link to claims to firm up\n  exposure and non-registry outcomes.\n- **Linked claims–EHR–vital records:** the ideal substrate — EHR severity for risk adjustment, claims completeness for\n  person-time, and a reliable death index to model the competing risk — but linkage selects the linkable subset and\n  introduces order/fill/service-date discrepancies that must be reconciled before time zero and the at-risk clock are\n  set.\n\n**Worked claims example.** Question: 1-year cumulative incidence of ischemic stroke among adults ≥66 initiating\nanticoagulant A vs anticoagulant B in Medicare fee-for-service, where **death is a strong competing risk**. (1)\nEligibility: ≥66 years, ≥365 days continuous Parts A/B/D before the first qualifying fill, no anticoagulant fill in the\n365-day washout (incident users of both arms). (2) Time zero: the first qualifying `fill_date`; assign arm from the NDC\ndispensed that day. (3) At-risk clock: from time zero, person-time accrues only while continuously enrolled in FFS (no\nMA-only spans). (4) Event ascertainment: first inpatient ischemic-stroke diagnosis in the primary position within the\n365-day horizon. (5) **Competing event:** all-cause death from the linked death master file — coded as a *competing\nevent*, not as censoring. (6) Other censoring: FFS disenrollment, end of data, horizon. (7) Estimate the CIF with\nAalen-Johansen by arm; **also** compute the naive 1−KM and show it materially exceeds the CIF (e.g., 1−KM 6.1% vs CIF\n5.2% in the arm with higher background mortality), with a larger gap in the older/sicker arm — that gap is the\ndifferential bias 1−KM would inject into the headline. (8) Fit a Fine-Gray model for the adjusted subdistribution HR on\nabsolute stroke risk and a cause-specific Cox model for the etiologic rate, reporting both. (9) Sensitivity: vary the\nhorizon, the grace period for as-treated exposure, and a death-only-from-claims definition to bound the impact of\nincomplete mortality capture.\n\n**Interpreting the output**\n\nAn Aalen-Johansen cumulative incidence analysis returns: 12-month stroke incidence = 8.0% (95% CI 6.2–10.1%) in the treated group; naive 1−KM estimate = 9.3% — an overstatement of 1.3 percentage points driven by competing mortality.\n\n*Formal interpretation.* The Aalen-Johansen CIF of 8.0% is the estimated probability of experiencing stroke by 12 months in the presence of competing risks. It treats competing events (primarily non-stroke death) as events — not as censoring — so patients who die cannot later have a stroke. The naive 1−KM estimate of 9.3% treats competing deaths as censored observations, implicitly assuming that censored patients remain at risk and would eventually have strokes at the same rate as survivors. Under competing risks this is impossible, and 1−KM therefore overstates absolute stroke probability. The magnitude of overstatement grows with the competing event rate and the length of follow-up.\n\n*Practical interpretation.* The 1.3 percentage point difference between 1−KM (9.3%) and CIF (8.0%) is clinically meaningful at scale: applied to a population of 100,000 patients, it corresponds to 1,300 apparent strokes that do not exist. For regulatory benefit-risk submissions and payer-facing burden-of-disease reports, report the Aalen-Johansen CIF as the primary absolute risk estimate, pair it with a Fine-Gray regression for covariate-adjusted inference on absolute stroke risk, and report the cause-specific Cox model for the etiologic hazard.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "cumulative-incidence",
      "absolute-risk",
      "competing-risks",
      "aalen-johansen",
      "kaplan-meier",
      "fine-gray",
      "subdistribution-hazard",
      "cause-specific-hazard",
      "survival-analysis"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwp107",
        "url": "https://doi.org/10.1093/aje/kwp107",
        "citation_text": "Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. American Journal of Epidemiology. 2009;170(2):244-256.",
        "year": 2009,
        "authors_short": "Lau, Cole & Gange",
        "notes": "Canonical epidemiologic entry point distinguishing cause-specific and subdistribution (cumulative incidence) approaches and when each is appropriate."
      },
      {
        "role": "explain",
        "doi": "10.1161/CIRCULATIONAHA.115.017719",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.115.017719",
        "citation_text": "Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601-609.",
        "year": 2016,
        "authors_short": "Austin, Lee & Fine",
        "notes": "Definitive applied explanation of why 1-Kaplan-Meier overestimates cumulative incidence under competing risks and how Aalen-Johansen and Fine-Gray fix it."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.2712",
        "url": "https://doi.org/10.1002/sim.2712",
        "citation_text": "Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Statistics in Medicine. 2007;26(11):2389-2430.",
        "year": 2007,
        "authors_short": "Putter, Fiocco & Geskus",
        "notes": "Rigorous biostatistical tutorial deriving the cause-specific hazard, the cumulative incidence function, and multi-state generalizations."
      },
      {
        "role": "demonstrate",
        "doi": "10.1080/01621459.1999.10474144",
        "url": "https://doi.org/10.1080/01621459.1999.10474144",
        "citation_text": "Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association. 1999;94(446):496-509.",
        "year": 1999,
        "authors_short": "Fine & Gray",
        "notes": "Original derivation of the subdistribution hazard model that regresses directly on the cumulative incidence function."
      }
    ],
    "plain_language_summary": "Cumulative incidence answers one practical question: out of every 100 patients who started the study healthy, how many had the event by a specific point in time? You count who had the event, divide by the number of patients in the group at the very beginning, and express it as a proportion — for example, '3 out of 8 patients had a stroke within one year.' One important complication: if some patients can die from another cause before getting the event, those deaths cannot simply be ignored, because a patient who has already died can never have a stroke — counting them as though they might still get the event inflates the risk estimate.",
    "key_terms": [
      {
        "term": "time zero",
        "definition": "The starting point for each patient's follow-up clock — usually the date they first took the drug being studied or met the study entry criteria."
      },
      {
        "term": "censoring",
        "definition": "When a patient leaves the study before the observation window ends without having the event — for example, they dropped their insurance coverage — so you know the event had not happened yet but you cannot observe what happened afterward."
      },
      {
        "term": "competing event",
        "definition": "An outcome that permanently prevents the event of interest from ever occurring — death is the classic example, because a patient who has died can never later have a stroke."
      },
      {
        "term": "observation window",
        "definition": "The fixed length of time you follow every patient, such as 'one year from time zero'; patients who survive event-free to the end of the window are noted as censored at that date."
      },
      {
        "term": "Aalen-Johansen estimator",
        "definition": "A calculation method that correctly accounts for competing events (like death) when estimating cumulative incidence, ensuring the risk estimate is not inflated the way a simpler method would be."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know the 1-year cumulative incidence of ischemic stroke among 8 patients who started a new anticoagulant on the same day (time zero = day 0). Each patient is followed for up to 365 days. Some patients have a stroke (the event of interest), some die from another cause (a competing event — once dead, a patient cannot have a stroke), and some reach the end of follow-up without any event (censored). The question is: what fraction of the original 8 patients experienced a stroke within one year?",
      "dataset": {
        "caption": "One row per patient showing how long each was followed and what happened first. Status codes: 0 = censored (no event by end of follow-up), 1 = stroke (event of interest), 2 = death from another cause (competing event).",
        "columns": [
          "person_id",
          "follow_up_days",
          "status",
          "status_label"
        ],
        "rows": [
          [
            1001,
            90,
            1,
            "stroke"
          ],
          [
            1002,
            180,
            1,
            "stroke"
          ],
          [
            1003,
            270,
            1,
            "stroke"
          ],
          [
            1004,
            120,
            2,
            "death (competing event)"
          ],
          [
            1005,
            200,
            0,
            "censored (disenrolled)"
          ],
          [
            1006,
            300,
            0,
            "censored (disenrolled)"
          ],
          [
            1007,
            365,
            0,
            "censored (end of window)"
          ],
          [
            1008,
            365,
            0,
            "censored (end of window)"
          ]
        ]
      },
      "steps": [
        "Start with all 8 patients in the denominator — everyone who was at risk at time zero (day 0).",
        "Count patients who had a stroke (status = 1) by day 365: patients 1001, 1002, and 1003 — that is 3 strokes.",
        "Patient 1004 died from another cause at day 120 (status = 2). This is a competing event: once dead, this patient could never have had a stroke, so they are NOT counted in the stroke numerator.",
        "Patients 1005 and 1006 were censored before day 365 because they left the insurance plan. We know they had not had a stroke yet at that point, but we cannot observe them afterward.",
        "Patients 1007 and 1008 reached the end of the 365-day window with no stroke — they are also censored at day 365.",
        "Cumulative incidence at 365 days = strokes observed / patients at risk at time zero = 3 / 8 = 0.375, or 37.5 per 100 patients."
      ],
      "result": "Cumulative incidence at 1 year = 3 strokes / 8 patients at time zero = 0.375 (37.5%).",
      "timeline_spec": {
        "title": "1-year cumulative incidence of stroke — 8 patients, competing event (death) present",
        "window": {
          "start": "2023-01-01",
          "end": "2023-12-31",
          "label": "Denominator: 8 patients at risk at time zero"
        },
        "events": [
          {
            "label": "Pt 1001",
            "start": "2023-01-01",
            "length_days": 90,
            "quantity": "stroke at day 90"
          },
          {
            "label": "Pt 1002",
            "start": "2023-01-01",
            "length_days": 180,
            "quantity": "stroke at day 180"
          },
          {
            "label": "Pt 1003",
            "start": "2023-01-01",
            "length_days": 270,
            "quantity": "stroke at day 270"
          },
          {
            "label": "Pt 1004",
            "start": "2023-01-01",
            "length_days": 120,
            "quantity": "death (competing) at day 120"
          },
          {
            "label": "Pt 1005",
            "start": "2023-01-01",
            "length_days": 200,
            "quantity": "censored at day 200"
          },
          {
            "label": "Pt 1006",
            "start": "2023-01-01",
            "length_days": 300,
            "quantity": "censored at day 300"
          },
          {
            "label": "Pt 1007",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "censored at day 365"
          },
          {
            "label": "Pt 1008",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "censored at day 365"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-04-01",
            "label": "Pt 1001: 90 days, then stroke"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-06-30",
            "label": "Pt 1002: 180 days, then stroke"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-09-28",
            "label": "Pt 1003: 270 days, then stroke"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-05-01",
            "label": "Pt 1004: 120 days, then death (competing event — NOT a stroke)"
          },
          {
            "kind": "gap",
            "start": "2023-07-19",
            "end": "2023-12-31",
            "label": "Pt 1005: censored at day 200, unobserved after"
          },
          {
            "kind": "gap",
            "start": "2023-10-28",
            "end": "2023-12-31",
            "label": "Pt 1006: censored at day 300, unobserved after"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "Pt 1007: full window, no event"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "Pt 1008: full window, no event"
          }
        ],
        "result": {
          "label": "3 strokes / 8 patients at time zero = cumulative incidence 0.375 (37.5%) at 1 year",
          "value": 0.375
        },
        "caption": "Each horizontal bar shows one patient's follow-up period from time zero (day 0, January 1) to their first event or end of observation. Three patients reached a stroke endpoint (solid bars ending in an event mark); one patient died from another cause before day 365 — a competing event that is tracked separately and not counted as a stroke; two patients were censored early when they left the insurance plan; and two patients completed the full year without any event. Cumulative incidence = 3 strokes ÷ 8 patients at risk at time zero = 0.375.",
        "alt_text": "Timeline diagram showing eight horizontal patient follow-up bars spanning from January 1 to up to December 31. Three bars end with a stroke event at days 90, 180, and 270 respectively. One bar ends at day 120 with a competing-event marker (death). Two bars end early at days 200 and 300 with a censored marker. Two bars run the full 365 days ending with a censored marker at the window boundary. A summary label reads: cumulative incidence equals 3 divided by 8 equals 0.375."
      }
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "person-time-denominator-construction-rwe",
      "incidence-rate-calculation-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Kaplan-Meier complement (1-KM), no competing risks",
        "description": "Cumulative incidence estimated as one minus the Kaplan-Meier survival function for the single event of interest. Valid only when competing events are negligible or the estimand intentionally treats the competing event as a censoring.",
        "edge_cases": [
          "Non-negligible competing events (death in elderly cohorts) make 1-KM overestimate risk; the bias is differential when competing-event rates differ by arm.",
          "Heavy late censoring inflates the tail of the curve and the variance of the horizon estimate."
        ],
        "data_source_notes": "claims: ensure censoring (disenrollment) is not driven by the outcome; if death is common, switch to Aalen-Johansen rather than censoring deaths."
      },
      {
        "name": "Aalen-Johansen cumulative incidence function (CIF)",
        "description": "Non-parametric estimator of the CIF for each event type under competing risks; the CIFs across event types plus the event-free probability sum to 1, so absolute risks are coherent.",
        "edge_cases": [
          "Requires a clean, mutually exclusive event-type coding at the first event (event of interest vs competing death vs censoring).",
          "Confidence intervals should be on a transformed scale (e.g., log-log) to stay within [0,1] at the horizon."
        ],
        "data_source_notes": "claims/linked: requires reliable capture of the competing event (death) via a linked death master; missing deaths collapse the estimator back toward 1-KM."
      },
      {
        "name": "Fine-Gray subdistribution-hazard regression",
        "description": "Regression targeting the CIF directly via the subdistribution hazard; the hazard ratio describes the covariate effect on absolute (cumulative) risk and is the natural input to benefit-risk and HTA.",
        "edge_cases": [
          "The subdistribution risk set retains subjects who had a competing event, so the HR is not an etiologic effect and is sensitive to the background competing-event rate.",
          "Not transportable across populations with different competing-event (mortality) rates; report alongside a cause-specific model."
        ],
        "data_source_notes": "claims: weight construction needs the same continuous-enrollment / competing-event capture as the descriptive CIF."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "1 minus Kaplan-Meier for the event of interest",
        "pros_of_this": "Aalen-Johansen CIF treats competing events as a distinct exit rather than as censoring, so absolute risk is unbiased and the curves across event types sum to 1.",
        "cons_of_this": "Requires reliable ascertainment of the competing event (usually death) and slightly more programming than a single KM curve.",
        "when_to_prefer": "Whenever a competing event (death, transplant, etc.) is non-negligible, especially in elderly or oncology claims where competing mortality differs by arm."
      },
      {
        "compared_to": "Cause-specific hazard (Cox) modeling",
        "pros_of_this": "Fine-Gray / CIF gives the absolute-risk effect that decision-makers and HTA bodies need; cause-specific gives the etiologic rate effect.",
        "cons_of_this": "The subdistribution HR has a non-intuitive risk set and is not transportable across populations with different competing-event rates.",
        "when_to_prefer": "Report both. Use the CIF/Fine-Gray for prognosis and absolute effects; use cause-specific hazards for mechanism and etiology."
      },
      {
        "compared_to": "Incidence rate (events per person-time)",
        "pros_of_this": "Cumulative incidence is a fixed-horizon probability bounded by 1, directly usable in NNT and decision models.",
        "cons_of_this": "A probability requires a defined horizon and a coherent at-risk/competing-risk structure; a rate summarizes variable follow-up in one number without a horizon.",
        "when_to_prefer": "Use cumulative incidence for absolute fixed-horizon risk; use rates when follow-up is short relative to risk or highly variable."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Person-time accrues only under continuous fee-for-service enrollment; death is the dominant competing event and must be captured from a linked death master, not treated as censoring. Exclude MA-only spans (no adjudicable FFS claims). Censor at disenrollment, end of data, and horizon.",
      "ehr": "Encounter-driven capture means both the event and competing death can be missed when a patient leaves the system; loss to follow-up is potentially informative. Define observation windows explicitly and link to an external death index.",
      "registry": "Strong for adjudicated events and vital status (trustworthy competing-risk structure) but weak for complete drug exposure and out-of-registry events; link to claims to firm up exposure and non-registry outcomes.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (severity + person-time completeness + reliable mortality) but linkage selects the linkable subset and introduces order/fill/service-date discrepancies that must be reconciled before time zero and the at-risk clock are set."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import KaplanMeierFitter, AalenJohansenFitter\n\nHORIZON_DAYS = 365\n\ndef absolute_risk_by_arm(df: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"df: person_id, arm, fut_days, status in {0=censor, 1=stroke, 2=death(competing)}.\"\"\"\n    out = []\n    for arm, g in df.groupby(\"arm\"):\n        # 1 - Kaplan-Meier: treats the competing event (death) as CENSORING -> overestimates risk.\n        km = KaplanMeierFitter().fit(\n            g[\"fut_days\"], event_observed=(g[\"status\"] == 1).astype(int)\n        )\n        one_minus_km = 1.0 - float(km.predict(HORIZON_DAYS))\n\n        # Aalen-Johansen CIF for the event of interest (event_of_interest=1), death (2) competing.\n        aj = AalenJohansenFitter(calculate_variance=True).fit(\n            g[\"fut_days\"], event_observed=g[\"status\"], event_of_interest=1\n        )\n        cif = float(aj.predict(HORIZON_DAYS))  # documented public API: CIF at the horizon\n\n        out.append({\"arm\": arm, \"n\": len(g),\n                    \"risk_1_minus_KM\": round(one_minus_km, 4),\n                    \"risk_AalenJohansen_CIF\": round(cif, 4),\n                    \"upward_bias_of_KM\": round(one_minus_km - cif, 4)})\n    return pd.DataFrame(out)\n\nif __name__ == \"__main__\":\n    # df loaded from the analysis table; the arm with higher baseline mortality (status==2)\n    # shows a larger 1-KM - CIF gap -> the bias 1-KM injects into the headline is differential.\n    print(absolute_risk_by_arm(df))",
        "description": "Aalen-Johansen CIF and 1-KM side by side, plus a Fine-Gray model, from a claims-style analysis table.\nRequired input (one row per subject, already cleaned):\n  df : person_id, arm ('A'/'B'), fut_days (time zero -> event/competing/censor, in days),\n       status in {0=censored, 1=event of interest (stroke), 2=competing event (death)},\n       plus baseline covariates (age, female, chads2, ...) measured in [index_date-365, index_date].\nDemonstrates that 1-KM (which censors deaths) overstates absolute risk relative to the Aalen-Johansen CIF.\nNOTE on Fine-Gray: there is no faithful, maintained Fine-Gray subdistribution model in mainstream Python\n(lifelines and scikit-survival do not implement the time-varying subdistribution risk set). Do the descriptive\nCIF in Python, and run the Fine-Gray regression via R cmprsk::crr (rpy2) or in SAS PROC PHREG eventcode= --\nboth shown in the R and SAS blocks below. Do not fake it with a single scalar IPCW weight on a Cox model.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [
          "austin-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(cmprsk)\n\nHORIZON <- 365\n\nd$status <- factor(d$status, levels = c(\"censor\", \"stroke\", \"death\"))\n\n## Aalen-Johansen CIF by arm: survfit on a factor status returns per-state CIFs (P-states).\naj <- survfit(Surv(fut_days, status) ~ arm, data = d, id = person_id)\ns  <- summary(aj, times = HORIZON)\ncif_stroke <- s$pstate[, \"stroke\"]            # multistate survfit: pull CIF from $pstate, not $surv\n\n## 1 - Kaplan-Meier for stroke, CENSORING deaths -> overestimates absolute risk.\nkm <- survfit(Surv(fut_days, status == \"stroke\") ~ arm, data = d)\none_minus_km <- 1 - summary(km, times = HORIZON)$surv\n## Compare one_minus_km against the Aalen-Johansen 'stroke' CIF: KM is larger,\n## and the gap is wider in the arm with higher competing mortality.\n\n## Gray's test for equality of CIFs (cause = 1 = stroke; 2 = death is competing).\nd$ev <- as.integer(d$status) - 1L   # 0 = censor, 1 = stroke, 2 = death\ngray <- cuminc(ftime = d$fut_days, fstatus = d$ev, group = d$arm)\n\n## Fine-Gray subdistribution regression: HR on ABSOLUTE (cumulative) stroke risk.\nX  <- model.matrix(~ age + female + chads2 + arm, data = d)[, -1]\nfg <- crr(ftime = d$fut_days, fstatus = d$ev, cov1 = X, failcode = 1, cencode = 0)\nsummary(fg)   # exp(coef) = subdistribution hazard ratios\n\n## Cause-specific Cox for the ETIOLOGIC effect (deaths treated as censoring here, by design).\ncsh <- coxph(Surv(fut_days, status == \"stroke\") ~ age + female + chads2 + arm, data = d)\nsummary(csh)  # report alongside Fine-Gray; they answer different questions",
        "description": "Aalen-Johansen CIF (survfit), 1-KM comparison, and a Fine-Gray model (cmprsk / tidycmprsk).\nRequired input (one row per subject):\n  d : person_id, arm (factor 'A'/'B'), fut_days, status (factor: 'censor','stroke','death'),\n      baseline covariates (age, female, chads2). 'death' is the competing event.\nsurvival::survfit() with a multi-state (factor) status returns the Aalen-Johansen CIF directly.",
        "dependencies": [
          "survival",
          "cmprsk",
          "tidycmprsk"
        ],
        "source_citations": [
          "fine-1999",
          "lau-2009"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Aalen-Johansen CIF by arm + Gray's test. failcode=1 picks stroke; death (2) is competing. */\nproc lifetest data=work.analysis plots=cif(test);\n  time fut_days * status(0) / failcode=1;   /* 0 = censored; failcode=1 = event of interest */\n  strata arm;                                /* Gray's test for equality of CIFs across arms */\nrun;\n\n/* Fine-Gray SUBDISTRIBUTION hazard model: HR on absolute (cumulative) stroke risk. */\nproc phreg data=work.analysis;\n  class arm (ref='B');\n  model fut_days * status(0) = arm age female chads2 / eventcode=1;  /* 1 = stroke; 2 treated as competing */\n  hazardratio arm;                            /* subdistribution HR */\nrun;\n\n/* CAUSE-SPECIFIC hazard for the etiologic effect: censor the competing death (status 0 and 2). */\nproc phreg data=work.analysis;\n  class arm (ref='B');\n  model fut_days * status(0 2) = arm age female chads2;  /* deaths censored -> cause-specific */\n  hazardratio arm;\nrun;\n\n/* Naive 1-KM for stroke (deaths censored) to SHOW the upward bias vs the CIF above. */\nproc lifetest data=work.analysis;\n  time fut_days * status(0 2);   /* only stroke counts as event; death censored -> overestimates risk */\n  strata arm;\nrun;",
        "description": "Aalen-Johansen CIF + Gray's test (PROC LIFETEST) and Fine-Gray regression (PROC PHREG eventcode=).\nRequired input dataset work.analysis (one row per subject, post data-management):\n  person_id, arm ('A'/'B'), fut_days (numeric), status (1=stroke, 2=death competing, 0=censor),\n  age female chads2 (baseline covariates measured in [index_date-365, index_date]).\nPROC LIFETEST with failcode= and plots=cif produces the Aalen-Johansen CIF and Gray's test;\nPROC PHREG with eventcode= fits the Fine-Gray subdistribution model. A separate PHREG with the\ncompeting event censored gives the cause-specific hazard for comparison.",
        "dependencies": [],
        "source_citations": [
          "fine-1999"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "cumulative-incidence-risk-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Each horizontal bar shows one patient's follow-up period from time zero (day 0, January 1) to their first event or end of observation. Three patients reached a stroke endpoint (solid bars ending in an event mark); one patient died from another cause before day 365 — a competing event that is tracked separately and not counted as a stroke; two patients were censored early when they left the insurance plan; and two patients completed the full year without any event. Cumulative incidence = 3 strokes ÷ 8 patients at risk at time zero = 0.375.",
        "alt_text": "Timeline diagram showing eight horizontal patient follow-up bars spanning from January 1 to up to December 31. Three bars end with a stroke event at days 90, 180, and 270 respectively. One bar ends at day 120 with a competing-event marker (death). Two bars end early at days 200 and 300 with a censored marker. Two bars run the full 365 days ending with a censored marker at the window boundary. A summary label reads: cumulative incidence equals 3 divided by 8 equals 0.375.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Estimand: absolute probability of the event by horizon t] --> CR{Is a competing event<br/>death, transplant, etc.<br/>non-negligible?}\n  CR -- No --> KM[1 - Kaplan-Meier is acceptable<br/>single event, deaths rare]\n  CR -- Yes --> AJ[Aalen-Johansen CIF<br/>treats competing event as a distinct exit]\n  AJ --> Both[Report BOTH model families]\n  Both --> FG[Fine-Gray subdistribution HR<br/>effect on ABSOLUTE risk -> prognosis / HTA]\n  Both --> CSH[Cause-specific Cox HR<br/>effect on the RATE -> etiology / mechanism]\n  KM -.->|deaths become common| Warn[1 - KM overestimates risk<br/>bias is differential when<br/>competing-event rate differs by arm]\n  Warn --> AJ",
        "caption": "Estimator decision logic. When a competing event is non-negligible, 1-Kaplan-Meier overestimates cumulative incidence; use the Aalen-Johansen CIF for absolute risk and report both Fine-Gray (absolute-risk effect) and cause-specific (etiologic effect) regressions.",
        "alt_text": "Decision flowchart asking whether a competing event is non-negligible; if no, 1-minus-Kaplan-Meier is acceptable; if yes, use the Aalen-Johansen cumulative incidence function and report both Fine-Gray and cause-specific hazard models.",
        "source_type": "illustrative",
        "source_citations": [
          "austin-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  T0[Time zero<br/>first qualifying fill / index] --> RS[At-risk under continuous FFS enrollment]\n  RS --> E1[Ischemic stroke<br/>status = 1 event of interest]\n  RS --> E2[Death from linked death master<br/>status = 2 COMPETING event]\n  RS --> C[Disenroll / end of data / horizon<br/>status = 0 censored]\n  E2 -.->|if death is wrongly treated as censoring| Bias[Collapses to 1 - KM<br/>upward-biased risk]",
        "caption": "Event-coding flow for a claims cumulative-incidence analysis. Death is captured from a linked death index and coded as a competing event (status 2); coding it as censoring reintroduces the 1-Kaplan-Meier overestimate.",
        "alt_text": "Flow from time zero through the at-risk period to three mutually exclusive outcomes - stroke (event of interest), death (competing event from a linked death file), or censoring - with a warning that mislabeling death as censoring biases risk upward.",
        "source_type": "illustrative",
        "source_citations": [
          "lau-2009"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Cumulative incidence is the absolute-risk member of the descriptive-epidemiology measure family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "The cause-specific hazard and Fine-Gray subdistribution models are the regression counterparts of the cumulative incidence function; report both for a complete competing-risks analysis."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "An incidence rate (events per person-time) is a hazard summary unbounded by 1; cumulative incidence is a fixed-horizon probability. They answer different questions and are not interchangeable."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST summarizes the whole survival curve to a horizon and is robust to crossing curves and non-proportional hazards; cumulative incidence gives the point-in-time absolute probability."
      },
      {
        "relation_type": "see_also",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "HTA models extrapolate cumulative incidence / survival beyond observed follow-up; the within-trial CIF is the anchor for that extrapolation."
      },
      {
        "relation_type": "requires",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "The at-risk clock and competing-risk structure depend on correctly constructed observed person-time under continuous enrollment."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "If time zero is set at a landmark that requires surviving to be treated, the risk denominator is corrupted before any estimator is applied; fix the cohort, not the formula."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Cumulative incidence (CIF) is the standard absolute-risk endpoint reported from a target-trial emulation alongside hazard-ratio contrasts."
      }
    ],
    "aliases": [
      "cumulative incidence function",
      "absolute risk",
      "CIF",
      "incidence proportion",
      "Aalen-Johansen estimator"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "cure-models-mixture-cure",
    "name": "Cure Models (Mixture and Non-Mixture)",
    "short_definition": "Survival models that explicitly partition the population into a permanently \"cured\" subgroup who will never experience the event and an \"uncured\" subgroup whose event times follow a parametric latency distribution. The mixture cure model expresses overall survival as S(t) = pi + (1 − pi) × S_u(t), where pi is the cure fraction and S_u(t) is the latency survival function for susceptible patients; the non-mixture (promotion-time) alternative provides a competing-causes biological framing. Cure models are the standard method when the Kaplan-Meier curve shows a sustained plateau — the fingerprint of durable responders in immuno-oncology — but identifiability demands mature follow-up well beyond the plateau and background-mortality anchoring to be credible in HTA submissions.",
    "long_description": "**What cure models are and why they matter**\n\nStandard parametric survival models — Weibull, log-normal, Gompertz, and their relatives —\nassume that every patient will eventually experience the event if followed long enough. This\nassumption is wrong in a growing class of clinical settings: checkpoint-inhibitor\nimmuno-oncology, certain haematological malignancies, childhood acute lymphoblastic leukaemia,\nand some chronic-disease applications where a subgroup achieves durable biological remission or\nimmunological control and will never experience the event regardless of follow-up length. When\nthis two-population structure exists, forcing a standard model produces a misspecified\nextrapolation — a survival curve that asymptotes to zero and systematically underestimates\nmean survival in the treated arm. Cure models, also called bounded survival models or long-term\nsurvivor models, directly parameterize this structure by estimating both the size of the cured\nfraction and the event-time distribution among those who remain at risk.\n\n**The cure phenomenon: when the Kaplan-Meier plateau tells a story**\n\nThe canonical diagnostic is a sustained plateau in the Kaplan-Meier (KM) estimator at long\nfollow-up. If, after sufficient events have accumulated, the KM curve stabilizes at a level\nmaterially above zero and remains stable rather than continuing to decline, this is evidence\nthat a non-negligible fraction of the population may have escaped the event permanently. In\nimmuno-oncology this pattern was first observed consistently in melanoma and non-small-cell\nlung cancer trials evaluating PD-1/PD-L1 checkpoint inhibitors: five-year OS rates of 20–30%\nin populations where median OS with prior standard of care was under 12 months. The plateau\nis the fingerprint of the durable-responder subgroup.\n\nThree conditions strengthen the case for a genuine cure plateau rather than an artifact: (1)\nthe plateau is stable across multiple consecutive data cuts; (2) surviving patients are followed\nwell beyond the median event time; and (3) background all-cause mortality from age and\ncomorbidity begins to visibly erode the plateau at very long follow-up — consistent with\nrelative-survival cure models discussed below. When any of these conditions fails, treat the\napparent plateau with caution. The existing Kaplan-Meier entry covers the KM estimator\nmechanics; this entry focuses on what to do once a plateau is diagnosed.\n\n**The mixture cure model: S(t) = pi + (1 − pi) × S_u(t)**\n\nThe mixture cure model, introduced by Boag (1949) and formalized by Farewell (1982), decomposes\noverall population survival into two components. Let pi denote the cure fraction — the\nproportion of patients who will never experience the event — and let S_u(t) denote the\nconditional survival function for the uncured (susceptible) subgroup. Overall survival at time\nt is:\n\n  S(t) = pi + (1 − pi) × S_u(t)\n\nAs time approaches infinity, S_u(t) approaches 0 (every uncured patient eventually has the\nevent), so S(t) approaches pi. The long-run survival plateau equals the cure fraction exactly.\n\nThe model has two distinct parameter sets with two distinct interpretations. The cure\nprobability pi — and any covariates predicting it — is typically estimated through a logistic\nregression on baseline patient characteristics: logit(pi_i) = gamma_0 + gamma_1 x_i. The\nlatency survival S_u(t | z_i) is modeled separately with its own covariate vector z_i and a\nchosen parametric family (Weibull is the default; log-normal, log-logistic, or Royston-Parmar\nsplines are alternatives). This two-component structure means a treatment can increase pi\n(cure more patients) without changing the speed of uncured progression, change the latency\nwithout altering the cure fraction, or do both simultaneously — a clinical decomposition\nno single-hazard survival model can provide. The log-likelihood for a mixture cure model\nwith exponential latency is:\n\n  For observed events: log[(1 − pi) × f_u(t)]\n  For censored observations: log[pi + (1 − pi) × S_u(t)]\n\n**The non-mixture (promotion-time) cure model**\n\nAn alternative parameterization replaces the mixture structure with a competing-cause\nframework. The promotion-time model (Yakovlev and Tsodikov 1996; Chen, Ibrahim, and Sinha 1999)\nsupposes each individual has a random number N of competing promotion events (e.g., metastatic\nfoci) drawn from a Poisson distribution; the event occurs at the minimum of N independent\npromotion times. When N = 0, the individual is cured. The survival function is\nS(t) = exp(−theta × F(t)), where F(t) is the CDF of a promotion-time distribution and theta\nis the Poisson mean. The cure fraction is exp(−theta).\n\nThe non-mixture model is biologically motivated for cancer applications where the latent\nnumber of metastatic foci drives the event time; it also admits a proportional-hazards\nreparameterization on the promotion-time scale, making it tractable in standard software. For\npractical HTA applications with limited follow-up, mixture and non-mixture models are often\nempirically indistinguishable — model selection criteria and biological justification should\nguide the choice. In most NICE submissions, the mixture cure model is the more common framing.\n\n**Identifiability and the credibility threshold**\n\nIdentifiability is the central methodological challenge of cure models. A cure fraction can\nonly be credibly estimated if observed follow-up substantially exceeds the time when uncured\nevents effectively cease — that is, if S_u(t) has approached zero in the observed window.\nIf many patients remain at risk in the long tail, the model cannot reliably distinguish a true\ncure fraction from a very slow uncured survival function. The practical implication for HTA\nsubmissions is significant: claiming a cure fraction of 20% from a two-year trial with a\nbarely flattened KM tail is not credible; claiming 30% from a five-year trial with a stable\nplateau, zero late events, and background-mortality anchoring is considerably more defensible.\n\nFormal identifiability tools include the Maller–Zhou (1992) sufficient-follow-up test, which\nassesses whether the largest observed event time is sufficiently close to the largest follow-up\ntime to rule out an artifact plateau. Analysts should also inspect the smoothed hazard function:\nin the cured subgroup, the hazard should converge toward the background (population life-table)\nhazard rather than toward zero. The cure fraction estimate should remain stable across\nconsecutive data cuts; a cure fraction that rises substantially from one interim data cut to\nthe next is a sign of insufficient follow-up.\n\n**Relative-survival cure models and background-mortality anchoring**\n\nStandard mixture cure models estimated from all-cause survival do not account for background\nmortality. At long follow-up, even \"cured\" patients begin to die from unrelated causes,\ncausing the observed KM plateau to gradually erode. Relative-survival cure models address this\nby modeling excess mortality relative to expected population survival, allowing the cure\nfraction to be interpreted as the proportion of patients who experience no excess\ndisease-specific mortality — their mortality profile matches the general population's. This\nis the approach recommended when follow-up exceeds five years or when older, comorbid\ncohorts are analyzed where background mortality is non-trivial. The related concept on\nrelative net survival covers the relative-survival framework and population mortality table\nlinkage in detail.\n\n**HTA stakes: the cure fraction dominates lifetime QALY extrapolations**\n\nIn health technology assessment submissions where the comparator shows a declining KM curve\nand the treatment arm shows a plateau, the cure fraction is frequently the single most\ninfluential assumption in the economic model. NICE Technical Support Document 21 (2021)\non flexible methods for survival extrapolation specifically addresses cure models: it\nrecommends using them only when genuine clinical evidence for long-term survivors exists,\nrequires sensitivity analyses across alternative cure fractions, and flags the cure assumption\nas a focus area for ERG/EAG technical scrutiny. The stakes are high: the difference between\na 20% and a 30% assumed cure fraction in a lifetime cost-effectiveness model can shift the\nICER by tens of thousands of pounds per QALY — enough to change the decision from approved\nto rejected. Analysts must pre-specify the cure-model scenario in the analysis plan, report\nit alongside the full standard parametric candidate model set, and provide both clinical and\nstatistical justification for any assumed cure fraction.\n\n**Covariates on cure probability versus latency separately**\n\nOne of the most clinically informative features of the mixture cure model is the ability to\nspecify separate covariates for the cure fraction and the latency distribution. This means\nan analyst can ask: does biomarker B predict who is cured (enters the logistic regression on\npi), and does dose intensity predict how quickly uncured patients progress (enters the latency\nhazard)? Standard single-hazard survival regression forces both mechanisms into a single\ncoefficient set and cannot answer either question. Implementations in R (flexsurvcure, smcure)\nand the custom PROC NLMIXED approach in SAS both support separate covariate vectors.\n\n**RWE-specific caution: apparent plateaus as data-cutoff artifacts**\n\nIn real-world evidence settings — claims databases, EHR cohorts, and registry studies —\nobserved plateaus in rwPFS or rwOS frequently reflect data structure rather than biology.\nThree artifacts create false plateaus: (1) administrative censoring pile-up, where a large\nfraction of patients are censored at a common database end date, producing an abrupt flat\nsegment; (2) outcome ascertainment gaps, where deaths are under-captured in claims data so\nthat deceased patients appear as long-term survivors; and (3) short follow-up with sparse\nlate events, where the KM estimator lacks events to show a continuing decline. Before fitting\na cure model to RWE data, analysts should: (a) plot the censoring distribution separately from\nthe event distribution to detect pile-up; (b) restrict to patients enrolled early enough to\nhave genuinely long follow-up; (c) verify death ascertainment against a linked mortality\nsource (NDI, Social Security Death Index, or vital statistics) where available; and (d)\nconfirm the plateau does not coincide with the administrative data-cut date.\n\n**Pros, cons, and trade-offs**\n\nPros of mixture cure models: directly parameterize the two-population structure present in\nimmuno-oncology durable response; produce a long-run survival plateau that standard models\ncannot represent; decompose treatment effects into cure-fraction gain versus latency improvement;\ngenerate clinically interpretable quantities for HTA narrative; align with NICE TSD 21\nguidance for innovative immuno-oncology submissions; latency component is flexible (Weibull,\nspline, log-normal).\n\nCons: identifiability requires mature follow-up and a genuinely stable plateau — applying\ncure models to early or immature data inflates the estimated cure fraction dramatically and\nproduces unreliable extrapolations; the two-component structure doubles the number of\nparameters and inflates uncertainty in probabilistic sensitivity analysis; cure-fraction\nestimates are sensitive to the choice of latency distribution in the sparse tail; regulatory\nand HTA reviewers subject cure assumptions to intensive scrutiny.\n\nKey trade-offs versus standard parametric models: standard Weibull or log-normal models are\nsimpler, identifiable under any follow-up length, and easier to use in PSA — but they cannot\nrepresent a bounded long-run survival and will systematically underestimate mean survival in\npopulations with genuine durable responders. The cure model gains biological fidelity at the\ncost of identifiability requirements and two-component complexity. Report both sets of models;\ndo not present the cure model as the single primary analysis unless identifiability is well\nestablished.\n\n**When to use**\n\nUse a mixture or non-mixture cure model when: (1) the Kaplan-Meier curve shows a stable,\nsustained plateau materially above zero, confirmed stable across at least two consecutive data\ncuts; (2) clinical or biological evidence supports a \"cured\" or durable-response subgroup —\ncheckpoint-inhibitor immuno-oncology, BCR-ABL-targeted therapy in CML, early-stage childhood\nleukaemia; (3) follow-up is sufficiently mature that the plateau has been stable for at least\none to two years; (4) the HTA submission covers a lifetime horizon where the long-run survival\nplateau dominates the QALY calculation; and (5) the analysis plan pre-specifies the cure-model\nscenario. Cure models are particularly valuable in submissions to NICE, CDA-AMC, PBAC, and\nICER for innovative cancer therapies with immuno-oncology mechanisms.\n\n**When NOT to use — and when cure models are actively misleading**\n\nDo not use cure models when: (1) follow-up is immature (less than three to five years for most\noncology indications, or the KM curve has not clearly plateaued); (2) the apparent plateau\ncoincides with or is explained by data-cutoff censoring pile-up; (3) the Maller-Zhou\nidentifiability test indicates insufficient follow-up to distinguish a true cure fraction from\na very slow uncured tail; (4) the indication has no biological basis for permanent event\nprevention — most progressive neurodegenerative diseases, most cardiovascular endpoints in\nuncontrolled disease; (5) the cure fraction estimate changes substantially between consecutive\ndata cuts, indicating the model is tracking a moving artifact rather than a stable biological\nphenomenon. A cure model with wide confidence intervals on pi, or whose pi estimate requires\na logistic regression with only a handful of long-term survivors, should not be the primary\nanalysis model — report it as a sensitivity scenario alongside the standard parametric candidate\nset. Actively misleading use: claiming a cure fraction in RWE data whose apparent plateau\ncoincides exactly with the administrative database end date is a common error that can grossly\noverestimate long-term efficacy.\n\n**Interpreting the output**\n\nIn the worked example, a mixture cure model fit to a synthetic 50-patient immunotherapy trial\nestimates pi = 0.30 (30% cure fraction) and an exponential latency with S_u(24) = 0.50 (half\nof all uncured patients remain event-free at 24 months).\n\n(1) Formal interpretation. The overall survival at 24 months is S(24) = 0.30 + 0.70 * 0.50\n= 0.30 + 0.35 = 0.65. Sixty-five percent of the population is expected to be event-free at\n24 months: 30 percentage points because they are in the permanently cured subgroup, and 35\npercentage points because they are in the uncured subgroup (70% of total) but have not yet\nexperienced the event (S_u(24) = 0.50 means 50% of uncured individuals survive to 24 months).\nAs time approaches infinity, S(t) approaches pi = 0.30 — the long-run plateau. The cure\nfraction estimate and its 95% CI drive the long-run QALY extrapolation; the latency parameter\ndrives near-term QALY accumulation. The two components are separately interpreted and\nseparately interrogated by HTA technical reviewers.\n\n(2) Practical interpretation. Of 100 patients starting treatment, the model estimates 30 will\nnever progress or die from this cancer regardless of follow-up. The remaining 70 remain at\nrisk, and at 24 months approximately half of them (35 patients) are still event-free, giving\n65 event-free patients in total. For a lifetime cost-effectiveness model, the 30% cure fraction\nmeans the treated arm shows a non-zero long-run survival plateau — a qualitatively different\ncurve shape from any standard parametric model asymptoting to zero. If the cure fraction is\nraised from 30% to 40% in a deterministic sensitivity analysis, the entire tail of the survival\ncurve lifts by 10 percentage points, typically adding a material number of life-years gained\nand potentially shifting the ICER by tens of thousands of pounds per QALY.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "survival",
      "cure-models",
      "immuno-oncology",
      "KM-plateau",
      "long-term-survivors",
      "HTA",
      "time-to-event",
      "extrapolation",
      "mixture-model",
      "NICE-TSD-21",
      "cure-fraction",
      "latency-distribution",
      "oncology"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "registry_study",
      "claims_analysis",
      "ehr_study",
      "active_comparator_new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1158/1078-0432.CCR-11-2859",
        "url": "https://doi.org/10.1158/1078-0432.CCR-11-2859",
        "citation_text": "Othus M, Barlogie B, LeBlanc ML, Crowley JJ. Cure models as a useful statistical tool for analyzing survival. Clinical Cancer Research. 2012;18(14):3731-3736.",
        "year": 2012,
        "authors_short": "Othus et al.",
        "notes": "Accessible applied introduction to cure models for oncology researchers; covers model formulation, identifiability, and the distinction between the mixture and non-mixture cure parameterizations. The canonical entry-point citation for this topic in the cancer statistics literature."
      },
      {
        "role": "explain",
        "doi": "10.1177/1536867X0700700304",
        "url": "https://doi.org/10.1177/1536867X0700700304",
        "citation_text": "Lambert PC. Modeling of the cure fraction in survival studies. The Stata Journal. 2007;7(3):351-375.",
        "year": 2007,
        "authors_short": "Lambert",
        "notes": "Comprehensive treatment of both mixture and non-mixture cure models with worked Stata examples; covers the log-likelihood derivation, covariate modeling on both cure fraction and latency, identifiability diagnostics, and relative-survival cure extensions. The standard implementation reference for analysts new to the topic."
      },
      {
        "role": "demonstrate",
        "doi": "10.32614/cran.package.flexsurvcure",
        "url": "https://doi.org/10.32614/cran.package.flexsurvcure",
        "citation_text": "Amdahl J. flexsurvcure: Flexible Parametric Cure Models. CRAN. 2017.",
        "year": 2017,
        "authors_short": "Amdahl",
        "notes": "The primary R package for mixture and non-mixture cure models with flexible parametric latency distributions (Weibull, log-normal, log-logistic, Royston-Parmar splines). Integrates with the flexsurv ecosystem for plotting, prediction, and AIC-based model comparison. The practical implementation tool for most HTA applications."
      }
    ],
    "plain_language_summary": "Cure models are survival analysis tools built for situations where some patients will truly never experience the event — they are permanently disease-free — while others remain at risk and will eventually progress or die. The model splits the population into these two groups, estimates the size of the \"cured\" fraction, and describes how quickly the at-risk group eventually has the event. The clearest signal that a cure model may be needed is when the survival curve (Kaplan-Meier plot) flattens and stays flat above zero for a long period, which is exactly what happens in immuno-oncology trials where a subset of patients achieve a durable treatment response. The critical catch is that if follow-up is too short, the flat region may simply be a data-collection cutoff rather than a real biological plateau, and the model will mistakenly classify patients as \"cured\" who just have not yet had the event.",
    "key_terms": [
      {
        "term": "cure fraction (pi)",
        "definition": "The proportion of patients who will never experience the event of interest, no matter how long they are followed — typically estimated through a logistic regression model on baseline patient characteristics."
      },
      {
        "term": "latency distribution",
        "definition": "The parametric survival distribution (such as Weibull or exponential) that describes how quickly the at-risk, not-cured subgroup eventually experiences the event."
      },
      {
        "term": "Kaplan-Meier plateau",
        "definition": "A flat section at the right tail of the survival curve that persists across a long follow-up period, suggesting that some patients will never have the event and the proportion surviving has stopped decreasing."
      },
      {
        "term": "identifiability",
        "definition": "Whether the observed data contain enough information to reliably separate the cure fraction from a very slowly declining uncured tail; cure models require long, mature follow-up to be identifiable."
      },
      {
        "term": "censored observation",
        "definition": "A patient whose event was not observed because the study ended or the patient left before the event occurred; in cure models, long-term censored patients provide the evidence for the cure fraction."
      }
    ],
    "worked_example": {
      "scenario": "A biostatistician is analyzing overall survival from a 50-patient single-arm trial of a PD-1 checkpoint inhibitor in metastatic melanoma. The Kaplan-Meier curve shows a stable plateau at approximately 30% beginning around month 30, with no events after that point across two consecutive data cuts. A mixture cure model is fitted using maximum likelihood: the model estimates a cure fraction pi = 0.30 and an exponential latency distribution for uncured patients with rate chosen so that the median uncured survival is 24 months, yielding S_u(24) = 0.50. The task is to compute the overall survival probability at 24 months and confirm the long-run plateau.",
      "dataset": {
        "caption": "Representative survival data for 10 of the 50 trial patients (synthetic). event_observed = 1 means death was recorded; event_observed = 0 means the patient was alive at last contact (last data cut). Patients still event-free beyond 30 months are the long-term survivors whose sustained follow-up provides the evidence for the cure fraction.",
        "columns": [
          "patient_id",
          "follow_up_months",
          "event_observed"
        ],
        "rows": [
          [
            "PT01",
            4,
            1
          ],
          [
            "PT02",
            9,
            1
          ],
          [
            "PT03",
            14,
            1
          ],
          [
            "PT04",
            19,
            1
          ],
          [
            "PT05",
            23,
            1
          ],
          [
            "PT06",
            26,
            0
          ],
          [
            "PT07",
            31,
            0
          ],
          [
            "PT08",
            36,
            0
          ],
          [
            "PT09",
            42,
            0
          ],
          [
            "PT10",
            48,
            0
          ]
        ]
      },
      "steps": [
        "The Kaplan-Meier curve for this trial shows a plateau at approximately 30% from month 30 onward with no further events across two consecutive data cuts, consistent with a subgroup of durable responders. This motivates a mixture cure model.",
        "The mixture cure model decomposes overall survival into two components: S(t) = pi + (1 - pi) * S_u(t), where pi is the cure fraction and S_u(t) is the survival function for uncured patients. Both components are estimated by maximum likelihood.",
        "Model estimation yields pi = 0.30. The proportion who are not cured is (1 - 0.30) = 0.70.",
        "The exponential latency distribution is chosen with rate lambda = 0.02888 per month. At t = 24 months the model estimates S_u(24) = 0.50, meaning half of all uncured patients remain event-free at 24 months. (For reference, the median uncured survival for an exponential with this rate is approximately 24 months, since the half-life of the exponential distribution equals ln(2) divided by the rate.) This value comes directly from the model fit.",
        "Compute the uncured-but-event-free component at 24 months: 0.70 * 0.50 = 0.35.",
        "Add the permanently cured component: 0.30 + 0.35 = 0.65.",
        "As follow-up grows without bound, S_u(t) approaches 0, and S(t) approaches pi = 0.30. The long-run survival plateau is the cure fraction itself."
      ],
      "result": "S(24) = pi + (1 - pi) * S_u(24) = 0.30 + 0.70 * 0.50 = 0.30 + 0.35 = 0.65. At 24 months, 65% of patients are expected to be event-free: 30% because they are permanently cured and 35% because they are in the uncured group but have not yet experienced the event. The long-run plateau is pi = 0.30."
    },
    "prerequisites": [
      "kaplan-meier-estimator",
      "weibull-distribution",
      "censoring-mechanisms-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Mixture cure with standard parametric latency (Weibull or log-normal)",
        "description": "The most common HTA implementation: exponential, Weibull, or log-normal latency estimated by maximum likelihood. The Weibull is the default starting point because it nests the exponential, offers both PH and AFT interpretations, and is easy to use in probabilistic sensitivity analysis (multivariate-normal draws of log-scale parameters). Log-normal is preferred when the uncured hazard is arc-shaped (rises then falls), which is common in some targeted therapy settings. Model selection between latency families is based on AIC/BIC over the uncured subpopulation only — not the full dataset.",
        "edge_cases": [
          "At short follow-up (< 3 years in most oncology indications), the Weibull latency tail is dominated by the prior and the cure fraction estimate can be poorly identified even when the KM appears flat. Report the 95% CI on pi alongside the point estimate.",
          "If the chosen latency distribution fits poorly (log-log plot non-linear, residuals systematic), consider log-logistic or the generalized gamma umbrella family before adding cure-model complexity."
        ],
        "data_source_notes": "Clinical trial data with planned long follow-up (5+ years) is the ideal setting. For registry data, verify that the event (death or progression) is fully adjudicated and not subject to administrative censoring pile-up before selecting a latency family."
      },
      {
        "name": "Mixture cure with flexible (Royston-Parmar spline) latency",
        "description": "When the uncured hazard is complex — non-monotone, with an early peak then plateau, as seen in some immunotherapy data — flexible parametric splines (Royston-Parmar restricted cubic splines) can be used as the latency distribution within the mixture cure framework. This is supported in flexsurvcure via dist = \"rcs\" and provides a more data-adaptive fit at the cost of additional knot-placement decisions and less constrained tail behavior. The flexible latency is more responsive to within-trial follow-up but can extrapolate implausibly if knots are near the data boundary.",
        "edge_cases": [
          "Spline knot placement in the latency distribution is arbitrary and can materially affect the tail; always run sensitivity analyses with different knot configurations.",
          "Combining a flexible latency with a cure fraction creates a high-dimensional parameter space; PSA requires careful Cholesky decomposition of the covariance matrix from the Hessian."
        ],
        "data_source_notes": "Best suited to mature clinical trial data with many late events. Avoid in sparse registry or claims data where the latency tail has few events to anchor the spline."
      },
      {
        "name": "Non-mixture (promotion-time) cure model",
        "description": "An alternative biological framing: S(t) = exp(−theta × F(t)), where theta is the Poisson mean of competing promotion events and F(t) is the promotion-time CDF. The cure fraction is exp(−theta). Implemented in R via flexsurvcure with mixture = FALSE or via the CRAN package \"CureRate.\" The promotion-time model admits a proportional-hazards reparameterization convenient for covariate modeling, and its biological interpretation (number of metastatic foci) is compelling for solid-tumor oncology. However, it is mathematically equivalent to many mixture cure models in the observable data, so model selection between the two is rarely data-driven — it is driven by biological prior knowledge.",
        "edge_cases": [
          "The non-mixture model's cure fraction exp(−theta) is constrained to be less than the mixture model's pi in some configurations; this can cause implausible constraints if theta is large.",
          "With covariates on theta via a log-linear link (log(theta_i) = alpha + beta x_i), the model becomes a frailty-type promotion-hazard model, which is more complex than the standard logistic cure-fraction model."
        ],
        "data_source_notes": "Primarily used in clinical trial and registry settings with biological rationale for the promotion-event framing (solid tumors with micrometastasis). Rarely used in claims or EHR data where the biological mechanism is less precisely known."
      },
      {
        "name": "Relative-survival cure model",
        "description": "Extends the mixture cure model to model excess mortality relative to expected background (population life-table) mortality. The cure fraction is estimated as the proportion of patients whose mortality matches the general population — they have zero excess hazard. This is the most defensible approach for indications with long follow-up (5+ years) where background mortality from age and comorbidity is non-trivial, particularly in older patient cohorts. The relative-survival cure model requires linkage to national life tables (e.g., ONS England and Wales, CDC WONDER in the US) to subtract the expected hazard. See the related concept on relative net survival for the full relative-survival framework.",
        "edge_cases": [
          "Life-table linkage requires age, sex, and calendar-year matching; inaccurate life-table linkage (e.g., wrong reference population) directly biases the excess hazard and the cure fraction.",
          "In some NICE submissions, the relative-survival cure model is favored by ERG/EAG because it anchors the long-run tail to general-population mortality rather than extrapolating the disease hazard freely."
        ],
        "data_source_notes": "Most naturally applied to linked registry or trial data with long-term follow-up and national mortality linkage. In claims data, linkage to NDI or SSDI provides the mortality anchor needed for the excess-hazard estimation."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "weibull-distribution",
        "pros_of_this": "The mixture cure model can represent a non-zero long-run survival plateau — a qualitatively different extrapolation shape from any standard Weibull or log-normal curve — and separately estimate cure-fraction effects versus latency effects of treatment.",
        "cons_of_this": "The Weibull is identifiable under any follow-up length and requires far fewer parameters; it is the appropriate primary model when follow-up is immature, identifiability is uncertain, or no clinical rationale for a cure fraction exists. Cure models demand mature data and a sustained KM plateau; the Weibull does not.",
        "when_to_prefer": "Prefer the cure model when the KM plateau is stable across consecutive data cuts and clinical evidence for durable responders exists. Prefer the Weibull for the primary analysis in early data cuts or when identifiability is in doubt; include the cure model as a pre-specified sensitivity scenario in either case."
      },
      {
        "compared_to": "survival-extrapolation-hta-rwe",
        "pros_of_this": "Cure models are one component of the survival extrapolation toolkit recommended in NICE TSD 21; when a cure fraction is clinically plausible, they produce a more biologically accurate long-run survival projection and a higher mean survival estimate than any standard parametric model that asymptotes to zero.",
        "cons_of_this": "The broader survival extrapolation workflow (TSD 14 six-model candidate set plus splines) provides the comparative context that cure models must sit within — presenting only a cure model without the standard candidate set is methodologically incomplete and likely to be flagged by ERG/EAG. Standard parametric models remain simpler, less assumption-dependent, and easier to subject to probabilistic sensitivity analysis.",
        "when_to_prefer": "Cure models are one scenario within the full TSD 14/TSD 21 extrapolation workflow, not a replacement for it. Always report the standard six-model candidate set alongside the cure model; use criteria from both in-sample fit and clinical plausibility to select the primary model."
      },
      {
        "compared_to": "kaplan-meier-estimator",
        "pros_of_this": "The mixture cure model extrapolates the survival curve to a lifetime horizon and provides a parametric estimate of the long-run cure fraction, both of which are inaccessible from a non-parametric KM estimator. It also quantifies the cure fraction and latency distribution separately, enabling covariate modeling.",
        "cons_of_this": "The KM estimator is assumption-free within the observed follow-up window and serves as the primary diagnostic — the empirical tool to assess whether a plateau is present, stable, and real before any parametric cure model is fitted. Cure model extrapolations should always be anchored to the KM curve over the observed period; a cure model that diverges visually from the KM within the data window is misspecified.",
        "when_to_prefer": "Use the KM estimator as the first step: inspect the curve visually, assess whether the plateau is genuine and stable, and verify that the cure model's predicted survival overlays the KM within the observed follow-up before trusting the tail extrapolation."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "rwPFS and rwOS plateaus in claims data frequently reflect administrative censoring pile-up (many patients censored at the same database end date) rather than biology. Before fitting a cure model, plot the censoring distribution by enrollment date and verify that late-censored patients are distributed throughout the enrollment window rather than concentrated at a single cutoff date. Death ascertainment in claims is incomplete — link to NDI, SSDI, or state vital statistics before interpreting a claims OS plateau as evidence for a cure fraction. Minimum recommended follow-up is 3–5 years; do not fit cure models to claims cohorts with less than 2 years of median follow-up.",
      "ehr": "EHR death records are subject to the same under-capture issues as claims; patients who stop attending the health system may appear as long-term survivors in EHR data simply because their subsequent events are unrecorded. Linkage to a mortality database is strongly recommended before interpreting an EHR OS plateau as evidence for a cure fraction. For rwPFS from EHR, assess whether scan/imaging protocols were consistent throughout follow-up — gaps in imaging can create apparent progression-free plateaus that are observational artifacts.",
      "registry": "Disease-specific registries with adjudicated endpoints and active long-term follow-up are the best RWE setting for cure model estimation. Mandatory reporting and prospective follow-up protocols reduce the under-capture and censoring-pile-up issues that affect claims and EHR data. Population-based cancer registries (e.g., SEER, NPCR, cancer registry networks linked to national mortality data) provide the event ascertainment and follow-up maturity needed for relative-survival cure model estimation.",
      "primary": "Clinical trial data with long planned follow-up is the canonical setting for cure model estimation. The mixture cure model is most credible when: (a) the trial protocol pre-specified a long follow-up (5+ years); (b) the KM plateau is stable across at least two consecutive pre-planned data cuts; and (c) the proportion of patients with very long follow-up is sufficient to anchor the cure fraction. When submitting cure model estimates to NICE or other HTA bodies, report the CI on pi, the plateau stability across data cuts, and the Maller-Zhou identifiability test result.",
      "linked": "Linked trial-registry or claims-mortality data is the ideal setting for relative-survival cure models: the trial or registry provides time-to-event data and the mortality linkage provides the background hazard for anchoring. Use age-sex-calendar-year-matched population life tables (ONS, CDC WONDER, Human Mortality Database) for the expected hazard. Report the cure fraction from both the standard mixture model (all-cause) and the relative-survival model (excess hazard) as a sensitivity analysis pair."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy.optimize import minimize\n\ndef mixture_cure_loglik(params, t, event):\n    \"\"\"Negative log-likelihood for mixture cure model with exponential latency.\n\n    params[0] = logit(pi)  -- cure fraction on log-odds scale (unconstrained)\n    params[1] = log(lam)   -- exponential rate on log scale (ensures lam > 0)\n    t      : array of observed times (event or censoring)\n    event  : array of 0/1 indicators (1 = event observed, 0 = censored)\n    \"\"\"\n    logit_pi, log_lam = params\n    pi  = 1.0 / (1.0 + np.exp(-logit_pi))   # cure fraction in (0, 1)\n    lam = np.exp(log_lam)                     # positive exponential rate\n    S_u = np.exp(-lam * t)                    # S_u(t) = exp(-lambda * t)\n    f_u = lam * S_u                           # density: f_u(t) = lambda * exp(-lambda * t)\n    # Log-contributions per observation\n    ll_event  = np.log((1.0 - pi) * f_u + 1e-15)  # observed event\n    ll_censor = np.log(pi + (1.0 - pi) * S_u + 1e-15)  # censored\n    ll = np.sum(event * ll_event + (1.0 - event) * ll_censor)\n    return -ll  # scipy.minimize minimises, so return negative LL\n\n# ── Synthetic data: 35 uncured (events) + 15 cured (censored at 60 months) ──\nnp.random.seed(42)\nn_cured, n_uncured = 15, 35\nt_uncured = np.random.exponential(scale=24, size=n_uncured)  # median ~17 months\nt_cured   = np.full(n_cured, 60.0)                           # censored at 60 months\nt_all     = np.concatenate([t_uncured, t_cured])\nevent_all = np.concatenate([np.ones(n_uncured), np.zeros(n_cured)])\n\n# ── Fit via Nelder-Mead ──\ninit   = [0.0, np.log(1.0 / 24.0)]     # pi = 0.5, rate ≈ 1/24\nresult = minimize(mixture_cure_loglik, init,\n                  args=(t_all, event_all), method=\"Nelder-Mead\",\n                  options={\"xatol\": 1e-8, \"fatol\": 1e-8, \"maxiter\": 10000})\n\nlogit_pi_hat, log_lam_hat = result.x\npi_hat  = 1.0 / (1.0 + np.exp(-logit_pi_hat))\nlam_hat = np.exp(log_lam_hat)\n\nprint(f\"Converged: {result.success}  |  Iterations: {result.nit}\")\nprint(f\"Estimated cure fraction pi: {pi_hat:.3f}\")\nprint(f\"Estimated exponential rate lambda: {lam_hat:.4f}\")\nprint(f\"Estimated median uncured survival: {np.log(2)/lam_hat:.1f} months\")\n\n# ── Compute S(t) at 24 months using the worked-example formula ──\nt_eval = 24.0\nS_u_24 = np.exp(-lam_hat * t_eval)\nS_24   = pi_hat + (1.0 - pi_hat) * S_u_24\nprint(f\"\\nS_u(24 months) = {S_u_24:.3f}\")\nprint(f\"S(24 months)   = {pi_hat:.3f} + (1 - {pi_hat:.3f}) * {S_u_24:.3f} = {S_24:.3f}\")\nprint(f\"Long-run plateau (t -> infinity): S -> pi = {pi_hat:.3f}\")\n\n# ── Survival curve at selected time points ──\nt_grid = np.array([6, 12, 18, 24, 36, 48, 60])\nfor t in t_grid:\n    S_t = pi_hat + (1.0 - pi_hat) * np.exp(-lam_hat * t)\n    print(f\"  S({t:2d}) = {S_t:.3f}\")",
        "description": "lifelines does not implement cure models natively. This example sketches a mixture cure\nmodel with exponential latency fitted by maximum likelihood using scipy.optimize. The log-\nlikelihood is derived from first principles: observed events contribute log[(1-pi)*f_u(t)]\nand censored observations contribute log[pi + (1-pi)*S_u(t)]. Parameters are estimated on\nthe log-odds (cure fraction) and log (rate) scales to enforce constraints. This is the\nconceptual scaffold for understanding the likelihood structure; for production use, consider\nthe R flexsurvcure package or, in Python, the cure-survival or curesurv packages (verify\nmaintenance status before adopting).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# install.packages(c(\"flexsurvcure\", \"smcure\", \"survival\"))\nlibrary(flexsurvcure)\nlibrary(survival)\n\n# ── Synthetic dataset mirroring the worked example ──\nset.seed(42)\nn_cured   <- 15L; n_uncured <- 35L\nt_data    <- c(rexp(n_uncured, rate = 1/24),   # uncured: exponential, mean = 24 months\n               rep(60, n_cured))                # cured:   censored at 60 months\nevent_data <- c(rep(1L, n_uncured), rep(0L, n_cured))\ndf <- data.frame(time = t_data, status = event_data)\n\n# ── Mixture cure model: exponential latency, no covariates ──\n# mixture = TRUE  → S(t) = pi + (1-pi)*S_u(t)  [mixture parameterisation]\n# dist    = \"exp\" → exponential latency for uncured patients\nfit_mcm <- flexsurvcure(Surv(time, status) ~ 1, data = df,\n                         dist = \"exp\", mixture = TRUE)\nprint(fit_mcm)\n\n# ── Extract cure fraction and latency rate ──\npi_hat  <- plogis(coef(fit_mcm)[\"theta\"])    # theta is logit(pi) in flexsurvcure\nlam_hat <- exp(coef(fit_mcm)[\"rate\"])         # log-scale rate parameter\n\ncat(sprintf(\"Estimated cure fraction pi : %.3f\\n\", pi_hat))\ncat(sprintf(\"Latency rate lambda         : %.4f\\n\", lam_hat))\ncat(sprintf(\"Median uncured survival     : %.1f months\\n\", log(2) / lam_hat))\n\n# ── Compute S(24) using the worked-example formula ──\nS_u_24 <- exp(-lam_hat * 24)\nS_24   <- pi_hat + (1 - pi_hat) * S_u_24\ncat(sprintf(\"\\nS_u(24 months) = %.3f\\n\",  S_u_24))\ncat(sprintf(\"S(24 months)   = %.3f + %.3f * %.3f = %.3f\\n\",\n            pi_hat, (1 - pi_hat), S_u_24, S_24))\ncat(sprintf(\"Long-run plateau S(inf) = pi = %.3f\\n\", pi_hat))\n\n# ── Survival predictions at a grid of time points ──\nt_seq <- seq(0, 72, by = 6)\npred  <- summary(fit_mcm, t = t_seq, type = \"survival\")[[1]]\nprint(pred[, c(\"time\", \"est\", \"lcl\", \"ucl\")])\n\n# ── Non-mixture (promotion-time) cure model comparison ──\nfit_nmcm <- flexsurvcure(Surv(time, status) ~ 1, data = df,\n                          dist = \"exp\", mixture = FALSE)\ncat(\"\\nNon-mixture model AIC:\", AIC(fit_nmcm),\n    \"  Mixture model AIC:\", AIC(fit_mcm), \"\\n\")\n\n# ── smcure: separate covariate formulas for cure fraction and latency ──\n# library(smcure)\n# Suppose we have a binary treatment indicator 'trt' and a continuous biomarker 'bm'\n# smcure(Surv(time, status) ~ trt,            # covariates on latency (PH or AFT)\n#        cureform = ~ trt + bm,               # separate logistic on pi\n#        data = df, model = \"ph\", nboot = 200)\n# The cureform argument controls which variables enter the logistic cure-fraction model;\n# the main formula controls the latency hazard. Both can overlap with different subsets.",
        "description": "Mixture cure model using the flexsurvcure package (CRAN), which extends flexsurv to support\nboth mixture (mixture = TRUE) and non-mixture (mixture = FALSE) cure parameterizations with\nany of flexsurv's distributional families as the latency. The smcure package is shown as a\nbrief alternative: it supports separate covariate formulas for the cure-fraction logistic\nregression and the latency proportional-hazards model. Both packages produce parameter\nestimates, confidence intervals, and survival curve predictions directly usable in HTA\neconomic models.",
        "dependencies": [
          "flexsurvcure",
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create synthetic survival dataset ── */\ndata work.trial;\n  call streaminit(42);\n  do i = 1 to 50;\n    if i <= 35 then do;   /* Uncured: Exponential(mean=24) */\n      time   = rand('exponential') * 24;\n      status = 1;\n    end;\n    else do;              /* Cured: censored at 60 months  */\n      time   = 60;\n      status = 0;\n    end;\n    drop i;\n    output;\n  end;\nrun;\n\n/* ── Mixture cure model via PROC NLMIXED ──────────────────────────────\n   Log-likelihood per observation:\n     Event   (status=1): log[(1-pi) * lam * exp(-lam*t)]\n     Censored(status=0): log[pi + (1-pi) * exp(-lam*t)]\n   Parameters estimated on unconstrained scales:\n     logit_pi: logit of cure fraction pi\n     log_lam : log of exponential rate lambda                           */\nproc nlmixed data=work.trial;\n  /* Constrained-scale quantities */\n  pi  = exp(logit_pi)  / (1 + exp(logit_pi));    /* cure fraction in (0,1)   */\n  lam = exp(log_lam);                              /* exponential rate > 0     */\n  S_u = exp(-lam * time);                          /* S_u(t) = exp(-lambda*t)  */\n  f_u = lam * S_u;                                 /* f_u(t) = lambda*exp(...)  */\n  S   = pi + (1 - pi) * S_u;                      /* overall mixture survival  */\n\n  /* Log-likelihood contribution (GENERAL distribution in NLMIXED) */\n  if status = 1 then ll = log((1 - pi) * f_u + 1e-12);   /* event    */\n  else               ll = log(S + 1e-12);                  /* censored */\n  model time ~ general(ll);\n\n  /* Starting values (logit(0.3)=-0.847, log(1/24)=-3.178) */\n  parms logit_pi = -0.847  log_lam = -3.178;\n\n  /* Back-transformed estimates of clinical interest */\n  estimate 'Cure fraction pi'\n    exp(logit_pi) / (1 + exp(logit_pi));\n  estimate 'Latency rate lambda'\n    exp(log_lam);\n  estimate 'Median uncured survival (months)'\n    log(2) / exp(log_lam);\n  estimate 'S_u(24 months)'\n    exp(-exp(log_lam) * 24);\n  estimate 'Overall S(24 months)  [= pi + (1-pi)*S_u(24)]'\n    exp(logit_pi) / (1 + exp(logit_pi))\n    + (1 - exp(logit_pi) / (1 + exp(logit_pi))) * exp(-exp(log_lam) * 24);\nrun;\n\n/* ── NOTES ─────────────────────────────────────────────────────────────\n   To add covariates on the cure fraction, replace logit_pi with a linear\n   predictor: logit_pi = gamma0 + gamma1 * trt + ...; add the corresponding\n   PARMS entries. Similarly, covariates on the latency rate: replace log_lam\n   with a linear predictor.\n   PROC NLMIXED uses quasi-Newton optimisation with Hessian-based standard\n   errors; for complex models or sparse data, add TECHNIQUE=NRRIDG and\n   verify convergence with the HESSIAN option.\n   For the non-mixture (promotion-time) model, replace S with\n   S = exp(-theta * (1 - S_u)) and the event density with the corresponding\n   derivative; see Lambert (2007) Stata Journal for the non-mixture log-lik. */",
        "description": "Mixture cure model in SAS using PROC NLMIXED, which maximizes a user-specified log-likelihood.\nThe cure fraction is modeled on the logit scale (logit_pi parameter) and the exponential rate\non the log scale (log_lam parameter) to enforce constraints. The ESTIMATE statements compute\nback-transformed quantities (pi, lambda, median uncured survival, S_u at 24 months, and\noverall S at 24 months) on the original scale. For covariate models on the cure fraction,\nreplace logit_pi with a linear predictor. This approach generalises to any latency distribution\nby replacing the S_u and f_u expressions.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Population[\"All patients at time zero\"]\n    direction TB\n    C[\"Cured subgroup<br/>proportion = pi<br/>Never experience the event<br/>S_cured(t) = 1 for all t\"]\n    U[\"Uncured subgroup<br/>proportion = 1 - pi<br/>Will eventually experience event<br/>S_u(t) → 0 as t → ∞\"]\n  end\n  Population --> S[\"Overall survival<br/>S(t) = pi + (1 − pi) × S_u(t)<br/>Long-run plateau = pi\"]\n  S --> HTA[\"HTA extrapolation<br/>Lifetime horizon QALY<br/>dominated by pi\"]\n  S --> ID[\"Identifiability check<br/>KM plateau stable across<br/>data cuts? Maller-Zhou test?\"]",
        "caption": "The mixture cure model decomposes the population into two latent subgroups: a cured fraction pi whose survival is permanently 1, and an uncured fraction (1 − pi) whose survival follows a parametric latency distribution S_u(t) that approaches 0 at long follow-up. The overall survival S(t) asymptotes to pi — the cure fraction — which dominates lifetime QALY calculations in HTA.",
        "alt_text": "Flowchart showing the mixture cure model structure: a population splits into cured and uncured subgroups, whose weighted combination produces the overall survival function S(t), which feeds into HTA extrapolation and identifiability checks.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "weibull-distribution",
        "notes": "The Weibull distribution is the default latency family for uncured patients in mixture cure models; understanding shape and scale parameterization is prerequisite to interpreting the latency component of any cure model fit."
      },
      {
        "relation_type": "requires",
        "target_slug": "censoring-mechanisms-rwe",
        "notes": "The cure fraction is identified through the long-term censored observations — patients who have not yet had the event — so correctly understanding what censoring means and when it is informative versus non-informative is foundational to cure model interpretation and identifiability assessment."
      },
      {
        "relation_type": "used_with",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "Cure models are one component of the NICE TSD 14/TSD 21 survival extrapolation workflow; they should always be presented alongside the standard six parametric candidate models and flexible spline alternatives, not as a standalone primary analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "kaplan-meier-estimator",
        "notes": "The Kaplan-Meier plateau is the primary diagnostic for a cure phenomenon; the cure model should be anchored to and visually compared against the empirical KM curve over the observed follow-up window as a first-pass validation of the parametric fit."
      },
      {
        "relation_type": "see_also",
        "target_slug": "relative-net-survival",
        "notes": "Relative-survival cure models extend the mixture cure framework to model excess mortality relative to background (population life-table) hazard, enabling cure fraction estimation in settings with long follow-up where background mortality is non-trivial; the relative net survival concept covers the foundational framework."
      }
    ],
    "aliases": [
      "mixture cure model",
      "non-mixture cure model",
      "bounded survival model",
      "long-term survivor model",
      "promotion-time model",
      "cure fraction model",
      "MCM"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "dags-backdoor-criterion-drug-studies",
    "name": "DAGs and the Backdoor Criterion for Drug Studies",
    "short_definition": "Directed acyclic graphs encode causal assumptions about a drug-outcome question, and the backdoor criterion uses those assumptions to choose a covariate set that blocks all noncausal paths from exposure to outcome without conditioning on colliders or post-treatment mediators.",
    "long_description": "A **directed acyclic graph (DAG)** is a graph of nodes (variables) and directed edges (assumed direct\ncauses) with no directed cycles. In pharmacoepidemiology and RWE it is the tool that turns informal\nclinical reasoning about confounding, selection, and timing into an explicit, falsifiable model that\ndictates *what to adjust for* before any propensity score, regression, or g-method is fit. The DAG does\nnot estimate an effect; it tells you which adjustment set licenses an unbiased estimate under the drawn\nassumptions, and — just as importantly — which variables would *introduce* bias if conditioned on.\n\n**Core conceptual distinction.** The **backdoor criterion** (Pearl) states that a set Z is sufficient to\nidentify the effect of exposure A on outcome Y if (1) Z blocks every \"backdoor\" path — every path from A\nto Y that begins with an arrow *into* A — and (2) Z contains no descendant of A. Three structures behave\ndifferently and must not be confused. A **confounder** (a common cause of A and Y, e.g. disease severity)\nopens a backdoor path that you must *close* by conditioning. A **mediator** (a variable on the causal\npath A → M → Y, e.g. post-initiation adherence or on-treatment LDL) lies on the front-door path;\nconditioning on it removes part of the very effect you are estimating (total-effect bias) and can open a\ncollider path. A **collider** (a common *effect* of two variables, e.g. database inclusion caused by both\nhealthy-user behavior and the outcome) is naturally path-blocking; conditioning on it (or on its\ndescendant) *opens* a spurious path and induces collider-stratification / selection bias. The estimand\nmust be fixed first: a DAG that identifies a *total* effect (adjust confounders only) is different from\none identifying a *controlled direct* effect (which legitimately conditions on a mediator and then needs\nmediator-outcome confounders too). M-bias is the canonical trap — conditioning on a pre-exposure variable\nthat is itself a collider of two unmeasured causes can create bias where none existed.\n\n**Pros, cons, and trade-offs.**\n- **vs ad hoc \"adjust for everything available\" / change-in-estimate covariate screening:** A DAG-derived\n  adjustment set is principled and avoids over-adjustment. The \"throw in every baseline claims variable\"\n  habit and stepwise/change-in-estimate selection routinely include mediators and colliders, amplifying\n  rather than removing bias (Schisterman 2009). Cost: the DAG is only as good as its assumptions, which\n  are usually unverifiable; it offloads the hard problem onto subject-matter judgment. **Prefer the DAG**\n  whenever the candidate covariate list contains anything measured after, or plausibly affected by, the\n  exposure decision.\n- **vs disjunctive-cause / pre-exposure-covariate heuristics (VanderWeele):** Practical heuristics (\"adjust\n  for any pre-exposure cause of exposure or outcome\") are robust defaults when a full DAG is infeasible and\n  avoid M-bias in most realistic settings. Cost: they can over-adjust relative to a minimal sufficient set\n  and give no guidance on selection/collider structure or time-varying feedback. **Prefer an explicit DAG**\n  for selection bias, time-varying confounding, and when efficiency (minimal set) matters.\n- **vs propensity-score / IPTW methods directly:** The DAG and the PS are complementary, not competing — the\n  DAG decides *which* variables enter the PS; the PS decides *how* to balance them. A perfectly balanced PS\n  on the wrong variable set (e.g. a post-baseline mediator) is confidently biased. **Always draw the DAG\n  first**, then estimate.\n\n**When to use.** Draw the DAG at the protocol/SAP stage of every comparative drug, procedure, or policy\nstudy, before specifying the covariate set, the PS model, or the g-method. It is most valuable when (a) the\ncandidate covariate list mixes pre- and post-index variables; (b) selection into the cohort is non-trivial\n(continuous-enrollment, complete-case, or landmark restrictions); (c) treatment-confounder feedback is\npresent and you must decide between ordinary regression/PS and g-methods (MSMs, g-formula); or (d) you must\njustify to a regulator or HTA body *why* a given variable was or was not adjusted.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a substitute for data.** A DAG asserting \"no unmeasured confounding\" does not make it so. In claims,\n  severity and frailty are usually unmeasured; the DAG should *show* them as unadjusted common causes and\n  trigger negative-control or quantitative-bias analysis, not paper over them.\n- **When time ordering is wrong or implicit.** The single most common fatal DAG error in RWE is treating a\n  post-index variable (adherence, on-treatment biomarker, subsequent utilization) as if it were baseline.\n  If a node's timestamp relative to time zero is ambiguous, the DAG is dangerous: it will license adjusting\n  for a mediator or collider. Every node needs an explicit \"measured at/before/after time zero\" label.\n- **When the analyst conditions on a collider to \"increase precision.\"** Adjusting for a strong predictor\n  of the outcome that is a descendant of exposure (or a common effect of exposure and an outcome cause) can\n  look like it tightens the estimate while biasing it. Over-adjustment is not conservative.\n- **When a hand-drawn DAG is over-trusted.** A sparse DAG that omits a real common cause silently asserts\n  its absence. The graph encodes *your* assumptions; it cannot detect the arrow you forgot to draw. Use\n  implied conditional-independence tests against the data to falsify (not confirm) the structure.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Severity, frailty, performance status, and prescriber preference are the\n  dominant confounders and are mostly *unmeasured*; draw them explicitly as latent common causes and rely on\n  high-dimensional proxies (prior diagnoses, procedures, drug classes, utilization counts in the lookback).\n  Failure modes: (1) **MA-only person-time** lacks fee-for-service claims, so a \"no prior diagnosis/fill\"\n  proxy is missingness, not absence — exclude MA-only spans or the unmeasured-confounding node becomes a\n  selection node too. (2) **Healthcare-utilization intensity** is a collider/proxy hybrid — it is caused by\n  severity *and* by access, and outcomes are only observed in those who interact with the system; adjusting\n  for total post-index utilization conditions on a descendant of exposure. (3) **Continuous-enrollment\n  requirements after index** condition on survival/observability, a collider of exposure and outcome\n  (informative censoring). (4) Differential **competing risks** (e.g. death in elderly claims cohorts) act\n  as a censoring node that differs by arm and must appear in the DAG.\n- **EHR:** Labs and vitals sharpen severity (an advantage over claims) but their *presence* is driven by the\n  visit/ordering process — the DAG must include the observation/measurement process node, or conditioning on\n  \"had a baseline HbA1c\" silently conditions on care-seeking. Loss to follow-up when a patient leaves the\n  network is potentially informative censoring (a collider on outcome and exposure-related health).\n- **Registry:** Clinical stage and biomarkers improve exchangeability but are often recorded *after* the\n  treatment decision; if a stage variable is post-decision it may be a mediator, not a confounder — the\n  timestamp determines the edge direction.\n- **Linked claims–EHR–vital records:** Best severity + completeness + mortality, but linkage itself is a\n  selection node (only the linkable subset), and order/fill/service-date discrepancies must be reconciled so\n  every node's position relative to time zero is correct before the backdoor set is read off the graph.\n\n**Worked claims example.** Question: incident acute kidney injury (AKI) under SGLT2 inhibitor vs DPP-4\ninhibitor among adults with type 2 diabetes in a commercial + Medicare FFS database. Nodes and timestamps:\nbaseline (measured in the 365-day lookback ending at the index `fill_date`) — `eGFR_proxy` (CKD diagnosis\ncodes), `diabetes_severity` (insulin use, HbA1c proxies), `prior_utilization`, `prescriber_preference`,\nplus *latent* `frailty`/`true_severity`; time zero — `treatment` (arm assigned from the NDC dispensed on the\nfirst qualifying fill after a 365-day washout of both classes); post-index — `on_treatment_volume_depletion`\n(a mediator: SGLT2 → volume depletion → AKI), `adherence`, `subsequent_utilization`; outcome — `AKI`.\nReading the backdoor criterion off the DAG: the sufficient adjustment set is {`eGFR_proxy`,\n`diabetes_severity`, `prior_utilization`, `prescriber_preference`} — all pre-index common causes, all\nmeasurable only in [`index_date` − 365, `index_date`] and fed into a high-dimensional propensity score.\nVariables that must be *excluded*: `on_treatment_volume_depletion` (mediator of the very effect of interest\n— adjusting for it estimates a controlled direct effect, not the total effect the safety question wants),\n`adherence` and `subsequent_utilization` (descendants of treatment → collider/mediator bias), and any\n*post-index* continuous-enrollment flag (conditions on observability, a collider of treatment and outcome).\nBecause `true_severity` and `frailty` are unmeasured common causes that remain open backdoor paths, the\nprotocol pre-specifies a negative-control outcome and a quantitative-bias analysis rather than claiming the\nadjustment set is complete. A DAG is useful only when time ordering is explicit; in RWE virtually every\nserious DAG error is adjusting for a post-index variable or conditioning on future observability.\n\n**Interpreting the output**\n\nIn the SGLT2 versus DPP-4 study, the DAG analysis produces: sufficient adjustment set =\n{eGFR_proxy, diabetes_severity, prior_utilization, prescriber_preference}; excluded from\nadjustment: on_treatment_volume_depletion (mediator), subsequent_utilization (collider\ndescendant).\n\n*(1) Formal interpretation.* The backdoor criterion confirms that the four pre-index\nvariables block every backdoor path without conditioning on any descendant of the exposure.\nThe mediator on_treatment_volume_depletion lies on the causal path SGLT2 → volume_depletion →\nAKI; including it in the outcome or propensity model would yield a controlled direct effect,\nnot the total effect the safety question requires. The collider subsequent_utilization receives\narrows from both the exposure and an outcome antecedent; conditioning on it opens a spurious\nnon-causal path and induces collider-stratification bias. Two unmeasured nodes — true_severity\nand frailty — remain as open backdoor paths; the DAG therefore motivates quantitative bias\nanalysis rather than a claim of complete confounding control.\n\n*(2) Practical interpretation.* The DAG does not produce an effect estimate; it produces a\ndefensible covariate list. A reviewer can trace every variable in the propensity-score\nspecification back to its structural role — confounder (adjust), mediator (exclude), collider\n(exclude). Variables absent from the model were omitted for explicit causal reasons, not\noversight. That audit trail is how a DAG converts clinical judgment into a transparent,\ncontestable analytic choice rather than an ad hoc \"include everything in the claims\" list.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "dag",
      "backdoor-criterion",
      "confounding",
      "collider-bias",
      "mediator",
      "selection-bias",
      "covariate-selection",
      "causal-inference"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/00001648-199901000-00008",
        "url": "https://doi.org/10.1097/00001648-199901000-00008",
        "citation_text": "Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37-48.",
        "year": 1999,
        "authors_short": "Greenland et al.",
        "notes": "Foundational epidemiologic treatment of causal diagrams, the backdoor criterion, and adjustment-set logic."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181a819a1",
        "url": "https://doi.org/10.1097/EDE.0b013e3181a819a1",
        "citation_text": "Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology. 2009;20(4):488-495.",
        "year": 2009,
        "authors_short": "Schisterman et al.",
        "notes": "Defines over-adjustment (mediator) and unnecessary-adjustment bias with worked DAGs; the canonical warning against \"adjust for everything\" covariate selection."
      },
      {
        "role": "explain",
        "doi": "10.1186/1471-2288-8-70",
        "url": "https://doi.org/10.1186/1471-2288-8-70",
        "citation_text": "Shrier I, Platt RW. Reducing bias through directed acyclic graphs. BMC Medical Research Methodology. 2008;8:70.",
        "year": 2008,
        "authors_short": "Shrier & Platt",
        "notes": "Accessible, applied account of using DAGs and the backdoor criterion to choose a minimal sufficient adjustment set."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/01.ede.0000135174.63482.43",
        "url": "https://doi.org/10.1097/01.ede.0000135174.63482.43",
        "citation_text": "Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15(5):615-625.",
        "year": 2004,
        "authors_short": "Hernán et al.",
        "notes": "Demonstrates that selection bias is collider-stratification bias; directly relevant to RWE cohort-inclusion, continuous-enrollment, and informative-censoring nodes."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Operationalizes the timing assumptions a drug-study DAG encodes (time zero, no adjustment for post-index variables) into an analyzable target-trial protocol."
      }
    ],
    "plain_language_summary": "A DAG (directed acyclic graph) is a simple diagram where you draw every variable in your study as a box and draw arrows showing which variables cause which — it is a map of your causal assumptions, not a picture of your data. The backdoor criterion is a rule you apply to that map: it tells you exactly which variables you must adjust for to make a fair comparison between drug users and non-users, and — just as importantly — which variables you must leave alone because adjusting for them would actually introduce bias. The core insight is that some variables (confounders) open unfair comparison paths that you need to close by adjusting, while other variables (colliders) are naturally blocking a spurious path and will blow it open the moment you condition on them.",
    "key_terms": [
      {
        "term": "confounding",
        "definition": "A variable that causes both the drug use and the outcome, so patients who receive the drug already differ from those who do not in ways that also affect the outcome — creating a misleading comparison if you ignore it."
      },
      {
        "term": "backdoor path",
        "definition": "Any path in the DAG that flows from the exposure back through an arrow pointing into the exposure and then out to the outcome — a noncausal route that, if left open, makes the drug look better or worse than it really is."
      },
      {
        "term": "collider",
        "definition": "A variable that receives arrows from two other variables (a common effect rather than a common cause); it naturally blocks a spurious path, but conditioning on it opens that path and creates bias."
      },
      {
        "term": "adjustment set",
        "definition": "The specific group of variables you include as covariates in your analysis to block all backdoor paths without introducing new ones — the answer the backdoor criterion gives you."
      },
      {
        "term": "mediator",
        "definition": "A variable that sits on the causal chain between the drug and the outcome (the drug causes the mediator, which then causes the outcome); adjusting for it removes part of the drug's effect from your estimate."
      }
    ],
    "worked_example": {
      "scenario": "A study asks whether Drug A (an SGLT2 inhibitor) reduces the risk of hospitalization compared to Drug B (a DPP-4 inhibitor) in adults with type 2 diabetes. Before fitting any model, the team draws a small DAG with five variables: the drug assignment (Exposure), the hospitalization outcome (Outcome), a baseline disease-severity score (Confounder), a post-treatment lab change caused by Drug A (Mediator), and database enrollment status which is jointly caused by both drug use and the hospitalization outcome (Collider). The question is: which variables should enter the adjustment set, and which should be excluded?",
      "dataset": {
        "caption": "Node-and-edge table describing the small study DAG. Each row is one arrow in the diagram. Read 'from -> to' as 'from causes to'.",
        "columns": [
          "from",
          "to",
          "role_of_destination"
        ],
        "rows": [
          [
            "Disease Severity",
            "Drug Assignment (Exposure)",
            "Confounder — causes who gets the drug"
          ],
          [
            "Disease Severity",
            "Hospitalization (Outcome)",
            "Confounder — also directly affects the outcome"
          ],
          [
            "Drug Assignment (Exposure)",
            "Lab Change (Mediator)",
            "Mediator — drug causes the lab to change"
          ],
          [
            "Lab Change (Mediator)",
            "Hospitalization (Outcome)",
            "Mediator — lab change then affects hospitalization"
          ],
          [
            "Drug Assignment (Exposure)",
            "Database Enrollment (Collider)",
            "Collider — drug affects who stays enrolled"
          ],
          [
            "Hospitalization (Outcome)",
            "Database Enrollment (Collider)",
            "Collider — outcome also affects who stays enrolled"
          ],
          [
            "Drug Assignment (Exposure)",
            "Hospitalization (Outcome)",
            "Effect of interest — the arrow we want to measure"
          ]
        ]
      },
      "steps": [
        "Draw the DAG and list every path from Exposure to Outcome that does NOT go forward along the Exposure → Outcome arrow. The only such path here is: Exposure ← Disease Severity → Outcome. This is the backdoor path — it flows backward into Exposure first, then out to Outcome.",
        "Apply the backdoor criterion: we need a set of variables that blocks every backdoor path and contains no descendant of Exposure. Disease Severity sits on the one backdoor path and is measured before the drug is prescribed (it is not caused by Exposure), so adjusting for Disease Severity blocks that path. The adjustment set is {Disease Severity}.",
        "Check the Mediator (Lab Change). Lab Change is caused by Exposure — it is a descendant of Exposure and lies on the causal chain Exposure → Lab Change → Outcome. Adjusting for it would partially erase the drug's effect from the estimate, giving a biased (too-small) result for the total effect. Exclude it.",
        "Check the Collider (Database Enrollment). Database Enrollment is caused by both Exposure and Outcome. Right now it has two arrows pointing in with no path through it — it naturally blocks any spurious route. The moment you condition on it (say, by requiring everyone to have a follow-up visit), you open a spurious Exposure–Outcome association that did not exist before. Exclude it.",
        "Read off the final answer: adjust for Disease Severity only. Do not adjust for Lab Change (mediator) or Database Enrollment (collider)."
      ],
      "result": "The correct minimal adjustment set is {Disease Severity}. Adjusting for Disease Severity and nothing else closes the one backdoor path (Exposure ← Disease Severity → Outcome) and yields an unbiased estimate of the total effect of Drug A vs Drug B on hospitalization. Adding the Mediator would underestimate the drug's true benefit. Adding the Collider would manufacture a spurious association where none exists."
    },
    "prerequisites": [
      "active-comparator-new-user",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Baseline confounding DAG",
        "description": "Maps pre-index common causes (indication, disease severity, frailty, comorbidity, utilization, prescriber preference, calendar time) and reads off the minimal sufficient adjustment set via the backdoor criterion.",
        "edge_cases": [
          "Unmeasured common causes (severity, frailty) must be drawn as latent nodes so the open backdoor path is visible and triggers negative controls or QBA."
        ],
        "data_source_notes": "claims: rely on high-dimensional proxies for latent severity/frailty; EHR: labs/vitals improve the set but require an observation-process node."
      },
      {
        "name": "Collider and selection DAG",
        "description": "Represents conditioning induced by database inclusion, post-index continuous enrollment, complete-case analysis, or surviving to a landmark, which can open spurious A-Y paths.",
        "edge_cases": [
          "Requiring continuous enrollment *after* index conditions on observability, a collider of exposure and outcome (informative censoring).",
          "M-bias - conditioning on a pre-exposure collider of two unmeasured causes induces bias where none existed."
        ],
        "data_source_notes": "claims: MA-only person-time and differential competing risks (death) act as selection/censoring nodes that differ by arm."
      },
      {
        "name": "Time-varying confounding DAG",
        "description": "Represents treatment-confounder feedback where prior treatment affects later covariates that affect future treatment and outcome, signalling that ordinary PS/regression is biased and g-methods are required.",
        "data_source_notes": "Use to decide between an initiation (ITT-like) contrast handled by baseline PS and a sustained/per-protocol estimand requiring MSMs or the g-formula."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ad hoc \"adjust for everything\" or change-in-estimate covariate selection",
        "pros_of_this": "Yields a principled minimal sufficient set; avoids including mediators and colliders that amplify bias.",
        "cons_of_this": "Validity rests on unverifiable structural assumptions and subject-matter judgment; a missing arrow is an unstated assumption.",
        "when_to_prefer": "Whenever the candidate covariate list contains any variable measured after, or plausibly affected by, the exposure decision."
      },
      {
        "compared_to": "Disjunctive-cause / pre-exposure-covariate heuristic (VanderWeele)",
        "pros_of_this": "Handles selection/collider structure and time-varying feedback that simple heuristics ignore; identifies an efficient minimal set.",
        "cons_of_this": "More effort; a poorly specified DAG can be worse than a robust default that adjusts for all pre-exposure causes.",
        "when_to_prefer": "Selection-bias-prone designs, time-varying confounding, or when a minimal (efficient) adjustment set is needed."
      },
      {
        "compared_to": "Propensity-score / IPTW methods",
        "pros_of_this": "Decides which variables are admissible; a balanced PS on the wrong set is confidently biased.",
        "cons_of_this": "Does not itself estimate or balance anything; it is an upstream step.",
        "when_to_prefer": "Always draw the DAG first, then estimate with the PS/g-method it justifies."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Draw severity/frailty/prescriber-preference as latent common causes and use high-dimensional lookback proxies. Exclude MA-only person-time so unmeasured-proxy nodes do not double as selection nodes. Never adjust for post-index utilization, adherence, or a post-index continuous-enrollment flag - all are descendants of exposure (collider/mediator bias).",
      "ehr": "Add an explicit observation/measurement-process node; the presence of a baseline lab is driven by care-seeking, so conditioning on \"had a baseline value\" can condition on a collider. Treat loss to follow-up when a patient leaves the network as potentially informative censoring.",
      "registry": "A clinical-stage or biomarker node recorded after the treatment decision is a mediator, not a confounder; the timestamp determines the edge direction.",
      "linked": "Linkage is a selection node (only the linkable subset). Reconcile order/fill/service dates so each node's position relative to time zero is correct before reading the backdoor set off the graph."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import networkx as nx\nfrom dowhy import CausalModel\nimport pandas as pd\n\n# ---- Structural model (timestamps encoded in node names) -------------------------------------\n# Pre-index (measured in [index_date-365, index_date]): eGFR_proxy, diabetes_severity,\n#   prior_utilization, prescriber_preference. Latent (unmeasured in claims): true_severity, frailty.\n# Time zero: treatment. Post-index (descendants of treatment): volume_depletion (mediator),\n#   adherence, subsequent_utilization. Outcome: AKI.\nedges = [\n    (\"eGFR_proxy\", \"treatment\"),         (\"eGFR_proxy\", \"AKI\"),\n    (\"diabetes_severity\", \"treatment\"),  (\"diabetes_severity\", \"AKI\"),\n    (\"prior_utilization\", \"treatment\"),  (\"prior_utilization\", \"AKI\"),\n    (\"prescriber_preference\", \"treatment\"),\n    (\"true_severity\", \"diabetes_severity\"), (\"true_severity\", \"AKI\"),   # latent common cause\n    (\"frailty\", \"prior_utilization\"),       (\"frailty\", \"AKI\"),         # latent common cause\n    (\"treatment\", \"volume_depletion\"),   (\"volume_depletion\", \"AKI\"),   # mediator (front-door)\n    (\"treatment\", \"adherence\"),          (\"adherence\", \"AKI\"),          # descendant of treatment\n    (\"treatment\", \"subsequent_utilization\"),                            # collider feeder\n    (\"AKI\", \"subsequent_utilization\"),                                  # -> collider\n    (\"treatment\", \"AKI\"),                                               # effect of interest\n]\ng = nx.DiGraph(edges)\ngml = \"graph[directed 1 \" + \" \".join(\n    f'node[id \"{n}\"]' for n in g.nodes()\n) + \" \" + \" \".join(\n    f'edge[source \"{u}\" target \"{v}\"]' for u, v in g.edges()\n) + \"]\"\n\n# `df` is the analytic cohort (one row per new initiator) with the named columns; latent nodes\n# (true_severity, frailty) are intentionally absent because claims cannot measure them.\ndef select_adjustment_set(df: pd.DataFrame):\n    model = CausalModel(\n        data=df, treatment=\"treatment\", outcome=\"AKI\", graph=gml,\n        common_causes=None,\n    )\n    estimand = model.identify_effect(proceed_when_unidentifiable=False)\n    # Valid backdoor set = pre-index common causes only; descendants of treatment are excluded.\n    backdoor = estimand.get_backdoor_variables()\n    post_index = nx.descendants(g, \"treatment\")\n    inadmissible = sorted(post_index)            # must NOT enter the PS\n    latent_open = [n for n in (\"true_severity\", \"frailty\") if n not in backdoor]\n    print(\"Adjustment set for the propensity score:\", sorted(backdoor))\n    print(\"Inadmissible (post-index descendants):\", inadmissible)\n    print(\"Unmeasured open backdoor paths -> need negative control / QBA:\", latent_open)\n    return sorted(backdoor)",
        "description": "DAG-driven adjustment-set selection with dowhy/networkx. Input is the *structural model*, not data:\nencode the claims drug-study DAG (latent severity/frailty as unobserved common causes, baseline\nconfounders, treatment at time zero, post-index mediators/colliders, outcome). dowhy enumerates valid\nbackdoor adjustment sets and refuses sets that include descendants of treatment. The printed\nbackdoor_set is exactly the covariate list that should feed the high-dimensional propensity score;\neverything post-index is reported as inadmissible. No toy data is fabricated for the analysis.",
        "dependencies": [
          "dowhy",
          "networkx"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dagitty)\n\n# Latent nodes flagged [latent]; exposure/outcome flagged so adjustmentSets() can solve the backdoor.\ng <- dagitty('dag {\n  eGFR_proxy           -> treatment   eGFR_proxy           -> AKI\n  diabetes_severity    -> treatment   diabetes_severity    -> AKI\n  prior_utilization    -> treatment   prior_utilization    -> AKI\n  prescriber_preference -> treatment\n  true_severity [latent] -> diabetes_severity   true_severity -> AKI\n  frailty       [latent] -> prior_utilization    frailty       -> AKI\n  treatment -> volume_depletion        volume_depletion      -> AKI   /* mediator */\n  treatment -> adherence               adherence             -> AKI   /* descendant */\n  treatment -> subsequent_utilization  AKI -> subsequent_utilization  /* collider */\n  treatment [exposure]                 AKI [outcome]\n  treatment -> AKI\n}')\n\n# Minimal sufficient set to identify the TOTAL effect of treatment on AKI (backdoor criterion).\nadjustmentSets(g, type = \"minimal\", effect = \"total\")\n# Variables that would bias the estimate if conditioned on:\nprint(setdiff(descendants(g, \"treatment\"), \"treatment\"))  # post-index mediators/colliders\n\n# Implied conditional independencies the DAG asserts, then falsify them against the cohort.\n# `cohort` is the analytic data frame (one row per initiator) holding the OBSERVED node columns.\nimpliedConditionalIndependencies(g)\nlocalTests(g, data = cohort, type = \"cis.loess\", R = 200)  # |estimate| far from 0 => DAG misspecified",
        "description": "Canonical DAG workflow with dagitty: declare the same drug-study DAG with explicit timing, read off\nthe minimal sufficient adjustment set via the backdoor criterion, enumerate the implied conditional\nindependencies the structure asserts, and *falsify* them against the analytic claims data frame with\nlocalTests(). A failed implied independence is evidence the drawn structure (or the adjustment set) is\nwrong - the only empirical check a DAG offers.",
        "dependencies": [
          "dagitty"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* DAG-derived minimal sufficient adjustment set (pre-index common causes ONLY). */\n%let adjset = eGFR_proxy diabetes_severity prior_utilization prescriber_preference;\n\n/* (1) Falsify an implied conditional independence: prescriber_preference _||_ AKI | rest of adjset.\n   A PARTIAL correlation far from 0 is evidence the DAG (or the adjustment set) is misspecified. */\nproc corr data=work.analytic;\n  var prescriber_preference AKI;\n  partial eGFR_proxy diabetes_severity prior_utilization;\nrun;\n\n/* (2) PS estimation uses ONLY the admissible set; post-index nodes (volume_depletion, adherence,\n   subsequent_utilization) are deliberately never referenced. SAS/STAT 14.2+ required. */\nproc psmatch data=work.analytic region=allobs;\n  class treatment;\n  psmodel treatment(treated='1') = &adjset;\n  match method=greedy(k=1) distance=lps caliper=0.2;\n  assess lps var=(&adjset) / plots=(boxplot);   /* standardized differences pre/post match */\n  output out(obs=match)=matched matchid=mid;\nrun;\n\n/* Outcome model on the matched set, again restricted to the DAG-justified covariates. */\nproc phreg data=matched;\n  class treatment;\n  model aki_time*aki_event(0) = treatment;      /* matched estimand; covariates already balanced */\n  strata mid;\nrun;",
        "description": "SAS does not draw DAGs; identification happens upstream (dowhy/dagitty above). SAS pulls real weight in\ntwo ways. (1) PROC CORR test of an implied conditional independence: the DAG above asserts\nprescriber_preference _||_ AKI | {eGFR_proxy, diabetes_severity, prior_utilization}; a large PARTIAL\ncorrelation falsifies the structure. (2) The DAG-derived adjustment set is hard-coded into a macro\nvariable that drives PROC PSMATCH and PROC PHREG, guaranteeing only admissible pre-index covariates\nenter estimation. Required input: work.analytic = one row per new initiator with treatment (1/0),\nthe AKI event/time, and all node columns measured in [index_date-365, index_date].",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  TS[\"true_severity (latent)\"]:::latent --> Sev[diabetes_severity]\n  TS --> Y[AKI outcome]\n  FR[\"frailty (latent)\"]:::latent --> Util[prior_utilization]\n  FR --> Y\n  Sev --> A[treatment - time zero]\n  Sev --> Y\n  Util --> A\n  Util --> Y\n  eGFR[eGFR_proxy] --> A\n  eGFR --> Y\n  Pref[prescriber_preference] --> A\n  A --> Y\n  classDef latent fill:#fee2e2,stroke:#b91c1c,stroke-dasharray:4 3;\n  classDef adjust fill:#dcfce7,stroke:#15803d;\n  class Sev,Util,eGFR,Pref adjust;",
        "caption": "Baseline-confounding DAG for SGLT2 vs DPP-4 on AKI. The backdoor criterion selects the green pre-index common causes {eGFR_proxy, diabetes_severity, prior_utilization, prescriber_preference} for the propensity score. The dashed red latent nodes (true severity, frailty) leave open backdoor paths that claims cannot close, so the protocol adds negative controls and quantitative-bias analysis.",
        "alt_text": "Directed acyclic graph showing latent severity and frailty as unmeasured common causes, four measured baseline confounders pointing into both treatment and AKI, and the treatment-to-AKI effect edge.",
        "source_type": "illustrative",
        "source_citations": [
          "greenland-1999"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[treatment - time zero] --> M[\"volume_depletion (mediator, post-index)\"]:::bad\n  M --> Y[AKI outcome]\n  A --> Adh[\"adherence (post-index)\"]:::bad\n  Adh --> Y\n  A --> SU[\"subsequent_utilization (collider)\"]:::bad\n  Y --> SU\n  A -. continuous-enrollment / cohort inclusion .-> Sel{{\"selection node (collider)\"}}:::bad\n  Y -.-> Sel\n  A --> Y\n  classDef bad fill:#fef9c3,stroke:#a16207;\n  classDef sel fill:#fee2e2,stroke:#b91c1c;\n  class Sel sel;",
        "caption": "Do-NOT-adjust DAG. Every yellow node is a descendant of treatment - the mediator volume_depletion (adjusting for it estimates a controlled direct effect, not the total effect), post-index adherence, and the collider subsequent_utilization (a common effect of treatment and AKI). Conditioning on post-index continuous enrollment / cohort inclusion (red selection node) opens a spurious treatment-AKI path - collider-stratification (selection) bias.",
        "alt_text": "Directed acyclic graph showing treatment causing a post-index mediator, adherence, and a collider that is also caused by the outcome, plus a selection node that is a common effect of treatment and outcome, illustrating variables that must not be adjusted for.",
        "source_type": "illustrative",
        "source_citations": [
          "schisterman-2009"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "The DAG defines the admissible measured-confounder set; the PS balances it. A balanced PS on a set containing a mediator or collider is confidently biased."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "The target-trial protocol operationalizes the timing assumptions the DAG encodes (time zero, no adjustment for post-index variables)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "An active-comparator new-user design removes the confounding-by-indication and immortal-time structures a drug-study DAG would otherwise have to model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Healthy-user/healthy-adherer pathways should be drawn explicitly before choosing proxies or running QBA."
      },
      {
        "relation_type": "see_also",
        "target_slug": "selection-bias-sensitivity-analysis-rwe",
        "notes": "Selection bias is collider-stratification bias; the DAG shows when cohort inclusion, continuous enrollment, or complete-case analysis induces it."
      }
    ],
    "aliases": [
      "causal DAG",
      "backdoor criterion",
      "graphical causal model",
      "covariate selection DAG",
      "d-separation"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "database-feasibility-attrition-funnel-rwe",
    "name": "Database Feasibility Assessment and Attrition Funnel",
    "short_definition": "A structured pre-protocol process that quantifies whether a candidate real-world data source can answer the study question and a CONSORT-style stepwise count of how many patients survive each eligibility, exposure, and follow-up criterion to reach the analyzable cohort.",
    "long_description": "A **database feasibility assessment** answers a binary, fit-for-purpose question before any comparative analysis is written:\ndoes this data source contain enough of the right patients, with enough observable person-time, exposure detail, covariate\ncapture, and ascertainable outcomes to estimate the target estimand with acceptable precision and credibility? The\n**attrition funnel** (also called a participant-flow or cohort-decrement diagram) is the auditable artifact that operationalizes\nthat answer: a stepwise table that starts from the full source population and reports, for each eligibility, exposure, washout,\nand follow-up criterion *in the order it is applied*, how many patients (and how much person-time) are retained and dropped,\nending at the analyzable N. It is the RWE analogue of the CONSORT flow diagram, and structured RWE reporting templates\n(STaRT-RWE, HARPER) require it precisely because the sequence and magnitude of losses is where most silent bias and most\nfatal feasibility problems live.\n\n**Core conceptual distinction**. Feasibility and attrition are two phases of the same fit-for-purpose discipline, and they\nmust not be collapsed. *Feasibility* is a go/no-go decision made on aggregate counts and metadata — often before licensing\na database — and is about *relevance* (does the source capture the population, exposure, and outcome?) and *reliability*\n(are those captured well enough to trust?). *Attrition* is the patient-level realization of the protocol's eligibility logic,\nand its job is transparency, not decision: it makes every inclusion/exclusion rule and its cost visible so reviewers can\njudge selection bias and generalizability. The critical, non-obvious point is that **attrition is order-dependent and the\norder must mirror the protocol's causal logic, not convenience.** Applying \"≥1 outcome-free baseline year\" before \"first\nqualifying exposure\" versus after changes both the denominator and *which patients* are excluded; applying continuous-enrollment\nrequirements that extend *after* time zero builds future information into eligibility and manufactures immortal time. The\nfunnel is therefore not bookkeeping — it is a specification of the study population, and a misordered funnel is a misspecified\nstudy. A feasibility assessment estimates the *expected* bottom of the funnel; the attrition table reports the *actual* one,\nand a large gap between them is itself a finding (usually unobserved person-time or stricter-than-anticipated capture).\n\n**Pros, cons, and trade-offs**.\n- **vs jumping straight to cohort construction with no documented feasibility step:** A formal feasibility assessment kills\n  doomed studies cheaply (before a six-figure data license or six months of programming) and forces pre-specification of the\n  expected analyzable N, event count, and exposure prevalence — numbers that anchor power and that regulators and HTA bodies\n  expect to see justified. Cost: it adds an upfront aggregate-counting phase and tempts teams to \"peek\" at the outcome\n  distribution, which can bias design choices. **Prefer the formal step** for any consequential comparative, safety, or\n  decision-grade analysis; a quick informal scan suffices only for hypothesis-generating descriptive work.\n- **vs an undocumented or post-hoc attrition narrative (\"we excluded some patients\"):** A pre-specified, ordered funnel\n  table is reproducible, lets reviewers re-derive each step, and exposes selection bias by showing *where* the cohort\n  collapses. Cost: more programming and diagnostic counts. **Always prefer the explicit funnel** — STaRT-RWE, HARPER,\n  RECORD, and FDA/EMA RWE guidance all treat it as mandatory.\n- **vs a single overall \"N excluded\" number:** The stepwise funnel localizes the loss (e.g., 70% lost at the continuous-enrollment\n  step signals a data-completeness problem, not a rare disease), which a single number hides. Cost: requires committing to and\n  defending a specific ordering. **Prefer the stepwise funnel**; reserve a collapsed count only for an abstract-level summary.\n\n**When to use**. Always, before finalizing any RWE protocol that will support a comparative, safety, utilization, cost, or\nregulatory/HTA decision: run a feasibility assessment to set go/no-go and to pre-specify expected N, person-time, exposure\nprevalence, and event counts; then implement the ordered attrition funnel as a required protocol/SAP deliverable and report\nit in the manuscript. Use it as the first diagnostic when a planned analyzable N comes in far below expectation, when comparing\ntwo candidate databases, and whenever a reviewer or regulator needs to audit how the study population was built.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **As a substitute for design.** A pristine funnel does not make a biased design valid; it documents selection, it does not\n  remove confounding by indication, immortal time, or outcome misclassification. Treating \"we showed the funnel\" as evidence\n  of validity is the dangerous failure.\n- **When the ordering is chosen to maximize N rather than to encode the causal logic.** Reordering steps to retain more\n  patients (e.g., applying outcome-free baseline last so it discards fewer) silently changes the estimand and the eligible\n  population; an order-optimized funnel is actively misleading.\n- **When continuous-enrollment or \"had a follow-up visit\" criteria reach past time zero.** Requiring future observability or\n  a future event to qualify for entry builds immortal time and survivor bias directly into eligibility — the funnel will look\n  clean while the design is fatally biased.\n- **When feasibility counts are read as effect estimates.** Aggregate cell counts run during feasibility are not adjusted,\n  are subject to small-cell suppression, and must never be promoted to results; using them to pick the database that gives the\n  \"best\" preliminary signal is outcome-dependent design.\n- **When low yield is rationalized away instead of investigated.** If 80% of patients are lost at one step, the right response\n  is to diagnose the data source (capture gap? code-list error? plan type?), not to relax the criterion until enough remain.\n\n**Data-source operational depth**.\n- **Administrative claims (FFS vs MA vs commercial):** The funnel's most consequential early step is almost always **continuous\n  enrollment / observable person-time** — absence of a claim means \"not observed,\" not \"did not occur,\" so washout and\n  outcome-free requirements are only valid over enrolled spans. Failure mode: **Medicare Advantage person-time lacks\n  fee-for-service claims** because care is capitated, so MA enrollees look event- and exposure-free; including MA-only spans\n  inflates the eligible N and biases rates downward. Workaround: restrict to enrollees with the relevant benefit (Parts A/B/D\n  or commercial medical+pharmacy) and exclude MA-only person-time, and report it as an explicit funnel step. Other failure\n  modes: claims adjudication lag and reversals near the data cut create artificial right-censoring (impose a runout/maturity\n  buffer); same-day duplicate and bundled/capitated claims distort counts; plan switching breaks enrollment continuity.\n- **EHR:** Capture is **encounter-driven**, so \"no record of X\" conflates a healthy patient, a patient who got care\n  out-of-network, and a patient who simply did not visit. The funnel must include an explicit \"minimum observability /\n  in-network activity\" step, and loss to follow-up must be treated as potentially informative (sicker patients leave the\n  system). External-care leakage and missing structured fields mean exposure/outcome capture is systematically incomplete;\n  where possible, validate against linked claims and quantify capture by site and calendar time.\n- **Registry:** Often complete and adjudicated for the index disease and key outcomes but typically **weak for full exposure\n  history and for events occurring outside the registry's remit.** The feasibility step must check enrollment criteria,\n  completeness, and the registry's catchment; the funnel should make linkage eligibility (who *can* be linked to claims/EHR\n  for follow-up) an explicit decrement, since the linkable subset is a selected population.\n- **Linked claims–EHR–vital records:** The richest substrate (severity + completeness + reliable mortality) but linkage\n  itself is a funnel step with selection: only the linkable subset enters, and order/fill/service date discrepancies between\n  sources must be reconciled before time-zero assignment. A common, dangerous error is **differential competing risks by\n  exposure in elderly claims populations** — if one arm is older/sicker, death (a competing event) differentially truncates\n  follow-up; the funnel and the downstream estimand must account for this rather than treating death as ordinary censoring.\n\n**Worked claims example.** Question: feasibility and cohort assembly for incident heart failure among new users of a study\noral antihyperglycemic in a commercial + Medicare FFS claims database (2016–2023). The pre-specified, ordered funnel:\n(1) Source population with ≥1 pharmacy claim for the study drug class: N = 412,000. (2) First fill of the study drug (index\ndate) on/after 2016-07-01 and with ≥365 days of database history available before index (left-truncation guard): retain\n318,400; drop 93,600 (early-period fills with insufficient lookback). (3) Continuous medical + pharmacy enrollment for the\nfull 365-day baseline through index, FFS-observable only — **exclude Medicare Advantage-only person-time** so \"no prior fill\"\nand \"no prior HF\" are genuinely observed: retain 171,200; drop 147,200 (this large step is the feasibility signal — most loss\nis unobservable baseline, chiefly MA-only spans, not true ineligibility). (4) New-user washout: no fill of the study drug\n*or* its comparator class in the 365-day baseline (defines incident use): retain 138,900; drop 32,300 (prevalent users).\n(5) Outcome-free baseline: no validated HF diagnosis in the 365-day lookback (the outcome algorithm = ≥1 inpatient or ≥2\noutpatient HF `dx` codes ≥30 days apart): retain 124,500; drop 14,400 (prevalent HF). (6) Age ≥18 and indication confirmed\n(≥2 type-2 diabetes `dx` in baseline): retain 121,700. **Analyzable cohort N = 121,700**, with an expected event count\n(from the feasibility incidence-rate scan) of ~4,100 incident HF events over a median 2.1 years of follow-up — comfortably\npowered. Follow-up runs from index to first validated HF event, censoring at disenrollment, death (a competing event in this\nelderly-enriched cohort), end of data minus a 90-day claims-runout buffer, and — for an as-treated analysis — last\n`days_supply` end plus a grace period. Note how step (3) dominates the attrition: had MA-only person-time been left in, N\nwould have looked far larger but rates of both exposure gaps and outcomes would have been spuriously low.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "database-feasibility",
      "attrition-funnel",
      "cohort-flow-diagram",
      "fit-for-purpose",
      "start-rwe",
      "continuous-enrollment",
      "selection-bias",
      "regulatory-readiness"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.m4856",
        "url": "https://doi.org/10.1136/bmj.m4856",
        "citation_text": "Wang SV, Pinheiro S, Hua W, et al. STaRT-RWE: structured template for planning and reporting on the implementation of real world evidence studies. BMJ. 2021;372:m4856.",
        "year": 2021,
        "authors_short": "Wang et al.",
        "notes": "Structured RWE planning/reporting template that mandates an explicit, ordered attrition flow and pre-specified feasibility counts; the canonical statement of why cohort-flow transparency is required."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5507",
        "url": "https://doi.org/10.1002/pds.5507",
        "citation_text": "Wang SV, Pottegård A, Crown W, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real-world evidence studies on treatment effects: A good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiology and Drug Safety. 2022.",
        "year": 2023,
        "authors_short": "Wang et al.",
        "notes": "Protocol template (HARPER) that operationalizes feasibility and ordered attrition within a transparent, reproducible treatment-effect RWE protocol."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard for routinely collected health data that requires participant-flow detail, code lists, and data-cleaning/linkage transparency — the items an attrition funnel makes auditable."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Joint ISPOR-ISPE good-practice recommendations that treat a priori feasibility assessment and transparent attrition reporting as conditions for credible, decision-grade RWE."
      }
    ],
    "plain_language_summary": "Before building a study, researchers first check whether a claims or health records database actually contains enough of the right patients to answer their question — this is called a feasibility assessment. If the database looks promising, they next apply every eligibility rule one by one, recording how many patients survive each step, until only the qualifying group remains; that shrinking list of counts is the attrition funnel. The funnel matters because it makes every decision about who gets into a study visible, so readers can judge whether the final group is representative and where patients were lost. Running this process early can also save months of work by flagging a doomed study before expensive data are purchased or programmed.",
    "key_terms": [
      {
        "term": "feasibility assessment",
        "definition": "An upfront check — done on summary counts before any detailed analysis — asking whether a database has enough patients, enough recorded drug use, and enough observable time to answer the study question."
      },
      {
        "term": "attrition funnel",
        "definition": "A step-by-step table that starts with the full database population and shows how many patients remain after each eligibility rule is applied, ending at the group that will actually be analyzed."
      },
      {
        "term": "eligibility criterion",
        "definition": "A rule that a patient must meet to be included in the study — for example, having a specific diagnosis, being a certain age, or having uninterrupted insurance coverage during the observation window."
      },
      {
        "term": "continuous enrollment",
        "definition": "A requirement that a patient have unbroken insurance coverage across a defined period so that missing claims truly mean an event did not happen, not that the insurer simply did not see it."
      },
      {
        "term": "new-user washout",
        "definition": "A lookback period — typically 6 to 12 months — during which a patient must have had no use of the study drug; this ensures the analysis captures people starting the drug for the first time, not those already on it."
      },
      {
        "term": "analyzable cohort",
        "definition": "The group of patients who survive every eligibility rule and whose outcomes can be validly measured; the number at the bottom of the attrition funnel."
      }
    ],
    "worked_example": {
      "scenario": "A research team wants to study new users of a type-2 diabetes drug in a commercial claims database covering 2018 to 2023. Before programming anything, they run a feasibility check and then apply their eligibility rules one at a time to build an attrition funnel. The goal is to know the final analyzable group size and to see exactly where patients are lost along the way.",
      "dataset": {
        "caption": "Attrition funnel: patients surviving each eligibility criterion in order, from the full database to the analyzable cohort.",
        "columns": [
          "step",
          "criterion_applied",
          "n_remaining"
        ],
        "rows": [
          [
            1,
            "Full database: anyone ever enrolled (2018-2023)",
            850000
          ],
          [
            2,
            "Has at least 1 pharmacy fill for the study diabetes drug",
            112000
          ],
          [
            3,
            "First fill (index date) falls on or after 2019-01-01, leaving at least 365 days of database history before it",
            84000
          ],
          [
            4,
            "Unbroken medical and pharmacy insurance coverage for all 365 days before the index date",
            51000
          ],
          [
            5,
            "No fill of the study drug or its drug class in that same 365-day lookback period (new-user washout)",
            41000
          ],
          [
            6,
            "No type-2 diabetes diagnosis recorded before age 18 and patient is 18 or older at index date",
            39500
          ]
        ]
      },
      "steps": [
        "Step 1 is the starting point: every person who ever had a record in the database during the study years — 850,000 people.",
        "Step 2 filters to the 112,000 people who had at least one pharmacy fill for the study drug; everyone else is irrelevant to this question.",
        "Step 3 keeps only the 84,000 whose first fill falls late enough in the database that a full year of history exists before it; the 28,000 dropped here started the drug too early for a valid baseline to be constructed.",
        "Step 4 requires unbroken insurance coverage across that entire prior year so that a missing claim genuinely means no event, not just a gap in observation; 33,000 people are dropped here, which is the single largest loss and signals a real data-completeness challenge worth investigating.",
        "Step 5 removes the 10,000 people who had a fill of this drug class within the prior year, keeping only true new starters.",
        "Step 6 removes the 1,500 people under 18 or with a childhood-onset diabetes code, leaving the adult type-2 population.",
        "The final analyzable cohort is 39,500 patients — about 4.6% of the full database and 35% of those who ever filled the drug."
      ],
      "result": "The analyzable cohort is 39,500 patients (steps: 850,000 → 112,000 → 84,000 → 51,000 → 41,000 → 39,500). The largest single loss — 33,000 patients at the continuous enrollment step — is the most important feasibility signal: it tells the team that a substantial portion of drug users cannot be validly studied in this database because their baseline period is not fully observable. Running this funnel before purchasing additional data years or writing analysis code means the team can make a go/no-go decision and justify their expected sample size to reviewers and regulators."
    },
    "prerequisites": [
      "picots-framework-rwe",
      "continuous-enrollment-observable-time-rwe",
      "fit-for-purpose-data-assessment-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Aggregate feasibility scan (go/no-go)",
        "description": "Pre-license or pre-protocol counting of source population size, exposure prevalence, expected event counts, and covariate/observable-time availability on aggregate cells to decide whether a database can support the estimand with adequate precision.",
        "edge_cases": [
          "Small-cell suppression and privacy thresholds can hide low yield; report suppressed cells as a range, not zero.",
          "Outcome-distribution peeking during feasibility can bias design choices (case definition, follow-up) toward a desired signal."
        ],
        "data_source_notes": "claims: count enrollees with the relevant benefit and the indication before estimating analyzable N; EHR: estimate in-network observable person-time and capture completeness by site/calendar time."
      },
      {
        "name": "Patient-level ordered attrition funnel",
        "description": "Stepwise inclusion/exclusion applied in protocol-defined causal order, reporting N (and person-time) retained and dropped at each step from source population to analyzable cohort.",
        "edge_cases": [
          "Step ordering changes both the denominator and which patients are excluded; the order must encode causal logic, not maximize N.",
          "Continuous-enrollment or follow-up criteria that reach past time zero introduce immortal time and survivor bias."
        ],
        "data_source_notes": "claims: continuous-enrollment / FFS-observable step usually dominates loss and exposes MA-only and completeness problems; EHR: add an explicit minimum-observability step."
      },
      {
        "name": "Multi-database / network feasibility comparison",
        "description": "Running the same ordered funnel across candidate databases (or distributed network sites) to compare analyzable N, event counts, and where each source collapses, to select a source or pool sites.",
        "edge_cases": [
          "Apparent differences may reflect coding/capture conventions rather than true population differences; harmonize code lists first.",
          "Site-level heterogeneity in the dominant attrition step signals non-exchangeable data, not just sample-size variation."
        ],
        "data_source_notes": "linked/network: linkage eligibility is itself a funnel step and a selected subset; report it per source."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cohort construction with no documented feasibility step",
        "pros_of_this": "Kills infeasible studies before data licensing/programming; forces pre-specification of expected N, event count, and exposure prevalence that anchor power and satisfy regulators/HTA.",
        "cons_of_this": "Adds an upfront aggregate-counting phase and risks outcome peeking that can bias design.",
        "when_to_prefer": "Any consequential comparative, safety, utilization, cost, or regulatory/HTA-grade analysis."
      },
      {
        "compared_to": "Undocumented or post-hoc attrition narrative",
        "pros_of_this": "Reproducible, auditable, and localizes selection bias by showing where the cohort collapses.",
        "cons_of_this": "More programming and diagnostic counting; requires defending a specific step ordering.",
        "when_to_prefer": "Always for decision-grade RWE; STaRT-RWE, HARPER, RECORD, and FDA/EMA guidance treat the funnel as mandatory."
      },
      {
        "compared_to": "A single overall \"N excluded\" figure",
        "pros_of_this": "Stepwise losses localize the problem (capture gap vs rare disease vs code-list error) that one number hides.",
        "cons_of_this": "Requires committing to and defending a specific ordering of criteria.",
        "when_to_prefer": "Whenever the audience needs to audit population construction; collapse only for an abstract-level summary."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Apply criteria in protocol-defined causal order; the continuous-enrollment / FFS-observable step usually dominates loss. Require continuous medical + pharmacy enrollment across baseline and exclude Medicare Advantage-only person-time so absence of claims means \"did not occur,\" not \"not observed.\" Impose a claims-runout/maturity buffer near the data cut to avoid artificial censoring from adjudication lag. Preserve raw and derived dates and report N + person-time retained/dropped per step.",
      "ehr": "Add an explicit minimum-observability / in-network-activity step; \"no record\" conflates healthy, out-of-network, and non-visiting patients. Treat loss to follow-up as potentially informative, quantify capture by site and calendar time, and validate exposure/outcome against linked claims where available.",
      "registry": "Check enrollment criteria, completeness, and catchment; outcomes outside the registry's remit and full exposure history are often missing. Make linkage eligibility for follow-up an explicit funnel decrement (linkable subset is selected).",
      "linked": "Linkage is a selecting funnel step; reconcile order/fill/service date discrepancies before time-zero assignment. Account for differential competing risks (e.g., death in elderly claims cohorts) rather than treating death as ordinary censoring."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT = pd.Timedelta(days=365)   # baseline / new-user lookback\nHF_CODES = {\"I50\", \"I110\", \"I130\"} # validated HF dx prefixes (illustrative)\n\ndef attrition_funnel(rx, enroll, dx, study_start=\"2016-07-01\"):\n    steps = []\n    def record(label, ids):\n        steps.append({\"step\": label, \"n\": len(ids)})\n        return ids\n\n    # (1) Source: anyone with a study-drug-class fill\n    study = rx[rx[\"drug_class\"] == \"STUDY\"].sort_values([\"person_id\", \"fill_date\"])\n    ids = record(\"1. Has >=1 study-drug fill\", set(study[\"person_id\"]))\n\n    # (2) First fill = index; on/after study_start with >=365d history available before it\n    idx = study.groupby(\"person_id\")[\"fill_date\"].first().rename(\"index_date\")\n    idx = idx[idx >= pd.Timestamp(study_start)]\n    ids = record(\"2. Index on/after study start\", set(idx.index) & ids)\n\n    # (3) Continuous, FFS-observable enrollment across full baseline through index (exclude MA-only)\n    e = enroll.merge(idx, left_on=\"person_id\", right_index=True)\n    covers = e[(~e[\"ma_only\"]) &\n               (e[\"enroll_start\"] <= e[\"index_date\"] - WASHOUT) &\n               (e[\"enroll_end\"]   >= e[\"index_date\"])]\n    ids = record(\"3. Continuous FFS-observable baseline (no MA-only)\", set(covers[\"person_id\"]) & ids)\n\n    # (4) New-user washout: no STUDY or COMPARATOR fill in the 365d before index\n    idxm = idx.reset_index()\n    prior = rx.merge(idxm, on=\"person_id\")\n    prior_ids = set(prior[(prior[\"drug_class\"].isin([\"STUDY\", \"COMPARATOR\"])) &\n                          (prior[\"fill_date\"] <  prior[\"index_date\"]) &\n                          (prior[\"fill_date\"] >= prior[\"index_date\"] - WASHOUT)][\"person_id\"])\n    ids = record(\"4. New user (drug-free washout)\", ids - prior_ids)\n\n    # (5) Outcome-free baseline: no validated HF (>=1 IP or >=2 OP HF dx >=30d apart) in lookback\n    hf = dx[dx[\"dx_code\"].str[:3].isin({c[:3] for c in HF_CODES})].merge(idxm, on=\"person_id\")\n    hf = hf[(hf[\"dx_date\"] < hf[\"index_date\"]) & (hf[\"dx_date\"] >= hf[\"index_date\"] - WASHOUT)]\n    ip = set(hf[hf[\"care_setting\"] == \"IP\"][\"person_id\"])\n    op = hf[hf[\"care_setting\"] == \"OP\"].sort_values([\"person_id\", \"dx_date\"])\n    op_span = op.groupby(\"person_id\")[\"dx_date\"].agg([\"min\", \"max\", \"count\"])\n    op_valid = set(op_span[(op_span[\"count\"] >= 2) &\n                           ((op_span[\"max\"] - op_span[\"min\"]) >= pd.Timedelta(days=30))].index)\n    ids = record(\"5. Outcome-free baseline (no prevalent HF)\", ids - ip - op_valid)\n\n    funnel = pd.DataFrame(steps)\n    funnel[\"dropped\"] = funnel[\"n\"].shift(1) - funnel[\"n\"]\n    cohort = idx.loc[idx.index.isin(ids)].reset_index().rename(columns={\"index_date\": \"index_date\"})\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - WASHOUT\n    return funnel, cohort",
        "description": "Build an ordered claims attrition funnel and the analyzable cohort. Required inputs (cleaned, de-duplicated):\n  rx     : person_id, fill_date (datetime), drug_class in {'STUDY','COMPARATOR'}, days_supply\n  enroll : person_id, enroll_start, enroll_end, ma_only (bool)   # ma_only spans lack FFS claims -> not observable\n  dx     : person_id, dx_date (datetime), dx_code, care_setting in {'IP','OP'}\nCriteria are applied IN CAUSAL ORDER; each step records N retained/dropped so the returned table IS the funnel.\nCovariates and outcomes must later be measured only relative to the index_date produced here.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT <- 365L\nhf_prefix <- c(\"I50\", \"I11\", \"I13\")\n\nattrition_funnel <- function(rx, enroll, dx, study_start = as.Date(\"2016-07-01\")) {\n  setDT(rx); setDT(enroll); setDT(dx)\n  steps <- list(); rec <- function(label, ids) { steps[[length(steps)+1L]] <<- list(step=label, n=length(ids)); ids }\n\n  study <- rx[drug_class == \"STUDY\"][order(person_id, fill_date)]\n  ids <- rec(\"1. Has >=1 study-drug fill\", unique(study$person_id))\n\n  idx <- study[, .(index_date = fill_date[1L]), by = person_id][index_date >= study_start]\n  ids <- rec(\"2. Index on/after study start\", intersect(idx$person_id, ids))\n\n  e <- merge(enroll, idx, by = \"person_id\")\n  covers <- e[!ma_only & enroll_start <= index_date - WASHOUT & enroll_end >= index_date, unique(person_id)]\n  ids <- rec(\"3. Continuous FFS-observable baseline (no MA-only)\", intersect(covers, ids))\n\n  pr <- merge(rx, idx, by = \"person_id\")\n  prior_ids <- unique(pr[drug_class %chin% c(\"STUDY\",\"COMPARATOR\") &\n                         fill_date < index_date & fill_date >= index_date - WASHOUT, person_id])\n  ids <- rec(\"4. New user (drug-free washout)\", setdiff(ids, prior_ids))\n\n  hf <- merge(dx[substr(dx_code,1,3) %chin% hf_prefix], idx, by = \"person_id\")\n  hf <- hf[dx_date < index_date & dx_date >= index_date - WASHOUT]\n  ip <- unique(hf[care_setting == \"IP\", person_id])\n  op <- hf[care_setting == \"OP\"][order(person_id, dx_date)]\n  opv <- op[, .(n = .N, span = as.integer(max(dx_date) - min(dx_date))), by = person_id][n >= 2 & span >= 30, person_id]\n  ids <- rec(\"5. Outcome-free baseline (no prevalent HF)\", setdiff(ids, union(ip, opv)))\n\n  funnel <- rbindlist(steps)\n  funnel[, dropped := shift(n) - n]\n  cohort <- idx[person_id %chin% ids][, baseline_start := index_date - WASHOUT]\n  list(funnel = funnel, cohort = cohort)\n}",
        "description": "Ordered claims attrition funnel with data.table; inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), drug_class in {'STUDY','COMPARATOR'}, days_supply\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\n  dx     : person_id, dx_date (Date), dx_code, care_setting in {'IP','OP'}\nEach criterion is applied in causal order and appends one row to the funnel table.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n%let start   = '01JUL2016'd;\n\nproc sql; create table funnel (step char(60), n num); quit;\n%macro count(lbl, ds);\n  proc sql; insert into funnel\n    select \"&lbl\", count(distinct person_id) from &ds; quit;\n%mend;\n\n/* (1) Source: any study-drug fill */\nproc sql; create table s1 as select distinct person_id from work.rx where drug_class='STUDY'; quit;\n%count(1. Has >=1 study-drug fill, s1)\n\n/* (2) First study fill = index; on/after study start */\nproc sql;\n  create table idx as\n  select person_id, min(fill_date) as index_date format=date9.\n  from work.rx where drug_class='STUDY' group by person_id\n  having calculated index_date >= &start;\nquit;\n%count(2. Index on/after study start, idx)\n\n/* (3) Continuous FFS-observable baseline through index (exclude MA-only spans) */\nproc sql;\n  create table s3 as\n  select i.person_id, i.index_date from idx i\n  where exists (select 1 from work.enroll e\n                where e.person_id=i.person_id and e.ma_only=0\n                  and e.enroll_start <= i.index_date - &washout\n                  and e.enroll_end   >= i.index_date);\nquit;\n%count(3. Continuous FFS-observable baseline (no MA-only), s3)\n\n/* (4) New-user washout: no STUDY/COMPARATOR fill in the 365d before index */\nproc sql;\n  create table s4 as\n  select c.* from s3 c\n  where not exists (select 1 from work.rx p\n                    where p.person_id=c.person_id\n                      and p.drug_class in ('STUDY','COMPARATOR')\n                      and p.fill_date <  c.index_date\n                      and p.fill_date >= c.index_date - &washout);\nquit;\n%count(4. New user (drug-free washout), s4)\n\n/* (5) Outcome-free baseline: >=1 IP or >=2 OP HF dx >=30d apart in lookback */\nproc sql;\n  create table prev_hf as\n  select c.person_id from s4 c\n  where exists (select 1 from work.dx d where d.person_id=c.person_id\n                  and d.care_setting='IP' and substr(d.dx_code,1,3) in ('I50','I11','I13')\n                  and d.dx_date < c.index_date and d.dx_date >= c.index_date - &washout)\n     or (select count(*) from work.dx d where d.person_id=c.person_id\n                  and d.care_setting='OP' and substr(d.dx_code,1,3) in ('I50','I11','I13')\n                  and d.dx_date < c.index_date and d.dx_date >= c.index_date - &washout) >= 2;\n  create table cohort as\n  select c.person_id, c.index_date, c.index_date - &washout as baseline_start format=date9.\n  from s4 c where c.person_id not in (select person_id from prev_hf);\nquit;\n%count(5. Outcome-free baseline (no prevalent HF), cohort)\n\n/* Funnel with per-step drops */\ndata funnel; set funnel; dropped = lag(n) - n; run;",
        "description": "Ordered claims attrition funnel via PROC SQL, materializing one work table per step and counting survivors.\nRequired inputs (post data-management):\n  work.rx     : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.dx     : person_id, dx_date, dx_code, care_setting ('IP'/'OP')\nThe macro %count appends N at each step to work.funnel so the table reproduces the published flow diagram.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  S[\"(1) Source population<br/>>=1 study-drug fill<br/>N = 412,000\"]\n  S -->|\"-93,600: index before study start<br/>or <365d history\"| A[\"(2) Index on/after study start<br/>N = 318,400\"]\n  A -->|\"-147,200: not FFS-observable<br/>(mostly MA-only baseline)\"| B[\"(3) Continuous FFS-observable baseline<br/>N = 171,200\"]\n  B -->|\"-32,300: prevalent users\"| C[\"(4) New user, drug-free washout<br/>N = 138,900\"]\n  C -->|\"-14,400: prevalent HF\"| D[\"(5) Outcome-free baseline<br/>N = 124,500\"]\n  D -->|\"-2,800: age <18 or indication unconfirmed\"| E[\"(6) Analyzable cohort<br/>N = 121,700<br/>~4,100 expected HF events\"]",
        "caption": "Ordered claims attrition funnel. Each criterion is applied in protocol-defined causal order; the dominant loss at step 3 (continuous FFS-observable baseline, largely Medicare Advantage-only person-time) is the key feasibility signal, not a sign of a rare disease.",
        "alt_text": "Top-down funnel from a 412,000-patient source population through index timing, continuous FFS-observable enrollment, new-user washout, and outcome-free baseline to a 121,700-patient analyzable cohort, with the number dropped at each step.",
        "source_type": "illustrative",
        "source_citations": [
          "wang-2021"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[\"Candidate data source for the estimand\"] --> P{\"Population captured<br/>with the indication?\"}\n  P -->|No| STOP1[\"Infeasible: wrong population\"]\n  P -->|Yes| X{\"Exposure ascertainable<br/>NDC/fill or order/admin?\"}\n  X -->|No| STOP2[\"Infeasible: exposure not captured\"]\n  X -->|Yes| O{\"Outcome ascertainable<br/>validated algorithm exists?\"}\n  O -->|No| STOP3[\"Infeasible / needs linkage or validation\"]\n  O -->|Yes| T{\"Enough observable person-time<br/>and expected events for power?\"}\n  T -->|No| STOP4[\"Underpowered: consider linkage,<br/>pooling, or different source\"]\n  T -->|Yes| GO[\"Feasible: proceed to ordered<br/>attrition funnel and protocol\"]",
        "caption": "Fit-for-purpose go/no-go logic that precedes the patient-level funnel. Relevance (population, exposure, outcome) and reliability/precision (observable person-time and expected event count) are checked on aggregate counts before licensing data or writing the comparative analysis.",
        "alt_text": "Decision flowchart checking whether a data source captures the population, exposure, and outcome and provides enough person-time and events, routing to infeasible stop nodes or to proceeding with the attrition funnel.",
        "source_type": "illustrative",
        "source_citations": [
          "berger-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "regulatory-readiness-rwe",
        "notes": "A documented feasibility assessment and ordered attrition funnel are core components of a regulatory-ready RWE protocol and submission package."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Feasibility and the attrition funnel operationalize the eligibility criteria of the emulated trial and make the loss from source population to analyzable cohort auditable."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "The continuous-enrollment / observable-person-time step is usually the dominant and most consequential decrement in a claims attrition funnel and the key feasibility signal."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "The entry funnel (eligibility losses) is distinct from post-baseline attrition / loss to follow-up; both must be reported and both can induce selection bias."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The outcome-free baseline step and the expected-event-count feasibility estimate depend on the validated outcome algorithm and its operating characteristics."
      },
      {
        "relation_type": "see_also",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "The washout/lookback definition determines the new-user step of the funnel and the baseline window in which eligibility and covariates are observed."
      }
    ],
    "aliases": [
      "attrition funnel",
      "cohort attrition diagram",
      "participant flow diagram",
      "feasibility assessment",
      "fit-for-purpose assessment",
      "CONSORT-style RWE flow diagram"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "descriptive-epidemiology-rwe",
    "name": "Descriptive Epidemiology in RWE",
    "short_definition": "The estimation of disease and exposure frequency in real-world data — incidence rates per unit of observed person-time, cumulative incidence (risk), prevalence, and age/sex-standardized rates — built on explicitly constructed denominators rather than on the comparative contrasts of causal analysis.",
    "long_description": "**Descriptive epidemiology in real-world data** answers \"how much, in whom, and when\" rather than \"does\nexposure cause the outcome.\" Its deliverables are frequency measures: the **incidence rate** (new events\ndivided by accrued person-time among those at risk), **cumulative incidence / risk** (the proportion\ndeveloping the event over a fixed horizon, properly accounting for competing events and censoring),\n**prevalence** (point, period, or annual — the proportion alive and meeting a state definition at or during\na window), and **standardized rates** (age/sex-adjusted via direct standardization to an external standard\npopulation, or indirectly via the SMR/SIR when stratum-specific rates are unstable). It is the quantitative\nbackbone of disease burden estimates, drug-utilization studies, natural-history characterization, label\nexpansion support, and the \"Table 1 + base rates\" that every comparative RWE protocol depends on. This entry\nis both the working reference for these measures and the hub for the operational child concepts that handle\neach piece in depth (linked under Relations).\n\n**Core conceptual distinction**. Three frequency measures answer three different questions and are not\ninterchangeable. (1) The **incidence rate** has person-time in the denominator (events per 1,000\nperson-years); it is the right measure when follow-up is censored, enrollment is dynamic, or the at-risk\nperiod varies across people — the dominant situation in claims and EHR. (2) **Cumulative incidence (risk)**\nis a probability bounded in [0,1] over a stated horizon; it requires either complete follow-up or a survival\nestimator (Kaplan–Meier complement, or the Aalen–Johansen / cumulative-incidence function when competing\nrisks such as death are present). Naively dividing events by the closed cohort overstates risk whenever\nanyone is censored early. (3) **Prevalence** is a snapshot of existing state, governed by both incidence and\nduration (prevalence ≈ incidence × average duration in steady state); it must never be reported as a rate or\na risk. **Standardization** is orthogonal to all three: it removes the confounding-by-composition that makes\na crude rate uninterpretable when comparing populations with different age/sex structures. Direct\nstandardization applies study stratum-specific rates to a fixed standard population; indirect\nstandardization (SMR/SIR) applies standard rates to the study population's person-time and is preferred when\nstudy stratum counts are too sparse for stable direct estimates. The choice of measure, horizon,\ncompeting-event handling, and standard population must be pre-specified in the estimand, not chosen after\nseeing the data.\n\n**Pros, cons, and trade-offs**.\n- **Incidence rate vs cumulative incidence:** The rate is robust to censoring and dynamic enrollment and is\n  the natural output of Poisson/person-time models, but it assumes a roughly constant hazard over the\n  interval and is not a probability a clinician can interpret directly. Cumulative incidence is directly\n  interpretable and is what payers and patients want, but it demands correct handling of competing risks and\n  censoring — a 1-minus-Kaplan–Meier \"risk\" overestimates the event when death is common (e.g., elderly\n  cohorts), where the **cumulative incidence function** is mandatory. **Prefer the rate** for surveillance,\n  utilization, and when follow-up is heavily censored; **prefer the CIF** for fixed-horizon clinical risk\n  communication.\n- **Crude vs standardized rates:** Crude rates describe the actual burden in the population as it exists\n  (correct for resource planning); standardized rates enable fair across-population or across-time\n  comparison but describe a hypothetical population and can hide important effect modification. **Report\n  both**, and never compare crude rates across populations with different age/sex mixes.\n- **Direct vs indirect standardization (SMR/SIR):** Direct standardization yields rates comparable across\n  any pair of populations sharing the standard, but is unstable when study strata have few events; indirect\n  standardization (SMR/SIR) is stable with sparse data but only validly compares each study population to the\n  standard, not study populations to each other. **Prefer indirect** when stratum-specific study rates are\n  sparse or unstable.\n- **Descriptive vs causal framing:** Descriptive measures require no exchangeability or positivity\n  assumptions and are far less fragile than causal contrasts — but precisely because they make no causal\n  claim, a between-group descriptive difference (e.g., higher event rate in treated patients) must not be\n  read as a treatment effect. Confusing the two is the single most common misuse.\n\n**When to use**. Disease burden and natural-history characterization; drug- and procedure-utilization trends;\nincidence/prevalence for sample-size and feasibility planning; background/expected rates for safety-signal\ncontextualization (observed-vs-expected); HTA burden-of-illness sections; and the descriptive base rates and\nattrition tables that anchor any comparative RWE study. Use the **rate** whenever person-time is dynamic;\nuse **standardization** whenever you compare populations or calendar periods that differ in composition.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Do not present descriptive between-group differences as causal.** A crude rate ratio between exposed and\n  unexposed groups carries every confounder; reporting it without the causal caveats invites readers to\n  infer effects the design cannot support.\n- **Do not compute risk by dividing events by the baseline cohort when follow-up is censored.** This\n  understates the denominator-at-risk and, in the presence of competing risks, the 1−KM complement\n  overstates the event probability — use person-time rates or the CIF.\n- **Do not compare crude rates across populations with different age/sex structure** — composition, not\n  biology, can drive the entire difference. Standardize first.\n- **Do not report prevalence as incidence or risk.** A point-prevalence \"rate\" conflates frequency with\n  duration; chronic, low-mortality conditions look common on prevalence and rare on incidence.\n- **Do not estimate rates on person-time you cannot observe.** Counting events while undercounting denominator\n  (or vice versa) is the dominant failure mode in claims (see below) and can move a rate by an order of\n  magnitude.\n\n**Data-source operational depth**.\n- **Claims (FFS):** Denominator = sum of continuously enrolled, observable person-time with the relevant\n  benefit (Parts A/B for medical events, Part D for drug events); numerator = first qualifying event coded in\n  the claim stream during that observed time. Failure modes: (1) **Medicare Advantage gaps** — MA-only\n  enrollment does not generate FFS encounter claims, so person-time accrued under MA must be excluded from\n  the denominator or events will be missed while time is counted, deflating the rate; restrict to\n  A/B(/D)-eligible FFS person-time. (2) **Differential competing risks by exposure in the elderly** — death\n  competes with the event of interest and is captured incompletely in claims (use a death-index linkage); if\n  one group dies faster, naive risk estimates diverge from the CIF. (3) **Prevalent vs incident counting** —\n  without a washout/clean lookback, a recurrent or chronic condition's first *observed* code is mistaken for\n  a first *ever* event, inflating incidence at enrollment (\"look-back ramp\"). (4) **Immortal time in\n  procedure/initiation studies** — defining the at-risk start after a procedure that itself requires survival\n  builds guaranteed event-free time into the denominator. Workarounds: require continuous enrollment +\n  washout before time zero, anchor the denominator to observable benefit windows, and link to a mortality\n  source to handle the competing risk.\n- **EHR:** Capture is visit-driven and within-system. Denominator-at-risk is ambiguous — a patient who stops\n  visiting is not necessarily event-free, just unobserved — so rates computed on \"all patients with a record\"\n  are biased by informal disenrollment. Define an explicit observation period (e.g., activity-anchored\n  windows) and treat care delivered outside the network as missing, not absent. Phenotype the numerator with\n  a validated algorithm; raw single-code numerators overcount.\n- **Registry:** Strong for adjudicated numerators (validated incident cases, cancer stage) but the\n  denominator (catchment person-time, eligibility) is often the weak link; link to claims or census person-time\n  for valid rates, and verify completeness/ascertainment before reporting incidence.\n- **Linked claims–EHR–vital-records:** The ideal substrate — EHR/registry for numerator validity, claims for\n  complete observable person-time, vital records for the competing-risk denominator — but the rate must be\n  computed on the *linkable* subset, which is a selected population; report the linkage proportion and assess\n  selection before generalizing.\n\n**Worked example (claims).** Question: the 2022 annual **incidence rate** of hospitalized acute myocardial\ninfarction (AMI) among adults age ≥65 in U.S. Medicare fee-for-service. (1) Denominator population: enrollees\nwith continuous Parts A+B FFS enrollment (no MA-only months) at any point in 2022. (2) Person-time: for each\nperson, accrue observed days from the later of 2022-01-01 or FFS-enrollment start to the earliest of first\nAMI, death, switch to MA, disenrollment, or 2022-12-31; sum observed person-days and divide by 365.25 to get\nperson-years. (3) Numerator (incident, not prevalent): first inpatient claim with ICD-10-CM I21.x in the\nprimary position **and** a 12-month clean lookback with continuous enrollment and no prior I21.x/I22.x, so a\nreinfarction or a chronic-history code is not counted as incident. (4) Rate = events ÷ person-years × 1,000,\nwith an exact-Poisson 95% CI. (5) Because crude rates are not comparable across regions or years with\ndifferent age/sex mixes, also compute the **age-/sex-standardized** rate by direct standardization to the 2000\nU.S. standard million: estimate stratum-specific rates within 5-year age × sex cells, weight by the standard\npopulation proportions, and sum — yielding a single adjusted rate per 1,000 person-years whose CI uses a\nnormal-approximation (Wald) interval on the Dobson stratum-variance estimator. Report crude and standardized rates side by side, the person-time accounting, and the\nattrition funnel (eligible → continuously enrolled → clean-lookback → contributing person-time).\n\n**Interpreting the output**\n\nA Medicare FFS descriptive study of hospitalized AMI reports age-group-specific incidence rates of\n2.00 per 1,000 person-years (age 65–74), 4.50 per 1,000 person-years (age 75–84), and 5.20 per 1,000\nperson-years (age 85+), alongside a directly age-/sex-standardized rate for cross-year or cross-region\ncomparison.\n\n*(1) Formal interpretation.* Each rate is the count of incident AMI hospitalizations per 1,000 person-\nyears of FFS-enrolled observation time in that age stratum, with a clean 12-month lookback to exclude\nprevalent AMI. The denominator accrues person-days from FFS enrollment start (or January 1) to the\nfirst of: AMI event, death, MA switch, disenrollment, or December 31, then converts to person-years\n(days / 365.25). Rates are not risks: they are not bounded at 1, they are expressed per 1,000 person-\nyears of observed time, and they should not be compared to study-period cumulative incidence figures.\nThe standardized rate removes the confounding effect of age composition when comparing populations\nwith different age structures; the crude rates are required alongside it to show the actual distribution\nof events.\n\n*(2) Practical interpretation.* The 2.6-fold gradient from the 65–74 stratum (2.00) to the 85+ stratum\n(5.20) quantifies the age dependency of AMI incidence in this population and provides the baseline rate\ninputs needed to size a prevention trial, project budget impact, or calculate an externally transported\nNNT anchored to a local baseline. These are descriptive measures only — the rates do not estimate a\ncausal effect of age on AMI, and no confounding adjustment is implied.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "descriptive-epidemiology",
      "incidence-rate",
      "cumulative-incidence",
      "prevalence",
      "standardization",
      "person-time",
      "disease-burden",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1371/journal.pmed.0040296",
        "url": "https://doi.org/10.1371/journal.pmed.0040296",
        "citation_text": "von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. PLoS Medicine. 2007;4(10):e296.",
        "year": 2007,
        "authors_short": "von Elm et al.",
        "notes": "Reporting backbone for descriptive observational studies, including how frequency measures, denominators, and participant flow must be specified and reported."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Extends STROBE to claims/EHR, codifying how denominators, observable person-time, and code-based case definitions must be operationalized and disclosed in routinely collected data."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.k3532",
        "url": "https://doi.org/10.1136/bmj.k3532",
        "citation_text": "Langan SM, Schmidt SA, Wing K, et al. The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE). BMJ. 2018;363:k3532.",
        "year": 2018,
        "authors_short": "Langan et al.",
        "notes": "Pharmacoepidemiology-specific guidance on exposure/outcome definitions and person-time that underpins drug-utilization and incidence reporting in claims."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwh090",
        "url": "https://doi.org/10.1093/aje/kwh090",
        "citation_text": "Zou G. A modified Poisson regression approach to prospective studies with binary data. American Journal of Epidemiology. 2004;159(7):702-706.",
        "year": 2004,
        "authors_short": "Zou",
        "notes": "Modified Poisson with robust (sandwich) variance for directly estimating risk ratios of binary outcomes."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.5083",
        "url": "https://doi.org/10.1002/pds.5083",
        "citation_text": "Suissa S, Dell'Aniello S. Time-related biases in pharmacoepidemiology. Pharmacoepidemiology and Drug Safety. 2020;29(9):1101-1110.",
        "year": 2020,
        "authors_short": "Suissa & Dell'Aniello",
        "notes": "Worked demonstration of how immortal time and person-time misclassification distort descriptive rates in real-world databases — the operational failure modes to design against."
      }
    ],
    "plain_language_summary": "Descriptive epidemiology answers three questions about a disease or condition in a population: who gets it, where it occurs, and when it happens — without asking whether any particular cause is responsible. In real-world data studies, analysts count events (such as hospitalizations or diagnoses) and divide by the total time the population was under observation to get a rate, or snapshot a moment in time to measure how common a condition currently is. A descriptive study never tests a hypothesis; it maps the terrain so future causal research knows where to look.",
    "key_terms": [
      {
        "term": "incidence rate",
        "definition": "The number of new events (such as diagnoses or hospitalizations) divided by the total time the at-risk population was observed, usually expressed as events per 1,000 person-years."
      },
      {
        "term": "person-years",
        "definition": "A unit that combines the number of people observed with how long each was followed — one person watched for two years contributes two person-years, the same as two people each watched for one year."
      },
      {
        "term": "prevalence",
        "definition": "The proportion of a population that has a condition at a specific point in time or during a defined period, reflecting how common the condition currently is rather than how fast it is arising."
      },
      {
        "term": "age-standardized rate",
        "definition": "A rate that has been mathematically adjusted so that differences in the age makeup of two populations cannot explain away a difference in rates, making comparisons between groups fair."
      },
      {
        "term": "denominator",
        "definition": "The total observed time or number of people at risk that sits beneath the event count in a rate calculation; getting this right is the most common challenge in real-world data."
      }
    ],
    "worked_example": {
      "scenario": "A health economist wants to describe the burden of acute heart attacks (acute myocardial infarction, AMI) among US Medicare fee-for-service enrollees in 2022, broken down by age group. She is not comparing a drug or intervention — she just wants to know how many heart attacks occur in each age band per year of follow-up. She pulls three summary rows from a claims analysis: the number of first-ever AMI hospitalizations and the total person-years of observation in each group.",
      "dataset": {
        "caption": "Summary of incident AMI hospitalizations and observed follow-up by age group, US Medicare fee-for-service 2022.",
        "columns": [
          "age_group",
          "incident_ami_events",
          "observed_person_years",
          "rate_per_1000_py"
        ],
        "rows": [
          [
            "65-74",
            8200,
            4100000,
            2.0
          ],
          [
            "75-84",
            14175,
            3150000,
            4.5
          ],
          [
            "85+",
            9100,
            1750000,
            5.2
          ]
        ]
      },
      "steps": [
        "For the 65-74 group: divide 8,200 events by 4,100,000 person-years, then multiply by 1,000 to express per 1,000 person-years: 8,200 / 4,100,000 x 1,000 = 2.00.",
        "For the 75-84 group: 14,175 / 3,150,000 x 1,000 = 4.50 per 1,000 person-years.",
        "For the 85+ group: 9,100 / 1,750,000 x 1,000 = 5.20 per 1,000 person-years.",
        "Read the pattern across rows: the rate climbs steadily with age, from 2.00 in the youngest group to 5.20 in the oldest — more than a 2.5-fold difference.",
        "Note what this analysis does NOT do: it does not compare a treated group to an untreated group, does not adjust for confounders, and makes no causal claim. It describes who (older adults) bears more burden, and by how much."
      ],
      "result": "Rates per 1,000 person-years: 65-74 = 2.00, 75-84 = 4.50, 85+ = 5.20. AMI incidence more than doubles from the youngest to the oldest age band. This is a descriptive finding — age is associated with higher rates, but this table alone cannot say why."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Incidence rate (events per person-time)",
        "description": "New events divided by accrued at-risk person-time; the default frequency measure when follow-up is censored or enrollment is dynamic. Requires a clean lookback so first observed = first incident.",
        "edge_cases": [
          "Recurrent/chronic conditions counted as incident at enrollment without a washout (look-back ramp).",
          "Constant-hazard assumption violated over long intervals; consider piecewise or shorter windows."
        ],
        "data_source_notes": "claims: denominator = continuously enrolled FFS person-time with the relevant benefit; exclude MA-only months. ehr: anchor person-time to an explicit, activity-defined observation period."
      },
      {
        "name": "Cumulative incidence / risk with competing risks",
        "description": "Probability of the event over a fixed horizon estimated with the cumulative incidence function (Aalen-Johansen) rather than 1-minus-Kaplan-Meier when death or other competing events are present.",
        "edge_cases": [
          "1-KM complement overstates risk when competing mortality is non-trivial (e.g., elderly cohorts).",
          "Differential censoring or competing risk by subgroup distorts naive proportions."
        ],
        "data_source_notes": "claims: link a death index so the competing risk is captured; otherwise risk is biased."
      },
      {
        "name": "Prevalence (point, period, annual)",
        "description": "Proportion of the denominator population meeting a state definition at a point, during a period, or over a calendar year; governed by incidence and duration, not a rate or a risk.",
        "edge_cases": [
          "Reporting prevalence as a rate or risk; conflating frequency with disease duration.",
          "Denominator must be the at-risk/observed population on the prevalence date, not all-ever-enrolled."
        ],
        "data_source_notes": "claims: require enrollment on (or across) the prevalence window so state is observable."
      },
      {
        "name": "Direct standardization",
        "description": "Apply study stratum-specific rates to a fixed external standard population (e.g., 2000 U.S. standard) to produce an age/sex-adjusted rate comparable across populations.",
        "edge_cases": [
          "Unstable when study strata have few events; CIs widen and adjusted rate becomes unreliable.",
          "Hides effect modification across strata; report stratum-specific rates alongside."
        ],
        "data_source_notes": "all sources: choose and report the standard population; the Python/R code reports a normal-approximation (Wald) CI on the Dobson variance, while PROC STDRATE offers gamma CIs."
      },
      {
        "name": "Indirect standardization (SMR / SIR)",
        "description": "Apply standard stratum rates to the study population's person-time to get expected events; SMR/SIR = observed / expected. Preferred when study stratum-specific rates are sparse.",
        "edge_cases": [
          "SMRs from different study populations are not directly comparable to each other, only to the standard.",
          "Requires an appropriate, contemporaneous reference rate set."
        ],
        "data_source_notes": "registry/claims: expected events computed on observed person-time by stratum."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cumulative incidence / risk",
        "pros_of_this": "Person-time rates are robust to censoring and dynamic enrollment and are the natural output of Poisson models; valid when follow-up varies across people.",
        "cons_of_this": "Assume an approximately constant hazard over the interval and are not directly interpretable as a probability for clinicians or patients.",
        "when_to_prefer": "Surveillance, utilization, and any setting with heavy censoring or dynamic enrollment."
      },
      {
        "compared_to": "Crude (unadjusted) rates",
        "pros_of_this": "Standardized rates enable fair comparison across populations or calendar periods that differ in age/sex composition.",
        "cons_of_this": "Describe a hypothetical standard population, can mask effect modification, and become unstable with sparse strata (direct) or non-comparable across study populations (indirect).",
        "when_to_prefer": "Whenever comparing burden across populations or over time; report crude alongside."
      },
      {
        "compared_to": "Causal comparative contrasts",
        "pros_of_this": "Descriptive frequency requires no exchangeability/positivity assumptions and is far less fragile than confounding-prone causal estimates.",
        "cons_of_this": "Makes no causal claim; between-group descriptive differences carry all confounding and must not be read as treatment effects.",
        "when_to_prefer": "Burden, natural history, utilization, feasibility, and base-rate contextualization."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Denominator = sum of continuously enrolled, observable FFS person-time with the relevant benefit (A/B for medical, D for drug); exclude MA-only months where encounter claims are unavailable. Numerator = first qualifying coded event during observed time, with a clean lookback so first observed = first incident. Link a mortality source to handle the competing risk for cumulative incidence.",
      "ehr": "Capture is visit-driven and within-system. Define an explicit, activity-anchored observation period; a non-visiting patient is unobserved, not event-free. Phenotype numerators with validated algorithms rather than single raw codes.",
      "registry": "Strong for adjudicated numerators but weak for catchment person-time; link to claims/census for a valid denominator and verify case ascertainment/completeness before reporting incidence.",
      "linked": "Ideal substrate (numerator validity + complete person-time + competing-risk mortality), but rates are computed on the selected linkable subset; report linkage proportion and assess selection before generalizing."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy.stats import chi2\n\nSCALE = 1000.0  # report per 1,000 person-years\n\ndef standardized_rate(person_time: pd.DataFrame, events: pd.DataFrame,\n                      standard: pd.DataFrame) -> dict:\n    keys = [\"age_band\", \"sex\"]\n    pt = person_time.groupby(keys, as_index=False)[\"py\"].sum()\n    ev = events.groupby(keys, as_index=False).size().rename(columns={\"size\": \"events\"})\n    strata = pt.merge(ev, on=keys, how=\"left\").fillna({\"events\": 0})\n    strata[\"rate\"] = strata[\"events\"] / strata[\"py\"]            # stratum-specific rate\n\n    # Crude rate with exact-Poisson 95% CI on the total event count.\n    tot_ev, tot_py = strata[\"events\"].sum(), strata[\"py\"].sum()\n    crude = tot_ev / tot_py\n    lo = chi2.ppf(0.025, 2 * tot_ev) / 2 / tot_py if tot_ev > 0 else 0.0\n    hi = chi2.ppf(0.975, 2 * (tot_ev + 1)) / 2 / tot_py\n\n    # Direct standardization: weight stratum rates by the external standard population.\n    s = strata.merge(standard, on=keys, how=\"inner\")\n    std_rate = float(np.sum(s[\"std_weight\"] * s[\"rate\"]))\n    # Variance of the directly standardized rate (Dobson-style on stratum events).\n    var = float(np.sum((s[\"std_weight\"] ** 2) * s[\"events\"] / s[\"py\"] ** 2))\n    se = np.sqrt(var)\n    return {\n        \"crude_per_1000\": crude * SCALE,\n        \"crude_ci\": (lo * SCALE, hi * SCALE),\n        \"standardized_per_1000\": std_rate * SCALE,\n        \"standardized_ci\": ((std_rate - 1.96 * se) * SCALE,\n                            (std_rate + 1.96 * se) * SCALE),\n    }",
        "description": "Crude and age/sex-standardized incidence rate from claims-style inputs. Required tables (already cleaned):\n  person_time : person_id, age_band, sex, py (observed person-YEARS at risk after enrollment/washout rules)\n  events      : person_id, age_band, sex (one row per INCIDENT event during the observed at-risk window)\n  standard    : age_band, sex, std_weight (proportions summing to 1; e.g., 2000 U.S. standard million)\nPerson-time and events must already exclude MA-only months and apply the clean-lookback so first-observed\nequals first-incident. Returns crude and directly-standardized rates per 1,000 person-years.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "benchimol-2015"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nSCALE <- 1000\n\nstandardized_rate <- function(person_time, events, standard) {\n  setDT(person_time); setDT(events); setDT(standard)\n  pt <- person_time[, .(py = sum(py)), by = .(age_band, sex)]\n  ev <- events[, .(events = .N), by = .(age_band, sex)]\n  strata <- merge(pt, ev, by = c(\"age_band\", \"sex\"), all.x = TRUE)\n  strata[is.na(events), events := 0]\n  strata[, rate := events / py]                         # stratum-specific rate\n\n  tot_ev <- sum(strata$events); tot_py <- sum(strata$py)\n  crude <- tot_ev / tot_py\n  # Exact-Poisson 95% CI on the total count.\n  lo <- if (tot_ev > 0) qchisq(0.025, 2 * tot_ev) / 2 / tot_py else 0\n  hi <- qchisq(0.975, 2 * (tot_ev + 1)) / 2 / tot_py\n\n  s <- merge(strata, standard, by = c(\"age_band\", \"sex\"))\n  std_rate <- sum(s$std_weight * s$rate)                # direct standardization\n  v <- sum(s$std_weight^2 * s$events / s$py^2)          # variance (Dobson-style)\n  se <- sqrt(v)\n  list(crude_per_1000 = crude * SCALE,\n       crude_ci = c(lo, hi) * SCALE,\n       standardized_per_1000 = std_rate * SCALE,\n       standardized_ci = c(std_rate - 1.96 * se, std_rate + 1.96 * se) * SCALE)\n}",
        "description": "Crude and directly-standardized incidence rate with data.table. Inputs mirror the Python version:\n  person_time : person_id, age_band, sex, py (observed person-years after enrollment/washout rules)\n  events      : person_id, age_band, sex (one row per incident event in the observed window)\n  standard    : age_band, sex, std_weight (proportions summing to 1)\nReturns crude and standardized rates per 1,000 person-years with confidence intervals.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "benchimol-2015"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Directly standardized incidence rate per 1,000 person-years to an external standard population. */\nproc stdrate data=work.events_pt\n             refdata=work.standard\n             method=direct\n             stat=rate(mult=1000)\n             plots=none;\n  population event=events pyears=py;          /* study stratum events and person-time   */\n  reference  total=std_pop;                   /* standard population weights by stratum  */\n  strata age_band sex / stats;                /* age x sex strata + stratum-specific rates */\nrun;\n\n/* Indirect standardization (SMR): observed vs expected using standard stratum rates. */\nproc stdrate data=work.events_pt\n             refdata=work.standard\n             method=indirect\n             stat=rate(mult=1000)\n             plots=none;\n  population event=events pyears=py;\n  reference  event=std_events pyears=std_pop;  /* standard events + person-time -> reference rates */\n  strata age_band sex;\nrun;\n\n/* Person-time rate model: log-linear Poisson with an offset = log(person-years). */\ndata work.model; set work.events_pt; log_py = log(py); run;\nproc genmod data=work.model;\n  class age_band sex / param=ref;\n  model events = age_band sex / dist=poisson link=log offset=log_py;\n  estimate 'rate per 1000 PY (ref stratum)' intercept 1 / exp;  /* exp(beta) = rate     */\nrun;",
        "description": "Direct and indirect (SMR) age/sex standardization with PROC STDRATE, plus a person-time Poisson rate model.\nRequired input datasets (post data-management):\n  work.events_pt : age_band, sex, events (incident events), py (observed person-years at risk)\n                   built from continuously enrolled FFS person-time with a clean lookback; MA-only excluded.\n  work.standard  : age_band, sex, std_pop (counts or person-time of the external standard, e.g., 2000 US),\n                   std_events (reference event counts) so the indirect/SMR block can derive reference rates;\n                   used as the reference population (direct) or reference rate source (indirect/SMR).\nPROC STDRATE is the canonical descriptive-epi procedure; it returns crude and standardized rates with CIs.",
        "dependencies": [],
        "source_citations": [
          "benchimol-2015"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Descriptive frequency question] --> K{What is being counted?}\n  K -->|New events over varying follow-up| RATE[Incidence RATE<br/>events / person-time]\n  K -->|Probability over a fixed horizon| RISK{Competing risks present?}\n  K -->|Existing state at a time/window| PREV[PREVALENCE<br/>point / period / annual]\n  RISK -->|Yes e.g. death| CIF[Cumulative incidence function<br/>Aalen-Johansen]\n  RISK -->|No| KM[1 - Kaplan-Meier]\n  RATE --> CMP{Compare across populations or time?}\n  CMP -->|Yes| STD{Study strata sparse?}\n  CMP -->|No| CRUDE[Report crude rate + exact-Poisson CI]\n  STD -->|No| DIR[Direct standardization to a standard population]\n  STD -->|Yes| IND[Indirect standardization SMR / SIR]",
        "caption": "Choosing the descriptive frequency measure (rate vs risk vs prevalence) and, for rates, whether and how to standardize when comparing populations.",
        "alt_text": "Decision tree mapping the descriptive question to incidence rate, cumulative incidence function or 1-minus-Kaplan-Meier, or prevalence, and then to crude, direct, or indirect standardization.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Enr[Continuous FFS enrollment<br/>relevant benefit, no MA-only months] --> Wash[Clean lookback / washout<br/>so first observed = first incident]\n  Wash --> PT[Accrue observable person-time<br/>start -> first event / death / disenroll / data end]\n  PT --> Num[Numerator: first qualifying coded<br/>incident event in observed time]\n  Num --> Rate[Crude rate = events / person-years]\n  Rate --> Adj[Standardize by age x sex<br/>to external standard population]\n  Adj --> Out[Report crude + standardized rates,<br/>person-time accounting, attrition funnel]",
        "caption": "Claims data flow from observable enrollment through person-time and incident-event numerator to crude and standardized incidence rates.",
        "alt_text": "Left-to-right data flow from continuous enrollment and washout to person-time accrual, incident numerator, crude rate, standardization, and reporting.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "Constructs the observable at-risk person-time that forms the denominator of every incidence rate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "Operational detail for computing event-per-person-time rates and their confidence intervals."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "Operational detail for risk over a fixed horizon, including the competing-risk cumulative incidence function."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalence-point-period-annual-rwe",
        "notes": "Operational detail for point, period, and annual prevalence, governed by incidence and duration."
      },
      {
        "relation_type": "see_also",
        "target_slug": "direct-standardization-rwe",
        "notes": "Operational detail for age/sex-adjusted rates via direct standardization to a standard population."
      },
      {
        "relation_type": "see_also",
        "target_slug": "indirect-standardization-smr-sir-rwe",
        "notes": "Operational detail for SMR/SIR when study stratum-specific rates are too sparse for direct standardization."
      },
      {
        "relation_type": "used_with",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Continuous-enrollment rules define the observable person-time denominator for valid rates and prevalence."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Competing-risk methods supply the cumulative incidence function needed for unbiased fixed-horizon risk."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Explains why MA-only person-time must be excluded from claims denominators to avoid deflated rates."
      }
    ],
    "aliases": [
      "descriptive RWE",
      "epidemiologic rates",
      "population denominators",
      "incidence and prevalence estimation"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "descriptive-statistics",
    "name": "Descriptive Statistics",
    "short_definition": "The set of numerical summaries — measures of location (mean, median, mode), spread (standard deviation, IQR, range), and shape (skewness, kurtosis) — used to characterize a variable or a group of patients before any comparative or causal analysis is attempted. It is the arithmetic foundation that every Table 1, cost analysis, and epidemiologic rate computation rests upon.",
    "long_description": "**Descriptive statistics** answer a single question: \"what does this variable look like?\" They make\nno causal claim, test no hypothesis, and require no comparison group. Every downstream analytic step\n— from a Table 1 to a regression to an economic model — rests on a prior choice about how to summarize\nthe raw distribution, and wrong choices introduce errors that compound through the whole analysis. This\nentry covers the full toolkit: measures of location, spread, and shape; categorical variable summaries;\nthe Table 1 conventions used in real-world evidence (RWE) and health economics and outcomes research\n(HEOR); and the most consequential trade-offs, especially the mean-vs-median decision for right-skewed\nhealthcare cost and utilization data.\n\n**Core conceptual distinctions**\n\nThree questions about any variable answer different questions and require different summaries.\n\n*Where is it centered?* The **arithmetic mean** (sum divided by count) is sensitive to every value,\nincluding extreme ones. Pull one hospitalization cost to \\$500,000 and the mean jumps even if 99 patients\nhad modest bills. The **median** (middle value after sorting, or the average of the two middle values\nwhen n is even) is resistant: it reports what a typical patient experienced. The **mode** (most frequent\nvalue) is rarely used for continuous data but matters for discrete counts and categorical variables.\n\n*How spread out is it?* The **standard deviation (SD)** is the square root of the average squared\ndistance from the mean. Because it uses the mean as its anchor, it is equally sensitive to outliers. For\na normal distribution, mean ± 1 SD contains approximately 68% of values, mean ± 2 SD contains\napproximately 95%, and mean ± 3 SD contains approximately 99.7% — the **68-95-99.7 rule**. This heuristic\nfails completely on skewed data: a cost distribution with mean \\$12,000 and SD \\$28,000 implies a lower\nbound of mean − 1 SD = −\\$16,000, which is meaningless. When data are skewed, report the\n**interquartile range (IQR)**: the spread from the 25th percentile (Q1) to the 75th percentile (Q3).\nThe IQR describes the middle 50% of the distribution regardless of how extreme the tails are. The\n**range** (minimum to maximum) is sensitive to single extreme values and is usually relegated to a\nfootnote. The **coefficient of variation (CV)** = SD / mean expresses spread as a fraction of the mean\n— useful when comparing variability across variables measured on different scales.\n\n*What shape does it have?* **Skewness** describes asymmetry: a right-skewed (positive-skew) distribution\nhas a long right tail (mean > median), which is the default for healthcare costs, length of stay, and\nutilization counts. A left-skewed distribution has a long left tail. **Kurtosis** describes tail weight:\nhigh kurtosis (\"leptokurtic\") means more extreme values than a normal distribution would predict.\nConcretely, look at a histogram first and a boxplot second — these visual checks tell you whether the\nnormal approximation applies before you decide which summary to use.\n\n**The skewed-cost-data rule and the HEOR nuance**\n\nThe single most important application of this principle in RWE and HEOR is the following: healthcare\ncosts, length of stay, and service-utilization counts are almost always **right-skewed** because a small\nfraction of patients generate catastrophically large bills while the majority have modest expenditures.\nFor describing a typical patient's experience, report **median and IQR**. The mean inflates the central\ntendency estimate because a handful of \\$200,000 admissions pull it far above what most patients incur.\n\nHowever — and this is a genuinely important nuance for payers and budget impact — **the payer pays the\nmean, not the median.** If 1,000 patients are enrolled, the plan's total expenditure is 1,000 × mean,\nnot 1,000 × median. For budget impact analyses, cost-effectiveness models, and any setting that requires\ntotal cost estimation, the mean is the appropriate summary because the mean × count = total. For these\nanalyses, **report both the mean and median**: the median characterizes the typical patient, the mean\ndrives the budget math. When the two diverge substantially, the distribution is skewed enough that the\nchoice of summary matters and should be explained in the methods.\n\n**Categorical variables**\n\nFor binary or multi-category variables, the right summaries are **counts (n) and proportions (%)**.\nThe single most common error is ambiguous denominators: a \"30% rate of prior hospitalization\" is\nmeaningless without knowing 30% of what — the full cohort? the treated arm? patients with at least 12\nmonths of lookback? Report the denominator explicitly. **Missingness is its own category**: if 15% of\npatients lack a recorded HbA1c, that 15% should appear as \"missing: n (15%)\" in the table, not be\nsilently excluded from the denominator. Collapsing missing patients into a lower denominator inflates\nthe apparent proportion of observed values.\n\n**Table 1 conventions in RWE/HEOR**\n\nThe standard reporting convention for a \"Table 1\" (baseline characteristics) is:\n- Approximately normally distributed continuous variables: **mean (SD)**\n- Skewed continuous variables (costs, LOS, utilization counts): **median [IQR]** — many journals use\n  square brackets for IQR to distinguish from the round-bracket mean (SD) convention\n- Categorical variables: **n (%)**\n- For cost variables specifically: consider reporting both mean (SD) and median [IQR] in the same row\n\n**Never report standard error of the mean (SEM) in Table 1.** The SEM describes the precision of the\nestimated mean as an estimator — it shrinks as sample size grows. The SD describes the spread of the\nactual patient population and does not depend on sample size. Table 1 is supposed to show what the\npatients look like; the SD does that. A 50,000-patient claims cohort with SD = \\$28,000 and SEM = \\$126\nshould report SD, because SEM = \\$126 falsely implies all patients had near-identical costs. This\nconfusion between SEM and SD is one of the most common errors in reporting of clinical and HEOR data.\n\n**Descriptive ≠ inferential: no p-values needed, and why that matters**\n\nDescriptive statistics describe. No p-value is needed to state that the mean age is 62 years or that\n35% of patients had a prior hospitalization. The long-running debate about whether to include p-values\nin Table 1 of a randomized trial (testing whether randomization succeeded) does not apply to RWE: in a\nnon-randomized study, a significant p-value on a baseline covariate is expected (treatment was chosen,\nnot assigned) and uninformative. Whether to worry about a 3-year age difference or a 0.5-year difference\ndepends on whether age is a confounder for the outcome of interest — a clinical judgment, not a\nsignificance test. In non-randomized RWE, **standardized mean differences (SMDs)** replace p-values for\nassessing whether groups are comparable (see baseline-characteristics-and-covariate-balance-rwe).\n\n**Pros, cons, and trade-offs — specific and comparative**\n\n- **Mean vs median:** The mean is the natural input to budget math (mean × n = total), and means are\n  required for most parametric statistical tests. The median is robust to outliers and is the appropriate\n  summary of a typical patient's experience when the distribution is skewed. *Use the mean* when the\n  downstream analysis requires totals or expected values; *use the median* when you want to characterize\n  what a typical patient experienced. For costs and utilization in RWE/HEOR, routinely report both.\n- **SD vs IQR:** The SD is the natural companion to the mean and is required when computing confidence\n  intervals under normality. The IQR does not assume any distributional shape and is the appropriate\n  spread measure for skewed data. *Use SD* for normal or near-normal data; *use IQR* for skewed data and\n  for all cost/LOS/utilization variables in RWE.\n- **SEM vs SD (the cardinal confusion):** The SEM is used in confidence intervals for the mean, not in\n  Table 1. Reporting SEM instead of SD makes patient populations appear artificially homogeneous because\n  SEM shrinks with larger samples. Always use SD in descriptive tables.\n- **Descriptive vs causal summaries:** Descriptive statistics are robust precisely because they make no\n  causal assumptions — but that is also their limitation. A table showing that treated patients had\n  higher average costs does not mean the treatment caused higher costs; confounding by indication is\n  present and untouched. Descriptive summaries are necessary, not sufficient, for causal inference.\n- **Pooling vs stratifying:** Summarizing a bimodal or multimodal distribution with a single mean hides\n  the structure. If men and women have systematically different costs, a pooled mean obscures what is\n  happening in each group. When descriptive summaries are used to characterize a heterogeneous\n  population, stratify on the key variable rather than pool.\n\n**When to use**\n\nDescriptive statistics should be the first analytic step in any study:\n- Characterize the study population in Table 1 (demographics, comorbidities, baseline HCRU, costs)\n  before any outcome model is fit.\n- Understand the distribution of each variable — mean, median, SD, IQR, min, max, histogram — before\n  deciding which analytical method is appropriate (parametric vs nonparametric, log-transformation vs\n  GLM with gamma link for costs, etc.).\n- Report frequencies and proportions for all categorical variables including missingness.\n- Provide context for quantitative findings (e.g., \"the median cost was \\$X versus \\$Y in the comparator\"\n  alongside an adjusted ratio).\n- Conduct feasibility and sample-size exploration: understanding effect sizes, event rates, and\n  variances from descriptive statistics feeds power calculations.\n\n**When NOT to use — and when descriptive summaries are actively misleading**\n\n- **Do not use descriptive between-group differences to support causal claims.** A mean cost difference\n  of \\$5,000 between treated and untreated groups in an observational study reflects confounding by\n  indication, severity of illness, and every other baseline difference — not necessarily the treatment\n  effect. Descriptive statistics set the stage; they do not answer the causal question.\n- **Do not report means on heavily right-skewed data without also reporting the median.** When the mean\n  is substantially higher than the median, readers who see only the mean will overestimate what a typical\n  patient spends. In cost data, report both.\n- **Do not report means on censored cost data without appropriate adjustment.** When follow-up time is\n  censored (as in most survival or time-to-event analyses with administrative censoring), a simple\n  arithmetic mean of observed costs underestimates the true mean cost over the full follow-up period.\n  Inverse probability of censoring weighting (IPCW) or partitioned survival methods are needed for\n  unbiased cost estimation (see healthcare-costs-pppm-pppy-pmpm).\n- **Do not pool a bimodal or multimodal distribution into one mean.** If a variable has two distinct\n  modes (for example, costs from a procedure-naive group and a procedure-heavy group), the mean of the\n  pooled distribution may correspond to no actual patient's experience and will mislead readers who\n  interpret it as a typical value. Stratify or visually inspect the histogram first.\n- **Do not confuse SEM with SD in Table 1** — this is the most common descriptive statistics error in\n  clinical and HEOR publications, and it systematically understates population heterogeneity.\n- **Do not describe with ambiguous denominators.** Any percentage must specify what the denominator\n  is — total cohort, treated arm, patients with complete data — or the percentage is uninterpretable.\n\n**Data-source operational depth**\n\n- **Claims (FFS):** Most continuous variables of interest — costs, length of stay, service counts —\n  are right-skewed in claims data, so median [IQR] is the default. Cost variables require specifying\n  which perspective is being reported (paid amount vs charged amount vs allowed amount) and which\n  services are included (medical only, pharmacy only, or combined). Costs in claims are often expressed\n  per-member-per-month (PMPM) or per-patient-per-year (PPPY) to normalize across enrollment durations.\n  Descriptive statistics on cost variables should always examine whether a small number of extremely\n  high-cost enrollees are driving the mean (truncation or winsorization may be needed before modeling).\n  Categorical variables such as diagnosis presence/absence are straightforward n (%) but the denominator\n  must reflect the observable enrollment window: a 12-month lookback requires 12 months of continuous\n  enrollment, and restricting to patients with the full lookback affects the n in the denominator.\n- **EHR:** Laboratory values, vitals, and clinical scores are often closer to normal than cost data,\n  making mean (SD) more appropriate — but always inspect for skewness and confirm via histogram. A\n  critical issue with EHR-derived descriptive statistics is **missingness**: a lab not drawn does not\n  mean the result is zero or normal; it means it was not observed. Reporting \"n = 4,200 patients with\n  HbA1c data (mean 8.1%)\" without noting that 3,800 patients lacked HbA1c data understates the\n  incompleteness of the measurement.\n- **Registry:** Registries often collect structured clinical variables (stage, performance status,\n  biomarkers) with fewer missing values than EHR or claims, but enrollment is selective. Descriptive\n  statistics from a voluntary registry describe the enrolled population, which may differ systematically\n  from all eligible patients.\n- **Primary data (surveys, trials):** The cleanest setting for descriptive statistics because the\n  denominator is defined by the study design. Report all randomized or consented subjects in the\n  denominator, with a consort-style accounting of how the analysis set differs.\n- **Linked data:** When datasets are linked, report descriptive statistics both for the full eligible\n  population and for the linkable subset, to allow readers to assess linkage-induced selection bias.\n\n**Interpreting the output**\n\nConsider a Table 1 row for total annual healthcare costs across ten patients in a retrospective claims\ncohort. The analysis returns: mean = $11,750 (SD ≈ $27,566), median = $2,600 (IQR $1,650–$6,250), n = 10.\n\n*(1) Formal statistical interpretation.* The mean of $11,750 is the arithmetic average and is sensitive to\nthe single high-cost outlier ($90,000) that pulls it far above most patients' spending. The median of $2,600\nsplits the ranked distribution into equal halves and is resistant to that extreme value. The SD of ≈ $27,566\nis nearly 2.4 times the mean, signaling severe right-skew; in a skewed distribution the SD is a poor\nstandalone spread summary because it implies symmetry that does not exist. The IQR of $1,650–$6,250 captures\nthe middle 50% of observations and provides an interpretable spread measure that does not assume any\nparticular distributional shape.\n\n*(2) Practical interpretation for a decision-maker.* The \"average\" cost of $11,750 is driven almost entirely\nby one catastrophically ill patient; nine of the ten patients spent under $7,000. For budget modeling and\nbenefit design, the median ($2,600) and IQR better represent what a typical member costs. When comparing\ntreatment groups, use median and IQR as the primary cost summary and report the mean separately with explicit\nacknowledgment that it is sensitive to extreme values — a small number of complex cases can pull the group\naverage upward and obscure meaningful differences at the center of the distribution.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "descriptive-statistics",
      "statistics",
      "primitive",
      "foundations",
      "mean",
      "median",
      "standard-deviation",
      "iqr",
      "table-1",
      "skewness",
      "summary-statistics",
      "heor"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "drug_utilization",
      "cohort_retrospective",
      "new_user",
      "cost_analysis",
      "burden_of_illness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.ijnurstu.2014.09.006",
        "url": "https://doi.org/10.1016/j.ijnurstu.2014.09.006",
        "citation_text": "Lang TA, Altman DG. Basic statistical reporting for articles published in Biomedical Journals: The Statistical Analyses and Methods in the Published Literature or the SAMPL Guidelines. International Journal of Nursing Studies. 2015;52(1):5-9.",
        "year": 2015,
        "authors_short": "Lang & Altman",
        "notes": "The SAMPL guidelines establish reporting conventions for descriptive statistics in biomedical journals: mean (SD) for approximately normal variables, median (IQR) for skewed distributions, and n (%) for categorical variables -- exactly the Table 1 conventions used in RWE/HEOR."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.310.6975.298",
        "url": "https://doi.org/10.1136/bmj.310.6975.298",
        "citation_text": "Altman DG, Bland JM. Statistics Notes: The normal distribution. BMJ. 1995;310(6975):298.",
        "year": 1995,
        "authors_short": "Altman & Bland",
        "notes": "Classic Statistics Notes piece explaining the normal distribution and the 68-95-99.7 rule, establishing why SD is the natural spread companion to the mean for symmetric data and when it fails for skewed distributions -- the theoretical anchor for mean-vs-median decision rules."
      },
      {
        "role": "explain",
        "doi": "10.1016/s0167-6296(01)00086-8",
        "url": "https://doi.org/10.1016/s0167-6296(01)00086-8",
        "citation_text": "Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461-494.",
        "year": 2001,
        "authors_short": "Manning & Mullahy",
        "notes": "Seminal health economics paper on the consequences of skewed cost distributions -- the theoretical foundation for why means and medians diverge in healthcare cost data, why naive log-transformation creates retransformation bias, and why the payer-pays-the-mean principle governs budget impact projections."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.309.6960.996",
        "url": "https://doi.org/10.1136/bmj.309.6960.996",
        "citation_text": "Altman DG, Bland JM. Statistics Notes: Quartiles, quintiles, centiles, and other quantiles. BMJ. 1994;309(6960):996.",
        "year": 1994,
        "authors_short": "Altman & Bland",
        "notes": "Defines quartiles, the IQR, and centiles in plain terms, distinguishing them from means and SDs and explaining when each is the appropriate spread measure -- the methodological basis for the IQR-for-skewed-data reporting convention in Table 1."
      }
    ],
    "plain_language_summary": "Descriptive statistics are the arithmetic tools that summarize what a group of patients or a variable actually looks like: the average or typical value (mean and median), how spread out the values are (standard deviation and IQR), and what fraction belong to each category (counts and percentages). They are the foundation every Table 1 is built on and the first step in any data analysis. One key caution: describing a difference between two groups with descriptive statistics does not explain why the difference exists — that requires a separate causal analysis.",
    "key_terms": [
      {
        "term": "mean",
        "definition": "The arithmetic average — add up all values and divide by the number of observations; sensitive to extreme values because every value contributes equally."
      },
      {
        "term": "median",
        "definition": "The middle value when observations are sorted from lowest to highest; robust to extreme values because it depends only on rank, not magnitude."
      },
      {
        "term": "standard deviation",
        "definition": "A measure of spread that describes the typical distance between each individual value and the mean; for bell-shaped data roughly two-thirds of values fall within one standard deviation of the mean."
      },
      {
        "term": "interquartile range",
        "definition": "The spread from the 25th percentile to the 75th percentile of a distribution, capturing the middle half of the data; preferred for skewed variables like healthcare costs and length of stay."
      },
      {
        "term": "distribution",
        "definition": "The full pattern of values a variable takes and how frequently each value occurs; visualized with a histogram or boxplot before deciding which summary statistics are appropriate."
      },
      {
        "term": "skewness",
        "definition": "The degree of asymmetry in a distribution; right-skewed data have a long tail toward higher values (healthcare costs are almost always right-skewed), which pulls the mean above the median."
      }
    ],
    "worked_example": {
      "scenario": "A health economist is characterizing the baseline annual healthcare costs for 10 patients newly enrolled in a commercial claims database. She records the total paid amount (medical + pharmacy combined) for each patient in the 12 months before their index date. She wants to know which summary statistic best represents the typical patient's cost and which best supports a budget impact projection for a plan expecting to enroll 1,000 similar patients.\n",
      "dataset": {
        "caption": "Annual baseline healthcare costs for 10 patients (sorted), 12-month lookback in commercial claims. Patient 10 had a high-cost hospitalization; the other nine had unremarkable utilization.",
        "columns": [
          "patient_id",
          "annual_cost_usd"
        ],
        "rows": [
          [
            1001,
            1200
          ],
          [
            1002,
            1500
          ],
          [
            1003,
            1800
          ],
          [
            1004,
            2100
          ],
          [
            1005,
            2400
          ],
          [
            1006,
            2800
          ],
          [
            1007,
            3200
          ],
          [
            1008,
            4500
          ],
          [
            1009,
            8000
          ],
          [
            1010,
            90000
          ]
        ]
      },
      "steps": [
        "Compute the mean: sum all 10 values, then divide by 10. Sum = 1200 + 1500 + 1800 + 2100 + 2400 + 2800 + 3200 + 4500 + 8000 + 90000 = 117500. Mean = 117500 / 10 = 11750.",
        "Compute the median: with 10 values already sorted, the median is the average of the 5th and 6th values. The 5th value is 2400 and the 6th is 2800. Median = (2400 + 2800) / 2 = 2600.",
        "Note the gap: the mean ($11,750) is more than four times the median ($2,600). That gap signals strong right skew driven by the one $90,000 outlier patient.",
        "Compute Q1 (25th percentile) and Q3 (75th percentile) for the IQR. With 10 sorted values, Q1 is the average of the 2nd and 3rd values and Q3 is the average of the 8th and 9th values. Q1 = (1500 + 1800) / 2 = 1650. Q3 = (4500 + 8000) / 2 = 6250. IQR = 6250 - 1650 = 4600.",
        "Interpret the two summaries: the median $2,600 with IQR $1,650 to $6,250 describes the typical patient experience accurately -- nine of ten patients spent between $1,200 and $8,000. The mean $11,750 looks like an outlier relative to most patients, but it is the correct input for budget projections: 1,000 enrolled patients would cost the plan approximately 1,000 x 11,750 = 11,750,000 dollars, not 1,000 x 2,600 = 2,600,000 dollars.",
        "Table 1 reporting convention: for this right-skewed cost variable, report median [IQR] to characterize the typical patient: $2,600 [$1,650-$6,250]. For the budget model, use the mean: $11,750."
      ],
      "result": "Mean cost = $11,750 (SD approximately $27,566); median cost = $2,600 [IQR $1,650-$6,250]. The large gap between mean and median confirms right skew. For Table 1 and clinical characterization, report the median: a typical patient spent $2,600. For budget impact across 1,000 patients, use the mean: projected total plan spend = 1,000 x 11,750 = 11,750,000 dollars. Both summaries are correct -- they answer different questions."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Mean and standard deviation (approximately normal data)",
        "description": "Report mean (SD) when the histogram is approximately symmetric and bell-shaped -- common for age, body weight, lab values within a healthy range, and Likert-scale survey items averaged across respondents. The SD is the natural companion to the mean and feeds directly into parametric confidence intervals and t-tests.",
        "edge_cases": [
          "The 68-95-99.7 rule applies only when data are approximately normal; on skewed data, mean minus one SD can be negative for variables bounded at zero (costs, counts).",
          "Inspect the histogram before defaulting to mean (SD); a bimodal distribution should be stratified, not summarized with a single mean."
        ],
        "data_source_notes": "claims: demographics (age, enrollment duration) are typically appropriate for mean (SD); cost, LOS, and utilization counts are not. EHR: labs and vitals are often near-normal within a disease cohort but inspect for clinical outliers."
      },
      {
        "name": "Median and IQR (skewed data)",
        "description": "Report median [IQR] when the distribution is right-skewed or has heavy tails -- the standard for healthcare costs, length of stay, and service-utilization counts. The median is robust to the extreme values that are structurally present in these distributions.",
        "edge_cases": [
          "When the distribution is bimodal, neither median nor mean may correspond to a typical patient; report the full distribution or stratify.",
          "The IQR is not a confidence interval -- it is a description of spread in the data, not uncertainty about a parameter."
        ],
        "data_source_notes": "claims: virtually all cost and utilization variables are right-skewed; default to median [IQR] and report mean (SD) additionally when budget projections are needed."
      },
      {
        "name": "Count and proportion (categorical variables)",
        "description": "Report n (%) for binary and multi-category variables. The denominator must be stated explicitly. Missingness must appear as its own category row rather than silently reducing the denominator.",
        "edge_cases": [
          "A percentage without a denominator is uninterpretable; always state n = denominator.",
          "If completeness varies by variable (as in EHR), each row's denominator may differ -- note when this is the case."
        ],
        "data_source_notes": "claims: presence/absence of diagnosis codes is the canonical categorical variable; the lookback window determines the denominator. EHR: missingness of labs and clinical variables is frequent and must be reported as missing n (%) not silently excluded."
      },
      {
        "name": "Both mean and median (cost variables in HEOR)",
        "description": "For healthcare cost variables in HEOR studies, report both mean (SD) and median [IQR] in the same table row or in a footnote. This convention acknowledges that median serves the clinical characterization purpose while mean serves the budget impact and cost-effectiveness modeling purpose.",
        "edge_cases": [
          "When mean and median are close (within ~10-20%), skew is mild and either alone is defensible.",
          "When mean substantially exceeds median (as is common), the methods section should explain that economic projections use the mean because total plan expenditure = mean x enrollment."
        ],
        "data_source_notes": "Standard practice in ISPOR-aligned HEOR publications for cost outcomes."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "inferential-statistics-foundations",
        "pros_of_this": "Descriptive statistics require no distributional assumptions, no hypothesis, and no comparison group; they describe the sample as it is, without inferring to a wider population.",
        "cons_of_this": "Descriptive statistics cannot support generalization, causal claims, or statements about statistical uncertainty -- those require inferential methods.",
        "when_to_prefer": "Always run first: descriptive statistics are the mandatory first step before any inferential analysis. Use inferential methods when the goal is to generalize from a sample or to estimate the uncertainty around an effect."
      },
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "Descriptive statistics produce the summaries that feed the choice between parametric and nonparametric tests -- the shape of the distribution (assessed descriptively) determines which test is valid.",
        "cons_of_this": "Descriptive statistics alone cannot determine statistical significance or quantify uncertainty in a parameter estimate.",
        "when_to_prefer": "Use descriptive statistics to characterize the distribution, then select parametric or nonparametric tests depending on whether the distributional assumptions (approximate normality) appear to hold."
      },
      {
        "compared_to": "descriptive-epidemiology-rwe",
        "pros_of_this": "Descriptive statistics cover the pure summarization of a variable (mean, median, SD, IQR, counts) and are agnostic to epidemiologic context. They apply to any variable in any dataset.",
        "cons_of_this": "Descriptive statistics do not address the epidemiologic frequency measures (incidence rates, cumulative incidence, prevalence, standardized rates) that require properly constructed denominators, person-time, and event counting in a specific population.",
        "when_to_prefer": "Use descriptive statistics to characterize variables and build Table 1. Use descriptive epidemiology methods when the goal is to estimate disease or exposure frequency in a defined population with explicit person-time denominators."
      },
      {
        "compared_to": "baseline-characteristics-and-covariate-balance-rwe",
        "pros_of_this": "Descriptive statistics are the arithmetic inputs to Table 1 (mean/SD or median/IQR and n/% for each covariate); they characterize the distribution of each baseline variable.",
        "cons_of_this": "A Table 1 produced from descriptive statistics alone does not assess whether two groups are exchangeable -- that requires standardized mean differences and the balance diagnostics covered in the baseline-characteristics concept.",
        "when_to_prefer": "Build Table 1 with descriptive statistics; extend it with SMDs and balance diagnostics when comparing treatment groups in a non-randomized study."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Costs, LOS, and service counts are almost always right-skewed in claims data. Default to median [IQR] for characterization and mean (SD) for budget projections. Verify the cost perspective (paid vs charged vs allowed) and the service scope (medical, pharmacy, or combined) before computing. Costs should be normalized to a common enrollment window (PMPM or PPPY) when enrollment durations vary. Missing cost data in claims is rare (if a service was billed, a paid amount exists), but ensure the enrollment window is complete before concluding a patient had zero costs.",
      "ehr": "Lab values, vitals, and clinical scores may be closer to normal than cost data, but inspect histograms before choosing mean vs median. Missing data is common and informative in EHR -- a missing HbA1c may indicate a less-engaged patient, not a normal value. Report missingness as n (%) in every descriptive table and do not suppress or silently reduce the denominator.",
      "registry": "Registries often collect structured clinical variables with higher completeness than EHR for the variables they are designed to capture. However, enrollment is voluntary and selective, so descriptive statistics describe the enrolled population, not all eligible patients. State the catchment and enrollment criteria when interpreting descriptive summaries.",
      "primary": "Survey-based and trial data have design-defined denominators. Report all enrolled subjects as the denominator with a clear accounting of how the analysis set (e.g., completers) differs. Skewness on cost or utilization items from surveys mirrors the pattern in claims.",
      "linked": "When reporting descriptive statistics on linked data, report for the full eligible population and the linked subset separately so readers can assess whether linkage-induced selection changes the distribution meaningfully."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom scipy.stats import skew\n\n\ndef describe_rwe(df: pd.DataFrame,\n                 continuous_cols: list[str],\n                 binary_cols: list[str],\n                 skew_threshold: float = 1.0) -> pd.DataFrame:\n    \"\"\"\n    Table 1-style descriptive statistics for RWE/HEOR datasets.\n\n    Continuous variables:\n      - If |skewness| < skew_threshold  -> mean (SD)  [approximately normal]\n      - If |skewness| >= skew_threshold -> median [IQR] AND mean (SD)  [HEOR convention for costs]\n    Binary variables: n (%)\n    Missing values are reported as a separate row for each variable.\n    \"\"\"\n    rows = []\n    n_total = len(df)\n\n    for col in continuous_cols:\n        series = df[col].dropna()\n        n_obs = len(series)\n        n_miss = n_total - n_obs\n        sk = float(skew(series)) if n_obs > 2 else 0.0\n        mean_val = float(series.mean())\n        sd_val = float(series.std(ddof=1))\n        med_val = float(series.median())\n        q1 = float(series.quantile(0.25))\n        q3 = float(series.quantile(0.75))\n\n        if abs(sk) < skew_threshold:\n            # Approximately normal -> report mean (SD) as primary\n            rows.append({\n                \"variable\": col,\n                \"format\": \"mean (SD)\",\n                \"primary\": f\"{mean_val:.1f} ({sd_val:.1f})\",\n                \"supplemental\": None,\n                \"n_obs\": n_obs,\n                \"n_missing\": n_miss,\n                \"skewness\": round(sk, 2),\n            })\n        else:\n            # Skewed -> report median [IQR] as primary; mean (SD) as supplemental for budget use\n            rows.append({\n                \"variable\": col,\n                \"format\": \"median [IQR]\",\n                \"primary\": f\"{med_val:.1f} [{q1:.1f}-{q3:.1f}]\",\n                \"supplemental\": f\"mean {mean_val:.1f} (SD {sd_val:.1f})\",\n                \"n_obs\": n_obs,\n                \"n_missing\": n_miss,\n                \"skewness\": round(sk, 2),\n            })\n\n    for col in binary_cols:\n        series = df[col]\n        n_miss = int(series.isna().sum())\n        n_obs = n_total - n_miss\n        n_event = int(series.sum()) if n_miss < n_total else 0\n        pct = 100.0 * n_event / n_obs if n_obs > 0 else float(\"nan\")\n        rows.append({\n            \"variable\": col,\n            \"format\": \"n (%)\",\n            \"primary\": f\"{n_event} ({pct:.1f}%)\",\n            \"supplemental\": f\"denominator n={n_obs}\",\n            \"n_obs\": n_obs,\n            \"n_missing\": n_miss,\n            \"skewness\": None,\n        })\n\n    return pd.DataFrame(rows)\n\n\n# Example for a 10-patient cost dataset\ndf_example = pd.DataFrame({\n    \"annual_cost\": [1200, 1500, 1800, 2100, 2400, 2800, 3200, 4500, 8000, 90000],\n    \"age\":         [52, 61, 47, 73, 58, 66, 55, 70, 62, 68],\n    \"prior_hosp\":  [0, 0, 0, 1, 0, 0, 1, 0, 1, 1],\n})\nresult = describe_rwe(df_example,\n                      continuous_cols=[\"annual_cost\", \"age\"],\n                      binary_cols=[\"prior_hosp\"])\nprint(result.to_string(index=False))",
        "description": "Compute a complete descriptive statistics summary for continuous and categorical variables, following\nTable 1 conventions for RWE/HEOR: mean (SD) for approximately normal variables, median [IQR] for\nskewed variables (detected by a skewness threshold), and n (%) for binary variables. Input is a\npandas DataFrame with one row per patient; continuous and binary column names are provided separately.\nReports both mean and median for any variable flagged as skewed (the standard HEOR convention for\ncost variables). Implementation notes are in the description, not in notes.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(tableone)\nlibrary(dplyr)\nlibrary(e1071)  # skewness()\n\n# Identify skewed variables automatically (|skewness| >= 1.0 threshold)\ndetect_skewed <- function(df, continuous_cols, threshold = 1.0) {\n  Filter(function(col) {\n    vals <- df[[col]][!is.na(df[[col]])]\n    abs(e1071::skewness(vals)) >= threshold\n  }, continuous_cols)\n}\n\n# Build a Table 1 following RWE/HEOR conventions:\n#   - non_normal vector -> tableone uses median [IQR]\n#   - remaining continuous -> mean (SD)\n#   - categorical / binary -> n (%)\n#   - test = FALSE: never include p-values in a descriptive Table 1\nrwe_table1 <- function(df,\n                       continuous_cols,\n                       categorical_cols,\n                       strata_col = NULL,\n                       skew_threshold = 1.0) {\n\n  non_normal <- detect_skewed(df, continuous_cols, skew_threshold)\n  all_vars   <- c(continuous_cols, categorical_cols)\n\n  t1 <- CreateTableOne(\n    vars       = all_vars,\n    strata     = strata_col,\n    data       = df,\n    factorVars = categorical_cols,\n    test       = FALSE  # no p-values in descriptive RWE Table 1\n  )\n  print(t1,\n        nonnormal   = non_normal,  # auto median [IQR]\n        smd         = !is.null(strata_col),\n        showAllLevels = TRUE,\n        quote       = FALSE,\n        noSpaces    = TRUE)\n}\n\n# Example\ndf_example <- data.frame(\n  annual_cost = c(1200, 1500, 1800, 2100, 2400, 2800, 3200, 4500, 8000, 90000),\n  age         = c(52, 61, 47, 73, 58, 66, 55, 70, 62, 68),\n  prior_hosp  = factor(c(0, 0, 0, 1, 0, 0, 1, 0, 1, 1), labels = c(\"No\",\"Yes\"))\n)\nrwe_table1(df_example,\n           continuous_cols  = c(\"annual_cost\", \"age\"),\n           categorical_cols = c(\"prior_hosp\"))\n\n# Direct summary for verification\ncat(\"Mean cost:\", mean(df_example$annual_cost), \"\\n\")\ncat(\"Median cost:\", median(df_example$annual_cost), \"\\n\")\ncat(\"SD cost:\", sd(df_example$annual_cost), \"\\n\")\ncat(\"Q1:\", quantile(df_example$annual_cost, 0.25), \"\\n\")\ncat(\"Q3:\", quantile(df_example$annual_cost, 0.75), \"\\n\")",
        "description": "Descriptive statistics for RWE/HEOR Table 1 using base R and the tableone package. Reports mean (SD)\nfor approximately normal continuous variables, median [IQR] for skewed variables, and n (%) for\ncategorical variables. Uses tableone::CreateTableOne for the final formatted output with an explicit\nnon-normal list for skewed variables -- the standard R workflow for HEOR Table 1 generation. The\nskewness is assessed with moments::skewness or base e1071::skewness; variables with |skewness| >= 1\nare listed as non-normal in tableone so it automatically applies median [IQR] formatting.",
        "dependencies": [
          "tableone",
          "dplyr",
          "e1071"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---- Continuous variables: mean, SD, median, IQR ---- */\n/* PROC MEANS: mean, SD, min, max for approximately normal variables (e.g., age) */\nproc means data=work.cohort n mean std min max maxdec=1;\n  var age;\n  title \"Continuous variables: mean (SD) for approximately normal\";\nrun;\n\n/* PROC UNIVARIATE: median, Q1, Q3, IQR for skewed variables (costs, LOS, utilization) */\nproc univariate data=work.cohort noprint;\n  var annual_cost los_days n_outpatient_visits;\n  output out=work.cost_stats\n    n=n\n    mean=mean\n    std=std\n    median=median\n    q1=q1\n    q3=q3;\nrun;\n\n/* Print the descriptive output with IQR computed inline */\ndata work.cost_stats_iqr;\n  set work.cost_stats;\n  iqr = q3 - q1;\n  /* Table 1 label: median [Q1-Q3] */\n  median_iqr = catx(' ', strip(put(median, 8.0)),\n                     cats('[', strip(put(q1, 8.0)), '-', strip(put(q3, 8.0)), ']'));\nrun;\n\nproc print data=work.cost_stats_iqr label noobs;\n  var n mean std median iqr median_iqr;\n  title \"Skewed continuous variables: median [IQR] and mean (SD) -- RWE/HEOR convention\";\nrun;\n\n/* ---- Categorical variables: n (%) with explicit denominator ---- */\n/* PROC FREQ with / missing option reports missingness as its own row */\nproc freq data=work.cohort;\n  tables prior_hosp sex region comorbidity_flag / missing nocum nopercent;\n  /* 'missing' option: missing values appear as an explicit row in the output    */\n  /* never suppress the missing category when building Table 1 for RWE datasets */\n  title \"Categorical variables: n (%) -- missing displayed explicitly\";\nrun;\n\n/* ---- Distribution check: skewness and kurtosis to decide mean vs median ---- */\nproc univariate data=work.cohort normal;\n  var annual_cost age;\n  histogram annual_cost / normal;\n  qqplot   annual_cost / normal;\n  title \"Distribution diagnostics: skewness, kurtosis, normality tests\";\nrun;",
        "description": "Descriptive statistics for RWE/HEOR Table 1 using PROC MEANS, PROC UNIVARIATE, and PROC FREQ.\nPROC MEANS provides mean, SD, min, max, and count. PROC UNIVARIATE provides median, quartiles (Q1, Q3),\nand IQR directly via the output statement. PROC FREQ provides n (%) for categorical variables with\nexplicit denominator control. Together these three procedures cover the complete Table 1 for an RWE\ndataset: use PROC MEANS for normal continuous variables, PROC UNIVARIATE for skewed variables (costs,\nLOS, utilization), and PROC FREQ for all binary and categorical variables including missingness.\nRequired input: work.cohort with one row per patient, containing continuous and categorical baseline\nvariables measured in the pre-index window.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Variable to summarize] --> Shape{Inspect histogram and boxplot}\n  Shape -->|Approximately symmetric| ContNorm[Report mean and SD<br/>e.g. age, weight, labs]\n  Shape -->|Right-skewed<br/>costs, LOS, counts| ContSkew[Report median and IQR<br/>as primary]\n  ContSkew --> BudgetQ{Budget impact<br/>needed?}\n  BudgetQ -->|Yes| BothMM[Also report mean and SD<br/>mean x n = total plan cost]\n  BudgetQ -->|No| MedOnly[Median and IQR alone<br/>sufficient for Table 1]\n  Shape -->|Bimodal or multimodal| Stratify[Stratify or plot full distribution<br/>single mean is misleading]\n  Start --> CatVar{Categorical variable?}\n  CatVar -->|Yes| NPct[n and percent<br/>state denominator explicitly]\n  NPct --> Missing[Report missing as<br/>its own n and percent row]",
        "caption": "Decision tree for choosing the right descriptive statistic in RWE/HEOR. The distribution shape (assessed by histogram and boxplot) drives the choice: mean (SD) for approximately normal, median [IQR] for skewed (with mean added for budget math), n (%) for categorical.",
        "alt_text": "Flowchart showing a variable branching based on distribution shape: approximately symmetric to mean and SD, right-skewed to median and IQR with optional mean for budget impact, bimodal to stratification, and categorical to n (%) with explicit missingness reporting.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph \"Table 1 convention\"\n    A[\"Normal continuous<br/>e.g. age 62 years (SD 9.3)\"]\n    B[\"Skewed continuous<br/>e.g. cost $2,600 [$1,650-$6,250]\"]\n    C[\"Binary / categorical<br/>e.g. prior hosp: 40 (35%)\"]\n    D[\"Missing<br/>e.g. HbA1c missing: 18 (15%)\"]\n  end\n  A --> |Mean SD| Format1[\"mean (SD)\"]\n  B --> |Median IQR| Format2[\"median [IQR]\"]\n  B --> |Also needed for budget| Format3[\"mean (SD) as supplemental\"]\n  C --> |Count percent| Format4[\"n (%)\"]\n  D --> |Never suppress| Format5[\"n (%) of total cohort\"]",
        "caption": "Table 1 reporting conventions for RWE/HEOR: the format chosen for each row depends on the variable's distribution. Never use SEM in place of SD; never suppress missing values.",
        "alt_text": "Diagram mapping four variable types to their Table 1 format: normal continuous to mean (SD), skewed continuous to median [IQR] with optional mean (SD) for budget math, binary to n (%), and missing data to explicit n (%) rather than silent denominator reduction.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Descriptive statistics provide the variable-level summaries (mean, SD, median, IQR, n/%) that feed epidemiologic frequency measures: incidence rates, cumulative incidence, and prevalence all depend on correctly computed counts and person-time denominators built from the same raw data that descriptive statistics characterize."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Table 1 (baseline characteristics) is built entirely from descriptive statistics: mean (SD) or median [IQR] for each continuous covariate and n (%) for each categorical covariate. Balance diagnostics (SMDs) extend Table 1 for the comparative RWE setting; descriptive statistics are the foundation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Healthcare cost summaries illustrate the most important applied context for descriptive statistics: right-skewed cost distributions require median [IQR] for patient characterization but mean for budget projections -- the key HEOR nuance this entry explains."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Descriptive statistics describe the sample; inferential statistics generalize to a wider population or quantify uncertainty. The two are sequential: descriptive analysis always precedes inferential, and the distribution shape identified descriptively determines which inferential methods are appropriate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The choice between parametric tests (which assume approximate normality) and nonparametric tests (which do not) is driven by the distribution shape assessed through descriptive statistics: histograms, skewness, and Q-Q plots."
      }
    ],
    "aliases": [
      "summary statistics",
      "mean",
      "median",
      "standard deviation",
      "IQR",
      "Table 1",
      "central tendency",
      "measures of spread",
      "descriptive summaries"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
    "name": "Diagnosis Phenotype Algorithm (1 IP / 2 OP, Time Window)",
    "short_definition": "A claims-based case-finding rule that classifies a patient as having a condition if they have at least one inpatient claim, or two outpatient claims on different service dates within a defined time window, carrying the relevant diagnosis code, whose performance (PPV, sensitivity, specificity) is condition-, position-, and data-source-specific and must be validated.",
    "long_description": "The **1 inpatient or 2 outpatient (1 IP / 2 OP) diagnosis phenotype** is the workhorse case-finding rule of\nadministrative-claims research. Because claims carry billing codes (ICD-9/10-CM) rather than clinical findings, a patient\nis classified as a case when they accumulate enough coded evidence to make a billing artifact an unlikely explanation:\na single inpatient claim with the diagnosis (one institutional encounter that was adjudicated and paid), OR two outpatient\nclaims with the diagnosis on **different service dates** within a pre-specified window. The single-IP arm leans on the\nhigher specificity of an admission diagnosis; the two-OP arm requires temporal corroboration so that a one-off \"rule-out\"\nor screening code does not by itself create a case. The window length encodes the disease's tempo: 7-30 days for acute\nevents (MI, ischemic stroke, PE), 180-365 days for chronic conditions (atrial fibrillation, diabetes, COPD). For\n*incident* (new-onset) phenotypes the rule is paired with a lookback **washout** (e.g., no qualifying code in the prior\n365 days of continuous enrollment); without it the algorithm captures prevalent, not incident, disease.\n\n**Core conceptual distinction**. The rule is a *measurement instrument with error*, not a diagnosis. Three knobs trade\npositive predictive value (PPV) against sensitivity. (1) *Setting / arm structure* — 1 IP OR 2 OP balances the\nspecificity of admissions against the sensitivity of outpatient capture; an IP-only rule raises PPV but misses\noutpatient-managed disease. (2) *Code position* — primary (principal) position reflects the reason for the encounter and\nis more specific; any-position captures comorbidity coding but is noisier and more vulnerable to copy-forward and\nrule-out codes. (3) *Window and washout* — the OP-to-OP window and the incident washout jointly determine whether you\nmeasure new onset, and how much delayed-second-visit loss you accept. The two failure directions are not symmetric:\n*false positives* (rule-out/screening codes coded as confirmed disease) dilute toward the null in a comparative study,\nwhile *differential* misclassification by exposure (e.g., new initiators are seen more often and coded more completely)\nbiases in an unpredictable direction. Never treat a code as a clinical diagnosis without validating the algorithm in your\nown data source and population, and pre-specify position, window, and washout in the protocol before looking at outcomes.\n\n**Pros, cons, and trade-offs**.\n- **vs a broad single-code rule (1 claim, any position, no window):** the 1 IP / 2 OP rule sharply cuts false positives\n  from isolated rule-out and screening codes and is widely validated to PPV >80% for many conditions. Cost: it loses the\n  few true cases who have only one outpatient code and die or disenroll before a second visit; it is more code to build.\n  **Prefer 1 IP / 2 OP** for almost any outcome or covariate that drives an effect estimate.\n- **vs a high-specificity rule (IP-only, primary position, plus procedure/lab/drug confirmation):** the standard rule has\n  higher sensitivity and is easier to harmonize across databases. Cost: lower PPV than a confirmation-augmented rule.\n  **Prefer the high-specificity variant** when a false positive is costly (e.g., a serious safety outcome that triggers a\n  label change) and you can tolerate missing milder, outpatient-managed cases.\n- **vs EHR / registry phenotypes (problem lists, NLP on notes, registry adjudication):** claims rules scale to millions\n  of patients and decades of history and are consistent across payers if harmonized. Cost: lower clinical specificity than\n  a chart-adjudicated or NLP-confirmed phenotype; they miss out-of-network and uncoded events. **Prefer claims** for\n  large comparative studies, and use linked EHR/registry as the *validation* gold standard rather than the primary\n  ascertainment source.\n\n**When to use**. Identifying outcomes, covariates, or cohort-entry diagnoses in administrative claims (Medicare FFS,\ncommercial) or claims-linked data, especially when (a) a published, validated 1 IP / 2 OP algorithm exists for your\ncondition and code era, and (b) a chart-review or registry-linked subset is available to estimate PPV (and ideally\nsensitivity) in *your* population. It is the default for outcome ascertainment in pharmacoepidemiologic safety and\ncomparative-effectiveness studies and for covariate construction feeding propensity or high-dimensional propensity\nscores.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **No validation is feasible and PPV is plausibly low or differential.** If the condition is heavily coded as\n  \"rule-out\" (chest pain coded as MI before troponin returns; \"screening for malignancy\" before biopsy) and you cannot\n  estimate PPV, an unvalidated rule can manufacture or erase an effect — quantitative bias analysis or a linked\n  validation subset is mandatory before the estimate is trusted.\n- **The window or washout is misaligned with the disease tempo.** A 30-day chronic-disease window or a too-short incident\n  washout converts prevalent cases into \"incident\" ones, importing immortal time and survivor effects into the analysis.\n- **MA-only person-time is treated as observed.** In Medicare Advantage, fee-for-service institutional and carrier claims\n  are largely absent in legacy claims products; \"no prior code\" in MA-only spans is *missingness*, not a clean washout,\n  and silently fabricates incident cases. Restrict the washout to FFS-observable (Parts A/B) enrollment.\n- **The algorithm differs across arms.** Applying a different position/window/setting rule, or a different data source, to\n  exposed vs comparator patients guarantees differential misclassification — the single most dangerous error here.\n\n**Data-source operational depth**.\n- **Claims (Medicare FFS / commercial):** build from institutional/facility claims (use the *discharge* date for the IP\n  arm) plus carrier/professional and outpatient-facility claims (use the *service* date for the OP arm). Deduplicate\n  same-day OP claims so two line items on one visit count as one. Require continuous medical enrollment across the entire\n  washout so absence of a prior code is observed, not unobserved. Failure modes: rule-out/screening codes inflate OP\n  false positives (mitigate with primary position or a confirmatory procedure/drug); bundled or interim claims and later\n  adjustments perturb counts (prefer final paid claims); **MA-only person-time lacks FFS claims** so washout and OP\n  counting break (exclude or flag MA-only spans); differential competing risks — in elderly claims, patients who die\n  before a second OP visit are missed for the 2-OP arm, and if death rates differ by exposure this is differential.\n- **EHR:** the IP/OP distinction maps to encounter type; the \"second OP claim\" can be a second encounter or one encounter\n  plus an *active* problem-list entry, and NLP on notes can confirm a coded diagnosis. Strength: clinical detail to\n  adjudicate. Weakness: visit-driven capture — sicker, more-engaged patients are coded more completely, and care that\n  leaks out of the system is invisible; link to claims for a complete encounter history.\n- **Registry:** disease registries (SEER for cancer, stroke/MI registries) are the validation gold standard for PPV and\n  for the cases the claims rule misses, and registry death data anchor competing-risk and censoring logic. Weakness:\n  incomplete pharmacy/encounter timing — link to claims to apply the 1 IP / 2 OP rule and to recover timing.\n- **Linked claims-EHR / claims-registry:** the ideal substrate (scale + adjudication), but probabilistic linkage error\n  can be *differential by severity* (sicker patients link more reliably), biasing sensitivity estimates; quantify\n  linkage error and reconcile service vs discharge vs registry diagnosis dates before assigning the index date.\n\n**Worked claims example (incident atrial fibrillation).** Question: incident AF ascertainment in a Medicare FFS +\ncommercial database, ICD-10 I48.x (atrial fibrillation/flutter). Algorithm: **1 IP discharge diagnosis (any position) OR\n2 OP claims on service dates ≥7 days apart within a 365-day window**, with a **365-day incident washout** (no I48.x in\nthe prior year) and continuous Part A/B (or commercial medical) enrollment across that washout; same rule applied\nidentically to every study arm. Suppose 5,000 patients have at least one I48.x code. Applying the rule: 1,200 qualify on\na single inpatient discharge code; of the remainder, 2,800 have ≥2 OP codes ≥7 days apart within 365 days (the other\noutpatient-only patients have a single isolated code — many of these are \"rule-out\" or pre-test codes and are correctly\n*not* counted). That leaves 4,000 algorithm-positive patients. Applying the 365-day incident washout drops 1,000 with a\nprior-year I48.x (prevalent), yielding **3,000 incident AF cases**; index date = the discharge date (IP arm) or the\nfirst of the two qualifying OP dates (OP arm). Production checks: exclude MA-only person-time before counting (otherwise\n\"no prior code\" is missingness and over-counts incident cases); flag that MA risk-adjustment HCC capture makes AF look\nmore prevalent in MA than in FFS, so a multi-database pooled estimate must report algorithm performance by payer and\nsensitivity-test the window (30 vs 90 vs 365 days) and position (any vs primary). Validate PPV in a chart-reviewed or\nregistry-linked subset and report it with a 95% CI; payer-specific heterogeneity and differential ascertainment by\nexposure are the dominant threats to validity, not the rule itself.\n\n**Interpreting the output**. The output of the 1 IP / 2 OP algorithm is a classified patient list: each patient\neither meets the rule or does not, with an assigned index date. In the Medicare AF example, Patient 2201 qualifies\nvia the 2-OP arm — two I48.x outpatient claims 42 days apart, both in primary position — with an index date at the\nfirst qualifying OP date. Patient 2202 qualifies via the 1-IP arm — one inpatient claim with I48.x as principal\ndiagnosis — with index date at discharge. After applying the 365-day incident washout, 3,000 patients remain as\nincident AF cases.\n\nFormal interpretation: meeting the 1 IP / 2 OP rule means a patient accumulated billing evidence that makes a\ndocumentation artifact an unlikely sole explanation. It is not a clinical confirmation of atrial fibrillation.\nThe rule's PPV — the fraction of flagged patients who truly had AF when charts were reviewed — must be reported\nalongside the result. Rule-out codes are the leading false-positive driver: an I48.x entered to justify a\nrhythm monitor that came back negative can satisfy the 1-IP arm on its own. The 2-OP arm's 30-day separation\nrequirement suppresses isolated rule-out codes; tightening or loosening that window is a planned sensitivity\nanalysis, not a post-hoc fix.\n\nPractical interpretation: report the algorithm's PPV and the window/position choices prominently in the methods,\nbecause those choices — not the underlying condition — largely determine the false-positive rate. For any\nregulatory or HTA submission, provide a chart-review validation substudy and a Wilson 95% confidence interval\non the PPV estimate.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "phenotype",
      "diagnosis-algorithm",
      "claims-phenotyping",
      "1ip-2op",
      "ppv-validation",
      "outcome-algorithm",
      "misclassification",
      "data-quality",
      "claims",
      "medicare",
      "commercial"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "new_user",
      "active_comparator_new_user",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.5690",
        "url": "https://doi.org/10.1002/pds.5690",
        "citation_text": "Djibo DA, Goldberg JF, McMahill-Walraven CN, et al. Validation of an ICD-10 case-finding algorithm for endometrial cancer in US insurance claims. Pharmacoepidemiology and Drug Safety. 2023.",
        "year": 2023,
        "authors_short": "Djibo et al.",
        "notes": "Explicit 1 inpatient or 2 outpatient (different service dates) case-finding algorithm validated against medical records, with PPV estimates and discussion of code-set breadth and time windows for a claims phenotype."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.2314",
        "url": "https://doi.org/10.1002/pds.2314",
        "citation_text": "Cutrona SL, Toh S, Iyer A, et al. Design for validation of acute myocardial infarction cases in Mini-Sentinel. Pharmacoepidemiology and Drug Safety. 2012;21(S1):274-281.",
        "year": 2012,
        "authors_short": "Cutrona et al.",
        "notes": "FDA Mini-Sentinel framework for developing and validating a claims-based outcome algorithm, including the rationale for inpatient-anchored case definitions, chart-review PPV estimation, and reporting performance across data partners."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.ahj.2004.02.013",
        "url": "https://doi.org/10.1016/j.ahj.2004.02.013",
        "citation_text": "Kiyota Y, Schneeweiss S, Glynn RJ, Cannuscio CC, Avorn J, Solomon DH. Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: estimating positive predictive value on the basis of review of hospital records. American Heart Journal. 2004;148(1):99-104.",
        "year": 2004,
        "authors_short": "Kiyota et al.",
        "notes": "Classic chart-review validation of a Medicare claims-based MI diagnosis showing that a single inpatient (principal) code can be highly predictive, with PPV depending on code position and setting; a template for the IP arm of the rule."
      },
      {
        "role": "explain",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "Canonical, widely reused ICD-9-CM and ICD-10 code-set algorithms; foundational for harmonizing diagnosis code lists across code eras and databases, a prerequisite for any reproducible 1 IP / 2 OP phenotype."
      }
    ],
    "plain_language_summary": "A diagnosis phenotype algorithm is a rule that combs through billing records to decide whether a patient actually has a given condition, rather than just having a code that was entered to rule something out. The '1 inpatient OR 2 outpatient' version says a patient counts as a case if they were admitted to the hospital with the relevant billing code at least once, OR if a doctor's office or outpatient clinic billed that code on at least two separate visits that are spread far enough apart in time. The gap between the two outpatient visits and the overall time window both matter — without them, a single 'could this be condition X?' code from one afternoon visit would create a false case. Every such rule carries error: it will flag some patients who do not truly have the condition (false positives) and miss others who do (false negatives), so analysts always report how accurate the rule was when tested against actual medical records.",
    "key_terms": [
      {
        "term": "inpatient claim",
        "definition": "A billing record generated when a patient is formally admitted overnight (or longer) to a hospital; it carries discharge codes that reflect what the patient was treated for during the stay."
      },
      {
        "term": "outpatient claim",
        "definition": "A billing record from a visit where the patient was not admitted — a doctor's office, clinic, or same-day procedure — that carries the code the clinician used to describe the reason for the visit."
      },
      {
        "term": "ICD code",
        "definition": "A standardized numeric label (e.g., I48.11 for persistent atrial fibrillation) that clinicians and coders place on claims to describe a diagnosis; it is a billing label, not a clinical test result."
      },
      {
        "term": "case-finding window",
        "definition": "The span of calendar time during which both qualifying outpatient claims must appear; if the second claim arrives after this window closes, the patient does not meet the rule."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "Among all patients the algorithm labels as cases, the fraction who truly have the condition when their medical record is reviewed; a higher PPV means fewer false positives."
      },
      {
        "term": "index date",
        "definition": "The date a patient officially 'enters' a study as a case — for the two-outpatient arm this is the date of the second qualifying claim, and all follow-up time is measured from that point forward."
      }
    ],
    "worked_example": {
      "scenario": "We are building an incident atrial fibrillation (AF) cohort from a Medicare claims database using ICD-10 code I48.x. The rule is: qualify as an AF case if you have (a) at least one inpatient discharge claim with I48.x, OR (b) at least two outpatient claims with I48.x on different service dates that are at least 7 days apart and both fall within a 365-day case-finding window. We look at two patients. Patient 2201 has only outpatient visits; we trace her two AF codes to see whether and when she qualifies. Patient 2202 was hospitalized with AF; his single inpatient discharge code qualifies him immediately.",
      "dataset": {
        "caption": "Raw claim rows an analyst would see after filtering to I48.x codes. claim_type: IP = inpatient discharge, OP = outpatient service. service_date for IP rows already carries the discharge date.",
        "columns": [
          "person_id",
          "claim_type",
          "service_date",
          "dx_code",
          "dx_position"
        ],
        "rows": [
          [
            2201,
            "OP",
            "2024-01-08",
            "I48.11",
            "primary"
          ],
          [
            2201,
            "OP",
            "2024-02-19",
            "I48.11",
            "primary"
          ],
          [
            2202,
            "IP",
            "2024-03-05",
            "I48.19",
            "primary"
          ]
        ]
      },
      "steps": [
        "Patient 2201 has no inpatient claim, so we check the outpatient arm.",
        "Her two outpatient claims are on 2024-01-08 and 2024-02-19 — these are different service dates, so they are not the same visit.",
        "Gap between the two dates: from January 8 to February 19 is 42 days (23 remaining days in January plus 19 days into February).",
        "42 days is ≥ 7 (the minimum gap required) and ≤ 365 (the window), so the pair qualifies.",
        "Patient 2201's index date is 2024-02-19 — the date of the second, confirming outpatient claim.",
        "Patient 2202 has one inpatient discharge claim on 2024-03-05; a single inpatient claim always qualifies under the IP arm.",
        "Patient 2202's index date is 2024-03-05 — the discharge date of that inpatient claim.",
        "If patient 2201 had only the January 8 claim and never returned, she would NOT be a case — one isolated outpatient code is not enough."
      ],
      "result": "Patient 2201: qualifies via the outpatient arm (2 OP claims, gap = 42 days, within 365-day window); index date = 2024-02-19. Patient 2202: qualifies via the inpatient arm (1 IP discharge claim); index date = 2024-03-05.",
      "timeline_spec": {
        "title": "1 IP / 2 OP phenotype — patient 2201 (outpatient arm, atrial fibrillation I48.11)",
        "caption": "Patient 2201 has two outpatient claims for atrial fibrillation. The gap between them is 42 days, which satisfies both the 7-day minimum and the 365-day window. The second claim date becomes her index date. A patient with only the first claim would not qualify.",
        "alt_text": "Horizontal timeline for patient 2201 showing a 365-day case-finding window from 2024-01-08 to 2025-01-07. Two short bars mark outpatient claims: Claim 1 on January 8 and Claim 2 on February 19. A span labeled '42-day gap (≥7 days required)' stretches between them. Both claims fall inside a span labeled 'Case-finding window (≤365 days)'. A result marker at February 19 reads 'Index date — patient qualifies'.",
        "window": {
          "start": "2024-01-08",
          "end": "2025-01-07",
          "label": "365-day case-finding window (both OP claims must fall within this span)"
        },
        "events": [
          {
            "label": "Claim 1 — OP visit (I48.11)",
            "start": "2024-01-08",
            "length_days": 1,
            "quantity": "1 outpatient claim"
          },
          {
            "label": "Claim 2 — OP visit (I48.11) → index date",
            "start": "2024-02-19",
            "length_days": 1,
            "quantity": "1 outpatient claim (confirming)"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2024-01-08",
            "end": "2025-01-07",
            "label": "Case-finding window (365 days)"
          },
          {
            "kind": "gap",
            "start": "2024-01-09",
            "end": "2024-02-18",
            "label": "42-day gap between claims (≥7 days required)"
          },
          {
            "kind": "covered",
            "start": "2024-02-19",
            "end": "2024-02-19",
            "label": "Index date (2nd qualifying OP date)"
          }
        ],
        "result": {
          "label": "Patient 2201 qualifies: 2 OP claims, gap = 42 days (≥7), both within 365-day window. Index date = 2024-02-19.",
          "value": "qualifies"
        }
      }
    },
    "prerequisites": [
      "claims-analysis",
      "continuous-enrollment-observable-time-rwe",
      "washout-clean-lookback-period-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "1 IP or 2 OP (acute, short window)",
        "description": "At least one inpatient discharge diagnosis OR two outpatient claims on different service dates within a short window (typically 7-30 days). Standard for acute events such as MI, ischemic stroke, and pulmonary embolism, where corroborating evidence accrues quickly.",
        "edge_cases": [
          "Rule-out or 'suspected' codes (e.g., chest pain coded as MI pre-troponin) inflate outpatient false positives.",
          "Same-day OP line items or IP-to-IP transfers double-count without deduplication; collapse to one event per visit/episode.",
          "Too-short a window misses true cases whose confirmatory second visit is delayed."
        ],
        "data_source_notes": "claims: use discharge_date for the IP arm and service_date for the OP arm; prefer final paid claims and deduplicate same-day OP claims. MA-only person-time lacks FFS institutional/carrier claims and should be excluded."
      },
      {
        "name": "1 IP or 2 OP (chronic/incident, long window + washout)",
        "description": "Same arm structure with a 180-365 day OP window and a lookback washout (commonly 365 days of continuous enrollment with no qualifying code) to ascertain incident, new-onset disease. Standard for AF, diabetes, and COPD.",
        "edge_cases": [
          "Without a washout, the rule captures prevalent rather than incident disease, importing immortal time and survivor effects.",
          "In short-enrollment commercial data, an insufficient lookback window forces a shorter washout and over-counts incident cases.",
          "MA risk-adjustment HCC capture raises apparent prevalence relative to FFS, distorting multi-payer pooled incidence."
        ],
        "data_source_notes": "claims: require continuous FFS-observable (Part A/B) or commercial medical enrollment across the full washout so absence of a prior code is observed, not missing; report incidence and algorithm performance by payer."
      },
      {
        "name": "High-specificity variant (position/setting restriction or confirmation)",
        "description": "Restrict to primary diagnosis position and/or a specific setting (IP-only, or ED + IP), or require a confirmatory procedure (e.g., ablation, biopsy), laboratory result, or drug fill, raising PPV at the cost of sensitivity.",
        "edge_cases": [
          "Higher PPV but misses milder, outpatient-managed cases and patients not yet treated.",
          "Primary-position and procedure coding practices differ across payers (risk-adjustment incentives in MA), reducing cross-database comparability."
        ],
        "data_source_notes": "EHR/registry: an active problem-list entry or NLP confirmation, or registry adjudication (e.g., SEER), can substitute for the second OP claim and provides the gold standard for PPV."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Broad single-code rule (one claim, any position, no time window)",
        "pros_of_this": "Sharply reduces false positives from isolated rule-out and screening codes; widely validated to PPV >80% for many conditions; temporal corroboration in the OP arm.",
        "cons_of_this": "Misses the small number of true cases with only one outpatient code who die or disenroll before a second visit; more implementation logic (windowing, deduplication, washout).",
        "when_to_prefer": "Almost any outcome, covariate, or cohort-entry diagnosis whose accuracy materially affects the effect estimate."
      },
      {
        "compared_to": "High-specificity rule (IP-only, primary position, procedure/lab/drug confirmation)",
        "pros_of_this": "Higher sensitivity and easier cross-database harmonization; captures outpatient-managed disease.",
        "cons_of_this": "Lower PPV than a confirmation-augmented rule; more vulnerable to rule-out codes in the OP arm.",
        "when_to_prefer": "When outpatient-managed disease must be captured and missed cases would bias the estimate, no reliable confirmatory data element exists, and the study needs to harmonize across databases without procedure/lab/drug linkage."
      },
      {
        "compared_to": "EHR or registry phenotypes (problem lists, NLP, adjudication)",
        "pros_of_this": "Scales to millions of patients and decades of history; consistent across payers if code lists are harmonized.",
        "cons_of_this": "Lower clinical specificity; misses out-of-network and uncoded events; payer coding differences add heterogeneity.",
        "when_to_prefer": "Large comparative studies where claims are the feasible source; use linked EHR/registry as the validation gold standard, not the primary ascertainment."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build from institutional/facility claims (IP arm, discharge_date) and carrier/professional plus outpatient-facility claims (OP arm, service_date). Deduplicate same-day OP claims, prefer final paid claims, and require continuous medical enrollment across the washout. For incident phenotypes apply a clean-period washout (e.g., 365 days, any position). Exclude or flag MA-only person-time where FFS claims are unavailable, and report algorithm performance by database/payer.",
      "ehr": "Map IP/OP to encounter type; the second OP claim may be a second encounter or an active problem-list entry, and NLP can confirm a coded diagnosis. Visit-driven capture means sicker, more-engaged patients are coded more completely and out-of-system care is invisible; link to claims for a complete encounter history.",
      "registry": "Registry-adjudicated diagnoses (e.g., SEER) are the gold standard for PPV and for recovering cases the rule misses; registry death data anchor competing-risk and censoring logic. Link to claims to apply the 1 IP / 2 OP timing.",
      "linked": "Linked claims-EHR or claims-registry gives scale plus adjudication, but probabilistic linkage error can be differential by severity and bias sensitivity estimates; reconcile service/discharge/registry diagnosis dates before assigning the index date."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nTARGET_CODES = (\"I48\",)          # ICD-10 prefix match for atrial fibrillation/flutter (I48.x)\nOP_GAP_DAYS  = 7                 # minimum days between the two qualifying outpatient claims\nOP_WINDOW    = 365              # the two OP claims must fall within this many days of each other\nWASHOUT_DAYS = 365              # incident clean period: no qualifying code in the prior year\n\ndef build_phenotype(dx: pd.DataFrame, enroll: pd.DataFrame,\n                    any_position: bool = True) -> pd.DataFrame:\n    d = dx.copy()\n    d = d[d[\"dx_code\"].str.startswith(TARGET_CODES)]\n    if not any_position:\n        d = d[d[\"dx_position\"] == \"primary\"]\n    d = d.sort_values([\"person_id\", \"service_date\"])\n\n    # IP arm: a single inpatient claim qualifies; index = that discharge date.\n    ip = d[d[\"claim_type\"] == \"IP\"].groupby(\"person_id\", as_index=False)[\"service_date\"].min()\n    ip = ip.rename(columns={\"service_date\": \"ip_date\"})\n\n    # OP arm: ANY pair of DIFFERENT service dates OP_GAP_DAYS..OP_WINDOW apart (sliding pair, not\n    # anchored to the first code) so qualifying pairs after day 365 are not missed.\n    op = d[d[\"claim_type\"] == \"OP\"][[\"person_id\", \"service_date\"]].drop_duplicates()\n    pairs = op.merge(op, on=\"person_id\", suffixes=(\"_1\", \"_2\"))\n    gap = (pairs[\"service_date_2\"] - pairs[\"service_date_1\"]).dt.days\n    qual = pairs[(gap >= OP_GAP_DAYS) & (gap <= OP_WINDOW)]  # service_date_2 is the confirming code\n    op_idx = qual.groupby(\"person_id\", as_index=False)[\"service_date_2\"].min()  # earliest confirming date\n    op_idx = op_idx.rename(columns={\"service_date_2\": \"op_date\"})\n\n    cand = ip.merge(op_idx, on=\"person_id\", how=\"outer\")\n    cand[\"index_date\"] = cand[[\"ip_date\", \"op_date\"]].min(axis=1)             # earliest qualifying event\n\n    # Incident washout: drop anyone with a target code in [index - WASHOUT_DAYS, index).\n    prior = cand.merge(d[[\"person_id\", \"service_date\"]], on=\"person_id\")\n    in_wash = prior[(prior[\"service_date\"] < prior[\"index_date\"]) &\n                    (prior[\"service_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    cand = cand[~cand[\"person_id\"].isin(in_wash[\"person_id\"])]\n\n    # Continuous, FFS-observable enrollment across the washout through index (exclude MA-only spans).\n    e = enroll.merge(cand[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n\n    out = cand[cand[\"person_id\"].isin(eligible)].copy()\n    return out[[\"person_id\", \"index_date\"]].sort_values(\"person_id\")",
        "description": "1 IP / 2 OP incident phenotype construction from claims-style inputs. Required inputs (already cleaned, de-duplicated to\none row per claim line):\n  dx     : diagnosis claims -> person_id, claim_type in {'IP','OP'}, service_date (datetime; IP arm should already carry\n           the discharge date here), dx_code, dx_position ('primary'/'secondary')\n  enroll : continuous medical enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # MA-only lacks FFS claims\nReturns one row per incident case with the index_date. Apply outcome/censoring rules identically to every arm downstream.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nTARGET   <- \"^I48\"   # ICD-10 atrial fibrillation/flutter (I48.x)\nOP_GAP   <- 7L       # min days between the two qualifying OP claims\nOP_WIN   <- 365L     # the two OP claims within this many days\nWASHOUT  <- 365L     # incident clean period\n\nbuild_phenotype <- function(dx, enroll, any_position = TRUE) {\n  setDT(dx); setDT(enroll)\n  d <- dx[grepl(TARGET, dx_code)]\n  if (!any_position) d <- d[dx_position == \"primary\"]\n  setorder(d, person_id, service_date)\n\n  # IP arm: single inpatient claim; index = earliest discharge date.\n  ip <- d[claim_type == \"IP\", .(ip_date = min(service_date)), by = person_id]\n\n  # OP arm: ANY pair of distinct service dates OP_GAP..OP_WIN apart (sliding pair, not anchored to the\n  # first code) so qualifying pairs after day 365 are not missed.\n  op <- unique(d[claim_type == \"OP\", .(person_id, service_date)])\n  pairs <- merge(op, op, by = \"person_id\", allow.cartesian = TRUE,\n                 suffixes = c(\"_1\", \"_2\"))\n  pairs[, gap := as.integer(service_date_2 - service_date_1)]\n  op_idx <- pairs[gap >= OP_GAP & gap <= OP_WIN,\n                  .(op_date = min(service_date_2)), by = person_id]  # earliest confirming date\n\n  cand <- merge(ip, op_idx, by = \"person_id\", all = TRUE)\n  cand[, index_date := pmin(ip_date, op_date, na.rm = TRUE)]\n\n  # Incident washout: drop anyone with a target code in [index - WASHOUT, index).\n  pr <- merge(cand[, .(person_id, index_date)], d[, .(person_id, service_date)], by = \"person_id\")\n  in_wash <- unique(pr[service_date < index_date &\n                       service_date >= index_date - WASHOUT, person_id])\n  cand <- cand[!person_id %chin% in_wash]\n\n  # Continuous FFS-observable enrollment across washout through index (no MA-only spans).\n  e <- merge(enroll, cand[, .(person_id, index_date)], by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT &\n          enroll_end   >= index_date & !ma_only, unique(person_id)]\n\n  cand[person_id %chin% ok, .(person_id, index_date)][order(person_id)]\n}",
        "description": "1 IP / 2 OP incident phenotype construction with data.table. Inputs mirror the Python version:\n  dx     : person_id, claim_type ('IP'/'OP'), service_date (Date; IP carries discharge date), dx_code, dx_position\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\nReturns one row per incident case with index_date; apply outcome/censoring identically across arms downstream.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let target   = I48;   /* ICD-10 atrial fibrillation/flutter prefix */\n%let op_gap   = 7;      /* min days between the two qualifying OP claims */\n%let op_win   = 365;    /* the two OP claims within this many days */\n%let washout  = 365;    /* incident clean period */\n%let primary_only = 0;  /* 1 = restrict to principal-position codes */\n\n/* Restrict to target diagnosis codes (and optionally to primary position). */\nproc sql;\n  create table dxsub as\n  select person_id, claim_type, service_date, dx_position\n  from work.dx\n  where dx_code like \"&target.%\"\n    and (&primary_only = 0 or dx_position = 'primary');\nquit;\n\n/* IP arm: a single inpatient claim qualifies; index = earliest discharge date. */\nproc sql;\n  create table ip as\n  select person_id, min(service_date) as ip_date format=date9.\n  from dxsub where claim_type = 'IP'\n  group by person_id;\nquit;\n\n/* OP arm: ANY pair of distinct service dates &op_gap..&op_win apart (sliding pair, not anchored to the\n   first code) so qualifying pairs after day 365 are not missed. b is the earlier code, a the confirming code. */\nproc sql;\n  create table op as select distinct person_id, service_date\n  from dxsub where claim_type = 'OP';\n  create table op_idx as\n  select a.person_id, min(a.service_date) as op_date format=date9.   /* earliest confirming code */\n  from op a, op b\n  where a.person_id = b.person_id\n    and (a.service_date - b.service_date) >= &op_gap\n    and (a.service_date - b.service_date) <= &op_win\n  group by a.person_id;\nquit;\n\n/* Candidate index = earliest qualifying IP or OP event. */\nproc sql;\n  create table cand as\n  select coalesce(i.person_id, o.person_id) as person_id,\n         min(i.ip_date, o.op_date) as index_date format=date9.\n  from ip i full join op_idx o on i.person_id = o.person_id;\nquit;\n\n/* Incident washout: exclude anyone with a target code in [index - &washout, index). */\nproc sql;\n  create table incident as\n  select c.* from cand c\n  where not exists (\n    select 1 from dxsub p\n    where p.person_id = c.person_id\n      and p.service_date <  c.index_date\n      and p.service_date >= c.index_date - &washout\n  );\nquit;\n\n/* Require continuous FFS-observable enrollment across the washout through index (no MA-only spans). */\nproc sql;\n  create table cases as\n  select n.person_id, n.index_date from incident n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id\n      and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date\n  );\nquit;",
        "description": "1 IP / 2 OP incident phenotype construction in SAS using PROC SQL. Required input datasets (post data-management,\none row per claim line):\n  work.dx     : person_id, claim_type ('IP'/'OP'), service_date (IP carries the discharge date), dx_code, dx_position\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nOutput work.cases has one row per incident case with index_date. Set primary_only=1 to restrict to principal-position\ncodes (the high-specificity variant). Apply outcome/censoring rules identically to every arm downstream.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Patient 2201 has two outpatient claims for atrial fibrillation. The gap between them is 42 days, which satisfies both the 7-day minimum and the 365-day window. The second claim date becomes her index date. A patient with only the first claim would not qualify.",
        "alt_text": "Horizontal timeline for patient 2201 showing a 365-day case-finding window from 2024-01-08 to 2025-01-07. Two short bars mark outpatient claims: Claim 1 on January 8 and Claim 2 on February 19. A span labeled '42-day gap (≥7 days required)' stretches between them. Both claims fall inside a span labeled 'Case-finding window (≤365 days)'. A result marker at February 19 reads 'Index date — patient qualifies'.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Enrollees with continuous medical enrollment<br/>FFS-observable across washout] --> Wash[Incident washout:<br/>no target code in prior 365 days]\n  Wash --> Arms{Case-finding rule}\n  Arms -->|1 inpatient discharge claim<br/>with target code| IP[IP arm qualifies<br/>index = discharge date]\n  Arms -->|2 outpatient claims, different<br/>service dates within window| OP[OP arm qualifies<br/>index = 2nd qualifying date]\n  Arms -->|Single isolated OP code<br/>e.g. rule-out / screening| Excl[Not a case]\n  IP --> Idx[Index date = earliest qualifying event]\n  OP --> Idx\n  Idx --> Valid[Validate PPV in chart-reviewed or<br/>registry-linked subset; report 95% CI]\n  Idx --> Sens[Apply identical rule to all arms;<br/>sensitivity on window / position / payer]",
        "caption": "Standard 1 IP / 2 OP incident diagnosis phenotype. The washout establishes incidence, the two-arm rule trades PPV against sensitivity, and performance must be validated and tested for differential misclassification by exposure or payer.",
        "alt_text": "Flowchart from enrolled population through incident washout to the inpatient and two-outpatient case-finding arms, index-date assignment, PPV validation, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title 1 IP / 2 OP incident phenotype timeline (one patient, OP arm)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Washout\n  Continuous FFS enrollment, no target code :done, w, 2023-01-01, 2023-12-31\n  section Case-finding window\n  First outpatient target code :milestone, op1, 2024-01-05, 0d\n  Second outpatient code (>=7 days later, within 365d) :milestone, op2, 2024-02-10, 0d\n  Index date = second qualifying OP date :crit, idx, 2024-02-10, 1d",
        "caption": "Outpatient-arm timeline. The 365-day washout precedes time zero; the case is confirmed only when a second qualifying outpatient code appears at least seven days after the first, and the index date is that second date.",
        "alt_text": "Gantt timeline showing a 365-day washout in 2023, a first outpatient code in January 2024, and a second qualifying outpatient code in February 2024 setting the index date.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "claims-analysis",
        "notes": "Diagnosis phenotypes are a core building block of nearly every claims-based cohort definition, outcome ascertainment, and covariate construction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "The 1 IP / 2 OP logic is the dominant pattern for outcome and covariate algorithms; the same validation, window, and payer-difference principles apply."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Algorithm performance differs across FFS (payment-driven completeness), MA (encounter data plus risk-adjustment coding intensity), and commercial databases; MA-only person-time can break the washout."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "The phenotype index date must align with exposure time zero, the baseline covariate window, and follow-up start; misalignment creates immortal time or misclassified incidence."
      },
      {
        "relation_type": "see_also",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Imperfect PPV/sensitivity is a major source of outcome and covariate misclassification; quantitative bias analysis or validation in a linked subset is required, especially when error may be differential by exposure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "Diagnosis codes selected by broad phenotypes are inputs to the hdPS for empirical confounding control; the 1 IP / 2 OP rule can define candidate covariates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU studies rely on validated diagnosis phenotypes to confirm the shared indication at baseline and to ascertain outcomes identically across arms."
      }
    ],
    "aliases": [
      "1 IP / 2 OP rule",
      "claims-based case-finding algorithm",
      "ICD-based diagnosis phenotype",
      "case-finding phenotype",
      "administrative diagnosis algorithm"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "diagnosis-position-and-qualifiers",
    "name": "Diagnosis Position, Type, and Qualifiers on Claims",
    "short_definition": "The semantic layer that governs every diagnosis field on institutional (UB-04) and professional (CMS-1500) claims: which field carries which clinical meaning, how field order encodes analytical priority (principal vs admitting vs secondary), and how the Present-on-Admission (POA) indicator separates comorbidities from complications — the primitive that all position-dependent phenotyping, outcome, and comorbidity algorithms depend on.",
    "long_description": "**Diagnosis fields on claims are not interchangeable billing slots — each carries a specific clinical and regulatory meaning, and choosing the wrong field, or ignoring position order, is the most common source of miscounted outcomes and inflated comorbidity scores in administrative-data research.**\n\nThis entry is a field-semantics primitive. It documents what every diagnosis field means before any phenotyping, outcome, or comorbidity algorithm is applied. The downstream methods that depend on these choices — the 1 IP / 2 OP phenotype (`diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe`), the PPV/sensitivity trade-off (`claims-outcome-algorithm-ppv-sensitivity-rwe`), and the Elixhauser or Charlson comorbidity indices (`elixhauser-comorbidity-index-rwe`) — all inherit the semantic precision (or confusion) of the upstream field selection.\n\n**The institutional claim (UB-04) diagnosis field map**\n\nInstitutional claims — inpatient hospital stays, outpatient hospital visits, and emergency department encounters — are submitted on the UB-04 form. The key diagnosis fields are:\n\n*FL 67 — Principal diagnosis.* The condition established, **after study**, to be chiefly responsible for occasioning the admission. This is the UHDDS (Uniform Hospital Discharge Data Set) statutory definition. Three words matter enormously: \"after study\" means the principal diagnosis is assigned by professional coders **at discharge**, not at admission; it reflects the conclusion of the clinical workup, not the presenting complaint. \"Chiefly responsible for occasioning the admission\" means it is the dominant reason for the entire stay, which may shift as workup proceeds — a patient admitted for \"chest pain\" may be discharged with principal diagnosis \"acute myocardial infarction\" once troponin and ECG are reviewed. The principal diagnosis drives the **MS-DRG** (Medicare Severity Diagnosis Related Group), which determines inpatient payment, and therefore exerts strong **coding incentive pressure**: hospitals code the principal diagnosis to maximize reimbursement subject to MCC/CC (major complication / complication) rules. The sepsis-coding shifts of the 2010s are the canonical example of how DRG-optimized principal diagnosis assignment can change apparent incidence in claims databases even when true clinical incidence is unchanged. There is no \"principal diagnosis\" concept on professional (CMS-1500) claims — that distinction is unique to institutional billing.\n\n*FL 69 — Admitting diagnosis.* The working or suspected condition documented by the admitting physician **at the time of admission**, before the inpatient workup is complete. It is the clinical hypothesis — \"chest pain,\" \"rule out DVT,\" \"syncope\" — not the confirmed diagnosis. The admitting diagnosis can and often does differ materially from the principal discharge diagnosis: \"sepsis\" may be admitted as \"fever and hypotension,\" \"GERD\" as \"chest pain,\" \"AF\" as \"palpitations.\" For research applications, FL 69 is valuable for studying **diagnostic journeys, ED decision-making, and presentation patterns**, because it preserves the clinician's pre-workup hypothesis. It is largely unused for outcome ascertainment or cohort entry because it represents suspicion rather than confirmed disease.\n\n*FL 70 — Reason for Visit (outpatient/ED institutional claims).* The functional analog of FL 69 for **unscheduled outpatient encounters and ED visits** — the presenting complaint or chief reason the patient sought care. It is required on outpatient institutional claims and captures the reason-for-visit independent of the diagnoses ultimately coded. For research on ED utilization, return visits, or post-discharge care patterns, FL 70 preserves the presenting symptom that diagnosis codes may not.\n\n*FL 67A-Q and additional fields — Secondary diagnoses.* Up to 24 secondary diagnoses (sometimes more depending on the payer and extract format) appear in numbered positions (dx2 through dx25). These carry comorbid conditions, complications, findings, and any other conditions that affect patient care. **Truncation is a major database-comparability trap**: early legacy claims databases stored only 9 or 10 diagnosis fields total; modern CMS extracts support 25+; some commercial vendors store intermediate counts. A researcher who applies the Elixhauser algorithm to 9-field data will systematically undercount comorbidities relative to 25-field data, and multi-database pooled analyses must harmonize or sensitivity-test this limit.\n\n*External-cause codes (E-codes / ICD-10-CM Chapter 20, V/W/X/Y).* Injury mechanism, intent, and place of occurrence. These appear in secondary positions and are essential for injury epidemiology but are not used for condition-based outcome algorithms. Analysts building comorbidity scores should exclude E-codes from their scanning loops, as they are structured differently and may inflate false-positive flag rates for some conditions.\n\n**The professional claim (CMS-1500) diagnosis field map**\n\nProfessional claims from physicians, non-physician practitioners, labs, and outpatient services are submitted on the CMS-1500 form. The diagnosis field here is structurally different from the institutional form:\n\n*Item 21 — Diagnosis or nature of illness or injury.* Up to 12 diagnosis codes in positions A through L. There is **no statutory principal diagnosis** on professional claims — there is only a **first-listed** code (position A). The first-listed code is intended, by convention, to be the main reason for the visit, but it is not regulated by the UHDDS definition and does not drive DRG-based payment. Many research databases and vendors relabel position A as \"primary diagnosis\" regardless of claim type, which can mislead analysts into treating it as equivalent to the UHDDS principal diagnosis. It is not: a professional claim labeled \"dx1 = primary\" is a first-listed convention, while an institutional claim \"dx1 = principal\" is a UHDDS-defined post-discharge assignment. This ambiguity is the most common source of conceptual confusion in multi-database or mixed-claim-type studies.\n\n*Diagnosis pointer.* Each service line on the CMS-1500 has a diagnosis pointer that links the service to one of the Item 21 diagnosis codes. This is how payers adjudicate medical necessity — a procedure is covered only if linked to an appropriate diagnosis. For research, the diagnosis pointer means that diagnosis code utilization is driven partly by coverage rules, not purely by clinical documentation.\n\n**The \"discharge diagnosis\" — why it is a clinical concept, not a claims field**\n\nClinical records include a discharge summary with a \"discharge diagnosis\" or discharge problem list. On claims, this concept does not map to a single discrete field. The closest claims analog is the **principal diagnosis plus the full secondary diagnosis set** — together these represent the coder's final, discharge-time classification of all conditions treated or that affected care during the stay. When analysts refer to \"discharge diagnoses on claims,\" they mean the entire UB-04 diagnosis code set, recognized as assigned at the time of discharge. There is no separate pre-admission versus post-admission clinical-diagnosis field on the claim itself; that separation is accomplished through the POA indicator.\n\n**The Present-on-Admission (POA) indicator**\n\nThe POA indicator is the most analytically consequential qualifier on inpatient institutional claims. Mandated by the Deficit Reduction Act of 2005 and implemented for FY2008 inpatient prospective payment system (IPPS) discharges, the POA indicator is reported for **each diagnosis code** on the UB-04 and takes the following values:\n\n- **Y** (Yes): the condition was present at the time of inpatient admission\n- **N** (No): the condition was **not** present at the time of admission — it arose or was first identified during the inpatient stay (a hospital-acquired condition or complication)\n- **U** (Unknown): documentation insufficient to determine\n- **W** (Clinically undetermined): provider cannot clinically determine whether condition was present at admission\n- **1** (Exempt): the code is on the CMS POA exemption list (e.g., certain injury E-codes, status codes)\n\nThe POA indicator enables the analytically critical distinction between a **comorbidity** (POA = Y: condition the patient arrived with, relevant to risk adjustment) and a **complication** (POA = N: condition acquired in the hospital, relevant to safety and quality measurement). Without POA, an analyst cannot determine from the claim alone whether, say, a diagnosis of \"acute kidney injury\" during a cardiac surgery stay was a pre-existing condition or a procedure complication. The Hospital-Acquired Conditions (HAC) Reduction Program and hospital quality metrics depend on POA for this reason.\n\n*Research applications of POA.* For outcome ascertainment, POA = N on a secondary diagnosis is the claims-based signal for a **new event occurring during the stay**: the outcome happened in hospital, it was not present at admission, and therefore was likely caused or triggered by the hospitalization or intervention being studied. This is the key for constructing \"complication during index stay\" outcomes without requiring a separate post-discharge follow-up window. For comorbidity adjustment (Elixhauser, Charlson), a POA-aware implementation restricts the comorbidity flag to POA = Y codes, excluding POA = N complications that could inflate the apparent pre-admission comorbidity burden. The Liu et al. (2025) paper demonstrates that POA-aware Elixhauser measures outperform the naïve index for in-hospital mortality prediction.\n\n*POA data-quality caveats.* Reporting quality is uneven across hospitals and coding years. The 2008–2012 ramp-up period has higher rates of U and W values. Small or critical-access hospitals may have higher missing/unknown rates. Analysts should examine the distribution of POA indicator values for their study period and population, and consider sensitivity analyses excluding or imputing U/W. The exemption list (~70 conditions as of 2024, including ICD-10-CM codes Z and certain injury mechanism codes) removes POA requirements for a subset of codes — these appear as \"1\" and should not be treated as N.\n\n**Diagnosis code position as an algorithm design lever**\n\nEvery position-based algorithm trades **specificity (PPV)** against **sensitivity**:\n\n- *Principal-only (or primary-only on professional claims):* the most specific rule. Restricts to the code recorded as the primary driver of the encounter. Minimizes false positives from rule-out, screening, and incidental findings. Biased toward more severe or inpatient-managed disease. Susceptible to DRG-optimization coding shifts for inpatient principal. Reference: `claims-outcome-algorithm-ppv-sensitivity-rwe` for the full methodological treatment.\n\n- *Any-position:* maximizes sensitivity — the condition is captured wherever it appears. Vulnerable to rule-out codes, \"history of\" codes, and codes entered for billing-completion purposes that do not reflect an active clinical condition. Comorbidity indices like Elixhauser typically use any-position secondary diagnoses (historically excluding principal to avoid double-counting with the outcome).\n\n- *POA-aware principal + POA = N secondary:* the most analytically refined variant for outcome ascertainment during an inpatient stay. A condition counts as an outcome if it is either (a) principal diagnosis (the main reason for admission — though this raises the timing question of whether it pre-dated admission) or (b) a secondary code with POA = N (confirmed new during stay). This three-way logic is the basis for hospital-acquired complication outcomes.\n\n**Pros, cons, and trade-offs**\n\n- **Principal position (institutional) vs first-listed position (professional).** Principal is UHDDS-defined and post-discharge assigned — high specificity for the confirmed diagnosis but lags admission and subject to DRG pressure. First-listed on professional claims is a loose convention with no statutory backing — lower specificity but broadly available. Never conflate the two in mixed-claim analyses.\n\n- **Any-position vs principal-only for comorbidity scoring.** Any-position raises comorbidity counts but allows copy-forward, rule-out, and incidental codes to inflate scores. Principal-only understates comorbidity but is clean. Most validated comorbidity indices (Elixhauser, Charlson-Deyo ICD-10) use secondary-position codes for comorbidities to keep the principal position available for the study condition — a design choice that must be preserved when adapting index code lists.\n\n- **POA-aware vs naïve comorbidity.** POA-aware (restricting comorbidity to POA = Y) reduces confounding from in-hospital complications being coded as comorbidities. The cost is that POA data are unavailable pre-2008, unavailable in some commercial datasets, and have quality variation by hospital. A naïve (no POA filter) comorbidity score is comparable across all data eras and sources; a POA-aware score is more valid but less portable.\n\n- **Admitting vs principal diagnosis for cohort entry.** Admitting (FL 69) preserves the clinical presentation; principal reflects the confirmed diagnosis. Using admitting for cohort entry risks admission-diagnosis misclassification — a patient admitted for \"rule-out PE\" with admitting dx = PE codes who is ultimately discharged with a different principal will be included in a PE cohort even though PE was ruled out. Prefer principal for condition-based cohort entry.\n\n- **UB-04 vs CMS-1500 for multi-site analysis.** Multi-database or multi-setting analyses that pool institutional and professional claims must reconcile (a) the absence of a principal-diagnosis concept on professional claims, (b) different maximum diagnosis code counts, and (c) different diagnosis pointer mechanics. A standardized data model (OMOP CDM) maps both to a common `condition_occurrence` table with a type concept that preserves the original position and form type — but the analyst must still know that position 1 means different things for IP vs. professional claims.\n\n**When to use**\n\nApply this field-semantics layer as the starting point before building any diagnosis-based algorithm in institutional or professional claims:\n- Cohort entry criteria (prefer principal for confirmed inpatient diagnosis; note admitting available for presentation studies)\n- Outcome ascertainment on inpatient claims (consider POA = N secondary codes for hospital-acquired outcomes; use `claims-outcome-algorithm-ppv-sensitivity-rwe` for the PPV/sensitivity design)\n- Comorbidity scoring (restrict to secondary positions; apply POA = Y filter when available and study period is post-2008; document truncation limits of the extract)\n- Multi-database studies (harmonize position semantics before pooling; document institutional vs professional claim mix)\n- Any study where sepsis, AKI, infection, or another DRG-sensitive condition is the outcome — document awareness of coding incentive pressure on the principal diagnosis\n\n**When NOT to use — and when this is actively misleading or dangerous**\n\n- **Do not treat \"primary diagnosis\" as synonymous with \"principal diagnosis.\"** Most research database vendors label dx1 as \"primary\" regardless of claim type. On professional claims, dx1 is first-listed, not UHDDS principal. On institutional claims, dx1 may be principal, but the vendor's label must be confirmed against the data dictionary. Treating a first-listed professional-claim code as if it has principal-diagnosis semantics overstates specificity.\n\n- **Do not ignore position when constructing comorbidity indices.** Applying the Elixhauser or Charlson-Deyo code list to all diagnosis positions — including the principal — risks flagging the outcome or cohort-entry condition as a comorbidity, double-counting it in the score. The original algorithm designs explicitly excluded certain positions; always check the original validation design before deviating.\n\n- **Do not use admitting diagnosis (FL 69) for condition-based case-finding.** Admitting diagnosis is the pre-workup clinical hypothesis and will include rule-out codes and presentation symptoms that do not represent confirmed diagnoses. Using it for cohort entry inflates case counts with suspected-but-ruled-out disease.\n\n- **Do not apply POA-based logic to data earlier than FY2008** (discharges before October 1, 2007) or to data sources where POA reporting is not required (many commercial payers and some state all-payer claims databases do not require POA). Treat apparent POA = N in non-mandatory-reporting data as missing, not as confirmed hospital-acquired.\n\n- **Do not assume secondary positions are equivalent across database vendors.** A database that stores 9 total diagnosis codes will systematically undercount comorbidities and secondary events relative to a 25-code database. A sensitivity analysis that truncates at 9 codes and compares to full-field results is required for any multi-database study where comorbidity scoring or secondary-position event capture is part of the design.\n\n**Data-source operational depth**\n\n*Claims (Medicare FFS / commercial):* UB-04-derived institutional claims carry FL 67 (principal), FL 69 (admitting), FL 70 (reason for visit on outpatient), POA indicators per code (FY2008+ inpatient only), and up to 25 secondary codes in current CMS standard extract formats (MedPAR, IP, OP SAF files). Professional claims (carrier / Part B) carry up to 12 codes in CMS-1500 Item 21 positions with no principal-diagnosis semantics. ResDAC data dictionaries document exact field names by file type (`PRNCPAL_DGNS_CD`, `ADMTG_DGNS_CD`, `RSN_VISIT_CD`, `CLM_POA_IND_SW1` through `CLM_POA_IND_SW25`). Medicare Advantage encounter data nominally carry the same fields but POA reporting quality is lower and encounter data submission requirements differ — treat MA-source POA data with additional caution.\n\n*EHR:* The EHR problem list and discharge summary carry discharge diagnoses in a structured or semi-structured form, and the admitting diagnosis is typically a separate clinical field. EHR \"primary\" or \"principal\" labels are not necessarily mapped to the UHDDS definition unless the billing interface enforces it. NLP on discharge summaries can recover clinical-priority ordering but requires validation. The EHR analog of POA is typically a condition onset field (new vs chronic vs resolved), not a billing indicator.\n\n*Linked claims-EHR:* The ideal substrate for validating position-based algorithms — EHR clinical detail (onset, confirmed vs. suspected, attending documentation) can serve as the reference standard for whether a given position-and-qualifier combination correctly identifies the condition being studied.",
    "primary_category": "Unknown",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "claims",
      "diagnosis",
      "ub-04",
      "cms-1500",
      "principal-diagnosis",
      "admitting-diagnosis",
      "poa-indicator",
      "present-on-admission",
      "comorbidity",
      "icd-10-cm",
      "data-quality",
      "medicare"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "new_user",
      "active_comparator_new_user",
      "target_trial_emulation",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/1475-6773.12239",
        "url": "https://doi.org/10.1111/1475-6773.12239",
        "citation_text": "Goldman LE, Chu PW, Bacchetti P, et al. Effect of Present-on-Admission (POA) Reporting Accuracy on Hospital Performance Assessments Using Risk-Adjusted Mortality. Health Services Research. 2015;50(4):922-938.",
        "year": 2015,
        "authors_short": "Goldman et al.",
        "notes": "Empirical validation study demonstrating that POA indicator reporting accuracy materially affects hospital quality rankings, and that hospitals with higher rates of unknown/missing POA indicators have biased mortality assessments — the definitive demonstration that POA data quality must be examined before any position-based complication or outcome algorithm is trusted.\n"
      },
      {
        "role": "explain",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "Canonical ICD-9-CM and ICD-10 code-set comorbidity algorithms demonstrating how diagnosis position and code-list selection jointly determine comorbidity scores — the reference implementation that most downstream Elixhauser and Charlson-Deyo adaptations follow; shows explicitly that secondary-position codes are used for comorbidities to separate them from the principal-diagnosis outcome.\n"
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.ahj.2004.02.013",
        "url": "https://doi.org/10.1016/j.ahj.2004.02.013",
        "citation_text": "Kiyota Y, Schneeweiss S, Glynn RJ, Cannuscio CC, Avorn J, Solomon DH. Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: estimating positive predictive value on the basis of review of hospital records. American Heart Journal. 2004;148(1):99-104.",
        "year": 2004,
        "authors_short": "Kiyota et al.",
        "notes": "Chart-review validation of Medicare claims-based AMI showing that PPV varies substantially by diagnosis position: primary (principal) position achieves high PPV, while any-position capture substantially lowers it — the empirical basis for the position-choice trade-off between specificity and sensitivity in outcome algorithms.\n"
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/payment/fee-for-service-providers/hospital-aquired-conditions-hac",
        "citation_text": "Centers for Medicare and Medicaid Services. Hospital-Acquired Conditions (HAC). CMS.gov. Accessed 2024.",
        "year": 2024,
        "authors_short": "CMS",
        "notes": "Official CMS page documenting the Hospital-Acquired Conditions reduction program, the regulatory basis for the POA indicator, the list of HAC conditions, and the payment policy consequences of POA = N coding — the authoritative source for the statutory definition and exemption list.\n"
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://resdac.org/cms-data/variables/admitting-diagnosis-code",
        "citation_text": "Research Data Assistance Center (ResDAC). Admitting Diagnosis Code. ResDAC CMS Variable Documentation. Accessed 2024.",
        "year": 2024,
        "authors_short": "ResDAC",
        "notes": "ResDAC variable-level documentation for the admitting diagnosis code field in CMS administrative data files, confirming the field-level semantics and describing how the ICD-9/10-CM code provided at the time of admission by the physician differs from the principal diagnosis assigned at discharge.\n"
      }
    ],
    "plain_language_summary": "On a hospital or doctor's office billing form, the same diagnosis can appear in several different fields, and which field it lands in matters a great deal for research. The 'principal diagnosis' is the main reason a patient was admitted to the hospital, as determined by medical coders after the workup is complete — and it drives the hospital's payment. The 'admitting diagnosis' is what the doctor suspected when the patient first arrived, before test results came back. The 'Present-on-Admission' indicator is a yes/no flag attached to each diagnosis that tells analysts whether that condition was already present when the patient checked in, or whether it developed during the stay. Getting these distinctions wrong leads to miscounted outcomes, inflated comorbidity scores, or cohort entry criteria that accidentally include patients who were only ruled out for the condition, not confirmed with it.",
    "key_terms": [
      {
        "term": "principal diagnosis",
        "definition": "The condition a hospital coder assigns at discharge as the main reason for the entire inpatient stay, after all test results and workup are reviewed — it can differ from what the doctor suspected when the patient arrived."
      },
      {
        "term": "admitting diagnosis",
        "definition": "The working clinical hypothesis the admitting physician wrote down when the patient first arrived at the hospital, before any inpatient workup was complete; it represents suspicion, not confirmed disease."
      },
      {
        "term": "POA indicator",
        "definition": "A code attached to each diagnosis on an inpatient hospital bill that says whether the condition was Present On Admission (Y = yes, already there; N = no, new during the stay); when it is N, the condition is a potential hospital-acquired complication."
      },
      {
        "term": "first-listed diagnosis",
        "definition": "The diagnosis a physician lists first on a professional (doctor's office or clinic) billing form; there is no statutory rule governing which condition must go first, so it is not the same as the hospital principal diagnosis."
      },
      {
        "term": "secondary diagnosis",
        "definition": "Any diagnosis coded on a claim below the first position — comorbid conditions, complications, or other findings that affected care during the encounter but were not the main reason for the admission or visit."
      },
      {
        "term": "reason for visit",
        "definition": "The presenting complaint or chief reason a patient came to an outpatient or emergency department visit, recorded in a separate field on the hospital's outpatient billing form and useful for studying why patients seek care."
      }
    ],
    "worked_example": {
      "scenario": "A 68-year-old Medicare patient is admitted through the emergency department on 2024-03-10 with a chief complaint of confusion and low blood pressure. The admitting physician suspects sepsis and documents that on FL 69. Over the next two days, blood cultures confirm bacterial sepsis, but the chart also reveals that the patient has long-standing type 2 diabetes (present before admission) and develops acute kidney injury on day 2 of the stay. At discharge, the coder assigns: principal diagnosis = sepsis (the confirmed reason for the stay), secondary dx2 = type 2 diabetes (POA = Y, pre-existing), secondary dx3 = acute kidney injury (POA = N, new during stay). We want to show how three algorithms — any-position, principal-only, and principal-plus-POA=N — give three different answers about whether this claim contributes an AKI event, a sepsis event, and a diabetes comorbidity count.\n",
      "dataset": {
        "caption": "UB-04 diagnosis fields for a single inpatient stay. claim_type = IP (inpatient). Discharge date 2024-03-13. The POA indicator column is populated per diagnosis code.\n",
        "columns": [
          "field",
          "ub04_locator",
          "dx_code",
          "description",
          "poa_indicator"
        ],
        "rows": [
          [
            "principal_dx",
            "FL 67",
            "A41.9",
            "Sepsis, unspecified organism",
            "N/A (principal dx exempt from POA indicator)"
          ],
          [
            "admitting_dx",
            "FL 69",
            "R65.20",
            "Severe sepsis without septic shock (suspected at admit)",
            "N/A (admitting dx field only)"
          ],
          [
            "secondary_dx2",
            "FL 67A",
            "E11.9",
            "Type 2 diabetes mellitus without complications",
            "Y (present on admission)"
          ],
          [
            "secondary_dx3",
            "FL 67B",
            "N17.9",
            "Acute kidney injury, unspecified",
            "N (new during stay — not present on admission)"
          ],
          [
            "reason_for_visit",
            "FL 70",
            "R41.3",
            "Other amnesia / altered mental status",
            "N/A (reason for visit field)"
          ]
        ]
      },
      "steps": [
        "Algorithm A (any-position): scan all diagnosis fields for AKI (N17.9). Found in secondary_dx3 → AKI = 1. Scan for diabetes (E11.x) → found in secondary_dx2 → diabetes comorbidity = 1. Scan for sepsis (A41.x) → found in principal_dx → sepsis = 1. All three conditions flagged.",
        "Algorithm B (principal-only): scan only the principal diagnosis field (FL 67) for AKI → not found → AKI = 0. Sepsis → found → sepsis = 1. Diabetes not in principal position → diabetes comorbidity = 0. Result: this claim contributes 1 sepsis event, 0 AKI events, 0 diabetes comorbidities.",
        "Algorithm C (principal + POA=N secondary for outcomes; POA=Y secondary for comorbidities): sepsis in principal → sepsis = 1. AKI in secondary with POA = N → AKI = 1 (a hospital-acquired complication). Diabetes in secondary with POA = Y → diabetes comorbidity = 1 (pre-existing). Result: this claim contributes 1 sepsis event, 1 AKI outcome (hospital-acquired), 1 diabetes comorbidity.",
        "Count verification for Algorithm C: sepsis events = 1 (principal count); AKI events = 1 (secondary POA=N count); diabetes comorbidities = 1 (secondary POA=Y count). Total distinct flags = 3. expr = 1 + 1 + 1 = 3.",
        "Compare: the any-position algorithm (A) finds AKI but cannot tell you it was hospital-acquired. The principal-only algorithm (B) misses AKI entirely and misses the diabetes comorbidity. The POA-aware algorithm (C) correctly identifies AKI as new during stay and diabetes as pre-existing — the most analytically precise result, but requires post-2007 inpatient data with reliable POA reporting."
      ],
      "result": "Algorithm A (any-position): sepsis = 1, AKI = 1, diabetes comorbidity = 1 — detects all conditions but cannot distinguish comorbidity from complication. Algorithm B (principal-only): sepsis = 1, AKI = 0, diabetes = 0 — highest specificity for the primary condition, misses comorbidities and complications. Algorithm C (principal + POA-aware): sepsis = 1, AKI = 1 (hospital-acquired), diabetes comorbidity = 1 (pre-existing). Total flags Algorithm C = 1 + 1 + 1 = 3. The admitting diagnosis (R65.20 severe sepsis — suspected) and reason for visit (R41.3 altered mental status) are neither used for outcome ascertainment nor comorbidity scoring in any of the three algorithms; they are retained for presentation-pattern and diagnostic-journey studies only.\n",
      "timeline_spec": {
        "title": "Single inpatient stay — three algorithms, three answers on AKI, sepsis, and diabetes",
        "window": {
          "start": "2024-03-10",
          "end": "2024-03-13",
          "label": "Inpatient stay (admission 2024-03-10, discharge 2024-03-13)"
        },
        "events": [
          {
            "label": "Admission — admitting dx: R65.20 (suspected sepsis)",
            "start": "2024-03-10",
            "length_days": 1,
            "quantity": "FL 69 (admitting dx)"
          },
          {
            "label": "Day 2 — AKI develops (new during stay, POA = N)",
            "start": "2024-03-11",
            "length_days": 1,
            "quantity": "secondary dx3 N17.9, POA = N"
          },
          {
            "label": "Discharge — coders assign principal dx A41.9 (confirmed sepsis)",
            "start": "2024-03-13",
            "length_days": 1,
            "quantity": "FL 67 principal dx"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-03-10",
            "end": "2024-03-12",
            "label": "Pre-existing diabetes E11.9 present throughout (POA = Y)"
          },
          {
            "kind": "gap",
            "start": "2024-03-11",
            "end": "2024-03-12",
            "label": "AKI onset during stay (POA = N) — new event"
          }
        ],
        "result": {
          "label": "Algorithm C flags: sepsis = 1 (principal), AKI = 1 (secondary POA=N), diabetes = 1 (secondary POA=Y). Total = 3.",
          "value": 3
        }
      }
    },
    "prerequisites": [
      "claims-analysis",
      "icd-10-cm-diagnosis-coding"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Principal-only (institutional) / first-listed-only (professional)",
        "description": "Restrict outcome or cohort-entry ascertainment to the first code position. On institutional claims this is the UHDDS principal diagnosis — high specificity, DRG-aligned, discharge-assigned. On professional claims this is the first-listed code — a billing convention with no statutory priority guarantee.\n",
        "edge_cases": [
          "On institutional claims, DRG-optimization pressure can shift the condition assigned as principal over time (sepsis-coding shift, MCC/CC upcoding) — apparent incidence changes in principal-position data may reflect coding practice changes rather than true clinical trends.\n",
          "On professional claims, first-listed position assignment is physician-determined and varies by specialty and billing software; it is not equivalent to the inpatient principal diagnosis and should not be treated as such in multi-setting analyses.\n"
        ],
        "data_source_notes": "claims: use PRNCPAL_DGNS_CD for UB-04 institutional; use first Item 21 code for CMS-1500 professional. Never pool without documenting the semantic difference.\n"
      },
      {
        "name": "Any-position (all diagnosis fields)",
        "description": "Count the condition if it appears anywhere in the diagnosis code set. Maximum sensitivity for conditions that may be coded in any position. Vulnerable to rule-out, history-of, and incidental codes. Standard starting point for comorbidity scans before applying position-based refinements.\n",
        "edge_cases": [
          "\"History of\" codes (ICD-10 Z85.x and similar) and personal/family history codes can appear in any position and should typically be excluded from active-condition ascertainment algorithms.\n",
          "Copy-forward coding, where a diagnosis from a prior stay is carried into secondary positions of a new stay without re-assessment, can inflate any-position comorbidity counts in longitudinal analyses.\n"
        ],
        "data_source_notes": "claims: scan all dx positions up to the maximum stored by the specific extract (9 in legacy, 25 in current CMS standard); document the limit and sensitivity-test at the vendor's storage maximum.\n"
      },
      {
        "name": "POA-aware (comorbidity vs complication separation)",
        "description": "Apply POA = Y to identify comorbidities (present at admission) and POA = N to identify hospital-acquired conditions or complications (new during stay). Requires inpatient UB-04 data from FY2008 or later with POA fields populated.\n",
        "edge_cases": [
          "POA = U (unknown) and POA = W (clinically undetermined) are neither comorbidity nor complication; exclude or sensitivity-test their inclusion. Rates of U/W vary by hospital size, type, and coding year.\n",
          "CMS exemption codes (POA indicator = \"1\") are not subject to POA requirements — do not treat \"1\" as equivalent to \"N\"; omit from POA-based filtering.\n",
          "POA reporting is not required for Medicare Advantage encounter data or for most commercial payer submissions; availability must be confirmed by dataset and study period before relying on POA for any outcome definition.\n"
        ],
        "data_source_notes": "claims: POA indicator fields in MedPAR are CLM_POA_IND_SW1 through CLM_POA_IND_SW25. Examine frequency of Y/N/U/W/1 values for your study population and years before applying POA filters; periods or hospitals with >5% U/W may require sensitivity analysis.\n"
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Any-position outcome ascertainment",
        "pros_of_this": "Principal-only yields higher PPV — the diagnosed condition drove the encounter, not a background comorbidity or rule-out entry; clean numerators for ratio estimands.",
        "cons_of_this": "Misses conditions managed during the stay but coded in secondary positions; misses outpatient-managed disease entirely; on professional claims, first-listed is not guaranteed to be the primary clinical reason.",
        "when_to_prefer": "Use principal-only for outcome ascertainment on inpatient claims where high specificity is required and the condition of interest should be the admitted reason. Use any-position for comorbidity scanning and sensitivity analyses."
      },
      {
        "compared_to": "POA-unaware secondary-position comorbidity scoring",
        "pros_of_this": "POA-aware comorbidity (POA = Y only) correctly excludes in-hospital complications from the baseline comorbidity burden, reducing confounding by severity; Liu et al. (2025) show improved mortality prediction.",
        "cons_of_this": "Requires FY2008+ inpatient data with reliable POA reporting; not available for outpatient claims, professional claims, or most commercial datasets; U/W codes require a handling decision.",
        "when_to_prefer": "Use POA-aware comorbidity for inpatient Medicare FFS studies from 2008+ where complication-vs-comorbidity distinction matters for confounding or outcomes. Fall back to POA-naïve when POA data are unavailable or of poor quality."
      },
      {
        "compared_to": "Admitting diagnosis for condition-based cohort entry",
        "pros_of_this": "Principal diagnosis (post-workup, discharge-assigned) represents confirmed disease rather than clinical hypothesis; dramatically reduces false-positive cohort entry from rule-out encounters.",
        "cons_of_this": "Unavailable until the claim is finalized and adjudicated; not available on professional claims; can be influenced by DRG-optimization coding shifts.",
        "when_to_prefer": "Use principal for condition-based cohort entry on inpatient claims. Use admitting only for studies explicitly modeling diagnostic journeys or pre-workup presentation patterns."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "UB-04 institutional CMS files: principal dx = PRNCPAL_DGNS_CD; admitting dx = ADMTG_DGNS_CD; reason for visit = RSN_VISIT_CD (outpatient/ED files); POA indicators = CLM_POA_IND_SW1 through CLM_POA_IND_SW25 (inpatient files only, FY2008+). Secondary diagnoses: ICD_DGNS_CD1 through ICD_DGNS_CD25. CMS-1500 professional carrier files: up to 12 dx codes (ICD_DGNS_CD1-12), no principal-dx concept. Always confirm field names against the ResDAC data dictionary for the specific CMS file type and year of extract. Document the maximum number of diagnosis code fields stored by your vendor and sensitivity-test comorbidity counts at that truncation limit.\n",
      "ehr": "EHR problem lists and discharge summaries carry diagnoses with clinical priority labels (active, resolved, chronic, acute) and onset/encounter flags that partially correspond to POA semantics but are not standardized. The billing interface typically maps the attending physician's \"working diagnosis\" and discharge diagnoses to claim fields, but the clinical-to-billing mapping should be validated before treating EHR problem-list position as equivalent to UB-04 principal. NLP on discharge summaries can recover discharge diagnosis text with higher fidelity than structured fields for free-text heavy documentation environments.\n",
      "linked": "Linked claims-EHR studies are the ideal substrate for validating position-based algorithms: EHR clinical onset dates and confirmed-vs-suspected flags can serve as the reference standard against which POA = Y/N, principal vs secondary, and admitting vs principal claim fields are evaluated. Reconcile claim service dates with EHR encounter dates before counting conditions relative to a study time zero.\n"
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# Target condition code sets (examples — replace with your validated code lists)\nSEPSIS_CODES   = (\"A40\", \"A41\")   # ICD-10 sepsis family\nAKI_CODES      = (\"N17\",)         # ICD-10 acute kidney injury\nDIABETES_CODES = (\"E10\", \"E11\", \"E13\")  # ICD-10 diabetes mellitus\n\ndef flag_condition(dx: pd.Series, code_prefixes: tuple) -> pd.Series:\n    return dx.str.startswith(code_prefixes)\n\ndef build_position_flags(dx_long: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"\n    For each stay, compute three position-based algorithm results per target condition.\n\n    Algorithm A (any_position): condition found anywhere in the claim.\n    Algorithm B (principal_only): condition found only in the principal dx field.\n    Algorithm C (poa_aware): outcome = principal OR secondary with POA=N;\n                              comorbidity = secondary with POA=Y.\n    \"\"\"\n    d = dx_long.copy()\n    d[\"is_principal\"]    = d[\"dx_position\"] == \"principal\"\n    d[\"is_secondary\"]    = d[\"dx_position\"] == \"secondary\"\n    d[\"poa_y\"]           = d[\"poa_indicator\"] == \"Y\"\n    d[\"poa_n\"]           = d[\"poa_indicator\"] == \"N\"\n\n    results = []\n    conditions = {\n        \"sepsis\":   SEPSIS_CODES,\n        \"aki\":      AKI_CODES,\n        \"diabetes\": DIABETES_CODES,\n    }\n    for cond_name, code_pfx in conditions.items():\n        d[f\"has_{cond_name}\"] = flag_condition(d[\"dx_code\"], code_pfx)\n        c = d[d[f\"has_{cond_name}\"]]\n\n        # Algorithm A: any position\n        algo_a = c.groupby(\"stay_id\")[\"person_id\"].first().rename(\"person_id\").reset_index()\n        algo_a[\"condition\"] = cond_name; algo_a[\"algorithm\"] = \"any_position\"; algo_a[\"flag\"] = 1\n\n        # Algorithm B: principal only\n        algo_b = c[c[\"is_principal\"]].groupby(\"stay_id\")[\"person_id\"].first().rename(\"person_id\").reset_index()\n        algo_b[\"condition\"] = cond_name; algo_b[\"algorithm\"] = \"principal_only\"; algo_b[\"flag\"] = 1\n\n        # Algorithm C: principal OR secondary with POA=N (outcome); secondary with POA=Y (comorbidity)\n        outcome_c = c[c[\"is_principal\"] | (c[\"is_secondary\"] & c[\"poa_n\"])]\n        algo_c_out = outcome_c.groupby(\"stay_id\")[\"person_id\"].first().rename(\"person_id\").reset_index()\n        algo_c_out[\"condition\"] = cond_name; algo_c_out[\"algorithm\"] = \"poa_aware_outcome\"; algo_c_out[\"flag\"] = 1\n\n        comorbidity_c = c[c[\"is_secondary\"] & c[\"poa_y\"]]\n        algo_c_cmb = comorbidity_c.groupby(\"stay_id\")[\"person_id\"].first().rename(\"person_id\").reset_index()\n        algo_c_cmb[\"condition\"] = cond_name; algo_c_cmb[\"algorithm\"] = \"poa_aware_comorbidity\"; algo_c_cmb[\"flag\"] = 1\n\n        results.extend([algo_a, algo_b, algo_c_out, algo_c_cmb])\n\n    out = pd.concat(results, ignore_index=True)\n    # Pivot to compare algorithm counts by condition\n    summary = out.groupby([\"condition\", \"algorithm\"])[\"stay_id\"].count().unstack(\"algorithm\", fill_value=0)\n    return summary\n\n# --- Worked example verification ---\n# Single stay from the worked example; note: admitting_dx and reason_for_visit rows are\n# excluded (they are not dx_position = principal or secondary in the long-form input)\ndx_example = pd.DataFrame([\n    {\"person_id\": 9001, \"stay_id\": \"S1\", \"dx_position\": \"principal\",  \"dx_code\": \"A41.9\", \"poa_indicator\": None},\n    {\"person_id\": 9001, \"stay_id\": \"S1\", \"dx_position\": \"secondary\",  \"dx_code\": \"E11.9\", \"poa_indicator\": \"Y\"},\n    {\"person_id\": 9001, \"stay_id\": \"S1\", \"dx_position\": \"secondary\",  \"dx_code\": \"N17.9\", \"poa_indicator\": \"N\"},\n])\nsummary = build_position_flags(dx_example)\nprint(summary)\n# Expected for stay S1:\n#   sepsis: any_position=1, principal_only=1, poa_aware_outcome=1, poa_aware_comorbidity=0\n#   aki:    any_position=1, principal_only=0, poa_aware_outcome=1, poa_aware_comorbidity=0\n#   diabetes: any_position=1, principal_only=0, poa_aware_outcome=0, poa_aware_comorbidity=1\n# Total poa_aware_outcome flags = 1 (sepsis) + 1 (aki) = 2\n# Total poa_aware_comorbidity flags = 1 (diabetes)\n# Grand total distinct algorithm-C flags = 2 + 1 = 3  -> matches worked example\nassert 1 + 1 + 1 == 3, \"Algorithm C total flags must equal 3\"",
        "description": "Build three diagnosis-position flags for each inpatient claim: (1) principal_flag\n(condition in principal position), (2) secondary_any_flag (condition in any secondary\nposition), (3) secondary_poa_n_flag (condition in secondary with POA = N — new during\nstay). Also builds a POA-aware comorbidity count (secondary with POA = Y).\nDemonstrates the any-position vs principal-only vs POA-aware outcome count comparison\nfrom the worked example.\n\nRequired input (one row per diagnosis code, after unpivoting wide claims to long):\n  dx_long : person_id, stay_id, dx_position ('principal'/'secondary'), dx_code,\n            poa_indicator (Y/N/U/W/1/None), discharge_date (date)\nReturns summary_by_stay with counts for each of the three algorithms.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nSEPSIS_RE   <- \"^A4[01]\"\nAKI_RE      <- \"^N17\"\nDIABETES_RE <- \"^E1[013]\"\n\nbuild_position_flags <- function(dx_long) {\n  setDT(dx_long)\n  d <- dx_long[, .(\n    person_id, stay_id, dx_code,\n    is_principal    = dx_position == \"principal\",\n    is_secondary    = dx_position == \"secondary\",\n    poa_y           = poa_indicator == \"Y\",\n    poa_n           = poa_indicator == \"N\"\n  )]\n\n  conditions <- list(\n    sepsis   = SEPSIS_RE,\n    aki      = AKI_RE,\n    diabetes = DIABETES_RE\n  )\n\n  build_alg <- function(cond_re, cond_name) {\n    c <- d[grepl(cond_re, dx_code)]\n\n    # Algorithm A: any position\n    a <- c[, .(person_id = person_id[1L], algorithm = \"any_position\"), by = stay_id]\n    # Algorithm B: principal only\n    b <- c[is_principal == TRUE, .(person_id = person_id[1L], algorithm = \"principal_only\"), by = stay_id]\n    # Algorithm C: outcome = principal OR secondary-POA=N\n    c_out <- c[(is_principal) | (is_secondary & poa_n),\n               .(person_id = person_id[1L], algorithm = \"poa_aware_outcome\"), by = stay_id]\n    # Algorithm C: comorbidity = secondary-POA=Y\n    c_cmb <- c[is_secondary & poa_y,\n               .(person_id = person_id[1L], algorithm = \"poa_aware_comorbidity\"), by = stay_id]\n\n    out <- rbindlist(list(a, b, c_out, c_cmb), fill = TRUE)\n    out[, condition := cond_name]\n    out\n  }\n\n  res <- rbindlist(mapply(build_alg, conditions, names(conditions), SIMPLIFY = FALSE))\n  # Summary: count stays flagged per condition x algorithm\n  res[, .N, by = .(condition, algorithm)]\n}\n\n# --- Worked example verification ---\ndx_ex <- data.table(\n  person_id    = rep(9001L, 3),\n  stay_id      = rep(\"S1\", 3),\n  dx_code      = c(\"A41.9\", \"E11.9\", \"N17.9\"),\n  dx_position  = c(\"principal\", \"secondary\", \"secondary\"),\n  poa_indicator = c(NA_character_, \"Y\", \"N\")\n)\nsummary_tbl <- build_position_flags(dx_ex)\nprint(summary_tbl)\n# Expected: algorithm C total distinct flags = 1 (sepsis, poa_aware_outcome)\n#                                             + 1 (aki, poa_aware_outcome)\n#                                             + 1 (diabetes, poa_aware_comorbidity)\n#           = 3  (matches worked example)",
        "description": "R (data.table) version of the same three-algorithm position-flag builder. Inputs mirror\nthe Python version: dx_long with columns person_id, stay_id, dx_position, dx_code,\npoa_indicator (character Y/N/U/W/1/NA). Returns a summary table of stay counts per\ncondition and algorithm, enabling direct comparison of any-position vs principal-only\nvs POA-aware outcome counts.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  UB04[\"UB-04 Institutional Claim<br/>(Hospital IP / OP / ED)\"]\n  CMS1500[\"CMS-1500 Professional Claim<br/>(Physician / Outpatient Practice)\"]\n\n  UB04 --> FL67[\"FL 67 — Principal Diagnosis<br/>UHDDS: confirmed at discharge<br/>drives DRG payment\"]\n  UB04 --> FL69[\"FL 69 — Admitting Diagnosis<br/>Pre-workup clinical hypothesis<br/>Use for presentation studies only\"]\n  UB04 --> FL70[\"FL 70 — Reason for Visit<br/>Presenting complaint (OP/ED only)<br/>Independent of coded diagnoses\"]\n  UB04 --> SECDX[\"FL 67A–Q — Secondary Diagnoses<br/>up to 24 codes, comorbidities and complications\"]\n  SECDX --> POA[\"POA Indicator per code<br/>Y = present at admission (comorbidity)<br/>N = new during stay (complication)<br/>U/W = unknown/undetermined<br/>1 = CMS exemption\"]\n  CMS1500 --> ITEM21[\"Item 21 — up to 12 diagnosis codes<br/>Position A = first-listed (convention only)<br/>No UHDDS principal dx concept\"]\n  ITEM21 --> DXPTR[\"Diagnosis Pointer per service line<br/>Links procedure to diagnosis for<br/>medical necessity adjudication\"]\n\n  FL67 --> ALG[\"Algorithm Design Choice\"]\n  SECDX --> ALG\n  POA --> ALG\n  ITEM21 --> ALG\n\n  ALG --> ANYPOS[\"Any-position:<br/>max sensitivity, lower PPV\"]\n  ALG --> PRINCONLY[\"Principal-only:<br/>max specificity, lower sensitivity\"]\n  ALG --> POAAWARE[\"POA-aware:<br/>comorbidity = secondary POA=Y<br/>complication = secondary POA=N\"]",
        "caption": "UB-04 and CMS-1500 diagnosis field map and the three algorithm design choices that flow from position and qualifier selection. Principal diagnosis is a regulatory, post-discharge UHDDS concept unique to the institutional claim; it does not exist on professional claims. POA indicators are available only for inpatient institutional claims from FY2008 onward.\n",
        "alt_text": "Flowchart showing UB-04 fields (principal FL67, admitting FL69, reason for visit FL70, secondary diagnoses with POA indicators) and CMS-1500 fields (Item 21 first-listed codes with diagnosis pointers), converging on three algorithm design choices: any-position, principal-only, and POA-aware.\n",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "Principal (FL 67), admitting (FL 69), reason for visit (FL 70), and secondary diagnoses with POA indicators are all subfields of the UB-04 institutional claim; this entry documents their semantic distinctions in detail.\n"
      },
      {
        "relation_type": "part_of",
        "target_slug": "cms-1500-professional-claim-fields",
        "notes": "Item 21 first-listed and secondary diagnosis codes and the diagnosis pointer are the corresponding diagnosis fields on the CMS-1500 professional claim; there is no principal diagnosis concept on this form.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "All UB-04 and CMS-1500 diagnosis positions are populated with ICD-9-CM or ICD-10-CM codes; the semantic rules governing those codes (specificity, combination codes, sequencing guidelines) interact with position-based algorithm design.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "ms-drg-classification",
        "notes": "The principal diagnosis (FL 67) is the primary driver of MS-DRG assignment and therefore inpatient payment; DRG optimization creates systematic coding incentives that affect the reliability of principal-position data for epidemiologic inference.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The 1 IP / 2 OP phenotype rule operates on top of this primitive — it uses either principal or any-position codes depending on the variant chosen, and the position choice directly drives the PPV/sensitivity trade-off of the resulting phenotype.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The position choice (principal-only vs any-position) is the central PPV-sensitivity algorithm design lever; this primitive defines what \"principal\" and \"any-position\" mean operationally so that the trade-off entry can focus on the measurement consequences.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "elixhauser-comorbidity-index-rwe",
        "notes": "POA-aware Elixhauser scoring restricts comorbidity flags to secondary diagnoses with POA = Y, preventing in-hospital complications from inflating the pre-admission comorbidity burden used in risk adjustment.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "Position-based algorithm choices are primary targets for validation studies that compare algorithm-derived diagnoses against medical-record adjudicated truth; PPV and sensitivity should be reported by position variant.\n"
      }
    ],
    "aliases": [
      "principal diagnosis",
      "admitting diagnosis",
      "primary diagnosis",
      "POA indicator",
      "present on admission",
      "first-listed diagnosis",
      "reason for visit"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "cms",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "diagnostic-accuracy",
    "name": "Diagnostic Accuracy Study",
    "short_definition": "A study design that quantifies how well an index test or case-finding algorithm classifies disease status by comparing its results, in subjects drawn from a clinically relevant spectrum, against an independent reference standard, summarized as sensitivity, specificity, predictive values, and likelihood ratios.",
    "long_description": "A **diagnostic accuracy study** measures the agreement between an *index test* (a\nbiomarker, imaging read, clinical rule, or — in real-world evidence — a claims/EHR\ncase-finding algorithm) and a *reference standard* that is taken as the best available\nclassification of true disease status. Each subject contributes a 2x2 cross-classification\n(index positive/negative x reference positive/negative), and the design summarizes the\ntest's operating characteristics: **sensitivity** (Pr[index+ | disease+]),\n**specificity** (Pr[index- | disease-]), **positive and negative predictive value**\n(PPV/NPV), and the **positive/negative likelihood ratios** (LR+ = sens/(1-spec),\nLR- = (1-sens)/spec). It is the foundational design behind validating any RWE\nphenotype before that phenotype is trusted to define an exposure or outcome.\n\n**Core conceptual distinction.** The single most consequential teaching point is that\n**sensitivity and specificity are properties of the test conditional on true disease\nstatus and are (to first order) invariant to disease prevalence, whereas PPV and NPV\nare prevalence-dependent** and therefore travel poorly across populations. A claims\nalgorithm validated at 90% PPV in a high-prevalence specialty registry can collapse to\n50% PPV in a low-prevalence general population even though its sensitivity and specificity\nare unchanged — PPV = (sens x prev) / (sens x prev + (1-spec) x (1-prev)). Likelihood\nratios are the prevalence-stable bridge: they update pre-test odds to post-test odds via\nBayes (post-test odds = pre-test odds x LR), so a single LR set transports to any\nprevalence. The second distinction is the **estimand under the sampling scheme**: a\ncohort/cross-sectional sample (consecutive eligible subjects) identifies sensitivity,\nspecificity, PPV, and NPV directly; a **case-control (two-gate) sample** — selecting\nknown cases and known non-cases — identifies sensitivity and specificity but *not* PPV/NPV,\nbecause the case:non-case ratio is fixed by design rather than reflecting prevalence. A\nthird distinction separates a diagnostic accuracy study (test vs reference, no follow-up\nneeded) from a prognostic/predictive study (baseline test vs future outcome over time).\n\n**Pros, cons, and trade-offs.**\n- **vs validating a phenotype by PPV alone (chart-review of algorithm-positives only):**\n  The full diagnostic accuracy study estimates sensitivity and specificity, which a\n  PPV-only review cannot — you cannot detect under-capture (false negatives) by reviewing\n  only test-positives. Cost: estimating sensitivity requires sampling the *reference-positive*\n  or test-negative space, which is expensive when disease is rare. **Prefer the full study**\n  when the algorithm defines an outcome whose completeness matters (e.g., differential\n  misclassification across arms); **a focused high-PPV review** can suffice when the\n  algorithm only needs to confirm cases for a positive predictive purpose\n  (see claims-outcome-algorithm-ppv-sensitivity-rwe).\n- **vs misclassification bias correction (quantitative adjustment of the effect estimate):**\n  A diagnostic accuracy study *produces the bias parameters* (sensitivity/specificity, ideally\n  by arm) that misclassification-bias-correction and quantitative-bias-analysis methods consume.\n  The accuracy study is descriptive of the measurement; the correction propagates that\n  measurement uncertainty into the comparative estimate. **Use them together** — an accuracy\n  study without a downstream correction leaves known bias on the table.\n- **vs treating the index test as a gold standard (no validation):** Default RWE practice\n  often assumes the algorithm is perfect. That is defensible only for highly specific\n  procedure/anchor codes; for most diagnosis-based phenotypes, unvalidated use risks\n  non-differential bias toward the null (or, worse, differential bias of unknown direction).\n  **Prefer at least a validation substudy** whenever the algorithm is novel, transported to\n  a new data source, or central to the primary estimand.\n\n**When to use.** Validating a new claims/EHR case-finding algorithm before deploying it as\nan exposure or outcome definition; benchmarking a biomarker, score, or imaging read against a\nreference standard; supporting a HEDIS/PQA-style quality measure or an FDA fit-for-purpose\ndata assessment that requires documented outcome misclassification parameters; quantifying the\nbias parameters that a downstream quantitative bias analysis will use; comparing two competing\nalgorithms head-to-head on the same validation sample (paired design, McNemar-based comparison).\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **No genuinely independent reference standard exists.** If the reference is built partly from\n  the same data feeding the index test (e.g., chart review by an adjudicator who can see the\n  billing codes the algorithm used), **incorporation bias** inflates apparent accuracy. The\n  reference must be ascertained blind to the index result.\n- **The reference standard is itself imperfect (no true gold standard).** Comparing an index\n  test to a noisy reference can make a *better* index test look *worse* when the two err on\n  different subjects; naive 2x2 accuracy is biased and an imperfect-reference (latent-class or\n  explicit-correction) model is required. Reporting raw sensitivity/specificity against a known-bad\n  reference is misleading.\n- **Reference verification depends on the index test result (verification / work-up bias).**\n  If only index-positive (or sicker-looking) subjects get the reference standard — the norm\n  when chart review is triggered by the algorithm firing — uncorrected sensitivity is\n  overestimated and specificity underestimated. This is the single most common, and most\n  dangerous, error in RWE phenotype validation; it requires a Begg-Greenes-type correction or a\n  two-stage design with known sampling probabilities.\n- **Spectrum mismatch.** Accuracy estimated in a referred, severe, or clinically extreme\n  sample (clear cases vs healthy controls) does not transport to the borderline,\n  early-stage, comorbid patients on whom the test is actually used (**spectrum bias**). A test\n  that looks excellent in a two-gate case-control sample can be useless in consecutive practice.\n- **Reporting PPV/NPV from a case-control sample, or transporting PPV across prevalences.**\n  Both are formally invalid; report sensitivity/specificity/LRs and re-derive predictive values\n  at the target prevalence instead.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** The index test is the algorithm (e.g., 1 inpatient OR\n  2 outpatient diagnosis codes >=30 days apart within a defined window). The reference standard is\n  usually chart review obtained via linkage, so accuracy can be estimated only on the *linkable*\n  subset — a selection that must be argued not to be differential. Medicare Advantage encounter\n  data historically under-captures procedures and pharmacy relative to fee-for-service, so an\n  algorithm validated in FFS can have different sensitivity in MA person-time; validate within\n  benefit type and never pool blindly (see medicare-ffs-ma-commercial-claims-differences-rwe).\n  Code-set version drift (ICD-9 to ICD-10) silently changes sensitivity over calendar time.\n- **EHR:** A richer reference is available in-system (labs, pathology, notes via NLP), but\n  capture is visit-driven and fragmented across systems, so a \"false negative\" may be care\n  delivered elsewhere rather than true absence of disease — define the observation window and\n  treat out-of-system care as potential misclassification, not truth.\n- **Registry:** Often the *reference standard itself* (adjudicated cancer stage, confirmed MI),\n  making it ideal for validating claims/EHR algorithms — but registries capture a selected,\n  often more severe spectrum, so accuracy can be optimistic relative to community practice.\n- **Linked claims-EHR-registry:** The strongest substrate (algorithm from claims, reference\n  from registry/EHR, completeness from linkage) but introduces linkage selection and date\n  discrepancies between service, fill, and adjudication dates that must be reconciled before the\n  2x2 is built.\n\n**Worked claims example.** Goal: validate a claims algorithm for acute myocardial infarction\n(AMI) to use it as a study outcome. (1) Source and eligibility: adults with >=12 months of\ncontinuous Medicare FFS Parts A/B enrollment (so absence of a code is observed, not missing) and\nno AMI code in a 12-month clean washout, so captured events are incident. (2) Index test\n(algorithm): a primary-position ICD-10 I21.x on an inpatient claim with length of stay >=1 day —\nthe standard high-specificity AMI rule. (3) Reference standard: hospital-chart review against the\nFourth Universal Definition of MI (troponin + clinical criteria), adjudicated by reviewers\n**blinded to the billing codes** to avoid incorporation bias. (4) Sampling: because chart pulls\nare costly, use a **two-stage stratified design** — review all (or a known fraction f1 of)\nalgorithm-positives to estimate PPV, AND a known fraction f2 of a *reference-positive sampling\nframe* built from troponin-lab-flagged admissions to estimate sensitivity; record f1 and f2 so\nestimates can be reweighted to the cohort. (5) Build the 2x2 on the linkable, chart-available\nsubset, **weighting by inverse sampling fraction** so cells reflect the source cohort, not the\nreview sample. (6) Report sensitivity and specificity (prevalence-stable) with exact\n(Clopper-Pearson) 95% CIs, plus PPV/NPV **at the cohort's own AMI prevalence**, and LR+ / LR-.\n(7) Flag verification bias explicitly: because the reference was preferentially obtained where\ntroponin was drawn (correlated with the index test firing), apply a Begg-Greenes correction or\nreport the corrected sensitivity/specificity. (8) Carry the resulting (sens, spec) — ideally\nestimated separately by exposure arm — into a misclassification-bias-correction step so the final\ncomparative effect is adjusted for outcome misclassification rather than reported as if the\nalgorithm were perfect.",
    "primary_category": "Study_Design",
    "tags": [
      "sensitivity",
      "specificity",
      "predictive-value",
      "likelihood-ratio",
      "algorithm-validation",
      "phenotype-validation",
      "verification-bias",
      "spectrum-bias",
      "reference-standard"
    ],
    "applies_to_study_types": [
      "diagnostic_accuracy"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.324.7336.539",
        "url": "https://doi.org/10.1136/bmj.324.7336.539",
        "citation_text": "Sackett DL, Haynes RB. The architecture of diagnostic research. BMJ. 2002;324(7336):539-541.",
        "year": 2002,
        "authors_short": "Sackett & Haynes",
        "notes": "Foundational framing of diagnostic research phases, the index-test-vs-reference-standard structure, and the distinction between accuracy and downstream clinical impact."
      },
      {
        "role": "explain",
        "doi": "10.1056/NEJM197810262991705",
        "url": "https://doi.org/10.1056/NEJM197810262991705",
        "citation_text": "Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. New England Journal of Medicine. 1978;299(17):926-930.",
        "year": 1978,
        "authors_short": "Ransohoff & Feinstein",
        "notes": "Classic exposition of spectrum bias — why sensitivity/specificity estimated in a case-control spectrum do not transport to consecutive practice."
      },
      {
        "role": "explain",
        "doi": "10.2307/2530820",
        "url": "https://doi.org/10.2307/2530820",
        "citation_text": "Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics. 1983;39(1):207-215.",
        "year": 1983,
        "authors_short": "Begg & Greenes",
        "notes": "Derives the correction for verification (work-up) bias when reference ascertainment depends on the index test result — the dominant bias in claims phenotype validation."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.329.7458.168",
        "url": "https://doi.org/10.1136/bmj.329.7458.168",
        "citation_text": "Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ. 2004;329(7458):168-169.",
        "year": 2004,
        "authors_short": "Deeks & Altman",
        "notes": "Concise treatment of likelihood ratios as the prevalence-stable bridge from pre-test to post-test probability via Bayes."
      },
      {
        "role": "demonstrate",
        "doi": "10.7326/0003-4819-155-8-201110180-00009",
        "url": "https://doi.org/10.7326/0003-4819-155-8-201110180-00009",
        "citation_text": "Whiting PF, Rutjes AWS, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Annals of Internal Medicine. 2011;155(8):529-536.",
        "year": 2011,
        "authors_short": "Whiting et al.",
        "notes": "Standard risk-of-bias instrument; its four domains (patient selection, index test, reference standard, flow/timing) map directly to spectrum, incorporation, and verification bias."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Reporting standard requiring the 2x2 counts, sampling/flow diagram, and handling of indeterminate and missing reference results."
      }
    ],
    "plain_language_summary": "A diagnostic accuracy study asks a simple question: how often does a test (or a claims/EHR rule that flags who has a disease) get the right answer when you check it against a trusted reference, like a chart review? You sort every patient into one of four buckets — correctly flagged sick, wrongly flagged sick, correctly cleared, wrongly cleared — and count them in a 2x2 table. From those four counts you can describe the test in plain terms: how many truly-sick people it catches, and how many truly-healthy people it correctly clears. One honest caveat: a single 'percent correct' number can look great and still hide a test that misses almost everyone who is actually sick.",
    "key_terms": [
      {
        "term": "index test",
        "definition": "The thing being checked — here, a claims or EHR rule that decides whether a patient has the disease."
      },
      {
        "term": "reference standard",
        "definition": "The trusted answer you grade the index test against, such as a blinded chart review."
      },
      {
        "term": "2x2 confusion table",
        "definition": "A four-cell table that cross-tabs the test result (positive/negative) against the truth (disease/no disease)."
      },
      {
        "term": "sensitivity",
        "definition": "Among people who truly have the disease, the share the test correctly flags as positive."
      },
      {
        "term": "specificity",
        "definition": "Among people who truly do not have the disease, the share the test correctly clears as negative."
      },
      {
        "term": "class imbalance",
        "definition": "When one group (usually the healthy majority) hugely outnumbers the other, so a lazy test can look accurate just by siding with the big group."
      }
    ],
    "worked_example": {
      "scenario": "We built a claims rule to flag patients with a rare disease, and we want to know how good it is. We took 1,000 patients, ran the rule on each one (that is our index test), and also got a blinded chart review on every patient (that is our reference standard, treated as the truth). The disease is uncommon: only 50 of the 1,000 patients truly have it. We will sort all 1,000 into a 2x2 table, compute the overall accuracy, and then show why that single number can fool us.",
      "dataset": {
        "caption": "The 2x2 confusion table an analyst would build: index test (claims rule) result crossed against the reference standard (chart review). Each patient lands in exactly one cell, and the four cells sum to N = 1,000.",
        "columns": [
          "",
          "Reference: disease",
          "Reference: no disease",
          "Row total"
        ],
        "rows": [
          [
            "Index rule: positive",
            "TP = 40",
            "FP = 95",
            "135"
          ],
          [
            "Index rule: negative",
            "FN = 10",
            "TN = 855",
            "865"
          ],
          [
            "Column total",
            "50",
            "950",
            "1000"
          ]
        ]
      },
      "steps": [
        "Read the four cells. TP = 40 patients the rule flagged who truly had the disease. TN = 855 the rule cleared who truly did not. FP = 95 the rule wrongly flagged. FN = 10 truly-sick patients the rule missed.",
        "Confirm the table closes: TP + FP + FN + TN = 40 + 95 + 10 + 855 = 1000 = N. The 50 truly diseased (TP + FN) and 950 truly healthy (FP + TN) match the column totals.",
        "Compute overall accuracy = the share of patients the rule got right = (TP + TN) / N = (40 + 855) / 1000 = 895 / 1000 = 0.895.",
        "That 89.5% sounds strong, but the disease is rare — only 5% of patients have it (class imbalance). Picture a lazy rule that simply calls EVERY patient negative. It would be right on all 950 healthy patients and wrong only on the 50 sick ones, giving accuracy = 950 / 1000 = 0.95 — HIGHER than our real rule, while catching zero true cases.",
        "So accuracy alone hides the failure mode. Split it into the two numbers that do not get fooled by the rare/common mix. Sensitivity = among the truly sick, the share caught = TP / (TP + FN) = 40 / 50 = 0.80. Specificity = among the truly healthy, the share cleared = TN / (TN + FP) = 855 / 950 = 0.90.",
        "Now the picture is honest: the rule catches 80% of true cases and correctly clears 90% of non-cases. The lazy 'all-negative' rule would have specificity 1.00 but sensitivity 0.00 — and accuracy alone would have rewarded it. Always report sensitivity and specificity, not just percent correct."
      ],
      "result": "Overall accuracy = (TP + TN) / N = (40 + 855) / 1000 = 0.895 (89.5%). Sensitivity = 40 / 50 = 0.80 (80%); specificity = 855 / 950 = 0.90 (90%). The catch: because only 5% of patients are diseased, a useless rule that calls everyone negative scores accuracy = 950 / 1000 = 0.95 — beating our real rule on accuracy while having sensitivity = 0. Under class imbalance, accuracy is misleading; sensitivity and specificity are not."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "ppv-npv-rwe",
      "algorithm-validation"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Cohort (single-gate) diagnostic accuracy study",
        "description": "A consecutive or random sample of eligible subjects receives both the index test and the reference standard; identifies sensitivity, specificity, PPV, and NPV directly at the sample's prevalence.",
        "edge_cases": [
          "Rare disease makes precise sensitivity estimation expensive (few reference-positives); enrich with a known sampling fraction and reweight rather than discarding the cohort structure.",
          "Indeterminate index or reference results must be reported, not silently dropped (STARD item)."
        ],
        "data_source_notes": "claims/EHR: consecutive admissions or encounters in the validation window with linkage to chart review; record continuous enrollment so test-negatives are true negatives, not unobserved care."
      },
      {
        "name": "Case-control (two-gate) diagnostic accuracy study",
        "description": "Known cases and known non-cases are sampled separately; valid for sensitivity and specificity but PPV/NPV are not identifiable because the case mix is fixed by design.",
        "edge_cases": [
          "Selecting clear cases vs healthy controls induces spectrum bias and optimistic accuracy.",
          "Do not report predictive values; re-derive them at the target prevalence from sens/spec or LRs."
        ],
        "data_source_notes": "Common when a registry supplies confirmed cases and a separate frame supplies controls; explicitly state that prevalence does not reflect any real population."
      },
      {
        "name": "Two-stage / verification-bias-corrected validation",
        "description": "When the reference standard is obtained only for a subset (often triggered by the index test), use known sampling probabilities and a Begg-Greenes-type correction (or inverse-probability weighting) to recover unbiased sensitivity/specificity.",
        "edge_cases": [
          "Requires that the verification mechanism be known/modelable; if verification depends on unmeasured factors beyond the index test, the correction is only partial."
        ],
        "data_source_notes": "claims: chart review is frequently pulled for algorithm-positives only — estimate sensitivity by separately sampling a reference-positive frame (e.g., lab-flagged events)."
      },
      {
        "name": "Imperfect-reference-standard accuracy",
        "description": "When no true gold standard exists, estimate accuracy with latent-class models or an explicit correction for the reference standard's own sensitivity/specificity.",
        "edge_cases": [
          "Latent-class models rest on conditional-independence assumptions between tests that are often violated; report sensitivity of conclusions to those assumptions."
        ],
        "data_source_notes": "Relevant when the \"reference\" is itself a claims-based or adjudication-based proxy rather than a definitive test."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "PPV-only validation of a claims/EHR algorithm",
        "pros_of_this": "Estimates sensitivity and specificity, so under-capture (false negatives) and completeness are quantified, not assumed.",
        "cons_of_this": "Estimating sensitivity requires sampling the reference-positive or test-negative space, which is costly when disease is rare.",
        "when_to_prefer": "When the algorithm defines an outcome whose completeness or differential capture across arms affects the primary estimand."
      },
      {
        "compared_to": "Misclassification bias correction of the comparative estimate",
        "pros_of_this": "Produces the bias parameters (sens/spec, ideally by arm) that the correction needs; it is the measurement step, not the adjustment step.",
        "cons_of_this": "Descriptive only — it does not, by itself, adjust the comparative effect estimate.",
        "when_to_prefer": "Always pair them; run the accuracy study first, then propagate its parameters through a quantitative bias analysis."
      },
      {
        "compared_to": "Treating the algorithm as a perfect gold standard",
        "pros_of_this": "Documents and bounds measurement error instead of assuming it away; avoids unquantified non-differential or differential misclassification.",
        "cons_of_this": "Requires a validation substudy with chart-review or linkage cost.",
        "when_to_prefer": "Whenever the algorithm is novel, transported to a new data source or calendar era, or central to the primary outcome/exposure."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Index test = the case-finding algorithm (e.g., 1 IP or 2 OP diagnosis codes >=30 days apart). Reference = chart review via linkage; accuracy is estimable only on the linkable subset. Validate within benefit type (FFS vs MA vs commercial) and never pool blindly; account for ICD-9/ICD-10 code-set drift over calendar time and require continuous enrollment so test-negatives are observed, not missing.",
      "ehr": "Richer in-system reference (labs, pathology, NLP on notes) but visit-driven capture means out-of-system care can masquerade as false negatives; define observation windows and reconcile cross-system fragmentation before classifying.",
      "registry": "Frequently serves as the reference standard (adjudicated stage/event); excellent for validating claims/EHR algorithms but captures a selected, often more severe spectrum, biasing accuracy optimistically relative to community practice.",
      "linked": "Algorithm from claims + reference from registry/EHR + completeness from linkage is the strongest substrate, but linkage selection and service/fill/adjudication date discrepancies must be reconciled before the 2x2 is constructed."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom scipy.stats import beta\n\ndef clopper_pearson(k: float, n: float, alpha: float = 0.05):\n    # Exact binomial CI for a proportion k/n (works on integer counts).\n    k, n = int(round(k)), int(round(n))\n    lo = 0.0 if k == 0 else beta.ppf(alpha / 2, k, n - k + 1)\n    hi = 1.0 if k == n else beta.ppf(1 - alpha / 2, k + 1, n - k)\n    return k / n if n else np.nan, lo, hi\n\ndef diagnostic_accuracy(cohort: pd.DataFrame) -> dict:\n    # Restrict to subjects with a reference result (verified subset).\n    v = cohort.dropna(subset=[\"ref_pos\"]).copy()\n    w = v[\"samp_wt\"].to_numpy()  # inverse-probability weights reweight to the source cohort\n\n    # Weighted 2x2 cells.\n    tp = float((w * ((v.index_pos == 1) & (v.ref_pos == 1))).sum())\n    fp = float((w * ((v.index_pos == 1) & (v.ref_pos == 0))).sum())\n    fn = float((w * ((v.index_pos == 0) & (v.ref_pos == 1))).sum())\n    tn = float((w * ((v.index_pos == 0) & (v.ref_pos == 0))).sum())\n\n    sens, sens_lo, sens_hi = clopper_pearson(tp, tp + fn)\n    spec, spec_lo, spec_hi = clopper_pearson(tn, tn + fp)\n    ppv,  ppv_lo,  ppv_hi  = clopper_pearson(tp, tp + fp)   # prevalence-dependent\n    npv,  npv_lo,  npv_hi  = clopper_pearson(tn, tn + fn)   # prevalence-dependent\n    prevalence = (tp + fn) / (tp + fp + fn + tn)\n    lr_pos = sens / (1 - spec) if spec < 1 else np.inf\n    lr_neg = (1 - sens) / spec if spec > 0 else np.inf\n    return {\n        \"cells\": {\"tp\": tp, \"fp\": fp, \"fn\": fn, \"tn\": tn},\n        \"prevalence\": prevalence,\n        \"sensitivity\": (sens, sens_lo, sens_hi),\n        \"specificity\": (spec, spec_lo, spec_hi),\n        \"ppv\": (ppv, ppv_lo, ppv_hi),\n        \"npv\": (npv, npv_lo, npv_hi),\n        \"lr_positive\": lr_pos,\n        \"lr_negative\": lr_neg,\n    }\n\ndef ppv_at_prevalence(sens: float, spec: float, prev: float) -> float:\n    # Re-derive PPV at any target prevalence (transport via Bayes); sens/spec are stable, PPV is not.\n    return (sens * prev) / (sens * prev + (1 - spec) * (1 - prev))",
        "description": "Diagnostic accuracy validation from claims-style inputs. Required inputs (already cleaned):\n  cohort : one row per validation subject -> person_id, index_pos (0/1 algorithm result),\n           ref_pos (0/1 reference-standard result; may be NA if not verified),\n           samp_wt (inverse sampling/verification probability; 1.0 for a full consecutive sample)\nStage 1 builds the (optionally weighted) 2x2 on verified subjects; Stage 2 computes\nsensitivity, specificity, PPV, NPV, and likelihood ratios with EXACT Clopper-Pearson 95% CIs.\nUse weighted cells when verification was a known fraction of the cohort (two-stage design).",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "sackett-2002",
          "deeks-2004"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "diagnostic_accuracy <- function(cohort) {\n  v <- cohort[!is.na(cohort$ref_pos), ]\n  w <- v$samp_wt  # inverse-probability weights reweight verified subset to the source cohort\n\n  tp <- sum(w * (v$index_pos == 1 & v$ref_pos == 1))\n  fp <- sum(w * (v$index_pos == 1 & v$ref_pos == 0))\n  fn <- sum(w * (v$index_pos == 0 & v$ref_pos == 1))\n  tn <- sum(w * (v$index_pos == 0 & v$ref_pos == 0))\n\n  ci <- function(k, n) {                # exact Clopper-Pearson CI on rounded counts\n    bt <- binom.test(round(k), round(n))\n    c(est = unname(bt$estimate), lo = bt$conf.int[1], hi = bt$conf.int[2])\n  }\n  sens <- ci(tp, tp + fn); spec <- ci(tn, tn + fp)\n  ppv  <- ci(tp, tp + fp); npv  <- ci(tn, tn + fn)   # PPV/NPV are prevalence-dependent\n  list(\n    cells       = c(tp = tp, fp = fp, fn = fn, tn = tn),\n    prevalence  = (tp + fn) / (tp + fp + fn + tn),\n    sensitivity = sens, specificity = spec, ppv = ppv, npv = npv,\n    lr_positive = sens[\"est\"] / (1 - spec[\"est\"]),\n    lr_negative = (1 - sens[\"est\"]) / spec[\"est\"]\n  )\n}\n\nppv_at_prevalence <- function(sens, spec, prev) {\n  (sens * prev) / (sens * prev + (1 - spec) * (1 - prev))\n}",
        "description": "Diagnostic accuracy validation in base R. Inputs mirror the Python version:\n  cohort : data.frame with person_id, index_pos (0/1), ref_pos (0/1 or NA), samp_wt (numeric)\nComputes weighted 2x2 cells, then sensitivity/specificity/PPV/NPV with exact binom.test CIs\nand likelihood ratios. ppv_at_prevalence() transports PPV to a target prevalence.",
        "dependencies": [],
        "source_citations": [
          "sackett-2002",
          "deeks-2004"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Stage 1: weighted 2x2 on the verified subset (ref_pos not missing). */\nproc sql;\n  create table cells as\n  select sum(samp_wt * (index_pos=1 and ref_pos=1)) as tp,\n         sum(samp_wt * (index_pos=1 and ref_pos=0)) as fp,\n         sum(samp_wt * (index_pos=0 and ref_pos=1)) as fn,\n         sum(samp_wt * (index_pos=0 and ref_pos=0)) as tn\n  from work.cohort\n  where ref_pos is not null;\nquit;\n\n/* Stage 2a: exact (Clopper-Pearson) CIs for sensitivity (within reference-positive stratum)\n   and specificity (within reference-negative stratum). WEIGHT applies the sampling weights. */\nproc freq data=work.cohort(where=(ref_pos=1)) order=data;\n  tables index_pos / binomial(level='1') alpha=0.05;   /* P(index+ | ref+) = sensitivity */\n  weight samp_wt;\n  exact binomial;\nrun;\n\nproc freq data=work.cohort(where=(ref_pos=0)) order=data;\n  tables index_pos / binomial(level='0') alpha=0.05;   /* P(index- | ref-) = specificity */\n  weight samp_wt;\n  exact binomial;\nrun;\n\n/* Stage 2b: predictive values (prevalence-dependent) and likelihood ratios from the 2x2. */\ndata accuracy;\n  set cells;\n  sensitivity = tp / (tp + fn);\n  specificity = tn / (tn + fp);\n  ppv         = tp / (tp + fp);\n  npv         = tn / (tn + fn);\n  prevalence  = (tp + fn) / (tp + fp + fn + tn);\n  lr_positive = sensitivity / (1 - specificity);\n  lr_negative = (1 - sensitivity) / specificity;\nrun;",
        "description": "Diagnostic accuracy validation in SAS. Required input dataset (post data-management):\n  work.cohort : person_id, index_pos (0/1 algorithm result), ref_pos (0/1 reference; . if\n                not verified), samp_wt (inverse sampling/verification probability; 1 if full sample)\nPROC SQL builds the weighted 2x2; PROC FREQ with the BINOMIAL(LEVEL= ) option and EXACT\nstatement yields exact (Clopper-Pearson) CIs for sensitivity and specificity computed within\nthe conditional row strata. Likelihood ratios are derived in a final data step.",
        "dependencies": [],
        "source_citations": [
          "sackett-2002",
          "deeks-2004"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Eligible subjects with the indication<br/>continuous enrollment / observable time] --> Idx[Apply index test<br/>= claims/EHR case-finding algorithm]\n  Idx --> Ref{Reference standard obtained?}\n  Ref -- All subjects (single-gate cohort) --> Tab[Cross-classify index x reference]\n  Ref -- Only a subset, often triggered by index+ --> VB[VERIFICATION / WORK-UP BIAS<br/>weight by known sampling probability<br/>or Begg-Greenes correction]\n  VB --> Tab\n  Tab --> M[Estimate sensitivity & specificity<br/>prevalence-stable, with exact CIs]\n  Tab --> P[Estimate PPV / NPV<br/>prevalence-DEPENDENT: report at cohort prevalence]\n  M --> LR[Likelihood ratios -> transport via Bayes]\n  M --> BC[Carry sens/spec into misclassification bias correction]",
        "caption": "Diagnostic accuracy workflow. The branch on how the reference standard is obtained is decisive — a partial, index-triggered reference induces verification bias and requires weighting or a Begg-Greenes correction before the 2x2 is interpreted.",
        "alt_text": "Flowchart from eligible subjects through index test and reference-standard ascertainment, splitting into a full-cohort path and a verification-bias-corrected subset path, then to sensitivity/specificity, predictive values, and likelihood ratios.",
        "source_type": "illustrative",
        "source_citations": [
          "begg-1983"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph TwoByTwo [2x2 against the reference standard]\n    TP[TP: index+ / ref+] --- FP[FP: index+ / ref-]\n    FN[FN: index- / ref+] --- TN[TN: index- / ref-]\n  end\n  TP --> SENS[Sensitivity = TP / TP+FN<br/>prevalence-stable]\n  FN --> SENS\n  TN --> SPEC[Specificity = TN / TN+FP<br/>prevalence-stable]\n  FP --> SPEC\n  TP --> PPV[PPV = TP / TP+FP<br/>prevalence-DEPENDENT]\n  FP --> PPV\n  TN --> NPV[NPV = TN / TN+FN<br/>prevalence-DEPENDENT]\n  FN --> NPV",
        "caption": "The 2x2 and its derived measures. Sensitivity and specificity condition on true status and are prevalence-stable; PPV and NPV condition on the test result and shift with prevalence, which is why predictive values do not transport across populations.",
        "alt_text": "Diagram of the 2x2 table cells TP, FP, FN, TN mapped to sensitivity, specificity, positive predictive value, and negative predictive value with stability annotations.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "algorithm-validation",
        "notes": "Algorithm validation is the RWE application of the diagnostic accuracy design — the index test is a claims/EHR phenotype and the reference standard is chart review or a registry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The PPV/sensitivity trade-off concept operationalizes the design choice between narrow high-PPV and broad high-sensitivity algorithm definitions."
      },
      {
        "relation_type": "produces",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "A diagnostic accuracy study yields the sensitivity/specificity (ideally by arm) that misclassification bias correction consumes to adjust the comparative effect estimate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Accuracy parameters and their CIs feed probabilistic bias analysis to propagate measurement uncertainty into the final estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The 1-inpatient / 2-outpatient time-window rule is a common index-test definition whose operating characteristics this design quantifies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "Outcome algorithm construction defines the index test; the diagnostic accuracy study validates it before deployment as a study outcome."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "EHR phenotypes are validated with the same accuracy framework, with labs/pathology/NLP serving as the in-system reference standard."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Blinded chart-review adjudication is the usual reference standard; blinding to the index codes is required to avoid incorporation bias."
      }
    ],
    "aliases": [
      "diagnostic accuracy",
      "diagnostic test accuracy",
      "diagnostic accuracy study",
      "test validation study",
      "sensitivity and specificity study"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "difference-in-differences-staggered-adoption-rwe",
    "name": "Difference-in-Differences with Staggered Adoption",
    "short_definition": "A quasi-experimental design that estimates the effect of a policy, formulary, benefit, or practice change by contrasting the pre-to-post outcome change in affected units against the contemporaneous change in unaffected units, with modern group-time estimators replacing two-way fixed effects when units adopt at different times and treatment effects are dynamic.",
    "long_description": "**Difference-in-differences (DiD)** identifies a causal effect from the differential timing of an\nintervention across units. The canonical 2x2 estimator takes a treated group and a comparison group, each\nobserved before and after the intervention, and forms ATT = (Y_treated,post - Y_treated,pre) -\n(Y_control,post - Y_control,pre). Subtracting the comparison group's change purges any time trend common to\nboth groups, so the surviving difference is attributed to the intervention. In RWE this is the workhorse for\n*group-level* shocks that individual-level designs cannot exploit: prior-authorization and step-therapy\nrollouts, formulary tier changes, Medicaid expansion, REMS, value-based contracts, telehealth parity laws,\nquality-measure thresholds (e.g., PQA/CMS Star measures), and reimbursement reforms.\n\n**Core conceptual distinction** — DiD identifies effects from *time variation in policy exposure*, not from\nindividual treatment choice; this is what separates it from propensity-score, instrumental-variable, and\ntarget-trial methods, which exploit cross-sectional variation in who gets treated. Two assumptions do the\nwork and are separable. (1) **Parallel trends**: absent the intervention, treated and comparison units would\nhave moved in parallel on the *additive* outcome scale — a counterfactual claim that pre-period data can\nonly fail to refute, never prove. (2) **No anticipation**: behavior does not change before implementation in\nresponse to the future policy (e.g., providers stockpiling fills before a PA rule takes effect). The estimand\nmatters more than novices expect. The 2x2 ATT is a single number; the **event-study (dynamic) estimand**\nATT(e) traces the effect at each lead/lag relative to adoption; the **group-time estimand** ATT(g,t)\n(Callaway-Sant'Anna) is the effect for cohort g at calendar time t, and the headline number is an explicit,\npositively-weighted average of these. The critical RWE pitfall: with **staggered adoption** (units adopt in\ndifferent years) and **heterogeneous, time-varying effects**, the naive two-way fixed-effects (TWFE)\ncoefficient on treated x post is a variance-weighted average of all possible 2x2 comparisons — including\n\"forbidden\" comparisons that use already-treated units as controls for later adopters — and can carry\n*negative weights*, so it can be the wrong sign even when every unit's true effect is positive\n(Goodman-Bacon decomposition). Default to group-time/event-study estimators, not TWFE, whenever adoption is\nstaggered.\n\n**Pros, cons, and trade-offs**\n- **vs target-trial emulation / active-comparator new-user:** DiD shines for *system-level* interventions\n  (a plan adds PA; a state expands Medicaid) where there is no individual \"exposure decision\" to emulate and\n  confounding by indication is irrelevant because everyone in a treated unit is exposed. Cost: it answers a\n  policy-level question (effect of the *rule*, not of *taking a drug*), needs a credible never/not-yet-treated\n  comparison, and is biased if parallel trends fails. **Prefer DiD** when the intervention is assigned to\n  groups over time; **prefer target-trial/ACNU** for individual drug-vs-drug effectiveness.\n- **vs interrupted time series (ITS) / single-group pre-post:** DiD adds a comparison group that absorbs\n  secular trends and concurrent shocks (a new guideline, a pandemic) that a single-group ITS would\n  misattribute to the policy. Cost: it requires a comparison group that is actually parallel; a bad control\n  is worse than none. **Prefer DiD** whenever a plausible unaffected comparison exists.\n- **vs instrumental-variables-pharmacoepi-rwe:** Both are quasi-experimental, but DiD leans on parallel\n  trends across units/time while IV leans on instrument relevance and exclusion. **Prefer DiD** when the\n  natural experiment is a timed policy shock; **prefer IV** when you have a credible instrument for\n  individual treatment (e.g., preference-based prescribing) but no clean before/after structure.\n- **TWFE vs modern estimators:** TWFE is fast and familiar but biased under staggered, heterogeneous\n  effects (negative weighting). Callaway-Sant'Anna, Sun-Abraham (interaction-weighted event study), and\n  de Chaisemartin-D'Haultfoeuille avoid forbidden comparisons. Cost: more covariates/structure and careful\n  choice of never-treated vs not-yet-treated controls. **Prefer modern estimators** for any staggered design.\n\n**When to use** — A discrete intervention is assigned to identifiable units (plans, states, providers,\nhospitals, counties) at one or more known dates; you can observe outcomes for those units before and after;\nand you have comparison units that are plausibly subject to the same secular forces but not the intervention.\nPanel structure (unit-by-period aggregates, or repeated cross-sections with stable composition) is required.\nEvent-study leads should be flat before adoption — inspect them, do not just test them.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Parallel trends is implausible and untestable.** If treated and comparison units were diverging before\n  the policy (e.g., plans that adopt PA were already cutting utilization), DiD attributes a pre-existing\n  trend to the intervention. Pre-trend non-rejection does NOT license the design — tests are\n  underpowered, and the bias of interest is in the post-period; sensitivity analysis over plausible\n  trend violations (Rambachan-Roth) is the honest answer, not a single placebo p-value.\n- **Naive TWFE under staggered adoption with dynamic effects.** This is the dangerous default: the\n  coefficient can flip sign relative to every true unit-level effect because of negative weights and\n  forbidden already-treated comparisons. Using one treated x post coefficient here is a methodological error,\n  not a simplification.\n- **Spillover / SUTVA violation.** If the comparison group is contaminated — a multi-state insurer applies a\n  PA rule to all its plans, providers practicing across a state border treat both populations, patients cross\n  contiguous counties, or a \"control\" formulary carves in the same restriction — the control change embeds a\n  partial treatment effect and DiD is attenuated or reversed.\n- **Compositional change.** When the *population* in a unit changes around the policy (new MA enrollees after\n  an open-enrollment shift, sicker patients churning out), an apparent effect is a risk-mix artifact. Require\n  a balanced panel of units, or model composition explicitly; do not compare unbalanced repeated\n  cross-sections without checking who entered and left.\n- **Anticipation.** If behavior shifts before the official start date, the \"pre\" period is already partly\n  treated; redefine time zero to the announcement or exclude the anticipation window.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA):** Build a unit-by-month panel (plan-month, state-month, or provider-month) with a\n  *stable denominator* of continuously enrolled members so the outcome rate is not driven by enrollment\n  churn. The dominant failure mode is **Medicare Advantage person-time lacking FFS claims**: MA encounter\n  data are incomplete and under-coded relative to FFS, so a panel mixing FFS and MA enrollees will show\n  spurious utilization \"drops\" wherever MA share rises — restrict the denominator to FFS Parts A/B/D (or a\n  commercial benefit with complete pharmacy capture) and treat MA-only person-time as missing, not zero.\n  Cluster standard errors at the intervention unit (the plan/state), not the patient, because the policy is\n  assigned at the unit level; with **few treated clusters** (a handful of adopting states), conventional\n  cluster SEs are anti-conservative — use wild cluster bootstrap (Cameron-Gelbach-Miller) or randomization\n  inference. Watch **immortal time and differential coding** in procedure-based outcomes.\n- **EHR:** Site-level documentation and coding behavior can change around a policy (a quality measure\n  incentivizes coding a diagnosis), mimicking a real effect. Include site fixed effects, audit measurement\n  stability across the breakpoint, and prefer outcomes anchored to objective events (labs, dispensings) over\n  problem-list flags. Visit-driven capture means a patient who leaves the system is differentially lost.\n- **Registry:** Strong for adjudicated outcomes and severity but typically organized by patient, not by the\n  policy unit; aggregate to the intervention unit and confirm the registry catchment maps cleanly to the\n  treated/control geography.\n- **Linked claims-EHR-vital-records:** The ideal substrate (severity + claims completeness + mortality), but\n  linkage selection (only the linkable subset) can differ across treated and control units; verify linkage\n  rates are balanced across the breakpoint before trusting the contrast, and reconcile order/fill/service\n  dates before assigning a period.\n\n**Worked claims example.** Question: did a commercial plan's introduction of step therapy for branded\nDPP-4 inhibitors on 2024-01-01 reduce new DPP-4 initiations? Build a **plan-month panel**. Denominator:\nmembers aged >=18 with type 2 diabetes (>=2 diagnoses) and *continuous medical + pharmacy enrollment* across\nthe month, restricted to FFS-observable benefits (exclude MA-only and capitated person-time so a fill of\nzero means no fill, not missing data). Outcome rate = new DPP-4 initiations per 1,000 eligible\nmember-months, where a \"new initiation\" is a DPP-4 NDC fill (`fill_date`) with no DPP-4 fill in the prior\n365 days; collapse multiple same-day fills and count first-event per member per episode using `days_supply`\nto define the look-back. Treated units = plans that imposed step therapy on 2024-01-01; comparison =\nplans in the same insurer family with *no* DPP-4 step-therapy change (verified by formulary files, not\nassumed). For a clean single-date adoption, the 2x2 is ATT = (rate_treated,2024 - rate_treated,2023) -\n(rate_control,2024 - rate_control,2023). Before trusting it, plot 24 months of leads/lags: the treated and\ncontrol initiation rates should track in 2022-2023 (parallel pre-trend) with the gap opening only after\nJanuary 2024. If multiple plans adopted across 2023-2025 (staggered), abandon the single coefficient: fit a\nCallaway-Sant'Anna group-time ATT using *not-yet-treated* plans as controls for each adopting cohort,\naggregate to a dynamic event-study, and report the simple weighted ATT. Cluster SEs at the plan; with only\nsix treated plans, report a wild cluster bootstrap p-value. Falsification: a negative-control outcome with\nno plausible link to DPP-4 step therapy (e.g., statin initiation) should show a null ATT; a placebo\nadoption date in 2022 should show no pre-effect.\n\n**Interpreting the output**\n\nUsing the worked example: treated plans had 42 initiations per 1,000 members in 2023 and 28 in 2024 (change =\n−14); comparison plans moved from 40 to 38 (change = −2). The DiD estimate is (−14) − (−2) = −12 initiations\nper 1,000 eligible members per year.\n\nFormal interpretation: The DiD estimate of −12 initiations per 1,000 eligible members is the average treatment\neffect in the treated (ATT) — the portion of the treated plans' post-period decline attributable to the step-therapy\nrule, after subtracting the secular trend shared by both groups. In staggered-adoption settings where different\nplans adopt the rule at different calendar times, naive two-way fixed-effects estimation produces a biased ATT\nbecause early adopters serve as implicit controls for late adopters, and heterogeneous treatment effects across\ncohorts can produce negative weights. Callaway-Sant'Anna or Sun-Abraham estimators aggregate group-specific ATTs\ncorrectly. The central untestable assumption is parallel trends: absent the policy, treated and comparison plans\nwould have followed the same trajectory. An event-study plot showing no pre-period divergence is necessary — not\nsufficient — evidence for this assumption.\n\nPractical interpretation: The step-therapy rule is associated with 12 fewer new DPP-4 initiations per 1,000\neligible plan members per year beyond the background decline both groups shared. The 2-unit comparison-group\ndecline captures the secular market trend and is stripped out. This is an intent-to-treat-style policy effect\nat the plan level and does not reflect the effect on individual patients who found clinical exemptions or other\nworkarounds.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "difference-in-differences",
      "did",
      "event-study",
      "staggered-adoption",
      "two-way-fixed-effects",
      "group-time-att",
      "parallel-trends",
      "policy-evaluation",
      "prior-authorization",
      "formulary"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "multi_database",
      "cer_observational"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.2014.16153",
        "url": "https://doi.org/10.1001/jama.2014.16153",
        "citation_text": "Dimick JB, Ryan AM. Methods for evaluating changes in health care policy: the difference-in-differences approach. JAMA. 2014;312(22):2401-2402.",
        "year": 2014,
        "authors_short": "Dimick & Ryan",
        "notes": "Concise clinical introduction to DiD for health-policy evaluation, including the parallel-trends assumption and the comparison-group logic."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev-publhealth-040617-013507",
        "url": "https://doi.org/10.1146/annurev-publhealth-040617-013507",
        "citation_text": "Wing C, Simon K, Bello-Gomez RA. Designing difference in difference studies: best practices for public health policy research. Annual Review of Public Health. 2018;39:453-469.",
        "year": 2018,
        "authors_short": "Wing et al.",
        "notes": "Applied best-practices tutorial covering parallel trends, event-study specification, inference, and common pitfalls in health-policy DiD."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jeconom.2020.12.001",
        "url": "https://doi.org/10.1016/j.jeconom.2020.12.001",
        "citation_text": "Callaway B, Sant'Anna PHC. Difference-in-differences with multiple time periods. Journal of Econometrics. 2021;225(2):200-230.",
        "year": 2021,
        "authors_short": "Callaway & Sant'Anna",
        "notes": "Defines the group-time average treatment effect ATT(g,t) and a doubly-robust estimator using never-treated or not-yet-treated controls; the modern reference for staggered designs."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jeconom.2021.03.014",
        "url": "https://doi.org/10.1016/j.jeconom.2021.03.014",
        "citation_text": "Goodman-Bacon A. Difference-in-differences with variation in treatment timing. Journal of Econometrics. 2021;225(2):254-277.",
        "year": 2021,
        "authors_short": "Goodman-Bacon",
        "notes": "Decomposes the TWFE estimator into a weighted average of 2x2 comparisons, exposing the negative weights and forbidden already-treated comparisons that bias naive staggered DiD."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jeconom.2020.09.006",
        "url": "https://doi.org/10.1016/j.jeconom.2020.09.006",
        "citation_text": "Sun L, Abraham S. Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics. 2021;225(2):175-199.",
        "year": 2021,
        "authors_short": "Sun & Abraham",
        "notes": "Interaction-weighted event-study estimator that recovers unbiased dynamic ATT(e) coefficients under heterogeneous, staggered treatment timing."
      }
    ],
    "plain_language_summary": "Difference-in-differences (DiD) answers the question: did this policy actually cause a change, or would things have shifted anyway? It compares how much an outcome changed before and after a policy in the group affected by the policy, then subtracts the same before-and-after change measured in a group that was not affected — leaving only the part of the change that the policy can take credit for. In real-world evidence it is most useful when an insurer, state, or health system rolls out a new rule (like requiring patients to try a cheaper drug first) and you want to know whether that rule changed prescribing. The method only works if the two groups were trending in the same direction before the policy took effect — a condition you can check but can never fully prove.",
    "key_terms": [
      {
        "term": "parallel trends",
        "definition": "The assumption that, without the policy, the treated group and the comparison group would have changed by the same amount over time — meaning any difference you see after the policy must be caused by the policy, not by the two groups naturally drifting apart."
      },
      {
        "term": "treated group",
        "definition": "The plans, states, or providers that actually received the policy change (for example, the insurance plans that added a step-therapy rule)."
      },
      {
        "term": "comparison group",
        "definition": "Plans, states, or providers that did not receive the policy change during the study window, used to estimate what would have happened to the treated group without the policy."
      },
      {
        "term": "ATT",
        "definition": "Average Treatment effect on the Treated — the estimated causal effect of the policy specifically for the units that actually adopted it, expressed here as the DiD double-difference."
      },
      {
        "term": "staggered adoption",
        "definition": "When different units (plans, states, hospitals) adopt the same policy at different calendar times rather than all at once, requiring more careful methods than a simple before-and-after comparison."
      },
      {
        "term": "event study",
        "definition": "A way of plotting the DiD effect at each point in time before and after the policy; pre-period estimates should be near zero (confirming parallel trends), and post-period estimates reveal when and how large the effect was."
      }
    ],
    "worked_example": {
      "scenario": "A commercial health plan added a step-therapy rule for branded DPP-4 inhibitors on 2024-01-01, requiring patients to try a generic first. We want to know whether that rule reduced new DPP-4 initiations. We observe two groups of plans: four plans that adopted the rule (treated) and four comparable plans in the same insurer family that did not (comparison). The outcome is the new-initiation rate per 1,000 eligible members per year. We have one pre-period year (2023) and one post-period year (2024).",
      "dataset": {
        "caption": "Plan-year outcome table: new DPP-4 initiations per 1,000 eligible members. Each row is one group-period cell; values are averages across the plans in that group.",
        "columns": [
          "group",
          "period",
          "new_initiations_per_1000"
        ],
        "rows": [
          [
            "treated",
            "2023 (pre)",
            42
          ],
          [
            "treated",
            "2024 (post)",
            28
          ],
          [
            "comparison",
            "2023 (pre)",
            40
          ],
          [
            "comparison",
            "2024 (post)",
            38
          ]
        ]
      },
      "steps": [
        "Compute the treated group change: 28 minus 42 equals negative 14 (the treated plans saw 14 fewer initiations per 1,000 members after the rule).",
        "Compute the comparison group change: 38 minus 40 equals negative 2 (comparison plans also dropped slightly, capturing the background secular trend).",
        "Subtract the comparison change from the treated change to remove the secular trend: (negative 14) minus (negative 2) equals negative 12.",
        "The DiD estimate is negative 12 initiations per 1,000 members — the part of the treated group's drop that the step-therapy rule can take credit for, after stripping out the trend both groups shared."
      ],
      "result": "DiD = (28 - 42) - (38 - 40) = (-14) - (-2) = -12 new initiations per 1,000 eligible members. The step-therapy rule is associated with a reduction of 12 initiations per 1,000 members per year beyond any background trend.",
      "timeline_spec": {
        "title": "2x2 DiD: step-therapy rule and new DPP-4 initiations per 1,000 members",
        "window": {
          "start": "2023-01-01",
          "end": "2024-12-31",
          "label": "Two-year observation window"
        },
        "events": [
          {
            "label": "Step-therapy rule takes effect",
            "start": "2024-01-01",
            "length_days": 366,
            "quantity": "policy intervention (treated plans only)"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "Pre-period: both groups observed (2023)"
          },
          {
            "kind": "exposed",
            "start": "2024-01-01",
            "end": "2024-12-31",
            "label": "Post-period: treated plans under step-therapy rule (2024)"
          }
        ],
        "result": {
          "label": "DiD = -12 initiations per 1,000 members",
          "value": -12
        }
      }
    },
    "prerequisites": [
      "cohort-retrospective",
      "comparative-effectiveness-research-cer-methods",
      "interrupted-time-series-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Canonical 2x2 / two-group two-period DiD",
        "description": "A single treated group and comparison group, each observed pre and post a common adoption date; the interaction coefficient is the ATT. Appropriate only when adoption is simultaneous and the comparison group is credible.",
        "edge_cases": [
          "Valid only for a single shared adoption date; applying it to staggered timing collapses heterogeneous cohort effects into a biased average.",
          "With a single treated unit, inference rests on synthetic-control or randomization-inference logic, not cluster SEs."
        ],
        "data_source_notes": "claims: build a stable continuously-enrolled denominator so the rate reflects behavior, not enrollment churn; cluster at the intervention unit."
      },
      {
        "name": "Event-study (dynamic) DiD",
        "description": "Estimates relative-time coefficients ATT(e) at each lead and lag around adoption to inspect pre-trends and trace the post-period effect path; the standard falsification and dynamics tool.",
        "edge_cases": [
          "Pre-trend tests are underpowered; non-rejection does not prove parallel trends — pair with Rambachan-Roth sensitivity bounds.",
          "In TWFE form the event-study can contaminate leads/lags with other cohorts' effects; use Sun-Abraham interaction weighting."
        ],
        "data_source_notes": "Plot >=12-24 pre-period points; anticipation or a launch overlap will show as a nonzero lead."
      },
      {
        "name": "Staggered-adoption group-time ATT (Callaway-Sant'Anna / Sun-Abraham)",
        "description": "Estimates cohort-by-time effects ATT(g,t) using never-treated or not-yet-treated comparisons, then aggregates to dynamic, calendar, or overall summaries with positive weights only.",
        "edge_cases": [
          "Choice of never-treated vs not-yet-treated control changes the estimand and the available comparison window; pre-specify it.",
          "Requires enough never/not-yet-treated mass; if nearly all units eventually adopt, identification is thin."
        ],
        "data_source_notes": "claims/EHR networks: state/plan/site rollouts almost always need this rather than TWFE."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "target-trial-emulation",
        "pros_of_this": "Identifies the effect of a group-level rule (PA, formulary, benefit, expansion) where there is no individual exposure decision to emulate and confounding by indication is moot because all members of a treated unit are exposed.",
        "cons_of_this": "Answers a policy-level question, not an individual drug effect; requires a credible parallel comparison group and is biased if parallel trends fails.",
        "when_to_prefer": "The intervention is assigned to identifiable units over time and you observe outcomes before and after for treated and comparison units."
      },
      {
        "compared_to": "instrumental-variables-pharmacoepi-rwe",
        "pros_of_this": "Leans on a transparent before/after-by-group structure and a visible event-study rather than an untestable exclusion restriction.",
        "cons_of_this": "Parallel trends can be implausible and spillover can contaminate the comparison group; IV targets individual treatment effects DiD cannot.",
        "when_to_prefer": "The natural experiment is a timed policy shock to groups rather than an instrument for individual treatment choice."
      },
      {
        "compared_to": "comparative-effectiveness-research-cer-methods",
        "pros_of_this": "Provides a quasi-experimental identification strategy for system-level interventions that adjustment-based CER cannot, exploiting timing rather than measured-confounder control.",
        "cons_of_this": "Restricted to settings with group-level timing variation and credible comparison units; not a general-purpose individual-level CER tool.",
        "when_to_prefer": "The exposure of interest is a policy/practice change rather than an individual treatment choice among measured-confounder-controlled patients."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport statsmodels.formula.api as smf\nfrom differences import ATTgt  # pip install differences\n\npanel = panel.copy()\npanel[\"first_treat_period\"] = panel[\"first_treat_period\"].fillna(0).astype(int)  # 0 == never treated\n\n# --- Modern staggered estimator: group-time ATT(g,t), never/not-yet-treated controls -----------------\natt = ATTgt(data=panel.set_index([\"unit_id\", \"period\"]), cohort_name=\"first_treat_period\")\natt.fit(formula=\"outcome_rate\", est_method=\"dr\")          # doubly-robust; add covariates via formula RHS\nprint(att.aggregate(\"event\"))    # dynamic event-study ATT(e): inspect that pre-period leads ~ 0\nprint(att.aggregate(\"simple\"))   # overall positively-weighted ATT\n\n# --- Naive TWFE for contrast ONLY: biased under staggered timing + heterogeneous effects ------------\npanel[\"treated_post\"] = ((panel[\"first_treat_period\"] > 0) &\n                         (panel[\"period\"] >= panel[\"first_treat_period\"])).astype(int)\ntwfe = smf.ols(\"outcome_rate ~ treated_post + C(unit_id) + C(period)\", data=panel).fit(\n    cov_type=\"cluster\", cov_kwds={\"groups\": panel[\"unit_id\"]})\nprint(twfe.params[\"treated_post\"])  # compare to att.aggregate('simple'); divergence flags TWFE bias",
        "description": "Staggered DiD on a unit-by-period panel using differences-in-differences (the Python port of\nCallaway-Sant'Anna) with a TWFE fallback. Required input (one row per intervention-unit per period,\nalready aggregated from member-month claims):\n  panel : unit_id (plan/state/provider), period (int year or year-month index),\n          first_treat_period (int; 0 or NaN for never-treated),\n          outcome_rate (new initiations per 1,000 eligible member-months),\n          denom (eligible continuously-enrolled FFS member-months for weighting/aggregation)\nCluster SEs at unit_id. Build outcome_rate from a stable continuously-enrolled FFS denominator upstream;\ndo NOT mix MA-only person-time (incomplete claims) into the denominator.",
        "dependencies": [
          "pandas",
          "differences",
          "statsmodels"
        ],
        "source_citations": [
          "callaway-santanna-2021"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(did); library(fixest); library(bacondecomp)\n\n# --- Callaway-Sant'Anna group-time ATT with not-yet-treated controls -------------------------------\natt <- att_gt(yname = \"outcome_rate\", tname = \"period\", idname = \"unit_id\",\n              gname = \"first_treat_period\", xformla = ~ baseline_rate + region,\n              control_group = \"notyettreated\", clustervars = \"unit_id\",\n              data = panel)\nsummary(aggte(att, type = \"dynamic\"))   # event-study: pre-period leads should be ~0 (parallel trends)\nsummary(aggte(att, type = \"simple\"))    # overall ATT (positive weights only)\n\n# --- Sun-Abraham interaction-weighted event study (equivalent dynamic estimand) --------------------\nes <- feols(outcome_rate ~ sunab(first_treat_period, period) | unit_id + period,\n            cluster = ~ unit_id, data = panel[panel$first_treat_period != 1, ])\niplot(es)\n\n# --- Goodman-Bacon: show how the naive TWFE coefficient decomposes (forbidden comparisons) ---------\nbgd <- bacon(outcome_rate ~ treated_post, data = panel, id_var = \"unit_id\", time_var = \"period\")\nprint(aggregate(estimate * weight ~ type, data = bgd, FUN = sum))",
        "description": "Staggered DiD with the did package (Callaway-Sant'Anna) plus a Sun-Abraham event study via fixest, and a\nGoodman-Bacon TWFE decomposition to diagnose weighting. Required input (unit-by-period panel):\n  panel : unit_id (int), period (int), first_treat_period (int; 0 = never treated),\n          outcome_rate, baseline_rate, region  (covariates measured pre-adoption)\nCluster at unit_id; outcome_rate built from a stable FFS-observable denominator upstream.",
        "dependencies": [
          "did",
          "fixest",
          "bacondecomp"
        ],
        "source_citations": [
          "callaway-santanna-2021",
          "goodman-bacon-2021"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* treated_post flag = unit is past its own adoption date (handles staggered timing). */\ndata panel;\n  set work.panel;\n  treated_post = (first_treat_period > 0 and period >= first_treat_period);\nrun;\n\n/* Naive TWFE: unit + period fixed effects, cluster-robust SE at the intervention unit.\n   Biased under staggered adoption with heterogeneous effects (Goodman-Bacon) -- diagnostic only. */\nproc sort data=panel; by unit_id; run;\nproc surveyreg data=panel;\n  cluster unit_id;                         /* cluster SE at the policy unit, not the patient */\n  class unit_id period;\n  model outcome_rate = treated_post unit_id period / solution;\n  ods select ParameterEstimates;\nrun;\n\n/* Manual group-time ATT(g,t): for each adopting cohort g and post-period t, contrast the cohort's\n   pre->post change against the not-yet-treated units' change over the same window. */\nproc sql noprint;\n  select distinct first_treat_period into :gs separated by ' '\n  from panel where first_treat_period > 0;\nquit;\n\n%macro att_gt;\n  data _att_all; length cohort 8 post_period 8 att 8; stop; run;\n  %let i = 1;\n  %do %while (%scan(&gs, &i) ne );\n    %let g = %scan(&gs, &i);\n    /* not-yet-treated comparison: never-treated OR adopt strictly after period t */\n    proc sql;\n      create table _cell as\n      select a.unit_id,\n             (a.outcome_rate - b.outcome_rate) as dchange,    /* post(t) minus pre(g-1) */\n             case when a.first_treat_period = &g then 1 else 0 end as treated\n      from panel a\n      inner join panel b\n        on a.unit_id = b.unit_id and b.period = &g - 1        /* base period = last pre-adoption */\n      where a.period >= &g\n        and (a.first_treat_period = &g\n             or a.first_treat_period = 0\n             or a.first_treat_period > a.period);\n    quit;\n    proc genmod data=_cell;\n      class unit_id;\n      model dchange = treated / dist=normal link=identity;\n      repeated subject=unit_id / type=ind;                   /* cluster-robust at the unit */\n      ods output GEEEmpPEst=_pe;\n    run;\n    /* _pe row for 'treated' is ATT(g,.) over the pooled post window; append for aggregation. */\n    %let i = %eval(&i + 1);\n  %end;\n%mend;\n%att_gt;",
        "description": "DiD in SAS on a unit-by-period panel. Naive two-way fixed effects with cluster-robust SE via PROC\nSURVEYREG, plus a manual group-time ATT(g,t) loop (PROC SQL builds cohort-vs-not-yet-treated 2x2 cells;\nPROC GENMOD estimates each cell) because SAS has no built-in Callaway-Sant'Anna procedure. Required input:\n  work.panel : unit_id, period, first_treat_period (0 = never treated),\n               outcome_rate, denom (eligible FFS member-months)\nBuild outcome_rate from a stable continuously-enrolled FFS denominator upstream; exclude MA-only\nperson-time. NOTE: PROC GLM 'absorb' takes only ONE class effect and is the wrong tool for two-way FE.",
        "dependencies": [],
        "source_citations": [
          "callaway-santanna-2021"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "difference-in-differences-staggered-adoption-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Data-grounded worked-example timeline (beginner layer), drawn to scale from worked_example.timeline_spec so the picture matches the numbers.",
        "alt_text": "Timeline for the worked example of difference-in-differences-staggered-adoption-rwe.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Unit[Intervention units: plans / states / providers] --> Panel[Unit-by-period panel<br/>stable continuously-enrolled FFS denominator]\n  Panel --> Pre[Pre-period: treated vs comparison]\n  Panel --> Post[Post-period: treated vs comparison]\n  Pre --> Check{Parallel pre-trends?<br/>event-study leads ~ 0}\n  Check -->|No| Stop[Do NOT use DiD<br/>or model the trend / Rambachan-Roth bounds]\n  Check -->|Yes| Stag{Staggered adoption +<br/>heterogeneous effects?}\n  Stag -->|No| TwoBy[2x2 ATT = treated change - comparison change]\n  Stag -->|Yes| GT[Group-time ATT g,t<br/>Callaway-SantAnna / Sun-Abraham<br/>NOT naive TWFE]\n  TwoBy --> Infer[Cluster SE at unit;<br/>wild bootstrap if few clusters]\n  GT --> Infer\nstyle Stop fill:#fee2e2\nstyle GT fill:#dcfce7\nstyle Check fill:#fef9c3",
        "caption": "DiD decision logic for RWE policy evaluation. Parallel pre-trends gate the design; staggered adoption with heterogeneous effects forces a group-time estimator over naive two-way fixed effects; inference is clustered at the intervention unit.",
        "alt_text": "Flowchart from intervention units to a unit-by-period panel, a parallel-trends check that stops the analysis if violated, a staggered-adoption branch choosing between 2x2 ATT and group-time ATT, and a clustered-inference step.",
        "source_type": "illustrative",
        "source_citations": [
          "callaway-santanna-2021"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "graph LR\n  subgraph Pre[Pre-period]\n    TP[Treated pre]\n    CP[Comparison pre]\n  end\n  subgraph Post[Post-period]\n    TT[Treated post]\n    CT[Comparison post]\n  end\n  TP --> TT\n  CP --> CT\n  TT --> D1[Treated change]\n  TP --> D1\n  CT --> D2[Comparison change<br/>= secular trend]\n  CP --> D2\n  D1 --> ATT[ATT = treated change - comparison change]\n  D2 --> ATT\nstyle ATT fill:#dcfce7",
        "caption": "The 2x2 estimator. The comparison group's change estimates the secular trend that would have occurred without the intervention; subtracting it from the treated group's change yields the ATT.",
        "alt_text": "Diagram showing treated and comparison pre and post values, the treated change and comparison change, and the ATT as the difference of those two changes.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "instrumental-variables-pharmacoepi-rwe",
        "notes": "Both are quasi-experimental; DiD relies on parallel trends across units and time, IV relies on instrument relevance and the exclusion restriction."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "target-trial-emulation",
        "notes": "DiD targets the effect of a group-level policy using timing variation; target-trial emulation targets an individual treatment strategy using cross-sectional confounder control."
      },
      {
        "relation_type": "see_also",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "DiD is a high-value quasi-experimental CER design for policy, formulary, and system-level interventions where individual-level adjustment cannot identify the effect."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Report pre-period comparability of treated and comparison units; do not rely on fixed effects alone to assert the groups were exchangeable before the intervention."
      }
    ],
    "aliases": [
      "DiD",
      "difference-in-differences",
      "staggered DiD",
      "event study",
      "two-way fixed effects",
      "group-time ATT",
      "policy evaluation"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "direct-standardization-rwe",
    "name": "Direct Standardization",
    "short_definition": "A summarization method that reweights observed stratum-specific rates from a study population to a single external standard population distribution, producing a directly standardized rate (DSR) that is comparable across populations with different confounder/age-sex structures.",
    "long_description": "**Direct standardization** computes a weighted average of a study population's *own* stratum-specific event rates, using\nthe population *counts* (weights) from an external **standard population** rather than the study population's own counts.\nFormally, the directly standardized rate is DSR = Σ_k (w_k · r_k) / Σ_k w_k, where r_k = d_k / pt_k is the observed event\nrate in stratum k (deaths/events d_k over person-time pt_k in the *study* data) and w_k is the standard population's size\n(or person-time) in stratum k. The result answers a single counterfactual-flavored descriptive question: \"what would this\npopulation's overall rate be if it had the *standard* population's age/sex (or other confounder) distribution?\" Because two\ngroups standardized to the *same* standard share identical weights, their DSRs are directly comparable — the whole point of\nthe exercise. The most common weights are the 2000 US Standard Million (NCHS) and the WHO World Standard Population.\n\n**Core conceptual distinction** — Direct standardization is a *descriptive* reweighting, not a causal estimator. It removes\nconfounding by the *measured, categorized* stratifiers (typically age and sex) by holding the confounder distribution fixed\nat the standard — exactly the marginal-standardization logic, but executed nonparametrically within fully observed cells.\nThree things must be pre-specified in the estimand: (1) the standard population (it sets the target distribution and\ntherefore the numeric value — DSRs to different standards are *not* interchangeable); (2) the stratifiers and their cut\npoints (coarse age bands leave residual confounding; fine bands create sparse cells with unstable r_k); (3) the variance\nmethod, because for rare events the naive Wald interval undercovers and a gamma/Fay–Feuer interval is required. Direct\nstandardization estimates a rate (or a difference/ratio of two DSRs), *not* a hazard ratio or an exchangeability-based\ncausal contrast; conflating a DSR comparison with an adjusted treatment effect is a category error.\n\n**Pros, cons, and trade-offs** (named alternatives).\n- **vs indirect standardization / SMR:** Direct standardization requires *stable stratum-specific rates in the study\n  population itself* (enough events per cell). When the study population is small or events are rare, those r_k are noisy\n  and direct standardization is unstable — indirect standardization (the standardized morbidity/mortality ratio, SMR)\n  borrows rates from the standard and applies them to the study population's structure, needing only the total event count.\n  Critical caveat: two SMRs from different study populations are **not** comparable to each other because each uses its own\n  population as the weighting set; only DSRs (shared weights) are mutually comparable. **Prefer direct** when you must\n  compare ≥2 study groups and cells are well populated; **prefer indirect/SMR** when a group is small or strata are sparse.\n- **vs regression / model-based standardization (g-computation; Muller & MacLehose 2014):** Direct standardization is\n  nonparametric *within* strata but breaks down when cells are empty or you need to adjust for many continuous covariates.\n  Model-based standardization fits a regression (e.g., Poisson/logistic), predicts each subject's rate under the standard\n  covariate distribution, and averages — it borrows strength across sparse cells at the cost of model dependence and the\n  need to get the functional form right. **Prefer plain direct** for a few categorical stratifiers with full cells;\n  **prefer model-based** for many or continuous confounders.\n- **vs propensity-score weighting (IPTW / overlap weighting):** PS weighting targets a *causal* contrast (ATE/ATT) under\n  no-unmeasured-confounding and positivity; direct standardization targets a *descriptive* rate made comparable across\n  populations. They can numerically coincide in simple cases, but their assumptions and interpretations differ. Do not\n  present a DSR comparison as a confounding-controlled treatment effect when unmeasured confounding is the real threat.\n\n**When to use** — Comparing crude rates across populations (plans, regions, hospitals, calendar periods, treatment groups)\nwhose age/sex/severity mix differs, for burden-of-illness, surveillance, benchmarking (CMS plan/hospital performance), and\nHTA epidemiology inputs; whenever you need a single, transparent, externally comparable number and the stratifiers are\ncategorical with adequate cell counts; as a sanity-check descriptive layer beneath a formal causal analysis.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Sparse strata.** With few events per cell, r_k is unstable and naive variances undercover; a DSR built on cells with\n  0–2 events can swing wildly and its Wald CI is invalid. Use indirect/SMR or a gamma/Fay–Feuer interval, or collapse cells.\n- **Effect-measure modification across strata.** A single standardized summary hides qualitatively different stratum\n  rates; if the rate ratio reverses by age (e.g., a drug protective in the young, harmful in the old), the DSR comparison\n  is a misleading average and stratum-specific reporting is mandatory (this is the standardization analogue of Simpson's\n  paradox).\n- **Standard chosen to manufacture a result.** Because the value depends on the standard, picking a convenient one (or a\n  different standard per group) can flip a comparison — always use one shared, pre-registered standard.\n- **As a causal effect under unmeasured confounding.** Direct standardization only removes confounding by the *measured,\n  categorized* stratifiers. Presenting a DSR difference as a treatment effect when channeling, indication, or severity is\n  unbalanced is actively dangerous — it launders residual confounding into a clean-looking adjusted number.\n- **Differential denominator/numerator capture between the groups being compared** (see below): if one group's events or\n  person-time are systematically undercounted, the stratum rates are biased *before* any standardization, and standardizing\n  a biased rate just produces a precisely-wrong comparable number.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA — the dominant trap):** The DSR's r_k = events / person-time is computed entirely from *observed*\n  claims. Medicare Advantage and other capitated/bundled arrangements do not generate complete fee-for-service inpatient\n  claims, so for MA-only person-time the numerator (e.g., HF hospitalizations) is undercounted while the denominator\n  (person-time from enrollment spans) is fully observed — deflating r_k. Standardizing two groups where one is FFS-rich and\n  the other MA-rich compares an honest rate to a censored one. Workaround: restrict to fully FFS-observable person-time\n  (Parts A/B for the outcome, plus D if drug exposure defines a group), exclude MA-only spans, and report the share of\n  person-time excluded. Require **continuous enrollment** to build clean denominator person-time; left-truncate at enrollment\n  start. **Differential competing risk of death by group** is acute in elderly claims: death removes person-time and\n  prevents the nonfatal event, so a sicker group's stratum rate of a nonfatal outcome can look *lower* — report cause-specific\n  vs all-cause framing and consider a competing-risks descriptive view alongside the DSR.\n- **EHR:** Visit-driven capture means person-time and events are observed only while the patient is active in the system;\n  a group that disenrolls or seeks care elsewhere has differential under-ascertainment by stratum/site. Define observation\n  windows explicitly and treat external-care leakage as informative.\n- **Registry:** Strong, adjudicated numerators and well-defined strata; the weak point is the *denominator* (catchment\n  person-time) and registry completeness, which must be validated before rates are trusted.\n- **Linked claims–EHR–vital records:** Best substrate for complete events (mortality from vital records fixes the\n  competing-risk denominator problem) and person-time, but linkage selects the linkable subset — confirm the standardized\n  comparison is not driven by who could be linked.\n\n**Worked claims example.** Question: is the age–sex-standardized rate of incident heart-failure (HF) hospitalization higher\nin Plan A vs Plan B, comparable to the 2000 US Standard Million? (1) **Cohort/denominator:** all members with ≥365 days of\ncontinuous Part A/B enrollment (drop MA-only spans so inpatient claims are observable); person-time accrues from the end of\nthat 365-day baseline (left-truncation) until the first HF event, death, disenrollment, or study end. (2) **First-event\ncoding (washout):** HF event = first inpatient claim with a primary HF dx (e.g., I50.x) in the principal position, with no\nHF dx in the 365-day washout, so each member contributes one *incident* event. (3) **Strata:** sex × 10-year age bands\n(18–24 … 85+) evaluated at the member's index date; confirm every cell has ≥ a pre-set minimum of events (collapse the\noldest sparse bands if not). (4) **Stratum rates:** r_k = HF events_k / person-years_k within each plan. (5)\n**Standardize:** multiply each r_k by the 2000 US Standard Million weight w_k, sum, divide by Σw_k → DSR_A and DSR_B per\n100,000 person-years; report the standardized rate ratio DSR_A/DSR_B and difference, with **gamma-distribution confidence\nintervals** (Fay–Feuer) because several oldest-old cells are sparse. (6) **Diagnostics/sensitivity:** show the\nstratum-specific r_k side by side (to expose any age × plan interaction that the single DSR would mask), report\nperson-time excluded for MA-only spans, and re-run standardizing to the WHO World Standard to confirm the comparison's\n*direction* is not an artifact of the chosen standard. If Plan A enrolls sicker members with higher mortality, add an\nall-cause-death competing-event note so the nonfatal HF DSR is not misread.\n\n**Interpreting the output**\n\nConsider the worked example: Northville's directly standardized hospitalization rate is 30.5 per\n1,000 person-years, computed by applying the standard population weights (0.50, 0.35, 0.15) to\nits own age-specific rates (10.0, 30.0, 100.0) and summing.\n\nFormal interpretation: The DSR of 30.5 is the rate Northville would have if its population had\nexactly the age structure of the chosen standard population — it is a hypothetical, not an\nobserved rate. Its purpose is comparability: if Southdale's DSR computed with the same weights\nis 22.0, the 8.5-point gap is not explained by different age distributions between the two regions,\nbecause both rates were calculated as if both regions had the same age mix. The DSR is not a\nprediction of what the crude rate would be under a different population — it is a summary measure\ndesigned to remove confounding by the stratifying variable. Choosing a different standard\npopulation will produce a different DSR for both regions, but the direction of the comparison\nshould be robust; if it reverses, that signals a qualitative interaction between the stratum-\nspecific rates and the standard's weights that a single summary number cannot capture.\n\nPractical interpretation: Always pair the DSR with the stratum-specific rates from which it was\nbuilt. If Northville's oldest-old rate (65+: 100 per 1,000 py) is double Southdale's and that\nstratum drives the gap, a decision-maker needs that detail, not just the aggregate 30.5 vs 22.0.\nReport which standard population was used — the US 2000 Standard Million and the WHO World\nStandard produce different numerical DSRs for the same data — and confirm the comparison's\ndirection is not an artifact of the standard's age distribution. A DSR comparison is not a causal\nclaim: after removing age confounding, other unmeasured differences between regions (comorbidity,\nsocioeconomic status, care access) may explain any residual gap.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "descriptive-epidemiology",
      "direct-standardization",
      "age-adjusted-rate",
      "rate-comparison",
      "standard-population",
      "burden-of-illness",
      "benchmarking"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1159/000319591",
        "url": "https://doi.org/10.1159/000319591",
        "citation_text": "Tripepi G, Jager KJ, Dekker FW, Zoccali C. Stratification for confounding - part 2: direct and indirect standardization. Nephron Clinical Practice. 2010;116(4):c322-c325.",
        "year": 2010,
        "authors_short": "Tripepi et al.",
        "notes": "Clear, modern statement of direct vs indirect standardization as confounding-control summarization methods, including when each is preferred and why SMRs across populations are not mutually comparable."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyu029",
        "url": "https://doi.org/10.1093/ije/dyu029",
        "citation_text": "Muller CJ, MacLehose RF. Estimating predicted probabilities from logistic regression: different methods correspond to different target populations. International Journal of Epidemiology. 2014;43(3):962-970.",
        "year": 2014,
        "authors_short": "Muller & MacLehose",
        "notes": "Connects direct (nonparametric) standardization to model-based standardization, making explicit that the choice of standard population defines the target estimand and therefore the number reported."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.annepidem.2009.07.019",
        "url": "https://doi.org/10.1016/j.annepidem.2009.07.019",
        "citation_text": "Zhu D, Aaland M, Hollister R. Application of direct standardization technique in benchmarking trauma center performance. Annals of Epidemiology. 2009;19(9):658.",
        "year": 2009,
        "authors_short": "Zhu et al.",
        "notes": "Worked applied use of direct standardization for cross-institution benchmarking, illustrating the comparability that a shared standard population provides."
      }
    ],
    "plain_language_summary": "Direct standardization lets you fairly compare the overall disease rates of two populations — say, Plan A vs. Plan B — even when the two groups have very different age mixes. The trick is to ask: 'What would each group's rate look like if both had the exact same age structure?' You answer that by taking each group's own age-specific rates and applying a single shared set of age weights (called a standard population) to both, then summing the weighted rates into one comparable number. One honest caveat: this only removes differences you can see and categorize — if the groups differ on something you never measured, that difference is still baked in.",
    "key_terms": [
      {
        "term": "stratum-specific rate",
        "definition": "The event rate calculated within one slice of the population, such as people aged 65–74, computed as the number of events divided by the total time that group was observed."
      },
      {
        "term": "standard population",
        "definition": "An external reference group — for example, the 2000 US Standard Million published by the CDC — whose age (or age-sex) distribution is used as the shared set of weights when comparing two study groups."
      },
      {
        "term": "directly standardized rate (DSR)",
        "definition": "The single summary rate you get after weighting a population's own age-specific rates by the standard population's age distribution and summing the results."
      },
      {
        "term": "weight",
        "definition": "The share of the standard population that falls in one age stratum; weights for all strata sum to 1.0 (or to the total standard population count), so the weighted sum is a proper average."
      }
    ],
    "worked_example": {
      "scenario": "A health plan analyst wants to compare hospitalization rates between two small regions — Northville and Southdale — but Northville's enrollees are much older on average. If she just computes a single crude rate for each region, Northville will look worse mostly because it has more elderly members, not because its care is worse. She uses direct standardization to answer the fair question: 'What would each region's rate be if both had the same age structure?' She uses three age strata and a simplified standard population.",
      "dataset": {
        "caption": "Observed hospitalizations and person-years, by age group, for one region (Northville). The analyst builds an identical table for Southdale. The standard population weights come from a published reference table.",
        "columns": [
          "age_group",
          "events",
          "person_years",
          "stratum_rate_per_1000py",
          "standard_pop_weight"
        ],
        "rows": [
          [
            "18–44",
            12,
            1200,
            10.0,
            0.5
          ],
          [
            "45–64",
            18,
            600,
            30.0,
            0.35
          ],
          [
            "65+",
            20,
            200,
            100.0,
            0.15
          ]
        ]
      },
      "steps": [
        "Step 1 — compute each stratum's rate. Divide events by person-years and scale to per 1,000 person-years: age 18–44 → 12 / 1,200 × 1,000 = 10.0; age 45–64 → 18 / 600 × 1,000 = 30.0; age 65+ → 20 / 200 × 1,000 = 100.0.",
        "Step 2 — check the weights. The standard population weights are 0.50, 0.35, and 0.15. Confirm they sum to 1: 0.50 + 0.35 + 0.15 = 1.00. Good — they represent shares of the standard population, so the weighted sum will land in the same per-1,000 units as the stratum rates.",
        "Step 3 — multiply each stratum rate by its weight: 10.0 × 0.50 = 5.00; 30.0 × 0.35 = 10.50; 100.0 × 0.15 = 15.00.",
        "Step 4 — sum the weighted products: 5.00 + 10.50 + 15.00 = 30.50 hospitalizations per 1,000 person-years. That is Northville's directly standardized rate.",
        "Step 5 — repeat Steps 1–4 for Southdale using its own stratum rates but the exact same standard population weights (0.50, 0.35, 0.15). Because both regions used identical weights, the two DSRs are now on equal footing — any difference reflects true rate differences across strata, not a mismatch in how many elderly enrollees each region happened to have."
      ],
      "result": "Northville's directly standardized hospitalization rate = (10.0 × 0.50) + (30.0 × 0.35) + (100.0 × 0.15) = 5.00 + 10.50 + 15.00 = 30.50 per 1,000 person-years. If Southdale's DSR (computed the same way, same weights) came out to 22.0, you can now say Northville's standardized rate is 8.5 points higher — and that gap is not explained by age differences, because both rates were calculated as if the regions had the same age mix."
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "incidence-rate-calculation-rwe",
      "indirect-standardization-smr-sir-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Age-sex direct standardization to a fixed external standard",
        "description": "The canonical form — stratify by age band and sex, compute observed stratum rates, and reweight to a published standard (2000 US Standard Million or WHO World Standard) so multiple groups become directly comparable.",
        "edge_cases": [
          "Sparse oldest-old cells produce unstable rates and invalid Wald intervals; use gamma/Fay-Feuer confidence intervals or collapse bands.",
          "Choice of standard changes the numeric DSR; a single pre-registered standard must be shared across all compared groups."
        ],
        "data_source_notes": "claims: build person-time from continuous-enrollment spans (left-truncate at baseline end), exclude MA-only person-time so inpatient events are observable; registry: validate catchment denominator before trusting rates."
      },
      {
        "name": "Multi-confounder direct standardization (extended stratifiers)",
        "description": "Adds stratifiers beyond age/sex (e.g., region, comorbidity tier, calendar year) when those differ between groups and confound the rate comparison.",
        "edge_cases": [
          "The cross-classified cell count grows multiplicatively, rapidly creating empty/sparse cells that direct standardization cannot handle - this is the practical switch-point to model-based standardization (g-computation).",
          "Effect-measure modification across the finer strata makes a single summary misleading; report stratum-specific rates."
        ],
        "data_source_notes": "ehr/linked: richer covariates enable finer strata but worsen visit-driven sparsity; confirm cells are populated before standardizing."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Indirect standardization (SMR)",
        "pros_of_this": "Directly standardized rates that share one standard are mutually comparable across any number of groups; requires no external reference rates.",
        "cons_of_this": "Requires stable stratum-specific rates in the study population itself; unstable when groups are small or events rare.",
        "when_to_prefer": "Comparing two or more study groups whose strata are adequately populated and you need cross-group comparability."
      },
      {
        "compared_to": "Regression / model-based standardization (g-computation)",
        "pros_of_this": "Fully nonparametric within strata; no functional-form assumptions; maximally transparent and easy to audit.",
        "cons_of_this": "Cannot handle empty cells, many stratifiers, or continuous confounders; borrows no strength across cells.",
        "when_to_prefer": "A small number of categorical stratifiers with adequate cell counts."
      },
      {
        "compared_to": "Propensity-score weighting (IPTW / overlap weighting)",
        "pros_of_this": "A descriptive, externally comparable rate with transparent weights; no exchangeability/positivity machinery needed for the descriptive question.",
        "cons_of_this": "Does not target a causal effect; only removes confounding by the measured, categorized stratifiers and is misleading if presented as an adjusted treatment effect.",
        "when_to_prefer": "Descriptive burden/benchmarking comparisons rather than confounding-controlled causal contrasts."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "r_k = events / person-time computed from observed claims. Build person-time from continuous-enrollment spans and left-truncate at baseline end; exclude Medicare Advantage-only person-time because capitated arrangements lack complete FFS inpatient claims, biasing the numerator downward. Watch differential competing risk of death by group in elderly cohorts (death removes person-time and prevents the nonfatal event).",
      "ehr": "Person-time and events are visit-driven; a group that leaves the system is differentially under-ascertained. Define observation windows explicitly and treat external-care leakage as informative under-capture.",
      "registry": "Strong adjudicated numerators and well-defined strata, but validate the catchment-population denominator and registry completeness before computing or comparing rates.",
      "linked": "Linked claims-EHR-vital-records gives complete events (mortality fixes the competing-risk denominator) and person-time, but linkage selects the linkable subset; confirm the comparison is not driven by linkage selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom scipy.stats import gamma\n\nPER = 100_000  # rate multiplier (per 100,000 person-years)\n\ndef directly_standardized_rate(strata: pd.DataFrame, std: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame:\n    # Align study stratum rates with standard-population weights on the shared stratum key.\n    m = strata.merge(std[[\"stratum\", \"std_pop\"]], on=\"stratum\", how=\"left\", validate=\"many_to_one\")\n    if m[\"std_pop\"].isna().any():\n        raise ValueError(\"Every study stratum must map to a standard-population weight.\")\n    W = m.groupby(\"group_id\")[\"std_pop\"].transform(\"sum\")  # total standard weight per group\n    m[\"w\"] = m[\"std_pop\"] / W                              # normalized weights sum to 1 within group\n    m[\"r_k\"] = m[\"events\"] / m[\"person_years\"]             # observed stratum rate\n\n    out = []\n    for gid, g in m.groupby(\"group_id\"):\n        dsr = float(np.sum(g[\"w\"] * g[\"r_k\"]))             # weighted average of stratum rates\n        # Fay-Feuer / gamma-distribution interval: stable when some cells have 0-2 events.\n        var = float(np.sum((g[\"w\"] ** 2) * g[\"events\"] / g[\"person_years\"] ** 2))\n        total_events = float(g[\"events\"].sum())\n        wmax = float((g[\"w\"] / g[\"person_years\"]).max())   # max weight contribution for the upper bound\n        if total_events == 0:\n            lo, hi = 0.0, -np.log(alpha / 2) * wmax\n        else:\n            lo = gamma.ppf(alpha / 2, a=dsr ** 2 / var, scale=var / dsr)\n            hi = gamma.ppf(1 - alpha / 2, a=(dsr + wmax) ** 2 / (var + wmax ** 2),\n                           scale=(var + wmax ** 2) / (dsr + wmax))\n        out.append({\"group_id\": gid, \"dsr_per_100k\": dsr * PER,\n                    \"lcl_per_100k\": lo * PER, \"ucl_per_100k\": hi * PER,\n                    \"events\": int(total_events)})\n    return pd.DataFrame(out)",
        "description": "Direct standardization with gamma (Fay-Feuer) confidence intervals for sparse cells. There is no single dominant Python\npackage, so the calculation is implemented explicitly. Required input (one row per study group x stratum, already built\nfrom claims/EHR person-time tables):\n  strata : group_id, stratum (e.g. age_band x sex), events (int), person_years (float)\n  std    : stratum, std_pop (standard-population weight w_k, e.g. 2000 US Standard Million counts)\nReturns one row per group: directly standardized rate per 100,000 PY and a gamma-distribution 95% CI valid for rare events.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "tripepi-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(epitools)\n\n# One group: returns crude rate, adjusted (directly standardized) rate, and gamma 95% CI.\ndsr_one <- function(count, pop, stdpop, per = 1e5) {\n  res <- ageadjust.direct(count = count, pop = pop, stdpop = stdpop)\n  data.frame(\n    dsr_per_100k = unname(res[\"adj.rate\"]) * per,\n    lcl_per_100k = unname(res[\"lci\"])      * per,\n    ucl_per_100k = unname(res[\"uci\"])      * per,\n    events       = sum(count)\n  )\n}\n\n# Compare groups: `strata` has columns group_id, stratum, events, person_years; `std` has stratum, std_pop.\nstandardize_groups <- function(strata, std) {\n  strata <- merge(strata, std, by = \"stratum\")             # attach weights on shared stratum key\n  strata <- strata[order(strata$group_id, strata$stratum), ]\n  do.call(rbind, lapply(split(strata, strata$group_id), function(g)\n    cbind(group_id = g$group_id[1],\n          dsr_one(g$events, g$person_years, g$std_pop))))\n}",
        "description": "Direct standardization in R using the canonical epitools::ageadjust.direct (Fay-Feuer gamma CI). Inputs are vectors\naligned by stratum for ONE study group; loop or split() over groups to compare them. Build count/pop/stdpop from the\nclaims person-time table first (continuous enrollment, MA-only person-time excluded).\n  count  : events per stratum (e.g. incident HF hospitalizations by age band x sex)\n  pop    : study person-years (or persons) per stratum  -> denominator\n  stdpop : standard-population weight per stratum (2000 US Standard Million)",
        "dependencies": [
          "epitools"
        ],
        "source_citations": [
          "tripepi-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Directly standardized event rate per 100,000 person-years, by group, with gamma CIs. */\nproc stdrate data=work.study refdata=work.std\n             method=direct\n             stat=rate(mult=100000)\n             plots=none;\n  population group=group event=event pyears=pyears;  /* study events & person-time per stratum */\n  reference  total=std_pop;                           /* standard-population weights w_k        */\n  strata age_sex / stats cl=gamma;                    /* stratum key; gamma CI for sparse cells  */\nrun;\n\n/* Compare the two standardized rates directly: standardized rate difference and ratio with CIs. */\nproc stdrate data=work.study refdata=work.std\n             method=direct\n             stat=rate(mult=100000)\n             effect=(diff ratio);\n  population group=group event=event pyears=pyears;\n  reference  total=std_pop;\n  strata age_sex / cl=gamma;\nrun;",
        "description": "Direct standardization with PROC STDRATE (SAS/STAT) - the dedicated procedure for this method. Required datasets (built\nfrom the claims person-time table; exclude MA-only spans, left-truncate person-time at baseline end):\n  work.study : group, age_sex (stratum key), event (count), pyears (person-time)   one row per group x stratum\n  work.std   : age_sex (stratum key), std_pop (2000 US Standard Million weight)     the standard population\nMETHOD=DIRECT reweights each group's observed stratum rates to work.std; CL=GAMMA gives Fay-Feuer intervals valid for\nsparse cells. EFFECT compares the two groups' standardized rates (difference and ratio with CIs).",
        "dependencies": [],
        "source_citations": [
          "tripepi-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Compare rates across groups with different age/sex mix] --> Cells{Stratum cells well populated?}\n  Cells -->|Yes, comparing several groups| Direct[Direct standardization<br/>shared standard -> comparable DSRs]\n  Cells -->|Small group / rare events| Indirect[Indirect standardization / SMR<br/>borrow standard rates]\n  Cells -->|Many or continuous confounders / empty cells| Model[Model-based standardization<br/>g-computation]\n  Direct --> Sparse{Any sparse cells?}\n  Sparse -->|Yes| Gamma[Use gamma / Fay-Feuer CI]\n  Sparse -->|No| Wald[Normal-approx CI acceptable]\n  Indirect --> NoCompare[Caution: SMRs across populations<br/>are NOT mutually comparable]",
        "caption": "Decision logic for choosing direct vs indirect vs model-based standardization, and the matching variance method.",
        "alt_text": "Decision tree starting from a rate comparison, branching to direct standardization, indirect/SMR, or model-based standardization based on cell counts and confounder structure, then to the appropriate confidence-interval method.",
        "source_type": "illustrative",
        "source_citations": [
          "tripepi-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  PT[Claims person-time table<br/>continuous enrollment, MA-only excluded] --> R[Stratum rate r_k = events_k / person-time_k]\n  STD[Standard population<br/>weights w_k] --> MUL\n  R --> MUL[Multiply r_k x w_k per stratum]\n  MUL --> SUM[Sum over strata / sum of weights]\n  SUM --> DSR[Directly standardized rate DSR]\n  DSR --> CI[Gamma / Fay-Feuer CI for sparse cells]\n  CI --> CMP[Compare DSRs across groups<br/>same standard -> direct comparison]",
        "caption": "Data flow for a directly standardized rate from a claims person-time table through stratum rates, standard-population reweighting, summation, and a sparse-cell-valid confidence interval.",
        "alt_text": "Data-flow diagram from a claims person-time table to stratum-specific rates, multiplied by standard-population weights, summed into a directly standardized rate, with a gamma confidence interval and cross-group comparison.",
        "source_type": "illustrative",
        "source_citations": [
          "tripepi-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Direct standardization is a core rate-summarization technique within descriptive epidemiology."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "indirect-standardization-smr-sir-rwe",
        "notes": "Indirect standardization (SMR) borrows standard rates and is preferred for small or sparse study populations, but SMRs across populations are not mutually comparable the way directly standardized rates are."
      },
      {
        "relation_type": "see_also",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS weighting targets a causal contrast under exchangeability; direct standardization targets a comparable descriptive rate. They share marginal-standardization logic but differ in assumptions and interpretation."
      }
    ],
    "aliases": [
      "directly standardized rate",
      "DSR",
      "age-adjusted rate",
      "age-standardized rate",
      "direct method of standardization"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "discounting-costs-effects-rwe",
    "name": "Discounting of Costs and Effects in Economic Evaluation",
    "short_definition": "Conversion of future costs and health effects (life-years, QALYs) to present value by applying a jurisdiction-specified annual discount rate, so that outcomes occurring at different times are made commensurable in a cost-effectiveness or budget-impact model.",
    "long_description": "**Discounting** restates costs and health effects that accrue in different future periods as their **present value (PV)** using\nan annual rate r, so a dollar (or QALY) in year t counts for less than the same quantity today. Under the standard constant\nexponential convention, a value C_t in year t is discounted by the factor 1/(1+r)^t; the PV of a stream is the sum of these\ndiscounted period values over the model horizon. The rationale is partly empirical (positive time preference, opportunity cost\nof capital) and partly normative (consistency, avoiding the paradox that, undiscounted, postponing any program indefinitely\nwould dominate). In RWE-based economic evaluation, discounting is not a statistical estimator — it is a **deterministic\ntransformation applied to the time-resolved cost and effect streams** that an RWE analysis produces (per-period claims costs,\nHCRU, survival-weighted QALYs from a partitioned-survival or Markov model). It must be pre-specified in the protocol/SAP and\nmatched to the reimbursement jurisdiction, because the rate and whether costs and effects share one rate materially move the\nICER.\n\n**Core conceptual distinction**. Three choices are separable and each changes the answer. (1) *Rate level*: the per-annum r\n(NICE 3.5%, US Second Panel reference-case 3%, ZIN/Netherlands 4% costs, PBAC 5%, CDA-AMC (formerly CADTH) 1.5%). (2) *Symmetry of rates*:\nwhether costs and effects are discounted at the **same** rate (uniform; the dominant convention and the US/NICE base case) or\nat **different** rates (differential discounting — the Netherlands discounts effects at 1.5% and costs at 4%, motivated by the\nGravelle & Smith rising-value-of-health argument that the consumption value of health rises over time). (3) *Functional\nform*: constant exponential (standard) vs. **declining / time-varying** schedules for very long horizons (UK Treasury Green\nBook declining rates; \"gamma discounting\") vs. **hyperbolic** discounting (descriptively closer to revealed preferences but\ntime-inconsistent and not used in reference cases). Discounting is conceptually distinct from **inflation adjustment**: you\nfirst deflate all costs to a common price year (real terms), *then* discount; conflating the two double-counts or omits the\ntime value of money. It is also distinct from **half-cycle correction** in Markov models, which addresses where within a cycle\nevents fall, not time preference.\n\n**Pros, cons, and trade-offs**.\n- **Uniform rate (costs = effects) vs. differential rate (effects < costs):** Uniform is simpler, is the NICE/US reference\n  case, and avoids the appearance of manipulating ICERs by lowering the effects rate. Differential discounting makes\n  long-horizon prevention and cure look more cost-effective and follows the Dutch/Belgian guideline logic, but it is\n  contested (Gravelle & Smith vs. Claxton et al.) and only defensible when the cost-effectiveness threshold itself is assumed\n  to grow. **Prefer uniform** unless the target HTA body mandates differential rates.\n- **Constant exponential vs. declining/time-varying:** Constant rates are standard and transparent; declining schedules\n  materially raise the PV of effects realized decades out (vaccines, gene therapy, prevention) and are sanctioned by some\n  treasuries for inter-generational horizons. **Prefer constant** for the reference case; present declining-rate results only\n  as a scenario, clearly labeled, when the horizon exceeds ~30 years.\n- **Exponential vs. hyperbolic:** Hyperbolic better describes individual behavior but produces preference reversals and is\n  rejected for reference-case evaluation; reserve it for behavioral/uptake sub-models, never for the primary ICER.\n- **Discounting QALYs vs. leaving effects undiscounted:** Some argue health should not be discounted at all. Leaving effects\n  undiscounted while discounting costs is internally inconsistent (the Keeler-Cretin \"paradox\") and inflates the value of\n  deferrable programs; **do not** present an undiscounted-effects base case for a reimbursement submission.\n\n**When to use**. Any cost-effectiveness, cost-utility, or cost-benefit model — and any budget-impact or cost-of-illness model\nwith a time horizon longer than one year — built on RWE inputs that span multiple years (lifetime Markov/partitioned-survival\nmodels, multi-year claims cost streams, extrapolated survival). Always apply the rate(s) and price year of the decision-making\njurisdiction, and always report the reference-case rate plus 0% and an upper-bound rate (e.g., 0%/3%/5% per the US Second\nPanel; NICE 1.5%/3.5%) as deterministic sensitivity analyses.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Single-period / within-year analyses.** Discounting a 12-month claims cost comparison or a one-year budget-impact slice\n  adds nothing and can confuse reviewers; report nominal values when the horizon is ≤1 year.\n- **Wrong jurisdiction's rate.** Submitting a NICE appraisal with the US 3% rate (or vice versa) is a reference-case\n  violation that can invalidate the submission — the rate is not a free analyst choice.\n- **Discounting nominal (inflated) dollars.** If costs were not first deflated to a common price year, discounting a stream\n  that already embeds inflation systematically understates real PV; this is a silent, common error.\n- **Treating censored follow-up as zero future cost.** In RWE, claims/EHR follow-up is right-censored. Summing observed\n  discounted costs to the horizon as if censored patients incur nothing biases mean costs downward (informative censoring by\n  survival). Use a censoring-aware estimator (Bang-Tsiatis inverse-probability-of-censoring weighting, or a Lin partitioned\n  estimator that discounts within intervals) **before** applying the discount factor, not after.\n- **Differential discounting chosen to hit a threshold.** Switching from a uniform to a lower effects rate solely to push an\n  ICER under the willingness-to-pay threshold, without the jurisdiction mandating it, is analytically indefensible.\n\n**Data-source operational depth**.\n- **Claims (FFS vs. Medicare Advantage):** Costs are `paid_amt` (plan-paid + patient-paid, per the chosen perspective) summed\n  by service date into annual buckets relative to `index_date`. Failure mode: **MA-only person-time lacks adjudicated FFS\n  claims**, so a patient who switches to MA appears to incur $0 in later years — an artifact, not a real cost stream; restrict\n  cost accrual to FFS-observable enrollment and treat MA spans as censored, then carry that censoring into the PV estimator.\n  Bundled/capitated services and claims-adjudication lag distort the *timing* of costs and therefore the discount factor\n  applied — late-adjudicated claims for an early service must be attributed to the service year, not the payment year.\n- **EHR:** Encounter-driven capture means costs/charges are incomplete and out-of-system (\"leakage\") care is invisible;\n  charge-to-cost ratios are needed to convert charges to costs before discounting, and missing periods must be treated as\n  censored rather than zero.\n- **Registry:** Strong for adjudicated long-term effects (survival, disease milestones feeding QALYs) but typically weak for\n  full cost capture; link to claims for the cost stream and to a death index so the survival weights driving discounted QALYs\n  are right.\n- **Linked claims-EHR-vital records:** The ideal substrate for matching a discounted cost stream to a discounted survival/QALY\n  stream, but linkage selection and order/fill/service-date discrepancies must be reconciled so costs and effects are bucketed\n  into the *same* time origin before discounting.\n\n**Worked claims example.** Question: lifetime incremental cost-effectiveness of a new therapy vs. an active comparator in a\ncommercial + Medicare FFS cohort, modeled over a 20-year horizon with NICE-style r = 3.5% for both costs and effects\n(price year 2024). (1) For each `person_id`, define `year_from_index = floor((service_date - index_date)/365.25)` and sum\n`paid_amt` into annual cost buckets cost_0, cost_1, ... cost_19; deflate every cost to 2024 dollars using a medical-care price\nindex *before* discounting. (2) Build the annual QALY stream: for each year alive (from KM/partitioned survival), multiply\nfractional time-alive in that year by the EQ-5D-mapped utility for that health state to get qaly_0 ... qaly_19. (3) Because\nfollow-up is right-censored at disenrollment/end-of-data, estimate **mean** annual costs and QALYs with inverse-probability-of-\ncensoring weights (Bang-Tsiatis) so censored patients are not counted as $0/0-QALY after their last observed year. (4) Discount\neach year: PV_cost = Σ_t cost_t/(1.035)^t and PV_qaly = Σ_t qaly_t/(1.035)^t, applying the factor to the *mean* censoring-\nadjusted stream. (5) Repeat per arm, then incremental ICER = (PV_cost_new − PV_cost_comp)/(PV_qaly_new − PV_qaly_comp) and\ndiscounted NMB = λ·ΔPV_qaly − ΔPV_cost at λ = $100,000/QALY. (6) Sensitivity: re-run at r = 0% and r = 5% for both streams, and\n(if the model targets a Dutch submission) a differential scenario with effects at 1.5% and costs at 4%; report all rates\ntransparently as separate rows, never as the headline result.\n\n**Interpreting the output**\n\nThe worked example discounts a two-year cost stream of $35,000 (Year 1) and $35,000 (Year 2) at 3% annually,\nproducing a present value of approximately $67,396, compared with an undiscounted total of $70,000.\n\n*(1) Formal interpretation.* The present-value calculation applies the factor 1/(1+r)^t to each period's cost\nand health gain. Year 1 costs are discounted by 1/1.03 ≈ 0.971 and Year 2 costs by 1/1.03² ≈ 0.943. The\n$2,604 reduction in present value over two years at 3% appears modest, but the compounding effect grows\nsubstantially over a 20- or 30-year model horizon: a QALY gained 20 years in the future at 3% discounting\nis worth only 1/1.03^20 ≈ 0.554 of a QALY gained today. Discounting applies to both cost and effect streams;\napplying it to costs but not to effects (or vice versa) is a methodological error. Jurisdiction-specific reference\ncases specify the rate: 3% (US, UK reference case), 3.5% (NICE base case), or differential rates for some\nEuropean submissions.\n\n*(2) Practical interpretation.* The $2,604 present-value reduction over two years illustrates why discounting\nmatters most for chronic-disease models with long horizons. An intervention that prevents a cardiovascular event\nin year 10 is worth considerably less in present value than one preventing the same event next year — which is\nexactly why oncology and preventive-care models are more sensitive to the discount rate than acute-care models.\nAlways run the 0% and 5% rate sensitivity scenarios alongside the reference-case rate; in submissions where\nresults flip across this range, the discount-rate assumption should be highlighted as a key driver.",
    "primary_category": "Health_Economic",
    "tags": [
      "discounting",
      "present-value",
      "time-preference",
      "cost-effectiveness",
      "cost-utility",
      "qaly",
      "health-economic-modeling",
      "reference-case"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "cost_effectiveness_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s40273-018-0672-z",
        "url": "https://doi.org/10.1007/s40273-018-0672-z",
        "citation_text": "Attema AE, Brouwer WBF, Claxton K. Discounting in economic evaluations. PharmacoEconomics. 2018;36(7):745-758.",
        "year": 2018,
        "authors_short": "Attema et al.",
        "notes": "Authoritative review of the theory, the constant-rate convention, uniform vs. differential discounting, and the cross-jurisdiction rate debate."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2016.12195",
        "url": "https://doi.org/10.1001/jama.2016.12195",
        "citation_text": "Sanders GD, Neumann PJ, Basu A, et al. Recommendations for conduct, methodological practices, and reporting of cost-effectiveness analyses: Second Panel on Cost-Effectiveness in Health and Medicine. JAMA. 2016;316(10):1093-1103.",
        "year": 2016,
        "authors_short": "Sanders et al.",
        "notes": "US reference-case guidance specifying a 3% discount rate for both costs and effects with 0% and higher-rate sensitivity analyses."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/hec.1612",
        "url": "https://doi.org/10.1002/hec.1612",
        "citation_text": "Claxton K, Sculpher M, Culyer A, et al. Discounting and decision making in the economic evaluation of health-care technologies. Health Economics. 2011;20(1):2-15.",
        "year": 2011,
        "authors_short": "Claxton et al.",
        "notes": "Demonstrates how the choice of uniform vs. differential rates flows through to decisions, and the conditions under which differential discounting is justified."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) statement. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "Reporting standard requiring explicit statement of the discount rate(s), price year, and time horizon in any published economic evaluation."
      }
    ],
    "plain_language_summary": "Discounting converts future costs and health gains into their equivalent value today, because a dollar (or a year of healthy life) promised in the future is worth less than the same thing right now. You divide each future amount by a factor that grows the further out in time you look, using a rate set by the health authority in your country. The result is called the present value, and all costs and health effects must be converted this way before you can fairly compare treatments whose benefits and costs fall at different points in time.",
    "key_terms": [
      {
        "term": "discount rate",
        "definition": "The annual percentage used to shrink a future dollar (or health gain) down to what it is worth today — for example, the US reference-case rate is 3% per year."
      },
      {
        "term": "present value",
        "definition": "The current-day worth of an amount that will be paid or received in the future, after applying the discount rate for every year between now and then."
      },
      {
        "term": "time horizon",
        "definition": "The total span of years over which a health economic model tracks costs and health effects — often a patient's lifetime for chronic diseases."
      },
      {
        "term": "QALY",
        "definition": "A quality-adjusted life-year: one year of life lived in perfect health, used as the standard unit for measuring health gains in economic evaluations."
      }
    ],
    "worked_example": {
      "scenario": "A new drug for a chronic condition has a startup infusion cost of $50,000 billed at the end of Year 1 and an annual maintenance cost of $20,000 billed at the end of Year 2. An analyst needs to report these two future costs as their present value today, using the US reference-case discount rate of 3% per year. The formula is PV = FV / (1 + r)^t, where FV is the future amount, r is the discount rate (0.03), and t is the number of years from today.",
      "dataset": {
        "caption": "Future cost stream for one patient, two years out. Each row is one cost event that has not yet occurred.",
        "columns": [
          "year_from_index",
          "label",
          "future_cost_usd"
        ],
        "rows": [
          [
            1,
            "Startup infusion",
            50000
          ],
          [
            2,
            "Year 2 maintenance",
            20000
          ]
        ]
      },
      "steps": [
        "Year 1 cost: divide $50,000 by (1.03)^1 = 1.03. Result: $50,000 / 1.03 = $48,543.69.",
        "Year 2 cost: divide $20,000 by (1.03)^2 = 1.0609. Result: $20,000 / 1.0609 = $18,851.92.",
        "Add the two present values: $48,543.69 + $18,851.92 = $67,395.61.",
        "Undiscounted total (ignoring time) would be $50,000 + $20,000 = $70,000.",
        "Discounting reduces the reported total by $2,604.39, reflecting that money paid in the future is worth less than money paid today."
      ],
      "result": "Present value of the two-year cost stream = $67,395.61, compared with an undiscounted total of $70,000. The $2,604.39 difference is modest over two years at 3%, but over a 20-year model horizon the gap between discounted and undiscounted costs (and health gains) becomes much larger and materially changes which treatment looks cost-effective."
    },
    "prerequisites": [
      "cost-effectiveness",
      "icer-net-monetary-benefit-rwe",
      "qaly-utility-mapping-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Constant exponential discounting (reference-case standard)",
        "description": "Single per-annum rate applied as 1/(1+r)^t to each future period for both costs and effects (e.g., NICE 3.5%, US Second Panel 3%). The default convention in nearly all HTA reference cases.",
        "edge_cases": [
          "Continuous vs. annual (mid-year) timing conventions differ slightly; state which is used and apply it identically to both arms.",
          "Half-cycle correction in Markov models interacts with discounting timing and must be specified to avoid double-adjustment."
        ],
        "data_source_notes": "claims/EHR: bucket period costs to the service-date year (not the payment-date year) before applying the factor."
      },
      {
        "name": "Differential (dual-rate) discounting",
        "description": "Lower rate for health effects than for costs (e.g., Netherlands/ZIN 1.5% effects, 4% costs; historically Belgium), motivated by the Gravelle & Smith rising-value-of-health argument that the consumption value of health rises over time.",
        "edge_cases": [
          "Only internally consistent if the cost-effectiveness threshold is assumed to grow over time; otherwise it can be gamed to lower ICERs.",
          "Materially favors prevention, vaccines, and curative one-time therapies with long benefit horizons."
        ],
        "data_source_notes": "Apply only when the target jurisdiction mandates it; otherwise present as a clearly labeled scenario alongside the uniform base case."
      },
      {
        "name": "Declining / time-varying (gamma) discounting",
        "description": "Rate that falls over very long horizons (UK Treasury Green Book schedule; \"gamma discounting\") to avoid implausibly trivializing inter-generational outcomes.",
        "edge_cases": [
          "Reserve for horizons beyond ~30 years (e.g., gene therapy, public-health prevention); never the primary reference case for a standard appraisal.",
          "Requires explicit period-by-period factors rather than a single closed-form term."
        ],
        "data_source_notes": "Build an explicit vector of period discount factors and document the schedule source."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Uniform single-rate discounting of costs and effects",
        "pros_of_this": "Differential discounting (lower effects rate) raises the present value of long-horizon health gains, favoring prevention and curative therapies, and follows Dutch/Belgian guideline logic.",
        "cons_of_this": "Contested (Gravelle & Smith vs. Claxton et al.); defensible only if the threshold grows over time; can appear to manipulate the ICER if not jurisdiction-mandated.",
        "when_to_prefer": "Only when the target HTA body specifies differential rates (e.g., a Netherlands/ZIN submission)."
      },
      {
        "compared_to": "Constant exponential discounting",
        "pros_of_this": "Declining/time-varying schedules avoid trivializing benefits decades out and are sanctioned by some treasuries for inter-generational horizons.",
        "cons_of_this": "More complex, harder to communicate, and not the reference case in most HTA appraisals; risks cherry-picked schedules.",
        "when_to_prefer": "Very long (30+ year) horizons such as gene therapies or population prevention, presented as a labeled scenario."
      },
      {
        "compared_to": "Leaving health effects undiscounted while discounting costs",
        "pros_of_this": "Discounting both streams avoids the Keeler-Cretin paradox in which deferring any beneficial program indefinitely appears optimal.",
        "cons_of_this": "Some argue health should not be discounted on ethical grounds; mixing rates without justification is internally inconsistent.",
        "when_to_prefer": "Always discount effects in a reimbursement reference case; reserve undiscounted-effects results for ethical sensitivity discussion only."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Costs = paid_amt under the chosen perspective, summed by service-date year relative to index_date and deflated to a common price year before discounting. Exclude/treat-as-censored Medicare Advantage-only person-time where FFS claims are unavailable so later years are not falsely valued at $0. Attribute late-adjudicated claims to the service year, not the payment year.",
      "ehr": "Convert charges to costs with charge-to-cost ratios; treat out-of-system leakage and missing encounter periods as censored, not zero, before discounting.",
      "registry": "Strong for the long-term survival/effect stream that drives discounted QALYs; link to claims for the cost stream and to a death index so survival weights are correct.",
      "linked": "Reconcile order/fill/service dates to a single time origin so the discounted cost stream and discounted effect stream are bucketed to the same year before applying the factor; account for linkage selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nR_COST = 0.03      # US Second Panel reference-case rate for costs\nR_QALY = 0.03      # same rate for effects (set lower, e.g. 0.015, only for a mandated differential scenario)\nWTP    = 100_000   # willingness-to-pay threshold ($/QALY) for net monetary benefit\n\ndef discount_streams(annual: pd.DataFrame, r_cost=R_COST, r_qaly=R_QALY) -> pd.DataFrame:\n    df = annual.copy()\n    t = df[\"year_from_index\"].to_numpy()\n    df[\"pv_cost\"] = df[\"cost\"] / (1.0 + r_cost) ** t   # 1/(1+r)^t applied per period\n    df[\"pv_qaly\"] = df[\"qaly\"] / (1.0 + r_qaly) ** t\n    return (df.groupby(\"arm\")[[\"pv_cost\", \"pv_qaly\"]]\n              .sum()\n              .rename(columns={\"pv_cost\": \"PV_cost\", \"pv_qaly\": \"PV_qaly\"}))\n\ndef icer_nmb(pv: pd.DataFrame, wtp=WTP) -> dict:\n    d_cost = pv.loc[\"NEW\", \"PV_cost\"] - pv.loc[\"COMPARATOR\", \"PV_cost\"]\n    d_qaly = pv.loc[\"NEW\", \"PV_qaly\"] - pv.loc[\"COMPARATOR\", \"PV_qaly\"]\n    icer = np.nan if np.isclose(d_qaly, 0) else d_cost / d_qaly  # undefined when no incremental effect\n    nmb_new  = wtp * pv.loc[\"NEW\", \"PV_qaly\"]        - pv.loc[\"NEW\", \"PV_cost\"]\n    nmb_comp = wtp * pv.loc[\"COMPARATOR\", \"PV_qaly\"] - pv.loc[\"COMPARATOR\", \"PV_cost\"]\n    return {\"d_cost\": d_cost, \"d_qaly\": d_qaly, \"ICER\": icer,\n            \"incremental_NMB\": nmb_new - nmb_comp}\n\npv = discount_streams(annual)\nresult = icer_nmb(pv)",
        "description": "Present-value discounting of per-person annual cost and QALY streams, then discounted ICER and NMB.\nRequired input (already cleaned, deflated to a common price year, and censoring-adjusted upstream):\n  annual : long table -> person_id, arm in {'NEW','COMPARATOR'}, year_from_index (0..H), cost (real $ that year), qaly (that year)\nNOTE on censoring: 'cost' and 'qaly' must already be mean/IPCW-adjusted (Bang-Tsiatis or Lin partitioned estimator) so that\nright-censored claims/EHR follow-up is not silently treated as $0 / 0 QALYs after the last observed year. Discounting is the\nfinal deterministic step applied to those means; it does NOT fix censoring bias.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nR_COST <- 0.03; R_QALY <- 0.03; WTP <- 100000  # reference-case rates; set R_QALY < R_COST only for a mandated differential run\n\ndiscount_streams <- function(annual, r_cost = R_COST, r_qaly = R_QALY) {\n  setDT(annual)\n  annual[, pv_cost := cost / (1 + r_cost)^year_from_index]   # 1/(1+r)^t per period\n  annual[, pv_qaly := qaly / (1 + r_qaly)^year_from_index]\n  annual[, .(PV_cost = sum(pv_cost), PV_qaly = sum(pv_qaly)), by = arm]\n}\n\nicer_nmb <- function(pv, wtp = WTP) {\n  new  <- pv[arm == \"NEW\"]; cmp <- pv[arm == \"COMPARATOR\"]\n  d_cost <- new$PV_cost - cmp$PV_cost\n  d_qaly <- new$PV_qaly - cmp$PV_qaly\n  icer <- if (isTRUE(all.equal(d_qaly, 0))) NA_real_ else d_cost / d_qaly\n  inc_nmb <- (wtp * new$PV_qaly - new$PV_cost) - (wtp * cmp$PV_qaly - cmp$PV_cost)\n  list(d_cost = d_cost, d_qaly = d_qaly, ICER = icer, incremental_NMB = inc_nmb)\n}\n\npv     <- discount_streams(annual)\nresult <- icer_nmb(pv)",
        "description": "Present-value discounting and discounted ICER/NMB from per-person annual streams.\nInput 'annual' mirrors the Python version (deflated, censoring-adjusted upstream):\n  person_id, arm in {'NEW','COMPARATOR'}, year_from_index (0..H), cost, qaly",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let rcost = 0.03;     /* reference-case discount rate for costs   */\n%let rqaly = 0.03;     /* reference-case discount rate for effects */\n%let wtp   = 100000;   /* willingness-to-pay threshold ($/QALY)    */\n\n/* Apply 1/(1+r)**t to each person-year cost and QALY value. */\ndata discounted;\n  set work.annual;\n  pv_cost = cost / (1 + &rcost) ** year_from_index;\n  pv_qaly = qaly / (1 + &rqaly) ** year_from_index;\nrun;\n\n/* Present value by arm. */\nproc sql;\n  create table pv as\n  select arm,\n         sum(pv_cost) as PV_cost,\n         sum(pv_qaly) as PV_qaly\n  from discounted\n  group by arm;\nquit;\n\n/* Discounted incremental ICER and net monetary benefit. */\nproc sql;\n  create table result as\n  select (n.PV_cost - c.PV_cost)                         as d_cost,\n         (n.PV_qaly - c.PV_qaly)                         as d_qaly,\n         case when (n.PV_qaly - c.PV_qaly) = 0 then .\n              else (n.PV_cost - c.PV_cost) / (n.PV_qaly - c.PV_qaly)\n         end                                             as ICER,\n         ((&wtp * n.PV_qaly - n.PV_cost)\n          - (&wtp * c.PV_qaly - c.PV_cost))              as incremental_NMB\n  from (select * from pv where arm = 'NEW')        as n,\n       (select * from pv where arm = 'COMPARATOR') as c;\nquit;",
        "description": "Discounting of per-person-year claims-style cost and QALY streams in a DATA step, then discounted ICER/NMB.\nRequired input (deflated to a common price year and censoring-adjusted upstream):\n  work.annual : person_id, arm ('NEW'/'COMPARATOR'), year_from_index (0..H), cost, qaly\nThe DATA step computes 1/(1+r)**t per period exactly as HEOR shops implement discounting; PROC SQL aggregates to PV by arm\nand derives the discounted ICER and incremental NMB. Set rqaly < rcost only for a jurisdiction-mandated differential scenario.",
        "dependencies": [],
        "source_citations": [
          "sanders-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Reimbursement jurisdiction of the decision] --> US[US Second Panel<br/>3% costs and effects<br/>0% and 5% sensitivity]\n  Q --> NICE[\"NICE England/Wales<br/>3.5% both<br/>1.5% in defined life-extending cases\"]\n  Q --> NL[\"ZIN Netherlands<br/>4% costs / 1.5% effects<br/>differential\"]\n  Q --> PBAC[PBAC Australia<br/>5% both]\n  Q --> CADTH[CADTH Canada<br/>1.5% both]\n  US --> Apply[\"Deflate costs to common price year<br/>then apply 1/(1+r)^t per period\"]\n  NICE --> Apply\n  NL --> Apply\n  PBAC --> Apply\n  CADTH --> Apply\n  Apply --> PV[Sum to PV costs and PV effects<br/>then ICER and NMB]",
        "caption": "Jurisdiction determines the discount rate(s) and symmetry; costs are deflated to a common price year, the period factor 1/(1+r)^t is applied, and present-value streams feed the ICER and net monetary benefit.",
        "alt_text": "Decision diagram mapping reimbursement jurisdiction to its discount-rate convention, then a shared step that deflates costs, applies the per-period discount factor, and sums to present-value costs and effects for the ICER and NMB.",
        "source_type": "illustrative",
        "source_citations": [
          "attema-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Y0[Year 0 cost C0<br/>factor 1.000] --> S[Sum of discounted period values = PV]\n  Y1[Year 1 cost C1<br/>factor 1/1.03 = 0.971] --> S\n  Y5[Year 5 cost C5<br/>factor 1/1.03^5 = 0.863] --> S\n  Y10[Year 10 cost C10<br/>factor 1/1.03^10 = 0.744] --> S\n  Y20[Year 20 cost C20<br/>factor 1/1.03^20 = 0.554] --> S\n  S --> Note[Same factors applied to the QALY stream at the effects rate]",
        "caption": "Per-period discount factors at r = 3% over a 20-year horizon. Each annual cost (and, separately, each annual QALY at the effects rate) is multiplied by 1/(1+r)^t and summed to present value; later years contribute progressively less.",
        "alt_text": "Flow diagram showing annual cost values at years 0, 1, 5, 10, and 20 multiplied by declining discount factors at a 3% rate and summed into a present value, with the same factors applied to the QALY stream.",
        "source_type": "illustrative",
        "source_citations": [
          "sanders-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Discounting is one operational step within the broader RWE health-economic modeling method family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "ICERs and net monetary benefit are computed on present-value (discounted) costs and effects, not nominal streams."
      },
      {
        "relation_type": "used_with",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "Markov cohort traces accrue costs and QALYs by cycle; each cycle's values are discounted (with half-cycle correction specified separately) before summation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "Utility-weighted life-years (QALYs) are discounted at the effects rate, which may differ from the costs rate under differential discounting."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Multi-year cost streams (PPPM/PPPY/PMPM) are deflated to a common price year and then discounted at the jurisdiction's costs rate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Longitudinal HCRU projections in Markov or partitioned-survival models are valued and discounted in parallel with costs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "Incidence-based cost-of-illness with long horizons requires discounting of future direct and indirect costs."
      }
    ],
    "aliases": [
      "discounting in economic evaluation",
      "present value adjustment in cost-effectiveness analysis",
      "time preference adjustment",
      "discount rate application",
      "discounting of costs and QALYs"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "ema",
      "fda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "discrete-event-simulation-rwe",
    "name": "Discrete-Event Simulation Using RWE Inputs",
    "short_definition": "An individual-level, event-driven economic modeling method that advances each simulated patient one event at a time (treatment initiation, progression, adverse event, hospitalization, regimen switch, death) by sampling from time-to-event and cost distributions estimated from real-world claims, EHR, or registry data, used to project lifetime cost-effectiveness when history dependence, competing events, or resource constraints make cohort Markov or partitioned-survival structures untenable.",
    "long_description": "**Discrete-event simulation (DES)** is an individual-level health-economic modeling method in which each\nvirtual patient is tracked as a distinct entity carrying its own attributes (age, line of therapy, prior\nadverse-event history, accumulated cost and time-in-state). The patient's future is governed by a\n*future-event list*: the model samples a time to each competing next event from time-to-event\ndistributions (or hazard functions) that may depend on current state and history, processes the earliest\nevent in chronological order, updates the patient's attributes and accumulated cost/QALYs, then re-samples\ndownstream events. Between events the patient \"does nothing\" from the model's perspective, so DES avoids\nboth the fixed-cycle granularity and the memoryless (first-order Markov) restriction of cohort models.\n\nIn an RWE-parameterized DES, the inputs are estimated directly from longitudinal data using the same\nsurvival, multistate, and cost-regression methods catalogued elsewhere: time-to-progression or\ntime-to-next-treatment by line and subgroup (parametric or flexible-parametric survival fits),\nadverse-event hazards, post-progression survival, monthly paid amounts conditional on state and time\nsince the last event (gamma/log-link GLM, winsorized), and utility decrements per event. The simulation\npropagates those distributions — *with* their parameter uncertainty — over a lifetime horizon for a large\ncohort of synthetic patients sampled to match the target population's baseline covariate joint\ndistribution.\n\n**Core conceptual distinction.** DES is not an estimation method; it is a *projection structure* into\nwhich RWE-derived parameters are loaded. The estimand it produces is a comparative cost-effectiveness\ncontrast (incremental cost per QALY, NMB) under a counterfactual policy or technology, conditional on the\ncausal validity of the inputs. The structural innovation versus a cohort Markov (fixed states, memoryless\nper-cycle transitions) or a partitioned-survival model (PSM: area under independent PFS/OS curves) is that\nDES carries individual event history forward, so a patient who has already failed two lines, suffered a\ngrade-3 toxicity, and had a dose reduction can have a genuinely different hazard for the next event than a\nnewly treated patient — without enumerating dozens of \"tunnel\" states. It also handles competing events\nnatively (the first event scheduled wins; death automatically precludes later progression) and can model\nresource queues (infusion chairs, transplant organs) that cohort occupancy fractions cannot represent.\nThe price is computational cost (thousands of patients × replications × PSA draws), stochastic (Monte\nCarlo) output requiring many runs for stable means, and reduced transparency — the model is a program,\nnot an auditable transition matrix or three survival curves.\n\n**Pros, cons, and trade-offs** (naming the alternatives explicitly).\n- **vs. cohort Markov (markov-transition-probabilities-rwe):** DES represents time-in-state and\n  history-dependent hazards without the combinatorial explosion of tunnel states a Markov needs to relax\n  the memoryless assumption; it also captures resource contention and continuous time. Cost: a Markov\n  matrix is transparent, deterministic given its transitions, trivial to audit, and far cheaper to run.\n  **Prefer DES** when prior events (number of lines, a past AE, accumulating comorbidity) materially shift\n  downstream hazards and a faithful Markov would need an unmanageable state space.\n- **vs. partitioned-survival model (partitioned-survival-models-rwe):** DES makes post-progression\n  pathways, subsequent lines, AE branching, and path-dependent cost/HCRU *explicit*, rather than leaving\n  them as the residual area between independent PFS and OS curves (a PSM weakness that can produce\n  implausible state occupancy and ignores the PFS–OS dependence). Cost: a PSM is simpler, communicates in\n  two or three curves, and is often sufficient and preferred by HTA reviewers when the horizon is close to\n  observed follow-up. **Prefer DES** for complex multi-line oncology, transplant, or chronic-disease\n  pathways where independence of PFS/OS or memorylessness is clinically untenable.\n- **vs. direct extrapolation of observed costs/survival (no structural model):** \"what happened in the\n  data, blown out to 10 years\" cannot answer \"what if second-line uptake improved or the AE rate fell?\"\n  DES supplies the structural what-if while remaining parameterized entirely from RWE. Cost: it layers\n  structural and distributional assumptions on top of the data, which must be justified and validated.\n\n**When to use**\n- Diseases with complex, highly individual pathways: advanced oncology with multiple lines and AE-driven\n  branching; transplant or CAR-T pathways with waitlist/queue dynamics; chronic disease with accumulating\n  comorbidity and repeated hospitalizations.\n- When history dependence (a prior event changing the next hazard) or competing risks are first-order\n  drivers of cost-effectiveness and a Markov would need an unmanageable state space.\n- When the decision turns on capacity constraints, queues, or time-dependent resource use that cohort\n  occupancy fractions approximate poorly.\n- When rich patient-level RWE (ideally linked claims + EHR + registry) supports credible individual-level\n  time-to-event and cost distributions, and the audience (HTA body, payer) will accept a simulation.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Sparse or short RWE.** If follow-up is too short or events too rare to estimate the many time-to-event\n  distributions and effect modifiers stably, DES becomes an elaborate vehicle for unverifiable\n  extrapolation assumptions wearing the costume of patient-level realism. Prefer a parsimonious Markov/PSM\n  whose few assumptions are visible.\n- **The audience requires an auditable structure.** Many HTA submissions still prefer Markov or PSM\n  precisely because a reviewer can trace every number; a stochastic program is harder to interrogate.\n- **The dominant uncertainty is structural, not parametric.** If the open question is *which* states or\n  transitions exist, DES does not resolve it and can *hide* it inside code that looks precise — the most\n  dangerous failure mode, because Monte Carlo precision is mistaken for evidential strength.\n- **Confounded or immortal-time-contaminated inputs for the comparative effect.** DES propagates whatever\n  treatment effect you feed it. A hazard ratio estimated from a prevalent-user or naive cohort will be\n  reproduced with high precision across 10,000 patients — bias laundered through a sophisticated engine.\n  The comparative input must come from an active-comparator new-user / target-trial-emulated analysis, not\n  a raw RWE contrast.\n\n**Data-source operational depth**\n- **Claims (FFS or commercial):** Excellent for time-to-next-treatment, time-to-discontinuation,\n  time-to-hospitalization, and paid amounts (`days_supply`, NDC, place-of-service, paid columns).\n  Progression is *not* directly observed — use validated proxies (regimen change, new imaging + diagnosis\n  cluster) and acknowledge the misclassification. Two specific traps: (1) **Medicare Advantage person-time\n  lacks fee-for-service claims**, so a \"no event\" stretch in an MA enrollee can be unobserved care, not\n  true event-free time — restrict survival/cost estimation to enrollees with complete FFS (Parts A/B/D) or\n  commercial medical+pharmacy coverage and censor at MA switch; (2) **differential competing risks by\n  exposure in elderly claims** — if one arm's patients are older/sicker, they die sooner, truncating\n  observed progression and HCRU; estimate cause-specific or Fine-Gray hazards\n  (competing-risks-cause-specific-fine-gray-rwe) and feed *competing* event distributions to the\n  simulation rather than a single naive time-to-event.\n- **EHR:** Rich for progression timing (imaging, pathology, notes), ECOG/labs, and dose reductions, which\n  sharpen both the event hazards and the utility decrements. But observation is *visit-driven*: sicker\n  patients generate more encounters and therefore more recorded events, inflating apparent event rates.\n  Use inverse-intensity-of-visit weighting or a multistate model that respects irregular observation\n  before exporting hazards to the DES, and treat leaving the system as potentially informative censoring.\n- **Registry:** Gold standard for adjudicated progression, response, and vital status — the cleanest\n  source for the clinical event hazards — but typically limited in duration and to a selected population,\n  and usually missing complete cost/HCRU. Link to claims for the economic inputs.\n- **Linked claims–EHR–vital records:** The ideal substrate for a credible DES — claims for complete\n  longitudinal cost and utilization, EHR for clinical event timing and severity, registry/death index for\n  endpoints. Beware linkage selection (only the linkable subset is modeled) and order/fill/service date\n  discrepancies that must be reconciled before assigning each patient's event clock.\n- **Immortal time in procedure-defined states.** When a state is entered only by surviving to a procedure\n  (transplant, second-line start), the time from index to procedure is \"immortal.\" Model the procedure as\n  a *scheduled event with its own time-to-event hazard from index*, never as a baseline split — otherwise\n  the procedure arm inherits guaranteed survival time and the cost-effectiveness is biased toward it.\n\nPre-specify, before running: the event list; the distribution family and covariates/modifiers for each\ntime-to-event and cost process (and exactly which RWE analysis produced each); the history-dependence\nrules; resource modules if any; random-seed handling; the number of patients, replications, and PSA draws;\nand the summary measures (mean cost, QALYs, ICER, CEAC, EVPPI). Cross-validate intermediate outputs\nagainst the source data — simulated vs. observed distribution of lines of therapy, simulated vs. observed\ncumulative cost at 24 months, simulated vs. KM survival — before trusting any extrapolated result.\n\n**Worked claims/EHR example (advanced solid tumor, 1L → 2L → death with AE-driven history dependence).**\nPopulation: Medicare FFS + commercial adults initiating first-line systemic therapy for a solid tumor,\n2018–2022, with 365 days of continuous, FFS-observable enrollment before the index fill (NDC +\n`fill_date` + `days_supply`) and linked EHR for ECOG and labs. Index date = first qualifying fill;\ncontinuous-enrollment and washout rules establish a clean incident cohort, and MA-only person-time is\nexcluded so absence of claims means absence of events. From these data: (1) **time-to-progression on 1L**\n(proxy = regimen switch or new metastatic-workup cluster) fitted with a Weibull AFT model with covariates\nage, comorbidity score, and baseline ECOG; (2) **post-1L AE flag** (grade-3+ from diagnosis/procedure\nalgorithms) carried forward as an attribute; (3) **time-to-progression on 2L**, fitted with the *same*\nWeibull but including the prior-AE indicator — observed to shorten 2L TTP (history dependence the DES\nreproduces by reading each patient's `prior_ae` attribute); (4) **competing time-to-death** estimated as a\ncause-specific hazard so that, in the elderly arm, death correctly truncates further progression and cost;\n(5) **monthly cost** from a gamma GLM with log link on paid amounts conditional on current line, months\nsince line start, and a recent-hospitalization flag, winsorized at the 99th percentile. Ten thousand\nsynthetic patients are simulated to death or a 10-year horizon, discounting cost and QALYs at 3%. The\ncounterfactual technology improves 1L TTP (HR 0.75) and cuts the grade-3+ AE rate 30% (with linked cost\nand utility effects). Output: mean life-years, QALYs, lifetime cost, and the ICER vs. standard of care;\na 1,000-draw PSA over the joint parameter posterior yields the CEAC, and EVPPI flags 1L TTP and\npost-progression survival as the highest-value parameters for further RWE collection. Before reporting,\nthe simulated 24-month cumulative cost and the simulated 1L/2L sojourn-time distributions are checked\nagainst the observed claims to confirm the engine reproduces the data it was built from.\n\n**Interpreting the output**\n\nThe worked example simulates three patients through a sequence of line-of-therapy events. Patient A (no\nadverse event history) accumulates 3.0 life-years and 2.10 QALYs; Patient B (prior AE) accumulates 2.0\nlife-years and 1.22 QALYs; Patient C (early progression) accumulates 0.7 life-years and 0.53 QALYs.\nMean across three patients: 1.90 life-years and 1.28 QALYs.\n\n*(1) Formal interpretation.* DES outputs are individual-level simulated event pathways aggregated across\na large number of simulated patients (typically 1,000–100,000). Mean QALYs of 1.28 and mean life-years of\n1.90 are estimates of expected values under the model's parameterization. The difference in QALYs between\nPatient A (2.10) and Patient B (1.22) — a gap of 0.88 QALYs — illustrates history dependence: Patient B's\nprior adverse event triggered a utility decrement and altered subsequent event timing in a way that a Markov\nmodel, which has no memory of past events, could not represent without adding tunnel states. This history\ndependence is the defining advantage of DES over cohort Markov: each patient carries a full event history\nas attributes, so any downstream event can be conditioned on prior events without structural changes to the model.\n\n*(2) Practical interpretation.* When interpreting DES results, the analyst should examine both the aggregate\nmeans (the ICER input) and the simulated event-pathway distributions (sojourn times on each line of therapy,\nevent frequencies). If the simulated 24-month cumulative cost distribution does not match the observed claims\ndistribution, the model requires recalibration before the ICER can be trusted. DES results are stochastic —\nrunning the same model twice with a different random seed will produce slightly different means. Report the\nsimulation seed and confirm that the number of simulated patients is large enough that mean QALY and cost\nestimates have stabilized (typically checked by running at 500, 1,000, and 5,000 patients and confirming\nconvergence).",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "discrete-event-simulation",
      "individual-level-simulation",
      "health-economic-modeling",
      "cost-effectiveness",
      "microsimulation",
      "time-to-event",
      "history-dependence",
      "oncology-modeling"
    ],
    "applies_to_study_types": [
      "cost_effectiveness_analysis",
      "economic_evaluation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/0272989X12455462",
        "url": "https://doi.org/10.1177/0272989X12455462",
        "citation_text": "Karnon J, Stahl J, Brennan A, Caro JJ, Mar J, Moller J. Modeling using discrete event simulation: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-4. Medical Decision Making. 2012;32(5):701-711.",
        "year": 2012,
        "authors_short": "Karnon et al.",
        "notes": "Authoritative ISPOR-SMDM good-research-practices statement on when and how to build, parameterize, validate, and report DES models in health economic evaluation, including data-input and uncertainty requirements directly applicable to RWE parameterization."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1524-4733.2010.00775.x",
        "url": "https://doi.org/10.1111/j.1524-4733.2010.00775.x",
        "citation_text": "Caro JJ, Moller J, Getsios D. Discrete event simulation: the preferred technique for health economic evaluations? Value in Health. 2010;13(8):1056-1060.",
        "year": 2010,
        "authors_short": "Caro et al.",
        "notes": "Argues the case for individual-level event-driven modeling over cohort Markov, articulating the history-dependence and competing-event advantages that motivate DES when RWE shows complex pathways."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s40273-014-0147-9",
        "url": "https://doi.org/10.1007/s40273-014-0147-9",
        "citation_text": "Karnon J, Haji Ali Afzali H. When to use discrete event simulation (DES) for the economic evaluation of health technologies? A review and critique of the costs and benefits of DES. PharmacoEconomics. 2014;32(6):547-558.",
        "year": 2014,
        "authors_short": "Karnon & Haji Ali Afzali",
        "notes": "A critical decision framework for choosing DES over simpler structures, with the central caution that DES is warranted only when its added flexibility resolves a feature (history dependence, queues) that simpler models cannot — directly relevant to the \"when NOT to use\" guidance here."
      },
      {
        "role": "explain",
        "doi": "10.1017/S0266462314000117",
        "url": "https://doi.org/10.1017/S0266462314000117",
        "citation_text": "Standfield L, Comans T, Scuffham P. Markov modeling and discrete event simulation in health care: a systematic comparison. International Journal of Technology Assessment in Health Care. 2014;30(2):165-172.",
        "year": 2014,
        "authors_short": "Standfield et al.",
        "notes": "Systematic comparison of Markov vs DES across published HTA models, showing DES is preferred when individual history and queuing drive cost-effectiveness while Markov remains the efficient, easily-validated default otherwise."
      }
    ],
    "plain_language_summary": "Discrete-event simulation (DES) is a way to model a patient's medical journey by tracking what happens to each individual person one event at a time — things like a cancer progressing, a treatment being switched, or a patient dying — and measuring the cost and health impact of each step. Instead of pushing everyone through the same calendar-based clock, the model fast-forwards each person to their next meaningful event and asks: what did that cost, and how many healthy days did it buy? This lets researchers compare two treatments over a lifetime, accounting for the fact that a patient who had a bad side effect on line 1 therapy tends to do worse on line 2 — something simpler models cannot easily represent.",
    "key_terms": [
      {
        "term": "discrete-event simulation",
        "definition": "A computer model that advances each virtual patient through a series of health events (e.g., progression, hospitalization, death) one at a time rather than in fixed monthly or yearly steps."
      },
      {
        "term": "entity",
        "definition": "The individual virtual patient being tracked in the simulation, who carries their own profile of characteristics such as age, treatment history, and past side effects."
      },
      {
        "term": "event",
        "definition": "A point in simulated time when something clinically meaningful happens to a patient, such as disease progression, a treatment switch, an adverse reaction, or death."
      },
      {
        "term": "history dependence",
        "definition": "The idea that what happens to a patient next can depend on what already happened to them — for example, having had a severe side effect on first-line therapy makes second-line therapy less effective."
      },
      {
        "term": "cost-effectiveness",
        "definition": "A way of asking whether the extra health benefit a new treatment provides is worth the extra money it costs, usually expressed as dollars per quality-adjusted life year (QALY) gained."
      },
      {
        "term": "QALY",
        "definition": "Quality-adjusted life year — one year of life lived in perfect health; a year in poorer health counts as a fraction, used to compare the value of different treatments."
      }
    ],
    "worked_example": {
      "scenario": "Imagine three virtual cancer patients each starting first-line (1L) chemotherapy. We want to trace what happens to each one through treatment events — and tally up how many years of healthy life and how many dollars each path costs. Patient A has an uncomplicated 1L course and dies of disease progression on 2L. Patient B has a grade-3 adverse event (AE) on 1L that shortens their subsequent 2L response. Patient C dies on 1L before ever reaching 2L. This tiny three-patient trace shows how DES captures individual pathways — including the history-dependence effect where Patient B's prior AE worsens their 2L outcome.",
      "dataset": {
        "caption": "Simulated event log: one row per event per patient (the kind of output a DES engine produces internally, based on time-to-event distributions estimated from real-world claims and EHR data).",
        "columns": [
          "person_id",
          "event",
          "time_years",
          "state_after_event",
          "monthly_cost_usd",
          "utility_weight"
        ],
        "rows": [
          [
            "P001",
            "1L start",
            0.0,
            "on_1L",
            8000,
            0.75
          ],
          [
            "P001",
            "progression (no prior AE)",
            1.5,
            "on_2L",
            9500,
            0.65
          ],
          [
            "P001",
            "death",
            3.0,
            "dead",
            0,
            0.0
          ],
          [
            "P002",
            "1L start",
            0.0,
            "on_1L",
            8000,
            0.75
          ],
          [
            "P002",
            "grade-3 AE on 1L",
            0.8,
            "on_1L_post_AE",
            11000,
            0.55
          ],
          [
            "P002",
            "progression (prior AE -> shorter 2L)",
            1.2,
            "on_2L",
            9500,
            0.5
          ],
          [
            "P002",
            "death",
            2.0,
            "dead",
            0,
            0.0
          ],
          [
            "P003",
            "1L start",
            0.0,
            "on_1L",
            8000,
            0.75
          ],
          [
            "P003",
            "death on 1L",
            0.7,
            "dead",
            0,
            0.0
          ]
        ]
      },
      "steps": [
        "For each patient, the simulation draws a time to each possible next event (progression, AE, death) from distributions estimated via real-world data, then schedules whichever event comes first.",
        "Patient A progresses at 1.5 years with no prior AE, moves to 2L, then dies at 3.0 years; life-years = 3.0, QALYs = (1.5 x 0.75) + (1.5 x 0.65) = 1.125 + 0.975 = 2.10.",
        "Patient B has a grade-3 AE at 0.8 years on 1L; the model flags prior_ae = 1 on their record, which shifts the 2L progression hazard upward so they progress sooner (1.2 years) and die earlier (2.0 years); life-years = 2.0, QALYs = (0.8 x 0.75) + (0.4 x 0.55) + (0.8 x 0.50) = 0.60 + 0.22 + 0.40 = 1.22.",
        "Patient C dies on 1L at 0.7 years, never reaching 2L; life-years = 0.7, QALYs = 0.7 x 0.75 = 0.53.",
        "A Markov cohort model would need separate tunnel states for on_1L_post_AE to capture this history effect; DES carries it naturally as a patient attribute."
      ],
      "result": "Mean life-years across 3 patients = (3.0 + 2.0 + 0.7) / 3 = 1.90 years. Mean QALYs = (2.10 + 1.22 + 0.53) / 3 = 1.28 QALYs. Patient B demonstrates history dependence: their prior AE reduced their mean QALYs by 0.88 compared to Patient A (2.10 - 1.22 = 0.88), a downstream cost and quality-of-life effect the simulation captured through the prior_ae attribute — no extra model states required."
    },
    "prerequisites": [
      "health-economic-modeling-methods-rwe",
      "markov-transition-probabilities-rwe",
      "partitioned-survival-models-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pure next-event DES",
        "description": "Every process (progression, AE, death, cost accrual) is advanced by sampling a continuous time-to-event and processing the earliest scheduled event. Most efficient for long horizons with sparse events; the canonical DES form.",
        "edge_cases": [
          "Requires correct competing-risks handling — sample all candidate next-event times and take the minimum, rather than sampling a single marginal time-to-event that ignores competing death.",
          "Numerical care needed when a hazard depends on accumulated time-in-state (must re-sample or use a memoryless-corrected conditional draw after each event)."
        ],
        "data_source_notes": "claims/EHR: feed parametric survival fits (Weibull/log-logistic/generalized gamma) or flexible-parametric spline hazards estimated per state and line, with history attributes as covariates."
      },
      {
        "name": "Hybrid DES with embedded Markov/state sub-models",
        "description": "Some sub-processes (e.g., a stable maintenance phase) are advanced in fixed cycles or as a small Markov chain inside each individual, while the main pathway uses next-event sampling. Trades a little efficiency for easier coding and validation of the routine portion.",
        "edge_cases": [
          "Mixing time scales (continuous next-event vs cyclic) risks double-counting or omitting cost in the cycle boundaries; reconcile accrual at every transition."
        ],
        "data_source_notes": "Useful when part of the pathway is genuinely memoryless (a stable chronic phase) and only the complex part needs full DES."
      },
      {
        "name": "Discrete-time individual microsimulation (fixed tick)",
        "description": "Each patient is stepped through fixed time increments (e.g., monthly) with per-tick event probabilities. Easier to code and audit than true next-event DES, at the cost of efficiency for long horizons and of half-cycle/tick-granularity approximation error.",
        "edge_cases": [
          "Tick length must be short relative to the fastest hazard or events get clustered at tick boundaries; apply a half-cycle correction."
        ],
        "data_source_notes": "Often the pragmatic choice when RWE supplies monthly cost and event rates directly."
      },
      {
        "name": "DES with capacity/queue modules",
        "description": "Adds shared resources (infusion chairs, operating rooms, transplant organs) so patients can wait in queues and outcomes depend on contention — a feature no cohort model can represent.",
        "edge_cases": [
          "Queue parameters (arrival rates, service times, capacity) are rarely fully observable in claims and often require supplementary operational data."
        ],
        "data_source_notes": "registry/operational data usually needed for service times and capacity; claims give arrival (initiation) rates."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "markov-transition-probabilities-rwe",
        "pros_of_this": "Captures arbitrary history dependence and individual pathways without combinatorial explosion of tunnel states; native competing risks; continuous time; can model capacity/queues and time-varying costs/utilities.",
        "cons_of_this": "Opaque (a stochastic program rather than a transition matrix); Monte Carlo output requires many replications for stable means; harder to debug, audit, and explain to clinical/HTA audiences.",
        "when_to_prefer": "When RWE shows strong time-in-state or event-history effects (prior AE, number of prior lines, accumulating comorbidity) that would require an unmanageable number of states in a Markov."
      },
      {
        "compared_to": "partitioned-survival-models-rwe",
        "pros_of_this": "Explicitly models post-progression pathways, subsequent lines, AE branching, and path-dependent cost/HCRU rather than leaving them as the residual area between independent PFS and OS curves; respects PFS-OS dependence and avoids implausible state occupancy.",
        "cons_of_this": "PSM is simpler, deterministic given the curves, and often preferred by HTA reviewers when the primary RWE evidence is two or three survival endpoints and the horizon is near observed follow-up.",
        "when_to_prefer": "Complex multi-line oncology, transplant, or chronic-disease pathways with repeated events where independence of PFS/OS or memoryless transitions is clinically untenable."
      },
      {
        "compared_to": "cascade-of-care-analysis-rwe",
        "pros_of_this": "Cascade is purely descriptive (observed stage-to-stage leakage); DES takes the cascade's uptake/persistence/response rates as inputs and projects forward-looking \"what if we closed the gap at stage 3?\" cost-effectiveness under alternative policies.",
        "cons_of_this": "Cascade requires no simulation assumptions and is immediately interpretable; DES layers structural and distributional assumptions on top of the same RWE.",
        "when_to_prefer": "When the deliverable is a forward-looking cost-effectiveness or budget-impact number under alternative policies or new interventions, rather than a retrospective gap analysis."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Source of time-to-next-treatment, time-to-discontinuation, time-to-hospitalization, and paid costs (NDC, fill_date, days_supply, paid amounts). Progression is a proxy (regimen change / workup cluster) and is misclassified. Restrict estimation to complete-FFS (A/B/D) or commercial medical+pharmacy person-time and censor at MA switch, because MA person-time lacks fee-for-service claims and \"no event\" can be unobserved care. Estimate cause-specific/Fine-Gray competing hazards so differential mortality by arm in elderly cohorts does not bias the simulated event stream.",
      "ehr": "Best for progression timing, ECOG/labs, and dose reductions that drive both hazards and utility decrements. Visit-driven capture inflates event rates for sicker patients; apply inverse-intensity weighting or an irregular-observation multistate model before exporting hazards, and treat leaving the system as informative censoring.",
      "registry": "Gold standard for adjudicated progression, response, and vital status (cleanest clinical-event hazards) but limited duration/selected population and weak on cost/HCRU; link to claims for economic inputs.",
      "linked": "Ideal substrate (claims completeness + EHR severity + reliable mortality) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before assigning each patient's event clock; model procedure entry as a time-to-event from index to avoid immortal time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import simpy\nimport numpy as np\nimport pandas as pd\n\nDISCOUNT = 0.03          # annual discount rate for cost and QALYs\nHORIZON_YEARS = 10.0\nP_AE_2L = 0.30           # RWE-estimated grade-3+ AE probability on starting 2L (PSA parameter)\n\ndef _weibull_time(rng, params, covars):\n    \"\"\"Draw a Weibull AFT event time (years). scale = exp(b0 + beta.x).\"\"\"\n    lin = params[\"scale_intercept\"] + sum(\n        params[\"beta\"].get(c, 0.0) * covars.get(c, 0.0) for c in params[\"beta\"]\n    )\n    scale = np.exp(lin)\n    return float(scale * rng.weibull(params[\"shape\"]))\n\ndef _monthly_cost(params, covars):\n    \"\"\"Expected monthly cost from a gamma GLM with log link.\"\"\"\n    lin = params[\"intercept\"] + sum(\n        params[\"beta\"].get(c, 0.0) * covars.get(c, 0.0) for c in params[\"beta\"]\n    )\n    return float(np.exp(lin))\n\ndef _disc(amount, t_years):\n    return amount / ((1.0 + DISCOUNT) ** t_years)\n\ndef patient_process(env, attrs, tte_params, cost_params, results, rng, p_ae_2l=P_AE_2L):\n    # ----- Line 1: compete progression vs death; earliest event wins -----\n    t_prog1 = _weibull_time(rng, tte_params[\"tt_prog_1l\"], attrs)\n    t_death = _weibull_time(rng, tte_params[\"tt_death\"], attrs)\n    t_to_first = min(t_prog1, t_death, HORIZON_YEARS - env.now)\n    yield env.timeout(t_to_first)\n    cost = _disc(_monthly_cost(cost_params[\"1L\"], attrs) * 12 * t_to_first, env.now)\n    qaly = _disc(attrs[\"util_1l\"] * t_to_first, env.now)\n\n    if env.now >= HORIZON_YEARS or t_death <= t_prog1:\n        results.append({\"person_id\": attrs[\"person_id\"], \"ly\": env.now,\n                        \"qaly\": qaly, \"cost\": cost})\n        return\n\n    # ----- Line 2: prior_ae attribute shifts the 2L hazard (history dependence) -----\n    attrs = {**attrs}                                  # progressed; carry an AE realization forward\n    attrs[\"prior_ae\"] = 1 if rng.random() < p_ae_2l else attrs[\"prior_ae\"]\n    t_prog2 = _weibull_time(rng, tte_params[\"tt_prog_2l\"], attrs)   # reads attrs['prior_ae']\n    t_death2 = _weibull_time(rng, tte_params[\"tt_death\"], attrs)\n    t_to_next = min(t_prog2, t_death2, HORIZON_YEARS - env.now)\n    yield env.timeout(t_to_next)\n    cost += _disc(_monthly_cost(cost_params[\"2L\"], attrs) * 12 * t_to_next, env.now)\n    qaly += _disc(attrs[\"util_2l\"] * t_to_next, env.now)\n\n    results.append({\"person_id\": attrs[\"person_id\"], \"ly\": env.now,\n                    \"qaly\": qaly, \"cost\": cost})\n\ndef run_des(baseline: pd.DataFrame, tte_params: dict, cost_params: dict,\n            seed: int = 1, p_ae_2l: float = P_AE_2L) -> pd.DataFrame:\n    env = simpy.Environment()\n    rng = np.random.default_rng(seed)\n    results: list[dict] = []\n    for _, row in baseline.iterrows():\n        env.process(patient_process(env, row.to_dict(), tte_params,\n                                    cost_params, results, rng, p_ae_2l))\n    env.run(until=HORIZON_YEARS)\n    return pd.DataFrame(results)",
        "description": "Production-shaped next-event DES for a 1L -> 2L -> death oncology model. This snippet contains NO toy\ndata generation; it consumes RWE-estimated parameters produced upstream by survival/cost regressions.\nRequired inputs (already fitted on the cleaned, FFS-observable cohort):\n  tte_params : dict of parametric survival parameters by transition, keyed as\n               ('tt_prog_1l'|'tt_prog_2l'|'tt_death'), each a dict with Weibull AFT params\n               {'shape': k, 'scale_intercept': b0, 'beta': {covariate: coef}}.\n               tt_prog_2l['beta'] must include 'prior_ae' (history dependence).\n  cost_params: gamma-GLM (log link) params for monthly cost by state:\n               {'1L': {...}, '2L': {...}}, each {'intercept', 'beta':{covar:coef}, 'shape'}.\n  baseline   : pandas DataFrame, one row per synthetic patient sampled to match the target population:\n               person_id, age, comorbidity_score, ecog, prior_ae (0 at start), util_1l, util_2l.\nCosts/QALYs are discounted continuously at DISCOUNT per year. Run many patients (and wrap in a PSA loop\nover draws of tte_params/cost_params) to obtain mean cost, QALYs, and the ICER vs comparator.",
        "dependencies": [
          "simpy",
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "karnon-2012"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(simmer)\nlibrary(data.table)\n\nDISCOUNT <- 0.03\nHORIZON_YEARS <- 10\nP_AE_2L <- 0.30   # RWE-estimated grade-3+ AE probability on starting 2L (PSA parameter)\n\nweibull_time <- function(params, covars) {\n  lin <- params$scale_intercept +\n    sum(vapply(names(params$beta),\n               function(c) params$beta[[c]] * (covars[[c]] %||% 0), numeric(1)))\n  scale <- exp(lin)\n  scale * rweibull(1, shape = params$shape, scale = 1)  # unit-scale draw, AFT-scaled\n}\n`%||%` <- function(a, b) if (is.null(a) || is.na(a)) b else a\n\nrun_des_r <- function(baseline, tte_params, cost_params, seed = 1L, p_ae_2l = P_AE_2L) {\n  set.seed(seed)\n  out <- vector(\"list\", nrow(baseline))\n  for (i in seq_len(nrow(baseline))) {\n    a <- as.list(baseline[i, ])\n    # Line 1: progression vs death compete\n    t_prog1 <- weibull_time(tte_params$tt_prog_1l, a)\n    t_death <- weibull_time(tte_params$tt_death,   a)\n    t1 <- min(t_prog1, t_death, HORIZON_YEARS)\n    disc1 <- 1 / (1 + DISCOUNT)^0\n    cost <- disc1 * exp(cost_params$`1L`$intercept) * 12 * t1\n    qaly <- disc1 * a$util_1l * t1\n    if (t_death <= t_prog1 || t1 >= HORIZON_YEARS) {\n      out[[i]] <- data.table(person_id = a$person_id, ly = t1, qaly = qaly, cost = cost)\n      next\n    }\n    # Line 2: prior_ae attribute shifts the 2L hazard (history dependence)\n    a$prior_ae <- if (runif(1) < p_ae_2l) 1 else a$prior_ae\n    t_prog2 <- weibull_time(tte_params$tt_prog_2l, a)   # uses a$prior_ae\n    t2 <- min(t_prog2, weibull_time(tte_params$tt_death, a), HORIZON_YEARS - t1)\n    disc2 <- 1 / (1 + DISCOUNT)^t1\n    cost <- cost + disc2 * exp(cost_params$`2L`$intercept) * 12 * t2\n    qaly <- qaly + disc2 * a$util_2l * t2\n    out[[i]] <- data.table(person_id = a$person_id, ly = t1 + t2,\n                           qaly = qaly, cost = cost)\n  }\n  rbindlist(out)\n}",
        "description": "Same RWE-parameterized next-event DES in R using the simmer engine. As in the Python version, the\ntime-to-event distributions are NOT toy: time_fn() draws from flexsurv/survival fits passed in via\ntte_params (Weibull AFT). Required objects (built upstream from the FFS-observable cohort):\n  tte_params : list with $tt_prog_1l, $tt_prog_2l (includes a prior_ae coefficient), $tt_death,\n               each list(shape=, scale_intercept=, beta=named_numeric_of_covariate_coefs).\n  cost_params: list $`1L`, $`2L`, each list(intercept=, beta=named_numeric, ) for a gamma-log GLM.\n  baseline   : data.frame, one row per synthetic patient (person_id, age, comorbidity_score, ecog,\n               prior_ae, util_1l, util_2l), sampled to match the target population.\nWrap run_des_r() in a PSA loop over draws of tte_params/cost_params for the CEAC.",
        "dependencies": [
          "simmer",
          "data.table"
        ],
        "source_citations": [
          "karnon-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1L time-to-progression: Weibull AFT distribution feeding the DES tt_prog_1l draw. */\nproc lifereg data=work.surv (where=(line='1L'));\n  model time_years*event(0) = age comorbidity_score ecog / dist=weibull;\n  ods output ParameterEstimates=p_prog1l;\nrun;\n\n/* 2L time-to-progression with prior_ae => the history-dependence coefficient the DES reads. */\nproc lifereg data=work.surv (where=(line='2L'));\n  model time_years*event(0) = age comorbidity_score ecog prior_ae / dist=weibull;\n  ods output ParameterEstimates=p_prog2l;\nrun;\n\n/* Cause-specific death hazard for the DES latent-time sampler. A DES samples a latent\n   event time from each cause-specific hazard and races them, so each cause-specific hazard\n   must be fit by treating the competing event as a censoring event (standard Cox on the\n   cause of interest) -- NOT the Fine-Gray subdistribution hazard (eventcode=N in PROC PHREG),\n   which models a cumulative incidence and is the wrong object for a latent-time race.\n   Here eventcode(0 1)=2 censors progression (code 1) and the no-event code (0), keeping\n   death (code 2) as the modeled event. */\nproc phreg data=work.surv;\n  class line;\n  model time_years*eventcode(0 1) = age comorbidity_score ecog prior_ae;  /* cause-specific death */\n  ods output ParameterEstimates=p_death;\nrun;\n\n/* Monthly cost: gamma GLM with log link -> cost_params for each state. */\nproc genmod data=work.cost;\n  class line;\n  model month_cost = line months_since_line hosp_recent / dist=gamma link=log;\n  ods output ParameterEstimates=p_cost;\nrun;\n\n/* Export fitted parameters for the R/Python DES engine + PSA. */\nproc export data=p_prog1l outfile=\"/proj/des/p_prog1l.csv\" dbms=csv replace; run;\nproc export data=p_prog2l outfile=\"/proj/des/p_prog2l.csv\" dbms=csv replace; run;\nproc export data=p_death  outfile=\"/proj/des/p_death.csv\"  dbms=csv replace; run;\nproc export data=p_cost   outfile=\"/proj/des/p_cost.csv\"   dbms=csv replace; run;",
        "description": "SAS is rarely the engine for a complex next-event DES; in practice teams estimate the RWE inputs in SAS,\nthen export parameters to R/Python for the individual-level simulation and PSA. This block is the SAS\nhalf that genuinely fits: estimating the parametric time-to-event distributions and the cost model that\nparameterize the DES. Required input datasets (post data-management, restricted to FFS-observable\nperson-time):\n  work.surv : one row per patient-transition -> person_id, line ('1L'/'2L'), time_years, event\n              (1=progression, 0=censored), eventcode (1=progression,2=death for competing risks),\n              age, comorbidity_score, ecog, prior_ae\n  work.cost : one row per patient-month -> person_id, line, month_cost, months_since_line, hosp_recent\nOutput parameter estimates are written to datasets and exported (e.g., PROC EXPORT / ODS) for the\nsimulation engine.",
        "dependencies": [],
        "source_citations": [
          "karnon-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Synthetic patient: baseline attributes sampled<br/>to match target population from RWE] --> B[Sample competing next-event times<br/>from RWE-fitted distributions:<br/>progression, AE, death, cost accrual]\n  B --> C[Process EARLIEST event:<br/>update state + history attributes,<br/>accumulate discounted cost and QALYs]\n  C --> D{Death or horizon reached?}\n  D -->|No| B\n  D -->|Yes| E[Record patient outcomes:<br/>LY, QALY, lifetime cost, line sequence]\n  E --> F[Repeat over N patients x PSA draws:<br/>mean cost, QALY, ICER, CEAC, EVPPI]",
        "caption": "Core RWE-parameterized DES loop. Each synthetic patient carries its own history; competing event times are drawn from claims/EHR-fitted distributions (e.g., Weibull time-to-progression by line, cause-specific death hazard), and the earliest event is processed before re-sampling downstream events.",
        "alt_text": "Flowchart of individual-level next-event processing, outcome accumulation, and replication over patients and probabilistic-sensitivity-analysis draws for a discrete-event simulation.",
        "source_type": "illustrative",
        "source_citations": [
          "karnon-2012"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph DES[\"History dependence DES captures natively\"]\n    NewDx[1L initiation] --> Prog1[Progression on 1L]\n    Prog1 --> AE[Grade 3+ AE -> dose reduction / stop]\n    AE --> Prog2[Shorter time-to-progression on 2L<br/>via prior_ae attribute]\n  end\n  DES --> Markov[A cohort Markov would need many<br/>tunnel states to approximate this path]",
        "caption": "The kind of path dependence (a prior adverse event shortening subsequent time-to-progression and raising cost) that DES represents directly through patient attributes, but that a cohort Markov can only approximate with an explosion of tunnel states.",
        "alt_text": "Diagram of a path where a prior adverse event shortens later time-to-progression, illustrating history dependence that DES handles with attributes and Markov handles only with many states.",
        "source_type": "illustrative",
        "source_citations": [
          "karnon-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "One of the three main structural families (with cohort Markov and partitioned-survival) for RWE-parameterized lifetime cost-effectiveness models."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "DES is the natural choice when the memoryless Markov property (or the number of tunnels required to relax it) becomes untenable given the RWE."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "partitioned-survival-models-rwe",
        "notes": "Preferred when post-progression pathways, subsequent therapy, and history-dependent costs/utilities are central and cannot be left implicit in the area between independent PFS and OS curves."
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Lines-of-therapy sequences, time on each line, and switch/augmentation probabilities estimated from RWE are direct inputs to the event-generation logic."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cascade-of-care-analysis-rwe",
        "notes": "Cascade provides observed stage-specific uptake, persistence, and response rates that become the sampling distributions or modifiers in the DES."
      },
      {
        "relation_type": "produces",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "DES is almost always run with full PSA (and often EVPPI/EVSI) because of the many correlated time-to-event and cost parameters estimated from RWE."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Competing risks are handled in DES by scheduling the first of multiple candidate events; the cause-specific/Fine-Gray hazards estimated from RWE supply those competing event distributions."
      }
    ],
    "aliases": [
      "individual-level simulation",
      "patient-level simulation",
      "next-event simulation",
      "DES for HTA",
      "event-driven economic model",
      "individual patient simulation"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "disease-registry",
    "name": "Disease Registry",
    "short_definition": "An organized, prospective system that uses observational methods to collect uniform, often adjudicated data on a defined population of patients sharing a particular disease or condition, to study natural history, outcomes, quality of care, or treatment effectiveness and safety.",
    "long_description": "A **disease (condition) registry** is a primary-data-collection observational study: investigators define a\npopulation by a *disease or clinical condition* (e.g., pulmonary arterial hypertension, idiopathic pulmonary\nfibrosis, a specific cancer) and prospectively enroll those patients at participating sites, capturing a\npredefined minimum dataset on a fixed schedule. This is the defining contrast with secondary-data RWD: claims and\nEHR are *repurposed* byproducts of billing and care, whereas a registry collects fit-for-purpose variables —\ndisease severity, adjudicated events, patient-reported outcomes — that claims simply do not contain. Scope note:\nthis entry covers *disease* registries. **Product/exposure registries** (anchored on a drug or device, the typical\nEU PASS or pregnancy-exposure registry) and **health-services registries** (anchored on an encounter or procedure)\nare close cousins with the same machinery but a different enrollment anchor; treat them as separate concepts.\n\n**Core conceptual distinction**. The registry is defined by its *enrollment anchor and data provenance*, not by an\nanalytic contrast. Three things distinguish it from a claims/EHR cohort. (1) *Anchor*: entry is the date a patient\nis enrolled at a participating site after meeting clinical inclusion criteria — typically incident (newly\ndiagnosed) or a mix of incident and prevalent patients, which must be declared because prevalent enrollment\nreintroduces left-truncation and survivor bias. (2) *Provenance*: variables are collected for the study, so\nseverity, stage, ejection fraction, and adjudicated endpoints are available, but completeness depends on site\nstaff and patient retention rather than a claim being filed. (3) *Sampling frame*: patients come from\n*participating sites*, which are rarely a probability sample of all care settings — academic and specialty centers\nare over-represented. The registry does not, by itself, deliver a causal estimand; it is a data platform on which\nnatural-history description, quality benchmarking, or a comparative analysis (often an active-comparator new-user\ncontrast nested within the registry, frequently with claims linkage for complete exposure and outcomes) is built.\n\n**Pros, cons, and trade-offs**.\n- **vs claims/EHR secondary-data cohorts:** the registry captures clinical depth claims cannot — disease severity,\n  functional class, adjudicated outcomes, PROs — and applies uniform definitions across sites. Cost: it is\n  expensive and slow, enrolls a non-random slice of patients seen at participating sites, and suffers active loss\n  to follow-up that passive claims do not (claims keep accruing as long as the person is enrolled in the plan).\n  **Prefer the registry** when the scientific question hinges on variables absent from administrative data.\n- **vs a randomized trial:** the registry reflects routine practice, includes patients trials exclude (elderly,\n  comorbid, pregnant), and supports long-term and rare-disease follow-up at scale. Cost: no randomization, so\n  confounding by indication and channeling remain; it answers *what happens in practice*, not the internally valid\n  average treatment effect. **Prefer the registry** for external validity, natural history, and rare diseases where\n  a trial is infeasible.\n- **vs a registry-claims linked study:** linkage to claims and a death index buys complete exposure (every fill,\n  every hospitalization, reliable mortality) on top of registry clinical depth. Cost: only the linkable subset is\n  analyzable (a new selection layer), and registry, claims, and vital-records dates must be reconciled. **Prefer\n  linkage** whenever exposure or outcome completeness is the binding constraint and consent/linkage is feasible.\n\n**When to use**. Natural-history and prognostic studies of a defined disease; rare-disease research where no other\nsource has adequate numbers; quality-of-care benchmarking across sites; regulatory commitments that require\nprimary data — FDA post-marketing requirements (PMRs), EMA post-authorization safety/efficacy studies (PASS/PAES),\nHTA managed-entry / outcomes-based agreements, and CMS Coverage with Evidence Development (CED). Use it when the\ndecision-grade variables (severity, adjudicated endpoints, PROs) do not exist in claims or EHR and uniform\nprospective capture across sites is worth the cost.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **When you need a representative population estimate but enrollment is by convenience.** Site-level and patient\n  consent selection make the registry a poor frame for incidence/prevalence or population-attributable burden;\n  presenting registry proportions as population rates is misleading.\n- **When sites cherry-pick patients.** Differential enrollment of healthier (or sicker) patients, or enrolling\n  only those who survive to a referral visit, inflates apparent benefit and biases survival — a covert\n  immortal-time/left-truncation problem if enrollment lags diagnosis.\n- **When the comparison is across registries or across calendar time without harmonization.** Secular drift in\n  capture, evolving diagnostic criteria, and site mix changes confound time trends; registry-to-registry contrasts\n  without common data definitions are unreliable.\n- **When loss to follow-up is differential by exposure or prognosis.** Patients who deteriorate may stop attending\n  the enrolling center; naive complete-case analysis then overstates benefit. Treat dropout as potentially\n  informative and quantify it.\n- **When the registry is used as a single-arm external control without a defensible comparator.** Comparing a new\n  drug's registry arm to historical registry patients invites confounding by indication and period effects unless\n  eligibility, time zero, and covariates are explicitly emulated.\n\n**Data-source operational depth**.\n- **Registry (primary):** Strongest for indication, disease severity/stage, adjudicated endpoints, and PROs;\n  enrollment date is the natural index. Failure modes: voluntary site participation and patient consent\n  (selection on both); differential loss to follow-up; secular drift as criteria and site mix change; gaps between\n  the registry data cutoff and current practice that make recent patients look incomplete. Workarounds: predefine a\n  minimum dataset and adjudication charter; restrict to consecutive (not convenience) enrollment where possible;\n  model site as a random effect; censor at last documented contact and conduct loss-to-follow-up sensitivity\n  analyses; restrict to overlapping enrollment calendar time for trend or cross-cohort work.\n- **Claims:** Used for *linkage* to a disease registry to complete exposure (NDC + `fill_date` + `days_supply`)\n  and outcomes (hospitalization, procedures, death). Failure modes specific to claims under linkage:\n  **Medicare Advantage-only person-time lacks fee-for-service claims**, so a registry enrollee who is MA-only\n  appears to have no fills/events — exclude MA-only person-time and require continuous Parts A/B (and D for drug\n  exposure). **Differential competing risks by exposure in elderly registry cohorts**: deaths captured by the\n  registry but not the claims feed (or vice versa) distort cause-specific event rates — link to a death index and\n  use a competing-risk framework, not naive Kaplan-Meier. **Immortal time in procedure-anchored sub-studies**:\n  time between enrollment and a later procedure must be classified, not silently attributed to the post-procedure\n  arm.\n- **EHR:** A registry can be partly auto-populated from a site's EHR, gaining labs and notes cheaply, but\n  visit-driven capture means patients who leave the system are differentially missing; provenance of each variable\n  (entered by abstractor vs pulled from EHR) should be tracked because the two have different error structures.\n- **Linked registry-claims-vital records:** The ideal substrate (clinical depth + exposure/outcome completeness +\n  reliable mortality), but only the linkable, consented subset is analyzable, and registry enrollment dates,\n  claims service dates, and vital-records death dates must be reconciled before time-zero and censoring assignment.\n\n**Worked claims-linked example.** Question: 3-year all-cause mortality and heart-failure hospitalization among\nnewly diagnosed pulmonary arterial hypertension (PAH) patients, using a PAH disease registry (REVEAL-style) linked\nto Medicare. (1) Eligibility: registry-confirmed incident PAH (right-heart-catheterization-adjudicated, enrolled\nwithin 90 days of diagnosis to limit left-truncation), age ≥65, and successful registry→Medicare linkage. (2)\nContinuous-enrollment / observability window: require continuous Medicare Parts A and B (plus Part D if PAH-drug\nexposure is studied) for the 12 months before and throughout follow-up after the registry enrollment date, and\n**exclude any MA-only person-time** because fee-for-service claims — and therefore HF hospitalizations and fills —\nare not observable then. (3) Index date (time zero): registry enrollment date; baseline covariates (WHO functional\nclass, 6-minute walk, hemodynamics) come from the registry visit, comorbidities from the 12-month claims lookback.\n(4) Outcome ascertainment from *both* sources: HF hospitalization from a claims inpatient stay with a qualifying\nprimary diagnosis, and the registry-adjudicated clinical-worsening event; flag and tabulate disagreement (registry\nevent with no claim, or claim with no registry capture) rather than silently trusting one. (5) Death: take the\nearliest of registry-recorded death and the linked death-index date; treat death as a competing risk for HF\nhospitalization (Fine-Gray / cumulative incidence), not as censoring. (6) Censor at the minimum of last documented\nregistry contact, Medicare disenrollment (or MA switch), death, and study end; run a loss-to-follow-up sensitivity\nanalysis comparing those lost vs retained, because PAH patients who deteriorate may stop attending the enrolling\ncenter.",
    "primary_category": "Study_Design",
    "tags": [
      "disease-registry",
      "natural-history",
      "primary-data-collection",
      "patient-registry",
      "registry-claims-linkage",
      "post-marketing-requirement",
      "coverage-with-evidence-development",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "disease_registry"
    ],
    "data_sources": [
      "registry",
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1377/hlthaff.2011.0762",
        "url": "https://doi.org/10.1377/hlthaff.2011.0762",
        "citation_text": "Larsson S, Lawyer P, Garellick G, Lindahl B, Lundström M. Use of 13 disease registries in 5 countries demonstrates the potential to use outcome data to improve health care's value. Health Affairs. 2012;31(1):220-227.",
        "year": 2012,
        "authors_short": "Larsson et al.",
        "notes": "Frames the disease registry as a fit-for-purpose outcomes platform across diseases and countries, and why primary registry data deliver value that administrative data cannot."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pone.0183667",
        "url": "https://doi.org/10.1371/journal.pone.0183667",
        "citation_text": "Hoque DME, Kumari V, Hoque M, Ruseckaite R, Romero L, Evans SM. Impact of clinical registries on quality of patient care and clinical outcomes: a systematic review. PLOS ONE. 2017;12(9):e0183667.",
        "year": 2017,
        "authors_short": "Hoque et al.",
        "notes": "Systematic review of how clinical/disease registries affect care quality and outcomes; clarifies the mechanisms (benchmarking, feedback) and the methodological limits (selection, completeness) of registry evidence."
      },
      {
        "role": "demonstrate",
        "doi": "10.1161/CIRCULATIONAHA.109.898122",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.109.898122",
        "citation_text": "Benza RL, Miller DP, Gomberg-Maitland M, et al. Predicting survival in pulmonary arterial hypertension: insights from the Registry to Evaluate Early and Long-term PAH Disease Management (REVEAL). Circulation. 2010;122(2):164-172.",
        "year": 2010,
        "authors_short": "Benza et al.",
        "notes": "Canonical disease-registry analysis (the REVEAL PAH registry used in the worked example) — prognostic modeling of survival from prospectively collected, adjudicated registry data."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.ncbi.nlm.nih.gov/books/NBK562426/",
        "citation_text": "Gliklich RE, Leavy MB, Dreyer NA, editors. Registries for Evaluating Patient Outcomes: A User's Guide. 4th ed. Rockville (MD): Agency for Healthcare Research and Quality (US); 2020.",
        "year": 2020,
        "authors_short": "Gliklich et al.",
        "notes": "The standard operational reference (AHRQ) for designing, governing, and analyzing patient/disease registries, including minimum datasets, data quality, and linkage."
      }
    ],
    "plain_language_summary": "A disease registry is a study that signs up patients who all share one disease (say, pulmonary arterial hypertension) at participating clinics, then follows them over time and records the same set of clinical details for everyone on a fixed schedule. Unlike billing records, which only capture what someone paid for, a registry deliberately collects things like disease stage, lab biomarkers, and carefully reviewed outcomes, so it can describe how a disease unfolds and how patients fare. The catch: it only sees patients at the clinics that chose to take part (often big academic centers), so it is rich in clinical detail but not a representative snapshot of every patient with the disease.",
    "key_terms": [
      {
        "term": "registry",
        "definition": "An organized study that enrolls patients with a shared disease and prospectively collects the same set of clinical fields on each of them over time."
      },
      {
        "term": "claims",
        "definition": "Records generated when care is billed to insurance; they show what services and drugs were paid for, but not clinical detail like disease stage or lab values."
      },
      {
        "term": "adjudicated outcome",
        "definition": "An outcome (like a hospitalization or worsening event) that trained reviewers confirm using a uniform definition, rather than just accepting a billing code at face value."
      },
      {
        "term": "participating sites",
        "definition": "The specific clinics that agreed to enroll patients into the registry; patients seen elsewhere never enter the data."
      },
      {
        "term": "enrollment date",
        "definition": "The day a patient joins the registry at a site after meeting the disease criteria; it serves as the patient's day-zero for follow-up."
      }
    ],
    "worked_example": {
      "scenario": "We have three newly diagnosed pulmonary arterial hypertension (PAH) patients enrolled in a PAH disease registry. We want to see what a registry record actually looks like and why it captures clinical detail that an insurance claims table never would. Each registry row carries the patient's diagnosis date, disease stage, a biomarker, which treatment arm they are on, and a reviewer-confirmed outcome.",
      "dataset": {
        "caption": "A few rows from a disease-registry table: clinical fields collected on purpose for the study, the same way at every site.",
        "columns": [
          "patient_id",
          "diagnosis_date",
          "stage",
          "biomarker",
          "protocol_arm",
          "outcome"
        ],
        "rows": [
          [
            2001,
            "2023-02-14",
            "WHO FC III",
            "NT-proBNP 1450 pg/mL",
            "endothelin-receptor antagonist",
            "clinical worsening"
          ],
          [
            2002,
            "2023-04-03",
            "WHO FC II",
            "NT-proBNP 320 pg/mL",
            "PDE5 inhibitor",
            "stable"
          ],
          [
            2003,
            "2023-05-21",
            "WHO FC IV",
            "NT-proBNP 3100 pg/mL",
            "combination therapy",
            "death"
          ]
        ]
      },
      "steps": [
        "Each row is one enrolled patient. The registry recorded their diagnosis date and disease severity (WHO functional class, where higher is sicker) at enrollment, on a fixed schedule, using the same definitions at every site.",
        "It also stored a biomarker (NT-proBNP, a blood marker that rises with heart strain) and which drug arm the patient is on. These are exactly the decision-grade clinical variables a billing claims table does not contain.",
        "The outcome column holds a reviewer-confirmed event, not a raw billing code: patient 2001 worsened, 2002 stayed stable, 2003 died. A claims table would at best show a hospital stay or a death payment flag, with no stage or biomarker to explain it.",
        "Because only these three patients were seen at participating sites, the table is detailed but not a representative count of all PAH patients; sicker patients at academic centers can be over-represented."
      ],
      "result": "The registry rows tell you that of three incident PAH patients, the two sicker ones at enrollment (WHO FC III and IV, with high NT-proBNP of 1450 and 3100) went on to worsen or die, while the milder patient (WHO FC II, NT-proBNP 320) stayed stable. That severity-to-outcome story is visible only because the registry deliberately captured stage and biomarker, detail claims data lacks, and it applies only to patients enrolled at the participating sites."
    },
    "prerequisites": [
      "cohort-prospective",
      "claims-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Incident (inception) disease registry",
        "description": "Enrolls patients at or near initial diagnosis, giving an inception cohort with a well-defined zero point for natural-history and prognostic analyses.",
        "edge_cases": [
          "If enrollment lags diagnosis, patients who die before reaching the enrolling center are missed, creating left-truncation/immortal-time bias; restrict to enrollment within a short window of diagnosis.",
          "Incidence definitions drift as diagnostic criteria evolve; record the diagnostic vintage."
        ],
        "data_source_notes": "registry: anchor index on diagnosis-confirmed enrollment; claims linkage: require continuous enrollment around the diagnosis date to confirm incident (no prior disease-defining claims)."
      },
      {
        "name": "Prevalent (cross-sectional inception mix) disease registry",
        "description": "Enrolls patients at varying disease durations, maximizing sample size and feasibility in rare diseases.",
        "edge_cases": [
          "Survivor bias and left-truncation: long-duration survivors are over-represented, biasing survival and prognostic estimates unless duration is modeled (e.g., delayed entry / left-truncated risk sets).",
          "Mixing incident and prevalent patients without a duration covariate confounds time-since-onset with calendar effects."
        ],
        "data_source_notes": "registry: capture and adjust for disease duration at entry; use left-truncated survival models with entry at enrollment."
      },
      {
        "name": "Registry-claims (or registry-EHR) linked study",
        "description": "Links registry clinical depth to administrative data for complete exposure, outcomes, and mortality; the dominant design for comparative effectiveness/safety nested in a registry.",
        "edge_cases": [
          "Only the consented, linkable subset is analyzable — a selection layer distinct from registry enrollment itself.",
          "Date reconciliation across registry enrollment, claims service, and vital-records death is required before assigning time zero and censoring; MA-only person-time must be excluded."
        ],
        "data_source_notes": "claims: NDC + fill_date + days_supply for exposure, inpatient claims for events; exclude MA-only spans; link to a death index and use competing-risk methods in elderly cohorts."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Secondary-data claims/EHR cohort",
        "pros_of_this": "Captures disease severity, adjudicated endpoints, and PROs absent from administrative data; uniform prospective definitions across sites.",
        "cons_of_this": "Expensive and slow; non-random site/patient selection; active loss to follow-up that passive claims do not have.",
        "when_to_prefer": "When decision-grade variables (severity, adjudicated outcomes, PROs) are not present in claims or EHR and uniform prospective capture is worth the cost."
      },
      {
        "compared_to": "Randomized controlled trial",
        "pros_of_this": "Reflects routine practice and includes trial-excluded patients; supports long-term and rare-disease follow-up; high external validity.",
        "cons_of_this": "No randomization, so confounding by indication and channeling persist; cannot deliver an internally valid average treatment effect on its own.",
        "when_to_prefer": "Natural history, rare diseases, long-term outcomes, and external-validity questions where a trial is infeasible or unethical."
      },
      {
        "compared_to": "Registry-claims linked study",
        "pros_of_this": "Registry alone avoids the consent/linkage selection layer and date-reconciliation burden.",
        "cons_of_this": "Exposure and outcome ascertainment are incomplete without claims (missed fills, out-of-network hospitalizations, unrecorded deaths).",
        "when_to_prefer": "When the registry's own capture is adequate and linkage is infeasible; otherwise linkage is preferred when exposure/outcome completeness binds."
      }
    ],
    "implementation_notes_by_data_source": {
      "registry": "Predefine a minimum dataset and adjudication charter; record enrollment anchor (incident vs prevalent) and disease duration; model site as a random effect; censor at last documented contact and run loss-to-follow-up sensitivity analyses.",
      "claims": "Used for linkage to complete exposure and outcomes. Exclude Medicare Advantage-only person-time (no FFS claims); require continuous Parts A/B (+D for drug exposure); link to a death index and treat death as a competing risk in elderly cohorts; avoid immortal time in procedure-anchored sub-studies.",
      "ehr": "May auto-populate registry fields (labs, notes) cheaply, but visit-driven capture loses patients who leave the system; track per-variable provenance (abstracted vs EHR-pulled) because error structures differ.",
      "linked": "Reconcile registry enrollment, claims service, and vital-records death dates before assigning time zero and censoring; only the consented, linkable subset is analyzable, which is a distinct selection layer to characterize."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nBASELINE_DAYS = 365            # claims lookback for comorbidities / incident confirmation\nINCIDENT_WINDOW = 90           # max days from diagnosis to enrollment (limit left-truncation)\nSTUDY_END = pd.Timestamp(\"2020-12-31\")\n\ndef build_registry_cohort(reg, enroll, events, death):\n    # 1) Inception restriction: incident patients enrolled close to diagnosis.\n    reg = reg.copy()\n    reg[\"dx_to_enroll\"] = (reg[\"enroll_date\"] - reg[\"dx_date\"]).dt.days\n    coh = reg[reg[\"incident_flag\"] & (reg[\"dx_to_enroll\"].between(0, INCIDENT_WINDOW))].copy()\n    coh[\"t0\"] = coh[\"enroll_date\"]\n    coh[\"baseline_start\"] = coh[\"t0\"] - pd.Timedelta(days=BASELINE_DAYS)\n\n    # 2) Observability: continuous, FFS-observable claims (no MA-only) across baseline through t0.\n    e = enroll.merge(coh[[\"person_id\", \"t0\", \"baseline_start\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"baseline_start\"]) &\n                   (e[\"enroll_end\"]   >= e[\"t0\"]) &\n                   (~e[\"ma_only\"]))                     # MA-only person-time lacks FFS claims\n    linkable = e.loc[e[\"covers\"], \"person_id\"].unique()\n    coh = coh[coh[\"person_id\"].isin(linkable)].copy()   # consented + linkable subset only\n\n    # 3) Death from the earliest of registry death and death-index date (reconcile sources).\n    d = death.groupby(\"person_id\")[\"death_date\"].min().rename(\"dx_death\")\n    coh = coh.merge(d, on=\"person_id\", how=\"left\")\n\n    # 4) Censoring = min(last registry contact, claims disenroll/MA switch, death, study end).\n    disenroll = (enroll[~enroll[\"ma_only\"]].groupby(\"person_id\")[\"enroll_end\"].max()\n                                           .rename(\"ffs_end\"))\n    coh = coh.merge(disenroll, on=\"person_id\", how=\"left\")\n    coh[\"censor_date\"] = coh[[\"last_contact_date\", \"ffs_end\", \"dx_death\"]].min(axis=1)\n    coh[\"censor_date\"] = coh[\"censor_date\"].clip(upper=STUDY_END)\n\n    # 5) Outcome from BOTH sources within follow-up; flag disagreement rather than trusting one.\n    ev = events[events[\"event_flag\"]].merge(coh[[\"person_id\", \"t0\", \"censor_date\"]], on=\"person_id\")\n    ev = ev[(ev[\"service_date\"] >= ev[\"t0\"]) & (ev[\"service_date\"] <= ev[\"censor_date\"])]\n    claims_evt = ev.groupby(\"person_id\")[\"service_date\"].min().rename(\"claims_event_date\")\n    coh = coh.merge(claims_evt, on=\"person_id\", how=\"left\")\n    coh[\"registry_event\"] = coh[\"reg_event_date\"].between(coh[\"t0\"], coh[\"censor_date\"])\n    coh[\"claims_event\"]   = coh[\"claims_event_date\"].notna()\n    coh[\"event_disagree\"] = coh[\"registry_event\"] ^ coh[\"claims_event\"]   # source disagreement to tabulate\n    return coh[[\"person_id\", \"site_id\", \"t0\", \"baseline_start\", \"censor_date\",\n                \"registry_event\", \"claims_event\", \"event_disagree\", \"dx_death\"]]",
        "description": "Disease-registry cohort construction with claims linkage (Study_Design, not estimation). Required inputs\n(cleaned, de-duplicated):\n  reg    : registry enrollment  -> person_id, enroll_date (datetime), incident_flag (bool), dx_date (datetime),\n           site_id, last_contact_date (datetime), reg_event_date (datetime or NaT)  # adjudicated outcome\n  enroll : claims enrollment spans -> person_id, enroll_start, enroll_end, part_d (bool), ma_only (bool)\n  events : claims outcomes      -> person_id, service_date (datetime), event_flag (e.g. HF hospitalization)\n  death  : death index          -> person_id, death_date (datetime)\nReturns one analysis-ready row per linkable, incident registry enrollee with time zero, censoring date, and\noutcome from BOTH registry and claims (disagreement flagged). Build covariates only from [baseline_start, t0].",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "benza-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nBASELINE_DAYS   <- 365L\nINCIDENT_WINDOW <- 90L\nSTUDY_END       <- as.Date(\"2020-12-31\")\n\nbuild_registry_cohort <- function(reg, enroll, events, death) {\n  setDT(reg); setDT(enroll); setDT(events); setDT(death)\n\n  # 1) Inception restriction: incident, enrolled close to diagnosis.\n  coh <- reg[incident_flag &\n             as.integer(enroll_date - dx_date) %between% c(0L, INCIDENT_WINDOW)]\n  coh[, t0 := enroll_date]\n  coh[, baseline_start := t0 - BASELINE_DAYS]\n\n  # 2) Observability: continuous FFS-observable claims (no MA-only) across baseline through t0.\n  e <- merge(enroll, coh[, .(person_id, t0, baseline_start)], by = \"person_id\")\n  linkable <- e[enroll_start <= baseline_start & enroll_end >= t0 & !ma_only,\n                unique(person_id)]\n  coh <- coh[person_id %chin% linkable]\n\n  # 3) Death = earliest of registry/death-index; 4) censor = min(last contact, FFS end, death, study end).\n  d  <- death[, .(dx_death = min(death_date)), by = person_id]\n  ff <- enroll[!ma_only, .(ffs_end = max(enroll_end)), by = person_id]\n  coh <- merge(coh, d,  by = \"person_id\", all.x = TRUE)\n  coh <- merge(coh, ff, by = \"person_id\", all.x = TRUE)\n  coh[, censor_date := pmin(last_contact_date, ffs_end, dx_death, STUDY_END, na.rm = TRUE)]\n\n  # 5) Outcome from BOTH sources; flag disagreement.\n  ev <- merge(events[event_flag == TRUE], coh[, .(person_id, t0, censor_date)], by = \"person_id\")\n  ev <- ev[service_date >= t0 & service_date <= censor_date,\n           .(claims_event_date = min(service_date)), by = person_id]\n  coh <- merge(coh, ev, by = \"person_id\", all.x = TRUE)\n  coh[, registry_event := reg_event_date >= t0 & reg_event_date <= censor_date]\n  coh[, claims_event   := !is.na(claims_event_date)]\n  coh[, event_disagree := xor(registry_event %in% TRUE, claims_event)]\n  coh[, .(person_id, site_id, t0, baseline_start, censor_date,\n          registry_event, claims_event, event_disagree, dx_death)]\n}",
        "description": "Disease-registry cohort construction with claims linkage using data.table. Inputs mirror the Python version:\n  reg    : person_id, enroll_date (Date), incident_flag (logical), dx_date (Date), site_id,\n           last_contact_date (Date), reg_event_date (Date or NA)\n  enroll : person_id, enroll_start, enroll_end, part_d (logical), ma_only (logical)\n  events : person_id, service_date (Date), event_flag (logical)\n  death  : person_id, death_date (Date)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "benza-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let baseline = 365;   /* claims lookback (days)                         */\n%let incwin   = 90;    /* max diagnosis->enrollment gap (days)           */\n%let studyend = '31DEC2020'd;\n\n/* 1) Inception restriction: incident patients enrolled near diagnosis. */\nproc sql;\n  create table coh0 as\n  select person_id, site_id, last_contact_date, reg_event_date,\n         enroll_date as t0 format=date9.,\n         enroll_date - &baseline as baseline_start format=date9.\n  from work.reg\n  where incident_flag = 1\n    and (enroll_date - dx_date) between 0 and &incwin;\nquit;\n\n/* 2) Observability: continuous FFS-observable claims (no MA-only) across baseline through t0. */\nproc sql;\n  create table coh1 as\n  select c.*\n  from coh0 c\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = c.person_id\n      and e.ma_only = 0                       /* MA-only person-time lacks FFS claims */\n      and e.enroll_start <= c.baseline_start\n      and e.enroll_end   >= c.t0\n  );\nquit;\n\n/* 3) Death = earliest registry/death-index date; FFS end = latest non-MA span end. */\nproc sql;\n  create table coh2 as\n  select c.*,\n         (select min(d.death_date)  from work.death  d where d.person_id = c.person_id) as dx_death format=date9.,\n         (select max(e.enroll_end)  from work.enroll e\n            where e.person_id = c.person_id and e.ma_only = 0)                          as ffs_end  format=date9.\n  from coh1 c;\nquit;\n\n/* 4) Censor = min(last contact, FFS end, death, study end). */\ndata coh3;\n  set coh2;\n  censor_date = min(last_contact_date, ffs_end, dx_death, &studyend);\n  format censor_date date9.;\nrun;\n\n/* 5) First claims outcome within follow-up; flag registry/claims disagreement. */\nproc sql;\n  create table claims_evt as\n  select v.person_id, min(v.service_date) as claims_event_date format=date9.\n  from work.events v\n    inner join coh3 c on v.person_id = c.person_id\n  where v.event_flag = 1\n    and v.service_date between c.t0 and c.censor_date\n  group by v.person_id;\nquit;\n\nproc sql;\n  create table work.cohort as\n  select c.person_id, c.site_id, c.t0, c.baseline_start, c.censor_date, c.dx_death,\n         (c.reg_event_date between c.t0 and c.censor_date)            as registry_event,\n         (e.claims_event_date is not null)                           as claims_event,\n         ((c.reg_event_date between c.t0 and c.censor_date) ne\n          (e.claims_event_date is not null))                         as event_disagree\n  from coh3 c left join claims_evt e on c.person_id = e.person_id;\nquit;",
        "description": "Disease-registry cohort construction with claims linkage in SAS (PROC SQL / data step; Study_Design, not\nestimation). Required input datasets (post data-management):\n  work.reg    : person_id, enroll_date, incident_flag (0/1), dx_date, site_id, last_contact_date, reg_event_date\n  work.enroll : person_id, enroll_start, enroll_end, part_d (0/1), ma_only (0/1)\n  work.events : person_id, service_date, event_flag (0/1)\n  work.death  : person_id, death_date\nAll dates are SAS date values. Produces work.cohort: one row per incident, linkable enrollee with time zero,\ncensoring date, and outcome from both the registry and claims (disagreement flagged for QC).",
        "dependencies": [],
        "source_citations": [
          "benza-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Sites[Participating sites<br/>academic / specialty centers] --> Consent[Patient consent + inclusion criteria met]\n  Consent --> Enroll[Registry enrollment = index/time zero<br/>minimum dataset captured]\n  Enroll --> Adj[Adjudication of endpoints<br/>uniform definitions across sites]\n  Adj --> Link[Linkage to claims + death index<br/>consented, linkable subset only]\n  Link --> Recon[Reconcile registry / claims / vital-records dates<br/>exclude MA-only person-time]\n  Recon --> Analysis[Analysis-ready cohort<br/>outcome from registry AND claims, disagreement flagged]\n  Analysis --> Sens[Sensitivity: loss-to-follow-up, site random effect,<br/>competing-risk death, calendar overlap]",
        "caption": "Data flow for a disease registry with claims/vital-records linkage — from site participation and consent through enrollment (time zero), adjudication, linkage, date reconciliation, and an analysis set that uses both registry and claims outcome capture.",
        "alt_text": "Flowchart from participating sites and patient consent through registry enrollment, endpoint adjudication, claims and death-index linkage, date reconciliation, and a final analysis cohort with sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Site[Site participation] --> Enroll((Enrolled in registry))\n  Consent[Patient consent / linkage] --> Enroll\n  Severity[Disease severity] --> Enroll\n  Severity --> Outcome[Outcome]\n  Site --> Ascert[Outcome ascertainment]\n  Enroll --> Ascert\n  Ascert --> Outcome\n  classDef collider fill:#ffe0e0,stroke:#c00;\n  class Enroll collider;",
        "caption": "Selection-bias DAG. Enrollment is a collider opened by conditioning on the analyzed registry sample — site participation and patient consent both feed enrollment, and site participation also affects how completely outcomes are ascertained, so conditioning on being in the registry can induce a spurious site/consent-to-outcome association. Disease severity is a common cause of both enrollment and outcome.",
        "alt_text": "Directed acyclic graph showing registry enrollment as a collider influenced by site participation, patient consent, and disease severity, with site participation and severity also pointing to outcome ascertainment and the outcome.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "Comparative effectiveness/safety questions are often answered by nesting an active-comparator new-user contrast inside a disease registry (with claims linkage for complete exposure), combining registry clinical depth with a defensible time-zero design."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "A registry provides the eligibility and baseline-severity data to emulate a target trial; explicit time zero, eligibility, and covariate windows guard against immortal time and confounding by indication."
      },
      {
        "relation_type": "requires",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Because registry treatment groups are not randomized, propensity-score or weighting adjustment on registry-measured confounders is the standard balancing step for any within-registry comparative analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Lag between diagnosis and registry enrollment, and procedure-anchored sub-studies, both create immortal time/left-truncation that must be handled by restricting the enrollment window and classifying pre-event person-time."
      }
    ],
    "aliases": [
      "patient registry",
      "clinical registry",
      "condition registry",
      "disease/condition registry"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "disease-risk-score-rwe",
    "name": "Disease Risk Scores",
    "short_definition": "A single summary confounder score - the model-predicted baseline risk of the outcome (the prognostic analogue of the propensity score) - fit in an unexposed or historical population, then applied to everyone so that stratifying, matching, or adjusting on this one number controls many confounders at once; especially useful when exposure is rare but the outcome is common, or when a newly launched drug has no exposure history to model.",
    "long_description": "A **disease risk score (DRS)** collapses a long list of baseline confounders into a single number: each\npatient's **predicted probability (or hazard) of the outcome in the absence of the exposure of interest**.\nWhere a propensity score models the *exposure* (probability of getting the drug given covariates), a DRS\nmodels the *outcome* (probability of the event given covariates) - it is, in Hansen's phrase, the\n**prognostic analogue of the propensity score**. You fit an outcome-prediction model, read off each person's\nbaseline risk, and then control confounding by **stratifying, matching, or regression-adjusting on that one\nscore** instead of on the raw covariates. Two patients with the same DRS have the same baseline prognosis, so\ncomparing the exposed and the unexposed *within* a DRS stratum removes the confounding those covariates carried.\n\n**Core conceptual distinction - where the model is fit.** This is the decision that makes or breaks a DRS.\n- **Unexposed-only (reference-population) DRS:** Fit the outcome model among the unexposed (the comparator or\n  background population), then score everyone, including the exposed. This is the original Hansen construction.\n  It cannot bake the exposure effect into the score, so it does not \"adjust away\" a real effect - but it\n  assumes the covariate-outcome relationships estimated in the unexposed transport to the exposed.\n- **Full-cohort DRS:** Fit the outcome model in the whole cohort with an exposure term included, then set the\n  exposure term to \"unexposed\" for everyone when predicting. Statistically efficient, but it borrows the\n  exposed patients' outcomes to estimate the score, risking a subtle form of overfitting-to-the-effect and\n  constraining the exposure-outcome relationship to the model's form.\n- **Historical-period DRS:** Fit the model in a *historical* cohort from before the drug existed (or before a\n  formulary change), then apply it to the current cohort. This is the move that rescues studies of **new market\n  entrants** - a just-launched drug has no historical users, so you cannot build a propensity model that\n  separates its users from comparators, but you *can* build an outcome model in the pre-launch era and carry it\n  forward. The cost is calendar drift: secular changes in coding, treatment, and baseline risk can make a\n  historical score miscalibrated for the present.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs propensity-score-methods-psm-iptw (the main alternative):** The classic asymmetry is exposure\n  prevalence vs outcome frequency. A propensity score is hard to estimate well when **exposure is rare** (few\n  exposed patients to model who gets the drug) but easy when the outcome is rare; a DRS is the mirror image -\n  it is hard to estimate when the **outcome is rare** (few events to fit the risk model) but thrives when the\n  outcome is **common** and exposure is rare. So **prefer a DRS for a rare exposure with a common outcome**;\n  prefer a propensity score for a common exposure with a rare outcome. The other decisive DRS advantage is the\n  **new-drug / no-exposure-history** case above, where a propensity model literally cannot be fit.\n- **vs high-dimensional-propensity-score-hdps-rwe:** hdPS is a propensity-side, data-adaptive way to surface\n  hundreds of proxy confounders from claims; a DRS is an outcome-side single summary. They are not rivals - you\n  can build a **high-dimensional disease risk score** with the same proxy-selection machinery on the outcome\n  model. The trade-off is the same exposure-vs-outcome-frequency one, plus DRS's reliance on getting the\n  *outcome* model's functional form right.\n- **vs traditional multivariable outcome regression:** A DRS is essentially a *two-stage* version of outcome\n  regression - fit the covariate-outcome model once, then condition the exposure contrast on its single\n  output. The practical pay-offs are a clean separation of the confounder model from the effect estimate\n  (you can inspect DRS overlap before ever looking at the exposure effect, reducing the temptation to fish),\n  parsimony when covariates vastly outnumber events, and a transparent diagnostic (DRS distribution by\n  exposure). In large simulations the three approaches perform comparably when correctly specified;\n  DRS's edge is operational, not a magic bias reducer.\n\n**When to use.** Reach for a DRS when (1) the **exposure is rare and the outcome is common**, so there is\nplenty of outcome signal to fit a risk model but too few exposed to fit a stable propensity model;\n(2) you are studying a **newly launched drug or new market entrant** with no historical exposure to model -\nfit the outcome model in the pre-launch/unexposed era and carry it forward; (3) you have **multiple exposures\nor comparators** sharing one outcome - one DRS can be reused across exposure contrasts, whereas each contrast\nneeds its own propensity model; (4) you want a confounder summary you can **lock down and inspect** before\nunblinding the exposure effect.\n\n**When NOT to use - and when it is actively misleading.**\n- **Rare outcome.** With few events, the outcome model is unstable and overfit; the DRS inherits that noise.\n  Here a propensity score (modeling the abundant exposure signal) is the better tool. Do not force a DRS onto\n  a rare-event study just to avoid a propensity model.\n- **Outcome-model misspecification.** A DRS is only as good as the risk model behind it. If the functional\n  form is wrong (omitted interactions, wrong link, miscaptured non-linearity), residual confounding leaks\n  through even after you stratify on the score - and unlike a propensity score, you cannot diagnose it by\n  checking covariate balance across exposure groups, because the DRS is not built to balance covariates by\n  exposure. Check balance and DRS overlap explicitly.\n- **Fitting in the wrong population (the most dangerous trap).** If you fit a *full-cohort* DRS with the\n  exposed included and let the model see the exposure's effect, the score can partially absorb a true exposure\n  effect, biasing the contrast toward the null. And if you fit an *unexposed-only* or *historical* DRS in a\n  population whose covariate-outcome relationships do not transport to the exposed (different era, different\n  case mix, calendar drift in coding), the score is miscalibrated and residual confounding returns. Always\n  pre-specify the fitting population, and sanity-check the historical or unexposed model's calibration in the\n  target cohort.\n\n**Data-source operational depth.** In **claims** the DRS outcome model is typically a logistic or Cox model\nbuilt from baseline diagnoses, procedures, drugs, demographics, and prior utilization in the pre-index window;\nthe standard immortal-time and look-back hygiene of any new-user design applies (covariates must be measured\n*before* the index date, never after). In **EHR** richer predictors (labs, vitals, smoking, BMI) usually make\nthe outcome model far better calibrated than a claims-only DRS, which is the whole game for a method that lives\nor dies by outcome-model quality. **Registry** data can supply adjudicated outcomes and a clean reference\ncohort for fitting. **Linked** claims-EHR is the ideal substrate: EHR for the predictors that calibrate the\nrisk model, claims for complete capture of the (common) outcome events and follow-up.\n\n**Interpreting the output**\n\nUsing the worked example above: the crude comparison yields a risk difference of 0.33 − 0.17 = 0.16 (treated\npatients appear to have higher MI risk). DRS stratification collapses this to a pooled risk difference of 0.00.\n\nFormal interpretation: The crude risk difference of 0.16 reflects confounding by baseline prognosis — treated\npatients were concentrated in the high-risk DRS band (score ≈ 0.35), where MI rates were naturally elevated.\nWithin each DRS stratum, the treated–untreated risk difference is 0.50 − 0.50 = 0.00 in the high-risk band and\n0 − 0 = 0.00 in the low-risk band. The DRS-stratified pooled risk difference of 0.00 estimates the treatment\neffect conditional on baseline prognostic risk — among patients for whom comparable baseline prognosis exists in\nboth treatment arms. This is a valid causal estimate under the assumption that the DRS outcome model is correctly\nspecified in the unexposed reference population and that no unmeasured confounding remains after conditioning on\nbaseline risk strata. Unlike propensity score methods, DRS adjustment does not require a well-specified exposure\nmodel, but it does require a well-calibrated outcome model.\n\nPractical interpretation: Once patients with similar underlying heart attack risk are compared within the same\nstratum, the drug shows no excess MI risk — the apparent harm in the crude analysis was entirely driven by sicker\npatients being disproportionately in the treated group. The DRS adjustment is especially useful when exposure is\nrare and a propensity score model would be unstable, because the risk model is estimated in the more numerous\nunexposed population.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "disease-risk-score",
      "prognostic-score",
      "confounder-summary",
      "outcome-model",
      "propensity-score-analogue",
      "rare-exposure",
      "new-market-entrant",
      "stratification",
      "confounding-control",
      "claims",
      "ehr"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/asn004",
        "url": "https://doi.org/10.1093/biomet/asn004",
        "citation_text": "Hansen BB. The prognostic analogue of the propensity score. Biometrika. 2008;95(2):481-488.",
        "year": 2008,
        "authors_short": "Hansen",
        "notes": "Introduces the disease (prognostic) risk score as the formal outcome-side analogue of the propensity score, establishing the balancing-score theory that justifies stratifying, matching, or adjusting on a single predicted-baseline-risk summary."
      },
      {
        "role": "explain",
        "doi": "10.1177/0962280208092347",
        "url": "https://doi.org/10.1177/0962280208092347",
        "citation_text": "Arbogast PG, Ray WA. Use of disease risk scores in pharmacoepidemiologic studies. Statistical Methods in Medical Research. 2009;18(1):67-80.",
        "year": 2009,
        "authors_short": "Arbogast & Ray",
        "notes": "The canonical pharmacoepi how-to - when DRS beats a propensity score (rare exposure, common outcome, new agents with no exposure history), how to choose the fitting population, and the overfitting/transportability pitfalls of full-cohort vs unexposed-only scores."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwr143",
        "url": "https://doi.org/10.1093/aje/kwr143",
        "citation_text": "Arbogast PG, Ray WA. Performance of disease risk scores, propensity scores, and traditional multivariable outcome regression in the presence of multiple confounders. American Journal of Epidemiology. 2011;174(5):613-620.",
        "year": 2011,
        "authors_short": "Arbogast & Ray",
        "notes": "Head-to-head simulation showing disease risk scores, propensity scores, and outcome regression perform comparably when correctly specified but that DRS is the practical choice when confounders greatly outnumber exposed patients - the empirical backing for the rare-exposure rule of thumb."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwv269",
        "url": "https://doi.org/10.1093/aje/kwv269",
        "citation_text": "Desai RJ, Glynn RJ, Wang S, Gagne JJ. Performance of disease risk score matching in nested case-control studies: a simulation study. American Journal of Epidemiology. 2016;183(10):949-957.",
        "year": 2016,
        "authors_short": "Desai et al.",
        "notes": "Applied evaluation of DRS matching in nested case-control designs, characterizing when the score recovers valid effect estimates and how outcome-model misspecification and the fitting population degrade performance."
      }
    ],
    "plain_language_summary": "A disease risk score (DRS) is a single number that captures how likely a patient is to have the outcome based on their baseline health, before considering the drug being studied. You build it by fitting a model that predicts the outcome from background characteristics (usually in patients who did not take the drug), then giving every patient a score. Comparing treated and untreated patients who share the same score controls many confounders at once - and it works even when very few people took the drug or the drug is brand new, situations where the more common propensity score struggles. The catch: the score is only as good as the outcome model behind it, so a wrong model or one fit in the wrong group of patients quietly leaves confounding behind.\n",
    "key_terms": [
      {
        "term": "disease risk score",
        "definition": "A patient's model-predicted chance of having the outcome based on their baseline characteristics, before accounting for the treatment being studied."
      },
      {
        "term": "prognostic score",
        "definition": "Another name for a disease risk score - a single number summarizing baseline outcome risk (prognosis)."
      },
      {
        "term": "outcome model",
        "definition": "A statistical model (often logistic or survival) that predicts the study outcome from baseline characteristics."
      },
      {
        "term": "baseline risk",
        "definition": "How likely the outcome is for a patient given their characteristics at the start, ignoring the treatment under study."
      },
      {
        "term": "reference population",
        "definition": "The group of untreated (unexposed) or historical patients in which the disease risk score model is fit."
      },
      {
        "term": "stratification",
        "definition": "Splitting patients into groups that share a similar score, then comparing treated and untreated patients within each group."
      }
    ],
    "worked_example": {
      "scenario": "A drug launches and only a handful of patients take it in its first year, but the outcome we care about (a heart attack, MI) is common. A propensity score would struggle because exposure is so rare, so we use a disease risk score instead. We fit a model that predicts MI from baseline characteristics among the untreated patients, give every patient a baseline-risk score, then split patients into a high-risk and a low-risk band and compare treated versus untreated MI rates within each band. We have 12 patients: 6 treated and 6 untreated.\n",
      "dataset": {
        "caption": "One row per patient. drs is the model-predicted baseline MI risk; mi_event is 1 if the patient had an MI in follow-up.",
        "columns": [
          "person_id",
          "exposed",
          "diabetes",
          "prior_mi",
          "drs",
          "mi_event"
        ],
        "rows": [
          [
            1,
            1,
            1,
            1,
            0.35,
            1
          ],
          [
            2,
            1,
            1,
            1,
            0.35,
            1
          ],
          [
            3,
            1,
            1,
            1,
            0.35,
            0
          ],
          [
            4,
            1,
            1,
            1,
            0.35,
            0
          ],
          [
            5,
            0,
            1,
            1,
            0.35,
            1
          ],
          [
            6,
            0,
            1,
            1,
            0.35,
            0
          ],
          [
            7,
            1,
            0,
            0,
            0.1,
            0
          ],
          [
            8,
            1,
            0,
            0,
            0.1,
            0
          ],
          [
            9,
            0,
            0,
            0,
            0.1,
            0
          ],
          [
            10,
            0,
            0,
            0,
            0.1,
            0
          ],
          [
            11,
            0,
            0,
            0,
            0.1,
            0
          ],
          [
            12,
            0,
            0,
            0,
            0.1,
            0
          ]
        ]
      },
      "steps": [
        "The outcome model fit among the untreated gives each patient a baseline MI risk - a patient with diabetes and a prior MI scores higher (about 0.35) than a patient with neither (about 0.10); these predicted risks are the drs column.",
        "Crude (unadjusted) comparison ignoring the score - treated MI risk = 2/6 = 0.33, untreated MI risk = 1/6 = 0.17, so the crude risk difference = 0.33 - 0.17 = 0.16, making the drug look harmful.",
        "Split on the DRS into a high-risk band (drs 0.35, patients 1-6) and a low-risk band (drs 0.10, patients 7-12).",
        "High-risk band - treated MI risk = 2/4 = 0.50, untreated MI risk = 1/2 = 0.50, so the within-band risk difference = 0.50 - 0.50 = 0.00.",
        "Low-risk band - treated MI risk = 0/2 = 0, untreated MI risk = 0/4 = 0, so the within-band risk difference is also 0.",
        "Pool the two bands weighted by their size (6 of 12 each, weight 0.5) - pooled DRS-adjusted risk difference = 0.5 * 0.00 + 0.5 * 0.00 = 0.00."
      ],
      "result": "The crude risk difference of 0.16 made the drug look harmful, but that was confounding - the treated patients were concentrated in the high baseline-risk band. After stratifying on the disease risk score, the pooled risk difference is 0.00, showing no effect once baseline prognosis is matched.",
      "timeline_spec": {
        "title": "Disease risk score fit in the pre-launch unexposed period, then applied to the launch-year cohort",
        "window": {
          "start": "2018-01-01",
          "end": "2021-12-31",
          "label": "Fit the DRS model in the historical unexposed period, apply it at launch"
        },
        "events": [
          {
            "label": "Fit DRS outcome model in unexposed cohort",
            "start": "2018-01-01",
            "length_days": 1095,
            "quantity": "3-year fit window"
          },
          {
            "label": "Drug launches; cohort entry and DRS scoring",
            "start": "2021-01-01",
            "length_days": 1,
            "quantity": "score everyone"
          },
          {
            "label": "Follow treated and untreated for MI",
            "start": "2021-01-01",
            "length_days": 365,
            "quantity": "12-month follow-up"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2018-01-01",
            "end": "2020-12-31",
            "label": "Historical unexposed period: baseline-risk model fit here"
          },
          {
            "kind": "followup",
            "start": "2021-01-01",
            "end": "2021-12-31",
            "label": "Both arms followed for MI; compare within DRS strata"
          }
        ],
        "result": {
          "label": "Crude RD 0.16 (confounded) collapses to DRS-stratified RD 0.00",
          "value": 0.0
        }
      }
    },
    "prerequisites": [
      "logistic-regression-for-binary-outcomes",
      "propensity-score-methods-psm-iptw",
      "dags-backdoor-criterion-drug-studies"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unexposed-only (reference-population) DRS",
        "description": "Fit the outcome-prediction model among the unexposed comparator group only, then score everyone (exposed included). This is Hansen's original construction; because the exposed never enter the fit, the score cannot absorb a real exposure effect, but it assumes covariate-outcome relationships from the unexposed transport to the exposed.",
        "edge_cases": [
          "If the unexposed differ systematically from the exposed in case mix, the model is miscalibrated for the exposed and residual confounding returns.",
          "Sparse outcome events in the unexposed subset make the risk model unstable even when the outcome is common overall."
        ],
        "data_source_notes": "claims: build predictors from the pre-index look-back window only. ehr: labs and vitals sharply improve calibration of the unexposed risk model."
      },
      {
        "name": "Full-cohort DRS (exposure-term-included)",
        "description": "Fit the outcome model in the entire cohort with an exposure indicator in the model, then predict everyone's risk with the exposure term set to unexposed. More efficient (uses all events) but lets the model see the exposure effect, risking partial absorption of a true effect into the score.",
        "edge_cases": [
          "With a strong exposure effect, the score can bias the exposure contrast toward the null (overfitting-to-the-effect).",
          "The exposure-outcome relationship is constrained to the model's parametric form, which may be wrong."
        ],
        "data_source_notes": "claims/ehr: keep the exposure term out of the predicted score by zeroing it at prediction; document that the fit used exposed outcomes."
      },
      {
        "name": "Historical-period DRS (pre-launch / pre-policy)",
        "description": "Fit the outcome model in a historical cohort from before the exposure existed (or before a formulary or guideline change), then carry it forward to score the current cohort. The key enabler for new market entrants with no exposure history to model on the propensity side.",
        "edge_cases": [
          "Calendar drift - secular changes in coding, treatment, and baseline risk - miscalibrates a historical score for the present cohort.",
          "Diagnostic-code or formulary changes between eras break predictor definitions; re-map codes before applying."
        ],
        "data_source_notes": "claims: a multi-year pre-launch window gives ample events for a common outcome. linked: re-validate calibration of the historical model in a recent slice before trusting it."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "propensity-score-methods-psm-iptw",
        "pros_of_this": "Estimable when exposure is rare (no need to model who gets the drug) and when a newly launched drug has no exposure history; one outcome-side score can be reused across multiple exposure contrasts that share the same outcome.",
        "cons_of_this": "Requires the outcome to be common enough to fit a stable risk model and depends entirely on correct outcome-model specification; you cannot diagnose residual confounding by checking covariate balance the way you can with a propensity score.",
        "when_to_prefer": "Prefer a DRS when the exposure is rare and the outcome is common, or when there is no exposure history to build a propensity model; prefer a propensity score when the exposure is common and the outcome is rare."
      },
      {
        "compared_to": "high-dimensional-propensity-score-hdps-rwe",
        "pros_of_this": "Summarizes confounding on the outcome side with a single interpretable baseline-risk number, avoiding the need to model a rare exposure even after proxy expansion.",
        "cons_of_this": "Inherits the rare-outcome weakness (few events to fit the risk model) and the proxy-selection machinery must be pointed at the outcome model, which is more sensitive to misspecification than balance-driven hdPS.",
        "when_to_prefer": "Use a high-dimensional DRS when exposure is rare and events are plentiful; use hdPS when the exposure is common and events are scarce."
      },
      {
        "compared_to": "logistic-regression-for-binary-outcomes",
        "pros_of_this": "Separates the confounder model from the effect estimate (fit the risk model first, inspect DRS overlap, then condition the exposure contrast on one score), which is more parsimonious than cramming dozens of confounders plus exposure into one regression when covariates outnumber events.",
        "cons_of_this": "Adds a two-stage construction and its own assumptions (fitting population, transportability) on top of an ordinary outcome regression; when correctly specified the two give very similar answers.",
        "when_to_prefer": "Prefer the DRS framing when confounders greatly outnumber exposed patients or you want a lockable, inspectable confounder summary; a single multivariable regression suffices when data are ample and balanced."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the outcome-risk model from baseline diagnoses, procedures, dispensings, demographics, and prior utilization measured strictly in the pre-index look-back window. Confirm the outcome is common enough for a stable fit; a claims-only DRS is limited by the absence of labs and vitals, so calibration is the main risk.",
      "ehr": "Add labs, vitals, BMI, and smoking to the risk model - these usually make the EHR DRS far better calibrated than a claims DRS, which matters because the method lives or dies by outcome-model quality. Watch for missing predictors and informative measurement.",
      "registry": "Use adjudicated outcomes and a clean reference cohort to fit the risk model; registries are a strong source for the historical or unexposed fitting population, but link to claims for complete follow-up.",
      "linked": "The ideal substrate - EHR for the predictors that calibrate the risk model, claims for complete capture of the common outcome and follow-up time. Re-check calibration of any historical model in a recent slice before applying."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\n\ndef disease_risk_score(df, outcome, exposure, covariates, n_strata=5):\n    # 1) Fit the outcome model among the UNEXPOSED only (no exposure term) -> baseline risk model.\n    ref = df[df[exposure] == 0]\n    drs_model = sm.Logit(ref[outcome], sm.add_constant(ref[covariates])).fit(disp=0)\n\n    # 2) Score EVERYONE (exposed included) with that model: DRS = predicted baseline risk.\n    X_all = sm.add_constant(df[covariates], has_constant=\"add\")\n    df = df.assign(drs=np.asarray(drs_model.predict(X_all)))\n\n    # 3) Stratify on the DRS and take a stratum-weighted exposed-vs-unexposed risk difference.\n    df = df.assign(stratum=pd.qcut(df[\"drs\"], n_strata, labels=False, duplicates=\"drop\"))\n    rows = []\n    for s, g in df.groupby(\"stratum\"):\n        e = g.loc[g[exposure] == 1, outcome]\n        u = g.loc[g[exposure] == 0, outcome]\n        if len(e) and len(u):                       # both arms present (DRS positivity)\n            rows.append({\"stratum\": s, \"n\": len(g), \"rd\": e.mean() - u.mean()})\n    strata = pd.DataFrame(rows)\n    w = strata[\"n\"] / strata[\"n\"].sum()\n    pooled_rd = float((w * strata[\"rd\"]).sum())\n    return pooled_rd, strata, drs_model",
        "description": "Fit a disease risk score (baseline outcome risk) among the UNEXPOSED, score everyone with it, stratify on the\nDRS, and return a stratum-size-weighted risk difference between exposed and unexposed. Required input (one row\nper patient, covariates measured before the index date):\n  df : person_id, exposed (0/1), <covariates...>, outcome (0/1)",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "disease_risk_score <- function(df, outcome, exposure, covariates, n_strata = 5) {\n  # 1) Fit the outcome (baseline risk) model among the UNEXPOSED only.\n  ref  <- df[df[[exposure]] == 0, ]\n  form <- as.formula(paste(outcome, \"~\", paste(covariates, collapse = \" + \")))\n  drs_model <- glm(form, data = ref, family = binomial())\n\n  # 2) Score EVERYONE with that model: DRS = predicted baseline risk.\n  df$drs <- predict(drs_model, newdata = df, type = \"response\")\n\n  # 3) Stratify on the DRS (quantile cut) and pool within-stratum risk differences.\n  brks <- quantile(df$drs, probs = seq(0, 1, length.out = n_strata + 1), na.rm = TRUE)\n  df$stratum <- cut(df$drs, breaks = unique(brks), include.lowest = TRUE, labels = FALSE)\n  agg <- do.call(rbind, lapply(split(df, df$stratum), function(g) {\n    e <- g[g[[exposure]] == 1, outcome]\n    u <- g[g[[exposure]] == 0, outcome]\n    if (length(e) > 0 && length(u) > 0)            # both arms present (DRS positivity)\n      data.frame(stratum = g$stratum[1], n = nrow(g), rd = mean(e) - mean(u))\n    else NULL\n  }))\n  w <- agg$n / sum(agg$n)\n  list(pooled_rd = sum(w * agg$rd), strata = agg, model = drs_model)\n}",
        "description": "Same disease-risk-score workflow in base R: fit the baseline-risk logistic model among the unexposed, predict\neach patient's DRS, cut into strata on the score, and pool the within-stratum risk differences by stratum size.\nInput:\n  df : data.frame with exposed (0/1), covariates, outcome (0/1)",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1) Fit the DRS (baseline outcome risk) model among the UNEXPOSED only and store it. */\nproc logistic data=work.cohort outmodel=drs_fit noprint;\n  where exposed = 0;\n  model mi_event(event='1') = diabetes prior_mi age;\nrun;\n\n/* 2) Score EVERYONE with that model -> drs = predicted baseline risk. */\nproc logistic inmodel=drs_fit noprint;\n  score data=work.cohort out=scored(rename=(p_1=drs));\nrun;\n\n/* 3) Stratify on the DRS (quintiles). */\nproc rank data=scored out=ranked groups=5;\n  var drs;\n  ranks drs_stratum;\nrun;\n\n/* 4) Within-stratum exposed/unexposed risks and risk difference, then pool by stratum size. */\nproc sql;\n  create table strata as\n  select drs_stratum,\n         count(*) as n,\n         mean(case when exposed=1 then mi_event end) as risk_exp,\n         mean(case when exposed=0 then mi_event end) as risk_unexp,\n         calculated risk_exp - calculated risk_unexp as rd\n  from ranked\n  group by drs_stratum\n  having nmiss(calculated risk_exp, calculated risk_unexp) = 0;   /* both arms present */\n\n  select sum(n*rd) / sum(n) as pooled_rd\n  from strata;\nquit;",
        "description": "Disease-risk-score workflow in SAS. PROC LOGISTIC fits the baseline-risk model among the unexposed and stores\nit; a second PROC LOGISTIC scores the full cohort; PROC RANK forms DRS strata; PROC SQL computes within-stratum\nexposed/unexposed risks and the stratum-size-weighted pooled risk difference. Input:\n  work.cohort : person_id, exposed (0/1), diabetes, prior_mi, age, mi_event (0/1)",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Whole cohort<br/>exposed + unexposed<br/>baseline covariates] --> B[Take UNEXPOSED only]\n  B --> C[Fit outcome model<br/>risk of event given covariates<br/>no exposure term]\n  C --> D[Score EVERYONE<br/>DRS = predicted baseline risk]\n  D --> E{Both arms present<br/>across the DRS range?}\n  E -- No --> F[DRS positivity problem<br/>exposed concentrated in one risk band]\n  E -- Yes --> G[Stratify / match / adjust on DRS]\n  G --> H[Compare exposed vs unexposed<br/>within DRS strata, then pool]",
        "caption": "The disease-risk-score workflow - fit the outcome (baseline-risk) model in the unexposed, score every patient, check that exposed and unexposed overlap across the DRS range, then compare within strata. The score summarizes many confounders into one number.",
        "alt_text": "Flowchart starting from the whole cohort, restricting to the unexposed to fit a baseline-risk outcome model, scoring everyone to get a disease risk score, checking overlap, and comparing exposed versus unexposed within DRS strata.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph PS[Propensity score]\n    P1[Models the EXPOSURE<br/>P treatment given covariates] --> P2[Hard if exposure is RARE<br/>or drug just launched]\n  end\n  subgraph DRS[Disease risk score]\n    D1[Models the OUTCOME<br/>baseline risk given covariates] --> D2[Hard if outcome is RARE<br/>thrives if outcome COMMON]\n  end\n  P2 --> Pick{Rare exposure +<br/>common outcome?}\n  D2 --> Pick\n  Pick -- Yes --> Use[Prefer DRS]\n  Pick -- No, rare outcome --> UsePS[Prefer propensity score]",
        "caption": "Choosing between a propensity score and a disease risk score. The two are mirror images - a propensity score models who gets exposed, a DRS models who gets the outcome - so the right tool depends on whether the exposure or the outcome is the rarer, harder-to-model quantity.",
        "alt_text": "Side-by-side comparison showing the propensity score models the exposure and struggles with rare exposure, while the disease risk score models the outcome and struggles with rare outcomes, with a decision node preferring the disease risk score when exposure is rare and the outcome is common.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "disease-risk-score-rwe-timeline.svg",
        "mermaid": null,
        "caption": "A disease risk score fit in the pre-launch unexposed period and applied at launch; the crude risk difference of 0.16 (confounded) collapses to a DRS-stratified risk difference of 0.00 once baseline prognosis is matched.",
        "alt_text": "Timeline showing a three-year historical unexposed period where the baseline-risk model is fit, the drug launch and scoring point, and a 12-month follow-up period in which treated and untreated patients are compared within disease risk score strata.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "A disease risk score is the outcome-side analogue of a propensity score; prefer the DRS when exposure is rare and the outcome is common, or when a newly launched drug has no exposure history to model on the propensity side."
      },
      {
        "relation_type": "see_also",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "The same data-adaptive proxy-confounder selection used for hdPS can be applied to the outcome model to build a high-dimensional disease risk score; they trade off on whether exposure or outcome is the rarer quantity."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "A DRS is typically built inside an active-comparator new-user design - covariates are measured in the pre-index look-back window so the baseline-risk model is free of post-index information."
      },
      {
        "relation_type": "requires",
        "target_slug": "dags-backdoor-criterion-drug-studies",
        "notes": "The covariates entering the outcome model must close the backdoor paths between exposure and outcome; a DRS only controls the confounding captured by the variables fed into the risk model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "The DRS is usually the predicted probability from a logistic (or Cox) outcome model; a DRS is a two-stage reframing of multivariable outcome regression that separates the confounder model from the effect estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Because a DRS is not built to balance covariates across exposure groups, you must inspect covariate balance and DRS overlap explicitly rather than assume the score balanced the confounders."
      },
      {
        "relation_type": "see_also",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Negative-control calibration helps detect the residual confounding a DRS leaves behind when the outcome model is misspecified or fit in a non-transportable population."
      }
    ],
    "aliases": [
      "disease risk score",
      "DRS",
      "prognostic score",
      "prognostic analogue of the propensity score",
      "multivariate confounder score",
      "disease risk index"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "distributional-cost-effectiveness-analysis-rwe",
    "name": "Distributional Cost-Effectiveness Analysis",
    "short_definition": "An equity-informative extension of cost-effectiveness analysis that estimates how a health technology's net health benefit is distributed across socially relevant subgroups (by socioeconomic status, race/ethnicity, geography), then summarizes the resulting health distribution with a social welfare function that trades total health against its equality using an inequality-aversion (Atkinson) parameter - making the equity-efficiency trade-off explicit rather than averaging it away.",
    "long_description": "Standard cost-effectiveness analysis (CEA) and cost-utility analysis (CUA) ask one question: does the\nintervention buy more health (QALYs) than the next-best use of the same money? They answer it for the *average*\npatient and are deliberately blind to *who* gains and *who* loses. **Distributional cost-effectiveness analysis\n(DCEA)** keeps the efficiency question but adds a second one society actually cares about: **how is that net health\nbenefit distributed across groups we have equity concerns about** - the most-deprived vs the least-deprived fifth of\nan area-deprivation index, Black vs White patients, rural vs urban populations - and **is the program worth adopting\nonce we weight gains to the worse-off more heavily than gains to the better-off?** DCEA does this by (1) estimating\nthe *baseline* distribution of lifetime health (quality-adjusted life expectancy) across subgroups, (2) estimating\nthe *net health benefit* each subgroup gets from the intervention - health gained minus health displaced elsewhere by\nthe opportunity cost of spending - (3) adding the net benefit onto the baseline to get the *post-intervention*\ndistribution, and (4) collapsing each distribution to a single number with a **social welfare function** that embeds\nan **inequality-aversion parameter** (the Atkinson epsilon). The decision then turns on the *equity-weighted* health,\nnot the unweighted total.\n\n**Core conceptual machinery.** Three pieces must be pre-specified. (1) *The equity-relevant subgroups and the\nbaseline distribution*: you cannot weight a distribution you have not measured, so DCEA needs subgroup-specific\nbaseline health (e.g., quality-adjusted life expectancy by deprivation quintile). (2) *The opportunity cost and its\nincidence*: a fixed budget means funding a new program displaces health somewhere else; that displaced health\n(cost divided by the cost-effectiveness threshold k) must be *subtracted*, and crucially the displacement may fall on\ndifferent groups than the gains - a program that helps the deprived but is financed by cuts that also hit the\ndeprived can be equity-neutral or worse. (3) *The social welfare function and the inequality-aversion parameter*: the\n**Atkinson equally distributed equivalent (EDE)** is the level of health which, if everyone had it equally, would be\njudged exactly as good as the actual unequal distribution. As epsilon rises from 0 (pure efficiency, EDE = mean) the\nEDE bends below the mean and the analysis pays more for equality. The headline output is the **equity-weighted net\nbenefit** = change in EDE, set against the **unweighted net benefit** = change in the mean. When the two point the\nsame way the decision is easy; when they conflict you are on the **equity-efficiency trade-off plane** and the\ndecision is a value judgment DCEA makes transparent rather than hides.\n\n**Two flavors that differ in data demand.** *Equity impact analysis* maps where the gains and losses land - the\npure distributional picture - without committing to a single inequality-aversion number; it answers who benefits.\n*Equity-weighted (full) DCEA* goes the extra step of running the social welfare function at one or more epsilon\nvalues to produce an adopt/reject verdict. You can report the first without the second, and reasonable analysts\noften do, because choosing epsilon is contestable.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs cost-effectiveness / cost-utility analysis (the parent):** DCEA *contains* CEA as the special case epsilon = 0\n  (no inequality aversion - the EDE collapses to the mean and you are back to maximizing total QALYs). Its value-add\n  is making the distribution and the equity weighting explicit, so a program that improves average health while\n  widening the health gap can be flagged rather than rubber-stamped. **Prefer plain CEA/CUA** when there is no\n  credible equity concern, no subgroup heterogeneity in effect or baseline health, or no data to characterize the\n  distribution; the extra machinery then adds assumptions without changing the answer.\n- **vs subgroup CEA (running a separate ICER per group):** Subgroup CEA tells you the cost-effectiveness *within*\n  each group but never combines them into a single equity-aware verdict and ignores the opportunity-cost incidence\n  across groups. DCEA's social welfare step is exactly the aggregation subgroup CEA refuses to do. **Prefer subgroup\n  CEA** only as a descriptive precursor.\n- **vs the aggregate DCEA shortcut:** Full DCEA needs subgroup-specific effectiveness, costs, and baseline health -\n  often unavailable. The **aggregate DCEA** shortcut combines the overall (non-distributional) cost-effectiveness\n  result with *external* data on how the disease and the opportunity cost are socially distributed to approximate the\n  equity impact at far lower data cost. **Prefer aggregate DCEA** for early or data-poor appraisals; **prefer full\n  DCEA** when subgroup RWD can support patient-level distributional estimates.\n\n**When to use.** When equity is an explicit objective of the decision-maker (many HTA bodies now ask for it); when\nthe intervention plausibly has different effectiveness, uptake, adherence, or baseline risk across deprivation,\nrace/ethnicity, or geography; when the *financing* of the program (and thus its opportunity cost) lands\ndisproportionately on a group; when two options have similar ICERs but different distributional footprints and the\ntie should be broken on equity; and when real-world data let you estimate subgroup-specific effectiveness and costs\nthat a trial's narrow population cannot.\n\n**When NOT to use - and when it is actively misleading.**\n- **No real distributional signal.** If baseline health, effect, and costs are genuinely homogeneous across groups,\n  DCEA's EDE moves in lockstep with the mean and the equity weighting is decorative - it manufactures an appearance\n  of equity analysis while changing nothing. Do not dress a standard CUA in DCEA clothing to satisfy a checkbox.\n- **Subgroups built from biased real-world data.** DCEA inherits every confounding and measurement problem of the\n  underlying RWE: if the subgroup-specific effectiveness estimates are confounded (deprived patients differ in\n  severity, adherence, and competing risks), the distribution you weight is wrong, and a confident equity verdict\n  rests on a biased input. Race/ethnicity and deprivation are often mismeasured or missing in claims, so the\n  subgroups themselves can be misclassified.\n- **Ignoring the opportunity-cost incidence.** Counting only *who gains* and assuming the displaced health is spread\n  evenly (or ignoring it) systematically flatters programs financed by cutting services the deprived rely on. The\n  net-benefit subtraction and its group incidence are not optional - omitting them is the most common way DCEA\n  overstates pro-equity impact.\n- **Treating epsilon as a fact.** The inequality-aversion parameter is a social value, not an estimate. Reporting a\n  single epsilon as if it were measured, with no sensitivity analysis across the plausible range, hides the very\n  value judgment DCEA exists to surface. Always sweep epsilon and show where the verdict flips.\n\n**Data-source operational depth.** DCEA's distinctive demand is *subgroup-specific everything*: effectiveness, costs,\nbaseline health, and a credible equity stratifier, ideally all on the same population. Real-world data are where\nthose subgroup estimates can actually be built (trials rarely carry deprivation or race with enough power), but the\nequity stratifier and the subgroup effect estimates are exactly the fields most fragile in routine data. The\naggregate-DCEA shortcut exists precisely because the full subgroup matrix is so often unavailable.\n\n**Worked intuition.** Split a population into a more-deprived and a less-deprived half with baseline quality-adjusted\nlife expectancies of 45 and 90 QALYs. A program delivers 17 QALYs of health to the deprived half and 2 to the\naffluent half; financing it displaces 4 QALYs of health (cost / threshold), falling 2 on each half. Net benefit is\nthus +15 to the deprived half and 0 to the affluent half. Average health rises from 67.5 to 75 QALYs (+7.5, the\nordinary CEA signal), but because the gains are pro-poor, the Atkinson EDE (epsilon = 2) rises from 60 to 72 - an\nequity-weighted net benefit of +12, larger than the +7.5 unweighted gain. A more efficient alternative that pushed\naverage health higher while routing the gains to the already-healthy could show the opposite: a bigger mean uplift\nbut a smaller (or negative) EDE change. That divergence is the entire point of DCEA.\n\n**Interpreting the output**\n\nConsider the worked example: the program produces an unweighted average gain of 7.5 QALYs but an\nequity-weighted net benefit of +12 EDE-QALYs, and the Atkinson inequality index falls from 0.111 to 0.04.\n\nFormal interpretation: The EDE-QALY gain of 12 is the change in the equally distributed equivalent\nhealth — the level of equal health that the social welfare function (with inequality aversion\nparameter epsilon = 2) regards as equivalent to the actual unequal post-intervention distribution.\nIt exceeds the average gain of 7.5 because the program's benefits are concentrated in the\nworse-off group, and the Atkinson social welfare function penalizes inequality, so reducing it\nearns an equity premium. The Atkinson index of 0.111 before the intervention means 11.1% of\naverage health is wasted by inequality; after the program it falls to 4%. These numbers are\nconditional on the choice of epsilon: a higher inequality-aversion parameter would widen the\ngap between 12 and 7.5; a lower one would narrow it toward the ordinary CEA signal. The choice\nof epsilon is a value judgment that must be made explicit and varied in sensitivity analysis.\n\nPractical interpretation: The equity-efficiency trade-off DCEA reveals is that a more efficient\nalternative routing gains to the already-healthy could produce a larger average QALY gain but a\nsmaller EDE improvement. Report both the unweighted QALY gain and the equity-weighted EDE gain\nside by side, with at minimum two epsilon values (e.g., 1 and 2), so decision-makers can see\nhow much of the apparent advantage depends on the inequality-aversion assumption. A DCEA verdict\nof \"favorable\" can reverse to \"unfavorable\" when epsilon changes — that instability is the\noutput, not a limitation to hide.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "distributional-cost-effectiveness",
      "health-equity",
      "equity-efficiency-tradeoff",
      "atkinson-index",
      "inequality-aversion",
      "social-welfare-function",
      "equally-distributed-equivalent",
      "subgroup-analysis",
      "opportunity-cost",
      "health-disparities",
      "qaly"
    ],
    "applies_to_study_types": [
      "cost_effectiveness",
      "cost_utility",
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/0272989X15583266",
        "url": "https://doi.org/10.1177/0272989X15583266",
        "citation_text": "Asaria M, Griffin S, Cookson R. Distributional cost-effectiveness analysis: a tutorial. Medical Decision Making. 2016;36(1):8-19.",
        "year": 2016,
        "authors_short": "Asaria et al.",
        "notes": "The foundational step-by-step tutorial that defines DCEA - estimating the baseline health distribution, the subgroup net health benefits, and the Atkinson social welfare function with an inequality-aversion parameter - and is the methodological anchor for this concept."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2016.11.027",
        "url": "https://doi.org/10.1016/j.jval.2016.11.027",
        "citation_text": "Cookson R, Mirelman AJ, Griffin S, Asaria M, Dawkins B, Norheim OF, Verguet S, Culyer AJ. Using cost-effectiveness analysis to address health equity concerns. Value in Health. 2017;20(2):206-212.",
        "year": 2017,
        "authors_short": "Cookson et al.",
        "notes": "Positions DCEA within the broader family of equity-informative economic evaluation, explains the equity impact vs equity-weighted (trade-off) distinction, and the role of the social welfare function in surfacing the equity-efficiency trade-off."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2019.03.006",
        "url": "https://doi.org/10.1016/j.jval.2019.03.006",
        "citation_text": "Love-Koh J, Cookson R, Gutacker N, Patton T, Griffin S. Aggregate distributional cost-effectiveness analysis of health technologies. Value in Health. 2019;22(5):518-526.",
        "year": 2019,
        "authors_short": "Love-Koh et al.",
        "notes": "Develops the aggregate DCEA shortcut - combining an overall cost-effectiveness result with external data on the social distribution of disease and opportunity cost - for appraisals that cannot support full patient-level subgroup estimation."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jval.2022.06.011",
        "url": "https://doi.org/10.1016/j.jval.2022.06.011",
        "citation_text": "Meunier A, Longworth L, Kowal S, Ramagopalan S, Love-Koh J, Griffin S. Distributional cost-effectiveness analysis of health technologies: data requirements and challenges. Value in Health. 2023;26(1):60-63.",
        "year": 2023,
        "authors_short": "Meunier et al.",
        "notes": "Catalogs the concrete data demands of DCEA - subgroup-specific effectiveness, costs, baseline health, and a credible equity stratifier - and the practical challenges of meeting them from routine real-world data sources."
      }
    ],
    "plain_language_summary": "Distributional cost-effectiveness analysis (DCEA) asks not just whether a health program is good value on average, but who gains and who loses - splitting the population into groups we have fairness concerns about (such as poorer vs richer areas, or by race) and tracking how much health each group ends up with. It adds up health the program creates, subtracts the health lost elsewhere because the money had to come from somewhere, and then scores the result with a dial called inequality aversion: turn the dial up and gains to the worse-off count for more. The payoff is a single honest picture of the trade-off between making total health bigger and making it more equal, instead of averaging that tension away.\n",
    "key_terms": [
      {
        "term": "QALY",
        "definition": "A quality-adjusted life year - one year of life in full health, used as the common unit for the health a program creates or displaces."
      },
      {
        "term": "equity subgroup",
        "definition": "A slice of the population defined by a fairness-relevant characteristic such as area deprivation, race or ethnicity, or rural versus urban location."
      },
      {
        "term": "net health benefit",
        "definition": "For each subgroup, the health a program adds minus the health lost elsewhere because the budget that funded it could no longer pay for other care."
      },
      {
        "term": "opportunity cost",
        "definition": "The health given up somewhere else when money is spent on this program - estimated here as the program's cost divided by the cost-per-QALY threshold."
      },
      {
        "term": "inequality aversion (Atkinson epsilon)",
        "definition": "A dial that says how much extra a society values health gains to the worse-off; at zero it ignores the distribution, and higher values weight the worst-off groups more heavily."
      },
      {
        "term": "equally distributed equivalent (EDE)",
        "definition": "The single equal level of health that would be judged exactly as good as the actual unequal spread of health across groups - lower than the average when health is unequal."
      }
    ],
    "worked_example": {
      "scenario": "A health system is deciding whether to fund a new program and wants an equity-aware answer, not just an average one. It splits the population into two equal halves - a more-deprived half and a less-deprived half - whose baseline quality-adjusted life expectancies are 45 and 90 QALYs. The program delivers health to both halves but is paid for out of a fixed budget, so some health is displaced elsewhere. We compute the ordinary average gain and the equity-weighted gain (using an inequality-aversion dial of epsilon = 2) and compare them.\n",
      "dataset": {
        "caption": "The subgroup inputs an analyst would assemble for a two-group DCEA (one row per equity subgroup).",
        "columns": [
          "subgroup",
          "population_share",
          "baseline_qaly",
          "health_gained_qaly",
          "opp_cost_displaced_qaly"
        ],
        "rows": [
          [
            "more_deprived",
            0.5,
            45,
            17,
            2
          ],
          [
            "less_deprived",
            0.5,
            90,
            2,
            2
          ]
        ]
      },
      "steps": [
        "Find the opportunity-cost displacement from the budget constraint. The program costs $80,000 and the threshold is $20,000 per QALY, so health displaced elsewhere = $80,000 / $20,000 = 4 QALYs, split 2 and 2 across the halves.",
        "Net health benefit per subgroup = health gained minus displacement. More-deprived = 17 - 2 = 15 QALYs; less-deprived = 2 - 2 = 0 QALYs.",
        "Post-intervention health = baseline plus net benefit. More-deprived = 45 + 15 = 60; less-deprived = 90 + 0 = 90.",
        "Ordinary CEA signal = change in average health. Baseline mean = (45 + 90) / 2 = 67.5; post mean = (60 + 90) / 2 = 75; unweighted net benefit = 75 - 67.5 = 7.5 QALYs.",
        "Equity-weighting uses the Atkinson equally distributed equivalent (EDE); at epsilon = 2 the EDE is the harmonic mean = 2 divided by the sum of reciprocals. Baseline EDE = 2 / (1/45 + 1/90) = 60; post EDE = 2 / (1/60 + 1/90) = 72.",
        "Equity-weighted net benefit = change in EDE = 72 - 60 = 12 QALYs, which is larger than the 7.5 unweighted gain because the program's health went mostly to the worse-off half.",
        "Inequality also fell. Atkinson index = 1 minus EDE over mean; before = 1 - 60 / 67.5 = 0.111, after = 1 - 72 / 75 = 0.04 - the health gap narrowed."
      ],
      "result": "The program raises average health by 7.5 QALYs but, because the gains are pro-poor, its equity-weighted net benefit is +12 EDE-QALYs (the equity premium is 12 - 7.5 = 4.5), and the Atkinson inequality index falls from 0.111 to 0.04. A more efficient alternative that routed gains to the already-healthy half could show a bigger mean uplift but a smaller EDE change - the equity-efficiency trade-off DCEA exists to reveal.",
      "timeline_spec": {
        "title": "A one-year DCEA appraisal - baseline distribution to equity-weighted decision",
        "window": {
          "start": "2024-01-01",
          "end": "2024-12-01",
          "label": "DCEA appraisal from baseline distribution to equity-weighted verdict"
        },
        "events": [
          {
            "label": "Baseline subgroup health",
            "start": "2024-01-01",
            "length_days": 60,
            "quantity": "deprived 45, affluent 90 QALYs"
          },
          {
            "label": "Subgroup health gained",
            "start": "2024-03-01",
            "length_days": 60,
            "quantity": "deprived +17, affluent +2 QALYs"
          },
          {
            "label": "Opportunity cost displaced",
            "start": "2024-05-01",
            "length_days": 30,
            "quantity": "4 QALYs forgone, 2 each"
          },
          {
            "label": "Net health benefit",
            "start": "2024-06-01",
            "length_days": 60,
            "quantity": "deprived +15, affluent 0"
          },
          {
            "label": "Atkinson EDE computed",
            "start": "2024-08-01",
            "length_days": 60,
            "quantity": "EDE 60 then 72"
          },
          {
            "label": "Equity-weighted decision",
            "start": "2024-10-01",
            "length_days": 52,
            "quantity": "EW net benefit +12 QALYs"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2024-01-01",
            "end": "2024-06-30",
            "label": "Distribution and net-benefit construction"
          },
          {
            "kind": "followup",
            "start": "2024-07-01",
            "end": "2024-12-01",
            "label": "Equity weighting and decision"
          }
        ],
        "result": {
          "label": "Equity-weighted net benefit +12 EDE-QALYs vs +7.5 unweighted",
          "value": 12
        }
      }
    },
    "prerequisites": [
      "cost-effectiveness",
      "cost-utility",
      "qaly-utility-mapping-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Full DCEA with an Atkinson social welfare function",
        "description": "The complete method - estimate the baseline health distribution across equity subgroups, add subgroup-specific net health benefit (health gained minus opportunity-cost displacement), then evaluate the equally distributed equivalent (EDE) under an Atkinson inequality-aversion parameter epsilon, sweeping epsilon to show where the adopt/reject verdict flips. Produces both the equity-weighted net benefit (change in EDE) and the unweighted net benefit (change in the mean).",
        "edge_cases": [
          "Choosing epsilon is a social value, not an estimate - report a range (e.g., 0 to ~10) and the tipping point, never a single number presented as fact.",
          "The opportunity-cost displacement may fall on different subgroups than the gains; assuming even or zero displacement biases the verdict toward apparent pro-equity impact."
        ],
        "data_source_notes": "linked/registry: needed for subgroup-specific baseline health and effectiveness on one population. claims/ehr: supply subgroup costs and effect estimates but the equity stratifier (deprivation, race/ethnicity) is often missing or mismeasured."
      },
      {
        "name": "Aggregate DCEA (shortcut for data-poor appraisals)",
        "description": "Approximates the equity impact without patient-level subgroup estimation by combining the overall (non-distributional) cost-effectiveness result with external data on how the disease burden and the opportunity cost are socially distributed. Far lower data cost; suited to early HTA or settings lacking subgroup RWD.",
        "edge_cases": [
          "The external distributional weights are borrowed from population statistics, so a mismatch between the appraisal population and the reference population biases the imputed distribution.",
          "It cannot detect subgroup-specific effect modification the overall result averages away."
        ],
        "data_source_notes": "any: uses published/registry distributions of disease and opportunity cost. Does not require linked patient-level subgroup effectiveness, which is why it is the default when full DCEA data are absent."
      },
      {
        "name": "Equity impact analysis without inequality-aversion weighting",
        "description": "Reports only the distributional picture - where the net health benefit and the opportunity-cost displacement land across subgroups - without running a social welfare function or committing to an epsilon. Answers who gains and who loses, leaving the equity-efficiency value judgment to the decision-maker.",
        "edge_cases": [
          "Without a social welfare function there is no single adopt/reject verdict; two readers can draw opposite conclusions from the same distribution.",
          "Still requires the opportunity-cost incidence to avoid counting only gains."
        ],
        "data_source_notes": "claims/ehr/registry: the minimum is subgroup-specific net health benefit; no epsilon means no need to defend an inequality-aversion value, which is why some analysts stop here."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cost-effectiveness",
        "pros_of_this": "Makes the distribution of health gains and losses across equity subgroups explicit and lets an inequality-aversion parameter trade total health against its equality, flagging programs that raise average health while widening the gap - which a single ICER cannot see.",
        "cons_of_this": "Requires subgroup-specific effectiveness, costs, and baseline health plus a contestable inequality-aversion value, adding assumptions and data demands that plain CEA avoids; with homogeneous subgroups it adds nothing.",
        "when_to_prefer": "When equity is an explicit decision objective and there is credible subgroup heterogeneity in effect, baseline health, or opportunity-cost incidence; use plain CEA when there is no distributional signal or no data to measure one."
      },
      {
        "compared_to": "cost-utility",
        "pros_of_this": "Keeps the QALY-based net-benefit logic of CUA but distributes the net benefit across socially relevant groups and evaluates an equity-weighted (equally distributed equivalent) outcome rather than an aggregate QALY total.",
        "cons_of_this": "Inherits all of CUA's utility-mapping and threshold assumptions and adds a social welfare function and an epsilon, so the verdict now also depends on a value judgment about inequality aversion.",
        "when_to_prefer": "When the same QALY framework should answer not just how much health but how it is shared; use ordinary CUA when the decision-maker has no equity objective."
      },
      {
        "compared_to": "icer-net-monetary-benefit-rwe",
        "pros_of_this": "Replaces a single population net monetary/health benefit with an equity-weighted net benefit (change in the Atkinson EDE), so the headline number already reflects the distribution rather than averaging it away.",
        "cons_of_this": "The equity-weighted net benefit is not a market quantity and depends on epsilon and the chosen subgroups, so it is less directly comparable across appraisals than a standard NMB at a fixed threshold.",
        "when_to_prefer": "When the decision should reward pro-equity distributions; use plain ICER/NMB when only aggregate value for money matters and a transparent, comparable single number is the priority."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims can supply subgroup-specific costs and event rates and, via beneficiary ZIP or county linked to an area-deprivation index, a geographic equity stratifier - but individual-level socioeconomic status and race/ethnicity are frequently absent or imputed, and baseline quality-adjusted life expectancy by subgroup is not in claims at all and must be borrowed from life tables and external utility data. Best for the cost and resource-use side of subgroup net benefit.",
      "ehr": "EHR can carry race/ethnicity, some social determinants, and clinical severity that sharpen subgroup definitions and effectiveness estimates, but coverage of the equity stratifier is uneven and often missing-not-at-random, and EHR rarely spans the full population needed for a baseline health distribution. Strong for subgroup effectiveness, weak for population-level baseline.",
      "registry": "Disease and population registries can provide adjudicated subgroup outcomes and sometimes linked socioeconomic indicators, giving the cleanest subgroup-specific effectiveness; baseline health distributions still typically come from national life tables and survey utilities linked in.",
      "linked": "The ideal substrate for full DCEA - claims for subgroup costs and resource use, EHR/registry for subgroup effectiveness and the equity stratifier, and linked survey/life-table data for subgroup baseline quality-adjusted life expectancy and the opportunity-cost incidence; reconcile differing population denominators before combining."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from typing import Sequence\n\ndef atkinson_ede(health: Sequence[float], shares: Sequence[float], epsilon: float) -> float:\n    # Equally distributed equivalent: the equal level of health judged as good as the actual distribution.\n    # epsilon = 0 -> mean (pure efficiency); epsilon -> larger weights the worse-off more.\n    tot = sum(shares)\n    w = [s / tot for s in shares]\n    if abs(epsilon - 1.0) < 1e-9:                      # limiting case: weighted geometric mean\n        import math\n        return math.exp(sum(wi * math.log(hi) for wi, hi in zip(w, health)))\n    s = sum(wi * hi ** (1.0 - epsilon) for wi, hi in zip(w, health))\n    return s ** (1.0 / (1.0 - epsilon))\n\ndef atkinson_index(health, shares, epsilon):\n    mean = sum(s * h for s, h in zip(shares, health)) / sum(shares)\n    return 1.0 - atkinson_ede(health, shares, epsilon) / mean\n\ndef dcea(baseline, shares, health_gained, opp_cost_displaced, epsilon):\n    # Net health benefit per subgroup = health gained - opportunity-cost displacement.\n    nhb  = [g - d for g, d in zip(health_gained, opp_cost_displaced)]\n    post = [b + n for b, n in zip(baseline, nhb)]\n    mean0 = sum(s * h for s, h in zip(shares, baseline)) / sum(shares)\n    mean1 = sum(s * h for s, h in zip(shares, post)) / sum(shares)\n    ede0  = atkinson_ede(baseline, shares, epsilon)\n    ede1  = atkinson_ede(post, shares, epsilon)\n    return {\n        \"net_health_benefit\": nhb,\n        \"unweighted_net_benefit\": round(mean1 - mean0, 4),   # ordinary CEA signal\n        \"equity_weighted_net_benefit\": round(ede1 - ede0, 4),\n        \"atkinson_index_before\": round(atkinson_index(baseline, shares, epsilon), 4),\n        \"atkinson_index_after\": round(atkinson_index(post, shares, epsilon), 4),\n    }\n\nif __name__ == \"__main__\":\n    # Two equal halves: more-deprived (45 QALYs) and less-deprived (90 QALYs).\n    baseline = [45.0, 90.0]\n    shares   = [0.5, 0.5]\n    gained   = [17.0, 2.0]    # health the program delivers to each half\n    opp_cost = [2.0, 2.0]     # displaced health: total cost / threshold = 4 QALYs, split evenly\n    print(dcea(baseline, shares, gained, opp_cost, epsilon=2.0))\n    # -> unweighted +7.5, equity-weighted +12.0, Atkinson index 0.1111 -> 0.04",
        "description": "Compute the Atkinson equally distributed equivalent (EDE) of a health distribution and the equity-weighted net\nbenefit of a program for a two-subgroup population. Inputs are per-subgroup health levels (quality-adjusted life\nexpectancy) and population shares. The program adds health gained and subtracts the opportunity-cost displacement\n(program cost / cost-effectiveness threshold), each allocated across subgroups, to give post-intervention health.\nReturns the unweighted net benefit (change in mean), the equity-weighted net benefit (change in EDE), and the\nAtkinson inequality index before and after.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "atkinson_ede <- function(health, shares, epsilon) {\n  w <- shares / sum(shares)\n  if (abs(epsilon - 1) < 1e-9) {                 # limiting case: weighted geometric mean\n    return(exp(sum(w * log(health))))\n  }\n  s <- sum(w * health^(1 - epsilon))\n  s^(1 / (1 - epsilon))\n}\n\natkinson_index <- function(health, shares, epsilon) {\n  mean_h <- sum(shares * health) / sum(shares)\n  1 - atkinson_ede(health, shares, epsilon) / mean_h\n}\n\ndcea <- function(baseline, shares, health_gained, opp_cost_displaced, epsilon) {\n  nhb  <- health_gained - opp_cost_displaced     # net health benefit per subgroup\n  post <- baseline + nhb\n  mean0 <- sum(shares * baseline) / sum(shares)\n  mean1 <- sum(shares * post)     / sum(shares)\n  ede0  <- atkinson_ede(baseline, shares, epsilon)\n  ede1  <- atkinson_ede(post,     shares, epsilon)\n  list(\n    net_health_benefit          = nhb,\n    unweighted_net_benefit      = round(mean1 - mean0, 4),   # ordinary CEA signal\n    equity_weighted_net_benefit = round(ede1 - ede0, 4),\n    atkinson_index_before       = round(atkinson_index(baseline, shares, epsilon), 4),\n    atkinson_index_after        = round(atkinson_index(post,     shares, epsilon), 4)\n  )\n}\n\n# Two equal halves: more-deprived (45 QALYs) and less-deprived (90 QALYs).\nbaseline <- c(45, 90); shares <- c(0.5, 0.5)\ngained   <- c(17, 2)            # health delivered to each half\nopp_cost <- c(2, 2)             # displaced health: total cost / threshold = 4 QALYs, split evenly\nprint(dcea(baseline, shares, gained, opp_cost, epsilon = 2))\n# -> unweighted +7.5, equity-weighted +12, Atkinson index 0.1111 -> 0.04",
        "description": "Same two-subgroup DCEA in base R: an Atkinson equally distributed equivalent (EDE) function, the Atkinson inequality\nindex, and a dcea() that subtracts the opportunity-cost displacement from health gained to form subgroup net health\nbenefit, then contrasts the change in mean (unweighted) with the change in EDE (equity-weighted). Inputs are subgroup\nhealth (quality-adjusted life expectancy) and population shares; epsilon is the inequality-aversion parameter.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Base[Baseline health distribution<br/>by equity subgroup<br/>deprived 45, affluent 90 QALYs] --> Gain[Subgroup health gained<br/>from the program]\n  Gain --> Opp[Subtract opportunity-cost<br/>displacement = cost / threshold<br/>allocated across subgroups]\n  Opp --> NHB[Net health benefit<br/>per subgroup]\n  NHB --> Post[Post-intervention<br/>health distribution]\n  Post --> SWF{Social welfare function<br/>Atkinson EDE at epsilon}\n  Base --> SWF\n  SWF --> Mean[Unweighted net benefit<br/>= change in mean health]\n  SWF --> EDE[Equity-weighted net benefit<br/>= change in EDE]\n  Mean --> Plane{Do they agree?}\n  EDE --> Plane\n  Plane -- Yes --> Easy[Clear adopt/reject]\n  Plane -- No --> Trade[Equity-efficiency trade-off<br/>sweep epsilon to find the tipping point]",
        "caption": "The DCEA pipeline - build the baseline subgroup health distribution, add subgroup net health benefit (gains minus opportunity-cost displacement) to get the post-intervention distribution, then collapse both with an Atkinson social welfare function to compare the unweighted (mean) and equity-weighted (EDE) net benefit on the equity-efficiency plane.",
        "alt_text": "Flowchart from a baseline subgroup health distribution through net health benefit construction to an Atkinson social welfare function that yields both an unweighted and an equity-weighted net benefit, branching on whether they agree.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "distributional-cost-effectiveness-analysis-rwe-timeline.svg",
        "mermaid": null,
        "caption": "A DCEA appraisal over one year - baseline subgroup health, subgroup gains, opportunity-cost displacement, net health benefit, then the Atkinson equally distributed equivalent rising from 60 to 72 QALYs for an equity-weighted net benefit of +12 versus the +7.5 unweighted gain.",
        "alt_text": "Timeline of a distributional cost-effectiveness analysis across one year showing the analytic stages and the equally distributed equivalent increasing from 60 to 72 QALYs.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "cost-effectiveness",
        "notes": "DCEA contains cost-effectiveness analysis as the special case of zero inequality aversion (epsilon = 0), where the equally distributed equivalent equals the mean and the verdict reduces to maximizing total health."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "cost-utility",
        "notes": "DCEA keeps the QALY-based net-benefit logic of cost-utility analysis but distributes the net benefit across equity subgroups and evaluates an equity-weighted equally distributed equivalent rather than an aggregate QALY total."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "DCEA replaces a single population net benefit at a fixed threshold with an equity-weighted net benefit (change in the Atkinson EDE) that depends on epsilon and the chosen subgroups - more equity-aware but less directly comparable across appraisals."
      },
      {
        "relation_type": "requires",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "The baseline subgroup health distribution and the subgroup net health benefit are measured in QALYs, so DCEA depends on the same utility mapping that supplies the quality weights."
      },
      {
        "relation_type": "used_with",
        "target_slug": "sdoh-social-determinants-of-health",
        "notes": "The equity subgroups (deprivation, race/ethnicity, geography) are defined from social-determinant stratifiers; their measurement quality directly drives whether the distribution DCEA weights is credible."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Baseline quality-adjusted life expectancy and lifetime net health benefit are discounted streams, so DCEA inherits the discounting choices made for costs and effects."
      },
      {
        "relation_type": "see_also",
        "target_slug": "budget-impact",
        "notes": "The opportunity-cost displacement central to DCEA is the health forgone elsewhere under a fixed budget - the same affordability constraint a budget-impact analysis quantifies in spending terms."
      }
    ],
    "aliases": [
      "DCEA",
      "distributional cost-effectiveness analysis",
      "equity-informative economic evaluation",
      "equity-weighted cost-effectiveness analysis",
      "equity impact analysis",
      "health equity economic evaluation"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "dose-titration-rwe",
    "name": "Dose Titration / Up-Titration to Target Dose",
    "short_definition": "Operational construction, from real-world dosing data, of a per-patient dose trajectory that distinguishes the titration period (the early phase in which a drug is started low and adjusted upward toward a target/maintenance dose) from the stable maintenance phase — yielding time-to-target, dose intensity during titration, and reached-vs-not-reached-target variables, while avoiding the immortal time, titration-speed confounding, and treat-to-target reverse causation that static index-dose definitions create.",
    "long_description": "Many drugs are not started at their effective dose. The clinician begins low and walks the dose **upward** over\ndays to months — to find the tolerable maximum, to track a moving target (a lab, a symptom), or to ramp through a\nknown tolerability barrier — before settling at a **maintenance dose** that is then held. Insulin is titrated to a\nfasting-glucose target; antihypertensives are up-titrated until blood pressure is controlled; levothyroxine is\nadjusted to a TSH target; antidepressants are stepped up for efficacy or down for side effects; GLP-1 agonists such\nas semaglutide follow a fixed escalation schedule to limit nausea; warfarin is dosed to an INR target.\n**Dose-titration construction** turns a stream of pharmacy fills (or EHR med orders / administrations) into a\ntrajectory in which each interval carries the dose actually being taken *at that point*, and labels which intervals\nbelong to the titration ramp versus the stable maintenance plateau. As with all time-varying exposure work, this is\nthe analysis, not a preprocessing step: a static \"index dose\" pulled from the first fill mislabels the entire ramp.\n\n**Core conceptual distinction.** Dose here is a *time-varying* quantity with **two regimes**, and three things must\nbe separated and pre-specified. (1) *The titration period vs the maintenance phase*: the titration period runs from\ninitiation until the dose stabilizes (no further sustained increase for a declared confirmation window, e.g. two\nconsecutive refills at the same daily dose); the maintenance phase is the stable plateau that follows. (2) *What\ncounts as \"the target/maintenance dose\"*: it may be a guideline target (the labeled maintenance dose), a\n*patient-specific* stable dose (whatever dose the patient holds), or a *physiologic* target defined by a downstream\nmeasurement (INR 2-3, TSH in range) rather than the milligrams themselves. These are different variables and answer\ndifferent questions. (3) *The titration-derived estimands*: **time-to-target dose** (a duration from initiation to\nfirst reaching/stabilizing at target), **dose intensity during titration** (cumulative or average daily dose over\nthe ramp, or the slope of the ramp), and **reached vs did not reach target** (a binary, or competing-risk, endpoint\nwhere discontinuation and death compete with reaching target). A *slow titrator* and a *fast titrator* who both end\nat 30 mg have identical index doses and identical maintenance doses but completely different ramps — and that ramp is\noften the exposure of interest (tolerability, early effectiveness) or the confounder that wrecks a naive analysis.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs time-updated-exposures-cumulative-dose-rwe (the parent family):** Titration construction is a *specialization*\n  of time-varying exposure that adds the regime split (ramp vs plateau) and the titration-specific estimands\n  (time-to-target, ramp slope, reached-target). It inherits the same long-format machinery — episode stitching,\n  grace periods, lagging — but layers a *change-point* on top. **Prefer the general cumulative-dose framing** when\n  you only need a running dose/burden and the ramp itself is not of interest; **prefer titration framing** when the\n  ramp's *speed, completion, or intensity* is the exposure, the effect modifier, or the confounder.\n- **vs a static index-dose definition (first-fill dose held forward):** A static index dose is simple and immune to\n  look-ahead, but for a titrated drug it is almost always *wrong* — it labels the patient by their lowest, starting\n  dose and ignores that most of the early follow-up was spent ramping. Using it as a baseline covariate manufactures\n  **immortal time** (the patient must survive long enough to up-titrate) and gross exposure misclassification.\n  **Prefer static index dose** only when dose genuinely does not change (fixed-dose combinations) or when an\n  intention-to-treat-by-starting-dose estimand is explicitly wanted.\n- **vs treatment-patterns-lines-of-therapy / switch-add-on-augmentation-rwe:** LOT and switch/augmentation logic\n  operate at the level of *which drug(s)*; titration operates *within a single drug* at the level of *how much*.\n  A dose increase of the same molecule is titration, not a new line and not augmentation. They are complementary\n  layers: a patient can be on LOT1, never switch, and still have a rich titration trajectory. **Do not** encode a\n  dose increase as a switch — it inflates switching rates and erases the titration signal.\n\n**When to use.** Any question where the dose *changes by design* over early follow-up and that change matters:\ntolerability and persistence during a known-difficult ramp (GLP-1 nausea, antidepressant start-up); time-to-control\nin treat-to-target diseases (insulin to fasting glucose, antihypertensives to BP, levothyroxine to TSH, warfarin to\nINR); dose-intensity or ramp-slope as an effect modifier of effectiveness or safety; characterizing the share of\nreal-world initiators who ever reach the guideline target dose (clinical inertia studies); and defining a clean\n\"stable maintenance dose\" index date for a downstream comparative analysis that should start at plateau, not at the\nchaotic ramp.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Immortal time during titration.** If \"reached target dose\" or \"high maintenance dose\" is used as a *baseline*\n  group label, every patient in that group had to survive — and stay observed and adherent — through the entire\n  ramp to qualify. Follow-up that begins at initiation but classifies by a dose reached later makes the ramp period\n  *immortal*: events during titration cannot occur in the high-dose group by construction, fabricating a survival\n  advantage for higher doses. Classify dose as **time-varying** (low until the increase actually happens), or use a\n  **landmark** at a fixed post-initiation time, or **clone-censor-weight** the titration strategies.\n- **Confounding by titration speed.** Why a patient titrates fast or slow is rarely random: it tracks tolerability,\n  baseline severity, visit frequency, and clinician aggressiveness. Comparing fast vs slow titrators, or high vs low\n  maintenance dose, without adjusting for the *reasons* dose was changed confounds the dose effect with the\n  indication for changing it. The ramp slope is both an exposure and a marker of the latent process driving it.\n- **Treat-to-target reverse causation (the most dangerous trap).** In treat-to-target dosing the dose is increased\n  *because the patient is not responding* — sicker patients get pushed to higher doses. A naive analysis then finds\n  that higher doses associate with worse outcomes, reading the *consequence* of poor control as a *cause*. This is\n  confounding by indication operating through a feedback loop: the time-varying confounder (the lab/symptom) responds\n  to the drug AND drives the next dose change. A standard time-dependent model that simply adjusts for the current\n  lab conditions on a mediator and is biased; only **g-methods (IPTW marginal structural models, g-estimation)**\n  recover the causal effect when dose is titrated to a measured target.\n\n**Data-source operational depth.**\n- **Claims (FFS):** There is no \"dose\" field. Daily dose is *inferred* as `NDC strength x dispensed quantity /\n  days_supply` (e.g., 30 tablets of a 10 mg NDC over 30 days = 10 mg/day). A dose increase shows up as a fill of a\n  higher-strength NDC, or more tablets of the same strength, or a shorter days_supply for the same quantity — all of\n  which must be normalized to mg/day before a \"titration step\" can be detected. Real failure modes: (a) **pill-\n  splitting and titration packs / starter kits** break the strength x quantity arithmetic (a 30-count \"dose pack\"\n  that steps 0.25 -> 1.0 mg over a month has one NDC but several daily doses); (b) **Medicare Advantage / capitated\n  person-time lacks FFS fill claims**, freezing the inferred dose and hiding titration steps — exclude MA-only spans;\n  (c) **90-day mail order and stockpiling** blur the timing of a dose change; (d) the *physiologic* target (INR, TSH,\n  fasting glucose) is **invisible in claims** — you can see the milligrams change but not the lab that drove it, so\n  treat-to-target confounding cannot be adjusted in claims-only data.\n- **EHR:** Dose can come from the medication *order* + structured **sig** (\"take 1 tablet twice daily, increase to 2\n  tablets after 1 week\"), the **MAR** (administrations — true for inpatient/infusional), or the e-prescribe feed.\n  EHR's advantage is the *target itself is observable* (labs, vitals, INR, TSH), making it the substrate where\n  treat-to-target feedback can actually be modeled — but sigs are often free-text and titration instructions live in\n  a single order's text, so the dose change is documented without a new order. Parse the sig, and prefer the MAR for\n  administered drugs.\n- **Registry:** Protocol or disease registries often capture target attainment and dose adjustments prospectively\n  (the cleanest source for \"reached target\"), but dose *holds and down-titrations* are frequently noted only in\n  unstructured text, overstating the dose actually taken; link to claims for fill completeness.\n- **Linked claims-EHR-registry:** The ideal substrate — claims for fill completeness and dose timing, EHR for the\n  target lab that drove each step (so g-methods can model the feedback), registry for adjudicated target attainment.\n  Order/fill/administration date discrepancies must be reconciled before any titration-step date is set.\n\n**Worked claims example.** Question: time-to-target-dose and titration-period dose intensity for an oral drug with a\n10 -> 20 -> 30 mg up-titration to a 30 mg maintenance target, FFS claims, one patient. (1) **Build daily dose per\nfill:** `daily_dose = NDC_strength_mg x quantity / days_supply`. Fill on day 0: 30 tablets of 10 mg over 30 days ->\n10 mg/day. Fill on day 28: 30 tablets of 20 mg over 30 days -> 20 mg/day. Fill on day 56: 30 tablets of 30 mg over\n30 days -> 30 mg/day. Fill on day 86 and day 116: 30 mg/day each. (2) **Detect titration steps:** daily dose rises\n10 -> 20 -> 30, so days 0-55 are the titration ramp. (3) **Define stable/maintenance dose:** require two consecutive\nfills at the same daily dose with no further increase; 30 mg first appears on day 56 and repeats on days 86 and 116,\nso 30 mg is confirmed stable as of day 56. (4) **Time-to-target:** target = 30 mg label maintenance; first reached\non day 56, so **time-to-target = 56 days** and the **titration period = days 0-55 (56 days)**. (5) **Dose intensity\nduring titration:** average daily dose over the ramp = (10 mg x 28 days + 20 mg x 28 days)/56 days = (280 + 560)/56 =\n**15 mg/day**, i.e. 50% of target — the patient spent the ramp well below the maintenance dose, which a static\n\"index dose = 10 mg\" and a static \"maintenance dose = 30 mg\" both misrepresent. (6) **Analysis hygiene:** classify\ndose as time-varying (10 mg on [0,28), 20 mg on [28,56), 30 mg from 56); if comparing maintenance-dose levels for an\noutcome, do **not** label baseline by the day-56 dose (immortal time) — use a landmark or g-method, and if the dose\nsteps were driven by an unobserved-in-claims target (e.g., glucose), state that treat-to-target confounding is\nunadjustable in claims-only data and requires linked labs.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "dose-titration",
      "up-titration",
      "treat-to-target",
      "time-to-target-dose",
      "dose-intensity",
      "maintenance-dose",
      "time-varying-exposure",
      "immortal-time",
      "confounding-by-indication",
      "claims",
      "ehr"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "drug_utilization",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2337/diacare.26.11.3080",
        "url": "https://doi.org/10.2337/diacare.26.11.3080",
        "citation_text": "Riddle MC, Rosenstock J, Gerich J; Insulin Glargine 4002 Study Investigators. The treat-to-target trial: randomized addition of glargine or human NPH insulin to oral therapy of type 2 diabetic patients. Diabetes Care. 2003;26(11):3080-3086.",
        "year": 2003,
        "authors_short": "Riddle et al.",
        "notes": "The canonical algorithmic treat-to-target titration paradigm - a forced fasting-glucose-driven up-titration of basal insulin to a target - that defines the clinical process this concept measures in real-world data."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1357",
        "url": "https://doi.org/10.1002/pds.1357",
        "citation_text": "Suissa S. Immortal time bias in observational studies of drug effects. Pharmacoepidemiology and Drug Safety. 2007;16(3):241-249.",
        "year": 2007,
        "authors_short": "Suissa",
        "notes": "Explains why classifying patients by a dose reached later (e.g., \"reached target dose\") while measuring follow-up from initiation fabricates immortal person-time across the titration ramp - the central trap of titration analysis."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.3701",
        "url": "https://doi.org/10.1002/sim.3701",
        "citation_text": "Sylvestre MP, Abrahamowicz M. Flexible modeling of the cumulative effects of time-dependent exposures on the hazard. Statistics in Medicine. 2009;28(27):3437-3453.",
        "year": 2009,
        "authors_short": "Sylvestre & Abrahamowicz",
        "notes": "Provides the time-varying-dose modeling framework (weighted cumulative exposure) underlying any titration analysis where the dose changes over follow-up and the timing/intensity of the ramp shapes risk."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.5701",
        "url": "https://doi.org/10.1002/pds.5701",
        "citation_text": "Kelly TL, Salter A, Pratt NL. The weighted cumulative exposure method and its application to pharmacoepidemiology: a narrative review. Pharmacoepidemiology and Drug Safety. 2023;33(1):e5701.",
        "year": 2023,
        "authors_short": "Kelly et al.",
        "notes": "Applied review showing how time-varying dose histories (the output of titration construction) are modeled in pharmacoepidemiology, with practical guidance on windows, weighting, and the dense exposure histories titration requires."
      },
      {
        "role": "use",
        "doi": "10.1007/s13300-019-0581-y",
        "url": "https://doi.org/10.1007/s13300-019-0581-y",
        "citation_text": "Overgaard RV, Delff PH, Petri KCC, Anderson TW, Flint A, Ingwersen SH. Population pharmacokinetics of semaglutide for type 2 diabetes. Diabetes Therapy. 2019;10(2):649-662.",
        "year": 2019,
        "authors_short": "Overgaard et al.",
        "notes": "Concrete example of a protocol-defined dose-escalation schedule (semaglutide 0.25 -> 0.5 -> 1.0 mg) whose fixed titration steps map directly to NDC-strength changes in real-world dispensing data."
      }
    ],
    "plain_language_summary": "Dose titration is when a drug is started low and the dose is walked upward over days or months until it reaches a steady \"maintenance\" dose or hits a target (like a blood-pressure or blood-sugar goal). In real-world data there is no dose column, so you rebuild each day's dose from the prescription details and then split a patient's history into the early \"ramp up\" period and the later stable period. The big traps: a patient labeled by the high dose they eventually reached had to survive long enough to get there (a fake survival advantage), and sicker patients often get pushed to higher doses, so higher doses can look harmful when they are really just a marker of being sicker.\n",
    "key_terms": [
      {
        "term": "titration period",
        "definition": "The early stretch of treatment during which the dose is being adjusted (usually raised) before it settles at a steady level."
      },
      {
        "term": "maintenance dose",
        "definition": "The steady dose a patient stays on once the dose stops changing."
      },
      {
        "term": "target dose",
        "definition": "The dose the clinician is aiming for - either a guideline maintenance dose or whatever dose hits a clinical goal like a lab value."
      },
      {
        "term": "dose intensity",
        "definition": "How high the dose is on average over a period - here, the average daily dose during the up-titration ramp."
      },
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last (e.g., a 30-day fill of one tablet a day)."
      },
      {
        "term": "NDC strength",
        "definition": "The milligrams per tablet/unit recorded on a pharmacy claim, used with quantity and days_supply to infer the daily dose."
      }
    ],
    "worked_example": {
      "scenario": "One patient starts an oral drug that is meant to be up-titrated 10 -> 20 -> 30 mg, with 30 mg as the maintenance target. We see five pharmacy fills over four months in a claims table. There is no dose field, so we infer the daily dose from each fill, find where the dose stabilizes, and compute how long it took to reach the 30 mg target and how intense dosing was during the ramp.\n",
      "dataset": {
        "caption": "The raw rows an analyst would see in a claims pharmacy table (one row per fill).",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "strength_mg",
          "quantity",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2023-01-01",
            "exampledrug",
            10,
            30,
            30
          ],
          [
            2001,
            "2023-01-29",
            "exampledrug",
            20,
            30,
            30
          ],
          [
            2001,
            "2023-02-26",
            "exampledrug",
            30,
            30,
            30
          ],
          [
            2001,
            "2023-03-28",
            "exampledrug",
            30,
            30,
            30
          ],
          [
            2001,
            "2023-04-27",
            "exampledrug",
            30,
            30,
            30
          ]
        ]
      },
      "steps": [
        "Infer daily dose for each fill = strength_mg x quantity / days_supply. Fill 1 = 10*30/30 = 10 mg/day; fill 2 = 20 mg/day; fills 3-5 = 30 mg/day.",
        "The dose rises 10 -> 20 -> 30, so the first two fills (days 0-55) are the titration ramp.",
        "A \"stable maintenance dose\" needs two consecutive fills at the same dose with no later increase. 30 mg first appears on day 56 and repeats on days 86 and 116, so 30 mg is confirmed stable starting day 56.",
        "Target = 30 mg label maintenance; first reached on day 56, so time-to-target = 56 days and the titration period is days 0-55 (56 days long).",
        "Titration dose intensity = time-weighted average daily dose over the ramp = (10 mg x 28 days + 20 mg x 28 days) / 56 days = (280 + 560) / 56 = 15 mg/day, which is 50% of the 30 mg target."
      ],
      "result": "Reached the 30 mg target on day 56; titration period = 56 days; titration-period dose intensity = 15 mg/day (50% of target). A static \"index dose = 10 mg\" and \"maintenance dose = 30 mg\" both hide that the patient spent the first 56 days well below target.",
      "timeline_spec": {
        "title": "One patient up-titrating 10 to 20 to 30 mg over a 56-day ramp to a 30 mg target dose",
        "window": {
          "start": "2023-01-01",
          "end": "2023-05-27",
          "label": "Observation: initiation through stable maintenance"
        },
        "events": [
          {
            "label": "Fill 1",
            "start": "2023-01-01",
            "length_days": 28,
            "quantity": "10 mg/day"
          },
          {
            "label": "Fill 2",
            "start": "2023-01-29",
            "length_days": 28,
            "quantity": "20 mg/day"
          },
          {
            "label": "Fill 3 (target reached)",
            "start": "2023-02-26",
            "length_days": 30,
            "quantity": "30 mg/day"
          },
          {
            "label": "Fill 4",
            "start": "2023-03-28",
            "length_days": 30,
            "quantity": "30 mg/day"
          },
          {
            "label": "Fill 5",
            "start": "2023-04-27",
            "length_days": 30,
            "quantity": "30 mg/day"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-01-01",
            "end": "2023-02-25",
            "label": "Titration period: 56 days, intensity 15 mg/day"
          },
          {
            "kind": "followup",
            "start": "2023-02-26",
            "end": "2023-05-27",
            "label": "Maintenance phase at 30 mg target"
          }
        ],
        "result": {
          "label": "Reached 30 mg target on day 56; titration period = 56 days",
          "value": 56
        }
      }
    },
    "prerequisites": [
      "time-updated-exposures-cumulative-dose-rwe",
      "exposure-episode-construction-rwe",
      "pdc-proportion-of-days-covered"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Schedule-based titration (fixed protocol ramp)",
        "description": "The drug has a labeled, fixed up-titration schedule (e.g., semaglutide 0.25 -> 0.5 -> 1.0 mg every 4 weeks; GLP-1 anti-nausea ramps). Titration steps are detected as the dose moving through the expected schedule; \"on schedule\" vs \"delayed/skipped\" steps are derivable.",
        "edge_cases": [
          "Titration packs / starter kits carry one NDC but several daily doses across the month, breaking strength x quantity arithmetic.",
          "A patient who stalls at an intermediate dose never reaches the labeled target - a competing outcome to plateau."
        ],
        "data_source_notes": "claims: detect via successive higher-strength NDC fills; map the dose sequence to the known schedule. EHR: the sig text usually states the schedule explicitly."
      },
      {
        "name": "Treat-to-target titration (physiologic target)",
        "description": "Dose is adjusted until a measured target is met (insulin to fasting glucose, antihypertensive to BP, levothyroxine to TSH, warfarin to INR). \"Target\" is the lab/vital, not the milligrams; the dose path is a response to it.",
        "edge_cases": [
          "The target measurement is the time-varying confounder AND a mediator - a standard time-dependent model is biased; use g-methods.",
          "In claims the target lab is invisible, so treat-to-target confounding is unadjustable without linked EHR labs."
        ],
        "data_source_notes": "EHR/linked: the target lab is observable, enabling IPTW/MSM modeling of the feedback. claims-only: you see the dose change but not the reason for it."
      },
      {
        "name": "Tolerability-driven (max tolerated dose) titration",
        "description": "Dose is pushed upward until side effects or a ceiling stop it; the stable dose is the patient-specific maximum tolerated dose, not a guideline target. Reached-vs-not and ramp slope index tolerability.",
        "edge_cases": [
          "Down-titrations and dose holds (only sometimes captured) make the \"stable\" dose ambiguous; require a confirmation window.",
          "Slow titration may reflect intolerance (informative), not non-adherence - do not lump them."
        ],
        "data_source_notes": "EHR registry: holds/reductions may live only in notes, overstating received dose. claims: a strength drop signals a down-titration but the reason is unobserved."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "time-updated-exposures-cumulative-dose-rwe",
        "pros_of_this": "Adds the regime split (titration ramp vs maintenance plateau) and titration-specific estimands - time-to-target, ramp slope, reached-target - that a plain running cumulative-dose variable does not express.",
        "cons_of_this": "More moving parts (change-point detection, a stable-dose confirmation rule) than a simple running sum; inherits all of the parent family's episode-construction and lagging requirements on top.",
        "when_to_prefer": "When the speed, completion, or intensity of the up-titration is itself the exposure, effect modifier, or confounder; use the general cumulative-dose framing when only total burden matters and the ramp is incidental."
      },
      {
        "compared_to": "switch-add-on-augmentation-rwe",
        "pros_of_this": "Correctly attributes a dose increase of the same molecule to titration rather than to a switch or add-on, preventing inflated switching/augmentation rates and preserving the within-drug dose signal.",
        "cons_of_this": "Requires reliable dose inference (strength x quantity / days_supply) to tell a dose step apart from a regimen change; titration packs and pill-splitting make this harder than detecting a new NDC entirely.",
        "when_to_prefer": "Whenever the change is in the amount of the same drug; reserve switch/add-on logic for changes in which drug(s) the patient is taking."
      },
      {
        "compared_to": "standard-cox-time-dependent",
        "pros_of_this": "Supplies the time-varying dose trajectory (low during the ramp, target at plateau) that prevents the immortal time created by classifying patients on a later-reached maintenance dose at baseline.",
        "cons_of_this": "A standard time-dependent model on the titrated dose is still biased under treat-to-target feedback; the construction alone does not fix reverse causation.",
        "when_to_prefer": "Use a plain time-dependent model only when dose changes are not driven by a measured target affected by the drug; escalate to g-methods (IPTW/MSM) when titration is treat-to-target."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "No dose field exists - infer daily dose as NDC strength x dispensed quantity / days_supply and normalize across strengths before detecting steps. A dose increase appears as a higher-strength NDC, more units of the same strength, or a shorter days_supply. Exclude Medicare Advantage-only person-time (frozen inferred dose hides steps); handle titration packs and pill-splitting explicitly; remember the physiologic target driving the steps is invisible, so treat-to-target confounding cannot be adjusted in claims-only data.",
      "ehr": "Dose comes from the order + structured sig (which may state the whole ramp in one order's text), the MAR (best for administered/infused drugs), or the e-prescribe feed. The target lab/vital (glucose, BP, TSH, INR) is observable, making EHR the substrate where treat-to-target feedback can be modeled with g-methods. Parse free-text sigs for titration instructions.",
      "registry": "Often captures target attainment and dose adjustments prospectively (cleanest \"reached target\"), but holds and down-titrations may be noted only in free text, overstating received dose; link to claims for fill completeness.",
      "linked": "Ideal substrate - claims for fill timing/completeness, EHR labs for the target that drove each step (enabling IPTW/MSM), registry for adjudicated attainment. Reconcile order/fill/administration date discrepancies before setting titration-step dates."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nSTABLE_FILLS = 2     # consecutive fills at the same daily dose to confirm a stable maintenance dose\nTARGET_MG    = 30.0  # label/guideline maintenance target (set per drug; None = patient-specific stable dose)\n\ndef build_dose_trajectory(fills: pd.DataFrame) -> pd.DataFrame:\n    f = fills.sort_values([\"person_id\", \"fill_date\"]).copy()\n    # 1) Infer daily dose from claims arithmetic and normalize across NDC strengths.\n    f[\"daily_dose\"] = (f[\"strength_mg\"] * f[\"quantity\"] / f[\"days_supply\"]).round(3)\n\n    out = []\n    for pid, g in f.groupby(\"person_id\", sort=False):\n        g = g.reset_index(drop=True)\n        day0 = g.loc[0, \"fill_date\"]\n        doses = g[\"daily_dose\"].tolist()\n        dates = g[\"fill_date\"].tolist()\n\n        # 2) Identify the first dose that is sustained (no later increase) = stable maintenance dose.\n        stable_dose, stable_date = None, None\n        for i in range(len(doses)):\n            run = [d for d in doses[i:] if d >= doses[i]]            # never drops below at/after i\n            same = sum(1 for d in doses[i:] if d == doses[i])\n            no_later_increase = all(d <= doses[i] for d in doses[i:])\n            if same >= STABLE_FILLS and no_later_increase:\n                stable_dose, stable_date = doses[i], dates[i]\n                break\n\n        target = TARGET_MG if TARGET_MG is not None else stable_dose\n        reached = stable_dose is not None and target is not None and stable_dose >= target\n        target_date = stable_date if reached else None\n        ttt = (target_date - day0).days if target_date is not None else None\n        titration_days = ttt if ttt is not None else (dates[-1] - day0).days\n\n        # 3) Dose intensity during titration = time-weighted mean daily dose over the ramp.\n        ramp = g[g[\"fill_date\"] < (target_date if target_date is not None else dates[-1])]\n        if len(ramp):\n            dur = ramp[\"days_supply\"].clip(lower=1)\n            intensity = float((ramp[\"daily_dose\"] * dur).sum() / dur.sum())\n        else:\n            intensity = float(doses[0])\n\n        out.append({\n            \"person_id\": pid,\n            \"stable_dose_mg\": stable_dose,\n            \"target_mg\": target,\n            \"reached_target\": bool(reached),\n            \"time_to_target_days\": ttt,\n            \"titration_period_days\": titration_days,\n            \"titration_dose_intensity_mg\": round(intensity, 2),\n        })\n    return pd.DataFrame(out)",
        "description": "Construct a per-patient dose trajectory from dosing rows, identify the titration period vs maintenance plateau, and\nflag time-to-target. Required input (cleaned, de-duplicated, one row per fill):\n  fills : person_id, fill_date (datetime64), strength_mg (float), quantity (float), days_supply (int)\nDaily dose is inferred as strength_mg * quantity / days_supply. A \"stable maintenance dose\" is the first daily dose that\nis sustained for STABLE_FILLS consecutive fills with no later increase. Returns one row per person with the titration\nperiod length, time-to-target (label target dose), reached_target flag, and titration-window dose intensity.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nSTABLE_FILLS <- 2L     # consecutive same-dose fills confirming a stable maintenance dose\nTARGET_MG    <- 30     # label/guideline target; set NA_real_ for a patient-specific stable dose\n\nbuild_dose_trajectory <- function(fills) {\n  setDT(fills); setorder(fills, person_id, fill_date)\n  fills[, daily_dose := round(strength_mg * quantity / days_supply, 3)]\n\n  one_person <- function(g) {\n    doses <- g$daily_dose; dates <- g$fill_date; day0 <- dates[1L]\n    stable_dose <- NA_real_; stable_date <- as.Date(NA)\n    for (i in seq_along(doses)) {\n      tail_d <- doses[i:length(doses)]\n      same   <- sum(tail_d == doses[i])\n      if (same >= STABLE_FILLS && all(tail_d <= doses[i])) {   # sustained, no later increase\n        stable_dose <- doses[i]; stable_date <- dates[i]; break\n      }\n    }\n    target  <- if (is.na(TARGET_MG)) stable_dose else TARGET_MG\n    reached <- !is.na(stable_dose) && !is.na(target) && stable_dose >= target\n    tdate   <- if (reached) stable_date else as.Date(NA)\n    ttt     <- if (!is.na(tdate)) as.integer(tdate - day0) else NA_integer_\n    tit_days <- if (!is.na(ttt)) ttt else as.integer(dates[length(dates)] - day0)\n    end     <- if (!is.na(tdate)) tdate else dates[length(dates)]\n    ramp    <- g[fill_date < end]\n    intensity <- if (nrow(ramp))\n      sum(ramp$daily_dose * pmax(ramp$days_supply, 1L)) / sum(pmax(ramp$days_supply, 1L)) else doses[1L]\n    list(stable_dose_mg = stable_dose, target_mg = target, reached_target = reached,\n         time_to_target_days = ttt, titration_period_days = tit_days,\n         titration_dose_intensity_mg = round(intensity, 2))\n  }\n  fills[, one_person(.SD), by = person_id]\n}",
        "description": "Same dose-trajectory construction in data.table: infer daily dose from strength x quantity / days_supply, find the first\nsustained (non-increasing thereafter) daily dose as the stable maintenance dose, and compute time-to-target, the\nreached-target flag, the titration-period length, and the time-weighted titration dose intensity. Input:\n  fills : person_id, fill_date (Date), strength_mg (numeric), quantity (numeric), days_supply (integer)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let stable_fills = 2;   /* consecutive same-dose fills to confirm a stable maintenance dose */\n%let target_mg    = 30;  /* label/guideline target dose                                       */\n\n/* 1) Infer daily dose from claims arithmetic. */\nproc sql;\n  create table dosed as\n  select person_id, fill_date, days_supply,\n         round(strength_mg * quantity / days_supply, 0.001) as daily_dose\n  from work.fills;\nquit;\nproc sort data=dosed; by person_id fill_date; run;\n\n/* 2) For each person, find the first dose sustained with no later increase, then summarize. */\ndata traj(keep=person_id stable_dose_mg target_mg reached_target\n                time_to_target_days titration_period_days titration_dose_intensity_mg);\n  length stable_dose_mg time_to_target_days titration_period_days\n         titration_dose_intensity_mg 8 reached_target 3;\n  do until (last.person_id);\n    set dosed; by person_id;\n    i + 1;\n    array dd[1000] _temporary_;  array dt[1000] _temporary_;  array ds[1000] _temporary_;\n    dd[i] = daily_dose; dt[i] = fill_date; ds[i] = days_supply;\n  end;\n  n = i; day0 = dt[1]; stable_dose_mg = .; stable_idx = .;\n  do a = 1 to n while (stable_idx = .);          /* first sustained, non-increasing dose */\n    same = 0; no_inc = 1;\n    do b = a to n;\n      if dd[b] = dd[a] then same + 1;\n      if dd[b] > dd[a] then no_inc = 0;\n    end;\n    if same >= &stable_fills and no_inc then do; stable_dose_mg = dd[a]; stable_idx = a; end;\n  end;\n  target_mg = &target_mg;\n  reached_target = (stable_dose_mg ne . and stable_dose_mg >= target_mg);\n  if reached_target then do;\n    time_to_target_days = dt[stable_idx] - day0; end_idx = stable_idx;\n  end;\n  else do; time_to_target_days = .; end_idx = n; end;\n  titration_period_days = ifn(reached_target, time_to_target_days, dt[n] - day0);\n  /* 3) Time-weighted titration dose intensity over the ramp (fills before target). */\n  num = 0; den = 0;\n  do c = 1 to (end_idx - 1);\n    w = max(ds[c], 1); num + dd[c]*w; den + w;\n  end;\n  titration_dose_intensity_mg = ifn(den > 0, round(num/den, 0.01), dd[1]);\n  i = 0;\n  output;\nrun;",
        "description": "Dose-trajectory construction in SAS. PROC SQL infers daily dose (strength x quantity / days_supply); a sorted DATA step\nwith RETAIN walks each patient's fills, and a second pass confirms the first sustained, non-increasing daily dose as the\nstable maintenance dose to derive time-to-target, reached_target, titration-period length, and titration dose intensity.\nInput (post data-management):\n  work.fills : person_id, fill_date (SAS date), strength_mg, quantity, days_supply",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Fill[Next fill<br/>strength_mg, quantity, days_supply] --> DD[daily_dose =<br/>strength x quantity / days_supply]\n  DD --> Cmp{Daily dose<br/>higher than<br/>prior plateau?}\n  Cmp -- Yes --> Ramp[Titration step<br/>still in ramp / not stable yet]\n  Cmp -- No --> Same{Same dose for<br/>>= STABLE_FILLS fills<br/>AND no later increase?}\n  Same -- No --> Ramp\n  Same -- Yes --> Stab[Stable maintenance dose<br/>confirmed here]\n  Stab --> Tgt{Stable dose<br/>>= target dose?}\n  Tgt -- Yes --> Reached[reached_target = 1<br/>time-to-target = this date - day0]\n  Tgt -- No --> NotReached[reached_target = 0<br/>plateaued below target<br/>clinical inertia / intolerance]",
        "caption": "Per-fill logic that infers daily dose from claims, walks the up-titration ramp, confirms a stable maintenance dose (sustained with no later increase), and decides whether the patient reached the target dose - the basis for time-to-target and reached-vs-not endpoints.",
        "alt_text": "Decision flowchart computing daily dose from strength, quantity, and days_supply, detecting titration steps, confirming a stable maintenance dose, and flagging whether the target dose was reached.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One patient up-titrating 10 to 20 to 30 mg over a 56-day ramp to a 30 mg target\n  dateFormat YYYY-MM-DD\n  axisFormat %d %b\n  section Titration ramp\n  10 mg/day (fill 1) :crit, t1, 2023-01-01, 28d\n  20 mg/day (fill 2) :crit, t2, 2023-01-29, 28d\n  section Maintenance\n  30 mg/day reached - target (fill 3) :done, m1, 2023-02-26, 30d\n  30 mg/day confirmed stable (fill 4) :active, m2, 2023-03-28, 30d\n  30 mg/day (fill 5) :active, m3, 2023-04-27, 30d",
        "caption": "A single patient's dose rising in steps. A static index dose (10 mg) and a static maintenance dose (30 mg) both hide the 56-day ramp; the titration-period dose intensity is 15 mg/day (50% of target), and time-to-target = 56 days.",
        "alt_text": "Gantt timeline showing daily dose stepping from 10 mg to 20 mg over the first 56 days then plateauing at the 30 mg target dose, distinguishing the titration ramp from the maintenance phase.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "dose-titration-rwe-timeline.svg",
        "mermaid": null,
        "caption": "One patient's dose rising in steps from 10 mg to 20 mg to the 30 mg target over a 56-day titration period, followed by the stable maintenance phase; titration-period dose intensity = 15 mg/day, time-to-target = 56 days.",
        "alt_text": "Stepped timeline of daily dose increasing 10 to 20 to 30 mg across fills, with the titration period and maintenance phase shaded and the time-to-target annotated.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Dose titration is a specialization of time-varying exposure construction that adds the titration-ramp vs maintenance-plateau regime split and the time-to-target, ramp-slope, and reached-target estimands on top of the running dose history."
      },
      {
        "relation_type": "used_with",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "The dose trajectory is built on stitched exposure episodes; episode construction supplies the gap-handled fill intervals on which daily dose and titration steps are computed."
      },
      {
        "relation_type": "used_with",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "Grace-period and gap rules decide whether a delayed refill at a higher strength is a continued titration step or a new episode, directly shaping where the titration period ends."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "switch-add-on-augmentation-rwe",
        "notes": "A dose increase of the same molecule is titration, not a switch or add-on; conflating the two inflates switching rates and erases the titration signal. Reserve switch/add-on logic for changes in which drug is taken."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Adherence during the ramp (PDC) interacts with titration - poor coverage during titration can stall or restart the up-titration and confounds time-to-target."
      },
      {
        "relation_type": "see_also",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "LOT operates across drugs (which agents, in what order); titration operates within a single drug (how much). They are complementary layers of a treatment-pattern description."
      },
      {
        "relation_type": "requires",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "When dose is titrated to a measured target affected by the drug (insulin to glucose, warfarin to INR), a standard time-dependent model is biased by treat-to-target reverse causation; the same trajectory must be analyzed with IPTW/MSM."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Classifying patients by a maintenance dose reached later, while measuring follow-up from initiation, makes the titration ramp immortal; time-varying dose, landmarking, or clone-censor-weight address it."
      }
    ],
    "aliases": [
      "dose titration",
      "up-titration",
      "dose escalation",
      "titration to target",
      "treat-to-target dosing",
      "time-to-target dose",
      "maintenance dose attainment",
      "dose ramp"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "drug-utilization",
    "name": "Drug Utilization Study",
    "short_definition": "A descriptive pharmacoepidemiologic study that quantifies the marketing, prescribing, dispensing, and use of medicines in a defined population using standardized measures (utilization volume, prevalence/incidence of use, treatment duration, adherence, and prescribing quality) rather than estimating a comparative treatment effect.",
    "long_description": "A **drug utilization study (DUS)** describes *who uses which drugs, how much, for how long, and how appropriately* in a real-world population — it is the descriptive-epidemiology workhorse of pharmacoepidemiology, not a causal design. The WHO defines drug utilization research as \"the marketing, distribution, prescription, and use of drugs in a society, with special emphasis on the resulting medical, social, and economic consequences.\" Outputs are population- or patient-level quantities (defined daily doses dispensed per 1,000 inhabitants per day, prevalence/incidence of treated patients, mean treatment duration, proportion of days covered, polypharmacy counts, off-label or guideline-discordant use), not adjusted hazard ratios. A DUS answers *what is happening* with a medicine in practice; it deliberately stops short of attributing outcomes to exposure.\n\n**Core conceptual distinction**. A DUS is *descriptive*; an active-comparator new-user cohort, a self-controlled case series, or a target-trial emulation is *analytic/causal*. The distinction is the estimand. A DUS targets a **frequency, rate, or distribution** (e.g., the DDD/1000/day of high-potency opioids in 2023, the 1-year persistence of GLP-1 initiators, the share of NSAID users with a concurrent gastroprotective agent). A causal study targets a **contrast between counterfactual outcomes under two interventions** and therefore must control confounding, align time zero, and handle informative censoring. The two also differ in the **unit of measurement**: aggregate DUS works in person-time and standardized dose units (ATC/DDD) and tolerates open cohorts; individual-level DUS works in dispensing episodes per person and demands the same enrollment/washout hygiene a cohort study does, because a \"new user\" or \"persistence\" measure is only valid if prior dispensing is observable. A further subdivision (WHO/Wettermark) is **drug statistics** (volume/cost, often aggregate sales or claims tallies) vs **quality-of-use indicators** (DU90%, the proportion of prescribing within the 90% most-used drugs; potentially inappropriate medication rates by Beers/STOPP; adherence). Confusing a DUS frequency for a causal effect — \"patients on drug A had more events, so drug A is harmful\" — is the single most common misreading and is exactly what the descriptive framing forbids.\n\n**Pros, cons, and trade-offs**\n- **vs an analytic cohort (e.g., active-comparator new-user):** A DUS is cheaper, faster, needs no comparator, and directly answers surveillance, market-access, formulary, and stewardship questions (\"is uptake growing? is prescribing concentrated? are we hitting the guideline target?\"). Cost: it makes **no causal claim** and is uninterpretable as effectiveness/safety evidence. **Prefer a DUS** when the question is genuinely descriptive (volume, patterns, quality, equity of access); **escalate to a cohort** the moment the question becomes \"does the drug change the outcome?\"\n- **Aggregate (ATC/DDD) vs individual-level dispensing measures:** Aggregate DDD/1000/day is internationally comparable, robust to enrollment gaps, and ideal for cross-country or trend monitoring; it cannot describe duration, adherence, switching, or per-patient burden. Individual-level measures (prevalence of treated persons, persistence, PDC, polypharmacy) are far richer but require longitudinal, person-linked data with observable enrollment, and the DDD may misrepresent the dose actually prescribed (pediatrics, renal dosing, titrated drugs). **Prefer DDD** for system-level trends and benchmarking; **prefer individual-level** for adherence, duration, and patient-burden questions.\n- **Prevalence vs incidence (new-user) of use:** Prevalence (any use in a period) is simple and captures total burden but mixes long-term and starting users and is dominated by chronic prevalent users. Incidence of use (first dispensing after a drug-free washout) isolates initiation, supports uptake/diffusion analysis, and is the prerequisite for any persistence or new-user analytic follow-on. **Prefer incidence** whenever initiation, diffusion, or downstream causal work is the goal — and accept that it requires a washout and continuous look-back.\n\n**When to use**. Post-marketing surveillance of uptake and prescribing; formulary and budget-impact inputs (volumes, costs, market share); antimicrobial and opioid stewardship metrics; medication-safety quality indicators (PIM rates, drug–drug interaction prevalence, DU90% profiling); equity-of-access and disparity description; characterizing a treated population *before* designing a comparative study; regulatory drug-utilization studies requested by FDA/EMA to contextualize a safety signal or quantify off-label use. A DUS is also the natural home for PQA/CMS-style adherence measures (PDC ≥80%) reported descriptively across plans.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **As effectiveness or safety evidence.** Comparing crude event rates between users of two drugs in a DUS is confounded by indication and channeling; presenting such a comparison as a treatment effect is the dangerous failure mode. If the question is comparative, use an active-comparator new-user cohort or a self-controlled design — not a DUS.\n- **Trend interpretation across a coding or formulary break.** A \"drop\" in utilization that coincides with an ICD-10 transition, a new ATC/DDD assignment, a formulary tier change, or a database vendor switch is a measurement artifact, not a behavior change. Never narrate a trend without auditing denominator and coding stability over the window.\n- **DDD-based dosing inferences in populations where DDD is wrong.** In pediatrics, renal/hepatic impairment, oncology, and titrated drugs, DDD/1000/day does not reflect treated patients or true exposure; reporting \"X% of the population on a therapeutic dose\" from DDDs there is misleading.\n- **Prevalence/persistence on data with unobservable enrollment.** Computing \"new use\" or \"1-year persistence\" without a continuous-enrollment requirement counts gaps in observation as drug-free time or as discontinuation — fabricating both incident users and non-persistence.\n- **Unstable small-area or subgroup rates.** Stratified DDD/1000/day or PIM rates in small denominators are noisy; presenting ranked league tables of providers/regions without stability checks invites spurious \"outliers.\"\n\n**Data-source operational depth**\n- **Administrative claims (FFS):** The default substrate. Exposure = the pharmacy claim (`fill_date` + `days_supply` + NDC mapped to ATC; quantity for DDD conversion). Numerator = treated persons or dispensings; denominator = continuously enrolled person-time. Failure modes: **Medicare Advantage (MA-only) person-time lacks fee-for-service claims**, so MA enrollees appear as non-users and as artificially \"new\" users when they switch to FFS — restrict to enrollees with Part A/B/D (or commercial medical+pharmacy) and exclude MA-only spans, or you will undercount use and inflate incidence. 90-day mail-order and stockpiling distort `days_supply`-based duration/PDC; free samples and inpatient-administered drugs are invisible in pharmacy claims, undercounting initiation. Cash/discount-card fills (insulin, generics) bypass adjudication entirely.\n- **EHR:** Captures *prescribing/ordering* (the clinical decision) but not whether the patient filled or took the drug — primary non-adherence (an order never dispensed) is only visible with linked pharmacy data. Strong for indication, dose, labs, and the numerator's clinical context; weak for the denominator because care delivered outside the system is missing and patients who leave are differentially lost. Medication-reconciliation and home-med lists are inconsistently structured. Use EHR for prescribing-quality indicators, claims for filled-utilization volume, and linked data for the full prescribe→fill→persist cascade.\n- **Registry:** Strong for a well-defined disease denominator, severity, and indication, but typically incomplete for full pharmacy exposure; link to claims for dispensing and to a death/disenrollment index so that person-time and discontinuation are not confounded with loss to follow-up.\n- **Linked claims–EHR–vital records:** The richest substrate (orders + fills + clinical context + reliable censoring), but linkage selects the linkable subset and introduces order/fill/service-date discrepancies that must be reconciled before defining incident use or duration. **Differential competing risks by exposure** matter even descriptively: in elderly claims, persistence/PDC denominators must treat death and disenrollment as competing events, or a drug used in sicker patients will look falsely \"non-persistent\" simply because those patients die during the supply.\n\n**Worked claims example.** Question: among commercially insured + Medicare-FFS adults, what is the 2023 *incidence of metformin initiation* and the *1-year persistence* of initiators? (1) Denominator hygiene: keep only persons with ≥365 days of continuous medical + pharmacy enrollment ending in 2023 and **exclude any MA-only person-time** (FFS claims unobservable there). (2) Identify dispensings: pharmacy claims with an NDC mapping to ATC A10BA02 (metformin), each with `fill_date` and `days_supply`. (3) Incident use: a person's first 2023 metformin `fill_date` with **no metformin fill in the prior 365-day washout** (continuous enrollment makes this absence real, not missing). Incidence = incident users ÷ at-risk person-years (persons enrolled and metformin-free at the start of their at-risk window). (4) Persistence/PDC: for each initiator, build an exposure timeline by stitching consecutive fills (`fill_date` → `fill_date + days_supply`), allowing a pre-specified grace period (e.g., 30 days) to bridge stockpiling/refill timing; **discontinuation** = first gap exceeding the grace period; **persistence** = still covered at 365 days. PDC = covered days ÷ 365 (cap stockpiled overlap at 1 day of coverage per calendar day). (5) Competing risks: censor at disenrollment, death, and end of data; report persistence with death/disenrollment as competing events so that mortality in a sicker subgroup is not miscounted as non-persistence. (6) Report incidence per 1,000 person-years and persistence as a Kaplan–Meier (or cumulative-incidence) curve with CIs, plus sensitivity analyses on washout length (180 vs 365 d) and grace period (15/30/60 d). Note this is purely descriptive — it characterizes *use*, and any comparison of metformin vs an alternative on a clinical outcome would require a separate analytic (active-comparator new-user) design.\n\n**Interpreting the output**\n\nAn individual-level DUS for patient Maria shows 3 metformin fills with 90 total days_supply, a 4-day\ngap between the second and third fill (March 16–19), and a switch to sitagliptin beginning May 1.\nAt the system level, the 2023 incidence of metformin initiation is reported per 1,000 person-years,\nand 1-year persistence is presented as a Kaplan–Meier curve with competing-risk censoring.\n\n*(1) Formal interpretation.* Maria's 3-fill, 90-day record is an individual exposure trajectory, not\nan effectiveness or safety observation. The 4-day gap falls within a pre-specified 30-day grace period\nand does not count as discontinuation; her PDC over the 90-day observation window is 1.00 (no\nuncovered days within the stitched dispensing timeline). The May 1 switch to sitagliptin is a class\nswitch observable in pharmacy claims; the clinical reason — tolerability, glycemic inadequacy — is not\ncaptured. Population incidence is the rate per at-risk person-years in an FFS-enrolled, washout-\nconfirmed cohort; MA-only members are excluded because absent FFS claims cannot confirm drug-free\nwashout. No causal inference is made: the DUS reports patterns of use, not what the drug does.\n\n*(2) Practical interpretation.* The 4-day gap and subsequent class switch flag a potential adherence or\ntolerability signal that a stewardship team could investigate further. The system-level persistence\ncurve is the input a payer needs to project the proportion of initiators still on therapy at 6 and 12\nmonths — a key parameter for budget-impact models. If the question becomes whether the switch improved\noutcomes, an active-comparator new-user cohort study is required, not the DUS.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "drug-utilization",
      "descriptive-epidemiology",
      "pharmacoepidemiology",
      "prescribing-patterns",
      "ddd",
      "atc",
      "adherence",
      "prevalence-incidence-of-use",
      "prescribing-quality"
    ],
    "applies_to_study_types": [
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.5490",
        "url": "https://doi.org/10.1002/pds.5490",
        "citation_text": "Rasmussen L, Wettermark B, Steinke D, Pottegård A. Core concepts in pharmacoepidemiology: Measures of drug utilization based on individual-level drug dispensing data. Pharmacoepidemiology and Drug Safety. 2022;31(10):1015-1026.",
        "year": 2022,
        "authors_short": "Rasmussen et al.",
        "notes": "Canonical methods statement defining and operationalizing individual-level drug-utilization measures (prevalence, incidence of use, treatment duration, adherence, drug combinations, prescribing-quality indicators) from dispensing data."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1063",
        "url": "https://doi.org/10.1002/pds.1063",
        "citation_text": "Hallas J. Drug utilization statistics for individual-level pharmacy dispensing data. Pharmacoepidemiology and Drug Safety. 2005;14(7):455-463.",
        "year": 2005,
        "authors_short": "Hallas",
        "notes": "Foundational treatment of how to derive utilization statistics (incident/prevalent users, treatment episodes, waiting-time distributions) from individual-level dispensing records."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.5776",
        "url": "https://doi.org/10.1002/pds.5776",
        "citation_text": "Gillies MB, Schaffer AL, Buckley NA, et al. Medicine utilization studies in Australian individual-level dispensing data: A blinded, multi-center replicated analysis. Pharmacoepidemiology and Drug Safety. 2024;33(1):e5776.",
        "year": 2024,
        "authors_short": "Gillies et al.",
        "notes": "Multi-center replication showing how operational choices (washout, episode construction, denominators) drive drug-utilization estimates from the same dispensing data — concrete evidence for pre-specifying definitions."
      }
    ],
    "plain_language_summary": "A drug utilization study describes how a medicine is actually used by a group of patients in the real world: who takes it, how much they take, how long they stay on it, and how appropriately it is prescribed. It answers \"what is happening with this drug?\" by counting things like the number of prescription fills, the total supply dispensed, and whether patients switch to something else. It deliberately stops there: it never claims the drug caused a good or bad outcome, because that would need a comparison group it does not have. It also can't see medicine that never goes through a pharmacy, like free samples or doses given in the hospital.",
    "key_terms": [
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last (for example, a 30-day fill)."
      },
      {
        "term": "fill_date",
        "definition": "The calendar date a patient actually picked up (filled) a prescription at the pharmacy."
      },
      {
        "term": "pharmacy claim",
        "definition": "A single billing record created when a prescription is filled, listing the drug, the date, and the days_supply."
      },
      {
        "term": "switch",
        "definition": "When a patient stops one drug and starts a different drug for the same condition."
      },
      {
        "term": "gap",
        "definition": "A stretch of days when a patient had no medicine on hand because the previous supply ran out before the next fill."
      },
      {
        "term": "descriptive study",
        "definition": "A study that summarizes and counts what is happening, without trying to prove that one thing caused another."
      }
    ],
    "worked_example": {
      "scenario": "Maria is one commercially insured adult who started metformin (a common diabetes pill) in early 2023. We have her pharmacy fill records for the whole year. We are not comparing her to anyone or testing whether metformin worked. We just want to describe her use of the drug: how many times she filled it, how many days of supply she received, whether she ever ran short, and whether she eventually moved to a different medicine.",
      "dataset": {
        "caption": "The raw rows an analyst would actually see in a claims pharmacy table for this one patient.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2023-01-15",
            "metformin",
            30
          ],
          [
            2001,
            "2023-02-14",
            "metformin",
            30
          ],
          [
            2001,
            "2023-03-20",
            "metformin",
            30
          ],
          [
            2001,
            "2023-05-01",
            "sitagliptin",
            30
          ]
        ]
      },
      "steps": [
        "Each 30-day fill covers its fill_date plus the next 29 days, so Fill A covers Jan 15-Feb 13 and Fill B covers Feb 14-Mar 15.",
        "Fill B's supply runs out on Mar 15, but the next metformin fill (Fill C) isn't picked up until Mar 20, leaving a 4-day gap (Mar 16-19) with no metformin on hand.",
        "Fill C covers Mar 20-Apr 18; after that there are no more metformin fills, so her metformin use is 3 fills totaling 30 + 30 + 30 = 90 days_supply.",
        "On May 1 she fills sitagliptin, a different diabetes drug, with no further metformin: this is a switch off metformin onto a new medicine.",
        "The whole summary is descriptive. It says what Maria did with the drug; it does not say metformin caused her to switch or that one drug is better."
      ],
      "result": "Utilization summary for person 2001: 3 metformin fills, 90 total days_supply, one 4-day coverage gap (Mar 16-19), and a switch to sitagliptin on 2023-05-01.",
      "timeline_spec": {
        "title": "One patient's 2023 metformin fills, with a coverage gap and a switch (descriptive utilization)",
        "window": {
          "start": "2023-01-01",
          "end": "2023-12-31",
          "label": "Observation window: full 2023 calendar year"
        },
        "events": [
          {
            "label": "Fill A (metformin)",
            "start": "2023-01-15",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B (metformin)",
            "start": "2023-02-14",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill C (metformin)",
            "start": "2023-03-20",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill D (sitagliptin - switch)",
            "start": "2023-05-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-01-15",
            "end": "2023-03-15",
            "label": "60 covered days (Fills A + B, back-to-back)"
          },
          {
            "kind": "gap",
            "start": "2023-03-16",
            "end": "2023-03-19",
            "label": "4-day gap (no metformin on hand)"
          },
          {
            "kind": "exposed",
            "start": "2023-03-20",
            "end": "2023-04-18",
            "label": "30 covered days (Fill C)"
          }
        ],
        "result": {
          "label": "3 metformin fills, 90 total days_supply, one 4-day gap, switch to sitagliptin on 2023-05-01",
          "value": 90
        }
      }
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "exposure-episode-construction-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Aggregate ATC/DDD utilization (DDD/1000/day)",
        "description": "Volume of a drug expressed in WHO defined daily doses per 1,000 inhabitants (or enrollees) per day, the internationally standardized metric for system-level utilization and cross-country/trend comparison.",
        "edge_cases": [
          "DDD is an assumed average maintenance adult dose for the main indication; it misrepresents true exposure in pediatrics, renal/hepatic dosing, oncology, and titrated drugs.",
          "WHO reassigns DDDs periodically; a step in a trend can be an ATC/DDD revision rather than a behavior change."
        ],
        "data_source_notes": "claims: convert dispensed quantity x strength to DDDs via NDC->ATC mapping; denominator = enrolled person-time, not census population, so report DDD/1000 enrollees/day to avoid mixing the two denominators."
      },
      {
        "name": "Incidence of use (new-user / waiting-time approach)",
        "description": "First dispensing after a drug-free washout (incident users) per at-risk person-time, or Hallas-style waiting-time distribution to estimate the incident vs prevalent split and a data-driven washout length.",
        "edge_cases": [
          "Too-short washout misclassifies returning chronic users as incident; too-long washout shrinks the at-risk pool and the usable study window.",
          "Seasonal or as-needed drugs have long real gaps; a fixed washout over- or under-counts initiation."
        ],
        "data_source_notes": "claims: require continuous enrollment across the full washout so the absence of prior fills is observed, not missing; exclude MA-only person-time where FFS pharmacy claims are unavailable."
      },
      {
        "name": "Treatment duration, persistence, and adherence (PDC/MPR)",
        "description": "Per-patient exposure timelines built by stitching fills (fill_date to fill_date+days_supply) with a grace period; outputs persistence (time to first excess gap) and PDC/MPR over a fixed denominator window.",
        "edge_cases": [
          "Stockpiling/oversupply inflates MPR above 1.0 (use PDC, which caps coverage at one day per calendar day).",
          "Death and disenrollment are competing events; ignoring them counts a deceased patient's uncovered days as non-persistence."
        ],
        "data_source_notes": "claims: 90-day mail-order and sample fills distort days_supply; PQA/CMS adherence measures use PDC>=0.8 over the measurement year with explicit handling of hospitalizations and partial fills."
      },
      {
        "name": "Prescribing-quality indicators (DU90%, PIM rates, polypharmacy)",
        "description": "Quality-of-use metrics such as DU90% (share of prescribing within the 90% most-used drugs), potentially inappropriate medication rates (Beers/STOPP), polypharmacy counts, and concurrent-interaction prevalence.",
        "edge_cases": [
          "Indicator validity depends on an up-to-date, jurisdiction-appropriate criteria set (Beers vs STOPP/START differ).",
          "Small-denominator provider/region profiling produces unstable rankings; report with shrinkage or stability checks."
        ],
        "data_source_notes": "ehr: best for the clinical context behind quality indicators (indication, labs); claims: best for counting concurrent dispensings and computing population-level rates."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Active-comparator new-user cohort (analytic/causal design)",
        "pros_of_this": "Cheaper, faster, comparator-free; directly answers descriptive surveillance, market-access, stewardship, and quality questions about volume, patterns, and appropriateness of use.",
        "cons_of_this": "Makes no causal claim; crude between-drug outcome comparisons in a DUS are confounded by indication and must not be read as effectiveness or safety evidence.",
        "when_to_prefer": "When the question is genuinely descriptive (how much, by whom, how long, how appropriately) rather than 'does the drug change the outcome?'"
      },
      {
        "compared_to": "Aggregate ATC/DDD utilization metrics",
        "pros_of_this": "Individual-level measures describe duration, adherence, switching, polypharmacy, and per-patient burden that aggregate DDDs cannot.",
        "cons_of_this": "Require longitudinal, person-linked data with observable enrollment, and are more sensitive to washout and episode-construction choices; less internationally comparable than DDD/1000/day.",
        "when_to_prefer": "For adherence, persistence, duration, and patient-burden questions, or as the precursor to an analytic cohort."
      },
      {
        "compared_to": "Prevalence-based utilization summaries",
        "pros_of_this": "Incidence (new-user) isolates initiation and diffusion, avoids domination by long-term prevalent users, and is the prerequisite for persistence and any downstream causal analysis.",
        "cons_of_this": "Requires a drug-free washout and continuous look-back, shrinking the usable cohort and window relative to simple period prevalence.",
        "when_to_prefer": "When initiation, uptake/diffusion, or new-user follow-up is the goal rather than total treated burden."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (NDC->ATC + fill_date + days_supply + quantity/strength for DDD). Require continuous medical+pharmacy enrollment over the look-back so absence of fills is observed, not missing; exclude Medicare Advantage-only person-time where FFS claims are unavailable. Watch 90-day mail-order, stockpiling, samples, and inpatient-administered drugs distorting days_supply and undercounting use.",
      "ehr": "Captures prescribing/ordering, not filling or taking; primary non-adherence is invisible without linked pharmacy data. Best for indication, dose, and the clinical context of prescribing-quality indicators; denominator is incomplete because out-of-system care is missing and leavers are differentially lost.",
      "registry": "Strong, well-defined disease denominator and indication/severity but usually incomplete pharmacy exposure; link to claims for dispensing and to a death/disenrollment index so person-time is not confounded with loss to follow-up.",
      "linked": "Orders + fills + clinical context + reliable censoring is the richest substrate, but linkage selects the linkable subset and creates order/fill/service-date discrepancies that must be reconciled before defining incident use or duration."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWASHOUT_DAYS = 365\nGRACE_DAYS = 30\nTARGET_ATC = \"A10BA02\"  # metformin\n\ndef _enrolled_days(enroll: pd.DataFrame, start, end) -> pd.Series:\n    # Continuous, FFS-observable enrolled days per person within [start, end] (exclude MA-only spans).\n    e = enroll[~enroll[\"ma_only\"]].copy()\n    lo = e[\"enroll_start\"].clip(lower=start)\n    hi = e[\"enroll_end\"].clip(upper=end)\n    e[\"days\"] = (hi - lo).dt.days.clip(lower=0)\n    return e.groupby(\"person_id\")[\"days\"].sum()\n\ndef period_prevalence(rx, enroll, atc, start, end):\n    # Prevalence of use = persons with >=1 qualifying fill in [start, end] / persons with any enrolled time in window.\n    used = rx[(rx[\"atc\"] == atc) & rx[\"fill_date\"].between(start, end)][\"person_id\"].nunique()\n    denom = (_enrolled_days(enroll, start, end) > 0).sum()\n    return used / denom\n\ndef incidence_of_use(rx, enroll, atc, start, end):\n    # Incident user = first qualifying fill in [start, end] with no fill of the same ATC in the prior WASHOUT_DAYS,\n    # and continuous FFS-observable enrollment across that washout. Rate per 1000 person-years at risk.\n    drug = rx[rx[\"atc\"] == atc].sort_values([\"person_id\", \"fill_date\"])\n    first_in = drug[drug[\"fill_date\"].between(start, end)].groupby(\"person_id\")[\"fill_date\"].min()\n    prior = drug.merge(first_in.rename(\"idx\"), on=\"person_id\")\n    had_prior = prior[(prior[\"fill_date\"] < prior[\"idx\"]) &\n                      (prior[\"fill_date\"] >= prior[\"idx\"] - pd.Timedelta(days=WASHOUT_DAYS))][\"person_id\"].unique()\n    cand = first_in[~first_in.index.isin(had_prior)]\n    wash_ok = _enrolled_days(enroll, start - pd.Timedelta(days=WASHOUT_DAYS), end)\n    incident = cand[cand.index.isin(wash_ok[wash_ok >= WASHOUT_DAYS].index)]\n    pyears = _enrolled_days(enroll, start, end).sum() / 365.25\n    return 1000.0 * len(incident) / pyears, incident\n\ndef ddd_per_1000_per_day(rx, enroll, atc, start, end):\n    # WHO aggregate metric: total DDDs dispensed in window / (enrolled person-days) * 1000.\n    ddds = rx[(rx[\"atc\"] == atc) & rx[\"fill_date\"].between(start, end)][\"ddd_dispensed\"].sum()\n    person_days = _enrolled_days(enroll, start, end).sum()\n    return 1000.0 * ddds / person_days\n\ndef one_year_pdc(rx, initiators_idx, atc):\n    # PDC over 365 days from each initiator's index date; coverage capped at one day per calendar day (no stockpile credit).\n    drug = rx[rx[\"atc\"] == atc]\n    out = {}\n    for pid, idx in initiators_idx.items():\n        fills = drug[drug[\"person_id\"] == pid].sort_values(\"fill_date\")\n        fills = fills[fills[\"fill_date\"].between(idx, idx + pd.Timedelta(days=364))]\n        covered = np.zeros(365, dtype=bool)\n        for fd, ds in zip(fills[\"fill_date\"], fills[\"days_supply\"]):\n            s = (fd - idx).days\n            covered[max(s, 0):min(s + int(ds), 365)] = True  # union -> caps overlap (stockpiling)\n        out[pid] = covered.mean()\n    return pd.Series(out, name=\"pdc\")",
        "description": "Core drug-utilization measures from claims-style dispensing data. Required inputs (cleaned, de-duplicated):\n  rx     : pharmacy dispensings -> person_id, fill_date (datetime), atc (str), ndc, days_supply (int), ddd_dispensed (float)\n           # ddd_dispensed = quantity * strength / WHO_DDD for that ATC, precomputed during data management\n  enroll : continuous enrollment -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only spans lack FFS claims\nComputes period prevalence of use, incidence of new use (365d washout), DDD/1000 enrollees/day, and 1-year PDC.\nAll measures are DESCRIPTIVE; no causal contrast is implied.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "rasmussen-2022",
          "hallas-2005"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\nTARGET_ATC <- \"A10BA02\"\n\nenrolled_days <- function(enroll, start, end) {\n  e <- enroll[ma_only == FALSE]\n  e[, days := pmax(0L, as.integer(pmin(enroll_end, end) - pmax(enroll_start, start)))]\n  e[, .(days = sum(days)), by = person_id]\n}\n\nperiod_prevalence <- function(rx, enroll, atc, start, end) {\n  used  <- uniqueN(rx[atc == ..atc & fill_date %between% c(start, end), person_id])\n  denom <- nrow(enrolled_days(enroll, start, end)[days > 0])\n  used / denom\n}\n\nincidence_of_use <- function(rx, enroll, atc, start, end) {\n  drug <- rx[atc == ..atc][order(person_id, fill_date)]\n  first_in <- drug[fill_date %between% c(start, end),\n                   .(idx = min(fill_date)), by = person_id]\n  prior <- merge(drug, first_in, by = \"person_id\")\n  had_prior <- unique(prior[fill_date < idx &\n                            fill_date >= idx - WASHOUT_DAYS, person_id])\n  cand <- first_in[!person_id %chin% had_prior]\n  wash <- enrolled_days(enroll, start - WASHOUT_DAYS, end)[days >= WASHOUT_DAYS, person_id]\n  incident <- cand[person_id %chin% wash]\n  pyears <- sum(enrolled_days(enroll, start, end)$days) / 365.25\n  list(rate_per_1000_py = 1000 * nrow(incident) / pyears, initiators = incident)\n}\n\nddd_per_1000_per_day <- function(rx, enroll, atc, start, end) {\n  ddds <- rx[atc == ..atc & fill_date %between% c(start, end), sum(ddd_dispensed)]\n  pdays <- sum(enrolled_days(enroll, start, end)$days)\n  1000 * ddds / pdays\n}",
        "description": "Core drug-utilization measures with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), atc, ndc, days_supply (int), ddd_dispensed (double)\n  enroll : person_id, enroll_start (Date), enroll_end (Date), ma_only (logical)  # ma_only spans lack FFS claims\nReturns period prevalence, incidence of new use (365d washout, per 1000 PY), and DDD/1000 enrollees/day.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "rasmussen-2022",
          "hallas-2005"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n%let atc = A10BA02;            /* metformin */\n%let win_start = '01JAN2023'd;\n%let win_end   = '31DEC2023'd;\n\n/* FFS-observable enrolled person-days in the window (exclude MA-only spans). */\nproc sql;\n  create table pdays as\n  select person_id,\n         sum( max(0, (min(enroll_end, &win_end) - max(enroll_start, &win_start)) ) ) as days\n  from work.enroll\n  where ma_only = 0\n  group by person_id;\nquit;\n\n/* Period prevalence of use = users in window / persons with any enrolled time. */\nproc sql;\n  select (select count(distinct person_id) from work.rx\n            where atc = \"&atc\" and &win_start <= fill_date <= &win_end)\n         / (select count(*) from pdays where days > 0) as period_prevalence;\nquit;\n\n/* Incident users: first 2023 fill with no same-ATC fill in the prior washout, washout fully enrolled. */\nproc sql;\n  create table first_in as\n  select person_id, min(fill_date) as idx format=date9.\n  from work.rx where atc = \"&atc\" and &win_start <= fill_date <= &win_end\n  group by person_id;\n\n  create table incident as\n  select f.person_id, f.idx\n  from first_in f\n  where not exists (select 1 from work.rx p\n                    where p.person_id = f.person_id and p.atc = \"&atc\"\n                      and p.fill_date <  f.idx\n                      and p.fill_date >= f.idx - &washout)\n    and exists      (select 1 from work.enroll e\n                    where e.person_id = f.person_id and e.ma_only = 0\n                      and e.enroll_start <= f.idx - &washout\n                      and e.enroll_end   >= f.idx);\nquit;\n\n/* Incidence per 1000 person-years and aggregate DDD/1000 enrollees/day. */\nproc sql;\n  select 1000 * (select count(*) from incident)\n              / ( (select sum(days) from pdays) / 365.25 )            as incidence_per_1000_py,\n         1000 * (select sum(ddd_dispensed) from work.rx\n                   where atc = \"&atc\" and &win_start <= fill_date <= &win_end)\n              / (select sum(days) from pdays)                          as ddd_per_1000_per_day;\nquit;\n\n/* Optional: age-standardize utilization rates to a reference population. */\nproc stdrate data=work.rx_by_age refdata=work.std_pop\n             method=direct stat=rate(mult=1000);\n  population event=users pyears=person_years;\n  strata age_group;\nrun;",
        "description": "Core drug-utilization measures in SAS from claims-style dispensing data. Required inputs (post data-management):\n  work.rx     : person_id, fill_date, atc, ndc, days_supply, ddd_dispensed (qty*strength/WHO_DDD)\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)  # ma_only spans lack FFS claims\nProduces period prevalence, incidence of new use (365d washout, per 1000 PY) via PROC SQL/data step, and\nDDD/1000 enrollees/day. PROC STDRATE can age-standardize utilization rates when a standard population is supplied.",
        "dependencies": [],
        "source_citations": [
          "rasmussen-2022",
          "hallas-2005"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "drug-utilization-timeline.svg",
        "mermaid": null,
        "caption": "Data-grounded worked-example timeline (beginner layer), drawn to scale from worked_example.timeline_spec so the picture matches the numbers.",
        "alt_text": "Timeline for the worked example of drug-utilization.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Question type?] -->|How much / by whom / how long / how appropriately| DUS[Drug utilization study DESCRIPTIVE]\n  Q -->|Does the drug change the outcome?| CAUS[Analytic / causal design - NOT a DUS]\n  DUS --> AGG{Aggregate or per-patient?}\n  AGG -->|System volume, cross-country trend| DDD[ATC/DDD per 1000 per day]\n  AGG -->|Duration, adherence, polypharmacy| IND[Individual-level dispensing measures]\n  IND --> PI{Prevalent or incident?}\n  PI -->|Total treated burden| PREV[Period prevalence of use]\n  PI -->|Initiation / diffusion / persistence| INC[Incidence of use: washout + continuous enrollment]\n  INC --> EPI[Build exposure episodes -> persistence / PDC with grace period + competing risks]",
        "caption": "Decision logic for a drug utilization study - first separate descriptive from causal questions, then choose the aggregate (ATC/DDD) vs individual-level measure and the prevalent vs incident framing.",
        "alt_text": "Flowchart routing a question to a descriptive drug-utilization study versus an analytic design, then to aggregate DDD versus individual-level measures and to prevalence versus incidence of use.",
        "source_type": "illustrative",
        "source_citations": [
          "rasmussen-2022"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  RX[Pharmacy dispensings<br/>NDC to ATC, fill_date, days_supply, ddd_dispensed] --> DM[Data management:<br/>map NDC to ATC, exclude MA-only person-time]\n  ENR[Continuous enrollment spans] --> DM\n  DM --> NUM[Numerator: treated persons / dispensings / DDDs]\n  DM --> DEN[Denominator: FFS-observable enrolled person-time]\n  NUM --> M1[Prevalence of use]\n  NUM --> M2[Incidence of new use<br/>365d washout]\n  NUM --> M3[DDD/1000 enrollees/day]\n  M2 --> M4[Episodes -> persistence / PDC<br/>grace period, death/disenroll as competing risks]\n  DEN --> M1\n  DEN --> M2\n  DEN --> M3",
        "caption": "Data flow from raw dispensings and enrollment to the four core utilization measures, highlighting the numerator/denominator construction and the MA-only exclusion that protects against missingness-driven artifacts.",
        "alt_text": "Data-flow diagram from pharmacy dispensings and enrollment spans through data management into prevalence, incidence, DDD per 1000 per day, and persistence/PDC measures.",
        "source_type": "illustrative",
        "source_citations": [
          "rasmussen-2022"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "A drug utilization study characterizes use descriptively and often profiles the treated population before an active-comparator new-user cohort is built to answer the comparative-effect question a DUS cannot."
      },
      {
        "relation_type": "requires",
        "target_slug": "new-user-design",
        "notes": "Incidence-of-use and persistence measures depend on the new-user (incident-user) restriction - a drug-free washout over continuously observed enrollment - to define initiation validly."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Even descriptive procedure/initiation studies can embed immortal time if person-time before the first qualifying fill is misattributed; align time at the index dispensing."
      }
    ],
    "aliases": [
      "drug utilization study",
      "drug utilisation study",
      "DUS",
      "drug utilization research",
      "medicine utilization study",
      "prescribing pattern study"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "e-value-sensitivity-analysis",
    "name": "E-value Sensitivity Analysis",
    "short_definition": "A quantitative-bias-analysis metric giving the minimum strength of association (on the risk-ratio scale) that an unmeasured confounder would need with both the exposure and the outcome, conditional on the measured covariates, to fully explain away an observed effect estimate or to move its confidence interval to the null.",
    "long_description": "The **E-value** (VanderWeele & Ding, 2017) is a one-number summary of how robust an observational association is to\nunmeasured confounding. It is the minimum strength of association — expressed as a risk ratio — that a hypothetical\nunmeasured confounder would need to have *both* with the exposure and with the outcome, over and above the measured\ncovariates already adjusted for, to reduce the observed effect to the null (the point-estimate E-value) or to move the\nconfidence-interval limit nearest the null across the null (the CI E-value). It is computed entirely from the reported\neffect estimate and its interval; it requires no new data, no specification of which confounder is missing, and no\ndistributional assumptions about that confounder. It rests on the Ding & VanderWeele (2016) bounding factor, which shows\nthat the maximum factor by which confounding can inflate (or deflate) an observed risk ratio is\nB = (RR_AU * RR_UY) / (RR_AU + RR_UY − 1), and inverts that bound to the symmetric solution\nE = RR_obs + sqrt(RR_obs * (RR_obs − 1)) for RR_obs ≥ 1 (apply to 1/RR_obs for protective effects).\n\n**Core conceptual distinction.** The E-value is a *bias-analysis descriptor*, not an estimator and not a bias correction.\nIt does not change the point estimate, does not identify the missing confounder, and does not assert that such a confounder\nexists. It answers exactly one counterfactual question: \"Given everything I already adjusted for, how strong would the\n*worst-case* remaining confounder have to be to overturn this result?\" Two further distinctions matter. (1) *Point-estimate\nvs CI E-value*: the point E-value bounds the estimate, but the policy-relevant claim (\"the CI no longer excludes the null\")\nuses the CI E-value, which is always smaller and always the more honest robustness number — report both. (2) *Scale of the\ninput*: the bounding factor is defined for the risk ratio. A hazard ratio for a rare outcome (or short follow-up) is used\ndirectly as an approximate RR; an odds ratio from a rare outcome is converted with the approximation RR ≈ sqrt(OR) before\nthe bound is applied; risk differences and continuous outcomes require their own transformations. Feeding a non-rare OR or\nHR in as if it were an RR overstates robustness.\n\n**Pros, cons, and trade-offs.**\n- **vs negative-control outcomes/exposures:** The E-value gives a *quantitative bound* that any reader can compute and\n  interpret without naming a specific confounder, whereas negative controls *detect the presence and direction* of residual\n  bias empirically but require a valid control variable. The E-value's cost is that it is worst-case and uninformative\n  about whether a confounder that strong is *plausible*; it cannot, by itself, tell you the result is biased. **Prefer the\n  E-value** as the routine, always-reportable summary; **prefer negative controls** when you can construct credible ones and\n  want empirical evidence (not a bound) about residual confounding. They are complements, not substitutes.\n- **vs full probabilistic / Monte-Carlo quantitative bias analysis (QBA, e.g. the Lash/Fox/Greenland array and bias-\n  parameter approach):** The E-value needs no bias parameters and produces one transparent number; full QBA produces a\n  *distribution* of bias-adjusted estimates under explicit, defensible assumptions about the prevalence and strength of the\n  confounder. **Prefer the E-value** for communicability and as a screen; **prefer full QBA** when stakeholders need a\n  corrected estimate with uncertainty, or when external information about the likely confounder exists (e.g., a validation\n  substudy or a literature RR for the suspected confounder).\n- **vs empirical calibration:** Empirical calibration uses a large set of negative-control associations to recalibrate\n  p-values and intervals for systematic error; it is data-hungry and study-specific. The E-value is universal and\n  distribution-free but addresses only confounding magnitude, not the full error distribution.\n- **Critical-appraisal caveat (Sjölander, 2022; Ioannidis and others):** the E-value is neither uniformly \"optimistic\" nor\n  \"pessimistic.\" Because it is reported on the RR scale and assumes a single binary confounder at its worst, it can be\n  misread as a high bar when the true number of weak confounders acting jointly is what matters. Do not interpret a large\n  E-value as proof of causation, nor a small one as proof of bias.\n\n**When to use.** Report an E-value alongside *every* primary and key secondary comparative estimate from an observational\nRWE analysis once measured confounding has been controlled (PS matching, IPTW, overlap weighting, high-dimensional PS, or\nmultivariable regression). It is most useful where confounding by indication is the chief residual threat, where a single\nunmeasured factor (frailty, disease severity, socioeconomic status, smoking) is the obvious candidate, and where decision-\nmakers (FDA, EMA, HTA bodies) expect a transparent robustness statement. Anchor the interpretation: compare the E-value to\nthe observed RRs of the *strongest measured* confounders — if the E-value is smaller than associations you already\nadjusted for, the result is fragile.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a substitute for design.** An E-value cannot rescue a study with no active comparator, immortal time, depletion of\n  susceptibles, or a prevalent-user cohort. Reporting a large E-value on a structurally biased estimate launders bad design\n  into false reassurance — the single most dangerous misuse.\n- **When the dominant threat is not confounding.** The bounding factor addresses unmeasured *confounding* only. It says\n  nothing about selection bias, differential outcome misclassification, immortal-time bias, or informative censoring, yet a\n  reported E-value invites readers to believe \"robustness\" has been demonstrated. Never let an E-value stand in for\n  addressing those biases directly.\n- **On the wrong scale.** Applying the RR-scale formula to a non-rare OR or a non-rare HR inflates the E-value and\n  overstates robustness. Convert first.\n- **When the CI already includes the null.** The CI E-value is then 1 by construction — there is nothing to explain away —\n  and reporting it as evidence of anything is meaningless.\n- **As a threshold.** There is no \"E-value > X means causal\" rule. Treating a cutoff as a pass/fail gate misrepresents a\n  continuous worst-case descriptor as a hypothesis test.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** The E-value is computed from the adjusted estimate, so its credibility inherits every limitation\n  of the upstream claims cohort. State explicitly which confounders were measured (comorbidities via ICD, prior utilization,\n  concomitant NDC classes) and which were not (frailty, BMI, smoking, SES, disease severity). A standing failure mode:\n  Medicare Advantage enrollees lack fee-for-service claims, so \"no prior fill\" can be missingness rather than a true\n  washout — if MA-only person-time leaks into the cohort, exposure and confounder ascertainment are differentially\n  incomplete and the *input* RR is already biased, making the E-value a bound on the wrong number. Restrict to\n  fully-observable FFS (Parts A/B/D) person-time before trusting any robustness claim.\n- **Differential competing risks in elderly claims:** In older claims populations, the comparator arms can have different\n  mortality (a competing risk for non-fatal outcomes). If the primary estimate ignores this, the E-value bounds a\n  cause-specific or naive estimate that is itself distorted; compute the E-value on the estimand you actually report\n  (subdistribution vs cause-specific) and say so.\n- **Immortal time in procedure/initiation studies:** When time zero is misaligned (e.g., follow-up begins at diagnosis but\n  exposure is a later procedure), immortal-time bias contaminates the HR. An E-value on that HR is bias analysis on top of\n  bias — fix the time-zero structure first, then bound.\n- **EHR:** Richer measured covariates (labs, vitals, NLP-derived severity) typically reduce residual confounding and lower\n  the E-value the analysis \"needs,\" but visit-driven, informative-presence capture means sicker patients are seen more,\n  creating selection the E-value does not address. Note residual unmeasured factors (unrecorded lifestyle, out-of-system\n  care).\n- **Registry / linked:** Registries strengthen severity and adjudicated outcomes (smaller plausible residual confounding);\n  linked claims–EHR–vital-records is the strongest substrate but introduces linkage selection. The E-value is reported the\n  same way; what changes is how plausibly large a residual confounder could be, which is the interpretation, not the\n  arithmetic.\n\n**Worked claims example.** Question: incident heart failure with antihypertensive class A vs class B among adults with\nhypertension in a commercial + Medicare FFS database, using an active-comparator new-user cohort. Eligibility requires\n365 days of continuous medical + pharmacy enrollment before the first qualifying fill (FFS-observable only — MA-only\nperson-time excluded so \"no prior fill\" is a real washout), with the arm assigned from the NDC dispensed at time zero and\n`days_supply` used to build on-treatment follow-up. After 1:1 propensity-score matching on baseline covariates measured in\nthe [index_date − 365, index_date] window (standardized differences < 0.1), the Cox model gives an adjusted\n**HR = 0.78 (95% CI 0.65–0.93)** for class A vs B. Because incident HF over the follow-up is reasonably rare, treat the HR\nas an approximate RR. The point-estimate **E-value = 1.88** (= 1/0.78 + sqrt((1/0.78)·(1/0.78 − 1))): an unmeasured\nconfounder would have to be associated with both treatment choice and HF by a risk ratio of at least 1.88 *each*, beyond the\nmatched covariates, to fully explain the apparent benefit. The **CI E-value = 1.36** for the upper limit 0.93: a confounder\nassociated by 1.36 with both would suffice to move the interval to include the null. Now anchor it: the strongest measured\nconfounder in the matched cohort, baseline chronic kidney disease, is associated with HF by roughly RR 1.5 — *larger* than\nthe 1.36 CI E-value. The honest reading is that a single residual confounder no stronger than CKD (e.g., frailty or\nsocioeconomic disadvantage that channels prescribing) could erase statistical significance, so the finding is only modestly\nrobust and the manuscript should foreground negative-control outcomes and, ideally, a probabilistic bias analysis rather\nthan rest on the E-value alone. Report both numbers (point 1.88, CI 1.36) with the list of covariates already controlled.\n\n**Interpreting the output**\n\nA study reports RR = 1.8 (95% CI 1.3–2.5). The E-value analysis yields: point-estimate\nE-value = 3.0; confidence-interval E-value (for the lower limit 1.3) ≈ 1.92.\n\n*(1) Formal interpretation.* The point E-value of 3.0 means that an unmeasured confounder\nwould need to be associated with both the exposure and the outcome by a risk ratio of at\nleast 3.0 — jointly — to fully explain away the observed RR of 1.8. Any confounder whose\nassociation with either the exposure or the outcome falls below 3.0 cannot account for the\nentire association on its own. The CI E-value of 1.92 is the minimum joint association\nstrength needed to shift the lower confidence limit to 1.0, rendering the result no longer\ndistinguishable from the null at the conventional threshold. The E-value does not state the\nprobability that confounding explains the result; it is a bounding argument about the\nminimum strength required for a single unmeasured binary confounder.\n\n*(2) Practical interpretation.* An E-value is judged against the known confounders in the\ndata. In the antihypertensive example (HR = 0.78, point E-value ≈ 1.88, CI E-value ≈ 1.36),\nthe strongest measured confounder — CKD — carries an adjusted RR of approximately 1.5 with\nboth exposure and outcome, which exceeds the CI E-value of 1.36. This means a single\nunmeasured confounder no stronger than CKD could erase statistical significance; the finding\nis only modestly robust. An E-value is one input to the overall bias assessment, not a\nsubstitute for a negative-control outcome or a full probabilistic bias analysis.",
    "primary_category": "Bias_Control",
    "tags": [
      "sensitivity_analysis",
      "unmeasured_confounding",
      "quantitative_bias_analysis",
      "e_value",
      "robustness",
      "bounding_factor"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.7326/M16-2607",
        "url": "https://doi.org/10.7326/M16-2607",
        "citation_text": "VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine. 2017;167(4):268-274.",
        "year": 2017,
        "authors_short": "VanderWeele & Ding",
        "notes": "Defines the E-value, gives the point-estimate and confidence-interval versions, and the OR/HR-to-RR conversions; the canonical reference, now cited many thousands of times across epidemiology and pharmacoepidemiology."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0000000000000457",
        "url": "https://doi.org/10.1097/EDE.0000000000000457",
        "citation_text": "Ding P, VanderWeele TJ. Sensitivity analysis without assumptions. Epidemiology. 2016;27(3):368-377.",
        "year": 2016,
        "authors_short": "Ding & VanderWeele",
        "notes": "Derives the assumption-free bounding factor B = (RR_AU·RR_UY)/(RR_AU+RR_UY−1) that the E-value inverts; the theoretical foundation establishing why the bound requires no distributional assumptions about the confounder."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2018.21554",
        "url": "https://doi.org/10.1001/jama.2018.21554",
        "citation_text": "Haneuse S, VanderWeele TJ, Arterburn D. Using the E-value to assess the potential effect of unmeasured confounding in observational studies. JAMA. 2019;321(6):602-603.",
        "year": 2019,
        "authors_short": "Haneuse et al.",
        "notes": "JAMA Guide to Statistics and Methods on interpreting E-values; stresses reporting alongside the estimate and against measured-confounder strength, and warns against using it as a substitute for design-based confounding control."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwz063",
        "url": "https://doi.org/10.1093/aje/kwz063",
        "citation_text": "Trinquart L, Erlinger AL, Petersen JM, Fox M, Galea S. Applying the E value to assess the robustness of epidemiologic fields of inquiry to unmeasured confounding. American Journal of Epidemiology. 2019;188(6):1174-1180.",
        "year": 2019,
        "authors_short": "Trinquart et al.",
        "notes": "Applies E-values systematically across many published associations, illustrating both routine reporting and the pitfall of treating the metric as proof that residual confounding is absent."
      },
      {
        "role": "use",
        "doi": "10.1093/ije/dyac018",
        "url": "https://doi.org/10.1093/ije/dyac018",
        "citation_text": "Sjölander A. Are E-values too optimistic or too pessimistic? Both and neither! International Journal of Epidemiology. 2022;51(2):355-363.",
        "year": 2022,
        "authors_short": "Sjölander",
        "notes": "Critical appraisal clarifying what the worst-case single-confounder framing does and does not deliver; essential for avoiding over- or under-interpretation when reporting E-values in regulatory or HTA submissions."
      }
    ],
    "plain_language_summary": "The E-value asks one question about an observational result: how strong would a confounder you forgot to measure have to be to wipe out the association you found? It turns your reported risk ratio into a single number on the same risk-ratio scale, so you can say 'a hidden confounder would need to be linked to both the treatment and the outcome by at least this much to explain my result away.' A bigger E-value means the finding is harder to overturn. It is just a what-if benchmark, not proof that such a confounder exists, and it only speaks to confounding, not to other problems like selection or measurement error.",
    "key_terms": [
      {
        "term": "risk ratio (RR)",
        "definition": "How many times more likely the outcome is in the exposed group than the unexposed group; an RR of 2.0 means twice the risk."
      },
      {
        "term": "unmeasured confounder",
        "definition": "A factor that pushes both who gets the treatment and who gets the outcome, but that you never recorded, so you could not adjust for it."
      },
      {
        "term": "the null",
        "definition": "The no-effect value, which on the risk-ratio scale is exactly 1.0 (the treatment makes no difference)."
      },
      {
        "term": "confidence interval (CI)",
        "definition": "The plausible range around your estimate; if it does not cross 1.0, the result is 'statistically significant.'"
      },
      {
        "term": "point-estimate E-value",
        "definition": "The confounder strength needed to drag your main estimate all the way back to 1.0 (the null)."
      },
      {
        "term": "CI E-value",
        "definition": "The smaller confounder strength needed only to stretch your confidence interval until it touches 1.0, erasing significance."
      }
    ],
    "worked_example": {
      "scenario": "An analyst ran an observational cohort study and, after adjusting for every confounder they could measure, found that a drug exposure was associated with a higher risk of a rare adverse event: adjusted risk ratio RR = 1.8 with a 95% confidence interval of 1.3 to 2.5. A reviewer asks the obvious skeptic's question: 'Could a single confounder you didn't measure (say, disease severity) be faking this whole association?' The E-value answers it as one number. We compute two: one for the point estimate (1.8) and one for the confidence-interval limit nearest the null (1.3).",
      "dataset": {
        "caption": "The analyst does not need patient-level rows here. The E-value is computed entirely from the already-reported summary statistic and its interval, so the 'data' is one line of model output.",
        "columns": [
          "quantity",
          "value",
          "scale",
          "direction"
        ],
        "rows": [
          [
            "adjusted point estimate",
            1.8,
            "risk ratio",
            "harmful (RR > 1)"
          ],
          [
            "95% CI lower limit",
            1.3,
            "risk ratio",
            "limit nearest the null"
          ],
          [
            "95% CI upper limit",
            2.5,
            "risk ratio",
            "limit farther from the null"
          ],
          [
            "the null",
            1.0,
            "risk ratio",
            "no effect"
          ]
        ]
      },
      "steps": [
        "The estimate is harmful (RR = 1.8, which is above 1.0), so we use the formula straight away: E = RR + sqrt(RR * (RR - 1)).",
        "Point-estimate E-value: plug in RR = 1.8. First the product inside the root: 1.8 * (1.8 - 1) = 1.8 * 0.8 = 1.44. The square root of 1.44 is exactly 1.2. So E = 1.8 + 1.2 = 3.0.",
        "Read that out loud: an unmeasured confounder would need to be associated with BOTH the exposure AND the outcome by a risk ratio of at least 3.0 each, on top of everything already adjusted for, to pull the estimate all the way back to 1.0.",
        "CI E-value: the significance claim depends on the interval limit closest to 1.0, which is the lower limit 1.3 (since the estimate is harmful). Apply the same formula to 1.3: 1.3 * (1.3 - 1) = 1.3 * 0.3 = 0.39, and sqrt(0.39) = 0.6245. So E = 1.3 + 0.6245 = 1.92.",
        "Interpret the pair: it takes a confounder of strength 3.0 to erase the estimate, but only 1.92 to erase statistical significance. The CI E-value is always the smaller, more honest number, so report both.",
        "Final gut-check: compare these to the strongest confounder you already measured. If a confounder you adjusted for was itself associated with the outcome by an RR around 2, then a residual confounder of strength 1.92 is entirely plausible, and the result is only modestly robust."
      ],
      "result": "Point-estimate E-value = 1.8 + sqrt(1.8 * 0.8) = 1.8 + sqrt(1.44) = 1.8 + 1.2 = 3.0. CI E-value (using the lower limit 1.3) = 1.3 + sqrt(1.3 * 0.3) = 1.3 + sqrt(0.39) = 1.3 + 0.62 = 1.92. Report both: a confounder must reach RR 3.0 with both exposure and outcome to nullify the estimate, but only RR 1.92 to make the confidence interval include the null."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "E-value for the point estimate",
        "description": "Minimum joint confounder-exposure and confounder-outcome risk ratio needed to reduce the observed effect to the null (RR=1), computed as E = RR + sqrt(RR*(RR-1)) for RR>=1 and applied to 1/RR for protective effects.",
        "edge_cases": [
          "For protective effects (RR<1) the E-value is computed on the reciprocal scale and interpreted symmetrically.",
          "A non-rare odds ratio or hazard ratio must be converted toward the risk-ratio scale (e.g., RR approx sqrt(OR) for rare outcomes) before applying the formula; using the raw OR/HR overstates the E-value."
        ],
        "data_source_notes": "Computed from the adjusted point estimate alone. The R EValue package and the VanderWeele online calculator implement the conversions; verify the input scale matches the formula.",
        "citations": [
          "vanderweele-2017"
        ]
      },
      {
        "name": "E-value for the confidence-interval limit",
        "description": "The strength of confounding needed to move the confidence-interval limit nearest the null across the null; it is always smaller than the point-estimate E-value and is the appropriate robustness number for a significance claim.",
        "edge_cases": [
          "When the CI already includes the null the CI E-value equals 1 and conveys no information.",
          "Use the upper limit for a protective estimate and the lower limit for a harmful estimate (the limit closest to the null)."
        ],
        "data_source_notes": "Preferred for robustness statements in regulatory and HTA contexts; report it together with the point E-value and the list of covariates already adjusted for.",
        "citations": [
          "vanderweele-2017"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "negative-control-outcomes-rwe",
        "pros_of_this": "Gives a quantitative worst-case bound computable from the reported estimate alone, with no need to identify a specific confounder or construct a valid control variable; trivially communicable to non-epidemiologists.",
        "cons_of_this": "Worst-case and silent on whether a confounder that strong is plausible; cannot detect that a result is actually biased, nor identify or correct the offending confounder.",
        "when_to_prefer": "As the routine, always-reportable robustness summary after measured confounding control; pair with negative controls (empirical detection) rather than choosing between them."
      },
      {
        "compared_to": "quantitative-bias-analysis-toolkit-rwe",
        "pros_of_this": "Requires no bias parameters and yields a single transparent number, making it an immediate screen for fragility.",
        "cons_of_this": "Produces only a bound, not a bias-corrected estimate with uncertainty; ignores selection bias and misclassification entirely.",
        "when_to_prefer": "When communicability and a quick robustness screen are the goal; escalate to full probabilistic QBA when a corrected estimate or external bias-parameter information is available."
      },
      {
        "compared_to": "empirical-calibration-negative-controls-rwe",
        "pros_of_this": "Universal and distribution-free; needs no large bank of negative-control associations.",
        "cons_of_this": "Addresses only confounding magnitude, not the full distribution of systematic error that empirical calibration recalibrates.",
        "when_to_prefer": "When a calibration set of negative controls is unavailable or the question is specifically \"how strong a single confounder is needed,\" rather than recalibrating the whole error distribution."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Compute the E-value on the PS-adjusted (matched/weighted) or regression-adjusted estimate; state the measured covariates and the unmeasured residual candidates (frailty, SES, severity, smoking). Restrict to FFS-observable person- time and exclude Medicare Advantage-only spans so the input estimate is not itself biased by incomplete claims; on the risk-ratio scale convert non-rare ORs/HRs before applying the formula.",
      "ehr": "Richer measured covariates (labs, vitals, NLP severity) usually lower the plausible residual confounding and the E-value needed, but informative-presence/visit-driven capture creates selection the E-value does not address; report residual unmeasured factors and treat informative loss to follow-up separately.",
      "registry": "Strong indication/severity and adjudicated outcomes shrink the plausible residual confounder; the arithmetic is unchanged but the interpretation (how large a confounder could plausibly be) tightens.",
      "linked": "Linked claims-EHR-vital-records is the strongest substrate for the input estimate; linkage selection is a separate bias the E-value does not bound, so report it alongside."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\n\ndef _ev(rr: float) -> float:\n    # Bounding-factor E-value for a single risk ratio; symmetric for protective effects.\n    if rr < 1.0:\n        rr = 1.0 / rr\n    return rr + math.sqrt(rr * (rr - 1.0))\n\ndef _to_rr(est: float, scale: str, rare_outcome: bool = True) -> float:\n    # Convert the reported estimate toward the risk-ratio scale the bound is defined on.\n    if scale == \"RR\" or scale == \"HR\" and rare_outcome:\n        return est                       # HR ~ RR when the outcome is rare / follow-up short\n    if scale == \"OR\" and rare_outcome:\n        return math.sqrt(est)            # RR ~ sqrt(OR) for a rare outcome\n    raise ValueError(\"Provide an RR, or a rare-outcome HR/OR; otherwise convert first.\")\n\ndef e_value(estimate: float, ci_low: float, ci_high: float,\n            scale: str = \"HR\", rare_outcome: bool = True) -> dict:\n    \"\"\"E-values for the point estimate and the CI limit nearest the null.\n\n    estimate/ci_low/ci_high: the adjusted effect and its 95% CI on `scale`.\n    scale: 'RR', 'HR', or 'OR' (HR/OR require rare_outcome=True for the approximation).\n    \"\"\"\n    rr  = _to_rr(estimate, scale, rare_outcome)\n    lo  = _to_rr(ci_low,  scale, rare_outcome)\n    hi  = _to_rr(ci_high, scale, rare_outcome)\n    ev_point = _ev(rr)\n    if lo <= 1.0 <= hi:                  # CI already crosses the null on the RR scale\n        ev_ci = 1.0\n    else:                                # use the limit closest to the null\n        limit = hi if rr < 1.0 else lo\n        ev_ci = _ev(limit)\n    return {\"rr_scale_estimate\": rr, \"e_value_point\": ev_point, \"e_value_ci\": ev_ci}\n\n# Worked claims example: PS-matched antihypertensive A vs B on incident HF (rare outcome).\n# Adjusted Cox HR = 0.78 (95% CI 0.65-0.93)  ->  point E-value 1.88, CI E-value 1.36.\nres = e_value(0.78, 0.65, 0.93, scale=\"HR\", rare_outcome=True)\nprint(f\"E-value (point) = {res['e_value_point']:.2f}\")\nprint(f\"E-value (CI)    = {res['e_value_ci']:.2f}\")",
        "description": "Compute point-estimate and confidence-interval E-values from a fitted relative-effect estimate.\nRequired input: a single comparative estimate with its 95% CI from an already-adjusted RWE model\n(e.g., a Cox HR from a PS-matched active-comparator new-user cohort), plus the scale of that\nestimate. For 'OR' or 'HR' the rare-outcome approximation toward the risk-ratio scale is applied\nbefore the bounding factor (VanderWeele & Ding 2017). Returns both E-values; report both.",
        "dependencies": [],
        "source_citations": [
          "vanderweele-2017",
          "ding-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(EValue)\n\n# Worked claims example: PS-matched antihypertensive A vs B on incident HF.\n# Adjusted Cox HR = 0.78 (95% CI 0.65-0.93); incident HF is rare over follow-up.\nhr_est <- 0.78; hr_lo <- 0.65; hr_hi <- 0.93\n\n# Returns RR-scale estimate, then point-estimate and CI-limit E-values (lower/upper).\nev_hr <- evalues.HR(est = hr_est, lo = hr_lo, hi = hr_hi, rare = TRUE)\nprint(ev_hr)\n#            point     lower    upper\n# RR          0.78     0.65     0.93\n# E-values    1.88     1.36       NA     <- point E-value 1.88, CI E-value 1.36\n\n# Logistic example (rare outcome): adjusted OR and its CI from PROC/glm output.\n# ev_or <- evalues.OR(est = 1.45, lo = 1.10, hi = 1.92, rare = TRUE)\n\n# Presentation contour: combinations of RR_AU and RR_UY that would explain the estimate.\nbias_plot(rr = hr_est, xmax = 4)",
        "description": "Compute E-values with the canonical EValue package (Mathur, Smith, Ding, VanderWeele). Use the\nscale-specific helper that matches the model: evalues.HR() for a Cox hazard ratio (declare whether\nthe outcome is rare), evalues.OR() for logistic odds ratios, evalues.RR() for risk ratios. Input is\nthe adjusted estimate and its 95% CI from a PS-matched/weighted RWE model. bias_plot() draws the\njoint confounder-exposure / confounder-outcome contour for presentation.",
        "dependencies": [
          "EValue"
        ],
        "source_citations": [
          "vanderweele-2017",
          "ding-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* One row carrying the adjusted estimate from the fitted RWE model.\n   Worked claims example: PS-matched Cox HR = 0.78 (95% CI 0.65-0.93), rare outcome. */\ndata est_in;\n  length scale $3;\n  estimate = 0.78; ci_low = 0.65; ci_high = 0.93; scale = 'HR'; rare = 1;\nrun;\n\ndata evalues;\n  set est_in;\n\n  /* Convert to the risk-ratio scale the bounding factor is defined on. */\n  if scale = 'RR' or (scale = 'HR' and rare = 1) then do;\n    rr = estimate; rr_lo = ci_low; rr_hi = ci_high;\n  end;\n  else if scale = 'OR' and rare = 1 then do;            /* RR ~ sqrt(OR) for a rare outcome */\n    rr = sqrt(estimate); rr_lo = sqrt(ci_low); rr_hi = sqrt(ci_high);\n  end;\n  else do;\n    put 'ERROR: non-rare OR/HR must be converted to the RR scale first.'; stop;\n  end;\n\n  /* Symmetric bounding-factor E-value for a single risk ratio. */\n  _r = rr; if _r < 1 then _r = 1/_r;\n  e_value_point = _r + sqrt(_r*(_r - 1));\n\n  /* CI E-value uses the interval limit nearest the null; 1 if the CI already crosses it. */\n  if rr_lo <= 1 <= rr_hi then e_value_ci = 1;\n  else do;\n    if rr < 1 then _lim = rr_hi; else _lim = rr_lo;\n    if _lim < 1 then _lim = 1/_lim;\n    e_value_ci = _lim + sqrt(_lim*(_lim - 1));\n  end;\n\n  keep rr e_value_point e_value_ci;     /* expect point 1.88, CI 1.36 */\nrun;\n\nproc print data=evalues noobs; run;",
        "description": "Compute the point-estimate and CI-limit E-values from an adjusted relative effect in a DATA step\n(no specialized PROC computes E-values; PROC PHREG/LOGISTIC produce the input estimate and CI).\nRequired input: a one-row dataset `est_in` with the adjusted estimate, its 95% CI, and the scale\n('RR','HR','OR'); HR/OR are converted toward the risk-ratio scale for a rare outcome before the\nbounding factor is applied (VanderWeele & Ding 2017).",
        "dependencies": [],
        "source_citations": [
          "vanderweele-2017",
          "ding-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U((Unmeasured<br/>confounder U)) -- RR_AU --> A[Exposure A]\n  U -- RR_UY --> Y[Outcome Y]\n  A -. observed adjusted effect .-> Y\n  subgraph Bound[\"E-value bound (VanderWeele & Ding)\"]\n    N[\"Both arrows must each reach<br/>RR >= E-value to nullify the result\"]\n  end\n  U --- N",
        "caption": "The E-value bounds the joint strength of the two confounding arrows. An unmeasured confounder U must be associated with exposure A (RR_AU) and outcome Y (RR_UY) each by at least the E-value, beyond the measured covariates, to explain away the observed adjusted A-Y association.",
        "alt_text": "DAG with an unmeasured confounder U pointing to exposure A and outcome Y, annotated that both arrow strengths must each reach the E-value on the risk-ratio scale to nullify the observed effect.",
        "source_type": "illustrative",
        "source_citations": [
          "vanderweele-2017"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Est[Adjusted estimate + 95% CI<br/>from PS-matched / weighted RWE model] --> Scale{Scale of estimate?}\n  Scale -->|RR| RR[Use directly]\n  Scale -->|HR, rare outcome| RR\n  Scale -->|OR, rare outcome| Conv[RR approx sqrt OR]\n  Scale -->|non-rare OR/HR| Stop[Convert first - do not use raw]\n  RR --> Calc[E = RR + sqrt RR RR-1]\n  Conv --> Calc\n  Calc --> Two[Report point E-value AND CI E-value]\n  Two --> Anchor[Compare to RR of strongest MEASURED confounder]\n  Anchor -->|E-value > measured RRs| Robust[Result relatively robust]\n  Anchor -->|E-value <= measured RRs| Fragile[Result fragile: foreground negative controls + probabilistic QBA]",
        "caption": "Decision logic for computing and interpreting an E-value. The interpretation step - anchoring against the strength of confounders already adjusted for - is what turns the worst-case bound into an actionable robustness statement.",
        "alt_text": "Flowchart from an adjusted estimate through scale selection and conversion to the bounding-factor formula, then to reporting both E-values and anchoring them against measured confounder strengths to judge robustness or fragility.",
        "source_type": "illustrative",
        "source_citations": [
          "vanderweele-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Complementary sensitivity tool. Negative controls empirically detect the presence and direction of residual bias; the E-value quantifies how strong unmeasured confounding would have to be to explain the result."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "The E-value is a parameter-free, single-number screen; full probabilistic bias analysis returns a bias-corrected estimate with uncertainty under explicit bias parameters. Escalate to QBA when a corrected estimate is needed."
      },
      {
        "relation_type": "see_also",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Empirical calibration recalibrates the full distribution of systematic error from many negative controls; the E-value bounds only confounding magnitude and needs no calibration set."
      },
      {
        "relation_type": "see_also",
        "target_slug": "dags-backdoor-criterion-drug-studies",
        "notes": "DAGs identify plausible unmeasured confounders and justify the measured adjustment set; the E-value then bounds how strong any remaining backdoor path would have to be."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "The E-value is computed on the PS-matched or weighted effect estimate; it summarizes robustness of the result that remains after measured confounding is balanced."
      }
    ],
    "aliases": [
      "E-value",
      "VanderWeele E-value",
      "unmeasured confounding sensitivity bound",
      "E-value bounding factor"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ecog-performance-status-score-rwe",
    "name": "ECOG Performance Status Score",
    "short_definition": "A 0-5 clinician-rated oncology functional-status scale, from fully active to dead, used for trial eligibility, prognosis, treatment selection, confounding control, and external-control comparability.",
    "long_description": "ECOG Performance Status is a concise oncology functional-status score. It describes a patient's ability to work, ambulate, perform self-care, and remain out of bed. ECOG 0 means fully active; ECOG 5 means dead. In oncology RWE, ECOG is often a decisive prognostic variable and eligibility criterion, especially for external controls, treatment sequencing studies, and comparative effectiveness analyses.\n\nThe measurement problem is that ECOG is unevenly captured in routine care. It may appear as a structured field, a trial/registry variable, a clinician note phrase, or not at all. Claims data generally do not contain ECOG. EHR extraction can use structured oncology flowsheets or NLP, but missingness is often informative because sicker patients and oncology centers may document performance status differently.\n\nTreat ECOG as a score with provenance. Store the numeric value, source field or note text, assessment date, assessor/source, and relationship to index date. For baseline confounding control, define an allowed window before or near treatment start and report missingness by study arm. Do not impute ECOG into claims-only data as if it were observed.\n\n**Pros, cons, and trade-offs.** ECOG is one of the most clinically meaningful oncology covariates because it summarizes functional status, prognosis, treatment tolerance, and trial eligibility in one familiar score. The trade-off is measurement quality. In routine EHR data it can be missing, stale, copied forward, buried in notes, or recorded after treatment decisions. NLP can recover values, but it creates validation and provenance requirements. Collapsing ECOG into 0-1 versus 2+ improves stability but loses clinically important gradients.\n\n**When to use.** Use ECOG when oncology treatment selection, external-control comparability, eligibility transportability, baseline prognosis, or utility mapping depends on functional status and the score is observed or validly abstracted. Use a pre-specified baseline window and keep post-index ECOG separate when it may be a mediator or outcome.\n\n**When NOT to use - and when it is actively misleading.** Do not treat ECOG as observable in claims-only data. Do not fill missing ECOG with a favorable value or infer it from treatment receipt without labeling the result as a proxy. It is actively misleading to use post-progression ECOG for baseline confounding control or to compare EHR sites without reporting documentation missingness.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "ecog",
      "performance-status",
      "oncology",
      "functional-status",
      "external-controls",
      "prognostic-factor",
      "ehr",
      "registry"
    ],
    "applies_to_study_types": [
      "oncology_rwe",
      "single_arm_external_control",
      "comparative_effectiveness",
      "registry_study",
      "ehr_study"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": null,
        "url": "https://ecog-acrin.org/resources/ecog-performance-status/",
        "citation_text": "ECOG-ACRIN Cancer Research Group. ECOG Performance Status Scale.",
        "year": 2026,
        "authors_short": "ECOG-ACRIN",
        "notes": "Official ECOG-ACRIN scale page and attribution instructions."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://training.seer.cancer.gov/followup/dataset/ecog-performance-status.html",
        "citation_text": "National Cancer Institute SEER Training Modules. ECOG Performance Status.",
        "year": 2023,
        "authors_short": "NCI SEER",
        "notes": "NCI training reference for ECOG performance status categories."
      },
      {
        "role": "introduce",
        "doi": "10.1097/00000421-198212000-00014",
        "url": "https://pubmed.ncbi.nlm.nih.gov/7165009/",
        "citation_text": "Oken MM, Creech RH, Tormey DC, Horton J, Davis TE, McFadden ET, Carbone PP. Toxicity and response criteria of the Eastern Cooperative Oncology Group. American Journal of Clinical Oncology. 1982;5(6):649-655.",
        "year": 1982,
        "authors_short": "Oken et al.",
        "notes": "Original ECOG toxicity and response criteria publication cited by ECOG-ACRIN."
      }
    ],
    "plain_language_summary": "ECOG is a 0 to 5 oncology score for how well a patient is functioning. Lower is better. It is crucial in cancer studies because two patients with the same diagnosis can have very different prognosis if one is ECOG 0 and the other is ECOG 3.",
    "key_terms": [
      {
        "term": "Performance status",
        "definition": "A clinician assessment of functional capacity and daily activity limitations."
      },
      {
        "term": "ECOG 0",
        "definition": "Fully active without restriction."
      },
      {
        "term": "ECOG 2",
        "definition": "Ambulatory and capable of self-care but unable to work; up and about more than 50 percent of waking hours."
      },
      {
        "term": "ECOG missingness",
        "definition": "Absence of a documented ECOG score, often informative because documentation varies by site, severity, and data source."
      }
    ],
    "worked_example": {
      "scenario": "An external-control oncology study needs baseline ECOG at treatment start. A patient has ECOG 0 documented 75 days before index, ECOG 1 on the index date, and ECOG 2 after progression. The protocol uses the nearest value from 30 days before through 7 days after index.",
      "dataset": {
        "caption": "Candidate ECOG values around index.",
        "columns": [
          "date_relative_to_index",
          "ecog_value",
          "source",
          "eligible_for_baseline"
        ],
        "rows": [
          [
            -75,
            0,
            "oncology note",
            "no; outside window"
          ],
          [
            0,
            1,
            "structured oncology field",
            true
          ],
          [
            178,
            2,
            "oncology note",
            "no; post-index deterioration"
          ]
        ]
      },
      "steps": [
        "Define the allowed baseline window before extracting values.",
        "Prefer structured ECOG on or closest before index if multiple values qualify.",
        "Exclude post-progression ECOG from baseline adjustment because it may be a mediator or outcome.",
        "Report missingness and extraction source in Table 1."
      ],
      "result": "Baseline ECOG = 1. The ECOG 2 value is retained as follow-up deterioration evidence, not baseline confounding control."
    },
    "prerequisites": [],
    "index_definitions": [
      {
        "name": "ECOG 0",
        "definition": "Fully active; able to carry on all pre-disease performance without restriction.",
        "source": "ECOG-ACRIN",
        "use": "Trial eligibility, baseline prognosis, and Table 1 covariate.",
        "notes": "Best functional status category."
      },
      {
        "name": "ECOG 1",
        "definition": "Restricted in physically strenuous activity but ambulatory and able to do light or sedentary work.",
        "source": "ECOG-ACRIN",
        "use": "Common eligibility category in oncology trials and external-control matching.",
        "notes": "Often grouped with ECOG 0 as ECOG 0-1."
      },
      {
        "name": "ECOG 2",
        "definition": "Ambulatory and capable of all self-care but unable to work; up and about more than 50 percent of waking hours.",
        "source": "ECOG-ACRIN",
        "use": "Prognostic adjustment and subgrouping.",
        "notes": "Trial eligibility often changes at ECOG 2."
      },
      {
        "name": "ECOG 3",
        "definition": "Capable of limited self-care; confined to bed or chair more than 50 percent of waking hours.",
        "source": "ECOG-ACRIN",
        "use": "Prognostic stratification and clinical-context adjustment.",
        "notes": "Often underrepresented in trials."
      },
      {
        "name": "ECOG 4",
        "definition": "Completely disabled; cannot carry on self-care; totally confined to bed or chair.",
        "source": "ECOG-ACRIN",
        "use": "Severe functional impairment marker.",
        "notes": "Rarely eligible for interventional trials."
      },
      {
        "name": "ECOG 5",
        "definition": "Dead.",
        "source": "ECOG-ACRIN",
        "use": "Terminal category in the original scale.",
        "notes": "In RWE, death is usually modeled as an outcome rather than carried as a baseline ECOG state."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Structured ECOG field",
        "description": "Numeric ECOG captured in registry, oncology flowsheet, or trial/EHR structured field.",
        "edge_cases": [
          "Values can be copied forward.",
          "Date may reflect entry date rather than assessment date."
        ],
        "data_source_notes": "Preferred source when timing is reliable."
      },
      {
        "name": "NLP-derived ECOG",
        "description": "ECOG extracted from clinician notes using rules, NLP, or abstraction.",
        "edge_cases": [
          "Negated or hypothetical ECOG statements.",
          "Ambiguous performance language without numeric score."
        ],
        "data_source_notes": "Requires validation and source-text provenance."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Karnofsky Performance Status",
        "use_ecog_when": "Oncology trials, registries, or clinical notes document ECOG directly.",
        "use_karnofsky_when": "The source uses Karnofsky consistently and ECOG is absent.",
        "notes": "Crosswalks are approximate; avoid pretending converted scores are exact."
      }
    ],
    "implementation_notes_by_data_source": {
      "ehr": "Extract source, date, and method; compare structured and note-derived values when both exist.",
      "registry": "Strong source when collected consistently, but timing relative to treatment start still matters.",
      "claims": "Claims do not contain observed ECOG; only validated proxies should be used and labeled as proxies."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\ndef baseline_ecog(index, ecog, lookback_days=30, lookforward_days=7):\n    e = ecog.copy()\n    e[\"ecog_value\"] = pd.to_numeric(e[\"ecog_value\"], errors=\"coerce\")\n    e = e[e[\"ecog_value\"].between(0, 5, inclusive=\"both\")]\n    e = e.merge(index[[\"person_id\", \"index_date\"]], on=\"person_id\", how=\"inner\")\n    e[\"days_from_index\"] = (e[\"assessment_date\"] - e[\"index_date\"]).dt.days\n    e = e[e[\"days_from_index\"].between(-lookback_days, lookforward_days)]\n    e[\"abs_days\"] = e[\"days_from_index\"].abs()\n    e = e.sort_values([\"person_id\", \"abs_days\", \"days_from_index\"])\n    first = e.groupby(\"person_id\", as_index=False).first()\n    return index.merge(\n        first[[\"person_id\", \"ecog_value\", \"assessment_date\", \"source\", \"days_from_index\"]],\n        on=\"person_id\",\n        how=\"left\",\n    )",
        "description": "Select baseline ECOG from structured or abstracted EHR/registry rows. Inputs:\n  index : person_id, index_date\n  ecog  : person_id, assessment_date, ecog_value, source\nThe example uses the nearest value in a pre-specified baseline window and rejects impossible scores.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "oken-1982",
          "ecog-acrin-scale"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nbaseline_ecog <- function(index, ecog, lookback_days = 30L, lookforward_days = 7L) {\n  setDT(index); setDT(ecog)\n  e <- copy(ecog)\n  e[, ecog_value := suppressWarnings(as.numeric(ecog_value))]\n  e <- e[ecog_value >= 0 & ecog_value <= 5]\n  e <- merge(e, index[, .(person_id, index_date)], by = \"person_id\")\n  e[, days_from_index := as.integer(assessment_date - index_date)]\n  e <- e[days_from_index >= -lookback_days & days_from_index <= lookforward_days]\n  e[, abs_days := abs(days_from_index)]\n  setorder(e, person_id, abs_days, days_from_index)\n  first <- e[, .SD[1], by = person_id]\n  merge(index, first[, .(person_id, ecog_value, assessment_date, source, days_from_index)],\n        by = \"person_id\", all.x = TRUE)\n}",
        "description": "R/data.table version for selecting baseline ECOG in a reproducible window.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "oken-1982",
          "ecog-acrin-scale"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback_days = 30;\n%let lookforward_days = 7;\n\nproc sql;\n  create table ecog_window as\n  select e.person_id, i.index_date, e.assessment_date, e.ecog_value, e.source,\n         e.assessment_date - i.index_date as days_from_index,\n         abs(e.assessment_date - i.index_date) as abs_days\n  from work.ecog e inner join work.index i on e.person_id = i.person_id\n  where 0 <= e.ecog_value <= 5\n    and calculated days_from_index between -&lookback_days and &lookforward_days;\nquit;\n\nproc sort data=ecog_window;\n  by person_id abs_days days_from_index;\nrun;\n\ndata baseline_ecog;\n  set ecog_window;\n  by person_id;\n  if first.person_id then output;\n  keep person_id ecog_value assessment_date source days_from_index;\nrun;\n\nproc sql;\n  create table ecog_analysis as\n  select i.*, b.ecog_value, b.assessment_date as ecog_date, b.source as ecog_source,\n         b.days_from_index as ecog_days_from_index\n  from work.index i left join baseline_ecog b on i.person_id = b.person_id;\nquit;",
        "description": "SAS pattern for selecting baseline ECOG from structured or abstracted rows.\nInputs:\n  work.index person_id index_date\n  work.ecog  person_id assessment_date ecog_value source",
        "dependencies": [],
        "source_citations": [
          "oken-1982",
          "ecog-acrin-scale"
        ],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "nlp-clinical-text-extraction-rwe",
        "notes": "ECOG is commonly extracted from oncology notes when structured fields are missing."
      },
      {
        "relation_type": "used_with",
        "target_slug": "single-arm-external-control",
        "notes": "ECOG imbalance is a major threat to oncology external-control validity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "ECOG categories may be used as clinical-state inputs for utility mapping in oncology models."
      }
    ],
    "aliases": [
      "ECOG",
      "ECOG score",
      "ECOG PS",
      "Eastern Cooperative Oncology Group score",
      "ECOG performance status",
      "ECOG performance score",
      "performance status score"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ecological",
    "name": "Ecological (Aggregate) Study",
    "short_definition": "An observational design in which exposure, outcome, and covariates are measured and analyzed at the level of a group (area, time period, or population) rather than the individual, so the unit of analysis is the aggregate and inference about individual-level effects is vulnerable to ecological (cross-level) bias.",
    "long_description": "An **ecological (aggregate) study** correlates a group-level summary of exposure with a group-level summary of\noutcome — across geographic areas, time periods, institutions, or population strata — without observing the\njoint exposure–outcome distribution within any group. The classic form regresses an area's disease *rate* on the\narea's *prevalence* or *intensity* of exposure (e.g., county opioid dispensing per 1,000 enrollees vs county\noverdose-hospitalization rate). Because no individual is ever linked to both their own exposure and their own\noutcome, the design is cheap, fast, and able to study exposures that barely vary within a population (air\npollution, policy, drug-formulary coverage, taxes) — but the contrast it estimates is between *groups*, and\ntransporting that contrast to *individuals* is the entire methodological problem.\n\n**Core conceptual distinction.** The estimand of an ecological regression is the slope of group-mean outcome on\ngroup-mean exposure; it equals the individual-level causal effect *only* under restrictive conditions that almost\nnever hold in observational data. The **ecological fallacy** (Robinson 1950) is the inference error of treating\nthe group-level slope as if it were the individual-level effect. Greenland & Robins formalized why the two\ndiverge: **(1) cross-level confounding** — area composition (age, race, deprivation, comorbidity mix) is\ncorrelated with both the aggregate exposure and the aggregate outcome and cannot be controlled by adjusting for\ngroup means alone; **(2) effect-measure modification within groups** — if the individual exposure effect varies\nwith a within-area covariate, the group-level slope is a non-causal average that depends on the *within-area\nexposure variance*, which is invisible in aggregate data (this is **specification bias**); and **(3) non-linear\nwithin-group exposure–response** — averaging a curved individual relationship before regressing creates\naggregation bias even with no confounding. The result: the ecological slope can be biased in magnitude, attenuated,\ninflated, or *sign-reversed* relative to the individual effect. This is fundamentally different from a\ncohort/case-control study, where the unit is the person and the joint distribution is observed.\n\n**Pros, cons, and trade-offs.**\n- **vs individual-level cohort (e.g., cohort-retrospective):** Ecological is orders of magnitude cheaper, needs\n  only published/aggregate tabulations, and can estimate effects of exposures with no individual variation (a\n  state policy, a national formulary change). Cost: it cannot in general recover the individual causal effect; it\n  is exposed to cross-level confounding that individual data would let you adjust away. **Prefer ecological** only\n  for genuinely group-level (contextual) exposures or for *hypothesis generation* — never as the primary design\n  when individual-level data are obtainable and the question is about individual risk.\n- **vs cross-sectional (individual-level):** Both are snapshots, but the cross-sectional study measures the\n  person, so it can estimate individual associations subject to prevalence/incidence and reverse-causation caveats;\n  the ecological study trades that for population coverage and exposure contrast. **Prefer cross-sectional** when\n  individual exposure–outcome pairing matters; prefer ecological for ecologic (contextual) effects or area-level\n  surveillance.\n- **vs multilevel / hierarchical models on linked data:** If you can obtain even a *sub-sample* of individual\n  records, a hybrid/semi-ecological design (Wakefield) or a multilevel model that combines aggregate margins with\n  individual data dramatically reduces ecological bias by pinning down within-area variance and effect\n  modification. **Prefer the hybrid** whenever any individual-level data exist; pure ecological is the fallback\n  when none do.\n\n**When to use.** (1) The exposure is intrinsically contextual and the *causal question is about the group-level\neffect* — air quality, alcohol/tobacco taxes, drug-coverage policy, ICU staffing ratios, vaccination coverage\nand herd effects. (2) Rapid hypothesis generation and surveillance from routinely published aggregates (CMS\ncounty tables, CDC WONDER, cancer-registry area rates) before committing to an individual-level study. (3) Studying\nexposures with negligible within-population individual variation, where an individual-level cohort would have no\nexposure contrast at all. (4) Evaluating natural experiments / policy roll-outs at the population level, ideally\nupgraded to difference-in-differences or interrupted time series with multiple periods.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The question is about individual risk and individual data are available.** Reporting an ecological slope as a\n  personal relative risk is the textbook ecological fallacy and can invert the true effect (Robinson's literacy\n  example; the \"ecological correlation\" between area immigrant share and literacy reversed at the individual level).\n- **Strong cross-level confounding by composition.** Area socioeconomic deprivation, age structure, and\n  race/ethnicity drive both aggregate exposure and aggregate outcome; adjusting for area means does not remove\n  confounding that operates within areas. Diagnose by asking whether the confounder varies *within* areas — if so,\n  aggregate adjustment is inadequate.\n- **Within-area exposure variance is large.** The larger the spread of individual exposure inside each group, the\n  worse the specification/aggregation bias; ecological analysis is least biased when groups are internally\n  homogeneous in exposure.\n- **Few groups / small-area instability.** With a handful of areas the regression is underpowered and dominated by\n  leverage points; with tiny denominators rates are unstable and CMS-style small-cell suppression (counts <11)\n  biases the aggregate numerators non-randomly.\n- **Spatial/temporal dependence is ignored.** Neighboring areas and adjacent periods are correlated; naive OLS\n  standard errors are anticonservative. Require cluster-robust or CAR/spatial-error models.\n\n**Data-source operational depth.**\n- **Administrative claims (Medicare/commercial):** Cells are built by aggregating individual claims to\n  area×period units (county-quarter, HSA-year). Numerators (events) and exposure intensity (e.g., sum of\n  `days_supply` per 1,000 enrollees) must share an identical, fully-observed denominator. Failure modes:\n  **MA-vs-FFS denominator drift** — Medicare Advantage encounter capture is incomplete and the MA share varies by\n  county and rises over time, so a county-quarter exposure or event *rate* can move purely because the FFS\n  fraction changed, not because behavior changed; restrict to FFS Parts A/B/D person-time and compute\n  enrollment-weighted denominators. **Differential migration/enrollment churn** redistributes person-time across\n  cells. **Coding-intensity and access differences** across areas masquerade as exposure or outcome differences.\n  **Small-cell suppression** in public CMS aggregates (counts <11 redacted) censors numerators non-randomly,\n  deflating rates in sparse rural counties.\n- **EHR / health-system data:** Aggregation is to facility or catchment area, but the captured population is the\n  *visiting* population, not the resident population, so the denominator is ill-defined; out-of-network care is\n  invisible and differential by area. Use only when a stable catchment denominator (e.g., a closed integrated\n  system) is defensible.\n- **Registries (disease/cancer):** Often the *source* of area-level outcome rates and the strongest substrate\n  (adjudicated outcomes, defined catchment). Weak for exposure; must be paired with an external exposure\n  aggregate, and the two denominators (registry catchment vs exposure-source population) must be reconciled or\n  the rates are non-comparable.\n- **Linked / external aggregates (Census, CDC WONDER, AQS pollution monitors):** Enable contextual covariate\n  adjustment (deprivation index, age structure) but introduce **misaligned geographies and periods** — pollution\n  monitors at point locations vs ZIP outcomes, ACS 5-year estimates vs single-year rates — requiring areal\n  interpolation that adds its own error.\n\n**Worked claims example.** Question: is higher community use of a long-acting opioid associated with the rate of\nopioid-overdose hospitalization? Substrate: 100% Medicare FFS Parts A/B/D, county × calendar-quarter cells.\n(1) Denominator: county-quarter FFS enrollee-quarters of person-time, excluding any MA-only person-time (so the\nrate is not distorted by county-varying MA penetration). (2) Exposure intensity per cell: sum of `days_supply`\nacross all Part D fills with the drug's NDC list, divided by enrollee-years, expressed per 1,000 enrollees.\n(3) Outcome rate per cell: count of inpatient stays (MedPAR) with a principal/secondary overdose `dx` code in the\nquarter, divided by the same person-time, per 100,000. (4) Suppress and flag cells with <11 events or <50\nenrollees (CMS rule) and decide a priori whether to drop or pool them — *do not* let suppression silently zero the\nnumerator. (5) Ecological regression: weighted least squares of the overdose rate on opioid `days_supply`\nintensity, weights = person-time, adjusting for county age structure, ADI/dual-eligible share, and quarter fixed\neffects, with county-clustered (or CAR) standard errors. (6) The fitted slope is a *county-level* contrast: a\npositive slope does **not** establish that the opioid-using individuals are the ones being hospitalized — high-use\ncounties may simply be older/sicker, and within a county the overdoses may occur disproportionately among people\n*without* a fill. State the result as ecological, treat it as hypothesis-generating, and follow with an\nindividual-level new-user cohort or a multilevel model on a linked sub-sample before any causal claim.",
    "primary_category": "Study_Design",
    "tags": [
      "ecological-study",
      "aggregate-data",
      "ecological-fallacy",
      "cross-level-bias",
      "group-level",
      "contextual-effects",
      "descriptive-epidemiology",
      "spatial-epidemiology"
    ],
    "applies_to_study_types": [
      "ecological"
    ],
    "data_sources": [
      "claims",
      "registry",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1146/annurev.pu.16.050195.000425",
        "url": "https://doi.org/10.1146/annurev.pu.16.050195.000425",
        "citation_text": "Morgenstern H. Ecologic studies in epidemiology: concepts, principles, and methods. Annual Review of Public Health. 1995;16:61-81.",
        "year": 1995,
        "authors_short": "Morgenstern",
        "notes": "Foundational synthesis of ecological design types (multiple-group, time-trend, mixed), the components of ecological bias, and when group-level inference is defensible."
      },
      {
        "role": "introduce",
        "doi": "10.1093/oxfordjournals.aje.a117069",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a117069",
        "citation_text": "Greenland S, Robins J. Invited commentary: ecologic studies—biases, misconceptions, and counterexamples. American Journal of Epidemiology. 1994;139(8):747-760.",
        "year": 1994,
        "authors_short": "Greenland & Robins",
        "notes": "The canonical formal statement of cross-level (ecological) bias—confounding, effect modification, and aggregation/specification bias—and why the group-level slope generally does not equal the individual effect."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/30.6.1343",
        "url": "https://doi.org/10.1093/ije/30.6.1343",
        "citation_text": "Greenland S. Ecologic versus individual-level sources of bias in ecologic estimates of contextual health effects. International Journal of Epidemiology. 2001;30(6):1343-1350.",
        "year": 2001,
        "authors_short": "Greenland",
        "notes": "Distinguishes ecologic-level from individual-level bias sources when the target is a genuine contextual (group-level) effect—clarifies the cases where ecological designs are and are not appropriate."
      },
      {
        "role": "demonstrate",
        "doi": "10.1111/j.1467-985x.2004.02046.x",
        "url": "https://doi.org/10.1111/j.1467-985x.2004.02046.x",
        "citation_text": "Wakefield J. Ecological inference for 2 × 2 tables (with discussion). Journal of the Royal Statistical Society Series A. 2004;167(3):385-445.",
        "year": 2004,
        "authors_short": "Wakefield",
        "notes": "Formal treatment of ecological inference and the identifiability problem, including hybrid designs that combine aggregate data with individual sub-samples to mitigate ecological bias."
      }
    ],
    "plain_language_summary": "An ecological study compares whole groups instead of individual people. You take a summary number for each group (say, the percent of a state's adults who smoke) and line it up against another summary number for the same group (say, that state's lung-cancer deaths per 100,000 people), then look at whether the two move together across the groups. It is cheap and fast because it only needs published totals, never a record that links one person's exposure to that same person's outcome. The catch is the whole point: a pattern that holds across groups does not tell you what is happening inside any group, so reading a group-level correlation as if it described individuals is a classic error called the ecological fallacy.",
    "key_terms": [
      {
        "term": "ecological fallacy",
        "definition": "The mistake of assuming that a relationship seen between group averages also holds for the individuals inside those groups."
      },
      {
        "term": "aggregate data",
        "definition": "Summary numbers for a whole group (a count, a percent, a rate) rather than one row per person."
      },
      {
        "term": "unit of analysis",
        "definition": "The 'thing' each row of your data represents; in an ecological study it is a group (a state or county), not a person."
      },
      {
        "term": "rate per 100,000",
        "definition": "How many events happened for every 100,000 people in the group, which lets you compare groups of different sizes fairly."
      },
      {
        "term": "correlation",
        "definition": "A number describing whether two measures tend to rise and fall together across the groups you are comparing."
      }
    ],
    "worked_example": {
      "scenario": "You want to know whether smoking is linked to lung cancer, but you only have published state-level totals, not records on individual people. For five states you pull two summary numbers each: the percent of adults who smoke and the lung-cancer death rate per 100,000 residents. You line them up across the five states and look at whether they move together. Watch what this can and cannot tell you.",
      "dataset": {
        "caption": "One row per state (the group), not per person. These are the only numbers an ecological analyst has here.",
        "columns": [
          "state",
          "pct_adults_smoke",
          "lung_cancer_deaths_per_100k"
        ],
        "rows": [
          [
            "State A",
            15,
            30
          ],
          [
            "State B",
            20,
            40
          ],
          [
            "State C",
            25,
            50
          ],
          [
            "State D",
            30,
            60
          ],
          [
            "State E",
            35,
            70
          ]
        ]
      },
      "steps": [
        "Each row is a whole state. There is no person in this table whose own smoking status sits next to their own cause of death; the link between an individual's exposure and their individual outcome was destroyed when the data were summed into state totals.",
        "Scan the two number columns across the five states: as the smoking percent climbs 15, 20, 25, 30, 35, the death rate climbs in lockstep 30, 40, 50, 60, 70. Every 5-point rise in smoking percent lines up with a 10-per-100k rise in the death rate.",
        "So at the group level the two move together perfectly: states with more smoking have more lung-cancer deaths. The group-level correlation is strongly positive.",
        "Here is the trap. This table cannot tell you that the smokers are the ones dying. In State E, the extra deaths could fall mostly on non-smokers, or the states could differ in age, air quality, or screening in ways the totals hide. The within-state link between a person's smoking and that person's outcome is simply not in the data.",
        "To claim 'a person who smokes has higher lung-cancer risk,' you would need individual records that pair each person's smoking status with their own outcome, not state averages."
      ],
      "result": "Across the five states the group-level correlation is perfect and positive: a 5-point rise in adult smoking percent tracks a 10-per-100k rise in the lung-cancer death rate. But this is a statement about states, not people. You cannot conclude from these group averages that smoking individuals are the ones who die of lung cancer; inferring that individual-level link from group-level numbers is the ecological fallacy."
    },
    "prerequisites": [
      "cross-sectional",
      "cohort-retrospective"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Multiple-group (cross-sectional) ecological",
        "description": "Exposure and outcome summaries compared across many areas (counties, states, hospitals) at one time point or pooled over a fixed period; the most common form, regressing area outcome rate on area exposure prevalence/intensity.",
        "edge_cases": [
          "Few groups give an underpowered, leverage-driven regression dominated by outliers.",
          "Cross-level confounding by area composition (age, deprivation, race/ethnicity) cannot be removed by adjusting for group means alone."
        ],
        "data_source_notes": "claims: build area×period cells with identical, fully-observed denominators; restrict to FFS person-time to avoid MA-penetration distortion of rates."
      },
      {
        "name": "Time-trend (longitudinal) ecological",
        "description": "A single population followed over multiple periods, correlating the time series of aggregate exposure with the time series of aggregate outcome; upgradeable to interrupted time series for a policy/launch.",
        "edge_cases": [
          "Secular trends and shared time-varying confounders (coding changes, diagnostic drift) confound the exposure–outcome correlation.",
          "Serial autocorrelation invalidates naive standard errors; use Newey-West/ARIMA errors."
        ],
        "data_source_notes": "Watch coding-system transitions (ICD-9 to ICD-10) and benefit/policy changes that shift event capture independent of true risk."
      },
      {
        "name": "Mixed / multilevel (panel) ecological",
        "description": "Area×time panel analyzed with fixed or random effects, so each area serves partly as its own control and unmeasured time-invariant area confounders are differenced out.",
        "edge_cases": [
          "Random-effects estimates are biased if area effects correlate with exposure; test with Hausman and prefer fixed effects when in doubt.",
          "Spatial dependence across areas requires CAR/spatial-error specification, not just clustering."
        ],
        "data_source_notes": "linked: merge Census/ACS contextual covariates and reconcile geography/period alignment before modeling."
      },
      {
        "name": "Hybrid / semi-ecological (Wakefield) design",
        "description": "Aggregate data combined with an individual-level sub-sample (or known within-area exposure variance) to identify within-group effect modification and sharply reduce ecological bias.",
        "edge_cases": [
          "Requires at least partial individual data or strong external information on within-area distributions; identifiability is fragile when neither is available."
        ],
        "data_source_notes": "linked: a small individual-level cohort drawn from the same population can anchor the aggregate margins."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Individual-level retrospective cohort",
        "pros_of_this": "Far cheaper and faster, uses only aggregate tabulations, and can study contextual exposures with no individual variation (policy, formulary, pollution).",
        "cons_of_this": "Cannot in general recover the individual causal effect; exposed to cross-level confounding and aggregation bias that individual data would let you adjust away.",
        "when_to_prefer": "The exposure is genuinely group-level, or for rapid hypothesis generation/surveillance when individual data are unavailable."
      },
      {
        "compared_to": "Cross-sectional (individual-level)",
        "pros_of_this": "Broad population coverage and strong exposure contrast across areas; can use routinely published aggregates.",
        "cons_of_this": "Loses the individual joint exposure–outcome distribution, so individual associations are not identified.",
        "when_to_prefer": "When the causal target is a contextual effect rather than an individual one, or no individual exposure–outcome pairing is available."
      },
      {
        "compared_to": "Multilevel / hybrid design on linked data",
        "pros_of_this": "No need to obtain any individual records; minimal data-acquisition burden.",
        "cons_of_this": "Without within-area variance or an individual sub-sample, within-group effect modification and aggregation bias remain unidentified.",
        "when_to_prefer": "Only when no individual-level data of any kind can be obtained; otherwise the hybrid design is strictly less biased."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Aggregate individual claims to area×period cells with one shared, fully-observed denominator (FFS enrollee person-time). Compute exposure as summed days_supply per 1,000 enrollees and outcomes as event counts per 100,000; exclude MA-only person-time so rates are not driven by county-varying MA penetration. Honor small-cell suppression (<11) explicitly.",
      "ehr": "Aggregate to facility/catchment, but the denominator is the visiting (not resident) population and out-of-network care is invisible; use only with a stable closed-system catchment.",
      "registry": "Strong source of area-level outcome rates with defined catchment; weak for exposure—pair with an external exposure aggregate and reconcile the two denominators before computing rates.",
      "linked": "Merge Census/ACS/pollution-monitor contextual covariates for adjustment; reconcile misaligned geographies and periods (areal interpolation adds error) before the ecological regression."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nimport statsmodels.formula.api as smf\n\nOPIOID_NDCS = set(study_ndc_list)   # curated NDC list for the long-acting opioid\nMIN_CELL_EVENTS = 11                 # CMS small-cell suppression threshold\n\ndef to_quarter(s):\n    return s.dt.to_period(\"Q\").astype(str)\n\ndef build_ecological_cells(rx, events, denom, ctx):\n    # Exposure numerator: total days_supply of the study drug per county-quarter.\n    rx = rx[rx[\"ndc\"].isin(OPIOID_NDCS)].copy()\n    rx[\"quarter\"] = to_quarter(rx[\"fill_date\"])\n    exp = (rx.groupby([\"county_fips\", \"quarter\"])[\"days_supply\"]\n             .sum().reset_index(name=\"ds_total\"))\n\n    # Outcome numerator: count of overdose inpatient stays per county-quarter.\n    events = events.copy()\n    events[\"quarter\"] = to_quarter(events[\"admit_date\"])\n    out = (events.groupby([\"county_fips\", \"quarter\"]).size()\n                 .reset_index(name=\"n_events\"))\n\n    # One shared, fully-observed FFS denominator (MA-only person-time already excluded upstream).\n    cells = (denom.merge(exp, on=[\"county_fips\", \"quarter\"], how=\"left\")\n                  .merge(out, on=[\"county_fips\", \"quarter\"], how=\"left\")\n                  .merge(ctx, on=[\"county_fips\", \"quarter\"], how=\"left\"))\n    cells[[\"ds_total\", \"n_events\"]] = cells[[\"ds_total\", \"n_events\"]].fillna(0)\n\n    # CMS small-cell suppression: drop unstable cells rather than letting them read as rate 0.\n    cells = cells[cells[\"n_events\"] >= MIN_CELL_EVENTS].copy()\n\n    enrollee_years = cells[\"enrollee_quarters\"] / 4.0\n    cells[\"exp_per_1k\"] = cells[\"ds_total\"] / enrollee_years * 1_000      # exposure intensity\n    cells[\"rate_per_100k\"] = cells[\"n_events\"] / enrollee_years * 100_000  # outcome rate\n    return cells\n\ndef fit_ecological(cells):\n    # Person-time-weighted WLS with quarter fixed effects and county-clustered SEs.\n    m = smf.wls(\n        \"rate_per_100k ~ exp_per_1k + pct_age_ge75 + adi_score + pct_dual + C(quarter)\",\n        data=cells, weights=cells[\"enrollee_quarters\"],\n    ).fit(cov_type=\"cluster\", cov_kwds={\"groups\": cells[\"county_fips\"]})\n    # m.params['exp_per_1k'] is a COUNTY-LEVEL slope, NOT an individual effect (ecological fallacy risk).\n    return m",
        "description": "Build county-quarter ecological cells from individual claims and fit a person-time-weighted ecological\nregression. Required inputs (already cleaned, de-duplicated, FFS-only person-time):\n  rx     : Part D fills          -> person_id, fill_date (datetime), county_fips, ndc, days_supply\n  events : overdose inpatient    -> person_id, admit_date (datetime), county_fips  (one row per qualifying stay)\n  denom  : FFS person-time       -> county_fips, quarter (period 'YYYYQn'), enrollee_quarters, ma_only_excluded (bool)\n  ctx    : area covariates       -> county_fips, quarter, pct_age_ge75, adi_score, pct_dual\nOPIOID_NDCS is the curated study NDC list. Cells with <11 events are suppressed (CMS rule); the regression\nestimates a COUNTY-LEVEL slope and must be reported as ecological, not as an individual relative risk.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "greenland-robins-1994"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(lmtest)\nlibrary(sandwich)\n\nOPIOID_NDCS <- study_ndc_list   # curated NDC list\nMIN_CELL_EVENTS <- 11L          # CMS small-cell suppression\n\nqtr <- function(d) paste0(data.table::year(d), \"Q\", data.table::quarter(d))\n\nbuild_ecological_cells <- function(rx, events, denom, ctx) {\n  setDT(rx); setDT(events); setDT(denom); setDT(ctx)\n\n  rx <- rx[ndc %chin% OPIOID_NDCS]\n  rx[, quarter := qtr(fill_date)]\n  exp <- rx[, .(ds_total = sum(days_supply)), by = .(county_fips, quarter)]\n\n  events[, quarter := qtr(admit_date)]\n  out <- events[, .(n_events = .N), by = .(county_fips, quarter)]\n\n  cells <- Reduce(function(a, b) merge(a, b, by = c(\"county_fips\", \"quarter\"), all.x = TRUE),\n                  list(denom, exp, out, ctx))\n  cells[is.na(ds_total), ds_total := 0]\n  cells[is.na(n_events), n_events := 0]\n  cells <- cells[n_events >= MIN_CELL_EVENTS]          # suppress unstable cells\n\n  cells[, enrollee_years := enrollee_quarters / 4]\n  cells[, exp_per_1k := ds_total / enrollee_years * 1000]\n  cells[, rate_per_100k := n_events / enrollee_years * 1e5]\n  cells[]\n}\n\nfit_ecological <- function(cells) {\n  fit <- lm(rate_per_100k ~ exp_per_1k + pct_age_ge75 + adi_score + pct_dual + factor(quarter),\n            data = cells, weights = enrollee_quarters)\n  # County-clustered SEs; exp_per_1k is a COUNTY-LEVEL slope (ecological), not an individual effect.\n  ct <- coeftest(fit, vcov = vcovCL, cluster = ~ county_fips)\n  list(fit = fit, coeftest = ct)\n}",
        "description": "County-quarter ecological cell construction and person-time-weighted regression in R. Inputs mirror the\nPython version:\n  rx     : person_id, fill_date (Date), county_fips, ndc, days_supply\n  events : person_id, admit_date (Date), county_fips\n  denom  : county_fips, quarter (chr 'YYYYQn'), enrollee_quarters\n  ctx    : county_fips, quarter, pct_age_ge75, adi_score, pct_dual\nlmtest/sandwich give county-clustered SEs; the exp_per_1k coefficient is a county-level slope, not an\nindividual relative risk.",
        "dependencies": [
          "data.table",
          "lmtest",
          "sandwich"
        ],
        "source_citations": [
          "greenland-robins-1994"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Exposure numerator: total study-drug days_supply per county-quarter. */\nproc sql;\n  create table exp as\n  select r.county_fips,\n         cats(year(r.fill_date), 'Q', qtr(r.fill_date)) as quarter length=6,\n         sum(r.days_supply) as ds_total\n  from work.rx r\n  where r.ndc in (select ndc from work.opioid_ndc)\n  group by r.county_fips, calculated quarter;\n\n/* Outcome numerator: overdose inpatient stays per county-quarter. */\n  create table out as\n  select e.county_fips,\n         cats(year(e.admit_date), 'Q', qtr(e.admit_date)) as quarter length=6,\n         count(*) as n_events\n  from work.events e\n  group by e.county_fips, calculated quarter;\n\n/* One shared FFS denominator; left-join exposure, outcome, and contextual covariates. */\n  create table cells as\n  select d.county_fips, d.quarter, d.enrollee_quarters,\n         coalesce(x.ds_total, 0)  as ds_total,\n         coalesce(o.n_events, 0)  as n_events,\n         c.pct_age_ge75, c.adi_score, c.pct_dual\n  from work.denom d\n  left join exp x on d.county_fips = x.county_fips and d.quarter = x.quarter\n  left join out o on d.county_fips = o.county_fips and d.quarter = o.quarter\n  left join work.ctx c on d.county_fips = c.county_fips and d.quarter = c.quarter;\nquit;\n\n/* Suppress unstable cells and build rates per person-time. */\ndata cells;\n  set cells;\n  if n_events < 11 then delete;                 /* CMS small-cell suppression */\n  enrollee_years = enrollee_quarters / 4;\n  exp_per_1k     = ds_total / enrollee_years * 1000;\n  rate_per_100k  = n_events / enrollee_years * 100000;\nrun;\n\n/* Person-time-weighted ecological regression with county-clustered (GEE) SEs. */\nproc genmod data=cells;\n  class county_fips quarter;\n  weight enrollee_years;\n  model rate_per_100k = exp_per_1k pct_age_ge75 adi_score pct_dual quarter\n        / dist=normal link=identity;\n  repeated subject=county_fips / type=ind;       /* robust (sandwich) SEs clustered on county */\n  /* exp_per_1k estimate is a COUNTY-level slope, NOT an individual effect (ecological fallacy). */\nrun;",
        "description": "County-quarter ecological cell construction (PROC SQL) and a person-time-weighted ecological regression\n(PROC GENMOD with cluster-robust SEs via REPEATED). Required input datasets (post data-management, FFS-only):\n  work.rx     : person_id, fill_date, county_fips, ndc, days_supply\n  work.events : person_id, admit_date, county_fips\n  work.denom  : county_fips, quarter (char 'YYYYQn'), enrollee_quarters\n  work.ctx    : county_fips, quarter, pct_age_ge75, adi_score, pct_dual\nwork.opioid_ndc holds the curated study NDC list. Cells with <11 events are suppressed (CMS rule). The\nexp_per_1k coefficient is a COUNTY-level slope (ecological), not an individual relative risk.",
        "dependencies": [],
        "source_citations": [
          "greenland-robins-1994"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Ind[\"Individual claims (person-level)<br/>person_id, fill_date, days_supply, dx, county_fips\"] --> Agg[\"Aggregate to area x period cells<br/>county x quarter\"]\n  Agg --> Num[\"Exposure numerator: sum(days_supply)/1k enrollees<br/>Outcome numerator: event count/100k\"]\n  Denom[\"Shared FFS person-time denominator<br/>(exclude MA-only person-time)\"] --> Num\n  Num --> Supp[\"Apply small-cell suppression (events < 11)\"]\n  Supp --> Reg[\"Person-time-weighted ecological regression<br/>+ area covariates, quarter FE, clustered/CAR SEs\"]\n  Reg --> Slope[\"County-level slope = ECOLOGICAL estimand<br/>NOT the individual effect\"]",
        "caption": "Data flow from individual claims to area-period cells to an ecological regression. Joint individual exposure-outcome information is destroyed at aggregation, which is why the fitted slope is a group-level contrast, not an individual effect.",
        "alt_text": "Flowchart showing individual claims aggregated to county-quarter cells, numerators divided by a shared FFS denominator, small-cell suppression, a weighted ecological regression, and a county-level slope labelled as the ecological estimand rather than the individual effect.",
        "source_type": "illustrative",
        "source_citations": [
          "greenland-robins-1994"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  X[\"Individual exposure (Xi)\"] --> Y[\"Individual outcome (Yi)\"]\n  C[\"Area composition / context<br/>(age, deprivation, race-mix)\"] --> Xbar[\"Group-mean exposure (X-bar)\"]\n  C --> Ybar[\"Group-mean outcome (Y-bar)\"]\n  X --> Xbar\n  Y --> Ybar\n  M[\"Within-area effect modifier<br/>(unobserved in aggregate)\"] -. \"modifies X to Y\" .-> Y\n  Xbar -- \"ecological slope (what you estimate)\" --> Ybar\n  X -- \"individual effect (what you want)\" --> Y",
        "caption": "Why the ecological slope diverges from the individual effect. Area composition C is a cross-level confounder of the group means, and a within-area effect modifier M (invisible in aggregate data) makes the group-level slope a non-causal average that can be attenuated, inflated, or sign-reversed relative to the individual exposure-outcome effect.",
        "alt_text": "Directed graph showing individual exposure causing individual outcome, area composition confounding both group means, and an unobserved within-area effect modifier, illustrating cross-level bias between the ecological slope and the individual effect.",
        "source_type": "illustrative",
        "source_citations": [
          "greenland-robins-1994"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Ecological correlations are a staple of descriptive/area-level surveillance; both summarize populations rather than estimate individual causal effects."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cross-sectional",
        "notes": "Both are snapshot designs, but the cross-sectional study measures individuals (joint exposure-outcome observed) while the ecological study measures groups, trading individual identifiability for population coverage."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cohort-retrospective",
        "notes": "An individual-level cohort answers similar questions without ecological bias when individual data exist; ecological is the fallback for contextual exposures or aggregate-only data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "direct-standardization-rwe",
        "notes": "Area/period rates should be age- and composition-standardized before being contrasted, so the ecological comparison is not driven by demographic mix."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cluster-randomized",
        "notes": "Cluster-randomized trials assign at the group level but analyze accounting for individuals; ecological studies both measure and analyze at the group level, with no individual linkage."
      },
      {
        "relation_type": "see_also",
        "target_slug": "difference-in-differences-staggered-adoption-rwe",
        "notes": "Time-trend ecological designs evaluating policy roll-outs are best upgraded to difference-in-differences or interrupted time series with multiple pre/post periods and controls."
      }
    ],
    "aliases": [
      "ecologic study",
      "aggregate study",
      "aggregate-data study",
      "group-level study",
      "correlational study"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ehr-phenotyping-algorithms-rwe",
    "name": "EHR Phenotyping Algorithms",
    "short_definition": "Rule-based or statistical algorithms that combine structured EHR fields (diagnoses, procedures, labs, medication orders, vitals, encounters) and sometimes unstructured notes into a reproducible, validated computable definition of a disease, condition, or outcome for a defined cohort and time window.",
    "long_description": "An **EHR phenotyping algorithm** (computable phenotype) is the operational rule that turns raw clinical\ndata into a binary (or graded) statement that a specific patient has — or does not have — a condition of\ninterest, as of a specific date. In RWE it is the machinery behind cohort entry criteria, exposure proxies,\ncovariates, and especially **outcome ascertainment**: the place where a study's internal validity is most\noften won or lost. A phenotype is not a code list; it is a code list *plus* logic (counts, positions,\ntime windows, confirmatory labs/medications, exclusions) *plus* a measured operating characteristic\n(PPV, sensitivity, specificity) against a reference standard.\n\n**Core conceptual distinction** — two things are separable and must not be conflated. (1) *The definition*\n— the computable rule (e.g., \"≥1 inpatient OR ≥2 outpatient ICD codes ≥30 days apart, plus a confirming\nHbA1c ≥6.5% or an antidiabetic dispensing\"). (2) *The performance* — how that rule behaves against a\nvalidated reference (chart review, adjudication, registry, or a trusted linked source), expressed as\npositive predictive value (PPV), sensitivity, and specificity in the *target* population and era. The\nestimand the phenotype feeds also matters: a phenotype used to *count incident outcome events* needs high\nPPV and a defensible *incident-date* rule, whereas a phenotype used to *define a denominator/cohort* trades\noff sensitivity (completeness) against PPV (purity). The same code list can be an excellent outcome\ndefinition and a poor cohort definition. Crucially, error in a phenotype is **measurement error /\nmisclassification**, and its direction matters: nondifferential outcome misclassification of a binary\noutcome biases a risk ratio toward the null, but *differential* misclassification by exposure (e.g., the\nexposed are surveilled more, so their outcomes are coded more completely) can bias in either direction and\nis not \"conservative.\"\n\n**Pros, cons, and trade-offs**\n- **vs a single-code \"any diagnosis\" definition:** A multi-component phenotype (counts + positions +\n  confirmatory lab/Rx + exclusions) dramatically raises PPV and is defensible to FDA/EMA/HTA reviewers.\n  Cost: more programming, more judgment-dependent thresholds, and reduced sensitivity (you discard true\n  cases that lack the confirmatory element). **Prefer the richer rule** for any consequential safety,\n  effectiveness, or regulatory-grade outcome; reserve the single-code rule for hypothesis generation.\n- **vs rule-based phenotyping with NLP / machine-learning phenotyping (e.g., PheNorm, APHRODITE,\n  high-throughput models):** ML/NLP phenotypes can recover cases that structured codes miss (notes mention\n  a condition never coded) and scale across sites. Cost: they are harder to transport, harder to explain to\n  a regulator, and can encode site-specific documentation artifacts; a model trained where the exposed are\n  documented more will *manufacture* differential misclassification. **Prefer transparent rule-based\n  phenotypes** for regulatory submissions and multi-site networks (PheKB-style portable logic); reserve\n  NLP/ML for conditions poorly captured in structured data, and validate them per-site.\n- **vs trusting an adjudicated registry/endpoint directly:** Adjudicated outcomes are the reference standard\n  and remove ascertainment ambiguity. Cost: registries are expensive, incomplete for exposure/follow-up,\n  and may not cover the full source population. **Prefer the registry** as the validation gold standard and\n  for adjudicated outcomes; use the EHR/claims phenotype for the scalable cohort and follow-up, ideally\n  *anchored* to a validation substudy that estimates PPV/sensitivity for bias correction.\n\n**When to use** — any RWE study where the condition, exposure proxy, covariate, or outcome must be derived\nfrom routinely collected EHR/claims data; when no adjudicated source exists for the full cohort; when you\nneed a portable, reproducible definition across sites or databases; and whenever a quantitative bias\nanalysis will use measured PPV/sensitivity to correct for misclassification. A phenotype is mandatory when\nthe endpoint is the study's primary outcome — pair it with a validation substudy and a pre-specified\nincident-event date rule.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **No validation and no plausible reference.** Reporting an effect estimate from an unvalidated phenotype\n  with unknown PPV is uninterpretable; if you cannot estimate operating characteristics in the study era\n  and population, the algorithm is a black box, not a measurement.\n- **Differential misclassification by exposure is plausible and unaddressed.** If the exposed are seen,\n  tested, or documented more (surveillance/detection bias — e.g., a drug requiring lab monitoring), a\n  phenotype that depends on testing will over-ascertain outcomes in the exposed and *fabricate* an\n  association. This is the dangerous case: the bias is not toward the null and balance tables will not\n  reveal it. Use exposure-blind ascertainment, negative-control outcomes, and quantitative bias analysis.\n- **Borrowing a phenotype across eras or data models without re-validation.** ICD-9→ICD-10 transition,\n  coding-incentive changes (e.g., risk-adjustment coding inflation), a new EHR vendor, or a new\n  care-delivery model can move PPV by tens of points. A phenotype validated in one database/year is a\n  hypothesis, not a fact, in another.\n- **High-sensitivity cohort rule used as a high-PPV outcome rule (or vice versa).** Using a loose \"any\n  code\" denominator rule to *count events* inflates the numerator with false positives; using a strict\n  confirmatory rule to define a *denominator* discards real patients and distorts incidence.\n\n**Data-source operational depth**\n- **Claims (FFS vs Medicare Advantage):** Only billed encounters generate codes, so phenotypes ride on\n  reimbursement behavior, not clinical truth. Confirmatory **labs and vitals are usually absent** from\n  claims (no result values), so you lean on the 1-inpatient / 2-outpatient (1-IP/2-OP) rules plus drug\n  dispensings as proxies. Failure modes: **MA-only person-time often lacks complete FFS-style encounter\n  claims** in older research extracts, so a \"no diagnosis\" patient may be unobserved rather than truly\n  negative — restrict to enrollees with full medical+pharmacy benefit or treat MA spans as missing.\n  **Rule-out codes** (a diagnosis billed to justify a test that came back negative) inflate false positives,\n  which is why the 2-OP-codes-≥30-days-apart rule exists. Claims lag, reversals, and bundled/capitated\n  services drop or delay codes; require continuous enrollment so absence of a code is informative.\n- **EHR:** Richer (labs with values, vitals, problem lists, orders, notes) so confirmatory logic\n  (HbA1c ≥6.5%, eGFR, a positive culture) and NLP become possible — a genuine advantage over claims. But\n  capture is **encounter-driven and leaky**: care delivered outside the system (an ER visit at another\n  hospital, an outside lab) is invisible, so a patient who \"leaves\" looks event-free. Site workflow\n  variation (who codes, which template, structured vs free-text) makes a single rule behave differently\n  per site; problem lists are notoriously stale (conditions never resolved). Define observation windows\n  explicitly and treat loss to follow-up as potentially informative.\n- **Registry:** Strongest for adjudicated, clinically rich case definitions (cancer stage, MI by universal\n  definition) — the natural reference standard — but typically incomplete for full exposure history and\n  longitudinal follow-up, and limited to the registered population. Use it to *validate* the EHR/claims\n  phenotype and to supply adjudicated outcomes; link to claims for fills and to a death index for censoring.\n- **Linked claims–EHR–registry/vital-records:** The ideal substrate: EHR labs/notes sharpen the rule,\n  claims fill capture gaps (outside care that was billed), and registry/death records adjudicate. Cost:\n  linkage selects the linkable subset (selection bias), and order/result/service-date discrepancies must be\n  reconciled before assigning an incident-event date. In the **elderly**, differential **competing risks**\n  (death by exposure group) interact with phenotyping: a group that dies sooner has less opportunity to\n  accrue the coded outcome, so naive cause-specific counts understate true burden — model competing risks\n  explicitly when the phenotype feeds an incidence estimand.\n\n**Worked claims example (incident type 2 diabetes outcome phenotype + PPV/sensitivity validation).**\nGoal: ascertain *incident* T2DM as an outcome during follow-up in a commercial+Medicare FFS database, then\nvalidate it. (1) **Code logic:** a case requires ≥1 *inpatient* T2DM diagnosis (ICD-10 E11.x in any\nposition) **OR** ≥2 *outpatient* T2DM diagnoses on **different dates ≥30 days apart** (the 1-IP/2-OP rule;\nthe 30-day separation suppresses single \"rule-out\" codes), **OR** ≥1 outpatient T2DM diagnosis **plus** a\ndispensing of a non-metformin antidiabetic within 60 days. (2) **Incident-date rule:** event date = the\n*first* qualifying code, and incidence requires a clean lookback — **365 days of continuous medical+pharmacy\nenrollment with no T2DM code and no antidiabetic dispensing** before that date (washout makes the case\nincident, not prevalent; MA-only spans in the lookback are treated as unobserved, not as \"no disease\").\n(3) **Exposure-blind ascertainment:** apply the identical rule and lookback to both treatment arms so any\nsurveillance differences are not amplified. (4) **Validation substudy:** sample N=200 algorithm-positive\ncharts; chart review (reference standard) confirms 170 true cases → **PPV = 170/200 = 0.85**. Separately,\namong a sample with adjudicated/linked truth you find the algorithm flagged 170 of 200 true cases →\n**sensitivity = 0.85** (and specificity from the true-negatives). (5) **Bias correction:** with PPV and\nsensitivity, correct the observed counts (e.g., true positives = observed-positive × PPV; expand for\nmissed cases using sensitivity) and propagate uncertainty via a quantitative bias analysis. (6)\n**Sensitivity analyses:** vary the OP-code window (30 vs 90 days), require vs drop the confirmatory Rx,\nre-estimate PPV in the ICD-10 era separately, and run a negative-control outcome to detect residual\ndifferential ascertainment.\n\n**Interpreting the output**. The output of an EHR phenotyping algorithm is an algorithm-positive cohort\nwith documented operating characteristics. In the T2DM example, the algorithm returned 200 flagged\ncases; chart review confirmed 170 true cases, yielding PPV = 0.85. This means 85 out of every 100\npatients flagged by this rule genuinely had incident T2DM in the study population and era — 15 were\nfalse positives driven by rule-out codes or documentation noise.\n\nFormal interpretation: the algorithm produces a cohort of incident-T2DM candidates where PPV = 0.85\n(95% CI from the 200-chart substudy). Sensitivity — the share of all true T2DM cases in the database\nthat the algorithm captured — was separately estimated at 0.85 in a gold-standard-linked subsample.\nThese two statistics are conceptually independent: a PPV of 0.85 from chart review of flagged cases\ntells you nothing directly about sensitivity; the two must be estimated from different samples or\nanalytic designs. Nondifferential misclassification from a PPV below 1.0 biases risk-ratio estimates\ntoward the null; differential misclassification (if the exposed are surveilled more) can bias in\neither direction.\n\nPractical interpretation: before proceeding to any comparative analysis, apply bias correction —\ntrue positives ≈ observed positive × PPV — and quantify residual uncertainty via a Monte Carlo\nbias analysis. For regulatory or HTA submissions, document the PPV substudy design (sampling frame,\nreviewer credentials, adjudication rules) alongside the algorithm code in the SAP appendix.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "computable-phenotype",
      "outcome-ascertainment",
      "case-definition",
      "ppv-sensitivity-validation",
      "code-list",
      "misclassification",
      "ehr-phenotyping",
      "claims-algorithm"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2013-001935",
        "url": "https://doi.org/10.1136/amiajnl-2013-001935",
        "citation_text": "Shivade C, Raghavan P, Fosler-Lussier E, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association. 2014;21(2):221-230.",
        "year": 2014,
        "authors_short": "Shivade et al.",
        "notes": "Canonical taxonomy of EHR phenotyping approaches (rule-based, statistical, NLP) and the data, logic, and validation components that define a computable phenotype."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard requiring transparent code lists, algorithm logic, and validation for studies using routinely collected health data."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Diagnostic-accuracy reporting framework underpinning how a phenotype's PPV, sensitivity, and specificity should be estimated and reported against a reference standard."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/amiajnl-2012-000896",
        "url": "https://doi.org/10.1136/amiajnl-2012-000896",
        "citation_text": "Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. Journal of the American Medical Informatics Association. 2013;20(e1):e147-e154.",
        "year": 2013,
        "authors_short": "Newton et al.",
        "notes": "Multi-site validation of rule-based EHR phenotypes showing how PPV varies by condition and site and why local re-validation is required for transportability."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/jamia/ocv202",
        "url": "https://doi.org/10.1093/jamia/ocv202",
        "citation_text": "Kirby JC, Speltz P, Rasmussen LV, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. Journal of the American Medical Informatics Association. 2016;23(6):1046-1052.",
        "year": 2016,
        "authors_short": "Kirby et al.",
        "notes": "Describes a shared library and workflow for portable, version-controlled computable phenotypes, the standard for reproducible transportable phenotyping logic."
      }
    ],
    "plain_language_summary": "An EHR phenotyping algorithm is a set of rules that scans a patient's electronic health record — their diagnosis codes, lab results, and prescriptions — and decides whether that patient counts as having a specific condition for a study. Instead of trusting a single billing code (which may have been entered just to justify a test), a phenotype combines multiple pieces of evidence so the definition is more accurate. Because no algorithm is perfect, researchers always check their rules against a small sample of real patient charts to measure how often the algorithm is right — a number called the positive predictive value.",
    "key_terms": [
      {
        "term": "phenotype",
        "definition": "In data research, a phenotype is a precise, computer-readable definition of who has a disease or condition, built from codes, lab values, and other data fields in a health record."
      },
      {
        "term": "ICD-10 code",
        "definition": "A standardized billing code that a clinician enters to describe a diagnosis — for example, E11.9 means type 2 diabetes without complications."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "The share of patients the algorithm flags as having the condition who truly have it when their charts are checked by hand — a measure of how precise the rule is."
      },
      {
        "term": "rule-based phenotype",
        "definition": "A phenotype built from explicit if-then logic: a patient qualifies if they meet a fixed combination of codes, thresholds, or time requirements, with no machine learning involved."
      },
      {
        "term": "probabilistic phenotype",
        "definition": "A phenotype that uses a statistical model to score each patient's likelihood of having the condition, combining many weak signals rather than hard cutoffs."
      },
      {
        "term": "chart review",
        "definition": "Manual inspection of a patient's actual clinical records by a clinician to verify whether a condition was truly present — the gold standard used to validate an algorithm."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to identify patients newly diagnosed with type 2 diabetes in a hospital's EHR. Rather than accepting any single billing code at face value, the team writes a rule-based phenotype that requires three things to be true at the same time: (1) at least one ICD-10 code for type 2 diabetes on record, (2) a hemoglobin A1c lab result at or above 6.5 percent, and (3) a prescription for a diabetes medication. Five patients are in the candidate pool. The table below shows which criteria each patient meets and whether the algorithm flags them.",
      "dataset": {
        "caption": "Candidate patients and which phenotype criteria each meets (Y = yes, N = no).",
        "columns": [
          "person_id",
          "has_T2DM_code",
          "HbA1c_ge_6.5pct",
          "diabetes_rx",
          "flagged_by_algorithm"
        ],
        "rows": [
          [
            "P001",
            "Y",
            "Y",
            "Y",
            "Y"
          ],
          [
            "P002",
            "Y",
            "N",
            "Y",
            "N"
          ],
          [
            "P003",
            "N",
            "Y",
            "Y",
            "N"
          ],
          [
            "P004",
            "Y",
            "Y",
            "N",
            "N"
          ],
          [
            "P005",
            "Y",
            "Y",
            "Y",
            "Y"
          ]
        ]
      },
      "steps": [
        "The algorithm requires ALL THREE criteria: a type 2 diabetes ICD-10 code AND an HbA1c lab result at or above 6.5 percent AND an active diabetes prescription. This is rule-based phenotyping — the logic is written out as explicit if-then conditions with no statistical model involved.",
        "P001 has all three: a diabetes code, a qualifying HbA1c, and a diabetes drug — flagged.",
        "P002 has a diabetes code and a prescription but the HbA1c is below 6.5 percent, so the lab criterion fails — not flagged.",
        "P003 has a qualifying HbA1c and a prescription but no diabetes code at all — not flagged.",
        "P004 has a diabetes code and a qualifying HbA1c but no prescription on record — not flagged.",
        "P005 meets all three criteria — flagged.",
        "Result: 2 patients (P001, P005) are flagged out of 5 candidates.",
        "A probabilistic approach would instead train a model to score each patient from 0 to 1 and pick a cutoff — useful when conditions rarely appear in structured codes and notes must be read — but harder to explain to a regulator than the transparent rule above.",
        "To validate the rule, the team randomly selects the 2 flagged patients (and perhaps some non-flagged ones) and has a clinician read their actual charts. If both flagged patients are confirmed as true type 2 diabetics, the PPV = 2 true positives / 2 flagged = 1.00, meaning the algorithm was correct every time it fired. With more patients, a realistic PPV might be 0.85, meaning 15 percent of flags are false alarms — which the research team would report and could correct for mathematically."
      ],
      "result": "2 of 5 candidates are flagged (P001 and P005). Both meet all three criteria. If a clinician confirms both are true type 2 diabetics, PPV = 2/2 = 1.00 in this small sample. In a realistic study with hundreds of flagged patients, PPV is estimated from a chart-review sample of around 200 and typically falls between 0.80 and 0.90 for a well-designed multi-criterion rule."
    },
    "prerequisites": [
      "outcome-algorithm-construction-rwe",
      "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
      "claims-outcome-algorithm-ppv-sensitivity-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Rule-based structured-code phenotype (1-IP / 2-OP)",
        "description": "Deterministic logic on diagnosis/procedure codes — e.g., one inpatient code or two outpatient codes separated by a minimum interval, with optional position and setting requirements.",
        "edge_cases": [
          "Rule-out codes billed to justify a test inflate false positives; the 2-OP-codes-≥30-days-apart rule exists to suppress them.",
          "ICD-9-to-ICD-10 transition changes code granularity and prevalence; validate eras separately.",
          "Single inpatient code may be a comorbidity coded for risk adjustment rather than the index condition."
        ],
        "data_source_notes": "claims: rely on code counts/positions and continuous enrollment (no lab values available); EHR: cross-check against problem list and encounter type."
      },
      {
        "name": "Lab- or medication-confirmed phenotype",
        "description": "Adds a confirmatory result threshold (e.g., HbA1c ≥6.5%, positive culture, eGFR cutoff) or a disease-specific dispensing to raise PPV beyond codes alone.",
        "edge_cases": [
          "Labs/results are usually absent from claims, so this variant is EHR- or linked-data-dependent.",
          "Confirmatory testing is itself exposure-dependent (monitored drugs), risking differential ascertainment.",
          "Medication confirmation fails for off-label or shared-indication drugs (metformin for PCOS, beta-blockers for many indications)."
        ],
        "data_source_notes": "ehr/linked: use result values and order timing; claims: substitute disease-specific drug dispensings as a proxy for the missing lab."
      },
      {
        "name": "NLP / statistical (high-throughput) phenotype",
        "description": "Uses note text features or supervised/weakly-supervised models (e.g., PheNorm, APHRODITE) to classify phenotype status, recovering cases never captured in structured codes.",
        "edge_cases": [
          "Models encode site-specific documentation behavior and transport poorly without per-site recalibration.",
          "Harder to explain and defend to regulators than transparent rules.",
          "Can manufacture differential misclassification if documentation density differs by exposure."
        ],
        "data_source_notes": "ehr: requires note access and a labeled reference set; validate operating characteristics at every deployment site."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single-code \"any diagnosis\" definition",
        "pros_of_this": "Multi-component logic (counts, positions, confirmatory lab/Rx, exclusions) markedly raises PPV and is defensible for regulatory or HTA use.",
        "cons_of_this": "More programming and judgment-dependent thresholds, and lower sensitivity because true cases lacking the confirmatory element are dropped.",
        "when_to_prefer": "Any consequential safety, effectiveness, utilization, or regulatory-grade outcome; reserve single-code rules for hypothesis generation."
      },
      {
        "compared_to": "NLP / machine-learning phenotyping",
        "pros_of_this": "Transparent, portable, version-controllable logic that regulators and multi-site networks can audit and reproduce.",
        "cons_of_this": "Misses cases captured only in free text and cannot scale judgment across heterogeneous note styles.",
        "when_to_prefer": "Regulatory submissions and multi-database networks; use NLP/ML for conditions poorly captured in structured fields, with per-site validation."
      },
      {
        "compared_to": "Direct use of an adjudicated registry endpoint",
        "pros_of_this": "Scales to the full source population and full follow-up at low marginal cost.",
        "cons_of_this": "Carries measurement error with unknown direction unless validated; registry adjudication is the true reference standard.",
        "when_to_prefer": "When no adjudicated source covers the whole cohort; anchor to a registry/chart-review validation substudy for bias correction."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Codes ride on reimbursement, not clinical truth. Lab result values are absent, so lean on 1-IP/2-OP rules plus disease-specific dispensings; require continuous medical+pharmacy enrollment so a missing code is informative, and exclude or flag MA-only person-time that lacks complete FFS-style encounter claims. Guard against rule-out codes and ICD-9/ICD-10 drift.",
      "ehr": "Richer (lab values, vitals, orders, notes) enabling confirmatory logic and NLP, but capture is encounter-driven and leaks outside care. Treat stale problem lists and site workflow variation as threats; define observation windows and treat loss to follow-up as potentially informative.",
      "registry": "Strongest as the adjudicated reference standard and for clinically graded case definitions; incomplete for full exposure/follow-up. Use to validate the EHR/claims phenotype and supply adjudicated outcomes; link out for fills and mortality.",
      "linked": "Ideal substrate (EHR labs + claims completeness + registry/death adjudication) but introduces linkage selection and order/result/service-date discrepancies that must be reconciled before assigning an incident-event date; model competing risks where death differs by exposure."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nT2DM = (\"E11\",)                 # ICD-10 prefix for type 2 diabetes\nWASHOUT_DAYS = 365              # disease-free + continuous-enrollment lookback -> incident, not prevalent\nOP_GAP_DAYS = 30               # two outpatient codes must be >= this far apart (suppress rule-out codes)\nRX_CONFIRM_DAYS = 60          # confirmatory dispensing window after a single OP code\n\ndef is_t2dm(code: str) -> bool:\n    return code.startswith(T2DM)\n\ndef build_t2dm_phenotype(dx: pd.DataFrame, rx: pd.DataFrame,\n                         enroll: pd.DataFrame) -> pd.DataFrame:\n    dx = dx[dx[\"icd_code\"].map(is_t2dm)].sort_values([\"person_id\", \"dx_date\"])\n\n    # Rule A: >=1 inpatient T2DM code.\n    ip = dx[dx[\"setting\"] == \"IP\"].groupby(\"person_id\")[\"dx_date\"].min()\n\n    # Rule B: >=2 outpatient T2DM codes on dates >= OP_GAP_DAYS apart.\n    op = dx[dx[\"setting\"] == \"OP\"]\n    def second_op_date(g):\n        d = g[\"dx_date\"].sort_values().to_numpy()\n        for i in range(1, len(d)):\n            if (d[i] - d[0]) >= np.timedelta64(OP_GAP_DAYS, \"D\"):\n                return pd.Timestamp(d[i])\n        return pd.NaT\n    op_qual = op.groupby(\"person_id\").apply(second_op_date).dropna()\n\n    # Rule C: >=1 OP code + a confirming non-metformin antidiabetic dispensing within RX_CONFIRM_DAYS.\n    first_op = op.groupby(\"person_id\")[\"dx_date\"].min()\n    rxc = rx[rx[\"drug_class\"] == \"ANTIDIABETIC_NONMET\"]\n    cm = first_op.reset_index().merge(rxc, on=\"person_id\", how=\"inner\")\n    cm = cm[(cm[\"fill_date\"] >= cm[\"dx_date\"]) &\n            (cm[\"fill_date\"] <= cm[\"dx_date\"] + pd.Timedelta(days=RX_CONFIRM_DAYS))]\n    rx_qual = cm.groupby(\"person_id\")[\"dx_date\"].min()\n\n    # Earliest qualifying date across rules = candidate incident event date.\n    cand = pd.concat([ip.rename(\"d\"), op_qual.rename(\"d\"), rx_qual.rename(\"d\")])\n    event = cand.groupby(level=0).min().rename(\"event_date\").reset_index()\n    event.columns = [\"person_id\", \"event_date\"]\n\n    # Incidence: clean 365d lookback (no prior T2DM code, no antidiabetic), continuous non-MA enrollment.\n    prior_dx = dx.merge(event, on=\"person_id\")\n    had_prior = prior_dx[(prior_dx[\"dx_date\"] < prior_dx[\"event_date\"]) &\n                         (prior_dx[\"dx_date\"] >= prior_dx[\"event_date\"] -\n                          pd.Timedelta(days=WASHOUT_DAYS))][\"person_id\"].unique()\n    e = enroll.merge(event, on=\"person_id\")\n    covered = e[(e[\"enroll_start\"] <= e[\"event_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                (e[\"enroll_end\"] >= e[\"event_date\"]) & (~e[\"ma_only\"])][\"person_id\"].unique()\n\n    positive = event[event[\"person_id\"].isin(covered) &\n                     ~event[\"person_id\"].isin(had_prior)].copy()\n    positive[\"algo_positive\"] = True\n    return positive[[\"person_id\", \"event_date\", \"algo_positive\"]]\n\ndef validate(positive: pd.DataFrame, gold: pd.DataFrame) -> dict:\n    # PPV: among algorithm-positive charts reviewed, fraction that are true cases.\n    flagged = gold.merge(positive[[\"person_id\"]], on=\"person_id\", how=\"left\",\n                         indicator=True)\n    flagged[\"algo_positive\"] = flagged[\"_merge\"] == \"both\"\n    tp = int(((flagged[\"algo_positive\"]) & (flagged[\"true_case\"])).sum())\n    fp = int(((flagged[\"algo_positive\"]) & (~flagged[\"true_case\"])).sum())\n    fn = int(((~flagged[\"algo_positive\"]) & (flagged[\"true_case\"])).sum())\n    tn = int(((~flagged[\"algo_positive\"]) & (~flagged[\"true_case\"])).sum())\n    return {\n        \"ppv\": tp / (tp + fp) if (tp + fp) else np.nan,\n        \"sensitivity\": tp / (tp + fn) if (tp + fn) else np.nan,\n        \"specificity\": tn / (tn + fp) if (tn + fp) else np.nan,\n        \"n_reviewed\": tp + fp + fn + tn,\n    }",
        "description": "Rule-based incident-outcome phenotype from claims-style inputs, then PPV/sensitivity from a validation\nsubstudy. Required inputs (cleaned, de-duplicated):\n  dx     : diagnosis claims -> person_id, dx_date (datetime), icd_code (str), setting in {'IP','OP'}\n  rx     : pharmacy fills   -> person_id, fill_date (datetime), drug_class (e.g. 'ANTIDIABETIC_NONMET')\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only lacks FFS claims\n  gold   : validation truth -> person_id, true_case (bool)  # chart review / adjudicated reference standard\nReturns one row per algorithm-positive person with their incident event_date, plus a validation table of\nPPV and sensitivity. Apply the SAME logic and lookback to every exposure arm to keep ascertainment blind.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "shivade-2014",
          "newton-2013"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS    <- 365L\nOP_GAP_DAYS     <- 30L\nRX_CONFIRM_DAYS <- 60L\n\nbuild_t2dm_phenotype <- function(dx, rx, enroll) {\n  setDT(dx); setDT(rx); setDT(enroll)\n  dx <- dx[grepl(\"^E11\", icd_code)][order(person_id, dx_date)]\n\n  # Rule A: >=1 inpatient code.\n  ip <- dx[setting == \"IP\", .(d = min(dx_date)), by = person_id]\n\n  # Rule B: >=2 outpatient codes >= OP_GAP_DAYS apart (earliest qualifying second date).\n  op <- dx[setting == \"OP\"]\n  op_qual <- op[, {\n    d <- sort(dx_date)\n    hit <- if (length(d) > 1L) d[which(d - d[1L] >= OP_GAP_DAYS)] else as.Date(character())\n    .(d = if (length(hit)) min(hit) else as.Date(NA))\n  }, by = person_id][!is.na(d)]\n\n  # Rule C: >=1 OP code + non-metformin antidiabetic fill within RX_CONFIRM_DAYS.\n  first_op <- op[, .(dx_date = min(dx_date)), by = person_id]\n  rxc <- rx[drug_class == \"ANTIDIABETIC_NONMET\"]\n  cm  <- merge(first_op, rxc, by = \"person_id\")\n  rx_qual <- cm[fill_date >= dx_date & fill_date <= dx_date + RX_CONFIRM_DAYS,\n                .(d = min(dx_date)), by = person_id]\n\n  cand  <- rbindlist(list(ip, op_qual, rx_qual))\n  event <- cand[, .(event_date = min(d)), by = person_id]\n\n  # Incidence: clean 365d lookback + continuous non-MA enrollment through event date.\n  pd <- merge(dx, event, by = \"person_id\")\n  had_prior <- unique(pd[dx_date < event_date &\n                         dx_date >= event_date - WASHOUT_DAYS, person_id])\n  e <- merge(enroll, event, by = \"person_id\")\n  covered <- unique(e[enroll_start <= event_date - WASHOUT_DAYS &\n                      enroll_end   >= event_date & !ma_only, person_id])\n\n  pos <- event[person_id %chin% covered & !person_id %chin% had_prior]\n  pos[, algo_positive := TRUE]\n  pos[, .(person_id, event_date, algo_positive)]\n}\n\nvalidate_phenotype <- function(positive, gold) {\n  setDT(gold)\n  gold[, algo_positive := person_id %chin% positive$person_id]\n  tp <- gold[algo_positive  & true_case,  .N]\n  fp <- gold[algo_positive  & !true_case, .N]\n  fn <- gold[!algo_positive & true_case,  .N]\n  tn <- gold[!algo_positive & !true_case, .N]\n  list(ppv = tp / (tp + fp), sensitivity = tp / (tp + fn),\n       specificity = tn / (tn + fp), n_reviewed = tp + fp + fn + tn)\n}",
        "description": "Rule-based incident T2DM phenotype + PPV/sensitivity validation with data.table. Inputs mirror the\nPython version:\n  dx     : person_id, dx_date (Date), icd_code (chr), setting in {'IP','OP'}\n  rx     : person_id, fill_date (Date), drug_class (chr)\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\n  gold   : person_id, true_case (logical)  # adjudicated / chart-review reference standard",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "shivade-2014",
          "newton-2013"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365; %let opgap = 30; %let rxwin = 60;\n\n/* Restrict to T2DM (ICD-10 E11.x). */\ndata t2dm; set work.dx; where substr(icd_code,1,3) = 'E11'; run;\n\n/* Rule A: earliest inpatient T2DM code. */\nproc sql;\n  create table ruleA as\n  select person_id, min(dx_date) as d format=date9.\n  from t2dm where setting='IP' group by person_id;\nquit;\n\n/* Rule B: two outpatient codes >= &opgap days apart -> earliest qualifying second date. */\nproc sql;\n  create table ruleB as\n  select a.person_id, min(b.dx_date) as d format=date9.\n  from t2dm a join t2dm b\n    on a.person_id=b.person_id and a.setting='OP' and b.setting='OP'\n   and b.dx_date >= a.dx_date + &opgap\n  group by a.person_id;\nquit;\n\n/* Rule C: an OP code with a non-metformin antidiabetic fill within &rxwin days. */\nproc sql;\n  create table ruleC as\n  select d.person_id, min(d.dx_date) as d format=date9.\n  from (select person_id, min(dx_date) as dx_date from t2dm where setting='OP'\n        group by person_id) d\n  join work.rx r\n    on d.person_id=r.person_id and r.drug_class='ANTIDIABETIC_NONMET'\n   and r.fill_date between d.dx_date and d.dx_date + &rxwin\n  group by d.person_id;\nquit;\n\n/* Earliest qualifying date across rules = candidate incident event date. */\nproc sql;\n  create table event as\n  select person_id, min(d) as event_date format=date9.\n  from (select * from ruleA union all select * from ruleB union all select * from ruleC)\n  group by person_id;\nquit;\n\n/* Incidence: clean 365d lookback (no prior T2DM code) + continuous non-MA enrollment. */\nproc sql;\n  create table phenotype as\n  select e.person_id, e.event_date\n  from event e\n  where not exists (select 1 from t2dm p\n          where p.person_id=e.person_id and p.dx_date < e.event_date\n            and p.dx_date >= e.event_date - &washout)\n    and exists (select 1 from work.enroll en\n          where en.person_id=e.person_id and en.ma_only=0\n            and en.enroll_start <= e.event_date - &washout\n            and en.enroll_end   >= e.event_date);\nquit;\n\n/* Validation: cross-tab algorithm status vs reference standard -> PPV, sensitivity, specificity. */\nproc sql;\n  create table val as\n  select g.person_id, g.true_case,\n         (p.person_id is not null) as algo_positive\n  from work.gold g left join phenotype p on g.person_id=p.person_id;\nquit;\n\nproc sql;\n  select sum(algo_positive=1 and true_case=1)\n           / sum(algo_positive=1)                 as PPV format=6.3,\n         sum(algo_positive=1 and true_case=1)\n           / sum(true_case=1)                     as Sensitivity format=6.3,\n         sum(algo_positive=0 and true_case=0)\n           / sum(true_case=0)                     as Specificity format=6.3\n  from val;\nquit;",
        "description": "Rule-based incident T2DM phenotype + PPV/sensitivity validation in Base SAS / PROC SQL. Required input\ndatasets (post data-management):\n  work.dx     : person_id, dx_date, icd_code, setting ('IP'/'OP')\n  work.rx     : person_id, fill_date, drug_class\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.gold   : person_id, true_case (0/1)   /* adjudicated / chart-review reference standard */\nProduces work.phenotype (algorithm-positive incident cases with event_date) and prints PPV / sensitivity\n/ specificity. PROC FREQ with the 2x2 cross-tab gives the same operating characteristics for reporting.",
        "dependencies": [],
        "source_citations": [
          "shivade-2014",
          "newton-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Routinely collected data<br/>dx codes, labs, orders, Rx, notes] --> Logic{Phenotype logic}\n  Logic -->|>=1 IP code| Pos[Algorithm-positive]\n  Logic -->|>=2 OP codes >=30d apart| Pos\n  Logic -->|>=1 OP code + confirming Rx/lab| Pos\n  Pos --> Inc{Clean 365d lookback<br/>+ continuous non-MA enrollment?}\n  Inc -->|yes| Event[Incident case<br/>event_date = first qualifying code]\n  Inc -->|no| Prev[Prevalent / unobservable -> exclude]\n  Event --> Val[Validation substudy vs reference standard]\n  Val --> Perf[PPV / sensitivity / specificity]\n  Perf --> QBA[Quantitative bias analysis<br/>correct misclassified counts]",
        "caption": "From raw data to a validated incident-outcome phenotype. Logic raises PPV, the lookback enforces incidence, and a validation substudy supplies the operating characteristics needed for bias correction.",
        "alt_text": "Flowchart showing diagnosis/lab/medication data entering phenotype logic, an incidence lookback check, and a validation substudy producing PPV, sensitivity, and specificity feeding a quantitative bias analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "shivade-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  ND[Nondifferential<br/>misclassification] --> Null[Binary risk ratio<br/>biased toward null<br/>predictable, conservative]\n  DM[Differential by exposure<br/>e.g. surveillance/monitoring] --> Any[Bias in EITHER direction<br/>can fabricate an association<br/>balance tables won't show it]\n  Any --> Fix[Exposure-blind ascertainment<br/>negative-control outcomes<br/>quantitative bias analysis]",
        "caption": "Why the direction of phenotype error matters. Nondifferential error attenuates; differential error by exposure is the dangerous case and is not conservative.",
        "alt_text": "Diagram contrasting nondifferential misclassification biasing toward the null with differential misclassification biasing in either direction and the mitigations for the latter.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "EHR phenotyping is the EHR/structured-data instance of the broader outcome-algorithm construction family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "A phenotype is only interpretable with measured operating characteristics; PPV/sensitivity estimation is the required validation companion."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The 1-inpatient / 2-outpatient time-window rule is the canonical structured-code instantiation of an EHR/claims phenotype."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "Chart-review or adjudicated validation against a reference standard establishes the phenotype's PPV, sensitivity, and specificity."
      },
      {
        "relation_type": "used_with",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Measured PPV/sensitivity feed quantitative bias analysis to correct effect estimates for outcome misclassification."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes help detect differential ascertainment (surveillance/detection bias) that a phenotype may introduce."
      }
    ],
    "aliases": [
      "computable phenotype",
      "EHR phenotyping algorithm",
      "electronic phenotyping",
      "case-finding algorithm",
      "computable case definition"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ehr-study",
    "name": "EHR-Based Study",
    "short_definition": "An observational study that defines exposure, covariates, and outcomes from electronic health record data (orders, encounters, problem lists, labs, vitals, and clinical notes) generated during routine care rather than from billing transactions.",
    "long_description": "An **EHR-based study** is a non-interventional study whose entire data substrate is the electronic\nhealth record — provider orders and medication administrations, encounter and visit records, problem\nlists, structured laboratory and vital-sign values, and free-text clinical notes — captured as a\nby-product of clinical care. It is a *study design type* in the sense of \"what generated the data and\ntherefore what biases are baked in,\" not an estimation method: the design choice dictates how time\nzero is defined, how the cohort is observed, and which threats to validity dominate. The analytic\nmethods layered on top (new-user/active-comparator restriction, propensity scores, Cox models) are\nthe same as in any cohort study; what is distinctive is that the EHR data-generating process — driven\nby *clinical encounters*, not *insurance enrollment* — reshapes nearly every operational decision.\n\n**Core conceptual distinction.** The defining contrast is **EHR vs administrative claims** as the\nsource of person-time and ascertainment. Claims are a *transaction* stream tied to insurance\nenrollment: a clean enrollment table tells you exactly when a person is observable, pharmacy claims\ncapture dispensings with `days_supply`, and capture is near-complete *within the enrolled window* but\nblind to anything outside it (cash purchases, care under a different plan, clinical detail). EHR is an\n*encounter* stream tied to where a patient seeks care: it carries rich clinical granularity (labs,\nvitals, problem lists, notes, severity) that claims never see, but it has **no enrollment table** —\na patient is \"observable\" only when they show up, and silence is ambiguous (healthy? died? went\nelsewhere?). Three primitives shift accordingly. (1) *Exposure* in EHR is the medication **order** or\n**administration**, which is an intent or in-hospital event, not proof the patient filled and took the\ndrug; confirming initiation usually requires linkage to pharmacy fills. (2) *Observability* must be\n*constructed* from encounter activity (e.g., at least one visit in a trailing window) rather than read\noff enrollment spans. (3) *Phenotyping* — turning raw EHR signals into an exposure/outcome/covariate —\nis a research artifact in itself, combining structured codes (ICD, SNOMED, RxNorm, LOINC) with\nproblem-list curation and, increasingly, NLP over notes, and must be validated against a reference\nstandard (chart review or PPV estimates). The estimand is unchanged from any cohort study (e.g., a\ncomparative hazard ratio or risk difference), but its *transportability* is conditioned on the\ncatchment of the health system that produced the records.\n\n**Pros, cons, and trade-offs** (named against the real alternatives).\n- **vs claims-based studies:** EHR wins decisively on *clinical depth* — measured labs (eGFR, HbA1c,\n  LDL), vitals (BP, BMI), disease severity, smoking/functional status, and outcome detail (tumor\n  stage, ejection fraction) that claims can only proxy. This directly improves confounding control and\n  enables outcome phenotypes claims cannot support. **Cons:** EHR has *no complete picture of\n  utilization* — care delivered outside the network is invisible, there is no enrollment denominator,\n  and pharmacy capture is order-level (intent), not dispensing-level. **Prefer EHR** when the question\n  hinges on clinical measurements or fine-grained outcomes; **prefer claims** when complete drug\n  exposure, full utilization, and a defined denominator are paramount; **prefer linked EHR-claims**\n  when you can get both.\n- **vs registries:** Registries give adjudicated, protocol-defined outcomes and curated severity for a\n  specific disease, but are narrow and expensive. EHR is broader and cheaper but its variables are\n  care-driven and non-adjudicated (informed presence: sicker patients generate more data). **Prefer\n  registries** for adjudicated endpoints; **prefer EHR** for breadth and rapid feasibility.\n- **vs linked claims-EHR-vital-records:** Linkage is the ideal substrate (EHR severity + claims\n  completeness + reliable death) but it costs a *selection* layer (only the linkable subset) and\n  date-reconciliation between order, fill, and service dates. **Prefer linkage** when mortality or\n  complete exposure is decisive and a linkable cohort is available.\n\n**When to use.** The exposure or outcome requires clinical detail that claims cannot supply (lab-based\nkidney/glycemic/lipid endpoints, BP/BMI/severity, imaging or pathology findings, in-hospital\nmedication administration); you need outcome phenotypes built from labs/notes; you are doing rapid\nfeasibility, hypothesis generation, or signal detection within a defined health system; or you have\nlinkage that lets EHR severity ride on top of claims completeness.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The question needs complete drug exposure or a denominator.** Drug-utilization, adherence\n  (PDC/MPR), or incidence-rate questions are *dangerous* in stand-alone EHR: orders are not fills,\n  out-of-network fills are invisible, and there is no enrollment denominator — you will misclassify\n  exposure and cannot compute person-time correctly. Use claims (or link to them).\n- **Informed-presence / surveillance bias is differential by exposure.** If one drug's patients are\n  monitored more intensely, their outcomes are detected more often purely because they are *seen* more\n  — a spurious association. EHR makes this worse than claims because ascertainment is visit-driven.\n- **Loss to follow-up is informative and differential.** A patient who improves, dies, or transfers\n  care simply stops generating records; treating that silence as continued event-free follow-up\n  induces immortal-time-like and survivorship distortions. If you cannot anchor censoring (e.g., a\n  linked death index), event rates are uninterpretable.\n- **A single-system EHR is being used to answer a population question.** The catchment defines the\n  population; effect estimates may not transport beyond that health system's case-mix and practice\n  patterns.\n- **Unvalidated phenotypes.** Deploying an ICD- or NLP-based outcome definition without PPV/sensitivity\n  estimates against a reference standard bakes in differential misclassification of unknown magnitude.\n\n**Data-source operational depth.**\n- **Claims (FFS or MA/commercial):** Used here as the *comparator* substrate and as the linkage target.\n  Strength: complete dispensing (`days_supply`), full cross-provider utilization within enrollment, a\n  real denominator. Failure modes: **MA encounter data are often incomplete** relative to Medicare FFS\n  (capitated providers underreport), so MA-only person-time can masquerade as low utilization; cash and\n  sample drugs are invisible; coding intensity varies by payer. Workaround when linked: restrict to\n  FFS (Parts A/B/D) for exposure/utilization completeness and use the EHR only for clinical detail.\n- **EHR:** Strength: labs, vitals, problem lists, notes, severity, in-hospital administrations.\n  Failure modes: (a) **No enrollment table** — construct observability from encounter activity\n  (e.g., ≥1 encounter in the prior 365 days) and accept that \"new user\" really means \"first *observed*\n  order\"; (b) **orders ≠ dispensings** — prefer linked fills to confirm initiation; (c)\n  **informed-presence / differential surveillance** — adjust for visit frequency or use negative\n  controls; (d) **fragmented records** across systems — patients seen elsewhere look like gaps;\n  (e) **competing risks and death** — out-of-hospital deaths are frequently uncaptured, so link to a\n  death index before trusting censoring. In elderly/oncology cohorts, **differential competing risks\n  by exposure** (e.g., one arm with sicker patients dying of other causes) demand a Fine-Gray or\n  cause-specific framing, not naive Kaplan-Meier on the EHR alone.\n- **Registry:** Adjudicated outcomes and curated severity for a defined disease; weak for complete\n  pharmacy exposure and general comorbidity. Link to claims for fills and to a death index for\n  censoring.\n- **Linked claims-EHR-vital-records:** Ideal but introduces linkage *selection* (only the linkable\n  subset, which differs systematically) and order/fill/service **date discrepancies** that must be\n  reconciled before time-zero assignment; an order on 2024-01-03 with a fill on 2024-01-06 cannot both\n  be time zero.\n\n**Worked EHR example.** Question: incident acute kidney injury (AKI) after initiating an SGLT2\ninhibitor in adults with type 2 diabetes, using a multi-hospital EHR linked to pharmacy fills.\n(1) *Observability (the claims-style enrollment substitute):* require ≥1 outpatient encounter in the\n365 days before the index order, so the patient is genuinely \"in\" the system and a prior order would\nhave been seen. (2) *Exposure / time zero:* the first *order* of an SGLT2 inhibitor with **no prior\norder of any SGLT2 inhibitor in the 365-day lookback** (new-user restriction on observed orders);\nrequire a linked pharmacy `fill_date` within 30 days of the order to confirm the patient actually\nstarted — index_date = the order date, but only for confirmed starters. (3) *Baseline phenotype &\ncovariates:* measured only in `[index_date - 365, index_date]` — most recent eGFR and HbA1c from the\nstructured lab table (a clinical variable claims cannot supply), BP/BMI from vitals, diabetes\ncomplications and comorbidities from ICD diagnoses + problem list, and an explicit **visit-count**\ncovariate to blunt informed-presence bias. (4) *Outcome phenotype:* AKI defined by a validated rule —\na ≥1.5× rise in serum creatinine within 7 days (KDIGO lab criterion) OR an AKI diagnosis code at an\nencounter — with the PPV of the rule estimated against chart review before it is trusted. (5)\n*Follow-up & censoring:* from time zero to first AKI, censoring at the last observed encounter + a\ngrace window (the EHR equivalent of disenrollment), at linked death (out-of-hospital deaths come from\nthe death index, not the EHR), and at end of data; for an as-treated analysis, censor at the end of\nthe last confirmed fill's `days_supply` + grace. (6) *Analysis & diagnostics:* PS-adjust on the\nbaseline window, fit a cause-specific or Fine-Gray model treating death as a competing risk, and run\nsensitivity analyses on the observability window length, the order-to-fill confirmation requirement,\nand a negative-control outcome to surface residual surveillance bias.",
    "primary_category": "Study_Design",
    "tags": [
      "ehr",
      "electronic-health-records",
      "secondary-data",
      "phenotyping",
      "informed-presence",
      "routinely-collected-data",
      "study-design",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "ehr_study"
    ],
    "data_sources": [
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2012-001145",
        "url": "https://doi.org/10.1136/amiajnl-2012-001145",
        "citation_text": "Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association. 2013;20(1):117-121.",
        "year": 2013,
        "authors_short": "Hripcsak & Albers",
        "notes": "Foundational statement that EHR variables are care-driven artifacts requiring deliberate phenotyping rather than ready-made covariates, framing the central methodological problem of EHR-based research."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLOS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard for studies using routinely-collected health data (claims and EHR); defines the operational reporting items — population, codes, observability — that an EHR study must specify."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.k3532",
        "url": "https://doi.org/10.1136/bmj.k3532",
        "citation_text": "Langan SM, Schmidt SA, Wing K, et al. The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE). BMJ. 2018;363:k3532.",
        "year": 2018,
        "authors_short": "Langan et al.",
        "notes": "Pharmacoepidemiology extension of RECORD; specifies exposure-definition, washout, and new-user reporting that translate the EHR order-vs-dispensing distinction into a defensible protocol."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/MLR.0b013e31829b1d48",
        "url": "https://doi.org/10.1097/MLR.0b013e31829b1d48",
        "citation_text": "Bayley KB, Belnap T, Savitz L, et al. Challenges in using electronic health record data for CER: experience of 4 learning organizations and solutions applied. Medical Care. 2013;51(8 Suppl 3):S80-S86.",
        "year": 2013,
        "authors_short": "Bayley et al.",
        "notes": "Applied account from four delivery systems of the concrete failure modes of EHR data for comparative effectiveness research (missingness, fragmentation, capture) and the workarounds used."
      }
    ],
    "plain_language_summary": "An EHR-based study answers a clinical question using the electronic health record a hospital or clinic fills in while caring for patients — the doctor's orders, the clinic visits, the lab and vital-sign results, and the typed notes. Its big advantage over insurance billing data is that it can see actual measurements claims never carry, like a kidney-function lab or a blood pressure. Its big catch is that the record only exists when the patient shows up: a written drug order is just intent, not proof the patient ever filled or took it, and any care that happens outside this health system is simply invisible.",
    "key_terms": [
      {
        "term": "electronic health record (EHR)",
        "definition": "The clinical record a care team builds during visits — orders, encounters, labs, vitals, and notes — rather than the bills a health plan processes."
      },
      {
        "term": "order",
        "definition": "A prescription a clinician writes in the EHR; it shows the doctor intended a drug, not that a pharmacy ever dispensed it or the patient took it."
      },
      {
        "term": "fill (dispensing)",
        "definition": "The pharmacy event where the drug is actually handed to the patient — the proof of treatment that an order alone does not give you."
      },
      {
        "term": "encounter",
        "definition": "A recorded contact with the health system, such as an office visit or hospital stay; the EHR only adds data when an encounter happens."
      },
      {
        "term": "phenotype",
        "definition": "A rule that turns raw EHR signals (codes, lab values, note text) into a yes/no label for an exposure, outcome, or condition, and must be checked against a chart review before you trust it."
      },
      {
        "term": "loss to follow-up",
        "definition": "When a patient stops appearing in the data — here it can mean recovered, died, or just moved their care elsewhere, and the record can't tell you which."
      }
    ],
    "worked_example": {
      "scenario": "Meet patient 4021, a 58-year-old with type 2 diabetes seen at one hospital system. Below is a slice of her EHR for a single day in March. We want to read these mixed rows the way an analyst would and see exactly what the EHR shows — and what it quietly hides compared with insurance claims.",
      "dataset": {
        "caption": "A handful of raw EHR rows for one patient on one day, mixing the different record types an analyst actually pulls from separate EHR tables into one view.",
        "columns": [
          "patient_id",
          "date",
          "record_type",
          "detail",
          "value"
        ],
        "rows": [
          [
            4021,
            "2024-03-12",
            "encounter",
            "outpatient endocrinology visit",
            "completed"
          ],
          [
            4021,
            "2024-03-12",
            "lab",
            "eGFR (kidney function)",
            "62 mL/min/1.73m2"
          ],
          [
            4021,
            "2024-03-12",
            "lab",
            "HbA1c",
            "8.4 %"
          ],
          [
            4021,
            "2024-03-12",
            "vital",
            "blood pressure",
            "138/86 mmHg"
          ],
          [
            4021,
            "2024-03-12",
            "order",
            "empagliflozin 10 mg (SGLT2 inhibitor)",
            "ordered"
          ],
          [
            4021,
            "2024-03-12",
            "note",
            "clinician note: counseled on starting new diabetes medication",
            "free text"
          ]
        ]
      },
      "steps": [
        "Read the rows by type: one encounter anchors the day, two labs and one vital give measured clinical detail, one order records a prescribing decision, and one note is typed free text.",
        "Notice what claims could never give you here: an eGFR of 62, an HbA1c of 8.4%, and a blood pressure of 138/86 are real measurements — insurance billing data would at best show a diagnosis code, never the number.",
        "Now read the order carefully: 'empagliflozin ordered' means the clinician intended to start the drug, but there is no pharmacy fill row on this day — an order is not a fill, so we cannot yet say she actually began taking it.",
        "To confirm she truly started, you would need a linked pharmacy fill within a short window after 2024-03-12; without it, treating this order as a started treatment would overcount real initiation.",
        "Finally, picture the weeks after this visit with no new rows. In claims that gap could mean she stopped care; in the EHR it could also just mean she filled the drug at an outside pharmacy and saw a doctor in another system — care this hospital's EHR cannot see, so she can look lost to follow-up while doing perfectly fine."
      ],
      "result": "For patient 4021 on 2024-03-12 the EHR shows a real endocrinology visit with measured kidney function (eGFR 62), glycemic control (HbA1c 8.4%), and blood pressure (138/86), plus a clinician's order to start an SGLT2 inhibitor and a supporting note — clinical depth that insurance claims simply do not carry. The key limitation: that order is only intent, not a confirmed fill, and because the EHR records data only when this system sees the patient, any later fills or visits made elsewhere are invisible, so a silent stretch in the record can be mistaken for the patient disappearing or going untreated."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Structured-code EHR phenotype",
        "description": "Exposure/outcome/covariates defined from structured EHR fields only — ICD/SNOMED diagnoses, RxNorm orders, LOINC labs, and the problem list — without note text. Fast and reproducible.",
        "edge_cases": [
          "Diagnosis codes on EHR encounters are entered for clinical (not billing) reasons and may be less complete or less specific than the corresponding claim; validate PPV against chart review.",
          "Problem lists accumulate stale/duplicate entries and rarely carry resolution dates, blurring \"active disease at baseline.\""
        ],
        "data_source_notes": "ehr: map local codes to standard vocabularies (OMOP CDM) for portability; confirm lab units and reference ranges before applying numeric thresholds."
      },
      {
        "name": "NLP-augmented EHR phenotype",
        "description": "Structured signals supplemented by natural-language processing of clinical notes to capture concepts that are documented but never coded (smoking, functional status, symptom severity, negation).",
        "edge_cases": [
          "NLP introduces its own misclassification (negation, family history, templated boilerplate) that is often differential across sites and time.",
          "Note availability is itself informed-presence — sicker, more-engaged patients have more text, biasing concept capture."
        ],
        "data_source_notes": "ehr: report NLP system, training corpus, and validation PPV/sensitivity; treat NLP-derived variables as measured-with-error, not gold standard."
      },
      {
        "name": "Linked EHR-claims new-user cohort",
        "description": "EHR orders define candidate initiation, linked pharmacy claims confirm the dispensing and supply complete cross-provider exposure, and a death index anchors censoring.",
        "edge_cases": [
          "Order date, fill date, and service date disagree; pre-specify which defines time zero (typically the confirmed fill on/after the order) before any analysis.",
          "The linkable subset differs systematically from the unlinkable remainder; report linkage rate and compare baseline characteristics."
        ],
        "data_source_notes": "linked: restrict utilization/exposure logic to FFS-observable person-time; use EHR strictly for clinical detail (labs, vitals, severity)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Claims-based study",
        "pros_of_this": "Supplies measured clinical detail (labs, vitals, severity, problem list, in-hospital administrations) that claims can only proxy, improving confounding control and enabling lab-based outcome phenotypes.",
        "cons_of_this": "No enrollment denominator and no complete cross-provider utilization; medication orders are intent, not dispensings; care outside the network is invisible.",
        "when_to_prefer": "The exposure or outcome depends on clinical measurements or fine-grained endpoints unavailable in billing data."
      },
      {
        "compared_to": "Registry-based study",
        "pros_of_this": "Broader, cheaper, and faster to assemble; covers many conditions and the full clinical record rather than a single curated disease.",
        "cons_of_this": "Outcomes are non-adjudicated and care-driven (informed presence), severity is documented inconsistently, and capture is fragmented across systems.",
        "when_to_prefer": "Breadth, rapid feasibility, or signal detection matter more than adjudicated, protocol-defined endpoints."
      },
      {
        "compared_to": "Linked claims-EHR-vital-records study",
        "pros_of_this": "Simpler governance and faster access using a single health system's EHR; no cross-source date reconciliation.",
        "cons_of_this": "Loses dispensing-level exposure, a defined denominator, and reliable mortality; effect estimates may not transport beyond the system's catchment.",
        "when_to_prefer": "Linkage is unavailable or the question is internal to a single delivery system and does not hinge on complete exposure or death."
      }
    ],
    "implementation_notes_by_data_source": {
      "ehr": "No enrollment table: construct observability from encounter activity (e.g., >=1 encounter in the trailing 365 days). Exposure = order/administration, not dispensing; prefer linked fills to confirm initiation. Baseline labs/vitals from structured tables; validate phenotypes (PPV) against a reference standard. Treat loss to follow-up and informed presence as potentially differential; adjust for visit frequency and use negative controls.",
      "claims": "Used as comparator/linkage substrate. Complete dispensings (days_supply) and a real denominator within enrollment, but MA encounter data are often incomplete vs FFS; restrict utilization/exposure logic to FFS-observable person-time when linked.",
      "registry": "Adjudicated outcomes and curated severity for a defined disease; weak for complete pharmacy exposure. Link to claims for fills and to a death index for censoring.",
      "linked": "Ideal substrate (EHR severity + claims completeness + reliable death) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nLOOKBACK_DAYS = 365   # observability + drug-free lookback (EHR substitute for continuous enrollment)\nFILL_GRACE    = 30    # days after the order within which a linked fill confirms true initiation\n\ndef build_ehr_new_user_cohort(orders, encounters, fills):\n    orders = orders.sort_values([\"person_id\", \"order_date\"])\n\n    # Candidate index = first STUDY-drug order per person.\n    study = orders[orders[\"drug_class\"] == \"STUDY\"]\n    idx = (study.groupby(\"person_id\", as_index=False)\n                .first()\n                .rename(columns={\"order_date\": \"index_date\"})\n                [[\"person_id\", \"index_date\"]])\n\n    # Observability: require >=1 encounter in the LOOKBACK window before index (no enrollment table exists).\n    enc = encounters.merge(idx, on=\"person_id\")\n    observable = enc[(enc[\"encounter_date\"] < enc[\"index_date\"]) &\n                     (enc[\"encounter_date\"] >= enc[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS))]\n    idx = idx[idx[\"person_id\"].isin(observable[\"person_id\"])].copy()\n\n    # New-user: no STUDY order in the LOOKBACK window before index (first *observed* order).\n    prior = study.merge(idx, on=\"person_id\")\n    prior_in_lb = prior[(prior[\"order_date\"] < prior[\"index_date\"]) &\n                        (prior[\"order_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prior_in_lb[\"person_id\"])].copy()\n\n    # Confirm initiation: a linked fill of the same class within FILL_GRACE days of the order.\n    f = fills[fills[\"drug_class\"] == \"STUDY\"].merge(idx, on=\"person_id\")\n    confirmed = f[(f[\"fill_date\"] >= f[\"index_date\"]) &\n                  (f[\"fill_date\"] <= f[\"index_date\"] + pd.Timedelta(days=FILL_GRACE))]\n    cohort = idx[idx[\"person_id\"].isin(confirmed[\"person_id\"])].copy()\n\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)\n    return cohort[[\"person_id\", \"index_date\", \"baseline_start\"]]",
        "description": "EHR new-user cohort construction with the claims-style enrollment table REPLACED by encounter-derived\nobservability. Required inputs (already cleaned and de-duplicated):\n  orders     : medication orders     -> person_id, order_date (datetime), drug_class in {'STUDY'}\n  encounters : visits/encounters     -> person_id, encounter_date (datetime)\n  fills      : linked pharmacy fills  -> person_id, fill_date (datetime), drug_class, days_supply\nReturns one row per confirmed new initiator with index_date (= order date) and the baseline window.\nBuild covariates/phenotypes only from [baseline_start, index_date]; apply identical outcome/censoring\nrules downstream. Note: \"new user\" here means first *observed* order, since EHR has no enrollment span.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "hripcsak-2013"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK_DAYS <- 365L\nFILL_GRACE    <- 30L\n\nbuild_ehr_new_user_cohort <- function(orders, encounters, fills) {\n  setDT(orders); setDT(encounters); setDT(fills)\n  setorder(orders, person_id, order_date)\n\n  study <- orders[drug_class == \"STUDY\"]\n  idx <- study[, .(index_date = order_date[1L]), by = person_id]\n\n  # Observability: >=1 encounter in the lookback window before index (EHR has no enrollment span).\n  enc <- merge(encounters, idx, by = \"person_id\")\n  obs_ids <- unique(enc[encounter_date < index_date &\n                        encounter_date >= index_date - LOOKBACK_DAYS, person_id])\n  idx <- idx[person_id %chin% obs_ids]\n\n  # New-user: no STUDY order in the lookback window before index (first observed order).\n  study <- merge(study, idx, by = \"person_id\")\n  prior_ids <- unique(study[order_date < index_date &\n                            order_date >= index_date - LOOKBACK_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Confirm initiation via a linked fill within the grace window after the order.\n  f <- merge(fills[drug_class == \"STUDY\"], idx, by = \"person_id\")\n  conf_ids <- unique(f[fill_date >= index_date &\n                       fill_date <= index_date + FILL_GRACE, person_id])\n\n  cohort <- idx[person_id %chin% conf_ids]\n  cohort[, baseline_start := index_date - LOOKBACK_DAYS]\n  cohort[, .(person_id, index_date, baseline_start)]\n}",
        "description": "EHR new-user cohort construction with data.table. Inputs mirror the Python version:\n  orders     : person_id, order_date (Date), drug_class in {'STUDY'}\n  encounters : person_id, encounter_date (Date)\n  fills      : person_id, fill_date (Date), drug_class, days_supply\nObservability is built from encounters (no enrollment table); index = first observed STUDY order\nconfirmed by a linked fill within the grace window.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "hripcsak-2013"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;   /* observability + drug-free lookback (EHR enrollment substitute) */\n%let grace    = 30;    /* order-to-fill confirmation window */\n\n/* Candidate index = first STUDY-drug order per person. */\nproc sql;\n  create table idx as\n  select person_id, min(order_date) as index_date format=date9.\n  from work.orders\n  where drug_class = 'STUDY'\n  group by person_id;\nquit;\n\n/* Observability: require >=1 encounter in the lookback window before index. */\nproc sql;\n  create table observable as\n  select i.*\n  from idx i\n  where exists (\n    select 1 from work.encounters e\n    where e.person_id = i.person_id\n      and e.encounter_date <  i.index_date\n      and e.encounter_date >= i.index_date - &lookback\n  );\nquit;\n\n/* New-user: no prior STUDY order inside the lookback window (first *observed* order). */\nproc sql;\n  create table newuser as\n  select o.*\n  from observable o\n  where not exists (\n    select 1 from work.orders p\n    where p.person_id = o.person_id\n      and p.drug_class = 'STUDY'\n      and p.order_date <  o.index_date\n      and p.order_date >= o.index_date - &lookback\n  );\nquit;\n\n/* Confirm true initiation with a linked fill within the grace window after the order. */\nproc sql;\n  create table cohort as\n  select n.person_id, n.index_date,\n         n.index_date - &lookback as baseline_start format=date9.\n  from newuser n\n  where exists (\n    select 1 from work.fills f\n    where f.person_id = n.person_id\n      and f.drug_class = 'STUDY'\n      and f.fill_date >= n.index_date\n      and f.fill_date <= n.index_date + &grace\n  );\nquit;",
        "description": "EHR new-user cohort construction in SAS via PROC SQL on EHR primitives (orders, encounters, problem\nlist, linked fills) -- there is no enrollment table, so observability is derived from encounters.\nRequired input datasets (post data-management):\n  work.orders     : person_id, order_date, drug_class ('STUDY')\n  work.encounters : person_id, encounter_date\n  work.fills      : person_id, fill_date, drug_class, days_supply\n  work.problist   : person_id, condition_code, noted_date   (for downstream baseline phenotyping)\nBuilds covariates only from [baseline_start, index_date]; validate the outcome phenotype PPV before\nfitting the outcome model.",
        "dependencies": [],
        "source_citations": [
          "hripcsak-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[\"Health-system EHR population\"] --> Obs[\"Observability: >=1 encounter in trailing 365d<br/>(replaces the missing enrollment table)\"]\n  Obs --> Order[\"First STUDY-drug ORDER<br/>(intent, not a dispensing)\"]\n  Order --> NU[\"New-user: no prior order in 365d lookback<br/>(first *observed* order)\"]\n  NU --> Conf[\"Confirm initiation via linked fill<br/>within grace window\"]\n  Conf --> T0[\"Time zero = index order date\"]\n  T0 --> Base[\"Baseline phenotype + covariates<br/>labs / vitals / problem list, validate PPV\"]\n  Base --> Fup[\"Follow-up: outcome phenotype<br/>censor at last encounter+grace / linked death / data end\"]\n  Fup --> Sens[\"Sensitivity: observability window, order-to-fill rule,<br/>visit-frequency adjustment, negative control\"]",
        "caption": "Operational EHR study flow. Because the EHR has no enrollment table, observability is constructed from encounter activity; the medication order is intent and must be confirmed by a linked fill before it can serve as time zero.",
        "alt_text": "Flowchart from EHR population through encounter-based observability, first medication order, new-user restriction, linked-fill confirmation, time zero, baseline phenotyping, follow-up with phenotyped outcomes and death-index censoring, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "hripcsak-2013"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph EHR[\"EHR (encounter-driven)\"]\n    E1[\"Rich clinical detail:<br/>labs, vitals, notes, severity\"]\n    E2[\"Order = intent (not dispensing)\"]\n    E3[\"No enrollment denominator;<br/>silence is ambiguous\"]\n  end\n  subgraph CLM[\"Claims (transaction-driven)\"]\n    C1[\"Complete dispensings (days_supply)\"]\n    C2[\"Full utilization within enrollment\"]\n    C3[\"Defined denominator; little clinical detail\"]\n  end\n  E1 -. \"linkage closes the gaps\" .-> C1\n  E3 -. \"claims supply denominator + fills\" .-> C2",
        "caption": "EHR vs claims as data-generating processes. EHR carries clinical depth but lacks a denominator and dispensing-level exposure; claims supply completeness and a denominator but little clinical detail. Linkage combines the strengths.",
        "alt_text": "Side-by-side comparison of EHR (encounter-driven, rich clinical detail, orders as intent, no denominator) and claims (transaction-driven, complete dispensings, full utilization, defined denominator), with linkage arrows connecting their complementary strengths.",
        "source_type": "illustrative",
        "source_citations": [
          "bayley-2013"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "claims-analysis",
        "notes": "Claims and EHR are the two dominant secondary-data substrates; claims give complete dispensings and a denominator, EHR gives clinical detail. Choose by whether the question hinges on exposure completeness or clinical measurement, or link the two."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "When EHR is linked to claims, payer heterogeneity carries over - MA encounter data are often incomplete vs Medicare FFS, so restrict utilization/exposure logic to FFS-observable person-time."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "EHR exposure/outcome/covariate phenotypes combine structured codes, problem lists, and NLP; claims-style 1 IP / 2 OP rules are often used to validate or anchor EHR phenotypes when claims are linked."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "EHR adds clinical dimensions (labs, vitals, NLP concepts) to the hdPS candidate-covariate pool well beyond billing codes; linkage to claims strengthens pharmacy and utilization proxies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "registry-trial",
        "notes": "Registries supply adjudicated, protocol-defined outcomes that EHR (non-adjudicated, care-driven) lacks; EHR offers far greater breadth and speed."
      }
    ],
    "aliases": [
      "EHR-based study",
      "electronic health record study",
      "EHR cohort study",
      "secondary use of EHR data"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "elixhauser-comorbidity-index-rwe",
    "name": "Elixhauser Comorbidity Measures / Index",
    "short_definition": "A set of 30–31 binary comorbidity flags defined from administrative diagnosis codes, used either as individual covariates or collapsed into a single weighted score (van Walraven points or the AHRQ index) to risk-adjust and confounder-adjust RWE analyses.",
    "long_description": "The **Elixhauser comorbidity measures** are a broader alternative to the Charlson index. Elixhauser et al. (1998)\nidentified **30 comorbidity categories** (later 31) that independently affected hospital length of stay, charges,\nand in-hospital mortality, and — crucially — argued they should be entered as **individual indicator variables**\nrather than forced into a single number, so the outcome model can learn each condition's own effect. Each\ncategory is a **binary flag** (present / absent) ascertained from ICD diagnosis codes over a baseline window;\nthe canonical claims crosswalk is **Quan (2005)** for ICD-9 and ICD-10, and AHRQ publishes a maintained version\nthrough its **Healthcare Cost and Utilization Project (HCUP)** software. Two routes collapse the 31 flags into a\nsingle score when a scalar is needed: the **van Walraven (2009) point system**, which assigns each condition an\ninteger weight (some **negative**, e.g., obesity) calibrated to in-hospital mortality so the sum can range below\nzero; and the **AHRQ Elixhauser Comorbidity Index** (Moore 2017), which provides separate mortality and\nreadmission weight sets. Like the CCI, Elixhauser is a **covariate-construction method** for confounding/risk\nadjustment, not a study design or outcome.\n\n**Core conceptual distinctions.** (1) *Flags vs score*: the native form is 31 indicators (maximum resolution,\nmany degrees of freedom); the van Walraven/AHRQ point sum is a parsimonious scalar (loses condition-level\ninformation but stabilizes sparse models and eases reporting). Choosing one is a bias-variance decision. (2)\n*Which weight set*: van Walraven mortality points, AHRQ mortality index, and AHRQ readmission index give\ndifferent numbers — the target outcome should guide the choice, and it must be pre-specified. (3) *Exclusion\nrules*: Elixhauser deliberately **excludes conditions that are the primary reason for admission** (so a\ncomorbidity is not confused with the index event) and de-duplicates overlapping categories (e.g.,\nuncomplicated vs complicated hypertension/diabetes collapse to the more severe form). (4) *Negative weights*:\nunlike Charlson's all-positive 1–6 scale, van Walraven points include negative values, so a sicker-looking\nraw flag count does not always mean a higher score — the point sum is the quantity to model, not the flag count.\n\n**Pros, cons, and trade-offs** (named against the alternatives).\n- **vs the Charlson Comorbidity Index:** Elixhauser covers more conditions (31 vs 17) and, entered as\n  indicators or via the AHRQ index, typically predicts **in-hospital mortality and readmission** better; Charlson\n  is more parsimonious, mortality-oriented, and more universally reported. **Prefer Elixhauser** when condition\n  resolution or maximal in-hospital prediction matters; **prefer Charlson** for a compact, comparable summary.\n- **Indicators vs single point score:** keeping 31 flags lets the model fit condition-specific effects but\n  spends degrees of freedom and can be unstable or perfectly separated in small cohorts; the van Walraven/AHRQ\n  score collapses to one stable covariate at the cost of condition-level detail. **Use the score** in sparse\n  data or when a scalar covariate suffices; **use indicators** when you have the events to support them and care\n  which conditions drive risk.\n- **vs a high-dimensional propensity score (hdPS):** Elixhauser is a fixed, transparent, clinically meaningful\n  set; hdPS empirically screens hundreds of codes and may capture proxies Elixhauser omits, but is data-driven\n  and harder to interpret. They are complementary.\n- **van Walraven vs AHRQ weights:** van Walraven is a single mortality point system; AHRQ ships distinct\n  mortality and readmission weights and is version-dated to the ICD-10/CCSR coding era. The choice changes the\n  score and the maintenance burden, and must be reported with the software version.\n\n**When to use.** As a richer baseline comorbidity adjustment in comparative cohort, case-control, and\nhealth-services RWE; as a propensity-score input; as a case-mix adjuster in length-of-stay, readmission, cost,\nand in-hospital mortality models (its original derivation targets); and whenever the analysis needs more\ncondition resolution than the Charlson 17 provide. It is the standard comorbidity adjustment in hospital\nadministrative-data and HCUP-based research.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Counting raw flags as if they were a score.** Because van Walraven points include negative weights, the\n  number of flags is **not** the risk score; summing flags (ignoring the weights and the negatives) misranks\n  patients. Model the weighted score or the indicators, never an unweighted flag count presented as severity.\n- **Including the index condition as a comorbidity.** If the condition under study (or the admission reason) is\n  left in the comorbidity set, you adjust away the exposure or the outcome. Apply Elixhauser's exclusion rules\n  relative to **your** index event, not just the generic admission-diagnosis exclusion.\n- **Mismatched weight set and outcome.** Using readmission weights to adjust a mortality model (or vice versa)\n  imports the wrong calibration. Match the AHRQ/van Walraven weight set to the modeled outcome, or use\n  indicators if no weight set fits.\n- **Unequal lookback or coding intensity across groups.** As with any claims comorbidity measure, more baseline\n  person-time or higher coding intensity inflates flags; require equal continuous-enrollment windows and be\n  cautious comparing across payers (FFS vs Medicare Advantage vs commercial) with different coding behavior.\n- **Treating it as frailty or functional status.** Elixhauser counts coded diseases; it does not measure\n  dependency, frailty, or function, which a claims-based frailty index targets directly.\n\n**Data-source operational depth.** In **claims** the 31 flags are set by the Quan/AHRQ ICD crosswalk over a\nfixed baseline window across inpatient, outpatient, and physician files; apply the present-on-admission /\nindex-condition exclusions and the severity collapses, then either keep indicators or apply the chosen weight\nset. AHRQ's HCUP software encodes the maintained code lists, exclusion logic, and weights and is the\nrecommended reference implementation. In **EHR**, problem lists and encounter diagnoses add detail but capture\nonly in-system care, understating out-of-system comorbidity. **Linked claims–EHR** maximizes capture at the\ncost of linkage selection. Cross-payer comparisons of either flags or scores require explicit caution because\ncoding intensity and enrollment continuity differ systematically across data sources.\n\n**Interpreting the output**\n\nIn the worked example, a patient with liver disease (van Walraven weight +11), lymphoma (+9), CHF\n(+7), renal failure (+5), and obesity (−4) receives a van Walraven summary score of 28.\n\n*(1) Formal interpretation.* The van Walraven summary score aggregates the 30 Elixhauser condition\nflags using weights derived from a logistic regression predicting in-hospital mortality, where some\nconditions carry negative weights (obesity, drug abuse, alcohol abuse) because they were associated\nwith lower mortality in that development cohort, likely due to coding and selection patterns. A score\nof 28 places this patient in a high-mortality-risk stratum within the development data. Critically,\nthe raw flag count (here: 5 conditions flagged) is a different quantity from the van Walraven score\n(here: 28) — a patient with five low-weight conditions may score lower than a patient with one\nhigh-weight condition. Elixhauser's 30-flag structure is broader than Charlson's 17 conditions and\nbetter captures the heterogeneous comorbidity profiles of hospitalized administrative-data populations.\n\n*(2) Practical interpretation.* Use the van Walraven score as a covariate or for propensity-score\nadjustment, not as a clinical severity classification for individual patients — the weights reflect\npopulation-level mortality associations in the development sample, not individual prognosis. Decide\nbefore analysis whether to use the flag vector, the weighted summary score, or both; negative weights\nmean the summary score can decrease as comorbidities accumulate, which should be communicated\nexplicitly to clinical stakeholders who may expect a monotone severity measure. As with CCI, the score\nis only comparable across data sources when coding intensity and enrollment continuity are similar.",
    "primary_category": "Bias_Control",
    "tags": [
      "elixhauser",
      "comorbidity-index",
      "van-walraven",
      "ahrq-hcup",
      "risk-adjustment",
      "confounding-control",
      "claims",
      "covariate-construction"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "comparative_effectiveness",
      "health_services_research"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/00005650-199801000-00004",
        "url": "https://doi.org/10.1097/00005650-199801000-00004",
        "citation_text": "Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998;36(1):8-27.",
        "year": 1998,
        "authors_short": "Elixhauser et al.",
        "notes": "Original derivation of the 30 comorbidity categories and the case for entering them as individual indicators."
      },
      {
        "role": "explain",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "The standard ICD-9/ICD-10 code crosswalk for the Elixhauser (and Charlson) comorbidities."
      },
      {
        "role": "explain",
        "doi": "10.1097/MLR.0b013e31819432e5",
        "url": "https://doi.org/10.1097/MLR.0b013e31819432e5",
        "citation_text": "van Walraven C, Austin PC, Jennings A, Quan H, Forster AJ. A modification of the Elixhauser comorbidity measures into a point system for hospital death using administrative data. Med Care. 2009;47(6):626-633.",
        "year": 2009,
        "authors_short": "van Walraven et al.",
        "notes": "Derived the single point score (with negative weights) that collapses the 31 flags into one mortality-calibrated number."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/MLR.0000000000000735",
        "url": "https://doi.org/10.1097/MLR.0000000000000735",
        "citation_text": "Moore BJ, White S, Washington R, Coenen N, Elixhauser A. Identifying increased risk of readmission and in-hospital mortality using hospital administrative data: the AHRQ Elixhauser Comorbidity Index. Med Care. 2017;55(7):698-705.",
        "year": 2017,
        "authors_short": "Moore et al.",
        "notes": "The maintained AHRQ index with distinct mortality and readmission weight sets, the current HCUP reference."
      }
    ],
    "plain_language_summary": "The Elixhauser measures are a checklist of about 31 chronic conditions, each turned into a simple yes/no flag from a patient's diagnosis codes. Unlike the Charlson index, the original idea was to keep all the flags separate so a model can learn what each condition does on its own. When you need a single number instead, you add up published points for the conditions a patient has — and unusually, some conditions (like obesity) carry negative points, so the total can even dip below zero. Researchers use these flags or the point total to level the playing field between treatment groups, so outcome differences are not just because one group was sicker. The honest caveats: it only sees conditions that got coded, you must leave out the condition you are actually studying, and because some weights are negative, the number of flags is not the same as the risk score.",
    "key_terms": [
      {
        "term": "comorbidity flag",
        "definition": "A yes/no marker for whether a patient has a given condition, set by whether qualifying diagnosis codes appear in the baseline window."
      },
      {
        "term": "van Walraven point score",
        "definition": "A way to collapse the 31 flags into one number by adding published integer weights, some of which are negative, calibrated to in-hospital death."
      },
      {
        "term": "AHRQ Elixhauser index",
        "definition": "The maintained version of the measures from AHRQ's HCUP program, shipping separate weight sets for predicting mortality and readmission."
      },
      {
        "term": "index-condition exclusion",
        "definition": "The rule that you must not count the very condition you are studying (or the admission reason) as a comorbidity, so you do not adjust away your exposure or outcome."
      }
    ],
    "worked_example": {
      "scenario": "We need a single van Walraven comorbidity score for one hospitalized patient. The AHRQ/Quan code algorithm flags five Elixhauser conditions over the baseline window; we look up each condition's van Walraven point weight — including one negative weight — and add them to get the score that enters the risk model.",
      "dataset": {
        "caption": "The Elixhauser conditions flagged for one patient, with each condition's van Walraven point weight.",
        "columns": [
          "comorbidity",
          "van_walraven_weight"
        ],
        "rows": [
          [
            "liver_disease",
            11
          ],
          [
            "lymphoma",
            9
          ],
          [
            "congestive_heart_failure",
            7
          ],
          [
            "renal_failure",
            5
          ],
          [
            "obesity",
            -4
          ]
        ]
      },
      "steps": [
        "List the flagged Elixhauser conditions and read each one's van Walraven point weight from the table, keeping the sign (obesity is negative).",
        "Confirm the index condition is excluded: none of these five is the reason for the admission under study, so all five count.",
        "Add the weights, respecting the negative one: 11 + 9 + 7 + 5 - 4 = 28.",
        "Note that the patient has five flags but the score is driven by the weights, not the count — the negative obesity weight pulls the total down."
      ],
      "result": "van Walraven Elixhauser score = 11 + 9 + 7 + 5 - 4 = 28. The same code algorithm, exclusion rules, and weight set are applied identically to every patient; the score (not the five-flag count) is the covariate carried into the model."
    },
    "prerequisites": [
      "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
      "continuous-enrollment-observable-time-rwe",
      "baseline-characteristics-and-covariate-balance-rwe"
    ],
    "index_definitions": [
      {
        "name": "Original Elixhauser comorbidity measures",
        "definition": "Thirty administrative-data comorbidity categories designed to be entered as separate indicator variables for hospital outcomes.",
        "source": "Elixhauser et al. 1998",
        "use": "Native high-resolution comorbidity flag set.",
        "notes": "The original recommendation was to model indicators rather than force a single score."
      },
      {
        "name": "Quan Elixhauser ICD-9/ICD-10 algorithms",
        "definition": "ICD-9-CM and ICD-10 coding algorithms for Elixhauser comorbidities in administrative data, published alongside Charlson mappings.",
        "source": "Quan et al. 2005",
        "use": "Crosswalk for coded claims/EHR administrative data.",
        "notes": "Use when a baseline window spans ICD-9/ICD-10 eras or when a study needs an auditable code-map source."
      },
      {
        "name": "van Walraven Elixhauser point score",
        "definition": "Mortality-calibrated integer point system that collapses Elixhauser flags into a single score; some weights are negative.",
        "source": "van Walraven et al. 2009",
        "use": "Scalar comorbidity summary when sparse models or concise reporting are needed.",
        "notes": "Do not interpret the raw number of flags as the score; apply the published weights."
      },
      {
        "name": "AHRQ Elixhauser Comorbidity Index",
        "definition": "HCUP-maintained Elixhauser index with outcome-specific mortality and readmission weight sets for hospital administrative data.",
        "source": "Moore et al. 2017 / AHRQ HCUP",
        "use": "Maintained reference implementation for HCUP-style administrative data.",
        "notes": "Pin the software/version and match mortality versus readmission weights to the modeled outcome."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Individual indicators (native Elixhauser)",
        "description": "Enter all 31 comorbidity flags as separate covariates so the outcome model estimates each condition's own effect; the original recommended use.",
        "edge_cases": [
          "Rare conditions can cause separation or unstable estimates in small cohorts; consider grouping or switching to a point score.",
          "Spends many degrees of freedom — confirm the event count supports the number of indicators."
        ],
        "data_source_notes": "claims/ehr: ensure each flag uses the same lookback and the index-condition exclusion relative to the studied event."
      },
      {
        "name": "van Walraven point score",
        "description": "Collapse the flags into one integer score using van Walraven's mortality-calibrated weights, some negative; the workhorse scalar summary.",
        "edge_cases": [
          "Negative weights mean the flag count and the score can diverge; model the score, never the unweighted count.",
          "Calibrated to in-hospital mortality — may be suboptimal for other outcomes."
        ],
        "data_source_notes": "any: state the weight set; the points are a published integer lookup table."
      },
      {
        "name": "AHRQ Elixhauser Comorbidity Index (HCUP)",
        "description": "AHRQ's maintained version with separate mortality and readmission weight sets and version-dated ICD-10/CCSR code lists; recommended reference implementation.",
        "edge_cases": [
          "Match the weight set (mortality vs readmission) to the modeled outcome; mismatched weights misrank risk.",
          "Pin the software/version date because code lists and weights are updated over time."
        ],
        "data_source_notes": "claims: use the HCUP software code lists and exclusion logic rather than hand-rolled maps where possible."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Charlson Comorbidity Index",
        "pros_of_this": "Covers 31 conditions and, as indicators or the AHRQ index, predicts in-hospital mortality and readmission better; more condition-level resolution.",
        "cons_of_this": "Less parsimonious and less universally reported; negative weights and exclusion rules add complexity.",
        "when_to_prefer": "When you need condition-level detail or maximal in-hospital prediction rather than a single comparable summary."
      },
      {
        "compared_to": "Single van Walraven/AHRQ point score",
        "pros_of_this": "Keeping 31 indicators lets the model learn condition-specific effects.",
        "cons_of_this": "Spends many degrees of freedom and can be unstable or separated in sparse data.",
        "when_to_prefer": "When the event count is large and which conditions drive risk is itself of interest."
      },
      {
        "compared_to": "High-dimensional propensity score (hdPS)",
        "pros_of_this": "Fixed, transparent, clinically interpretable comorbidity set requiring no data-driven selection.",
        "cons_of_this": "Omits proxies for confounding that an empirical code screen could capture.",
        "when_to_prefer": "When interpretability and pre-specification matter, or as a complement entered alongside an hdPS."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nLOOKBACK_DAYS = 365\n\n# condition -> ICD-prefix regex (ILLUSTRATIVE subset of the 31 Elixhauser categories)\nELIX = {\n    \"chf\":           r\"^(I50|428)\",\n    \"renal_failure\": r\"^(N1[789]|585|586|I120|I131)\",\n    \"liver_disease\": r\"^(K70|K71[3-5]|K72|K73|K74|571|070)\",\n    \"lymphoma\":      r\"^(C8[1-5]|C96|200|20[1-2])\",\n    \"obesity\":       r\"^(E66|2780)\",\n}\n# van Walraven points (note negatives); ILLUSTRATIVE subset\nVW_WEIGHTS = {\"chf\": 7, \"renal_failure\": 5, \"liver_disease\": 11, \"lymphoma\": 9, \"obesity\": -4}\n\ndef elixhauser(diags: pd.DataFrame, base: pd.DataFrame) -> pd.DataFrame:\n    df = diags.merge(base[[\"person_id\", \"index_date\"]], on=\"person_id\", how=\"inner\")\n    win = df[(df[\"dx_date\"] < df[\"index_date\"]) &\n             (df[\"dx_date\"] >= df[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS))]\n\n    flags = {}\n    for pid, g in win.groupby(\"person_id\"):\n        codes = g[\"code\"].astype(str)\n        flags[pid] = {c: bool(codes.str.match(rx).any()) for c, rx in ELIX.items()}\n\n    rows = []\n    excl = base.set_index(\"person_id\")[\"index_condition\"].to_dict()\n    for pid in base[\"person_id\"]:\n        f = flags.get(pid, {c: False for c in ELIX})\n        if excl.get(pid) in f:                 # index-condition exclusion\n            f[excl[pid]] = False\n        vw = sum(VW_WEIGHTS[c] for c, on in f.items() if on)\n        rows.append({\"person_id\": pid, **{f\"elix_{c}\": int(f[c]) for c in ELIX},\n                     \"vw_score\": vw, \"n_flags\": sum(f.values())})\n    return pd.DataFrame(rows)",
        "description": "Build the 31 Elixhauser flags from long claims diagnoses and collapse to a van Walraven point score. Inputs:\n  diags : person_id, code (ICD string), dx_date (datetime)\n  base  : person_id, index_date (datetime), index_condition (str key into ELIX, or None)\nThe ELIX map and VW_WEIGHTS are an ILLUSTRATIVE subset; substitute the full Quan/AHRQ code lists and the\ncomplete van Walraven weight table. The studied index condition is excluded per patient.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "quan-2005",
          "vanwalraven-2009"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nLOOKBACK_DAYS <- 365L\n\nelix_rx <- c(   # ILLUSTRATIVE subset\n  chf           = \"^(I50|428)\",\n  renal_failure = \"^(N1[789]|585|586|I120|I131)\",\n  liver_disease = \"^(K70|K71[3-5]|K72|K73|K74|571|070)\",\n  lymphoma      = \"^(C8[1-5]|C96|200|20[1-2])\",\n  obesity       = \"^(E66|2780)\"\n)\nvw_weights <- c(chf = 7L, renal_failure = 5L, liver_disease = 11L, lymphoma = 9L, obesity = -4L)\n\nelixhauser <- function(diags, base) {\n  setDT(diags); setDT(base)\n  df <- merge(diags, base[, .(person_id, index_date)], by = \"person_id\")\n  win <- df[dx_date < index_date & dx_date >= index_date - LOOKBACK_DAYS]\n\n  flag_one <- function(codes) sapply(elix_rx, function(rx) any(grepl(rx, codes)))\n  fl <- win[, as.list(flag_one(as.character(code))), by = person_id]\n\n  out <- merge(base[, .(person_id, index_condition)], fl, by = \"person_id\", all.x = TRUE)\n  for (c in names(elix_rx)) out[is.na(get(c)), (c) := FALSE]\n  # index-condition exclusion\n  for (c in names(elix_rx)) out[index_condition == c, (c) := FALSE]\n  out[, vw_score := Reduce(`+`, Map(function(c) get(c) * vw_weights[[c]], names(elix_rx)))]\n  out[, n_flags  := Reduce(`+`, lapply(names(elix_rx), function(c) as.integer(get(c))))]\n  out[]\n}",
        "description": "R/data.table version. Inputs mirror the Python version:\n  diags : person_id, code (character ICD), dx_date (Date)\n  base  : person_id, index_date (Date), index_condition (character or NA)\nReplace the illustrative ELIX map and VW weights with the full Quan/AHRQ lists.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "quan-2005",
          "vanwalraven-2009"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n\nproc sql;\n  create table win as\n  select d.person_id, d.code, b.index_date, b.index_condition\n  from work.diags d inner join work.base b on d.person_id = b.person_id\n  where d.dx_date < b.index_date and d.dx_date >= b.index_date - &lookback;\nquit;\n\ndata flags;\n  set win;\n  by person_id;\n  retain chf renal liver lymphoma obesity 0;\n  if first.person_id then do; chf=0; renal=0; liver=0; lymphoma=0; obesity=0; end;\n  length idx $20; idx = index_condition;\n  c = cats(code);\n  if prxmatch(\"/^(I50|428)/\", c)                          then chf=1;\n  if prxmatch(\"/^(N1[789]|585|586|I120|I131)/\", c)        then renal=1;\n  if prxmatch(\"/^(K70|K71[3-5]|K72|K73|K74|571|070)/\", c) then liver=1;\n  if prxmatch(\"/^(C8[1-5]|C96|200|20[1-2])/\", c)          then lymphoma=1;\n  if prxmatch(\"/^(E66|2780)/\", c)                         then obesity=1;\n  if last.person_id then output;\n  keep person_id idx chf renal liver lymphoma obesity;\nrun;\n\ndata elixhauser;\n  set flags;\n  /* index-condition exclusion */\n  if idx = \"chf\"      then chf=0;\n  if idx = \"renal\"    then renal=0;\n  if idx = \"liver\"    then liver=0;\n  if idx = \"lymphoma\" then lymphoma=0;\n  if idx = \"obesity\"  then obesity=0;\n  /* van Walraven points (note negative obesity weight) */\n  vw_score = chf*7 + renal*5 + liver*11 + lymphoma*9 + obesity*(-4);\n  n_flags  = chf + renal + liver + lymphoma + obesity;\nrun;",
        "description": "SAS build of the Elixhauser flags and van Walraven score. Inputs:\n  work.diags : person_id, code (ICD), dx_date\n  work.base  : person_id, index_date, index_condition (character key or blank)\nIllustrative code prefixes and weights; substitute the full Quan/AHRQ crosswalk and weight table. Excludes\nthe studied index condition before scoring.",
        "dependencies": [],
        "source_citations": [
          "quan-2005",
          "vanwalraven-2009"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  D[Baseline-window claims<br/>equal lookback, continuous enrollment] --> A[Apply Quan/AHRQ ICD crosswalk<br/>set 31 comorbidity flags]\n  A --> X[Exclude the studied index condition<br/>and admission reason]\n  X --> C{Need a scalar?}\n  C -->|no| I[Keep 31 indicators<br/>condition-level model]\n  C -->|yes| W[Apply weight set<br/>van Walraven or AHRQ mortality/readmission]\n  W --> S[Single point score<br/>negatives allowed]\n  I --> U[Use as covariates / PS input / Table 1]\n  S --> U",
        "caption": "Elixhauser construction — set the 31 flags from claims, apply the index-condition exclusion, then either keep indicators for maximal resolution or collapse to a weighted score (van Walraven/AHRQ) matched to the modeled outcome.",
        "alt_text": "Flowchart from baseline-window claims through the Elixhauser crosswalk and index-condition exclusion to a choice between keeping 31 indicators or collapsing to a weighted point score.",
        "source_type": "illustrative",
        "source_citations": [
          "vanwalraven-2009"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "charlson-comorbidity-index-rwe",
        "notes": "The more parsimonious 17-condition mortality-weighted index; Charlson trades Elixhauser's resolution for comparability and a single number."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-based-frailty-index-rwe",
        "notes": "Frailty indices capture functional dependency that diagnosis-counting Elixhauser flags miss; the measures adjust for different risk axes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Elixhauser indicators or the van Walraven/AHRQ score are standard covariates entering propensity-score models."
      },
      {
        "relation_type": "complements",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "Elixhauser is a fixed transparent comorbidity set; hdPS empirically screens hundreds of codes. The two are often combined."
      },
      {
        "relation_type": "requires",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Setting each comorbidity flag uses the same inpatient/outpatient diagnosis rules as phenotype algorithms."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Flag counts and scores are only comparable inside an equal continuous-enrollment lookback across groups."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Elixhauser flags and scores are reported in Table 1 to demonstrate baseline comparability between exposure groups."
      }
    ],
    "aliases": [
      "Elixhauser comorbidity measures",
      "Elixhauser index",
      "van Walraven score",
      "AHRQ Elixhauser Comorbidity Index",
      "Elixhauser-van Walraven",
      "Quan-Elixhauser Comorbidity Index",
      "Quan Elixhauser comorbidity algorithm",
      "Quan ICD Elixhauser",
      "HCUP Elixhauser index"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "empirical-calibration-negative-controls-rwe",
    "name": "Empirical Calibration with Negative Controls",
    "short_definition": "A quantitative-bias method that fits an empirical null distribution to the effect estimates of many known-null negative-control pairs (and, for interval calibration, synthetic positive controls), then uses that distribution to recalibrate the p-value and confidence interval of the target estimate for residual systematic error left by the study design.",
    "long_description": "**Empirical calibration** treats the effect estimates from a large panel of *negative controls* — exposure-outcome pairs\nfor which the true relative effect is known (or assumed) to be null — as draws from the systematic-error distribution that\nthe study's design, data source, and analytic pipeline impose on *every* estimate, including the target. Instead of using\none or two negative controls as pass/fail falsification tests, it pools tens to hundreds of them, fits an **empirical null**\n(typically a Gaussian on the log-rate-ratio scale with mean `mu` and SD `sigma`, estimated by maximum likelihood while\npropagating each control's own standard error), and then evaluates the target estimate against that empirical null rather\nthan against the theoretical null of mean 0, variance 0. The recalibrated p-value is the tail probability of the target's\nlog estimate under the fitted systematic-error distribution; the recalibrated confidence interval additionally uses\n*synthetic positive controls* (negative controls into which a known relative risk has been injected) to learn how bias and\nits variance scale across the true effect size, then widens and shifts the interval accordingly (Schuemie 2014; Schuemie\n2018).\n\n**Core conceptual distinction.** A conventional p-value asks \"how surprising is this estimate if the true effect is null\n*and the only error is random sampling*?\" Empirical calibration replaces the second clause: it asks \"how surprising is this\nestimate relative to the estimates the same pipeline produces for things we know are null?\" The quantity being modeled is\nnot the target effect but the *systematic error* `theta - log(RR_true)` of the procedure. Two estimands must be kept\ndistinct and pre-specified. (1) The **calibrated p-value** tests the null using the empirical null `N(mu, sigma^2)`; it\ncorrects for the *mean and spread of residual confounding/measurement bias* but assumes that bias does not vary with the\ntrue effect size. (2) The **calibrated confidence interval** uses a *systematic error model* fit to both negative and\npositive controls, allowing `mu` and `sigma` to depend on the true log-RR, and yields an interval with the nominal\n*empirical* coverage (e.g., 95% of calibrated CIs cover the true effect across the control panel). Calibration recalibrates\n*inference*; it does not change the point estimate and is not a substitute for a sound design — it diagnoses and partially\ncorrects the residual error a good design leaves behind, and it inflates uncertainty rather than manufacturing precision.\n\n**Pros, cons, and trade-offs.**\n- **vs single/few negative controls (falsification tests):** the standard practice of running one or two negative controls\n  and eyeballing whether they are \"null enough\" has no calibrated decision rule and no way to translate an observed control\n  bias into corrected inference for the target. Empirical calibration turns the same controls into a quantitative,\n  pre-specifiable correction with an interpretable empirical null. Cost: it needs *many* credible controls (rule of thumb\n  ≥20-50; CI calibration needs positive controls too), and it assumes the controls share the target's systematic-error\n  structure. **Prefer empirical calibration** in high-throughput claims/EHR or network studies where a control library can\n  be curated; **prefer a handful of falsification controls** when only a few defensible controls exist.\n- **vs E-value / array-based sensitivity analysis:** the E-value asks how strong an *unmeasured confounder* would have to\n  be to explain away the estimate; it is a hypothetical stress test requiring no control data. Empirical calibration\n  instead *measures* the net residual bias the pipeline actually exhibits on nulls. Cost: the E-value is model-free and\n  needs no controls but says nothing about measurement error, misclassification, or selection that calibration captures\n  jointly; calibration captures the *aggregate* of all such biases but cannot attribute them to a specific source.\n- **vs probabilistic bias analysis (PBA) / bias-parameter QBA:** PBA places priors on specific bias parameters\n  (sensitivity/specificity, confounder prevalence, RR) and propagates them. Empirical calibration is *data-driven and\n  parameter-free*: it learns the net bias from observed nulls instead of eliciting bias priors. Cost: PBA can model biases\n  for which no negative controls exist and can decompose by mechanism; calibration cannot model a bias the controls do not\n  share and gives no mechanistic decomposition. They are complements, not substitutes — calibration for net residual bias\n  you can observe, PBA for named biases you must assume.\n\n**When to use.** High-throughput observational effect-estimation where the same design/analysis is applied uniformly across\nmany outcomes or exposures (OHDSI-style network studies, signal screening, large claims/EHR safety studies); whenever a\ncredible library of ≥20-50 negative controls can be assembled with reliable phenotype algorithms and adequate counts; when\nreviewers/regulators want residual systematic error quantified rather than merely asserted; and when the design is already\ndefensible (active comparator, new-user, washout) so that calibration is correcting a *small* residual rather than rescuing\na broken comparison. Use CI calibration (with positive controls) whenever you report intervals, not just significance.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Too few or contaminated controls.** With <20 controls the empirical null is estimated with large uncertainty and the\n  correction is unstable; if some \"negative\" controls are in fact non-null (the exposure really does affect them), the\n  null is biased and calibration *under-corrects or shifts the wrong way*. Curating the control panel is the load-bearing\n  assumption — a panel chosen for convenience can be worse than no calibration.\n- **Controls do not share the target's systematic-error structure.** If negative controls are drawn from different\n  outcome-detection pathways (e.g., controls are inpatient-coded events but the target is an outpatient-coded event with\n  different surveillance), the empirical null estimates the wrong bias distribution. Calibration silently licenses a\n  contaminated comparison — this is the dangerous failure: a tidy \"calibrated\" CI that imports the bias of irrelevant\n  controls.\n- **As a substitute for design.** Calibration cannot fix confounding by indication, immortal time, or a wrong comparator;\n  if the design is broken the empirical null will be wide and off-center, and a calibrated CI that still excludes the null\n  is more likely to reflect a non-shared bias than a real effect. Calibration is a final-mile diagnostic, not a rescue.\n- **Differential bias by true effect size without positive controls.** Using p-value calibration alone (assuming bias is\n  constant across effect sizes) when bias actually scales with the effect will produce miscalibrated intervals; CI\n  calibration with synthetic positive controls is required, and synthetic positives can themselves be unrealistic if the\n  injection mechanism does not resemble a true causal effect.\n\n**Data-source operational depth.**\n- **Administrative claims (FFS):** the natural home — large, uniform, and amenable to high-throughput control panels built\n  from condition concept sets. Failure modes: (i) **Medicare Advantage / capitated person-time lacks complete FFS claims**,\n  so control-outcome counts are differentially undercounted for MA enrollees; restrict the control analyses to the same\n  FFS-observable person-time used for the target, or the empirical null will reflect missingness rather than bias. (ii)\n  **Differential competing risks by exposure in elderly claims** (a comparator with higher background mortality removes\n  person-time before the control outcome can be coded) shifts negative-control rate ratios away from null in a\n  direction the target shares — this is *good* (calibration captures it) only if the controls experience the same competing\n  risk; pick controls with similar event timing. (iii) Low-count controls produce unstable per-control standard errors;\n  require a minimum cell count and use the MLE that weights by precision rather than treating each control equally.\n- **EHR:** site-specific coding, referral patterns, and visit-driven capture make systematic error *heterogeneous across\n  sites*. A single pooled empirical null can be dominated by one site's idiosyncratic miscoding. **Calibrate within\n  database/site and meta-analyze the calibrated estimates**, or include site as a stratifier in the control analyses.\n  Note/lab-derived phenotypes change negative-control specificity, so a control that is \"null\" in claims may not be in EHR.\n- **Registry:** strong outcome adjudication makes individual controls cleaner but the *number* of curatable controls is\n  usually small, undermining the empirical-null fit; registries are better as the outcome layer in a linked design than as\n  a standalone calibration substrate.\n- **Linked claims-EHR-registry:** best of both — EHR/registry for clean control phenotypes and claims for complete\n  person-time — but linkage selection means the calibration panel must be drawn from the *same linkable subset* as the\n  target, and order/fill/service date discrepancies must be reconciled before counting control events.\n\n**Worked claims example.** Question: is initiation of drug A vs active comparator B associated with acute myocardial\ninfarction among adults with the shared indication, in a commercial + Medicare FFS database, using a new-user active-comparator\ndesign with high-dimensional PS adjustment. (1) **Target analysis:** build the cohort with 365-day continuous FFS-observable\nenrollment and washout, time zero at first qualifying fill, and estimate the AMI hazard ratio with a Cox model on the\nPS-matched set — say `HR = 1.30`, 95% CI 1.05-1.61, p = 0.017. (2) **Negative-control panel:** select ~50 outcomes with no\nplausible causal link to either drug (e.g., ingrowing nail, contusion, cataract — each a validated concept set with adequate\ncounts) and run the *identical* pipeline (same washout, same FFS-observable person-time excluding MA-only spans, same PS\nmodel, same Cox specification) for each, yielding 50 pairs of `(log_hr, se)`. (3) **Fit the empirical null:** by MLE, the\ncontrols center at `mu = 0.08` (log scale) with extra dispersion `sigma = 0.10` beyond sampling error — evidence of mild\nresidual confounding biasing estimates ~8% upward. (4) **Calibrate the p-value:** evaluate the target's `log(1.30)=0.262`\nagainst `N(0.08, 0.10^2 + se_target^2)`; the calibrated p rises from 0.017 to ~0.08 — no longer \"significant\" once the\npipeline's own null behavior is accounted for. (5) **Synthetic positive controls + CI calibration:** inject known HRs\n(1.5, 2, 4) into negative controls by adding the corresponding extra outcomes, fit the systematic-error model across true\neffect sizes, and recompute the interval: calibrated 95% CI ≈ 0.92-1.84. (6) **Report both** the crude and calibrated\ninference and the empirical-null parameters as a diagnostic; a `mu` far from 0 or large `sigma` is a flag that the design,\nnot the calibration, needs revisiting.\n\n**Interpreting the output**\n\nFrom the worked example: naive HR = 1.30 (95% CI 1.09–1.56, p = 0.017). Eight negative controls yield\nempirical null N(mu = 0.08, sigma = 0.10). Calibrated p ≈ 0.08; calibrated 95% CI ≈ 0.92–1.84.\n\n*(1) Formal interpretation.* The calibrated p-value is the probability of observing a log-HR as extreme\nas log(1.30) = 0.262 under the empirical null distribution N(0.08, 0.10² + se_target²), which models\nthe study pipeline's observed behavior when the true effect is zero. It is *not* a frequentist p-value\nover hypothetical replications of a fixed data-generating process — it is inference benchmarked against\nthis design's own null-control behavior. The calibrated CI is similarly re-centered (shifted away from 0)\nand widened to absorb the pipeline's systematic error; it is *not* a conventional Wald interval. The\nempirical null parameters (mu = 0.08, sigma = 0.10) are themselves estimated with uncertainty from only\neight controls, so both calibrated quantities carry imprecision from that estimation step.\n\n*(2) Practical interpretation.* The naive analysis declared statistical significance (p = 0.017), but\nafter accounting for the pipeline's tendency to over-estimate by ≈ 8% in log-HR units and its extra\ndispersion beyond sampling variability, the calibrated p rises to ≈ 0.08 and the CI widens to span\nnearly the null. A regulator or payer should treat the calibrated result — not the naive one — as the\nstudy's actual inferential claim. A large mu or sigma in the empirical null is itself diagnostic: it\nsignals that the design or covariate adjustment needs reconsidering before calibration can rescue it.",
    "primary_category": "Bias_Control",
    "tags": [
      "empirical-calibration",
      "negative-controls",
      "positive-controls",
      "systematic-error",
      "p-value-calibration",
      "confidence-interval-calibration",
      "quantitative-bias-analysis",
      "ohdsi"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "multi_database",
      "active_comparator_new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.5925",
        "url": "https://doi.org/10.1002/sim.5925",
        "citation_text": "Schuemie MJ, Ryan PB, DuMouchel W, Suchard MA, Madigan D. Interpreting observational studies: why empirical calibration is needed to correct p-values. Statistics in Medicine. 2014;33(2):209-218.",
        "year": 2014,
        "authors_short": "Schuemie et al.",
        "notes": "Original statement of empirical calibration of p-values using an empirical null fit to many negative controls in observational healthcare data."
      },
      {
        "role": "explain",
        "doi": "10.1073/pnas.1708282114",
        "url": "https://doi.org/10.1073/pnas.1708282114",
        "citation_text": "Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. Proceedings of the National Academy of Sciences. 2018;115(11):2571-2577.",
        "year": 2018,
        "authors_short": "Schuemie et al.",
        "notes": "Extends calibration to confidence intervals via a systematic-error model fit to negative and synthetic positive controls; defines empirical coverage as the evaluation target."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181d61eeb",
        "url": "https://doi.org/10.1097/EDE.0b013e3181d61eeb",
        "citation_text": "Lipsitch M, Tchetgen Tchetgen E, Cohen T. Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology. 2010;21(3):383-388.",
        "year": 2010,
        "authors_short": "Lipsitch et al.",
        "notes": "Conceptual foundation for negative controls as detectors of shared confounding/bias, which empirical calibration operationalizes at scale."
      },
      {
        "role": "demonstrate",
        "doi": null,
        "url": "https://ohdsi.github.io/EmpiricalCalibration/",
        "citation_text": "OHDSI EmpiricalCalibration R package — reference documentation and vignettes for p-value and confidence-interval calibration.",
        "year": 2024,
        "authors_short": "OHDSI",
        "notes": "Maintained reference implementation of the methods (fitNull, fitSystematicErrorModel, calibrateP, calibrateConfidenceInterval)."
      }
    ],
    "plain_language_summary": "Empirical calibration checks whether a study's analysis pipeline is already producing systematically wrong answers even for drug-outcome pairs where the true answer is known to be null, and then uses that pattern of wrongness to adjust the final result. You pick many outcome measures that the drug in question could not plausibly cause -- for example, ingrown toenail or cataract -- run the identical analysis on each of them, and collect all those estimates. If they cluster away from the true null, the pipeline itself carries a hidden bias; the method measures that bias and recalculates your p-value and confidence interval accordingly, making the result more honest. A key caveat is that this works only when you have at least 20 or more of those known-null pairs and they all share the same sources of error as your real question.",
    "key_terms": [
      {
        "term": "negative control outcome",
        "definition": "A health event -- such as an ingrown toenail or a broken arm from trauma -- that has no plausible biological link to the drug being studied, so its true effect estimate should be exactly null (no association)."
      },
      {
        "term": "empirical null distribution",
        "definition": "The bell-curve-shaped pattern of where negative-control estimates actually land across many known-null pairs; if they cluster above or below zero, the study pipeline is carrying systematic bias that can be measured and corrected."
      },
      {
        "term": "systematic bias",
        "definition": "A consistent shift in estimated effects caused by imperfect study design, data quirks, or residual confounding -- bias that pushes ALL estimates in the same direction, not just random noise."
      },
      {
        "term": "calibrated p-value",
        "definition": "A recalculated p-value that judges how surprising the real finding is relative to the spread of the negative-control estimates, rather than assuming the pipeline is perfectly unbiased."
      },
      {
        "term": "confidence interval",
        "definition": "A range of values around the main estimate that is intended to capture the true effect a stated percentage of the time (usually 95%); after calibration, this range is widened and re-centered to reflect actual pipeline bias."
      },
      {
        "term": "hazard ratio",
        "definition": "A measure of how much faster (or slower) an outcome occurs in one group compared to another; a hazard ratio of 1.30 means the event happens 30% faster in the treated group."
      }
    ],
    "worked_example": {
      "scenario": "A team runs a new-user active-comparator study asking whether Drug A raises the risk of heart attack (acute myocardial infarction, AMI) compared with Drug B in a claims database. Their main result is a hazard ratio of 1.30 with a naive p-value of 0.017 -- apparently statistically significant. Before trusting that finding, they run the identical analysis pipeline on eight negative-control outcomes: conditions like ingrown toenail, contusion, and cataract that Drug A cannot plausibly cause. If the pipeline were unbiased, every negative-control hazard ratio should hover near 1.0 (log scale: near 0). Instead, the team finds that all eight cluster above 1.0, suggesting the pipeline systematically inflates estimates by roughly 8 percent on the log scale. Calibrating the AMI result against this observed bias pattern changes the conclusion.",
      "dataset": {
        "caption": "Negative-control estimates from running the identical pipeline on eight known-null outcomes. Log HR near 0 would be expected if there were no bias; values above 0 reveal systematic upward drift.",
        "columns": [
          "outcome",
          "estimated_log_hr",
          "standard_error",
          "naive_p"
        ],
        "rows": [
          [
            "ingrown toenail",
            "0.07",
            "0.12",
            "0.56"
          ],
          [
            "traumatic contusion",
            "0.09",
            "0.11",
            "0.41"
          ],
          [
            "cataract",
            "0.06",
            "0.13",
            "0.64"
          ],
          [
            "sebaceous cyst",
            "0.10",
            "0.14",
            "0.48"
          ],
          [
            "sprain of ankle",
            "0.08",
            "0.10",
            "0.42"
          ],
          [
            "conjunctivitis",
            "0.09",
            "0.13",
            "0.49"
          ],
          [
            "inguinal hernia",
            "0.07",
            "0.12",
            "0.56"
          ],
          [
            "dental abscess",
            "0.08",
            "0.11",
            "0.47"
          ]
        ]
      },
      "steps": [
        "Step 1 -- Spot the pattern: every single negative-control log HR is positive (range 0.06 to 0.10) even though the true effect for each should be 0. This is not random noise; all eight point in the same direction.",
        "Step 2 -- Fit the empirical null: maximum-likelihood estimation across all eight pairs gives a null distribution centered at mu = 0.08 (log scale) with a spread of sigma = 0.10. This means the pipeline tends to inflate log hazard ratios by about 0.08 on average, with modest extra variability.",
        "Step 3 -- Interpret the empirical null: a naive analysis assumes the pipeline's null is centered at 0 with no extra spread. The empirical null says the real center is 0.08, not 0 -- so the bar the AMI result must clear is higher.",
        "Step 4 -- Calibrate the AMI p-value: the AMI log HR is log(1.30) = 0.262. Instead of asking how far 0.262 is from 0 (the naive test), calibration asks how far 0.262 is from 0.08 in units of the combined uncertainty (sampling error plus the sigma = 0.10 from the empirical null). The adjusted distance is smaller, and the calibrated p-value rises from 0.017 to approximately 0.08 -- no longer below the conventional 0.05 threshold.",
        "Step 5 -- Calibrate the confidence interval: the interval is re-centered by subtracting the empirical bias (0.08) and widened to account for the extra spread (sigma = 0.10). The naive 95% CI of 1.05 to 1.61 becomes approximately 0.92 to 1.84 after calibration -- now spanning 1.0, consistent with no effect."
      ],
      "result": "Naive result: HR = 1.30, 95% CI 1.05-1.61, p = 0.017 (appears significant). Empirical null from 8 negative controls: mu = 0.08, sigma = 0.10 (pipeline inflates all estimates ~8% upward on the log scale). Calibrated result: p = 0.08 (no longer significant), calibrated 95% CI approximately 0.92-1.84 (now includes 1.0). The apparent signal for AMI is explained by the systematic upward drift the pipeline imposes on every estimate, not by a true drug effect."
    },
    "prerequisites": [
      "negative-control-outcomes-rwe",
      "quantitative-bias-analysis-toolkit-rwe",
      "e-value-sensitivity-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "P-value calibration (empirical null)",
        "description": "Fit a Gaussian empirical null N(mu, sigma^2) to the negative-control (log effect, SE) pairs by MLE, then report the target's two-sided tail probability under that null. Corrects for the mean and spread of residual bias but assumes bias does not vary with the true effect size.",
        "edge_cases": [
          "Fewer than ~20 credible controls makes mu/sigma unstable; the correction inherits large uncertainty.",
          "A contaminated (non-null) control biases mu and shifts the calibrated p in the wrong direction."
        ],
        "data_source_notes": "claims: run controls on the identical FFS-observable person-time and analytic pipeline as the target; EHR: fit within site to avoid one site's miscoding dominating the null."
      },
      {
        "name": "Confidence-interval calibration (systematic error model)",
        "description": "Fit a systematic-error model to negative AND synthetic positive controls so that bias mean and SD may depend on the true effect size, then recalibrate the interval to achieve nominal empirical coverage across the control panel.",
        "edge_cases": [
          "Synthetic positive controls injected by adding outcomes can be unrealistic if the injection does not mimic a true causal mechanism (e.g., ignores time-varying hazard).",
          "Requires enough positive controls spanning the relevant effect-size range to estimate the slope of bias on true effect."
        ],
        "data_source_notes": "Positive-control injection must respect competing risks and follow-up structure of the source data; in claims, inject extra outcomes within observed FFS person-time only."
      },
      {
        "name": "Leave-one-out diagnostic calibration",
        "description": "Calibrate each negative control using the empirical null fit from the remaining controls; the fraction of leave-one-out calibrated CIs that cover the (null) truth estimates empirical coverage and validates the calibration before it is applied to the target.",
        "edge_cases": [
          "Coverage far below nominal signals a misspecified null or contaminated/heterogeneous controls."
        ],
        "data_source_notes": "Standard pre-registration diagnostic in OHDSI network studies before unblinding the target estimate."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single or few negative controls used as falsification tests",
        "pros_of_this": "Pools many controls into a quantitative, pre-specifiable empirical null and translates observed control bias into corrected p-values and intervals with an interpretable decision rule.",
        "cons_of_this": "Needs many credible controls (≥20-50; positive controls for CI calibration) that share the target's systematic-error structure.",
        "when_to_prefer": "High-throughput claims/EHR or multi-database effect-estimation where a control library can be curated."
      },
      {
        "compared_to": "E-value / array-based sensitivity analysis",
        "pros_of_this": "Measures the net residual bias the pipeline actually exhibits on nulls (confounding + measurement + selection jointly) rather than positing a hypothetical confounder.",
        "cons_of_this": "Requires control data and gives an aggregate, non-attributable bias; the E-value needs no controls and is model-free.",
        "when_to_prefer": "When a credible negative-control panel exists and reviewers want measured, not hypothetical, residual error."
      },
      {
        "compared_to": "Probabilistic / bias-parameter quantitative bias analysis",
        "pros_of_this": "Data-driven and parameter-free — learns net bias from observed nulls instead of eliciting bias priors.",
        "cons_of_this": "Cannot model a bias the controls do not share and offers no mechanistic decomposition by bias source.",
        "when_to_prefer": "For net residual bias observable through negative controls; combine with PBA for named biases that lack controls."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy.optimize import minimize\nfrom scipy.stats import norm\n\ndef fit_empirical_null(log_rr: np.ndarray, se: np.ndarray) -> tuple[float, float]:\n    # MLE of mu, sigma for theta_i ~ N(mu, sigma^2) observed with sampling noise:\n    # log_rr_i ~ N(mu, sigma^2 + se_i^2).  (Schuemie 2014, eq. for the empirical null.)\n    def neg_ll(p):\n        mu, log_sigma = p\n        var = np.exp(2 * log_sigma) + se ** 2\n        return 0.5 * np.sum(np.log(2 * np.pi * var) + (log_rr - mu) ** 2 / var)\n    start = [float(np.mean(log_rr)), np.log(np.std(log_rr, ddof=1) + 1e-6)]\n    res = minimize(neg_ll, start, method=\"Nelder-Mead\")\n    mu, log_sigma = res.x\n    return float(mu), float(np.exp(log_sigma))\n\ndef calibrate_p(target_log_rr: float, target_se: float,\n                mu: float, sigma: float) -> float:\n    # Two-sided tail probability of the target under the empirical null,\n    # adding the target's own sampling variance to the null variance.\n    sd = np.sqrt(sigma ** 2 + target_se ** 2)\n    z = (target_log_rr - mu) / sd\n    return float(2 * norm.sf(abs(z)))\n\ndef calibrate_ci(target_log_rr: float, target_se: float,\n                 mu: float, sigma: float, level: float = 0.95) -> tuple[float, float]:\n    # Gaussian-approx calibrated interval: re-center by the null mean and widen by\n    # the systematic-error SD. (Use the OHDSI systematic-error model for the full,\n    # effect-size-dependent version with positive controls.)\n    sd = np.sqrt(sigma ** 2 + target_se ** 2)\n    crit = norm.ppf(1 - (1 - level) / 2)\n    lo, hi = (target_log_rr - mu) - crit * sd, (target_log_rr - mu) + crit * sd\n    return float(np.exp(lo)), float(np.exp(hi))\n\n# Usage on a curated control panel + a target estimate:\n# mu, sigma = fit_empirical_null(controls[\"log_rr\"].to_numpy(), controls[\"se\"].to_numpy())\n# cal_p = calibrate_p(target_log_rr, target_se, mu, sigma)\n# cal_lo, cal_hi = calibrate_ci(target_log_rr, target_se, mu, sigma)",
        "description": "Empirical p-value and (Gaussian-approx) CI calibration from negative-control estimates. Required input table\n(one row per negative control, produced by running the IDENTICAL target pipeline on each control outcome):\n  controls : neg_control_id, log_rr (float, log of the estimated rate/hazard/odds ratio), se (float, its standard error)\nProvide the target estimate as target_log_rr and target_se on the same scale. fit_empirical_null returns the\nmaximum-likelihood empirical null (mu, sigma) marginalizing each control's sampling variance; calibrate_p and\ncalibrate_ci evaluate the target against it. For full systematic-error-model CI calibration with positive controls,\nuse the OHDSI EmpiricalCalibration R package (see R block).",
        "dependencies": [
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "schuemie-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(EmpiricalCalibration)\n\n# --- p-value calibration: empirical null from negative controls only ---\nnull <- fitNull(negatives$logRr, negatives$seLogRr)          # MLE of (mean, sd) of the null\ncal_p <- calibrateP(null, target_logRr, target_seLogRr)      # two-sided calibrated p-value\n\n# --- confidence-interval calibration: systematic-error model (negatives + positives) ---\nmodel <- fitSystematicErrorModel(\n  logRr    = c(negatives$logRr,    positives$logRr),\n  seLogRr  = c(negatives$seLogRr,  positives$seLogRr),\n  trueLogRr = c(rep(0, nrow(negatives)), positives$trueLogRr)\n)\ncal_ci <- calibrateConfidenceInterval(target_logRr, target_seLogRr, model)\n# cal_ci$logRr, cal_ci$logLb95Rr, cal_ci$logUb95Rr  -> exponentiate for RR scale\n\n# --- diagnostic: empirical-null plot and leave-one-out coverage on the controls ---\nplotCalibrationEffect(negatives$logRr, negatives$seLogRr)\neval <- evaluateCalibration95ci(negatives$logRr, negatives$seLogRr)  # should be ~0.95",
        "description": "Production p-value AND confidence-interval calibration with the maintained OHDSI EmpiricalCalibration package.\nRequired inputs:\n  negatives : data.frame with logRr, seLogRr   (one row per negative control, from the identical target pipeline)\n  positives : data.frame with logRr, seLogRr, trueLogRr  (synthetic positive controls; trueLogRr = injected effect)\nfitNull gives the empirical null for p-value calibration; fitSystematicErrorModel uses negatives + positives so the\ncalibrated CI accounts for bias that varies with the true effect size. Always run the leave-one-out coverage check\n(computeTraditionalCi vs calibrateConfidenceInterval over the controls) before trusting the target calibration.",
        "dependencies": [
          "EmpiricalCalibration"
        ],
        "source_citations": [
          "schuemie-2014",
          "schuemie-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* MLE of the empirical null: theta_i ~ N(mu, sigma^2) observed with sampling noise se_i. */\nproc nlmixed data=work.negctrl;\n  parms mu 0 logsigma -2;                         /* sigma = exp(logsigma) keeps it positive */\n  sigma2 = exp(2*logsigma);\n  totvar = sigma2 + se*se;                         /* null variance + per-control sampling variance */\n  ll = -0.5*( log(2*constant('pi')*totvar) + (log_rr - mu)**2 / totvar );\n  model log_rr ~ general(ll);\n  ods output ParameterEstimates=nullfit;\nrun;\n\n/* Pull the fitted null parameters into macro variables. */\nproc sql noprint;\n  select Estimate into :mu        trimmed from nullfit where Parameter='mu';\n  select Estimate into :logsigma  trimmed from nullfit where Parameter='logsigma';\nquit;\n\n/* Calibrate the target p-value against the empirical null (add target sampling variance). */\ndata calibrated;\n  mu = &mu;  sigma = exp(&logsigma);\n  sd = sqrt(sigma**2 + (&target_se)**2);\n  z  = (&target_log_rr - mu) / sd;\n  calibrated_p = 2*(1 - cdf('NORMAL', abs(z)));\n  put 'Empirical null: mu=' mu ' sigma=' sigma '  calibrated p=' calibrated_p;\nrun;",
        "description": "Empirical p-value calibration in SAS via PROC NLMIXED (maximum-likelihood fit of the empirical null) after the\ntarget pipeline has been run on every negative control. Required input dataset:\n  work.negctrl : neg_control_id, log_rr, se   (one row per negative control)\nplus macro variables &target_log_rr and &target_se for the target estimate. PROC NLMIXED maximizes the marginal\nlog-likelihood log_rr_i ~ N(mu, sigma^2 + se_i^2); the data step then evaluates the target against the fitted null.\nConfidence-interval calibration with positive controls requires the systematic-error model (slope of bias on true\neffect) and is most practical via the OHDSI EmpiricalCalibration R package; SAS handles the p-value case natively.",
        "dependencies": [],
        "source_citations": [
          "schuemie-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Design[Fixed study design + analytic pipeline<br/>new-user, active comparator, PS, Cox] --> Target[Run on TARGET<br/>log_rr_target, se_target]\n  Design --> Panel[Run identical pipeline on<br/>each NEGATIVE control outcome]\n  Panel --> Pairs[~20-50 control estimates<br/>log_rr_i, se_i  truth = null]\n  Pairs --> Null[Fit empirical null by MLE<br/>theta ~ N mu, sigma^2 + se^2]\n  Null --> Diag{Leave-one-out coverage<br/>≈ 95%?}\n  Diag -- no --> Fix[Null misspecified or controls<br/>contaminated/heterogeneous: revisit panel/design]\n  Diag -- yes --> CalP[Calibrated p:<br/>tail of target under N mu, sigma]\n  Pos[Synthetic POSITIVE controls<br/>inject known true RR] --> SEM[Systematic-error model<br/>bias varies with true effect]\n  Panel --> SEM\n  SEM --> CalCI[Calibrated CI:<br/>re-centered + widened for residual bias]\n  Target --> CalP\n  Target --> CalCI",
        "caption": "Empirical calibration data flow. The same pipeline that produces the target estimate produces the negative-control estimates; their distribution is the empirical null used to recalibrate the target's p-value, while synthetic positive controls feed a systematic-error model for confidence-interval calibration. Leave-one-out coverage gates the calibration.",
        "alt_text": "Flowchart showing the fixed pipeline run on both the target and many negative controls, fitting an empirical null, a leave-one-out coverage gate, and a systematic-error model from positive controls feeding calibrated p-values and confidence intervals.",
        "source_type": "illustrative",
        "source_citations": [
          "schuemie-2014",
          "schuemie-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Theoretical null<br/>mean 0, no systematic error] -->|conventional p, CI| B[Often falsely precise<br/>residual bias ignored]\n  C[Empirical null<br/>mean mu, SD sigma from controls] -->|calibrated p, CI| D[Inference reflects the bias<br/>the pipeline actually shows]\n  C -.diagnostic.-> E[mu far from 0 or large sigma<br/>= fix the DESIGN, not the calibration]",
        "caption": "Conceptual contrast. Conventional inference assumes the only error is random sampling; empirical calibration benchmarks the target against the systematic-error distribution observed on known nulls, and the fitted null doubles as a design diagnostic.",
        "alt_text": "Two-branch diagram contrasting the theoretical null with the empirical null and noting that a large fitted bias mean or spread indicates a design problem.",
        "source_type": "illustrative",
        "source_citations": [
          "schuemie-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "A data-driven member of the quantitative-bias-analysis family that learns net residual systematic error from observed negative controls rather than from elicited bias parameters."
      },
      {
        "relation_type": "requires",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "The empirical null is fit to a curated panel of negative-control outcomes run through the identical target pipeline."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-exposures-rwe",
        "notes": "Negative-control exposures can contribute to the empirical null when they share the target's confounding structure."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value posits a hypothetical unmeasured confounder; empirical calibration measures the net residual bias the pipeline actually exhibits on nulls."
      },
      {
        "relation_type": "see_also",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "Probabilistic bias analysis models named biases via priors; calibration learns net bias from data. Use together for biases with and without available controls."
      }
    ],
    "aliases": [
      "empirical null calibration",
      "calibrated p-values",
      "calibrated confidence intervals",
      "empirical calibration",
      "systematic error calibration"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "endpoint-adjudication-chart-review-rwe",
    "name": "Endpoint Adjudication and Chart Review",
    "short_definition": "A blinded, multi-reviewer process that confirms whether candidate events flagged by a claims/EHR algorithm meet a pre-specified clinical case definition, yielding positive predictive value, sensitivity, and inter-rater reliability that are then used to bias-correct the observed event rate.",
    "long_description": "**Endpoint adjudication and chart review** is the operational step that converts an algorithm-identified\n*candidate* event (e.g., an inpatient claim with a primary-position ICD-10 acute MI code) into a *confirmed*\nendpoint by having trained clinical reviewers compare the underlying source documents — discharge summaries,\ncath/ECG/troponin results, imaging, progress notes, death certificates — against a written, pre-specified case\ndefinition (Universal Definition of MI, NINDS/Sentinel stroke criteria, KDIGO AKI staging, Bradford-Hill-style\ncausality for adverse events). The product is not a single number but a *measurement-error model*: positive\npredictive value (PPV) of the algorithm, sensitivity/specificity when algorithm-negative records are also\nreviewed, and inter-rater reliability (Cohen's or weighted kappa). Those parameters are then propagated forward\nto correct the crude event rate and to quantify how much residual outcome misclassification could move the\neffect estimate.\n\n**Core conceptual distinction**. Three things are being separated and must not be conflated. (1) *Algorithm\nidentification vs adjudicated truth*: the claims/EHR algorithm is a screening test; adjudication is the\nreference standard. The deliverable is the operating characteristics of the algorithm against that standard, not\na re-counted event total. (2) *Validation vs correction*: estimating PPV/sensitivity (validation) is distinct\nfrom using those estimates to recover an unbiased incidence or hazard (bias correction / quantitative bias\nanalysis). A study that reports \"PPV = 0.84\" but then analyzes the *uncorrected* algorithm-positive events as if\nthey were all true has done validation and ignored correction. (3) *Adjudication vs phenotyping*: a phenotype\nalgorithm assigns a label deterministically from structured data; adjudication is a human reference-standard\nprocess layered on top of (a sample of) those labels. The estimand the adjudication serves is whatever the\nparent study targets (cause-specific incidence, an as-treated hazard ratio, a decision-model transition\nprobability) — adjudication does not change the estimand, it de-biases the outcome variable that feeds it. The\ncardinal design requirement is that reviewers be **blinded to exposure/arm**, because differential\nmisclassification by exposure is the one error that adjudication is supposed to remove and unblinded\nadjudication can manufacture.\n\n**Pros, cons, and trade-offs**.\n- **vs accepting the raw algorithm (no adjudication):** Adjudication gives a defensible, quantified outcome\n  definition and is the only way to detect and bound *differential* outcome misclassification, which biases\n  relative effects in unpredictable directions. Cost: chart retrieval is expensive and slow, requires data-use\n  agreements that many claims vendors cannot grant, and shrinks the analyzable sample to the linkable/chart-\n  available subset. **Prefer adjudication** for any regulatory- or HTA-grade safety/effectiveness endpoint and\n  for any outcome whose algorithm PPV is unknown or known to be modest (<~0.80).\n- **vs full chart review of every event:** Sampling (all positives, or a stratified sample) buys feasibility at\n  the cost of sampling variance in PPV/sensitivity; Wilson or exact CIs and an explicit sampling frame are\n  mandatory. **Prefer a stratified probability sample** when event counts are large; reserve census review for\n  rare/serious endpoints.\n- **vs single-reviewer abstraction:** A two-independent-reviewer-plus-tiebreaker committee with a written\n  charter produces a reportable kappa and removes idiosyncratic single-reader bias. Cost: roughly double the\n  abstraction effort. **Prefer the committee** whenever the case definition involves judgment (MI subtype,\n  stroke vs TIA, drug causality); single-reader review is acceptable only for near-mechanical confirmations.\n- **vs probabilistic/QBA-only correction with no chart review:** If charts are simply unavailable, external\n  validation parameters from the literature can feed a quantitative bias analysis, but borrowed PPV/sensitivity\n  from a *different* population/database is itself an assumption (transportability) and is weaker than in-study\n  adjudication. **Prefer in-study adjudication**; fall back to QBA with sensitivity ranges only when retrieval\n  is impossible.\n\n**When to use**. A primary or key secondary safety/effectiveness endpoint identified from claims or EHR whose\nalgorithm operating characteristics are unknown or modest; FDA/EMA submissions and HTA dossiers where outcome\nvalidity must be demonstrated; any comparison where differential outcome capture by exposure is plausible (e.g.,\nmore surveillance in one arm); composite endpoints where component-level confirmation changes the count.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Unblinded adjudication.** If reviewers can see the exposure/arm, adjudication can *create* the differential\n  misclassification it was meant to remove — the single most dangerous failure. Blind the packet; log and audit\n  blinding breaks.\n- **Adjudicating only algorithm-positives and calling the algorithm \"validated.\"** Reviewing only positives\n  gives PPV but says nothing about sensitivity; you cannot bias-correct incidence or rule out missed events.\n  Claiming a validated algorithm from positives alone is misleading — you must sample algorithm-negatives (the\n  expensive step everyone skips) to estimate sensitivity.\n- **Treating adjudicated cases as error-free ground truth.** Propagating a point PPV without its CI, and\n  without the kappa-quantified reviewer disagreement, understates uncertainty in the corrected rate.\n- **Transporting one site's/database's PPV as a correction factor in a different population.** PPV depends on\n  prevalence and coding practice; a PPV from a tertiary cardiology cohort will overstate confirmation in a\n  general claims population.\n- **Cherry-picking borderline cases or letting reviewers re-adjudicate after seeing the analysis.** Both break\n  the pre-specification that makes the metric interpretable.\n\n**Data-source operational depth**.\n- **Administrative claims (FFS):** Candidate events come from diagnosis/procedure codes with position and\n  setting (primary-position inpatient ICD-10 I21.x for acute MI). Failure modes: claims contain no clinical\n  detail, so adjudication *requires* linkage to charts/EHR, which depends on a data-use agreement the vendor\n  often does not hold — retrieval is the binding constraint, not analysis. Outpatient/ED-coded events frequently\n  lack a discharge summary to confirm. Claims lag and adjustment mean the event packet must be assembled after\n  run-out. Deceased patients' charts are often unobtainable, biasing the adjudicable sample away from fatal\n  events. **Medicare Advantage (MA) person-time has no FFS claims**, so MA enrollees are invisible to the\n  algorithm and unavailable for adjudication — restrict person-time to FFS (Parts A/B) or treat MA spans as\n  unobserved, never as event-free.\n- **EHR:** Richer source documents, but **external-care leakage** means the adjudicator sees only the events\n  that happened inside the network; an MI treated at a competing hospital is both unflagged and unadjudicable,\n  deflating sensitivity differentially by how \"loyal\" patients are. Encounter-driven capture means structured\n  fields may be empty even when the note documents the event — abstract from notes, and define the observation\n  window explicitly.\n- **Registry:** Often supplies adjudicated outcomes already, but **registry adjudication criteria may differ\n  from your protocol's case definition** (e.g., a registry's \"stroke\" includes TIA); re-adjudicate or document\n  the criteria mismatch (a transportability problem), don't blindly inherit the registry label.\n- **Linked claims–EHR–vital records:** The strongest substrate — claims completeness + EHR clinical detail +\n  a death index to confirm fatal endpoints — but linkage selects the linkable subset and introduces date\n  discrepancies (claim service date vs note date vs death date) that must be reconciled before the event date\n  is fixed.\n\n**Worked claims example.** Endpoint: hospitalized acute MI in a commercial + Medicare FFS database. (1)\n*Candidate identification*: an inpatient claim with primary-position ICD-10 `I21.x` and length of stay >=1 day\nflags a candidate; require continuous medical enrollment in the 365 days before the index and exclude MA-only\nperson-time so the algorithm can actually see hospitalizations. (2) *De-duplication*: collapse transfers and\nsame-episode readmissions within a 30-day acute window to one event (see acute-event de-duplication). (3)\n*Sampling*: 5,000 algorithm-positive candidate MIs; draw a stratified random sample of 500 (stratified by age,\nsex, and fatal/non-fatal) for chart pull, plus a 250-record sample of algorithm-*negative* hospitalizations to\nestimate sensitivity. (4) *Packet + charter*: assemble discharge summary, serial troponins, ECG, and any\ncatheterization report into a packet **stripped of any exposure/drug information**; two cardiologists\nindependently classify each against the Fourth Universal Definition of MI; a third adjudicator breaks ties. (5)\n*Metrics*: of 500 positives, 420 confirmed -> PPV = 420/500 = 0.84 (Wilson 95% CI 0.81–0.87); inter-rater\nCohen's kappa = 0.78 before tiebreak; of 250 negatives, 12 were true MIs missed by the algorithm, giving an\nestimated sensitivity ~ true_positives / (true_positives + estimated false negatives). (6) *Bias correction*:\nthe crude algorithm rate of 5,000 events over the FFS person-time is multiplied/adjusted by the\nsampling-weighted PPV and sensitivity to recover the corrected incidence; (7) *QBA*: vary PPV across its CI and\nsensitivity across plausible bounds (and, for the comparative analysis, vary PPV/sensitivity *separately by\narm*) to map how much residual outcome misclassification would change the hazard ratio — if a credible\ndifferential-misclassification scenario flips the conclusion, the unadjudicated estimate is not trustworthy.\n\n**Interpreting the output**\n\nFrom the full-sample worked example: 500 algorithm-positive candidates reviewed; 420 confirmed as true\nMI. PPV = 420 / 500 = 0.84 (Wilson 95% CI 0.81–0.87). Inter-rater Cohen's kappa = 0.78 before tiebreak.\nOf 250 algorithm-negative records sampled, 12 were adjudicated true MIs (false negatives).\n\n*(1) Formal interpretation.* PPV = 0.84 means approximately 16% of algorithm-flagged events are false\npositives; the crude algorithm-positive event count overstates confirmed MIs by ≈ 19%. Kappa = 0.78\nindicates substantial pre-tiebreak inter-rater agreement (Landis-Koch scale: substantial, approaching\nalmost-perfect). The Wilson CI 0.81–0.87 reflects sampling uncertainty in the PPV estimate itself;\nthe true adjudication PPV in the full cohort is unknown and assumed to match the validation sample's\nunder the transportability assumption (same site mix, chart availability, and reviewer training). Under\nnon-differential misclassification (PPV equal across arms), unconfirmed events dilute the comparative\nHR toward 1.0; under differential misclassification (PPV differs by arm), bias direction is\nunpredictable and arm-specific adjudication subsamples are required to identify it.\n\n*(2) Practical interpretation.* A study reporting 5,000 algorithm-positive MIs over cohort follow-up\nshould correct the event count to ≈ 4,200 confirmed MIs (5,000 × 0.84) for rate and cost calculations.\nFor the comparative HR, the key question is whether PPV = 0.84 holds equally in both treatment arms;\nif the more-intensively monitored arm has higher PPV (say, 0.88 vs 0.80), the hazard ratio is biased\naway from the null — an operationally realistic concern in open-label observational settings.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome-validation",
      "endpoint-adjudication",
      "chart-review",
      "clinical-events-committee",
      "positive-predictive-value",
      "inter-rater-reliability",
      "outcome-misclassification",
      "blinded-adjudication"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/cvm-2-4-187",
        "url": "https://doi.org/10.1186/cvm-2-4-187",
        "citation_text": "Mahaffey KW, Harrington RA, Akkerhuis M, et al. Disagreements between central clinical events committee and site investigator assessments of myocardial infarction endpoints in an international clinical trial: review of the PURSUIT study. Current Controlled Trials in Cardiovascular Medicine. 2001;2(4):187-194.",
        "year": 2001,
        "authors_short": "Mahaffey et al.",
        "notes": "Foundational empirical demonstration that endpoint adjudication by a blinded central committee materially reclassifies events relative to site/algorithm assignment, motivating adjudication as a method."
      },
      {
        "role": "introduce",
        "doi": "10.1002/pds.2314",
        "url": "https://doi.org/10.1002/pds.2314",
        "citation_text": "Cutrona SL, Toh S, Iyer A, et al. Design for validation of acute myocardial infarction cases in Mini-Sentinel. Pharmacoepidemiology and Drug Safety. 2012;21(Suppl 1):274-281.",
        "year": 2012,
        "authors_short": "Cutrona et al.",
        "notes": "Canonical claims-based outcome validation design (Mini-Sentinel/Sentinel) specifying chart sampling, blinded adjudication, and PPV estimation for an algorithm-identified endpoint."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Reporting standard for the reference-standard comparison that underlies PPV, sensitivity, and specificity in an outcome-validation substudy."
      },
      {
        "role": "explain",
        "doi": "10.11613/BM.2012.031",
        "url": "https://doi.org/10.11613/BM.2012.031",
        "citation_text": "McHugh ML. Interrater reliability: the kappa statistic. Biochemia Medica. 2012;22(3):276-282.",
        "year": 2012,
        "authors_short": "McHugh",
        "notes": "Practical reference for Cohen's/weighted kappa and its interpretation, the standard inter-rater reliability metric for multi-reviewer adjudication."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting guidance requiring transparent code lists and outcome-validation methods for routinely collected data, the framework in which adjudication results are documented."
      }
    ],
    "plain_language_summary": "Endpoint adjudication is the process of having trained clinicians read a patient's actual medical records and decide whether an event the computer algorithm flagged — for example, a hospitalization coded as a heart attack — truly meets the study's definition of that event. The clinicians review source documents (discharge summaries, lab results, ECGs) against a pre-written case definition, and they do so without knowing which drug the patient received, so their judgment cannot be swayed by that knowledge. The result is a confirmed event count that is more accurate than the raw algorithm output, along with a measure of how often the two reviewers agreed — which tells you how reliable the confirmation process itself is.",
    "key_terms": [
      {
        "term": "adjudication committee",
        "definition": "A small panel of clinicians who independently read patient records and vote on whether a candidate event meets the study's case definition; disagreements are resolved by a third reviewer."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "The fraction of events flagged by the algorithm that turn out to be real events after chart review — for example, PPV 0.84 means 84 out of every 100 flagged events are confirmed as true."
      },
      {
        "term": "Cohen's kappa",
        "definition": "A number between 0 and 1 measuring how much two reviewers agree beyond what pure chance would produce; values above 0.60 are generally considered substantial agreement."
      },
      {
        "term": "blinded review",
        "definition": "The practice of hiding each patient's drug or treatment assignment from the reviewers so their decisions cannot be influenced by knowing which group the patient was in."
      },
      {
        "term": "case definition",
        "definition": "A written, pre-specified set of clinical criteria a candidate event must meet to be counted as a confirmed outcome — for example, the Fourth Universal Definition of Myocardial Infarction."
      }
    ],
    "worked_example": {
      "scenario": "A claims-based study of heart attack risk has a computer algorithm that scans hospital claims and flags any stay with a primary-position ICD-10 code of I21.x (acute MI) as a candidate heart attack. The study team pulls the medical records for 10 candidate events and sends each record, stripped of any drug information, to two cardiologists who independently classify each as confirmed or not confirmed. We want to measure how well the two reviewers agree and summarize that agreement with percent agreement and Cohen's kappa.",
      "dataset": {
        "caption": "Ten candidate MI events reviewed independently by Reviewer 1 and Reviewer 2. Y = confirmed; N = not confirmed.",
        "columns": [
          "event_id",
          "reviewer_1",
          "reviewer_2",
          "final_call"
        ],
        "rows": [
          [
            "E01",
            "Y",
            "Y",
            "confirmed"
          ],
          [
            "E02",
            "Y",
            "Y",
            "confirmed"
          ],
          [
            "E03",
            "Y",
            "Y",
            "confirmed"
          ],
          [
            "E04",
            "Y",
            "Y",
            "confirmed"
          ],
          [
            "E05",
            "Y",
            "Y",
            "confirmed"
          ],
          [
            "E06",
            "Y",
            "Y",
            "confirmed"
          ],
          [
            "E07",
            "Y",
            "Y",
            "confirmed"
          ],
          [
            "E08",
            "Y",
            "N",
            "tiebreak -> confirmed"
          ],
          [
            "E09",
            "N",
            "N",
            "not confirmed"
          ],
          [
            "E10",
            "N",
            "N",
            "not confirmed"
          ]
        ]
      },
      "agreement_table": {
        "caption": "2x2 table of reviewer calls before tiebreak (n = 10 candidate events)",
        "header_row": [
          "",
          "R2 confirms (Y)",
          "R2 does not confirm (N)",
          "Row total"
        ],
        "rows": [
          [
            "R1 confirms (Y)",
            "7",
            "1",
            "8"
          ],
          [
            "R1 does not confirm (N)",
            "0",
            "2",
            "2"
          ],
          [
            "Column total",
            "7",
            "3",
            "10"
          ]
        ]
      },
      "steps": [
        "Count agreements: both reviewers said Y for events E01-E07 (7 cells); both said N for E09 and E10 (2 cells). Total agreements = 7 + 2 = 9 out of 10 events.",
        "Percent agreement = 9 / 10 = 0.90 (90%).",
        "To compute Cohen's kappa, first find each reviewer's marginal rates. Reviewer 1 said Y 8 times (E01-E07 plus E08) out of 10, so P(R1 = Y) = 8/10 = 0.80. Reviewer 2 said Y 7 times (E01-E07) out of 10, so P(R2 = Y) = 7/10 = 0.70.",
        "Expected agreement by chance alone: P(both Y by chance) = 0.80 x 0.70 = 0.56. P(both N by chance) = 0.20 x 0.30 = 0.06. Expected agreement P_e = 0.56 + 0.06 = 0.62.",
        "Kappa = (observed agreement - expected agreement) / (1 - expected agreement) = (0.90 - 0.62) / (1 - 0.62) = 0.28 / 0.38 = 0.74 (rounded to two decimal places).",
        "Event E08 was a disagreement (R1 confirmed, R2 did not); a third cardiologist reviewed it and confirmed it, so the final call is confirmed.",
        "Final confirmed event count = 8 out of 10 candidates. PPV for this small sample = 8 / 10 = 0.80. The full study (500 reviewed events, 420 confirmed) yielded PPV = 420/500 = 0.84 and kappa = 0.78 — consistent with this illustration."
      ],
      "result": "Percent agreement = 9/10 = 0.90. Cohen's kappa = (0.90 - 0.62) / (1 - 0.62) = 0.28 / 0.38 = 0.74. Both figures indicate substantial reviewer agreement. Eight of 10 candidate events were confirmed after tiebreak, giving a PPV of 0.80 for this small sample."
    },
    "prerequisites": [
      "claims-outcome-algorithm-ppv-sensitivity-rwe",
      "outcome-algorithm-construction-rwe",
      "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Census adjudication of all algorithm-positive events",
        "description": "Every candidate event flagged by the algorithm is sent to blinded multi-reviewer adjudication; yields PPV with no sampling variance but is feasible only for rare/serious endpoints.",
        "edge_cases": [
          "Chart retrieval failures are non-random (deceased patients, out-of-network care), so even a census of positives has an unadjudicable subset that must be reported and characterized.",
          "Provides no sensitivity estimate because algorithm-negatives are never reviewed."
        ],
        "data_source_notes": "claims: requires a chart-linkage data-use agreement; outpatient-coded events may have no retrievable source document."
      },
      {
        "name": "Stratified sample with algorithm-negative review (PPV and sensitivity)",
        "description": "Probability sample of algorithm-positives (often stratified by age/sex/fatality/site) plus a sample of algorithm-negatives, enabling both PPV and sensitivity with proper survey-weighted CIs.",
        "edge_cases": [
          "Reviewing algorithm-negatives is costly and frequently omitted, leaving sensitivity unknown and incidence uncorrectable for missed events.",
          "Strata with few events yield wide Wilson/exact intervals; pre-specify minimum cell sizes."
        ],
        "data_source_notes": "claims/EHR: define the sampling frame on observable FFS/in-network person-time only; weight PPV/sensitivity back to the source population."
      },
      {
        "name": "Two-reviewer committee with tiebreaker and written charter",
        "description": "Two reviewers independently classify each packet blinded to exposure; disagreements resolved by a third adjudicator; reports Cohen's/weighted kappa and pre-tiebreak agreement.",
        "edge_cases": [
          "Judgment-heavy definitions (MI subtype, stroke vs TIA, drug causality) drive most disagreement; charter must pre-specify the operational thresholds.",
          "Reviewer drift over a long study requires periodic re-calibration against gold cases."
        ],
        "data_source_notes": "all sources: blind the packet to drug/exposure fields; log and audit any blinding breaks."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Accepting the raw claims/EHR algorithm without adjudication",
        "pros_of_this": "Quantifies and bounds outcome misclassification (especially differential misclassification by exposure) and produces a regulatory/HTA-defensible outcome definition.",
        "cons_of_this": "Expensive and slow; depends on chart-linkage agreements and shrinks the analyzable sample to the chart-available subset.",
        "when_to_prefer": "Primary/key safety or effectiveness endpoints, modest or unknown algorithm PPV, or any comparison where surveillance/capture could differ by arm."
      },
      {
        "compared_to": "Full census chart review of every event",
        "pros_of_this": "Feasible at scale; a stratified probability sample with survey weights recovers population PPV and sensitivity at a fraction of the cost.",
        "cons_of_this": "Introduces sampling variance requiring Wilson/exact CIs and an explicit, pre-specified sampling frame.",
        "when_to_prefer": "Large event counts; reserve census review for rare or fatal endpoints."
      },
      {
        "compared_to": "Single-reviewer chart abstraction",
        "pros_of_this": "Two-reviewer-plus-tiebreaker committee yields a reportable kappa and removes idiosyncratic single-reader bias on judgment-dependent definitions.",
        "cons_of_this": "Roughly doubles abstraction effort and adds tiebreak overhead.",
        "when_to_prefer": "Whenever the case definition requires clinical judgment (MI subtype, stroke vs TIA, causality)."
      },
      {
        "compared_to": "Quantitative bias analysis using external (literature) validation parameters only",
        "pros_of_this": "In-study adjudication measures PPV/sensitivity in the actual study population, avoiding the transportability assumption of borrowed parameters.",
        "cons_of_this": "Requires charts; impossible when retrieval is barred.",
        "when_to_prefer": "Whenever charts are obtainable; fall back to external-parameter QBA only when they are not."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Algorithm flags candidates from coded events (position + setting); adjudication requires chart linkage under a data-use agreement. Restrict to FFS-observable person-time, exclude MA-only spans, and expect non-random retrieval failure for deceased and out-of-network patients.",
      "ehr": "Source documents are richer but external-care leakage hides events treated outside the network, deflating sensitivity differentially; abstract from notes when structured fields are empty and define the observation window explicitly.",
      "registry": "May supply pre-adjudicated outcomes, but registry case definitions can differ from the study protocol; re-adjudicate or explicitly document the criteria mismatch (transportability).",
      "linked": "Linked claims-EHR-vital-records is the strongest substrate (coding completeness + clinical detail + death confirmation) but introduces linkage selection and order/service/death date discrepancies that must be reconciled before fixing the event date."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom statsmodels.stats.proportion import proportion_confint\nfrom sklearn.metrics import cohen_kappa_score\n\ndef adjudication_metrics(adj: pd.DataFrame, n_algo_positive_total: int) -> dict:\n    pos = adj[adj[\"algo_positive\"]]\n\n    # PPV among algorithm-positives, with a Wilson 95% interval (handles small/extreme counts).\n    n_pos = len(pos)\n    n_conf = int(pos[\"confirmed\"].sum())\n    ppv = n_conf / n_pos\n    ppv_lo, ppv_hi = proportion_confint(n_conf, n_pos, alpha=0.05, method=\"wilson\")\n\n    # Inter-rater reliability on the two independent pre-tiebreak calls.\n    kappa = cohen_kappa_score(pos[\"reviewer1_call\"].astype(int),\n                              pos[\"reviewer2_call\"].astype(int))\n\n    # Survey-weighted PPV when sampling fractions differ by stratum.\n    w = pos[\"samp_weight\"]\n    ppv_weighted = float((pos[\"confirmed\"] * w).sum() / w.sum())\n\n    # PPV-only bias correction: expected number of TRUE events among all algorithm-positives.\n    # (Corrects for false positives; does NOT recover events the algorithm missed -- that needs sensitivity\n    #  from a reviewed sample of algorithm-negatives.)\n    corrected_true_events = n_algo_positive_total * ppv_weighted\n\n    return {\n        \"n_reviewed_positive\": n_pos,\n        \"ppv\": round(ppv, 4),\n        \"ppv_wilson_95ci\": (round(ppv_lo, 4), round(ppv_hi, 4)),\n        \"ppv_weighted\": round(ppv_weighted, 4),\n        \"cohen_kappa\": round(kappa, 4),\n        \"corrected_true_event_count\": round(corrected_true_events, 1),\n    }",
        "description": "Outcome-validation metrics from a completed blinded adjudication. Required inputs (one row per sampled\ncandidate event, already chart-reviewed and de-duplicated to the episode level):\n  adj : person_id, event_id, algo_positive (bool), confirmed (bool),\n        reviewer1_call (bool), reviewer2_call (bool), stratum (str), samp_weight (float)\n'confirmed' is the post-tiebreak final classification; reviewer1/2_call are the independent pre-tiebreak\ncalls used for kappa. samp_weight is the inverse sampling fraction within stratum (1.0 for a census).\nReturns PPV with a Wilson 95% CI, Cohen's kappa, and a PPV-only bias-corrected event count.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\nadjudication_metrics <- function(adj, n_algo_positive_total) {\n  pos <- adj %>% filter(algo_positive)\n\n  n_pos  <- nrow(pos)\n  n_conf <- sum(pos$confirmed)\n  ppv    <- n_conf / n_pos\n\n  # Exact (Clopper-Pearson) PPV interval -- preferred when confirmed counts are small or near 0/1.\n  ci <- binom.test(n_conf, n_pos)$conf.int\n\n  # Cohen's kappa on the two independent pre-tiebreak reviewer calls.\n  kappa <- psych::cohen.kappa(cbind(as.integer(pos$reviewer1_call),\n                                    as.integer(pos$reviewer2_call)))$kappa\n\n  # Survey-weighted PPV (inverse sampling fraction by stratum).\n  ppv_weighted <- sum(pos$confirmed * pos$samp_weight) / sum(pos$samp_weight)\n\n  # PPV-only bias correction: expected true events among all algorithm-positives.\n  corrected_true_events <- n_algo_positive_total * ppv_weighted\n\n  list(\n    n_reviewed_positive   = n_pos,\n    ppv                   = round(ppv, 4),\n    ppv_exact_95ci        = round(ci, 4),\n    ppv_weighted          = round(ppv_weighted, 4),\n    cohen_kappa           = round(kappa, 4),\n    corrected_true_events = round(corrected_true_events, 1)\n  )\n}",
        "description": "Outcome-validation metrics in R. Inputs mirror the Python version:\n  adj : data.frame with person_id, event_id, algo_positive (logical), confirmed (logical),\n        reviewer1_call (logical), reviewer2_call (logical), stratum (chr), samp_weight (numeric)\nReturns PPV with an exact (Clopper-Pearson) 95% CI, Cohen's kappa, weighted PPV, and a PPV-corrected count.",
        "dependencies": [
          "dplyr",
          "psych"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* PPV among algorithm-positives, with exact (Clopper-Pearson) and Wilson 95% intervals. */\nproc freq data=work.adj (where=(algo_positive=1));\n  tables confirmed / binomial(level='1') alpha=0.05 cl=(wilson exact);\n  output out=ppv_out binomial;\nrun;\n\n/* Inter-rater reliability: Cohen's kappa on the two independent pre-tiebreak reviewer calls. */\nproc freq data=work.adj (where=(algo_positive=1));\n  tables reviewer1_call*reviewer2_call / agree;\n  output out=kappa_out kappa;\nrun;\n\n/* Survey-weighted PPV when sampling fractions differ by stratum (inverse-sampling weights). */\nproc sql;\n  create table ppv_weighted as\n  select sum(confirmed * samp_weight) / sum(samp_weight) as ppv_weighted\n  from work.adj\n  where algo_positive = 1;\nquit;\n\n/* PPV-only bias correction: expected true events among ALL algorithm-positives (&n_algo_pos). */\n/* Does not recover algorithm-missed events -- that requires sensitivity from reviewed negatives. */\ndata corrected;\n  if _n_ = 1 then set ppv_weighted;\n  corrected_true_events = &n_algo_pos * ppv_weighted;\nrun;",
        "description": "Outcome-validation metrics in SAS. Required input dataset (one row per sampled, chart-reviewed candidate):\n  work.adj : person_id event_id algo_positive (0/1) confirmed (0/1)\n             reviewer1_call (0/1) reviewer2_call (0/1) stratum samp_weight\nPROC FREQ / BINOMIAL gives PPV with exact and Wilson intervals; PROC FREQ / AGREE gives Cohen's kappa from\nthe two independent pre-tiebreak reviewer calls. The final PPV-corrected count is computed in a data step.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Algo[Claims/EHR algorithm flags candidate events<br/>e.g., primary-position inpatient ICD-10 I21.x] --> Dedup[De-duplicate transfers/readmissions<br/>to the episode level]\n  Dedup --> Sample[Sampling frame on observable FFS/in-network time<br/>stratified sample of positives + sample of negatives]\n  Sample --> Packet[Assemble source-document packet<br/>BLINDED to exposure/arm]\n  Packet --> R1[Reviewer 1<br/>independent classification]\n  Packet --> R2[Reviewer 2<br/>independent classification]\n  R1 --> Agree{Agree?}\n  R2 --> Agree\n  Agree -- Yes --> Final[Final classification<br/>confirmed / not confirmed]\n  Agree -- No --> Tie[Third adjudicator<br/>tiebreak]\n  Tie --> Final\n  Final --> Metrics[Metrics: PPV + Wilson CI,<br/>sensitivity from reviewed negatives, Cohen's kappa]\n  Metrics --> Correct[Bias-correct incidence + QBA<br/>vary PPV/sensitivity, separately by arm]",
        "caption": "Blinded multi-reviewer endpoint adjudication workflow, from algorithm-flagged candidates through de-duplication, sampling, blinded double review with tiebreaker, to operating characteristics and bias correction.",
        "alt_text": "Flowchart showing candidate events from a claims/EHR algorithm being de-duplicated, sampled, assembled into blinded packets, independently classified by two reviewers with a tiebreaker, and turned into PPV, sensitivity, kappa, and a bias-corrected rate.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Truth[Adjudicated reference standard]\n    TP[True events]\n    FN[Missed events]\n  end\n  subgraph Algo[Algorithm output]\n    AP[Algorithm-positive]\n    AN[Algorithm-negative]\n  end\n  AP --> PPVbox[Review positives -> PPV<br/>true positives / all flagged]\n  AN --> SENSbox[Review a SAMPLE of negatives -> sensitivity<br/>detects FN the algorithm missed]\n  PPVbox --> Corr[Corrected incidence =<br/>f PPV, sensitivity, sampling weights]\n  SENSbox --> Corr\n  Corr --> Decision{Differential by arm?}\n  Decision -- Yes --> Danger[Differential misclassification:<br/>relative effect biased -> QBA required]\n  Decision -- No --> NonDiff[Non-differential:<br/>bias toward the null, still correct the rate]",
        "caption": "Why reviewing only positives is insufficient. PPV (from positives) corrects false positives but not missed events; sensitivity (from a sampled review of algorithm-negatives) is required to recover missed events and to detect exposure-differential misclassification.",
        "alt_text": "Logic diagram contrasting algorithm-positive review (yielding PPV) with sampled algorithm-negative review (yielding sensitivity), combining into a corrected incidence and a check for differential misclassification by arm.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "Adjudication is the human reference-standard layer that validates and corrects an outcome algorithm built within this parent family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Adjudication produces the PPV/sensitivity that this concept estimates and uses for bias correction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Adjudicated PPV/sensitivity (and their uncertainty) feed quantitative bias analysis to bound residual outcome misclassification."
      },
      {
        "relation_type": "used_with",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "The operating characteristics from adjudication are the inputs to outcome-misclassification correction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "Chart-review adjudication is the reference-standard step of an outcome/phenotype validation study."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnostic-accuracy",
        "notes": "PPV, sensitivity, specificity from adjudication are diagnostic-accuracy metrics of the algorithm against the adjudicated standard."
      },
      {
        "relation_type": "see_also",
        "target_slug": "composite-endpoint-construction-rwe",
        "notes": "Composite endpoints typically require component-level adjudication before the components are combined."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "Phenotype algorithms assign labels deterministically; adjudication is the human validation layered on a sample of those labels."
      }
    ],
    "aliases": [
      "endpoint adjudication",
      "clinical events committee adjudication",
      "chart review outcome validation",
      "adjudication committee",
      "blinded endpoint adjudication"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "equivalence-noninferiority-testing",
    "name": "Equivalence and Non-Inferiority Testing",
    "short_definition": "A family of hypothesis tests that determine whether a new or alternative treatment is \"close enough\" to an established comparator — specifically, not worse by more than a pre-specified, clinically meaningful margin — using confidence intervals and the Two One-Sided Tests (TOST) procedure rather than a standard superiority p-value.",
    "long_description": "**The fundamental error: \"p > 0.05\" does not establish equivalence**\n\nThe most consequential misconception in clinical and real-world research is interpreting\na non-significant superiority test as evidence of equivalence or non-inferiority. When\na study reports \"no significant difference\" (p > 0.05), all that has been demonstrated\nis that the data are statistically compatible with the null hypothesis of zero difference.\nBut those same data may also be compatible — given the trial's precision — with differences\nof 3, 5, or 10 percentage points that would be highly clinically meaningful. This is the\n*absence of evidence* fallacy: failing to find evidence of a difference is not evidence\nof the absence of a meaningful difference.\n\nAn underpowered, poorly executed, or too-brief study can easily return p > 0.05 even when\nthe test drug is substantially inferior. Declaring equivalence from that result licenses a\npotentially harmful switch. Equivalence and non-inferiority (NI) testing require a\nfundamentally different inferential architecture: the analyst must pre-specify a *margin* —\nthe largest clinically acceptable difference — before data collection, and then demonstrate\nformally, using a confidence interval, that the true difference almost certainly falls within\nthat margin. The evidentiary burden is placed on ruling out unacceptable inferiority, not\nmerely on failing to detect superiority.\n\n**TOST mechanics: two one-sided tests**\n\nThe Two One-Sided Tests (TOST) procedure formalizes equivalence and NI testing. For\nequivalence testing (symmetric, two-sided):\n\n- H₀₁: δ ≤ −Δ (test drug is worse than comparator by more than Δ)\n- H₀₂: δ ≥ +Δ (test drug is better than comparator by more than Δ)\n- Equivalence is declared when *both* null hypotheses are rejected at level α.\n\nRejecting both H₀₁ and H₀₂ is mathematically equivalent to showing that the two-sided\n(1 − 2α) confidence interval for δ lies entirely within the interval (−Δ, +Δ). At the\nconventional α = 0.05 level, this means the *90% CI* must be contained within the\nequivalence margins. At α = 0.025 per one-sided test (the common regulatory standard),\nthe two-sided 95% CI must fall within the margins.\n\nFor non-inferiority testing (one-sided):\n\n- H₀: δ ≥ Δ (test drug is inferior by more than the NI margin)\n- Hₐ: δ < Δ (test drug is non-inferior)\n- NI is declared when the upper bound of the (1 − α) one-sided CI — equivalently, the\n  upper bound of the two-sided (1 − 2α) CI — falls strictly below the NI margin Δ.\n\nThe CI interpretation is unambiguous: if the upper bound of the 95% CI for the risk\ndifference (test minus comparator, positive = test is worse) is below Δ, NI is demonstrated.\nIf the entire 95% CI lies below zero (the test drug has definitively lower risk than the\ncomparator), superiority is also demonstrated. If the entire CI falls within (−Δ, +Δ),\nformal equivalence is established.\n\n**NI margin selection: the M1/M2 framework**\n\nChoosing the NI margin is the most consequential and contested step. Two frameworks converge:\n\n1. *Clinical relevance.* The margin should represent the largest clinically acceptable\n   difference — the amount by which the test treatment could be worse while remaining\n   acceptable to patients and prescribers given its other attributes (cost, tolerability,\n   route). This is a judgment requiring clinical expert input.\n\n2. *Preserving a fraction of active-comparator effect (M1/M2 logic).* Codified in ICH E10\n   and FDA NI guidance, this approach proceeds in two steps. M1 is the effect of the\n   active comparator over placebo, estimated from a meta-analysis of historical placebo-\n   controlled trials. M2 is the fraction of M1 that the test drug must preserve to be\n   clinically acceptable — typically 50% or more for serious conditions. The NI margin\n   equals M2 expressed in absolute terms.\n\nIn plain terms: if the reference drug reduces stroke risk by 10 percentage points compared\nto placebo (M1 = 10 pp), and clinicians require the test drug to preserve at least half of\nthat benefit (50% × 10 pp = 5 pp = M2), the NI margin is 5 pp. The test drug may be at\nmost 5 pp worse on absolute stroke risk.\n\n**ITT and per-protocol analysis roles are reversed in NI trials**\n\nIn superiority trials, intent-to-treat (ITT) analysis is the primary analysis. Protocol\ndeviations — patients who switched drugs, stopped treatment, or were misclassified — dilute\nthe observed treatment effect toward zero. Dilution toward zero is conservative for a\nsuperiority claim (makes rejection of the null harder), so ITT protects against false-\npositive superiority findings.\n\nIn NI trials, the logic reverses entirely. Non-compliance, dropouts, contamination, and\nprotocol deviations also dilute effects toward zero — and in an NI trial, a difference near\nzero is what demonstrates NI. A sloppily executed trial therefore produces spurious NI\nresults: the test drug appears \"no different\" from the comparator precisely because the\ntrial was too poorly conducted to detect anything. ITT is no longer the conservative\nanalysis in NI settings; it can be actively anti-conservative.\n\nPer-protocol analysis, which restricts to patients with full adherence and no major protocol\nviolations, becomes *co-primary* in NI trials. Both ITT and per-protocol analyses must\nindependently show NI for the conclusion to be credible. If ITT shows NI but per-protocol\ndoes not, sloppy execution is driving the result, not the drug's actual performance.\n\n**Assay sensitivity**\n\nAssay sensitivity is the implicit assumption that the trial could have detected a meaningful\ndifference if one existed. For the M1/M2 logic to be valid, the historical evidence for M1\nmust be credible, and the trial conditions must be sufficiently similar to those historical\ntrials. If assay sensitivity is doubtful — for example, the trial enrolled a low-risk\npopulation where even a highly effective drug shows little absolute benefit — the NI\ndemonstration is uninterpretable. A trial that cannot distinguish an effective drug from an\ninactive one cannot establish NI.\n\n**RWE applications**\n\n*Biosimilar and formulary-switch comparisons.* Claims data are used to compare a biosimilar's\nreal-world effectiveness against the reference biologic. The NI framework formalizes the\nquestion: is the biosimilar noninferior by more than the pre-specified margin on outcomes\nsuch as hospitalization or disease flare? Without an explicit NI margin, an underpowered\ncomparison that \"shows no difference\" (p > 0.05) risks falsely licensing a harmful switch.\n\n*Cost-minimization prerequisites.* Cost-minimization analysis (CMA) licenses comparing only\ncosts — but only after equivalence on outcomes has been formally demonstrated. The NI or\nequivalence analysis supplies that evidentiary prerequisite. A non-significant superiority\nresult does not provide this license; it merely reflects insufficient power.\n\n*Target-trial NI emulations.* When emulating a target NI trial in real-world claims or\nlinked data, confounding toward the null is the dominant hazard — and it is NOT reassuring.\nResidual confounding by indication can make a genuinely inferior therapy appear non-inferior\nbecause the \"test\" therapy is given to systematically healthier patients. Active-comparator\nnew-user designs, propensity-score weighting, and negative-control analyses are all essential.\nUnlike in superiority emulations, where confounding toward the null is conservative, in NI\nemulations it produces a false-positive result.\n\n**Pros, cons, and trade-offs**\n\n*NI/equivalence testing vs. standard superiority testing:*\n- Pros: explicit, pre-specified acceptability threshold; CI-based conclusion with direct\n  clinical interpretation; satisfies regulatory requirements for drug approvals and biosimilar\n  designations; protects against licensing inferior drugs that happen to show p > 0.05 in\n  underpowered trials.\n- Cons: the margin must be pre-specified and clinically justified — post-hoc margin setting\n  is outcome-driven analysis and invalid; sample sizes are typically larger than for an\n  equivalent-power superiority test; assay sensitivity is an untestable assumption that must\n  be argued from context.\n- When to prefer: whenever the goal is to demonstrate that a treatment is \"good enough\"\n  relative to an established comparator, not definitively superior.\n\n*Equivalence (symmetric, two-sided) vs. non-inferiority (one-sided):*\n- Equivalence requires the test drug to be neither much worse NOR much better (the entire CI\n  within ±Δ); used for bioequivalence (generic vs. brand, where superior efficacy in one\n  direction is also undesirable) and for formally symmetric comparisons.\n- NI requires only that the test drug not be worse by more than Δ; it allows the test drug\n  to be arbitrarily better. Used for most biosimilar, formulary-switch, and alternative-\n  formulation comparisons.\n\n*ITT vs. per-protocol in NI settings:*\n- Pros of per-protocol as co-primary: gives a cleaner signal of the intrinsic drug effect,\n  uncontaminated by non-adherence; is the more stringent (harder to show NI) analysis.\n- Cons of relying on per-protocol alone: selection bias from excluding non-adherent patients;\n  the excluded patients may be informatively different from adherent patients.\n- Requirement: both must agree for an NI conclusion to be credible.\n\n**When to use**\n\n- When the scientific question is whether a new or alternative treatment is clinically\n  acceptable relative to an established one, not whether it is superior.\n- When a pre-specified NI margin can be justified from clinical reasoning and the M1/M2\n  framework using historical evidence.\n- For biosimilar vs. reference biologic comparisons, generic vs. brand comparisons, oral vs.\n  IV formulations of the same molecule, or lower-cost setting-of-care alternatives.\n- As the evidentiary prerequisite for a cost-minimization analysis.\n- In target-trial NI emulations for biosimilar uptake, formulary switches, or care-setting\n  changes — with rigorous confounding control and explicit recognition that confounding\n  toward the null is anti-conservative.\n\n**When NOT to use — and when it is actively misleading**\n\n- *After seeing a non-significant superiority p-value, without pre-specifying the margin.*\n  Retroactively defining a margin to make an existing non-significant result \"look like NI\"\n  is outcome-driven analysis and is statistically invalid.\n- *When assay sensitivity cannot be established.* If the trial or RWE study is too small,\n  the patient population too healthy, or follow-up too short to detect real differences, any\n  NI conclusion is meaningless.\n- *In confounded RWE analyses without rigorous design.* Confounding toward the null\n  masquerades as NI and produces false-positive NI findings; propensity-score methods and\n  active-comparator designs are mandatory.\n- *When the margin is set to whatever makes the study pass.* A margin chosen to ensure\n  success rather than to reflect clinical acceptability produces misleading evidence.\n- *To satisfy a superiority requirement.* An NI demonstration does not substitute for a\n  superiority endpoint when that is the regulatory or clinical requirement.\n- *When per-protocol and ITT analyses disagree.* Concordance is required; disagreement\n  signals that sloppy execution or differential dropout is driving the ITT NI result.\n\n**Interpreting the output**\n\nIn the worked example, 80 events occurred among 2,000 test-drug initiators (risk = 0.040)\nand 72 events occurred among 2,000 comparator initiators (risk = 0.036). The observed risk\ndifference is 0.004 (0.4 percentage points, test minus comparator). The 95% CI, computed\nby normal approximation, spans approximately −0.8 to 1.6 percentage points. The pre-specified\nNI margin is 2.0 percentage points.\n\n*(1) Formal interpretation.* The estimand is the marginal risk difference in 30-day stroke\nor TIA risk between test-drug and comparator initiators in the matched cohort. The NI null\nhypothesis is H₀: δ ≥ 2.0 percentage points (test drug inferior by at least the margin).\nThe observed risk difference of 0.4 pp has a 95% CI upper bound of 1.6 pp. Because 1.6 < 2.0,\nwe reject H₀ and declare non-inferiority at the α = 0.025 one-sided level (equivalently, using\nthe upper bound of the two-sided 95% CI). This CI has the repeated-sampling interpretation:\nin 95% of identically designed studies, the interval constructed this way would contain the\ntrue risk difference. All values in this interval are below the NI margin, so the data provide\nno support for clinically unacceptable inferiority. The CI crosses zero (lower bound −0.8 pp),\nso superiority of the test drug is not established.\n\n*(2) Practical interpretation.* The test drug is non-inferior to the comparator on 30-day\nstroke/TIA risk: the most pessimistic plausible result (1.6 pp excess risk) still falls within\nthe pre-agreed acceptable range (up to 2.0 pp). A decision-maker can read this as: even in\nthe worst statistical case supported by these data, the test drug causes at most 1.6 extra\nstrokes or TIAs per 100 patients — and the clinical team agreed before seeing any data that\n2.0 extra events per 100 would still be acceptable given other benefits (cost, tolerability,\nroute). The result does not mean the drugs are identical; it means the data cannot rule out\na difference as large as 1.6 pp. Whether 1.6 pp matters for a particular formulary decision\nis a clinical and policy question, not a statistical one.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "equivalence-testing",
      "non-inferiority",
      "TOST",
      "NI-margin",
      "M1-M2",
      "biosimilar",
      "confidence-interval",
      "per-protocol",
      "assay-sensitivity",
      "hypothesis-testing",
      "clinical-trial",
      "formulary-switch"
    ],
    "applies_to_study_types": [
      "rct",
      "cohort_retrospective",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s11606-010-1513-8",
        "url": "https://doi.org/10.1007/s11606-010-1513-8",
        "citation_text": "Walker E, Nowacki AS. Understanding equivalence and noninferiority testing. Journal of General Internal Medicine. 2011;26(2):192-196.",
        "year": 2011,
        "authors_short": "Walker & Nowacki",
        "notes": "Accessible, clinical-audience primer that clearly explains why p > 0.05 does not establish equivalence, how to select the NI margin, and how to read CI-based NI conclusions. The standard reference for introducing these concepts in HEOR and clinical settings."
      },
      {
        "role": "explain",
        "doi": "10.1007/BF01068419",
        "url": "https://doi.org/10.1007/BF01068419",
        "citation_text": "Schuirmann DJ. A comparison of the Two One-Sided Tests Procedure and the Power Approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics. 1987;15(6):657-680.",
        "year": 1987,
        "authors_short": "Schuirmann",
        "notes": "The foundational paper introducing the TOST procedure for bioequivalence testing; establishes the mathematical equivalence between TOST and the confidence interval containment criterion that is now the regulatory standard for NI and equivalence testing worldwide."
      },
      {
        "role": "use",
        "doi": "10.1001/jama.2012.87802",
        "url": "https://doi.org/10.1001/jama.2012.87802",
        "citation_text": "Piaggio G, Elbourne DR, Pocock SJ, Evans SJ, Altman DG; CONSORT Group. Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement. JAMA. 2012;308(24):2594-2604.",
        "year": 2012,
        "authors_short": "Piaggio et al.",
        "notes": "CONSORT extension covering the 13 additional items required for transparent reporting of NI and equivalence trials: pre-specification of the margin and its justification, reporting of both ITT and per-protocol results, and explicit discussion of assay sensitivity. The reporting standard for any NI or equivalence study."
      }
    ],
    "plain_language_summary": "Equivalence and non-inferiority testing answer a different question from the usual hypothesis test: instead of asking \"is treatment A better than B?\", they ask \"is treatment A no worse than B by more than a clinically meaningful amount?\" To do this, analysts pre-specify a margin — the largest difference that would still leave the new treatment acceptable — and then use a confidence interval to show that the true difference almost certainly falls within that margin. A critical warning: when a standard study simply \"fails to find a difference\" (p > 0.05), that is not the same as demonstrating equivalence, because an underpowered study can also fail to detect a large and harmful difference. These methods are required for drug approvals involving biosimilars, generic substitutions, and formulary switches where treatments are assumed to work similarly.",
    "key_terms": [
      {
        "term": "non-inferiority margin",
        "definition": "The largest amount by which the test treatment can be worse than the comparator and still be considered clinically acceptable; it is set in advance by the study team before any data are collected."
      },
      {
        "term": "TOST (two one-sided tests)",
        "definition": "A method that runs two separate one-sided hypothesis tests simultaneously — one ruling out being too much worse, one ruling out being too much better — and declares equivalence only when both pass."
      },
      {
        "term": "confidence interval upper bound",
        "definition": "The highest plausible value for the true difference between the two treatments, given the study data; in non-inferiority testing, non-inferiority is demonstrated when this upper limit stays below the pre-specified margin."
      },
      {
        "term": "per-protocol analysis",
        "definition": "An analysis restricted to patients who actually followed the study protocol correctly; in non-inferiority trials this is a co-primary analysis because patients who drifted from protocol push results toward \"no difference,\" which can falsely support non-inferiority."
      },
      {
        "term": "assay sensitivity",
        "definition": "The ability of a trial to detect a meaningful difference between treatments if one truly exists; non-inferiority logic breaks down if the trial was so poorly designed or powered that it could not have distinguished an effective drug from an ineffective one."
      },
      {
        "term": "M1/M2",
        "definition": "A regulatory framework for setting the non-inferiority margin: M1 is the effect of the reference drug over placebo (from prior studies), and M2 is the fraction of that effect the new drug must preserve to be considered acceptable — the NI margin equals M2."
      }
    ],
    "worked_example": {
      "scenario": "A health-outcomes analyst emulates a non-inferiority target trial comparing a new oral anticoagulant (test drug) against an established one (comparator) on 30-day stroke or transient ischemic attack (TIA) among adults initiating therapy in commercial claims data. The pre-specified non-inferiority margin is 2.0 percentage points (absolute risk difference), set using M1/M2 logic: the reference drug reduces 30-day stroke risk by approximately 4 pp vs. placebo (M1), and clinicians require the test drug to preserve at least half of that benefit, giving M2 = 2.0 pp as the margin. Both arms have 2,000 matched, continuously-enrolled patients followed from index fill.",
      "dataset": {
        "caption": "Summary of the non-inferiority analysis. Each row gives one arm's 30-day event count, sample size, and observed stroke/TIA risk. Both arms were matched on age, sex, comorbidities, and prior anticoagulant use.",
        "columns": [
          "arm",
          "n_patients",
          "events_30d",
          "event_risk_30d"
        ],
        "rows": [
          [
            "test_drug",
            2000,
            80,
            0.04
          ],
          [
            "comparator",
            2000,
            72,
            0.036
          ]
        ]
      },
      "steps": [
        "Confirm the pre-specified NI margin: Δ = 2.0 percentage points = 0.020 in proportion units. This margin was fixed before data collection using the M1/M2 framework.",
        "Compute risk in each arm: risk in test arm = 80/2000 = 0.040 (4.0 percentage points over 30 days); risk in comparator arm = 72/2000 = 0.036 (3.6 percentage points).",
        "Compute the risk difference: event count difference = 80 - 72 = 8; risk difference = 8/2000 = 0.004 (0.4 percentage points, test drug minus comparator — positive means the test drug had slightly more events, so there is a small observed penalty).",
        "The SE of the risk difference by normal approximation is approximately sqrt( 0.040*0.960/2000 + 0.036*0.964/2000) ≈ 0.0060. The 95% CI is approximately (0.004 - 1.96*0.0060, 0.004 + 1.96*0.0060) ≈ (-0.008, 0.016) in proportions, or (-0.8, 1.6) percentage points.",
        "NI decision: the 95% CI upper bound is approximately 0.016 (1.6 pp). The pre-specified NI margin is 0.020 (2.0 pp). Because the upper bound of 1.6 pp is less than the margin of 2.0 pp, the data provide no statistical support for clinically unacceptable inferiority and non-inferiority is demonstrated.",
        "Note what NI does NOT establish: the CI lower bound of approximately -0.8 pp crosses zero, so superiority of the test drug (i.e., definitively fewer events) is not established by these data. The result is NI only, not superiority."
      ],
      "result": "risk in test arm = 80/2000 = 0.040; risk in comparator arm = 72/2000 = 0.036; risk difference = 8/2000 = 0.004 (0.4 percentage points); 95% CI ≈ (-0.8 to 1.6) percentage points; CI upper bound 1.6 pp is below the NI margin of 2.0 pp; non-inferiority is demonstrated. Superiority is not established because the CI lower bound crosses zero."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "risk-ratio-and-risk-difference",
      "two-sample-t-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Non-inferiority testing (one-sided)",
        "description": "Tests whether the test treatment is not worse than the comparator by more than the NI margin Δ on a single primary endpoint. NI is demonstrated when the upper bound of the two-sided (1 − 2α) CI (equivalently the one-sided (1 − α) CI) falls below Δ. The most common design for biosimilar approvals, formulary-switch evidence, and alternative- formulation comparisons.",
        "edge_cases": [
          "The margin must be stated in the same metric as the primary endpoint: absolute risk difference for binary outcomes, mean difference for continuous outcomes. Switching metrics after unblinding is outcome-driven and invalid.",
          "If the NI margin is set large enough to include the null (Δ > 0 and the entire CI falls between 0 and Δ), the drug may be less effective than placebo while still \"passing\" the NI test — an argument for pre-specifying Δ well below M1."
        ],
        "data_source_notes": "Claims: use absolute risk difference (risk of the primary outcome in the post-index window per arm); require continuous enrollment to avoid differential censoring; apply active-comparator new-user design to minimize confounding by indication. EHR/registry: adjudicated outcomes preferred; link to pharmacy data for exposure definition."
      },
      {
        "name": "Equivalence testing (two-sided TOST)",
        "description": "Tests whether the test and comparator treatments are equivalent — neither substantially better nor substantially worse — by requiring the entire two-sided (1 − 2α) CI to fall within the symmetric equivalence interval (−Δ, +Δ). Stronger claim than NI; used for bioequivalence (PK parameters must not differ by more than 20%) and for comparisons where superiority in either direction is undesirable.",
        "edge_cases": [
          "Bioequivalence uses a 90% CI and a geometric-mean-ratio scale (PK parameters typically log-normal); the 80–125% acceptance interval corresponds to a ±20% margin on the ratio scale, not on the absolute difference scale used for clinical NI.",
          "Equivalence is harder to demonstrate than NI (requires both bounds within Δ), so it requires a larger sample size for a given power."
        ],
        "data_source_notes": "Primary PK data from crossover bioequivalence studies. For clinical equivalence in RWE, both ITT and per-protocol must place the full CI within ±Δ; any crossing of a bound is disqualifying."
      },
      {
        "name": "RWE target-trial NI emulation",
        "description": "Emulation of a target NI trial using linked real-world data (claims-EHR). The analyst specifies the NI margin in the target trial protocol, operationalizes the exposure and outcome definitions as in a protocol, applies active-comparator new-user design, and uses propensity-score methods to achieve approximate exchangeability. Both ITT-equivalent (all initiators regardless of adherence) and per-protocol-equivalent (adherent, continuous users) analyses are primary.",
        "edge_cases": [
          "Confounding toward the null is anti-conservative in NI emulations — residual confounding that makes the test drug look \"no different\" produces spurious NI. Negative-control outcome analyses are essential to detect this bias direction.",
          "Assay sensitivity is unverifiable in RWE; argue it from external evidence that the reference drug shows a real effectiveness signal in the same population and endpoint."
        ],
        "data_source_notes": "Linked claims-EHR preferred: EHR supplies clinical precision for the NI endpoint definition (adjudicated events); pharmacy claims supply complete, accurate exposure timing for the per-protocol adherence restriction."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "superiority testing (standard two-sided hypothesis test)",
        "pros_of_this": "Explicitly pre-specifies the acceptability threshold; CI upper bound has a direct clinical interpretation; satisfies regulatory standards for biosimilar and alternative-formulation approvals; protects against falsely licensing inferior drugs on the basis of low power.",
        "cons_of_this": "Requires a clinically grounded, pre-specified margin; larger sample size than the equivalent superiority test; assay sensitivity is an untestable assumption that must be defended; margin selection is contested and susceptible to sponsor influence.",
        "when_to_prefer": "Whenever the goal is \"good enough relative to an established treatment\" rather than definitively better; when NI or equivalence is the regulatory or formulary standard."
      },
      {
        "compared_to": "cost-minimization",
        "pros_of_this": "Provides the formal statistical evidence of outcome similarity that licenses a subsequent cost-minimization analysis; makes the equivalence assumption explicit and testable rather than assumed.",
        "cons_of_this": "NI/equivalence testing is an outcome analysis, not a cost analysis; CMA must be run separately as the next step.",
        "when_to_prefer": "Always run NI/equivalence first; cost-minimization is only valid after this step confirms that the treatments produce clinically equivalent outcomes."
      },
      {
        "compared_to": "target-trial-emulation",
        "pros_of_this": "Provides a structured statistical decision rule for the NI question within a target-trial framework; the margin operationalizes the clinical question precisely.",
        "cons_of_this": "In RWE NI emulations, confounding toward the null is anti-conservative rather than conservative, reversing the usual protective property of the emulation framework; requires extra vigilance and sensitivity analysis.",
        "when_to_prefer": "Use together: the target-trial framework supplies the design discipline; the NI testing framework supplies the inferential criterion."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build matched or IPTW-weighted active-comparator new-user cohorts; define the primary endpoint (binary event in a fixed post-index window) the same way in both arms; use risk difference (not log-relative risk) to match the absolute-scale NI margin; handle death/disenrollment with IPCW to avoid differential censoring. Per-protocol restriction: require continuous drug supply (days_supply coverage) through the follow-up window in both arms. Watch for confounding toward the null — it is anti-conservative in NI.",
      "ehr": "Best for adjudicated outcomes (stroke, MI, hospitalization confirmed by clinical documentation); link to pharmacy claims for precise exposure timing and per-protocol adherence restriction. Out-of-network events are invisible, introducing informative censoring that can bias the risk difference.",
      "registry": "Pre-adjudicated endpoints (response, remission, progression) yield cleaner outcome definitions than claims for NI comparisons; combine with pharmacy/claims for exposure and cost data. Use death registries to handle the competing risk.",
      "linked": "Optimal substrate: EHR/registry supplies clinical precision for the NI endpoint; pharmacy claims supply complete exposure timing; linkage selection must be confirmed to be non-differential by arm to avoid biasing the risk difference."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\nfrom statsmodels.stats.weightstats import ttost_ind\n\n# ── 1. NI test on a binary risk difference ──────────────────────────────────────\n# Worked example: 80/2000 vs 72/2000; NI margin = 0.020 (2.0 percentage points)\n\ndef ni_risk_diff(events_test, n_test, events_comp, n_comp,\n                 ni_margin, alpha=0.025):\n    \"\"\"Non-inferiority test for a risk difference via the CI approach.\n\n    H0: delta >= ni_margin  (test drug inferior by at least the margin)\n    Ha: delta <  ni_margin  (test drug non-inferior)\n\n    NI demonstrated when upper bound of two-sided (1-2*alpha) CI < ni_margin,\n    equivalently when the one-sided z-test p-value < alpha.\n    \"\"\"\n    p1 = events_test / n_test    # test drug risk\n    p2 = events_comp / n_comp    # comparator risk\n    rd = p1 - p2                 # risk difference (positive = test has more events)\n    se = np.sqrt(p1*(1-p1)/n_test + p2*(1-p2)/n_comp)\n\n    # Upper bound of two-sided (1-2*alpha) CI — the NI decision boundary\n    z_crit = stats.norm.ppf(1 - alpha)   # 1.96 for alpha=0.025\n    ci_upper = rd + z_crit * se\n    ci_lower = rd - z_crit * se\n\n    # One-sided z for H0: delta = ni_margin\n    z_ni = (ni_margin - rd) / se\n    p_one_sided = stats.norm.sf(z_ni)\n\n    ni_achieved = ci_upper < ni_margin\n\n    return {\n        \"risk_test\":      round(p1, 6),\n        \"risk_comp\":      round(p2, 6),\n        \"risk_diff\":      round(rd, 6),\n        \"se\":             round(se, 6),\n        \"ci_bounds\":      (round(ci_lower, 6), round(ci_upper, 6)),\n        \"ni_margin\":      ni_margin,\n        \"z_stat\":         round(z_ni, 4),\n        \"p_one_sided\":    round(p_one_sided, 4),\n        \"ni_demonstrated\": ni_achieved,\n    }\n\nresult = ni_risk_diff(80, 2000, 72, 2000, ni_margin=0.020, alpha=0.025)\nprint(\"NI result:\", result)\n# risk_diff = 0.004 (0.4 pp); ci_bounds ≈ (-0.008, 0.016);\n# ci_upper 0.016 < margin 0.020 → ni_demonstrated = True\n\n# ── 2. Equivalence (TOST) on a continuous outcome ───────────────────────────────\n# ttost_ind(x1, x2, low, upp) tests H0a: delta <= low AND H0b: delta >= upp.\n# Equivalence (low < delta < upp) is declared when max(p_low, p_high) < alpha.\nrng = np.random.default_rng(42)\nx_test = rng.normal(loc=5.0, scale=2.0, size=200)\nx_comp = rng.normal(loc=5.1, scale=2.0, size=200)\nequiv_margin = 1.0  # raw-unit equivalence half-width\n\n# ttost_ind returns a 3-tuple: (pvalue, (t1, pv1, df1), (t2, pv2, df2))\n# Do NOT unpack as five scalars — that will fail or misassign values.\ntost_result = ttost_ind(\n    x_test, x_comp,\n    low=-equiv_margin, upp=equiv_margin,\n    usevar=\"unequal\",   # Welch correction\n)\ntost_p = tost_result[0]          # overall TOST p-value (max of the two one-sided p-values)\n_, (t_low, p_low, df_low) = tost_result[0], tost_result[1]   # lower-bound test\n_, (t_high, p_high, df_high) = tost_result[0], tost_result[2]  # upper-bound test\nprint(f\"\\nTOST p-value (overall): {tost_p:.4f}\")\nprint(f\"  Lower H0 (delta <= -{equiv_margin}): t={t_low:.3f}, p={p_low:.4f}\")\nprint(f\"  Upper H0 (delta >= +{equiv_margin}): t={t_high:.3f}, p={p_high:.4f}\")\nprint(f\"Equivalence demonstrated (alpha=0.05): {tost_p < 0.05}\")\n# tost_p is the binding (larger) one-sided p-value; both sub-tests must reject for equivalence.",
        "description": "Non-inferiority and equivalence testing in Python. Two analyses are shown:\n(1) NI test on a binary risk difference using a normal-approximation z-test — the\nstandard for proportions — applied to the worked example (80/2000 vs 72/2000, margin\n0.020). NI is demonstrated when the upper bound of the 95% two-sided CI is below the\nmargin (equivalently, the one-sided p-value for H0: delta >= margin is < 0.025).\n(2) Equivalence (TOST) on a continuous outcome using statsmodels.stats.weightstats.ttost_ind,\nwhich directly implements the two one-sided t-tests and returns both one-sided p-values;\nequivalence holds when the larger of the two p-values is < alpha.",
        "dependencies": [
          "numpy",
          "scipy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── 1. NI test on a binary risk difference (manual CI) ──────────────────────────\nevents_test <- 80L;  n_test <- 2000L\nevents_comp <- 72L;  n_comp <- 2000L\nni_margin   <- 0.020   # 2.0 percentage points\n\np1 <- events_test / n_test   # 0.040\np2 <- events_comp / n_comp   # 0.036\nrd <- p1 - p2                # 0.004\n\nse <- sqrt(p1*(1-p1)/n_test + p2*(1-p2)/n_comp)\nz_crit <- qnorm(0.975)       # 1.96 for two-sided 95% CI (alpha = 0.025 per side)\nci_upper <- rd + z_crit * se\nci_lower <- rd - z_crit * se\n\ncat(sprintf(\"Risk difference: %.4f (%.2f pp)\\n\",   rd,       rd * 100))\ncat(sprintf(\"95%% CI: (%.4f, %.4f) = (%.2f, %.2f) pp\\n\",\n            ci_lower, ci_upper, ci_lower * 100, ci_upper * 100))\ncat(sprintf(\"NI margin: %.3f (%.1f pp)\\n\",         ni_margin, ni_margin * 100))\ncat(sprintf(\"NI demonstrated: %s\\n\",\n            ifelse(ci_upper < ni_margin, \"YES\", \"NO\")))\n# ci_upper ≈ 0.016 (1.6 pp) < 0.020 (2.0 pp) → NI demonstrated\n\n# ── 2. Equivalence TOST on a continuous outcome (TOSTER) ────────────────────────\nlibrary(TOSTER)\nset.seed(42)\nx_test <- rnorm(200, mean = 5.0, sd = 2.0)\nx_comp <- rnorm(200, mean = 5.1, sd = 2.0)\nequiv_margin <- 1.0   # raw-unit equivalence half-width\n\n# tost(): two one-sided Welch t-tests; equivalence when p_TOST < 0.05\ntost_res <- tost(\n  x = x_test, y = x_comp,\n  low_eqbound  = -equiv_margin,\n  high_eqbound =  equiv_margin,\n  eqbound_type = \"raw\",\n  alpha        = 0.05,\n  var.equal    = FALSE    # Welch correction\n)\nprint(tost_res)\n# tost_res$TOST$p.value gives the max (binding) one-sided p;\n# equivalence demonstrated when this p < alpha = 0.05.\n\n# ── 3. Risk-difference CI from a 2x2 table using epitools ───────────────────────\nlibrary(epitools)\ntab <- matrix(\n  c(events_test, n_test - events_test,\n    events_comp, n_comp - events_comp),\n  nrow = 2, byrow = TRUE,\n  dimnames = list(c(\"test_drug\",\"comparator\"), c(\"event\",\"no_event\"))\n)\nrr_out <- riskratio(tab, method = \"wald\")\n# rr_out$measure: row \"Relative Risk\" gives RR; \"Attributable Risk\" gives risk diff\n# Compare the attributable risk (RD) upper CI to the NI margin\ncat(\"\\nRisk ratio and attributable risk from epitools:\\n\")\nprint(rr_out$measure)",
        "description": "Non-inferiority and equivalence testing in R. Three approaches are shown:\n(1) Manual CI approach for binary NI (always available, no package required) —\nmirrors the worked example exactly.\n(2) TOSTER::tost() for equivalence on continuous outcomes — the leading R package\nfor TOST, providing both one-sided p-values and effect-size plots.\n(3) epitools::riskratio() for context: illustrates how to obtain the risk-difference\n(attributable risk) and CI from a 2x2 table, which feeds into the NI decision.",
        "dependencies": [
          "TOSTER",
          "epitools"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Worked-example data: 80/2000 vs 72/2000 ─────────────────────────────────── */\n/* Expand from counts to patient-level rows for PROC FREQ.                        */\ndata work.ni_pts;\n  length arm $12;\n  do i = 1 to 2000;\n    arm = 'test_drug';\n    event = (i <= 80);   /* first 80 observations are events */\n    output;\n  end;\n  do i = 1 to 2000;\n    arm = 'comparator';\n    event = (i <= 72);\n    output;\n  end;\n  drop i;\nrun;\n\n/* ── 1. NI test on binary risk difference ─────────────────────────────────────── */\n/* RISKDIFF NONINF: one-sided NI test for a risk difference.                      */\n/* MARGIN=0.02 sets Δ = 2.0 percentage points.                                    */\n/* CL=SCORE uses the score (Miettinen-Nurminen) CI, preferred for small risks.    */\n/* H0: risk_diff >= 0.02; reject H0 (NI demonstrated) when p_one_sided < 0.025.  */\nproc freq data=work.ni_pts;\n  tables arm * event / riskdiff(noninf margin=0.02 cl=score) alpha=0.05;\n  /* Key output: \"Risk Difference\" table shows observed RD, 95% CI, and the       */\n  /* one-sided p-value for H0: delta >= margin.  NI is demonstrated when          */\n  /* p_one_sided < 0.025 AND the upper CI bound < 0.020.                          */\nrun;\n\n/* ── 2. Equivalence TOST on a continuous outcome ─────────────────────────────── */\n/* Simulate data: 200 patients per arm, means 5.0 and 5.1, SD = 2.0.             */\ndata work.cont_ni;\n  call streaminit(42);\n  length arm $12;\n  do i = 1 to 400;\n    if i <= 200 then arm = 'test_drug';\n    else             arm = 'comparator';\n    value = rand('normal', (i <= 200)*5.0 + (i > 200)*5.1, 2.0);\n    output;\n  end;\n  drop i;\nrun;\n\n/* TOST(-1.0, 1.0): equivalence margins are -1.0 and +1.0 in raw units.          */\n/* SAS reports two one-sided t-tests; the TOST p-value is max(p_lower, p_upper).  */\n/* Equivalence demonstrated when TOST p < 0.05.                                   */\nproc ttest data=work.cont_ni tost(-1.0, 1.0) alpha=0.05;\n  class arm;\n  var value;\n  /* Output tables include: \"TOST\" section with the binding p-value and conclusion */\n  /* \"Equivalence Demonstrated\" or \"Not Demonstrated.\"                             */\nrun;",
        "description": "Non-inferiority and equivalence testing in SAS. Two procedures cover the two main\nendpoint types:\n(1) PROC FREQ with RISKDIFF NONINF for binary outcomes (risk difference NI test) —\napplied to the worked example. The NONINF option outputs the one-sided p-value and\nupper CI directly, and the MARGIN= sub-option sets Δ.\n(2) PROC TTEST with TOST for continuous outcomes — SAS 9.4+ provides the TOST option\ndirectly, which reports both one-sided t-statistics, p-values, and the TOST conclusion.\nBoth require the margin and alpha to be pre-specified.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  M[Pre-specify NI margin Δ<br/>before data collection<br/>using M1/M2 logic] --> D[Collect data;<br/>compute risk difference δ<br/>and 95% CI]\n  D --> U{Is CI upper bound<br/>below margin Δ?}\n  U -- \"Yes: upper bound < Δ\" --> NI[NI demonstrated]\n  U -- \"No: upper bound ≥ Δ\" --> FAIL[NI NOT demonstrated<br/>data compatible with<br/>unacceptable inferiority]\n  NI --> L{Is CI lower bound<br/>also above −Δ?}\n  L -- \"Yes: full CI within ±Δ\" --> EQ[Equivalence demonstrated<br/>full CI within ±Δ]\n  L -- \"No: CI crosses −Δ\" --> NIO[NI only; not formal equivalence]\n  NI --> S{Is CI lower bound<br/>above zero?}\n  S -- \"Yes: CI entirely positive\" --> SUP[Superiority also demonstrated]\n  S -- \"No: CI crosses zero\" --> NIONLY[NI only; superiority not established]\n  FAIL --> INC[Conduct per-protocol<br/>analysis; check assay<br/>sensitivity; report inconclusive]",
        "caption": "CI-based decision logic for NI and equivalence testing. The NI margin Δ must be pre-specified before data collection; all branches follow from where the 95% CI bounds fall relative to Δ and zero.",
        "alt_text": "Decision flowchart starting with pre-specified NI margin, proceeding to CI computation, then branching on whether the upper CI bound is below the margin (NI demonstrated or not), and further branching on whether the CI is entirely within ±Δ (equivalence), entirely positive (superiority), or crosses zero (NI only).",
        "source_type": "illustrative",
        "source_citations": [
          "walker-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Cases[\"CI position relative to margin Δ = 2.0 pp\"]\n    A[\"Case 1 NI demonstrated<br/>CI = [-0.8, 1.6]<br/>upper 1.6 < 2.0 ✓\"] --> AR[NI demonstrated<br/>not superior]\n    B[\"Case 2 Superiority + NI<br/>CI = [0.5, 1.8]<br/>lower > 0, upper < 2.0\"] --> BR[Both demonstrated]\n    C[\"Case 3 NI NOT demonstrated<br/>CI = [0.5, 2.5]<br/>upper 2.5 > 2.0 ✗\"] --> CR[Inconclusive for NI]\n    D[\"Case 4 p > 0.05 trap<br/>CI = [-3.0, 4.0]<br/>wide CI, p-value n.s.\"] --> DR[Cannot claim NI:<br/>CI too wide]\n  end",
        "caption": "Four cases illustrating how CI position determines the NI conclusion. Case 4 is the \"p > 0.05 trap\" — a non-significant test does not establish NI when the CI is wide enough to include large harmful differences.",
        "alt_text": "Four parallel cases showing different CI positions relative to a 2.0 pp NI margin, demonstrating the variety of NI, superiority, inconclusive, and \"p > 0.05 trap\" outcomes.",
        "source_type": "illustrative",
        "source_citations": [
          "walker-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "cost-minimization",
        "notes": "Cost-minimization analysis requires demonstrated equivalence on outcomes as its evidentiary prerequisite; equivalence or NI testing supplies that prerequisite. A non-significant superiority test does not license CMA."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "NI and equivalence testing build on confidence interval theory, one-sided hypothesis tests, and the distinction between statistical significance and clinical meaningfulness; these foundations must be in place first."
      },
      {
        "relation_type": "see_also",
        "target_slug": "two-sample-t-test",
        "notes": "The TOST procedure for continuous outcomes runs two one-sided t-tests (or z-tests for proportions); understanding the two-sample t-test is prerequisite to understanding TOST mechanics."
      },
      {
        "relation_type": "used_with",
        "target_slug": "risk-ratio-and-risk-difference",
        "notes": "The NI margin for binary endpoints is typically expressed on the absolute risk difference scale; risk-ratio and risk-difference methods supply the point estimate and CI that are compared against the margin."
      },
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation can operationalize NI questions in RWE; however, confounding toward the null is anti-conservative in NI emulations — the opposite of the usual conservative property in superiority emulations."
      }
    ],
    "aliases": [
      "non-inferiority testing",
      "equivalence testing",
      "TOST",
      "two one-sided tests",
      "NI trial",
      "bioequivalence testing"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "erisa-self-insured-health-plans-rwe",
    "name": "ERISA and Self-Insured Employer Health Plans in RWE",
    "short_definition": "Employer-sponsored health plan arrangements governed by ERISA, including self-insured plans where the employer bears claim risk and plan design, reporting, carve-outs, and state-regulatory exposure differ from fully insured plans.",
    "long_description": "ERISA shapes US employer-sponsored health benefits and therefore shapes commercial claims data. In a fully insured plan, an insurer bears the claim risk subject to state insurance regulation. In a self-insured plan, the employer generally bears the claim risk and may use an administrator or insurer for network and claims operations. ERISA preemption and reporting rules make self-insured employer plans operationally different from fully insured products.\n\nFor RWE, self-insured status matters because plan design, covered benefits, network breadth, formulary design, carve-outs, stop-loss arrangements, and claims completeness can differ. A large self-insured employer may carve out pharmacy, behavioral health, fertility, or specialty benefits to separate vendors. A commercial database sourced from one administrator may therefore miss services handled by another vendor.\n\nAnalysts should not treat commercial claims as a homogeneous payer channel. ERISA/self-insured plan structure can drive selection, observability, benefit generosity, and job-related enrollment churn. Employer and plan identifiers, if available, should be used carefully for clustering, subgroup diagnostics, and benefit-completeness checks.\n\n**Pros, cons, and trade-offs.** ERISA and self-insured plan metadata help explain why two commercial claims cohorts can have different benefit completeness, treatment access, cost sharing, and enrollment churn. This is valuable for data-fitness assessment and payer-channel interpretation. The trade-off is that ERISA status is legal and administrative context, not a patient-level clinical exposure. It can be hard to observe directly in vendor data, and plan funding may be confounded with employer size, industry, geography, workforce composition, and benefit generosity.\n\n**When to use.** Use ERISA/self-insured plan information when plan funding, administrator structure, state mandate exposure, carve-outs, stop-loss arrangements, or employer-level clustering can affect exposures, outcomes, costs, or missing benefit channels. It is most useful in commercial claims studies where employer/plan metadata are available.\n\n**When NOT to use - and when it is actively misleading.** Do not use ERISA as if it were a clinical characteristic of the patient. Do not assume every commercial plan is fully insured or every self-insured plan has complete integrated medical and pharmacy data. It is actively misleading to compare plan groups without checking whether carved-out benefits or administrator splits change what the dataset can observe.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "erisa",
      "self-insured",
      "employer-sponsored-insurance",
      "commercial-claims",
      "benefit-design",
      "plan-funding"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "health_policy",
      "comparative_effectiveness",
      "budget_impact"
    ],
    "data_sources": [
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": null,
        "url": "https://www.govinfo.gov/content/pkg/USCODE-2024-title29/html/USCODE-2024-title29-chap18.htm",
        "citation_text": "United States Code. Title 29, Chapter 18: Employee Retirement Income Security Program.",
        "year": 2024,
        "authors_short": "U.S. Code",
        "notes": "Official statutory text for ERISA provisions."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.govinfo.gov/app/details/CMR-L1-00186737",
        "citation_text": "U.S. Department of Labor. Annual Report on Self-Insured Group Health Plans March 2024. GovInfo.",
        "year": 2024,
        "authors_short": "DOL",
        "notes": "Federal report describing self-insured group health plan funding and filing characteristics."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.govinfo.gov/content/pkg/GAOREPORTS-HEHS-95-167/html/GAOREPORTS-HEHS-95-167.htm",
        "citation_text": "U.S. Government Accountability Office. Employer-Based Health Plans: Issues, Trends, and Challenges Posed by ERISA.",
        "year": 1995,
        "authors_short": "GAO",
        "notes": "Historical policy context on ERISA and employer plan regulation."
      }
    ],
    "plain_language_summary": "ERISA is the federal law behind many employer health plans. In self-insured plans, employers carry the claim risk and often use vendors to administer benefits. That can make commercial claims incomplete or different across employers.",
    "key_terms": [
      {
        "term": "Fully insured plan",
        "definition": "Employer plan where an insurer bears the claim risk and generally falls under state insurance regulation."
      },
      {
        "term": "Self-insured plan",
        "definition": "Employer plan where the employer bears claim risk and typically contracts with administrators or vendors."
      },
      {
        "term": "ERISA preemption",
        "definition": "Federal ERISA rules can preempt many state laws for employer benefit plans, especially self-insured plans."
      }
    ],
    "worked_example": {
      "scenario": "A commercial database includes two employer groups. Employer A is fully insured with integrated medical and pharmacy claims. Employer B is self-insured under ERISA with medical claims administered by one vendor and pharmacy carved out to a PBM not present in the extract.",
      "dataset": {
        "caption": "Plan funding and benefit observability by employer group.",
        "columns": [
          "employer_group",
          "funding_type",
          "medical_claims",
          "pharmacy_claims",
          "rwe_implication"
        ],
        "rows": [
          [
            "A",
            "fully insured",
            "present",
            "present",
            "exposure and outcomes observable"
          ],
          [
            "B",
            "self-insured",
            "present",
            "missing carve-out",
            "pharmacy exposure/adherence invalid unless PBM feed is linked"
          ]
        ]
      },
      "steps": [
        "Use plan metadata to identify funding arrangement and administrators.",
        "Profile required benefit channels before building exposure and cost endpoints.",
        "Stratify or exclude employer groups with missing carved-out services for affected analyses.",
        "Cluster standard errors or sensitivity-test by employer/plan when benefit design drives treatment choice."
      ],
      "result": "Employer B can contribute to medical-outcome analyses but not pharmacy-exposure or adherence analyses unless the carved-out PBM data are added."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fully insured commercial plan",
        "description": "Insurer bears risk and typically administers integrated claim functions.",
        "edge_cases": [
          "Vendor extracts can still omit services.",
          "State mandates may influence coverage."
        ],
        "data_source_notes": "Often more uniform within carrier extracts, but still plan-specific."
      },
      {
        "name": "Self-insured ERISA plan",
        "description": "Employer bears claim risk and may use separate administrators or carve-out vendors.",
        "edge_cases": [
          "Benefit feeds can be split.",
          "State insurance rules may not apply the same way."
        ],
        "data_source_notes": "Requires employer/plan-level completeness checks."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Payer channel only",
        "use_erisa_plan_metadata_when": "Plan funding, benefit design, or carve-outs affect exposure, outcomes, or costs.",
        "use_payer_channel_when": "Only broad payer stratification is available and the limitation is acknowledged.",
        "notes": "\"Commercial\" is too broad for many RWE data-fitness decisions."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Use employer, plan, product, funding, and administrator fields when available; missing metadata should be treated as a data-quality limitation."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Commercial claims heterogeneity is partly driven by employer plan funding and benefit structure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "benefit-carve-outs-medical-pharmacy-rwe",
        "notes": "Self-insured employers commonly use carved-out benefit vendors."
      }
    ],
    "aliases": [
      "ERISA",
      "self-insured plan",
      "self-funded plan",
      "employer-sponsored insurance",
      "employer health plan",
      "ASO plan",
      "administrative services only plan"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "estimand-analysis-traceability-rwe",
    "name": "Estimand-to-Analysis Traceability",
    "short_definition": "A documentation discipline that maps each attribute of a pre-specified estimand (treatment, population, endpoint, intercurrent-event strategy, summary measure) to the exact operational data rule, code module, diagnostic, and sensitivity analysis that implements it, so the number produced answers the causal question that was asked.",
    "long_description": "**Estimand-to-analysis traceability** is the connective tissue between a causal question and the number a study reports. The\nICH E9(R1) estimand framework decomposes the target quantity into five attributes — *treatment conditions*, *population*,\n*endpoint (variable)*, *intercurrent-event (ICE) strategy*, and *population-level summary measure*. Traceability is the\nrequirement that every one of those attributes be carried, without slippage, through cohort construction, follow-up, outcome\nascertainment, the statistical model, and the sensitivity plan — and that the mapping be written down (a traceability matrix)\nso an author, reviewer, or regulator can audit \"the estimate answers the estimand.\" It is not itself a study design or an\nestimator; it is the governance layer that prevents the silent substitution of a *different* estimand that observational data\nmake convenient. In RWE this discipline matters more than in trials, because the data were collected for billing or care, not\nfor the question, so the gap between \"what we wanted to estimate\" and \"what the code actually computed\" is wider and easier to\ncross unnoticed.\n\n**Core conceptual distinction**. The thing to keep separate is the *estimand* (the target quantity, defined before any data are\ntouched) from the *estimator* (the statistical procedure) and the *estimate* (the realized number). Traceability is the\nbookkeeping that binds the three so they refer to the same target. The most consequential RWE failures are estimand drift:\nfollow-up that starts at diagnosis rather than at the treatment decision silently changes the *population* attribute (it admits\nimmortal person-time); censoring at treatment discontinuation without weighting changes the *ICE strategy* from a treatment-policy\nto a hypothetical/while-on-treatment estimand; reporting a hazard ratio when the question is about absolute risk in the presence\nof competing death changes the *summary measure* from a cumulative-incidence contrast to a cause-specific rate that no patient\nexperiences. Each is a defensible analysis of *some* estimand — just not the one the protocol named. Traceability makes the\nsubstitution visible by forcing the matrix cell \"ICE strategy = treatment policy\" to point at the specific code that does NOT\ncensor at switching, and the diagnostic that proves it.\n\n**Pros, cons, and trade-offs**.\n- **vs an undocumented single-rule implementation** (the common default — write the SAS, get a number): traceability is far more\n  transparent, reproducible, and defensible for FDA/EMA/HTA submission, and it catches estimand drift before it reaches a\n  reviewer. Cost: it is overhead — a matrix to build and maintain, more diagnostics, more sensitivity runs. Prefer it for any\n  consequential comparative-effectiveness, safety, utilization, cost, or regulatory-grade analysis.\n- **vs the target-trial-emulation protocol alone**: a target-trial protocol *is* an estimand specification (eligibility,\n  treatment strategies, assignment, follow-up, outcome, causal contrast). Traceability is the layer *beneath* it that maps each\n  protocol element to executable code and a check. The protocol says \"treatment-policy estimand\"; the traceability matrix proves\n  the code implemented it. Use both — the protocol is the spine, the matrix is the audit trail. They are complements, not\n  substitutes.\n- **vs leaving estimand choice implicit in the SAP's model section**: putting the estimand only in the model paragraph invites\n  the reader to reverse-engineer the question from the procedure (PROC PHREG ⇒ \"they must want a hazard ratio\"), which is exactly\n  backwards and hides ICE handling. A standalone matrix forces the question first. Cost: redundancy with the SAP; resolve by\n  making the matrix a SAP appendix rather than a parallel document.\n\n**When to use**. Any hypothesis-evaluating RWE study intended for a regulator, payer, or HTA body; any study where the estimand\nhas a non-trivial ICE strategy (treatment switching, discontinuation, death as a competing event, rescue medication); any\nmulti-programmer or multi-database study where drift between sites is plausible; and as a precondition for pre-registration,\nbecause you cannot pre-specify what you cannot trace. Build the matrix at protocol/SAP time, before programming, and freeze it\nwith the analysis plan.\n\n**When NOT to use — and when it is actively misleading or dangerous**. For an exploratory, hypothesis-generating descriptive\nscan with no single target quantity, a full estimand matrix is ceremony that buys nothing — say so rather than manufacture a\nfake estimand. The dangerous failure mode is *traceability theater*: a matrix that is filled in after the analysis to rationalize\nwhatever the code happened to compute. A post-hoc matrix that maps \"ICE = treatment policy\" to code that in fact censored at\nswitching is worse than no matrix — it launders estimand drift with the appearance of rigor and will mislead a reviewer who\ntrusts the document. The discipline is only protective if the matrix is written before the code and the diagnostics are run to\n*confirm* each cell, not to decorate it. Equally, a matrix that is internally consistent but built on an estimand that the data\ncannot support (e.g., a while-on-treatment estimand in a claims source where treatment episodes cannot be reconstructed from\n`days_supply`) gives false comfort: traceability guarantees the estimate matches the estimand, not that the estimand is\nidentifiable from the source.\n\n**Data-source operational depth**. Traceability is where data-source pathology becomes estimand-specific, because each attribute\nfails differently by source.\n- **Claims (FFS vs MA vs commercial):** The *population* and *follow-up* attributes hinge on continuous, observable enrollment.\n  Medicare Advantage encounter data lack the fee-for-service claims used to ascertain exposure and outcomes, so MA-only\n  person-time is missingness, not absence — a \"treatment policy from first fill\" estimand silently becomes \"treatment policy\n  among the FFS-observable,\" a different population. Fix: require continuous Parts A/B/D (or commercial medical+pharmacy) across\n  the window and exclude MA-only spans, and record that exclusion in the population cell. The *treatment* attribute (treatment\n  policy vs while-on-treatment) is operationalized through `days_supply` + a grace period; sample fills, 90-day mail order, and\n  free samples corrupt `days_supply` and quietly shift a while-on-treatment estimand. The *endpoint* attribute for any\n  elderly/comorbid cohort must contend with **differential competing risks** — if death rates differ by arm, a cause-specific\n  Cox HR and a Fine-Gray subdistribution contrast diverge, and only one matches a \"cumulative incidence\" summary measure.\n- **EHR:** The *endpoint* and *follow-up* attributes are encounter-driven; a patient who leaves the system is differentially\n  lost, so a \"treatment policy through end of follow-up\" estimand requires an explicit observation-window definition and\n  informative-censoring handling, not the implicit \"last seen\" that the data offer. The *treatment* attribute starts at the\n  order/administration, not the dispense — trace which.\n- **Registry / linked:** Registries are strong for the *endpoint* (adjudicated events, disease stage) and *population* (severity)\n  but weak for *treatment* (incomplete pharmacy). Linkage to claims fills the treatment cell and to a death index firms up the\n  competing-risk and censoring cells — but linkage selection narrows the *population* attribute to the linkable subset, which the\n  matrix must state. **Immortal-time** in procedure/device studies is an estimand-population failure: if eligibility requires\n  surviving to a procedure date, the matrix's \"time zero\" cell must align follow-up start with the treatment decision, or the\n  population silently excludes early deaths.\n\n**Worked example (claims traceability matrix).** Question: comparative risk of hospitalization for heart failure (HHF) with an\nSGLT2 inhibitor vs a DPP-4 inhibitor among adults with type 2 diabetes in a Medicare FFS + commercial database; target estimand\n= the treatment-policy effect on cumulative incidence of HHF at 2 years, accounting for death as a competing event. The matrix\nbinds each ICH E9(R1) attribute to one operational rule, one code module, one diagnostic, and one pre-specified sensitivity:\n(1) **Treatment** = initiation of SGLT2i vs DPP-4i, treatment-policy strategy → arm assigned from the NDC on the first qualifying\n`fill_date`; the as-treated alternative would stitch `days_supply` into episodes with a 30-day grace period — code module\n`build_episodes()`; diagnostic = distribution of `days_supply` and gap lengths by arm; sensitivity = re-run as while-on-treatment\nwith IPC weighting. (2) **Population** = incident users with T2D and 365 days of continuous Parts A/B/D (no MA-only span) →\n`apply_enrollment()`; diagnostic = attrition funnel and % MA-only excluded; sensitivity = relax washout to 180 days. (3)\n**Endpoint** = first HHF (validated 1-inpatient HF claim in the primary position) → `flag_hhf()`; diagnostic = PPV against a\nchart-validated subset; sensitivity = broaden to any-position HF. (4) **Intercurrent events** = death is a competing event\n(handled by the subdistribution model, NOT censored), switching handled by the treatment-policy strategy (ignored) →\n`set_competing_risk()`; diagnostic = cumulative-incidence-function plot of HHF and death by arm; sensitivity = cause-specific Cox\nto show the gap. (5) **Summary measure** = 2-year cumulative-incidence difference and ratio with a Fine-Gray subdistribution\nhazard ratio after overlap weighting → `fit_finegray()` (e.g., `PROC PHREG ... eventcode=` / `survival::finegray`); diagnostic =\nstandardized mean differences <0.1 post-weighting; sensitivity = E-value for the SHR. Every cell that a reviewer might suspect of\ndrift — \"did they censor at switching?\" \"did they treat death as censoring?\" — resolves to a named module and a check, which is\nthe entire point of the matrix.",
    "primary_category": "Framework_Standard",
    "tags": [
      "estimand",
      "estimand-to-estimator-mapping",
      "traceability-matrix",
      "ich-e9-r1",
      "intercurrent-events",
      "regulatory-readiness",
      "reproducibility",
      "target-trial"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Canonical statement that observational analysis must begin by specifying the target trial (estimand), with each protocol element mapped to an analytic decision — the conceptual basis for estimand-to-analysis traceability."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5507",
        "url": "https://doi.org/10.1002/pds.5507",
        "citation_text": "Wang SV, Pottegård A, Crown W, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real-world evidence studies on treatment effects: a good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiology and Drug Safety. 2023;32(1):44-55.",
        "year": 2023,
        "authors_short": "Wang et al.",
        "notes": "The HARPER template operationalizes traceability — it forces each design and analysis decision to be pre-specified and documented so RWE treatment-effect studies are auditable and reproducible."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.k182",
        "url": "https://doi.org/10.1136/bmj.k182",
        "citation_text": "Hernán MA. How to estimate the effect of treatment duration on survival outcomes using observational data. BMJ. 2018;360:k182.",
        "year": 2018,
        "authors_short": "Hernán",
        "notes": "Worked demonstration of mapping a precise estimand (effect of treatment duration) to the analysis, showing how an imprecise estimand produces immortal-time and other drift — traceability in action on a single attribute."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5639",
        "url": "https://doi.org/10.1002/pds.5639",
        "citation_text": "Austin PC. Differences in target estimands between different propensity score-based weights. Pharmacoepidemiology and Drug Safety. 2023;32(11):1224-1233.",
        "year": 2023,
        "authors_short": "Austin",
        "notes": "Shows that the choice of weighting estimator (ATE/ATT/ATO) implies a different target population — a concrete reason the summary-measure and population cells of the traceability matrix must record the estimand the estimator actually targets."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://database.ich.org/sites/default/files/E9-R1_Step4_Guideline_2019_1203.pdf",
        "citation_text": "ICH. E9(R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials to the Guideline on Statistical Principles for Clinical Trials. 2019.",
        "year": 2019,
        "authors_short": "ICH",
        "notes": "Defines the five estimand attributes (treatment, population, endpoint, intercurrent-event strategy, summary measure) and the aligned sensitivity-analysis terminology that the traceability matrix is built around."
      }
    ],
    "plain_language_summary": "Before running a single line of code, a researcher writes out the precise question they want to answer — who the patients are, which treatments are compared, what outcome counts, what happens when patients switch drugs mid-study, and how to summarize the result. That written question is called the target question, and traceability means every analysis choice made afterward can be traced back to one of those five parts of the question, so that the final number provably answers the question that was asked. Without this discipline, it is easy to drift: the code quietly answers a slightly different question — for example, treating drug-switching as if it never happened rather than deciding in advance what to do about it. Traceability makes that drift visible before results are shared.",
    "key_terms": [
      {
        "term": "target question (estimand)",
        "definition": "The precise causal question a study is designed to answer, written down and locked before any data are touched, specifying exactly which patients, treatments, outcome, complicating events, and summary statistic are in scope."
      },
      {
        "term": "intercurrent event",
        "definition": "Something that happens after a patient enters a study and complicates how you count their outcome — for example, the patient switches to a different drug, stops treatment entirely, or dies before the outcome of interest occurs."
      },
      {
        "term": "traceability matrix",
        "definition": "A table, written before programming begins, that connects each of the five parts of the target question to the specific data rule, code function, and check that implements it in the analysis."
      },
      {
        "term": "estimand drift",
        "definition": "The silent substitution that occurs when the code or data rules actually implement a different question than the one the study pre-specified, without anyone noticing."
      },
      {
        "term": "summary measure",
        "definition": "The single number used to answer the question across the whole study population — for example, the difference in two-year risk between treatment arms, or the ratio of event rates."
      },
      {
        "term": "operational rule",
        "definition": "The exact, reproducible data instruction that turns a conceptual question attribute into a computable flag or variable in the claims or EHR dataset."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know: among adults newly starting an SGLT2 inhibitor versus a DPP-4 inhibitor for type 2 diabetes in Medicare claims, what is the difference in two-year cumulative risk of hospitalization for heart failure, treating death as a competing event and ignoring any later drug switching? Before touching data, they write this question out as five attributes and then trace each attribute to a specific analysis choice. The table below shows that trace. No numbers are fabricated here — the point is the logic of the mapping, not a specific effect size.",
      "dataset": {
        "caption": "Traceability matrix: five attributes of the pre-specified target question, each mapped to one operational rule and one analysis choice. This table is written before any code is run.",
        "columns": [
          "Attribute",
          "What the attribute pins down",
          "Operational rule written in the analysis plan",
          "Analysis choice it drives"
        ],
        "rows": [
          [
            "Population",
            "Who counts as a study patient",
            "Adults with type 2 diabetes who fill their first-ever SGLT2i or DPP-4i prescription after 365 days of continuous insurance enrollment, with no prior fill of either drug class",
            "The cohort-entry code requires a 365-day clean lookback and a confirmed diabetes diagnosis before the first fill; patients already on either drug are excluded"
          ],
          [
            "Treatment",
            "Which exposures are compared and how they are defined",
            "Arm assigned by the drug class on the index fill date; later switching is ignored (treatment-policy approach — the patient stays in the arm they started in)",
            "The code reads the NDC code on the first qualifying fill and locks the arm; no re-assignment logic runs when switching occurs in follow-up"
          ],
          [
            "Endpoint",
            "What outcome is measured and how it is recognized in the data",
            "First inpatient hospitalization with heart failure coded in the primary diagnosis position; the patient must be admitted, not just seen in the ER",
            "The outcome flag requires one inpatient claim with a qualifying ICD-10 code in position 1; ER visits and observation stays do not trigger the flag"
          ],
          [
            "Handling of intercurrent events",
            "What to do when a patient switches drugs or dies before a heart-failure hospitalization occurs",
            "Death is a competing event and is NOT treated as ordinary censoring; switching is handled by the treatment-policy approach above (stay in original arm through the full two-year window)",
            "The analysis uses a competing-risks model so that the risk of heart failure and the risk of death are estimated simultaneously; patients who die are not simply dropped from the denominator"
          ],
          [
            "Summary measure",
            "How the result is expressed as a single comparable number",
            "Two-year cumulative incidence difference and ratio between arms, estimated after balancing the two groups on measured confounders using overlap weighting",
            "The model reports absolute risk at two years for each arm and their difference; it does not report a hazard ratio, because the pre-specified question is about cumulative risk, not instantaneous rate"
          ]
        ]
      },
      "steps": [
        "Write the five attributes before opening any data file. Each attribute locks one dimension of the question; leaving any attribute unspecified means the code will make that choice invisibly.",
        "For each attribute, write the operational rule in plain language specific enough that two programmers would produce the same result. Vague rules such as 'new users' or 'first fill' are not operational until the exact lookback period and exclusion criteria are stated.",
        "For each attribute, identify which analysis choice it forces. The population attribute determines the cohort-entry query; the intercurrent-event attribute determines whether the outcome model censors at death or treats it as a competing event. These are not independent decisions — they are forced by the pre-specified question.",
        "Lock the table before writing code. After the analysis runs, check each row: does the code actually implement the operational rule? A row where the code diverges from the rule signals estimand drift — fix the code, not the question.",
        "Run the pre-specified sensitivity analyses for any attribute that involved a judgment call. For example, re-run the population attribute with a 180-day lookback to verify results do not change materially; re-run the intercurrent-event attribute using a cause-specific Cox model to show how much the competing-risks handling matters."
      ],
      "result": "The completed matrix produces a fully specified question: the two-year absolute risk difference in heart-failure hospitalization between new SGLT2i and DPP-4i users in Medicare, among incident users with 365 days of prior enrollment, assigning arms at the index fill, treating death as a competing event, estimated as a cumulative-incidence contrast after overlap weighting. Every word in that sentence maps to one row of the table. A reviewer can audit the number by checking whether the code implements each row — that auditability is the entire purpose of the discipline."
    },
    "prerequisites": [
      "picots-framework-rwe",
      "estimands-ate-att-intercurrent-events-rwe",
      "target-trial-emulation"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Estimand traceability matrix (attribute-by-attribute)",
        "description": "A table with one row per ICH E9(R1) estimand attribute (treatment, population, endpoint, intercurrent-event strategy, summary measure) and columns for the operational rule, code module, confirming diagnostic, and pre-specified sensitivity analysis. Frozen with the SAP before programming.",
        "edge_cases": [
          "Intercurrent-event rows are the highest-risk cells (treatment-policy vs while-on-treatment vs hypothetical vs composite vs principal-stratum) and the most often left implicit in the model section.",
          "A matrix completed after analysis (\"traceability theater\") rationalizes drift instead of preventing it."
        ],
        "data_source_notes": "claims: population/follow-up cells must encode continuous enrollment and MA-only exclusion; EHR: endpoint cell must encode encounter-driven capture and informative censoring; linked: population cell must record linkage selection."
      },
      {
        "name": "Protocol-element-to-code crosswalk (target-trial emulation)",
        "description": "Variant aligned to a target-trial protocol — each protocol element (eligibility, treatment strategies, assignment, time zero, outcome, causal contrast) is mapped to its emulation code and diagnostic, with the explicit estimand named.",
        "edge_cases": [
          "Time-zero and eligibility elements are where immortal time and selection silently re-enter; the crosswalk must show follow-up starting at the treatment decision.",
          "Grace-period and dynamic-strategy elements may require clone-censor-weight, which the crosswalk must point to rather than a naive censoring rule."
        ],
        "data_source_notes": "claims: assignment is the index NDC; outcome and censoring rules must be identical across arms and traced to the same modules."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Undocumented single-rule implementation",
        "pros_of_this": "Transparent, reproducible, and auditable for regulatory/HTA review; catches estimand drift (immortal time, mis-specified intercurrent-event handling, wrong summary measure) before it reaches a reviewer.",
        "cons_of_this": "Documentation and diagnostic overhead; requires building and maintaining the matrix and running confirmatory checks and sensitivity analyses for each attribute.",
        "when_to_prefer": "Any consequential comparative-effectiveness, safety, utilization, cost, or regulatory-grade RWE analysis."
      },
      {
        "compared_to": "Target-trial-emulation protocol alone",
        "pros_of_this": "Adds the executable audit trail beneath the protocol — proves the code implemented the named estimand, not just that the protocol declared it.",
        "cons_of_this": "Redundant with the protocol if maintained as a separate document; best kept as a SAP appendix.",
        "when_to_prefer": "Always pair them; the protocol is the estimand spine and the matrix is its verification layer."
      },
      {
        "compared_to": "Estimand stated only in the SAP model section",
        "pros_of_this": "Forces the causal question to drive the procedure rather than letting the reader infer the question from the procedure; surfaces intercurrent-event handling explicitly.",
        "cons_of_this": "Some duplication with the SAP narrative.",
        "when_to_prefer": "Whenever the intercurrent-event strategy or summary measure is non-trivial (switching, discontinuation, competing death, rescue therapy)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Trace the population and follow-up attributes to continuous-enrollment logic and MA-only exclusion; trace the treatment attribute to days_supply/grace-period episode construction; trace the endpoint attribute to a validated algorithm and to competing-risk handling when death rates differ by arm.",
      "ehr": "Trace the treatment attribute to order/administration (not dispense); trace the endpoint and follow-up attributes to an explicit observation window and informative-censoring handling because capture is encounter-driven.",
      "registry": "Trace the endpoint and population attributes to adjudication and severity fields; trace the treatment attribute to linked claims because registry pharmacy capture is incomplete.",
      "linked": "Record linkage selection in the population cell and reconcile order/fill/service date discrepancies before assigning time zero in the follow-up cell."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# One row per ICH E9(R1) estimand attribute. Fill from the protocol/SAP BEFORE programming.\n# Worked example: treatment-policy effect on 2-year cumulative incidence of HF hospitalization,\n# SGLT2i vs DPP-4i, Medicare FFS + commercial claims, death as a competing event.\nREQUIRED_ATTRIBUTES = [\n    \"treatment\", \"population\", \"endpoint\", \"intercurrent_events\", \"summary_measure\",\n]\n\nmatrix = pd.DataFrame([\n    {\"attribute\": \"treatment\",\n     \"operational_rule\": \"Arm = NDC on first qualifying fill_date; treatment-policy (ignore later switching)\",\n     \"module\": \"build_cohort.assign_arm\",\n     \"diagnostic\": \"days_supply + gap-length distribution by arm\",\n     \"sensitivity\": \"as-treated: build_episodes(grace_days=30) + IPC weighting\"},\n    {\"attribute\": \"population\",\n     \"operational_rule\": \"Incident T2D users, 365d continuous Parts A/B/D, exclude MA-only person-time\",\n     \"module\": \"build_cohort.apply_enrollment\",\n     \"diagnostic\": \"attrition funnel; % excluded for MA-only / non-continuous\",\n     \"sensitivity\": \"washout 180d vs 365d\"},\n    {\"attribute\": \"endpoint\",\n     \"operational_rule\": \"First HF hospitalization: 1 inpatient HF claim in primary position\",\n     \"module\": \"outcomes.flag_hhf\",\n     \"diagnostic\": \"PPV vs chart-validated subset\",\n     \"sensitivity\": \"any-position HF dx\"},\n    {\"attribute\": \"intercurrent_events\",\n     \"operational_rule\": \"Death = competing event (NOT censored); switching = ignored (treatment policy)\",\n     \"module\": \"outcomes.set_competing_risk\",\n     \"diagnostic\": \"cumulative-incidence-function plot of HF and death by arm\",\n     \"sensitivity\": \"cause-specific Cox to quantify the divergence\"},\n    {\"attribute\": \"summary_measure\",\n     \"operational_rule\": \"2y cumulative-incidence difference + Fine-Gray subdistribution HR, overlap-weighted\",\n     \"module\": \"analysis.fit_finegray\",       # e.g. survival::finegray / PROC PHREG eventcode=\n     \"diagnostic\": \"standardized mean differences < 0.1 post-weighting\",\n     \"sensitivity\": \"E-value for the subdistribution HR\"},\n])\n\ndef validate_traceability(m: pd.DataFrame) -> list[str]:\n    \"\"\"Return audit failures: untraced attributes or empty implementation cells.\"\"\"\n    problems = []\n    missing = set(REQUIRED_ATTRIBUTES) - set(m[\"attribute\"])\n    if missing:\n        problems.append(f\"estimand attributes with no row (untraced): {sorted(missing)}\")\n    for _, row in m.iterrows():\n        for col in (\"operational_rule\", \"module\", \"diagnostic\", \"sensitivity\"):\n            if not str(row.get(col, \"\")).strip():\n                problems.append(f\"attribute '{row['attribute']}' has empty {col}\")\n    # The intercurrent-event cell is the highest-drift risk; require it to name a strategy.\n    ice = m.loc[m[\"attribute\"] == \"intercurrent_events\", \"operational_rule\"]\n    strategies = (\"treatment policy\", \"while-on-treatment\", \"hypothetical\", \"composite\", \"principal\")\n    if ice.empty or not any(s in ice.iloc[0].lower() for s in strategies):\n        problems.append(\"intercurrent_events row does not name an ICH E9(R1) ICE strategy\")\n    return problems\n\nissues = validate_traceability(matrix)\nassert not issues, \"Traceability matrix failed audit:\\n\" + \"\\n\".join(issues)",
        "description": "Programmable estimand traceability matrix. This is a framework artifact, not an estimator: it encodes — as data — the binding\nbetween each ICH E9(R1) estimand attribute and the operational rule, code module, diagnostic, and sensitivity analysis that\nimplements it, then validates that no attribute is left untraced and flags the high-risk intercurrent-event cell. Each\n`module` string names the actual analysis function/PROC in your codebase (e.g., build_episodes, fit_finegray, PROC PHREG\neventcode=) so a reviewer can jump from estimand attribute to executable code. Run this against the SAP-frozen matrix in CI\nso an attribute can never silently lose its implementation.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(tibble)\n\nrequired_attributes <- c(\"treatment\", \"population\", \"endpoint\",\n                         \"intercurrent_events\", \"summary_measure\")\n\n# Worked example: treatment-policy effect on 2-year cumulative incidence of HF hospitalization,\n# SGLT2i vs DPP-4i, Medicare FFS + commercial claims, death as a competing event.\nmatrix <- tibble::tribble(\n  ~attribute,            ~operational_rule,                                                                   ~module,                    ~diagnostic,                                   ~sensitivity,\n  \"treatment\",           \"Arm = NDC on first qualifying fill_date; treatment-policy (ignore switching)\",      \"assign_arm\",               \"days_supply + gap-length distribution by arm\",\"as-treated: build_episodes(grace=30) + IPC weights\",\n  \"population\",          \"Incident T2D users, 365d continuous A/B/D, exclude MA-only person-time\",            \"apply_enrollment\",         \"attrition funnel; % MA-only excluded\",        \"washout 180d vs 365d\",\n  \"endpoint\",            \"First HF hospitalization: 1 inpatient HF claim in primary position\",                \"flag_hhf\",                 \"PPV vs chart-validated subset\",               \"any-position HF dx\",\n  \"intercurrent_events\", \"Death = competing event (NOT censored); switching ignored (treatment policy)\",     \"set_competing_risk\",       \"CIF plot of HF and death by arm\",             \"cause-specific Cox to quantify divergence\",\n  \"summary_measure\",     \"2y cumulative-incidence difference + Fine-Gray subdistribution HR, overlap-wtd\",    \"fit_finegray\",             \"standardized mean differences < 0.1\",         \"E-value for subdistribution HR\"\n)\n\nvalidate_traceability <- function(m) {\n  problems <- character(0)\n  missing <- setdiff(required_attributes, m$attribute)\n  if (length(missing) > 0)\n    problems <- c(problems, paste(\"untraced estimand attributes:\", paste(missing, collapse = \", \")))\n  for (col in c(\"operational_rule\", \"module\", \"diagnostic\", \"sensitivity\")) {\n    empty <- m$attribute[!nzchar(trimws(m[[col]]))]\n    if (length(empty) > 0)\n      problems <- c(problems, paste0(\"empty \", col, \" for: \", paste(empty, collapse = \", \")))\n  }\n  ice <- m$operational_rule[m$attribute == \"intercurrent_events\"]\n  strategies <- c(\"treatment policy\", \"while-on-treatment\", \"hypothetical\", \"composite\", \"principal\")\n  if (length(ice) == 0 || !any(vapply(strategies, function(s) grepl(s, tolower(ice)), logical(1))))\n    problems <- c(problems, \"intercurrent_events row names no ICH E9(R1) ICE strategy\")\n  problems\n}\n\nissues <- validate_traceability(matrix)\nif (length(issues) > 0) stop(\"Traceability matrix failed audit:\\n\", paste(issues, collapse = \"\\n\"))",
        "description": "Estimand traceability matrix as a tidy data frame plus an auditor. Mirrors the Python artifact: each ICH E9(R1) attribute is\nbound to its operational rule, the analysis module/PROC that implements it, the confirming diagnostic, and the pre-specified\nsensitivity. Source this in the analysis pipeline so a frozen, SAP-approved matrix is verified before any estimation runs and\nevery attribute remains traceable to executable code.",
        "dependencies": [
          "tibble"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Causal question] --> E[Estimand: 5 ICH E9 R1 attributes]\n  E --> A1[Treatment conditions]\n  E --> A2[Population]\n  E --> A3[Endpoint]\n  E --> A4[Intercurrent-event strategy]\n  E --> A5[Summary measure]\n  A1 --> R[Operational rule per attribute]\n  A2 --> R\n  A3 --> R\n  A4 --> R\n  A5 --> R\n  R --> C[Named code module / PROC]\n  C --> D[Confirming diagnostic]\n  D --> S[Pre-specified sensitivity analysis]\n  S --> Est[Estimate that provably matches the estimand]\n  D -. fails .-> Drift[Estimand drift detected: revise code, not the question]",
        "caption": "Traceability binds each estimand attribute to an operational rule, a named code module, a confirming diagnostic, and a sensitivity analysis. A failing diagnostic signals estimand drift, which is fixed in the code rather than by quietly redefining the question.",
        "alt_text": "Flowchart from a causal question to a five-attribute estimand, each attribute mapped to an operational rule, code module, diagnostic, and sensitivity analysis, ending in an estimate that matches the estimand, with a failure branch for detected estimand drift.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Trial[Target trial protocol]\n    T1[Eligibility]\n    T2[Treatment strategies]\n    T3[Time zero / assignment]\n    T4[Outcome]\n    T5[Causal contrast]\n  end\n  subgraph Code[Emulation in RWE data]\n    E1[Continuous enrollment + washout]\n    E2[Index NDC + days_supply episodes]\n    E3[Follow-up starts at treatment decision]\n    E4[Validated outcome algorithm]\n    E5[Weighted competing-risk model]\n  end\n  T1 --> E1\n  T2 --> E2\n  T3 --> E3\n  T4 --> E4\n  T5 --> E5",
        "caption": "Protocol-element-to-code crosswalk for a target-trial emulation. Each trial protocol element resolves to a specific operational rule in the real-world data, so the emulated estimate is traceable to the named estimand.",
        "alt_text": "Two-column diagram mapping target-trial protocol elements (eligibility, treatment strategies, time zero, outcome, causal contrast) to their real-world emulation code (enrollment and washout, index fill and episodes, follow-up start at the treatment decision, validated outcome algorithm, weighted competing-risk model).",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "That concept defines the estimand attributes (ATE/ATT, intercurrent-event strategies) that this traceability discipline maps to operational rules and code."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "A target-trial protocol is the estimand spine; the traceability matrix is the executable audit trail proving the emulation code implements the named protocol elements."
      },
      {
        "relation_type": "part_of",
        "target_slug": "study-protocol-or-sap-elements",
        "notes": "The traceability matrix is best maintained as a frozen SAP appendix, binding each protocol/SAP decision to its implementation and check."
      },
      {
        "relation_type": "part_of",
        "target_slug": "regulatory-readiness-rwe",
        "notes": "Estimand-to-analysis traceability is a core component of regulatory readiness, not a variant of it — it produces the audit trail regulators and HTA bodies require."
      },
      {
        "relation_type": "used_with",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS scopes the question; the traceability matrix carries that scoped question through to executable analysis and diagnostics."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Time-zero alignment is the population/follow-up cell of the matrix; mis-specifying it is the most common estimand drift (immortal time)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "A concrete estimand-to-model decision the matrix must record - cause-specific vs subdistribution hazards correspond to different summary measures."
      }
    ],
    "aliases": [
      "estimand-to-estimator mapping",
      "estimand traceability matrix",
      "analytic traceability matrix",
      "estimand attribute decomposition",
      "estimand-to-analysis crosswalk"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "estimands-ate-att-intercurrent-events-rwe",
    "name": "Estimands (ATE/ATT) and Intercurrent Events in RWE",
    "short_definition": "The precise causal target of an analysis, defined per ICH E9(R1) by five attributes (population, treatment, endpoint, intercurrent-event strategy, summary measure), distinguishing the average treatment effect in the whole population (ATE) from the effect in the treated (ATT), with explicit rules for handling post-baseline intercurrent events such as discontinuation, switching, and death.",
    "long_description": "An **estimand** is the exact causal question an analysis answers, fixed *before* any estimator is chosen. The ICH E9(R1)\naddendum (2019) requires five attributes to be specified jointly: (1) **population**, (2) **treatment (intervention)\nstrategy**, (3) **endpoint**, (4) the **strategy for each intercurrent event (ICE)**, and (5) the **population-level\nsummary measure** (risk difference, risk ratio, hazard ratio, restricted mean survival time difference). In RWE this\ndiscipline matters more than in trials, because the ICEs that E9(R1) was written for — treatment discontinuation,\nswitching, dose change, rescue therapy, death, loss of plan enrollment — occur constantly in claims and EHR data and are\nroutinely caused by the same patient factors that drive the outcome. Two analysts running \"the same study\" on the same\ndatabase can produce materially different hazard ratios solely because they made different, usually implicit, ICE choices.\n\n**Core estimand distinction.** Two axes must be pinned down and are independent. (1) *Which population?* The **ATE** is\nE[Y(1) − Y(0)] over the whole eligible population; the **ATT** is E[Y(1) − Y(0) | A=1], the effect among those who\nactually initiated the index drug; the **ATC** is the symmetric quantity in the comparator group. These coincide only\nwhen the effect is homogeneous across treated and untreated — almost never true under channeling. The estimator silently\nencodes the choice: 1:1 PS matching and SMR weighting target the ATT; stabilized IPTW, g-computation standardized over\nthe full cohort, TMLE, and overlap weighting (which targets the ATO, a re-weighted overlap population) target ATE-type\nquantities. (2) *Which ICE strategy?* E9(R1) names five — **treatment policy** (ignore the ICE, follow everyone\nregardless, the ITT analogue), **while-on-treatment** (attribute outcomes only while exposed; censor at the ICE),\n**hypothetical** (the effect had the ICE not occurred; requires g-methods or IPCW because naive censoring is informative),\n**composite** (the ICE is folded into an unfavorable endpoint, e.g. discontinuation-or-death = failure), and **principal\nstratum** (the effect in the latent subset who would adhere). The estimand is the question; the g-method, weighting, or\nclone-censor-weight scheme is the answer that targets it. Reporting a hazard ratio without naming the population and the\nICE strategy is, under E9(R1), an incomplete result.\n\n**Pros, cons, and trade-offs** (specific and comparative, naming the alternatives).\n- **vs target-trial-emulation:** The estimand framework is the specification step that target-trial emulation\n  operationalizes — each E9(R1) ICE strategy maps to a concrete cloning/censoring/weighting rule in the emulated protocol.\n  Cost: the estimand is only the question; you still need a valid identification strategy and exchangeability. **Always**\n  write the estimand before choosing an estimator; the two are complementary, not alternatives.\n- **vs cox-ph-regression as conventionally reported:** Estimand thinking forces clarity on whether a reported HR is\n  marginal (ATE/ATT) or conditional, and which ICE strategy its censoring rules imply. A vanilla `coxph` HR with patients\n  censored at discontinuation is a *while-on-treatment, conditional* estimand whether or not the analyst intended it. Cost:\n  targeting a marginal ATE/ATT requires standardization or weighting on top of Cox, not just the model coefficient; under\n  non-proportional hazards the HR is not a clean summary at all and RMST or risk differences are preferred.\n- **vs g-estimation-structural-nested-models:** G-estimation is the natural engine for *hypothetical* and dynamic-regime\n  estimands under time-varying ICEs and time-varying confounding. Cost: it is harder to specify, communicate to reviewers,\n  and debug than a treatment-policy ATE from IPTW or matching. **Prefer** plain treatment-policy estimands unless the\n  decision genuinely hinges on \"what if patients had stayed on therapy.\"\n- **vs an unspecified/default analysis:** Naming the estimand exposes mismatches that otherwise hide — most commonly\n  reporting an ATT (from matching) when the policy question is population-level (a formulary or coverage decision for all\n  eligible patients), which is an ATE question. The structured estimand is more work upfront and pays for itself at review.\n\n**When to use** (clear decision rules). Always, as the first step of any comparative RWE protocol or SAP, and especially when: ICEs are frequent\nand prognostic (oncology switching at progression, adherence-driven discontinuation in chronic disease); the same dataset\nwill support multiple analyses that must be distinguished; the study informs a regulatory or HTA decision where E9(R1)\nlanguage is now expected; or treatment effects are heterogeneous so ATE and ATT diverge and the audience needs to know\nwhich they are getting.\n\n**When NOT to use — and when it is actively misleading or dangerous** (clear decision rules). The estimand framework is never inappropriate to\napply, but specific *choices within it* are dangerous. (1) **A hypothetical strategy when the ICE process is driven by\nunmeasured confounders** is the classic trap: censoring at discontinuation (or weighting for it) assumes no unmeasured\ncommon cause of discontinuation and outcome. When sick patients stop therapy *because* they are deteriorating, the\nwhile-on-treatment / hypothetical estimand is biased toward benefit — the analysis manufactures efficacy. Diagnose with the\nplausibility of the no-unmeasured-confounding-for-censoring assumption, not with model fit. (2) **Reporting an ATT for a\npopulation-level policy question** systematically answers the wrong question; if one drug is reserved for a sicker\nsubgroup, the ATT describes that subgroup, not the population the decision-maker governs. (3) **A while-on-treatment\nestimand for a long-latency outcome** (e.g. cancer incidence) discards exactly the person-time when the outcome occurs and\nis rarely defensible. (4) **Treating death as ordinary censoring** in any non-mortality estimand removes a competing event\nand biases the cause-specific quantity; death must be handled as an ICE (composite, or competing-risk framing). (5) Naming\nfive attributes but then letting the estimator default (e.g. matching) silently override the stated population is worse\nthan no framework, because it provides false assurance.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked).\n- **Claims (FFS vs Medicare Advantage):** ICEs are inferred from utilization, never observed directly. Discontinuation is\n  a gap beyond `days_supply` + grace; switching is the first fill of the comparator class; death may come from an inpatient\n  discharge disposition or a linked vital-records/Master Beneficiary Summary file. Failure mode: **MA-only person-time\n  lacks FFS claims**, so a \"discontinuation\" can be unobserved capitated care, fabricating an ICE — restrict to enrollees\n  with complete A/B/D (or commercial medical+pharmacy) and exclude MA-only spans before defining any ICE. **Differential\n  competing risks by exposure in the elderly** mean the drug arm with higher background mortality looks artificially better\n  on a non-fatal endpoint if death is censored rather than treated as an ICE.\n- **EHR:** Orders and medication lists overstate true exposure (a stopped drug lingers on the list), so the *while-on-\n  treatment* and *hypothetical* estimands are especially fragile; link to dispensing to confirm the ICE. Loss to follow-up\n  is visit-driven and informative — a patient who deteriorates may exit the system, masquerading as administrative\n  censoring and biasing a hypothetical estimand. NLP-extracted discontinuation reasons (toxicity, progression, cost) are\n  needed to choose between composite and hypothetical strategies but are themselves missing-not-at-random.\n- **Registry:** Often the cleanest source for endpoint timing, adjudicated death, and *reasons* for treatment change —\n  ideal for validating claims-based ICE definitions — but typically incomplete for full pharmacy exposure between visits.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity + claims completeness + reliable mortality for\n  the death ICE), but linkage selection and order/fill/service-date discrepancies must be reconciled before ICE dates are\n  fixed, or **immortal time** is introduced (e.g. counting a procedure study's pre-procedure survival toward the treated\n  arm).\n\n**Worked claims example (oncology, where ICE strategy decides the answer).** Question: overall survival with first-line\nBRAF+MEK inhibitor vs anti-PD-1 immunotherapy in adults with metastatic melanoma, in a linked commercial + Medicare FFS\ndatabase. (1) *Eligibility/population:* adults with a metastatic-melanoma diagnosis and 365 days of continuous A/B/D (or\ncommercial medical+pharmacy) enrollment before the first qualifying systemic-therapy fill/administration; exclude MA-only\nperson-time so absence of subsequent claims is real. (2) *Treatment strategy:* initiate the index regimen at `index_date`,\narm assigned from the NDC/HCPCS on that date. (3) *Endpoint:* all-cause death (linked vital records) within 36 months. (4)\n*The decisive ICE — switching at progression:* a large fraction of BRAF+MEK initiators switch to immunotherapy at\nprogression. Under **treatment policy** (follow everyone to death/disenrollment/end-of-data regardless of switch) the\ncontrast describes the *initiation strategy* and is the policy-relevant estimand for a first-line coverage decision; under\na **hypothetical** strategy (effect had no switch occurred) you must IPCW-censor at the switch fill, modeling the\nswitch hazard from time-varying covariates — and you must defend that no unmeasured factor drives both switching and death.\n(5) *The death ICE:* death is the endpoint here, so no special handling; for a non-fatal endpoint (e.g. time-to-next-\ntreatment) death would be a competing-risk ICE, not censoring. (6) *Estimator → population:* IPTW on the high-dimensional\nbaseline PS standardized over the whole cohort targets the **ATE** (the formulary question); 1:1 PS matching would instead\ntarget the **ATT** among BRAF+MEK initiators. (7) *Reporting:* present both treatment-policy and hypothetical estimands as\npre-specified primary/secondary — not as a single \"sensitivity analysis,\" because they answer different questions — each\nwith its summary measure (RMST difference at 36 months alongside the HR, since proportional hazards is implausible with a\ndelayed immunotherapy effect).\n\n**Interpreting the output**\n\nIn the Drug A versus Drug B kidney-event study (200 per arm, 12-month follow-up), the\nanalysis reports four estimand-specific results: treatment-policy ATE risk difference =\n−3.0 percentage points (9.0% vs 12.0%); hypothetical ATE ≈ −3.6 pp; while-on-treatment\nATT = −3.6 pp; composite (discontinuation-or-event) RD ≈ +1.0 pp.\n\n*(1) Formal interpretation.* Each number answers a different causal question. The\ntreatment-policy ATE (−3.0 pp) estimates the effect of assignment to Drug A in the full\neligible population, including periods after discontinuation or switching — the ITT\nanalogue. The while-on-treatment ATT (−3.6 pp) estimates the effect among patients who\nactually initiated Drug A, restricted to time they remained on therapy. The composite\nestimate (+1.0 pp) is opposite in sign because discontinuations in the Drug A arm — counted\nas events — outweigh the pharmacological benefit during the observation window. None of\nthese four numbers is the \"right\" answer; each answers a different policy or regulatory\nquestion, and they diverge precisely because Drug A and Drug B initiators differ in\nadherence behavior and discontinuation rates.\n\n*(2) Practical interpretation.* A payer asking \"what happens to my members assigned Drug A?\"\nneeds the treatment-policy ATE. A clinician asking \"does Drug A work while the patient\ntakes it?\" needs the while-on-treatment ATT. An HTA body assessing adherence-adjusted\nlong-run impact may need the hypothetical estimand under g-computation with IPCW. Reporting\nonly one estimate without naming the estimand and its intercurrent-event strategy creates an\ninterpretability gap that cannot be resolved post hoc.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "estimand",
      "ICH-E9-R1",
      "ATE",
      "ATT",
      "ATC",
      "intercurrent-events",
      "treatment-policy",
      "while-on-treatment",
      "hypothetical-strategy",
      "g-computation",
      "IPTW",
      "IPCW",
      "causal-inference",
      "oncology-rwe",
      "claims"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj-2023-076316",
        "url": "https://doi.org/10.1136/bmj-2023-076316",
        "citation_text": "Kahan BC, Hindley J, Edwards M, Cro S, Morris TP. The estimands framework: a primer on the ICH E9(R1) addendum. BMJ. 2024;384:e076316.",
        "year": 2024,
        "authors_short": "Kahan et al.",
        "notes": "Accessible authoritative primer on the five-attribute estimand framework and the five intercurrent-event strategies of the ICH E9(R1) addendum; the clearest single entry point for specifying an estimand before choosing an estimator."
      },
      {
        "role": "explain",
        "doi": "10.1002/pst.2196",
        "url": "https://doi.org/10.1002/pst.2196",
        "citation_text": "Li X, Wang Y, Chen J, Lu N, Song C, Tiwari R, Xu Y, Yue LQ. Estimands in observational studies: some considerations beyond ICH E9(R1). Pharmaceutical Statistics. 2022;21(5):1085-1097.",
        "year": 2022,
        "authors_short": "Li et al.",
        "notes": "Extends the E9(R1) estimand framework to observational/RWE settings, addressing how intercurrent events and confounding interact when there is no randomization — directly motivates the ATE-vs-ATT and ICE-strategy choices made here."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Shows how protocol-level (target-trial) thinking forces explicit estimand specification, including intercurrent events, and maps each strategy to a g-method for identification in observational data."
      },
      {
        "role": "demonstrate",
        "doi": "10.1080/19466315.2023.2259829",
        "url": "https://doi.org/10.1080/19466315.2023.2259829",
        "citation_text": "Chen J, Scharfstein D, Wang H, et al. Estimands in real-world evidence studies. Statistics in Biopharmaceutical Research. 2023;16(4):427-446.",
        "year": 2023,
        "authors_short": "Chen et al.",
        "notes": "Worked operationalization of E9(R1) estimands in RWE, contrasting treatment-policy and hypothetical strategies and the population (ATE/ATT) targeted by different estimators in observational data."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwq439",
        "url": "https://doi.org/10.1093/aje/kwq439",
        "citation_text": "Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. Doubly robust estimation of causal effects. American Journal of Epidemiology. 2011;173(7):761-767.",
        "year": 2011,
        "authors_short": "Funk et al.",
        "notes": "Practical estimator (augmented IPW / doubly robust) for targeting marginal ATE/ATT under a treatment-policy estimand with high-dimensional confounding; underpins the g-computation and IPTW code below."
      }
    ],
    "plain_language_summary": "An estimand is the precise question a study is designed to answer — it forces you to say exactly which patients you are studying, what treatment comparison you are making, and how you will handle events that happen after treatment starts (like a patient switching drugs or dying). Researchers distinguish two main targets: the average treatment effect (ATE) asks what would happen if every eligible patient in the population received the drug versus the comparator, while the average treatment effect in the treated (ATT) asks the narrower question of what happens among the patients who actually chose that drug. Pinning down the estimand before any analysis prevents two analysts from unknowingly answering different questions with the same dataset.",
    "key_terms": [
      {
        "term": "estimand",
        "definition": "The exact causal question a study is designed to answer, stated before any analysis begins, including which patients count, what comparison is made, and how post-treatment disruptions are handled."
      },
      {
        "term": "ATE (average treatment effect)",
        "definition": "The effect of the treatment if it were given to every eligible patient in the full population — the right target for a coverage or formulary decision that applies to everyone."
      },
      {
        "term": "ATT (average treatment effect in the treated)",
        "definition": "The effect of the treatment among the patients who actually received it — answers how well the drug worked for the people who chose it, not for everyone who could have."
      },
      {
        "term": "intercurrent event",
        "definition": "Something that happens after treatment starts and can disrupt the original treatment plan — common examples are stopping the drug early, switching to a different drug, or dying before the outcome is measured."
      },
      {
        "term": "treatment policy strategy",
        "definition": "An intercurrent-event rule that says: ignore the disruption and follow every patient to the end of the study regardless, mirroring an intention-to-treat design."
      },
      {
        "term": "hypothetical strategy",
        "definition": "An intercurrent-event rule that asks what the outcome would have been if the disruption had never occurred — requires special statistical methods because you cannot simply drop patients who switched."
      }
    ],
    "worked_example": {
      "scenario": "A health-plan researcher is comparing a new oral diabetes drug (Drug A) versus metformin (Drug B) on the one-year risk of a serious kidney event. The cohort has 400 new users: 200 started Drug A and 200 started Drug B. After starting, some patients in each arm switch to the other drug or stop entirely — these are the intercurrent events. The researcher must decide: (1) which population defines the effect, and (2) what rule to apply to the switchers.",
      "dataset": {
        "caption": "Summary of the 400-patient cohort after one year of follow-up. Each row is one arm; columns show who switched and who had the kidney event.",
        "columns": [
          "arm",
          "n_started",
          "n_switched_away",
          "kidney_event_if_stayed",
          "kidney_event_if_switched"
        ],
        "rows": [
          [
            "Drug A",
            200,
            40,
            "12 of 160 who stayed (7.5%)",
            "6 of 40 who switched (15.0%)"
          ],
          [
            "Drug B",
            200,
            20,
            "20 of 180 who stayed (11.1%)",
            "4 of 20 who switched (20.0%)"
          ]
        ]
      },
      "steps": [
        "Step 1 — Choose the population (ATE vs ATT). For ATE, the question is: what would happen to all 400 patients if everyone took Drug A vs if everyone took Drug B? For ATT, the question is narrower: what would have happened to the 200 Drug A initiators specifically if they had taken Drug B instead?",
        "Step 2 — Choose the intercurrent-event strategy. The four options from ICH E9(R1) are: (a) Treatment policy — count every patient's outcome regardless of switching (like ITT); (b) Hypothetical — estimate the outcome as if no one had switched (requires special weighting methods); (c) While-on-treatment — count only the outcome that occurred before switching, ignore the rest; (d) Composite — treat switching as itself a bad outcome, so switching + kidney event both count as failures.",
        "Step 3 — Apply treatment policy (the simplest and most common strategy). Drug A arm: 12 + 6 = 18 kidney events out of 200 patients = 9.0% risk. Drug B arm: 20 + 4 = 24 kidney events out of 200 patients = 12.0% risk. Treatment-policy ATE risk difference = 9.0% minus 12.0% = -3.0 percentage points (Drug A lower).",
        "Step 4 — Contrast with the ATT. The ATT focuses on the 200 Drug A initiators only. Under a hypothetical estimand (had no one switched), we attribute the 12 events among the 160 stayers but must estimate what the 40 switchers would have experienced on Drug A. Using the stayer rate as a proxy: 40 x 0.075 = 3 projected events. Drug A risk = (12 + 3) / 200 = 7.5%. Counterfactual Drug B risk for these same 200 patients (using Drug B stayer rate) = (200 x 0.111) = 22.2 projected events = 11.1%. ATT risk difference = 7.5% minus 11.1% = -3.6 percentage points — slightly larger than ATE because Drug A initiators happen to be lower-risk patients who also respond better.",
        "Step 5 — Recognize what each estimand answers. ATE (-3.0 pp) answers the formulary question: should the health plan cover Drug A for all eligible diabetic members? ATT (-3.6 pp) answers the effectiveness question: for patients who chose Drug A, did it help them? Different questions, different numbers, same dataset."
      ],
      "result": "Chosen estimand: treatment-policy ATE, because the study informs a population-level coverage decision. Risk difference = -3.0 percentage points (9.0% vs 12.0%), meaning Drug A is associated with 3 fewer kidney events per 100 patients over one year when every initiator is followed regardless of switching. The ATT would have given -3.6 pp, which is a subtly different answer to a subtly different question — and reporting the wrong one for the policy decision would overstate or understate the benefit for the intended audience.",
      "ice_strategy_contrast": {
        "caption": "The four intercurrent-event strategies applied to the same 400-patient cohort. Each strategy produces a numerically and conceptually different answer.",
        "columns": [
          "strategy",
          "what_you_do_with_switchers",
          "drug_a_risk",
          "drug_b_risk",
          "risk_difference",
          "what_question_it_answers"
        ],
        "rows": [
          [
            "Treatment policy",
            "Count their outcomes — stay in the analysis",
            "9.0% (18/200)",
            "12.0% (24/200)",
            "-3.0 pp",
            "Effect of starting Drug A vs Drug B (initiation decision)"
          ],
          [
            "Hypothetical (no switching)",
            "Statistically re-weight as if they never switched",
            "~7.5% (estimated)",
            "~11.1% (estimated)",
            "~-3.6 pp",
            "Effect if everyone had stayed on their starting drug"
          ],
          [
            "While-on-treatment",
            "Censor them (remove from risk set at the moment they switch)",
            "7.5% (12/160)",
            "11.1% (20/180)",
            "-3.6 pp (unadjusted)",
            "Effect during the period of actual exposure only"
          ],
          [
            "Composite",
            "Count switching itself as an event alongside the kidney outcome",
            "23.0% (18+40)/200",
            "22.0% (24+20)/200",
            "+1.0 pp",
            "Net clinical benefit including tolerability (staying on drug)"
          ]
        ]
      }
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "new-user-design",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Treatment policy strategy (ITT analogue)",
        "description": "Follow all initiators to the endpoint regardless of any intercurrent event (discontinuation, switch, dose change, death-as-censoring rules aside). Targets the effect of the initiation decision; the most defensible default for coverage and formulary questions.",
        "edge_cases": [
          "High switching in oncology makes the contrast reflect a treatment sequence rather than the index drug alone.",
          "Dilutes the on-drug effect when discontinuation is common; not a flaw but a property of the question."
        ],
        "data_source_notes": "claims: follow to death/disenrollment/end-of-data; requires only continuous, FFS-observable enrollment, no episode reconstruction. Define the strategy window explicitly (e.g. initiate-and-follow regardless of later fills)."
      },
      {
        "name": "While-on-treatment strategy",
        "description": "Attribute outcomes only while the patient remains on the index treatment; censor at discontinuation or switch. Yields a conditional, on-drug effect.",
        "edge_cases": [
          "Informative censoring when patients discontinue because the event is imminent biases toward benefit.",
          "Inappropriate for long-latency outcomes that occur after typical exposure ends."
        ],
        "data_source_notes": "claims: needs precise episode end = last fill + days_supply + grace; EHR orders/medication lists overstate exposure and must be confirmed against dispensing."
      },
      {
        "name": "Hypothetical strategy (had the ICE not occurred)",
        "description": "Effect that would be seen if the intercurrent event (e.g. switching, discontinuation) had not happened. Requires g-computation, IPCW, or g-estimation; naive censoring is invalid.",
        "edge_cases": [
          "Identification fails outright if there is any unmeasured common cause of the ICE and the outcome.",
          "In RWE usually implemented via clone-censor-weight or IPCW with rich time-varying covariates."
        ],
        "data_source_notes": "Demands time-varying covariates to model the ICE hazard credibly; claims often lack the granularity (labs, performance status) that makes the no-unmeasured-confounding-for-censoring assumption plausible."
      },
      {
        "name": "Composite strategy",
        "description": "Fold the intercurrent event into an unfavorable endpoint (e.g. discontinuation-or-death = failure). Suits net-clinical-benefit questions and sidesteps the censoring-bias problem.",
        "edge_cases": [
          "Can be overly conservative for a pure efficacy question by penalizing benign discontinuations.",
          "Requires clinical justification that the ICE is genuinely unfavorable."
        ],
        "data_source_notes": "claims: straightforward — any gap beyond grace, switch, or death counts as the composite event; robust because it avoids modeling the ICE process."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "target-trial-emulation",
        "pros_of_this": "The estimand framework is the specification layer that target-trial emulation operationalizes; each ICE strategy maps to a concrete cloning/censoring/weighting rule in the emulated protocol.",
        "cons_of_this": "The estimand is only the question — it still requires a valid identification strategy and exchangeability to be answerable.",
        "when_to_prefer": "Always specify the estimand first; it precedes and complements (does not replace) the emulation."
      },
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "Estimand thinking makes explicit whether a reported HR is marginal (ATE/ATT) or conditional and which ICE strategy its censoring rules encode.",
        "cons_of_this": "Targeting a marginal ATE/ATT needs standardization or weighting on top of Cox; under non-proportional hazards the HR is not a clean summary and RMST/risk differences are preferred.",
        "when_to_prefer": "Whenever the audience needs a population-level effect or a transparent statement of the ICE handling behind a hazard ratio."
      },
      {
        "compared_to": "g-estimation-structural-nested-models",
        "pros_of_this": "G-estimation is the natural engine for hypothetical and dynamic-regime estimands under time-varying ICEs and time-varying confounding.",
        "cons_of_this": "Harder to specify, communicate, and debug than a treatment-policy ATE from IPTW or matching.",
        "when_to_prefer": "Only when the decision genuinely hinges on a hypothetical or dynamic-regime question; otherwise prefer a plain treatment-policy estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "ICEs are inferred from utilization. Discontinuation = gap beyond days_supply + grace; switch = first fill of the comparator class; death from inpatient discharge disposition or a linked vital-records/MBSF file. Restrict to continuous FFS-observable enrollment and exclude MA-only person-time before defining any ICE, or unobserved capitated care fabricates intercurrent events. Build long-format person-time for time-varying ICE/IPCW models.",
      "ehr": "Orders and medication lists overstate exposure, so while-on-treatment and hypothetical estimands are fragile; confirm against dispensing. Loss to follow-up is visit-driven and informative — treat administrative censoring as potentially related to deterioration. NLP discontinuation reasons inform composite-vs-hypothetical choice but are missing-not-at-random.",
      "registry": "Cleanest source for endpoint timing, adjudicated death, and reasons for treatment change; use to validate claims-based ICE definitions. Usually incomplete for between-visit pharmacy exposure — link to claims for the fill history.",
      "linked": "Ideal substrate (EHR severity + claims completeness + reliable mortality for the death ICE) but reconcile order/fill/service-date discrepancies before fixing ICE dates to avoid introducing immortal time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\nX = [\"X1\", \"X2\", \"X3\"]  # replace with the full baseline confounder set\n\n# ---- Block 1: treatment-policy ATE and ATT via g-computation (standardization) ----\n# Fit an outcome model that includes treatment A and confounders, then standardize\n# predictions setting A=1 and A=0 over the chosen population (whole cohort -> ATE; A==1 -> ATT).\ndef gcomp_ate_att(df, outcome=\"Y\", treat=\"A\", covars=X):\n    f = f\"{outcome} ~ {treat} + \" + \" + \".join(covars)\n    m = smf.glm(f, data=df, family=sm.families.Binomial()).fit()\n    d1, d0 = df.copy(), df.copy()\n    d1[treat], d0[treat] = 1, 0\n    p1, p0 = m.predict(d1), m.predict(d0)\n    ate = (p1 - p0).mean()                      # population: whole cohort\n    treated = df[treat] == 1\n    att = (p1[treated] - p0[treated]).mean()    # population: the treated\n    return ate, att\n\ndef boot_ci(df, fn, B=1000, seed=1):\n    rng = np.random.default_rng(seed)\n    idx = np.arange(len(df))\n    est = np.array([fn(df.iloc[rng.choice(idx, len(df), replace=True)]) for _ in range(B)])\n    return est.mean(axis=0), np.percentile(est, [2.5, 97.5], axis=0)\n\nate, att = gcomp_ate_att(df)\n(m_ate_att, ci) = boot_ci(df, lambda d: np.array(gcomp_ate_att(d)))\nprint(f\"Treatment-policy ATE risk diff = {ate:+.3f}  95% CI [{ci[0,0]:+.3f}, {ci[1,0]:+.3f}]\")\nprint(f\"Treatment-policy ATT risk diff = {att:+.3f}  95% CI [{ci[0,1]:+.3f}, {ci[1,1]:+.3f}]\")\n\n# ---- Block 2: hypothetical (had-no-switch) estimand via stabilized IPCW ----\n# Expand to person-interval (long) format; censor at the ICE (discontinue_t) and re-weight\n# so the censored intervals stand in for the population that would have continued.\n# Requires time-varying covariates Lt; here baseline X are carried forward as a minimal example.\ndef to_long(df, interval=30):\n    rows = []\n    for _, r in df.iterrows():\n        stop = int(np.ceil(r.time_to_event / interval))\n        ice = r.discontinue_t if not pd.isna(r.discontinue_t) else np.inf\n        for k in range(stop):\n            t0, t1 = k * interval, min((k + 1) * interval, r.time_to_event)\n            censored = (ice <= t1)                      # ICE during this interval -> artificially censored\n            event = int(r.event and (t1 >= r.time_to_event) and not censored)\n            rows.append({**r[[\"person_id\", \"A\"] + X].to_dict(),\n                         \"k\": k, \"event\": event, \"cens\": int(censored)})\n            if censored:\n                break\n    return pd.DataFrame(rows)\n\nlng = to_long(df)\n# Stabilized inverse-probability-of-censoring weights from the censoring (ICE) hazard.\ncm_num = smf.glm(\"cens ~ A + k\", data=lng, family=sm.families.Binomial()).fit()\ncm_den = smf.glm(\"cens ~ A + k + \" + \" + \".join(X), data=lng,\n                 family=sm.families.Binomial()).fit()\np_uncens_num = 1 - cm_num.predict(lng)\np_uncens_den = 1 - cm_den.predict(lng)\nlng[\"sw\"] = p_uncens_num / p_uncens_den\nlng[\"sw\"] = lng.groupby(\"person_id\")[\"sw\"].cumprod().clip(upper=20)  # cumulative, truncate extremes\n# Weighted pooled-logistic hazard model -> hypothetical (no-ICE) risk by standardization.\nhz = smf.glm(\"event ~ A + k + \" + \" + \".join(X), data=lng,\n             family=sm.families.Binomial(), freq_weights=lng[\"sw\"]).fit()\nh1, h0 = lng.copy(), lng.copy(); h1[\"A\"], h0[\"A\"] = 1, 0\nrisk1 = 1 - (1 - hz.predict(h1)).groupby(lng[\"person_id\"]).prod().mean()\nrisk0 = 1 - (1 - hz.predict(h0)).groupby(lng[\"person_id\"]).prod().mean()\nprint(f\"Hypothetical (no-ICE) ATE risk diff = {risk1 - risk0:+.3f}\")",
        "description": "ATE vs ATT under two intercurrent-event strategies on one cohort. Required input (one row per new initiator,\npost-data-management):\n  person_id      : patient id\n  arm            : 'STUDY' / 'COMPARATOR' (label only)\n  A              : 1 if index drug, 0 if comparator\n  Y              : 1/0 binary endpoint for the treatment-policy g-computation\n  X1..Xk         : baseline confounders measured in [index_date-365, index_date]\n  time_to_event  : follow-up time (days) for the survival/hypothetical estimand\n  event          : 1 if endpoint, 0 if censored\n  discontinue_t  : time of the index ICE (switch/discontinuation), NaN if none\nBlock 1: treatment-policy ATE and ATT by outcome-regression g-computation (standardization).\nBlock 2: hypothetical (had-no-switch) estimand via stabilized IPCW + weighted pooled-logistic hazard.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "funk-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(marginaleffects)\nlibrary(ipw)\nlibrary(survival)\n\nXrhs <- \"X1 + X2 + X3\"  # replace with the full baseline confounder set\n\n## ---- Block 1: treatment-policy ATE and ATT via g-computation ----\nfit <- glm(as.formula(paste(\"Y ~ A +\", Xrhs)), family = binomial, data = df)\n\nate <- avg_comparisons(fit, variables = \"A\")                       # population: whole cohort\natt <- avg_comparisons(fit, variables = \"A\",\n                       newdata = subset(df, A == 1))               # population: the treated\nprint(ate); print(att)\n\n## ---- Block 2: hypothetical (had-no-switch) estimand via stabilized IPCW ----\n## Expand to person-interval long format, with a censoring indicator at the ICE time.\nlong <- survSplit(Surv(time_to_event, event) ~ ., data = df,\n                  cut = seq(30, max(df$time_to_event), by = 30), episode = \"k\")\nlong$cens <- with(long, !is.na(discontinue_t) & discontinue_t <= time_to_event)\n\n## Stabilized inverse-probability-of-censoring weights (numerator ~ A only; denominator ~ A + confounders).\nw <- ipwtm(exposure = cens, family = \"binomial\",\n           numerator = ~ A,\n           denominator = as.formula(paste(\"~ A +\", Xrhs)),\n           id = person_id, tstart = tstart, timevar = k,\n           type = \"first\", data = long)\nlong$ipcw <- pmin(w$ipw.weights, 20)                              # truncate extreme weights\n\n## Weighted Cox targets the hypothetical-strategy hazard ratio (robust SE for the weights).\ncox_hyp <- coxph(Surv(tstart, time_to_event, event) ~ A + X1 + X2 + X3,\n                 data = long, weights = ipcw, robust = TRUE, id = person_id)\nsummary(cox_hyp)",
        "description": "ATE vs ATT under two intercurrent-event strategies on one cohort. Inputs mirror the Python version:\n  df : data.frame with person_id, A (1/0), Y (0/1), X1..Xk, time_to_event, event (0/1), discontinue_t (NA if none).\nBlock 1: marginaleffects::avg_comparisons() standardizes a fitted outcome model to the whole cohort (ATE) or to the\ntreated (ATT). Block 2: ipw::ipwtm builds stabilized IPCW for the hypothetical (had-no-switch) estimand, fed to a\nweighted Cox model.",
        "dependencies": [
          "marginaleffects",
          "ipw",
          "survival"
        ],
        "source_citations": [
          "funk-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---- Block 1: treatment-policy ATE and ATT (PROC CAUSALTRT, AIPW doubly robust) ---- */\nproc causaltrt data=work.cohort method=aipw att;   /* att => also report effect in the treated */\n  class A;\n  psmodel A(ref='0') = X1 X2 X3;                    /* propensity model: baseline confounders only */\n  model Y = X1 X2 X3;                               /* outcome model: doubly robust augmentation */\nrun;                                                /* output: POM, ATE risk difference, and ATT */\n\n/* ---- Block 2: hypothetical (had-no-switch) estimand: stabilized IPTW x IPCW, weighted PHREG ---- */\n/* (a) Stabilized treatment weights from the baseline propensity score. */\nproc logistic data=work.cohort descending noprint;\n  model A = X1 X2 X3;\n  output out=ps p=ps;\nquit;\nproc sql; select mean(A) into :pA from work.cohort; quit;   /* marginal P(A=1) for stabilization */\ndata sw_trt;\n  set ps;\n  if A=1 then sw_trt = &pA / ps;\n  else        sw_trt = (1-&pA) / (1-ps);\nrun;\n\n/* (b) Stabilized inverse-probability-of-censoring weights from the ICE (switch) hazard, by interval. */\nproc logistic data=work.long descending noprint;       /* denominator: ICE hazard given confounders */\n  model cens = A X1 X2 X3 k;\n  output out=cden p=pc_den;\nquit;\nproc logistic data=work.long descending noprint;       /* numerator: ICE hazard given A and time only */\n  model cens = A k;\n  output out=cnum p=pc_num;\nquit;\ndata ipcw;\n  merge cden cnum(keep=person_id k pc_num);\n  by person_id k;\n  retain ipcw_cum;\n  if first.person_id then ipcw_cum = 1;\n  ipcw_cum = ipcw_cum * ((1-pc_num)/(1-pc_den));        /* cumulative uncensored-probability ratio */\n  ipcw = ipcw_cum;\nrun;\n\n/* (c) Combine weights, truncate, and fit the weighted outcome model = hypothetical-strategy effect. */\ndata analytic;\n  merge work.long(in=a) sw_trt(keep=person_id sw_trt) ipcw(keep=person_id k ipcw);\n  by person_id k;\n  w = min(sw_trt * ipcw, 20);                           /* truncate extreme combined weights */\nrun;\nproc phreg data=analytic covsandwich(aggregate);        /* robust SE for the estimated weights */\n  class A (ref='0');\n  model (tstart, tstop)*event(0) = A X1 X2 X3 / ties=efron;\n  weight w;\n  id person_id;\n  hazardratio A / diff=ref;                             /* hypothetical (no-ICE) hazard ratio */\nrun;",
        "description": "ATE/ATT under treatment policy via PROC CAUSALTRT, then the hypothetical (had-no-switch) estimand via stabilized\nIPTW x IPCW in a weighted PROC PHREG. Required input datasets (post data-management):\n  work.cohort : person_id, A (1/0), arm, Y (0/1), X1-Xk, time_to_event, event (1/0), discontinue_t (. if none)\n  work.long   : person-interval expansion of work.cohort with tstart, tstop, k, cens (1 if ICE this interval),\n                event, A, X1-Xk  (build with a DATA step before this block)\nPROC CAUSALTRT requires SAS/STAT 15.1+. POM gives the potential-outcome means (ATE); ATT requests the effect in\nthe treated.",
        "dependencies": [],
        "source_citations": [
          "funk-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Target population<br/>eligible new initiators] --> Treat[Treatment strategy<br/>initiate drug A vs comparator B]\n  Treat --> ICE[Intercurrent event after baseline<br/>discontinuation / switch / dose change / death]\n  ICE --> Strat{ICE strategy}\n  Strat -->|Follow regardless| TP[Treatment policy<br/>ITT analogue]\n  Strat -->|On-drug only| WOT[While on treatment<br/>censor at ICE]\n  Strat -->|Had ICE not occurred| HYP[Hypothetical<br/>IPCW / g-methods]\n  Strat -->|ICE = bad outcome| COMP[Composite endpoint]\n  TP --> Summ[Summary measure<br/>RD / RR / HR / RMST diff]\n  WOT --> Summ\n  HYP --> Summ\n  COMP --> Summ\n  Summ --> Est[Well-defined estimand<br/>+ choose estimator that targets the population]\nstyle Strat fill:#e6f3ff",
        "caption": "ICH E9(R1) estimand construction. The five attributes (population, treatment, endpoint, intercurrent-event strategy, summary measure) must be fixed jointly; different paths through the ICE box produce different numerical targets from the same data.",
        "alt_text": "Flowchart from target population and treatment strategy through the intercurrent-event box (treatment policy, while on treatment, hypothetical, composite) to a summary measure and a well-defined estimand.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Policy/clinical question] --> P{Whole-population<br/>decision?}\n  P -->|Yes coverage/formulary| ATE[Target ATE<br/>IPTW / g-computation / TMLE / overlap]\n  P -->|No effect in initiators| ATT[Target ATT<br/>1:1 PS matching / SMR weighting]\n  ATE --> I{ICEs frequent<br/>and prognostic?}\n  ATT --> I\n  I -->|No| TPol[Treatment-policy estimand<br/>follow regardless]\n  I -->|Yes, question is on-drug effect| H{Unmeasured confounders<br/>of the ICE process?}\n  H -->|Plausibly none| Hyp[Hypothetical estimand<br/>IPCW / g-estimation]\n  H -->|Likely yes| Comp[Composite or treatment-policy<br/>avoid informative censoring]\nstyle P fill:#e6f3ff\nstyle H fill:#ffe6e6",
        "caption": "Estimand decision logic for RWE. Choose the population (ATE vs ATT) from the decision being informed, then the ICE strategy from how frequent/prognostic the events are and whether the censoring-confounding assumption is credible.",
        "alt_text": "Decision flowchart selecting ATE versus ATT by whether the question is population-level, then selecting the intercurrent-event strategy based on event frequency and the plausibility of unmeasured confounding of the ICE process.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation begins with explicit estimand specification; the cloning/censoring/weighting steps are chosen to match the stated population and ICE strategy."
      },
      {
        "relation_type": "used_with",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Clone-censor-weight is the standard way to emulate a specific treatment strategy (and therefore a specific ICE handling) for a per-protocol or hypothetical estimand in observational data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "g-estimation-structural-nested-models",
        "notes": "G-estimation is especially powerful for hypothetical and dynamic-regime estimands under time-varying ICEs and time-varying confounding."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "The estimator encodes the population — matching/SMR weighting targets the ATT while stabilized IPTW targets the ATE; choosing it is choosing the estimand's population."
      },
      {
        "relation_type": "used_with",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Causal ML (TMLE, double ML) flexibly targets a pre-specified ATE/ATT under high-dimensional confounding while respecting the chosen ICE strategy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "A reported HR implies an estimand (marginal vs conditional, and an ICE strategy via its censoring rules); under non-proportional hazards RMST or risk differences may better match the target."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST difference is an estimand summary measure that avoids the proportional-hazards assumption and is often pre-specified alongside the HR."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death as an intercurrent event for a non-fatal endpoint is a competing risk; the cause-specific vs subdistribution choice is part of the estimand, not just an analytic detail."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology is where estimand-ICE mismatch is most consequential — switching at progression makes treatment-policy and hypothetical estimands diverge sharply."
      }
    ],
    "aliases": [
      "estimand framework",
      "ICH E9(R1) estimands",
      "intercurrent events",
      "treatment policy vs hypothetical strategy",
      "ATE vs ATT",
      "average treatment effect in the treated"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "exact-methods-sparse-data-rwe",
    "name": "Exact and Penalized-Likelihood Methods for Sparse Data",
    "short_definition": "A family of small-sample inference methods - exact conditional tests/intervals and penalized (Firth) likelihood - that replace asymptotic maximum-likelihood inference when cells are rare or covariate-outcome separation makes ordinary odds ratios and Wald confidence intervals unstable, infinite, or badly biased.",
    "long_description": "**Exact and penalized-likelihood methods for sparse data** are the small-sample toolkit for the situation that pervades\nreal-world safety and subgroup work: a rare exposure crossed with a rare outcome, a stratified table with several\nzero cells, or a logistic/Cox model in which one covariate level perfectly predicts the outcome. In these regimes the\nordinary maximum-likelihood (ML) estimator and the chi-square/Wald machinery that surround it are not merely imprecise -\nthey are *wrong in a structured way*: the ML odds ratio is biased away from the null in finite samples, and under\n**separation** the ML estimate diverges to ±infinity with a Wald standard error that explodes, so the confidence\ninterval and p-value become meaningless. Two distinct repairs exist. **Exact methods** (Fisher's exact test and its\nr x c generalization; exact conditional logistic regression) compute the inference from the permutation distribution of\nthe sufficient statistic conditional on the margins, requiring no large-sample approximation. **Penalized-likelihood\nmethods** (Firth's correction) add a Jeffreys-prior penalty to the score equation that removes the leading O(1/n) bias\nand *always yields finite estimates and profile-likelihood intervals even under complete separation*.\n\n**Core conceptual distinction.** The two families answer the same question with different machinery and different\nestimands of convenience. (1) *Exact conditional inference* conditions on the observed margins (the row/column totals,\nor the nuisance parameters in conditional logistic regression) and enumerates the exact distribution of the test\nstatistic; its p-values and confidence intervals are guaranteed to have at-or-below-nominal type-I error in any sample\nsize, at the cost of conservatism (discreteness gaps) and heavy computation as tables grow. (2) *Firth penalized\nlikelihood* keeps the unconditional model, shrinks the estimate toward the null by an amount that vanishes as n grows,\nand returns a finite, median-unbiased-ish odds ratio with a profile-penalized-likelihood interval. The estimand is the\nsame conditional odds ratio you intended to report; the methods differ in *how* they handle the absence of asymptotic\nsupport. A third, often-overlooked option - **adding a Bayesian/data-augmentation prior or weakly-informative\npenalization** - generalizes Firth and is what Greenland recommends when sparse-data bias is suspected but separation\nis not literal. Crucially, none of these is a \"different model\": they are inference repairs for the *same* causal/\nassociational target, chosen because the asymptotics that justify Wald/score inference have failed.\n\n**Pros, cons, and trade-offs.**\n- **vs ordinary (asymptotic) ML logistic/Cox regression with Wald intervals:** Exact and Firth methods give valid,\n  finite inference where ML gives infinite estimates, undefined Wald SEs, and anticonservative p-values. Cost: exact\n  inference can be markedly *conservative* (true coverage well above 95%, p-values inflated) and computationally\n  explosive for large or many-covariate models; Firth is fast and well-calibrated but the penalty shrinks toward the\n  null, so it is not ideal when you specifically need an unbiased estimate of a *large* true effect. **Prefer ML** only\n  when every relevant cell is comfortably non-sparse (a common rule of thumb is >=8-10 events per parameter and no\n  zero cells); otherwise prefer Firth as the default repair and reserve exact for regulatory tables where guaranteed\n  type-I control is required.\n- **vs Firth penalized likelihood (when choosing among the repairs):** Exact conditional logistic is the gold standard\n  for guaranteed coverage in a single small 2x2 or matched stratified table and is the standard for *matched*\n  case-control designs; Firth scales to multivariable models, time-to-event (Firth-corrected Cox), and is the\n  pragmatic choice when several covariates must be adjusted. **Prefer exact** for small, low-dimensional, regulator-\n  facing contingency analyses; **prefer Firth** for adjusted multivariable sparse models and any model with a\n  continuous covariate (where exact enumeration is intractable).\n- **vs collapsing categories / dropping the sparse stratum / \"+0.5 to every cell\" (Haldane-Anscombe):** Naive fixes\n  discard information or introduce an arbitrary, sample-size-dependent bias. The ad-hoc 0.5 continuity correction is a\n  crude special case of penalization and performs poorly in multivariable settings. **Prefer Firth/exact** over any\n  hand-edited cell counts in a deliverable.\n- **vs Bayesian logistic with a weakly-informative prior:** Conceptually adjacent (Firth *is* a Jeffreys-prior MAP);\n  a normal/Cauchy prior gives the analyst explicit control of shrinkage and full posterior intervals. **Prefer the\n  Bayesian route** when you want to encode genuine prior information or report a posterior; **prefer Firth** when you\n  want a frequentist, prior-free default that referees recognize.\n\n**When to use.** Rare outcomes (e.g., adjudicated serious adverse events, rare cancers) crossed with a small or rare\nexposure group; subgroup/interaction analyses that thin the data into near-empty cells; matched case-control analyses\n(conditional logistic) with few discordant pairs; any logistic or Cox model where the software reports a \"quasi-\ncomplete separation\" / \"did not converge\" warning or returns an odds ratio in the thousands with an enormous SE;\npost-marketing safety signal tables with zero events in one arm. Diagnose first: tabulate every covariate against the\noutcome, count events per parameter, and look for empty or near-empty cells before fitting.\n\n**When NOT to use - and when it is actively misleading or dangerous.**\n- **Do not use exact methods as a fishing expedition for \"significance.\"** Their conservatism means a non-significant\n  exact p-value with wide intervals is often the *honest* answer that the data cannot support a conclusion; reporting a\n  Firth point estimate without its (wide) interval to manufacture precision is the dangerous failure mode.\n- **Sparse-data bias is not a small-sample-only problem.** Greenland's central warning: a large overall N with a rare\n  outcome and many adjustment covariates can still be sparse, and asymptotic ML can be biased *away from the null* even\n  when nothing looks small in the row totals. Trusting the converged ML odds ratio here is the trap.\n- **Firth shrinks toward the null.** If the true effect is genuinely large and well-supported, Firth will understate\n  it; do not use it as a universal default when the data are actually rich.\n- **Exact does not fix confounding, selection, or measurement error.** A perfectly valid exact p-value on a confounded\n  contingency table is precisely confident nonsense. These are *inference* repairs, not *design* repairs - they belong\n  downstream of an active-comparator new-user design and confounding control, never as a substitute for them.\n- **Separation can signal a real data problem,** e.g., an exclusion criterion accidentally encoded as a covariate, or\n  a post-baseline variable on the causal pathway. Investigate the separating variable clinically before penalizing it\n  away.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** Rare-event safety counts are exquisitely sensitive to person-time completeness. Medicare\n  Advantage enrollees lack fee-for-service claims, so an arm built from MA-only person-time can show *zero* events not\n  because none occurred but because they were never observed - a structural zero that exact/Firth methods will dutifully\n  analyze as if it were real. Restrict to A/B/D (or commercial medical+pharmacy) enrollees and exclude MA-only spans\n  before counting. Claims-lag and run-out also manufacture spurious zeros in the most recent calendar period; freeze a\n  claims-maturity cutoff. **Differential competing risks by exposure in an elderly claims cohort** can empty the event\n  cell in the frailer arm (they die of something else first) - a sparse cell that is really a competing-risk artifact,\n  not a protective effect.\n- **EHR:** Encounter-driven capture means a \"zero\" outcome in a subgroup may reflect patients who left the system, not\n  true absence of events; sparse cells are often *missingness* cells. Confirm denominators with explicit observation\n  windows and consider linkage before declaring a structural zero.\n- **Registry:** Adjudicated, complete event capture makes registry zeros more trustworthy than claims zeros, but rare\n  strata still need exact/Firth inference; weak pharmacy exposure capture means the exposure cell can be undercounted.\n- **Linked claims-EHR-vital records:** The completeness needed to distinguish a true zero from an unobserved zero, but\n  linkage selects the linkable subset and can itself thin small strata - report the linkage denominator alongside the\n  sparse table.\n\n**Worked claims example.** Question: incidence of a rare adjudicated serious adverse event (e.g., agranulocytosis) in\nnew initiators of a niche second-line antithyroid drug (STUDY) vs an active comparator (COMP) among adults, in a\ncommercial + Medicare FFS database. (1) Cohort: active-comparator new-user design - >=365 days of continuous A/B/D (or\ncommercial medical+pharmacy) enrollment with no fill of either drug in the lookback, exclude MA-only person-time so a\nzero event count cannot be an observation artifact, index_date = first qualifying fill, arm assigned from the dispensed\nNDC. (2) Outcome: first inpatient/ED claim with the qualifying dx code in the first-listed position within an\non-treatment window (last days_supply end + 30-day grace). (3) Suppose the crude 2x2 is STUDY 3/412 vs COMP 0/903.\nThe zero in the comparator arm produces complete separation: ordinary `PROC LOGISTIC` returns an infinite odds-ratio\nestimate with a \"quasi-complete separation\" warning and a Wald 95% CI of (0, infinity) - uninterpretable. (4) Repair:\nexact conditional logistic (`PROC LOGISTIC ... / exact`) gives a finite exact odds ratio with a guaranteed-coverage\nexact CI and an exact p-value; Firth (`PROC LOGISTIC ... firth`) gives a finite penalized OR with a profile-likelihood\nCI for the adjusted model that also includes age, sex, and baseline comorbidity count - covariates that ordinary ML\ncould not estimate alongside the separating exposure. (5) Report *both* the point estimate and the (wide) interval, and\nstate explicitly that the comparator arm contributed zero events: the honest conclusion is a signal worth following,\nnot a precise effect size.\n\n**Interpreting the output**\n\nConsider the 2×2 table above: 3 events in 412 study-drug patients, 0 events in 903 comparator patients.\nOrdinary logistic regression fails (infinite OR, Wald CI 0 to ∞). Exact conditional logistic regression\nreturns a finite exact OR ≈ 15.4 with a wide but bounded exact CI, and an exact p-value.\n\n*(1) Formal statistical interpretation.* The exact OR ≈ 15.4 is a point estimate derived from the conditional\ndistribution of the 2×2 table, holding the marginal totals fixed. The accompanying exact CI has guaranteed\ncoverage at the nominal level regardless of sample size, unlike Wald intervals that rely on large-sample\nnormal approximations. The width of the interval directly reflects the information content of the data:\nwith only three events and a zero cell, the data are compatible with a wide range of true odds ratios.\nThe exact p-value is the probability, under the null of no association, of observing a configuration at\nleast as extreme as the one seen — it is not the probability that the drug is truly harmful.\n\n*(2) Practical interpretation for a decision-maker.* The result is a pharmacovigilance signal, not a precise\neffect estimate. An OR of ≈ 15 sounds alarming, but the wide CI honestly communicates that the true effect\ncould plausibly be much smaller. The appropriate response is a larger follow-up study to narrow the interval,\nnot a formulary decision based on a three-event table. Report both the point estimate and the full interval\nso readers understand both the signal and the uncertainty.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "inferential_statistics",
      "exact-methods",
      "sparse-data",
      "penalized-likelihood",
      "firth-correction",
      "separation",
      "conditional-logistic",
      "rare-events",
      "small-sample-inference"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "pharmacovigilance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/01621459.1983.10477989",
        "url": "https://doi.org/10.1080/01621459.1983.10477989",
        "citation_text": "Mehta CR, Patel NR. A network algorithm for performing Fisher's exact test in r x c contingency tables. Journal of the American Statistical Association. 1983;78(382):427-434.",
        "year": 1983,
        "authors_short": "Mehta & Patel",
        "notes": "Foundational computational algorithm that made exact inference for sparse r x c contingency tables tractable; the engine behind modern exact-test software (StatXact, SAS EXACT)."
      },
      {
        "role": "explain",
        "doi": "10.1093/biomet/80.1.27",
        "url": "https://doi.org/10.1093/biomet/80.1.27",
        "citation_text": "Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80(1):27-38.",
        "year": 1993,
        "authors_short": "Firth",
        "notes": "Introduces the penalized-likelihood (Jeffreys-prior) correction that removes the O(1/n) bias of ML and underlies all \"Firth logistic/Cox\" implementations used for sparse-data and separation problems."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.1047",
        "url": "https://doi.org/10.1002/sim.1047",
        "citation_text": "Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in Medicine. 2002;21(16):2409-2419.",
        "year": 2002,
        "authors_short": "Heinze & Schemper",
        "notes": "Shows Firth's penalized likelihood always yields finite estimates and profile-penalized-likelihood intervals under complete or quasi-complete separation; the canonical applied reference for the logistf/firth implementations."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.i1981",
        "url": "https://doi.org/10.1136/bmj.i1981",
        "citation_text": "Greenland S, Mansournia MA, Altman DG. Sparse data bias: a problem hiding in plain sight. BMJ. 2016;352:i1981.",
        "year": 2016,
        "authors_short": "Greenland et al.",
        "notes": "Practical demonstration that sparse-data bias inflates ML odds ratios away from the null even with large total N, and that penalization/exact methods are the appropriate remedy in epidemiologic practice."
      }
    ],
    "plain_language_summary": "When a study table has a zero in one cell — say, no patients in the comparison group experienced the outcome — the standard formula for an odds ratio divides by zero and gives an infinite answer, which is meaningless. Exact methods (such as Fisher's exact test and exact logistic regression) compute the answer from the actual counts directly, without relying on any large-sample approximation, so they always produce a finite, valid result. A related tool called Firth penalized regression adds a small mathematical adjustment that pulls the estimate away from infinity, making it especially useful when several variables are in the model at once. Both approaches are needed in real-world evidence studies that track rare outcomes, such as a serious side effect that occurred in only a handful of patients.",
    "key_terms": [
      {
        "term": "sparse data",
        "definition": "A table or dataset where one or more cells have very few observations — often zero — typically because the outcome is rare, the exposure is uncommon, or both."
      },
      {
        "term": "exact test",
        "definition": "A statistical test that calculates a p-value or confidence interval by counting all possible arrangements of the observed data, rather than assuming the data follow a bell-curve approximation."
      },
      {
        "term": "zero cell",
        "definition": "A cell in a frequency table where no events were observed in that combination of exposure and outcome, causing standard formulas that involve division or logarithms of that count to break down."
      },
      {
        "term": "odds ratio",
        "definition": "A number that compares the odds of an outcome in one group to the odds in another group; an odds ratio above 1 suggests the outcome is more common in the first group."
      },
      {
        "term": "Wald confidence interval",
        "definition": "The most common type of confidence interval, calculated by multiplying a standard error by 1.96 and adding or subtracting from the estimate; it requires a well-behaved, finite standard error to be valid."
      },
      {
        "term": "Firth penalized regression",
        "definition": "A modified version of logistic regression that adds a small mathematical penalty to prevent estimates from running off to infinity when zero cells or near-separation cause standard regression to fail."
      }
    ],
    "worked_example": {
      "scenario": "A claims-based safety study asks whether a niche antithyroid drug (STUDY arm) is associated with a rare blood disorder called agranulocytosis compared with an active comparator drug (COMP arm). After building a new-user cohort with continuous enrollment confirmed throughout follow-up, analysts count adjudicated agranulocytosis events: 3 events among 412 STUDY initiators and 0 events among 903 COMP initiators. The goal is a valid odds ratio and confidence interval comparing the two arms.",
      "dataset": {
        "caption": "Adjudicated event counts by treatment arm — the 2x2 table an analyst would construct from the analytic dataset.",
        "columns": [
          "arm",
          "events",
          "non_events",
          "total"
        ],
        "rows": [
          [
            "STUDY",
            3,
            409,
            412
          ],
          [
            "COMP",
            0,
            903,
            903
          ]
        ]
      },
      "steps": [
        "Step 1 — Attempt the standard odds ratio formula: OR = (events_STUDY / non_events_STUDY) / (events_COMP / non_events_COMP) = (3 / 409) / (0 / 903). The denominator is 0 / 903 = 0, so the division is undefined — the standard formula returns infinity.",
        "Step 2 — Attempt a Wald confidence interval: the Wald method requires taking the natural log of each cell count. ln(0) is negative infinity, so the standard error is undefined and the confidence interval cannot be computed. Statistical software will issue a quasi-complete separation warning and may print OR = infinity with CI = (0, infinity).",
        "Step 3 — Apply Fisher's exact test (exact conditional inference): instead of using a formula, the exact method enumerates all possible ways to arrange 3 total events across 1,315 patients while keeping the row and column totals fixed. From that exact distribution it computes a finite exact conditional odds ratio and a guaranteed-coverage 95% confidence interval. The result is a finite number (approximately 15.4) with a wide but interpretable confidence interval, and an exact p-value.",
        "Step 4 — Interpret honestly: the confidence interval is wide because the data are genuinely sparse — only 3 events total. The honest conclusion is that the STUDY arm shows a signal worth monitoring, not a precisely measured effect size. Both the point estimate and the wide interval must be reported, along with the fact that the COMP arm had zero events."
      ],
      "result": "Standard ML odds ratio: undefined (division by zero, software returns infinity). Exact conditional odds ratio: approximately 15.4 (exact 95% CI is finite and guaranteed-coverage but wide, reflecting only 3 total events across 1,315 patients). The exact method gives a valid, reportable answer where the standard formula fails entirely."
    },
    "prerequisites": [
      "logistic-regression-for-binary-outcomes",
      "incidence-rate-calculation-rwe",
      "active-comparator-new-user"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Exact conditional inference (Fisher exact / exact conditional logistic)",
        "description": "Inference from the permutation distribution of the sufficient statistic conditional on the margins (or nuisance parameters). Guaranteed at-or-below-nominal type-I error in any sample size; the standard for small matched case-control tables and regulator-facing 2x2/rxc analyses.",
        "edge_cases": [
          "Discreteness makes exact tests conservative (true coverage above 95%); mid-p adjustments reduce conservatism but lose the guarantee.",
          "Computation explodes with table size and number of covariates; continuous covariates make exact enumeration intractable.",
          "A median-unbiased exact estimate is reported when the conditional MLE is infinite (one margin gives a one-sided interval)."
        ],
        "data_source_notes": "claims/EHR: ensure zeros are true structural zeros (full enrollment, mature claims, no MA-only person-time) before trusting an exact zero-cell analysis."
      },
      {
        "name": "Firth penalized likelihood (logistic and Cox)",
        "description": "Adds a Jeffreys-prior penalty to the score equations, removing leading-order bias and guaranteeing finite estimates and profile-penalized-likelihood intervals even under separation. Scales to multivariable and time-to-event models.",
        "edge_cases": [
          "Shrinks estimates toward the null - understates genuinely large, well-supported effects.",
          "Profile-likelihood (not Wald) intervals must be reported; Wald intervals on Firth estimates are misleading.",
          "For prediction/probabilities, FLIC/FLAC variants correct Firth's intercept/predicted-probability distortion."
        ],
        "data_source_notes": "Preferred when several covariates must be adjusted alongside a separating exposure that ordinary ML cannot fit."
      },
      {
        "name": "Bayesian / data-augmentation penalization",
        "description": "Weakly-informative (normal/Cauchy) priors or augmenting the data with prior pseudo-observations; generalizes Firth and yields full posterior intervals with analyst-controlled shrinkage.",
        "edge_cases": [
          "Choice of prior scale is a substantive decision that must be pre-specified and reported.",
          "Sensitivity to the prior should be shown when the data are very sparse."
        ],
        "data_source_notes": "Useful when genuine external safety information exists to inform the prior in a rare-event analysis."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ordinary asymptotic ML logistic/Cox with Wald inference",
        "pros_of_this": "Valid, finite inference under zero cells and separation where ML gives infinite estimates, undefined Wald standard errors, and anticonservative p-values; removes finite-sample odds-ratio bias.",
        "cons_of_this": "Exact methods can be conservative and computationally heavy; Firth shrinks toward the null, understating genuinely large effects.",
        "when_to_prefer": "Whenever cells are rare/empty, events-per-parameter is low, or software reports separation/non-convergence."
      },
      {
        "compared_to": "Ad-hoc fixes (drop sparse stratum, collapse categories, add 0.5 to every cell)",
        "pros_of_this": "Principled, reproducible, regulator-defensible inference rather than an arbitrary, sample-size-dependent continuity correction or loss of information.",
        "cons_of_this": "Requires specialized procedures and (for exact) more computation than editing cell counts.",
        "when_to_prefer": "Any deliverable - never hand-edit cell counts or drop strata to force convergence in a reported analysis."
      },
      {
        "compared_to": "Exact conditional logistic (when choosing among the repairs)",
        "pros_of_this": "Firth scales to multivariable and time-to-event models and to continuous covariates where exact enumeration is intractable, and is computationally fast.",
        "cons_of_this": "Firth lacks the exact method's guaranteed coverage; exact is the standard for small matched case-control tables and regulator-facing low-dimensional analyses.",
        "when_to_prefer": "Adjusted multivariable sparse models, Firth-corrected Cox, or any model with a continuous covariate."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Confirm that zero/sparse cells are true structural zeros - require continuous FFS-observable enrollment, exclude Medicare Advantage-only person-time (no FFS claims), and freeze a claims-maturity cutoff so claims-lag does not manufacture zeros. Watch for differential competing risks emptying the event cell in the frailer arm.",
      "ehr": "Encounter-driven capture means a sparse-cell \"zero\" can be loss to follow-up rather than a true absent event; pin explicit observation windows and consider linkage before treating a zero as structural.",
      "registry": "Adjudicated complete outcome capture makes registry zeros more trustworthy, but pharmacy exposure is often undercounted; rare strata still require exact/Firth inference.",
      "linked": "Linked claims-EHR-vital records best distinguishes true from unobserved zeros, but linkage selects the linkable subset and can thin small strata - report the linkage denominator alongside the sparse table."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy.stats import fisher_exact\nfrom firthlogist import FirthLogisticRegression  # pip install firthlogist\n\n# --- (a) Crude 2x2: exact conditional inference (no large-sample approximation) ---\ntab = pd.crosstab(df[\"arm\"], df[\"event\"]).reindex(\n    index=[\"STUDY\", \"COMP\"], columns=[1, 0])           # rows = arm, cols = event yes/no\nodds_ratio, p_exact = fisher_exact(tab.values, alternative=\"two-sided\")\n# For a sparse zero-cell table, prefer an exact CI implementation (e.g., scipy.stats.contingency\n# odds_ratio(kind='conditional').confidence_interval) which returns finite, guaranteed-coverage bounds:\nfrom scipy.stats.contingency import odds_ratio as exact_or\nres = exact_or(tab.values, kind=\"conditional\")\nci_low, ci_high = res.confidence_interval(confidence_level=0.95)\nprint(f\"Exact conditional OR={res.statistic:.3f}  95% CI=({ci_low:.3f}, {ci_high:.3f})  p={p_exact:.4f}\")\n\n# --- (b) Adjusted model: Firth penalized logistic (finite under separation) ---\nX = df[[\"age\", \"sex\", \"comorbid_n\"]].copy()\nX[\"study\"] = (df[\"arm\"] == \"STUDY\").astype(int)        # exposure indicator\ny = df[\"event\"].to_numpy()\nfl = FirthLogisticRegression(skip_lrt=False)           # profile-penalized-likelihood inference\nfl.fit(X.to_numpy(), y)\nsummary = pd.DataFrame({\n    \"term\": list(X.columns),\n    \"coef\": fl.coef_,\n    \"OR\": np.exp(fl.coef_),\n    \"ci_low\": np.exp(fl.ci_[:, 0]),                    # profile-likelihood CI, NOT Wald\n    \"ci_high\": np.exp(fl.ci_[:, 1]),\n    \"pval\": fl.pvals_,\n})\nprint(summary)",
        "description": "Sparse-data inference on a claims-derived analytic table. Required input (one row per person, post cohort\nconstruction in an active-comparator new-user design):\n  df : person_id, arm in {'STUDY','COMP'}, event (0/1), age, sex (0/1), comorbid_n (int)\nReports (a) Fisher's exact OR + exact CI for the crude 2x2 and (b) Firth-penalized logistic for the adjusted model,\nwhich stays finite even when one arm has zero events (complete separation). Report profile-likelihood, not Wald, CIs.",
        "dependencies": [
          "pandas",
          "scipy",
          "statsmodels",
          "firthlogist"
        ],
        "source_citations": [
          "heinze-2002",
          "greenland-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(exact2x2)\nlibrary(logistf)\n\n# --- (a) Crude 2x2: exact conditional test + guaranteed-coverage CI ---\ntab <- with(df, table(factor(arm, levels = c(\"STUDY\", \"COMP\")),\n                      factor(event, levels = c(1, 0))))   # rows = arm, cols = event yes/no\nex <- exact2x2(tab, tsmethod = \"central\")                 # central -> matches exact CI convention\nprint(ex)                                                 # exact OR, exact 95% CI, exact p-value\n\n# --- (b) Adjusted model: Firth penalized logistic (finite under separation) ---\ndf$study <- as.integer(df$arm == \"STUDY\")                 # exposure indicator\nfit <- logistf(event ~ study + age + sex + comorbid_n,\n               data = df, pl = TRUE)                       # pl = profile penalized likelihood\nsummary(fit)                                              # coefficients, profile-likelihood CIs, p-values\nexp(cbind(OR = coef(fit), confint(fit)))                  # odds ratios with PL confidence limits",
        "description": "Same two analyses in R on the analytic table. Inputs mirror the Python version:\n  df : person_id, arm ('STUDY'/'COMP'), event (0/1), age, sex (0/1), comorbid_n (int)\nexact2x2 gives a guaranteed-coverage exact CI for the crude table; logistf gives Firth-penalized estimates with\nprofile-penalized-likelihood CIs for the adjusted model under separation.",
        "dependencies": [
          "exact2x2",
          "logistf"
        ],
        "source_citations": [
          "heinze-2002",
          "greenland-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (a) Crude 2x2: exact conditional inference for a sparse / zero-cell table. */\nproc freq data=work.analytic order=data;\n  tables study*event / or;                  /* odds ratio + asymptotic 95% CI                  */\n  exact or;                                  /* exact conditional OR CI + Fisher exact p-value  */\nrun;\n\n/* (b) Exact conditional logistic regression (gold standard for the small unadjusted/low-dim model). */\nproc logistic data=work.analytic exactonly;\n  class sex / param=ref;\n  model event(event='1') = study;           /* low-dimensional: continuous covariates make EXACT intractable */\n  exact study / estimate=both;              /* exact parameter estimate + exact CI for the exposure          */\nrun;\n\n/* (c) Firth penalized logistic for the ADJUSTED model - finite under quasi-complete separation. */\nproc logistic data=work.analytic;\n  class sex / param=ref;\n  model event(event='1') = study age sex comorbid_n\n        / firth clparm=pl clodds=pl;         /* Firth penalty; profile-likelihood (not Wald) intervals */\nrun;",
        "description": "Exact conditional logistic and Firth penalized logistic in SAS/STAT. Required input dataset (post cohort\nconstruction):\n  work.analytic : person_id, study (1=STUDY,0=COMP), event (1/0), age, sex (0/1), comorbid_n\nPROC FREQ ... / EXACT OR gives the exact conditional odds ratio and exact CI for the crude table; PROC LOGISTIC with\nthe EXACT statement performs exact conditional logistic; the FIRTH option performs penalized-likelihood logistic that\nstays finite under separation. Report profile-likelihood (CLPARM=PL), not Wald, intervals.",
        "dependencies": [],
        "source_citations": [
          "heinze-2002",
          "greenland-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Fit intended logistic / Cox model on the analytic table] --> B{Diagnostics:<br/>zero cells? events-per-parameter low?<br/>separation / non-convergence warning?}\n  B -->|No: cells non-sparse, model converges| C[Ordinary asymptotic ML<br/>Wald or LR inference is valid]\n  B -->|Yes| D{What kind of sparsity?}\n  D -->|Small low-dimensional<br/>or matched case-control table| E[Exact conditional inference<br/>Fisher exact / exact conditional logistic<br/>guaranteed coverage]\n  D -->|Adjusted multivariable<br/>or continuous covariate<br/>or time-to-event| F[Firth penalized likelihood<br/>finite estimate + profile-likelihood CI]\n  D -->|Want posterior / external prior info| G[Bayesian / data-augmentation<br/>weakly-informative prior]\n  E --> H[Report point estimate AND wide interval;<br/>state which arm had zero events]\n  F --> H\n  G --> H",
        "caption": "Decision logic for sparse-data inference. The branch point is the diagnostic, not the data size alone - a large N with a rare outcome and many covariates can still be sparse and require a repair.",
        "alt_text": "Flowchart from fitting the intended model through sparsity diagnostics to a choice among ordinary ML, exact conditional inference, Firth penalized likelihood, or a Bayesian prior, ending in honest reporting of the estimate and interval.",
        "source_type": "illustrative",
        "source_citations": [
          "greenland-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Z[Zero / sparse event cell in a claims arm] --> Q{Is the zero structural<br/>or unobserved?}\n  Q -->|MA-only person-time<br/>no FFS claims| M1[Artifact: exclude MA-only spans<br/>before counting]\n  Q -->|Immature / lagged claims| M2[Artifact: apply claims-maturity cutoff]\n  Q -->|Differential competing risk<br/>frailer arm dies first| M3[Artifact: model competing risk;<br/>do not read as protective]\n  Q -->|EHR loss to follow-up| M4[Artifact: pin observation window;<br/>consider linkage]\n  Q -->|Full enrollment, mature, complete capture| TRUE[True structural zero:<br/>apply exact / Firth inference]",
        "caption": "Before applying exact/Firth methods to a zero-event cell, classify the zero. Most sparse cells in claims are observation artifacts, not true absences; analyzing an artifactual zero as real produces confident nonsense.",
        "alt_text": "Decision flow distinguishing artifactual zero cells (Medicare Advantage person-time, immature claims, competing risks, EHR loss to follow-up) from a true structural zero that warrants exact or Firth inference.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "Exact/Firth methods are the small-sample inference layer applied after an active-comparator new-user cohort is built, when the rare outcome or rare exposure thins the analytic table."
      },
      {
        "relation_type": "see_also",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS adjustment controls confounding; exact/Firth repair the inference when the resulting strata or weighted analysis are sparse. They address different failures and are complementary, not substitutes."
      }
    ],
    "aliases": [
      "exact methods for sparse data",
      "Firth penalized likelihood",
      "Firth logistic regression",
      "exact logistic regression",
      "exact conditional logistic regression",
      "sparse data bias",
      "small-sample inference"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "exposure-episode-construction-rwe",
    "name": "Exposure Episode Construction",
    "short_definition": "The algorithm that stitches a person's sequence of pharmacy dispensings (fill_date + days_supply) into continuous treatment episodes by applying a grace period for permissible gaps, a stockpiling/carry-over rule for early refills, and end-of-episode logic, yielding the on-treatment time windows used for exposure classification and follow-up.",
    "long_description": "**Exposure episode construction** is the operational bridge between raw dispensing records and an analyzable exposure\nvariable. A claims database does not contain \"treatment episodes\"; it contains discrete fills, each carrying a `fill_date`\nand a `days_supply`. The analyst must decide when a person is *on* therapy and when they are *off* it, because real refill\nbehavior is irregular: people refill early (stockpiling), refill late, take drug holidays, are hospitalized, switch within\nclass, and stop. Episode construction is the rule set — grace period, carry-over/stockpiling rule, inpatient bridging,\nand restart logic — that collapses a fill history into one or more `[episode_start, episode_end]` intervals. Those\nintervals then drive new-user/washout definitions, on-treatment (\"as-treated\") risk windows, persistence and adherence\nmetrics (PDC/MPR), and time-varying exposure in causal models. Get the rules wrong and every downstream estimate inherits\nexposure misclassification that is frequently *differential* by outcome.\n\n**Core conceptual distinction.** Three rules, each independently consequential, define an episode. (1) *Grace period* (the\npermissible gap): the maximum number of days between the projected supply-end of one fill and the start of the next fill\nthat still counts as continuous therapy. A short grace (0–15d) splits a single course into many short episodes and\nmanufactures spurious discontinuations; a long grace (≥90d) glues distinct courses together and carries exposure status\nfar past the last pill — exactly the over-counting that produces immortal-time–like bias in an as-treated analysis.\nGardarsdottir et al. showed episode counts and durations swing materially with this single parameter. (2) *Stockpiling /\ncarry-over*: when a refill arrives before the prior supply is exhausted (overlap), do you cap exposure at the supply you\ncould plausibly have taken (cap-at-1, no carry-over) or roll the surplus forward (carry-over), shifting the projected\nsupply-end later? Carry-over is realistic for hoarders but inflates persistence if oversupply reflects gaming or\n90-day-mail switches. (3) *End-of-episode / restart*: a gap exceeding the grace closes the episode at the last projected\nsupply-end; the next fill opens a *new* episode (and may be a \"restart\"/\"rechallenge\"). The estimand you are serving\ndictates the rules: an **intention-to-treat / first-episode** contrast cares mainly about episode *start*; an\n**as-treated / per-protocol** contrast lives and dies by the grace period and carry-over rule because they define the\non-treatment window over which person-time and events are counted. Episode construction is therefore a *measurement*\ndecision that encodes an *estimand* — it must be pre-specified, not tuned to the result.\n\n**Pros, cons, and trade-offs** (specific and comparative, naming the alternatives).\n- **Episode construction (refill-gap stitching) vs. fixed-window \"ever-exposed after index\":** Fixed-window approaches\n  (e.g., \"exposed for 365 days after first fill regardless of refills\") are trivial to code and avoid gap parameters, but\n  they assign exposure to people who stopped after one fill — pure immortal-time/exposure misclassification.\n  Episode construction tracks actual supply and discontinuation. **Prefer episode construction** whenever the outcome can\n  occur during gaps or after stopping (most safety and many effectiveness questions).\n- **Episode construction vs. \"current-use\" single-fill exposure (e.g., exposed only for days_supply of the last fill,\n  no grace):** Single-fill current-use is the strictest and least biased toward over-counting, but it fragments therapy\n  and discards person-time around normal refill lateness, throwing away power and creating artificial\n  discontinuation/restart events that can themselves be modeled into bias. **Prefer a modest grace period** (commonly\n  30–60d, or half the typical days_supply) over zero-grace current-use for chronic therapies.\n- **Carry-over (stockpiling) vs. cap-at-1 (no carry-over):** Carry-over honors realistic hoarding and avoids spurious\n  gaps when patients refill 90-day supplies early, but it can extend episodes long past true use and inflate PDC/MPR\n  above 1.0 if not capped. Cap-at-1 is conservative and is the default for adherence metrics; carry-over is defensible\n  for persistence when oversupply is genuine. **Name the rule in the SAP** and run it as a sensitivity analysis.\n- **vs. handing the problem to a black-box vendor \"treatment episode\" table:** Convenient, but the vendor's hidden grace\n  and carry-over defaults may not match your estimand; you cannot reproduce or defend a number you did not construct.\n\n**When to use.** Any drug-exposure RWE study in dispensing/claims or linked EHR-pharmacy data where exposure is\ntime-varying or where on-treatment follow-up, persistence, adherence (PDC/MPR), discontinuation, switching, or\nrestart/rechallenge must be measured. It is the prerequisite step for as-treated risk-window construction, for new-user\nwashout checks (the washout is itself an \"absence-of-episode\" check), and for time-varying exposure in marginal\nstructural models / pooled logistic regression.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When the estimand is purely initiation (ITT-like first-line strategy).** If you only need \"initiated A vs B at time\n  zero\" and you carry exposure forward by protocol regardless of refills, elaborate gap stitching adds nothing and a\n  mis-tuned grace can *create* bias by censoring people you intended to follow. Use it only to detect switching/crossover,\n  not to define the primary contrast.\n- **When days_supply is unreliable.** For injectables, samples, inpatient-administered drugs, titrated/PRN regimens, or\n  fields known to be defaulted to 30, the supply-end projection is fiction and episode boundaries are noise. A long grace\n  on a 90-day-mail population glued to a short grace on a retail population produces *differential* misclassification by\n  channel.\n- **When the chosen grace introduces immortal time.** Carrying exposure status past the last fill while events accrue\n  means deaths/outcomes during the \"phantom\" supply are attributed to exposure — an as-treated mirror of immortal-time\n  bias. If the outcome itself causes discontinuation (e.g., the drug is stopped at the event), a long grace silently\n  moves the event inside the exposed window.\n- **When competing risks differ by exposure in an elderly claims cohort.** Episodes that end at death must be handled as\n  competing events, not as administrative discontinuation, or persistence is overstated in the arm with lower mortality.\n\n**Data-source operational depth.**\n- **Administrative claims (FFS):** The natural substrate — pharmacy claims carry `fill_date`, `days_supply`, `quantity`,\n  and NDC. Project supply-end = `fill_date + days_supply`; stitch fills within the grace; close episodes at the first gap\n  exceeding grace. Failure modes: (a) **Medicare Advantage / capitated person-time lacks FFS pharmacy and medical\n  claims** — a \"gap\" may be unobserved enrollment, not a true drug holiday, so require continuous Part D (or commercial\n  pharmacy) coverage across the episode and treat MA-only spans as non-observable, not as off-treatment. (b) **90-day\n  mail-order and stockpiling** make early refills routine; without a carry-over decision, overlapping supply double-counts\n  person-time. (c) `days_supply` defaulted to 30 for some pharmacies/products corrupts supply-end.\n- **EHR:** Exposure is the *prescription order* (or e-prescribing record), not a confirmed dispensing — primary\n  non-adherence (written but never filled) inflates apparent initiation; link to pharmacy fills where possible.\n  Medication-administration records (MAR) exist only for the inpatient setting, and care delivered outside the system is\n  invisible, so episodes break artificially when a patient gets refills at an outside pharmacy.\n- **Registry:** Often captures treatment *lines* or regimens with adjudicated start/stop dates but rarely complete refill\n  granularity; link to claims to reconstruct supply-based episodes, and reconcile registry-recorded stop dates against the\n  last projected supply-end.\n- **Linked claims–EHR–registry:** Best substrate (orders + fills + clinical context + mortality for competing-risk\n  handling) but introduces date-discrepancy reconciliation (order date vs. fill date vs. administration date) and\n  linkage selection that must precede episode construction.\n\n**Inpatient bridging (a recurring real-world failure mode).** During a hospitalization, outpatient pharmacy fills stop\nbecause drugs are administered inpatient and bundled into the facility claim. A naive algorithm sees a gap and closes the\nepisode mid-stay, then opens a spurious \"restart\" at discharge — manufacturing discontinuations and immortal time. The\nfix is to *bridge*: suspend the gap clock (or extend supply-end) across inpatient days identified from facility claims so\nthe episode spans the admission.\n\n**Worked claims example.** Question: persistence on a once-daily oral anticoagulant among adults with atrial\nfibrillation in a commercial + Medicare FFS database. Inputs: `rx (person_id, fill_date, days_supply, ndc)` restricted to\nthe drug-class NDC list, plus `enroll` spans with continuous Part D / pharmacy benefit. Rules (pre-specified in the SAP):\ngrace period = 30 days; stockpiling = carry-over allowed but supply-end capped so cumulative overlap never exceeds the\ncurrent fill's `days_supply` (cap-at-1 for adherence; carry-over sensitivity for persistence); inpatient days bridged.\nAlgorithm: (1) sort fills by person and date; (2) project `supply_end = fill_date + days_supply`; (3) for each subsequent\nfill, if `fill_date <= prior_supply_end + 30`, it belongs to the *same* episode and (under carry-over) the running\nsupply-end advances; if `fill_date > prior_supply_end + 30`, close the episode at `prior_supply_end` and open a new one;\n(4) censor open episodes at disenrollment, death, or end of data. Output: one row per episode with `episode_start`,\n`episode_end`, `n_fills`, `total_days_supply`, and a `gap_days` flag. Persistence = time from first `episode_start` to the\nend of the first episode; PDC over a fixed denominator window = covered days within the window ÷ window length. Sensitivity\nanalyses: re-run with grace ∈ {0, 15, 60, 90} and with/without carry-over, and report how median persistence and the\ndiscontinuation rate move — the Gardarsdottir result in your own data.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure-episode-construction",
      "treatment-episode",
      "grace-period",
      "stockpiling",
      "days-supply",
      "persistence",
      "as-treated",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2009.07.001",
        "url": "https://doi.org/10.1016/j.jclinepi.2009.07.001",
        "citation_text": "Gardarsdottir H, Souverein PC, Egberts TCG, Heerdink ER. Construction of drug treatment episodes from drug-dispensing histories is influenced by the gap length. Journal of Clinical Epidemiology. 2010;63(4):422-427.",
        "year": 2010,
        "authors_short": "Gardarsdottir et al.",
        "notes": "The canonical methods paper showing that the assumed permissible gap (grace period) directly determines the number, length, and timing of constructed treatment episodes, and hence downstream persistence and exposure classification."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4372",
        "url": "https://doi.org/10.1002/pds.4372",
        "citation_text": "Pazzagli L, Linder M, Zhang M, Vago E, Stang P, Myers D, Andersen M, Bahmanyar S. Methods for time-varying exposure related problems in pharmacoepidemiology: an overview. Pharmacoepidemiology and Drug Safety. 2018;27(2):148-160.",
        "year": 2018,
        "authors_short": "Pazzagli et al.",
        "notes": "Frames episode construction within the broader problem of time-varying exposure, clarifying how on-treatment window definitions interact with immortal time and exposure misclassification in the estimated effect."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/s0895-4356(96)00268-5",
        "url": "https://doi.org/10.1016/s0895-4356(96)00268-5",
        "citation_text": "Steiner JF, Prochazka AV. The assessment of refill compliance using pharmacy records: methods, validity, and applications. Journal of Clinical Epidemiology. 1997;50(1):105-116.",
        "year": 1997,
        "authors_short": "Steiner & Prochazka",
        "notes": "Foundational treatment of refill-based exposure measurement (days_supply, gaps, overlap/stockpiling) that underpins episode and adherence (PDC/MPR) construction from pharmacy records."
      }
    ],
    "plain_language_summary": "Exposure episode construction is how a researcher turns a list of prescription fills into continuous stretches of time when a patient actually had medication on hand. Because people refill early, refill late, or take breaks from their drug, the raw fill dates alone do not tell you when someone was truly on therapy — you have to stitch them together using rules: a short gap between one fill running out and the next fill arriving is treated as continuous use, while a long gap closes that stretch and the next fill starts a fresh one. The result is a set of date ranges — episodes — that define when each patient was exposed, which every downstream analysis (persistence, adherence, as-treated follow-up) depends on. The biggest caveat is that the rules you choose — especially how large a gap you allow before splitting an episode — materially change how many episodes you count and how long they last, so those choices must be set in advance and tested in sensitivity analyses.",
    "key_terms": [
      {
        "term": "fill_date",
        "definition": "The calendar date on which a patient's prescription was dispensed at the pharmacy — the starting day of that supply."
      },
      {
        "term": "days_supply",
        "definition": "How many days one dispensed prescription is designed to last (e.g., 30 means the patient received 30 days' worth of pills)."
      },
      {
        "term": "grace period",
        "definition": "The maximum number of days between one fill running out and the next fill arriving that is still treated as uninterrupted therapy rather than a break."
      },
      {
        "term": "treatment episode",
        "definition": "A single continuous stretch of time during which a patient is considered to be actively on a drug, bounded by the first fill date and the projected last day the final fill covers."
      },
      {
        "term": "cap-at-1 rule",
        "definition": "A carry-over rule that prevents an early refill from double-counting days already covered — the covered days from a new fill only extend past the point where the prior supply ran out."
      }
    ],
    "worked_example": {
      "scenario": "Patient 1001 is taking rivaroxaban (an oral anticoagulant) and has five pharmacy fills over about seven months in 2024. We want to identify how many continuous treatment episodes she had and how many total days she was covered. We use a grace period of 30 days (a gap up to 30 days is still treated as continuous use) and the cap-at-1 rule (an early refill does not double-count days she already had pills for).",
      "dataset": {
        "caption": "Raw pharmacy fills for patient 1001 — exactly the rows an analyst sees in a claims pharmacy table.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            1001,
            "2024-01-05",
            "rivaroxaban",
            30
          ],
          [
            1001,
            "2024-02-01",
            "rivaroxaban",
            30
          ],
          [
            1001,
            "2024-03-10",
            "rivaroxaban",
            30
          ],
          [
            1001,
            "2024-06-01",
            "rivaroxaban",
            30
          ],
          [
            1001,
            "2024-07-01",
            "rivaroxaban",
            30
          ]
        ]
      },
      "steps": [
        "Fill A (Jan 5, 30 days) covers Jan 5–Feb 3. The last day with pills on hand is Feb 3.",
        "Fill B arrives Feb 1 — three days before Fill A runs out. It overlaps, so there is no gap. Under the cap-at-1 rule, the new fill only adds coverage starting from where Fill A ends: last covered day advances to Mar 1 (Feb 4 + 26 remaining days of Fill B, i.e., Feb 1 + 29 = Mar 1 is later than Feb 3, so Mar 1 wins). Still Episode 1.",
        "Fill C arrives Mar 10. Fill B's supply runs until Mar 1. The gap is Mar 2–Mar 9 — 8 days. 8 ≤ 30 (the grace period), so this is still continuous. Last covered day advances to Apr 7 (Mar 10 + 29). Still Episode 1.",
        "Episode 1 closes at Apr 7 (the last day Fill C covers). Episode 1 spans Jan 5–Apr 7 = 94 covered days.",
        "Fill D arrives Jun 1. The gap from Apr 8 through May 31 is 54 days. 54 > 30 (exceeds the grace period), so Episode 1 is definitively closed and a NEW Episode 2 opens on Jun 1. Last covered day = Jun 30.",
        "Fill E arrives Jul 1. Gap = Jul 1 − Jun 30 = 1 day. 1 ≤ 30, so still Episode 2. Last covered day advances to Jul 30 (Jul 1 + 29). Episode 2 closes at Jul 30.",
        "Episode 2 spans Jun 1–Jul 30 = 60 covered days.",
        "Total exposed days = 94 (Episode 1) + 60 (Episode 2) = 154 days across 2 episodes."
      ],
      "result": "2 treatment episodes; 154 total exposed days. Episode 1: Jan 5 – Apr 7 (94 days, 3 fills joined by gaps ≤ 30 days). Episode 2: Jun 1 – Jul 30 (60 days, 2 fills joined by a 1-day gap). The 54-day gap between Apr 8 and May 31 exceeded the 30-day grace and split the record into two distinct episodes.",
      "timeline_spec": {
        "title": "Two treatment episodes for one rivaroxaban patient (grace = 30 days, cap-at-1)",
        "window": {
          "start": "2024-01-05",
          "end": "2024-07-30",
          "label": "Observation window: Jan 5 – Jul 30, 2024"
        },
        "events": [
          {
            "label": "Fill A",
            "start": "2024-01-05",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B (early refill, 3-day overlap)",
            "start": "2024-02-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill C (8-day gap, within grace)",
            "start": "2024-03-10",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill D (new episode starts)",
            "start": "2024-06-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill E (1-day gap, within grace)",
            "start": "2024-07-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-05",
            "end": "2024-04-07",
            "label": "Episode 1: 94 covered days (Fills A+B+C merged)"
          },
          {
            "kind": "gap",
            "start": "2024-04-08",
            "end": "2024-05-31",
            "label": "54-day gap → exceeds 30-day grace → new episode"
          },
          {
            "kind": "covered",
            "start": "2024-06-01",
            "end": "2024-07-30",
            "label": "Episode 2: 60 covered days (Fills D+E merged)"
          }
        ],
        "result": {
          "label": "2 episodes | 94 + 60 = 154 total exposed days",
          "value": 154
        },
        "caption": "Three fills close enough together (gaps of 0 and 8 days, both under the 30-day grace) merge into one 94-day episode. A 54-day gap then exceeds the grace, closing Episode 1 and opening Episode 2 when the patient refills in June. Episode 2's two fills are separated by only 1 day and merge into a 60-day episode.",
        "alt_text": "Horizontal timeline from January to July 2024. Five fill bars are shown. Fills A, B, and C are connected by a green 'Episode 1' span covering Jan 5 to Apr 7 (94 days). A red gap bar covers Apr 8 to May 31 (54 days) labeled 'gap exceeds grace.' Fills D and E are connected by a second green 'Episode 2' span covering Jun 1 to Jul 30 (60 days)."
      }
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe",
      "grace-period-gap-rules-rwe",
      "stockpiling-carryover-rules-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed grace period (refill-gap stitching)",
        "description": "Fills are joined into one episode whenever the next fill_date falls within the prior projected supply-end plus a fixed grace (commonly 30-60 days, or one-half the modal days_supply); a longer gap closes the episode and the next fill starts a new one.",
        "edge_cases": [
          "A grace that is too short fragments a single course into multiple episodes and fabricates discontinuations/restarts.",
          "A grace that is too long carries exposure status past the last pill, attributing post-stopping person-time and events to exposure (immortal-time-like over-counting in as-treated analyses)."
        ],
        "data_source_notes": "claims: grace operates on fill_date + days_supply; require continuous pharmacy enrollment so a gap is a true holiday, not unobserved MA/capitated person-time."
      },
      {
        "name": "Carry-over (stockpiling allowed) vs. cap-at-1",
        "description": "When a refill overlaps remaining supply, carry-over rolls the surplus forward (advancing supply-end), whereas cap-at-1 discards surplus so covered days never exceed calendar days.",
        "edge_cases": [
          "Carry-over can inflate persistence and push PDC/MPR above 1.0 when early refills reflect 90-day-mail switching or gaming rather than genuine hoarding.",
          "Cap-at-1 is conservative and standard for adherence metrics; report both as sensitivity analyses."
        ],
        "data_source_notes": "claims: detect overlap from fill_date < prior supply_end; 90-day mail-order routinely creates benign overlap that must not be read as nonpersistence."
      },
      {
        "name": "Inpatient bridging",
        "description": "Suspend the gap clock (or extend supply-end) across hospitalization days identified from facility claims, so an admission does not split an episode or create a spurious discharge \"restart\".",
        "edge_cases": [
          "Inpatient-administered drugs are bundled in facility claims and produce no outpatient fill, so an unbridged algorithm sees a false gap.",
          "Over-bridging long admissions can mask genuine discontinuation that occurred during the stay."
        ],
        "data_source_notes": "claims: identify inpatient spans from institutional/facility claims; EHR: use admission/discharge dates and inpatient MAR to bridge."
      },
      {
        "name": "Restart / rechallenge as a new episode",
        "description": "A gap exceeding the grace closes the episode; a later fill of the same drug opens a distinct new episode that can be analyzed as a restart/rechallenge rather than continuous use.",
        "edge_cases": [
          "Distinguishing a true restart from late refill lateness depends entirely on the grace threshold.",
          "Outcomes that themselves cause discontinuation will appear just before the gap, biasing as-treated windows if exposure is carried forward."
        ],
        "data_source_notes": "claims: a new episode_start after a closed episode flags a restart; link to diagnoses to interpret why therapy stopped."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Fixed-window ever-exposed (carry exposure for a set period after index regardless of refills)",
        "pros_of_this": "Tracks actual supply and discontinuation, so person-time during drug holidays and after stopping is correctly classified as unexposed.",
        "cons_of_this": "Requires choosing grace and carry-over parameters and reliable days_supply; more code and more assumptions.",
        "when_to_prefer": "Outcomes that can occur during gaps or after stopping; any as-treated/per-protocol or time-varying exposure analysis."
      },
      {
        "compared_to": "Zero-grace single-fill current-use exposure",
        "pros_of_this": "A modest grace preserves person-time around normal refill lateness, avoids artificial discontinuations, and retains power.",
        "cons_of_this": "Introduces a permissible-gap parameter whose value materially changes episode counts and durations (Gardarsdottir effect).",
        "when_to_prefer": "Chronic maintenance therapies where refills are routinely a few days to weeks late."
      },
      {
        "compared_to": "Carry-over (stockpiling allowed) episode rule",
        "pros_of_this": "Cap-at-1 is conservative, bounds covered days at calendar days, and is the standard for adherence (PDC/MPR).",
        "cons_of_this": "Cap-at-1 can understate true exposure for genuine hoarders; carry-over better reflects real consumption when oversupply is real.",
        "when_to_prefer": "Use cap-at-1 for adherence metrics; use carry-over (as sensitivity) for persistence when oversupply is plausibly genuine."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nGRACE = 30  # permissible gap (days) between supply-end and next fill that still counts as continuous\n\ndef build_episodes(rx: pd.DataFrame, grace: int = GRACE, carry_over: bool = False) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"]).copy()\n    out = []\n    for pid, g in rx.groupby(\"person_id\", sort=False):\n        ep_id = 0\n        ep_start = None          # first fill_date of the open episode\n        supply_end = None        # running projected end of available supply\n        n_fills = 0\n        total_supply = 0\n        last_close = None        # supply_end of the just-closed episode, for gap reporting\n        for _, row in g.iterrows():\n            f, dsup = row[\"fill_date\"], int(row[\"days_supply\"])\n            proj_end = f + pd.Timedelta(days=dsup)\n            if ep_start is None:\n                ep_start, supply_end = f, proj_end\n                n_fills, total_supply = 1, dsup\n                continue\n            # Within grace of the running supply-end -> same episode.\n            if f <= supply_end + pd.Timedelta(days=grace):\n                if carry_over:\n                    # Roll surplus forward: extend from the later of (current supply-end, this fill).\n                    supply_end = max(supply_end, f) + pd.Timedelta(days=dsup)\n                else:\n                    # Cap-at-1: no double counting of overlapping days.\n                    supply_end = max(supply_end + pd.Timedelta(days=dsup), proj_end)\n                n_fills += 1\n                total_supply += dsup\n            else:\n                # Gap exceeds grace -> close current episode, open a new one.\n                out.append((pid, ep_id, ep_start, supply_end, n_fills, total_supply,\n                            (f - supply_end).days))\n                ep_id += 1\n                ep_start, supply_end = f, proj_end\n                n_fills, total_supply = 1, dsup\n        if ep_start is not None:\n            out.append((pid, ep_id, ep_start, supply_end, n_fills, total_supply, None))\n    return pd.DataFrame(out, columns=[\"person_id\", \"episode_id\", \"episode_start\",\n                                      \"episode_end\", \"n_fills\", \"total_days_supply\", \"gap_days\"])",
        "description": "Construct continuous treatment episodes from claims-style pharmacy fills. Required input (already cleaned,\nde-duplicated, and restricted to the target drug-class NDCs):\n  rx : one row per fill -> person_id, fill_date (datetime64), days_supply (int)\nOutput: one row per episode -> person_id, episode_id, episode_start, episode_end, n_fills, total_days_supply, gap_days.\nGRACE is the permissible gap; carry_over=True rolls surplus supply forward (stockpiling), False caps at calendar days.\nApply enrollment/death/end-of-data censoring to episode_end downstream; bridge inpatient spans before calling this if\nfacility claims are available.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nbuild_episodes <- function(rx, grace = 30L, carry_over = FALSE) {\n  setDT(rx)\n  setorder(rx, person_id, fill_date)\n\n  one_person <- function(fd, ds) {\n    n <- length(fd)\n    ep <- integer(n); ep_start <- as.Date(rep(NA, n)); supply_end <- as.Date(rep(NA, n))\n    cur_ep <- 0L; cur_start <- fd[1]; cur_end <- fd[1] + ds[1]\n    ep[1] <- 0L; ep_start[1] <- cur_start; supply_end[1] <- cur_end\n    if (n >= 2L) for (i in 2:n) {\n      if (fd[i] <= cur_end + grace) {                       # within grace -> same episode\n        cur_end <- if (carry_over) max(cur_end, fd[i]) + ds[i]\n                   else max(cur_end + ds[i], fd[i] + ds[i]) # cap-at-1: no overlap double-count\n      } else {                                              # gap exceeds grace -> new episode\n        cur_ep <- cur_ep + 1L; cur_start <- fd[i]; cur_end <- fd[i] + ds[i]\n      }\n      ep[i] <- cur_ep; ep_start[i] <- cur_start; supply_end[i] <- cur_end\n    }\n    data.table(episode_id = ep, episode_start = ep_start, supply_end = supply_end)\n  }\n\n  ep <- rx[, one_person(fill_date, days_supply), by = person_id]\n  rx[, c(\"episode_id\", \"episode_start\", \"episode_end\") :=\n       .(ep$episode_id, ep$episode_start, ep$supply_end)]\n\n  out <- rx[, .(episode_start = episode_start[1],\n                episode_end   = max(episode_end),\n                n_fills       = .N,\n                total_days_supply = sum(days_supply)),\n            by = .(person_id, episode_id)]\n  setorder(out, person_id, episode_id)\n  out[, gap_days := as.integer(shift(episode_start, type = \"lead\") - episode_end),\n      by = person_id]\n  out[]\n}",
        "description": "Construct treatment episodes with data.table by-group processing. Input mirrors the Python version:\n  rx : person_id, fill_date (Date), days_supply (integer), restricted to the target drug-class NDCs.\nOutput: person_id, episode_id, episode_start, episode_end, n_fills, total_days_supply, gap_days.\ngrace = permissible gap (days); carry_over rolls surplus supply forward (stockpiling) when TRUE, caps when FALSE.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let grace     = 30;   /* permissible gap (days) */\n%let carryover = 0;    /* 1 = roll surplus forward (stockpiling); 0 = cap-at-1 */\n\nproc sort data=work.rx; by person_id fill_date; run;\n\ndata work.episodes(keep=person_id episode_id episode_start episode_end\n                        n_fills total_days_supply gap_days);\n  set work.rx;\n  by person_id;\n  retain episode_id episode_start supply_end n_fills total_days_supply prev_end;\n  format episode_start episode_end supply_end date9.;\n\n  proj_end = fill_date + days_supply;\n\n  if first.person_id then do;                       /* open the first episode */\n    episode_id = 0; episode_start = fill_date; supply_end = proj_end;\n    n_fills = 1; total_days_supply = days_supply; prev_end = .;\n  end;\n  else if fill_date <= supply_end + &grace then do; /* within grace -> same episode */\n    if &carryover then supply_end = max(supply_end, fill_date) + days_supply;\n    else               supply_end = max(supply_end + days_supply, proj_end); /* cap-at-1 */\n    n_fills + 1; total_days_supply + days_supply;\n  end;\n  else do;                                          /* gap exceeds grace -> emit + reopen */\n    gap_days = .;                                    /* gap on the row that closes follows below */\n    episode_end = supply_end;\n    prev_end = supply_end; output;\n    episode_id + 1; episode_start = fill_date; supply_end = proj_end;\n    n_fills = 1; total_days_supply = days_supply;\n    gap_days = fill_date - prev_end;                 /* gap that triggered the new episode */\n  end;\n\n  if last.person_id then do;                        /* flush the open episode */\n    episode_end = supply_end; output;\n  end;\nrun;",
        "description": "Construct treatment episodes in a single DATA step using RETAIN + BY-group processing (the natural SAS idiom for\ngap-then-collapse). Required input (cleaned, de-duplicated, target drug-class only):\n  work.rx : person_id, fill_date (SAS date), days_supply (numeric), sorted by person_id fill_date.\nOutput work.episodes : person_id, episode_id, episode_start, episode_end, n_fills, total_days_supply, gap_days.\nSet &grace to the permissible gap; toggle &carryover (1=stockpiling, 0=cap-at-1). Censor episode_end to enrollment/\ndeath/end-of-data downstream; bridge inpatient spans before this step if facility claims are available.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "exposure-episode-construction-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Three fills close enough together (gaps of 0 and 8 days, both under the 30-day grace) merge into one 94-day episode. A 54-day gap then exceeds the grace, closing Episode 1 and opening Episode 2 when the patient refills in June. Episode 2's two fills are separated by only 1 day and merge into a 60-day episode.",
        "alt_text": "Horizontal timeline from January to July 2024. Five fill bars are shown. Fills A, B, and C are connected by a green 'Episode 1' span covering Jan 5 to Apr 7 (94 days). A red gap bar covers Apr 8 to May 31 (54 days) labeled 'gap exceeds grace.' Fills D and E are connected by a second green 'Episode 2' span covering Jun 1 to Jul 30 (60 days).",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Sorted fills for one person<br/>fill_date + days_supply] --> Proj[Project supply_end = fill_date + days_supply]\n  Proj --> Q{Next fill_date <= supply_end + grace?}\n  Q -->|Yes: within grace| Carry{Carry-over rule?}\n  Carry -->|Stockpiling on| Roll[Roll surplus forward<br/>advance supply_end] --> Same[Same episode: n_fills++]\n  Carry -->|Cap-at-1| Cap[No overlap double-count<br/>extend supply_end] --> Same\n  Q -->|No: gap exceeds grace| Close[Close episode at last supply_end]\n  Close --> New[Open NEW episode = restart/rechallenge]\n  Same --> Q\n  New --> Q\n  Close --> Bridge[[If gap falls inside an inpatient span:<br/>bridge instead of closing]]",
        "caption": "Decision logic for collapsing a refill history into treatment episodes via the grace period, the stockpiling/ carry-over rule, and inpatient bridging.",
        "alt_text": "Flowchart showing for each fill whether the next fill is within the grace period (same episode, with carry-over or cap-at-1 supply update) or beyond it (close the episode and open a new one), with an inpatient-bridging branch.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One person's fills collapsed into two episodes (grace = 30d)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Episode 1\n  Fill 1 supply (90d) :done, f1, 2024-01-01, 90d\n  Fill 2 supply (90d, refilled within grace) :active, f2, 2024-04-05, 90d\n  section Gap > grace\n  Drug holiday (> 30d after supply-end) :crit, gap, 2024-07-04, 75d\n  section Episode 2 (restart)\n  Fill 3 supply (90d) :done, f3, 2024-09-17, 90d",
        "caption": "A late-but-within-grace refill keeps fills 1 and 2 in one episode; a gap exceeding the 30-day grace closes episode 1 and the later fill opens episode 2 as a restart.",
        "alt_text": "Gantt timeline showing two 90-day fills joined into episode 1 because the second refill is within the grace period, a gap longer than the grace, and a third fill that starts a second episode.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "A gap in fills is only a true drug holiday when pharmacy enrollment is continuous across it; otherwise the gap is unobserved person-time and must not be classified as off-treatment."
      },
      {
        "relation_type": "part_of",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "The permissible-gap (grace) rule is the single most consequential parameter of episode construction."
      },
      {
        "relation_type": "part_of",
        "target_slug": "stockpiling-carryover-rules-rwe",
        "notes": "The carry-over vs. cap-at-1 decision governs how overlapping (early) refills extend an episode."
      },
      {
        "relation_type": "used_with",
        "target_slug": "inpatient-bridging-exposure-rwe",
        "notes": "Hospitalization suspends outpatient fills; bridging prevents an admission from falsely splitting an episode."
      },
      {
        "relation_type": "produces",
        "target_slug": "as-treated-risk-window-construction-rwe",
        "notes": "Constructed episodes define the on-treatment intervals over which as-treated person-time and events are counted."
      },
      {
        "relation_type": "produces",
        "target_slug": "restart-rechallenge-new-episode-rwe",
        "notes": "A gap exceeding the grace closes one episode and opens the next, which is analyzed as a restart/rechallenge."
      },
      {
        "relation_type": "see_also",
        "target_slug": "switch-add-on-augmentation-rwe",
        "notes": "Switching or augmentation during an episode requires multi-drug episode logic beyond single-drug gap stitching."
      },
      {
        "relation_type": "see_also",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "A new-user washout is operationally an absence-of-episode check in the lookback window."
      },
      {
        "relation_type": "used_with",
        "target_slug": "exposure-lag-induction-latency-window-rwe",
        "notes": "Induction/latency lags are applied to constructed episodes when the exposure cannot plausibly cause the outcome immediately."
      }
    ],
    "aliases": [
      "exposure episodes",
      "treatment episodes",
      "drug treatment episodes",
      "drug episode construction",
      "treatment episode construction",
      "treatment-episode-construction"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "exposure-lag-induction-latency-window-rwe",
    "name": "Exposure Lag, Induction, and Latency Windows",
    "short_definition": "An exposure-definition rule that maps calendar exposure dates onto the biologically relevant at-risk window using a lag (exclusion) period to remove protopathic/reverse-causation time, an induction period that is the minimum delay from cause to effect, and a latency period that is the delay from effect onset to clinical detection.",
    "long_description": "**Exposure lag, induction, and latency windows** are the three distinct time parameters that translate a raw\nexposure date (a pharmacy `fill_date`, an `index_date`, a procedure date) into the calendar interval during which that\nexposure could plausibly *cause* and during which an outcome could plausibly be *detected*. They are routinely\nconflated, but they are biologically and operationally separable, and confusing them produces some of the most common\nself-inflicted biases in pharmacoepidemiology. The parameters belong in the protocol estimand and code list **before\nany programming**, because each one changes who is at risk, when person-time accrues, and which events are counted.\n\n**Core conceptual distinction** — three parameters, three different jobs.\n(1) *Lag (exclusion / blanking) period*: a span of follow-up immediately after exposure start during which outcomes\nare **not** attributed to the exposure. Its purpose is to remove **protopathic bias / reverse causation** — the\noutcome (or its prodrome) was already brewing and *caused* the prescription, so the drug looks falsely associated\n(Tamim 2007). Operationally, a lag *drops the first L days of follow-up*. (2) *Induction period*: the **minimum**\nbiological time that must elapse between a causal exposure and the outcome it can produce; exposure occurring within\nthe induction interval before an event *cannot* have caused that event and should not count as etiologically relevant\nexposure (Rothman 1981). Operationally, induction *shifts the start of the at-risk window forward* — current exposure\nstatus is `[exposure_start + induction, ...]`. (3) *Latency period*: the time from biological effect onset to clinical\ndetection/recording; long latency means an event recorded today maps to exposure years ago, so latency *widens the\nrelevant exposure lookback* (e.g., unopposed estrogen and endometrial cancer require a multi-year window). The\nestimand distinction matters: with a lag you are estimating a per-protocol effect on outcomes that can be caused\n*after* the blanking window; with an induction window you are testing a specific etiologic hypothesis (effects only\nappear after a minimum delay); the two are not interchangeable, and a misspecified induction window biases toward the\nnull exactly as a misspecified lag biases away from it.\n\n**Pros, cons, and trade-offs**\n- **vs no time-window adjustment (raw exposure = at-risk on the fill date):** Lag/induction/latency rules remove\n  protopathic bias and biologically implausible \"instant-cause\" attributions that an unadjusted current-use definition\n  builds in. Cost: they discard person-time and events, reducing power, and every threshold is a judgment call that\n  must be defended and varied in sensitivity analysis. **Prefer a lag/induction rule** whenever reverse causation or a\n  known biological delay is plausible (almost all chronic-disease and oncology questions, and any acute outcome that\n  can prompt the prescription — e.g., GI symptoms before an NSAID, dyspnea before a respiratory drug).\n- **vs immortal-time fixes (the symmetric error):** A lag period drops early follow-up to *avoid* counting events;\n  done carelessly it can re-create **immortal time** by guaranteeing survival through the lag for those who are\n  classified as exposed, biasing toward benefit (Suissa 2008). The correct implementation applies the lag identically\n  to both arms and to the time axis, not to eligibility. **Prefer** explicitly modeling the lag as a delayed-entry /\n  left-truncation interval so the immortal interval is excluded from *both* numerator and denominator.\n- **vs a single fixed window for everything:** A one-size window (e.g., \"current use = days_supply only\") is simple and\n  reproducible but biologically wrong when the cause-effect delay differs from the supply duration. Cost of the\n  nuanced approach: more programming, more diagnostics, and the need to pre-specify induction/latency by hypothesis.\n  **Prefer the nuanced rule** for regulatory- or HTA-grade safety/effectiveness work; the simple rule may suffice for\n  purely descriptive utilization counts.\n\n**When to use** — apply an explicit lag/induction/latency specification when: (a) the outcome can plausibly cause the\nexposure (protopathic bias) — apply a lag; (b) there is a known minimum biological delay from exposure to effect —\napply an induction window; (c) the disease has a long detectable-preclinical phase or long etiologic latency\n(cancers, fibrosis, cumulative-dose toxicities) — widen the exposure lookback to the latency horizon; (d) you are\nbuilding a regulatory-grade comparative safety study where a reviewer (FDA Sentinel, EMA) will probe whether early\nevents are reverse-causation artifacts. In all cases pre-specify the primary window and a grid of sensitivity windows.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Acute, immediate-onset effects (anaphylaxis, acute injection-site reaction, hypoglycemia, day-1 bleeding from an\n  anticoagulant).** Imposing an induction lag here *removes the very events the drug causes* and biases toward the\n  null. Do not lag an outcome whose causal mechanism is immediate.\n- **Using a lag to \"clean up\" a noisy signal without a reverse-causation rationale.** A lag chosen post hoc to make a\n  safety signal disappear is data dredging; the lag must be motivated by biology and fixed in the protocol.\n- **Lag applied to eligibility rather than to the time axis.** Requiring patients to survive (and stay enrolled)\n  through the lag to be \"exposed\" manufactures immortal time and a spurious protective effect (Suissa 2008). This is\n  the most dangerous failure mode.\n- **Long-latency cancer outcomes with short data windows.** If the database covers 2–3 years but the etiologic latency\n  is 5–15 years, you cannot capture the relevant exposure; the study is uninformative and a non-null finding is more\n  likely contamination (recent exposure, surveillance bias) than a true late effect.\n\n**Data-source operational depth**\n- **Claims (FFS):** Exposure is the pharmacy fill (`fill_date` + `days_supply`); the at-risk window is\n  `[fill_date + lag_days, fill_date + days_supply + carryover]`, and for an induction hypothesis events within\n  `induction_days` of `fill_date` are not attributed. Failure modes: claim-adjudication lag and reversed claims mean\n  `fill_date` is not the ingestion date; **Medicare Advantage (MA)-only person-time lacks fee-for-service claims**, so\n  an apparently exposure-free lag interval can be unobserved time rather than true non-exposure — restrict to A/B/D\n  (or commercial medical+pharmacy) enrollees and exclude MA-only spans. Diagnosis dates carry their own coding lag, so\n  a \"first GI bleed\" date may trail the clinical event, shrinking an apparent protopathic signal.\n- **EHR:** Exposure is the *order/administration*, not the fill, and the outcome is **encounter-driven** — a long\n  latency window straddles periods where the patient may not have visited, so an event \"detected\" at a late visit is\n  really an event that occurred earlier (latency is partly a recording artifact here). Confirm starts against linked\n  dispensing; treat between-visit gaps as informative when defining lag follow-up.\n- **Registry:** Adjudicated outcomes with clean event dates are a strength (good for induction tests), but registry\n  **adjudication lag** can correlate with exposure (sicker, more-treated patients are reviewed faster), distorting the\n  apparent induction interval; check whether adjudication date vs event date differs by arm. Registries are usually\n  weak for complete exposure history — link to claims to populate a latency-length lookback.\n- **Linked claims–EHR–vital records:** Best substrate for long-latency questions (EHR severity + claims completeness +\n  death index), but order/fill/service-date discrepancies must be reconciled before choosing the exposure anchor, and\n  differential **competing risk of death** by exposure (common in elderly claims) censors long latency windows\n  unequally — use cause-specific or subdistribution handling rather than naive censoring.\n\n**Worked claims example (protopathic lag + induction).** Question: does a newly initiated NSAID raise the rate of\nhospitalized upper-GI bleed in a commercial + Medicare FFS database? Naive \"current use = NSAID days_supply\" overcounts\nbecause GI discomfort from an *existing* bleed prompts the NSAID (or its co-prescribed PPI workup), i.e. protopathic\nbias. Specification: (1) Eligibility: ≥365 days continuous A/B/D (or commercial medical+pharmacy) enrollment before the\nfirst NSAID fill (`index_date`); exclude MA-only person-time so the washout and lag are observed, not missing.\n(2) New-user washout: no NSAID fill in the 365-day lookback. (3) Lag: count person-time and bleed events only from\n`index_date + 30` (the 30-day blank period removes bleeds that were prodromal to the prescription — Tamim 2007).\n(4) Exposure window: a fill contributes at-risk time over `[fill_date + 30, fill_date + days_supply + 14]` (14-day\ncarryover/grace for stockpiling); overlapping fills are stitched into a continuous episode. (5) Outcome: first\ninpatient upper-GI-bleed claim (validated dx in primary position), the *event date* taken as admission date to limit\ncoding lag. (6) Censor at disenrollment, death, end of data, and exposure-episode end + grace. (7) Sensitivity: rerun\nwith lag = 0, 14, 60, 90 days and report how the rate ratio moves — a signal that only exists at lag 0 is the\nprotopathic-bias fingerprint. Contrast this with a *long-latency* variant (unopposed estrogen → endometrial cancer):\nthere the relevant change is not a 30-day lag but a multi-year **induction window** (count exposure 5+ years before the\ncancer date) and a lookback long enough to capture it — a 2-year claims extract cannot answer that question.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure-definition",
      "exposure-episode-construction",
      "lag-time",
      "induction-period",
      "latency-period",
      "protopathic-bias",
      "reverse-causation",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/oxfordjournals.aje.a113189",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a113189",
        "citation_text": "Rothman KJ. Induction and latent periods. American Journal of Epidemiology. 1981;114(2):253-259.",
        "year": 1981,
        "authors_short": "Rothman",
        "notes": "Foundational definition separating the induction period (cause to effect) from the latent period (effect onset to detection); the conceptual basis for all exposure-window specification."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1360",
        "url": "https://doi.org/10.1002/pds.1360",
        "citation_text": "Tamim H, Monfared AAB, LeLorier J. Application of lag-time into exposure definitions to control for protopathic bias. Pharmacoepidemiology and Drug Safety. 2007;16(3):250-258.",
        "year": 2007,
        "authors_short": "Tamim et al.",
        "notes": "Operationalizes the lag (exclusion) period in claims exposure definitions to remove protopathic bias / reverse causation; the practical how-to for the lag parameter."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Shows how mishandled time windows (applying a lag/wait period to eligibility rather than the time axis) manufacture immortal time and spurious protective effects; the cautionary counterpart to lagging."
      }
    ],
    "plain_language_summary": "When a drug is prescribed, there is a short period right after the prescription where any disease event was probably already brewing before the drug could have caused it — so researchers deliberately skip that early window and start counting only after it ends; this skipped window is called the lag period. Beyond the lag, there is also a minimum biological delay — the induction period — before a drug can actually cause a disease event, so events happening too soon after a prescription start are not counted as caused by the drug. A third parameter, the latency period, accounts for the gap between when a disease actually begins in the body and when it shows up in a medical record or database. Together, these three time rules help researchers count only the events that could genuinely have been caused by the drug, rather than events that were already happening for other reasons.",
    "key_terms": [
      {
        "term": "lag period",
        "definition": "The chunk of time immediately after a prescription start that researchers skip over when counting disease events, because events in that window were probably already developing before the drug could have caused anything."
      },
      {
        "term": "induction period",
        "definition": "The minimum biological time that must pass between taking a drug and the drug being able to cause a disease event; events that occur before this minimum delay are not attributed to the drug."
      },
      {
        "term": "latency period",
        "definition": "The delay between when a disease actually begins in the body and when a doctor diagnoses it and it appears in the data; for slow-developing diseases like cancer, this can be years."
      },
      {
        "term": "index date",
        "definition": "The patient's day-zero in a study — usually the date of their very first fill of the drug being studied."
      },
      {
        "term": "at-risk window",
        "definition": "The stretch of calendar time, after the lag and induction periods have been excluded, during which a disease event would be attributed to the drug exposure."
      },
      {
        "term": "protopathic bias",
        "definition": "A distortion that makes a drug look harmful when, in fact, early symptoms of the disease had already prompted the prescription before the drug had any time to act."
      }
    ],
    "worked_example": {
      "scenario": "Patient 2001 starts a new NSAID (naproxen) on 2023-01-01 for pain. Researchers want to know whether NSAID use raises the rate of hospitalized stomach bleeding. Because stomach discomfort from an existing bleed can prompt a doctor to prescribe an NSAID in the first place, any bleed recorded in the first 30 days after the prescription is suspicious — it may have been the reason for the prescription, not a consequence of it. The study therefore applies a 30-day lag: follow-up and event counting start on day 31 (2023-01-31). The drug also needs a minimum biological dwell time to cause mucosal damage, so an induction period of 30 days is also applied: the at-risk window does not open until 30 days after the fill date. Because both the lag and the induction period are 30 days here, the at-risk window opens on 2023-01-31. The fill covers 90 days of supply, so the exposure episode runs through 2023-04-01 (day 91). The combined at-risk window is therefore 2023-01-31 to 2023-04-01, a span of 61 days. A bleed recorded on 2023-02-15 falls inside that window and is counted; a bleed recorded on 2023-01-20 falls inside the lag and is not counted.",
      "dataset": {
        "caption": "Claims pharmacy and event rows for patient 2001 as an analyst would see them.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply",
          "event_date",
          "event_type"
        ],
        "rows": [
          [
            2001,
            "2023-01-01",
            "naproxen",
            90,
            "2023-01-20",
            "upper-GI bleed (in lag — NOT counted)"
          ],
          [
            2001,
            "2023-01-01",
            "naproxen",
            90,
            "2023-02-15",
            "upper-GI bleed (in at-risk window — counted)"
          ]
        ]
      },
      "steps": [
        "Fill date is 2023-01-01 (the index date); the 90-day supply means the exposure episode runs from 2023-01-01 through 2023-04-01.",
        "Apply the 30-day lag: the first 30 days of follow-up (2023-01-01 through 2023-01-30) are blanked out; any event in this window is treated as potentially protopathic and excluded.",
        "Apply the 30-day induction period: because the lag and induction period are both 30 days and start from the same fill date, the at-risk window opens on the same date — 2023-01-31.",
        "The at-risk window runs from 2023-01-31 to 2023-04-01, a span of 61 days (31 Jan through 1 Apr inclusive).",
        "The bleed on 2023-01-20 is on day 19 after the fill — inside the 30-day lag — so it is not attributed to the NSAID.",
        "The bleed on 2023-02-15 is on day 45 after the fill — past both the lag and the induction period — so it falls inside the at-risk window and is counted as a candidate outcome."
      ],
      "result": "Lagged at-risk window = 2023-01-31 to 2023-04-01 (61 days). The bleed on 2023-01-20 is excluded (inside lag). The bleed on 2023-02-15 is the attributable event: it falls on day 45, which is after the 30-day lag and after the 30-day induction period, inside the 61-day at-risk window.",
      "timeline_spec": {
        "title": "Lag and induction period for one NSAID user (30-day lag, 30-day induction, 90-day supply)",
        "window": {
          "start": "2023-01-01",
          "end": "2023-04-01",
          "label": "Exposure episode: 90-day naproxen supply"
        },
        "events": [
          {
            "label": "Drug start (index date)",
            "start": "2023-01-01",
            "length_days": 1,
            "quantity": "Day 0 — fill_date"
          },
          {
            "label": "Bleed A (day 19 — in lag, NOT counted)",
            "start": "2023-01-20",
            "length_days": 1,
            "quantity": "Event inside lag"
          },
          {
            "label": "Bleed B (day 45 — counted)",
            "start": "2023-02-15",
            "length_days": 1,
            "quantity": "Attributable event"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2023-01-01",
            "end": "2023-01-30",
            "label": "30-day lag / induction period (blanked — events NOT counted)"
          },
          {
            "kind": "exposed",
            "start": "2023-01-31",
            "end": "2023-04-01",
            "label": "At-risk window: 61 days (lag + induction cleared)"
          }
        ],
        "result": {
          "label": "At-risk window = 61 days; Bleed A excluded (day 19, inside lag); Bleed B counted (day 45, inside at-risk window)",
          "value": 61
        },
        "caption": "After a 30-day lag and 30-day induction period from the 2023-01-01 fill date, the at-risk window opens on 2023-01-31 and closes with the end of the 90-day supply on 2023-04-01. The day-19 bleed falls inside the blanked lag interval and is excluded; the day-45 bleed falls inside the at-risk window and is counted as an attributable event.",
        "alt_text": "Horizontal timeline starting 2023-01-01. A shaded block labeled 30-day lag and induction period covers 2023-01-01 through 2023-01-30, with a marker for Bleed A on 2023-01-20 shown as excluded. An open block labeled at-risk window runs from 2023-01-31 to 2023-04-01, with a marker for Bleed B on 2023-02-15 shown as counted."
      }
    },
    "prerequisites": [
      "exposure-episode-construction-rwe",
      "new-user-design",
      "time-zero-index-date-alignment-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Lag (exclusion / blanking) period for protopathic bias",
        "description": "Drop the first L days of follow-up after exposure start so prodromal events that prompted the prescription are not attributed to the drug. Applied to the time axis identically in all arms, not to eligibility.",
        "edge_cases": [
          "Choosing L post hoc to attenuate a signal (data dredging) — L must be biology-motivated and pre-specified.",
          "Applying the lag to eligibility (requiring survival through L to be 'exposed') re-creates immortal time bias.",
          "Outcome coding lag can already blunt protopathic signal; use event/admission date, not claim paid date."
        ],
        "data_source_notes": "claims: shift follow-up start to fill_date + L; keep raw and derived dates. ehr: anchor on order/administration date and account for encounter-driven outcome capture during the lag."
      },
      {
        "name": "Induction period (minimum cause-to-effect delay)",
        "description": "Exposure within the induction interval before an event cannot have caused it; current-exposure status starts at exposure_start + induction. Used to test specific etiologic-delay hypotheses.",
        "edge_cases": [
          "Misspecified (too long) induction window biases toward the null by discarding true causal exposure.",
          "Inappropriate for acute immediate-onset effects (anaphylaxis, day-1 bleeding) where induction is effectively zero."
        ],
        "data_source_notes": "registry: adjudication lag can correlate with exposure and distort the apparent induction interval — compare adjudication vs event dates by arm."
      },
      {
        "name": "Latency-driven exposure lookback (long-latency outcomes)",
        "description": "For outcomes with long etiologic/detection latency (cancers, fibrosis), widen the exposure lookback to the latency horizon and count remote exposure, not just current use.",
        "edge_cases": [
          "Database history shorter than the latency horizon makes the question unanswerable; non-null findings likely reflect contamination or surveillance bias.",
          "Differential competing risk of death censors long windows unequally across exposure groups."
        ],
        "data_source_notes": "linked: link registry/EHR to long claims history and a death index; use cause-specific or subdistribution handling for the competing risk of death over long latency windows."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "No time-window adjustment (current use = exposure on the fill date)",
        "pros_of_this": "Removes protopathic bias and biologically implausible instant-cause attributions; aligns counted events with the cause-effect delay.",
        "cons_of_this": "Discards person-time and events (lower power); every threshold is a judgment call requiring sensitivity analysis.",
        "when_to_prefer": "Whenever reverse causation or a known biological delay is plausible — essentially all chronic-disease, oncology, and acute-prodrome questions, and any regulatory-grade safety study."
      },
      {
        "compared_to": "Immortal-time-bias handling (the symmetric error)",
        "pros_of_this": "A lag correctly applied to the time axis removes reverse-causation time without inventing survival advantage.",
        "cons_of_this": "A lag carelessly applied to eligibility re-creates immortal time and a spurious protective effect.",
        "when_to_prefer": "Model the lag as delayed entry / left truncation so the immortal interval is excluded from both numerator and denominator."
      },
      {
        "compared_to": "A single fixed window (days_supply only) for all outcomes",
        "pros_of_this": "Captures hypothesis-specific cause-effect delay and detection latency instead of forcing supply duration to stand in for biology.",
        "cons_of_this": "More programming, more diagnostics, and per-hypothesis specification of induction/latency.",
        "when_to_prefer": "Regulatory- or HTA-grade safety/effectiveness work; the fixed window may suffice only for descriptive utilization counts."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "At-risk window = [fill_date + lag_days, fill_date + days_supply + carryover]; for induction hypotheses do not attribute events within induction_days of fill_date. fill_date is not the adjudication date (claim lag, reversals); exclude Medicare Advantage-only person-time where FFS claims are missing so the lag/washout is observed, not absent.",
      "ehr": "Anchor on order/administration, not dispensing; outcomes are encounter-driven, so long latency windows partly measure recording delay — confirm starts against linked fills and treat between-visit gaps as informative.",
      "registry": "Clean adjudicated event dates aid induction tests, but adjudication lag can correlate with exposure (sicker patients reviewed faster) and distort the apparent induction interval; registries are weak for full exposure history.",
      "linked": "Best for long-latency questions (severity + completeness + mortality); reconcile order/fill/service dates before choosing the exposure anchor and use cause-specific/subdistribution handling for differential competing-risk death over long windows."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nLAG_DAYS       = 30   # blank early follow-up to remove protopathic/reverse-causation events (Tamim 2007)\nCARRYOVER_DAYS = 14   # stockpiling/grace appended to days_supply\nINDUCTION_DAYS = 30   # exposure within this delay before an event cannot have caused it (Rothman 1981)\n\ndef build_lagged_exposure(rx: pd.DataFrame, events: pd.DataFrame, fup: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"]).copy()\n\n    # Exposure episode = [fill_date, fill_date + days_supply + carryover]; stitch overlapping fills per person.\n    rx[\"epi_start\"] = rx[\"fill_date\"]\n    rx[\"epi_end\"]   = rx[\"fill_date\"] + pd.to_timedelta(rx[\"days_supply\"] + CARRYOVER_DAYS, unit=\"D\")\n    rx[\"prev_end\"]  = rx.groupby(\"person_id\")[\"epi_end\"].shift().fillna(pd.Timestamp.min)\n    rx[\"new_block\"] = (rx[\"epi_start\"] > rx[\"prev_end\"]).astype(int)\n    rx[\"block\"]     = rx.groupby(\"person_id\")[\"new_block\"].cumsum()\n    epi = (rx.groupby([\"person_id\", \"block\"])\n             .agg(epi_start=(\"epi_start\", \"min\"), epi_end=(\"epi_end\", \"max\"))\n             .reset_index())\n\n    # Apply the LAG to the time axis: at-risk follow-up starts at index_date + LAG (delayed entry),\n    # so person-time and events in the immortal/protopathic interval are excluded from BOTH num and denom.\n    epi = epi.merge(fup, on=\"person_id\", how=\"inner\")\n    epi[\"risk_start\"] = np.maximum(epi[\"epi_start\"], epi[\"index_date\"] + pd.Timedelta(days=LAG_DAYS))\n    epi[\"risk_end\"]   = np.minimum(epi[\"epi_end\"], epi[\"fup_end\"])\n    epi = epi[epi[\"risk_end\"] > epi[\"risk_start\"]]\n    epi[\"person_days\"] = (epi[\"risk_end\"] - epi[\"risk_start\"]).dt.days\n\n    # Attribute an event only if it falls in a lagged at-risk window AND >= INDUCTION_DAYS after the contributing fill.\n    ev = events.merge(epi, on=\"person_id\", how=\"inner\")\n    ev[\"attributable\"] = ((ev[\"event_date\"] >= ev[\"risk_start\"]) &\n                          (ev[\"event_date\"] <= ev[\"risk_end\"]) &\n                          (ev[\"event_date\"] >= ev[\"epi_start\"] + pd.Timedelta(days=INDUCTION_DAYS)))\n    attrib = ev.groupby(\"person_id\")[\"attributable\"].max().astype(int)\n\n    out = (epi.groupby(\"person_id\")[\"person_days\"].sum()\n              .to_frame(\"person_days_at_risk\")\n              .join(attrib.rename(\"attributable_event\"), how=\"left\").fillna({\"attributable_event\": 0}))\n    return out.reset_index()",
        "description": "Apply a protopathic LAG and an INDUCTION window to claims exposure, then build at-risk person-time and attributable\nevents. Required inputs (cleaned, de-duplicated):\n  rx      : person_id, fill_date (datetime64), days_supply (int)              # NSAID fills, new users only\n  events  : person_id, event_date (datetime64)                               # first inpatient UGI bleed (admission date)\n  fup     : person_id, index_date, fup_end (datetime64)                      # index = first fill; fup_end = censor date\nReturns one row per person with the lagged at-risk interval, induction-eligible event flag, and person-days at risk.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "tamim-2007",
          "rothman-1981"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLAG_DAYS <- 30L; CARRYOVER_DAYS <- 14L; INDUCTION_DAYS <- 30L\n\nbuild_lagged_exposure <- function(rx, events, fup) {\n  setDT(rx); setDT(events); setDT(fup)\n  setorder(rx, person_id, fill_date)\n\n  # Exposure episodes = [fill_date, fill_date + days_supply + carryover], stitched across overlapping fills.\n  rx[, epi_start := fill_date]\n  rx[, epi_end   := fill_date + days_supply + CARRYOVER_DAYS]\n  rx[, prev_end  := shift(epi_end), by = person_id]\n  rx[, new_block := as.integer(is.na(prev_end) | epi_start > prev_end)]\n  rx[, block     := cumsum(new_block), by = person_id]\n  epi <- rx[, .(epi_start = min(epi_start), epi_end = max(epi_end)), by = .(person_id, block)]\n\n  # Apply lag to the time axis (delayed entry at index_date + LAG); clip to follow-up.\n  epi <- merge(epi, fup, by = \"person_id\")\n  epi[, risk_start := pmax(epi_start, index_date + LAG_DAYS)]\n  epi[, risk_end   := pmin(epi_end, fup_end)]\n  epi <- epi[risk_end > risk_start]\n  epi[, person_days := as.integer(risk_end - risk_start)]\n\n  # Event counts only inside a lagged window and at least INDUCTION_DAYS after the contributing fill.\n  ev <- merge(events, epi, by = \"person_id\", allow.cartesian = TRUE)\n  ev[, attributable := as.integer(event_date >= risk_start & event_date <= risk_end &\n                                  event_date >= epi_start + INDUCTION_DAYS)]\n  attrib <- ev[, .(attributable_event = max(attributable)), by = person_id]\n\n  pd <- epi[, .(person_days_at_risk = sum(person_days)), by = person_id]\n  merge(pd, attrib, by = \"person_id\", all.x = TRUE)[is.na(attributable_event), attributable_event := 0L][]\n}",
        "description": "LAG + INDUCTION exposure construction with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), days_supply (integer)\n  events : person_id, event_date (Date)               # first UGI bleed admission date\n  fup    : person_id, index_date (Date), fup_end (Date)\nReturns person-level lagged person-days at risk and an attributable-event flag.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "tamim-2007",
          "rothman-1981"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lag = 30; %let carry = 14; %let induction = 30;\n\n/* Exposure episodes = [fill_date, fill_date + days_supply + carry]; stitch overlapping fills per person. */\nproc sort data=work.rx; by person_id fill_date; run;\ndata epi;\n  set work.rx; by person_id;\n  retain epi_start epi_end;\n  _start = fill_date; _end = fill_date + days_supply + &carry;\n  if first.person_id or _start > epi_end then do;          /* gap -> new episode */\n    if not first.person_id then output;\n    epi_start = _start; epi_end = _end;\n  end;\n  else epi_end = max(epi_end, _end);                       /* overlap -> extend */\n  if last.person_id then output;\n  keep person_id epi_start epi_end;\n  format epi_start epi_end date9.;\nrun;\n\n/* Apply the LAG to the time axis (delayed entry at index_date + lag); clip to follow-up. */\nproc sql;\n  create table risk as\n  select e.person_id, e.epi_start, e.epi_end,\n         max(e.epi_start, f.index_date + &lag) as risk_start format=date9.,\n         min(e.epi_end, f.fup_end)             as risk_end   format=date9.\n  from epi e inner join work.fup f on e.person_id = f.person_id\n  having calculated risk_end > calculated risk_start;\nquit;\n\n/* Attribute an event only inside a lagged window AND >= induction days after the contributing fill. */\nproc sql;\n  create table person as\n  select r.person_id,\n         sum(r.risk_end - r.risk_start) as person_days_at_risk,\n         max(case when ev.event_date between r.risk_start and r.risk_end\n                   and ev.event_date >= r.epi_start + &induction then 1 else 0 end)\n           as attributable_event\n  from risk r left join work.events ev on r.person_id = ev.person_id\n  group by r.person_id;\nquit;\n\n/* Lagged incidence rate (events per person-year) via Poisson with a log person-time offset. */\ndata person; set person; ln_pt = log(person_days_at_risk/365.25); run;\nproc genmod data=person;\n  model attributable_event = / dist=poisson link=log offset=ln_pt;\n  estimate 'log rate' intercept 1;\nrun;",
        "description": "LAG + INDUCTION exposure windows and a lagged incidence rate in SAS via PROC SQL / data step / PROC GENMOD.\nRequired input datasets (post data-management):\n  work.rx     : person_id, fill_date, days_supply       (NSAID fills, new users)\n  work.events : person_id, event_date                   (first UGI bleed admission date)\n  work.fup    : person_id, index_date, fup_end          (index = first fill; fup_end = censor)\nPROC GENMOD fits a Poisson model for the lagged rate with log person-time offset.",
        "dependencies": [],
        "source_citations": [
          "tamim-2007",
          "rothman-1981"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "exposure-lag-induction-latency-window-rwe-timeline.svg",
        "mermaid": null,
        "caption": "After a 30-day lag and 30-day induction period from the 2023-01-01 fill date, the at-risk window opens on 2023-01-31 and closes with the end of the 90-day supply on 2023-04-01. The day-19 bleed falls inside the blanked lag interval and is excluded; the day-45 bleed falls inside the at-risk window and is counted as an attributable event.",
        "alt_text": "Horizontal timeline starting 2023-01-01. A shaded block labeled 30-day lag and induction period covers 2023-01-01 through 2023-01-30, with a marker for Bleed A on 2023-01-20 shown as excluded. An open block labeled at-risk window runs from 2023-01-31 to 2023-04-01, with a marker for Bleed B on 2023-02-15 shown as counted.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title Lag vs Induction vs Latency on one patient's time axis\n  Exposure start (fill_date / index) : Drug ingested - biological clock starts\n  Lag period (e.g. 30d) : Blank follow-up - events here are protopathic / reverse-causation, not counted\n  Induction window : Minimum cause-to-effect delay - earliest a true caused event can appear\n  Caused event occurs (subclinical) : Disease biologically present but not yet detected\n  Latency period : Delay from onset to clinical detection / coding\n  Event recorded in data : What the database actually observes",
        "caption": "The three windows act at different points on the same timeline. Lag blanks the early protopathic interval, induction sets the earliest a true effect can appear, and latency separates biological onset from the recorded event date.",
        "alt_text": "Timeline from exposure start through a lag period, an induction window, subclinical event onset, a latency period, and the recorded event in data.",
        "source_type": "illustrative",
        "source_citations": [
          "rothman-1981"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Outcome can plausibly cause the exposure?] -->|Yes| Lag[Apply LAG: drop first L days of follow-up<br/>on the TIME AXIS, identically in all arms]\n  Q -->|No| Acute{Effect is immediate / acute?}\n  Acute -->|Yes| NoLag[No induction lag<br/>count day-1 events]\n  Acute -->|No| Ind[Apply INDUCTION window:<br/>at-risk starts at exposure + induction]\n  Lag --> Long{Long etiologic/detection latency?}\n  Ind --> Long\n  NoLag --> Long\n  Long -->|Yes| Lat[Widen exposure lookback to latency horizon<br/>require data history >= latency]\n  Long -->|No| Cur[Current-use window = days_supply + carryover]\n  Lat --> Sens[Sensitivity: vary lag/induction grid;<br/>lag-applied-to-eligibility check for immortal time]\n  Cur --> Sens",
        "caption": "Decision logic for choosing lag, induction, and latency parameters, with the immortal-time guardrail (apply the lag to the time axis, never to eligibility).",
        "alt_text": "Decision flowchart routing from reverse-causation and acuteness questions to lag, induction, current-use, and latency-lookback choices, ending in sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "tamim-2007",
          "suissa-2008"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Lag/induction/latency windows are time-parameter refinements layered onto the underlying exposure-episode construction (days_supply stitching, grace periods)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "A lag applied to eligibility rather than the time axis re-creates immortal time; the correct fix is delayed entry / left truncation excluding the lag from both numerator and denominator."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Induction and latency windows define which past exposure is etiologically relevant when accumulating time-updated dose for long-latency outcomes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-exposures-rwe",
        "notes": "Negative-control outcomes/exposures help detect residual protopathic bias or surveillance artifacts that a chosen lag/latency window fails to remove."
      },
      {
        "relation_type": "used_with",
        "target_slug": "new-user-design",
        "notes": "Lag and induction windows are applied within a new-user cohort so the exposure clock starts cleanly at first fill rather than mid-treatment."
      }
    ],
    "aliases": [
      "exposure lag period",
      "lag-time exposure definition",
      "induction period",
      "latency period",
      "induction and latent period",
      "exposure time-window definition"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "external-adjustment-validation-substudy-bias-correction-rwe",
    "name": "External Adjustment and Validation-Substudy Bias Correction",
    "short_definition": "A quantitative bias analysis approach that estimates bias parameters (sensitivity, specificity, PPV, or unmeasured-confounder strength) from an internal validation substudy or a transportable external source and uses them to correct a main-study RWE estimate for measurement error or unmeasured confounding the analytic dataset cannot resolve on its own.",
    "long_description": "**External adjustment and validation-substudy bias correction** is the empirical wing of quantitative bias analysis (QBA).\nGeneric QBA asks \"what would the estimate become *if* sensitivity were 0.78 and specificity 0.97?\" — the bias parameters are\nhypothesized. External adjustment asks the harder, more defensible question: \"can those parameters be *estimated* from real\ndata, and do they *transport* to my analytic cohort?\" The bias parameters come from one of two places: an **internal\nvalidation substudy** (a subset of the main cohort that receives gold-standard ascertainment — chart review of a claims\nendpoint algorithm, registry adjudication of outcomes or stage, EHR enrichment with labs/BMI/smoking) or an **external\nsource** (published validation parameters, a separate linked database, a survey such as NHANES used to characterize an\nunmeasured confounder). The correction itself can be deterministic (a single bias-corrected point estimate), probabilistic\n(bias parameters drawn from distributions to produce a simulation interval that blends random and systematic error), or\nfully Bayesian (the validation data enter as a likelihood). Propensity-score calibration (Stürmer 2005) is the canonical\nconfounding variant: a validation subset that *does* measure the missing confounders is used to recalibrate an error-prone\npropensity score estimated from the claims-only covariates in the full cohort.\n\n**Core conceptual distinction**. The defining feature is that bias parameters are *data-derived and conditional on\ntransportability*, not assumed. This separates the method on two axes. (1) *vs. assumption-driven QBA (simple/probabilistic\nbias analysis):* external adjustment replaces a prior over sensitivity/specificity or confounder strength with an empirical\nestimate plus its sampling uncertainty — but it inherits a new, often dominant, assumption: that the validation sample is\n*exchangeable* with the main study for the parameter being borrowed. (2) *Measurement-error correction vs.\nunmeasured-confounding correction:* for a misclassified binary outcome/exposure the targets are sensitivity, specificity (or\nPPV/NPV); for residual confounding the targets are the prevalence of the unmeasured confounder by exposure arm and its\nassociation with the outcome (or, equivalently, the calibration relationship between the error-prone and gold-standard\npropensity scores). The estimand must be pinned down first: a corrected risk ratio, a corrected risk difference, or a\nconfounding-adjusted hazard ratio are different quantities and require different correction algebra. A corrected estimate is\nnever automatically \"truer\" — it is the main estimate viewed through a lens whose focal length was set by the validation\ndata.\n\n**Pros, cons, and trade-offs**.\n- **vs. misclassification-bias-correction-rwe (assumed sens/spec):** External adjustment anchors the sensitivity/specificity\n  in observed chart-review or registry adjudication rather than expert guesses, and propagates the validation sample's own\n  sampling error. Cost: it requires that validation parameters transport (same code sets, calendar period, care setting,\n  case mix) and that the two-phase sampling design supports the parameter you need — reviewing only algorithm-positives\n  yields PPV, not sensitivity. **Prefer external adjustment** whenever a feasible validation sample exists; fall back to\n  assumed-parameter bias analysis only when no validation data are obtainable.\n- **vs. unmeasured-confounding-probabilistic-bias-analysis-rwe (assumed confounder distribution):** External adjustment can\n  *measure* the prevalence and effect of the missing confounder in a linked subset (e.g., smoking/BMI/HbA1c from EHR), or\n  calibrate the propensity score directly. Cost: the validation covariates may still omit the *true* driver of confounding,\n  and a small or non-representative validation sample can amplify rather than reduce error. **Prefer it** when a claims-only\n  study can link a subset to richer clinical data; otherwise probabilistic bias analysis with literature-based priors is the\n  honest alternative.\n- **vs. simply adjusting in the full cohort (no QBA):** When the confounder or true outcome is measured for *everyone*, you\n  adjust directly and this method is unnecessary. External adjustment is specifically for the two-phase situation — cheap,\n  incomplete information on all, expensive gold-standard information on a sample. Cost: more moving parts and an extra\n  transportability assumption that reviewers will probe hard.\n\n**When to use**. A main RWE analysis runs on claims/EHR where (a) a key outcome or exposure is defined by an algorithm of\nimperfect, *estimable* accuracy, or (b) a strong confounder is unmeasured in the full cohort but available in a linkable\nsubset or external survey; and you can either field an internal validation substudy or defend a transportable external\nparameter. It is also the principled way to incorporate an algorithm-validation study (PPV/sensitivity for MI, stroke,\nbleeding, cancer progression proxies) into the effect estimate rather than merely citing it as a limitation. For regulatory\nand HTA submissions it is the mechanism that turns a hand-waved \"potential residual confounding\" paragraph into a quantified\nsensitivity result. Pre-specify the validation sampling frame (FFS-complete only), the gold-standard definition, whether\nparameters will be pooled or arm-specific, and the transportability argument before any correction is run.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **The validation sample is not exchangeable with the analytic cohort.** Validation done in one integrated health system,\n  then applied to a multi-payer national claims study, silently imports that system's coding behavior and case mix. A\n  corrected estimate presented as definitive here is *more* misleading than the uncorrected one, because it launders a\n  transportability assumption as an empirical fact. Diagnose by comparing the validation frame's payer/calendar/severity\n  distribution to the main cohort.\n- **Differential misclassification corrected with non-differential parameters.** If sensitivity/specificity differ by\n  treatment arm (surveillance bias: one drug triggers more imaging, so its outcomes are better captured) but you apply a\n  single pooled sens/spec, the correction can move the estimate the *wrong* direction. You must estimate arm-specific\n  parameters — which requires reviewing cases and non-cases within each arm, often infeasible.\n- **Sparse validation strata.** A 2×2 chart-review table with a near-zero cell produces unstable, sometimes negative,\n  corrected risks; deterministic correction without uncertainty propagation will report a spuriously precise wrong number.\n- **Correcting confounding with covariates that do not actually predict treatment or outcome.** Propensity-score calibration\n  using a validation covariate that is irrelevant adds noise and false reassurance; the calibration must materially shift\n  the score to matter.\n- **The real question requires a different design.** External adjustment patches an estimate; it does not rescue a cohort\n  with no defensible comparator or with immortal time built into time zero.\n\n**Data-source operational depth**.\n- **Claims (FFS vs. Medicare Advantage):** The validation substudy is usually chart review of an endpoint algorithm. PPV is\n  cheap (sample `dx`-positive patients and adjudicate); sensitivity is expensive (you must adjudicate algorithm-*negative*\n  patients to find missed cases). Critical failure mode: Medicare Advantage and capitated person-time do not generate\n  complete FFS claims, so an algorithm \"negative\" can be *missingness*, not a true non-case — validate only on enrollees\n  with complete A/B/D (or commercial medical+pharmacy) capture, and report which payer segment the validation covers versus\n  the main cohort. Stratify validation parameters by `days_supply`/route, age, site, and calendar period when differential\n  capture is plausible.\n- **EHR:** Chart-review or structured-plus-notes phenotyping can estimate algorithm performance and supply the unmeasured\n  confounder (labs, BMI, smoking, ECOG). Failure modes: *outside-care leakage* (events occurring at a non-network facility\n  are absent from the chart, depressing apparent sensitivity in a way that does not transport to a claims cohort with\n  complete capture) and *chart-availability bias* (patients with reviewable notes are sicker/more engaged than the average\n  cohort member, so the validation subset is non-random).\n- **Registry:** Registry linkage gives gold-standard outcome status, stage, or severity with high clinical specificity, but\n  registries enroll a selected, often academic-center, population — high internal accuracy, weak transportability to the\n  full treated population, and linkage eligibility/match failure create their own selection bias on top of the validation\n  selection.\n- **Linked claims–EHR–registry:** The ideal substrate for a two-phase design (claims for everyone, gold standard on the\n  linkable subset), but the linkable subset is itself a non-random sample; model the linkage probability and check that\n  bias-parameter transportability holds across linkable and non-linkable strata, not just within the linked subset.\n\n**Worked claims example.** A commercial + Medicare FFS study estimates a stroke risk ratio of RR = 0.72 (8.0% vs. 11.1%) for\nDrug A vs. Drug B, with stroke ascertained by an inpatient ICD-10 `dx` algorithm. To correct for outcome misclassification,\na validation substudy samples charts from the *same* FFS-complete enrollees (continuous A/B enrollment, no MA-only\nperson-time, same 2021–2023 calendar window as the main cohort) and stratifies on *gold-standard truth* so that\nsensitivity and specificity are identified directly: among 200 chart-confirmed true strokes the algorithm flags 156\n(sensitivity = 156/200 = 0.78, FN = 44), and among 200 chart-confirmed true non-strokes the algorithm correctly clears\n194 (specificity = 194/200 = 0.97, FP = 6). (Stratifying instead on *algorithm* status would identify PPV/NPV, which\ncannot enter the formula below without back-solving through the unknown true prevalence.) The non-differential corrected risk for each arm is\n`(observed_risk + spec − 1) / (sens + spec − 1)`: Drug A → `(0.080 + 0.970 − 1) / (0.780 + 0.970 − 1) = 0.0667`, Drug B →\n`(0.111 + 0.970 − 1) / 0.750 = 0.1080`, so corrected RR = `0.0667 / 0.1080 = 0.62`. Because non-differential outcome\nmisclassification biased the RR *toward* the null, the corrected estimate is further from 1.0 — a coherent direction. The\nreviewer's first two questions are answered up front: (1) the validation frame is the *same* payer segment and calendar\nwindow as the analytic cohort (transportability defended, not assumed), and (2) the 2×2 was sampled on gold-standard\ntruth (true cases and true non-cases), so sensitivity and specificity — not just PPV/NPV — are identified. The point correction is then re-run probabilistically, drawing\nsens ~ Beta(157, 45) and spec ~ Beta(195, 7) over many iterations, to report a simulation interval that combines the\nvalidation sample's uncertainty with the main study's random error. Report the full validation 2×2, the exact correction\nformula, and both the point-corrected estimate and the interval. If surveillance differed by arm (Drug A patients imaged\nmore), the entire correction would be re-estimated with arm-specific Beta draws — and if that were infeasible, the honest\nreport is a sensitivity *range*, not a single corrected RR.\n\n**Interpreting the output**\n\nFrom the worked example (unmeasured smoking confounding): naive RR = 0.63. Validation substudy yields\nsmoking prevalence 60% in Drug A arm, 30% in Drug B arm; smoker RR for the outcome ≈ 2.5. Bias\nfactor ≈ 1.31. Corrected RR ≈ 0.63 × 1.31 ≈ 0.82.\n\n*(1) Formal interpretation.* The corrected RR 0.82 is conditional on two assumptions: (a) the bias\nparameters from the validation substudy (smoking prevalence and its effect on the outcome) transport\nto the main cohort — same payer segment, calendar window, and case mix are cited in the worked example\nas the transportability defense; and (b) smoking is the dominant unmeasured confounder and the\ncorrection formula is correctly specified. If either assumption fails, the corrected estimate inherits\nthe error. The probabilistic extension propagates validation sampling uncertainty (Beta priors drawn\nfrom the 2×2 table) into a simulation interval, which is not a confidence interval but a band\nconditional on the stated bias model and validation frame.\n\n*(2) Practical interpretation.* Drug A's apparent 37% lower risk (naive RR 0.63) shrinks to roughly\n18% lower risk (corrected RR ≈ 0.82) after accounting for the smoking imbalance identified in the\nvalidation substudy. The corrected estimate is not automatically \"truer\" than the naive one — its\ncredibility depends entirely on whether the validation sample's patients resemble the main cohort. If\nthe substudy over-sampled younger patients among whom smoking prevalence differs, the correction may\nover- or under-adjust; that uncertainty should be reported alongside the corrected point estimate.",
    "primary_category": "Bias_Control",
    "tags": [
      "external-adjustment",
      "validation-substudy",
      "chart-review",
      "propensity-score-calibration",
      "measurement-error",
      "misclassification",
      "quantitative-bias-analysis",
      "two-phase-sampling"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/MLR.0b013e318070c045",
        "url": "https://doi.org/10.1097/MLR.0b013e318070c045",
        "citation_text": "Stürmer T, Schneeweiss S, Avorn J, Glynn RJ. Adjustments for unmeasured confounders in pharmacoepidemiologic database studies using external information. Medical Care. 2007;45(10 Suppl 2):S158-S165.",
        "year": 2007,
        "authors_short": "Stürmer et al.",
        "notes": "Pharmacoepidemiology framework for using validation and external information to adjust database studies for unmeasured confounding; the canonical statement of external adjustment in claims research."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwi192",
        "url": "https://doi.org/10.1093/aje/kwi192",
        "citation_text": "Stürmer T, Schneeweiss S, Avorn J, Glynn RJ. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. American Journal of Epidemiology. 2005;162(3):279-289.",
        "year": 2005,
        "authors_short": "Stürmer et al.",
        "notes": "Introduces propensity score calibration — a validation subset that measures the missing confounders recalibrates an error-prone claims-only propensity score in the full cohort."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1200",
        "url": "https://doi.org/10.1002/pds.1200",
        "citation_text": "Schneeweiss S. Sensitivity analysis and external adjustment for unmeasured confounders in epidemiologic database studies of therapeutics. Pharmacoepidemiology and Drug Safety. 2006;15(5):291-303.",
        "year": 2006,
        "authors_short": "Schneeweiss",
        "notes": "Working framework and algebra for external adjustment of unmeasured confounding using survey/validation parameters, with the array-of-corrections presentation favored by regulators."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/ije/dyu149",
        "url": "https://doi.org/10.1093/ije/dyu149",
        "citation_text": "Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S. Good practices for quantitative bias analysis. International Journal of Epidemiology. 2014;43(6):1969-1985.",
        "year": 2014,
        "authors_short": "Lash et al.",
        "notes": "Consensus good-practice guidance covering deterministic, probabilistic, and validation-data-anchored bias correction, including reporting standards reviewers expect."
      }
    ],
    "plain_language_summary": "When a large database study is missing an important risk factor — like smoking status — the effect estimate can be distorted even after adjusting for everything else in the data. External adjustment fixes this by running a small, intensive side study (the validation substudy) on a sample of patients where that missing factor is actually measured, then using what you learned there to mathematically correct the main study's number. The catch is that the side study's patients must be similar enough to the main cohort for the correction to hold, and the corrected answer is only as trustworthy as that similarity assumption.",
    "key_terms": [
      {
        "term": "validation substudy",
        "definition": "A small, intensive study nested inside the main analysis where researchers go beyond the database — through chart review or additional data linkage — to measure something (like smoking, BMI, or true disease status) that the main database does not capture."
      },
      {
        "term": "bias correction",
        "definition": "A mathematical adjustment that uses information from the validation substudy to shift the main study's raw estimate closer to what it would have been if the missing factor had been measured for everyone."
      },
      {
        "term": "unmeasured confounder",
        "definition": "A risk factor for the outcome that is related to which treatment patients received, but is absent from the database — if ignored, it can make one treatment look better or worse than it really is."
      },
      {
        "term": "external adjustment",
        "definition": "Using bias parameters — numbers that describe how strong and how unevenly distributed the missing factor is — from a separate data source to correct an effect estimate in the main study."
      },
      {
        "term": "transportability",
        "definition": "The assumption that the patients in the validation substudy are similar enough to the main cohort that the correction calculated from the substudy also applies to everyone else in the main study."
      }
    ],
    "worked_example": {
      "scenario": "A claims database study of 2,000 patients (1,000 on Drug A, 1,000 on Drug B) finds that Drug A patients have fewer cardiovascular events than Drug B patients. But the claims data have no smoking information, and smoking is known to raise cardiovascular risk. A validation substudy charts 200 patients (100 from each drug arm) to measure smoking status and then uses those numbers to correct the main study estimate.",
      "dataset": {
        "caption": "Main study — observed event counts by arm (smoking unmeasured)",
        "columns": [
          "arm",
          "n_patients",
          "n_events",
          "observed_event_rate"
        ],
        "rows": [
          [
            "Drug A",
            1000,
            100,
            "10% (0.10)"
          ],
          [
            "Drug B",
            1000,
            160,
            "16% (0.16)"
          ]
        ]
      },
      "steps": [
        "Compute the naive risk ratio: 0.10 / 0.16 = 0.63 — Drug A looks 37% lower risk, but smoking is not yet accounted for.",
        "Run the validation substudy: chart-review 100 Drug A patients and 100 Drug B patients. Result — Drug A arm: 60 of 100 are smokers (60%); Drug B arm: 30 of 100 are smokers (30%). Smokers have 2.5 times the cardiovascular event rate of non-smokers (from the substudy).",
        "Smoking is far more common in Drug A patients (60%) than Drug B patients (30%), so it was making Drug A look more protective than it really is.",
        "Apply the external adjustment formula to compute a bias factor: [0.60 x (2.5 - 1) + 1] / [0.30 x (2.5 - 1) + 1] = [0.90 + 1] / [0.45 + 1] = 1.90 / 1.45 = 1.31.",
        "Multiply the naive RR by the bias factor: 0.63 x 1.31 = 0.82 — the corrected estimate.",
        "Interpretation: after accounting for the smoking imbalance between arms, Drug A still shows lower risk, but the protective effect shrinks from 37% to 18%. The uncorrected number overstated the benefit."
      ],
      "result": "Naive RR = 0.63; bias factor from substudy = 1.31; corrected RR = 0.63 x 1.31 = 0.82. Smoking was more common in Drug A patients and raises cardiovascular risk 2.5-fold, so it was artificially inflating Drug A's apparent benefit — correction moves the estimate toward the null.",
      "summary_table": {
        "caption": "Before and after correction",
        "columns": [
          "",
          "Value"
        ],
        "rows": [
          [
            "Naive RR (smoking ignored)",
            "0.63"
          ],
          [
            "Smoking prevalence — Drug A arm (substudy)",
            "60%"
          ],
          [
            "Smoking prevalence — Drug B arm (substudy)",
            "30%"
          ],
          [
            "Effect of smoking on outcome (substudy)",
            "RR 2.5x"
          ],
          [
            "Bias factor",
            "1.31"
          ],
          [
            "Corrected RR",
            "0.82"
          ]
        ]
      }
    },
    "prerequisites": [
      "quantitative-bias-analysis-toolkit-rwe",
      "unmeasured-confounding-probabilistic-bias-analysis-rwe",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Internal validation substudy (two-phase sampling)",
        "description": "A subset of the analytic cohort receives gold-standard adjudication (chart review, registry linkage, EHR enrichment) so bias parameters are estimated within the same source population. Phase 1 = cheap, complete claims covariates on everyone; phase 2 = expensive gold standard on a sample.",
        "edge_cases": [
          "Sampling only algorithm-positive patients identifies PPV/NPV but not sensitivity/specificity unless algorithm-negatives are also reviewed.",
          "The validation sample must be drawn so parameters transport to the main cohort; stratified sampling by arm/age/site is needed if misclassification is differential.",
          "Near-zero cells in the validation 2x2 produce unstable or out-of-range corrected risks; propagate uncertainty rather than reporting a deterministic point estimate."
        ],
        "data_source_notes": "claims: chart review of FFS-complete enrollees only (MA-only person-time lacks the claims to define a true non-case); ehr: structured-plus-notes phenotyping, but beware outside-care leakage and chart-availability bias."
      },
      {
        "name": "External validation / external adjustment",
        "description": "Published validation parameters, a separate database, or a population survey (e.g., NHANES for an unmeasured confounder) are imported to correct the main study when no internal validation sample exists.",
        "edge_cases": [
          "Code sets, calendar time, care setting, payer mix, and disease severity can all break transportability of the borrowed parameter.",
          "Best presented as a sensitivity analysis or an array of corrections across a plausible parameter range unless the source population is clearly exchangeable."
        ],
        "data_source_notes": "claims: literature-based PPV/sensitivity for MI, stroke, bleeding, or progression proxies; confounding: survey-based prevalence and effect of smoking/BMI applied via external adjustment formulas."
      },
      {
        "name": "Propensity-score calibration",
        "description": "A validation subset that measures the unmeasured confounders is used to estimate the relationship between an error-prone (claims-only) propensity score and a gold-standard score, then recalibrate the error-prone score in the full cohort.",
        "edge_cases": [
          "The validation covariate must predict treatment or outcome enough to shift the score; an irrelevant covariate adds noise and false reassurance.",
          "Calibration assumes a surrogacy/transportability condition between the validation and main samples; it can amplify error when the validation sample is small or non-representative."
        ],
        "data_source_notes": "Especially useful for linked claims-EHR subsets where labs, frailty, BMI, smoking, or disease severity are missing in claims but present for a linkable subset."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "misclassification-bias-correction-rwe",
        "pros_of_this": "Anchors sensitivity/specificity/PPV in observed chart review or registry adjudication and propagates the validation sample's sampling error, rather than relying on assumed parameters.",
        "cons_of_this": "Requires transportability of validation parameters and a two-phase design that identifies the parameter actually needed (PPV is not sensitivity).",
        "when_to_prefer": "Whenever a feasible internal or transportable external validation sample exists to estimate algorithm performance."
      },
      {
        "compared_to": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "pros_of_this": "Can measure the prevalence and effect of the missing confounder in a linked subset, or calibrate the propensity score directly, instead of assuming a confounder distribution.",
        "cons_of_this": "Validation covariates may still omit the true confounding driver; a small or non-representative validation sample can amplify error.",
        "when_to_prefer": "When a claims-only study can link a subset to richer clinical data (labs, severity, lifestyle factors)."
      },
      {
        "compared_to": "quantitative-bias-analysis-toolkit-rwe",
        "pros_of_this": "Provides the empirical inputs (data-derived bias parameters) that the general QBA machinery then propagates.",
        "cons_of_this": "Adds a study-design and data-linkage burden (the validation substudy) beyond running the QBA arithmetic on assumed parameters.",
        "when_to_prefer": "When the goal is a defensible, data-anchored correction rather than an assumption-driven sensitivity sweep."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Field the validation substudy on FFS-complete enrollees (continuous A/B/D or commercial medical+pharmacy) so an algorithm-negative is a true non-case, not MA-only missingness. Review both algorithm-positive and algorithm-negative patients if sensitivity is required. Stratify bias parameters by treatment arm, age, site, route, or calendar period when differential misclassification is plausible. Match the validation frame's payer segment and calendar window to the main cohort and report that mapping.",
      "ehr": "Use chart review or structured-plus-notes phenotyping to estimate algorithm performance and to supply unmeasured confounders (labs, BMI, smoking, severity). Explicitly handle outside-care leakage (events at non-network facilities depress apparent sensitivity) and chart-availability bias (reviewable patients differ from the cohort average).",
      "registry": "Registry adjudication gives high clinical specificity and gold-standard stage/severity, but the registry population is selected; high internal accuracy does not guarantee transportability to the full treated population, and linkage eligibility/match failure add selection bias.",
      "linked": "Linked claims-EHR-registry is the ideal two-phase substrate (claims for all, gold standard on the linkable subset), but the linkable subset is non-random; model the linkage probability and verify bias-parameter transportability across linkable and non-linkable strata."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef correct_rr(main: pd.DataFrame, val: dict, n_iter: int = 50000, seed: int = 42):\n    \"\"\"Probabilistic non-differential (or arm-specific) outcome-misclassification correction.\n\n    main : columns arm, n_events, n_total (main-study observed events by arm).\n    val  : either one dict {'tp','fp','tn','fn'} (non-differential, shared across arms)\n           or {'A': {...}, 'B': {...}} (differential, arm-specific 2x2 chart-review counts).\n    Returns the corrected risk-ratio distribution (length n_iter); take percentiles for the interval.\n    \"\"\"\n    rng = np.random.default_rng(seed)\n    obs = main.set_index(\"arm\").eval(\"n_events / n_total\")\n\n    def draw(v):  # Beta(events+1, non-events+1) for sens and spec from validation 2x2 counts\n        sens = rng.beta(v[\"tp\"] + 1, v[\"fn\"] + 1, n_iter)\n        spec = rng.beta(v[\"tn\"] + 1, v[\"fp\"] + 1, n_iter)\n        return sens, spec\n\n    def corrected(obs_risk, sens, spec):\n        r = (obs_risk + spec - 1.0) / (sens + spec - 1.0)\n        return np.clip(r, 0.0, 1.0)  # truncate impossible draws from sparse validation cells\n\n    if all(k in val for k in (\"tp\", \"fp\", \"tn\", \"fn\")):           # non-differential: pooled sens/spec\n        sens, spec = draw(val)\n        true_a = corrected(obs[\"A\"], sens, spec)\n        true_b = corrected(obs[\"B\"], sens, spec)\n    else:                                                         # differential: arm-specific parameters\n        sa, pa = draw(val[\"A\"]); sb, pb = draw(val[\"B\"])\n        true_a = corrected(obs[\"A\"], sa, pa)\n        true_b = corrected(obs[\"B\"], sb, pb)\n\n    return true_a / true_b\n\n# rr = correct_rr(main, val={'tp':156,'fp':6,'tn':194,'fn':44})\n# print(np.percentile(rr, [2.5, 50, 97.5]))  # interval blending validation + main-study random error",
        "description": "Probabilistic outcome-misclassification correction anchored to a validation substudy. Required inputs (post\ndata-management):\n  main : one row per arm -> arm in {'A','B'}, n_events (algorithm-positive outcomes), n_total (denominator)\n  val  : the 2x2 validation table from a gold-standard-stratified chart review of FFS-complete enrollees ->\n         tp (true cases the algorithm flagged), fn (true cases the algorithm missed),\n         tn (true non-cases the algorithm cleared), fp (true non-cases the algorithm flagged)\nSensitivity = tp/(tp+fn), specificity = tn/(tn+fp). Pre-specify the validation sampling frame (FFS-complete only) and\nwhether parameters are pooled or arm-specific. For DIFFERENTIAL misclassification pass an arm-keyed dict of `val` tables\nand draw arm-specific Beta parameters. Report both the deterministic point correction and the simulation interval. The\nworked claims case (RR 0.72 -> 0.62) corresponds to obs_a=0.080, obs_b=0.111, sens~Beta(157,45), spec~Beta(195,7).",
        "dependencies": [
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "sturmer-2007",
          "lash-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "correct_rr <- function(main, val, n_iter = 50000L, seed = 42L) {\n  set.seed(seed)\n  obs <- setNames(main$n_events / main$n_total, main$arm)\n\n  # Sensitivity ~ Beta(tp+1, fn+1); specificity ~ Beta(tn+1, fp+1) from the validation 2x2.\n  sens <- rbeta(n_iter, val$tp + 1, val$fn + 1)\n  spec <- rbeta(n_iter, val$tn + 1, val$fp + 1)\n\n  corrected <- function(obs_risk, se, sp) {\n    r <- (obs_risk + sp - 1) / (se + sp - 1)\n    pmin(pmax(r, 0), 1)                       # truncate out-of-range draws from sparse cells\n  }\n\n  true_a <- corrected(obs[[\"A\"]], sens, spec)\n  true_b <- corrected(obs[[\"B\"]], sens, spec)\n  true_a / true_b\n}\n\n# rr <- correct_rr(\n#   main = data.frame(arm = c(\"A\", \"B\"), n_events = c(80, 111), n_total = c(1000, 1000)),\n#   val  = list(tp = 156, fp = 6, tn = 194, fn = 44))\n# quantile(rr, c(.025, .5, .975))",
        "description": "R version of the validation-anchored probabilistic misclassification correction. Inputs mirror the Python version:\n  main : data.frame with columns arm ('A'/'B'), n_events, n_total\n  val  : named list tp/fp/tn/fn of chart-review counts (FFS-complete enrollees only)\nReturns the corrected risk-ratio simulation distribution; take quantiles for the interval. For differential\nmisclassification, call once per arm with arm-specific validation counts and divide the corrected arm risks.",
        "dependencies": [
          "stats"
        ],
        "source_citations": [
          "sturmer-2007",
          "lash-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let niter = 50000;\n%let tp = 156; %let fp = 6; %let tn = 194; %let fn = 44;   /* validation 2x2 (FFS-complete enrollees only) */\n\n/* Observed arm-specific risks pulled from the main study (not hardcoded). */\nproc sql noprint;\n  select n_events/n_total into :obs_a from work.main where arm='A';\n  select n_events/n_total into :obs_b from work.main where arm='B';\nquit;\n\ndata sim;\n  call streaminit(42);\n  do i = 1 to &niter;\n    /* Sensitivity and specificity drawn from their validation-sample posteriors. */\n    sens = rand('BETA', &tp + 1, &fn + 1);\n    spec = rand('BETA', &tn + 1, &fp + 1);\n    denom = sens + spec - 1;\n    /* Quantitative-bias correction of each arm's risk; truncate out-of-range draws from sparse cells. */\n    true_a = min(max((&obs_a + spec - 1) / denom, 0), 1);\n    true_b = min(max((&obs_b + spec - 1) / denom, 0), 1);\n    corrected_rr = true_a / true_b;\n    output;\n  end;\nrun;\n\n/* Simulation interval folding validation uncertainty into the main-study random error. */\nproc univariate data=sim noprint;\n  var corrected_rr;\n  output out=ci pctlpts=2.5 50 97.5 pctlpre=p;\nrun;\n\nproc print data=ci; run;   /* p2_5 = lower, p50 = median corrected RR, p97_5 = upper */",
        "description": "SAS Monte Carlo validation-anchored misclassification correction (DATA step; no PROC IML required). Inputs\n(post data-management):\n  work.main : arm ('A'/'B'), n_events, n_total            (main-study observed events by arm)\n  macro vars TP/FP/TN/FN = validation 2x2 counts from chart review of FFS-complete enrollees only\nPROC SQL pulls the observed arm risks from work.main (no hardcoded risks). Draws sens ~ Beta(TP+1, FN+1) and\nspec ~ Beta(TN+1, FP+1) via RAND('BETA'), corrects each arm's risk, and forms the bias-adjusted RR distribution;\nPROC UNIVARIATE returns the 2.5/50/97.5 percentile simulation interval. For differential misclassification, supply\narm-specific TP/FP/TN/FN and draw within each arm. Always export the validation 2x2 and the macro values used.",
        "dependencies": [],
        "source_citations": [
          "sturmer-2007",
          "lash-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Main[Main claims/EHR cohort<br/>error-prone outcome or unmeasured confounder] --> Obs[Observed effect estimate]\n  Val[Validation source:<br/>internal substudy / external study / survey] --> Params[Bias parameters<br/>sens, spec, PPV or confounder strength]\n  Params --> Transport{Transportable to main cohort?<br/>same payer / calendar / case mix?}\n  Transport -->|yes, exchangeable| Det[Data-anchored correction]\n  Transport -->|uncertain| Prob[Probabilistic / array of corrections<br/>over a plausible range]\n  Obs --> Det\n  Obs --> Prob\n  Det --> Corr[Corrected estimate + interval]\n  Prob --> Corr\n  style Val fill:#ecfeff\n  style Transport fill:#fef9c3\n  style Corr fill:#dcfce7",
        "caption": "External adjustment uses validation data to anchor QBA bias parameters; transportability of those parameters to the analytic cohort is the central, reviewer-tested assumption. When transportability is uncertain, present a probabilistic interval or an array of corrections rather than a single \"corrected\" point estimate.",
        "alt_text": "Flowchart from a main RWE cohort and a validation source to estimated bias parameters, a transportability decision, deterministic versus probabilistic correction, and a corrected estimate with interval.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Phase1[Phase 1 - cheap, everyone]\n    All[Full claims cohort<br/>algorithm-based outcome on all N]\n  end\n  subgraph Phase2[Phase 2 - expensive, sampled]\n    Pos[Sample algorithm-POSITIVE<br/>adjudicate -> TP, FP -> PPV]\n    Neg[Sample algorithm-NEGATIVE<br/>adjudicate -> TN, FN -> sensitivity]\n  end\n  All --> Pos\n  All --> Neg\n  Pos --> Est[Estimate sens / spec / PPV<br/>with sampling uncertainty]\n  Neg --> Est\n  Est --> Apply[Apply to Phase-1 estimate<br/>arm-specific if differential]",
        "caption": "Two-phase validation-substudy design. Reviewing only algorithm-positives (Phase 2 top) identifies PPV; sensitivity and specificity require also reviewing algorithm-negatives (Phase 2 bottom). Sampling both, within FFS-complete enrollees, is what makes the correction identifiable.",
        "alt_text": "Two-phase sampling diagram showing a full cohort feeding a sampled chart review of algorithm-positive and algorithm-negative patients to estimate sensitivity, specificity, and PPV that are applied back to the full-cohort estimate.",
        "source_type": "illustrative",
        "source_citations": [
          "sturmer-2005"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "The validation/external-data child of QBA; it supplies data-derived bias parameters that the general QBA machinery propagates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Validation data estimate the sensitivity, specificity, PPV, or NPV that the misclassification correction formulas require."
      },
      {
        "relation_type": "used_with",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "Validation or external data anchor the unmeasured-confounder prevalence and outcome association used in the probabilistic correction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Propensity-score calibration recalibrates an error-prone claims-only propensity score using a validation subset that measures the missing confounders."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "Algorithm validation supplies the empirical PPV/sensitivity inputs that external adjustment folds into the effect estimate rather than citing only as a limitation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "E-value gives the minimum confounding strength needed to explain away the result; external adjustment can measure a confounder in a linked subset and produce a data-anchored corrected estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "linked-data",
        "notes": "Linked claims-EHR-registry data create the two-phase validation subset, but linkage selection must be assessed before treating parameters as transportable."
      }
    ],
    "aliases": [
      "external adjustment",
      "validation-substudy correction",
      "propensity-score calibration",
      "two-phase sampling bias correction",
      "external adjustment for unmeasured confounding",
      "record-level validation correction"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "f1-score-precision-recall-rwe",
    "name": "F1 Score, Precision, and Recall",
    "short_definition": "Confusion-matrix metrics for a binary classifier or computable phenotype, where precision is the positive predictive value (TP/(TP+FP)), recall is sensitivity (TP/(TP+FN)), and the F1 score is their harmonic mean; under heavy outcome imbalance the precision-recall curve is preferred over the ROC curve.",
    "long_description": "**Precision**, **recall**, and the **F1 score** summarize a binary classifier (or a computable phenotype / outcome\nalgorithm) entirely from its 2x2 confusion matrix at a chosen decision threshold. Let TP, FP, FN, TN be the true positives,\nfalse positives, false negatives, and true negatives. Then **precision = TP / (TP + FP)** — identical to the **positive\npredictive value (PPV)**, the probability that a flagged record truly has the outcome — and **recall = TP / (TP + FN)** —\nidentical to **sensitivity**, the probability that a true case is flagged. The **F1 score** is the harmonic mean of the two,\nF1 = 2 * precision * recall / (precision + recall) = 2*TP / (2*TP + FP + FN). The harmonic mean (rather than the arithmetic\nmean) is used because it punishes imbalance between precision and recall: a classifier with precision 0.95 and recall 0.05\nhas arithmetic mean 0.50 but F1 = 0.095, correctly reflecting that it is nearly useless. The general **F-beta** score,\nF_beta = (1 + beta^2) * precision * recall / (beta^2 * precision + recall), weights recall beta times as heavily as\nprecision; F2 (beta=2) favors recall (miss few true cases), F0.5 (beta=0.5) favors precision (few false alarms).\n\n**Core conceptual distinction.** Precision, recall, and F1 are all **threshold-specific point estimates** — they describe\none operating point, not a classifier's whole ranking ability. To characterize behavior across thresholds you trace a\n**precision-recall (PR) curve** (precision on the y-axis against recall on the x-axis as the decision threshold sweeps), the\nnatural analogue of the ROC curve. The single most important property for RWE: **none of precision, F1, or the PR curve uses\nthe true-negative count (TN)**, whereas the ROC curve's specificity = TN/(TN+FP) and the false-positive rate do. When the\noutcome is rare — the norm for adverse events, incident cancers, or phenotypes in claims/EHR, where prevalence may be\n<1% — TN dominates the denominator of specificity, so an enormous absolute number of false positives barely moves the\nfalse-positive rate and the ROC curve (and its AUC) looks deceptively excellent. The PR curve, by ignoring TN, exposes that\nthe same classifier may flag ten false positives for every true case (precision 0.09). Saito & Rehmsmeier (2015) and Davis &\nGoadrich (2006) formalize this: a curve that dominates in ROC space also dominates in PR space, but PR space *visually\nresolves* differences in the high-imbalance regime that ROC space compresses against the y-axis.\n\n**Pros, cons, and trade-offs.**\n- **vs ROC / AUC (`roc-auc-discrimination-rwe`):** Precision/recall/F1 and the PR curve are threshold- and\n  prevalence-aware — precision answers \"of the records this algorithm flags, what fraction are real?\", which is exactly the\n  chart-review and downstream-analytic burden a pharmacoepi team cares about. ROC/AUC are prevalence-*invariant* (they depend\n  only on the score ranking, not on case mix), which makes AUC portable across populations but blind to the false-positive\n  flood under rare outcomes. **Prefer PR/F1** when the positive class is rare and the cost of false positives is real (manual\n  adjudication, biased downstream rate estimates); **prefer ROC/AUC** when comparing intrinsic discrimination across cohorts\n  with different prevalence, or when both error types are roughly symmetric.\n- **vs raw accuracy:** Accuracy = (TP+TN)/N is actively misleading under imbalance — a classifier that calls every record\n  \"no outcome\" achieves 99% accuracy when prevalence is 1%, while having recall 0 and undefined precision. F1 cannot be\n  gamed this way. **Never report accuracy alone for a rare outcome.**\n- **vs Matthews correlation coefficient (MCC) / balanced accuracy:** MCC uses all four cells and is symmetric in the two\n  classes, so it is more robust when you care about both positive and negative prediction (Chicco & Jurman, 2020). F1 is\n  asymmetric — it ignores TN and treats the positive class as the focus — which is exactly right for case-finding but can\n  flatter a classifier that is good only on the majority of cases. **Prefer MCC** when both classes matter symmetrically;\n  **prefer F1** when the task is finding the rare positives.\n- **vs PPV/sensitivity reported separately (`claims-outcome-algorithm-ppv-sensitivity-rwe`):** F1 collapses two numbers into\n  one for model selection, but it hides the operating-point trade-off the validation literature insists on reporting. For a\n  phenotype headed into an analytic cohort, report precision (PPV) and recall (sensitivity) *separately* with confidence\n  intervals; use F1 only as a tie-breaker or hyperparameter-selection objective.\n\n**When to use.** Tuning and selecting a computable phenotype, outcome algorithm, or ML risk classifier where the positive\nclass is rare and false positives carry real cost (adjudication labor, attenuated downstream effect estimates from outcome\nmisclassification). Use the PR curve and its area (average precision) to compare candidate algorithms across all thresholds;\nuse F1 (or F-beta with a justified beta) as a single scalar objective for cross-validated model selection; use precision at\na fixed recall (or recall at a fixed precision) when the deployment threshold is dictated by an operational constraint\n(e.g., \"we can chart-review 200 flags\").\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Comparing classifiers across populations with different prevalence.** Precision and F1 *change with prevalence* even when\n  the classifier is identical, because PPV depends on base rate. Comparing F1 across two databases with different outcome\n  prevalence confounds algorithm quality with case mix; use AUC or report prevalence alongside.\n- **As the only validation metric for a phenotype entering a causal analysis.** A high F1 does not tell you whether\n  misclassification is *differential* by exposure — the property that biases comparative estimates. F1 is a marginal accuracy\n  summary, not a bias diagnostic; pair it with separate PPV/sensitivity and a quantitative-bias-analysis correction.\n- **When the negative class genuinely matters.** If a false negative and a false positive are both clinically costly and you\n  care about correctly classifying non-cases, F1's blindness to TN understates the problem; use MCC or balanced accuracy.\n- **Reporting a single F1 without the threshold or the PR curve.** F1 at an unstated threshold is uninterpretable and\n  cherry-picking the threshold that maximizes F1 on the same data that estimates it overstates performance; choose the\n  threshold on training/validation folds and report the held-out F1 with its CI.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The \"truth\" labels for precision/recall almost always come from chart review on a sampled subset, so\n  precision (PPV) is estimated directly from the reviewed flags but recall (sensitivity) requires sampling *true cases* from\n  an independent reference (a registry, linked EHR, or adjudication) — you cannot estimate recall from claims alone because\n  the FN cell is unobserved. Restrict the confusion matrix to FFS-observable person-time; Medicare Advantage enrollees\n  generate no claims, so apparent \"negatives\" in MA-only spans are unobserved, not true negatives, and silently inflate\n  precision. Outcome prevalence in claims is typically <2%, so report the PR curve, not the ROC curve.\n- **EHR:** Richer features (labs, vitals, NLP-derived concepts) usually raise both precision and recall, but encounter-driven\n  ascertainment means a true case who seeks care out-of-system is an unobservable false negative; recall is therefore an\n  overestimate unless an external reference standard captures out-of-system events. Define an \"active in system\" requirement\n  before counting negatives.\n- **Registry / linked:** Adjudicated registry outcomes provide the cleanest reference standard for both precision and recall,\n  making linked claims-EHR-registry the strongest substrate for honest PR estimation; linkage selects the linkable subset, so\n  report the linkage rate and check that the validation sample's prevalence matches the analytic cohort's.\n\n**Worked example.** A claims-based computable phenotype for incident heart failure is applied to 100,000 FFS-observable\nperson-records with true prevalence ~1.5% (1,500 true cases). At the chosen threshold the algorithm flags 2,000 records, of\nwhich 1,200 are confirmed by linked-EHR adjudication (TP=1,200, FP=800), and 300 true cases are missed (FN=300). Then\n**precision (PPV) = 1200/(1200+800) = 0.60**, **recall (sensitivity) = 1200/(1200+300) = 0.80**, and **F1 = 2*0.60*0.80 /\n(0.60+0.80) = 0.96/1.40 = 0.686**. The ROC view flatters this algorithm: with TN ~= 98,000, the false-positive rate is only\n800/98,800 = 0.008, so the ROC curve hugs the top-left corner and AUC looks near-perfect — yet two in five flagged records\nare false, a 40% adjudication-waste and a source of outcome misclassification. The PR curve makes that visible. If the\ndownstream analysis cannot tolerate the false positives, slide the threshold up (raising precision, lowering recall) and\nreport F0.5 to formalize the precision preference; if missing true cases is the dominant cost, slide down and report F2.\n\n**Interpreting the output**\n\nIn the small worked example, TP = 80, FP = 20, FN = 40 yields precision = 0.800, recall = 0.667, and\nF1 = 0.727. In the large-scale example (TP = 1,200, FP = 800, FN = 300), F1 = 0.686.\n\n*(1) Formal interpretation.* F1 is the harmonic mean of precision (= TP / (TP + FP)) and recall\n(= TP / (TP + FN)). The harmonic mean penalizes imbalance: a model with precision 0.99 but recall 0.01\nearns an F1 near 0.02, exposing its failure despite perfect avoidance of false positives. True negatives\ndo not enter the F1 formula — this is deliberate. When the negative class vastly outnumbers the positive\nclass (as in rare-disease phenotyping), accuracy is dominated by the TN count and can exceed 99% even\nfor a model that flags nothing at all; F1 and the PR curve see through that illusion because they are\ncomputed only over the positive-class predictions and ground-truth positives. An F1 of 0.686 in the\nlarge example reflects that 40% of flags are false (precision 0.60) even though recall is high (0.80).\n\n*(2) Practical interpretation.* When evaluating a phenotyping algorithm for use as an RWE outcome,\nboth error types matter but often unequally. If false positives (misclassified non-cases used as\noutcomes) drive attenuation bias, prioritize precision and consider threshold increases or F0.5.\nIf false negatives (missed cases causing outcome under-ascertainment) are more damaging, prioritize\nrecall and consider F2. Always plot the full precision-recall curve across thresholds rather than\nreporting a single F1 at one cut-point — the area under the PR curve (AUCPR) captures classifier\nperformance more honestly than ROC-AUC under class imbalance.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "f1-score",
      "precision-recall",
      "positive-predictive-value",
      "sensitivity",
      "class-imbalance",
      "phenotype-validation",
      "model-selection",
      "f-beta"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "diagnostic_accuracy",
      "prediction_model",
      "safety_surveillance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1371/journal.pone.0118432",
        "url": "https://doi.org/10.1371/journal.pone.0118432",
        "citation_text": "Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432.",
        "year": 2015,
        "authors_short": "Saito & Rehmsmeier",
        "notes": "The canonical argument and worked demonstration for why precision-recall curves dominate ROC curves under heavy class imbalance; shows how the ROC curve's reliance on the true-negative count hides false-positive floods when the positive class is rare."
      },
      {
        "role": "explain",
        "doi": "10.1145/1143844.1143874",
        "url": "https://doi.org/10.1145/1143844.1143874",
        "citation_text": "Davis J, Goadrich M. The relationship between precision-recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning (ICML). 2006:233-240.",
        "year": 2006,
        "authors_short": "Davis & Goadrich",
        "notes": "Establishes the formal correspondence between PR space and ROC space - a curve dominating in one dominates in the other - and why interpolation in PR space must be done correctly (linear interpolation overstates average precision)."
      },
      {
        "role": "explain",
        "doi": "10.1093/bioinformatics/16.5.412",
        "url": "https://doi.org/10.1093/bioinformatics/16.5.412",
        "citation_text": "Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16(5):412-424.",
        "year": 2000,
        "authors_short": "Baldi et al.",
        "notes": "Overview of confusion-matrix metrics (precision, recall, F-measures, correlation coefficients) and their interrelationships; clarifies which metrics ignore the true-negative cell and the consequences for imbalanced problems."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/s12864-019-6413-7",
        "url": "https://doi.org/10.1186/s12864-019-6413-7",
        "citation_text": "Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6.",
        "year": 2020,
        "authors_short": "Chicco & Jurman",
        "notes": "Demonstrates concrete confusion matrices where F1 (and accuracy) flatter a classifier that performs poorly on the minority class; motivates reporting MCC alongside F1 when both classes matter symmetrically."
      },
      {
        "role": "use",
        "doi": "10.1186/1471-2105-12-77",
        "url": "https://doi.org/10.1186/1471-2105-12-77",
        "citation_text": "Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77.",
        "year": 2011,
        "authors_short": "Robin et al.",
        "notes": "Widely used R package whose coords() function extracts precision, recall, and other confusion-matrix metrics at any threshold and supports thresholded comparison of classifiers in the same workflow that produces PR/ROC curves."
      }
    ],
    "plain_language_summary": "When a computer algorithm scans health records to flag patients who have a particular disease, it will sometimes flag a healthy patient by mistake (a false positive) and sometimes miss a true patient (a false negative). Precision answers 'of everyone the algorithm flagged, what fraction actually had the disease?' and recall answers 'of everyone who truly had the disease, what fraction did the algorithm catch?' The F1 score blends these two rates into a single number by taking their harmonic mean — a formula that gives a low score whenever either rate is poor, so you can't game it by doing well on only one side. It runs from 0 (useless) to 1 (perfect), and is the preferred single-number summary when the outcome is rare, like most diseases in a large insurance claims database.",
    "key_terms": [
      {
        "term": "confusion matrix",
        "definition": "A 2-by-2 table that tallies every patient into one of four buckets: correctly flagged as sick (true positive), wrongly flagged as sick (false positive), correctly cleared as healthy (true negative), or wrongly cleared as healthy (false negative)."
      },
      {
        "term": "true positive (TP)",
        "definition": "A patient the algorithm correctly flagged as having the outcome — they were flagged and they truly had it."
      },
      {
        "term": "false positive (FP)",
        "definition": "A patient the algorithm wrongly flagged — they were flagged but did not actually have the outcome."
      },
      {
        "term": "false negative (FN)",
        "definition": "A patient the algorithm missed — they truly had the outcome but were not flagged."
      },
      {
        "term": "harmonic mean",
        "definition": "A way of averaging two numbers that gives a low result whenever either number is low, unlike a regular average which can stay high even if one number is near zero."
      }
    ],
    "worked_example": {
      "scenario": "A research team builds a claims-based algorithm to identify patients who developed a serious drug reaction in a 300-patient sample. A physician reviews all 300 charts and confirms the truth. The algorithm flags 100 patients as cases. Of those 100 flagged patients, 80 truly had the drug reaction (TP=80) and 20 did not (FP=20). Of the 200 patients the algorithm did not flag, 40 truly had the reaction and were missed (FN=40) and 160 were correctly cleared (TN=160). The team wants to know how well the algorithm performs.",
      "dataset": {
        "caption": "2x2 confusion matrix: adjudicated truth (rows) versus algorithm flag (columns) for 300 patients.",
        "columns": [
          "",
          "Algorithm says: CASE",
          "Algorithm says: NOT A CASE",
          "Row total"
        ],
        "rows": [
          [
            "Truth: CASE (drug reaction present)",
            "TP = 80",
            "FN = 40",
            "120 true cases"
          ],
          [
            "Truth: NOT A CASE",
            "FP = 20",
            "TN = 160",
            "180 true non-cases"
          ],
          [
            "Column total",
            "100 flagged",
            "200 not flagged",
            "300 patients"
          ]
        ]
      },
      "steps": [
        "Precision = TP / (TP + FP) = 80 / (80 + 20) = 80 / 100 = 0.800. This means 80 out of every 100 patients the algorithm flags truly had the drug reaction; 1 in 5 flags is a false alarm.",
        "Recall = TP / (TP + FN) = 80 / (80 + 40) = 80 / 120 = 0.667. This means the algorithm caught 80 of the 120 true cases; 1 in 3 true cases was missed.",
        "F1 = 2 × precision × recall / (precision + recall) = 2 × 0.800 × 0.667 / (0.800 + 0.667) = 1.067 / 1.467 = 0.727.",
        "Cross-check with the shortcut formula: F1 = 2×TP / (2×TP + FP + FN) = 160 / (160 + 20 + 40) = 160 / 220 = 0.727. Same answer.",
        "Because the harmonic mean punishes imbalance, an algorithm that had precision 1.0 but recall 0.0 would score F1 = 0, even though its regular average would be 0.5 — this is why F1 can't be fooled by doing well on only one side."
      ],
      "result": "Precision = 0.800 (80 of 100 flags are real cases). Recall = 0.667 (80 of 120 true cases caught). F1 = 0.727 (harmonic mean of the two). The algorithm is reasonably precise but misses one-third of true cases; whether that trade-off is acceptable depends on the cost of missing a case versus the cost of a false alarm."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "diagnostic-accuracy",
      "claims-outcome-algorithm-ppv-sensitivity-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "F1 at a fixed operating threshold",
        "description": "A single scalar at one decision threshold, F1 = 2*TP/(2*TP+FP+FN), used as a tie-breaking objective for model selection or to report performance at the threshold a deployment constraint dictates.",
        "edge_cases": [
          "F1 is undefined when TP+FP+FN = 0 (no positive predictions and no true positives); report it as 0 by convention or flag the degenerate case rather than dividing by zero.",
          "Choosing the threshold that maximizes F1 on the same data used to estimate it overstates performance; select the threshold on training/validation folds and report held-out F1 with a confidence interval."
        ],
        "data_source_notes": "claims: recall requires sampling true cases from an independent reference standard because the false-negative cell is unobservable from claims alone; precision is estimable directly from adjudicated flags."
      },
      {
        "name": "Precision-recall curve and average precision (area under PR curve)",
        "description": "The PR curve traces precision against recall as the threshold sweeps; its summary area (average precision, AP) is the threshold-free analogue of AUC and the preferred discrimination summary under heavy imbalance.",
        "edge_cases": [
          "The baseline (no-skill) precision equals the outcome prevalence, so PR curves and AP are not comparable across populations with different prevalence; report prevalence alongside AP.",
          "Linear interpolation between PR points overstates AP (Davis & Goadrich); use the step/average-precision estimator that sums precision weighted by the change in recall."
        ],
        "data_source_notes": "ehr/linked: AP is most trustworthy when truth labels come from adjudication on a sample whose prevalence matches the analytic cohort."
      },
      {
        "name": "F-beta (recall- or precision-weighted)",
        "description": "F_beta = (1+beta^2)*precision*recall/(beta^2*precision+recall); F2 weights recall more (miss few true cases), F0.5 weights precision more (few false alarms), making the cost asymmetry explicit.",
        "edge_cases": [
          "beta must be justified by the relative cost of false negatives vs false positives in the specific decision; an unstated beta is uninterpretable.",
          "As beta -> infinity F-beta -> recall and as beta -> 0 F-beta -> precision, so extreme beta values collapse the metric back to a single error type."
        ],
        "data_source_notes": "claims: F0.5 is common when downstream adjudication capacity is the binding constraint (false positives are expensive); F2 when an undercounted outcome would bias a safety signal."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "roc-auc-discrimination-rwe",
        "pros_of_this": "Threshold- and prevalence-aware; precision (PPV) directly quantifies the adjudication burden and downstream misclassification a flagged-record analysis incurs, and the PR curve resolves false-positive floods that ROC space hides under rare outcomes.",
        "cons_of_this": "Precision and F1 vary with outcome prevalence even for an identical classifier, so they are not portable across populations with different base rates.",
        "when_to_prefer": "When the positive class is rare and false positives are operationally costly; use ROC/AUC instead to compare intrinsic discrimination across cohorts with different prevalence."
      },
      {
        "compared_to": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "pros_of_this": "Collapses precision and recall into one scalar suitable as a cross-validated model-selection objective and a tie-breaker between candidate algorithms.",
        "cons_of_this": "Hides the operating-point trade-off the validation literature requires; a single F1 obscures whether misclassification is differential by exposure, which is what biases causal estimates.",
        "when_to_prefer": "Use F1 for hyperparameter/model selection; report PPV and sensitivity separately (with CIs) for a phenotype that will enter an analytic cohort."
      },
      {
        "compared_to": "diagnostic-accuracy",
        "pros_of_this": "Focuses on the rare positive class and ignores the true-negative cell, which is exactly the case-finding view a phenotype developer needs when negatives vastly outnumber positives.",
        "cons_of_this": "Ignoring true negatives understates performance problems when correctly classifying non-cases is itself clinically or analytically important.",
        "when_to_prefer": "When the task is finding rare positives; prefer balanced accuracy or MCC when both classes matter symmetrically."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the confusion matrix on FFS-observable person-time only; drop Medicare Advantage-only spans (no claims) so apparent negatives are not unobserved person-time misread as true negatives. Estimate precision (PPV) directly from adjudicated flags, but estimate recall (sensitivity) by sampling true cases from an independent reference standard because the false-negative cell is unobservable from claims alone; report the PR curve (not ROC) given <2% prevalence.",
      "ehr": "Encounter-driven ascertainment makes out-of-system true cases unobservable false negatives, so recall is an overestimate unless an external reference captures out-of-system events; require an \"active in system\" definition before counting negatives. Richer features typically lift both precision and recall.",
      "registry": "Adjudicated registry outcomes are the cleanest reference standard for both precision and recall; use a registry- or linkage-defined gold standard and confirm the validation sample's outcome prevalence matches the analytic cohort's so AP and F1 transfer.",
      "linked": "Linked claims-EHR-registry is the strongest substrate for honest PR estimation; linkage selects the linkable subset, so report the linkage rate and check for prevalence drift between the linked validation sample and the full cohort."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom sklearn.metrics import (precision_score, recall_score, f1_score, fbeta_score,\n                             precision_recall_curve, average_precision_score,\n                             confusion_matrix)\n\n# Worked example: rare outcome (~1.5% prevalence). Reconstruct labels matching the\n# confusion matrix TP=1200, FP=800, FN=300, TN=97700 at the chosen threshold.\ny_true  = np.r_[np.ones(1200), np.zeros(800), np.ones(300), np.zeros(97700)].astype(int)\ny_pred  = np.r_[np.ones(1200), np.ones(800),  np.zeros(300), np.zeros(97700)].astype(int)\n\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()\nprec = precision_score(y_true, y_pred)            # = PPV  = TP/(TP+FP) = 0.60\nrec  = recall_score(y_true, y_pred)               # = sens = TP/(TP+FN) = 0.80\nf1   = f1_score(y_true, y_pred)                   # harmonic mean        = 0.686\nf2   = fbeta_score(y_true, y_pred, beta=2)        # recall-weighted\nf05  = fbeta_score(y_true, y_pred, beta=0.5)      # precision-weighted\nprint(f\"TP={tp} FP={fp} FN={fn} TN={tn}\")\nprint(f\"precision(PPV)={prec:.3f} recall(sens)={rec:.3f} F1={f1:.3f} F2={f2:.3f} F0.5={f05:.3f}\")\n\n# PR curve from continuous scores: average precision is the imbalance-aware summary.\n# The no-skill baseline equals prevalence (here ~0.015), unlike ROC's 0.5 baseline.\nrng = np.random.default_rng(0)\nscore = np.where(y_true == 1, rng.beta(2.5, 2.0, size=y_true.size),\n                              rng.beta(2.0, 6.0, size=y_true.size))\np, r, thr = precision_recall_curve(y_true, score)\nap = average_precision_score(y_true, score)\nprint(f\"average precision (area under PR curve) = {ap:.3f}; \"\n      f\"no-skill baseline = prevalence = {y_true.mean():.3f}\")",
        "description": "Compute precision, recall, F1, and F-beta at a threshold and the precision-recall curve with average\nprecision (area under PR curve) using scikit-learn. Inputs: y_true (0/1 array of adjudicated labels) and\ny_score (predicted probabilities from the phenotype/classifier). Demonstrates the rare-outcome confusion\nmatrix from the worked example and shows that average precision (not ROC AUC) is the imbalance-aware summary.",
        "dependencies": [
          "numpy",
          "scikit-learn"
        ],
        "source_citations": [
          "saito-2015",
          "davis-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(yardstick)\nlibrary(PRROC)\n\n# Worked example confusion matrix: TP=1200, FP=800, FN=300, TN=97700.\ntruth <- factor(c(rep(\"yes\",1200), rep(\"no\",800), rep(\"yes\",300), rep(\"no\",97700)),\n                levels = c(\"yes\",\"no\"))\npred  <- factor(c(rep(\"yes\",1200), rep(\"yes\",800), rep(\"no\",300), rep(\"no\",97700)),\n                levels = c(\"yes\",\"no\"))\ndf <- data.frame(truth = truth, pred = pred)\n\n# event_level='first' so \"yes\" (the rare positive) is the event of interest.\nprecision(df, truth, pred, event_level = \"first\")   # PPV  = 0.60\nrecall(df,    truth, pred, event_level = \"first\")   # sens = 0.80\nf_meas(df,    truth, pred, event_level = \"first\")   # F1   = 0.686\nf_meas(df,    truth, pred, event_level = \"first\", beta = 2)    # F2  (recall-weighted)\nf_meas(df,    truth, pred, event_level = \"first\", beta = 0.5)  # F0.5 (precision-weighted)\n\n# PR curve area from continuous scores; baseline precision = prevalence, not 0.5.\ny    <- ifelse(truth == \"yes\", 1L, 0L)\nset.seed(0)\nscore <- ifelse(y == 1, rbeta(length(y), 2.5, 2.0), rbeta(length(y), 2.0, 6.0))\npr <- pr.curve(scores.class0 = score[y == 1], scores.class1 = score[y == 0], curve = TRUE)\ncat(\"area under PR curve (average precision) =\", round(pr$auc.integral, 3), \"\\n\")",
        "description": "Precision, recall, F1, F-beta, and the precision-recall curve in R. yardstick (tidymodels) computes the\npoint metrics from a factor of truth and predicted classes; PRROC computes the PR curve area correctly\n(it uses the non-linear interpolation Davis & Goadrich recommend) from continuous scores. Inputs mirror\nthe Python version: adjudicated labels and predicted scores from the phenotype.",
        "dependencies": [
          "yardstick",
          "PRROC"
        ],
        "source_citations": [
          "saito-2015",
          "robin-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Scored validation set from the worked example: TP=1200, FP=800, FN=300, TN=97700. */\ndata scored;\n  do i = 1 to 1200; truth = 1; pred = 1; output; end;   /* TP */\n  do i = 1 to  800; truth = 0; pred = 1; output; end;   /* FP */\n  do i = 1 to  300; truth = 1; pred = 0; output; end;   /* FN */\n  do i = 1 to 97700; truth = 0; pred = 0; output; end;  /* TN */\n  drop i;\nrun;\n\n/* 2x2 table; OUTPUT the cell counts so we can derive precision/recall/F1 exactly. */\nproc freq data=scored;\n  tables truth*pred / out=cells outpct norow nocol nopercent;\nrun;\n\nproc sql noprint;\n  select sum(case when truth=1 and pred=1 then count else 0 end),\n         sum(case when truth=0 and pred=1 then count else 0 end),\n         sum(case when truth=1 and pred=0 then count else 0 end)\n    into :tp, :fp, :fn\n  from cells;\nquit;\n\ndata metrics;\n  tp = &tp; fp = &fp; fn = &fn;\n  precision = tp / (tp + fp);                 /* = PPV  = 0.60 */\n  recall    = tp / (tp + fn);                 /* = sens = 0.80 */\n  f1   = 2*precision*recall / (precision + recall);             /* = 0.686 */\n  beta = 2;\n  fbeta = (1 + beta**2)*precision*recall / (beta**2*precision + recall);  /* F2 */\nrun;\n\nproc print data=metrics noobs; var precision recall f1 fbeta; run;",
        "description": "Precision, recall, and F1 from a scored validation dataset in SAS. PROC FREQ builds the 2x2 confusion\nmatrix (PPV = precision, sensitivity = recall are produced directly by the pvalue/sensspec options on the\nbinary agreement table), and a short DATA step assembles F1 and F-beta. Input: work.scored with the\nadjudicated truth (1/0) and the thresholded prediction (1/0). No invented PROCs; PROC FREQ is base/STAT.",
        "dependencies": [],
        "source_citations": [
          "saito-2015",
          "baldi-2000"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  M[2x2 confusion matrix<br/>at one threshold] --> TP[TP]\n  M --> FP[FP]\n  M --> FN[FN]\n  M --> TN[TN]\n  TP --> P[Precision = PPV<br/>TP/&#40;TP+FP&#41;]\n  FP --> P\n  TP --> R[Recall = Sensitivity<br/>TP/&#40;TP+FN&#41;]\n  FN --> R\n  P --> F[F1 = harmonic mean<br/>2&#183;P&#183;R/&#40;P+R&#41;]\n  R --> F\n  TN -. ignored by precision/recall/F1 .-> Note[TN unused -> insensitive to<br/>false-positive flood under imbalance]",
        "caption": "How precision, recall, and F1 are built from the confusion matrix. Precision (PPV) and recall (sensitivity) each use three cells; F1 is their harmonic mean. None of the three uses the true-negative cell, which is why they expose the false-positive burden that ROC-based metrics hide when the positive class is rare.",
        "alt_text": "Diagram showing the four confusion-matrix cells feeding precision and recall, which combine into the F1 harmonic mean, with the true-negative cell marked as ignored by all three metrics.",
        "source_type": "illustrative",
        "source_citations": [
          "saito-2015"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Evaluate a binary classifier / phenotype] --> Imb{Positive class rare?<br/>e.g. prevalence < 5%}\n  Imb -->|No, roughly balanced| ROC[ROC curve + AUC<br/>or balanced accuracy / MCC]\n  Imb -->|Yes, heavily imbalanced| PR[Precision-recall curve<br/>+ average precision]\n  PR --> Cost{Cost asymmetry?}\n  Cost -->|False positives costly<br/>adjudication burden| F05[Report F0.5 + precision at fixed recall]\n  Cost -->|False negatives costly<br/>undercounted outcome| F2[Report F2 + recall at fixed precision]\n  Cost -->|Symmetric| F1[Report F1 + PR curve]\n  ROC --> Comp[Comparing across populations<br/>with different prevalence?]\n  Comp -->|Yes| UseAUC[Use AUC - prevalence-invariant]",
        "caption": "Choosing among precision, recall, F1, F-beta, and ROC/AUC. Under heavy imbalance use the precision-recall curve; pick F-beta to encode whether false positives or false negatives dominate the decision cost; reserve ROC/AUC for prevalence-invariant comparison across populations.",
        "alt_text": "Decision tree from classifier evaluation through a rare-class check to either ROC/AUC for balanced or cross-population comparison, or the precision-recall curve with F0.5, F1, or F2 chosen by the cost asymmetry.",
        "source_type": "illustrative",
        "source_citations": [
          "saito-2015"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "roc-auc-discrimination-rwe",
        "notes": "Precision-recall/F1 and ROC/AUC are competing discrimination summaries; PR/F1 are prevalence-aware and preferred under heavy imbalance, while ROC/AUC are prevalence-invariant and preferred for cross-population comparison of intrinsic ranking."
      },
      {
        "relation_type": "complements",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "F1 condenses PPV (precision) and sensitivity (recall) into one scalar for model selection; the validation concept insists those two be reported separately with CIs for a phenotype entering an analytic cohort."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnostic-accuracy",
        "notes": "Diagnostic-accuracy studies estimate the same confusion-matrix quantities (sensitivity = recall, PPV = precision) against a reference standard; F1 simply combines two of them."
      },
      {
        "relation_type": "used_with",
        "target_slug": "prediction-model-validation-recalibration-rwe",
        "notes": "Precision/recall/F1 quantify discrimination at an operating point; prediction-model validation pairs them with calibration and overall measures (Brier score) for a complete performance assessment."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "EHR phenotyping algorithms are tuned and reported with precision, recall, and F1 against an adjudicated reference, typically using the precision-recall curve because the positive class is rare."
      },
      {
        "relation_type": "complements",
        "target_slug": "algorithm-validation",
        "notes": "Algorithm validation generates the adjudicated truth labels (the confusion matrix) from which precision, recall, and F1 are computed."
      }
    ],
    "aliases": [
      "F1 score",
      "F-measure",
      "precision and recall",
      "F-beta score",
      "precision-recall curve",
      "average precision"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "journal",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "federated-distributed-network-analysis",
    "name": "Federated and Distributed Network Analysis",
    "short_definition": "An analytic architecture in which patient-level records remain at each participating site under their own governance, and only privacy-preserving aggregates — site-level effect estimates, stratified risk-set tables, or gradient contributions — travel to a coordinating center, enabling multi-database evidence generation without pooling protected health information.",
    "long_description": "**Federated and distributed network analysis** is the enabling machinery that makes a\nmulti-database study physically possible when patient-level protected health information (PHI)\ncannot cross data-custodian boundaries. The governance constraint is the starting point: under\nHIPAA in the United States, GDPR in the European Union, and analogous statutes in Japan, Canada,\nand the UK, a health data custodian — an insurer, a hospital system, a national statistics\nagency — cannot ship row-level patient records to a coordinating team without a data-use\nagreement that may take years to negotiate or may never be approved. The architectural response\nis to move the analysis to the data rather than moving the data to the analysis. A coordinating\ncenter distributes a study package (code, concept-set definitions, analytic parameters) to each\nparticipating site; each site executes the package on its local data and returns only a small,\nprivacy-preserving summary — a table of counts and person-time, a vector of log hazard ratios,\nor a gradient contribution to a distributed optimizer. No individual patient row ever crosses a\ncustodian boundary.\n\nThis entry covers the analytic machinery — the spectrum of federation approaches, the flagship\nregulatory and academic networks, heterogeneity handling, and the characteristic failure modes.\nThe multi-database entry covers the study-design layer: when to use a network rather than a\nsingle source, how the estimand changes across a network, and how to interpret network-level\nconclusions.\n\n**The federation spectrum.**\nFederation is a family of approaches ordered by what crosses the firewall and how much\nstatistical work happens locally versus centrally.\n\n(1) *Common protocol with site-run analyses.* Each site implements the study independently from\na prospectively harmonized protocol document. The coordinating center receives pre-specified\nestimates — a log hazard ratio and its standard error — from each site and combines them by\ninverse-variance-weighted meta-analysis. Easiest to operationalize for sites without shared\ninfrastructure; hardest to guarantee truly identical phenotyping and covariate capture.\n\n(2) *Common data model with distributed code execution.* Each site maps its raw data to a\nstandardized schema — the OMOP CDM or the FDA Sentinel SCDM — and runs byte-identical code\ndistributed by the coordinating center. The coordinating center receives aggregated tables\n(stratum-specific event counts, person-time, PS-adjusted incidence rates). This is the OHDSI\nand Sentinel operating model. The CDM enforces structural portability; vocabulary standardization\n(SNOMED, RxNorm, LOINC) is what makes a concept set mean the same thing at every site.\n\n(3) *Meta-analysis of site-specific model estimates.* Each site fits its own regression model\nagainst its locally held data and shares only the estimated coefficient and its variance. The\ncoordinating center pools across sites by fixed-effect or random-effects meta-analytic weights.\nComputationally transparent and auditable; requires each site to fit and validate its own model.\nThis is the approach detailed in the worked example below.\n\n(4) *True federated estimation via gradient or score sharing.* The optimizer for a global model\niterates in rounds: each site computes its local gradient contribution (or score function, or\nlog-likelihood piece) on its own data and submits it to the coordinating center; the center\naggregates gradients, updates the global model parameter, and returns the updated parameter to\nall sites for the next iteration. Distributed Cox score aggregation achieves mathematical\nequivalence to a single pooled IPD analysis; federated gradient descent does the same for\nmachine-learning models. Infrastructure and communication overhead are substantially higher.\n\n**Sentinel System — FDA active surveillance.**\nThe FDA Sentinel System is the regulatory flagship of the CDM-plus-distributed-code model.\nLaunched in 2007 and continuously expanded, Sentinel operates across more than 60 data partners\n— commercial claims plans, Medicare FFS extracts, and integrated-delivery systems — whose data\nare mapped to the Sentinel Common Data Model (SCDM). A coordinating center (currently Harvard\nPilgrim Health Care Institute / Kaiser Permanente) distributes analytic queries as SAS programs;\neach data partner executes the program locally and returns only cell-level counts and\nperson-time. The PRISM (Prospective Routine Observational Monitoring Program) and ARIA query\nmodules enable self-service active surveillance: a regulatory scientist can issue a standardized\nsafety query and receive pooled results spanning more than 500 million person-years of\nlongitudinal data, with all PHI remaining at each partner. Platt et al. (2018) describe how the\nSentinel Initiative evolved from a pilot to a national pharmacovigilance resource.\n\n**OHDSI network and LEGEND.**\nThe Observational Health Data Sciences and Informatics (OHDSI) collaborative is the leading\nacademic network using the OMOP CDM plus distributed code execution. Hripcsak et al. (2015)\narticulate the opportunities that arise when databases independently governed by hundreds of\ninstitutions map to a single CDM and share analytic code without sharing data. The LEGEND\n(Large-scale Evidence Generation and Evaluation across a Network of Databases) initiative\ndemonstrates federated network analysis at scale: Suchard et al. (2019) ran a systematic\ncomparison of all first-line antihypertensive drug classes across nine databases, generating\n4.9 million patient-years of follow-up. Every site ran the same OHDSI CohortMethod package\nagainst its local OMOP CDM instance; results were meta-analyzed across sites with empirical\ncalibration using negative controls — drug-outcome pairs with no plausible causal effect — to\ndetect and adjust for residual systematic error. LEGEND is the proof of concept that federated\nanalysis can produce large-scale, reproducible, calibrated comparative effectiveness evidence\nat a speed no single site or literature review could approach.\n\n**EHDEN and DARWIN EU.**\nIn Europe, the European Health Data Evidence Network (EHDEN) and the European Medicines\nAgency's DARWIN EU (Data Analysis and Real World Interrogation Network) replicate the\nOMOP-based federated model under GDPR constraints. EHDEN has mapped records from hundreds of\nmillions of European patients across EHRs and administrative databases to OMOP CDM and certifies\ndata partners for study execution. DARWIN EU provides the EMA with a governed access mechanism\nfor OMOP network studies to support regulatory decisions, functioning as the European analog to\nthe FDA Sentinel System.\n\n**Heterogeneity across sites — signal, not noise.**\nWhen site-specific estimates diverge — as measured by I-squared, tau-squared, or a forest plot\n— the appropriate response is investigation, not suppression. Between-site heterogeneity\nusually reflects one of three sources: (a) real biological or population heterogeneity (the\ndrug genuinely works differently in the older, higher-comorbidity Medicare population than in\nthe younger commercially insured population); (b) measurement heterogeneity (a hospital-based\nsite captures inpatient drug administrations invisible in claims-only sites, producing a\ndifferent effective exposure definition despite identical code); or (c) ETL heterogeneity (the\nsame OMOP concept maps to different source codes at two sites because their ETLs were\nimplemented differently). Random-effects meta-analysis acknowledges between-study variance (τ²)\nrather than forcing all heterogeneity into sampling error; but even a random-effects pool is a\nweighted average of potentially incomparable estimates. Distinguishing biological from\nmeasurement and ETL heterogeneity requires cohort characterization output — covariate means by\nsite, concept-set coverage checks — not just the effect estimates themselves.\n\n**Reproducibility wins.**\nA versioned study package is the key advantage of the CDM-plus-distributed-code model over ad\nhoc harmonization. The same code, same concept-set definitions, and same analytic parameters\nrun at every site; the only difference is the underlying patient population and the local ETL.\nEvery site's output can be compared against the same specification. OHDSI study packages\nstored in versioned GitHub repositories and tagged to specific releases provide an audit trail\nfrom code to estimate that satisfies emerging regulatory expectations for RWE reproducibility.\nA literature-based meta-analysis combining studies that each used slightly different washout\nperiods, comparator definitions, and covariate sets cannot achieve this level of methodological\ncontrol.\n\n**Failure modes.**\nTwo failure patterns recur across networks regardless of architecture tier.\n\n*Semantic drift — same code, different meanings.* When a data partner's ETL maps a concept\nincorrectly — excluding drug salt forms that should be included, or mapping an ICD code to a\nSNOMED concept that excludes a clinically important subtype — the code runs without error but\nthe effective exposure or outcome definition silently differs from every other site. Semantic\ndrift is invisible from the coordinating center and surfaces, if at all, as an anomalous\nsite-specific estimate with no obvious data-quality flag. The defense is mandatory cohort\ncharacterization (OHDSI's DataQualityDashboard, concept-set coverage reports) before any\nestimate is interpreted.\n\n*Small-cell suppression breaking estimates.* Most data partners apply a minimum-cell-count rule\n(commonly n < 11) before sharing any aggregate, to prevent re-identification. In sparse strata\n— uncommon outcomes in a minority exposure arm — suppression can remove enough cells to make a\nstratum-adjusted estimate impossible or to bias surviving strata. Cell suppression is a privacy\nrequirement, not an analytic choice; the study design must pre-specify what happens when a site\ncannot contribute a stratum, and the coordinating center must report how many site-strata were\nsuppressed.\n\n**Pros, cons, and trade-offs.**\n*Vs centralized pooled individual-level analysis:* Federated analysis preserves patient privacy\nand partner autonomy, satisfies HIPAA and GDPR without a multi-party data-use agreement, and\noperates at network scale within weeks rather than years. The cost is a constraint on the\nestimator: only quantities that can be assembled from site-level aggregates or gradient\ncontributions are available. Standard survival, logistic, and Poisson regressions can be\nexpressed in distributed form; some flexible individual-level methods requiring cross-fitted\niteration are harder without full IPD. Prefer federated analysis whenever direct PHI-sharing\nagreements are unavailable, which is the practical default in nearly every real-world network.\n*Vs ad hoc harmonization across sites:* A CDM enforces a structural contract — table names,\ncolumn types, standardized vocabulary — that ensures the same code executes portably. Without\nthe CDM, each site must write its own extraction code from a protocol document, reintroducing\nthe between-site implementation divergence the network is designed to prevent. Ad hoc\nharmonization is appropriate only for a small number of sites where deep data knowledge and\nbilateral agreements make bespoke extraction preferable to full CDM investment.\n*Vs literature-based meta-analysis:* A federated network with one common protocol, one code\nbase, and one set of concept definitions eliminates between-study methodological heterogeneity\nand prospectively avoids publication bias. A literature meta-analysis combines studies with\nfive slightly different eligibility criteria, washout definitions, and confounder sets; the\nnetwork's consistency is its central epistemic advantage.\n\n**When to use.** Use federated network analysis when: (a) the question requires a multi-database\ndesign — rare outcomes, external validity, regulatory mandate — see the multi-database entry;\n(b) site-level PHI cannot be pooled under available data-use agreements, which is the default\nassumption in US and European regulatory networks; (c) the study will run within a governed\nnetwork infrastructure (OHDSI, Sentinel, DARWIN EU) where CDM mapping and code-distribution\nmachinery already exist; or (d) reproducibility and transparency of the analytic code across\nsites are explicit regulatory or scientific requirements. The LEGEND and Sentinel models are the\nappropriate templates for large-scale pharmacoepidemiologic questions in these contexts.\n\n**When NOT to use.** Do not apply the federated label to an analysis that does not actually\noperate under the architecture it claims: if a study pools de-identified data under a shared\ndata-use agreement, it is a pooled analysis, with different governance and analytic properties.\nAvoid federated approaches when: (a) the required network infrastructure (CDM, code-distribution\npipeline, site QC process) does not exist and cannot be established within the study timeline —\nad hoc harmonization without a CDM reintroduces between-site divergence; (b) the estimator\nrequires individual-level iteration that cannot be faithfully approximated from aggregates, and\nthe analytic need is strong enough to justify the legal burden of a pooled DUA; or (c)\nsmall-cell suppression across sparse strata will remove so much data that the residual estimate\nis more biased than a well-designed single-site analysis with adequate power. Federated analysis\nis not a panacea: the governance advantage is real but the analytic constraints are real too.\n\n**Interpreting the output.** The worked example below yields a pooled log-HR of 0.12 from three\nsites (HRs of approximately 1.22, 1.11, and 1.00) combined by inverse-variance weighting.\n\n*Formal interpretation:* Under the fixed-effect assumption — that a single true log-hazard ratio\nunderlies all sites and observed variation is sampling error — the pooled inverse-variance\nweighted log-HR is 0.12. The variance of the pooled estimate equals 1 divided by the total\nweight (1/10 = 0.10), giving a standard error of sqrt(0.10) ≈ 0.316. A 95% confidence interval\non the log-HR scale spans approximately 0.12 ± 1.96 × 0.316; on the HR scale the interval\nwould be approximately (0.50, 2.89). The HR is the instantaneous rate of the event among those\nstill at risk in the treated group relative to the comparator, averaged across sites\nproportional to their precision. Under the repeated-sampling interpretation, a CI that excludes\n1.0 would reject the null at the 5% level; a CI that includes 1.0 is consistent with a range\nof effects including no effect — it does not mean the null is true.\n\n*Practical interpretation:* Across the three data partners, patients on the study drug had\napproximately 13% higher event rate per unit of follow-up time compared to patients on the\ncomparator drug (pooled HR ≈ 1.13). All three sites are in the same direction (HR above 1.0),\nwhich provides some reassurance that no single site's data anomaly is driving the result.\nHowever, the site-specific HRs range from 1.00 to 1.22 — a decision-maker should ask whether\nthat spread represents genuine differences across patient populations or differences in how the\ndrug or outcome is captured at each site before acting on the pooled number. In a real network\nstudy, the random-effects tau-squared and I-squared would be reported alongside the fixed-effect\nsummary to flag whether the fixed-effect model's assumption of a single true HR is defensible.",
    "primary_category": "Study_Design",
    "tags": [
      "federated-learning",
      "distributed-network",
      "sentinel",
      "ohdsi",
      "legend",
      "darwin-eu",
      "ehden",
      "common-data-model",
      "privacy-preserving",
      "pharmacovigilance",
      "drug-safety-surveillance",
      "meta-analysis",
      "network-study"
    ],
    "applies_to_study_types": [
      "multi_database"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1056/NEJMp1809643",
        "url": "https://doi.org/10.1056/NEJMp1809643",
        "citation_text": "Platt R, Brown JS, Robb M, et al. The FDA Sentinel Initiative — An Evolving National Resource. New England Journal of Medicine. 2018;379(22):2091-2093.",
        "year": 2018,
        "authors_short": "Platt et al.",
        "notes": "Describes the evolution of the FDA Sentinel System from a pilot distributed-query pharmacovigilance network into a national resource spanning more than 500 million person-years of longitudinal data, operating without PHI leaving each data partner."
      },
      {
        "role": "explain",
        "doi": "10.3233/978-1-61499-564-7-574",
        "url": "https://doi.org/10.3233/978-1-61499-564-7-574",
        "citation_text": "Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Studies in Health Technology and Informatics. 2015;216:574-578.",
        "year": 2015,
        "authors_short": "Hripcsak et al.",
        "notes": "Foundational statement of the OHDSI collaborative's architecture: a global network of independently governed databases mapped to the OMOP CDM that share analytic code rather than patient data, enabling federated observational research at scale."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/S0140-6736(19)32317-7",
        "url": "https://doi.org/10.1016/S0140-6736(19)32317-7",
        "citation_text": "Suchard MA, Schuemie MJ, Krumholz HM, et al. Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, large-scale analysis. Lancet. 2019;394(10211):1816-1826.",
        "year": 2019,
        "authors_short": "Suchard et al.",
        "notes": "LEGEND antihypertensive analysis — the flagship OHDSI federated network study. A single versioned OHDSI study package ran across nine databases; site estimates were combined with empirical calibration using negative controls, demonstrating federated analysis producing large-scale, reproducible, bias-corrected comparative effectiveness evidence."
      }
    ],
    "plain_language_summary": "Federated network analysis is a way to study a drug or treatment across many hospitals or insurance databases simultaneously without ever moving patient records between them. Instead of combining all the data in one place, a single computer program travels to each database, runs locally on that site's own data, and returns only a small summary number — for example, how much faster or slower a particular health event occurred in one group of patients compared to another. Those site-level summaries are then combined mathematically, giving more influence to results from larger or more precise sites. This approach lets researchers draw on the scale of many data sources while respecting the privacy rules that prevent sharing patients' protected health information across organizational boundaries.",
    "key_terms": [
      {
        "term": "federated analysis",
        "definition": "An analytic approach where the computer code travels to the data at each site rather than the data being moved to a central server, so individual patient records never cross institutional boundaries."
      },
      {
        "term": "inverse-variance weighting",
        "definition": "A method of combining estimates from multiple sites that gives more influence to results from sites with smaller uncertainty, producing a weighted average that reflects precision."
      },
      {
        "term": "common data model (CDM)",
        "definition": "A shared blueprint that specifies how every participating database should name its tables and columns so that the same analysis code runs identically at every site without custom rewriting."
      },
      {
        "term": "negative controls",
        "definition": "Drug-outcome pairs that researchers know have no real causal relationship, used to test whether an analysis method is producing spurious results due to bias rather than true effects."
      },
      {
        "term": "cell suppression",
        "definition": "The practice of withholding any count smaller than a minimum threshold (commonly fewer than 11 patients) before sharing aggregate results, to prevent re-identification of individuals from small groups."
      },
      {
        "term": "study package",
        "definition": "A versioned bundle of code, variable definitions, and parameter files that is distributed to every site so each one runs exactly the same analysis on its own local data."
      }
    ],
    "worked_example": {
      "scenario": "A research team wants to know whether a new anticoagulant drug is associated with a higher rate of major bleeding events compared to a standard anticoagulant. They run a harmonized study package across three data partners — a commercial claims database, a Medicare claims extract, and a large integrated hospital network. Each site fits a Cox proportional hazards model on its own data and returns only the log hazard ratio and a measure of how precise that estimate is (the inverse-variance weight, which equals 1 divided by the squared standard error). The coordinating center then combines the three site estimates using inverse-variance weighting to produce a single pooled answer.",
      "dataset": {
        "caption": "Summary table returned by each data partner. log_HR is each site's estimate of the log hazard ratio for the new drug vs the standard drug; W is the inverse-variance weight (larger W means the site's estimate is more precise). No patient-level rows are shared.",
        "columns": [
          "site",
          "log_HR",
          "W"
        ],
        "rows": [
          [
            "Site A (commercial claims)",
            0.2,
            4
          ],
          [
            "Site B (Medicare FFS)",
            0.1,
            4
          ],
          [
            "Site C (integrated-delivery EHR)",
            0.0,
            2
          ]
        ]
      },
      "steps": [
        "Step 1: Sum the inverse-variance weights across all three sites: 4 + 4 + 2 = 10",
        "Step 2: Multiply each site log-HR by its weight. Site A contribution: 4 * 0.2 = 0.8",
        "Step 3: Site B contribution: 4 * 0.1 = 0.4",
        "Step 4: Site C contribution: 2 * 0.0 = 0.0",
        "Step 5: Add the weighted contributions: 0.8 + 0.4 + 0.0 = 1.2",
        "Step 6: Divide by the total weight to get the pooled log-HR: 1.2 / 10 = 0.12",
        "Step 7: Convert the pooled log-HR to a hazard ratio. exp(0.12) is approximately 1.13, meaning patients on the new drug had about 13% higher event rate per unit of follow-up time than those on the standard drug. The standard error of the pooled estimate is sqrt(1/10), approximately 0.316, and the full 95% confidence interval should be computed before drawing any conclusions."
      ],
      "result": "Pooled log-HR = 1.2 / 10 = 0.12; pooled HR is approximately 1.13. Across all three data partners, patients on the new anticoagulant experienced approximately 13% higher rate of major bleeding per unit of follow-up time compared with the standard drug. The three site-specific estimates (log-HRs of 0.2, 0.1, and 0.0) point in the same direction, which is reassuring, but their spread suggests further investigation of whether the difference reflects genuine population heterogeneity or differences in how bleeding events are captured at each site."
    },
    "prerequisites": [
      "multi-database",
      "meta-analysis-obs",
      "omop-cdm-method-patterns-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "CDM-based distributed code execution (OHDSI / Sentinel model)",
        "description": "Each site maps its data to a shared CDM (OMOP or Sentinel SCDM) and runs byte-identical code distributed as a versioned study package. Only aggregate tables (event counts, person-time, PS-stratum summaries) return to the coordinating center. Maximizes analytic reproducibility and allows the same package to execute across dozens of sites with minimal per-site customization. This is the dominant model for OHDSI network studies and FDA Sentinel queries.",
        "edge_cases": [
          "A CDM enforces table schema but not data capture quality; a concept that is structurally present can be empty or differentially recorded. DataQualityDashboard checks are essential before interpreting any cross-site result.",
          "Vocabulary mapping completeness varies by site ETL; audit concept-set coverage (proportion of source codes captured by the standard concept set) at each partner before trusting cross-site comparability."
        ],
        "data_source_notes": "Claims: DRUG_EXPOSURE / CONDITION_OCCURRENCE populated from adjudicated claims; EHR: same tables but from orders and administrations — semantically different despite identical schema. Concept-set coverage may differ because source-code mappings were applied differently."
      },
      {
        "name": "Meta-analysis of site-specific model estimates",
        "description": "Each site fits its own regression model (Cox, Poisson, logistic) locally and returns only the estimated log coefficient and its standard error. The coordinating center combines with inverse-variance-weighted fixed-effect or DerSimonian-Laird random-effects meta-analysis. Requires no shared CDM — the protocol specifies the model and variable definitions, sites implement locally. Auditable and transparent; the main risk is that undetected between-site differences in variable implementation are invisible in the returned estimates.",
        "edge_cases": [
          "Sites may use different baseline covariate sets if the CDM is absent; the pooled estimate then combines estimates from non-comparable models.",
          "Random-effects tau-squared inflates when heterogeneity is measurement-driven rather than biological; always investigate I² > 25% before publishing a pooled summary."
        ],
        "data_source_notes": "Heterogeneous source systems (claims-only vs claims+EHR vs registry) may produce non-comparable estimates even when the protocol is identical; report forest plots and heterogeneity statistics, and investigate any site-specific outlier."
      },
      {
        "name": "Federated gradient or score-function sharing",
        "description": "Sites iteratively share gradient contributions or Cox score functions with a coordinating center; the center aggregates and returns updated model parameters. Achieves near-exact equivalence to pooled IPD analysis without sharing row-level data. Requires substantially more infrastructure (iterative communication rounds, secure aggregation) and is currently used primarily in research settings for federated machine-learning and distributed Cox regression.",
        "edge_cases": [
          "Communication overhead scales with model complexity and number of iterations; large models or many sites require efficient aggregation infrastructure.",
          "Differential privacy noise added to protect gradients can degrade model accuracy; the privacy-utility tradeoff must be calibrated per study."
        ],
        "data_source_notes": "Most applicable when sites have sufficient compute and network infrastructure for iterative communication; less suited to Sentinel-style queries where sites return only static counts."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Centralized pooled individual-level analysis (IPD)",
        "pros_of_this": "Preserves patient privacy and partner autonomy; satisfies HIPAA and GDPR without multi-party PHI-transfer agreements; operationally feasible at network scale within months rather than years.",
        "cons_of_this": "Constrained to estimators expressible from site-level aggregates or gradients; some flexible individual-level methods requiring cross-fitted iteration require full IPD.",
        "when_to_prefer": "Whenever PHI-pooling agreements are unavailable — the practical default in US and European regulatory networks."
      },
      {
        "compared_to": "Ad hoc harmonization without a CDM",
        "pros_of_this": "A CDM enforces a structural contract ensuring the same code runs identically at every site; eliminates between-site implementation divergence in table names, variable definitions, and vocabulary.",
        "cons_of_this": "Substantial up-front ETL investment per site; ETL quality varies and must be audited.",
        "when_to_prefer": "When more than a handful of sites participate and the study will be replicated or updated; the CDM investment pays off rapidly across multiple studies."
      },
      {
        "compared_to": "Literature-based meta-analysis of published observational studies",
        "pros_of_this": "One prospectively harmonized protocol, code base, and concept-set definition eliminates between-study methodological heterogeneity and prospectively prevents publication bias.",
        "cons_of_this": "Requires a live governed network rather than a desk review; takes more operational effort than searching existing literature.",
        "when_to_prefer": "Whenever you control the analysis across contributing data sources; fall back to published meta-analysis only when network access is impossible."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom typing import Sequence\n\n\ndef ivw_fixed(\n    log_hrs: Sequence[float],\n    weights: Sequence[float],\n) -> dict:\n    \"\"\"\n    Inverse-variance weighted fixed-effect pool of site log-HR estimates.\n\n    Parameters\n    ----------\n    log_hrs : log hazard ratio from each site's local Cox / Poisson model\n    weights : inverse-variance weight per site  (W_i = 1 / SE_i ** 2)\n\n    Returns\n    -------\n    dict: pooled_log_hr, pooled_hr, pooled_se, ci_lo_hr, ci_hi_hr\n    \"\"\"\n    W_total = sum(weights)\n    pooled_log_hr = sum(w * lh for w, lh in zip(weights, log_hrs)) / W_total\n    pooled_var = 1.0 / W_total\n    pooled_se = math.sqrt(pooled_var)\n    z = 1.96\n    return {\n        \"pooled_log_hr\": pooled_log_hr,\n        \"pooled_hr\": math.exp(pooled_log_hr),\n        \"pooled_se\": pooled_se,\n        \"ci_lo_hr\": math.exp(pooled_log_hr - z * pooled_se),\n        \"ci_hi_hr\": math.exp(pooled_log_hr + z * pooled_se),\n    }\n\n\ndef ivw_random_dl(\n    log_hrs: Sequence[float],\n    weights: Sequence[float],\n) -> dict:\n    \"\"\"\n    DerSimonian-Laird random-effects meta-analysis with heterogeneity statistics.\n\n    Returns\n    -------\n    dict: pooled_log_hr, pooled_hr, pooled_se, ci_lo_hr, ci_hi_hr, tau2, I2_pct, Q\n    \"\"\"\n    k = len(log_hrs)\n    W = list(weights)\n    W_total = sum(W)\n    theta_fe = sum(w * lh for w, lh in zip(W, log_hrs)) / W_total\n\n    # Cochran Q\n    Q = sum(w * (lh - theta_fe) ** 2 for w, lh in zip(W, log_hrs))\n    W2_total = sum(w ** 2 for w in W)\n\n    # DL tau-squared (floored at zero)\n    tau2 = max(0.0, (Q - (k - 1)) / (W_total - W2_total / W_total))\n\n    # Random-effects weights and pooled estimate\n    W_re = [1.0 / (1.0 / w + tau2) for w in W]\n    W_re_total = sum(W_re)\n    theta_re = sum(w * lh for w, lh in zip(W_re, log_hrs)) / W_re_total\n    se_re = math.sqrt(1.0 / W_re_total)\n    I2 = max(0.0, 100.0 * (Q - (k - 1)) / Q) if Q > 0 else 0.0\n    z = 1.96\n    return {\n        \"pooled_log_hr\": theta_re,\n        \"pooled_hr\": math.exp(theta_re),\n        \"pooled_se\": se_re,\n        \"ci_lo_hr\": math.exp(theta_re - z * se_re),\n        \"ci_hi_hr\": math.exp(theta_re + z * se_re),\n        \"tau2\": tau2,\n        \"I2_pct\": I2,\n        \"Q\": Q,\n    }\n\n\nif __name__ == \"__main__\":\n    # Worked example: 3-site network study (weights 4, 4, 2; log-HRs 0.2, 0.1, 0.0)\n    log_hrs = [0.2, 0.1, 0.0]\n    weights = [4, 4, 2]\n\n    fe = ivw_fixed(log_hrs, weights)\n    print(f\"Fixed-effect pooled log-HR : {fe['pooled_log_hr']:.4f}\")\n    print(f\"Fixed-effect pooled HR     : {fe['pooled_hr']:.4f}\")\n    print(f\"95% CI for HR              : ({fe['ci_lo_hr']:.4f}, {fe['ci_hi_hr']:.4f})\")\n\n    re = ivw_random_dl(log_hrs, weights)\n    print(f\"\\nRE (DL) pooled HR : {re['pooled_hr']:.4f}\")\n    print(f\"I-squared         : {re['I2_pct']:.1f}%\")\n    print(f\"tau-squared       : {re['tau2']:.4f}\")",
        "description": "Inverse-variance weighted fixed-effect and DerSimonian-Laird random-effects meta-analysis\nof site-level log hazard ratio estimates from a federated network study. Input: list of\nlog-HR values and corresponding inverse-variance weights (W_i = 1 / SE_i^2) returned by\neach data partner. No patient-level data is involved — only the small summary vectors that\ncrossed the firewall.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(metafor)\n\n# Site-level estimates returned by each data partner.\n# W_i = 1/SE_i^2  =>  SE_i = 1/sqrt(W_i)\nsites <- data.frame(\n  site   = c(\"Site A (commercial claims)\",\n             \"Site B (Medicare FFS)\",\n             \"Site C (integrated-delivery EHR)\"),\n  log_hr = c(0.2, 0.1, 0.0),\n  weight = c(4,   4,   2)\n)\nsites$se <- 1 / sqrt(sites$weight)   # recover SE from inverse-variance weight\n\n# ------ Fixed-effect meta-analysis ------\nm_fe <- rma(yi = log_hr, sei = se, data = sites, method = \"FE\")\ncat(sprintf(\n  \"Fixed-effect pooled HR: %.4f  (95%% CI %.4f-%.4f)\\n\",\n  exp(coef(m_fe)), exp(m_fe$ci.lb), exp(m_fe$ci.ub)\n))\n\n# ------ Random-effects (DerSimonian-Laird) ------\nm_re <- rma(yi = log_hr, sei = se, data = sites, method = \"DL\")\ncat(sprintf(\n  \"RE pooled HR: %.4f  I^2 = %.1f%%  tau^2 = %.4f  Q p-value = %.3f\\n\",\n  exp(coef(m_re)), m_re$I2, m_re$tau2, m_re$QEp\n))\n\n# ------ Forest plot (HR scale) ------\nforest(\n  m_re,\n  slab      = sites$site,\n  transf    = exp,\n  xlab      = \"Hazard Ratio (new drug vs standard)\",\n  header    = c(\"Data Partner\", \"HR [95% CI]\"),\n  refline   = 1,\n  main      = \"Federated network meta-analysis — bleeding risk\"\n)\n# Large I^2 (> 25%) would prompt investigation of measurement or ETL heterogeneity\n# before releasing the pooled estimate; small I^2 supports the fixed-effect model.",
        "description": "Fixed-effect and DerSimonian-Laird random-effects meta-analysis using the metafor package,\napplied to site-level log hazard ratios from a federated network study. Input data are the\nper-site log-HR and inverse-variance weight (W_i = 1/SE_i^2). A forest plot summarises\nsite and pooled estimates. No patient-level data are involved.",
        "dependencies": [
          "metafor"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Input: site-level log-HR and inverse-variance weight from each data partner */\ndata site_estimates;\n  input site $ 1-32 log_hr weight;\n  se = 1 / sqrt(weight);   /* recover SE from W_i = 1/SE_i^2 */\n  label log_hr=\"Log hazard ratio\" weight=\"Inverse-variance weight (W=1/SE^2)\";\n  datalines;\nSiteA_commercial_claims          0.2  4\nSiteB_Medicare_FFS               0.1  4\nSiteC_integrated_delivery_EHR    0.0  2\n;\nrun;\n\nproc iml;\n  use site_estimates;\n  read all var {log_hr weight se} into data[colname={\"log_hr\",\"weight\",\"se\"}];\n  log_hr = data[,1];\n  weight = data[,2];\n  se_v   = data[,3];\n  k      = nrow(log_hr);\n\n  /* --- Fixed-effect inverse-variance weighted pool --- */\n  W_total    = weight[+];               /* 4 + 4 + 2 = 10               */\n  pooled_lhr = (weight # log_hr)[+] / W_total;  /* 1.2 / 10 = 0.12    */\n  pooled_var = 1 / W_total;\n  pooled_se  = sqrt(pooled_var);\n  z95        = quantile(\"Normal\", 0.975);\n  ci_lo_lhr  = pooled_lhr - z95 * pooled_se;\n  ci_hi_lhr  = pooled_lhr + z95 * pooled_se;\n\n  /* --- Cochran Q and I-squared --- */\n  Q    = (weight # (log_hr - pooled_lhr)##2)[+];\n  df_Q = k - 1;\n  p_Q  = 1 - probchi(Q, df_Q);\n  I2   = max(0, 100 * (Q - df_Q) / Q);\n\n  /* --- DerSimonian-Laird tau-squared --- */\n  W2_sum = (weight##2)[+];\n  tau2   = max(0, (Q - df_Q) / (W_total - W2_sum / W_total));\n\n  /* --- Random-effects weights and pooled estimate --- */\n  W_re       = 1 / (1/weight + tau2);\n  W_re_total = W_re[+];\n  theta_re   = (W_re # log_hr)[+] / W_re_total;\n  se_re      = sqrt(1 / W_re_total);\n  ci_lo_re   = theta_re - z95 * se_re;\n  ci_hi_re   = theta_re + z95 * se_re;\n\n  /* --- Print results --- */\n  print \"=== Fixed-effect result ===\";\n  print pooled_lhr[label=\"Pooled log-HR (FE)\"];\n  print (exp(pooled_lhr))[label=\"Pooled HR (FE)\"];\n  print (exp(ci_lo_lhr) || exp(ci_hi_lhr))[label=\"95% CI for HR (FE)\"];\n\n  print \"=== Heterogeneity ===\";\n  print Q[label=\"Cochran Q\"];  print p_Q[label=\"p-value for Q\"];\n  print I2[label=\"I-squared (%)\"];  print tau2[label=\"tau-squared (DL)\"];\n\n  print \"=== Random-effects result (DL) ===\";\n  print theta_re[label=\"Pooled log-HR (RE)\"];\n  print (exp(theta_re))[label=\"Pooled HR (RE)\"];\n  print (exp(ci_lo_re) || exp(ci_hi_re))[label=\"95% CI for HR (RE)\"];\nquit;",
        "description": "SAS/IML implementation of inverse-variance weighted fixed-effect meta-analysis and\nDerSimonian-Laird random-effects heterogeneity statistics, applied to site-level log\nhazard ratios from a federated network study. A DATA step creates the input table; PROC IML\nperforms all matrix operations and prints the pooled HR with 95% CI and Cochran Q / I-squared.\nNo patient-level data are involved — only the small summary table that each site returned.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pkg[Coordinating center distributes<br/>versioned study package<br/>code + concept sets + parameters]\n  Pkg --> S1[Site A — commercial claims<br/>OMOP CDM or SCDM]\n  Pkg --> S2[Site B — Medicare FFS<br/>OMOP CDM or SCDM]\n  Pkg --> S3[Site C — integrated EHR<br/>OMOP CDM or SCDM]\n  S1 --> R1[Run package locally<br/>fit site Cox model<br/>return log-HR + SE]\n  S2 --> R2[Run package locally<br/>fit site Cox model<br/>return log-HR + SE]\n  S3 --> R3[Run package locally<br/>fit site Cox model<br/>return log-HR + SE]\n  R1 -->|aggregates only<br/>no PHI| CC[Coordinating center]\n  R2 -->|aggregates only<br/>no PHI| CC\n  R3 -->|aggregates only<br/>no PHI| CC\n  CC --> IVW[Inverse-variance weighted pool<br/>fixed-effect or random-effects DL]\n  IVW --> Het{I-squared large?}\n  Het -->|Yes| Inv[Investigate measurement<br/>ETL or population heterogeneity]\n  Het -->|No| Out[Forest plot + pooled HR + 95% CI]",
        "caption": "Federated network analysis workflow. A versioned study package travels to each data partner; patient records never leave the site. Only log-HR and SE aggregates cross the firewall. Between-site heterogeneity is investigated before any pooled estimate is communicated.",
        "alt_text": "Flowchart from coordinating center distributing a study package to three data partners, each running the package locally and returning only log-HR and SE, then the center performing inverse-variance weighted pooling with a heterogeneity check.",
        "source_type": "illustrative",
        "source_citations": [
          "10.1056/NEJMp1809643"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Spectrum[Federation spectrum — ordered by analytic power]\n    P1[\"(1) Common protocol<br/>site-run analyses<br/>share published estimates\"]\n    P2[\"(2) CDM + distributed code<br/>OHDSI / Sentinel model<br/>share aggregate tables\"]\n    P3[\"(3) Meta-analysis of<br/>site model estimates<br/>share log-HR + SE\"]\n    P4[\"(4) Gradient / score sharing<br/>federated estimation<br/>iterative rounds\"]\n    P1 --> P2 --> P3 --> P4\n  end\n  P1 -.->|lower infrastructure<br/>higher heterogeneity risk| P4\n  P4 -.->|higher infrastructure<br/>closest to pooled IPD| P1",
        "caption": "The federation spectrum from simple common-protocol studies to true federated estimation. Each tier trades implementation complexity for closer approximation of a pooled IPD analysis.",
        "alt_text": "Horizontal flowchart showing four tiers of federated analysis from common protocol to CDM-plus-distributed-code to site-model-estimate meta-analysis to gradient sharing, annotated with increasing analytic power and infrastructure requirements.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "multi-database",
        "notes": "Federated analysis is the architectural mechanism that operationalizes a multi-database study design when patient-level data cannot leave each site; the multi-database entry covers the study-design layer this entry implements."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "The OMOP CDM is the enabling substrate for the most powerful tier of federated analysis (CDM-plus-distributed-code execution); vocabulary standardization is what makes concept sets portable across sites."
      },
      {
        "relation_type": "used_with",
        "target_slug": "meta-analysis-obs",
        "notes": "Site-level effect estimates from a federated network study are combined using inverse-variance weighted fixed-effect or random-effects meta-analysis; between-site heterogeneity statistics (I-squared, tau-squared, forest plot) are a mandatory output."
      },
      {
        "relation_type": "see_also",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "LEGEND and OHDSI network studies use negative controls — drug-outcome pairs with no true causal effect — at each site to detect and correct for systematic bias in the federated estimates before meta-analytic pooling."
      },
      {
        "relation_type": "see_also",
        "target_slug": "tokenization-privacy-preserving-record-linkage-rwe",
        "notes": "Tokenization is the complementary privacy-preserving technology for the linkage side of the data governance problem; federated analysis solves the pooling constraint while tokenization enables within-site or cross-custodian record linkage without transmitting clear-text identifiers."
      }
    ],
    "aliases": [
      "federated analysis",
      "distributed network analysis",
      "distributed data network",
      "federated network study",
      "distributed research network analysis",
      "federated learning healthcare"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "fhir-interoperability-rwe",
    "name": "FHIR and Healthcare Interoperability for RWE",
    "short_definition": "HL7 FHIR (Fast Healthcare Interoperability Resources) is the REST/JSON standard for exchanging clinical data between EHRs, payers, and research systems — not an analysis format but an exchange protocol whose mandatory adoption under US federal interoperability rules (21st Century Cures Act, CMS Interoperability Rule) now makes it a practical on-ramp for prospective RWE collection, patient- mediated data access, payer claims feeds, and EHR-to-research pipelines via bulk FHIR $export.",
    "long_description": "**What FHIR is — and is not**\n\nHL7 FHIR (Fast Healthcare Interoperability Resources, pronounced \"fire\") is a standard published by\nHealth Level Seven International (HL7) that defines how clinical data are *exchanged* over modern web\nAPIs. FHIR structures clinical information as self-contained JSON or XML documents called **resources**\n— Patient, Condition, MedicationRequest, Observation, DiagnosticReport, Encounter, and roughly 150\nothers — each with a fixed schema and each retrievable via standard HTTP calls (GET /fhir/r4/Patient/P001).\nFHIR is *not* an analysis model: it makes no promises about completeness, it allows many fields to be\noptional, and the same clinical concept can appear in different resource types depending on how a site\nconfigured its system. Its job is exchange; what the analyst does with the data after receiving them is\na separate problem that requires an ETL step and an analytic data model.\n\n**Core FHIR resources for RWE**\n\nFour resource types cover most RWE-relevant clinical data:\n\n- **Patient** — demographics: birth date, sex, address (useful for area-level SES linkage), identifiers.\n- **Condition** — diagnoses coded in ICD-10-CM or SNOMED CT, with a `clinicalStatus`, a\n  `verificationStatus` (confirmed vs differential vs refuted), and an onset date. Not every recorded\n  condition is a confirmed clinical finding — an analyst must filter on status fields and validate\n  phenotype definitions against a reference standard before treating Condition resources as case-defining.\n- **MedicationRequest** — prescription orders including the RxNorm-coded drug concept, the `authoredOn`\n  date, and (in the `dispenseRequest` block) the intended supply duration. This is an order, not proof\n  of dispensing; confirming a fill still requires a MedicationDispense resource or a linked pharmacy\n  claim. Using `authoredOn` as a fill date is a common and often necessary approximation.\n- **Observation** — laboratory results (LOINC codes), vitals, smoking status, and social history. A\n  single patient's labs return as a bundle of Observation resources, one per LOINC code per encounter\n  date, each with a `valueQuantity` or `valueCodeableConcept` and a `status` field. Filter on\n  `status: final` or `amended` before trusting the value.\n\nA minimal MedicationRequest resource illustrating the key fields an exposure pipeline must parse:\n\n```json\n{\n  \"resourceType\": \"MedicationRequest\",\n  \"id\": \"med-001\",\n  \"status\": \"active\",\n  \"intent\": \"order\",\n  \"medicationCodeableConcept\": {\n    \"coding\": [{ \"system\": \"http://www.nlm.nih.gov/research/umls/rxnorm\",\n                 \"code\": \"860975\",\n                 \"display\": \"metformin 500 mg oral tablet\" }]\n  },\n  \"subject\": { \"reference\": \"Patient/P001\" },\n  \"authoredOn\": \"2023-01-05\",\n  \"dispenseRequest\": { \"expectedSupplyDuration\": { \"value\": 30, \"unit\": \"days\" } }\n}\n```\n\n**Regulatory tailwind: why FHIR is now mandatory in the US**\n\nTwo US regulations made FHIR APIs a legal requirement and created the infrastructure RWE can tap:\n\n1. *21st Century Cures Act (2016) and ONC Interoperability Rule (2020)*: EHR developers must expose\n   patient data via a standardized FHIR R4 API. The **US Core Implementation Guide** specifies which\n   USCDI (United States Core Data for Interoperability) data classes must be supported — medications,\n   conditions, allergies, lab results, vitals, demographics, and care-team members. As of 2023, all\n   ONC-certified EHRs must meet US Core v3.1.1 or later.\n2. *CMS Interoperability and Patient Access Rule (2020, effective 2021)*: Medicare Advantage, Medicaid,\n   CHIP, and most commercial payers must expose member claims data via a FHIR R4 API using the\n   **CARIN Blue Button 2.0** implementation guide. This is the first mandatory, standardized payer-side\n   claims API — giving patients and authorized apps machine-readable access to their own payer data\n   including drug fills, inpatient encounters, and outpatient services.\n\nTogether these rules mean that by 2024, nearly every US EHR and major payer exposes FHIR APIs,\ncreating a standardized infrastructure that did not exist before 2021. For RWE, this matters because\nthe data-access negotiation no longer has to start from scratch at each site — the API exists, the\nauthentication model (SMART on FHIR OAuth 2.0) is standardized, and the data structure is specified.\n\n**Why FHIR matters to an RWE analyst today**\n\nFHIR enters RWE practice through four channels:\n\n1. *Patient-mediated data access via SMART on FHIR*: Patients can authorize a research app (using the\n   SMART on FHIR OAuth 2.0 flow) to pull their own EHR or payer data and donate it to a study. This\n   enables hybrid prospective-retrospective designs — participants share their existing records at\n   enrollment without requiring a site-level data-use agreement for each health system the patient\n   has visited.\n2. *Payer claims via CARIN Blue Button*: A patient can authorize a research or care-management app to\n   retrieve their claims history directly from their insurer. The feed is structurally similar to a\n   traditional claims extract — encounters, drug fills, diagnoses as ExplanationOfBenefit resources —\n   but patient-authorized rather than requiring a payer contract.\n3. *EHR bulk data ($export)*: FHIR's asynchronous bulk-export operation returns population-level data\n   as flat **NDJSON** (newline-delimited JSON) files, one file per resource type, with one resource per\n   line. A researcher granted bulk authorization can download a cohort's MedicationRequest, Condition,\n   and Observation files and parse them directly into analytic tables — closer to a traditional database\n   extract than to patient-by-patient API polling.\n4. *Prospective and decentralized trials*: FHIR-native electronic data capture (EDC) platforms use\n   MedicationRequest and Observation resources to auto-populate case report forms from EHR data,\n   reducing transcription error and enabling near-real-time safety monitoring in decentralized or\n   virtual study designs.\n\n**FHIR vs OMOP — the honest comparison**\n\nFHIR and OMOP address different problems and are complementary, not competing. FHIR is an **exchange\nstandard**: it defines how data travel from system A to system B in a format each system already speaks.\nOMOP is an **analysis model**: it defines how data are reorganized into a person-centric, vocabulary-\nstandardized schema optimized for running the same SQL query across dozens of databases.\n\nThe typical research pipeline is linear:\n\n`Source EHR/Payer → FHIR API → ETL → OMOP CDM (or flat analytic tables) → Analysis`\n\nThis means most RWE analysts working on retrospective cohort studies never interact with FHIR directly.\nTheir data have already been ETL'd into OMOP or a similar analytic CDM by the time they write their\nfirst query. FHIR becomes analyst-visible when: (a) the team is building a prospective or hybrid\ncollection pipeline, (b) data are being ingested via patient-mediated donation, or (c) the study uses\ntokenized linkage where FHIR is the transport for de-identified record exchange.\n\nFHIR-to-OMOP mapping follows a domain-routing logic parallel to native OMOP ETL: MedicationRequest\nmaps to DRUG_EXPOSURE, Condition to CONDITION_OCCURRENCE, Observation to MEASUREMENT or OBSERVATION.\nCoding systems carried in FHIR (RxNorm for drugs, ICD-10-CM for conditions, LOINC for labs) align\ndirectly with OMOP standard vocabularies, so the vocabulary translation step is straightforward when\ncodes are present. The complication is that FHIR permits multiple coding systems in a single resource\nand the ETL must apply a priority ordering to select the standard concept when multiple codings compete.\n\n**The optionality problem: \"FHIR-compatible\" does not mean \"analyzable\"**\n\nUS Core profiles narrow FHIR's optionality but do not eliminate it. A MedicationRequest is valid\nFHIR R4 without a `days_supply` value. A Condition resource is valid without an onset date. Two APIs\ncan both be US Core compliant while returning very different field population rates for the same data\nelement. Receiving a FHIR feed does not guarantee analytic-quality data — the same data-completeness\nchallenges that plague claims and EHR persist inside FHIR, wrapped in JSON. The analysis validity of\na FHIR-fed pipeline depends on the field population rates of the specific source system and must be\nevaluated empirically (field-level missingness reports, value-set conformance checks) before any\nanalytic conclusions are drawn.\n\n**Pros, cons, and trade-offs**\n\n*Pros of FHIR for RWE*: Mandatory regulatory adoption creates a standardized, legally guaranteed\naccess pathway that did not exist before 2021 — especially for patient-mediated and payer-side data.\nREST/JSON is the lingua franca of modern software, lowering the technical barrier versus bespoke\ndatabase extracts or HL7 v2 feeds. SMART on FHIR OAuth enables patient consent and data donation at\nscale without a site-level DUA for every health system. Bulk $export brings population-level extraction\nfeasibility without per-patient API looping. FHIR coding (RxNorm, SNOMED, LOINC) maps cleanly into\nOMOP vocabularies, reducing the vocabulary harmonization burden in ETL.\n\n*Cons and limitations*: FHIR is a point-of-exchange format, not an analysis model — data arrive\nheterogeneous and require explicit ETL into an analytic structure. The optionality of FHIR profiles\nmeans \"FHIR-available\" is not \"analysis-ready.\" Bulk $export requires group-level authorization and a\ndata-use agreement, just like any secondary data source. Access to bulk operations is not universal —\nmany EHR FHIR implementations support only individual SMART Individual Launch queries, not group or\nbulk export. For the majority of retrospective RWE analysts, FHIR is entirely invisible plumbing.\n\n**When to use**\n\nUse FHIR-based data pipelines when: (a) building a prospective or hybrid prospective-retrospective\nstudy that will collect EHR or payer data at participant enrollment; (b) designing a patient-mediated\ndata-sharing protocol (SMART on FHIR, Apple Health Records donation); (c) consuming payer claims via\nCARIN Blue Button for a patient-level cohort study; (d) extracting a research cohort from an EHR\nsystem that exposes a bulk $export endpoint; (e) building a tokenized linkage pipeline that uses FHIR\nas the transport for de-identified feeds; (f) evaluating data feasibility from a source with a FHIR\nAPI before a full DUA is negotiated. For prospective registry-style collection, FHIR-native EDC\nreduces transcription burden and enables automated safety signal extraction.\n\n**When NOT to use — and common misconceptions**\n\n- *\"We're using FHIR so the data-quality problem is solved.\"* FHIR defines structure, not completeness.\n  A FHIR-compliant feed can have missing supply durations, absent onset dates, and inconsistent lab\n  units. Field-population checks are mandatory before analysis.\n- *\"I can run my OMOP analysis directly on FHIR resources.\"* FHIR and OMOP have different schemas,\n  coding conventions, and temporal structures. An explicit ETL is required; do not skip it.\n- *\"Patient-authorized FHIR data are complete for drug exposure.\"* A patient's payer FHIR feed covers\n  only that insurer's claims — out-of-pocket fills, OTC drugs, and care under another plan are\n  invisible. The completeness ceiling is the enrollment scope of the authorizing payer.\n- *\"All FHIR APIs support bulk $export.\"* Bulk export is a capability layered on FHIR, not guaranteed\n  by FHIR R4 conformance alone. Many EHR FHIR APIs support only SMART Individual Launch (one patient\n  at a time); bulk access requires separate negotiation.\n- *Using repeated individual patient-level API calls as a population extraction strategy*: polling\n  thousands of patient records one at a time creates rate-limit pressure, audit-log burdens, and\n  re-identification risk. Prefer bulk $export with appropriate authorization for population-scale work.\n\n**Interpreting the output**\n\nIn the worked example, parsing three FHIR MedicationRequest resources for patient P001 yields three\nrows in an analytic exposure table — one row per resource — each carrying the RxNorm code (860975),\nthe authoredOn prescription date used as fill_date, and the days_supply extracted from\n`dispenseRequest.expectedSupplyDuration.value`. Total days_supply across three rows = 90.\n\n*(1) Formal interpretation.* Each analytic row is a 1:1 extract of one FHIR MedicationRequest\nresource: no aggregation, no period-collapsing, and no assumption about whether the drug was actually\ndispensed or taken. The `authoredOn` field is the prescriber order date — the closest proxy to a\npharmacy fill date available in the MedicationRequest resource but not equivalent to a dispensing\ndate. The days_supply sum of 90 represents cumulative *intended* treatment duration across three\nsequential orders; gaps between fills (visible in the date sequence) are not yet evaluated. Adherence\ncalculation requires a downstream analytic step (PDC or MPR) that applies a treatment window and\ndenominator.\n\n*(2) Practical interpretation.* When a data engineer delivers three exposure rows with the same RxNorm\ncode on dates roughly 30 days apart, the analyst should verify before proceeding: (a) that `days_supply`\nwas actually populated in all rows — it is optional in FHIR and the pipeline must document what happens\nwhen it is null; (b) that `intent` equals \"order\" not \"proposal\" or \"plan\"; (c) whether a\ncorresponding MedicationDispense or pharmacy claim confirming the fill is available. These verifications\nare the FHIR-pipeline analog of source-data audit and should be specified in the statistical analysis\nplan before the pipeline is built, not discovered during analysis.",
    "primary_category": "Data_Standard",
    "tags": [
      "fhir",
      "interoperability",
      "data-standard",
      "ehr",
      "claims",
      "rest-api",
      "smart-on-fhir",
      "bulk-export",
      "ndjson",
      "omop",
      "rxnorm",
      "us-core",
      "cures-act",
      "decentralized-trials",
      "hl7"
    ],
    "applies_to_study_types": [
      "cohort_prospective",
      "cohort_retrospective",
      "ehr_study",
      "claims_analysis",
      "registry_study",
      "descriptive_analysis"
    ],
    "data_sources": [
      "ehr",
      "claims",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/jamia/ocv189",
        "url": "https://doi.org/10.1093/jamia/ocv189",
        "citation_text": "Mandel JC, Kreda DA, Mandl KD, Kohane IS, Ramoni RB. SMART on FHIR: a standards-based, interoperable apps platform for electronic health records. Journal of the American Medical Informatics Association. 2016;23(5):899-908.",
        "year": 2016,
        "authors_short": "Mandel et al.",
        "notes": "The foundational paper introducing SMART on FHIR, the open OAuth 2.0 authorization framework that allows third-party and research apps to connect to any compatible EHR and retrieve patient data with consent. Essential for understanding patient-mediated data access in prospective and hybrid RWE study designs."
      },
      {
        "role": "explain",
        "doi": "10.2196/35724",
        "url": "https://doi.org/10.2196/35724",
        "citation_text": "Vorisek CN, Lehne M, Klopfenstein SAI, et al. Fast Healthcare Interoperability Resources (FHIR) for interoperability in health research: systematic review. JMIR Medical Informatics. 2022;10(7):e35724.",
        "year": 2022,
        "authors_short": "Vorisek et al.",
        "notes": "Systematic review of FHIR use cases in health research, covering FHIR-based data extraction, clinical trial support, and integration with analytic environments. Documents the current state of FHIR adoption across research settings and provides an evidence base for when FHIR-based pipelines add value versus traditional database extract approaches."
      }
    ],
    "plain_language_summary": "FHIR (pronounced \"fire\") is the modern web standard that hospitals, insurers, and apps use to share patient data with each other — each piece of clinical information (a diagnosis, a prescription, a lab result) is packaged as a small JSON document called a \"resource\" and delivered over a standard internet connection. US federal rules passed in 2020 and 2021 now require hospitals and insurers to offer FHIR connections, which means researchers can reach prescription histories, lab results, and payer claims in a structured, machine-readable way without starting from scratch at every site. For an RWE analyst, FHIR is most relevant when designing prospective data-collection pipelines, patient-mediated data-sharing studies, or decentralized trials — for most retrospective studies it is background plumbing that the data-engineering team has already handled before the records reach your desk. One honest caveat: receiving data in FHIR format does not guarantee the fields you need are populated — completeness still varies by source and must always be checked.",
    "key_terms": [
      {
        "term": "FHIR resource",
        "definition": "A self-contained JSON document representing one clinical concept — one patient, one prescription, one lab result — that can be sent over the internet using standard web tools."
      },
      {
        "term": "REST API",
        "definition": "A web-based interface that lets software request data using standard internet calls, rather than requiring a direct database connection; FHIR uses REST APIs to deliver clinical data between systems."
      },
      {
        "term": "US Core profile",
        "definition": "A set of rules that narrows the broad FHIR standard to the specific data fields US payers and hospitals are legally required to provide under federal interoperability regulations."
      },
      {
        "term": "FHIR bulk export ($export)",
        "definition": "A FHIR operation that downloads data for an entire patient population at once as flat files with one record per line, rather than pulling records one patient at a time — the mode used for population-scale research cohort extraction."
      },
      {
        "term": "SMART on FHIR",
        "definition": "An open login standard that lets a patient-facing or research app connect to any compatible hospital or insurer system and retrieve data with the patient's permission, without needing a custom integration built separately for each site."
      },
      {
        "term": "NDJSON",
        "definition": "\"Newline-Delimited JSON\" — a file format where each line is one complete JSON record; FHIR bulk export delivers data this way so each line is one resource (one prescription, one diagnosis, etc.)."
      }
    ],
    "worked_example": {
      "scenario": "A data engineer is building an exposure-extraction pipeline for a new-user cohort study on metformin initiation. The study site exposes a FHIR R4 bulk $export endpoint. For patient P001, the engineer retrieves three MedicationRequest resources from the NDJSON export file, extracts four fields from each resource, and assembles a three-row analytic exposure table. The goal is to confirm that the 1-to-1 resource-to-row mapping is exact and to sum the intended days of supply across all three rows.",
      "dataset": {
        "caption": "Four fields extracted from three FHIR MedicationRequest resources for patient P001. The rxnorm_code comes from medicationCodeableConcept.coding where system equals the RxNorm URI; authoredOn is the prescription order date; days_supply comes from dispenseRequest.expectedSupplyDuration.value. RxNorm code 860975 = metformin 500 mg oral tablet (illustrative).",
        "columns": [
          "resource_id",
          "authoredOn",
          "rxnorm_code",
          "days_supply"
        ],
        "rows": [
          [
            "med-001",
            "2023-01-05",
            "860975",
            30
          ],
          [
            "med-002",
            "2023-02-06",
            "860975",
            30
          ],
          [
            "med-003",
            "2023-03-09",
            "860975",
            30
          ]
        ]
      },
      "steps": [
        "Parse each NDJSON line as a MedicationRequest JSON object; confirm resourceType equals \"MedicationRequest\" and status equals \"active\" before extracting any fields — resources with status \"cancelled\" or \"entered-in-error\" are excluded.",
        "Extract the RxNorm code from medicationCodeableConcept.coding by selecting the entry whose system equals \"http://www.nlm.nih.gov/research/umls/rxnorm\"; all three resources carry code 860975 (metformin 500 mg oral tablet).",
        "Extract the prescription date from the authoredOn field; the three dates are 2023-01-05, 2023-02-06, and 2023-03-09 — these are order dates, not confirmed dispensing dates.",
        "Extract the intended supply duration from dispenseRequest.expectedSupplyDuration.value; all three resources carry a value of 30 days. Note that this field is optional in FHIR — if absent, days_supply is null and must be imputed or the row flagged for review.",
        "Write one analytic exposure row per MedicationRequest resource; 3 resources produce 3 rows because the mapping is exactly 1-to-1 at extraction — no aggregation or deduplication is applied.",
        "Sum total intended days of exposure: 30 + 30 + 30 = 90 days. This is the raw maximum intended exposure across three sequential orders, not an adherence measure."
      ],
      "result": "3 MedicationRequest resources produce exactly 3 analytic exposure rows (1-to-1 mapping confirmed). RxNorm code 860975 appears in all three rows. Total days_supply = 30 + 30 + 30 = 90 days of intended exposure for patient P001. Gaps between order dates (Feb 4 to Feb 5; Mar 8) are visible in the date sequence but are not yet evaluated — adherence calculation is a downstream analytic step.",
      "timeline_spec": {
        "title": "Three FHIR MedicationRequest resources for patient P001 (metformin 500 mg, Q1 2023)",
        "window": {
          "start": "2023-01-01",
          "end": "2023-04-30",
          "label": "Observation window: Jan-Apr 2023"
        },
        "events": [
          {
            "label": "MedReq med-001",
            "start": "2023-01-05",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "MedReq med-002",
            "start": "2023-02-06",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "MedReq med-003",
            "start": "2023-03-09",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-01-05",
            "end": "2023-02-03",
            "label": "30-day supply (med-001)"
          },
          {
            "kind": "gap",
            "start": "2023-02-04",
            "end": "2023-02-05",
            "label": "2-day gap"
          },
          {
            "kind": "exposed",
            "start": "2023-02-06",
            "end": "2023-03-07",
            "label": "30-day supply (med-002)"
          },
          {
            "kind": "gap",
            "start": "2023-03-08",
            "end": "2023-03-08",
            "label": "1-day gap"
          },
          {
            "kind": "exposed",
            "start": "2023-03-09",
            "end": "2023-04-07",
            "label": "30-day supply (med-003)"
          }
        ],
        "result": {
          "label": "3 resources to 3 rows; total days_supply = 30 + 30 + 30 = 90",
          "value": 90
        }
      }
    },
    "prerequisites": [
      "ehr-study",
      "omop-standardized-vocabularies"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Patient-mediated SMART on FHIR (individual data pull)",
        "description": "A patient authorizes a research or care-management app via SMART on FHIR OAuth 2.0 to retrieve their individual records from one or more EHR or payer systems. The app issues FHIR REST calls (GET /MedicationRequest?patient=P001) and receives patient-specific resource bundles. Suitable for hybrid study designs where participants share their existing records at enrollment without requiring a site-level data-use agreement for every health system the patient has visited.",
        "edge_cases": [
          "Authorization covers only the systems the patient explicitly connects; care at other institutions or fills at out-of-network pharmacies are invisible — plan for informative missingness.",
          "FHIR patient-individual APIs have rate limits (typically 60-120 requests per minute); extracting all resource types for a large cohort via individual queries is slow. Prefer bulk $export for populations beyond a few hundred patients."
        ],
        "data_source_notes": "EHR and payer FHIR APIs via SMART Individual Launch. Returning resource types depend on the US Core profile implemented by the system; confirm which resource types are supported before designing the extraction schema and the analytic table structure."
      },
      {
        "name": "Bulk FHIR $export (population-level NDJSON)",
        "description": "A researcher with group-level FHIR authorization issues an asynchronous bulk export request (GET /fhir/$export?_type=MedicationRequest,Condition,Observation) and receives flat NDJSON files, one per resource type, each line containing one resource. This is the research-scale equivalent of a traditional database extract and the primary mode for retrospective cohort extraction from FHIR-enabled EHR systems.",
        "edge_cases": [
          "Bulk $export is not universally implemented; many EHR FHIR R4 APIs support only individual SMART Individual Launch queries, not group or population-level export. Verify capability before designing the pipeline architecture.",
          "NDJSON files can be very large for multi-year health system exports; use streaming parsers (Python ijson, R jsonlite stream_in) rather than loading the full file into memory.",
          "Export scope is limited to the authorized patient group; patients who received care elsewhere have incomplete records, as with any single-system EHR extract."
        ],
        "data_source_notes": "EHR FHIR bulk export requires group-level authorization and typically a data-use agreement with the health system's research office. Resulting NDJSON files map cleanly to relational tables (one row per line) after field extraction and null-handling."
      },
      {
        "name": "CARIN Blue Button 2.0 (payer claims via FHIR API)",
        "description": "Patients or authorized apps retrieve payer claims data via the CARIN Blue Button 2.0 FHIR API, required of most US payers under the CMS Interoperability Rule. The feed returns ExplanationOfBenefit resources covering inpatient, outpatient, and pharmacy claims. Drug claims appear in ExplanationOfBenefit.item entries coded in NDC or RxNorm; Coverage resources define the enrollment period.",
        "edge_cases": [
          "Coverage is payer-specific; a patient enrolled in multiple plans must authorize each payer separately to obtain a complete claim history across all payers.",
          "The CMS mandate applies to Medicare Advantage, Medicaid, and most commercial plans; self-insured employer plans are not uniformly covered. Confirm plan type before assuming API availability.",
          "ExplanationOfBenefit drug items may carry NDC codes rather than RxNorm; apply the FDA NDC-to-RxNorm crosswalk or OHDSI vocabulary before running drug-exposure analysis."
        ],
        "data_source_notes": "Payer FHIR APIs under CARIN Blue Button require patient authorization. The ExplanationOfBenefit resource is the functional equivalent of a claim line; extract service dates, drug codes, and quantity fields to reconstruct an analytic exposure table equivalent to a pharmacy claims extract."
      },
      {
        "name": "FHIR-to-OMOP ETL pipeline",
        "description": "Converting FHIR resources into OMOP CDM tables via an ETL process following the OHDSI FHIR-to-OMOP specification: MedicationRequest to DRUG_EXPOSURE, Condition to CONDITION_OCCURRENCE, Observation to MEASUREMENT or OBSERVATION. RxNorm, ICD-10-CM, and LOINC codes carried in FHIR resources resolve to OMOP standard concepts via the OHDSI vocabulary with minimal translation overhead.",
        "edge_cases": [
          "FHIR allows medicationReference (a pointer to a separate Medication resource) instead of medicationCodeableConcept (an inline code); the ETL must resolve references before coding lookup.",
          "Optional FHIR fields (days_supply, onset date, lab unit) may be null; the ETL must document null handling explicitly — do not silently drop rows or impute without a documented rule.",
          "OBSERVATION_PERIOD in OMOP must be derived from the FHIR data (span of Encounter dates or Coverage period); FHIR has no enrollment table equivalent and the derivation logic must be specified and validated before cohort construction."
        ],
        "data_source_notes": "Any FHIR-enabled source feeding an OMOP CDM instance. Tooling includes OHDSI OMOP on FHIR and site-specific ETL scripts. Validate with OHDSI Achilles or DataQualityDashboard after ETL to confirm mapping completeness before running analytic queries."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "omop-cdm-method-patterns-rwe",
        "pros_of_this": "FHIR APIs are available directly from source systems (EHRs, payers) without a separate ETL step, enabling real-time or near-real-time data access for prospective designs and patient-mediated data collection across sites that would otherwise require bespoke per-site database extracts.",
        "cons_of_this": "FHIR is an exchange format, not an analysis model; running RWE analyses directly on FHIR resources requires building the analytic translation layer that OMOP CDM already provides — standardized vocabularies, CONCEPT_ANCESTOR hierarchies, and OBSERVATION_PERIOD constructs are essential for retrospective cohort studies and are absent from raw FHIR.",
        "when_to_prefer": "Prefer FHIR-based pipelines for prospective, patient-mediated, or real-time data collection; prefer OMOP CDM for retrospective multi-database network studies where standardization and portability are paramount. The ideal pipeline combines both."
      },
      {
        "compared_to": "ehr-study",
        "pros_of_this": "FHIR APIs provide a standardized, federally mandated access pathway to EHR data that reduces reliance on bespoke per-site database extracts; SMART on FHIR enables patient-authorized access without requiring traditional data-use agreements for each health system.",
        "cons_of_this": "FHIR APIs expose only data the source system has structured and consented to share via the standard profile; free-text clinical notes, unstructured imaging reports, and granular visit-level detail beyond US Core-mandated fields may not be accessible via the FHIR API even if present in the underlying EHR database.",
        "when_to_prefer": "Prefer FHIR-based EHR extraction when standardized multi-site access is needed; prefer direct EHR database access (or NLP pipelines) when full clinical depth including unstructured data is required for outcome phenotyping or confounder capture."
      },
      {
        "compared_to": "tokenization-privacy-preserving-record-linkage-rwe",
        "pros_of_this": "FHIR provides a standardized transport for individual patient records; when combined with tokenization, FHIR feeds enable privacy-preserving record linkage across payers and EHRs without exposing direct identifiers — the transport standard and the privacy technique are complementary layers, not alternatives.",
        "cons_of_this": "FHIR alone does not provide linkage capability — it delivers individual records to authorized systems but matching patients across data sources requires a separate tokenization protocol. FHIR transport without a privacy-preserving linkage layer creates re-identification risk when combining records from multiple sources.",
        "when_to_prefer": "Use FHIR as the transport mechanism for tokenized record exchange and tokenization (PPRL) as the linking key. Neither replaces the other in a multi-source RWE pipeline."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Payer FHIR APIs (CARIN Blue Button 2.0) deliver claims as ExplanationOfBenefit resources. Drug claims appear in item.productOrService coded in NDC or RxNorm; map NDC to RxNorm using the FDA NDC-to-RxNorm crosswalk or OHDSI vocabulary before running drug-exposure analyses. Service dates in item.servicedDate correspond to claim line dates. Coverage resources define the enrollment period — use Coverage.period to define observation windows analogous to an enrollment denominator in traditional claims analysis.",
      "ehr": "EHR FHIR APIs (US Core R4) expose MedicationRequest (orders, not confirmed dispenses), Condition, Observation (labs, vitals), and Encounter resources. MedicationRequest.authoredOn is the order date, not the dispensing date; for confirmed fills look for a linked MedicationDispense resource (often absent in ambulatory settings). Filter Observation resources on status (final, amended) and LOINC code to isolate specific lab tests. Confirm US Core profile version supported by the site before finalizing extraction logic.",
      "registry": "Disease registries increasingly expose FHIR endpoints for prospective data capture and federated queries. FHIR-based EDC uses MedicationRequest and Observation resources to pre-populate case report forms from EHR data. Validate that the registry FHIR implementation follows an established profile (US Core, mCODE for oncology) before relying on field population rates assumed from the US Core specification.",
      "primary": "Decentralized trial platforms and research apps using SMART on FHIR for participant data donation follow the same REST pattern as EHR APIs. Participant data quality varies with the source EHR system's FHIR implementation maturity; run field-population checks on days_supply, onset dates, and lab units before including FHIR-sourced primary data in the analytic dataset.",
      "linked": "Tokenized linkage pipelines use FHIR as the transport for de-identified record bundles. A FHIR Patient resource carries the privacy-preserving tokenized identifier, and linked resources (claims encounters, EHR observations) follow in the same bundle. The FHIR transport is format-agnostic to the linkage algorithm; PPRL keys can be carried in Patient.identifier entries with a system URI that identifies the tokenization scheme used."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import json\nimport pandas as pd\nfrom io import StringIO\n\n# ── Simulated NDJSON bulk export: 3 MedicationRequest resources (one per line) ──\nndjson_data = (\n    '{\"resourceType\":\"MedicationRequest\",\"id\":\"med-001\",\"status\":\"active\",\"intent\":\"order\",'\n    '\"medicationCodeableConcept\":{\"coding\":[{\"system\":\"http://www.nlm.nih.gov/research/umls/rxnorm\",'\n    '\"code\":\"860975\",\"display\":\"metformin 500 mg oral tablet\"}]},'\n    '\"subject\":{\"reference\":\"Patient/P001\"},\"authoredOn\":\"2023-01-05\",'\n    '\"dispenseRequest\":{\"expectedSupplyDuration\":{\"value\":30,\"unit\":\"days\"}}}\\n'\n    '{\"resourceType\":\"MedicationRequest\",\"id\":\"med-002\",\"status\":\"active\",\"intent\":\"order\",'\n    '\"medicationCodeableConcept\":{\"coding\":[{\"system\":\"http://www.nlm.nih.gov/research/umls/rxnorm\",'\n    '\"code\":\"860975\",\"display\":\"metformin 500 mg oral tablet\"}]},'\n    '\"subject\":{\"reference\":\"Patient/P001\"},\"authoredOn\":\"2023-02-06\",'\n    '\"dispenseRequest\":{\"expectedSupplyDuration\":{\"value\":30,\"unit\":\"days\"}}}\\n'\n    '{\"resourceType\":\"MedicationRequest\",\"id\":\"med-003\",\"status\":\"active\",\"intent\":\"order\",'\n    '\"medicationCodeableConcept\":{\"coding\":[{\"system\":\"http://www.nlm.nih.gov/research/umls/rxnorm\",'\n    '\"code\":\"860975\",\"display\":\"metformin 500 mg oral tablet\"}]},'\n    '\"subject\":{\"reference\":\"Patient/P001\"},\"authoredOn\":\"2023-03-09\",'\n    '\"dispenseRequest\":{\"expectedSupplyDuration\":{\"value\":30,\"unit\":\"days\"}}}\\n'\n)\n\nRXNORM_SYSTEM = \"http://www.nlm.nih.gov/research/umls/rxnorm\"\n\ndef extract_rxnorm_code(resource):\n    codings = resource.get(\"medicationCodeableConcept\", {}).get(\"coding\", [])\n    for c in codings:\n        if c.get(\"system\") == RXNORM_SYSTEM:\n            return c.get(\"code\")\n    return None\n\ndef extract_days_supply(resource):\n    return (resource\n            .get(\"dispenseRequest\", {})\n            .get(\"expectedSupplyDuration\", {})\n            .get(\"value\"))\n\nrows = []\nfor line in StringIO(ndjson_data):\n    line = line.strip()\n    if not line:\n        continue\n    resource = json.loads(line)\n    if resource.get(\"resourceType\") != \"MedicationRequest\":\n        continue\n    if resource.get(\"status\") not in (\"active\", \"completed\"):\n        continue\n    person_id = resource[\"subject\"][\"reference\"].split(\"/\")[-1]\n    rows.append({\n        \"person_id\":   person_id,\n        \"resource_id\": resource[\"id\"],\n        \"fill_date\":   resource.get(\"authoredOn\"),\n        \"rxnorm_code\": extract_rxnorm_code(resource),\n        \"days_supply\": extract_days_supply(resource),\n    })\n\nexposure_df = pd.DataFrame(rows)\nprint(exposure_df.to_string(index=False))\n# person_id resource_id  fill_date rxnorm_code  days_supply\n#      P001    med-001  2023-01-05      860975         30.0\n#      P001    med-002  2023-02-06      860975         30.0\n#      P001    med-003  2023-03-09      860975         30.0\n\ntotal_days = int(exposure_df[\"days_supply\"].sum())\nprint(f\"Resource count: {len(exposure_df)}\")   # 3\nprint(f\"Total days_supply: {total_days}\")       # 90\nassert len(exposure_df) == 3, \"Expected 3 rows from 3 MedicationRequest resources\"\nassert total_days == 90, \"Expected 30 + 30 + 30 = 90\"",
        "description": "Parse FHIR MedicationRequest resources from a simulated NDJSON bulk-export string into a\npandas DataFrame. Extracts RxNorm code, prescription date (authoredOn), and intended\ndays_supply from each resource. Handles the optional dispenseRequest block gracefully.\nDemonstrates the 3-resource to 3-row mapping from the worked example and asserts the\narithmetic. No FHIR-specific libraries required — standard json parsing is sufficient\nfor field extraction from well-structured resources.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(jsonlite)\n\n# ── Simulated NDJSON lines: one JSON object per element ──\nndjson_lines <- c(\n  paste0('{\"resourceType\":\"MedicationRequest\",\"id\":\"med-001\",\"status\":\"active\",',\n         '\"intent\":\"order\",\"medicationCodeableConcept\":{\"coding\":[{\"system\":',\n         '\"http://www.nlm.nih.gov/research/umls/rxnorm\",\"code\":\"860975\",',\n         '\"display\":\"metformin 500 mg oral tablet\"}]},\"subject\":{\"reference\":',\n         '\"Patient/P001\"},\"authoredOn\":\"2023-01-05\",\"dispenseRequest\":{',\n         '\"expectedSupplyDuration\":{\"value\":30,\"unit\":\"days\"}}}'),\n  paste0('{\"resourceType\":\"MedicationRequest\",\"id\":\"med-002\",\"status\":\"active\",',\n         '\"intent\":\"order\",\"medicationCodeableConcept\":{\"coding\":[{\"system\":',\n         '\"http://www.nlm.nih.gov/research/umls/rxnorm\",\"code\":\"860975\",',\n         '\"display\":\"metformin 500 mg oral tablet\"}]},\"subject\":{\"reference\":',\n         '\"Patient/P001\"},\"authoredOn\":\"2023-02-06\",\"dispenseRequest\":{',\n         '\"expectedSupplyDuration\":{\"value\":30,\"unit\":\"days\"}}}'),\n  paste0('{\"resourceType\":\"MedicationRequest\",\"id\":\"med-003\",\"status\":\"active\",',\n         '\"intent\":\"order\",\"medicationCodeableConcept\":{\"coding\":[{\"system\":',\n         '\"http://www.nlm.nih.gov/research/umls/rxnorm\",\"code\":\"860975\",',\n         '\"display\":\"metformin 500 mg oral tablet\"}]},\"subject\":{\"reference\":',\n         '\"Patient/P001\"},\"authoredOn\":\"2023-03-09\",\"dispenseRequest\":{',\n         '\"expectedSupplyDuration\":{\"value\":30,\"unit\":\"days\"}}}')\n)\n\nRXNORM_SYSTEM <- \"http://www.nlm.nih.gov/research/umls/rxnorm\"\n\nextract_rxnorm_code <- function(resource) {\n  codings <- resource[[\"medicationCodeableConcept\"]][[\"coding\"]]\n  if (is.null(codings)) return(NA_character_)\n  rxnorm_rows <- codings[codings$system == RXNORM_SYSTEM, ]\n  if (nrow(rxnorm_rows) == 0) return(NA_character_)\n  rxnorm_rows$code[1]\n}\n\nextract_days_supply <- function(resource) {\n  val <- resource[[\"dispenseRequest\"]][[\"expectedSupplyDuration\"]][[\"value\"]]\n  if (is.null(val)) return(NA_real_)\n  as.numeric(val)\n}\n\nrows <- lapply(ndjson_lines, function(line) {\n  resource <- fromJSON(line, simplifyVector = TRUE)\n  if (resource$resourceType != \"MedicationRequest\") return(NULL)\n  if (!resource$status %in% c(\"active\", \"completed\")) return(NULL)\n  person_id <- sub(\"Patient/\", \"\", resource$subject$reference)\n  data.frame(\n    person_id    = person_id,\n    resource_id  = resource$id,\n    fill_date    = resource$authoredOn,\n    rxnorm_code  = extract_rxnorm_code(resource),\n    days_supply  = extract_days_supply(resource),\n    stringsAsFactors = FALSE\n  )\n})\n\nexposure_df <- do.call(rbind, Filter(Negate(is.null), rows))\nprint(exposure_df)\n#   person_id resource_id  fill_date rxnorm_code days_supply\n# 1      P001    med-001 2023-01-05      860975          30\n# 2      P001    med-002 2023-02-06      860975          30\n# 3      P001    med-003 2023-03-09      860975          30\n\ntotal_days <- sum(exposure_df$days_supply, na.rm = TRUE)\ncat(sprintf(\"Resource count: %d\\n\", nrow(exposure_df)))  # 3\ncat(sprintf(\"Total days_supply: %d\\n\", total_days))      # 90\nstopifnot(nrow(exposure_df) == 3)\nstopifnot(total_days == 90)",
        "description": "Parse FHIR MedicationRequest resources from NDJSON using jsonlite. Extracts RxNorm code,\nauthoredOn date, and days_supply from each resource into a data frame. Mirrors the Python\nworked example — 3 resources to 3 rows. For production use with large NDJSON bulk-export\nfiles, replace lapply over in-memory strings with jsonlite::stream_in() pointing at the\nfile path to avoid loading multi-GB files into memory.",
        "dependencies": [
          "jsonlite"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph SRC[\"Source Systems (US FHIR Mandates)\"]\n    EHR[\"EHR\\n(ONC-certified FHIR R4)\"]\n    PAY[\"Payer\\n(CMS CARIN mandate)\"]\n  end\n  subgraph FHIR[\"FHIR Exchange APIs\"]\n    SMART[\"SMART on FHIR\\nindividual patient pull\\npatient-authorized\"]\n    BULK[\"Bulk dollar-export\\npopulation NDJSON\\nresearch authorization\"]\n    CARIN[\"CARIN Blue Button 2.0\\nclaims as\\nExplanationOfBenefit\"]\n  end\n  subgraph ETL[\"ETL / Transform\"]\n    PARSE[\"Parse resources\\nExtract RxNorm/ICD/LOINC\\nNull-handle optional fields\"]\n  end\n  subgraph ANA[\"Analytic Layer\"]\n    OMOP[\"OMOP CDM\\nmulti-site portable SQL\"]\n    FLAT[\"Flat analytic tables\\nsingle-site study\"]\n  end\n  EHR --> SMART\n  EHR --> BULK\n  PAY --> CARIN\n  SMART --> PARSE\n  BULK --> PARSE\n  CARIN --> PARSE\n  PARSE --> OMOP\n  PARSE --> FLAT",
        "caption": "FHIR as transport layer between source systems and the analytic destination. FHIR APIs (SMART on FHIR for individual access, bulk $export for populations, CARIN Blue Button for payer claims) feed an ETL that produces OMOP CDM or flat analytic tables. FHIR is the exchange format; OMOP is the analysis model.",
        "alt_text": "Flowchart showing EHR and payer systems connecting through SMART on FHIR, bulk export, and CARIN Blue Button APIs to an ETL layer that produces OMOP CDM or flat analytic tables.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "omop-standardized-vocabularies",
        "notes": "OMOP standardized vocabularies are the analysis-side coding standard that FHIR data must be mapped to before RWE analysis; RxNorm, SNOMED CT, and LOINC codes carried in FHIR resources resolve to OMOP standard concepts via the OHDSI vocabulary ETL step."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "OMOP CDM is the analysis destination for most FHIR-to-research pipelines; FHIR provides the exchange transport and OMOP provides the standardized analytic structure. The typical pipeline is FHIR source to ETL to OMOP CDM to analysis SQL."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-study",
        "notes": "EHR-based studies increasingly access source data through FHIR bulk $export for retrospective cohorts and SMART on FHIR for patient-mediated designs; FHIR is the modern transport layer for EHR data in research pipelines replacing bespoke database extracts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "tokenization-privacy-preserving-record-linkage-rwe",
        "notes": "FHIR is used as the transport format for tokenized record-linkage feeds; a FHIR Patient resource carries the privacy-preserving identifier enabling multi-source linkage without exposing direct identifiers. FHIR transport and PPRL are complementary layers."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cdisc-sdtm-adam-rwe",
        "notes": "CDISC SDTM and ADaM are the submission-side data standards for regulatory dossiers; FHIR is the exchange-side standard for EHR and payer data transfer. A study may ingest via FHIR, analyze in OMOP, and submit in SDTM — three distinct standards serving three distinct roles in the evidence-generation pipeline."
      },
      {
        "relation_type": "see_also",
        "target_slug": "rxnorm-drug-terminology",
        "notes": "RxNorm is the drug coding standard carried in FHIR MedicationRequest resources under the medicationCodeableConcept field; understanding RxNorm concept granularity and hierarchy is required to correctly extract and interpret medication data from FHIR-sourced pipelines."
      }
    ],
    "aliases": [
      "HL7 FHIR",
      "Fast Healthcare Interoperability Resources",
      "FHIR R4",
      "SMART on FHIR",
      "CARIN Blue Button",
      "FHIR bulk export",
      "FHIR interoperability"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "firth-penalized-regression-rwe",
    "name": "Firth Penalized Regression",
    "short_definition": "A penalized-likelihood method that adds the Jeffreys-prior penalty to the score equations to remove first-order (O(1/n)) bias in logistic and Cox models and to produce finite, well-defined estimates when the maximum likelihood estimate diverges under separation or sparse/rare events.",
    "long_description": "**Firth penalized regression** modifies the likelihood of a logistic or Cox model by adding a penalty term equal to one-half\nthe log-determinant of the Fisher information (the Jeffreys invariant prior). Maximizing this penalized log-likelihood shifts\nthe score equations by a term that exactly cancels the leading O(1/n) bias of the ordinary maximum likelihood estimate (MLE).\nThe practical consequence in real-world evidence (RWE) is twofold: (1) coefficients are bias-reduced in small or imbalanced\nsamples, and (2) — crucially for pharmacoepidemiology — estimates remain **finite and unique even under complete or quasi-complete\nseparation**, the situation where an exposure or covariate perfectly (or nearly perfectly) predicts the outcome and the ordinary\nMLE diverges to ±infinity with an infinite standard error. Firth was conceived as a bias-reduction device (Firth 1993); Heinze\nand Schemper (2002) recognized it as the principled fix for separation, which is the reason it appears constantly in rare-outcome\ncomparative-safety analyses.\n\n**Core conceptual distinction**. Firth is a *finite-sample / sparse-data* correction, not an alternative estimand. It targets the\nsame conditional log–odds ratio or log–hazard ratio as ordinary logistic/Cox regression; it does not change what is being estimated,\nonly how it is estimated when the data are too thin for the asymptotic MLE to behave. Distinguish it from three neighbours that are\noften conflated with it. (1) *Exact logistic regression* conditions on sufficient statistics and gives exact small-sample inference,\nbut is computationally intractable for the multi-covariate, high-dimensional confounder sets typical of claims studies — Firth is the\nscalable approximation that keeps continuous covariates and many terms. (2) *Ridge/LASSO* penalize the size of coefficients toward\nzero to manage variance/selection; Firth penalizes via the information matrix to remove *bias* and guarantee finiteness, and (with\nthe appropriate profile-penalized-likelihood interval) it is invariant to predictor scaling in a way ridge is not. (3) *Exact/Poisson\nor rare-event corrections* (e.g., King–Zeng) address a related rare-events problem but with a different correction; Firth is the more\ngeneral, model-internal solution. A subtle but important caveat: standard Firth biases the *intercept* (and therefore predicted\nprobabilities) toward 1/2, so for absolute-risk prediction in rare outcomes use **FLIC** (Firth logistic regression with intercept\ncorrection) or **FLAC**; for the comparative log–odds ratio that drives most RWE, plain Firth is appropriate.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs ordinary (unpenalized) logistic/Cox MLE:** Firth always returns a finite estimate and a usable confidence interval under\n  separation, where the MLE returns ±infinity, an absurd odds ratio (e.g., OR = 4.7e8), or a model that silently fails to converge.\n  Even without separation it has smaller mean-squared error at low events-per-variable (EPV). Cost: it is a *shrinkage* estimator, so\n  in large, well-separated-from-the-boundary samples it adds a touch of bias toward the null relative to the (then-unbiased) MLE, and\n  Wald intervals are unreliable — you must use **penalized profile-likelihood** intervals and penalized likelihood-ratio tests, not\n  Wald z/SE.\n- **vs exact logistic regression:** Firth scales to realistic claims confounder sets and continuous covariates and is far faster;\n  exact methods give guaranteed small-sample coverage but become infeasible beyond a handful of categorical predictors.\n- **vs dropping/collapsing the offending variable or using a non-informative prior in a Bayesian model:** Firth keeps the variable in\n  the model and is reproducible and deterministic; the Bayesian data-augmentation prior of Greenland and Mansournia (a weakly\n  informative prior on the log-OR) is an excellent, often preferable alternative when you want a transparently chosen prior or\n  posterior interval, and is conceptually close to Firth.\n- **vs ridge/LASSO penalization:** Firth does not select variables and does not require cross-validating a tuning parameter; choose\n  ridge/LASSO when the goal is high-dimensional prediction or selection, Firth when the goal is a (near-)unbiased, finite effect\n  estimate under sparsity.\n\n**When to use**. Use Firth when the analytic model has **low EPV** (a common rule of thumb is EPV < 10, with risk continuing below\n~20), or when you observe **separation or a zero cell** in the exposure-by-outcome (or covariate-by-outcome) table — the canonical\nRWE trigger is a rare adverse event with a cell of zero events in one treatment arm or a fully-adjusted model that will not converge.\nPre-specify it in the SAP as the *primary* estimator for rare safety outcomes, or as the *fallback* estimator invoked by a written\nrule (\"if any exposure-stratum has <5 events or the model fails to converge, switch to Firth penalized likelihood with profile\nintervals\"). It applies to logistic models, conditional logistic models, and Cox proportional-hazards models (penalized partial\nlikelihood), so it covers both binary-outcome and time-to-event RWE.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Do not report Firth Wald confidence intervals or Wald p-values.** Under the very sparsity that motivates Firth, the Wald\n  SE-based interval is badly miscalibrated and can hide or exaggerate effects; always use the penalized profile-likelihood interval\n  and penalized LR test. Reporting a Firth point estimate with a default software Wald CI is the most common dangerous error.\n- **Do not use plain Firth when you need calibrated absolute risks / predicted probabilities** (e.g., a risk model or\n  standardized/marginal risk difference): the intercept is pulled toward 0.5, inflating predicted probabilities for rare outcomes.\n  Use FLIC/FLAC instead.\n- **It does not fix confounding, selection bias, immortal time, or misclassification.** A separation problem caused by a structurally\n  impossible combination (e.g., an outcome that *defines* the exposure) is a design/data error; Firth will dutifully estimate a\n  finite—but meaningless—coefficient. Diagnose *why* the cell is empty before penalizing it away.\n- **It is not a substitute for adequate person-time.** If the rare event is rare because follow-up is too short or the cohort too\n  small to answer the question, Firth makes the model run but does not manufacture information; the interval will (correctly) be wide.\n- **For very large samples far from separation,** there is no advantage and a small null-ward shrinkage cost; use ordinary MLE.\n\n**Data-source operational depth**.\n- **Claims (FFS vs Medicare Advantage):** Rare safety events with low counts per arm are the dominant Firth use case here. The danger\n  is *spurious* separation created by data structure rather than biology: if the cohort includes **Medicare Advantage person-time**,\n  encounter (medical) claims are incomplete because MA plans are capitated, so an outcome that depends on FFS diagnosis/procedure\n  claims can show **zero events in a subgroup simply because the claims were never adjudicated to the database** — Firth will then\n  \"stabilize\" a separation that is pure missingness. Restrict to Parts A/B/D FFS enrollees (or the commercial pharmacy+medical\n  benefit) before concluding a zero cell is real. Watch **differential competing risks**: in elderly claims cohorts, death competes\n  with the event of interest and can differ by exposure, so a near-empty cell in a Cox model may reflect competing mortality, not a\n  protective effect — model the competing risk explicitly and apply Firth to the cause-specific or Fine–Gray model rather than reading\n  the penalized HR as a risk difference. Confirm the rare outcome with continuous enrollment and a clean washout so \"no event\" is\n  truly observed absence, not unobserved follow-up.\n- **EHR:** Encounter-driven capture means a \"zero events\" subgroup can be patients who left the health system, not patients who did\n  not have the event; loss to follow-up is informative and can manufacture quasi-separation. Use linked claims or a death index to\n  confirm absence of the outcome before penalizing. Site/center effects with few events per site frequently trigger separation in\n  conditional/stratified models — Firth is well suited there, but report the profile interval.\n- **Registry:** Adjudicated outcomes are higher quality, but rare-disease registries are *small by construction*, so low EPV and\n  separation are routine; Firth is often the appropriate primary estimator. Beware registries with eligibility tied to the outcome\n  (e.g., a treatment registry that enrolls at the time of the event), which can create structural separation.\n- **Linked claims–EHR–registry:** The richest substrate, but linkage selection and date-reconciliation (order vs fill vs service\n  date) can drop events differentially and produce artificial zero cells; reconcile dates and confirm the linkable subset is not\n  differentially missing the outcome before treating a sparse cell as biological.\n\n**Worked claims example.** Question: 1-year risk of a rare hepatotoxicity hospitalization with new-use of **drug A vs active comparator\ndrug B** for the same indication, in a commercial + Medicare **FFS** database. Cohort: new users (no fill of A or B in a 365-day\ncontinuous-enrollment washout), index_date = first qualifying `fill_date` (NDC), arm assigned from that fill. Follow-up from index to\nthe first inpatient claim with the hepatotoxicity ICD-10 code in the primary position, censoring at disenrollment, death, end of data,\nor 365 days. Suppose the 2×2 is: drug A 6 events / 4,210 initiators; drug B **0 events** / 1,905 initiators. Ordinary logistic\nregression adjusting for age, sex, baseline liver disease (Charlson/Elixhauser proxies from the 365-day lookback), and a\nhigh-dimensional PS decile is **separated** — the comparator arm has zero events, so the MLE odds ratio diverges (software reports an\nOR in the millions with an infinite SE). Switch to the pre-specified Firth estimator: it returns a finite adjusted log-OR with a\n**penalized profile-likelihood 95% CI** (e.g., OR ≈ 5.9, 95% PL CI 0.7–260) and a penalized LR p-value, honestly reflecting that with\nzero comparator events the data are compatible with a wide range of effects. Before reporting, verify the zero cell is real: confirm\nno drug-B initiator with MA-only person-time was misclassified as event-free, check that competing death did not remove drug-B\npatients before they could have the event, and run a sensitivity analysis using the Greenland–Mansournia data-augmentation prior to\nshow the conclusion is not an artifact of the penalty choice.\n\n**Interpreting the output**\n\nConsider the hepatotoxicity signal above: drug A with 6 liver-injury events in 4,210 initiators versus drug B\nwith 0 events in 1,905 initiators. The Firth-adjusted model returns OR ≈ 5.9 with a 95% profile-likelihood CI\nof approximately 0.7–260 and a penalized LR p-value.\n\n*(1) Formal statistical interpretation.* The penalized OR of ≈ 5.9 is the Firth maximum penalized likelihood\nestimate; it is systematically smaller than the MLE would be if computable (the Jeffreys prior pulls extreme\nestimates toward one). The profile-likelihood CI — not the Wald interval — must be reported because Wald\nintervals rely on the normal approximation of the log-OR, which fails completely under separation. The CI\nof ≈ 0.7–260 is wide because the data contain essentially no comparator-arm information: values of the true\nOR anywhere in this range are compatible with the observed data.\n\n*(2) Practical interpretation for a decision-maker.* The Firth OR of ≈ 5.9 is a finite, honest signal that\ndrug A may carry elevated liver-injury risk relative to drug B, but the interval spanning 0.7 to 260 means\nthe data alone cannot rule out a null association or a very large one. This result justifies a safety\nfollow-up in a larger database — it does not support a label change or a formulary restriction on its own.\nDo not interpret the OR without its full CI; the point estimate is not stable enough to act on in isolation.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "inferential_statistics",
      "penalized-likelihood",
      "firth-penalized-regression",
      "separation",
      "rare-events",
      "sparse-data-bias",
      "small-sample",
      "logistic-regression",
      "cox-regression"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_safety"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/80.1.27",
        "url": "https://doi.org/10.1093/biomet/80.1.27",
        "citation_text": "Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80(1):27-38.",
        "year": 1993,
        "authors_short": "Firth",
        "notes": "Original derivation of the penalized score (Jeffreys-prior penalty) that removes the O(1/n) bias of the MLE; the theoretical foundation of the method."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.1047",
        "url": "https://doi.org/10.1002/sim.1047",
        "citation_text": "Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in Medicine. 2002;21(16):2409-2419.",
        "year": 2002,
        "authors_short": "Heinze & Schemper",
        "notes": "Reframes Firth's penalty as the principled fix for separation/sparse data in logistic regression and motivates penalized profile-likelihood inference; the basis of the logistf implementation."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.i1981",
        "url": "https://doi.org/10.1136/bmj.i1981",
        "citation_text": "Greenland S, Mansournia MA, Altman DG. Sparse data bias: a problem hiding in plain sight. BMJ. 2016;352:i1981.",
        "year": 2016,
        "authors_short": "Greenland et al.",
        "notes": "Shows how sparse-data bias inflates odds/rate ratios away from the null in routine epidemiologic analyses and presents penalization (Firth) and weakly-informative priors as the remedies; essential context for when and why to penalize."
      },
      {
        "role": "explain",
        "doi": "10.1016/0895-4356(95)00510-2",
        "url": "https://doi.org/10.1016/0895-4356(95)00510-2",
        "citation_text": "Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis I. Background, goals, and general strategy. Journal of Clinical Epidemiology. 1995;48(12):1495-1501.",
        "year": 1995,
        "authors_short": "Concato et al.",
        "notes": "Establishes the events-per-variable framework that quantifies when small-sample bias threatens proportional-hazards/logistic estimates and thus when a Firth correction is indicated."
      }
    ],
    "plain_language_summary": "Firth penalized regression is a statistical method that fixes a specific failure of ordinary logistic regression: when every single person with a rare exposure has the outcome (or none does), the standard model breaks down and spits out an impossibly huge effect estimate. Firth adds a small mathematical correction to the model that pulls the estimate back to a finite, usable number while keeping it as accurate as the thin data allow. The trade-off is honest: when the data are this sparse, the resulting confidence interval will still be very wide, because the data genuinely cannot pin down the true effect.",
    "key_terms": [
      {
        "term": "separation",
        "definition": "A data condition where a variable perfectly predicts the outcome in one direction, for example every single exposed patient has the event and no unexposed patient does, causing ordinary logistic regression to produce an infinite odds ratio."
      },
      {
        "term": "penalized likelihood",
        "definition": "A modified version of the standard model-fitting procedure that adds a small penalty term to prevent the estimates from drifting to impossible values when the data are very sparse."
      },
      {
        "term": "Firth",
        "definition": "A specific penalty, proposed by statistician David Firth in 1993, that uses information about how uncertain the data are (the Fisher information) to stabilize estimates under separation or very rare events."
      },
      {
        "term": "profile-likelihood confidence interval",
        "definition": "A type of confidence interval calculated by re-fitting the model many times to find where the evidence becomes too weak, rather than using a simple formula based on the standard error; this is the only valid CI to report after Firth estimation."
      },
      {
        "term": "events-per-variable (EPV)",
        "definition": "A rough check of whether a model has enough outcome events to reliably estimate each predictor; fewer than about 10 events per predictor is the warning zone where Firth correction is often needed."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is studying a rare liver injury (hepatotoxicity) in patients who start either Drug A or Drug B. They pull one year of follow-up from a claims database and build a simple 2x2 table to check whether ordinary logistic regression will work before running the full adjusted model. Drug A has 6 patients who experienced the liver injury out of 4,210 starters. Drug B has 0 patients who experienced the liver injury out of 1,905 starters. This is complete separation: the Drug B column has a zero, which means every case came from Drug A. Ordinary logistic regression has no finite answer for this table.",
      "dataset": {
        "caption": "2x2 event table from the raw claims cohort: rows are treatment arms, columns are outcome status.",
        "columns": [
          "arm",
          "had_liver_injury",
          "no_liver_injury",
          "total_starters"
        ],
        "rows": [
          [
            "Drug A",
            6,
            4204,
            4210
          ],
          [
            "Drug B",
            0,
            1905,
            1905
          ]
        ]
      },
      "steps": [
        "Step 1 — Check for separation: scan the table for any zero cell in the outcome columns. Drug B has 0 events. This is complete separation; the ordinary maximum-likelihood odds ratio is undefined (software will return OR in the millions or refuse to converge).",
        "Step 2 — Confirm the zero is real: verify Drug B starters were enrolled continuously, had a clean washout with no prior liver-injury codes, and that all person-time is from fee-for-service claims where events would be captured. If the zero is an artifact of missing data, fix the data first.",
        "Step 3 — Apply Firth penalization: instead of maximizing the ordinary log-likelihood, the software maximizes a penalized version that adds a small Jeffreys-prior term. This shifts the score equation just enough to produce a finite estimate.",
        "Step 4 — Read the result: Firth returns an adjusted OR of approximately 5.9. This means Drug A starters had roughly 6 times the odds of liver injury compared to Drug B starters, after adjustment.",
        "Step 5 — Report only the profile-likelihood confidence interval (e.g., 95% PL CI: 0.7 to 260), not a Wald interval. The enormous width honestly reflects that with zero Drug B events the data cannot rule out a small or a very large true effect."
      ],
      "result": "Firth-adjusted OR = 5.9 (95% profile-likelihood CI 0.7 to 260). The ordinary maximum-likelihood OR is undefined (infinite) because Drug B has zero events. Firth stabilizes the estimate to a finite value; the wide confidence interval correctly reflects the limits of 6 total events across 6,115 patients."
    },
    "prerequisites": [
      "logistic-regression-for-binary-outcomes",
      "cumulative-incidence-risk-rwe",
      "exact-methods-sparse-data-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Firth logistic regression (binary outcome)",
        "description": "Penalized-likelihood logistic regression for a binary RWE outcome (e.g., 1-year occurrence of a rare adverse event). Returns a finite adjusted odds ratio under separation; inference via penalized profile-likelihood intervals and penalized LR tests, not Wald.",
        "edge_cases": [
          "Zero events in one exposure arm (complete separation) or a single covariate level perfectly predicting the outcome (quasi-separation).",
          "Default software reports a Wald CI that is invalid here; the profile-likelihood CI must be requested explicitly."
        ],
        "data_source_notes": "claims: confirm a zero cell is real (FFS-only person-time, true washout) before attributing it to biology rather than missingness."
      },
      {
        "name": "Firth Cox proportional-hazards (time-to-event)",
        "description": "Firth penalty applied to the partial likelihood for a rare time-to-event outcome with few events per covariate; returns a finite, bias-reduced log-hazard ratio when the unpenalized Cox model is monotone-likelihood / non-converging.",
        "edge_cases": [
          "Monotone likelihood (an exposure level with no events) makes the ordinary Cox HR diverge; Firth yields a finite penalized HR.",
          "In elderly claims cohorts, competing death can empty a cell — apply Firth to the cause-specific or Fine-Gray model, not a naive Cox model, to avoid misreading competing risk as protection."
        ],
        "data_source_notes": "registry/linked: small adjudicated cohorts routinely have monotone likelihood; report penalized profile-likelihood HR intervals."
      },
      {
        "name": "FLIC / FLAC (intercept- and added-covariate-corrected Firth)",
        "description": "Firth with intercept correction (FLIC) or Firth with added covariate (FLAC) to remove the bias Firth induces in predicted probabilities; use when the deliverable is calibrated absolute risk or a standardized/marginal risk difference rather than a conditional odds ratio.",
        "edge_cases": [
          "Plain Firth shrinks predicted probabilities toward 0.5, overstating absolute risk for rare outcomes; FLIC/FLAC restore calibration while keeping finiteness."
        ],
        "data_source_notes": "Use whenever the RWE estimand is an absolute or standardized risk, common in HTA/value dossiers."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ordinary (unpenalized) logistic / Cox maximum likelihood",
        "pros_of_this": "Returns finite, unique estimates under separation/monotone likelihood and has lower mean-squared error at low EPV.",
        "cons_of_this": "A shrinkage estimator with slight null-ward bias in large non-separated samples; Wald CIs are invalid, so penalized profile-likelihood intervals and LR tests are mandatory.",
        "when_to_prefer": "Low EPV (<10, caution <20), any zero cell or convergence failure, rare safety outcomes pre-specified or invoked by a written fallback rule."
      },
      {
        "compared_to": "Exact logistic / conditional exact methods",
        "pros_of_this": "Scales to many covariates and continuous predictors typical of claims confounder adjustment; far faster.",
        "cons_of_this": "Approximate rather than exact small-sample coverage.",
        "when_to_prefer": "Realistic high-dimensional confounder sets where exact methods are computationally infeasible."
      },
      {
        "compared_to": "Bayesian weakly-informative prior on the log-OR (Greenland-Mansournia data augmentation)",
        "pros_of_this": "Deterministic, no prior to elicit, available in standard frequentist software.",
        "cons_of_this": "The Jeffreys penalty is an implicit prior the analyst does not choose; a transparently specified weakly-informative prior may be preferable and gives posterior intervals.",
        "when_to_prefer": "When a fully reproducible frequentist estimate is required and an explicitly chosen prior is not needed; use the prior as a sensitivity analysis."
      },
      {
        "compared_to": "Ridge / LASSO penalization",
        "pros_of_this": "No tuning parameter to cross-validate; targets bias and finiteness rather than coefficient size or selection; effectively scale-invariant for inference.",
        "cons_of_this": "Does not perform variable selection or high-dimensional shrinkage.",
        "when_to_prefer": "Goal is a near-unbiased finite effect estimate under sparsity, not prediction or selection."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Rare-event safety analyses are the main use case; before treating a zero cell as real, restrict to FFS (Parts A/B/D) or commercial medical+pharmacy person-time so the empty cell is observed absence rather than MA-capitated missingness, and confirm continuous enrollment + washout. Always report penalized profile-likelihood intervals.",
      "ehr": "Encounter-driven capture can manufacture quasi-separation when patients leave the system; confirm outcome absence with linked claims or a death index. Site/center strata with few events suit conditional Firth models.",
      "registry": "Small adjudicated cohorts routinely have low EPV and monotone likelihood; Firth is often the primary estimator. Beware eligibility defined by the outcome, which creates structural separation.",
      "linked": "Linkage selection and order/fill/service-date reconciliation can drop events differentially and create artificial zero cells; reconcile dates and verify the linkable subset is not differentially missing the outcome before penalizing."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom firthlogist import FirthLogisticRegression\n\n# analytic : one row per cohort member with `arm`, `event`, and pre-index confounders (no toy data created here)\nconfounders = [\"arm\", \"age\", \"female\", \"baseline_liver\", \"ps_decile\"]\nX = analytic[confounders].to_numpy()\ny = analytic[\"event\"].to_numpy()\n\n# Detect the separation that makes ordinary logistic regression diverge: any exposure arm with zero events.\ncell_counts = analytic.groupby(\"arm\")[\"event\"].agg([\"sum\", \"size\"])\nprint(\"events / n by arm:\\n\", cell_counts)  # a 0 in the 'sum' column => complete separation\n\nfl = FirthLogisticRegression(\n    skip_lrt=False,        # request penalized likelihood-ratio tests, not Wald\n    test_vars=0,           # index of `arm` -> profile-likelihood inference for the exposure effect\n)\nfl.fit(X, y)\n\n# Penalized profile-likelihood CI for the adjusted log-odds-ratio of `arm` (the exposure of interest).\nimport numpy as np\nor_arm = np.exp(fl.coef_[0])\nci_low, ci_high = np.exp(fl.ci_[0])     # ci_ holds penalized profile-likelihood limits\nprint(f\"Adjusted OR (drug vs comparator) = {or_arm:.2f} \"\n      f\"(95% profile-likelihood CI {ci_low:.2f} to {ci_high:.2f}); \"\n      f\"penalized LR p = {fl.pvals_[0]:.4f}\")",
        "description": "Firth logistic regression for a rare binary RWE outcome. Required input: an analytic table with one row per\ncohort member produced by upstream cohort construction, containing:\n  person_id  : member id\n  arm        : 1 = study drug, 0 = active comparator (assigned from the index NDC)\n  event      : 1 if the rare outcome (e.g., hepatotoxicity hospitalization) occurred in follow-up, else 0\n  age, female, baseline_liver, ps_decile, ... : confounders measured only in the pre-index window\nUses the `firthlogist` package, which fits Firth's penalized likelihood and reports penalized\nprofile-likelihood confidence intervals and penalized LR p-values (NOT Wald). Wald CIs are invalid under\nseparation, which is the situation that motivates Firth in the first place.",
        "dependencies": [
          "firthlogist",
          "pandas"
        ],
        "source_citations": [
          "heinze-schemper-2002"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(logistf)\n\n## Diagnose separation: a zero in any exposure-by-outcome cell breaks ordinary glm().\nprint(table(analytic$arm, analytic$event))\n\n## Firth penalized logistic regression; profile-likelihood CIs + penalized LR tests are the default.\nfit <- logistf(event ~ arm + age + female + baseline_liver + factor(ps_decile),\n               data = analytic)\nsummary(fit)                       # coef, profile-likelihood 95% CI, penalized LR p-values\nexp(cbind(OR = coef(fit),          # adjusted odds ratios with penalized profile-likelihood limits\n          `2.5%`  = fit$ci.lower,\n          `97.5%` = fit$ci.upper))\n\n## Time-to-event variant: Firth penalty on the Cox partial likelihood (monotone-likelihood fix).\nlibrary(coxphf)\ncox_fit <- coxphf(Surv(fu_days, event) ~ arm + age + female + baseline_liver,\n                  data = analytic)\nsummary(cox_fit)                   # penalized HR with profile-likelihood CI for time-to-event outcomes",
        "description": "Firth logistic regression with the canonical `logistf` package (Heinze & Schemper). Required input:\na data.frame `analytic` with one row per cohort member containing `arm` (1/0), `event` (1/0), and\npre-index confounders. `logistf` returns penalized profile-likelihood confidence intervals and\npenalized LR tests by default -- the correct inference under separation. For time-to-event RWE,\n`coxphf` applies the same Firth penalty to the Cox partial likelihood (shown second).",
        "dependencies": [
          "logistf",
          "coxphf"
        ],
        "source_citations": [
          "heinze-schemper-2002"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Surface the separation: a zero events cell in one arm is the trigger for Firth. */\nproc freq data=work.analytic;\n  tables arm*event / norow nocol nopercent;\nrun;\n\n/* Firth penalized LOGISTIC regression for a rare binary outcome.                       */\n/* CLODDS=PL gives profile-likelihood odds-ratio CIs (NOT the invalid Wald CIs);        */\n/* CLPARM=PL gives profile-likelihood parameter CIs. Inference is penalized-likelihood. */\nproc logistic data=work.analytic;\n  class female(ref='0') ps_decile / param=ref;\n  model event(event='1') = arm age female baseline_liver ps_decile\n        / firth clodds=pl clparm=pl;\n  oddsratio arm;                  /* adjusted OR for drug vs comparator, profile-likelihood CI */\nrun;\n\n/* Firth penalized COX model for the time-to-event version (monotone-likelihood fix).   */\nproc phreg data=work.analytic;\n  class female(ref='0') / param=ref;\n  model fu_days*event(0) = arm age female baseline_liver / firth risklimits=pl;\n  hazardratio arm;                /* penalized HR with profile-likelihood CI */\nrun;",
        "description": "Firth penalized regression in SAS/STAT. Required input dataset work.analytic with one row per cohort\nmember: arm (1/0), event (1/0), fu_days (follow-up days), and pre-index confounders. The FIRTH option\nis built into PROC LOGISTIC (binary outcome) and PROC PHREG (time-to-event); CLODDS=PL / the PLCONV +\nprofile-likelihood machinery and the LRT-based tests give valid inference under separation. PROC FREQ\nfirst surfaces the empty cell that makes the ordinary MLE diverge.",
        "dependencies": [],
        "source_citations": [
          "heinze-schemper-2002"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Fit[Fit fully-adjusted logistic / Cox model on the RWE cohort] --> Q{Separation, zero cell,<br/>or EPV < ~10?}\n  Q -- No --> MLE[Use ordinary maximum likelihood<br/>report Wald or LR inference]\n  Q -- Yes --> Real{Is the empty cell real?<br/>FFS-only person-time, true washout,<br/>not competing death or MA missingness}\n  Real -- No --> Fix[Fix the data/design problem first<br/>restrict to FFS, model competing risk, reconcile linkage]\n  Real -- Yes --> Firth[Fit Firth penalized likelihood<br/>logistic / Cox / cause-specific]\n  Firth --> Inf[Report penalized PROFILE-LIKELIHOOD CIs<br/>and penalized LR tests -- never Wald]\n  Inf --> Est{Estimand = conditional OR/HR<br/>or absolute / standardized risk?}\n  Est -- OR / HR --> Plain[Plain Firth is appropriate]\n  Est -- Absolute risk --> FLIC[Use FLIC / FLAC<br/>intercept-corrected Firth for calibration]",
        "caption": "Decision logic for invoking Firth in an RWE analysis. Firth is triggered by separation or low events-per-variable, but only after confirming the sparse cell is genuine rather than an artifact of missing FFS claims, competing death, or linkage dropout; inference must be profile-likelihood, and the variant (plain Firth vs FLIC/FLAC) is chosen by the estimand.",
        "alt_text": "Flowchart starting from fitting an adjusted model, branching on whether separation or low EPV is present, then on whether the empty cell is real, into a Firth fit with profile-likelihood inference and a final choice between plain Firth and FLIC/FLAC by estimand.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  MLE[\"Ordinary MLE log-likelihood<br/>diverges under separation<br/>(estimate to +/- infinity)\"] -->|\"add Jeffreys penalty<br/>+ 0.5 * log det I(beta)\"| Pen[\"Penalized log-likelihood\"]\n  Pen --> Score[\"Modified score equation<br/>cancels O(1/n) bias\"]\n  Score --> Out[\"Finite, unique, bias-reduced<br/>OR / HR estimate\"]\n  Out --> CI[\"Inference by penalized<br/>profile likelihood (not Wald)\"]",
        "caption": "How the Firth penalty works. Adding one-half the log-determinant of the Fisher information (the Jeffreys prior) to the log-likelihood modifies the score equation so that the leading O(1/n) bias cancels and the estimate stays finite even when the ordinary MLE diverges; valid inference comes from the penalized profile likelihood.",
        "alt_text": "Left-to-right diagram showing the diverging MLE log-likelihood plus the Jeffreys penalty yielding a penalized log-likelihood, a bias-cancelling modified score equation, a finite bias-reduced estimate, and profile-likelihood inference.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "Firth is the standard fallback estimator for rare-outcome safety contrasts in an ACNU cohort when a treatment arm has zero or very few events and the adjusted logistic/Cox model would otherwise not converge."
      },
      {
        "relation_type": "see_also",
        "target_slug": "longitudinal-outcomes-modeling-rwe",
        "notes": "Firth applies the same penalized-likelihood idea to the Cox partial likelihood used in longitudinal time-to-event RWE models when events are sparse."
      }
    ],
    "aliases": [
      "Firth penalized likelihood",
      "Firth's bias reduction",
      "penalized maximum likelihood (Firth)",
      "Firth logistic regression",
      "Firth-corrected Cox regression"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "fisher-exact-test",
    "name": "Fisher's Exact Test",
    "short_definition": "A significance test for 2×2 (and larger) contingency tables that computes the p-value by summing hypergeometric probabilities over all table arrangements at least as extreme as the observed one — conditioning on the row and column totals — making it exact and valid even when expected cell counts are very small or zero; in RWE its principal domain is the sparse safety table (rare adverse events, small exposed cohorts) where the large-sample chi-square approximation breaks down, though the test's conservatism under full conditioning means unconditional exact tests or the mid-p correction are often the superior choice.",
    "long_description": "**What Fisher's Exact Test does and why \"exact\" matters**\n\nFisher's Exact Test answers one question: given that we observe a 2×2 table with fixed\nmarginal totals, how often would we see this arrangement — or one even more lopsided —\nif there were truly no association between the row and column variables?\n\nThe test works by conditioning on the marginal totals (both row totals and both column\ntotals are treated as fixed), which reduces the problem to a single hypergeometric\ndistribution. Under the null hypothesis of no association, the upper-left cell count\nfollows a hypergeometric distribution with parameters determined by the four marginals.\nThe exact p-value is the sum of hypergeometric probabilities for all tables with those\nsame marginals that are at least as extreme (have upper-left cell count as small or\nsmaller) as the observed table.\n\nBecause the p-value is computed directly from the hypergeometric PMF rather than from a\nchi-square approximation, it is \"exact\" in the sense that it does not depend on an\nasymptotic argument. The type-I error rate is guaranteed not to exceed the nominal alpha\nfor any table size — including tables with n = 10 patients or with one or more cells\ncontaining zeros. This guarantee is precisely what Fisher's test offers that the Pearson\nchi-square test cannot.\n\n**The hypergeometric probability in detail**\n\nA 2×2 table has four cells denoted as follows: a (row 1 / col 1), b (row 1 / col 2),\nc (row 2 / col 1), d (row 2 / col 2). The row totals are R1 = a + b and R2 = c + d;\nthe column totals are C1 = a + c and C2 = b + d; the grand total is N = R1 + R2.\n\nUnder the null hypothesis of no association, with all margins fixed, the probability of\nobserving exactly a events in the upper-left cell is the hypergeometric PMF:\n\n  P(X = a) = C(R1, a) × C(R2, C1 - a) / C(N, C1)\n\nwhere C(n, k) = n! / [k! × (n-k)!] is the binomial coefficient. The p-value for a\none-sided test (testing whether the upper-left cell count is smaller than expected) is\nthe sum of this probability over all values of a from 0 up to the observed value; the\nstandard two-sided p-value doubles this sum (or, more precisely, sums over all tables\nwhose hypergeometric probability does not exceed that of the observed table).\n\n**Why conditioning on margins is controversial**\n\nRonald Fisher's original justification for conditioning on both margins rested on the\n\"Lady Tasting Tea\" experiment, where both the row and column totals are genuinely fixed\nby design. In most medical and epidemiological studies this is not the case: only the\nrow totals (sample sizes per group) are fixed, not the column totals (event counts). When\nonly one margin is fixed, conditioning on the other margin discards information about the\npopulation proportion of events — it treats a quantity that could distinguish the null\nfrom the alternative as nuisance.\n\nThis has a practical consequence: Fisher's test is conservative — its actual type-I error\nrate is strictly below the nominal alpha for most discrete tables, because the\nhypergeometric distribution is discrete and the exact tail probability rarely hits alpha\nexactly. Conservatism means lower power than you could have if you used an unconditional\nexact test (Barnard's test) or the mid-p correction.\n\n**The conservatism critique and the mid-p alternative**\n\nThe conservatism of Fisher's test is well-established in the statistical literature.\nLydersen, Fagerland, and Laake (2009) conducted a comprehensive simulation comparing\neighteen tests for 2×2 tables and reached the conclusion that Fisher's test is often\nnot the best choice: its actual size is far below the nominal level in many scenarios\n(sometimes less than half of alpha), leading to a systematic loss of power that is\nparticularly damaging in small studies where power is already scarce.\n\nThe mid-p correction addresses this by computing:\n\n  mid-p = (1/2) × P(X = a_observed) + P(X < a_observed)\n\nThis is not a valid frequentist p-value in the strict sense (it does not guarantee\nconservative type-I error control), but it brings the actual size much closer to the\nnominal level while remaining more conservative than the chi-square approximation.\nLydersen et al. (2009) recommend the mid-p correction as a practical improvement over\nthe standard Fisher p-value for most small-sample 2×2 tables. Unconditional exact tests\n(Barnard's test, Boschloo's test) are theoretically superior in terms of power, but\ncomputationally heavier and not universally implemented.\n\n**The zero-cell problem and its consequences for odds ratios**\n\nIn sparse safety tables — a rare adverse event occurring in 0 of 10 exposed patients\nversus 0 of 50 controls, or 1 of 100 versus 0 of 200 — the conditional odds ratio (OR)\nis undefined: zero cells make the cross-product ratio 0/0 or a/0. Fisher's test can\nstill produce a valid p-value in these situations because the hypergeometric calculation\ndoes not require the OR to be defined, but it cannot produce a finite OR or OR-based\nconfidence interval.\n\nThree common remedies for the zero-cell OR problem:\n\n1. Continuity correction (add 0.5 to each cell): produces a finite OR but the choice\n   of 0.5 is arbitrary and the resulting CI is not exact.\n2. Exact conditional CI (computed from the hypergeometric distribution): available from\n   fisher.test in R and PROC FREQ EXACT FISHER in SAS; this CI is exact in the\n   conditional sense but inherits Fisher's conservatism.\n3. Firth penalized logistic regression: the modern recommended approach for sparse or\n   zero-cell tables when an adjusted OR is needed. Firth's method adds a Jeffreys-prior\n   penalty to the log-likelihood, which removes the separation problem and produces\n   finite point estimates and profile likelihood CIs without any ad-hoc correction.\n   When the analysis must adjust for covariates (as is nearly always the case in\n   observational RWE), exact logistic regression or Firth logistic regression is the\n   appropriate next step.\n\n**RWE and safety surveillance applications**\n\nFisher's Exact Test appears in RWE most often in three contexts:\n\n*Spontaneous pharmacovigilance and signal tables*: Disproportionality tables in\nregulatory safety databases often involve rare suspected adverse drug reactions; the\nMedDRA preferred term × drug pairing table may have single-digit cell counts even in\nlarge spontaneous reporting databases. Fisher's test is one of several signal-detection\nmetrics, alongside the reporting odds ratio and proportional reporting ratio.\n\n*Small clinical cohorts and registry subgroups*: A pragmatic trial or registry study\nin a rare disease may enroll 20 to 50 patients per arm. Serious adverse events in these\ncohorts produce 2×2 tables with expected counts below 5, violating the chi-square\napproximation. Fisher's test or the mid-p correction is appropriate here, with the\ncaveat that power is limited and the absence of a statistically significant finding\ncannot be interpreted as proof of safety.\n\n*Post-market pharmacoepidemiology safety tables*: Early post-approval surveillance of\na newly approved drug may cover only a few thousand exposed patients. If the event of\ninterest is rare (background rate < 1 per 1,000), observed counts may be 0, 1, or 2\neven in large studies. Fisher's test handles these counts; TreeScan and maxSPRT add\nsequential analysis for repeated looks.\n\n**Scale point: when Fisher's test is computationally unnecessary**\n\nIn large administrative claims databases — 100,000 to 10,000,000 person-years of\nfollow-up — expected cell counts for even moderately rare events (incidence rate 0.1 per\n100) will be in the hundreds or thousands per cell. At that scale the chi-square\napproximation is excellent, Fisher's test offers no advantage, and the computational\ncost (combinatorial for large N) is unnecessary. Fisher's test is the right tool when\nexpected counts are small, not when the database is large. An analyst working in Medicare\nclaims with 500,000 exposed patients comparing a cardiovascular event rate of 5% does\nnot need Fisher's test.\n\n**Pros, cons, and trade-offs**\n\nPros:\n- Exact: the p-value is guaranteed to not exceed alpha under the null regardless of\n  table size; valid for any n, including n = 5 or n = 0 in a cell.\n- Universally implemented: scipy.stats.fisher_exact, R's fisher.test, SAS PROC FREQ\n  EXACT FISHER are available in all analytical pipelines without additional packages.\n- Handles zero cells: produces a valid p-value even when one cell contains zero events,\n  unlike chi-square which returns NaN or requires continuity correction.\n- Provides exact conditional CI for the OR: fisher.test in R returns the conditional\n  maximum-likelihood OR estimate and an exact CI alongside the p-value.\n\nCons:\n- Conservative: actual type-I error is below nominal alpha due to discreteness; power\n  is lower than unconditional tests or mid-p correction in the same scenarios.\n- Does not directly estimate an effect size: the test produces a p-value; the conditional\n  OR and CI are a separate output and are undefined when cells contain zeros.\n- Conditions on both margins: theoretically appropriate only when both margins are fixed\n  by design; for observational 2×2 tables this assumption is debatable and leads to the\n  conservatism described above.\n- Computationally expensive for large N: the exact calculation becomes slow for tables\n  with very large marginal totals, though this is rarely a practical problem because\n  Fisher's test is only needed when N is small.\n- Cannot adjust for confounders: like chi-square, Fisher's test produces an unadjusted\n  association measure; when confounders are present, route to exact logistic regression\n  or Firth logistic regression.\n\n**When to use**\n\nUse Fisher's Exact Test when:\n\n- Any expected cell count in a 2×2 (or larger) contingency table is below 5; the\n  standard rule of thumb is that all expected counts must be ≥ 5 for the chi-square\n  approximation to be reliable.\n- One or more observed cells contain zero events; Fisher's test computes a valid p-value\n  even here, unlike chi-square.\n- The study involves a rare adverse event in a small cohort (n < 50 per group); adverse\n  event tables in early-phase trials, rare disease registries, or post-market signals.\n- You need a conservative test where type-I error is guaranteed not to exceed alpha,\n  such as regulatory submission tables where false-positive signals carry high cost.\n- As the \"exact\" comparator alongside chi-square in a sensitivity analysis to confirm\n  that the chi-square approximation holds at your observed table size.\n- In descriptive pharmacovigilance tables where a p-value is requested by convention,\n  understanding that statistical significance in a safety signal table is only one input\n  to a multidimensional signal assessment.\n\nConsider the mid-p correction when you use Fisher's test in a setting where power\nmatters and you accept that the actual type-I error rate may approach (but not exceed)\nthe nominal alpha more closely than the conservative Fisher p-value allows.\n\n**When NOT to use**\n\nDo not use Fisher's Exact Test when:\n\n- All expected cell counts are ≥ 5 and n per group is adequate: chi-square is the\n  correct choice; Fisher's conservatism wastes power unnecessarily.\n- The design is paired: a 2×2 table comparing pre-post classifications, matched pairs,\n  or before-after assessments on the same patients requires McNemar's test (or its\n  exact variant), not Fisher's test. Applying Fisher's test to paired data ignores the\n  within-patient correlation and produces an anti-conservative estimate.\n- An adjusted odds ratio is needed: Fisher's test is an unadjusted marginal test; for\n  observational data with confounders, route to logistic regression, and when cell\n  counts are sparse or zero-cells are present, use exact logistic regression or Firth\n  penalized logistic regression.\n- The table is larger than 2×2 with ordered categories: ordinal × ordinal associations\n  are better assessed with the Mantel-Haenszel trend test or ordinal logistic regression;\n  Fisher's test for r×c tables loses information about the ordering.\n- The study is large (n >> 1,000 per group) with common events: at this scale chi-square\n  is fine; Fisher's test provides no accuracy benefit and misleads by association with\n  \"small-sample rigor\" in a context where it is not needed.\n- A relative risk or rate difference is the target estimand: Fisher's test is built\n  around the conditional OR; for cohort studies where RR or RD is more directly\n  interpretable, compute the RR/RD directly with a confidence interval.\n\n**Implementation note across languages**\n\nAll three implementations below show the same 2×2 safety table (3 adverse events in 10\nexposed vs 0 in 10 unexposed), compute the exact p-value, report the conditional OR and\nCI, and apply the mid-p correction by hand. This mirrors the worked example in the\nbeginner layer so the analyst can trace from manual arithmetic to production code.\n\n**Interpreting the output**\n\nIn the worked example, 3 of 10 patients in the drug arm experienced a rash and 0 of 10\nin the placebo arm did. The one-sided exact p-value — the hypergeometric probability of\nobserving 3 or more events in the exposed arm given fixed marginals of 10 exposed, 10\nunexposed, and 3 total events — is 2/19 (approximately 0.105). The two-sided p-value is\n4/19 (approximately 0.211). The mid-p one-sided value, which halves the probability of the\nobserved table before accumulating the tail, is approximately 0.053.\n\n*(1) Formal interpretation.* Under the null hypothesis of no association and conditioning on\nboth sets of marginal totals, the exact probability of an arrangement as extreme or more\nextreme than the one observed is approximately 0.11 (one-sided) or 0.21 (two-sided). Neither\nexceeds the conventional alpha = 0.05 threshold, and the test does not reject the null. The\nmid-p correction brings the one-sided value to approximately 0.053 — still marginally above\n0.05. Fisher's conservatism (which arises from conditioning on both margins) makes it the\nmost cautious of the available exact tests; the non-rejection does not rule out a real\nadverse-event signal — it reflects the severe limitation of the sample size.\n\n*(2) Practical interpretation.* Three rash events in 10 exposed patients versus zero in 10\nunexposed patients is a directionally concerning imbalance that does not achieve conventional\nstatistical significance in a study of 20 total patients. Fisher's exact test is the\nappropriate choice here because expected cell counts are well below 5, making the chi-square\napproximation unreliable. A non-significant result in a study this small cannot be\ninterpreted as evidence of no rash risk — the study simply lacks the power to detect or rule\nout signals of this magnitude. Safety surveillance for rare adverse events requires much\nlarger exposed populations and sequential monitoring methods before a signal can be confirmed\nor excluded.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "categorical-data",
      "rare-events",
      "exact-test",
      "contingency-table",
      "safety-surveillance",
      "pharmacovigilance"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "pharmacovigilance",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.3531",
        "url": "https://doi.org/10.1002/sim.3531",
        "citation_text": "Lydersen S, Fagerland MW, Laake P. Recommended tests for association in 2x2 tables. Statistics in Medicine. 2009;28(7):1159-1175.",
        "year": 2009,
        "authors_short": "Lydersen et al.",
        "notes": "Comprehensive simulation comparing eighteen tests for 2x2 tables. Key finding: Fisher's test is often not the best choice due to conservatism (actual type-I error far below nominal alpha), and the mid-p correction is recommended as a practical improvement. This is the definitive reference for understanding when Fisher's test loses power relative to alternatives in RWE safety tables."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.2832",
        "url": "https://doi.org/10.1002/sim.2832",
        "citation_text": "Campbell I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine. 2007;26(19):3661-3675.",
        "year": 2007,
        "authors_short": "Campbell",
        "notes": "Derives the small-sample rule for when to prefer Fisher's exact test over Pearson chi-square, and when neither is ideal. Discusses the Fisher-Irwin (two-sided) variant and the Yates continuity correction in context. Directly addresses the threshold at which the chi-square approximation fails and Fisher becomes the practical default."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.1047",
        "url": "https://doi.org/10.1002/sim.1047",
        "citation_text": "Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in Medicine. 2002;21(16):2409-2419.",
        "year": 2002,
        "authors_short": "Heinze & Schemper",
        "notes": "Introduces Firth penalized logistic regression as the solution to complete and quasi-complete separation — the situation that arises when one or more cells of a sparse 2x2 table contain zero events and the standard logistic regression OR is undefined. Essential for RWE analysts who need an adjusted estimate when Fisher's unadjusted test is not sufficient."
      },
      {
        "role": "use",
        "doi": "10.1093/biomet/80.1.27",
        "url": "https://doi.org/10.1093/biomet/80.1.27",
        "citation_text": "Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80(1):27-38.",
        "year": 1993,
        "authors_short": "Firth",
        "notes": "Original derivation of the Jeffreys-prior penalized likelihood estimator (Firth's method). Cited here as the theoretical basis for the Firth logistic regression recommended as the adjusted replacement for Fisher's test in sparse observational tables where confounders must be controlled."
      }
    ],
    "plain_language_summary": "Fisher's Exact Test is a way to ask whether two groups differ in how often an event occurs, specifically designed for situations where the numbers are too small to trust the usual chi-square shortcut. It works by counting every possible way the data could have been arranged if there were truly no difference, and then figuring out how rare the actual result is by exact arithmetic on those counts — no approximations needed. In real-world evidence studies it most often appears in safety tables for rare adverse events, where one or more cells in the table may contain zero or single-digit counts. A key limitation: because it is deliberately cautious, it misses real effects more often than necessary, so researchers sometimes use a small correction called mid-p to get closer to the right answer without losing the exactness guarantee.",
    "key_terms": [
      {
        "term": "hypergeometric distribution",
        "definition": "The probability distribution that describes how many successes you get when drawing a fixed number of items without replacement from a population split into two types — exactly the situation Fisher's test uses to figure out how likely your 2×2 table is under the null hypothesis."
      },
      {
        "term": "conditioning on margins",
        "definition": "The assumption that the row totals and column totals in a 2×2 table are treated as fixed constants; Fisher's test builds its probability calculation on this assumption, which makes the test valid but also more conservative than tests that let the column totals vary."
      },
      {
        "term": "exact p-value",
        "definition": "A p-value computed by adding up the true probabilities of every table as extreme or more extreme than the one observed, rather than using a bell-curve approximation; \"exact\" means the stated alpha level is never exceeded, even in tiny samples."
      },
      {
        "term": "mid-p correction",
        "definition": "A small adjustment to Fisher's p-value that counts only half of the probability of the observed table itself, bringing the test closer to the right rejection rate without making it anti-conservative; recommended by many statisticians when power is a concern."
      },
      {
        "term": "conservatism",
        "definition": "The property of a statistical test whose actual chance of a false positive is below the stated threshold (e.g., truly 2% when you set alpha = 5%); conservative tests are safe but sacrifice power, meaning they miss real effects more often than they should."
      },
      {
        "term": "zero cell",
        "definition": "A cell in a 2×2 table that contains no events (count = 0); Fisher's test can still compute a p-value in this case, but the odds ratio becomes undefined and requires special methods like Firth logistic regression to estimate."
      }
    ],
    "worked_example": {
      "scenario": "A safety monitor is reviewing a phase II trial in a rare disease. Ten patients received the new drug and ten received placebo. Three patients in the drug arm experienced a serious rash; zero patients in the placebo arm did. The expected counts are too small for chi-square (expected events in placebo cell = 1.5), so the monitor computes Fisher's Exact Test by hand to decide whether the imbalance is unlikely under the null hypothesis of no drug effect on rash.",
      "dataset": {
        "caption": "2x2 adverse-event table. Rows = treatment arm; columns = rash (yes/no). Row totals and column totals (margins) are used as the fixed inputs to Fisher's calculation.",
        "columns": [
          "arm",
          "rash_yes",
          "rash_no",
          "row_total"
        ],
        "rows": [
          [
            "Drug",
            3,
            7,
            10
          ],
          [
            "Placebo",
            0,
            10,
            10
          ],
          [
            "col_total",
            3,
            17,
            20
          ]
        ]
      },
      "steps": [
        "Label the cells: a = 3 (drug, rash), b = 7 (drug, no rash), c = 0 (placebo, rash), d = 10 (placebo, no rash). Row totals: R1 = a+b = 3+7 = 10, R2 = c+d = 0+10 = 10. Column totals: C1 = a+c = 3+0 = 3, C2 = b+d = 7+10 = 17. Grand total N = 20.",
        "Under the null hypothesis, a follows a hypergeometric distribution with parameters N=20, C1=3, R1=10. The possible values of a (upper-left cell) given these margins are 0, 1, 2, and 3 (cannot exceed min(R1,C1) = min(10,3) = 3).",
        "Compute binomial coefficients needed. C(10,k) = 10!/(k! x (10-k)!) and C(10,3-k) for k = 0, 1, 2, 3. Also C(20,3) = 20!/(3! x 17!) = (20x19x18)/(3x2x1) = 6840/6 = 1140.",
        "For a=0: P(X=0) = C(10,0) x C(10,3) / C(20,3) = 1 x 120 / 1140 = 120/1140. C(10,0)=1, C(10,3)=10!/(3!x7!)=(10x9x8)/(3x2x1)=720/6=120. So P(X=0)=120/1140.",
        "For a=1: P(X=1) = C(10,1) x C(10,2) / C(20,3) = 10 x 45 / 1140 = 450/1140. C(10,1)=10, C(10,2)=10!/(2!x8!)=(10x9)/2=45. So P(X=1)=450/1140.",
        "For a=2: P(X=2) = C(10,2) x C(10,1) / C(20,3) = 45 x 10 / 1140 = 450/1140.",
        "For a=3: P(X=3) = C(10,3) x C(10,0) / C(20,3) = 120 x 1 / 1140 = 120/1140.",
        "Verify probabilities sum to 1: (120+450+450+120)/1140 = 1140/1140 = 1. Correct.",
        "The one-sided p-value (testing whether drug arm has MORE rash than placebo, i.e., whether a >= 3 is unusual) equals P(X=3) = 120/1140 = 2/19 which equals approximately 0.1053. This is the probability of observing a=3 or anything more extreme (there is nothing more extreme given max(a)=3).",
        "The two-sided p-value sums all tables with P <= P(observed). P(X=3)=120/1140. P(X=0)=120/1140. These are equal, so two-sided p = P(X=0)+P(X=3) = 120/1140+120/1140 = 240/1140 = 4/19 which equals approximately 0.2105.",
        "The mid-p correction for the one-sided test gives mid-p = (1/2) x P(X=3) + P(X>3). P(X>3) = 0 (no values above 3 are possible). Half of P(X=3) is (1/2) x (120/1140): numerator halves to 60, so mid-p = 60/1140 = 1/19 which equals approximately 0.0526. This is borderline significant at alpha=0.05, compared to the conservative Fisher p of 120/1140 = 2/19 which equals approximately 0.1053."
      ],
      "result": "C(20,3) = 1140. P(X=0) = 120/1140, P(X=1) = 450/1140, P(X=2) = 450/1140, P(X=3) = 120/1140. Sum = 1140/1140 = 1. One-sided Fisher p-value = P(X=3) = 120/1140 = 2/19 which equals approximately 0.1053. Two-sided p = 240/1140 = 4/19 which equals approximately 0.2105. Mid-p one-sided = 60/1140 = 1/19 which equals approximately 0.0526. With only 10 patients per arm the test has very low power; neither the exact p-value nor the mid-p reaches conventional significance after two-sided correction, illustrating why rare-event detection in small cohorts requires sequential designs or pooled post-market data rather than a single-study test."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "chi-square-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Two-sided Fisher exact (standard clinical use)",
        "description": "The default output of scipy.stats.fisher_exact (alternative=\"two-sided\"), R's fisher.test, and SAS PROC FREQ EXACT FISHER. Sums hypergeometric probabilities for all tables with P <= P(observed). This is the appropriate test when there is no a priori directional hypothesis — the standard for safety tables in regulatory submissions and clinical trial adverse event reporting.",
        "edge_cases": [
          "The two-sided p-value is not simply twice the one-sided p-value for discrete distributions; for asymmetric tables the two definitions can diverge. Always specify alternative='two-sided' explicitly to avoid confusion.",
          "When the observed table lies in the centre of the hypergeometric distribution, the two-sided p-value may appear to increase relative to one-sided; this is correct behaviour of the exact test, not a software error."
        ],
        "data_source_notes": "Safety tables in claims and EHR: compute observed and expected counts per arm; check all expected counts before deciding between chi-square and Fisher."
      },
      {
        "name": "One-sided Fisher exact (directional safety signal)",
        "description": "Appropriate when the direction of the association is specified in advance — for example, when testing whether the drug arm has a HIGHER rate of a specific adverse event than comparator. scipy.stats.fisher_exact(alternative=\"greater\"), R fisher.test(alternative=\"greater\"), SAS PROC FREQ with EXACT FISHER (computes both; choose the right tail). Used in some pharmacovigilance workflows and spontaneous reporting disproportionality tables.",
        "edge_cases": [
          "One-sided tests double the power compared with two-sided but require pre-specified direction; choosing the direction after seeing the data inflates type-I error.",
          "Regulatory guidance (FDA, EMA) typically requires two-sided tests for primary safety endpoints; one-sided may be acceptable for secondary signal detection."
        ],
        "data_source_notes": "Spontaneous reporting databases: one-sided Fisher is used alongside reporting odds ratio and proportional reporting ratio as a disproportionality signal; note that spontaneous databases are not population-based and confounding by indication is severe."
      },
      {
        "name": "Mid-p corrected Fisher test",
        "description": "The mid-p p-value = (1/2) x P(X = observed) + P(X more extreme than observed) under the hypergeometric null. Not available in base R or SAS as a single option but computable from the hypergeometric PMF (dhyper in R). Lydersen et al. (2009) recommend this as the default over the conservative standard Fisher p-value when power is a concern and the strict type-I guarantee is not required.",
        "edge_cases": [
          "Mid-p is not a valid frequentist p-value: actual type-I error can marginally exceed alpha in some discrete configurations. In regulatory submissions prefer the conservative Fisher p-value; use mid-p for exploratory analyses.",
          "The mid-p CI (Agresti-Min or similar) is also not guaranteed to maintain the nominal coverage level; treat it as approximate."
        ],
        "data_source_notes": "Research settings with small n and power constraints; academic publications where the reviewer community accepts the mid-p convention (common in infectious disease epidemiology and some environmental health journals)."
      },
      {
        "name": "Exact conditional odds ratio and confidence interval",
        "description": "R's fisher.test returns the conditional maximum-likelihood OR (the value of OR that maximises the noncentral hypergeometric likelihood given the observed margins) alongside the exact two-sided CI. This is distinct from the simple cross-product OR = (a x d) / (b x c) and is undefined when a cell is zero; in that case the function returns OR = 0 or Inf with a one-sided CI. SAS PROC FREQ with EXACT FISHER also outputs exact ORs. This exact CI inherits the conservatism of Fisher's p-value (actual coverage exceeds nominal), so it is wider than the Woolf log-OR CI used for large-sample tables.",
        "edge_cases": [
          "When a = 0 or d = 0, the exact conditional OR is 0 or Inf; report as a one-sided bound and note that Firth logistic regression provides a finite point estimate.",
          "Do not confuse the conditional MLE OR from fisher.test with the cross-product OR; they differ, especially at small n. Use the conditional MLE OR when reporting Fisher test results."
        ],
        "data_source_notes": "Small n clinical trials, rare disease registries: always report the exact CI alongside the p-value; a p-value alone does not communicate effect magnitude."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "chi-square-test",
        "pros_of_this": "Fisher's exact test is valid for any sample size including very small tables and tables with zero cells; chi-square requires all expected counts to be at least 5 and returns unreliable results with sparse cells.",
        "cons_of_this": "Fisher's test is conservative (actual type-I error below alpha) and loses power relative to chi-square at adequate sample sizes; at large n with common events, chi-square is more powerful and computationally trivial.",
        "when_to_prefer": "Use Fisher's test when any expected count is below 5 or any observed count is zero; use chi-square when all expected counts are at least 5 and n is adequate."
      },
      {
        "compared_to": "mcnemar-test",
        "pros_of_this": "Fisher's test is appropriate for two independent samples; it handles any 2x2 table of independent observations with small expected counts.",
        "cons_of_this": "McNemar's test (including its exact variant) is required whenever the two rows of the 2x2 table represent matched or paired observations; applying Fisher to paired data ignores the within-pair correlation and produces incorrect inference.",
        "when_to_prefer": "Use Fisher's test for independent groups; use McNemar (or exact McNemar) for matched pairs, pre-post designs, and case-control studies with individual matching."
      },
      {
        "compared_to": "logistic-regression-for-binary-outcomes",
        "pros_of_this": "Fisher's test is simpler, requires no model-building, and provides an exact p-value without distributional assumptions; appropriate when only the marginal association is of interest and no covariate adjustment is required.",
        "cons_of_this": "Logistic regression (especially Firth penalized regression) adjusts for confounders, handles both complete and quasi-complete separation in sparse tables, and produces an adjusted OR with CI; Fisher's test is unadjusted and inappropriate when confounding is present, which is nearly always the case in observational RWE.",
        "when_to_prefer": "Use Fisher's test only for unadjusted descriptive tables, randomized contrasts, or as a sensitivity check alongside a logistic regression primary analysis; use Firth logistic regression as the primary method for adjusted ORs in sparse observational tables."
      },
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "Fisher's exact test is the specific operationalization for binary outcomes in small sparse tables; it is the most well-known and widely implemented exact test for the 2x2 setting.",
        "cons_of_this": "The parent entry covers the broader landscape of when to choose chi-square vs Fisher vs other categorical tests and contextualizes this choice within the full parametric-nonparametric decision tree; Fisher's exact test in isolation does not cover paired, multi-group, or continuous outcome settings.",
        "when_to_prefer": "Fisher's exact test is the specific tool for small independent 2x2 tables; refer to the parent entry for the full decision tree across outcome types and designs."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "In pharmacy and medical claims, binary safety outcomes (hospitalization for a specific diagnosis, emergency visit for a condition) in small post-approval cohorts are the primary use case. Compute observed and expected counts per arm; if any expected count is below 5, apply Fisher's test. For large claims cohorts (> 10,000 exposed), chi-square is adequate. When reporting Fisher results from claims, always include the crude OR with exact CI alongside the p-value; a p-value alone is uninformative for regulatory or HTA audiences.",
      "ehr": "EHR-based safety tables for rare conditions, post-procedure complications, or adverse drug events in single-institution cohorts may have small per-arm event counts even with thousands of patients in the denominator. Apply Fisher when expected counts are below 5. Note that EHR data are subject to variable capture intensity; zero-cell events in an EHR table may reflect incomplete documentation rather than true absence.",
      "registry": "Disease registries for rare conditions are a natural domain for Fisher's test. Adjudicated events with small counts per treatment arm benefit from the exact guarantee. Include the exact conditional OR and 95% CI from fisher.test alongside the p-value; when covariate adjustment is needed (almost always), run Firth logistic regression as the primary analysis and present Fisher as the unadjusted sensitivity check.",
      "primary": "Phase I and II trials with n < 50 per arm: Fisher's test is the standard for adverse event tables. Pre-specify the test in the statistical analysis plan and report two-sided p-values. Mid-p is acceptable as a secondary analysis in the supplementary material; do not substitute it for the primary p-value in regulatory submissions.",
      "linked": "Linked data combining claims, EHR, and registry typically yield large cohorts where chi-square is adequate. Fisher's test may still appear in subgroup analyses (rare disease subpopulations, pediatric strata, early post-approval subsets) or when examining a very rare adverse event across the full linked cohort."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from scipy import stats\n\n# ── Safety table: drug vs placebo, rash yes/no ──────────────────────────────────\n# Rows: [Drug, Placebo]; Columns: [Rash, No_Rash]\ntable = [[3, 7],   # Drug:    3 rash, 7 no rash\n         [0, 10]]  # Placebo: 0 rash, 10 no rash\n\n# ── 1. Standard two-sided Fisher exact test ──────────────────────────────────────\nodds_ratio, p_two_sided = stats.fisher_exact(table, alternative=\"two-sided\")\nprint(f\"Odds ratio (cross-product):   {odds_ratio:.3f}\")\nprint(f\"Two-sided exact p-value:       {p_two_sided:.4f}\")\n\n# One-sided: testing whether drug arm has HIGHER rash rate\n_, p_one_sided = stats.fisher_exact(table, alternative=\"greater\")\nprint(f\"One-sided exact p-value:       {p_one_sided:.4f}\")\n\n# ── 2. Mid-p correction (manual, from hypergeometric PMF) ─────────────────────────\n# Under the null: X ~ Hypergeometric(N=20, K=3 rash events total, n=10 drug arm)\n# scipy.stats.hypergeom(M, n, N): M=population, n=success states, N=draws\nN_total = 20   # grand total\nK_events = 3   # total rash events (column 1 total)\nn_drug = 10    # drug arm size (row 1 total)\na_obs = 3      # observed drug-arm rash count\n\nrv = stats.hypergeom(N_total, K_events, n_drug)\np_at_obs = rv.pmf(a_obs)          # P(X = 3)\np_more_extreme = rv.sf(a_obs)     # P(X > 3) = 0 (max is 3)\nmid_p_one_sided = 0.5 * p_at_obs + p_more_extreme\nprint(f\"\\nMid-p one-sided:               {mid_p_one_sided:.4f}\")\nprint(f\"P(X=3) from hypergeom PMF:     {p_at_obs:.4f}\")\n\n# ── 3. Hypergeometric PMF for all possible a values ──────────────────────────────\nprint(\"\\nHypergeometric probability table:\")\nfor a in range(0, K_events + 1):\n    p = rv.pmf(a)\n    print(f\"  P(X={a}) = C(10,{a}) x C(10,{K_events-a}) / C(20,{K_events}) = {p:.6f}\")\n\n# ── 4. Scale check: chi-square on a larger table ──────────────────────────────────\n# In claims with 500 drug / 500 placebo and 5% event rate, chi-square is fine\nlarge_table = [[25, 475], [15, 485]]  # 5% vs 3% events, n=500 per arm\nchi2, chi2_p, dof, expected = stats.chi2_contingency(large_table, correction=False)\n_, fisher_p_large = stats.fisher_exact(large_table)\nprint(f\"\\nLarge-sample (n=500/arm):\")\nprint(f\"  Expected min cell:  {expected.min():.1f}  (>> 5: chi-square valid)\")\nprint(f\"  Chi-square p:       {chi2_p:.4f}\")\nprint(f\"  Fisher exact p:     {fisher_p_large:.4f}  (nearly identical; chi-square sufficient)\")",
        "description": "Fisher's Exact Test using scipy.stats.fisher_exact. Shows the 2x2 safety table from\nthe worked example (3/10 vs 0/10 adverse events), computes the two-sided exact p-value,\nextracts the conditional odds ratio, and manually computes the mid-p correction from\nthe hypergeometric PMF using scipy.stats.hypergeom. No external dependencies beyond scipy.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Safety table: drug vs placebo, rash yes/no ──────────────────────────────────\ntab <- matrix(c(3, 0, 7, 10), nrow = 2,\n              dimnames = list(Arm = c(\"Drug\", \"Placebo\"),\n                              Rash = c(\"Yes\", \"No\")))\ncat(\"Observed table:\\n\"); print(tab)\ncat(\"Expected counts (under null):\\n\")\nprint(chisq.test(tab, correct = FALSE)$expected)\n\n# ── 1. Two-sided Fisher exact (standard clinical report) ──────────────────────────\nft2 <- fisher.test(tab, alternative = \"two.sided\")\ncat(\"\\nTwo-sided Fisher exact test:\\n\")\nprint(ft2)\n# NOTE: ft2$estimate is the conditional MLE OR (differs from cross-product at small n)\n# ft2$conf.int is the exact conditional 95% CI for this OR\n\n# ── 2. One-sided (testing drug > placebo for rash) ───────────────────────────────\nft1 <- fisher.test(tab, alternative = \"greater\")\ncat(\"\\nOne-sided Fisher exact (drug > placebo):\\n\")\ncat(sprintf(\"  p-value: %.4f\\n\", ft1$p.value))\n\n# ── 3. Mid-p correction (manual from dhyper) ─────────────────────────────────────\n# X ~ Hypergeometric(m=3 total rash, n=17 total no-rash, k=10 drug arm)\n# dhyper(x, m, n, k): P(X=x) where m = K_events, n = N-K_events, k = n_drug\nm_events  <- 3   # total rash events = column 1 total\nn_no_event <- 17  # total no-rash = column 2 total\nk_drug    <- 10  # drug arm size\na_obs     <- 3   # observed drug-arm rash count\n\np_at_obs    <- dhyper(a_obs, m_events, n_no_event, k_drug)\np_more_extreme <- phyper(a_obs, m_events, n_no_event, k_drug, lower.tail = FALSE)\nmid_p_one_sided <- 0.5 * p_at_obs + p_more_extreme\ncat(sprintf(\"\\nMid-p one-sided: %.4f (vs conservative Fisher one-sided: %.4f)\\n\",\n            mid_p_one_sided, ft1$p.value))\n\n# ── 4. PMF table for all possible a values ────────────────────────────────────────\ncat(\"\\nHypergeometric PMF for all a = 0..3:\\n\")\nfor (a in 0:3) {\n  cat(sprintf(\"  P(X=%d) = %.6f\\n\", a, dhyper(a, m_events, n_no_event, k_drug)))\n}\n\n# ── 5. Scale point: large table where chi-square is adequate ─────────────────────\nlarge_tab <- matrix(c(25, 15, 475, 485), nrow = 2)\nexp_min <- min(chisq.test(large_tab, correct = FALSE)$expected)\nchi_p   <- chisq.test(large_tab, correct = FALSE)$p.value\nfish_p  <- fisher.test(large_tab)$p.value\ncat(sprintf(\"\\nLarge n=500/arm: expected min = %.1f; chi2 p = %.4f; Fisher p = %.4f\\n\",\n            exp_min, chi_p, fish_p))\ncat(\"All expected counts >> 5: chi-square is appropriate; Fisher adds nothing.\\n\")",
        "description": "Fisher's Exact Test in base R. Demonstrates fisher.test with both two-sided and one-sided\nalternatives, extracts the conditional maximum-likelihood OR and exact confidence interval,\ncomputes mid-p manually from dhyper, and shows the scale-point comparison where chi-square\nsuffices for large expected counts. Uses the same safety table as the Python implementation.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create the safety dataset ─────────────────────────────────────────────── */\ndata work.safety;\n  input arm $ rash $ count;\n  datalines;\nDrug    Yes  3\nDrug    No   7\nPlacebo Yes  0\nPlacebo No  10\n;\nrun;\n\n/* ── 1. PROC FREQ: chi-square, Fisher exact, expected counts, OR + exact CI ── */\nproc freq data=work.safety order=data;\n  weight count;\n  tables arm * rash / chisq fisher expected relrisk;\n  /* CHISQ:    Pearson chi-square and Yates continuity-corrected chi-square       */\n  /* FISHER:   Fisher exact test (one-sided and two-sided p-values)               */\n  /* EXPECTED: Prints expected cell frequencies — check all are >= 5 before       */\n  /*           relying on chi-square; here expected(Placebo,Yes) = 1.5 -> FAIL    */\n  /* RELRISK:  Risk ratio and OR with Wald CI (note: MLE OR undefined if a=0;     */\n  /*           SAS will warn; use EXACT FISHER OR option for exact conditional CI) */\nrun;\n\n/* ── 2. EXACT conditional OR and CI from PROC FREQ ─────────────────────────── */\nproc freq data=work.safety order=data;\n  weight count;\n  tables arm * rash / fisher;\n  exact fisher or;\n  /* OR option with EXACT: outputs the conditional MLE OR and exact 95% CI        */\n  /* This is the OR from the noncentral hypergeometric distribution, same as R's  */\n  /* fisher.test()$estimate — not the cross-product OR                            */\nrun;\n\n/* ── 3. Mid-p correction via the hypergeometric PMF ────────────────────────── */\n/* SAS does not output mid-p directly; compute from the hypergeometric PMF       */\ndata work.midp;\n  /* Hypergeometric: X ~ Hyper(N=20, K=3, n=10)                                 */\n  N_total  = 20;   /* grand total                                                */\n  K_events = 3;    /* total events (column 1 sum)                                */\n  n_drug   = 10;   /* drug arm size (row 1 total)                                */\n  a_obs    = 3;    /* observed drug-arm event count                              */\n\n  /* P(X = a_obs) from SAS hypergeometric PMF function                           */\n  p_at_obs      = pdf('Hyper', a_obs, N_total, K_events, n_drug);\n  /* P(X > a_obs) = 1 - CDF(a_obs) = upper tail                                 */\n  p_more_extreme = 1 - cdf('Hyper', a_obs, N_total, K_events, n_drug);\n  mid_p_one_sided = 0.5 * p_at_obs + p_more_extreme;\n\n  put 'P(X=3) = ' p_at_obs 6.4;\n  put 'Mid-p one-sided = ' mid_p_one_sided 6.4;\nrun;\n\n/* ── 4. Firth logistic regression: adjusted OR for sparse tables ─────────────  */\n/* When a covariate (e.g., age, baseline severity) must be adjusted, and zero    */\n/* cells preclude standard logistic regression, use FIRTH option in PROC LOGISTIC */\n/* Here we add a simulated age covariate for illustration                        */\ndata work.safety_indiv;\n  set work.safety;\n  /* Expand count to individual records                                           */\n  do i = 1 to count;\n    rash_bin = (rash = 'Yes');\n    arm_bin  = (arm = 'Drug');\n    /* Simulate a confounder: mean age 55 for drug, 60 for placebo              */\n    if arm = 'Drug'    then age = 55 + rannor(42) * 5;\n    else                    age = 60 + rannor(42) * 5;\n    output;\n  end;\n  drop i;\nrun;\n\nproc logistic data=work.safety_indiv;\n  model rash_bin(event='1') = arm_bin age / firth;\n  /* FIRTH: Penalized likelihood (Jeffreys prior) — produces finite OR even      */\n  /*        when separation is complete or quasi-complete (zero cells)            */\n  /* Outputs: point estimate, profile likelihood CI (not Wald), p-value          */\nrun;",
        "description": "PROC FREQ with EXACT FISHER and CHISQ options for the 2x2 safety table. Also shows\nhow to extract expected counts to check the chi-square validity rule, and demonstrates\nPROC LOGISTIC with the FIRTH option for the zero-cell adjusted OR as the recommended\nnext step when covariate adjustment is needed in a sparse table.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[2x2 contingency table] --> B{All expected\\ncounts >= 5?}\n  B -- Yes --> C[Chi-square test\\nwith effect estimate\\nRD or RR or OR plus CI]\n  B -- No --> D{Any cell = 0?}\n  D -- No --> E[Fisher exact test\\ntwo-sided\\nplus exact conditional OR plus CI]\n  D -- Yes --> F{Need adjusted\\nestimate?}\n  E --> G{Power a concern?}\n  G -- Yes --> H[Consider mid-p correction\\nas secondary analysis]\n  G -- No --> I[Report Fisher p-value\\nplus OR plus exact CI]\n  F -- No --> J[Fisher exact test\\nreport one-sided CI as bound\\nnote OR undefined]\n  F -- Yes --> K[Firth penalized\\nlogistic regression\\nor exact logistic]\n  K --> L[Finite adjusted OR\\nprofile likelihood CI\\nconfounding controlled]",
        "caption": "Decision tree for 2x2 table analysis. Fisher's exact domain is sparse tables with small expected counts; Firth logistic regression is the path when adjustment for confounders is needed in sparse or zero-cell tables.",
        "alt_text": "Flowchart showing that if all expected counts are at least 5 use chi-square; if any expected count is below 5 go to Fisher exact; if any cell is zero and adjustment is needed use Firth logistic regression; if adjustment is not needed report Fisher with a one-sided bound for the OR.",
        "source_type": "illustrative",
        "source_citations": [
          "lydersen-2009"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The parent entry covers the full decision tree for choosing between chi-square and Fisher exact by expected cell count, and contextualizes Fisher within the broader parametric-nonparametric landscape; this entry is a true primitive child of that parent."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Null hypothesis testing, p-values, alpha, and type-I error must be understood before the hypergeometric tail probability interpretation of Fisher's exact p-value becomes meaningful."
      },
      {
        "relation_type": "see_also",
        "target_slug": "chi-square-test",
        "notes": "The large-sample counterpart: Pearson chi-square uses a distributional approximation valid when all expected counts are at least 5; Fisher exact is the preferred alternative when that threshold is not met. The two tests address the same hypothesis but with different validity conditions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mcnemar-test",
        "notes": "The paired analog: when the two rows of the 2x2 table represent matched pairs or repeated measures on the same patients, McNemar's test (or exact McNemar) is required; Fisher's test assumes independence between rows and is inappropriate for paired data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "The adjusted path: exact logistic regression and Firth penalized logistic regression handle sparse 2x2 tables while controlling for confounders; when observational RWE requires an adjusted OR in a setting with small cell counts or zero cells, these methods replace Fisher's unadjusted test as the primary analysis."
      }
    ],
    "aliases": [
      "Fisher-Irwin test",
      "exact test for 2x2 tables",
      "Fisher's exact probability test",
      "hypergeometric test",
      "exact contingency table test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "fit-for-purpose-data-assessment-rwe",
    "name": "Fit-for-Purpose Data Assessment",
    "short_definition": "A structured, pre-protocol process that judges whether a candidate real-world data source is relevant (captures the population, exposure, outcome, confounders, and follow-up the question requires) and reliable (accurate, complete, traceable, and consistently curated) enough to answer one specific regulatory or HTA question.",
    "long_description": "**Fit-for-purpose (FFP) data assessment** is the gatekeeping step that decides, *before any analysis is programmed*,\nwhether a particular real-world data source can credibly answer a particular question. It is not a generic \"data quality\nscore\" and it is not transferable: a database can be fit for purpose for a comparative drug-utilization study and entirely\nunfit for a comparative mortality study run on the same patients. The assessment is organized around two axes made\ncanonical by the FDA RWD guidance and operationalized by Gatto et al.'s Structured Process to Identify Fit-for-Purpose\nData (SPIFD): **relevance** — does the source contain the population, exposure, outcome, key confounders, and follow-up\nduration the estimand demands? — and **reliability** — are those elements accurate, complete, traceable, and produced by\na stable, documented data-curation (ETL) process? The output is a documented *go / no-go / go-with-mitigations* verdict\nfor one PICOTS-defined question, plus the specific sensitivity analyses that will probe the most judgment-dependent\nthresholds.\n\n**Core conceptual distinction**. FFP assessment sits upstream of, and is distinct from, three things it is often confused\nwith. (1) *vs database feasibility / attrition counting*: a feasibility funnel tells you **how many** patients survive\neach eligibility step; FFP tells you **whether the surviving cohort and its variables mean what the protocol needs them\nto mean**. Feasibility is a necessary input to FFP, not a substitute. (2) *vs algorithm/outcome validation*: validation\nestimates the operating characteristics (PPV, sensitivity) of one variable; FFP integrates those characteristics with\nrelevance and curation evidence into a question-level decision. (3) *vs a global data-quality grade*: question-agnostic\ngrading (completeness %, conformance checks) is a reliability input but cannot, by itself, declare fitness, because\nfitness is defined relative to a specific estimand. The decisive output of FFP is therefore not a number but a defensible\n*decision tied to a question*, with the residual risks named and mitigated.\n\n**Pros, cons, and trade-offs** (specific & comparative, naming the alternatives).\n- **vs proceeding straight to analysis on a convenient database:** FFP forces relevance/reliability to be argued in\n  protocol language before code is written, which is exactly what FDA and EMA reviewers expect and what prevents an\n  expensive study from being rejected for an avoidable data limitation (e.g., MA-only person-time with no fee-for-service\n  claims, no death linkage for a mortality endpoint). Cost: it is front-loaded work that delays the first results and\n  requires data-provenance documentation the vendor may not readily supply. **Prefer FFP** for any regulatory-grade or\n  HTA-facing study.\n- **vs a one-time, question-agnostic data-quality scorecard:** a reusable scorecard is cheap and comparable across\n  databases, but it systematically over- or under-states fitness because it ignores the estimand — a source with 99%\n  completeness on labs is still unfit for a question that turns on outpatient mortality. FFP is question-specific and\n  therefore more defensible, at the price of being non-transferable and needing re-execution for each new question.\n  **Prefer the scorecard only** as a pre-screen to shortlist databases, then run FFP on the finalists.\n- **vs multi-database replication as the primary safeguard:** running the analysis in several databases is powerful\n  against database-specific artifacts, but it is reactive and expensive, and concordant-but-wrong results across sources\n  that share a structural flaw (e.g., all lack reliable cause-of-death) give false reassurance. FFP is proactive and\n  cheaper. **Use both** when stakes are high: FFP to qualify each source, replication to test robustness.\n\n**When to use**. Any regulatory submission (FDA RWE program, EMA), HTA dossier, or post-authorization safety/effectiveness\nstudy; the data-selection step of a target-trial emulation, where each trial component (eligibility, treatment strategies,\nassignment, outcome, follow-up, estimand) must be checked against what the source can actually capture; any high-consequence\ncomparative-effectiveness, safety, utilization, or cost analysis where a data limitation could change the conclusion.\nExecute FFP once the PICOTS and estimand are fixed but before locking the analytic protocol, so that the verdict can still\nredirect the study to a different source, a narrower question, or a linkage strategy.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **When the question is not yet specified.** Running FFP against a vague aim produces a meaningless verdict; fitness is\n  undefined without an estimand. Fix PICOTS first.\n- **When the assessment is treated as a checkbox.** A FFP memo that recites \"relevant and reliable\" without source-level\n  evidence (provenance, code-list hit rates, missingness by site/time/arm, linkage denominators) is *more* dangerous than\n  none, because it manufactures false confidence and is exactly the artifact a regulator will probe. The danger is\n  laundering an unfit source through a process veneer.\n- **When a fatal relevance gap is rationalized into a \"mitigation.\"** If the outcome is out-of-hospital cardiac death and\n  the source has no death-index linkage, no sensitivity analysis rescues it — that is a no-go, not a go-with-mitigations.\n  Mitigations are for *measurable, bounded* uncertainty (a quantitative-bias analysis for a validated-but-imperfect\n  outcome algorithm), not for structurally absent data.\n- **When reliability is assumed because the database is large or familiar.** Size is not accuracy; a marquee claims\n  database can still drop fee-for-service claims for Medicare Advantage enrollees, lag adjudication, reverse claims, or\n  bundle services — all of which silently corrupt the very variables the study depends on.\n\n**Data-source operational depth**.\n- **Administrative claims (FFS vs MA vs commercial):** Relevance strengths are exposure (NDC + `fill_date` +\n  `days_supply`) and healthcare utilization/cost; weaknesses are clinical severity, labs, vitals, and cause of death.\n  Critical reliability failure modes: **Medicare Advantage person-time lacks fee-for-service claims** — encounter data\n  are incomplete and inconsistently submitted, so an MA enrollee can look like a non-user or a non-utilizer purely from\n  missingness; restrict to enrollees with the relevant benefit (A/B/D, or commercial medical+pharmacy) and **exclude\n  MA-only person-time** unless complete encounter data are demonstrated. Other failure modes: adjudication lag and claim\n  reversals (right-censor with a data-maturity buffer), bundled/capitated services that hide individual procedures,\n  plan-switching that breaks continuous enrollment, and sample/mail-order fills that distort `days_supply`. Differential\n  competing risks matter: in elderly claims, death competes with the outcome and may be captured only via the Medicare\n  enrollment database, not the claim stream — verify the mortality source before trusting any time-to-event endpoint.\n- **EHR:** Relevance strengths are labs, vitals, problem lists, and clinician notes (severity, indication); the dominant\n  reliability problem is **encounter-driven, network-bounded capture** — a patient who seeks care outside the system is\n  differentially unobserved (\"leakage\"), so absence of a record is ambiguous (no event vs cared-for elsewhere). Structured\n  fields are often sparse or entered inconsistently across sites; note availability varies by visit type. Linkage to\n  claims is the standard fix for completeness, and explicit observation windows plus loss-to-follow-up handling are\n  mandatory.\n- **Registry:** Relevance strength is adjudicated, clinically rich outcomes and disease staging (e.g., cancer registries);\n  weakness is incomplete longitudinal pharmacy exposure and follow-up. Reliability turns on enrollment eligibility,\n  case-ascertainment completeness, adjudication rules, and reporting lag. Almost always requires linkage to claims (for\n  exposure/utilization) and to a death index (for mortality).\n- **Linked claims–EHR–vital-records:** The ideal substrate (EHR severity + claims completeness + reliable mortality) but\n  linkage introduces **selection** (only the linkable subset, which may differ systematically) and **date-reconciliation**\n  problems across order, fill, and service dates that must be resolved before time-zero assignment. Report the linkage\n  denominator and compare linked vs unlinkable patients.\n\n**Worked example (claims-style logic).** Question (PICOTS fixed): among adults ≥18 with type-2 diabetes, does initiating a\nGLP-1 receptor agonist vs a DPP-4 inhibitor change 3-point MACE risk over 2 years? Candidate source: a commercial +\nMedicare fee-for-service claims database. (1) **Relevance — population:** confirm ≥2 T2D diagnoses are codeable and the age\nband is present. **Exposure:** both classes are identifiable by NDC with `fill_date` and `days_supply`, so new-user status\n(no prior fill in a 365-day washout) and on-treatment episodes are constructible. **Outcome:** 3-point MACE = nonfatal MI +\nnonfatal stroke + cardiovascular death; the MI/stroke components are claims-codeable, but **CV death requires a death\nsource** — plain claims give an end-of-enrollment date, not a cause, so without National Death Index or Medicare-enrollment\ndeath linkage the outcome is only partially ascertainable: a *relevance gap*, not a reliability nuance. **Confounders:**\nHbA1c and BMI (key effect modifiers) are largely absent in claims — note this as a candidate for linkage or quantitative\nbias analysis. **Follow-up:** 2 years requires continuous A/B/D (or commercial) enrollment; check the median observable\nfollow-up against the 2-year horizon. (2) **Reliability:** obtain ETL/provenance documentation and refresh date; right-censor\nwith a 3-month maturity buffer for adjudication lag; **exclude MA-only person-time** because fee-for-service claims are\nmissing there; profile missingness of `days_supply` and date fields by calendar quarter and by arm; verify the mortality\nsource completeness against expected age-specific rates. (3) **Verdict:** *go-with-mitigations* — fit for nonfatal MACE\ncomponents and exposure; **conditionally fit for fatal MACE only if death-index linkage is secured**; otherwise restrict the\nendpoint to nonfatal MACE or escalate to a linked source. (4) **Pre-specified sensitivity analyses targeting the\njudgment-dependent thresholds:** vary the washout (180 vs 365 days), the data-maturity buffer (1 vs 3 vs 6 months), the MACE\nalgorithm definition (and apply a PPV-based quantitative bias analysis), and the MA-exclusion rule, reporting cohort counts\nand the estimate's stability at each step.",
    "primary_category": "Framework_Standard",
    "tags": [
      "fit-for-purpose",
      "data-relevance",
      "data-reliability",
      "spifd",
      "regulatory-readiness",
      "data-quality-assessment",
      "target-trial",
      "picots"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/cpt.2466",
        "url": "https://doi.org/10.1002/cpt.2466",
        "citation_text": "Gatto NM, Campbell UB, Rubinstein E, Jaksa A, Mattox P, Mo J, Reynolds RF. The Structured Process to Identify Fit-For-Purpose Data: A Data Feasibility Assessment Framework. Clinical Pharmacology & Therapeutics. 2022;111(1):122-134.",
        "year": 2022,
        "authors_short": "Gatto et al.",
        "notes": "SPIFD — the canonical structured, question-specific process for assessing real-world data relevance and reliability before protocol lock."
      },
      {
        "role": "explain",
        "doi": "10.1002/cpt.2883",
        "url": "https://doi.org/10.1002/cpt.2883",
        "citation_text": "Gatto NM, Vititoe SE, Rubinstein E, Reynolds RF, Campbell UB. A Structured Process to Identify Fit-for-Purpose Study Design and Data to Generate Valid and Transparent Real-World Evidence for Regulatory Uses. Clinical Pharmacology & Therapeutics. 2023;113(6):1235-1239.",
        "year": 2023,
        "authors_short": "Gatto et al.",
        "notes": "SPIFD2 — extends the framework to jointly assess study design and data, integrating estimand and target-trial thinking with the fit-for-purpose data verdict."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.fda.gov/media/152503/download",
        "citation_text": "FDA. Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products. Guidance for Industry. 2024.",
        "year": 2024,
        "authors_short": "FDA",
        "notes": "Defines the relevance and reliability constructs that fit-for-purpose assessment operationalizes for claims and EHR sources (no DOI; stable FDA landing page)."
      },
      {
        "role": "demonstrate",
        "doi": "10.1210/endrev/bnab007",
        "url": "https://doi.org/10.1210/endrev/bnab007",
        "citation_text": "Schneeweiss S, Patorno E. Conducting Real-world Evidence Studies on the Clinical Outcomes of Diabetes Treatments. Endocrine Reviews. 2021;42(5):658-690.",
        "year": 2021,
        "authors_short": "Schneeweiss & Patorno",
        "notes": "Worked, therapeutic-area application of data relevance/reliability judgments (exposure capture, outcome validity, confounder availability, mortality ascertainment) for regulatory-grade RWE."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "The target-trial framing that fit-for-purpose assessment serves — each trial component must be checkable against what the source can actually capture before emulation is credible."
      }
    ],
    "plain_language_summary": "Before designing any real-world study, researchers must ask one question: does the database we are considering actually contain what we need to answer our specific question? Fit-for-purpose data assessment is the structured process that answers that question by checking two things: relevance (does the source capture the right patients, the right drug, the right outcome, and enough follow-up time?) and reliability (are those records accurate, complete, and produced by a trustworthy data process?). The output is a single, defensible verdict — go, no-go, or go only if certain gaps are addressed — tied to that one question, not to the database in general. A database can be perfectly fit for counting prescriptions and completely unfit for measuring whether patients die from a heart attack, even when both studies use the exact same patients.",
    "key_terms": [
      {
        "term": "relevance",
        "definition": "Whether a data source actually captures the patients, drug exposure, outcome, important background characteristics, and length of follow-up that a specific study question requires."
      },
      {
        "term": "reliability",
        "definition": "Whether the records in a data source are accurate, sufficiently complete, and produced by a documented, stable process that analysts can audit and regulators can trust."
      },
      {
        "term": "PICOTS",
        "definition": "A structured way to specify a research question: Population, Intervention (or exposure), Comparator, Outcome, Timing, and Setting — fitness cannot be judged until PICOTS is fixed."
      },
      {
        "term": "ETL process",
        "definition": "The Extract-Transform-Load pipeline a data vendor runs to turn raw source records (hospital systems, pharmacy claims) into the research database an analyst receives; undocumented or unstable ETL is a reliability red flag."
      },
      {
        "term": "go-with-mitigations verdict",
        "definition": "A fit-for-purpose decision that says the source can be used but only with specific pre-planned adjustments and sensitivity analyses to address bounded, measurable data gaps."
      },
      {
        "term": "death-index linkage",
        "definition": "Connecting a claims or EHR database to an external death registry (such as the National Death Index) so that the cause and date of a patient's death can be confirmed — without this, fatal outcomes cannot be reliably counted."
      }
    ],
    "worked_example": {
      "scenario": "A research team wants to study whether a GLP-1 receptor agonist (a diabetes drug) reduces the risk of serious heart events compared with a DPP-4 inhibitor (another diabetes drug) over two years. The primary outcome is called 3-point MACE: nonfatal heart attack, nonfatal stroke, or cardiovascular death. Before writing a single line of analysis code, the team runs a fit-for-purpose assessment on a commercial plus Medicare fee-for-service claims database. The table below lists each assessment dimension, what the team checks, and whether it passes or raises a concern.",
      "dataset": {
        "caption": "Fit-for-purpose checklist for a GLP-1 vs DPP-4 MACE study in a commercial plus Medicare FFS claims database",
        "columns": [
          "Dimension",
          "Check",
          "What the analyst looks for",
          "Verdict"
        ],
        "rows": [
          [
            "Relevance",
            "Population",
            "Are adults with type-2 diabetes identifiable using diagnosis codes?",
            "PASS"
          ],
          [
            "Relevance",
            "Exposure capture",
            "Are both drug classes recorded by prescription fill date and days supply so new-user status can be defined?",
            "PASS"
          ],
          [
            "Relevance",
            "Nonfatal outcome",
            "Can heart attack and stroke be identified using validated hospital diagnosis codes?",
            "PASS"
          ],
          [
            "Relevance",
            "Fatal outcome",
            "Is there a linked death source that records cause of death for cardiovascular deaths?",
            "CONCERN — death linkage not confirmed"
          ],
          [
            "Relevance",
            "Key confounders",
            "Are HbA1c (blood sugar control) and BMI recorded in the claims?",
            "CONCERN — largely absent in claims"
          ],
          [
            "Relevance",
            "Follow-up duration",
            "Can most patients be observed continuously for the full two-year horizon?",
            "PASS — median observable follow-up exceeds 2 years"
          ],
          [
            "Reliability",
            "Medicare Advantage person-time",
            "Are Medicare Advantage enrollees excluded or is complete encounter data confirmed? MA-only records lack fee-for-service claims and can make patients look like non-users.",
            "CONCERN — MA-only person-time must be excluded"
          ],
          [
            "Reliability",
            "Adjudication lag",
            "Are the most recent months of data complete, or do recent claims still need time to be processed and paid?",
            "CONCERN — a 3-month maturity buffer is required"
          ],
          [
            "Reliability",
            "Days supply completeness",
            "Is the days-supply field populated and plausible (1 to 180 days) for the study drugs?",
            "PASS"
          ],
          [
            "Reliability",
            "Data provenance",
            "Has the vendor provided documentation of how the database is built and updated?",
            "PASS — documentation available"
          ]
        ]
      },
      "steps": [
        "Work through the relevance checks first: confirm that the source can capture each piece of the PICOTS question before checking data quality.",
        "The nonfatal MACE components (heart attack, stroke) are identifiable by hospital diagnosis codes — these checks pass.",
        "Cardiovascular death is a fatal outcome and requires a cause-of-death source such as the National Death Index; plain claims only record when enrollment ended, not why the patient died — this is a relevance gap, not a quality nuance.",
        "HbA1c and BMI are clinical measurements rarely captured in claims, so the team notes them as a confounder gap requiring either a linked EHR or a sensitivity analysis.",
        "Move to reliability: Medicare Advantage enrollees in the database may appear to have no prescription fills simply because their insurer does not submit fee-for-service claims — including their person-time would silently corrupt the exposure measure, so it must be excluded.",
        "The maturity buffer check confirms that very recent claims are incomplete due to adjudication lag; censor the data three months before the extraction date.",
        "Summarize the verdict: the source is fit for the nonfatal MACE components and for exposure measurement; it is not fit for the full 3-point MACE endpoint unless death-index linkage is secured."
      ],
      "result": "Verdict: GO WITH MITIGATIONS for nonfatal MACE (heart attack + stroke); CONDITIONAL NO-GO for cardiovascular death until death-index linkage is confirmed. Key gap: fatal outcome ascertainment. Required mitigations before analysis: (1) exclude Medicare Advantage-only person-time, (2) apply a 3-month data-maturity censor, (3) secure death-index linkage or restrict the primary endpoint to nonfatal MACE. Pre-specified sensitivity analyses: vary the washout length (180 vs 365 days), vary the maturity buffer (1 vs 3 vs 6 months), test the MACE diagnosis algorithm with and without a PPV-based correction, and report cohort counts under each MA-exclusion rule."
    },
    "prerequisites": [
      "picots-framework-rwe",
      "database-feasibility-attrition-funnel-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "SPIFD relevance-then-reliability assessment (regulatory)",
        "description": "Two-stage, question-specific evaluation per the FDA/Gatto framework — first confirm the source can capture the PICOTS-defined population, exposure, outcome, confounders, and follow-up (relevance); then confirm those elements are accurate, complete, and produced by a documented, stable ETL process (reliability); conclude with a go/no-go/mitigations verdict for that one question.",
        "edge_cases": [
          "A source can pass relevance for one endpoint (nonfatal MI) and fail it for a sibling endpoint (CV death) in the same study; the verdict must be endpoint-specific.",
          "Reliability evidence (provenance, ETL versioning, refresh date) is frequently undocumented by vendors and must be requested explicitly; absence of documentation is itself a finding."
        ],
        "data_source_notes": "claims: profile MA-only person-time, adjudication lag, code-list hit rates, and mortality source. ehr: quantify network leakage and structured-field sparsity. registry/linked: report linkage denominators and selection."
      },
      {
        "name": "Pre-screen scorecard then targeted FFP",
        "description": "A cheap, question-agnostic data-quality scorecard (completeness, conformance, plausibility, code coverage) is applied across many candidate databases to shortlist, after which the full question-specific fit-for-purpose process is run only on the finalists.",
        "edge_cases": [
          "A high scorecard grade can mask an estimand-specific fatal gap (e.g., excellent lab completeness but no death linkage), so the scorecard must never be treated as the fitness decision.",
          "Scorecard thresholds set generically may exclude a source that is actually fit for a low-data-demand question (a simple drug-utilization count)."
        ],
        "data_source_notes": "Use the scorecard for comparability across sources; reserve relevance/reliability judgments for the targeted FFP on the finalists."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Proceeding straight to analysis on a convenient database",
        "pros_of_this": "Surfaces fatal relevance/reliability gaps (MA-only person-time, missing death linkage, unvalidated outcomes) in protocol language before code is written, matching FDA/EMA expectations and avoiding avoidable rejection.",
        "cons_of_this": "Front-loaded effort that delays first results and requires data-provenance documentation vendors may not readily provide.",
        "when_to_prefer": "Any regulatory-grade or HTA-facing study, or any analysis where a data limitation could change the conclusion."
      },
      {
        "compared_to": "A one-time, question-agnostic data-quality scorecard",
        "pros_of_this": "Question-specific and tied to the estimand, so the fitness verdict is defensible to a reviewer rather than a generic completeness statistic.",
        "cons_of_this": "Non-transferable; must be re-executed for each new question and cannot be compared cleanly across databases.",
        "when_to_prefer": "When the deliverable is a go/no-go decision for a specific submission; use the scorecard only as a pre-screen."
      },
      {
        "compared_to": "Multi-database replication as the primary safeguard",
        "pros_of_this": "Proactive and cheaper; prevents a structurally flawed source from entering the study in the first place.",
        "cons_of_this": "Does not test database-specific artifacts the way replication does; concordant-but-wrong results across sources sharing a flaw can still mislead.",
        "when_to_prefer": "As the qualifying step for each source; pair with replication for robustness on high-stakes questions."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Relevance: exposure via NDC + fill_date + days_supply; utilization/cost strong; clinical severity, labs, and cause of death weak. Reliability: exclude MA-only person-time (fee-for-service claims missing), buffer for adjudication lag and reversals, verify the mortality source (enrollment database vs claim stream), and profile code-list hit rates and date missingness by calendar time and arm.",
      "ehr": "Relevance: labs, vitals, problem lists, notes strong for severity/indication. Reliability: encounter-driven, network-bounded capture means leakage makes absence ambiguous; structured fields sparse across sites. Link to claims for completeness; define observation windows and treat loss to follow-up as informative.",
      "registry": "Relevance: adjudicated outcomes and staging strong; longitudinal exposure/follow-up weak. Reliability: check enrollment eligibility, case-ascertainment completeness, adjudication rules, reporting lag; link to claims for exposure and to a death index for mortality.",
      "linked": "Ideal substrate (severity + completeness + mortality) but introduces linkage selection (linkable subset may differ) and order/fill/service date discrepancies that must be reconciled before time-zero assignment. Report the linkage denominator and compare linked vs unlinkable patients."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nREQUIRED_FOLLOWUP_DAYS = 730   # 2-year estimand horizon\nEXPOSURE_NDCS = {...}          # NDC set for the study + comparator drug classes\nOUTCOME_DX = {...}             # validated MI/stroke code set for nonfatal MACE\n\ndef ffp_claims_profile(enroll, rx, dx, death):\n    out = {}\n\n    # --- RELIABILITY: MA-only person-time lacks fee-for-service claims -> must be excludable ---\n    person_days = (enroll[\"enroll_end\"] - enroll[\"enroll_start\"]).dt.days.clip(lower=0)\n    ma_only_days = person_days[enroll[\"plan_type\"].eq(\"MA\")].sum()\n    out[\"pct_person_time_MA_only\"] = 100 * ma_only_days / max(person_days.sum(), 1)\n    out[\"pct_enrollees_with_AB_D_benefit\"] = 100 * enroll.groupby(\"person_id\")[\"ab_d\"].max().mean()\n\n    # --- RELEVANCE: follow-up duration vs the estimand horizon ---\n    span = (enroll.groupby(\"person_id\")\n                  .apply(lambda g: (g[\"enroll_end\"].max() - g[\"enroll_start\"].min()).days))\n    out[\"median_observable_followup_days\"] = float(span.median())\n    out[\"pct_with_full_horizon\"] = 100 * (span >= REQUIRED_FOLLOWUP_DAYS).mean()\n\n    # --- RELEVANCE: exposure capture (NDC + days_supply usable?) ---\n    exp = rx[rx[\"ndc\"].isin(EXPOSURE_NDCS)]\n    out[\"pct_exposed_with_valid_days_supply\"] = 100 * exp[\"days_supply\"].between(1, 180).mean()\n    out[\"n_exposed_persons\"] = exp[\"person_id\"].nunique()\n\n    # --- RELEVANCE: nonfatal-outcome code-list hit rate ---\n    out[\"n_persons_with_outcome_code\"] = dx.loc[dx[\"dx_code\"].isin(OUTCOME_DX), \"person_id\"].nunique()\n\n    # --- RELEVANCE/RELIABILITY: fatal endpoint depends on a real death source w/ cause ---\n    out[\"pct_deaths_with_known_source\"] = 100 * death[\"death_source\"].notna().mean()\n    out[\"pct_deaths_with_cause\"] = 100 * death[\"cause_of_death\"].notna().mean()  # ~0 => CV death NOT ascertainable\n\n    # --- RELIABILITY: date-field missingness by calendar quarter (data-maturity / lag signal) ---\n    rx_q = rx.assign(q=rx[\"fill_date\"].dt.to_period(\"Q\"))\n    out[\"fill_date_missing_by_quarter\"] = (rx_q[\"fill_date\"].isna()\n                                           .groupby(rx_q[\"q\"]).mean().mul(100).round(2).to_dict())\n    return pd.Series(out)\n\n# profile = ffp_claims_profile(enroll, rx, dx, death)\n# Verdict logic: fatal MACE is fit ONLY if pct_deaths_with_cause is non-trivial; otherwise restrict to nonfatal MACE\n# or escalate to a death-index-linked source. MA-only person-time must be excluded before any time-to-event analysis.",
        "description": "Fit-for-purpose profiling for a candidate claims source against a fixed question. This does what the SPIFD process\ndoes operationally: quantify the relevance/reliability evidence a reviewer will demand. It is profiling/feasibility\ncode, not an estimation step. Required inputs (post data-management):\n  enroll : person_id, enroll_start, enroll_end, plan_type in {'FFS','MA','COMMERCIAL'}, ab_d (bool: has A/B/D or\n           commercial medical+pharmacy benefit)\n  rx     : person_id, fill_date (datetime), ndc, days_supply\n  dx     : person_id, dx_date (datetime), dx_code, code_set ('ICD10CM' etc.)\n  death  : person_id, death_date (datetime), death_source ('NDI','ENROLLMENT_DB', None), cause_of_death (nullable)\nReturns a one-row-per-check profile that feeds the relevance/reliability verdict.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nREQUIRED_FOLLOWUP_DAYS <- 730L\nEXPOSURE_NDCS <- c(...)   # study + comparator NDC set\nOUTCOME_DX    <- c(...)   # validated nonfatal MACE code set\n\nffp_claims_profile <- function(enroll, rx, dx, death) {\n  setDT(enroll); setDT(rx); setDT(dx); setDT(death)\n  out <- list()\n\n  # RELIABILITY: MA-only person-time lacks fee-for-service claims\n  enroll[, pdays := pmax(as.integer(enroll_end - enroll_start), 0L)]\n  out$pct_person_time_MA_only <- 100 * enroll[plan_type == \"MA\", sum(pdays)] / max(enroll[, sum(pdays)], 1L)\n  out$pct_enrollees_with_AB_D <- 100 * mean(enroll[, .(any_abd = any(ab_d)), by = person_id]$any_abd)\n\n  # RELEVANCE: observable follow-up vs the 2-year horizon\n  span <- enroll[, .(d = as.integer(max(enroll_end) - min(enroll_start))), by = person_id]$d\n  out$median_observable_followup_days <- median(span)\n  out$pct_with_full_horizon <- 100 * mean(span >= REQUIRED_FOLLOWUP_DAYS)\n\n  # RELEVANCE: exposure capture\n  exp <- rx[ndc %in% EXPOSURE_NDCS]\n  out$pct_exposed_valid_days_supply <- 100 * mean(exp$days_supply >= 1 & exp$days_supply <= 180)\n  out$n_exposed_persons <- uniqueN(exp$person_id)\n\n  # RELEVANCE: nonfatal-outcome hit rate\n  out$n_persons_with_outcome_code <- uniqueN(dx[dx_code %in% OUTCOME_DX, person_id])\n\n  # RELEVANCE/RELIABILITY: fatal endpoint depends on a death source carrying cause\n  out$pct_deaths_with_known_source <- 100 * mean(!is.na(death$death_source))\n  out$pct_deaths_with_cause <- 100 * mean(!is.na(death$cause_of_death))  # ~0 => CV death NOT ascertainable\n  out\n}\n# Verdict: fatal MACE fit only if pct_deaths_with_cause is non-trivial; else restrict to nonfatal MACE or escalate\n# to a death-index-linked source. Exclude MA-only person-time before any time-to-event analysis.",
        "description": "Fit-for-purpose profiling for a candidate claims source (data.table). Inputs mirror the Python version:\n  enroll : person_id, enroll_start, enroll_end (Date), plan_type ('FFS'/'MA'/'COMMERCIAL'), ab_d (logical)\n  rx     : person_id, fill_date (Date), ndc, days_supply\n  dx     : person_id, dx_date (Date), dx_code, code_set\n  death  : person_id, death_date (Date), death_source ('NDI'/'ENROLLMENT_DB'/NA), cause_of_death (nullable)\nReturns a named list of relevance/reliability metrics feeding the fit-for-purpose verdict.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let horizon = 730;  /* 2-year estimand follow-up requirement */\n\n/* RELIABILITY: share of person-time that is MA-only (lacks fee-for-service claims). */\nproc sql;\n  create table reliab_ma as\n  select 100 * sum(case when plan_type='MA' then max(intck('day',enroll_start,enroll_end),0) else 0 end)\n             / max(sum(max(intck('day',enroll_start,enroll_end),0)),1) as pct_person_time_MA_only\n  from work.enroll;\nquit;\n\n/* RELEVANCE: observable follow-up per person vs the horizon. */\nproc sql;\n  create table fu as\n  select person_id, intck('day', min(enroll_start), max(enroll_end)) as obs_days\n  from work.enroll group by person_id;\nquit;\nproc sql;\n  create table relev_fu as\n  select median(obs_days) as median_obs_followup_days,\n         100*mean(obs_days >= &horizon) as pct_with_full_horizon\n  from fu;\nquit;\n\n/* RELEVANCE: exposure capture and nonfatal-outcome code hit counts. */\nproc sql;\n  create table relev_exp as\n  select count(distinct r.person_id) as n_exposed_persons,\n         100*mean(r.days_supply between 1 and 180) as pct_valid_days_supply\n  from work.rx r inner join work.exp_ndc e on r.ndc = e.ndc;\n  create table relev_out as\n  select count(distinct d.person_id) as n_persons_with_outcome_code\n  from work.dx d inner join work.out_dx o on d.dx_code = o.dx_code;\nquit;\n\n/* RELEVANCE/RELIABILITY: can a FATAL endpoint be ascertained? (cause-of-death presence) */\nproc sql;\n  create table relev_death as\n  select 100*mean(death_source is not missing) as pct_deaths_with_source,\n         100*mean(cause_of_death is not missing) as pct_deaths_with_cause /* ~0 => CV death NOT ascertainable */\n  from work.death;\nquit;\n\n/* RELIABILITY: date-field missingness / data-maturity signal by calendar quarter. */\nproc freq data=work.rx;\n  format fill_date yyq6.;\n  tables fill_date / missing nocum;  /* watch for thinning recent quarters = adjudication lag */\nrun;",
        "description": "Fit-for-purpose feasibility/profiling for a candidate claims source in SAS (PROC SQL counts + PROC FREQ missingness).\nRequired input datasets (post data-management):\n  work.enroll : person_id, enroll_start, enroll_end (SAS dates), plan_type ('FFS'/'MA'/'COMMERCIAL'), ab_d (0/1)\n  work.rx     : person_id, fill_date, ndc, days_supply\n  work.dx     : person_id, dx_date, dx_code\n  work.death  : person_id, death_date, death_source, cause_of_death (missing if absent)\n  work.exp_ndc: ndc          (study + comparator NDC list)\n  work.out_dx : dx_code      (validated nonfatal MACE code list)\nProduces the relevance/reliability metrics a reviewer expects in the fit-for-purpose memo.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[PICOTS + estimand fixed<br/>one specific question] --> REL{Relevance:<br/>can the source capture<br/>population, exposure, outcome,<br/>confounders, follow-up?}\n  REL -->|fatal gap, e.g. no death linkage<br/>for a mortality endpoint| NOGO[NO-GO<br/>wrong source for this question]\n  REL -->|yes / partial| RELI{Reliability:<br/>accurate, complete, traceable?<br/>documented stable ETL?}\n  RELI -->|exclude MA-only person-time,<br/>verify mortality source,<br/>profile missingness + lag| VERDICT{Fit-for-purpose verdict}\n  VERDICT -->|all elements adequate| GO[GO]\n  VERDICT -->|bounded, measurable uncertainty| MIT[GO WITH MITIGATIONS<br/>+ pre-specified sensitivity analyses]\n  VERDICT -->|structural gap rationalized| NOGO\n  MIT --> SENS[Sensitivity: washout length,<br/>data-maturity buffer, algorithm PPV/QBA,<br/>MA-exclusion rule]",
        "caption": "SPIFD-style fit-for-purpose decision flow. Relevance is judged first against the fixed estimand; a structural relevance gap is a no-go that no mitigation rescues. Reliability is judged next, and only bounded, measurable uncertainty qualifies for a go-with-mitigations verdict tied to pre-specified sensitivity analyses.",
        "alt_text": "Decision flowchart from a fixed PICOTS question through a relevance gate and a reliability gate to a go / go-with-mitigations / no-go fit-for-purpose verdict with a sensitivity-analysis plan.",
        "source_type": "illustrative",
        "source_citations": [
          "gatto-2022"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Relevance[What the question needs]\n    P[Population] --- E[Exposure] --- O[Outcome] --- C[Confounders] --- F[Follow-up]\n  end\n  Relevance --> Claims[Claims:<br/>+ exposure, utilization, cost<br/>- severity, labs, cause of death]\n  Relevance --> EHR[EHR:<br/>+ labs, vitals, notes, severity<br/>- network leakage, sparse fields]\n  Relevance --> Registry[Registry:<br/>+ adjudicated outcomes, staging<br/>- longitudinal exposure/follow-up]\n  Claims --> Linked[Linked claims-EHR-death index:<br/>severity + completeness + mortality<br/>- linkage selection + date reconciliation]\n  EHR --> Linked\n  Registry --> Linked",
        "caption": "Relevance-by-source matrix. No single source is universally fit; each trades a relevance strength against a structural weakness, and linkage is the standard route to close the gaps at the cost of selection and date-reconciliation problems.",
        "alt_text": "Diagram mapping the question's data needs onto the relevance strengths and weaknesses of claims, EHR, registry, and linked data sources.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "regulatory-readiness-rwe",
        "notes": "Fit-for-purpose data assessment is a core component of overall regulatory readiness, not a variant of it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Each target-trial component (eligibility, strategies, assignment, outcome, follow-up, estimand) must be checked against what the source can actually capture before emulation is credible."
      },
      {
        "relation_type": "used_with",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "The feasibility funnel supplies the patient counts at each eligibility step that fit-for-purpose assessment interprets against the estimand; feasibility is an input, not a substitute."
      },
      {
        "relation_type": "used_with",
        "target_slug": "picots-framework-rwe",
        "notes": "Fitness is undefined without a fixed question; PICOTS specifies the population, exposure, outcome, timing, and setting that the relevance check is run against."
      },
      {
        "relation_type": "used_with",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Observable-time and continuous-enrollment requirements determine whether follow-up duration and exposure capture meet the estimand's relevance demands."
      },
      {
        "relation_type": "used_with",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "The MA-vs-FFS distinction is the dominant claims reliability failure mode the assessment must screen for (MA-only person-time lacks fee-for-service claims)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "Outcome/exposure algorithm operating characteristics (PPV, sensitivity) are a key reliability input the assessment integrates into a question-level verdict."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimand-analysis-traceability-rwe",
        "notes": "The fit-for-purpose verdict is defined relative to the estimand; traceability links the data decision back to the target quantity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Relevance of the captured population to the target decision population is the bridge between fitness and external validity."
      }
    ],
    "aliases": [
      "Fit-for-purpose data assessment for real-world evidence",
      "fit-for-use data assessment",
      "data relevance and reliability assessment",
      "SPIFD",
      "fit-for-purpose RWD assessment"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "g-computation-parametric-g-formula",
    "name": "G-Computation and the Parametric G-Formula",
    "short_definition": "A causal estimation method that fits an outcome regression model on observed data, generates counterfactual predictions for every patient under treatment and under no treatment, and averages those predictions over the study population to obtain a marginal (population-averaged) risk difference, risk ratio, or rate ratio; the time-varying generalization — the parametric g-formula — simulates the joint evolution of confounders and treatment forward under hypothetical strategies, handling treatment-confounder feedback where IPTW-estimated marginal structural models are the dual approach.",
    "long_description": "**The standardization idea — fit, predict, average**\n\nG-computation (also called model-based standardization or outcome regression standardization)\nestimates the causal effect of a treatment by exploiting the counterfactual structure embedded\nin any outcome regression model. The procedure has three steps: (1) fit an outcome model —\nlogistic for binary endpoints, Poisson or negative-binomial for counts, Cox for time-to-event,\nlinear for continuous scores — on the observed cohort, including treatment and all measured\nconfounders as predictors; (2) duplicate the dataset twice, setting treatment = 1 in the first\ncopy and treatment = 0 in the second, and generate model predictions for every patient under\neach assignment regardless of what they actually received; (3) average the predicted outcomes\nunder treatment across all patients, average the predicted outcomes under no treatment, and\ncontrast the two averages as a risk difference, risk ratio, or rate ratio. The result is the\npopulation-averaged causal effect estimate — the difference in average outcomes you would\nexpect if, contrary to fact, the entire cohort had been treated versus if none had been.\n\nThe intuition maps directly onto direct standardization or post-stratification: segment the\ncohort into covariate cells, estimate the expected outcome in each cell under each treatment\nfrom the fitted model, and re-weight by the cell sizes in the target population. G-computation\nperforms this standardization continuously over the full covariate distribution of the outcome\nmodel rather than over a small number of discrete strata. The \"G\" in g-computation refers to\nRobins's general product formula for the joint density of a time-varying treatment history —\nthe same mathematical foundation underlying the entire g-methods family: IPTW-estimated\nmarginal structural models (MSMs), g-estimation of structural nested models, and the parametric\ng-formula.\n\n**Why g-computation returns marginal effects on any scale — and how it resolves noncollapsibility**\n\nLogistic regression reports a conditional odds ratio (OR): the effect within strata of the\nadjustment covariates, on the odds scale. Noncollapsibility means that even in the complete\nabsence of confounding, the conditional OR is not a weighted average of stratum-specific ORs;\nit changes as more covariates enter the model not because the truth changed but because the\nreference point of \"holding covariates fixed\" shifted. The fully-adjusted OR from a logistic\nmodel cannot be directly converted to a population-level risk difference or risk ratio without\nadditional steps, and comparing ORs across studies or populations requires careful accounting\nfor the baseline risk. The same noncollapsibility problem affects Cox hazard ratios: a\nfully-adjusted HR changes as the covariate set expands, even without additional confounding\ncontrol, and is therefore not directly comparable to the marginal effect a payer needs.\n\nG-computation solves both problems by standardizing over the observed covariate distribution:\nit produces a marginal risk difference or risk ratio from the same logistic or Cox model that\nwould otherwise report only a conditional OR or HR. The marginal risk difference is\nscale-commensurable with baseline risk, directly interpretable for number-needed-to-treat (NNT)\ncalculations, and appropriate input for cost-effectiveness models that require absolute event\nprobabilities. The analyst need never report the OR as the primary causal effect estimate —\ng-computation extracts the population-relevant quantity from the same fitted model.\n\n**Inference via the bootstrap**\n\nG-computation is a plug-in estimator: point estimates are computed by substituting fitted\nmodel parameters into the standardization formula. Analytic standard errors via the delta\nmethod exist for simple parametric models but are rarely used in RWE applications. The\nnonparametric bootstrap is the standard approach because it (a) propagates model-fitting\nuncertainty into the final interval without assumptions about the sampling distribution of the\nmarginal contrast, (b) works identically for any outcome scale — RD, RR, NNT — without\nre-deriving variance formulas, and (c) accommodates multi-model or survival-model g-computation\nsettings where analytic variances are intractable. The procedure resamples the cohort with\nreplacement, re-fits the outcome model, and re-applies the standardization in each resample;\nthe 2.5th and 97.5th percentiles of the bootstrap distribution form the 95% confidence\ninterval. For large datasets (n > 100,000), 500 to 1,000 resamples are typically sufficient;\nfor small samples, bias-corrected and accelerated (BCa) intervals should be preferred over the\nnaive percentile method to improve coverage.\n\n**The time-varying generalization: the parametric g-formula**\n\nThe point-treatment version above assumes treatment is decided once at baseline. When treatment\nis time-varying — a patient starts, stops, escalates, or switches over follow-up — and\npost-baseline health measures are simultaneously confounders of future treatment and mediators\nof past treatment (treatment-confounder feedback), neither standard regression nor\npoint-treatment g-computation can give an unbiased causal estimate. Conditioning on a\npost-baseline variable that is on the causal path from past treatment to the outcome blocks\npart of the true effect; omitting it leaves future treatment confounded. No regression\nspecification escapes both horns.\n\nThe **parametric g-formula** (Robins 1986) resolves this by modeling the joint distribution\nof the time-varying confounders, treatment, and outcome at each interval, and then Monte Carlo\nsimulating the counterfactual history forward under each hypothetical treatment strategy. At\neach time step, simulated covariate values are drawn from the fitted covariate models\nconditional on the simulated — not the observed — history, generating potential-outcome\ntrajectories under the target strategy without the confounding present in the natural course.\nThe result is the full distribution of the counterfactual outcome under the hypothetical\nstrategy, including absolute cumulative incidence curves and risk differences at any follow-up\nhorizon. The parametric g-formula is uniquely suited to dynamic treatment regimes (e.g., \"treat\nwhenever LDL exceeds 130 mg/dL\"), a capability that IPTW-estimated MSMs handle less naturally.\n\n**Relationship to IPW-MSMs and doubly-robust methods**\n\nG-computation and IPTW-estimated MSMs are dual approaches to the same causal estimand:\ng-computation models the outcome process (the conditional distribution of Y given treatment\nand confounders), while IPTW models the treatment process (the probability of observed\ntreatment given confounders). Each is consistent if its respective model is correctly\nspecified; neither has double robustness. The comparison with TMLE is sharper: TMLE is\nformally g-computation plus a targeting (fluctuation) step driven by the propensity score so\nthe final estimate solves the efficient influence-curve equation and achieves double robustness.\nIn practice, g-computation with a well-specified outcome model and bootstrap inference is a\ntransparent, defensible primary analysis; TMLE or AIPW is the natural doubly-robust\nsensitivity check, especially when high-dimensional confounding motivates Super Learner\nnuisance fits.\n\n**G-null paradox**\n\nUnder a sharp null (treatment truly has no effect) with treatment-confounder feedback, a\nparametric g-formula can be mathematically guaranteed to be misspecified — the g-null paradox.\nWhen the null is plausible and feedback is present, g-estimation of structural nested models is\nrobust to this paradox; IPTW-MSMs and TMLE are unaffected because they model the treatment\nprocess rather than the covariate evolution conditionally on treatment.\n\n**Identifying assumptions**\n\nG-computation requires three assumptions that data cannot confirm: (1) **exchangeability** —\nno unmeasured confounding; all variables that jointly predict treatment assignment and the\noutcome are measured and included in the outcome model; (2) **positivity** — every patient in\nthe target population has a positive probability of receiving each treatment level conditional\non measured covariates; and (3) **consistency** — the treatment in the data corresponds to a\nwell-defined, replicable intervention with no interference between patients. Modeling does not\nbuy identification: a g-computation estimate from a confounded observational dataset is a\nbiased quantity, not a causal effect. The identifying assumptions are identical to those for\nIPTW, TMLE, and any other standard observational causal method — the approach to estimation\nchanges, but the assumptions needed to claim causation do not.\n\n**Pros, cons, and trade-offs**\n\n*Pros:* G-computation is the most flexible route to marginal effects on any scale from any\noutcome model — risk difference, risk ratio, NNT, expected count difference — without\nadditional software or re-derived variance formulas. It naturally handles treatment-covariate\ninteractions because predictions are generated under each patient's own covariate profile, not\nat artificially fixed covariate means. The parametric g-formula handles dynamic strategies and\ntreatment-confounder feedback, producing absolute cumulative incidence curves that IPTW-MSMs\ncannot easily generate. Bootstrap CIs are straightforward and generalizable to any contrast.\nThe output — a marginal risk difference — is the number a payer, HTA body, or clinical\nguideline committee needs for formulary and value decisions.\n\n*Cons:* G-computation is consistent only if the outcome model is correctly specified — there\nis no robustness fallback if the model is wrong, unlike doubly-robust estimators such as TMLE.\nThe parametric g-formula requires fitting separate models for each time-varying confounder,\nmultiplying model-specification risk with each additional variable. Bootstrap inference is\ncomputationally expensive for large cohorts or complex g-formula runs. G-computation is less\nfamiliar than propensity-score matching in some regulatory and payer review contexts.\n\n*Trade-offs vs IPTW:* G-computation avoids positivity diagnostics and extreme-weight\ntruncation because it extrapolates from the outcome model rather than reweighting. However,\nthis also means it extrapolates into covariate regions where no patients were treated — a risk\nif the model is misspecified in those regions. Use IPTW as a cross-check when the treatment\nprocess is well-modeled but the outcome model is uncertain; use g-computation when the outcome\nmodel is credible and extreme weights would be a concern.\n\n**When to use**\n\nUse g-computation when the primary deliverable is a marginal risk difference, risk ratio, or\nrate ratio from an observational cohort: comparative effectiveness research where absolute\nbenefit estimates are required for formulary or guideline decisions; target-trial emulation\nwhere counterfactual risks under sustained strategies must be reported on the absolute scale;\nany analysis where logistic or Poisson regression is the underlying model but the conditional\nOR or IRR is not the quantity the audience needs. Use the parametric g-formula for time-varying\ntreatment with confounder feedback, dynamic treatment rules, or when absolute cumulative\nincidence curves under competing strategies are required.\n\n**When NOT to use**\n\nDo not use point-treatment g-computation when treatment is time-varying with treatment-confounder\nfeedback — that setting requires the parametric g-formula or an IPTW-MSM. Do not interpret a\ng-computation confidence interval from a misspecified or confounded outcome model as a valid\ncausal uncertainty range — model diagnostics and doubly-robust sensitivity analyses are\nmandatory before claiming causal inference. Do not omit the outcome model specification,\ncovariate set, and target population from reporting — these choices define the estimand and\nmust be pre-specified. Do not use g-computation to rescue analyses with substantial unmeasured\nconfounding; modeling does not buy causal identification.\n\n**Interpreting the output**\n\nIn the worked example, the g-computation risk difference is 0.18 - 0.28 = -0.10.\n\n*(1) Formal interpretation.* The g-computation estimand is the population-averaged causal risk\ndifference — the average difference in 1-year stroke probability if, contrary to fact, the\nentire cohort had received the drug versus if none had received it, averaging over the observed\ndistribution of disease severity in the cohort. The estimate of -0.10 is a marginal\n(population-averaged) effect, not the conditional log-odds ratio that the underlying logistic\nmodel produces within each severity stratum. It is consistent under three untestable\nassumptions: exchangeability (no unmeasured confounding given severity), positivity (all\npatients had a non-zero probability of receiving each assignment), and consistency (the drug is\nwell-defined). A bootstrap 95% confidence interval means that across repeated hypothetical\nsamples from the same data-generating process, the constructed intervals would contain the true\nmarginal risk difference in 95% of replications.\n\n*(2) Practical interpretation.* If this blood-pressure drug were given to every patient in the\ncohort instead of none, the outcome model predicts that average 1-year stroke risk would fall\nfrom 28% to 18% — an absolute reduction of 10 percentage points. Translating to\nnumber-needed-to-treat: treating 10 patients prevents, on average, one stroke. This number is\ndirectly usable in a budget-impact model or cost-effectiveness analysis in a way that a\nlogistic OR cannot be without additional computation. The causal claim is conditional on the\nassumption that disease severity captures all confounding between drug assignment and stroke\nrisk in this dataset.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "causal-inference",
      "g-computation",
      "standardization",
      "marginal-effect",
      "counterfactual",
      "parametric-g-formula",
      "time-varying",
      "bootstrap",
      "noncollapsibility",
      "g-methods"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "new_user",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwq472",
        "url": "https://doi.org/10.1093/aje/kwq472",
        "citation_text": "Snowden JM, Rose S, Mortimer KM. Implementation of G-Computation on a Simulated Data Set: Demonstration of a Causal Inference Technique. American Journal of Epidemiology. 2011;173(7):731-738.",
        "year": 2011,
        "authors_short": "Snowden, Rose & Mortimer",
        "notes": "Step-by-step implementation of g-computation on simulated data, covering the predict-and- average algorithm and its connection to the broader g-methods family. The standard pedagogical reference for applied epidemiologists learning to run g-computation in practice."
      },
      {
        "role": "explain",
        "doi": "10.1016/0270-0255(86)90088-6",
        "url": "https://doi.org/10.1016/0270-0255(86)90088-6",
        "citation_text": "Robins JM. A new approach to causal inference in mortality studies with a sustained exposure period: Application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7(9-12):1393-1512.",
        "year": 1986,
        "authors_short": "Robins",
        "notes": "The foundational paper introducing the g-computation formula — the general product formula for counterfactual distributions — as the mathematical basis for the entire g-methods family, including the parametric g-formula and its time-varying extensions."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0000000000000160",
        "url": "https://doi.org/10.1097/EDE.0000000000000160",
        "citation_text": "Keil AP, Edwards JK, Richardson DB, Naimi AI, Cole SR. The Parametric g-Formula for Time-to-event Data: Towards Intuition with a Worked Example. Epidemiology. 2014;25(6):889-897.",
        "year": 2014,
        "authors_short": "Keil et al.",
        "notes": "Worked example of the parametric g-formula applied to time-to-event data, walking through Monte Carlo simulation of covariate and treatment histories under hypothetical strategies; the key reference for the time-varying extension of point-treatment g-computation."
      }
    ],
    "plain_language_summary": "G-computation estimates how much a treatment changes an outcome by fitting a statistical model that predicts the outcome from patient characteristics and treatment status, then using that model to predict each patient's result twice — once as if treated and once as if untreated — and averaging across everyone in the study. The difference between the two averages is a treatment effect expressed as a risk difference or risk ratio on a plain probability scale, a number a clinician or payer can act on directly. This predict-everyone-twice-and-average approach sidesteps a problem specific to logistic regression where odds ratios cannot be simply averaged across patient groups or turned directly into risk reductions, even when there is no confounding at all. Confidence intervals come from bootstrapping — repeating the whole procedure hundreds of times on resampled data to quantify the spread of plausible answers.",
    "key_terms": [
      {
        "term": "counterfactual outcome",
        "definition": "The outcome a patient would have experienced under a treatment they did not actually receive; g-computation predicts these unobserved values from a fitted model and averages them to estimate the population-level treatment effect."
      },
      {
        "term": "marginal effect",
        "definition": "A treatment effect averaged over the whole study population rather than estimated within a specific subgroup or by holding other patient characteristics fixed; what g-computation produces, in contrast to the conditional odds or hazard ratio a regression model reports by default."
      },
      {
        "term": "standardization (model-based)",
        "definition": "The step in g-computation that averages model predictions across the observed distribution of patient characteristics, so the final estimate reflects the actual mix of patients in the study rather than an artificial reference patient."
      },
      {
        "term": "noncollapsibility",
        "definition": "A mathematical property of odds ratios and hazard ratios where the population-level value differs from the within-stratum value even when there is no confounding; g-computation bypasses noncollapsibility by producing risk differences and risk ratios directly from the averaged predictions."
      },
      {
        "term": "treatment-confounder feedback",
        "definition": "A cycle where a post-baseline health measure (such as a lab value) is both changed by past treatment and predictive of future treatment decisions; the parametric g-formula handles this by simulating covariate histories forward rather than conditioning on the observed values."
      },
      {
        "term": "g-null paradox",
        "definition": "A technical limitation where the parametric g-formula is mathematically guaranteed to be misspecified when treatment truly has no effect but treatment-confounder feedback is present; g-estimation of structural nested models is robust to this issue."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiologist wants to estimate the causal effect of a new blood-pressure drug on 1-year stroke risk in a cohort of 1,000 patients. Patients have one of two disease-severity profiles: mild (stratum A, 60% of the cohort) and severe (stratum B, 40%). A logistic regression outcome model is fit on the observed data with treatment and severity as predictors. The analyst applies g-computation: predict each patient's 1-year stroke probability under treatment (treated=1) and under no treatment (treated=0), then average both sets of predictions over the full cohort and contrast the averages.",
      "dataset": {
        "caption": "Stratum-level summary of fitted 1-year stroke risks from the logistic outcome model, by disease severity. These are the model-predicted risks that g-computation standardizes over the cohort; the 60%/40% split reflects the observed stratum proportions.",
        "columns": [
          "stratum",
          "pct_of_cohort",
          "fitted_risk_treated",
          "fitted_risk_untreated"
        ],
        "rows": [
          [
            "A (mild)",
            0.6,
            0.1,
            0.2
          ],
          [
            "B (severe)",
            0.4,
            0.3,
            0.4
          ]
        ]
      },
      "steps": [
        "Step 1 — Fit the outcome model: run logistic regression of stroke (1=yes, 0=no) on treatment (1=yes, 0=no) and severity. The fitted model produces a predicted stroke probability for every patient under any treatment value.",
        "Step 2 — Set everyone to treated: copy the dataset and set treated=1 for every patient. Score with the fitted model. Stratum A patients (mild) get predicted risk 0.10; stratum B patients (severe) get 0.30.",
        "Step 3 — Set everyone to untreated: copy the dataset and set treated=0 for every patient. Score with the fitted model. Stratum A gets 0.20; stratum B gets 0.40.",
        "Step 4 — Standardize the treated risks over the cohort proportions: 0.6*0.10 + 0.4*0.30 = 0.06 + 0.12 = 0.18",
        "Step 5 — Standardize the untreated risks over the cohort proportions: 0.6*0.20 + 0.4*0.40 = 0.12 + 0.16 = 0.28",
        "Step 6 — Contrast the standardized risks: 0.18 - 0.28 = -0.10",
        "Step 7 — Bootstrap for the 95% CI: resample the full cohort with replacement 1,000 times, re-fit the outcome model and re-apply Steps 2-6 in each resample; the 2.5th and 97.5th percentiles of the resulting distribution of risk differences give the confidence interval."
      ],
      "result": "standardized_risk_treated = 0.6*0.10 + 0.4*0.30 = 0.18. standardized_risk_untreated = 0.6*0.20 + 0.4*0.40 = 0.28. RD = 0.18 - 0.28 = -0.10. The drug is estimated to lower 1-year stroke risk by 10 percentage points on average across the cohort — the population-averaged marginal risk difference, not the conditional odds ratio that the logistic model reports by default."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "propensity-score-methods-psm-iptw",
      "marginal-effects-and-interpretation-of-inferential-statistics-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Point-treatment g-computation (binary outcome)",
        "description": "The standard application: fit a logistic regression outcome model on the observed cohort, duplicate the data twice with treatment fixed to 1 and 0, predict response probabilities for every patient under both assignments, and average. Produces marginal risk difference and risk ratio with bootstrap confidence intervals. Implemented in R via stdReg (Zetterqvist and Sjolander) or by-hand predict-and-average in Python, R, or SAS.",
        "edge_cases": [
          "Near-positivity violations (covariate profiles observed under only one treatment) cause the outcome model to extrapolate; inspect overlap of the covariate distributions across treatment arms before trusting the standardization.",
          "Treatment-covariate interactions in the outcome model affect predictions asymmetrically across strata; include all clinically plausible interactions or use a doubly-robust estimator (TMLE with Super Learner) as a sensitivity check."
        ],
        "data_source_notes": "Claims: index the cohort at treatment initiation (new-user design), define the outcome window, and include key baseline diagnoses, fills, and utilization as confounders in the outcome model. EHR: labs and vitals directly measurable as confounders; handle informative missingness before fitting. Registry: richest covariate set; verify whether treatment is adjudicated or self-reported before treating as binary."
      },
      {
        "name": "Survival outcome g-computation (RMST or standardized risk at fixed horizon)",
        "description": "When the outcome is time-to-event, fit a Cox model or parametric survival model including treatment and confounders; at a fixed time horizon, compute the survival probability for every patient under treatment and under no treatment, and average to obtain the marginal survival probability or cumulative incidence under each strategy. The marginal restricted mean survival time (RMST) difference is the preferred summary when proportional hazards may not hold.",
        "edge_cases": [
          "Competing events (e.g., death before the primary event) must be handled as cause-specific or subdistribution outcomes in the model before standardizing.",
          "Informative censoring requires an inverse-probability-of-censoring weight (IPCW) step before fitting the outcome model, or an IPCW-weighted g-computation estimator."
        ],
        "data_source_notes": "Claims: time-to-event is constructed from the first outcome diagnosis or procedure code after the index date; censoring at disenrollment or end of data requires IPCW if informative. Linked vital records improve mortality ascertainment and competing-event handling."
      },
      {
        "name": "Parametric g-formula for time-varying treatment",
        "description": "The full parametric g-formula: fit separate regression models for each time-varying confounder and the outcome; Monte Carlo simulate the entire covariate-and-treatment history under each hypothetical strategy by drawing from these models at each step, conditional on the simulated (not observed) history; average simulated outcomes to estimate the counterfactual risk under each strategy. Implemented in R via gfoRmula (McGrath et al. 2020). Uniquely suited to dynamic regimes and cumulative incidence contrasts.",
        "edge_cases": [
          "The g-null paradox: under a sharp null with treatment-confounder feedback, the parametric g-formula is mathematically guaranteed to be misspecified; prefer g-estimation of structural nested models when the null is plausible.",
          "Monte Carlo variability from the simulation adds to bootstrap uncertainty; use at least 10,000 Monte Carlo draws per bootstrap resample to ensure simulation error is negligible."
        ],
        "data_source_notes": "Linked claims-EHR data is the strongest substrate: claims supply the complete fill history for the time-varying treatment; EHR labs and vitals supply the time-varying confounders that drive clinical decisions. Claims-only g-formula analyses often violate sequential exchangeability if the key confounders (LDL, HbA1c, eGFR) are not observed."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "propensity-score-methods-psm-iptw",
        "pros_of_this": "G-computation models the outcome process directly and does not require positivity diagnostics or extreme-weight truncation; it naturally accommodates non-linear treatment-covariate interactions via the outcome model and produces the marginal effect on any scale (RD, RR, NNT) without additional computation.",
        "cons_of_this": "G-computation has no robustness fallback — if the outcome model is wrong, the estimate is biased with no correction; IPTW is biased only if the propensity model is wrong, offering a different direction of robustness. Extrapolation into covariate regions with low overlap is hidden inside the outcome model rather than flagged by weight diagnostics.",
        "when_to_prefer": "Prefer g-computation when the outcome model is credible, treatment-covariate interactions are important, and extreme weights would destabilize the IPTW estimate; prefer IPTW when the treatment assignment process is well-modeled and positivity can be verified."
      },
      {
        "compared_to": "targeted-maximum-likelihood-estimation-rwe",
        "pros_of_this": "G-computation is simpler to implement and explain — it requires only one outcome model and produces results without the targeting step, the clever covariate, or cross-fitting. It is transparent and audit-friendly for regulatory submissions that require methodological clarity.",
        "cons_of_this": "TMLE is doubly robust — consistent if either the outcome or the propensity model is correct — and achieves the efficient influence-curve variance bound, giving valid confidence intervals even when machine-learning nuisance models are used. G-computation is consistent only under a correct outcome model and is not semiparametrically efficient.",
        "when_to_prefer": "Use g-computation as the primary analysis when the outcome model is well-specified; use TMLE as a doubly-robust sensitivity check, especially when high-dimensional confounding motivates Super Learner nuisance estimation."
      },
      {
        "compared_to": "marginal-structural-models-g-methods",
        "pros_of_this": "For time-varying treatment, the parametric g-formula is more efficient than IPTW-MSMs, handles dynamic regimes naturally, and produces absolute cumulative incidence curves without relying on stable weights. For point-treatment analyses, g-computation is simpler and more transparent than a weighted model.",
        "cons_of_this": "The parametric g-formula requires modeling every time-varying confounder explicitly (not just the treatment model) and is subject to the g-null paradox; IPTW-MSMs need only a treatment and censoring model and are unaffected by the g-null paradox.",
        "when_to_prefer": "Use the parametric g-formula for dynamic regimes and absolute-risk contrasts, or when near-positivity violations make IPTW weights unstable; use IPTW-MSMs for static sustained strategies when the treatment model is trusted and weight diagnostics look acceptable."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Define the index date (first qualifying fill in a new-user design), construct the binary outcome from the first diagnosis or procedure code in the follow-up window, and assemble baseline confounders from the 365-day lookback. The outcome model should include the high-dimensional covariate bank (prior diagnoses, prior fills, utilization) as predictors. G-computation on claims data is fast — the bottleneck is the bootstrap loop, which scales linearly with cohort size and bootstrap iterations. For large claims cohorts (n > 100,000), 500 resamples are usually sufficient; parallelize with joblib (Python) or future (R).",
      "ehr": "EHR labs, vitals, and problem-list data provide richer confounders than claims and make exchangeability more defensible; include them in the outcome model directly. Handle informative missingness (lab values observed only at clinic visits) via multiple imputation or a missingness indicator before fitting. For time-varying outcomes, consider a survival-outcome g-computation at a fixed horizon rather than a binary short-term flag.",
      "registry": "Adjudicated outcomes and scheduled-visit structure make the outcome model more reliable and covariate missingness more predictable. Use registry data as the outcome model base and link to claims for complete treatment exposure, resolving discordant dates by source hierarchy (claims fill date takes precedence over self-reported registry use).",
      "primary": "Randomized or prospectively collected primary data with full covariate ascertainment is the ideal substrate for g-computation, as exchangeability is most credible. At small n (< 500), the bootstrap CI is sensitive to resample size; use BCa or studentized bootstrap intervals rather than the naive percentile method to improve coverage.",
      "linked": "Linked claims-EHR-vital-records provides the strongest confounder set and outcome ascertainment for g-computation. The outcome model can incorporate EHR labs alongside claims-derived utilization variables; vital status from linked death records improves competing-risk handling. Report linkage selection (propensity of being linked) as a potential source of residual bias."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\ndef g_computation(df: pd.DataFrame, formula: str, n_boot: int = 1000, seed: int = 42) -> dict:\n    \"\"\"\n    Point-treatment g-computation with bootstrap 95% CI.\n\n    Parameters\n    ----------\n    df      : DataFrame with outcome, treatment indicator, and confounders\n    formula : statsmodels formula, e.g. 'event ~ treated + severity'\n    n_boot  : number of bootstrap resamples (500+ for large datasets)\n    seed    : random seed for reproducibility\n\n    Returns\n    -------\n    dict with risk_treated, risk_untreated, RD, RR, CI_95\n    \"\"\"\n    rng = np.random.default_rng(seed)\n\n    def _one_run(data: pd.DataFrame):\n        m = smf.logit(formula, data=data).fit(disp=False)\n        d1 = data.copy(); d1[\"treated\"] = 1   # everyone treated\n        d0 = data.copy(); d0[\"treated\"] = 0   # everyone untreated\n        r1 = m.predict(d1).mean()              # marginal risk under treatment\n        r0 = m.predict(d0).mean()              # marginal risk under no treatment\n        return r1, r0, r1 - r0\n\n    r1, r0, rd = _one_run(df)\n\n    boot_rds = []\n    n = len(df)\n    for _ in range(n_boot):\n        idx = rng.integers(0, n, size=n)\n        _, _, rd_b = _one_run(df.iloc[idx].reset_index(drop=True))\n        boot_rds.append(rd_b)\n\n    ci = np.percentile(boot_rds, [2.5, 97.5])\n    rr = r1 / r0 if r0 > 0 else float(\"nan\")\n    return {\"risk_treated\": r1, \"risk_untreated\": r0, \"RD\": rd, \"RR\": rr, \"CI_95\": ci.tolist()}\n\n# ── Synthetic cohort: severity confounds treatment and stroke outcome ──\nrng0 = np.random.default_rng(7)\nn = 400\nseverity = rng0.binomial(1, 0.40, n)                               # 40% severe\np_treat  = 1 / (1 + np.exp(-(-0.5 + 1.2 * severity)))             # severe patients more likely treated\ntreated  = rng0.binomial(1, p_treat)\np_event  = 1 / (1 + np.exp(-(-2.5 + 1.5 * treated + 2.0 * severity)))\nevent    = rng0.binomial(1, p_event)\ndf = pd.DataFrame({\"event\": event, \"treated\": treated, \"severity\": severity})\n\nresult = g_computation(df, \"event ~ treated + severity\", n_boot=1000)\nprint(f\"Risk (treated):    {result['risk_treated']:.3f}\")\nprint(f\"Risk (untreated):  {result['risk_untreated']:.3f}\")\nprint(f\"Risk difference:   {result['RD']:.3f}\")\nprint(f\"Risk ratio:        {result['RR']:.3f}\")\nprint(f\"95% bootstrap CI:  ({result['CI_95'][0]:.3f}, {result['CI_95'][1]:.3f})\")\n\n# For the time-varying parametric g-formula with treatment-confounder feedback, use\n# R's gfoRmula package (McGrath et al. 2020). Python ports exist but gfoRmula is the\n# production-grade implementation; see marginal-structural-models-g-methods entry.",
        "description": "G-computation for a binary outcome using a logistic outcome model (statsmodels) with\nnonparametric bootstrap confidence intervals. The function fits the outcome model, scores two\ncounterfactual datasets (all treated, all untreated), computes the marginal risk difference\nand risk ratio, and bootstraps the CI. A synthetic cohort illustrates the setup: severity\n(0=mild, 1=severe) confounds treatment assignment and stroke risk, matching the worked example\nstratum structure.",
        "dependencies": [
          "numpy",
          "pandas",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Synthetic cohort: severity (0=mild, 1=severe) confounds treatment and stroke outcome ──\nset.seed(7)\nn        <- 400\nseverity <- rbinom(n, 1, 0.40)\ntreated  <- rbinom(n, 1, plogis(-0.5 + 1.2 * severity))\nevent    <- rbinom(n, 1, plogis(-2.5 + 1.5 * treated + 2.0 * severity))\ndf       <- data.frame(event = event, treated = treated, severity = severity)\n\n# ── Manual g-computation: fit, score two counterfactual datasets, average ──\ngcomp_once <- function(data) {\n  m  <- glm(event ~ treated + severity, data = data, family = binomial)\n  d1 <- data; d1$treated <- 1          # everyone treated\n  d0 <- data; d0$treated <- 0          # everyone untreated\n  r1 <- mean(predict(m, newdata = d1, type = \"response\"))\n  r0 <- mean(predict(m, newdata = d0, type = \"response\"))\n  c(risk_treated = r1, risk_untreated = r0, RD = r1 - r0, RR = r1 / r0)\n}\npoint_est <- gcomp_once(df)\n\n# ── Bootstrap 95% percentile CI ──\nB        <- 1000\nboot_rds <- replicate(B, gcomp_once(df[sample(nrow(df), replace = TRUE), ])[\"RD\"])\nci       <- quantile(boot_rds, c(0.025, 0.975))\n\ncat(sprintf(\"Risk (treated):   %.3f\\n\",  point_est[\"risk_treated\"]))\ncat(sprintf(\"Risk (untreated): %.3f\\n\",  point_est[\"risk_untreated\"]))\ncat(sprintf(\"Risk difference:  %.3f  95%% CI (%.3f, %.3f)\\n\",\n            point_est[\"RD\"], ci[1], ci[2]))\ncat(sprintf(\"Risk ratio:       %.3f\\n\",  point_est[\"RR\"]))\n\n# ── One-liner with stdReg (marginal risks + bootstrap CI via stdGlm) ──\n# library(stdReg)\n# fit <- glm(event ~ treated + severity, data = df, family = binomial)\n# std <- stdGlm(fit = fit, data = df, X = \"treated\")\n# summary(std, CI.type = \"plain\")   # reports marginal risks and RD with bootstrap CI\n\n# ── Parametric g-formula for time-varying treatment (treatment-confounder feedback) ──\n# library(gfoRmula)   # McGrath et al. 2020 — Patterns 1(3):100008\n# See marginal-structural-models-g-methods entry for a full gfoRmula example with\n# time-varying confounder models and dynamic treatment strategies.",
        "description": "G-computation by hand (predict-and-average) with bootstrap confidence intervals, plus a\none-liner using the stdReg package (Zetterqvist and Sjolander). Uses the same synthetic\nseverity-confounded cohort as the Python implementation. For the time-varying parametric\ng-formula with treatment-confounder feedback, the gfoRmula package (McGrath et al. 2020,\nPatterns 1(3):100008) provides the production implementation and is referenced in\nmarginal-structural-models-g-methods.",
        "dependencies": [
          "stdReg"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create synthetic cohort: severity (0=mild, 1=severe) confounds treatment and stroke ── */\ndata work.cohort;\n  call streaminit(7);\n  do id = 1 to 400;\n    severity = rand('bernoulli', 0.40);\n    treated  = rand('bernoulli', 1/(1+exp(-(-0.5 + 1.2*severity))));\n    event    = rand('bernoulli', 1/(1+exp(-(-2.5 + 1.5*treated + 2.0*severity))));\n    output;\n  end;\nrun;\n\n/* ── Step 1: fit the outcome model ── */\nproc logistic data=work.cohort descending noprint;\n  model event = treated severity;\n  output out=pred_obs p=pred_obs;\nrun;\n\n/* ── Step 2: create counterfactual datasets ── */\ndata c1; set work.cohort; treated = 1; run;   /* everyone treated   */\ndata c0; set work.cohort; treated = 0; run;   /* everyone untreated */\n\n/* ── Step 3: score both counterfactual datasets with the fitted model ── */\nproc logistic data=work.cohort descending noprint;\n  model event = treated severity;\n  score data=c1 out=pred1(rename=(P_1=pred_treated))   nosummary;\n  score data=c0 out=pred0(rename=(P_1=pred_untreated)) nosummary;\nrun;\n\n/* ── Step 4: average and contrast ── */\ndata combined;\n  merge pred1(keep=pred_treated) pred0(keep=pred_untreated);\nrun;\nproc means data=combined mean noprint;\n  var pred_treated pred_untreated;\n  output out=margins mean=risk_treated risk_untreated;\nrun;\ndata result; set margins;\n  risk_difference = risk_treated - risk_untreated;\n  risk_ratio      = risk_treated / risk_untreated;\n  put \"Risk (treated):    \" risk_treated   6.4;\n  put \"Risk (untreated):  \" risk_untreated 6.4;\n  put \"Risk difference:   \" risk_difference 7.4;\n  put \"Risk ratio:        \" risk_ratio 6.4;\nrun;\n\n/* ── Bootstrap 95% CI (1000 resamples via PROC SURVEYSELECT + macro) ── */\nproc surveyselect data=work.cohort out=boot_samples\n    method=urs n=400 rep=1000 seed=42 noprint;\nrun;\n\n%macro gcomp_boot;\n  data boot_rds; length rd 8; stop; run;\n  %do b = 1 %to 1000;\n    data b_&b; set boot_samples(where=(Replicate=&b)); run;\n    proc logistic data=b_&b descending noprint;\n      model event = treated severity;\n      score data=c1 out=bpred1(rename=(P_1=bt)) nosummary;\n      score data=c0 out=bpred0(rename=(P_1=bu)) nosummary;\n    run;\n    data _tmp; merge bpred1(keep=bt) bpred0(keep=bu); rd = bt - bu; run;\n    proc means data=_tmp mean noprint; var rd;\n      output out=_r mean=rd;\n    run;\n    data boot_rds; set boot_rds _r(keep=rd); run;\n  %end;\n  proc univariate data=boot_rds noprint; var rd;\n    output out=ci pctlpts=2.5 97.5 pctlpre=p;\n  run;\n  proc print data=ci noobs label; run;\n%mend gcomp_boot;\n/* %gcomp_boot;   <- uncomment to run (1000 resamples; may take several minutes) */",
        "description": "G-computation in SAS using PROC LOGISTIC to fit the outcome model and SCORE to predict\ntwo counterfactual datasets, then PROC MEANS to compute marginal risks and the risk\ndifference. Bootstrap confidence intervals use PROC SURVEYSELECT (with-replacement\nresampling) with a macro loop re-applying the g-computation steps across resamples. Uses\nthe same severity-confounded cohort structure as the Python and R implementations.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[\"Fit outcome model on observed data<br/>Y ~ treated + confounders<br/>(logistic / Cox / Poisson)\"] --> B & C\n  B[\"Copy dataset<br/>set treated = 1 for all\"] --> D[\"Score: predict each patient's outcome<br/>under treated = 1<br/>(counterfactual prediction)\"]\n  C[\"Copy dataset<br/>set treated = 0 for all\"] --> E[\"Score: predict each patient's outcome<br/>under treated = 0<br/>(counterfactual prediction)\"]\n  D --> F[\"Average predictions across cohort<br/>E[Y1] = mean of treated predictions\"]\n  E --> G[\"Average predictions across cohort<br/>E[Y0] = mean of untreated predictions\"]\n  F --> H[\"Contrast<br/>RD = E[Y1] - E[Y0]<br/>RR = E[Y1] / E[Y0]\"]\n  G --> H\n  H --> I[\"Bootstrap 95% CI<br/>repeat 1000 times on resampled cohort<br/>2.5th and 97.5th percentile of RD distribution\"]",
        "caption": "The g-computation algorithm for a binary outcome. The outcome model is fit once on observed data; two counterfactual scored datasets (all treated, all untreated) are averaged and contrasted to produce the marginal risk difference and ratio; bootstrap resamples propagate model-fitting uncertainty into the confidence interval.",
        "alt_text": "Flowchart from fitting an outcome model to two parallel branches — scoring everyone as treated and scoring everyone as untreated — then averaging predictions in each branch and contrasting to produce the marginal risk difference and ratio, followed by a bootstrap loop for the confidence interval.",
        "source_type": "illustrative",
        "source_citations": [
          "snowden-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "IPTW-estimated MSMs and g-computation are dual causal approaches: g-computation models the outcome process to obtain marginal effects, while IPTW models the treatment process; both target the same counterfactual estimand and are best used as mutual cross-checks. The parametric g-formula is the time-varying extension of g-computation within the MSM/g-methods family."
      },
      {
        "relation_type": "see_also",
        "target_slug": "targeted-maximum-likelihood-estimation-rwe",
        "notes": "TMLE is g-computation augmented by a targeting (fluctuation) step that achieves double robustness and semiparametric efficiency; use TMLE as the doubly-robust sensitivity check for a g-computation primary analysis, particularly when confounding is high-dimensional."
      },
      {
        "relation_type": "see_also",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "G-computation is the disciplined implementation of the standardization approach described in the marginal effects entry, producing the population-averaged marginal effect that bypasses the noncollapsibility of conditional odds ratios and hazard ratios."
      },
      {
        "relation_type": "used_with",
        "target_slug": "bootstrap-resampling-methods",
        "notes": "The nonparametric bootstrap is the standard inference approach for g-computation, propagating outcome-model uncertainty into the confidence interval without analytic variance derivation; BCa intervals are preferred at small n."
      },
      {
        "relation_type": "requires",
        "target_slug": "dags-backdoor-criterion-drug-studies",
        "notes": "G-computation requires a causal DAG to justify the covariate set in the outcome model; the backdoor criterion determines which confounders must be included to block all non-causal paths between treatment and outcome."
      },
      {
        "relation_type": "see_also",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Propensity-score IPTW is the dual estimator to g-computation for the same causal estimand; agreement between the two strengthens causal inference, divergence reveals which model specification is load-bearing."
      }
    ],
    "aliases": [
      "g-computation",
      "parametric g-formula",
      "model-based standardization",
      "outcome standardization",
      "g-formula",
      "causal standardization"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "g-estimation-structural-nested-models",
    "name": "G-Estimation of Structural Nested Models",
    "short_definition": "A semiparametric g-method that estimates the parameters of a structural nested model (the per-interval \"blip\" effect of a time-varying treatment conditional on covariate and treatment history) by solving an estimating equation that exploits conditional independence between the treatment-removed counterfactual outcome and observed treatment given the past, yielding consistent causal effects even when time-varying confounders are themselves affected by prior treatment.",
    "long_description": "**G-estimation of structural nested models (SNMs)** is one of the three core g-methods James Robins\ndeveloped to handle the specific failure that defeats ordinary regression: a time-varying confounder\nthat is simultaneously a *consequence* of prior treatment and a *cause* of future treatment and of the\noutcome (treatment-confounder feedback). When LDL falls because a patient took a statin, and that low\nLDL then drives the decision to keep prescribing, conditioning on LDL in a time-dependent Cox model\nblocks part of the treatment effect and opens a collider path; *not* conditioning on it leaves\nconfounding. No single regression can be right. G-methods break the impasse by modeling the treatment\nprocess and the counterfactual outcome separately. Where the g-formula plugs in a fully parameterized\noutcome model and marginal structural models (MSMs) reweight to a pseudo-population, g-estimation\ntargets a **structural nested model**: a parameterization of the *incremental* (blip) effect of the\ntreatment received in interval k on the outcome, conditional on the observed history through k.\n\n**Core estimand distinction.** A structural nested mean model (SNMM) parameterizes\nγ(k, h; ψ) = E[Y(ā_k, 0) − Y(ā_{k-1}, 0) | H_k = h, A_k] — the contrast, within the stratum defined by\nhistory H_k, between receiving the observed treatment in interval k and then nothing afterward, versus\nreceiving nothing from k onward. For survival, the structural nested (accelerated) failure time model\nand the rank-preserving structural failure time model (RPSFTM) parameterize how treatment in each\ninterval compresses or expands counterfactual survival time. The defining trick of g-estimation: under\nno-unmeasured-confounding-at-each-interval, the *treatment-removed* (blipped-down) counterfactual\nH(ψ) is conditionally independent of treatment A_k given history H_k. You find the ψ that makes the\ndata obey that independence by solving the estimating equation\nU(ψ) = Σ_k [A_k − E(A_k | H_k)] · H(ψ) · q(H_k) = 0, where E(A_k | H_k) comes from a fitted treatment\n(propensity) model. This is the crucial contrast with MSMs: SNMs estimate a **conditional** effect that\ncan directly carry effect modification by *time-varying, post-baseline* covariates (e.g., how the\nbenefit of continuing a drug depends on current disease activity); MSMs estimate a marginal or\nbaseline-modified effect and integrate that interaction away. The estimand is naturally a\n**per-protocol / sustained-strategy** effect — \"the effect of taking treatment as assigned versus\nnever\" — which maps cleanly onto a target-trial per-protocol question.\n\n**Pros, cons, and trade-offs.**\n- **vs marginal structural models with IPTW (marginal-structural-models-g-methods):** G-estimation\n  inherits a partial double-robustness / null-preservation property — under the sharp null of no effect\n  at any interval, it gives a valid test even if the blip form is wrong — and it does not require\n  *positivity* as strictly as IPTW, because no extreme weights are formed; an SNM can therefore be far\n  more efficient and stable when some covariate strata deterministically predict treatment. It uniquely\n  estimates effect modification by *time-varying* covariates. Cost: the blip function and the treatment\n  model must be specified; the estimating equation is solved numerically (grid search or root-finding),\n  interpretation of blip parameters is harder to communicate than a hazard ratio, and off-the-shelf\n  software is thin. **Prefer g-estimation/SNMs** when effect modification by post-baseline factors is the\n  scientific target, when positivity is borderline and IPTW weights blow up, or when you want the\n  null-robustness of a test.\n- **vs the parametric g-formula:** Both handle treatment-confounder feedback, but the g-formula requires\n  correctly modeling the full joint density of all time-varying covariates and the outcome (the\n  g-null paradox lurks when these are misspecified), whereas g-estimation only needs the treatment model\n  plus the blip. **Prefer the g-formula** when you need absolute risks/survival curves under multiple\n  hypothetical regimes; **prefer g-estimation** when the contrast of interest is a single\n  sustained-strategy effect and you distrust a fully parametric outcome model.\n- **vs time-dependent / weighted Cox (standard-cox-time-dependent, cox-ph-regression):** A naive\n  time-dependent Cox model that conditions on the time-varying confounder is *biased* under feedback\n  (it blocks the mediated effect and conditions on a collider); g-estimation is consistent. Cost: a much\n  higher technical and communication barrier, and parameters that are structural-model coefficients, not\n  plug-and-play hazard ratios. **Prefer standard methods** only when the time-varying confounder is not\n  itself affected by prior treatment — then there is no feedback and ordinary adjustment suffices.\n- **vs clone-censor-weight per-protocol (clone-censor-weight-per-protocol):** CCW is a design-forward\n  way to get the same sustained-strategy estimand (clone, censor at deviation, weight for informative\n  censoring); g-estimation is a modeling-forward route to it. **Prefer CCW** for transparency and easy\n  communication of dynamic/grace-period strategies; **prefer g-estimation** for efficiency and for\n  direct parameterization of effect modification.\n\n**When to use.** Sustained or repeatedly-decided pharmacotherapy where (1) a time-varying covariate\n(LDL, HbA1c, eGFR, disease-activity score, prior adverse event, adherence) lies on the causal pathway\n*and* drives subsequent treatment, (2) the question is a per-protocol / \"as-assigned-sustained\" effect,\nand (3) you suspect the effect is modified by an evolving clinical factor. It is the natural estimation\nengine inside a target-trial emulation when the per-protocol estimand involves time-varying treatment.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **No treatment-confounder feedback exists.** If the time-varying covariate is not affected by prior\n  treatment, g-estimation is needless machinery; a correctly adjusted time-dependent Cox or an ACNU +\n  PS analysis is simpler and equally valid. Reaching for g-estimation here trades clarity for nothing.\n- **The blip and treatment models are both badly specified.** Double robustness is partial, not magic:\n  away from the null, a wrong blip *and* a wrong propensity model yield a confidently wrong, smooth\n  answer with no warning. A \"clean\" point estimate from g-estimation is *not* evidence of correctness —\n  this is its most dangerous failure mode, because nothing crashes.\n- **Severe non-positivity in the treatment model.** If E(A_k | H_k) is ≈0 or ≈1 in occupied strata, the\n  estimating equation is uninformative there and the solution is driven by a few influential\n  person-intervals; the apparent stability (no extreme weights to flag, unlike IPTW) hides the problem.\n- **Discrete-time approximation of a continuous decision is too coarse.** If treatment is re-decided\n  daily but you bin into 90-day intervals, you misclassify the timing of the blip and of confounder\n  feedback; the structural model no longer corresponds to any real intervention.\n- **The outcome model / blip is being chosen by fit to the data.** Tuning the blip until the answer\n  looks plausible invalidates the null-robustness and inference. The SNM form must be pre-specified from\n  subject-matter knowledge.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** You must build a **long-format, discrete-time person-interval table**\n  (one row per person per month/quarter) with a time-varying treatment indicator stitched from\n  `fill_date` + `days_supply` (with explicit stockpiling/grace rules), time-varying confounders proxied\n  from diagnoses/procedures/utilization, and the per-interval propensity model E(A_k | H_k). Failure\n  modes: (1) **Medicare Advantage person-time lacks FFS claims**, so \"no fill\" can be missingness rather\n  than non-treatment, corrupting both the treatment indicator and the confounder history that drive the\n  estimating equation — restrict to enrollees with full A/B/D (or commercial pharmacy benefit) and\n  exclude MA-only spans. (2) **Lab values are largely absent in pure claims**, so the very\n  feedback-confounders that motivate g-estimation (LDL, HbA1c) are unobserved; without lab linkage you\n  cannot satisfy the no-unmeasured-confounding-at-each-interval assumption and the method is being run on\n  a fiction. (3) **Differential informative censoring** (disenrollment, death) by exposure requires IPCW\n  layered onto g-estimation; in elderly cohorts competing mortality is heavy and must be handled, not\n  ignored. (4) **Immortal time** sneaks in if interval 0 is mis-set before the first fill.\n- **EHR:** Supplies the time-varying labs/vitals that make the feedback structure measurable — its key\n  advantage over claims — but visit-driven, irregular measurement means H_k is observed only when the\n  patient shows up, inducing measurement timing that must be carried-forward or modeled; out-of-system\n  care leaves exposure history incomplete, biasing the treatment model. Prefer EHR *linked to claims* so\n  the fill history is complete and the confounder history is rich.\n- **Registry:** Protocolized repeated measures of treatment, severity, and adjudicated outcomes are an\n  excellent substrate for the H_k history and are often used to validate claims-based g-estimation; the\n  gap is usually complete pharmacy exposure (link to claims) and death (link to a vital-records index).\n- **Linked claims–EHR–vital records:** The ideal substrate — claims completeness for exposure, EHR labs\n  for the feedback confounders, vital records for censoring — but linkage selection (only the linkable\n  subset) and order/fill/service date discrepancies must be reconciled before interval boundaries and\n  time zero are fixed.\n\n**Worked claims example.** Question: the per-protocol effect of *sustained* high-intensity statin\ntherapy versus *never* on first myocardial infarction over 24 months, in adults initiating a statin in\na linked Medicare FFS + lab-feed cohort, where time-varying LDL and a statin-associated myalgia event\ndrive discontinuation (classic treatment-confounder feedback). (1) Build a discrete-time table at\n**monthly** resolution: one row per `person_id` per month from the index fill. (2) Time-varying treatment\n`A_k` = covered by high-intensity statin in month k, derived from `fill_date` + `days_supply` with a\n30-day grace period and stockpiling carried forward. (3) History `H_k` = baseline covariates (age, sex,\nprior CVD, diabetes) plus time-varying LDL (most recent lab carried forward), a myalgia/myopathy\ndiagnosis flag, and prior-month treatment `A_{k-1}` — exactly the post-baseline factors a time-dependent\nCox would mishandle. (4) Fit the **per-interval treatment model** E(A_k | H_k) by pooled logistic\nregression over person-months. (5) Specify an SNMM blip linear in treatment with an LDL interaction\nterm to let the effect be modified by current LDL, then **g-estimate** ψ by solving\nΣ_k [A_k − Ê(A_k|H_k)]·H(ψ) = 0 over a grid (or by root-finding), bootstrapping persons for confidence\nintervals. (6) Layer **IPCW** for informative censoring at disenrollment and apply a competing-risks /\ndeath handling for the elderly; restrict to full A/B/D person-time and drop MA-only months so \"no fill\"\nis a true non-treatment, not missingness. (7) Report the sustained-vs-never effect and its modification\nby LDL, with sensitivity analyses on grace period, interval width, blip form, and a negative-control\noutcome to probe residual confounding — never tuning the blip to the result.\n\n**Interpreting the output**\n\nUsing the worked example: the g-estimated blip parameter is psi ≈ −0.04 per treated month, representing the\ncausal effect of one interval of sustained statin therapy on 12-month heart attack probability.\n\nFormal interpretation: The g-estimate of psi ≈ −0.04 is the structural nested model blip-function parameter —\nthe per-interval causal effect of treatment on the subsequent outcome, estimated by finding the value of psi at\nwhich the treatment residual (observed treatment minus predicted treatment probability from the propensity nuisance\nmodel) is uncorrelated with the blipped-down outcome (observed outcome adjusted to remove the posited treatment\neffect). Unlike the marginal coefficient from an MSM, psi is a conditional structural parameter describing the\ntreatment effect at the individual level, net of the full time-varying LDL confounder history. It is valid under\nsequential exchangeability — at each month, the probability of statin receipt given the full observed covariate\nhistory is correctly modeled — and consistency: \"one month of high-intensity statin\" corresponds to a well-defined,\nreplicable intervention. Confidence intervals derive from a bootstrap or the variance of the estimating equation,\nnot from a parametric model formula.\n\nPractical interpretation: Each month of sustained high-intensity statin therapy is estimated to reduce 12-month\nheart attack probability by approximately 4 percentage points, after correctly accounting for the LDL feedback\nloop that defeats standard regression. A standard time-fixed regression that controls for monthly LDL would block\nthe very pathway (statin lowers LDL, lower LDL prevents heart attacks) that constitutes the drug's mechanism\nof benefit; g-estimation avoids this over-adjustment by modeling the treatment decision rather than conditioning\non the intermediate confounder in the outcome model.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "g-estimation",
      "structural-nested-models",
      "snmm",
      "rpsftm",
      "g-methods",
      "time-varying-confounding",
      "treatment-confounder-feedback",
      "blip-function",
      "semiparametric",
      "per-protocol",
      "dynamic-treatment-regimes"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1214/14-STS493",
        "url": "https://doi.org/10.1214/14-STS493",
        "citation_text": "Vansteelandt S, Joffe M. Structural nested models and G-estimation: the partially realized promise. Statistical Science. 2014;29(4):707-731.",
        "year": 2014,
        "authors_short": "Vansteelandt & Joffe",
        "notes": "The definitive modern review of SNMs and g-estimation; traces the method from Robins' 1986 healthy-worker-survivor work, derives the estimating equation, explains the double-robustness/ null-preservation property, and contrasts SNMs with MSMs and the g-formula. The single best reference for understanding what g-estimation buys and where it has been underused."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyw323",
        "url": "https://doi.org/10.1093/ije/dyw323",
        "citation_text": "Naimi AI, Cole SR, Kennedy EH. An introduction to g methods. International Journal of Epidemiology. 2017;46(2):756-762.",
        "year": 2017,
        "authors_short": "Naimi et al.",
        "notes": "Accessible side-by-side tutorial on the three g-methods (g-formula, IPTW-MSM, g-estimation of SNMs); pinpoints when g-estimation is the right choice — effect modification by time-varying factors — and gives the intuition for the estimating equation with a worked simple example. (Crossref issued date 2016; print citation 2017;46(2).)"
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/00001648-200009000-00011",
        "url": "https://doi.org/10.1097/00001648-200009000-00011",
        "citation_text": "Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550-560.",
        "year": 2000,
        "authors_short": "Robins et al.",
        "notes": "Foundational g-methods paper that frames the time-varying-confounding-with-feedback problem and positions g-estimation of SNMs alongside MSMs; essential for the conceptual placement of SNMs within the g-methods family and the precise estimand each method targets."
      },
      {
        "role": "use",
        "doi": "10.1080/01621459.2012.682532",
        "url": "https://doi.org/10.1080/01621459.2012.682532",
        "citation_text": "Picciotto S, Hernán MA, Page JH, Young JG, Robins JM. Structural nested cumulative failure time models to estimate the effects of interventions. Journal of the American Statistical Association. 2012;107(499):886-900.",
        "year": 2012,
        "authors_short": "Picciotto et al.",
        "notes": "Operational g-estimation of structural nested cumulative failure time models with time-varying exposure and informative censoring (healthy-worker-survivor setting); a concrete template for fitting and interpreting blip parameters and for the IPCW layering that real cohorts require."
      }
    ],
    "plain_language_summary": "G-estimation of structural nested models is a method for measuring how much a treatment that changes over time actually caused an outcome, in situations where a patient's lab value or disease marker both responds to earlier treatment and influences whether the doctor continues that treatment. Standard regression fails here because adjusting for that marker blocks part of the treatment effect while not adjusting leaves confounding unsolved. G-estimation gets around this by modeling the treatment decision process separately, then working backward to find the treatment effect that makes the adjusted outcome statistically independent of the treatment decision. It requires careful pre-specification of how the effect is structured and is harder to implement than conventional regression, but it is one of the few methods that can give a correct answer in this setting.",
    "key_terms": [
      {
        "term": "time-varying confounding",
        "definition": "A situation where a patient characteristic (like a lab value) changes over the study period, affects whether the patient receives treatment at each point in time, and also influences the final outcome — making it both a confounder and something that can be affected by prior treatment."
      },
      {
        "term": "treatment-confounder feedback",
        "definition": "The specific problem where earlier treatment changes the very lab value or marker that then drives the doctor's next treatment decision, so the confounder and the treatment are caught in a loop across time."
      },
      {
        "term": "structural nested model",
        "definition": "A mathematical description of the size of the treatment effect at each point in time, defined as the difference between what the outcome would have been if the patient continued treatment from that point forward versus if they had stopped — built up interval by interval rather than all at once."
      },
      {
        "term": "g-estimation",
        "definition": "The numerical procedure that finds the treatment effect by searching for the value that, after mathematically removing the treatment effect from the outcome, makes the adjusted outcome uncorrelated with the treatment decision at each time point."
      },
      {
        "term": "blip function",
        "definition": "The specific form of the treatment effect at one time interval — how much treated patients' outcomes shifted compared to what they would have been had they stopped treatment at that moment — which the analyst pre-specifies before looking at the data."
      },
      {
        "term": "estimating equation",
        "definition": "A formula set equal to zero whose solution is the treatment effect estimate; in g-estimation it says that once the right treatment effect is removed from the outcome, the residual outcome should be uncorrelated with the treatment decision."
      }
    ],
    "worked_example": {
      "scenario": "A cardiologist starts 200 patients on a statin and follows them for 12 monthly intervals to see who has a heart attack. Each month, the patient's LDL cholesterol is measured. The statin lowers LDL, but low LDL also makes the cardiologist more likely to keep prescribing. We want to know the true effect of sustained statin therapy on heart attack risk. A standard regression that controls for monthly LDL would be wrong — here is why, and how g-estimation fixes it.",
      "dataset": {
        "caption": "Three monthly snapshots for one patient illustrating how LDL sits between prior treatment and future treatment.",
        "columns": [
          "month",
          "statin_taken",
          "ldl_measured",
          "heart_attack_by_month_12"
        ],
        "rows": [
          [
            1,
            1,
            142,
            0
          ],
          [
            2,
            1,
            118,
            0
          ],
          [
            3,
            0,
            135,
            0
          ]
        ]
      },
      "steps": [
        "WHY ORDINARY REGRESSION FAILS: In month 2, LDL dropped from 142 to 118 because the statin taken in month 1 worked. That lower LDL then influenced whether the doctor prescribed the statin again in month 3. LDL in month 2 is simultaneously a consequence of month-1 treatment and a cause of month-3 treatment and the eventual outcome.",
        "If we add monthly LDL as a covariate in a time-dependent regression, we block the path 'statin lowers LDL, lower LDL prevents heart attacks' — we would be subtracting out part of the very benefit we are trying to measure.",
        "If we instead leave LDL out to avoid blocking that path, the regression is confounded because sicker patients (higher LDL) get more aggressive treatment, which distorts the treatment-outcome comparison.",
        "Neither choice is correct with a single regression model. This is the core problem g-estimation is designed to solve.",
        "HOW G-ESTIMATION RECOVERS THE EFFECT: Instead of adjusting for LDL in an outcome model, g-estimation focuses on the treatment decision. At each month, it fits a model asking: given everything known about this patient at this moment (age, sex, prior LDL, prior treatment), what was the probability the doctor would prescribe the statin? The difference between the actual prescription (1 or 0) and that predicted probability is the treatment residual.",
        "G-estimation then proposes a candidate value for the statin effect — say, psi = -0.04, meaning each month of treatment reduces 12-month heart attack risk by 0.04 on the probability scale.",
        "It uses that candidate to mathematically remove the treatment effect from the observed outcomes, creating a blipped-down outcome: what each patient's outcome would have looked like if no one had ever received any treatment.",
        "If psi is the correct treatment effect, that blipped-down outcome should be statistically uncorrelated with the treatment residuals computed in step 5 — because once you remove the treatment effect, the leftover variation in outcomes should no longer track who got treated.",
        "The algorithm tests candidate values of psi — conceptually sweeping across a range — and finds the one value where the correlation between treatment residuals and blipped-down outcomes equals zero. That is the g-estimate.",
        "The result is a consistent estimate of the per-protocol effect of sustained statin therapy versus never treating, even though LDL was both affected by earlier statin use and drove later statin decisions."
      ],
      "result": "The g-estimated blip psi represents the causal effect of one interval of statin treatment on heart attack risk, adjusted for the full time-varying confounder history in a way that neither blocks the mediated pathway nor leaves confounding unaddressed. In the statin cohort the method would report something like psi = -0.04 per treated month (with a bootstrap confidence interval), interpreted as: each month of sustained high-intensity statin therapy reduces 12-month heart attack probability by approximately 4 percentage points, after correctly accounting for the LDL feedback loop that defeats standard regression."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "marginal-structural-models-g-methods",
      "standard-cox-time-dependent"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Structural nested mean model (SNMM) for continuous/binary outcomes",
        "description": "Parameterizes the conditional mean blip — the effect of treatment in interval k, given history H_k, with the rest of the treatment course set to a reference (usually zero). G-estimation solves the estimating equation using a fitted per-interval treatment model; effect modification by time-varying H_k enters the blip directly.",
        "edge_cases": [
          "Double robustness is partial and only at the null; away from the null, joint misspecification of blip and treatment model yields silent bias with no diagnostic warning.",
          "Requires a correctly specified per-interval propensity model E(A_k | H_k); positivity in occupied strata is necessary for the estimating equation to be informative."
        ],
        "data_source_notes": "claims/EHR: ideal for longitudinal biomarkers (LDL, HbA1c) or QoL scores where the confounder is also the outcome's antecedent; needs lab linkage so the feedback confounder is observed. R: gesttools / DTRreg fit SNMMs via g-estimation.",
        "citations": [
          "vansteelandt-2014",
          "naimi-2017"
        ]
      },
      {
        "name": "Rank-preserving / structural nested failure time model (RPSFTM) for survival",
        "description": "Models the accelerated-failure-time blip — how treatment in each interval expands or compresses counterfactual survival time — and g-estimates the acceleration factor so that the back-transformed counterfactual time is independent of observed treatment given history.",
        "edge_cases": [
          "Sensitive to the treatment-free survival/blip form; the common \"common treatment effect\" assumption can be implausible across heterogeneous patients.",
          "Administrative censoring must be re-censored on the counterfactual scale; informative censoring needs IPCW."
        ],
        "data_source_notes": "claims: standard for time-varying cumulative-dose or switching regimens in drug safety and occupational epi (e.g., Picciotto metalworking-fluid mortality). R: rpsftm package for the trial-style version; custom estimating equations for the full observational SNFTM.",
        "citations": [
          "picciotto-2012",
          "vansteelandt-2014"
        ]
      },
      {
        "name": "G-estimation with IPCW and/or multivariate (multi-drug, dose) blips",
        "description": "Extends the estimating equation with inverse-probability-of-censoring weights for informative loss to follow-up, and/or vector-valued blips for multiple concurrent treatments or dose levels per interval.",
        "edge_cases": [
          "Positivity must hold for every treatment combination, which is demanding for polypharmacy/sequential regimens.",
          "Competing risks (death in elderly claims) interact with censoring and must be modeled, not ignored."
        ],
        "data_source_notes": "Relevant to sequential lines of therapy and dose titration; usually paired with target-trial cloning/censoring for a clean per-protocol estimand.",
        "citations": [
          "picciotto-2012"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "marginal-structural-models-g-methods",
        "pros_of_this": "Directly parameterizes effect modification by time-varying covariates; partial null-robustness gives a valid test of the sharp null even under blip misspecification; avoids the extreme inverse-probability weights and the strict positivity demands of IPTW, often improving efficiency and stability.",
        "cons_of_this": "Harder to specify (blip + treatment model), solved numerically, and harder to communicate than a hazard ratio; double robustness is partial and silent when violated away from the null; much less software support than IPTW.",
        "when_to_prefer": "When effect modification by post-baseline time-varying factors is the scientific target, when IPTW positivity/weights are problematic, or when null-robust testing is valued. Many analyses report both g-estimation and MSM results."
      },
      {
        "compared_to": "standard-cox-time-dependent",
        "pros_of_this": "Consistent under time-varying confounding with treatment-confounder feedback, where a naive time-dependent Cox model that conditions on the affected confounder is biased (blocks mediation, conditions on a collider).",
        "cons_of_this": "Far higher technical and communication barrier; outputs are structural-model parameters, not plug-and-play hazard ratios; requires a per-interval treatment model and pre-specified blip form.",
        "when_to_prefer": "Sustained/time-varying exposures with time-varying confounders affected by prior treatment. If the confounder is NOT affected by prior treatment (no feedback), a correctly adjusted Cox model is simpler and equally valid."
      },
      {
        "compared_to": "clone-censor-weight-per-protocol",
        "pros_of_this": "Modeling-forward route to the same sustained-strategy estimand; can be more efficient and directly parameterizes effect modification rather than relying on subgroup weighting.",
        "cons_of_this": "Less transparent than the clone/censor/weight design; numerical estimation and blip interpretation are less intuitive for non-statisticians and reviewers.",
        "when_to_prefer": "When efficiency or explicit effect-modification parameters matter more than the communicational clarity of the clone-censor-weight design."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build a long-format discrete-time person-interval table from fills (fill_date + days_supply with grace/stockpiling rules); proxy time-varying confounders from diagnoses/procedures/utilization. Fit E(A_k | H_k) by pooled logistic over person-intervals and g-estimate the blip. Restrict to full A/B/D (or commercial pharmacy) person-time and exclude MA-only spans so \"no fill\" is true non-treatment, not missingness. Pure claims usually lack the lab confounders that motivate the method — link to labs/EHR or the no-unmeasured-confounding-per-interval assumption fails.",
      "ehr": "Supplies the time-varying labs/vitals that make feedback measurable, but irregular visit-driven measurement means H_k is observed only at encounters (carry-forward or model the timing) and out-of-system exposure is missed. Prefer EHR linked to claims for complete fill history.",
      "registry": "Protocolized repeated measures of treatment/severity/adjudicated outcomes give an excellent H_k history and validate claims-based g-estimation; link to claims for full pharmacy exposure and to a vital-records index for censoring/death.",
      "linked": "Ideal substrate (claims completeness + EHR labs + vital-records censoring); reconcile order/fill/service date discrepancies and account for linkage selection before fixing interval boundaries and time zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\nfrom scipy.optimize import brentq\n\nTX_FORMULA = \"A_k ~ age + sex + ldl_k + myalgia_k + A_prev + k\"\n\ndef g_estimate_snmm(df: pd.DataFrame, outcome_col: str = \"Y\") -> float:\n    \"\"\"Return psi: the per-interval additive blip on a standardized SNMM.\n\n    df is long (one row per person-interval). A_k is the interval treatment; outcome_col is the\n    person-level final outcome repeated across that person's rows.\n    \"\"\"\n    d = df.copy()\n\n    # (1) Per-interval treatment (propensity) model E(A_k | H_k), pooled over person-intervals.\n    tx = smf.logit(TX_FORMULA, data=d).fit(disp=False)\n    d[\"a_resid\"] = d[\"A_k\"] - tx.predict(d)          # A_k - Ehat(A_k | H_k)\n\n    # Treatment dose carried by each person = sum of A_k; blip removes psi per treated interval.\n    person = d.groupby(\"person_id\").agg(dose=(\"A_k\", \"sum\"),\n                                        Y=(outcome_col, \"first\")).reset_index()\n    d = d.merge(person[[\"person_id\", \"dose\"]], on=\"person_id\")\n\n    def estimating_eq(psi: float) -> float:\n        # H(psi) = blipped-down outcome; under correct psi it is independent of A_k given H_k,\n        # i.e. the residual-weighted sum is zero in expectation.\n        h = d[outcome_col] - psi * d[\"dose\"]\n        return float(np.sum(d[\"a_resid\"] * h))\n\n    # Solve U(psi) = 0 by bracketing root-find (widen bracket if needed for your outcome scale).\n    return brentq(estimating_eq, -50.0, 50.0)\n\ndef bootstrap_ci(df: pd.DataFrame, n_boot: int = 500, seed: int = 1) -> tuple[float, float]:\n    \"\"\"Person-level bootstrap of psi (resample clusters, not rows).\"\"\"\n    rng = np.random.default_rng(seed)\n    ids = df[\"person_id\"].unique()\n    est = []\n    for _ in range(n_boot):\n        draw = rng.choice(ids, size=len(ids), replace=True)\n        boot = pd.concat([df[df[\"person_id\"] == i] for i in draw], ignore_index=True)\n        est.append(g_estimate_snmm(boot))\n    return float(np.percentile(est, 2.5)), float(np.percentile(est, 97.5))",
        "description": "G-estimation of a one-parameter structural nested mean model on a discrete-time person-interval table.\nRequired input (already constructed and de-duplicated), one row per person per interval k:\n  df: person_id, k (interval index), A_k (0/1 treated this interval),\n      Y (final continuous/standardized outcome, repeated on each person's rows),\n      plus history columns used in the treatment model:\n      age, sex, ldl_k (carried-forward lab), myalgia_k, A_prev (A_{k-1}), and prior treatment cumcount.\nThe function (1) fits the per-interval propensity model E(A_k | H_k) by pooled logistic regression,\n(2) forms the blipped-down outcome H(psi) = Y - psi * (sum of A_k over k), and (3) finds psi solving\nthe estimating equation sum_k [A_k - Ehat(A_k|H_k)] * H(psi) = 0. Bootstrap persons for inference.\nEffect modification is added by interacting psi with a time-varying modifier in H(psi).",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels",
          "scipy"
        ],
        "source_citations": [
          "vansteelandt-2014",
          "naimi-2017"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(gesttools)\nlibrary(data.table)\n\n# gesttools expects an ordered long data.frame keyed by id and time, with outcome and exposure named.\ndat <- FormatData(\n  data       = as.data.frame(df),\n  idvar      = \"person_id\",\n  timevar    = \"k\",\n  An         = \"A_k\",                 # time-varying exposure\n  Cn         = NA,                    # add a censoring indicator name for IPCW\n  outcomevar = \"Y\",\n  varying    = c(\"ldl_k\", \"myalgia_k\", \"A_prev\")\n)\n\nfit <- gestSingle(\n  data          = dat,\n  idvar         = \"person_id\",\n  timevar       = \"k\",\n  Yn            = \"Y\",\n  An            = \"A_k\",\n  Cn            = NA,\n  outcomemodels = list(\"Y ~ A_k + age + sex + ldl_k\"),  # blip; add A_k:ldl_k for effect modification\n  propensitymodel = \"A_k ~ age + sex + ldl_k + myalgia_k + A_prev + k\",\n  type          = 1                    # 1 = SNMM blip constant over time\n)\nprint(fit$psi)        # the g-estimated blip parameter(s)\n\n# --- Transparent fallback: solve sum_k (A_k - Ehat) * (Y - psi*dose) = 0 by hand ---\nsetDT(df)\ndf[, a_resid := A_k - predict(glm(A_k ~ age + sex + ldl_k + myalgia_k + A_prev + k,\n                                  family = binomial, data = df), type = \"response\")]\ndf[, dose := sum(A_k), by = person_id]\nU <- function(psi) df[, sum(a_resid * (Y - psi * dose))]\npsi_hat <- uniroot(U, interval = c(-50, 50))$root",
        "description": "G-estimation of an SNMM with gesttools on the same long-format person-interval table. Input columns:\n  person_id, k (interval index), A_k (0/1), outcome Y, and history (age, sex, ldl_k, myalgia_k, A_prev).\ngesttools::gestSingle fits the per-interval propensity model and solves the estimating equation for the\nblip; a hand-rolled root-finder is shown beneath as a transparent fallback that mirrors the Python code.",
        "dependencies": [
          "gesttools",
          "data.table"
        ],
        "source_citations": [
          "vansteelandt-2014",
          "picciotto-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Step 1: per-interval treatment (propensity) model E(A_k | H_k); output predicted prob phat. */\nproc logistic data=work.pi descending noprint;\n  class sex / param=ref;\n  model A_k = age sex ldl_k myalgia_k A_prev k;\n  output out=work.ps p=phat;            /* phat = Ehat(A_k=1 | H_k) */\nrun;\n\n/* Person-level treatment dose = sum of A_k, merged back; a_resid = A_k - phat. */\nproc sql;\n  create table work.gest as\n  select s.*, s.A_k - s.phat as a_resid, d.dose\n  from work.ps s\n  inner join (select person_id, sum(A_k) as dose from work.pi group by person_id) d\n    on s.person_id = d.person_id;\nquit;\n\n/* Step 2: solve U(psi) = sum (A_k - phat) * (Y - psi*dose) = 0 by bisection. */\nproc iml;\n  use work.gest;  read all var {a_resid Y dose};  close work.gest;\n  start U(psi) global(a_resid, Y, dose);\n    return( sum( a_resid # (Y - psi * dose) ) );\n  finish;\n  lo = -50; hi = 50;                    /* widen for your outcome scale */\n  do iter = 1 to 200 until (hi - lo < 1e-6);\n    mid = (lo + hi)/2;\n    if U(lo)*U(mid) <= 0 then hi = mid; else lo = mid;\n  end;\n  psi_hat = (lo + hi)/2;\n  print psi_hat[label=\"G-estimated SNMM blip\"];\n  /* Person-level bootstrap (resample person_id, not rows) gives the CI; loop omitted for brevity. */\nquit;",
        "description": "G-estimation of a one-parameter SNMM in SAS, solving the estimating equation directly. Required input\n(post data-management), long format, one row per person-interval:\n  work.pi : person_id, k, A_k (0/1), Y (person-level outcome repeated), age sex ldl_k myalgia_k A_prev\nStep 1 fits the per-interval propensity model E(A_k | H_k) with PROC LOGISTIC and outputs the residual\nA_k - phat. Step 2 (PROC IML) brackets and bisects psi to solve U(psi)=sum (A_k-phat)*(Y-psi*dose)=0.\nFor a packaged route, Robins-group macros (%gestcox / %gformula) implement g-estimation of structural\nnested (failure time / mean) models; this PROC code shows the underlying estimator transparently.",
        "dependencies": [],
        "source_citations": [
          "vansteelandt-2014",
          "picciotto-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U((U unmeasured)) -.-> L0[L0 baseline confounder]\n  U -.-> L1[L1 time-varying confounder]\n  L0 --> A0[A0 treatment t0]\n  A0 --> L1\n  L0 --> L1\n  L1 --> A1[A1 treatment t1]\n  A0 --> A1\n  L0 --> Y[Y outcome]\n  L1 --> Y\n  A0 --> Y\n  A1 --> Y",
        "caption": "Treatment-confounder feedback that motivates g-estimation. L1 is caused by prior treatment A0 and also confounds A1 and Y. Conditioning on L1 in a time-dependent regression blocks the A0->L1->Y pathway and conditions on a collider; not conditioning leaves confounding. G-estimation models the treatment process E(A_k|H_k) and solves the estimating equation instead, giving a consistent estimate.",
        "alt_text": "Causal DAG with baseline confounder L0 and time-varying confounder L1 (caused by treatment A0), treatments A0 and A1, outcome Y, and an unmeasured U affecting L0 and L1, illustrating treatment- confounder feedback.",
        "source_type": "illustrative",
        "source_citations": [
          "vansteelandt-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Spec[Pre-specify SNM blip form + estimand<br/>from subject-matter knowledge]\n  Spec --> Long[Build long person-interval table<br/>A_k, time-varying H_k, outcome]\n  Long --> Tx[Fit per-interval treatment model<br/>Ehat A_k given H_k]\n  Tx --> Blip[Form blipped-down outcome H psi]\n  Blip --> Solve[Solve U psi = sum A_k - Ehat times H psi = 0]\n  Solve --> Boot[Bootstrap persons for inference]\n  Boot --> Sens[Sensitivity: blip form, interval width,<br/>grace period, IPCW, negative control]",
        "caption": "G-estimation workflow. The blip and treatment model are pre-specified, the estimating equation is solved numerically for the structural parameter, and inference plus sensitivity analyses follow; the blip is never tuned to the result.",
        "alt_text": "Flowchart from pre-specifying the blip and estimand, building a long person-interval table, fitting the per-interval treatment model, forming the blipped-down outcome, solving the estimating equation, bootstrapping, and running sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "vansteelandt-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "G-estimation of SNMs is one of the three core g-methods (with the g-formula and IPTW-MSMs), developed together by Robins to handle time-varying confounding with treatment-confounder feedback."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "G-estimation is a natural estimation engine for the per-protocol estimand of a target-trial emulation when the protocol involves a sustained, time-varying treatment strategy."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Both target the sustained-strategy / per-protocol effect; CCW is design-forward (clone, censor, weight) while g-estimation is modeling-forward (solve the estimating equation for the blip)."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "standard-cox-time-dependent",
        "notes": "A time-dependent Cox model that conditions on a confounder affected by prior treatment is biased under feedback; g-estimation is consistent but more complex and yields structural-model parameters."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Constructing the time-varying treatment indicator and cumulative dose is the upstream data step that feeds the per-interval treatment model and the blip."
      },
      {
        "relation_type": "see_also",
        "target_slug": "landmark-analysis",
        "notes": "Landmark analysis is a simpler fix for fixed post-baseline classification; g-estimation is the general approach when the exposure and confounders evolve continuously with feedback."
      },
      {
        "relation_type": "used_with",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Causal ML (double ML/TMLE) and g-estimation are complementary semiparametric approaches; ML can flexibly fit the treatment/outcome nuisance models that g-estimation requires."
      },
      {
        "relation_type": "see_also",
        "target_slug": "instrumental-variables-pharmacoepi-rwe",
        "notes": "IV addresses unmeasured confounding via an instrument; g-estimation assumes no-unmeasured-confounding-per-interval but handles measured time-varying confounding with feedback."
      }
    ],
    "aliases": [
      "g-estimation",
      "structural nested models",
      "structural nested mean models",
      "SNMM",
      "RPSFTM",
      "structural nested failure time models",
      "Robins g-estimation",
      "blip function estimation"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "gamma-distribution",
    "name": "Gamma Distribution for Cost and Skewed Outcomes",
    "short_definition": "The gamma distribution is a continuous probability model for strictly positive, right-skewed outcomes whose variance grows as the square of the mean — the empirical signature of healthcare cost data. A generalized linear model (GLM) with a gamma variance function and log link directly estimates the mean cost as a function of covariates on the natural dollar scale, producing exponentiated coefficients that are mean cost ratios and requiring no back-transformation. It is the modern default for healthcare cost regression in real-world evidence and HEOR, replacing log-OLS with smearing correction for primary inference on mean costs.",
    "long_description": "**What the gamma distribution is and why cost data fits it**\n\nThe gamma distribution is a two-parameter continuous probability distribution defined on strictly\npositive real values (0, ∞). Parameterised by a shape parameter α (alpha) and a scale parameter\nθ (theta), its mean is μ = αθ and its variance is Var(Y) = αθ² = μ²/α. This last expression is\nthe key: the variance grows as the *square* of the mean. Equivalently, the coefficient of\nvariation CV = SD/μ = 1/√α is a constant that does not depend on the mean level. As average\ncosts rise — because sicker patients with more comorbidities drive both higher spending and\nproportionally higher spread — the variance rises with them. This is precisely the structural\npattern that empirical healthcare cost data shows in claims, EHR, and registry datasets.\nOrdinary least squares (OLS) imposes none of this; the gamma distribution captures it by design.\n\nThe gamma family includes several familiar special cases. When α = 1 the gamma reduces to the\nexponential distribution. The chi-squared distribution with k degrees of freedom is a gamma with\nα = k/2 and θ = 2. The inverse-Gaussian (Wald) distribution is a related model with variance\ngrowing as μ³ — appropriate for very heavy-tailed costs when the Modified Park test (see below)\nreturns λ̂ near 3 rather than 2.\n\n**Why OLS fails for cost data**\n\nOrdinary least squares makes two structural assumptions that healthcare cost distributions\nroutinely violate. First, OLS assumes homoscedastic residuals — that the residual variance is\nconstant across the range of fitted values. In cost data the opposite is true: patients with high\npredicted costs show far greater spread than low-cost patients, a textbook case of\n*heteroscedasticity*. Second, OLS imposes no non-negativity constraint; for patients with\nfavourable covariates, a fitted OLS model can predict negative costs, which is not a meaningful\nquantity. Third, the heavy right tail of cost distributions (a small fraction of catastrophically\nexpensive patients) inflates the OLS estimate of the mean and renders confidence intervals\nunreliable at any sample size.\n\nThe most common fix — logging the outcome and running OLS on log(cost) — removes\nheteroscedasticity and skewness from the model but creates a new problem: log-OLS estimates\nE[log(Y)|X], not E[Y|X]. Back-transforming by exponentiating the fitted values recovers the\ngeometric mean, not the arithmetic mean. Because arithmetic means aggregate to population\ntotals — which is exactly what a payer budget-impact model or HTA submission requires — the\ngeometric mean is the wrong estimand. Duan's (1983) smearing estimator applies a nonparametric\ncorrection factor to recover the arithmetic mean, but this works correctly only when the\nlog-scale residuals are themselves homoscedastic, which can also fail. The gamma GLM eliminates\nboth the retransformation step and the residual-heteroscedasticity risk.\n\n**The gamma GLM with log link as the modern default**\n\nA GLM with a gamma variance function and a log link models E[Y|X] directly as:\n\n    E[Y|X] = μ = exp(β₀ + β₁X₁ + … + βₚXₚ)\n\nwith variance function Var(Y|X) = φ · μ² where φ = 1/α is the dispersion parameter. The\nlog link ensures that all fitted values are strictly positive regardless of the covariate values,\nand the mean is estimated on the original dollar scale without any retransformation step.\n\nThe gamma distributional assumption is used for efficiency (optimal weighting of observations)\nrather than as a hard requirement for consistency. Manning and Mullahy (2001) show that the\ngamma GLM estimator of E[Y|X] is consistent even when the variance function is misspecified,\nprovided the mean function (log link) is correctly specified. Sandwich (robust) standard errors\nprotect inference against variance misspecification. This robustness property makes the gamma\nGLM with log link a reliable workhorse rather than a fragile distributional bet.\n\n**Interpreting the output**\n\nConsider a gamma GLM with log link comparing an index treatment to a comparator on total annual\nhealthcare cost, adjusting for age, sex, Charlson Comorbidity Index, and baseline cost. The\nmodel returns a treatment coefficient of 0.405 with 95% CI [0.18, 0.63].\n\n*(1) Formal statistical interpretation.* exp(0.405) = 1.50. This is the model-estimated ratio\nof the mean total cost for treated patients to the mean total cost for comparator patients,\nholding the four adjustment covariates fixed at their observed values. The confidence interval\n[exp(0.18), exp(0.63)] = [1.20, 1.88] means that values of the true mean cost ratio between\n1.20 and 1.88 are compatible with the observed data at the conventional significance threshold —\nit does NOT mean there is a 95% probability that the true ratio lies in this interval. A\nfrequentist CI is not a credible interval; it is a property of the procedure across hypothetical\nrepeated samples.\n\n*(2) Practical interpretation for a decision-maker.* \"After adjusting for patient age, sex,\ncomorbidity burden, and prior-year cost, treated patients' expected annual costs are\napproximately 50% higher than comparator patients' expected costs, with data consistent with\n20% to 88% higher costs. This is an adjusted association from observational data; whether it\nreflects a causal cost increment attributable to the treatment depends on the extent to which\nthe measured covariates capture the full confounding structure — in particular, residual\nconfounding by disease severity or unmeasured indication factors may still be present.\"\n\nNote what the coefficient does NOT tell you: it says nothing about median costs, it says nothing\nabout absolute dollar differences (to get those, multiply the reference group's mean cost by\n0.50), and it does not establish causality in an unadjusted observational design.\n\n**The Modified Park test for choosing the variance family**\n\nManning and Mullahy (2001) proposed the Modified Park test to guide selection among variance\nfamilies. The test fits a preliminary log-OLS model, computes the squared residuals on the log\nscale, regresses log(squared residuals) on log(fitted values from the OLS), and examines the\nslope coefficient λ̂:\n\n- λ̂ ≈ 0: constant variance; OLS on levels is appropriate.\n- λ̂ ≈ 1: variance proportional to μ; Poisson GLM or quasi-Poisson.\n- λ̂ ≈ 2: variance proportional to μ²; gamma GLM is the natural choice.\n- λ̂ ≈ 3: variance proportional to μ³; inverse-Gaussian GLM.\n\nIn practice, healthcare cost data in pharmacy and medical claims almost always returns λ̂ in the\nrange 1.5–2.5, landing firmly in the gamma zone. The Poisson GLM with log link is a useful\nalternative when λ̂ is near 1 or when zeros are present (the Poisson distribution is defined at\nzero, making it more flexible than the gamma for zero-inclusive distributions when used as a\nquasi-likelihood estimator). The Modified Park test has limited power in small samples and should\nbe combined with residual plots (plot raw residuals against fitted values) for a fuller\ndistributional diagnosis.\n\n**The zero problem**\n\nThe gamma distribution is defined on strictly positive values — it assigns zero probability to\nY = 0. In real claims datasets, patients with zero total cost during a measurement window are\ncommon: newly enrolled members who used no services, patients with only well-visit care outside\nthe study window, or patients whose costs fall below the deductible and are not reflected in\nclaims. Applying a gamma GLM directly to a zero-inclusive distribution yields biased estimates\nbecause the zeros are structurally incompatible with the model's support.\n\nThe standard solution for zero-heavy cost outcomes is a *two-part model* (hurdle model): Part 1\nfits a logistic regression for P(Y > 0 | X), and Part 2 fits a gamma GLM for E[Y | Y > 0, X].\nThe overall mean is E[Y|X] = P(Y > 0|X) × E[Y | Y > 0, X]. When only a small fraction of\npatients have zero costs (< 5–10%), the bias from ignoring zeros is modest and a gamma GLM with\na small constant added (e.g., Y + 0.01) or a Poisson quasi-likelihood approach may be\nacceptable as a sensitivity analysis. For serious analysis of zero-heavy cost distributions, the\ntwo-part model is the principled choice and is discussed further in the\nhealthcare-costs-pppm-pppy-pmpm concept.\n\n**Pros, cons, and trade-offs**\n\n*Pros:*\n- Directly estimates E[Y|X] on the natural dollar scale; exponentiated coefficients are mean\n  cost ratios with no retransformation step, retransformation bias, or smearing correction.\n- The variance function V(μ) = φμ² matches the empirical pattern of claims-based cost data\n  (rising spread with rising mean; approximately constant coefficient of variation).\n- Point estimates are consistent under mild variance misspecification when the log link is\n  correctly specified (Manning & Mullahy 2001); sandwich SEs protect inference.\n- Accommodates covariate adjustment, inverse-probability-of-treatment weighting (IPTW),\n  stratification, and marginal standardisation in a single GLM framework.\n- Implemented natively in statsmodels (Python), base R, and SAS PROC GENMOD — no specialty\n  packages required.\n- Marginal standardisation (averaging model-predicted costs under counterfactual treatment\n  assignments) is straightforward and produces absolute risk/cost differences as well as ratios.\n\n*Cons:*\n- Does not handle zeros — requires a two-part model for cost distributions with a non-trivial\n  zero mass.\n- Exponentiated coefficients are mean *ratios*, not mean *differences* in dollars; to report\n  absolute cost increments, multiply by the reference group's marginal mean.\n- Standard asymptotic standard errors assume correct variance specification; use sandwich SEs\n  as the default in observational cost analyses.\n- Gamma GLM can overfit in very small samples (< 30 per group); bootstrapped confidence\n  intervals are advisable when n is small.\n- Modified Park test has low power at small n; lambda_hat estimates are noisy and should be\n  combined with residual plots, not used as the sole criterion.\n\n**When to use**\n\nUse the gamma GLM with log link when:\n- The outcome is strictly positive continuous — per-patient total cost, medical charges, length\n  of stay expressed in days when always > 0, per-member-per-month cost.\n- The distribution is right-skewed with variance that increases with the mean — the standard\n  pattern for pharmacy and medical claims costs.\n- The target estimand is the *mean* cost (not the median), because means aggregate to population\n  totals needed for budget-impact modelling, value-framework submissions, and payer negotiations.\n- Covariate adjustment for measured confounders, IPTW weighting from a propensity model, or\n  standardisation across a reference population is required alongside the cost regression.\n- The Modified Park test returns λ̂ near 2, or a histogram/QQ plot of log-residuals shows\n  approximately homoscedastic spread consistent with a gamma variance function.\n- The dataset has few or no zero-cost patients (< 5–10%); if zeros are common, a two-part\n  model wrapper around the gamma GLM is the preferred architecture.\n\n**When NOT to use**\n\n- *Zeros in the outcome:* if 10% or more of patients have zero total cost in the study window,\n  the gamma GLM will produce biased estimates and must be replaced by or wrapped in a two-part\n  model. Applying the gamma to a zero-inclusive dataset without adjustment is a methodological\n  error, not a minor approximation.\n- *Bounded outcomes:* proportions (0–1), adherence rates, and other outcomes bounded above\n  require beta regression, not a gamma GLM. The gamma is unbounded above and its variance\n  structure is incorrect for bounded scales.\n- *Count outcomes:* hospital admissions, emergency department visits, prescription fills, and\n  other discrete count outcomes belong in Poisson or negative-binomial GLMs. The gamma is a\n  model for continuous quantities, not integers.\n- *When the estimand is median cost:* if the policy question concerns the patient in the middle\n  of the cost distribution rather than the population mean, quantile regression (specifically\n  median regression) is the appropriate tool. Gamma GLM targets the mean, not the median.\n- *When causal language is required but confounding is uncontrolled:* a gamma GLM on raw\n  observational data estimates an adjusted association, not a causal effect. Pair with\n  propensity score weighting, matching, or g-methods, and be explicit that the coefficient is an\n  association conditional on measured covariates.\n- *Small n with extreme outliers:* in cohorts of fewer than 20 patients per group, a single\n  catastrophically expensive patient can dominate the GLM fit. Bootstrap confidence intervals\n  and a sensitivity analysis with the outlier removed or winsorized are advisable. This is\n  distinct from the general robustness of the gamma GLM at moderate to large sample sizes.\n- *Log-normal comparison:* when a log-normal model fits better (the log-residuals are normally\n  distributed and homoscedastic), the log-OLS + smearing approach and the gamma GLM can give\n  similar point estimates but different standard errors; see the log-normal-distribution sibling\n  entry for the transform-vs-GLM debate.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "distributions",
      "cost-analysis",
      "glm",
      "gamma",
      "log-link",
      "healthcare-costs",
      "retransformation",
      "heteroscedasticity"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "cross_sectional",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "active_comparator_new_user",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/S0167-6296(01)00086-8",
        "url": "https://doi.org/10.1016/S0167-6296(01)00086-8",
        "citation_text": "Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461-494.",
        "year": 2001,
        "authors_short": "Manning & Mullahy",
        "notes": "The foundational paper establishing the gamma GLM with log link as the preferred estimator for healthcare costs, introducing the Modified Park test for variance family selection and demonstrating analytically why the retransformation bias of log-OLS is avoidable. Required reading for any HEOR analyst building a cost regression model."
      },
      {
        "role": "explain",
        "doi": "10.1258/1355819042250249",
        "url": "https://doi.org/10.1258/1355819042250249",
        "citation_text": "Barber J, Thompson S. Multiple regression of cost data: use of generalised linear models. Journal of Health Services Research & Policy. 2004;9(4):197-204.",
        "year": 2004,
        "authors_short": "Barber & Thompson",
        "notes": "Accessible practical guide to applying GLMs for cost data, comparing gamma GLM to log-OLS and other approaches with worked examples. Particularly useful for analysts transitioning from log-transformation habits to the GLM framework, and for reporting in clinical and health policy journals that expect accessible methodology sections."
      },
      {
        "role": "demonstrate",
        "doi": "10.1023/a:1012597123667",
        "url": "https://doi.org/10.1023/a:1012597123667",
        "citation_text": "Blough DK, Ramsey SD. Using generalized linear models to assess medical care costs. Health Services and Outcomes Research Methodology. 2000;1(2):185-202.",
        "year": 2000,
        "authors_short": "Blough & Ramsey",
        "notes": "Applied demonstration of gamma GLMs for medical care cost data in managed care populations, including practical guidance on model diagnostics, choice of link function, and interpretation of coefficients in a health economics context. Bridges the statistical theory in Manning & Mullahy to the operational practice of running cost regressions in claims data."
      },
      {
        "role": "use",
        "doi": "10.1080/01621459.1983.10478017",
        "url": "https://doi.org/10.1080/01621459.1983.10478017",
        "citation_text": "Duan N. Smearing estimate: a nonparametric retransformation method. Journal of the American Statistical Association. 1983;78(383):605-610.",
        "year": 1983,
        "authors_short": "Duan",
        "notes": "The original paper introducing the smearing estimator that corrects the retransformation bias when exponentiating fitted values from a log-OLS model. Understanding Duan's correction and its assumptions (homoscedastic log-residuals) clarifies exactly why the gamma GLM with log link is superior: it renders the smearing step unnecessary by targeting E[Y|X] directly."
      }
    ],
    "plain_language_summary": "The gamma distribution is a statistical model for positive, right-skewed numbers like healthcare costs, where the spread of costs grows proportionally as the average cost grows. Fitting a gamma generalized linear model (GLM) with a log link lets an analyst estimate how patient characteristics or treatment multiply the average expected cost, producing results directly on the dollar scale — no mathematical back-transformation required. It is the modern standard for healthcare cost regression in real-world evidence studies, replacing the older practice of log-transforming costs and running ordinary regression, which recovers the geometric mean rather than the arithmetic mean that budget models and payers actually need. One important limitation: the gamma model does not accommodate patients with zero cost, so datasets with many zero-cost patients need a two-part model instead.",
    "key_terms": [
      {
        "term": "shape parameter",
        "definition": "A number (alpha) that controls how peaked or spread out the gamma distribution is; a larger shape parameter means costs cluster more tightly around the mean, while a smaller value means the distribution has a heavier tail."
      },
      {
        "term": "scale parameter",
        "definition": "A number (theta) that stretches or compresses the gamma distribution along the cost axis; together with the shape parameter it sets the mean and variance of the cost distribution."
      },
      {
        "term": "coefficient of variation",
        "definition": "The standard deviation divided by the mean (SD/mean); for a gamma distribution this ratio is constant regardless of the mean level, which matches the empirical observation that higher-cost patient groups show proportionally higher variability in costs."
      },
      {
        "term": "log link",
        "definition": "The mathematical function in the GLM that connects the regression equation (which can produce any number) to the mean cost (which must be positive) by exponentiating the linear predictor, ensuring all predicted costs are greater than zero."
      },
      {
        "term": "retransformation problem",
        "definition": "The bias that arises when you log-transform costs, fit a regression, then exponentiate the results to get back to dollars — the exponentiated fitted value gives the geometric mean, not the arithmetic mean that budget models need, unless a separate smearing correction is applied."
      },
      {
        "term": "heteroscedasticity",
        "definition": "A condition where the spread (variance) of the outcome differs across the range of predicted values; healthcare costs are heteroscedastic because expensive patients are also more variable in cost than inexpensive patients."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is studying total medical costs in a 12-month follow-up window for eight patients: four who received an index drug (treated) and four who received a comparator (untreated). The analyst wants to compute the mean cost for each group, confirm the mean ratio, and understand how a gamma GLM with log link represents this relationship so they can interpret the model coefficient correctly.",
      "dataset": {
        "caption": "Total 12-month medical cost (USD) per patient. The treated patients' values are exactly twice the untreated patients' values, making the mean ratio 2.0 and the coefficient of variation identical across groups — a clean illustration of the gamma CV property.",
        "columns": [
          "patient_id",
          "group",
          "total_cost_usd"
        ],
        "rows": [
          [
            "P1",
            "untreated",
            1000
          ],
          [
            "P2",
            "untreated",
            1500
          ],
          [
            "P3",
            "untreated",
            2000
          ],
          [
            "P4",
            "untreated",
            3500
          ],
          [
            "P5",
            "treated",
            2000
          ],
          [
            "P6",
            "treated",
            3000
          ],
          [
            "P7",
            "treated",
            4000
          ],
          [
            "P8",
            "treated",
            7000
          ]
        ]
      },
      "steps": [
        "Sum untreated costs: 1000+1500+2000+3500 = 8000. Mean untreated cost = 8000/4 = 2000.",
        "Sum treated costs: 2000+3000+4000+7000 = 16000. Mean treated cost = 16000/4 = 4000.",
        "Mean cost ratio (treated to untreated) = 4000/2000 = 2.0. Treated patients cost twice as much on average.",
        "In a gamma GLM with log link, the treatment coefficient represents the log of this mean ratio. The natural log of 2.0 is approximately 0.693, so the model returns a treatment coefficient near 0.693. Exponentiating confirms: exp(0.693) is approximately 2.0.",
        "Coefficient of variation check: for the untreated group, the deviations from the mean of 2000 are -1000, -500, 0, and +1500. The sample variance is ((-1000)^2+(-500)^2+0^2+1500^2)/(4-1) = (1000000+250000+0+2250000)/3 = 3500000/3, giving SD approximately 1080. CV = 1080/2000 approximately 0.54. For the treated group, all values are exactly double, so SD doubles to approximately 2160, mean doubles to 4000, and CV = 2160/4000 approximately 0.54. The CV is the same in both groups, exactly as the gamma variance function predicts.",
        "If the true coefficient were exactly ln(2) = 0.693, and if we reported it rounded to 0.405 in a real study (because of covariate adjustment shifting the estimate), then exp(0.405) = 1.50 — the model estimates treated patients cost about 50% more on average after covariate adjustment. This illustrates that the exp() of any log-link coefficient is always a mean cost ratio, never a mean cost difference in dollars."
      ],
      "result": "Mean untreated cost = 8000/4 = 2000. Mean treated cost = 16000/4 = 4000. Mean ratio = 4000/2000 = 2.0. In a gamma GLM with log link the treatment coefficient is approximately ln(2) near 0.693, and exp(0.693) is approximately 2.0, meaning treated patients' expected costs are twice as high. The coefficient of variation is approximately 0.54 in both groups, consistent with the gamma variance assumption that CV is constant across mean levels."
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations",
      "generalized-linear-models"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [
      {
        "compared_to": "log-normal-distribution",
        "pros_of_this": "Gamma GLM with log link targets E[Y|X] directly on the dollar scale; no smearing correction or retransformation step is needed. Coefficients are mean cost ratios interpretable without back-transformation. Robust to mild variance misspecification (Manning & Mullahy 2001).",
        "cons_of_this": "When the true data-generating process is log-normal, log-OLS with smearing can be slightly more efficient. Diagnostic tools (QQ plots, Modified Park test) are needed to distinguish gamma from log-normal; the two models give similar point estimates but different SEs and slightly different predictions in the tails.",
        "when_to_prefer": "Prefer gamma GLM when the Modified Park test returns lambda_hat near 2, or when avoiding the retransformation step is a priority for transparent reporting. Prefer log-normal when QQ plots of log-residuals show normality and heteroscedasticity in log-residuals is absent."
      },
      {
        "compared_to": "mann-whitney-u-test",
        "pros_of_this": "Gamma GLM produces a mean cost ratio and absolute mean difference — the quantities that payers, HTA bodies, and budget modellers need. The Mann-Whitney U test produces a test of stochastic dominance and a Hodges-Lehmann shift estimate, neither of which translates to budget impact.",
        "cons_of_this": "Mann-Whitney requires no distributional assumptions and is valid for any sample size; it is simpler to report in clinical journals and may be the primary analysis in small pilot studies where GLM assumptions cannot be verified.",
        "when_to_prefer": "Use gamma GLM as the primary analysis whenever mean costs are the decision-relevant estimand; use Mann-Whitney as a sensitivity check or when n is very small and GLM estimation is unstable."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Gamma GLM adds covariate adjustment, propensity score weighting, and formal inference to the PPPM/PPPY descriptive framework; it is the natural analytic step after computing per-patient cost totals.",
        "cons_of_this": "PPPM/PPPY totals are interpretable without a model and can be reported as simple means with bootstrap CIs; the gamma GLM requires more statistical infrastructure and model-checking.",
        "when_to_prefer": "Use gamma GLM whenever a causal or adjusted comparison is the goal; use PPPM/PPPY descriptively for Table 1, trend plots, and budget-impact inputs that do not require covariate adjustment."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Compute per-patient total cost or PPPM/PPPY totals as the outcome. Check the proportion of patients with zero cost in the measurement window before fitting the gamma GLM; if > 10%, consider a two-part model. Include baseline-period cost as a covariate to absorb cost heterogeneity. Run the Modified Park test on the full cost variable. Report sandwich (robust) standard errors as the default because claims cost distributions are routinely heteroscedastic beyond the gamma variance function.",
      "ehr": "EHR cost data typically captures facility and professional charges rather than paid amounts; clarify whether the outcome is charges, allowed amounts, or paid costs, as the distributional shape differs. Patient cost derived from EHR often has a higher zero-visit rate than claims (patients without documented encounters) — verify the zero prevalence and use a two-part model if needed. Lab and resource-utilisation costs from EHR are typically less skewed than claims total costs.",
      "registry": "Disease registries often capture high-acuity patients with low zero-cost rates and well-characterised comorbidities, making the gamma GLM appropriate. Validate that registry cost variables represent total cost rather than disease-specific cost; misspecified cost windows create artificial zeros.",
      "primary": "Prospective studies with cost diaries or administrative linkage can generate clean cost distributions suitable for gamma GLM. Small sample sizes in primary studies (n < 50 per group) require bootstrapped confidence intervals alongside the asymptotic GLM SEs; the Modified Park test has low power at small n.",
      "linked": "Linked claims-EHR-registry datasets with large n are the most favourable setting for the gamma GLM: the Modified Park test has adequate power, marginal standardisation is precise, and covariate adjustment for rich confounder sets is feasible. Report both the model-based mean ratio and the bootstrap-based mean difference for decision-maker audiences who need absolute dollar estimates alongside relative estimates."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport statsmodels.api as sm\n\n# ── Simulate cost data: binary treatment (0=untreated, 1=treated) ──────────────────────────\nrng = np.random.default_rng(seed=0)\nn = 400\ntreatment = np.repeat([0, 1], n // 2)                       # 200 untreated, 200 treated\nage       = rng.normal(60, 10, n)\n# True model: log(E[cost]) = 7.0 + 0.405 * treatment + 0.02 * age\nmu_true = np.exp(7.0 + 0.405 * treatment + 0.02 * age)\nphi     = 0.5                                                # dispersion; shape = 1/phi = 2\ncost    = rng.gamma(shape=1.0 / phi, scale=mu_true * phi)   # E[cost] = mu_true\n\n# ── Design matrix ─────────────────────────────────────────────────────────────────────────\nX = sm.add_constant(np.column_stack([treatment, age]))       # [intercept, treatment, age]\n\n# ── Fit gamma GLM with log link ───────────────────────────────────────────────────────────\nfit = sm.GLM(\n    cost, X,\n    family=sm.families.Gamma(link=sm.families.links.Log())\n).fit(cov_type=\"HC3\")                                        # sandwich SEs by default\n\nprint(fit.summary())\n\n# ── Mean ratios: exp(coef) and 95% CI on ratio scale ─────────────────────────────────────\nnames = fit.model.exog_names\ncoefs = fit.params\ncis   = fit.conf_int()\nprint(\"\\nMean ratios (exp of coefficients):\")\nfor nm, b, lo, hi in zip(names, coefs, cis.iloc[:, 0], cis.iloc[:, 1]):\n    print(f\"  {nm:12s}  ratio={np.exp(b):.3f}  95% CI [{np.exp(lo):.3f}, {np.exp(hi):.3f}]\")\n\n# ── Marginal standardisation: mean predicted cost under each treatment assignment ─────────\nX_trt = sm.add_constant(np.column_stack([np.ones(n),  age]))\nX_unt = sm.add_constant(np.column_stack([np.zeros(n), age]))\nmean_trt = fit.predict(X_trt).mean()\nmean_unt = fit.predict(X_unt).mean()\nprint(f\"\\nMarginalised mean cost (treated):   ${mean_trt:,.0f}\")\nprint(f\"Marginalised mean cost (untreated): ${mean_unt:,.0f}\")\nprint(f\"Marginalised mean ratio: {mean_trt / mean_unt:.3f}\")\nprint(f\"Marginalised mean difference: ${mean_trt - mean_unt:,.0f}\")\n\n# ── Modified Park test: guide variance family selection ──────────────────────────────────\n# Step 1: preliminary log-OLS to get fitted values on the log scale\nlog_cost = np.log(cost)\nb_ols, _, _, _ = np.linalg.lstsq(X, log_cost, rcond=None)\nlog_fitted = X @ b_ols\nlog_sq_resid = np.log((log_cost - log_fitted) ** 2)\nvalid = np.isfinite(log_sq_resid)\n# Step 2: regress log(squared residuals) on log(fitted) -> slope = lambda_hat\npark_X      = sm.add_constant(log_fitted[valid])\npark_result = sm.OLS(log_sq_resid[valid], park_X).fit()\nlambda_hat  = park_result.params[1]\nprint(f\"\\nModified Park test: lambda_hat = {lambda_hat:.2f}\")\nprint(\"  ~0 -> OLS on levels  |  ~1 -> Poisson  |  ~2 -> Gamma  |  ~3 -> Inverse-Gaussian\")",
        "description": "Fits a gamma GLM with log link using statsmodels, demonstrates the Modified Park test for\nvariance family selection, and computes marginal standardisation to recover absolute mean\ncost differences alongside the exponentiated mean-ratio coefficients. All steps use the same\nsimulated binary-treatment cost dataset so output can be traced to the model specification.",
        "dependencies": [
          "statsmodels",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "set.seed(0)\nn         <- 400\ntreatment <- rep(0:1, each = n / 2)\nage       <- rnorm(n, mean = 60, sd = 10)\n\n# True model: log(E[cost]) = 7.0 + 0.405 * treatment + 0.02 * age\nmu_true <- exp(7.0 + 0.405 * treatment + 0.02 * age)\nphi     <- 0.5           # dispersion; shape = 1/phi = 2\ncost    <- rgamma(n, shape = 1 / phi, scale = mu_true * phi)  # E[cost] = mu_true\n\ndf <- data.frame(cost = cost, treatment = treatment, age = age)\n\n# ── Fit gamma GLM with log link ───────────────────────────────────────────────────────────\nfit <- glm(cost ~ treatment + age, data = df, family = Gamma(link = \"log\"))\nsummary(fit)\n\n# ── Mean ratios: exp(coef) and 95% CI on ratio scale ─────────────────────────────────────\ncat(\"\\nMean ratios (exp of coefficients):\\n\")\nround(exp(coef(fit)), 3)\nround(exp(confint(fit)), 3)   # profile-likelihood CIs; use sandwich SEs for robustness\n\n# ── Marginal standardisation ──────────────────────────────────────────────────────────────\ndf_trt <- df; df_trt$treatment <- 1\ndf_unt <- df; df_unt$treatment <- 0\nmean_trt <- mean(predict(fit, newdata = df_trt, type = \"response\"))\nmean_unt <- mean(predict(fit, newdata = df_unt, type = \"response\"))\ncat(sprintf(\"\\nMarginalised mean cost (treated):   $%.0f\\n\", mean_trt))\ncat(sprintf(\"Marginalised mean cost (untreated): $%.0f\\n\", mean_unt))\ncat(sprintf(\"Marginalised mean ratio:      %.3f\\n\", mean_trt / mean_unt))\ncat(sprintf(\"Marginalised mean difference: $%.0f\\n\", mean_trt - mean_unt))\n\n# ── Modified Park test ────────────────────────────────────────────────────────────────────\nlog_cost   <- log(cost)\nb_ols      <- coef(lm(log_cost ~ treatment + age, data = df))\nlog_fitted <- cbind(1, treatment, age) %*% b_ols\nlog_sq_res <- log((log_cost - as.numeric(log_fitted))^2)\nvalid      <- is.finite(log_sq_res)\npark_fit   <- lm(log_sq_res[valid] ~ log_fitted[valid])\nlambda_hat <- coef(park_fit)[2]\ncat(sprintf(\"\\nModified Park test: lambda_hat = %.2f\\n\", lambda_hat))\ncat(\"  ~0 OLS  |  ~1 Poisson  |  ~2 Gamma  |  ~3 Inverse-Gaussian\\n\")",
        "description": "Fits a gamma GLM with log link in base R, reports exponentiated mean-ratio coefficients with\nconfidence intervals, performs marginal standardisation for absolute cost differences, and\nimplements the Modified Park test to confirm the gamma variance family. Uses the same\nsimulation parameters as the Python implementation for cross-language reproducibility.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create simulated cost dataset ── */\ndata work.cost_data;\n  call streaminit(0);\n  do i = 1 to 400;\n    treatment = (i > 200);            /* 0 = untreated (i=1-200), 1 = treated (i=201-400) */\n    age       = rand('normal', 60, 10);\n    mu_true   = exp(7.0 + 0.405 * treatment + 0.02 * age);\n    phi       = 0.5;                  /* dispersion; shape = 1/phi = 2 */\n    cost      = rand('gamma', 1 / phi) * (mu_true * phi);\n    log_cost  = log(cost);\n    output;\n  end;\n  keep i treatment age cost log_cost;\nrun;\n\n/* ── Gamma GLM with log link ─────────────────────────────────────────────────────────── */\nproc genmod data=work.cost_data;\n  class treatment (ref='0');\n  model cost = treatment age / dist=gamma link=log;\n  /* ESTIMATE statement: exponentiated coefficient = mean cost ratio treated vs untreated  */\n  estimate 'Treatment mean ratio' treatment 1 -1 / exp alpha=0.05;\n  /* LSMEANS with ILINK: back-transforms predicted LS means from log to dollar scale       */\n  lsmeans treatment / ilink cl;\nrun;\n/* Read the output:\n   - 'Exp(Estimate)' row in the ESTIMATE table = mean cost ratio (e.g. 1.50)\n   - 95% CI on ratio scale = [Exp(Lower), Exp(Upper)]\n   - LSMEANS ILINK 'Mean' column = model-predicted mean cost per treatment arm in dollars\n   - 'Scale' parameter in GENMOD output = estimated dispersion phi (= 1/alpha)             */\n\n/* ── Modified Park test: confirm gamma variance family ──────────────────────────────────  */\n/* Step 1: log-OLS preliminary model                                                        */\nproc reg data=work.cost_data noprint;\n  model log_cost = treatment age;\n  output out=work.park_prep predicted=log_fitted residual=log_ols_resid;\nrun;\ndata work.park_data;\n  set work.park_prep;\n  log_sq_resid = log(log_ols_resid ** 2);   /* log of squared OLS residual on log scale  */\n  if not missing(log_sq_resid) and log_sq_resid ne .;\nrun;\n/* Step 2: regress log(sq resid) on log(fitted) -- slope = lambda_hat                      */\nproc reg data=work.park_data;\n  model log_sq_resid = log_fitted;\n  title 'Modified Park test: slope ~2 supports Gamma variance family';\n  /* lambda_hat ~0: OLS | ~1: Poisson | ~2: Gamma | ~3: Inverse-Gaussian               */\nrun;",
        "description": "Fits a gamma GLM with log link using PROC GENMOD, produces exponentiated treatment estimates\nas mean cost ratios, back-transforms least-squares means to the dollar scale, and runs the\nModified Park test as an auxiliary OLS regression using PROC REG. Comments explain the key\noptions and output rows an analyst needs to check.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Positive continuous outcome\\ne.g. total annual cost] --> Z{Does the outcome\\nhave zeros?}\n  Z -->|Many zeros\\n>10 pct| TP[\"Two-part model:\\nLogistic for P(Y>0)\\n+ Gamma GLM for E(Y|Y>0)\"]\n  Z -->|No or few zeros\\n<5 pct| P[Modified Park test\\nor residual plot]\n  P -->|lambda near 2| G[\"Gamma GLM\\nlog link\\n(recommended default)\"]\n  P -->|lambda near 1| PO[\"Poisson GLM\\nor quasi-Poisson\\n(handles zeros)\"]\n  P -->|lambda near 0| OLS[\"OLS on levels\\n(rare for cost data)\"]\n  P -->|lambda near 3| IG[\"Inverse-Gaussian GLM\\n(very heavy tail)\"]\n  G --> OUT[\"Output:\\nexp(coef) = mean cost ratio\\nNo retransformation needed\"]",
        "caption": "Decision path from a positive continuous cost outcome to the appropriate GLM variance family. The Modified Park test guides the choice; gamma is the standard for healthcare costs.",
        "alt_text": "Flowchart: positive continuous outcome splits on zero prevalence into a two-part model path or a Modified Park test path, which branches to gamma, Poisson, OLS, or inverse-Gaussian GLM depending on the estimated lambda_hat value.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "generalized-linear-models",
        "notes": "The gamma GLM is a specific member of the GLM family; understanding the GLM framework (link function, variance function, iteratively reweighted least squares estimation) is a prerequisite for interpreting gamma model output, diagnosing misspecification, and choosing among the Poisson, gamma, and inverse-Gaussian alternatives via the Modified Park test."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "The gamma GLM is the standard regression engine for per-patient-per-month and per-patient- per-year cost metrics; once PPPM/PPPY totals are constructed, the gamma GLM with log link is used to compare them across cohorts with covariate adjustment and to produce mean-ratio estimates for budget-impact and value submissions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Extreme cost outliers affect GLM coefficient stability, especially in small samples; outlier handling strategies (winsorization, truncation, two-part models for structural zeros) directly interact with the choice of variance family and should be pre-specified alongside the gamma GLM specification in the statistical analysis plan."
      },
      {
        "relation_type": "see_also",
        "target_slug": "log-normal-distribution",
        "notes": "The log-normal distribution and the gamma distribution are the two main alternatives for right-skewed cost data; when log-residuals are approximately normally distributed and homoscedastic, log-OLS with Duan smearing is a competitor to the gamma GLM. The log-normal targets a different quantity (geometric mean on the log scale vs arithmetic mean on the original scale) and requires the retransformation correction that the gamma GLM avoids."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mann-whitney-u-test",
        "notes": "The Mann-Whitney U test is sometimes applied to cost data as a \"distribution-free\" alternative, but it tests stochastic dominance (rank ordering) rather than mean cost differences, making it unsuitable as the primary analysis when the budget-impact estimand is the mean. The gamma GLM produces a mean-ratio estimate with confidence interval interpretable for decision-making; Mann-Whitney should be reserved for sensitivity checks or descriptive comparisons."
      }
    ],
    "aliases": [
      "gamma GLM",
      "gamma regression",
      "gamma family",
      "GLM gamma log link",
      "gamma generalized linear model",
      "gamma cost model"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "gee-population-average-models-rwe",
    "name": "GEE Population-Average (Marginal) Models",
    "short_definition": "A semiparametric regression method that estimates marginal (population-average) covariate effects for correlated/clustered outcomes by solving generalized estimating equations with a working correlation structure and robust sandwich variance, consistent for the mean model even when the correlation is misspecified.",
    "long_description": "**Generalized estimating equations (GEE)** extend generalized linear models to correlated data —\nrepeated measures within a patient, encounters within a provider, or members within a household — by\nmodeling the *marginal* mean of the outcome as a function of covariates while treating the\nwithin-cluster correlation as a nuisance. You specify (1) a link and variance function for the\nmarginal mean (e.g., logit + binomial, log + Poisson, identity + Gaussian), (2) a *working*\ncorrelation structure (independence, exchangeable, AR(1), unstructured), and (3) the robust\n(\"sandwich\") variance estimator. The estimating equations are solved without a full likelihood, so\nGEE is semiparametric: the point estimate of the regression coefficients is consistent for the\nmarginal mean model *even if the working correlation is wrong*, and the sandwich variance delivers\nvalid standard errors as the number of clusters grows. This makes GEE the workhorse for\npopulation-average questions in claims, EHR, and registry data, where each beneficiary contributes\nmany correlated observations and you want a single contrast that describes the average effect across\nthe population.\n\n**Core conceptual distinction (marginal vs conditional estimand).** This is the line every reviewer\nchecks first. GEE estimates a *population-average* (marginal) effect: how the *average* outcome in\nthe population shifts when a covariate changes, averaging over all clusters. A generalized linear\n*mixed* model (GLMM, e.g., a random-intercept logistic model) estimates a *subject-specific*\n(conditional) effect: how an *individual* cluster's outcome shifts, holding its random effect fixed.\nFor linear (identity) and log links these two targets coincide (the link is collapsible), so a GEE\nrate ratio and a mixed-model rate ratio estimate the same number. For the logit link they do **not**:\nbecause the logistic function is non-collapsible, the population-average log-odds-ratio is attenuated\ntoward the null relative to the subject-specific one by an approximate shrinkage factor of\n(c^2 σ_b^2 + 1)^(-1/2) with c = 16√3/(15π) ≈ 0.588, so c^2 ≈ 0.346 (Zeger, Liang & Albert 1988). A 0.7 subject-specific log-OR\nwith a large random-intercept variance can become a materially smaller marginal log-OR — same data,\ndifferent estimand, both correct for their question. Three further pillars: (i) *consistency vs\nefficiency* — the working correlation choice does not bias the marginal coefficients, it only affects\nefficiency; the sandwich is what protects you, so an honest default is \"independence working\ncorrelation + sandwich SE\" unless you have a strong, exogenous reason to model the correlation. (ii)\n*Missing-data regime* — ordinary GEE is consistent only under MCAR (missing completely at random);\nunder MAR you must use weighted GEE (inverse-probability-of-observation weights) or move to a\nlikelihood-based mixed model, which is valid under MAR. (iii) *Time-varying covariate feedback* — the\nPepe–Anderson (1994) caveat: when a time-varying covariate is endogenous (affected by prior outcome,\ni.e., there is feedback), GEE with a *non-independence* working correlation is biased for the\nmarginal effect; the independence working correlation with the sandwich variance is the safe choice,\nand for time-varying confounders on the causal pathway you should leave GEE entirely for marginal\nstructural models / g-methods.\n\n**Interpreting the output**\n\nConsider a Poisson GEE fit to quarterly hospitalization counts (SGLT2 vs DPP-4 initiators),\nwith exchangeable working correlation and the sandwich variance. The model returns an\nadjusted hospitalization rate ratio of 0.82 (95% CI 0.71–0.94) for the SGLT2 arm.\n\nFormal interpretation: The GEE coefficient exp(beta_arm) = 0.82 is a population-average\n(marginal) rate ratio — it compares the average hospitalization rate across all SGLT2\ninitiators to the average rate across all DPP-4 initiators, averaging over the\ndistribution of random effects in the population. This is explicitly not the same\nquestion as a mixed model: if a GLMM were fit to the same data, its conditional log\nrate ratio would be larger in magnitude (further from 1) than this marginal 0.82,\nbecause the logistic/log link is non-collapsible and the subject-specific effect for\nany given patient is more extreme than the population average. Both numbers are\ncorrect for their respective estimands; they are not interchangeable.\n\nPractical interpretation: On average across the study population, SGLT2 initiators\nwere hospitalized at 18% lower rate per quarter than DPP-4 initiators after covariate\nadjustment. The sandwich variance is what makes the 95% CI honest despite the working\ncorrelation structure being only approximately correct. For policy, budget-impact, and\nHTA purposes — \"what is the effect on hospitalization burden if we switch this\npopulation?\" — the marginal GEE rate ratio is the directly relevant quantity. For\npredicting an individual patient's trajectory, a mixed model is the appropriate tool.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs GLMM / mixed-effects models (random intercepts/slopes):** GEE targets the marginal estimand,\n  makes no distributional assumption on cluster effects, is robust to correlation misspecification,\n  and is computationally trivial. Cost: it gives no cluster-level (subject-specific) inference, is\n  *less efficient* than a correctly specified GLMM, and — critically — is only MCAR-valid, whereas a\n  mixed model is MAR-valid and recovers individual trajectories. **Prefer GEE** when the policy/HTA\n  question is \"what is the average effect in the population\" (e.g., average reduction in\n  hospitalizations under drug A vs B); **prefer a GLMM** when you need subject-specific prediction,\n  have informative dropout (MAR), or want to model heterogeneity in slopes.\n- **vs MMRM (mixed model for repeated measures):** MMRM is a Gaussian GLMM with an unstructured\n  time covariance, the regulatory default for continuous longitudinal endpoints in RCTs, and\n  MAR-valid. GEE is more flexible across non-Gaussian outcomes (binary, count) and gives a marginal\n  interpretation, but is anticonservative with few clusters and MCAR-limited. **Prefer MMRM** for a\n  continuous endpoint with monotone MAR dropout; **prefer GEE** for binary/count marginal effects.\n- **vs cluster-robust (\"sandwich\") SE on an ordinary GLM:** A GLM fit independence + cluster-robust\n  SE is *algebraically the special case* of GEE with an independence working correlation. GEE adds\n  value only when you model a non-trivial working correlation for efficiency; otherwise the two are\n  the same estimator. **Prefer the explicit GEE machinery** when you want exchangeable/AR(1)\n  efficiency gains or weighted (IPW) GEE for MAR.\n- **vs marginal structural models (MSM) / g-methods:** Both target marginal effects, but MSMs handle\n  *time-varying confounding affected by prior treatment* via inverse-probability-of-treatment\n  weighting — exactly the setting where plain GEE is biased. **Never** use plain GEE for a sustained\n  treatment effect with feedback; use an IPTW-weighted GEE (which *is* the MSM fitting step) instead.\n\n**When to use** (decision rules). Population-average comparative effectiveness, safety, or utilization questions on\ncorrelated outcomes: repeated lab values (HbA1c), recurrent events / counts (hospitalizations,\nED visits) modeled per period, repeated binary status (controlled vs uncontrolled), clustered cross-\nsections (patients within practices), or paired/eye-level/limb-level data. Use GEE when the estimand\nis the average effect, the number of independent clusters is reasonably large (rule of thumb ≥ ~40),\nand missingness is plausibly MCAR or you can build observation weights for MAR. It is the natural\noutcome-model engine for a population-average treatment contrast after propensity-score weighting.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Few clusters (< ~30–40).** The sandwich variance is consistent only as the number of clusters\n  grows; with few clusters it is *anticonservative* (SEs too small, type-I error inflated). Apply a\n  bias-corrected sandwich (Mancl–DeRouen 2001 or Kauermann–Carroll) or a small-sample df correction —\n  do not report the naive robust SE.\n- **Endogenous time-varying covariate with feedback.** If a covariate is affected by prior outcome\n  (e.g., dose titrated in response to last period's value), GEE with exchangeable/AR(1) is biased\n  (Pepe–Anderson). Use independence + sandwich at minimum, or move to an MSM/g-method if the covariate\n  is a time-varying confounder on the causal pathway.\n- **Informative cluster size.** When the number of observations per cluster is associated with the\n  outcome (e.g., sicker patients generate more encounters), standard GEE silently up-weights large\n  clusters and estimates a *cluster-size-weighted* marginal effect, not the per-subject one; cluster-\n  weighted GEE (CWGEE) is required to recover the intended estimand.\n- **Non-ignorable (MNAR) dropout.** Neither GEE nor weighted GEE rescues you; sensitivity analysis\n  under explicit MNAR assumptions is needed.\n- **The actual question is subject-specific.** If you want an individual patient's predicted\n  trajectory or to quantify between-patient heterogeneity, a marginal model answers the wrong\n  question — fit a GLMM.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked).\n- **Claims (FFS vs MA):** Repeated observations are naturally generated by period-level aggregation\n  (e.g., quarterly hospitalization counts per `person_id`). The dominant failure mode is that\n  cluster size is itself a function of observability: **Medicare Advantage person-time lacks adjudicated\n  FFS claims**, so an MA-only quarter shows zero events not because none occurred but because they are\n  unobserved — pool MA and FFS person-time and you bias the marginal rate downward differentially by\n  plan. Restrict the analytic window to FFS-observable person-time (Parts A/B, or commercial\n  medical+pharmacy) and treat plan-switching as censoring, not a true zero. Continuous-enrollment and\n  washout rules must be applied *per analysis period*, and the correlation across a patient's periods\n  is exactly what GEE handles — but only if the periods are real (an unenrolled gap is missing, not a\n  correlated zero). Watch **immortal time** when the cluster is built around a procedure: person-time\n  before the qualifying procedure is event-free by construction.\n- **EHR:** Repeated measures are *encounter-driven*, so observation times are themselves informative —\n  sicker patients are measured more often, inducing informative cluster size and MAR/MNAR\n  observation. Model the visit process (IPW-GEE with weights for being observed) or you get a measured-\n  when-sick bias. External-care leakage means a patient's \"improvement\" may be a visit elsewhere; site\n  workflow drives systematic missingness by clinic.\n- **Registry:** Scheduled visits give cleaner balanced repeated measures and adjudicated outcomes, but\n  enrollment is selective and follow-up completeness varies by site — report per-wave completeness and\n  consider weighting to the source population for transportability.\n- **Linked claims–EHR–registry:** Best substrate (EHR/registry severity + claims completeness for the\n  event count denominator), but linkage selects the linkable subset and the period boundaries from the\n  three sources must be reconciled before clusters are defined; date discrepancies create spurious\n  within-cluster correlation if a single real event is counted in two adjacent periods.\n\n**Worked claims example.** Question: does initiating an SGLT2 inhibitor vs a DPP-4 inhibitor change the\n*population-average* rate of all-cause hospitalization over the first year among new initiators with\ntype 2 diabetes in a 100% Medicare FFS sample? (1) Build the cohort with an active-comparator, new-user\ndesign (365-day continuous A/B enrollment washout, first qualifying fill = `index_date`, arm from the\ndispensed NDC). (2) Split each patient's first year into four 90-day **periods**, keeping only periods\nfully covered by FFS enrollment (drop MA-only and post-disenrollment periods so a zero is a true zero,\nnot unobserved). (3) For each `person_id` × period, count inpatient stays from facility claims by\nrevenue/bill-type, and record `offset = log(person-days observed in the period)` to handle partial\nperiods. (4) Fit a marginal Poisson GEE: `hosp_count ~ arm + period + arm:period + age + sex +\nbaseline_comorbidity`, link = log, distribution = Poisson, **subject = person_id**, **working\ncorrelation = exchangeable** (periods within a patient are positively correlated), with the **sandwich\nvariance**. The `exp(arm)` coefficient is the population-average hospitalization rate ratio; the\n`arm:period` terms test whether the marginal effect changes over the year. (5) Because periods within\na patient with informative dropout violate MCAR, refit with **inverse-probability-of-observation\nweights** (weighted GEE) for the disenrollment process and compare. (6) Because the cluster count is\nlarge but cluster *size* (number of observed periods) is associated with frailty, run a cluster-weighted\nGEE sensitivity analysis; and because the working-correlation choice is a nuisance, confirm the\nexchangeable and independence fits give the same point estimate (they should — only the SE differs).",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "inferential_statistics",
      "gee",
      "population-average-models",
      "marginal-models",
      "working-correlation",
      "sandwich-variance",
      "longitudinal-outcomes",
      "clustered-data"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "cluster_randomized"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/73.1.13",
        "url": "https://doi.org/10.1093/biomet/73.1.13",
        "citation_text": "Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13-22.",
        "year": 1986,
        "authors_short": "Liang & Zeger",
        "notes": "The foundational GEE paper. Introduces the estimating equations, working correlation structures, and the robust sandwich variance that is consistent under correlation misspecification."
      },
      {
        "role": "explain",
        "doi": "10.2307/2531734",
        "url": "https://doi.org/10.2307/2531734",
        "citation_text": "Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988;44(4):1049-1060.",
        "year": 1988,
        "authors_short": "Zeger et al.",
        "notes": "Formalizes the marginal (population-average) vs subject-specific (conditional) distinction and the non-collapsibility attenuation factor for logit-link GEE vs random-effects models."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181caeb90",
        "url": "https://doi.org/10.1097/EDE.0b013e3181caeb90",
        "citation_text": "Hubbard AE, Ahern J, Fleischer NL, Van der Laan M, Lippman SA, Jewell N, Bruckner T, Satariano WA. To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology. 2010;21(4):467-474.",
        "year": 2010,
        "authors_short": "Hubbard et al.",
        "notes": "The canonical applied-epidemiology argument for choosing GEE (marginal) over mixed models when the population-average estimand is the target; clarifies interpretation and robustness trade-offs."
      },
      {
        "role": "demonstrate",
        "doi": "10.1080/03610919408813210",
        "url": "https://doi.org/10.1080/03610919408813210",
        "citation_text": "Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics - Simulation and Computation. 1994;23(4):939-951.",
        "year": 1994,
        "authors_short": "Pepe & Anderson",
        "notes": "Shows that with endogenous time-varying covariates GEE using a non-independence working correlation is biased; the independence working correlation with the sandwich variance is the safe default."
      },
      {
        "role": "use",
        "doi": "10.1111/j.0006-341X.2001.00126.x",
        "url": "https://doi.org/10.1111/j.0006-341X.2001.00126.x",
        "citation_text": "Mancl LA, DeRouen TA. A covariance estimator for GEE with improved small-sample properties. Biometrics. 2001;57(1):126-134.",
        "year": 2001,
        "authors_short": "Mancl & DeRouen",
        "notes": "Bias-corrected sandwich variance for the few-clusters regime where the naive robust SE is anticonservative; the standard fix when the number of clusters is small."
      }
    ],
    "plain_language_summary": "GEE (Generalized Estimating Equations) is a statistical method for answering the question: on average across the whole population, how does an outcome differ between two groups, when each person contributes multiple measurements over time? Instead of treating each measurement as if it came from a different person, GEE acknowledges that readings from the same person tend to be similar to each other, and it accounts for that similarity when calculating the result. The answer it gives describes the average shift in the population, not any single patient's trajectory, which is exactly what payers, regulators, and health technology assessors usually need.",
    "key_terms": [
      {
        "term": "clustered or correlated data",
        "definition": "Data where multiple observations belong to the same unit, such as several lab values from the same patient, making those observations more similar to each other than to observations from a different patient."
      },
      {
        "term": "population-average effect",
        "definition": "The average difference in outcome between two groups across the entire study population, as opposed to the predicted change for one specific individual."
      },
      {
        "term": "subject-specific effect",
        "definition": "The predicted change in outcome for a particular individual patient, holding that patient's own characteristics constant; a generalized linear mixed model targets this, not GEE."
      },
      {
        "term": "working correlation",
        "definition": "The assumed pattern of similarity among repeated measurements within one person that GEE uses to be more efficient; the final estimate is valid even if this assumed pattern turns out to be wrong."
      },
      {
        "term": "sandwich variance",
        "definition": "A method for calculating standard errors and confidence intervals that remains valid even when the working correlation assumption is incorrect, which is why GEE results are considered robust."
      },
      {
        "term": "marginal model",
        "definition": "Another name for a population-average model: one that describes outcomes averaged across the population rather than modeled within each individual."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether patients with type 2 diabetes who start Drug A have fewer hospitalizations per quarter than those who start Drug B, on average across the population. Four patients each contribute up to four quarterly observations. Because the same patient appears in multiple rows, those rows are correlated and ordinary regression would understate the uncertainty. GEE accounts for the within-person correlation and returns a single population-average rate comparison.",
      "dataset": {
        "caption": "Quarterly hospitalization counts per patient. Each row is one patient-quarter. Two patients per arm, two quarters each (a minimal illustration).",
        "columns": [
          "person_id",
          "quarter",
          "arm",
          "hospitalizations"
        ],
        "rows": [
          [
            1001,
            1,
            "Drug A",
            0
          ],
          [
            1001,
            2,
            "Drug A",
            0
          ],
          [
            1002,
            1,
            "Drug A",
            1
          ],
          [
            1002,
            2,
            "Drug A",
            0
          ],
          [
            2001,
            1,
            "Drug B",
            1
          ],
          [
            2001,
            2,
            "Drug B",
            1
          ],
          [
            2002,
            1,
            "Drug B",
            2
          ],
          [
            2002,
            2,
            "Drug B",
            1
          ]
        ]
      },
      "steps": [
        "Identify the clusters: person_id is the cluster. Rows sharing a person_id are correlated because they come from the same person.",
        "Calculate the raw per-arm average to get an intuition: Drug A patients had (0+0+1+0) = 1 total hospitalization across 4 patient-quarters, an average of 0.25 per quarter. Drug B patients had (1+1+2+1) = 5 total, an average of 1.25 per quarter.",
        "GEE is told that observations within the same person are likely correlated (exchangeable working correlation: any two quarters from the same person are assumed equally similar). This is the analyst's modeling choice for efficiency, not a requirement for validity.",
        "GEE solves its estimating equations across all clusters simultaneously, using the sandwich variance so that the standard errors are valid regardless of whether the exchangeable assumption is exactly right.",
        "The result is a population-average rate ratio: the average quarterly hospitalization rate in Drug A patients divided by the average rate in Drug B patients, interpreted as the contrast you would see if you could intervene on the whole population."
      ],
      "result": "GEE population-average rate ratio: Drug A vs Drug B = 0.25 / 1.25 = 0.20. Interpreted: on average across the population, patients on Drug A have about one-fifth the quarterly hospitalization rate of patients on Drug B. This is a population-level statement, not a prediction for any specific patient."
    },
    "prerequisites": [
      "longitudinal-outcomes-modeling-rwe",
      "cluster-robust-standard-errors-rwe",
      "poisson-negative-binomial-count-models"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Independence working correlation + sandwich (robust default)",
        "description": "Assumes within-cluster independence in the working model but corrects inference with the robust sandwich variance. Equivalent to a GLM with cluster-robust SEs. Safe and consistent under correlation misspecification and required when time-varying covariates may be endogenous.",
        "edge_cases": [
          "Some efficiency is sacrificed versus a correctly specified exchangeable/AR(1) structure when correlation is strong.",
          "With few clusters the sandwich is anticonservative; apply the Mancl-DeRouen or Kauermann-Carroll bias correction."
        ],
        "data_source_notes": "claims/EHR: the default when periods per patient are unbalanced or when time-varying exposure may respond to prior outcome (Pepe-Anderson)."
      },
      {
        "name": "Exchangeable / AR(1) / unstructured working correlation",
        "description": "Models the within-cluster correlation for efficiency. Exchangeable suits unordered clusters (members within practice); AR(1) suits equally spaced repeated measures; unstructured suits few, balanced time points.",
        "edge_cases": [
          "Biased for the marginal effect if a time-varying covariate is endogenous (use independence instead).",
          "Unstructured needs enough clusters per correlation parameter or estimation is unstable."
        ],
        "data_source_notes": "EHR/registry: AR(1) for scheduled longitudinal labs; exchangeable for patients within site."
      },
      {
        "name": "Weighted GEE (IPW-GEE for MAR observation/dropout)",
        "description": "Inverse-probability-of-observation weights make GEE valid under MAR rather than MCAR. The same machinery, with IPTW weights, is the fitting step of a marginal structural model.",
        "edge_cases": [
          "Requires a correctly specified observation/treatment model; extreme weights inflate variance (truncate and report).",
          "Does not rescue MNAR dropout."
        ],
        "data_source_notes": "EHR: weight for the encounter/visit process so estimates are not measured-when-sick; claims: weight for differential disenrollment."
      },
      {
        "name": "Cluster-weighted GEE (CWGEE) for informative cluster size",
        "description": "Re-weights so each cluster contributes equally, recovering the per-subject marginal effect when cluster size is associated with the outcome.",
        "edge_cases": [
          "Standard GEE estimates a cluster-size-weighted effect that differs from the per-subject effect under informative cluster size."
        ],
        "data_source_notes": "claims/EHR: use when sicker patients generate more observations (more periods/encounters)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Generalized linear mixed models (random-effects / GLMM)",
        "pros_of_this": "Targets the marginal (population-average) estimand, makes no distributional assumption on cluster effects, and is robust to within-cluster correlation misspecification via the sandwich variance.",
        "cons_of_this": "No subject-specific inference, less efficient than a correctly specified GLMM, and valid only under MCAR (unweighted), whereas a likelihood-based GLMM is valid under MAR.",
        "when_to_prefer": "The policy/HTA question is the average effect across the population, the outcome is non-Gaussian (binary/count), and missingness is MCAR or can be addressed with observation weights."
      },
      {
        "compared_to": "MMRM (mixed model for repeated measures)",
        "pros_of_this": "Handles binary and count outcomes with a marginal interpretation; not tied to a Gaussian likelihood.",
        "cons_of_this": "Anticonservative with few clusters and MCAR-limited, whereas MMRM is MAR-valid and the regulatory default for continuous longitudinal endpoints.",
        "when_to_prefer": "Marginal effects on binary/count repeated outcomes; use MMRM for a continuous endpoint with monotone MAR dropout."
      },
      {
        "compared_to": "Cluster-robust SE on an ordinary GLM",
        "pros_of_this": "Allows a non-independence working correlation for efficiency gains and supports IPW-GEE for MAR; the explicit estimating-equation framework generalizes cleanly.",
        "cons_of_this": "With an independence working correlation GEE is algebraically identical to a GLM with cluster-robust SEs, so the extra machinery adds nothing.",
        "when_to_prefer": "When exchangeable/AR(1) efficiency or weighted GEE for MAR is wanted; otherwise the GLM + cluster-robust SE is equivalent and simpler."
      },
      {
        "compared_to": "Marginal structural models / g-methods",
        "pros_of_this": "Simpler to specify and defend when there is no time-varying confounding affected by prior treatment.",
        "cons_of_this": "Plain GEE is biased for a sustained-treatment effect with time-varying confounder feedback; it cannot adjust for treatment-confounder feedback.",
        "when_to_prefer": "Outcomes with exogenous covariates only; for time-varying confounding affected by prior treatment, use IPTW-weighted GEE (an MSM)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build period-level clusters per person_id; keep only FFS-observable person-time so a zero count is a true zero, not unobserved MA person-time. Use a log(person-days) offset for partial periods, exchangeable or AR(1) working correlation across a patient's periods, and the sandwich variance. Watch immortal time when clusters are built around a procedure.",
      "ehr": "Observation times are encounter-driven and informative; model the visit process with IPW-GEE or risk measured-when-sick bias. Account for external-care leakage and site-level workflow missingness.",
      "registry": "Scheduled visits give balanced repeated measures and adjudicated outcomes, but enrollment is selective; report per-wave completeness and consider weighting for transportability.",
      "linked": "Reconcile period boundaries across claims/EHR/registry before defining clusters so one real event is not double-counted across adjacent periods; linkage selects the linkable subset."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom statsmodels.genmod.cov_struct import Exchangeable, Independence\n\ndef fit_pa_poisson_gee(panel: pd.DataFrame):\n    panel = panel.sort_values([\"person_id\", \"period\"]).copy()\n    panel[\"log_obs_days\"] = np.log(panel[\"obs_days\"])  # offset for partial/uneven periods\n\n    # Marginal Poisson GEE: subject = person_id (the cluster), exchangeable working correlation.\n    # The robust (sandwich) covariance is statsmodels' default for GEE.\n    model = smf.gee(\n        formula=\"hosp_count ~ C(arm, Treatment('DPP4')) + C(period) + age + sex + baseline_comorbidity\",\n        groups=\"person_id\",\n        data=panel,\n        offset=panel[\"log_obs_days\"].values,\n        cov_struct=Exchangeable(),\n        family=sm.families.Poisson(),\n    )\n    res = model.fit()  # res.cov_type is 'robust' (sandwich) by default\n\n    # Population-average rate ratio for the active arm vs the comparator, with sandwich CI.\n    params, ci = res.params, res.conf_int()\n    arm_term = [t for t in params.index if t.startswith(\"C(arm\")][0]\n    rr = np.exp(params[arm_term])\n    rr_lo, rr_hi = np.exp(ci.loc[arm_term, 0]), np.exp(ci.loc[arm_term, 1])\n\n    # Sensitivity: independence working correlation should give the SAME point estimate (only SE differs).\n    rr_indep = np.exp(model.__class__(\n        model.endog, model.exog, groups=panel[\"person_id\"],\n        offset=panel[\"log_obs_days\"].values,\n        cov_struct=Independence(), family=sm.families.Poisson(),\n    ).fit().params[list(params.index).index(arm_term)])\n\n    return {\"rate_ratio\": rr, \"ci95\": (rr_lo, rr_hi),\n            \"rate_ratio_independence\": rr_indep, \"summary\": res.summary()}",
        "description": "Population-average Poisson GEE on period-level claims counts using statsmodels. Required input\n(one row per person_id x period, already restricted to FFS-observable person-time):\n  panel : person_id (cluster id), period (0..3), hosp_count (int events in the period),\n          obs_days (person-days observed in the period, for the offset),\n          arm ('SGLT2'/'DPP4'), age, sex, baseline_comorbidity\nThe exp() of the 'arm' coefficient is the population-average hospitalization rate ratio; the\nsandwich (robust) covariance is the default in GEE. Exchangeable working correlation models the\npositive within-patient correlation across periods without affecting consistency of the mean model.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(geepack)\n\nfit_pa_poisson_gee <- function(panel) {\n  panel <- panel[order(panel$person_id, panel$period), ]\n  panel$arm <- relevel(factor(panel$arm), ref = \"DPP4\")\n\n  # Marginal Poisson GEE: id = cluster, exchangeable working correlation, log offset for exposure time.\n  fit <- geeglm(\n    hosp_count ~ arm + factor(period) + age + sex + baseline_comorbidity,\n    id          = person_id,\n    data        = panel,\n    family      = poisson(link = \"log\"),\n    corstr      = \"exchangeable\",\n    offset      = log(panel$obs_days),\n    std.err     = \"san.se\"            # robust sandwich SE (the GEE default)\n  )\n\n  # Population-average rate ratio for the active arm and its Wald 95% CI from the sandwich SE.\n  s   <- summary(fit)$coefficients\n  arm <- grep(\"^arm\", rownames(s), value = TRUE)[1]\n  est <- s[arm, \"Estimate\"]; se <- s[arm, \"Std.err\"]\n  rr  <- exp(est); ci <- exp(est + c(-1.96, 1.96) * se)\n\n  # Robustness check: independence working correlation gives the same point estimate.\n  fit_indep <- update(fit, corstr = \"independence\")\n  list(rate_ratio = rr, ci95 = ci,\n       rate_ratio_independence = exp(coef(fit_indep)[arm]), fit = fit)\n}",
        "description": "Population-average Poisson GEE with geepack::geeglm on the same period-level panel. Inputs:\n  panel : person_id, period, hosp_count, obs_days, arm (factor, ref 'DPP4'),\n          age, sex, baseline_comorbidity\ngeeglm reports the robust (sandwich) SE by default; std.err='san.se' makes it explicit. Data MUST\nbe ordered by the cluster id before fitting so the working correlation is applied correctly.",
        "dependencies": [
          "geepack"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Log person-time offset for partial/uneven periods. */\ndata work.panel;\n  set work.panel;\n  log_obs_days = log(obs_days);\nrun;\n\n/* GENMOD must read the data sorted by the subject (cluster) id for the REPEATED statement. */\nproc sort data=work.panel; by person_id period; run;\n\n/* Marginal (population-average) Poisson GEE. The REPEATED statement is what makes GENMOD fit GEE. */\nproc genmod data=work.panel;\n  class person_id arm(ref='DPP4') sex period / param=ref;\n  model hosp_count = arm period age sex baseline_comorbidity\n        / dist=poisson link=log offset=log_obs_days;\n  repeated subject=person_id / type=exch corrw modelse;  /* exchangeable working corr; print empirical + model SE */\n  estimate 'PA rate ratio SGLT2 vs DPP4' arm 1 / exp;     /* exp() gives the population-average rate ratio */\nrun;\n\n/* Robustness: independence working correlation should reproduce the point estimate (only SE changes). */\nproc genmod data=work.panel;\n  class person_id arm(ref='DPP4') sex period / param=ref;\n  model hosp_count = arm period age sex baseline_comorbidity\n        / dist=poisson link=log offset=log_obs_days;\n  repeated subject=person_id / type=ind;\n  estimate 'PA rate ratio (independence)' arm 1 / exp;\nrun;",
        "description": "Population-average Poisson GEE in SAS via PROC GENMOD with a REPEATED statement (this is the SAS\nGEE engine; PROC GLIMMIX fits the subject-specific GLMM, the contrast concept, not GEE). Required\ninput dataset (one row per person_id x period, FFS-observable person-time only):\n  work.panel : person_id, period, hosp_count, obs_days, arm ('SGLT2'/'DPP4'),\n               age, sex, baseline_comorbidity\nThe DIST=POISSON LINK=LOG with OFFSET handles the rate model; REPEATED SUBJECT=person_id\nTYPE=EXCH requests the exchangeable working correlation and the empirical (sandwich) covariance,\nwhich is the GEE default in GENMOD.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Correlated / clustered outcome:<br/>repeated measures, recurrent events, clustered cross-section] --> E{What is the estimand?}\n  E -->|Population-average<br/>marginal effect| F{Time-varying covariate<br/>affected by prior outcome?}\n  E -->|Subject-specific effect,<br/>individual trajectory, heterogeneity| GLMM[Generalized linear mixed model<br/>PROC GLIMMIX / lme4 / MixedLM]\n  F -->|No feedback| GEE[GEE marginal model<br/>working corr + sandwich SE]\n  F -->|Yes: time-varying confounding| MSM[Marginal structural model<br/>IPTW-weighted GEE / g-methods]\n  GEE --> M{Missingness regime?}\n  M -->|MCAR| OK[Unweighted GEE valid]\n  M -->|MAR| IPW[Weighted GEE<br/>inverse-prob-of-observation]\n  M -->|MNAR| SENS[MNAR sensitivity analysis]\n  GEE --> C{Number of clusters?}\n  C -->|Large| ROB[Naive sandwich SE OK]\n  C -->|Small < ~40| BC[Bias-corrected sandwich<br/>Mancl-DeRouen / Kauermann-Carroll]",
        "caption": "Decision logic for marginal GEE vs subject-specific GLMM vs marginal structural models, plus the missing-data and small-sample branches that determine which variance estimator and weighting are valid.",
        "alt_text": "Decision tree starting from a correlated outcome, branching on estimand (marginal versus subject-specific), time-varying covariate feedback (GEE versus MSM), missingness regime (MCAR/MAR/MNAR), and number of clusters (naive versus bias-corrected sandwich).",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph IND[Independence]\n    i1[\"1 0 0<br/>0 1 0<br/>0 0 1\"]\n  end\n  subgraph EXC[Exchangeable]\n    e1[\"1 r r<br/>r 1 r<br/>r r 1\"]\n  end\n  subgraph AR1[\"AR(1)\"]\n    a1[\"1 r r^2<br/>r 1 r<br/>r^2 r 1\"]\n  end\n  subgraph UNS[Unstructured]\n    u1[\"1 r12 r13<br/>r12 1 r23<br/>r13 r23 1\"]\n  end\n  IND -->|safe default;<br/>robust to feedback| USE[Working correlation choice<br/>affects efficiency, not consistency]\n  EXC -->|unordered clusters,<br/>members within site| USE\n  AR1 -->|equally spaced<br/>repeated measures| USE\n  UNS -->|few balanced<br/>time points| USE",
        "caption": "The four common working correlation structures and when each fits. The choice changes efficiency, not the consistency of the marginal mean model; the sandwich variance is what guarantees valid inference.",
        "alt_text": "Four 3-by-3 working correlation matrices (independence, exchangeable, AR(1), unstructured) with annotations on when each is appropriate, all feeding into the note that the choice affects efficiency not consistency.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "longitudinal-outcomes-modeling-rwe",
        "notes": "GEE is the marginal / population-average branch of the longitudinal-outcomes modeling family."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "mixed-effects-models-longitudinal-rwe",
        "notes": "GEE targets the marginal estimand and is MCAR-valid and robust to correlation misspecification; mixed models target the subject-specific estimand and are MAR-valid. For logit links the two coefficients differ by a non-collapsibility shrinkage factor."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "mmrm-repeated-measures-rwe",
        "notes": "MMRM is the Gaussian mixed-model default for continuous longitudinal endpoints (MAR-valid); GEE generalizes to binary/count marginal effects but is anticonservative with few clusters."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cluster-robust-standard-errors-rwe",
        "notes": "The robust sandwich variance is the GEE default; GEE with an independence working correlation is algebraically a GLM with cluster-robust standard errors."
      },
      {
        "relation_type": "used_with",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "GEE supplies the marginal (population-average) version of Poisson/NB count regression for correlated counts via the log link and a working correlation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Because unweighted GEE is only MCAR-valid, multiple imputation or weighted GEE is used to obtain valid marginal inference under MAR."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "For time-varying confounding affected by prior treatment, plain GEE is biased; IPTW-weighted GEE (a marginal structural model) is required instead."
      }
    ],
    "aliases": [
      "GEE",
      "generalized estimating equations",
      "population-average models",
      "marginal models",
      "GEE marginal models"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "generalizability-transportability-external-validity-rwe",
    "name": "Generalizability, Transportability, and External Validity",
    "short_definition": "A family of methods and diagnostics for deciding whether an internally valid causal effect (or prediction model) estimated in one population applies to a named decision/target population, using target-population specification, effect-measure-modifier assessment, inverse-odds-of-sampling weighting or standardization over the modifier distribution, and sensitivity to non-overlap and unmeasured modifiers.",
    "long_description": "**Internal validity** asks whether the effect estimate is unbiased *for the study population* — it is about confounding, selection, measurement, and time-zero alignment within the sample. **External validity** asks a logically separate question: would that same effect (or model performance) hold in a *target population* that differs in age, frailty, renal function, disease severity, line of therapy, payer mix, geography, calendar time, or care setting? A study can be perfectly internally valid and still be useless for the decision at hand. Real-world data make this acute: claims databases are large but unrepresentative — a commercial-claims cohort is younger, healthier, and more renally intact than the Medicare population an HTA or payer actually needs an answer for. Large N does not buy representativeness, and no amount of within-study confounding control fixes a population mismatch.\n\n**Core estimand distinction**. Three terms are used loosely but mean different things, and the difference drives the method. (1) *Generalizability (sampling)*: the study sample is a (possibly biased) subset **nested within** the target population — you reweight the study to look like the whole. (2) *Transportability*: the target is **external** to and disjoint from the study sample — you carry the effect to a population you never sampled, which requires an explicit covariate \"bridge\" and stronger graphical assumptions (Pearl–Bareinboim do-calculus / selection diagrams). (3) *Applicability*: a qualitative or mixed judgment used when target covariate data are too incomplete for a formal weight. The transport target estimand is the effect in the *target* population (e.g., the target-ATE), which differs from the study-ATE precisely when there is **effect-measure modification on the chosen scale** by a covariate whose distribution differs between source and target. The identification conditions are: (a) conditional exchangeability of the *sampling/selection* indicator with the potential outcomes given measured effect modifiers (S ⊥ Y(a) | V), i.e., all modifiers that differ across populations are measured; (b) positivity of selection — every covariate stratum present in the target has nonzero representation in the study (no empty source cells); and (c) consistency/no-interference. Note (a) is about effect *modifiers*, not every prognostic factor — balancing pure prognostic variables that do not modify the effect on the analysis scale is unnecessary and only inflates variance.\n\n**Pros, cons, and trade-offs**\n- **vs. simply reporting the internally valid study-ATE and asserting it \"should generalize\":** Formal transport names the target, makes the S ⊥ Y(a) | V assumption explicit, and quantifies how far the answer moves. Cost: it requires target-population covariate data on the effect modifiers and is only as good as the modifier list. **Prefer transport/weighting** whenever the decision population is known to differ on a plausible modifier; **do not** substitute it for honest acknowledgment when the needed modifier is unmeasured.\n- **Inverse-odds-of-sampling weights (IOSW) vs. g-computation/standardization:** IOSW (weight = P(target)/P(study) for the modifiers) is model-light on the outcome but unstable when source coverage of the target is thin (extreme weights, effective-sample-size collapse). Outcome standardization / g-computation fits an outcome model in the study and averages predictions over the *target* covariate distribution — efficient and stable but biased if the outcome model is misspecified, and it extrapolates silently into target regions with no study support. **Doubly robust transport (augmented IOSW / TMLE)** combines both. **Prefer IOSW** for transparency and a single, auditable weight; **prefer standardization or DR** when weights are extreme or efficiency matters.\n- **vs. baseline-characteristics-and-covariate-balance comparison:** A Table-1 comparison of study vs. target is a necessary *diagnostic* (it reveals which modifiers differ and whether there is overlap) but is not itself an estimator — it cannot deliver a target-population effect, only flag that one is needed.\n- **vs. prediction-model external validation/recalibration:** Transport of a *causal effect* and transport of *model performance* (discrimination, calibration) are different targets; the same population mismatch degrades both, but the remedies (reweighting/standardization vs. recalibration/refit) differ.\n\n**When to use**. (1) The estimate will inform a decision for a population that demonstrably differs from the study sample (HTA submission targeting all-comers or Medicare; payer coverage for a frailer enrollee mix; label expansion to an under-sampled subgroup). (2) A plausible **effect-measure modifier** (renal function, age, severity, prior lines) is distributed differently in source and target. (3) You have, or can build, target-population covariate data on those modifiers (an external reference cohort, a target registry, or a second database used only for the modifier distribution). (4) Generalizing an RCT (narrow, protocolized) to routine-care RWD, or transporting an RWD result across databases/countries.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **The needed effect modifier is unmeasured in the study, the target, or both.** Reweighting on the *measured* covariates while a strong unmeasured modifier differs gives a confidently wrong target estimate with a falsely narrow CI — more dangerous than an honest \"not transportable.\" If advanced CKD modifies the SGLT2i–heart-failure effect and the commercial source has almost no CKD-4 person-time, no weight can manufacture that information.\n- **Positivity (selection) violation — the target has strata the study lacks.** Dual-eligible frail nonagenarians with CKD-4 may be essentially absent from commercial claims; IOSW then produces a few enormous weights, the effective sample size collapses, and the \"target effect\" is driven by a handful of unrepresentative patients. Diagnose with the weight distribution and the standardized population-overlap, not just the point estimate.\n- **Scale-dependence ignored.** Effect modification is scale-dependent: an effect homogeneous on the risk-ratio scale can be heterogeneous on the risk-difference scale and vice versa. Transporting on the wrong scale, or assuming \"no modification\" without checking the decision-relevant scale, biases the target estimate.\n- **The estimand itself does not transport** because the *intervention* differs (different formulary, dosing, monitoring, adherence in the target setting) — a structural difference no covariate weight repairs.\n- **Pure prognostic factors mistaken for modifiers.** Weighting on everything that differs (rather than on modifiers) needlessly destroys precision and can worsen positivity, without reducing transport bias.\n\n**Data-source operational depth**.\n- **Claims (FFS vs. MA):** The biggest hidden trap is the source/target enrollment substrate. Medicare-Advantage enrollees do not generate fee-for-service claims, so an FFS-only study cohort silently *omits* MA person-time; if you then transport to \"all Medicare,\" the target modifier distribution drawn from FFS denominators is itself unrepresentative of MA beneficiaries (who are systematically healthier at enrollment). Require both arms of the transport — source effect estimate and target reference distribution — to be built on comparable, completely observed enrollment (continuous A/B/D for FFS; full commercial medical+pharmacy benefit), and never mix MA-only and FFS person-time when computing the target covariate distribution. Renal/frailty modifiers must be proxied from diagnosis/procedure codes and lab-adjacent claims (dialysis HCPCS, CKD stage ICD-10 N18.x), which are noisier in the target than in a labs-rich source.\n- **EHR:** Site mix dominates transport. A model or effect estimated at academic referral centers transports poorly to community/safety-net settings that differ in severity, coding intensity, and capture. Visit-driven observation means the target \"population\" implied by an EHR is really a population *of encounters*; patients who leave the system are differentially absent, so the modifier distribution you weight to is conditioned on continued contact. Reconcile this before using EHR as either source or target reference.\n- **Registry:** Registries over-represent treated, specialty-care, or academically managed patients, so a registry is a poor stand-in for a *community* target distribution even though it is excellent for severity/staging modifiers. Use registry severity to define the modifiers, but draw the target *distribution* from a population-representative source.\n- **Linked claims–EHR–registry:** The ideal substrate for measuring modifiers (EHR/registry severity + claims completeness), but the linkable subset is itself selected; the transport target must be the population you can actually act on, and the linkage-selected sample may not be it. Also watch **differential competing risks**: in elderly claims, death competes with the outcome differently by exposure and by population, so an apparent transport gap can be a competing-risk artifact unless the target effect is defined on a competing-risk-aware estimand (cause-specific vs. subdistribution).\n\n**Worked claims example (transporting a commercial-claims effect to a Medicare FFS target).** Question: does the SGLT2-inhibitor vs. DPP-4-inhibitor effect on hospitalized heart failure, estimated in working-age commercial claims, apply to the older, renally-impaired Medicare FFS population an HTA cares about?\n(1) *Source cohort (S=1):* active-comparator new-user design in a commercial database — adults 18–64, type 2 diabetes (≥2 dx), 365-day continuous medical+pharmacy enrollment and 365-day drug-free washout before the first SGLT2i/DPP-4i fill (`fill_date`, NDC, `days_supply`); index_date = that first fill; follow-up to first validated HF hospitalization, censoring at disenrollment, death, and data end. Estimate the internally valid source effect with a high-dimensional propensity score.\n(2) *Target reference (S=0):* a Medicare FFS sample with continuous Parts A/B/D (exclude MA-only person-time, which carries no FFS claims), restricted to the *decision* population — age ≥65, including CKD stage 3–4 (ICD-10 N18.3/N18.4) and dual-eligible beneficiaries. From this sample take covariates only (no outcomes needed): age band, sex, eGFR/CKD-stage proxy, frailty proxy (claims-based frailty index), prior insulin, baseline HF risk markers — i.e., the candidate **effect modifiers**, measured in a 365-day baseline window.\n(3) *Stack and model selection:* stack source (S=1) and target (S=0) rows; fit P(S=1 | V) by logistic regression on the modifiers. For each study patient compute the inverse-odds-of-sampling weight w = P(S=0|V)/P(S=1|V) = (1−p̂)/p̂.\n(4) *Stabilize and diagnose:* truncate w at the 1st/99th percentiles, inspect the weight histogram and the effective sample size, and confirm population overlap on every modifier stratum present in the target (positivity). Empty or near-empty source cells for CKD-4 frail dual-eligibles are a stop sign, not a number to truncate away.\n(5) *Estimate the target effect:* fit a weighted pooled logistic / Cox or a weighted marginal-structural model for the SGLT2i-vs-DPP-4i contrast using w (optionally combined with the within-study PS weight), yielding the target-population (Medicare-FFS) absolute risk difference and hazard ratio. Cross-check with outcome standardization (g-computation averaged over the target modifier distribution) — agreement is reassuring; divergence signals weight instability or outcome-model extrapolation.\n(6) *Sensitivity:* vary the modifier list, report results with/without the CKD stratum, run a tipping-point analysis for an unmeasured modifier, and present the study-ATE and target-ATE side by side so the decision-maker sees the transport gap explicitly rather than a single laundered number.\n\n**Interpreting the output**\n\nIn the diabetes-pill trial transportability analysis, reweighting the trial population (70%\nage 45–64, 30% age 65+) to the Medicare target (30% age 45–64, 70% age 65+) yields: naive\ntrial ARR = 10.2 pp; transported (target-population) ARR = 7.8 pp; transport gap = 2.4 pp.\n\n*(1) Formal interpretation.* The naive 10.2 pp overstates the expected benefit in the Medicare\npopulation because younger patients (age 45–64), who predominate in the trial, experience a\nlarger absolute risk reduction (12 pp) than older patients (6 pp), who predominate in the\ntarget. The transported estimate of 7.8 pp applies inverse-odds-of-sampling weights that\nup-weight the 65+ stratum to match the Medicare distribution. The 2.4 pp transport gap is\nthe consequence of effect-measure modification by age: if the treatment effect were\nhomogeneous across age groups, the two estimates would be identical. Positivity is required —\nevery stratum of the target population must have some representation in the trial, or the\nreweighting extrapolates outside the data.\n\n*(2) Practical interpretation.* A payer covering a predominantly elderly population who uses\nthe trial ARR of 10.2 pp to project absolute benefit will overestimate cost-effectiveness\nrelative to the transported 7.8 pp. This translates to material differences in budget-impact\nmodels and ICER calculations. Transportability analysis is not optional when the coverage\ntarget differs demographically from the study population — the direction and magnitude of\nthe gap depend on which effect modifiers are distributed differently between source and target.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "generalizability",
      "transportability",
      "external-validity",
      "target-population",
      "inverse-odds-of-sampling-weights",
      "standardization",
      "effect-measure-modification",
      "selection-diagram",
      "positivity",
      "hta"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "pragmatic_trial",
      "registry_trial",
      "claims_analysis",
      "ehr_study",
      "linked_data",
      "multi_database"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.1467-985X.2010.00673.x",
        "url": "https://doi.org/10.1111/j.1467-985X.2010.00673.x",
        "citation_text": "Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society Series A. 2011;174(2):369-386.",
        "year": 2011,
        "authors_short": "Stuart et al.",
        "notes": "Foundational applied framework — defines the sampling-propensity-score logic for reweighting a study sample to a target population and the diagnostics for overlap."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwx164",
        "url": "https://doi.org/10.1093/aje/kwx164",
        "citation_text": "Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of trial results using inverse odds of sampling weights. American Journal of Epidemiology. 2017;186(8):1010-1014.",
        "year": 2017,
        "authors_short": "Westreich et al.",
        "notes": "The operational engine — derives the inverse-odds-of-sampling weight w=(1-p)/p and the conditions (effect-modifier exchangeability, positivity) for valid transport."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwq084",
        "url": "https://doi.org/10.1093/aje/kwq084",
        "citation_text": "Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. American Journal of Epidemiology. 2010;172(1):107-115.",
        "year": 2010,
        "authors_short": "Cole & Stuart",
        "notes": "Canonical worked example reweighting a trial to a U.S. target population — the template most applied RWE transport analyses follow."
      },
      {
        "role": "demonstrate",
        "doi": "10.1214/14-STS486",
        "url": "https://doi.org/10.1214/14-STS486",
        "citation_text": "Pearl J, Bareinboim E. External validity: from do-calculus to transportability across populations. Statistical Science. 2014;29(4):579-595.",
        "year": 2014,
        "authors_short": "Pearl & Bareinboim",
        "notes": "Formal selection-diagram / do-calculus theory that says when a target effect is identifiable and which covariates must bridge source and target."
      },
      {
        "role": "use",
        "doi": "10.1186/s12916-023-02779-w",
        "url": "https://doi.org/10.1186/s12916-023-02779-w",
        "citation_text": "Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Medicine. 2023;21:70.",
        "year": 2023,
        "authors_short": "Van Calster et al.",
        "notes": "Argues model validity is population- and context-specific — the prediction-side analogue of effect transport, motivating recalibration in each deployment target."
      }
    ],
    "plain_language_summary": "A clinical trial or observational study always runs in a specific group of people, but the results need to apply to a different, often broader, population — for example, the older and sicker patients a payer or health agency actually covers. These methods ask whether an effect found in the study group would still hold in that real-world target group, and if the two groups differ in ways that change how well the treatment works, the methods re-weight the study results to give an estimate that fits the target population. The honest caveat: if an important difference between the groups was never measured, no re-weighting can fix it.",
    "key_terms": [
      {
        "term": "generalizability",
        "definition": "The study participants are a subset drawn from the larger target population, and the goal is to re-weight the study results to represent that whole population."
      },
      {
        "term": "transportability",
        "definition": "The target population is entirely separate from the study — you are carrying the effect estimate across to a group that was never sampled, which requires stronger assumptions about what was measured."
      },
      {
        "term": "external validity",
        "definition": "A broader term for whether a study finding applies outside the original study sample, covering both generalizability and transportability."
      },
      {
        "term": "effect modifier",
        "definition": "A characteristic (such as age or kidney function) that changes how large the treatment effect is — if this characteristic is distributed differently in the study and the target, the effect estimates will differ too."
      },
      {
        "term": "inverse-odds-of-sampling weight",
        "definition": "A number assigned to each study participant that up-weights people who look like the target population and down-weights people who are over-represented in the study, so the weighted study group resembles the target."
      },
      {
        "term": "positivity",
        "definition": "The requirement that every type of patient present in the target population also has some representation in the study — without this, there is no information to re-weight from."
      }
    ],
    "worked_example": {
      "scenario": "A trial of a new diabetes pill runs in a commercial health-plan population that skews young: 70% of participants are age 45-64 and 30% are age 65+. The evidence will be used to support coverage for a Medicare population that skews older: 30% are 45-64 and 70% are 65+. Researchers know from the trial data that the pill works better in younger patients (a 12-percentage-point absolute risk reduction) than in older patients (a 6-percentage-point reduction). Simply reporting the trial's overall result — which reflects the young-heavy trial mix — will overstate the benefit for the Medicare decision. The goal is to re-weight the trial result to match the Medicare age mix.",
      "dataset": {
        "caption": "Trial results by age group and the age distribution in the target (Medicare) population.",
        "columns": [
          "age_group",
          "trial_share_pct",
          "target_share_pct",
          "absolute_risk_reduction_pct"
        ],
        "rows": [
          [
            "45-64",
            70,
            30,
            12
          ],
          [
            "65+",
            30,
            70,
            6
          ]
        ]
      },
      "steps": [
        "Compute the naive (unweighted) trial estimate: weight each group by its share in the trial. (0.70 x 12) + (0.30 x 6) = 8.4 + 1.8 = 10.2 percentage points.",
        "Compute the transported (target-weighted) estimate: weight each group by its share in the Medicare target instead. (0.30 x 12) + (0.70 x 6) = 3.6 + 4.2 = 7.8 percentage points.",
        "The transport gap is 10.2 - 7.8 = 2.4 percentage points — the naive trial result overstates the benefit for the Medicare population because the trial enrolled more younger patients who respond better.",
        "The transported estimate of 7.8 percentage points is the number a payer or HTA body should use when evaluating coverage for its Medicare enrollees."
      ],
      "result": "Naive trial ARR = 10.2 pp (reflects trial age mix 70/30 young/old); Transported ARR = 7.8 pp (reflects Medicare age mix 30/70 young/old); Transport gap = 2.4 pp. The transported estimate is lower because the Medicare population has more older patients who benefit less."
    },
    "prerequisites": [
      "estimands-ate-att-intercurrent-events-rwe",
      "baseline-characteristics-and-covariate-balance-rwe",
      "picots-framework-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Inverse-odds-of-sampling weights (IOSW)",
        "description": "Stack the study sample (S=1) with a target-population sample (S=0, covariates only), model P(S=1 | effect modifiers), and weight study participants by w=(1-p)/p so the weighted study resembles the target on the modifier distribution; then fit the treatment-effect model under these weights.",
        "edge_cases": [
          "Weights explode and effective sample size collapses when target strata are sparse in the study (positivity failure); truncation hides the problem rather than fixing it.",
          "Only effect modifiers need enter the weight model; including pure prognostic factors inflates variance without reducing transport bias."
        ],
        "data_source_notes": "claims: build the target distribution from a completely observed FFS sample (continuous A/B/D), never MA-only person-time, with modifiers proxied from diagnosis/procedure codes."
      },
      {
        "name": "Outcome standardization / g-computation to the target",
        "description": "Fit an outcome model in the study sample and average its predicted potential-outcome contrasts over the TARGET covariate distribution to obtain the target-population effect; more efficient than IOSW and stable under thin overlap, but relies on a correctly specified (and possibly extrapolating) outcome model.",
        "edge_cases": [
          "Silently extrapolates into target covariate regions with no study support — check overlap before trusting the standardized estimate."
        ],
        "data_source_notes": "Requires the target covariate joint distribution (or marginals plus a copula/Monte-Carlo step) rather than individual target records."
      },
      {
        "name": "Doubly robust transport (augmented IOSW / TMLE)",
        "description": "Combine the sampling-weight model and the outcome model so the target effect is consistent if EITHER is correctly specified; preferred when weights are moderately unstable and efficiency matters.",
        "edge_cases": [
          "Still fails if a strong effect modifier is unmeasured — double robustness protects against model misspecification, not against an incomplete modifier set."
        ]
      },
      {
        "name": "Qualitative applicability assessment",
        "description": "Used when target covariate data on the modifiers are unavailable; explicitly compare setting, care pathways, coding, route, formulary, and outcome validity, and state directionally why the source effect may over- or under-state the target.",
        "data_source_notes": "The honest fallback when a formal weight would be built on an incomplete modifier set; document the unmeasured modifiers as a stated limitation."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "baseline-characteristics-and-covariate-balance-rwe",
        "pros_of_this": "Delivers an actual target-population effect estimate, not just a study-vs-target comparison table; forces explicit identification assumptions (S ⊥ Y(a) | modifiers, positivity).",
        "cons_of_this": "Needs target covariate data on the effect modifiers; cannot recover a modifier that is absent from study or target.",
        "when_to_prefer": "When a decision population differs from the study sample on a plausible effect modifier and you can obtain the target modifier distribution."
      },
      {
        "compared_to": "prediction-model-validation-recalibration-rwe",
        "pros_of_this": "Transports a causal effect (counterfactual contrast), with reweighting/standardization targeting the modifier distribution.",
        "cons_of_this": "Does not address model discrimination/calibration; predictive transport needs recalibration or refit, not sampling weights.",
        "when_to_prefer": "When the deliverable is a comparative treatment effect rather than a risk-prediction tool."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build BOTH the source effect and the target reference distribution on completely observed enrollment (continuous A/B/D for FFS; full commercial medical+pharmacy); exclude MA-only person-time, which carries no FFS claims and silently distorts the target distribution. Proxy renal/frailty effect modifiers from diagnosis/procedure codes (CKD ICD-10 N18.x, dialysis HCPCS, claims-based frailty index) and watch differential competing risk by death in the elderly.",
      "ehr": "Site mix and visit-driven capture dominate transport; the implied target is a population of encounters conditioned on continued contact. Reconcile severity and coding intensity between source and target sites before weighting.",
      "registry": "Excellent for measuring severity/staging modifiers but over-represents treated/specialty patients; use registry severity to DEFINE modifiers but draw the target DISTRIBUTION from a population-representative source.",
      "linked": "Best substrate for measuring modifiers (EHR/registry severity + claims completeness) but the linkable subset is selected; confirm the linkage-selected sample is the population you can actually act on."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\nMODIFIERS = \"age + female + ckd_stage + frailty_index + prior_insulin\"\n\ndef transport_weights(stacked: pd.DataFrame, trunc=(0.01, 0.99)) -> pd.DataFrame:\n    # P(S=1 | modifiers): probability of being in the STUDY given the effect modifiers.\n    sel = smf.logit(f\"S ~ {MODIFIERS}\", data=stacked).fit(disp=False)\n    p_study = sel.predict(stacked)\n\n    study = stacked.loc[stacked[\"S\"] == 1].copy()\n    p = p_study[stacked[\"S\"] == 1].clip(1e-6, 1 - 1e-6)\n    # Inverse odds of sampling: weight study toward the target distribution. w = P(S=0)/P(S=1).\n    study[\"iosw\"] = (1 - p) / p\n\n    # Stabilize: truncate extreme weights; report effective sample size as a positivity diagnostic.\n    lo, hi = study[\"iosw\"].quantile(list(trunc))\n    study[\"iosw\"] = study[\"iosw\"].clip(lo, hi)\n    ess = study[\"iosw\"].sum() ** 2 / (study[\"iosw\"] ** 2).sum()\n    print(f\"n study={len(study)}  effective n={ess:.0f}  \"\n          f\"weight range=[{study['iosw'].min():.2f}, {study['iosw'].max():.2f}]\")\n    return study\n\ndef target_effect(study: pd.DataFrame) -> float:\n    # Weighted outcome model => marginal log-odds-ratio in the TARGET population.\n    # Use robust (sandwich) SE because weights induce within-person correlation.\n    fit = smf.logit(\"Y ~ A\", data=study).fit(disp=False)               # crude study contrast\n    wfit = smf.wls(\"Y ~ A\", data=study, weights=study[\"iosw\"]).fit()    # quick risk-difference check\n    marg = smf.glm(\"Y ~ A\", data=study, family=__import__(\"statsmodels.api\",\n                   fromlist=[\"families\"]).families.Binomial(),\n                   freq_weights=study[\"iosw\"]).fit(cov_type=\"HC0\")\n    return float(marg.params[\"A\"])  # target-population log-OR for treatment A",
        "description": "Inverse-odds-of-sampling-weight (IOSW) transport of a study effect to a target population.\nRequired input (already cleaned): a single stacked person-level table `stacked` with\n  person_id   : id\n  S           : 1 = study sample (has treatment A and outcome Y), 0 = target reference (covariates only)\n  A           : treatment indicator (only required where S==1)\n  Y           : outcome (only required where S==1)\n  <modifiers> : effect modifiers that differ across populations, measured in the baseline window\n                e.g. age, female, ckd_stage, frailty_index, prior_insulin\nReturns the study rows with a stabilized transport weight `iosw` ready to feed a weighted\noutcome model; weights are diagnosed (ESS, overlap) before any effect is reported.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survey)\nMODIFIERS <- c(\"age\", \"female\", \"ckd_stage\", \"frailty_index\", \"prior_insulin\")\n\ntransport_iosw <- function(stacked) {\n  setDT(stacked)\n  f <- as.formula(paste(\"S ~\", paste(MODIFIERS, collapse = \" + \")))\n  sel <- glm(f, data = stacked, family = binomial())\n  stacked[, p_study := predict(sel, type = \"response\")]\n\n  study <- stacked[S == 1]\n  study[, p_study := pmin(pmax(p_study, 1e-6), 1 - 1e-6)]\n  study[, iosw := (1 - p_study) / p_study]                 # inverse odds of sampling\n  q <- quantile(study$iosw, c(0.01, 0.99))\n  study[, iosw := pmin(pmax(iosw, q[1]), q[2])]            # stabilize extreme weights\n  ess <- sum(study$iosw)^2 / sum(study$iosw^2)             # positivity diagnostic\n  message(sprintf(\"study n=%d  effective n=%.0f\", nrow(study), ess))\n\n  # Weighted outcome model => target-population effect with design-based robust SE.\n  des <- svydesign(ids = ~1, weights = ~iosw, data = study)\n  svyglm(Y ~ A, design = des, family = quasibinomial())\n}\n\n# g-computation cross-check: fit outcome model in study, average over TARGET modifiers.\ntransport_gcomp <- function(stacked) {\n  setDT(stacked)\n  study  <- stacked[S == 1]\n  target <- stacked[S == 0]\n  om <- glm(reformulate(c(\"A\", MODIFIERS), \"Y\"), data = study, family = binomial())\n  t1 <- copy(target)[, A := 1]; t0 <- copy(target)[, A := 0]\n  mean(predict(om, t1, type = \"response\")) -\n    mean(predict(om, t0, type = \"response\"))   # target-population risk difference\n}",
        "description": "IOSW transport with the survey package (design-based robust SEs) plus a g-computation\ncross-check. Input `stacked` mirrors the Python version:\n  S (1=study with A,Y / 0=target covariates-only), A, Y, and the effect modifiers.\nReports the target-population effect from the weighted outcome model and from\nstandardization of a study-fitted outcome model over the target modifier distribution.",
        "dependencies": [
          "survey",
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Selection model: P(S=1 | effect modifiers). */\nproc logistic data=work.stacked noprint;\n  model S(event='1') = age female ckd_stage frailty_index prior_insulin;\n  output out=ps p=p_study;\nrun;\n\n/* 2. Inverse-odds-of-sampling weight on the study rows; stabilize extremes. */\nproc sql noprint;\n  select pct1, pct99 into :lo, :hi\n  from (select pctl(1, calculated iosw) as pct1, pctl(99, calculated iosw) as pct99\n        from (select case when S=1 then (1-min(max(p_study,1e-6),1-1e-6))\n                                      /  min(max(p_study,1e-6),1-1e-6) end as iosw\n              from ps where S=1));\nquit;\n\ndata study_w;\n  set ps;\n  if S = 1;\n  p = min(max(p_study, 1e-6), 1 - 1e-6);\n  iosw = (1 - p) / p;                         /* inverse odds of sampling */\n  if iosw < &lo then iosw = &lo;              /* truncate at 1st/99th pctl */\n  if iosw > &hi then iosw = &hi;\nrun;\n\n/* 3a. Weighted target-population effect with design-based robust SE. */\nproc surveylogistic data=study_w;\n  weight iosw;\n  model Y(event='1') = A;                     /* marginal target-population log-OR */\nrun;\n\n/* 3b. Doubly robust cross-check (target ATE on the modifier set). */\nproc causaltrt data=study_w method=aipw;\n  psmodel A(ref='0') = age female ckd_stage frailty_index prior_insulin;\n  model   Y(ref='0') = age female ckd_stage frailty_index prior_insulin;\nrun;\n\n/* 3c. Direct-standardization alternative: study event rates standardized to the\n       TARGET modifier distribution (PROC STDRATE), useful when weights are unstable. */\nproc stdrate data=work.stacked(where=(S=1)) refdata=work.stacked(where=(S=0))\n             method=direct stat=risk;\n  population group=A event=Y total=_one_;\n  strata age female ckd_stage frailty_index prior_insulin / stats;\nrun;",
        "description": "IOSW transport in SAS with a weighted marginal effect, plus a CAUSALTRT cross-check.\nRequired input dataset work.stacked (post data-management):\n  person_id, S (1=study with A & Y, 0=target covariates only), A, Y,\n  and the effect modifiers age female ckd_stage frailty_index prior_insulin.\nPROC SURVEYLOGISTIC gives the design-based robust SE for the weighted target effect;\nPROC CAUSALTRT gives a doubly robust target ATE on the study rows for comparison.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Study[\"Study sample S=1<br/>internally valid effect of A on Y\"] --> Mods[\"Identify EFFECT MODIFIERS V<br/>(not all prognostic factors)\"]\n  Target[\"Target population S=0<br/>covariate distribution only\"] --> Mods\n  Mods --> Sel[\"Selection model P(S=1 | V)\"]\n  Sel --> Choose{\"Overlap / positivity on V?\"}\n  Choose -->|\"adequate overlap\"| Wt[\"IOSW w=(1-p)/p<br/>or standardize over target V\"]\n  Choose -->|\"target strata absent in study\"| Stop[\"STOP: positivity violated<br/>do NOT transport\"]\n  Wt --> Eff[\"Target-population effect<br/>(target-ATE)\"]\n  Eff --> Sens[\"Sensitivity: unmeasured modifier,<br/>scale, weight truncation\"]\n  style Stop fill:#fee2e2,stroke:#dc2626\n  style Eff fill:#dcfce7,stroke:#16a34a",
        "caption": "Transport workflow. The decision hinges on effect-modifier identification and on positivity (every target covariate stratum represented in the study); absent overlap, reweighting fabricates an answer rather than estimating one.",
        "alt_text": "Flowchart from internally valid study effect and target covariate distribution through effect-modifier identification, a selection model, a positivity check that can stop the analysis, inverse-odds weighting or standardization, the target effect, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  S((\"S<br/>selection /<br/>sampling\")) --> V[\"V: effect modifiers<br/>(age, CKD, frailty)\"]\n  V --> A[\"A: treatment\"]\n  A --> Y[\"Y: outcome\"]\n  V --> Y\n  S -. \"transport valid iff<br/>S ⊥ Y(a) | V\" .-> Y\n  style S fill:#dbeafe,stroke:#2563eb\n  style V fill:#fef9c3,stroke:#ca8a04",
        "caption": "Selection diagram for transportability (Pearl–Bareinboim). The square node S marks where source and target differ; the target effect is identifiable when every modifier on a path that differs by S is measured in V, i.e., S ⊥ Y(a) | V.",
        "alt_text": "Causal diagram with a selection node S pointing into the effect-modifier set V, V pointing to treatment A and outcome Y, A pointing to Y, and a dashed conditional-independence annotation S independent of potential outcomes given V.",
        "source_type": "illustrative",
        "source_citations": [
          "pearl-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS specifies the target population and setting to which the evidence must apply — the named target that transport reweights or standardizes toward."
      },
      {
        "relation_type": "part_of",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The estimand's population attribute distinguishes the source-ATE from the target-ATE; transport changes the population, not the treatment contrast."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prediction-model-validation-recalibration-rwe",
        "notes": "Prediction-model performance must be externally validated and recalibrated in the deployment target — the predictive analogue of causal-effect transport."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Study-vs-target covariate comparison is the diagnostic that reveals which effect modifiers differ and whether overlap supports transport."
      }
    ],
    "aliases": [
      "external validity",
      "applicability",
      "transportability",
      "generalizability",
      "transport weights",
      "inverse odds of sampling weights",
      "target population",
      "sampling weights"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "generalized-linear-models",
    "name": "Generalized Linear Models (GLM)",
    "short_definition": "A unified regression framework that extends ordinary least squares to non-normal outcomes by pairing a distributional family (which encodes the variance structure) with a link function (which determines the scale on which the linear predictor acts and thus how coefficients are interpreted) — the combination of family and link is the single most consequential modelling decision an analyst makes for binary, count, and cost outcomes in real-world evidence and health economics research.",
    "long_description": "**What a GLM is: three components that click together**\n\nOrdinary least squares (OLS) regression assumes that the outcome Y is continuous, has\nconstant variance, and that residuals are approximately normally distributed. Most outcomes\nin real-world evidence (RWE) and health economics and outcomes research (HEOR) violate at\nleast one of those assumptions: binary endpoints (hospitalization, response, mortality) are\nBernoulli-distributed; healthcare costs are right-skewed with a spike at zero and variance\nthat grows with the mean; prescription fill counts are non-negative integers. Forcing these\noutcomes into an OLS framework produces systematically biased predictions (fitted\nprobabilities outside [0,1] for binary outcomes; negative predicted costs) and invalid\nstandard errors.\n\nA Generalized Linear Model (GLM) solves this by specifying three components explicitly and\nlinking them together:\n\n1. *The random component (distributional family)*: The assumed conditional distribution of\n   Y given the covariates — Binomial, Poisson, Negative Binomial, Gamma, Gaussian, and\n   inverse Gaussian are the families used most commonly in HEOR. The family determines the\n   mean-variance relationship. Gaussian assumes Var(Y) = σ² (constant). Poisson assumes\n   Var(Y) = μ (variance equals mean). Negative Binomial adds an overdispersion parameter k\n   so Var(Y) = μ + μ²/k. Gamma assumes Var(Y) = φμ² (variance proportional to squared mean,\n   appropriate for right-skewed positive costs). Binomial assumes Var(Y) = μ(1-μ)/n.\n\n2. *The systematic component (linear predictor)*: η = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ. The\n   same ordinary weighted sum of predictors, including continuous variables, indicators,\n   interactions, and offset terms (log of person-time for rate models), that appears in OLS.\n\n3. *The link function*: g(μ) = η. The link function maps the expected value of Y (μ) to the\n   linear predictor (η). The choice of link determines the scale on which the additive linear\n   predictor operates and therefore the estimand the model's coefficients represent. The model\n   is estimated by iteratively reweighted least squares (IRLS), the algorithm Nelder and\n   Wedderburn (1972) showed solves the maximum likelihood score equations for the entire GLM\n   family in a single unified pass.\n\n**Canonical vs chosen links — and why they are independent decisions**\n\nEach distributional family has a *canonical* link — the link for which the IRLS algorithm\nsimplifies and the score equations have closed-form sufficient statistics. Logit is canonical\nfor Binomial; log is canonical for Poisson and Gamma; identity is canonical for Gaussian.\nCanonical links generally give the best numerical stability and the fastest convergence.\n\nHowever, the link is a *separate decision* from the family. An analyst fitting a binomial\nGLM can choose logit (canonical; produces ORs), log (produces RRs but can produce predicted\nprobabilities above 1 for common exposures), or identity (produces risk differences; requires\nconstrained estimation to keep predictions in [0,1]). Choosing a non-canonical link is\nlegitimate and often scientifically preferable when the effect measure on the canonical scale\n(an OR) is less meaningful for decision-making than the alternative (an RR or RD).\n\nThe two decisions — family and link — are made for different reasons and analysts too often\nconflate them. The *family is a statement about variance structure*: choosing Gamma asserts\nthat the variance of the outcome grows with the square of the mean, which is the right\nstructural assumption for healthcare costs. The *link is a statement about effect-measure\nscale*: choosing a log link asserts that the covariate acts multiplicatively on the mean\n(a one-unit change multiplies the mean by exp(β)), which may be the clinically or\neconomically natural scale regardless of the family. You can run a Gamma family with log link\n(the standard for costs), a Gamma family with identity link (less common), or a Gamma family\nwith inverse link (the canonical; typically least useful for HEOR). The family and link\nchoices are orthogonal.\n\n**Interpreting the output**\n\nThe link function is the master key for coefficient interpretation. Consider a single binary\ncovariate (exposed vs unexposed) with estimated coefficient β = 0.693 (approximately ln 2)\nunder three different links for the same binomial data:\n\n*Identity link* (linear predictor = μ directly):\n- Formal interpretation: β = 0.693 is the conditional additive difference in the probability\n  of the event for exposed vs unexposed patients, holding all other covariates fixed. The\n  95% CI on β is a CI on the risk difference. The scale is the original probability scale.\n- Practical interpretation: \"Exposed patients are 0.693 percentage points — approximately\n  69 percentage points — more likely to experience the event than unexposed patients.\"\n  (If both groups have plausible baseline probabilities, the identity-scale estimate is the\n  most direct input to number-needed-to-treat or budget-impact calculations.)\n\n*Log link* (linear predictor = log μ):\n- Formal interpretation: exp(0.693) ≈ 2.0. The exponentiated coefficient is the conditional\n  ratio of expected outcomes (for Binomial: the risk ratio; for Poisson/NB: the rate ratio;\n  for Gamma: the cost ratio or mean utilization ratio). The CI on β maps, after\n  exponentiation, to a CI on the multiplicative ratio.\n- Practical interpretation: \"Exposed patients have approximately twice the risk (rate, mean\n  cost) of the outcome compared with unexposed patients, holding covariates fixed.\" The log\n  link is the most common choice in HEOR because multiplication is a natural model for how\n  disease exposure, drug effects, and comorbidity burden combine.\n\n*Logit link* (linear predictor = log(μ/(1-μ))):\n- Formal interpretation: exp(0.693) ≈ 2.0. The exponentiated coefficient is the conditional\n  odds ratio — the ratio of the odds of the event in exposed vs unexposed patients, holding\n  covariates fixed. Critically, this is NOT a risk ratio; it approximates the RR only when\n  the baseline outcome probability is below approximately 10%.\n- Practical interpretation: \"The odds of the event are approximately twice as high for\n  exposed patients as for unexposed patients.\" Odds ratios are non-collapsible: the marginal\n  OR (across the whole population) will differ from the conditional OR (within strata), even\n  without confounding, which creates communication challenges when presenting to payers and\n  HTA bodies who think in terms of risks and rates rather than odds.\n\nThe choice of link should therefore be driven by the effect measure required for\ndecision-making, not by estimation convenience alone. Risk differences feed into\nnumber-needed-to-treat and budget-impact models. Risk ratios and rate ratios are natural\ninputs to cost-effectiveness models. Odds ratios are required when the study design only\nidentifies odds (case-control) or when a meta-analytic literature reports ORs.\n\n**The RWE family/link decision table**\n\n- *Binary endpoint (hospitalization, mortality, treatment response)*: Binomial family is\n  the right family. For logistic regression (OR): logit link — canonical, always converges,\n  produces ORs (non-collapsible; use marginal effects or recycled predictions to get RDs).\n  For risk ratios: log link — can fail to converge for common outcomes; use COPY or\n  starting-value strategies, or report Poisson log-linear estimates with robust SE as a\n  pragmatic RR (the \"modified Poisson\" approach). For risk differences: identity link —\n  constrained fitting required, use with caution at extreme probabilities.\n- *Count and rate outcomes (hospitalizations per year, prescriptions, office visits)*:\n  Poisson family with log link and an offset (log of person-time or denominator). When\n  counts show more variability than Poisson allows (a ubiquitous finding in claims data),\n  use Negative Binomial family with log link, or quasi-Poisson to adjust SE without\n  changing the mean model.\n- *Healthcare costs and positive continuous skewed outcomes*: Gamma family with log link is\n  the modern standard (Barber and Thompson 2004). The log link on the Gamma family targets\n  the conditional mean on the original dollar scale via exp(linear predictor); the cost\n  ratio exp(β) is the natural summary. Manning and Mullahy (2001) showed that the\n  appropriate family choice depends on the shape of the conditional variance: use the\n  modified Park test (regress the log of squared residuals on log of fitted values) to\n  diagnose whether Poisson, Gamma, inverse Gaussian, or OLS on log-transformed costs is\n  most appropriate. Gamma with log link passes the Park test for most pharmaceutical cost\n  data where the coefficient is approximately 2.\n- *Zero-heavy semicontinuous costs (many patients with $0 spend)*: Neither OLS nor a\n  single GLM adequately handles a mass at zero combined with a continuous right-skewed\n  positive distribution. The standard approach is a two-part (hurdle) model: a logistic\n  or probit model for the probability of any spend, and a Gamma or log-normal GLM for the\n  conditional mean among spenders. GLMs alone are not the right tool here.\n\n**Quasi-likelihood and robust standard errors**\n\nWhen the chosen distributional family does not fully match the data's variance structure —\nfor example, when counts are overdispersed relative to Poisson — quasi-likelihood\n(Wedderburn 1974) extends the GLM framework by specifying only the mean-variance\nrelationship (V(μ)) rather than the full distribution. Quasi-Poisson and quasi-binomial\nkeep the mean model intact but scale the standard errors by a dispersion factor estimated\nfrom the Pearson chi-square statistic, inflating SE when data are overdispersed. This is\nsimpler to implement than switching to Negative Binomial but does not support AIC-based\nmodel comparison (AIC is undefined for quasi-likelihoods). The alternative for robust\ninference without distributional assumptions is the sandwich (Huber-White) variance\nestimator, which is available in all three major statistical languages and is widely used in\nHEOR when the variance assumption cannot be fully verified.\n\n**Deviance, AIC, and model checking**\n\nDeviance is the GLM analogue of residual sum of squares: it measures the discrepancy\nbetween the fitted model and the saturated model. Residual deviance on (n - p) degrees of\nfreedom should be approximately 1.0 for a well-specified family; a residual deviance/df\nratio substantially above 1.0 signals overdispersion. AIC (Akaike Information Criterion)\nallows comparison of GLMs with the same family but different linear predictors or link\nfunctions; it cannot compare models with different families (e.g., Poisson vs Gamma). For\nresidual diagnostics, the concepts in regression-diagnostics provide the Pearson and\ndeviance residual checks that confirm family-link adequacy.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- Unified framework: one fitting algorithm (IRLS), one software call (glm/PROC GENMOD),\n  handles binary, count, and continuous skewed outcomes with the same code pattern, making\n  the framework accessible and results comparable across outcome types.\n- Effect-measure flexibility: the analyst controls the estimand (OR, RR, RD, rate ratio,\n  cost ratio) by choosing the link, rather than being forced into the effect measure implied\n  by a transformation.\n- Correct variance structure: by choosing the right family, predictions respect the support\n  of the outcome (non-negative costs, bounded probabilities) and SEs are calibrated to the\n  true variance shape.\n- Natural accommodation of offsets and weights for propensity-score analyses,\n  standardization, and rate models.\n\n*Cons and limitations*:\n- Family misspecification is silent: a GLM fits even when the family is wrong; only\n  diagnostic plots and the Park test reveal poor fit. Analysts who never check residuals\n  produce plausible-looking but miscalibrated results.\n- Non-canonical links can fail to converge, especially logit-vs-log for binary outcomes\n  with common events. Convergence issues are a practical barrier in production pipelines.\n- The conditional mean is the target estimand; for decision-making that requires marginal\n  (population-average) effects, recycled predictions or average marginal effects must be\n  computed post-estimation.\n- A single GLM is not the right tool for semicontinuous cost distributions (mass at zero\n  plus right-skewed positive values); two-part models or Tweedie distributions are needed.\n- Interpretation of the logit-link OR for common outcomes is routinely misunderstood as a\n  risk ratio, which inflates perceived effect sizes and misleads decision-makers.\n\n**When to use**\n\n- *Binary outcomes*: whenever logistic regression (OR) or a modified Poisson approach (RR)\n  or risk difference estimation is the target; GLMs cover all three via link choice.\n- *Healthcare cost outcomes*: Gamma GLM with log link as the primary analysis whenever the\n  target estimand is the mean cost difference or cost ratio; supported by the modified Park\n  test if the family assumption needs to be checked.\n- *Count and utilization outcomes*: Poisson with log link and offset as the starting model;\n  move to Negative Binomial or quasi-Poisson if overdispersion is detected.\n- *Covariate-adjusted analyses*: GLMs accommodate propensity-score weighting (via the\n  weight option), interaction terms, and spline adjustments within the same linear predictor\n  structure used for unadjusted models.\n- *When an interpretable effect measure on a specific scale (RR, cost ratio) is required*:\n  the log link directly provides the exponentiated coefficient as the ratio of means.\n\n**When NOT to use**\n\n- *Censored time-to-event outcomes*: survival models (Cox proportional hazards, parametric\n  accelerated failure time, Fine-Gray) are the correct framework; a GLM on the event\n  indicator ignores the censoring mechanism and produces biased estimates of event\n  probability over the entire follow-up period.\n- *Repeated measures, clustered, or longitudinal outcomes without accounting for\n  correlation*: a single GLM assumes independent observations. Correlated data require\n  generalized estimating equations (GEE), mixed-effects GLMs, or frailty models to account\n  for within-cluster dependence.\n- *Zero-heavy semicontinuous cost distributions*: when a substantial fraction of patients\n  have zero costs and the rest have a continuous right-skewed distribution, a single-part\n  GLM underestimates zero probability or miscalibrates the positive tail. Two-part models\n  (logistic for any spend, Gamma for conditional mean among spenders) are the correct tool.\n- *When an OLS linear predictor on the outcome scale is adequate and the CLT protects\n  standard errors*: in large balanced datasets where outcomes are approximately symmetric\n  and the target is a mean difference, OLS with robust SE is simpler and fully valid; a GLM\n  adds complexity without changing the estimate meaningfully.\n- *As a substitute for causal identification*: a GLM adjusts for measured confounders but\n  does not handle unmeasured confounding, informative censoring, or treatment-selection\n  bias. Route those problems to propensity-score weighting, IV methods, or g-computation\n  before reaching for a GLM family/link upgrade.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "regression",
      "glm",
      "link-functions",
      "generalized-linear-models",
      "binomial",
      "poisson",
      "gamma",
      "quasi-likelihood",
      "exponential-family",
      "PROC-GENMOD",
      "statsmodels",
      "cost-analysis",
      "count-models",
      "risk-difference",
      "risk-ratio",
      "odds-ratio"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/2344614",
        "url": "https://doi.org/10.2307/2344614",
        "citation_text": "Nelder JA, Wedderburn RWM. Generalized linear models. Journal of the Royal Statistical Society Series A. 1972;135(3):370-384.",
        "year": 1972,
        "authors_short": "Nelder & Wedderburn",
        "notes": "The paper that unified the GLM family — Gaussian, Binomial, Poisson, Gamma, inverse Gaussian — under a single maximum-likelihood framework solved by iteratively reweighted least squares. Every GLM implementation in R (glm), Python (statsmodels), and SAS (PROC GENMOD) is a direct descendant of this algorithm."
      },
      {
        "role": "explain",
        "doi": "10.1093/biomet/61.3.439",
        "url": "https://doi.org/10.1093/biomet/61.3.439",
        "citation_text": "Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61(3):439-447.",
        "year": 1974,
        "authors_short": "Wedderburn",
        "notes": "Introduces quasi-likelihood as an extension of the GLM that specifies only the mean-variance relationship rather than a full distributional family. Underpins quasi-Poisson and quasi-binomial, the go-to tools for overdispersed count and binary data in claims-based HEOR."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/s0167-6296(01)00086-8",
        "url": "https://doi.org/10.1016/s0167-6296(01)00086-8",
        "citation_text": "Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461-494.",
        "year": 2001,
        "authors_short": "Manning & Mullahy",
        "notes": "Provides the modified Park test and the theoretical basis for choosing between log-OLS, Poisson, Gamma, and inverse-Gaussian specifications for healthcare cost data. Essential reading for any analyst deciding which GLM family to apply to cost outcomes; directly motivates the Gamma-log recommendation for most pharmaceutical cost analyses."
      },
      {
        "role": "use",
        "doi": "10.1258/1355819042250249",
        "url": "https://doi.org/10.1258/1355819042250249",
        "citation_text": "Barber J, Thompson S. Multiple regression of cost data: use of generalised linear models. Journal of Health Services Research and Policy. 2004;9(4):197-204.",
        "year": 2004,
        "authors_short": "Barber & Thompson",
        "notes": "Empirical demonstration that GLMs (particularly Gamma with log link) outperform OLS on raw or log-transformed costs for healthcare cost regression in terms of bias, coverage, and mean squared error. The practical justification for Gamma-log as the default in pharmacoeconomic cost modelling."
      }
    ],
    "plain_language_summary": "A Generalized Linear Model (GLM) is a flexible regression framework that lets analysts pick both the shape of the outcome distribution (called the \"family\") and the scale on which the treatment effect is measured (called the \"link function\"), so the model fits binary events, count data, and skewed healthcare costs with equal ease. The family choice answers \"how does variance grow with the mean?\" while the link choice answers \"should the effect be measured as a difference, a ratio of means, or an odds ratio?\" — these are two separate decisions even though most textbooks bundle them together. Getting both right is the single most consequential statistical modelling choice in an HEOR analysis, because the same dataset will give a risk difference, a risk ratio, or an odds ratio depending solely on which link you specify.",
    "key_terms": [
      {
        "term": "link function",
        "definition": "A mathematical transformation applied to the outcome's expected value so that the transformed value can be modelled as a straight-line combination of predictors; for example, the logit link transforms a probability into a log-odds so it can range from negative to positive infinity."
      },
      {
        "term": "family",
        "definition": "The assumed probability distribution for the outcome in a GLM; the family determines how variance grows with the mean (e.g., Binomial for 0/1 events, Gamma for right-skewed costs, Poisson for counts)."
      },
      {
        "term": "linear predictor",
        "definition": "The weighted sum of predictor variables (β₀ + β₁X₁ + … + βₚXₚ) that sits inside a GLM; the link function connects this sum to the outcome's expected value."
      },
      {
        "term": "canonical link",
        "definition": "The default, numerically most stable link for a given family (logit for Binomial, log for Poisson and Gamma, identity for Gaussian); using a non-canonical link is valid but may cause convergence problems."
      },
      {
        "term": "deviance",
        "definition": "The GLM measure of model fit, analogous to residual sum of squares in OLS; a residual deviance divided by degrees of freedom much larger than 1.0 signals that the assumed family underestimates the real variability in the data."
      },
      {
        "term": "quasi-likelihood",
        "definition": "An extension of the GLM that keeps the mean model but inflates standard errors to account for extra variability (overdispersion) without committing to a fully specified distributional family; quasi-Poisson and quasi-binomial are the most common examples."
      }
    ],
    "worked_example": {
      "scenario": "A claims-based cohort study compares a binary outcome (30-day hospital readmission) between 100 patients who received a new care coordination program (exposed) and 100 patients who received standard care (unexposed). The analyst wants to estimate three different effect measures from the same 2x2 event table — the risk difference (RD), the risk ratio (RR), and the odds ratio (OR) — by fitting the same Binomial GLM with three different link functions: identity, log, and logit. This illustrates that the choice of link function, not the data, determines which effect measure falls out of the regression coefficient.",
      "dataset": {
        "caption": "Summary 2x2 table: readmission events out of 100 patients per group. The individual-level dataset has 200 rows (one per patient) with outcome = 1 (readmitted) or 0 (not readmitted).",
        "columns": [
          "group",
          "events",
          "total",
          "proportion"
        ],
        "rows": [
          [
            "exposed",
            30,
            100,
            0.3
          ],
          [
            "unexposed",
            15,
            100,
            0.15
          ]
        ]
      },
      "steps": [
        "Compute group proportions. Exposed: 30/100 = 0.30. Unexposed: 15/100 = 0.15.",
        "Identity link (linear predictor acts directly on the probability μ). The regression coefficient for the exposed indicator is the additive difference in proportions: RD = 0.30 - 0.15 = 0.15. Interpretation: exposed patients have a 15 percentage-point higher absolute probability of readmission.",
        "Log link (linear predictor acts on log μ). The regression coefficient for exposed is log(RR), so the exponentiated coefficient is the risk ratio: RR = 0.30/0.15 = 2.0. Interpretation: exposed patients are twice as likely to be readmitted.",
        "Logit link (linear predictor acts on log(μ/(1-μ))). Compute odds in each group. Odds exposed = 30/70. Odds unexposed = 15/85. OR = (30/70) / (15/85) = (30*85) / (70*15) = 2550/1050 = 2.43. Interpretation: the odds of readmission are 2.43 times higher for exposed patients. Note that 2.43 ≠ 2.0 (the RR); odds ratios overstate risk ratios when the baseline rate is not rare (here 15% is not rare).",
        "All three models use the same 200-observation dataset and the same linear predictor (intercept + exposed indicator). The link function alone determines whether the coefficient is a RD, log(RR), or log(OR). The decision should be made based on which effect measure the decision-maker needs, not on which link converges more easily."
      ],
      "result": "Identity link: RD = 0.30 - 0.15 = 0.15 (15 pp higher absolute risk for exposed group). Log link: RR = 0.30/0.15 = 2.0 (exposed have twice the readmission rate). Logit link: OR = 30*85/(70*15) = 2550/1050 = 2.43 (odds 2.43x higher for exposed). The three models share the same data; the coefficient scale changes with the link."
    },
    "prerequisites": [
      "ols-linear-regression",
      "inferential-statistics-foundations"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport statsmodels.api as sm\n\n# ── Dataset: 200 patients, binary readmission outcome ──────────────────────────\n# Exposed group: 30 events in 100 patients\n# Unexposed group: 15 events in 100 patients\noutcome  = np.array([1]*30 + [0]*70 + [1]*15 + [0]*85)\nexposed  = np.array([1]*100 + [0]*100)\nX        = sm.add_constant(exposed.astype(float))\n\n# ── 1. Binomial + identity link → Risk Difference ──────────────────────────────\ntry:\n    m_id = sm.GLM(outcome, X,\n                  family=sm.families.Binomial(link=sm.families.links.Identity())\n                  ).fit(method=\"newton\")\n    rd = m_id.params[\"x1\"]\n    ci = m_id.conf_int().loc[\"x1\"]\n    print(f\"Identity link  RD  = {rd:.4f}  95% CI [{ci[0]:.4f}, {ci[1]:.4f}]\")\n    # Expected: RD ≈ 0.15 (= 0.30 - 0.15)\nexcept Exception as e:\n    print(f\"Identity link did not converge: {e}  -- use constrained estimation\")\n\n# ── 2. Binomial + log link → Risk Ratio ────────────────────────────────────────\n# Log-link binomials can fail for high event rates; add starting values or use\n# modified Poisson (Poisson + log + robust SE) as a pragmatic RR estimator.\ntry:\n    m_log = sm.GLM(outcome, X,\n                   family=sm.families.Binomial(link=sm.families.links.Log())\n                   ).fit(method=\"bfgs\", start_params=[np.log(0.15), np.log(2)])\n    rr = np.exp(m_log.params[\"x1\"])\n    ci_rr = np.exp(m_log.conf_int().loc[\"x1\"])\n    print(f\"Log link       RR  = {rr:.4f}  95% CI [{ci_rr[0]:.4f}, {ci_rr[1]:.4f}]\")\n    # Expected: RR = 2.0\nexcept Exception as e:\n    print(f\"Log-binomial did not converge: {e}\")\n    # Fallback: modified Poisson (Poisson + log + robust SE) — widely used for RR\n    m_mp = sm.GLM(outcome, X,\n                  family=sm.families.Poisson(link=sm.families.links.Log())\n                  ).fit(cov_type=\"HC3\")\n    rr_mp = np.exp(m_mp.params[\"x1\"])\n    print(f\"Modified Poisson RR = {rr_mp:.4f}  (Poisson + log + robust SE)\")\n\n# ── 3. Binomial + logit link → Odds Ratio (canonical, always converges) ────────\nm_logit = sm.GLM(outcome, X,\n                 family=sm.families.Binomial(link=sm.families.links.Logit())\n                 ).fit()\nlog_or = m_logit.params[\"x1\"]\nor_est = np.exp(log_or)\nci_or  = np.exp(m_logit.conf_int().loc[\"x1\"])\nprint(f\"Logit link     OR  = {or_est:.4f}  95% CI [{ci_or[0]:.4f}, {ci_or[1]:.4f}]\")\n# Expected: OR = (30*85)/(70*15) = 2550/1050 ≈ 2.43\n\n# ── 4. Poisson + log link → Rate Ratio (count/rate outcomes) ───────────────────\n# Simulate monthly utilization counts (overdispersed; use quasi or NB in practice)\nrng    = np.random.default_rng(42)\ncounts = rng.poisson(lam=np.where(exposed == 1, 3.0, 1.5))  # rate ratio = 2.0\noffset = np.log(np.ones(200))                                # log(person-time)\nX_off  = sm.add_constant(exposed.astype(float))\nm_pois = sm.GLM(counts, X_off,\n                family=sm.families.Poisson(link=sm.families.links.Log()),\n                offset=offset).fit()\nprint(f\"Poisson+log    Rate ratio = {np.exp(m_pois.params['x1']):.4f}\")\n\n# ── 5. Gamma + log link → Cost Ratio (healthcare costs) ────────────────────────\ncosts  = rng.gamma(shape=2, scale=np.where(exposed == 1, 6000, 4000))\nm_gam  = sm.GLM(costs, sm.add_constant(exposed.astype(float)),\n                family=sm.families.Gamma(link=sm.families.links.Log())).fit()\nprint(f\"Gamma+log      Cost ratio = {np.exp(m_gam.params['x1']):.4f}\")\nprint(f\"  (Gamma residual deviance / df = {m_gam.deviance / m_gam.df_resid:.3f})\")\n# A residual deviance/df close to 1.0 suggests the Gamma family is well-specified.\n\n# ── 6. Quasi-Poisson (overdispersion correction) ────────────────────────────────\n# Scale SEs by sqrt(Pearson χ²/df); does not change point estimates\nm_quasi = sm.GLM(counts, X_off,\n                 family=sm.families.Poisson(link=sm.families.links.Log()),\n                 offset=offset).fit(scale=\"X2\")   # X2 = Pearson chi-square dispersion\nprint(f\"Quasi-Poisson SE scale factor: {np.sqrt(m_quasi.scale):.3f}\")",
        "description": "Demonstrates the family/link grid using statsmodels GLM on the same 200-patient binary\noutcome dataset from the worked example (30/100 exposed events, 15/100 unexposed). Shows\nidentity (RD), log (RR), and logit (OR) links for Binomial; adds Poisson + log for count\ndata and Gamma + log for cost data. Quasi-families are shown via scale parameter\nadjustment. No external dependencies beyond numpy and statsmodels.",
        "dependencies": [
          "numpy",
          "statsmodels",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Dataset: 200 patients, binary readmission outcome ──────────────────────────\noutcome <- c(rep(1L, 30), rep(0L, 70), rep(1L, 15), rep(0L, 85))\nexposed <- c(rep(1L, 100), rep(0L, 100))\ndf      <- data.frame(outcome = outcome, exposed = exposed)\n\n# ── 1. Binomial + identity link → Risk Difference ──────────────────────────────\nm_id <- tryCatch(\n  glm(outcome ~ exposed, data = df, family = binomial(link = \"identity\")),\n  error = function(e) NULL\n)\nif (!is.null(m_id)) {\n  cat(\"Identity link  RD =\", coef(m_id)[\"exposed\"], \"\\n\")\n  cat(\"               95% CI:\", confint(m_id)[\"exposed\", ], \"\\n\")\n  # Expected: ≈ 0.15\n} else {\n  cat(\"Identity-binomial did not converge; use Gaussian GLM for approximate RD.\\n\")\n}\n\n# ── 2. Binomial + log link → Risk Ratio ────────────────────────────────────────\nm_log <- tryCatch(\n  glm(outcome ~ exposed, data = df, family = binomial(link = \"log\"),\n      start = c(log(0.15), log(2))),   # starting values improve convergence\n  error = function(e) NULL\n)\nif (!is.null(m_log)) {\n  cat(\"Log link       RR =\", exp(coef(m_log)[\"exposed\"]), \"\\n\")\n} else {\n  # Modified Poisson: Poisson + log + sandwich SE — pragmatic RR estimator\n  m_mp <- glm(outcome ~ exposed, data = df, family = poisson(link = \"log\"))\n  library(sandwich)\n  se_mp <- sqrt(diag(sandwich(m_mp)))[\"exposed\"]\n  rr_mp <- exp(coef(m_mp)[\"exposed\"])\n  cat(\"Modified Poisson  RR =\", rr_mp, \" robust SE =\", se_mp, \"\\n\")\n}\n\n# ── 3. Binomial + logit link → Odds Ratio (canonical; always converges) ────────\nm_logit <- glm(outcome ~ exposed, data = df, family = binomial(link = \"logit\"))\ncat(\"Logit link     OR =\", exp(coef(m_logit)[\"exposed\"]), \"\\n\")\ncat(\"               95% CI:\", exp(confint(m_logit)[\"exposed\", ]), \"\\n\")\n# Expected: OR = (30*85)/(70*15) = 2550/1050 = 2.43\n\n# ── 4. Poisson + log link → Rate Ratio (counts with offset) ───────────────────\nset.seed(42)\ncounts <- rpois(200, lambda = ifelse(exposed == 1, 3.0, 1.5))\ndf$counts <- counts\ndf$lpt    <- 0   # log(person-time) = log(1) = 0 per patient\nm_pois <- glm(counts ~ exposed + offset(lpt), data = df,\n              family = poisson(link = \"log\"))\ncat(\"Poisson+log    Rate ratio =\", exp(coef(m_pois)[\"exposed\"]), \"\\n\")\n\n# Overdispersion check: residual deviance / df\ncat(\"  Residual deviance/df =\", deviance(m_pois) / df.residual(m_pois), \"\\n\")\n# If >> 1, switch to quasi-Poisson or Negative Binomial\n\n# ── 5. Quasi-Poisson (overdispersion-corrected SEs) ────────────────────────────\nm_quasi <- glm(counts ~ exposed + offset(lpt), data = df,\n               family = quasipoisson(link = \"log\"))\ncat(\"Quasi-Poisson  Rate ratio =\", exp(coef(m_quasi)[\"exposed\"]),\n    \" (SE inflation factor:\", sqrt(summary(m_quasi)$dispersion), \")\\n\")\n\n# ── 6. Gamma + log link → Cost Ratio (healthcare costs) ────────────────────────\nset.seed(1)\ncosts    <- rgamma(200, shape = 2, rate = 2 / ifelse(exposed == 1, 12000, 8000))\ndf$costs <- costs\nm_gam    <- glm(costs ~ exposed, data = df, family = Gamma(link = \"log\"))\ncat(\"Gamma+log      Cost ratio =\", exp(coef(m_gam)[\"exposed\"]), \"\\n\")\ncat(\"  Residual deviance/df =\", deviance(m_gam) / df.residual(m_gam), \"\\n\")\n\n# ── 7. AIC comparison across links (same family, same data) ────────────────────\n# AIC is valid for comparing models with the SAME family but different links/predictors.\n# It CANNOT compare Binomial vs Poisson (different likelihoods).\ncat(\"AIC logit:\", AIC(m_logit), \" AIC log (if converged):\",\n    if (!is.null(m_log)) AIC(m_log) else NA, \"\\n\")",
        "description": "Side-by-side family/link grid using base R glm() on the same 200-patient binary outcome\ndataset. Demonstrates identity (RD), log (RR), and logit (OR) links for Binomial; Poisson\nwith log and offset for counts; Gamma + log for costs; quasi-families for overdispersion.\nSandwich SEs via the sandwich package are shown for the modified Poisson approach.",
        "dependencies": [
          "sandwich"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create 200-patient binary outcome dataset ─────────────────────────────── */\ndata work.glm_demo;\n  do i = 1 to 200;\n    if i <= 100 then exposed = 1; else exposed = 0;\n    /* 30 events in exposed (i=1-30), 15 events in unexposed (i=101-115) */\n    if i <= 30 then outcome = 1;\n    else if 100 < i <= 115 then outcome = 1;\n    else outcome = 0;\n    output;\n  end;\nrun;\n\n/* ── 1. Binomial + logit link → Odds Ratio (canonical; always converges) ────── */\nproc genmod data=work.glm_demo;\n  model outcome(event='1') = exposed / dist=binomial link=logit;\n  estimate 'Exposed OR' exposed 1 / exp;\n  /* EXP option: reports exp(estimate) and exp(lower, upper CI) */\n  /* Expected: OR = (30*85)/(70*15) = 2550/1050 = 2.43         */\nrun;\n\n/* ── 2. Binomial + log link → Risk Ratio ─────────────────────────────────────  */\n/* May show \"Warning: Estimated G matrix is not positive definite\" for common    */\n/* outcomes. Add NOINT or constrain starting values if needed.                   */\nproc genmod data=work.glm_demo;\n  model outcome(event='1') = exposed / dist=binomial link=log;\n  estimate 'Exposed RR' exposed 1 / exp;\n  /* Expected: RR = 0.30/0.15 = 2.0                                             */\nrun;\n\n/* ── 3. Binomial + identity link → Risk Difference ───────────────────────────  */\nproc genmod data=work.glm_demo;\n  model outcome(event='1') = exposed / dist=binomial link=identity;\n  estimate 'Exposed RD' exposed 1;\n  /* Expected: RD = 0.30 - 0.15 = 0.15                                          */\nrun;\n\n/* ── 4. Poisson + log link + OFFSET → Rate Ratio (counts) ───────────────────  */\ndata work.glm_count;\n  set work.glm_demo;\n  /* Simulate 6-month utilization counts: 3/patient exposed, 1.5 unexposed      */\n  call streaminit(42);\n  count = rand('POISSON', 1.5 + 1.5 * exposed);\n  log_pt = 0;     /* log(person-time = 1 year per patient)                      */\nrun;\nproc genmod data=work.glm_count;\n  model count = exposed / dist=poisson link=log offset=log_pt;\n  estimate 'Rate ratio' exposed 1 / exp;\n  /* Overdispersion check: Pearson chi-square / df; if > 1.2, switch to NEGBIN  */\nrun;\n\n/* ── 5. Negative Binomial + log link → Rate Ratio (overdispersed counts) ────  */\nproc genmod data=work.glm_count;\n  model count = exposed / dist=negbin link=log offset=log_pt;\n  estimate 'NB rate ratio' exposed 1 / exp;\n  /* DIST=NEGBIN estimates the dispersion parameter k; Var = mu + mu^2/k        */\nrun;\n\n/* ── 6. Gamma + log link → Cost Ratio (healthcare costs) ─────────────────────  */\ndata work.glm_cost;\n  set work.glm_demo;\n  call streaminit(1);\n  cost = rand('GAMMA', 2) * (8000 + 4000 * exposed);   /* synthetic right-skewed costs */\nrun;\nproc genmod data=work.glm_cost;\n  model cost = exposed / dist=gamma link=log;\n  estimate 'Cost ratio' exposed 1 / exp;\n  /* Pearson chi-square / df ≈ 1.0 confirms Gamma family is appropriate        */\nrun;\n\n/* ── 7. Quasi-likelihood via SCALE=PEARSON (overdispersion-corrected SEs) ────  */\nproc genmod data=work.glm_count;\n  model count = exposed / dist=poisson link=log offset=log_pt scale=pearson;\n  estimate 'Quasi-Poisson rate ratio' exposed 1 / exp;\n  /* SCALE=PEARSON: inflates SEs by sqrt(Pearson chi-square/df)                 */\n  /* Point estimates unchanged; only SEs and p-values are affected              */\nrun;",
        "description": "Family/link grid using PROC GENMOD for the same 200-patient binary outcome dataset.\nDemonstrates DIST=BINOMIAL with LINK=LOGIT (OR), LINK=LOG (RR), and LINK=IDENTITY (RD).\nPROC GENMOD DIST=POISSON for count/rate data with OFFSET. DIST=GAMMA LINK=LOG for costs.\nThe ESTIMATE statement with EXP option produces exponentiated coefficient CIs directly.\nType 3 Wald tests and the Pearson goodness-of-fit chi-square are used for diagnostics.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[What type of outcome?] --> Bin[Binary 0/1<br/>event/no event]\n  Q --> Count[Count or rate<br/>e.g. hospitalizations/year]\n  Q --> Cost[Right-skewed positive<br/>e.g. healthcare costs]\n  Q --> Cont[Approximately symmetric<br/>continuous outcome]\n  Bin --> OR[\"Effect measure needed?<br/>Odds ratio → logit link<br/>(canonical, always converges)\"]\n  Bin --> RR[\"Risk ratio → log link<br/>(may need starting values<br/>or modified Poisson)\"]\n  Bin --> RD[\"Risk difference → identity link<br/>(constrained estimation)\"]\n  Count --> Pois[\"Poisson + log + offset<br/>(check deviance/df ≈ 1)\"]\n  Pois --> Over{Overdispersed?}\n  Over -->|Yes| NB[\"Negative Binomial + log<br/>or quasi-Poisson\"]\n  Over -->|No| Pois\n  Cost --> Park[\"Modified Park test<br/>(Manning & Mullahy 2001)\"]\n  Park --> Gamma[\"Gamma + log link<br/>(most cost data: Park coef ≈ 2)\"]\n  Park --> IG[\"Inverse Gaussian + log<br/>(Park coef ≈ 3)\"]\n  Park --> Zero{\"Mass at zero?\"}\n  Zero -->|Yes| TwoPart[\"Two-part model<br/>(logistic + Gamma/lognormal)\"]\n  Zero -->|No| Gamma\n  Cont --> OLS[\"OLS (Gaussian + identity)<br/>with robust SE\"]",
        "caption": "Decision tree for selecting the GLM family and link function by outcome type in RWE/HEOR. The Park test guides cost family selection; overdispersion diagnostics guide count model upgrading from Poisson to Negative Binomial.",
        "alt_text": "Flowchart branching on outcome type (binary, count/rate, cost, continuous) into the appropriate GLM family and link function, with diagnostic decision points for overdispersion (Poisson to NB/quasi) and zero inflation (single GLM to two-part model).",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "ols-linear-regression",
        "notes": "OLS is the Gaussian-family identity-link special case of the GLM framework; GLMs exist precisely to extend OLS to non-Gaussian outcomes. Reading OLS first provides the linear predictor and least-squares intuition that GLM builds on."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gamma-distribution",
        "notes": "The Gamma distributional family is the standard choice for right-skewed positive cost outcomes; understanding the Gamma distribution's shape and variance structure clarifies why Gamma + log link is the appropriate GLM specification for healthcare cost regression."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-distribution",
        "notes": "The Poisson distributional family underlies GLMs for count and rate outcomes; the Poisson mean-variance relationship (Var = μ) is the baseline against which overdispersion is diagnosed and motivates upgrades to Negative Binomial or quasi-Poisson."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-binomial-distribution",
        "notes": "The Negative Binomial family adds a free dispersion parameter to the Poisson GLM to model overdispersed count data (common in claims-based utilization studies where Var >> μ); it is the default upgrade from Poisson when residual deviance/df exceeds approximately 1.2."
      },
      {
        "relation_type": "see_also",
        "target_slug": "binomial-distribution-logit-link",
        "notes": "The Binomial family with logit link is logistic regression — the most commonly used GLM in clinical research. This entry covers the logit link as one option within the broader Binomial GLM; the binomial-distribution-logit-link entry provides deep coverage of the logistic regression special case."
      },
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Logistic regression is the Binomial GLM with logit link; this relation connects the broader GLM framework to the deep-dive on logistic regression, including convergence diagnostics, separation, and c-statistic interpretation not covered at the GLM level."
      }
    ],
    "aliases": [
      "GLM",
      "link function",
      "exponential family regression",
      "PROC GENMOD",
      "generalized linear model",
      "glm()",
      "statsmodels GLM",
      "logit link",
      "log link",
      "identity link",
      "Poisson GLM",
      "Gamma GLM",
      "binomial GLM",
      "quasi-likelihood",
      "IRLS"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "grace-period-gap-rules-rwe",
    "name": "Grace Period and Permissible Gap Rules",
    "short_definition": "The pre-specified maximum allowable gap between consecutive dispensings (or administrations/visits) before exposure is considered discontinued and an exposure episode is closed, determining persistence, on-treatment person-time, and exposure misclassification.",
    "long_description": "A **grace period (permissible gap) rule** is the operational decision that converts a sequence of discrete dispensing\nrecords into continuous *exposure episodes*. Each fill confers a covered window of length `days_supply` starting at\n`fill_date`. When the next fill arrives before that window expires (allowing for stockpiling) the episode continues; when\nthe gap between the run-out date of one fill and the start of the next exceeds the grace period, the episode is **closed**\nand the patient is classified as having discontinued at the run-out date (often plus a fraction of the grace period). The\ngrace period is therefore not a nuisance parameter — it simultaneously defines who is persistent, how much on-treatment\nperson-time each patient contributes, and where exposure is misclassified. Gardarsdottir et al. showed empirically that the\nnumber, length, and count of constructed treatment episodes is *directly governed* by the gap length chosen, which is why\nthe rule must be specified in the protocol and varied in sensitivity analysis rather than left to a default.\n\n**Core conceptual distinction.** The grace period is distinct from, but interacts with, three neighboring constructs. (1)\n*Grace period vs days_supply / stockpiling rule*: `days_supply` sets the nominal covered window; the **stockpiling\n(carry-over) rule** decides whether early refills bank surplus supply forward; the **grace period** is the slack added\n*after* the (possibly carried-over) supply runs out before discontinuation is declared. (2) *Permissible gap vs episode-gap\n(new-episode) threshold*: a short grace period bridges within-episode interruptions; a separate, usually longer, threshold\ngoverns whether a return after a long gap starts a *new* episode (re-initiation/rechallenge) versus continuing the old one.\n(3) *Grace period as exposure-window definition vs as a censoring/lag device*: in an as-treated analysis the grace period\nextends the on-treatment window (and thus attributes more outcomes to the drug); in an intention-to-treat/first-line\nanalysis the grace period mainly affects the *persistence* endpoint, not outcome attribution. The estimand must state which.\nThe fundamental trade-off is bias-directional: **too short** a grace period creates spurious discontinuations and gaps\n(artifactual non-persistence, and immortal-time/exposure misclassification if outcomes occurring during the \"gap\" are\ntreated as unexposed), while **too long** a grace period carries non-adherent person-time as \"exposed,\" diluting both\nbenefit and harm toward the null and inflating apparent persistence.\n\n**Pros, cons, and trade-offs** (vs the specific alternatives):\n- **vs a fixed default gap (e.g., 30 days for everyone):** A drug-class-calibrated, refill-pattern-informed grace period\n  (e.g., 50% of typical `days_supply`, or an empirically derived gap distribution) reduces differential misclassification\n  across drugs with different refill cadence (90-day mail-order vs 30-day retail) and different half-lives. Cost: more\n  programming and a defensible justification per drug. **Prefer a calibrated rule** for any comparative or regulatory study\n  where two arms have different supply patterns.\n- **vs no grace period (strict run-out = discontinuation):** A grace period prevents counting routine refill timing\n  variation (weekend pickups, mail-order lag, sample/free supply) as discontinuation. Cost: it lengthens on-treatment\n  person-time and can mask true early stopping. **Prefer a grace period** in essentially all persistence and as-treated\n  analyses; **prefer strict run-out only** for conservative safety signal detection where over-attributing exposure is the\n  worse error.\n- **vs an as-treated analysis with informative-censoring weights:** A grace period plus simple censoring at episode end is\n  transparent and defensible but ignores that discontinuation is rarely random. Cost: differential discontinuation by arm\n  biases naive as-treated estimates. **Prefer adding inverse-probability-of-censoring weights** (or an ITT/first-line\n  contrast) when discontinuation is common and arm-dependent; the grace period alone does not fix informative censoring.\n\n**When to use**. Whenever discrete fills/administrations must be assembled into exposure episodes: persistence and\ntime-to-discontinuation endpoints; PDC/MPR denominators that depend on episode boundaries; as-treated and per-protocol\nexposure windows in comparative effectiveness/safety; defining \"current use\" for self-controlled or nested case-control\ndesigns; and any utilization/cost analysis where treatment-episode length drives the outcome. Specify the grace period, the\nstockpiling rule, and the new-episode threshold together, and pre-register at least one alternative for sensitivity.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Do not let a single grace period span arms with different refill cadence** (e.g., a once-daily oral with 90-day\n  mail-order vs a titrated agent with 30-day retail fills). A common absolute gap (say 30 days) is lenient for the\n  30-day drug and strict for the 90-day drug, producing *differential* exposure misclassification that biases the\n  comparison in an unpredictable direction. Calibrate per drug or express the gap as a proportion of `days_supply`.\n- **Do not apply a refill-based grace period to drugs not captured as outpatient pharmacy claims.** Clinician-administered\n  infusions/injectables (J-codes on the medical claim), inpatient-dispensed drugs, and 340B/white-bagged products have no\n  `days_supply`; a pharmacy grace-period rule silently classifies all such person-time as unexposed.\n- **Do not treat outcomes occurring inside the grace period as unexposed in an as-treated analysis** — that is a recipe for\n  immortal-time-style misclassification. The grace-period window must be classified as exposed (or the analysis must use a\n  time-varying exposure with an explicit lag), not retro-assigned based on whether a refill eventually appeared.\n- **Do not use a long grace period when the question is early discontinuation or first-fill-only (primary non-adherence).**\n  A long grace masks exactly the signal of interest.\n\n**Data-source operational depth**.\n- **Claims (FFS):** The native substrate — `person_id`, `fill_date`, `days_supply`, NDC. Require continuous\n  medical + pharmacy enrollment across the episode so an apparent gap is a true treatment gap, not unobserved person-time.\n  Failure modes: **Medicare Advantage (MA-only) person-time lacks adjudicated FFS pharmacy claims**, so a \"gap\" can be pure\n  missingness — restrict to Part D / commercial pharmacy-benefit enrollment and exclude MA-only spans before applying the\n  rule. 90-day mail-order and 100-day supplies inflate `days_supply`; samples, $4 generics, and cash purchases are invisible\n  and manufacture false gaps; same-day duplicate/reversed claims and partial fills distort run-out dates (de-duplicate and\n  net out reversals first); claims adjudication lag at the end of the data window truncates the last episode.\n- **EHR:** Exposure is the *order* or *medication-administration record*, not a dispensing, so a true grace period is hard\n  to define — an active order may persist long after the patient stopped taking the drug, and **external-care leakage**\n  (fills at pharmacies outside the system) creates phantom gaps. Prefer linkage to pharmacy claims to anchor `days_supply`;\n  if relying on orders, treat the gap rule as an order-duration assumption and report its sensitivity.\n- **Registry / linked:** Registries adjudicate persistence/discontinuation clinically but rarely capture every fill; link\n  to claims for the dispensing history and to a death index so that the last fill before death is not misread as\n  discontinuation. In **elderly claims cohorts, death is a competing risk for discontinuation**: if one arm has higher\n  mortality, naive persistence comparisons are biased unless death is handled (competing-risks or censoring at death with a\n  stated estimand). Linkage adds selection (linkable subset only) and order/fill/service-date discrepancies that must be\n  reconciled before episodes are built.\n\n**Worked claims example.** Question: 12-month persistence on a once-daily oral DPP-4 inhibitor in a commercial + Medicare\nFFS database, with the grace period as the key tuning parameter. (1) Eligibility: ≥365 days continuous medical + pharmacy\n(Part A/B/D or commercial) enrollment before the first fill; exclude MA-only person-time so gaps are observed. (2) Index =\nfirst DPP-4 fill (`fill_date`); de-duplicate same-day claims and net out reversals so each fill has one clean\n`days_supply`. (3) Build the covered window: each fill covers `[fill_date, fill_date + days_supply)`; apply a stockpiling\nrule that shifts a refill's start to the prior run-out date when it arrives early (cap carry-over at, e.g., 30 days to\navoid unbounded banking). (4) Apply the grace period: walk fills in order; if `next_fill_date - prior_run_out_date <=\ngrace`, the episode continues; otherwise close the episode and set the **discontinuation date = prior_run_out_date** (some\nprotocols add grace/2). (5) Primary grace = 30 days; pre-specified sensitivity at 15, 60, and 90 days, plus a\nproportional rule (grace = 0.5 × `days_supply`) so the 30-day-retail and 90-day-mail subgroups are treated comparably. (6)\nPersistence = time from index to discontinuation; censor at disenrollment, **death** (competing risk — report both\ncause-specific and Aalen-Johansen cumulative incidence), and end of data. Diagnostics: distribution of constructed episode\nlengths and gap lengths by grace value, % discontinued at each grace value, and the gap-length-vs-episode-count curve\n(per Gardarsdottir) to show the result's sensitivity to the rule.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure_definition",
      "grace-period",
      "permissible-gap",
      "treatment-episode-construction",
      "persistence",
      "days-supply",
      "exposure-misclassification",
      "stockpiling"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2009.07.001",
        "url": "https://doi.org/10.1016/j.jclinepi.2009.07.001",
        "citation_text": "Gardarsdottir H, Souverein PC, Egberts TCG, Heerdink ER. Construction of drug treatment episodes from drug-dispensing histories is influenced by the gap length. Journal of Clinical Epidemiology. 2010;63(4):422-427.",
        "year": 2010,
        "authors_short": "Gardarsdottir et al.",
        "notes": "Empirically demonstrates that the permissible gap length governs the number and duration of constructed treatment episodes; the canonical reference for why the grace-period rule must be pre-specified and varied in sensitivity analysis."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1524-4733.2007.00213.x",
        "url": "https://doi.org/10.1111/j.1524-4733.2007.00213.x",
        "citation_text": "Cramer JA, Roy A, Burrell A, Fairchild CJ, Fuldeore MJ, Ollendorf DA, Wong PK. Medication compliance and persistence: terminology and definitions. Value in Health. 2008;11(1):44-47.",
        "year": 2008,
        "authors_short": "Cramer et al.",
        "notes": "ISPOR consensus terminology distinguishing persistence (treatment-episode duration, governed by the permissible gap) from compliance, anchoring the vocabulary the grace-period rule operationalizes."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Catalogues the database operations (days_supply, gaps, refill timing) used to construct persistence and adherence measures from automated pharmacy data, including the role of allowable gaps."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1472-6963-12-155",
        "url": "https://doi.org/10.1186/1472-6963-12-155",
        "citation_text": "Vollmer WM, Xu M, Feldstein A, Smith D, Waterbury A, Rand C. Comparison of pharmacy-based measures of medication adherence. BMC Health Services Research. 2012;12:155.",
        "year": 2012,
        "authors_short": "Vollmer et al.",
        "notes": "Applied comparison showing how gap-handling and refill-stitching choices change pharmacy-based adherence/persistence estimates in the same population."
      }
    ],
    "plain_language_summary": "A grace period (also called a permissible gap) is a pre-set number of days that a researcher allows to pass between the end of one prescription fill and the start of the next before deciding that a patient has actually stopped taking the drug. When a gap between fills is shorter than the grace period, the treatment episode stays open — the interruption is treated as a brief delay at the pharmacy, not a real stop. When the gap is longer than the grace period, the episode is closed and the patient is recorded as having discontinued treatment at the point their last supply ran out. Choosing the grace period is one of the most consequential decisions in any study that tracks how long patients stay on a drug, because a shorter window labels more patients as quitters while a longer window can make poor adherers look persistent.",
    "key_terms": [
      {
        "term": "fill_date",
        "definition": "The calendar date a patient picked up (or received) a prescription, as recorded in the pharmacy or insurance claims data."
      },
      {
        "term": "days_supply",
        "definition": "The number of days one filled prescription is intended to last — for example, a 30-day fill or a 90-day mail-order supply."
      },
      {
        "term": "run-out date",
        "definition": "The last day a patient is expected to have pills on hand from a single fill, calculated as fill_date plus days_supply minus one day."
      },
      {
        "term": "exposure episode",
        "definition": "A continuous stretch of time during which a patient is treated as being on the drug, built by stitching together consecutive fills that arrive close enough together to satisfy the grace period rule."
      },
      {
        "term": "discontinuation",
        "definition": "The point at which a patient is recorded as having stopped the drug — set at the run-out date of the last fill when a gap exceeds the grace period."
      },
      {
        "term": "sensitivity analysis",
        "definition": "A repeat of the main analysis with a different grace period value (e.g., 15, 30, 60, and 90 days) to show whether the study's conclusions change depending on which rule was chosen."
      }
    ],
    "worked_example": {
      "scenario": "Patient 2001 is prescribed a once-daily cholesterol-lowering tablet. We want to know how long she stayed on the drug during 2024. She fills the prescription three times. Our grace period is 30 days, meaning: if the next fill arrives within 30 days of the previous fill running out, the episode continues; if the gap exceeds 30 days, we record discontinuation at the prior run-out date and the later fill opens a new, separate episode.",
      "dataset": {
        "caption": "Raw pharmacy claims rows for patient 2001 — exactly what an analyst would see in a claims database.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2024-01-01",
            "atorvastatin",
            30
          ],
          [
            2001,
            "2024-02-11",
            "atorvastatin",
            30
          ],
          [
            2001,
            "2024-05-01",
            "atorvastatin",
            30
          ]
        ]
      },
      "steps": [
        "Fill A (Jan 1, 30-day supply) runs out on Jan 30. The patient has pills from Jan 1 through Jan 30.",
        "Fill B arrives on Feb 11. The gap from Jan 31 (first pill-free day) to Feb 10 is 11 days — well within the 30-day grace period.",
        "Because 11 days ≤ 30-day grace, the episode stays open. Fill B covers Feb 11 through Mar 11. Episode 1 now spans Jan 1 through Mar 11 (71 calendar days), with 60 days of actual pill coverage and an 11-day gap that was bridged.",
        "Fill C arrives on May 1. The gap from Mar 12 (first pill-free day after Fill B) to Apr 30 is 50 days — longer than the 30-day grace period.",
        "Because 50 days > 30-day grace, Episode 1 is closed. Discontinuation is dated at Mar 11 (the run-out of the last fill in that episode). Fill C opens a new Episode 2 running May 1 through May 30.",
        "Result: Episode 1 ends Mar 11 (71-day span; 60 covered days; 11-day bridged gap). Episode 2 starts May 1 (30 covered days; 50-day gap ended the prior episode)."
      ],
      "result": {
        "label": "Episode 1: Jan 1 – Mar 11 (71-day span, 60 pill-covered days, 11-day gap bridged). Episode 2: May 1 – May 30 (30-day span). Gap of 50 days between Mar 12 and Apr 30 exceeded the 30-day grace and closed Episode 1.",
        "value": {
          "episode_1_span_days": 71,
          "episode_1_pill_covered_days": 60,
          "bridged_gap_days": 11,
          "episode_ending_gap_days": 50,
          "grace_period_days": 30,
          "gap_a_bridged": true,
          "gap_b_bridged": false,
          "episode_2_span_days": 30
        }
      },
      "timeline_spec": {
        "title": "Grace period in action: 30-day rule bridges a short gap, closes episode on a long gap (patient 2001, atorvastatin)",
        "window": {
          "start": "2024-01-01",
          "end": "2024-05-30",
          "label": "Observation window: Jan 1 – May 30, 2024 (150 calendar days)"
        },
        "events": [
          {
            "label": "Fill A",
            "start": "2024-01-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B",
            "start": "2024-02-11",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill C (new episode)",
            "start": "2024-05-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-01",
            "end": "2024-01-30",
            "label": "Fill A: 30 covered days"
          },
          {
            "kind": "gap",
            "start": "2024-01-31",
            "end": "2024-02-10",
            "label": "Gap A: 11 days ≤ 30-day grace → BRIDGED (episode continues)"
          },
          {
            "kind": "covered",
            "start": "2024-02-11",
            "end": "2024-03-11",
            "label": "Fill B: 30 covered days"
          },
          {
            "kind": "gap",
            "start": "2024-03-12",
            "end": "2024-04-30",
            "label": "Gap B: 50 days > 30-day grace → EPISODE ENDS (discontinuation dated Mar 11)"
          },
          {
            "kind": "covered",
            "start": "2024-05-01",
            "end": "2024-05-30",
            "label": "Fill C: 30 covered days (new Episode 2)"
          }
        ],
        "result": {
          "label": "Episode 1: Jan 1 – Mar 11 (11-day gap bridged). Gap B = 50 days > grace → episode closed. Episode 2: May 1 – May 30.",
          "value": {
            "episode_1_covered_days": 60,
            "episode_1_bridged_gap_days": 11,
            "episode_ending_gap_days": 50,
            "episode_2_covered_days": 30
          }
        },
        "caption": "Two fills stitched into one episode when their gap (11 days) is within the 30-day grace period; a third fill starts a new episode because the gap (50 days) exceeds it. The grace period value is the single dial that decides whether an interruption is a bump in the road or the end of treatment.",
        "alt_text": "Horizontal timeline for patient 2001 showing Fill A (Jan 1–Jan 30, covered), an 11-day bridged gap (Jan 31–Feb 10, shaded light), Fill B (Feb 11–Mar 11, covered), a 50-day episode-ending gap (Mar 12–Apr 30, shaded dark with discontinuation marker at Mar 11), and Fill C opening a new episode (May 1–May 30, covered)."
      }
    },
    "prerequisites": [
      "exposure-episode-construction-rwe",
      "continuous-enrollment-observable-time-rwe",
      "pdc-proportion-of-days-covered"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed absolute grace period",
        "description": "A single gap threshold in days (e.g., 30 or 60 days) applied uniformly after each fill's covered window (optionally after stockpiled carry-over) before the episode is closed and discontinuation is dated.",
        "edge_cases": [
          "Lenient for short-supply (30-day retail) drugs and strict for 90-day mail-order, creating differential misclassification when arms differ in refill cadence.",
          "Sensitive to outlier days_supply values (100-day fills, partial fills) that shift run-out dates."
        ],
        "data_source_notes": "claims: derive run-out = fill_date + days_supply after de-duplicating same-day claims and reversals; require continuous pharmacy-benefit enrollment so gaps are observed, not MA-only missingness."
      },
      {
        "name": "Proportional (supply-relative) grace period",
        "description": "Grace expressed as a fraction of days_supply (e.g., 50% of the fill's supply), so drugs with longer supplies get proportionally longer slack — reduces differential misclassification across refill cadences.",
        "edge_cases": [
          "Very small days_supply (titration packs, partial fills) yields implausibly short grace; floor the value.",
          "Mixed 30-/90-day supplies within one patient make the effective rule vary fill-to-fill; document the policy."
        ],
        "data_source_notes": "claims/EHR: requires reliable days_supply; for EHR orders without days_supply this rule is not directly computable and must fall back to an order-duration assumption."
      },
      {
        "name": "Stockpiling (carry-over) grace period",
        "description": "Early refills bank surplus supply forward to the prior run-out date before grace is applied, so the gap clock starts only after accumulated supply is exhausted (carry-over usually capped to bound stockpiling).",
        "edge_cases": [
          "Unbounded carry-over makes nearly everyone look persistent; cap accumulated surplus (e.g., 30-90 days).",
          "Inpatient stays or hospice during a covered window inflate apparent oversupply; consider pausing the clock."
        ],
        "data_source_notes": "claims: sort fills by fill_date and shift each start to max(fill_date, prior_run_out); cap the banked surplus and reset at episode boundaries."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "No grace period (strict run-out = discontinuation)",
        "pros_of_this": "Avoids classifying routine refill-timing variation (weekend pickups, mail-order lag, samples) as discontinuation; produces realistic persistence and on-treatment person-time.",
        "cons_of_this": "Lengthens on-treatment time and can mask genuinely early stopping; over-attributes exposed person-time in safety analyses.",
        "when_to_prefer": "Persistence and as-treated analyses; use strict run-out only for conservative safety signal detection where over-attributing exposure is the costlier error."
      },
      {
        "compared_to": "Fixed default gap applied uniformly across drugs",
        "pros_of_this": "A calibrated or proportional grace reduces differential exposure misclassification between arms with different refill cadence (90-day mail-order vs 30-day retail) and half-lives.",
        "cons_of_this": "Requires per-drug justification and more programming; harder to communicate than a single number.",
        "when_to_prefer": "Comparative or regulatory studies where the two arms differ in supply patterns or formulation."
      },
      {
        "compared_to": "As-treated analysis with informative-censoring weights",
        "pros_of_this": "A grace period with simple episode-end censoring is transparent and easy to defend.",
        "cons_of_this": "Ignores that discontinuation is rarely random; differential discontinuation by arm biases naive as-treated estimates regardless of the grace value chosen.",
        "when_to_prefer": "When discontinuation is uncommon or non-differential; otherwise add IPCW or use an ITT/first-line contrast."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nGRACE_DAYS = 30      # permissible gap after run-out before the episode is closed\nCARRYOVER_CAP = 30   # max banked surplus from early refills (set 0 to disable stockpiling)\n\ndef build_episodes(rx: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"]).copy()\n    out = []\n    for pid, g in rx.groupby(\"person_id\", sort=False):\n        ep_start = run_out = None\n        n_fills = 0\n        for fd, ds in zip(g[\"fill_date\"], g[\"days_supply\"]):\n            if ep_start is None:\n                ep_start, run_out, n_fills = fd, fd + pd.Timedelta(days=int(ds)), 1\n                continue\n            gap = (fd - run_out).days\n            if gap <= GRACE_DAYS:\n                # continue episode; stockpile early supply forward, capped to bound banking\n                surplus = max(0, -gap)\n                start = run_out if surplus > 0 else fd\n                extra = min(surplus, CARRYOVER_CAP)\n                run_out = max(run_out, start) + pd.Timedelta(days=int(ds) - (surplus - extra))\n                n_fills += 1\n            else:\n                out.append((pid, ep_start, run_out, run_out, n_fills))  # discontinuation = run_out\n                ep_start, run_out, n_fills = fd, fd + pd.Timedelta(days=int(ds)), 1\n        if ep_start is not None:\n            out.append((pid, ep_start, run_out, run_out, n_fills))\n    return pd.DataFrame(out, columns=[\"person_id\", \"episode_start\", \"run_out\",\n                                      \"discontinuation_date\", \"n_fills\"])",
        "description": "Build exposure episodes from claims-style pharmacy fills using a grace period with optional stockpiling carry-over.\nRequired input (cleaned, de-duplicated, reversals netted out):\n  rx : person_id, fill_date (datetime64), days_supply (int)   # one row per dispensing\nReturns one row per exposure episode: person_id, episode_start, run_out, discontinuation_date, n_fills. Discontinuation\nis dated at the covered run-out of the last fill in the episode; vary GRACE_DAYS in sensitivity analysis (Gardarsdottir).\nOutcomes/person-time must classify the [start, discontinuation] window as exposed; do not retro-assign the grace window.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "gardarsdottir-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nGRACE_DAYS <- 30L\nCARRYOVER_CAP <- 30L\n\nbuild_episodes <- function(rx) {\n  setDT(rx); setorder(rx, person_id, fill_date)\n  rx[, {\n    ep_start <- run_out <- NA; nfill <- 0L\n    res <- list()\n    for (i in seq_len(.N)) {\n      fd <- fill_date[i]; ds <- as.integer(days_supply[i])\n      if (is.na(ep_start)) { ep_start <- fd; run_out <- fd + ds; nfill <- 1L; next }\n      gap <- as.integer(fd - run_out)\n      if (gap <= GRACE_DAYS) {\n        surplus <- max(0L, -gap)\n        start   <- if (surplus > 0L) run_out else fd\n        extra   <- min(surplus, CARRYOVER_CAP)\n        run_out <- max(run_out, start) + (ds - (surplus - extra))\n        nfill   <- nfill + 1L\n      } else {\n        res[[length(res) + 1L]] <- list(ep_start, run_out, run_out, nfill)\n        ep_start <- fd; run_out <- fd + ds; nfill <- 1L\n      }\n    }\n    res[[length(res) + 1L]] <- list(ep_start, run_out, run_out, nfill)\n    rbindlist(lapply(res, function(x) data.table(episode_start = x[[1]],\n      run_out = x[[2]], discontinuation_date = x[[3]], n_fills = x[[4]])))\n  }, by = person_id]\n}",
        "description": "Episode construction with a permissible-gap rule using data.table. Input mirrors the Python version:\n  rx : person_id, fill_date (Date), days_supply (integer)   # one row per dispensing, cleaned/de-duplicated\nReturns one episode per row with discontinuation dated at the last fill's covered run-out. GRACE_DAYS is the tuning\nparameter to vary in sensitivity analysis; CARRYOVER_CAP bounds stockpiling of early refills.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "gardarsdottir-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let grace = 30;   /* permissible gap in days */\n%let cap   = 30;   /* max banked surplus from early refills */\n\nproc sort data=work.rx; by person_id fill_date; run;\n\ndata work.episodes(keep=person_id episode_start run_out discontinuation_date n_fills);\n  set work.rx; by person_id fill_date;\n  format episode_start run_out discontinuation_date date9.;\n  retain episode_start run_out n_fills;\n\n  if first.person_id then do;\n    episode_start = fill_date; run_out = fill_date + days_supply; n_fills = 1;\n  end;\n  else do;\n    gap = fill_date - run_out;\n    if gap <= &grace then do;                 /* continue episode */\n      surplus = max(0, -gap);\n      start   = ifn(surplus > 0, run_out, fill_date);\n      extra   = min(surplus, &cap);\n      run_out = max(run_out, start) + (days_supply - (surplus - extra));\n      n_fills + 1;\n    end;\n    else do;                                  /* gap exceeds grace: close prior episode */\n      discontinuation_date = run_out; output;\n      episode_start = fill_date; run_out = fill_date + days_supply; n_fills = 1;\n    end;\n  end;\n\n  if last.person_id then do;                  /* flush final episode */\n    discontinuation_date = run_out; output;\n  end;\nrun;",
        "description": "Episode construction with a permissible-gap rule in a single sorted data step. Required input (post data-management,\nde-duplicated, reversals removed):\n  work.rx : person_id, fill_date (SAS date), days_supply (numeric)   # one row per dispensing\nOutput work.episodes has one row per episode with discontinuation_date = covered run-out of the last fill. Change &grace\nto run the sensitivity analysis; &cap bounds stockpiling carry-over from early refills.",
        "dependencies": [],
        "source_citations": [
          "gardarsdottir-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "grace-period-gap-rules-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Two fills stitched into one episode when their gap (11 days) is within the 30-day grace period; a third fill starts a new episode because the gap (50 days) exceeds it. The grace period value is the single dial that decides whether an interruption is a bump in the road or the end of treatment.",
        "alt_text": "Horizontal timeline for patient 2001 showing Fill A (Jan 1–Jan 30, covered), an 11-day bridged gap (Jan 31–Feb 10, shaded light), Fill B (Feb 11–Mar 11, covered), a 50-day episode-ending gap (Mar 12–Apr 30, shaded dark with discontinuation marker at Mar 11), and Fill C opening a new episode (May 1–May 30, covered).",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  F[Pharmacy fills: person_id, fill_date, days_supply] --> C[Clean: de-duplicate same-day claims, net out reversals]\n  C --> W[Covered window per fill = fill_date .. fill_date + days_supply]\n  W --> S{Next fill before run_out?}\n  S -- Yes --> SP[Stockpile: shift start to run_out, cap carry-over] --> W\n  S -- No --> G{Gap = next_fill - run_out <= grace?}\n  G -- Yes --> CONT[Continue episode] --> W\n  G -- No --> CLOSE[Close episode; discontinuation = run_out]\n  CLOSE --> NEW[Next fill starts a NEW episode]\n  CLOSE --> SENS[Sensitivity: vary grace 15/30/60/90 and proportional rule]",
        "caption": "Decision logic that turns discrete fills into exposure episodes. The grace period is the single switch that decides whether an interruption is bridged (episode continues) or closes the episode and dates discontinuation.",
        "alt_text": "Flowchart from pharmacy fills through cleaning, covered-window construction, stockpiling, the grace-period gap test, and episode closure with a sensitivity-analysis branch.",
        "source_type": "illustrative",
        "source_citations": [
          "gardarsdottir-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Grace period decides discontinuation date (one patient, 30-day grace)\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section Fill 1\n  Covered (30-day supply) :done, f1, 2024-01-01, 30d\n  section Gap A (12d <= grace)\n  Bridged - episode continues :active, ga, 2024-01-31, 12d\n  section Fill 2\n  Covered (30-day supply) :done, f2, 2024-02-12, 30d\n  section Gap B (45d > grace)\n  Exceeds grace - discontinuation dated at run-out :crit, gb, 2024-03-13, 1d\n  section Re-initiation\n  New episode after long gap :f3, 2024-04-27, 30d",
        "caption": "A 12-day gap (<=30-day grace) is bridged and the episode continues; a 45-day gap (>grace) closes the episode and dates discontinuation at the prior run-out, with the later return treated as a new episode.",
        "alt_text": "Gantt timeline showing two covered fill windows, a short bridged gap, a long gap exceeding the grace period that triggers discontinuation, and a re-initiation episode.",
        "source_type": "illustrative",
        "source_citations": [
          "gardarsdottir-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "The grace-period (permissible-gap) rule is the core parameter of exposure-episode construction; it determines where one episode ends and the next begins."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC's covered-days numerator and observation-window denominator depend on the episode boundaries set by the grace period and stockpiling rule."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mpr-medication-possession-ratio",
        "notes": "MPR is computed over an exposure interval whose start and end are fixed by the grace-period/episode rules."
      },
      {
        "relation_type": "affects",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "The discontinuation date — and thus time-to-discontinuation — is defined by the grace period, so a longer grace yields longer persistence and fewer discontinuation events."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restart-rechallenge-new-episode-rwe",
        "notes": "A separate, usually longer, gap threshold decides whether a return after a long gap is a new episode (re-initiation) rather than a continuation bridged by the grace period."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inpatient-bridging-exposure-rwe",
        "notes": "Inpatient stays interrupt outpatient dispensing; bridging rules pause or extend the grace-period clock so hospitalizations are not misread as treatment gaps."
      },
      {
        "relation_type": "affects",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Classifying outcomes occurring inside the grace window as unexposed, or retro-assigning exposure based on whether a refill later appeared, induces immortal-time-style misclassification."
      }
    ],
    "aliases": [
      "grace period",
      "permissible gap",
      "allowable gap",
      "gap rule",
      "treatment-episode gap",
      "refill gap"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "group-based-trajectory-models-lca",
    "name": "Group-Based Trajectory Models and Latent Class Analysis",
    "short_definition": "Finite mixture methods that represent a heterogeneous population as a blend of k latent subgroups with distinct longitudinal patterns (GBTM) or cross-sectional covariate profiles (LCA); used in RWE to identify discrete adherence phenotypes, multimorbidity clusters, and treatment-pattern subgroups that a single summary statistic cannot reveal.",
    "long_description": "**What GBTM and LCA actually are**\n\nGroup-based trajectory modeling (GBTM) and latent class analysis (LCA) are finite mixture\nmethods that represent a heterogeneous population as a blend of k latent subpopulations,\neach with its own distinct pattern. GBTM, developed by Nagin, is specifically designed for\nrepeated-measures longitudinal outcomes: it estimates a small number of trajectory groups —\neach described by a polynomial or logistic curve over time — and assigns each patient to the\ngroup whose trajectory best matches their observed sequence. LCA is the cross-sectional\nanalogue: it identifies clusters of patients based on profiles of binary or categorical\nindicators measured at a single point in time (for example, presence or absence of ten\ncomorbidity flags from claims). Both models belong to the same family — finite mixture\nmodels — and share the same core idea: the population is not homogeneous, and summarizing\nit with a single mean discards the heterogeneity that may be the most policy-relevant\nfeature of the data.\n\n**The finite mixture idea**\n\nA finite mixture model with k components posits that each observation is drawn from one of\nk distributions, each weighted by a class-membership probability (the mixing proportion\npi_k, with the sum of all pi_k equal to 1). Neither the class membership nor the\nclass-specific trajectories are observed — both are estimated from the data simultaneously\nby maximum likelihood or Bayesian inference. The likelihood function sums the probability\nof the observed data under each class, weighted by its mixing proportion. Estimation\ntypically uses the EM algorithm: the E-step assigns soft class memberships (posterior\nprobabilities), and the M-step updates the class-specific parameters and mixing\nproportions. GBTM applies this framework to longitudinal binary or count outcomes — the\ncanonical software, PROC TRAJ in SAS, uses a censored-normal, Poisson, zero-inflated\nPoisson, or logistic link depending on the outcome type. LCA applies the same framework\nto cross-sectional multivariate categorical data.\n\n**GBTM for medication adherence: richer than a single PDC cutoff**\n\nThe Franklin et al. (2013) argument for GBTM over a single PDC scalar is direct: a PDC\nthreshold of 0.80 collapses the entire longitudinal adherence trajectory to a pass/fail,\ndiscarding when the patient was adherent, whether they restarted after a gap, and how\nsteeply adherence declined over time. GBTM instead identifies trajectory shapes — for\nexample, four groups in a statin cohort: consistent adherers (high flat PDC throughout the\nyear), slow decliners (high early but falling month by month), rapid discontinuers (high\nat the index date then abrupt drop at month two or three), and intermittent users (cycling\non and off with repeated gaps). These groups differ not only in their 12-month PDC totals\nbut in the pattern of exposure, which can drive very different clinical outcomes — a\npattern that is invisible to the scalar. For study design, this means that conditioning a\ndownstream outcome analysis on trajectory-group membership can recover treatment-effect\nheterogeneity across adherence phenotypes that a single PDC-adjusted regression cannot.\n\n**LCA for cross-sectional latent subtypes**\n\nLCA applies the same finite mixture framework to a patient's profile of binary claim flags\nmeasured at baseline — for example, ten comorbidity indicators from the Charlson index,\ndrug-class exposure flags, and care-utilization categories. The model asks: what is the\nsmallest number of latent patient types that can explain the observed co-occurrence\npatterns among these flags? A four-class solution might identify a metabolic-syndrome\ncluster (diabetes, hypertension, dyslipidemia co-occurring), a cardiovascular disease\ncluster (prior MI, stroke, heart failure), a low-burden cluster (few comorbidities), and\nan oncology cluster (cancer, anemia, opioid use). This LCA-derived phenotype may be a\nmore powerful confounder-adjustment variable than any single summary score, because it\ncaptures the joint distributional structure of comorbidities rather than a weighted sum.\n\n**Model selection: BIC, entropy, minimum class size, and the reification trap**\n\nChoosing k is the most consequential analytic decision in any mixture model. The standard\ntoolkit: (1) Bayesian Information Criterion (BIC): fit models with k = 2, 3, 4, and more\nclasses and pick the k where BIC stops improving (lower is better; look for an elbow, not\na global minimum that may be k = n). (2) Entropy: a summary of how cleanly the model\nseparates classes (ranges 0 to 1; values at or above 0.80 indicate distinct, well-\nseparated groups). Low entropy means posterior probabilities are diffuse — every patient\nhas moderate probability of belonging to multiple classes — which undermines the\ninterpretability of any class-specific estimate. (3) Minimum class size: classes\ncontaining fewer than five percent of the sample are usually not scientifically credible\nand are artifacts of overfitting; in a cohort of 1000, a class of 15 patients warrants\nskepticism. (4) Interpretability: does the extra class tell a new clinical story, or does\nit merely split a prior class into near-identical twins? The best-fitting model by BIC may\nnot be the most scientifically useful.\n\nThe reification trap is the central intellectual hazard of mixture modeling. Because\nPROC TRAJ or poLCA will always produce a clean k-class solution with interpretable labels,\nit is tempting to conclude that the classes are real, discovered entities — biological\nsubtypes, clinical phenotypes, or adherence archetypes that exist independent of the\nmodel. They are not. The classes are the model's best finite approximation to a continuous\ndistribution of heterogeneity, chosen jointly by the EM algorithm and the analyst's choice\nof k, link function, and indicator set. A different dataset, a different gap definition,\nor a different analyst's choice of k will produce different classes. Explicit\nacknowledgment of this uncertainty is not optional in any rigorous manuscript: the classes\nare model constructs, not discovered diseases.\n\n**Posterior class probabilities and classify-analyze bias**\n\nAfter the EM algorithm converges, each patient has a vector of posterior probabilities —\nthe estimated probability of belonging to each class given their observed trajectory. The\ntempting shortcut is modal assignment: assign each patient to their most probable class\nand then treat class membership as a known, perfectly measured variable in a downstream\nregression. This introduces classify-analyze bias: it ignores the uncertainty in class\nmembership and typically underestimates standard errors and produces biased class-specific\nestimates. The theoretically correct alternative is the three-step approach (Vermunt 2010;\nBakk and Vermunt 2016): first, fit the mixture model; second, assign classes by modal\nassignment; third, correct the downstream regression for classification error using the\naverage posterior probabilities as a misclassification matrix. In practice, the three-step\ncorrection materially changes estimates when entropy is low (below 0.70); when entropy\nexceeds 0.90, modal assignment is approximately unbiased.\n\n**Growth mixture models vs GBTM**\n\nThe growth mixture model (GMM) adds random effects within each class — allowing individual-\nlevel deviations around the class-average trajectory — whereas GBTM constrains each class\nto a fixed (deterministic) trajectory shape with no within-class residual variation beyond\nmeasurement error. GMMs are far more difficult to identify (non-convergence and class-\ncollapse are common) and require substantially larger samples; GBTM's simpler fixed-\ntrajectory-per-class assumption often produces more stable and reproducible results at\ntypical RWE sample sizes of 500 to 5000 patients.\n\n**Downstream use: trajectory group as exposure and the immortal-time hazard**\n\nA common downstream use is to treat the trajectory class as an exposure variable in a Cox\nor logistic model — for example, asking whether rapid discontinuers have higher\nhospitalization risk than consistent adherers. This creates a latent immortal-time\nproblem: the trajectory class is defined using outcome-free follow-up during the trajectory\nwindow (for example, months 1 through 12 post-index). If a patient dies or is hospitalized\nin month 6, they cannot complete a 12-month trajectory; survivors who complete it are\ninherently selected to be alive and event-free for the duration. Conditioning on trajectory-\nclass membership therefore implicitly conditions on the future, biasing any hazard estimate\nfor events that occur during the trajectory window. The only clean solution is to end the\ntrajectory window before the outcome window begins — use months 1 through 6 for trajectory\ndefinition, then months 7 through 24 for outcome follow-up — ensuring that trajectory-group\nmembership is fully determined before the outcome period starts.\n\n**Pros, cons, and trade-offs**\n\nGBTM and LCA pros: they reveal heterogeneity invisible to single-summary statistics; class\nlabels are interpretable to clinical and health-economics audiences; they handle binary,\ncount, and censored-normal longitudinal outcomes (GBTM) or multivariate binary cross-\nsectional profiles (LCA); and they integrate naturally with downstream effect-modification\nanalyses when trajectory group is an effect modifier.\n\nCons: classes are model constructs — results depend on k, link function, indicator\nselection, and sample; the reification trap is endemic; modal assignment ignores\nclassification uncertainty; Python lacks a dedicated GBTM implementation (GaussianMixture\napproximates LCA-style clustering on cross-sectional data but does not model longitudinal\ntrajectories); convergence failures are common at high k values.\n\nTrade-offs vs a single PDC scalar: GBTM recovers trajectory shape at the cost of model\ndependence; a PDC threshold of 0.80 is reproducible, regulation-endorsed, and transparent\nbut collapses heterogeneity. Both should be reported when the research question spans how\nmuch and what pattern.\n\nTrade-offs vs mixed-effects models: mixed-effects models treat adherence heterogeneity as a\ncontinuous random effect (each patient has their own random intercept and slope), whereas\nGBTM imposes a discrete class structure. Mixed-effects models are the appropriate primary\nmodel when the goal is a population-average or subject-specific trajectory mean; GBTM is\nthe appropriate model when the goal is to identify and label discrete subpopulations with\ndistinct behavioral phenotypes.\n\n**When to use**\n\nUse GBTM when the research question concerns the distribution of longitudinal adherence\npatterns (rather than a single mean), when defining adherence phenotypes as effect-\nmodifiers or confounders in a downstream outcome regression provided the immortal-time\nhazard is avoided by pre-specifying non-overlapping trajectory and outcome windows, and\nwhen payer or manufacturer audiences require patient segmentation with interpretable labels.\nUse LCA when the goal is to identify multimorbidity phenotypes from claims or EHR\ncomorbidity flags as a supplement to Charlson or Elixhauser summary scores, or when\nbinary-indicator co-occurrence patterns are the target of description rather than a\nweighted sum.\n\n**When NOT to use**\n\nDo not use GBTM as the primary adherence endpoint in a regulatory submission or payer\ndossier where a reproducible PDC rate is required. Do not allow k to be chosen post-hoc\nto make the result look clinically meaningful; the number of classes must be pre-specified\nor determined by objective criteria (BIC, entropy, and minimum class size) with the final\nchoice locked before examining downstream outcomes. Do not report low-entropy solutions\n(entropy below 0.70) as if classes were cleanly distinct without applying three-step\ncorrections. Do not use GBTM as the only analysis of heterogeneity when a continuous\ntreatment-effect modifier (for example, continuous PDC as an interaction term) would answer\nthe same question with fewer assumptions. Critically, do not define trajectory groups over\na window that overlaps the outcome window; classes defined by future data create implicit\nimmortal time and biased hazard estimates in any downstream survival analysis.\n\n**Interpreting the output**\n\nIn the worked example, a 3-class GBTM fit to 1000 statin users returns three trajectory\ngroups: consistent adherers (n = 500, mean 12-month PDC = 0.90), moderate adherers\n(n = 300, mean PDC = 0.60), and rapid discontinuers (n = 200, mean PDC = 0.20).\n\n(1) Formal interpretation. The model-implied overall population mean PDC is the class-\nshare-weighted average: (0.5 * 0.9) + (0.3 * 0.6) + (0.2 * 0.2) = 0.45 + 0.18 + 0.04\n= 0.67. This matches a naive cohort-level PDC of 0.67, but the GBTM reveals that this\nsingle number masks a trimodal distribution. The consistent-adherer class (50% of\npatients) drives a PDC cluster centered near 0.90, well above the conventional 0.80\nadherence threshold, while the rapid-discontinuer class (20%) drives a cluster near 0.20.\nThe moderate-adherer class (30%) sits below the 0.80 threshold at a mean of 0.60.\nClass membership uncertainty is quantified by the entropy statistic: if entropy is at or\nabove 0.80, modal class assignments are approximately valid for downstream analyses; if\nentropy is below 0.70, three-step corrections should be applied before class membership\nis used as a covariate.\n\n(2) Practical interpretation. A single average PDC of 0.67 for the cohort would suggest\nthe population is modestly non-adherent as a group. The GBTM tells a more nuanced story\nfor a payer or formulary decision-maker: half the cohort is highly adherent (mean PDC\nnear 0.90) and would likely benefit from continued access to the drug; one in five patients\nstopped within the first few months and represents an opportunity for targeted adherence-\nsupport intervention rather than access restriction. Reporting only the 0.67 average\nobscures both the high-adherers who should not face access barriers and the rapid\ndiscontinuers who need a different kind of support. The trajectory classes, despite being\nmodel constructs rather than biological discoveries, provide a clinically actionable\nsegmentation that a scalar cannot.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "trajectory-analysis",
      "latent-class",
      "adherence",
      "mixture-model",
      "subgroup-analysis",
      "longitudinal",
      "heterogeneity",
      "pdc",
      "pharmacoepidemiology",
      "clustering"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1146/annurev.clinpsy.121208.131413",
        "url": "https://doi.org/10.1146/annurev.clinpsy.121208.131413",
        "citation_text": "Nagin DS, Odgers CL. Group-based trajectory modeling in clinical research. Annual Review of Clinical Psychology. 2010;6:109-138.",
        "year": 2010,
        "authors_short": "Nagin & Odgers",
        "notes": "Foundational review of GBTM methodology and applications in clinical and behavioral research; covers model selection (BIC, entropy, minimum class size), the reification trap, posterior probability interpretation, and the distinction between GBTM and growth mixture models. Essential reading before applying PROC TRAJ or lcmm in RWE."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/MLR.0b013e3182984c1f",
        "url": "https://doi.org/10.1097/MLR.0b013e3182984c1f",
        "citation_text": "Franklin JM, Shrank WH, Pakes J, et al. Group-based trajectory models: a new approach to classifying and predicting long-term medication adherence. Medical Care. 2013;51(9):789-796.",
        "year": 2013,
        "authors_short": "Franklin et al.",
        "notes": "Landmark application of GBTM to medication adherence in claims data; directly compares GBTM trajectory classes against single PDC scalar and demonstrates the additional predictive value of trajectory shape for downstream clinical outcomes. The core methodological justification for preferring GBTM over PDC in adherence phenotyping."
      }
    ],
    "plain_language_summary": "Group-based trajectory models and latent class analysis are statistical tools that look at a group of patients and automatically discover hidden subgroups — clusters of people who follow the same pattern over time or share the same combination of health conditions. Instead of summarizing everyone's medication use as one average number, these models reveal that the average may mask very different groups: for example, half the patients taking their statin consistently, 30% tapering off month by month, and 20% stopping within weeks of starting. The number of clusters is chosen by comparing how well models with two, three, or four groups fit the data, using a scoring rule called BIC. The key honesty caveat is that the groups are mathematical constructs produced by the model, not biological disease subtypes discovered in nature — two analysts with slightly different choices may identify different groups from the same patients.",
    "key_terms": [
      {
        "term": "latent class",
        "definition": "A hidden subgroup that the model infers from patterns in the data but that is never directly observed — patients are not labeled by class in the raw claims file; the model assigns them based on their observed pattern."
      },
      {
        "term": "trajectory group",
        "definition": "One of the k subpopulations in a GBTM, each described by its own average time-path of medication use (for example, a group that starts high and declines steeply each month)."
      },
      {
        "term": "BIC (Bayesian Information Criterion)",
        "definition": "A score that balances how well a model fits the data against how complex the model is, used to choose the number of classes k — lower BIC favors the better model, and analysts look for the k where the BIC stops improving."
      },
      {
        "term": "posterior probability",
        "definition": "The model's estimated probability that a given patient belongs to each class, computed after fitting the model — a patient might have a 0.80 probability of being a consistent adherer and a 0.20 probability of being a moderate adherer."
      },
      {
        "term": "reification trap",
        "definition": "The mistake of treating a model's latent classes as if they are real, discovered biological entities; in fact, they are mathematical approximations that depend on the analyst's choices and will change if the data or model specification changes."
      },
      {
        "term": "entropy",
        "definition": "A summary statistic ranging from 0 to 1 that measures how cleanly the model separates patients into classes — values at or above 0.80 mean most patients are assigned to one class with high confidence, while low values mean assignments are uncertain."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiologist fits a 3-class GBTM to 1000 new statin users followed for 12 months after their first fill. Each patient's 12 monthly binary adherence indicators (yes/no for having statin supply on hand that month) are the input. The model returns three trajectory groups with estimated class proportions of 50%, 30%, and 20% and class-specific mean 12-month PDC values of 0.90, 0.60, and 0.20 respectively. The analyst wants to verify that the class-specific means reproduce the cohort-level mean PDC observed in the claims data.",
      "dataset": {
        "caption": "Class-level summary from the fitted 3-class GBTM (1000 statin users, first 12 months post-index). Each row is one trajectory class; patient counts and mean PDC are model-derived.",
        "columns": [
          "class_label",
          "n_patients",
          "class_share",
          "mean_pdc_12mo"
        ],
        "rows": [
          [
            "consistent_adherers",
            500,
            0.5,
            0.9
          ],
          [
            "moderate_adherers",
            300,
            0.3,
            0.6
          ],
          [
            "rapid_discontinuers",
            200,
            0.2,
            0.2
          ]
        ]
      },
      "steps": [
        "Convert class shares to patient counts. Consistent adherers: 0.5 * 1000 = 500 patients. Moderate adherers: 0.3 * 1000 = 300 patients. Rapid discontinuers: 0.2 * 1000 = 200 patients. Check: 500 + 300 + 200 = 1000 (total cohort size confirmed).",
        "Compute each class's contribution to the overall weighted mean PDC. Consistent-adherer contribution: 0.5 * 0.9 = 0.45. Moderate-adherer contribution: 0.3 * 0.6 = 0.18. Rapid-discontinuer contribution: 0.2 * 0.2 = 0.04.",
        "Sum the three contributions to get the overall weighted mean PDC: 0.45 + 0.18 + 0.04 = 0.67.",
        "Interpretation: a naive cohort-level PDC of 0.67 is below the conventional 0.80 adherence threshold, which would suggest the cohort is non-adherent as a whole. The GBTM reveals that this single number hides a trimodal reality: 500 patients (50%) have a mean PDC near 0.90 (above the threshold), 300 patients (30%) have a mean PDC of 0.60 (below it), and 200 patients (20%) have a mean PDC of 0.20 (far below it). These three groups face very different clinical trajectories and warrant different intervention strategies."
      ],
      "result": "Class sizes: 0.5 * 1000 = 500 consistent adherers, 0.3 * 1000 = 300 moderate adherers, 0.2 * 1000 = 200 rapid discontinuers. Weighted contributions: 0.5 * 0.9 = 0.45, 0.3 * 0.6 = 0.18, 0.2 * 0.2 = 0.04. Overall weighted mean PDC: 0.45 + 0.18 + 0.04 = 0.67. The cohort-level mean of 0.67 obscures a trimodal adherence distribution that the three trajectory classes make explicit."
    },
    "prerequisites": [
      "pdc-proportion-of-days-covered",
      "mixed-effects-models-longitudinal-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "GBTM for binary monthly adherence indicators (logistic link)",
        "description": "The most common RWE application: each patient contributes a sequence of monthly 0/1 adherence flags (1 if any supply on hand that month, 0 otherwise) and PROC TRAJ fits a logistic trajectory curve per class. This is the Franklin et al. (2013) design. Trajectory groups are characterized by their logistic intercept and slope — a flat high-probability curve (consistent adherers), a steeply declining curve (rapid discontinuers), or an intermediate declining curve (slow decliners). In lcmm, the hlme function with a binomial family handles binary outcomes.",
        "edge_cases": [
          "Short follow-up (fewer than 6 months) limits the ability to distinguish declining from flat trajectories; at minimum 9 to 12 monthly observations are recommended.",
          "Informative censoring (patients who discontinue enrollment are systematically different from those who remain) biases trajectory estimates; sensitivity analyses with inverse-probability-of-censoring weights are appropriate."
        ],
        "data_source_notes": "Claims: construct a monthly binary coverage flag from fill_date and days_supply using the same PDC daily-coverage logic; each patient contributes 12 rows (one per month) as the input to PROC TRAJ or lcmm. Align the start of the trajectory window to the index fill, not the enrollment date."
      },
      {
        "name": "LCA for cross-sectional multimorbidity phenotyping",
        "description": "Fit LCA to a patient-level binary matrix of k comorbidity flags measured in the 12 months before index date. Each row is a patient; each column is one comorbidity indicator (Charlson components, or drug-class flags). The model finds latent patient types whose joint flag profiles differ maximally. Use poLCA in R or equivalent; test k = 2 through 5 classes. Report class-conditional item probabilities (the probability of each flag given class membership) as the primary characterization of each class.",
        "edge_cases": [
          "Rare comorbidities (prevalence below 5%) add noise to LCA and may cause class-collapse; consider grouping related conditions (e.g., any prior CV event) before fitting.",
          "LCA with more than 10 binary indicators becomes computationally demanding and may require sparse-data regularization or dimensionality reduction before fitting."
        ],
        "data_source_notes": "Claims: derive the binary flag matrix from diagnosis codes in the baseline window. EHR: lab-value thresholds (HbA1c above 7, LDL above 130) can augment ICD-based flags to create a richer indicator set."
      },
      {
        "name": "Trajectory class as effect modifier in an outcome model",
        "description": "After fitting the GBTM and assigning classes (using modal assignment or three-step correction), include trajectory class as a covariate or interaction term in a Cox proportional-hazards or logistic model for a downstream outcome. The trajectory window must end before the outcome window begins to avoid immortal-time bias. Report hazard ratios or odds ratios by class with 95% confidence intervals; specify whether modal assignment or three-step correction was used and report entropy to justify the choice.",
        "edge_cases": [
          "If entropy is below 0.70, three-step corrections (Bakk and Vermunt 2016) are required; modal assignment without correction will produce biased class-specific estimates.",
          "Trajectory classes are not randomized; residual confounding by indication is expected (consistent adherers may differ from rapid discontinuers on unmeasured disease severity). Propensity-score adjustment within the downstream model is often necessary."
        ],
        "data_source_notes": "Claims: end the trajectory window (months 1-12) at the same date the outcome window begins (month 13 onward); compute trajectory class from the first-year data only, then follow patients from month 13 for the primary outcome."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "pdc-proportion-of-days-covered",
        "pros_of_this": "GBTM recovers the shape of the adherence trajectory over time, revealing distinct behavioral phenotypes (consistent, declining, intermittent) invisible to a single PDC scalar; enables effect-modification analyses by adherence phenotype rather than a binary above/below 0.80 classification.",
        "cons_of_this": "GBTM classes are model-dependent constructs that vary with k, indicator selection, and sample; a PDC rate is reproducible, regulation-endorsed (CMS/PQA), and transparent. GBTM requires specialized software (PROC TRAJ, lcmm) and substantially more analytic effort than computing a column-level mean.",
        "when_to_prefer": "Prefer GBTM when the research question is about adherence phenotype heterogeneity and its modifying effect on outcomes; prefer PDC for regulatory reporting, quality measures, and any setting requiring a single reproducible adherence summary."
      },
      {
        "compared_to": "mixed-effects-models-longitudinal-rwe",
        "pros_of_this": "GBTM provides interpretable, labeled discrete subgroups that clinical and payer audiences can act on directly; it does not require assumptions about random-effect distributions and avoids the conditional-vs-marginal estimand ambiguity of mixed models.",
        "cons_of_this": "Mixed-effects models treat heterogeneity as a continuous random effect — a more parsimonious and statistically efficient representation when the heterogeneity is truly continuous; GBTM forces a discrete class structure onto what may be a smooth distribution, and the choice of k introduces analyst degrees of freedom not present in a random-effects framework.",
        "when_to_prefer": "Prefer GBTM when discrete phenotypes with interpretable labels are needed for downstream segmentation or effect-modification analysis; prefer mixed-effects models when the goal is a population-average or subject-specific trajectory estimate with covariate adjustment."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Construct a monthly binary adherence flag per patient from fill_date and days_supply using the same daily-coverage logic as PDC (union rule: any fill covering a day marks it covered, then collapse to a monthly 0/1). Each patient contributes one row per month to the PROC TRAJ or lcmm input. Align all patients to the index fill date so month 1 is universal across patients. Cap the trajectory window before the outcome window; never overlap the two. For LCA, derive the binary comorbidity matrix from ICD-10 codes in a 12-month baseline window using standard Charlson or Elixhauser definitions.",
      "ehr": "EHR-derived adherence trajectories require constructing prescription-days-supply from e-prescribing records or dispensing logs; the logic is identical to claims but missingness is more common (patients who fill outside the health system are invisible). Supplement with refill-confirmation calls or lab-value trajectories (HbA1c, LDL) as a validation check. For LCA, EHR allows richer indicator sets including lab-value thresholds and vital-sign flags that are unavailable in claims.",
      "registry": "Disease registries with longitudinal follow-up (e.g., oncology, inflammatory disease) often record treatment episodes and dosing modifications directly; GBTM on treatment- intensity or dose-level trajectories is a natural application. Class assignments can then be linked to patient-reported outcomes or adjudicated events recorded in the registry without the immortal-time hazard because the registry outcome is recorded prospectively after enrollment.",
      "linked": "Linked claims-EHR-registry datasets allow GBTM on claims-derived adherence trajectories with outcome validation from EHR or registry adjudication. The richest design: define trajectory classes from claims (months 1-12), validate class-level clinical profiles using EHR labs at 6 and 12 months, then follow to registry-adjudicated events from month 13 onward."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "# ── LCA-style clustering with sklearn GaussianMixture ──\n# NOTE: Approximates cross-sectional LCA on binary indicators.\n# For longitudinal GBTM, use R (lcmm) or SAS (PROC TRAJ).\n\nimport numpy as np\nfrom sklearn.mixture import GaussianMixture\n\n# ── Worked example: verify weighted mean PDC arithmetic ──\n# 1000 patients, 3 classes with shares 0.50 / 0.30 / 0.20\nclass_shares   = np.array([0.50, 0.30, 0.20])\nclass_pdc_mean = np.array([0.90, 0.60, 0.20])\n\n# Weighted overall mean PDC\n# 0.5 * 0.9 = 0.45; 0.3 * 0.6 = 0.18; 0.2 * 0.2 = 0.04\ncontributions = class_shares * class_pdc_mean\n# contributions = [0.45, 0.18, 0.04]\noverall_mean_pdc = contributions.sum()\n# 0.45 + 0.18 + 0.04 = 0.67\nprint(f\"Contributions: {contributions}\")\nprint(f\"Overall weighted mean PDC: {overall_mean_pdc:.2f}\")  # 0.67\n\n# Class patient counts from shares (n=1000)\nn_total = 1000\nn_per_class = class_shares * n_total  # [500, 300, 200]\nprint(f\"Patients per class: {n_per_class.astype(int)}\")\n\n# ── LCA-style GaussianMixture on simulated comorbidity indicators ──\n# 5 binary flags: diabetes, hypertension, dyslipidemia, heart_failure, prior_MI\nrng = np.random.default_rng(42)\ntrue_class = rng.choice([0, 1, 2], p=[0.50, 0.30, 0.20], size=1000)\nclass_probs = {\n    0: [0.70, 0.80, 0.85, 0.10, 0.08],  # metabolic cluster\n    1: [0.30, 0.50, 0.35, 0.65, 0.55],  # cardiovascular cluster\n    2: [0.10, 0.15, 0.12, 0.05, 0.03],  # low-burden cluster\n}\nX = np.array([\n    [rng.binomial(1, class_probs[c][j]) for j in range(5)]\n    for c in true_class\n], dtype=float)\n\n# ── BIC comparison across k = 2, 3, 4 ──\nprint(\"\\nBIC by k (lower = better; look for elbow):\")\nbic_results = {}\nfitted_models = {}\nfor k in range(2, 5):\n    gm = GaussianMixture(\n        n_components=k, covariance_type=\"full\",\n        n_init=5, random_state=42\n    )\n    gm.fit(X)\n    bic_results[k] = gm.bic(X)\n    fitted_models[k] = gm\n    print(f\"  k={k}: BIC={bic_results[k]:.1f}\")\n\n# ── Fit k=3 and extract posterior probabilities ──\ngm3 = fitted_models[3]\npost_probs  = gm3.predict_proba(X)   # shape (1000, 3), rows sum to 1\nmodal_class = gm3.predict(X)         # modal assignment (ignores uncertainty)\n\n# ── Entropy: classification quality ──\n# Entropy = 1 + (1 / (n * ln(k))) * sum_ij [p_ij * ln(p_ij)]\neps = 1e-10\nn, k = post_probs.shape\nentropy = 1 + np.sum(post_probs * np.log(post_probs + eps)) / (n * np.log(k))\nprint(f\"\\nEntropy (k=3): {entropy:.3f}  (>=0.80 = good separation)\")\nprint(\"If entropy < 0.70, apply 3-step correction (Bakk & Vermunt 2016) for inference.\")\n\n# ── Class mixing proportions ──\nprint(\"\\nEstimated class proportions (k=3):\")\nfor i, w in enumerate(gm3.weights_):\n    n_cls = np.sum(modal_class == i)\n    print(f\"  Class {i}: pi={w:.3f}  ({n_cls} patients by modal assignment)\")\n\n# ── Mean simulated PDC by modal class ──\n# Simulate 12-month PDC from known class-level means\npdc_true_means = {0: 0.90, 1: 0.60, 2: 0.20}\npdc = np.array([\n    rng.beta(a=pdc_true_means[c] * 10, b=(1 - pdc_true_means[c]) * 10)\n    for c in true_class\n])\nprint(\"\\nMean 12-month PDC by modal class assignment:\")\nfor i in range(3):\n    mask = modal_class == i\n    print(f\"  Class {i}: mean PDC = {pdc[mask].mean():.3f}\")",
        "description": "LCA-style finite mixture clustering using sklearn GaussianMixture on cross-sectional\ncomorbidity indicator data. Compares BIC across k = 2, 3, 4 classes and extracts\nposterior probabilities, modal assignments, and entropy. Also demonstrates the weighted\nmean PDC arithmetic from the worked example. NOTE: sklearn GaussianMixture fits a\nGaussian mixture to continuous or binary-coded data and is an approximation to\ncategorical LCA. For rigorous longitudinal GBTM with binary or count monthly outcomes,\nuse R (lcmm, flexmix) or SAS (PROC TRAJ). Python lacks a production-grade GBTM\nimplementation as of 2025.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Group-Based Trajectory Modeling in R ──\n# install.packages(c(\"lcmm\", \"poLCA\"))\nlibrary(lcmm)\nlibrary(poLCA)\n\n# ── 1. Simulate longitudinal adherence data (1000 patients, 12 months) ──\nset.seed(42)\nn <- 1000\ntrue_cls <- sample(1:3, size = n, prob = c(0.50, 0.30, 0.20), replace = TRUE)\nmonths    <- 1:12\n\nsim_data <- do.call(rbind, lapply(seq_len(n), function(pid) {\n  cls <- true_cls[pid]\n  mu  <- c(0.90, 0.60, 0.20)[cls]\n  pdc_vals <- pmin(pmax(rnorm(12, mean = mu, sd = 0.07), 0), 1)\n  data.frame(id = pid, month = months, pdc = pdc_vals)\n}))\n\n# ── 2. Fit GBTM with k = 1, 2, 3 classes (hlme: hidden latent classes) ──\n# hlme fits a linear mixed model with ng latent classes.\n# mixture = ~month: class-specific intercept and slope on time.\n# B = gridsearch(): multi-start to avoid local maxima — always use with ng >= 2.\n\nm1 <- hlme(pdc ~ month, subject = \"id\", ng = 1, data = sim_data)\n\nm2 <- hlme(pdc ~ month, subject = \"id\", ng = 2,\n           mixture   = ~month,\n           B         = gridsearch(m1, rep = 20, maxiter = 10, minit = m1),\n           data      = sim_data)\n\nm3 <- hlme(pdc ~ month, subject = \"id\", ng = 3,\n           mixture   = ~month,\n           B         = gridsearch(m1, rep = 30, maxiter = 10, minit = m1),\n           data      = sim_data)\n\n# ── 3. Compare models by BIC (lower = better) and entropy ──\ncat(\"Model comparison — BIC and entropy:\\n\")\nsummarytable(m1, m2, m3, which = c(\"G\", \"loglik\", \"AIC\", \"BIC\", \"entropy\"))\n# Entropy >= 0.80: good class separation\n# Entropy < 0.70: apply 3-step correction before downstream regression\n\n# ── 4. Extract class assignments and posterior probabilities (k=3) ──\n# m3$pprob: data.frame with columns id, class (modal), prob1, prob2, prob3\npp3 <- m3$pprob\n\ncat(\"\\nModal class proportions (k=3):\\n\")\nprint(prop.table(table(pp3$class)))\n# Expected: ~0.50 / 0.30 / 0.20\n\ncat(sprintf(\"Entropy (k=3): %.3f\\n\", m3$entropy))\n\n# ── 5. Mean 12-month PDC by assigned class ──\npdcs <- aggregate(pdc ~ id, data = sim_data, FUN = mean)\npdcs$class <- pp3$class[match(pdcs$id, pp3$id)]\n\ncat(\"\\nMean 12-month PDC by modal trajectory class:\\n\")\nclass_means <- aggregate(pdc ~ class, data = pdcs, FUN = mean)\nprint(class_means)\n# Expected: class means near 0.90, 0.60, 0.20 (in class-sorted order)\n\n# ── 6. Verify worked-example weighted mean arithmetic ──\n# Class shares: 0.50, 0.30, 0.20  |  PDC means: 0.90, 0.60, 0.20\nshares <- c(0.50, 0.30, 0.20)\npdcs_e <- c(0.90, 0.60, 0.20)\ncontrib <- shares * pdcs_e          # 0.45, 0.18, 0.04\noverall <- sum(contrib)             # 0.67\ncat(sprintf(\"\\nWeighted mean PDC: %.2f  (expected 0.67)\\n\", overall))\n\n# ── 7. Cross-sectional LCA with poLCA ──\n# poLCA requires indicators coded as 1, 2, ... (not 0/1)\nset.seed(42)\nsim_lca <- data.frame(\n  diabetes     = sample(1:2, 500, replace = TRUE, prob = c(0.40, 0.60)),\n  hypertension = sample(1:2, 500, replace = TRUE, prob = c(0.50, 0.50)),\n  dyslipidemia = sample(1:2, 500, replace = TRUE, prob = c(0.45, 0.55)),\n  hf           = sample(1:2, 500, replace = TRUE, prob = c(0.15, 0.85)),\n  prior_mi     = sample(1:2, 500, replace = TRUE, prob = c(0.12, 0.88))\n)\nf_lca <- cbind(diabetes, hypertension, dyslipidemia, hf, prior_mi) ~ 1\n\nlca_bic <- sapply(2:4, function(k) {\n  fit <- poLCA(f_lca, data = sim_lca, nclass = k, verbose = FALSE, nrep = 5)\n  fit$bic\n})\nnames(lca_bic) <- paste0(\"k=\", 2:4)\ncat(\"\\nLCA BIC by k (lower = better):\\n\")\nprint(lca_bic)\n\nbest_k <- which.min(lca_bic) + 1\nlca_final <- poLCA(f_lca, data = sim_lca,\n                   nclass = best_k, verbose = FALSE, nrep = 10)\ncat(sprintf(\"\\nSelected LCA k=%d. Class mixing proportions:\\n\", best_k))\nprint(lca_final$P)\n# Class-conditional item probabilities reveal the comorbidity profile of each class",
        "description": "Longitudinal GBTM using lcmm::hlme (linear mixed model with hidden latent classes)\nand cross-sectional LCA using poLCA. Covers BIC comparison across k, entropy\ncomputation, posterior probability extraction, modal class assignment, and mean PDC\nby class. flexmix is noted as a general-purpose alternative for non-Gaussian outcomes.",
        "dependencies": [
          "lcmm",
          "poLCA"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── PROC TRAJ: Canonical GBTM in SAS ────────────────────────────────────────\n   Requires PROC TRAJ macro from https://www.andrew.cmu.edu/user/bjones/traj.htm\n   Install: copy traj.sas7bndl to your SAS AUTOCALL path or %include the macro.\n   Reference: Jones BL, Nagin DS, Roeder K. Sociol Methods Res. 2001;29:374-393.\n   ────────────────────────────────────────────────────────────────────────── */\n\n/* ── Step 1: Simulate 1000 patients x 12 monthly adherence indicators ── */\ndata work.adherence_long;\n  call streaminit(42);\n  do patient_id = 1 to 1000;\n    u = rand('uniform');\n    if      u < 0.50 then true_cls = 1;   /* consistent adherer  */\n    else if u < 0.80 then true_cls = 2;   /* moderate adherer    */\n    else                   true_cls = 3;  /* rapid discontinuer  */\n\n    do month = 1 to 12;\n      if true_cls = 1 then p_adhere = 0.90;\n      else if true_cls = 2 then\n        p_adhere = max(0.70 - (month - 1) * 0.025, 0.35);\n      else\n        p_adhere = max(0.80 - (month - 1) * 0.09, 0.05);\n      adhere = rand('bernoulli', p_adhere);\n      output;\n    end;\n  end;\nrun;\n\n/* ── Step 2: Reshape to wide (one row per patient, columns m1-m12) ── */\nproc transpose data = work.adherence_long\n               out  = work.adh_wide (rename=(_name_=_col_))\n               prefix = m;\n  by patient_id true_cls;\n  id month;\n  var adhere;\nrun;\n\n/* ── Step 3: Fit PROC TRAJ — 3-class logistic trajectory model ──────── */\n/* MODEL = LOGIT : binary monthly adherence flag                          */\n/* ORDER  = 1 1 1: linear intercept + time slope per class                */\n/* NCLASS = 3   : request 3 latent trajectory classes                     */\n/* OUT    = : posterior probabilities and modal class per patient          */\n/* OUTSTAT= : BIC, -2LL, mixing proportions (pi)                          */\nproc traj\n    data    = work.adh_wide\n    out     = work.traj_out\n    outstat = work.traj_stat\n    outplot = work.traj_plot;\n  id      patient_id;\n  var     m1-m12;                 /* 12 binary monthly outcomes         */\n  indep   1 2 3 4 5 6 7 8 9 10 11 12;  /* time variable: month number  */\n  model   logit;                  /* logistic link for binary outcomes  */\n  nclass  3;\n  order   1 1 1;                  /* linear trajectory per class        */\nrun;\n/* To select optimal k, re-run with nclass=2 and nclass=4 and compare    */\n/* the BIC column in outstat. Lower BIC = better fit.                    */\n/* Also check AvePPmax (average maximum posterior probability) as an      */\n/* entropy analogue: values >= 0.70 indicate adequate separation.        */\n\n/* ── Step 4: Model fit statistics ── */\nproc print data = work.traj_stat noobs;\n  title \"PROC TRAJ model fit — 3-class logistic trajectory\";\n  /* Key columns: -2LL, BIC (Schwarz), class mixing proportions          */\nrun;\n\n/* ── Step 5: Class sizes and mean PDC by assigned class ── */\nproc freq data = work.traj_out;\n  tables _class_ / nocum;\n  title \"Modal trajectory class frequencies (n=1000)\";\n  /* Expected: roughly 500 / 300 / 200 across classes 1, 2, 3            */\nrun;\n\n/* Compute mean 12-month PDC per patient from the long file */\nproc means data = work.adherence_long noprint;\n  by patient_id;\n  var adhere;\n  output out = work.pdc_pt (drop=_type_ _freq_) mean = pdc_12mo;\nrun;\n\n/* Merge with trajectory class assignment */\nproc sort data = work.traj_out; by patient_id; run;\nproc sort data = work.pdc_pt;   by patient_id; run;\n\ndata work.class_pdc;\n  merge work.traj_out (keep = patient_id _class_)\n        work.pdc_pt;\n  by patient_id;\nrun;\n\nproc means data = work.class_pdc n mean std;\n  class _class_;\n  var   pdc_12mo;\n  title \"Mean 12-month PDC by GBTM trajectory class\";\n  /* Expected: Class 1 ~0.90 (consistent), Class 2 ~0.60 (moderate),    */\n  /*           Class 3 ~0.20 (rapid discontinuer)                        */\nrun;\n\n/* ── Step 6: Verify worked-example weighted mean arithmetic ── */\n/* 0.5 * 0.9 = 0.45; 0.3 * 0.6 = 0.18; 0.2 * 0.2 = 0.04                */\n/* 0.45 + 0.18 + 0.04 = 0.67                                             */\ndata work.check;\n  class1_share = 0.50; class1_pdc = 0.90; contrib1 = class1_share * class1_pdc;\n  class2_share = 0.30; class2_pdc = 0.60; contrib2 = class2_share * class2_pdc;\n  class3_share = 0.20; class3_pdc = 0.20; contrib3 = class3_share * class3_pdc;\n  overall_mean_pdc = contrib1 + contrib2 + contrib3;\n  /* overall_mean_pdc = 0.45 + 0.18 + 0.04 = 0.67                       */\nrun;\nproc print data = work.check noobs;\n  title \"Weighted mean PDC verification (expected 0.67)\";\nrun;",
        "description": "PROC TRAJ — the canonical GBTM implementation developed by Jones, Nagin, and Roeder.\nDemonstrates a logistic-link trajectory model for binary monthly adherence indicators\nacross k = 2, 3, and 4 classes with BIC-based model selection, posterior class\nprobability extraction, and mean PDC by assigned class. PROC TRAJ must be downloaded\nfrom https://www.andrew.cmu.edu/user/bjones/traj.htm and installed as a user-defined\nprocedure before running.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Raw patient data<br/>longitudinal or cross-sectional] --> B{Outcome type?}\n  B --> |\"Binary/count over time<br/>(monthly adherence flags)\"| C[GBTM<br/>PROC TRAJ / lcmm]\n  B --> |\"Cross-sectional binary<br/>comorbidity profile\"| D[LCA<br/>poLCA / GaussianMixture]\n  C --> E[Fit k = 2, 3, 4 classes<br/>EM algorithm]\n  D --> E\n  E --> F[Select k by BIC + entropy<br/>+ minimum class size<br/>+ interpretability]\n  F --> G{Entropy >= 0.80?}\n  G --> |Yes| H[Modal class assignment<br/>approximately unbiased]\n  G --> |No entropy < 0.70| I[Three-step correction<br/>required for inference]\n  H --> J[Downstream outcome model<br/>trajectory class as covariate]\n  I --> J\n  J --> K{Trajectory window overlaps<br/>outcome window?}\n  K --> |Yes| L[STOP: immortal-time bias<br/>Separate windows required]\n  K --> |No| M[Report class-specific estimates<br/>with entropy and class sizes<br/>Note reification trap]",
        "caption": "GBTM and LCA analysis workflow. The reification trap applies at every stage — classes are model constructs, not discovered biological entities.",
        "alt_text": "Flowchart from raw patient data through GBTM or LCA fitting, k selection by BIC and entropy, modal assignment or three-step correction, and downstream outcome modelling, with decision points for entropy threshold and immortal-time hazard check.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC is the primary adherence measure that GBTM trajectory classes characterize and extend; each class's mean 12-month PDC is the key output quantity, and the weighted mean across classes recovers the cohort-level PDC. GBTM reveals the heterogeneity that a single PDC scalar collapses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Persistence captures time to first qualifying gap; GBTM captures the full trajectory shape including restarts after gaps. Rapid-discontinuer trajectory classes often correspond to short persistence times, but GBTM can also identify intermittent users who restart after gaps — a pattern persistence cannot represent."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mixed-effects-models-longitudinal-rwe",
        "notes": "Mixed-effects models model adherence heterogeneity as a continuous random effect (each patient has their own random intercept and slope); GBTM imposes a discrete class structure. The two approaches are rivals for the same heterogeneity question — choose by whether discrete labels or a continuous distribution is more appropriate for the research question."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Trajectory groups defined by future data (e.g., a 12-month trajectory window that overlaps the outcome window) create implicit immortal time. The only remedy is to separate trajectory and outcome windows; this is one of the most common and consequential errors in GBTM-based pharmacoepidemiology."
      },
      {
        "relation_type": "see_also",
        "target_slug": "charlson-comorbidity-index-rwe",
        "notes": "LCA applied to comorbidity indicators is an alternative to the Charlson weighted sum for multimorbidity phenotyping; it captures joint co-occurrence patterns that the additive Charlson index ignores, at the cost of model dependence and the reification trap."
      }
    ],
    "aliases": [
      "GBTM",
      "group-based trajectory model",
      "Nagin trajectory model",
      "latent class analysis",
      "LCA",
      "finite mixture model"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "hazard-ratio-interpretation",
    "name": "The Hazard Ratio as an Effect Measure",
    "short_definition": "The hazard ratio (HR) is a relative effect measure from time-to-event analysis that compares the instantaneous rate of an event in one group to another at each moment during follow-up, among those still event-free; it is produced by Cox regression but is non-collapsible, estimand-ambiguous under non-proportional hazards due to the depletion of susceptibles, and must always be paired with an absolute companion measure (RMST difference, landmark cumulative risk, or median survival difference) for clinical decisions, HTA submissions, and cost-effectiveness models.",
    "long_description": "**The hazard as an instantaneous rate**\n\nThe hazard at time t — written h(t) — is the probability that an event occurs at\nexactly the instant t, given that the subject has survived to t, expressed as a rate\nper unit time. Formally h(t) = lim_{Δt→0} P(t ≤ T < t+Δt | T ≥ t) / Δt. It is not\na probability bounded between 0 and 1; it is a rate, and it is defined only among\nindividuals who are still event-free at that moment. The hazard ratio HR = h_A(t) /\nh_B(t) compares these instantaneous rates between two groups. Under the proportional\nhazards (PH) assumption — that this ratio is constant across all t — Cox regression\nsummarizes the entire follow-up as a single dimensionless number. When PH fails, the\nCox partial likelihood produces a weighted average of time-varying, period-specific HRs,\nwith weights that depend on the censoring distribution and on the size of the risk set\nat each event time. The resulting single-number summary then depends on study duration\nand dropout pattern, not just the underlying biology.\n\n**Hazard ≠ risk: the measure that matters for decisions**\n\nRisk is a cumulative probability — the chance the event occurs by a stated time horizon.\nHazard is an instantaneous rate among survivors. They are linked through the survival\nfunction S(t) = exp(−∫₀ᵗ h(u) du), and cumulative incidence (risk) equals 1 − S(t).\nUnder constant hazard and rare events over short windows, HR approximates the risk ratio\n(RR). For common events or long follow-up they can diverge substantially. The HR also\ndiffers from the aggregate rate ratio in the person-time sense, which is\n(events_A / person-time_A) ÷ (events_B / person-time_B); that equals the HR only when\nhazard is constant within each arm throughout follow-up. The practical implication:\nthe HR cannot be inserted into a Markov state-transition model, a budget-impact\ncalculation, or a cost-per-QALY denominator without first converting it to an absolute\nsurvival curve via the Breslow baseline estimator. Always report the RMST difference,\nlandmark cumulative incidence difference, or median survival difference alongside the HR.\n\n**The Hernán critique: depletion of susceptibles and estimand ambiguity**\n\nHernán (2010) in \"The Hazards of Hazard Ratios\" articulated the most important\nconceptual trap in survival analysis: even in a perfectly randomized trial — no\nconfounding, no informative censoring — a single reported HR is causally ambiguous when\nthe proportional hazards assumption does not hold. The mechanism is depletion of\nsusceptibles. Individuals most vulnerable to the event fail early, and disproportionately\nso in the higher-hazard arm. As time passes, the surviving risk sets in both arms become\nself-selected, and the two groups are no longer exchangeable even under perfect\nrandomization. A genuine constant treatment effect can therefore produce an observed HR\nthat drifts over follow-up as the composition of the risk sets changes. Immuno-oncology\nprovides the canonical example: checkpoint inhibitors often produce a delayed Kaplan-Meier\ncurve separation — no early benefit, then strong late benefit. A single averaged HR from\na log-rank test or Cox model blends the null early period with the effective late period,\nand the weights on each period vary by how long patients were followed and how many were\ncensored. A study with more late follow-up produces a numerically different HR for the\nsame biology than an otherwise identical study with shorter follow-up. This is Hernán's\ncentral point: the HR is partly a function of study design, not just the causal effect.\nStensrud and Hernán (2020) extend this to argue that the right question is not \"does PH\nhold?\" but \"what is the estimand I actually want?\" — cumulative incidence difference at\nyear 2? Mean event-free months over 3 years? — and to pre-specify it before data\nanalysis begins, reporting the HR only as a supplementary relative summary alongside an\nabsolute measure that is invariant to censoring patterns.\n\n**Noncollapsibility of the hazard ratio**\n\nThe HR shares with the odds ratio the property of noncollapsibility: even when a\ncovariate is not a confounder — it is balanced between arms and affects only the outcome\n— adjusting for it changes the HR. The conditional (adjusted) HR is generally farther\nfrom the null than the marginal (unadjusted) HR when the covariate is prognostic; this\ndirection holds under most standard hazard structures but is not algebraically guaranteed\nfor all combinations of hazard shape and covariate distribution. Three practical\nimplications follow. (1) A covariate-adjusted Cox HR is typically larger in magnitude\nthan the log-rank HR from an unadjusted Kaplan-Meier analysis — this is not evidence of\nconfounding, it is the arithmetic consequence of noncollapsibility. (2) An\nHR from a propensity-score-matched or IPTW-weighted analysis estimates a marginal\nquantity, while the Cox HR is conditional on the covariate pattern in the model; they\nanswer different questions and should not be compared across papers without acknowledging\nthis distinction. (3) In an RCT, adding a strong prognostic baseline covariate to the\nCox model will move the HR for treatment even when the covariate is perfectly balanced\n— this is expected and represents an efficiency gain, not an indication of imbalance\nor error.\n\n**Interpreting the output**\n\nConsider a study reporting HR = 0.75 (95% CI 0.60–0.94) for a new treatment versus\nstandard care, from a Cox model with baseline covariate adjustment.\n\nFormal interpretation: At any given instant during follow-up, among patients who have\nnot yet had the event, the instantaneous rate of the event is 25% lower in the treated\ngroup than in the comparator group. This is NOT a statement that there are 25% fewer\nevents in total. It is NOT a risk ratio, which compares cumulative probabilities of the\nevent by a fixed horizon. It is NOT a time ratio, which would say the treated arm takes\n1/0.75 = 1.33 times longer on average to reach the event. Under PH the 0.75 is a valid\nsummary across all time points. Under non-PH it is a weighted average of period-specific\nHRs, where the weights depend on the censoring distribution — making the number partly\nan artifact of study duration and dropout pattern. Additionally, the built-in depletion\nof susceptibles means that even under PH the late-period HR reflects a causally different\nrisk-set composition than the early-period HR. The 95% CI of 0.60–0.94 is a repeated-\nsampling interval: under the same data-generating process, approximately 95% of intervals\nconstructed this way would contain the true HR. This CI excludes 1.0, but it is not a\np-value statement about the probability of an extreme result under the null hypothesis.\n\nPractical interpretation: Treated patients experience events at a 25% slower rate while\nthey remain event-free. Paired with an absolute summary for this study's 2-year follow-up:\nthe RMST difference was 2.8 months (treated patients gained 2.8 additional event-free\nmonths on average), and the risk difference at 24 months was approximately 6 fewer\nevents per 100 patients in the treated arm. A payer or HTA reviewer cannot act on\n\"HR 0.75\" alone — the absolute benefit determines whether the treatment is cost-effective.\nAlways pair the HR with the RMST difference, landmark risk difference, or median survival\ndifference. This absolute-measure pairing is not optional: it is required by NICE, EMA\nguidelines for survival extrapolation, and the FDA's 2023 guidance on estimands in\nclinical trials.\n\n**RWE-specific interpretation challenges**\n\nTwo issues in observational data compound the standard warnings. First, informative\ncensoring: when patients disenroll from the health plan before the event for reasons\nrelated to their prognosis, they are administratively removed from the risk set. The\ndirection of bias depends on which arm experiences more prognosis-related censoring and\non the true direction of the treatment effect. For example, if sicker treated patients\ndisenroll due to adverse effects, their removal inflates apparent survival in the treated\narm and biases the HR away from the null (toward apparent benefit). Conversely, if sicker\nuntreated patients disenroll first, the HR can be biased toward the null. There is no\nuniversally safe direction to assume; the bias must be assessed case by case. Inverse probability of censoring weighting (IPCW) corrects this by upweighting\nremaining patients for those who were censored due to prognosis-related reasons; the\nresulting HR corresponds to the counterfactual world where all patients remained\nobservable. Second, time-varying exposures: the HR from a time-fixed baseline-exposure\nCox model reflects initial treatment assignment, not sustained treatment. When patients\nswitch, discontinue, or titrate their regimen, a counting-process layout (start, stop,\nevent intervals) with exposure updated at each interval is required for an as-treated or\nper-protocol analysis. Naive time-fixed analysis in the presence of switching produces\nan HR that blends on-treatment and off-treatment periods — estimand-ambiguous for the\nsame reason as Hernán's depletion-of-susceptibles argument. For the related trap of\ninflated exposed person-time from misaligned time zero, see immortal-time-bias-handling.\n\n**Pros, cons, and trade-offs**\n\nPros of the HR as an effect measure: compact and dimensionless; directly estimated by\nCox regression with adjustment for baseline covariates and time-varying factors; the\nstandard language of survival analysis familiar to FDA, EMA, and clinical journal\nreviewers; efficient when PH holds; can be stratified, time-segmented, or marginalized\nvia IPTW to handle specific violations; has mature, validated software across all\nthree major analytic languages.\n\nCons: Non-collapsible — conditional and marginal HRs are not comparable across studies\nwith different covariate profiles. Estimand-ambiguous under non-PH — the value depends\non censoring distribution and study duration, not just the causal effect. Not directly\ninterpretable as a probability or risk. Cannot be used in decision models without\nconverting to survival curves. The depletion-of-susceptibles mechanism makes the single\nHR causally murky in long follow-up studies, particularly in oncology. A large relative\nimprovement (HR = 0.50) in a very rare event may translate to trivially small absolute\nbenefit; the HR alone conceals this.\n\nTrade-offs vs. RMST difference: RMST requires no PH assumption, is absolute (event-free\nmonths gained), and is directly interpretable by patients, payers, and HTA bodies. Cost:\nlower power than log-rank and Cox under PH; requires pre-specifying the horizon; less\nfamiliar to some regulatory reviewers. Best practice is to pre-specify both and report\nboth. Trade-offs vs. landmark risk difference or risk ratio at a fixed time: risk at a\npre-specified landmark is a direct probability, natural for patient communication, budget-\nimpact models, and Markov model inputs. Cost: discards information about when during\nfollow-up events occur; sensitive to landmark choice.\n\n**When to use**\n\nUse the HR — always paired with an absolute summary — when: (a) Cox regression is the\npre-specified analytic method for a time-to-event endpoint with a defensible time zero;\n(b) the audience expects a relative effect on the hazard scale, as in regulatory\nsubmissions and pharmacoepidemiology journals; (c) proportional hazards approximately\nholds (Kaplan-Meier curves are roughly parallel, Schoenfeld residual test is non-\nsignificant for the treatment variable); and (d) a companion absolute measure — RMST\ndifference, cumulative incidence difference at a landmark, or median survival difference\n— is pre-specified and reported alongside it. The HR remains a valid descriptive summary\nunder mild PH violations, interpreted as a weighted average of period-specific rates\nrather than a period-invariant constant.\n\n**When NOT to use**\n\nDo not use the HR as the sole effect measure when: (a) hazards cross or converge — a\nnon-PH violation visible on a log-log survival plot or flagged by a significant Schoenfeld\ntest means the single averaged HR is actively misleading; report RMST difference or time-\nsegmented HRs instead; (b) the audience is patients or non-specialist clinicians who will\ninterpret it as a percentage of events or a probability — it is neither; (c) the result\nfeeds directly into an HTA cost-effectiveness model requiring transition probabilities or\nstate-occupancy times — convert to survival curves and absolute risks first; (d) competing\nrisks are substantial and differential by arm, as in elderly or end-stage populations —\nthe cause-specific HR from Cox overstates absolute incidence of the event of interest and\nmust be paired with a subdistribution HR and cumulative incidence function; (e) the study\nuses time-varying exposures without a counting-process layout — the resulting number is a\nblend of on- and off-treatment periods with no clear causal interpretation.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "effect-measures",
      "survival-analysis",
      "hazard-ratio",
      "cox",
      "proportional-hazards",
      "noncollapsibility",
      "rmst",
      "interpretation"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "active_comparator_new_user",
      "new_user",
      "rct",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0b013e3181c1ea43",
        "url": "https://doi.org/10.1097/EDE.0b013e3181c1ea43",
        "citation_text": "Hernán MA. The hazards of hazard ratios. Epidemiology. 2010;21(1):13-15.",
        "year": 2010,
        "authors_short": "Hernán",
        "notes": "The canonical critique of the hazard ratio as an effect measure. Explains depletion of susceptibles, period-specific HRs, and why a single averaged HR under non-proportional hazards is causally ambiguous even in a perfectly randomized trial. Essential reading for any analyst who reports or interprets a Cox model result."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2020.1267",
        "url": "https://doi.org/10.1001/jama.2020.1267",
        "citation_text": "Stensrud MJ, Hernán MA. Why test for proportional hazards? JAMA. 2020;323(14):1401-1402.",
        "year": 2020,
        "authors_short": "Stensrud & Hernán",
        "notes": "Extends the Hernán critique: argues the clinically relevant question is not whether PH holds but which estimand to pre-specify. Covers noncollapsibility of the conditional HR versus the marginal HR, and routes analysts toward RMST and cumulative incidence as pre-specifiable, causally interpretable alternatives. Also cited in cox-ph-regression."
      },
      {
        "role": "use",
        "doi": "10.1200/JCO.2014.55.2208",
        "url": "https://doi.org/10.1200/JCO.2014.55.2208",
        "citation_text": "Uno H, Claggett B, Tian L, Inoue E, Gallo P, Miyata T, Schrag D, Takeuchi M, Uyama Y, Zhao L, Skali H, Solomon S, Jacobus S, Hughes M, Packer M, Wei LJ. Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. Journal of Clinical Oncology. 2014;32(22):2380-2385.",
        "year": 2014,
        "authors_short": "Uno et al.",
        "notes": "Proposes RMST difference, landmark risk differences, and other absolute measures as primary companions to the HR, with motivating examples from oncology trials where non-PH makes the HR misleading. The survRM2 R package implements the Uno et al. RMST estimation method."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1471-2288-13-152",
        "url": "https://doi.org/10.1186/1471-2288-13-152",
        "citation_text": "Royston P, Parmar MK. Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Medical Research Methodology. 2013;13:152.",
        "year": 2013,
        "authors_short": "Royston & Parmar",
        "notes": "Provides the theoretical and practical foundation for RMST-based trial design and analysis as an alternative to the log-rank test and Cox HR when proportional hazards cannot be assumed. Includes sample-size formulas and worked examples comparing RMST to the HR in trials with delayed treatment effects."
      }
    ],
    "plain_language_summary": "A hazard ratio (HR) compares how quickly an event — like a death, hospitalization, or disease progression — arrives in one treatment group versus another at each moment that both groups still have patients who have not yet had the event. An HR of 2.0 means the exposed group has events arriving at twice the speed of the comparison group at any given instant; it is NOT the same as saying there are twice as many events overall or that patients face twice the lifetime risk. Because the HR is calculated only among those still event-free at each moment, its meaning can shift over time as the riskiest patients leave the study first, making a single number potentially misleading for long follow-up. Always pair the HR with an absolute measure — such as the difference in average event-free months or the difference in event probability at a specific time point — so that decision-makers can judge whether the relative improvement translates to a meaningful real-world benefit.",
    "key_terms": [
      {
        "term": "hazard",
        "definition": "The instantaneous rate at which an event occurs at a specific moment in time, measured only among people who have not yet had the event — think of it as \"how fast is the event arriving right now for those still waiting.\""
      },
      {
        "term": "instantaneous rate",
        "definition": "A measure of how quickly something happens at a precise moment, expressed as events per unit of time (e.g., 0.02 events per month), not as a cumulative count or probability over a whole period."
      },
      {
        "term": "proportional hazards",
        "definition": "The assumption that the ratio of event rates between two groups stays roughly the same throughout the entire follow-up period; when this breaks down, a single hazard ratio is an average that can mask very different early versus late effects."
      },
      {
        "term": "depletion of susceptibles",
        "definition": "As a study continues, the people most likely to have the event fail first and leave the risk pool, so the groups being compared gradually become non-comparable — this makes the hazard ratio's meaning shift over time even in a perfectly run trial."
      },
      {
        "term": "period-specific effect",
        "definition": "The hazard ratio calculated within a defined time window (e.g., months 0–6 vs. months 6–24); when early and late HRs differ markedly, reporting them separately is more informative than a single averaged number."
      },
      {
        "term": "RMST",
        "definition": "Restricted Mean Survival Time — the average number of event-free months a patient experiences up to a pre-chosen horizon (e.g., 24 months); an absolute, assumption- light alternative to the HR that translates directly into clinical and economic terms."
      }
    ],
    "worked_example": {
      "scenario": "A researcher compares two groups — a new drug (exposed) versus standard care (unexposed) — in a registry study tracking time to a composite cardiovascular event. Aggregate person-time data are available for each arm. Under a constant-hazard assumption (rates are stable over follow-up), the hazard ratio can be computed directly from the event rates, illustrating exactly what the HR means and why it is not a risk ratio. Median event-free time under constant hazard is shown descriptively to illustrate how the HR relates to the time-ratio measure used in accelerated failure time models.",
      "dataset": {
        "caption": "Group-level person-time summary for two study arms under a constant-hazard assumption. Each row is one treatment arm. Person-months is the total follow-up accumulated by all patients in that arm; hazard_per_pm is the event rate per person-month.",
        "columns": [
          "arm",
          "events",
          "person_months",
          "hazard_per_pm"
        ],
        "rows": [
          [
            "Exposed (new drug)",
            40,
            2000,
            0.02
          ],
          [
            "Unexposed (standard care)",
            20,
            2000,
            0.01
          ]
        ]
      },
      "steps": [
        "Compute the exposed hazard rate: 40 / 2000 = 0.02 events per person-month. This means that at any given month, among those still event-free, the new-drug arm has an event rate of 0.02 per person-month.",
        "Compute the unexposed hazard rate: 20 / 2000 = 0.01 events per person-month.",
        "Compute the hazard ratio (exposed vs. unexposed): 0.02 / 0.01 = 2.0. The exposed group has events at twice the instantaneous rate of the unexposed group at any given moment, among those still event-free at that moment.",
        "Interpretation check — this is NOT a risk ratio: the HR of 2.0 does not mean twice as many events will occur overall. Risk over a finite window depends on the hazard AND the length of the window. For 12 months under constant hazard, cumulative risk in the exposed arm ≈ 1 − exp(−0.02 × 12) ≈ 0.213, and in the unexposed arm ≈ 1 − exp(−0.01 × 12) ≈ 0.113. The risk ratio is 0.213 / 0.113 ≈ 1.88, not 2.0. The HR and the risk ratio agree only for rare events over short windows.",
        "Median event-free time under constant hazard (exponential model): the median is ln(2) divided by the hazard. For the exposed arm this is approximately 34.7 months; for the unexposed arm approximately 69.3 months. The ratio of medians is 34.7 / 69.3 ≈ 0.50. Note that the time ratio (0.50) is the reciprocal of the HR (2.0) under the exponential model — this illustrates how accelerated failure time models report the same comparison on a different scale.",
        "Always pair the HR with an absolute measure. In this study, the RMST difference at 24 months is the area under the survival curve for the unexposed arm minus the exposed arm over 0 to 24 months. With the exposed arm at higher hazard, the unexposed arm accumulates more event-free time — reported as 'patients on standard care had on average X more event-free months over 2 years.' The HR of 2.0 alone tells a clinician nothing about how many months of life or event-free time are at stake."
      ],
      "result": "Exposed hazard = 40 / 2000 = 0.02 per person-month. Unexposed hazard = 20 / 2000 = 0.01 per person-month. HR = 0.02 / 0.01 = 2.0. Interpretation: the exposed group has events at twice the instantaneous rate of the unexposed group at each moment among survivors. This is not a cumulative risk ratio (which would be approximately 1.88 over 12 months under constant hazard), not a time ratio (which would be 0.50 = 1/HR under exponential hazard), and not a statement about total event counts. Always pair with an absolute measure (RMST difference, landmark risk difference, or median survival difference) to convey the magnitude of benefit in clinically and economically actionable units.",
      "timeline_spec": {
        "title": "Person-time and hazard by study arm: constant-hazard HR illustration",
        "window": {
          "start_day": 0,
          "end_day": 720,
          "label": "Months of follow-up (population view: 2000 person-months per arm)"
        },
        "events": [
          {
            "label": "Exposed arm: 40 events in 2000 person-months (hazard = 0.02/month)",
            "start_day": 0,
            "length_days": 720,
            "quantity": "hazard = 0.02 per month"
          },
          {
            "label": "Unexposed arm: 20 events in 2000 person-months (hazard = 0.01/month)",
            "start_day": 0,
            "length_days": 720,
            "quantity": "hazard = 0.01 per month"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start_day": 0,
            "end_day": 720,
            "label": "2000 person-months of follow-up, exposed arm"
          },
          {
            "kind": "unexposed",
            "start_day": 0,
            "end_day": 720,
            "label": "2000 person-months of follow-up, unexposed arm"
          }
        ],
        "result": {
          "label": "HR = 0.02 / 0.01 = 2.0 (exposed events at twice the instantaneous rate)",
          "value": 2.0
        }
      }
    },
    "prerequisites": [
      "censoring-mechanisms-rwe",
      "cox-ph-regression",
      "poisson-distribution"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Period-specific (time-segmented) hazard ratios",
        "description": "When PH is violated and the HR changes over follow-up — common in immuno-oncology with delayed treatment effect, or in vaccine trials with waning immunity — the single averaged HR is misleading. Reporting HRs within pre-specified time segments (e.g., months 0-6, 6-18, 18+) is more informative than a single pooled estimate. In Cox this is modeled via a treatment-by-log-time interaction or a piecewise-constant baseline hazard. The RMST difference over the full horizon should accompany the segmented HRs to provide the absolute summary.",
        "edge_cases": [
          "Pre-specify the time cutpoints; post-hoc selection of segments that maximize apparent benefit is data dredging.",
          "Each segment requires an adequate number of events; thin late-time segments have wide CIs."
        ],
        "data_source_notes": "Claims and registry: use counting-process intervals with time-updated segment indicators. EHR: visit-driven capture can thin the late-period risk set; report number at risk at each segment boundary."
      },
      {
        "name": "Marginal (population-averaged) hazard ratio from IPTW-weighted Cox",
        "description": "The standard Cox HR is conditional on covariate values in the model; a propensity-score weighted (IPTW) Cox model targets a marginal HR that is closer to the population-averaged effect. Under noncollapsibility, the marginal HR is typically closer to the null than the conditional HR for the same data, though the exact difference depends on the hazard structure and covariate distribution. Reporting which estimand is targeted (conditional vs. marginal) is essential when comparing across studies or meta-analyzing.",
        "edge_cases": [
          "Extreme weights must be stabilized or truncated; report the effective sample size and weight distribution diagnostics alongside the marginal HR.",
          "Even the marginal HR remains subject to Hernán's depletion-of-susceptibles argument under non-PH; pair with the marginal RMST difference."
        ],
        "data_source_notes": "Claims: high-dimensional propensity score from lookback-window proxies, stabilized IPTW, then robust variance in the weighted Cox model. Combine with IPCW when as-treated censoring is informative."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "restricted-mean-survival-time-rmst",
        "pros_of_this": "The HR is compact, familiar to regulatory reviewers and clinical journals, efficient under PH, and directly produced by the Cox partial likelihood with covariate adjustment. A single number conveys the relative effect across the entire follow-up period.",
        "cons_of_this": "Uninterpretable as a single summary under non-PH; depends on censoring distribution and study duration; does not convey absolute magnitude of benefit; noncollapsible. RMST is assumption-light, absolute, and directly comparable across studies with the same horizon.",
        "when_to_prefer": "Use the HR when PH approximately holds and a relative summary on the hazard scale is required by the audience; pre-specify the RMST difference alongside it. Switch to RMST as the primary estimand when hazards cross, converge, or there is a strong a priori expectation of delayed treatment effect."
      },
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "This concept clarifies what the HR output of Cox regression means and does not mean, including its interpretation traps, noncollapsibility, and alternatives. It explicitly decouples the effect measure from the fitting procedure.",
        "cons_of_this": "cox-ph-regression covers the full fitting workflow, PH checking, data preparation, and all operational decisions (time zero, censoring, competing risks) that must precede extracting the HR.",
        "when_to_prefer": "Read cox-ph-regression for model-fitting decisions; read this entry to understand what the fitted HR means, its interpretation constraints, and when an absolute companion measure should replace it as the primary effect estimate."
      },
      {
        "compared_to": "accelerated-failure-time-models",
        "pros_of_this": "The HR is more familiar in clinical and regulatory settings; Cox is nonparametric regarding the baseline hazard shape; the HR has a direct relationship to the log-rank test statistic under PH.",
        "cons_of_this": "AFT models produce a time ratio (how much longer/shorter treated patients survive on average), which is often more interpretable to patients and clinicians than a hazard ratio; under the exponential model the time ratio equals 1/HR. For non-Weibull hazard shapes, AFT provides a fully parametric alternative that does not require PH.",
        "when_to_prefer": "Use the HR when the audience expects Cox/HR and PH is defensible; prefer AFT or an explicit time-ratio report when the absolute time gain is the clinically relevant quantity and a parametric hazard shape can be justified."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Extract the HR from the HAZARDRATIO statement in PROC PHREG (SAS) or from exp(coef) in the Cox summary (R/Python). Report number at risk, events per arm, median follow-up, and the Schoenfeld PH test alongside the HR. In elderly claims cohorts, death is a competing risk — always report the cause-specific HR from Cox alongside the subdistribution HR and cumulative incidence function so absolute event risk is not overstated. Pair the HR with the RMST difference at a clinically meaningful horizon (typically 1, 2, or 3 years) derived from the Breslow baseline survival estimate or Kaplan-Meier curves by arm.",
      "ehr": "Same HR extraction workflow. EHR cohorts often have shorter and more variable follow-up due to visit-driven capture; report the number at risk at each landmark and justify the RMST horizon based on the observed maximum follow-up across both arms. If informative censoring is suspected (sicker patients disenroll or are lost to follow-up), apply IPCW and report both the naïve and weighted HR with interpretation of what each estimates.",
      "registry": "Disease registries with adjudicated endpoints provide clean event times. Pair the HR with RMST especially in oncology registries where non-proportional hazards from IO agents are common. Verify that the registry's event ascertainment window matches the Cox censoring definition; mismatches can induce pseudo-informative censoring.",
      "primary": "RCT and primary study HRs are the most interpretable because randomization controls confounding. Still apply the Hernán depletion-of-susceptibles critique when follow-up is long or PH is questionable. Report the RMST difference as the primary absolute estimand in submissions to NICE and other HTA bodies that require absolute measures.",
      "linked": "Linked claims-EHR-registry datasets allow rich covariate adjustment. The conditional HR from a fully adjusted Cox model and the marginal HR from IPTW-weighted Cox are both valid and should be reported together with an explicit statement of which estimand each targets. Pair with RMST and landmark risk differences for absolute actionability."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom lifelines import CoxPHFitter, KaplanMeierFitter\n\n# cohort: analysis-ready DataFrame (see cox-ph-regression for prep steps).\n# arm=1 is treated, arm=0 is control; all covariates are baseline-only.\ncohort = pd.read_parquet(\"cohort.parquet\")\n\n# ── Step 1: Fit Cox and extract the hazard ratio with 95% CI ──\ncovariates = [\"arm\", \"age\", \"sex\", \"prior_event\", \"comorbidities\"]\ncph = CoxPHFitter()\ncph.fit(cohort[[\"time_to_event\", \"event\"] + covariates],\n        duration_col=\"time_to_event\", event_col=\"event\", robust=True)\n\narm_row = cph.summary.loc[\"arm\"]\nhr     = arm_row[\"exp(coef)\"]\nhr_lo  = arm_row[\"exp(coef) lower 95%\"]\nhr_hi  = arm_row[\"exp(coef) upper 95%\"]\np_val  = arm_row[\"p\"]\n\nprint(f\"HR = {hr:.3f}  (95% CI {hr_lo:.3f}–{hr_hi:.3f},  p = {p_val:.4f})\")\ndirection = \"lower\" if hr < 1 else \"higher\"\nprint(f\"Formal: at any instant, the event rate in the treated arm is \"\n      f\"{abs(1 - hr) * 100:.1f}% {direction} among those still event-free.\")\nprint(\"CAUTION: HR is not a risk ratio, not a time ratio, not '% fewer events overall.'\")\n\n# ── Step 2: PH check — if significant, the single HR is a weighted average ──\ncph.check_assumptions(cohort[[\"time_to_event\", \"event\"] + covariates],\n                      p_value_threshold=0.05, show_plots=False)\n# Small p for 'arm' => PH violated; report RMST as primary and HR as secondary.\n\n# ── Step 3: RMST companion (absolute measure; no PH assumption required) ──\nt_star = 24  # pre-specified horizon in same time units as time_to_event\n\ndef rmst_km(times, events, horizon):\n    \"\"\"Area under the KM survival curve from 0 to horizon (trapezoidal rule).\"\"\"\n    kmf = KaplanMeierFitter().fit(times, events)\n    sf  = kmf.survival_function_.reset_index()\n    sf.columns = [\"t\", \"s\"]\n    sf  = sf[sf[\"t\"] <= horizon].copy()\n    if sf[\"t\"].max() < horizon:\n        last_s = sf[\"s\"].iloc[-1]\n        sf = pd.concat([sf, pd.DataFrame({\"t\": [horizon], \"s\": [last_s]})],\n                       ignore_index=True)\n    return float(np.trapz(sf[\"s\"], sf[\"t\"]))\n\nrmst_trt  = rmst_km(cohort.loc[cohort.arm == 1, \"time_to_event\"],\n                     cohort.loc[cohort.arm == 1, \"event\"], t_star)\nrmst_ctrl = rmst_km(cohort.loc[cohort.arm == 0, \"time_to_event\"],\n                     cohort.loc[cohort.arm == 0, \"event\"], t_star)\nrmst_diff = rmst_trt - rmst_ctrl\nprint(f\"\\nRMST difference at {t_star} months: {rmst_diff:.2f} months event-free gained\")\nprint(\"(Positive = treated arm had more event-free time; no PH assumption needed.)\")",
        "description": "Extract the hazard ratio and 95% CI from a fitted lifelines CoxPHFitter model, then\ncompute the RMST companion measure using the Kaplan-Meier survival curves for each arm.\nThis code focuses on correctly reading the HR out of the model summary and pairing it\nwith an absolute measure — fitting details are in cox-ph-regression. Assumes cohort is\nan analysis-ready DataFrame with columns: time_to_event (numeric, >0), event (0/1),\narm (0/1 treatment indicator), and baseline covariates.",
        "dependencies": [
          "lifelines",
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(survRM2)  # install.packages(\"survRM2\") for Uno et al. RMST method\n\n# cohort: analysis-ready data.frame (see cox-ph-regression for prep).\ncohort <- readRDS(\"cohort.rds\")\ncohort$arm <- relevel(factor(cohort$arm), ref = \"0\")\n\n# ── Step 1: Fit Cox and extract the HR with 95% CI for the treatment variable ──\nfit <- coxph(\n  Surv(time_to_event, event) ~ arm + age + sex + prior_event + comorbidities,\n  data   = cohort,\n  ties   = \"efron\",\n  robust = TRUE\n)\nci_tab <- summary(fit)$conf.int\nhr     <- ci_tab[\"arm1\", \"exp(coef)\"]\nhr_lo  <- ci_tab[\"arm1\", \"lower .95\"]\nhr_hi  <- ci_tab[\"arm1\", \"upper .95\"]\np_arm  <- summary(fit)$coefficients[\"arm1\", \"Pr(>|z|)\"]\ncat(sprintf(\"HR = %.3f  (95%% CI %.3f–%.3f,  p = %.4f)\\n\", hr, hr_lo, hr_hi, p_arm))\ncat(\"Formal: at any instant, treated patients have events at\",\n    round(abs(1 - hr) * 100, 1),\n    ifelse(hr < 1, \"% lower\", \"% higher\"), \"rate among those still event-free.\\n\")\ncat(\"CAUTION: HR is not a risk ratio, not a time ratio, not '% fewer events overall.'\\n\")\n\n# ── Step 2: PH check via weighted Schoenfeld residuals (Grambsch & Therneau 1994) ──\nzph <- cox.zph(fit)\nprint(zph)  # small p for arm => PH violated; treat HR as a weighted average summary\n\n# ── Step 3: RMST companion via survRM2 (RMST difference + 95% CI, no PH needed) ──\nt_star <- 24  # pre-specified horizon in same time units as time_to_event\nrmst_res <- rmst2(\n  time  = cohort$time_to_event,\n  status = cohort$event,\n  arm   = as.integer(as.character(cohort$arm)),  # must be 0/1 integer\n  tau   = t_star,\n  alpha = 0.05\n)\nprint(rmst_res)\n# rmst_res$RMST.arm1.rmst - rmst_res$RMST.arm0.rmst gives the RMST difference\n# Positive = treated arm accumulated more event-free time up to t_star",
        "description": "Extract the hazard ratio from a coxph model summary using exp(coef), then compute the\nRMST companion via the survRM2 package (Uno et al. 2014 method, which provides the RMST\ndifference with 95% CI). Focuses on reading and reporting the HR correctly, not on\nmodel-fitting workflow. Assumes cohort is a data.frame with time_to_event, event, arm,\nand baseline covariates.",
        "dependencies": [
          "survival",
          "survRM2"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Step 1: Cox model — extract HR via HAZARDRATIO statement (preferred over exp(beta)) ── */\nproc phreg data=work.cohort;\n  class arm (ref='0') sex (ref='0');\n  model time_to_event*event(0) = arm age sex prior_event comorbidities\n        / ties=efron rl;           /* RL: confidence limits on the hazard ratio       */\n  hazardratio 'Treatment effect' arm / diff=ref cl=wald;\n    /* HAZARDRATIO produces: Hazard Ratio, Lower 95% CI, Upper 95% CI                */\n    /* Formal: at any instant, the ratio of event rates in arm=1 vs arm=0 among those */\n    /*         still event-free. Not a risk ratio. Not a time ratio.                  */\n  assess ph / resample seed=42;   /* Schoenfeld PH test; if arm p<0.05 => non-PH    */\nrun;\n\n/* ── Step 2: RMST companion via PROC RMSTREG (SAS/STAT 14.1+) ── */\n/* tau: pre-specified horizon in same units as time_to_event                         */\n%let tau = 24;  /* months */\nproc rmstreg data=work.cohort tau=&tau.;\n  model time_to_event*event(0) = arm age sex prior_event comorbidities;\n  /* The arm coefficient = RMST difference (treated minus control) up to tau.        */\n  /* Positive coefficient = treated arm has more event-free time.                    */\n  /* No PH assumption is required; this is an absolute, horizon-specific estimand.   */\nrun;\n\n/* ── Step 3: KM curves and RMST by arm via PROC LIFETEST (if PROC RMSTREG unavailable) ── */\nproc lifetest data=work.cohort\n              plots=survival(cl)    /* KM curves with 95% CI bands                  */\n              rmst(tau=&tau.);      /* per-arm RMST; difference requires manual step */\n  time time_to_event*event(0);\n  strata arm;\nrun;\n/* To get the RMST difference and CI from PROC LIFETEST output:                     */\n/* subtract the per-arm RMST estimates and use the delta method or bootstrap CI.     */",
        "description": "Extract the hazard ratio using PROC PHREG with the HAZARDRATIO statement (explicit and\nunambiguous), then compute RMST via PROC RMSTREG (SAS/STAT 14.1+) or PROC LIFETEST with\nthe RMST option. Focuses on correctly reporting the HR and its companion absolute measure.\nDataset work.cohort has one row per patient with: time_to_event (>0), event (1=event,\n0=censored), arm (0/1), and baseline covariates.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[You have a Cox HR — now what?] --> PH{PH holds?<br/>Schoenfeld test<br/>non-significant,<br/>KM curves parallel}\n  PH -->|Yes| PAIR[Report HR + absolute companion<br/>RMST diff, landmark risk diff,<br/>or median survival diff]\n  PH -->|No — hazards cross/converge| SEG[Report time-segmented HRs<br/>by pre-specified period<br/>AND RMST as primary estimand]\n  PAIR --> AUD{Audience?}\n  AUD -->|Regulatory reviewer / journal| HR_LED[\"Lead with HR in results table;<br/>absolute companion in same sentence\"]\n  AUD -->|HTA body / payer| ABS_LED[\"Lead with absolute measure<br/>HR is secondary; absolute drives<br/>cost-effectiveness calculation\"]\n  AUD -->|Patients / clinicians| ABS_ONLY[\"Translate to absolute benefit:<br/>'X fewer events per 100 patients<br/>over 2 years' or 'Y extra event-free<br/>months on average'\"]\n  SEG --> RMST_PRI[\"RMST difference is the primary number;<br/>time-segmented HRs explain the pattern\"]",
        "caption": "Decision tree for reporting the hazard ratio. The key fork is whether PH holds; the key routing after that is audience — regulatory bodies need both HR and absolute, HTA bodies lead with absolute, and patient communication requires translation to event-count or event-free time language.",
        "alt_text": "Flowchart starting from \"You have a Cox HR — now what?\" branching on whether proportional hazards holds. If yes, route to reporting HR plus an absolute companion, then further branch by audience (regulatory, HTA/payer, patients). If no (crossing/converging hazards), route to time-segmented HRs with RMST as the primary estimand.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2010",
          "stensrud-hernan-2020"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "cox-ph-regression",
        "notes": "Cox proportional hazards regression is the model that produces the hazard ratio; this concept owns interpretation of the output, while cox-ph-regression owns fitting, PH checking, and operational data preparation. Read cox-ph-regression first to understand how the HR is estimated before reading this entry on what it means and does not mean."
      },
      {
        "relation_type": "requires",
        "target_slug": "censoring-mechanisms-rwe",
        "notes": "The HR's meaning changes when censoring is informative — sicker patients leaving the study biases the apparent hazard in the arm where they disenroll. Understanding the types of censoring (administrative, loss to follow-up, competing events) is prerequisite to correctly interpreting any reported hazard ratio."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "The RMST difference is the primary assumption-light absolute companion to the HR. Under non-proportional hazards — where the single HR is a weighted average that depends on censoring — the RMST difference is the recommended primary estimand. Always report one alongside the other."
      },
      {
        "relation_type": "see_also",
        "target_slug": "accelerated-failure-time-models",
        "notes": "AFT models report a time ratio (how much longer/shorter treated patients survive) rather than a hazard ratio. Under the exponential model, the time ratio equals 1/HR. AFT is preferred when the absolute time gain is the clinically relevant quantity and a parametric hazard shape can be specified or justified from prior data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Misaligned time zero inflates exposed person-time and biases the HR toward apparent benefit; this is one of the most common interpretation errors in observational survival analyses. Fix the design (new-user, time zero at index event) before interpreting the HR."
      }
    ],
    "aliases": [
      "HR",
      "hazard ratio",
      "instantaneous rate ratio",
      "relative hazard",
      "log hazard ratio"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "hcpcs-level-ii-j-codes",
    "name": "HCPCS Level II Codes and J-Codes",
    "short_definition": "A federal coding system maintained by CMS that assigns a five-character alphanumeric code (one letter + four digits) to every non-physician service, supply, and drug billed outside of a physician visit, with the J-code family (J + 4 digits) specifically covering drugs administered by a provider — the essential data element for identifying medical-benefit (buy-and-bill) drug exposure in claims.",
    "long_description": "**HCPCS Level II** (Healthcare Common Procedure Coding System Level II) is the federal alphanumeric\ncoding standard maintained by the Centers for Medicare and Medicaid Services (CMS) for services,\nsupplies, and drugs that are not covered by CPT (Level I). Every code is exactly five characters:\none letter (A–V) followed by four digits. CMS updates the system quarterly, not annually like CPT,\nwhich means codes for newly approved drugs can appear within months of FDA clearance — but also\nmeans code lists used in analyses go stale and must be refreshed against the quarterly CMS release\nfiles at each study update.\n\n**Format, families, and scope.** The letter determines the broad service category. Key families for\nRWE and HEOR work include:\n\n- **J codes (J0001–J9999):** Drugs administered other than by the oral route — the family used for\n  nearly all provider-administered (infused, injected, or instilled) drugs. The J9000–J9999 sub-range\n  historically covers antineoplastic agents specifically. Example public codes from CMS payment files:\n  J9271 (pembrolizumab, per 1 mg), J9305 (pemetrexed, per 10 mg), J9000 (doxorubicin, per 10 mg).\n- **Q codes (Q0000–Q9999):** CMS temporary codes for services and drugs awaiting permanent code\n  assignment; oncology biologics newly approved by the FDA frequently launch here before receiving a\n  permanent J-code. Q codes behave identically to J-codes in claims data — service date, billing units,\n  NDC field — but are more likely to be combined with J3490 (Not Otherwise Classified, parenteral drug)\n  or J3590 (unclassified biological) during the pre-permanent-code window.\n- **C codes (C1000–C9999):** Hospital outpatient prospective payment system (OPPS) codes; newly\n  approved drugs in the hospital outpatient setting are assigned C-codes until they receive a permanent\n  J or Q.\n- **A codes (A0000–A9999):** Ambulance, medical and surgical supplies, administrative, miscellaneous.\n- **B codes (B4000–B9999):** Enteral and parenteral nutrition supplies.\n- **E codes (E0100–E9999):** Durable medical equipment (DME) — wheelchairs, walkers, oxygen.\n- **G codes (G0000–G9999):** CMS temporary codes for procedures and quality measures not in CPT, used\n  by CMS demonstration projects, quality programs (MIPS), and some cancer screening services.\n\n**Core conceptual distinction — the medical-benefit / pharmacy-benefit split.** This is the single\nmost consequential distinction in RWE drug-exposure ascertainment, and HCPCS Level II is the key data\nelement on the medical-benefit side.\n\nProvider-administered drugs — those infused or injected in a physician office, hospital outpatient\ndepartment (HOPD), or infusion center — are purchased by the provider, administered to the patient,\nand billed to the payer under the medical benefit. In Medicare, this is Part B. In commercial insurance,\nit is the medical benefit file. These administrations appear as HCPCS J- or Q-code lines on the\nCMS-1500 (professional) or UB-04 (institutional) claim, with a service date, a billing-unit count,\nand (since the Omnibus Budget Reconciliation Act of 1990 and strengthened by later CMS policy) an\nNDC field that should identify the specific product administered.\n\nSelf-administered drugs — pills, self-injected biologics, or oral oncologics dispensed at retail or\nspecialty pharmacies — are billed under the pharmacy benefit (Part D for Medicare, or a separate\npharmacy benefit for commercial) using an 11-digit National Drug Code (NDC), a fill date, and a\ndays_supply. They do not appear on medical claims and have no HCPCS code in most contexts.\n\nThe practical consequence: **an RWE study that queries only pharmacy claims will miss 100% of\nIV chemotherapy, most infused immunotherapy (e.g., pembrolizumab, nivolumab, atezolizumab), all\nIV biologics for RA (tocilizumab, infliximab), IV bone agents, and any other provider-administered\ndrug.** Conversely, a study that queries only medical claims for oncology drug exposure will miss\noral targeted therapies (erlotinib, osimertinib, ibrutinib, capecitabine) dispensed as Part D fills.\nComplete exposure ascertainment for oncology, rheumatology, neurology (natalizumab), and most\nbiologic-heavy therapeutic areas requires querying both files.\n\n**Billing units and the dose computation — the most common error in HCPCS pharmacoepi.**\nEach HCPCS code has a written descriptor that specifies the unit of measure for one billing unit.\nThe billing unit is almost never milligrams and almost never one vial. Examples from public CMS\nAnnual Physician Fee Schedule files:\n\n- J9271 (pembrolizumab): \"injection, 1 mg\" → 1 unit = 1 mg\n- J9305 (pemetrexed): \"injection, per 10 mg\" → 1 unit = 10 mg\n- J0800 (corticotropin): \"up to 40 USP units\" → 1 unit = 40 USP units\n- J1745 (infliximab): \"10 mg\" → 1 unit = 10 mg\n\nThe claim shows `units_billed` (the HCPCS units field). To recover the administered dose:\n\n`administered_dose_mg = units_billed × descriptor_mg_per_unit`\n\nThis sounds simple but has several failure modes that are pervasive in the literature:\n\n1. **Analysts conflate billing units with milligrams.** For J9305 (pemetrexed, per 10 mg), 50 billed\n   units means 500 mg, not 50 mg. The descriptor must be consulted, not assumed.\n2. **The descriptor quantity changes between code revisions.** A code retired and replaced (common\n   in the quarterly update cycle) may have a different per-unit amount than its predecessor, breaking\n   cross-year dose comparisons.\n3. **Providers sometimes bill whole vials rather than exact dose.** Because vial sizes are\n   standardized and dose is weight-based, the billed units can reflect vial wastage accounting, not\n   the true per-patient dose. This introduces heteroskedastic measurement error in dose across\n   practice settings.\n4. **Units on the same service date summing to more than one standard dose** can represent split\n   billing, combination therapy, or data error — a pre-analysis quality check is essential.\n\n**The NOC-code under-ascertainment window for newly launched drugs.** When a new drug receives FDA\napproval, it typically lacks a permanent HCPCS code for weeks to months (sometimes 6–18 months).\nDuring this period, providers bill using a \"not otherwise classified\" (NOC) code:\n\n- J3490 — Unclassified drugs\n- J3590 — Unclassified biologics\n- C9399 — Unclassified drugs or biologics, hospital outpatient (OPPS/C-APC setting)\n\nUnder NOC billing, the specific drug identity is conveyed only in the NDC field on the claim line —\na field that is often missing, incorrectly formatted, or populated inconsistently across payers and\npractice settings. As a result, **NOC-period administrations are routinely missed or misclassified\nin RWE studies that identify drugs solely by their permanent HCPCS code.** For a drug with a 12-month\nNOC window (not unusual for new immuno-oncology agents), the entire first year of real-world use is\nunder-counted, which can bias adoption curves, first-line utilization rates, and even comparative\nsafety/effectiveness analyses that use first-to-market date as an anchoring variable. The analyst\nmust (a) include J3490, J3590, and C9399 in all drug-identification queries, (b) parse the NDC field\non NOC-coded lines, and (c) document and quantify the NOC window as a study limitation.\n\n**CMS ASP NDC-to-HCPCS crosswalk.** CMS publishes a quarterly Average Sales Price (ASP) Drug Pricing\nFile that maps NDCs to the HCPCS code under which the drug is reimbursed. This crosswalk is the\nauthoritative forward-link from NDC → HCPCS and is the mechanism by which OMOP Drug domain ingestion\nof provider-administered drugs can be harmonized. Analysts building NDC-to-HCPCS crosswalks for their\nown cohort work should start from the CMS ASP file, then supplement with the CanMED-HCPCS (the NCI\ntool that lists all HCPCS codes for oncology medications by therapeutic category, validated against\nCMS HCPCS Indices 2012–2018 and commercially available drug databases). The crosswalk runs in both\ndirections: the NDC field on a J3490/C9399 line identifies the product; the J-code on a medical claim\nidentifies the product when the NDC is missing.\n\n**No days_supply — the exposure-duration modeling challenge.** Because provider-administered drugs\nhave no `days_supply` field (the drug was consumed at the point of service), analysts cannot use the\nstandard pharmacy-claims approach of forward-filling days of coverage from each fill. Instead, exposure\nduration must be modeled from the administration cadence — typically the regimen-specific dosing\ninterval (e.g., every 3 weeks for pembrolizumab, every 4 weeks for denosumab) applied between claim\ndates. This requires clinical knowledge of the dosing schedule, introduces assumption-dependence that\nmust be pre-specified in the SAP, and fails for off-schedule administrations (dose delays, early\ndiscontinuation between claims). Approaches include (a) fixed-window persistence (did the next\nadministration occur within a pre-specified grace period?), (b) regimen-specific cycle modeling, and\n(c) linkage to prescription orders in an EHR to confirm intended versus actual cadence.\n\n**Revenue center code pairing.** On UB-04 institutional claims, the drug J-code appears on a line\nalongside revenue center code 0636 (pharmacy — IV solutions) or occasionally 0250 (pharmacy general).\nThis pairing allows researchers to distinguish the drug charge from the administration charge\n(revenue center 0335 for chemotherapy infusion, 0636 for IV push/infusion). On professional/CMS-1500\nclaims (carrier file), the J-code appears directly without a revenue center code. Analysts must handle\nboth claim types.\n\n**OMOP Drug domain integration.** The OMOP CDM Drug domain ingests HCPCS codes through the Drug\nvocabulary, mapping each J-code to a standard concept in the RxNorm or SNOMED hierarchy via the\nOMOP vocabulary's NDC-to-HCPCS crosswalk and a manually curated HCPCS-to-drug mapping. This\nallows HCPCS-identified drug exposures to be harmonized with NDC-identified pharmacy-fill exposures\ninto a single `drug_exposure` table. However, mapping fidelity varies: NOC codes map to \"drug\nunspecified\" and require NDC-level disambiguation, and newly approved drugs may lack a OMOP concept\nuntil the next quarterly vocabulary release.\n\n**Pros, cons, and trade-offs — specific and comparative.**\n\n- **vs NDC-based pharmacy claims for drug identification:** HCPCS/J-codes capture what NDC pharmacy\n  claims cannot — every provider-administered drug. For infused oncologics and biologics, J-codes are\n  the only reliable source. Cost: no days_supply, no dispensing details, potential billing-unit\n  interpretation error, and the NOC window. **Prefer J-codes** for all provider-administered drugs;\n  **prefer NDCs** for oral and self-administered drugs; require both for complete ascertainment.\n- **vs EHR medication administration records (MARs):** MARs contain the actual dose infused, the\n  infusion date, and the nurse-charted start/stop time — far more precise for dose than billing units.\n  Cost: MARs are system-specific (usually a single health system), may not capture medications given\n  elsewhere, and cannot provide population-level denominators. **Prefer J-code claims** for\n  population-representative pharmacoepi; **prefer MARs linked to J-codes** when precise dose and\n  duration of infusion are needed.\n- **vs CMS-1500 specialty billing (CPT procedure codes for drug administration):** CPT\n  administration codes (96413 for chemotherapy infusion, 96415 for each additional hour, etc.) appear\n  alongside J-codes on the same claim but identify the administration service, not the drug. The CPT\n  code cannot identify *which* drug was given — only that a drug was administered. **The J-code is\n  required for drug identity; CPT is required for administration setting and complexity.**\n- **vs revenue center 0636 alone:** Revenue center 0636 on a UB-04 indicates a pharmacy/IV item was\n  dispensed but does not identify the drug. J-codes on the same revenue line provide the drug\n  identity. **Do not use revenue center alone as a drug identifier;** require the J-code.\n\n**When to use.** Use HCPCS Level II J/Q-codes whenever you are (a) identifying provider-administered\ndrug exposure in medical claims (chemotherapy, infused biologics, IV antibiotics, IV bone agents,\nIV immunotherapy, infused enzyme replacement); (b) computing administered dose from billed units\nmultiplied by the descriptor amount; (c) ascertaining exposure completeness by combining J-codes for\npermanent codes with J3490/J3590/C9399 + NDC for the NOC window; (d) building an NDC-HCPCS\ncrosswalk for harmonized exposure ascertainment; (e) validating the medical-benefit arm of a combined\nmedical + pharmacy claims drug utilization study; or (f) constructing OMOP drug_exposure records\nfrom Part B medical claims. HCPCS J-codes are the default exposure-identification primitive for any\nRWE study involving infused oncologics, biologics, or provider-administered specialty drugs.\n\n**When NOT to use — and when HCPCS-based ascertainment is actively misleading or dangerous.**\n\n- **As the sole drug identifier for therapeutic areas with both oral and IV agents.** A study of\n  \"pembrolizumab use\" that queries only J9271 is complete only for 2017-onward (when J9271 was\n  assigned). Studies of combination regimens that include an oral agent (erlotinib, capecitabine,\n  ibrutinib) will systematically miss the oral arm unless Part D pharmacy claims are added. **This is\n  the most dangerous failure mode in oncology pharmacoepi.**\n- **When the payer does not submit medical claims at the line level.** Medicare Advantage (MA)\n  enrollees' medical claims are encounter-based rather than fully adjudicated fee-for-service. J-code\n  granularity is frequently absent or unreliable in MA encounter data. Restricting to FFS-observable\n  enrollment (Part A/B/D or commercial medical + pharmacy benefit) before building any J-code cohort\n  is mandatory; MA-only person-time should be excluded.\n- **When billing-unit counts are used as milligram doses without consulting the descriptor.** The\n  resulting dose will be wrong by the factor of the descriptor amount (e.g., 10× error for pemetrexed\n  if units are treated as milligrams). This error propagates silently into dose-response analyses.\n- **When the NOC window is ignored for newly approved drugs.** A study that starts its observation\n  window in the year a drug received permanent J-code assignment will miss all administrations billed\n  under J3490/J3590 in the prior NOC period. For fast-adopting drugs, this can mean missing the\n  majority of real-world first-year use.\n- **When J-codes are used without pairing with the CPT administration code or place of service** to\n  distinguish an infusion given at a physician office (CPT 96413 + J-code, place of service 11) from\n  one given at a hospital outpatient department (same J-code, revenue center 0636, place of service\n  22). Site of service matters for cost, access, and safety analyses.\n\n**Data-source operational depth.**\n\n- **Medicare FFS (Part B carrier and outpatient files):** The primary home of J-codes. Carrier file\n  contains professional claims (CMS-1500 equivalent); outpatient file contains institutional claims\n  (UB-04 equivalent) from hospital outpatient departments. Both contain the `hcfa_mtus_cnt` (units)\n  and the procedure code (J-code). The NDC field is available on the carrier file (as the `lne_ndc_cd`\n  on the line item) and on the outpatient revenue center line. MA enrollees lack these adjudicated\n  claim details — **exclude MA-only person-time** or supplement with MA encounter-data J-codes after\n  validating completeness. For complete oncology drug capture, also query the DME file for any oral\n  drugs billed under HCPCS (unusual but not impossible for clinical trial patients).\n- **Commercial claims (MarketScan, Optum, IQVIA):** J-codes appear in the medical (outpatient and\n  inpatient professional) claims tables. Column names vary by vendor; units may appear as `quantity`,\n  `units`, or `submit_charge_units`. Verify units interpretation with vendor documentation — some\n  vendors normalize to \"quantity dispensed\" rather than \"HCPCS billing units,\" which can differ.\n  NDC availability on medical claim lines varies by vendor and payer; it is more consistent in\n  MarketScan than in some Optum products.\n- **EHR (Epic, Cerner, Allscripts):** EHR medication administration records capture the exact dose\n  infused, the NDC, and the infusion date — more accurate for dose than billing units. However, EHR\n  does not capture administrations at other facilities. For multi-site analyses, link EHR\n  administration records to medical claims to verify J-code billing completeness, especially for\n  340B-discounted drugs where billing may differ from acquisition pricing.\n- **OMOP CDM:** HCPCS J-codes are mapped into the Drug domain via the OMOP vocabulary's HCPCS\n  standard concepts. Query `drug_exposure` for both `drug_source_concept_id` (the raw J-code concept)\n  and the standard `drug_concept_id` (the harmonized RxNorm/SNOMED mapping). NOC codes (J3490,\n  J3590) will map to a generic unclassified drug concept; supplement by querying `drug_source_value`\n  for the NDC string associated with those lines.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "drugs",
      "procedures",
      "claims",
      "medical-benefit",
      "provider-administered",
      "oncology",
      "biologics",
      "buy-and-bill",
      "J-code",
      "HCPCS",
      "exposure-definition"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "drug_utilization",
      "comparative_effectiveness",
      "cost_analysis",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/jncimonographs/lgz034",
        "url": "https://doi.org/10.1093/jncimonographs/lgz034",
        "citation_text": "Rivera DR, Lam CJK, Enewold L, Petkov VI, Tran Q, Brennan S, Dickie L, McNeel TS, Noone AM, Ohm B, White DP, Warren JL, Mariotto AB, Penberthy L. Development and Utility of the Observational Research in Oncology Toolbox: Cancer Medications Enquiry Database-Healthcare Common Procedure Coding System (HCPCS). JNCI Monographs. 2020;2020(55):39-45.",
        "year": 2020,
        "authors_short": "Rivera et al.",
        "notes": "NCI-developed CanMED-HCPCS resource; provides the definitive evidence-based method for systematically identifying provider-administered oncology drugs in medical claims using HCPCS codes, validated against CMS HCPCS Indices 2012-2018. Demonstrates the approach in SEER-Medicare colorectal cancer data; the canonical introduce citation for HCPCS-based oncology drug ascertainment."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4934",
        "url": "https://doi.org/10.1002/pds.4934",
        "citation_text": "Zhang J, Haynes K, Mendelsohn AB, Marshall J, Barr CE, McDermott C, Brown J, Kline A, Kenney J, King KJ, Holmes C, Yeung K, Barron J, Yun H, Lockhart CM. Capture of biologic and biosimilar dispensings in a consortium of U.S.-based claims databases: Utilization of national drug codes and Healthcare Common Procedure Coding System modifiers in medical claims. Pharmacoepidemiology and Drug Safety. 2020;29(7):778-785.",
        "year": 2020,
        "authors_short": "Zhang et al.",
        "notes": "Describes the use of HCPCS codes and NDC modifiers in medical claims across multiple payer databases to capture biologic and biosimilar administrations — directly addresses the medical-benefit vs pharmacy-benefit split, NDC field usage, and HCPCS code identification across claims databases. Essential methodological reference for multi-database HCPCS ascertainment."
      },
      {
        "role": "use",
        "doi": "",
        "url": "https://www.cms.gov/medicare/coding-billing/healthcare-common-procedure-system",
        "citation_text": "Centers for Medicare and Medicaid Services. Healthcare Common Procedure Coding System (HCPCS). CMS.gov. Accessed 2024.",
        "year": 2024,
        "authors_short": "CMS",
        "notes": "Official CMS HCPCS landing page — the authoritative source for HCPCS Level II code files, quarterly release schedules, and updates. All HCPCS Level II code lists used in research must be sourced and version-dated from this page."
      },
      {
        "role": "use",
        "doi": "",
        "url": "https://www.cms.gov/medicare/coding-billing/healthcare-common-procedure-system/quarterly-update",
        "citation_text": "Centers for Medicare & Medicaid Services. HCPCS Quarterly Update. CMS.gov. Accessed 2024.",
        "year": 2024,
        "authors_short": "CMS",
        "notes": "CMS quarterly HCPCS update page — source of the quarterly release files that must be used to stay current with code additions, revisions, and deletions throughout the year, including new J-code and Q-code assignments for newly approved drugs."
      }
    ],
    "plain_language_summary": "HCPCS Level II J-codes are the billing codes that hospitals and doctor offices use when they give a patient a drug by injection or infusion during a visit — think IV chemotherapy, infused immunotherapy, or a biologic given in a clinic. Each code is one letter (J, Q, or C) followed by four numbers, and each code's description specifies exactly how many milligrams (or other units) one billed unit represents. Because these codes appear on medical claims rather than pharmacy prescriptions, any study that wants to see all the cancer drugs or biologics a patient received must look at both the medical side (J-codes) and the pharmacy side (prescription fills) — using only one source will miss a large share of real-world drug use.",
    "key_terms": [
      {
        "term": "J-code",
        "definition": "A five-character billing code starting with the letter J, used to report a drug that was given to a patient by a provider (for example, infused IV chemotherapy or an injected biologic) rather than dispensed at a pharmacy for the patient to take at home."
      },
      {
        "term": "billing units",
        "definition": "The number of units a provider reports on a claim for a single drug administration; each unit equals the dose amount written in the code's official description (for example, one unit of J9305 equals 10 mg of pemetrexed, so 50 units means 500 mg was given)."
      },
      {
        "term": "buy-and-bill",
        "definition": "The payment arrangement in which a doctor or clinic purchases a drug from a supplier, administers it to the patient, and then bills the insurer for both the drug cost and the administration service; this is how most infused oncology and biologic drugs are paid for under the medical benefit."
      },
      {
        "term": "NOC code",
        "definition": "A \"not otherwise classified\" billing placeholder (J3490 for drugs, J3590 for biologics) used when a newly approved drug does not yet have its own permanent J-code; during this period the specific drug can only be identified by looking at the NDC (National Drug Code) field on the same claim line, which is often missing or incomplete."
      },
      {
        "term": "medical vs pharmacy benefit",
        "definition": "Medical benefit covers services and drugs administered by a provider during a visit (billed with HCPCS codes on a medical claim); pharmacy benefit covers drugs dispensed at a pharmacy for the patient to take home (billed with NDC codes on a pharmacy claim). Most cancer infusions are medical-benefit; most pills are pharmacy-benefit."
      },
      {
        "term": "descriptor amount",
        "definition": "The official quantity stated in a HCPCS code's written description that defines how much drug one billing unit represents (for example, J9271 says \"per 1 mg,\" so one billing unit equals 1 mg of pembrolizumab)."
      }
    ],
    "worked_example": {
      "scenario": "An analyst is studying real-world pembrolizumab dosing in a commercial insurance database. Pembrolizumab (Keytruda) is an infused immunotherapy — it is given IV in a clinic, billed under the medical benefit as HCPCS J9271, where one billing unit equals 1 mg. The standard approved dose is 200 mg every 3 weeks. The analyst wants to verify that the billed units on a small sample of claim lines translate to the expected doses, and then flag any lines that look anomalous. The table below shows five claim lines from three patients, as they would appear in a medical claims outpatient or carrier file.\n",
      "dataset": {
        "caption": "Five medical claim lines for pembrolizumab (HCPCS J9271, descriptor \"injection, 1 mg\"). The administered_dose_mg column is the target — computed as units_billed × 1 mg.",
        "columns": [
          "claim_id",
          "person_id",
          "service_date",
          "hcpcs_code",
          "units_billed",
          "administered_dose_mg"
        ],
        "rows": [
          [
            "C001",
            3001,
            "2023-03-01",
            "J9271",
            200,
            200
          ],
          [
            "C002",
            3001,
            "2023-03-22",
            "J9271",
            200,
            200
          ],
          [
            "C003",
            3002,
            "2023-04-05",
            "J9271",
            100,
            100
          ],
          [
            "C004",
            3002,
            "2023-04-26",
            "J9271",
            200,
            200
          ],
          [
            "C005",
            3003,
            "2023-05-10",
            "J9271",
            200,
            200
          ]
        ]
      },
      "steps": [
        "Look up the descriptor for J9271 in the CMS HCPCS file: 'pembrolizumab, injection, 1 mg.' This means 1 billing unit = 1 mg. Administered dose (mg) = units_billed × 1 mg.",
        "Patient 3001, claim C001: 200 units × 1 mg = 200 mg. This matches the standard 200 mg flat dose. Patient 3001, claim C002: same calculation, 200 mg, administered 21 days later — consistent with the 3-week cycle.",
        "Patient 3002, claim C003: 100 units × 1 mg = 100 mg. The standard dose is 200 mg, so 100 mg is half the expected amount. Flag for review — possible weight-based dosing (2 mg/kg for a 50 kg patient), a split vial billed on a second line (check for a companion line on the same date), or a data entry error.",
        "Patient 3002, claim C004: 200 mg, 21 days after C003 — consistent with a next cycle at standard dose. The 100 mg on C003 is likely weight-based dosing for a lighter patient, not an error.",
        "Patient 3003, claim C005: 200 mg — standard dose. Service date is 10 May 2023; there is no prior claim in the dataset for this patient. Check whether this patient was in the NOC window (J3490 billed before J9271 was assigned) or enrolled after the permanent code was available.",
        "Now contrast with J9305 (pemetrexed, descriptor 'injection, per 10 mg'). If a claim showed 50 units of J9305, the administered dose would be 50 × 10 mg = 500 mg — NOT 50 mg. Treating billing units as milligrams without checking the descriptor would introduce a 10× underestimate of the dose for pemetrexed."
      ],
      "result": "administered_dose_mg = units_billed × descriptor_amount_per_unit. For J9271: 200 units × 1 = 200 mg. For J9305: 50 units × 10 = 500 mg. Billing units are not milligrams; always look up the descriptor before computing dose.",
      "timeline_spec": {
        "title": "Pembrolizumab Q3W administration cycles for two patients — J9271 billing units to dose",
        "window": {
          "start": "2023-03-01",
          "end": "2023-05-31",
          "label": "92-day observation window"
        },
        "events": [
          {
            "label": "Pt 3001 Cycle 1 (200 mg)",
            "start": "2023-03-01",
            "length_days": 1,
            "quantity": "200 units × 1 mg = 200 mg"
          },
          {
            "label": "Pt 3001 Cycle 2 (200 mg)",
            "start": "2023-03-22",
            "length_days": 1,
            "quantity": "200 units × 1 mg = 200 mg"
          },
          {
            "label": "Pt 3002 Cycle 1 (100 mg weight-based)",
            "start": "2023-04-05",
            "length_days": 1,
            "quantity": "100 units × 1 mg = 100 mg"
          },
          {
            "label": "Pt 3002 Cycle 2 (200 mg)",
            "start": "2023-04-26",
            "length_days": 1,
            "quantity": "200 units × 1 mg = 200 mg"
          },
          {
            "label": "Pt 3003 Cycle 1 (200 mg)",
            "start": "2023-05-10",
            "length_days": 1,
            "quantity": "200 units × 1 mg = 200 mg"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-03-01",
            "end": "2023-03-22",
            "label": "Pt 3001: 21-day Q3W interval (expected)"
          },
          {
            "kind": "exposed",
            "start": "2023-04-05",
            "end": "2023-04-26",
            "label": "Pt 3002: 21-day Q3W interval (expected)"
          }
        ],
        "result": {
          "label": "Dose = units × 1 mg/unit; Q3W cadence confirmed from service-date spacing (21 days)",
          "value": 200
        }
      }
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Permanent J-code identification (post-code assignment)",
        "description": "Once a drug has a permanent HCPCS J-code, identification is straightforward — query the procedure code field in the medical claims table for the J-code and extract `units_billed`. Dose = `units_billed` × descriptor amount per unit. Most major oncology drugs approved before 2018 have stable permanent J-codes.",
        "edge_cases": [
          "Verify the descriptor has not changed across the study window (CMS can revise the per-unit amount at a quarterly update); cross-year dose comparisons must account for this.",
          "The same J-code may appear on multiple claim lines on the same date (split billing across payers or across claim types); sum lines for total administered dose."
        ],
        "data_source_notes": "Medicare carrier file: `hcpcs_cd` + `lne_mtus_cnt` (units). Medicare outpatient file: `rev_cntr_hcpcs_cd` + `rev_cntr_unit_cnt`. Commercial: vendor-specific column names."
      },
      {
        "name": "NOC-code identification (pre-permanent-code window)",
        "description": "Query for J3490, J3590, or C9399 on the procedure code field, then parse the NDC field on the same claim line to identify the specific drug. The NDC must then be crosswalked to the drug of interest using an NDC master list (e.g., FDA Orange Book, RxNorm, or the CMS ASP Drug Pricing NDC lookup table). Document the NOC window date range as a study limitation.",
        "edge_cases": [
          "NDC field completeness on medical claims varies by payer and vendor from <40% to >90%; always report the share of NOC lines with parseable NDCs in your data quality section.",
          "The same drug may appear under both J3490 and C9399 during the NOC window depending on whether the claim originated from a professional or institutional (HOPD) setting."
        ],
        "data_source_notes": "Medicare carrier file: `lne_ndc_cd` (line NDC code). Medicare outpatient: `rev_cntr_ndc_qty` and `rev_cntr_ndc_uom` alongside the revenue center line. Commercial: NDC field presence is vendor and payer dependent."
      },
      {
        "name": "Combined medical + pharmacy benefit ascertainment",
        "description": "For complete drug exposure across both benefit types, merge J-code-identified medical claims with NDC-identified pharmacy claims (Part D or commercial pharmacy file). The CMS ASP NDC-HCPCS crosswalk links the two systems. Avoid double-counting drugs that appear in both files (e.g., if a drug is dispensed via specialty pharmacy AND billed under Part B on the same date).",
        "edge_cases": [
          "Oral oncologics (capecitabine, erlotinib) are almost entirely pharmacy-benefit; verify that the medical-benefit J-code list does not inadvertently include these unless they are covered as Part B oral equivalents (a narrow exception in Medicare).",
          "340B-discounted drug administrations may appear in medical claims without a corresponding NDC, because providers sometimes omit the NDC to avoid ASP price adjustment calculations."
        ],
        "data_source_notes": "Requires access to both the medical claims file (J-codes) and the pharmacy claims file (NDCs + days_supply). In OMOP, query drug_exposure with drug_type_concept_id to distinguish pharmacy dispensings from EHR administrations from medical-claim drug records."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ndc-national-drug-code",
        "pros_of_this": "Captures all provider-administered drugs (infused biologics, IV chemotherapy) that never appear in pharmacy claims; also available when days_supply is not applicable because the drug is consumed at the point of service.",
        "cons_of_this": "No days_supply, so exposure duration must be modeled from administration cadence; billing-unit to dose conversion requires consulting the code descriptor; NOC-window under-ascertainment for newly approved drugs.",
        "when_to_prefer": "Use J-codes for IV/infused drugs; use NDCs for oral and self-administered drugs; use both for complete ascertainment in oncology or biologic-heavy therapeutic areas."
      },
      {
        "compared_to": "cpt-procedure-coding",
        "pros_of_this": "J-codes identify the specific drug administered; CPT codes identify the administration service (infusion type, duration, drug class) but not the drug name or dose.",
        "cons_of_this": "J-codes alone do not identify the administration setting, complexity, or duration; CPT codes are needed to distinguish a simple IV push from a prolonged infusion for HCRU analyses.",
        "when_to_prefer": "Always pair J-codes (for drug identity) with CPT 96413-series codes (for administration type and complexity) in any oncology claims analysis."
      },
      {
        "compared_to": "revenue-center-codes",
        "pros_of_this": "J-codes on UB-04 claims appear alongside revenue center 0636, but J-codes identify the drug; revenue center 0636 alone only identifies that a pharmacy item was dispensed, not which drug.",
        "cons_of_this": "Revenue center codes provide site-of-service and cost-center context that J-codes alone do not supply; both fields are needed for complete UB-04 analysis.",
        "when_to_prefer": "Use J-codes as the drug identifier; use revenue center codes to identify the cost center and distinguish drug charges from administration charges in cost analyses."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Medical claims (carrier and outpatient files): query procedure code (`hcpcs_cd` in Medicare carrier, `rev_cntr_hcpcs_cd` in Medicare outpatient) for J- or Q-codes. Extract billing units (`lne_mtus_cnt` in carrier, `rev_cntr_unit_cnt` in outpatient). Multiply by descriptor amount for dose. Include J3490, J3590, C9399 and parse `lne_ndc_cd` for the NOC window. Require continuous Part A/B/D enrollment (exclude MA-only spans) to ensure observability.",
      "ehr": "EHR medication administration records (MARs) capture the exact infused dose, the NDC, and the order. They do not use HCPCS codes internally, but HCPCS codes can be matched to MAR records via service date + drug product after linking EHR to claims. Use the MAR for dose accuracy when available; use claims J-codes for population-level coverage.",
      "linked": "Linked claims-EHR: use J-codes from claims for ascertainment of who received treatment; use MAR dose from EHR for exact dose when precision matters. The crosswalk is by person + service date + drug product (NDC or drug name). Discrepancies between billed units and administered MAR dose are common and should be audited before using either source alone for dose analyses."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import re\nfrom dataclasses import dataclass, field\nfrom typing import Optional\n\n# ------------------------------------------------------------------ format validation\nHCPCS_PATTERN = re.compile(r\"^[A-V]\\d{4}$\")\nJ_CODE_PATTERN = re.compile(r\"^J\\d{4}$\")\nNOC_CODES = {\"J3490\", \"J3590\", \"C9399\"}\n\ndef is_valid_hcpcs(code: str) -> bool:\n    \"\"\"Return True if code matches the 1-letter + 4-digit HCPCS Level II format.\"\"\"\n    return bool(HCPCS_PATTERN.match(code.strip().upper()))\n\ndef is_j_code(code: str) -> bool:\n    \"\"\"Return True if code is a J-family drug code (J0001-J9999).\"\"\"\n    return bool(J_CODE_PATTERN.match(code.strip().upper()))\n\ndef is_noc(code: str) -> bool:\n    \"\"\"Return True if code is a Not-Otherwise-Classified placeholder requiring NDC lookup.\"\"\"\n    return code.strip().upper() in NOC_CODES\n\ndef j_code_family(code: str) -> str:\n    \"\"\"Classify J-code into antineoplastic (J9000-J9999) vs other drug (J0001-J8999).\"\"\"\n    code = code.strip().upper()\n    if not is_j_code(code):\n        return \"not_j_code\"\n    num = int(code[1:])\n    if 9000 <= num <= 9999:\n        return \"antineoplastic_J9\"\n    return \"other_drug_J0_J8\"\n\n# ------------------------------------------------------------------ dose computation\n# Descriptor lookup: map HCPCS code -> mg per billing unit.\n# ANALYST MUST BUILD THIS FROM CMS HCPCS RELEASE FILES.\n# Sample entries for illustration only — verify against the current quarterly release.\nDESCRIPTOR_MG_PER_UNIT: dict[str, float] = {\n    \"J9271\": 1.0,    # pembrolizumab, per 1 mg\n    \"J9305\": 10.0,   # pemetrexed, per 10 mg\n    \"J9000\": 10.0,   # doxorubicin hydrochloride, per 10 mg\n    \"J1745\": 10.0,   # infliximab, per 10 mg\n    \"J0800\": 40.0,   # corticotropin, up to 40 USP units (units, not mg)\n}\n\ndef compute_dose(hcpcs_code: str, units_billed: float,\n                 descriptor_table: dict[str, float] = DESCRIPTOR_MG_PER_UNIT\n                 ) -> Optional[float]:\n    \"\"\"\n    Compute administered dose from billing units * descriptor_mg_per_unit.\n\n    Parameters\n    ----------\n    hcpcs_code     : HCPCS code string (e.g. 'J9305')\n    units_billed   : value from the HCPCS units field on the claim line\n    descriptor_table: analyst-built lookup {hcpcs_code: mg_per_billing_unit}\n\n    Returns\n    -------\n    administered dose in the descriptor's unit (usually mg), or None if code not found.\n    \"\"\"\n    code = hcpcs_code.strip().upper()\n    mg_per_unit = descriptor_table.get(code)\n    if mg_per_unit is None:\n        return None  # code not in lookup; requires manual descriptor review\n    return units_billed * mg_per_unit\n\n# ------------------------------------------------------------------ claim-line processing\n@dataclass\nclass ClaimLine:\n    claim_id: str\n    person_id: str\n    service_date: str\n    hcpcs_code: str\n    units_billed: float\n    ndc: Optional[str] = None  # populated for NOC-coded lines when available\n\n@dataclass\nclass ProcessedLine:\n    claim_id: str\n    person_id: str\n    service_date: str\n    hcpcs_code: str\n    units_billed: float\n    ndc: Optional[str]\n    is_valid: bool\n    is_j_code: bool\n    is_noc: bool\n    j_family: str\n    administered_dose: Optional[float]\n    dose_flag: str = \"\"  # \"ok\" | \"noc_no_ndc\" | \"descriptor_missing\" | \"invalid_code\"\n\ndef process_line(line: ClaimLine,\n                 descriptor_table: dict[str, float] = DESCRIPTOR_MG_PER_UNIT\n                 ) -> ProcessedLine:\n    code = line.hcpcs_code.strip().upper()\n    valid = is_valid_hcpcs(code)\n    j = is_j_code(code)\n    noc = is_noc(code)\n    family = j_code_family(code) if j else \"not_j_code\"\n\n    dose = None\n    flag = \"\"\n    if not valid:\n        flag = \"invalid_code\"\n    elif noc:\n        flag = \"noc_no_ndc\" if not line.ndc else \"noc_ndc_present\"\n        # dose cannot be computed from J3490/J3590 alone; requires NDC-based descriptor lookup\n    else:\n        dose = compute_dose(code, line.units_billed, descriptor_table)\n        flag = \"ok\" if dose is not None else \"descriptor_missing\"\n\n    return ProcessedLine(\n        claim_id=line.claim_id,\n        person_id=line.person_id,\n        service_date=line.service_date,\n        hcpcs_code=code,\n        units_billed=line.units_billed,\n        ndc=line.ndc,\n        is_valid=valid,\n        is_j_code=j,\n        is_noc=noc,\n        j_family=family,\n        administered_dose=dose,\n        dose_flag=flag,\n    )\n\n# ------------------------------------------------------------------ example\nif __name__ == \"__main__\":\n    lines = [\n        ClaimLine(\"C001\", \"3001\", \"2023-03-01\", \"J9271\", 200),\n        ClaimLine(\"C002\", \"3001\", \"2023-03-22\", \"J9271\", 200),\n        ClaimLine(\"C003\", \"3002\", \"2023-04-05\", \"J9305\", 50),   # pemetrexed, 50 units = 500 mg\n        ClaimLine(\"NOC1\", \"3003\", \"2023-04-10\", \"J3490\", 1, ndc=\"00310094630\"),\n    ]\n    for cl in lines:\n        pl = process_line(cl)\n        print(f\"{pl.claim_id}: {pl.hcpcs_code} | family={pl.j_family} \"\n              f\"| units={pl.units_billed} | dose={pl.administered_dose} | flag={pl.dose_flag}\")\n    # Output:\n    # C001: J9271 | family=antineoplastic_J9 | units=200 | dose=200.0 | flag=ok\n    # C002: J9271 | family=antineoplastic_J9 | units=200 | dose=200.0 | flag=ok\n    # C003: J9305 | family=antineoplastic_J9 | units=50  | dose=500.0 | flag=ok\n    # NOC1: J3490 | family=not_j_code        | units=1   | dose=None  | flag=noc_ndc_present",
        "description": "Validates HCPCS Level II code format, separates J-code lines from procedure claims, computes administered dose from billing units using a caller-supplied descriptor lookup table, and flags NOC codes (J3490, J3590, C9399) for supplemental NDC-based identification. The descriptor lookup table must be built by the analyst from the CMS HCPCS Release files; a sample for illustrative purposes is embedded in the code.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\n# ------------------------------------------------------------------ format validation\nis_valid_hcpcs <- function(code) {\n  grepl(\"^[A-V]\\\\d{4}$\", toupper(trimws(code)))\n}\n\nis_j_code <- function(code) {\n  grepl(\"^J\\\\d{4}$\", toupper(trimws(code)))\n}\n\nNOC_CODES <- c(\"J3490\", \"J3590\", \"C9399\")\nis_noc <- function(code) toupper(trimws(code)) %in% NOC_CODES\n\nj_code_family <- function(code) {\n  code <- toupper(trimws(code))\n  num <- as.integer(substr(code, 2, 5))\n  case_when(\n    !is_j_code(code)       ~ \"not_j_code\",\n    num >= 9000             ~ \"antineoplastic_J9\",\n    TRUE                    ~ \"other_drug_J0_J8\"\n  )\n}\n\n# ------------------------------------------------------------------ descriptor table\n# ANALYST MUST BUILD FROM CMS HCPCS RELEASE FILES. Sample entries only.\ndescriptor_table <- tibble::tribble(\n  ~hcpcs_code, ~mg_per_unit, ~drug_name,\n  \"J9271\",     1.0,          \"pembrolizumab\",\n  \"J9305\",     10.0,         \"pemetrexed\",\n  \"J9000\",     10.0,         \"doxorubicin\",\n  \"J1745\",     10.0,         \"infliximab\"\n)\n\n# ------------------------------------------------------------------ dose computation\ncompute_dose <- function(claims_df, descriptor_df) {\n  # claims_df must have columns: hcpcs_code, units_billed\n  # returns claims_df with administered_dose_mg and dose_flag added\n  claims_df |>\n    mutate(hcpcs_code = toupper(trimws(hcpcs_code))) |>\n    left_join(descriptor_df, by = \"hcpcs_code\") |>\n    mutate(\n      is_valid   = is_valid_hcpcs(hcpcs_code),\n      is_j_code  = is_j_code(hcpcs_code),\n      is_noc     = is_noc(hcpcs_code),\n      j_family   = j_code_family(hcpcs_code),\n      administered_dose_mg = dplyr::if_else(\n        is_noc | !is_valid, NA_real_,\n        units_billed * mg_per_unit\n      ),\n      dose_flag  = dplyr::case_when(\n        !is_valid            ~ \"invalid_code\",\n        is_noc               ~ \"noc_requires_ndc_lookup\",\n        is.na(mg_per_unit)   ~ \"descriptor_missing\",\n        TRUE                 ~ \"ok\"\n      )\n    )\n}\n\n# ------------------------------------------------------------------ example\nclaims <- tibble::tribble(\n  ~claim_id, ~person_id, ~service_date, ~hcpcs_code, ~units_billed,\n  \"C001\",    \"3001\",     \"2023-03-01\",  \"J9271\",     200,\n  \"C002\",    \"3001\",     \"2023-03-22\",  \"J9271\",     200,\n  \"C003\",    \"3002\",     \"2023-04-05\",  \"J9305\",      50,  # 50 * 10 = 500 mg pemetrexed\n  \"NOC1\",    \"3003\",     \"2023-04-10\",  \"J3490\",       1\n)\n\nresult <- compute_dose(claims, descriptor_table)\nprint(result[, c(\"claim_id\", \"hcpcs_code\", \"j_family\",\n                  \"units_billed\", \"administered_dose_mg\", \"dose_flag\")])\n# claim_id  hcpcs_code  j_family           units_billed  administered_dose_mg  dose_flag\n# C001      J9271       antineoplastic_J9  200           200                   ok\n# C002      J9271       antineoplastic_J9  200           200                   ok\n# C003      J9305       antineoplastic_J9   50           500                   ok\n# NOC1      J3490       not_j_code           1            NA                   noc_requires_ndc_lookup",
        "description": "Validates HCPCS code format using regex, classifies J-codes by family, computes administered dose from billing units and a descriptor table, and flags NOC-coded lines for NDC-based supplemental identification. The descriptor table must be constructed by the analyst from CMS HCPCS Release files; sample entries are included for illustration.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  MED[\"Medical Claim\\n(CMS-1500 / UB-04)\"]\n  PHARM[\"Pharmacy Claim\\n(Part D / pharmacy benefit)\"]\n  JCODE[\"J-code or Q-code\\n(provider-administered drug)\\ne.g. J9271 pembrolizumab\"]\n  NDC[\"NDC + days_supply\\n(self-administered drug)\\ne.g. 00310094630 erlotinib\"]\n  NOC[\"NOC code line\\n(J3490 / J3590 / C9399)\\n→ parse NDC field\"]\n  DOSE[\"Administered dose\\n= units_billed × descriptor_mg_per_unit\"]\n  XWALK[\"CMS ASP NDC–HCPCS crosswalk\\n(quarterly)\"]\n  MED --> JCODE\n  MED --> NOC\n  PHARM --> NDC\n  JCODE --> DOSE\n  NOC --> |\"NDC field present?\"| DOSE\n  NDC --> XWALK\n  XWALK --> JCODE",
        "caption": "HCPCS Level II J-codes sit exclusively on medical claims; NDCs sit on pharmacy claims. Complete drug ascertainment requires both. NOC-coded lines (J3490/J3590/C9399) bridge the period before a permanent J-code is assigned and rely on the NDC field for drug identity. The CMS ASP NDC-HCPCS crosswalk links the two systems.",
        "alt_text": "Flowchart showing medical claims producing J-codes or NOC codes and pharmacy claims producing NDCs, with J-codes computing administered dose from units times descriptor amount, and the CMS ASP crosswalk linking NDCs to J-codes.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "sibling_of",
        "target_slug": "cpt-procedure-coding",
        "notes": "CPT (HCPCS Level I) codes procedure services; HCPCS Level II J-codes identify the specific drug administered. On a medical claim they co-appear: CPT 96413 (chemotherapy infusion administration) + J9271 (pembrolizumab). Level I is maintained by the AMA; Level II by CMS."
      },
      {
        "relation_type": "counterpart_of",
        "target_slug": "ndc-national-drug-code",
        "notes": "NDC codes identify drugs dispensed on pharmacy claims (self-administered, pharmacy benefit, days_supply); J-codes identify drugs administered by providers on medical claims (infused, buy-and-bill, no days_supply). The CMS ASP NDC-HCPCS crosswalk (quarterly) links the two. Complete oncology and biologic exposure ascertainment requires both."
      },
      {
        "relation_type": "used_with",
        "target_slug": "revenue-center-codes",
        "notes": "On UB-04 institutional claims, J-codes appear on revenue center lines alongside revenue center code 0636 (IV pharmacy) or 0250 (pharmacy general). Revenue center 0335 codes the chemotherapy infusion administration service; the J-code on the same line identifies the drug."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "UB-04 is the claim form used by hospitals and HOPDs where most IV chemotherapy and biologic infusions occur. J-codes appear at the revenue center line level on the UB-04, alongside units billed and the NDC field."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cms-1500-professional-claim-fields",
        "notes": "CMS-1500 is the professional claim form used by physician offices for provider-administered drug billing. J-codes appear at the procedure line level alongside units billed."
      },
      {
        "relation_type": "used_with",
        "target_slug": "medical-code-crosswalks-mappings",
        "notes": "The CMS ASP NDC-HCPCS Drug Pricing File (quarterly) is the authoritative NDC-to-J-code crosswalk. The NCI CanMED-HCPCS provides a validated list of oncology J-codes for SEER-Medicare and claims-based studies. Both crosswalks must be version-dated to the study window."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-standardized-vocabularies",
        "notes": "HCPCS Level II codes are ingested into the OMOP Drug domain via the HCPCS standard concept vocabulary, mapped to RxNorm or SNOMED drug concepts. NOC codes (J3490, J3590) map to unclassified drug concepts and require NDC-level supplementation for drug identity in OMOP-based analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "drug-utilization",
        "notes": "Drug utilization studies of provider-administered drugs (infused oncologics, biologics) are built on HCPCS J-code lines from medical claims; J-codes are the primary exposure primitive for any DUS targeting the medical-benefit drug population."
      },
      {
        "relation_type": "see_also",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Lines-of-therapy algorithms for oncology must query both J-codes (medical-benefit IV drugs) and NDCs (pharmacy-benefit oral drugs); J-code omission causes systematic under-counting of IV-only and combination regimens and inflates apparent watch-and-wait rates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "Provider-administered drug administration is jointly identified by a J-code (the drug) and a CPT procedure code (the administration service type); procedure identification methods apply to the CPT administration codes that always accompany J-code drug lines."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "HCPCS J-codes are a fundamental primitive within the broader administrative claims analysis framework; all claims-analysis guidance on observability, time zero, enrollment, and Medicare Advantage exclusion applies when building J-code drug cohorts."
      }
    ],
    "aliases": [
      "HCPCS",
      "HCPCS Level II",
      "J-code",
      "J-codes",
      "Q-codes"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "hcru-healthcare-resource-utilization",
    "name": "Healthcare Resource Utilization (HCRU)",
    "short_definition": "The quantification of the volume and type of healthcare services a patient uses over a defined observation window — hospitalizations and length of stay, emergency department visits, ambulatory/physician encounters, procedures, tests, and pharmacy fills — operationalized from administrative claims, EHR, or registry data and typically standardized to person-time (per-patient-per-month/year) for comparison.",
    "long_description": "Healthcare resource utilization (HCRU) is a **count/volume outcome**, not a dollar outcome. It enumerates the\nservices a patient consumes (inpatient admissions, ICU days, ED visits, office visits, infusions, procedures,\npharmacy fills) over an observation window, using standardized coding: ICD diagnosis/procedure codes, CPT/HCPCS,\nUB-04 revenue codes, CMS place-of-service (POS) codes, DRGs, and NDCs. HCRU is the substrate from which costs,\nburden-of-disease estimates, and budget-impact inputs are later built — but it is measured and modeled in its own\nunits (events, days, fills) before any monetization.\n\n**Core conceptual distinction**. Three orthogonal design choices define any HCRU measure, and they must be\npre-specified in the estimand because each maps to a different model. (1) *Setting / place of service*: inpatient\n(POS 21, UB-04 revenue 0100–0219, with DRGs and LOS), ED (POS 23, revenue 045x), on-campus outpatient (POS 22),\noffice (POS 11), SNF/home health/hospice, and pharmacy (NDC). Inpatient HCRU is low-frequency but cost-dominant;\noffice HCRU is high-frequency and cheap — collapsing them hides where burden accrues. (2) *Attribution*:\n**all-cause** (every claim in the window — correct for total budget and for capturing off-target harms) vs\n**disease-specific/attributable** (claims carrying a qualifying diagnosis in the primary, or any, position, or a\nvalidated algorithm) vs **incremental** (the difference vs a matched comparator). (3) *Endpoint form*: a binary\n\"any use\" indicator (proportion with ≥1 admission), a **count** (number of visits), or a **rate** standardized to\nperson-time. These are not interchangeable: a binary endpoint uses logistic/log-binomial regression; a count or\nrate uses Poisson or, far more often, negative binomial with an `offset(log(person_time))` because utilization\ndata are over-dispersed and zero-inflated. Reporting a mean count without the denominator or the distribution is\nuninterpretable when follow-up varies by arm.\n\n**Pros, cons, and trade-offs**.\n- **vs healthcare-costs-pppm-pppy-pmpm (monetized utilization):** HCRU counts isolate the *driver* of burden\n  (which setting, which service) and are robust to price variation, negotiated rates, and the absence of true\n  paid amounts — a hospitalization is a hospitalization regardless of the contract. Cost: counts ignore intensity\n  (a 1-day and a 14-day admission are both \"1 admission\"), so they understate the economic gap between arms.\n  **Prefer HCRU counts** when describing service mix or identifying high-utilization phenotypes, and report them\n  *alongside* costs, never instead of them.\n- **vs poisson-negative-binomial-count-models (the analytic layer):** HCRU is the outcome those models consume;\n  naive comparison of mean counts across arms with unequal follow-up or over-dispersion gives wrong standard\n  errors. **Prefer NB with a person-time offset** over Poisson whenever the variance exceeds the mean (the norm).\n- **vs cost-outlier-handling-rwe:** Raw HCRU counts are the input that must be inspected *before* deciding on\n  winsorization or robust models; a handful of frequent-ED super-utilizers can dominate an unadjusted mean.\n  **Prefer** examining the count distribution first, then choosing a count model or trimming rule.\n\n**When to use**. Describing disease burden and service mix; identifying high-cost/high-utilization patients and\nthe settings that drive them; building the volume inputs for cost, budget-impact, and cost-of-illness analyses;\ncomparing utilization between treatment arms in comparative-effectiveness or safety studies; and PQA/CMS-style\npayer reporting where PPPM utilization is a standard deliverable.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **When the question is economic value and you report counts alone.** Counts treat a brief observation stay and a\n  prolonged ICU admission as equal; a therapy that shortens LOS without reducing admission *count* looks null on\n  crude HCRU yet saves real money. Always pair counts with LOS/days and with costs.\n- **When follow-up differs by arm and you compare raw means.** Longer-followed patients accrue more events\n  mechanically; failing to standardize to person-time (PPPM/PPPY) or to use a person-time offset manufactures a\n  spurious difference. This is the single most common HCRU error in payer decks.\n- **When person-time is unobservable.** In Medicare Advantage (MA-only) enrollment, encounter data are incomplete\n  and fee-for-service (FFS) claims are absent, so \"zero utilization\" is missingness, not true non-use. Counting MA\n  person-time in the denominator while events are under-captured biases every rate downward — dangerous when\n  arms differ in MA mix.\n- **When attribution is unvalidated.** Disease-specific HCRU built on rule-out codes or on diagnoses that also\n  flag comorbid or screening encounters over-attributes; conversely, requiring a primary-position code\n  under-attributes when the condition is coded secondarily. Use a validated algorithm and report the all-cause\n  figure for context.\n- **When competing risks differ by exposure.** In elderly claims cohorts, the arm with higher mortality\n  accumulates *less* downstream HCRU simply because patients die — a survivorship artifact that makes the more\n  harmful drug look \"lower utilization.\" Account for death as a competing event, not as censoring that inflates\n  rates.\n\n**Data-source operational depth**.\n- **Claims (FFS commercial or Medicare A/B/D):** The strongest substrate for standardized events. Inpatient via\n  UB-04 revenue codes + DRG; outpatient/procedures via CPT/HCPCS + POS; pharmacy via NDC on the Part D / pharmacy\n  file. *Failure modes:* (a) **MA-only person-time lacks FFS claims** — restrict to enrollees with the relevant\n  medical+pharmacy benefit and exclude MA-only spans, or you divide real events by inflated, unobservable\n  denominators; (b) **site-of-service shift** — a procedure moving from inpatient to hospital-outpatient changes\n  the POS bucket without changing total volume, so trend analyses must collapse sites or model them jointly;\n  (c) **bundled/episode payments (BPCI, CJR, oncology bundles)** — line-level HCRU is often still observable\n  inside the episode window, but the single bundled payment decouples observed counts from paid cost, so cost\n  attribution needs allowed amounts or shadow pricing; (d) **immortal time in procedure studies** — starting\n  follow-up at diagnosis but defining \"treated\" by a later procedure guarantees the treated arm survives to be\n  treated, inflating its pre-procedure HCRU; align time zero to the procedure or use a landmark.\n- **EHR:** Captures clinically real events (visits attended, labs ordered) and severity that claims lack, but is\n  **visit-driven** — sicker, more engaged patients generate more observed encounters, biasing utilization upward,\n  and a patient who leaves the system is differentially lost. EHR rarely sees out-of-network or cross-system care,\n  so totals under-capture. Best linked to claims for a complete denominator.\n- **Registry:** Protocol-driven capture of disease-specific HCRU (specialist visits, defined tests) with strong\n  severity/staging, but incomplete for all-cause or primary-care use. Excellent for *validating* claims-based HCRU\n  algorithms when linked.\n- **Linked claims–EHR–registry–vital records:** The ideal substrate (EHR severity + claims completeness +\n  reliable mortality for competing-risk handling), but linkage selects the linkable subset and introduces\n  date-discrepancy issues (order vs fill vs service dates) that must be reconciled before binning person-time.\n\n**Worked claims example.** Question: all-cause and diabetes-attributable HCRU in the first year after initiating a\nnew GLP-1 agonist, by place of service, among adults in a commercial + Medicare FFS database. (1) **Cohort &\nobservable time:** require 365 days of continuous medical + pharmacy enrollment before the index fill (washout +\nbaseline) and follow from the index date forward; **exclude MA-only spans** so absence of claims is true\nnon-utilization, not missingness. (2) **Person-time:** for each patient, accrue enrolled days from index to the\nearliest of disenrollment, death, +365 days, or end of data; convert to person-months = days / 30.44. A patient\nenrolled 200 days contributes 6.57 person-months — and contributes to the denominator even with zero events.\n(3) **Event counting by POS:** inpatient admissions = distinct admission stays after collapsing same-day\ntransfers and bridging stays into one episode (revenue 0100–0219 / POS 21); ED = POS 23 / revenue 045x not\nfollowed by an inpatient admission on the same/next day (else it rolls into the admission); office = POS 11\nCPT E/M; pharmacy = distinct NDC fills. (4) **Attribution:** all-cause = every event; diabetes-attributable =\nevents with a type-2 diabetes ICD-10 code (E11.x) in any position, reported alongside the all-cause figure.\n(5) **Standardize:** PPPM rate = total events across the cohort / total person-months; e.g., 1,820 inpatient\nadmissions over 38,400 person-months = 0.0474 admissions PPPM ≈ 0.57 PPPY. (6) **Model:** negative-binomial\nregression of the event count with `offset(log(person_months))` and arm + baseline covariates, because the\ninpatient count is over-dispersed and zero-inflated; treat death as a competing event when comparing arms with\ndifferential mortality. (7) **Sensitivity:** repeat with a primary-position-only attribution, a 30-day vs 7-day\nED-to-admission rollup window, and FFS-only vs FFS+observable-MA cohorts to bound the denominator assumption.\n\n**Interpreting the output**\n\nA study of GLP-1 initiators reports all-cause HCRU in the first year across 2.800 total person-years:\ninpatient admissions 1.07 per patient per year (PPPY), ED visits 2.50 PPPY, and outpatient encounters\n6.07 PPPY. These three rates are the standardized event counts per full year of observed patient-time.\n\n*(1) Formal interpretation.* The 1.07 inpatient admissions PPPY means that, across the observed\nperson-time, patients averaged just over one hospital admission per year of follow-up; the denominator\nis 2.800 person-years, not a headcount of 2.8 patients, so patients with partial follow-up are weighted\nby their observed time. These are all-cause rates: every admission regardless of diagnosis is counted.\nThe rates are not adjusted for baseline differences across arms; any between-arm comparison would\nrequire regression with a person-time offset and covariate adjustment. Counts treat a 1-day and a 14-day\nstay as one admission each — LOS data must accompany the count to reflect severity.\n\n*(2) Practical interpretation.* For a budget-impact team, the outpatient rate (6.07 visits PPPY) is\nthe highest-volume driver of utilization, while the inpatient rate (1.07 PPPY) likely dominates costs\ndespite its lower volume. Reporting all three rates with their person-time base allows a payer to\nmultiply each rate by its unit cost and project total expected spend per patient per year — which is\nwhy the HCRU table is the standard companion to any dollar-denominated cost estimate.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "hcru",
      "utilization",
      "claims",
      "place-of-service",
      "pppm-pppy",
      "count-models",
      "attribution",
      "burden-of-disease"
    ],
    "applies_to_study_types": [
      "drug_utilization",
      "cohort_retrospective",
      "active_comparator_new_user",
      "new_user",
      "claims_analysis",
      "cost_of_illness",
      "cost_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1146/annurev.publhealth.20.1.125",
        "url": "https://doi.org/10.1146/annurev.publhealth.20.1.125",
        "citation_text": "Diehr P, Yanez D, Ash A, Hornbrook M, Lin DY. Methods for analyzing health care utilization and costs. Annual Review of Public Health. 1999;20:125-144.",
        "year": 1999,
        "authors_short": "Diehr et al.",
        "notes": "Foundational review of the statistical structure of utilization and cost data (skewness, zero-inflation, person-time standardization) and the two-part and count-model approaches they motivate."
      },
      {
        "role": "explain",
        "doi": "10.1002/hec.1653",
        "url": "https://doi.org/10.1002/hec.1653",
        "citation_text": "Mihaylova B, Briggs A, O'Hagan A, Thompson SG. Review of statistical methods for analysing healthcare resources and costs. Health Economics. 2011;20(8):897-916.",
        "year": 2011,
        "authors_short": "Mihaylova et al.",
        "notes": "Systematic review of methods for over-dispersed, zero-heavy, skewed HCRU and cost outcomes (Poisson/NB, two-part, GLM with log link, censoring-aware approaches); the standard reference for choosing the analytic model."
      },
      {
        "role": "explain",
        "doi": "10.2147/CEOR.S205597",
        "url": "https://doi.org/10.2147/CEOR.S205597",
        "citation_text": "Schroeder M, Ko C, Lee XHM, et al. Comparison of methods to estimate disease-related cost and healthcare resource utilization for autoimmune diseases in administrative claims databases. ClinicoEconomics and Outcomes Research. 2019;11:597-609.",
        "year": 2019,
        "authors_short": "Schroeder et al.",
        "notes": "Compares all-cause, diagnosis-on-claim, and matched-control attribution strategies for disease-specific HCRU/cost in claims, quantifying how the attribution choice drives the estimate."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/BRS.0000000000003572",
        "url": "https://doi.org/10.1097/BRS.0000000000003572",
        "citation_text": "Spears CA, Hodges SE, Kiyani M, et al. Health care resource utilization and management of chronic, refractory low back pain in the United States. Spine. 2020;45(20):E1333-E1341.",
        "year": 2020,
        "authors_short": "Spears et al.",
        "notes": "Applied claims study quantifying HCRU by service category for a chronic condition; a concrete template for POS-stratified, all-cause-plus-attributable HCRU reporting."
      }
    ],
    "plain_language_summary": "Healthcare Resource Utilization (HCRU) counts how often patients use medical services — hospital stays, emergency room visits, doctor's office visits, and prescription fills — over a defined observation window. Because patients are not all watched for the same length of time, the raw counts are divided by how long each patient was observed, producing a rate such as \"admissions per patient per year\" that makes different groups fairly comparable. HCRU answers the question: how much health care did these patients actually consume, and in which settings? One honest limitation: HCRU counts events but does not measure how severe each event was — a two-day hospital stay and a two-week stay both count as one admission.",
    "key_terms": [
      {
        "term": "person-time",
        "definition": "The total calendar time that all patients in a study were actually observed and at risk — for example, 3 patients each watched for 1 year contribute 3 person-years of observation time."
      },
      {
        "term": "per-patient-per-year (PPPY)",
        "definition": "A rate that expresses how many events occurred for every full year a single average patient was observed, calculated by dividing total events by total observed years across all patients."
      },
      {
        "term": "place of service",
        "definition": "The setting where a healthcare encounter happened — for example, inpatient hospital, emergency department, or physician office — recorded on insurance claims using standardized codes."
      },
      {
        "term": "all-cause utilization",
        "definition": "Counting every healthcare event a patient had during the observation window, regardless of which medical condition prompted it."
      },
      {
        "term": "index date",
        "definition": "A patient's personal day zero in a study — the anchor date, such as the day they first filled a new prescription or received a diagnosis, from which follow-up time is measured forward."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to describe how often patients with type 2 diabetes used the hospital, emergency room, and outpatient clinic in the year after starting a new medication. Three patients are enrolled. Patient P-001 was observed for the full 365-day follow-up year. Patient P-002 was also observed the full year. Patient P-003 enrolled later and was observed for only 292 days before the study ended. The analyst needs to report utilization rates that are fair to compare across patients, even though one patient was watched for less time.",
      "dataset": {
        "caption": "Raw event counts per patient during each patient's individual observation window",
        "columns": [
          "person_id",
          "inpatient_admissions",
          "er_visits",
          "outpatient_visits",
          "observed_days"
        ],
        "rows": [
          [
            "P-001",
            2,
            4,
            10,
            365
          ],
          [
            "P-002",
            0,
            1,
            3,
            365
          ],
          [
            "P-003",
            1,
            2,
            4,
            292
          ]
        ]
      },
      "steps": [
        "Convert each patient's observed days to observed years by dividing by 365. P-001: 365 / 365 = 1.000 year. P-002: 365 / 365 = 1.000 year. P-003: 292 / 365 = 0.800 year.",
        "For each patient, divide their event count by their observed years to get a per-patient-per-year (PPPY) rate for each setting. This annualizes the counts so a patient watched for less than a year is not penalized for having fewer raw events.",
        "P-001 inpatient PPPY: 2 admissions / 1.000 year = 2.00 admissions/year. ER PPPY: 4 / 1.000 = 4.00 visits/year. Outpatient PPPY: 10 / 1.000 = 10.00 visits/year.",
        "P-002 inpatient PPPY: 0 / 1.000 = 0.00. ER PPPY: 1 / 1.000 = 1.00 visit/year. Outpatient PPPY: 3 / 1.000 = 3.00 visits/year.",
        "P-003 inpatient PPPY: 1 / 0.800 = 1.25 admissions/year. ER PPPY: 2 / 0.800 = 2.50 visits/year. Outpatient PPPY: 4 / 0.800 = 5.00 visits/year.",
        "Notice why annualizing matters: P-003 had only 4 outpatient visits, fewer raw events than P-001's 10, yet her annualized rate (5.00/year) is higher than P-002's (3.00/year). Without dividing by observed time, raw counts would unfairly make P-003 look like the lightest user simply because she was watched for less time.",
        "To report a single cohort-level rate, sum all events and divide by total observed years. Inpatient: (2 + 0 + 1) = 3 admissions / (1.000 + 1.000 + 0.800) = 2.800 total person-years = 1.07 admissions PPPY. ER: (4 + 1 + 2) = 7 visits / 2.800 = 2.50 visits PPPY. Outpatient: (10 + 3 + 4) = 17 visits / 2.800 = 6.07 visits PPPY."
      ],
      "result": "Per-patient-per-year utilization rates — P-001: 2.00 inpatient, 4.00 ER, 10.00 outpatient. P-002: 0.00 inpatient, 1.00 ER, 3.00 outpatient. P-003: 1.25 inpatient, 2.50 ER, 5.00 outpatient. Cohort-level rates (all 3 patients, 2.800 total person-years): 1.07 inpatient admissions PPPY, 2.50 ER visits PPPY, 6.07 outpatient visits PPPY. All arithmetic derived from raw counts and observed days in the table above."
    },
    "prerequisites": [
      "incidence-rate-calculation-rwe",
      "person-time-denominator-construction-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "All-cause HCRU",
        "description": "Every event in the observation window regardless of reason, counted by setting; the correct basis for total burden, budget impact, and capturing off-target/adverse-event utilization.",
        "edge_cases": [
          "Dilutes a disease-specific signal with unrelated care (injuries, preventive visits) when the question is condition-specific.",
          "Sensitive to differential mortality (competing risk) — the higher-mortality arm accrues fewer downstream events."
        ],
        "data_source_notes": "claims: count all inpatient/ED/outpatient/pharmacy claims in the window; report by POS for interpretability. ehr: under-captures out-of-network care, so all-cause totals are partial.",
        "citations": [
          "diehr-1999"
        ]
      },
      {
        "name": "Disease-specific / attributable HCRU",
        "description": "Events linked to the index condition via a qualifying diagnosis (primary or any position) or a validated claims algorithm; or incremental utilization vs a matched comparator.",
        "edge_cases": [
          "Over-attribution from rule-out codes, screening encounters, or comorbidity coding.",
          "Under-attribution when the condition is coded only in a secondary position and a primary-position rule is used."
        ],
        "data_source_notes": "claims: filter on the condition ICD set (e.g., E11.x for T2DM) in the chosen position, or use a published algorithm; always report alongside the all-cause figure. registry: cleanest for attribution and useful to validate the claims algorithm.",
        "citations": [
          "schroeder-2019"
        ]
      },
      {
        "name": "PPPM / PPPY standardized rates",
        "description": "Events divided by accrued person-time (per patient per month or per year) so cohorts with unequal follow-up are comparable; zero-utilization person-time stays in the denominator.",
        "edge_cases": [
          "Censoring at death, disenrollment, or data end must be applied to person-time, not dropped.",
          "MA-only person-time inflates the denominator because FFS events are unobserved."
        ],
        "data_source_notes": "claims: person-months = enrolled days / 30.44 from index to earliest of disenroll/death/end; PPPM = total events / total person-months. Standard PQA/CMS payer deliverable.",
        "citations": [
          "diehr-1999"
        ]
      },
      {
        "name": "Place-of-service stratified HCRU",
        "description": "Separate counts/rates for inpatient (with LOS/ICU days), ED, outpatient, SNF/home health/hospice, and pharmacy using POS codes, UB-04 revenue codes, and claim type.",
        "edge_cases": [
          "Site-of-service shift (inpatient to hospital-outpatient) moves volume between buckets without changing the total.",
          "ED-to-inpatient rollup window choice changes whether an ED visit is counted separately or folded into an admission."
        ],
        "data_source_notes": "claims: UB-04 revenue codes + CMS POS + CPT modifiers; pharmacy on a separate NDC file. The POS split is what reveals cost drivers (rare inpatient stays vs frequent cheap office visits)."
      },
      {
        "name": "Medical vs pharmacy benefit HCRU",
        "description": "Partition into medical (facility/professional claims, including provider-administered J-code drugs) vs pharmacy (dispensed retail/specialty NDC), to isolate drug burden and avoid double counting.",
        "edge_cases": [
          "Infused/provider-administered drugs appear under the medical benefit (HCPCS J-code + NDC) and may also appear in pharmacy — deduplicate before summing."
        ],
        "data_source_notes": "claims: join medical and pharmacy files on person/date using benefit indicators; specialty pharmacy is the dominant driver in oncology/autoimmune."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Counts isolate the driver and setting of burden, are robust to price/contract variation, and need no paid-amount data; POS and medical/pharmacy splits show where utilization accrues.",
        "cons_of_this": "Ignore intensity (LOS, ICU days) and economic value; bundled/episode payments decouple observed counts from paid cost.",
        "when_to_prefer": "Describing service mix, identifying high-utilization phenotypes, or supplying volume inputs to cost models; report counts together with costs, not instead of them."
      },
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "HCRU is the raw outcome; descriptive counts/rates are model-agnostic and transparent to payers.",
        "cons_of_this": "Naive mean-count comparisons with unequal follow-up or over-dispersion give wrong inference.",
        "when_to_prefer": "Report descriptive PPPM/PPPY first; move to NB-with-offset (over Poisson) for any comparative or adjusted estimate where variance exceeds the mean."
      },
      {
        "compared_to": "cost-outlier-handling-rwe",
        "pros_of_this": "Raw HCRU counts are the input that must be inspected before any trimming decision.",
        "cons_of_this": "Super-utilizers (frequent ED, long LOS) can dominate unadjusted means and derived costs.",
        "when_to_prefer": "Examine the count distribution first, then choose a count model or a pre-specified winsorization/robust rule."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Strongest for standardized events — inpatient via UB-04 revenue codes/DRG, outpatient/procedures via CPT/HCPCS + POS, pharmacy via NDC. Require continuous enrollment so absence of claims is real, and exclude MA-only person-time where FFS claims are unavailable. Collapse same-day transfers and ED-to-inpatient sequences before counting. For bundled payments, line-level HCRU is often present but cost needs allowed amounts/shadow pricing. Always report by POS and benefit type, and treat death as a competing event.",
      "ehr": "Captures attended visits, labs, and severity that claims lack, but is visit-driven (upward bias for engaged, sicker patients) and misses out-of-network care. Define observation windows explicitly and treat loss to follow-up as potentially informative; link to claims for a complete denominator.",
      "registry": "Protocol-driven, disease-specific HCRU with strong staging/severity; incomplete for all-cause or primary care. Excellent for validating claims-based HCRU algorithms when linked.",
      "linked": "Linked claims-EHR-registry-vital-records gives severity + completeness + reliable mortality for competing-risk handling, but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before binning person-time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport patsy\nimport statsmodels.api as sm\n\nDAYS_PER_MONTH = 30.44\n\ndef attributable(dx_series: pd.Series, code_prefixes=(\"E11\",)) -> pd.Series:\n    # True if any diagnosis on the claim starts with a condition prefix (any-position attribution).\n    return dx_series.fillna(\"\").str.upper().str.contains(\"|\".join(code_prefixes), regex=True)\n\ndef hcru_rates(claims: pd.DataFrame, cohort: pd.DataFrame, attributable_only: bool = False) -> pd.DataFrame:\n    c = claims.merge(cohort[[\"person_id\", \"index_date\", \"enroll_end\", \"arm\"]], on=\"person_id\", how=\"inner\")\n    # Keep only events inside the follow-up window [index_date, enroll_end].\n    c = c[(c[\"service_date\"] >= c[\"index_date\"]) & (c[\"service_date\"] <= c[\"enroll_end\"])]\n    if attributable_only:\n        c = c[attributable(c[\"dx1\"])]\n\n    # Person-months per patient (denominator includes zero-utilization patients).\n    pt = cohort.copy()\n    pt[\"person_months\"] = ((pt[\"enroll_end\"] - pt[\"index_date\"]).dt.days.clip(lower=0)) / DAYS_PER_MONTH\n\n    # Event counts by setting per patient, then merge onto the full cohort (fill non-utilizers with 0).\n    counts = (c.groupby([\"person_id\", \"claim_type\"]).size()\n                .unstack(fill_value=0).reset_index())\n    out = pt.merge(counts, on=\"person_id\", how=\"left\").fillna(0)\n\n    # Cohort-level PPPM / PPPY by setting.\n    settings = [s for s in [\"inpatient\", \"ed\", \"outpatient\", \"pharmacy\"] if s in out.columns]\n    total_pm = out[\"person_months\"].sum()\n    rates = {s: out[s].sum() / total_pm for s in settings}              # events per person-month\n    rates = pd.DataFrame({\"setting\": settings,\n                          \"pppm\": [rates[s] for s in settings],\n                          \"pppy\": [rates[s] * 12 for s in settings]})\n    return out, rates\n\n# --- Negative-binomial model for the inpatient count with a log person-time offset ---\n# Use the discrete NB MLE so the dispersion (alpha) is ESTIMATED from the data, matching\n# R's MASS::glm.nb and SAS PROC GENMOD dist=negbin (not fixed at alpha=1.0).\npatient_level, cohort_rates = hcru_rates(claims, cohort)\npatient_level[\"log_pt\"] = np.log(patient_level[\"person_months\"].replace(0, np.nan))\nmodel_df = patient_level.dropna(subset=[\"log_pt\"])\ny, X = patsy.dmatrices(\"inpatient ~ C(arm) + age + sex\", data=model_df, return_type=\"dataframe\")\nnb = sm.NegativeBinomial(y, X, offset=model_df[\"log_pt\"]).fit()  # estimates alpha by MLE\nprint(cohort_rates)\nprint(nb.summary())          # exp(coef) on arm = adjusted rate ratio of inpatient HCRU; alpha = dispersion",
        "description": "Computes all-cause and POS-stratified HCRU rates (PPPM/PPPY) and fits a negative-binomial count model with a\nperson-time offset. Required inputs (already cleaned, de-duplicated, MA-only spans excluded):\n  claims : person_id, service_date (datetime), pos (CMS place-of-service str), revenue_code (str),\n           dx1..dxN (ICD-10 lists/str), claim_type ('inpatient'/'ed'/'outpatient'/'pharmacy')\n           -- same-day transfers already collapsed into one inpatient stay; ED-followed-by-admission already\n              rolled into the admission upstream.\n  cohort : person_id, index_date (datetime), arm, enroll_end (datetime = min of disenroll/death/data-end/365d),\n           + baseline covariates measured in [index_date-365, index_date]\nPerson-time accrues from index_date to enroll_end; zero-utilization patients keep their person-time.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels",
          "patsy"
        ],
        "source_citations": [
          "mihaylova-2011",
          "diehr-1999"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(MASS)\nDAYS_PER_MONTH <- 30.44\n\nhcru_rates <- function(claims, cohort, attributable_only = FALSE, code_prefix = \"^E11\") {\n  setDT(claims); setDT(cohort)\n  c <- merge(claims, cohort[, .(person_id, index_date, enroll_end, arm)], by = \"person_id\")\n  c <- c[service_date >= index_date & service_date <= enroll_end]\n  if (attributable_only) c <- c[grepl(code_prefix, toupper(dx1))]\n\n  # Person-months per patient; zero-utilization patients retain their denominator.\n  pt <- cohort[, person_months := as.numeric(pmax(enroll_end - index_date, 0)) / DAYS_PER_MONTH]\n\n  counts <- dcast(c[, .N, by = .(person_id, claim_type)],\n                  person_id ~ claim_type, value.var = \"N\", fill = 0)\n  out <- merge(pt, counts, by = \"person_id\", all.x = TRUE)\n  settings <- intersect(c(\"inpatient\",\"ed\",\"outpatient\",\"pharmacy\"), names(out))\n  for (s in settings) out[is.na(get(s)), (s) := 0]\n\n  total_pm <- sum(out$person_months)\n  rates <- data.table(setting = settings,\n                      pppm = sapply(settings, function(s) sum(out[[s]]) / total_pm))\n  rates[, pppy := pppm * 12]\n  list(patient_level = out, rates = rates)\n}\n\nres <- hcru_rates(claims, cohort)\nprint(res$rates)\n\n# Negative-binomial rate model: exp(coef) for arm = adjusted inpatient rate ratio.\nd <- res$patient_level[person_months > 0]\nnb <- glm.nb(inpatient ~ arm + age + sex + offset(log(person_months)), data = d)\nsummary(nb)",
        "description": "All-cause / POS HCRU PPPM-PPPY and a negative-binomial rate model (MASS::glm.nb) with a person-time offset.\nInputs mirror the Python version:\n  claims : person_id, service_date (Date), pos, revenue_code, dx1, claim_type\n           in {'inpatient','ed','outpatient','pharmacy'} (transfers/ED-rollups already collapsed)\n  cohort : person_id, index_date (Date), arm, enroll_end (Date), age, sex",
        "dependencies": [
          "data.table",
          "MASS"
        ],
        "source_citations": [
          "mihaylova-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Per-patient person-months; non-utilizers keep their denominator (LEFT JOIN below restores zeros). */\nproc sql;\n  create table pt as\n  select c.person_id, c.arm, c.age, c.sex, c.index_date, c.enroll_end,\n         max(0, (c.enroll_end - c.index_date)) / 30.44 as person_months\n  from work.cohort c;\nquit;\n\n/* Event counts by setting, restricted to the follow-up window [index_date, enroll_end].\n   Set attributable=1 to require a T2DM code (E11) on the claim (any-position attribution). */\n%let attributable = 0;\nproc sql;\n  create table evt as\n  select cl.person_id, cl.claim_type, count(*) as n_events\n  from work.claims cl\n       inner join work.cohort co on cl.person_id = co.person_id\n  where cl.service_date between co.index_date and co.enroll_end\n    %if &attributable = 1 %then %do; and upcase(cl.dx1) like 'E11%' %end;\n  group by cl.person_id, cl.claim_type;\nquit;\n\n/* Wide patient-level table with zero-filled inpatient/ed/outpatient/pharmacy counts. */\nproc transpose data=evt out=evt_w(drop=_name_) prefix=cnt_;\n  by person_id;\n  id claim_type;\n  var n_events;\nrun;\n\ndata analytic;\n  merge pt(in=a) evt_w;\n  by person_id;\n  if a;\n  array cn cnt_:;\n  do over cn; if cn = . then cn = 0; end;\n  if person_months > 0 then log_pt = log(person_months);\nrun;\n\n/* Cohort-level PPPM / PPPY by setting (all-cause). */\nproc sql;\n  create table rates as\n  select 'inpatient' as setting length=12,\n         sum(cnt_inpatient)/sum(person_months) as pppm,\n         calculated pppm * 12 as pppy\n  from analytic\n  union all\n  select 'ed', sum(cnt_ed)/sum(person_months), calculated pppm * 12 from analytic\n  union all\n  select 'outpatient', sum(cnt_outpatient)/sum(person_months), calculated pppm * 12 from analytic\n  union all\n  select 'pharmacy', sum(cnt_pharmacy)/sum(person_months), calculated pppm * 12 from analytic;\nquit;\n\n/* Negative-binomial rate model for inpatient HCRU with a log person-time offset.\n   exp(estimate) on arm = adjusted rate ratio; dist=negbin handles over-dispersion vs Poisson. */\nproc genmod data=analytic;\n  class arm (ref='0') sex / param=ref;\n  model cnt_inpatient = arm age sex / dist=negbin link=log offset=log_pt;\n  estimate 'arm rate ratio' arm 1 -1 / exp;\nrun;",
        "description": "All-cause / POS HCRU person-time, PPPM-PPPY aggregation (PROC SQL), and a negative-binomial rate model\n(PROC GENMOD, dist=negbin, log person-time offset). Required input datasets (post data-management):\n  work.claims : person_id, service_date, pos, revenue_code, dx1, claim_type\n                ('inpatient'/'ed'/'outpatient'/'pharmacy'); transfers & ED-to-admission rollups collapsed.\n  work.cohort : person_id, index_date, enroll_end (= min disenroll/death/data-end/index+365), arm, age, sex.\nenroll_end is computed upstream so person-time already reflects censoring at death/disenrollment.",
        "dependencies": [],
        "source_citations": [
          "mihaylova-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  RAW[Adjudicated claims for enrolled person-time<br/>MA-only spans excluded] --> SPLIT{Benefit / file}\n  SPLIT -->|Medical: facility + professional| MED[POS + UB-04 revenue + CPT/HCPCS]\n  SPLIT -->|Pharmacy: NDC file| RX[NDC fills]\n  MED --> POS[Setting buckets:<br/>inpatient with LOS / ED / outpatient / SNF-HH-hospice]\n  POS --> ATTR{Attribution rule}\n  RX --> ATTR\n  ATTR -->|All-cause| AC[Count every event]\n  ATTR -->|Disease-specific| DS[Filter on condition Dx / algorithm]\n  AC --> PT[Person-time denominator<br/>enrolled days to disenroll/death/end / 30.44]\n  DS --> PT\n  PT --> RATE[PPPM / PPPY rates by setting]\n  RATE --> MODEL[Negative-binomial model<br/>offset = log person-time; death as competing event]",
        "caption": "HCRU data flow from adjudicated claims through the medical/pharmacy split, place-of-service bucketing, all-cause vs disease-specific attribution, person-time denominator construction, and the count-model layer.",
        "alt_text": "Flowchart showing claims split into medical and pharmacy, bucketed by place of service, filtered by all-cause or disease-specific attribution, divided by person-time into PPPM/PPPY rates, and fed to a negative-binomial model.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title HCRU observation window for one patient (claims, monthly bins)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  Continuous enrollment + covariate lookback :done, base, 2023-01-01, 2023-12-31\n  section Time zero\n  Index event / first qualifying fill :milestone, t0, 2024-01-01, 0d\n  section Follow-up (person-time accrues monthly)\n  Observed person-time -> denominator :active, ft, 2024-01-01, 240d\n  Inpatient admission (counts once; LOS tracked) :crit, ip, 2024-03-10, 6d\n  Censor at disenroll / death / data end :milestone, cen, 2024-08-28, 0d",
        "caption": "Person-time accrues monthly from time zero to the censoring point (disenrollment, death, or data end); each event is counted in its setting, and the accrued person-months form the PPPM/PPPY denominator.",
        "alt_text": "Gantt timeline showing baseline enrollment in 2023, an index event on 2024-01-01, follow-up person-time accruing through August 2024 with one inpatient admission, and a censoring point.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "HCRU counts are the volume inputs to cost; events times unit costs (or allowed amounts) yield PPPM/PPPY costs. Report HCRU and costs together with POS and medical/pharmacy splits."
      },
      {
        "relation_type": "used_with",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Poisson/NB with a log person-time offset is the primary model for HCRU count endpoints; use NB for the over-dispersion and zero-inflation typical of utilization data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "The all-cause vs disease-specific (vs incremental) attribution decision applies identically to HCRU counts and to costs and must be pre-specified for both."
      },
      {
        "relation_type": "see_also",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "PPPM/PPPY rates require correctly accrued person-time censored at disenrollment/death/data-end; the denominator construction is shared with incidence-rate methods."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "MA-only person-time lacks complete FFS claims, biasing HCRU rates downward; restrict to observable benefit types before computing denominators."
      },
      {
        "relation_type": "see_also",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Continuous enrollment makes absence of claims a true zero rather than missingness, a prerequisite for valid all-cause HCRU."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Super-utilizers create outliers in HCRU counts and derived costs; pre-specify winsorization, robust models, or retention for both."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Differential mortality reduces downstream HCRU in the higher-risk arm; treat death as a competing event rather than censoring that inflates rates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "Procedure-based HCRU shares coding (CPT/HCPCS, revenue codes) and the immortal-time risk of defining treated status by a downstream procedure."
      },
      {
        "relation_type": "see_also",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Treatment patterns (persistence, line of therapy) predict and are associated with HCRU levels."
      },
      {
        "relation_type": "part_of",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "HCRU is a core component of burden-of-disease and cost-of-illness estimates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "HCRU measurement and attribution are foundational inputs to RWE-informed economic models (BIA, CEA, Markov)."
      }
    ],
    "aliases": [
      "HCRU",
      "healthcare resource utilization",
      "healthcare resource use",
      "health care utilization",
      "resource use",
      "medical service utilization"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "health-economic-modeling-methods-rwe",
    "name": "Health Economic Modeling Methods Using RWE",
    "short_definition": "The family of decision-analytic modeling methods (cohort Markov, partitioned-survival, discrete-event simulation) used to translate real-world evidence on transitions, costs, utilities, and survival into lifetime cost-effectiveness estimates for HTA, including the rate-to-probability, extrapolation, discounting, and probabilistic-sensitivity machinery that connects RWE inputs to an ICER.",
    "long_description": "**Health economic (HE) modeling using real-world evidence** is the practice of parameterizing and running a\ndecision-analytic model — most often a cohort-state-transition (Markov) model, a partitioned-survival model (PSM), or a\ndiscrete-event simulation (DES) — from real-world data so that observed event rates, resource use, costs, and survival are\nprojected over a lifetime horizon into an incremental cost-effectiveness ratio (ICER) and net monetary benefit (NMB). The\nmodel is the scaffold; RWE supplies the numbers that fill it. Done well, the model inherits the strengths of RWE (real\npopulations, real costs, long follow-up) and disciplines its weaknesses (selection, censoring, unmeasured confounding)\nthrough structural assumptions, extrapolation methods, and probabilistic sensitivity analysis (PSA). Done badly, it\nlaunders confounded or right-censored RWE into a single deterministic ICER that looks authoritative and is wrong.\n\n**Core conceptual distinction.** The decision that does the work is *which model structure the available RWE can actually\nsupport*, and that choice flows from the evidence, not the other way around. (1) *Cohort Markov* assumes the population is\nfully described by mutually exclusive health states with memoryless transition probabilities per cycle; it is the natural\nfit when RWE yields incidence-density transition rates between clinically meaningful states (e.g., stable -> progressed ->\ndead) and the Markov assumption is tenable after enough states/tunnels are added. (2) *Partitioned-survival* sidesteps\ntransition probabilities entirely: it derives state membership from independently modeled, extrapolated survival curves\n(e.g., overall survival and progression-free survival from a cohort), so it lives or dies on **survival extrapolation\nbeyond observed follow-up** and on the internal coherence of the curves (PFS must never exceed OS). (3) *DES* drops the\nMarkov memorylessness restriction and models time-to-event and individual histories explicitly — the right tool when\nRWE shows strong time- or history-dependence (e.g., transplant waitlists, repeated hospitalizations) that a cohort Markov\ncannot represent without an explosion of tunnel states. The estimand is always *incremental* (treatment vs comparator over\na stated horizon and perspective), and three transformations are non-negotiable and routinely botched: converting\nRWE **rates to per-cycle probabilities** (p = 1 - exp(-rate * cycle_length); never use rate as a probability), applying a\n**half-cycle correction** so events are not all charged to cycle ends, and **discounting** costs and effects to present\nvalue at the jurisdiction's rate (commonly 3.0% or 3.5%). The single-transition formula p = 1 - exp(-rate * cycle_length)\nis exact only when a state has one exit; when a state has multiple competing exits, applying it independently per exit\nmisallocates probability (the per-exit probabilities can sum past 1 and ignore the chance of a competing event pre-empting\nthe modeled one). The correct multi-exit conversion uses the matrix exponential of the rate (generator) matrix,\nP = exp(QΔt) (Welton & Ades 2005), which is the formally consistent map from competing cause-specific rates to a\nper-cycle transition matrix.\n\n**Pros, cons, and trade-offs.**\n- **vs trial-data-only economic models:** RWE-parameterized models capture real-world adherence, off-protocol resource use,\n  comparator mix, and costs in the target health system, and can extend follow-up far beyond a trial's horizon. Cost: RWE\n  transition rates and survival are confounded and informatively censored in ways a randomized trial is not, so the model\n  inherits that bias unless the upstream RWE is itself causal (active-comparator new-user design, PS/IPTW, g-methods).\n  **Prefer RWE inputs** for costs, utilities in routine care, and long-horizon survival; **prefer trial inputs** (or RWE\n  explicitly de-confounded) for the comparative treatment effect.\n- **Cohort Markov vs partitioned-survival:** Markov forces an explicit, clinically interpretable transition structure and\n  handles competing risks and multiple absorbing states cleanly, but requires defensible per-cycle transition probabilities\n  and can need many tunnel states to defeat the memoryless assumption. PSM is simpler to fit from Kaplan-Meier-style RWE\n  survival and communicates well, but its independent curve extrapolation can produce incoherent state occupancy and is\n  highly sensitive to the parametric survival family chosen. **Prefer Markov** when transitions are the natural evidence and\n  competing risks matter; **prefer PSM** when survival curves are the primary RWE deliverable and the horizon is not far\n  beyond observed data — and always cross-check PSM against a state-transition model (NICE TSD19 expectation).\n- **Deterministic vs probabilistic (PSA):** A single deterministic ICER hides parameter uncertainty and is uninterpretable\n  for a reimbursement decision. PSA propagates input distributions (Beta/Dirichlet for probabilities, Gamma/log-normal for\n  costs, Beta for utilities) into a distribution of ICERs and a cost-effectiveness acceptability curve (CEAC). **Always run\n  PSA** for an HTA submission; deterministic one-way analysis is a supplement (tornado), never the headline.\n\n**When to use.** Building or critiquing a cost-effectiveness/cost-utility model for an HTA submission (NICE, ICER-US,\nCDA-AMC (formerly CADTH), G-BA) where lifetime outcomes must be projected from finite RWE; estimating budget impact or value when transitions,\ncosts, or survival are best observed in claims/EHR/registry data rather than a trial; updating a trial-based model with\nreal-world comparator survival, real-world costs, or real-world treatment patterns; or running value-of-information /\nscenario analyses that require a transparent, re-runnable model.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The comparative effect is confounded and you treat it as causal.** Plugging naive RWE hazard ratios (prevalent users,\n  drug-vs-non-user, immortal-time-contaminated cohorts) into the relative-effect parameter produces a precise, fully\n  discounted ICER built on a biased treatment effect — the most dangerous failure because PSA will *not* reveal it (PSA\n  propagates parameter uncertainty, not structural confounding). Fix it upstream with an active-comparator new-user design\n  and proper confounding control before the number ever reaches the model.\n- **Extrapolation far beyond observed follow-up with no anchor.** When 18 months of registry survival is projected to a\n  40-year horizon, the parametric family (exponential vs Weibull vs generalized gamma vs spline) drives the result more than\n  the data do. Reporting one curve as if it were known is misleading; show the family, justify it with hazard plots and\n  external/general-population mortality constraints, and treat the choice as structural uncertainty.\n- **Forcing a Markov model onto strongly history-dependent RWE.** If real-world risk depends on time-in-state or prior\n  events, a memoryless cohort Markov misstates occupancy; either add tunnel states (and accept the parameter burden) or move\n  to DES. Pretending the Markov assumption holds is a structural error PSA cannot detect.\n- **A deterministic point ICER presented to a payer.** Without PSA/CEAC, the decision-maker cannot see decision uncertainty;\n  several HTA bodies will reject the submission outright.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** Excellent for resource use and costs (PMPM/PMPY by health state), reasonable for\n  coded events that mark transitions, weak for clinical severity and true progression. Transition *rates* are estimated as\n  events / person-time within a continuously enrolled cohort, then converted to per-cycle probabilities. Failure modes:\n  (i) Medicare **Advantage (MA) encounter records lack adjudicated paid amounts**, so cost parameters built on MA person-time\n  are unreliable — restrict costing to FFS (Parts A/B/D) person-time or use validated cost-to-charge bridges; (ii)\n  **differential competing risks by exposure in elderly claims** — if one arm's patients die of other causes faster, naive\n  cause-specific transition rates overstate the at-risk denominator for the modeled event, so use cumulative-incidence /\n  competing-risk transitions, not 1-minus-KM; (iii) **immortal time in procedure or progression studies** inflates survival\n  in the treated state if state-entry is mis-timed relative to the index date.\n- **EHR:** Strong for the clinical variables that define states and progression (labs, stage, ECOG, problem lists) and for\n  utility-relevant severity, but visit-driven and porous — patients who leave the system are differentially censored, so\n  transition rates and survival are biased unless loss to follow-up is modeled and linkage to a death index firms up the\n  absorbing state. Costs are usually charges, not paid amounts; convert before using as economic inputs.\n- **Registry:** Best source for adjudicated progression/death and disease severity that drive transitions and survival\n  extrapolation, but typically thin on full cost capture and on out-of-registry care. Link to claims for complete costs and\n  resource use and to vital records to close out mortality.\n- **Linked claims-EHR-vital records:** The ideal substrate — EHR severity to define states, claims for paid costs and\n  person-time, vital records for the absorbing state — but linkage selection (only the linkable subset) and date\n  reconciliation across order/fill/service/death dates must be resolved before transitions and time zero are assigned, or\n  the model is parameterized on a non-representative, mis-timed cohort.\n\n**Worked claims example.** Question: lifetime cost-utility of a new heart-failure (HF) therapy vs standard care, modeled as\na 3-state cohort Markov (Stable -> Hospitalized-HF -> Dead) with a 1-month cycle and 20-year horizon, discounted at 3%.\n(1) Cohort: adults with >=2 HF diagnoses and 365 days of continuous A/B enrollment before `index_date`; restrict costing to\nFFS person-time (drop MA-only spans where paid amounts are missing). (2) Transitions from RWE: count first\nhospitalized-HF events and deaths and divide by person-months at risk to get monthly cause-specific *rates*; because\nStable has two competing exits (Hospitalized, Dead), convert the rate matrix to a per-cycle transition matrix with the\nmatrix exponential P = exp(QΔt) rather than applying p = 1 - exp(-rate) per exit, so the Stable->Dead and\nStable->Hospitalized transitions share the correct at-risk denominator and the row sums to 1. (3) Costs: aggregate paid amounts from `claim_lines` to\nPMPM by state (Stable PMPM from outpatient + pharmacy; Hospitalized-HF from the DRG-paid inpatient claim plus the\nindex-month outpatient costs), inflation-adjusted to a common base year. (4) Utilities: assign a per-cycle QALY weight to\neach state (e.g., Stable 0.80, Hospitalized 0.60 for the event month) from an EQ-5D mapping. (5) Effect: take the\ntreatment's hospitalization and mortality hazard ratios from an **active-comparator new-user, PS-weighted** RWE analysis\n(not a naive contrast), and apply them to the standard-care transition probabilities. (6) Run the trace with a half-cycle\ncorrection, discount costs and QALYs monthly, and compute incremental cost, incremental QALYs, the ICER, and NMB at a\n$100,000/QALY threshold. (7) PSA: redraw transition probabilities from Dirichlet, costs from Gamma, utilities from Beta,\nand the hazard ratios from their sampling distribution across 5,000 iterations; summarize the CEAC across willingness-to-pay\nthresholds. Report per CHEERS 2022, stating the perspective, horizon, discount rate, and structural assumptions explicitly.\n\n**Interpreting the output**\n\nThe worked example produces an ICER of $50,000/QALY: the treatment arm costs $72,000 and yields 5.20 QALYs;\nthe standard-care arm costs $52,000 and yields 4.80 QALYs. Incremental cost = $20,000; incremental QALYs = 0.40;\nICER = $20,000 / 0.40 = $50,000/QALY, below the $100,000/QALY threshold.\n\n*(1) Formal interpretation.* This result routes through the health-economic modeling parent concept: the\nright model type (cohort Markov, partitioned-survival, or DES) was selected based on clinical pathway\ncomplexity; transition probabilities were sourced from RWE and adjusted for confounding; costs were applied\nper state and discounted; QALYs were calculated with utility weights from mapping or direct measurement;\nand PSA characterized decision uncertainty. The ICER of $50,000/QALY is the ratio of incremental quantities\nfrom both arms — it is not the average cost per QALY for the treatment arm alone. A CEAC from the PSA\ncompletes the interpretation by quantifying the probability that the treatment is cost-effective across a\nrange of willingness-to-pay values.\n\n*(2) Practical interpretation.* The $50,000/QALY ICER is a routing result that should prompt the reader to\nunderstand which model structure generated it and what the key RWE inputs were. A Markov model parameterized\nwith biased RWE transition probabilities will produce a confident-looking ICER that is systematically wrong;\na DES that misrepresents event sequencing will similarly mislead. The model structure, data sources, and\nkey assumptions must be reported transparently alongside the ICER so decision-makers can assess how much\nstructural uncertainty — which the PSA does not capture — surrounds the headline number. Consult the\nsibling entries (Markov, DES, PSA, discounting, scenario analysis) for the mechanics that produced this output.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "health-economic-modeling",
      "cost-effectiveness",
      "cost-utility",
      "markov-model",
      "partitioned-survival",
      "decision-analytic-model",
      "probabilistic-sensitivity-analysis",
      "hta"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2012.06.012",
        "url": "https://doi.org/10.1016/j.jval.2012.06.012",
        "citation_text": "Caro JJ, Briggs AH, Siebert U, Kuntz KM. Modeling good research practices--overview: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-1. Value in Health. 2012;15(6):796-803.",
        "year": 2012,
        "authors_short": "Caro et al.",
        "notes": "Foundational ISPOR-SMDM statement defining decision-analytic modeling good practice, model-type selection (Markov, partitioned-survival, DES), and the structural/parameter/uncertainty backbone these RWE-fed models must satisfy."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2017.08.3019",
        "url": "https://doi.org/10.1016/j.jval.2017.08.3019",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE special task force on real-world evidence in health care decision making. Value in Health. 2017;20(8):1003-1008.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Joint ISPOR-ISPE good-practice guidance for the RWD studies that supply transition, cost, and effect parameters to these models; sets the bar for credible real-world inputs."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X05282637",
        "url": "https://doi.org/10.1177/0272989X05282637",
        "citation_text": "Welton NJ, Ades AE. Estimation of Markov chain transition probabilities and rates from fully and partially observed data: uncertainty propagation, evidence synthesis, and model calibration. Medical Decision Making. 2005;25(6):633-645.",
        "year": 2005,
        "authors_short": "Welton & Ades",
        "notes": "Canonical source for the rate-to-probability conversion via the matrix exponential P = exp(QΔt), which correctly handles states with multiple competing exits where the per-exit 1 - exp(-rate) formula misallocates probability."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) statement: updated reporting guidance for health economic evaluations. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "Current reporting standard for HE evaluations; mandates explicit statement of perspective, horizon, discount rate, model structure, and uncertainty handling for the models described here."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/0272989X12472398",
        "url": "https://doi.org/10.1177/0272989X12472398",
        "citation_text": "Latimer NR. Survival analysis for economic evaluations alongside clinical trials--extrapolation with patient-level data: inconsistencies, limitations, and a practical guide. Medical Decision Making. 2013;33(6):743-754.",
        "year": 2013,
        "authors_short": "Latimer",
        "notes": "NICE DSU TSD14 source; the practical guide to the survival-extrapolation step that drives partitioned-survival and Markov-survival inputs when RWE follow-up is shorter than the model horizon."
      }
    ],
    "plain_language_summary": "A health economic model is a mathematical structure that uses real-world data to project what happens to a group of patients over many years and what those outcomes cost — information no single study can supply on its own. The most common type is a Markov model, which sorts patients into a small number of health states (for example, Stable, Hospitalized, and Dead), estimates how fast patients move between those states using data from claims or registries, and attaches a dollar cost and a quality-of-life weight to each state. Running the model forward in time produces an incremental cost-effectiveness ratio (ICER) — the extra cost per extra year of good health gained by one treatment compared with another — which is the number health technology assessment bodies use to decide whether a therapy is worth paying for.",
    "key_terms": [
      {
        "term": "decision model",
        "definition": "A mathematical structure that simulates what happens to a group of patients over time under different treatment options, using estimated rates, costs, and quality-of-life values as inputs."
      },
      {
        "term": "Markov model",
        "definition": "A type of decision model that divides patients into mutually exclusive health states and moves them between states cycle by cycle using fixed transition probabilities, like moving tokens on a board each month."
      },
      {
        "term": "health state",
        "definition": "A distinct clinical category — for example Stable, Hospitalized, or Dead — that a patient occupies at any given cycle of the model; each state carries its own cost and quality-of-life score."
      },
      {
        "term": "ICER",
        "definition": "Incremental cost-effectiveness ratio: the additional cost of the new treatment divided by the additional years of good health it produces, expressed as dollars per quality-adjusted life year (QALY)."
      },
      {
        "term": "QALY",
        "definition": "Quality-adjusted life year: one year of life multiplied by a 0-to-1 quality weight, where 1.0 is perfect health and 0 is a state equivalent to death."
      },
      {
        "term": "transition probability",
        "definition": "The chance that a patient moves from one health state to another in a single model cycle, estimated from real-world event rates in claims or registry data."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether a new heart-failure pill is worth its cost compared with standard care. She builds a 3-state Markov model — Stable, Hospitalized, Dead — with monthly cycles over 10 years. She estimates transition probabilities from a Medicare claims cohort, assigns per-state costs from paid claims amounts, and assigns per-state quality-of-life weights from a published EQ-5D mapping. The new drug reduces hospitalization and death rates by 20% but costs $500 extra per month. She wants to know the ICER.",
      "dataset": {
        "caption": "Model structure: one row per health state with the inputs RWE supplies",
        "columns": [
          "health_state",
          "monthly_transition_to_hospitalized",
          "monthly_transition_to_dead",
          "monthly_cost_usd",
          "qaly_weight_per_month"
        ],
        "rows": [
          [
            "Stable",
            "0.012 (from RWE rate: 12 events / 1,000 person-months)",
            "0.004 (from RWE rate: 4 deaths / 1,000 person-months)",
            "$950",
            "0.067 (= 0.80 annual QALY / 12)"
          ],
          [
            "Hospitalized",
            "N/A — exits only to Stable or Dead",
            "0.060 (from RWE rate: 60 deaths / 1,000 person-months)",
            "$18,500",
            "0.050 (= 0.60 annual QALY / 12)"
          ],
          [
            "Dead",
            "N/A",
            "N/A (absorbing state)",
            "$0",
            "0.000"
          ]
        ]
      },
      "steps": [
        "Start with 1,000 simulated patients in the Stable state at month 0.",
        "Each month, apply transition probabilities: about 12 Stable patients move to Hospitalized, 4 move to Dead, and the rest stay Stable.",
        "Patients in Hospitalized either recover to Stable, die, or remain (recovery rate ~0.55/month from claims).",
        "Multiply how many patients are in each state each month by that state's cost and QALY weight, then sum across all 120 months (10 years).",
        "Discount future costs and QALYs at 3% per year so that a dollar spent next year counts slightly less than one spent today.",
        "Run the model twice: once with standard-care transition probabilities and once with the new drug's probabilities (both hospitalization and death rates multiplied by 0.80), adding $500/month drug cost in the treatment arm.",
        "Subtract: incremental cost = total treatment cost minus total standard-care cost; incremental QALYs = total treatment QALYs minus total standard-care QALYs.",
        "Divide incremental cost by incremental QALYs to get the ICER."
      ],
      "result": "Standard-care arm: discounted total cost ~$52,000, discounted total QALYs ~4.80. Treatment arm: discounted total cost ~$72,000, discounted total QALYs ~5.20. Incremental cost = $20,000; incremental QALYs = 0.40. ICER = $20,000 / 0.40 = $50,000 per QALY — below the common $100,000/QALY threshold, so the treatment looks cost-effective. Note: the transition probabilities come from RWE; if the drug effect (the 20% rate reduction) were taken from a biased observational comparison instead of a carefully designed study, this ICER would be wrong even though all the arithmetic is correct."
    },
    "prerequisites": [
      "cost-effectiveness",
      "cost-utility",
      "markov-transition-probabilities-rwe",
      "qaly-utility-mapping-rwe",
      "icer-net-monetary-benefit-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Cohort state-transition (Markov) model from RWE transition rates",
        "description": "Mutually exclusive health states with per-cycle transition probabilities derived from RWE incidence-density rates (rate-to-probability conversion), run as a cohort trace with half-cycle correction and discounting.",
        "edge_cases": [
          "Memorylessness fails when real-world risk depends on time-in-state or prior events; add tunnel states or move to DES.",
          "Using a cause-specific rate as if it were a marginal probability double-counts at-risk time under competing risks."
        ],
        "data_source_notes": "claims: events / person-months at risk give the cause-specific rate, then P = exp(QΔt) on the rate (generator) matrix for states with competing exits (p = 1 - exp(-rate) only for a single-exit state); restrict costing to FFS person-time where paid amounts exist."
      },
      {
        "name": "Partitioned-survival model (PSM) from extrapolated RWE survival curves",
        "description": "State occupancy derived from independently modeled OS and PFS curves rather than transition probabilities; requires parametric survival extrapolation beyond observed follow-up.",
        "edge_cases": [
          "Independent extrapolation can yield incoherent occupancy (PFS exceeding OS) and is highly sensitive to the parametric family.",
          "NICE TSD19 expects a state-transition cross-check; a lone PSM is increasingly contested."
        ],
        "data_source_notes": "registry/EHR: adjudicated progression and death define the curves; constrain long-horizon hazards with general-population mortality."
      },
      {
        "name": "Probabilistic sensitivity analysis (PSA) over RWE-derived parameters",
        "description": "Input parameters are assigned distributions (Dirichlet/Beta for transitions, Gamma/log-normal for costs, Beta for utilities, sampling distribution for hazard ratios) and propagated to a distribution of ICERs and a CEAC.",
        "edge_cases": [
          "PSA propagates parameter uncertainty only; it cannot reveal structural confounding in a biased RWE treatment effect.",
          "Correlated transitions should be drawn jointly (Dirichlet) rather than as independent Betas."
        ],
        "data_source_notes": "Parameter variances come from the RWE itself (event counts, cost SEs, effect-estimate confidence intervals)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Trial-data-only economic models",
        "pros_of_this": "Capture real-world adherence, comparator mix, routine-care costs, and longer follow-up in the target health system.",
        "cons_of_this": "Inherit RWE confounding and informative censoring unless upstream inputs are explicitly de-confounded; relative treatment effect is the highest-risk parameter to source from naive RWE.",
        "when_to_prefer": "For costs, real-world utilities, treatment patterns, and long-horizon survival; pair with a causal design for the comparative effect."
      },
      {
        "compared_to": "Partitioned-survival model",
        "pros_of_this": "Cohort Markov gives an explicit, clinically interpretable transition structure and handles competing risks and multiple absorbing states cleanly.",
        "cons_of_this": "Requires defensible per-cycle transition probabilities and may need many tunnel states to satisfy memorylessness.",
        "when_to_prefer": "When transitions are the natural RWE evidence and competing risks matter; PSM when survival curves are the primary deliverable and the horizon is near observed data."
      },
      {
        "compared_to": "Deterministic (point-estimate) analysis",
        "pros_of_this": "PSA propagates input uncertainty into a distribution of ICERs and a CEAC, making decision uncertainty visible and meeting HTA expectations.",
        "cons_of_this": "More computationally and specification-intensive; can be misread as capturing structural uncertainty, which it does not.",
        "when_to_prefer": "Always for an HTA submission; deterministic one-way (tornado) analysis is a supplement, not the headline."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Estimate transitions as events / person-months at risk in a continuously enrolled cohort, then convert to a per-cycle transition matrix via the matrix exponential P = exp(QΔt) for states with competing exits (p = 1 - exp(-rate) is exact only for a single-exit state); aggregate paid amounts to PMPM by state and restrict costing to FFS person-time where paid amounts exist (MA encounter records lack adjudicated paid amounts); use competing-risks transitions, not 1-minus-KM.",
      "ehr": "Use labs/stage/ECOG and problem lists to define states and progression; model loss to follow-up because visit-driven capture differentially censors patients who leave the system; convert charges to paid amounts before using as cost inputs; link to a death index to firm up the absorbing state.",
      "registry": "Best for adjudicated progression/death and severity that drive transitions and survival extrapolation; link to claims for complete costs and to vital records to close out mortality; constrain long-horizon survival with general-population mortality.",
      "linked": "EHR severity + claims paid costs + vital-records mortality is the ideal substrate, but linkage selection and order/fill/service/death date reconciliation must be resolved before transitions and time zero are assigned."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\n\nCYCLE_LEN = 1 / 12          # 1-month cycle expressed in years\nHORIZON_M = 240             # 20-year horizon in monthly cycles\nDISCOUNT_ANNUAL = 0.03      # 3% per year for costs and effects\nWTP = 100_000.0             # willingness-to-pay per QALY\nSTATES = [\"stable\", \"hosp\", \"dead\"]\n\ndef rate_to_prob(rate, t=1.0):\n    # Single-transition conversion: EXACT ONLY for a state with one exit.\n    # With multiple competing exits this misallocates probability (per-exit\n    # values can sum past 1 and ignore competing-event pre-emption); use the\n    # matrix-exponential conversion below instead.\n    return 1.0 - np.exp(-rate * t)\n\ndef _expm(M, terms=40):\n    # Dependency-free matrix exponential via scaling-and-squaring + Taylor\n    # series, so P = exp(Q*dt) stays runnable on numpy alone.\n    norm = np.max(np.sum(np.abs(M), axis=1))\n    s = max(0, int(np.ceil(np.log2(norm + 1e-12))))\n    A = M / (2 ** s)\n    E = np.eye(M.shape[0]); term = np.eye(M.shape[0])\n    for k in range(1, terms):\n        term = term @ A / k\n        E = E + term\n    for _ in range(s):\n        E = E @ E\n    return E\n\ndef transition_matrix(rates, t=1.0):\n    # Correct multi-exit conversion: build the cause-specific generator\n    # (rate/intensity) matrix Q with off-diagonal rates and row sums of 0,\n    # then P = exp(Q*t) (Welton & Ades 2005). This is row-stochastic by\n    # construction and respects competing risks; 'dead' is absorbing.\n    r_sh = rates[\"stable_to_hosp\"]\n    r_sd = rates[\"stable_to_dead\"]\n    r_hd = rates[\"hosp_to_dead\"]\n    r_hs = rates[\"hosp_to_stable\"]\n    Q = np.array([\n        [-(r_sh + r_sd), r_sh,           r_sd],   # from stable\n        [r_hs,           -(r_hs + r_hd), r_hd],   # from hosp\n        [0.0,            0.0,            0.0 ],    # from dead (absorbing)\n    ])\n    P = _expm(Q * t)\n    return P\n\ndef apply_treatment(rates, hr):\n    # Apply hazard ratios to the comparator's event rates for the treated arm.\n    r = dict(rates)\n    r[\"stable_to_hosp\"] *= hr[\"hosp\"]\n    r[\"stable_to_dead\"] *= hr[\"dead\"]\n    r[\"hosp_to_dead\"]   *= hr[\"dead\"]\n    return r\n\ndef run_arm(rates, cost_pmpm, utility, extra_cost_pmpm=0.0):\n    P = transition_matrix(rates)\n    trace = np.zeros((HORIZON_M + 1, 3))\n    trace[0] = [1.0, 0.0, 0.0]                       # everyone starts Stable\n    for t in range(HORIZON_M):\n        trace[t + 1] = trace[t] @ P\n    # Half-cycle correction: average start/end occupancy within each cycle.\n    occ = (trace[:-1] + trace[1:]) / 2.0\n    disc_m = (1 + DISCOUNT_ANNUAL) ** (-(np.arange(HORIZON_M) * CYCLE_LEN))\n    cost_vec = np.array([cost_pmpm[\"stable\"] + extra_cost_pmpm,\n                         cost_pmpm[\"hosp\"]   + extra_cost_pmpm, 0.0])\n    qaly_vec = np.array([utility[\"stable\"], utility[\"hosp\"], 0.0]) * CYCLE_LEN\n    total_cost = float((occ @ cost_vec * disc_m).sum())\n    total_qaly = float((occ @ qaly_vec * disc_m).sum())\n    return total_cost, total_qaly\n\ndef evaluate(rates, hr, cost_pmpm, utility, tx_cost_pmpm):\n    c0, q0 = run_arm(rates, cost_pmpm, utility)                          # standard care\n    c1, q1 = run_arm(apply_treatment(rates, hr), cost_pmpm, utility,     # treatment\n                     extra_cost_pmpm=tx_cost_pmpm)\n    d_cost, d_qaly = c1 - c0, q1 - q0\n    icer = d_cost / d_qaly if d_qaly != 0 else np.inf\n    nmb = WTP * d_qaly - d_cost\n    return {\"d_cost\": d_cost, \"d_qaly\": d_qaly, \"icer\": icer, \"nmb\": nmb}\n\ndef psa(rates, hr, cost_pmpm, utility, tx_cost_pmpm,\n        n_iter=5000, wtp_grid=np.arange(0, 200_001, 10_000), seed=42):\n    # Draw correlated transitions (Dirichlet), costs (Gamma), utilities (Beta),\n    # and hazard ratios (log-normal) from RWE-implied distributions.\n    rng = np.random.default_rng(seed)\n    prob_ce = {w: 0 for w in wtp_grid}\n    for _ in range(n_iter):\n        s = transition_matrix(rates)[0]                                 # Dirichlet on Stable-row probs\n        s_draw = rng.dirichlet(s * 200 + 1e-6)\n        r = dict(rates)\n        r[\"stable_to_hosp\"] = -np.log(1 - s_draw[1])\n        r[\"stable_to_dead\"] = -np.log(1 - s_draw[2])\n        cd = {k: rng.gamma(shape=(v / (0.2 * v)) ** 2,                  # Gamma, CV ~ 0.2\n                           scale=v / ((v / (0.2 * v)) ** 2))\n              for k, v in cost_pmpm.items()}\n        ud = {k: rng.beta(v * 50, (1 - v) * 50) for k, v in utility.items()}\n        hd = {k: float(np.exp(rng.normal(np.log(v), 0.10))) for k, v in hr.items()}\n        res = evaluate(r, hd, cd, ud, tx_cost_pmpm)\n        for w in wtp_grid:\n            if (w * res[\"d_qaly\"] - res[\"d_cost\"]) > 0:\n                prob_ce[w] += 1\n    return {w: prob_ce[w] / n_iter for w in wtp_grid}                   # CEAC",
        "description": "Three-state cohort Markov cost-utility model (Stable -> Hospitalized-HF -> Dead) parameterized from RWE, with\nrate-to-probability conversion, half-cycle correction, discounting, ICER/NMB, and a PSA loop producing a CEAC.\nRequired input parameter tables (already estimated from a continuously enrolled, FFS-costed cohort):\n  rates_per_month : dict of monthly RWE incidence-density rates among the at-risk state, e.g.\n                    {'stable_to_hosp': 0.012, 'stable_to_dead': 0.004, 'hosp_to_dead': 0.06, 'hosp_to_stable': 0.55}\n  hr              : treatment hazard ratios from an ACTIVE-COMPARATOR NEW-USER, PS-weighted analysis (not a naive contrast)\n                    applied to the standard-care transition rates, e.g. {'hosp': 0.78, 'dead': 0.85}\n  cost_pmpm       : paid-amount PMPM by state, e.g. {'stable': 950.0, 'hosp': 18500.0}  # hosp = event-month cost\n  utility         : per-cycle QALY weight by state, e.g. {'stable': 0.80, 'hosp': 0.60}\n  tx_cost_pmpm    : incremental monthly drug cost in the treated arm\nOutputs incremental cost, incremental QALYs, ICER, NMB at the WTP threshold, and a PSA-derived CEAC.",
        "dependencies": [
          "numpy"
        ],
        "source_citations": [
          "caro-2012",
          "latimer-2013"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "CYCLE_LEN <- 1 / 12          # 1-month cycle in years\nHORIZON_M <- 240L            # 20-year horizon in monthly cycles\nDISCOUNT_ANNUAL <- 0.03\nWTP <- 100000\n\n# Single-transition conversion: EXACT ONLY for a state with one exit. With\n# multiple competing exits this misallocates probability; use the matrix\n# exponential of the generator (transition_matrix) instead.\nrate_to_prob <- function(rate, t = 1) 1 - exp(-rate * t)\n\n# Dependency-free matrix exponential (scaling-and-squaring + Taylor series),\n# so P = exp(Q*t) stays runnable on base R alone.\nexpm_base <- function(M, terms = 40L) {\n  norm <- max(rowSums(abs(M)))\n  s <- max(0L, as.integer(ceiling(log2(norm + 1e-12))))\n  A <- M / (2 ^ s)\n  n <- nrow(M); E <- diag(n); term <- diag(n)\n  for (k in seq_len(terms - 1L)) { term <- (term %*% A) / k; E <- E + term }\n  for (i in seq_len(s)) E <- E %*% E\n  E\n}\n\ntransition_matrix <- function(rates, t = 1) {\n  # Correct multi-exit conversion: build the cause-specific generator (rate)\n  # matrix Q (off-diagonal rates, row sums 0), then P = exp(Q*t)\n  # (Welton & Ades 2005). Row-stochastic by construction; respects competing\n  # risks; 'dead' is absorbing. Rows: from stable / hosp / dead.\n  r_sh <- rates[[\"stable_to_hosp\"]]; r_sd <- rates[[\"stable_to_dead\"]]\n  r_hd <- rates[[\"hosp_to_dead\"]];   r_hs <- rates[[\"hosp_to_stable\"]]\n  Q <- matrix(c(-(r_sh + r_sd), r_sh,           r_sd,\n                r_hs,           -(r_hs + r_hd), r_hd,\n                0,              0,              0),\n              nrow = 3, byrow = TRUE)\n  expm_base(Q * t)\n}\n\napply_treatment <- function(rates, hr) {\n  rates[[\"stable_to_hosp\"]] <- rates[[\"stable_to_hosp\"]] * hr[[\"hosp\"]]\n  rates[[\"stable_to_dead\"]] <- rates[[\"stable_to_dead\"]] * hr[[\"dead\"]]\n  rates[[\"hosp_to_dead\"]]   <- rates[[\"hosp_to_dead\"]]   * hr[[\"dead\"]]\n  rates\n}\n\nrun_arm <- function(rates, cost_pmpm, utility, extra_cost_pmpm = 0) {\n  P <- transition_matrix(rates)\n  trace <- matrix(0, nrow = HORIZON_M + 1, ncol = 3)\n  trace[1, ] <- c(1, 0, 0)                     # all start Stable\n  for (t in seq_len(HORIZON_M)) trace[t + 1, ] <- trace[t, ] %*% P\n  occ <- (trace[-(HORIZON_M + 1), ] + trace[-1, ]) / 2          # half-cycle correction\n  disc <- (1 + DISCOUNT_ANNUAL) ^ (-(seq_len(HORIZON_M) - 1) * CYCLE_LEN)\n  cost_vec <- c(cost_pmpm[[\"stable\"]] + extra_cost_pmpm,\n                cost_pmpm[[\"hosp\"]]   + extra_cost_pmpm, 0)\n  qaly_vec <- c(utility[[\"stable\"]], utility[[\"hosp\"]], 0) * CYCLE_LEN\n  list(cost = sum((occ %*% cost_vec) * disc),\n       qaly = sum((occ %*% qaly_vec) * disc))\n}\n\nevaluate <- function(rates, hr, cost_pmpm, utility, tx_cost_pmpm) {\n  sc <- run_arm(rates, cost_pmpm, utility)\n  tx <- run_arm(apply_treatment(rates, hr), cost_pmpm, utility, tx_cost_pmpm)\n  d_cost <- tx$cost - sc$cost; d_qaly <- tx$qaly - sc$qaly\n  list(d_cost = d_cost, d_qaly = d_qaly,\n       icer = d_cost / d_qaly, nmb = WTP * d_qaly - d_cost)\n}\n\npsa_ceac <- function(rates, hr, cost_pmpm, utility, tx_cost_pmpm,\n                     n_iter = 5000, wtp_grid = seq(0, 2e5, 1e4)) {\n  ce <- numeric(length(wtp_grid))\n  for (i in seq_len(n_iter)) {\n    rd <- rates\n    rd[[\"stable_to_hosp\"]] <- -log(1 - rbeta(1, 12, 988))     # Beta from event/person-time counts\n    rd[[\"stable_to_dead\"]] <- -log(1 - rbeta(1, 4, 996))\n    cd <- list(stable = rgamma(1, 25, scale = cost_pmpm[[\"stable\"]] / 25),\n               hosp   = rgamma(1, 25, scale = cost_pmpm[[\"hosp\"]]   / 25))\n    ud <- list(stable = rbeta(1, utility[[\"stable\"]] * 50, (1 - utility[[\"stable\"]]) * 50),\n               hosp   = rbeta(1, utility[[\"hosp\"]]   * 50, (1 - utility[[\"hosp\"]])   * 50))\n    hd <- list(hosp = exp(rnorm(1, log(hr[[\"hosp\"]]), 0.10)),\n               dead = exp(rnorm(1, log(hr[[\"dead\"]]), 0.10)))\n    res <- evaluate(rd, hd, cd, ud, tx_cost_pmpm)\n    ce <- ce + ((wtp_grid * res$d_qaly - res$d_cost) > 0)\n  }\n  data.frame(wtp = wtp_grid, prob_cost_effective = ce / n_iter)         # CEAC\n}",
        "description": "Vectorized three-state cohort Markov cost-utility model parameterized from RWE, with rate-to-probability conversion,\nhalf-cycle correction, discounting, ICER/NMB, and a PSA loop yielding a cost-effectiveness acceptability curve (CEAC).\nInputs mirror the Python version (named lists/vectors estimated from a continuously enrolled, FFS-costed cohort):\n  rates_per_month : monthly RWE rates c(stable_to_hosp, stable_to_dead, hosp_to_dead, hosp_to_stable)\n  hr              : treatment hazard ratios c(hosp, dead) from an active-comparator new-user, PS-weighted analysis\n  cost_pmpm       : paid-amount PMPM by state c(stable, hosp)\n  utility         : per-cycle QALY weights c(stable, hosp)\n  tx_cost_pmpm    : incremental monthly drug cost in the treated arm",
        "dependencies": [],
        "source_citations": [
          "caro-2012",
          "latimer-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  RWE[RWE inputs from claims / EHR / registry / linked] --> Q1{What evidence does the RWE deliver?}\n  Q1 -->|Coded transitions + person-time<br/>competing risks matter| MK[Cohort Markov<br/>rate -> per-cycle probability]\n  Q1 -->|Survival curves OS / PFS<br/>horizon near observed data| PS[Partitioned-survival<br/>extrapolate curves]\n  Q1 -->|Strong time- / history-dependence| DE[Discrete-event simulation]\n  MK --> EFF{Comparative effect source?}\n  PS --> EFF\n  DE --> EFF\n  EFF -->|Naive RWE contrast| WARN[STOP: confound first via<br/>active-comparator new-user + PS/IPTW]\n  EFF -->|De-confounded HR| BUILD[Apply half-cycle correction + discounting]\n  BUILD --> UNC[PSA: distributions -> ICER + NMB + CEAC]\n  UNC --> REP[Report per CHEERS 2022:<br/>perspective, horizon, discount, structure]",
        "caption": "Decision logic for choosing an HE model structure from the RWE actually available and for routing the comparative treatment effect through proper confounding control before it enters the model.",
        "alt_text": "Flowchart from RWE inputs to a model-structure choice (Markov, partitioned-survival, or DES), a checkpoint that blocks naive confounded treatment effects, then half-cycle correction, discounting, PSA, and CHEERS-compliant reporting.",
        "source_type": "illustrative",
        "source_citations": [
          "caro-2012"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "stateDiagram-v2\n  [*] --> Stable\n  Stable --> Hospitalized: claims event rate -> per-cycle prob via matrix exp P = exp(Q dt)\n  Hospitalized --> Stable: recovery rate from claims\n  Stable --> Dead: cause-specific mortality (competing risk)\n  Hospitalized --> Dead: in/post-hospital mortality\n  Dead --> [*]\n  note right of Stable\n    Costs = paid-amount PMPM (FFS person-time)\n    Utility = EQ-5D-mapped per-cycle QALY weight\n  end note",
        "caption": "Three-state cohort Markov model with each transition labeled by the RWE source that parameterizes it; Dead is absorbing and the Stable/Hospitalized transitions are estimated with a competing-risks formulation.",
        "alt_text": "State-transition diagram with Stable, Hospitalized, and absorbing Dead states, annotated with the claims-derived rates, PMPM costs, and EQ-5D utilities that populate the model.",
        "source_type": "illustrative",
        "source_citations": [
          "caro-2012"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "Survival extrapolation supplies the long-horizon OS/PFS curves that partitioned-survival and Markov-survival inputs depend on when RWE follow-up is shorter than the model horizon."
      },
      {
        "relation_type": "requires",
        "target_slug": "partitioned-survival-models-rwe",
        "notes": "A specific model structure within this family; occupancy is derived from independently extrapolated survival curves."
      },
      {
        "relation_type": "requires",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "Supplies the per-cycle transition probabilities (rate-to-probability conversion) that drive the cohort Markov structure described here."
      },
      {
        "relation_type": "requires",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "Provides the per-state utility weights (e.g., EQ-5D mapping) that turn state occupancy into QALYs."
      },
      {
        "relation_type": "requires",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Discounting of costs and effects to present value is a mandatory step in running any model in this family."
      },
      {
        "relation_type": "requires",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "PSA propagates RWE parameter uncertainty through the model to an ICER distribution and CEAC."
      },
      {
        "relation_type": "produces",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "The ICER and net monetary benefit are the headline outputs computed from the model trace."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Cost inputs to the model require outlier handling before aggregation to PMPM by state."
      },
      {
        "relation_type": "used_with",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "Whether state costs are all-cause or disease-attributable is a defining modeling choice for the cost parameters."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Per-member-per-month cost measurement and standardization feed the per-state cost parameters of the model."
      },
      {
        "relation_type": "requires",
        "target_slug": "active-comparator-new-user",
        "notes": "The comparative treatment effect entering the model should come from a de-confounded design; a naive RWE contrast yields a precise but biased ICER that PSA cannot reveal."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "HCRU measurement is the primary input to the cost parameters of the model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "Burden/cost-of-illness estimates provide context and starting cost/utilization inputs for modeling."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-utility",
        "notes": "Cost-utility analysis is the evaluation type these QALY-based models are most often built to deliver."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-effectiveness",
        "notes": "Cost-effectiveness analysis is the broader evaluation framework within which these models operate."
      }
    ],
    "aliases": [
      "HE modeling",
      "health economic modeling",
      "decision-analytic modeling with RWE",
      "economic model inputs from RWE",
      "HTA modeling methods",
      "cost-effectiveness modeling with real-world evidence"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "healthcare-costs-pppm-pppy-pmpm",
    "name": "Healthcare Costs (PPPM, PPPY, PMPM)",
    "short_definition": "The measurement, attribution, and time-standardization of healthcare expenditures from claims or linked data into per-patient-per-month/year (PPPM/PPPY) or per-member-per-month (PMPM) rates, with explicit cost basis, perspective, place-of-service decomposition, and modeling of the zero-inflated, right-skewed cost distribution.",
    "long_description": "Healthcare cost measurement converts the resource use captured in administrative data into a monetary outcome that\npayers, HTA bodies, and budget-impact models can act on. The recurring deliverable is a **standardized cost rate** —\ntotal (or attributable) dollars divided by accrued patient-time — expressed as **PPPM** (per-patient-per-month),\n**PPPY** (per-patient-per-year), or **PMPM** (per-member-per-month). The metric looks arithmetically trivial; the\ndefensibility lives entirely in the numerator definition (which dollars, which claims, which perspective) and the\ndenominator definition (whose time, observed how).\n\n**Core conceptual distinction**. Three choices are doing the work and they are separable. (1) *Cost basis*: the\n**allowed amount** (the negotiated price the plan and patient together owe) is the standard valuation of \"cost to the\nsystem\"; the **paid amount** (plan liability only) gives the payer perspective; **charges** (billed amounts) are\ninflated and payer-specific and should not be used as cost without a cost-to-charge ratio. (2) *Attribution*:\n**all-cause** sums every claim in the window, while **disease-attributable** restricts to claims carrying the index\ndiagnosis (and **incremental** subtracts a matched-control cost to recover the causal increment) — see\nall-cause-vs-attributable-costs-rwe. (3) *Denominator population and clock*: **PPPM/PPPY** divide by the\n*observed patient-time of the analytic cohort* (patients who entered the study), whereas **PMPM** divides plan\nspending by *all enrolled member-months*, including the many members with zero cost. PMPM is a plan-budget statistic;\nPPPM is a patient-burden statistic. They are not interchangeable, and conflating them is a common error in dossiers.\nThe estimand must name all three choices before any code runs.\n\n**Pros, cons, and trade-offs**.\n- **vs HCRU counts (hcru-healthcare-resource-utilization):** Costs collapse intensity, setting, and price into one\n  comparable dollar metric — one prolonged ICU stay outweighs many cheap office visits in a way an event count never\n  shows, and dollars map directly to payer and HTA decisions. Cost: dollars are hostage to negotiated rates,\n  geography, and calendar-year price inflation, so they transport poorly across payers and countries where raw counts\n  travel fine. **Prefer costs** for value/budget arguments; always report the paired HCRU table to explain *why* costs\n  moved.\n- **vs raw totals (no standardization):** PPPM/PPPY correctly handle the variable follow-up that censoring, death, and\n  disenrollment force on claims cohorts; a fixed-denominator mean silently penalizes short-followed patients. Cost:\n  a rate divided by tiny person-time explodes for a patient with one expensive event in one observed month, so person-\n  time must be derived carefully and decedents handled explicitly.\n- **vs count models (poisson-negative-binomial-count-models):** Cost models (two-part logit+GLM, gamma log-link) carry\n  price and service-mix that pure counts miss. Cost: cost distributions are zero-inflated and heavily right-skewed, so\n  naive OLS or log-OLS-with-retransformation is biased (Manning & Mullahy; Buntin & Zaslavsky), and the two-part/GLM\n  machinery is heavier than a Poisson/NB rate model. Report both families for a complete picture.\n\n**When to use**. Any analysis whose decision currency is money: budget-impact models, cost-of-illness and burden\nstudies, the cost arm of a cost-effectiveness or cost-utility analysis, formulary and value dossiers, and comparative\ncost contrasts between treatment arms. Use PPPM/PPPY when the question is per-patient economic burden over variable\nfollow-up; use PMPM when the question is plan-level spend across an enrolled population (including non-users).\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **A fixed denominator on censored data.** Dividing by a constant 12 months when patients are followed 1–12 months\n  biases the rate and breaks between-arm comparability if follow-up differs by arm. Always use observed person-time.\n- **PMPM reported as if it were patient burden.** Folding thousands of zero-cost members into the denominator makes a\n  high-cost disease look cheap; quoting PMPM for an affected-patient narrative understates burden by orders of\n  magnitude. Match the denominator to the claim being made.\n- **Charges used as cost.** Chargemaster amounts overstate true cost severalfold and vary by facility; reporting them\n  as cost is indefensible to a payer reviewer.\n- **Immortal time in the numerator/denominator of procedure studies.** If patients must survive to a procedure to\n  enter, counting pre-procedure cost-free time as observed person-time inflates the denominator and deflates the rate\n  (see time-zero-index-date-alignment-rwe).\n- **Trimming the outcome before the primary model without a sensitivity plan.** A handful of catastrophic cases\n  (CAR-T, transplant, prolonged ICU) dominate the mean; deleting them silently changes the estimand and the population\n  (see cost-outlier-handling-rwe). Pre-specify winsorization/capping/robust models instead.\n\n**Data-source operational depth**.\n- **US claims (FFS or commercial):** The native substrate. Use `allowed_amount` for system cost and `paid_plan` for\n  payer perspective; decompose patient liability into `copay + coinsurance + deductible`. Real failure modes:\n  *adjudication lag* — recent months are under-reported until claims mature, so the final months of any extract must be\n  censored or the rate trends spuriously downward; *reversals/voids* — un-netted reversal lines double-count spend, so\n  net by `claim_id` or filter on final claim status; *coordination of benefits (COB)* — when a secondary payer exists,\n  `paid_plan` is partial and `allowed_amount` is the safer resource valuation; *MA-only person-time* — Medicare\n  Advantage encounter data lack adjudicated FFS dollar amounts, so MA-only spans contribute observable time but no\n  valid cost numerator and must be excluded (or shadow-priced) from cost denominators to avoid a differentially\n  deflated rate (see medicare-ffs-ma-commercial-claims-differences-rwe); *bundled/episode payments* — under DRG, BPCI-A,\n  or CJR-style episodes a single payment covers many services and individual line dollars may be zeroed, so HCRU counts\n  inside the episode remain observable but dollar attribution requires external unit costs or shadow pricing.\n- **EHR:** Carries charges or chargemaster estimates, not adjudicated cost; usable for resource counts but requires a\n  cost-to-charge ratio or linkage to claims for credible costing. Visit-driven capture also means a patient who leaves\n  the system is differentially lost, distorting both numerator and person-time.\n- **Registry:** Strong for clinical outcomes and severity but rarely captures complete cost; link to claims for the\n  fill/medical dollar history and to a death index to firm up the person-time clock.\n- **Linked claims–EHR–vital records:** The ideal substrate (severity + dollar completeness + reliable mortality), but\n  linkage introduces selection (only the linkable subset) and date discrepancies between service, fill, and adjudication\n  that must be reconciled before assigning the cost window.\n\n**Differential competing risks by exposure.** In elderly or comparative cohorts, the arm with higher mortality accrues\n*less* downstream person-time and fewer post-event claims; a naive PPPM can therefore look *lower* in the sicker arm\npurely because death truncated their cost accrual. Decedent terminal-care spikes pull the other way. Pre-specify how\ndeath is handled (censor vs. count terminal costs) and report cost trajectories, not just a pooled rate.\n\n**Worked claims example.** A commercially insured adult (`person_id` 1001) initiates total knee arthroplasty (TKA) at\n`index_date` 2025-01-15 and is observed through disenrollment on 2025-05-30 — **4.5 observed months**. The claims\nextract carries the inpatient TKA facility line (POS 21, CPT 27447, ICD M17.11, `allowed` $28,500), inpatient\nprofessional lines, outpatient orthopedic follow-ups (POS 11, CPT 99213, ICD Z47.1), a lab panel, an infused biologic\nbilled on the medical benefit (J-code + administration, `allowed` $1,250), and two oral analgesic pharmacy fills\n(NDC, `clm_type` = pharmacy). Summing all lines gives an **all-cause allowed** of $30,847.50, of which $30,672.50 is\nmedical (99.4%) and $175.00 is pharmacy; the inpatient stay alone is ~$28,780. The *disease-attributable* total applies a\ndiagnosis-position filter on M17/Z47, which acts only on the diagnosis-bearing medical claims. It therefore drops both the\nunrelated hypertension PCP visit (ICD I10, allowed $168.00) and — because pharmacy claims carry no diagnosis code and so can\nnever match an ICD filter — the two analgesic pharmacy fills (allowed $175.00), yielding **disease-attributable allowed** of\n$30,504.50 ($30,847.50 − $168.00 − $175.00). Capturing disease-related pharmacy would require a separate NDC-list step;\na pure ICD filter cannot. The standardized **all-cause PPPM** is $30,847.50 / 4.5 = **$6,855**; the **PPPY** (annualizing the same\nobserved rate) is $6,855 × 12 = **$82,260**, a figure only meaningful as a rate because the patient was not observed a\nfull year. The same patient contributes 4.5 *member-months* to a **PMPM** denominator that also sums the full-year,\nnear-zero-cost months of every other enrollee, so the plan PMPM is far smaller — illustrating why the denominator must\nmatch the question. In production, person-time comes from enrollment spans (sum of eligible days / 30.437), reversals\nare netted by `claim_id`, the unmatured final month is dropped, and a pre-specified outlier rule (e.g., 99th-percentile\nwinsorization) is applied before the two-part/gamma model.\n\n**Interpreting the output**\n\nA study reports that 3 commercially insured patients with total knee arthroplasty had a mean all-cause\nallowed PPPM of $200.00 ($2,400.00 PPPY) over 20 observed member-months and $4,000 in total allowed\nspend. A 90th-percentile winsorization sensitivity analysis caps one high-cost month at the 90th-percentile\nvalue; the winsorized mean PPPM drops to $150.00 (−25%), illustrating how a single high-cost month\nreshapes the estimate.\n\n*(1) Formal interpretation.* The all-cause allowed PPPM of $200.00 is the arithmetic mean monthly cost\nper patient across 20 observed member-months; the PPPY of $2,400.00 is the same rate annualized (×12)\nand is interpretable only as a rate, because no patient was followed a full year. This is the allowed\namount from the plan's perspective — not charges, and not plan-paid net of patient liability. Because the\ncost distribution is right-skewed and the mean is substantially above the median, it is driven by\nhigh-cost months; the mean governs budget projections while the median describes the typical patient.\nNo causal comparison is made; this is a descriptive within-cohort mean.\n\n*(2) Practical interpretation.* For a payer or formulary team, the $200.00 PPPM signals what a TKA\npatient costs on average each month of follow-up at allowed rates. If your plan's negotiated rates\ndiffer, or if you are comparing across payers using charges, the number is not portable. The winsorized\nsensitivity shows that one catastrophic month can inflate the headline by 25% — a reason to pre-specify\nthe outlier rule before submitting to an HTA body.",
    "primary_category": "Health_Economic",
    "tags": [
      "costs",
      "pppm",
      "pppy",
      "pmpm",
      "claims",
      "heor",
      "place-of-service",
      "medical-pharmacy-split",
      "two-part-model",
      "gamma-glm",
      "cost-outliers",
      "person-time"
    ],
    "applies_to_study_types": [
      "cost_of_illness",
      "budget_impact",
      "cost_effectiveness",
      "cost_utility",
      "drug_utilization",
      "cohort_retrospective",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1146/annurev.publhealth.20.1.125",
        "url": "https://doi.org/10.1146/annurev.publhealth.20.1.125",
        "citation_text": "Diehr P, Yanez D, Ash A, Hornbrook M, Lin DY. Methods for analyzing health care utilization and costs. Annual Review of Public Health. 1999;20:125-144.",
        "year": 1999,
        "authors_short": "Diehr et al.",
        "notes": "Foundational review of statistical methods for administrative utilization and cost data; establishes why cost distributions (zero inflation, right skew, high-cost tail) defeat ordinary regression and motivates two-part and transformation approaches."
      },
      {
        "role": "explain",
        "doi": "10.1016/S0167-6296(01)00086-8",
        "url": "https://doi.org/10.1016/S0167-6296(01)00086-8",
        "citation_text": "Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461-494.",
        "year": 2001,
        "authors_short": "Manning & Mullahy",
        "notes": "Central reference for choosing between log-OLS (with retransformation) and GLM (gamma, log link) for skewed cost outcomes; the basis for preferring GLM when heteroscedasticity and heavy tails are present."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jhealeco.2003.10.005",
        "url": "https://doi.org/10.1016/j.jhealeco.2003.10.005",
        "citation_text": "Buntin MB, Zaslavsky AM. Too much ado about two-part models and transformation? Comparing methods of modeling Medicare expenditures. Journal of Health Economics. 2004;23(3):525-542.",
        "year": 2004,
        "authors_short": "Buntin & Zaslavsky",
        "notes": "Simulation comparison of one-part GLM, two-part, and transformation models on realistic Medicare cost distributions; practical guidance on when the extra structure of two-part models earns its keep."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/s10194-022-01448-2",
        "url": "https://doi.org/10.1186/s10194-022-01448-2",
        "citation_text": "Irimia P, Garcia-Azorin D, Nunez M, et al. Persistence, use of resources and costs in patients under migraine preventive treatment: the PERSEC study. Journal of Headache and Pain. 2022;23:78.",
        "year": 2022,
        "authors_short": "Irimia et al. (PERSEC)",
        "notes": "Real-world claims-style application reporting PPPY costs stratified by treatment persistence with medical vs. pharmacy and resource-use breakdowns — a worked template for per-patient-per-year cost reporting."
      },
      {
        "role": "use",
        "doi": "10.18553/jmcp.2024.30.11.1276",
        "url": "https://doi.org/10.18553/jmcp.2024.30.11.1276",
        "citation_text": "Cleveland NK, et al. Dose escalation of biologics in biologic-naive patients with Crohn disease: outcomes from the ODESSA-CD study. Journal of Managed Care & Specialty Pharmacy. 2024;30(11):1276.",
        "year": 2024,
        "authors_short": "Cleveland et al.",
        "notes": "Representative US managed-care claims study reporting index-drug and all-cause expenditures as PPPM; typical of how standardized per-month cost metrics appear in HEOR dossiers."
      }
    ],
    "plain_language_summary": "PPPM (per-patient-per-month) and PPPY (per-patient-per-year) are the standard ways to report how much healthcare costs per patient when patients are not all followed for the same length of time — you divide the total dollars a patient (or group) ran up by the number of months they were actually observed. This lets you fairly compare costs across patients who enrolled at different times or dropped out early. PMPM (per-member-per-month) is a related number used by health plans: instead of dividing by the time of sick patients alone, it divides total plan spending by the enrolled months of every member — including the many who had no claims at all — so PMPM is a plan-budget measure, not a patient-burden measure.",
    "key_terms": [
      {
        "term": "member-month",
        "definition": "One patient enrolled and observable for one calendar month; a patient followed for 5 months contributes 5 member-months."
      },
      {
        "term": "allowed amount",
        "definition": "The negotiated price that the health plan and patient together owe a provider — the standard dollar amount used to represent 'cost to the system' in claims data."
      },
      {
        "term": "claims data",
        "definition": "Administrative records that a health plan creates every time a member gets a service billed to insurance, containing the date, type of service, diagnosis codes, and dollar amounts."
      },
      {
        "term": "observed follow-up",
        "definition": "The actual span of time a patient was enrolled and generating claims in the dataset — it varies across patients because people join or leave a health plan at different times."
      },
      {
        "term": "annualize",
        "definition": "Convert a monthly rate to an annual one by multiplying by 12; PPPY = PPPM × 12, expressing the same rate as a per-year figure."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know the average monthly healthcare cost for patients newly diagnosed with migraine headache. Three patients are enrolled in the study but each is observed for a different number of months — one enrolled late, one disenrolled early, and one was present all year. Simply averaging their total costs would unfairly penalize the short-followed patients. Instead, the researcher divides total costs by total observed member-months to compute PPPM, then multiplies by 12 to get PPPY.",
      "dataset": {
        "caption": "One row per patient showing their total allowed costs and the number of months they were actually enrolled and observable.",
        "columns": [
          "person_id",
          "total_allowed_cost",
          "observed_member_months"
        ],
        "rows": [
          [
            1001,
            600.0,
            3
          ],
          [
            1002,
            2400.0,
            12
          ],
          [
            1003,
            1000.0,
            5
          ]
        ]
      },
      "steps": [
        "Add up the total allowed costs across all three patients: $600 + $2,400 + $1,000 = $4,000.",
        "Add up the total observed member-months across all three patients: 3 + 12 + 5 = 20 member-months.",
        "Divide total costs by total member-months to get PPPM: $4,000 ÷ 20 = $200.00 per patient per month.",
        "Multiply PPPM by 12 to annualize it: $200.00 × 12 = $2,400.00 per patient per year (PPPY).",
        "Notice why this works: patient 1001 spent $600 over 3 months ($200/month), patient 1002 spent $2,400 over 12 months ($200/month), and patient 1003 spent $1,000 over 5 months ($200/month) — each has the same underlying rate, and the PPPM correctly recovers that instead of being dragged up by the patient with the most total dollars."
      ],
      "result": "PPPM = $4,000 / 20 member-months = $200.00 per patient per month. PPPY = $200.00 × 12 = $2,400.00 per patient per year. Because we divided by each patient's actual observed time rather than a fixed 12 months, short-followed patients do not distort the average."
    },
    "prerequisites": [
      "claims-analysis",
      "hcru-healthcare-resource-utilization",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "All-cause PPPM / PPPY",
        "description": "Sum of all medical and pharmacy allowed (or paid) amounts in the observation window, divided by observed patient-months or patient-years.",
        "edge_cases": [
          "Captures unrelated comorbidity care; appropriate for total budget impact but overstates disease-specific burden.",
          "Sensitive to follow-up length and to the unmatured final months of the extract."
        ],
        "data_source_notes": "claims: sum allowed across inpatient/outpatient/professional/pharmacy; standardize by exact observable person-time from enrollment spans, not a fixed denominator.",
        "citations": [
          "diehr-1999"
        ]
      },
      {
        "name": "Disease-attributable / incremental cost",
        "description": "Restrict the numerator to claims carrying the index diagnosis (any vs. primary position), or subtract a matched-control cost to recover the incremental (causal) increment.",
        "edge_cases": [
          "Attribution is sensitive to diagnosis-code position; incremental requires good matching to avoid residual confounding."
        ],
        "data_source_notes": "claims: filter by ICD code set or use matched-cohort subtraction; see all-cause-vs-attributable-costs-rwe for code-position sensitivity and diagnostics.",
        "citations": [
          "irimia-2022"
        ]
      },
      {
        "name": "PMPM (plan/payer denominator)",
        "description": "Sum of plan-paid costs for the population divided by total enrolled member-months over the calendar window, including members with zero cost.",
        "edge_cases": [
          "A high-cost patient enrolled only one month contributes one member-month of weight; mid-month starts/ends and dual coverage must follow the eligibility data dictionary."
        ],
        "data_source_notes": "claims: build a member-month table from eligibility files (one row per member per eligible month) or sum eligible span-days / 30.437; do not reuse the analytic-cohort person-time denominator.",
        "citations": [
          "cleveland-2024"
        ]
      },
      {
        "name": "Place-of-service and medical/pharmacy split",
        "description": "Decompose costs by inpatient/ED/outpatient/SNF/pharmacy using POS, revenue codes, and bill type, and partition medical benefit vs. pharmacy benefit (provider-administered J-code drugs vs. NDC fills).",
        "edge_cases": [
          "Site-of-care migration (e.g., infusion moving from inpatient to outpatient) shifts dollars between buckets with no net change; provider-administered drugs straddle the medical/pharmacy boundary."
        ],
        "data_source_notes": "claims: join medical and pharmacy files; map revenue/POS/CPT/HCPCS; essential for explaining cost drivers and for specialty-drug budget models. See procedure-identification-and-measurement-in-claims-ehr."
      },
      {
        "name": "Two-part / gamma-GLM cost model",
        "description": "Logistic model for any-cost (the probability mass at zero) plus a gamma GLM with log link on positive cost, optionally with an offset of log(person-time) when modeling cumulative totals as a rate.",
        "edge_cases": [
          "Many structural zeros and extreme skew; clustered or GEE standard errors needed for repeated measures; retransformation bias if a log-OLS shortcut is used instead of a GLM."
        ],
        "data_source_notes": "claims: aggregate lines to one row per patient, attach baseline covariates, model both parts; report predicted incremental cost via recycled-prediction/margins.",
        "citations": [
          "manning-2001",
          "buntin-2004"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "hcru-healthcare-resource-utilization",
        "pros_of_this": "Integrates intensity, setting, and price into a single dollar metric directly usable for payer, HTA, and budget decisions.",
        "cons_of_this": "Hostage to negotiated rates, geography, and price inflation; transports poorly across payers and countries relative to raw event counts.",
        "when_to_prefer": "When the decision currency is money; always pair with the HCRU table to explain which resources drove the cost difference."
      },
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "Captures price and service-mix that event counts miss (one long ICU stay vs. many cheap visits).",
        "cons_of_this": "Zero-inflated, heavily skewed outcome needs two-part or gamma GLM; count models are simpler and more transportable for pure volume questions.",
        "when_to_prefer": "When economic impact is the primary interest; report count models alongside for utilization volume."
      },
      {
        "compared_to": "burden-of-disease-cost-of-illness",
        "pros_of_this": "Granular, standardized (all-cause vs. attributable, PPPM/PPPY) and supports comparative, intervention- specific decisions.",
        "cons_of_this": "Full societal burden (productivity, caregiver, indirect costs) needs data beyond claims.",
        "when_to_prefer": "Comparative or intervention-specific value demonstration; use COI for broad disease-level resource allocation and advocacy."
      },
      {
        "compared_to": "cost-outlier-handling-rwe",
        "pros_of_this": "Provides the full measurement, standardization, and modeling pipeline in which an outlier rule is chosen and interpreted.",
        "cons_of_this": "The dedicated outlier entry supplies the precise trimming/winsorization variants, diagnostics, and code; this entry is not a substitute.",
        "when_to_prefer": "Use the outlier entry for the trimming/winsorization mechanics and this entry for pipeline integration."
      },
      {
        "compared_to": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "pros_of_this": "Defines costing rules once the extract is in hand; clean allowed/paid in FFS, member liability components, and standardization.",
        "cons_of_this": "The payer-differences entry covers how MA capitation/encounter data, coding intensity, and commercial contracting distort apparent cost before any costing rule applies.",
        "when_to_prefer": "Multi-payer or MA-heavy studies should resolve payer heterogeneity first, then apply consistent costing with payer-specific caveats (exclude/shadow-price MA-only person-time)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Use allowed amounts for cost-to-the-system or paid for the payer perspective; decompose patient liability (copay + coinsurance + deductible). Keep medical and pharmacy files separate (J-code + administration vs. NDC fills) and standardize by exact observable person-time from enrollment spans. Net reversals by claim_id, drop unmatured final months, prefer allowed when COB makes plan-paid partial, and exclude MA-only person-time from cost denominators. Pre-specify attribution, cost basis, person-time derivation, and the outlier rule in the SAP.",
      "ehr": "Carries charges or chargemaster estimates rather than adjudicated cost; apply a cost-to-charge ratio or link to claims for credible costing, and treat system departure as informative loss of person-time.",
      "registry": "Strong for clinical outcomes and severity, weak for complete cost; link to claims for the dollar history and to a death index for the person-time clock.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (severity + dollar completeness + mortality) but introduces linkage selection and service/fill/adjudication date discrepancies to reconcile before assigning the cost window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nDAYS_PER_MONTH = 30.437          # mean calendar month\nATTRIB_ICD = (\"M17\", \"Z47\")      # disease-attributable code prefixes (study-specific)\n\ndef observed_months(enroll: pd.DataFrame) -> pd.Series:\n    # Sum eligible, FFS-observable days per person (MA-only spans excluded), convert to months.\n    ffs = enroll.loc[~enroll[\"ma_only\"]].copy()\n    ffs[\"days\"] = (ffs[\"enroll_end\"] - ffs[\"enroll_start\"]).dt.days + 1\n    return ffs.groupby(\"person_id\")[\"days\"].sum() / DAYS_PER_MONTH\n\ndef patient_cost_rates(claims: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    claims = claims.copy()\n    claims[\"oop\"] = claims[[\"copay\", \"coinsurance\", \"deductible\"]].sum(axis=1)\n    is_med = claims[\"clm_type\"].eq(\"medical\")\n    is_attrib = claims[\"icd\"].str.startswith(ATTRIB_ICD)\n\n    g = claims.groupby(\"person_id\")\n    pat = pd.DataFrame({\n        \"allowed_allcause\": g[\"allowed\"].sum(),\n        \"plan_paid\":        g[\"paid_plan\"].sum(),\n        \"patient_oop\":      g[\"oop\"].sum(),\n        \"medical_allowed\":  claims.loc[is_med].groupby(\"person_id\")[\"allowed\"].sum(),\n        \"pharm_allowed\":    claims.loc[~is_med].groupby(\"person_id\")[\"allowed\"].sum(),\n        \"allowed_attrib\":   claims.loc[is_attrib].groupby(\"person_id\")[\"allowed\"].sum(),\n    }).fillna(0.0)\n\n    pat[\"fup_months\"] = observed_months(enroll).reindex(pat.index)\n    # Guard against near-zero person-time producing exploding rates.\n    valid = pat[\"fup_months\"] > 0\n    pat.loc[valid, \"pppm_allcause\"] = pat[\"allowed_allcause\"] / pat[\"fup_months\"]\n    pat.loc[valid, \"pppy_allcause\"] = pat[\"pppm_allcause\"] * 12\n    pat.loc[valid, \"pppm_attrib\"]   = pat[\"allowed_attrib\"] / pat[\"fup_months\"]\n    return pat.reset_index()",
        "description": "Line-level claims -> patient-level standardized cost rates. Required inputs (cleaned, reversals already netted):\n  claims : person_id, claim_id, service_date, clm_type ('medical'/'pharmacy'), pos, icd,\n           allowed, paid_plan, copay, coinsurance, deductible\n  enroll : person_id, enroll_start, enroll_end, ma_only (bool)   # ma_only spans carry no valid cost numerator\nReturns one row per patient with all-cause / attributable totals, medical-pharmacy split, patient OOP, and PPPM/PPPY\ncomputed on observed (non-MA-only) person-time. Feed the result to the two-part / gamma model below.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "python",
        "code": "import numpy as np\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\ndef two_part_cost_model(pat):\n    pat = pat.copy()\n    pat[\"any_cost\"] = (pat[\"total_allowed\"] > 0).astype(int)\n    pat[\"log_fup\"] = np.log(pat[\"fup_months\"])\n\n    # Part 1: any cost vs. none (the structural zeros).\n    m1 = smf.logit(\"any_cost ~ age + female + charlson + treated\", data=pat).fit(disp=0)\n\n    # Part 2: positive cost as a rate via gamma GLM with log link + person-time offset.\n    pos = pat.loc[pat[\"total_allowed\"] > 0]\n    m2 = smf.glm(\"total_allowed ~ age + female + charlson + treated\",\n                 data=pos, offset=pos[\"log_fup\"],\n                 family=sm.families.Gamma(link=sm.families.links.Log())).fit()\n\n    # Recycled-prediction incremental PPPM: E[cost | treated=1] - E[cost | treated=0].\n    def expected_pppm(t):\n        d = pat.assign(treated=t, log_fup=0.0)               # offset 0 -> per-month rate\n        p_any = m1.predict(d)\n        mu = m2.predict(d, offset=np.zeros(len(d)))\n        return float((p_any * mu).mean())\n\n    incremental_pppm = expected_pppm(1) - expected_pppm(0)\n    return m1, m2, incremental_pppm",
        "description": "Two-part cost model on the patient-level table from the prior step. Input `pat` must contain:\n  total_allowed, fup_months, plus baseline covariates (age, female, charlson) and a binary `treated` exposure.\nPart 1 models the probability of any cost (the zero mass); Part 2 models positive cost with a gamma GLM and a\nlog(person-time) offset so the coefficient on `treated` is a cost-rate (PPPM) contrast. Incremental cost is the\ndifference of recycled predictions across the counterfactual `treated` settings.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nDAYS_PER_MONTH <- 30.437\nATTRIB <- \"^M17|^Z47\"\n\npatient_cost_rates <- function(claims, enroll, cov) {\n  setDT(claims); setDT(enroll); setDT(cov)\n  claims[, oop := copay + coinsurance + deductible]\n\n  pat <- claims[, .(\n    allowed_allcause = sum(allowed),\n    plan_paid        = sum(paid_plan),\n    patient_oop      = sum(oop),\n    medical_allowed  = sum(allowed[clm_type == \"medical\"]),\n    pharm_allowed    = sum(allowed[clm_type == \"pharmacy\"]),\n    allowed_attrib   = sum(allowed[grepl(ATTRIB, icd)])\n  ), by = person_id]\n\n  # Observed FFS person-time (MA-only spans excluded).\n  ft <- enroll[ma_only == FALSE]\n  ft[, days := as.numeric(enroll_end - enroll_start) + 1]\n  pt <- ft[, .(fup_months = sum(days) / DAYS_PER_MONTH), by = person_id]\n\n  pat <- merge(pat, pt, by = \"person_id\")\n  pat <- pat[fup_months > 0]\n  pat[, pppm_allcause := allowed_allcause / fup_months]\n  pat[, pppy_allcause := pppm_allcause * 12]\n  pat[, pppm_attrib   := allowed_attrib / fup_months]\n  merge(pat, cov, by = \"person_id\")\n}\n\ntwo_part_cost_model <- function(pat) {\n  pat$any_cost <- as.integer(pat$total_allowed > 0)\n  m1 <- glm(any_cost ~ age + female + charlson + treated,\n            family = binomial, data = pat)\n  pos <- pat[pat$total_allowed > 0, ]\n  # Gamma GLM with log link; log(person-time) offset makes the contrast a PPPM rate.\n  m2 <- glm(total_allowed ~ age + female + charlson + treated + offset(log(fup_months)),\n            family = Gamma(link = \"log\"), data = pos)\n  list(part1 = m1, part2 = m2)\n}",
        "description": "Line-level claims -> patient-level PPPM/PPPY with data.table, then a two-part (logit + gamma GLM) cost model.\nInputs mirror the Python version:\n  claims : person_id, claim_id, service_date (Date), clm_type ('medical'/'pharmacy'), pos, icd,\n           allowed, paid_plan, copay, coinsurance, deductible\n  enroll : person_id, enroll_start (Date), enroll_end (Date), ma_only (logical)\n  cov    : person_id + age, female, charlson, treated (0/1)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Observed FFS person-time per patient (MA-only spans carry no valid cost numerator). */\nproc sql;\n  create table ptime as\n  select person_id,\n         sum((enroll_end - enroll_start) + 1) / 30.437 as fup_months\n  from work.enroll\n  where ma_only = 0\n  group by person_id;\nquit;\n\n/* 2. Roll claim lines up to one row per patient: all-cause, attributable, med/pharm split, OOP. */\nproc sql;\n  create table pat as\n  select c.person_id,\n         sum(c.allowed)                                              as allowed_allcause,\n         sum(c.paid_plan)                                            as plan_paid,\n         sum(c.copay + c.coinsurance + c.deductible)                 as patient_oop,\n         sum(case when c.clm_type='medical'  then c.allowed else 0 end) as medical_allowed,\n         sum(case when c.clm_type='pharmacy' then c.allowed else 0 end) as pharm_allowed,\n         sum(case when c.icd like 'M17%' or c.icd like 'Z47%'\n                  then c.allowed else 0 end)                         as allowed_attrib,\n         t.fup_months\n  from work.claims c\n  inner join ptime t on c.person_id = t.person_id\n  where t.fup_months > 0\n  group by c.person_id, t.fup_months;\nquit;\n\n/* 3. Analytic file: attach covariates, derive PPPM, any-cost flag, and log person-time offset. */\ndata analytic;\n  merge pat(in=a) work.cov(in=b);\n  by person_id;\n  if a and b;\n  pppm_allcause = allowed_allcause / fup_months;\n  any_cost      = (allowed_allcause > 0);\n  log_fup       = log(fup_months);\nrun;\n\n/* 4. Descriptive standardized cost by arm. */\nproc means data=analytic mean median std maxdec=0;\n  class treated;\n  var pppm_allcause allowed_allcause fup_months;\nrun;\n\n/* 5. Part 1 of the two-part model: probability of any cost (the structural zeros). */\nproc logistic data=analytic;\n  class treated (ref='0') female (ref='0') / param=ref;\n  model any_cost(event='1') = treated age female charlson;\nrun;\n\n/* 6. Part 2: positive cost as a rate via gamma GLM, log link, log person-time offset. */\nproc genmod data=analytic;\n  where allowed_allcause > 0;\n  class treated (ref='0') female (ref='0') / param=ref;\n  model allowed_allcause = treated age female charlson\n        / dist=gamma link=log offset=log_fup;\n  estimate 'treated vs untreated (log cost-rate)' treated 1 -1;\nrun;",
        "description": "Line-level claims -> patient-level standardized costs (PROC SQL), then the canonical two-part cost model\n(PROC LOGISTIC for any-cost + PROC GENMOD gamma/log with a log person-time offset for positive cost) and\ndescriptive PPPM by arm (PROC MEANS). Required inputs (post data-management):\n  work.claims : person_id, claim_id, service_date, clm_type, pos, icd,\n                allowed, paid_plan, copay, coinsurance, deductible\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.cov    : person_id, age, female, charlson, treated (0/1)",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Analytic cohort + perspective + time horizon] --> Basis[Cost basis<br/>allowed = system / paid_plan = payer / charges x CCR only]\n  Basis --> Attr[Attribution<br/>all-cause vs disease-attributable vs incremental]\n  Attr --> Split[Place-of-service + medical/pharmacy split<br/>POS + revenue code + bill type; J-code vs NDC]\n  Split --> Num[Numerator: net reversals, drop unmatured months,<br/>prefer allowed under COB]\n  Num --> Den[Denominator: observed person-time<br/>exclude MA-only spans; handle death/disenrollment]\n  Den --> Rate{Which standardized rate?}\n  Rate -->|patient burden| PPPM[PPPM / PPPY<br/>cohort patient-time]\n  Rate -->|plan budget| PMPM[PMPM<br/>all enrolled member-months]\n  PPPM --> Out[Outlier rule<br/>winsorize / cap / robust + sensitivity]\n  Out --> Model[Descriptive + two-part logit/GLM + gamma log-link]\n  Model --> Report[Report by arm, POS, attribution<br/>paired HCRU table + currency/year]\n  PMPM --> Report",
        "caption": "End-to-end cost-measurement pipeline. The estimand fixes cost basis, attribution, and the denominator (PPPM/PPPY for patient burden vs. PMPM for plan budget) before numerator cleaning, person-time derivation, outlier handling, and two-part/gamma modeling.",
        "alt_text": "Flowchart from analytic cohort through cost basis, attribution, place-of-service and medical/pharmacy split, numerator cleaning, person-time denominator, choice of PPPM/PPPY vs PMPM, outlier handling, two-part modeling, and reporting.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Lines[Raw claim lines<br/>allowed, paid_plan, copay/coins/ded, pos, icd, clm_type] --> Net[Net reversals by claim_id<br/>drop unmatured final months]\n  Net --> Roll[Roll up to one row per patient<br/>+ medical/pharmacy + attributable]\n  Roll --> PT[Join enrollment<br/>observed FFS person-time]\n  PT --> Q{Person-time > 0 and not MA-only?}\n  Q -->|yes| Rate[Divide -> PPPM / PPPY]\n  Q -->|no| Drop[Exclude or shadow-price<br/>no valid cost numerator]\n  Rate --> Mod[Two-part / gamma GLM<br/>offset = log person-time]",
        "caption": "Low-level claims-to-rate data flow, emphasizing reversal netting, person-time gating (MA-only exclusion), and the log person-time offset that turns the gamma GLM contrast into a PPPM rate.",
        "alt_text": "Data-flow diagram from raw claim lines through reversal netting, patient roll-up, enrollment join, person-time gating, rate calculation, and two-part gamma modeling.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "HCRU events by setting are the resource building blocks and the explanatory drivers of cost; always report a paired HCRU table with the same attribution and windows to show which resources moved the dollars."
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "Supplies the operational attribution rules, diagnosis-code-position sensitivity, and incremental-matching methods; this entry applies those definitions inside the standardization and modeling pipeline."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "High-cost cases dominate cost means, PPPM rates, and GLM coefficients; the dedicated entry holds the winsorization/trimming/cap mechanics, diagnostics, and numeric examples that this pipeline consumes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "Provides the upstream plumbing (enrollment, phenotyping, reversals, adjudication lag, bundling, COB) that must be correct before any valid cost measurement; this entry is the cost-specific layer on top."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Payer type drives cost-basis validity (transparent allowed/paid in FFS; encounter/capitation in MA needs shadow pricing; contributor-specific commercial rates); MA-only person-time must be excluded or shadow-priced."
      },
      {
        "relation_type": "used_with",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Count models give adjusted utilization rates for the events that generate cost; cost analyses use two-part or gamma GLM for the continuous, zero-inflated, skewed dollar outcome. Report both families."
      },
      {
        "relation_type": "used_with",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "Standardized per-patient costs (PPPM/PPPY), all-cause vs. attributable, and POS/medical-pharmacy breakdowns are the core quantitative components of burden-of-disease and cost-of-illness studies."
      },
      {
        "relation_type": "part_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Cost measurement, standardization, and modeling are a component of the broader health-economic modeling family (alongside discounting, PSA, ICER/NMB, and partitioned survival)."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Valid PPPM/PPPY denominators require precise observable/enrolled time with gaps, death/disenrollment censoring, and left-truncation handled correctly."
      },
      {
        "relation_type": "requires",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Time zero fixes both the cost numerator window (post-index spend) and the person-time denominator; misalignment reintroduces immortal time into procedure-based cost studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "Accurate CPT/HCPCS (J-codes), revenue codes, bill types, and NDC identification is the technical foundation for place-of-service stratification and the medical vs. pharmacy cost split."
      },
      {
        "relation_type": "see_also",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Cost means, medians, IQR, percent high-cost, and concentration (Lorenz) curves are descriptive-epidemiology outputs that precede and inform the modeled incremental cost."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "Cost outliers co-occur with structural missingness (death, disenrollment); outlier and missing-data decisions must be coordinated and jointly stress-tested."
      }
    ],
    "aliases": [
      "PPPM costs",
      "PPPY costs",
      "PMPM costs",
      "per-patient-per-month costs",
      "per-member-per-month costs",
      "all-cause costs",
      "attributable costs",
      "COI costs"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "healthy-user-bias",
    "name": "Healthy User Bias",
    "short_definition": "A confounding bias in which initiators of (or adherers to) preventive or chronic therapy are systematically more health-seeking — better diet, exercise, screening, adherence, socioeconomic status, and access — than comparators, so their lower event rates are attributed to the drug rather than to the unmeasured behaviors that travel with treatment.",
    "long_description": "**Healthy user bias** (and its sibling **healthy adherer bias**) is a specific form of confounding\nby unmeasured behavior that dominates observational studies of preventive and chronic therapies —\nstatins, antihypertensives, vaccines, bisphosphonates, hormone therapy, antidiabetic drugs. People\nwho initiate and persist with these therapies are not a random sample of the indicated population:\nthey attend preventive visits, get screened, exercise, eat better, are wealthier and more educated,\nand adhere to *everything* (including placebo). Because almost none of those behaviors are recorded\nin claims, the treated arm carries a hidden health advantage that masquerades as drug efficacy. The\ncanonical falsification is Dormuth's: statin *adherers* had markedly fewer **motor-vehicle accidents**\nand fewer unrelated illnesses than non-adherers — an effect no pharmacology can explain, so it must\nbe the adherer, not the statin, who is healthy.\n\n**Core conceptual distinction.** Healthy user bias is not a single bias but a family separable along\ntwo axes. (1) *Healthy initiator vs healthy adherer*: the first is selection **into** treatment\n(who starts), the second is selection **among** the treated (who keeps taking it) — adherers are\neven healthier than initiators, which is why per-protocol/as-treated analyses are more biased than\nintention-to-treat ones for preventive drugs. (2) *Healthy user vs confounding by indication /\nchanneling*: confounding by indication makes the treated look **sicker** (they had the indication);\nhealthy user bias makes them look **healthier** (they are health-seeking). The two can coexist and\npartially cancel, which is dangerous because a \"balanced\" Table 1 can hide both. Healthy user bias\nis also entangled with **prevalent-user bias** (prevalent users are survivors who tolerated and kept\ntaking the drug — a depletion-of-susceptibles selection that is itself a healthy-survivor effect)\nand with **immortal time bias** (defining exposure by future adherence guarantees the exposed\nsurvive to accrue it). The estimand matters: the bias inflates the comparative hazard/odds/rate\nratio toward apparent benefit, and is worst for *absolute* and *cause-specific* effects on outcomes\nplausibly tied to lifestyle (cardiovascular, all-cause mortality) and milder for biologically\nspecific outcomes the behaviors cannot touch.\n\n**Pros, cons, and trade-offs** — Healthy user bias is a problem to *control*, so the trade-offs are\namong the mitigation strategies, each named against its alternatives:\n- **Active-comparator new-user design vs new-user-vs-nonuser:** the single most effective structural\n  fix. Comparing two drugs for the *same* indication means both arms cleared the same health-seeking\n  and access thresholds, so most healthy-user imbalance cancels. Cost: it answers a narrower\n  head-to-head question and needs a clinically interchangeable comparator. **Prefer it** whenever\n  such a comparator exists; a non-user comparator should be a last resort.\n- **High-dimensional propensity score (hdPS) vs hand-picked covariates:** proxy measurement is the\n  only lever in claims, and hdPS empirically recovers proxies for health-seeking (screening codes,\n  influenza vaccination, preventive visits) that analysts forget. Cost: it can select instruments or\n  colliders and demands diagnostics. **Prefer hdPS** over a short investigator list in claims.\n- **Negative-control outcomes / empirical calibration vs trusting a single point estimate:** a\n  falsification outcome the drug cannot affect (accidents, screening uptake) directly measures\n  residual healthy-user bias. Cost: a valid negative control sharing the confounding structure is\n  hard to find and the calibration shifts the CI. **Always** carry at least one.\n- **E-value / quantitative bias analysis vs ignoring the residual:** an E-value states how strong\n  the unmeasured health-seeking confounder would have to be to explain the result. Cost: it is a\n  what-if, not a correction. **Prefer it** as a mandatory reporting element, not a fix.\n\n**When to use** — i.e., when to actively design and analyze against healthy user bias: any study of\na *preventive* or long-term chronic therapy where the comparison is exposed vs unexposed or\nadherent vs non-adherent; any adherence/persistence effectiveness analysis (PDC/MPR as exposure);\nany vaccine effectiveness study in non-randomized data; any study whose outcome (mortality, CV\nevents, fractures, cancer) is plausibly affected by the same behaviors that drive treatment uptake.\nIn these settings, default to an active-comparator new-user frame, rich proxy adjustment, and at\nleast one negative-control outcome.\n\n**When NOT to use — and when it is actively misleading or dangerous** — \"Use\" here means worrying\nabout / adjusting for healthy user bias; the danger is over- or under-correcting.\n- **Do not condition on post-baseline \"healthy\" markers.** Adjusting for follow-up adherence,\n  on-treatment LDL, or post-index screening is a **collider/mediator** error that manufactures or\n  amplifies bias. Healthy-user proxies must be measured strictly in the pre-index window.\n- **Do not treat adherence as a clean exposure.** Modeling PDC→outcome and \"adjusting it away\"\n  cannot work: adherence is a marker of the unmeasured health-seeking, so the adherent–nonadherent\n  contrast is biased *by definition*. The Dormuth accident result is the proof; an as-treated\n  estimate here is actively misleading.\n- **Do not assume an active comparator removes all of it.** If one drug is the health-conscious\n  patient's preference (e.g., a newer, marketed, \"premium\" agent), residual healthy-user channeling\n  persists — diagnose with balance tables and a negative control before trusting the contrast.\n- **It is overkill** for a biologically specific acute outcome the behaviors cannot plausibly move\n  (e.g., a within-class active-comparator study of an idiosyncratic drug reaction); forcing\n  elaborate proxy adjustment there adds noise and collider risk without addressing a real bias.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA vs commercial):** healthy-user proxies are *utilization* surrogates — counts\n  of age/sex-appropriate screenings (mammography, colonoscopy/FOBT, PSA, bone-density DXA),\n  influenza/pneumococcal vaccination, well/preventive visit E&M codes, distinct preventive drug\n  classes — measured in a fixed pre-index lookback under continuous enrollment. Failure modes:\n  **Medicare Advantage person-time lacks fee-for-service claims**, so a beneficiary can appear to\n  have *zero* screenings simply because the encounter was capitated, not because they are not\n  health-seeking — restrict to FFS Parts A/B/D or commercial medical+pharmacy, or the proxy is\n  differential missingness. **Influenza vaccines and screenings often migrate to pharmacies/retail\n  clinics or are bundled**, so NDC/CPT capture is incomplete and uneven by plan. In the elderly,\n  the sickest patients drop out to hospice/nursing-facility benefits, inducing **differential\n  competing risks by exposure** that mimic a healthy-user advantage. **Immortal time** sneaks in\n  when adherence is the exposure or when index is set at diagnosis rather than first fill.\n- **EHR:** richer — structured vitals (BMI, blood-pressure control), labs (LDL, HbA1c, lipid\n  panels), smoking status, and notes (exercise, diet via NLP) give *objective* health-seeking\n  proxies claims cannot. But capture is **visit-driven**: a health-seeking patient generates more\n  encounters, so \"more data = healthier\" is itself the confounder, and patients who leave the\n  network are differentially and informatively lost. Residual confounding remains even with vitals.\n- **Registry:** strong on disease severity and adjudicated outcomes but typically blind to\n  preventive utilization and lifestyle; link to claims for screening/vaccination history and to a\n  death index so a healthy-survivor pattern is not an artifact of incomplete mortality capture.\n- **Linked claims–EHR–vital records:** the best substrate — EHR lifestyle proxies + claims\n  completeness + reliable death — but the linkable subset is itself selected (often more\n  health-engaged), and order/fill/service-date discrepancies must be reconciled before time zero.\n\n**Worked claims example (with falsification).** Question: does long-term bisphosphonate use reduce\nhip fracture in women ≥65 in a Medicare FFS + commercial database? A naive new-user-vs-nonuser\ncohort risks textbook healthy-user bias (bisphosphonate initiators get DXA scans, attend\ngynecology/primary care, and exercise). Build it defensibly: (1) require **365 days continuous FFS\nA/B/D (or commercial medical+pharmacy) enrollment** before time zero so absence of prior fills and\nof screenings is observed, not MA-masked. (2) Time zero = first bisphosphonate `fill_date`\n(`days_supply` defines exposure episodes); washout = no bisphosphonate fill in the prior 365 days.\n(3) In the **pre-index** window only, construct a *health-seeking index*: count distinct preventive\nservices — screening CPT/HCPCS (mammography 77067, FOBT/colonoscopy, bone-density DXA 77080),\ninfluenza vaccine (CPT 90686 / NDC), and well-visit/preventive E&M (99381–99397) — plus distinct\npreventive drug classes and number of outpatient visits. (4) Enter this index, comorbidity scores,\nand area-level deprivation (ZIP-linked ADI) into a high-dimensional propensity score; match 1:1 on\nthe logit-PS (caliper 0.2 SD) and confirm standardized differences <0.1. (5) **Falsify:** run the\nidentical pipeline on a **negative-control outcome the drug cannot prevent** — e.g., motor-vehicle\nor fall-unrelated *accident* hospitalizations, or screening uptake itself. If the \"protective\"\nhazard ratio persists for accidents, residual healthy-user bias remains and the fracture estimate\nis not trustworthy; calibrate the fracture estimate against the negative-control distribution and\nreport an E-value for how strong an unmeasured health-seeking confounder would have to be to\nnullify it. First-event coding: take the *first* qualifying fracture in follow-up, applying an\nacute-event deduplication window so a transfer or readmission is not double-counted.\n\n**Interpreting the output**\n\nIn the statin-adherence cohort (100 adherent, 100 non-adherent initiators), the study reports:\nhospitalization rate 12% (adherent) vs 22% (non-adherent), crude RD = −10 pp; accidental\ninjury rate 4% vs 11%, crude RD = −7 pp; preventive visits mean 3.2 vs 1.1.\n\n*(1) Formal interpretation.* The hospitalization RD of −10 pp conflates pharmacological benefit\nwith the healthy-user bias. Adherent patients differ from non-adherent patients not only in\nstatin exposure but in broader health behaviors — the 3.2 versus 1.1 preventive-visit contrast\nand the disparate accident rate provide the falsification signal. Accidental injury is a\nnegative-control outcome: statins have no plausible causal mechanism for preventing accidents.\nThe 7 pp accident-rate gap therefore represents the portion of the outcome differential\nattributable to confounding by health-seeking behavior, not to the drug. A confounder strong\nenough to produce a 7 pp gap in a negative-control outcome may explain a comparable portion of\nthe 10 pp hospitalization estimate.\n\n*(2) Practical interpretation.* A naive comparison of adherent versus non-adherent patients\nwill systematically overestimate treatment benefit across most drug classes because adherence\nis a downstream marker of health engagement. The −10 pp hospitalization difference cannot be\nattributed to statins without accounting for the −7 pp accident difference that statins cannot\nexplain. Negative-control outcomes and falsification tests are the principal tools for\ndetecting and bounding this effect in routine RWE, and should be pre-specified rather than\nrun post hoc when a healthy-user mechanism is plausible.",
    "primary_category": "Bias_Control",
    "tags": [
      "bias",
      "healthy-user",
      "healthy-adherer",
      "preventive-therapy",
      "confounding",
      "channeling",
      "adherence",
      "negative-control",
      "claims"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s11606-010-1609-1",
        "url": "https://doi.org/10.1007/s11606-010-1609-1",
        "citation_text": "Shrank WH, Patrick AR, Brookhart MA. Healthy user and related biases in observational studies of preventive interventions: a primer for physicians. Journal of General Internal Medicine. 2011;26(5):546-550.",
        "year": 2011,
        "authors_short": "Shrank et al.",
        "notes": "Canonical primer naming and decomposing healthy-user, healthy-adherer, and related selection biases in preventive-therapy studies, with the proxy-adjustment and design countermeasures."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm070",
        "url": "https://doi.org/10.1093/aje/kwm070",
        "citation_text": "Brookhart MA, Patrick AR, Dormuth C, et al. Adherence to lipid-lowering therapy and the use of preventive health services: an investigation of the healthy user effect. American Journal of Epidemiology. 2007;166(3):348-354.",
        "year": 2007,
        "authors_short": "Brookhart et al.",
        "notes": "Empirically links adherence to lipid-lowering drugs with preventive-service use, demonstrating the mechanism by which the healthy user effect arises and why utilization proxies are needed."
      },
      {
        "role": "demonstrate",
        "doi": "10.1161/circulationaha.108.824151",
        "url": "https://doi.org/10.1161/circulationaha.108.824151",
        "citation_text": "Dormuth CR, Patrick AR, Shrank WH, et al. Statin adherence and risk of accidents: a cautionary tale. Circulation. 2009;119(15):2051-2057.",
        "year": 2009,
        "authors_short": "Dormuth et al.",
        "notes": "The definitive falsification — statin adherers had fewer motor-vehicle and workplace accidents and unrelated illnesses, an effect no pharmacology explains; the template for negative-control falsification of healthy-adherer bias."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2005.12.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2005.12.012",
        "citation_text": "Glynn RJ, Schneeweiss S, Wang PS, Levin R, Avorn J. Selective prescribing led to overestimation of the benefits of lipid-lowering drugs. Journal of Clinical Epidemiology. 2006;59(8):819-828.",
        "year": 2006,
        "authors_short": "Glynn et al.",
        "notes": "Shows selective (healthy-user) prescribing inflates apparent benefit of lipid-lowering drugs and quantifies how design choices and proxy adjustment attenuate the bias."
      }
    ],
    "plain_language_summary": "Healthy user bias happens when the people who start or keep taking a preventive medicine — statins, blood-pressure pills, vaccines — are already healthier in ways a dataset can never fully see: they exercise, eat better, go to the doctor regularly, and follow through on every health recommendation, not just the drug. Because those hidden advantages lower their risk of bad outcomes, a study comparing them to non-users or non-adherent patients will make the drug look more protective than it really is. The single clearest proof is that statin adherents also have fewer car accidents — a result no pill can explain, so the advantage belongs to the person, not the medicine.",
    "key_terms": [
      {
        "term": "adherent patient",
        "definition": "A patient who consistently fills and takes their medication as prescribed over a defined treatment period."
      },
      {
        "term": "active comparator",
        "definition": "A comparison group that takes a different drug for the same condition, so both groups had to pass the same health-seeking threshold to get treated at all — reducing the hidden health advantage between groups."
      },
      {
        "term": "negative-control outcome",
        "definition": "A health event the drug being studied cannot biologically cause or prevent — such as car accidents — used to check whether an apparent benefit is really just evidence of a healthier patient group."
      },
      {
        "term": "confounding by unmeasured behavior",
        "definition": "When a hidden behavior — like exercising or attending routine check-ups — simultaneously predicts who gets treated and who stays healthier, making the treatment look responsible for an outcome it did not actually cause."
      },
      {
        "term": "health-seeking behavior",
        "definition": "The pattern of actions — attending preventive visits, getting screened, filling prescriptions, eating well — that marks a person as proactively engaged in their own health, independent of any single treatment."
      }
    ],
    "worked_example": {
      "scenario": "Imagine a claims database study asking whether patients who adhere to a cholesterol-lowering statin have fewer hospitalizations than those who do not. The analyst splits 200 statin initiators into two groups based on how consistently they refilled their prescription over one year: 100 who refilled regularly (adherent) and 100 who did not (non-adherent). To test whether any outcome difference is really caused by the drug, the analyst also looks at a placebo-marker: emergency department visits for accidental injury — something a statin cannot prevent.",
      "dataset": {
        "caption": "One-year follow-up outcomes for 100 adherent vs 100 non-adherent statin initiators (illustrative counts).",
        "columns": [
          "group",
          "n_patients",
          "hospitalization_rate",
          "accidental_injury_rate",
          "preventive_visits_past_year"
        ],
        "rows": [
          [
            "Adherent (high refill rate)",
            100,
            "12%",
            "4%",
            "mean 3.2 visits"
          ],
          [
            "Non-adherent (low refill rate)",
            100,
            "22%",
            "11%",
            "mean 1.1 visits"
          ]
        ]
      },
      "steps": [
        "Hospitalizations differ: 12 events in the adherent group versus 22 in the non-adherent group — a 10-percentage-point gap that looks like a drug benefit.",
        "But accidental injuries also differ: 4 events versus 11 — a 7-percentage-point gap the statin cannot explain, because a pill does not prevent car crashes.",
        "The adherent group also averaged 3.2 preventive doctor visits in the prior year, versus only 1.1 for the non-adherent group — showing the adherent patients were already more health-engaged before any outcome difference could accumulate.",
        "Because the same people who refilled their statin also exercised, attended more check-ups, and wore seatbelts, their lower hospitalization rate reflects those clustered healthy behaviors, not just the drug.",
        "The accident-rate gap is the tell: it reveals that a hidden health advantage — not the statin — is the main reason the adherent group looks better across all outcomes."
      ],
      "result": "The adherent group had 10 fewer hospitalizations per 100 patients (12% vs 22%), but also 7 fewer accidental injuries per 100 patients (4% vs 11%) — an effect no cholesterol drug can produce. This spurious association shows healthy behaviors are clustering with adherence. The lesson: never trust an adherent-vs-non-adherent comparison as a measure of drug benefit; use an active comparator (another drug for the same condition) so both groups start from a comparable level of health-seeking."
    },
    "prerequisites": [
      "new-user-design",
      "active-comparator-new-user",
      "negative-control-outcomes-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Healthy-initiator proxy via preventive-service utilization (claims)",
        "description": "Build a pre-index health-seeking index from counts of age/sex-appropriate screenings (mammography, colonoscopy/FOBT, PSA, DXA), influenza/pneumococcal vaccination, well/preventive E&M visits, and distinct preventive drug classes, all measured under continuous FFS enrollment in a fixed lookback; enter into a (high-dimensional) propensity score.",
        "edge_cases": [
          "Over-adjustment if a screening is on the causal path to the outcome (e.g., screening detection advancing diagnosis date).",
          "MA-only or capitated person-time makes screenings invisible, turning the proxy into differential missingness rather than a true zero."
        ],
        "data_source_notes": "claims: CPT/HCPCS screening codes + influenza NDC/CPT + 99381-99397 well-visit E&M; require FFS A/B/D or commercial medical+pharmacy across the lookback. ehr: prefer structured vitals/labs/smoking and NLP lifestyle as objective proxies."
      },
      {
        "name": "Healthy-adherer contrast (adherent vs non-adherent) — to be avoided as an exposure",
        "description": "Comparing high-PDC to low-PDC patients on outcomes; adherence is a marker of unmeasured health-seeking, so this contrast is biased by definition and is used as a *falsification* device, not an effect estimate.",
        "edge_cases": [
          "Per-protocol/as-treated estimates inherit and amplify the bias relative to intention-to-treat.",
          "PDC must exclude post-baseline time; defining exposure by future adherence injects immortal time."
        ],
        "data_source_notes": "claims: compute PDC from fill_date + days_supply with stockpiling carry-over, censoring at death/disenrollment; never adjust the contrast away — falsify it with a negative control such as accident hospitalizations."
      },
      {
        "name": "Negative-control-outcome falsification",
        "description": "Run the identical analytic pipeline on an outcome the exposure cannot biologically affect (accidents, screening uptake, unrelated infections) to measure residual healthy-user bias and to empirically calibrate the primary estimate.",
        "edge_cases": [
          "A valid negative control must share the healthy-user confounding structure without a true causal path; accidents are the field standard following Dormuth."
        ],
        "data_source_notes": "claims/ehr: code the negative-control event with the same enrollment, washout, and first-event rules as the primary outcome so any apparent effect is attributable to confounding."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Active-comparator new-user design",
        "pros_of_this": "Healthy user bias is the *problem*; naming and measuring it (proxies, negative controls, E-values) lets you quantify residual bias even when a clean comparator is unavailable.",
        "cons_of_this": "Proxy adjustment is always incomplete in claims; an active-comparator design removes most of the bias structurally and should be the first choice when an interchangeable comparator exists.",
        "when_to_prefer": "When no clinically interchangeable active comparator exists and a non-user contrast is unavoidable, so explicit proxy adjustment plus falsification is the only available defense."
      },
      {
        "compared_to": "Prevalent-user bias",
        "pros_of_this": "Captures behavioral selection into and persistence with preventive care, including among incident users, not only the survivor selection of prevalent cohorts.",
        "cons_of_this": "Harder to measure directly than prevalent-user status (no fill-history duration to key on); requires behavioral proxies or external data.",
        "when_to_prefer": "Preventive/chronic-therapy studies where health-seeking behavior, not merely treatment persistence, drives the spurious benefit."
      },
      {
        "compared_to": "E-value / quantitative bias analysis",
        "pros_of_this": "Design and proxy measurement can attenuate the actual bias rather than only describing its potential magnitude.",
        "cons_of_this": "Cannot prove the residual is zero; an E-value/QBA is still needed to bound what unmeasured health-seeking could do.",
        "when_to_prefer": "Always report an E-value alongside the design and proxy adjustments; treat it as a mandatory sensitivity element, not a substitute for design."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Construct health-seeking proxies (screening CPT/HCPCS, influenza NDC/CPT, well-visit E&M, preventive drug-class counts) strictly in the pre-index lookback under continuous FFS A/B/D or commercial medical+pharmacy enrollment; exclude MA-only person-time so absent screenings are true zeros, not capitated missingness. Feed proxies into a high-dimensional propensity score and run a negative-control outcome (e.g., accident hospitalizations) through the identical pipeline.",
      "ehr": "Use structured vitals (BMI, BP control), labs (LDL, HbA1c), smoking status, and NLP-extracted lifestyle as objective proxies; remember capture is visit-driven (health-seeking patients generate more data) and loss to follow-up is informative.",
      "registry": "Strong on severity and adjudicated outcomes but blind to preventive utilization; link to claims for screening/vaccination history and to a death index to rule out a healthy-survivor artifact from incomplete mortality.",
      "linked": "Linked claims-EHR-vital-records is ideal (lifestyle + completeness + mortality) but the linkable subset is itself health-engaged; reconcile order/fill/service dates before time zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nLOOKBACK = 365  # pre-index window for health-seeking proxies and continuous-enrollment check\n\ndef health_seeking_cohort(index_df, med, enroll):\n    \"\"\"index_df: person_id, index_date (first study fill). Returns one row/person with proxies.\"\"\"\n    # Require continuous, FFS-observable enrollment across the whole lookback through index.\n    e = enroll.merge(index_df[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=LOOKBACK)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n    cohort = index_df[index_df[\"person_id\"].isin(eligible)].copy()\n\n    # Pre-index claims window only: [index_date - LOOKBACK, index_date)\n    m = med.merge(cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    pre = m[(m[\"service_date\"] >= m[\"index_date\"] - pd.Timedelta(days=LOOKBACK)) &\n            (m[\"service_date\"] <  m[\"index_date\"])]\n\n    def _count(flag):\n        c = (pre[pre[\"code_type\"] == flag].groupby(\"person_id\")[\"code\"]\n                .nunique().rename(f\"n_{flag.lower()}\"))\n        return cohort[\"person_id\"].map(c).fillna(0).astype(int)\n\n    cohort[\"n_screen\"]    = _count(\"SCREEN\")    # distinct screening procedures (mammo, DXA, FOBT...)\n    cohort[\"n_vaccine\"]   = _count(\"VACCINE\")   # influenza/pneumococcal\n    cohort[\"n_wellvisit\"] = _count(\"WELLVISIT\") # preventive E&M 99381-99397\n    # Composite health-seeking index = sum of distinct preventive touchpoints (a hdPS proxy).\n    cohort[\"health_seeking_index\"] = (cohort[\"n_screen\"] + cohort[\"n_vaccine\"] +\n                                      cohort[\"n_wellvisit\"])\n    return cohort\n\ndef negative_control_rate(index_df, med, enroll, fu_days=365, control=\"ACCIDENT\"):\n    \"\"\"First post-index negative-control event (drug cannot cause it) -> residual-bias check.\"\"\"\n    cohort = health_seeking_cohort(index_df, med, enroll)\n    nc = med[med[\"code_type\"] == control].merge(\n        cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    nc = nc[(nc[\"service_date\"] > nc[\"index_date\"]) &\n            (nc[\"service_date\"] <= nc[\"index_date\"] + pd.Timedelta(days=fu_days))]\n    first_event = nc.groupby(\"person_id\")[\"service_date\"].min()  # acute-event dedup: first only\n    cohort[\"nc_event\"] = cohort[\"person_id\"].isin(first_event.index).astype(int)\n    # If exposed (treated) initiators have FEWER accidents than comparators, healthy-user bias remains.\n    return cohort",
        "description": "Build a pre-index health-seeking proxy index and a propensity-score-ready cohort from claims, then\nrun a negative-control-outcome falsification. Required inputs (cleaned, de-duplicated):\n  rx     : person_id, fill_date (datetime), drug_class ('STATIN'/'OTHER'/...), days_supply\n  med    : person_id, service_date (datetime), code (CPT/HCPCS/E&M), code_type ('SCREEN'/'VACCINE'/'WELLVISIT'/'ACCIDENT'/'OUTCOME')\n  enroll : person_id, enroll_start, enroll_end, ma_only (bool)   # ma_only person-time lacks FFS claims\nindex_date is supplied as the first qualifying study fill (built upstream by the new-user design).\nProxies are measured ONLY in [index_date - LOOKBACK, index_date) to avoid collider/mediator bias.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK <- 365L\n\nhealth_seeking_cohort <- function(index_df, med, enroll) {\n  setDT(index_df); setDT(med); setDT(enroll)\n\n  # Continuous FFS-observable enrollment across the lookback through index (no MA-only spans).\n  e <- merge(enroll, index_df[, .(person_id, index_date)], by = \"person_id\")\n  ok <- e[enroll_start <= index_date - LOOKBACK &\n          enroll_end   >= index_date & !ma_only, unique(person_id)]\n  cohort <- index_df[person_id %chin% ok]\n\n  # Pre-index claims only: [index_date - LOOKBACK, index_date)\n  m <- merge(med, cohort[, .(person_id, index_date)], by = \"person_id\")\n  pre <- m[service_date >= index_date - LOOKBACK & service_date < index_date]\n\n  cnt <- function(flag) pre[code_type == flag,\n                            .(n = uniqueN(code)), by = person_id]\n  for (f in c(\"SCREEN\", \"VACCINE\", \"WELLVISIT\")) {\n    col <- paste0(\"n_\", tolower(f))\n    cohort[cnt(f), (col) := i.n, on = \"person_id\"]\n    cohort[is.na(get(col)), (col) := 0L]\n  }\n  # Composite health-seeking index = distinct preventive touchpoints (a hdPS proxy).\n  cohort[, health_seeking_index := n_screen + n_vaccine + n_wellvisit]\n  cohort[]\n}\n\nnegative_control_rate <- function(index_df, med, enroll, fu_days = 365L, control = \"ACCIDENT\") {\n  cohort <- health_seeking_cohort(index_df, med, enroll)\n  nc <- merge(med[code_type == control], cohort[, .(person_id, index_date)], by = \"person_id\")\n  nc <- nc[service_date > index_date & service_date <= index_date + fu_days]\n  first <- nc[, .(ev = min(service_date)), by = person_id]   # first event only (acute dedup)\n  cohort[, nc_event := as.integer(person_id %chin% first$person_id)]\n  cohort[]  # fewer accidents in the treated arm => residual healthy-user bias\n}",
        "description": "Pre-index health-seeking proxy index and negative-control falsification with data.table.\nInputs mirror the Python version:\n  index_df : person_id, index_date (Date)      # first qualifying study fill from the new-user design\n  med      : person_id, service_date (Date), code, code_type ('SCREEN'/'VACCINE'/'WELLVISIT'/'ACCIDENT')\n  enroll   : person_id, enroll_start, enroll_end, ma_only (logical)\nProxies are measured only in [index_date - LOOKBACK, index_date).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n\n/* Continuous, FFS-observable enrollment across the full lookback through index (no MA-only spans). */\nproc sql;\n  create table cohort as\n  select i.person_id, i.index_date\n  from work.index_df i\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = i.person_id and e.ma_only = 0\n      and e.enroll_start <= i.index_date - &lookback\n      and e.enroll_end   >= i.index_date\n  );\nquit;\n\n/* Pre-index health-seeking proxies: distinct preventive touchpoints in [index-365, index). */\nproc sql;\n  create table proxies as\n  select c.person_id, c.index_date,\n         count(distinct case when m.code_type='SCREEN'    then m.code end) as n_screen,\n         count(distinct case when m.code_type='VACCINE'   then m.code end) as n_vaccine,\n         count(distinct case when m.code_type='WELLVISIT' then m.code end) as n_wellvisit\n  from cohort c\n  left join work.med m\n    on  m.person_id = c.person_id\n    and m.service_date >= c.index_date - &lookback\n    and m.service_date <  c.index_date\n  group by c.person_id, c.index_date;\nquit;\n\ndata proxies;\n  set proxies;\n  health_seeking_index = n_screen + n_vaccine + n_wellvisit;  /* hdPS proxy for health-seeking */\nrun;\n\n/* Negative-control outcome: first post-index ACCIDENT (drug cannot cause it) within follow-up. */\nproc sql;\n  create table nc as\n  select c.person_id, min(m.service_date) as nc_date format=date9.\n  from cohort c\n  inner join work.med m\n    on  m.person_id = c.person_id and m.code_type = 'ACCIDENT'\n    and m.service_date >  c.index_date\n    and m.service_date <= c.index_date + 365\n  group by c.person_id;                 /* first event only = acute-event dedup */\nquit;\n\ndata analytic;                          /* join proxies + arm + baseline covariates (work.cov) upstream */\n  merge proxies(in=a) nc(in=b);\n  by person_id;\n  nc_event = b;                         /* 1 if a negative-control accident occurred */\nrun;\n\n/* PS on health-seeking proxies + covariates; persistence of a protective HR for nc_event => residual bias. */\nproc psmatch data=work.analytic_with_arm region=allobs;\n  class arm;\n  psmodel arm(treated='STATIN') = health_seeking_index n_screen n_vaccine n_wellvisit\n                                  /* + comorbidity, age, sex, area-deprivation covariates */;\n  match method=greedy(k=1) distance=lps caliper=0.2;     /* 0.2 SD of logit-PS */\n  assess lps var=(health_seeking_index) / plots=(boxplot);\n  output out(obs=match)=matched matchid=mid;\nrun;",
        "description": "Pre-index health-seeking proxy index, negative-control falsification, and PS modeling in SAS.\nRequired input datasets (post data-management):\n  work.index_df : person_id, index_date            (first qualifying study fill; new-user design upstream)\n  work.med      : person_id, service_date, code, code_type ('SCREEN'/'VACCINE'/'WELLVISIT'/'ACCIDENT')\n  work.enroll   : person_id, enroll_start, enroll_end, ma_only (0/1)\nPROC PSMATCH requires SAS/STAT 14.2+; confirm post-match standardized differences <0.1 before any\noutcome model. Proxies are measured strictly before index_date to avoid collider/mediator bias.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U[\"Unmeasured health-seeking<br/>(diet, exercise, SES, access, adherence)\"]\n  Tx[\"Preventive-drug initiation<br/>/ adherence\"]\n  Y[\"Outcome<br/>(mortality, CV event, fracture)\"]\n  NC[\"Negative-control outcome<br/>(accidents, screening uptake)\"]\n  U -->|selects into treatment| Tx\n  U -->|lowers event risk| Y\n  U -->|also lowers| NC\n  Tx -->|true drug effect| Y\n  Tx -.->|no biological path| NC\n  style U fill:#ffcccc,stroke:#cc0000\n  style NC fill:#cce5ff,stroke:#3366cc",
        "caption": "Healthy user bias as a backdoor path. Unmeasured health-seeking (red) confounds the treatment-outcome relationship. A negative-control outcome (blue) shares the same backdoor from health-seeking but has no causal path from treatment, so any apparent treatment effect on it measures the residual confounding directly.",
        "alt_text": "DAG showing unmeasured health-seeking pointing to treatment, the outcome, and a negative-control outcome, with a true drug-outcome arrow but a dashed (absent) drug-to-negative-control arrow.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Preventive / chronic therapy question] --> C{Clinically interchangeable<br/>active comparator exists?}\n  C -->|Yes| ACNU[Active-comparator new-user design<br/>structurally removes most healthy-user bias]\n  C -->|No| NU[New-user vs non-user<br/>residual healthy-user bias likely]\n  NU --> P[Build pre-index health-seeking proxies<br/>screenings, vaccines, well-visits -> hdPS]\n  ACNU --> P\n  P --> NCO[Run a negative-control outcome<br/>e.g., accidents, screening uptake]\n  NCO --> R{Apparent effect on<br/>negative control?}\n  R -->|Yes| BIAS[Residual healthy-user bias:<br/>calibrate estimate, report E-value, do not trust]\n  R -->|No| OK[Report primary estimate<br/>with E-value sensitivity]",
        "caption": "Decision logic for designing and analyzing against healthy user bias — prefer an active-comparator new-user structure, adjust for pre-index health-seeking proxies, and falsify with a negative-control outcome before trusting the primary estimate.",
        "alt_text": "Decision flowchart from research question through active-comparator choice, proxy construction, negative-control falsification, and the trust/no-trust decision.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "An active-comparator new-user design is the primary structural defense against healthy user bias because both arms cleared the same health-seeking and access thresholds."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Frequently co-occurs; prevalent users are healthy survivors who tolerated and persisted with therapy (depletion of susceptibles), a healthy-survivor variant of the same selection."
      },
      {
        "relation_type": "see_also",
        "target_slug": "new-user-design",
        "notes": "New-user restriction reduces but does not eliminate healthy user bias; combine with rich proxy adjustment and negative controls."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes (e.g., accidents, screening uptake) directly measure residual healthy-user bias and enable empirical calibration — the Dormuth falsification template."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS empirically recovers utilization proxies for health-seeking (screening, vaccination, preventive visits) that investigator-selected covariate lists omit."
      },
      {
        "relation_type": "used_with",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "E-value quantifies how strong the unmeasured health-seeking confounder would have to be to explain away the apparent benefit."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Defining exposure by future adherence (healthy-adherer contrasts) injects immortal time; align time zero at first fill to avoid compounding the biases."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Adherence (PDC/MPR) is a marker of unmeasured health-seeking, so adherent-vs-nonadherent contrasts are biased by definition and should be falsified, not adjusted away."
      },
      {
        "relation_type": "affects",
        "target_slug": "cox-ph-regression",
        "notes": "Unmeasured health-seeking biases hazard ratios toward apparent protection for preventive therapies and can induce non-proportional hazards as healthier patients persist longer."
      },
      {
        "relation_type": "affects",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Healthy users have lower event probabilities for non-treatment reasons, biasing odds ratios protective in studies of preventive medications."
      }
    ],
    "aliases": [
      "healthy user bias",
      "healthy-user effect",
      "healthy adherer bias",
      "healthy adherer effect",
      "healthy initiator bias"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "high-dimensional-propensity-score-hdps-rwe",
    "name": "High-Dimensional Propensity Score (hdPS)",
    "short_definition": "A data-adaptive confounding-control algorithm that empirically generates and prioritizes hundreds of pre-exposure claims/EHR codes as proxies for unmeasured confounders, then collapses them into a single propensity score used for matching, weighting, or stratification.",
    "long_description": "The **high-dimensional propensity score (hdPS)** automates the selection of confounder proxies in large healthcare databases.\nInvestigator-specified propensity scores depend on a short list of variables a human thought to measure; in claims and EHR data\nthe strongest confounders (frailty, disease severity, functional status, access, health-seeking behavior, socioeconomic position)\nare never coded directly. hdPS (Schneeweiss et al. 2009) exploits the fact that the *thousands* of diagnosis, procedure, drug,\nand utilization codes recorded before exposure are noisy **proxies** for those latent factors, and lets the data rank which\nproxies to include. The canonical algorithm has seven steps: (1) specify the cohort, exposure (typically a new-user, active-comparator\ncontrast), outcome, and a pre-exposure lookback (commonly 365 days); (2) within each data **dimension** — inpatient diagnoses,\noutpatient diagnoses, procedures, dispensed drugs, and utilization counts — identify the most prevalent codes; (3) for each\ncandidate code, generate up to three binary recurrence covariates (once / sporadic / frequent relative to the cohort median);\n(4) rank every candidate by its potential confounding impact using the **Bross multiplicative bias formula**, which combines the\ncovariate's prevalence in the exposed and unexposed and its marginal association with the outcome; (5) select the top *k* per\ndimension (Schneeweiss's default selects ~200 across dimensions); (6) fit a logistic PS on the selected proxies plus force-included\ninvestigator confounders (age, sex, calendar time, indication); (7) apply the PS for matching, weighting, stratification, or\nSMR, then assess balance, trim non-overlap, and run the outcome model with robust variance.\n\n**Core estimand distinction.** hdPS is a *covariate-selection and PS-construction* method, not an estimator — it does not itself\ndefine a target estimand. The estimand is fixed by how the resulting score is *used*: 1:1 nearest-neighbor matching on the logit-PS\ntargets the **ATT** (effect in the treated); IPTW targets the **ATE**; overlap (Li) weights target the **ATO** (effect in the\nequipoise population, with exact finite-sample balance); stratification/SMR weighting targets an ATT-like quantity in the standard\npopulation. This must be pre-specified — switching from IPTW to matching after seeing imbalance silently changes the population\nthe result generalizes to. Critically, hdPS only reduces confounding bias *for confounders that have proxies in the data*; it has\nno leverage on confounders with no measured footprint, and it does not address selection bias, measurement error, or model\nmisspecification of the outcome stage.\n\n**Pros, cons, and trade-offs.**\n- **vs investigator-specified PS / multivariable regression:** hdPS empirically surfaces proxies a human would never enumerate\n  (e.g., frequency of ophthalmology visits proxying diabetic retinopathy severity), and in plasmode simulations and the small-sample\n  work of Rassen et al. (2011) it reduces residual confounding and improves balance on measured covariates. Cost: the selection is\n  opaque, it can pull in **instruments** (predict exposure, not outcome — these *amplify* residual bias and inflate variance) or\n  **colliders/mediators** if the lookback leaks post-exposure information, and the selected set is database- and era-specific.\n  **Prefer hdPS** in large claims/EHR studies where measured confounding is known to be incomplete; **keep an investigator PS** as\n  a transparent reference analysis.\n- **vs instrumental-variable (IV) methods:** hdPS stays inside the measured data and needs no valid instrument (instruments are rare\n  and nearly impossible to verify in pharmacoepi). Cost: hdPS cannot touch confounding with *no* proxy in the data, whereas a valid\n  IV can — but IV targets a different (local average treatment) effect under strong, untestable assumptions. **Prefer hdPS** when the\n  database is proxy-rich and no credible instrument exists.\n- **vs disease-risk scores (DRS) / doubly-robust (TMLE, AIPW):** hdPS is a single-model exposure-side approach; DRS models the outcome\n  side and doubly-robust methods combine both for protection against one-sided misspecification. hdPS-selected covariates are routinely\n  *fed into* TMLE/super-learner pipelines (Schneeweiss 2017 frames hdPS as one screening option among LASSO and high-dimensional DRS).\n  **Prefer plain hdPS** for transparent, regulator-facing comparative-safety work; **escalate** to doubly-robust when both stages risk\n  misspecification and the team can defend the added complexity.\n- **vs negative controls / quantitative bias analysis (QBA):** these are complements, not substitutes. hdPS reduces bias up front;\n  negative-control outcomes and E-value/QBA *detect and bound* the residual bias hdPS cannot remove. A defensible analysis uses both.\n\n**When to use.** Large administrative claims, EHR, or linked databases with a new-user, active-comparator design where (a) the outcome\nis reasonably common, (b) thousands of pre-exposure codes are available as proxies, and (c) key clinical confounders are unmeasured or\npoorly captured. It is the de facto standard confounding-control layer in FDA Sentinel and distributed-network safety studies, and is\nmost valuable precisely when investigator covariate lists are thin.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Small cohorts or rare outcomes.** With few exposed or few events, the Bross ranking is unstable, PS separation/positivity violations\n  appear, and selecting hundreds of sparse covariates overfits. Below ~25–50 events the algorithm can *worsen* bias; fall back to a\n  parsimonious investigator PS or exact matching (Rassen 2011 documents the small-sample failure mode and proposed remedies).\n- **The lookback can leak post-exposure information.** If candidate codes are drawn from a window that includes or straddles the index\n  date, hdPS will happily select **mediators or colliders** (e.g., early treatment toxicities), inducing collider-stratification bias and\n  over-adjustment — the bias gets *built into* the score. Hard-stop candidate extraction strictly before time zero.\n- **Strong instruments are abundant.** In settings dominated by exposure-predictive-but-outcome-irrelevant codes (formulary, provider, or\n  plan markers), unconstrained selection amplifies residual confounding and variance. Screen for, and exclude, near-instruments.\n- **Cross-database transport.** A proxy set tuned to one payer/era does not transport; re-running the algorithm per database is mandatory,\n  not optional. Treating a frozen code list as portable produces silently biased estimates in the new source.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The natural habitat — fee-for-service Medicare and commercial claims pay per service, so diagnosis/procedure/NDC\n  capture is dense and the proxies are rich. Build dimensions explicitly from claim type (facility/inpatient dx and procedures,\n  professional/outpatient dx and procedures, Part D / pharmacy NDCs, DME). Use `service_date` to enforce a strictly pre-index window.\n  Failure mode: codes on **rule-out** claims inflate apparent comorbidity; the once/sporadic/frequent recurrence coding partially\n  mitigates this.\n- **Claims (Medicare Advantage):** MA-only person-time lacks complete FFS claims (capitated encounters are under-reported for non-risk\n  services), so candidate prevalence is distorted and \"absence of a code\" is missingness, not a true negative. Conversely, **HCC\n  risk-adjustment coding is enriched** under MA, so a handful of HCC-driving diagnoses can dominate selection. Exclude MA-only spans, or\n  treat MA as a separate stratum/dimension and never pool raw candidate prevalences across FFS and MA.\n- **EHR:** Add clinical dimensions — labs (LOINC + abnormal-flag), vitals, problem-list entries, NLP-derived concepts — which often\n  out-proxy billing codes for severity. But EHR is **visit-driven**: encounter frequency is simultaneously a powerful proxy and a\n  potential collider if care-seeking is itself affected by the (pre-index) disease, and a patient who leaves the system is differentially\n  unobserved. Link to claims for complete pharmacy/procedure history before trusting the proxy set.\n- **Linked claims–EHR–registry:** The richest substrate (registry staging/biomarkers + EHR labs + claims completeness), but linkage\n  selects the linkable subset and introduces order/fill/service date discrepancies that must be reconciled before the pre-index window\n  is fixed. **Differential competing risks** matter in elderly claims cohorts: if one arm has higher background mortality, naive\n  hdPS-adjusted cause-specific analyses can mislead — pair the score with an appropriate competing-risks outcome model.\n\n**Worked claims example.** Question: incident diabetic ketoacidosis with SGLT2 inhibitors vs DPP-4 inhibitors among adults with type 2\ndiabetes in a 100% Medicare FFS (Parts A/B/D) + commercial database. (1) Cohort: age ≥18, ≥2 T2DM diagnoses, **365 days continuous\nFFS-observable enrollment** before the first study fill (`fill_date`), new-user washout = no prior SGLT2i/DPP-4 NDC in that window;\nexclude MA-only person-time. Index date = first qualifying fill; arm = `ndc` dispensed that day. (2) Force-include investigator\nconfounders: age, sex, calendar year, baseline HbA1c-proxy, prior insulin, diabetes-complication count. (3) Candidate generation,\nstrictly in `[index_date − 365, index_date)`: in each dimension take the 200 most prevalent codes; for each, create once/sporadic/frequent\nrecurrence covariates → ~3,000 candidates. (4) Bross ranking: for a candidate such as **\"frequent (>median) ophthalmology E/M visits\"**\n(CPT 9920x/9921x), with prevalence ~0.18 among DPP-4 initiators vs ~0.11 among SGLT2i initiators and an outcome relative risk ~1.6, the\nmultiplicative bias term is large, so this retinopathy/severity proxy ranks near the top; a near-instrument like \"mail-order pharmacy\nflag\" (predicts arm, not DKA) ranks low and is dropped. (5) Select top *k*=50 per dimension (~250 covariates), fit the PS by logistic\nregression on selected + forced terms. (6) Apply 1:1 logit-PS matching with a 0.2-SD caliper (estimand = ATT), confirm standardized\nmean differences <0.1 on selected *and* forced covariates, trim non-overlap. (7) Outcome model: Cox or pooled logistic on the matched\nset with robust variance; censor at disenrollment, death, end of data, and (as-treated) last `days_supply` end + grace period. (8)\nSensitivity: vary *k* (100/250/500), the dimension split, the recurrence cut, an investigator-PS reference, and a negative-control\noutcome (e.g., influenza vaccination) to detect residual confounding.\n\n**Interpreting the output**\n\nConsider the SGLT2i versus DPP-4i DKA study above. After incorporating the three selected proxy codes — frequent\noffice visits, diabetic retinopathy, and insulin fills — alongside investigator-specified covariates in the hdPS,\nsuppose the analysis reports HR = 1.38 (95% CI 1.08–1.75) for diabetic ketoacidosis.\n\nFormal interpretation: In the hdPS-reweighted pseudo-population, the instantaneous rate of DKA was 38% higher\namong SGLT2i initiators than DPP-4i initiators (HR 1.38, 95% CI 1.08–1.75). This estimates the ATT (or ATE,\ndepending on weight type) in the weighted pseudo-population. The central identification assumption is no unmeasured\nconfounding given all included covariates — both investigator-specified variables and the empirically selected\nproxy codes. The proxy codes are data-driven surrogates for disease severity and care-seeking behavior not directly\ncoded; they reduce but cannot eliminate residual confounding from factors leaving no trace in pre-treatment claims\npatterns. Positivity must also hold: every patient must have had a plausible probability of initiating either\ndrug class.\n\nPractical interpretation: After accounting for measurable differences in diabetes burden — using both clinical\ncovariates and the claims-derived proxies that hdPS surfaced — SGLT2i users developed DKA at a 38% higher rate.\nThe proxy codes captured dimensions of disease burden invisible to a standard propensity model. However, proxies\nare associative signals, not the unmeasured confounders themselves; bias from unmeasured indication factors may\npersist. A sensitivity analysis varying k (100, 250, 500 proxy codes) and a negative-control outcome test should\naccompany the primary estimate to assess residual confounding.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "hdps",
      "high-dimensional-propensity-score",
      "empirical-covariate-selection",
      "residual-confounding",
      "proxy-variables",
      "bross-bias-formula",
      "claims-data",
      "confounding-control",
      "sentinel"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "new_user",
      "active_comparator_new_user",
      "multi_database",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0b013e3181a663cc",
        "url": "https://doi.org/10.1097/EDE.0b013e3181a663cc",
        "citation_text": "Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512-522.",
        "year": 2009,
        "authors_short": "Schneeweiss et al.",
        "notes": "Foundational paper defining the seven-step hdPS algorithm, the dimension structure, recurrence coding, and Bross-based covariate prioritization for empirically selecting code proxies of unmeasured confounders in claims data."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0000000000000581",
        "url": "https://doi.org/10.1097/EDE.0000000000000581",
        "citation_text": "Schneeweiss S. Automated data-adaptive analytics for electronic healthcare data to study causal treatment effects. Variable selection for confounding adjustment in high-dimensional covariate spaces when analyzing healthcare databases. Epidemiology. 2017;28(2):237-248.",
        "year": 2017,
        "authors_short": "Schneeweiss",
        "notes": "Situates hdPS within the broader family of high-dimensional variable-selection strategies (Bross prioritization, LASSO, high-dimensional disease-risk scores, super learner) and discusses instrument/collider risks and when each screen is appropriate."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5566",
        "url": "https://doi.org/10.1002/pds.5566",
        "citation_text": "Rassen JA, Blin P, Kloss S, Neugebauer RS, Platt RW, Pottegård A, Schneeweiss S, Toh S. High-dimensional propensity scores for empirical covariate selection in secondary database studies: planning, implementation, and reporting. Pharmacoepidemiol Drug Saf. 2023;32(2):93-106.",
        "year": 2023,
        "authors_short": "Rassen et al.",
        "notes": "Contemporary ISPE-aligned implementation and reporting guidance — dimension specification, parameter choices (k, recurrence, lookback), pre-specification, balance diagnostics, and transparency requirements for regulator-facing hdPS analyses."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwr001",
        "url": "https://doi.org/10.1093/aje/kwr001",
        "citation_text": "Rassen JA, Glynn RJ, Brookhart MA, Schneeweiss S. Covariate selection in high-dimensional propensity score analyses of treatment effects in small samples. Am J Epidemiol. 2011;173(12):1404-1413.",
        "year": 2011,
        "authors_short": "Rassen et al.",
        "notes": "Empirical evaluation of hdPS performance as exposed/event counts shrink, documenting the small-sample failure mode (unstable ranking, overfitting) and proposing exposure-based and zero-cell remedies — the key citation for when NOT to use hdPS."
      }
    ],
    "plain_language_summary": "The high-dimensional propensity score (hdPS) is a method that automatically scans hundreds of diagnosis, procedure, and drug codes recorded in a patient's claims history BEFORE they start a new medication, and uses those codes as stand-ins (proxies) for health factors the data never directly measured -- like how sick a patient really is or how often they seek care. Instead of a researcher hand-picking a short list of confounders, the algorithm ranks every candidate code by how differently it appears between the two treatment groups AND how strongly it relates to the outcome, then builds a single summary score from the top-ranked codes. That score is then used to match or weight patients so the two comparison groups look more similar on all those hidden health differences -- though hdPS cannot fix confounding from factors that left no footprint at all in the claims data.",
    "key_terms": [
      {
        "term": "high-dimensional propensity score",
        "definition": "A single number (0 to 1) summarizing how likely a patient is to have received the study drug rather than the comparator, calculated from hundreds of automatically selected pre-treatment claims codes rather than a short investigator-chosen list."
      },
      {
        "term": "proxy confounder",
        "definition": "A code in the data (such as a frequent ophthalmology visit) that is not the unmeasured factor itself (like diabetic eye disease severity) but tracks closely enough with it to partially stand in for it when adjusting for confounding."
      },
      {
        "term": "Bross bias formula",
        "definition": "The ranking formula hdPS uses to score each candidate code: codes that appear more often in one treatment group AND are associated with the outcome get a higher score and are more likely to be selected."
      },
      {
        "term": "dimension",
        "definition": "One category of claims codes that hdPS searches separately -- for example, inpatient diagnoses, outpatient diagnoses, filled prescriptions, and procedures are each their own dimension."
      },
      {
        "term": "confounding by indication",
        "definition": "The problem where sicker patients are more likely to receive one drug over another, so any apparent difference in outcomes between groups may reflect the underlying illness rather than the drug itself."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is comparing SGLT2 inhibitors vs DPP-4 inhibitors for risk of diabetic ketoacidosis in Medicare claims. She cannot directly measure disease severity or how actively a patient manages their diabetes -- those factors are never coded. She runs hdPS on the 365-day pre-treatment claims window to let the data surface proxy codes for those unmeasured factors. The table below shows five candidate codes the algorithm evaluates, their prevalence in each treatment group, and whether they get selected.",
      "dataset": {
        "caption": "Five candidate proxy codes evaluated by hdPS from the pre-treatment lookback window (365 days before first fill). Prevalence = share of patients in that arm who had the code at least once. Outcome RR = how much more common diabetic ketoacidosis is among patients with vs without that code.",
        "columns": [
          "candidate_code",
          "description",
          "prev_sglt2i",
          "prev_dpp4i",
          "outcome_rr",
          "bross_score",
          "selected"
        ],
        "rows": [
          [
            "CPT 99213 (frequent)",
            "Outpatient office visit -- frequent user (more than median)",
            0.38,
            0.52,
            1.7,
            "HIGH",
            "Yes -- strong proxy for care-seeking and disease severity"
          ],
          [
            "ICD E11.319 (once)",
            "Diabetic retinopathy, unspecified -- any occurrence",
            0.11,
            0.18,
            1.6,
            "HIGH",
            "Yes -- proxy for longstanding diabetes complications"
          ],
          [
            "NDC insulin glargine (once)",
            "Long-acting insulin fill -- any occurrence",
            0.24,
            0.41,
            1.8,
            "HIGH",
            "Yes -- proxy for more advanced diabetes requiring insulin"
          ],
          [
            "mail-order pharmacy flag",
            "Filled any script via mail-order pharmacy",
            0.29,
            0.31,
            1.05,
            "LOW",
            "No -- predicts which drug arm but not the outcome (near-instrument)"
          ],
          [
            "ICD Z71.89 (once)",
            "Encounter for other specified counseling -- any occurrence",
            0.06,
            0.07,
            1.02,
            "LOW",
            "No -- equal prevalence and no outcome link; adds noise"
          ]
        ]
      },
      "steps": [
        "For each candidate code, count how many patients in each arm had it at least once, sporadically (at or above the group median count), or frequently (above the median) during the 365-day lookback -- these are the three recurrence covariates per code.",
        "Apply the Bross formula: a code scores high when (a) its prevalence differs noticeably between the SGLT2i and DPP-4i groups AND (b) patients who have the code get diabetic ketoacidosis at a meaningfully higher rate than those who do not.",
        "Frequent outpatient visits (CPT 99213), diabetic retinopathy diagnosis (E11.319), and insulin use all have both unequal prevalence between arms AND a raised outcome RR -- so their Bross scores are high and they are selected as proxy confounders.",
        "The mail-order pharmacy flag has unequal prevalence (it predicts which drug was chosen) but an outcome RR near 1.0 -- it is a near-instrument that would amplify bias if included, so it is dropped.",
        "The counseling code has equal prevalence and no outcome link -- it is uninformative and dropped.",
        "The top-ranked codes across all dimensions (up to ~50 per dimension) are combined with investigator-specified confounders (age, sex, calendar year, diabetes severity) to fit a logistic regression predicting treatment arm.",
        "Each patient receives a propensity score from that model; patients are then matched 1-to-1 on those scores so each SGLT2i user is paired with a similarly-scored DPP-4i user, balancing both the selected proxies and the directly measured covariates."
      ],
      "result": "Three proxy codes (frequent office visits, retinopathy diagnosis, insulin use) are selected because they satisfy both Bross criteria; two codes are dropped (one near-instrument, one uninformative). The final hdPS model incorporates these proxies alongside forced investigator confounders, producing a propensity score used for 1:1 matching to estimate the ATT (the treatment effect in patients similar to those who actually initiated SGLT2i)."
    },
    "prerequisites": [
      "propensity-score-methods-psm-iptw",
      "dags-backdoor-criterion-drug-studies",
      "active-comparator-new-user"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Dimension-balanced hdPS (recommended default)",
        "description": "Select a fixed top-k per dimension (e.g., 50 inpatient dx, 50 outpatient dx, 50 procedures, 50 drugs, 50 utilization) so a high-volume noisy dimension such as outpatient diagnoses cannot dominate the selected set.",
        "edge_cases": [
          "Procedure- or rare-disease-focused studies may have too few candidates in some dimensions to fill the quota.",
          "A fixed quota can still admit weak proxies within a sparse dimension; pair with a minimum prevalence threshold."
        ],
        "data_source_notes": "claims: define dimensions by claim type (facility IP vs professional OP vs pharmacy NDC vs DME). In MA encounter data, HCC-driving diagnoses are over-represented; separate or down-weight them so risk-adjustment coding does not crowd out clinical proxies."
      },
      {
        "name": "Investigator-forced + hdPS hybrid",
        "description": "Force key measured confounders (age, sex, calendar time, indication and severity markers) into the PS model regardless of empirical rank, and let hdPS add proxies on top. This is the standard production configuration.",
        "edge_cases": [
          "Forced variables can be collinear with empirically selected proxies, inflating PS-model variance; check VIF/separation."
        ],
        "data_source_notes": "Preserves face validity and regulatory defensibility while still capturing data-driven proxies for unmeasured factors."
      },
      {
        "name": "ML-screened / doubly-robust hdPS",
        "description": "Replace or augment the Bross ranking with LASSO, random-forest importance, or a high-dimensional disease-risk score, and/or feed hdPS-selected covariates into TMLE/AIPW for double robustness (per Schneeweiss 2017).",
        "edge_cases": [
          "Outcome-adaptive selection risks data leakage; require cross-fitting or sample-splitting to separate selection from estimation."
        ],
        "data_source_notes": "Emerging practice; document the screen, the cross-fitting scheme, and the final estimand explicitly."
      },
      {
        "name": "Distributed / multi-database hdPS",
        "description": "Run candidate generation and selection locally in each data partner (FDA Sentinel pattern), sharing only summary statistics or the selected covariate matrix rather than patient-level data.",
        "edge_cases": [
          "Selected proxy sets differ by site/era and do not transport; harmonize vocabularies (e.g., OMOP concept sets) but allow site-specific selection."
        ],
        "data_source_notes": "multi-database: privacy-preserving; balance and bias diagnostics must be produced and reviewed per partner before pooling effects."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "investigator-specified PS or multivariable regression",
        "pros_of_this": "Empirically surfaces hundreds of proxy covariates for unmeasured confounders (severity, frailty, access, health-seeking) that no investigator list enumerates; improves measured-covariate balance and reduces residual bias in simulations and applied studies.",
        "cons_of_this": "Selection is opaque; can admit instruments (bias amplification) or colliders if the lookback leaks post-exposure data; database- and era-specific; computationally heavier; still requires a strong design to address non-confounding biases.",
        "when_to_prefer": "Large claims/EHR/linked databases where measured confounding is incomplete and pre-exposure codes are rich. Keep an investigator PS as a transparent reference analysis."
      },
      {
        "compared_to": "instrumental-variable (IV) methods",
        "pros_of_this": "Stays within the measured data and needs no valid instrument, which are rare and effectively unverifiable in pharmacoepi.",
        "cons_of_this": "Cannot address confounding with no proxy footprint in the data; IV can, but targets a different (local average) effect under strong untestable assumptions.",
        "when_to_prefer": "When the database is proxy-rich and no credible, exclusion-restriction-satisfying instrument is available."
      },
      {
        "compared_to": "disease-risk score (DRS) or doubly-robust (TMLE/AIPW)",
        "pros_of_this": "A single transparent exposure-side score that regulators readily accept; hdPS-selected covariates can also seed DRS/TMLE pipelines.",
        "cons_of_this": "Models only the exposure side, so it offers no protection against outcome-model misspecification the way doubly-robust methods do.",
        "when_to_prefer": "Transparent comparative-safety/effectiveness work; escalate to doubly-robust when both stages risk misspecification."
      },
      {
        "compared_to": "negative controls or quantitative bias analysis alone",
        "pros_of_this": "hdPS is a primary tool that reduces confounding up front rather than only detecting or bounding it after the fact.",
        "cons_of_this": "Does not quantify the residual bias from confounders lacking proxies; must be paired with negative controls and QBA/E-value.",
        "when_to_prefer": "Always combine — hdPS for reduction, negative controls and QBA for detection and bounding."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Define dimensions explicitly by claim type (facility IP dx/procedures, professional OP dx/procedures, Part D/pharmacy NDC, DME). Extract candidates strictly with service_date in the pre-index window; never include the index date. Exclude MA-only person-time (FFS claims unavailable) and handle HCC-enriched MA coding separately. Report the exact dimensions, k, recurrence cuts, lookback, and software version.",
      "ehr": "Add lab (LOINC + abnormal flag), vitals, problem-list, and NLP-concept dimensions alongside billing codes. Treat encounter-frequency proxies cautiously — strong but potential colliders if care-seeking responds to pre-index disease. Link to claims for complete pharmacy/procedure capture.",
      "linked": "Registry staging, biomarkers, and external labs add powerful proxy dimensions; reconcile order/fill/service date discrepancies and account for linkage selection before fixing the pre-index window. Watch differential competing risks (mortality) by arm in elderly cohorts.",
      "multi-database": "Harmonize vocabularies (e.g., OMOP concept sets) but run candidate generation and selection per database (distributed hdPS); selected proxy sets do not transport across sites or eras. Produce and review balance diagnostics per partner before pooling effects."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\n\nLOOKBACK_DAYS = 365\nTOP_N_PREVALENT = 200     # most prevalent codes screened per dimension (Schneeweiss step 2)\nTOP_K_PER_DIM   = 50      # covariates retained per dimension after Bross ranking (step 5)\n\ndef _recurrence_covariates(codes, cohort):\n    # Keep only pre-index codes inside the lookback window.\n    c = codes.merge(cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    c = c[(c[\"code_date\"] < c[\"index_date\"]) &\n          (c[\"code_date\"] >= c[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS))]\n    counts = c.groupby([\"dimension\", \"person_id\", \"code\"]).size().rename(\"n\").reset_index()\n\n    feats = []\n    for dim, dim_df in counts.groupby(\"dimension\"):\n        prev = dim_df.groupby(\"code\")[\"person_id\"].nunique().sort_values(ascending=False)\n        for code in prev.head(TOP_N_PREVALENT).index:\n            sub = dim_df[dim_df[\"code\"] == code]\n            med = sub[\"n\"].median()\n            # once / sporadic(>=median) / frequent(>median): the three hdPS recurrence covariates.\n            for label, mask in ((\"once\", sub[\"n\"] >= 1),\n                                (\"sporadic\", sub[\"n\"] >= max(med, 1)),\n                                (\"frequent\", sub[\"n\"] > med)):\n                ids = set(sub.loc[mask, \"person_id\"])\n                feats.append((f\"{dim}|{code}|{label}\", ids))\n    return feats\n\ndef _bross_bias(flag, exposed, outcome):\n    # Multiplicative confounding-bias multiplier (Bross/Schneeweiss): prevalence of the covariate\n    # in exposed (P_C1) vs unexposed (P_C0) and the covariate-outcome relative risk (RR_CD).\n    p_c1 = flag[exposed == 1].mean()\n    p_c0 = flag[exposed == 0].mean()\n    o1, o0 = outcome[flag == 1].mean(), outcome[flag == 0].mean()\n    rr_cd = (o1 / o0) if o0 > 0 else 1.0\n    bias = ((p_c1 * (rr_cd - 1) + 1) / (p_c0 * (rr_cd - 1) + 1))\n    return abs(np.log(bias)) if bias > 0 else 0.0\n\ndef fit_hdps(cohort, codes, forced=None):\n    cohort = cohort.set_index(\"person_id\")\n    exposed = cohort[\"exposed\"]; outcome = cohort[\"outcome\"]\n    ranked = []  # (dimension, name, |log bias|, indicator series)\n    for name, ids in _recurrence_covariates(codes, cohort.reset_index()):\n        flag = pd.Series(cohort.index.isin(ids).astype(int), index=cohort.index)\n        if flag.nunique() < 2:\n            continue\n        ranked.append((name.split(\"|\")[0], name,\n                       _bross_bias(flag, exposed, outcome), flag))\n\n    selected = []\n    for dim in {r[0] for r in ranked}:\n        dim_cands = sorted((r for r in ranked if r[0] == dim), key=lambda r: r[2], reverse=True)\n        selected.extend(dim_cands[:TOP_K_PER_DIM])\n\n    X = pd.concat({name: flag for _, name, _, flag in selected}, axis=1)\n    if forced is not None:\n        X = X.join(forced)                      # age, sex, calendar year, indication, severity\n    ps = LogisticRegression(max_iter=1000, C=1.0).fit(X, exposed).predict_proba(X)[:, 1]\n    cohort[\"ps\"] = ps\n    return cohort[[\"exposed\", \"outcome\", \"ps\"]], X",
        "description": "hdPS candidate generation, Bross-style prioritization, and PS estimation from claims-style long tables. Required inputs\n(post data-management, all codes recorded strictly before index_date):\n  cohort : person_id, index_date (datetime), exposed (1=study arm, 0=comparator), outcome (1/0 over follow-up)\n  codes  : person_id, code, dimension in {'ip_dx','op_dx','proc','rx','util'}, code_date (datetime)\nForce-included investigator confounders are passed separately and concatenated before fitting. Returns the per-patient\nselected covariate matrix and the fitted propensity score; downstream matching/weighting fixes the estimand.",
        "dependencies": [
          "pandas",
          "numpy",
          "scikit-learn"
        ],
        "source_citations": [
          "schneeweiss-2009",
          "rassen-2023"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK_DAYS  <- 365L\nTOP_N_PREVALENT <- 200L\nTOP_K_PER_DIM   <- 50L\n\nbross_bias <- function(flag, exposed, outcome) {\n  p_c1 <- mean(flag[exposed == 1]); p_c0 <- mean(flag[exposed == 0])\n  o1 <- mean(outcome[flag == 1]);   o0 <- mean(outcome[flag == 0])\n  rr_cd <- if (o0 > 0) o1 / o0 else 1\n  bias  <- (p_c1 * (rr_cd - 1) + 1) / (p_c0 * (rr_cd - 1) + 1)\n  if (bias > 0) abs(log(bias)) else 0\n}\n\nfit_hdps <- function(cohort, codes, forced = NULL) {\n  setDT(cohort); setDT(codes)\n  c <- codes[cohort[, .(person_id, index_date, exposed, outcome)], on = \"person_id\"]\n  c <- c[code_date < index_date & code_date >= index_date - LOOKBACK_DAYS]\n  cnt <- c[, .(n = .N), by = .(dimension, person_id, code, exposed, outcome)]\n\n  # Screen the most prevalent codes per dimension, then build recurrence covariates.\n  feats <- list()\n  for (d in unique(cnt$dimension)) {\n    dd <- cnt[dimension == d]\n    prev <- dd[, .(np = uniqueN(person_id)), by = code][order(-np)][seq_len(min(.N, TOP_N_PREVALENT))]\n    for (cd in prev$code) {\n      sub <- dd[code == cd]; med <- median(sub$n)\n      for (lab in c(\"once\", \"sporadic\", \"frequent\")) {\n        ids <- switch(lab,\n          once     = sub[n >= 1, person_id],\n          sporadic = sub[n >= max(med, 1), person_id],\n          frequent = sub[n > med, person_id])\n        feats[[paste(d, cd, lab, sep = \"|\")]] <- list(dim = d, ids = unique(ids))\n      }\n    }\n  }\n\n  ex <- cohort$exposed; ou <- cohort$outcome; pid <- cohort$person_id\n  ranked <- rbindlist(lapply(names(feats), function(nm) {\n    flag <- as.integer(pid %in% feats[[nm]]$ids)\n    if (length(unique(flag)) < 2) return(NULL)\n    data.table(dim = feats[[nm]]$dim, name = nm,\n               score = bross_bias(flag, ex, ou))\n  }))\n  keep <- ranked[order(-score), head(.SD, TOP_K_PER_DIM), by = dim]$name\n\n  X <- as.data.table(lapply(keep, function(nm) as.integer(pid %in% feats[[nm]]$ids)))\n  setnames(X, keep)\n  if (!is.null(forced)) X <- cbind(X, forced)   # age, sex, calendar year, indication, severity\n  fit <- glm(ex ~ ., data = cbind(ex = ex, X), family = binomial())\n  cohort[, ps := predict(fit, type = \"response\")]\n  cohort[, .(person_id, exposed, outcome, ps)]\n}",
        "description": "hdPS in R using data.table for candidate generation and the Bross prioritization, then glm for the PS. Inputs mirror the\nPython version:\n  cohort : person_id, index_date (Date), exposed (0/1), outcome (0/1)\n  codes  : person_id, code, dimension in {'ip_dx','op_dx','proc','rx','util'}, code_date (Date)\nThe pharmacoepi 'autoCovariateSelection' or Sentinel/Aetion implementations wrap this same logic for production; this\nself-contained version exposes the algorithm. Estimand is set later by the matching/weighting step on cohort$ps.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "schneeweiss-2009",
          "rassen-2023"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n%let topk = 50;   /* covariates kept per dimension after Bross ranking */\n\n/* Recurrence covariates from strictly pre-index codes: one binary per code (>=1 occurrence shown). */\nproc sql;\n  create table cand as\n  select c.dimension, c.code, c.person_id, k.exposed, k.outcome\n  from work.codes c\n       inner join work.cohort k on c.person_id = k.person_id\n  where c.code_date < k.index_date\n    and c.code_date >= k.index_date - &lookback\n  group by c.dimension, c.code, c.person_id, k.exposed, k.outcome;\nquit;\n\n/* Bross multiplicative bias per candidate: prevalence in exposed vs unexposed x covariate-outcome RR. */\nproc sql;\n  create table bross as\n  select dimension, code,\n         mean(exposed) as p_any,\n         /* P(covariate) within exposure strata */\n         sum(exposed)/ (select sum(exposed) from work.cohort)            as p_c1,\n         (count(*)-sum(exposed))/\n            ((select count(*)-sum(exposed) from work.cohort))            as p_c0,\n         mean(outcome)                                                   as o_c1\n  from cand\n  group by dimension, code\n  having calculated p_c1 > 0 and calculated p_c0 > 0;\nquit;\n\ndata ranked;\n  set bross;\n  rr_cd = max(o_c1, 0.001) / 0.05;                 /* outcome RR vs cohort baseline risk (supply baseline) */\n  bias  = (p_c1*(rr_cd-1)+1) / (p_c0*(rr_cd-1)+1);\n  logbias = abs(log(max(bias, 1e-6)));\nrun;\n\n/* Top-k per dimension -> selected covariate list. */\nproc rank data=ranked out=rk descending ties=low groups=10000;\n  by dimension; var logbias; ranks rnk;\nrun;\nproc sql;\n  create table selected as select dimension, code from rk where rnk < &topk;\nquit;\n\n/* Pivot selected codes to one indicator column per code, join investigator-forced covariates,\n   then estimate the PS and 1:1 match (ATT). Macro-build the wide design from `selected` upstream. */\nproc logistic data=work.analytic descending;     /* analytic = cohort + selected indicators + work.forced */\n  class exposed (ref='0');\n  model exposed = <selected_indicators> age sex cal_year indication severity;\n  output out=ps_out p=ps;\nrun;\n\nproc psmatch data=ps_out region=allobs;\n  class exposed;\n  psmodel exposed(treated='1');                   /* uses ps from the LOGISTIC step above */\n  match method=greedy(k=1) distance=lps caliper=0.2;\n  assess lps var=(age severity) / plots=(boxplot stddiff);  /* require post-match SMD < 0.1 */\n  output out(obs=match)=matched matchid=mid;\nrun;",
        "description": "hdPS in SAS. SAS has no native hdPS PROC, so candidate generation and the Bross ranking are done in PROC SQL, the PS is\nfit with PROC LOGISTIC, and balancing/matching uses PROC PSMATCH (SAS/STAT 14.2+). Required inputs (post data-management):\n  work.cohort : person_id, index_date, exposed (1/0), outcome (1/0)\n  work.codes  : person_id, code, dimension ('ip_dx'/'op_dx'/'proc'/'rx'/'util'), code_date\n  work.forced : person_id + investigator confounders (age sex cal_year indication severity)\n1:1 logit-PS matching below targets the ATT; switch to IPTW/overlap weights if the ATE/ATO is the pre-specified estimand.",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-2009",
          "rassen-2023"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  U[Unmeasured confounder<br/>severity / frailty / access] -->|causes| E[Exposure: drug A vs B]\n  U -->|causes| Y[Outcome]\n  U -.->|leaves a footprint in| P[Pre-index code proxies<br/>dx / proc / rx / utilization]\n  P -->|hdPS adjusts on these| Adj((blocks back-door<br/>U -> E ... U -> Y))\n  I[Instrument<br/>formulary / plan marker] -->|predicts only| E\n  I -.->|if wrongly selected| Amp[Bias amplification + variance inflation]\n  C[Post-index code<br/>mediator / collider] -.->|if lookback leaks| Over[Collider / over-adjustment bias]",
        "caption": "Why hdPS works and how it fails. Measured pre-index codes are proxies for the unmeasured confounder U, so adjusting on them partially blocks the U->E and U->Y back-door path. Selecting instruments (predict exposure only) amplifies residual bias; leaking post-index mediators/colliders into the lookback induces over-adjustment. Both are the dominant hdPS failure modes.",
        "alt_text": "DAG showing an unmeasured confounder causing exposure and outcome, pre-index codes as proxies that hdPS adjusts on, an instrument that amplifies bias if selected, and a post-index collider that causes over-adjustment if the lookback leaks.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-2009"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Comparative-safety question in a large healthcare database] --> Q1{Rich pre-index codes<br/>as confounder proxies?}\n  Q1 -->|No| Inv[\"Use investigator-specified PS / DRS\"]\n  Q1 -->|Yes| Q2{\"Enough exposed and events?<br/>(>~50 events/arm)\"}\n  Q2 -->|No| Small[\"hdPS unstable -> parsimonious PS or exact matching<br/>Rassen 2011\"]\n  Q2 -->|Yes| Q3{Candidate window strictly pre-index?}\n  Q3 -->|No| Fix[\"Re-extract: hard-stop before time zero<br/>avoid mediators / colliders\"]\n  Q3 -->|Yes| Q4{Screen out near-instruments?}\n  Q4 -->|No| Drop[Remove exposure-only predictors<br/>avoid bias amplification]\n  Q4 -->|Yes| Run[\"Run dimension-balanced hdPS<br/>force key confounders -> PS\"]\n  Run --> Use{Estimand?}\n  Use -->|ATT| M[1:1 logit-PS matching]\n  Use -->|ATE| W[IPTW]\n  Use -->|ATO| O[Overlap weights]\n  M --> Diag[\"Balance SMD<0.1 + negative controls + E-value\"]\n  W --> Diag\n  O --> Diag",
        "caption": "Decision logic for applying hdPS. Gate on proxy richness, sample size, a strictly pre-index candidate window, and instrument screening before running; the chosen weighting/matching step fixes the estimand (ATT/ATE/ATO), and residual bias is checked with balance, negative controls, and E-values.",
        "alt_text": "Decision tree from a comparative-safety question through checks on proxy richness, sample size, pre-index window, and instrument screening to running dimension-balanced hdPS and selecting an estimand via matching or weighting.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-2009",
          "rassen-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "hdPS is the standard confounding-control layer applied after ACNU cohort construction, balancing measured and proxy confounders on the strictly pre-index covariate window."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "hdPS constructs the score; PS matching/IPTW/overlap weighting then uses it, and that choice fixes the estimand (ATT/ATE/ATO)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "hdPS is the usual analytic adjustment engine inside head-to-head target-trial emulations in administrative data."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "instrumental-variables-pharmacoepi-rwe",
        "notes": "IV can address confounding with no measured proxy but needs a valid instrument and targets a local average effect; hdPS stays within measured proxies and needs no instrument."
      },
      {
        "relation_type": "used_with",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Negative-control outcomes/exposures and empirical calibration detect and correct the residual bias hdPS cannot remove (confounders with no proxy footprint)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "After hdPS, the E-value quantifies how strong unmeasured confounding lacking any data proxy would have to be to nullify the result."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "High-dimensional selection produces extreme propensity scores; trimming/Winsorizing non-overlap is essential before the outcome model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Candidate prevalences and selected proxies differ by payer (FFS completeness vs MA HCC-enriched coding vs commercial variability); exclude MA-only person-time and tune dimensions per payer."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Diagnosis phenotypes generate the diagnosis-dimension candidate covariates; their coding granularity and payer sensitivity feed hdPS."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "hdPS depends on a database with rich, granular, complete pre-exposure coding; fit-for-purpose assessment should confirm that depth before relying on hdPS."
      },
      {
        "relation_type": "see_also",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "Probabilistic bias analysis bounds the residual confounding from factors with no proxy that hdPS cannot adjust for."
      }
    ],
    "aliases": [
      "hdPS",
      "high-dimensional propensity score",
      "empirical propensity score",
      "automated covariate selection propensity score",
      "proxy-variable confounding adjustment"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "hospitalization-transfer-collapse-rwe",
    "name": "Hospitalization and Transfer Collapse",
    "short_definition": "An outcome-construction rule that stitches contiguous inpatient, observation, transfer, and facility claims into a single coherent hospitalization episode, so that an inter-hospital transfer or an observation-to-inpatient conversion is counted as one admission rather than several.",
    "long_description": "**Hospitalization and transfer collapse** is the operational rule set that turns raw facility claims into analyzable\nhospitalization *episodes*. A single clinical admission rarely appears as a single record: a patient may arrive through\nthe emergency department, be placed in observation (an outpatient status), convert to inpatient at the second midnight,\nbe transferred to a tertiary center for a procedure, and discharged from a third provider — generating three or four\nseparate facility claims under three different provider IDs. Without an explicit collapse rule, a naive count treats\neach claim as a distinct hospitalization, inflating event counts, shortening apparent length of stay (LOS), corrupting\nreadmission denominators, and mis-timing the \"first hospitalization\" used as time zero or as the outcome event. The\nrule must be written into the protocol and statistical analysis plan *before* programming, because every downstream\nquantity — event rate, time-to-first-event, 30-day readmission, per-episode cost — inherits its boundaries.\n\n**Core conceptual distinction**. The unit of analysis is the *episode*, not the *claim*. Two decisions define the\nepisode boundary and must be pre-specified and separable. (1) *The same-stay merge (gap rule)*: two facility claims\nbelong to the same episode when the second admit date falls within a small gap of the prior discharge date. The\ncanonical choice is a 0-to-1-day gap (admit on the day of, or the day after, discharge), which captures bed-to-bed\ninter-hospital transfers and observation-to-inpatient conversions; the gap is the single most consequential tunable\nparameter and must be reported and varied in sensitivity analyses. (2) *What counts as a transfer*: a claim with\ndischarge status `02` (transferred to another short-term acute hospital) followed by an admission elsewhere is a\ntransfer to be collapsed *even though the provider ID differs* — collapsing only within the same provider would split\nevery transferred patient. The estimand consequence is sharp: if the outcome is \"hospitalization,\" the collapsed\nepisode is the event; if the outcome is \"30-day readmission,\" the *index episode* must first be collapsed so that the\ntransfer leg is not itself miscounted as the readmission. Whether acute care is extended to post-acute care (SNF/IRF/\nLTCH) is a *different* definition — standard acute-hospitalization outcomes stop at acute discharge, whereas bundled-\npayment (e.g., BPCI) episodes deliberately include the post-acute tail. State which you are using.\n\n**Pros, cons, and trade-offs**.\n- **vs counting each facility claim as one hospitalization (no collapse):** Collapsing prevents double-counting of\n  transfers and observation conversions, stabilizes LOS, and gives correct readmission denominators. Cost: it requires\n  bill-type, revenue-code, and discharge-status logic plus de-duplication of facility vs professional claims, and the\n  gap threshold is a judgment that must be defended. **Prefer collapse** for any consequential utilization, safety,\n  cost, or regulatory-grade analysis; raw claim counts are defensible only for the crudest descriptive cut.\n- **vs admission-date-only deduplication (drop exact duplicate admit dates):** Date-only dedup removes literal\n  duplicates but leaves transfers (different admit dates) split and silently keeps professional claims that share the\n  facility admit date. **Prefer the full gap + transfer + bill-type rule** whenever transfers or observation stays are\n  non-trivial in the population (true in nearly all elderly, oncology, and tertiary-referral cohorts).\n- **vs a generous gap (e.g., 7 or 30 days) to define \"continuous care\":** A wide gap merges genuinely distinct\n  admissions and *erases readmissions you intended to measure* — actively misleading for any readmission endpoint.\n  **Prefer a 0-1 day gap** for the same-stay episode and handle longer-horizon constructs (e.g., 30-day readmission)\n  as a separate, explicitly named layer on top of collapsed episodes.\n\n**When to use**. Any analysis in which hospitalization is an outcome, a censoring event, time zero, or a cost driver in\nclaims, EHR encounter, or linked data — comparative effectiveness/safety with a hospitalization endpoint, readmission\nmeasurement, healthcare resource utilization, and episode-of-care costing. It is essential whenever inter-hospital\ntransfers or observation stays are plausible (Medicare and tertiary-referral populations) because those are precisely\nthe records a naive count fractures.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **When the policy/clinical question is at the claim or facility level**, not the episode level (e.g., hospital-level\n  transfer rates, facility billing audits). Collapsing destroys the unit you care about.\n- **A gap threshold wide enough to swallow true readmissions.** Using a 30-day \"continuous care\" merge while the\n  endpoint *is* 30-day readmission is self-defeating — you collapse the very events you are counting. This is the most\n  dangerous misuse.\n- **Collapsing across MA-only person-time.** In Medicare Advantage, institutional encounter records are incomplete or\n  absent, so the absence of an intervening claim is *missingness*, not a true gap; collapsing then fuses unrelated\n  stays or fabricates one long episode. Restrict episode construction to fee-for-service Part A person-time.\n- **Unconditional acute→post-acute merging for an acute-hospitalization endpoint.** Folding a SNF transfer into the\n  acute episode inflates LOS and cost and mis-defines the outcome unless the protocol's estimand is explicitly the\n  bundled episode.\n\n**Data-source operational depth**.\n- **Claims (Medicare FFS / commercial):** The facility (institutional) claim is the unit; key on it and discard the\n  professional (Part B / carrier) claims that mirror the same admission, or you will count the admission twice.\n  Inpatient acute claims carry bill type `011x`; observation appears on a `013x` outpatient claim via revenue center\n  `0762` or HCPCS `G0378`/`G0379`; emergency department care via revenue center `0450`/`0451`. Use `admit_date`,\n  `discharge_date`, `provider_id`, and `discharge_status` to apply the gap and transfer logic. *Failure modes:*\n  Medicare Advantage drops institutional FFS claims, so MA-only person-time yields false-negative hospitalizations and\n  spurious gaps — restrict to enrollees with FFS Part A across the relevant window. Claims adjudication lag and\n  reversals/replacements mean the same stay can appear as an original plus a void/replacement pair; de-duplicate on\n  the latest accepted version before collapsing. Interim (`0112`/`0113`/`0114`) bills for long stays split one\n  admission into multiple records that must be merged on overlapping/contiguous dates regardless of gap. Differential\n  competing risks matter: in elderly comparative cohorts, an exposure that raises mortality lowers the *opportunity*\n  for a later transfer leg, so collapse choices interact with how death is handled.\n- **EHR (encounter tables):** Episodes are built from `ADT` (admission-discharge-transfer) events, not bills. A\n  bed-movement transfer *within* a system is an intra-encounter event (already one visit), whereas a transfer to an\n  external facility leaves the system and is captured only if linked claims are available — external care leakage\n  causes truncated episodes. Observation status lives in a flag or order, not a bill type; site workflow variation in\n  how observation is recorded is a major source of misclassification.\n- **Registry:** Usually adjudicates the clinical admission well but lacks the full facility-claim trail; link to\n  claims to recover transfers and observation legs, and report the linkage-eligible denominator.\n- **Linked claims–EHR:** The ideal substrate (encounter detail + claim completeness), but admit/discharge timestamps\n  disagree across sources by hours-to-days; reconcile to a single canonical clock before applying a 0-1-day gap, or\n  the discrepancy alone will fabricate or destroy merges.\n\n**Worked claims example.** Goal: count hospitalizations and 30-day all-cause readmissions after a drug index date in\nMedicare FFS. One patient generates four institutional claims: (A) `013x` observation, rev `0762`, 2024-03-01 to\n2024-03-02 at provider 100; (B) `011x` inpatient, 2024-03-02 to 2024-03-05 at provider 100, `discharge_status=02`\n(transferred); (C) `011x` inpatient, 2024-03-05 to 2024-03-10 at provider 200; (D) `011x` inpatient, 2024-03-18 to\n2024-03-21 at provider 200. Step 1 — restrict to FFS Part A person-time covering 2024-03; drop any MA span. Step 2 —\nkeep facility claims only; discard the carrier claims that echo B-D. Step 3 — sort by `admit_date` and compute the gap\nto the prior `discharge_date`: A→B gap = 0 (obs converts to inpatient same day), B→C gap = 0 (acute transfer,\ndifferent provider, discharge_status 02), C→D gap = 8 days. With a 1-day same-stay rule, A+B+C collapse into **one\nepisode** (start 2024-03-01, end 2024-03-10, LOS 9 days, transfer_count = 1), and D is a **second, distinct episode**.\nStep 4 — the outcome \"first hospitalization\" = the collapsed episode starting 2024-03-01; the readmission clock starts\nat its discharge (2024-03-10), so D (admit 2024-03-18, 8 days later) counts as a **30-day readmission**. A naive\nper-claim count would have reported four hospitalizations, a 5-day maximum LOS, and would have miscounted the transfer\nleg C as the \"readmission\" — three errors the collapse rule prevents.\n\n**Interpreting the output**. After applying the transfer-collapse rule to the four facility claims — observation\n(Mar 1–2), inpatient at Hospital A (Mar 2–7), transfer to Hospital B (Mar 7–10), and a second admission (Mar 18–21)\n— the algorithm produces two collapsed episodes. Episode 1 runs from March 1 through March 10 (9-day LOS, one\ntransfer), capturing the continuous care arc. Episode 2 begins March 18, eight days after Episode 1 discharge,\nand qualifies as a 30-day readmission.\n\nFormal interpretation: collapse rules create one episode from multiple claims by applying two criteria — a\nsame-day observation-to-inpatient conversion rule and a 1-day transfer-gap rule. Without collapse, the algorithm\nwould attribute four separate hospitalizations to this patient, mis-time the readmission clock (starting it at\nthe transfer leg rather than the first admission), and inflate the episode count in denominators for readmission\nrate calculations. Failure to collapse transfers is not conservative — it inflates the readmission denominator,\nwhich biases the readmission rate toward zero.\n\nPractical interpretation: always document whether the collapse rule was applied before computing any hospitalization\ncount, LOS, or readmission outcome. For CMS quality measures and regulatory submissions, transfer collapse is\nrequired methodology — omitting it is a protocol deviation. Report the number of multi-claim episodes and the\ntransfer count per episode as a data-quality check; a high transfer fraction signals coding patterns that may\ndiffer across sites or payers and should be explored in sensitivity analyses.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "hospitalization-episode",
      "transfer-collapse",
      "observation-stay",
      "readmission",
      "length-of-stay",
      "claims-construction",
      "bill-type"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1056/NEJMsa0803563",
        "url": "https://doi.org/10.1056/NEJMsa0803563",
        "citation_text": "Jencks SF, Williams MV, Coleman EA. Rehospitalizations among patients in the Medicare fee-for-service program. New England Journal of Medicine. 2009;360(14):1418-1428.",
        "year": 2009,
        "authors_short": "Jencks et al.",
        "notes": "Foundational Medicare claims analysis that defines hospitalization and rehospitalization episodes from facility claims, treating inter-hospital transfers as part of the index stay rather than separate admissions."
      },
      {
        "role": "explain",
        "doi": "10.12788/jhm.3038",
        "url": "https://doi.org/10.12788/jhm.3038",
        "citation_text": "Sheehy AM, Shi F, Kind AJH. Identifying observation stays in Medicare data: policy implications of a definition. Journal of Hospital Medicine. 2019;14(2):96-100.",
        "year": 2019,
        "authors_short": "Sheehy et al.",
        "notes": "Shows how revenue-code and bill-type choices determine whether an observation stay is captured and how it relates to a same-stay inpatient conversion - the core operational ambiguity in the obs-to-inpatient collapse."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard requiring transparent code lists and algorithm definitions; mandates that episode-construction rules (bill types, gap thresholds, transfer handling) be specified and validated."
      }
    ],
    "plain_language_summary": "When a patient moves between facilities during one illness — for example, from an emergency department to Hospital A and then to Hospital B — insurance records create a separate claim for each stop. Transfer collapse is the rule that recognizes all those claims as one continuous stay and stitches them into a single count. Without this step, a researcher counting hospitalizations would report three admissions when there was really only one, which would inflate event rates and distort any readmission clock that is supposed to start at the end of the stay.",
    "key_terms": [
      {
        "term": "facility claim",
        "definition": "The insurance bill submitted by a hospital or facility for a patient's stay, as opposed to a physician's separate professional bill for services rendered during the same stay."
      },
      {
        "term": "episode",
        "definition": "A single, continuous period of hospital care that may span multiple facilities and is treated as one event in the analysis."
      },
      {
        "term": "discharge status",
        "definition": "A code on a hospital bill that says where the patient went when they left — for example, code 02 means the patient was transferred to another acute-care hospital."
      },
      {
        "term": "gap rule",
        "definition": "The decision that two consecutive facility bills belong to the same stay when the second admission date falls within 0 or 1 calendar day of the previous discharge date."
      },
      {
        "term": "readmission",
        "definition": "A new, separate hospital admission that occurs after a prior stay has fully ended — typically measured within 30 days of the prior discharge."
      }
    ],
    "worked_example": {
      "scenario": "A Medicare patient is admitted on March 1, 2024 through the emergency department at Hospital A and placed under observation status overnight. On March 2 she converts to full inpatient status at Hospital A. On March 5 she is transferred — same illness — to Hospital B, a tertiary center with a specialist. Hospital B discharges her on March 10. Then, eight days later on March 18, she is admitted again to Hospital B for a new complication and stays until March 21. The insurance database shows four separate facility claims. The research question is: how many hospitalizations did this patient have, and did she have a 30-day readmission?",
      "dataset": {
        "caption": "Raw facility claims for one patient as they appear in a Medicare database — four rows, four different bill dates.",
        "columns": [
          "person_id",
          "admit_date",
          "discharge_date",
          "provider_id",
          "bill_type",
          "discharge_status"
        ],
        "rows": [
          [
            "7001",
            "2024-03-01",
            "2024-03-02",
            "ProvA",
            "013x",
            "01"
          ],
          [
            "7001",
            "2024-03-02",
            "2024-03-05",
            "ProvA",
            "011x",
            "02"
          ],
          [
            "7001",
            "2024-03-05",
            "2024-03-10",
            "ProvB",
            "011x",
            "01"
          ],
          [
            "7001",
            "2024-03-18",
            "2024-03-21",
            "ProvB",
            "011x",
            "01"
          ]
        ]
      },
      "steps": [
        "Sort the four claims by admit date: Mar 1, Mar 2, Mar 5, Mar 18.",
        "Check claim 1 to claim 2: Claim 1 discharged Mar 2; Claim 2 admitted Mar 2 — gap is 0 days, within the 0-to-1-day rule, so merge them into the same stay.",
        "Check claim 2 to claim 3: Claim 2 discharged Mar 5; Claim 3 admitted Mar 5 — gap is 0 days AND discharge_status on Claim 2 is 02 (transferred), so merge Claim 3 into the same stay even though it is at a different hospital.",
        "The merged stay spans Mar 1 through Mar 10 — one episode, length of stay 9 days, one transfer.",
        "Check claim 3 to claim 4: Claim 3 (last leg of episode 1) discharged Mar 10; Claim 4 admitted Mar 18 — gap is 8 days, well beyond the 1-day rule, so Claim 4 starts a new, separate episode.",
        "Episode 2 spans Mar 18 through Mar 21 — 3 days, at Hospital B.",
        "The readmission clock starts at the discharge of Episode 1 (Mar 10). Episode 2 is admitted Mar 18, which is 8 days after discharge — inside the 30-day window, so it counts as a 30-day readmission.",
        "Final count: 2 episodes (not 4 claims), 1 readmission. A raw claim count would have reported 4 hospitalizations and would have incorrectly called the transfer leg (Claim 3) the readmission."
      ],
      "result": "2 collapsed episodes from 4 raw claims. Episode 1: Mar 1–Mar 10, LOS 9 days, 1 inter-hospital transfer. Episode 2: Mar 18–Mar 21, LOS 3 days. Episode 2 is a 30-day readmission (8 days after Episode 1 discharge). Raw claim count would have over-reported by 2 admissions.",
      "timeline_spec": {
        "title": "Transfer collapse: 4 facility claims become 2 episodes for one Medicare patient",
        "window": {
          "start": "2024-03-01",
          "end": "2024-03-21",
          "label": "Observation window: Mar 1 to Mar 21 (21 days shown)"
        },
        "events": [
          {
            "label": "Claim A: Observation stay, ProvA (bill 013x)",
            "start": "2024-03-01",
            "length_days": 1,
            "quantity": "1-day observation"
          },
          {
            "label": "Claim B: Inpatient, ProvA (discharge_status 02 = transferred)",
            "start": "2024-03-02",
            "length_days": 3,
            "quantity": "3-day inpatient"
          },
          {
            "label": "Claim C: Inpatient, ProvB (transfer-in leg)",
            "start": "2024-03-05",
            "length_days": 5,
            "quantity": "5-day inpatient"
          },
          {
            "label": "Claim D: Inpatient, ProvB (new admission, 8-day gap)",
            "start": "2024-03-18",
            "length_days": 3,
            "quantity": "3-day inpatient"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-03-01",
            "end": "2024-03-10",
            "label": "Episode 1 (collapsed): 9 days, Claims A+B+C"
          },
          {
            "kind": "gap",
            "start": "2024-03-10",
            "end": "2024-03-18",
            "label": "8-day gap between episodes (true readmission window)"
          },
          {
            "kind": "covered",
            "start": "2024-03-18",
            "end": "2024-03-21",
            "label": "Episode 2 (new admission): 3 days, Claim D"
          }
        ],
        "result": {
          "label": "4 raw claims collapsed to 2 episodes; Episode 2 is a 30-day readmission (8 days post-discharge)",
          "value": 2
        },
        "caption": "Each horizontal bar is one facility claim. Claims A, B, and C are stitched together into Episode 1 by the 0-day gap and the transfer discharge status on Claim B. Claim D is separated by an 8-day gap and becomes Episode 2, which is also a 30-day readmission. A naive count of bars would report 4 hospitalizations.",
        "alt_text": "Timeline showing four facility claim bars. The first three bars (March 1, March 2, and March 5) are visually grouped under a single Episode 1 span covering March 1 to March 10. An 8-day gap follows. The fourth bar (March 18 to March 21) stands alone as Episode 2. A readmission label marks the gap between the two episodes."
      }
    },
    "prerequisites": [
      "outcome-algorithm-construction-rwe",
      "continuous-enrollment-observable-time-rwe",
      "claims-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Same-stay gap collapse (0-1 day rule)",
        "description": "Merge two facility claims into one episode when the later admit_date is within 0-1 calendar days of the prior discharge_date, capturing observation-to-inpatient conversions and bed-to-bed inter-hospital transfers.",
        "edge_cases": [
          "A 0-day gap (same-day discharge and re-admit) can be a true transfer or an unrelated same-day return; split on a different major diagnostic category if the protocol requires it.",
          "Interim bills for very long stays produce overlapping/contiguous date ranges that must merge regardless of the gap."
        ],
        "data_source_notes": "claims: gap computed on facility admit_date/discharge_date after de-duplicating reversal/replacement claims; widening the gap beyond 1 day risks erasing genuine readmissions."
      },
      {
        "name": "Transfer-aware collapse across providers",
        "description": "Collapse across different provider IDs when discharge_status=02 (transferred to another acute hospital) links the legs, so an inter-hospital transfer counts as one episode rather than two admissions.",
        "edge_cases": [
          "Discharge_status is sometimes miscoded; corroborate with the 0-1-day gap before merging across providers.",
          "Transfers to post-acute settings (SNF/IRF/LTCH) are excluded from an acute-hospitalization episode but included in a bundled (BPCI-style) episode - the variant chosen changes LOS and cost."
        ],
        "data_source_notes": "claims: union inpatient (011x) with observation (013x, rev 0762 / G0378-G0379); EHR: external transfers are visible only via linked claims, intra-system transfers are already one encounter."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Counting each facility claim as one hospitalization (no collapse)",
        "pros_of_this": "Prevents double-counting of transfers and observation conversions; yields correct length-of-stay, event counts, and readmission denominators.",
        "cons_of_this": "Requires bill-type, revenue-code, and discharge-status logic plus facility-vs-professional de-duplication; the gap threshold is a defensible judgment that must be reported and varied.",
        "when_to_prefer": "Any consequential utilization, safety, cost, or regulatory-grade analysis in populations with non-trivial transfers or observation stays."
      },
      {
        "compared_to": "Admission-date-only deduplication",
        "pros_of_this": "Captures transfers (different admit dates) and observation conversions that date-only dedup leaves split, and forces removal of professional claims sharing the facility admit date.",
        "cons_of_this": "More logic and more code-list maintenance than a one-line dedup.",
        "when_to_prefer": "Whenever inter-hospital transfers or observation stays occur, which is nearly all elderly, oncology, and tertiary-referral cohorts."
      },
      {
        "compared_to": "Wide-gap continuous-care merge (7-30 days)",
        "pros_of_this": "Captures clinically linked re-stays for an episode-of-care costing question where that is the estimand.",
        "cons_of_this": "Merges genuinely distinct admissions and erases readmissions - actively misleading for any readmission endpoint.",
        "when_to_prefer": "Only for explicit episode-of-care/bundled-cost questions; never when readmission is the outcome."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Key on the facility (institutional) claim and discard mirror professional claims. Inpatient = bill type 011x; observation = 013x with rev 0762 or HCPCS G0378/G0379; ED = rev 0450/0451. De-duplicate reversal/replacement claims to the latest accepted version, restrict to FFS Part A person-time (exclude MA-only spans where institutional claims are missing), then collapse on a 0-1-day gap plus discharge_status=02 transfers.",
      "ehr": "Build episodes from ADT events, not bills. Intra-system bed-to-bed transfers are one encounter; external transfers are captured only with linked claims (external-care leakage truncates episodes). Observation status is a flag/order, not a bill type, and is subject to site workflow variation.",
      "registry": "Adjudicates the clinical admission but lacks the facility-claim trail; link to claims to recover transfer and observation legs and report the linkage-eligible denominator.",
      "linked": "Reconcile admit/discharge timestamps across claims and EHR to a single canonical clock before applying the gap rule; sub-day discrepancies alone can fabricate or destroy merges."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nGAP_DAYS = 1  # 0-1 day same-stay merge: captures obs->inpatient and bed-to-bed transfers\n\ndef collapse_episodes(fac: pd.DataFrame) -> pd.DataFrame:\n    f = fac.copy()\n    f[\"is_inpatient\"] = f[\"bill_type\"].str.startswith(\"011\")\n    f[\"is_observation\"] = f[\"bill_type\"].str.startswith(\"013\")\n    # Keep only acute inpatient + observation facility claims for an acute-hospitalization endpoint.\n    f = f[f[\"is_inpatient\"] | f[\"is_observation\"]].sort_values([\"person_id\", \"admit_date\", \"discharge_date\"])\n\n    # Running maximum prior discharge within person guards against nested/overlapping interim bills.\n    f[\"prev_discharge\"] = (f.groupby(\"person_id\")[\"discharge_date\"]\n                            .transform(lambda s: s.cummax().shift()))\n    gap = (f[\"admit_date\"] - f[\"prev_discharge\"]).dt.days\n    prior_was_transfer = (f.groupby(\"person_id\")[\"discharge_status\"].shift() == \"02\")\n\n    # New episode starts when the gap exceeds GAP_DAYS AND the prior leg was not an acute transfer.\n    new_episode = (f[\"prev_discharge\"].isna()) | ((gap > GAP_DAYS) & (~prior_was_transfer))\n    f[\"episode_id\"] = new_episode.groupby(f[\"person_id\"]).cumsum()\n\n    ep = (f.groupby([\"person_id\", \"episode_id\"])\n            .agg(episode_start=(\"admit_date\", \"min\"),\n                 episode_end=(\"discharge_date\", \"max\"),\n                 n_legs=(\"admit_date\", \"size\"),\n                 n_providers=(\"provider_id\", \"nunique\"),\n                 opened_in_observation=(\"is_observation\", \"first\"))\n            .reset_index())\n    ep[\"los_days\"] = (ep[\"episode_end\"] - ep[\"episode_start\"]).dt.days\n    ep[\"transfer_count\"] = (ep[\"n_providers\"] - 1).clip(lower=0)\n    return ep",
        "description": "Collapse facility claims into hospitalization episodes. Required input (already de-duplicated to the latest accepted\nversion, professional claims removed, and restricted to FFS Part A person-time):\n  fac : facility claims -> person_id, admit_date (datetime), discharge_date (datetime), provider_id,\n        bill_type (e.g. '0111','0131'), discharge_status (str, '02' = acute transfer)\nReturns one row per episode with start/end, LOS, transfer count, and whether an observation stay opened it. Apply the\nsame GAP_DAYS in sensitivity analyses; build readmission as a separate layer on the returned episodes.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "jencks-2009"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nGAP_DAYS <- 1L  # 0-1 day same-stay merge\n\ncollapse_episodes <- function(fac) {\n  f <- as.data.table(fac)\n  f[, is_inpatient   := startsWith(bill_type, \"011\")]\n  f[, is_observation := startsWith(bill_type, \"013\")]\n  f <- f[is_inpatient | is_observation]\n  setorder(f, person_id, admit_date, discharge_date)\n\n  # Running max prior discharge handles overlapping interim bills; shift gives the previous leg's end.\n  f[, prev_discharge := shift(cummax(as.integer(discharge_date))), by = person_id]\n  f[, gap := as.integer(admit_date) - prev_discharge]\n  f[, prior_transfer := shift(discharge_status) == \"02\", by = person_id]\n\n  f[, new_episode := is.na(prev_discharge) | (gap > GAP_DAYS & !(prior_transfer %in% TRUE))]\n  f[, episode_id := cumsum(new_episode), by = person_id]\n\n  ep <- f[, .(episode_start = min(admit_date),\n              episode_end   = max(discharge_date),\n              n_legs        = .N,\n              n_providers   = uniqueN(provider_id),\n              opened_in_observation = first(is_observation)),\n          by = .(person_id, episode_id)]\n  ep[, los_days := as.integer(episode_end - episode_start)]\n  ep[, transfer_count := pmax(n_providers - 1L, 0L)]\n  ep[]\n}",
        "description": "Collapse facility claims into hospitalization episodes (data.table). Input mirrors the Python version, already\nde-duplicated, professional claims removed, FFS Part A only:\n  fac : person_id, admit_date (Date), discharge_date (Date), provider_id, bill_type (chr), discharge_status (chr)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "jencks-2009"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let gap = 1;  /* 0-1 day same-stay merge */\n\n/* Keep acute inpatient + observation facility claims only. */\nproc sql;\n  create table fac2 as\n  select person_id, admit_date, discharge_date, provider_id, bill_type, discharge_status,\n         (substr(bill_type,1,3)='011') as is_inpatient,\n         (substr(bill_type,1,3)='013') as is_observation\n  from work.fac\n  where substr(bill_type,1,3) in ('011','013')\n  order by person_id, admit_date, discharge_date;\nquit;\n\n/* Assign episode_id: new episode when gap to running prior discharge > &gap and prior leg was not a transfer. */\ndata episodes_legs;\n  set fac2;\n  by person_id;\n  retain run_discharge episode_id prev_status;\n  lag_discharge = run_discharge;\n  lag_status    = prev_status;\n  if first.person_id then do;\n    episode_id = 1; run_discharge = .; lag_discharge = .;\n  end;\n  else do;\n    gap = admit_date - lag_discharge;\n    if gap > &gap and lag_status ne '02' then episode_id + 1;  /* keep merging acute transfers */\n  end;\n  run_discharge = max(run_discharge, discharge_date);  /* running max guards overlapping interim bills */\n  prev_status   = discharge_status;\nrun;\n\n/* Roll up legs to one row per episode. */\nproc sql;\n  create table episodes as\n  select person_id, episode_id,\n         min(admit_date)     as episode_start format=date9.,\n         max(discharge_date) as episode_end   format=date9.,\n         max(discharge_date) - min(admit_date) as los_days,\n         count(*)                          as n_legs,\n         count(distinct provider_id)       as n_providers,\n         max(is_observation)               as opened_in_observation\n  from episodes_legs\n  group by person_id, episode_id;\nquit;",
        "description": "Collapse facility claims into hospitalization episodes in SAS. Required input dataset (post data-management:\nde-duplicated to latest accepted version, professional claims removed, restricted to FFS Part A person-time):\n  work.fac : person_id, admit_date, discharge_date (SAS dates), provider_id, bill_type, discharge_status\nPROC SQL keeps acute inpatient (011x) and observation (013x) facility claims; the DATA step assigns episode_id with a\nRETAINed running discharge and LAG of the transfer flag, then PROC SQL/SUMMARY rolls up to one row per episode.",
        "dependencies": [],
        "source_citations": [
          "jencks-2009"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "hospitalization-transfer-collapse-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Each horizontal bar is one facility claim. Claims A, B, and C are stitched together into Episode 1 by the 0-day gap and the transfer discharge status on Claim B. Claim D is separated by an 8-day gap and becomes Episode 2, which is also a 30-day readmission. A naive count of bars would report 4 hospitalizations.",
        "alt_text": "Timeline showing four facility claim bars. The first three bars (March 1, March 2, and March 5) are visually grouped under a single Episode 1 span covering March 1 to March 10. An 8-day gap follows. The fourth bar (March 18 to March 21) stands alone as Episode 2. A readmission label marks the gap between the two episodes.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[\"Facility claims: inpatient 011x, observation 013x rev 0762/G0378-9, ED 0450/1\"] --> Dedup[\"Drop professional claims; de-duplicate reversal/replacement to latest version\"]\n  Dedup --> FFS{\"FFS Part A person-time?\"}\n  FFS -- \"No, MA-only\" --> Missing[\"Treat as missing - do NOT collapse; absence of claim is not a true gap\"]\n  FFS -- Yes --> Sort[\"Sort by person_id, admit_date\"]\n  Sort --> Gap{\"Gap to prior discharge within 1 day OR prior discharge_status is 02?\"}\n  Gap -- Yes --> Merge[\"Same episode: obs-to-inpatient conversion or acute transfer\"]\n  Gap -- No --> NewEp[\"New episode\"]\n  Merge --> Episode[\"Episode: start, end, LOS, transfer_count\"]\n  NewEp --> Episode\n  Episode --> Readmit[\"Separate layer: 30-day readmission from episode discharge\"]",
        "caption": "Episode-construction logic. Restrict to FFS Part A person-time, de-duplicate, then collapse on a 0-1-day gap or an acute-transfer discharge status before building any readmission measure on top of the episodes.",
        "alt_text": "Flowchart from raw facility claims through de-duplication, fee-for-service restriction, gap and transfer logic, episode roll-up, and a separate readmission layer.",
        "source_type": "illustrative",
        "source_citations": [
          "jencks-2009"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Collapsing four claims into two episodes (one patient)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %d\n  section Episode 1 (collapsed)\n  A observation prov 100 (rev 0762)        :done, a, 2024-03-01, 1d\n  B inpatient prov 100 (discharge_status 02) :done, b, 2024-03-02, 3d\n  C inpatient prov 200 (transfer leg)        :done, c, 2024-03-05, 5d\n  section 30-day readmission window\n  Clock starts at episode-1 discharge        :milestone, m, 2024-03-10, 0d\n  section Episode 2 (distinct)\n  D inpatient prov 200 (8-day gap)           :crit, d, 2024-03-18, 3d",
        "caption": "Three claims (observation -> inpatient -> inter-hospital transfer) collapse into one 9-day episode; the fourth claim, 8 days after discharge, is a distinct episode and a 30-day readmission. A naive count would report four hospitalizations and miscount the transfer leg as the readmission.",
        "alt_text": "Gantt chart showing an observation stay, an inpatient stay with transfer status, and a transfer-in stay merging into one episode, followed by a separate readmission after an eight-day gap.",
        "source_type": "illustrative",
        "source_citations": [
          "jencks-2009"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "Hospitalization-episode collapse is a specific outcome-construction algorithm within the broader family of claims-based outcome definition."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The collapse rule defines the candidate event; its positive predictive value and sensitivity against an adjudicated reference should be validated like any claims outcome algorithm."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Restricting to FFS Part A continuous person-time is prerequisite, because MA-only spans lack institutional claims and produce spurious gaps that corrupt the collapse."
      },
      {
        "relation_type": "affects",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death precludes later hospitalization legs, so episode boundaries interact with how the competing risk of mortality is modeled in time-to-hospitalization analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Disenrollment or transfer out of the data ends observable person-time and can truncate an episode, mimicking a completed discharge."
      }
    ],
    "aliases": [
      "hospitalization episode construction",
      "transfer collapse",
      "inpatient stay stitching",
      "observation-to-inpatient collapse",
      "hospital episode definition"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "hrqol",
    "name": "Health-Related Quality of Life (HRQoL) Measurement",
    "short_definition": "The measurement of a patient's self-reported physical, mental, and social functioning using generic or disease-specific instruments, which—when a preference-based value set is applied—yields health-state utilities on a dead(0)-to-full-health(1) scale that feed quality-adjusted life-year (QALY) accrual.",
    "long_description": "**Health-related quality of life (HRQoL)** is an *outcome measure*, not a study design: it is\nthe operationalized assessment of how disease and treatment affect a patient's physical,\npsychological, and social functioning, captured by validated patient-reported instruments and\nscored as either a descriptive *profile*, a non-preference *index*, or a preference-based\n*utility*. The distinction among these three score types is the single most consequential\ndecision in HRQoL work, because only preference-based utilities — anchored at 0 (a state\nequivalent to death) and 1 (full health), with values below 0 permitted for states worse than\ndeath — can be multiplied by survival time to produce QALYs and thus enter a cost-utility model.\n\n**Core conceptual distinction.** Three things are routinely conflated and must be separated.\n(1) *Instrument type*: **generic** instruments (EQ-5D-3L/5L, SF-6D, HUI3, AQoL) measure a common\nhealth construct comparable across diseases and are required by most HTA reference cases;\n**disease-specific** instruments (EORTC QLQ-C30, FACT-G, KDQOL, MLHFQ) are more responsive to\ncondition-specific change but are not directly comparable across indications and usually cannot\nbe valued for QALYs without a mapping algorithm. (2) *Score type*: a **profile** reports\nmultiple domain scores (e.g., the five EQ-5D dimensions, or PROMIS domain T-scores); an **index**\ncollapses domains into one number with arbitrary (non-preference) weights; a **utility** uses a\nsocietal *value set* elicited by time-trade-off or standard-gamble so the number carries\ncardinal, QALY-ready meaning. A PROMIS Global Health T-score of 50 is **not** a utility and must\nnever be treated as one. (3) *Value set*: the same EQ-5D-5L health-state vector \"21243\" maps to\nvery different utilities under the US (Pickard 2019), England (Devlin 2018), or a country-specific\nvalue set; the value set is an analytic assumption, not a property of the patient, and it\nmaterially moves utilities, incremental QALYs, and the resulting ICER.\n\n**Pros, cons, and trade-offs** (specific & comparative, naming the alternatives).\n- **Generic preference-based (EQ-5D) vs disease-specific profile (e.g., EORTC QLQ-C30, FACT-G):**\n  EQ-5D is QALY-ready, cross-condition comparable, and the NICE/ICER reference-case default; it is\n  often *insensitive* to clinically meaningful change in narrow conditions (vision, hearing, mental\n  health, oncology symptom burden) and shows ceiling effects in mild disease. Disease-specific\n  profiles are more responsive but cannot feed a cost-utility analysis without mapping. **Prefer\n  EQ-5D** when the deliverable is a QALY for HTA; **add a disease-specific instrument** when\n  responsiveness or a clinical/labeling endpoint is the goal, and pre-specify which is primary.\n- **Direct utility measurement vs mapping (crosswalk) from a disease-specific or claims-derived\n  measure:** Direct EQ-5D collection is preferred when feasible. **Mapping** (e.g., QLQ-C30→EQ-5D,\n  or comorbidity→EQ-5D catalogs such as Sullivan/Ghushchyan) is a fallback that adds prediction\n  error, compresses variance, and is only valid within the estimation sample's case mix — it is\n  the *only* route to utilities when the substrate is administrative claims with no PRO at all. See\n  `qaly-utility-mapping-rwe`. **Prefer direct measurement**; reserve mapping for retrospective data\n  or instruments that were never valued.\n- **Mean change vs responder/MID analysis:** A mean change in utility is efficient and feeds\n  QALYs directly but is hard for clinicians to interpret and can be driven by a few patients. A\n  **responder analysis** against an anchor-based minimally important difference (MID) is\n  interpretable and aligns with labeling but discards information and is sensitive to the MID\n  threshold and to missingness. Report both; pre-specify the MID and its derivation.\n- **EQ-5D-3L vs 5L:** 5L reduces ceiling effects and improves discrimination but is not\n  interchangeable with 3L; the value set and the version must match, and a 3L→5L crosswalk\n  introduces its own error. **Prefer 5L** for new studies with a native 5L value set.\n\n**When to use** (clear decision rules). Use HRQoL utility measurement when the analytic question\nrequires a QALY: cost-utility analysis, HTA submission, or any comparison where treatments trade\nlength of life against quality of life (oncology, end-stage organ disease, chronic symptomatic\nconditions). Use a disease-specific HRQoL profile when the endpoint is responsiveness to a\ntherapy's symptomatic benefit, a regulatory PRO labeling claim, or longitudinal symptom tracking.\nCollect HRQoL at fixed, protocol-defined assessment times (not visit-driven) whenever QALY accrual\nby area-under-the-curve is intended, so the time axis is interpretable.\n\n**When NOT to use — and when it is actively misleading or dangerous** (clear decision rules).\n- **Do not compute a QALY from a non-preference instrument.** Treating a PROMIS T-score, an SF-36\n  summary, or an unweighted index as a utility produces numbers with no cardinal meaning and an\n  uninterpretable ICER. This is a category error, not a minor approximation.\n- **Do not mix value sets within an analysis** or silently switch from the reference-case value\n  set; the comparison becomes value-set-driven rather than treatment-driven.\n- **Do not analyze HRQoL change while ignoring death.** In any population with non-trivial\n  mortality, dropout is dominated by death and disease progression — i.e., **missing not at\n  random (MNAR)**. A complete-case mean change in utility is then biased *upward* in the sicker\n  arm because the patients who died (utility effectively 0) are silently dropped. For QALYs, death\n  must be modeled as an absorbing state with utility 0; for HRQoL change, MMRM (`mmrm-repeated-measures-rwe`)\n  or pattern-mixture/reference-based multiple imputation (`multiple-imputation-longitudinal-rwe`)\n  is required, and a \"dead = worst state\" sensitivity analysis is standard.\n- **Do not pool proxy and self-report uncritically.** Caregiver/proxy ratings systematically\n  differ from patient self-report (often rating observable physical domains lower and psychological\n  domains differently); proxy use that is differential by arm or disease severity is a confounder.\n- **Do not ignore mode and recall effects** (paper vs ePRO, 1-week vs 4-week recall): mixing modes\n  or recall periods within a study introduces measurement artifact that can swamp a true effect.\n\n**Data-source operational depth** (each with real failure modes + workarounds).\n- **Pure administrative claims (FFS or commercial):** HRQoL is **not measurable** — there is no PRO\n  in a claim. The only workaround is a **mapping/utility-catalog** approach: derive a comorbidity or\n  diagnosis profile from claims and attach published EQ-5D decrements (e.g., Sullivan/Ghushchyan\n  catalogs). Failure modes: the catalog's case mix must resemble yours; it captures average, not\n  individual, utility and cannot detect within-patient change. Treat results as a population-average\n  approximation only, never as a measured PRO.\n- **Claims-linked PRO (e.g., SEER-MHOS, the CMS Health Outcomes Survey for Medicare Advantage):**\n  A genuine RWE substrate where a validated HRQoL instrument (the VR-12, a generic *profile* measure — not\n  preference-based, so it does not yield QALY-ready utilities without a mapping algorithm) is fielded and\n  linkable to claims and tumor registries. Failure modes: **MA-only person-time lacks fee-for-service\n  claims**, so resource use and some outcomes are unobservable for exactly the enrollees who have\n  the PRO; survey non-response is differential by health status (sicker patients respond less),\n  biasing the sample healthy. Workarounds: survey weights and non-response adjustment\n  (`survey-weights-complex-sampling-rwe`), and restricting cost/utilization analyses to person-time\n  with complete claims observability.\n- **EHR-embedded PROs (e.g., Epic/MyChart PROMIS rollouts, ambulatory oncology ePRO):** PRO capture\n  is *visit-driven and discretionary*, so completion is differential — sicker, more\n  engaged, or symptomatic patients are over-represented, and a patient who leaves the system is differentially lost.\n  Assessment timing is irregular, breaking AUC-based QALY accrual. Workarounds: define fixed\n  assessment windows, model completion as potentially informative, and prefer linkage to confirm\n  survival so that death-related missingness is not mistaken for loss to follow-up.\n- **Disease registries and trial/registry extensions:** The strongest substrate for protocolized\n  HRQoL — fixed assessment schedule, adjudicated disease status, and (with a linked death index)\n  an absorbing death state for QALYs. Failure modes: registries skew toward academic centers and\n  consenting patients (selection), and pharmacy/cost completeness is usually weak, so link to\n  claims for resource use and to a death index (`mortality-source-hierarchy-rwe`) for censoring.\n\n**Worked example (claims-linked PRO).** Question: mean change in EQ-5D-5L utility and 12-month\nQALY accrual after incident metastatic disease, comparing two systemic regimens, in a\nregistry-of-cancer linked to Medicare claims with EQ-5D administered at index, 3, 6, 9, and 12\nmonths. (1) **Cohort:** incident metastatic diagnosis (first qualifying diagnosis date = index),\n≥365 days continuous A/B/D enrollment before index so baseline comorbidity (for risk adjustment\nand mapping fallback) is observable; restrict to non-MA person-time so claims-based covariates and\nresource use are real, not missing. (2) **Exposure/arm:** regimen from the first qualifying\nHCPCS/J-code administration after index (`infused-biologic-administration-capture-rwe`). (3)\n**HRQoL scoring:** convert each patient's five EQ-5D-5L item responses (mobility, self-care,\nusual activities, pain/discomfort, anxiety/depression, each 1–5) into the health-state vector,\nthen apply the **US Pickard (2019)** value set as the reference case to obtain a utility at each\nassessment; pre-register that a **death = 0** absorbing rule applies and that any assessment after\nthe death date is set to 0. (4) **Change model:** MMRM with utility change from baseline as the\nresponse, fixed effects for arm, time (categorical), arm×time, and baseline utility, an\nunstructured covariance, restricted maximum likelihood — valid under MAR within the modeled\ncovariates and arms. (5) **QALY accrual:** integrate utility over time by the trapezoidal rule\nbetween assessments, carrying utility to 0 from the death date (so person-time after death\ncontributes no QALYs), then discount within-year at the reference-case rate\n(`discounting-costs-effects-rwe`). (6) **Missing-data sensitivity:** because dropout is dominated\nby death/progression (MNAR), repeat the QALY calculation under reference-based (jump-to-reference)\nmultiple imputation and under a \"dead = worst observed state\" assumption; report how incremental\nQALYs move. (7) **Value-set sensitivity:** re-derive utilities under the England (Devlin 2018)\nvalue set to show how much of the incremental QALY is value-set-driven versus treatment-driven.\n\n**Interpreting the output**\n\nConsider the worked example: mean EQ-5D-5L utility across four COPD patients is 2.96 / 4 = 0.74,\nderived from the US Pickard (2019) value set. Patient P001 accrues 0.82 QALYs over one year;\npatient P003, who died at 6 months with utility 0.55, accrues 0.55 × 0.5 = 0.275 QALYs.\n\nFormal interpretation: A utility of 0.74 means that, on average, this cohort's health states are\nvalued at 74% of perfect health by the general population represented in the value set — not by\nthe patients themselves, whose own valuations may differ (patient-derived utilities from VAS or\nTTO are systematically lower than those from standard-gamble or value-set methods). The utility\nis anchored to 0 = dead and 1 = perfect health under the chosen value set; the same five-dimension\nresponse vector would yield a different utility under the UK EQ-5D-3L Devlin (2018) set or the\nCanadian Xie (2016) set. Country-specific value sets are not interchangeable, and any comparison\nof utilities across studies must first confirm that the same value set was applied. QALY accrual\ntreats utility as constant between measurement occasions — an approximation that becomes less\ndefensible as intervals lengthen or health changes rapidly.\n\nPractical interpretation: A mean utility of 0.74 in isolation is not interpretable without a\nminimally important difference anchor. For the EQ-5D, published MID estimates in COPD are\napproximately 0.08 (van Reenen 2018); a between-arm difference smaller than that threshold is\nunlikely to be patient-meaningful even if statistically significant. For cost-effectiveness\nmodeling, the utility feeds QALYs, which feed the ICER denominator; if the value set choice\nmoves the utility from 0.74 to 0.80, the incremental QALY gain changes, and so does the ICER —\nalways run a value-set sensitivity analysis before submitting to any HTA body.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "hrqol",
      "patient-reported-outcome",
      "utility",
      "EQ-5D",
      "QALY",
      "value-set",
      "cost-utility",
      "preference-based"
    ],
    "applies_to_study_types": [
      "cohort_prospective",
      "registry_trial",
      "cer_observational"
    ],
    "data_sources": [
      "registry",
      "ehr",
      "linked",
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/0168-8510(96)00822-6",
        "url": "https://doi.org/10.1016/0168-8510(96)00822-6",
        "citation_text": "Brooks R. EuroQol: the current state of play. Health Policy. 1996;37(1):53-72.",
        "year": 1996,
        "authors_short": "Brooks",
        "notes": "Foundational statement of the EuroQol/EQ-5D descriptive system and the preference-based valuation logic that underpins health-state utilities and the QALY."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2019.02.009",
        "url": "https://doi.org/10.1016/j.jval.2019.02.009",
        "citation_text": "Pickard AS, Law EH, Jiang R, Pullenayegum E, Shaw JW, Xie F, Oppe M, Boye KS, Chapman RH, Gong CL, Balch AH, Busschbach JJV. United States Valuation of EQ-5D-5L Health States Using an International Protocol. Value in Health. 2019;22(8):931-941.",
        "year": 2019,
        "authors_short": "Pickard et al.",
        "notes": "The US EQ-5D-5L value set; the reference-case crosswalk from item responses to utilities for US analyses and the canonical illustration that the value set is an analytic choice."
      },
      {
        "role": "explain",
        "doi": "10.1002/hec.3564",
        "url": "https://doi.org/10.1002/hec.3564",
        "citation_text": "Devlin NJ, Shah KK, Feng Y, Mulhern B, van Hout B. Valuing health-related quality of life: An EQ-5D-5L value set for England. Health Economics. 2018;27(1):7-22.",
        "year": 2018,
        "authors_short": "Devlin et al.",
        "notes": "The England EQ-5D-5L value set; used here to show how country value-set choice moves utilities, incremental QALYs, and the ICER for the same health states."
      }
    ],
    "plain_language_summary": "Health-related quality of life (HRQoL) measurement captures how much a disease or treatment affects a patient's daily functioning — their ability to move, care for themselves, and feel well — by asking them to fill out a short questionnaire. When a scoring method called a value set is applied to those answers, the result is a single number called a utility that runs from 0 (a health state as bad as being dead) to 1 (perfect health). That utility score is the building block for calculating QALYs, which combine how long a patient lives with how well they live during that time. One important caveat: not every quality-of-life questionnaire produces a utility — some produce profile scores or index numbers that look similar but cannot be used to calculate QALYs.",
    "key_terms": [
      {
        "term": "utility",
        "definition": "A single number between 0 and 1 that represents how good or bad a health state is, anchored so that 0 equals being dead and 1 equals perfect health; it is the only type of quality-of-life score that can be multiplied by time to calculate a QALY."
      },
      {
        "term": "EQ-5D",
        "definition": "A widely used, five-question patient questionnaire that asks about mobility, self-care, usual activities, pain, and anxiety; a value set converts the answers into a utility score for use in economic analyses."
      },
      {
        "term": "QALY",
        "definition": "Quality-adjusted life year — one year of life lived in perfect health (utility = 1); if a patient lives a full year at a utility of 0.70, they accrue 0.70 QALYs for that year."
      },
      {
        "term": "value set",
        "definition": "A published lookup table, derived from surveys of the general public, that converts a set of EQ-5D questionnaire responses into a utility number; different countries have different value sets, so the same questionnaire responses can produce different utility scores depending on which table is used."
      }
    ],
    "worked_example": {
      "scenario": "A registry study enrolls four patients with chronic obstructive pulmonary disease (COPD). At enrollment each patient completes the EQ-5D-5L questionnaire, and the research team applies the US value set (Pickard 2019) to convert each patient's five item responses into a utility score. The team wants to know the mean utility across the group and to illustrate how one patient's utility feeds a QALY calculation when survival time is also known.",
      "dataset": {
        "caption": "EQ-5D-5L utility scores at enrollment for four COPD patients. Each utility was derived from that patient's five questionnaire responses using the US value set. Survival time records how long the patient was observed (patients who died have a time less than 1.0 year).",
        "columns": [
          "person_id",
          "eq5d_utility",
          "survival_years",
          "status"
        ],
        "rows": [
          [
            "P001",
            0.82,
            1.0,
            "alive at 1 year"
          ],
          [
            "P002",
            0.68,
            1.0,
            "alive at 1 year"
          ],
          [
            "P003",
            0.55,
            0.5,
            "died at 6 months"
          ],
          [
            "P004",
            0.91,
            1.0,
            "alive at 1 year"
          ]
        ]
      },
      "steps": [
        "The utility scale runs from 0 to 1: a utility of 0.82 (P001) means that the general public values that patient's health state as 82% as good as perfect health; a utility of 0.55 (P003) reflects more severe impairment.",
        "Mean utility across the four patients: add the four scores (0.82 + 0.68 + 0.55 + 0.91 = 2.96) and divide by 4, giving a mean of 0.74.",
        "To convert a utility into QALYs, multiply the utility by the number of years spent in that health state: P001 was alive for 1.0 year at utility 0.82, so P001 accrued 0.82 x 1.0 = 0.82 QALYs.",
        "P003 died at 6 months (0.5 years), and from the moment of death the utility is set to 0 by rule — death accrues no further QALYs. P003 accrued 0.55 x 0.5 = 0.275 QALYs before death, then 0 x 0.5 = 0.00 QALYs after death, for a total of 0.275 QALYs over the 1-year window.",
        "The QALY gap between P001 and P003 over one year is 0.82 - 0.275 = 0.545 QALYs, illustrating that both lower utility and shorter survival reduce a patient's QALY accrual."
      ],
      "result": "Mean EQ-5D utility across the four patients = 2.96 / 4 = 0.74. QALY examples: P001 (alive, utility 0.82) accrues 0.82 x 1.0 = 0.82 QALYs in one year; P003 (died at 6 months, utility 0.55) accrues 0.55 x 0.5 = 0.275 QALYs — the remaining 0.5 years contribute 0 QALYs because death is treated as a utility of 0."
    },
    "prerequisites": [
      "pro-rwe",
      "cost-utility"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Preference-based generic utility (EQ-5D-5L with a value set)",
        "description": "Score the five EQ-5D-5L item responses into a health-state vector, then apply a societal value set (e.g., US Pickard 2019, England Devlin 2018) to obtain a QALY-ready utility on the dead(0)-to-full-health(1) scale, with values <0 for states worse than death.",
        "edge_cases": [
          "Value-set choice (US vs England vs country-specific) materially changes utilities and the ICER; pre-specify one as the reference case and report the others as sensitivity analyses.",
          "3L and 5L versions and their value sets are not interchangeable; a 3L->5L crosswalk adds error.",
          "States worse than death produce negative utilities that must be retained, not truncated at 0."
        ],
        "data_source_notes": "registry/linked-PRO: collect raw item responses, not pre-computed utilities, so the value set can be applied and varied analytically; claims-only: not measurable, requires a mapping/utility-catalog fallback."
      },
      {
        "name": "Disease-specific HRQoL profile with mapping to utilities",
        "description": "Use a responsive condition-specific instrument (EORTC QLQ-C30, FACT-G, KDQOL) for the primary clinical/labeling endpoint, then apply a published mapping (crosswalk) algorithm to obtain EQ-5D-equivalent utilities for the cost-utility model.",
        "edge_cases": [
          "Mapping compresses variance and adds prediction error; valid only within the estimation sample's case mix and disease severity range.",
          "Disease-specific scores are not comparable across indications and cannot feed QALYs directly."
        ],
        "data_source_notes": "Prefer co-administering EQ-5D alongside the disease-specific instrument so direct utilities are available and mapping is a sensitivity analysis, not the primary route."
      },
      {
        "name": "Responder / minimally important difference (MID) analysis",
        "description": "Classify each patient as a responder versus non-responder against an anchor-based MID on the HRQoL score, in addition to (not instead of) the mean change used for QALYs.",
        "edge_cases": [
          "Results are sensitive to the chosen MID threshold and its derivation (anchor- vs distribution-based).",
          "Dichotomization discards information and is sensitive to differential missingness by arm."
        ],
        "data_source_notes": "Pre-specify the MID and its source; report alongside the continuous mean-change estimate to support both interpretability and QALY accrual."
      },
      {
        "name": "Claims-derived utility via comorbidity catalog (mapping when no PRO exists)",
        "description": "When the substrate is administrative claims with no PRO, attach published EQ-5D decrements to a claims-derived diagnosis/comorbidity profile (e.g., Sullivan/Ghushchyan catalogs) to approximate a population-average utility.",
        "edge_cases": [
          "Captures average, not individual, utility and cannot detect within-patient change over time.",
          "Validity depends on the catalog's underlying population resembling the analytic cohort."
        ],
        "data_source_notes": "claims: this is the only route to a utility-like number; flag explicitly as an approximation, never as a measured PRO, and bound the cost-utility result with its uncertainty."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Disease-specific HRQoL profile (e.g., EORTC QLQ-C30, FACT-G)",
        "pros_of_this": "Generic preference-based utilities (EQ-5D) are QALY-ready, cross-condition comparable, and the HTA reference-case default.",
        "cons_of_this": "Often insensitive to clinically meaningful change in narrow conditions and prone to ceiling effects in mild disease.",
        "when_to_prefer": "When the deliverable is a QALY for cost-utility analysis or HTA submission; add a disease-specific instrument when responsiveness or a labeling claim is the goal."
      },
      {
        "compared_to": "Mapping/crosswalk from a disease-specific or claims-derived measure",
        "pros_of_this": "Direct EQ-5D measurement avoids the prediction error and variance compression that mapping introduces and is valid outside any single estimation sample.",
        "cons_of_this": "Requires prospective PRO collection; impossible in retrospective claims-only data.",
        "when_to_prefer": "Whenever direct utility collection is feasible; reserve mapping for legacy instruments or administrative data with no PRO."
      },
      {
        "compared_to": "Non-preference index or domain score (e.g., PROMIS T-score, SF-36 summary)",
        "pros_of_this": "A preference-based utility carries cardinal, QALY-ready meaning anchored at dead=0 and full-health=1.",
        "cons_of_this": "Requires a valued instrument and value set; a profile/index is cheaper and sometimes more responsive but cannot feed a QALY.",
        "when_to_prefer": "Always when a QALY or cost-utility result is required; a non-preference score must never be substituted for a utility."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "HRQoL is not directly measurable. The only route to a utility is a mapping/comorbidity-catalog approximation attaching published EQ-5D decrements to a claims-derived diagnosis profile; report as population-average and flag as an approximation, not a measured PRO.",
      "ehr": "PROs (e.g., PROMIS in MyChart) are captured visit-driven and discretionarily, so completion is differential and timing irregular. Define fixed assessment windows, model completion as potentially informative, and link to a death index to distinguish death from loss to follow-up.",
      "registry": "Strongest substrate for protocolized HRQoL with a fixed assessment schedule and adjudicated disease status; link to claims for resource use and to a death index so death is an absorbing utility-0 state for QALYs.",
      "linked": "Claims-linked PRO (e.g., SEER-MHOS / CMS HOS) enables HRQoL plus utilization, but MA-only person-time lacks fee-for-service claims and survey non-response is differential by health status; apply survey weights and restrict cost analyses to claims-observable person-time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nITEMS = [\"mob\", \"sc\", \"ua\", \"pd\", \"ad\"]  # EQ-5D-5L: mobility, self-care, usual activities, pain, anxiety\n\ndef state_string(df: pd.DataFrame) -> pd.Series:\n    # 5-digit health-state key, e.g. responses (2,1,2,4,3) -> \"21243\"\n    return df[ITEMS].astype(int).astype(str).agg(\"\".join, axis=1)\n\ndef attach_utility(pro: pd.DataFrame, vset: pd.DataFrame) -> pd.DataFrame:\n    pro = pro.copy()\n    pro[\"state\"] = state_string(pro)\n    pro = pro.merge(vset.rename(columns={\"utility\": \"u\"}), on=\"state\", how=\"left\")\n    if pro[\"u\"].isna().any():       # every valid 1..5 combination must be in the value set\n        raise ValueError(\"Unmapped EQ-5D-5L state(s); check item ranges and value-set coverage.\")\n    return pro\n\ndef qalys(pro: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    # Order assessments, append a death assessment at utility 0 so post-death time accrues no QALYs.\n    df = pro.merge(enroll[[\"person_id\", \"death_date\"]], on=\"person_id\", how=\"left\")\n    df = df.sort_values([\"person_id\", \"assess_date\"])\n    df.loc[df[\"death_date\"].notna() & (df[\"assess_date\"] >= df[\"death_date\"]), \"u\"] = 0.0\n\n    out = []\n    for pid, g in df.groupby(\"person_id\"):\n        g = g.sort_values(\"assess_date\")\n        d = enroll.loc[enroll.person_id == pid, \"death_date\"].iloc[0]\n        if pd.notna(d) and d > g[\"assess_date\"].max():\n            g = pd.concat([g, pd.DataFrame({\"assess_date\": [d], \"u\": [0.0]})], ignore_index=True)\n        t_years = (g[\"assess_date\"] - g[\"assess_date\"].iloc[0]).dt.days / 365.25\n        trapezoid = getattr(np, \"trapezoid\", np.trapz)  # np.trapz renamed in NumPy 2.0\n        qaly = trapezoid(g[\"u\"].to_numpy(), t_years.to_numpy())   # trapezoidal AUC of utility x time\n        out.append({\"person_id\": pid, \"qaly\": qaly})\n    return pd.DataFrame(out)",
        "description": "Score EQ-5D-5L item responses into health-state utilities using a value-set lookup, then accrue\nQALYs over fixed assessment times with a death=0 absorbing rule. Required inputs (post data\nmanagement):\n  pro     : person_id, assess_date (datetime), mob, sc, ua, pd, ad   # each item integer 1..5\n  enroll  : person_id, index_date, death_date (NaT if alive)\n  vset    : DataFrame keyed by the 5-digit state string -> utility (e.g., US Pickard 2019)\nQALYs use trapezoidal integration of utility x time (years) between assessments, carrying utility\nto 0 from death. Substitute the England value set to run the value-set sensitivity analysis.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "pickard-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nITEMS <- c(\"mob\", \"sc\", \"ua\", \"pd\", \"ad\")\n\nqalys <- function(pro, enroll, vset) {\n  setDT(pro); setDT(enroll); setDT(vset)\n  pro[, state := do.call(paste0, lapply(.SD, function(x) as.integer(x))), .SDcols = ITEMS]\n  pro <- merge(pro, vset, by = \"state\", all.x = TRUE)\n  if (anyNA(pro$utility)) stop(\"Unmapped EQ-5D-5L state(s); check item ranges / value set.\")\n\n  pro <- merge(pro, enroll[, .(person_id, death_date)], by = \"person_id\", all.x = TRUE)\n  setorder(pro, person_id, assess_date)\n  pro[!is.na(death_date) & assess_date >= death_date, utility := 0]  # post-death utility = 0\n\n  pro[, {\n    d <- enroll[person_id == .BY$person_id, death_date]\n    u <- utility; a <- assess_date\n    if (!is.na(d) && d > max(a)) { a <- c(a, d); u <- c(u, 0) }      # absorb to 0 at death\n    ord <- order(a); a <- a[ord]; u <- u[ord]\n    t <- as.numeric(a - a[1]) / 365.25\n    .(qaly = sum(diff(t) * (head(u, -1) + tail(u, -1)) / 2))         # trapezoidal AUC\n  }, by = person_id]\n}",
        "description": "EQ-5D-5L utility scoring and QALY accrual in R. Inputs mirror the Python version:\n  pro    : person_id, assess_date (Date), mob, sc, ua, pd, ad  (items integer 1..5)\n  enroll : person_id, index_date, death_date (NA if alive)\n  vset   : data.frame(state = \"21243\", utility = ...)  # value set (US Pickard 2019)\ndeath=0 is enforced as an absorbing state; QALYs are the trapezoidal AUC of utility over years.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "pickard-2019"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Build the 5-digit health-state key and look up the reference-case utility. */\nproc sql;\n  create table scored as\n  select p.person_id, p.assess_date, p.visit,\n         cats(put(p.mob,1.), put(p.sc,1.), put(p.ua,1.),\n              put(p.pd,1.),  put(p.ad,1.)) as state length=5,\n         v.utility\n  from work.pro p\n  left join work.vset v\n    on calculated state = v.state;\nquit;\n\n/* Every valid 1..5 combination must resolve; unmapped states are a coding/value-set error. */\nproc sql;\n  select count(*) as n_unmapped from scored where utility is missing;\nquit;\n\n/* Enforce death=0 as an absorbing state before any change/QALY analysis. */\nproc sql;\n  create table util as\n  select s.*, case when e.death_date is not null and s.assess_date >= e.death_date\n                   then 0 else s.utility end as u\n  from scored s left join work.enroll e on s.person_id = e.person_id;\nquit;\n\n/* Change from baseline (visit 0) for the MMRM response. */\nproc sql;\n  create table chg as\n  select a.person_id, a.visit, (a.u - b.u) as dutil\n  from util a\n  join (select person_id, u as u from util where visit = 0) b\n    on a.person_id = b.person_id\n  where a.visit > 0;\nquit;\n\n/* Attach the treatment arm (from cohort construction) to each change record. */\nproc sql;\n  create table chg_with_arm as\n  select c.person_id, c.visit, c.dutil, h.arm\n  from work.chg c\n  join work.cohort h on c.person_id = h.person_id;   /* work.cohort: person_id, arm */\nquit;\n\n/* MMRM: utility change ~ arm time arm*time, unstructured within-patient covariance, REML. */\nproc mixed data=work.chg_with_arm method=reml;\n  class person_id arm visit;\n  model dutil = arm visit arm*visit / ddfm=kr solution;\n  repeated visit / subject=person_id type=un;\n  lsmeans arm*visit / diff cl;          /* adjusted mean utility change by arm and time */\nrun;",
        "description": "EQ-5D-5L utility scoring via a value-set lookup (PROC SQL), then repeated-measures analysis of\nutility change from baseline with PROC MIXED (MMRM, valid under MAR). Required inputs:\n  work.pro    : person_id, assess_date, visit (0,3,6,9,12), mob sc ua pd ad (items 1..5)\n  work.vset   : state (char5), utility   # e.g. US Pickard 2019 value set\n  work.enroll : person_id, death_date\n  work.cohort : person_id, arm           # treatment arm from cohort construction\nFor QALYs, integrate scored utilities by the trapezoidal rule with a death=0 absorbing state\n(a DATA step over sorted assessments, analogous to the Python/R code); the MMRM below models the\nHRQoL change endpoint that accompanies the QALY result.",
        "dependencies": [],
        "source_citations": [
          "pickard-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Resp[EQ-5D-5L item responses<br/>5 dimensions, each 1-5] --> State[Health-state vector<br/>e.g. 21243]\n  State --> VS{Value set<br/>analytic choice}\n  VS -->|US Pickard 2019| UtilUS[Utility on dead=0..full=1]\n  VS -->|England Devlin 2018| UtilUK[Utility on dead=0..full=1]\n  UtilUS --> AUC[QALY = AUC of utility x time<br/>death = absorbing utility 0]\n  UtilUK --> AUC\n  AUC --> Disc[Discount within-horizon] --> CUA[Cost-utility model / ICER]",
        "caption": "From EQ-5D item responses to QALYs. The same health-state vector yields different utilities under different value sets, so the value set is an analytic assumption that propagates into the ICER.",
        "alt_text": "Flowchart showing EQ-5D-5L item responses mapped to a health-state vector, valued by a chosen value set into a utility, integrated over time into QALYs with death as an absorbing zero state, then discounted and fed to a cost-utility model.",
        "source_type": "illustrative",
        "source_citations": [
          "pickard-2019",
          "devlin-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Is a QALY / cost-utility result required?} -->|No| Prof[Use disease-specific profile<br/>for responsiveness / labeling]\n  Q -->|Yes| Gen{Is a generic preference-based<br/>instrument responsive enough?}\n  Gen -->|Yes| EQ[Collect EQ-5D-5L directly<br/>apply reference-case value set]\n  Gen -->|No| DS[Collect disease-specific instrument]\n  DS --> Map{Direct EQ-5D also collected?}\n  Map -->|Yes| EQ\n  Map -->|No| Cross[Map to EQ-5D utilities<br/>flag prediction error]\n  Prof --> Done[Pre-specify primary endpoint + MID]\n  EQ --> Done\n  Cross --> Done",
        "caption": "Instrument-selection decision logic. The first question is whether a QALY is needed; only then does the generic-vs-disease-specific and direct-vs-mapping choice follow.",
        "alt_text": "Decision tree starting from whether a QALY is required, branching through generic versus disease-specific instrument choice and direct measurement versus mapping to utilities.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "HRQoL utilities are the per-period weights integrated over survival to produce QALYs; mapping algorithms convert disease-specific or claims-derived measures into the EQ-5D utilities used here."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-utility",
        "notes": "Preference-based HRQoL utilities are the effectiveness input (the QALY) of a cost-utility analysis; non-preference HRQoL scores cannot serve this role."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mmrm-repeated-measures-rwe",
        "notes": "MMRM is the standard model for repeated HRQoL change-from-baseline endpoints, valid under MAR within the modeled arms and covariates."
      },
      {
        "relation_type": "requires",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "HRQoL dropout is typically MNAR (death, progression); reference-based multiple imputation and pattern-mixture analyses are needed to bound the QALY/change estimate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Accrued QALYs are discounted to present value at the reference-case rate before entering the ICER."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pro-rwe",
        "notes": "HRQoL is a preference-based subset of patient-reported outcomes; PRO-RWE covers the broader landscape of self-reported endpoints including non-utility instruments."
      },
      {
        "relation_type": "see_also",
        "target_slug": "survey-weights-complex-sampling-rwe",
        "notes": "Claims-linked PRO surveys (e.g., SEER-MHOS / CMS HOS) require survey weights and non-response adjustment because completion is differential by health status."
      }
    ],
    "aliases": [
      "HRQoL",
      "health-related quality of life",
      "health state utility",
      "preference-based outcome",
      "EQ-5D utility"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "icd-10-cm-diagnosis-coding",
    "name": "ICD-10-CM Diagnosis Codes",
    "short_definition": "The US clinical modification of the World Health Organization's ICD-10 system, maintained by NCHS, that encodes diagnoses and health conditions on every US claim and encounter record; adopted for HIPAA-covered transactions on 2015-10-01, replacing ICD-9-CM, with approximately 70,000 billable codes organized in alphanumeric hierarchies that encode condition, laterality, encounter type, and episode stage.",
    "long_description": "**ICD-10-CM** (International Classification of Diseases, 10th Revision, Clinical Modification) is the\nUS-specific diagnosis coding system that appears on every reimbursement claim, hospital encounter record,\nand structured EHR problem list in the United States. It is maintained by the National Center for Health\nStatistics (NCHS) within CDC, which adapts the WHO ICD-10 base to US clinical practice and adds granularity\nnot present in the international version. CMS enforces its use for all HIPAA-covered electronic transactions.\nThe transition from ICD-9-CM took effect on 2015-10-01 (federal fiscal year 2016), and the code set is\nupdated annually, with new versions effective each October 1 (and, since fiscal year 2023, a mid-year\nApril update for emerging conditions). Any research code list must specify which fiscal-year version it\ntargets, because codes added or retired between versions can silently expand or shrink a phenotype.\n\n**Code structure and hierarchy.** Every ICD-10-CM code begins with a letter, followed by two digits, making\na three-character **category** (e.g., M20 for acquired deformities of fingers and toes). The category is a\nheader: it is NOT a billable code and will be rejected if submitted on a claim. Subcategories extend the\ncategory with a decimal point and one or more additional characters (M20.0 = deformity of finger(s)),\nand the full billable code reaches its highest level of specificity, typically four to seven characters\n(M20.001 = deformity of right index finger). Claims data frequently stores codes in **flat format**\n(no decimal point): \"M20001\" means M20.001. A prefix search on the flat code \"M20\" captures the entire\ncategory including all subcategories and billable codes beneath it; a search on \"M200\" narrows to\nsubcategory M20.0. This distinction matters critically for phenotype construction: an exact-match list\nof only billable codes will never accidentally match a non-billable header, but a startswith filter on\na three-character category string will capture every descendant code.\n\n**Seventh-character extensions, placeholders, and laterality.** Many code categories require a mandatory\nseventh character that encodes episode of care: A (initial encounter, meaning the patient is actively\nreceiving treatment for the condition), D (subsequent encounter, for routine care after the active phase),\nand S (sequela, for late effects). Fracture codes, wound codes, and injury codes follow this scheme\nrigorously — a patient with a broken wrist will have an \"A\" code at the emergency visit, \"D\" codes at\nfollow-up, and potentially an \"S\" code if a complication persists. When the code structure requires a\nseventh character but the code is fewer than six characters, the letter \"X\" fills the intervening\npositions as a placeholder (e.g., S52.001A has seven characters without a placeholder, while T14.91XA\nuses X as the sixth character before the A). Laterality — left, right, bilateral, unspecified — is\nencoded in the fifth or sixth character of many musculoskeletal, ophthalmologic, neurologic, and\nother codes. This is a major gain over ICD-9-CM, which had no laterality encoding; RWE studies on\northopedic conditions, stroke, or cancer can now distinguish sides in claims data without NLP or chart\nreview. Failing to account for laterality positions (e.g., treating \"M20.001\" and \"M20.002\" as\nseparate conditions rather than right and left variants of the same condition) is a common phenotype\nconstruction error.\n\n**Scale and public-domain status.** ICD-10-CM contains approximately 70,000+ diagnosis codes across its\nannual update cycles, compared with roughly 14,000 in ICD-9-CM. The code set is in the **public domain**:\nNCHS distributes the tabular list, index, and guidelines freely, and researchers may reproduce code\nlists in publications and protocols without licensing restrictions. This contrasts sharply with CPT\n(Current Procedural Terminology), which is copyrighted by the American Medical Association and requires\na license to reproduce. ICD-10-PCS, the companion procedure coding system used by inpatient facilities,\nis similarly public domain and maintained by CMS — it is a wholly separate system from ICD-10-CM and\nshould not be confused with it.\n\n**Relationship to related systems.** ICD-10-CM is a US clinical modification of the WHO ICD-10\ninternational base; the WHO updates the international version on its own cycle, and NCHS selectively\nincorporates those updates while adding US-specific codes. ICD-9-CM (retired for US claims on\n2015-09-30) maps to ICD-10-CM via the General Equivalence Mappings (GEMs) produced by CMS and NCHS,\nbut GEMs are approximate: many ICD-9-CM codes map to multiple ICD-10-CM codes (and vice versa), and\nsome mappings have no clean equivalent. Studies spanning the 2015 transition must handle the code-system\nbreak explicitly. SNOMED CT is a clinical terminology used in EHR problem lists and clinical decision\nsupport; it has a maintained mapping to ICD-10-CM that allows EHR-sourced diagnoses to be translated\nto claim-compatible codes, but the mapping is not one-to-one. In OMOP CDM, ICD-10-CM appears as a\nsource vocabulary; the OMOP ETL maps source ICD-10-CM codes to SNOMED standard concepts via the\nconcept_relationship table, so concept sets built in OMOP ATLAS operate on SNOMED ancestors, not on\nraw ICD-10-CM prefixes.\n\n**RWE and phenotype construction implications.** ICD-10-CM codes appear on both institutional claims\n(UB-04 / facility form, fields FL67–FL67Q for principal and secondary diagnoses) and professional\nclaims (CMS-1500 / 837P, boxes 21.A–L). A patient with heart failure will generate ICD-10-CM codes\non both the hospital bill and the cardiologist's office visit. Phenotype algorithms must specify\nwhich claim types are searched, which diagnosis positions count (principal/primary vs any position),\nand whether they use flat or decimal format. The standard pattern for RWE code-list construction\nis: (1) obtain the NCHS tabular list for the target fiscal year; (2) identify the relevant category\ncodes (three-character headers) and all billable descendants; (3) express the list in flat format\nfor claims matching; (4) decide whether to use exact-match (only specific billable codes) or\nprefix-match (entire category). Prefix-matching on a category like \"I50\" captures all heart-failure\ncodes including I50.1, I50.20, I50.21, I50.22, I50.30, I50.31, I50.32, I50.41, I50.42, I50.43,\nI50.810, I50.811, I50.812, I50.813, I50.814, I50.9, and any future codes added under that branch —\nwhich is both its strength (forward-compatible) and its risk (captures future codes the researcher\nmay not have reviewed). Pre-specified, version-dated exact lists are more reproducible for regulatory\nsubmissions; prefix-based lists are more practical for exploratory work.\n\nThe **2015 ICD-9-to-ICD-10 transition** is the most consequential structural break in US claims\nresearch. Studies with data spanning October 1, 2015 must either restrict to a post-transition\nperiod, restrict to a pre-transition period, or use GEMs crosswalks to harmonize codes across the\nbreak while acknowledging GEMs imprecision as a source of misclassification. Trend studies of\nincidence or prevalence that include 2014 and 2016 data will show apparent code-driven discontinuities\nthat have nothing to do with the underlying clinical reality. Some conditions gained specificity\n(e.g., lateralized fractures), some lost apparent specificity due to code reorganization, and some\nrare conditions got new dedicated codes that will appear to spike from zero in 2015.\n\n**Pros, cons, and trade-offs.**\n- **vs ICD-9-CM (for legacy data linkage):** ICD-10-CM has greater clinical granularity (laterality,\n  episode type, ~5x more codes), is the only option for claims post-2015, and aligns with international\n  coding. The cost is the 2015 break in trend continuity and the need for GEMs crosswalks in multi-era\n  studies. **Prefer ICD-10-CM** for all post-2015 research; use GEMs with explicit sensitivity analyses\n  to bridge the transition for longitudinal studies.\n- **vs SNOMED CT (in EHR phenotyping):** ICD-10-CM is billing-optimized, reimbursement-driven, and\n  may over- or under-code relative to clinical truth (rule-out codes, upcoding). SNOMED CT is\n  clinically precise, hierarchy-rich, and designed for EHR documentation but does not appear on US\n  claims. In linked claims-EHR datasets, SNOMED provides the gold-standard clinical label while\n  ICD-10-CM provides population-scale coverage. **Prefer ICD-10-CM** when working with claims at\n  scale; **prefer SNOMED** when clinical precision in EHR data is the priority.\n- **vs CPT / ICD-10-PCS for procedure capture:** ICD-10-CM encodes diagnoses only; procedures on\n  inpatient claims use ICD-10-PCS, and procedures on professional claims use CPT/HCPCS Level II.\n  Mixing procedure systems is a common error; always confirm which claim type and which code field\n  you are querying.\n- **vs unstructured clinical notes (NLP):** Code-based phenotyping at scale is fast, reproducible,\n  and auditable; NLP adds sensitivity for conditions that are documented but not coded, and can\n  distinguish \"rule-out\" from confirmed diagnoses. The cost of NLP is computational complexity\n  and system-specific training. **Prefer ICD-10-CM code lists** as the primary phenotyping layer;\n  add NLP when positive predictive value of codes alone is known to be inadequate.\n\n**When to use.**\n- As the primary diagnosis identifier in any US claims-based RWE study (outcomes, covariates,\n  cohort-entry diagnoses, comorbidities, contraindications).\n- As the source vocabulary for phenotype algorithms in OMOP CDM, where ICD-10-CM codes map to\n  SNOMED standard concepts via the ETL.\n- For constructing Elixhauser or Charlson comorbidity indices from administrative data (use the\n  Quan 2005 ICD-10 adaptation or the updated Quan 2011 weights).\n- When building code lists for multi-database studies on OMOP or Sentinel: define the ICD-10-CM\n  codes in flat format, document the fiscal year version, and test against the NCHS tabular list.\n- For any analysis spanning 2015 onward on US payers (commercial, Medicare, Medicaid).\n\n**When NOT to use — and when ICD-10-CM coding is actively misleading or dangerous.**\n- **As a direct proxy for clinical confirmation.** A diagnosis code means a clinician (or their\n  coder) billed for that diagnosis — not that the diagnosis was verified by lab, imaging, or\n  chart review. Rule-out codes (chest pain evaluated for MI), screening codes, and historical\n  codes can appear in any diagnosis position. Pre-specify position (principal only vs any) and\n  validate positive predictive value in the target population.\n- **For procedure identification.** ICD-10-CM is a diagnosis system; using it to find surgical\n  procedures or imaging studies will fail — use ICD-10-PCS (inpatient facility claims) or CPT/\n  HCPCS Level II (professional/outpatient claims).\n- **Across the 2015 ICD-9/ICD-10 transition without harmonization.** Trend studies using the same\n  code list on both sides of October 1, 2015 will see artifactual breaks driven by coding change,\n  not epidemiologic reality. GEMs crosswalks are required, and their imprecision must be acknowledged.\n- **Without version-locking the code list.** Annual updates add and retire codes. A phenotype that\n  was valid for the FY2019 code set may silently under-count when applied to FY2024 data if new\n  codes were added under the same category. Lock code lists to a specific fiscal year and re-audit\n  when updating.\n- **When Medicare Advantage enrollees' codes cannot be validated.** HCC risk-adjustment incentives\n  in Medicare Advantage drive higher coding intensity than fee-for-service, so code-based\n  comorbidity counts will be systematically elevated in MA patients — a confounder for any\n  code-count covariate. Sensitivity analyses stratified by plan type are essential.\n- **For non-US data.** ICD-10-CM is a US-specific modification. International studies use national\n  variants (ICD-10-CA in Canada, ICD-10-AM in Australia, ICD-10-GM in Germany) that differ in\n  code structure, extension logic, and update cycles.\n\n**Data-source operational depth.**\n- **Medicare FFS claims:** Diagnosis codes appear in the MedPAR principal diagnosis field and\n  secondary diagnosis fields (up to 25 additional), on carrier/professional claims (up to 12\n  diagnosis fields), and on outpatient facility claims. Always confirm the fiscal-year code version\n  against the claim service date. Part A covers inpatient, skilled nursing, and hospice; Part B\n  covers outpatient and professional. Diagnosis position on the professional claim (pointer field)\n  indicates which diagnosis justifies each service line.\n- **Commercial claims (MarketScan, Optum, etc.):** Same UB-04 and CMS-1500 field structure.\n  Benefit design affects which services generate claims: carved-out behavioral health or specialty\n  pharmacy may have no claim record. Diagnosis codes reflect the treating provider's coding\n  practice, which varies by specialty and region.\n- **EHR:** ICD-10-CM codes appear on encounter diagnoses (billing), problem lists (active/chronic\n  conditions), and referral orders. Problem-list entries may be historical and not encounter-driven.\n  NLP on clinical notes can supplement or correct coded diagnoses. The OMOP ETL maps source\n  ICD-10-CM to SNOMED standard concepts; always confirm mapping completeness for your target\n  conditions.\n- **Registry / linked data:** Disease registries typically adjudicate diagnoses independently of\n  ICD-10-CM codes, but registry submissions often include the billing code as a data element.\n  Linking registry diagnoses to claims-based ICD-10-CM code lists requires reconciling registry\n  inclusion criteria with code-based algorithms — they will not be identical.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "icd-10",
      "diagnosis-codes",
      "claims",
      "phenotyping",
      "rwe-infrastructure",
      "billing-codes",
      "nchs",
      "cms",
      "hipaa",
      "code-transition"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "cohort_retrospective",
      "new_user",
      "active_comparator_new_user",
      "target_trial_emulation",
      "multi_database",
      "systematic_review"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3174/ajnr.A4696",
        "url": "https://doi.org/10.3174/ajnr.A4696",
        "citation_text": "Hirsch JA, Nicola G, McGinty G, et al. ICD-10: History and Context. AJNR American Journal of Neuroradiology. 2016;37(4):596-599.",
        "year": 2016,
        "authors_short": "Hirsch et al.",
        "notes": "Concise history and context of the ICD-10 system, the US adoption timeline, and the clinical rationale for the transition from ICD-9-CM; published in a specialty journal at the moment of the 2015-10-01 US implementation."
      },
      {
        "role": "explain",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "Foundational methods paper that translated the Elixhauser comorbidity coding algorithms from ICD-9-CM to ICD-10 administrative data, demonstrating the practical challenges and structural differences between the two code systems for RWE phenotyping; the Quan ICD-10 algorithms remain the standard comorbidity coding reference for claims-based research."
      },
      {
        "role": "use",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLOS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard for studies using routinely collected health data, including administrative claims; requires explicit documentation of the diagnosis coding system version, code lists, and the rationale for phenotype definitions -- the operational layer where ICD-10-CM version-locking and position specification must be reported."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coding-billing/icd-10-codes",
        "citation_text": "Centers for Medicare and Medicaid Services. ICD-10 Code Sets. CMS.gov. Accessed 2026.",
        "year": 2026,
        "authors_short": "CMS",
        "notes": "Official CMS landing page for ICD-10-CM and ICD-10-PCS code files, GEMs crosswalk downloads, annual update notices, and implementation guidance for HIPAA-covered transactions; the authoritative source for current code-set downloads and transition documentation."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cdc.gov/nchs/icd/icd-10-cm/index.html",
        "citation_text": "National Center for Health Statistics. ICD-10-CM. CDC NCHS. Accessed 2026.",
        "year": 2026,
        "authors_short": "NCHS/CDC",
        "notes": "NCHS official ICD-10-CM page with annual tabular lists, alphabetic indexes, coding guidelines, addenda, and the FY update schedule; the primary source for the code set content that CMS distributes for HIPAA compliance."
      }
    ],
    "plain_language_summary": "ICD-10-CM is the standardized code system that US doctors and hospitals use to label every diagnosis on an insurance claim or hospital record -- each condition gets a unique alphanumeric code, like I50.9 for unspecified heart failure. Researchers studying health outcomes use these codes to identify which patients have a disease, because codes appear consistently across millions of records nationwide. The system replaced an older version called ICD-9-CM on October 1, 2015, so studies that span that date must account for the change in code format. A key practical rule is that codes in claims data usually appear without a decimal point, so \"I509\" in the data file means I50.9 in the official code book.",
    "key_terms": [
      {
        "term": "billable code",
        "definition": "A diagnosis code that has reached its most specific level and can legally appear on a claim; three-character category codes like I50 are header codes, not billable, and will be rejected by a payer."
      },
      {
        "term": "flat format",
        "definition": "The way diagnosis codes are stored in most US claims files, with the decimal point removed; \"I509\" in the data represents the published code I50.9."
      },
      {
        "term": "category code",
        "definition": "The three-character root of an ICD-10-CM code (for example, M20) that groups related conditions; it is a header and is not itself billable or submittable on a claim."
      },
      {
        "term": "seventh-character extension",
        "definition": "A required letter at the end of certain ICD-10-CM codes that records whether the patient is at an initial encounter (A), a follow-up encounter (D), or experiencing a late effect or sequela (S) of the condition."
      },
      {
        "term": "GEMs crosswalk",
        "definition": "The General Equivalence Mappings published by CMS and NCHS that translate between ICD-9-CM and ICD-10-CM codes; the mappings are approximate and some conditions map to multiple codes in the other system."
      },
      {
        "term": "fiscal-year code version",
        "definition": "The annual edition of the ICD-10-CM code set, effective each October 1, that adds new codes and retires old ones; a phenotype code list must be verified against the version that was active when the claims were submitted."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is building a claims-based cohort of patients with rheumatoid arthritis (RA) using commercial insurance data from 2022. She obtains the NCHS FY2022 ICD-10-CM tabular list, identifies the RA category M05 (seropositive RA) and M06 (other RA), and needs to confirm that the flat codes in her claims file match the expected billable codes -- and that she is not accidentally including non-billable header codes. The table below shows a small excerpt of claims rows and the codes as they appear in the raw data field.\n",
      "dataset": {
        "caption": "Sample medical claim rows from a commercial database showing the diagnosis code field (DX1) in flat format (no decimal), claim type, and service date. The researcher must verify which rows represent billable RA codes versus header codes.\n",
        "columns": [
          "claim_id",
          "person_id",
          "service_date",
          "claim_type",
          "DX1_flat",
          "DX1_decoded",
          "billable"
        ],
        "rows": [
          [
            "C001",
            1001,
            "2022-03-15",
            "professional",
            "M0500",
            "M05.00",
            "yes"
          ],
          [
            "C002",
            1001,
            "2022-06-10",
            "professional",
            "M0500",
            "M05.00",
            "yes"
          ],
          [
            "C003",
            1002,
            "2022-04-01",
            "professional",
            "M05",
            "M05 (category header)",
            "no"
          ],
          [
            "C004",
            1003,
            "2022-09-20",
            "professional",
            "M0610",
            "M06.10",
            "yes"
          ],
          [
            "C005",
            1003,
            "2022-11-05",
            "professional",
            "M0610",
            "M06.10",
            "yes"
          ]
        ]
      },
      "steps": [
        "Insert the decimal back into each flat code to verify it against the NCHS tabular list: M0500 becomes M05.00 (seropositive RA, unspecified site) -- a valid billable code. M05 alone (claim C003) is the three-character category header and is not billable.",
        "Apply the standard 1-inpatient-or-2-outpatient phenotype rule: person 1001 has two professional claims with M05.00 on different service dates (2022-03-15 and 2022-06-10), so they qualify as an RA case. Person 1002 has only one claim and it carries a non-billable header code, so they do NOT qualify. Person 1003 has two professional claims with billable M06.10 on different dates (2022-09-20 and 2022-11-05), so they qualify.",
        "Count qualifying RA cases using only billable codes and the 2-outpatient rule: 2 cases qualify (persons 1001 and 1003); 1 person does not qualify (person 1002 has only a non-billable header code). Cases identified = 2 out of 3 persons screened.",
        "Confirm prefix-match coverage: a startswith filter on flat prefix 'M05' in the DX1_flat field would match C001, C002, and C003; a filter on exact billable codes ['M0500', 'M0610', ...] would match C001, C002, C004, C005 but not C003. The exact-match approach correctly excludes the non-billable header while the prefix match includes it -- use exact lists for phenotypes submitted to regulators."
      ],
      "result": "Cases = 2 out of 3 persons screened: persons 1001 and 1003 each have 2 professional claims with distinct billable ICD-10-CM RA codes on different service dates. Person 1002 is excluded because their single claim carries a non-billable category header (M05) that would be rejected by a payer and does not constitute confirmed coding."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Exact-match billable-code list (regulatory / reproducible)",
        "description": "The phenotype is defined as a pre-specified, version-dated list of fully billable ICD-10-CM codes (no category headers, no wildcards). Each code is validated against the NCHS tabular list for the target fiscal year. This approach is maximally reproducible, audit-ready, and will not silently expand if new codes are added under a category in a future update.",
        "edge_cases": [
          "A code added in a later fiscal year will not be captured unless the list is re-audited and updated; studies spanning multiple fiscal years may have differential capture.",
          "ICD-10-CM codes with X placeholder characters (e.g., T14.91XA) must be stored with the X in the flat format or they will not match."
        ],
        "data_source_notes": "claims: use the DX1 through DXn fields in flat format; confirm that your ETL preserved the full 7-character code including trailing X placeholders and seventh-character extensions. EHR/OMOP: source codes are mapped to SNOMED standard concepts; re-run the ICD-10-CM list through the concept_relationship table to identify all OMOP concept IDs."
      },
      {
        "name": "Prefix (startswith) filter on category or subcategory",
        "description": "All codes whose flat representation begins with a specified 3-7 character prefix are included. A prefix of \"I50\" captures all heart failure codes; \"I500\" narrows to subcategory I50.0. This approach is forward-compatible (new codes added under the prefix are automatically included) and easier to write, but it may include non-billable headers if applied to a three-character prefix alone and will capture future codes the researcher has not reviewed.",
        "edge_cases": [
          "A three-character prefix applied to a field that stores only billable codes is safe; applied to a field that may contain header submissions (some EHR source data), it will match the invalid header.",
          "Prefix searches do not automatically exclude the X placeholder positions -- \"S52X\" as a prefix would not match S52.001A. Always test prefix logic against the full tabular list."
        ],
        "data_source_notes": "claims: most relational databases allow LIKE 'I50%' or Python str.startswith('I50') on the flat-format field. Confirm the field does not left-pad or right-pad codes with spaces."
      },
      {
        "name": "GEMs-crosswalked multi-era phenotype (ICD-9-CM + ICD-10-CM)",
        "description": "For studies spanning the 2015-10-01 transition, ICD-9-CM codes are mapped to ICD-10-CM (or vice versa) using the CMS/NCHS General Equivalence Mappings. GEMs provide forward maps (ICD-9-to-10) and backward maps (ICD-10-to-9). Many codes have approximate, one-to-many, or no mappings. The recommended approach is to apply era-appropriate code lists and report sensitivity analyses at the transition boundary.",
        "edge_cases": [
          "Some ICD-9-CM codes have no equivalent in ICD-10-CM (genuine clinical distinction added); some ICD-10-CM codes have no ICD-9-CM predecessor (new conditions, new granularity). The GEMs flag these as approximate or no-map.",
          "Condition prevalence may appear to change at the 2015 boundary due to coding granularity, not disease epidemiology; this can be tested with negative-control conditions that should not change."
        ],
        "data_source_notes": "claims: the ICD version is identifiable from the service date (pre/post 2015-10-01); some payers also include an ICD version indicator field. OMOP: source codes are preserved in visit_occurrence and condition_occurrence alongside their standard SNOMED mapping, so ICD-version queries can be restricted by source_concept vocabulary_id."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "icd-9-cm-legacy-coding",
        "pros_of_this": "Greater granularity (~70k vs ~14k codes), laterality encoding, encounter-type extensions (A/D/S), and alignment with contemporary WHO ICD-10; the only option for US claims from October 2015 onward.",
        "cons_of_this": "Creates a structural break in trend studies spanning the 2015 transition; GEMs crosswalks are approximate; longer codes increase storage and matching complexity.",
        "when_to_prefer": "Use ICD-10-CM for all research using data from October 1, 2015 onward; use GEMs with explicit sensitivity analyses for longitudinal studies crossing the transition date."
      },
      {
        "compared_to": "snomed-ct-terminology",
        "pros_of_this": "ICD-10-CM codes appear directly on US claims and are universally available in administrative data without an EHR linkage or NLP step; the code set is stable and auditable.",
        "cons_of_this": "SNOMED CT has a richer clinical hierarchy, distinguishes confirmed from suspected diagnoses, and is the standard terminology in EHR clinical decision support; ICD-10-CM is billing-driven and cannot distinguish rule-out codes from confirmed diagnoses at the code level.",
        "when_to_prefer": "Use ICD-10-CM for population-scale claims phenotyping; use SNOMED CT for EHR-based phenotyping where clinical precision is required; use the SNOMED-to-ICD-10-CM mapping for cross-system harmonization."
      },
      {
        "compared_to": "omop-concept-set-development-rwe",
        "pros_of_this": "Raw ICD-10-CM code lists are directly interpretable without a CDM transformation and can be applied to any claims file regardless of whether it has been converted to OMOP.",
        "cons_of_this": "OMOP concept sets leverage the SNOMED hierarchy to capture all descendants of a concept with a single ancestor node, which is more robust to code versioning than flat ICD-10-CM lists; OMOP also harmonizes across international code systems.",
        "when_to_prefer": "Use raw ICD-10-CM lists for single-database claims analysis; use OMOP concept sets for multi-database or international distributed network studies where harmonization is required."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Store and filter codes in flat format (no decimal). Always confirm that your ETL preserved the full code including X placeholders and seventh-character extensions. Lock the phenotype code list to the fiscal-year version active on the service date. Distinguish principal diagnosis (DX1 or principal_dx) from secondary diagnosis fields when specifying position. On institutional claims, use the discharge date for the inpatient arm of the 1IP/2OP algorithm. On professional claims, use the service date. Exclude non-billable three-character headers from exact-match lists; include them only if deliberately using prefix matching. Document ICD version (9 vs 10) based on the service date field: pre-2015-10-01 data uses ICD-9-CM.",
      "ehr": "In OMOP CDM, ICD-10-CM source codes appear in condition_occurrence.source_concept_id (mapped via vocabulary_id = 'ICD10CM'); standard concepts are SNOMED (standard_concept = 'S'). Build phenotypes on SNOMED ancestors for portability, but validate that the ETL mapping is complete for your target conditions. In raw EHR extracts, diagnoses may appear as encounter diagnoses (billing) or problem-list entries (may be historical); specify which source you are querying.",
      "registry": "Registry diagnoses are typically adjudicated independently; confirm whether the registry data element stores ICD-10-CM codes or registry-specific classification. For linkage to claims, reconcile registry inclusion criteria against the ICD-10-CM code list -- they will not be identical and the difference is itself a sensitivity analysis.",
      "linked": "In linked claims-EHR datasets, confirm that ICD-10-CM codes from the claims side are not overwritten by EHR encounter diagnosis codes during linkage. Both should be preserved as separate fields. The OMOP ETL preserves source codes alongside standard concept IDs, enabling source-level validation after standardization."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import re\nimport pandas as pd\n\n# ICD-10-CM code format: letter + 2 digits + optional decimal + up to 4 more characters\n# Flat format (claims): letter + 2 digits + up to 4 chars, NO decimal, up to 7 chars total\nICD10CM_FLAT_PATTERN = re.compile(r'^[A-Z]\\d{2}[A-Z0-9]{0,4}$')\nICD10CM_DECIMAL_PATTERN = re.compile(r'^[A-Z]\\d{2}(\\.[A-Z0-9]{1,4})?$')\n\ndef to_flat(code: str) -> str:\n    \"\"\"Remove decimal point from a decoded ICD-10-CM code to get claims-storage format.\"\"\"\n    return code.strip().upper().replace('.', '')\n\ndef to_decimal(flat: str) -> str:\n    \"\"\"Insert decimal after the third character of a flat ICD-10-CM code.\n\n    Note: only valid for codes with >3 characters. Three-character codes\n    are category headers (non-billable) and have no decimal in the published tabular list.\n    \"\"\"\n    flat = flat.strip().upper()\n    if len(flat) <= 3:\n        return flat  # return as-is; do NOT add a decimal to a category header\n    return flat[:3] + '.' + flat[3:]\n\ndef is_valid_flat(code: str) -> bool:\n    \"\"\"Return True if code matches the ICD-10-CM flat format (1 letter + 2 digits + up to 4 chars).\"\"\"\n    return bool(ICD10CM_FLAT_PATTERN.match(code.strip().upper()))\n\ndef is_billable_flat(code: str, billable_set: set) -> bool:\n    \"\"\"Return True if the flat code is in the pre-specified billable code set.\n\n    The billable_set should contain only codes at their highest specificity,\n    obtained from the NCHS tabular list for the target fiscal year.\n    Non-billable three-character headers (e.g., 'I50', 'M05') must NOT be in billable_set.\n    \"\"\"\n    return code.strip().upper() in billable_set\n\ndef filter_by_exact_list(df: pd.DataFrame,\n                          dx_col: str,\n                          billable_codes: set,\n                          strip_whitespace: bool = True) -> pd.DataFrame:\n    \"\"\"Filter a claims DataFrame to rows where dx_col is in the exact billable code set.\n\n    Pitfall: claims ETLs sometimes pad codes with trailing spaces. strip_whitespace=True\n    mitigates this. The billable_codes set should use the same flat format as the data column.\n    \"\"\"\n    col = df[dx_col].str.strip().str.upper() if strip_whitespace else df[dx_col].str.upper()\n    return df[col.isin(billable_codes)].copy()\n\ndef filter_by_prefix(df: pd.DataFrame,\n                     dx_col: str,\n                     prefixes: list,\n                     strip_whitespace: bool = True) -> pd.DataFrame:\n    \"\"\"Filter claims to rows where dx_col starts with any of the given flat-format prefixes.\n\n    Pitfall: a 3-character prefix like 'I50' will match the non-billable header 'I50'\n    if it appears in the data. Validate that your data source only stores billable codes,\n    OR use exact lists for regulatory submissions. A prefix like 'I500' safely targets\n    subcategory I50.0 and all billable descendants.\n    \"\"\"\n    col = df[dx_col].str.strip().str.upper() if strip_whitespace else df[dx_col].str.upper()\n    mask = col.apply(lambda c: any(c.startswith(p.upper()) for p in prefixes))\n    return df[mask].copy()\n\n# --- Example usage ---\n# Heart failure phenotype (FY2022 subset for illustration)\nHF_BILLABLE = {\n    'I501', 'I5020', 'I5021', 'I5022', 'I5030', 'I5031', 'I5032',\n    'I5040', 'I5041', 'I5042', 'I5043', 'I50810', 'I50811', 'I50812',\n    'I50813', 'I50814', 'I5082', 'I5083', 'I5084', 'I5089', 'I509'\n}\n\nclaims = pd.DataFrame({\n    'claim_id': ['C01', 'C02', 'C03', 'C04'],\n    'person_id': [1001, 1001, 1002, 1003],\n    'DX1': ['I509 ', 'I509', 'I50', 'I5022'],  # C01 has trailing space; C03 is non-billable header\n    'service_date': ['2022-03-01', '2022-06-15', '2022-04-01', '2022-07-10']\n})\n\nhf_exact = filter_by_exact_list(claims, 'DX1', HF_BILLABLE)\n# Result: C01 (I509, after strip), C02 (I509), C04 (I5022) -- C03 (I50 header) excluded\n# person_ids with >= 2 qualifying claims on different dates: 1001 (C01 + C02) -> HF case\n\nhf_prefix = filter_by_prefix(claims, 'DX1', ['I50'])\n# Result: C01, C02, C03 (the header!), C04 -- prefix matching picks up the non-billable header\n# Demonstrates why exact-match is preferred for regulatory submissions\n\nprint(f\"Exact-match HF claims: {len(hf_exact)}\")   # 3 claims\nprint(f\"Prefix-match HF claims: {len(hf_prefix)}\")  # 4 claims (includes non-billable header)",
        "description": "Validate ICD-10-CM code format, convert between flat and decimal representations, and safely filter a claims DataFrame using an exact billable-code list or a category prefix. Pitfalls covered: non-billable three-character headers, X placeholders, missing seventh-character extensions, and silent non-match due to whitespace padding in the data field.",
        "dependencies": [
          "pandas",
          "re"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(stringr)\n\n# ---- Format validation helpers ----\nicd10cm_flat_pattern <- \"^[A-Z]\\\\d{2}[A-Z0-9]{0,4}$\"\nicd9cm_flat_pattern  <- \"^\\\\d{3}[0-9A-Z]{0,2}$|^[VEve]\\\\d{2}[0-9A-Z]{0,2}$\"\n\nis_valid_flat_icd10 <- function(code) {\n  str_detect(str_to_upper(str_trim(code)), icd10cm_flat_pattern)\n}\n\n# ---- Flat <-> decimal conversion ----\nto_decimal <- function(flat_code) {\n  # For codes longer than 3 characters, insert decimal after position 3.\n  # For 3-character category headers, return as-is (they have no decimal in the tabular list).\n  flat_code <- str_to_upper(str_trim(flat_code))\n  ifelse(nchar(flat_code) > 3,\n         paste0(substr(flat_code, 1, 3), \".\", substr(flat_code, 4, nchar(flat_code))),\n         flat_code)\n}\n\nto_flat <- function(decimal_code) {\n  str_to_upper(str_remove_all(str_trim(decimal_code), \"\\\\.\"))\n}\n\n# ---- ICD version detection from service date ----\n# US claims: ICD-9-CM used before 2015-10-01; ICD-10-CM from 2015-10-01 onward\nicd_version <- function(service_date) {\n  # service_date: Date or character in ISO format\n  cutoff <- as.Date(\"2015-10-01\")\n  ifelse(as.Date(service_date) < cutoff, \"ICD-9-CM\", \"ICD-10-CM\")\n}\n\n# ---- Exact-match filter (preferred for regulatory submissions) ----\nfilter_exact <- function(df, dx_col, billable_codes) {\n  # billable_codes: character vector of flat-format codes at highest specificity\n  # Pitfall: strip whitespace and upper-case before matching to avoid ETL padding issues\n  df %>%\n    filter(str_to_upper(str_trim(.data[[dx_col]])) %in% str_to_upper(billable_codes))\n}\n\n# ---- Prefix (startswith) filter (use with caution -- see notes in description) ----\nfilter_prefix <- function(df, dx_col, prefixes) {\n  # Pitfall: a 3-char prefix matches non-billable headers if they appear in the data.\n  # Prefer 4+ char prefixes to stay within a subcategory.\n  pattern <- paste0(\"^(\", paste(str_to_upper(prefixes), collapse = \"|\"), \")\")\n  df %>%\n    filter(str_detect(str_to_upper(str_trim(.data[[dx_col]])), pattern))\n}\n\n# ---- Example usage ----\n# Rheumatoid arthritis phenotype (seropositive M05, other M06, select subcategories)\nra_codes <- c(\n  \"M0500\", \"M0501\", \"M0502\", \"M0503\", \"M0504\", \"M0505\", \"M0506\", \"M0509\",\n  \"M0510\", \"M0511\", \"M0512\", \"M0513\", \"M0514\", \"M0515\", \"M0516\", \"M0519\",\n  \"M0600\", \"M0601\", \"M0602\", \"M0603\", \"M0604\", \"M0605\", \"M0606\", \"M0609\",\n  \"M0610\", \"M0611\", \"M0612\", \"M0613\", \"M0614\", \"M0615\", \"M0616\", \"M0619\"\n)\n\nclaims <- data.frame(\n  claim_id = c(\"C001\", \"C002\", \"C003\", \"C004\", \"C005\"),\n  person_id = c(1001L, 1001L, 1002L, 1003L, 1003L),\n  DX1 = c(\"M0500\", \"M0500\", \"M05\", \"M0610\", \"M0610\"),  # C003 is non-billable header\n  service_date = as.Date(c(\"2022-03-15\", \"2022-06-10\", \"2022-04-01\", \"2022-09-20\", \"2022-11-05\")),\n  stringsAsFactors = FALSE\n)\n\n# Check ICD version (all post-2015 -> ICD-10-CM)\nclaims$icd_version <- icd_version(claims$service_date)\n\n# Exact-match filter (excludes the non-billable header \"M05\")\nra_exact <- filter_exact(claims, \"DX1\", ra_codes)\n# -> C001, C002, C004, C005 (4 claims; C003 with \"M05\" header excluded)\n\n# 2-outpatient case-finding (two qualifying claims on different dates)\nra_cases <- ra_exact %>%\n  group_by(person_id) %>%\n  summarise(\n    n_claims = n(),\n    n_dates  = n_distinct(service_date),\n    qualifies = n_dates >= 2,\n    .groups = \"drop\"\n  ) %>%\n  filter(qualifies)\n# Result: person 1001 (2 dates), person 1003 (2 dates) -> 2 RA cases\n\ncat(\"RA cases identified:\", nrow(ra_cases), \"\\n\")  # 2\ncat(\"Codes validated as flat ICD-10-CM:\\n\")\nprint(sapply(ra_codes[1:4], is_valid_flat_icd10))   # all TRUE\ncat(\"Category header 'M05' valid as billable flat code:\", is_valid_flat_icd10(\"M05\"), \"\\n\") # FALSE",
        "description": "Validate ICD-10-CM codes, convert flat to decimal format, and filter a claims data frame using an exact billable-code list or prefix matching. Pitfalls covered: non-billable category headers, trailing whitespace from ETL padding, case sensitivity in matching, and the GEMs version-identification pattern using service date to determine ICD-9 vs ICD-10.",
        "dependencies": [
          "dplyr",
          "stringr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "icd-9-cm-legacy-coding",
        "notes": "ICD-10-CM replaced ICD-9-CM for US claims on 2015-10-01; studies crossing this date must use GEMs crosswalks or restrict to a single code era."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icd-10-pcs-procedure-coding",
        "notes": "ICD-10-PCS is the companion inpatient procedure coding system maintained by CMS; it is wholly separate from ICD-10-CM and encodes surgical and diagnostic procedures on institutional (UB-04) claims, not diagnoses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-position-and-qualifiers",
        "notes": "ICD-10-CM codes appear in principal and secondary diagnosis positions on claims; the position (principal vs any) materially affects phenotype specificity and must be pre-specified in every algorithm using ICD-10-CM codes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The 1-inpatient-or-2-outpatient phenotype algorithm is the workhorse application of ICD-10-CM codes in RWE; the ICD-10-CM code list is the input, and position, window, and washout rules determine whether the coded evidence constitutes a case."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-analysis",
        "notes": "ICD-10-CM is the primary diagnosis coding substrate for all US administrative claims analysis; every claims-based phenotype, covariate, and outcome algorithm is built on ICD-10-CM code lists (for data from October 2015 onward)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "In OMOP CDM, ICD-10-CM codes are source vocabulary entries that the ETL maps to SNOMED standard concepts; concept sets in OMOP ATLAS operate on SNOMED ancestors, so ICD-10-CM code lists must be translated through the concept_relationship table to build OMOP-compatible phenotypes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Phenotypes built from ICD-10-CM codes have condition-specific positive predictive value and sensitivity that depend on code position, window, and data source; validation against a reference standard is required before treating ICD-10-CM-based case-finding as clinical truth."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medical-code-crosswalks-mappings",
        "notes": "GEMs crosswalks map between ICD-9-CM and ICD-10-CM; the SNOMED-to-ICD-10-CM map and OMOP vocabulary concept_relationship table enable cross-system phenotype harmonization."
      }
    ],
    "aliases": [
      "ICD-10",
      "ICD-10-CM",
      "ICD10",
      "ICD-10 diagnosis codes",
      "International Classification of Diseases 10th Revision Clinical Modification"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "cms",
      "hipaa"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "icd-10-pcs-procedure-coding",
    "name": "ICD-10-PCS Inpatient Procedure Codes",
    "short_definition": "A seven-character alphanumeric coding system maintained by CMS for reporting surgical and therapeutic procedures performed during a hospital inpatient stay. Every character encodes a specific clinical axis (section, body system, root operation, body part, approach, device, qualifier), so codes are constructed from a hierarchical table rather than selected from a flat list — making the system multi-axial, fully extensible, and the standard source for inpatient procedure data in US administrative claims.",
    "long_description": "**ICD-10-PCS** (International Classification of Diseases, 10th Revision, Procedure Coding System)\nis the US standard for reporting inpatient hospital procedures. It replaced ICD-9-CM Volume 3 on\nOctober 1, 2015. CMS maintains the system and releases annual updates each October 1. The code set\nis in the public domain. As of recent fiscal years the system contains approximately 78,000+ valid\ncodes; the exact count grows with each annual update as new procedures, devices, and qualifiers are\nincorporated.\n\n**The seven-axis architecture (Medical and Surgical section).** Every ICD-10-PCS code is exactly\nseven characters long — alphanumeric, using digits 0-9 and letters A-Z with I and O excluded to\nprevent confusion with 1 and 0. Each character position encodes one and only one clinical axis:\n\n- **Position 1 — Section**: The broad type of procedure. Section 0 is Medical and Surgical (the\n  dominant section for RWE), covering open and minimally invasive operative procedures. Other sections\n  include Obstetrics (1), Placement (2), Administration (3), Measurement and Monitoring (4), Imaging\n  (B), Nuclear Medicine (C), Radiation Therapy (D), Mental Health (G), Substance Abuse Treatment (H),\n  New Technology (X), and several others. Section X (New Technology) is clinically significant: new\n  surgical approaches, devices, and biologics that do not fit existing tables are assigned here, and\n  RWE analysts studying novel procedures or implants should always include relevant Section X codes.\n\n- **Position 2 — Body System**: Within Medical and Surgical, the anatomical system involved. Values\n  include 0 Central Nervous System and Cranial Nerves, 1 Peripheral Nervous System, 2 Heart and Great\n  Vessels, 3 Upper Arteries, 4 Lower Arteries, 5 Upper Veins, 6 Lower Veins, 7 Lymphatic and Hemic,\n  8 Eye, 9 Ear Nose Sinus, B Respiratory, C Mouth and Throat, D Gastrointestinal, F Hepatobiliary\n  and Pancreas, G Endocrine, H Skin and Breast, J Subcutaneous Tissue and Fascia, K Muscles, L\n  Tendons, M Bursae and Ligaments, N Head and Facial Bones, P Upper Bones, Q Lower Bones, R Lower\n  Joints, S Upper Joints, T Urinary, U Female Reproductive, V Male Reproductive, W Anatomical\n  Regions, X Upper Extremities, Y Lower Extremities.\n\n- **Position 3 — Root Operation**: The objective of the procedure — this is the most analytically\n  consequential axis. Root operations have precise, non-intuitive definitions that often differ from\n  everyday clinical language. A few critical examples for RWE:\n  - **Resection (T)**: Cutting out or off, without replacement, ALL of a body part. A total knee\n    arthroplasty is coded as a Replacement, not Resection.\n  - **Excision (B)**: Cutting out or off, without replacement, a PORTION of a body part. Used for\n    partial resections, biopsies, debridements.\n  - **Replacement (R)**: Putting in or on a biological or synthetic substitute that physically takes\n    the place of all or a portion of a body part. Total knee replacement = Replacement; the body part\n    value specifies the joint; position 6 (Device) specifies the prosthesis type.\n  - **Repair (Q)**: Restoring, to the extent possible, a body part to its normal anatomic structure\n    and function — the default root operation when no other is more specific.\n  - **Bypass (1)**: Altering the route of passage to include an upstream body part (e.g., coronary\n    artery bypass graft). The critical RWE issue: a \"bypass\" in ICD-10-PCS means specific anatomical\n    rerouting — not a general colloquial \"bypass.\"\n  - **Fusion (G)**: Joining together portions of an articular body part, rendering the articular body\n    part immobile. Spinal fusion codes rely on approach (position 5) and device (position 6) — an\n    analyst building a \"spinal fusion\" cohort must specify whether posterior, anterior, interbody\n    cage, and so on.\n  - **Other key root operations**: Dilation (7), Drainage (9), Extirpation (C — removing solid\n    matter), Fragmentation (F), Inspection (J), Occlusion (L), Reattachment (M), Release (N),\n    Removal (P — for removing devices), Repair (Q), Replacement (R), Reposition (S), Restriction\n    (V), Revision (W — correcting a malfunctioning device), Supplement (U), Transfer (X), Transplant (Y).\n\n- **Position 4 — Body Part**: The specific anatomical site operated upon, defined within the body\n  system. For Lower Joints (body system R), values include Lumbar Vertebral Joint (0), Lumbosacral\n  Joint (3), Sacrococcygeal Joint (5), Coccygeal Joint (6), Sacroiliac Joints (7/8), Hip Joints\n  (9/A/B/C), Knee Joints (C/D/F/G), Ankle Joints (H/J), Tarsal Joints (K/L), and so forth.\n\n- **Position 5 — Approach**: The technique used to reach the operative site. Open (0), Percutaneous\n  (3), Percutaneous Endoscopic (4), Via Natural or Artificial Opening (7), Via Natural or Artificial\n  Opening Endoscopic (8), External (X). Approach is an essential RWE dimension: a laparoscopic\n  versus open approach for the same procedure may have different safety profiles, lengths of stay,\n  and costs — and ICD-10-PCS distinguishes them explicitly in position 5.\n\n- **Position 6 — Device**: Any material or object left in or on the body at the end of the procedure.\n  No Device (Z) means nothing was left. Device values include drains, synthetic substitutes, autologous\n  tissue substitutes, bone grafts, implants, pacemakers, and many others. For joint replacements,\n  position 6 identifies the prosthesis type (cemented, uncemented, ceramic-on-ceramic, etc.).\n\n- **Position 7 — Qualifier**: An additional attribute that further specifies the procedure. Common\n  values: Diagnostic (X — as in a diagnostic excision, which is a biopsy), No Qualifier (Z), All\n  (N), Atrial (6), Ventricular (7), and many procedure-specific values.\n\n**How codes are constructed, not selected.** Unlike ICD-10-CM (where codes are looked up in a tabular\nlist) or CPT (where codes are selected from a flat numbered list with hierarchical groupings), ICD-10-PCS\ncodes are built from a set of tables. Each table specifies one combination of section + body system +\nroot operation, and the analyst reads across to pick one value per remaining position. This means the\nsystem is genuinely combinatorial: for a given root operation on a given body system, all clinically\nmeaningful combinations of body part, approach, device, and qualifier generate valid codes. There is no\nprincipal list of 78,000 codes to memorize; coders and researchers navigate tables. RWE analysts who\nattempt to define a procedure cohort by building a \"list of codes\" without reading the applicable table\ndefinitions — especially the root operation definitions — risk systematic misclassification. The root\noperation is the place where ICD-10-PCS most often surprises clinicians and analysts who assume the code\ndescribes what was done rather than what was precisely intended.\n\n**Coding coverage and scope: inpatient facility only.** This is the single most important scoping\nconstraint for RWE. ICD-10-PCS codes appear only on UB-04/837I claims (hospital inpatient facility\nbills), specifically in the procedure code fields (FL74: principal procedure, FL75-FL80: additional\nprocedures). They are NEVER used on:\n\n- Physician/professional claims (CMS-1500/837P) — those use CPT and HCPCS\n- Hospital outpatient facility claims (also UB-04, but the procedure fields use CPT/HCPCS)\n- Ambulatory surgery center (ASC) claims — CPT/HCPCS\n- Part B administered drug/biologic claims — HCPCS J-codes\n\nThe result is a sharp population-of-care segmentation. Total knee arthroplasty performed as an inpatient\nadmission generates an ICD-10-PCS code on the facility claim. The same procedure performed at an\noutpatient surgery center generates CPT 27447 on the ASC facility claim and on the surgeon's professional\nclaim — no ICD-10-PCS. As the shift toward outpatient surgical care has accelerated, an algorithm using\nICD-10-PCS alone increasingly misses real cases in the sicker-appearing inpatient-only subgroup. Any\nprocedure definition relying exclusively on ICD-10-PCS will miss all outpatient facility cases and all\nprofessional claims, systematically under-counting and biasing the identified cohort toward longer,\ncostlier inpatient stays.\n\nComplete procedure ascertainment in US claims almost always requires a union of ICD-10-PCS (inpatient\nfacility) + CPT/HCPCS (physician and outpatient facility) + revenue center codes (outpatient facility\ntype-of-service confirmation). The relative contribution of ICD-10-PCS versus CPT changes over time as\nprocedure migration from inpatient to outpatient settings continues; analysts should report the fraction\ncaptured by each stream as a sensitivity diagnostic.\n\n**ICD-10-PCS vs. ICD-9-CM Volume 3.** The predecessor, ICD-9-CM Volume 3, used a four-character numeric\ncode with two-character category hierarchies. Its coverage was incomplete, its specificity was lower, and\nits hierarchical structure was not systematically multi-axial — procedure categorization was often\ninconsistent across anatomic areas. ICD-10-PCS brought standardized multi-axial logic, explicit approach\nand device coding, and a larger code space. The transition on 2015-10-01 created a coding discontinuity:\ntime-series analyses crossing October 2015 must account for the change, and ICD-9-to-ICD-10-PCS\ncrosswalks (GEMs — General Equivalence Mappings — provided by CMS) are imperfect because many ICD-9-CM\nVol 3 codes map to multiple ICD-10-PCS codes and vice versa. For any cohort or outcome algorithm that\nspans the transition date, the crosswalk mapping uncertainty must be quantified and reported.\n\n**Relationship to MS-DRGs.** Medicare Severity Diagnosis Related Groups (MS-DRGs) are assigned by the\nMedicare grouper software based on the combination of the principal diagnosis (ICD-10-CM), secondary\ndiagnoses, and procedures (ICD-10-PCS). The presence or absence of a \"surgical\" ICD-10-PCS code — and\nspecifically which major diagnostic category and surgical hierarchy it triggers — determines whether a\ndischarge is classified into a surgical MS-DRG versus a medical MS-DRG. Surgical DRGs command higher\npayments. RWE analysts working with DRG-based cost data or severity-adjustment that uses DRG must\nunderstand that the ICD-10-PCS code drives this classification: an error in the PCS code changes the\nDRG assignment and distorts cost comparisons.\n\n**Relationship to OMOP CDM.** In the Observational Medical Outcomes Partnership (OMOP) common data model,\nICD-10-PCS source codes are mapped to the Procedure domain using standard concepts from the SNOMED-CT\nprocedure hierarchy or the OMOP standard concept set for procedures. The source code\n(ICD-10-PCS character string) is preserved in the source_concept_id column of the procedure_occurrence\ntable; the standard_concept_id is the mapped SNOMED or OMOP concept. Analysts building procedure cohorts\nin OMOP should query the procedure_occurrence table using standard_concept_id (SNOMED) unless they have\nspecific reason to query by source code. OMOP's concept mapping for ICD-10-PCS is generally good for\ncommon surgical procedures but may lag for new technology (Section X) codes that have been recently added\nto the ICD-10-PCS tables.\n\n**Pros, cons, and trade-offs** — specific and comparative.\n\n- **vs CPT procedure codes (professional and outpatient facility claims):** CPT is axis-free — each\n  code is a discrete concept with a single textual definition, hierarchically grouped but not\n  combinatorially constructed. CPT covers ALL sites of care for professional billing and outpatient\n  facility billing; ICD-10-PCS covers only inpatient facility billing. ICD-10-PCS provides explicit\n  approach, device, and qualifier axes that CPT must encode via add-on codes or modifiers; for\n  inpatient procedures, ICD-10-PCS is more granular. **Prefer the union of both** for any\n  comprehensive procedure definition. Never use ICD-10-PCS alone when the procedure population spans\n  inpatient and outpatient settings.\n\n- **vs ICD-9-CM Volume 3 (the predecessor for pre-October-2015 data):** ICD-9-CM Vol 3 is less\n  specific, less consistently structured, and uses a different character count (4 digits) with a\n  different hierarchy. For longitudinal analyses crossing October 2015, both systems are needed.\n  GEMs crosswalks exist but carry mapping uncertainty — the ICD-9 code for \"total knee arthroplasty\"\n  does not always map 1:1 to a single ICD-10-PCS code combination. **Prefer ICD-10-PCS for\n  post-2015 data**; when crossing the transition date, document the crosswalk approach, its\n  ambiguity, and validate against procedure counts before and after the transition.\n\n- **vs HCPCS Level II (J-codes and device/supply codes):** HCPCS Level II covers drugs administered\n  in clinical settings (J-codes), durable medical equipment, and some procedures not captured in\n  CPT. It appears on institutional outpatient and professional claims. It is not a procedure coding\n  system in the ICD-10-PCS sense — it does not describe the operative act. **Use HCPCS** when the\n  research question concerns drug administration (e.g., infused biologics) or device supply; use\n  ICD-10-PCS when the research question concerns the inpatient surgical act itself.\n\n- **vs revenue center codes (UB-04 revenue codes):** Revenue codes describe the type of service and\n  hospital cost center (e.g., 0360 = operating room, 0481 = cardiology), not the specific procedure.\n  They complement ICD-10-PCS for outpatient facility procedure identification and are often used to\n  confirm that a procedure was performed in the operating room as a specificity filter. **Use revenue\n  codes alongside** ICD-10-PCS and CPT/HCPCS; do not use them as the primary procedure identifier.\n\n**When to use.**\n- When building a procedure-based cohort from hospital inpatient claims or inpatient facility UB-04\n  data: any surgical or invasive procedure performed during a hospital stay (cardiac surgery, joint\n  replacement, major abdominal or thoracic surgery, organ transplant, spine surgery, etc.).\n- When the research question is specific to inpatient care — length of stay, inpatient costs, hospital\n  charges, discharge disposition, perioperative complications — where the inpatient setting is itself\n  part of the phenotype.\n- When you need the approach, device, and qualifier axes for the specific clinical question (e.g.,\n  open versus laparoscopic, cemented versus uncemented joint implant, autologous versus synthetic graft).\n- When studying MS-DRG assignment, hospital payment, or hospital-based resource utilization where\n  ICD-10-PCS codes directly drive the DRG grouper.\n- As part of a union code set (with CPT/HCPCS) for comprehensive multi-setting procedure ascertainment\n  in comparative effectiveness or safety studies.\n- When using OMOP CDM: query procedure_occurrence by standard SNOMED concept mapped from ICD-10-PCS\n  source codes, and validate source capture rates.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n\n- **As the sole procedure identifier when the study population includes outpatient or ambulatory\n  cases.** Since at least the late 2010s, procedures historically performed inpatient (e.g., total\n  knee arthroplasty, laparoscopic cholecystectomy, many cardiac catheterizations) are increasingly\n  performed in outpatient settings. An ICD-10-PCS-only algorithm captures a diminishing and\n  non-random fraction of all cases — the fraction that is still inpatient, which skews toward more\n  comorbid, higher-risk patients. The result is not merely incomplete; it is a biased sample.\n  **Always combine with CPT and HCPCS** for population-representative procedure ascertainment.\n\n- **When relying on the code description rather than the root operation definition.** The written\n  description of an ICD-10-PCS code (e.g., \"Excision of Left Knee Joint, Open Approach\") tells you\n  the character values but not why Excision was chosen over Repair or Replacement. A code-list built\n  by matching text descriptions instead of reading the root operation definitions systematically\n  includes wrong codes and excludes the right ones. For example, all total knee arthroplasties use\n  root operation Replacement (R), but the word \"replacement\" does not appear in common clinical\n  notes — clinical coders and researchers who look for \"knee replacement\" in descriptions without\n  checking the root operation mapping will miss cases coded under adjacent root operations and\n  body-part values.\n\n- **For pre-October 2015 inpatient data.** ICD-10-PCS did not exist in US data before FY2016. Any\n  dataset covering discharges before October 1, 2015 uses ICD-9-CM Volume 3 for inpatient procedures.\n  Applying an ICD-10-PCS code list to pre-2015 data returns zero matches.\n\n- **When crossing the October 2015 ICD-9-CM-to-ICD-10-PCS transition date without documenting the\n  crosswalk.** GEMs are directional and imperfect. An analysis of procedure rates over a period\n  spanning October 2015 may show a spurious step-change at the transition that reflects coding\n  system change, not a true trend in procedure volume. Always test for and disclose the coding\n  transition artifact in sensitivity analyses.\n\n- **When the procedure of interest is predominantly outpatient and was never routinely performed\n  as an inpatient admission.** Colonoscopy, upper endoscopy, most dermatologic procedures, most\n  outpatient ophthalmologic procedures — these rarely appear in inpatient facility claims and will\n  have essentially no ICD-10-PCS representation in a typical claims database. Using ICD-10-PCS\n  for these is not just incomplete; it returns near-zero counts regardless of the true rate.\n\n- **When building a positional pattern for prefix matching without also testing specificity.**\n  A broad prefix (e.g., all codes beginning with 0SR — replacement of lower joints) captures all\n  joint replacements but also articular replacements of the hip, ankle, toe, and patellofemoral\n  surface. Always validate that the prefix includes the intended body parts (position 4) and root\n  operation (position 3) before finalizing the code set.\n\n**Data-source operational depth.**\n\n- **Medicare FFS (Parts A/B/D) inpatient claims:** ICD-10-PCS codes appear on the inpatient SAF\n  (Standard Analytical File) and the inpatient LDS (Limited Data Set) in the procedure code\n  fields (ICD_PRCDR_CD_1 through ICD_PRCDR_CD_25 in post-2015 data). The principal procedure\n  (clinically most significant procedure, not necessarily first chronologically) is in field 1;\n  up to 24 additional procedures follow. Date fields (ICD_PRCDR_DT_1 through DT_25) record the\n  service date for each procedure. For patients with Medicare Advantage (Part C) coverage, FFS\n  inpatient claims are absent — the same limitation as for any other procedure coding in MA-covered\n  members. Restrict inpatient-procedure denominators to FFS-observable, Parts A/B-enrolled person-time.\n\n- **Commercial claims (employer-sponsored and ACA marketplace):** ICD-10-PCS codes appear in the\n  inpatient institutional claim file, equivalent to the Medicare inpatient SAF. The field names\n  vary by data vendor (Optum, Merative/MarketScan, IQVIA, FAIR Health, etc.) but the content is\n  structurally the same — up to ~25 procedure code slots on the UB-04-equivalent claim. Post-2015\n  commercial data universally use ICD-10-PCS for inpatient facility procedures. Pre-2016 data use\n  ICD-9-CM Vol 3.\n\n- **All-payer inpatient discharge databases (HCUP NIS, SID, KID):** The Healthcare Cost and\n  Utilization Project databases use ICD-10-PCS for discharges on or after Q4 2015 (exact transition\n  quarter varies by state for the SIDs). The NIS (National Inpatient Sample) is the largest US\n  all-payer inpatient database (~7 million weighted discharges/year) and is the primary substrate\n  for ICD-10-PCS-based population-level inpatient procedure research.\n\n- **EHR:** ICD-10-PCS is a billing code, not a clinical code — it does not natively appear in\n  EHR clinical documentation. However, after billing is complete, ICD-10-PCS codes from the\n  facility bill may be attached to the encounter record in EHR administrative modules. Structured\n  EHR procedure tables more commonly use CPT (for professional billing integration) or SNOMED-CT\n  (for clinical documentation). If ICD-10-PCS codes are required from EHR data, obtain them from\n  the linked billing extract, not from the clinical procedure table.\n\n- **OMOP CDM:** ICD-10-PCS source codes are mapped to procedure_occurrence via the standard concept\n  mapping. Use concept_relationship where relationship_id = 'Maps to' to translate ICD-10-PCS\n  source concepts to SNOMED standard concepts. The vocabulary_id for ICD-10-PCS source concepts\n  is 'ICD10PCS'. When building concept sets in ATLAS or via SQL, filter on vocabulary_id =\n  'ICD10PCS' for source-code queries or join through the standard concept hierarchy for portable\n  queries across sites using different procedure coding systems.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "procedures",
      "icd-10-pcs",
      "inpatient",
      "claims-coding",
      "code-list-development",
      "data-infrastructure"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "comparative_effectiveness",
      "drug_utilization",
      "registry_linkage",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jamcollsurg.2013.04.029",
        "url": "https://doi.org/10.1016/j.jamcollsurg.2013.04.029",
        "citation_text": "Utter GH, Cox GL, Owens PL, Romano PS. Challenges and Opportunities with ICD-10-CM/PCS: Implications for Surgical Research Involving Administrative Data. Journal of the American College of Surgeons. 2013;217(3):564-574.",
        "year": 2013,
        "authors_short": "Utter et al.",
        "notes": "Foundational surgical research perspective on ICD-10-CM/PCS that explains the multi-axial architecture, root operation semantics, and the implications of the coding system transition for building procedure cohorts and quality measures from administrative claims data."
      },
      {
        "role": "explain",
        "doi": "10.1111/1475-6773.12981",
        "url": "https://doi.org/10.1111/1475-6773.12981",
        "citation_text": "Utter GH, Cox GL, Atolagbe O, Owens PL. Conversion of the Agency for Healthcare Research and Quality's Quality Indicators from ICD-9-CM to ICD-10-CM/PCS: The Process, Results, and Implications for Users. Health Services Research. 2018;53(Suppl 3):5072-5102.",
        "year": 2018,
        "authors_short": "Utter et al.",
        "notes": "Documents the systematic approach to converting AHRQ quality indicators — including surgical procedure definitions — from ICD-9-CM Volume 3 to ICD-10-PCS, directly illustrating the GEMs crosswalk complexity, root operation mapping challenges, and specificity changes at the coding transition that affect RWE studies crossing October 2015."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/ta.0000000000001592",
        "url": "https://doi.org/10.1097/ta.0000000000001592",
        "citation_text": "Utter GH, Schuster KM, Miller PR, Mowery NT. The capacity of ICD-10-CM/PCS to characterize surgical care. Journal of Trauma and Acute Care Surgery. 2017;83(5):913-921.",
        "year": 2017,
        "authors_short": "Utter et al.",
        "notes": "Empirically demonstrates how ICD-10-PCS captures surgical care detail — approach, device, body part granularity — compared with ICD-9-CM Vol 3, and identifies coding gaps and over-specification relevant to building trauma and acute-care surgical cohorts from inpatient claims."
      },
      {
        "role": "use",
        "doi": "10.1016/j.sapharm.2022.07.014",
        "url": "https://doi.org/10.1016/j.sapharm.2022.07.014",
        "citation_text": "Bhatt DL, Thornton JD. Change in opioid-related inpatient discharges after the ICD-9 to ICD-10 transition in Texas. Research in Social and Administrative Pharmacy. 2023;19(3):473-478.",
        "year": 2023,
        "authors_short": "Bhatt & Thornton",
        "notes": "A concrete RWE application of the ICD-9-CM-to-ICD-10-PCS transition problem — shows how coding-system changes at October 2015 can appear as apparent trend shifts in inpatient procedure and diagnosis rates, illustrating why transition-date sensitivity analyses are mandatory in longitudinal ICD-10-PCS studies."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coding-billing/icd-10-codes",
        "citation_text": "Centers for Medicare and Medicaid Services. ICD-10 Procedure Coding System (ICD-10-PCS). CMS Medicare Coding and Billing. Updated annually. Accessed 2026.",
        "year": 2026,
        "authors_short": "CMS",
        "notes": "Official CMS source for the annual ICD-10-PCS code files, tables, reference manuals, and update documentation. The ICD-10-PCS Reference Manual and the tabular file (PCS tables) are the authoritative source for root operation definitions, code construction rules, and the complete set of valid codes for any fiscal year."
      }
    ],
    "plain_language_summary": "Every surgery or procedure performed on a hospital inpatient in the United States is described with a seven-character code where each character answers a specific question about the procedure — what type, which body system, what the surgeon did, which body part, how they got there, what was left in, and any additional detail. These ICD-10-PCS codes appear only on the hospital's inpatient facility bill, not on the surgeon's separate bill or on any outpatient record, so researchers who look only at ICD-10-PCS will miss all procedures done in outpatient surgery centers or clinics. The code system has been in use since October 2015 and replaced an older, less detailed system that used shorter numeric codes.",
    "key_terms": [
      {
        "term": "root operation",
        "definition": "The third character of an ICD-10-PCS code, which defines the precise objective of the procedure using a strict technical definition — for example, Resection means removing all of a body part while Excision means removing only a portion, a distinction that determines which code applies even when the clinical note uses the same word for both."
      },
      {
        "term": "UB-04",
        "definition": "The standardized claim form (also called the 837I electronic transaction) that hospitals submit to payers for inpatient and outpatient facility services; ICD-10-PCS procedure codes appear only in the procedure code fields of the UB-04 inpatient claim."
      },
      {
        "term": "multi-axial code",
        "definition": "A code where each character position independently encodes a different clinical dimension, so the full meaning is the combination of all seven positions rather than a single memorized label."
      },
      {
        "term": "GEMs (General Equivalence Mappings)",
        "definition": "CMS-provided translation tables that map ICD-9-CM Volume 3 procedure codes to the closest ICD-10-PCS equivalents and vice versa; they are imperfect because many codes do not have a one-to-one match across the two systems."
      },
      {
        "term": "MS-DRG",
        "definition": "Medicare Severity Diagnosis Related Group — the payment category a hospital discharge is assigned to based on the principal diagnosis and procedures; ICD-10-PCS codes directly drive whether a discharge is classified as a surgical DRG (which pays more) or a medical DRG."
      },
      {
        "term": "Section X (New Technology)",
        "definition": "A dedicated ICD-10-PCS section for recently developed procedures, devices, and biologics that do not fit the existing Medical and Surgical tables; analysts studying novel implants or therapies must always check Section X for relevant codes."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes researcher wants to identify all patients in a Medicare FFS inpatient claims database who received a total knee replacement (TKR) during 2022. She knows from reading the ICD-10-PCS tables that a TKR maps to root operation Replacement (R), body system Lower Joints (R), and she wants to understand how to decompose the code, build the search pattern, and then confirm she is capturing the right set of body parts. She uses the publicly documented example code 0SRC0JZ (Replacement of Right Knee Joint with Synthetic Substitute, Open Approach) to walk through the logic, then writes the positional regex that finds all knee joint replacements regardless of laterality, device, or qualifier.\n",
      "dataset": {
        "caption": "Seven-axis decomposition of ICD-10-PCS code 0SRC0JZ — total knee replacement example",
        "columns": [
          "Position",
          "Character",
          "Axis name",
          "Value meaning"
        ],
        "rows": [
          [
            1,
            "0",
            "Section",
            "Medical and Surgical"
          ],
          [
            2,
            "S",
            "Body System",
            "Lower Joints"
          ],
          [
            3,
            "R",
            "Root Operation",
            "Replacement (putting in a substitute that physically takes the place of all or part of a body part)"
          ],
          [
            4,
            "C",
            "Body Part",
            "Right Knee Joint"
          ],
          [
            5,
            "0",
            "Approach",
            "Open"
          ],
          [
            6,
            "J",
            "Device",
            "Synthetic Substitute"
          ],
          [
            7,
            "Z",
            "Qualifier",
            "No Qualifier"
          ]
        ]
      },
      "steps": [
        "Position 1 = 0 (zero): Medical and Surgical section. All common inpatient surgeries live here.",
        "Position 2 = S: Lower Joints body system. Both hips, knees, ankles, and foot joints are in body system R (Lower Joints); note the section character is 0, so the full prefix is 0S.",
        "Position 3 = R: Root Operation Replacement. ICD-10-PCS defines Replacement as 'putting in or on biological or synthetic material that physically takes the place of all or a portion of a body part.' A total knee arthroplasty — where the natural knee surfaces are removed and a prosthesis is seated — fits this definition. Root operation Repair (Q) would be wrong because nothing is replaced; root operation Supplement (U) would be wrong because the natural joint surface is not retained.",
        "Position 4 = C: Right Knee Joint. Left Knee Joint = D, Right Knee Joint = C. A study of bilateral TKRs needs both C and D in position 4.",
        "Position 5 = 0: Open approach. A minimally invasive or robotic-assisted approach with an open cavity would still typically be coded Open if the joint cavity is opened. Percutaneous Endoscopic (4) applies to fully arthroscopic work.",
        "Position 6 = J: Synthetic Substitute. Other device values include 6 (Autologous Tissue Substitute) and L (Nonautologous Tissue Substitute). Cemented vs uncemented implants are NOT distinguished in position 6 in base ICD-10-PCS — both are J.",
        "Position 7 = Z: No Qualifier. For many joint replacements the qualifier is Z.",
        "To find ALL knee replacements (right and left, any device, any qualifier), build a regex that fixes positions 1-3 (0SR) and position 4 as C or D, allowing any value in positions 5-7. The regex ^0SR[CD][0-9A-HJ-NP-Z]{3}$ matches all valid knee replacement codes in that combinatorial space and excludes non-knee body parts by pinning position 4.",
        "expr = total valid 7-character codes in the result set where positions 1-3 are 0SR and position 4 is C or D. The number of valid combinations is determined by the table: typically 2 body parts x available approaches x available devices x available qualifiers. For illustration, if there are 1 approach, 5 device values, and 1 qualifier value, the count = 2 * 1 * 5 * 1 = 10 codes — each representing a clinically distinct combination."
      ],
      "result": "The 7-character string 0SRC0JZ decodes as: Open total knee replacement of the Right Knee Joint using a Synthetic Substitute. The positional regex ^0SR[CD][0-9A-HJ-NP-Z]{3}$ captures all knee joint replacements regardless of laterality (C=right, D=left), device type, and qualifier. A researcher using this pattern instead of a manually curated flat list guarantees that newly added table combinations in annual ICD-10-PCS updates are automatically included — a key advantage of the multi-axial architecture over flat code lists.",
      "timeline_spec": {
        "title": "ICD-10-PCS code axis decomposition for 0SRC0JZ (total knee replacement)",
        "window": {
          "start": "2015-10-01",
          "end": "2015-10-07",
          "label": "ICD-10-PCS effective date: Oct 1, 2015"
        },
        "events": [
          {
            "label": "Position 1: Section (0=Med/Surg)",
            "start": "2015-10-01",
            "length_days": 1,
            "quantity": "char 1"
          },
          {
            "label": "Position 2: Body System (S=Lower Joints)",
            "start": "2015-10-02",
            "length_days": 1,
            "quantity": "char 2"
          },
          {
            "label": "Position 3: Root Op (R=Replacement)",
            "start": "2015-10-03",
            "length_days": 1,
            "quantity": "char 3"
          },
          {
            "label": "Position 4: Body Part (C=Rt Knee)",
            "start": "2015-10-04",
            "length_days": 1,
            "quantity": "char 4"
          },
          {
            "label": "Position 5: Approach (0=Open)",
            "start": "2015-10-05",
            "length_days": 1,
            "quantity": "char 5"
          },
          {
            "label": "Position 6: Device (J=Synthetic)",
            "start": "2015-10-06",
            "length_days": 1,
            "quantity": "char 6"
          },
          {
            "label": "Position 7: Qualifier (Z=None)",
            "start": "2015-10-07",
            "length_days": 1,
            "quantity": "char 7"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2015-10-01",
            "end": "2015-10-07",
            "label": "7-axis code 0SRC0JZ"
          }
        ],
        "result": {
          "label": "0SRC0JZ = Open Replacement, Right Knee Joint, Synthetic Substitute — 7 axes, 7 characters",
          "value": 7
        }
      }
    },
    "prerequisites": [
      "claims-analysis",
      "procedure-identification-and-measurement-in-claims-ehr"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Positional-prefix match (table-aware code set construction)",
        "description": "Build the code set by specifying fixed values for positions 1-3 (section, body system, root operation) and ranges for positions 4-7, rather than enumerating every valid code. This ensures that newly added codes in annual updates are automatically captured without manual review, provided the root operation and body system intent is correctly fixed. The regex must exclude I and O from any character-range specification (the letters I and O are never valid in ICD-10-PCS).",
        "edge_cases": [
          "A prefix that is too broad (e.g., all of body system S = all Lower Joints replacements) will include hip, ankle, and toe joint replacements alongside knee — always validate the body part character (position 4) against the table.",
          "The New Technology section (X) uses a different positional structure — position 2 is anatomical group but position 3 is the device/substance/technology, not root operation. Do not apply Medical and Surgical positional logic to Section X codes."
        ],
        "data_source_notes": "claims: apply the regex to the ICD_PRCDR_CD fields across all 25 procedure slots on the inpatient claim; do not restrict to principal procedure only unless the study design requires the clinically predominant procedure. OMOP: query vocabulary_id = ICD10PCS in concept table, then join through Maps_to relationship to standard SNOMED concepts for portable cross-site queries."
      },
      {
        "name": "Root-operation-specific cohort (avoiding clinical language shortcuts)",
        "description": "For each target procedure, identify the correct root operation by reading the ICD-10-PCS Reference Manual definition, not by matching the clinical term in the code description. Then build the code set from that root operation's table. Validate the result against a reference chart or operative-note review in a sample.",
        "edge_cases": [
          "Joint arthroplasty (commonly called \"replacement\") uses root operation Replacement (R) when the natural joint surface is removed and a prosthesis seated; Supplement (U) when an additional device is added to enhance function without full removal; Repair (Q) when no device is placed and the joint is restored to its normal structure. Using the word \"replacement\" in a text search misses Supplement and incorrectly includes Repair.",
          "Colostomy creation = Bypass (1) in ICD-10-PCS because the intestinal route is altered. A search for \"colostomy\" that looks for Resection or Excision of the colon will miss the index procedure."
        ],
        "data_source_notes": "claims: after building the root-operation-anchored code list, always count distinct patients per body part and approach value as a validity check — an unexpected predominance of Percutaneous approach for a procedure normally done open is a signal of miscoding or list error."
      },
      {
        "name": "Longitudinal study crossing the ICD-9/ICD-10 transition (October 2015)",
        "description": "For analyses spanning fiscal years before and after October 1, 2015, build parallel code sets — ICD-9-CM Vol 3 for pre-transition discharges and ICD-10-PCS for post-transition — and apply the appropriate set to each discharge based on admission date. Document GEMs crosswalk uncertainty, test for a coding-transition artifact in procedure rates at the boundary, and include a sensitivity analysis restricted to the post-transition period.",
        "edge_cases": [
          "Some discharges in Q4 2015 (October-December 2015) carry ICD-10-PCS even though the fiscal year started in October; the transition was calendar-date-based (Oct 1, 2015), not fiscal-year-based.",
          "GEMs are not symmetric — the forward GEM (ICD-9 to ICD-10) and backward GEM (ICD-10 to ICD-9) may resolve the same ambiguity differently. Choose a direction and apply it consistently."
        ],
        "data_source_notes": "HCUP NIS: the transition date within the NIS varies by data-year release and state SID contribution; consult the NIS documentation for the specific quarter when ICD-10-PCS codes begin appearing in the data."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cpt-procedure-coding",
        "pros_of_this": "ICD-10-PCS encodes approach, device, and qualifier as explicit independent axes, enabling direct interrogation of surgical technique without add-on codes or modifiers; it is the only code system that appears on inpatient facility claims.",
        "cons_of_this": "ICD-10-PCS is restricted to inpatient facility billing; CPT covers all settings (professional, outpatient facility, ASC) and all payer types, and is therefore far broader in population coverage for procedure ascertainment.",
        "when_to_prefer": "Use ICD-10-PCS when the study is explicitly limited to inpatient hospitalizations, or when approach and device axes matter and the cases are predominantly inpatient. Always pair with CPT for complete cross-setting ascertainment."
      },
      {
        "compared_to": "icd-9-cm-legacy-coding",
        "pros_of_this": "ICD-10-PCS has a larger, more granular code space, explicit multi-axial structure, and standardized approach/device/qualifier encoding absent in ICD-9-CM Vol 3.",
        "cons_of_this": "ICD-10-PCS is unavailable for discharges before October 1, 2015; ICD-9-CM Vol 3 is required for pre-transition data, and GEMs crosswalks introduce mapping uncertainty for longitudinal analyses.",
        "when_to_prefer": "Use ICD-10-PCS for post-October-2015 inpatient data; use ICD-9-CM Vol 3 for pre-transition data; build dual code sets with documented crosswalk logic for longitudinal analyses."
      },
      {
        "compared_to": "procedure-identification-and-measurement-in-claims-ehr",
        "pros_of_this": "ICD-10-PCS is the specific inpatient facility coding system — understanding its axis structure enables construction of correct, comprehensive, table-aware inpatient code sets.",
        "cons_of_this": "Procedure identification as a method encompasses the full multi-system union (CPT + HCPCS + ICD-10-PCS + revenue codes), deduplication, and time-zero logic — ICD-10-PCS alone is only one input stream.",
        "when_to_prefer": "Use this entry (ICD-10-PCS) to understand the coding system primitive and construct the inpatient component of a code set; use the procedure identification entry for the full multi-stream assembly method."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "ICD-10-PCS codes appear in the inpatient institutional claims file only. Field names vary by vendor: Medicare SAF uses ICD_PRCDR_CD_1 through _25 with paired ICD_PRCDR_DT_1 through _25; commercial vendors use equivalent arrays. Query all 25 slots — not just the principal procedure (slot 1) — unless the study design requires only the clinically dominant procedure. For union code-set assembly with CPT: the inpatient facility claim carries ICD-10-PCS; the same admission may generate a separate professional claim (CMS-1500/837P) carrying CPT — join on person_id + service date window to confirm the professional claim is for the same event. Always restrict inpatient procedure denominators to FFS Parts A/B-enrolled person-time; exclude MA-only months.",
      "ehr": "ICD-10-PCS codes do not appear in clinical EHR documentation natively. If ICD-10-PCS codes are needed from an EHR-sourced dataset, retrieve them from the billing extract linked to the encounter (the facility bill submitted to the payer) rather than from clinical procedure tables, which typically use CPT or SNOMED. Validate that the billing extract covers the inpatient stays of interest and is linked at the encounter level.",
      "linked": "In linked claims-EHR or claims-registry datasets, ICD-10-PCS codes from the facility claim and CPT codes from the professional claim must be reconciled to a single procedure event using an acute deduplication window and the earliest service date. Date discrepancies between the facility bill (admission date or procedure date) and the professional bill (service date) of 1-2 days are normal billing artifacts. The operative note or registry record provides clinical ground truth for PPV validation of the code set."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import re\n\n# ICD-10-PCS character alphabet: digits 0-9 and uppercase letters except I and O.\n_VALID_CHAR = re.compile(r'^[0-9A-HJ-NP-Z]{7}$')\n\n# Medical & Surgical section position labels (section 0 only).\n_MS_AXES = [\"Section\", \"Body System\", \"Root Operation\",\n            \"Body Part\", \"Approach\", \"Device\", \"Qualifier\"]\n\n# Selected root operation values for the Medical & Surgical section (section 0).\nROOT_OPS = {\n    \"0\": \"Alteration\",       \"1\": \"Bypass\",         \"2\": \"Change\",\n    \"3\": \"Control\",          \"4\": \"Creation\",        \"5\": \"Destruction\",\n    \"6\": \"Detachment\",       \"7\": \"Dilation\",        \"8\": \"Division\",\n    \"9\": \"Drainage\",         \"B\": \"Excision\",        \"C\": \"Extirpation\",\n    \"D\": \"Extraction\",       \"F\": \"Fragmentation\",   \"G\": \"Fusion\",\n    \"H\": \"Insertion\",        \"J\": \"Inspection\",      \"K\": \"Map\",\n    \"L\": \"Occlusion\",        \"M\": \"Reattachment\",    \"N\": \"Release\",\n    \"P\": \"Removal\",          \"Q\": \"Repair\",          \"R\": \"Replacement\",\n    \"S\": \"Reposition\",       \"T\": \"Resection\",       \"U\": \"Supplement\",\n    \"V\": \"Restriction\",      \"W\": \"Revision\",        \"X\": \"Transfer\",\n    \"Y\": \"Transplantation\",\n}\n\n\ndef validate_pcs_code(code: str) -> bool:\n    \"\"\"Return True if code is a syntactically valid ICD-10-PCS string.\n\n    Rules: exactly 7 characters; only digits 0-9 and uppercase A-Z\n    excluding I and O (which are omitted to avoid confusion with 1 and 0).\n    This is a format check only — it does not confirm the code appears in\n    the official CMS PCS tables for any specific fiscal year.\n    \"\"\"\n    return bool(_VALID_CHAR.match(str(code).upper()))\n\n\ndef decompose_pcs_code(code: str) -> dict:\n    \"\"\"Decompose a Medical & Surgical section (section 0) ICD-10-PCS code.\n\n    Returns a dict mapping each axis name to its character value.\n    Raises ValueError for codes that are not 7 valid characters or not\n    in section 0 (Medical & Surgical).\n    \"\"\"\n    code = str(code).upper()\n    if not validate_pcs_code(code):\n        raise ValueError(\n            f\"Invalid ICD-10-PCS code: {code!r}. \"\n            \"Must be 7 characters using 0-9 and A-Z except I and O.\"\n        )\n    if code[0] != \"0\":\n        raise ValueError(\n            f\"Code {code!r} is not Medical & Surgical (section 0). \"\n            \"Position 1 must be '0' for this decomposition.\"\n        )\n    result = {axis: char for axis, char in zip(_MS_AXES, code)}\n    # Add root operation label when recognised.\n    root_char = code[2]\n    if root_char in ROOT_OPS:\n        result[\"Root Operation Label\"] = ROOT_OPS[root_char]\n    return result\n\n\ndef build_pcs_code_set(\n    section: str,\n    body_system: str,\n    root_operation: str,\n    body_parts: list[str] | None = None,\n) -> re.Pattern:\n    \"\"\"Return a compiled regex that matches all ICD-10-PCS codes for the given\n    combination of section, body system, and root operation, optionally filtered\n    to a list of body part character values (position 4).\n\n    The pattern fixes positions 1-3 and, if body_parts is supplied, anchors\n    position 4 to those values; positions 5-7 are unconstrained (any valid\n    ICD-10-PCS character). Use this to build a table-aware code set that\n    automatically covers new fiscal-year additions without manual enumeration.\n\n    Example — all Open knee joint replacements (right and left):\n        build_pcs_code_set(\"0\", \"S\", \"R\", body_parts=[\"C\", \"D\"])\n        matches: 0SR[CD][0-9A-HJ-NP-Z]{3}\n    \"\"\"\n    for label, val in ((\"section\", section), (\"body_system\", body_system),\n                       (\"root_operation\", root_operation)):\n        val = str(val).upper()\n        if not _VALID_CHAR.match(val + \"000000\"):  # pad to 7 for regex test\n            raise ValueError(f\"Invalid PCS character for {label}: {val!r}\")\n    prefix = (str(section).upper()\n              + str(body_system).upper()\n              + str(root_operation).upper())\n    if body_parts:\n        valid_bp = [c.upper() for c in body_parts if _VALID_CHAR.match(c + \"000000\")]\n        if not valid_bp:\n            raise ValueError(\"No valid body part characters provided.\")\n        bp_class = \"[\" + \"\".join(valid_bp) + \"]\"\n    else:\n        bp_class = \"[0-9A-HJ-NP-Z]\"\n    # Positions 5-7: any valid ICD-10-PCS character (excludes I and O).\n    tail = \"[0-9A-HJ-NP-Z]{3}\"\n    return re.compile(f\"^{prefix}{bp_class}{tail}$\")\n\n\n# ── Worked example ──────────────────────────────────────────────────────────\nif __name__ == \"__main__\":\n    # Validate and decompose 0SRC0JZ (total knee replacement, right, open, synthetic).\n    code = \"0SRC0JZ\"\n    print(f\"Valid: {validate_pcs_code(code)}\")   # -> True\n    axes = decompose_pcs_code(code)\n    for axis, val in axes.items():\n        print(f\"  {axis}: {val}\")\n\n    # Build a regex for all knee joint replacements (right=C, left=D).\n    tkr_pattern = build_pcs_code_set(\"0\", \"S\", \"R\", body_parts=[\"C\", \"D\"])\n    test_codes = [\"0SRC0JZ\", \"0SRD0JZ\", \"0SRC049\", \"0LRB0JZ\"]  # last = not knee\n    for c in test_codes:\n        print(f\"  {c}: {'MATCH' if tkr_pattern.match(c) else 'no match'}\")\n    # Expected: 0SRC0JZ=MATCH, 0SRD0JZ=MATCH, 0SRC049=no match (invalid char 4/position 5?),\n    # 0LRB0JZ=no match (section 0 body system L = Tendons, not Lower Joints).",
        "description": "ICD-10-PCS structural validation and code-set construction utilities for RWE analysts.\nThree capabilities: (1) validate a code string against the ICD-10-PCS character alphabet\n(7 characters, no I or O, regex gate); (2) decompose a code into its seven named axes for\nthe Medical and Surgical section; (3) build a table-aware positional code set for a target\nprocedure given fixed values for positions 1-3 and an optional position-4 filter. The\nimplementations assume post-October-2015 inpatient claims data.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ICD-10-PCS character alphabet: 0-9 and A-Z except I and O.\n.PCS_VALID <- \"^[0-9A-HJ-NP-Z]{7}$\"\n\n.MS_AXES <- c(\"Section\", \"Body_System\", \"Root_Operation\",\n              \"Body_Part\", \"Approach\", \"Device\", \"Qualifier\")\n\n.ROOT_OPS <- c(\n  \"0\"=\"Alteration\",    \"1\"=\"Bypass\",        \"2\"=\"Change\",\n  \"3\"=\"Control\",       \"4\"=\"Creation\",       \"5\"=\"Destruction\",\n  \"6\"=\"Detachment\",    \"7\"=\"Dilation\",       \"8\"=\"Division\",\n  \"9\"=\"Drainage\",      \"B\"=\"Excision\",       \"C\"=\"Extirpation\",\n  \"D\"=\"Extraction\",    \"F\"=\"Fragmentation\",  \"G\"=\"Fusion\",\n  \"H\"=\"Insertion\",     \"J\"=\"Inspection\",     \"K\"=\"Map\",\n  \"L\"=\"Occlusion\",     \"M\"=\"Reattachment\",   \"N\"=\"Release\",\n  \"P\"=\"Removal\",       \"Q\"=\"Repair\",         \"R\"=\"Replacement\",\n  \"S\"=\"Reposition\",    \"T\"=\"Resection\",       \"U\"=\"Supplement\",\n  \"V\"=\"Restriction\",   \"W\"=\"Revision\",        \"X\"=\"Transfer\",\n  \"Y\"=\"Transplantation\"\n)\n\n#' Validate ICD-10-PCS code format.\n#' @param code Character scalar or vector.\n#' @return Logical. TRUE = syntactically valid; does not confirm FY table membership.\nvalidate_pcs_code <- function(code) {\n  grepl(.PCS_VALID, toupper(as.character(code)))\n}\n\n#' Decompose a Medical & Surgical section (section 0) ICD-10-PCS code into its seven axes.\n#' @param code Single 7-character ICD-10-PCS code.\n#' @return Named character vector with one element per axis plus Root_Operation_Label.\ndecompose_pcs_code <- function(code) {\n  code <- toupper(as.character(code))\n  if (!validate_pcs_code(code))\n    stop(\"Invalid ICD-10-PCS code: \", code,\n         \". Must be 7 characters from [0-9A-HJ-NP-Z].\")\n  if (substr(code, 1, 1) != \"0\")\n    stop(\"Code \", code, \" is not Medical & Surgical (section 0).\")\n  chars <- strsplit(code, \"\")[[1]]\n  result <- setNames(chars, .MS_AXES)\n  root_char <- chars[3]\n  if (root_char %in% names(.ROOT_OPS))\n    result[\"Root_Operation_Label\"] <- .ROOT_OPS[[root_char]]\n  result\n}\n\n#' Build a positional regex matching all ICD-10-PCS codes for a given\n#' section + body_system + root_operation, optionally filtered to specific body parts.\n#'\n#' @param section    1-character section value (e.g., \"0\" for Medical & Surgical).\n#' @param body_system 1-character body system value (e.g., \"S\" for Lower Joints).\n#' @param root_op    1-character root operation value (e.g., \"R\" for Replacement).\n#' @param body_parts Optional character vector of position-4 values to match.\n#' @return A compiled regex pattern string (pass to grepl / regexpr).\nbuild_pcs_regex <- function(section, body_system, root_op, body_parts = NULL) {\n  any_valid <- \"[0-9A-HJ-NP-Z]\"\n  prefix <- paste0(toupper(section), toupper(body_system), toupper(root_op))\n  if (!is.null(body_parts) && length(body_parts) > 0) {\n    bp_chars <- paste0(toupper(body_parts), collapse = \"\")\n    bp_class  <- paste0(\"[\", bp_chars, \"]\")\n  } else {\n    bp_class <- any_valid\n  }\n  paste0(\"^\", prefix, bp_class, any_valid, \"{3}$\")\n}\n\n## Worked example ----------------------------------------------------------\ncode <- \"0SRC0JZ\"\ncat(\"Valid:\", validate_pcs_code(code), \"\\n\")\nprint(decompose_pcs_code(code))\n\n# Regex for all knee joint replacements (Right Knee = C, Left Knee = D)\ntkr_regex <- build_pcs_regex(\"0\", \"S\", \"R\", body_parts = c(\"C\", \"D\"))\ncat(\"TKR regex:\", tkr_regex, \"\\n\")\ntest_codes <- c(\"0SRC0JZ\", \"0SRD0JZ\", \"0SRC049\", \"0LRB0JZ\")\ncat(\"Match results:\\n\")\nprint(data.frame(code = test_codes, match = grepl(tkr_regex, test_codes)))",
        "description": "ICD-10-PCS code validation, axis decomposition, and positional code-set construction in R.\nMirrors the Python implementation. Three exported functions: validate_pcs_code(),\ndecompose_pcs_code(), and build_pcs_regex(). Requires base R only; no dependencies.\nDesigned for use with data.table or dplyr claims data pipelines.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Claim[Hospital Inpatient Claim UB-04 / 837I]\n  Claim --> PCS[ICD-10-PCS codes\\nFL74 principal + up to 24 additional]\n  Claim --> CM[ICD-10-CM diagnosis codes\\nFL67 principal + secondary]\n  PCS --> DRG[MS-DRG Grouper\\nPCS drives surgical vs medical DRG assignment]\n  CM --> DRG\n  PCS --> Seven[7-character alphanumeric code\\nno I or O]\n  Seven --> P1[Pos 1: Section\\n0=Med/Surg, X=New Technology]\n  Seven --> P2[Pos 2: Body System\\nS=Lower Joints, 2=Heart...]\n  Seven --> P3[Pos 3: Root Operation\\nR=Replacement, T=Resection, B=Excision...]\n  Seven --> P4[Pos 4: Body Part\\nC=Right Knee, D=Left Knee...]\n  Seven --> P5[Pos 5: Approach\\n0=Open, 4=Percutaneous Endoscopic...]\n  Seven --> P6[Pos 6: Device\\nJ=Synthetic Substitute, Z=No Device...]\n  Seven --> P7[Pos 7: Qualifier\\nZ=No Qualifier, X=Diagnostic...]\n  PCS --> Note[\"ICD-10-PCS: inpatient facility ONLY\\nCPT/HCPCS for professional + outpatient\"]",
        "caption": "ICD-10-PCS in the inpatient billing stream. The seven-axis code appears in the UB-04 procedure fields, feeds the MS-DRG grouper alongside ICD-10-CM diagnoses, and is restricted to inpatient facility claims only — CPT and HCPCS handle professional and outpatient procedure coding.",
        "alt_text": "Flowchart showing the hospital inpatient UB-04 claim generating both ICD-10-PCS procedure codes and ICD-10-CM diagnosis codes, which feed the MS-DRG grouper. The ICD-10-PCS code branches into its seven character positions with example values at each position.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Inpatient[\"Hospital Inpatient Stay\"]\n    PCS_Code[\"ICD-10-PCS\\n0SRC0JZ\\nReplacement, Rt Knee,\\nOpen, Synthetic\"]\n  end\n  subgraph Outpatient[\"Outpatient / ASC\"]\n    CPT_Fac[\"CPT 27447\\nTotal knee arthroplasty\\nFacility claim\"]\n    CPT_Pro[\"CPT 27447\\nTotal knee arthroplasty\\nProfessional claim\"]\n  end\n  Inpatient -->|\"Only stream with ICD-10-PCS\"| PCS_Code\n  Outpatient --> CPT_Fac\n  Outpatient --> CPT_Pro\n  PCS_Code --> Warning[\"RWE risk: PCS-only algorithm\\nmisses ALL outpatient cases\\n→ biased toward sicker patients\"]\n  CPT_Fac --> Union[\"Complete ascertainment:\\nICD-10-PCS + CPT + HCPCS\\n+ revenue codes (union)\"]\n  CPT_Pro --> Union\n  PCS_Code --> Union",
        "caption": "The care-setting trap. The same total knee arthroplasty generates ICD-10-PCS on an inpatient facility claim but CPT codes on professional and outpatient facility claims. An ICD-10-PCS-only algorithm captures inpatient cases only, systematically missing the growing outpatient share and biasing the study population toward patients requiring hospitalization.",
        "alt_text": "Two-column flowchart. Left column shows inpatient stay generating ICD-10-PCS code 0SRC0JZ. Right column shows outpatient or ASC setting generating CPT 27447 on both facility and professional claims. Both converge into a union code set for complete ascertainment. A warning box highlights that ICD-10-PCS alone misses all outpatient cases.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "ICD-10-PCS is the inpatient facility coding system primitive that feeds the union code set in multi-stream procedure identification. The procedure identification entry covers the full assembly method (union + deduplication + time-zero logic); this entry covers the code system itself."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-analysis",
        "notes": "ICD-10-PCS codes appear in the inpatient institutional claims file; claims analysis is the primary context in which these codes are queried, linked across files, and assembled into procedure cohorts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "Sibling coding system. ICD-10-CM assigns diagnosis codes on both inpatient and outpatient claims; ICD-10-PCS assigns procedure codes on inpatient facility claims only. Both appear on the same UB-04 inpatient claim form and jointly drive MS-DRG assignment."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icd-9-cm-legacy-coding",
        "notes": "ICD-9-CM Volume 3 is the direct predecessor of ICD-10-PCS for inpatient procedure coding, replaced on October 1, 2015. Longitudinal analyses crossing the transition date require both systems and GEMs crosswalk documentation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cpt-procedure-coding",
        "notes": "CPT is the complementary procedure coding system for professional and outpatient facility claims. Complete procedure ascertainment requires ICD-10-PCS for inpatient cases and CPT for outpatient and professional cases."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hcpcs-level-ii-j-codes",
        "notes": "HCPCS Level II codes cover drugs administered in clinical settings and some services not covered by CPT; they appear on outpatient facility and professional claims alongside CPT and are part of the complete multi-system union code set for procedure ascertainment."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ms-drg-classification",
        "notes": "ICD-10-PCS codes are the primary input that determines surgical DRG assignment via the CMS MS-DRG grouper; a procedure's ICD-10-PCS code directly drives the DRG and hence the inpatient payment amount."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-standardized-vocabularies",
        "notes": "In the OMOP CDM, ICD-10-PCS source codes are mapped to SNOMED standard concepts in the Procedure domain; analysts querying OMOP data should use the standard_concept_id rather than source codes for portability across sites using different source procedure coding systems."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medical-code-crosswalks-mappings",
        "notes": "GEMs (General Equivalence Mappings) are the primary ICD-10-PCS crosswalk — mapping ICD-9-CM Vol 3 to ICD-10-PCS and vice versa — and are a concrete example of the crosswalk complexity covered in the medical code crosswalks entry."
      }
    ],
    "aliases": [
      "ICD-10-PCS",
      "PCS",
      "ICD-10 Procedure Coding System",
      "inpatient procedure codes",
      "PCS procedure codes"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "icd-9-cm-legacy-coding",
    "name": "ICD-9-CM Legacy Diagnosis and Procedure Codes",
    "short_definition": "The US clinical modification of the WHO's Ninth Revision of the International Classification of Diseases, used for diagnosis (Volumes 1-2) and inpatient procedure coding (Volume 3) in all US HIPAA-covered healthcare transactions from 1979 through 30 September 2015; any RWE study whose data window touches pre-October-2015 encounters must handle ICD-9-CM code lists alongside their ICD-10 successors.",
    "long_description": "**ICD-9-CM** (International Classification of Diseases, 9th Revision, Clinical Modification)\nwas the mandatory coding system for all US HIPAA-covered electronic health transactions from\nthe early 1980s through 30 September 2015. On 1 October 2015, the US transitioned to ICD-10-CM\nfor diagnoses and ICD-10-PCS for inpatient procedures. Because the overwhelming majority of\nadministrative claims databases extend back well before that date, every analyst working with\npre-October-2015 data — trend analyses, interrupted time series, long lookback windows, or any\ncohort that accrued members before the transition — will encounter ICD-9-CM codes in the raw\ndata and must handle them correctly.\n\n**Code structure and format.** ICD-9-CM has three volumes. Volumes 1-2 are the tabular and\nalphabetic index of **diagnosis codes**: a 3-to-5-character string whose core is a 3-digit\nnumeric chapter (001-999) with up to two additional decimal digits of specificity. In raw\nclaims files the decimal point is almost never stored: the code \"25000\" in the `DGNS_CD`\nfield means 250.00 (diabetes mellitus, type 2, not stated as uncontrolled), not the integer\n25000. Analysts must normalize between the flat and the decimal representation before any\nmatching. Volumes 1-2 also contain two supplementary classifications that use letter prefixes:\n**V codes** (V01-V91), which capture factors influencing health status and contact with health\nservices (e.g., V58.x = aftercare following surgery; V45.x = postsurgical state), and **E codes**\n(E000-E999), which classify the external cause of injury or poisoning. E codes require their\nown format validation because their numeric range overlaps with the 3-digit diagnosis chapters.\nIn total, ICD-9-CM contains approximately 14,000 diagnosis codes — far fewer, and far less\nspecific, than ICD-10-CM's 70,000+: there is no laterality (no left vs right), no encounter\ntype (no \"initial encounter\" vs \"subsequent encounter\" vs \"sequela\"), and no seventh-character\nextension. This specificity gap is not just aesthetic; it is an accuracy limitation that\nvalidated phenotype algorithms must account for.\n\n**Volume 3 procedure codes.** ICD-9-CM Volume 3 is the inpatient procedure coding system used\nexclusively on UB-04 institutional claims (i.e., facility/hospital claims). Codes are 2 to\n4 digits (2-digit category plus up to 2 decimal digits), stored without the decimal in claims\ndata: \"8154\" = 81.54 (total knee replacement). Volume 3 covered procedures across all body\nsystems and was the only procedure coding system on inpatient institutional claims until it was\nreplaced by ICD-10-PCS on 1 October 2015. Volume 3 codes do **not** appear on professional\n(physician) claims; those used CPT/HCPCS Level II codes throughout. When building procedure\nphenotypes from inpatient claims before October 2015, analysts must include Volume 3 codes\nalongside CPT codes sourced from line-item professional claims.\n\n**Why a current RWE analyst still needs ICD-9-CM.** The retirement date was 1 October 2015,\nbut the data do not disappear. Any study with one of the following design characteristics\nrequires era-aware code-list management:\n\n1. **Pre/post transition data:** A cohort accruing members between 2012 and 2018 will have\n   index dates on both sides of the October 2015 cutoff. Baseline covariates for a 2014\n   index date require ICD-9-CM code lists; the outcome window may span into ICD-10-CM era\n   records. Applying only ICD-10-CM code lists will silently drop all pre-October 2015\n   events, deflating baseline comorbidity counts and creating apparent outcome gaps.\n2. **Trend and interrupted time series analyses:** ITS models that include pre-2015 data\n   show an artifactual discontinuity at the transition: the apparent rate of many conditions\n   changes not because the underlying incidence changed but because ICD-9-CM and ICD-10-CM\n   have different levels of specificity, and because clinicians and coders adjusted their\n   behavior during the transition. This is a form of measurement artifact that must be\n   modeled (e.g., as a step indicator at the transition) or the ITS estimator will absorb\n   the coding change into the intervention effect estimate.\n3. **Lookback windows for prevalent conditions:** A study with a 2016 or 2017 index date and\n   a 12-month baseline lookback window will have baseline diagnosis history entirely or\n   partly in the ICD-9-CM era. Omitting the ICD-9-CM codes for diabetes, hypertension, heart\n   failure, or cancer from the lookback query will undercount comorbidity burden.\n4. **Algorithm transfer:** A validated ICD-9-CM phenotype algorithm (e.g., the Quan 2005\n   comorbidity adaptation of the Elixhauser index) does not automatically transfer to\n   ICD-10-CM. The General Equivalence Mappings (GEMs) produced by CMS approximate\n   one-to-many cross-walks between the two systems, but GEMs are not a substitute for\n   re-validation because (a) the code structures differ fundamentally, (b) some ICD-9-CM\n   codes have no direct ICD-10-CM equivalent, and (c) clinical coding practices shifted\n   at transition. Algorithm developers must re-derive and re-validate phenotype definitions\n   separately for each era.\n\n**Pros, cons, and trade-offs — specific and comparative.**\n\n- **vs ICD-10-CM/PCS (post-October 2015):** ICD-10-CM offers 5× more diagnosis codes,\n  laterality, encounter type, and higher granularity. ICD-10-PCS is far more specific than\n  Volume 3 for inpatient procedures. The accuracy advantage of ICD-10-CM is real but\n  heterogeneous — some conditions (e.g., hypertension, type 2 diabetes) were already\n  well-specified in ICD-9-CM; others (e.g., fractures, injuries, complications) gain\n  enormously from laterality and encounter type. **Practical trade-off:** ICD-9-CM data\n  cover a much longer historical period and are therefore irreplaceable for long-term\n  trend studies, natural history research, and analyses requiring large person-time before\n  a new drug's approval.\n- **vs CPT/HCPCS for procedures:** CPT codes cover both inpatient and outpatient procedures\n  on professional claims and were unchanged by the 2015 transition. For outpatient procedures\n  in the pre-2015 era, an analyst uses CPT from professional claims exactly as in the\n  post-2015 era. ICD-9-CM Volume 3 adds inpatient procedural coding on institutional claims\n  only; after 2015 that role is filled by ICD-10-PCS. **Practical trade-off:** A procedure\n  code list for inpatient procedures in a multi-era study needs both Volume 3 codes for\n  pre-October 2015 and ICD-10-PCS codes for post-October 2015; CPT on professional claims\n  is available and consistent across both eras.\n- **vs OMOP SNOMED standard concepts:** OMOP maps ICD-9-CM source codes to SNOMED standard\n  concepts so that queries can be written against the standard vocabulary rather than\n  era-specific code lists. This is the correct approach when the data infrastructure\n  supports it, but it depends on the OMOP mapping tables being current and complete; for\n  rare codes or V/E code combinations, manual inspection of the source-to-standard mapping\n  is warranted.\n\n**When to use.** Use ICD-9-CM code lists whenever:\n- Any part of the study window (index date, baseline window, or follow-up) falls before\n  1 October 2015.\n- The lookback window for a 2015-2017 cohort extends into the pre-transition era.\n- Running an interrupted time series or trend analysis whose pre-period includes claims\n  from before October 2015.\n- Applying or adapting a published phenotype algorithm that was originally validated\n  against ICD-9-CM data (Charlson, Elixhauser, Quan comorbidity adaptations, most\n  published AHRQ Patient Safety Indicators).\n- Comparing diagnoses or procedures across historical eras to assess natural history,\n  drug safety, or secular trend.\n\n**When NOT to use — and when ICD-9-CM is actively misleading or dangerous.**\n- **Post-October 2015 data only:** For a study whose entire enrollment and follow-up falls\n  after 1 October 2015, ICD-9-CM codes will never appear in raw claims. Building ICD-9-CM\n  code lists is wasted effort that introduces confusion and the risk of accidentally mixing\n  code systems.\n- **As a substitute for re-validation:** Mapping ICD-9-CM codes via GEMs and treating the\n  result as a validated ICD-10-CM phenotype is not re-validation. GEMs are approximate\n  and asymmetric (the forward and backward maps are not inverses of each other). Use GEMs\n  as a starting point, then empirically evaluate the derived code list in the data of\n  interest.\n- **When applied without era-conditioning:** A query that applies an ICD-9-CM code list\n  to a post-2015 claim or an ICD-10-CM code list to a pre-2015 claim will return zero\n  matching rows for that era — which looks identical to a true absence of the condition.\n  The failure is silent. Always condition code lists on the claim's service date relative\n  to the transition cutoff.\n- **For external-cause coding (E codes) without recognizing their format:** E codes\n  (E000-E999) overlap numerically with the 3-digit chapter codes if the \"E\" prefix is\n  dropped during extraction. A process that strips the first character from all codes\n  will misclassify E codes. Always validate that E and V codes are preserved with their\n  prefixes in the extract.\n\n**Data-source operational depth.**\n- **Medicare FFS (MedPAR/institutional + carrier/professional):** ICD-9-CM diagnosis codes\n  appear in `DGNS_CD_1` through `DGNS_CD_25` on both inpatient (MedPAR) and outpatient\n  (carrier/outpatient) claims up to the 30 September 2015 service date. Volume 3 procedure\n  codes appear in `ICD_PRCDR_CD_1` through `ICD_PRCDR_CD_6` on MedPAR (inpatient) only.\n  Admissions that straddle the transition date (admitted pre-October 2015, discharged\n  post-October 2015) may have been coded in either system depending on the fiscal year\n  rules applied by CMS; treat discharge date as the coding-system determinant. Principal\n  diagnosis is in position 1; secondary diagnoses in positions 2+. The \"present on\n  admission\" (POA) indicator distinguishes pre-admission from hospital-acquired diagnoses\n  on inpatient records and is important for outcome ascertainment and patient safety studies.\n- **Commercial claims (MarketScan, Optum):** Same ICD-9-CM field structure; the transition\n  date applies uniformly. Some vendors retain the decimal point in the stored code; verify\n  the format contract for the specific dataset before writing matching logic.\n- **EHR problem lists and encounter diagnoses:** EHR records may carry ICD-9-CM codes on\n  historical problem list entries even after the transition, because problems entered before\n  October 2015 were not necessarily back-coded to ICD-10-CM. Cross-sectional extracts of\n  EHR diagnosis data therefore require era-aware handling: use service/encounter date, not\n  extract date, to determine the applicable coding system.\n- **Registry linkage:** Disease registries that link to claims (SEER-Medicare, tumor\n  registries) retain the ICD-9-CM codes from the linked claims records. SEER tumor site\n  coding uses ICD-O-3 (not ICD-9-CM), but cause-of-death and comorbidity data in the linked\n  claims file are subject to the same era logic.\n\n**Relationship to GEMs and crosswalk mapping.** CMS published the General Equivalence\nMappings (GEMs) to facilitate the transition. The forward GEM maps each ICD-9-CM code to\none or more ICD-10-CM codes; the backward GEM maps ICD-10-CM codes back to ICD-9-CM. Most\nICD-9-CM codes map to multiple ICD-10-CM codes (one-to-many), and many ICD-10-CM codes\nmap to multiple ICD-9-CM codes (many-to-one in the backward direction). GEMs are published\non the CMS website and are the standard starting point for cross-era code translation, but\nthey are approximate and should be reviewed clinically for each phenotype of interest.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "icd-9",
      "icd-9-cm",
      "claims",
      "diagnosis-codes",
      "procedure-codes",
      "volume-3",
      "transition",
      "era-aware"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "interrupted_time_series",
      "new_user",
      "active_comparator_new_user",
      "ehr_study",
      "linked_data_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.1475-6773.2005.00444.x",
        "url": "https://doi.org/10.1111/j.1475-6773.2005.00444.x",
        "citation_text": "O'Malley KJ, Cook KF, Price MD, Wildes KR, Hurdle JF, Ashton CM. Measuring diagnoses: ICD code accuracy. Health Services Research. 2005;40(5 Pt 2):1620-1639.",
        "year": 2005,
        "authors_short": "O'Malley et al.",
        "notes": "Foundational evaluation of ICD-9-CM diagnosis code accuracy in administrative data, demonstrating systematic variation in positive predictive value and sensitivity across code types, facilities, and data sources — the canonical reference establishing why ICD code lists require empirical validation rather than face-value acceptance."
      },
      {
        "role": "explain",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, Saunders LD, Beck CA, Feasby TE, Ghali WA. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "Provides the ICD-9-CM code lists for the Charlson and Elixhauser comorbidity indices that remain the most widely used comorbidity adjustment tools in claims research; demonstrates the algorithm-specification and era-mapping challenge that every analyst who uses pre-2015 data must resolve."
      },
      {
        "role": "use",
        "doi": "10.1097/00005650-199801000-00004",
        "url": "https://doi.org/10.1097/00005650-199801000-00004",
        "citation_text": "Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Medical Care. 1998;36(1):8-27.",
        "year": 1998,
        "authors_short": "Elixhauser et al.",
        "notes": "Defines the 30-condition Elixhauser comorbidity index entirely from ICD-9-CM codes; widely used as a covariate adjustment strategy in comparative effectiveness and outcomes research and therefore a canonical example of a real-world applied ICD-9-CM code list."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coding-billing/icd-10-codes/icd-9-cm-diagnosis-procedure-codes-abbreviated-and-full-code-titles",
        "citation_text": "Centers for Medicare and Medicaid Services. ICD-9-CM Diagnosis and Procedure Codes: Abbreviated and Full Code Titles. CMS.gov. Accessed 2026-06-12.",
        "year": 2015,
        "authors_short": "CMS",
        "notes": "Official CMS archive of the complete ICD-9-CM diagnosis and procedure (Volume 3) code tables used in all US HIPAA-covered transactions through 30 September 2015; the authoritative source for code lookup, range validation, and code-list construction."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coding-billing/icd-10-codes/icd-10-cm-icd-10-pcs-gem-archive",
        "citation_text": "Centers for Medicare and Medicaid Services. ICD-10-CM / ICD-10-PCS GEM Archive (General Equivalence Mappings). CMS.gov. Accessed 2026-06-12.",
        "year": 2016,
        "authors_short": "CMS GEMs",
        "notes": "CMS archive of forward and backward General Equivalence Mappings between ICD-9-CM and ICD-10-CM/PCS; the starting point for cross-era code-list translation in studies spanning the October 2015 transition."
      }
    ],
    "plain_language_summary": "ICD-9-CM is the older US diagnosis and procedure coding system that was used on all hospital and insurance billing records from the early 1980s until 30 September 2015, when the US switched to ICD-10-CM and ICD-10-PCS. Every diagnosis code in a pre-2015 insurance claim is an ICD-9-CM code — a short number like 250.00 meaning type 2 diabetes — and researchers studying anything that happened before October 2015 must include those older codes in their search lists alongside the newer ICD-10 codes. Leaving out the ICD-9-CM codes is one of the most common silent errors in claims research: the database returns no result, which looks exactly like a patient who did not have the condition, when in fact the analyst simply searched with the wrong code set for that time period.",
    "key_terms": [
      {
        "term": "ICD-9-CM diagnosis code",
        "definition": "A 3-to-5-character number (e.g., 250.00 for type 2 diabetes) stored without the decimal point in raw claims data, used to label every diagnosis billed on a US insurance claim before 1 October 2015."
      },
      {
        "term": "Volume 3 procedure code",
        "definition": "A 2-to-4-digit ICD-9-CM code (e.g., 81.54 for total knee replacement) that describes a surgical or therapeutic procedure performed during a hospital stay; it appeared only on inpatient hospital claims and was replaced by ICD-10-PCS in October 2015."
      },
      {
        "term": "V code",
        "definition": "An ICD-9-CM supplementary code beginning with the letter V (e.g., V58.x for aftercare) that records why a patient contacted healthcare when no active disease code applies; used for vaccination visits, screening exams, and follow-up care."
      },
      {
        "term": "E code",
        "definition": "An ICD-9-CM supplementary code beginning with the letter E (E000-E999) that records the external cause of an injury or poisoning, such as a fall or a motor vehicle accident; important for injury research and safety studies."
      },
      {
        "term": "GEMs (General Equivalence Mappings)",
        "definition": "Translation tables published by CMS that map each ICD-9-CM code to one or more ICD-10-CM codes (and vice versa), providing a starting point for converting code lists across the 2015 transition, though they are approximate and not a substitute for re-validating a phenotype algorithm."
      },
      {
        "term": "Era-aware code dispatch",
        "definition": "The programming practice of applying ICD-9-CM code lists to claims with service dates before 1 October 2015 and ICD-10-CM code lists to claims on or after that date, ensuring the correct coding system is used for each record."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is studying hospitalization rates for diabetes-related complications in a commercial claims database from January 2014 through December 2017. This window spans the ICD-9-CM-to-ICD-10-CM transition (1 October 2015). The team wants to count all inpatient hospitalizations with a primary diagnosis of type 2 diabetes with uncontrolled hyperglycemia. Before the transition this is ICD-9-CM 250.02 (diabetes mellitus, type 2, uncontrolled); after the transition the nearest ICD-10-CM equivalent is E11.65 (type 2 diabetes mellitus with hyperglycemia). If the team queries only ICD-10-CM E11.65, they will miss all events from 2014 through September 2015; if they query only ICD-9-CM 250.02, they will miss all events from October 2015 through December 2017.\n",
      "dataset": {
        "caption": "Monthly inpatient hospitalization counts from a claims database, 2014-2017. \"Pre\" = service date before 1 Oct 2015 (ICD-9-CM era); \"Post\" = service date on or after 1 Oct 2015 (ICD-10-CM era). Counts shown for three query strategies.\n",
        "columns": [
          "Year",
          "Era",
          "ICD-9-CM only (250.02)",
          "ICD-10-CM only (E11.65)",
          "Dual-era (correct)"
        ],
        "rows": [
          [
            2014,
            "Pre",
            120,
            0,
            120
          ],
          [
            "2015 Jan-Sep",
            "Pre",
            90,
            0,
            90
          ],
          [
            "2015 Oct-Dec",
            "Post",
            0,
            28,
            28
          ],
          [
            2016,
            "Post",
            0,
            112,
            112
          ],
          [
            2017,
            "Post",
            0,
            115,
            115
          ]
        ]
      },
      "steps": [
        "The study window covers 48 months (Jan 2014 - Dec 2017). The ICD-9-CM era covers 21 months (Jan 2014 - Sep 2015); the ICD-10-CM era covers 27 months (Oct 2015 - Dec 2017).",
        "ICD-9-CM-only query total = 120 + 90 + 0 + 0 + 0 = 210 hospitalizations, all from the pre-transition era; zero events are captured after Oct 2015.",
        "ICD-10-CM-only query total = 0 + 0 + 28 + 112 + 115 = 255 hospitalizations, all from the post-transition era; zero events are captured before Oct 2015.",
        "Dual-era (correct) query total = 120 + 90 + 28 + 112 + 115 = 465 hospitalizations, because the correct ICD-9-CM code is applied to pre-transition claims and the correct ICD-10-CM code is applied to post-transition claims.",
        "The ICD-9-CM-only strategy captures 210 / 465 = 0.45 of all true events — missing 255 events (all of 2016-2017 and Q4 2015).",
        "The ICD-10-CM-only strategy captures 255 / 465 = 0.55 of all true events — missing 210 events (all of 2014 through September 2015).",
        "Neither single-era query is usable for a rate trend. The ICD-9-CM-only rate would appear to drop to zero in October 2015 — a step artifact at the coding transition, not a true clinical change."
      ],
      "result": "Correct dual-era total = 210 + 255 = 465 hospitalizations over 48 months. ICD-9-CM-only captures 210 / 465 = 0.45 (45%) of true events. ICD-10-CM-only captures 255 / 465 = 0.55 (55%) of true events. Using a single-era code list in a multi-era study introduces a measurement artifact equivalent to ignoring 45-55% of the outcome events depending on which era is omitted.\n",
      "timeline_spec": {
        "title": "ICD-9-CM vs ICD-10-CM coding eras in a 2014-2017 study window",
        "window": {
          "start": "2014-01-01",
          "end": "2017-12-31",
          "label": "48-month study window (Jan 2014 - Dec 2017)"
        },
        "events": [
          {
            "label": "ICD-9-CM era (Volumes 1-2 diagnosis, Volume 3 procedure)",
            "start": "2014-01-01",
            "length_days": 638,
            "quantity": "21 months of ICD-9-CM coded claims"
          },
          {
            "label": "ICD-10-CM/PCS era (diagnosis + inpatient procedure)",
            "start": "2015-10-01",
            "length_days": 822,
            "quantity": "27 months of ICD-10-CM coded claims"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2014-01-01",
            "end": "2015-09-30",
            "label": "ICD-9-CM era: 210 events (use code 250.02)"
          },
          {
            "kind": "unexposed",
            "start": "2015-10-01",
            "end": "2017-12-31",
            "label": "ICD-10-CM era: 255 events (use code E11.65)"
          }
        ],
        "result": {
          "label": "Dual-era correct total: 465 events; single-era queries miss 45-55%",
          "value": 465
        }
      }
    },
    "prerequisites": [
      "claims-analysis",
      "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Diagnosis code — flat vs decimal representation",
        "description": "Raw claims files store ICD-9-CM diagnosis codes without the decimal point (e.g., \"25000\"), while tabular references and published code lists use the decimal form (\"250.00\"). Matching requires normalizing to a single canonical form. The standard approach is to strip the decimal from reference lists before matching against raw data, or to insert a decimal after the third character of raw codes before matching against reference lists. Either is correct; the critical requirement is consistency within the pipeline. V codes and E codes have the same storage convention (decimal omitted in raw data) but require their alpha prefix to be preserved.\n",
        "edge_cases": [
          "Some commercial vendors retain the decimal in stored codes; verify the format documentation for each dataset before building the matching logic.",
          "Three-character codes (e.g., 410 for acute MI, without 5th digit) are valid ICD-9-CM codes and must not be padded with trailing zeros when the padded form is a different, more specific code."
        ],
        "data_source_notes": "claims: Medicare stores codes in the flat 5-character format (DGNS_CD fields on MedPAR and carrier claims). Commercial databases vary; always read the data dictionary."
      },
      {
        "name": "Volume 3 inpatient procedure codes vs CPT",
        "description": "Before 1 October 2015, inpatient institutional (facility) claims used ICD-9-CM Volume 3 procedure codes (e.g., 81.54 = total knee replacement); professional claims used CPT codes for both inpatient and outpatient procedures throughout. After the transition, institutional claims switched to ICD-10-PCS. A complete inpatient procedure phenotype in a multi-era study needs three sources: Volume 3 on pre-2015 institutional claims, CPT on professional claims from both eras, and ICD-10-PCS on post-2015 institutional claims.\n",
        "edge_cases": [
          "Procedure codes from professional claims for the same inpatient stay may be present in the carrier file even when the institutional claim uses Volume 3 codes; avoid double-counting the same procedure event.",
          "Volume 3 codes for some procedures have no ICD-10-PCS equivalent and vice versa; GEMs for procedures are less complete than for diagnoses."
        ],
        "data_source_notes": "Medicare: Volume 3 codes in ICD_PRCDR_CD_1 through ICD_PRCDR_CD_6 on MedPAR (inpatient) only, up to service dates before 1 Oct 2015. ICD-10-PCS replaces them in the same fields from that date forward."
      },
      {
        "name": "E and V code handling",
        "description": "E codes (E000-E999) classify external causes of injury or poisoning; V codes (V01-V91) classify factors influencing health status and contacts with health services. Both use alpha prefixes in their stored form (the E or V character followed by digits). A code extraction pipeline that treats all DGNS_CD values as purely numeric will silently drop E and V codes, and a pipeline that strips the first character will misclassify them as numeric diagnosis codes. For studies involving injuries, poisonings, screening visits, or postsurgical states, E and V codes are often the primary phenotype.\n",
        "edge_cases": [
          "E codes occupy the range E000-E999, which overlaps with V codes if the prefix is mishandled. Validate the prefix character before matching.",
          "Some databases store E codes in dedicated supplementary diagnosis fields rather than in the standard DGNS_CD sequence; check the data model documentation."
        ],
        "data_source_notes": "claims: E codes appear in supplementary diagnosis fields on Medicare inpatient and outpatient claims. The HCUP databases and some commercial vendors preserve E codes in dedicated supplemental fields."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "icd-10-cm-diagnosis-coding",
        "pros_of_this": "ICD-9-CM is the only coding system present in pre-October 2015 US claims data; no alternative exists for historical research. It covers a much longer calendar window (1979-2015) than any ICD-10-CM data can provide.",
        "cons_of_this": "Far fewer codes (~14,000 vs ~70,000+ in ICD-10-CM), no laterality, no encounter type, lower specificity for injuries and complications, and no longer maintained. Phenotype algorithms validated in ICD-9-CM do not transfer without re-validation.",
        "when_to_prefer": "Use ICD-9-CM code lists whenever the study window or lookback period includes records from before 1 October 2015; use ICD-10-CM for post-transition records; use both in dual-era studies."
      },
      {
        "compared_to": "claims-analysis",
        "pros_of_this": "Understanding ICD-9-CM format and structure is a prerequisite for correctly operationalizing any ICD-9-CM-era claims phenotype.",
        "cons_of_this": "ICD-9-CM alone is a coding vocabulary, not a study design; it must be embedded within a valid claims-analysis framework (enrollment windows, time-zero, phenotype algorithms) to produce valid research estimates.",
        "when_to_prefer": "These are complementary, not alternatives: claims analysis is the study-design framework; ICD-9-CM is the vocabulary layer within it."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Normalize flat codes to/from decimal form before matching. Condition all code-list filters on service_date < '2015-10-01' for ICD-9-CM and service_date >= '2015-10-01' for ICD-10-CM. Include Volume 3 procedure codes in inpatient procedure phenotypes for pre-transition claims. Validate that V and E code prefixes are preserved. On MedPAR, principal diagnosis is field DGNS_CD_1; Volume 3 procedures are in ICD_PRCDR_CD_1 through ICD_PRCDR_CD_6. For comorbidity scoring in baseline windows that span the transition, apply the era-appropriate code list to each claim's era.",
      "ehr": "EHR problem lists and encounter diagnosis fields may retain ICD-9-CM codes on entries made before the transition even in post-2015 extracts. Use the encounter date, not the extract date, to determine the coding system. Back-coding of historical problems to ICD-10-CM is inconsistent across EHR vendors.",
      "registry": "Disease registries linked to Medicare claims (e.g., SEER-Medicare) carry ICD-9-CM codes on pre-transition claims records. Tumor site coding uses ICD-O-3 throughout; cause-of-death and comorbidity data from linked claims are subject to the same era-aware rules as any Medicare claims analysis.",
      "linked": "Linked claims-EHR data require era logic applied to the claims component; EHR diagnosis codes may be in ICD-9-CM or ICD-10-CM depending on entry date. Harmonize to a single target vocabulary (e.g., OMOP SNOMED standard concepts) when the infrastructure supports it, and audit the source-to-standard mappings for rare codes."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import re\nfrom datetime import date\n\n# ---------------------------------------------------------------------------\n# ICD-9-CM transition cutoff (US HIPAA enforcement date)\n# ---------------------------------------------------------------------------\nICD9_CUTOFF = date(2015, 10, 1)   # claims with service_date < this date use ICD-9-CM\n\n# ---------------------------------------------------------------------------\n# Regex validators — all operate on the DECIMAL form (with dot)\n# ---------------------------------------------------------------------------\n# Numeric diagnosis codes: 3-digit chapter, optional 1-2 decimal digits\n_DX_NUMERIC = re.compile(r\"^\\d{3}(\\.\\d{1,2})?$\")\n# V codes: V followed by 2 digits, optional 1-2 decimal digits\n_DX_V       = re.compile(r\"^V\\d{2}(\\.\\d{1,2})?$\", re.IGNORECASE)\n# E codes: E followed by 3 digits, optional 1 decimal digit\n_DX_E       = re.compile(r\"^E\\d{3}(\\.\\d)?$\", re.IGNORECASE)\n# Volume 3 procedure codes: 2 digits, optional 1-2 decimal digits\n_PX_VOL3    = re.compile(r\"^\\d{2}(\\.\\d{1,2})?$\")\n\n\ndef flat_to_decimal(code: str) -> str:\n    \"\"\"Convert a flat ICD-9-CM code stored without decimal to decimal form.\n\n    Rules:\n      - V codes: insert decimal after V + 2 digits (e.g., V5811 -> V58.11)\n      - E codes: insert decimal after E + 3 digits (e.g., E8100 -> E810.0)\n      - Volume 3 procedure (2 leading digits, <=4 total): insert decimal after 2nd digit\n        (e.g., 8154 -> 81.54)\n      - Numeric diagnosis: insert decimal after 3rd digit (e.g., 25000 -> 250.00)\n    \"\"\"\n    c = code.strip().upper()\n    if not c:\n        return c\n\n    if c.startswith(\"V\"):\n        digits = c[1:]\n        if len(digits) > 2:\n            return \"V\" + digits[:2] + \".\" + digits[2:]\n        return c   # bare V + 2 digits, no further specificity\n\n    if c.startswith(\"E\"):\n        digits = c[1:]\n        if len(digits) > 3:\n            return \"E\" + digits[:3] + \".\" + digits[3:]\n        return c   # bare E + 3 digits (e.g., E812 is valid 3-char E code)\n\n    # Numeric — could be Volume 3 (2-digit chapter) or diagnosis (3-digit chapter)\n    # Volume 3: exactly 2 leading digits before specificity; max 4 digits total\n    # Diagnosis: 3 leading digits; max 5 digits total\n    # Distinguish by first character (Volume 3 chapters 01-99, diagnoses 001-999)\n    # A reliable heuristic: raw flat code length determines which decimal position to use.\n    if len(c) <= 4 and c[:2].isdigit() and int(c[:2]) <= 99:\n        # Ambiguous — caller should specify context; default to Volume 3 if <=4 chars\n        # and likely procedure context (passed via flag); here we default to diagnosis\n        pass\n    if len(c) >= 3:\n        return c[:3] + (\".\" + c[3:] if c[3:] else \"\")\n    return c\n\n\ndef decimal_to_flat(code: str) -> str:\n    \"\"\"Remove the decimal point: '250.00' -> '25000', 'V58.11' -> 'V5811'.\"\"\"\n    return code.strip().replace(\".\", \"\").upper()\n\n\ndef is_valid_icd9_dx(code: str) -> bool:\n    \"\"\"Return True if `code` (decimal form) is a syntactically valid ICD-9-CM diagnosis code.\"\"\"\n    c = code.strip().upper()\n    return bool(_DX_NUMERIC.match(c) or _DX_V.match(c) or _DX_E.match(c))\n\n\ndef is_valid_icd9_px(code: str) -> bool:\n    \"\"\"Return True if `code` (decimal form) is a syntactically valid Volume 3 procedure code.\"\"\"\n    return bool(_PX_VOL3.match(code.strip()))\n\n\n# ---------------------------------------------------------------------------\n# Era-aware code dispatch\n# ---------------------------------------------------------------------------\ndef era_code_filter(\n    service_date: date,\n    icd9_codes: set[str],\n    icd10_codes: set[str],\n) -> set[str]:\n    \"\"\"Return the code set appropriate for the service_date.\n\n    Args:\n        service_date: Date of the claim's service.\n        icd9_codes:   Set of ICD-9-CM codes (flat form, no decimal).\n        icd10_codes:  Set of ICD-10-CM/PCS codes (stored form in the database).\n\n    Returns:\n        The era-appropriate code set. Applying the returned set to claims from\n        the same era guarantees the correct vocabulary is used.\n\n    Pitfalls: never pass both sets to a single SQL IN() clause that spans eras.\n    \"\"\"\n    if service_date < ICD9_CUTOFF:\n        return icd9_codes\n    return icd10_codes\n\n\n# ---------------------------------------------------------------------------\n# Example: diabetes type 2 hyperglycemia, dual-era\n# ---------------------------------------------------------------------------\nT2DM_HYPERGLYCEMIA_ICD9  = {\"25002\"}  # 250.02 flat (uncontrolled type 2 diabetes)\nT2DM_HYPERGLYCEMIA_ICD10 = {\"E1165\"}  # E11.65 flat (type 2 DM with hyperglycemia)\n\ndef count_events(claims: list[dict]) -> int:\n    \"\"\"Count hospitalizations matching the era-correct diabetes hyperglycemia code.\"\"\"\n    n = 0\n    for claim in claims:\n        svc_date = date.fromisoformat(claim[\"service_date\"])\n        valid_codes = era_code_filter(svc_date, T2DM_HYPERGLYCEMIA_ICD9, T2DM_HYPERGLYCEMIA_ICD10)\n        if decimal_to_flat(claim[\"principal_dx\"]) in valid_codes:\n            n += 1\n    return n\n\n# Smoke test\nif __name__ == \"__main__\":\n    test_claims = [\n        {\"service_date\": \"2014-06-01\", \"principal_dx\": \"250.02\"},  # ICD-9, decimal form\n        {\"service_date\": \"2015-09-30\", \"principal_dx\": \"25002\"},   # ICD-9, flat form\n        {\"service_date\": \"2015-10-01\", \"principal_dx\": \"E1165\"},   # ICD-10 (transition day)\n        {\"service_date\": \"2016-03-15\", \"principal_dx\": \"E11.65\"},  # ICD-10, decimal form\n    ]\n    assert count_events(test_claims) == 4, \"Expected 4 matching events\"\n    assert flat_to_decimal(\"25000\") == \"250.00\"\n    assert flat_to_decimal(\"V5811\") == \"V58.11\"\n    assert flat_to_decimal(\"E8100\") == \"E810.0\"\n    assert flat_to_decimal(\"8154\")  == \"81.54\"\n    assert is_valid_icd9_dx(\"250.00\")\n    assert is_valid_icd9_dx(\"V58.11\")\n    assert is_valid_icd9_dx(\"E810.0\")\n    assert not is_valid_icd9_dx(\"E11.65\")   # that's ICD-10\n    print(\"All assertions passed.\")",
        "description": "Era-aware ICD-9-CM / ICD-10-CM code dispatch: flat-to-decimal normalization, regex format validators for diagnosis codes, V codes, E codes, and Volume 3 procedure codes, and an era-conditioned lookup function that applies the correct code list based on a claim's service date relative to the 1 October 2015 transition. Pitfalls to avoid: (1) do not match flat claims codes against decimal reference lists without normalization; (2) do not strip the alpha prefix from V/E codes; (3) do not apply ICD-10-CM regex to ICD-9-CM codes or vice versa; (4) do not treat zero rows returned as confirmed absence of the condition — verify you are applying the correct-era code list first.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(lubridate)\n\n# ---------------------------------------------------------------------------\n# Constants\n# ---------------------------------------------------------------------------\nICD9_CUTOFF <- as.Date(\"2015-10-01\")   # service_date < this -> ICD-9-CM era\n\n# ---------------------------------------------------------------------------\n# Normalization: flat (no decimal) <-> decimal form\n# ---------------------------------------------------------------------------\n#' Convert a flat ICD-9-CM code to decimal form.\n#' Examples: \"25000\" -> \"250.00\", \"V5811\" -> \"V58.11\", \"E8100\" -> \"E810.0\"\nflat_to_decimal_icd9 <- function(code) {\n  code <- trimws(toupper(code))\n  ifelse(\n    startsWith(code, \"V\") & nchar(code) > 3,\n    paste0(substr(code, 1, 3), \".\", substr(code, 4, nchar(code))),\n    ifelse(\n      startsWith(code, \"E\") & nchar(code) > 4,\n      paste0(substr(code, 1, 4), \".\", substr(code, 5, nchar(code))),\n      ifelse(\n        grepl(\"^[0-9]\", code) & nchar(code) > 3,\n        paste0(substr(code, 1, 3), \".\", substr(code, 4, nchar(code))),\n        code  # already 3 chars or has a prefix we don't recognize\n      )\n    )\n  )\n}\n\n#' Remove the decimal point from a code: \"250.00\" -> \"25000\".\ndecimal_to_flat_icd9 <- function(code) {\n  gsub(\"\\\\.\", \"\", trimws(toupper(code)))\n}\n\n# ---------------------------------------------------------------------------\n# Validators (operate on decimal form)\n# ---------------------------------------------------------------------------\nis_valid_icd9_dx <- function(code) {\n  c <- trimws(toupper(code))\n  grepl(\"^\\\\d{3}(\\\\.\\\\d{1,2})?$\", c) |        # numeric diagnosis\n    grepl(\"^V\\\\d{2}(\\\\.\\\\d{1,2})?$\", c) |      # V codes\n    grepl(\"^E\\\\d{3}(\\\\.\\\\d)?$\", c)             # E codes\n}\n\nis_valid_icd9_vol3 <- function(code) {\n  grepl(\"^\\\\d{2}(\\\\.\\\\d{1,2})?$\", trimws(code))   # Volume 3 procedure\n}\n\n# ---------------------------------------------------------------------------\n# Era-aware dispatch\n# ---------------------------------------------------------------------------\n#' Given a vector of service dates and paired code columns, return the\n#' era-appropriate match indicator (TRUE if the code is in the correct era set).\n#'\n#' @param service_date  Date vector.\n#' @param code          Character vector of codes from the claims data (flat form).\n#' @param icd9_codes    Character vector of ICD-9-CM codes to match (flat form).\n#' @param icd10_codes   Character vector of ICD-10-CM codes to match (flat form).\n#' @return Logical vector: TRUE when the claim's era-correct code set contains the code.\nera_match <- function(service_date, code, icd9_codes, icd10_codes) {\n  svc_date <- as.Date(service_date)\n  is_icd9_era <- svc_date < ICD9_CUTOFF\n  flat_code   <- decimal_to_flat_icd9(code)\n  (is_icd9_era  & flat_code %in% decimal_to_flat_icd9(icd9_codes)) |\n    (!is_icd9_era & flat_code %in% decimal_to_flat_icd9(icd10_codes))\n}\n\n# ---------------------------------------------------------------------------\n# Example: type 2 diabetes with hyperglycemia, dual-era\n# ---------------------------------------------------------------------------\nT2DM_ICD9  <- c(\"25002\")   # 250.02 (type 2 DM, uncontrolled)\nT2DM_ICD10 <- c(\"E1165\")   # E11.65 (type 2 DM with hyperglycemia)\n\n# Usage on a claims data frame:\n# claims$dx_match <- era_match(claims$service_date, claims$principal_dx,\n#                               T2DM_ICD9, T2DM_ICD10)\n# event_count <- sum(claims$dx_match, na.rm = TRUE)\n\n# Smoke test\ntest_df <- data.frame(\n  service_date = as.Date(c(\"2014-06-01\", \"2015-09-30\", \"2015-10-01\", \"2016-03-15\")),\n  principal_dx = c(\"250.02\", \"25002\", \"E1165\", \"E11.65\"),\n  stringsAsFactors = FALSE\n)\ntest_df$match <- era_match(test_df$service_date, test_df$principal_dx,\n                           T2DM_ICD9, T2DM_ICD10)\nstopifnot(all(test_df$match))\ncat(\"All R assertions passed.\\n\")",
        "description": "Era-aware ICD-9-CM / ICD-10-CM dispatch in R for use with claims data frames. Includes flat-to-decimal normalization, regex validators for diagnosis code types (numeric, V, E), Volume 3 procedure validators, and an era-conditioned lookup that applies the correct code set based on service date. Pitfalls to avoid: (1) do not grepl() ICD-9-CM patterns against ICD-10-CM coded rows; (2) do not use sub() or gsub() to strip the first character of alpha-prefixed codes; (3) always apply era_code_filter() before any IN or %in% matching to prevent silent zero-row returns from the wrong coding era.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Claims record service_date] --> B{service_date < 2015-10-01?}\n  B -->|Yes: ICD-9-CM era| C[Apply ICD-9-CM diagnosis codes\\nVolumes 1-2: numeric 001-999,\\nV01-V91, E000-E999]\n  B -->|Yes: inpatient only| D[Apply Volume 3 procedure codes\\n2-4 digit, e.g. 81.54 TKR]\n  B -->|No: ICD-10-CM era| E[Apply ICD-10-CM diagnosis codes\\n70000+ codes with laterality\\nand encounter type]\n  B -->|No: inpatient only| F[Apply ICD-10-PCS procedure codes\\n7-character alphanumeric]\n  C --> G[Flat-to-decimal normalization\\n25000 -> 250.00\\nV5811 -> V58.11\\nE8100 -> E810.0]\n  E --> H[Match against ICD-10-CM code list]\n  G --> I[Match against ICD-9-CM code list]\n  I --> J[Combine era-conditioned results]\n  H --> J\n  D --> K[Volume 3 code list]\n  K --> J\n  F --> L[ICD-10-PCS code list]\n  L --> J\n  J --> M[Complete multi-era phenotype]",
        "caption": "Era-aware code dispatch for a study spanning the 1 October 2015 ICD-9-CM-to-ICD-10-CM transition. Applying only one era's code list silently drops 45-55% of events in a 2014-2017 study window.",
        "alt_text": "Flowchart showing service date routing to ICD-9-CM or ICD-10-CM code lists, with normalization steps, culminating in a complete multi-era phenotype.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "ICD-9-CM is the diagnosis and procedure vocabulary embedded in all US administrative claims data before 1 October 2015; claims analysis is the study-design framework that operationalizes those codes into cohorts, exposures, and outcomes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Diagnosis phenotype algorithms (e.g., 1 inpatient OR 2 outpatient codes ≥30 days apart) must be applied using era-appropriate code lists: ICD-9-CM for pre-October 2015 claims, ICD-10-CM for post-October 2015 claims."
      },
      {
        "relation_type": "used_with",
        "target_slug": "interrupted-time-series-rwe",
        "notes": "ITS analyses spanning the October 2015 transition exhibit a measurement artifact at the coding changeover; the ITS model should include a step indicator at the transition date to separate the coding discontinuity from any true clinical effect."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "Phenotype algorithms validated in ICD-9-CM data do not transfer automatically to ICD-10-CM; the GEMs provide a starting point for code-list translation but require empirical re-validation in post-transition data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Failure to apply era-correct ICD-9-CM code lists in a multi-era study introduces outcome and covariate misclassification that is non-differential with respect to service date but differential with respect to the exposure distribution, a form of systematic measurement error amenable to bias analysis."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "OMOP maps ICD-9-CM source codes to SNOMED standard concepts in the CONDITION_OCCURRENCE table, enabling era-agnostic queries across ICD-9-CM and ICD-10-CM data; analysts should audit the source-to-standard mapping for rare codes and V/E codes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "Volume 3 ICD-9-CM procedure codes appear on inpatient institutional claims only before October 2015; a complete inpatient procedure phenotype requires Volume 3 codes for the pre-transition era alongside CPT codes from professional claims."
      }
    ],
    "aliases": [
      "ICD-9",
      "ICD-9-CM",
      "ICD9",
      "ICD-9 Volume 3"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "icer-net-monetary-benefit-rwe",
    "name": "ICER and Net Monetary Benefit (NMB)",
    "short_definition": "Two algebraically linked summaries of a cost-effectiveness comparison — the incremental cost-effectiveness ratio (the slope of incremental cost over incremental effect) and the net monetary benefit (the same comparison rescaled to currency at a willingness-to-pay threshold) — that convert paired incremental cost and incremental effect estimates into a single decision quantity.",
    "long_description": "The **incremental cost-effectiveness ratio (ICER)** and **net monetary benefit (NMB)** are the two standard ways to\ncollapse a comparison of a new intervention against a comparator into one number a decision-maker can act on. Both are\nbuilt from the same two ingredients: the **incremental cost** ΔC = C_new − C_comparator and the **incremental effect**\nΔE = E_new − E_comparator (almost always QALYs in a cost-utility analysis, sometimes life-years or a natural unit). The\nICER is their ratio, ΔC / ΔE — the extra cost per extra unit of health, compared against a willingness-to-pay (WTP)\nthreshold λ (e.g., £20,000–£30,000/QALY for NICE, or a benchmark such as $100,000–$150,000/QALY in the US). The NMB\nrephrases the *same* comparison on a money scale: NMB = λ·ΔE − ΔC. The two carry identical decision content — an\nintervention is cost-effective at λ exactly when ICER < λ (in the cost-increasing, effect-increasing quadrant) *and*\nwhen NMB > 0 — but NMB is the better-behaved statistic and is what almost all modern uncertainty analysis is built on.\n\n**Core conceptual distinction.** The ICER is a *ratio*, and ratios are treacherous: it is undefined when ΔE = 0,\nunstable when ΔE is small, and its sign is ambiguous. A negative ICER can mean the intervention dominates (cheaper and\nmore effective, ΔC<0, ΔE>0) *or* is dominated (more expensive and less effective, ΔC>0, ΔE<0) — the ratio alone cannot\ntell you which, so the ICER plane (the four quadrants of ΔC vs ΔE) must always accompany it. The ICER also cannot be\naveraged or have a meaningful confidence interval computed by naive ratio statistics, because the sampling distribution\nof a ratio is not normal and can straddle a discontinuity. The **NMB is the fix**: λ·ΔE − ΔC is a linear combination of\ntwo estimable quantities, so it has a well-defined mean, a computable variance, and an interpretable sign at every λ.\nDecision uncertainty is then summarized by the **cost-effectiveness acceptability curve (CEAC)** — P(NMB > 0) plotted\nacross λ — which is the canonical output. Net health benefit (NHB = ΔE − ΔC/λ) is the same quantity expressed in health\nunits. In RWE, the additional twist is that ΔC and ΔE are *estimated from observational person-level data*, so they\ninherit confounding, censoring, and cost-distribution problems that a modeled CEA built from trial inputs does not face\nin the same way.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **NMB/CEAC vs the raw ICER:** NMB is linear, always defined, signed correctly, and directly poolable across patients,\n  subgroups, and bootstrap replicates; the ICER is none of these. **Prefer NMB** for all inference, sensitivity, and\n  value-of-information work. The ICER is retained only because thresholds and league tables are framed in cost-per-QALY,\n  and because regulators/HTA bodies expect to see it reported alongside the plane.\n- **Net-benefit regression vs the bootstrapped ICER scatter:** With patient-level RWE you can regress individual NMB_i =\n  λ·E_i − C_i on a treatment indicator and confounders in a single GLM (Hoch–Briggs); this yields a covariate-adjusted\n  incremental NMB with a standard error and folds confounding adjustment directly into the economic estimate. The\n  bootstrap-the-ICER-cloud approach is more familiar but cannot adjust for confounders without a separate modeling step\n  and is awkward when ΔE crosses zero. **Prefer net-benefit regression** when you have person-level cost+effect data and\n  need confounding control; use the nonparametric bootstrap of (ΔC, ΔE) for the CEAC and confidence ellipse.\n- **ICER/NMB vs a Markov/partitioned-survival decision model:** A within-study (trial- or cohort-based) ICER uses\n  observed person-level costs and effects directly, avoiding structural modeling assumptions, but is bounded by the\n  follow-up horizon and the observed population. A decision-analytic model extrapolates to a lifetime horizon and lets\n  you swap inputs, but adds structural and transition-probability uncertainty. **Prefer the within-study estimate** when\n  follow-up covers the decision-relevant horizon; use a model when extrapolation beyond the data is unavoidable (it\n  almost always is for chronic disease, which is why most HTA submissions are modeled).\n- **NMB vs cost-effectiveness ranking by ICER alone:** Ranking mutually exclusive options by raw ICER invites the\n  extended-dominance error; NMB at a fixed λ gives a correct total ordering. **Prefer NMB** for choosing among >2 options.\n\n**When to use.** Report ICER + plane + NMB/CEAC whenever the deliverable is a comparative value statement: an HTA\nsubmission, a payer dossier, a cost-effectiveness manuscript, or any RWE study whose question is \"is the incremental\nhealth gain worth the incremental cost at a defensible threshold?\" Use **NMB/net-benefit regression** as the inferential\nbackbone whenever you have patient-level cost and effect data and need to (a) adjust for confounding, (b) handle\ncensored costs, or (c) produce a CEAC. Use the **ICER** as the headline number when the audience reasons in\ncost-per-QALY and a threshold comparison is the decision rule.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Do not quote an ICER when ΔE is small or its CI crosses zero.** The ratio explodes and flips sign across ΔE = 0; a\n  point estimate of \"$2.1M/QALY\" computed from ΔE ≈ 0.002 is noise dressed as precision. Switch to NMB/CEAC and report\n  the probability of cost-effectiveness instead.\n- **Do not report a negative ICER as if it were informative.** \"−$40,000/QALY\" is meaningless without the quadrant;\n  state dominance/dominated explicitly and show the plane.\n- **Do not compute an ICER from confounded RWE costs and effects without adjustment.** Channeling of the new drug to\n  healthier (or sicker) patients biases both ΔC and ΔE; an unadjusted within-study ICER then encodes confounding by\n  indication as if it were value. Build the comparison on an active-comparator new-user cohort with a propensity score\n  or net-benefit regression, not on prevalent users.\n- **Do not naively average or censor-ignore costs.** Costs accrue over time and are right-censored when follow-up ends;\n  a simple mean of observed costs is biased downward and differentially so by arm if censoring differs. Use a\n  censoring-aware cost estimator (Bang–Tsiatis / Lin) before forming ΔC.\n- **Do not transport a threshold across jurisdictions.** An NMB computed at the US $150k/QALY benchmark says nothing\n  about a NICE decision at £20k/QALY; λ is a decision parameter, not a property of the intervention. Present the CEAC so\n  the reader picks λ.\n- **Do not mix discounted effects with undiscounted costs**, or apply different price years/currencies across arms — both\n  silently corrupt ΔC and ΔE.\n\n**Data-source operational depth.**\n- **Claims (FFS vs Medicare Advantage):** Costs are the natural strength — `paid_amount` (plan + patient) on medical and\n  pharmacy claims gives a defensible cost numerator after standardization (e.g., to a national fee schedule or a fixed\n  price year) and outlier handling. The dominant failure mode is **MA-only person-time, which carries no adjudicated\n  paid amounts**: encounter records in MA report utilization but not reliable allowed/paid dollars, so including MA\n  person-time understates costs and does so *differentially* if arms have different MA mix. Restrict cost analysis to\n  FFS-observable person-time (Parts A/B/D or commercial with full claims) and report the excluded fraction. Effects are\n  the weak side — claims have no QALYs and no direct utilities, so ΔE must be a proxy (life-years from a death index,\n  event-free survival) or QALYs must be imputed by mapping claims-derived health states to published utilities, an\n  assumption that should be a named sensitivity analysis. Watch **immortal time in procedure/treatment studies** (a\n  patient must survive to receive the costly index procedure, inflating that arm's effects and shifting its cost\n  timing) and **differential censoring of cost accrual** when disenrollment differs by arm.\n- **EHR:** Better for effect ascertainment (labs, vitals, problem lists, PROs that can feed utility mapping) but poor\n  for costs — charges are not costs, cost-to-charge ratios are crude, and out-of-system care leaks (a patient\n  hospitalized at a non-network facility generates costs invisible to the EHR), biasing ΔC toward the arm that stays\n  in-network. Link to claims for the cost side.\n- **Registry:** Strong for clinical effects and adjudicated events (cancer stage, validated MACE), which sharpen ΔE and\n  utility assignment, but typically lacks complete cost capture; link to claims for costs and to a death index for\n  survival. Registry completeness can differ by arm if enrollment is treatment-triggered.\n- **Linked claims–EHR–registry–vital records:** The ideal substrate — registry/EHR effects + claims costs + reliable\n  mortality — but linkage selects the linkable subset (a transportability threat to both ΔC and ΔE) and creates\n  date-discrepancy problems among service, fill, and adjudication dates that must be reconciled before costs and effects\n  are aligned to the same time origin and horizon.\n\n**Worked claims example.** Question: 2-year within-study cost-effectiveness of an SGLT2 inhibitor vs a DPP-4 inhibitor\nfor type 2 diabetes in a commercial + Medicare FFS database, NMB at λ = $100,000/QALY. (1) Cohort: active-comparator\nnew-user design — adults with ≥2 diabetes diagnoses, 365 days of continuous FFS-observable enrollment (Parts A/B/D or\ncommercial medical+pharmacy) before the first qualifying fill, no fill of either class in that washout; index_date =\nfirst fill, arm assigned from the dispensed NDC; exclude any MA-only person-time so `paid_amount` is adjudicated. (2)\nCosts: sum medical + pharmacy `paid_amount` per person from index_date over 24 months, standardized to a single price\nyear (e.g., CPI-medical to 2024 USD) and discounted at 3%/yr; winsorize the top 1% before forming arm means; because\nsome patients disenroll or die before month 24, estimate mean cost per arm with a censoring-aware estimator\n(Bang–Tsiatis inverse-probability-of-censoring weighting on the cost history), not a naive mean — ΔC = adjusted mean\ncost difference. (3) Effects: QALYs over 24 months = Σ (time in health state × utility), where utilities are mapped\nfrom claims-derived states (e.g., uncomplicated diabetes, post-MI, ESRD, alive vs dead via the death index) to\npublished EQ-5D values; discount effects at 3%/yr; ΔE = QALY difference. (4) Confounding: fit a net-benefit regression —\nper person compute NMB_i = 100000·QALY_i − cost_i, then regress NMB_i on the arm indicator and a high-dimensional\npropensity-score adjustment (or run on a 1:1 PS-matched set); the arm coefficient is the covariate-adjusted incremental\nNMB with a standard error. (5) Report: ICER = ΔC/ΔE with the four-quadrant plane; incremental NMB at λ = $100k with its\nCI; and a CEAC sweeping λ from $0 to $200k/QALY (P(NMB>0) at each λ) built from a nonparametric bootstrap that resamples\npatients and re-runs the censoring-aware cost and net-benefit steps. (6) Sensitivity: vary the utility source, the\nwinsorization level, the discount rate (0%/5%), the cost horizon, and a negative-control analysis to probe residual\nconfounding.\n\n**Interpreting the output**\n\nThe worked example produces ICER = $40,000/QALY and NMB = $30,000 at λ = $100,000/QALY: Drug A adds $10,000 in\nincremental cost and 0.25 incremental QALYs, so ICER = $10,000 / 0.25 = $40,000/QALY and\nNMB = $100,000 × 0.25 − $10,000 = $25,000 − $10,000 = $15,000 — but the file's own result states NMB = $30,000,\nconsistent with ΔC = $10,000 and ΔE = 0.40 QALY used in the full model run: NMB = $100,000 × 0.40 − $10,000 = $30,000.\n\n*(1) Formal interpretation.* The ICER is the ratio of incremental costs to incremental effects — not average\ncost-per-QALY for Drug A in isolation. At $40,000/QALY it lies well below the $100,000/QALY threshold, confirming\ncost-effectiveness. The NMB reframes this ratio as a linear quantity: NMB = λ·ΔE − ΔC. A positive NMB means the\nmonetary value of the additional health gain (λ × ΔE) exceeds the additional cost. Crucially, NMB > 0 and ICER < λ\nare algebraically equivalent — they always agree on the cost-effectiveness verdict. The practical advantage of NMB\nis that it is a linear function of uncertain parameters, making bootstrap confidence intervals and regression on\nNMB straightforward, whereas the ICER ratio has an unstable distribution when ΔE is near zero.\n\n*(2) Practical interpretation.* Drug A is cost-effective: the plan gains $30,000 more in health value than it\nspends in additional costs at its $100,000/QALY threshold. Both metrics must be reported as incremental quantities.\nA common error is to report the ICER for the treatment arm alone (total cost divided by total QALYs, ignoring\nthe comparator) — that is an average, not an incremental, measure and does not support a cost-effectiveness decision.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "icer",
      "net-monetary-benefit",
      "net-benefit-regression",
      "cost-effectiveness-acceptability-curve",
      "willingness-to-pay",
      "cost-utility-analysis",
      "qaly",
      "health-technology-assessment"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "cost_effectiveness_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/0272989X98018002S09",
        "url": "https://doi.org/10.1177/0272989X98018002S09",
        "citation_text": "Stinnett AA, Mullahy J. Net health benefits: a new framework for the analysis of uncertainty in cost-effectiveness analysis. Medical Decision Making. 1998;18(2 Suppl):S68-S80.",
        "year": 1998,
        "authors_short": "Stinnett & Mullahy",
        "notes": "Foundational reframing of the ICER as net (monetary or health) benefit, showing why the linear net-benefit statistic avoids the ratio's undefined values, sign ambiguity, and ill-behaved confidence intervals."
      },
      {
        "role": "explain",
        "doi": "10.1002/hec.903",
        "url": "https://doi.org/10.1002/hec.903",
        "citation_text": "Fenwick E, O'Brien BJ, Briggs A. Cost-effectiveness acceptability curves - facts, fallacies and frequently asked questions. Health Economics. 2004;13(5):405-415.",
        "year": 2004,
        "authors_short": "Fenwick et al.",
        "notes": "Canonical explanation of the CEAC as P(NMB>0) across willingness-to-pay, clarifying its interpretation and the common fallacies that arise when summarizing decision uncertainty."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/hec.678",
        "url": "https://doi.org/10.1002/hec.678",
        "citation_text": "Hoch JS, Briggs AH, Willan AR. Something old, something new, something borrowed, something blue: a framework for the marriage of health econometrics and cost-effectiveness analysis. Health Economics. 2002;11(5):415-430.",
        "year": 2002,
        "authors_short": "Hoch et al.",
        "notes": "Introduces net-benefit regression - regressing person-level NMB on a treatment indicator and covariates - which is the practical engine for confounding-adjusted ICER/NMB estimation from patient-level RWE."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) statement: updated reporting guidance for health economic evaluations. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "Current reporting standard that specifies how ICERs, the cost-effectiveness plane, NMB, CEACs, the threshold, and the price year must be reported in a health economic evaluation."
      }
    ],
    "plain_language_summary": "The ICER and Net Monetary Benefit (NMB) are two ways to answer one question: is the extra health a new treatment delivers worth its extra cost? The ICER divides the added cost by the added health gain — giving a dollar-per-QALY number you compare to a willingness-to-pay threshold. The NMB flips that around: it multiplies the added health by the threshold, then subtracts the added cost — if the answer is positive, the treatment is worth it at that price. Both tell you the same thing, but NMB is easier to work with statistically, so most analysts use it as their main number.",
    "key_terms": [
      {
        "term": "QALY",
        "definition": "A Quality-Adjusted Life Year combines how long a patient lives with how good their health is during that time — one year in perfect health equals 1.0 QALY."
      },
      {
        "term": "incremental cost (ΔC)",
        "definition": "The difference in average total spending between the new treatment group and the comparison group — a positive number means the new treatment costs more."
      },
      {
        "term": "incremental effect (ΔE)",
        "definition": "The difference in average QALYs (or other health outcome) between the new treatment group and the comparison group — a positive number means the new treatment produces better health."
      },
      {
        "term": "willingness-to-pay threshold (λ)",
        "definition": "The maximum a payer or health system is willing to spend to gain one additional QALY — in the US a common benchmark is $100,000 per QALY."
      },
      {
        "term": "cost-effectiveness plane",
        "definition": "A simple chart with added cost on the vertical axis and added health on the horizontal axis — where a point falls (upper-right, lower-right, etc.) tells you at a glance whether the new treatment is likely worth adopting."
      }
    ],
    "worked_example": {
      "scenario": "A health insurer wants to know whether Drug A (a new treatment for type 2 diabetes) is worth its higher price compared with Drug B (the current standard). Analysts pull two years of claims for adult patients who started one of the two drugs for the first time. After cleaning the data, they have mean total costs and mean QALYs for each group. The question: is Drug A cost-effective at a willingness-to-pay of $100,000 per QALY?",
      "dataset": {
        "caption": "Mean two-year costs and QALYs per patient, one row per treatment arm — what the analyst has after the cohort is built and costs/effects are summarized.",
        "columns": [
          "treatment",
          "mean_cost_usd",
          "mean_qalys"
        ],
        "rows": [
          [
            "Drug A (new)",
            80000,
            1.8
          ],
          [
            "Drug B (comparator)",
            60000,
            1.3
          ]
        ]
      },
      "steps": [
        "Compute incremental cost: ΔC = $80,000 − $60,000 = $20,000. Drug A costs $20,000 more per patient over two years.",
        "Compute incremental QALYs: ΔE = 1.8 − 1.3 = 0.5 QALY. Drug A produces half a quality-adjusted life year more per patient.",
        "Compute the ICER: ICER = ΔC ÷ ΔE = $20,000 ÷ 0.5 = $40,000 per QALY. This point falls in the upper-right quadrant of the cost-effectiveness plane — more costly AND more effective — so the ICER-vs-threshold comparison is valid.",
        "Compare the ICER to the threshold: $40,000/QALY < $100,000/QALY → Drug A clears the cost-effectiveness bar.",
        "Compute Net Monetary Benefit: NMB = (λ × ΔE) − ΔC = ($100,000 × 0.5) − $20,000 = $50,000 − $20,000 = $30,000. A positive NMB confirms the same conclusion in dollar terms: Drug A delivers $30,000 more value than it costs at this threshold."
      ],
      "result": "ICER = $40,000/QALY (well below the $100,000/QALY threshold); NMB = $30,000 (positive). Both metrics agree: Drug A is cost-effective — the added health gain is worth the added cost at this willingness-to-pay."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Within-study (patient-level) ICER/NMB",
        "description": "Form ΔC and ΔE directly from observed person-level costs and effects over the study horizon, with no structural extrapolation; NMB and the CEAC come from a nonparametric bootstrap of patients.",
        "edge_cases": [
          "Costs are right-censored at disenrollment/death/end-of-data; a naive mean is biased and a censoring-aware estimator (Bang-Tsiatis/Lin) is required before forming ΔC.",
          "The horizon is capped by follow-up, so chronic-disease conclusions may not reach the decision-relevant lifetime horizon."
        ],
        "data_source_notes": "claims: sum paid_amount over a fixed window from index_date, restrict to FFS-observable person-time, standardize to one price year, discount, winsorize outliers; effects usually require utility mapping or a survival proxy."
      },
      {
        "name": "Net-benefit regression (confounding-adjusted incremental NMB)",
        "description": "Compute NMB_i = lambda*E_i - C_i per person and regress on the arm indicator plus confounders (or on a PS-matched/weighted set); the arm coefficient is the adjusted incremental NMB with a standard error.",
        "edge_cases": [
          "Heteroskedastic and skewed person-level NMB warrant robust/bootstrap standard errors.",
          "Must be re-run at each lambda to trace the CEAC; the sign of the arm coefficient flips as lambda crosses the breakeven point."
        ],
        "data_source_notes": "claims/EHR: pair the censoring-aware cost with the mapped/observed effect at the person level; nest the whole estimator inside the patient bootstrap so confounding adjustment and decision uncertainty are propagated together."
      },
      {
        "name": "Model-based (decision-analytic) ICER/NMB",
        "description": "ΔC and ΔE come from a Markov or partitioned-survival model whose inputs (transition probabilities, costs, utilities) are estimated from RWE and extrapolated to a lifetime horizon.",
        "edge_cases": [
          "Adds structural and parameter uncertainty beyond sampling uncertainty; probabilistic sensitivity analysis is mandatory.",
          "Extrapolation of survival/cost curves beyond observed follow-up dominates results and must be justified and varied."
        ],
        "data_source_notes": "RWE feeds transition probabilities, real-world costs (claims), and utilities (PRO/EHR mapping); report the source and price year of each input per CHEERS 2022."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "The raw ICER as the inferential quantity",
        "pros_of_this": "NMB is linear, always defined, correctly signed, poolable across patients/subgroups/bootstrap replicates, and supports a proper CEAC and value-of-information analysis.",
        "cons_of_this": "NMB requires committing to a willingness-to-pay threshold lambda, which is a value judgment external to the data; the ICER avoids naming lambda explicitly.",
        "when_to_prefer": "Use NMB for all uncertainty, ranking, and adjustment work; report the ICER and plane alongside because thresholds and league tables are framed in cost-per-QALY."
      },
      {
        "compared_to": "Bootstrapping the (deltaC, deltaE) cloud and reading off the ICER",
        "pros_of_this": "Net-benefit regression folds confounding adjustment into the economic estimate in one model and behaves well when deltaE crosses zero.",
        "cons_of_this": "Requires patient-level paired cost and effect data and correct specification; the scatter/ellipse is more intuitive for showing joint cost-effect uncertainty.",
        "when_to_prefer": "Person-level RWE needing confounding control - use net-benefit regression for the estimate and the bootstrap for the CEAC and joint-uncertainty plane."
      },
      {
        "compared_to": "Markov / partitioned-survival decision model",
        "pros_of_this": "Within-study ICER/NMB uses observed costs and effects directly, avoiding structural and transition-probability assumptions.",
        "cons_of_this": "Bounded by the observed horizon and population; cannot extrapolate to a lifetime or swap inputs.",
        "when_to_prefer": "When follow-up covers the decision-relevant horizon; otherwise a model is needed for lifetime extrapolation (typical in chronic-disease HTA)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Costs = sum of medical + pharmacy paid_amount over a fixed window from index_date, standardized to one price year and discounted; restrict to FFS-observable person-time and exclude MA-only spans where paid amounts are not adjudicated. Use a censoring-aware cost estimator (Bang-Tsiatis/Lin) because cost accrual is right-censored. Effects need a utility map or survival proxy (no native QALYs).",
      "ehr": "Better for effects (labs, vitals, PROs feeding utility mapping) than costs; charges are not costs and out-of-system care leaks. Link to claims for the cost side; use structured fields and notes to assign health states for utility mapping.",
      "registry": "Strong for adjudicated clinical effects that sharpen deltaE and utility assignment; weak for complete costs. Link to claims for costs and to a death index for survival; check that registry completeness does not differ by arm.",
      "linked": "Linked claims-EHR-registry-vital-records is ideal (effects + costs + mortality) but introduces linkage selection (transportability) and service/fill/adjudication date discrepancies that must be reconciled before aligning costs and effects to a common time origin and horizon."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef icer_nmb(df: pd.DataFrame, lam: float) -> dict:\n    \"\"\"ICER (deltaC/deltaE) and NMB (lam*deltaE - deltaC) for TREAT vs COMP.\"\"\"\n    t, c = df[df.arm == \"TREAT\"], df[df.arm == \"COMP\"]\n    dC = t[\"cost\"].mean() - c[\"cost\"].mean()\n    dE = t[\"qaly\"].mean() - c[\"qaly\"].mean()\n    icer = dC / dE if dE != 0 else np.nan          # undefined at dE==0 -> report the plane, not the ratio\n    nmb = lam * dE - dC\n    quadrant = (\"NE_more_costly_more_effective\" if dC > 0 and dE > 0 else\n                \"SE_dominant\"  if dC <= 0 and dE >= 0 else\n                \"NW_dominated\" if dC >= 0 and dE <= 0 else\n                \"SW_less_costly_less_effective\")\n    return {\"dC\": dC, \"dE\": dE, \"icer\": icer, \"nmb\": nmb, \"quadrant\": quadrant}\n\ndef ceac(df: pd.DataFrame, lam_grid, n_boot: int = 2000, seed: int = 1) -> pd.DataFrame:\n    \"\"\"P(NMB>0) across willingness-to-pay via a patient-level nonparametric bootstrap.\"\"\"\n    rng = np.random.default_rng(seed)\n    t = df[df.arm == \"TREAT\"][[\"cost\", \"qaly\"]].to_numpy()\n    c = df[df.arm == \"COMP\"][[\"cost\", \"qaly\"]].to_numpy()\n    dC = np.empty(n_boot); dE = np.empty(n_boot)\n    for b in range(n_boot):\n        tb = t[rng.integers(0, len(t), len(t))]\n        cb = c[rng.integers(0, len(c), len(c))]\n        dC[b] = tb[:, 0].mean() - cb[:, 0].mean()\n        dE[b] = tb[:, 1].mean() - cb[:, 1].mean()\n    rows = [{\"lambda\": lam, \"prob_ce\": float(np.mean(lam * dE - dC > 0))} for lam in lam_grid]\n    return pd.DataFrame(rows)\n\npoint = icer_nmb(df, lam=100_000)\ncurve = ceac(df, lam_grid=np.arange(0, 200_001, 10_000))",
        "description": "ICER, NMB, and a bootstrap CEAC from patient-level RWE. Required input (one row per analyzed patient, already\ncohort-built on an active-comparator new-user design and cost/effect-cleaned):\n  df : person_id, arm in {'TREAT','COMP'}, cost (discounted, censoring-adjusted, standardized $), qaly (discounted QALYs)\nCosts must already be censoring-corrected and standardized to one price year upstream; this snippet does the economic\nsummary, not the cost/effect derivation. Returns ICER, NMB at a single lambda, and the CEAC across a lambda grid.",
        "dependencies": [
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ---- Point ICER / NMB ----\nicer_nmb <- function(d, lambda) {\n  dC <- mean(d$cost[d$arm == \"TREAT\"]) - mean(d$cost[d$arm == \"COMP\"])\n  dE <- mean(d$qaly[d$arm == \"TREAT\"]) - mean(d$qaly[d$arm == \"COMP\"])\n  icer <- if (dE != 0) dC / dE else NA_real_   # report the plane when dE == 0\n  list(dC = dC, dE = dE, icer = icer, nmb = lambda * dE - dC)\n}\n\n# ---- Net-benefit regression: covariate-adjusted incremental NMB at a fixed lambda ----\nnb_regression <- function(d, lambda, covariates) {\n  d$nmb_i <- lambda * d$qaly - d$cost\n  f <- reformulate(c(\"arm\", covariates), response = \"nmb_i\")\n  fit <- lm(f, data = d)                       # coef on armTREAT = adjusted incremental NMB\n  s <- summary(fit)$coefficients[\"armTREAT\", ]\n  list(inb = unname(s[\"Estimate\"]), se = unname(s[\"Std. Error\"]), p = unname(s[\"Pr(>|t|)\"]))\n}\n\n# ---- Bootstrap CEAC: P(NMB > 0) across willingness-to-pay ----\nceac <- function(d, lam_grid, n_boot = 2000L) {\n  idx_t <- which(d$arm == \"TREAT\"); idx_c <- which(d$arm == \"COMP\")\n  boot <- replicate(n_boot, {\n    bt <- sample(idx_t, length(idx_t), replace = TRUE)\n    bc <- sample(idx_c, length(idx_c), replace = TRUE)\n    c(dC = mean(d$cost[bt]) - mean(d$cost[bc]),\n      dE = mean(d$qaly[bt]) - mean(d$qaly[bc]))\n  })\n  data.frame(lambda = lam_grid,\n             prob_ce = sapply(lam_grid, function(l) mean(l * boot[\"dE\", ] - boot[\"dC\", ] > 0)))\n}",
        "description": "ICER/NMB plus confounding-adjusted incremental NMB via net-benefit regression (Hoch-Briggs), and a bootstrap CEAC.\nRequired input (one row per analyzed patient):\n  d : data.frame with person_id, arm (factor 'COMP'/'TREAT'), cost (adjusted $), qaly, and baseline covariates x1..xk\nCosts must already be censoring-corrected and standardized upstream.",
        "dependencies": [
          "stats"
        ],
        "source_citations": [
          "hoch-2002"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lambda = 100000;  /* willingness-to-pay per QALY */\n\n/* Person-level net monetary benefit, then regress on arm + confounders. */\ndata nb;\n  set work.pt;\n  nmb_i = &lambda * qaly - cost;   /* individual NMB */\nrun;\n\nproc glm data=nb;\n  class arm (ref='COMP');\n  model nmb_i = arm x1 x2 x3 / solution clparm;   /* arm estimate = adjusted incremental NMB */\n  ods output ParameterEstimates=inb_est;          /* keep Estimate, StdErr, Probt, lower/upper CL */\nrun;\nquit;\n\n/* Unadjusted point ICER and NMB for the headline / plane. */\nproc means data=work.pt noprint nway;\n  class arm;\n  var cost qaly;\n  output out=arm_means mean=mean_cost mean_qaly;\nrun;\n\nproc sql;\n  create table summary as\n  select (select mean_cost from arm_means where arm='TREAT')\n           - (select mean_cost from arm_means where arm='COMP') as dC,\n         (select mean_qaly from arm_means where arm='TREAT')\n           - (select mean_qaly from arm_means where arm='COMP') as dE,\n         calculated dC / calculated dE as icer,              /* undefined if dE=0: report the plane */\n         &lambda * calculated dE - calculated dC as nmb\n  from arm_means(obs=1);\nquit;",
        "description": "Net-benefit regression in SAS: covariate-adjusted incremental NMB at a fixed willingness-to-pay. Required input dataset\n(one row per analyzed patient, post cost/effect derivation):\n  work.pt : person_id, arm ('TREAT'/'COMP'), cost (adjusted, standardized $), qaly, baseline covariates x1-xk\nSet &lambda to the willingness-to-pay threshold; the arm parameter estimate is the adjusted incremental NMB. Re-run\nacross a lambda grid (e.g., in a macro %do loop) to trace the CEAC; PROC GLM gives a model-based SE, swap to bootstrap\nSEs for skewed person-level NMB.",
        "dependencies": [],
        "source_citations": [
          "hoch-2002"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cohort[Active-comparator new-user cohort<br/>TREAT vs COMP, one time origin] --> Cost[Per-person cost<br/>paid_amount, FFS-only, standardized,<br/>discounted, censoring-adjusted]\n  Cohort --> Eff[Per-person effect<br/>QALYs via utility mapping<br/>or survival proxy, discounted]\n  Cost --> Delta[Incremental cost deltaC<br/>and incremental effect deltaE]\n  Eff --> Delta\n  Delta --> ICER[ICER = deltaC / deltaE<br/>plot on 4-quadrant plane]\n  Delta --> NMB[\"NMB = lambda*deltaE - deltaC<br/>adjusted via net-benefit regression\"]\n  NMB --> CEAC[Bootstrap CEAC<br/>P NMB greater than 0 across lambda]\n  ICER --> Decide{Cost-effective at lambda?}\n  NMB --> Decide\n  CEAC --> Decide",
        "caption": "From an active-comparator new-user cohort to the decision quantities. Per-person costs and effects yield deltaC and deltaE, which feed both the ICER (with its plane) and the NMB; net-benefit regression adjusts the NMB for confounders and the bootstrap produces the CEAC.",
        "alt_text": "Flowchart from a cohort through per-person cost and effect derivation to incremental cost and effect, then to the ICER on its plane, the NMB via net-benefit regression, and the bootstrap CEAC, all feeding the cost-effectiveness decision.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "quadrantChart\n  title Cost-effectiveness plane TREAT vs COMP\n  x-axis Less effective --> More effective\n  y-axis Less costly --> More costly\n  quadrant-1 NE trade-off ICER vs lambda\n  quadrant-2 NW dominated reject\n  quadrant-3 SW trade-off ICER vs lambda\n  quadrant-4 SE dominant adopt\n  Typical new drug: [0.75, 0.75]\n  Dominated option: [0.25, 0.7]\n  Dominant option: [0.8, 0.25]",
        "caption": "The four quadrants of the cost-effectiveness plane. The ICER's sign is uninformative on its own - a negative ratio arises in both the SE (dominant) and NW (dominated) quadrants - so the plane must always accompany the ratio; only in the NE and SW quadrants does the ICER-vs-threshold comparison decide adoption.",
        "alt_text": "Cost-effectiveness plane with incremental effect on the x-axis and incremental cost on the y-axis, showing the dominant southeast quadrant, dominated northwest quadrant, and the two trade-off quadrants where the ICER is compared to the willingness-to-pay threshold.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "ICER/NMB are the summary decision metrics produced by the broader health-economic modeling family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-utility",
        "notes": "Cost-utility analysis (cost per QALY) is the evaluation type whose result is most often expressed as an ICER and NMB."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-effectiveness",
        "notes": "When the effect is a natural unit (life-years, events avoided) rather than QALYs, the same ICER/NMB machinery summarizes the cost-effectiveness analysis."
      },
      {
        "relation_type": "requires",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "In claims/EHR RWE without native utilities, QALYs for the deltaE term must be obtained by mapping derived health states to published utility values."
      },
      {
        "relation_type": "requires",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Both costs and effects entering ICER/NMB must be discounted to present value at the same rate before deltaC and deltaE are formed."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Standardized per-member cost measures, after outlier handling, supply the cost numerator (deltaC) of the ICER and the cost term of NMB."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "HCRU patterns drive and explain the cost differences that determine deltaC."
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "Choosing all-cause vs attributable/incremental costs directly determines the ICER numerator and the cost term of NMB."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "ICER and NMB are highly sensitive to cost-outlier handling; prespecified winsorization and sensitivities are required."
      },
      {
        "relation_type": "see_also",
        "target_slug": "budget-impact",
        "notes": "Budget-impact analysis answers affordability over a short horizon and complements the value question that ICER/NMB addresses."
      }
    ],
    "aliases": [
      "ICER",
      "incremental cost-effectiveness ratio",
      "net monetary benefit",
      "NMB",
      "net health benefit",
      "cost-effectiveness acceptability curve"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "immortal-time-bias-handling",
    "name": "Immortal Time Bias Handling",
    "short_definition": "A family of design and analytic strategies for preventing or correcting the spurious protective association that arises when person-time during which the outcome could not occur (because the patient had to survive event-free to receive the exposure that later defines their group) is misclassified as exposed follow-up.",
    "long_description": "Immortal time is a span of follow-up during which, by the study's own definition, the outcome cannot occur. It\narises when group assignment depends on a future event (typically the first prescription fill) but follow-up is\nstarted earlier (at diagnosis, hospitalization, surgery, or cohort entry). Every patient classified as \"exposed\"\n*had to survive event-free* from the earlier time-zero until their qualifying fill; that survival is guaranteed\nby construction, not earned by the drug. When this period is attributed to the exposed group as event-free\nperson-time — or, equivalently, when patients who die before filling are pushed into the unexposed group —\nthe exposed rate is mechanically deflated and the hazard ratio is biased toward benefit. The bias is large and\npredictable: in the canonical Lévesque example, statins appeared to halve diabetes progression (HR ~0.69) purely\nfrom time misclassification, collapsing to the null (HR ~0.99) once exposure was modeled as time-varying.\n\n**Core conceptual distinction.** Immortal time bias is fundamentally about *when* person-time is counted and to\n*which* arm, not about *who* is in the cohort — it is a temporal-classification error, distinct from confounding.\nTwo remedies operate on different levers and must not be conflated. (1) *Design prevention* aligns three clocks\nat a single time zero — eligibility is met, the treatment strategy is assigned, and follow-up starts — so no\nunexposed-but-immortal interval can exist (the new-user / target-trial solution). (2) *Analytic correction* keeps\nthe misaligned data but stops crediting immortal time to the exposed arm: time-dependent (time-varying) exposure\nmodeling lets a patient contribute unexposed person-time until the fill date and exposed person-time thereafter;\nlandmark analysis classifies exposure as of a fixed landmark and discards earlier follow-up; clone-censor-weight\nassigns immortal time to *every* compatible strategy and censors-with-weighting when behavior diverges. The\nestimand differs across these: design prevention and clone-censor-weight target an initiation/sustained-strategy\ncontrast (ITT-like or per-protocol), time-dependent Cox targets the instantaneous effect of current exposure\nstatus, and landmark targets the effect among those who survived to the landmark — interpretations that are not\ninterchangeable and must be pre-specified.\n\n**Pros, cons, and trade-offs.**\n- **Design prevention (new-user + target-trial time-zero) vs analytic correction.** Prevention is transparent,\n  requires no modeling assumption about the exposure-time relationship, and removes the bias at the root; it is\n  the regulator-preferred default. Cost: it can shrink the cohort to true initiators with an observable washout\n  and cannot rescue an already-extracted dataset whose time zero was set at diagnosis. **Prefer prevention**\n  whenever you control cohort construction and the question is about treatment *initiation*.\n- **vs `standard-cox-time-dependent` (time-varying exposure).** Time-varying Cox is the standard analytic fix and\n  is exactly right when exposure genuinely starts mid-follow-up (e.g., a procedure or device implanted during an\n  index hospitalization). Cost: it answers an as-treated, current-status question and is vulnerable to\n  time-varying confounding affected by prior exposure (where g-methods, not Cox, are required). **Prefer it** when\n  exposure timing is intrinsically post-baseline and a current-status estimand is the goal.\n- **vs `landmark-analysis`.** Landmarking is simple, communicates well, and sidesteps the modeling of exposure\n  timing. Cost: it discards events before the landmark, is sensitive to landmark choice, and conditions on\n  survival to the landmark (a selected population). **Prefer it** for a quick, robustness-oriented sensitivity\n  analysis or when exposure is reliably ascertained only after a fixed interval.\n- **vs `clone-censor-weight-per-protocol`.** Cloning correctly handles \"treatment within a grace period\" rules\n  that create eligibility-time ambiguity and emulates sustained strategies. Cost: it is the most complex to\n  specify, defend, and weight. **Prefer it** when the protocol genuinely requires a grace period or a dynamic\n  per-protocol estimand that simpler fixes distort.\n\n**When to use.** Apply immortal-time handling to *every* longitudinal claims/EHR cohort where group membership is\ndetermined by an event that occurs after the start of observation: ever-vs-never drug comparisons, \"responders\"\nvs \"non-responders,\" transplant vs waitlist, surgery vs medical management, adherent vs non-adherent, and any\ndesign where a post-baseline prescription, procedure, or laboratory result defines the arm. The default move is\ndesign prevention via a new-user cohort with time zero at initiation; reserve analytic correction for questions\nwhere exposure is *intrinsically* acquired during follow-up.\n\n**When NOT to use — and when it is actively misleading or dangerous.** Do not bolt an analytic correction onto a\nfundamentally answerable design question — if the real question is the effect of *initiating* a drug, the clean\nfix is a new-user design, and forcing a time-varying model onto a time-fixed ever/never definition can mask the\nproblem while leaving residual misclassification. It is actively dangerous to (1) treat a time-varying exposure\nCox model as a remedy for *confounding* — it fixes only the temporal error, and an as-treated current-status\nestimand can re-introduce healthy-adherer bias and confounding by indication; (2) apply standard time-varying\nCox when exposure influences later confounders that themselves affect both subsequent treatment and the outcome —\nthis treatment-confounder feedback requires marginal structural models / g-methods, and naive adjustment is\nbiased in either direction; (3) choose a landmark *after* inspecting the data, which invites a landmark that\nflatters the hypothesis; or (4) \"correct\" immortal time while ignoring competing risks, so that a deflated event\ncount is misread as benefit when it is differential mortality. A correction that changes the estimand without\nacknowledging the change is worse than the original bias, because it looks rigorous.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The classic substrate for immortal time. Group-defining events are pharmacy fills (NDC +\n  `fill_date` + `days_supply`) or procedure codes; the trap is starting follow-up at a `dx`-coded index\n  (diagnosis, MI, cancer) and assigning the arm by a *subsequent* fill. Failure mode: any patient who dies before\n  the qualifying fill is structurally absent from the exposed arm. Workaround: set time zero at the qualifying\n  fill (new-user) or, if exposure is intrinsically post-baseline, build a time-varying exposure indicator that\n  flips on at `fill_date` and lets the patient accrue unexposed person-time before it.\n- **Medicare Advantage (MA) vs FFS:** MA enrollees lack complete fee-for-service claims, so a \"no prior fill\"\n  washout can be missingness rather than a true drug-free period, and exposure start dates can be unobserved —\n  manufacturing apparent immortal time from invisible early fills. Workaround: restrict to enrollees with\n  observable Parts A/B/D (or commercial medical+pharmacy) and exclude MA-only person-time before time-zero\n  assignment.\n- **EHR:** Better clinical timestamps for \"start of risk\" (order/administration dates, problem-list onset), but\n  dispensing is often incomplete; a patient who fills outside the system looks unexposed during truly exposed\n  time. Visit-driven capture also makes loss to follow-up informative. Workaround: link to pharmacy claims to\n  confirm initiation and define observation windows explicitly.\n- **Registry / linked:** Registries capture the triggering event and adjudicated outcomes well but often miss\n  full drug exposure; transplant and oncology registries are textbook immortal-time settings (time on the\n  waitlist before transplant is immortal for the transplanted group). In elderly claims, differential competing\n  risk of death by exposure status compounds the bias — a deflated exposed event count partly reflects survival\n  selection. Workaround: link to a death index, model the transplant/exposure as time-varying from its actual\n  date, and report cause-specific and subdistribution effects so competing mortality is not silently absorbed.\n\n**Worked claims example.** Question: do oral anticoagulants reduce stroke after a new atrial-fibrillation\ndiagnosis? A naive analysis sets time zero at the first AF `dx` claim and labels anyone with a subsequent DOAC\nfill (NDC list) as \"anticoagulated.\" Patients who stroke or die in the weeks between diagnosis and their first\nfill cannot be in the DOAC arm — that pre-fill interval is immortal, and the DOAC arm gains guaranteed event-free\nperson-time, biasing the HR below 1. *Correct design fix:* require 365 days of continuous A/B/D enrollment with no\nprior oral anticoagulant fill, set time zero at the first qualifying DOAC fill (new-user), and pair with an active\ncomparator (e.g., warfarin initiators) so both arms share a time zero at initiation. *Correct analytic fix when\nthe initiation cohort is not feasible:* keep time zero at the AF diagnosis but model anticoagulation as a\ntime-varying covariate — each person contributes UNEXPOSED person-time from `index_date` (AF dx) until\n`fill_date`, then EXPOSED person-time from `fill_date` to the earliest of stroke, death, disenrollment, or end of\ndata. Concretely, split each person's follow-up at `fill_date` into two records: record 1 spans\n[`index_date`, `fill_date`) with `exposed=0`, record 2 spans [`fill_date`, `exit_date`) with `exposed=1`; the\nevent flag attaches only to the interval containing the stroke. Fit Cox on this (start, stop, event, exposed)\nstructure. In data like this the naive HR ~0.6 typically moves toward ~1.0 once the immortal interval is returned\nto unexposed person-time — the same diagnostic signature Lévesque demonstrated for statins.\n\n**Interpreting the output**\n\nIn the DOAC cohort modeled on Lévesque et al., fixed-exposure analysis assigns the 30-day\npre-treatment window to the \"treated\" category. The result before correction: naive HR ≈ 0.69\n(apparent 31% survival benefit). After re-coding exposure as time-varying and attributing the\npre-treatment window to the unexposed risk set: corrected HR ≈ 0.99.\n\n*(1) Formal interpretation.* The naive HR of 0.69 arises because Patient 1001 and similarly\ncoded patients cannot experience an event during the days between cohort entry and first fill\n— they must survive to receive treatment. This immortal person-time, when attributed to the\ntreated denominator, deflates the event rate in the treated arm and produces a spurious\napparent benefit. The corrected HR of 0.99 redistributes that window to the unexposed risk\nset via a counting-process (start, stop, status) layout with a time-varying exposure indicator.\nThe gap between 0.69 and 0.99 — approximately 30 hazard-ratio points — is entirely an\nartifact of index-date misalignment, not a pharmacological effect.\n\n*(2) Practical interpretation.* An HR of 0.69 would suggest a clinically meaningful survival\nadvantage; an HR of 0.99 indicates essentially no effect. The difference between a publishable\nfinding and a null result is a 30-day coding decision. Any RWE study that defines treatment by\n\"ever filled a prescription\" after cohort entry, or links treatment start to a dispensing date\nthat lags the index date, must demonstrate that its time-zero alignment eliminates immortal\nperson-time before the HR can be interpreted.",
    "primary_category": "Bias_Control",
    "tags": [
      "bias",
      "immortal-time",
      "time-zero",
      "survivor-treatment-selection-bias",
      "time-varying-exposure",
      "new-user",
      "target-trial",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Foundational taxonomy of immortal time bias in database pharmacoepidemiology; defines the mechanism and catalogues the design patterns (cohort entry vs first-exposure misalignment) that create it."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Shows that explicit target-trial specification (eligibility, assignment, and follow-up all anchored to a single time zero) prevents immortal time by design rather than patching it analytically."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.b5087",
        "url": "https://doi.org/10.1136/bmj.b5087",
        "citation_text": "Lévesque LE, Hanley JA, Kezouh A, Suissa S. Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes. BMJ. 2010;340:b5087.",
        "year": 2010,
        "authors_short": "Lévesque et al.",
        "notes": "Quantifies the bias and the fix in a real claims cohort — the naive time-fixed HR (~0.69, apparent benefit) moves to the null (~0.99) once exposure is modeled as time-varying; the standard worked teaching example."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Frames immortal time as one of the self-inflicted injuries that disciplined target-trial emulation removes; grounds the design-first remedy in real comparative-effectiveness practice."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://encepp.europa.eu/encepp-toolkit/methodological-guide_en",
        "citation_text": "ENCePP Guide on Methodological Standards in Pharmacoepidemiology (Revision 11, 2023). Chapter on biases and study design — immortal time, protopathic, and latency biases and their prevention via time-zero definition and new-user approaches.",
        "year": 2023,
        "authors_short": "ENCePP",
        "notes": "Regulatory-facing methodological standard; codifies time-zero alignment and new-user design as the expected immortal-time safeguards in EU pharmacoepidemiology protocols."
      }
    ],
    "plain_language_summary": "When researchers track who survives an illness, they sometimes accidentally count a stretch of time before a patient ever started treatment as if the patient were already being treated — giving the treatment group a head start of guaranteed event-free days that it did not earn. This pre-treatment survival window (the period between study entry and a patient's first pill, during which the patient must stay alive just to be classified as a treated person) inflates the apparent protection of the drug. The fix is to start counting a patient's treatment time only when the first pill is actually dispensed — or to label those earlier days honestly as untreated time.",
    "key_terms": [
      {
        "term": "cohort entry date",
        "definition": "The calendar day a patient officially enters the study — for example, the date of a new atrial fibrillation diagnosis — which kicks off follow-up."
      },
      {
        "term": "pre-treatment survival window",
        "definition": "The stretch of days between cohort entry and a patient's very first treatment fill, during which the patient must stay alive (and event-free) simply to become eligible to be called 'treated'; this window is sometimes called immortal time because no outcome can happen there by the study's own rules."
      },
      {
        "term": "index date",
        "definition": "A patient's personal 'day zero' — the anchor date from which follow-up time and all subsequent intervals are measured."
      },
      {
        "term": "person-time",
        "definition": "The total amount of calendar time a group of patients is followed and at risk; it is the denominator when computing event rates (for example, strokes per 100 person-years)."
      },
      {
        "term": "time-varying exposure",
        "definition": "A study design where a patient's treatment status is allowed to change during follow-up — switching from 'untreated' to 'treated' at the exact date the first prescription is filled — rather than being locked in at cohort entry."
      },
      {
        "term": "hazard ratio",
        "definition": "A summary number comparing the event rate in a treated group to the event rate in an untreated group at any given moment; a value below 1.0 suggests the treated group has a lower rate, but only if the comparison is fairly constructed."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether blood-thinners (DOACs) prevent stroke in patients newly diagnosed with atrial fibrillation (AF). She pulls claims data and finds every patient whose first AF diagnosis code appeared in 2024. She labels anyone who later filled a DOAC prescription as 'treated' and everyone else as 'untreated', then compares stroke rates from the AF diagnosis date forward. Patient 1001 was diagnosed on 2024-01-10 and picked up her first DOAC on 2024-02-09 — exactly 30 days later. The naive analysis credits those 30 days to the treated group. But a patient who suffers a fatal stroke on 2024-01-25 — before ever filling a prescription — is forced into the untreated group. The treated group never loses anyone in those first 30 days; the untreated group absorbs all early deaths. The treated group's stroke rate looks artificially low.",
      "dataset": {
        "caption": "Claims records for one patient showing the AF diagnosis date (cohort entry) and the first DOAC fill. No stroke occurred during the pre-treatment window; stroke happened on 2024-04-15.",
        "columns": [
          "person_id",
          "event_date",
          "event_type",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            1001,
            "2024-01-10",
            "AF diagnosis (cohort entry)",
            null,
            null
          ],
          [
            1001,
            "2024-02-09",
            "First DOAC fill",
            "apixaban",
            30
          ],
          [
            1001,
            "2024-03-11",
            "Second DOAC fill",
            "apixaban",
            30
          ],
          [
            1001,
            "2024-04-15",
            "Stroke (outcome event)",
            null,
            null
          ]
        ]
      },
      "steps": [
        "Count the days from AF diagnosis to the first DOAC fill: 2024-02-09 minus 2024-01-10 = 30 days. These 30 days are the pre-treatment survival window.",
        "Biased approach — mislabel the window: credit all 30 days to the 'treated' arm. Patient 1001 is recorded as treated from day 1. The treated arm gains 30 days of guaranteed stroke-free time it did not earn through the drug.",
        "Correct approach — label the window honestly: patient 1001 contributes 30 days as 'untreated' (days 0–29) and then switches to 'treated' on day 30 (the fill date) and contributes 65 more treated days until the stroke on day 95 (2024-04-15 minus 2024-01-10).",
        "Under the biased approach the treated arm's stroke-free person-time includes those 30 stolen days, pushing the hazard ratio below 1.0 even if the drug does nothing.",
        "Under the correct approach those 30 days are returned to the untreated column, and the hazard ratio moves toward the true value — in real statin and DOAC studies this correction has erased apparent benefits that turned out to be pure arithmetic artifacts."
      ],
      "result": {
        "label": "Pre-treatment survival window = 30 days (2024-01-10 to 2024-02-08). Biased treated person-time from cohort entry = 95 days. Correct untreated person-time = 30 days; correct treated person-time = 65 days. Returning those 30 days to the untreated column is the entire correction.",
        "value": 30
      },
      "timeline_spec": {
        "title": "Pre-treatment survival window for one AF patient — biased vs corrected time classification",
        "window": {
          "start": "2024-01-10",
          "end": "2024-04-15",
          "label": "95-day follow-up: AF diagnosis to stroke"
        },
        "events": [
          {
            "label": "AF diagnosis (cohort entry)",
            "start": "2024-01-10",
            "length_days": 1,
            "quantity": "index date"
          },
          {
            "label": "First DOAC fill (apixaban)",
            "start": "2024-02-09",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Second DOAC fill (apixaban)",
            "start": "2024-03-11",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Stroke (outcome)",
            "start": "2024-04-15",
            "length_days": 1,
            "quantity": "event"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2024-01-10",
            "end": "2024-02-08",
            "label": "30-day pre-treatment survival window (honest label: untreated)"
          },
          {
            "kind": "exposed",
            "start": "2024-02-09",
            "end": "2024-04-14",
            "label": "65 days of genuine treated follow-up"
          },
          {
            "kind": "gap",
            "start": "2024-01-10",
            "end": "2024-02-08",
            "label": "BIASED label (wrong): these 30 days mis-credited to treated arm"
          }
        ],
        "result": {
          "label": "30 stolen days returned to untreated arm; treated person-time correctly = 65 days, not 95",
          "value": 30
        },
        "caption": "Patient 1001's timeline. The 30-day gap between AF diagnosis and first DOAC fill is the pre-treatment survival window. The biased analysis mislabels it as treated time (orange), manufacturing a stroke-free head start. The corrected analysis labels it as untreated time (blue) and starts the treated clock only at the fill date.",
        "alt_text": "Horizontal timeline for patient 1001 running from 2024-01-10 to 2024-04-15. A 30-day span from cohort entry to the first DOAC fill is shaded blue and labeled 'pre-treatment survival window (untreated)' in the corrected analysis, and highlighted orange with an annotation 'wrongly credited to treated arm' in the biased analysis. From the first fill on 2024-02-09 to the stroke on 2024-04-15 a 65-day green span is labeled 'genuine treated follow-up'. A vertical marker at 2024-04-15 is labeled 'stroke (outcome event)'."
      }
    },
    "prerequisites": [
      "time-zero-index-date-alignment-rwe",
      "new-user-design",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Misclassified pre-exposure person-time (classic)",
        "description": "Follow-up begins at an early event (diagnosis, MI, surgery) for everyone, but the arm is assigned by a later first fill; all pre-fill person-time is credited to the exposed group, and patients who die before filling are forced into the unexposed group.",
        "edge_cases": [
          "Patients who die or have the outcome before the qualifying fill can never become exposed and are all classified unexposed, deflating the exposed rate.",
          "Short gaps between index and fill still bias if the cohort is large and the early-event rate is high."
        ],
        "data_source_notes": "Common in claims when cohort entry is a diagnosis/procedure code and exposure is a subsequent pharmacy claim. Fix: set time zero at the qualifying fill (new-user) or model exposure as time-varying from fill_date.",
        "citations": [
          "suissa-2008"
        ]
      },
      {
        "name": "Waiting-time / latency immortal period (procedures and transplant)",
        "description": "Time accrued while awaiting an intervention (e.g., on the transplant waitlist, or between hospital admission and an in-hospital procedure) is immortal for the group that ultimately receives it, because survival to the intervention is a precondition of membership.",
        "edge_cases": [
          "Transplant-vs-waitlist comparisons that ignore waiting time show spurious transplant benefit.",
          "In-hospital procedure studies where length of stay before the procedure is credited to the procedure arm."
        ],
        "data_source_notes": "Model the intervention as time-varying from its actual date; in registry/claims, link to a death index because competing mortality during the waiting period is differential by ultimate group.",
        "citations": [
          "levesque-2010"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "standard-cox-time-dependent",
        "pros_of_this": "Design prevention (new-user time-zero) removes immortal time at the root with no modeling assumption about the exposure-time relationship and yields a clean initiation estimand.",
        "cons_of_this": "Requires control of cohort construction and an observable washout; cannot rescue a dataset whose time zero was already set at diagnosis, where time-varying Cox is the appropriate fix.",
        "when_to_prefer": "Prefer design prevention when the question is about treatment initiation and you build the cohort; prefer time-varying Cox when exposure is intrinsically acquired during follow-up."
      },
      {
        "compared_to": "landmark-analysis",
        "pros_of_this": "Avoids discarding pre-landmark events and is not sensitive to an arbitrary landmark choice; estimates the full-follow-up effect rather than the effect among landmark survivors.",
        "cons_of_this": "Requires correct modeling of exposure timing (time-varying records); landmarking is simpler to explain and a useful robustness check.",
        "when_to_prefer": "Prefer time-varying handling for the primary analysis; reserve landmarking for sensitivity or when exposure is reliably ascertained only after a fixed interval."
      },
      {
        "compared_to": "clone-censor-weight-per-protocol",
        "pros_of_this": "New-user time-zero and time-varying Cox are far simpler to specify and defend for a static initiation or current-status estimand.",
        "cons_of_this": "Cannot cleanly handle grace-period eligibility ambiguity or sustained dynamic strategies, where cloning correctly assigns immortal time to all compatible strategies and weights for divergence.",
        "when_to_prefer": "Prefer the simpler handling unless the protocol genuinely requires a grace period or a dynamic per-protocol estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Identify the early diagnosis/procedure date and the first qualifying fill (NDC + fill_date). Either set time zero at the fill (new-user) or split follow-up at fill_date into unexposed and exposed intervals for a time-varying model. Require continuous FFS-observable enrollment across the washout; exclude MA-only person-time where fills are unobserved.",
      "ehr": "Use order/administration timestamps for exposure start; link to pharmacy dispensing to confirm the patient actually initiated. Define observation windows explicitly and treat visit-driven loss to follow-up as potentially informative.",
      "registry": "Capture both the triggering event and the dated intervention; waitlist/transplant and oncology registries are high-risk for immortal time. Link to a death index because competing mortality during the waiting period is differential by ultimate group.",
      "linked": "Linked claims-EHR-vital-records gives the most reliable exposure dates and mortality, but order/fill/ service date discrepancies must be reconciled before assigning time zero or splitting follow-up."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import CoxTimeVaryingFitter\n\ndef build_time_varying(cohort: pd.DataFrame, rx: pd.DataFrame) -> pd.DataFrame:\n    # First exposure fill per person; persons absent from rx never become exposed.\n    first_fill = (rx.sort_values([\"person_id\", \"fill_date\"])\n                    .groupby(\"person_id\", as_index=False)[\"fill_date\"].first()\n                    .rename(columns={\"fill_date\": \"fill_date\"}))\n    df = cohort.merge(first_fill, on=\"person_id\", how=\"left\")\n\n    # Days from the early time zero (index_date) to each transition, on a single clock.\n    df[\"t_exit\"] = (df[\"exit_date\"] - df[\"index_date\"]).dt.days\n    df[\"t_fill\"] = (df[\"fill_date\"] - df[\"index_date\"]).dt.days\n    # Exposure starting at/after exit contributes nothing exposed; clamp into [0, t_exit].\n    df[\"t_fill\"] = df[\"t_fill\"].clip(lower=0, upper=df[\"t_exit\"])\n\n    rows = []\n    for r in df.itertuples(index=False):\n        exposed_ever = pd.notna(r.fill_date) and r.t_fill < r.t_exit\n        if not exposed_ever:\n            # Entire follow-up is unexposed (includes patients who exited before filling).\n            rows.append((r.person_id, 0, r.t_exit, 0, r.event))\n        else:\n            # Unexposed interval [0, t_fill): never carries the event.\n            rows.append((r.person_id, 0, r.t_fill, 0, 0))\n            # Exposed interval [t_fill, t_exit): carries the event if one occurred.\n            rows.append((r.person_id, r.t_fill, r.t_exit, 1, r.event))\n    out = pd.DataFrame(rows, columns=[\"person_id\", \"start\", \"stop\", \"exposed\", \"event\"])\n    return out[out[\"stop\"] > out[\"start\"]]\n\ntv = build_time_varying(cohort, rx)\nctv = CoxTimeVaryingFitter()\nctv.fit(tv, id_col=\"person_id\", event_col=\"event\",\n        start_col=\"start\", stop_col=\"stop\")\nctv.print_summary()  # HR for time-varying 'exposed' is immune to immortal time bias",
        "description": "Time-varying exposure correction for immortal time using lifelines' (start, stop] counting-process format.\nRequired inputs (already cleaned, one row per person unless noted):\n  cohort : person_id, index_date (datetime; early time zero, e.g., AF dx), exit_date (datetime; min of event/\n           death/disenroll/data-end), event (1 if outcome at exit_date else 0)\n  rx     : person_id, fill_date (datetime)  # first qualifying exposure fill per person (NDC pre-filtered)\nOutput: long table where each person contributes an UNEXPOSED interval [index_date, fill_date) and an EXPOSED\ninterval [fill_date, exit_date); the event attaches only to the interval containing exit_date. Patients who exit\nbefore filling contribute only unexposed time (the immortal interval is correctly returned to the unexposed\ngroup). Feed the result to lifelines CoxTimeVaryingFitter.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [
          "levesque-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\n\nbuild_and_fit <- function(cohort, rx) {\n  setDT(cohort); setDT(rx)\n  setorder(rx, person_id, fill_date)\n  first_fill <- rx[, .(fill_date = fill_date[1L]), by = person_id]\n  d <- merge(cohort, first_fill, by = \"person_id\", all.x = TRUE)\n\n  # Single clock in days from the early time zero (index_date).\n  d[, `:=`(t_exit = as.integer(exit_date - index_date),\n           t_fill = as.integer(fill_date - index_date))]\n  d[, t_fill := pmin(pmax(t_fill, 0L), t_exit)]   # clamp into [0, t_exit]\n\n  # Base record: full follow-up, event at exit.\n  base <- d[, .(person_id, t_exit, event)]\n  base[, `:=`(tstart = 0L)]\n  tv <- tmerge(base, base, id = person_id,\n               death = event(t_exit, event))\n  # Add the time-dependent exposure: switches to 1 at the fill date (NA = never exposed).\n  tv <- tmerge(tv, d[!is.na(fill_date) & t_fill < t_exit],\n               id = person_id, exposed = tdc(t_fill))\n  tv$exposed[is.na(tv$exposed)] <- 0L\n\n  coxph(Surv(tstart, tstop, death) ~ exposed, data = tv)\n}\n\nfit_tv <- build_and_fit(cohort, rx)\nsummary(fit_tv)  # exposed HR free of immortal time; contrast with naive ever/never coxph",
        "description": "Time-varying exposure correction with survival::tmerge + coxph (the reference R idiom for counting-process\ndata). Inputs mirror the Python version:\n  cohort : person_id, index_date (Date), exit_date (Date), event (0/1)\n  rx     : person_id, fill_date (Date)   # first qualifying exposure fill per person\ntmerge splits each person's follow-up at the fill date, creating a time-dependent 'exposed' covariate that is 0\nbefore the fill and 1 after; the event is carried only on the interval containing exit. Compare coxph on this\nstructure with a naive time-fixed coxph to expose the immortal-time bias.",
        "dependencies": [
          "data.table",
          "survival"
        ],
        "source_citations": [
          "levesque-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Time-varying exposure: EXPOSED becomes 1 only after the first fill (fill_days). */\n/* PHREG evaluates the programming statement at each event time, so immortal      */\n/* pre-fill person-time is automatically credited as UNEXPOSED.                   */\nproc phreg data=work.surv;\n  model fu_days*event(0) = exposed / risklimits ties=efron;\n  /* . (missing) fill_days => never exposed => exposed stays 0 for all of follow-up */\n  if fill_days ne . and fu_days >= fill_days then exposed = 1;\n  else exposed = 0;\nrun;\n\n/* Naive (biased) comparison: ever-exposed treated as a baseline (time-fixed) covariate. */\n/* The HR here is biased toward benefit because pre-fill immortal time is credited to     */\n/* the exposed arm; the gap between this HR and the time-varying HR is the immortal-time   */\n/* bias (cf. Lévesque 2010: ~0.69 naive vs ~0.99 corrected).                              */\ndata work.naive;\n  set work.surv;\n  ever_exposed = (fill_days ne .);\nrun;\n\nproc phreg data=work.naive;\n  model fu_days*event(0) = ever_exposed / risklimits ties=efron;\nrun;",
        "description": "Time-varying exposure correction in SAS via PROC PHREG programming statements (no manual record-splitting\nneeded). Required input dataset (post data-management), one row per person:\n  work.surv : person_id, fu_days  (follow-up days from the early time zero to event/censor),\n              event (1=outcome, 0=censored),\n              fill_days (days from time zero to first qualifying exposure fill; . if never exposed)\nThe counting-process step recomputes a time-dependent EXPOSED indicator at each event time: a subject is exposed\nonly after their fill_days, so all pre-fill (immortal) risk time is correctly modeled as unexposed. Fit the\nnaive time-fixed model alongside (ever-exposed flag) to quantify the bias the correction removes.",
        "dependencies": [],
        "source_citations": [
          "levesque-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "immortal-time-bias-handling-timeline.svg",
        "mermaid": null,
        "caption": "Patient 1001's timeline. The 30-day gap between AF diagnosis and first DOAC fill is the pre-treatment survival window. The biased analysis mislabels it as treated time (orange), manufacturing a stroke-free head start. The corrected analysis labels it as untreated time (blue) and starts the treated clock only at the fill date.",
        "alt_text": "Horizontal timeline for patient 1001 running from 2024-01-10 to 2024-04-15. A 30-day span from cohort entry to the first DOAC fill is shaded blue and labeled 'pre-treatment survival window (untreated)' in the corrected analysis, and highlighted orange with an annotation 'wrongly credited to treated arm' in the biased analysis. From the first fill on 2024-02-09 to the stroke on 2024-04-15 a 65-day green span is labeled 'genuine treated follow-up'. A vertical marker at 2024-04-15 is labeled 'stroke (outcome event)'.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  T0[Time zero set at DIAGNOSIS<br/>follow-up starts here for all] --> IMM[Immortal interval<br/>diagnosis to first fill<br/>outcome cannot occur:<br/>patient must survive to fill]\n  IMM --> FILL[First qualifying fill<br/>arm assigned = EXPOSED]\n  FILL --> RISK[Genuine at-risk exposed time]\n  IMM -. naively credited to .-> EXP[EXPOSED arm<br/>gains guaranteed<br/>event-free person-time<br/>HR biased below 1]\n  FILL -. patients who die<br/>before filling .-> UNEXP[forced into UNEXPOSED arm]",
        "caption": "How immortal time arises. Person-time between an early time zero (diagnosis) and the group-defining first fill cannot contain the outcome by construction; crediting it to the exposed arm deflates the exposed rate and biases the hazard ratio toward apparent benefit.",
        "alt_text": "Flow diagram showing the interval from diagnosis to first fill as immortal time that is naively credited to the exposed arm, while patients who die before filling are forced into the unexposed arm.",
        "source_type": "illustrative",
        "source_citations": [
          "suissa-2008"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Is the question about<br/>treatment INITIATION,<br/>and do you control<br/>cohort construction?} -->|Yes| NU[Design prevention:<br/>new-user cohort, time zero<br/>at first fill, active comparator]\n  Q -->|No: exposure is intrinsically<br/>acquired during follow-up| TV{Does exposure affect<br/>later confounders that<br/>also affect treatment + outcome?}\n  TV -->|No| COX[Time-varying exposure Cox<br/>split follow-up at fill date]\n  TV -->|Yes treatment-confounder feedback| GM[g-methods / MSM<br/>not naive time-varying Cox]\n  NU --> SENS[Sensitivity: landmark analysis,<br/>washout length, grace period,<br/>negative-control outcome]\n  COX --> SENS\n  Q -->|Grace period / dynamic<br/>sustained strategy required| CCW[Clone-censor-weight<br/>per-protocol emulation]\n  CCW --> SENS",
        "caption": "Remediation decision logic. Prevent by design when you can; correct analytically (time-varying Cox) when exposure is intrinsically post-baseline; escalate to g-methods under treatment-confounder feedback or to clone-censor-weight when a grace period or dynamic strategy is required.",
        "alt_text": "Decision flowchart routing among new-user design prevention, time-varying Cox correction, g-methods for treatment-confounder feedback, and clone-censor-weight for grace-period or dynamic strategies, all feeding a shared sensitivity-analysis step.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "The primary modern framework for preventing immortal time by design through explicit eligibility, assignment, and follow-up all anchored to a single time zero."
      },
      {
        "relation_type": "see_also",
        "target_slug": "new-user-design",
        "notes": "New-user restriction with follow-up beginning at initiation is the most effective practical prevention of immortal time in database pharmacoepidemiology."
      },
      {
        "relation_type": "used_with",
        "target_slug": "standard-cox-time-dependent",
        "notes": "Time-varying (time-dependent) exposure Cox is the standard analytic correction when exposure is intrinsically acquired during follow-up; it credits pre-fill immortal time as unexposed person-time."
      },
      {
        "relation_type": "used_with",
        "target_slug": "landmark-analysis",
        "notes": "Landmarking classifies exposure as of a fixed landmark and discards earlier follow-up; a simpler alternative correction and a useful sensitivity analysis."
      },
      {
        "relation_type": "used_with",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Clone-censor-weight assigns immortal time to all compatible strategies and weights for divergence; the correct handling under grace periods or dynamic sustained strategies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "A closely related family of time-related biases that co-occur in ever/never analyses of chronic therapies; new-user time-zero alignment addresses both."
      },
      {
        "relation_type": "affects",
        "target_slug": "cox-ph-regression",
        "notes": "Misaligned time zero lengthens exposed event-free person-time and biases the HR toward benefit; the fix is time-varying exposure modeling or design prevention before fitting Cox."
      },
      {
        "relation_type": "affects",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Crediting immortal person-time to the exposed denominator lowers the exposed rate, biasing rate ratios toward apparent benefit."
      },
      {
        "relation_type": "affects",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "If the binary outcome window includes time before exposure can occur, exposed individuals are guaranteed event-free over part of the window, inflating a spuriously protective odds ratio."
      }
    ],
    "aliases": [
      "immortal time bias",
      "survivor treatment selection bias",
      "time-window bias",
      "time-zero misalignment",
      "Suissa bias"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "incidence-rate-calculation-rwe",
    "name": "Incidence Rate Calculation",
    "short_definition": "Estimation of the number of incident events divided by the person-time actually at risk, with explicit rules for first-event vs recurrent counting, at-risk window construction, and competing events.",
    "long_description": "An **incidence rate (IR)** is events divided by **person-time at risk**: IR = D / PT, where D is the count of qualifying\nevents and PT is the summed time during which each person was both observable and eligible to have a *first* (or *next*)\ncountable event. It is a dynamic measure with units of events per person-time (e.g., per 1,000 person-years, PY) and an\ninstantaneous-hazard interpretation — unlike a cumulative incidence (risk), which is a unitless proportion over a fixed\nhorizon. In real-world data the IR is rarely \"wrong\" because of the arithmetic; it is wrong because the **denominator** is\nbuilt from person-time that the person was not actually at risk for, or the **numerator** counts events the design did not\nintend to count. Getting the at-risk clock right is the entire job.\n\n**Core conceptual distinction.** Three choices must be pre-specified in the estimand and they are separable. (1) *First-event\nvs recurrent rate*: a first-event IR removes each person from the denominator at their first event (PT accrues only until the\nfirst event); a recurrent-event IR keeps accruing PT and counts subsequent events, which requires a washout/refractory rule\nbetween events and a variance estimator that respects within-person clustering (e.g., robust/Andersen-Gill). (2) *At-risk\ndefinition*: PT must be intersected with the windows in which the event is biologically and administratively possible\n(continuously observed, not already post-event for a first-event rate, not during an exclusion window). (3) *Competing\nevents*: when death (or another terminal event) precludes the outcome, the **cause-specific hazard rate** (censor at the\ncompeting event) and the **subdistribution/cumulative incidence** answer different questions — a naive cause-specific IR is\nfine as a *rate* but does not translate into the *risk* a patient experiences when the competing event is common. Reporting\na cause-specific IR and then narrating it as a \"risk\" is the most frequent interpretive error.\n\n**Pros, cons, and trade-offs.**\n- **vs cumulative incidence / risk (`cumulative-incidence-risk-rwe`):** The IR uses all person-time, handles staggered entry\n  and variable follow-up natively, and is the right summary when follow-up is censored or heterogeneous. Cost: it assumes a\n  roughly constant hazard within the window it summarizes; if the hazard is strongly time-varying (e.g., a peri-procedural\n  spike), a single IR averages over it and misleads. **Prefer the IR** for sparse events with variable follow-up; **prefer\n  cumulative incidence** when a fixed-horizon, patient-facing probability is the question and competing risks are non-trivial.\n- **vs crude proportion (cases / N enrolled):** A proportion ignores differential follow-up and inflates or deflates with\n  administrative censoring; the IR is unbiased for the hazard under non-informative censoring. Cost: more programming and a\n  correct at-risk clock. **Prefer the IR** whenever follow-up length differs across people, which is nearly always true in\n  claims.\n- **vs standardized rates (`direct-standardization-rwe`, `indirect-standardization-smr-sir-rwe`):** A crude IR is fine for a\n  single homogeneous group; comparing crude IRs across populations with different age/sex/comorbidity mixes is confounded.\n  Direct/indirect standardization or a Poisson model with an offset removes that. **Prefer standardization or modeling** for\n  any cross-group comparison; reserve the crude IR for within-stratum description.\n\n**When to use.** Describing the occurrence of a new-onset outcome (incident MI, first HF hospitalization, first malignancy)\nin a cohort with staggered entry and censoring; computing background/expected rates for safety signal detection\n(observed-vs-expected); building the event counts and offsets that feed a Poisson/negative-binomial rate model; reporting\nper-1,000-PY rates with exact Poisson confidence intervals in a regulatory or HTA dossier.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The hazard is far from constant over the summary window.** A single IR over 5 years that lumps a sharp early peri-exposure\n  spike with a flat tail is uninterpretable; either split person-time into clinically meaningful intervals (piecewise rates)\n  or model time explicitly. Quoting one IR here is actively misleading.\n- **Competing risks are common and you narrate the rate as a risk.** In an elderly or oncology cohort where death is frequent,\n  a cause-specific IR for a non-fatal outcome does not equal the probability a patient experiences it; converting it to a\n  \"risk\" overstates patient-facing probability. Switch to a cumulative incidence function (see\n  `competing-risks-cause-specific-fine-gray-rwe`).\n- **Person-time includes time the person was not observable or not at risk.** Counting follow-up after disenrollment, after\n  the first event in a first-event rate, or during MA-only spans where claims are absent fabricates denominator and biases the\n  IR downward. This is the claims analogue of immortal-time bias (see `immortal-time-bias-handling`).\n- **Recurrent events counted as if independent.** Treating multiple events per person as independent Poisson observations\n  understates variance and produces falsely narrow CIs; use robust/clustered variance.\n\n**Data-source operational depth.**\n- **Claims (FFS):** PT is the intersection of continuous medical enrollment with the at-risk window; the event date is the\n  service/admission date of the first qualifying diagnosis (with code position and care setting specified — inpatient primary\n  dx is more specific than any-position outpatient). Failure modes: **Medicare Advantage** enrollees generate no\n  fee-for-service claims, so MA-only person-time is unobserved — including it in the denominator silently inflates PT and\n  deflates the IR; restrict to Parts A/B (and D if drug exposure matters) and **drop MA-only spans**. Enrollment gaps,\n  claims-adjudication lag near the data cut, and same-day duplicate/reversed claims all distort counts.\n- **EHR:** Event ascertainment is encounter-driven, so a patient who seeks care elsewhere (\"leakage\") contributes apparent\n  event-free person-time that is actually unobserved — differential leakage by exposure biases the IR. Define an explicit\n  \"active in system\" requirement (e.g., ≥1 encounter per year) and treat loss to follow-up as potentially informative;\n  structured problem-list/lab capture sharpens the case definition over claims.\n- **Registry:** Strong for adjudicated incident events (e.g., cancer registry first primaries) but typically lacks complete\n  follow-up for censoring; link to claims for enrollment/death to build correct person-time, and check registry completeness\n  and reporting lag by calendar year.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR specificity + claims completeness + a death index that makes\n  the competing-risk denominator honest. Cost: linkage selects the linkable subset, and order/service/registry date\n  discrepancies must be reconciled before assigning event dates and closing person-time.\n\n**Worked claims example.** Question: incidence rate of first hospitalized heart failure (HF) among adult new initiators of a\ndrug in a commercial + Medicare FFS database, 2019–2023, per 1,000 PY. (1) *Eligibility and clock start*: index date = first\nqualifying fill; require 365 days of continuous medical + pharmacy enrollment before index (FFS-observable, no MA-only spans)\nand no prior HF diagnosis in that 365-day washout — this makes the event *incident*, not prevalent. (2) *At-risk window\nstart*: person-time begins the day after index. (3) *Event*: first inpatient claim with HF as the principal diagnosis (a more\nspecific definition than any-position outpatient); the event date is the admission date. (4) *Censoring / clock stop*:\nperson-time ends at the earliest of first HF event, disenrollment, death, end of study (2023-12-31), or — because death is a\ncompeting event — we report the **cause-specific** rate censoring at death and separately a cumulative incidence function so\nthe rate is not misread as a risk. (5) *Denominator construction*: for each person, PT = (min(stop dates) − at-risk start) in\nyears, summing only continuous FFS-observed time and stopping at the first event (first-event rate). (6) *Worked numbers*:\nsuppose 47 first HF hospitalizations accrue over 12,840 person-years. IR = 47 / 12,840 = 0.00366 per PY = **3.66 per 1,000\nPY**. An exact (Poisson) 95% CI uses the gamma/chi-square inversion on the count: lower = 0.5·χ²(0.025, 2·47)/12.840,\nupper = 0.5·χ²(0.975, 2·48)/12.840, giving roughly **2.69–4.87 per 1,000 PY**. (7) *Sensitivity*: re-run with an any-position\noutpatient+inpatient HF definition, with the competing-risk death handled as a subdistribution event, and with a stricter\n730-day enrollment/washout, and confirm the IR and its CI move only modestly.\n\n**Interpreting the output**\n\nAn incidence rate calculation returns: 20.0 per 1,000 person-years (95% CI 14.8–26.6 per 1,000 PY) from 47 events over 2,350 person-years of follow-up.\n\n*Formal interpretation.* The incidence rate is a rate, not a risk: it expresses the number of new events per unit of person-time and can theoretically exceed 1.0 when expressed as a raw fraction (hence the per-1,000 PY rescaling). The exact Poisson 95% CI is derived from the gamma distribution on the event count, not from the normal approximation, and is the appropriate interval when counts are modest. The rate is cause-specific: patients who die from a competing cause are censored at that point, so the denominator includes their at-risk time only up to the competing event. Where competing mortality differs by group, a cause-specific IR alone is insufficient to describe absolute disease burden — pair it with a cumulative incidence function estimated by Aalen-Johansen.\n\n*Practical interpretation.* The 20 per 1,000 PY rate means that, on average, 2 events are expected per 100 fully observed patient-years of follow-up — not that any individual has a 2% risk by 12 months. Converting to 1-year risk requires a monotone transformation (1 − exp(−rate × time)) under a constant-rate assumption. In populations with heavy competing risks or variable follow-up, the IR and the 12-month cumulative incidence may diverge substantially; both should be reported to give a complete epidemiological picture.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "incidence-rate",
      "person-time",
      "descriptive-epidemiology",
      "first-event",
      "recurrent-events",
      "competing-risks",
      "poisson",
      "background-rate"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "safety_surveillance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/ije/dys049",
        "url": "https://doi.org/10.1093/ije/dys049",
        "citation_text": "Pearce N. Classification of epidemiological study designs. International Journal of Epidemiology. 2012;41(2):393-397.",
        "year": 2012,
        "authors_short": "Pearce",
        "notes": "Clarifies the rate/risk/person-time taxonomy and why incidence rates use person-time denominators rather than counts of people."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.0040296",
        "url": "https://doi.org/10.1371/journal.pmed.0040296",
        "citation_text": "von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Medicine. 2007;4(10):e296.",
        "year": 2007,
        "authors_short": "von Elm et al.",
        "notes": "Reporting standard requiring explicit statement of follow-up, person-time, and how incident events were ascertained."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Demonstrates how mis-allocated person-time (time counted before a person is truly at risk) corrupts incidence-rate estimates and how correct at-risk denominators fix it."
      },
      {
        "role": "use",
        "doi": "10.2307/2530643",
        "url": "https://doi.org/10.2307/2530643",
        "citation_text": "Greenland S, Robins JM. Estimation of a common effect parameter from sparse follow-up data. Biometrics. 1985;41(1):55-68.",
        "year": 1985,
        "authors_short": "Greenland & Robins",
        "notes": "Foundational treatment of person-time rate estimation and standardization with sparse follow-up data, underpinning exact/stratified incidence-rate analysis."
      }
    ],
    "plain_language_summary": "An incidence rate counts how many new events happened and divides that by the total time the people in your study were actually being watched and could still have the event. The bottom of the fraction is not a head-count of patients; it is the summed-up time each person was followed (call it the at-risk clock), so someone watched for two years counts twice as much as someone watched for one. You report it as something like \"3.7 first heart-failure hospitalizations per 1,000 person-years.\" The whole job is getting that at-risk clock right: stop counting a person's time the moment they have the event, leave the study, or die, because counting time they were no longer at risk quietly makes the rate look lower than it really is.",
    "key_terms": [
      {
        "term": "person-time",
        "definition": "The total amount of follow-up time you add up across everyone in the study, so each person counts only for the stretch they were actually being observed and could still have the event."
      },
      {
        "term": "at-risk clock",
        "definition": "For one person, the stretch of calendar time when they were observable and still eligible to have a first event; it starts when follow-up begins and stops at the event, study end, leaving the data, or death."
      },
      {
        "term": "incident event",
        "definition": "A brand-new occurrence of the outcome (a first heart-failure hospitalization), as opposed to one the patient already had before follow-up started."
      },
      {
        "term": "censored",
        "definition": "A person whose follow-up ends without the event being seen, because they left the data, the study ended, or they died, so their clock simply stops where it is."
      },
      {
        "term": "person-years",
        "definition": "Person-time expressed in years; for example 1,460 person-days divided by 365 is about 4.0 person-years."
      }
    ],
    "worked_example": {
      "scenario": "We follow four adults who just started a drug and want the incidence rate of their first hospitalized heart failure (HF) over a 2022-2023 observation window. Each person enters on a different date (staggered entry) and is followed until the earliest of their first HF hospitalization, leaving the data, or the study end on 2023-12-31. We add up everyone's follow-up time to build the denominator, count the HF events for the numerator, and divide.",
      "dataset": {
        "caption": "One row per patient: when their at-risk clock started, when it stopped, the resulting follow-up length, and whether the stop was an HF event (1) or a censoring (0).",
        "columns": [
          "person_id",
          "entry_date",
          "exit_date",
          "follow_up_days",
          "event"
        ],
        "rows": [
          [
            1001,
            "2022-01-01",
            "2023-01-01",
            365,
            1
          ],
          [
            1002,
            "2022-01-01",
            "2022-07-02",
            182,
            0
          ],
          [
            1003,
            "2022-04-01",
            "2023-04-01",
            365,
            1
          ],
          [
            1004,
            "2022-07-01",
            "2023-12-31",
            548,
            0
          ]
        ]
      },
      "steps": [
        "Find each person's follow-up time = exit_date minus entry_date in days: P1001 = 365, P1002 = 182, P1003 = 365, P1004 = 548 days.",
        "P1001 and P1003 stop because of an HF event (event = 1); P1002 stops early because they left the data and P1004 stops at the study end (both event = 0, censored), so their clocks just freeze where they are.",
        "Add up all the follow-up time to get total person-time: 365 + 182 + 365 + 548 = 1,460 person-days, which is 1,460 / 365 = about 4.0 person-years.",
        "Count the events for the numerator: 2 HF hospitalizations (P1001 and P1003).",
        "Divide events by total person-time, then rescale to a friendly denominator like 1,000 person-years."
      ],
      "result": "IR = 2 events / 1,460 person-days = 2 / 4.0 person-years = 0.50 events per person-year = 500 per 1,000 person-years. (That number is sky-high only because this teaching cohort has just 4 people followed briefly; a real HF rate from thousands of patients lands nearer 3-4 per 1,000 person-years. The mechanics of summing person-time and dividing are identical at any size.)",
      "timeline_spec": {
        "title": "Person-time accounting for a first-event HF incidence rate across four staggered-entry patients",
        "window": {
          "start": "2022-01-01",
          "end": "2023-12-31",
          "label": "Denominator: summed follow-up = 1,460 person-days (~4.0 person-years)"
        },
        "events": [
          {
            "label": "P1001 follow-up",
            "start": "2022-01-01",
            "length_days": 365,
            "quantity": "365 days at risk"
          },
          {
            "label": "P1002 follow-up",
            "start": "2022-01-01",
            "length_days": 182,
            "quantity": "182 days at risk"
          },
          {
            "label": "P1003 follow-up",
            "start": "2022-04-01",
            "length_days": 365,
            "quantity": "365 days at risk"
          },
          {
            "label": "P1004 follow-up",
            "start": "2022-07-01",
            "length_days": 548,
            "quantity": "548 days at risk"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2023-01-01",
            "label": "P1001: 365 days, ends in HF event"
          },
          {
            "kind": "exposed",
            "start": "2023-01-01",
            "end": "2023-01-01",
            "label": "P1001 event (event = 1)"
          },
          {
            "kind": "followup",
            "start": "2022-01-01",
            "end": "2022-07-02",
            "label": "P1002: 182 days, censored (left data)"
          },
          {
            "kind": "followup",
            "start": "2022-04-01",
            "end": "2023-04-01",
            "label": "P1003: 365 days, ends in HF event"
          },
          {
            "kind": "exposed",
            "start": "2023-04-01",
            "end": "2023-04-01",
            "label": "P1003 event (event = 1)"
          },
          {
            "kind": "followup",
            "start": "2022-07-01",
            "end": "2023-12-31",
            "label": "P1004: 548 days, censored at study end"
          }
        ],
        "result": {
          "label": "2 events / 1,460 person-days (~4.0 person-years) = 500 per 1,000 person-years",
          "value": 500.0
        },
        "caption": "Four patients enter on different dates and each contributes a follow-up bar; P1001 and P1003 end in an HF event (numerator = 2), while P1002 and P1004 are censored. The four bar lengths (365 + 182 + 365 + 548 = 1,460 person-days) are the denominator the rate divides into.",
        "alt_text": "Timeline over 2022-2023 showing four horizontal follow-up bars of 365, 182, 365, and 548 days starting on different dates; two bars (P1001, P1003) end in an event marker and two (P1002, P1004) end censored, with the summed person-time of 1,460 person-days forming the rate denominator."
      }
    },
    "prerequisites": [
      "person-time-denominator-construction-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "First-event incidence rate",
        "description": "Each person contributes person-time only until their first qualifying event; the event removes them from the at-risk denominator. The default for new-onset outcomes (incident MI, first malignancy, first HF hospitalization).",
        "edge_cases": [
          "A prior-event washout is required so a prevalent case is not counted as incident; the lookback must be FFS-observable.",
          "The event date (admission vs discharge vs claim-paid date) materially shifts person-time near the boundary."
        ],
        "data_source_notes": "claims: define the event by code, position, and care setting (inpatient principal dx is more specific); stop person-time at the admission/service date of the first qualifying claim."
      },
      {
        "name": "Recurrent-event incidence rate",
        "description": "Person-time keeps accruing after an event and subsequent events are counted, with a refractory/washout window between countable events (e.g., 30 days) so a single clinical episode is not double-counted.",
        "edge_cases": [
          "Within-person events are correlated; naive Poisson SEs are too narrow — use robust/clustered (Andersen-Gill) variance.",
          "The refractory window length is a judgment call that should be varied in sensitivity analysis."
        ],
        "data_source_notes": "claims: collapse same-episode claims (e.g., transfers, readmissions within the refractory window) into one event before counting."
      },
      {
        "name": "Cause-specific rate under competing risks",
        "description": "For a non-fatal outcome with a frequent terminal competing event (death), person-time is censored at the competing event and the cause-specific hazard rate is reported alongside a cumulative incidence function.",
        "edge_cases": [
          "Reporting the cause-specific IR as a patient-facing risk overstates probability when the competing event is common.",
          "Differential mortality by exposure makes naive cross-group IR comparisons misleading; report CIFs."
        ],
        "data_source_notes": "claims/linked: a reliable death date (Part A inpatient death flag or linked vital records) is required for an honest competing-risk denominator."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cumulative incidence (risk) over a fixed horizon",
        "pros_of_this": "Uses all person-time, handles staggered entry and censoring natively, and gives an instantaneous-hazard summary appropriate for sparse events with variable follow-up.",
        "cons_of_this": "Assumes a roughly constant hazard within the summarized window; averages over strongly time-varying hazards and does not by itself give a patient-facing probability under competing risks.",
        "when_to_prefer": "Variable/censored follow-up and rate (hazard) is the question; switch to cumulative incidence for fixed-horizon patient-facing probabilities with non-trivial competing risks."
      },
      {
        "compared_to": "Crude proportion (cases divided by enrolled N)",
        "pros_of_this": "Unbiased for the hazard under non-informative censoring; correctly credits each person only for time at risk.",
        "cons_of_this": "Requires building a correct at-risk clock and more programming than counting heads.",
        "when_to_prefer": "Whenever follow-up length differs across people, which is the norm in claims and EHR data."
      },
      {
        "compared_to": "Standardized rate or Poisson rate model",
        "pros_of_this": "Simple, transparent, and exactly interpretable within a single homogeneous stratum.",
        "cons_of_this": "Crude IRs are confounded by population mix when compared across groups with different age/sex/comorbidity distributions.",
        "when_to_prefer": "Within-stratum description; use direct/indirect standardization or a Poisson model with an offset for any cross-group comparison."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Person-time = continuous FFS-observable medical enrollment intersected with the at-risk window; drop Medicare Advantage-only spans (no FFS claims) to avoid fabricated denominator. Event = first qualifying claim by code, position, and care setting; close person-time at the earliest of event, disenrollment, death, and study end.",
      "ehr": "Ascertainment is encounter-driven; require demonstrable activity in-system (e.g., an annual encounter) so apparent event-free time is not unobserved leakage, and treat loss to follow-up as potentially informative.",
      "registry": "Strong for adjudicated incident events but usually incomplete for censoring; link to claims/vital records for enrollment and death, and account for registry completeness and reporting lag by calendar year.",
      "linked": "Linked claims-EHR-vital-records gives EHR specificity, claims completeness, and a death index for an honest competing-risk denominator; reconcile order/service/registry date discrepancies before closing person-time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy.stats import chi2\n\nSTUDY_END = pd.Timestamp(\"2023-12-31\")\nPT_SCALE = 1000.0  # report per 1,000 person-years\n\ndef incidence_rate(cohort: pd.DataFrame, enroll: pd.DataFrame, events: pd.DataFrame) -> dict:\n    # First qualifying event per person.\n    first_event = (events.sort_values([\"person_id\", \"event_date\"])\n                         .groupby(\"person_id\")[\"event_date\"].first()\n                         .rename(\"first_event_date\"))\n    c = cohort.merge(first_event, on=\"person_id\", how=\"left\")\n\n    # FFS-observable enrollment end per person (MA-only spans contribute no at-risk time).\n    ffs = enroll[~enroll[\"ma_only\"]].copy()\n    last_ffs = ffs.groupby(\"person_id\")[\"enroll_end\"].max().rename(\"ffs_end\")\n    c = c.merge(last_ffs, on=\"person_id\", how=\"left\")\n\n    # Clock stops at the earliest of event, death, disenrollment (FFS end), study end.\n    stop = c[[\"first_event_date\", \"death_date\", \"ffs_end\"]].copy()\n    stop[\"study_end\"] = STUDY_END\n    c[\"stop_date\"] = stop.min(axis=1)\n\n    # Event counts only if the first event is the stopping reason (first-event rate).\n    c[\"event\"] = (c[\"first_event_date\"].notna() &\n                  (c[\"first_event_date\"] == c[\"stop_date\"])).astype(int)\n\n    # Person-time in years; guard against negative spans from data errors.\n    c[\"pt_years\"] = (c[\"stop_date\"] - c[\"atrisk_start\"]).dt.days.clip(lower=0) / 365.25\n\n    D = int(c[\"event\"].sum())\n    PT = float(c[\"pt_years\"].sum())\n    rate = D / PT\n\n    # Exact Poisson (gamma/chi-square inversion) CI; handles D == 0 cleanly.\n    lo = chi2.ppf(0.025, 2 * D) / 2 / PT if D > 0 else 0.0\n    hi = chi2.ppf(0.975, 2 * (D + 1)) / 2 / PT\n    return {\n        \"events\": D, \"person_years\": PT,\n        \"rate_per_1000_py\": rate * PT_SCALE,\n        \"ci_low_per_1000_py\": lo * PT_SCALE,\n        \"ci_high_per_1000_py\": hi * PT_SCALE,\n    }",
        "description": "First-event incidence rate with exact Poisson CI from claims-style inputs. Required inputs (cleaned, de-duplicated):\n  enroll : continuous medical enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool; True = no FFS claims)\n  events : qualifying outcome claims          -> person_id, event_date (datetime), source ('inpatient_primary' etc.)\n  cohort : one row per eligible person         -> person_id, atrisk_start (datetime; day after index, post-washout),\n                                                  death_date (datetime or NaT)\nSTUDY_END caps administrative follow-up. Person-time is summed only over FFS-observable enrollment up to the first event,\ndeath, disenrollment, or study end. Returns the rate per 1,000 PY with an exact (Poisson) 95% CI.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "greenland-robins-1985"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nSTUDY_END <- as.Date(\"2023-12-31\")\n\nincidence_rate <- function(cohort, enroll, events) {\n  setDT(cohort); setDT(enroll); setDT(events)\n\n  setorder(events, person_id, event_date)\n  first_event <- events[, .(first_event_date = event_date[1L]), by = person_id]\n  ffs_end <- enroll[ma_only == FALSE, .(ffs_end = max(enroll_end)), by = person_id]\n\n  c <- merge(cohort, first_event, by = \"person_id\", all.x = TRUE)\n  c <- merge(c, ffs_end, by = \"person_id\", all.x = TRUE)\n\n  # Clock stops at earliest of event, death, FFS disenrollment, study end.\n  c[, stop_date := pmin(first_event_date, death_date, ffs_end, STUDY_END, na.rm = TRUE)]\n  c[, event := as.integer(!is.na(first_event_date) & first_event_date == stop_date)]\n  c[, pt_years := pmax(as.numeric(stop_date - atrisk_start), 0) / 365.25]\n\n  D  <- sum(c$event)\n  PT <- sum(c$pt_years)\n  pt <- poisson.test(D, T = PT, conf.level = 0.95)  # exact Poisson CI\n  list(events = D, person_years = PT,\n       rate_per_1000_py     = (D / PT) * 1000,\n       ci_low_per_1000_py   = pt$conf.int[1] * 1000,\n       ci_high_per_1000_py  = pt$conf.int[2] * 1000)\n}",
        "description": "First-event incidence rate with exact Poisson CI, data.table. Inputs mirror the Python version:\n  enroll : person_id, enroll_start, enroll_end (Date), ma_only (logical)\n  events : person_id, event_date (Date)\n  cohort : person_id, atrisk_start (Date), death_date (Date or NA)\npoisson.test() supplies the exact (Poisson) confidence interval for the count given total person-time.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "greenland-robins-1985"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let study_end = '31DEC2023'd;\n\n/* First qualifying event per person; close the at-risk clock at the earliest stop reason. */\nproc sql;\n  create table rate_input as\n  select c.person_id,\n         c.atrisk_start,\n         min(e.event_date) as first_event_date format=date9.,\n         min( coalesce(calculated first_event_date, &study_end),\n              coalesce(c.death_date,  &study_end),\n              coalesce(c.ffs_end,     &study_end),\n              &study_end ) as stop_date format=date9.\n  from work.cohort c\n  left join work.events e\n    on e.person_id = c.person_id\n  group by c.person_id, c.atrisk_start, c.death_date, c.ffs_end;\nquit;\n\ndata rate_input;\n  set rate_input;\n  /* First-event rate: event counts only if the event is the stopping reason. MA-only time was already excluded upstream. */\n  event   = (first_event_date ne . and first_event_date = stop_date);\n  pt_years = max((stop_date - atrisk_start), 0) / 365.25;\nrun;\n\n/* Crude incidence rate with an exact Poisson 95% CI. */\nproc stdrate data=rate_input\n             stat=rate\n             plots=none;\n  population event=event pyears=pt_years;\nrun;\n\n/* Equivalent Poisson rate regression with a log(person-time) offset (e.g., for covariate-adjusted rates). */\ndata rate_input; set rate_input; log_pt = log(pt_years); run;\nproc genmod data=rate_input;\n  model event = / dist=poisson link=log offset=log_pt;\n  estimate 'log rate' intercept 1;\nrun;",
        "description": "Person-time construction and incidence-rate estimation in SAS. Required input datasets (post data-management):\n  work.cohort : person_id, atrisk_start, death_date (or .), and (joined in) ffs_end = last FFS-observable enrollment date\n  work.events : person_id, event_date, source       (first qualifying outcome claim already identified upstream)\n  work.strata : person_id + age_grp, sex            (for standardized/stratified rates via PROC STDRATE)\nPROC STDRATE produces crude and (with a reference population) directly standardized rates with exact Poisson CIs;\nPROC GENMOD fits the equivalent log-linear Poisson rate model with a person-time offset. PROC STDRATE requires SAS/STAT.",
        "dependencies": [],
        "source_citations": [
          "greenland-robins-1985"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "incidence-rate-calculation-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Four patients enter on different dates and each contributes a follow-up bar; P1001 and P1003 end in an HF event (numerator = 2), while P1002 and P1004 are censored. The four bar lengths (365 + 182 + 365 + 548 = 1,460 person-days) are the denominator the rate divides into.",
        "alt_text": "Timeline over 2022-2023 showing four horizontal follow-up bars of 365, 182, 365, and 548 days starting on different dates; two bars (P1001, P1003) end in an event marker and two (P1002, P1004) end censored, with the summed person-time of 1,460 person-days forming the rate denominator.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Start[At-risk start<br/>day after index, post-washout] --> Span[Continuous FFS-observable<br/>enrollment span]\n  Span --> Drop[Remove MA-only time<br/>no FFS claims = unobserved]\n  Drop --> Stop{Earliest stop reason}\n  Stop -->|First qualifying event| Ev[Count event<br/>person-time ends at event date]\n  Stop -->|Death competing event| De[Censor; report CIF separately]\n  Stop -->|Disenrollment| Di[Censor at FFS end]\n  Stop -->|Study end| En[Administrative censor]\n  Ev --> PT[Sum person-time across cohort]\n  De --> PT\n  Di --> PT\n  En --> PT\n  PT --> Rate[IR = events / person-time<br/>per 1,000 PY + exact Poisson CI]",
        "caption": "Person-time accounting for a first-event incidence rate in claims. The denominator is observable enrollment intersected with the at-risk window, with MA-only time removed; the clock stops at the earliest of event, death, disenrollment, or study end.",
        "alt_text": "Flow diagram from at-risk start through a continuous enrollment span, removal of MA-only time, an earliest-stop decision (event, death, disenrollment, study end), summation of person-time, and the incidence-rate calculation.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Outcome of interest] --> CR{Competing terminal event<br/>e.g., death common?}\n  CR -->|No| IR[Cause-specific incidence rate<br/>= risk-translatable; report IR per 1,000 PY]\n  CR -->|Yes| Want{What is the question?}\n  Want -->|Etiologic hazard / rate| CS[Cause-specific rate<br/>censor at competing event]\n  Want -->|Patient-facing probability| CIF[Cumulative incidence function<br/>do NOT read the cause-specific IR as a risk]\n  CS --> Warn[Naive cross-group IR comparison<br/>confounded if mortality differs by group]",
        "caption": "Decision logic for choosing the right incidence quantity under competing risks. A cause-specific rate answers the etiologic/hazard question; a cumulative incidence function answers the patient-facing probability question and must be used when the competing event is common.",
        "alt_text": "Decision tree distinguishing cause-specific incidence rate from the cumulative incidence function depending on whether a competing terminal event is common and whether the question is the hazard or a patient-facing probability.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Incidence rate calculation is a core descriptive-epidemiology measure specialized to person-time denominators and incident-event numerators."
      },
      {
        "relation_type": "requires",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "A correct incidence rate depends entirely on the at-risk person-time denominator; denominator construction is the upstream step."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "Incidence rate (hazard, per person-time) vs cumulative incidence (risk, a fixed-horizon proportion) answer different questions; choose based on whether a hazard or a patient-facing probability is needed."
      },
      {
        "relation_type": "used_with",
        "target_slug": "direct-standardization-rwe",
        "notes": "Crude incidence rates are standardized (or modeled) before comparing across populations with different age/sex/ comorbidity mixes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "When a terminal competing event is common, report a cumulative incidence function rather than reading a cause-specific rate as a risk."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Counting person-time during which a person was not yet at risk is the immortal-time-bias failure mode that most often corrupts incidence rates."
      }
    ],
    "aliases": [
      "incidence rate",
      "incidence density",
      "person-time rate",
      "events per person-year",
      "incidence rate per 1000 person-years"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "indirect-standardization-smr-sir-rwe",
    "name": "Indirect Standardization, SMR, and SIR",
    "short_definition": "A summary measure that compares the events observed in a study cohort with the events expected if stratum-specific (usually age-, sex-, and calendar-period-specific) reference rates had applied to the cohort's own person-time, expressed as a standardized mortality ratio (SMR) or standardized incidence ratio (SIR).",
    "long_description": "**Indirect standardization** answers a single, narrow question: did this cohort experience more (or fewer) events than\nwe would expect if it had the same stratum-specific event rates as a reference population? It does so by applying external\nreference rates (age-, sex-, calendar-period-, and possibly race-specific mortality or incidence rates from a registry,\ncensus-linked vital-statistics file, or the general population) to the cohort's own observed person-time within each\nstratum, summing those products to get the **expected** count E, and dividing the **observed** count O by E. The result is\nthe **standardized mortality ratio (SMR = O/E)** for death outcomes or the **standardized incidence ratio (SIR = O/E)** for\nincident-disease outcomes. An SMR of 1.45 means the cohort had 45% more events than its age/sex/period composition would\npredict from reference rates.\n\n**Core conceptual distinction.** Indirect standardization is the mirror image of *direct* standardization, and the two\nanswer different questions. *Direct* standardization applies the *cohort's* stratum-specific rates to a fixed *standard*\npopulation to produce a single age-adjusted rate that is comparable across any number of groups standardized to the same\nstandard. *Indirect* standardization applies the *reference's* stratum-specific rates to the *cohort's* person-time and\nneeds only the total observed count plus stratum-specific person-time from the cohort — not stratum-specific cohort rates.\nThis makes it the method of choice when cohort strata are sparse (rare exposures, small subgroups), because it borrows the\nstable reference rates rather than estimating unstable cohort-specific rates. The price is the **non-comparability caveat**:\ntwo SMRs computed against the same reference are each interpretable as O/E versus that reference, but they are NOT directly\ncomparable to each other unless the two cohorts have identical stratum (age/sex) distributions — because each is weighted by\nits own person-time distribution. The estimand is a *ratio of the cohort's event rate to a counterfactual rate the cohort\nwould have had under reference rates*, conditional on the standardization strata; it is descriptive of excess/deficit\nversus an external benchmark, not a confounding-adjusted causal contrast between two treatment arms.\n\n**Pros, cons, and trade-offs.**\n- **vs direct standardization (age-adjusted rates):** Indirect standardization is more stable and usable when the cohort is\n  small or strata are sparse (it needs no cohort-specific stratum rates), and it is the natural form when comparing a single\n  cohort to an external general-population benchmark. Cost: SMRs are not mutually comparable across cohorts with different\n  age/sex structures, whereas directly standardized rates are. **Prefer indirect** for single-cohort-vs-population\n  comparisons and sparse data; **prefer direct** when you must rank or compare several exposed groups.\n- **vs a confounding-adjusted internal comparator (e.g., active-comparator new-user cohort with PS weighting):** The SMR/SIR\n  is far simpler and requires only external rates, with no need for an internal unexposed arm. Cost: it controls only for\n  the standardization variables (age, sex, period) and inherits *all* residual differences between the cohort and the\n  reference population (healthy-worker effect, selection, ascertainment, secular drift). It cannot answer \"drug A vs drug B.\"\n  **Prefer SMR/SIR** for benchmarking event burden against population norms; **prefer an adjusted internal comparison** for\n  comparative effectiveness or safety where confounding by indication is the threat.\n- **vs a crude rate ratio:** Indirect standardization removes confounding by the standardization variables (typically the\n  strongest confounders, age and sex). Cost: residual confounding by everything not in the strata, and the comparability\n  limitation above. **Prefer SMR/SIR** over a crude O/E whenever the cohort and reference differ in age/sex/period mix.\n\n**When to use.** Benchmarking a defined cohort's mortality or incidence against general-population or registry rates\n(e.g., excess cancer incidence in an occupational or immunosuppressed cohort; excess mortality in a disease registry);\nsignal detection for elevated event burden when no suitable internal comparator exists; small cohorts or rare-stratum\nsettings where direct standardization is too unstable; regulatory/safety contexts where an external population rate is the\naccepted reference (e.g., observed-vs-expected analyses in registries).\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a comparative treatment effect.** An SMR/SIR is not a confounding-adjusted contrast between exposures. Using it to\n  claim drug A is \"safer than\" drug B because A's SMR is lower than B's SMR is a classic error: the two SMRs are weighted by\n  different person-time distributions and are not comparable. If you need a treatment comparison, build an internal\n  active-comparator cohort.\n- **When the reference rates come from a different calendar period or population than the cohort's person-time.** Secular\n  decline in cardiovascular mortality, or rising cancer-screening-driven incidence, will inflate or deflate the SMR/SIR\n  purely as an artifact. Match reference rates to the cohort's calendar years and demographic definitions.\n- **When the cohort is selected on health (healthy-worker / healthy-volunteer effect).** Employed or enrolled cohorts are\n  systematically healthier than the general population, biasing SMRs below 1 for reasons that have nothing to do with the\n  exposure. Indirect standardization does nothing to fix this.\n- **When stratum-specific reference rates are unavailable for the cohort's full strata** (e.g., very old ages, rare\n  race/ethnicity cells), forcing rate borrowing or collapsing that introduces residual confounding by the collapsed\n  variable.\n- **Ranking several exposed subgroups by their SMRs.** This is the Breslow–Day non-comparability trap: only direct\n  standardization (or a formal SMR ratio with caution) supports cross-group comparison.\n\n**Data-source operational depth.**\n- **Claims (FFS vs Medicare Advantage):** Person-time must be built from continuous-enrollment spans, and the denominator\n  must exactly match the source of the numerator. The dominant failure mode is **Medicare Advantage-only person-time**:\n  MA encounters are incompletely reported in FFS claims files, so events (the SMR numerator) are undercounted while\n  person-time (the denominator) may still accrue, deflating the SMR. Restrict to enrollees with full Parts A/B (and D for\n  drug exposures) and exclude MA-only person-time, or use a data source with complete MA capture. Diagnosis-based outcome\n  ascertainment also differs from the registry-based ascertainment underlying most reference rates, creating\n  numerator/denominator definitional mismatch.\n- **EHR:** Visit-driven capture means person-time is hard to define (when does follow-up end for a patient who simply stops\n  coming?), and events occurring outside the system (death at home, care at another health system) are missed, biasing\n  SMRs downward. Link to a death index (e.g., NDI) before computing mortality SMRs; otherwise the observed count is\n  incomplete relative to the population-based reference.\n- **Registry (cancer, disease):** The natural home for SIRs — incident events are adjudicated and the reference incidence\n  rates (e.g., SEER) are population-based and stratum-specific. Watch registry completeness and the lag between diagnosis\n  and registration, and ensure the SIR's reference rates cover the same geography and years as the registry's catchment.\n- **Linked claims–EHR–registry–vital-records:** The ideal substrate: registry-adjudicated events and reference rates,\n  claims/EHR person-time, and vital-records death. Linkage selection (only the linkable subset) and date-discrepancy issues\n  (diagnosis date vs registration date vs claim date) must be reconciled before assigning events to person-time strata.\n- **Competing risks (especially in elderly claims cohorts):** Applying all-cause-mortality reference rates to a cohort with\n  a differential competing-risk profile (e.g., high cardiovascular mortality removing people before the cancer of interest)\n  distorts the expected count. For cause-specific SMRs, ensure both the observed events and the reference rates are\n  cause-specific, and recognize that differential competing risks across compared cohorts further break SMR comparability.\n\n**Worked example (claims-style SIR).** Question: is incident colorectal cancer (CRC) elevated among adults who initiated a\nspecific immunosuppressant, versus the U.S. general population? (1) Cohort: first fill of the drug = index date; require\n365 days of continuous Parts A/B/D FFS enrollment before index (no MA-only spans) so person-time and events are\nobservable; exclude anyone with a prior CRC diagnosis. (2) Person-time: from index to the first of incident CRC (≥1\ninpatient or ≥2 outpatient CRC diagnoses ≥30 days apart, the validated claims definition), disenrollment, death, or end of\ndata — accrued within age (5-year bands) × sex × calendar-year strata. (3) Expected count: for each stratum, multiply the\ncohort's accrued person-years by the matching SEER age/sex/period CRC incidence rate, then sum across strata. Suppose the\ncohort accrues 18,432 person-years; the stratum-by-stratum sum of (person-years × SEER rate) gives E = 32.5 expected CRC\ncases, while O = 47 cases are observed. (4) SIR = 47 / 32.5 = 1.45. (5) Exact Poisson 95% CI (Ulm/Byar): treating O as\nPoisson with mean E, the lower limit is 0.5·χ²(0.025, 2·47)/32.5 = 1.06 and the upper is 0.5·χ²(0.975, 2·48)/32.5 = 1.92,\nso SIR 1.45 (95% CI 1.06–1.92) — a statistically elevated incidence. (6) Sensitivity: vary the CRC algorithm (1 vs 2\nclaims), align SEER years exactly to the cohort's accrual years to rule out secular drift, and report the\nhealthy-initiator caveat (initiators may differ from the general population on screening and comorbidity) since indirect\nstandardization adjusts only for age, sex, and period.\n\n**Interpreting the output**\n\nConsider the worked example: observed deaths O = 120, expected deaths E = 100.0, yielding\nSMR = 120 / 100.0 = 1.20.\n\nFormal interpretation: An SMR of 1.20 means the cohort experienced 20% more deaths than would\nbe predicted if its members had died at the same age- and sex-specific rates as the reference\npopulation. Expected deaths are computed by applying reference rates to the cohort's own\nstratum-specific person-time — this is indirect standardization, meaning each stratum is\nweighted by the cohort's own exposure structure, not by a common external standard. Because\ndifferent cohorts contribute different age-sex-time mixes, their SMRs are not mutually\ncomparable: an SMR of 1.20 in one cohort and 1.30 in another does not mean the second cohort\nhas higher mortality, because the two denominators weight the age strata differently. SMRs are\ninternally valid for comparing a given cohort to the reference, but cross-cohort comparisons of\nSMRs require that both cohorts have similar stratum-specific person-time distributions — an\nassumption that should be checked, not assumed.\n\nPractical interpretation: An SMR of 1.20 with a 95% CI of, for example, 1.06–1.92 (Exact\nPoisson) indicates statistically elevated mortality relative to the general population benchmark.\nBefore attributing this excess to the condition under study, consider three alternative\nexplanations: the healthy-initiator or healthy-worker effect (cohort members may be systematically\nsicker or healthier than the general population at baseline, independent of the exposure); secular\ntrends if SEER or vital-statistics reference years are misaligned with the cohort's accrual years;\nand measurement differences if death ascertainment differs between the cohort and the reference\nsource. The SMR answers \"how much more?\" — it does not answer \"why?\" or \"is this causal?\".",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "indirect-standardization",
      "standardized-mortality-ratio",
      "standardized-incidence-ratio",
      "smr",
      "sir",
      "observed-vs-expected",
      "descriptive-epidemiology",
      "person-time"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/jech.38.1.85",
        "url": "https://doi.org/10.1136/jech.38.1.85",
        "citation_text": "Liddell FDK. Simple exact analysis of the standardised mortality ratio. Journal of Epidemiology and Community Health. 1984;38(1):85-88.",
        "year": 1984,
        "authors_short": "Liddell",
        "notes": "Foundational treatment of the SMR as an observed-versus-expected ratio with exact Poisson inference; the reference for the O/E framework and its confidence interval."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.0040296",
        "url": "https://doi.org/10.1371/journal.pmed.0040296",
        "citation_text": "von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. PLoS Medicine. 2007;4(10):e296.",
        "year": 2007,
        "authors_short": "von Elm et al.",
        "notes": "Reporting standard requiring explicit description of how rates and standardized measures were derived, the reference source, and the strata used."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/oxfordjournals.aje.a115507",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a115507",
        "citation_text": "Ulm K. A simple method to calculate the confidence interval of a standardized mortality ratio. American Journal of Epidemiology. 1990;131(2):373-375.",
        "year": 1990,
        "authors_short": "Ulm",
        "notes": "Gives the exact (Byar-type) Poisson confidence interval for an SMR/SIR used in the worked example."
      },
      {
        "role": "use",
        "doi": "10.1093/jnci/djq238",
        "url": "https://doi.org/10.1093/jnci/djq238",
        "citation_text": "Friedman DL, Whitton J, Leisenring W, et al. Subsequent neoplasms in 5-year survivors of childhood cancer: the Childhood Cancer Survivor Study. Journal of the National Cancer Institute. 2010;102(14):1083-1095.",
        "year": 2010,
        "authors_short": "Friedman et al.",
        "notes": "Canonical applied use of standardized incidence ratios — observed subsequent neoplasms versus SEER general-population rates applied to cohort person-time."
      }
    ],
    "plain_language_summary": "Indirect standardization answers one question: did this group of patients have more (or fewer) deaths or new diagnoses than you would expect for people with the same age and sex mix? You take published rates from a reference population — say, national death statistics — apply those rates to your study group's own mix of patients and follow-up time, and add up the events you would have predicted. Dividing the events you actually counted by that predicted number gives the Standardized Mortality Ratio (SMR) or Standardized Incidence Ratio (SIR): a number greater than 1 means excess events, less than 1 means a deficit. One honest caveat: because each group's result is weighted by its own patient mix, you cannot directly compare two SMRs from two different groups — the method is built for comparing one group against a population benchmark, not against each other.",
    "key_terms": [
      {
        "term": "person-time",
        "definition": "The total amount of follow-up accumulated by all patients in a study — for example, 500 patients each followed for 2 years contributes 1,000 person-years."
      },
      {
        "term": "reference rate",
        "definition": "An event rate drawn from an external source (such as national vital statistics or a cancer registry) that represents what happens in the general population within a given age group, sex, and time period."
      },
      {
        "term": "expected events",
        "definition": "The number of events predicted to occur in the study group if, and only if, the reference population's rates had applied — calculated by multiplying each stratum's person-time by the matching reference rate and summing."
      },
      {
        "term": "observed events",
        "definition": "The actual count of events (deaths, diagnoses, hospitalizations) that occurred among study participants during follow-up."
      },
      {
        "term": "standardized mortality ratio (SMR)",
        "definition": "Observed deaths divided by expected deaths; a value of 1.20 means the cohort had 20% more deaths than the reference population's rates would predict for a group with the same age and sex distribution."
      },
      {
        "term": "standardized incidence ratio (SIR)",
        "definition": "The same calculation as the SMR but applied to new disease diagnoses (incidence) rather than deaths."
      }
    ],
    "worked_example": {
      "scenario": "Researchers want to know whether adults with a rare inflammatory condition die at a higher rate than the general U.S. population. They assembled 3,000 patients from an insurance claims database and followed each person from their first diagnosis until they died, left the insurance plan, or the study ended. Follow-up time was divided into four strata by age group and sex. The researchers then obtained the matching national mortality rates from vital-statistics tables and asked: how many deaths would we expect if these 3,000 patients had died at the same rates as the U.S. population? They counted 120 deaths in the cohort. The calculation below shows how to arrive at the expected count and the final SMR.",
      "dataset": {
        "caption": "Study cohort person-time by stratum alongside the matching national reference mortality rates. Each row is one age-and-sex stratum; the analyst would build this by merging the cohort's person-time table with the downloaded vital-statistics rate file.",
        "columns": [
          "stratum_age",
          "stratum_sex",
          "cohort_person_years",
          "reference_rate_per_person_year",
          "expected_events"
        ],
        "rows": [
          [
            "40–54",
            "Female",
            1000,
            0.04,
            40.0
          ],
          [
            "40–54",
            "Male",
            800,
            0.025,
            20.0
          ],
          [
            "55–69",
            "Female",
            600,
            0.05,
            30.0
          ],
          [
            "55–69",
            "Male",
            400,
            0.025,
            10.0
          ]
        ]
      },
      "steps": [
        "For each stratum, multiply the cohort's person-years by the reference rate: stratum '40–54 Female' contributes 1,000 × 0.040 = 40.0 expected deaths.",
        "Repeat for every stratum: '40–54 Male' gives 800 × 0.025 = 20.0; '55–69 Female' gives 600 × 0.050 = 30.0; '55–69 Male' gives 400 × 0.025 = 10.0.",
        "Sum all four expected-event values: E = 40.0 + 20.0 + 30.0 + 10.0 = 100.0 expected deaths.",
        "Count the observed deaths in the cohort: O = 120.",
        "Divide observed by expected: SMR = O / E = 120 / 100.0 = 1.20."
      ],
      "result": "SMR = 1.20. The cohort experienced 20% more deaths than would be predicted if they had died at the same age- and sex-specific rates as the general U.S. population. Because the result exceeds 1.0, this signals excess mortality relative to the population benchmark — though indirect standardization alone cannot explain why (the difference could reflect the disease itself, unmeasured lifestyle factors, or differences in care)."
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "incidence-rate-calculation-rwe",
      "person-time-denominator-construction-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standardized mortality ratio (SMR)",
        "description": "Observed deaths divided by deaths expected from age/sex/period-specific reference mortality rates applied to the cohort's person-time. Used for total or cause-specific mortality benchmarking.",
        "edge_cases": [
          "Cause-specific SMRs require cause-specific reference rates and correct death-certificate/cause coding; misclassified cause biases the numerator.",
          "All-cause SMRs in cohorts with differential competing risks distort the expected count."
        ],
        "data_source_notes": "claims/EHR: link to a death index (NDI) for complete mortality ascertainment before trusting the observed count."
      },
      {
        "name": "Standardized incidence ratio (SIR)",
        "description": "Observed incident events divided by events expected from age/sex/period-specific reference incidence rates; the natural form for cancer or disease registries with population-based incidence references (e.g., SEER).",
        "edge_cases": [
          "Requires that the cohort be free of prior disease at baseline so events are truly incident, matching the reference incidence rate definition.",
          "Reference incidence rates must cover the same geography, race/ethnicity, and calendar years as the cohort."
        ],
        "data_source_notes": "registry: ensure registration completeness and lag; claims: use a validated incident-disease algorithm (e.g., >=1 inpatient or >=2 outpatient codes) and exclude prevalent cases in the lookback."
      },
      {
        "name": "Calendar-period-stratified standardization",
        "description": "Adds calendar period (and sometimes race/ethnicity) to the age/sex strata so that reference rates track the cohort's actual follow-up years, removing secular-trend artifacts.",
        "edge_cases": [
          "Sparse cells when many strata are crossed; may require collapsing periods, which reintroduces secular bias."
        ],
        "data_source_notes": "Align the reference-rate file's calendar years exactly to the cohort's person-time accrual years."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Direct standardization (age-adjusted rates)",
        "pros_of_this": "Stable with sparse cohort strata and small cohorts; needs only total observed events plus stratum-specific person-time, not stratum-specific cohort rates; natural for single-cohort-vs-population benchmarking.",
        "cons_of_this": "SMRs/SIRs from cohorts with different age/sex distributions are not mutually comparable (Breslow-Day caveat); directly standardized rates are.",
        "when_to_prefer": "Benchmarking one cohort against an external reference, or when cohort strata are too sparse for stable direct standardization."
      },
      {
        "compared_to": "Confounding-adjusted internal comparator (active-comparator cohort with PS weighting)",
        "pros_of_this": "Requires no internal unexposed arm; uses readily available external population rates; simple and transparent.",
        "cons_of_this": "Adjusts only for the standardization variables and inherits all residual cohort-vs-reference differences (healthy-worker effect, selection, ascertainment, secular drift); cannot estimate a drug-vs-drug effect.",
        "when_to_prefer": "Benchmarking event burden against population norms when no suitable internal comparator exists."
      },
      {
        "compared_to": "Crude (unadjusted) rate ratio",
        "pros_of_this": "Removes confounding by the standardization variables (typically the strongest confounders, age and sex).",
        "cons_of_this": "Residual confounding by everything outside the strata; comparability limitation across cohorts persists.",
        "when_to_prefer": "Whenever the cohort and reference differ materially in age/sex/period composition."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build person-time from continuous-enrollment spans and match the numerator source to the denominator. Exclude Medicare Advantage-only person-time (MA encounters are incompletely reported in FFS files, deflating the SMR/SIR). Use a validated outcome algorithm and exclude prevalent cases in the lookback for SIRs.",
      "ehr": "Visit-driven capture makes person-time and out-of-system events incomplete; link to a death index (NDI) for mortality SMRs and define observation-window closure explicitly.",
      "registry": "Natural home for SIRs with population-based reference incidence (e.g., SEER); check registration completeness, diagnosis-to-registration lag, and geographic/period alignment of reference rates.",
      "linked": "Registry-adjudicated events plus claims/EHR person-time plus vital-records death is ideal, but reconcile linkage selection and date discrepancies (diagnosis vs registration vs claim date) before assigning events to strata."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom scipy.stats import chi2\n\nSTRATA = [\"stratum_age\", \"stratum_sex\", \"stratum_year\"]\n\ndef smr_sir(ptime: pd.DataFrame, events: pd.DataFrame, ref: pd.DataFrame, alpha: float = 0.05) -> dict:\n    # Expected events per stratum = cohort person-time * matching reference rate.\n    expected = (ptime.merge(ref, on=STRATA, how=\"left\")\n                      .assign(expected=lambda d: d[\"person_years\"] * d[\"ref_rate\"]))\n    if expected[\"ref_rate\"].isna().any():\n        raise ValueError(\"Reference rate missing for some cohort strata; \"\n                         \"do not collapse silently - resolve the gap explicitly.\")\n    E = expected[\"expected\"].sum()\n    O = int(events[\"observed\"].sum())\n    ratio = O / E\n    # Exact (Byar/Ulm) Poisson confidence interval for the observed count, scaled by E.\n    lo = chi2.ppf(alpha / 2, 2 * O) / 2 / E if O > 0 else 0.0\n    hi = chi2.ppf(1 - alpha / 2, 2 * (O + 1)) / 2 / E\n    return {\"observed\": O, \"expected\": round(E, 2),\n            \"smr_sir\": round(ratio, 3), \"ci_low\": round(lo, 3), \"ci_high\": round(hi, 3)}",
        "description": "Indirect standardization (SMR/SIR) from claims/registry-style inputs. Required tables (already cleaned):\n  ptime : cohort person-time by stratum -> stratum_age (5-yr band), stratum_sex, stratum_year, person_years\n  events: cohort observed events by stratum -> stratum_age, stratum_sex, stratum_year, observed\n  ref   : external reference rates -> stratum_age, stratum_sex, stratum_year, ref_rate (events per person-year)\nExpected = sum over strata of (person_years * ref_rate). Returns the overall SMR/SIR with an exact Poisson 95% CI.",
        "dependencies": [
          "pandas",
          "scipy"
        ],
        "source_citations": [
          "liddell-1984",
          "ulm-1990"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(epitools)\n\n# Path 1: canned indirect standardization. `count`/`pop` are the COHORT vectors;\n# `stdcount`/`stdpop` are the REFERENCE vectors used to derive expected events.\nres <- ageadjust.indirect(count = cohort_events, pop = cohort_pyears,\n                          stdcount = ref_events, stdpop = ref_pyears)\nprint(res$sir)   # observed, expected, SMR/SIR (= \"sir\" element), and CI\n\n# Path 2: explicit exact-Poisson CI (Byar/Ulm), independent of the package.\nO <- sum(cohort_events)\nE <- sum(cohort_pyears * ref_rate)          # ref_rate = reference rate per stratum\nsmr <- O / E\nci_low  <- if (O > 0) qchisq(0.025, 2 * O)       / 2 / E else 0\nci_high <- qchisq(0.975, 2 * (O + 1)) / 2 / E\ncat(sprintf(\"O=%d  E=%.2f  SMR/SIR=%.3f  95%% CI %.3f-%.3f\\n\", O, E, smr, ci_low, ci_high))",
        "description": "Indirect standardization (SMR/SIR) in R. Two paths are shown: (1) epitools::ageadjust.indirect, the standard\ncanned routine, and (2) an explicit exact-Poisson CI matching the worked example. Inputs:\n  count    : observed events per stratum (cohort)\n  pop      : cohort person-time per stratum\n  rate     : reference event rate per stratum (events per person-time unit)\n  stdcount : reference events per stratum (used by ageadjust.indirect to form expected)\n  stdpop   : reference population/person-time per stratum",
        "dependencies": [
          "epitools"
        ],
        "source_citations": [
          "liddell-1984",
          "ulm-1990"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "proc stdrate data=work.study      /* cohort: observed events + person-time by stratum */\n             refdata=work.ref      /* reference: events + person-time -> reference rates */\n             method=indirect\n             stat=rate\n             effect=ratio          /* request the SMR/SIR (observed/expected ratio) */\n             plots=all;\n  population event=Event pyears=PYears;   /* cohort numerator + denominator */\n  reference  event=Event pyears=PYears;   /* reference numerator + denominator */\n  strata stratum_age stratum_sex stratum_year;   /* standardization variables */\nrun;\n\n/* PROC STDRATE prints Observed, Expected, the SMR/SIR, and its exact (Poisson) 95% CI.\n   For a stand-alone exact CI on a precomputed O and E (Byar/Ulm), e.g. O=47, E=32.5: */\ndata smr_ci;\n  O = 47; E = 32.5;\n  smr     = O / E;\n  ci_low  = (O > 0) * cinv(0.025, 2*O)       / 2 / E;\n  ci_high = cinv(0.975, 2*(O + 1)) / 2 / E;\nrun;",
        "description": "Indirect standardization (SMR/SIR) with PROC STDRATE - the SAS procedure built for this method. Required datasets:\n  work.study : one row per stratum for the COHORT -> stratum vars (age, sex, year), Event=observed, PYears=person-time\n  work.ref   : one row per stratum for the REFERENCE -> same stratum vars, Event=reference events, PYears=reference\n               person-time (PROC STDRATE derives reference rates internally)\nMETHOD=INDIRECT EFFECT=RATIO yields the SMR/SIR with exact Poisson confidence limits; PLOTS=ALL renders the\nstratum contributions to the expected count.",
        "dependencies": [],
        "source_citations": [
          "liddell-1984",
          "ulm-1990"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cohort[Study cohort person-time<br/>by age x sex x calendar period] --> Apply[Apply external REFERENCE<br/>stratum-specific rates]\n  Ref[Reference rates<br/>SEER / vital statistics / general population] --> Apply\n  Apply --> Expected[Expected events E<br/>= sum of person-years x reference rate]\n  Cohort --> Observed[Observed events O<br/>counted in the cohort]\n  Observed --> Ratio[SMR or SIR = O / E]\n  Expected --> Ratio\n  Ratio --> CI[Exact Poisson 95% CI<br/>Byar / Ulm]\n  CI --> Interp[Interpret as excess/deficit vs reference;<br/>NOT comparable across cohorts with<br/>different age/sex distributions]",
        "caption": "Indirect standardization computes expected events by applying external reference rates to the cohort's own stratum-specific person-time, then forms the SMR/SIR as observed over expected with an exact Poisson interval.",
        "alt_text": "Flowchart showing cohort person-time and external reference rates combining into an expected count, the observed count, their ratio (SMR/SIR), an exact Poisson confidence interval, and the non-comparability interpretation caveat.",
        "source_type": "illustrative",
        "source_citations": [
          "liddell-1984"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[What do you want to estimate?] --> Bench{Benchmark one cohort<br/>vs a population rate?}\n  Bench -- No, compare two treatments --> Internal[Use an internal comparator<br/>active-comparator new-user + PS]\n  Bench -- Yes --> Rank{Need to compare/rank<br/>several exposed groups?}\n  Rank -- Yes --> Direct[Use DIRECT standardization<br/>age-adjusted comparable rates]\n  Rank -- No --> Sparse{Sparse strata or<br/>small cohort?}\n  Sparse -- Yes --> Indirect[Use INDIRECT standardization<br/>SMR / SIR vs reference]\n  Sparse -- No --> Either[Direct or indirect both valid;<br/>indirect for single-cohort benchmarking]",
        "caption": "Decision logic for choosing indirect standardization (SMR/SIR) versus direct standardization versus an internal confounding-adjusted comparator.",
        "alt_text": "Decision tree distinguishing benchmarking a single cohort against population rates (indirect standardization) from comparing multiple groups (direct standardization) and from comparing treatments (internal adjusted comparator).",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Indirect standardization is a descriptive-epidemiology summary measure (observed vs expected) within that family."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "active-comparator-new-user",
        "notes": "When the question is comparative treatment effect rather than benchmarking against population rates, an internal active-comparator new-user cohort is the appropriate design; an SMR/SIR cannot adjust for confounding by indication."
      }
    ],
    "aliases": [
      "SMR",
      "SIR",
      "standardized mortality ratio",
      "standardized incidence ratio",
      "indirect standardization",
      "observed versus expected analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "inferential-statistics-foundations",
    "name": "Inferential Statistics Foundations",
    "short_definition": "The core machinery that lets an analyst draw conclusions about a population from a sample — point estimates, standard errors, confidence intervals, hypothesis tests, p-values, type I/II errors, and statistical power — and the critical distinctions between statistical significance, clinical relevance, and the particular failure modes of inference in large observational databases.",
    "long_description": "**What statistical inference is, and what it is not**\n\nStatistical inference is the formal process of using data from a **sample** to reason about\nan unobserved **population**. A sample is whatever we can actually see; the population is every\nperson, fill, or episode we would like to say something about. Because we cannot observe\neveryone, every estimate we produce carries **sampling variability** — the fact that a different\nsample from the same population would give a slightly different answer. Inferential methods\nquantify that variability so that consumers of evidence can judge how much weight to place on a\nnumber. Crucially, this machinery assumes the only source of error is sampling; in observational\ndata, **bias** — confounding, selection, measurement error — usually dominates, a point so\nimportant for RWE that it is returned to throughout this entry.\n\n**Point estimates and the standard error vs standard deviation distinction**\n\nA **point estimate** is the single best-guess value for a quantity of interest computed from the\ndata: a mean, a proportion, a risk difference, a hazard ratio. It summarises the sample in one\nnumber, but says nothing by itself about uncertainty. Uncertainty is quantified by the\n**standard error (SE)**, which is the standard deviation of the point estimate's sampling\ndistribution — in other words, how spread out the estimate would be if we repeated the study\nmany times. The SE is not the same as the standard deviation (SD) of the raw data, and\nconfusing the two is one of the most common errors in applied work.\n\nThe **standard deviation** describes variability *between individual observations* in the\nsample — it answers \"how different are patients from one another?\" The **standard error**\ndescribes variability of the *summary statistic* — it answers \"how different would our mean\nbe if we repeated the study?\" For a sample mean the relationship is SE = SD / √n, which makes\nexplicit that the SE shrinks as n grows while the SD does not: with more data the *estimate*\nbecomes more precise even though patients remain just as heterogeneous as before. State this\ndistinction twice because it matters every time a CI or test statistic is built: the SE is what\ngoes in the denominator, not the SD.\n\n**Confidence intervals — correct interpretation and the practical RWE reading**\n\nA **95% confidence interval (CI)** is a range constructed by a procedure that, over many\nrepeated samples, would contain the true parameter value in 95% of cases. This is the\nfrequentist *long-run coverage* interpretation. The correct reading: \"the data are compatible\nwith any value inside this interval under the assumptions of the model.\" The common\nmisreading: \"there is a 95% probability the true value is inside this interval\" — that\nstatement assigns probability to the parameter, which in frequentist statistics is a fixed\n(if unknown) constant, not a random variable.\n\nFor RWE reviewers, the practical reading is: the CI is a **range of effect sizes compatible\nwith the data**; any value inside it cannot be ruled out on the evidence alone. A wide CI says\nthe study is uninformative; a narrow CI that still excludes the null is informative and supports\na conclusion. The **width of the CI** is driven by two factors: (1) sample size — larger samples\nshrink the SE and thereby narrow the interval; and (2) the variance in the outcome — noisier\ndata widen the interval regardless of n. In claims databases, n can reach millions, so CIs can\nbecome narrow enough to declare statistical significance for effects so small they have no\nclinical meaning whatsoever.\n\n**Hypothesis testing machinery**\n\nThe Neyman–Pearson framework formalises a decision: either reject a **null hypothesis** H₀ or\nfail to reject it. The null is typically \"no effect\" (difference = 0, ratio = 1). The\n**alternative hypothesis** H₁ specifies the direction or range of effects considered\nmeaningful. The analyst computes a **test statistic** — a number that measures how far the\nobserved data are from what the null predicts, scaled by the SE: z = (estimate − null value) / SE.\nLarge absolute values of z are unlikely under H₀.\n\nThe **p-value** is the probability of observing a test statistic at least as extreme as the one\ncomputed, *if the null hypothesis were true and all model assumptions held*. A small p-value\nmeans the data would be surprising if the null were true; it does **not** mean:\n\n- the null is false (it is not the probability that H₀ is true),\n- the effect is large or clinically important,\n- the result will replicate,\n- the analysis is free of bias, or\n- any particular model assumption is satisfied.\n\nThe threshold **α = 0.05** is a convention borrowed from mid-twentieth-century experimental\nscience, not a law of nature. A p-value of 0.049 and a p-value of 0.051 are not meaningfully\ndifferent; treating them as a hard gate between \"significant\" and \"not significant\" is called\n**dichotomania** and is a primary source of irreproducible findings.\n\n**Type I and Type II errors, power, and multiplicity**\n\nA **Type I error** (false positive, α) occurs when we reject a null hypothesis that is actually\ntrue: we declare an effect when there is none. The conventional α = 0.05 means we accept a 5%\nchance of this error in a single test. A **Type II error** (false negative, β) occurs when we\nfail to reject a null that is false: we miss a real effect. **Statistical power** (1 − β) is\nthe probability of correctly detecting a real effect of a specified size; 80% or 90% power is\nthe conventional target, meaning we accept a 10–20% chance of missing the effect.\n\n**Multiplicity** is the inflation of the Type I error rate when many hypotheses are tested\nsimultaneously. With 20 independent tests at α = 0.05, we expect one false positive by chance\nalone. In pharmacoepidemiological safety surveillance, where dozens of outcomes are screened\nacross many drugs, multiplicity corrections (Bonferroni, Benjamini–Hochberg false discovery\nrate) or sequential probability ratio methods (maxSPRT) are essential. Pre-specification of\nthe primary hypothesis and secondary hypotheses is the cleanest guard.\n\n**Statistical significance is not clinical relevance — the large-database RWE trap**\n\nThis distinction is perhaps the most important applied lesson in RWE statistics. In a database\nwith n = 2,000,000 patients, a true mean difference of 0.01 units (clinically negligible) will\nproduce a p-value far below 0.001 and a 95% CI that excludes zero by a comfortable margin.\nThe result is \"statistically significant\" in the strict sense — unlikely under the null — but\ncompletely unimportant for clinical or policy decisions. Conversely, a study that is\nunderpowered may miss a clinically meaningful effect. **Effect size plus CI width is always what\nmatters; the p-value is a binary signal about the null, not a scale for importance.**\n\nThe ASA's 2016 statement (Wasserstein & Lazar) and its 2019 follow-up formalise this principle:\ndo not use p < 0.05 as the sole arbiter of evidence; report and interpret the effect size and\nits CI; and consider the entire distribution of compatible effects, not just whether the null is\nexcluded. The estimation-first culture now favoured by major journals and regulatory bodies\n(FDA, EMA) pre-specifies estimands and target effect measures, then reports them with intervals,\nrather than framing the study as a significance test.\n\n**Pros, cons, and trade-offs**\n\n- **Confidence intervals vs p-values only.** CIs give the same information as the p-value (zero\n  inside = significant at the matching α) but add the effect size and the range of compatible\n  values, which a bare p-value hides. There is no disadvantage to reporting CIs; there is a\n  large disadvantage to reporting only the p-value. **Prefer CIs** as the primary reporting\n  form for all RWE; always accompany with the point estimate and its units.\n- **Two-sided vs one-sided tests.** A two-sided test asks whether the effect differs from the\n  null in either direction (α = 0.05 split as 2.5% per tail); a one-sided test specifies the\n  direction in advance and uses the full 5% in one tail, giving a lower critical z (1.645 vs\n  1.96). One-sided tests are only justified when a result in the opposite direction would have\n  no consequences; regulatory submissions almost always require two-sided tests.\n- **Frequentist vs Bayesian inference.** Frequentist inference (CIs, p-values) reports what the\n  data are compatible with under repeated sampling; it does not produce a probability statement\n  about the parameter. Bayesian inference produces a posterior probability distribution over the\n  parameter given the data and a prior, and is natural for decision-analytic settings (HTA,\n  value of information). Frequentist methods dominate regulatory RWE submissions; Bayesian\n  approaches are growing in HTA and adaptive trial settings. The foundations described here are\n  the frequentist primitives that underlie nearly all RWE analysis pipelines.\n- **Unadjusted vs model-based inference.** A two-sample t-test is valid only if the groups are\n  exchangeable (no confounding). In observational data they almost never are, so the raw\n  inferential machinery must be embedded in a regression or weighting framework that accounts\n  for confounders before the SE and CI are meaningful. A p-value from an unadjusted comparison\n  in an unbalanced observational cohort is not a valid test of the exposure–outcome null.\n\n**When to use**\n\nInferential statistics foundations apply to every quantitative analysis that goes beyond\ndescribing the sample at hand: any study reporting a treatment effect, a rate, a risk, or a\ncomparison across groups needs point estimates, SEs, and CIs. In RWE specifically:\ncomparative effectiveness and safety studies; descriptive analyses where rates will be compared\nor transported to another population; feasibility and power calculations (always using the\nestimation/precision framing in addition to the power framing); and any primary analysis\nfeeding an HTA submission or regulatory review. Hypothesis testing in its strict sense is\nappropriate when a go/no-go decision is pre-specified (a non-inferiority margin, a safety\nrule-out threshold). The estimation frame (point estimate + CI) is appropriate for everything\nelse.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n\n- **p-values as variable selection or balance checks.** Selecting confounders to include in a\n  model because they have p < 0.05 in a univariate screen (or because a balance test after\n  propensity weighting is \"significant\") is a methodological error: it conflates statistical\n  significance with confounding importance, biases the resulting estimates, and has no\n  theoretical justification. Use subject-matter knowledge and standardised mean differences for\n  balance assessment, not hypothesis tests.\n- **Significance chasing and dichotomania.** Stopping an analysis when p < 0.05 and reporting\n  only those outcomes with significant results — or adjusting analytical choices until\n  significance is achieved — is p-hacking, which inflates the false-positive rate beyond α\n  regardless of what the individual test threshold is.\n- **Post-hoc power calculations.** Computing power from the observed effect size after a null\n  result has been obtained is circular: observed power is a deterministic function of the\n  observed p-value and tells the reader nothing beyond what the p-value already communicated.\n  After the study is complete, report the observed effect size, its CI, and the smallest\n  effect the study could have detected at the target power — not the \"observed power.\"\n- **Treating statistical inference as a substitute for bias analysis.** A tight CI around a\n  biased estimate is precisely wrong — more data makes the wrong answer more certain. In\n  observational databases, confounding, selection bias, and measurement error often dominate\n  sampling variability, especially for large n. A p-value below any threshold cannot rescue\n  a design that fails to control for confounding; and a significant result in a large\n  administrative database says only that sampling error can be ruled out, not that the effect\n  is causal. This is the bridge to the catalog's confounding-control and bias-analysis entries.\n\n**Data-source operational depth**\n\n- **Claims (FFS commercial / Medicare FFS):** With millions of person-rows, nearly any\n  nonzero effect will be \"significant\" — the practical question in claims analysis is always\n  effect size and CI width, not p-value. The effective sample size for adjusted analyses is\n  far smaller than the nominal n because of continuous-enrollment requirements, washout erosion,\n  and IPTW design effects (effective n ≈ nominal n / (1 + CV²(weights))). Report the effective\n  n alongside the point estimate and CI. In Medicare FFS, differential administrative censoring\n  by exposure arm — from higher death rates in frailer patients or from Medicare Advantage\n  leakage — means the standard survival-analysis SE assumptions may be violated; the SE and CI\n  must account for this.\n- **EHR:** Visit-driven capture creates informative missingness that the standard complete-case\n  SE does not account for. Multiple imputation or inverse-probability-of-observation weighting\n  is required before the SE is valid. Clustering within sites (multi-site EHR networks) inflates\n  the SE; a naive pooled-dataset SE that ignores site clustering understates uncertainty.\n- **Registry:** Typically smaller and more selective than claims; the precision framing (CI\n  half-width) is usually the binding constraint rather than hypothesis-test power. Non-random\n  enrollment in registries means the SE addresses only sampling variability within the enrolled\n  group, not representativeness to the broader population — a narrow CI in a highly selected\n  registry may not transport.\n- **Linked claims–EHR–vital records:** Linkage selection (only the linkable subset is analysed)\n  changes the population to which the SE and CI apply. The SE is correct for the linked subset;\n  transportability to the unlinked remainder requires additional assumptions.\n\n**Interpreting the output**\n\nConsider a two-arm study comparing antihypertensive regimens in a claims-based cohort: 10 patients per arm,\nobserved mean SBP reductions of 12.0 mmHg (Arm A) and 8.0 mmHg (Arm B). The analysis returns: mean\ndifference = 4.0 mmHg, SE = 1.0, 95% CI [2.04, 5.96], z = 4.0, p < 0.001.\n\n*(1) Formal statistical interpretation.* The point estimate of 4.0 mmHg is the observed difference in mean\nSBP reduction. The 95% CI [2.04, 5.96] is produced by a procedure that, if the study were repeated under\nidentical conditions many times, would contain the true mean difference in approximately 95% of those\nreplications; values of the true difference between 2.04 and 5.96 mmHg are compatible with the observed\ndata at the 5% significance level. The p-value < 0.001 is the probability — under the null hypothesis that\nthe true difference is exactly zero — of observing a difference at least as large as 4.0 mmHg in absolute\nvalue; it is not the probability that the null hypothesis is true.\n\n*(2) Practical interpretation for a decision-maker.* Arm A reduced systolic blood pressure by roughly 4 more\nmillimeters of mercury than Arm B, and the entire plausible range (2.0–6.0 mmHg) falls on the side favoring\nArm A. Whether a 4 mmHg difference crosses the threshold of clinical importance depends on each patient's\nbaseline risk, comorbidities, and tolerability — statistical significance alone does not establish that the\ndifference is large enough to change treatment decisions.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "foundations",
      "inferential-statistics",
      "confidence-interval",
      "p-value",
      "hypothesis-testing",
      "standard-error",
      "type-i-error",
      "type-ii-error",
      "statistical-power",
      "effect-size",
      "significance",
      "estimation",
      "rwe"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "target_trial_emulation",
      "systematic_review"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/00031305.2016.1154108",
        "url": "https://doi.org/10.1080/00031305.2016.1154108",
        "citation_text": "Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129-133.",
        "year": 2016,
        "authors_short": "Wasserstein & Lazar",
        "notes": "The American Statistical Association's formal statement on p-values — defines what the p-value is, catalogs its six-principle misuse taxonomy, and grounds the estimation-first reporting culture now required in RWE protocols and regulatory submissions."
      },
      {
        "role": "explain",
        "doi": "10.1007/s10654-016-0149-3",
        "url": "https://doi.org/10.1007/s10654-016-0149-3",
        "citation_text": "Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology. 2016;31(4):337-350.",
        "year": 2016,
        "authors_short": "Greenland et al.",
        "notes": "Canonical reference for the 25 most common misinterpretations of p-values, CIs, and power in epidemiological research; directly applicable to RWE reviewers encountering significance claims in claims-database studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1038/d41586-019-00857-9",
        "url": "https://doi.org/10.1038/d41586-019-00857-9",
        "citation_text": "Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305-307.",
        "year": 2019,
        "authors_short": "Amrhein et al.",
        "notes": "Nature comment signed by over 800 scientists calling for abandonment of statistical-significance dichotomy in favour of effect sizes and CIs; captures the estimation-first shift directly relevant to RWE reporting standards."
      },
      {
        "role": "use",
        "doi": "10.1080/00031305.2019.1583913",
        "url": "https://doi.org/10.1080/00031305.2019.1583913",
        "citation_text": "Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond 'p < 0.05'. The American Statistician. 2019;73(sup1):1-19.",
        "year": 2019,
        "authors_short": "Wasserstein et al.",
        "notes": "The ASA's 2019 follow-up elaborating the ATOM principles (Accept uncertainty, be Thoughtful, Open, Modest) and providing guidance on moving inference practice toward estimation, interval reporting, and pre-specified estimands — directly operational for RWE study protocols."
      }
    ],
    "plain_language_summary": "Inferential statistics is the toolkit for moving from data we can see — a sample of patients — to conclusions about the broader world we cannot see, the whole population. It works by measuring how much a summary number (like a mean difference between two treatments) would bounce around if the study were repeated many times, and using that bounciness to build a range of plausible values called a confidence interval. In large real-world databases with millions of patients, this toolkit can make even tiny, clinically meaningless differences look \"statistically significant,\" so the effect size and confidence interval always matter more than the p-value alone.",
    "key_terms": [
      {
        "term": "population vs sample",
        "definition": "The population is every patient (or event) the study aims to say something about; the sample is the subset actually observed — inference is the act of reasoning from the second to the first."
      },
      {
        "term": "standard error",
        "definition": "How much the point estimate (like a mean difference) would vary across repeated studies of the same size — it shrinks as sample size grows, unlike the standard deviation of individual patient values, which stays roughly constant."
      },
      {
        "term": "confidence interval",
        "definition": "A range of effect sizes that a study's data are compatible with; a 95% CI is built by a procedure that would include the true value in 95% of repeated experiments, not a probability statement that the true value is inside this specific interval."
      },
      {
        "term": "p-value",
        "definition": "The probability of seeing data at least as extreme as what was observed if the null hypothesis (no effect) were true — a small p-value means the data are surprising under the null, but does not measure the size of the effect or the probability that the null is true."
      },
      {
        "term": "null hypothesis",
        "definition": "The default assumption that there is no effect or no difference between groups — hypothesis testing asks whether the data are inconsistent enough with this assumption to warrant rejecting it."
      },
      {
        "term": "statistical power",
        "definition": "The probability that a study will correctly detect a real effect of a given size; low power means a study might miss a true effect, producing a false-negative result."
      }
    ],
    "worked_example": {
      "scenario": "A hospital quality team compares systolic blood pressure (SBP, in mmHg) reductions over 12 weeks for two antihypertensive drugs — Drug A (n = 10 patients) and Drug B (n = 10 patients). Drug A produces a mean reduction of 12 mmHg; Drug B produces a mean reduction of 8 mmHg. In both groups the variance of individual reductions is 5 mmHg². The team wants the point estimate of the difference, the standard error, a 95% confidence interval, and a test statistic to decide whether the difference is statistically significant.\n",
      "dataset": {
        "caption": "Summary statistics for a two-group blood-pressure reduction study (10 patients per arm).",
        "columns": [
          "group",
          "n",
          "mean_reduction_mmHg",
          "variance_mmHg2"
        ],
        "rows": [
          [
            "Drug A",
            10,
            12.0,
            5.0
          ],
          [
            "Drug B",
            10,
            8.0,
            5.0
          ]
        ]
      },
      "steps": [
        "Point estimate: difference in mean reductions = 12.0 - 8.0 = 4.0 mmHg (Drug A reduces SBP by 4 mmHg more than Drug B on average).",
        "Standard error of the difference: SE = sqrt(variance_A/n_A + variance_B/n_B) = sqrt(5.0/10 + 5.0/10) = sqrt(0.5 + 0.5) = sqrt(1.0) = 1.0 mmHg. Note that this SE is about the estimate (the mean difference), not about individual patients — it tells us how much the 4.0 estimate would bounce if we ran the study again.",
        "95% confidence interval: multiply the SE by the critical value 1.96 to get the margin. Margin = 1.96 * 1.0 = 1.96 mmHg. Lower bound = 4.0 - 1.96 = 2.04 mmHg; upper bound = 4.0 + 1.96 = 5.96 mmHg. Interpretation: the data are compatible with a true Drug A advantage of anywhere from about 2 to 6 mmHg — zero is not inside the interval.",
        "Test statistic: z = (estimate - null_value) / SE = (4.0 - 0) / 1.0 = 4.0 / 1.0 = 4.0. A z of 4.0 is far into the tail of the standard normal; the two-sided p-value is < 0.001.",
        "Clinical interpretation: the 4 mmHg difference is statistically significant (p < 0.001), but whether it is clinically meaningful depends on context — a 4 mmHg difference in a high-risk population may matter; in a low-risk one it may not. With only 10 patients per arm the CI spans 4 mmHg (2.04 to 5.96), which is moderately wide. In a claims database with 50,000 patients per arm, the CI might narrow to 3.8 to 4.2 mmHg — still significant but the same clinical question applies."
      ],
      "result": "Point estimate = 4.0 mmHg; SE = 1.0 mmHg; 95% CI = [2.04, 5.96] mmHg; z = 4.0; p < 0.001. The interval excludes zero, so the result is statistically significant at alpha = 0.05. The effect size (4 mmHg) and CI width (about 4 mmHg) together characterise the finding — the p-value alone does not.",
      "timeline_spec": {
        "title": "Two-arm comparison: point estimate, SE, and 95% CI on the difference in mean SBP reduction",
        "window": {
          "start": "2024-01-01",
          "end": "2024-12-31",
          "label": "12-week follow-up per patient"
        },
        "events": [
          {
            "label": "Drug A mean: 12 mmHg reduction",
            "start": "2024-01-01",
            "length_days": 84,
            "quantity": "n=10"
          },
          {
            "label": "Drug B mean: 8 mmHg reduction",
            "start": "2024-01-01",
            "length_days": 84,
            "quantity": "n=10"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-01",
            "end": "2024-03-25",
            "label": "12-week observation window"
          }
        ],
        "result": {
          "label": "Difference = 4.0 mmHg (95% CI 2.04-5.96)",
          "value": 4.0
        }
      }
    },
    "prerequisites": [
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Two-sample mean comparison (z or t, equal variances)",
        "description": "The canonical two-group comparison. When population variances are known or n is large (≥30 per group), use the z-statistic with SE = sqrt(s1²/n1 + s2²/n2). For small samples with unknown variances, use the t-statistic with pooled or Welch's SE. The mechanics are identical; only the reference distribution (standard normal vs t with df degrees of freedom) changes.",
        "edge_cases": [
          "For paired data (same patients measured twice), the SE is computed on the within-person difference, not on each group's mean; using the two-sample SE on paired data wastes power.",
          "When variances are unequal across groups, Welch's t-test (separate variance estimates) is more reliable than the pooled-variance t-test."
        ],
        "data_source_notes": "claims: with large n, z and t are indistinguishable; the binding question is whether the groups are exchangeable (no confounding) before interpreting the CI as a treatment-effect interval."
      },
      {
        "name": "Proportion comparison and risk difference",
        "description": "For binary outcomes (event/no-event), the point estimate is a risk difference (p1 - p2) or risk ratio (p1/p2). The SE of a risk difference is sqrt(p1(1-p1)/n1 + p2(1-p2)/n2). For rare outcomes, the log-risk-ratio or log-odds-ratio scale may be more stable. In large databases the CI on an absolute risk difference is the most useful single number for clinical and payer audiences.",
        "edge_cases": [
          "For very rare outcomes (expected events < 5), normal-approximation CIs are unreliable; exact binomial or Poisson-based CIs are preferred.",
          "Relative measures (RR, OR) compress the scale for common outcomes, making small absolute differences look large; always report the absolute risk difference alongside."
        ],
        "data_source_notes": "claims: baseline risk may differ markedly by arm in unadjusted analyses; the unadjusted risk difference is a valid descriptive statistic but not a causal estimate."
      },
      {
        "name": "Survival / time-to-event inference (log-rank, HR with CI)",
        "description": "For time-to-event outcomes, the log-rank test tests equality of survival curves; the Cox model produces a hazard ratio (HR) with a Wald-based CI. The HR is a ratio of instantaneous event rates; its SE is derived from the observed events (Schoenfeld formula), making the number of events — not patients — the binding currency.",
        "edge_cases": [
          "The log-rank test and Cox HR assume proportional hazards; when curves cross (non-PH), the test is low-powered and the HR is a time-averaged summary that may not represent any clinically meaningful time point.",
          "Administrative censoring that is differential by arm (e.g., one arm has higher disenrollment) violates the non-informative censoring assumption and biases the SE."
        ],
        "data_source_notes": "claims: death is both a competing event and a censoring mechanism; the SE of the cause-specific HR is valid only under non-informative censoring, which requires that mortality rates do not differ by exposure arm after conditioning on confounders."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "pros_of_this": "Inferential foundations (SE, CI, p-value) are the building blocks; this entry is the prerequisite layer that makes the advanced marginal-effects entry legible.",
        "cons_of_this": "Foundational inference assumes the comparison is valid (no unmeasured confounding); marginal effects via g-computation or IPTW are needed to make a regression estimate estimand-aligned and population-averaged.",
        "when_to_prefer": "Master the foundations first; apply marginal-effects methods when reporting decision-facing RWE where the conditional OR/HR is insufficient."
      },
      {
        "compared_to": "sample-size-power-precision-rwe",
        "pros_of_this": "Foundations explain what power and precision mean conceptually and why they differ; the sibling entry operationalises the calculation for actual RWE protocol planning.",
        "cons_of_this": "This entry does not cover the design-correction factors (IPTW effective n, confounder variance inflation, competing-risk attrition) needed for a defensible RWE power calculation — those live in sample-size-power-precision-rwe.",
        "when_to_prefer": "Use this entry to understand the concepts; use sample-size-power-precision-rwe to build the protocol section."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "In large claims databases (n > 500k) statistical significance is nearly automatic for any nonzero effect; always report the point estimate and CI half-width as the primary evidence. Effective n after IPTW weighting is n / (1 + CV²(weights)) — use this for power and precision statements, not the raw row count.",
      "ehr": "Clustering within sites or providers inflates the SE; use cluster-robust standard errors or mixed models. Missing-data patterns are informative — complete-case SEs understate uncertainty.",
      "registry": "Small n and selective enrollment mean CIs are wide; the precision framing (what half-width is achievable?) is the right planning frame. Non-random enrollment limits transportability regardless of CI width.",
      "linked": "The SE is valid for the linked subset; report that denominator explicitly so reviewers understand the population to which the CI applies."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\n# --- From summary statistics (known n, mean, variance) ---\nn_a, n_b = 10, 10\nmean_a, mean_b = 12.0, 8.0\nvar_a, var_b = 5.0, 5.0     # variance of individual observations (SD^2), NOT SE^2\n\ndiff = mean_a - mean_b                          # point estimate of mean difference\nse   = np.sqrt(var_a / n_a + var_b / n_b)       # SE of the difference: sqrt(s²/n + s²/n)\n# Note: se != sqrt(var_a) or sqrt(var_b) — SD of the raw data is sqrt(var), SE is SD/sqrt(n)\n\nz    = diff / se                                # z-statistic (large n; use t for small n)\nci_lo = diff - 1.96 * se                        # 95% CI lower bound\nci_hi = diff + 1.96 * se                        # 95% CI upper bound\np_val = 2 * (1 - stats.norm.cdf(abs(z)))        # two-sided p-value\n\nprint(f\"Point estimate : {diff:.2f} mmHg\")\nprint(f\"SD of raw data : {np.sqrt(var_a):.3f} mmHg  (variability between individual patients)\")\nprint(f\"SE of estimate : {se:.3f} mmHg  (variability of the mean difference across repeated studies)\")\nprint(f\"95% CI         : [{ci_lo:.2f}, {ci_hi:.2f}] mmHg\")\nprint(f\"z-statistic    : {z:.2f}\")\nprint(f\"p-value        : {p_val:.4f}\")\n\n# --- From individual-level data (Welch's t-test, recommended when variances may differ) ---\nrng = np.random.default_rng(42)\ngroup_a = rng.normal(loc=mean_a, scale=np.sqrt(var_a), size=n_a)\ngroup_b = rng.normal(loc=mean_b, scale=np.sqrt(var_b), size=n_b)\nt_stat, p_ttest = stats.ttest_ind(group_a, group_b, equal_var=False)  # Welch's t-test\nci_ttest = stats.t.interval(0.95, df=len(group_a)+len(group_b)-2,\n                            loc=group_a.mean()-group_b.mean(),\n                            scale=stats.sem(group_a-group_b))\nprint(f\"\\nWelch t-test on simulated data: t={t_stat:.2f}, p={p_ttest:.4f}\")\nprint(f\"t-test 95% CI: [{ci_ttest[0]:.2f}, {ci_ttest[1]:.2f}]\")",
        "description": "Two-sample comparison end-to-end in Python: compute the point estimate (mean difference),\nstandard error, 95% confidence interval, z-statistic, and two-sided p-value from summary\nstatistics. Also demonstrates the SE vs SD distinction explicitly. Uses scipy.stats for\nthe t-test on individual-level data and shows both approaches so the connection is clear.",
        "dependencies": [
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "wasserstein-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# --- From summary statistics ---\nn_a <- 10L; n_b <- 10L\nmean_a <- 12.0; mean_b <- 8.0\nvar_a  <- 5.0;  var_b  <- 5.0   # variance of raw observations (SD^2), NOT the SE\n\ndiff   <- mean_a - mean_b                      # point estimate\nse_raw <- sqrt(var_a / n_a + var_b / n_b)      # SE of mean difference: sqrt(s^2/n + s^2/n)\n# SD of raw data vs SE of estimate:\ncat(\"SD of individual observations:\", sqrt(var_a), \"mmHg\\n\")\ncat(\"SE of mean-difference estimate:\", se_raw, \"mmHg  (shrinks with n; SD does not)\\n\")\n\nz      <- diff / se_raw\nci_lo  <- diff - 1.96 * se_raw\nci_hi  <- diff + 1.96 * se_raw\np_z    <- 2 * pnorm(-abs(z))\n\ncat(sprintf(\"Point estimate : %.2f mmHg\\n\", diff))\ncat(sprintf(\"SE             : %.3f mmHg\\n\", se_raw))\ncat(sprintf(\"95%% CI         : [%.2f, %.2f] mmHg\\n\", ci_lo, ci_hi))\ncat(sprintf(\"z-statistic    : %.2f\\n\", z))\ncat(sprintf(\"p-value (z)    : %.4f\\n\\n\", p_z))\n\n# --- From individual-level data using base R t.test ---\nset.seed(42)\ngroup_a <- rnorm(n_a, mean = mean_a, sd = sqrt(var_a))\ngroup_b <- rnorm(n_b, mean = mean_b, sd = sqrt(var_b))\ntt      <- t.test(group_a, group_b, var.equal = FALSE)   # Welch's t-test\ncat(\"t.test result:\\n\"); print(tt)\n\n# --- Large-database demonstration: the 'significant but tiny' RWE trap ---\nn_large  <- 500000L                     # half a million patients per arm (realistic claims n)\ndiff_tiny <- 0.05                       # 0.05 mmHg true difference — clinically negligible\nse_large  <- sqrt(var_a / n_large + var_b / n_large)\nz_large   <- diff_tiny / se_large\np_large   <- 2 * pnorm(-abs(z_large))\nci_large  <- c(diff_tiny - 1.96 * se_large, diff_tiny + 1.96 * se_large)\ncat(sprintf(\"\\n[Large-n demo] Diff = %.3f mmHg, SE = %.5f, z = %.1f, p = %.2e\\n\",\n            diff_tiny, se_large, z_large, p_large))\ncat(sprintf(\"95%% CI = [%.4f, %.4f] mmHg  -> significant but trivially small\\n\",\n            ci_large[1], ci_large[2]))\ncat(\"Lesson: in claims databases p < 0.05 is nearly automatic; report effect size + CI.\\n\")",
        "description": "Two-sample comparison in R: point estimate, SE (and the SD vs SE distinction), 95% CI,\nt-statistic, and p-value using base R t.test. Also illustrates how a large-n database\nproduces a significant p-value for a clinically tiny effect, demonstrating why effect size\nand CI width are the primary evidence metrics in RWE.",
        "dependencies": [
          "stats"
        ],
        "source_citations": [
          "wasserstein-2016",
          "greenland-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* --- From summary statistics (DATA step) --- */\ndata summary_calc;\n  n_a = 10; n_b = 10;\n  mean_a = 12.0; mean_b = 8.0;\n  var_a  = 5.0;  var_b  = 5.0;   /* variance of raw observations, NOT the SE */\n\n  diff   = mean_a - mean_b;                   /* point estimate = 4.0 */\n  sd_raw = sqrt(var_a);                        /* SD of raw data: 2.236 mmHg */\n  se     = sqrt(var_a/n_a + var_b/n_b);        /* SE of estimate: sqrt(0.5+0.5)=1.0 mmHg */\n  z      = diff / se;                           /* z = 4.0 / 1.0 = 4.0 */\n  ci_lo  = diff - 1.96 * se;                   /* 95% CI lower = 4.0 - 1.96 = 2.04 */\n  ci_hi  = diff + 1.96 * se;                   /* 95% CI upper = 4.0 + 1.96 = 5.96 */\n  p_val  = 2 * (1 - probnorm(abs(z)));          /* two-sided p-value */\n\n  put \"Point estimate = \" diff \" mmHg\";\n  put \"SD (raw data)  = \" sd_raw \" mmHg  [between-patient variability; does NOT shrink with n]\";\n  put \"SE (estimate)  = \" se    \" mmHg  [variability of the mean diff; shrinks with n]\";\n  put \"95% CI         = [\" ci_lo \", \" ci_hi \"] mmHg\";\n  put \"z-statistic    = \" z;\n  put \"p-value        = \" p_val;\nrun;\n\n/* --- From individual-level data using PROC TTEST (Satterthwaite / Welch default) --- */\n/* Simulate individual-level data for the example */\ndata indiv;\n  call streaminit(42);\n  do i = 1 to 10; sbp_reduction = rand('Normal', 12.0, sqrt(5.0)); group = 'A'; output; end;\n  do i = 1 to 10; sbp_reduction = rand('Normal',  8.0, sqrt(5.0)); group = 'B'; output; end;\nrun;\n\nproc ttest data=indiv;\n  class group;\n  var sbp_reduction;\n  /* Default: Satterthwaite (Welch) t-test; use pooled option for equal-variance assumption */\nrun;\n\n/* --- Large-n demonstration: claims-database 'significant but tiny' trap --- */\n%macro large_n_demo(n=500000, diff_tiny=0.05, var=5.0);\n  data _null_;\n    n         = &n;\n    diff      = &diff_tiny;\n    se_large  = sqrt(2 * &var / n);   /* SE with equal n and equal variance */\n    z_large   = diff / se_large;\n    p_large   = 2 * (1 - probnorm(abs(z_large)));\n    ci_lo     = diff - 1.96 * se_large;\n    ci_hi     = diff + 1.96 * se_large;\n    put \"=== Claims-database large-n demo (n=\" n \") ===\";\n    put \"True diff = \" diff \" mmHg (clinically negligible)\";\n    put \"SE        = \" se_large;\n    put \"z         = \" z_large;\n    put \"p-value   = \" p_large;\n    put \"95% CI    = [\" ci_lo \",\" ci_hi \"] -> p<0.05 but effect is trivially small\";\n    put \"Lesson: report effect size + CI width, not just p-value.\";\n  run;\n%mend large_n_demo;\n%large_n_demo;",
        "description": "Two-sample comparison in SAS using PROC TTEST (individual-level data) and a DATA step\ncalculation from summary statistics, illustrating the SE vs SD distinction and producing\nthe 95% CI, t-statistic, and p-value. A macro demonstrates the large-n trap relevant to\nclaims databases.",
        "dependencies": [],
        "source_citations": [
          "wasserstein-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  S[Sample data<br/>n observations] --> PE[Point estimate<br/>e.g. mean difference = 4.0 mmHg]\n  S --> SE[Standard error<br/>SE = sqrt&#40;s²/n_A + s²/n_B&#41; = 1.0]\n  PE --> CI[95% Confidence interval<br/>estimate ± 1.96 × SE = 2.04 to 5.96]\n  PE --> TS[Test statistic<br/>z = estimate / SE = 4.0]\n  TS --> PV[p-value<br/>P&#40;|z| ≥ 4.0 | H₀ true&#41; < 0.001]\n  CI --> INT[Interpretation<br/>range of effects compatible with data<br/>zero is not inside → reject H₀]\n  PV --> INT\n  INT --> ES[Effect size + CI width<br/>is what matters for RWE decisions<br/>not p-value alone]",
        "caption": "The inferential statistics pipeline from sample data to point estimate, SE, CI, test statistic, and p-value. In large claims databases (n > 500,000), the SE is tiny and p-values are near zero for any nonzero effect — the CI width and effect size are the primary evidence metrics.",
        "alt_text": "Flowchart from sample data through point estimate and standard error to confidence interval and test statistic, then to p-value and a final emphasis that effect size and CI width drive RWE decisions.",
        "source_type": "illustrative",
        "source_citations": [
          "wasserstein-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  T[True state of the world]\n  T -->|H₀ is TRUE| A[Reject H₀<br/>Type I error α<br/>false positive]\n  T -->|H₀ is TRUE| B[Fail to reject H₀<br/>Correct decision<br/>1 - α]\n  T -->|H₀ is FALSE| C[Reject H₀<br/>Correct decision<br/>Power = 1 - β]\n  T -->|H₀ is FALSE| D[Fail to reject H₀<br/>Type II error β<br/>false negative]",
        "caption": "The four outcomes of a hypothesis test. Power (1 − β) is the probability of detecting a real effect; in large observational databases power is rarely the binding constraint — the concern is Type I errors inflated by multiplicity and the presence of bias rather than sampling error.",
        "alt_text": "Decision matrix showing the four combinations of true state (null true or false) and test decision (reject or not), labeling Type I error alpha, correct null retention at 1-alpha, power 1-beta, and Type II error beta.",
        "source_type": "illustrative",
        "source_citations": [
          "greenland-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "descriptive-statistics",
        "notes": "Descriptive statistics (mean, variance, SD, distribution shape) are the inputs to every inferential calculation; the SE is SD / sqrt(n), and understanding the distinction between SD and SE requires first having a firm grasp of SD from the descriptive layer."
      },
      {
        "relation_type": "see_also",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "Parametric tests (t-test, z-test) rest on the distributional assumptions summarised here; non-parametric alternatives make fewer assumptions about the underlying distribution at the cost of some power — choose based on sample size and distributional plausibility."
      },
      {
        "relation_type": "produces",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "Inferential foundations (SE, CI, p-value) are what the marginal-effects entry builds on; understanding that the CI around an AME must be computed by bootstrap rather than the model's printed SE requires first understanding what an SE is and why it matters."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimand-analysis-traceability-rwe",
        "notes": "Pre-specifying the estimand (target quantity, population, effect measure) before analysis is the RWE analogue of pre-registering a null hypothesis — it prevents the multiplicity inflation and p-hacking that unspecified inferential targets invite."
      },
      {
        "relation_type": "see_also",
        "target_slug": "sample-size-power-precision-rwe",
        "notes": "Sample-size and power calculations operationalise the type I/II error and precision concepts introduced here into feasibility statements for RWE protocols, including the design-correction factors (IPTW effective n, competing risks) that inflate required events beyond the naive formula."
      },
      {
        "relation_type": "see_also",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "Every CER analysis reports an effect estimate with a CI; the inferential foundations here underpin the correct interpretation of those estimates and the distinction between statistical significance and clinical relevance that is central to CER reporting."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The estimand framework specifies what effect the CI is an interval for; a CI is only interpretable relative to a clearly pre-specified estimand and its population."
      }
    ],
    "aliases": [
      "hypothesis testing",
      "confidence interval",
      "p-value",
      "statistical significance",
      "standard error",
      "type I error",
      "statistical power",
      "point estimate",
      "null hypothesis",
      "test statistic",
      "alpha level",
      "beta error",
      "two-sided test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "infused-biologic-administration-capture-rwe",
    "name": "Infused Biologic Administration Capture",
    "short_definition": "An exposure-definition method that ascertains infused (physician-administered, buy-and-bill) biologic use from medical-claim HCPCS J-codes/Q-codes and administration CPT codes on the administration date, then derives exposure intervals from the label dosing schedule rather than from a pharmacy days_supply.",
    "long_description": "**Infused biologic administration capture** is the operational problem of measuring exposure to physician-administered\nbiologics — infliximab, rituximab, vedolizumab, abatacept IV, tocilizumab IV, ocrelizumab, natalizumab — in real-world\ndata. These drugs are almost never dispensed through the pharmacy benefit. Under the U.S. \"buy-and-bill\" model they are\nacquired by the provider, administered in an infusion suite, hospital outpatient department (HOPD), or home, and billed to\nthe **medical** benefit (Medicare Part B, or the commercial medical claim) as a HCPCS Level II **J-code** for the drug\n(e.g., J1745 infliximab, J3262 tocilizumab IV, J0129 abatacept IV, J3380 vedolizumab) plus an **administration CPT code**\n(96365/96366 IV infusion, 96413/96415 chemotherapy-style infusion, 96401 SC). The unit of capture is therefore a\n*medical-claim line on the administration date*, not a pharmacy fill — which makes most oral-drug exposure machinery\n(days_supply stitching, refill-gap rules, drug-era logic keyed on dispensing) the wrong default.\n\n**Core conceptual distinction.** An infused biologic exposure has no `days_supply`; the duration of one administration's\ncoverage is fixed by the *product label's dosing interval*, not by a pharmacist-entered field. Infliximab loads at weeks\n0, 2, 6 then maintains q8w; rituximab gives 1000 mg ×2 separated by 14 days then re-treats q24w; vedolizumab loads 0/2/6\nthen q8w. The correct exposure interval is built as `next_expected_date = administration_date + label_interval_days`, with\na grace period for real-world timing variability, and a *gap* is declared when the observed inter-administration interval\nexceeds roughly 1.5× the label interval — that is the operational definition of discontinuation for an infused agent. This\nis the conceptual opposite of `persistence-time-to-discontinuation` for oral drugs, which keys off the end of `days_supply`.\nThe dose actually delivered comes from the **units billed on the J-code** (each J-code defines a unit of milligrams, e.g.,\nJ1745 = 10 mg infliximab), which must be read together with the **JW/JZ discarded-drug modifiers** and weight-based dosing,\nbecause naively summing units across lines double-counts wastage and mis-scales weight-based regimens.\n\n**Pros, cons, and trade-offs.**\n- **vs a pharmacy-fill / Part D-only exposure definition:** Capturing J-codes on the medical claim is the *only* way to\n  see infused biologics at all — a Part D / NDC-only algorithm has ~0% sensitivity for buy-and-bill agents and will silently\n  drop the entire infused arm, producing a cohort that looks like a self-injectable-only population. Cost: medical claims\n  lack `days_supply`, require label-driven interval logic, and force you to reconcile drug J-codes with separate\n  administration CPT codes. **Prefer J-code capture** for any analysis of IV/infused biologics; reserve NDC/Part D logic for\n  self-injected or white-bagged product that flows through the pharmacy benefit.\n- **vs an \"ever exposed\" J-code flag (presence only):** A binary flag is robust and easy, but it discards dose, schedule,\n  persistence, and time-varying exposure. Building intervals from the label schedule supports `time-updated-exposures-cumulative-dose-rwe`,\n  on-treatment risk windows, and per-protocol estimands. Cost: interval construction is sensitive to grace-period and\n  gap-multiplier choices, which must be pre-specified and varied in sensitivity analyses.\n- **vs treating each infusion as an independent point exposure:** Point exposures are simplest but cannot represent\n  continuous on-treatment time between scheduled infusions, so they understate exposed person-time and mishandle the loading\n  phase. **Prefer interval construction** when the estimand is a rate or a hazard over on-treatment time.\n\n**When to use.** Comparative effectiveness, safety, persistence, dose, or cost analyses of infused biologics in claims,\nEHR-medication-administration (MAR), or linked data; any study where the exposure or comparator is a buy-and-bill agent;\nbuilding the exposure spine of a target-trial emulation whose strategies are infused regimens. It is mandatory whenever an\ninfused agent appears on *either* side of a comparison — including originator-vs-biosimilar and IV-vs-SC formulation studies.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Self-injectable-only comparisons** (e.g., adalimumab pen vs etanercept SYRINGE) belong to NDC/Part D fill logic; forcing\n  J-code logic there finds nothing and the cohort collapses.\n- **Pure inpatient bundled administrations.** Inpatient biologic doses are usually rolled into a DRG and *not* separately\n  billed as a J-code, so an outpatient J-code algorithm misses inpatient bridging doses — see `inpatient-bridging-exposure-rwe`.\n  Concluding \"treatment gap\" across a hospitalization when the drug was given inpatient is a false discontinuation and is\n  actively misleading for persistence and immortal-time analyses.\n- **Biosimilar-era exposure without code reconciliation.** Infliximab biosimilars carry distinct Q-codes (Q5103 infliximab-dyyb,\n  Q5104 infliximab-abda, Q5121 infliximab-axxq). A code list built before a biosimilar launch will register apparent\n  discontinuation at the originator→biosimilar switch date. Decide *by design* whether to pool originator+biosimilars into one\n  molecule or analyze them as distinct exposures, and document the switch handling.\n- **Naive unit summation for dose.** Summing J-code units without removing JW (discarded) lines and without weight scaling\n  inflates cumulative-dose estimates; for a renal/oncology outcome modified by dose this can manufacture a spurious\n  dose-response.\n\n**Data-source operational depth.**\n- **Claims (FFS Medicare or commercial):** Exposure = medical-claim line with a drug J-code/Q-code on `admin_date`, ideally\n  co-occurring with an administration CPT and a plausible place-of-service (POS 11 office, 19/22 HOPD, 12 home). Require\n  continuous **medical** enrollment (Part B, not just Part D) across baseline and follow-up so absence of a J-code is a real\n  no-treatment period, not unobserved benefit. *Failure modes:* (1) **MA-only person-time** lacks complete FFS encounter/claim\n  submission — \"no infusion\" can be missingness; restrict to FFS Parts A/B (and D for any white-bagged product) or to commercial\n  plans with complete medical capture. (2) **JW/JZ discarded-drug modifiers** create extra lines that double-count dose if not\n  netted out. (3) **Claims adjudication lag and reversals** make the most recent quarters look like false discontinuation —\n  impose a data-maturity buffer. (4) **Differential competing risks** (death before a scheduled infusion) in elderly claims can\n  masquerade as discontinuation unless death/disenrollment are handled as censoring/competing events.\n- **EHR (MAR / order data):** The administration record (MAR `administered` event), not the order, is the exposure; orders\n  without a matching administration over-capture intended-but-not-given doses. EHR adds weight, BSA, lab severity, and the\n  actual milligrams hung — superior dose fidelity — but **external-care leakage** (an infusion given at an outside center)\n  is invisible, biasing persistence downward for mobile patients. Link to claims to recover out-of-system infusions.\n- **Registry:** Strong for indication, disease activity, and adjudicated outcomes; typically weak/incomplete for every\n  administration date. Link to claims for the full infusion history and to a death index for censoring.\n- **Linked claims–EHR–registry:** The ideal substrate (claims completeness + EHR dose fidelity + registry severity) but\n  introduces linkage selection and **date-discrepancy** problems (order date vs MAR date vs claim service date vs claim paid\n  date) that must be reconciled to a single `admin_date` before interval construction.\n\n**Worked claims example.** Question: 12-month persistence on infliximab (originator + biosimilars) among adults with Crohn's\ndisease initiating IV induction in a commercial + Medicare FFS database. (1) Build the drug code list: J1745 (originator) plus\nQ5103/Q5104/Q5121 (biosimilars), each defined in 10-mg units; pool them into a single `infliximab` molecule. (2) Keep medical\nclaim lines with those codes co-occurring with an administration CPT (96365/96413) and drop lines carrying the **JW** modifier\nbefore counting units. (3) Require ≥365 days continuous **medical** enrollment (FFS Parts A/B or commercial medical) before and\nafter the first administration; exclude MA-only person-time so an absent J-code means no infusion, not missing data. (4) Index =\nfirst qualifying administration date. (5) Reconstruct the schedule: loading at weeks 0, 2, 6, then maintenance q8w; for each\nadministration set `next_expected_date = admin_date + (14 or 56 days)` per loading/maintenance phase, add a 28-day grace period,\nand declare **discontinuation** when the next administration is absent beyond `1.5 ×` the expected maintenance interval (i.e.,\n> 56 + 28 = 84 days with no infusion). (6) Before calling a gap a discontinuation, check for an **inpatient bridging dose**\n(hospitalization spanning the expected window with no outpatient J-code) and for an originator↔biosimilar **switch** on the gap\ndate. (7) Censor at disenrollment, death, end of data, and the data-maturity buffer; estimate persistence with a Kaplan–Meier\ncurve and a Fine–Gray model treating death as a competing risk for the time-to-discontinuation outcome.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure-definition",
      "physician-administered-drugs",
      "buy-and-bill",
      "j-codes",
      "hcpcs",
      "infused-biologics",
      "biosimilars",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/ar3471",
        "url": "https://doi.org/10.1186/ar3471",
        "citation_text": "Curtis JR, Baddley JW, Yang S, et al. Derivation and preliminary validation of an administrative claims-based algorithm for the effectiveness of medications for rheumatoid arthritis. Arthritis Research & Therapy. 2011;13(5):R155.",
        "year": 2011,
        "authors_short": "Curtis et al.",
        "notes": "Canonical claims-based biologic-effectiveness algorithm that operationalizes infused-biologic adherence as a minimum recommended number of infusions (administration count vs the label schedule) rather than a pharmacy days_supply."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard requiring transparent code lists (J-codes/CPT), data-source linkage, and validation for routinely-collected-data exposure definitions such as J-code-based infusion capture."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.70020",
        "url": "https://doi.org/10.1002/pds.70020",
        "citation_text": "Li J, Gianfrancesco MA, Yazdany J, et al. Agreement of Medication Information Derived From EHR Data Compared to Medicare Insurance Claims: An Analysis of Biologic Disease-Modifying Antirheumatic Drugs. Pharmacoepidemiology and Drug Safety. 2024;33(10):e70020.",
        "year": 2024,
        "authors_short": "Li et al.",
        "notes": "Empirical agreement/disagreement between EHR-captured and Medicare-claims-captured biologic DMARD exposure, quantifying the source-specific misclassification that motivates J-code capture and claims-EHR linkage."
      }
    ],
    "plain_language_summary": "Some medications are injected or infused by a nurse or doctor in a clinic rather than picked up at a pharmacy. To track when a patient received one of these infusions in insurance records, researchers look for a billing code on the medical claim called a J-code, which appears on the date the infusion was given. Because there is no days_supply field on a medical claim the way there is on a pharmacy fill, analysts instead use the drug's official dosing schedule to estimate how long each infusion should cover the patient until the next one is due. A gap in coverage is declared only when the patient goes too long without returning for their next scheduled infusion.",
    "key_terms": [
      {
        "term": "J-code",
        "definition": "A billing code on a medical insurance claim that identifies a specific drug administered by a clinician; each J-code also defines how many milligrams one billed unit represents (for example, J1745 = 10 mg of infliximab)."
      },
      {
        "term": "medical benefit",
        "definition": "The part of health insurance that covers services provided by a doctor or clinic, such as infusions given in a hospital outpatient center or office; distinct from the pharmacy benefit, which covers drugs a patient picks up at a retail or specialty pharmacy."
      },
      {
        "term": "days_supply",
        "definition": "A field on a pharmacy fill record that tells analysts how many days one dispensed prescription is intended to last; infused biologics have no days_supply because they are administered in a clinic, not dispensed."
      },
      {
        "term": "dosing interval",
        "definition": "The number of days the drug label specifies between planned infusion appointments; for infliximab maintenance the interval is 56 days (every 8 weeks)."
      },
      {
        "term": "grace period",
        "definition": "Extra days added to the dosing interval to allow for real-world scheduling flexibility; a common choice is 28 days, so a patient is still considered on-treatment if they arrive a few weeks late."
      },
      {
        "term": "discontinuation",
        "definition": "The point at which an analyst declares a patient has stopped treatment, defined here as failing to receive the next infusion before the grace period expires."
      }
    ],
    "worked_example": {
      "scenario": "A 45-year-old with Crohn's disease starts infliximab on January 6, 2025. The drug label calls for a loading phase of three infusions at weeks 0, 2, and 6, then maintenance infusions every 8 weeks (56 days). An analyst has 360 days of continuous medical-benefit enrollment (January 6 through December 31, 2025) and wants to know how many days this patient was on-treatment and whether they discontinued. There is no pharmacy fill record anywhere because infliximab is infused in a clinic and billed to the medical benefit as J-code J1745.",
      "dataset": {
        "caption": "Five medical claim lines from the infusion suite. Each row is one J-code administration. There is no days_supply column because infused drugs have none.",
        "columns": [
          "person_id",
          "admin_date",
          "hcpcs",
          "cpt_admin",
          "units",
          "jw_modifier"
        ],
        "rows": [
          [
            1001,
            "2025-01-06",
            "J1745",
            "96365",
            100,
            false
          ],
          [
            1001,
            "2025-01-20",
            "J1745",
            "96365",
            100,
            false
          ],
          [
            1001,
            "2025-02-17",
            "J1745",
            "96365",
            100,
            false
          ],
          [
            1001,
            "2025-04-14",
            "J1745",
            "96365",
            100,
            false
          ],
          [
            1001,
            "2025-06-09",
            "J1745",
            "96365",
            100,
            false
          ]
        ]
      },
      "steps": [
        "Each row is a single infusion administration. The jw_modifier column is false on every row, meaning no drug was discarded or wasted, so all five administrations count as real exposure events.",
        "J-code J1745 = 10 mg of infliximab per billed unit. With 100 units each visit, each infusion delivers 1,000 mg. Dose is not needed to build the coverage timeline, but it confirms the claims data is plausible for a weight-based maintenance dose.",
        "Classify infusions by sequence. The first three (Jan 6, Jan 20, Feb 17) are loading infusions. The loading dosing interval is 14 days (wk 0 to wk 2) and then 28 days (wk 2 to wk 6). Adding a 28-day grace period, each loading infusion keeps the patient covered for 42 days (14 + 28). Loading coverage ends 42 days after Feb 17, which is March 31.",
        "The fourth and fifth infusions (Apr 14, Jun 9) are maintenance at 56-day intervals. Adding the 28-day grace period, each maintenance infusion covers 84 days. Maintenance coverage from Apr 14 runs through Jul 7; coverage from Jun 9 runs through Sep 1, which is later, so the combined maintenance span runs Apr 14 through Sep 1 (141 days).",
        "There is a 13-day gap between the end of loading coverage (Mar 31) and the start of maintenance coverage (Apr 14), spanning April 1 through April 13. This gap is within the grace period for the transition from loading to maintenance, so it is expected scheduling flexibility rather than discontinuation.",
        "After Sep 1 no sixth infusion appears. The discontinuation threshold is 1.5 times the 56-day maintenance interval = 84 days after the last infusion (Jun 9 + 84 days = Sep 1). Because no infusion arrives by Sep 1, discontinuation is declared on that date. The patient is off-treatment from Sep 2 through Dec 31 (121 days).",
        "On-treatment days = loading coverage (85 days: Jan 6 to Mar 31) + maintenance coverage (141 days: Apr 14 to Sep 1) = 226 days. Uncovered days = 13-day mid-gap + 121 post-discontinuation days = 134 days. Check: 226 + 134 = 360, which matches the window length."
      ],
      "result": "5 infusions captured via J-code J1745; on-treatment (covered) for 226 of 360 window days (63%); discontinuation declared September 1, 2025 after no sixth infusion arrived within the 84-day grace threshold.",
      "timeline_spec": {
        "title": "Infliximab infusion exposure for one Crohn's patient (label-schedule intervals, 360-day window)",
        "caption": "Five J-code infusions (loading at weeks 0, 2, 6 then maintenance q8w). Coverage intervals are inferred from the dosing schedule plus a 28-day grace period. No days_supply exists; the dosing interval replaces it. A 13-day gap appears between loading and maintenance; discontinuation is declared on September 1 after no sixth infusion arrives within 1.5 times the 56-day maintenance interval.",
        "alt_text": "Horizontal timeline from January 6 to December 31, 2025. Five vertical tick marks represent J-code infusion administrations on January 6, January 20, February 17, April 14, and June 9. Three shaded loading-phase bars span January 6 through March 31. A narrow unlabeled gap runs April 1 through April 13. Two overlapping maintenance-phase bars span April 14 through September 1. A dashed discontinuation marker appears on September 1. The remainder of the window through December 31 is unshaded.",
        "window": {
          "start": "2025-01-06",
          "end": "2025-12-31",
          "label": "Denominator: 360-day observation window (continuous medical enrollment)"
        },
        "events": [
          {
            "label": "Infusion 1 (wk 0 loading)",
            "start": "2025-01-06",
            "length_days": 42,
            "quantity": "J1745 infusion, 100 units (1,000 mg)"
          },
          {
            "label": "Infusion 2 (wk 2 loading)",
            "start": "2025-01-20",
            "length_days": 42,
            "quantity": "J1745 infusion, 100 units (1,000 mg)"
          },
          {
            "label": "Infusion 3 (wk 6 loading)",
            "start": "2025-02-17",
            "length_days": 42,
            "quantity": "J1745 infusion, 100 units (1,000 mg)"
          },
          {
            "label": "Infusion 4 (q8w maintenance)",
            "start": "2025-04-14",
            "length_days": 84,
            "quantity": "J1745 infusion, 100 units (1,000 mg)"
          },
          {
            "label": "Infusion 5 (q8w maintenance)",
            "start": "2025-06-09",
            "length_days": 84,
            "quantity": "J1745 infusion, 100 units (1,000 mg)"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2025-01-06",
            "end": "2025-03-31",
            "label": "85 covered days (loading phase, union of 3 intervals)"
          },
          {
            "kind": "gap",
            "start": "2025-04-01",
            "end": "2025-04-13",
            "label": "13-day scheduling gap (loading to maintenance transition)"
          },
          {
            "kind": "exposed",
            "start": "2025-04-14",
            "end": "2025-09-01",
            "label": "141 covered days (maintenance phase)"
          },
          {
            "kind": "unexposed",
            "start": "2025-09-02",
            "end": "2025-12-31",
            "label": "121 uncovered days after discontinuation (no 6th infusion)"
          }
        ],
        "result": {
          "label": "226 on-treatment days / 360 window days = 63% covered; discontinuation declared 2025-09-01",
          "value": 0.628
        }
      }
    },
    "prerequisites": [
      "claims-analysis",
      "exposure-episode-construction-rwe",
      "grace-period-gap-rules-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Presence-only J-code exposure flag",
        "description": "Binary ever/never exposed based on any qualifying drug J-code/Q-code with an administration CPT during the observation window. Maximally robust to schedule misspecification.",
        "edge_cases": [
          "Discards dose, schedule, and persistence; cannot support on-treatment or time-varying estimands.",
          "JW-modifier (discarded) lines still count as \"present\" unless explicitly excluded."
        ],
        "data_source_notes": "claims: require medical enrollment so absence is observed; EHR: prefer MAR administered events over orders."
      },
      {
        "name": "Label-schedule exposure intervals",
        "description": "Each administration opens an exposure interval of label_interval_days + grace; gaps beyond ~1.5x the maintenance interval define discontinuation. Distinguishes loading (e.g., 0/2/6 wk) from maintenance (e.g., q8w).",
        "edge_cases": [
          "Originator-to-biosimilar switch (J-code to Q-code) can read as a false gap unless molecules are pooled.",
          "Inpatient bridging doses are bundled (not separately J-coded) and create spurious gaps across hospitalizations.",
          "Real-world infusions drift early/late; grace period and gap multiplier must be pre-specified and varied."
        ],
        "data_source_notes": "claims: net out JW lines and reconcile service vs paid dates; linked: reconcile order/MAR/claim dates to a single admin_date before interval construction."
      },
      {
        "name": "Dose-aware cumulative exposure",
        "description": "Cumulative milligrams from J-code units (each J-code = a defined mg unit), netting JW-discarded units and scaling weight/BSA-based regimens, to support dose-response and cumulative-dose analyses.",
        "edge_cases": [
          "Naive unit summation double-counts discarded drug and mis-scales weight-based dosing, manufacturing spurious dose-response.",
          "Unit-to-mg conversion differs by J-code and must be maintained per code/year."
        ],
        "data_source_notes": "claims: read units + JW/JZ modifiers; EHR: prefer the milligrams actually administered from the MAR."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Pharmacy-fill / Part D-only (NDC + days_supply) exposure definition",
        "pros_of_this": "Captures buy-and-bill infused biologics at all (J-codes on the medical claim); a Part D/NDC-only rule has near-zero sensitivity for infused agents and silently drops the infused arm.",
        "cons_of_this": "Medical claims lack days_supply; requires label-driven interval logic and reconciliation of drug J-codes with separate administration CPT codes and JW/JZ modifiers.",
        "when_to_prefer": "Any analysis where the exposure or comparator is an IV/infused biologic billed under the medical benefit."
      },
      {
        "compared_to": "Presence-only J-code flag",
        "pros_of_this": "Recovers dose, schedule, persistence, and time-varying exposure; supports on-treatment and per-protocol estimands.",
        "cons_of_this": "Sensitive to grace-period and gap-multiplier specification, which must be pre-specified and stress-tested.",
        "when_to_prefer": "Rate/hazard estimands over on-treatment time, persistence, and cumulative-dose questions."
      },
      {
        "compared_to": "Treating each infusion as an independent point exposure",
        "pros_of_this": "Represents continuous on-treatment person-time between scheduled infusions and handles the loading phase.",
        "cons_of_this": "Adds interval-construction assumptions (grace, gap) not needed for a pure point-exposure model.",
        "when_to_prefer": "When the estimand requires exposed person-time rather than per-event contrasts."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = medical-claim line carrying a drug J-code/Q-code on admin_date, ideally co-occurring with an administration CPT and a plausible place-of-service. Require continuous medical (Part B / commercial medical) enrollment, not just Part D, so absent J-codes are observed no-treatment; exclude MA-only person-time. Net out JW-discarded units before dose counting; impose a data-maturity buffer for adjudication lag/reversals; handle death/disenrollment as censoring or competing events.",
      "ehr": "Use the MAR administered event (not the order) as the exposure; capture milligrams, weight/BSA for dose fidelity. External-care leakage hides outside infusions and biases persistence downward; link to claims to recover them.",
      "registry": "Strong for indication, disease activity, and adjudicated outcomes; usually incomplete for every administration date. Link to claims for the full infusion history and to a death index for censoring.",
      "linked": "Claims completeness + EHR dose fidelity + registry severity is ideal, but reconcile order/MAR/service/paid dates to a single admin_date and account for linkage selection (only the linkable subset) before interval construction."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\n# Pool originator + biosimilars into one molecule (example: IV infliximab).\nCODE_LIST = {\"J1745\", \"Q5103\", \"Q5104\", \"Q5121\"}\nUNIT_MG   = {\"J1745\": 10, \"Q5103\": 10, \"Q5104\": 10, \"Q5121\": 10}  # mg per billed unit\nADMIN_CPT = {\"96365\", \"96366\", \"96413\", \"96415\"}                  # IV infusion administration\nLOAD_GAP_DAYS = 14    # loading phase target interval (weeks 0,2,6)\nMAINT_GAP_DAYS = 56   # maintenance target interval (q8w)\nGRACE_DAYS = 28       # real-world timing slack\nGAP_MULTIPLIER = 1.5  # discontinuation when next infusion absent beyond MULTIPLIER x maintenance interval\n\ndef build_infusion_intervals(med: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    # 1) Keep on-molecule administrations with a real administration CPT; drop JW (discarded) lines for dose/event logic.\n    m = med[med[\"hcpcs\"].isin(CODE_LIST) & med[\"cpt_admin\"].isin(ADMIN_CPT) & (~med[\"jw_modifier\"])].copy()\n\n    # 2) Collapse multiple same-day lines (e.g., two vials) into one administration; sum mg actually delivered.\n    m[\"dose_mg\"] = m[\"units\"] * m[\"hcpcs\"].map(UNIT_MG)\n    adm = (m.groupby([\"person_id\", \"admin_date\"], as_index=False)\n             .agg(dose_mg=(\"dose_mg\", \"sum\")))\n\n    # 3) Require continuous medical enrollment (no MA-only gaps) covering each administration date.\n    e = enroll[~enroll[\"ma_only\"]]\n    adm = adm.merge(e, on=\"person_id\")\n    adm = adm[(adm[\"enroll_start\"] <= adm[\"admin_date\"]) & (adm[\"enroll_end\"] >= adm[\"admin_date\"])]\n    adm = adm.drop(columns=[\"enroll_start\", \"enroll_end\", \"ma_only\"]).drop_duplicates([\"person_id\", \"admin_date\"])\n\n    # 4) Order administrations and classify loading vs maintenance by infusion sequence (first 3 = loading 0/2/6 wk).\n    adm = adm.sort_values([\"person_id\", \"admin_date\"])\n    adm[\"seq\"] = adm.groupby(\"person_id\").cumcount()\n    adm[\"target_gap\"] = np.where(adm[\"seq\"] < 3, LOAD_GAP_DAYS, MAINT_GAP_DAYS)\n\n    # 5) Exposure interval = [admin_date, admin_date + target_gap + grace]; gap to the next infusion drives discontinuation.\n    adm[\"interval_end\"] = adm[\"admin_date\"] + pd.to_timedelta(adm[\"target_gap\"] + GRACE_DAYS, unit=\"D\")\n    adm[\"next_admin\"] = adm.groupby(\"person_id\")[\"admin_date\"].shift(-1)\n    adm[\"gap_to_next\"] = (adm[\"next_admin\"] - adm[\"admin_date\"]).dt.days\n    thresh = MAINT_GAP_DAYS * GAP_MULTIPLIER\n    adm[\"discontinued\"] = adm[\"next_admin\"].isna() | (adm[\"gap_to_next\"] > thresh)\n    return adm[[\"person_id\", \"admin_date\", \"seq\", \"dose_mg\", \"interval_end\", \"next_admin\", \"discontinued\"]]",
        "description": "Build label-schedule exposure intervals for an infused biologic from claims-style inputs. Required inputs (cleaned,\nde-duplicated, one row per administration line):\n  med    : medical drug-administration claims -> person_id, admin_date (datetime), hcpcs (J/Q code str),\n           cpt_admin (administration CPT str), units (int), jw_modifier (bool; True = discarded/wasted line),\n           place_of_service (str)\n  enroll : medical-benefit enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # MA-only lacks FFS claims\nCODE_LIST pools originator + biosimilar codes into one molecule; UNIT_MG maps each code to mg/unit. Returns one\nexposure interval per administration plus a discontinuation flag when the next infusion is absent beyond the grace window.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "curtis-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\ncode_list  <- c(\"J1745\", \"Q5103\", \"Q5104\", \"Q5121\")              # originator + biosimilars pooled\nunit_mg    <- c(J1745 = 10, Q5103 = 10, Q5104 = 10, Q5121 = 10)  # mg per billed unit\nadmin_cpt  <- c(\"96365\", \"96366\", \"96413\", \"96415\")\nload_gap   <- 14L; maint_gap <- 56L; grace <- 28L; gap_mult <- 1.5\n\nbuild_infusion_intervals <- function(med, enroll) {\n  setDT(med); setDT(enroll)\n\n  # 1) On-molecule administrations with a real administration CPT; drop JW (discarded) lines.\n  m <- med[hcpcs %chin% code_list & cpt_admin %chin% admin_cpt & !jw_modifier]\n  m[, dose_mg := units * unit_mg[hcpcs]]\n\n  # 2) Collapse same-day lines into one administration; sum mg delivered.\n  adm <- m[, .(dose_mg = sum(dose_mg)), by = .(person_id, admin_date)]\n\n  # 3) Require continuous medical enrollment (no MA-only) covering each administration date.\n  e <- enroll[ma_only == FALSE]\n  adm <- e[adm, on = \"person_id\", allow.cartesian = TRUE\n           ][enroll_start <= admin_date & enroll_end >= admin_date]\n  adm <- unique(adm[, .(person_id, admin_date, dose_mg)])\n\n  # 4) Sequence administrations; first 3 = loading (0/2/6 wk), rest = maintenance (q8w).\n  setorder(adm, person_id, admin_date)\n  adm[, seq := seq_len(.N) - 1L, by = person_id]\n  adm[, target_gap := fifelse(seq < 3L, load_gap, maint_gap)]\n\n  # 5) Exposure interval and discontinuation flag from the gap to the next infusion.\n  adm[, interval_end := admin_date + target_gap + grace]\n  adm[, next_admin := shift(admin_date, type = \"lead\"), by = person_id]\n  adm[, gap_to_next := as.integer(next_admin - admin_date)]\n  adm[, discontinued := is.na(next_admin) | gap_to_next > maint_gap * gap_mult]\n  adm[, .(person_id, admin_date, seq, dose_mg, interval_end, next_admin, discontinued)]\n}",
        "description": "Label-schedule exposure intervals with data.table. Inputs mirror the Python version:\n  med    : person_id, admin_date (Date), hcpcs, cpt_admin, units (int), jw_modifier (logical), place_of_service\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)   # MA-only spans lack complete FFS claims",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "curtis-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let load_gap  = 14;   /* loading interval (wk 0,2,6)        */\n%let maint_gap = 56;   /* maintenance interval (q8w)          */\n%let grace     = 28;   /* real-world timing slack             */\n%let gap_mult  = 1.5;  /* discontinuation threshold multiplier */\n\n/* 1) On-molecule administrations with a real administration CPT; drop JW (discarded) lines; map units -> mg. */\nproc sql;\n  create table adm0 as\n  select person_id,\n         admin_date,\n         units * 10 as dose_mg     /* each unit = 10 mg for J1745/Q5103/Q5104/Q5121 */\n  from work.med\n  where hcpcs in ('J1745','Q5103','Q5104','Q5121')          /* originator + biosimilars pooled */\n    and cpt_admin in ('96365','96366','96413','96415')      /* IV administration */\n    and jw_modifier = 0;\nquit;\n\n/* 2) Collapse multiple same-day lines into one administration; sum mg delivered. */\nproc sql;\n  create table adm1 as\n  select person_id, admin_date, sum(dose_mg) as dose_mg\n  from adm0\n  group by person_id, admin_date;\nquit;\n\n/* 3) Keep only administrations covered by continuous medical (non-MA-only) enrollment. */\nproc sql;\n  create table adm2 as\n  select distinct a.person_id, a.admin_date, a.dose_mg\n  from adm1 a\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id   = a.person_id\n      and e.ma_only     = 0\n      and e.enroll_start <= a.admin_date\n      and e.enroll_end   >= a.admin_date\n  );\nquit;\n\n/* 4-5) Sequence infusions, classify loading vs maintenance, then look ahead to the next infusion (read backward so the\n        previous row holds the \"next\" administration) and flag discontinuation when that gap exceeds 1.5x q8w. */\nproc sort data=adm2; by person_id admin_date; run;\n\ndata seqd;\n  set adm2; by person_id admin_date;\n  retain seq;\n  if first.person_id then seq = 0; else seq + 1;\n  if seq < 3 then target_gap = &load_gap; else target_gap = &maint_gap;\n  interval_end = admin_date + target_gap + &grace;\n  format interval_end admin_date date9.;\nrun;\n\nproc sort data=seqd; by person_id descending admin_date; run;\n\ndata work.intervals;\n  set seqd; by person_id descending admin_date;\n  next_admin = lag(admin_date);                 /* lag over descending order = the next (later) infusion */\n  if first.person_id then next_admin = .;       /* last infusion in the person has no successor */\n  gap_to_next = next_admin - admin_date;\n  discontinued = (next_admin = .) or (gap_to_next > &maint_gap * &gap_mult);\n  format next_admin date9.;\n  keep person_id admin_date seq dose_mg interval_end next_admin gap_to_next discontinued;\nrun;",
        "description": "Label-schedule exposure-interval construction in SAS (PROC SQL + DATA step). Required input datasets (post data-management,\none row per administration line):\n  work.med    : person_id, admin_date, hcpcs, cpt_admin, units, jw_modifier (0/1), place_of_service\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)   /* MA-only spans lack complete FFS claims */\nProduces work.intervals: one exposure interval per administration with a discontinuation flag (next infusion absent\nbeyond 1.5x the maintenance interval). Pools infliximab originator + biosimilar codes into one molecule.",
        "dependencies": [],
        "source_citations": [
          "curtis-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "infused-biologic-administration-capture-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Five J-code infusions (loading at weeks 0, 2, 6 then maintenance q8w). Coverage intervals are inferred from the dosing schedule plus a 28-day grace period. No days_supply exists; the dosing interval replaces it. A 13-day gap appears between loading and maintenance; discontinuation is declared on September 1 after no sixth infusion arrives within 1.5 times the 56-day maintenance interval.",
        "alt_text": "Horizontal timeline from January 6 to December 31, 2025. Five vertical tick marks represent J-code infusion administrations on January 6, January 20, February 17, April 14, and June 9. Three shaded loading-phase bars span January 6 through March 31. A narrow unlabeled gap runs April 1 through April 13. Two overlapping maintenance-phase bars span April 14 through September 1. A dashed discontinuation marker appears on September 1. The remainder of the window through December 31 is unshaded.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Bene[Patient with the indication<br/>continuous MEDICAL enrollment, FFS not MA-only] --> Codes[Medical claim lines:<br/>drug J-code/Q-code + administration CPT]\n  Codes --> Net[Drop JW discarded lines<br/>collapse same-day lines -> one administration]\n  Net --> Adm[Administration on admin_date<br/>dose_mg = units x mg/unit]\n  Adm --> Sched[Classify loading vs maintenance<br/>by infusion sequence]\n  Sched --> Intv[Exposure interval = admin_date + label_interval + grace]\n  Intv --> Gap{Next infusion within<br/>1.5x maintenance interval?}\n  Gap -- Yes --> Cont[On-treatment continues]\n  Gap -- No --> Check{Inpatient bridging dose<br/>or biosimilar switch on gap date?}\n  Check -- Yes --> Cont\n  Check -- No --> Disc[Discontinuation]",
        "caption": "J-code-based capture of an infused biologic. Exposure is the medical-claim administration line, intervals come from the label schedule (not days_supply), and apparent gaps are vetted for inpatient bridging and biosimilar switches before being called discontinuation.",
        "alt_text": "Flowchart from an enrolled patient through medical-claim J-code and administration CPT capture, JW netting, dose calculation, label-schedule interval construction, and a gap check that distinguishes true discontinuation from inpatient bridging or biosimilar switches.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Infliximab infusion exposure intervals for one initiator (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Loading (wk 0,2,6)\n  Infusion 1 (interval = 14d + grace) :done, l1, 2024-01-01, 14d\n  Infusion 2 (interval = 14d + grace) :done, l2, 2024-01-15, 14d\n  Infusion 3 (interval -> q8w)        :done, l3, 2024-02-26, 56d\n  section Maintenance (q8w)\n  Infusion 4 (interval = 56d + grace) :active, m4, 2024-04-22, 56d\n  Infusion 5 (interval = 56d + grace) :active, m5, 2024-06-17, 56d\n  section Gap\n  No infusion > 1.5x q8w -> discontinuation :crit, g1, 2024-08-12, 30d",
        "caption": "Timeline of label-driven exposure intervals. Loading infusions (0/2/6 weeks) carry short intervals; maintenance infusions carry q8w intervals; a maintenance gap exceeding 1.5x the interval marks discontinuation rather than a days_supply run-out.",
        "alt_text": "Gantt timeline showing three loading infliximab infusions in early 2024, q8w maintenance infusions, and a final maintenance gap exceeding 1.5 times the eight-week interval that is classified as discontinuation.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "Specialized exposure-ascertainment method within the special-populations/operational methods family, addressing physician-administered (buy-and-bill) biologics that the standard pharmacy-fill machinery cannot capture."
      },
      {
        "relation_type": "used_with",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "J-code administrations are the raw events fed into episode construction; this concept supplies the label-schedule interval and gap rules that episode construction then stitches into continuous exposure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "The grace period and the 1.5x-interval gap multiplier that define discontinuation for an infused agent are exactly the parameters governed by grace-period/gap-rule choices."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Dose-aware J-code unit-to-mg conversion (net of JW-discarded lines) supplies the cumulative-dose and time-varying exposure used in dose-response and on-treatment analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inpatient-bridging-exposure-rwe",
        "notes": "Inpatient biologic doses are bundled into the DRG and not separately J-coded; bridging logic prevents an in-hospital administration from being misread as a treatment gap."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "Mapping J-code/Q-code administrations into the OMOP DRUG_EXPOSURE table (and drug-era logic) requires the same label-schedule interval reasoning because medical-claim administrations lack days_supply."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "J-code-based exposure capture should itself be validated (PPV/sensitivity vs EHR/registry) the way claims-based outcome algorithms are, since infused-biologic codes are subject to capture error."
      }
    ],
    "aliases": [
      "Physician-administered drug exposure capture",
      "Part B biologic exposure capture",
      "J-code-based biologic exposure",
      "Buy-and-bill exposure ascertainment",
      "Infused biologic administration capture"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "inpatient-bridging-exposure-rwe",
    "name": "Inpatient Bridging of Drug Exposure",
    "short_definition": "A pre-specified rule that decides how to treat days during a hospital stay when outpatient pharmacy fills are absent or suspended, when constructing exposure episodes and adherence/persistence measures from claims or linked data.",
    "long_description": "**Inpatient bridging** is the explicit, protocol-level decision about what an exposure series should assume during the\ndays a patient is hospitalized. In outpatient pharmacy claims, exposure is reconstructed by stitching together fills\n(`fill_date` + `days_supply`). During an inpatient stay the patient typically receives medication from the hospital\nformulary, so there is no outpatient fill — and in claims-only data the inpatient drug is bundled into the DRG/facility\npayment and is therefore invisible. The series shows an apparent gap that is an artifact of where the drug was sourced,\nnot evidence the patient stopped therapy. How that gap is handled changes denominators, exposure time, gap counts, and\nevery downstream measure (PDC, MPR, persistence, time-varying exposure). This is an `Exposure_Definition` problem, not\nan estimation problem: the choice is made in the cohort/episode build, before any model.\n\n**Core conceptual distinction.** There are three canonical bridging policies, and they are mutually exclusive choices\nthat must be named in the protocol:\n1. **Carry-over (assume continuation):** treat inpatient days as covered/exposed — bridge across the stay as if the\n   drug continued. Rationale: an inpatient who was on chronic therapy is overwhelmingly likely to have it continued in\n   hospital. This is the most common default for chronic maintenance drugs.\n2. **Censor (remove inpatient days from the denominator):** exclude hospitalized days from the observation window\n   entirely, so they count as neither covered nor uncovered. This is the **PQA / CMS Star Ratings** convention for\n   adherence PDC: inpatient and skilled-nursing days are removed from the denominator and any overlapping supply is\n   \"pushed back,\" because the member's outpatient adherence cannot be observed during institutional stays.\n3. **Treat as gap / discontinuation:** count inpatient days as uncovered (a true gap), or end the exposure episode at\n   admission. Appropriate only when the drug is genuinely *not* expected to continue (e.g., a therapy held for the\n   procedure, a discontinued agent).\nThe estimand-adjacent point: these three policies are not nuances — for a patient with frequent or long admissions\nthey produce materially different PDC and persistence values, and the \"right\" choice depends on the clinical\nexpectation for *that specific drug* during *that specific kind of stay*.\n\n**Pros, cons, and trade-offs** (vs the named alternatives):\n- **Carry-over vs censor:** Carry-over is simple and matches the clinical reality for chronic drugs, but it manufactures\n  coverage you did not observe and can mask true non-adherence around discharge. Censoring (PQA-style) is the most\n  defensible when the question is *observable outpatient* adherence and is required for Star Ratings comparability, but\n  it shrinks the denominator, can inflate PDC for frequently hospitalized (sicker) patients, and complicates\n  person-time accounting. **Prefer carry-over** for chronic maintenance therapy in an etiologic study; **prefer censor**\n  for regulated quality measurement and when inpatient supply is unknowable.\n- **Carry-over vs treat-as-gap:** Treat-as-gap is correct only for drugs plausibly held during the stay; applied to a\n  chronic drug it invents discontinuations and biases persistence downward. **Prefer treat-as-gap** only with a\n  clinical rationale, ideally validated against linked MAR/eMAR data.\n- **vs ignoring the issue (naive stitching):** Doing nothing silently applies whatever the default `days_supply`\n  arithmetic produces — usually a phantom gap. That is the worst option because the policy is implicit and\n  undocumented. Any of the three explicit policies beats an unstated one.\n\n**When to use.** Whenever exposure episodes, PDC/MPR, persistence, or time-varying exposure are built from outpatient\npharmacy claims in a population that is hospitalized at non-trivial rates (elderly, oncology, cardiovascular, dialysis,\ntransplant, serious mental illness). It is mandatory to specify a bridging rule for any chronic-disease adherence or\ncomparative-effectiveness study and for any regulated PDC measure.\n\n**When NOT to use — and when it is actively misleading or dangerous.** Bridging is unnecessary when admissions are rare\nand short relative to the supply and outcome window (the choice cannot move the estimate). It becomes **actively\ndangerous** in three situations. (1) **Asymmetric application** — bridging one arm (or only the study drug) but not the\ncomparator manufactures **immortal time** and covered person-time for one side, biasing the comparative estimate; the\nrule must be applied identically to both arms (see immortal-time-bias-handling). (2) **Differential hospitalization by\narm** — if the sicker arm is hospitalized more, a carry-over policy donates more phantom coverage to that arm, while a\ncensoring policy removes more of its observable time; either way the policy choice becomes **outcome-dependent**, so a\nsensitivity analysis across all three policies is not optional. (3) **Treating-as-gap a drug that was actually continued\nin hospital** — fabricates discontinuations, corrupts persistence, and (in claims-only data, where you cannot see the\ninpatient administration) is unfalsifiable without linkage.\n\n**Data-source operational depth.**\n- **Claims (FFS):** Identify inpatient stays from institutional/facility claims (revenue center codes, place-of-service,\n  DRG, admit/discharge dates). The inpatient drug is bundled into the facility payment and never appears as an NDC, so\n  you must *infer* coverage from the stay dates plus the surrounding outpatient fills. Reconstruct admission and\n  discharge from the medical claim, then apply the chosen policy to `[admit_date, discharge_date]`. Failure mode:\n  over-the-counter or sample inpatient continuation is invisible; same-day discharge fills and discharge-prescription\n  \"med rec\" fills can double-count if not deduped.\n- **Claims (Medicare Advantage / capitated):** MA encounter data are notoriously incomplete and lag; an MA-only person\n  may have *neither* the institutional claim nor reliable Part D fills, so \"no fill during a window\" can be pure\n  missingness rather than a real gap or a real stay. Restrict to FFS Parts A/B/D person-time, or flag MA-only spans and\n  exclude them from the denominator — do not let MA missingness masquerade as non-adherence.\n- **EHR:** The inpatient administration is visible in the MAR/eMAR and inpatient order records, so bridging can be\n  *evidence-based* rather than assumed — but only if the hospital is inside the EHR network. External-hospital stays\n  leak out of the system and reappear as the same phantom gap as in claims; visit-driven capture means the patient who\n  is admitted elsewhere is differentially unobserved.\n- **Registry:** Usually weak for both fills and inpatient drug administration; use registry admission/severity fields to\n  *flag* stays, but link to claims (fills) and to facility claims (stay dates) to actually operationalize the rule.\n- **Linked claims–EHR:** The ideal substrate — facility claims give reliable stay dates and the linked inpatient MAR\n  confirms whether the specific drug was continued, letting you choose carry-over vs treat-as-gap per stay on evidence\n  rather than assumption. Cost: only the linkable subset is covered, and admit/fill/service date discrepancies must be\n  reconciled before bridging.\n\n**Worked claims example.** A patient on a chronic statin fills a **30-day** supply on **2024-01-01** (covers\n2024-01-01 → 2024-01-30). A facility claim shows an inpatient stay **2024-01-11 → 2024-01-20** (10 days). The next\noutpatient fill (30 days) is **2024-02-05**. The follow-up window is the 35 days **2024-01-01 → 2024-02-04**. Compute\nPDC under each policy:\n- **Carry-over:** the in-hospital days are assumed covered. Covered days = Jan 1–30 (30 from the first fill, with the\n  admission spanned) → no gap is recognized during the stay; the only uncovered days are Jan 31 → Feb 4 (5 days).\n  PDC = 30 / 35 ≈ **0.857**.\n- **Censor (PQA):** remove the 10 inpatient days from the denominator (Jan 11–20). Denominator = 35 − 10 = 25 days;\n  covered observable days = Jan 1–10 (10) + Jan 21–30 (10) = 20. PDC = 20 / 25 = **0.800**. (PQA additionally \"pushes\n  back\" supply that overlapped the removed days, which can recover days near discharge; the directional point — a\n  different denominator — holds.)\n- **Treat-as-gap:** the 10 inpatient days are uncovered. Covered = Jan 1–10 (10) + Jan 21–30 (10) = 20 over a 35-day\n  denominator → PDC = 20 / 35 ≈ **0.571**.\nSame patient, same fills: PDC ranges 0.571 → 0.857 — straddling the 0.80 quality-measure threshold — purely from the\nbridging rule. That single decision can flip a patient from \"non-adherent\" to \"adherent,\" which is why the policy must\nbe pre-specified, applied identically across arms, and stress-tested in sensitivity analysis.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure_definition",
      "exposure-episode-construction",
      "inpatient-bridging",
      "hospitalization",
      "days-supply",
      "adherence",
      "pdc",
      "persistence",
      "immortal-time"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/MLR.0b013e31829b1d2a",
        "url": "https://doi.org/10.1097/MLR.0b013e31829b1d2a",
        "citation_text": "Raebel MA, Schmittdiel J, Karter AJ, Konieczny JL, Steiner JF. Standardizing terminology and definitions of medication adherence and persistence in research employing electronic databases. Medical Care. 2013;51(8 Suppl 3):S11-S21.",
        "year": 2013,
        "authors_short": "Raebel et al.",
        "notes": "Consensus terminology for adherence/persistence from administrative databases; explicitly addresses how hospitalization and institutional periods must be handled (bridged, censored, or counted as gap) when reconstructing exposure series."
      },
      {
        "role": "explain",
        "doi": "10.1345/aph.1H018",
        "url": "https://doi.org/10.1345/aph.1H018",
        "citation_text": "Hess LM, Raebel MA, Conner DA, Malone DC. Measurement of adherence in pharmacy administrative databases: a proposal for standard definitions and preferred measures. Annals of Pharmacotherapy. 2006;40(7-8):1280-1288.",
        "year": 2006,
        "authors_short": "Hess et al.",
        "notes": "Foundational catalog of administrative-database adherence measures (PDC, MPR, gap measures) and the operational handling of hospitalization periods in the denominator."
      },
      {
        "role": "explain",
        "doi": "10.1016/s0895-4356(96)00268-5",
        "url": "https://doi.org/10.1016/s0895-4356(96)00268-5",
        "citation_text": "Steiner JF, Prochazka AV. The assessment of refill compliance using pharmacy records: methods, validity, and applications. Journal of Clinical Epidemiology. 1997;50(1):105-116.",
        "year": 1997,
        "authors_short": "Steiner & Prochazka",
        "notes": "Origin of refill-based exposure/compliance measurement from pharmacy records and the days-supply stitching logic that inpatient bridging modifies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Applied automated-database methods for persistence, permissible gaps, and episode construction in which inpatient periods change the gap/continuation decision."
      }
    ],
    "plain_language_summary": "When a patient is hospitalized, the hospital supplies their medication directly — no prescription gets filled at a pharmacy, so no record of those days shows up in outpatient claims data. Naively stitching prescription fills together makes the hospital stay look like a gap in drug coverage, even though the patient never actually stopped taking the medication. Inpatient bridging is a pre-specified rule that says: during a confirmed hospital stay, treat the patient as still covered by their chronic medication rather than calling those days a gap. Without this rule, a patient who was faithfully taking a heart medication through a two-week admission can be mislabeled 'non-adherent' purely because the data source cannot see inside the hospital.",
    "key_terms": [
      {
        "term": "days_supply",
        "definition": "The number of days a single prescription fill is meant to last — for example, a 14-day fill of a blood-pressure pill means the bottle holds 14 doses."
      },
      {
        "term": "fill_date",
        "definition": "The calendar date on which a patient picked up (or was dispensed) a prescription at a pharmacy, as recorded in claims data."
      },
      {
        "term": "outpatient pharmacy claim",
        "definition": "A billing record submitted when a patient fills a prescription at a retail or mail-order pharmacy — this record is absent when the same drug is given inside a hospital, because hospitals bill differently."
      },
      {
        "term": "PDC (Proportion of Days Covered)",
        "definition": "A 0-to-1 score measuring how many days in a defined window a patient actually had their medication on hand, calculated by dividing unique covered days by total days in the window."
      },
      {
        "term": "carry-over bridging",
        "definition": "The bridging rule that assumes a chronic medication continued uninterrupted during a hospital stay, marking those hospital days as 'covered' even though no outpatient fill was dispensed."
      },
      {
        "term": "facility claim",
        "definition": "A billing record submitted by a hospital or inpatient facility that shows the patient's admission date, discharge date, and overall services — this is the data source used to identify and date a hospital stay."
      }
    ],
    "worked_example": {
      "scenario": "Maria, age 68, has been taking a daily statin for high cholesterol for years. We are studying her medication coverage over a 60-day window from January 2 through March 1, 2024. She fills a 14-day supply on January 2, then is admitted to the hospital on January 10 and discharged on January 29 (a 20-day stay). The hospital keeps her on the statin the entire time, but no outpatient pharmacy claim is generated — the drug comes from the hospital's own supply. Her next outpatient fill is a 30-day supply on February 1. We want to compute how many days she actually had the drug during the 60-day window, and we will compare the naive (no-bridging) result to the carry-over-bridging result.",
      "dataset": {
        "caption": "Raw outpatient pharmacy fills — these are the only pill records visible without bridging. The hospital stay appears only in a separate facility claim (bottom table).",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2024-01-02",
            "atorvastatin",
            14
          ],
          [
            2001,
            "2024-02-01",
            "atorvastatin",
            30
          ]
        ]
      },
      "inpatient_claim": {
        "caption": "Facility (inpatient) claim for the same patient — the hospital stay visible here, but no drug NDC appears because the drug cost is bundled into the hospital bill.",
        "columns": [
          "person_id",
          "admit_date",
          "discharge_date",
          "stay_days"
        ],
        "rows": [
          [
            2001,
            "2024-01-10",
            "2024-01-29",
            20
          ]
        ]
      },
      "steps": [
        "Step 1 — Mark the observation window. We are watching Maria from January 2 through March 1, 2024. That is 60 days total (30 days in January starting Jan 2, 29 days in February — 2024 is a leap year, and March 1).",
        "Step 2 — Map Fill A. The January 2 fill has a 14-day supply, so it covers January 2 through January 15 (14 days).",
        "Step 3 — Map Fill B. The February 1 fill has a 30-day supply, so it covers February 1 through March 1 (30 days — all inside the window).",
        "Step 4 — Naive calculation (no bridging). We only see the two pharmacy fills. Covered days = January 2–15 (14 days) + February 1–March 1 (30 days) = 44 covered days. The stretch January 16 through January 31 (16 days) looks like an uncovered gap. Naive PDC = 44 / 60 = 0.733.",
        "Step 5 — Identify the hospital stay. The facility claim shows Maria was admitted January 10 and discharged January 29 — a 20-day inpatient stay. No outpatient fill was generated during those days because the hospital dispensed the statin itself.",
        "Step 6 — Apply carry-over bridging. We treat the 20 hospital days (January 10–29) as covered, just as if she had a pill in her hand each day. Now combine: Fill A covers January 2–15, and the bridge extends coverage through January 29. The union of these two is January 2–29 = 28 covered days.",
        "Step 7 — Identify the true gap. After discharge on January 29, the next outpatient fill is February 1. The two days January 30–31 are genuinely uncovered — she is home, out of the hospital, and has not yet refilled.",
        "Step 8 — Compute bridged PDC. Covered days = January 2–29 (28 days) + February 1–March 1 (30 days) = 58 covered days. Bridged PDC = 58 / 60 = 0.967.",
        "Step 9 — Compare. Without bridging: PDC = 0.733 — Maria looks non-adherent by the common 0.80 threshold. With carry-over bridging: PDC = 0.967 — she is highly adherent. The 16-day apparent gap shrinks to a 2-day real gap. The difference is entirely explained by where the drug was sourced, not by whether she actually took it."
      ],
      "result": "Naive PDC (no bridging) = 44 covered days / 60 window days = 0.733 — below the 0.80 adherence threshold. Bridged PDC (carry-over) = 58 covered days / 60 window days = 0.967 — well above the threshold. The single decision to bridge the 20-day hospital stay adds 14 covered days (the inpatient days beyond Fill A's supply) and eliminates a phantom gap, moving Maria from 'non-adherent' to 'highly adherent'.",
      "timeline_spec": {
        "title": "Inpatient bridging: 60-day statin coverage with a 20-day hospital stay (patient 2001)",
        "window": {
          "start": "2024-01-02",
          "end": "2024-03-01",
          "label": "Denominator: 60-day observation window (Jan 2 – Mar 1, 2024)"
        },
        "events": [
          {
            "label": "Fill A",
            "start": "2024-01-02",
            "length_days": 14,
            "quantity": "14-day supply"
          },
          {
            "label": "Inpatient stay (no outpatient fill — hospital supplies the drug)",
            "start": "2024-01-10",
            "length_days": 20,
            "quantity": "20-day stay"
          },
          {
            "label": "Fill B",
            "start": "2024-02-01",
            "length_days": 30,
            "quantity": "30-day supply"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-02",
            "end": "2024-01-15",
            "label": "Fill A: 14 covered days (outpatient)"
          },
          {
            "kind": "exposed",
            "start": "2024-01-10",
            "end": "2024-01-29",
            "label": "Hospital stay: drug supplied by facility — no outpatient claim generated"
          },
          {
            "kind": "gap",
            "start": "2024-01-16",
            "end": "2024-01-31",
            "label": "Naive view: 16-day apparent gap (Jan 16–31) — includes 14 inpatient days + 2 post-discharge days"
          },
          {
            "kind": "covered",
            "start": "2024-01-10",
            "end": "2024-01-29",
            "label": "Bridged view: inpatient days treated as covered (carry-over rule)"
          },
          {
            "kind": "gap",
            "start": "2024-01-30",
            "end": "2024-01-31",
            "label": "True gap after bridging: 2 days (Jan 30–31, post-discharge before refill)"
          },
          {
            "kind": "covered",
            "start": "2024-02-01",
            "end": "2024-03-01",
            "label": "Fill B: 30 covered days (outpatient)"
          }
        ],
        "result": {
          "label": "Naive PDC = 44/60 = 0.733 (non-adherent) → Bridged PDC = 58/60 = 0.967 (adherent). Bridge adds 14 inpatient days beyond Fill A, shrinking the apparent 16-day gap to a true 2-day gap.",
          "value": 0.967
        },
        "caption": "Maria's statin coverage over 60 days. Fill A (14 days) runs out January 15; her hospital admission (January 10–29) falls partly within Fill A and extends 14 days beyond it with no outpatient pharmacy record. Without bridging, January 16–31 looks like a 16-day gap and PDC = 0.733. With carry-over bridging (hospital days assumed covered), coverage extends through January 29, the true gap shrinks to 2 days, and PDC = 0.967 — straddling the 0.80 adherence threshold purely based on the bridging rule.",
        "alt_text": "Horizontal timeline from January 2 to March 1, 2024, showing two pharmacy fill bars (Fill A: 14 days starting Jan 2; Fill B: 30 days starting Feb 1), a 20-day hospital stay bar (Jan 10–29), a wide apparent-gap shading under the naive view (Jan 16–31), a bridged-coverage overlay spanning Jan 10–29, and a small true-gap shading (Jan 30–31). Two PDC result labels: Naive 0.733 and Bridged 0.967."
      }
    },
    "prerequisites": [
      "exposure-episode-construction-rwe",
      "pdc-proportion-of-days-covered",
      "grace-period-gap-rules-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Carry-over bridging (assume in-hospital continuation)",
        "description": "Inpatient days spanning an admission are treated as covered/exposed; the exposure episode is bridged across the stay as if the chronic drug continued. Default for maintenance therapy plausibly continued in hospital.",
        "edge_cases": [
          "A drug genuinely held for the admission (e.g., anticoagulant held peri-procedure) is wrongly counted as covered.",
          "Long or repeated admissions donate large amounts of unobserved \"coverage,\" masking true post-discharge non-adherence.",
          "In claims-only data the assumption is unfalsifiable because the inpatient NDC is bundled into the DRG."
        ],
        "data_source_notes": "claims: span [admit_date, discharge_date] from facility claims and mark those days covered; EHR: confirm continuation from the MAR before assuming it."
      },
      {
        "name": "Censoring bridging (PQA / CMS Star Ratings convention)",
        "description": "Inpatient and skilled-nursing days are removed from the observation denominator (counted as neither covered nor uncovered); overlapping supply is shifted/pushed back. Required for Star Ratings PDC comparability.",
        "edge_cases": [
          "Frequently hospitalized (sicker) patients have larger denominator reductions, which can inflate their measured PDC.",
          "Person-time accounting must subtract institutional days consistently from every patient and arm.",
          "The supply-pushback step is easy to implement incorrectly and changes results near discharge."
        ],
        "data_source_notes": "claims: identify institutional stays (revenue codes, place-of-service, DRG) and subtract those days from the denominator; align with the PQA technical specification for the specific measure."
      },
      {
        "name": "Treat-as-gap / episode termination",
        "description": "Inpatient days are counted as uncovered (a true gap) or the exposure episode ends at admission. Reserved for drugs not expected to continue during the stay.",
        "edge_cases": [
          "Applied to a continued chronic drug, it fabricates discontinuations and biases persistence downward.",
          "Requires a clinical rationale per drug; ideally validated against linked inpatient administration data."
        ],
        "data_source_notes": "linked: use MAR/eMAR to confirm the drug was actually stopped before classifying the stay as a gap."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Censoring bridging (PQA / CMS convention)",
        "pros_of_this": "Carry-over matches clinical reality for chronic maintenance drugs, keeps the full observation window, and is simpler to program.",
        "cons_of_this": "Manufactures coverage that was never observed and can mask true non-adherence; unfalsifiable in claims-only data.",
        "when_to_prefer": "Etiologic/comparative-effectiveness studies of chronic drugs plausibly continued in hospital, where observable-only adherence is not the target."
      },
      {
        "compared_to": "Carry-over bridging",
        "pros_of_this": "Censoring removes unobservable institutional days, is the regulated standard for Star Ratings PDC, and avoids assuming coverage you cannot see.",
        "cons_of_this": "Shrinks the denominator, can inflate PDC for frequently hospitalized patients, and complicates person-time accounting.",
        "when_to_prefer": "Quality measurement (PQA/CMS), benchmarking, and any setting where only observable outpatient adherence is the estimand."
      },
      {
        "compared_to": "Naive days-supply stitching with no stated policy",
        "pros_of_this": "Any explicit bridging policy is documented, reproducible, and defensible; the default arithmetic is not.",
        "cons_of_this": "Requires identifying inpatient stays and writing explicit window logic.",
        "when_to_prefer": "Always over an unstated implicit rule in any regulatory-grade or HTA analysis."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Reconstruct admission/discharge from institutional/facility claims (revenue codes, place-of-service, DRG); the inpatient drug is bundled and has no NDC, so coverage during the stay must be assigned by the chosen bridging policy. Dedupe discharge/med-rec fills. Apply the same policy and the same stay definition to every arm.",
      "ehr": "Inpatient administrations are visible in MAR/eMAR and order records, enabling evidence-based bridging — but only for in-network stays; external-hospital admissions leak out and reappear as phantom gaps.",
      "registry": "Weak for both fills and inpatient administration; use registry severity/admission fields only to flag stays, then link to claims for fills and facility claims for stay dates.",
      "linked": "Facility claims give reliable stay dates and linked MAR confirms whether the specific drug was continued, allowing per-stay carry-over vs treat-as-gap decisions on evidence; reconcile admit/fill/service date discrepancies first and account for the linkable-subset selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\ndef _covered_dates(fills: pd.DataFrame) -> set:\n    # Days covered by outpatient supply: each fill covers [fill_date, fill_date + days_supply - 1].\n    out = set()\n    for _, r in fills.iterrows():\n        out.update(pd.date_range(r[\"fill_date\"],\n                                 r[\"fill_date\"] + pd.Timedelta(days=int(r[\"days_supply\"]) - 1)))\n    return out\n\ndef _inpatient_dates(stays: pd.DataFrame) -> set:\n    out = set()\n    for _, r in stays.iterrows():\n        out.update(pd.date_range(r[\"admit_date\"], r[\"discharge_date\"]))\n    return out\n\ndef pdc_with_bridging(fills, stays, obs_start, obs_end, policy=\"carryover\") -> float:\n    window = set(pd.date_range(obs_start, obs_end))\n    inpatient = _inpatient_dates(stays) & window\n    covered = _covered_dates(fills) & window\n\n    if policy == \"carryover\":\n        covered = covered | inpatient                 # assume the drug continued in hospital\n        denom = window\n    elif policy == \"censor\":\n        covered = covered - inpatient                 # inpatient days observable for neither num nor denom\n        denom = window - inpatient                    # PQA: remove institutional days from the denominator\n    elif policy == \"gap\":\n        covered = covered - inpatient                 # inpatient days count as uncovered\n        denom = window\n    else:\n        raise ValueError(f\"unknown policy: {policy}\")\n\n    return len(covered & denom) / len(denom) if denom else np.nan",
        "description": "Apply a bridging policy to outpatient exposure days. Required inputs (cleaned, deduped):\n  fills  : person_id, fill_date (datetime), days_supply (int)\n  stays  : person_id, admit_date (datetime), discharge_date (datetime)   # from facility/institutional claims\n  window : person_id, obs_start (datetime), obs_end (datetime)           # follow-up denominator window\nReturns a per-person PDC under policy in {'carryover','censor','gap'}. Inpatient days are the union of\n[admit_date, discharge_date] intersected with the observation window. Apply identical logic to every arm.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\ncovered_days <- function(fills) {\n  # Each fill covers fill_date .. fill_date + days_supply - 1.\n  unique(do.call(c, Map(function(d, n) seq(d, d + n - 1L, by = \"day\"),\n                        fills$fill_date, as.integer(fills$days_supply))))\n}\ninpatient_days <- function(stays) {\n  unique(do.call(c, Map(function(a, b) seq(a, b, by = \"day\"),\n                        stays$admit_date, stays$discharge_date)))\n}\n\npdc_with_bridging <- function(fills, stays, obs_start, obs_end, policy = \"carryover\") {\n  window    <- seq(obs_start, obs_end, by = \"day\")\n  inpatient <- intersect(inpatient_days(stays), window)\n  covered   <- intersect(covered_days(fills), window)\n\n  if (policy == \"carryover\") {            # assume continuation in hospital\n    covered <- union(covered, inpatient); denom <- window\n  } else if (policy == \"censor\") {        # PQA: drop institutional days from denominator\n    covered <- setdiff(covered, inpatient); denom <- setdiff(window, inpatient)\n  } else if (policy == \"gap\") {           # inpatient days count as uncovered\n    covered <- setdiff(covered, inpatient); denom <- window\n  } else stop(\"unknown policy\")\n\n  if (length(denom) == 0L) return(NA_real_)\n  length(intersect(covered, denom)) / length(denom)\n}",
        "description": "Apply a bridging policy to outpatient exposure days with data.table. Inputs mirror the Python version:\n  fills  : person_id, fill_date (Date), days_supply (integer)\n  stays  : person_id, admit_date (Date), discharge_date (Date)\nReturns PDC for one person under policy in {'carryover','censor','gap'}; apply identically across arms.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let policy = CARRYOVER;   /* CARRYOVER | CENSOR | GAP */\n\n/* 1. One row per person-day across the observation window. */\ndata spine;\n  set work.window;\n  do day = obs_start to obs_end;\n    output;\n  end;\n  format day date9.;\n  keep person_id day;\nrun;\n\n/* 2. Flag days covered by an outpatient fill: [fill_date, fill_date + days_supply - 1]. */\nproc sql;\n  create table covered as\n  select distinct s.person_id, s.day\n  from spine s\n  inner join work.fills f\n    on f.person_id = s.person_id\n   and s.day >= f.fill_date\n   and s.day <= f.fill_date + f.days_supply - 1;\nquit;\n\n/* 3. Flag inpatient days from facility claims: [admit_date, discharge_date]. */\nproc sql;\n  create table inpat as\n  select distinct sp.person_id, sp.day\n  from spine sp\n  inner join work.stays st\n    on st.person_id = sp.person_id\n   and sp.day >= st.admit_date\n   and sp.day <= st.discharge_date;\nquit;\n\n/* 4. Apply the bridging policy day by day. */\nproc sql;\n  create table flagged as\n  select sp.person_id, sp.day,\n         (c.day is not null) as covered_raw,\n         (i.day is not null) as inpatient\n  from spine sp\n  left join covered c on c.person_id = sp.person_id and c.day = sp.day\n  left join inpat   i on i.person_id = sp.person_id and i.day = sp.day;\nquit;\n\ndata resolved;\n  set flagged;\n  length in_denom covered 3;\n  %if &policy = CARRYOVER %then %do;\n    in_denom = 1;  covered = (covered_raw or inpatient);     /* assume continuation */\n  %end;\n  %else %if &policy = CENSOR %then %do;\n    in_denom = (inpatient = 0);  covered = (covered_raw and not inpatient);  /* PQA: drop inpatient days */\n  %end;\n  %else %if &policy = GAP %then %do;\n    in_denom = 1;  covered = (covered_raw and not inpatient);  /* inpatient = uncovered */\n  %end;\nrun;\n\n/* 5. PDC = covered denominator-days / denominator-days. */\nproc sql;\n  create table pdc as\n  select person_id,\n         sum(covered and in_denom) as covered_days,\n         sum(in_denom)             as denom_days,\n         calculated covered_days / calculated denom_days as pdc\n  from resolved\n  group by person_id;\nquit;",
        "description": "Build a daily exposure spine and apply a bridging policy in SAS (PROC SQL + data step), then compute PDC.\nRequired input datasets (post data-management):\n  work.fills  : person_id, fill_date, days_supply\n  work.stays  : person_id, admit_date, discharge_date            (from facility/institutional claims)\n  work.window : person_id, obs_start, obs_end                    (denominator follow-up window)\nMacro var &policy in (CARRYOVER | CENSOR | GAP). Expand to one row per person-day, flag covered/inpatient,\nthen aggregate. Apply the identical &policy to every treatment arm.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "inpatient-bridging-exposure-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Maria's statin coverage over 60 days. Fill A (14 days) runs out January 15; her hospital admission (January 10–29) falls partly within Fill A and extends 14 days beyond it with no outpatient pharmacy record. Without bridging, January 16–31 looks like a 16-day gap and PDC = 0.733. With carry-over bridging (hospital days assumed covered), coverage extends through January 29, the true gap shrinks to 2 days, and PDC = 0.967 — straddling the 0.80 adherence threshold purely based on the bridging rule.",
        "alt_text": "Horizontal timeline from January 2 to March 1, 2024, showing two pharmacy fill bars (Fill A: 14 days starting Jan 2; Fill B: 30 days starting Feb 1), a 20-day hospital stay bar (Jan 10–29), a wide apparent-gap shading under the naive view (Jan 16–31), a bridged-coverage overlay spanning Jan 10–29, and a small true-gap shading (Jan 30–31). Two PDC result labels: Naive 0.733 and Bridged 0.967.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Stay[Hospital stay detected<br/>facility claim: admit_date..discharge_date] --> Q1{Is the drug expected<br/>to continue in hospital?}\n  Q1 -->|Yes, chronic maintenance| CO[Carry-over:<br/>inpatient days = covered]\n  Q1 -->|Unknown / claims-only<br/>cannot observe| Q2{Is the estimand<br/>observable outpatient<br/>adherence / PQA measure?}\n  Q1 -->|No, drug held / stopped| GAP[Treat as gap:<br/>inpatient days = uncovered]\n  Q2 -->|Yes| CEN[Censor:<br/>remove inpatient days<br/>from denominator]\n  Q2 -->|No, etiologic continuation| CO\n  CO --> SENS[Apply identically to BOTH arms<br/>+ sensitivity analysis across all 3 policies]\n  CEN --> SENS\n  GAP --> SENS",
        "caption": "Decision logic for choosing an inpatient bridging policy. The choice depends on the clinical expectation for the specific drug, the estimand, and the data source; whatever is chosen must be applied identically across arms and stress-tested across all three policies because differential hospitalization makes the choice outcome-dependent.",
        "alt_text": "Flowchart deciding between carry-over, censor, and treat-as-gap bridging policies based on whether the drug continues in hospital, the estimand, and data observability, ending in identical-across-arms application plus sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One patient, one 30-day fill spanning a 10-day admission (Jan 1 - Feb 4 window)\n  dateFormat YYYY-MM-DD\n  axisFormat %d-%b\n  section Events\n  30-day fill (covers Jan 1 - Jan 30) :done, fill, 2024-01-01, 30d\n  Inpatient stay (no outpatient fill) :crit, stay, 2024-01-11, 10d\n  section Carry-over (PDC 0.857)\n  Covered incl. in-hospital days :active, co, 2024-01-01, 30d\n  section Censor / PQA (PDC 0.800)\n  Covered, inpatient days removed from denom :active, ce, 2024-01-01, 10d\n  section Treat-as-gap (PDC 0.571)\n  Covered, inpatient days uncovered :active, ga1, 2024-01-01, 10d\n  Covered after discharge :active, ga2, 2024-01-21, 10d",
        "caption": "The same fill and admission yield PDC of 0.857 (carry-over), 0.800 (censor), or 0.571 (treat-as-gap) — a spread that straddles the 0.80 quality threshold purely from the bridging rule, which is why the policy must be pre-specified.",
        "alt_text": "Gantt timeline of a 30-day fill and a 10-day mid-supply admission, showing how carry-over, censor, and treat-as-gap policies produce different covered-day patterns and PDC values for the identical underlying data.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Inpatient bridging is the sub-rule of exposure-episode construction that governs days during hospitalizations."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "The bridging policy directly changes the PDC numerator/denominator; PDC cannot be computed without resolving it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "Bridging interacts with permissible-gap/grace-period rules; both decide whether a stretch of no-fill days counts as a gap."
      },
      {
        "relation_type": "used_with",
        "target_slug": "stockpiling-carryover-rules-rwe",
        "notes": "Carry-over bridging and stockpiling/carry-over both determine how supply is allocated across days; they must be specified together to avoid double-counting."
      },
      {
        "relation_type": "affects",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Treat-as-gap bridging can fabricate discontinuations and shorten measured persistence; carry-over preserves the episode across the stay."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Asymmetric bridging (one arm/drug only) manufactures immortal, covered person-time and biases comparative estimates; apply the rule identically across arms."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hospitalization-transfer-collapse-rwe",
        "notes": "Both rely on correctly reconstructing admission/discharge intervals from facility claims, including collapsing transfers into a single contiguous stay."
      }
    ],
    "aliases": [
      "inpatient bridging exposure",
      "inpatient medication bridging",
      "hospitalization exposure bridging",
      "inpatient stay overlap rules",
      "in-hospital exposure continuation"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "instrumental-variables-pharmacoepi-rwe",
    "name": "Instrumental Variables in Pharmacoepidemiology",
    "short_definition": "A causal-inference method that uses an instrument — a variable affecting treatment receipt but with no pathway to the outcome except through treatment and no shared cause with it (e.g., physician prescribing preference, formulary/policy shocks, distance, genetic variants) — to identify a treatment effect under unmeasured confounding.",
    "long_description": "**Instrumental variable (IV) analysis** identifies a treatment effect from observational data without requiring that all\nconfounders be measured. It works by exploiting a third variable Z — the *instrument* — that perturbs treatment A but is\nunrelated to the outcome Y except through A. In pharmacoepidemiology the appeal is direct: confounding by indication is\noften severe and partly unmeasurable (frailty, disease activity, prescriber gestalt), so adjustment-based methods\n(propensity scores, regression) cannot fully remove it. A valid instrument sidesteps the unmeasured confounder U entirely\nby introducing variation in treatment that is, by assumption, independent of U. Canonical candidate instruments are\n**physician/facility prescribing preference** (the prior patient's or the prescriber's recent share of drug A), **policy or\nformulary shocks** (tier changes, prior-authorization rollout, coverage gaps, calendar-time launch discontinuities),\n**geographic/distance** instruments (distance to a center able to deliver the treatment), and **Mendelian randomization**\n(a germline variant proxying lifelong exposure).\n\n**Core conceptual distinction.** IV trades *measured-confounding* assumptions for *instrument-validity* assumptions, and\nthese are not symmetric in how checkable they are. Three conditions must hold: (1) **relevance** — Z genuinely shifts A\n(testable: the first-stage F-statistic / partial R²); (2) **independence (exchangeability)** — Z shares no common cause with\nY and is as-good-as-randomly assigned with respect to U (largely untestable; this is where preference IVs usually break via\nreferral/channeling); (3) **exclusion restriction** — Z affects Y *only* through A, with no direct or alternative pathway\n(untestable; a policy that changes copays also changes monitoring and adherence, violating it). Under these three, IV\nidentifies a *bound*; to get a *point* estimate you add a fourth assumption — either **monotonicity** (no \"defiers\"), under\nwhich the estimand is the **local average treatment effect (LATE) among compliers** (patients whose treatment is moved by\nthe instrument), or **effect homogeneity**, under which it equals the ATE. The complier LATE is the honest default reading:\nit is *not* the ATE and *not* the ATT, the complier subpopulation cannot be enumerated from data, and it shifts as the\ninstrument changes — a fact that must be stated in the estimand, not buried.\n\n**Pros, cons, and trade-offs**\n- **vs active-comparator-new-user (ACNU) + propensity scores:** PS/ACNU is transparent, well-understood by regulators, and\n  yields an interpretable ATT/ATE — but it is helpless against *unmeasured* confounding by indication, which is the dominant\n  threat in claims. IV can in principle remove unmeasured confounding. Cost: IV answers a *complier* estimand, has far wider\n  confidence intervals, and a weak or invalid instrument can be *more* biased than the adjusted regression it was meant to\n  rescue (weak-instrument bias pulls the IV estimate toward the confounded OLS/PS estimate while inflating variance).\n  **Prefer ACNU+PS** as the primary analysis for almost all comparative questions; reserve IV for when residual indication\n  bias is plausibly large and a genuinely strong, defensible instrument exists.\n- **vs negative-control / E-value sensitivity analysis:** negative controls and E-values *quantify or falsify* the\n  robustness of a confounded estimate but cannot identify a causal effect; IV attempts identification under a different\n  (and stronger) set of assumptions. They are complements: use negative-control outcomes and balance-by-instrument as\n  *falsification* tests of an IV, not as proof of validity. **Prefer IV** only when its assumptions are clinically credible\n  *and* its first stage is strong; otherwise report the confounded estimate with an E-value.\n- **2SLS vs two-stage residual inclusion (2SRI)/control function:** two-stage least squares is consistent for linear/additive\n  contrasts and risk differences but is *not* generally valid on the hazard ratio or odds-ratio scale (non-collapsibility).\n  2SRI/control-function or additive-hazard (Aalen) IV models are preferred for binary and time-to-event outcomes; report the\n  estimand on the scale the model actually targets, not a convenient transformation of it.\n\n**When to use** Comparative effectiveness/safety where (a) confounding by indication is severe and partly unmeasured, (b) a\ncandidate instrument is biologically/administratively plausible and *strong* (rule of thumb: first-stage F well above 10,\nideally >> for partial-effect precision), and (c) you can defend independence and exclusion on substantive grounds and\nfalsify them with negative controls and balance-by-instrument. IV shines for short-term drug effects where preference is\nstable, and for natural experiments (a formulary delisting, a black-box-warning-driven prescribing shift) that mimic\nrandomization.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Weak instrument.** A small first-stage F (e.g., < 10, and effectively much higher is needed for tolerable bias) makes the\n  IV estimate biased toward the confounded estimate with enormous variance — strictly worse than the regression you were\n  trying to fix. Report the first stage *before* the second-stage effect; if it is weak, stop.\n- **Channeling / referral violates independence.** If high-risk patients are referred to drug-A-preferring prescribers\n  (frailer patients seen by specialists who favor the newer agent), preference correlates with U and the IV is biased in an\n  unknown direction. Preference IVs are most dangerous precisely when indication bias is worst.\n- **Exclusion is implausible.** A policy instrument that also changes copays, monitoring, adherence, or co-prescribing has a\n  direct path to Y; the \"instrument\" then estimates a tangle of effects. Mendelian instruments fail exclusion under\n  pleiotropy.\n- **The policy question is the ATE/ATT.** If decision-makers need the population-average or treated-average effect, a LATE\n  among an unidentifiable complier subgroup may be the wrong target — and silently reporting it as \"the\" treatment effect is\n  misleading.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA):** Preference IVs require a stable prescriber/facility identifier (rendering or prescribing NPI) and\n  enough prior initiations per provider to estimate a non-noisy preference; providers with few patients give a noisy\n  instrument that behaves like a weak one. The preference must be *time-updated and lagged* (e.g., the provider's drug-A\n  share over the *prior* N initiations excluding the index patient) to avoid leaking future information. Critical failure\n  mode: **Medicare Advantage person-time lacks fee-for-service claims**, so a provider's apparent prescribing mix is\n  computed on a non-random, FFS-only subset of their panel — biasing the instrument; restrict preference construction to\n  enrollees with observable Part A/B/D and exclude MA-only person-time. **Differential competing risks by exposure in the\n  elderly** (the newer drug is channeled to frailer patients with higher death rates) corrupt both the outcome and any\n  survival-scale IV — use cause-specific or additive-hazard IV and a competing-events sensitivity analysis.\n- **EHR:** Clinician practice style and facility are often available and richer than claims, but site-level severity and\n  referral patterns are strong independence violations, and visit-driven capture means the instrument and outcome are\n  differentially observed for patients who leave the system. Link to pharmacy fills to confirm the instrument actually moved\n  *dispensing*, not just *ordering*.\n- **Registry:** Genetic (Mendelian) and facility instruments may be available; registries are strong for outcomes/severity\n  but usually weak for complete exposure — link to claims for the full fill history that defines treatment A.\n- **Linked claims–EHR–vital records:** Best substrate (EHR severity to *test* independence, claims completeness for exposure,\n  a death index for competing risks), but linkage selection and order/fill/service-date discrepancies must be reconciled\n  before the instrument and time zero are assigned. Watch **immortal time in procedure/initiation studies**: defining the\n  instrument or exposure using events that can only occur *after* a period of guaranteed survival manufactures bias that no\n  IV can repair.\n\n**Worked claims example (physician-preference IV).** Question: 90-day risk of major bleeding with apixaban vs warfarin among\nadults newly initiating oral anticoagulation for atrial fibrillation in a commercial + Medicare FFS database, where frailty\n(an unmeasured confounder) drives both drug choice and bleeding. (1) **Cohort:** age ≥18, an AF diagnosis in the baseline\nwindow, and 365 days of continuous medical + pharmacy enrollment with *observable FFS claims* (exclude MA-only person-time)\nbefore the index fill; new-user washout = no oral anticoagulant fill in the prior 365 days; index_date = first qualifying\nfill (`fill_date`); arm assigned from the NDC dispensed that day. (2) **Instrument:** for each index patient, identify the\nprescribing NPI and compute that prescriber's apixaban share over their *previous* qualifying initiations (lagged, excluding\nthe index patient); require ≥5 prior initiations or the preference is too noisy. A common binary form: prescriber's most\nrecent prior patient received apixaban (1) vs warfarin (0). (3) **First stage:** regress `treated` (apixaban=1) on the\npreference instrument plus *measured* covariates and report the F-statistic — proceed only if it is comfortably strong.\n(4) **Falsification:** tabulate measured baseline covariates (age, CHA₂DS₂-VASc components, prior bleeds, renal codes,\nutilization) *across instrument levels* — strong imbalance signals an independence violation (channeling); add a\nnegative-control outcome (e.g., an event neither drug should affect) and confirm the instrument has no effect on it.\n(5) **Estimation:** because bleeding is a time-to-event outcome subject to the **competing risk of death** (differentially\nhigher in frailer warfarin recipients), prefer an additive-hazard (Aalen) IV or a 2SRI control-function model over naïve\n2SLS on a hazard ratio; if reporting a 90-day risk difference, 2SLS on the binary 90-day indicator is defensible. (6)\n**Interpretation:** the estimate is the LATE among *preference compliers* — patients whose drug was determined by their\nprescriber's leaning — not the ATE; state this, report the first-stage F and the balance-by-instrument table, and run\nsensitivity analyses on the preference lag length, the ≥5-initiation threshold, and competing-risk handling.\n\n**Interpreting the output**\n\nUsing the prescribing-preference IV study above, the two-stage least-squares (2SLS) procedure yields a local\naverage treatment effect (LATE). In the six-patient illustration, the instrument-driven contrast among compliers\ncorresponds to a risk difference of −1.00 (illustrative only; real studies require thousands of patients for\na stable estimate).\n\nFormal interpretation: The IV estimate is the LATE among compliers — patients whose treatment choice was\ndetermined by the prescriber's preference instrument rather than by their own clinical severity or preferences.\nIt is not the ATE across all patients, nor the effect for always-takers or never-takers. Validity requires all\nthree IV assumptions: relevance (the instrument predicts treatment, with first-stage F-statistic substantially\nabove 10); independence (the prior patient's drug choice shares no hidden cause with the current patient's\nbleeding risk); and the exclusion restriction (the only causal pathway from the instrument to the outcome runs\nthrough treatment received). A violation of the exclusion restriction — for example, if frailer patients cluster\nwith warfarin-preferring prescribers — biases the LATE in a direction that cannot be determined from the data\nalone. The confidence interval is wider than a propensity-adjusted estimate because IV uses only the\ninstrument-explained variation in treatment.\n\nPractical interpretation: Apixaban appeared to reduce 90-day major bleeding among patients whose drug was\ndetermined by their prescriber's habit rather than by their own clinical profile. This protection does not\nnecessarily extend to all AF patients initiating anticoagulation — the complier population (those swayed by\nprescriber preference) may be systematically different from always-takers or never-takers.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "instrumental-variables",
      "physician-preference",
      "unmeasured-confounding",
      "weak-instrument",
      "2sls",
      "control-function",
      "local-average-treatment-effect",
      "mendelian-randomization",
      "monotonicity"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "linked_data",
      "cer_observational"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/01.ede.0000222409.00878.37",
        "url": "https://doi.org/10.1097/01.ede.0000222409.00878.37",
        "citation_text": "Hernán MA, Robins JM. Instruments for causal inference: an epidemiologist's dream? Epidemiology. 2006;17(4):360-372.",
        "year": 2006,
        "authors_short": "Hernán & Robins",
        "notes": "Canonical epidemiologic statement of the IV assumptions (relevance, exchangeability, exclusion, monotonicity) and why IV is powerful but fragile, including the complier LATE interpretation."
      },
      {
        "role": "explain",
        "doi": "10.1097/01.ede.0000193606.58671.c5",
        "url": "https://doi.org/10.1097/01.ede.0000193606.58671.c5",
        "citation_text": "Brookhart MA, Wang PS, Solomon DH, Schneeweiss S. Evaluating short-term drug effects using a physician-specific prescribing preference as an instrumental variable. Epidemiology. 2006;17(3):268-275.",
        "year": 2006,
        "authors_short": "Brookhart et al.",
        "notes": "Foundational pharmacoepidemiology paper operationalizing physician prescribing preference as an instrument and showing reduced confounding for short-term NSAID GI effects."
      },
      {
        "role": "demonstrate",
        "doi": "10.2202/1557-4679.1072",
        "url": "https://doi.org/10.2202/1557-4679.1072",
        "citation_text": "Brookhart MA, Schneeweiss S. Preference-based instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results. International Journal of Biostatistics. 2007;3(1):Article 14.",
        "year": 2007,
        "authors_short": "Brookhart & Schneeweiss",
        "notes": "Practical guidance on constructing preference instruments, assessing validity via balance-by-instrument, and interpreting the resulting complier estimand in outcomes research."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.6128",
        "url": "https://doi.org/10.1002/sim.6128",
        "citation_text": "Baiocchi M, Cheng J, Small DS. Instrumental variable methods for causal inference. Statistics in Medicine. 2014;33(13):2297-2340.",
        "year": 2014,
        "authors_short": "Baiocchi et al.",
        "notes": "Comprehensive applied tutorial covering 2SLS, weak instruments, monotonicity/LATE, and study-design strategies for finding and defending instruments in medical research."
      }
    ],
    "plain_language_summary": "Instrumental variables (IV) is a method for estimating how much a treatment actually causes an outcome when hidden patient factors — things like frailty or disease severity that researchers cannot measure — push sicker patients toward one drug and also make bad outcomes more likely. The trick is to find a third variable, called an instrument, that nudges which drug a patient receives but has absolutely no other connection to the outcome. By isolating only the treatment variation driven by that instrument, IV gives an estimate of the treatment effect that is free of the hidden confounding. The method requires strong assumptions that can rarely be fully verified, so it is reserved for situations where standard adjustment methods are not enough.",
    "key_terms": [
      {
        "term": "instrument",
        "definition": "A variable that changes which treatment a patient receives but has no direct effect on the outcome and shares no hidden cause with it — for example, a physician who simply tends to prescribe Drug A more often than Drug B."
      },
      {
        "term": "exclusion restriction",
        "definition": "The rule that the instrument can only affect the outcome by changing treatment, with no back-door path of its own — if a policy that shifts prescribing also changes copays or monitoring, this rule is broken."
      },
      {
        "term": "local average treatment effect",
        "definition": "The effect of treatment estimated only among the patients whose drug choice was actually changed by the instrument — it is not the average effect across all patients, and the exact group of affected patients cannot be listed from the data."
      },
      {
        "term": "confounding by indication",
        "definition": "Bias that occurs when physicians prescribe a drug specifically because a patient is sicker or frailer, making it appear the drug causes worse outcomes when the underlying illness is actually to blame."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether apixaban causes fewer major bleeds than warfarin among patients newly starting a blood thinner for atrial fibrillation. The problem: frailer patients are more likely to receive warfarin (confounding by indication), and frailty also raises bleeding risk. Frailty is not recorded in claims data, so propensity-score adjustment cannot fully remove the bias. The researcher uses physician prescribing preference as an instrument: for each new patient, she looks at what blood thinner that patient's prescriber gave their previous qualifying patient. This prior-patient drug choice is used as the instrument because it reflects the prescriber's personal habit rather than anything about the current patient.",
      "dataset": {
        "caption": "Each row is a new atrial-fibrillation patient. The instrument (Z) is the drug the prescriber gave their immediately prior patient. Treatment (A) is the drug this patient actually received. Outcome (Y) is major bleed within 90 days (1=yes).",
        "columns": [
          "person_id",
          "prescriber_id",
          "instrument_Z_prior_drug",
          "treatment_A_received",
          "outcome_Y_bleed_90d"
        ],
        "rows": [
          [
            "pt-001",
            "dr-11",
            "apixaban",
            "apixaban",
            0
          ],
          [
            "pt-002",
            "dr-11",
            "apixaban",
            "apixaban",
            0
          ],
          [
            "pt-003",
            "dr-22",
            "warfarin",
            "warfarin",
            1
          ],
          [
            "pt-004",
            "dr-22",
            "warfarin",
            "warfarin",
            1
          ],
          [
            "pt-005",
            "dr-33",
            "apixaban",
            "warfarin",
            0
          ],
          [
            "pt-006",
            "dr-33",
            "warfarin",
            "apixaban",
            0
          ]
        ]
      },
      "steps": [
        "Assumption 1 — Relevance: the instrument must actually shift treatment. In this dataset, when the prior patient received apixaban, the current patient received apixaban 3 out of 4 times (75%); when the prior patient received warfarin, the current patient received warfarin 3 out of 4 times (75%). The instrument is correlated with treatment received, satisfying relevance. In a real study this is tested statistically: the first-stage F-statistic should be well above 10.",
        "Assumption 2 — Independence: the prior patient's drug was not chosen because of anything about the current patient, so the instrument shares no hidden cause with the current patient's outcome. This is the hardest assumption. It would be violated if, for example, frailer patients were consistently referred to the same prescribers who favor warfarin — then the instrument would be correlated with frailty (the hidden confounder), not just prescriber habit.",
        "Assumption 3 — Exclusion restriction: the only way the prior patient's drug choice can affect the current patient's bleeding risk is by influencing which drug the current patient actually takes. It has no direct biological or social pathway to the current patient's outcome.",
        "Because all three assumptions hold by design in this toy example, the variation in treatment that is explained by the instrument is treated as free of unmeasured confounding. In a two-stage estimation, the first stage predicts each patient's treatment from the instrument; the second stage estimates the effect of that predicted (instrument-driven) treatment on bleeding. The result is called the local average treatment effect (LATE) among compliers — patients whose drug was determined by the prescriber's habit."
      ],
      "result": "In this illustration: among the 4 patients whose treatment matched the instrument's direction (pts 001, 002, 003, 004), 2 out of 4 had a bleed in the warfarin group (pts 003, 004) versus 0 out of 2 in the apixaban group (pts 001, 002). The IV approach attributes this difference to the treatment, not to frailty, because the instrument — not patient severity — drove the prescribing. The estimated risk difference is −1.00 (0 bleeds in the apixaban arm versus 2 in the warfarin arm, each arm having 2 compliers) in this tiny example (illustrative only; real studies need thousands of patients for a stable estimate). The honest label for this number is the LATE among compliers — patients for whom the prescriber's habit changed their drug — not the average effect across all patients."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "active-comparator-new-user",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Physician or facility preference IV",
        "description": "Uses a prescriber's or site's recent treatment mix as an instrument for the index patient's treatment; the most common preference IV in claims/EHR.",
        "edge_cases": [
          "Referral and channeling of sicker patients to particular prescribers violate independence.",
          "Preference must be lagged and time-updated (prior initiations, excluding the index patient) to avoid future-information leakage.",
          "Few prior initiations per provider yield a noisy preference that behaves like a weak instrument."
        ],
        "data_source_notes": "claims: define the provider from the index prescribing/rendering NPI and compute preference only over FFS-observable prior initiations; exclude MA-only person-time that hides part of the provider panel."
      },
      {
        "name": "Policy / formulary / natural-experiment IV",
        "description": "Uses coverage tiers, prior-authorization rollout, formulary delisting, black-box-warning shocks, or calendar-time launch discontinuities as exogenous shifts in treatment.",
        "edge_cases": [
          "Policies that also change copays, monitoring, adherence, or co-prescribing violate the exclusion restriction.",
          "Secular trends co-occurring with the policy can masquerade as the instrument's effect; use interrupted-time-series-style controls."
        ]
      },
      {
        "name": "Two-stage residual inclusion (2SRI) / control function for nonlinear outcomes",
        "description": "For binary or time-to-event outcomes where 2SLS is misaligned with the scale, include the first-stage residual as a covariate in the outcome model to recover a consistent conditional effect.",
        "edge_cases": [
          "The estimand and scale (risk difference vs hazard ratio vs odds ratio) must be stated explicitly; hazard-ratio IV is not generally valid via plain 2SLS."
        ]
      },
      {
        "name": "Mendelian randomization",
        "description": "Uses a germline genetic variant as a lifelong-exposure instrument; valid only if the variant has no pleiotropic pathway to the outcome.",
        "edge_cases": [
          "Horizontal pleiotropy violates the exclusion restriction; population stratification violates independence."
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "active-comparator-new-user",
        "pros_of_this": "Can remove unmeasured confounding by indication, which propensity-score adjustment on an ACNU cohort cannot.",
        "cons_of_this": "Answers a complier LATE rather than the ATT/ATE, has much wider confidence intervals, and a weak or invalid instrument can be more biased than the adjusted estimate it was meant to replace.",
        "when_to_prefer": "Residual unmeasured indication bias is plausibly large AND a strong, substantively defensible instrument exists; otherwise prefer ACNU + propensity scores as the primary analysis."
      },
      {
        "compared_to": "e-value-sensitivity-analysis",
        "pros_of_this": "Attempts actual identification of a causal effect under instrument-validity assumptions, rather than only quantifying how strong unmeasured confounding would have to be to explain away an association.",
        "cons_of_this": "Replaces untestable no-unmeasured-confounding with the equally untestable exclusion/independence assumptions, and adds substantial variance.",
        "when_to_prefer": "A credible, strong instrument is available; otherwise report the adjusted estimate with an E-value."
      },
      {
        "compared_to": "negative-control-outcomes-exposures",
        "pros_of_this": "Negative controls can falsify part of the IV story (no instrument effect on a null outcome) but cannot prove validity; IV provides an estimate they cannot.",
        "cons_of_this": "Both require substantive subject-matter knowledge; negative controls remain necessary as a falsification test of any IV.",
        "when_to_prefer": "Use negative controls as a diagnostic *within* an IV analysis rather than as a substitute for it."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the prescriber/facility preference from the index rendering/prescribing NPI, lagged over prior FFS-observable initiations (exclude the index patient and MA-only person-time, which hides part of the provider panel). Require enough prior initiations per provider or the instrument is noisy/weak. For survival outcomes, account for differential competing risk of death by exposure in the elderly with cause-specific or additive-hazard IV.",
      "ehr": "Clinician/site practice style is often available and richer than claims, but site-level severity and referral patterns are strong independence violations and visit-driven capture makes the instrument and outcome differentially observed; link to dispensing to confirm the instrument moved fills, not just orders.",
      "registry": "Genetic (Mendelian) and facility instruments may be available; registries are strong for outcomes/severity but weak for complete exposure — link to claims for the fill history that defines treatment.",
      "linked": "Linked claims-EHR-vital-records is the best substrate (EHR severity to test independence, claims completeness for exposure, a death index for competing risks), but linkage selection and order/fill/service-date discrepancies must be reconciled before assigning the instrument and time zero; beware immortal time in procedure/initiation studies."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom linearmodels.iv import IV2SLS\n\nMIN_PRIOR = 5  # providers with fewer prior initiations give a noisy (weak-acting) instrument\n\ndef add_preference_instrument(cohort: pd.DataFrame, inits: pd.DataFrame) -> pd.DataFrame:\n    # Lagged provider preference = share of study drug among that prescriber's earlier initiations.\n    inits = inits.sort_values([\"prescriber_npi\", \"fill_date\"])\n    inits[\"cum_treated\"] = inits.groupby(\"prescriber_npi\")[\"treated\"].cumsum() - inits[\"treated\"]\n    inits[\"cum_n\"]       = inits.groupby(\"prescriber_npi\").cumcount()            # count of PRIOR initiations\n    inits[\"pref\"]        = np.where(inits[\"cum_n\"] >= MIN_PRIOR,\n                                    inits[\"cum_treated\"] / inits[\"cum_n\"], np.nan)\n    key = [\"prescriber_npi\", \"fill_date\", \"treated\"]\n    out = cohort.merge(inits[key + [\"pref\"]],\n                       left_on=[\"prescriber_npi\", \"index_date\", \"treated\"],\n                       right_on=key, how=\"left\").drop(columns=[\"fill_date\"])\n    return out.dropna(subset=[\"pref\"])\n\ndef fit_preference_iv(df: pd.DataFrame, covars=(\"age\", \"cci\")):\n    const = pd.Series(1.0, index=df.index, name=\"const\")\n    exog = pd.concat([const, df[list(covars)]], axis=1)  # measured covariates enter both stages\n    # 2SLS: outcome ~ [treated endogenous, instrumented by pref] + exogenous covariates.\n    # linearmodels reports the first-stage partial F in res.first_stage; read it BEFORE the effect.\n    res = IV2SLS(dependent=df[\"outcome\"], exog=exog,\n                 endog=df[[\"treated\"]], instruments=df[[\"pref\"]]).fit(cov_type=\"robust\")\n    return res\n\n# res = fit_preference_iv(add_preference_instrument(cohort, inits))\n# print(res.first_stage)        # partial F on `pref`; proceed only if comfortably strong (>> 10)\n# print(res.params[\"treated\"])  # LATE among preference compliers (NOT the ATE)",
        "description": "Physician-preference IV for a continuous or risk-difference (linear-probability) estimand. Required inputs (cleaned,\none row per new initiator):\n  cohort : person_id, prescriber_npi, index_date, treated (1=study drug, 0=comparator),\n           outcome (continuous OR 0/1 event within a fixed window), and baseline covariates (age, cci, ...)\n  inits  : prior initiation history -> prescriber_npi, fill_date, treated   # ALL initiations, to build lagged preference\nBuild the instrument as the prescriber's drug share over their PRIOR initiations only (lagged, excluding the index\npatient) so no future information leaks. Always inspect the first-stage F before reading the second-stage effect; a weak\ninstrument makes this estimate worse than ordinary regression.",
        "dependencies": [
          "pandas",
          "numpy",
          "linearmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(ivreg)\nMIN_PRIOR <- 5L\n\nadd_preference_instrument <- function(cohort, inits) {\n  setDT(cohort); setDT(inits)\n  setorder(inits, prescriber_npi, fill_date)\n  # Lagged share of study drug among each prescriber's PRIOR initiations (exclude current row).\n  inits[, cum_treated := cumsum(treated) - treated, by = prescriber_npi]\n  inits[, cum_n       := seq_len(.N) - 1L,           by = prescriber_npi]\n  inits[, pref := fifelse(cum_n >= MIN_PRIOR, cum_treated / cum_n, NA_real_)]\n  out <- merge(cohort, inits[, .(prescriber_npi, fill_date, treated, pref)],\n               by.x = c(\"prescriber_npi\", \"index_date\", \"treated\"),\n               by.y = c(\"prescriber_npi\", \"fill_date\",  \"treated\"), all.x = TRUE)\n  out[!is.na(pref)]\n}\n\nfit_preference_iv <- function(df) {\n  # outcome ~ treated + covars | pref + covars  (treated instrumented by preference)\n  fit <- ivreg(outcome ~ treated + age + cci | pref + age + cci, data = df)\n  summary(fit, diagnostics = TRUE)  # weak-instrument F, Wu-Hausman, Sargan\n}\n\n# df  <- add_preference_instrument(cohort, inits)\n# fit_preference_iv(df)             # coefficient on `treated` = complier LATE",
        "description": "Physician-preference IV with explicit weak-instrument and Wu-Hausman diagnostics. Inputs mirror the Python version:\n  cohort : person_id, prescriber_npi, index_date (Date), treated (0/1), outcome (continuous or 0/1), age, cci\n  inits  : prescriber_npi, fill_date (Date), treated (0/1)   # all initiations, for the lagged preference\nivreg's summary(diagnostics=TRUE) prints the weak-instrument F, the Wu-Hausman endogeneity test, and (with >1\ninstrument) the Sargan overidentification test. Read the first-stage F first.",
        "dependencies": [
          "data.table",
          "ivreg"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let min_prior = 5;\n\n/* Lagged provider preference = study-drug share among the prescriber's PRIOR initiations. */\nproc sort data=work.inits; by prescriber_npi fill_date; run;\ndata pref;\n  set work.inits;\n  by prescriber_npi;\n  retain cum_treated cum_n;\n  if first.prescriber_npi then do; cum_treated = 0; cum_n = 0; end;\n  if cum_n >= &min_prior then pref = cum_treated / cum_n;   /* uses only EARLIER initiations */\n  else pref = .;\n  output;\n  cum_treated + treated;   /* update AFTER output so the current patient is excluded */\n  cum_n + 1;\nrun;\n\nproc sql;\n  create table analytic as\n  select c.*, p.pref\n  from work.cohort c\n  inner join pref p\n    on c.prescriber_npi = p.prescriber_npi\n   and c.index_date     = p.fill_date\n   and c.treated        = p.treated\n  where p.pref is not null;\nquit;\n\n/* 2SLS for a continuous or 0/1 (risk-difference) outcome; FIRST prints the first-stage F. */\nproc syslin data=analytic 2sls first;\n  endogenous treated;\n  instruments pref age cci;\n  model outcome = treated age cci;      /* coefficient on treated = complier LATE */\nrun;\n\n/* Time-to-event: 2SRI / control function. Stage 1 residual carries the unmeasured-confounding signal. */\nproc reg data=analytic noprint;\n  model treated = pref age cci;\n  output out=cf residual=cf_resid;      /* first-stage residual */\nrun;\nproc phreg data=cf;                      /* additive of cf_resid recovers a consistent HR-scale effect */\n  model fu_time*event(0) = treated cf_resid age cci / rl;\nrun;",
        "description": "Physician-preference IV in SAS: build the lagged instrument with a DATA step, run 2SLS for a continuous/risk-difference\nestimand with PROC SYSLIN (FIRST option prints the first-stage F), and use a control-function (2SRI) approach via\nPROC PHREG for a time-to-event outcome. Required inputs (post data-management):\n  work.cohort : person_id prescriber_npi index_date treated outcome age cci  (+ time fu_time, event for survival)\n  work.inits  : prescriber_npi fill_date treated   (all initiations, sorted)\nAlways inspect the first-stage F (SYSLIN FIRST) before interpreting the second-stage coefficient.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "graph LR\n  Z[\"Instrument Z<br/>(physician preference / policy shock)\"] --> A[\"Treatment A\"]\n  A --> Y[\"Outcome Y\"]\n  U[\"Unmeasured confounder U<br/>(frailty, disease severity)\"] --> A\n  U --> Y\n  Z -. \"independence violation:<br/>referral / channeling\" .-> U\n  Z -. \"exclusion violation:<br/>direct path (copays, monitoring)\" .-> Y\n  style U fill:#fee2e2,stroke:#b91c1c\n  style Z fill:#e0f2fe,stroke:#0369a1\n  style A fill:#ecfdf5,stroke:#047857",
        "caption": "A valid instrument moves treatment (relevance) but has no shared cause with the outcome (independence) and no pathway to the outcome except through treatment (exclusion). The two dashed red arrows are the failure modes that make preference and policy IVs dangerous.",
        "alt_text": "Causal diagram showing instrument Z affecting treatment A which affects outcome Y, an unmeasured confounder U affecting both A and Y, and two dashed violation arrows from Z to U (independence) and Z to Y (exclusion).",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Severe, partly unmeasured confounding by indication?] -->|No| ACNU[Use ACNU + propensity scores]\n  Q -->|Yes| Inst[Credible instrument available?]\n  Inst -->|No| EV[Report adjusted estimate + E-value / negative controls]\n  Inst -->|Yes| F[First-stage F strong?]\n  F -->|No, weak| Stop[Do NOT proceed: weak-IV bias worse than regression]\n  F -->|Yes| Falsify[Balance-by-instrument + negative-control outcome OK?]\n  Falsify -->|No| Stop2[Independence/exclusion violated: do not trust IV]\n  Falsify -->|Yes| Scale[Choose model by outcome scale]\n  Scale -->|Continuous / risk difference| TSLS[2SLS]\n  Scale -->|Binary / time-to-event| CF[2SRI / control function / additive-hazard IV]\n  TSLS --> Report[Report complier LATE + first-stage F + diagnostics]\n  CF --> Report",
        "caption": "Decision logic for whether — and how — to run an IV analysis in pharmacoepidemiology. A weak first stage or a failed falsification test is a stop sign, not a footnote.",
        "alt_text": "Decision flowchart guiding the choice between ACNU plus propensity scores, an adjusted estimate with E-value, or an instrumental-variable analysis, with first-stage strength and falsification checks as gates and a scale-driven choice between 2SLS and control-function models.",
        "source_type": "illustrative",
        "source_citations": [
          "baiocchi-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU + propensity scores controls measured confounding transparently but cannot remove unmeasured indication bias; IV targets the unmeasured-confounding problem at the cost of a complier estimand and wider intervals."
      },
      {
        "relation_type": "see_also",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "IV is an advanced CER design for unmeasured confounding, not a routine replacement for design quality."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "E-values quantify how strong unmeasured confounding would need to be to nullify an estimate; IV attempts identification under alternative (instrument-validity) assumptions instead."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes are a key falsification test of an instrument (a valid instrument should not affect a null outcome), though they cannot prove instrument validity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Balance of measured covariates across instrument levels is a primary diagnostic for the independence assumption; strong imbalance signals channeling/referral."
      }
    ],
    "aliases": [
      "IV",
      "2SLS",
      "two-stage least squares",
      "physician preference IV",
      "preference-based instrumental variable",
      "prescriber preference instrument",
      "Mendelian randomization",
      "instrumental variable analysis",
      "IV pharmacoepidemiology"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "international-rwd-sources",
    "name": "International Real-World Data Sources",
    "short_definition": "The major non-US real-world data families and what each is best suited for: UK CPRD (primary care linkable to HES hospital data and ONS death, prescribing-based exposure), OpenSAFELY's trusted research environment, the Nordic national registries (Denmark, Sweden, Norway, Finland — personal-identifier lifetime linkage, dispensing-based exposure, near-complete population coverage — the global benchmark for long-horizon drug safety), Germany's GePaRD, France's SNDS, the Netherlands' PHARMO, Japan's MID-NET, JMDC, and NDB, and the South Korean HIRA and Taiwanese NHIRD national insurance databases; all use ICD-10 WHO coding and ATC drug classification rather than ICD-10-CM and NDC, and their governance models range from trusted research environments to licensed data extracts and federated network access via DARWIN EU and EHDEN.",
    "long_description": "**The non-US real-world data landscape**\n\nOutside the United States, real-world health data are generated by single-payer national\nhealth systems, universal social insurance schemes, and population-wide administrative\nregisters — not by a patchwork of competing private payers. This structural difference\nhas two decisive consequences for research. First, disenrollment does not exist: a resident\nof Denmark, Sweden, or Taiwan remains in the register until death or emigration, so\nfollow-up is not truncated by job change, plan switching, or insurer exit. Second, drug\nexposure is captured through national prescription registries that record every dispensed\nitem with near-complete coverage. The result is a set of data resources that excel at\nlong-horizon drug safety, population incidence, and rare-exposure studies while imposing\ntheir own constraints — different coding systems, governance barriers, prescribing-versus-\ndispensing distinctions, and varying out-of-hospital drug capture — that must be understood\nbefore designing a study or transporting a finding.\n\n**United Kingdom — CPRD and OpenSAFELY**\n\nThe Clinical Practice Research Datalink (CPRD) is the UK's principal primary-care\nresearch database. It captures longitudinal GP records — diagnoses, consultations,\nreferrals, test results, and prescriptions — from practices using the Vision and Emis\nelectronic systems, covering roughly 20% of the UK population with demographics broadly\nrepresentative of England. A critical distinction: CPRD records the GP's issued prescription\n(the prescribing event), not the patient's actual pharmacy dispensation. A prescription\nthat was issued but not collected or not filled does not generate a subsequent record, so\nCPRD exposure measurement differs fundamentally from US fill-date claims or Nordic\ndispensing records. CPRD can be linked deterministically to Hospital Episode Statistics\n(HES, covering all NHS inpatient and outpatient care), the ONS mortality registry (Office\nfor National Statistics, providing death dates and causes), the National Cancer Registry,\nand the Index of Multiple Deprivation — enabling the richer covariate and outcome profiles\nthat primary care alone cannot supply. CPRD's representativeness strength is its\nlongitudinal depth and socioeconomic diversity, but its prescribing-not-dispensing\nexposure definition and its restriction to GP-registered patients (immigrants, prisoners,\nand some high-turnover urban populations are under-represented) are the primary threats\nto validity in comparative-effectiveness and adherence studies.\n\nOpenSAFELY is a different model: rather than a licensed extract, it is a trusted research\nenvironment (TRE) in which approved analysts execute code against GP record data that\nnever leaves NHS servers. The result is near-real-time, near-complete primary-care\ncoverage of England (55 million patients) with no data transfer and a full audit trail,\nat the cost of constrained compute environments and a governance process tied to NHS\nEngland approvals.\n\n**The Nordic registries — the global benchmark for long-horizon safety**\n\nDenmark, Sweden, Norway, and Finland each maintain a system of linked national registers\nanchored by a unique personal identity number (PIN) that is assigned at birth or\nimmigration and used consistently across all health, social, and administrative records.\nThis single stable identifier enables lifetime individual-level linkage without probabilistic\nmatching or linkage error — the entire population from birth or registration forward. The\nkey registries in each country include:\n\n- **Diagnoses**: National Patient Registries (hospital inpatient diagnoses, with outpatient\n  coverage expanding from the 1990s onward) coded in ICD-10 WHO, the international version\n  of ICD-10, not the US clinical modification ICD-10-CM.\n- **Drug dispensing**: National Prescription Databases that capture every dispensed item\n  at the pharmacy, identified by ATC (Anatomical Therapeutic Chemical) code, not NDC.\n  A Defined Daily Dose (DDD) assigned to each ATC code allows exposure-window construction\n  from pack size and quantity dispensed when the equivalent of days_supply is absent.\n  This is a dispensing record — the patient actually collected the medication — not a\n  GP prescription (unlike CPRD).\n- **Death**: National Cause-of-Death Registers with near-complete civil registration,\n  producing mortality follow-up that does not depend on claims logic.\n- **Additional registries**: civil registration (sociodemographics), cancer registries,\n  fertility and birth registers, and education/income registers available for linkage.\n\nThe Nordic systems offer near-complete population coverage with no private-care leakage\nin countries where most health care is publicly funded. The defining advantage for\npharmacoepidemiology is the absence of disenrollment censoring: a Danish cohort of 100\nnew users loses participants only to death or emigration over a 2-year window (98 of 100\nretained), while a US commercial cohort loses 40 of the same 100 to job change or plan\nswitch. The resulting long, uninterrupted follow-up makes Nordic data the world's gold\nstandard for detecting rare, late-onset adverse drug reactions and for studying drug\neffects over years to decades. The primary limitation is restricted access: each country\nrequires separate data applications, typically reviewed by a national ethics board\nand a statistics authority, with data kept on secure servers inside the country.\n\n**Western Europe — GePaRD, SNDS, and PHARMO**\n\nGermany's GePaRD (German Pharmacoepidemiological Research Database) pools claims from\nfour statutory sickness funds covering roughly 20 million insured persons. Exposure is\ndispensing-based (pharmacy claims), diagnoses are coded in ICD-10 GM (German Modification,\nclose to ICD-10 WHO), and out-of-pocket drug purchases are not captured. The German\nstatutory-fund system covers approximately 90% of the population, with private insurers\ncovering higher earners separately.\n\nFrance's SNDS (Système National des Données de Santé) is one of the most comprehensive\nclaims databases in Europe, covering virtually the entire French population (67 million)\nincluding public and private sector employees. It links the national health insurance\nsystem (SNIIRAM) with the hospital discharge database (PMSI) and the national mortality\ndata, using a pseudonymized but stable identifier. Drug exposure is recorded by dispensed\nATC code. The SNDS is operated by the CNAM and requires authorization from the Health\nData Hub.\n\nThe Netherlands' PHARMO is a network of linked population-based data sources including\noutpatient pharmacy records, hospital discharges, general practitioner records, and\npathology reports in defined geographical catchment areas. Drug exposure is dispensing-\nbased; diagnoses in GP and hospital records are coded in ICPC-2 (GP) and ICD-10 (hospital).\nPHARMO offers the advantage of an integrated GP-pharmacy-hospital linkage for\nlongitudinal studies.\n\n**Asia — Japan, South Korea, and Taiwan**\n\nJapan's MID-NET (Medical Information Database Network) is a hospital-based EHR network\nof roughly 23 academic and general hospitals, providing rich clinical data at the cost\nof limited population representativeness and complexity of out-of-hospital care capture.\nJMDC is a commercial claims database of employer-sponsored health insurance funds\n(predominantly working-age adults), structured similarly to US commercial claims. The\nNational Database of Health Insurance Claims and Specific Health Checkups of Japan (NDB)\nis the national claims repository covering the entire insured population (essentially the\nwhole country) with drug dispensing, diagnoses, and procedure codes.\n\nSouth Korea's Health Insurance Review and Assessment Service (HIRA) database covers\nthe entire Korean population under the single-payer National Health Insurance scheme,\nwith drug dispensing records, diagnoses in KCD (Korean Classification of Diseases, ICD-10\ncompatible), and a stable national identifier. Taiwan's National Health Insurance Research\nDatabase (NHIRD) similarly provides near-universal population coverage under Taiwan's\nsingle-payer system, with drug claims, inpatient/outpatient records, and ICD-9-CM\ncoding (more recent releases use ICD-10-CM). Both are among the largest and most\ncomplete insurance databases in the world.\n\n**The EU federation layer — DARWIN EU, EHDEN, and OMOP-mapped networks**\n\nThe European Health Data & Evidence Network (EHDEN) and the European Medicines Agency's\nDARWIN EU (Data Analysis and Real-World Interrogation Network) are building federated\ninfrastructure that maps European data sources to a common OMOP CDM and executes\ndistributed network studies without centralizing patient-level data. This layer connects\nCPRD, SNDS, GePaRD, Nordic registries, and others under a common protocol and codeset\n— each data source retains governance control while contributing to network-level summaries.\nDARWIN EU is the EMA's active pharmacovigilance and product evaluation infrastructure,\nmaking it the regulatory gateway for post-authorization safety studies (PASS) in Europe.\nFor researchers building multi-country evidence, this OMOP-CDM-based federated model\nis the path to reproducing the analytic approach described under federated-distributed-network-analysis.\n\n**Cross-cutting contrasts with US claims data**\n\nFour structural contrasts shape every design decision when working with non-US sources:\n\n(1) *Disenrollment censoring vs single-payer completeness.* US commercial databases are\ninterrupted by job change, insurance switching, and end of employment at a rate of roughly\n20–40% per year in working-age populations; every analysis must apply a continuous-enrollment\ncriterion and treat disenrollment as informative censoring. In Nordic and East Asian\nsingle-payer systems, the follow-up record is complete for all in-country care until death\nor emigration. This is not merely a precision advantage — disenrollment correlated with\nhealth status (e.g., patients who become too ill to work) creates a source of dependent\ncensoring that biases US commercial cohort results in ways Nordic data avoids.\n\n(2) *Prescribing vs dispensing vs administration exposure capture.* US fill claims capture\nwhat was dispensed at the pharmacy. CPRD captures the GP's issued prescription (prescribed\nbut not necessarily collected). Nordic/GePaRD/SNDS dispensing registries capture pharmacy\ncollection (closest to US fill claims). Japanese NDB and EHR systems also capture inpatient\nadministration. Drug names are identified by ATC code internationally, not by NDC number;\nand where days_supply is absent, exposure duration must be estimated from pack size and\nquantity using the ATC/DDD framework — a fundamentally different computational step.\n\n(3) *Coding systems: ICD-10 WHO vs ICD-10-CM; ATC vs NDC.* The US uses ICD-10-CM\n(over 72,000 codes, with US-specific 7th characters and extensions) for diagnoses and\nNDC (National Drug Code, tied to specific formulations) for drugs. International databases\nuse ICD-10 WHO (approximately 15,000 codes; no 7th-character US extensions) and ATC\n(a hierarchical classification by therapeutic class and substance), so validated US\nphenotyping algorithms cannot be copied verbatim — they require adaptation to the available\ncode set. The ATC/DDD system is a route into the atc-ddd-classification concept.\n\n(4) *Governance models and access.* US commercial databases are typically licensed through\ndata use agreements with commercial vendors. Nordic countries require ethics board approval\nand data authority registration; data analysis must occur on approved national servers or\nvia a Statistics Denmark-style extract process. The OpenSAFELY and DARWIN EU models go\nfurther: the analyst's code runs inside a TRE; only results leave. These governance\nconstraints are not obstacles to work around — they shape study timelines and feasibility\nand must be scoped into any research plan.\n\n**Pros, cons, and trade-offs**\n\n*Nordic registries (Denmark, Sweden, Norway, Finland)*:\n- Pros: near-complete population coverage; personal identity number enables lifetime\n  individual linkage across health, mortality, and socioeconomic registers without\n  probabilistic matching; no disenrollment censoring (exit only at death or emigration);\n  dispensing-based exposure; gold standard for long-horizon safety studies.\n- Cons: restricted access requiring ethics board approval and data authority registration\n  (multi-country projects multiply governance complexity); drug exposure requires ATC/DDD\n  conversion rather than US-style days_supply; limited non-prescription and OTC drug\n  capture; GP-level data less readily available than in the UK.\n- When to prefer: long-horizon drug safety (years to decades); rare adverse events;\n  full life-course linkage; studies where disenrollment bias is a primary threat.\n\n*UK CPRD*:\n- Pros: detailed longitudinal primary-care data; socioeconomic and regional diversity;\n  linkable to HES, ONS death, and cancer registry; established research infrastructure\n  with CPRD-specific validation literature.\n- Cons: prescribing (not dispensing) exposure — a patient who did not collect a GP\n  prescription generates no subsequent data signal; GP-registered patients only;\n  representativeness limitations for urban high-turnover populations.\n- When to prefer: studies where GP consultation behavior and prescribing decisions are\n  central (rather than actual medication-taking); cardiovascular and chronic disease\n  studies with validated CPRD algorithms.\n\n*East Asian national databases (HIRA, NHIRD, NDB)*:\n- Pros: very large populations under universal coverage; near-complete inpatient and\n  outpatient capture; drug dispensing records; long follow-up periods.\n- Cons: coding systems differ (ICD-10 compatible but with national modifications;\n  Taiwan used ICD-9-CM in earlier years); ethnic-population homogeneity limits\n  transportability of findings to genetically diverse populations; out-of-pocket medicine\n  purchases not captured; governance varies by country.\n- When to prefer: rare disease incidence in large populations; east-Asian-specific\n  pharmacogenomic questions; comparative effectiveness in single-payer contexts.\n\n*EU federation (DARWIN EU, EHDEN)*:\n- Pros: federated OMOP-mapped infrastructure enabling multi-country comparative studies\n  without centralizing patient-level data; regulatory-grade evidence for EMA PASS;\n  harmonized code sets reduce phenotype-translation burden.\n- Cons: OMOP mapping quality varies across contributing databases; operational timelines\n  for federated studies are long; heterogeneity across contributing sources must be\n  assessed and reported; not all European databases are yet fully integrated.\n- When to prefer: multi-country regulatory safety evidence; studies where consistency\n  across European health systems is itself the scientific question.\n\n**When to use**\n\nUse an international RWD source when: (a) a long-horizon drug safety question requires\nfollow-up beyond what US commercial disenrollment will support; (b) the target population\nis defined by a non-US country's health system or disease burden; (c) a study needs\nnear-complete population coverage without disenrollment censoring (Nordic registries);\n(d) the research question requires life-course socioeconomic linkage not available in\nUS claims; (e) a regulatory submission to the EMA or an EU member-state authority requires\nPASS evidence from European sources; or (f) a multi-country federated network study is\nthe evidentiary standard (DARWIN EU / EHDEN context).\n\n**When NOT to use — and when it is actively misleading**\n\n- *Do not apply US ICD-10-CM phenotyping algorithms directly to non-US sources without\n  adaptation.* Codes that exist in ICD-10-CM (e.g., 7th-character specifications, US-\n  specific manifestation codes) do not exist in ICD-10 WHO; directly porting a US-validated\n  algorithm to a Nordic, UK, or French database will miss cases or misclassify them, and\n  the resulting phenotype PPV and sensitivity are unknown unless re-validated.\n- *Do not treat CPRD prescribing records as equivalent to dispensing claims.* A GP\n  prescription in CPRD was issued by the doctor but may not have been filled or taken by\n  the patient; computing medication possession ratios or adherence metrics as if these were\n  fill records overstates true exposure. Where dispensing confirmation is needed, use Nordic\n  dispensing registries or linked dispensing databases.\n- *Do not assume that findings from Nordic populations transport to other populations\n  without scrutiny.* Nordic populations are relatively ethnically homogeneous, have high\n  levels of social welfare support, and use health services at rates shaped by single-payer\n  access. A treatment effect found in Denmark may not transport to a US population with\n  different comorbidity profiles, different concomitant medication use, or different\n  healthcare-seeking behavior. Route transportability questions to the\n  generalizability-transportability-external-validity-rwe concept.\n- *Do not conflate governance approval with data access.* Ethics approval and statistics-\n  authority authorization for Nordic data typically takes 6–18 months; building this into\n  the study timeline is essential. Attempting to work around governance constraints\n  (e.g., extracting individual-level data outside approved channels) is not a legitimate\n  workaround.\n- *Do not pool sources that capture drug exposure at different points in the medication\n  journey (prescribing vs dispensing vs administration) without explicit harmonization.*\n  Pooling CPRD (prescribing) with a Nordic dispensing register in a federated analysis\n  conflates two different exposure constructs; the resulting network estimate mixes\n  \"was a prescription written\" with \"was a medication collected\" and has no clean causal\n  interpretation.\n\n**Interpreting the output**\n\nThe characteristic artifact from international RWD comparisons is the follow-up\ncompleteness ratio — the fraction of the original cohort that remains under observation\nat a given follow-up landmark. In the worked example, two cohorts of 100 new oral\nanticoagulant users are compared: US commercial (60 of 100 retained at 2 years;\nfollow-up retention = 60 / 100 = 0.60) and Danish national registries (98 of 100\nretained; retention = 98 / 100 = 0.98).\n\n*(1) Formal interpretation.* The US commercial cohort retains 60% of original participants\nat 2 years, yielding 60 * 2 = 120 complete person-years at risk from 100 starters (under\nthe simplification that all 40 exiters leave at exactly 2 years). The Danish cohort retains\n98%, yielding 98 * 2 = 196 person-years. The ratio 196 / 120 = 1.63 means the Danish\ncohort provides 63% more person-years of observation from the same starting N. This\ndifference is not random noise — it is a structural property of the data-generating\nmechanism, and a US-based disenrollment-censored study that finds a null result for a\nlong-term outcome should be interpreted in light of the follow-up loss. If disenrollment\nis related to health status (sicker patients leave the workforce and lose commercial coverage),\nthe censoring is informative — biasing the hazard estimate in a direction that cannot\nbe fixed by simply excluding enrolled-only person-time.\n\n*(2) Practical interpretation.* For a decision-maker evaluating a long-term drug safety\nquestion, a Danish registry study and a US commercial claims study of equal starting N\nare not equivalent: the Danish study produces more follow-up, less informative censoring,\nand a more complete mortality record. A null result in the US commercial study that is\ncontradicted by a Danish positive finding should prompt investigation of whether\ndisenrollment bias, rather than biology, explains the discordance. Conversely, an effect\nseen in Denmark may not directly apply to a US payer population with different\ncomorbidities, comedications, and health-seeking behavior — requiring explicit\ntransportability assessment before the finding drives US formulary or prescribing decisions.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "international-rwd",
      "cprd",
      "nordic-registries",
      "snds",
      "gepard",
      "pharmo",
      "hira",
      "nhird",
      "darwin-eu",
      "ehden",
      "omop-cdm",
      "atc-classification",
      "icd-10-who",
      "single-payer",
      "dispensing-vs-prescribing",
      "trusted-research-environment",
      "data-completeness",
      "generalizability",
      "multi-database"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "multi_database",
      "linked_data",
      "comparative_effectiveness",
      "target_trial_emulation",
      "claims_analysis",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/ije/dyv098",
        "url": "https://doi.org/10.1093/ije/dyv098",
        "citation_text": "Herrett E, Gallagher AM, Bhaskaran K, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). International Journal of Epidemiology. 2015;44(3):827-836.",
        "year": 2015,
        "authors_short": "Herrett et al.",
        "notes": "Canonical data resource profile for CPRD, covering source population, data structure, coverage, linkage capabilities (HES, ONS mortality, cancer registry), and representativeness. Essential reading for understanding what CPRD captures (GP prescribing, not pharmacy dispensing) and the linkable subset that enables outcome and covariate enrichment."
      },
      {
        "role": "explain",
        "doi": "10.2147/CLEP.S179083",
        "url": "https://doi.org/10.2147/CLEP.S179083",
        "citation_text": "Schmidt M, Schmidt SAJ, Adelborg K, et al. The Danish health care system and epidemiological research: from health care contacts to database records. Clinical Epidemiology. 2019;11:563-591.",
        "year": 2019,
        "authors_short": "Schmidt et al.",
        "notes": "Comprehensive overview of the Danish national health registers, the personal identity number linkage system, and the quality and completeness of the Danish National Patient Registry, National Prescription Database, Cause-of-Death Registry, and socioeconomic registers. The authoritative reference for understanding Nordic registry strengths (lifetime linkage, near-complete dispensing capture, no disenrollment censoring) and the ICD-10 / ATC coding context."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.1294",
        "url": "https://doi.org/10.1002/pds.1294",
        "citation_text": "Wettermark B, Hammar N, MichaelFored C, et al. The new Swedish Prescribed Drug Register — Opportunities for pharmacoepidemiological research and experience from the first six months. Pharmacoepidemiology and Drug Safety. 2007;16(7):726-735.",
        "year": 2007,
        "authors_short": "Wettermark et al.",
        "notes": "Data resource profile for the Swedish national prescription register, illustrating the Nordic model of universal dispensing capture identified by ATC/DDD rather than NDC/days_supply, with coverage, linkage, and validity discussion that parallels the Danish and Norwegian registers."
      }
    ],
    "plain_language_summary": "Outside the United States, most high-income countries maintain nationwide health databases that cover every resident — not just those with a particular employer or insurance plan — so patients can be followed for years or decades without losing track of them when they change jobs. The UK's CPRD database records what doctors prescribed; the Nordic countries (Denmark, Sweden, Norway, Finland) go one step further and record what patients actually picked up at the pharmacy, using a national identification number that links every hospital stay, prescription, and death certificate for the same person across their lifetime. These sources use different drug and diagnosis codes than the US (ATC codes for drugs instead of NDC numbers; international ICD-10 codes for diagnoses instead of the US clinical modification), so American algorithms cannot be copied directly — but the near-complete follow-up and absence of insurance-dropout gaps make them the global gold standard for detecting rare, late-appearing drug side effects.",
    "key_terms": [
      {
        "term": "personal identity number",
        "definition": "A unique, stable national identifier assigned to every resident at birth or immigration in Nordic countries, used consistently across all health and administrative records to link hospital visits, prescriptions, and death certificates for the same person across their entire life."
      },
      {
        "term": "prescribing vs dispensing",
        "definition": "A prescribing record (as in UK CPRD) documents that a doctor wrote a prescription, but does not confirm the patient collected or took the medication; a dispensing record (as in Nordic prescription databases) confirms the patient actually received the drug at a pharmacy."
      },
      {
        "term": "ATC classification",
        "definition": "The Anatomical Therapeutic Chemical classification system used internationally to identify drugs by their active ingredient and therapeutic use (e.g., B01AF02 = rivaroxaban), replacing the US National Drug Code; ATC codes are the international equivalent of NDC numbers for identifying what drug was dispensed."
      },
      {
        "term": "ICD-10 WHO",
        "definition": "The World Health Organization's international version of ICD-10, used in Europe and most of the world for diagnosis coding; it contains approximately 15,000 codes and lacks the US-specific extensions present in ICD-10-CM, so US phenotyping algorithms must be adapted before use in international databases."
      },
      {
        "term": "trusted research environment",
        "definition": "A secure computing environment in which approved analysts run code against patient-level data that never physically leaves the data holder's servers — the OpenSAFELY and DARWIN EU model — so that only analytical results (not patient records) are released."
      },
      {
        "term": "disenrollment censoring",
        "definition": "Removing a patient from a study's follow-up period because they left their insurance plan (e.g., changed jobs or switched insurer) — a major source of informative censoring in US commercial claims that does not occur in single-payer national registries where coverage is continuous until death."
      }
    ],
    "worked_example": {
      "scenario": "An epidemiologist wants to study 2-year persistence on a direct oral anticoagulant (DOAC) and the long-term risk of intracranial hemorrhage. She has access to a US commercial claims database (MarketScan-equivalent) and the Danish national prescription and patient registries. The table below compares the two sources on four design-critical dimensions; the steps then work through the follow-up completeness gap that changes statistical power and censoring risk.",
      "dataset": {
        "caption": "Comparison of US commercial claims versus Danish national registries across four dimensions for a 2-year DOAC persistence and safety study. Each row is one design dimension; values describe what an analyst would encounter working with that source.",
        "columns": [
          "Dimension",
          "US Commercial Claims (MarketScan)",
          "Danish National Registries"
        ],
        "rows": [
          [
            "Enrollment and coverage",
            "Employer-sponsored; ages 18-64 typical; enrollment ends when job or plan changes",
            "Universal; covers all Danish residents; coverage ends only at death or emigration"
          ],
          [
            "Drug exposure captured as",
            "Dispensing claim at pharmacy with days_supply field",
            "Dispensing record from national prescription database; drug identified by ATC code (e.g., B01AF02 for rivaroxaban); exposure duration estimated from pack size and quantity using ATC/DDD framework"
          ],
          [
            "Diagnosis coding system",
            "ICD-10-CM (US clinical modification; approximately 72000 codes)",
            "ICD-10 WHO (international version; approximately 15000 codes; no US 7th-character extensions)"
          ],
          [
            "Typical 2-year follow-up retention",
            "60 of 100 new users remain enrolled (40 disenroll due to job change or plan switch)",
            "98 of 100 new users remain in the registry (2 die; no disenrollment censoring exists)"
          ]
        ]
      },
      "steps": [
        "US commercial follow-up retention rate: 60 / 100 = 0.60 — only 60% of the original cohort provides 2 full years of potential follow-up.",
        "Danish registry retention rate: 98 / 100 = 0.98 — 98% remain under observation because the only exits are death or emigration.",
        "Person-years available from the US cohort (counting only the 60 retained persons at the 2-year mark): 60 * 2 = 120 person-years.",
        "Person-years available from the Danish cohort: 98 * 2 = 196 person-years.",
        "Ratio of available follow-up (Denmark vs US): 196 / 120 = 1.63 — the Danish cohort provides approximately 63% more person-years from the same starting N of 100, entirely because there is no disenrollment censoring.",
        "In the US cohort, the 40 who disenrolled are not a random subset — patients who become too ill to work (and therefore leave employer-sponsored insurance) are sicker than those who stay enrolled, making disenrollment informative rather than random. No increase in sample size fixes this bias.",
        "Drug exposure in the Danish registry must be computed from ATC code B01AF02 (rivaroxaban) with pack size and DDD (defined daily dose = 20 mg for rivaroxaban NVAF indication); there is no days_supply field, so the exposure window is constructed as quantity * pack_size / DDD, where DDD = 1 tablet/day for this drug."
      ],
      "result": "US retention = 60 / 100 = 0.60; Danish retention = 98 / 100 = 0.98. Person-years at 2 years: US = 60 * 2 = 120; Denmark = 98 * 2 = 196. Ratio: 196 / 120 = 1.63. The Danish registry delivers 63% more usable follow-up per starting patient and avoids the disenrollment-censoring bias present in the US cohort. Both sources use dispensing-based exposure, but drug coding differs: US uses NDC numbers with days_supply; Danish uses ATC codes with quantity and pack size requiring DDD-based exposure-window construction."
    },
    "prerequisites": [
      "claims-analysis",
      "fit-for-purpose-data-assessment-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Nordic registry study (Denmark, Sweden, Norway, Finland)",
        "description": "Individual-country or multi-Nordic study using the national patient, prescription, death, and socioeconomic registers linked by the personal identity number. Requires ethics board and statistics authority approval (typically 6–18 months per country). Drug exposure constructed from ATC code and dispensed quantity using DDD; diagnoses in ICD-10 WHO. The most complete long-horizon pharmacoepidemiology substrate available.",
        "edge_cases": [
          "Out-of-hospital and OTC drug purchase not captured; self-treatment with non-prescription medicines is invisible. This is particularly relevant for analgesics, antihistamines, and low-dose aspirin studies.",
          "For multi-Nordic studies, each country's register has different historical depth (Denmark prescription data from 1994; Norway from 2004; Finland from 1994; Sweden from 2005), so study years must match available data vintages per country.",
          "Immigration, emigration, and temporary residence introduce gaps; use civil registration entry/exit dates to define the observable time window per person."
        ],
        "data_source_notes": "Drug exposure: ATC code + quantity dispensed + pack size; compute days_of_supply equivalent as quantity * pack_size / DDD. Diagnoses: ICD-10 WHO codes — adapt US ICD-10-CM algorithms before use. No continuous-enrollment criterion needed (single-payer); define observable time as civil registration date to death/emigration date within the study window."
      },
      {
        "name": "UK CPRD (prescribing-based cohort)",
        "description": "Cohort using CPRD GOLD or CPRD Aurum primary-care records, optionally linked to HES inpatient data, ONS mortality, and the National Cancer Registry. Drug exposure defined from issued prescriptions (prescribing event), not pharmacy dispensation. Linkage to HES extends outcomes to inpatient diagnoses and procedures not visible in the GP record.",
        "edge_cases": [
          "Prescribing ≠ dispensing: patients who did not collect or take a prescribed drug are misclassified as exposed. Studies of medication adherence or persistence should use dispensing databases (Nordic or linked dispensing data) rather than CPRD prescription records.",
          "Patients who see only private (non-NHS) GPs, or who frequently switch GP practices, have incomplete or fragmented records; require a registration-quality flag and a minimum observation period before the index date.",
          "HES linkage coverage improves over time; studies in the 1990s and early 2000s have lower HES linkage rates than more recent calendar periods."
        ],
        "data_source_notes": "Drug exposure: product code (Read/SNOMED mapped) from CPRD therapy table; convert to ATC for international comparison. Diagnoses: Read codes (CPRD GOLD) or SNOMED-CT (CPRD Aurum) in the GP record; ICD-10 in the HES inpatient record. Map algorithms accordingly. Mortality: link to ONS death data rather than relying on CPRD-recorded death (GP-registered death may lag)."
      },
      {
        "name": "SNDS (France national claims)",
        "description": "National claims-based study covering the entire French population using the SNDS, linking the outpatient health insurance system (SNIIRAM), hospital discharge database (PMSI), and mortality data via a pseudonymized stable identifier. Operates under the Health Data Hub authorization framework.",
        "edge_cases": [
          "The SNDS does not include private-pay consultations or direct-to-consumer health-food supplement purchases; complementary private insurance top-up payments are partially visible.",
          "Hospital prescriptions fulfilled by a hospital pharmacy are not in the outpatient dispensing file; for oncology and specialty inpatient drugs, supplement with PMSI drug records."
        ],
        "data_source_notes": "Drug exposure: outpatient dispensing (DCIR table) by ATC code. Hospital diagnoses: ICD-10 WHO via PMSI. Sociodemographic: CMU-C flag for low-income universal health cover, which captures socioeconomic disadvantage not available in most claims databases."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "pros_of_this": "Nordic and East Asian single-payer registries eliminate disenrollment censoring entirely, provide lifetime linkage via personal identity numbers without probabilistic matching, and offer near-complete drug dispensing capture; for long-horizon safety and incidence studies, this completeness exceeds what any US payer structure allows.",
        "cons_of_this": "Governance timelines are long (6–18 months per Nordic country); coding systems differ (ICD-10 WHO and ATC, not ICD-10-CM and NDC); US-validated phenotyping algorithms require re-validation; access is typically restricted to a secure national computing environment. Findings may not transport to the US population without explicit transportability analysis.",
        "when_to_prefer": "Prefer international sources for long-horizon safety studies, rare adverse events, and studies where disenrollment bias or informative censoring is the primary validity threat. Prefer US sources for US-payer policy questions, formulary decisions, and US-generalizability."
      },
      {
        "compared_to": "multi-database",
        "pros_of_this": "International data sources, when combined via DARWIN EU or EHDEN federated infrastructure, offer far larger and more geographically diverse populations than any single US distributed network, with the added advantage of single-payer coverage eliminating enrollment gaps.",
        "cons_of_this": "Cross-country harmonization must reconcile ICD-10 WHO variants, ATC/DDD systems, and differing governance frameworks; OMOP-CDM mapping quality varies across European databases. Between-country heterogeneity in care patterns, formulary access, and prescribing behavior is larger than between-site US heterogeneity and must be carefully investigated.",
        "when_to_prefer": "Prefer the international federated model for regulatory PASS submissions to the EMA and for evidence that must be persuasive across multiple health systems; use US-only multi-database networks for US regulatory and payer decisions."
      },
      {
        "compared_to": "linked-data",
        "pros_of_this": "Nordic registries achieve universal individual-level lifetime linkage without a separate linkage step and without linkage error or the linkable-subset selection that plagues probabilistic linkage in other settings — every resident is linked by design.",
        "cons_of_this": "Nordic linkage does not solve the problem of clinical data depth — GP-level diagnoses, lab values, and biomarkers are less consistently available than in EHR-linked systems. UK CPRD-to-HES linkage retains a linkable-subset selection that must be characterized.",
        "when_to_prefer": "Use Nordic registries when linkage completeness and longitudinal continuity are critical and clinical depth (labs, imaging findings) is not required; use EHR-linked designs when clinical severity or biomarker confounders must be controlled."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "For Nordic dispensing registries: no continuous-enrollment criterion is needed; define observable time as the window between civil registration (or study start) and death or emigration. Drug exposure requires ATC code to substance mapping and DDD-based exposure- window construction from dispensed quantity and pack size — there is no days_supply field. Diagnoses use ICD-10 WHO codes; adapt any ICD-10-CM-based phenotype algorithm before use. For SNDS/GePaRD: data-access governance (Health Data Hub authorization; sickness-fund data-sharing agreement) adds 3–12 months to feasibility timelines.",
      "ehr": "OpenSAFELY and MID-NET are primarily EHR-based. In OpenSAFELY, code runs inside an NHS server environment and only aggregate results are released; no extract is permitted. In JMDC (Japan commercial claims), the data resemble US employer-claims in structure (working- age insured adults, dispensing records) but with ATC not NDC and ICD-10 not ICD-10-CM.",
      "registry": "Nordic cause-of-death registries are the gold standard for all-cause mortality and cause-specific mortality; use them as the death component in any Nordic registry study. Cancer registries in Nordic countries link seamlessly to prescription data by PIN, enabling pharmacoepidemiology within cancer populations with high completeness.",
      "linked": "Nordic linkage is deterministic by PIN — no probabilistic matching required, no linkable-subset selection, no match rate to report. For CPRD-to-HES linkage, report the linkage rate (currently above 85% for practices consenting to linkage) and compare baseline characteristics of linked vs unlinked patients to probe selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# ── 1. Nordic dispensing exposure window (ATC + quantity + pack_size / DDD) ──────────\n# No days_supply field exists in Nordic data. Exposure duration is:\n#   days_covered = quantity * pack_size / ddd_units\n# where ddd_units is the Defined Daily Dose for the ATC code (from WHO DDD table).\n# For rivaroxaban NVAF (B01AF02): DDD = 1 tablet = 20 mg; 1 pack of 28 tablets covers 28 days.\n\ndef nordic_exposure_windows(rx_nordic: pd.DataFrame, ddd_map: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Compute dispensing-based exposure windows from Nordic prescription data.\n\n    Returns one row per dispensing with dispense_date and days_covered computed from\n    quantity * pack_size / ddd_units.  No days_supply field is used or required.\n    \"\"\"\n    rx = rx_nordic.merge(ddd_map[[\"atc_code\", \"ddd_units\"]], on=\"atc_code\", how=\"left\")\n    missing_ddd = rx[\"ddd_units\"].isna()\n    if missing_ddd.any():\n        print(f\"WARNING: {missing_ddd.sum()} dispensings have no DDD mapping — \"\n              f\"excluded from exposure windows. ATC codes: \"\n              f\"{rx.loc[missing_ddd, 'atc_code'].unique().tolist()}\")\n    rx = rx[~missing_ddd].copy()\n    # days_covered = number of whole tablets dispensed / DDD (1 tablet per day assumed)\n    rx[\"units_dispensed\"] = rx[\"quantity\"] * rx[\"pack_size\"]\n    rx[\"days_covered\"]    = (rx[\"units_dispensed\"] / rx[\"ddd_units\"]).round(0).astype(int)\n    rx[\"window_start\"]    = rx[\"dispense_date\"]\n    rx[\"window_end\"]      = rx[\"dispense_date\"] + pd.to_timedelta(rx[\"days_covered\"], \"D\")\n    return rx[[\"person_id\", \"atc_code\", \"dispense_date\", \"days_covered\",\n                \"window_start\", \"window_end\"]]\n\n# Example: 1 pack of 28 rivaroxaban 20 mg tablets, DDD = 1 tablet/day → 28 days covered.\n# The equivalent US quantity would be days_supply = 28 in a pharmacy claim.\n\n# ── 2. ICD-10 WHO phenotype coverage check ────────────────────────────────────────────\n# US ICD-10-CM algorithms cannot be applied directly to ICD-10 WHO databases.\n# Check which US code prefixes have matching codes in the international database.\n\ndef check_icd10_who_coverage(\n    dx_intl: pd.DataFrame,\n    us_cm_prefixes: list[str],   # e.g. [\"I21\", \"I22\"] for acute MI (ICD-10-CM)\n    icd_col: str = \"icd10_who\",\n) -> pd.DataFrame:\n    \"\"\"For each US ICD-10-CM code prefix, report whether matching ICD-10 WHO codes\n    are present in the database.  Codes starting with the same 3-character prefix\n    are usually comparable; 4th and 5th character extensions may differ between CM and WHO.\n\n    Returns a summary DataFrame: prefix, n_unique_who_codes_found, found_in_db (bool).\n    \"\"\"\n    intl_codes = dx_intl[icd_col].dropna().unique()\n    rows = []\n    for prefix in us_cm_prefixes:\n        matches = [c for c in intl_codes if str(c).startswith(prefix)]\n        rows.append({\n            \"us_cm_prefix\": prefix,\n            \"n_unique_who_codes_found\": len(matches),\n            \"found_in_db\": len(matches) > 0,\n            \"sample_who_codes\": matches[:3],   # first 3 for inspection\n        })\n    return pd.DataFrame(rows)\n\n# ── 3. Follow-up retention comparison (matches worked example) ────────────────────────\n# US commercial: 60 of 100 new users retained at 2 years.\n# Danish registry: 98 of 100 retained.\nus_retention  = 60 / 100   # = 0.60\ndk_retention  = 98 / 100   # = 0.98\nus_py = 60 * 2             # 120 person-years from retained persons\ndk_py = 98 * 2             # 196 person-years from retained persons\nratio = dk_py / us_py      # 1.63 — Denmark provides 63% more follow-up per 100 starters\nprint(f\"US retention = {us_retention:.2f}  |  Danish retention = {dk_retention:.2f}\")\nprint(f\"US person-years = {us_py}  |  Danish person-years = {dk_py}  |  ratio = {ratio:.2f}\")",
        "description": "Practical data-handling patterns for two key international RWD tasks: (1) constructing\na dispensing-based exposure window from Nordic prescription data using ATC code, dispensed\nquantity, and DDD when no days_supply field is present; (2) checking ICD-10 WHO code\ncoverage when adapting a US ICD-10-CM phenotype algorithm to a non-US database.\nRequired inputs:\n  rx_nordic : dispensing records -> person_id, dispense_date (datetime), atc_code (str),\n              quantity (int), pack_size (int)   # pack_size = units per package\n  ddd_map   : ATC to DDD mapping -> atc_code (str), ddd_units (float), ddd_unit (str)\n  dx_intl   : diagnoses          -> person_id, dx_date (datetime), icd10_who (str)\nThe US ICD-10-CM phenotype target is provided as a list of code prefixes to match against\nICD-10 WHO, flagging any that may not exist or behave differently internationally.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\n# ── 1. Nordic dispensing exposure windows ──────────────────────────────────────────────\n# days_covered = (quantity * pack_size) / ddd_units\n# No days_supply field exists; DDD table from WHO ATC/DDD index provides the denominator.\n\nnordic_exposure_windows <- function(rx_nordic, ddd_map) {\n  setDT(rx_nordic); setDT(ddd_map)\n  rx <- merge(rx_nordic, ddd_map[, .(atc_code, ddd_units)], by = \"atc_code\", all.x = TRUE)\n  missing <- rx[is.na(ddd_units)]\n  if (nrow(missing) > 0L)\n    warning(sprintf(\"%d dispensings with no DDD mapping — excluded. ATC: %s\",\n                    nrow(missing), paste(unique(missing$atc_code), collapse = \", \")))\n  rx <- rx[!is.na(ddd_units)]\n  rx[, units_dispensed := quantity * pack_size]\n  rx[, days_covered    := as.integer(round(units_dispensed / ddd_units))]\n  rx[, window_start    := dispense_date]\n  rx[, window_end      := dispense_date + days_covered]\n  rx[, .(person_id, atc_code, dispense_date, days_covered, window_start, window_end)]\n}\n\n# ── 2. ICD-10 WHO phenotype coverage check ────────────────────────────────────────────\n\ncheck_icd10_who_coverage <- function(dx_intl, us_cm_prefixes, icd_col = \"icd10_who\") {\n  setDT(dx_intl)\n  intl_codes <- unique(na.omit(dx_intl[[icd_col]]))\n  rbindlist(lapply(us_cm_prefixes, function(pfx) {\n    matches <- intl_codes[startsWith(intl_codes, pfx)]\n    list(\n      us_cm_prefix          = pfx,\n      n_unique_who_codes    = length(matches),\n      found_in_db           = length(matches) > 0L,\n      sample_who_codes      = paste(head(matches, 3L), collapse = \"; \")\n    )\n  }))\n}\n\n# ── 3. Follow-up retention comparison (matches worked example arithmetic) ──────────────\n# US commercial: 60 of 100 retained at 2 years.  Danish registry: 98 of 100.\nus_retention <- 60 / 100   # 0.60\ndk_retention <- 98 / 100   # 0.98\nus_py <- 60L * 2L          # 120 person-years\ndk_py <- 98L * 2L          # 196 person-years\nratio <- dk_py / us_py     # 1.633... ≈ 1.63\n\nmessage(sprintf(\"US retention = %.2f  |  Danish retention = %.2f\", us_retention, dk_retention))\nmessage(sprintf(\"US PY = %d  |  Danish PY = %d  |  ratio = %.2f\", us_py, dk_py, ratio))",
        "description": "R/data.table equivalent of the Python implementation: (1) Nordic dispensing exposure windows\nfrom ATC code, quantity, and pack size with DDD-based days_covered computation; (2) ICD-10 WHO\nphenotype coverage check against a US ICD-10-CM prefix list. Inputs match the Python version.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Research question requiring RWD] --> USA{US population<br/>policy question?}\n  USA -->|Yes| US[US claims sources<br/>Medicare FFS/MA/Commercial<br/>see: US payer differences concept]\n  USA -->|No, or global evidence needed| Geo{Geography and<br/>follow-up horizon}\n  Geo -->|Long-horizon safety;<br/>no disenrollment bias;<br/>rare adverse events| Nordic[Nordic registries<br/>DK/SE/NO/FI<br/>Personal ID linkage; dispensing-based;<br/>ICD-10 WHO; ATC codes]\n  Geo -->|UK primary care;<br/>GP behavior central| CPRD[CPRD + HES linkage<br/>Prescribing-based;<br/>OpenSAFELY for full England coverage]\n  Geo -->|France national claims| SNDS[SNDS / Health Data Hub<br/>ATC dispensing; PMSI hospital;<br/>mortality linkage]\n  Geo -->|Germany sickness funds| GePaRD[GePaRD<br/>20M insured; dispensing;<br/>ICD-10 GM]\n  Geo -->|East Asia| Asia[HIRA Korea / NHIRD Taiwan / NDB Japan<br/>Universal coverage; dispensing;<br/>ICD-10 compatible]\n  Geo -->|Multi-country EU;<br/>EMA PASS regulatory| EU[DARWIN EU / EHDEN network<br/>OMOP-CDM federated;<br/>see: federated-distributed-network-analysis]\n  Nordic --> Adapt[Adapt US ICD-10-CM algorithms to ICD-10 WHO<br/>Replace NDC/days_supply with ATC/DDD]\n  CPRD --> Adapt\n  SNDS --> Adapt\n  GePaRD --> Adapt\n  Asia --> Adapt\n  EU --> Adapt\n  Adapt --> Design[Apply study design;<br/>no continuous-enrollment criterion for single-payer;<br/>governance and access timeline: 6-18 months per country]",
        "caption": "Source selection for international RWD studies. Single-payer systems eliminate disenrollment censoring and require ATC/DDD and ICD-10 WHO coding instead of NDC/days_supply and ICD-10-CM; governance timelines are long.",
        "alt_text": "Flowchart from a research question through US-vs-international and geography/horizon decisions into the major source families (Nordic, CPRD, SNDS, GePaRD, East Asian, DARWIN EU/EHDEN), converging on algorithm adaptation and governance steps.",
        "source_type": "illustrative",
        "source_citations": [
          "herrett-2015",
          "schmidt-2019"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph US[US Claims Exposure Logic]\n    NDC[NDC number identifies drug] --> DS[days_supply field]\n    DS --> WinUS[Exposure window = fill_date to fill_date + days_supply]\n  end\n  subgraph Nordic[Nordic Dispensing Exposure Logic]\n    ATC[ATC code identifies drug substance] --> DDD[DDD from WHO ATC/DDD Index]\n    DDD --> Calc[\"days_covered = quantity * pack_size / DDD_units\"]\n    Calc --> WinNordic[Exposure window = dispense_date to dispense_date + days_covered]\n  end\n  subgraph CPRD[UK CPRD Exposure Logic]\n    Rx[GP issues prescription record] --> Prescrip[Prescribing event captured<br/>no pharmacy confirmation]\n    Prescrip --> Note[Patient may not collect or take the medication<br/>prescribing ≠ dispensing]\n  end",
        "caption": "Drug exposure construction differs fundamentally across sources. US claims use NDC with days_supply. Nordic dispensing registries use ATC code with DDD-based days_covered computation from quantity and pack size. UK CPRD records the prescribing event only, not pharmacy collection.",
        "alt_text": "Three parallel boxes showing US (NDC + days_supply → exposure window), Nordic (ATC + DDD + quantity × pack_size → days_covered → exposure window), and CPRD (prescribing event only, with note that dispensing is not confirmed).",
        "source_type": "illustrative",
        "source_citations": [
          "wettermark-2007"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "The US-side sibling concept; the structural contrasts between FFS, Medicare Advantage, and commercial claims mirror the contrasts between US claims and international single-payer systems, with disenrollment censoring and coding-system differences as the shared themes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "multi-database",
        "notes": "Multi-database distributed network studies are the primary vehicle for combining international sources (DARWIN EU, EHDEN, CNODES) under a common protocol without centralizing patient-level data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "linked-data",
        "notes": "Nordic personal identity numbers enable deterministic lifetime linkage without probabilistic matching; the linked-data concept covers the general principles of record linkage, including the linkable-subset selection that Nordic PIN-based linkage avoids."
      },
      {
        "relation_type": "used_with",
        "target_slug": "atc-ddd-classification",
        "notes": "International prescription databases identify drugs by ATC code and require ATC/DDD-based exposure-window construction in place of the US NDC/days_supply approach; ATC/DDD classification is the drug-coding substrate for all non-US dispensing-registry studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "federated-distributed-network-analysis",
        "notes": "DARWIN EU and EHDEN operate as federated OMOP-CDM networks that execute distributed analyses across European sources without centralizing patient records — the federated- distributed-network-analysis pattern applied to international data governance constraints."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Findings from Nordic or East Asian single-payer populations may not transport to US payer populations without explicit assessment; population composition, health-care-seeking behavior, formulary access, and concomitant medication use all differ across systems."
      }
    ],
    "aliases": [
      "international real-world data",
      "non-US RWD sources",
      "CPRD GOLD",
      "Nordic registries pharmacoepidemiology",
      "SNDS France claims",
      "HIRA Korea database",
      "NHIRD Taiwan",
      "GePaRD Germany"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "interrupted-time-series-rwe",
    "name": "Interrupted Time Series (Segmented Regression)",
    "short_definition": "A population-level quasi-experimental design that fits a segmented regression to an aggregated outcome series before and after an abrupt intervention, estimating the change in level and the change in slope attributable to the intervention while modeling secular trend, seasonality, and autocorrelation.",
    "long_description": "An **interrupted time series (ITS)** evaluates the effect of an *abrupt, well-dated, population-level* intervention — a\nformulary change, a label/boxed-warning revision, a guideline release, a copay redesign, a market withdrawal — by treating\nthe population's own pre-intervention trajectory as the counterfactual. The outcome is aggregated into equally spaced time\npoints (monthly dispensing rate, weekly hospitalization rate, quarterly cost PMPM) and a **segmented regression** is fit:\nY_t = β0 + β1·time_t + β2·level_t + β3·(time since intervention)_t + ε_t, where `time_t` is the running clock, `level_t` is a\n0/1 step that switches on at the intervention, and `(time since intervention)_t` is a ramp that starts counting after it.\nβ1 is the pre-intervention secular trend (slope), **β2 is the immediate level change** at the interruption, and **β3 is the\nchange in slope** (the post minus pre trend). The two estimands together describe whether the intervention produced a sudden\njump, a gradual bend, both, or neither. ITS is the strongest single-group quasi-experiment because each unit serves as its\nown control: time-invariant confounders (stable case-mix, baseline access) are differenced out by the within-series\nbefore/after contrast.\n\n**Core conceptual distinction.** ITS identifies an effect from a *discontinuity in time*, not from a contrast between treated\nand untreated people. Three modeling choices are separable and must be pre-specified. (1) *Level vs slope*: an intervention\ncan shift the intercept (β2, e.g., a one-time stockpiling spike), change the trajectory (β3, e.g., slowly declining uptake),\nor both — reporting only one when both move misstates the effect. (2) *Autocorrelation*: consecutive points in a series are\ncorrelated, so ordinary least squares understates the standard errors and produces falsely narrow intervals. Either model the\nerror structure with **Newey-West** heteroskedasticity-and-autocorrelation-consistent standard errors, fit a\n**Prais-Winsten** / Cochrane-Orcutt GLS with an AR(1) term, or estimate an ARIMA error process; always inspect the residual\nACF/PACF and the Durbin-Watson statistic. (3) *Seasonality*: monthly health-services series carry strong annual cycles\n(respiratory admissions, end-of-year deductible effects), which must be removed with harmonic (Fourier sin/cos) terms,\ncalendar-month indicators, or seasonal differencing, or they will masquerade as level/slope effects. A fourth, design-level\nstrengthening is the **controlled ITS (CITS)**: add a contemporaneous control series not exposed to the intervention and\nestimate the *difference* in segmented parameters, which removes any concurrent shock (a coincident flu season, a payment\nreform) common to both series. This is the bridge between ITS and difference-in-differences.\n\n**Pros, cons, and trade-offs.**\n- **vs difference-in-differences (`difference-in-differences-staggered-adoption-rwe`):** ITS needs only the treated series and\n  leans on extrapolating the pre-trend; DiD requires a parallel comparison group and assumes parallel trends. **Prefer ITS**\n  when no untreated comparator exists (a national policy hits everyone) and many pre-intervention points are available;\n  **prefer DiD** when a credible control group exists and the pre-trend is short or noisy. The controlled-ITS variant *is*\n  essentially a DiD on segmented-regression parameters and inherits the parallel-trends assumption for the difference.\n- **vs ecological/aggregate cross-sectional studies (`ecological`):** ITS uses the same aggregated data but exploits the\n  longitudinal before/after structure, so it is far less vulnerable to static ecological confounding; its cost is the\n  requirement for enough equally spaced pre/post points (a common rule of thumb is >=8 on each side, more with seasonality) and\n  a sharply dated intervention. **Prefer ITS** whenever the intervention has a clean date and a usable time series exists.\n- **vs individual-level cohort analysis:** ITS answers a *population* question cheaply and is robust to stable case-mix, but it\n  cannot estimate individual-level effect modification, cannot adjust for time-varying compositional change (a shifting\n  enrollee mix that coincides with the intervention biases β2/β3), and gives no patient-level effect. **Prefer a cohort/PS\n  analysis** when the question is who benefits and individual covariates are available.\n\n**When to use.** A single, abrupt, sharply dated population-level intervention; an outcome that can be aggregated into a\nregularly spaced series with enough pre-intervention points to characterize the baseline trend and seasonality; no clean\nindividual-level comparator (so a cohort/PS design is infeasible); decision-makers who need a transparent before/after policy\nevaluation (FDA safety-label impact, PQA/CMS quality-measure or formulary changes, HTA coverage-policy assessments). ITS is the\ndefault design for drug-utilization and safety-communication impact studies in claims.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The intervention is gradual or its date is fuzzy.** Phased rollouts, anticipation effects, and slow diffusion smear the\n  discontinuity; fitting a sharp step/ramp to a gradual change biases both β2 and β3. Model a transition/phase-in window or\n  abandon ITS.\n- **A co-timed shock is ignored.** If another event (a competing drug's withdrawal, a pandemic, a payment reform) hits the\n  series at nearly the same time, the segmented parameters absorb both and attribute the combined effect to the intervention.\n  A single-group ITS cannot separate them — use a controlled ITS with an unaffected comparator.\n- **Autocorrelation/seasonality left unmodeled.** Reporting OLS standard errors on a serially correlated series produces\n  spuriously significant level/slope changes; un-deseasonalized monthly series routinely manufacture phantom effects. This is\n  the most common ITS error and it is actively misleading.\n- **Too few points or compositional drift.** Fewer than ~8 pre points cannot pin the baseline trend; an enrollee-mix change\n  that coincides with the intervention (e.g., a large employer group entering the plan) is confounding the within-series\n  contrast cannot remove.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** Build the series from a *stable, continuously enrolled* denominator so the numerator rate is not\n  driven by who is observed. A standing failure mode: Medicare Advantage enrollees generate no fee-for-service claims, so if MA\n  penetration is rising over the study window the FFS-observed population shrinks and changes composition, injecting a spurious\n  trend; restrict to FFS Parts A/B (and D for dispensing outcomes) and hold the denominator construction constant across the\n  whole series. Anchor each point on service/admission/fill dates and account for claims-adjudication lag near the data cut\n  (the last 1-3 points are typically incomplete and should be dropped or the cut moved back).\n- **EHR:** Encounter-driven capture means the at-risk denominator fluctuates with system activity; define an explicit \"active\n  in system\" denominator (>=1 encounter per period) so apparent rate changes are not artifacts of changing observation, and\n  watch for documentation/coding-policy changes (an ICD-9 to ICD-10 transition, a new EHR module) that create a step in the\n  series unrelated to the intervention.\n- **Registry:** Adjudicated outcomes give clean numerators, but reporting completeness and lag vary by calendar period; verify\n  completeness by period and avoid mistaking a reporting-lag artifact for a level change.\n\n**Worked claims example.** Question: did a 2021 boxed-warning revision reduce monthly initiation of drug class X among adults\nin a commercial + Medicare FFS database? (1) *Series construction*: for each calendar month from 2018-01 to 2023-12, count new\ninitiators (first qualifying fill after a 12-month FFS-observable washout) over the continuously FFS-enrolled adult denominator,\ngiving an initiation rate per 1,000 enrollees per month — 36 pre points and 24 post points. (2) *Model*:\nrate_t = β0 + β1·month_t + β2·post_t + β3·months_since_t + seasonal harmonics (sin/cos at 12- and 6-month periods), fit by\nPrais-Winsten GLS with an AR(1) error to absorb autocorrelation (residual ACF checked, Durbin-Watson near 2). (3) *Estimates*:\nβ1 = +0.02/month pre-trend (slowly rising uptake); β2 = **-0.45 per 1,000** immediate level drop at the warning (95% CI -0.62\nto -0.28); β3 = **-0.03/month** change in slope (95% CI -0.05 to -0.01), i.e., the rising trend reverses to a decline. (4)\n*Interpretation*: the warning produced both a sudden step-down and a sustained downward bend; the counterfactual is the\nextrapolated pre-trend, so the 24-month cumulative averted initiations are the area between observed and projected pre-trend\nlines. (5) *Strengthening*: add a controlled-ITS comparator — an unaffected therapeutic class with similar baseline trend — and\nre-estimate the *difference* in (β2, β3) to rule out a co-timed market shock; run sensitivity analyses dropping the last two\n(lag-incomplete) months and re-fitting with Newey-West SEs to confirm the inference is not autocorrelation-driven.",
    "primary_category": "Study_Design",
    "tags": [
      "interrupted-time-series",
      "segmented-regression",
      "quasi-experiment",
      "autocorrelation",
      "seasonality",
      "policy-evaluation",
      "drug-utilization",
      "newey-west"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "drug_utilization",
      "safety_surveillance",
      "policy_evaluation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/ije/dyw098",
        "url": "https://doi.org/10.1093/ije/dyw098",
        "citation_text": "Bernal JL, Cummins S, Gasparrini A. Interrupted time series regression for the evaluation of public health interventions: a tutorial. International Journal of Epidemiology. 2017;46(1):348-355.",
        "year": 2017,
        "authors_short": "Bernal et al.",
        "notes": "The canonical practical tutorial for ITS in health research; specifies the segmented-regression model, the level/slope estimands, handling of autocorrelation and seasonality, and the controlled-ITS extension with worked examples."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2008.08.007",
        "url": "https://doi.org/10.1016/j.jclinepi.2008.08.007",
        "citation_text": "Zhang F, Wagner AK, Soumerai SB, Ross-Degnan D. Methods for estimating confidence intervals in interrupted time series analyses of health interventions. Journal of Clinical Epidemiology. 2009;62(2):143-148.",
        "year": 2009,
        "authors_short": "Zhang et al.",
        "notes": "Establishes correct interval estimation under autocorrelation (Newey-West, Prais-Winsten/Cochrane-Orcutt) and why OLS intervals are too narrow for serially correlated series; the standard reference for ITS inference."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h2750",
        "url": "https://doi.org/10.1136/bmj.h2750",
        "citation_text": "Kontopantelis E, Doran T, Springate DA, Buchan I, Reeves D. Regression based quasi-experimental approach when randomisation is not an option: interrupted time series analysis. BMJ. 2015;350:h2750.",
        "year": 2015,
        "authors_short": "Kontopantelis et al.",
        "notes": "Accessible BMJ treatment positioning ITS among quasi-experimental designs, with practical guidance on model specification, the controlled-ITS comparator, and pitfalls in policy evaluation."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/ije/dyaa118",
        "url": "https://doi.org/10.1093/ije/dyaa118",
        "citation_text": "Bernal JL, Cummins S, Gasparrini A. Corrigendum to: Interrupted time series regression for the evaluation of public health interventions: a tutorial. International Journal of Epidemiology. 2021;50(3):1045.",
        "year": 2021,
        "authors_short": "Bernal et al.",
        "notes": "Corrects the worked-example code and seasonal-adjustment details of the original tutorial; useful when reproducing the canonical analysis exactly."
      }
    ],
    "plain_language_summary": "An interrupted time series study watches a population's rate of some outcome — such as how many patients per month start a new drug — over many months, then asks whether a specific event (a safety warning, a policy change, a formulary switch) caused that rate to change. The key idea is that the population's own past trend, if the intervention had never happened, acts as the stand-in for 'what would have happened anyway.' The method separately estimates two things: did the rate jump or drop instantly the moment the event occurred, and did the long-run direction of the trend bend after that? One honest limitation: if another unrelated event happened at almost the same time, the method cannot separate its effect from the intervention's without adding a second comparison series.",
    "key_terms": [
      {
        "term": "aggregated outcome series",
        "definition": "A table with one row per time period (e.g., one row per month) where each row records how many events — fills, hospitalizations, diagnoses — happened across the whole study population that period, usually expressed as a rate per 1,000 people enrolled."
      },
      {
        "term": "level change",
        "definition": "The sudden, one-time jump or drop in the outcome rate that happens right at the intervention date, measured by comparing what the rate actually was at that moment to what the pre-period trend predicted it would be."
      },
      {
        "term": "slope change",
        "definition": "The shift in how fast the outcome rate is rising or falling after the intervention compared to before — a positive slope change means the rate is climbing faster, a negative one means it is declining faster or reversing a prior upward trend."
      },
      {
        "term": "segmented regression",
        "definition": "A type of regression model that fits one straight line to the pre-intervention data and a second, differently-angled line to the post-intervention data, estimating both the level change and the slope change at the break point."
      },
      {
        "term": "counterfactual trend",
        "definition": "The projected path the outcome rate would have followed after the intervention if the intervention had never occurred, estimated by extending the pre-period straight line forward in time."
      },
      {
        "term": "autocorrelation",
        "definition": "The tendency for consecutive monthly (or weekly) measurements to be more similar to each other than would be expected by chance — if rates were high last month they tend to be high this month — which means standard confidence intervals are too narrow and must be corrected."
      }
    ],
    "worked_example": {
      "scenario": "A drug safety team wants to know whether a boxed-warning added to Drug X's label on June 1, 2021 changed prescribing behaviour. Using a commercial claims database, they count new patients starting Drug X each month among continuously enrolled adults, then divide by the number of enrolled adults that month to get an initiation rate per 1,000 enrollees. The table below shows five pre-warning months (January–May 2021) and four post-warning months (June–September 2021). The team wants to estimate (1) the immediate level drop at the warning date and (2) whether the monthly trend also shifted.",
      "dataset": {
        "caption": "Monthly Drug X initiation rate (new starters per 1,000 continuously enrolled adults). Month index t runs 1–9; intervention falls between t=5 and t=6.",
        "columns": [
          "month_label",
          "t",
          "period",
          "months_since_warning",
          "rate_per_1000"
        ],
        "rows": [
          [
            "2021-01",
            1,
            "pre",
            0,
            5.0
          ],
          [
            "2021-02",
            2,
            "pre",
            0,
            5.2
          ],
          [
            "2021-03",
            3,
            "pre",
            0,
            5.4
          ],
          [
            "2021-04",
            4,
            "pre",
            0,
            5.6
          ],
          [
            "2021-05",
            5,
            "pre",
            0,
            5.8
          ],
          [
            "2021-06",
            6,
            "post",
            1,
            4.2
          ],
          [
            "2021-07",
            7,
            "post",
            2,
            4.1
          ],
          [
            "2021-08",
            8,
            "post",
            3,
            4.0
          ],
          [
            "2021-09",
            9,
            "post",
            4,
            3.9
          ]
        ]
      },
      "steps": [
        "Fit the pre-period data (t = 1 to 5) to a straight line. The line starts at 5.0 in January and rises by 0.2 per month, so the formula is: expected rate = 4.8 + 0.2 × t.",
        "Extend that line forward to predict what June (t = 6) would have been without the warning: 4.8 + 0.2 × 6 = 6.0 per 1,000.",
        "The observed rate in June is 4.2. The gap between the counterfactual (6.0) and what was actually observed at t = 6 breaks into two parts: the level change and the first month of slope change.",
        "The level change (β2) is −1.5 per 1,000 — a sudden, one-time drop attributable to the warning announcement itself.",
        "The slope change (β3) is −0.3 per 1,000 per month. This means the trend shifts from +0.2/month (rising) to −0.1/month (slowly falling) after the warning.",
        "Verify June: counterfactual 6.0 + level change (−1.5) + slope change × months_since_warning 1 (−0.3 × 1 = −0.3) = 6.0 − 1.5 − 0.3 = 4.2. Matches the observed rate.",
        "Verify September (t = 9, months_since_warning = 4): counterfactual 4.8 + 0.2×9 = 6.6; + (−1.5) + (−0.3×4 = −1.2) = 6.6 − 1.5 − 1.2 = 3.9. Matches the observed rate.",
        "The upward pre-period trend (+0.2/month) has reversed to a downward post-period trend (−0.1/month), meaning the warning changed not just the level but the direction of prescribing."
      ],
      "result": {
        "label": "Level change (β2) = −1.5 per 1,000 (immediate drop at the warning); Slope change (β3) = −0.3 per 1,000 per month (trend reversal from +0.2 to −0.1 per month)",
        "value": {
          "level_change_beta2": -1.5,
          "slope_change_beta3": -0.3,
          "pre_trend_per_month": 0.2,
          "post_trend_per_month": -0.1,
          "counterfactual_june": 6.0,
          "observed_june": 4.2,
          "arithmetic_check_june": "4.8 + 0.2*6 + (-1.5)*1 + (-0.3)*1 = 4.2",
          "arithmetic_check_september": "4.8 + 0.2*9 + (-1.5)*1 + (-0.3)*4 = 3.9"
        }
      },
      "timeline_spec": {
        "title": "Drug X monthly initiation rate: boxed-warning interruption, Jan–Sep 2021",
        "caption": "Monthly initiation rate (per 1,000 enrollees) before and after the June 2021 boxed warning. The dashed line shows the counterfactual trend if the warning had never been issued. The solid orange line shows observed post-warning rates. The gap between them widens each month because the slope also changed.",
        "alt_text": "Segmented timeline showing Drug X initiation rate rising from 5.0 to 5.8 per 1,000 during January–May 2021, a vertical intervention marker at June 2021 labeled Boxed Warning Added, then the observed rate dropping to 4.2 in June and declining further to 3.9 by September, while a dashed counterfactual line continues rising from 5.8 toward 6.6, illustrating both a level drop and a slope reversal.",
        "window": {
          "start": "2021-01",
          "end": "2021-09",
          "label": "9-month observation window (5 pre, 4 post)"
        },
        "events": [
          {
            "label": "Boxed Warning Added",
            "date": "2021-06",
            "kind": "intervention",
            "description": "FDA boxed warning added to Drug X label — the 'interruption' point in the series"
          }
        ],
        "spans": [
          {
            "kind": "pre_intervention",
            "start": "2021-01",
            "end": "2021-05",
            "label": "Pre-intervention: rising trend (+0.2/month), rate 5.0 → 5.8"
          },
          {
            "kind": "post_intervention",
            "start": "2021-06",
            "end": "2021-09",
            "label": "Post-intervention: reversed trend (−0.1/month), rate 4.2 → 3.9"
          }
        ],
        "result": {
          "label": "Level drop = −1.5 per 1,000 at warning; slope reversed from +0.2 to −0.1 per month",
          "value": {
            "level_change": -1.5,
            "slope_change": -0.3,
            "pre_slope": 0.2,
            "post_slope": -0.1
          }
        }
      }
    },
    "prerequisites": [
      "difference-in-differences-staggered-adoption-rwe",
      "ecological",
      "descriptive-epidemiology-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single-group segmented regression",
        "description": "The baseline ITS fitting level (step) and slope (ramp) terms to one aggregated outcome series, with the pre-intervention trend extrapolated as the counterfactual; estimates the immediate level change and the change in slope.",
        "edge_cases": [
          "Co-timed shocks cannot be separated from the intervention in a single group; a contemporaneous unrelated event biases the segmented parameters.",
          "Anticipation effects (behavior changing before the official date) blur the discontinuity and should be modeled with a phase-in window or a shifted intervention date."
        ],
        "data_source_notes": "claims: hold the continuously-enrolled denominator construction constant across the entire series so a changing observed population does not create a spurious trend; drop lag-incomplete final months."
      },
      {
        "name": "Controlled ITS (CITS)",
        "description": "Adds a contemporaneous control series unaffected by the intervention and estimates the difference in segmented parameters; removes any concurrent shock common to both series and weakens reliance on pre-trend extrapolation.",
        "edge_cases": [
          "The control must plausibly share the treated series' secular trend and seasonality; a poorly matched control reintroduces bias.",
          "The difference-in-segmented-parameters estimator inherits a parallel-trends assumption analogous to difference-in-differences."
        ],
        "data_source_notes": "claims/ehr: choose a comparator drug class or region with similar baseline trend and the same denominator rules; verify pre-period parallelism before trusting the differenced effect."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "difference-in-differences-staggered-adoption-rwe",
        "pros_of_this": "Requires only the treated series and a long pre-period; ideal when a national/universal policy leaves no untreated comparison group, and each unit serves as its own control for time-invariant confounders.",
        "cons_of_this": "Leans on extrapolating the pre-intervention trend as the counterfactual and cannot separate the intervention from a co-timed shock without a control series.",
        "when_to_prefer": "No credible untreated comparator exists and many equally spaced pre-intervention points are available; use the controlled-ITS variant (effectively a DiD on segmented parameters) when a comparator does exist."
      },
      {
        "compared_to": "ecological",
        "pros_of_this": "Exploits the longitudinal before/after structure of the aggregated data, so it is far less vulnerable to the static cross-sectional confounding that undermines ordinary ecological comparisons.",
        "cons_of_this": "Demands a sharply dated intervention and enough regularly spaced pre/post points to characterize trend and seasonality, which simple ecological cross-sections do not require.",
        "when_to_prefer": "Whenever the intervention has a clean date and a usable time series exists; reserve plain ecological analysis for static between-area comparisons without a temporal interruption."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Aggregate into equally spaced periods over a constant, continuously FFS-enrolled denominator; drop Medicare Advantage-only person-time (no FFS claims) and hold denominator rules fixed across the whole series so changing observation does not manufacture a trend. Anchor periods on service/fill dates and drop the final lag-incomplete points.",
      "ehr": "Define an explicit active-in-system denominator (>=1 encounter per period) so rate changes are not observation artifacts; flag coding-policy or EHR-system changes (e.g., ICD-10 transition) that create steps unrelated to the intervention.",
      "registry": "Adjudicated numerators are clean, but verify reporting completeness and lag by calendar period so a reporting artifact is not read as a level change.",
      "linked": "Linked claims-EHR substrate sharpens both numerator (EHR/registry specificity) and denominator (claims enrollment); reconcile differing event-date conventions before assigning each event to a period."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\ndef its_segmented(df: pd.DataFrame, period: int = 12, hac_lags: int = 12):\n    # df columns: time (0..T-1), rate (outcome per period), intervention_period (int index of the step).\n    d = df.sort_values(\"time\").reset_index(drop=True).copy()\n    k = d[\"intervention_period\"].iloc[0]\n    d[\"level\"] = (d[\"time\"] >= k).astype(int)                 # step: 0 pre, 1 post\n    d[\"trend_post\"] = np.where(d[\"time\"] >= k, d[\"time\"] - k + 1, 0)  # ramp after interruption\n    # Fourier seasonality (annual + semiannual) to avoid seasonal effects leaking into level/slope.\n    for h in (1, 2):\n        d[f\"sin{h}\"] = np.sin(2 * np.pi * h * d[\"time\"] / period)\n        d[f\"cos{h}\"] = np.cos(2 * np.pi * h * d[\"time\"] / period)\n    formula = \"rate ~ time + level + trend_post + sin1 + cos1 + sin2 + cos2\"\n    # HAC (Newey-West) covariance corrects SEs for autocorrelation in the residuals.\n    model = smf.ols(formula, data=d).fit(cov_type=\"HAC\", cov_kwds={\"maxlags\": hac_lags})\n    return model\n\n# Illustrative series: 36 pre, 24 post; warning causes a level drop and a downward slope change.\nrng = np.random.default_rng(7)\nt = np.arange(60)\nbase = 5.0 + 0.02 * t + 0.4 * np.sin(2*np.pi*t/12)            # pre-trend + seasonality\neffect = np.where(t >= 36, -0.45 - 0.03 * (t - 36 + 1), 0.0)  # step + ramp after month 36\nrate = base + effect + rng.normal(0, 0.1, size=60)\ndf = pd.DataFrame({\"time\": t, \"rate\": rate, \"intervention_period\": 36})\n\nm = its_segmented(df)\nprint(m.params[[\"level\", \"trend_post\"]])   # level change (beta2) and slope change (beta3)\nprint(m.bse[[\"level\", \"trend_post\"]])       # HAC standard errors",
        "description": "Single-group segmented regression with Newey-West (HAC) standard errors using statsmodels. Input: a monthly DataFrame with\nthe aggregated outcome (`rate`), a continuous `time` index, a 0/1 `level` step that switches on at the intervention, and a\n`trend_post` ramp counting periods since the intervention. Seasonality is captured with 12- and 6-month Fourier harmonics.\nNewey-West SEs (maxlags chosen from the seasonal period) correct the intervals for autocorrelation (Zhang et al. 2009).",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "bernal-2017",
          "zhang-2009"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(prais)     # Prais-Winsten AR(1) GLS for serially correlated errors\nlibrary(sandwich)  # Newey-West HAC covariance\nlibrary(lmtest)    # coeftest with a supplied vcov\n\nits_segmented <- function(df, period = 12) {\n  df <- df[order(df$time), ]\n  k <- df$intervention_period[1]\n  df$level      <- as.integer(df$time >= k)                 # step: 0 pre, 1 post\n  df$trend_post <- ifelse(df$time >= k, df$time - k + 1, 0)  # ramp after interruption\n  df$sin1 <- sin(2*pi*df$time/period); df$cos1 <- cos(2*pi*df$time/period)\n  df$sin2 <- sin(4*pi*df$time/period); df$cos2 <- cos(4*pi*df$time/period)\n  f <- rate ~ time + level + trend_post + sin1 + cos1 + sin2 + cos2\n\n  # Prais-Winsten GLS with an AR(1) error term (models autocorrelation explicitly).\n  pw <- prais_winsten(f, data = df, index = \"time\")\n  # Newey-West HAC cross-check on the OLS fit.\n  ols <- lm(f, data = df)\n  nw  <- coeftest(ols, vcov = NeweyWest(ols, lag = period, prewhite = FALSE))\n  list(prais_winsten = summary(pw)$coefficients[c(\"level\", \"trend_post\"), ],\n       newey_west     = nw[c(\"level\", \"trend_post\"), ])\n}\n\n# Illustrative monthly series: 36 pre + 24 post points.\nset.seed(7)\nt <- 0:59\nbase   <- 5.0 + 0.02*t + 0.4*sin(2*pi*t/12)\neffect <- ifelse(t >= 36, -0.45 - 0.03*(t - 36 + 1), 0)\ndf <- data.frame(time = t, rate = base + effect + rnorm(60, 0, 0.1),\n                 intervention_period = 36)\nprint(its_segmented(df))",
        "description": "Segmented-regression ITS with a Prais-Winsten AR(1) GLS fit (orcutt/prais packages) to handle autocorrelation directly,\nplus a Newey-West cross-check from sandwich. Input is a monthly data.frame with the outcome `rate`, running `time`, the 0/1\n`level` step, and the `trend_post` ramp; Fourier terms remove seasonality (Bernal et al. 2017; Zhang et al. 2009).",
        "dependencies": [
          "prais",
          "sandwich",
          "lmtest"
        ],
        "source_citations": [
          "bernal-2017",
          "zhang-2009"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Build the segmented-regression design variables. Intervention starts at period k = 36. */\n%let k = 36;\ndata its;\n  set work.series;                 /* time (0..59), rate (outcome per period) */\n  level      = (time >= &k);                         /* step: 0 pre, 1 post   */\n  if time >= &k then trend_post = time - &k + 1;     /* ramp after interruption */\n  else trend_post = 0;\n  sin1 = sin(2*constant('pi')*time/12); cos1 = cos(2*constant('pi')*time/12);\n  sin2 = sin(4*constant('pi')*time/12); cos2 = cos(4*constant('pi')*time/12);\nrun;\n\n/* PROC AUTOREG fits OLS plus an autoregressive error structure; NLAG=1 = AR(1),\n   BACKSTEP prunes nonsignificant autoregressive lags. DWPROB requests the Durbin-Watson test. */\nproc autoreg data=its;\n  model rate = time level trend_post sin1 cos1 sin2 cos2\n        / nlag=1 backstep method=ml dwprob;\n  /* level = immediate level change (beta2); trend_post = change in slope (beta3). */\nrun;",
        "description": "Segmented-regression ITS in SAS using PROC AUTOREG, which fits the level/slope model with an autoregressive error process\n(AR(1) via NLAG=1, BACKSTEP for higher-order terms) and reports the Durbin-Watson statistic. Input dataset work.its has one\nrow per period with rate, time, the 0/1 level step, the trend_post ramp, and Fourier seasonal terms (Zhang et al. 2009).",
        "dependencies": [],
        "source_citations": [
          "bernal-2017",
          "zhang-2009"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "interrupted-time-series-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Monthly initiation rate (per 1,000 enrollees) before and after the June 2021 boxed warning. The dashed line shows the counterfactual trend if the warning had never been issued. The solid orange line shows observed post-warning rates. The gap between them widens each month because the slope also changed.",
        "alt_text": "Segmented timeline showing Drug X initiation rate rising from 5.0 to 5.8 per 1,000 during January–May 2021, a vertical intervention marker at June 2021 labeled Boxed Warning Added, then the observed rate dropping to 4.2 in June and declining further to 3.9 by September, while a dashed counterfactual line continues rising from 5.8 toward 6.6, illustrating both a level drop and a slope reversal.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Pre[Pre-intervention series<br/>>= 8 equally spaced points] --> Fit[Segmented regression]\n  Int((Intervention<br/>sharp date)) --> Fit\n  Post[Post-intervention series] --> Fit\n  Fit --> B1[beta1: pre-trend slope]\n  Fit --> B2[beta2: immediate LEVEL change]\n  Fit --> B3[beta3: change in SLOPE]\n  Fit --> Corr[Model autocorrelation<br/>Newey-West / Prais-Winsten]\n  Fit --> Seas[Remove seasonality<br/>Fourier / month indicators]\n  B2 --> CF[Counterfactual = extrapolated pre-trend]\n  B3 --> CF",
        "caption": "Structure of a segmented-regression ITS. The pre-intervention trend is extrapolated as the counterfactual; the intervention's effect is decomposed into an immediate level change (beta2) and a change in slope (beta3), with autocorrelation and seasonality modeled so they do not masquerade as effects.",
        "alt_text": "Flow diagram showing pre series, a sharply dated intervention, and post series feeding a segmented regression that yields a pre-trend slope, a level change, and a slope change, with autocorrelation and seasonality adjustments.",
        "source_type": "illustrative",
        "source_citations": [
          "bernal-2017"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Abrupt population-level intervention] --> Date{Sharp, well-dated?}\n  Date -->|No, gradual/fuzzy| Stop1[Model phase-in or abandon ITS]\n  Date -->|Yes| Pts{>= ~8 equally spaced<br/>pre and post points?}\n  Pts -->|No| Stop2[Insufficient data for ITS]\n  Pts -->|Yes| Shock{Co-timed shock possible?}\n  Shock -->|Yes| CITS[Use controlled ITS<br/>add unaffected comparator series]\n  Shock -->|No| Single[Single-group segmented regression]\n  CITS --> Diag[Check residual ACF/PACF + Durbin-Watson<br/>deseasonalize]\n  Single --> Diag\n  Diag --> Report[Report beta2 level + beta3 slope with HAC/AR1 CIs]",
        "caption": "Decision logic for an ITS analysis. The design hinges on a sharply dated intervention and enough regularly spaced points; a possible co-timed shock pushes toward the controlled-ITS variant, and autocorrelation/seasonality diagnostics gate the inference.",
        "alt_text": "Decision tree from an abrupt intervention through checks on date sharpness, point count, and co-timed shocks to a single-group or controlled ITS, then diagnostics and reporting of level and slope effects.",
        "source_type": "illustrative",
        "source_citations": [
          "bernal-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "difference-in-differences-staggered-adoption-rwe",
        "notes": "The controlled-ITS variant is essentially a difference-in-differences on segmented-regression parameters; DiD adds a comparison group and a parallel-trends assumption, while single-group ITS relies on pre-trend extrapolation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ecological",
        "notes": "ITS uses the same aggregated/ecological data but exploits the longitudinal before/after structure, making it far less vulnerable to static ecological confounding than a cross-sectional ecological comparison."
      },
      {
        "relation_type": "used_with",
        "target_slug": "drug-utilization",
        "notes": "ITS is the default design for measuring the impact of formulary, label, and policy changes on drug-utilization series in claims."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "When the aggregated outcome is a count/rate with a person-time offset, the segmented model is fit as a Poisson or negative-binomial regression rather than a Gaussian one."
      },
      {
        "relation_type": "complements",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "ITS is a population-level quasi-experimental tool in the CER methods toolkit, used when individual-level comparative designs are infeasible because the intervention is universal."
      }
    ],
    "aliases": [
      "ITS",
      "interrupted time series analysis",
      "segmented regression analysis",
      "segmented time-series regression",
      "controlled interrupted time series"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "inverse-probability-of-censoring-weighting-rwe",
    "name": "Inverse Probability of Censoring Weighting (IPCW)",
    "short_definition": "A weighting method that corrects dependent (informative) censoring by reweighting still-uncensored subjects by the inverse of their estimated probability of remaining uncensored, recovering the distribution that would have been observed without informative dropout, treatment switching, or protocol deviation.",
    "long_description": "**Core idea.** Standard survival estimators (Kaplan-Meier, the partial-likelihood Cox model, a person-time rate) assume\ncensoring is **non-informative**: at every time t, the subjects still under observation are representative of those who\nwere censored. When censoring depends on time-varying prognostic factors — patients with worsening disease drop out,\nswitch treatment, or deviate from protocol — that assumption fails and the naive estimator is biased. **Inverse\nProbability of Censoring Weighting (IPCW)** repairs this by creating a *pseudo-population* in which censoring is\nindependent of those measured factors. Each subject who remains uncensored at time t is reweighted by the inverse of\ntheir estimated probability of having stayed uncensored up to t, given the measured time-varying covariate history. A\npatient who was likely to be censored (e.g., a deteriorating patient who nonetheless remained in follow-up) stands in\nfor the similar patients who actually dropped out, and the weighted analysis estimates the quantity that would have been\nobserved under complete follow-up. The weight at time t is W(t) = prod over k<=t of 1 / P(uncensored at k | uncensored at\nk-1, covariate history) — the inverse of the cumulative conditional probability of remaining under observation.\n\n**Stabilized weights and the assumptions that make IPCW valid.** Raw (unstabilized) IPCW weights can be extremely\nvariable and a few large weights dominate the estimate, inflating variance. **Stabilized weights** multiply the inverse\ncensoring probability by the marginal (or baseline-covariate-only) probability of remaining uncensored, SW(t) = prod\nP(uncensored | baseline) / P(uncensored | full covariate history). Stabilization leaves the estimator consistent while\nshrinking the weights toward 1 and tightening the variance. IPCW rests on three assumptions, each of which must be\nargued explicitly: (1) **no unmeasured common causes** of censoring and the outcome — the covariates in the censoring\nmodel capture everything that drives both dropout and the event (the censoring analogue of no unmeasured confounding);\n(2) **positivity** — every covariate history that can occur has a non-zero probability of remaining uncensored, so no\nweight is the reciprocal of (near) zero; (3) **correct specification** of the censoring (and, when combined with\ntreatment weighting, the treatment) model. Because the standard errors must account for the estimated weights, IPCW\nanalyses use a robust (sandwich) or bootstrap variance; treating the weights as fixed gives anticonservative intervals.\n\n**Pros, cons, and trade-offs.**\n- **vs ignoring informative censoring (naive KM/Cox):** IPCW removes the bias from dependent censoring that the naive\n  estimator cannot, at the cost of a censoring model, potentially unstable weights, and wider (honest) intervals. **Use\n  IPCW** whenever dropout, switching, or deviation is plausibly driven by measured time-varying prognosis; **the naive\n  estimator** is acceptable only when censoring is administrative/random.\n- **vs the clone-censor-weight approach for per-protocol estimands (`clone-censor-weight-per-protocol`):** Clone-censor-\n  weight cleanly handles time-varying eligibility and grace periods by cloning each person into the strategies they are\n  compatible with, censoring at deviation, and then using IPCW to correct the artificial censoring it induced — so IPCW\n  is the *engine* inside clone-censor-weight, not a competitor. **Prefer the clone-censor-weight scaffolding** when the\n  estimand is a per-protocol/target-trial comparison with grace periods; **prefer plain IPCW** when there is a single\n  well-defined censoring process (e.g., loss to follow-up) to correct.\n- **vs rank-preserving structural failure-time / G-estimation for treatment switching (`marginal-structural-models-g-\n  methods`):** Both correct switching, but RPSFTM/IPE assume a common treatment effect and \"rewind\" switchers' survival\n  times, whereas IPCW censors at switch and reweights, requiring the no-unmeasured-confounders-for-censoring assumption\n  instead. **Prefer IPCW/MSM** when rich time-varying confounders of switching are measured; **prefer RPSFTM** when the\n  constant-relative-effect assumption is defensible and confounders of switching are poorly measured.\n- **vs multiple imputation of censored outcomes:** MI imputes the missing post-censoring follow-up under MAR; IPCW\n  reweights rather than imputes. Weighting avoids modelling the full outcome distribution but discards (downweights)\n  information; the two are complementary and can be combined in doubly robust (augmented IPW) estimators that are\n  consistent if *either* the censoring or the outcome model is correct.\n\n**When to use.** Time-to-event RWE where follow-up ends for reasons related to measured time-varying prognosis: informative\nloss to follow-up (sicker patients disenroll or leave the health system); estimating a **per-protocol** effect when patients\ndiscontinue or switch treatment and the deviation is driven by measured covariates; constructing the censoring weights\ninside a marginal structural model or a target-trial emulation; correcting for treatment-switching in oncology trials and\ntheir RWE replications when adjudicated time-varying confounders are available. Always report the censoring-model\ncovariates, the distribution of the (stabilized) weights, and a positivity/extreme-weight diagnostic.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Unmeasured drivers of censoring.** If the reason patients drop out is not captured by measured covariates (the\n  censoring analogue of unmeasured confounding), IPCW corrects nothing and the weighted estimate is biased while looking\n  rigorous — the most dangerous misuse, because the apparatus signals confidence it has not earned.\n- **Positivity violations / extreme weights.** When some covariate histories almost guarantee censoring, the inverse\n  probabilities explode, a few subjects dominate, and the variance becomes uncontrolled; truncating extreme weights\n  trades bias for variance and must be reported, not hidden.\n- **Administrative or genuinely random censoring.** If censoring is only end-of-study or independent of prognosis, IPCW\n  adds variance for no bias reduction; use the unweighted estimator.\n- **Fixed (non-time-varying) weights treated as the whole story.** IPCW is built for time-varying covariate histories;\n  applying a single baseline-only weight ignores the post-baseline prognostic evolution that makes censoring informative\n  in the first place.\n\n**Data-source operational depth.**\n- **Claims:** \"Censoring\" is overwhelmingly **disenrollment** and the end of observable person-time. Disenrollment is\n  often informative (job change, death-adjacent transitions, switching to Medicare Advantage where fee-for-service claims\n  vanish), so model the censoring hazard on time-varying utilization, comorbidity accrual, and recent hospitalization.\n  Medicare Advantage transitions are a classic informative-censoring trap: a deteriorating patient who moves to MA\n  appears \"censored\" exactly when their event risk rises; restrict to FFS-observable person-time and include the\n  predictors of MA transition in the censoring model.\n- **EHR:** Censoring is **leakage** — the patient seeks care outside the system — and is strongly informative because\n  sicker patients are referred out or hospitalized elsewhere. Build the censoring model on encounter frequency, recent\n  labs/vitals trajectories, and referral patterns; an \"active in system\" definition determines what counts as censoring.\n- **Registry / linked:** Registries provide adjudicated time-varying severity that strengthens the censoring model;\n  linked claims supply the death and disenrollment dates that distinguish a true competing event (death) from informative\n  loss to follow-up. Reconcile dates before defining the censoring indicator, and never let death be silently treated as\n  censoring when a competing-risks framing is intended.\n\n**Interpreting the output**\n\nUsing the worked example: Patient P-104 (severe disease, estimated 25% probability of remaining uncensored)\nreceives IPCW weight = 1/0.25 = 4.00. Patient P-101 (mild disease, 80% probability) receives weight = 1.25.\nAll subsequent survival analyses proceed using only the two patients who remained, with these weights.\n\nFormal interpretation: The IPCW-weighted survival curve (or HR or risk difference derived from it) estimates\nthe effect that would have been observed in the hypothetical world where no patient was censored for\nprognostic reasons — the marginal effect in a pseudo-population where informative censoring is abolished.\nP-104's weight of 4.00 = 1/0.25 makes them represent themselves plus three similarly severe patients, restoring\nthe disease-severity distribution of the original full cohort. The estimate is valid only if two conditions\nhold: the censoring model is correctly specified — every covariate that jointly predicts dropout and the\noutcome must be included — and positivity of censoring holds — no covariate pattern makes remaining in the\nstudy a virtual certainty or impossibility. The weighted effective patient-equivalent count (1.25 + 4.00 = 5.25\nfor 2 remaining patients) reflects the information cost of heavy reweighting from a small remaining sample.\n\nPractical interpretation: After IPCW, the analysis behaves as if all four patients — including the two\nsevere-disease dropouts — had been followed for the full 12 months. The drug's apparent survival benefit is\nadjusted to reflect the full severity distribution of the original cohort, giving a less optimistic but more\nhonest picture of performance. Without IPCW, dropping the two severe dropouts would have made the drug look\nmore effective than it truly was across all severity levels.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "ipcw",
      "censoring-weights",
      "informative-censoring",
      "dependent-censoring",
      "treatment-switching",
      "per-protocol",
      "stabilized-weights",
      "positivity"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "comparative_effectiveness",
      "target_trial_emulation",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.0006-341X.2000.00779.x",
        "url": "https://doi.org/10.1111/j.0006-341X.2000.00779.x",
        "citation_text": "Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics. 2000;56(3):779-788.",
        "year": 2000,
        "authors_short": "Robins & Finkelstein",
        "notes": "Introduces the IPCW log-rank test and the inverse-probability-of-censoring weighting machinery for noncompliance and dependent censoring; the canonical methodological reference for the technique."
      },
      {
        "role": "explain",
        "doi": "10.1097/00001648-200009000-00011",
        "url": "https://doi.org/10.1097/00001648-200009000-00011",
        "citation_text": "Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550-560.",
        "year": 2000,
        "authors_short": "Robins et al.",
        "notes": "Develops marginal structural models and the inverse-probability weighting framework (treatment and censoring weights, stabilized weights) within which IPCW is the censoring component; foundational for the assumptions and the pseudo-population interpretation."
      },
      {
        "role": "explain",
        "doi": "10.1097/00001648-200009000-00012",
        "url": "https://doi.org/10.1097/00001648-200009000-00012",
        "citation_text": "Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561-570.",
        "year": 2000,
        "authors_short": "Hernan et al.",
        "notes": "Worked application that uses IPCW to handle dependent censoring (loss to follow-up) alongside treatment weights; illustrates weight construction, stabilization, and diagnostics on real survival data."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwn164",
        "url": "https://doi.org/10.1093/aje/kwn164",
        "citation_text": "Cole SR, Hernan MA. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology. 2008;168(6):656-664.",
        "year": 2008,
        "authors_short": "Cole & Hernan",
        "notes": "Practical construction and diagnostics for stabilized inverse probability weights, including numerator and denominator model specification and weight distribution checks."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/ede.0000000000000409",
        "url": "https://doi.org/10.1097/EDE.0000000000000409",
        "citation_text": "Howe CJ, Cole SR, Lau B, Napravnik S, Eron JJ. Selection bias due to loss to follow-up in cohort studies. Epidemiology. 2016;27(1):91-97.",
        "year": 2016,
        "authors_short": "Howe et al.",
        "notes": "Demonstrates inverse-probability-of-censoring weighting to correct selection bias from informative loss to follow-up in an observational cohort, with practical weight construction and diagnostics."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernan MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernan & Robins",
        "notes": "Places IPCW within target-trial emulation, where censoring weights correct the artificial censoring induced by per-protocol analysis of treatment strategies; the applied framework most RWE per-protocol analyses now follow."
      }
    ],
    "plain_language_summary": "Inverse Probability of Censoring Weighting (IPCW) is a statistical technique that corrects a specific kind of dropout problem in studies that follow patients over time. When the patients who leave a study early (drop out, switch treatments, or are otherwise lost) are systematically sicker or healthier than those who stay, the remaining group no longer represents everyone, and any survival estimate you calculate will be skewed. IPCW fixes this by mathematically up-weighting each patient who stays in the study to stand in for similar patients who dropped out, so the weighted group behaves as though no one left early for health-related reasons.",
    "key_terms": [
      {
        "term": "censoring",
        "definition": "What happens when a patient's follow-up ends before the study outcome occurs, usually because they drop out, switch treatments, or the study ends, so you stop observing them."
      },
      {
        "term": "informative censoring",
        "definition": "A dropout pattern where the reason a patient leaves the study is connected to how sick they are or how likely they are to have the outcome, meaning the people who leave are different from those who stay."
      },
      {
        "term": "censoring weight",
        "definition": "A number assigned to each patient who remains in the study, equal to the inverse of their estimated probability of still being there, so patients who were unlikely to stay carry more weight in the analysis."
      },
      {
        "term": "pseudo-population",
        "definition": "The reweighted group of study participants created by IPCW that statistically represents what the full original cohort would have looked like if no one had dropped out for health-related reasons."
      },
      {
        "term": "positivity",
        "definition": "The requirement that every type of patient in the study must have at least some chance of remaining in follow-up; if certain patients are virtually guaranteed to drop out, their censoring weight becomes impossibly large and the method breaks down."
      }
    ],
    "worked_example": {
      "scenario": "A study follows 4 patients with a serious chronic illness to see how long they survive on a new drug. At month 6, two patients drop out. The analyst suspects the dropout is informative because the patients who left were sicker than those who stayed. To correct for this, IPCW assigns each remaining patient a weight equal to 1 divided by that patient's estimated probability of still being in the study at month 6. Patients who were unlikely to remain get large weights so they stand in for similar patients who did drop out.",
      "dataset": {
        "caption": "Four patients in a 12-month survival study. At month 6, Patients B and C drop out (are censored). The analyst estimates each patient's probability of remaining uncensored at month 6 based on their disease severity score.",
        "columns": [
          "patient_id",
          "status_at_month_6",
          "disease_severity",
          "prob_remaining_uncensored",
          "ipcw_weight"
        ],
        "rows": [
          [
            "P-101",
            "still in study",
            "mild",
            0.8,
            1.25
          ],
          [
            "P-102",
            "dropped out",
            "severe",
            null,
            null
          ],
          [
            "P-103",
            "dropped out",
            "severe",
            null,
            null
          ],
          [
            "P-104",
            "still in study",
            "severe",
            0.25,
            4.0
          ]
        ]
      },
      "steps": [
        "Estimate each patient's probability of remaining uncensored at month 6. Patient P-101 has mild disease, so their estimated probability of staying is 0.80. Patient P-104 has severe disease like the dropouts, so their estimated probability of staying is only 0.25.",
        "Calculate the IPCW weight for each patient still in the study: weight = 1 / probability of remaining uncensored.",
        "P-101: weight = 1 / 0.80 = 1.25. This patient was likely to stay, so their weight is close to 1 and they count for slightly more than one person.",
        "P-104: weight = 1 / 0.25 = 4.00. This severely ill patient was unlikely to stay, so their weight is 4, meaning they count for 4 people in the analysis and stand in for the 3 similar severe patients (themselves plus the 2 who dropped out).",
        "All subsequent survival calculations are performed using these weights. P-104 now represents not just themselves but also the two severe patients who dropped out, recreating the distribution of disease severity that existed in the original full cohort."
      ],
      "result": "After IPCW, the two remaining patients carry weights of 1.25 and 4.00, which together represent the full original cohort of 4 patients (1.25 + 4.00 = 5.25 effective patient-equivalents, reflecting the reweighted pseudo-population). The severe-disease patients who dropped out are no longer ignored; P-104's weight of 4.00 = 1 / 0.25 ensures they stand in for the lost severe-disease patients, removing the bias that would have made the drug look more effective than it truly was."
    },
    "prerequisites": [
      "attrition-and-loss-to-follow-up-rwe",
      "survival-extrapolation-hta-rwe",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [
      {
        "name": "Unstabilized censoring weight",
        "definition": "The inverse of the estimated probability of remaining uncensored through a follow-up interval, conditional on the measured covariate history used in the censoring model.",
        "source": "Robins and Finkelstein 2000",
        "use": "Core IPCW construction when informative censoring is plausible.",
        "notes": "Can be highly variable when the probability of remaining uncensored is small."
      },
      {
        "name": "Stabilized censoring weight",
        "definition": "A censoring weight whose numerator uses a marginal or baseline-covariate-only probability of remaining uncensored and whose denominator uses the full measured covariate history.",
        "source": "Cole and Hernan 2008",
        "use": "Variance control and improved finite-sample behavior in weighted survival analyses.",
        "notes": "Report numerator and denominator model covariates separately."
      },
      {
        "name": "Truncated IPCW",
        "definition": "An IPCW implementation that caps extreme weights at pre-specified percentiles or absolute thresholds.",
        "source": "Epidemiologic weighting practice",
        "use": "Bias-variance sensitivity analysis when positivity is strained.",
        "notes": "Truncation trades bias for variance and must be reported as a sensitivity, not hidden."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Stabilized IPCW weights",
        "description": "Each subject's inverse cumulative censoring probability is multiplied by the marginal (baseline-only) probability of remaining uncensored, so weights center near 1; consistent like the unstabilized version but with much smaller variance.",
        "edge_cases": [
          "Stabilized weights should have a mean near 1.0; a mean far from 1 signals censoring-model misspecification or positivity problems.",
          "Stabilization does not fix positivity violations; extreme stabilized weights still indicate covariate histories with near-certain censoring."
        ],
        "data_source_notes": "claims: build the numerator from baseline covariates and the denominator from time-varying utilization and comorbidity accrual measured in observable FFS person-time."
      },
      {
        "name": "Truncated / Winsorized weights for positivity control",
        "description": "Weights above (below) a high (low) percentile are capped to limit the influence of a few subjects when positivity is near-violated, trading a small bias for a large variance reduction.",
        "edge_cases": [
          "Truncation thresholds (e.g., 1st/99th percentile) must be pre-specified and reported; the estimand shifts slightly toward the truncated pseudo-population.",
          "Report results across several truncation levels as a sensitivity analysis rather than reporting only the most favorable."
        ],
        "data_source_notes": "ehr: leakage-driven censoring can produce extreme weights for patients with sparse encounters; cap and report the fraction affected."
      },
      {
        "name": "IPCW within a marginal structural / clone-censor-weight model",
        "description": "Censoring weights are multiplied by treatment weights (and applied to artificially censored clones) to estimate a per-protocol or sustained-treatment-strategy effect under time-varying confounding.",
        "edge_cases": [
          "The combined weight inherits positivity requirements from both the treatment and censoring models; diagnose each separately.",
          "Robust/bootstrap variance is required because both sets of weights are estimated."
        ],
        "data_source_notes": "linked: adjudicated time-varying severity from a registry sharpens both the treatment and censoring models in a target-trial emulation."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive Kaplan-Meier / Cox ignoring informative censoring",
        "pros_of_this": "Removes bias from dependent censoring by reweighting to a pseudo-population where censoring is independent of measured time-varying prognosis.",
        "cons_of_this": "Requires a correctly specified censoring model and the no-unmeasured-common-causes assumption; weights can be unstable and intervals are wider (and more honest).",
        "when_to_prefer": "Whenever dropout, switching, or deviation is plausibly driven by measured time-varying prognosis; the naive estimator is fine only for administrative/random censoring."
      },
      {
        "compared_to": "marginal-structural-models-g-methods",
        "pros_of_this": "IPCW is the censoring-weight component that lets MSMs handle dependent censoring; together they estimate sustained-treatment effects under time-varying confounding.",
        "cons_of_this": "Inherits MSM positivity and model-specification demands; on its own IPCW corrects only censoring, not treatment-confounding.",
        "when_to_prefer": "Use IPCW alone for a single censoring process; combine with treatment weights inside an MSM when time-varying confounding of treatment is also present."
      },
      {
        "compared_to": "clone-censor-weight-per-protocol",
        "pros_of_this": "Plain IPCW is simpler when there is one well-defined censoring process to correct (e.g., loss to follow-up).",
        "cons_of_this": "Does not by itself handle time-varying eligibility, grace periods, or the cloning needed for clean per-protocol estimands.",
        "when_to_prefer": "For per-protocol/target-trial estimands with grace periods, use clone-censor-weight (which uses IPCW internally); use plain IPCW for direct correction of informative loss to follow-up."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Censoring is dominated by disenrollment and loss of observable person-time, often informative (e.g., transition to Medicare Advantage where FFS claims disappear). Restrict to FFS-observable time, model the censoring hazard on time-varying utilization, comorbidity accrual, and recent hospitalization, and never let death be silently treated as informative censoring when a competing-risks framing is intended.",
      "ehr": "Censoring is leakage (care sought outside the system) and strongly informative because sicker patients are referred or hospitalized elsewhere; model the censoring hazard on encounter frequency and recent lab/vital trajectories, and define an explicit active-in-system rule that determines what counts as censoring.",
      "registry": "Adjudicated time-varying severity strengthens the censoring model; registries usually need linkage to claims or a death index to separate true death (a competing event) from informative loss to follow-up.",
      "linked": "Linked claims-EHR-vital-records gives EHR specificity, claims disenrollment dates, and a death index for an honest censoring definition; reconcile date discrepancies before defining the censoring indicator and the weight intervals."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\nimport statsmodels.api as sm\n\n# df: person-period long format with subject_id, t, event, censored, L (time-varying), x (exposure), base (baseline cov).\ndf = df.sort_values([\"subject_id\", \"t\"]).reset_index(drop=True)\n\n# Censoring hazard models (pooled logistic). Numerator uses baseline only; denominator adds time-varying L.\nnum_m = smf.logit(\"censored ~ x + base + t\", data=df).fit(disp=0)\nden_m = smf.logit(\"censored ~ x + base + L + t\", data=df).fit(disp=0)\n\n# Per-interval probability of REMAINING uncensored = 1 - hazard of being censored this interval.\ndf[\"p_unc_num\"] = 1.0 - num_m.predict(df)\ndf[\"p_unc_den\"] = 1.0 - den_m.predict(df)\n\n# Cumulative products within subject give the stabilized IPCW weight at each time t.\ndf[\"cum_num\"] = df.groupby(\"subject_id\")[\"p_unc_num\"].cumprod()\ndf[\"cum_den\"] = df.groupby(\"subject_id\")[\"p_unc_den\"].cumprod()\ndf[\"sw\"] = df[\"cum_num\"] / df[\"cum_den\"]\n\nprint(f\"Stabilized weight: mean={df['sw'].mean():.3f} (target ~1.0), \"\n      f\"max={df['sw'].max():.2f}, p99={df['sw'].quantile(0.99):.2f}\")\n\n# Weighted pooled-logistic outcome model in the IPCW pseudo-population; robust SEs (cluster on subject).\nout = smf.glm(\"event ~ x + base + t\", data=df, family=sm.families.Binomial(),\n              freq_weights=df[\"sw\"]).fit(cov_type=\"cluster\",\n                                         cov_kwds={\"groups\": df[\"subject_id\"]})\nprint(out.summary())",
        "description": "Stabilized IPCW for a discrete-time survival analysis with informative censoring. Required input is a person-period\n(long) table: one row per subject per time interval up to event, censoring, or administrative end, with columns\nsubject_id, t (interval index), event (1 at the event interval else 0), censored (1 in the interval the subject is\nlost), and a time-varying covariate L driving both censoring and the event. We fit pooled logistic censoring models\nfor the numerator (baseline only) and denominator (baseline + time-varying L), form stabilized weights as the running\nproduct of (1-num_hazard)/(1-den_hazard), and fit a weighted pooled-logistic outcome model. Robust standard errors are\nrequired because the weights are estimated.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "robins-finkelstein-2000",
          "robins-hernan-brumback-2000"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(geepack)\ndat <- dat[order(dat$subject_id, dat$t), ]\n\n# Censoring hazard: numerator (baseline only), denominator (baseline + time-varying L).\nnum_m <- glm(censored ~ x + base + t,     data = dat, family = binomial())\nden_m <- glm(censored ~ x + base + L + t, data = dat, family = binomial())\n\n# Probability of remaining uncensored this interval = 1 - censoring hazard.\ndat$p_unc_num <- 1 - predict(num_m, type = \"response\")\ndat$p_unc_den <- 1 - predict(den_m, type = \"response\")\n\n# Cumulative product within subject -> stabilized IPCW weight at each time.\ncumprod_by <- function(p, id) ave(p, id, FUN = cumprod)\ndat$cum_num <- cumprod_by(dat$p_unc_num, dat$subject_id)\ndat$cum_den <- cumprod_by(dat$p_unc_den, dat$subject_id)\ndat$sw      <- dat$cum_num / dat$cum_den\n\ncat(sprintf(\"Stabilized weight mean=%.3f (target ~1), max=%.2f\\n\",\n            mean(dat$sw), max(dat$sw)))\n\n# Weighted pooled-logistic outcome model with robust SEs (GEE, independence working correlation).\nfit <- geeglm(event ~ x + base + t, id = subject_id, data = dat,\n              family = binomial(), weights = sw, corstr = \"independence\")\nprint(summary(fit))",
        "description": "Stabilized IPCW in R with pooled-logistic censoring models and a weighted pooled-logistic outcome model, using a\nperson-period long data frame (subject_id, t, event, censored, L, x, base). geeglm (geepack) gives robust\n(sandwich) standard errors clustered on subject, which are required because the weights are estimated. The ipw\npackage also implements ipwtm() for time-varying censoring weights as an alternative.",
        "dependencies": [
          "geepack"
        ],
        "source_citations": [
          "robins-finkelstein-2000",
          "robins-hernan-brumback-2000"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Numerator censoring-hazard model (baseline only); score P(censored). */\nproc logistic data=work.pp noprint;\n  model censored(event='1') = x base t;\n  output out=num_out p=ph_num;\nrun;\n\n/* Denominator censoring-hazard model (adds time-varying L); score P(censored). */\nproc logistic data=work.pp noprint;\n  model censored(event='1') = x base L t;\n  output out=den_out p=ph_den;\nrun;\n\nproc sort data=num_out; by subject_id t; run;\nproc sort data=den_out; by subject_id t; run;\n\ndata weights;\n  merge num_out(keep=subject_id t event x base ph_num)\n        den_out(keep=subject_id t ph_den);\n  by subject_id t;\n  retain cum_num cum_den;\n  p_unc_num = 1 - ph_num;          /* prob remain uncensored this interval (numerator)   */\n  p_unc_den = 1 - ph_den;          /* prob remain uncensored this interval (denominator) */\n  if first.subject_id then do; cum_num = 1; cum_den = 1; end;\n  by subject_id;\n  cum_num = cum_num * p_unc_num;   /* cumulative product within subject */\n  cum_den = cum_den * p_unc_den;\n  sw = cum_num / cum_den;          /* stabilized IPCW weight at time t   */\nrun;\n\nproc means data=weights mean max p99; var sw; title 'Stabilized IPCW weight (mean target ~1)'; run;\n\n/* Weighted pooled-logistic outcome model; empirical (robust) SEs via REPEATED. */\nproc genmod data=weights;\n  class subject_id;\n  weight sw;\n  model event(event='1') = x base t / dist=binomial link=logit;\n  repeated subject=subject_id / type=ind;\nrun;",
        "description": "Stabilized IPCW in SAS on a person-period dataset work.pp (subject_id, t, event, censored, L, x, base). PROC LOGISTIC\nfits the numerator and denominator censoring-hazard models and scores per-interval probabilities; a DATA step forms\nthe cumulative products within subject (RETAIN) to build the stabilized weight; PROC GENMOD fits the weighted\npooled-logistic outcome model with REPEATED SUBJECT= for empirical (robust) standard errors, which are required\nbecause the weights are estimated.",
        "dependencies": [],
        "source_citations": [
          "robins-finkelstein-2000",
          "robins-hernan-brumback-2000"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cohort[Cohort with time-varying covariate history L] --> CensM[Fit censoring hazard model<br/>P remain uncensored given history]\n  CensM --> Num[Numerator: baseline-only<br/>probability of staying uncensored]\n  CensM --> Den[Denominator: full history<br/>probability of staying uncensored]\n  Num --> SW[Stabilized weight<br/>cumulative product num / den]\n  Den --> SW\n  SW --> Pos{Positivity OK?<br/>weights bounded, mean ~1}\n  Pos -->|Yes| Pseudo[Weighted pseudo-population<br/>censoring now independent of L]\n  Pos -->|No, extreme weights| Trunc[Truncate weights<br/>report sensitivity]\n  Trunc --> Pseudo\n  Pseudo --> Est[Weighted outcome model<br/>robust / bootstrap variance]",
        "caption": "IPCW workflow: estimate the probability of remaining uncensored from the covariate history, form stabilized weights, check positivity, and fit the weighted outcome model in the pseudo-population with robust variance.",
        "alt_text": "Flowchart from a cohort with time-varying covariates through censoring-hazard modelling, stabilized weight construction, a positivity check with optional truncation, and a weighted outcome model with robust variance.",
        "source_type": "illustrative",
        "source_citations": [
          "robins-finkelstein-2000"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  C{Why is follow-up ending?} -->|Administrative / random| Naive[Naive KM/Cox unbiased<br/>do not weight]\n  C -->|Loss to follow-up driven by measured prognosis| IPCW[IPCW: reweight uncensored subjects]\n  C -->|Treatment switching / deviation| PP[Per-protocol estimand]\n  PP --> CCW[Clone-censor-weight uses IPCW internally]\n  C -->|Unmeasured drivers of censoring| Stop[IPCW cannot fix it<br/>weighted estimate still biased]",
        "caption": "Decision logic for handling censoring. IPCW is warranted when censoring is informative through measured time-varying covariates; it cannot rescue censoring driven by unmeasured factors, and per-protocol estimands wrap IPCW inside the clone-censor-weight scaffolding.",
        "alt_text": "Decision tree on the reason follow-up ends, routing administrative censoring to the naive estimator, measured informative censoring to IPCW, switching to clone-censor-weight, and unmeasured-driven censoring to an unfixable case.",
        "source_type": "illustrative",
        "source_citations": [
          "robins-finkelstein-2000"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "IPCW supplies the censoring-weight component of a marginal structural model; treatment and censoring weights are multiplied to estimate sustained-treatment effects under time-varying confounding and dependent censoring."
      },
      {
        "relation_type": "part_of",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Clone-censor-weight induces artificial censoring at protocol deviation and uses IPCW to correct it; IPCW is the weighting engine inside the per-protocol/target-trial emulation, not a competing method."
      },
      {
        "relation_type": "affects",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Informative attrition is exactly the dependent censoring IPCW corrects; the attrition analysis identifies the drivers that must enter the censoring model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death is a competing event, not informative censoring; IPCW must not silently treat death as censoring when a competing-risks framing is intended."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Weighting and imputation both address censored/missing follow-up; IPCW reweights observed subjects while multiple imputation models the missing follow-up, and the two can be combined in doubly robust estimators."
      }
    ],
    "aliases": [
      "IPCW",
      "inverse probability of censoring weighting",
      "censoring weights",
      "inverse-probability-of-censoring-weighted estimation",
      "inverse probability censoring weights",
      "inverse-probability-of-censoring weights",
      "stabilized censoring weights",
      "dependent censoring weights",
      "informative censoring weighting",
      "IPCW survival analysis",
      "IPCW Kaplan-Meier",
      "IPCW Cox model",
      "selection-bias weights for loss to follow-up"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ipd-meta-analysis",
    "name": "Individual Participant Data (IPD) Meta-Analysis",
    "short_definition": "A meta-analysis that obtains the raw patient-level records from each contributing study or database, harmonizes them to a common data model, and synthesizes a treatment effect either by pooling stratified per-study estimates (two-stage) or by fitting a single hierarchical model to all participants at once (one-stage).",
    "long_description": "**Individual participant data (IPD) meta-analysis** is the synthesis of the original patient-level records from\nmultiple studies or data partners, rather than the published/aggregate summary statistics that conventional\naggregate-data (AD) meta-analysis must rely on. Having the raw rows lets the analyst standardize eligibility,\ncovariate definitions, follow-up time, and the outcome model across all contributors, then estimate a single\npooled effect with quantified between-study heterogeneity. In real-world evidence it is the analytic backbone of\ndistributed/multi-database programs (FDA Sentinel, the OHDSI network, CNODES), where each site runs an identical\nprotocol on its own claims or EHR extract and the network combines the results.\n\n**Core estimand distinction** — IPD-MA targets the same pooled comparative effect (a log-HR, log-OR, or risk\ndifference) as AD meta-analysis, but two structural choices define how it is estimated and what it can additionally\nrecover. (1) *Two-stage*: fit the chosen model **within each study** (e.g., a per-study Cox model on the matched\ncohort), extract the study-specific effect and its standard error, then combine those estimates with a standard\nrandom-effects (DerSimonian–Laird or REML) or common-effect meta-analysis. (2) *One-stage*: fit **one hierarchical\nmodel to the stacked patient rows**, with study entered as a stratification/random-effect term so that nuisance\nparameters (baseline risk, baseline hazard) are study-specific while the treatment effect is shared or random\nacross studies. The two can give different numbers — not because of a bug, but because they make different\nassumptions about how nuisance parameters are estimated, how the within- and between-study information is weighted,\nand ML vs REML estimation of the heterogeneity variance τ² (Burke, Ensor & Riley 2017). The decisive IPD-only\ncapability is the unbiased estimation of **within-study treatment–covariate interactions**: pooling published\nsubgroup effects conflates the within-study interaction (what a clinician needs) with the across-study\n*ecological/aggregation* association, whereas IPD separates them (the \"deft\" approach of Fisher et al. 2017).\nIPD-MA does **not** by itself remove confounding, publication bias, or study-level design flaws — it inherits\nwhatever the contributing studies carry.\n\n**Pros, cons, and trade-offs**\n- **vs aggregate-data (AD) meta-analysis of published effects:** IPD lets you re-define the cohort identically\n  everywhere, re-instate excluded covariates, reanalyze time-to-event with a common model, recover unbiased\n  subgroup/interaction effects, and check individual-level assumptions (proportional hazards, functional form).\n  Cost: it is enormously more expensive, requires data-sharing/governance, and demands serious harmonization\n  labor. **Prefer IPD** when interactions, time-to-event reanalysis, or standardized confounding control are the\n  point; AD is acceptable when the marginal main effect is all that is needed and IPD is unobtainable.\n- **vs network meta-analysis (NMA):** NMA connects many treatments through indirect comparisons but usually runs\n  on aggregate arm-level data; IPD-MA is typically pairwise/few-treatment but patient-level. They are complementary\n  — IPD-NMA exists and is the most demanding of all. **Prefer IPD-MA** for a focused head-to-head where interaction\n  and patient-level adjustment matter.\n- **vs a single pooled-rows analysis that ignores study (naive merge):** simply concatenating everyone's rows and\n  fitting one ordinary model **ignores clustering by study**, can induce Simpson's-paradox reversals, and produces\n  falsely precise standard errors. One-stage IPD-MA fixes this by stratifying/randomizing the study term. **Never**\n  do the naive merge.\n- **two-stage vs one-stage within IPD-MA:** two-stage is transparent, mirrors familiar forest-plot meta-analysis,\n  and is the natural fit for *federated* analyses where only per-site summary statistics can leave the firewall.\n  One-stage is more efficient with few or small studies and rare events, handles within-study interactions and\n  non-linear terms more naturally, but is easier to misspecify (especially the τ² and the choice of common vs\n  random treatment effect). **Prefer two-stage** when studies are large and governance blocks row-level pooling;\n  **prefer one-stage** for sparse data, complex modeling, or interaction estimation.\n\n**When to use** — a focused comparative-effectiveness or safety question where the contributing studies/databases\ncan supply patient-level data; distributed RWE networks running a common protocol; questions about *who benefits*\n(treatment-effect modification by age, renal function, baseline severity); time-to-event endpoints that the\noriginal papers reported only as crude rates; situations where standardizing covariate adjustment and eligibility\nacross heterogeneous sources is essential for a credible pooled estimate.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **When only a handful of contributors will share data and they are the systematically \"better\" studies.**\n  Availability bias means the IPD subset is not the evidence base; an IPD-MA on the cooperative minority can be\n  *more* biased than an AD-MA of everyone. Always report what fraction of eligible studies/participants supplied\n  IPD and benchmark against the AD effect.\n- **When the underlying studies are confounded or biased.** IPD does not launder design flaws. A pooled\n  confounded effect is a precise wrong answer. Harmonized confounding control (e.g., the same propensity model per\n  site) is necessary but not sufficient if key confounders are unmeasured everywhere.\n- **The naive single-model merge ignoring study.** Treating all rows as one sample fabricates precision and can\n  reverse the direction of the effect (aggregation/Simpson bias).\n- **Reading an across-study (ecological) covariate association as an individual-level interaction.** A trend in\n  pooled per-study effects against mean study age is *not* the patient-level age interaction; using it to\n  personalize treatment is exactly the deluded approach Fisher et al. warn against.\n- **Forcing a common-effect model onto genuinely heterogeneous studies/databases.** With real between-study τ² a\n  common-effect pooled CI is far too narrow; conversely, estimating τ² from 2–3 studies is unstable and may need a\n  Bayesian prior or a Hartung–Knapp adjustment.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA vs commercial):** Each database defines exposure from the pharmacy claim (NDC + `fill_date`\n  + `days_supply`) and requires continuous medical+pharmacy enrollment across the washout so \"no prior fill\" is\n  observed rather than missing. Failure modes that differ *by site* and quietly drive heterogeneity: Medicare\n  Advantage and capitated person-time lack fee-for-service claims (a site heavy in MA-only members has artificially\n  low captured utilization — exclude MA-only person-time or model the difference), differential coding intensity\n  between commercial and Medicare populations, and competing risks by death that vary with the age mix (an elderly\n  Medicare site has more deaths censoring the event of interest). Harmonize the code lists and the\n  continuous-enrollment rule centrally before any site runs the model.\n- **EHR:** Initiation is the *order/administration*, not a dispensing, and capture is visit-driven, so a patient\n  who leaves the system is differentially lost. Sites differ in note/lab availability, so a covariate that exists\n  at one site is missing at another — decide centrally whether such a covariate is dropped network-wide or handled\n  by per-site imputation, because inconsistent handling masquerades as between-study heterogeneity.\n- **Registry:** Strong indication, severity, and adjudicated outcomes but typically incomplete pharmacy exposure;\n  link to claims for fills and to a death index for censoring before contributing to the pool.\n- **Linked claims–EHR–vital records:** the ideal per-site substrate (severity + completeness + mortality), but the\n  linkable subset is a selected population and order/fill/service-date discrepancies must be reconciled before\n  time-zero assignment so that time-zero means the same thing at every site.\n- **Federated vs centralized governance:** when row-level data cannot leave a site (HIPAA, GDPR, payer\n  contracts), run a *two-stage* design — each site fits the common model locally and exports only the coefficient\n  and its SE (and, for an exact one-stage equivalent, sufficient statistics / score and information matrices).\n  Beware **immortal time** introduced site-by-site in procedure or treatment-initiation cohorts: if any site\n  starts follow-up before the exposure decision, it contributes a biased study-specific effect that the pooling\n  step will faithfully carry forward.\n\n**Worked claims example (distributed two-stage IPD-MA).** Question: 1-year risk of hospitalized heart failure with\na second-generation sulfonylurea vs a DPP-4 inhibitor among adults with type 2 diabetes, run across four data\npartners (a commercial claims plan, Medicare FFS, an integrated-delivery EHR linked to claims, and a regional\nregistry linked to claims). A single protocol is distributed. At each site: (1) Eligibility — age ≥18, ≥2 diabetes\ndiagnoses, and 365 days of continuous medical+pharmacy enrollment before the first study fill; exclude MA-only\nperson-time so the washout is observable. (2) New-user washout — no fill of *any* sulfonylurea or DPP-4 inhibitor\nin the 365-day lookback. (3) Time zero — date of the first qualifying fill (`fill_date`); assign the arm from the\ndispensed NDC. (4) Confounding control — fit the *same* high-dimensional propensity score from covariates measured\nin `[index_date-365, index_date]` and apply 1:1 PS matching, checking standardized differences <0.1. (5) Outcome —\nfirst validated HF hospitalization; follow from time zero, censoring at disenrollment, death (mortality source\nhierarchy fixed centrally), end of data, and end of the 365-day risk window. (6) Stage 1 — each site fits a Cox\nmodel on its matched cohort and exports only the log-HR and its SE. (7) Stage 2 — the coordinating center pools the\nfour log-HRs with a REML random-effects model, reporting the pooled HR, the 95% CI, and τ² / I² for between-site\nheterogeneity; a one-stage stratified Cox (baseline hazard stratified by site) is run as a sensitivity analysis on\nany sites that can share rows, and a meta-regression of the per-site HR on the site's MA-share and age mix probes\nwhether the data-source failure modes above are driving the heterogeneity.",
    "primary_category": "Study_Design",
    "tags": [
      "meta-analysis",
      "evidence-synthesis",
      "individual-participant-data",
      "one-stage",
      "two-stage",
      "treatment-effect-modification",
      "distributed-analysis",
      "heterogeneity"
    ],
    "applies_to_study_types": [
      "ipd_meta_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.c221",
        "url": "https://doi.org/10.1136/bmj.c221",
        "citation_text": "Riley RD, Lambert PC, Abo-Zaid G. Meta-analysis of individual participant data: rationale, conduct, and reporting. BMJ. 2010;340:c221.",
        "year": 2010,
        "authors_short": "Riley et al.",
        "notes": "Canonical statement of why IPD synthesis outperforms aggregate-data meta-analysis (standardized analyses, restored data, time-to-event reanalysis, subgroup interactions) and how to conduct and report it."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.7141",
        "url": "https://doi.org/10.1002/sim.7141",
        "citation_text": "Burke DL, Ensor J, Riley RD. Meta-analysis using individual participant data: one-stage and two-stage approaches, and why they may differ. Statistics in Medicine. 2017;36(5):855-875.",
        "year": 2017,
        "authors_short": "Burke et al.",
        "notes": "Definitive treatment of the one-stage vs two-stage estimand distinction and the reasons (nuisance-parameter stratification, ML vs REML, weighting of within/between information) the two can disagree."
      },
      {
        "role": "demonstrate",
        "doi": "10.1001/jama.2015.3656",
        "url": "https://doi.org/10.1001/jama.2015.3656",
        "citation_text": "Stewart LA, Clarke M, Rovers M, et al. Preferred Reporting Items for a Systematic Review and Meta-analysis of Individual Participant Data: the PRISMA-IPD Statement. JAMA. 2015;313(16):1657-1665.",
        "year": 2015,
        "authors_short": "Stewart et al.",
        "notes": "Reporting standard specific to IPD-MA; the checklist that demonstrates how a defensible IPD synthesis is structured and disclosed (data obtained vs eligible, integrity checks, synthesis model)."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.j573",
        "url": "https://doi.org/10.1136/bmj.j573",
        "citation_text": "Fisher DJ, Carpenter JR, Morris TP, Freeman SC, Tierney JF. Meta-analytical methods to identify who benefits most from treatments: daft, deluded, or deft approach? BMJ. 2017;356:j573.",
        "year": 2017,
        "authors_short": "Fisher et al.",
        "notes": "Applied methods paper showing the IPD-only advantage — separating within-study treatment-covariate interactions from misleading across-study (ecological) associations to identify who benefits."
      }
    ],
    "plain_language_summary": "Individual Participant Data (IPD) meta-analysis is a way to combine evidence from multiple studies by collecting the actual raw patient records from each study — not just the summary numbers published in a journal article. Having every patient's row of data lets researchers apply the same eligibility rules, the same covariate adjustments, and the same outcome definition across all studies, so the comparison is apples-to-apples rather than apples-to-oranges. It also unlocks questions a standard meta-analysis simply cannot answer, such as whether a drug works better in older patients or in patients with worse kidney function. The trade-off is real: gathering and harmonizing raw data from multiple institutions is expensive, requires data-sharing agreements, and is only feasible when data partners are willing to participate.",
    "key_terms": [
      {
        "term": "individual patient data",
        "definition": "The original row-level records for each patient in a study — one row per person — as opposed to the single summary number (like an average hazard ratio) that a published paper reports."
      },
      {
        "term": "aggregate-data meta-analysis",
        "definition": "The conventional approach that combines only the published summary statistics from each study (for example, one hazard ratio and confidence interval per study) rather than the underlying patient records."
      },
      {
        "term": "one-stage vs two-stage",
        "definition": "Two ways to pool the data: two-stage fits a separate statistical model within each study and then combines those study-level results; one-stage stacks all patients' rows together and fits a single model that treats study membership as a built-in grouping variable — both are valid, but they can give slightly different answers."
      },
      {
        "term": "harmonization",
        "definition": "The process of redefining variables, eligibility criteria, and outcome measurements the same way across all contributing studies so that differences in results reflect biology rather than differences in how each study was run."
      },
      {
        "term": "between-study heterogeneity",
        "definition": "The degree to which the treatment effect genuinely differs from study to study, measured by statistics called tau-squared and I-squared — high heterogeneity means results varied more than you would expect from chance alone."
      },
      {
        "term": "federated analysis",
        "definition": "An approach used when raw patient data cannot leave a site due to privacy rules: each site runs the identical analysis on its own data and sends back only the summary result (a number and its standard error), never the individual records."
      }
    ],
    "worked_example": {
      "scenario": "A research team wants to know whether Drug A lowers one-year hospitalization risk more than Drug B among adults with type 2 diabetes. Three separate database studies exist — a commercial claims study (Site 1), a Medicare claims study (Site 2), and an EHR-linked study (Site 3). A standard aggregate-data meta-analysis could combine only the three published hazard ratios. An IPD meta-analysis instead collects the raw patient rows from all three sites, harmonizes the data, and gains the ability to do things aggregate data cannot.",
      "dataset": {
        "caption": "Hypothetical patient rows that IPD meta-analysis works with (one row per patient, all three sites stacked after harmonization). An aggregate-data meta-analysis never sees these rows — it sees only three numbers.",
        "columns": [
          "person_id",
          "study_id",
          "treat",
          "age",
          "baseline_hba1c",
          "event_hosp",
          "followup_days"
        ],
        "rows": [
          [
            "P001",
            "Site1",
            "DrugA",
            58,
            8.1,
            0,
            365
          ],
          [
            "P002",
            "Site1",
            "DrugB",
            61,
            7.9,
            1,
            210
          ],
          [
            "P003",
            "Site2",
            "DrugA",
            74,
            8.6,
            0,
            365
          ],
          [
            "P004",
            "Site2",
            "DrugB",
            72,
            8.4,
            1,
            180
          ],
          [
            "P005",
            "Site3",
            "DrugA",
            55,
            9.0,
            0,
            365
          ],
          [
            "P006",
            "Site3",
            "DrugB",
            57,
            8.8,
            0,
            365
          ]
        ]
      },
      "comparison_table": {
        "caption": "What each approach can and cannot do",
        "columns": [
          "Capability",
          "Aggregate-data meta-analysis",
          "IPD meta-analysis"
        ],
        "rows": [
          [
            "Combine results across studies",
            "Yes — pools published hazard ratios",
            "Yes — pools patient rows then estimates hazard ratios"
          ],
          [
            "Standardize eligibility criteria across studies",
            "No — stuck with each study's own enrollment rules",
            "Yes — re-applies the same rule to every patient row"
          ],
          [
            "Adjust for the same covariates everywhere",
            "No — each paper adjusted for different variables",
            "Yes — the same model is fit at every site"
          ],
          [
            "Detect whether effect differs by age or disease severity",
            "No — subgroup effects from different papers mix within-study and across-study associations",
            "Yes — the within-study subgroup contrast is unbiased"
          ],
          [
            "Work when data cannot leave a site",
            "Yes (it only needs published numbers)",
            "Yes via two-stage federated design (sites send only their summary result)"
          ]
        ]
      },
      "steps": [
        "Each site receives the same analysis protocol specifying eligibility (age 18+, diagnosis codes, 365-day enrollment lookback), the same covariate list (age, HbA1c, comorbidities), and the same outcome definition (first hospitalization within 365 days).",
        "Each site harmonizes its data to the shared schema — the patient rows shown above represent what the combined, standardized dataset looks like after that step.",
        "Two-stage approach: each site fits a Cox survival model on its own matched cohort and reports back one log-hazard-ratio and one standard error (three numbers total from three sites).",
        "The coordinating center combines those three study-level estimates using a random-effects model (REML), producing a single pooled hazard ratio with a 95% confidence interval and a measure of how much the effect varied across sites (I-squared).",
        "Because the IPD contains each patient's age and baseline HbA1c, the team can also ask: does Drug A work better in patients with HbA1c above 9? This within-study interaction is estimated at each site and then pooled — an analysis that aggregate-data meta-analysis cannot perform correctly.",
        "Result: pooled HR = 0.78 (95% CI 0.65-0.93), I-squared = 18% (low heterogeneity). The subgroup analysis shows a stronger benefit in patients with HbA1c > 9.0 (HR 0.62), a finding only recoverable from the patient-level data."
      ],
      "result": "Pooled HR for hospitalization, Drug A vs Drug B = 0.78 (95% CI 0.65-0.93). The IPD approach also reveals that patients with baseline HbA1c above 9.0 benefit more (HR 0.62 in that subgroup) — a finding that would have been invisible to aggregate-data meta-analysis because published papers did not report that specific subgroup split."
    },
    "prerequisites": [
      "meta-analysis-obs",
      "multi-database",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Two-stage IPD meta-analysis",
        "description": "Fit the chosen outcome model separately within each study/database, extract the study-specific effect estimate and standard error, then combine them with a common-effect or random-effects (DerSimonian-Laird / REML, optionally Hartung-Knapp) meta-analysis.",
        "edge_cases": [
          "Studies with zero events in an arm yield unstable or undefined effect estimates and SEs; consider continuity corrections, exact methods, or a one-stage model instead.",
          "Estimating between-study variance from very few studies (2-3) is unreliable; a Hartung-Knapp adjustment or a weakly informative Bayesian prior on tau-squared is safer than naive REML."
        ],
        "data_source_notes": "Federated networks (Sentinel, OHDSI): each site fits the model behind its firewall and exports only the coefficient + SE (or sufficient statistics), satisfying data-sharing constraints without moving rows."
      },
      {
        "name": "One-stage IPD meta-analysis",
        "description": "Fit a single hierarchical model to the stacked patient rows, with study entered as a stratification term (study-specific baseline hazard/intercept) and the treatment effect modeled as common or random across studies.",
        "edge_cases": [
          "Must stratify or random-effect the study term; a naive single model pooling all rows ignores clustering, can reverse the effect (aggregation/Simpson bias), and understates the standard error.",
          "Choice of common vs random treatment effect and ML vs REML estimation materially changes the result and the CI; pre-specify it."
        ],
        "data_source_notes": "Requires row-level pooling (centralized governance) or a federated one-stage algorithm that exchanges score/information matrices; harmonize covariate definitions across sites before stacking."
      },
      {
        "name": "IPD meta-analysis of treatment-covariate interactions",
        "description": "Estimate effect modification using only the within-study contrast (centering the covariate within each study, or a within/between decomposition) so the interaction is not contaminated by across-study ecological associations.",
        "edge_cases": [
          "Pooling study-level subgroup effects against a study-level mean covariate estimates an ecological association, not the patient-level interaction - it can point the wrong way.",
          "Within-study interactions are low-powered; require many events per study and pre-registration to avoid data-dredged subgroups."
        ],
        "data_source_notes": "Needs the actual patient-level covariate at each site (e.g., baseline eGFR, HbA1c, age) - a common reason AD meta-analysis cannot answer who-benefits questions."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Aggregate-data (AD) meta-analysis of published effects",
        "pros_of_this": "Standardizes eligibility, covariate adjustment, follow-up, and the outcome model across studies; reanalyzes time-to-event; recovers unbiased within-study interactions; allows individual-level assumption checks.",
        "cons_of_this": "Far more costly, requires data-sharing governance and heavy harmonization, and risks availability bias if only the cooperative subset of studies contributes IPD.",
        "when_to_prefer": "When treatment-effect modification, time-to-event reanalysis, or harmonized confounding control is central and patient-level data can be obtained from a representative set of contributors."
      },
      {
        "compared_to": "Network meta-analysis (NMA)",
        "pros_of_this": "Patient-level adjustment and unbiased interaction estimation for a focused comparison; checks individual-level assumptions an arm-level network cannot.",
        "cons_of_this": "Typically limited to pairwise/few-treatment comparisons; does not by itself connect a large network of treatments through indirect evidence.",
        "when_to_prefer": "Focused head-to-head questions where patient-level adjustment and effect modification matter more than breadth of the treatment network."
      },
      {
        "compared_to": "Naive single-model merge of all patient rows (ignoring study)",
        "pros_of_this": "Correctly accounts for clustering by study via stratified baseline parameters and a common/random treatment effect, giving valid standard errors and avoiding aggregation (Simpson) reversals.",
        "cons_of_this": "More complex to specify (random-effects structure, tau-squared, common vs random effect) than a single pooled regression.",
        "when_to_prefer": "Always, over a naive merge - the naive merge is not a defensible analysis."
      },
      {
        "compared_to": "One-stage IPD meta-analysis (when this entry's reference point is two-stage)",
        "pros_of_this": "Two-stage is transparent, mirrors familiar forest-plot synthesis, and is the natural fit for federated analyses where only per-site summary statistics may leave the firewall.",
        "cons_of_this": "Less efficient with few/small studies or rare events, and cannot natively model within-study interactions or non-linear terms as flexibly as a one-stage hierarchical model.",
        "when_to_prefer": "Large studies and/or governance that blocks row-level pooling; switch to one-stage for sparse data, rare events, or interaction estimation."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Each site defines exposure from NDC + fill_date + days_supply and requires continuous medical+pharmacy enrollment across the washout so absence of prior fills is observed; exclude Medicare Advantage-only person-time (no FFS claims) and harmonize code lists and the continuous-enrollment rule centrally before any site fits the model, so between-site differences reflect biology, not capture artifacts.",
      "ehr": "Initiation is the order/administration, not a dispensing, and capture is visit-driven; decide centrally whether covariates available at only some sites are dropped network-wide or per-site imputed, because inconsistent handling inflates apparent between-study heterogeneity.",
      "registry": "Strong indication/severity/adjudicated outcomes but incomplete pharmacy exposure; link to claims for fills and to a death index for censoring before contributing study-specific estimates to the pool.",
      "linked": "Linked claims-EHR-vital-records is the ideal per-site substrate but the linkable subset is selected, and order/fill/service-date discrepancies must be reconciled so time-zero is defined identically at every site."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom lifelines import CoxPHFitter\nimport statsmodels.api as sm\n\ndef two_stage_ipd_ma(ipd: pd.DataFrame) -> dict:\n    # ---- Stage 1: per-study Cox model, keep treatment log-HR and its SE ----\n    rows = []\n    for study, g in ipd.groupby(\"study_id\"):\n        if g[\"event\"].sum() == 0 or g[\"treat\"].nunique() < 2:\n            continue  # uninformative site (no events or single-arm) - report, do not impute\n        cph = CoxPHFitter().fit(g[[\"time\", \"event\", \"treat\"]],\n                                duration_col=\"time\", event_col=\"event\")\n        rows.append({\"study_id\": study,\n                     \"loghr\": cph.params_[\"treat\"],\n                     \"se\":    cph.standard_errors_[\"treat\"]})\n    stage1 = pd.DataFrame(rows)\n\n    # ---- Stage 2: REML random-effects pool of the per-study log-HRs ----\n    y, s2 = stage1[\"loghr\"].to_numpy(), stage1[\"se\"].to_numpy() ** 2\n    tau2 = 0.0\n    for _ in range(100):  # REML fixed-point iteration for between-study variance\n        w = 1.0 / (s2 + tau2)\n        mu = np.sum(w * y) / np.sum(w)\n        tau2_new = max(0.0, (np.sum(w**2 * ((y - mu) ** 2 - s2)) +\n                             (1.0 / np.sum(w))) / np.sum(w**2))\n        if abs(tau2_new - tau2) < 1e-10:\n            break\n        tau2 = tau2_new\n    w = 1.0 / (s2 + tau2)\n    mu = np.sum(w * y) / np.sum(w)\n    se_mu = np.sqrt(1.0 / np.sum(w))\n    Q = np.sum((y - np.sum((y / s2)) / np.sum(1 / s2)) ** 2 / s2)\n    I2 = max(0.0, (Q - (len(y) - 1)) / Q) if Q > 0 else 0.0\n    return {\"per_study\": stage1, \"pooled_HR\": np.exp(mu),\n            \"ci95\": (np.exp(mu - 1.96 * se_mu), np.exp(mu + 1.96 * se_mu)),\n            \"tau2\": tau2, \"I2\": I2}\n\ndef one_stage_ipd_ma(ipd: pd.DataFrame):\n    # Stratified Cox: study-specific baseline hazard (nuisance), shared treatment log-HR.\n    cph = CoxPHFitter()\n    cph.fit(ipd[[\"time\", \"event\", \"treat\", \"study_id\"]],\n            duration_col=\"time\", event_col=\"event\", strata=[\"study_id\"])\n    return cph  # cph.summary holds the pooled treat log-HR, SE, CI\n\n# res = two_stage_ipd_ma(ipd); print(res[\"pooled_HR\"], res[\"ci95\"], res[\"I2\"])",
        "description": "Two-stage and one-stage IPD meta-analysis from a stacked, already-harmonized patient-level table. Required input\n(one row per patient, identical schema across all contributing sites/studies):\n  ipd : person_id, study_id (site/study label), treat (1=study drug, 0=comparator),\n        time (follow-up time to event/censor), event (1=event, 0=censored),\n        and harmonized baseline covariates already used in each site's propensity model.\nStage 1 fits a per-study Cox model and keeps the log-HR + SE; stage 2 pools them with REML random effects.\nThe one-stage alternative is a stratified Cox (baseline hazard stratified by study). NEVER fit a single Cox on\nthe stacked rows without the study stratifier - that ignores clustering and can reverse the effect.",
        "dependencies": [
          "pandas",
          "numpy",
          "lifelines",
          "statsmodels"
        ],
        "source_citations": [
          "burke-2017"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(metafor)\n\n# ---- Stage 1: per-study Cox model, collect treatment log-HR and SE ----\nstage1 <- do.call(rbind, lapply(split(ipd, ipd$study_id), function(g) {\n  if (sum(g$event) == 0 || length(unique(g$treat)) < 2) return(NULL)  # uninformative site\n  fit <- coxph(Surv(time, event) ~ treat, data = g)\n  data.frame(study_id = g$study_id[1],\n             loghr = coef(fit)[[\"treat\"]],\n             se    = sqrt(vcov(fit)[[\"treat\", \"treat\"]]))\n}))\n\n# ---- Stage 2: REML random-effects pool (Hartung-Knapp when few studies) ----\npooled <- rma(yi = stage1$loghr, sei = stage1$se,\n              method = \"REML\", test = \"knha\")\npooled_HR <- exp(pooled$b)                              # pooled hazard ratio\nci        <- exp(c(pooled$ci.lb, pooled$ci.ub))         # 95% CI\nc(HR = pooled_HR, lcl = ci[1], ucl = ci[2],\n  tau2 = pooled$tau2, I2 = pooled$I2)\n\n# ---- One-stage alternative: study-stratified Cox (study-specific baseline hazard) ----\none_stage <- coxph(Surv(time, event) ~ treat + strata(study_id), data = ipd)\nsummary(one_stage)$coefficients   # pooled treat log-HR, SE, CI from the stacked rows",
        "description": "Two-stage (survival::coxph per study -> metafor::rma REML pool) and one-stage (study-stratified coxph) IPD-MA.\nInput mirrors the Python version: one row per patient with study_id, treat (0/1), time, event (0/1), and\nharmonized baseline covariates. metafor's rma() with method='REML' gives the pooled estimate, tau^2 and I^2; use\ntest='knha' (Hartung-Knapp) when the number of studies is small.",
        "dependencies": [
          "survival",
          "metafor"
        ],
        "source_citations": [
          "burke-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---- Stage 1: per-study Cox model -> treatment log-HR (Estimate) and SE (StdErr) ---- */\nproc phreg data=work.ipd;\n  by study_id;\n  model time*event(0) = treat;          /* time-to-event with treat as the only term */\n  ods output ParameterEstimates=stage1; /* keeps Estimate (log-HR) and StdErr by study */\nrun;\n\ndata stage1;\n  set stage1;\n  where Parameter = 'treat';\n  loghr = Estimate;\n  v     = StdErr**2;                     /* within-study variance of the log-HR */\nrun;\n\n/* ---- Stage 2: DerSimonian-Laird random-effects pool with the known within-study variances ----\n   Closed-form moment estimator of tau^2, then an inverse-variance weighted pool. Runs in Base/STAT\n   with no PARMS data set: each study's within-study variance v is treated as known (fixed). */\nproc means data=stage1 noprint;\n  var loghr;\n  output out=k n=k;                      /* k = number of studies */\nrun;\n\n/* Fixed-effect (inverse-variance) quantities and Cochran's Q. */\ndata fe;\n  if _n_ = 1 then set k(keep=k);\n  set stage1 end=last;\n  retain sw 0 swy 0 sw2 0 q 0;\n  w   = 1 / v;                           /* fixed-effect weight = 1/within-study variance */\n  sw  + w;                               /* sum of weights */\n  swy + w*loghr;                         /* sum of weight*log-HR */\n  sw2 + w*w;                             /* sum of squared weights */\n  if last then do;\n    mu_fe = swy / sw;                    /* fixed-effect pooled log-HR (used to center Cochran's Q) */\n    output;\n  end;\n  keep k sw swy sw2 mu_fe;\nrun;\n\n/* Cochran's Q about the fixed-effect mean, then DerSimonian-Laird tau^2. */\ndata q;\n  if _n_ = 1 then set fe(keep=k sw swy sw2 mu_fe);\n  set stage1 end=last;\n  retain q 0;\n  w  = 1 / v;\n  q + w*(loghr - mu_fe)**2;              /* Cochran's Q */\n  if last then do;\n    c      = sw - (sw2 / sw);            /* DL scaling constant */\n    tau2   = max(0, (q - (k - 1)) / c);  /* between-study variance (truncated at 0) */\n    output;\n  end;\n  keep k tau2;\nrun;\n\n/* Random-effects inverse-variance pool with weights 1/(v + tau^2). */\ndata pool;\n  if _n_ = 1 then set q(keep=tau2);\n  set stage1 end=last;\n  retain swr 0 swry 0;\n  wr   = 1 / (v + tau2);                 /* random-effects weight */\n  swr  + wr;\n  swry + wr*loghr;\n  if last then do;\n    loghr_pool = swry / swr;             /* pooled log-HR */\n    se_pool    = sqrt(1 / swr);          /* SE of pooled log-HR */\n    lcl        = loghr_pool - 1.96*se_pool;\n    ucl        = loghr_pool + 1.96*se_pool;\n    hr     = exp(loghr_pool);            /* pooled hazard ratio */\n    hr_lcl = exp(lcl);\n    hr_ucl = exp(ucl);\n    output;\n  end;\n  keep tau2 loghr_pool se_pool hr hr_lcl hr_ucl;\nrun;\n/* pool holds the pooled HR with 95% CI and tau^2 (between-study heterogeneity). */\n\n/* ---- One-stage alternative: study-stratified Cox on the stacked rows ---- */\nproc phreg data=work.ipd;\n  strata study_id;                       /* study-specific baseline hazard (nuisance) */\n  model time*event(0) = treat;           /* shared pooled treatment log-HR */\n  hazardratio 'pooled' treat;            /* pooled HR with 95% CI */\nrun;",
        "description": "Two-stage and one-stage IPD-MA in SAS. Required input dataset (one row per patient, identical schema per site):\n  work.ipd : person_id study_id treat (1/0) time event (1/0) <harmonized baseline covariates>\nStage 1 uses PROC PHREG by study to get each site's log-HR and SE; PROC MIXED with a known measurement-error\nvariance (the classic two-stage random-effects meta-analysis via method=REML) pools them. The one-stage\nalternative is a single PROC PHREG with STRATA study_id (study-specific baseline hazard, shared treatment effect).",
        "dependencies": [],
        "source_citations": [
          "burke-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Focused comparative question] --> Get[Obtain patient-level data<br/>from each study / database]\n  Get --> Harm[Harmonize: eligibility, covariates,<br/>follow-up, outcome to a common data model]\n  Harm --> Gov{Can rows leave the site?}\n  Gov -->|No - federated| TwoStage[Two-stage: each site fits the model,<br/>exports log-HR + SE only]\n  Gov -->|Yes - centralized| Choice{Sparse data or<br/>interactions needed?}\n  Choice -->|Yes| OneStage[One-stage: single hierarchical model,<br/>study as stratum / random effect]\n  Choice -->|No| TwoStage\n  TwoStage --> Pool[REML random-effects pool:<br/>pooled HR, 95% CI, tau^2 / I^2]\n  OneStage --> Pool\n  Pool --> Het[Probe heterogeneity:<br/>meta-regression on site MA-share, age mix]",
        "caption": "IPD meta-analysis workflow. Harmonization to a common data model precedes synthesis; governance dictates federated two-stage vs centralized one-stage, and heterogeneity is interrogated rather than averaged away.",
        "alt_text": "Flowchart from a focused question through obtaining and harmonizing patient-level data, a governance decision between two-stage and one-stage synthesis, REML pooling, and heterogeneity interrogation.",
        "source_type": "illustrative",
        "source_citations": [
          "riley-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph WS[Within-study interaction - what IPD recovers]\n    A[Patient-level covariate centered within each study] --> B[Unbiased treatment x covariate effect]\n  end\n  subgraph AS[Across-study association - the ecological trap]\n    C[Per-study effect vs study-mean covariate] --> D[Aggregation / ecological slope<br/>NOT the patient-level interaction]\n  end\n  B --> Deft[Deft: combine within-study interactions]\n  D --> Deluded[Deluded: read ecological slope<br/>as personalization rule]",
        "caption": "The core IPD-only advantage (Fisher et al. 2017). Estimating effect modification from the within-study contrast is valid; reading the across-study (ecological) association as an individual-level interaction is the deluded approach that points the wrong way.",
        "alt_text": "Diagram contrasting the valid within-study treatment-covariate interaction recoverable from IPD against the misleading across-study ecological association.",
        "source_type": "illustrative",
        "source_citations": [
          "fisher-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "meta-analysis-obs",
        "notes": "IPD-MA is the patient-level form of meta-analysis; the aggregate-data observational meta-analysis combines published effect estimates rather than raw records."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "meta-analysis-rct",
        "notes": "IPD synthesis of trials is the higher-effort alternative to aggregate-data RCT meta-analysis, adding standardized reanalysis and unbiased within-study interactions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "network-meta-analysis",
        "notes": "NMA connects many treatments via indirect (usually aggregate) evidence; IPD-MA is patient-level but typically pairwise - the two are complementary and can be combined as IPD-NMA."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multi-database",
        "notes": "IPD-MA is the synthesis engine of distributed multi-database RWE networks (Sentinel, OHDSI, CNODES) where each partner runs a common protocol on its own data."
      },
      {
        "relation_type": "part_of",
        "target_slug": "systematic-review",
        "notes": "An IPD meta-analysis is the quantitative synthesis step of a systematic review conducted with patient-level data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Each contributing site commonly emulates the same target trial (new-user, active-comparator, time-zero aligned) before its estimate enters the pool."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "A harmonized propensity-score model is applied identically at each site so that the per-study estimates being pooled are confounding-adjusted in the same way."
      }
    ],
    "aliases": [
      "IPD meta-analysis",
      "individual patient data meta-analysis",
      "one-stage meta-analysis",
      "two-stage meta-analysis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "kaplan-meier-estimator",
    "name": "Kaplan-Meier Estimator",
    "short_definition": "The Kaplan-Meier (product-limit) estimator is a nonparametric method for estimating the probability that an event has not yet occurred by any follow-up time, from a mix of subjects who experienced the event and subjects whose observation ended before the event (censored); it multiplies conditional survival probabilities at each observed event time to produce a step-function curve from which median survival time, quantiles, and confidence bands are read, under the assumption that censoring is non-informative.",
    "long_description": "**Product-limit construction: risk sets and conditional survival at each event time**\n\nThe Kaplan-Meier estimator builds the survival curve one event at a time. Let t_1 < t_2 < ...\n< t_k denote the distinct times at which events occur (ties are handled by treating the\nevent and any same-day censorings together). At each event time t_j, define n_j as the\nnumber of subjects still in the risk set (not yet had the event, not yet censored before\nt_j) and d_j as the number of events that occur at exactly t_j. The conditional probability\nof surviving past t_j, given survival to t_j, is (n_j - d_j) / n_j. The Kaplan-Meier\nestimate at time t is the cumulative product of all such conditional probabilities for\nevent times up to and including t:\n\n  S_KM(t) = product over all j with t_j <= t of ((n_j - d_j) / n_j)\n\nThis product-limit formula is what Kaplan and Meier (1958) introduced. Its power comes\nfrom updating the risk set after every event: subjects who are censored between consecutive\nevent times exit the risk set before the next event, so the denominator shrinks and each\nsubsequent conditional probability uses only those still observed. Censored subjects\ncontribute their observed follow-up time to every risk set they are part of before exit;\nthey do not contribute to the event count d_j. The resulting step function drops only at\nobserved event times and is flat between them, making it easy to read off the probability\nof being event-free at any given time.\n\n**Median survival time and quantiles from the curve**\n\nThe KM median survival time is the smallest t at which the curve drops to or below 0.5\n(i.e., S_KM(t) ≤ 0.5). If the curve never reaches 0.5 within the observation window —\nbecause too many subjects are censored before enough events accumulate — the median is\nundefined or reported as \">last event time.\" Similarly, the p-th quantile (e.g., the\n25th or 75th percentile of the event-time distribution) is read as the smallest t where\nS_KM(t) ≤ (1 - p). These quantiles are robust to the non-normal shape typical of\ntime-to-event data in RWE and are the primary summaries reported alongside KM curves in\npublications and regulatory submissions.\n\n**Greenwood variance and confidence interval bands**\n\nThe variance of the KM estimator at time t is estimated by the Greenwood formula:\n\n  Var[S_KM(t)] = S_KM(t)^2 * sum over j with t_j <= t of (d_j / (n_j * (n_j - d_j)))\n\nPointwise 95% confidence intervals are typically constructed on the log or log-log scale\n(so the CI is guaranteed to stay in [0, 1]) and then back-transformed. The Hall-Wellner\nand equal-precision (EP) bands, used in SAS PROC LIFETEST and R survfit(), are simultaneous\nbands that cover the entire curve at a nominal confidence level. The CI widens as the risk\nset shrinks — notably at the right tail of the curve, where few subjects remain after\nheavy censoring. A KM curve that drops sharply in its final steps with extremely wide CI\nbands signals that the tail estimate is unreliable, a near-universal occurrence in claims\ndata where late follow-up is dominated by the few persistent enrollees.\n\n**Assumptions: non-informative censoring**\n\nThe foundational assumption is that censoring is non-informative (equivalently, independent):\nat any event time t, subjects censored at t have the same future event hazard as those\nstill under observation. When this holds, the risk-set calculation is valid — the\ncensored subjects who exit are a random draw from those still at risk, not a systematically\nsicker or healthier subset. In real-world insurance claims, this assumption is threatened\nby health-correlated disenrollment: patients who lose coverage due to job loss tied to\nworsening disease, or who transition to Medicare Advantage plans because of escalating\ncare needs, leave the observable risk set at elevated risk. The direction of the bias is\ntypically optimistic — the remaining, observable subjects are healthier than those who\nleft — and the KM curve overstates survival probability. See the censoring-mechanisms-rwe\nentry for a full taxonomy of censoring types and when the non-informative assumption fails.\n\n**Number-at-risk tables: mandatory reporting**\n\nA KM curve without a number-at-risk table is incomplete. The number at risk at each\ndisplayed time point (typically at regular intervals below the x-axis) tells the reader\nhow many subjects were still contributing to the estimate at that moment. A curve that\nappears smooth and well-estimated at month 24 in a study that enrolled 10,000 patients\nmay be based on only 200 at-risk individuals if enrollment was staggered and\nadministrative censoring was common. Regulatory guidance from the FDA and EMA, as well\nas methodological standards from Pocock et al. (2002), treat the at-risk table as\nnon-negotiable. Both ggsurvfit (R) and PROC LIFETEST (SAS) produce at-risk rows natively.\n\n**KM versus 1 minus KM and the competing-risks overestimation trap**\n\nA common mistake is reporting 1 - S_KM(t) as the cumulative incidence of the event when\na competing risk (most often death) is present and non-negligible. Treating death as\ncensoring when the event of interest is a non-fatal outcome (readmission, stroke,\nmedication failure) assumes the censored-dead patients remain at risk of the non-fatal\noutcome — which is impossible. The 1 - KM curve therefore overestimates cumulative\nincidence in a real population where mortality removes subjects permanently, and the\noverestimation is differential when mortality rates differ by treatment arm. When competing\nrisks are present, the correct descriptive estimator is the Aalen-Johansen cumulative\nincidence function, which treats competing events as distinct exit types rather than\ncensoring. See the competing-risks-cause-specific-fine-gray-rwe and\ncumulative-incidence-risk-rwe entries for full treatment.\n\n**No covariate adjustment: KM is descriptive**\n\nThe KM estimator makes no distributional assumption and accepts no covariates. It describes\nobserved (possibly confounded) survival in the data as delivered. If two arms differ in\nbaseline age, comorbidity, or frailty, the KM curves absorb those imbalances — the\nestimator will not adjust for them. Covariate-adjusted survival curves require either\n(a) Cox proportional hazards regression with the baseline covariates as predictors and the\ncurve predicted at representative covariate values, or (b) standardization (direct\nadjustment) over the covariate distribution of a reference population. These adjusted\ncurves carry the proportional-hazards or model-specification assumption that KM avoids,\nso reporting the unadjusted KM alongside adjusted estimates is standard practice for\ntransparency.\n\n**Claims-specific considerations**\n\nIn commercial insurance claims, the dominant censoring mechanism is plan disenrollment,\nwhich is typically health-correlated (see above). The risk set at the tail of the KM\ncurve in a 12-month or 24-month study is often small — fewer than 5 to 10 percent of\nthe original cohort — because enrollment gaps and plan switches remove subjects over\ntime. KM estimates at the tail based on fewer than 10 to 20 subjects at risk are\nunstable: a single event can drop the curve by 5 to 10 percentage points, and the CI\nbands expand dramatically. The convention in regulatory pharmacoepidemiology is to\ntruncate the displayed KM curve at the time when fewer than 10 subjects remain at risk,\nor to report only up to the time when 80 percent of the cohort has been censored. HEOR\nreports to payers should follow this convention and include the at-risk counts explicitly.\n\n**Pros, cons, and trade-offs**\n\nPros of the Kaplan-Meier estimator:\n- Fully nonparametric — makes no assumption about the shape of the underlying event-time\n  distribution; valid for exponential, Weibull, log-normal, and unknown distributions alike.\n- Accommodates censoring rigorously through the risk-set mechanism, provided the\n  non-informative censoring assumption holds.\n- Produces a visual curve that communicates survival probability intuitively to clinical\n  and payer audiences.\n- The log-rank test is the natural companion for group comparisons (see log-rank-test).\n\nCons:\n- Purely descriptive: no covariate adjustment, no causal inference without additional\n  assumptions and methods.\n- The non-informative censoring assumption is often violated in insurance claims, biasing\n  the curve optimistically when the sickest patients disenroll earliest.\n- Tail instability: small risk sets at late time points yield wide CI bands and unstable\n  point estimates.\n- Incorrectly applied to competing-risk settings (treating death as censoring) produces\n  upward-biased cumulative incidence estimates.\n- The KM curve and median survival are not directly usable in cost-effectiveness models\n  that require hazard rates or parametric distributions; parametric fitting or RMST is\n  needed for extrapolation.\n\n**When to use**\n\nUse the Kaplan-Meier estimator as the primary descriptive tool whenever the outcome is a\ntime-to-event endpoint and the research question is \"what fraction of the population\nremained event-free by time t?\" It is appropriate for:\n- Comparative effectiveness and safety studies where unadjusted group-level survival curves\n  are needed as a starting point or visual summary.\n- Regulatory submissions to FDA and EMA, where KM curves with at-risk tables are a\n  standard component of the clinical and safety sections.\n- Phase II or pilot studies where the non-parametric approach avoids parametric\n  misspecification in small samples.\n- Any setting where the log-rank test will be used, since both share the same risk-set\n  framework and the KM curve is the natural visual companion.\n\n**When NOT to use — and when this is actively misleading**\n\nDo not use the Kaplan-Meier 1 - S_KM(t) curve as the cumulative incidence when a\ncompeting event is non-negligible — typically death in any study of older patients or\nserious chronic conditions. This is the most consequential misapplication in RWE. The\noverestimation of absolute risk can be clinically and policy-relevant in magnitude (often\n1 to 3 percentage points at 12 months in elderly cohorts), and the bias is differential\nby arm when competing mortality differs.\n\nDo not use KM as the primary analysis when group imbalance must be controlled: the estimator\nhas no mechanism for covariate adjustment. In an observational study without adequate\nmatching or weighting, the KM curves reflect a confounded comparison, and reporting them\nas if they estimate a causal contrast is misleading. Use Cox regression or standardized\nsurvival curves for adjusted inference, and report KM as descriptive background.\n\nDo not extend the curve beyond the time when the risk set becomes small (fewer than 10 to\n20 subjects). The curve is arithmetically valid but practically useless in that region,\nand presenting a smooth-looking KM tail based on two or three subjects at risk misleads\nthe audience about the precision of the estimate.\n\nDo not use KM for long-horizon extrapolation (e.g., lifetime survival for a cost-utility\nmodel). The curve is anchored to observed follow-up; extrapolating requires parametric\nsurvival models fitted to the observed data and validated against external sources.\n\n**Interpreting the output**\n\nIn the 10-patient worked example, the KM curve estimates S(10) = 0.9, S(30) = 0.7875,\nS(50) = 0.63, and S(70) = 0.4725. The median survival time is 70 days.\n\nFormal interpretation: the KM survival probability of 0.4725 at day 70 is the estimated\nprobability of remaining event-free through day 70, under the assumption that the six\ncensored subjects were censored non-informatically (i.e., their residual event risk was\nthe same as those who remained under observation at each event time). The estimate is\na product of four conditional probabilities — one at each observed event time — and it\nis the maximum-likelihood estimator of the event-time distribution in the nonparametric\nfamily. The stated median of 70 days is the smallest observed time at which the estimated\nevent-free probability drops to or below 0.5. After day 70, only three subjects remain\nin the risk set (all censored); the curve stays at 0.4725 and the CI expands, signaling\nthat the tail is no longer reliably estimated.\n\nPractical interpretation: approximately half of patients in this cohort would be expected\nto have had the event by day 70. However, the estimate is based on 10 patients and 4\nevents; the confidence interval around the median is wide and the result should be treated\nas a pilot estimate, not a definitive survival probability. In a real claims analysis with\nthousands of patients, the same calculation applies mechanically but the risk-set stability\nand the plausibility of the non-informative censoring assumption both require explicit\ndocumentation and, in regulatory submissions, a sensitivity analysis.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "survival-analysis",
      "time-to-event",
      "nonparametric",
      "product-limit",
      "kaplan-meier",
      "KM-curve",
      "median-survival",
      "censoring",
      "risk-set",
      "greenwood-variance",
      "number-at-risk"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/01621459.1958.10501452",
        "url": "https://doi.org/10.1080/01621459.1958.10501452",
        "citation_text": "Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association. 1958;53(282):457-481.",
        "year": 1958,
        "authors_short": "Kaplan & Meier",
        "notes": "The original paper introducing the product-limit estimator. Defines the construction of the survival curve from incomplete (censored) observations and proves it is the nonparametric maximum-likelihood estimator of the event-time distribution. One of the most cited papers in all of statistics."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.otohns.2010.05.007",
        "url": "https://doi.org/10.1016/j.otohns.2010.05.007",
        "citation_text": "Rich JT, Neely JG, Paniello RC, Voelker CC, Nussenbaum B, Wang EW. A practical guide to understanding Kaplan-Meier curves. Otolaryngology-Head and Neck Surgery. 2010;143(3):331-336.",
        "year": 2010,
        "authors_short": "Rich et al.",
        "notes": "A widely read applied guide explaining the KM curve for clinical researchers: how to read and report a KM curve, the meaning of confidence intervals, the role of censoring, and common misinterpretations. Suitable for both authors and reviewers of RWE manuscripts."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/S0140-6736(02)08594-X",
        "url": "https://doi.org/10.1016/S0140-6736(02)08594-X",
        "citation_text": "Pocock SJ, Clayton TC, Altman DG. Survival plots of time-to-event outcomes in clinical trials: good practice and pitfalls. Lancet. 2002;359(9318):1686-1689.",
        "year": 2002,
        "authors_short": "Pocock et al.",
        "notes": "Sets the reporting standard for KM curves in clinical publications: mandatory number-at-risk tables below the x-axis, appropriate confidence interval presentation, truncation of the curve when risk sets become small, and avoidance of misleading visual conventions. Widely cited in regulatory guidance and editorial policies."
      }
    ],
    "plain_language_summary": "The Kaplan-Meier curve is a way to draw the fraction of patients who have not yet had a specific event (such as hospitalization or death) at each point in time, using data from a group where some patients leave the study early without having the event (called censored patients). It works by computing, at each moment when an event occurs, the fraction of the patients still being observed who had the event, then multiplying those fractions together progressively to get an ever-updating \"survival\" probability. The curve's most practical output is the median survival time — the day by which half the group has had the event — along with a table showing how many patients were still being counted at each point, which warns you when the estimate is based on only a few remaining patients and should not be trusted too precisely.",
    "key_terms": [
      {
        "term": "risk set",
        "definition": "At any given moment in a survival study, the group of patients who are still being observed and have not yet had the event — only these patients count toward the denominator in the survival calculation at that time."
      },
      {
        "term": "censoring",
        "definition": "When a patient stops being observed before the event occurs, such as when they drop their insurance plan or the study ends, so you know the event had not happened yet but cannot follow them further."
      },
      {
        "term": "product-limit",
        "definition": "The mathematical approach behind the Kaplan-Meier curve: multiply together the fraction who survived each individual event time to get the cumulative probability of being event-free up to any given time."
      },
      {
        "term": "median survival time",
        "definition": "The time point at which the Kaplan-Meier curve drops to exactly 50%, meaning that half the patients in the group are estimated to have had the event by that day."
      },
      {
        "term": "number at risk",
        "definition": "The count of patients still actively being followed at a given time point, displayed below the KM curve to show when the estimate becomes unreliable because too few patients remain in the study."
      },
      {
        "term": "Greenwood variance",
        "definition": "The standard formula for estimating how uncertain the Kaplan-Meier curve is at any given time point, which produces wider confidence bands as fewer patients remain in the risk set at later follow-up times."
      }
    ],
    "worked_example": {
      "scenario": "Ten patients newly prescribed a diabetes medication are followed from their first prescription (day 0) to see whether they are hospitalized for a hypoglycemia-related event. Some patients experience the event; others drop their insurance plan or reach the end of the 90-day study window without any event (censored). The analyst builds the Kaplan-Meier survival curve step by step to estimate what fraction of patients remain event-free at each event time, and reads off the median time to first hospitalization.",
      "dataset": {
        "caption": "One row per patient. Event = 1 means a hospitalization occurred; event = 0 means the patient was censored (left observation without the event). Follow-up days is the number of days from first prescription to event or censoring.",
        "columns": [
          "patient_id",
          "follow_up_days",
          "event",
          "status_detail"
        ],
        "rows": [
          [
            "PT01",
            10,
            1,
            "hospitalization"
          ],
          [
            "PT02",
            20,
            0,
            "censored: disenrolled"
          ],
          [
            "PT03",
            30,
            1,
            "hospitalization"
          ],
          [
            "PT04",
            40,
            0,
            "censored: disenrolled"
          ],
          [
            "PT05",
            45,
            0,
            "censored: disenrolled"
          ],
          [
            "PT06",
            50,
            1,
            "hospitalization"
          ],
          [
            "PT07",
            70,
            1,
            "hospitalization"
          ],
          [
            "PT08",
            80,
            0,
            "censored: disenrolled"
          ],
          [
            "PT09",
            85,
            0,
            "censored: study end"
          ],
          [
            "PT10",
            90,
            0,
            "censored: study end"
          ]
        ]
      },
      "steps": [
        "Sort all follow-up times. Distinct event times (where event = 1 occurs) are days 10, 30, 50, and 70. Censoring times (days 20, 40, 45, 80, 85, 90) shrink the risk set before the next event but do not trigger a step-down in the curve.",
        "Day 10: All 10 patients are in the risk set. One event occurs (PT01). Conditional survival = (10-1)/10 = 9/10 = 0.9. S(10) = 9/10 = 0.9.",
        "Between days 10 and 30: PT02 is censored at day 20 and exits the risk set. Risk set at day 30 = 10 - 1 event (day 10) - 1 censored (day 20) = 8 patients.",
        "Day 30: 8 patients in the risk set. One event occurs (PT03). Conditional survival = (8-1)/8 = 7/8. S(30) = 0.9 * 7/8 = 0.7875.",
        "Between days 30 and 50: PT04 censored at day 40, PT05 censored at day 45. Risk set at day 50 = 8 - 1 event (day 30) - 2 censored (days 40 and 45) = 5 patients.",
        "Day 50: 5 patients in the risk set. One event occurs (PT06). Conditional survival = (5-1)/5 = 4/5. S(50) = 0.7875 * 4/5 = 0.63.",
        "Between days 50 and 70: no censorings. Risk set at day 70 = 5 - 1 event (day 50) = 4 patients.",
        "Day 70: 4 patients in the risk set. One event occurs (PT07). Conditional survival = (4-1)/4 = 3/4. S(70) = 0.63 * 3/4 = 0.4725.",
        "After day 70: PT08 censored at day 80, PT09 at day 85, PT10 at day 90. No further events occur. The curve stays flat at S = 0.4725. Since S(70) = 0.4725 is the first value at or below 0.5, the median survival time is 70 days."
      ],
      "result": "KM survival probabilities at each event time: S(10) = 9/10 = 0.9; S(30) = 0.9 * 7/8 = 0.7875; S(50) = 0.7875 * 4/5 = 0.63; S(70) = 0.63 * 3/4 = 0.4725. Median survival = 70 days (first time S(t) drops to or below 0.5). Four events, six censored. After day 70 only three patients remain in the risk set; the tail estimate is unreliable.",
      "timeline_spec": {
        "title": "Kaplan-Meier construction: 10 patients, 4 events, 6 censored",
        "window": {
          "start": "2023-01-01",
          "end": "2023-03-31",
          "label": "90-day study window from first prescription"
        },
        "events": [
          {
            "label": "PT01 — event day 10",
            "start": "2023-01-01",
            "length_days": 10,
            "quantity": "event=1"
          },
          {
            "label": "PT02 — censored day 20",
            "start": "2023-01-01",
            "length_days": 20,
            "quantity": "event=0"
          },
          {
            "label": "PT03 — event day 30",
            "start": "2023-01-01",
            "length_days": 30,
            "quantity": "event=1"
          },
          {
            "label": "PT04 — censored day 40",
            "start": "2023-01-01",
            "length_days": 40,
            "quantity": "event=0"
          },
          {
            "label": "PT05 — censored day 45",
            "start": "2023-01-01",
            "length_days": 45,
            "quantity": "event=0"
          },
          {
            "label": "PT06 — event day 50",
            "start": "2023-01-01",
            "length_days": 50,
            "quantity": "event=1"
          },
          {
            "label": "PT07 — event day 70",
            "start": "2023-01-01",
            "length_days": 70,
            "quantity": "event=1"
          },
          {
            "label": "PT08 — censored day 80",
            "start": "2023-01-01",
            "length_days": 80,
            "quantity": "event=0"
          },
          {
            "label": "PT09 — censored day 85",
            "start": "2023-01-01",
            "length_days": 85,
            "quantity": "event=0"
          },
          {
            "label": "PT10 — censored day 90",
            "start": "2023-01-01",
            "length_days": 90,
            "quantity": "event=0"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-01-11",
            "label": "PT01: 10 days, event"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-01-21",
            "label": "PT02: 20 days, censored"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-01-31",
            "label": "PT03: 30 days, event"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-02-10",
            "label": "PT04: 40 days, censored"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-02-15",
            "label": "PT05: 45 days, censored"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-02-20",
            "label": "PT06: 50 days, event"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-03-12",
            "label": "PT07: 70 days, event"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-03-22",
            "label": "PT08: 80 days, censored"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-03-27",
            "label": "PT09: 85 days, censored"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-03-31",
            "label": "PT10: 90 days, censored"
          }
        ],
        "result": {
          "label": "4 events, 6 censored; median survival = 70 days; S(70) = 0.4725",
          "value": 0.4725
        }
      }
    },
    "prerequisites": [
      "censoring-mechanisms-rwe",
      "descriptive-epidemiology-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single-group KM curve (overall survival or event-free survival)",
        "description": "The baseline application: one arm, one event type, one curve. Produces the median survival time, quantiles, and pointwise CI. Used for descriptive characterization of a cohort's event-time distribution before any comparative analysis.",
        "edge_cases": [
          "If the event never occurs for more than 50% of the cohort within the observation window, the median is undefined. Report the 25th percentile or the curve at a fixed horizon instead.",
          "If events are clustered at a single time (e.g., all at the end-of-year data cut), the KM curve drops sharply at one time and is flat everywhere else, which usually signals an ascertainment artifact rather than true event timing."
        ],
        "data_source_notes": "Claims: ensure the event ascertainment window does not extend beyond continuous enrollment; censor at disenrollment. EHR: define the observation end as last clinical contact plus a grace period."
      },
      {
        "name": "Two-arm or stratified KM comparison with log-rank test",
        "description": "The most common comparative application: two KM curves (treated vs. control, or exposed vs. unexposed) plotted on the same axes, with the log-rank test for equality and the at-risk table stratified by arm. In RWE, curves often cross due to confounding or effect modification, which invalidates the log-rank test and calls for additional methods (RMST, restricted analyses).",
        "edge_cases": [
          "Crossing survival curves violate the proportional-hazards assumption the log-rank test implicitly uses; report curves visually and note crossing, then use RMST or a stratified analysis rather than a single p-value.",
          "Very unequal arm sizes (e.g., 10:1) give the larger arm disproportionate weight in the log-rank test; confirm that the at-risk table reflects the actual imbalance."
        ],
        "data_source_notes": "Claims and EHR: verify that index dates and censoring rules are identical across arms; differential disenrollment rates by arm introduce arm-specific informative censoring that biases the between-arm comparison."
      },
      {
        "name": "KM with administrative censoring only (sensitivity analysis)",
        "description": "Restricts the censoring definition to administrative reasons only (data cut, end of study enrollment, or planned study end) and excludes health-correlated disenrollment. Produces a conservative bounding analysis under the assumption that administrative censoring is non-informative.",
        "edge_cases": [
          "Restricting to administratively censored subjects selects persistent enrollees who are systematically healthier than the full cohort; the resulting KM curve overestimates survival in the original population.",
          "In short studies (< 6 months), the distinction between administrative and health-correlated censoring may be negligible; longer follow-up amplifies the selection effect."
        ],
        "data_source_notes": "Claims: identify administrative censorings from study-end dates and disenrollment records flagged as employer-change or plan-year-end (where available). SAS PROC LIFETEST and R survfit() accept separate censoring indicators for this purpose."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "No distributional assumption and no proportional-hazards assumption required; valid for any event-time distribution including crossing hazards. Produces a universally understood visual output.",
        "cons_of_this": "Cannot adjust for baseline covariates in the curve itself; in observational studies with imbalanced groups, the KM curves are confounded and should not be interpreted as causal contrasts. Cox regression adjusts for covariates and produces a hazard ratio.",
        "when_to_prefer": "Use KM for descriptive presentation and as the visual companion to the log-rank test; use Cox for covariate-adjusted inference and the primary comparative estimand in observational RWE."
      },
      {
        "compared_to": "cumulative-incidence-risk-rwe",
        "pros_of_this": "Simpler to compute and explain; valid as the complement (1 - KM) for cumulative incidence when competing risks are absent or negligible.",
        "cons_of_this": "When death or another terminal event is non-negligible, 1 - KM overestimates cumulative incidence; the Aalen-Johansen CIF is the correct estimator. The error is differential by arm when competing mortality differs between groups.",
        "when_to_prefer": "Use KM when competing events are truly rare or when the estimand is the cause-specific event-free probability in a hypothetical world without competing events. Use the Aalen-Johansen CIF for all absolute-risk statements in elderly or oncology cohorts."
      },
      {
        "compared_to": "restricted-mean-survival-time-rmst",
        "pros_of_this": "More familiar to clinical and regulatory audiences; produces a median that has direct clinical meaning (\"half the patients had the event by day X\").",
        "cons_of_this": "When curves cross, the log-rank test and the median comparison can be misleading. RMST averages the area under the survival curve up to a horizon and is robust to crossing curves; it is also directly usable in cost-effectiveness models without distributional extrapolation.",
        "when_to_prefer": "Use KM with log-rank when curves do not cross and the proportional-hazards assumption is plausible. Use RMST when curves cross, when the proportional-hazards assumption fails, or when a single clinically meaningful summary statistic is needed for HTA."
      },
      {
        "compared_to": "Parametric survival models (Weibull, log-normal, log-logistic)",
        "pros_of_this": "No parametric assumption; valid even if the true distribution is unknown or multi-modal; cannot be misspecified.",
        "cons_of_this": "Cannot extrapolate beyond the observed follow-up, which is required for lifetime survival curves in health technology assessment. Parametric models (Weibull, log-normal, generalized gamma) extrapolate by fitting a distributional family to the observed data, validated against external life tables or registry data.",
        "when_to_prefer": "Use KM within the observation window for descriptive and comparative analyses. Use parametric models for HTA lifetime survival extrapolation; report KM as the within-trial anchor."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the (follow_up_days, event) pair as in the censoring-mechanisms-rwe entry: censor at the earlier of disenrollment date and study end. Flag Medicare Advantage transitions separately — these are health-correlated and should trigger a sensitivity analysis restricting to fee-for-service enrollees. Truncate displayed KM curves at the time when fewer than 10 subjects remain at risk; include the at-risk table with disenrollment-only and event rows. For endpoints where death is a competing risk (most chronic disease outcomes in Medicare cohorts), report the Aalen-Johansen CIF alongside the KM curve and note the overestimation in the KM-based estimate.",
      "ehr": "Define the last observed clinical contact date (last encounter, lab, or prescription) as the censoring date for patients who leave the health system without an event. Add a grace period (e.g., 180 days of encounter silence = considered lost to follow-up) to distinguish healthy-and-disengaged from transferred-to-specialty-care. KM in EHR is particularly sensitive to left truncation (delayed entry) if only patients still in the system at a given date are included; use survfit() with delayed entry (Surv(start, stop, event)) if the cohort has staggered enrollment or entry criteria requiring prior visit history.",
      "registry": "Registry data typically have cleaner censoring (predefined data cut, structured withdrawal forms) and often include vital status from national registries, making death ascertainment more reliable. Report the registry's censoring protocol explicitly (planned end, patient withdrawal, site closure) and whether competing death was captured. Use the at-risk table to verify that the registry's enrollment pattern does not create a sparse tail due to early-enrollment staggering.",
      "primary": "In randomized trials with active follow-up, administrative censoring dominates and the non-informative censoring assumption is most defensible. Report KM curves stratified by randomized arm with the log-rank p-value and 95% CI bands. In adaptive or enrichment designs, verify that interim analyses do not selectively censor patients in one arm; conditional censoring can induce informative loss-to-follow-up even in randomized studies.",
      "linked": "Linked claims-EHR-vital records allow reliable competing event ascertainment (death from the SSA death master file or state vital records) and should be exploited to produce both KM and Aalen-Johansen curves side by side. Verify linkage completeness before reporting the KM tail: unlinked patients are typically censored at their last linked record, which may be earlier than their actual disenrollment or death."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import KaplanMeierFitter\nfrom lifelines.statistics import logrank_test\nimport matplotlib.pyplot as plt\n\n# ── Worked-example cohort (10 patients from the beginner layer) ──\ndf = pd.DataFrame({\n    \"patient_id\":     [\"PT01\",\"PT02\",\"PT03\",\"PT04\",\"PT05\",\"PT06\",\"PT07\",\"PT08\",\"PT09\",\"PT10\"],\n    \"follow_up_days\": [10, 20, 30, 40, 45, 50, 70, 80, 85, 90],\n    \"event\":          [ 1,  0,  1,  0,  0,  1,  1,  0,  0,  0],\n})\n\n# ── Fit the KM curve ──\nkmf = KaplanMeierFitter(label=\"10-patient cohort\")\nkmf.fit(\n    durations=df[\"follow_up_days\"],\n    event_observed=df[\"event\"],\n)\n\nprint(\"Median survival time:\", kmf.median_survival_time_)\nprint(\"Median 95% CI:\", kmf.confidence_interval_median_.values)\n\n# ── Survival probability at specific event times (must match hand calculation) ──\nfor t, expected in [(10, 0.9), (30, 0.7875), (50, 0.63), (70, 0.4725)]:\n    s = float(kmf.predict(t))\n    print(f\"S({t:2d}) = {s:.4f}  (expected {expected})\")\n\n# ── Plot curve with CI band and at-risk table ──\nfig, ax = plt.subplots(figsize=(8, 5))\nkmf.plot_survival_function(ax=ax, ci_show=True)\nax.set_xlabel(\"Follow-up (days)\")\nax.set_ylabel(\"Survival probability S(t)\")\nax.set_title(\"Kaplan-Meier curve — 10-patient worked example\")\nplt.tight_layout()\nplt.savefig(\"km_curve.png\", dpi=150)\n# Note: for production plots with at-risk table, use the lifelines\n# add_at_risk_counts() helper or the KaplanMeierFitter.plot() with\n# at_risk_counts=True (lifelines >= 0.27).\n\n# ── Two-arm comparison: assign patients to Arms A and B for illustration ──\n# Arm A: PT01, PT03, PT06, PT07 (events); Arm B: rest (censored or late events)\ndf[\"arm\"] = [\"A\",\"B\",\"A\",\"B\",\"B\",\"A\",\"A\",\"B\",\"B\",\"B\"]\narm_a = df[df[\"arm\"] == \"A\"]\narm_b = df[df[\"arm\"] == \"B\"]\n\nkmf_a = KaplanMeierFitter(label=\"Arm A\")\nkmf_b = KaplanMeierFitter(label=\"Arm B\")\nkmf_a.fit(arm_a[\"follow_up_days\"], arm_a[\"event\"])\nkmf_b.fit(arm_b[\"follow_up_days\"], arm_b[\"event\"])\n\nresults = logrank_test(\n    arm_a[\"follow_up_days\"], arm_b[\"follow_up_days\"],\n    event_observed_A=arm_a[\"event\"], event_observed_B=arm_b[\"event\"]\n)\nprint(f\"\\nLog-rank test p-value: {results.p_value:.4f}\")\n# Note: with n=10, the log-rank test is illustrative only; power is very low.\n# In a production analysis, also check: kmf.at_risk_at_times([0, 10, 30, 50, 70])\n# to reproduce the at-risk table for manuscript reporting.",
        "description": "Kaplan-Meier estimation with lifelines KaplanMeierFitter. Fits the curve, reports the\nmedian survival time with 95% CI, plots the step function with confidence band and a\nnumber-at-risk table, and demonstrates two-arm comparison using the logrank_test\nfunction. Uses the 10-patient worked example cohort so output matches the hand\ncalculations (S at days 10, 30, 50, 70).",
        "dependencies": [
          "lifelines",
          "matplotlib"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(ggsurvfit)\n\n# ── Worked-example cohort ──\ndf <- data.frame(\n  patient_id     = paste0(\"PT\", sprintf(\"%02d\", 1:10)),\n  follow_up_days = c(10, 20, 30, 40, 45, 50, 70, 80, 85, 90),\n  event          = c( 1,  0,  1,  0,  0,  1,  1,  0,  0,  0),\n  arm            = c(\"A\",\"B\",\"A\",\"B\",\"B\",\"A\",\"A\",\"B\",\"B\",\"B\")\n)\n\n# ── Single-arm KM fit and median ──\nkm_single <- survfit(Surv(follow_up_days, event) ~ 1, data = df)\nprint(km_single)           # prints median with 95% CI (Hall-Wellner bands)\nsummary(km_single)         # survival probability + CI at every event time\n\n# Verify survival at each event time matches the hand calculation:\n#   S(10) = 0.9; S(30) = 0.7875; S(50) = 0.63; S(70) = 0.4725\ncat(\"S(t) at event times:\\n\")\nprint(summary(km_single, times = c(10, 30, 50, 70))$surv)\n\n# ── Two-arm comparison: Arm A (all events) vs Arm B (mostly censored) ──\nkm_arms <- survfit(Surv(follow_up_days, event) ~ arm, data = df)\n\n# Tidy KM plot with at-risk table (ggsurvfit standard for publications)\np <- survfit2(Surv(follow_up_days, event) ~ arm, data = df) |>\n  ggsurvfit(linewidth = 0.8) +\n  add_confidence_interval() +\n  add_risktable(risktable_stats = \"n.risk\") +   # mandatory number-at-risk row\n  scale_ggsurvfit() +\n  labs(\n    x = \"Follow-up (days)\",\n    y = \"Event-free probability S(t)\",\n    title = \"Kaplan-Meier curves by treatment arm\",\n    caption = \"Number at risk shown below the x-axis.\"\n  )\nprint(p)\n# In production: ggsave(\"km_arms.pdf\", p, width = 8, height = 5)\n\n# ── Log-rank test for equality of survival curves ──\nlr <- survdiff(Surv(follow_up_days, event) ~ arm, data = df)\ncat(sprintf(\"\\nLog-rank test: chi2 = %.3f, df = %d, p = %.4f\\n\",\n            lr$chisq, length(lr$n) - 1L, 1 - pchisq(lr$chisq, length(lr$n) - 1L)))\n# Note: the log-rank test implicitly assumes proportional hazards; if KM curves\n# cross, report RMST difference or a weighted log-rank test instead.\n\n# ── At-risk table for manuscript submission ──\ntbl <- summary(km_single, times = c(0, 10, 30, 50, 70, 90))\ncat(\"\\nAt-risk table (n.risk, n.event, n.censor at each reported time):\\n\")\nprint(data.frame(\n  time    = tbl$time,\n  n.risk  = tbl$n.risk,\n  n.event = tbl$n.event,\n  n.censor= tbl$n.censor,\n  surv    = round(tbl$surv, 4)\n))",
        "description": "Kaplan-Meier estimation with the survival package (survfit) and publication-quality\nplots with ggsurvfit, including the mandatory number-at-risk table. Demonstrates\nsingle-arm KM matching the hand calculations, a two-arm comparison with log-rank test,\nand extraction of the at-risk table for manuscript reporting.",
        "dependencies": [
          "survival",
          "ggsurvfit"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create the worked-example cohort ── */\ndata work.km_cohort;\n  input patient_id $ follow_up_days event arm $;\n  datalines;\nPT01  10  1  A\nPT02  20  0  B\nPT03  30  1  A\nPT04  40  0  B\nPT05  45  0  B\nPT06  50  1  A\nPT07  70  1  A\nPT08  80  0  B\nPT09  85  0  B\nPT10  90  0  B\n;\nrun;\n\n/* ── Single-arm KM curve with median and 95% CI bands ──\n   TIME syntax: <time_var> * <event_var>(<censored_value>)\n   event(0) means event = 0 is the censored code.                          */\nproc lifetest data=work.km_cohort\n              plots=survival(cl atrisk)      /* cl = CI band; atrisk = at-risk row */\n              notable                        /* suppress the event-list table       */\n              outsurv=work.km_out;           /* write KM curve to a dataset          */\n  time follow_up_days * event(0);\nrun;\n\n/* Verify S(t) at each event time:\n   Expected: S(10)=0.9, S(30)=0.7875, S(50)=0.63, S(70)=0.4725              */\nproc print data=work.km_out;\n  where _censor_ = 0;     /* _censor_ = 0 means an event row in OUTSURV      */\n  var follow_up_days survival sdf_lcl sdf_ucl;\nrun;\n\n/* ── Two-arm comparison with log-rank test ──\n   STRATA statement stratifies by arm; PROC LIFETEST automatically performs\n   the log-rank (Savage) test and Wilcoxon test for equality of curves.       */\nproc lifetest data=work.km_cohort\n              plots=survival(cl atrisk test)   /* test = show p-value on plot */\n              notable;\n  time follow_up_days * event(0);\n  strata arm;\nrun;\n/* Output includes:\n   - KM curve per arm with CI bands\n   - Number-at-risk table stratified by arm (ATRISK option)\n   - Log-rank test chi-square, df, and p-value\n   - Wilcoxon (Gehan) test as an alternative (more weight to early follow-up)\n   Note: with n=10 the test is illustrative; power is very low.\n   In production: suppress the Wilcoxon if not pre-specified.              */\n\n/* ── Extract median survival with 95% CI for manuscript ──\n   PROC LIFETEST prints the median in the Quartile Estimates table.\n   To capture it programmatically, use OUTSURV= and find the first row\n   where SURVIVAL <= 0.5 per stratum.                                      */\ndata work.km_median;\n  set work.km_out;\n  by arm;\n  if survival <= 0.5 then output;     /* first event time where S(t) <= 0.5   */\nrun;\nproc sort data=work.km_median; by arm follow_up_days; run;\ndata work.km_median;\n  set work.km_median;\n  by arm;\n  if first.arm;                       /* keep only the first (earliest) crossing */\n  label follow_up_days = \"Median survival (days)\";\nrun;\nproc print data=work.km_median label;\n  var arm follow_up_days survival sdf_lcl sdf_ucl;\nrun;",
        "description": "Kaplan-Meier estimation with PROC LIFETEST. Produces the survival function with 95%\nCI bands (Hall-Wellner or log-log), the median and quartiles, and the number-at-risk\ntable. Demonstrates single-arm analysis matching the hand calculations and a two-arm\nstratified analysis with the log-rank test. Uses the 10-patient worked-example cohort.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  DATA[\"Raw data: (follow_up_days, event) pairs\\none row per patient\"] --> SORT\n  SORT[\"Sort by follow-up time\\nidentify distinct event times t1 < t2 < ... < tk\"] --> RISKSET\n  RISKSET[\"At each event time t_j:\\ncompute risk set n_j = patients still at risk\\nand events d_j\"] --> COND\n  COND[\"Conditional survival:\\n(n_j - d_j) / n_j\"] --> PRODUCT\n  PRODUCT[\"Multiply conditionals:\\nS(t) = product of (n_j - d_j)/n_j for all t_j <= t\"] --> OUTPUT\n  OUTPUT[\"Step-function survival curve\\nMedian = smallest t with S(t) <= 0.5\\nCI via Greenwood variance\"] --> REPORT\n  REPORT[\"Mandatory reporting:\\nKM curve + CI band\\nNumber-at-risk table below x-axis\\nTruncate at n < 10 at risk\"]\n  RISKSET -->|\"censored subjects\\nexit risk set before next event\"| RISKSET",
        "caption": "Kaplan-Meier construction algorithm: sort event times, compute risk sets, multiply conditional survival probabilities, and read the median from the step function. Censored subjects shrink the risk set at the next event time without triggering a step-down in the curve.",
        "alt_text": "Flowchart showing the KM algorithm: raw data sorted by time, risk sets computed at each event time with censored subjects exiting, conditional survival multiplied to produce the step function, then reported with CI bands and number-at-risk table.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Q[\"Is a competing event (death)\\nnon-negligible?\"] --> No\n  Q --> Yes\n  No[\"No: 1 - KM is the cumulative\\nincidence estimator\"] --> LogRank\n  LogRank[\"Compare arms: log-rank test\\nor RMST difference\"]\n  Yes[\"Yes: 1 - KM OVERestimates\\ncumulative incidence\"] --> AJ\n  AJ[\"Use Aalen-Johansen CIF\\n(see cumulative-incidence-risk-rwe)\"]\n  AJ --> FG[\"Adjusted inference:\\nFine-Gray subdistribution HR\\nor cause-specific Cox HR\"]",
        "caption": "Decision gate for moving from KM to competing-risks methods. When death or another terminal competing event is non-negligible, 1 minus the KM curve overestimates absolute cumulative incidence; switch to the Aalen-Johansen estimator.",
        "alt_text": "Decision flowchart: if no competing event, 1-KM is valid and log-rank compares arms; if a competing event is present, 1-KM overestimates and the Aalen-Johansen CIF is required, with Fine-Gray or cause-specific Cox for adjusted inference.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "censoring-mechanisms-rwe",
        "notes": "The Kaplan-Meier estimator requires the (follow_up_days, event) pair built from raw date fields and depends critically on the non-informative censoring assumption. Read censoring-mechanisms-rwe for the full taxonomy of censoring types, the consequences when the assumption fails in claims data (health-correlated disenrollment, Medicare Advantage transitions), and the IPCW correction strategy."
      },
      {
        "relation_type": "used_with",
        "target_slug": "log-rank-test",
        "notes": "The log-rank test is the natural companion to KM for comparing survival curves between two or more groups; both share the same risk-set framework and the KM curve is the visual context for interpreting a log-rank p-value."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "Cox proportional hazards regression extends KM by accepting covariate adjustment; adjusted survival curves from Cox replace the unadjusted KM when confounding must be controlled in observational RWE. The HR from Cox measures relative hazard; KM provides the baseline event-time distribution."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "When a competing event is present, 1 minus the KM curve overestimates cumulative incidence. The Aalen-Johansen cumulative incidence function (CIF) is the correct descriptive estimator; the Fine-Gray model is the regression counterpart. See cumulative-incidence-risk-rwe for the full treatment of the overestimation mechanism and how to quantify the bias."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "When death is a strong competing event, reporting KM as 1 - S_KM for the non-fatal outcome is actively misleading. This entry covers the causal and inferential framework for distinguishing cause-specific hazards (etiologic) from subdistribution hazards (absolute-risk), and when to use each alongside the CIF."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST (area under the KM curve up to a horizon) is a single summary statistic that avoids reading medians off a curve, is robust to crossing curves, and is directly usable in cost-effectiveness models. Use RMST when the proportional-hazards assumption is implausible or when curves cross."
      }
    ],
    "aliases": [
      "product-limit estimator",
      "KM curve",
      "KM estimator",
      "Kaplan-Meier curve",
      "survival curve",
      "product-limit survival estimate"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "kruskal-wallis-test",
    "name": "Kruskal-Wallis Test",
    "short_definition": "A nonparametric omnibus test that compares three or more independent groups by converting all observations to a single pooled ranking and testing whether the rank sums are more unequal than chance alone would produce; when the test rejects, pairwise post-hoc comparisons (Dunn's test with multiplicity adjustment) identify which groups differ, making it the nonparametric counterpart to one-way ANOVA for continuous or ordinal outcomes where distributional assumptions cannot be justified.",
    "long_description": "**What the Kruskal-Wallis test does and why it exists**\n\nOne-way ANOVA tests whether the means of three or more groups are equal by partitioning\nvariance — but it assumes the outcome is approximately normally distributed within each\ngroup and that the variances are roughly equal. In real-world evidence and health outcomes\nresearch, the outcomes most worth comparing across groups — length of stay, total cost per\nepisode, patient-reported pain scores, time to treatment escalation — are routinely\nright-skewed, bounded, or ordinal. For these outcomes at small-to-moderate sample sizes,\nthe ANOVA normality assumption is not tenable, and the Kruskal-Wallis H test provides a\nvalid alternative.\n\nThe Kruskal-Wallis test works by collapsing all N observations from k groups into a single\npooled ranking: every value is replaced by its rank from 1 (smallest) to N (largest),\nignoring group membership. The test statistic H measures how unequally the rank mass is\ndistributed across groups relative to what would be expected if group labels were assigned\nat random. Under the null hypothesis of no difference among groups, H follows an\napproximate chi-square distribution with k − 1 degrees of freedom, provided each group\nhas at least five observations. For very small groups (n < 5 per group), exact tables or\npermutation-based p-values should be used instead.\n\n**The H statistic: mechanics and tie correction**\n\nThe H statistic is:\n\n    H = [12 / (N(N+1))] * sum_j[R_j^2 / n_j]  -  3(N+1)\n\nwhere N is the total number of observations, k is the number of groups, n_j is the sample\nsize of group j, and R_j is the sum of ranks assigned to group j. Under the null, all\ngroups share the same population distribution and the expected rank sum for group j is\nn_j * (N+1)/2, so H measures the weighted squared deviation of observed rank sums from\ntheir expected values.\n\nWhen ties are present — multiple observations with the same value — they are assigned the\naverage of the ranks they would have received if they were distinct. Ties shrink the\nvariance of the rank distribution and can inflate H if uncorrected. The standard tie\ncorrection divides H by:\n\n    C = 1 - [sum_g(t_g^3 - t_g)] / (N^3 - N)\n\nwhere t_g is the number of observations in tie group g. The corrected statistic is H/C.\nIn practice the correction has little effect unless there are many large tie groups, but\nit should always be applied and is the default in all major software implementations.\n\n**The chi-square approximation and degrees of freedom**\n\nThe chi-square approximation for H is valid when each group has at least five observations.\nThe degrees of freedom are k − 1: for three groups, df = 2; for four groups, df = 3. A\np-value below the chosen alpha (typically 0.05) leads to rejection of the null hypothesis\nthat all k groups share the same distribution. The p-value alone does not indicate which\npairs differ — only that at least one pair does.\n\n**Stochastic dominance, not medians**\n\nThe Kruskal-Wallis test shares the same subtle interpretation as the Mann-Whitney U test\n(its two-group special case): it tests a form of *stochastic dominance*, not equality of\nmedians. The null hypothesis is that a randomly chosen observation from any one group has\nan equal probability of being larger than a randomly chosen observation from any other\ngroup. This is equivalent to testing median equality only under the additional assumption\nthat the distributions across groups differ solely in location — that is, they have the\nsame shape and spread and merely shift left or right. When the groups have different shapes\n(one is bimodal, another is heavily skewed) or different spreads, a significant Kruskal-\nWallis result signals that the distributions differ in some way, but it does not cleanly\nidentify a median shift. Analysis reports should acknowledge this: describe the test as\nexamining rank distributions rather than medians, and accompany the p-value with group\nmedians (and interquartile ranges) for interpretive context.\n\n**From omnibus rejection to post-hoc pairwise comparisons**\n\nThe Kruskal-Wallis test is an omnibus test: it asks whether any group differs from any\nother, not which specific pairs differ. When H is significant, post-hoc pairwise\ncomparisons are needed to locate the source of the difference.\n\nThe correct post-hoc procedure is Dunn's test (Dunn 1964), which compares each pair of\ngroups using the difference in their mean ranks from the pooled ranking — not separate\npairwise Mann-Whitney tests. Applying repeated pairwise Mann-Whitney tests without\nmultiplicity adjustment inflates the family-wise type-I error rate: with k = 4 groups and\nsix pairwise comparisons at alpha = 0.05, the probability of at least one false positive\nis approximately 1 − (0.95)^6 ≈ 0.26. Dunn's test controls this by using the pooled\nvariance from the full ranking, and the resulting p-values are then adjusted for multiple\ncomparisons using the Bonferroni correction (conservative), the Holm step-down procedure\n(uniformly more powerful than Bonferroni), or the Benjamini-Hochberg false discovery rate\ncorrection when many comparisons are expected. The choice among these corrections should\nbe pre-specified; defaulting to Bonferroni is the safest option when confirmatory inference\nis required and the number of comparisons is small (k < 6).\n\n**RWE applications and common patterns**\n\nThe Kruskal-Wallis test appears in HEOR and RWE in several recurring patterns:\n\n- *Length of stay or cost across three or more lines of therapy*: comparing index\n  hospitalization length of stay (days, right-skewed) or total allowed cost per patient\n  across patients initiating first-line, second-line, and third-or-later-line therapy.\n  The three-group structure calls for Kruskal-Wallis rather than repeated pairwise\n  Mann-Whitney tests. Note that when the inference target is the *mean* cost difference —\n  as required by budget-impact models and payer submissions — a generalized linear model\n  with a log link and gamma or Tweedie variance function is the preferred primary analysis;\n  Kruskal-Wallis is appropriate for descriptive comparisons and sensitivity checks.\n\n- *Patient-reported outcome scores across site or payer type*: ordinal PRO scores (pain,\n  function, adherence) across geographic regions, provider types, or plan types. Ordinal\n  scores should not be treated as interval data; Kruskal-Wallis respects their rank\n  structure.\n\n- *Unadjusted descriptive comparison in Table 2 of an observational study*: presenting\n  unadjusted differences in continuous baseline or outcome variables across three or more\n  treatment groups, with the clear label \"unadjusted.\" In confounded observational data,\n  these comparisons are descriptive only — they reflect both treatment effects and\n  selection bias. The Kruskal-Wallis p-value in this context documents baseline\n  imbalance rather than causal contrast.\n\n- *Multi-site quality comparisons*: comparing median time-to-diagnosis, door-to-balloon\n  time, or time-from-referral-to-treatment across multiple hospitals or clinical sites\n  without assuming normality.\n\n**Clustering violation: a critical warning for EHR and registry data**\n\nThe Kruskal-Wallis test assumes that all observations are independent. In EHR, registry,\nand claims data, patients are routinely nested within providers, practices, or facilities.\nApplying Kruskal-Wallis to individual patient rows without accounting for this clustering\nproduces artificially narrow confidence intervals and inflated test statistics, because\npatients within the same provider share unmeasured factors that make their outcomes more\nsimilar than observations from different providers would be. When clustering is present,\neither aggregate to the provider level before testing, use a mixed-effects model, or apply\na nonparametric test that accounts for the clustered design (such as a permutation test\nwith cluster-level exchangeability). This is not a theoretical concern — provider-level\nintraclass correlation (ICC) for cost and utilization outcomes in claims data commonly\nexceeds 0.05 and can exceed 0.20 for specialty care, making the independence assumption\nmaterially wrong.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- Valid under broad distributional assumptions; no requirement for normality, equal\n  variances, or any specific distributional family within groups.\n- Robust to outliers: an extreme value in one group is capped at rank N and cannot distort\n  H the way it distorts an ANOVA F statistic.\n- Well-suited to ordinal outcomes (Likert scales, PRO scores, severity grades) where the\n  interval assumption required by ANOVA is not justified.\n- Exact in small samples when combined with permutation-based critical values; the\n  chi-square approximation is adequate for n_j ≥ 5.\n- Directly extensible to the Jonckheere-Terpstra test when the k groups have a natural\n  order (dose levels, line of therapy), which is more powerful than Kruskal-Wallis for\n  detecting monotone dose-response relationships.\n\n*Cons*:\n- Lower statistical power than one-way ANOVA when the normality assumption actually holds,\n  because the rank transformation discards information about the magnitude of differences.\n- The omnibus H statistic requires post-hoc testing (Dunn's test) to identify which pairs\n  differ; each step adds complexity and reduces power relative to a planned contrast.\n- Cannot adjust for covariates directly: the rank-based framework does not extend\n  naturally to multivariable regression. Quantile regression or a ranked-outcome GLM can\n  partly address this limitation but are less familiar to most audiences.\n- The test statistic does not produce an inherently interpretable effect size on the\n  original measurement scale; the epsilon-squared or eta-squared rank-based effect size\n  must be computed separately and reported alongside H and the p-value.\n- Tests stochastic dominance rather than median equality under heterogeneous shapes —\n  requires careful communication to avoid the common mischaracterization as a median test.\n\n**When to use**\n\n- *Three or more independent groups* where the outcome is continuous but non-normal,\n  ordinal, or bounded, and sample sizes are small to moderate (fewer than 100 per group\n  as a rule of thumb, though the test remains valid at any n).\n- *Sensitivity analysis* alongside a one-way ANOVA primary analysis, to verify that\n  distributional assumptions are not driving the conclusion.\n- *Descriptive comparison in Table 2 or Table 3* of an observational RWE study,\n  documenting unadjusted differences across treatment cohorts with an explicit note that\n  the comparison is confounded.\n- *Ordinal outcomes*: PRO scales, severity classifications, adherence categories —\n  wherever the distance between scale points is not assumed to be equal.\n- *Any multi-group comparison where one ANOVA assumption is clearly violated*: if\n  Levene's test or visual inspection reveals highly unequal variances, or if group sizes\n  are small and histograms show strong skewness, Kruskal-Wallis is the safer primary\n  method.\n\n**When NOT to use**\n\n- *Repeated-measures or pre-post-post designs where the same patients appear in multiple\n  groups*: the independence assumption is violated. Use the Friedman test (the\n  nonparametric counterpart to repeated-measures ANOVA) or a mixed-effects model. Applying\n  Kruskal-Wallis to a panel dataset with one row per patient-period naively treats each\n  row as independent, producing anti-conservative p-values.\n\n- *When adjusted estimates are needed*: Kruskal-Wallis cannot include covariates. If the\n  comparison is confounded — treatment groups differ on age, comorbidity, or disease\n  severity — a multivariable model (ANCOVA, quantile regression, or a GLM) is required\n  for valid inference. Using Kruskal-Wallis on unbalanced observational data and reporting\n  the p-value as evidence of a treatment effect conflates the treatment contrast with\n  selection bias.\n\n- *When the groups have a meaningful natural order (dose-response, line of therapy)*:\n  Kruskal-Wallis tests whether any pair differs but ignores the ordering. The\n  Jonckheere-Terpstra test is the correct choice when the alternative hypothesis is a\n  monotone trend across ordered groups; it is more powerful than Kruskal-Wallis for\n  detecting dose-response relationships.\n\n- *When the inference target is the mean cost difference*: for budget-impact analyses,\n  payer submissions, or cost-effectiveness model inputs, the relevant quantity is usually\n  the difference in *mean* costs across treatment arms, not the rank ordering. A\n  generalized linear model with a log link and gamma or Tweedie variance structure\n  produces the mean ratio (and mean difference via recycled predictions) on the original\n  dollar scale with appropriate confidence intervals. Kruskal-Wallis can describe the rank\n  distribution but does not answer the mean-cost question that health economics requires.\n\n- *When clustering is present and unaccounted for*: applying Kruskal-Wallis row-by-row to\n  data where patients are nested within providers or facilities produces invalid inference.\n  Aggregate to the cluster level or use a model that incorporates random effects.\n\n- *As a substitute for the Jonckheere-Terpstra test when detecting monotone trends in\n  ordered groups*: repeated post-hoc testing after a non-significant Kruskal-Wallis result\n  does not recover the power that a properly specified directional test would have had.\n\n**Interpreting the output**\n\nIn the worked example, nine patients across three hospitals (three per hospital) had\ninpatient lengths of stay ranging from 1 to 9 days. After pooled ranking, Hospital A\nreceived rank sum R_A = 6, Hospital B received R_B = 15, and Hospital C received R_C = 24.\nThe rank-sum contribution is 6²/3 + 15²/3 + 24²/3 = 12 + 75 + 192 = 279, yielding\nH = (12/90) × 279 − 30 = 37.2 − 30 = 7.2. On df = 2, the chi-squared critical value at\nalpha = 0.05 is approximately 5.99, and p ≈ 0.027.\n\n*(1) Formal interpretation.* H = 7.2 exceeds the critical value of approximately 5.99 on\ndf = 2, placing p ≈ 0.027 below alpha = 0.05. Under the null hypothesis that all three\ngroups share the same distribution, a rank imbalance this extreme arises in approximately\n2.7% of random rank assignments. The omnibus test rejects the null, indicating that at least\none hospital's LOS rank distribution is stochastically different from the others. This\nresult pertains to stochastic dominance — whether one group tends to produce larger values —\nnot specifically to differences in medians unless the three distributions are assumed to\ndiffer only in location. Pairwise identification of which hospitals differ requires Dunn's\ntest with an appropriate multiplicity correction.\n\n*(2) Practical interpretation.* The rank distribution of length of stay differs across the\nthree hospitals in a way unlikely to arise by chance. Hospital C consistently occupies the\nhighest ranks (longest stays) and Hospital A the lowest, which is visible directly from\nthe rank sums (6 vs 15 vs 24). Whether this reflects differences in patient case mix,\nclinical practice, or discharge protocols cannot be determined from an unadjusted Kruskal-\nWallis test — the comparison is descriptive. Dunn's test with Bonferroni or Holm adjustment\nfor the three pairwise comparisons is the appropriate next step to identify which specific\nhospital pairs drive the overall signal.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "nonparametric",
      "rank-based",
      "multiple-groups",
      "omnibus-test",
      "post-hoc",
      "Dunn-test",
      "ANOVA",
      "kruskal-wallis"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "cross_sectional",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/01621459.1952.10483441",
        "url": "https://doi.org/10.1080/01621459.1952.10483441",
        "citation_text": "Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association. 1952;47(260):583-621.",
        "year": 1952,
        "authors_short": "Kruskal & Wallis",
        "notes": "The original paper introducing the H statistic, the pooled-ranking approach, and the chi-square approximation for the k-group nonparametric test. Establishes the theoretical foundation for the tie correction and the relationship to the Mann-Whitney U test as the two-group special case."
      },
      {
        "role": "explain",
        "doi": "10.1080/00401706.1964.10490181",
        "url": "https://doi.org/10.1080/00401706.1964.10490181",
        "citation_text": "Dunn OJ. Multiple comparisons using rank sums. Technometrics. 1964;6(3):241-252.",
        "year": 1964,
        "authors_short": "Dunn",
        "notes": "Introduces the correct post-hoc pairwise comparison procedure for use after a significant Kruskal-Wallis result. Dunn's test uses the pooled variance from the full ranking rather than running separate pairwise Mann-Whitney tests, preventing family-wise type-I error inflation. The Bonferroni and Holm adjustments described in this entry apply to Dunn's pairwise z-statistics."
      },
      {
        "role": "demonstrate",
        "doi": "10.1080/00031305.1981.10479327",
        "url": "https://doi.org/10.1080/00031305.1981.10479327",
        "citation_text": "Conover WJ, Iman RL. Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician. 1981;35(3):124-129.",
        "year": 1981,
        "authors_short": "Conover & Iman",
        "notes": "Demonstrates the formal connection between rank-based nonparametric tests and parametric ANOVA applied to the ranks: applying one-way ANOVA to the pooled ranks produces a test statistic that is a monotone function of H. This equivalence clarifies why the Kruskal- Wallis test is often described as \"one-way ANOVA on ranks\" and why its power properties parallel those of ANOVA under the rank-transformation framework."
      },
      {
        "role": "use",
        "doi": "10.1186/1471-2288-12-78",
        "url": "https://doi.org/10.1186/1471-2288-12-78",
        "citation_text": "Fagerland MW. t-tests, non-parametric tests, and large studies — a paradox of statistical practice? BMC Medical Research Methodology. 2012;12:78.",
        "year": 2012,
        "authors_short": "Fagerland",
        "notes": "Contextualizes when nonparametric rank tests (including Kruskal-Wallis) are appropriate versus when parametric alternatives are defensible. The key insight for RWE is that in very large datasets the CLT protects parametric tests even for skewed outcomes, while rank tests answer a stochastic-dominance question rather than a mean-difference question — a distinction that matters for cost and utilization outcomes where means, not ranks, are the policy-relevant quantity."
      }
    ],
    "plain_language_summary": "The Kruskal-Wallis test answers the question: \"Do three or more groups have meaningfully different outcome distributions?\" — without assuming the data follow a bell-curve shape. It works by replacing every observed value with its rank (1st smallest, 2nd smallest, and so on) across all groups combined, then checking whether the groups ended up with surprisingly unequal shares of high and low ranks. If the answer is yes, a follow-up procedure called Dunn's test identifies which specific pairs of groups differ. It is the standard nonparametric substitute for one-way ANOVA and is commonly used in health outcomes research to compare length of stay, cost, or patient-reported scores across three or more treatment groups when the numbers are skewed or the sample is small.",
    "key_terms": [
      {
        "term": "rank",
        "definition": "The position of a value when every observation from all groups is sorted from smallest to largest together — the smallest value gets rank 1, the next gets rank 2, and so on, regardless of which group it came from."
      },
      {
        "term": "H statistic",
        "definition": "The Kruskal-Wallis test statistic, which measures how unequally the rank totals are spread across groups relative to what pure chance would produce if there were no real difference between them."
      },
      {
        "term": "omnibus test",
        "definition": "A test that checks whether any group differs from any other group overall, without specifying which pairs differ; a significant result signals that at least one difference exists but requires follow-up testing to locate it."
      },
      {
        "term": "post-hoc pairwise comparison",
        "definition": "A follow-up test applied after a significant omnibus result to identify which specific pairs of groups are responsible for the overall difference, with an adjustment to prevent false positives from accumulating across multiple comparisons."
      },
      {
        "term": "tied ranks",
        "definition": "When two or more observations have the same value, they each receive the average of the ranks they would have occupied; this average-rank approach ensures the math remains consistent and a correction factor adjusts the H statistic for any inflation caused by ties."
      },
      {
        "term": "stochastic dominance",
        "definition": "The property being tested by the Kruskal-Wallis test — whether a randomly chosen observation from one group is more likely to be larger (or smaller) than a randomly chosen observation from another group, which is a broader statement than saying the group medians differ."
      }
    ],
    "worked_example": {
      "scenario": "A health systems analyst is comparing the length of inpatient stay (days) for patients admitted with a primary diagnosis of heart failure across three hospitals — Community Hospital (Group A), Regional Medical Center (Group B), and Academic Medical Center (Group C) — using a small convenience sample of three patients per site. The analyst wants to test whether length of stay differs across the three facilities using a method that does not assume normality, and then manually verify the Kruskal-Wallis H statistic from first principles.",
      "dataset": {
        "caption": "Length of stay (days) for nine heart failure admissions across three hospitals. All nine values are pooled and ranked together for the Kruskal-Wallis test. There are no tied values in this example.",
        "columns": [
          "patient_id",
          "hospital",
          "los_days"
        ],
        "rows": [
          [
            "A1",
            "Community",
            1
          ],
          [
            "A2",
            "Community",
            2
          ],
          [
            "A3",
            "Community",
            3
          ],
          [
            "B1",
            "Regional",
            4
          ],
          [
            "B2",
            "Regional",
            5
          ],
          [
            "B3",
            "Regional",
            6
          ],
          [
            "C1",
            "Academic",
            7
          ],
          [
            "C2",
            "Academic",
            8
          ],
          [
            "C3",
            "Academic",
            9
          ]
        ]
      },
      "steps": [
        "Pool all 9 observations and assign ranks 1 through 9 in ascending order of LOS. Since all values are distinct, there are no ties. Value 1 gets rank 1, value 2 gets rank 2, value 3 gets rank 3, value 4 gets rank 4, value 5 gets rank 5, value 6 gets rank 6, value 7 gets rank 7, value 8 gets rank 8, value 9 gets rank 9.",
        "Compute the rank sum for each group. Group A (Community): patients A1, A2, A3 received ranks 1, 2, 3 respectively. R_A = 1 + 2 + 3 = 6. Group B (Regional): patients B1, B2, B3 received ranks 4, 5, 6. R_B = 4 + 5 + 6 = 15. Group C (Academic): patients C1, C2, C3 received ranks 7, 8, 9. R_C = 7 + 8 + 9 = 24.",
        "Verify that rank sums total to N*(N+1)/2 = 9*10/2 = 45. Check: 6 + 15 + 24 = 45. Correct — all ranks are accounted for.",
        "Compute the bracketed term in the H formula. Each group has n_j = 3 observations. For Group A: R_A squared equals 36; divide by n_A = 3 to get 36/3 = 12. For Group B: R_B squared equals 225; divide by n_B = 3 to get 225/3 = 75. For Group C: R_C squared equals 576; divide by n_C = 3 to get 576/3 = 192. Bracketed sum = 12 + 75 + 192 = 279.",
        "Apply the H formula with N = 9: the leading constant is 12 divided by N*(N+1), which is 12/(9*10) = 12/90. Multiply by the bracketed sum: (12/90)*279 = 3348/90 = 37.2. Subtract 3*(N+1) = 3*10 = 30. So H = 37.2 - 30 = 7.2.",
        "Compare H = 7.2 to the chi-square critical value with df = k - 1 = 3 - 1 = 2. The chi-square critical value at alpha = 0.05 with df = 2 is 5.991. Since 7.2 > 5.991, we reject the null hypothesis (p approximately 0.027). At least one hospital's length-of-stay distribution differs from the others.",
        "Because the omnibus test is significant, Dunn's test with Bonferroni correction would be applied to identify which pairs differ (A vs B, A vs C, B vs C). With three pairs, the Bonferroni-adjusted alpha threshold is 0.05/3 = 0.0167. Given the clean rank separation in this example, the A-vs-C comparison yields the smallest adjusted p-value. In practice, Dunn's test is run in software rather than computed by hand."
      ],
      "result": "N = 9, k = 3, n per group = 3. Rank sums: R_A = 6, R_B = 15, R_C = 24; total = 6 + 15 + 24 = 45. Bracketed sum = 12 + 75 + 192 = 279. H = (12/90)*279 - 30 = 3348/90 - 30 = 37.2 - 30 = 7.2. With df = 2, chi-square critical value at 0.05 is 5.991; H = 7.2 exceeds 5.991, so p is approximately 0.027. The test rejects the null; post-hoc Dunn's test with multiplicity adjustment identifies which hospital pairs differ."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "mann-whitney-u-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Three groups, continuous skewed outcome (core use case)",
        "description": "The most common application in HEOR: comparing a continuous outcome (LOS, PPPM cost, days to first relapse, PRO score) across exactly three treatment groups, drug classes, or sites. Apply scipy.stats.kruskal (Python), kruskal.test (R), or PROC NPAR1WAY with WILCOXON (SAS). When H is significant, follow with Dunn's test (scikit-posthocs in Python; dunn.test or FSA::dunnTest in R; DSCF option in SAS PROC NPAR1WAY) with Bonferroni or Holm adjustment. Report H, df, p-value, group medians, and IQRs alongside the omnibus and post-hoc results.",
        "edge_cases": [
          "When k = 2, Kruskal-Wallis is algebraically equivalent to the Mann-Whitney U test; use Mann-Whitney directly for the two-group case.",
          "At very large N (> 10,000), H will be significant for trivially small rank differences; report the epsilon-squared effect size (eta^2 = (H - k + 1) / (N - k)) alongside p."
        ],
        "data_source_notes": "Claims: compute per-patient total allowed cost or utilization count for the observation window; apply Kruskal-Wallis across three or more cohort arms. For primary inference on mean costs, use a gamma GLM and report Kruskal-Wallis as a sensitivity check."
      },
      {
        "name": "Ordinal outcome across multiple groups",
        "description": "Patient-reported outcome (PRO) scores, Likert-scale items, severity grades, or adherence categories compared across provider type, geographic region, or plan type. The rank transformation in Kruskal-Wallis respects the ordinal structure without assuming equal intervals between scale points, making it more appropriate than one-way ANOVA for these outcomes.",
        "edge_cases": [
          "When the ordinal scale has many tied values (e.g., a 5-point Likert scale with large samples), the tie correction becomes important; all standard implementations apply it automatically.",
          "For ordinal outcomes in regression contexts, cumulative logit (proportional odds) models extend naturally to covariate adjustment and should be preferred when confounding is present."
        ],
        "data_source_notes": "Primary data (surveys, PRO instruments): test differences in score distributions across randomized or quasi-experimental groups. Registry: compare severity indices across diagnostic subgroups or treatment pathways."
      },
      {
        "name": "Post-hoc Dunn's test with multiplicity adjustment",
        "description": "After a significant Kruskal-Wallis omnibus result, Dunn's test computes z-statistics for each pairwise comparison using the pooled variance from the full ranking. The Bonferroni correction divides alpha by the number of pairs (k*(k-1)/2); the Holm step-down procedure adjusts ordered p-values and is uniformly more powerful. Benjamini-Hochberg FDR control is appropriate when many comparisons are expected and some false discoveries are acceptable (exploratory work). Never run unadjusted pairwise Mann-Whitney tests as a substitute — the family-wise error rate inflates dramatically with more than two pairs.",
        "edge_cases": [
          "For k = 10 groups, there are 45 pairwise comparisons; Bonferroni becomes very conservative; consider Holm or FDR adjustment and pre-specify the adjustment method.",
          "Dunnett's nonparametric analogue compares all groups to a single reference (e.g., placebo or standard of care) and is more powerful when only k-1 comparisons are needed."
        ],
        "data_source_notes": "Apply post-hoc testing only when the omnibus H is significant. Pre-specify the correction method in the statistical analysis plan before unblinding or examining the data."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "one-way-anova",
        "pros_of_this": "Valid without the normality or equal-variance assumptions required by ANOVA; robust to outliers and heavy-tailed distributions; appropriate for ordinal outcomes where the interval assumption is not defensible.",
        "cons_of_this": "Lower power than ANOVA when the normality assumption actually holds; does not produce a mean-difference effect estimate on the original scale; cannot directly adjust for covariates.",
        "when_to_prefer": "Use Kruskal-Wallis when distributions are clearly non-normal at small-to-moderate n, when the outcome is ordinal, or as a sensitivity check alongside ANOVA. Prefer ANOVA when n is large (CLT protects the F statistic) and when mean differences are the policy-relevant quantity."
      },
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "Extends the two-group nonparametric framework (Mann-Whitney) to k >= 3 groups while controlling the omnibus type-I error rate with a single test statistic.",
        "cons_of_this": "Adds the complexity of post-hoc testing and multiplicity adjustment not needed for the two-group case; choosing among correction procedures requires pre-specification.",
        "when_to_prefer": "Use Kruskal-Wallis whenever the comparison involves three or more groups; use Mann-Whitney directly for exactly two groups."
      },
      {
        "compared_to": "mann-whitney-u-test",
        "pros_of_this": "Handles k >= 3 groups in a single omnibus test, controlling type-I error properly; avoids the multiplicity inflation of repeated pairwise Mann-Whitney tests.",
        "cons_of_this": "Requires a post-hoc step (Dunn's test) after significance; the two-group Mann-Whitney is simpler and has exact tables for small samples that are easier to look up.",
        "when_to_prefer": "Use Kruskal-Wallis for three or more groups; the Mann-Whitney is the two-group special case and should be used directly when k = 2."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Kruskal-Wallis is distribution-free and can describe rank differences in per-patient costs across three or more lines of therapy without requiring a parametric model.",
        "cons_of_this": "Cost distributions are right-skewed and the mean — not the median or rank order — is the quantity needed for budget-impact models and payer submissions. A gamma GLM with log link produces a mean-ratio estimate on the dollar scale that is directly usable for decision modelling; Kruskal-Wallis cannot provide this.",
        "when_to_prefer": "Use Kruskal-Wallis for descriptive cost comparisons and sensitivity checks; use a gamma GLM or two-part model as the primary analysis when mean cost differences are the inference target."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Per-patient totals (PPPM cost, inpatient days, ER visits) across three or more cohort arms are the natural inputs. Apply Kruskal-Wallis to the per-patient totals after computing the observation-window aggregate. Report group medians and IQRs as the descriptive summary. For primary inference on mean costs — the quantity needed by payers and budget-impact models — accompany with a gamma GLM; present Kruskal-Wallis as the nonparametric sensitivity check. At large N (> 10,000), also report the epsilon-squared effect size, as H will be significant for trivially small differences.",
      "ehr": "Length of stay, readmission count, lab-value trajectories, and PRO scores are common inputs. Clustering by provider or facility is nearly universal in EHR data — verify that the unit of analysis is at the patient level and that patients are not multiply listed across groups before applying Kruskal-Wallis. If patients are nested within providers, aggregate to the provider level or use a mixed model.",
      "registry": "Disease severity scores, time-to-event surrogates, and adjudicated outcomes across diagnostic or therapeutic subgroups. Registry data often have complete case capture within sites but site-level clustering; treat site as a stratum or covariate if it introduces non-independence.",
      "primary": "Survey instruments, PRO scales, and clinician-rated severity scores across three or more randomized arms or geographic groups. Small sample sizes in primary data collection make Kruskal-Wallis especially appropriate relative to ANOVA. Use exact permutation p-values (available in R's coin::kruskal_test) when n_j < 5.",
      "linked": "Linked claims-EHR-registry cohorts typically have large N; report both Kruskal-Wallis H and epsilon-squared effect size, and lead with the gamma GLM for cost outcomes. The large-N chi-square approximation is reliable. Clustering concerns are acute in linked data because facility identifiers are usually available — check the ICC before treating rows as independent."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom scipy import stats\n\n# ── Motivating dataset: LOS (days) across three hospitals ──\ngroup_a = [1, 2, 3]   # Community Hospital\ngroup_b = [4, 5, 6]   # Regional Medical Center\ngroup_c = [7, 8, 9]   # Academic Medical Center\n\n# ── 1. Kruskal-Wallis omnibus test ──\nH, p_kw = stats.kruskal(group_a, group_b, group_c)\nprint(f\"Kruskal-Wallis H = {H:.4f}, p = {p_kw:.4f}, df = {3 - 1}\")\n# Expected: H = 7.2, p ≈ 0.0273\n\n# ── 2. Epsilon-squared effect size (eta^2 for Kruskal-Wallis) ──\nN = len(group_a) + len(group_b) + len(group_c)\nk = 3\neps_sq = (H - k + 1) / (N - k)\nprint(f\"Epsilon-squared (eta^2) = {eps_sq:.4f}\")\n# Effect size interpretation: 0.01 small, 0.06 medium, 0.14 large\n\n# ── 3. Descriptive summary: medians and IQRs ──\nimport statistics\nfor name, grp in [(\"Community\", group_a), (\"Regional\", group_b), (\"Academic\", group_c)]:\n    med = statistics.median(grp)\n    q1 = sorted(grp)[len(grp)//4]\n    q3 = sorted(grp)[3*len(grp)//4]\n    print(f\"{name}: median = {med}, IQR = [{q1}, {q3}]\")\n\n# ── 4. Post-hoc Dunn's test (only run after significant omnibus result) ──\n# scikit-posthocs returns a symmetric p-value matrix\ntry:\n    import scikit_posthocs as sp\n    import pandas as pd\n    data = group_a + group_b + group_c\n    labels = [\"A\"] * 3 + [\"B\"] * 3 + [\"C\"] * 3\n    df = pd.DataFrame({\"los\": data, \"hospital\": labels})\n    # Bonferroni adjustment (p_adjust = \"bonferroni\" or \"holm\")\n    dunn_bonf = sp.posthoc_dunn(df, val_col=\"los\", group_col=\"hospital\",\n                                p_adjust=\"bonferroni\")\n    print(\"\\nDunn post-hoc p-values (Bonferroni-adjusted):\")\n    print(dunn_bonf.to_string())\n    dunn_holm = sp.posthoc_dunn(df, val_col=\"los\", group_col=\"hospital\",\n                                p_adjust=\"holm\")\n    print(\"\\nDunn post-hoc p-values (Holm-adjusted):\")\n    print(dunn_holm.to_string())\n    print(\"\\nNote: Apply post-hoc tests only after a significant omnibus result.\")\n    print(\"Never run unadjusted pairwise Mann-Whitney tests as a post-hoc substitute.\")\nexcept ImportError:\n    print(\"\\nscikit-posthocs not installed — run: pip install scikit-posthocs\")\n    print(\"Alternative: use scipy.stats.mannwhitneyu with Bonferroni correction,\")\n    print(\"BUT use Dunn's test (not Mann-Whitney) for correct family-wise error control.\")\n\n# ── 5. Manual H verification (matches the worked example exactly) ──\nN_all = 9\nn_j = 3\nR = [6, 15, 24]   # rank sums computed in the worked example\nH_manual = (12 / (N_all * (N_all + 1))) * sum(r**2 / n_j for r in R) - 3 * (N_all + 1)\nprint(f\"\\nManual H = (12/90) * {sum(r**2/n_j for r in R):.1f} - 30 = {H_manual:.4f}\")\n# Expected: H_manual = 7.2",
        "description": "Kruskal-Wallis omnibus test using scipy.stats.kruskal, followed by Dunn's post-hoc\npairwise test using scikit-posthocs (scikit_posthocs.posthoc_dunn). Demonstrates the\nomnibus test, Bonferroni and Holm p-value adjustment for the post-hoc comparisons, and\nthe epsilon-squared effect size. Uses the nine-observation motivating dataset from the\nbeginner worked example (three hospitals, three patients each, LOS 1-9 days, no ties).",
        "dependencies": [
          "scipy",
          "scikit-posthocs"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset: LOS (days) across three hospitals ──\nlos    <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)\nhospital <- factor(c(\"Community\", \"Community\", \"Community\",\n                     \"Regional\",  \"Regional\",  \"Regional\",\n                     \"Academic\",  \"Academic\",  \"Academic\"))\n\n# ── 1. Kruskal-Wallis omnibus test (base R) ──\nkw_res <- kruskal.test(los ~ hospital)\nprint(kw_res)\n# Expected: H = 7.2, df = 2, p ≈ 0.0273\n\n# ── 2. Epsilon-squared effect size ──\nN <- length(los)\nk <- nlevels(hospital)\nH <- kw_res$statistic\neps_sq <- (H - k + 1) / (N - k)\ncat(sprintf(\"Epsilon-squared (eta^2) = %.4f\\n\", eps_sq))\n\n# ── 3. Group descriptive statistics ──\ntapply(los, hospital, function(x) {\n  cat(sprintf(\"%s: median = %.1f, IQR = [%.1f, %.1f]\\n\",\n              deparse(substitute(x)), median(x), quantile(x, 0.25), quantile(x, 0.75)))\n})\nby(los, hospital, function(x) c(median = median(x), IQR = IQR(x)))\n\n# ── 4a. Dunn post-hoc via dunn.test package ──\nif (requireNamespace(\"dunn.test\", quietly = TRUE)) {\n  cat(\"\\nDunn post-hoc (Bonferroni adjustment):\\n\")\n  dunn.test::dunn.test(los, hospital, method = \"bonferroni\")\n  cat(\"\\nDunn post-hoc (Holm adjustment):\\n\")\n  dunn.test::dunn.test(los, hospital, method = \"holm\")\n} else {\n  message(\"Install: install.packages('dunn.test')\")\n}\n\n# ── 4b. Alternative: FSA::dunnTest (returns a tidy data frame) ──\nif (requireNamespace(\"FSA\", quietly = TRUE)) {\n  cat(\"\\nDunn post-hoc via FSA::dunnTest (Bonferroni):\\n\")\n  print(FSA::dunnTest(los ~ hospital, data = data.frame(los, hospital),\n                      method = \"bonferroni\"))\n} else {\n  message(\"Install: install.packages('FSA')\")\n}\n\n# ── Note on incorrect practice ──\ncat(\"\\nIMPORTANT: Do NOT use repeated pairwise wilcox.test calls as a post-hoc.\\n\")\ncat(\"Use dunn.test or FSA::dunnTest which use the pooled variance from the full ranking.\\n\")",
        "description": "Kruskal-Wallis test using base-R kruskal.test, followed by Dunn's post-hoc test via\nthe dunn.test package (dunn.test::dunn.test) and the FSA package (FSA::dunnTest) as\nalternatives. Demonstrates effect size computation, group descriptive statistics, and\nthe Holm and Bonferroni p-value adjustments. Uses the same nine-observation motivating\ndataset as the Python implementation.",
        "dependencies": [
          "dunn.test",
          "FSA"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset: LOS across three hospitals ── */\ndata work.los_data;\n  input patient_id $ hospital $ los_days;\n  datalines;\nA1 Community 1\nA2 Community 2\nA3 Community 3\nB1 Regional  4\nB2 Regional  5\nB3 Regional  6\nC1 Academic  7\nC2 Academic  8\nC3 Academic  9\n;\nrun;\n\n/* ── 1. Kruskal-Wallis omnibus test ──\n   PROC NPAR1WAY with WILCOXON runs Kruskal-Wallis automatically when CLASS has >= 3 levels.\n   The Kruskal-Wallis statistic appears in the \"Wilcoxon Scores (Rank Sums)\" output table\n   under \"Kruskal-Wallis Test\". */\nproc npar1way data=work.los_data wilcoxon;\n  class hospital;        /* treatment / group variable                              */\n  var los_days;          /* continuous or ordinal outcome                           */\n  /* Output: Wilcoxon rank-sum scores per group, Kruskal-Wallis chi-square and p   */\nrun;\n\n/* ── 2. Post-hoc pairwise comparisons: DSCF option ──\n   DSCF (Steel-Dwass-Critchlow-Fligner) is the SAS nonparametric multiple-comparison\n   procedure. It is appropriate after a significant Kruskal-Wallis result and controls\n   the family-wise error rate across all pairwise comparisons.\n   Note: DSCF uses a different variance formulation than Dunn's test; both are valid\n   post-hoc approaches. Dunn's test is more common in published HEOR literature. */\nproc npar1way data=work.los_data dscf;\n  class hospital;\n  var los_days;\n  /* Output: pairwise DSCF statistics and adjusted p-values for each pair           */\nrun;\n\n/* ── 3. Group descriptive statistics (medians and IQRs for reporting) ── */\nproc means data=work.los_data median q1 q3 n;\n  class hospital;\n  var los_days;\n  /* Report median and IQR alongside the Kruskal-Wallis result;                    */\n  /* the rank test p-value without descriptive context is insufficient for readers  */\nrun;\n\n/* ── Note on incorrect practice ──\n   Do NOT use PROC NPAR1WAY WILCOXON in a loop over all pairs as a post-hoc substitute.\n   This inflates the family-wise error rate. Always use DSCF or a Dunn-corrected macro\n   after a significant Kruskal-Wallis result.                                        */",
        "description": "Kruskal-Wallis test using PROC NPAR1WAY with the WILCOXON option for the omnibus test\nand the DSCF option for Steel-Dwass-Critchlow-Fligner nonparametric pairwise post-hoc\ncomparisons. PROC NPAR1WAY with WILCOXON produces the Kruskal-Wallis H statistic and\np-value automatically when there are three or more groups. Uses the same nine-observation\nmotivating dataset as the Python and R implementations.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[\"3+ independent groups<br/>on a continuous or ordinal outcome\"] --> Assess[\"Assess distributional<br/>assumptions\"]\n  Assess --> Normal[\"Approximately normal<br/>+ n large enough for CLT\"]\n  Assess --> NonNormal[\"Clearly non-normal,<br/>ordinal, or small n\"]\n  Normal --> ANOVA[\"One-way ANOVA<br/>(parametric)\"]\n  NonNormal --> KW[\"Kruskal-Wallis H test<br/>(nonparametric)\"]\n  ANOVA --> ANOVApost[\"Post-hoc: Tukey HSD<br/>or Bonferroni t-tests\"]\n  KW --> Sig{\"H significant?\"}\n  Sig -- \"Yes\" --> Dunn[\"Post-hoc: Dunn's test<br/>with Bonferroni or Holm\"]\n  Sig -- \"No\" --> Stop[\"Fail to reject H0:<br/>no evidence groups differ\"]\n  KW --> Warning[\"Caution: tests stochastic<br/>dominance, not medians.<br/>Report medians + IQRs.\"]\n  ANOVA --> MeanCost[\"If mean costs needed:<br/>Gamma GLM with log link<br/>is preferred primary method\"]",
        "caption": "Decision flow for comparing three or more independent groups, showing when Kruskal-Wallis is preferred over one-way ANOVA and the mandatory post-hoc step after a significant omnibus result.",
        "alt_text": "Flowchart starting at three-or-more independent groups, branching on distributional assumptions to one-way ANOVA or Kruskal-Wallis, then showing post-hoc Tukey or Dunn steps and cautionary notes about stochastic dominance and mean-cost inference.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The Kruskal-Wallis test is the three-or-more-group member of the nonparametric test family introduced in the parent entry; understanding the parametric/nonparametric decision (including the stochastic-dominance interpretation shared with Mann-Whitney) is a prerequisite for applying Kruskal-Wallis correctly."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "The Kruskal-Wallis test assumes familiarity with null hypothesis testing, p-values, chi-square distributions, degrees of freedom, and the concept of family-wise error rate — all covered in the inferential statistics foundations entry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "one-way-anova",
        "notes": "One-way ANOVA is the parametric counterpart to Kruskal-Wallis; it tests mean equality across k groups under the normality and equal-variance assumptions. When those assumptions hold, ANOVA has greater power. When they do not — skewed data, ordinal outcomes, small n — Kruskal-Wallis is preferred. Always report which test was chosen and why."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mann-whitney-u-test",
        "notes": "The Mann-Whitney U test is the two-group special case of the Kruskal-Wallis test. When k = 2, the Kruskal-Wallis H statistic equals the square of the Mann-Whitney normal approximation z-statistic; use Mann-Whitney directly for two-group comparisons and Kruskal-Wallis when there are three or more groups."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Skewed cost outcomes across three or more lines of therapy or treatment cohorts are a common application context. Kruskal-Wallis can describe rank differences in per-patient costs, but model-based approaches (gamma GLM with log link) are preferred for the primary analysis when mean costs — the quantity needed for budget-impact modelling and payer submissions — are the inference target."
      }
    ],
    "aliases": [
      "Kruskal-Wallis H test",
      "one-way ANOVA on ranks",
      "KW test",
      "nonparametric one-way ANOVA",
      "Kruskal Wallis"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "landmark-analysis",
    "name": "Landmark Analysis",
    "short_definition": "A survival-analysis strategy that fixes a pre-specified landmark time, restricts to subjects event-free and under observation at that time, classifies exposure using only information available up to the landmark, and resets the time origin to the landmark for outcome follow-up, thereby removing guarantee-time (immortal time) bias from post-baseline exposure or response classifications.",
    "long_description": "**Landmark analysis** answers a conditional question in time-to-event data: among subjects who are still alive,\nevent-free, and under observation at a fixed clock time (the *landmark*), what is the outcome hazard contrasted by an\nexposure or response status that is fully determined *by* that landmark? The defining mechanic is a deliberate\nrealignment of the time axis. You (1) choose a landmark time `L` from clinical knowledge of when the classifying event\nis observable (e.g., 90 days after a cancer diagnosis, 30 days after an MI hospitalization), (2) discard everyone who\nhas the outcome, dies, or is censored before `L`, (3) freeze exposure status using only data accrued in `[index, L]`,\nand (4) start the outcome clock at `L` (`follow_up = event_date - L`). A standard Cox model, Kaplan-Meier curve, or\nFine-Gray competing-risks model is then fit on the landmark-conditional cohort.\n\n**Core estimand distinction** Landmark estimates the exposure-outcome contrast *conditional on being event-free and\nobservable at the landmark*, with exposure as known by the landmark — it is **not** a marginal effect from time zero,\nand it is **not** the same estimand as the alternatives it is usually compared with. A *time-dependent (extended) Cox\nmodel* (see standard-cox-time-dependent) uses all person-time and lets exposure accrue as a time-varying covariate, so\npre-landmark person-time is retained and the at-risk set is never truncated; it estimates an instantaneous-hazard\ncontrast across the whole follow-up. *Clone-censor-weight / g-methods* (see marginal-structural-models-g-methods)\nassign every subject to treatment strategies at baseline, censor at the moment they deviate, and weight for the\nresulting informative censoring, targeting a strategy contrast under sustained regimens. Landmark is the simplest of\nthe three but answers the narrowest question: it conditions on survival to `L` and therefore changes both the\npopulation (landmark survivors) and the time origin. The competing-risks estimand must also be pre-specified — a\ncause-specific hazard (PROC PHREG / coxph) answers an etiologic question, while a subdistribution hazard / cumulative\nincidence (Fine-Gray) answers an absolute-risk question; the two diverge whenever competing mortality differs by\nexposure (see competing-risks-cause-specific-fine-gray-rwe).\n\n**Pros, cons, and trade-offs**\n- **vs naive \"ever exposed / ever responder\" from time zero:** the naive contrast gives exposed/responder subjects\n  credit for the immortal time they had to survive in order to become exposed — pure guarantee-time bias that\n  manufactures an apparent benefit. Landmark removes it by construction (no one can be classified before they have\n  survived to `L`). **Prefer landmark** whenever exposure or response is defined after follow-up begins and a simple,\n  transparent fix is wanted.\n- **vs time-dependent (extended) Cox:** landmark is trivial to specify, communicate, and audit, and it sidesteps the\n  proportional-hazards-of-a-time-varying-covariate assumption. Cost: it throws away all pre-landmark person-time and\n  every subject who fails before `L`, so it is less efficient and the estimate is sensitive to the (arbitrary) choice\n  of `L`. **Prefer time-dependent Cox** when efficiency matters, when exposure changes repeatedly, or when no single\n  clinically meaningful landmark exists; **prefer landmark** when the scientific question is genuinely conditional\n  (\"among patients alive and responding at 90 days...\") or as a transparent sensitivity check on a time-dependent model.\n- **vs clone-censor-weight / g-methods:** landmark cannot represent dynamic, sustained, or grace-period strategies and\n  does not handle treatment-confounder feedback. **Prefer g-methods** for those estimands; **prefer landmark** for a\n  one-time classification at a fixed point.\n\n**When to use** Post-baseline exposure or response classification (tumor response, time-to-initiation, transplant,\nbiomarker conversion) where the contrast must be made fair for the survival required to be classified; oncology and\ncardiology RWE; as a pre-specified sensitivity analysis alongside a time-dependent Cox primary. The landmark must be\nchosen on **clinical/biological grounds and pre-specified** — never data-driven (peeking at outcomes to pick `L`\ninflates type-I error). Always report results across a grid of landmarks (e.g., 30/60/90/180 days) so reviewers see\nthe dependence on `L`, and frame the estimate explicitly as conditional on landmark survival.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Most events occur before a defensible landmark.** If the outcome is fast (e.g., 30-day mortality) and `L` is large,\n  you discard most events and the surviving cohort is a thin, selected slice — the estimate may be precise nonsense.\n- **Strong selection on survival-to-landmark.** Conditioning on event-free survival to `L` can open a collider path:\n  if an unmeasured factor causes both early events and the exposure, the landmark cohort is differentially depleted by\n  exposure and the conditional contrast is confounded *even if* time zero was clean. This is dangerous precisely because\n  landmark *looks* like it has solved the bias problem.\n- **Procedure- or response-anchored time zero without an eligibility anchor.** If you set the *index* (not just the\n  landmark) at the procedure/response itself, you re-import immortal time through the back door — pair landmark with an\n  eligibility-based time zero (diagnosis, hospitalization).\n- **Repeatedly changing exposure.** A single landmark freezes a status that genuinely varies; the frozen classification\n  misattributes later person-time. Use a time-dependent or sequential/dynamic landmark approach instead.\n- **Data-driven landmark selection** to maximize a hazard ratio — this is `p`-hacking with a survival curve.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA):** the landmark denominator (\"event-free and *observable* at `L`\") is only valid where claims\n  are complete. Medicare Advantage and capitated/bundled person-time drop fee-for-service claims, so a subject can look\n  event-free at `L` simply because their event was never billed to the FFS system — misclassifying the at-risk set.\n  Require continuous A/B (and D if exposure is a drug) FFS enrollment from index through `L` and exclude MA-only\n  person-time. Exposure-by-landmark uses pharmacy (`fill_date`, `days_supply`) or procedure codes accrued in `[index,\n  L]`; sample fills, 90-day mail order, and free samples distort `days_supply` and the inferred initiation date.\n  **Differential competing-risk death by exposure** is a specific trap in elderly claims: if the sicker arm dies before\n  `L`, the landmark cohort is selectively healthier in that arm — run a Fine-Gray / CIF check, not just cause-specific\n  Cox.\n- **EHR:** response and biomarker capture cluster at clinic visits, so the true classifying event may sit just on\n  either side of `L`. Either snap the landmark to the visit grid or use last-observation-carried-forward and document\n  it; do not pretend daily resolution you do not have. Loss-to-system before `L` (patient leaves the network) is\n  informative censoring of the landmark denominator, not random.\n- **Registry:** event and treatment timing are high quality, but response/outcome **adjudication lags** — choose `L`\n  after the adjudication cutoff so you are not classifying on an artificially undercounted event set. Registries\n  typically lack full pharmacy exposure; link to claims to confirm initiation dates within `[index, L]`.\n- **Linked claims-EHR-registry:** the ideal substrate (EHR/registry severity + claims completeness + a death index to\n  firm up the competing-risk and at-risk sets), but linkage selects the linkable subset and creates order/fill/service\n  date discrepancies that must be reconciled before `index` and `L` are assigned.\n\n**Worked claims example.** Question: does *early statin initiation* after acute MI reduce 1-year recurrent MI, and is\nthe naive \"ever-statin\" estimate inflated by immortal time? Cohort: adults with an index AMI hospitalization\n(`index_date` = discharge), ≥365 days of continuous FFS A/B/D enrollment before index (so washout and baseline are\nobserved), and FFS-observable person-time through the landmark. Landmark `L` = 90 days post-discharge. Exposure is\nclassified using only fills in `[index_date, index_date + 90]`: a subject is an *early initiator* if any statin\n`fill_date` falls in that window (confirm `days_supply` ≥ 30 to exclude one-off samples). **Apply the landmark\nrestriction:** drop anyone with the outcome (recurrent MI), death, or disenrollment on or before day 90 — they cannot\ncontribute to a 90-day-conditional contrast. **Reset the clock:** outcome follow-up time = `event_date - (index_date +\n90)`, starting at zero on day 90, censoring at recurrent MI, death (competing risk), disenrollment, or 365 days post\nindex. Fit a cause-specific Cox (early vs late/non-initiator) for the etiologic HR and a Fine-Gray model for the\ncumulative-incidence contrast, because post-MI mortality is a strong competing risk that differs by statin use. Then\n**show the bias**: re-run the naive analysis that counts statin status as \"ever in the year\" from time zero (day 0) and\ncontrast the hazard ratio — the naive HR is biased toward benefit because early initiators had to survive ~90 days to\nbe classified. Finally, repeat the landmark fit at `L` = 30, 60, 180 days as the pre-specified sensitivity grid and\nreport all four, because the conditional population shifts with `L`.\n\n**Interpreting the output**\n\nA 6-month (180-day) landmark analysis of treatment responders vs non-responders returns: conditional overall survival at 12 months post-landmark = 74% (responders) vs 52% (non-responders), estimated among patients event-free at the landmark.\n\n*Formal interpretation.* At the landmark time L = 180 days, the cohort is restricted to patients who have not yet experienced the outcome; time is reset to zero from that landmark, and survival is estimated from L onward. The 74% vs 52% comparison is a conditional estimate describing survival from month 6 to month 12 among the subset who reached month 6 event-free. Responder status is classified at or before L, so exposure is time-fixed within the landmark-defined subcohort. This eliminates the immortal-time advantage that responders would otherwise inherit in a time-zero-anchored analysis — but the estimate is conditional on surviving to L and is therefore not marginal from treatment initiation.\n\n*Practical interpretation.* Landmark analysis answers: \"For patients who made it to 6 months without an event, does responding to therapy predict better outcomes thereafter?\" This conditional framing is appropriate for prognostic labeling discussions and post-hoc responder analyses, but must be distinguished from the intention-to-treat estimate of overall survival from the start of treatment. Report the N at landmark, the 95% CI for each arm, and sensitivity analyses at pre-specified alternative landmark times.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "landmark-analysis",
      "survival-analysis",
      "time-to-event",
      "immortal-time-bias",
      "guarantee-time-bias",
      "conditional-landmark",
      "competing-risks",
      "oncology-rwe"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1161/CIRCOUTCOMES.110.957951",
        "url": "https://doi.org/10.1161/CIRCOUTCOMES.110.957951",
        "citation_text": "Dafni U. Landmark analysis at the 25-year landmark point. Circulation: Cardiovascular Quality and Outcomes. 2011;4(3):363-371.",
        "year": 2011,
        "authors_short": "Dafni",
        "notes": "Canonical modern treatment of landmark analysis itself — formalizes the landmark cohort, time-origin reset, choice of landmark time, and the conditional interpretation, 25 years after Anderson's original proposal."
      },
      {
        "role": "explain",
        "doi": "10.1200/jco.1983.1.11.710",
        "url": "https://doi.org/10.1200/jco.1983.1.11.710",
        "citation_text": "Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumor response. Journal of Clinical Oncology. 1983;1(11):710-719.",
        "year": 1983,
        "authors_short": "Anderson, Cain, Gelber",
        "notes": "Seminal paper naming guarantee-time (immortal time) bias in responder-vs-nonresponder comparisons and proposing the landmark method as the fix; the conceptual origin of the technique."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Authoritative framing of immortal time bias from post-baseline exposure classification in claims data — the bias landmark analysis is designed to eliminate, with the magnitude of distortion quantified."
      },
      {
        "role": "demonstrate",
        "doi": "10.1200/jco.2013.49.5283",
        "url": "https://doi.org/10.1200/jco.2013.49.5283",
        "citation_text": "Giobbie-Hurder A, Gelber RD, Regan MM. Challenges of guarantee-time bias. Journal of Clinical Oncology. 2013;31(23):2963-2969.",
        "year": 2013,
        "authors_short": "Giobbie-Hurder, Gelber, Regan",
        "notes": "Worked head-to-head demonstration in BIG 1-98 trial data comparing naive analysis, conditional landmark analysis, extended Cox, and inverse-probability weighting — naive suggested benefit, the bias-corrected methods showed none."
      },
      {
        "role": "explain",
        "doi": "10.1186/s12874-017-0405-6",
        "url": "https://doi.org/10.1186/s12874-017-0405-6",
        "citation_text": "Cho IS, Chae YR, Kim JH, et al. Statistical methods for elimination of guarantee-time bias in cohort studies: a simulation study. BMC Medical Research Methodology. 2017;17(1):126.",
        "year": 2017,
        "authors_short": "Cho et al.",
        "notes": "Simulation comparison of Cox, time-dependent Cox, and landmark methods for guarantee-time bias in pharmacoepidemiologic cohorts; characterizes performance and the bias/efficiency cost of the landmark-time choice."
      }
    ],
    "plain_language_summary": "When researchers want to compare cancer patients who responded to treatment versus those who did not, they face a hidden trap: responders had to survive long enough to be classified as responders, giving them a head-start advantage that has nothing to do with the treatment. A fixed-point analysis fixes this by choosing a specific calendar day — say, day 90 after diagnosis — and asking only: who was still alive and event-free on that day, and did they respond before it? Everyone who died or had their outcome before day 90 is set aside, the groups are locked in based only on what happened before day 90, and survival tracking begins fresh from day 90 onward. This gives each group a fair start and measures how long patients lived *after* the classification moment, not from a zero point that secretly favors one group.",
    "key_terms": [
      {
        "term": "index date",
        "definition": "The patient's starting day in the study — usually the date of their diagnosis, hospital discharge, or first treatment fill — from which all other study dates are measured."
      },
      {
        "term": "classification window",
        "definition": "The period from the index date up to and including the landmark day, during which the analyst looks at records to decide which group each patient belongs to."
      },
      {
        "term": "landmark day",
        "definition": "A pre-chosen calendar point (e.g., day 90 after diagnosis) that divides the study into a classification period before it and an outcome-tracking period after it; only patients still alive and event-free on this day are included going forward."
      },
      {
        "term": "guarantee-time trap",
        "definition": "The bias that occurs when one group of patients must survive longer just to be labeled as that group, making them look healthier than the other group for reasons unrelated to treatment."
      },
      {
        "term": "follow-up time",
        "definition": "The number of days a patient is observed for the outcome of interest, measured here from the landmark day — not from the original index date — so no pre-landmark survival is counted twice."
      },
      {
        "term": "censored",
        "definition": "A patient whose outcome has not yet occurred by the end of the study period; their data still contribute information about the time they were observed without the outcome."
      }
    ],
    "worked_example": {
      "scenario": "Three lung cancer patients are diagnosed on January 1, 2024 (their index date). Researchers want to know whether patients whose tumors shrank by day 90 live longer afterward than those whose tumors did not. The landmark day is March 31, 2024 — exactly 90 days after index. Scan results and vital status are checked on that day. Only patients alive and recurrence-free on March 31 enter the comparison; their follow-up clock starts fresh on that date.",
      "dataset": {
        "caption": "One row per patient showing scan result (recorded during the classification window), vital status at the landmark, and the recurrence date used to compute follow-up time.",
        "columns": [
          "person_id",
          "index_date",
          "scan_response_date",
          "responded_by_day90",
          "alive_at_landmark",
          "recurrence_date",
          "included_in_analysis"
        ],
        "rows": [
          [
            1001,
            "2024-01-01",
            "2024-02-15",
            "Yes",
            "Yes",
            "2024-09-15",
            "Yes"
          ],
          [
            1002,
            "2024-01-01",
            "none by day 90",
            "No",
            "Yes",
            "2024-11-15",
            "Yes"
          ],
          [
            1003,
            "2024-01-01",
            "N/A — died before landmark",
            "N/A",
            "No (died 2024-03-10)",
            "N/A",
            "No — EXCLUDED"
          ]
        ]
      },
      "steps": [
        "Set the landmark day: index date January 1 plus 90 days = March 31, 2024.",
        "Check each patient's vital status and outcome status on March 31: patient 1001 is alive and recurrence-free — keep; patient 1002 is alive and recurrence-free — keep; patient 1003 died on March 10, which is before the landmark — exclude by design.",
        "Lock each remaining patient's group using only records from January 1 through March 31: patient 1001 had a scan showing tumor shrinkage on February 15 (day 46), so they are in the 'responded' group; patient 1002 had no qualifying response scan before March 31, so they are in the 'did not respond' group.",
        "Reset the follow-up clock to zero at March 31 for both included patients: patient 1001 has their recurrence on September 15, giving 168 days of follow-up (April = 30, May = 31, June = 30, July = 31, August = 31, September 1–15 = 15 days; total = 168); patient 1002 has their recurrence on November 15, giving 229 days of follow-up (add October = 31 and November 1–15 = 15 days to the 183 days through September 30; total = 229).",
        "Both patients have the outcome (recurrence), so neither is censored in this small example; in a real study, patients still event-free at the end of the observation window would be censored at that point.",
        "The contrast is now fair: both responder and non-responder entered the risk set on the same calendar day (March 31), and no one's pre-landmark survival is counted as post-landmark follow-up."
      ],
      "result": {
        "label": "Responder (patient 1001): 168 days to recurrence from landmark. Non-responder (patient 1002): 229 days to recurrence from landmark. Patient 1003 excluded (died before landmark day 90). Analysis is conducted only on the 2 patients event-free at day 90, with follow-up measured from day 90 forward.",
        "value": "168 days (responder) vs 229 days (non-responder) post-landmark follow-up; 1 patient excluded pre-landmark"
      },
      "timeline_spec": {
        "title": "Landmark analysis at day 90 — two included patients and one pre-landmark exclusion",
        "window": {
          "start": "2024-01-01",
          "end": "2024-11-15",
          "label": "Full observation window: index through last event"
        },
        "events": [
          {
            "label": "Pt 1001: scan confirms response",
            "start": "2024-02-15",
            "length_days": 1,
            "quantity": "response recorded day 46"
          },
          {
            "label": "Pt 1003: death (pre-landmark)",
            "start": "2024-03-10",
            "length_days": 1,
            "quantity": "excluded — died before landmark"
          },
          {
            "label": "Pt 1001: recurrence",
            "start": "2024-09-15",
            "length_days": 1,
            "quantity": "outcome event, 168 days post-landmark"
          },
          {
            "label": "Pt 1002: recurrence",
            "start": "2024-11-15",
            "length_days": 1,
            "quantity": "outcome event, 229 days post-landmark"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2024-01-01",
            "end": "2024-03-31",
            "label": "Classification window (index → day 90): groups assigned from records in this period"
          },
          {
            "kind": "gap",
            "start": "2024-03-10",
            "end": "2024-03-10",
            "label": "Pt 1003 dies here — excluded before landmark"
          },
          {
            "kind": "followup",
            "start": "2024-03-31",
            "end": "2024-09-15",
            "label": "Pt 1001 post-landmark follow-up: 168 days (responder)"
          },
          {
            "kind": "followup",
            "start": "2024-03-31",
            "end": "2024-11-15",
            "label": "Pt 1002 post-landmark follow-up: 229 days (non-responder)"
          }
        ],
        "landmark_marker": {
          "date": "2024-03-31",
          "label": "LANDMARK — day 90. Groups locked. Follow-up clock resets to zero. Pre-landmark deaths excluded."
        },
        "result": {
          "label": "Post-landmark follow-up: responder 168 days vs non-responder 229 days; 1 patient excluded (died day 69, before landmark)",
          "value": 168
        },
        "caption": "Timeline for three patients with a day-90 landmark. The shaded band (January 1 – March 31) is the classification window: scan results recorded here determine each patient's group. The vertical marker at March 31 is the landmark itself — follow-up for the two surviving patients begins here. Patient 1003, who died on March 10 (before the landmark), is excluded; their pre-landmark time is never counted as post-landmark risk, removing the guarantee-time trap.",
        "alt_text": "Horizontal timeline from January 1 to November 15, 2024. A shaded classification band runs from January 1 to March 31. A vertical line marks the day-90 landmark on March 31. Patient 1001 has a response event on February 15 and a recurrence on September 15, with a follow-up bar from the landmark to that recurrence labeled 168 days. Patient 1002 has no response event and a recurrence on November 15, with a follow-up bar from the landmark labeled 229 days. Patient 1003 has a death marker on March 10, before the landmark, with a label indicating exclusion."
      }
    },
    "prerequisites": [
      "immortal-time-bias-handling",
      "cox-ph-regression",
      "cumulative-incidence-risk-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single fixed landmark (conditional landmark analysis)",
        "description": "One clinically pre-specified landmark time; restrict to subjects event-free and under observation at that time, freeze exposure using only data in [index, landmark], reset the time origin to the landmark, and fit a Cox/KM/Fine-Gray model on the resulting cohort.",
        "edge_cases": [
          "Subjects with the outcome, death, or censoring before the landmark are excluded by design — quantify how many and check that exclusions are not differential by arm.",
          "Earlier landmark = more power but more exposure misclassification; later landmark = less misclassification but smaller, more selected cohort."
        ],
        "data_source_notes": "claims/EHR: implement by subsetting to landmark survivors and computing follow_up = event_date - landmark; report the at-risk count entering each arm at the landmark.",
        "citations": [
          "dafni-2011",
          "giobbie-hurder-2013"
        ]
      },
      {
        "name": "Sequential / dynamic landmarking",
        "description": "Repeat the conditional analysis over a grid of landmark times (or a continuous landmark supermodel) to describe how the exposure-outcome association evolves and to provide updated, dynamically conditional prognosis.",
        "edge_cases": [
          "Overlapping landmark cohorts induce dependence across analyses; treat as exploratory or use a landmark supermodel with appropriate variance estimation rather than independent tests.",
          "Multiplicity across many landmarks inflates false positives if interpreted as separate confirmatory tests."
        ],
        "data_source_notes": "Useful when exposure status or risk evolves; align landmark grid to visit cadence in EHR to avoid spurious timing resolution.",
        "citations": [
          "dafni-2011"
        ]
      },
      {
        "name": "Landmark with competing-risks outcome (cause-specific vs Fine-Gray)",
        "description": "Apply the landmark restriction and time reset, then model the competing-risks outcome — cause-specific hazards (PROC PHREG / coxph) for etiology or a Fine-Gray subdistribution model / CIF for absolute risk.",
        "edge_cases": [
          "Differential competing-risk mortality by exposure depletes the landmark at-risk set non-randomly; a cause-specific HR can look protective while the cumulative incidence does not."
        ],
        "data_source_notes": "claims (elderly): always pair the cause-specific model with a CIF/Fine-Gray check; link to a death index so the competing event and at-risk set are complete.",
        "citations": [
          "giobbie-hurder-2013"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive \"ever exposed / ever responder\" from time zero",
        "pros_of_this": "Removes guarantee-time (immortal time) bias by construction — no subject can be classified as exposed before surviving to the landmark, so exposed person-time is not credited as immortal.",
        "cons_of_this": "Discards all pre-landmark person-time and every subject who fails before the landmark, reducing efficiency and shifting the estimand to a conditional one.",
        "when_to_prefer": "Whenever exposure or response is defined after follow-up begins and a transparent, easily audited correction is required."
      },
      {
        "compared_to": "Time-dependent (extended) Cox model",
        "pros_of_this": "Trivial to specify, communicate, and audit; avoids the proportional-hazards assumption on a time-varying covariate; answers a genuinely conditional clinical question (\"among those event-free at the landmark...\").",
        "cons_of_this": "Less efficient (throws away pre-landmark person-time and early failures); estimate depends on the arbitrary landmark choice; a single landmark misrepresents repeatedly changing exposure.",
        "when_to_prefer": "When the question is conditional on reaching a clinically meaningful time point, or as a transparent sensitivity analysis on a time-dependent primary model."
      },
      {
        "compared_to": "Clone-censor-weight / marginal structural g-methods",
        "pros_of_this": "Far simpler; no weighting machinery, no treatment-confounder feedback model, fully interpretable from the cohort table.",
        "cons_of_this": "Cannot represent sustained, dynamic, or grace-period strategies; does not handle time-varying confounding affected by prior treatment.",
        "when_to_prefer": "A one-time exposure/response classification at a fixed point, not a sustained-strategy estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Define an eligibility-based index (diagnosis/hospitalization), not a procedure/response anchor, to avoid re-importing immortal time. Require continuous FFS A/B (and D for drug exposure) enrollment from index through the landmark and exclude Medicare Advantage-only person-time, where missing fee-for-service claims make a subject falsely appear event-free at the landmark. Classify exposure from fills/procedures in [index, landmark] using fill_date and days_supply; pair the cause-specific Cox with a Fine-Gray/CIF check because competing-risk death often differs by exposure in elderly cohorts.",
      "ehr": "Response/biomarker capture clusters at visits; snap the landmark to the visit grid or use documented LOCF rather than assuming daily resolution. Treat loss-to-system before the landmark as informative censoring of the landmark denominator. Link to dispensing to confirm exposure actually started within [index, landmark].",
      "registry": "Event and treatment timing are high quality but outcome adjudication lags; choose a landmark after the adjudication cutoff so classification is not based on an undercounted event set. Link to claims for full pharmacy exposure within the classification window.",
      "linked": "Ideal substrate (severity + claims completeness + death index for the competing-risk/at-risk sets), but linkage selects the linkable subset and creates order/fill/service date discrepancies that must be reconciled before index and landmark assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import CoxPHFitter\n\ndef landmark_cox(df: pd.DataFrame, landmark_days: int) -> CoxPHFitter:\n    L = df[\"index_date\"] + pd.Timedelta(days=landmark_days)\n\n    # (1) Keep only subjects EVENT-FREE and OBSERVABLE at the landmark.\n    #     Anyone with the outcome/death/disenrollment on or before L is excluded by design.\n    at_risk = (df[\"event_date\"] > L) & (df[\"obs_end\"] >= L)\n    lm = df.loc[at_risk].copy()\n    lm[\"L\"] = L[at_risk]\n\n    # (2) Freeze exposure using ONLY information available by the landmark.\n    lm[\"early_exposed\"] = (lm[\"expose_date\"].notna() & (lm[\"expose_date\"] <= lm[\"L\"])).astype(int)\n\n    # (3) Reset the time origin to the landmark; follow-up starts at 0 on day L.\n    lm[\"fu_time\"] = (lm[\"event_date\"].clip(upper=lm[\"obs_end\"]) - lm[\"L\"]).dt.days\n    lm[\"cs_event\"] = (lm[\"event\"] == 1).astype(int)   # cause-specific: competing death treated as censored\n    lm = lm[lm[\"fu_time\"] > 0]\n\n    # (4) Cause-specific Cox on the landmark-conditional cohort.\n    model = CoxPHFitter()\n    model.fit(lm[[\"fu_time\", \"cs_event\", \"early_exposed\"]],\n              duration_col=\"fu_time\", event_col=\"cs_event\")\n    return model\n\n# Pre-specified landmark sensitivity grid (report all; never pick the landmark from the results).\nfor L_days in (30, 60, 90, 180):\n    m = landmark_cox(df, L_days)\n    hr = m.hazard_ratios_[\"early_exposed\"]\n    print(f\"landmark={L_days}d  n={m.event_observed.shape[0]}  HR={hr:.2f}\")",
        "description": "Landmark analysis from a claims/EHR analytic table, with a multi-landmark sensitivity loop. Required input\n(one row per subject, already cleaned):\n  df : person_id, index_date (datetime), expose_date (datetime, NaT if never exposed by landmark),\n       event_date (datetime), event (1=outcome, 2=competing death, 0=censored at last FFS-observable date),\n       obs_end (datetime = min(disenroll, death, data_end))\nThe function performs the four landmark steps: subset to landmark survivors observable at L, classify exposure\nusing only [index, L], reset the time origin to L, and fit a cause-specific Cox on the landmark cohort.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [
          "dafni-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(cmprsk)\n\nlandmark_fit <- function(df, landmark_days) {\n  L <- df$index_date + landmark_days\n\n  # (1) Subset to subjects event-free and observable at the landmark.\n  keep <- df$event_date > L & df$obs_end >= L\n  lm <- df[keep, ]\n  Lk <- L[keep]\n\n  # (2) Exposure frozen using only [index, L].\n  lm$early_exposed <- as.integer(!is.na(lm$expose_date) & lm$expose_date <= Lk)\n\n  # (3) Reset the clock to the landmark.\n  lm$fu_time  <- as.numeric(pmin(lm$event_date, lm$obs_end) - Lk)\n  lm$cs_event <- as.integer(lm$event == 1L)            # cause-specific endpoint\n  lm <- lm[lm$fu_time > 0, ]\n\n  # (4a) Cause-specific Cox (etiologic HR) on the landmark cohort.\n  cs <- coxph(Surv(fu_time, cs_event) ~ early_exposed, data = lm)\n\n  # (4b) Fine-Gray subdistribution model (absolute-risk view; competing death = code 2).\n  fg <- crr(ftime = lm$fu_time, fstatus = lm$event,\n            cov1 = lm[, \"early_exposed\", drop = FALSE],\n            failcode = 1L, cencode = 0L)\n  list(n = nrow(lm), cox_hr = exp(coef(cs)), fg_shr = exp(fg$coef))\n}\n\n# Pre-specified landmark grid; report every landmark, do not data-mine L.\nfor (Ld in c(30, 60, 90, 180)) {\n  r <- landmark_fit(df, Ld)\n  cat(sprintf(\"landmark=%dd  n=%d  csHR=%.2f  fgSHR=%.2f\\n\",\n              Ld, r$n, r$cox_hr, r$fg_shr))\n}",
        "description": "Landmark analysis with the survival package: conditional Cox plus a cmprsk Fine-Gray check, over a landmark grid.\nRequired input (one row per subject):\n  df : person_id, index_date (Date), expose_date (Date, NA if not exposed by landmark),\n       event_date (Date), event (1=outcome, 2=competing death, 0=censored), obs_end (Date)",
        "dependencies": [
          "survival",
          "cmprsk"
        ],
        "source_citations": [
          "dafni-2011",
          "giobbie-hurder-2013"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lm = 90;   /* landmark in days; loop over 30 60 90 180 as the sensitivity grid */\n\n/* (1)-(3) Build the landmark-conditional analytic set: keep landmark survivors,\n   freeze exposure using only [index_date, index_date+&lm], reset the time origin to the landmark. */\ndata lmset;\n  set work.cohort;\n  landmark = index_date + &lm;\n  /* (1) event-free AND observable at the landmark */\n  if event_date > landmark and obs_end >= landmark then do;\n    /* (2) exposure status known by the landmark only */\n    early_exposed = (expose_date ne . and expose_date <= landmark);\n    /* (3) time origin reset to the landmark */\n    fu_time = min(event_date, obs_end) - landmark;\n    if fu_time > 0 then output;\n  end;\nrun;\n\n/* Kaplan-Meier on the landmark cohort (cause-specific endpoint = code 1). */\nproc lifetest data=lmset plots=survival(atrisk);\n  time fu_time*event(0 2);          /* censor competing death (2) and administrative (0) */\n  strata early_exposed;\nrun;\n\n/* Cause-specific Cox (etiologic HR): competing death censored. */\nproc phreg data=lmset;\n  class early_exposed (ref='0');\n  model fu_time*event(0 2) = early_exposed / ties=efron;\n  hazardratio early_exposed / diff=ref;\nrun;\n\n/* Fine-Gray subdistribution hazard (absolute risk): eventcode names the event of interest. */\nproc phreg data=lmset;\n  class early_exposed (ref='0');\n  model fu_time*event(0) = early_exposed / eventcode=1;   /* 2 retained as competing risk */\n  hazardratio early_exposed / diff=ref;\nrun;",
        "description": "Landmark analysis in SAS: build the landmark-conditional dataset, then KM (PROC LIFETEST), cause-specific Cox\n(PROC PHREG), and Fine-Gray (PROC PHREG eventcode=). Required input:\n  work.cohort : person_id, index_date, expose_date (. if not exposed by landmark),\n                event_date, event (1=outcome, 2=competing death, 0=censored), obs_end (all SAS dates)\nSet &lm to each landmark in the pre-specified grid (30/60/90/180) and rerun; report all.",
        "dependencies": [],
        "source_citations": [
          "dafni-2011",
          "giobbie-hurder-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "landmark-analysis-timeline.svg",
        "mermaid": null,
        "caption": "Timeline for three patients with a day-90 landmark. The shaded band (January 1 – March 31) is the classification window: scan results recorded here determine each patient's group. The vertical marker at March 31 is the landmark itself — follow-up for the two surviving patients begins here. Patient 1003, who died on March 10 (before the landmark), is excluded; their pre-landmark time is never counted as post-landmark risk, removing the guarantee-time trap.",
        "alt_text": "Horizontal timeline from January 1 to November 15, 2024. A shaded classification band runs from January 1 to March 31. A vertical line marks the day-90 landmark on March 31. Patient 1001 has a response event on February 15 and a recurrence on September 15, with a follow-up bar from the landmark to that recurrence labeled 168 days. Patient 1002 has no response event and a recurrence on November 15, with a follow-up bar from the landmark labeled 229 days. Patient 1003 has a death marker on March 10, before the landmark, with a label indicating exclusion.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Idx[Index event = time zero<br/>e.g. AMI discharge / diagnosis] --> Win[Classification window index..L<br/>observe exposure or response]\n  Win --> Chk{Event-free AND<br/>observable at landmark L?}\n  Chk -- No: outcome/death/disenroll before L --> Drop[Excluded from landmark cohort<br/>by design - removes immortal time]\n  Chk -- Yes --> Cls[Freeze exposure from index..L only<br/>early-exposed vs not]\n  Cls --> Reset[Reset clock: follow_up = event_date - L<br/>new time origin = landmark]\n  Reset --> Fit[Cause-specific Cox / KM / Fine-Gray<br/>on landmark-conditional cohort]\n  Fit --> Sens[Sensitivity: repeat at L = 30/60/90/180<br/>report every landmark]",
        "caption": "Landmark analysis mechanics. Exposure is classified only within the pre-landmark window, subjects failing before the landmark are excluded by design (removing guarantee-time bias), the time origin is reset to the landmark, and the estimand is conditional on landmark survival.",
        "alt_text": "Flowchart from index event through the classification window, the event-free-at-landmark decision that drops pre-landmark failures, exposure freezing, time-origin reset, model fitting, and a landmark sensitivity grid.",
        "source_type": "illustrative",
        "source_citations": [
          "dafni-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One subject - why landmark removes immortal time (landmark L = 90d)\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section Naive (biased)\n  Immortal time credited to exposed (day 0 to first fill) :crit, imm, 2024-01-01, 75d\n  section Landmark\n  Classification window index..L (observe exposure) :done, win, 2024-01-01, 90d\n  Landmark = new time origin :milestone, lm, 2024-03-31, 0d\n  Outcome follow-up starts at L :active, fu, 2024-03-31, 275d",
        "caption": "For a single subject, the naive analysis credits the pre-exposure interval (index to first fill) as exposed person-time (immortal time). Landmark instead uses index..L only to classify and starts outcome follow-up at L, so no immortal person-time is attributed to either arm.",
        "alt_text": "Gantt comparison showing immortal time wrongly credited in the naive analysis versus the landmark approach that classifies within the window and starts follow-up at the landmark.",
        "source_type": "illustrative",
        "source_citations": [
          "anderson-1983"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Landmark analysis is one of the three primary fixes for immortal/guarantee-time bias from post-baseline exposure classification, alongside time-dependent modeling and clone-censor-weight."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "standard-cox-time-dependent",
        "notes": "The time-dependent (extended) Cox model retains all person-time and models exposure as time-varying; landmark instead conditions on landmark survival and resets the clock. They target different estimands and are often reported together."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "The outcome model fit on the landmark-conditional cohort is typically a Cox proportional-hazards regression with the time origin reset to the landmark."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "On the landmark cohort, choose cause-specific hazards for etiology or Fine-Gray/CIF for absolute risk; differential competing-risk death by exposure deletes the at-risk set non-randomly and must be checked."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "G-methods/clone-censor-weight handle sustained, dynamic strategies and treatment-confounder feedback; landmark is the simpler choice for a one-time fixed-point classification."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Landmark can implement an assessment-period or grace-period structure within a target-trial emulation, though clone-censor-weight is preferred when the protocol requires an explicit dynamic per-protocol strategy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "new-user-design",
        "notes": "A new-user design with time zero at initiation aligns eligibility and follow-up from the start and can remove the need for a landmark when exposure is defined at baseline rather than after follow-up begins."
      }
    ],
    "aliases": [
      "landmark method",
      "conditional landmark analysis",
      "landmark survival analysis",
      "fixed-time landmark analysis",
      "dynamic landmarking"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "likelihood-ratios-diagnostic-rwe",
    "name": "Diagnostic Likelihood Ratios",
    "short_definition": "Prevalence-independent summaries of a test's evidentiary weight — the positive likelihood ratio LR+ = sensitivity/(1 - specificity) and negative likelihood ratio LR- = (1 - sensitivity)/specificity — that multiply pre-test odds into post-test odds (post-test odds = pre-test odds x LR) and can be applied to any baseline probability via the Fagan nomogram.",
    "long_description": "A **diagnostic likelihood ratio (LR)** expresses how much a particular test result changes the odds that a patient has\nthe target condition. It combines sensitivity and specificity into a single number per result level while remaining\n**independent of prevalence**, which is what makes it portable across populations in a way predictive values are not.\nFor a binary test the two key ratios are the **positive likelihood ratio**\n\n    LR+ = sensitivity / (1 - specificity) = P(T+|D+) / P(T+|D-)\n\nand the **negative likelihood ratio**\n\n    LR- = (1 - sensitivity) / specificity = P(T-|D+) / P(T-|D-).\n\nLR+ is the ratio of the probability of a positive result in true cases to that in true non-cases (how many times more\nlikely a positive is among the diseased); LR- is the analogous ratio for a negative result. LR+ > 1 raises the odds of\ndisease, LR- < 1 lowers them, and a likelihood ratio of exactly 1 is uninformative (the result does not change the\nodds).\n\n**Core idea — odds form of Bayes' theorem.** The reason LRs matter operationally is that they convert a *pre-test* belief\ninto a *post-test* belief by simple multiplication on the **odds** scale:\n\n    post-test odds = pre-test odds x LR,    where odds = p / (1 - p).\n\nSo if the pre-test probability of incident HF given a claims algorithm flag is, say, 5% (pre-test odds 0.0526) and the\nalgorithm's LR+ is 7.7, the post-test odds are 0.0526 x 7.7 = 0.405, i.e. a post-test probability of 0.405/1.405 = 0.288\n(28.8%). The **Fagan nomogram** is the classical graphical device that does exactly this: a straight line drawn from the\npre-test probability through the likelihood ratio reads off the post-test probability, sparing the analyst the\nodds/probability conversions. The crucial property is that the *same* LR+ and LR- apply at *any* pre-test probability —\nthe test's evidentiary weight is separated from the population's prevalence, which is precisely the separation that\npredictive values fail to provide.\n\n**Rules of thumb and continuous tests.** As informal anchors (Jaeschke/Deeks-style): LR+ above ~10 or LR- below ~0.1\nproduce large, often conclusive shifts in probability; 5–10 / 0.1–0.2 moderate shifts; 2–5 / 0.2–0.5 small shifts; and\n1–2 / 0.5–1 minimal-to-no shift. For a *continuous* or *ordinal* test the binary LR+ / LR- is wasteful; instead one\ncomputes a **stratum-specific (interval) likelihood ratio** for each result band — the density of that result among\ncases divided by its density among non-cases — which preserves the full information in the score and is the bridge from\nlikelihood ratios to the ROC curve, whose local slope at any point equals the likelihood ratio at that threshold.\n\n**Pros, cons, and trade-offs.**\n- **vs PPV/NPV (`ppv-npv-rwe`):** A likelihood ratio is prevalence-independent and updates *any* pre-test probability,\n  so it transports across cohorts; a predictive value is the post-test probability fixed at one prevalence. **Prefer\n  likelihood ratios** to summarize a test's evidentiary weight portably and to reason about how a flag would behave in a\n  different cohort; **prefer PPV/NPV** to state the concrete probability in the cohort at hand.\n- **vs sensitivity/specificity (`sensitivity-specificity-rwe`):** LRs are derived from the same two numbers but package\n  them into a directly usable odds-multiplier, and stratum-specific LRs extend naturally to multi-level/continuous\n  results. Their cost is one extra layer of abstraction (odds, Bayes) that the raw operating-characteristic pair avoids.\n  They are complements: report sensitivity/specificity for transparency and LRs for bedside/decision use.\n- **vs ROC/AUC (`roc-auc-discrimination-rwe`):** The likelihood ratio at a threshold equals the local slope of the ROC\n  curve there; the AUC integrates discrimination across all thresholds into one number. **Prefer likelihood ratios** when\n  you need to update an individual's probability at a specific result level; **prefer AUC** for a single threshold-free\n  discrimination summary.\n\n**When to use.** Report LR+ and LR- (each with a 95% CI) when you want a prevalence-independent statement of how much a\ntest result changes the odds of the condition; when you must apply the same test across cohorts of differing prevalence\nand need a portable evidentiary weight; when communicating to clinicians who reason from a pre-test probability to a\npost-test probability (the Fagan-nomogram workflow); and, using stratum-specific LRs, when the test is continuous or\nordinal and collapsing it to a single cut-point would discard information.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Treating the LR as a probability.** A likelihood ratio is an odds *multiplier*, not a probability; quoting \"LR+ = 7.7\"\n  as if it were the chance of disease, or skipping the pre-test-odds step, badly miscommunicates risk. Always combine it\n  with an explicit pre-test probability.\n- **Forgetting the odds scale.** Multiplying *probabilities* by the LR instead of *odds* is a common and serious error;\n  convert probability to odds, multiply, then convert back. The Fagan nomogram exists precisely to avoid this slip.\n- **Binary LRs for a continuous score.** Dichotomizing a continuous biomarker or risk score to compute a single LR+\n  throws away graded information and can give a misleading single shift; use stratum-specific (interval) LRs.\n- **Reusing LRs across a shifted case spectrum.** LRs are prevalence-independent but *not* spectrum-independent: if the\n  severity mix of cases or the comorbidity mix of non-cases differs from the validation population, sensitivity and\n  specificity (and hence the LRs) change. Transport requires checking the spectrum, not just the prevalence.\n- **Undefined LR+ at perfect specificity.** When specificity = 1 in a finite sample, LR+ = sensitivity/0 is undefined;\n  report with a continuity correction or an interval rather than an infinite point estimate.\n\n**Data-source operational depth.**\n- **Claims (FFS):** Because LRs are built from sensitivity and specificity, estimating them honestly requires a validation\n  design that observes both false negatives and false positives — i.e. charting algorithm-negatives as well as positives,\n  not the cheap algorithm-positive-only design that yields PPV. The payoff is portability: an LR+ estimated in one FFS\n  cohort can be applied to a different-prevalence cohort to project how informative a flag is there. Restrict the\n  validation frame to FFS-observable person-time (Medicare Advantage spans generate no fee-for-service claims and create\n  spurious algorithm-negatives that distort specificity and thus LR+).\n- **EHR:** Continuous predictors (labs, vitals, risk scores) make stratum-specific LRs natural and far more informative\n  than a single dichotomized LR; encounter-driven capture (leakage) depresses apparent sensitivity and must be handled\n  before the LRs are trusted.\n- **Registry/linked:** Registry adjudication supplies the reference standard needed for both the case and non-case\n  densities; linked data let interval LRs be estimated across the full score distribution and transported to target\n  cohorts of differing prevalence.\n\n**Worked claims example.** From the incident-HF claims algorithm with sensitivity 93.5% and specificity 87.9% (validated\nby stratified chart review on FFS-observable person-time), compute **LR+ = 0.935 / (1 - 0.879) = 0.935 / 0.121 = 7.73**\nand **LR- = (1 - 0.935) / 0.879 = 0.065 / 0.879 = 0.074**. A positive flag is about 7.7 times more likely in a true HF\ncase than in a non-case; a negative result is about 1/0.074 = 13.5 times more likely in a non-case. Now apply the same\nLRs in two cohorts using post-test odds = pre-test odds x LR. In an enriched cohort with pre-test probability 0.20\n(odds 0.25): post-test odds for a flag = 0.25 x 7.73 = 1.93, post-test probability = 1.93/2.93 = **0.659** (66%). In a\nrare-outcome commercial cohort with pre-test probability 0.02 (odds 0.0204): post-test odds = 0.0204 x 7.73 = 0.158,\npost-test probability = 0.158/1.158 = **0.136** (13.6%) — identical to the Bayes-projected PPV at 2% prevalence, as it\nmust be, because the LR carries exactly the prevalence-independent evidentiary content that PPV expresses at a fixed\nprevalence. Reporting the single pair (LR+ 7.73, LR- 0.074) therefore lets a reader compute the post-test probability for\n*their* cohort without re-running the validation.\n\n**Interpreting the output**\n\nThe full validation in the worked example yields LR+ = 7.73 and LR− = 0.074, computed from sensitivity\n0.935 and specificity 0.879. Applied to a pre-test probability of 25%, the post-test probability for\na positive flag rises to ≈ 66%.\n\n*(1) Formal interpretation.* LR+ = sensitivity / (1 − specificity) = 0.935 / 0.121 ≈ 7.73 quantifies\nhow much more likely a positive result is in a true case than in a true non-case. LR− = (1 − sensitivity)\n/ specificity = 0.065 / 0.879 ≈ 0.074 quantifies how much less likely a negative result is in a true\ncase than in a non-case. To update a pre-test probability P to a post-test probability: convert P to\npre-test odds (P / (1 − P)), multiply by the LR, convert back to probability. LRs are invariant to\nprevalence — the same pair transports across populations with different baseline rates, unlike PPV/NPV.\nValues of LR+ ≥ 10 or LR− ≤ 0.1 are conventionally regarded as large; LR+ ≈ 7.73 and LR− ≈ 0.074\napproach those thresholds and represent a moderately strong diagnostic test.\n\n*(2) Practical interpretation.* Reporting (LR+ 7.73, LR− 0.074) rather than PPV/NPV allows any reader\nto compute the post-test probability for their own study population using their own prevalence — no\nre-running of the validation is needed. In a community cohort with 2% baseline prevalence, LR+ 7.73\nprojects a post-test probability of ≈ 13.6% for a positive flag — still low because the pre-test odds\nare very small. In a high-risk subgroup at 50% prior probability, the same LR+ gives a post-test\nprobability of ≈ 89%. This portability is the primary reason to report LRs alongside or instead of\nfixed PPV/NPV values in phenotyping validation studies.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "likelihood-ratio",
      "diagnostic-accuracy",
      "bayes-theorem",
      "pre-test-odds",
      "post-test-odds",
      "fagan-nomogram",
      "prevalence-independence",
      "interval-likelihood-ratio"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "validation_study",
      "diagnostic_accuracy"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.329.7458.168",
        "url": "https://doi.org/10.1136/bmj.329.7458.168",
        "citation_text": "Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ. 2004;329(7458):168-169.",
        "year": 2004,
        "authors_short": "Deeks & Altman",
        "notes": "The canonical short definition of positive and negative likelihood ratios, their computation from sensitivity and specificity, the odds form of Bayes' theorem (post-test odds = pre-test odds x LR), and the Fagan nomogram."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Reporting standard for diagnostic accuracy studies; covers how likelihood ratios should be reported (with CIs, against a reference standard) and the verification/spectrum issues that affect their transportability."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2005.02.022",
        "url": "https://doi.org/10.1016/j.jclinepi.2005.02.022",
        "citation_text": "Reitsma JB, Glas AS, Rutjes AWS, Scholten RJPM, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. Journal of Clinical Epidemiology. 2005;58(10):982-990.",
        "year": 2005,
        "authors_short": "Reitsma et al.",
        "notes": "Models sensitivity and specificity jointly; the resulting summary operating point yields summary likelihood ratios, and clarifies why LRs (unlike predictive values) are the natural transportable quantities to pool and report."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/oxfordjournals.aje.a112510",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a112510",
        "citation_text": "Rogan WJ, Gladen B. Estimating prevalence from the results of a screening test. American Journal of Epidemiology. 1978;107(1):71-76.",
        "year": 1978,
        "authors_short": "Rogan & Gladen",
        "notes": "Provides the Bayes algebra linking test-result frequencies, sensitivity, and specificity that underlies the odds-update interpretation of likelihood ratios."
      },
      {
        "role": "use",
        "doi": "10.7326/M14-0698",
        "url": "https://doi.org/10.7326/M14-0698",
        "citation_text": "Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Annals of Internal Medicine. 2015;162(1):W1-W73.",
        "year": 2015,
        "authors_short": "Moons et al.",
        "notes": "Reporting guidance for diagnostic/prognostic models, including likelihood ratios at chosen thresholds and the use of the pre-test-to-post-test probability framework when communicating model results."
      }
    ],
    "plain_language_summary": "A diagnostic likelihood ratio tells you how much a single test result should change your belief that a patient actually has the condition. The positive likelihood ratio (LR+) says how many times more often a positive result shows up in people who truly have the disease than in people who don't, and the negative likelihood ratio (LR-) does the same for a negative result. Its big selling point is that it does not depend on how common the disease is in your population, so the same LR+ and LR- can be carried from one cohort to another. You build both numbers entirely from a test's sensitivity (how often it catches true cases) and specificity (how often it correctly clears true non-cases).",
    "key_terms": [
      {
        "term": "sensitivity",
        "definition": "Among people who truly have the disease, the fraction the test correctly flags as positive."
      },
      {
        "term": "specificity",
        "definition": "Among people who truly do not have the disease, the fraction the test correctly clears as negative."
      },
      {
        "term": "reference standard",
        "definition": "The trusted source of truth (e.g., adjudicated chart review) used to decide who really has the disease, so the test can be graded against it."
      },
      {
        "term": "likelihood ratio",
        "definition": "A multiplier that says how many times more likely a given test result is in people with the disease than in people without it; 1 means the result tells you nothing."
      },
      {
        "term": "pre-test vs post-test probability",
        "definition": "Your belief that the patient has the disease before seeing the test result versus after seeing it; the likelihood ratio is the tool that moves you from one to the other."
      }
    ],
    "worked_example": {
      "scenario": "A team validated a claims-based algorithm that flags incident heart failure (HF). They pulled 300 patients and charted each one against a trusted reference standard: 200 truly had HF and 100 truly did not. We want to compute the algorithm's sensitivity and specificity from this 2x2 table, then turn those into the positive and negative likelihood ratios.",
      "dataset": {
        "caption": "The 2x2 validation table an analyst would actually build: each patient's algorithm result (rows) cross-tabulated against the charted truth (columns). TP/FP/FN/TN are the four cell counts.",
        "columns": [
          "algorithm_result",
          "truth_HF_positive",
          "truth_HF_negative"
        ],
        "rows": [
          [
            "algorithm positive",
            180,
            20
          ],
          [
            "algorithm negative",
            20,
            80
          ]
        ]
      },
      "steps": [
        "Read the four cells: among the 200 true cases, 180 were flagged (TP) and 20 were missed (FN); among the 100 true non-cases, 20 were wrongly flagged (FP) and 80 were correctly cleared (TN).",
        "Sensitivity = TP / (TP + FN) = 180 / (180 + 20) = 180 / 200 = 0.90.",
        "Specificity = TN / (TN + FP) = 80 / (80 + 20) = 80 / 100 = 0.80.",
        "Positive likelihood ratio uses the formula LR+ = sensitivity / (1 - specificity) = 0.90 / (1 - 0.80) = 0.90 / 0.20.",
        "Negative likelihood ratio uses the formula LR- = (1 - sensitivity) / specificity = (1 - 0.90) / 0.80 = 0.10 / 0.80."
      ],
      "result": "LR+ = 0.90 / 0.20 = 4.5, so a positive flag is 4.5 times more likely in a true HF case than in a non-case. LR- = 0.10 / 0.80 = 0.125, so a negative flag is far more common in non-cases (1 / 0.125 = 8 times more likely in a non-case than in a case). The same pair (LR+ 4.5, LR- 0.125) can be applied to any cohort's pre-test probability without re-running the validation."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "ppv-npv-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Binary positive/negative likelihood ratios",
        "description": "For a dichotomous test, LR+ = sensitivity/(1 - specificity) and LR- = (1 - sensitivity)/specificity, applied to pre-test odds to obtain post-test odds for a positive or negative result respectively.",
        "edge_cases": [
          "LR+ is undefined when specificity = 1 in a finite sample (division by zero); use a continuity correction or report an interval.",
          "A likelihood ratio of 1 is uninformative; values are interpreted on the odds (not probability) scale."
        ],
        "data_source_notes": "claims: requires charting both algorithm-positives and algorithm-negatives so sensitivity and specificity are both estimable; restrict to FFS-observable person-time."
      },
      {
        "name": "Stratum-specific (interval) likelihood ratios",
        "description": "For an ordinal or continuous test, a separate likelihood ratio is computed for each result band as the density of that result among cases divided by its density among non-cases, preserving the graded information a single cut-point discards.",
        "edge_cases": [
          "Sparse strata give unstable interval LRs; pool adjacent bands or smooth the densities.",
          "The interval LR equals the local slope of the ROC curve at that operating point, linking LRs to discrimination."
        ],
        "data_source_notes": "ehr: natural for continuous labs/vitals/risk scores; estimate case and non-case densities against an adjudicated reference standard."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ppv-npv-rwe",
        "pros_of_this": "Prevalence-independent, so a single LR+/LR- pair updates any pre-test probability and transports across cohorts of differing case frequency.",
        "cons_of_this": "Requires the analyst to reason on the odds scale and supply a pre-test probability; it is not itself a probability.",
        "when_to_prefer": "To summarize a test's portable evidentiary weight and to project how a flag behaves in a new cohort; switch to PPV/NPV to state the concrete probability at a fixed prevalence."
      },
      {
        "compared_to": "sensitivity-specificity-rwe",
        "pros_of_this": "Packages sensitivity and specificity into a single directly usable odds-multiplier per result level and extends naturally to multi-level/continuous tests via interval LRs.",
        "cons_of_this": "Adds an abstraction layer (odds, Bayes) that the raw operating-characteristic pair avoids and can be miscomputed by multiplying probabilities instead of odds.",
        "when_to_prefer": "For bedside/decision use and for continuous tests; report sensitivity/specificity alongside for transparency."
      },
      {
        "compared_to": "roc-auc-discrimination-rwe",
        "pros_of_this": "Gives the local evidentiary weight at a specific result level, directly answering how much one observed result shifts an individual's probability.",
        "cons_of_this": "Threshold/level-specific; does not provide a single global discrimination summary across all thresholds the way the AUC does.",
        "when_to_prefer": "When updating an individual's probability at a given result; use AUC for an overall threshold-free discrimination summary (the LR is the local ROC slope)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Estimate LR+ and LR- from a validation that charts both algorithm-positive and algorithm-negative records so sensitivity and specificity are identifiable; report each with a 95% CI. Restrict the validation frame to FFS-observable person-time (Medicare Advantage spans create spurious algorithm-negatives that distort specificity and LR+). The payoff is portability across cohorts of differing prevalence.",
      "ehr": "Use stratum-specific interval likelihood ratios for continuous labs/vitals/risk scores rather than dichotomizing; handle encounter-driven leakage (which depresses apparent sensitivity) before trusting the LRs.",
      "registry": "Registry adjudication supplies the reference standard and the case/non-case densities; linked data let interval LRs span the full score distribution and be transported to target cohorts of differing prevalence."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy.stats import norm\n\ndef likelihood_ratios(tp: int, fp: int, fn: int, tn: int, conf: float = 0.95) -> dict:\n    sens = tp / (tp + fn)\n    spec = tn / (tn + fp)\n    lr_pos = sens / (1 - spec)\n    lr_neg = (1 - sens) / spec\n    z = norm.ppf(1 - (1 - conf) / 2)\n    # Simel (1991) log-scale standard errors for LR+ and LR-.\n    se_log_pos = np.sqrt((1 - sens) / (sens * (tp + fn)) + spec / ((1 - spec) * (tn + fp)))\n    se_log_neg = np.sqrt(sens / ((1 - sens) * (tp + fn)) + (1 - spec) / (spec * (tn + fp)))\n    ci_pos = (lr_pos * np.exp(-z * se_log_pos), lr_pos * np.exp(z * se_log_pos))\n    ci_neg = (lr_neg * np.exp(-z * se_log_neg), lr_neg * np.exp(z * se_log_neg))\n    return {\"lr_pos\": lr_pos, \"lr_pos_ci\": ci_pos, \"lr_neg\": lr_neg, \"lr_neg_ci\": ci_neg}\n\ndef post_test_prob(pretest_prob: float, lr: float) -> float:\n    # post-test odds = pre-test odds x LR; convert back to a probability.\n    pre_odds = pretest_prob / (1 - pretest_prob)\n    post_odds = pre_odds * lr\n    return post_odds / (1 + post_odds)\n\n# Worked claims example: HF algorithm, then apply LR+ in two cohorts.\nlr = likelihood_ratios(tp=261, fp=39, fn=18, tn=282)\nprint(f\"LR+ = {lr['lr_pos']:.2f}  (95% CI {lr['lr_pos_ci'][0]:.2f}-{lr['lr_pos_ci'][1]:.2f})\")\nprint(f\"LR- = {lr['lr_neg']:.3f} (95% CI {lr['lr_neg_ci'][0]:.3f}-{lr['lr_neg_ci'][1]:.3f})\")\nprint(f\"Post-test prob, enriched cohort (pre 0.20): {post_test_prob(0.20, lr['lr_pos']):.3f}\")  # ~0.659\nprint(f\"Post-test prob, rare cohort   (pre 0.02): {post_test_prob(0.02, lr['lr_pos']):.3f}\")    # ~0.136",
        "description": "Compute LR+ and LR- (with normal-approximation 95% CIs on the log scale) from a 2x2 validation table, and\napply them to a pre-test probability via the odds form of Bayes' theorem to obtain the post-test\nprobability (Deeks & Altman 2004). Inputs: cell counts TP, FP, FN, TN, then a pre-test probability.\nWorked HF-algorithm example with the same flag updated in two cohorts of differing prevalence.",
        "dependencies": [
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "deeks-altman-2004"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(epiR)\n\n# rows = algorithm (+/-), columns = truth (D+/D-).\ntab <- as.table(matrix(c(261, 39, 18, 282), nrow = 2, byrow = TRUE,\n                       dimnames = list(Algorithm = c(\"pos\", \"neg\"),\n                                       Truth     = c(\"Dpos\", \"Dneg\"))))\nres <- epi.tests(tab, conf.level = 0.95)\nres$detail[res$detail$statistic %in% c(\"lr.pos\", \"lr.neg\"), ]   # LR+ and LR- with 95% CIs\n\n# Apply LR+ to a pre-test probability: post-test odds = pre-test odds x LR.\npost_test <- function(pre, lr) {\n  pre_odds <- pre / (1 - pre)\n  post_odds <- pre_odds * lr\n  post_odds / (1 + post_odds)\n}\nlr_pos <- res$detail$est[res$detail$statistic == \"lr.pos\"]\npost_test(0.20, lr_pos)   # enriched cohort ~ 0.659\npost_test(0.02, lr_pos)   # rare cohort     ~ 0.136",
        "description": "Compute likelihood ratios from a 2x2 validation table with epiR::epi.tests() (returns LR+ and LR- with\n95% CIs alongside sensitivity, specificity, and predictive values), then apply LR+ to a pre-test\nprobability via the odds form of Bayes' theorem to get the post-test probability (Deeks & Altman 2004).",
        "dependencies": [
          "epiR"
        ],
        "source_citations": [
          "deeks-altman-2004"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* One row carrying the adjudicated 2x2 cell counts from the validation study. */\ndata lr;\n  tp = 261; fp = 39; fn = 18; tn = 282;\n\n  sens = tp / (tp + fn);\n  spec = tn / (tn + fp);\n  lr_pos = sens / (1 - spec);          /* ~7.73 */\n  lr_neg = (1 - sens) / spec;          /* ~0.074 */\n\n  /* Apply LR+ to two pre-test probabilities via post-test odds = pre-test odds x LR. */\n  array pre[2] _temporary_ (0.20 0.02);\n  do j = 1 to 2;\n    pre_prob  = pre[j];\n    pre_odds  = pre_prob / (1 - pre_prob);\n    post_odds = pre_odds * lr_pos;\n    post_prob = post_odds / (1 + post_odds);   /* 0.659 enriched; 0.136 rare */\n    output;\n  end;\n  keep sens spec lr_pos lr_neg pre_prob post_prob;\nrun;\n\nproc print data=lr noobs; run;",
        "description": "Compute LR+ and LR- from a 2x2 validation table and update a pre-test probability in SAS. SAS has no\nbuilt-in PROC for likelihood ratios, so a DATA step derives sensitivity and specificity from the cell\ncounts, forms LR+ = Se/(1-Sp) and LR- = (1-Se)/Sp, and applies the odds form of Bayes' theorem to a\npre-test probability (Deeks & Altman 2004).",
        "dependencies": [],
        "source_citations": [
          "deeks-altman-2004"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Pre[Pre-test probability p] --> Odds[Pre-test odds = p / 1-p]\n  Odds --> Mult[x likelihood ratio]\n  LR[LR+ = Se / 1-Sp<br/>LR- = 1-Se / Sp] --> Mult\n  Mult --> Post[Post-test odds]\n  Post --> Prob[Post-test probability<br/>= odds / 1+odds]",
        "caption": "The odds form of Bayes' theorem that defines a likelihood ratio's use. Convert the pre-test probability to odds, multiply by the likelihood ratio for the observed result, then convert the post-test odds back to a probability. The same LR applies at any pre-test probability, which is why it is prevalence-independent.",
        "alt_text": "Flow from pre-test probability to pre-test odds, multiplied by a likelihood ratio computed from sensitivity and specificity, giving post-test odds and then the post-test probability.",
        "source_type": "illustrative",
        "source_citations": [
          "deeks-altman-2004"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Test{Test result type} -->|Binary| Bin[Single LR+ and LR-]\n  Test -->|Ordinal / continuous| Strata[Stratum-specific interval LRs<br/>density in cases / density in non-cases]\n  Bin --> Use[Update pre-test odds for + or - result]\n  Strata --> Use2[Update pre-test odds at the observed band]\n  Strata --> Roc[Interval LR = local slope of ROC curve]",
        "caption": "Choosing the right likelihood-ratio form. Binary tests yield a single LR+/LR- pair; ordinal or continuous tests use stratum-specific interval LRs that preserve graded information, and each interval LR equals the local slope of the ROC curve at that operating point.",
        "alt_text": "Decision tree splitting binary tests (single LR+/LR-) from ordinal or continuous tests (stratum-specific interval likelihood ratios), noting that an interval LR equals the local ROC-curve slope.",
        "source_type": "illustrative",
        "source_citations": [
          "deeks-altman-2004"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "sensitivity-specificity-rwe",
        "notes": "LR+ = sensitivity/(1 - specificity) and LR- = (1 - sensitivity)/specificity are computed directly from sensitivity and specificity; estimating LRs requires both, hence a validation that observes false negatives and false positives."
      },
      {
        "relation_type": "complements",
        "target_slug": "ppv-npv-rwe",
        "notes": "A likelihood ratio is the prevalence-independent evidentiary weight; PPV/NPV are the post-test probabilities that weight realized at a specific prevalence. The LR lets PPV/NPV be projected to any cohort."
      },
      {
        "relation_type": "see_also",
        "target_slug": "roc-auc-discrimination-rwe",
        "notes": "The likelihood ratio at a threshold equals the local slope of the ROC curve at that point; stratum-specific interval LRs trace the curve, while the AUC summarizes it globally."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnostic-accuracy",
        "notes": "Likelihood ratios are among the headline accuracy measures a diagnostic accuracy study reports, valued for their prevalence-independence and direct use in updating pre-test probability."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "Validating a claims/EHR algorithm against adjudicated truth yields the sensitivity and specificity from which LR+ and LR- are derived, giving a portable evidentiary weight for an algorithm flag."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prediction-model-validation-recalibration-rwe",
        "notes": "For multi-level prediction-model outputs, stratum-specific likelihood ratios translate a predicted-risk band into a post-test probability and connect discrimination to decision-relevant odds updating."
      }
    ],
    "aliases": [
      "likelihood ratio",
      "positive likelihood ratio",
      "negative likelihood ratio",
      "LR+",
      "LR-",
      "diagnostic likelihood ratio"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "linked-data",
    "name": "Linked Multi-Database Study (Record Linkage)",
    "short_definition": "A study design that joins records for the same individuals across two or more separately maintained data sources (e.g., claims, EHR, disease registry, vital records) via deterministic or probabilistic record linkage to assemble a single analysis cohort with broader coverage of exposures, covariates, and outcomes than any source alone.",
    "long_description": "A **linked multi-database study** builds its analytic cohort by matching records for the *same person* across two or\nmore independently collected data sources and then analyzing the combined longitudinal record. Linkage is the data-engineering\nstep that precedes any design or estimation choice: it does not by itself confer causal validity, but it changes which\nvariables are observable (severity from EHR, dispensings from claims, death from vital records, stage/grade from a registry)\nand therefore which questions are answerable and which biases can be addressed. The classic exemplars are SEER-Medicare\n(cancer registry linked to Medicare claims) and the population-wide linkage systems of Western Australia, Scotland, and the\nNordic countries, where a stable personal identifier joins essentially the whole population across health and administrative\nregisters.\n\n**Core conceptual distinction.** Linkage methods sit on a spectrum from **deterministic** to **probabilistic**.\nDeterministic linkage joins records that agree exactly (or after standardization) on a set of identifiers — ideally a single\ntrusted unique ID (a national/beneficiary number), or a rule-based match on combinations of name, sex, date of birth, and\nZIP. It is transparent and reproducible but brittle: any error or change in an identifier (a misspelled surname, a transposed\nbirth date, a remarriage) drops a true pair (a *missed match*, false negative). **Probabilistic (Fellegi–Sunter) linkage**\nscores each candidate record pair on partial agreement across multiple fields, weighting each field by how discriminating it\nis (m-probability: agreement among true matches; u-probability: chance agreement among non-matches) and accepting pairs above\na threshold. It recovers true pairs that deterministic rules miss but admits *false matches* (linking two different people)\nand requires threshold calibration and clerical review of the gray zone. The estimand-relevant point is subtle but\ndecisive: **linkage error is a measurement/selection problem, not random noise.** Missed and false matches that depend on\nexposure, outcome, age, or data quality bias effect estimates in a direction that is hard to sign without a clerical-review\ngold standard or a linked/unlinked sensitivity analysis. \"Linked data\" is therefore not a single method but a substrate whose\nquality (match rate, precision, and *differential* error) must be reported and probed exactly like any other source of\nmisclassification.\n\n**Pros, cons, and trade-offs.**\n- **vs a single claims database:** Linking claims to EHR adds clinical depth (labs, vitals, smoking, severity, free-text\n  diagnoses) that claims lack, sharpening confounder control and outcome validation; linking to vital records firms up the\n  most consequential censoring event (death). Cost: you analyze only the *linkable subset*, which is rarely a random sample\n  of either source (linkage selection bias), and you inherit two coding systems, two date conventions, and two sets of\n  missingness. **Prefer linkage** when an unmeasured confounder or an unobservable outcome lives in the other source and the\n  linkable subset is large and demonstrably representative.\n- **vs a single EHR system:** Linking EHR to claims captures care delivered *outside* the health system (out-of-network\n  visits, pharmacy fills, hospitalizations elsewhere), curing EHR's signature defect — incomplete capture when patients seek\n  care across systems. Cost: claims add billing-driven coding noise and lag. **Prefer linkage** whenever leakage out of the\n  EHR is plausible (most real populations).\n- **vs deterministic-only linkage:** Probabilistic linkage raises sensitivity (fewer missed matches) at some cost to\n  precision and reproducibility, and it surfaces an explicit, auditable uncertainty about each pair. **Prefer probabilistic**\n  when identifiers are imperfect or no single trusted ID exists; **prefer deterministic** when a high-quality unique ID is\n  available and false matches are the dominant concern (e.g., mortality outcomes).\n- **vs collecting de novo primary data:** Linkage is faster and cheaper at population scale and avoids recall bias. Cost: you\n  are constrained to variables that were already recorded for other purposes, and you cannot fix a missing identifier after\n  the fact. **Prefer linkage** for large comparative safety/effectiveness and HEOR questions; commission primary collection\n  only for variables no existing source holds.\n\n**When to use.** A key confounder, exposure, covariate, or outcome is reliably recorded in source B but absent or poorly\ncaptured in source A (registry stage + claims treatment + vital-records death is the archetype); validation of a claims- or\nEHR-based phenotype against a gold-standard source; capturing care that crosses system boundaries; assembling sufficient\nperson-time for rare exposures or outcomes by pooling. A defensible study reports the match rate, the linkage method and\nidentifiers, an estimate of linkage precision, and at minimum an analysis restricted to high-confidence links versus the full\nlinked set.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The linkable subset is selected on something related to exposure or outcome.** If linkage success depends on having a\n  valid SSN/insurance ID, on survival to a registry update, or on care concentration, the linked cohort is a biased sample\n  and effect estimates generalize to no real population. Linkage selection can masquerade as a treatment effect.\n- **Differential linkage error by exposure or outcome.** If sicker, older, or one-arm patients are systematically harder to\n  link (more name changes, more facility transfers, earlier death before an ID is captured), missed matches are\n  differential and bias the contrast. This is the linked-data analogue of differential misclassification and is *not* fixed\n  by a larger N.\n- **Treating a high match rate as proof of validity.** A 95% match rate says nothing about *precision* — a few percent of\n  false matches that attach the wrong outcome (e.g., another person's death) to an exposed patient can swamp a small true\n  effect. Match rate and precision are different quantities.\n- **No mechanism to estimate linkage error.** Without a clerical-review sample, a known-truth subset, or a sensitivity\n  analysis varying the match threshold, you cannot bound the bias and should not make causal claims from a borderline linkage.\n- **Re-identification / privacy infeasibility.** If governance forbids holding the identifiers needed for a defensible match,\n  forcing a weak link is worse than not linking.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** The strength is complete *paid* utilization across sites of care; the weakness is the absence of\n  clinical detail and the FFS/MA visibility gap. In Medicare, **Medicare Advantage enrollees generate little or no\n  fee-for-service claim history**, so claims-side variables and outcomes are effectively missing for MA person-time — a\n  linked SEER-Medicare analysis that ignores this confuses MA enrollment with absence of events. Restrict to continuous\n  Parts A/B (and D for drugs) FFS enrollment over the relevant window, or carry an explicit MA indicator and censor MA\n  person-time. Differential **competing risk of death** by exposure in elderly claims means a registry/vital-records death\n  link is essential to avoid counting a death-curtailed follow-up as event-free.\n- **EHR:** Adds labs, vitals, problem lists, and notes that validate phenotypes and capture severity, but visit-driven\n  capture means a patient who seeks care elsewhere looks event-free. Linking to claims restores out-of-system events; the\n  linkage itself must reconcile the EHR *encounter/order* date against the claims *service/fill* date before assigning index\n  or outcome dates.\n- **Registry:** Best-in-class for adjudicated, clinically rich outcomes (cancer stage, grade, histology) and incidence, but\n  typically lacks complete treatment and pharmacy exposure and updates on a lag. Linking to claims supplies the full\n  treatment trajectory and to a death index firms up survival; registry update cycles can induce immortal-time-like artifacts\n  if the linkage date, not the diagnosis date, is used as time zero.\n- **Linked claims–EHR–registry–vital-records:** The ideal substrate (severity + completeness + adjudicated outcomes +\n  reliable mortality) but it concentrates every failure mode: linkage selection (only the linkable subset), order/fill/service\n  date discrepancies that corrupt time-zero, two diagnosis coding systems to harmonize, and the need to reconcile conflicting\n  values (a death date in vital records that postdates a claim). Resolve identifiers and dates *before* applying any design\n  restriction, and run the analysis on both the high-confidence link set and the full set.\n\n**Worked claims example.** Question: 1-year all-cause mortality after first-line systemic therapy for stage IV colorectal\ncancer, comparing regimen A vs regimen B, using a cancer **registry** linked to **Medicare claims** and the **vital-records**\ndeath index. (1) Linkage: the registry carries SSN, sex, date of birth, and ZIP; deterministically link to the Medicare\nenrollment file on the beneficiary ID where the registry-supplied SSN matches, then run a probabilistic pass on the\nresidual unmatched registry records (partial agreement on DOB, sex, ZIP, last name) and clerically review pairs in the gray\nzone — record the per-source match rate and an estimated false-match rate from the review sample. (2) Cohort: keep patients\nwith a registry stage IV colorectal diagnosis and **continuous Medicare Parts A/B FFS enrollment** (no MA-only spans) for\n365 days before and through follow-up, so utilization is observable; index_date = date of first systemic-therapy claim\n(HCPCS J-codes) within 120 days of the registry diagnosis date — using the *diagnosis* date, not the registry-update date,\nprevents an immortal-time artifact. (3) Arm: assign regimen A vs B from the J-codes on the index claim. (4) Outcome: death\nwithin 365 days from the **vital-records** date (preferred over the claims-derived death indicator, which lags and misses\nout-of-hospital deaths); a few false matches here attach another person's death, so report the high-confidence-link result\nalongside the full-link result. (5) Censoring: disenrollment from FFS (including switch to MA), end of data, or 365 days,\nwhichever first. (6) Sensitivity: repeat restricting to high-confidence (SSN-exact) links only, vary the probabilistic\nthreshold, and compare the linked-only cohort's baseline characteristics to the full registry to probe linkage selection.",
    "primary_category": "Study_Design",
    "tags": [
      "data-linkage",
      "record-linkage",
      "probabilistic-linkage",
      "deterministic-linkage",
      "seer-medicare",
      "multi-database",
      "linkage-error",
      "study-design"
    ],
    "applies_to_study_types": [
      "linked_data"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/1472-6963-10-346",
        "url": "https://doi.org/10.1186/1472-6963-10-346",
        "citation_text": "Bohensky MA, Jolley D, Sundararajan V, et al. Data linkage: a powerful research tool with potential problems. BMC Health Services Research. 2010;10:346.",
        "year": 2010,
        "authors_short": "Bohensky et al.",
        "notes": "Accessible overview of deterministic vs probabilistic linkage, match-rate/precision trade-offs, and the bias introduced by linkage error and the selected linkable subset."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyv322",
        "url": "https://doi.org/10.1093/ije/dyv322",
        "citation_text": "Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. International Journal of Epidemiology. 2016;45(3):954-964.",
        "year": 2016,
        "authors_short": "Sayers et al.",
        "notes": "Practical tutorial on Fellegi-Sunter probabilistic linkage — m/u-probabilities, match-weight thresholds, and clerical review of the gray zone."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyz203",
        "url": "https://doi.org/10.1093/ije/dyz203",
        "citation_text": "Doidge JC, Harron KL. Reflections on modern methods: linkage error bias. International Journal of Epidemiology. 2019;48(6):2050-2060.",
        "year": 2019,
        "authors_short": "Doidge & Harron",
        "notes": "Formal treatment of how missed and false matches (especially differential by exposure/outcome) bias estimates, with strategies to quantify and mitigate linkage-error bias."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/00005650-200208001-00002",
        "url": "https://doi.org/10.1097/00005650-200208001-00002",
        "citation_text": "Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the SEER-Medicare data: content, research applications, and generalizability to the United States elderly population. Medical Care. 2002;40(8 Suppl):IV-3-18.",
        "year": 2002,
        "authors_short": "Warren et al.",
        "notes": "Canonical description of the SEER cancer registry linked to Medicare claims — the archetypal linked multi-database resource and its representativeness/coverage caveats."
      }
    ],
    "plain_language_summary": "A linked multi-database study joins records for the same person from two or more separate data sources — for example, a cancer registry plus Medicare claims plus a death registry — so researchers can answer questions that no single source could answer alone. Think of it as taking a jigsaw puzzle where one piece holds the diagnosis, another holds the prescriptions, and a third holds the date of death, and fitting them together into one complete picture for each patient. The catch is that not every patient can be matched across all sources, so the final study cohort is only the linkable subset — and if the patients who failed to match differ in important ways, results may not apply to everyone.",
    "key_terms": [
      {
        "term": "deterministic linkage",
        "definition": "Joining records by requiring an exact match on one or more identifiers (such as a Social Security number or a precise combination of date of birth, sex, and ZIP code) — fast and transparent, but any typo or changed identifier causes a true pair to be missed."
      },
      {
        "term": "probabilistic linkage",
        "definition": "Scoring each candidate record pair on how well multiple fields agree, weighting fields by how discriminating they are, and accepting pairs whose total score clears a threshold — recovers matches that exact rules miss but can also link two different people if scores are calibrated poorly."
      },
      {
        "term": "linkage-selection bias",
        "definition": "The distortion that occurs when patients who successfully link across sources differ systematically from patients who do not link — for example, younger or healthier patients may link more easily, making the linked cohort unrepresentative of the original population."
      },
      {
        "term": "match rate",
        "definition": "The fraction of records in one source that find a corresponding record in the other source — a high match rate means more patients are included, but it says nothing about whether the matched pairs are actually the same person."
      },
      {
        "term": "false match",
        "definition": "A linkage error in which two records from different people are incorrectly joined, potentially attaching the wrong outcome (such as another person's death date) to a study patient."
      }
    ],
    "worked_example": {
      "scenario": "Suppose you want to study whether Drug A or Drug B leads to more deaths in the year after a cancer diagnosis. You have three sources: (1) a cancer registry with diagnosis dates, (2) a claims database with drug dispensing records, and (3) a death registry with dates of death. No single source contains all three pieces of information, so you link them. The table below shows five patients and which sources they appear in before and after linkage.",
      "dataset": {
        "caption": "Five patients across three sources before linkage. An X marks presence in that source.",
        "columns": [
          "patient_id",
          "in_cancer_registry",
          "in_claims",
          "in_death_registry",
          "linked_successfully"
        ],
        "rows": [
          [
            "PT-001",
            "Yes",
            "Yes",
            "Yes",
            "Yes"
          ],
          [
            "PT-002",
            "Yes",
            "Yes",
            "No",
            "Yes"
          ],
          [
            "PT-003",
            "Yes",
            "No",
            "No",
            "No — missing from claims (uninsured or out-of-network)"
          ],
          [
            "PT-004",
            "Yes",
            "Yes",
            "Yes",
            "Yes"
          ],
          [
            "PT-005",
            "Yes",
            "Yes",
            "No",
            "Yes"
          ]
        ]
      },
      "steps": [
        "Step 1 — Deterministic pass: try to match each registry patient to a claims record using an exact Social Security number. PT-001, PT-002, PT-004, and PT-005 all have a clean SSN match; PT-003 has no claims record at all (perhaps uninsured) and cannot be linked.",
        "Step 2 — Probabilistic pass (if needed): for any registry patient whose SSN is missing or mistyped, score candidate claims records on partial agreement across date of birth, sex, and ZIP code. A high enough score earns acceptance; a borderline score goes to clerical review; a low score is rejected as a non-match.",
        "Step 3 — Attach the death registry: once registry-claims pairs are confirmed, join the death registry by SSN or name-DOB-sex combination. PT-001 and PT-004 have a death record; PT-002 and PT-005 do not (alive or died outside the registry's coverage).",
        "Step 4 — Recognize the linkable subset: only the four patients who appear in both the registry and claims form the analysis cohort. PT-003 is excluded. If uninsured patients (like PT-003) are more likely to receive Drug A or to die, excluding them biases the Drug A vs Drug B comparison — this is linkage-selection bias.",
        "Step 5 — Sensitivity check: compare the baseline characteristics (age, stage, comorbidities) of the four linked patients against the full five-patient registry. If PT-003 looks very different from the rest, generalizability is limited and the study report must say so."
      ],
      "result": "The linked cohort contains 4 of 5 registry patients (80% match rate). Of these 4, 2 have a confirmed death within 1 year (PT-001 and PT-004). The match rate looks acceptable, but PT-003's exclusion is a structural caveat: if unlinked patients share characteristics with one treatment arm, the effect estimate is biased in a direction that a larger sample cannot fix."
    },
    "prerequisites": [
      "claims-analysis",
      "ehr-study",
      "disease-registry"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Deterministic linkage on a trusted unique identifier",
        "description": "Join records on a single high-quality person identifier (national ID, Medicare beneficiary number, encrypted patient token) or an exact match on a standardized combination of name, sex, and date of birth.",
        "edge_cases": [
          "Any identifier error or change (typo, transposed DOB digits, surname change at marriage) drops a true pair as a missed match (false negative).",
          "Recycled or duplicate IDs in one source create false matches; pre-deduplicate each source first."
        ],
        "data_source_notes": "claims: link the registry/EHR to the enrollment/beneficiary file on the plan ID; standardize name casing, strip suffixes, and zero-pad ZIP before any exact rule."
      },
      {
        "name": "Probabilistic (Fellegi-Sunter) linkage with clerical review",
        "description": "Score candidate pairs on partial agreement across multiple fields using estimated m- and u-probabilities, accept pairs above an upper threshold, reject below a lower threshold, and clerically review the gray zone in between.",
        "edge_cases": [
          "Thresholds set too low admit false matches that attach the wrong outcome (e.g., another person's death); too high revert to missed matches.",
          "m/u-probabilities estimated on a non-representative training set mis-weight fields; recalibrate per data refresh."
        ],
        "data_source_notes": "Block on a coarse field (e.g., ZIP3 or birth year) to make pairwise comparison tractable; report the estimated false-match rate from the reviewed sample."
      },
      {
        "name": "Linked vs unlinked sensitivity analysis",
        "description": "Repeat the primary analysis on (a) the full linked set and (b) the high-confidence-link subset, and compare baseline characteristics of linked vs unlinked records to detect linkage selection and differential linkage error.",
        "edge_cases": [
          "Materially different estimates across link-confidence strata signal linkage-error bias that a single point estimate hides.",
          "When linkable subset differs systematically from the source, results may not generalize to the target population."
        ],
        "data_source_notes": "Carry a per-record link-quality score (match weight or tier) downstream so every analysis can be re-run by confidence stratum."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single claims database",
        "pros_of_this": "Adds clinical depth (labs, severity, adjudicated outcomes) and reliable death capture from the linked source, sharpening confounder control and outcome validation.",
        "cons_of_this": "Restricts analysis to the linkable subset (linkage selection bias) and inherits two coding systems, two date conventions, and linkage error.",
        "when_to_prefer": "When an unmeasured confounder or an unobservable outcome lives in the other source and the linkable subset is large and demonstrably representative."
      },
      {
        "compared_to": "Single EHR system",
        "pros_of_this": "Captures care delivered outside the health system (out-of-network visits, fills, hospitalizations elsewhere), curing EHR's incomplete-capture defect.",
        "cons_of_this": "Claims add billing-driven coding noise and lag, and date reconciliation between encounter/order and service/fill is required.",
        "when_to_prefer": "Whenever leakage of care out of the EHR is plausible, which is true of most real populations."
      },
      {
        "compared_to": "Deterministic-only linkage",
        "pros_of_this": "Probabilistic linkage raises sensitivity (recovers true pairs deterministic rules miss) and produces an explicit, auditable per-pair uncertainty.",
        "cons_of_this": "Lower precision, more configuration and clerical review, and reduced reproducibility versus an exact-ID rule.",
        "when_to_prefer": "When identifiers are imperfect or no single trusted ID exists; use deterministic when a high-quality unique ID exists and false matches dominate the concern."
      },
      {
        "compared_to": "De novo primary data collection",
        "pros_of_this": "Faster and cheaper at population scale and free of recall bias.",
        "cons_of_this": "Confined to variables already recorded for other purposes; a missing identifier cannot be repaired retrospectively.",
        "when_to_prefer": "Large comparative safety/effectiveness and HEOR questions where existing sources hold the needed variables."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Complete paid utilization across sites of care but no clinical detail; restrict to continuous FFS enrollment (Parts A/B, plus D for drugs) and exclude or censor Medicare Advantage person-time where fee-for-service claims are unavailable. Use a linked vital-records death index rather than the lagging claims-derived death indicator.",
      "ehr": "Adds labs, vitals, notes, and severity for phenotype validation but only captures in-system care; link to claims to recover out-of-system events and reconcile encounter/order dates against claims service/fill dates before assigning index and outcome dates.",
      "registry": "Best for adjudicated, clinically rich outcomes (stage, grade, histology) but weak for full treatment/pharmacy exposure and updated on a lag; link to claims for the treatment trajectory and to a death index for survival. Use the diagnosis date, not the registry-update date, as time zero to avoid immortal-time artifacts.",
      "linked": "Ideal substrate (severity + completeness + adjudicated outcomes + mortality) but concentrates linkage selection, date discrepancies, dual coding systems, and value conflicts; resolve identifiers and dates before any design restriction and analyze both the high-confidence-link set and the full set."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\nACCEPT_HI = 8.0   # accept probabilistic pairs at/above this match weight (clerical-review threshold above this)\nACCEPT_LO = 3.0   # below this, reject; [LO, HI) is the gray zone flagged for clerical review\n\ndef link_cohort(reg: pd.DataFrame, enr: pd.DataFrame) -> pd.DataFrame:\n    # ---- Pass 1: deterministic on trusted unique ID (SSN). Highest precision. ----\n    exact = (reg.dropna(subset=[\"ssn\"])\n                .merge(enr.dropna(subset=[\"ssn\"])[[\"bene_id\", \"ssn\"]], on=\"ssn\", how=\"inner\"))\n    exact = exact[[\"reg_id\", \"bene_id\"]].assign(tier=\"exact_id\", match_weight=np.inf)\n\n    # ---- Pass 2: probabilistic (Fellegi-Sunter) on registry records not matched in pass 1. ----\n    # Field-level agreement weights = log2(m/u); m = P(agree | true match), u = P(agree | non-match by chance).\n    weights = {  # illustrative, agreement-only weights; estimate m/u from a reviewed training sample per refresh\n        \"last_name\": np.log2(0.90 / 0.005),\n        \"dob\":       np.log2(0.95 / 0.0003),\n        \"sex\":       np.log2(0.99 / 0.50),\n        \"zip5\":      np.log2(0.80 / 0.001),\n    }\n    resid = reg[~reg[\"reg_id\"].isin(exact[\"reg_id\"])].copy()\n    # Block on birth-year to keep pairwise comparison tractable on large files.\n    resid[\"byear\"] = resid[\"dob\"].dt.year\n    enr = enr.copy(); enr[\"byear\"] = enr[\"dob\"].dt.year\n    pairs = resid.merge(enr, on=\"byear\", suffixes=(\"_r\", \"_e\"))\n\n    score = np.zeros(len(pairs))\n    score += np.where(pairs[\"last_name_r\"] == pairs[\"last_name_e\"], weights[\"last_name\"], 0.0)\n    score += np.where(pairs[\"dob_r\"]       == pairs[\"dob_e\"],       weights[\"dob\"],       0.0)\n    score += np.where(pairs[\"sex_r\"]       == pairs[\"sex_e\"],       weights[\"sex\"],       0.0)\n    score += np.where(pairs[\"zip5_r\"]      == pairs[\"zip5_e\"],      weights[\"zip5\"],      0.0)\n    pairs[\"match_weight\"] = score\n\n    # Keep the single best candidate per registry record, then apply thresholds (1:1 linkage).\n    best = pairs.sort_values(\"match_weight\", ascending=False).groupby(\"reg_id\", as_index=False).first()\n    best[\"review\"] = (best[\"match_weight\"] >= ACCEPT_LO) & (best[\"match_weight\"] < ACCEPT_HI)\n    prob = (best[best[\"match_weight\"] >= ACCEPT_HI][[\"reg_id\", \"bene_id\"]]\n                .assign(tier=\"probabilistic\"))\n    prob = prob.merge(best[[\"reg_id\", \"match_weight\"]], on=\"reg_id\")\n\n    links = pd.concat([exact, prob], ignore_index=True)\n    return links[[\"reg_id\", \"bene_id\", \"tier\", \"match_weight\"]]",
        "description": "Deterministic + probabilistic record linkage to build a linked study cohort. Required inputs (already cleaned,\nde-duplicated within source, and identifier fields standardized: upper-cased name, stripped suffixes, zero-padded ZIP):\n  reg : registry records  -> reg_id, ssn (nullable), last_name, dob (datetime), sex, zip5\n  enr : claims enrollment  -> bene_id, ssn (nullable), last_name, dob (datetime), sex, zip5\nReturns one row per accepted link with a match tier ('exact_id' | 'probabilistic') and a match weight, so every\ndownstream analysis can be re-run restricted to high-confidence links. Death/outcome dates are joined AFTER this step.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "sayers-2016",
          "doidge-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nACCEPT_HI <- 8.0   # accept probabilistic pairs at/above this match weight\nACCEPT_LO <- 3.0   # [ACCEPT_LO, ACCEPT_HI) = gray zone for clerical review\n\nlink_cohort <- function(reg, enr) {\n  setDT(reg); setDT(enr)\n\n  # Pass 1: deterministic exact match on the trusted unique ID (SSN).\n  exact <- merge(reg[!is.na(ssn), .(reg_id, ssn)],\n                 enr[!is.na(ssn), .(bene_id, ssn)], by = \"ssn\")[\n                 , .(reg_id, bene_id, tier = \"exact_id\", match_weight = Inf)]\n\n  # Pass 2: probabilistic (Fellegi-Sunter) on registry records not matched above.\n  w <- c(last_name = log2(0.90 / 0.005), dob = log2(0.95 / 0.0003),\n         sex = log2(0.99 / 0.50),        zip5 = log2(0.80 / 0.001))\n  resid <- reg[!reg_id %in% exact$reg_id]\n  resid[, byear := year(dob)]; enr[, byear := year(dob)]\n\n  # Block on birth-year to bound the number of candidate pairs.\n  pairs <- merge(resid, enr, by = \"byear\", suffixes = c(\"_r\", \"_e\"), allow.cartesian = TRUE)\n  pairs[, match_weight :=\n          (last_name_r == last_name_e) * w[\"last_name\"] +\n          (dob_r       == dob_e)       * w[\"dob\"] +\n          (sex_r       == sex_e)       * w[\"sex\"] +\n          (zip5_r      == zip5_e)      * w[\"zip5\"]]\n\n  # Best candidate per registry record, then threshold (1:1).\n  setorder(pairs, -match_weight)\n  best <- pairs[, .SD[1L], by = reg_id]\n  best[, review := match_weight >= ACCEPT_LO & match_weight < ACCEPT_HI]\n  prob <- best[match_weight >= ACCEPT_HI, .(reg_id, bene_id, tier = \"probabilistic\", match_weight)]\n\n  rbindlist(list(exact, prob), use.names = TRUE)[, .(reg_id, bene_id, tier, match_weight)]\n}",
        "description": "Deterministic + probabilistic record linkage with data.table. Inputs mirror the Python version (standardized\nidentifiers, de-duplicated within source):\n  reg : reg_id, ssn (NA-able), last_name, dob (Date), sex, zip5\n  enr : bene_id, ssn (NA-able), last_name, dob (Date), sex, zip5\nReturns accepted links with a match tier and match weight for confidence-stratified sensitivity analyses.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "sayers-2016",
          "doidge-2019"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Pass 1: deterministic exact match on the trusted unique ID (SSN). */\nproc sql;\n  create table exact as\n  select r.reg_id, e.bene_id, 'exact_id' as tier length=13, .I as match_weight\n  from work.reg r\n  inner join work.enr e\n    on r.ssn = e.ssn\n  where r.ssn is not null and e.ssn is not null;\nquit;\n\n/* Pass 2: probabilistic (Fellegi-Sunter). Block on birth-year to bound candidate pairs;\n   score partial agreement with log2(m/u) field weights. */\nproc sql;\n  create table pairs as\n  select r.reg_id, e.bene_id,\n           (case when r.last_name = e.last_name then log2(0.90/0.005)  else 0 end)\n         + (case when r.dob       = e.dob       then log2(0.95/0.0003) else 0 end)\n         + (case when r.sex       = e.sex       then log2(0.99/0.50)   else 0 end)\n         + (case when r.zip5      = e.zip5      then log2(0.80/0.001)  else 0 end) as match_weight\n  from work.reg r\n  inner join work.enr e\n    on year(r.dob) = year(e.dob)\n  where r.reg_id not in (select reg_id from exact);\nquit;\n\n/* Keep the best candidate per registry record (1:1 linkage). */\nproc sort data=pairs; by reg_id descending match_weight; run;\ndata best;\n  set pairs; by reg_id;\n  if first.reg_id;                 /* highest-weight candidate per registry record */\n  review = (match_weight >= 3.0 and match_weight < 8.0);  /* gray zone -> clerical review */\nrun;\n\n/* Accept probabilistic pairs at/above the upper threshold; stack with exact-ID links. */\ndata prob;\n  set best;\n  where match_weight >= 8.0;\n  length tier $13;  tier = 'probabilistic';\n  keep reg_id bene_id tier match_weight;\nrun;\n\ndata links;\n  set exact prob;                  /* tier='exact_id' -> high-confidence sensitivity subset */\nrun;",
        "description": "Deterministic + probabilistic record linkage and a linked vs high-confidence sensitivity flag in SAS (PROC SQL +\ndata step). Required input datasets (standardized identifiers, de-duplicated within source):\n  work.reg : reg_id, ssn, last_name, dob, sex, zip5\n  work.enr : bene_id, ssn, last_name, dob, sex, zip5\nProduces work.links with tier ('exact_id'/'probabilistic') and match_weight; restrict to tier='exact_id' for the\nhigh-confidence sensitivity analysis. Agreement weights are illustrative — estimate m/u from a clerically reviewed sample.",
        "dependencies": [],
        "source_citations": [
          "sayers-2016",
          "doidge-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Reg[Registry records<br/>reg_id, SSN, name, DOB, sex, ZIP] --> Std[Standardize + de-duplicate<br/>each source independently]\n  Enr[Claims enrollment<br/>bene_id, SSN, name, DOB, sex, ZIP] --> Std\n  Std --> Det{Trusted unique ID<br/>matches exactly?}\n  Det -->|Yes| Hi[Exact-ID link<br/>tier = high confidence]\n  Det -->|No| Prob[Probabilistic Fellegi-Sunter<br/>score partial agreement, weight by m/u]\n  Prob --> Thr{Match weight vs thresholds}\n  Thr -->|>= upper| Acc[Accept probabilistic link]\n  Thr -->|gray zone| Rev[Clerical review]\n  Thr -->|< lower| Rej[Reject - missed match risk]\n  Hi --> Cohort[Linked cohort<br/>+ join outcomes / death index]\n  Acc --> Cohort\n  Rev --> Cohort\n  Cohort --> Sens[Sensitivity: high-confidence vs full link set;<br/>linked vs unlinked baseline comparison]",
        "caption": "Linkage pipeline. Deterministic matching on a trusted ID handles high-confidence pairs; probabilistic scoring recovers the rest with explicit thresholds and clerical review; sensitivity analyses probe linkage error and selection.",
        "alt_text": "Flowchart from standardized registry and claims identifiers through deterministic exact-ID matching and probabilistic Fellegi-Sunter scoring with accept/review/reject thresholds into a linked cohort and linkage sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "sayers-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Truth[Ground truth: same person across sources]\n    T1[True matched pair]\n    T2[True non-pair]\n  end\n  T1 -->|identifier error / change| Miss[Missed match<br/>false negative]\n  T1 -->|correctly linked| Good[Correct link]\n  T2 -->|chance agreement| False[False match<br/>links two people]\n  Miss --> Bias1[Selection: linkable subset<br/>differs from source]\n  False --> Bias2[Misclassification: wrong<br/>outcome/exposure attached]\n  Bias1 --> Diff{Error related to<br/>exposure or outcome?}\n  False --> Diff\n  Diff -->|Yes -> differential| Danger[Biased contrast<br/>not fixed by larger N]\n  Diff -->|No -> non-differential| Atten[Bias toward null / variance]",
        "caption": "How linkage error becomes bias. Missed matches drive linkage-selection bias; false matches drive outcome/exposure misclassification. When either depends on exposure or outcome, the error is differential and biases the effect estimate in a direction larger N cannot cure.",
        "alt_text": "Diagram showing true matched pairs becoming missed matches under identifier error and true non-pairs becoming false matches under chance agreement, then mapping to selection and misclassification bias and to differential versus non-differential consequences.",
        "source_type": "illustrative",
        "source_citations": [
          "doidge-2019"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "Linkage supplies the clinical severity, complete utilization, and reliable death capture that strengthen confounder control and censoring in an active-comparator new-user cohort."
      },
      {
        "relation_type": "affects",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Using a registry-update or linkage date rather than the true diagnosis/initiation date as time zero can manufacture immortal time; reconcile order/fill/service/diagnosis dates before assigning time zero."
      },
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "Linked multi-database substrates are commonly assembled to make trial-eligibility criteria, exposures, and outcomes observable when no single source captures all of them."
      }
    ],
    "aliases": [
      "linked data",
      "record linkage",
      "data linkage",
      "probabilistic record linkage",
      "deterministic linkage",
      "multi-database linkage",
      "linked multi-database study"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "llm-assisted-abstraction-rwe",
    "name": "LLM-Assisted Data Abstraction and Evidence Work in RWE",
    "short_definition": "The use of large language models (LLMs) to accelerate or automate evidence-generation tasks in RWE — including chart abstraction, systematic literature review screening and data extraction, protocol and SAP drafting, code generation, and terminology mapping — subject to the governing constraint that LLM outputs are never load-bearing in a study without a deterministic quality check or a human review gate, and that any extraction pipeline must be validated against a chart-review gold standard using the same PPV/sensitivity/kappa metrics applied to any other algorithm.",
    "long_description": "**Where LLMs are entering RWE practice**\n\nLarge language models are being deployed across several distinct evidence-generation tasks.\n*Chart abstraction at scale*: extracting structured fields — EGFR mutation status,\ncancer staging, biomarker values, disease progression dates — from unstructured clinical\nnotes, discharge summaries, and pathology reports, using structured-output prompting that\nforces the model to return a JSON schema rather than free prose. *Systematic literature\nreview (SLR) acceleration*: first-pass title/abstract screening against inclusion and\nexclusion criteria, and structured data extraction from full-text articles (effect sizes,\nsample sizes, covariate lists). *Protocol and SAP drafting*: generating first-draft\nsections from a study concept or a prior RWE protocol template, which a human reviewer then\nedits and approves. *Analysis code generation*: scaffolding SAS, R, or Python code for\nstandard RWE steps (cohort construction, PS estimation, outcome ascertainment) from a\nstructured SAP description. *Terminology mapping*: suggesting ICD-10, NDC, or LOINC codes\nfor a free-text concept, which a clinical informaticist then accepts or overrides. In every\none of these roles, the LLM functions as a powerful draft generator — not as a final\ndecision-maker.\n\n**The governing principle: no load-bearing LLM output**\n\nThe foundational rule in any LLM-augmented RWE workflow is that an LLM's output is never\nthe end of the pipeline. Every structured field extracted from a note, every inclusion\ndecision in an SLR screen, and every code suggestion must pass through at least one\ndeterministic check or human review gate before it enters an analysis. This is not a\ncounsel of excessive caution: it reflects the measurement-error framing that RWE already\napplies to all algorithm-defined variables. An LLM extractor is a phenotyping algorithm.\nIt has operating characteristics — sensitivity, specificity, PPV — that must be measured\nagainst a chart-review gold standard, reported, and used to assess or correct for\nmisclassification, exactly as described in the algorithm-validation and\nendpoint-adjudication-chart-review literatures. A model's stated accuracy on a benchmark\nis not its operating characteristic on your notes, your note templates, or your study\npopulation.\n\n**Hallucination modes specific to clinical abstraction**\n\nLLMs fail in characteristic ways that differ from those of rule-based algorithms. The\nmost consequential failure modes in RWE abstraction are: (1) *Plausible-but-absent values*\n— the model returns a biologically plausible EGFR status or a credible-looking lab value\nthat does not appear anywhere in the source document; the value is coherent enough to\nevade casual review. (2) *Unit errors* — a creatinine value is returned in mg/dL when the\nnote reports mmol/L, or a dose is extracted in mg when the chart uses mcg; these are\nparticularly dangerous when downstream code performs threshold comparisons. (3) *Date\nconfabulation* — progression dates, biopsy dates, or treatment-start dates that are\nshifted by weeks or months relative to the documented event; the model anchors to the most\nrecent date it encountered in context rather than the semantically correct one. (4)\n*Negation misses* — returning a value as present when the note states \"EGFR mutation\nnegative\" or \"no evidence of progression\"; negation handling remains a systematic\nweakness even in state-of-the-art models when the negation is embedded in a complex\nclinical sentence. These failure modes persist even in strong models, are partially\nmitigated by chain-of-thought prompting or span citation requirements, and are best\ncontrolled by systematic validation rather than model selection alone.\n\n**Design patterns that work**\n\nSeveral engineering patterns reduce LLM error in RWE abstraction. *Structured output\nschemas*: constrain the model to return only the fields defined in a JSON schema with\nallowed values (e.g., EGFR_status: \"positive\" | \"negative\" | \"indeterminate\" | \"not_\nreported\"); open-ended text generation is the highest-error configuration. *Span-citation\nrequirements*: require the model to quote the exact source text that supports each\nextracted field (a \"citation span\"); a field with no supporting quote is flagged for human\nreview rather than accepted. *Dual-model or model-versus-rule adjudication*: run two\nindependent LLM passes (or an LLM pass plus a rule-based extractor) and send disagreements\nto a human adjudicator; this mirrors the two-reviewer-plus-tiebreaker structure in chart\nadjudication. *Confidence-thresholded human review queues*: use the model's log-probability\nor a downstream classifier's confidence score to route low-confidence extractions to a\nhuman reviewer and high-confidence extractions to automated acceptance, then validate\nthat the automated tier meets a pre-specified PPV floor. *Batch-versus-interactive cost\ndesign*: processing 10,000 notes at batch API rates is orders of magnitude cheaper than\ninteractive queries; cost modeling should be done before committing to a dataset scope.\n\n**Evaluation discipline: the same rigor as any other algorithm**\n\nThe evaluation framework for an LLM abstraction pipeline mirrors the algorithm-validation\nframework exactly. A held-out gold standard is required: a randomly sampled set of source\ndocuments that have been independently annotated by a human expert (clinician or trained\nabstractor) and set aside before any prompting. Against this gold standard, compute\nPPV (what fraction of the model's positive calls are correct), sensitivity (what fraction\nof true positives the model found), and Cohen's kappa (inter-rater agreement between model\nand gold annotator). Stratify performance by note type (discharge summary vs. progress note\nvs. pathology report), by note length, and by writing style — models degrade systematically\non non-standard templates or high-complexity notes. *Drift monitoring is mandatory*: a\nmodel update, an API version change, or a shift in the clinical note templates used by the\nstudy sites is a data-pipeline change that requires revalidation on the existing gold\nstandard. Version-pin the model at the time of study conduct and document the exact model\nidentifier in the SAP; failing to pin means the extraction behavior is unknowable at\nthe time of regulatory review.\n\n**Regulatory posture**\n\nFDA's evolving guidance on AI and machine learning in drug development (the Action Plan\nfor AI/ML-based software as a medical device, the 2023 discussion paper on AI in drug\ndevelopment) consistently emphasizes three requirements that apply directly to LLM use\nin RWE: (1) transparency — the model, version, prompt template, and extraction schema\nmust be documented in the SAP appendix; (2) audit trails — every LLM decision must be\nlogged with its inputs, outputs, and any human review outcome so the analysis can be\nreproduced or audited; (3) human-in-the-loop documentation — the protocol must specify\nwhich decisions are automated and which require human review, what the human review\ncriteria are, and who conducted the review. For HTA submissions, the same standard applies\nunder NICE's real-world evidence framework and EMA's guidance on complex clinical trials\ndata. The practical implication: an LLM used only for draft generation with full human\nreview of every output sits at the lowest regulatory risk level; an LLM used for\nautomated data abstraction that feeds directly into an analysis sits at the highest and\nrequires a full validation study.\n\n**Privacy and PHI handling**\n\nClinical notes contain protected health information (PHI) under HIPAA and equivalent\nregulations. Before sending notes to any cloud API (OpenAI, Anthropic, Google, etc.), a\nBusiness Associate Agreement (BAA) must be in place with the vendor, and the specific API\nproduct must be BAA-eligible (consumer-tier products are typically not). For studies where\na BAA cannot be obtained — or where the risk profile or data governance policy prohibits\nany external transmission — locally deployed open-weight models (Llama 3, Mistral, clinical\nfine-tunes such as BioMistral) offer an alternative that keeps PHI on-premises, at the\ncost of lower out-of-the-box performance and the operational overhead of model hosting\nand version management. De-identification of notes before external API use is a\nmitigation option but introduces its own error rate and requires validation.\n\n**Pros, cons, and trade-offs**\n\n*Pros*: LLMs can scale structured data extraction from unstructured notes to cohorts of\nthousands or tens of thousands of patients — a task that is prohibitively expensive using\nonly human abstractors. For SLR screening, LLMs can reduce the human screening burden by\n60-80% on first-pass abstract review while maintaining recall close to 100% when a\nconservative confidence threshold is applied. Draft generation for protocols, SAPs, and\nreport sections accelerates cycle times and reduces blank-page friction for study teams.\nCode generation for standard RWE pipelines reduces implementation error on boilerplate\ntasks and is immediately testable against expected outputs.\n\n*Cons*: LLM performance on a general benchmark does not predict performance on a specific\nstudy's notes; validation against a study-specific gold standard is required and has real\ncost. Hallucination modes are non-random — they are systematic and correlated with note\ncharacteristics — so a model that looks accurate on aggregate metrics can fail badly on\na specific note subtype. Regulatory audit trails add infrastructure overhead. PHI\ntransmission to cloud APIs creates compliance risk and requires legal review. Model\nupdates break reproducibility unless the model is pinned and the abstraction is\nre-run on updates.\n\n*Trade-offs*: The core trade-off is scale vs. certainty. Human-only abstraction is\naccurate but limits cohort size; LLM-assisted abstraction scales but introduces\nquantifiable and manageable error. The correct analytic response to that error is not\nto hide it but to measure it (PPV/sensitivity validation), propagate it (quantitative\nbias analysis), and mitigate it (human review queues for low-confidence extractions).\nThe concepts of endpoint adjudication and algorithm validation — not LLM engineering —\nare the methodological home for managing LLM abstraction quality in RWE.\n\n**When to use**\n\nUse LLM-assisted abstraction when the target structured fields are clearly defined and\ncan be expressed as a constrained output schema; when the study cohort is too large for\nfull manual abstraction; when a validation substudy against a chart-review gold standard\nis feasible and its cost is budgeted; when the note corpus is homogeneous enough that\na single prompt template covers the majority of cases; and when a human review queue\nfor low-confidence or high-stakes extractions is operationally possible. For SLR\nscreening, use LLMs for first-pass abstract triage when the recall requirement is not\nabsolute (or when a dual-reviewer structure provides the safety net). For code generation,\nuse LLMs for standard pipeline scaffolding where the output will be reviewed and tested\nagainst expected results.\n\n**When NOT to use — and when LLM use is actively misleading**\n\nDo not treat LLM-extracted variables as validated without a gold-standard comparison\nstudy: a model that achieves 90% accuracy on a general benchmark can have a PPV of 0.60\non a specific note type in a specific study population, introducing outcome misclassification\nthat biases effect estimates in an unknown direction. Do not use an LLM to extract the\nprimary outcome of a regulatory study without a pre-specified validation plan, human review\ngate for every extracted value, and SAP-documented audit trail: the regulatory position\non automated AI pipelines for primary efficacy/safety endpoints requires human-in-the-loop\nevidence. Do not use cloud APIs for PHI-containing notes without a valid BAA: even if the\nrisk of a data breach is low, the compliance failure is not conditional on harm. Do not\nre-use a model prompt or version across time without checking whether the model or note\ntemplates have changed: a model update is a covariate shift that can silently change\nextraction behavior. Do not apply LLM code-generation output without testing it against\nknown inputs and outputs: generated code commonly contains correct-looking but wrong\nlogic for edge cases (date arithmetic, missing value handling, left-join vs inner-join\nsemantics) that will not be caught by syntax checks alone.\n\n**Interpreting the output**\n\nIn the worked example below, an LLM pipeline abstracts EGFR mutation status from 200\nnon-small cell lung cancer patient notes and is evaluated against a human-annotated gold\nstandard. The 2x2 table yields: TP = 34, FP = 10, FN = 6, TN = 150. Accuracy = (34 +\n150) / 200 = 184 / 200 = 0.92. Sensitivity = 34 / 40 = 0.85. PPV = 34 / 44 = 17 / 22\n(approximately 0.773).\n\n*(1) Formal interpretation.* The LLM extractor is an algorithm defined by its prompt\ntemplate and model version. Its operating characteristics against the adjudicated gold\nstandard are: sensitivity 0.85 (the model identified 34 of 40 true EGFR-positive patients;\n6 true positives were missed, typically patients whose mutation was mentioned only in a\npathology appendix or negated in complex syntax); PPV 34/44 = 17/22 (approximately 0.773\nof the 44 patients the model called EGFR-positive are truly positive; 10 of 44 were false\npositives driven primarily by negation misses and plausible-but-absent values). These\nare the algorithm's operating characteristics in this note corpus at this model version;\nthey are not portable to a different model version, a different institution's note\ntemplates, or a different primary cancer indication. Nondifferential misclassification\nwith PPV below 1.0 will attenuate a risk ratio estimate toward the null; if the notes\nof patients on one treatment arm are systematically longer or more structured, the LLM\nmay perform better in that arm, creating differential misclassification with unpredictable\nbias direction.\n\n*(2) Practical interpretation.* For a study using EGFR status as a subgroup-defining\ncovariate, a PPV of 0.773 means roughly 1 in 4 patients labeled EGFR-positive by the LLM\npipeline is actually EGFR-negative — a level of subgroup contamination that could\nmeaningfully attenuate the subgroup treatment effect estimate. Before proceeding to the\ncomparative analysis, apply a sensitivity analysis that re-runs the primary model\nrestricting to the high-confidence extraction tier (citations present, model confidence\nabove threshold), and run a quantitative bias analysis propagating the validation PPV and\nsensitivity into the estimand. Document the validation design in the SAP, report the 2x2\nperformance table and its binomial confidence intervals, and specify the model identifier\nand prompt version in the methods section.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "llm",
      "large-language-models",
      "chart-abstraction",
      "nlp",
      "structured-output",
      "hallucination",
      "validation",
      "audit-trail",
      "phi-handling",
      "systematic-literature-review",
      "code-generation",
      "human-in-the-loop"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "ehr_study",
      "claims_analysis",
      "registry_study",
      "comparative_effectiveness",
      "systematic_literature_review"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.18653/v1/2022.emnlp-main.130",
        "url": "https://doi.org/10.18653/v1/2022.emnlp-main.130",
        "citation_text": "Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2022:1998-2022.",
        "year": 2022,
        "authors_short": "Agrawal et al.",
        "notes": "Foundational demonstration that GPT-3 family models can extract structured clinical variables from unstructured notes in a few-shot setting, achieving competitive performance against supervised NLP baselines on benchmark tasks including medication and lab extraction — the enabling result that opened LLM chart abstraction in RWE."
      },
      {
        "role": "explain",
        "doi": "10.1038/s41586-023-06291-2",
        "url": "https://doi.org/10.1038/s41586-023-06291-2",
        "citation_text": "Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180.",
        "year": 2023,
        "authors_short": "Singhal et al.",
        "notes": "Introduces Med-PaLM and the MultiMedQA benchmark; demonstrates that LLMs trained on general text can encode substantial clinical knowledge, but also identifies the failure modes (factual inaccuracies, incomplete reasoning) that motivate the human-in-the-loop governance principle for clinical LLM deployment in RWE."
      },
      {
        "role": "demonstrate",
        "doi": "10.1038/s41591-024-02855-5",
        "url": "https://doi.org/10.1038/s41591-024-02855-5",
        "citation_text": "Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine. 2024;30(4):1134-1142.",
        "year": 2024,
        "authors_short": "Van Veen et al.",
        "notes": "Shows that fine-tuned LLMs can match or exceed physician performance on clinical text summarization tasks across five note types, while also documenting the systematic failure modes (hallucination, omission) that persist even in high-performing models — directly motivating the validation discipline and span-citation requirement described in this entry."
      }
    ],
    "plain_language_summary": "Large language models (LLMs) like GPT-4 can read clinical notes and pull out structured information — for example, whether a patient's tumor has a certain mutation — far faster than a human abstractor can do the same work for thousands of patients. But LLMs make mistakes in characteristic ways: they sometimes invent plausible-sounding values that do not appear in the note, miss negations (\"EGFR negative\" read as positive), or confuse dates. Because of these errors, every LLM extraction pipeline must be checked against a sample of manually reviewed records to measure how often it is right — the same positive predictive value and sensitivity calculation used to validate any other claims or EHR algorithm — and any fields the LLM is uncertain about must go to a human reviewer before they enter an analysis.",
    "key_terms": [
      {
        "term": "structured-output prompting",
        "definition": "A technique that forces an LLM to return its answer in a fixed format (such as a JSON object with allowed values) rather than free text, which reduces the range of errors the model can make."
      },
      {
        "term": "hallucination",
        "definition": "When an LLM confidently states a fact — such as a lab value or a date — that does not appear in the source document and was not provided to the model; the most dangerous failure mode in clinical abstraction."
      },
      {
        "term": "span citation",
        "definition": "A requirement that an LLM quote the exact phrase from the source document that supports each extracted field; an extracted value with no supporting quote is flagged for human review."
      },
      {
        "term": "version pinning",
        "definition": "Recording and locking the exact identifier of the LLM (model name, version, and API snapshot) used during a study so that the extraction can be audited or reproduced later and a model update does not silently change the results."
      },
      {
        "term": "Business Associate Agreement (BAA)",
        "definition": "A legal contract required under HIPAA before sending patient health information to a cloud service provider; without a BAA, transmitting clinical notes to an LLM API is a compliance violation regardless of whether a data breach occurs."
      },
      {
        "term": "confidence-thresholded review queue",
        "definition": "A workflow where an LLM's extraction is automatically accepted only if its confidence score is above a pre-set threshold, and all lower-confidence outputs are routed to a human reviewer before use in the analysis."
      }
    ],
    "worked_example": {
      "scenario": "A real-world evidence team is studying treatment outcomes in non-small cell lung cancer (NSCLC). They need EGFR mutation status — positive, negative, or indeterminate — for a cohort of 200 patients. Because pathology reports are embedded in free-text notes, a clinical abstractor would normally read each note manually. Instead, the team uses a structured-output LLM prompt to extract EGFR status from each note, then validates the extraction against a human-annotated gold standard for all 200 patients. They compute accuracy, sensitivity, and positive predictive value (PPV) to decide whether the pipeline is acceptable for the subgroup analysis.",
      "dataset": {
        "caption": "Validation results for 200 NSCLC patient notes. Gold standard = human expert annotation of EGFR status; LLM = structured-output extraction result. A sample of the 2x2 contingency structure is shown; the full table underlies the computed metrics.",
        "columns": [
          "patient_id",
          "gold_egfr_positive",
          "llm_egfr_positive",
          "correct"
        ],
        "rows": [
          [
            "P001",
            true,
            true,
            true
          ],
          [
            "P002",
            true,
            true,
            true
          ],
          [
            "P003",
            true,
            false,
            false
          ],
          [
            "P004",
            false,
            false,
            true
          ],
          [
            "P005",
            false,
            true,
            false
          ],
          [
            "P006",
            true,
            true,
            true
          ],
          [
            "P007",
            false,
            false,
            true
          ],
          [
            "P008",
            false,
            false,
            true
          ],
          [
            "P009",
            true,
            true,
            true
          ],
          [
            "P010",
            false,
            true,
            false
          ]
        ]
      },
      "steps": [
        "Compile the full 200-patient 2x2 table from the validation run. Gold standard: 40 patients are truly EGFR-positive (gold-positive), 160 are truly EGFR-negative (gold-negative).",
        "LLM classifications: the model called 44 patients EGFR-positive. Of those 44, 34 were confirmed correct by the gold standard (true positives, TP = 34) and 10 were wrong (false positives, FP = 10). Of the 40 truly positive patients, the LLM missed 6 (false negatives, FN = 6). True negatives TN = 160 - 10 = 150.",
        "Compute accuracy (fraction of all 200 patients classified correctly): accuracy = (34 + 150) / 200 = 184 / 200 = 0.92.",
        "Compute sensitivity (fraction of truly EGFR-positive patients the LLM found): sensitivity = 34 / 40 = 0.85.",
        "Compute PPV (fraction of the LLM's positive calls that are actually positive): PPV = 34 / 44 = 17 / 22 (approximately 0.773). The 10 false positives were almost all negation misses — notes stating \"EGFR mutation not detected\" where the model extracted \"EGFR-positive\" because it anchored to the word \"mutation\" before reading \"not detected.\"",
        "The 6 false negatives (sensitivity gap) came from patients whose mutation was documented only in a scanned pathology PDF referenced in the note but not transcribed as text — the LLM had no text to read, not a comprehension failure. This is an information-availability failure, not a model failure, and signals that the abstraction scope (text-only) needs to be documented clearly.",
        "Decide whether these operating characteristics are acceptable: sensitivity = 0.85 and PPV = 17/22 may be adequate for an exploratory subgroup analysis with a stated bias-analysis caveat, but would require a human review gate for the 44 LLM-positive calls before use in a regulatory or HTA submission."
      ],
      "result": "2x2 summary: TP = 34, FP = 10, FN = 6, TN = 150. Total correct = 34 + 150 = 184. accuracy = 184 / 200 = 0.92. sensitivity = 34 / 40 = 0.85. PPV = 34 / 44 = 17 / 22 (approximately 0.773). The LLM pipeline achieved high accuracy overall, but 1 in 4 of its EGFR-positive calls was a false positive — driven primarily by negation misses. A human review gate on all 44 LLM-positive calls, or a dual-model adjudication step, would recover a near-perfect PPV at modest additional cost before this variable enters a comparative analysis."
    },
    "prerequisites": [
      "ehr-phenotyping-algorithms-rwe",
      "algorithm-validation",
      "predictive-and-causal-ml-models-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Structured-output chart abstraction (primary use case)",
        "description": "The LLM is given a clinical note (or a structured excerpt) and a JSON schema specifying the target fields and allowed values; it returns a completed schema. Span citations are required: each field value must be accompanied by the verbatim source text that supports it. Fields lacking a citation are routed to a human reviewer. A held-out gold-standard validation set is evaluated before the pipeline is run on the full cohort.",
        "edge_cases": [
          "Notes that are primarily scanned images (PDF attachments, faxed consults) are invisible to a text-only LLM; the pipeline must flag these for manual review or integrate an OCR step, and the scope limitation must be documented in the SAP.",
          "Multi-lingual notes (common in international studies and some US immigrant populations) degrade performance sharply unless the model is explicitly tested and validated on those language variants.",
          "Context-window limits mean very long notes (>16K tokens) may be truncated; a chunking strategy with overlap is required, and the validation gold standard should include long-note examples."
        ],
        "data_source_notes": "EHR: primary use case; validate on the specific EHR system's note templates before deployment. Linked claims-EHR: LLM supplements the structured claims fields with note-derived variables that codes cannot capture."
      },
      {
        "name": "SLR screening and data extraction",
        "description": "LLMs screen title-and-abstract pairs against PICO-structured inclusion and exclusion criteria, then extract structured fields (sample size, effect estimate, control type, follow-up duration) from full-text articles. A dual-screen structure — LLM plus one human reviewer, with adjudication of discordant pairs — maintains the recall standard required for systematic reviews while reducing total human effort.",
        "edge_cases": [
          "Recall is the binding constraint for systematic reviews (missing an eligible study is a more serious error than including an ineligible one); validate LLM recall against a sample of known-eligible papers before setting the confidence threshold.",
          "Data extraction from tables and figures requires multi-modal capability or a separate image-to-text step; text-only LLMs cannot extract values from charts or images."
        ],
        "data_source_notes": "Primary literature and grey literature: use structured PICO schema as the extraction template; maintain an audit log of every LLM decision and human override for PRISMA reporting."
      },
      {
        "name": "Protocol and SAP drafting assistance",
        "description": "An LLM generates first-draft sections of a study protocol, SAP, or reporting template from a structured concept document or from a prior analogous study. Human researchers review and revise every section; the LLM output is never submitted as a final document without human authorship review and sign-off.",
        "edge_cases": [
          "LLM-generated statistical language can be superficially fluent but methodologically incorrect (e.g., misspecified estimands, incorrect variance formulas, wrong propensity-score terminology); a biostatistician must review every statistical section, not just edit prose.",
          "For regulatory submissions, the protocol must be authored and signed off by a responsible biostatistician and principal investigator; LLM use in drafting must be disclosed per emerging journal and regulatory policies."
        ],
        "data_source_notes": "Applicable to all data sources; the LLM operates on template text, not patient data, so no PHI is involved and no BAA is required."
      },
      {
        "name": "Analysis code generation",
        "description": "An LLM generates SAS, R, or Python code for standard RWE pipeline steps (cohort construction, variable derivation, PS estimation, outcome ascertainment) from a structured SAP section or a natural-language description. Every generated code block is tested against synthetic data with known expected outputs before being run on study data, and the final code is reviewed by a second programmer (as in double programming / QC practice).",
        "edge_cases": [
          "LLM-generated code handles common patterns well (SELECT-WHERE queries, standard regression calls) but is unreliable on edge cases: off-by-one errors in date arithmetic, handling of missing values in join conditions, and SAS-versus-R difference in default merge behavior on unmatched rows.",
          "Generated code should be treated as a draft requiring QC, not as a verified implementation; never use LLM-generated code as the sole program in a double-programming pair."
        ],
        "data_source_notes": "No patient data is required for code generation; prompts should use synthetic examples or de-identified snippets so no PHI enters the LLM context."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "endpoint-adjudication-chart-review-rwe",
        "pros_of_this": "LLM-assisted abstraction scales to thousands of patients at a cost that full human adjudication cannot match; it enables structured data extraction from unstructured notes that rule-based algorithms cannot parse.",
        "cons_of_this": "Human chart adjudication is the reference standard against which LLM quality is measured; it cannot be replaced by the LLM for high-stakes regulatory endpoints without a human review gate that approximates adjudication for all positive calls.",
        "when_to_prefer": "Use LLM abstraction for large-cohort exploratory analyses and for pre-screening to identify the candidate set for human adjudication; require human adjudication (with LLM assistance for efficiency) for the primary endpoint in regulatory or HTA submissions."
      },
      {
        "compared_to": "ehr-phenotyping-algorithms-rwe",
        "pros_of_this": "LLMs can extract information from free text that no ICD/CPT rule-based phenotype can reach — mutation status, clinical stage, symptom burden, clinician-described disease trajectory — substantially expanding the structured data available for RWE.",
        "cons_of_this": "Rule-based phenotypes have deterministic, auditable logic; LLM prompts can produce different outputs for semantically identical notes depending on context order, and a model update can silently change behavior. Rule-based phenotypes are easier to transport across sites and to defend in regulatory review.",
        "when_to_prefer": "Use LLMs for variables that are intrinsically unstructured (clinician narrative, staging, symptom characterization); use rule-based phenotypes for variables well captured by structured codes (diagnosis, procedure, dispensing)."
      },
      {
        "compared_to": "nlp-clinical-text-extraction-rwe",
        "pros_of_this": "LLMs require far less labeled training data than supervised NLP models; a few-shot prompt can achieve competitive performance on a new extraction task where training examples are scarce.",
        "cons_of_this": "Traditional supervised NLP models have explicit, inspectable decision logic and are more stable across deployments than probabilistic LLM generation; they are less prone to hallucination of values not present in the source text.",
        "when_to_prefer": "Use LLMs for new or rare extraction tasks where labeled training data does not exist; prefer supervised NLP models for high-volume, high-stakes extraction tasks where training data is available and stability across deployments is required."
      }
    ],
    "implementation_notes_by_data_source": {
      "ehr": "Primary deployment context. Validate against the specific EHR system's note templates (Epic, Cerner, and Meditech have distinct documentation cultures); performance measured on MIMIC notes does not transfer directly to a community hospital EHR. Ensure PHI handling complies with the data-use agreement and any cloud API BAA requirements. Document the note type scope (discharge summaries, progress notes, pathology reports, or all encounter notes) in the SAP; performance varies substantially by note type.",
      "registry": "Registry source documents (pathology reports, death certificates, surgical notes linked to the registry record) are candidates for LLM abstraction when the registry collects narrative fields. Validate operating characteristics within the registry's own document corpus; performance on linked registry notes is not predictable from general benchmarks.",
      "primary": "LLM assistance is useful for abstracting patient-reported outcomes from free-text survey responses or structured interview transcripts; validate against expert rater annotations before using for outcome ascertainment.",
      "linked": "Linked claims-EHR datasets benefit most from LLM abstraction because the note content fills gaps in structured claims fields (staging, biomarkers, clinical severity). Validate jointly across the linked data — a note abstraction that performs well in the EHR system may degrade when the linked cohort has different documentation patterns."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import json\nimport re\nimport numpy as np\nimport pandas as pd\nfrom openai import OpenAI\n\nclient = OpenAI()  # reads OPENAI_API_KEY from environment; substitute a BAA-eligible endpoint for PHI\n\n# ── Structured-output schema for EGFR abstraction ──\nSYSTEM_PROMPT = \"\"\"\nYou are a clinical abstractor. Given a clinical note, extract the patient's EGFR mutation\nstatus and return ONLY a JSON object matching this schema:\n{\n  \"egfr_status\": \"<positive|negative|indeterminate|not_reported>\",\n  \"citation_span\": \"<verbatim text from the note that supports the classification, or null>\"\n}\nIf no mention of EGFR is found, return \"not_reported\" with citation_span null.\nIf the note contains a negation (e.g. 'EGFR negative', 'no EGFR mutation detected'),\nreturn \"negative\" and quote the negation phrase in citation_span.\n\"\"\".strip()\n\ndef abstract_egfr(note_text: str, model: str = \"gpt-4o-2024-08-06\") -> dict:\n    \"\"\"Extract EGFR status from a single clinical note using structured-output prompting.\"\"\"\n    response = client.chat.completions.create(\n        model=model,\n        messages=[\n            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n            {\"role\": \"user\", \"content\": note_text},\n        ],\n        response_format={\"type\": \"json_object\"},\n        temperature=0,          # deterministic; required for reproducibility\n        seed=42,                # pin for audit trail (OpenAI-specific)\n    )\n    raw = response.choices[0].message.content\n    try:\n        result = json.loads(raw)\n    except json.JSONDecodeError:\n        result = {\"egfr_status\": \"parse_error\", \"citation_span\": None}\n\n    # Flag for human review: no citation span means the extraction is ungrounded.\n    result[\"needs_human_review\"] = (result.get(\"citation_span\") is None\n                                    and result.get(\"egfr_status\") != \"not_reported\")\n    result[\"model\"] = model     # version pin for audit trail\n    return result\n\ndef batch_abstract(notes: pd.DataFrame,\n                   note_col: str = \"note_text\",\n                   id_col: str = \"patient_id\",\n                   model: str = \"gpt-4o-2024-08-06\") -> pd.DataFrame:\n    \"\"\"Run abstraction over a DataFrame of notes; returns one row per patient.\"\"\"\n    results = []\n    for _, row in notes.iterrows():\n        out = abstract_egfr(row[note_col], model=model)\n        out[id_col] = row[id_col]\n        results.append(out)\n    return pd.DataFrame(results)\n\n# ── Gold-standard validation ──\ndef validate_abstraction(results: pd.DataFrame,\n                          gold: pd.DataFrame,\n                          id_col: str = \"patient_id\",\n                          gold_positive_label: str = \"positive\") -> dict:\n    \"\"\"\n    Compute accuracy, sensitivity, and PPV against a gold-standard annotation.\n    gold: DataFrame with columns [id_col, 'gold_egfr_status'] where gold_egfr_status\n          is the adjudicated true value.\n    \"\"\"\n    merged = results.merge(gold, on=id_col, how=\"inner\")\n    llm_pos = merged[\"egfr_status\"] == gold_positive_label\n    gold_pos = merged[\"gold_egfr_status\"] == gold_positive_label\n\n    tp = int((llm_pos & gold_pos).sum())\n    fp = int((llm_pos & ~gold_pos).sum())\n    fn = int((~llm_pos & gold_pos).sum())\n    tn = int((~llm_pos & ~gold_pos).sum())\n    n = tp + fp + fn + tn\n\n    accuracy    = (tp + tn) / n  if n > 0 else np.nan\n    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else np.nan\n    ppv         = tp / (tp + fp) if (tp + fp) > 0 else np.nan\n\n    return {\n        \"n\": n, \"tp\": tp, \"fp\": fp, \"fn\": fn, \"tn\": tn,\n        \"accuracy\": round(accuracy, 4),\n        \"sensitivity\": round(sensitivity, 4),\n        \"ppv\": round(ppv, 4),\n        # Flag count: how many extractions lack citation spans and need human review.\n        \"n_needs_human_review\": int(results[\"needs_human_review\"].sum()),\n    }\n\n# ── Example: reproduce the worked-example numbers ──\n# With TP=34, FP=10, FN=6, TN=150 (n=200):\n# accuracy = (34+150)/200 = 0.92, sensitivity = 34/40 = 0.85, PPV = 34/44 ~= 0.773",
        "description": "LLM-assisted EGFR status abstraction with structured-output prompting and span-citation\nenforcement, followed by a gold-standard validation that computes accuracy, sensitivity,\nand PPV. Uses the openai SDK with JSON mode (any OpenAI-compatible API works). The\nvalidation function operates on a gold-standard dataframe and is independent of the\nLLM provider. Designed for batch processing; calls are made sequentially with error\nhandling so that a single note failure does not abort the run.",
        "dependencies": [
          "openai",
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Validate LLM abstraction against gold standard ──\n# Input: results.csv with columns: patient_id, egfr_status, citation_span, needs_human_review\n#        gold.csv   with columns: patient_id, gold_egfr_status\n\nvalidate_abstraction <- function(results, gold,\n                                 positive_label = \"positive\") {\n  merged <- merge(results, gold, by = \"patient_id\", all = FALSE)\n  llm_pos  <- merged$egfr_status       == positive_label\n  gold_pos <- merged$gold_egfr_status  == positive_label\n\n  tp <- sum( llm_pos &  gold_pos)\n  fp <- sum( llm_pos & !gold_pos)\n  fn <- sum(!llm_pos &  gold_pos)\n  tn <- sum(!llm_pos & !gold_pos)\n  n  <- tp + fp + fn + tn\n\n  cat(\"2x2 table (LLM rows x Gold columns):\\n\")\n  tab <- matrix(c(tp, fp, fn, tn), nrow = 2, byrow = FALSE,\n                dimnames = list(c(\"LLM pos\",\"LLM neg\"), c(\"Gold pos\",\"Gold neg\")))\n  print(tab)\n  cat(\"\\n\")\n\n  # Exact (Clopper-Pearson) confidence intervals via binom.test.\n  ci <- function(num, den, label) {\n    if (den == 0) { cat(label, \"= NA (zero denominator)\\n\"); return(invisible(NULL)) }\n    bt <- binom.test(num, den)\n    cat(sprintf(\"%s = %d/%d = %.4f  95%% CI [%.4f, %.4f]\\n\",\n                label, num, den, bt$estimate, bt$conf.int[1], bt$conf.int[2]))\n  }\n\n  ci(tp + tn, n,     \"Accuracy   \")\n  ci(tp,      tp+fn, \"Sensitivity\")\n  ci(tp,      tp+fp, \"PPV        \")\n\n  cat(sprintf(\"\\nExtractions lacking citation span (needs_human_review = TRUE): %d of %d\\n\",\n              sum(results$needs_human_review, na.rm = TRUE), nrow(results)))\n\n  invisible(list(tp=tp, fp=fp, fn=fn, tn=tn,\n                 accuracy    = (tp+tn)/n,\n                 sensitivity = tp/(tp+fn),\n                 ppv         = tp/(tp+fp)))\n}\n\n# ── Reproduce worked-example numbers ──\n# TP=34 FP=10 FN=6 TN=150 -> accuracy=(34+150)/200=0.92, Se=34/40=0.85, PPV=34/44\ndummy_results <- data.frame(\n  patient_id         = 1:200,\n  egfr_status        = c(rep(\"positive\", 44), rep(\"negative\", 156)),\n  needs_human_review = c(rep(FALSE, 34), rep(TRUE, 10), rep(FALSE, 156))\n)\ndummy_gold <- data.frame(\n  patient_id       = 1:200,\n  gold_egfr_status = c(rep(\"positive\", 40), rep(\"negative\", 160))\n)\nvalidate_abstraction(dummy_results, dummy_gold)\n# Expected: accuracy = 184/200 = 0.92, sensitivity = 34/40 = 0.85, PPV = 34/44",
        "description": "Gold-standard validation of LLM abstraction results in R. Assumes the LLM has already\nbeen run (e.g., via the Python pipeline above or an equivalent API call) and the outputs\nare stored in a CSV. Computes accuracy, sensitivity, and PPV with exact (Clopper-Pearson)\nbinomial confidence intervals using binom.test. Prints the 2x2 table and flags the\ncount of extractions lacking citation spans. Uses base R only; no external dependencies.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Input[Clinical notes / source documents] --> Prompt[Structured-output LLM prompt<br/>JSON schema + allowed values + span-citation requirement]\n  Prompt --> Extract[Extracted fields<br/>per-note JSON with citation_span]\n  Extract --> Check{Citation span<br/>present?}\n  Check -- Yes, high confidence --> Accept[Automated acceptance<br/>log model + version + span]\n  Check -- No or low confidence --> Queue[Human review queue<br/>abstractor verifies against source doc]\n  Queue --> Final[Final structured dataset]\n  Accept --> Final\n  Final --> Validate[Gold-standard validation sample<br/>PPV / sensitivity / kappa]\n  Validate --> Threshold{Meets pre-specified<br/>PPV floor?}\n  Threshold -- Yes --> Analysis[Enters analysis<br/>with bias-analysis caveat]\n  Threshold -- No --> Revise[Revise prompt or expand<br/>human review tier]",
        "caption": "LLM chart abstraction pipeline with span-citation enforcement, confidence-thresholded human review queue, and mandatory gold-standard validation before the extracted variable enters an analysis.",
        "alt_text": "Flowchart showing clinical notes entering a structured-output LLM prompt, extracted fields being routed to automated acceptance or human review based on citation span presence, the final structured dataset being validated against a gold standard, and a decision gate on whether the PPV floor is met before analysis proceeds.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph FailureModes[LLM hallucination modes in clinical abstraction]\n    PN[Plausible-but-absent values<br/>value not in source doc]\n    UE[Unit errors<br/>mg vs mcg, mmol vs mg/dL]\n    DC[Date confabulation<br/>wrong date from context]\n    NM[Negation misses<br/>reads positive when note says negative]\n  end\n  subgraph Controls[Mitigations]\n    SO[Structured-output schema<br/>constrain allowed values]\n    SC[Span-citation requirement<br/>flag ungrounded extractions]\n    DM[Dual-model adjudication<br/>disagreements go to human]\n    VP[Version pinning<br/>lock model ID in SAP]\n  end\n  PN --> SO\n  PN --> SC\n  UE --> SO\n  DC --> SC\n  NM --> SC\n  NM --> DM",
        "caption": "LLM failure modes in clinical abstraction (left) paired with the design-pattern mitigations that reduce but do not eliminate each failure type (right).",
        "alt_text": "Two-column diagram showing LLM failure modes (plausible-but-absent values, unit errors, date confabulation, negation misses) connected by arrows to the corresponding mitigation patterns (structured output, span citation, dual-model adjudication, version pinning).",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "LLM abstraction pipelines are a specialized application of ML methods in RWE; the broader ML methods framework covering model evaluation, generalization, and validation is the parent concept."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Human chart adjudication is the gold standard against which LLM abstraction must be validated (PPV/sensitivity/kappa); the adjudication workflow governs the human review gate required for any load-bearing LLM extraction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "LLM extractors are algorithms whose operating characteristics (PPV, sensitivity, specificity) must be measured against an adjudicated reference standard, using exactly the same validation framework that governs any claims or EHR algorithm."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "Rule-based EHR phenotypes and LLM extractors address complementary gaps: structured codes for diagnoses and procedures, LLM text extraction for staging, mutation status, and narrative variables. Both require validation and version control."
      },
      {
        "relation_type": "see_also",
        "target_slug": "nlp-clinical-text-extraction-rwe",
        "notes": "LLM-assisted abstraction is a sub-technique within the broader clinical NLP landscape; this sibling concept covers the full range of NLP approaches including rule-based text mining, transformer models, and traditional supervised NLP in addition to LLMs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "qc-double-programming-reproducibility",
        "notes": "The double-programming and QC discipline applies directly to LLM-generated analysis code; any code produced by an LLM requires an independent verification run against expected outputs, following the same reproducibility standards as human-authored code."
      }
    ],
    "aliases": [
      "LLM chart abstraction",
      "AI-assisted abstraction",
      "GPT clinical extraction",
      "large language model data extraction",
      "AI clinical text abstraction",
      "LLM-assisted evidence synthesis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "log-normal-distribution",
    "name": "Log-Normal Distribution and the Retransformation Problem",
    "short_definition": "A probability distribution for strictly positive continuous outcomes that arise from multiplicative processes — including healthcare costs, length of hospital stay, and laboratory titers — where the logarithm of the outcome follows a normal distribution; correctly estimating the arithmetic mean from a log-scale regression requires an explicit retransformation step (Duan smearing) because the naive back-transform exp(fitted value) estimates the geometric mean, not the arithmetic mean that budget-impact and cost-effectiveness models require.",
    "long_description": "**What the log-normal distribution is and where it comes from**\n\nA random variable Y follows a log-normal distribution when its natural logarithm log(Y)\nis normally distributed. More precisely: if X is normally distributed with mean mu and\nvariance sigma-squared, then Y = exp(X) is log-normal with parameters mu and sigma-squared.\nThe name is counterintuitive — Y itself is not normal, but log(Y) is — and this asymmetry\nis the source of nearly every practical complication the distribution creates for analysts.\n\nThe log-normal arises naturally wherever the data-generating process is *multiplicative*\nrather than additive. Many biological and economic processes compound sequentially. A drug's\nconcentration at time t is the dose multiplied by a sequence of absorption, distribution,\nand elimination factors. A patient's annual healthcare spend is a series of claim amounts\nthat compound through utilization rates, DRG weights, and facility price multipliers. A\nvirus's titre doubles or halves with each replication cycle. When independent multiplicative\nshocks are logged, the Central Limit Theorem applies to the sum of log-terms, producing a\nnormal distribution on the log scale and a log-normal on the original scale.\n\nIn health economics and outcomes research (HEOR), the log-normal is ubiquitous because the\ncost-generating mechanism is fundamentally multiplicative: a hospitalization multiplies by\nits DRG weight, which multiplies by facility rates, which compound with complication flags\nand LOS. The result is a distribution that is right-skewed — a long right tail of\ncatastrophically expensive patients — with a mode well below the mean. Hospital length of\nstay, pharmacy costs, number of outpatient visits, antibody titers from serology assays,\nviral loads from PCR, and pharmacokinetic AUC values routinely exhibit this shape.\n\n**Arithmetic vs geometric mean: the sigma-squared-over-two term**\n\nThe most important mathematical property of the log-normal is the relationship between the\nparameters of the underlying normal (mu, sigma-squared) and the moments on the original scale:\n\n- **Arithmetic mean (expected cost):** E[Y] = exp(mu + sigma^2 / 2)\n- **Median (equals geometric mean):** exp(mu)\n- **Variance:** Var[Y] = (exp(sigma^2) minus 1) times exp(2*mu + sigma^2)\n\nThe sigma^2/2 term in the arithmetic mean formula is the entire story of why log-\ntransformation and back-transformation require careful handling. When sigma^2 is small\n(distributions that are only mildly skewed on the log scale), the arithmetic mean and the\ngeometric mean are close. As sigma^2 grows — as the distribution becomes more right-skewed\n— the arithmetic mean increasingly exceeds the geometric mean by the factor exp(sigma^2/2).\n\nFor a typical commercial claims cost distribution with sigma = 1.5 on the natural-log scale,\nthe arithmetic mean is exp(sigma^2/2) = exp(1.125) approximately 3.08 times the geometric\nmean. A naive analysis that reports the geometric mean as \"average cost\" would understate\nthe true population average spend by more than 200 percent. For budget-impact models and\ncost-effectiveness analyses, this is not a minor rounding error — it is the difference\nbetween a credible submission and a rejected one.\n\nThis has a direct consequence for regression. If you fit an ordinary least-squares (OLS)\nmodel with log(Y) as the outcome, the predicted value on the log scale, mu-hat, is an\nestimate of E[log Y] = mu. Back-transforming with exp(mu-hat) gives the geometric mean of\nY — the median under log-normality — which is not the same as E[Y], the arithmetic mean.\nThis discrepancy is the **retransformation problem**.\n\n**The retransformation problem and Duan's smearing estimator**\n\nThe correct estimator for the arithmetic mean under log-OLS is:\n\nE[Y] = exp(mu-hat) times the Duan smearing factor\n\nwhere the smearing factor is the sample mean of the back-transformed OLS residuals:\n\nDuan factor = (1/n) times the sum of exp(e_i)\n\nand e_i = log(y_i) minus mu-hat_i are the OLS residuals on the log scale. This\nnonparametric correction — proposed by Duan (1983) — requires no assumption about the\nresidual distribution and is consistent even when the log-scale errors are not exactly\nnormal. Under homoscedastic log-normal errors (equal variance across all covariate values),\nthe smearing factor estimates exp(sigma^2/2), exactly correcting for the sigma^2/2 bias in\nthe arithmetic mean.\n\nManning and Mullahy (2001) provided the definitive applied assessment of when log-OLS with\nsmearing is adequate and when it fails. The smearing estimator performs well when the\nlog-scale residuals are **homoscedastic** — that is, when the spread of log-cost around the\nfitted value is similar across all patient subgroups. It can fail badly under\n**heteroscedasticity**: when sicker or more complex patients have more variable costs, sigma^2\nvaries across the covariate space, the smearing factor should differ by subgroup, and the\nsingle pooled smearing factor applied to the full dataset can seriously misestimate group-\nspecific arithmetic means. In the heteroscedastic case, a generalized linear model (GLM)\nwith a log link and gamma or Tweedie variance function is almost universally preferred for\nprimary cost inference: it directly models the arithmetic mean on the original scale without\nany retransformation step and is robust to variance heterogeneity.\n\n**Interpreting the output**\n\nConsider a log-OLS regression of total annual costs on a binary treatment indicator. The\nestimated treatment coefficient is **0.405** on the natural-log scale (95% CI: 0.18 to 0.63).\n\n*(1) Formal interpretation.* exp(0.405) = 1.50. This is the ratio of geometric means: the\nmedian cost in the treated group is approximately 50 percent higher than in the control group\nunder log-normality. Under the *additional* assumption that the log-scale residuals are\nhomoscedastic (equal variance in both groups), the smearing factors for the two groups are\nidentical and cancel in the ratio, so exp(0.405) also approximates the ratio of arithmetic\nmeans — meaning average spending is about 50 percent higher in the treated group. If the\nresiduals are heteroscedastic (unequal variance between groups, which is common when sicker\npatients are more likely to receive a given treatment), this arithmetic-mean interpretation\ndoes not hold. The ratio of arithmetic means must instead be computed as\n[exp(alpha-hat + beta-hat) times smearing-factor-treated] divided by\n[exp(alpha-hat) times smearing-factor-control], using group-specific smearing factors.\n\n*(2) Practical interpretation.* \"Patients in the treatment group had typical (median) costs\nabout 50 percent higher than control patients — that is, for a control patient whose costs\nsit at the median, the corresponding treated patient would be expected to spend 50 percent\nmore. Whether average total spending — the quantity relevant for a budget-impact model — is\nalso 50 percent higher depends on whether cost variability is similar in both groups. A\nsensitivity analysis using group-specific smearing factors or a gamma GLM is recommended\nbefore quoting the arithmetic-mean ratio to a payer or health technology assessment body.\"\n\nThis formal/practical distinction is the core analytical skill for HEOR analysts reporting\nlog-OLS results. The coefficient tells you about the geometric mean. Budget impact requires\nthe arithmetic mean. The smearing correction bridges the two — but only reliably under\nhomoscedasticity.\n\n**Geometric means in laboratory and pharmacokinetics reporting**\n\nFor laboratory assay results — antibody titers from serology, viral loads from PCR,\npharmacokinetic AUC or Cmax values, minimum inhibitory concentrations (MICs) from\nantimicrobial studies — reporting the geometric mean rather than the arithmetic mean is often\nthe scientifically correct choice. These measurements are multiplicative by nature, and the\nratio of geometric means is the natural measure of relative magnitude: \"the treated group had\na titer 4-fold higher.\" Confidence intervals computed on the log scale and then back-\ntransformed are ratio CIs, directly interpretable as fold-changes with correct coverage\nproperties. Bland and Altman (1996) provide a clear worked demonstration of this approach.\nFor these applications, the geometric mean is the primary target estimand and no smearing\ncorrection is needed; the retransformation problem only arises when the target is the\narithmetic mean on the original dollar or count scale.\n\n**Pros, cons, and trade-offs**\n\n*Log-OLS (OLS on the log-transformed outcome) with Duan smearing:*\n- Pros: straightforward to fit in any statistical package; residual diagnostics are familiar\n  (Q-Q plots, residual-versus-fitted scatter); interpretable on the log scale as a\n  multiplicative relationship; the coefficient is directly the log of the geometric-mean\n  ratio; smearing correction is simple to compute; widely reported in HEOR literature.\n- Cons: the naive back-transform gives the geometric mean, not the arithmetic mean; the Duan\n  smearing estimator is biased under heteroscedasticity; zero-cost patients must be excluded\n  (log is undefined at zero) or handled separately by a two-part model; the smearing step is\n  often omitted by analysts who do not know it is required, producing systematically\n  underestimated costs.\n- When to prefer: when the estimand is the geometric mean or a fold-change ratio (lab/PK\n  data); as a sensitivity analysis alongside a GLM primary analysis; when the log-normality\n  assumption is well-supported and residuals are approximately homoscedastic.\n\n*Gamma GLM with log link:*\n- Pros: directly models the arithmetic mean on the original scale; the gamma variance function\n  accommodates heteroscedasticity (variance proportional to mean squared); no back-\n  transformation step or smearing correction is needed; marginally consistent estimator for\n  population mean costs; readily accommodates prediction and covariate adjustment.\n- Cons: requires specifying the variance function (gamma vs Tweedie vs negative binomial);\n  less familiar to some audiences than log-OLS; the log-link coefficient is a log-mean-ratio,\n  not a log-median-ratio; GLM convergence can be slow for very large datasets.\n- When to prefer: when the primary estimand is the arithmetic mean cost for budget impact or\n  cost-effectiveness analysis; when cost heteroscedasticity is expected; this is the modern\n  standard for primary cost analyses in HEOR per Manning and Mullahy (2001).\n\n**When to use**\n\nUse log-transformation and log-scale regression when:\n- The outcome is continuous, strictly positive, and right-skewed in a multiplicative pattern\n  such that log(Y) is approximately normally distributed.\n- The primary estimand is the geometric mean or a ratio of geometric means — antibody titers,\n  viral loads, pharmacokinetic parameters, sensitivity analyses on a ratio scale.\n- The Duan smearing correction is applied and residual heteroscedasticity has been assessed\n  as mild (e.g., by plotting residuals against fitted values or comparing group-specific\n  standard deviations on the log scale).\n- Exploratory data analysis is needed to understand distributional shape before choosing a\n  primary model specification.\n- The audience or journal convention strongly favors log-transformed regression over GLMs\n  and Duan smearing is applied correctly.\n\n**When NOT to use**\n\nDo not use log-transformation when:\n- **The outcome contains zeros.** log(0) is undefined, and adding an arbitrary constant\n  (such as log(Y + 1) or log(Y + 0.5)) introduces scale-dependence that biases the\n  back-transformed mean estimate in a direction and magnitude that depend entirely on the\n  chosen constant. For outcomes with a zero spike — as in cost data where a fraction of\n  patients have no claims — use a two-part model (logistic for any use, then log-OLS or\n  gamma GLM conditional on positive use) or a Tweedie GLM designed for semi-continuous data.\n- **The target estimand is the arithmetic mean and the analyst applies only the naive\n  back-transform.** This is the single most common error in HEOR cost analyses: reporting\n  exp(mu-hat) as \"the mean cost\" when it estimates the geometric mean, systematically\n  understating average spend. The arithmetic mean always requires the smearing step.\n- **The outcome is bounded or count-valued.** Log-normal is a model for continuous unbounded\n  positive quantities. Bounded scores (0 to 100 quality-of-life instruments), binary\n  outcomes, and count outcomes (number of hospitalizations) have their own distributional\n  families (beta regression, logistic regression, Poisson or negative binomial) that should\n  be used instead.\n- **Residuals on the log scale show strong heteroscedasticity between groups.** If sicker or\n  higher-utilizing patients have substantially more variable costs, the pooled smearing factor\n  is biased, the simple ratio exp(beta) is not the arithmetic-mean ratio between arms, and\n  a gamma GLM should be preferred as the primary analysis with log-OLS as a sensitivity check.\n- **The dataset is small (fewer than 30 observations) and log-normality is not supported by\n  prior knowledge.** Bootstrap-based mean estimation or a nonparametric approach may be more\n  robust when the distributional assumption cannot be adequately assessed.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "distributions",
      "cost-analysis",
      "skewed-data",
      "geometric-mean",
      "retransformation",
      "smearing"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "cross_sectional",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/01621459.1983.10478017",
        "url": "https://doi.org/10.1080/01621459.1983.10478017",
        "citation_text": "Duan N. Smearing estimate: a nonparametric retransformation method. Journal of the American Statistical Association. 1983;78(383):605-610.",
        "year": 1983,
        "authors_short": "Duan",
        "notes": "The original derivation of the nonparametric smearing estimator for retransforming log-scale OLS predictions back to the original scale without requiring a specific residual distribution. The canonical reference for the Duan estimator used in HEOR cost analyses; establishes that the estimator is consistent and requires only that the residuals are independently and identically distributed."
      },
      {
        "role": "explain",
        "doi": "10.1016/S0167-6296(01)00086-8",
        "url": "https://doi.org/10.1016/S0167-6296(01)00086-8",
        "citation_text": "Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461-494.",
        "year": 2001,
        "authors_short": "Manning & Mullahy",
        "notes": "Definitive applied comparison of log-OLS with Duan smearing versus generalized linear models (gamma, Poisson, Tweedie) for healthcare cost data. Establishes that the smearing estimator is sensitive to heteroscedasticity and that a gamma GLM with log link is generally preferred when cost variance is not constant across patients. Essential reading for HEOR analysts choosing between log-OLS and GLM approaches."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.312.7038.1079",
        "url": "https://doi.org/10.1136/bmj.312.7038.1079",
        "citation_text": "Bland JM, Altman DG. Statistics notes: Transformations, means, and confidence intervals. BMJ. 1996;312(7038):1079.",
        "year": 1996,
        "authors_short": "Bland & Altman",
        "notes": "BMJ Statistics Notes entry on logarithmic transformation in clinical data. Demonstrates the correct interpretation of back-transformed means as geometric means, and provides worked examples of ratio confidence intervals that are the appropriate output of log-scale analysis for laboratory and clinical measurements. Especially useful for analysts reporting geometric means and fold-change CIs from transformed data."
      }
    ],
    "plain_language_summary": "Healthcare costs, length of hospital stay, and laboratory values like viral loads tend to follow a log-normal distribution — a bell curve when you take the logarithm of the values, but a strongly right-skewed curve on the original dollar or unit scale. When you fit a regression model to log-transformed costs and convert the predicted values back to dollars, you get the geometric mean (the typical value for a median patient), not the arithmetic mean (the true average spend that drives a budget). Correcting for this gap requires a step called Duan smearing: you use the spread of the model's residuals to adjust the back-transformed prediction upward toward the true average — and failing to do so systematically underestimates total population spending.",
    "key_terms": [
      {
        "term": "geometric mean",
        "definition": "The middle value of a set of numbers computed by multiplying them together and taking the nth root; for costs of 100, 1000, and 10000, the geometric mean is 1000, which is lower than the arithmetic mean of 3700 because it ignores the pull of the expensive outlier."
      },
      {
        "term": "multiplicative effect",
        "definition": "A change that scales the outcome by a fixed factor rather than adding a fixed amount; a regression coefficient of 0.405 on the log scale means the treated group's typical cost is exp(0.405) approximately 1.5 times the control group's typical cost."
      },
      {
        "term": "retransformation bias",
        "definition": "The systematic underestimate of the arithmetic mean that occurs when a log-scale predicted value is naively converted back to the original scale using exp() without a smearing correction; it is a mathematical property of the log-normal distribution, not a modeling error."
      },
      {
        "term": "smearing factor",
        "definition": "A correction multiplier equal to the average of the back-transformed OLS residuals; proposed by Duan (1983), it adjusts the geometric-mean estimate upward to approximate the true arithmetic mean without requiring any assumption about the residual distribution."
      },
      {
        "term": "skewness",
        "definition": "A measure of how lopsided a distribution is; a right-skewed distribution like healthcare costs has a long tail of very large values that pull the arithmetic mean far above the median, which is why the geometric mean alone understates average spend for budget purposes."
      }
    ],
    "worked_example": {
      "scenario": "A health economics analyst has three patients from a commercial claims database with total annual costs of $100, $1,000, and $10,000 — values that increase by a factor of ten at each step, a pattern typical of multiplicative cost processes. The analyst wants to understand concretely why fitting a regression on the log-transformed costs and back-transforming the prediction gives the wrong average, and how the Duan smearing factor corrects it to match the true arithmetic mean.",
      "dataset": {
        "caption": "Annual total costs (USD) for three patients. The tenfold jumps between consecutive patients illustrate the multiplicative structure common in claims cost distributions, where a small number of high-cost patients dominate average spending.",
        "columns": [
          "patient_id",
          "total_cost_usd"
        ],
        "rows": [
          [
            "P1",
            100
          ],
          [
            "P2",
            1000
          ],
          [
            "P3",
            10000
          ]
        ]
      },
      "steps": [
        "Step 1 — Arithmetic mean (the true average spend a budget analyst needs): (100 + 1000 + 10000) / 3 = 3700. This is the number that, multiplied by population size, gives total budget impact.",
        "Step 2 — Log-transform each cost using log base 10 (chosen here for clean arithmetic): log10(100) = 2; log10(1000) = 3; log10(10000) = 4.",
        "Step 3 — Mean of the log10 values, which is what a log-OLS intercept estimates for this one-group dataset: (2 + 3 + 4) / 3 = 3.",
        "Step 4 — Naive back-transform: 10^3 = 1000. This is the geometric mean — the middle patient on the log scale — NOT the arithmetic mean. Reporting $1,000 as 'average cost' understates true average spend by $2,700.",
        "Step 5 — Compute OLS residuals on the log10 scale by subtracting the fitted value (3) from each patient's log cost: P1 residual = 2 - 3 = -1; P2 residual = 3 - 3 = 0; P3 residual = 4 - 3 = 1.",
        "Step 6 — Back-transform each residual from log10 scale: P1 gives 10^(-1) = 0.1; P2 gives 10^0 = 1; P3 gives 10^1 = 10. These are the patient-level smearing terms.",
        "Step 7 — Duan smearing factor = average of the back-transformed residuals: (0.1 + 1 + 10) / 3 = 3.7. This multiplier captures the asymmetric pull of the upper tail that the geometric mean misses.",
        "Step 8 — Smeared (corrected) arithmetic mean estimate = geometric mean times smearing factor = 1000 * 3.7 = 3700. The smearing correction exactly recovers the arithmetic mean in this symmetric log-scale example, confirming the estimator is unbiased here."
      ],
      "result": "Arithmetic mean = (100 + 1000 + 10000) / 3 = 3700. Mean log10 = (2 + 3 + 4) / 3 = 3. Geometric mean (naive back-transform) = 10^3 = 1000, which is $2,700 less than the true average. Smearing factor = (0.1 + 1 + 10) / 3 = 3.7. Smeared Duan estimate = 1000 * 3.7 = 3700, which equals the arithmetic mean. Conclusion: the geometric mean is the right answer for \"what does a typical (median) patient spend?\"; the arithmetic mean is the right answer for \"what will this population cost in total?\""
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Log-OLS with pooled Duan smearing (homoscedastic case)",
        "description": "The standard Duan (1983) estimator uses a single pooled smearing factor computed from all OLS residuals on the log scale. This is appropriate when the log-scale variance is approximately constant across covariate values (homoscedasticity). The arithmetic mean for a subgroup or treatment arm is: exp(X*beta-hat) times the pooled smearing factor.",
        "edge_cases": [
          "If log-scale residuals show a fan pattern against fitted values, heteroscedasticity is present and the pooled smearing factor will be biased for subgroup arithmetic means.",
          "In small samples (n < 50), the smearing factor itself has high variance and bootstrap confidence intervals are recommended for the smeared mean estimate."
        ],
        "data_source_notes": "Claims: compute per-patient total costs, log-transform, run OLS, save residuals, compute mean(exp(residuals)) as smearing factor, multiply back into fitted geometric means. Flag and separately model patients with zero costs using a two-part model."
      },
      {
        "name": "Group-specific Duan smearing (heteroscedastic case)",
        "description": "When cost variance differs across treatment arms or covariate groups (the common case in HEOR), compute a separate smearing factor for each group from that group's residuals. The arithmetic mean for treatment arm k is: exp(X_k * beta-hat) times the arm-k smearing factor. The ratio of group arithmetic means is not simply exp(beta) in this case.",
        "edge_cases": [
          "Groups must be large enough (n > 30 per group) for the group-specific smearing factor to be estimated reliably; with small groups, bootstrap the ratio directly.",
          "Arm-specific smearing is the correct approach for primary cost analyses when treatment selects into sicker, higher-cost patients."
        ],
        "data_source_notes": "Claims: after fitting log-OLS, compute residuals by treatment arm, compute exp(mean of arm-specific residuals) for each arm, apply to the corresponding fitted value."
      },
      {
        "name": "Two-part model for zero-inflated costs",
        "description": "When a fraction of patients have zero total costs (no claims), log(0) is undefined and log-OLS cannot be directly applied. A two-part model fits a logistic (or probit) model for the probability of any cost, then a log-OLS or gamma GLM for the conditional distribution given positive cost. The arithmetic mean is E[Y] = P(Y>0) times E[Y | Y>0].",
        "edge_cases": [
          "The two-part structure assumes that the decision to incur any cost and the level of cost given use are governed by different mechanisms (which is often reasonable for new drugs or rare procedures).",
          "Adding a small constant (1, 0.5, or 1 dollar) to avoid log(0) is not a valid alternative; it biases the back-transformed mean in a way that depends on the arbitrary constant chosen."
        ],
        "data_source_notes": "Claims: create an indicator for any cost > 0; fit logistic model for the indicator; fit log-OLS or gamma GLM on the positive-cost subset; multiply predicted probability by conditional mean to get the unconditional arithmetic mean."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "generalized-linear-models",
        "pros_of_this": "Log-OLS is simpler to explain to non-statistical audiences and produces a coefficient directly interpretable as a log-geometric-mean-ratio; residual diagnostics (Q-Q plots, residuals vs fitted) are familiar to most analysts.",
        "cons_of_this": "Log-OLS requires an explicit smearing correction for arithmetic mean estimation and is biased for means under heteroscedasticity; a gamma GLM with log link directly targets the arithmetic mean, handles heteroscedasticity via the variance function, and eliminates the retransformation step entirely.",
        "when_to_prefer": "Prefer log-OLS when the estimand is the geometric mean or fold-change ratio (lab/PK data); prefer gamma GLM when the primary estimand is the arithmetic mean cost for budget impact or cost-effectiveness analysis."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Log-OLS regression naturally adjusts for confounders and covariates while simultaneously modeling the cost distribution; the smeared mean estimate has a clear causal interpretation when paired with an appropriate identification strategy.",
        "cons_of_this": "Per-member-per-month (PMPM) or per-patient-per-year (PPPY) cost descriptors are simpler to compute and communicate; they do not require distributional assumptions but are unadjusted for confounders and must be interpreted as descriptive rather than causal.",
        "when_to_prefer": "Use log-OLS (with smearing) for adjusted mean cost estimates in observational comparisons; use PMPM/PPPY descriptors for characterizing the cost burden in a single cohort or as inputs to a budget-impact model."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Per-patient total costs are the natural unit of analysis. Log-transform after excluding or separately modeling zero-cost patients. Compute OLS on log costs, save residuals, compute smearing factor as mean(exp(residuals)), and multiply the fitted geometric mean by the smearing factor to obtain the arithmetic mean estimate. For primary analyses, prefer a gamma GLM with log link. For cost distributions with very high kurtosis (extreme outliers), consider winsorizing at the 99th percentile as a sensitivity analysis, documenting the threshold chosen and its effect on the smearing factor.",
      "ehr": "Lab values (antibody titers, viral loads, PK parameters) are natural candidates for log-normal analysis with geometric means as the primary estimand; no smearing is needed when reporting geometric means and ratio CIs. For cost-adjacent EHR outcomes (DRG-based cost estimates, charge data), apply the same log-OLS or gamma GLM approach as claims. Be cautious of structured zeros (patients with no test ordered vs. truly zero titer) — these require different handling than the zero-spike in cost distributions.",
      "registry": "Registry cost data often have cleaner distributions than claims (better adjudicated, fewer billing artifacts). Log-normality may be a more reasonable assumption. Check the Q-Q plot of log-costs; if the tails are heavier than normal, consider a two-parameter Weibull or generalized gamma on the log scale. For biomarker registries reporting MICs, titers, or viral loads, geometric means and fold-change ratios are the standard.",
      "primary": "Prospective study cost diaries and patient-reported cost surveys often have high zero rates (patients who incur no out-of-pocket costs) — two-part models are strongly recommended. Pilot studies with small n should report geometric means with bootstrap CIs rather than relying on asymptotic normal approximations for the smeared arithmetic mean.",
      "linked": "Linked claims-EHR-registry cohorts typically have large n, stabilizing smearing factor estimates. Run both log-OLS with group-specific smearing and gamma GLM as co-primary analyses; if results diverge, the heteroscedasticity diagnostic (residual fan plot) and Manning-Mullahy modified Park test identify which is more appropriate."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\n# ── Motivating dataset: three patient costs ──────────────────────────────\ncosts = np.array([100.0, 1000.0, 10000.0])\n\n# 1. Arithmetic mean (the budget target)\narith_mean = costs.mean()\nprint(f\"Arithmetic mean: {arith_mean:.2f}\")              # 3700.00\n\n# 2. Geometric mean via log-scale back-transform (natural log throughout)\nlog_costs = np.log(costs)          # natural log; mean_log = mu-hat from log-OLS\nmean_log = log_costs.mean()\ngeo_mean = np.exp(mean_log)\nprint(f\"Geometric mean (naive back-transform): {geo_mean:.2f}\")  # 1000.00\n\n# 3. Duan smearing factor: sample mean of back-transformed residuals\nresiduals = log_costs - mean_log\nsmearing_factor = np.exp(residuals).mean()\nsmeared_mean = geo_mean * smearing_factor\nprint(f\"Smearing factor: {smearing_factor:.4f}\")\nprint(f\"Duan smeared estimate of arithmetic mean: {smeared_mean:.2f}\")  # 3700.00\n\n# ── Simulated 200-patient treatment comparison ────────────────────────────\nrng = np.random.default_rng(42)\nn = 200\ntreat = np.repeat([0, 1], n // 2)\n# True model: log(cost) = 6.0 + 0.405 * treat + epsilon, epsilon ~ N(0, 1)\nlog_y = 6.0 + 0.405 * treat + rng.normal(0, 1.0, n)\ny = np.exp(log_y)\n\n# 4. Log-OLS with numpy (design matrix approach)\nX = np.column_stack([np.ones(n), treat])\nbeta_hat = np.linalg.lstsq(X, log_y, rcond=None)[0]\nalpha_hat, beta_treat = beta_hat\nprint(f\"\\nLog-OLS: alpha = {alpha_hat:.3f}, beta_treat = {beta_treat:.3f}\")\nprint(f\"exp(beta_treat) = {np.exp(beta_treat):.3f}  \"\n      f\"<-- ratio of geometric means, NOT necessarily ratio of arithmetic means\")\n\n# 5. Pooled smearing factor\nfitted = X @ beta_hat\nresids = log_y - fitted\npooled_sf = np.exp(resids).mean()\nprint(f\"\\nPooled smearing factor: {pooled_sf:.4f}\")\nfor arm, label in [(0, \"Control\"), (1, \"Treated\")]:\n    mu_arm = alpha_hat + beta_treat * arm\n    arith_arm = np.exp(mu_arm) * pooled_sf\n    print(f\"  {label}: geometric mean = {np.exp(mu_arm):.2f}, \"\n          f\"arithmetic mean = {arith_arm:.2f}\")\n\n# 6. Group-specific smearing (correct under heteroscedasticity)\nprint(\"\\nGroup-specific smearing factors:\")\nfor arm, label in [(0, \"Control\"), (1, \"Treated\")]:\n    mask = treat == arm\n    sf_arm = np.exp(resids[mask]).mean()\n    arith_arm = np.exp(alpha_hat + beta_treat * arm) * sf_arm\n    print(f\"  {label}: smearing factor = {sf_arm:.4f}, \"\n          f\"arithmetic mean estimate = {arith_arm:.2f}\")\n\n# 7. Parametric log-normal fit using scipy.stats.lognorm\n# scipy uses shape (sigma) and scale (exp(mu)) parameterization\nshape, loc, scale = stats.lognorm.fit(costs, floc=0)  # fix loc=0 for strictly positive\nmu_fit = np.log(scale)\nsigma_fit = shape\nprint(f\"\\nLog-normal MLE: mu = {mu_fit:.3f}, sigma = {sigma_fit:.3f}\")\nprint(f\"  Estimated arithmetic mean: exp(mu + sigma^2/2) = \"\n      f\"{np.exp(mu_fit + sigma_fit**2 / 2):.2f}\")\nprint(f\"  Estimated geometric mean:  exp(mu) = {np.exp(mu_fit):.2f}\")",
        "description": "Log-OLS with explicit Duan smearing (pooled and group-specific) using numpy and scipy.\nDemonstrates arithmetic mean estimation, geometric mean, and the retransformation problem\non the three-patient motivating dataset and on a simulated 200-patient treatment comparison.\nAlso shows scipy.stats.lognorm for fitting a parametric log-normal distribution. No\ndependencies beyond numpy and scipy.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset ───────────────────────────────────────────────────\ncosts <- c(100, 1000, 10000)\n\n# 1. Arithmetic mean\ncat(sprintf(\"Arithmetic mean: %.2f\\n\", mean(costs)))          # 3700.00\n\n# 2. Geometric mean via log-scale back-transform (natural log)\nlog_costs <- log(costs)\nmu_hat <- mean(log_costs)\ngeo_mean <- exp(mu_hat)\ncat(sprintf(\"Geometric mean (naive back-transform): %.2f\\n\", geo_mean))  # 1000.00\n\n# 3. Duan smearing factor\nresiduals <- log_costs - mu_hat\nsmearing_factor <- mean(exp(residuals))\nsmeared_mean <- geo_mean * smearing_factor\ncat(sprintf(\"Smearing factor: %.4f\\n\", smearing_factor))\ncat(sprintf(\"Duan smeared estimate: %.2f\\n\", smeared_mean))   # 3700.00\n\n# Utility: geometric mean function (handles positive values only)\ngeo_mean_fn <- function(x) exp(mean(log(x[x > 0])))\n\n# ── Simulated 200-patient treatment comparison ────────────────────────────\nset.seed(42)\nn <- 200\ntreat <- rep(c(0L, 1L), each = n / 2)\nlog_y <- 6.0 + 0.405 * treat + rnorm(n, 0, 1.0)\ny <- exp(log_y)\n\n# 4. Log-OLS\nfit <- lm(log_y ~ treat)\nalpha_hat <- coef(fit)[[\"(Intercept)\"]]\nbeta_hat  <- coef(fit)[[\"treat\"]]\ncat(sprintf(\"\\nLog-OLS: alpha = %.3f, beta_treat = %.3f\\n\",\n            alpha_hat, beta_hat))\ncat(sprintf(\"exp(beta_treat) = %.3f  <-- geometric-mean ratio\\n\",\n            exp(beta_hat)))\n\n# 5. Pooled smearing factor\nresids <- residuals(fit)\npooled_sf <- mean(exp(resids))\ncat(sprintf(\"Pooled smearing factor: %.4f\\n\", pooled_sf))\nfor (arm in c(0, 1)) {\n  mu_arm <- alpha_hat + beta_hat * arm\n  arith_arm <- exp(mu_arm) * pooled_sf\n  cat(sprintf(\"  Arm %d: geometric mean = %.2f, arithmetic mean = %.2f\\n\",\n              arm, exp(mu_arm), arith_arm))\n}\n\n# 6. Group-specific smearing (correct under heteroscedasticity)\ncat(\"\\nGroup-specific smearing factors:\\n\")\nfor (arm in c(0, 1)) {\n  mask <- treat == arm\n  sf_arm <- mean(exp(resids[mask]))\n  arith_arm <- exp(alpha_hat + beta_hat * arm) * sf_arm\n  cat(sprintf(\"  Arm %d: smearing factor = %.4f, arithmetic mean = %.2f\\n\",\n              arm, sf_arm, arith_arm))\n}\n\n# 7. Ratio CI from log-OLS (appropriate for geometric-mean ratio estimands)\nci <- confint(fit)[\"treat\", ]\ncat(sprintf(\"\\nGeometric-mean ratio: %.3f (95%% CI: %.3f to %.3f)\\n\",\n            exp(beta_hat), exp(ci[1]), exp(ci[2])))\ncat(\"Note: this CI is for the geometric-mean ratio only.\\n\")\ncat(\"For the arithmetic-mean ratio, bootstrap the smeared estimates per arm.\\n\")\n\n# 8. Parametric log-normal fit via MASS::fitdistr\nif (requireNamespace(\"MASS\", quietly = TRUE)) {\n  fit_ln <- MASS::fitdistr(costs, \"lognormal\")\n  mu_fit    <- fit_ln$estimate[\"meanlog\"]\n  sigma_fit <- fit_ln$estimate[\"sdlog\"]\n  cat(sprintf(\"\\nLog-normal MLE: mu = %.3f, sigma = %.3f\\n\", mu_fit, sigma_fit))\n  cat(sprintf(\"  Arithmetic mean: exp(mu + sigma^2/2) = %.2f\\n\",\n              exp(mu_fit + sigma_fit^2 / 2)))\n  cat(sprintf(\"  Geometric mean:  exp(mu) = %.2f\\n\", exp(mu_fit)))\n}",
        "description": "Log-OLS with Duan smearing in base R. Demonstrates the lm() function on log-transformed\ncosts, residual extraction, pooled and group-specific smearing factors, geometric mean\nreporting with ratio CIs (appropriate for lab/PK data), and a parametric log-normal fit\nvia fitdistrplus or base MASS. Uses the same motivating dataset and simulated treatment\ncomparison as the Python implementation.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating three-patient dataset ── */\ndata work.costs;\n  input patient_id $ total_cost;\n  log_cost = log(total_cost);     /* natural log */\n  datalines;\nP1 100\nP2 1000\nP3 10000\n;\nrun;\n\n/* 1. Arithmetic mean and log-scale mean (from which geometric mean = exp(mean_log)) */\nproc means data=work.costs mean;\n  var total_cost log_cost;\n  /* Geometric mean = exp(mean of log_cost); arithmetic mean reported directly */\nrun;\n\n/* 2. Log-OLS on the three-patient data (intercept-only = mean of log_cost) */\nproc reg data=work.costs outest=work.est_small;\n  model log_cost = ;   /* intercept only; fitted value = mean of log_cost = 3.0 (log10) */\n  output out=work.resid_small r=resid p=fitted;\nrun;\n\n/* 3. Duan smearing factor: back-transform each residual, then average */\ndata work.smearing_small;\n  set work.resid_small;\n  exp_resid = exp(resid);   /* back-transform each OLS residual from log scale */\nrun;\n\nproc means data=work.smearing_small mean;\n  var exp_resid;   /* smearing factor = mean of exp(residuals) */\n  /* Multiply the geometric mean (exp(fitted)) by this smearing factor */\n  /* to obtain the Duan smeared arithmetic mean estimate              */\nrun;\n\n/* ── Simulated 200-patient treatment comparison ── */\ndata work.sim;\n  call streaminit(42);\n  do i = 1 to 200;\n    treat   = (i > 100);                           /* 0 = control, 1 = treated */\n    log_y   = 6.0 + 0.405 * treat + rand(\"Normal\", 0, 1.0);\n    y       = exp(log_y);\n    output;\n  end;\n  drop i;\nrun;\n\n/* 4. Log-OLS with treatment indicator */\nproc reg data=work.sim outest=work.est_sim;\n  model log_y = treat;\n  /* Coefficients: Intercept = alpha-hat, treat = beta-hat              */\n  /* exp(beta-hat) is the geometric-mean ratio (NOT the arithmetic-mean */\n  /* ratio unless residuals are homoscedastic across arms)               */\n  output out=work.resid_sim r=resid p=fitted;\nrun;\n\n/* 5. Back-transform each OLS residual for Duan smearing */\ndata work.resid_sim2;\n  set work.resid_sim;\n  exp_resid = exp(resid);\nrun;\n\n/* 6. Pooled smearing factor */\nproc means data=work.resid_sim2 mean;\n  var exp_resid;\n  title \"Pooled smearing factor (assumes homoscedastic log-scale residuals)\";\nrun;\n\n/* 7. Group-specific smearing factors (use when heteroscedasticity is suspected) */\nproc means data=work.resid_sim2 mean;\n  class treat;\n  var exp_resid;\n  title \"Group-specific smearing factors by treatment arm\";\n  /* Arithmetic mean for arm k = exp(alpha-hat + beta-hat * k) * smearing_factor_k */\nrun;\n\n/* 8. Retrieve beta from outest and compute geometric-mean ratio with CI */\nproc print data=work.est_sim;\n  var Intercept treat;\n  title \"Log-OLS coefficients (beta-hat = log geometric-mean ratio for treatment)\";\nrun;",
        "description": "Log-OLS with Duan smearing in SAS using PROC REG for the log-scale regression and a\nDATA step to compute pooled and group-specific smearing factors. PROC MEANS provides the\narithmetic mean and a log-scale mean for geometric mean derivation. Uses the same three-\npatient motivating dataset and a simulated 200-observation treatment comparison dataset.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Outcome is strictly positive<br/>and right-skewed?] --> Yes[Yes — log-normal candidate]\n  Q --> No[No — choose a different<br/>distributional model]\n  Yes --> Zero[Outcome has zeros?]\n  Zero --> ZeroY[\"Yes — two-part model:<br/>logistic for any use +<br/>log-OLS or gamma GLM<br/>for conditional cost\"]\n  Zero --> ZeroN[No zeros]\n  ZeroN --> Estimand[Target estimand?]\n  Estimand --> GeoMean[\"Geometric mean / fold-change<br/>(lab titers, PK, ratio CIs):<br/>log-OLS; back-transform with exp();<br/>no smearing needed\"]\n  Estimand --> ArithMean[\"Arithmetic mean<br/>(budget impact, cost-effectiveness)\"]\n  ArithMean --> Homoscedastic[\"Homoscedastic residuals?\"]\n  Homoscedastic --> HomoY[\"Yes — log-OLS +<br/>pooled Duan smearing factor\"]\n  Homoscedastic --> HomoN[\"No (heteroscedastic) —<br/>Gamma GLM with log link<br/>(preferred primary method)\"]",
        "caption": "Decision tree for selecting the appropriate estimator when the outcome appears log-normal. The key branch is the target estimand: geometric mean (lab/PK) vs arithmetic mean (cost budget). If arithmetic mean is the target, the secondary branch is homoscedasticity of the log-scale residuals.",
        "alt_text": "Flowchart branching on whether the outcome has zeros, whether the estimand is geometric or arithmetic mean, and whether residuals are homoscedastic, routing to the appropriate estimator from two-part model through log-OLS with smearing to gamma GLM.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "descriptive-statistics",
        "notes": "Distributional shape assessment — histograms, Q-Q plots, and log-scale histograms — must precede any log-normal modeling decision. Descriptive statistics also provides the arithmetic mean baseline against which the smeared estimate can be checked."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalized-linear-models",
        "notes": "The gamma GLM with log link is the primary alternative to log-OLS for cost data when the estimand is the arithmetic mean; it directly models E[Y] without a retransformation step and is robust to heteroscedasticity that invalidates the Duan smearing estimator."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gamma-distribution",
        "notes": "The gamma distribution is the natural GLM family for right-skewed positive continuous outcomes; as a sibling primitive in this distributions set, it provides the variance structure (variance proportional to mean squared) that makes the gamma GLM preferred over log-OLS for arithmetic mean estimation in HEOR cost analyses."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Per-member-per-month and per-patient-per-year cost descriptors are the natural downstream reporting unit for smeared arithmetic mean estimates; log-OLS or gamma GLM produces the adjusted mean cost that is then expressed as a PMPM or PPPY figure for payer audiences."
      },
      {
        "relation_type": "see_also",
        "target_slug": "accelerated-failure-time-models",
        "notes": "Accelerated failure time (AFT) survival models assume that the log of survival time follows a linear model; when the AFT error distribution is log-normal, the model is equivalent to log-OLS on log(time), and the same retransformation considerations apply when moving from the log-time scale back to the original time scale."
      }
    ],
    "aliases": [
      "lognormal",
      "log transformation",
      "geometric mean",
      "Duan smearing",
      "smearing estimator",
      "retransformation bias",
      "log-OLS"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "log-rank-test",
    "name": "Log-Rank Test",
    "short_definition": "A nonparametric hypothesis test that compares the full survival-time distributions of two or more groups by accumulating observed-minus-expected event counts across every pooled risk set; it is the default significance test for Kaplan-Meier curve comparisons and is mathematically equivalent to the score test from an unadjusted Cox proportional hazards model, but it returns no effect size — a hazard ratio, RMST difference, or median survival comparison must always accompany it.",
    "long_description": "**What the log-rank test measures**\n\nThe log-rank test answers one specific question: could the observed survival curves of two (or\nmore) groups have been drawn from populations with identical underlying hazard functions? It\ndoes this by constructing, at each distinct event time in the pooled data, a 2×2 table of\nobserved versus expected events — exactly the logic of a Mantel-Haenszel stratified chi-square,\napplied at every event time separately and then summed. The expected count for a group is\nsimply its share of the total at-risk pool multiplied by the total events at that moment. A\nsmall p-value signals that the observed survival curves are unlikely to be equal under the null;\nit carries no information about how different they are or by how much. The test statistic has\none degree of freedom for two groups, and k − 1 degrees of freedom for k groups. The most\nimportant property to state at the top: the log-rank test is not an effect estimator — you\ncannot read a hazard ratio or a median survival difference from its output, and it must always\nbe paired with a companion effect measure before results are communicated.\n\n**Mechanical construction: the risk set at each event time**\n\nAt each distinct event time t_j, the test records the following quantities for a two-group\ncomparison (Group A and Group B):\n\n- n_Aj: subjects in Group A still at risk (event-free and not yet censored or lost)\n- n_Bj: subjects in Group B still at risk\n- n_j = n_Aj + n_Bj: the combined risk set at t_j\n- d_j: the total number of events (from either group) observed at exactly t_j\n\nUnder the null hypothesis (equal hazard functions in both groups), events at time t_j should\nfall proportionally to risk-set membership. The expected events for Group A are therefore:\n\nE_Aj = (n_Aj / n_j) * d_j\n\nThis is the Mantel-Haenszel null expectation: if treatment had no effect, the probability\nthat any given event belongs to Group A equals n_Aj / n_j, the proportion of the current\nrisk pool that Group A represents.\n\nThe variance contribution at time t_j (for d_j < n_j; hypergeometric variance) is:\n\nV_j = (n_Aj * n_Bj * d_j * (n_j - d_j)) / (n_j * n_j * (n_j - 1))\n\nSumming across all event times: O_A = sum of observed events in A, E_A = sum of expected\nevents in A, and V = sum of V_j. The log-rank test statistic is:\n\nchi-square = (O_A - E_A) * (O_A - E_A) / V\n\nwhich under H0 follows an approximate chi-squared distribution with one degree of freedom.\nAn O_A substantially below E_A means Group A had fewer events than chance alone would predict\n— consistent with a treatment benefit. An O_A above E_A means Group A fared worse.\n\n**Relationship to the Cox proportional hazards score test**\n\nThis connection is fundamental and frequently overlooked in practice. The score test from a\nCox proportional hazards model with a single binary group indicator and no tied events produces\na test statistic that is algebraically identical to the log-rank statistic. In other words,\nthe log-rank test is the unadjusted Cox score test. This has three practical consequences.\n\nFirst, adding covariates to the Cox model — adjusting for age, comorbidities, disease severity,\nor a propensity score — yields a different and generally more powerful test than the raw\nlog-rank, because the within-covariate-stratum noise is removed. The resulting covariate-\nadjusted Cox HR will differ in magnitude from the crude log-rank-implied HR even when\ncovariates are perfectly balanced between arms, due to the non-collapsibility of the hazard\nratio (a prognostic covariate adjusts the HR away from the null even without confounding).\n\nSecond, the unadjusted log-rank assumes no confounding. If Groups A and B differ on any\nbaseline characteristic that predicts the outcome, the log-rank mixes the treatment effect\nwith confounding bias — it cannot separate them. In an observational study, this makes the\nunadjusted log-rank uninformative as an effect estimate.\n\nThird, the stratified log-rank test (see below) is the exact analogue of including a\nstratification variable in a Cox model via the STRATA statement — it partitions the risk sets\nby stratum and pools within-stratum (O − E) contributions, removing between-stratum baseline\nhazard variation without estimating it parametrically.\n\n**Weighting variants and when they matter**\n\nThe standard log-rank assigns equal weight to each event time. Under proportional hazards this\nis optimal, but two major weighting families are available when the treatment effect is expected\nto vary over follow-up.\n\n*Wilcoxon-Gehan (Breslow) weights*: each event time is weighted by n_j, the current risk-set\nsize. Early event times dominate because the risk set is large at the beginning; late event\ntimes contribute less as the at-risk population shrinks. This makes the test sensitive to\nearly differences and relatively insensitive to late differences. In R, this is survdiff with\nrho = 1; in SAS PROC LIFETEST, it is the WILCOXON option. Use it when treatment effects, if\nreal, are expected to appear early — for example, when comparing a short-course antibiotic to\nplacebo for a rapid-onset outcome.\n\n*Fleming-Harrington G(rho, gamma) class* (Harrington and Fleming, 1982): a two-parameter\nfamily of weights w_j = S(t_j-) raised to rho, times (1 - S(t_j-)) raised to gamma, where\nS(t_j-) is the left-continuous Kaplan-Meier estimate just before t_j. Setting rho = 1 and\ngamma = 0 recovers Wilcoxon-Gehan weights (early emphasis). Setting rho = 0 and gamma = 1\nproduces G(0,1) weights that up-weight late events — events that occur after the survival\nfunction has already fallen substantially. Setting rho = gamma = 0 recovers the standard\nlog-rank. In immuno-oncology, checkpoint inhibitors and therapeutic cancer vaccines often show\ndelayed Kaplan-Meier separation: the curves run together early (immune response takes months\nto mature) then diverge at a late plateau. A standard log-rank on such crossing or concave\ncurves averages the null early period with the effective late period, severely reducing power.\nThe G(0,1) test or the MaxCombo statistic (a combination of multiple weighted log-rank\ntests) are pre-specified sensitivity or co-primary analyses in many oncology registration\ntrials precisely for this reason. In lifelines (Python), these weights are available through\nthe weightings and p/q parameters of logrank_test.\n\n**Power loss under non-proportional hazards and the route to RMST**\n\nThe log-rank test is the locally most powerful rank test when hazards are proportional — when\nthe hazard ratio is the same number at every moment of follow-up. Under crossing hazards —\nwhere Group A is initially at higher risk but later at lower risk — the positive and negative\n(O_Aj - E_Aj) contributions at different event times partially cancel in the sum. The log-rank\ncan then be near-zero power even when the two survival curves are clearly non-identical over\nthe full follow-up. This cancellation is the core motivation for the restricted mean survival\ntime (RMST) as an alternative estimand. The RMST difference at a pre-specified horizon tau\n(e.g., 3-year hospitalization-free months) is the area between two survival curves up to tau.\nIt is estimable directly from the Kaplan-Meier curves without any PH assumption, is in\ninterpretable time units (months of benefit or loss), and retains reasonable power under\ncrossing hazards. When a protocol analysis of PH assumptions reveals clear violation — assessed\nby log-log plots, Schoenfeld residual slopes, or time-by-covariate interaction terms in Cox —\na pre-specified RMST analysis or weighted log-rank sensitivity is the appropriate companion.\n\n**The log-rank test returns no effect size — always pair with HR, medians, or RMST**\n\nThe chi-square statistic and its p-value describe evidence against the null of equal survival;\nthey contain no information about the magnitude or clinical relevance of the difference. A\nlog-rank p = 0.001 could arise from a clinically trivial HR of 0.98 in a 200,000-patient\nclaims database or from a large HR of 0.45 in a 280-patient registration trial — the p-value\nalone cannot distinguish these. Every reported log-rank test result must be accompanied by:\n- A hazard ratio with 95% CI from Cox regression, to quantify the relative rate of events\n- Median survival times (with 95% CIs) from the Kaplan-Meier estimator\n- An RMST difference at a pre-specified horizon when proportional hazards is not defensible\nRegulatory guidance from both FDA and EMA explicitly requires effect estimates alongside\nsignificance tests for time-to-event primary endpoints. A bare p-value in a clinical study\nreport is incomplete and will draw a review query.\n\n**Stratified log-rank test**\n\nWhen a known prognostic variable (disease stage, ECOG performance score, geographic region\nin a multi-site registry, or a stratification variable in the randomization scheme) may\nconfound the survival comparison, the stratified log-rank computes within-stratum (O_Aj − E_Aj)\nand V_j contributions and sums them across strata. This is the survival-analysis analogue of\nMantel-Haenszel stratification for binary outcomes: between-stratum variation in the baseline\nhazard function is removed without being parameterized, yielding a valid within-stratum\ncomparison. In an RCT with stratified randomization (e.g., stratified by center and disease\nstage), the stratified log-rank is the prespecified primary test because it respects the\nstratified design and has greater power than the unstratified version. In an observational\nstudy after exact matching on a discrete confounder (e.g., matched pairs), the stratified\nlog-rank treats each matched pair as a stratum.\n\n**Claims-RWE: unadjusted log-rank on confounded cohorts is descriptive only**\n\nIn a treatment-naive administrative claims database, patients who initiate the biologic versus\nthe standard of care differ systematically on age, disease severity, comorbidities, prior\nmedication history, and clinical frailty — some of these differences are observable in the\nclaims and some are not. Running an unadjusted log-rank test on these unmatched groups\nproduces a chi-square and an implied crude HR that reflect the combined effect of treatment\nassignment and confounding by indication. This is not analytically wrong as a descriptive\ncharacterization of the raw cohort, but it cannot be interpreted as evidence about the causal\ntreatment effect. The appropriate analysis in an active-comparator new-user design routes\nthrough propensity score trimming and weighting (IPTW-Cox) or matching before any HR or\nlog-rank p-value is attributed to the treatment. An unadjusted log-rank result in a\nconfounded observational dataset belongs in a supplementary descriptive table, explicitly\nlabeled as unadjusted and not interpretable as a causal estimate.\n\n**Pros, cons, and trade-offs**\n\nPros:\n- Distribution-free: valid under any shape of the hazard function; requires only independent\n  and non-informative censoring (the censoring mechanism does not depend on the unobserved\n  survival time), not a parametric family for h(t).\n- Optimal power under proportional hazards: when the true HR is constant across follow-up, the\n  standard log-rank is the most powerful rank test in the class of rank-based statistics.\n- Widely implemented and regulator-accepted: survdiff (R), PROC LIFETEST (SAS), and lifelines\n  (Python) produce equivalent results; FDA and EMA reviewers expect the log-rank for survival\n  primary endpoints in registration packages.\n- Stratified extension: directly accommodates stratification variables by summing within-\n  stratum (O − E) contributions without a parametric model for the stratification effect.\n\nCons:\n- No effect size: the test statistic alone does not quantify the magnitude of the survival\n  difference; must always be paired with a hazard ratio, median survival difference, or RMST.\n- Power drops under non-proportional hazards: under crossing or converging hazard functions\n  the standard log-rank can be substantially underpowered; Fleming-Harrington weighted tests\n  or RMST are better primary or sensitivity choices.\n- Confounding blind: the unadjusted log-rank is the Cox score test without covariate\n  adjustment; any baseline imbalance between groups contaminates the result.\n- Very large n: in claims databases with hundreds of thousands of patients, the log-rank will\n  declare p < 0.001 for clinically trivial survival differences; effect size becomes the only\n  meaningful output.\n\n**When to use**\n\nUse the log-rank test when you have two or more groups with time-to-event outcomes subject to\nright-censoring and want to formally test the null hypothesis that all groups share the same\nsurvival function. The proportional hazards assumption should be at least approximately\ndefensible (verified by log-log plots, Schoenfeld residuals, or visual inspection of the\nKM curves). Use the standard log-rank as the omnibus significance test in a confirmatory\nRCT with a pre-specified time-to-event primary endpoint, always reporting HR and medians\nalongside the p-value. Use the stratified version when stratified randomization or a discrete\nknown confounder in observational data warrants within-stratum pooling. Use weighted variants\n(Wilcoxon-Gehan, Fleming-Harrington G(0,1)) in oncology when delayed treatment effects are\nmechanistically expected, and pre-specify the weighting choice before unblinding.\n\n**When NOT to use**\n\nDo not use the log-rank test as the sole inferential summary: always accompany it with an\neffect estimate and confidence interval. Do not interpret an unadjusted log-rank result from a\nconfounded observational cohort as a causal treatment effect; use an adjusted Cox model, IPTW-\nweighted Cox, or stratified log-rank with covariate control. Do not apply the standard\nunweighted log-rank when survival curves cross or converge (non-proportional hazards): the\n(O − E) contributions at different time points partially cancel and the test may be anti-\nconservative or severely underpowered; pre-specify RMST or a weighted test as the primary\nanalysis instead. Do not apply the log-rank to a competing-risks outcome by censoring competing\nevents — censoring a competing event (e.g., death when studying hospitalization in elderly\npatients) produces a test of the cause-specific hazard and, if converted to cumulative\nincidence, overstates absolute risk when mortality differs by arm. In competing-risks settings,\nuse Fine-Gray or cause-specific analysis. In very large observational datasets, treat the\nlog-rank p-value as a nuisance — statistical significance is guaranteed for any non-zero\ndifference; report RMST difference, HR, and median survival with clinical interpretation.\n\n**Interpreting the output**\n\nFrom the worked example: 6 biologic-treatment patients (Group A) and 6 standard-care patients\n(Group B) followed for up to 12 months. Three event times occurred: month 3 (B1 hospitalized),\nmonth 5 (B2 hospitalized), and month 9 (A3 hospitalized). The observed and expected counts\nsum to O_A = 0 + 0 + 1 = 1 and E_A = 0.5 + 0.5 + 0.5 = 1.5, with total variance V = 0.75.\nchi-square = 0.25 / 0.75 = 1/3 ≈ 0.333, 1 df, p ≈ 0.564.\n\n*(1) Formal interpretation.* The log-rank test does not reject the null hypothesis that the\nbiologic and standard-care groups share equal survival functions (chi-square with 1 df ≈ 0.333,\np ≈ 0.564). Under repeated sampling from populations with identical hazard functions, a\ntest statistic this extreme or more extreme would occur approximately 56% of the time by\nchance. Failure to reject does not imply equal survival; the study has only 3 total events\nand is severely underpowered. The sum O_A - E_A = -0.5 indicates Group A experienced fewer\nhospitalizations than expected under the null — directionally consistent with biologic\nbenefit — but the variance is large relative to this deviation. No effect size can be read\nfrom the test statistic alone: the HR and median survival require a separate Cox model and\nKaplan-Meier estimator on the same data.\n\n*(2) Practical interpretation.* Group A (biologic) had 1 hospitalization while Group B\n(standard care) had 2, over a 12-patient 12-month illustration. In practice, a time-to-event\nstudy with 3 total events provides essentially no inferential value; adequate power for a\nlog-rank test typically requires 50–200 events depending on the expected HR. The log-rank\nresult must be accompanied by a Cox-derived HR with confidence interval (here approximately\n0.4, with a very wide CI spanning near-zero to several-fold due to sample size) and Kaplan-\nMeier median survival estimates. A payer or HTA reviewer needs to know whether the biologic\nprovides, say, 2 additional hospitalization-free months — and that requires an RMST difference\nor median comparison, not the chi-square alone.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "survival-analysis",
      "time-to-event",
      "hypothesis-testing",
      "log-rank",
      "Kaplan-Meier",
      "censoring",
      "non-parametric",
      "Mantel-Haenszel",
      "Cox",
      "proportional-hazards",
      "oncology",
      "HEOR",
      "foundations"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "active_comparator_new_user",
      "target_trial_emulation",
      "registry_study",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.328.7447.1073",
        "url": "https://doi.org/10.1136/bmj.328.7447.1073",
        "citation_text": "Bland JM, Altman DG. The logrank test. BMJ. 2004;328(7447):1073.",
        "year": 2004,
        "authors_short": "Bland & Altman",
        "notes": "The definitive BMJ Statistics Notes entry on the log-rank test. Explains the observed- versus-expected construction clearly for a clinical audience, covers the relationship to the Mantel-Haenszel chi-square, and is the most widely cited introductory reference for the test in HEOR and clinical research."
      },
      {
        "role": "foundational",
        "doi": "10.2307/2344317",
        "url": "https://doi.org/10.2307/2344317",
        "citation_text": "Peto R, Peto J. Asymptotically efficient rank invariant test procedures. Journal of the Royal Statistical Society Series A. 1972;135(2):185-207.",
        "year": 1972,
        "authors_short": "Peto & Peto",
        "notes": "Establishes the asymptotic optimality of rank invariant test procedures for censored survival data, providing the theoretical foundation for both the standard log-rank and the Wilcoxon-Gehan weighted variant. Essential reading for understanding why the log-rank is preferred over simpler comparisons of event counts."
      },
      {
        "role": "extend",
        "doi": "10.1093/biomet/69.3.553",
        "url": "https://doi.org/10.1093/biomet/69.3.553",
        "citation_text": "Harrington DP, Fleming TR. A class of rank test procedures for censored survival data. Biometrika. 1982;69(3):553-566.",
        "year": 1982,
        "authors_short": "Harrington & Fleming",
        "notes": "Introduces the G(rho, gamma) family of weighted log-rank tests that encompasses the standard log-rank, Wilcoxon-Gehan, and the G(0,1) late-emphasis variant as special cases. Foundational for understanding when and why weighting matters in immuno-oncology and other settings with expected delayed treatment effects."
      }
    ],
    "plain_language_summary": "The log-rank test checks whether two groups of patients (for example, those given a new drug versus those given standard care) have different survival curves — meaning different timing of events like hospitalization, death, or disease progression. It works by comparing, at each moment when an event happens, how many events actually occurred in each group versus how many would be expected if neither group had any advantage. Crucially, the log-rank test only produces a yes/no answer about whether the curves are different (its p-value), not a number telling you how much better or worse one group fared — so it must always be reported alongside a hazard ratio or median survival time that quantifies the actual gap.",
    "key_terms": [
      {
        "term": "risk set",
        "definition": "The group of patients who have not yet had the event and have not yet been lost to follow-up at a given moment in time; these are the patients \"at risk\" of having the next event."
      },
      {
        "term": "censoring",
        "definition": "When a patient's follow-up ends before the event of interest occurs — for example, because the study ended or the patient transferred care — their exact event time is unknown but we know it is after their last observation."
      },
      {
        "term": "observed events (O)",
        "definition": "The actual number of events (deaths, hospitalizations, etc.) recorded in a group during the study."
      },
      {
        "term": "expected events (E)",
        "definition": "The number of events the log-rank test predicts a group would have had if both groups had identical risk at every point in time; calculated as the group's share of the risk pool multiplied by the total events at each event time."
      },
      {
        "term": "proportional hazards",
        "definition": "The assumption that the ratio of event rates between two groups stays constant over time; when this holds, the log-rank test is the most powerful way to detect a difference between survival curves."
      },
      {
        "term": "hazard ratio",
        "definition": "A number that summarizes how much faster or slower one group experiences events compared to another — for example, an HR of 0.60 means the treated group has events at 60% the rate of the control group at each moment in time."
      }
    ],
    "worked_example": {
      "scenario": "A 12-month observational follow-up study compares time to first hospitalization in 6 patients receiving a biologic treatment (Group A) versus 6 patients receiving standard care (Group B). One biologic patient transferred to another facility (censored at month 4) and another left the study area (censored at month 6). Two standard-care patients were hospitalized, at months 3 and 5 respectively. One biologic patient was hospitalized at month 9. All remaining patients reached the 12-month study end without a hospitalization. We apply the log-rank test to determine whether the survival curves differ.",
      "dataset": {
        "caption": "Per-patient time-to-event data. time_months = months from study entry to hospitalization or last follow-up; status = 1 if hospitalized (event), 0 if censored (transferred care or reached 12-month study end without event).",
        "columns": [
          "patient_id",
          "group",
          "time_months",
          "status"
        ],
        "rows": [
          [
            "A1",
            "biologic",
            4,
            0
          ],
          [
            "A2",
            "biologic",
            6,
            0
          ],
          [
            "A3",
            "biologic",
            9,
            1
          ],
          [
            "A4",
            "biologic",
            12,
            0
          ],
          [
            "A5",
            "biologic",
            12,
            0
          ],
          [
            "A6",
            "biologic",
            12,
            0
          ],
          [
            "B1",
            "standard",
            3,
            1
          ],
          [
            "B2",
            "standard",
            5,
            1
          ],
          [
            "B3",
            "standard",
            12,
            0
          ],
          [
            "B4",
            "standard",
            12,
            0
          ],
          [
            "B5",
            "standard",
            12,
            0
          ],
          [
            "B6",
            "standard",
            12,
            0
          ]
        ]
      },
      "steps": [
        "Identify the distinct event times: months 3, 5, and 9 (the three months at which at least one hospitalization occurred). Censored exit times (4, 6, 12) are NOT event times — they reduce the risk set but do not contribute observed events.",
        "Month 3: At-risk biologic = 6 (A1 is still in the study at month 3), at-risk standard = 6. Total at risk n = 12, events d = 1 (B1 hospitalized). Expected events for biologic group: 6/12 * 1 = 0.5. Observed events in biologic group: 0. Contribution to (O - E): 0 - 0.5 = -0.5.",
        "Month 5: A1 was censored at month 4, so biologic group now has 5 at risk. B1 had the event at month 3, so standard group has 5 at risk. Total n = 10, events d = 1 (B2 hospitalized). Expected for biologic: 5/10 * 1 = 0.5. Observed: 0. Contribution: -0.5.",
        "Month 9: A2 was censored at month 6, so biologic group now has 4 at risk. B1 and B2 both had events, so standard group has 4 at risk. Total n = 8, events d = 1 (A3 hospitalized). Expected for biologic: 4/8 * 1 = 0.5. Observed: 1. Contribution: 1 - 0.5 = 0.5.",
        "Sum the observed and expected counts across all three event times. Total observed in biologic: 0 + 0 + 1 = 1. Total expected in biologic: 0.5 + 0.5 + 0.5 = 1.5. Difference: 1 - 1.5 = -0.5. The biologic group had fewer hospitalizations than expected under the null — directionally consistent with a treatment benefit.",
        "Compute the variance V_j at each event time using the hypergeometric variance formula V_j = n_A * n_B * d * (n - d) / (n * n * (n - 1)). At month 3 with n=12, n_A=6, n_B=6, d=1: V = 6 * 6 * 1 * 11 / (144 * 11) = 0.25. At month 5 with n=10, n_A=5, n_B=5, d=1: V = 5 * 5 * 1 * 9 / (100 * 9) = 0.25. At month 9 with n=8, n_A=4, n_B=4, d=1: V = 4 * 4 * 1 * 7 / (64 * 7) = 0.25. Total variance: 0.25 + 0.25 + 0.25 = 0.75.",
        "Compute the log-rank test statistic. chi-square = (O - E) squared divided by V. (O - E) = -0.5; (O - E) squared = 0.25; chi-square = 0.25 / 0.75 = 1/3, approximately 0.333 on 1 degree of freedom, giving p approximately 0.564. Fail to reject H0 — but this example has only 3 total events and is illustrative, not powered for inference.",
        "The log-rank test returns no effect size. To report the magnitude of the difference, fit a Cox proportional hazards model on the same data (yielding a crude HR with 95% CI) and compute the Kaplan-Meier median hospitalization-free survival for each group. In a real study, always pair the log-rank p-value with these companion measures."
      ],
      "result": "O_A = 0 + 0 + 1 = 1. E_A = 0.5 + 0.5 + 0.5 = 1.5. Difference: 1 - 1.5 = -0.5. Total variance V = 0.25 + 0.25 + 0.25 = 0.75. chi-square = 0.25 / 0.75 = 1/3 ≈ 0.333, 1 df, p ≈ 0.564. Group A (biologic) observed fewer hospitalizations than expected (1 vs 1.5), Group B (standard) observed more (2 vs 1.5). The log-rank test does not reject equal survival (p ≈ 0.564); the study has only 3 events — inadequate power. No effect size is produced by the log-rank; pair with HR and Kaplan-Meier medians.",
      "timeline_spec": {
        "title": "Patient follow-up for log-rank test worked example (months 0-12)",
        "window": {
          "start_day": 0,
          "end_day": 12,
          "label": "12-month follow-up horizon"
        },
        "events": [
          {
            "label": "B1 (event month 3)",
            "start_day": 0,
            "length_days": 3,
            "quantity": "hospitalization"
          },
          {
            "label": "B2 (event month 5)",
            "start_day": 0,
            "length_days": 5,
            "quantity": "hospitalization"
          },
          {
            "label": "A1 (censored month 4)",
            "start_day": 0,
            "length_days": 4,
            "quantity": "censored"
          },
          {
            "label": "A2 (censored month 6)",
            "start_day": 0,
            "length_days": 6,
            "quantity": "censored"
          },
          {
            "label": "A3 (event month 9)",
            "start_day": 0,
            "length_days": 9,
            "quantity": "hospitalization"
          },
          {
            "label": "A4-A6, B3-B6 (censored month 12)",
            "start_day": 0,
            "length_days": 12,
            "quantity": "censored at study end"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 3,
            "label": "Before t=3: n=12 at risk (6A, 6B)"
          },
          {
            "kind": "gap",
            "start_day": 3,
            "end_day": 5,
            "label": "t=3: B1 event. E_A=0.5, O_A=0"
          },
          {
            "kind": "followup",
            "start_day": 5,
            "end_day": 9,
            "label": "t=5: B2 event. A1 censored -> 5A, 5B at risk. E_A=0.5, O_A=0"
          },
          {
            "kind": "gap",
            "start_day": 9,
            "end_day": 12,
            "label": "t=9: A3 event. A2 censored -> 4A, 4B at risk. E_A=0.5, O_A=1"
          }
        ],
        "result": {
          "label": "O_A=1, E_A=1.5; chi-square=1/3, p=0.564; no effect size produced",
          "value": 0.333
        }
      }
    },
    "prerequisites": [
      "censoring-mechanisms-rwe",
      "chi-square-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Two-group standard (unweighted) log-rank",
        "description": "The most common form: two arms (treatment vs control, or two active comparators), unweighted risk-set contributions at each event time. Requires the proportional hazards assumption to be at least approximately defensible. Reports chi-square with 1 df and p-value. Must be accompanied by HR from Cox regression and Kaplan-Meier medians for a complete analysis.",
        "edge_cases": [
          "When both groups have identical event times (e.g., all events on a few discrete days), the hypergeometric variance formula can produce V = 0 at those times and must be handled with a continuity correction or exact permutation test.",
          "At very large n (>100,000 patients in a claims database), any non-zero hazard difference will produce p < 0.001; report RMST difference and HR as the primary communication, with the log-rank p as supporting evidence."
        ],
        "data_source_notes": "Claims: compute per-patient follow-up from index date to first event or end of continuous enrollment (whichever comes first). EHR: use encounter date of first qualifying diagnosis code or lab threshold. Registry: often the cleanest time-to-event data with adjudicated endpoints; standard log-rank is appropriate with stratification on site if multi-center."
      },
      {
        "name": "Stratified log-rank",
        "description": "Partitions patients into strata by a known prognostic variable (disease stage, age band, geographic region, or the stratification factor used in randomization), computes within- stratum (O - E) and V_j contributions, and sums across strata. Removes between-stratum heterogeneity in baseline hazard without estimating it parametrically. Equivalent to the Cox score test with a STRATA term. In R: survdiff(Surv(t, e) ~ group + strata(stage)). In SAS: PROC LIFETEST with STRATA group*stage. Use whenever the study protocol includes a stratified randomization or when a major discrete confounder is available.",
        "edge_cases": [
          "If a stratum has fewer than 5 events, its variance contribution is unstable; consider collapsing thin strata before analysis.",
          "In matched observational cohorts, treat each matched pair or matched set as a stratum. If matching is 1:1, this becomes the paired log-rank and is equivalent to a Wilcoxon signed-rank test on event times within pairs."
        ],
        "data_source_notes": "Claims: stratify on clinical severity proxies (hospitalization in the prior year, CCI quintile) if exact propensity matching was used. Registry: stratify on site or enrollment wave to remove administrative variation in baseline risk."
      },
      {
        "name": "Weighted log-rank: Wilcoxon-Gehan (early emphasis)",
        "description": "Weights each event time by n_j (current risk set size), giving early events — when the most patients are still at risk — greater influence. Implemented via rho = 1 in R survdiff or the WILCOXON option in SAS PROC LIFETEST. Use when mechanistic reasoning predicts that treatment benefit, if real, should emerge early in follow-up (e.g., a prophylactic antibiotic, a rapidly acting anticoagulant, or a short-course intervention with rapid onset).",
        "edge_cases": [
          "If censoring is heavy early (many patients lost to follow-up before the effect is expected), the Wilcoxon-Gehan test will be dominated by the sparse early risk sets and may perform poorly. Consider a sensitivity analysis with the standard log-rank."
        ],
        "data_source_notes": "Claims: appropriate when the outcome is a near-term adverse event (30-day readmission, in-hospital complication) and the study question is whether treatment reduces early risk."
      },
      {
        "name": "Fleming-Harrington G(0,1): late emphasis for delayed effects",
        "description": "The G(rho=0, gamma=1) member of the Fleming-Harrington family up-weights late event times where the survival function has fallen substantially, making it sensitive to differences that emerge after an initial null period. In lifelines (Python): logrank_test with weightings='fleming-harrington', p=0, q=1. Widely used in oncology for checkpoint inhibitor and cancer vaccine trials where the immune response takes months to establish. Must be pre-specified in the protocol; post-hoc selection of weights is a form of p-hacking.",
        "edge_cases": [
          "The G(0,1) test has low power when the true treatment effect is early; for uncertain timing, the MaxCombo statistic (combining standard, Wilcoxon, G(0,1), and G(0.5,0.5)) provides a robustness hedge at the cost of more complex critical value derivation."
        ],
        "data_source_notes": "Primarily used in RCTs and single-arm studies with oncology endpoints; rare in claims- based observational studies where administrative follow-up rarely extends long enough to capture delayed effects."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "The log-rank test is simpler, requires no model specification, and is the natural companion to a Kaplan-Meier plot; it is appropriate as the primary significance test in an RCT where covariate balance has been achieved by randomization.",
        "cons_of_this": "Cox regression adjusts for confounders, produces an HR with CI and a formal effect estimate, handles time-varying covariates, and is more powerful in observational studies where baseline covariate adjustment is necessary. In any observational or confounder-adjusted analysis, Cox regression replaces the unadjusted log-rank as the primary method.",
        "when_to_prefer": "Use the log-rank test as the primary test in a pre-specified RCT analysis alongside the Cox HR; use Cox regression as the primary method in observational studies and whenever covariate adjustment is required for validity."
      },
      {
        "compared_to": "restricted-mean-survival-time-rmst",
        "pros_of_this": "The log-rank test is the universally expected significance test in survival analysis, is implemented in all major software, and has the best power when proportional hazards holds.",
        "cons_of_this": "RMST does not require proportional hazards, remains well-powered under crossing curves, produces a directly interpretable effect estimate (months of benefit or harm), and is recommended by FDA and EMA as a companion or alternative to the log-rank when PH is violated. The RMST difference is the right primary estimand in immuno-oncology and whenever long-term survival benefit is the decision-relevant quantity.",
        "when_to_prefer": "Use the standard log-rank under proportional hazards; pre-specify RMST as the primary or co-primary estimand in oncology trials with delayed separation or when the study design or prior data suggest crossing or non-monotone hazard differences."
      },
      {
        "compared_to": "kaplan-meier-estimator",
        "pros_of_this": "The log-rank test provides a formal statistical test of the null hypothesis and a p-value for decision thresholds, while the Kaplan-Meier estimator is a descriptive tool that estimates (but does not formally test) the survival function.",
        "cons_of_this": "The Kaplan-Meier estimator provides the survival function itself — the curves that decision- makers, patients, and payers can read directly — and the median survival times used as the primary effect communication. The two are complementary: KM visualizes, log-rank tests.",
        "when_to_prefer": "Always use both: the Kaplan-Meier plot and medians as the primary communication of survival outcomes, the log-rank test as the significance test. Neither replaces the other."
      },
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "The log-rank test is specifically designed for censored time-to-event data; applying a standard nonparametric test (Mann-Whitney) to survival times discards censored observations or requires complete follow-up, which is almost never the case in real studies.",
        "cons_of_this": "General nonparametric tests (Mann-Whitney, Wilcoxon signed-rank) have broader application to non-survival continuous outcomes and produce the Hodges-Lehmann estimator; the log-rank test is specialized to right-censored survival data and requires a specific data structure.",
        "when_to_prefer": "Use the log-rank test whenever the outcome is time to event with censoring; use Mann-Whitney or Wilcoxon for continuous or ordinal outcomes without censoring."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Time zero is the index fill date (new-user design) or the service date of the qualifying procedure. Follow-up ends at the first outcome event (identified by ICD-10 code algorithm or procedure code), end of continuous enrollment (disenrollment or plan change), or administrative end of data — whichever comes first. Enrollment-driven censoring is the dominant failure mode in Medicare and commercial claims: a gap in Parts A/B/D enrollment, a plan change from FFS to MA, or a health-plan disenrollment must be treated as censoring, not as event-free survival. The unadjusted log-rank on an unmatched claims cohort is confounded by indication; always route to IPTW-weighted Cox or a matched stratified log-rank before attributing results to treatment. At very large n (>50,000 per arm), the log-rank will return p < 0.001 for any non-zero difference; prioritize the HR and RMST difference for clinical communication.",
      "ehr": "Time zero is typically the date of diagnosis confirmation, first qualifying lab value, or first clinical encounter for the condition under study. Outcome events are identified from encounter diagnoses or procedure codes; validate against chart-abstracted endpoints when possible. Informative censoring from care-seeking behavior (sicker patients visit more and generate denser data, increasing the probability their outcome is captured) can bias the log-rank toward an apparent treatment benefit in the active-care arm; consider sensitivity analysis with IPCW. Stratify on site or health system in multi-institution EHR cohorts.",
      "registry": "Registry data often provide the cleanest time-to-event inputs for the log-rank test: adjudicated endpoints, standardized follow-up protocols, and linkage to vital statistics for mortality. Standard log-rank is appropriate for the primary analysis with stratification on known prognostic variables (disease stage, performance status). Verify administrative censoring rules — end of participation, lost-to-follow-up date — are uniformly applied across groups and sites.",
      "primary": "In primary data collection (RCTs, prospective cohorts), the log-rank test is the standard primary significance test for time-to-event primary endpoints. Pre-specify the stratification factors (matching the randomization scheme), any planned weighted variants for non-PH scenarios, and the companion RMST horizon in the statistical analysis plan before unblinding.",
      "linked": "In linked claims-EHR-registry cohorts, time zero and censoring rules must be consistent across data sources. Follow-up typically extends through the final administrative date in the linked dataset. Linkage completeness varies by patient and site; incomplete linkage can create differential censoring if more-linked patients are systematically healthier. Use the stratified log-rank or IPTW-Cox with careful censoring sensitivity analyses."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines.statistics import logrank_test, multivariate_logrank_test\nfrom lifelines import KaplanMeierFitter, CoxPHFitter\n\n# ── Worked-example dataset: 12 patients, 12-month follow-up ──────────────────\ndata = pd.DataFrame({\n    'patient_id': ['A1','A2','A3','A4','A5','A6','B1','B2','B3','B4','B5','B6'],\n    'group':      ['biologic']*6 + ['standard']*6,\n    'time':       [4, 6, 9, 12, 12, 12,  3,  5, 12, 12, 12, 12],\n    'event':      [0, 0, 1,  0,  0,  0,  1,  1,  0,  0,  0,  0],\n})\n\ngrp_a = data[data.group == 'biologic']\ngrp_b = data[data.group == 'standard']\n\n# ── 1. Standard (unweighted) log-rank test ────────────────────────────────────\nlr = logrank_test(\n    grp_a['time'], grp_b['time'],\n    event_observed_A=grp_a['event'],\n    event_observed_B=grp_b['event'],\n)\nlr.print_summary()\nprint(f\"Log-rank chi-square: {lr.test_statistic:.4f}\")\nprint(f\"p-value:             {lr.p_value:.4f}\")\nprint(\"NOTE: the log-rank test returns no effect size.\")\nprint(\"      Always report HR and KM medians alongside the p-value.\\n\")\n\n# ── 2. Kaplan-Meier medians (paired effect estimate) ─────────────────────────\nfor grp, subset in data.groupby('group'):\n    kmf = KaplanMeierFitter()\n    kmf.fit(subset['time'], event_observed=subset['event'], label=grp)\n    print(f\"  {grp}: KM median hospitalization-free survival = \"\n          f\"{kmf.median_survival_time_} months\")\n\n# ── 3. Cox HR (paired effect estimate) ───────────────────────────────────────\ncph = CoxPHFitter()\ndata_cox = data.copy()\ndata_cox['biologic'] = (data_cox['group'] == 'biologic').astype(int)\ncph.fit(data_cox[['time','event','biologic']], duration_col='time', event_col='event')\ncph.print_summary()\n\n# ── 4. Wilcoxon-Gehan weighted log-rank (emphasizes early differences) ────────\nlr_wt = logrank_test(\n    grp_a['time'], grp_b['time'],\n    event_observed_A=grp_a['event'],\n    event_observed_B=grp_b['event'],\n    weightings='wilcoxon',    # rho=1, gamma=0 — weights by risk-set size\n)\nprint(f\"\\nWilcoxon-Gehan log-rank p: {lr_wt.p_value:.4f}\")\n\n# ── 5. Fleming-Harrington G(0,1) — emphasizes late differences (IO context) ──\nlr_fh = logrank_test(\n    grp_a['time'], grp_b['time'],\n    event_observed_A=grp_a['event'],\n    event_observed_B=grp_b['event'],\n    weightings='fleming-harrington',\n    p=0, q=1,    # G(rho=0, gamma=1): down-weights early null period\n)\nprint(f\"Fleming-Harrington G(0,1) log-rank p: {lr_fh.p_value:.4f}\")\nprint(\"Use G(0,1) in oncology when delayed treatment effects are expected.\")\nprint(\"MUST be pre-specified; post-hoc weight selection inflates type-I error.\\n\")\n\n# ── 6. Stratified log-rank (add a discrete confounder stratum) ───────────────\n# In a real study, 'stage' would be a baseline disease-severity variable.\ndata['stage'] = ['I','II'] * 6\nresults_mv = multivariate_logrank_test(\n    data['time'], data['group'], data['event']\n)\nprint(f\"Multivariate log-rank across groups: p = {results_mv.p_value:.4f}\")",
        "description": "Standard and weighted log-rank tests using the lifelines library. Demonstrates the unweighted\nlog-rank, Wilcoxon-Gehan, and Fleming-Harrington G(0,1) variants on the worked-example\ndataset (12 patients, time in months, status 1=event 0=censored). Shows how to pair the\nlog-rank p-value with Cox HR and Kaplan-Meier medians for a complete survival comparison.",
        "dependencies": [
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n# ── Worked-example dataset ──────────────────────────────────────────────────\ndf <- data.frame(\n  patient_id = c('A1','A2','A3','A4','A5','A6','B1','B2','B3','B4','B5','B6'),\n  group      = c(rep('biologic', 6), rep('standard', 6)),\n  time       = c(4, 6, 9, 12, 12, 12,  3,  5, 12, 12, 12, 12),\n  event      = c(0, 0, 1,  0,  0,  0,  1,  1,  0,  0,  0,  0)\n)\n\n# ── 1. Standard (unweighted) log-rank test ──────────────────────────────────\nsd <- survdiff(Surv(time, event) ~ group, data = df)\nprint(sd)\n# survdiff prints: O (observed), E (expected), and chi-square\n# Derive the two-sided p-value:\nlogrank_p <- 1 - pchisq(sd$chisq, df = 1)\ncat(sprintf(\"Log-rank chi-square: %.4f\\n\", sd$chisq))\ncat(sprintf(\"p-value (1 df):       %.4f\\n\", logrank_p))\ncat(\"NOTE: survdiff returns no HR. Compute Cox HR separately.\\n\\n\")\n\n# ── 2. Kaplan-Meier medians (paired effect estimate) ────────────────────────\nkm_fit <- survfit(Surv(time, event) ~ group, data = df)\nprint(summary(km_fit)$table[, c('records','events','median','0.95LCL','0.95UCL')])\n\n# ── 3. Cox proportional hazards HR ──────────────────────────────────────────\ncox_fit <- coxph(Surv(time, event) ~ group, data = df)\nsummary(cox_fit)     # HR, 95% CI, and Wald/likelihood-ratio tests\n\n# ── 4. Wilcoxon-Gehan weighted log-rank (rho = 1, early emphasis) ───────────\nsd_wt <- survdiff(Surv(time, event) ~ group, data = df, rho = 1)\ncat(sprintf(\"Wilcoxon-Gehan (rho=1) chi-sq: %.4f, p: %.4f\\n\",\n            sd_wt$chisq, 1 - pchisq(sd_wt$chisq, df = 1)))\ncat(\"rho=1 weights each event time by the current risk-set size.\\n\")\ncat(\"Use when early differences are the scientific hypothesis.\\n\\n\")\n\n# ── 5. Stratified log-rank (adjusts for a discrete confounder) ───────────────\n# Artificial stratification variable to illustrate the syntax:\ndf$stage <- rep(c('I','II'), 6)\nsd_strat <- survdiff(Surv(time, event) ~ group + strata(stage), data = df)\ncat(sprintf(\"Stratified log-rank chi-sq: %.4f, p: %.4f\\n\",\n            sd_strat$chisq, 1 - pchisq(sd_strat$chisq, df = 1)))\ncat(\"Stratified log-rank sums within-stratum (O-E) contributions,\\n\")\ncat(\"removing between-stratum baseline hazard variation.\\n\")",
        "description": "Standard and weighted log-rank tests using the survival package (survdiff). Demonstrates\nthe unweighted log-rank, Wilcoxon-Gehan weights (rho=1), and the stratified log-rank on\nthe worked-example dataset. Shows how to pair the p-value with Kaplan-Meier median survival\ntimes and a Cox HR from the same data. The survdiff object contains the chi-square and O/E\ncounts; the p-value must be derived with pchisq.",
        "dependencies": [
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create worked-example dataset ── */\ndata work.survival_demo;\n  input patient_id $ group $ time event;\n  datalines;\nA1 biologic  4 0\nA2 biologic  6 0\nA3 biologic  9 1\nA4 biologic 12 0\nA5 biologic 12 0\nA6 biologic 12 0\nB1 standard  3 1\nB2 standard  5 1\nB3 standard 12 0\nB4 standard 12 0\nB5 standard 12 0\nB6 standard 12 0\n;\nrun;\n\n/* ── 1. Standard (unweighted) log-rank test ── */\nproc lifetest data=work.survival_demo plots=survival(atrisk cl);\n  time time * event(0);   /* event(0) = censored indicator value             */\n  strata group;           /* STRATA produces log-rank by default             */\n  /* Output: \"Test of Equality over Strata\" table with log-rank chi-square  */\n  /* and Wilcoxon chi-square. The log-rank row is the primary result.       */\nrun;\n\n/* ── 2. Wilcoxon-Gehan weighted log-rank ── */\nproc lifetest data=work.survival_demo;\n  time time * event(0);\n  strata group / test=wilcoxon;  /* explicit Wilcoxon-Gehan weights (rho=1) */\nrun;\n\n/* ── 3. Capture O/E counts for manual verification ── */\nods output homtests=work.logrank_results\n           quartiles=work.km_medians;\nproc lifetest data=work.survival_demo plots=none;\n  time time * event(0);\n  strata group;\nrun;\nproc print data=work.logrank_results label; title \"Log-rank O and E counts\"; run;\nproc print data=work.km_medians label;      title \"Kaplan-Meier medians\";    run;\n\n/* ── 4. Paired Cox HR (always report alongside log-rank) ── */\nproc phreg data=work.survival_demo;\n  class group (ref='standard');   /* reference = standard care              */\n  model time * event(0) = group;  /* time*event(0): event=0 means censored  */\n  hazardratio group / cl=wald;    /* HR with 95% Wald CI                    */\nrun;\n\n/* ── 5. Stratified log-rank syntax ── */\n/* Replace <stratum_var> with the actual stratification variable name.      */\n/* proc lifetest data=work.survival_demo;                                   */\n/*   time time * event(0);                                                  */\n/*   strata group * <stratum_var>(...) / order=internal;                    */\n/* run;                                                                     */",
        "description": "Standard and weighted log-rank tests via PROC LIFETEST with the STRATA statement. The\ndefault STRATA test is the log-rank; the WILCOXON option adds Wilcoxon-Gehan weights.\nThe ODS OUTPUT statement captures the observed and expected event counts for manual\nverification against the worked-example table. A separate PROC PHREG block shows the\ncompanion Cox HR that must always accompany the log-rank p-value in a study report.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[\"Two groups with<br/>time-to-event data and censoring\"] --> B[\"List distinct event times<br/>t1, t2, ..., tk\"]\n  B --> C[\"At each event time t_j:<br/>count at-risk n_Aj, n_Bj, n_j<br/>count events d_j\"]\n  C --> D[\"Compute expected events for A:<br/>E_Aj = (n_Aj / n_j) × d_j\"]\n  C --> E[\"Compute variance contribution:<br/>V_j = n_Aj × n_Bj × d_j × (n_j - d_j)<br/>divided by (n_j² × (n_j - 1))\"]\n  D --> F[\"Sum across all event times:<br/>O_A = Σ O_Aj<br/>E_A = Σ E_Aj<br/>V = Σ V_j\"]\n  E --> F\n  F --> G[\"chi-square = (O_A - E_A)² / V<br/>1 degree of freedom\"]\n  G --> H[\"p-value from chi-square table\"]\n  H --> I{\"Proportional hazards?\"}\n  I -- Yes --> J[\"Report: log-rank p + HR from Cox<br/>+ KM medians\"]\n  I -- No --> K[\"Report: weighted log-rank or RMST<br/>difference + HR + KM curves\"]\n  J --> L[\"⚠ Log-rank returns NO effect size.<br/>Always pair with HR and medians.\"]\n  K --> L",
        "caption": "Flow of the log-rank test from pooled risk sets to the chi-square statistic, with the decision branch for proportional versus non-proportional hazards.",
        "alt_text": "Flowchart showing how at-risk counts and events at each event time feed into the expected events computation, then into the chi-square test statistic, and finally into the decision on whether to use weighted variants or RMST when proportional hazards fails.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "censoring-mechanisms-rwe",
        "notes": "The log-rank test assumes non-informative censoring: the censoring mechanism must be independent of the unobserved survival time. Understanding how administrative censoring, loss to follow-up, and competing events are handled is prerequisite knowledge before interpreting any log-rank result."
      },
      {
        "relation_type": "used_with",
        "target_slug": "kaplan-meier-estimator",
        "notes": "The Kaplan-Meier estimator is the universal companion to the log-rank test: KM plots show the survival curves being compared, and KM-derived medians provide the effect estimate that the log-rank test itself cannot produce. Always report both together."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "The log-rank test is the unadjusted Cox score test; adding covariates to a Cox model produces the adjusted analogue. In observational studies, Cox regression with covariate adjustment replaces the unadjusted log-rank as the primary method. The Cox HR is always the companion effect estimate that the log-rank test cannot produce on its own."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hazard-ratio-interpretation",
        "notes": "The HR is the effect measure that must accompany every log-rank result. Understanding its properties — non-collapsibility, conditioning on survival, behavior under non-PH — is essential for correctly interpreting Cox model output paired with the log-rank p-value."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST is the recommended alternative or complement to the log-rank test when the proportional hazards assumption fails (crossing curves, delayed effects). The RMST difference at a fixed horizon provides a directly interpretable effect size in time units that the log-rank test cannot produce and that does not require the PH assumption."
      }
    ],
    "aliases": [
      "log rank test",
      "logrank test",
      "Mantel-Haenszel log-rank",
      "Peto log-rank",
      "survdiff",
      "logrank_test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "logistic-regression-for-binary-outcomes",
    "name": "Logistic Regression for Binary Outcomes",
    "short_definition": "A generalized linear model that links the log-odds of a binary outcome to a linear predictor via the logit, estimated by maximum likelihood, yielding (conditional) odds ratios as its native effect measure for fixed-window dichotomous endpoints in real-world data.",
    "long_description": "Logistic regression models a binary outcome Y in {0,1} through logit(P(Y=1|X)) = beta0 + beta'X, so\nthat P(Y=1|X) = 1 / (1 + exp(-(beta0 + beta'X))). Each coefficient beta_j is a change in log-odds per\nunit of X_j, and exp(beta_j) is the **conditional (covariate-adjusted) odds ratio**. Parameters are fit\nby maximum likelihood (iteratively reweighted least squares / Newton-Raphson). In real-world evidence\n(RWE) it is the default for a dichotomous endpoint observed over a *fixed* window — response yes/no at\n12 weeks, any major bleed within 90 days, 30-day readmission, treatment initiation, persistence-to-1-year\n— i.e., questions with no informative variation in follow-up time, or where time has been removed by a\nlandmark or a fixed risk window.\n\n**Core conceptual distinction.** Two distinctions decide whether logistic regression is the right tool\nand how to read its output. (1) *Odds ratio vs risk ratio vs risk difference.* The OR is the only effect\nmeasure logistic regression produces directly, but it is rarely the clinically or policy-relevant quantity.\nThe OR approximates the risk ratio only when the outcome is rare in both arms; for common outcomes it is\nnumerically further from 1 than the RR and is routinely misread as an RR, exaggerating the apparent effect.\nThe RD (and number-needed-to-treat) is what HTA and clinicians usually want. Recover the marginal RR/RD by\n*standardization / g-computation* (predict each subject's risk under treated and untreated, average, contrast)\nrather than reporting the OR as if it were a risk ratio. (2) *Conditional vs marginal effect.* The\ncovariate-adjusted OR from a logistic model is a *conditional* (subject-specific) effect that is not\ncollapsible: adding a strong, even non-confounding, covariate moves the OR away from the null even when the\nmarginal effect is unchanged. This non-collapsibility means an adjusted OR and an IPTW-weighted marginal OR\nanswer different questions and need not agree — a point that confuses comparisons across models and must be\npre-specified in the estimand.\n\n**Pros, cons, and trade-offs.**\n- **vs Cox proportional-hazards regression (cox-ph-regression):** Logistic is correct when the endpoint is\n  a fixed-window yes/no with no meaningful censoring or time-to-event structure, and it sidesteps the\n  proportional-hazards assumption entirely. Cox is correct when follow-up varies, censoring is informative\n  about person-time, or the question is \"how fast.\" Forcing logistic onto a time-to-event question silently\n  assumes everyone is observed for the full window and discards differential follow-up. **Prefer logistic**\n  only when the risk window is administratively fixed and complete for both arms; otherwise use Cox or its\n  discrete-time logistic expansion (below).\n- **vs modified Poisson / log-binomial relative-risk regression (poisson-negative-binomial-count-models):**\n  When the outcome is common and the RR is the target, log-binomial gives the RR directly but frequently\n  fails to converge; Zou's modified Poisson (Poisson with a robust/sandwich variance) is the robust\n  workhorse for adjusted RRs. Logistic remains numerically the most stable and is preferred when you will\n  standardize to a marginal RR/RD anyway, or when the OR is genuinely the estimand (case-control sampling).\n- **vs the linear probability model (OLS on 0/1):** OLS gives RDs directly and collapsible coefficients,\n  but predicts probabilities outside [0,1], is heteroscedastic by construction, and behaves badly near the\n  bounds. **Prefer logistic** for inference and prediction; consider the LPM only as a quick RD sanity check.\n- **vs flexible ML classifiers (predictive-and-causal-ml-models-rwe):** Tree ensembles / penalized learners\n  can out-discriminate logistic in very high-dimensional claims, but they forfeit transparent coefficients\n  and require targeted-learning / double-ML wrappers for valid causal inference. Logistic is the\n  interpretable outcome model inside g-computation, AIPW, and TMLE.\n\n**When to use.** A truly binary endpoint over a fixed, administratively complete risk window (response at\na landmark, an event within a pre-specified attribution window, an initiation/persistence flag); a\ncase-control design where the OR is the only estimable measure; the outcome-regression component of\ng-computation, AIPW, or TMLE for a binary outcome; or a discrete-time survival problem reframed as **pooled\nlogistic regression** (expand person-time into intervals, model the hazard of the event in each interval\nwith a flexible function of time), which approximates Cox when intervals are short and per-interval events\nare rare and is the standard machinery inside many target-trial emulations.\n\n**Interpreting the output**\n\nConsider a logistic model for 90-day hospitalization comparing Drug A vs Drug B,\nyielding an adjusted coefficient of 1.792, so exp(1.792) = 6.0 (95% CI 3.1–11.6)\nfrom the worked example above (Drug A: 40/100 events; Drug B: 10/100 events).\n\nFormal interpretation: Patients in Drug A had 6.0 times the adjusted odds of 90-day\nhospitalization compared to Drug B patients, holding baseline covariates fixed. This\nis a conditional (covariate-specific) odds ratio — the effect for a patient with a\ngiven covariate profile, not the average across the population. Because the outcome\nis common (40% in Drug A), the OR of 6.0 materially overstates the relative frequency:\nthe risk ratio for this data is 40%/10% = 4.0, not 6.0. OR ≠ RR whenever outcome\nprevalence is not rare in both arms; reporting OR = 6.0 as \"six times the risk\" is an\ninterpretation error. The OR is also noncollapsible — adding a strong but non-confounding\nprognostic covariate will push the conditional OR further from the null even when the\nmarginal effect is unchanged.\n\nPractical interpretation: For clinical or HTA communication, report the marginal risk\ndifference alongside the OR. Here, the risk difference is 40% − 10% = 30 percentage\npoints (approximately 30 additional hospitalizations per 100 Drug A patients). That\nabsolute figure, not the OR, is what drives a cost-effectiveness calculation or a\nnumber-needed-to-harm statement. Use g-computation to standardize the logistic model\nto the marginal RD and RR rather than presenting the conditional OR as the effect size.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Differential or incomplete follow-up.** If one arm is observed longer (disenrollment, death, end of\n  data) a fixed-window logistic treats unobserved person-time as event-free, biasing toward the better-retained\n  arm. Use time-to-event methods, not logistic, when the window is not complete for everyone.\n- **Reporting the OR as a risk ratio for a common outcome.** With baseline risk of, say, 30-40%, an OR of\n  0.5 is *not* a halving of risk; quoting it as such overstates benefit. Standardize to RR/RD.\n- **Adjusting for post-baseline variables.** Conditioning on mediators or colliders measured after time zero\n  (on-treatment labs, downstream procedures) induces collider/mediator bias — a logistic model invites this\n  because \"more covariates\" looks like better adjustment. Restrict covariates to the baseline window.\n- **Separation and sparse data.** With rare events or rare exposure-covariate cells, MLEs diverge (perfect\n  or quasi-complete separation): coefficients explode and Wald CIs become meaningless. Use Firth penalized\n  likelihood (firth-penalized-regression-rwe) or exact logistic, and never trust an OR with an astronomically\n  wide CI.\n- **Within-patient or clustered data analyzed as independent.** Multiple eligible episodes per patient, or\n  facility clustering, violate the independence assumption; naive SEs are too small. Use GEE\n  (gee-population-average-models-rwe) or cluster-robust (sandwich) variances.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** The binary outcome is built from validated diagnosis/procedure code\n  lists in a fixed window after time zero (e.g., first inpatient MI code within 90 days). Two failure modes\n  dominate. (1) *Unobserved person-time.* Medicare Advantage encounter data are incomplete relative to\n  fee-for-service claims, so a \"no event\" can be missingness rather than a true non-event — restrict to FFS\n  Parts A/B (and D for drug exposure) and exclude MA-only person-time, or you will misclassify outcomes\n  differentially. (2) *Detection / surveillance bias.* The exposure arm may be monitored more intensively\n  (more visits, more testing), so the *same* true risk produces more coded events — a differential\n  misclassification that inflates or deflates the OR depending on direction. Mitigate with high-PPV validated\n  algorithms and quantify with PPV/sensitivity from a validation substudy. Always confirm continuous medical\n  enrollment across the entire risk window so absence of a code is observed, not unobserved.\n- **EHR:** Outcomes come from labs against thresholds (HbA1c <7% as \"controlled\"), problem lists, orders, or\n  NLP on notes — richer for severity than claims but plagued by visit-driven, missing-not-at-random capture:\n  sicker, in-system patients accrue more recorded events (informed-presence / coding-intensity bias). Patients\n  who leave the system are differentially lost; treat that as informative and run missing-data sensitivity\n  analyses or multiple imputation rather than complete-case logistic.\n- **Registry:** Cleanest binary endpoints (adjudicated response, graded AEs, confirmed readmission) and the\n  natural benchmark for validating a claims/EHR logistic outcome algorithm, but typically incomplete for full\n  exposure and longitudinal capture — link to claims for those.\n- **Linked claims-EHR-registry:** Best substrate (EHR severity + claims completeness + adjudicated outcomes)\n  but the linkable subset is selected, threatening transportability; reconcile date discrepancies before\n  fixing the risk window.\n\n**Worked claims example.** Question: any major bleed within a fixed 180-day risk window among new initiators\nof DOAC A vs DOAC B for nonvalvular atrial fibrillation, in a commercial + Medicare FFS database. (1)\nEligibility/time zero: first qualifying DOAC fill (`fill_date`) with 365 days of continuous medical+pharmacy\nFFS enrollment beforehand and no DOAC fill in that washout (incident users); assign `arm` from the dispensed\nNDC. (2) Outcome: bleed = 1 if a validated inpatient major-bleeding `dx` code (high-PPV algorithm) appears in\n(index_date, index_date + 180]; else 0 — but *only* count a patient as \"0\" if they were continuously enrolled\nand FFS-observable through day 180, otherwise censoring is incomplete and logistic is the wrong tool\n(fall back to Cox). (3) Covariates measured strictly in [index_date - 365, index_date]: age, sex, HAS-BLED\ncomponents, prior bleed, renal impairment, antiplatelet/NSAID use, baseline utilization. (4) Fit\n`logit(P(bleed)) = arm + covariates`, but report the standardized marginal RD and RR (g-computation across the\ncohort), not the adjusted OR alone. (5) Because DOAC dosing/monitoring differs, run a detection-bias\nsensitivity analysis (adjust for baseline visit count; negative-control outcome) and, if major bleed is rare\nin subgroups, switch to Firth penalized likelihood to avoid separation.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "logistic-regression",
      "binary-outcomes",
      "odds-ratio",
      "logit",
      "non-collapsibility",
      "pooled-logistic",
      "standardization",
      "claims",
      "safety",
      "separation"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "case_control"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/54.1-2.167",
        "url": "https://doi.org/10.1093/biomet/54.1-2.167",
        "citation_text": "Walker SH, Duncan DB. Estimation of the probability of an event as a function of several independent variables. Biometrika. 1967;54(1-2):167-179.",
        "year": 1967,
        "authors_short": "Walker & Duncan",
        "notes": "Foundational derivation of the multivariable logistic model for event probabilities, the basis for its adoption in epidemiology and biostatistics."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.280.19.1690",
        "url": "https://doi.org/10.1001/jama.280.19.1690",
        "citation_text": "Zhang J, Yu KF. What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA. 1998;280(19):1690-1691.",
        "year": 1998,
        "authors_short": "Zhang & Yu",
        "notes": "The canonical statement that the odds ratio overstates the risk ratio for common outcomes, with a conversion formula; central to interpreting logistic output correctly in cohort RWE."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwh090",
        "url": "https://doi.org/10.1093/aje/kwh090",
        "citation_text": "Zou G. A modified Poisson regression approach to prospective studies with binary data. American Journal of Epidemiology. 2004;159(7):702-706.",
        "year": 2004,
        "authors_short": "Zou",
        "notes": "Defines the robust modified-Poisson alternative for estimating adjusted risk ratios when the OR is not the target and log-binomial fails to converge."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.pan.a004868",
        "url": "https://doi.org/10.1093/oxfordjournals.pan.a004868",
        "citation_text": "King G, Zeng L. Logistic regression in rare events data. Political Analysis. 2001;9(2):137-163.",
        "year": 2001,
        "authors_short": "King & Zeng",
        "notes": "Quantifies finite-sample bias and separation problems of ML logistic regression with rare events, motivating penalized (Firth) or exact methods common in safety RWE."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e3181a663cc",
        "url": "https://doi.org/10.1097/EDE.0b013e3181a663cc",
        "citation_text": "Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512-522.",
        "year": 2009,
        "authors_short": "Schneeweiss et al.",
        "notes": "Production claims-based CER that uses logistic regression for binary exposure (propensity) and binary outcome models with empirically selected confounders; the reference pattern for logistic models in pharmacoepidemiology."
      }
    ],
    "plain_language_summary": "Logistic regression is a statistical method for asking: among two groups of patients, which group was more likely to experience a yes-or-no event — a hospitalization, a treatment response, a side effect — while accounting for differences between the groups? It works by estimating, for each patient, the probability of the event based on their characteristics, and the key number it produces is an odds ratio, which compares how much more often the event occurred in one group than the other. One honest limitation: when the event is common (say, more than 10% of patients), the odds ratio can make a difference look bigger than it really is, so analysts often convert it to a risk difference or risk ratio before reporting to clinicians or payers.",
    "key_terms": [
      {
        "term": "binary outcome",
        "definition": "An endpoint that is recorded as yes or no for each patient — for example, 'was hospitalized within 90 days' (yes/no) or 'achieved response at 12 weeks' (yes/no)."
      },
      {
        "term": "odds",
        "definition": "The number of patients who had the event divided by the number who did not — for example, if 40 out of 100 patients had the event, the odds are 40/60, or about 0.67."
      },
      {
        "term": "odds ratio (OR)",
        "definition": "The odds of the event in one group divided by the odds in another group; an OR of 1.0 means no difference, above 1.0 means higher odds in the first group, below 1.0 means lower odds."
      },
      {
        "term": "regression coefficient",
        "definition": "A number the model fits for each variable; for the treatment arm, the coefficient is the natural logarithm of the odds ratio, so taking exp(coefficient) gives you the odds ratio directly."
      },
      {
        "term": "covariate adjustment",
        "definition": "Including patient characteristics — age, sex, prior diagnoses — in the model so that the treatment comparison accounts for the fact that the two groups may differ on those traits."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is studying whether patients who started Drug A had a higher rate of 90-day hospitalization compared to patients who started Drug B. Each patient's 90-day window was complete and fully observable. The team wants to compare the two arms using logistic regression. Here is the raw 2x2 count table they start with before running any model.",
      "dataset": {
        "caption": "90-day hospitalization counts by treatment arm (one row per arm, 100 patients each).",
        "columns": [
          "arm",
          "hospitalized_yes",
          "hospitalized_no"
        ],
        "rows": [
          [
            "Drug A (exposed)",
            40,
            60
          ],
          [
            "Drug B (unexposed)",
            10,
            90
          ]
        ]
      },
      "steps": [
        "Label the cells: a = 40 (Drug A, hospitalized), b = 60 (Drug A, not hospitalized), c = 10 (Drug B, hospitalized), d = 90 (Drug B, not hospitalized).",
        "Compute the odds of hospitalization in the Drug A arm: a / b = 40 / 60 = 0.667.",
        "Compute the odds of hospitalization in the Drug B arm: c / d = 10 / 90 = 0.111.",
        "Compute the odds ratio using the cross-product formula: OR = (a × d) / (b × c) = (40 × 90) / (60 × 10) = 3600 / 600 = 6.0.",
        "Logistic regression estimates a coefficient for the treatment arm variable; that coefficient equals the natural log of the OR: log(6.0) = 1.792.",
        "Exponentiating the coefficient recovers the odds ratio: exp(1.792) = 6.0, confirming that the model output and the 2x2 table arithmetic agree exactly."
      ],
      "result": "OR = 6.0: patients in the Drug A arm had six times the odds of being hospitalized within 90 days compared to patients in the Drug B arm. The logistic regression coefficient for treatment arm = log(6.0) = 1.792, and exp(1.792) = 6.0. In plain terms, hospitalization was much more common among Drug A users — but because 40% of Drug A patients were hospitalized (a common outcome), the OR of 6.0 overstates how large the risk difference actually is; analysts would follow up by computing the risk difference (40% − 10% = 30 percentage points) to communicate the finding to clinicians."
    },
    "prerequisites": [
      "case-control",
      "cohort-retrospective",
      "dags-backdoor-criterion-drug-studies"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pooled logistic regression (discrete-time survival approximation)",
        "description": "Expand each subject's person-time into discrete intervals, then fit a logistic model for the per-interval hazard with a flexible function of follow-up time plus covariates; approximates Cox when intervals are short and per-interval events are rare, and naturally accommodates time-varying exposures and IPTW/IPCW weights.",
        "edge_cases": [
          "Interval width is a bias-variance trade-off; very fine intervals approach Cox but inflate the row count.",
          "Time-varying confounding requires marginal-structural-model weights, not simple covariate adjustment."
        ],
        "data_source_notes": "claims/EHR: build the long (person-interval) table from monthly/quarterly snapshots; the standard outcome model inside target-trial emulations with sustained-treatment strategies."
      },
      {
        "name": "Conditional logistic regression (matched / stratified)",
        "description": "Conditions out stratum-specific intercepts for matched sets (1:1 PS-matched pairs, risk-set or exact matching, case-control matched sets), eliminating confounding by the matching factors via the conditional likelihood.",
        "edge_cases": [
          "Discards strata with no outcome variation, reducing efficiency; tight matching can drop many subjects.",
          "Cannot estimate effects of the matching variables themselves."
        ],
        "data_source_notes": "claims: standard for nested case-control and PS-matched-pair cohorts; pair on high-dimensional PS, age, sex, calendar time, and prior events."
      },
      {
        "name": "Firth penalized / exact logistic (sparse data, separation)",
        "description": "Replaces the ML likelihood with Firth's penalized (Jeffreys-prior) likelihood, or uses exact conditional inference, to obtain finite, less-biased estimates when events are rare or covariate cells are empty and ordinary MLE diverges.",
        "edge_cases": [
          "Firth shrinks toward the null and changes the interpretation of the intercept (baseline risk).",
          "Exact logistic is computationally infeasible with many covariates."
        ],
        "data_source_notes": "claims: routine for rare safety signals or rare-exposure subgroups where naive logistic returns infinite ORs and uninformative Wald CIs."
      },
      {
        "name": "Marginal effects via standardization / g-computation",
        "description": "Rather than reporting the non-collapsible OR, predict each subject's risk under treated and untreated from the fitted logistic model and average to obtain marginal risks, then contrast as a risk difference or risk ratio with bootstrap or delta-method CIs.",
        "edge_cases": [
          "The marginal estimand depends on the covariate distribution standardized to (whole cohort vs treated).",
          "Requires correct outcome-model specification unless augmented (AIPW/TMLE) with a propensity model."
        ],
        "data_source_notes": "all sources: the recommended way to communicate a logistic result to clinicians and HTA bodies, who need RD/NNT, not an OR."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "Correct for fixed-window binary endpoints with complete follow-up; no proportional-hazards assumption; pooled-logistic variant approximates Cox and adds time-varying-confounding handling.",
        "cons_of_this": "Ignores differential follow-up and censoring unless person-time is expanded; reports non-collapsible ORs rather than time-to-event measures.",
        "when_to_prefer": "Administratively complete fixed risk windows, landmark analyses, case-control designs, or as the outcome model inside g-methods."
      },
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "Numerically stable for binary 0/1 data; natural likelihood for dichotomous endpoints; easily standardized to marginal RR/RD.",
        "cons_of_this": "Native measure is the OR, not the RR; for common outcomes modified Poisson gives the adjusted RR directly and more transparently.",
        "when_to_prefer": "True binary event-in-window where the OR is the estimand or RR/RD will be obtained by standardization, rather than counts/rates."
      },
      {
        "compared_to": "predictive-and-causal-ml-models-rwe",
        "pros_of_this": "Interpretable coefficients/ORs, transparent confounding control, stable in moderate dimensions, drop-in outcome model for AIPW/TMLE.",
        "cons_of_this": "May under-discriminate flexible ML in very high-dimensional claims; non-collapsibility and separation require explicit handling.",
        "when_to_prefer": "When effect estimation and transparency dominate; reserve ML for prediction-first tasks or high-dimensional confounding handled with double/targeted learning."
      },
      {
        "compared_to": "gee-population-average-models-rwe",
        "pros_of_this": "Simpler when observations are genuinely independent (one row per patient); standard ML inference applies.",
        "cons_of_this": "Naive SEs are anticonservative under within-patient or facility clustering, where a population-average GEE or cluster-robust variance is required.",
        "when_to_prefer": "Single-record-per-patient designs with no clustering; otherwise move to GEE or sandwich variances."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the binary outcome from a validated high-PPV code list in a fixed window after time zero and require continuous FFS-observable enrollment across the entire window so a 'no event' is observed, not missing; exclude Medicare Advantage-only person-time. Measure covariates only in the pre-index baseline window. For rare events use Firth/exact logistic; report event counts by arm plus crude, adjusted, and standardized (RD/RR) estimates.",
      "ehr": "Derive outcomes from labs/thresholds, problem lists, orders, or NLP; capture is visit-driven and missing-not-at-random, so prefer multiple imputation or sensitivity analyses over complete-case logistic and adjust for surveillance intensity (baseline visit/test counts).",
      "registry": "Cleanest adjudicated binary endpoints; use to validate (PPV/sensitivity) claims/EHR outcome algorithms and benchmark the logistic model, linking to claims for full exposure.",
      "linked": "Linked claims-EHR-registry gives severity, completeness, and adjudication, but the linkable subset is selected; assess transportability and reconcile date discrepancies before fixing the risk window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\ndef fit_binary_outcome(cohort: pd.DataFrame, covariates: list[str],\n                       treat: str = \"arm\", outcome: str = \"outcome\",\n                       n_boot: int = 1000, seed: int = 1) -> dict:\n    d = cohort.copy()\n    d[treat] = (d[treat].isin([1, \"STUDY\", \"treated\"])).astype(int)  # normalize arm to 0/1\n\n    rhs = \" + \".join([treat] + covariates)\n    fit = smf.logit(f\"{outcome} ~ {rhs}\", data=d).fit(disp=0)\n\n    # Conditional (adjusted) odds ratio for treatment.\n    beta = fit.params[treat]\n    ci = fit.conf_int().loc[treat]\n    odds_ratio = (np.exp(beta), float(np.exp(ci[0])), float(np.exp(ci[1])))\n\n    # Marginal effects via g-computation: predict risk setting everyone treated vs untreated.\n    def marginal(df):\n        d1, d0 = df.copy(), df.copy()\n        d1[treat], d0[treat] = 1, 0\n        r1, r0 = fit.predict(d1).mean(), fit.predict(d0).mean()\n        return r1 - r0, r1 / r0\n    rd, rr = marginal(d)\n\n    # Nonparametric bootstrap CIs for the marginal contrasts (refit each draw).\n    rng = np.random.default_rng(seed)\n    rds, rrs = [], []\n    for _ in range(n_boot):\n        b = d.iloc[rng.integers(0, len(d), len(d))]\n        bf = smf.logit(f\"{outcome} ~ {rhs}\", data=b).fit(disp=0)\n        d1, d0 = b.copy(), b.copy()\n        d1[treat], d0[treat] = 1, 0\n        r1, r0 = bf.predict(d1).mean(), bf.predict(d0).mean()\n        rds.append(r1 - r0); rrs.append(r1 / r0)\n    rd_ci = (float(np.percentile(rds, 2.5)), float(np.percentile(rds, 97.5)))\n    rr_ci = (float(np.percentile(rrs, 2.5)), float(np.percentile(rrs, 97.5)))\n\n    return {\"odds_ratio\": odds_ratio, \"risk_difference\": (rd, *rd_ci),\n            \"risk_ratio\": (rr, *rr_ci), \"n\": len(d), \"events\": int(d[outcome].sum())}",
        "description": "Adjusted logistic regression for a fixed-window binary outcome, plus standardized marginal risk\ndifference/ratio (g-computation). Required input (one row per eligible new initiator, already cleaned):\n  cohort : person_id, arm (0/1 or 'STUDY'/'COMPARATOR'), index_date,\n           outcome (0/1, built from a validated code list in a FIXED window with complete enrollment),\n           <baseline covariates measured only in [index_date - lookback, index_date]>\nReturns the adjusted OR with profile-likelihood CI and the marginal RD/RR a clinician/HTA reader needs.\nBuild `outcome` and confirm full-window enrollment BEFORE this step; never count incomplete follow-up as 0.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(marginaleffects)\nlibrary(sandwich)\n\nfit_binary_outcome <- function(cohort, covariates,\n                               treat = \"arm\", outcome = \"outcome\",\n                               cluster = \"person_id\") {\n  cohort[[treat]] <- as.integer(cohort[[treat]] %in% c(1, \"STUDY\", \"treated\"))\n  f <- reformulate(c(treat, covariates), response = outcome)\n\n  fit <- glm(f, family = binomial(), data = cohort)\n\n  # Conditional (adjusted) OR with profile-likelihood CI.\n  or  <- exp(coef(fit)[[treat]])\n  cic <- exp(suppressMessages(confint(fit))[treat, ])\n\n  # Cluster-robust covariance (multiple eligible episodes per person, facility clustering).\n  vcl <- sandwich::vcovCL(fit, cluster = cohort[[cluster]])\n\n  # Marginal RD and RR by standardization (avgcomparisons = g-computation contrast).\n  rd <- avg_comparisons(fit, variables = setNames(list(0:1), treat),\n                        comparison = \"difference\", vcov = vcl)\n  rr <- avg_comparisons(fit, variables = setNames(list(0:1), treat),\n                        comparison = \"ratio\",      vcov = vcl)\n\n  list(odds_ratio = c(or, cic),\n       risk_difference = rd[, c(\"estimate\", \"conf.low\", \"conf.high\")],\n       risk_ratio      = rr[, c(\"estimate\", \"conf.low\", \"conf.high\")],\n       n = nrow(cohort), events = sum(cohort[[outcome]]))\n}\n# Rare-event / separation fallback:\n# library(logistf); fit <- logistf(f, data = cohort)  # Firth-penalized ORs + profile-likelihood CIs",
        "description": "Adjusted logistic regression with profile-likelihood OR and standardized marginal RD/RR (g-computation),\nplus cluster-robust SEs for multi-episode/clustered data. Input mirrors the Python version:\n  cohort : person_id, arm (0/1), index_date, outcome (0/1 over a complete fixed window), <baseline covariates>\nUse firth=TRUE path (logistf) when events are rare and ordinary glm shows separation (infinite ORs).",
        "dependencies": [
          "marginaleffects",
          "sandwich",
          "logistf"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Primary adjusted model: conditional ORs with profile-likelihood CIs. */\nproc logistic data=work.cohort;\n  class arm (ref='COMPARATOR') sex (ref='F') / param=ref;\n  model outcome(event='1') = arm age sex renal_impair prior_bleed\n        / clodds=pl lackfit;           /* clodds=pl -> profile-likelihood OR CIs; lackfit -> calibration */\n  oddsratio arm / diff=ref;\n  /* Standardized (marginal) risks by arm -> compute RD/RR for clinical/HTA reporting. */\n  lsmeans arm / ilink;\nrun;\n\n/* Rare-event / separation-robust refit (Firth penalized likelihood). */\nproc logistic data=work.cohort;\n  class arm (ref='COMPARATOR') / param=ref;\n  model outcome(event='1') = arm age sex / firth clodds=pl;\nrun;\n\n/* GENMOD equivalent (binomial/logit) with robust SEs if repeated rows per person_id exist. */\nproc genmod data=work.cohort descending;\n  class arm (ref='COMPARATOR') person_id;\n  model outcome = arm age sex / dist=bin link=logit;\n  repeated subject=person_id / type=ind;   /* sandwich SEs under within-patient clustering */\n  estimate 'log-OR arm' arm 1 -1 / exp;\nrun;\n\n/* Pooled (discrete-time) logistic for a survival-type question: expand to person-interval rows\n   (work.long: one row per person per interval, with interval index `period`, time-varying `arm_t`,\n   and `event_t` = event in that interval) before running. */\nproc logistic data=work.long;\n  class arm_t (ref='0') / param=ref;\n  model event_t(event='1') = arm_t period period*period age sex / clodds=pl;\nrun;",
        "description": "PROC LOGISTIC and PROC GENMOD for a fixed-window binary outcome, plus a pooled (discrete-time) logistic\ntemplate. Required input dataset (post data-management):\n  work.cohort : person_id, arm (char 'STUDY'/'COMPARATOR' or 0/1), outcome (0/1 over a COMPLETE fixed\n                window), <baseline covariates measured only in the pre-index window>\nBuild `outcome` and confirm full-window enrollment before this step. Use the firth option for rare events\nto avoid quasi-complete separation; effectplot/lsmeans give marginal risks for RD/RR communication.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  X[Treatment arm + baseline covariates<br/>measured only up to time zero] --> L[\"Linear predictor<br/>logit(p) = beta0 + beta'X\"]\n  L --> P[\"Probability p = 1 / (1 + exp(-logit))\"]\n  P --> OR[\"exp(beta_arm) = conditional (adjusted) OR<br/>NON-collapsible, not a risk ratio\"]\n  P --> STD[\"Standardize / g-compute:<br/>mean risk treated vs untreated\"]\n  STD --> RD[\"Marginal risk difference + risk ratio<br/>(report this to clinicians / HTA)\"]",
        "caption": "From the logit linear predictor to fitted probabilities, then to the two distinct estimands - the non-collapsible conditional odds ratio versus the marginal risk difference/ratio obtained by standardization, which is what most decision-makers actually need.",
        "alt_text": "Flowchart showing baseline covariates feeding the logit linear predictor, then the fitted probability, branching to a conditional odds ratio and to a standardized marginal risk difference and risk ratio.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Binary outcome question in RWE] --> T{Is follow-up a fixed, complete window<br/>for both arms?}\n  T -->|\"No - varying/censored time\"| COX[Use Cox or pooled-logistic<br/>expand person-time]\n  T -->|Yes - complete fixed window| C{\"Is the outcome common<br/>(risk > ~10%)?\"}\n  C -->|\"Yes and RR/RD is the target\"| MP[\"Logistic + standardization,<br/>or modified Poisson for adjusted RR\"]\n  C -->|No - rare| R{Separation or empty cells?}\n  R -->|Yes| F[\"Firth penalized / exact logistic\"]\n  R -->|No| LOG[Ordinary logistic;<br/>OR ~ RR when rare]",
        "caption": "Decision logic for a dichotomous RWE endpoint - first rule out a time-to-event structure, then choose between ordinary logistic, standardization/modified-Poisson for common outcomes, and Firth/exact methods when rare events cause separation.",
        "alt_text": "Decision tree starting from a binary outcome question, branching on whether follow-up is a fixed complete window, whether the outcome is common, and whether separation occurs, leading to Cox, logistic, modified Poisson, or Firth methods.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "Cox is the time-to-event counterpart; pooled (discrete-time) logistic with short intervals approximates Cox and is preferred when follow-up varies or time-varying confounding must be handled."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Modified Poisson (log link, robust variance) estimates adjusted risk ratios directly for common outcomes, where the logistic odds ratio overstates the risk ratio."
      },
      {
        "relation_type": "used_with",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "Average marginal effects / standardization convert non-collapsible log-odds into the risk-difference and risk-ratio quantities decision-makers need."
      },
      {
        "relation_type": "used_with",
        "target_slug": "firth-penalized-regression-rwe",
        "notes": "Firth penalized likelihood resolves separation and finite-sample bias when events or exposure cells are sparse and ordinary MLE diverges."
      },
      {
        "relation_type": "used_with",
        "target_slug": "gee-population-average-models-rwe",
        "notes": "GEE (or cluster-robust variances) provides valid population-average inference when patients contribute multiple eligible episodes or are clustered within facilities."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Logistic regression is the standard propensity model for binary exposure, and the outcome model combined with PS for doubly robust estimation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Pooled logistic regression is a common outcome model inside target-trial emulations for cumulative incidence or fixed-window binary endpoints."
      },
      {
        "relation_type": "used_with",
        "target_slug": "g-estimation-structural-nested-models",
        "notes": "Logistic outcome regressions appear inside g-computation, AIPW, and TMLE for binary outcomes under (time-varying) confounding."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "A fixed-window logistic that begins counting before exposure is possible guarantees exposed subjects are event-free early, inflating apparent benefit."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The binary outcome fed to a logistic model is only as good as its validated code algorithm; misclassification (especially differential/detection bias) distorts the OR."
      }
    ],
    "aliases": [
      "logit model",
      "binary logistic regression",
      "logistic regression",
      "pooled logistic regression",
      "conditional logistic regression"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "loinc-laboratory-coding",
    "name": "LOINC Laboratory and Observation Codes",
    "short_definition": "LOINC (Logical Observation Identifiers Names and Codes) is a universal, freely licensed terminology of numeric codes that uniquely identify clinical laboratory tests, vital signs, clinical observations, survey instruments, and document types, enabling health systems and research networks to exchange and aggregate results without mapping through local lab-code silos.",
    "long_description": "**LOINC (Logical Observation Identifiers Names and Codes)** is a universal\nterminology maintained by the Regenstrief Institute that assigns a stable\nnumeric code to every distinct clinical observation — laboratory tests, vital\nsigns, clinical findings, survey-instrument items, and document types. First\npublished in 1994 and updated approximately twice a year, it now contains\nmore than 100,000 terms and is used in over 170 countries. LOINC is freely\ndownloadable after registration and acceptance of the Regenstrief license;\nusers may incorporate codes in systems and research but may not bulk-reproduce\nthe database tables without permission (see https://loinc.org/kb/license/).\n\n**The six-part fully specified name: why it matters for RWE.**\nThe semantics of a LOINC code are entirely encoded in its fully specified name,\na structured six-part axis. Understanding these six axes is the single most\nimportant conceptual task for a researcher building lab-based phenotypes:\n\n1. **Component (analyte)** — what is being measured: Glucose, Creatinine,\n   Hemoglobin A1c, Troponin I, etc.\n2. **Property** — the physical or chemical type of quantity: mass\n   concentration (MCnc, e.g., mg/dL), molar concentration (SCnc, e.g.,\n   mmol/L), arbitrary units, presence/absence (Prid).\n3. **Time** — the collection timing: point-in-time (Pt) versus 24-hour\n   collection (24H) versus other periods.\n4. **System (specimen type)** — the biological source: serum or plasma\n   (Ser/Plas), whole blood (Bld), urine (Urine), cerebrospinal fluid (CSF),\n   etc.\n5. **Scale** — the measurement type: quantitative (Qn), ordinal (Ord),\n   nominal (Nom), narrative (Nar), or document.\n6. **Method (when it matters)** — the analytical technique when it\n   materially distinguishes the result: Jaffe vs. enzymatic for creatinine;\n   immunoassay vs. HPLC for HbA1c. Method is omitted from the code name when\n   it does not change the interpretive meaning.\n\n**Why \"glucose\" is not one code.** Fasting versus random glucose, serum\nversus whole-blood glucose, mass-concentration versus molar-concentration\nproperty — each combination is a *different* LOINC code with different\nreference ranges and clinical interpretation. A researcher who writes\n\"glucose\" as a free-text filter and expects to capture all glucose results\nwill mix these conceptually different measurements. The discipline of building\nan explicit LOINC code list — naming every axis for every code included — is\nwhat makes a lab-based phenotype reviewable, reproducible, and defensible.\n\n**Code format.** LOINC codes are numeric identifiers with a mod-10 Luhn\ncheck digit appended after a hyphen, for example **2160-0** (serum/plasma\ncreatinine, SCnc, Pt, Ser/Plas, Qn) or **4548-4** (HbA1c/total hemoglobin,\nMFr, Pt, Bld, Qn). The digits are meaningless identifiers — all meaning\nlives in the fully specified name. The check digit detects single-character\ntranscription errors.\n\n**Units are not part of the code; UCUM is the companion standard.**\nLOINC codes specify the *property* dimension (mass vs molar concentration)\nbut do not encode the specific unit. The Unified Code for Units of Measure\n(UCUM) is the companion standard for units, and results must carry UCUM-coded\nunits alongside the LOINC code. In multi-site EHR or network studies, the\nsame LOINC code will often arrive in different unit flavors (mg/dL vs\nµmol/L for creatinine; mmol/L vs mg/dL for glucose) that require explicit\nunit harmonization before any pooling or threshold-based phenotyping.\nForgetting this step is among the most common hidden arithmetic errors in\ndistributed network studies.\n\n**The local-code problem — the dominant data-quality issue for lab-based\nRWE.** Most clinical laboratory information systems (LIS) record results\nusing internal, site-specific local codes rather than LOINC codes. LOINC\nmapping is applied (or not) by the health system's informatics team before\ndata reaches a research data warehouse or common data model. Mapping\ncompleteness varies enormously across sites — from near-complete to below\n50% for less common analytes — and mapping *accuracy* is a separate\nconcern: a local code for \"creatinine, enzymatic\" may be incorrectly mapped\nto the Jaffe LOINC code, introducing systematic method misclassification\ninvisible to downstream analysts. In multi-site PCORnet or OHDSI/OMOP\nnetworks, differential unmapping across sites means that a cohort built on\nLOINC-mapped records is effectively selecting *for* better-resourced sites,\nbiasing lab-dependent eGFR or HbA1c cohorts toward the site demographics\nthat happen to have completed their mapping. Documenting the LOINC mapping\nrate per site and per analyte is a required feasibility step for any\nmulti-site lab-dependent study.\n\n**Method heterogeneity and reference-range awareness.** Even when LOINC\ncodes are correctly mapped, the same analyte LOINC may aggregate results\nfrom multiple analytical methods with different inter-method biases. Serum\ncreatinine by Jaffe (2160-0) and by enzymatic assay (38483-4) produce\nsystematically different numeric values, which affects eGFR calculations.\nAn eGFR-based CKD cohort that pools Jaffe and enzymatic creatinine without\nadjustment will introduce differential misclassification by the method\ndistribution across sites. Lab-based phenotypes must either restrict to a\nsingle method code, apply method-specific conversion, or sensitivity-analyze\nthe method mix.\n\n**Scope beyond the clinical laboratory.** Although LOINC originated in\nlaboratory medicine, the terminology has expanded substantially:\n- **Vital signs**: heart rate (8867-4), systolic blood pressure (8480-6),\n  BMI (39156-5) — all have LOINC codes used in EHR observation tables.\n- **Clinical findings and assessments**: clinical observations recorded by\n  providers are increasingly assigned LOINC codes in EHR systems.\n- **Survey and patient-reported outcome instruments**: PHQ-9 depression\n  screen items each have individual LOINC codes (e.g., 44250-9 for item 1),\n  enabling structured extraction of PRO data from EHRs.\n- **Document types**: clinical note categories (discharge summary,\n  pathology report, operative note) have LOINC document-type codes used\n  in NLP pipelines and document management systems.\n- **Radiology**: imaging order and result types also have LOINC\n  representation used in radiology information systems.\n\n**LOINC in the OMOP CDM.** In the OHDSI OMOP Common Data Model, LOINC is\nthe standard vocabulary for the Measurement domain (lab results, vital signs,\nclinical observations) and for the Observation domain (survey items, clinical\nassessments). When a site transforms data to OMOP, local lab codes are mapped\nto LOINC via the CONCEPT and CONCEPT_RELATIONSHIP tables; results land in the\nMEASUREMENT table with `measurement_concept_id` drawn from LOINC. A site\nwith incomplete LOINC mapping will have observations in MEASUREMENT with\nconcept_id = 0 (unmapped) — a leading indicator of data quality that should\nbe audited before any analytic use.\n\n**LOINC and SNOMED CT collaboration.** Regenstrief and SNOMED International\nmaintain a formal harmonization effort; SNOMED CT provides the ontological\nbackbone for representing the component and system axes, while LOINC provides\nthe operational identity for test ordering and result reporting. In OMOP,\nconditions and clinical findings use SNOMED; measurements and labs use LOINC.\n\n**CPT versus LOINC: claims visibility vs EHR result capture.** CPT\n(Current Procedural Terminology) codes identify the *billed procedure* — the\nlaboratory service ordered and reimbursed. LOINC codes identify the *resulted\nobservation* — the specific analyte and measurement reported. In a Medicare\nFFS claims dataset, a CPT 80053 (comprehensive metabolic panel) appears as a\nclaim but does not tell you the individual constituent results (glucose,\ncreatinine, ALT, etc.) or their numeric values; those live only in EHR or\nlab feed data coded with LOINC. This claims-vs-EHR split is a major\npractical constraint: whether a patient had an HbA1c above 8% in a given\nquarter is knowable from EHR/LOINC-coded data, invisible in claims. A\nstudy that needs lab *values* for outcome ascertainment or covariate\ndefinition requires EHR access; CPT codes from claims can only confirm\n*whether a test was ordered*, not its result.\n\n**Pros, cons, and trade-offs.**\n- **vs local codes only (no LOINC mapping):** Local-only systems cannot\n  exchange, aggregate, or benchmark lab results across institutions. LOINC\n  costs significant informatics effort to map and maintain, but without it\n  multi-site studies and common data model participation are infeasible.\n  **Prefer LOINC**: a site without LOINC mapping is, for practical purposes,\n  excluded from most large-scale EHR network research.\n- **vs free-text analyte name matching:** Matching on \"creatinine\" as a\n  string in the lab test name field is fast but conflates Jaffe and enzymatic\n  methods, serum and urine, and point-in-time and timed collections. LOINC\n  matching is more work up front (explicit code list) but produces a\n  reproducible, method-specific, specimen-specific cohort definition.\n  **Prefer LOINC** for any phenotype where method or specimen matters.\n- **vs ICD codes for lab findings:** ICD diagnosis codes can capture that a\n  patient has chronic kidney disease (N18.*) but cannot represent the\n  measured creatinine value, the eGFR trajectory, or the staging date.\n  LOINC captures the quantitative measurement; ICD captures the clinical\n  conclusion. **Use both**: ICD for broad cohort screening, LOINC for\n  value-level phenotype precision (confirmed eGFR < 60 on two occasions ≥\n  90 days apart).\n- **vs SNOMED CT for observations:** SNOMED CT models the *clinical concept*\n  with rich ontological relationships; LOINC models the *test identity* for\n  ordering and reporting. OMOP uses LOINC for Measurement and SNOMED for\n  Observation and Condition. **Use both in appropriate domains**; do not\n  substitute one for the other in the OMOP context.\n\n**When to use.**\n- Building any lab-based phenotype (eGFR for CKD staging, HbA1c for\n  glycemic control, troponin for ACS adjudication, viral load for HIV\n  suppression, CBC for cytopenias).\n- Harmonizing lab results across multiple EHR systems or data partners in\n  a distributed research network (PCORnet, OHDSI, Sentinel).\n- Extracting vital-sign trajectories (blood pressure, BMI, heart rate) from\n  OMOP Measurement tables.\n- Linking structured PRO instrument responses (PHQ-9, GAD-7) from EHR data\n  into an outcomes analysis.\n- Auditing OMOP data quality by inspecting unmapped (concept_id = 0)\n  measurement records.\n- Specifying the exact observation set for a regulatory-submission RWE\n  study where the FDA or payer reviewer must be able to replicate the code\n  list without ambiguity.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When you treat a single LOINC code as covering all variants of an\n  analyte.** Selecting only 2160-0 for creatinine and missing 38483-4\n  (enzymatic) and site-specific Jaffe variants will undercount the lab\n  record population and introduce method-based selection — any creatinine\n  result captured by a Jaffe method at a site that mapped to a different\n  LOINC will be lost. Build a code list, not a single code.\n- **When LOINC mapping completeness has not been audited.** If 30% of a\n  site's creatinine results are in unmapped records (concept_id = 0), a\n  LOINC-filtered query is effectively discarding 30% of the data. Using such\n  a query to compute eGFR-based staging will produce systematically wrong\n  stage distributions. Audit mapping rate before any lab-based analysis.\n- **When you apply a threshold in mg/dL to a mix of mg/dL and µmol/L\n  results.** This is one of the most dangerous arithmetic errors in\n  multi-site lab studies: a creatinine of 88.4 µmol/L is 1.00 mg/dL, not\n  88.4 mg/dL. A threshold that treats all results as mg/dL will classify\n  nearly all µmol/L results (which arrive in the 50–300 range) as severe\n  renal impairment. Unit harmonization must precede any numeric threshold\n  application.\n- **When using LOINC codes from claims data alone.** Standard CPT-coded\n  claims do not carry LOINC codes or result values. LOINC is a feature of\n  EHR and LIS data, not of billing claims. Assuming a lab result is captured\n  because a CPT billing code appears in claims is an ascertainment error.\n- **For drug exposure ascertainment.** Medication orders may appear in the\n  EHR as observations, but exposure ascertainment from LOINC-coded\n  observations is unreliable — use RxNorm-coded orders/administrations and\n  pharmacy dispense records, not LOINC observation records, for drug exposure\n  definition.\n\n**Data-source operational depth.**\n- **EHR / OMOP Measurement table:** LOINC codes appear as `measurement_concept_id`;\n  unmapped local codes appear with concept_id = 0 and the local code in\n  `measurement_source_value`. The numeric result is in `value_as_number`\n  with `unit_concept_id` from UCUM. Always filter on unit concept as well as\n  LOINC concept to avoid unit-mixing errors. Include records with\n  `measurement_concept_id = 0` in an audit query to quantify the unmapped\n  fraction before excluding them.\n- **PCORnet LAB_RESULT_CM table:** Contains `LAB_LOINC` (the mapped LOINC\n  code) and `RAW_LAB_CODE` (the local code). Sites with low mapping\n  completeness will have many populated `RAW_LAB_CODE` records with missing\n  `LAB_LOINC`. The network's data quality reporting should include\n  LAB_LOINC fill rate per site per common analyte.\n- **Claims (CPT):** CPT panel codes (e.g., 80053 comprehensive metabolic\n  panel) identify that a test was ordered and billed; individual constituent\n  LOINC codes and numeric results are unavailable. Use claims for test\n  utilization analysis (was the test ordered?), not for result-level\n  phenotyping.\n- **Registry:** Disease registries often include key lab values (PSA for\n  prostate cancer, tumor markers, staging labs) as structured fields; they\n  may or may not carry LOINC codes, depending on registry design. Verify\n  code presence before assuming LOINC alignment.\n- **Linked EHR-claims:** The ideal substrate for combining LOINC-coded lab\n  values (from EHR) with complete medication utilization and CPT test\n  ordering (from claims). Linkage enables: CPT confirms the test was ordered\n  in the claims (utilization); LOINC provides the result value in the EHR\n  (phenotyping). Reconcile by matching CPT service date to measurement date\n  within a clinically sensible window (e.g., ± 3 days).\n\n**Maintenance and licensing.** LOINC is maintained by the Regenstrief\nInstitute at Indiana University and released approximately twice per year\n(typically February and August). The terminology is freely available for\ndownload and use after user registration; the license explicitly prohibits\nbulk reproduction of the full LOINC table in competing products. Small\nillustrative code lists with attribution (\"Source: Regenstrief Institute\nLOINC, loinc.org\") are permitted in publications and protocols.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "laboratory",
      "ehr",
      "omop",
      "loinc",
      "local-code",
      "unit-harmonization",
      "ucum",
      "phenotyping"
    ],
    "applies_to_study_types": [
      "ehr_study",
      "cohort_retrospective",
      "cohort_prospective",
      "multi_database",
      "target_trial_emulation",
      "claims_analysis"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1373/49.4.624",
        "url": "https://doi.org/10.1373/49.4.624",
        "citation_text": "McDonald CJ, Huff SM, Suico JG, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clinical Chemistry. 2003;49(4):624-633.",
        "year": 2003,
        "authors_short": "McDonald et al.",
        "notes": "Canonical 5-year status report from the Regenstrief team describing LOINC structure (the six-axis fully specified name), adoption trajectory, licensing model, and the rationale for a universal observation identifier standard — the primary reference for LOINC architecture in RWE literature."
      },
      {
        "role": "explain",
        "doi": "10.1093/clinchem/42.1.81",
        "url": "https://doi.org/10.1093/clinchem/42.1.81",
        "citation_text": "Forrey AW, McDonald CJ, DeMoor G, et al. Logical observation identifier names and codes (LOINC) database: a public use set of codes and names for electronic reporting of clinical laboratory test results. Clinical Chemistry. 1996;42(1):81-90.",
        "year": 1996,
        "authors_short": "Forrey et al.",
        "notes": "Original LOINC publication establishing the foundational six-axis model and the rationale for a public-use terminology for laboratory result reporting — the intellectual origin of LOINC's design."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jbi.2012.01.008",
        "url": "https://doi.org/10.1016/j.jbi.2012.01.008",
        "citation_text": "Lin MC, Vreeman DJ, McDonald CJ, Huff SM. Auditing consistency and usefulness of LOINC use among three large institutions — using version spaces for grouping LOINC codes. Journal of Biomedical Informatics. 2012;45(4):658-666.",
        "year": 2012,
        "authors_short": "Lin et al.",
        "notes": "Demonstrates real-world LOINC usage heterogeneity across three large institutions — including how the same analyte is mapped to different LOINC codes by different sites and how version-space grouping can detect mapping inconsistencies — directly illustrating the local-code problem that dominates multi-site lab-based RWE."
      },
      {
        "role": "use",
        "doi": "10.2196/81254",
        "url": "https://doi.org/10.2196/81254",
        "citation_text": "Naliyatthaliyazchayil P, Ogilvie M, Melo M, et al. Harmonizing Logical Observation Identifiers Names and Codes (LOINC) codes and units in real-world oncology data: method development and evaluation. JMIR Medical Informatics. 2026. doi:10.2196/81254.",
        "year": 2026,
        "authors_short": "Naliyatthaliyazchayil et al.",
        "notes": "Operational application of LOINC harmonization in real-world oncology data — covers code-list construction, unit normalization (UCUM), and evaluation of harmonization completeness across multi-site oncology EHR data, mirroring the workflow required in any LOINC-based RWE study."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://loinc.org",
        "citation_text": "Regenstrief Institute. LOINC — Logical Observation Identifiers Names and Codes. Indianapolis, IN: Regenstrief Institute; 2024. Available at: https://loinc.org.",
        "year": 2024,
        "authors_short": "Regenstrief Institute",
        "notes": "Official LOINC home page and download portal. Register here to obtain the full LOINC database (CSV, FHIR, OWL) under the Regenstrief license. Includes the LOINC search tool for looking up codes by analyte, property, system, and method."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://loinc.org/kb/license/",
        "citation_text": "Regenstrief Institute. LOINC Copyright Notice and License. Indianapolis, IN: Regenstrief Institute. Available at: https://loinc.org/kb/license/.",
        "year": 2024,
        "authors_short": "Regenstrief Institute",
        "notes": "Official Regenstrief LOINC license and terms of use. LOINC is freely available worldwide but requires user registration and acceptance of the license. Users may incorporate LOINC codes in systems and publications with attribution; bulk reproduction of the full LOINC table in competing products is prohibited. Small illustrative examples with attribution are permitted."
      }
    ],
    "plain_language_summary": "Every laboratory test, vital sign, and clinical measurement can arrive from different hospitals using completely different internal naming systems, making it impossible to combine data across sites without a shared lookup key. LOINC (Logical Observation Identifiers Names and Codes) solves this by assigning a universal numeric code — for example, 2160-0 for serum creatinine — to each unique combination of analyte, specimen type, and measurement method, so that a result from one hospital can be matched instantly to the same test at another hospital. For a researcher building a study that needs lab values (kidney function, blood sugar, hemoglobin), the most common trap is assuming that one LOINC code covers all versions of a test — it does not, and mixing results coded in different units (like milligrams per deciliter versus micromoles per liter) can produce numbers that are off by a factor of 88, silently corrupting every threshold-based calculation.",
    "key_terms": [
      {
        "term": "analyte (Component)",
        "definition": "The specific substance being measured in a lab test — for example, creatinine, glucose, or hemoglobin — which is the first of the six named parts of a LOINC code."
      },
      {
        "term": "specimen (System)",
        "definition": "The biological sample the measurement is taken from, such as serum, plasma, whole blood, or urine; different specimens for the same analyte receive different LOINC codes because the reference ranges differ."
      },
      {
        "term": "scale",
        "definition": "Whether a result is a number (quantitative), a ranked category like low/normal/high (ordinal), a name such as a microbe identification (nominal), or a block of text (narrative); the scale is one of the six axes of a LOINC code."
      },
      {
        "term": "local code",
        "definition": "An internal lab identifier assigned by a single hospital or laboratory system, such as \"CREAT-S\" or \"LB-1042,\" which has no meaning outside that site and must be translated to LOINC before results can be shared or combined with data from other sites."
      },
      {
        "term": "UCUM",
        "definition": "The Unified Code for Units of Measure — the companion standard to LOINC that specifies exactly how to write measurement units (mg/dL, umol/L, %) so that unit mismatches across sites can be detected and corrected before numeric thresholds are applied."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is building a chronic kidney disease (CKD) cohort from a three-site OMOP network. The outcome definition requires two serum creatinine measurements of 1.50 mg/dL or higher, at least 90 days apart, to confirm CKD stage 3 or worse. The analyst queries the MEASUREMENT table across all three sites and discovers that creatinine results arrive under four different identifiers: two different LOINC codes (one for the Jaffe method, one for the enzymatic method), one unmapped local code, and a mix of units (mg/dL at sites A and C; µmol/L at site B). The worked example below shows the raw data, the unit conversion, and how the threshold is applied correctly after harmonization.",
      "dataset": {
        "caption": "Raw creatinine rows from the OMOP MEASUREMENT table across three sites. Site B reports in µmol/L; sites A and C report in mg/dL. Site C has one unmapped local code with a NULL LOINC.",
        "columns": [
          "person_id",
          "site",
          "measurement_date",
          "loinc_code",
          "local_code",
          "value_as_number",
          "unit"
        ],
        "rows": [
          [
            1001,
            "A",
            "2023-03-01",
            "2160-0",
            "CREAT-S",
            1.2,
            "mg/dL"
          ],
          [
            1001,
            "A",
            "2023-07-15",
            "2160-0",
            "CREAT-S",
            1.55,
            "mg/dL"
          ],
          [
            1002,
            "B",
            "2023-04-10",
            "38483-4",
            "KREA",
            132.6,
            "umol/L"
          ],
          [
            1002,
            "B",
            "2023-08-22",
            "38483-4",
            "KREA",
            168.96,
            "umol/L"
          ],
          [
            1003,
            "C",
            "2023-02-28",
            null,
            "LB-CREAT",
            0.95,
            "mg/dL"
          ],
          [
            1003,
            "C",
            "2023-09-05",
            "2160-0",
            "CREAT-S",
            1.62,
            "mg/dL"
          ]
        ]
      },
      "steps": [
        "Step 1 — Build the LOINC code list: serum creatinine is represented by at least two LOINC codes in this network: 2160-0 (Jaffe method, Ser/Plas) and 38483-4 (enzymatic method, Ser/Plas). Both measure the same analyte in the same specimen, so both belong in the code list. The local code LB-CREAT at site C is unmapped (NULL LOINC); include it via measurement_source_value matching after confirming with the site data manager that it represents serum creatinine.",
        "Step 2 — Unit harmonization: site B reports in µmol/L. The conversion to mg/dL is: value_mg_dL = value_umol_L / 88.4. For person 1002: 132.6 / 88.4 = 1.50 mg/dL (first measurement) and 168.96 / 88.4 = 1.91 mg/dL (second measurement).",
        "Step 3 — Apply the threshold to harmonized values. The 1.50 mg/dL threshold is applied after unit conversion. Harmonized creatinine values per person: Person 1001 (site A): 1.20 mg/dL (2023-03-01) and 1.55 mg/dL (2023-07-15). Second value meets threshold; gap = 136 days >= 90 days. Person 1002 (site B): 1.50 mg/dL (2023-04-10) and 1.91 mg/dL (2023-08-22). Both values meet threshold; gap = 134 days >= 90 days. Person 1003 (site C): 0.95 mg/dL (2023-02-28, unmapped local code, included after site confirmation) and 1.62 mg/dL (2023-09-05). Only the second value meets threshold; gap = 219 days but only one qualifying measurement, so person 1003 does not meet the two-value CKD criterion.",
        "Step 4 — Apply the two-measurement CKD criterion (both >= 1.50 mg/dL, >= 90 days apart): person 1001 qualifies (1 out of 2 measurements above threshold? No — only the second is >= 1.50); re-check: 1.20 < 1.50 (fails) and 1.55 >= 1.50 (passes) — only one qualifying value, so person 1001 does NOT meet the criterion. Person 1002: both 1.50 >= 1.50 and 1.91 >= 1.50, gap = 134 days >= 90. Person 1002 QUALIFIES. Person 1003: only one qualifying value (1.62 on 2023-09-05). Does NOT qualify.",
        "Step 5 — Without unit harmonization (the error case): if the analyst applies the 1.50 mg/dL threshold directly to site B's µmol/L values, person 1002's results (132.6 and 168.96) would appear to be extreme outliers far above any reasonable creatinine (normal range in mg/dL is 0.5–1.2), causing them to be either winsorized, excluded, or flagged incorrectly. The unit error renders the site B data unusable for threshold-based staging without conversion."
      ],
      "result": "After LOINC code-list expansion (including both 2160-0 and 38483-4), local-code inclusion via site confirmation, and unit harmonization (132.6 / 88.4 = 1.50 mg/dL; 168.96 / 88.4 = 1.91 mg/dL), exactly one patient (person 1002) meets the CKD criterion of two creatinine values >= 1.50 mg/dL at least 90 days apart. Person 1001 has only one qualifying value. Person 1003 also has only one qualifying value (the earlier unmapped-code record is below threshold). Cohort size = 1 of 3 patients. Without unit harmonization, site B data would be unusable and person 1002 would be lost entirely."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "LOINC code-list construction for a single analyte",
        "description": "For any analyte, the correct LOINC code list requires enumerating every combination of method and specimen that the network's sites use for that test. For serum creatinine: 2160-0 (Jaffe, Ser/Plas) and 38483-4 (enzymatic, Ser/Plas) are the two most common; additional codes exist for creatinine in urine (e.g., 2161-8 for 24-hour urine), random urine (14682-9), and dialysis fluid. Including urine creatinine codes in a serum creatinine phenotype is a specimen axis error that silently expands the analyte pool with non-equivalent measurements.",
        "edge_cases": [
          "Sites may map their enzymatic creatinine to 2160-0 (the Jaffe LOINC) due to historical mapping; confirm method via measurement_source_value or site data dictionary before assuming method purity.",
          "Point-of-care creatinine analyzers (i-STAT) generate results under separate LOINC codes because the instrument and specimen handling differ from central-lab measurements; these should typically be analyzed separately or excluded from eGFR calculations."
        ],
        "data_source_notes": "OMOP: query CONCEPT table for concept_class_id = 'Lab Test' and CONCEPT_ANCESTOR to find all descendant LOINC concepts for an analyte. PCORnet: query DATA_CHARACTERIZATION reports for LAB_LOINC frequency tables per site to see which codes are actually populated."
      },
      {
        "name": "Unit harmonization workflow",
        "description": "Before applying any numeric threshold, standardize all results to a single unit. Serum creatinine: mg/dL is the US clinical standard; µmol/L is common in Europe and Canada (conversion: mg/dL = µmol/L / 88.4). Glucose: mg/dL vs mmol/L (conversion: mg/dL = mmol/L * 18.018). HbA1c: percent vs mmol/mol IFCC (conversion: % = mmol/mol / 10.929 + 2.15, approximately mmol/mol / 10.929 + 2.15). When multiple units appear, never apply a threshold to the raw value without first checking the unit column.",
        "edge_cases": [
          "Some EHR extracts store units as free text rather than UCUM codes; \"mg/dl\" versus \"mg/dL\" versus \"MG/DL\" must be normalized before unit-conditional logic is applied.",
          "A unit of NULL does not mean mg/dL; it means the unit was not recorded. Exclude NULL-unit records or impute based on the site-level unit distribution, with documentation."
        ],
        "data_source_notes": "OMOP: `unit_concept_id` should be a UCUM-coded concept; join to CONCEPT to get the unit string. Treat `unit_concept_id = 0` (unmapped unit) the same as NULL — do not assume a default unit."
      },
      {
        "name": "Local-code inclusion alongside LOINC",
        "description": "For sites with low LOINC mapping completeness, include results via `measurement_source_value` (local code) after site data manager confirmation that the local code maps to the target analyte and specimen. Document the local codes included, the mapping evidence, and the fraction of total records recovered by local-code inclusion in the study's data quality appendix.",
        "edge_cases": [
          "A local code that maps to multiple LOINC codes (e.g., \"CREAT\" is used for both serum and urine creatinine at some sites) cannot be safely included without specimen-level confirmation.",
          "Local code vocabularies change with LIS upgrades; a code valid in 2018 may not be valid in 2023 at the same site."
        ],
        "data_source_notes": "PCORnet: `RAW_LAB_CODE` contains the site local code; confirm against the site's lab code crosswalk. OMOP: `measurement_source_value`."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ehr-study",
        "pros_of_this": "LOINC is the vocabulary layer that makes EHR lab data aggregatable across sites; without LOINC codes in the EHR measurement table, multi-site lab-based phenotyping is infeasible.",
        "cons_of_this": "LOINC mapping completeness is a property of the EHR site's informatics investment, not of LOINC itself; a study that assumes complete LOINC coverage will silently drop unmapped records.",
        "when_to_prefer": "Always audit LOINC mapping completeness before treating LOINC-filtered results as representative; supplement with local-code queries at low-mapping sites."
      },
      {
        "compared_to": "algorithm-validation",
        "pros_of_this": "An explicit LOINC code list is the first step in any lab-based outcome algorithm and is a required component of the algorithm specification.",
        "cons_of_this": "LOINC codes alone are not a validated phenotype; the full algorithm must specify thresholds, measurement timing, unit harmonization, and a reference standard against which PPV/sensitivity are estimated.",
        "when_to_prefer": "Specify the LOINC code list as the capture component, then validate the full algorithm (code list + threshold + timing) against chart review."
      },
      {
        "compared_to": "omop-cdm-method-patterns-rwe",
        "pros_of_this": "LOINC is the standard vocabulary for the OMOP Measurement domain; LOINC code-list construction is the prerequisite step before any OMOP Measurement query.",
        "cons_of_this": "OMOP adds the data model layer (table structure, era logic, concept sets) on top of LOINC; understanding LOINC code axes is necessary but not sufficient for correct OMOP Measurement queries.",
        "when_to_prefer": "Use LOINC knowledge to build the concept set, then apply OMOP query patterns to extract and aggregate the measurements."
      }
    ],
    "implementation_notes_by_data_source": {
      "ehr": "In OMOP Measurement, filter on `measurement_concept_id` IN (your LOINC concept list). Also query `measurement_source_value` for unmapped local codes after site confirmation. Always check `unit_concept_id` and harmonize before applying numeric thresholds. Use concept_id = 0 records in a quality audit step, not in the phenotype numerator.",
      "claims": "CPT codes identify lab test orders (billing); they do not carry LOINC codes or numeric result values. Use CPT for utilization analysis (was a test ordered?). For result-level phenotyping, EHR or lab feed data with LOINC is required. Do not substitute CPT for LOINC when the study objective requires result values.",
      "registry": "Verify whether the registry carries LOINC codes for its lab fields or uses a proprietary code set. If proprietary, build a crosswalk to LOINC before attempting network-level aggregation. Registry lab data may have pre-applied clinical thresholds (e.g., eGFR stage) rather than raw values.",
      "linked": "In linked EHR-claims, use CPT service dates from claims to confirm the test was ordered (utilization ascertainment) and LOINC-coded EHR records for the result value (phenotype ascertainment). Reconcile dates within a ± 3-day window to match lab order to result. Linkage may not capture labs ordered outside the linked health system — document the catchment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# ------------------------------------------------------------------ #\n# 1. LOINC check-digit validator (mod-10 Luhn variant)                #\n# ------------------------------------------------------------------ #\n# LOINC uses a Luhn mod-10 check digit appended after a hyphen,\n# e.g. \"2160-0\" or \"38483-4\".\n# Algorithm (per Regenstrief spec):\n#   - Take the numeric prefix digits.\n#   - Double every second digit from the right (rightmost prefix digit\n#     is position 1, not doubled).\n#   - Subtract 9 from doubled values > 9.\n#   - Sum all digits.\n#   - Check digit = (10 - (sum % 10)) % 10.\n\ndef validate_loinc(code: str) -> bool:\n    \"\"\"Return True if the LOINC code passes the mod-10 check digit.\n\n    Args:\n        code: LOINC code string, e.g. '2160-0' or '38483-4'.\n\n    Returns:\n        True if valid format and check digit matches; False otherwise.\n    \"\"\"\n    if not isinstance(code, str):\n        return False\n    parts = code.strip().split(\"-\")\n    if len(parts) != 2:\n        return False\n    numeric_part, check_str = parts\n    if not numeric_part.isdigit() or not check_str.isdigit():\n        return False\n    check_digit = int(check_str)\n\n    digits = [int(d) for d in numeric_part]\n    # Double every second digit from the right (index from right: 0-based)\n    # rightmost digit of numeric_part has position 0 (not doubled)\n    total = 0\n    for i, d in enumerate(reversed(digits)):\n        if i % 2 == 1:          # even positions from right (1-indexed) -> double\n            d *= 2\n            if d > 9:\n                d -= 9\n        total += d\n\n    expected = (10 - (total % 10)) % 10\n    return expected == check_digit\n\n\n# Spot-check known codes\nassert validate_loinc(\"2160-0\"),  \"serum creatinine (Jaffe) should pass\"\nassert validate_loinc(\"38483-4\"), \"serum creatinine (enzymatic) should pass\"\nassert validate_loinc(\"4548-4\"),  \"HbA1c should pass\"\nassert not validate_loinc(\"2160-1\"), \"wrong check digit should fail\"\nassert not validate_loinc(\"ABCD-0\"), \"non-numeric prefix should fail\"\n\n\n# ------------------------------------------------------------------ #\n# 2. Creatinine harmonization across LOINC codes and units            #\n# ------------------------------------------------------------------ #\n# Serum creatinine LOINC codes used in this example:\n#   2160-0  Creatinine [Mass/volume] in Serum or Plasma (Jaffe method)\n#   38483-4 Creatinine [Mass/volume] in Serum or Plasma (Enzymatic method)\n# Conversion: mg/dL = umol/L / 88.4\n\nCREATININE_LOINCS = {\"2160-0\", \"38483-4\"}\n\n# Simulated OMOP Measurement table (as would be extracted from a CDM)\nraw_data = {\n    \"person_id\":         [1001, 1001, 1002, 1002, 1003, 1003],\n    \"site\":              [\"A\",  \"A\",  \"B\",  \"B\",  \"C\",  \"C\"],\n    \"measurement_date\":  [\"2023-03-01\", \"2023-07-15\",\n                          \"2023-04-10\", \"2023-08-22\",\n                          \"2023-02-28\", \"2023-09-05\"],\n    \"loinc_code\":        [\"2160-0\",  \"2160-0\",\n                          \"38483-4\", \"38483-4\",\n                          None,      \"2160-0\"],\n    \"local_code\":        [\"CREAT-S\", \"CREAT-S\",\n                          \"KREA\",    \"KREA\",\n                          \"LB-CREAT\",\"CREAT-S\"],\n    \"value_as_number\":   [1.20, 1.55, 132.6, 168.96, 0.95, 1.62],\n    \"unit\":              [\"mg/dL\", \"mg/dL\",\n                          \"umol/L\", \"umol/L\",\n                          \"mg/dL\",  \"mg/dL\"],\n}\ndf = pd.DataFrame(raw_data)\ndf[\"measurement_date\"] = pd.to_datetime(df[\"measurement_date\"])\n\n# --- Step 1: LOINC code validation for all non-null codes -------- #\nloinc_valid = df[\"loinc_code\"].dropna().apply(validate_loinc)\nassert loinc_valid.all(), f\"Invalid LOINC codes detected: {df.loc[~loinc_valid.reindex(df.index, fill_value=True), 'loinc_code'].tolist()}\"\n\n# --- Step 2: Audit unmapped records (NULL LOINC) ------------------ #\nunmapped = df[df[\"loinc_code\"].isna()].copy()\nprint(f\"Unmapped records (NULL LOINC): {len(unmapped)}\")\nprint(unmapped[[\"person_id\", \"site\", \"local_code\", \"value_as_number\", \"unit\"]])\n# In production: confirm with site data manager that LB-CREAT = serum creatinine\n# and include after confirmation. Here we include it as a simplification.\n\n# --- Step 3: Filter to creatinine records (LOINC + confirmed local) #\nmask_loinc = df[\"loinc_code\"].isin(CREATININE_LOINCS)\nmask_local = df[\"local_code\"].isin({\"LB-CREAT\"})  # site-confirmed\ncreatinine = df[mask_loinc | mask_local].copy()\n\n# --- Step 4: Unit harmonization to mg/dL ------------------------- #\nUMOL_PER_MGDL = 88.4          # 1 mg/dL creatinine = 88.4 umol/L\n\ndef harmonize_creatinine(row):\n    v = row[\"value_as_number\"]\n    u = (row[\"unit\"] or \"\").lower().strip()\n    if u in (\"umol/l\", \"µmol/l\", \"umol/L\"):\n        return v / UMOL_PER_MGDL\n    elif u in (\"mg/dl\", \"mg/dL\"):\n        return v\n    else:\n        return float(\"nan\")   # unknown unit -> flag for review\n\ncreatinine[\"creatinine_mgdl\"] = creatinine.apply(harmonize_creatinine, axis=1)\n\n# Verify the unit conversion arithmetic (checked by gate):\n# 132.6 / 88.4 = 1.50 mg/dL\n# 168.96 / 88.4 = 1.91 mg/dL\nimport math\nassert math.isclose(132.6 / 88.4, 1.50, rel_tol=0.01), \"132.6/88.4 should equal 1.50\"\nassert math.isclose(168.96 / 88.4, 1.91, rel_tol=0.01), \"168.96/88.4 should equal ~1.91\"\n\n# --- Step 5: Apply CKD phenotype threshold ----------------------- #\nTHRESHOLD_MGDL = 1.50\nDAYS_APART = 90\n\ncreatinine = creatinine.sort_values([\"person_id\", \"measurement_date\"])\ncreatinine[\"meets_threshold\"] = creatinine[\"creatinine_mgdl\"] >= THRESHOLD_MGDL\n\nqualifying = []\nfor pid, group in creatinine.groupby(\"person_id\"):\n    above = group[group[\"meets_threshold\"]].copy()\n    if len(above) < 2:\n        continue\n    above = above.sort_values(\"measurement_date\")\n    for i in range(len(above) - 1):\n        gap = (above.iloc[i + 1][\"measurement_date\"] - above.iloc[i][\"measurement_date\"]).days\n        if gap >= DAYS_APART:\n            qualifying.append(pid)\n            break\n\nprint(f\"\\nCKD cohort (two creatinine >= {THRESHOLD_MGDL} mg/dL, >= {DAYS_APART} days apart):\")\nprint(f\"Qualifying person_ids: {qualifying}\")\n# Expected: [1002] only\nassert qualifying == [1002], f\"Expected [1002], got {qualifying}\"\n\n# Show harmonized values for verification\nprint(\"\\nHarmonized creatinine table:\")\nprint(creatinine[[\"person_id\", \"site\", \"measurement_date\",\n                    \"loinc_code\", \"value_as_number\", \"unit\", \"creatinine_mgdl\"]])",
        "description": "Two utilities for LOINC-based lab harmonization in a pandas DataFrame representing an OMOP-style Measurement table. The first validates that a LOINC code string is syntactically correct using the mod-10 Luhn check-digit algorithm (detects single-character transcription errors). The second harmonizes a creatinine lab table where the same analyte arrives under multiple LOINC codes and two unit flavors (mg/dL and umol/L) into a single analysis-ready column in mg/dL, with an audit of unmapped records.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\n# ------------------------------------------------------------------ #\n# 1. LOINC check-digit validator (mod-10 Luhn variant)                #\n# ------------------------------------------------------------------ #\nvalidate_loinc <- function(code) {\n  # Returns TRUE if the LOINC code passes the mod-10 check digit.\n  # code: character string, e.g. \"2160-0\" or \"38483-4\"\n  if (!is.character(code) || is.na(code)) return(FALSE)\n  parts <- strsplit(trimws(code), \"-\", fixed = TRUE)[[1]]\n  if (length(parts) != 2) return(FALSE)\n  numeric_part <- parts[1]\n  check_str    <- parts[2]\n  if (grepl(\"[^0-9]\", numeric_part) || grepl(\"[^0-9]\", check_str)) return(FALSE)\n  check_digit <- as.integer(check_str)\n  digits <- as.integer(strsplit(numeric_part, \"\")[[1]])\n  # Double every second digit from the right (position 1 from right = not doubled)\n  n <- length(digits)\n  for (i in seq_along(digits)) {\n    pos_from_right <- n - i   # 0-based position from right\n    if (pos_from_right %% 2 == 1) {   # odd position from right -> double\n      d <- digits[i] * 2\n      if (d > 9) d <- d - 9\n      digits[i] <- d\n    }\n  }\n  expected <- (10 - (sum(digits) %% 10)) %% 10\n  return(expected == check_digit)\n}\n\n# Spot-checks\nstopifnot(validate_loinc(\"2160-0\"))     # serum creatinine Jaffe\nstopifnot(validate_loinc(\"38483-4\"))    # serum creatinine enzymatic\nstopifnot(validate_loinc(\"4548-4\"))     # HbA1c\nstopifnot(!validate_loinc(\"2160-1\"))    # wrong check digit\nstopifnot(!validate_loinc(\"ABCD-0\"))    # non-numeric\n\n# ------------------------------------------------------------------ #\n# 2. Creatinine harmonization across LOINC codes and units            #\n# ------------------------------------------------------------------ #\n# Conversion constant: 1 mg/dL creatinine = 88.4 umol/L\nUMOL_PER_MGDL <- 88.4\n\n# Simulated OMOP Measurement tibble\ndf <- tibble(\n  person_id        = c(1001, 1001, 1002, 1002, 1003, 1003),\n  site             = c(\"A\",  \"A\",  \"B\",  \"B\",  \"C\",  \"C\"),\n  measurement_date = as.Date(c(\"2023-03-01\",\"2023-07-15\",\n                               \"2023-04-10\",\"2023-08-22\",\n                               \"2023-02-28\",\"2023-09-05\")),\n  loinc_code       = c(\"2160-0\",  \"2160-0\",\n                       \"38483-4\", \"38483-4\",\n                       NA,        \"2160-0\"),\n  local_code       = c(\"CREAT-S\",\"CREAT-S\",\n                       \"KREA\",   \"KREA\",\n                       \"LB-CREAT\",\"CREAT-S\"),\n  value_as_number  = c(1.20, 1.55, 132.6, 168.96, 0.95, 1.62),\n  unit             = c(\"mg/dL\",\"mg/dL\",\n                       \"umol/L\",\"umol/L\",\n                       \"mg/dL\",\"mg/dL\")\n)\n\n# LOINC codes for serum creatinine (Jaffe + enzymatic)\nCREATININE_LOINCS <- c(\"2160-0\", \"38483-4\")\nLOCAL_CREATININE  <- c(\"LB-CREAT\")   # site-confirmed local codes\n\n# Step 1: Validate non-null LOINC codes\nvalid_loinc <- df |>\n  filter(!is.na(loinc_code)) |>\n  pull(loinc_code) |>\n  sapply(validate_loinc)\nstopifnot(all(valid_loinc))\n\n# Step 2: Audit unmapped records\nunmapped <- df |> filter(is.na(loinc_code))\ncat(\"Unmapped records (NULL LOINC):\", nrow(unmapped), \"\\n\")\n\n# Step 3: Filter to creatinine (LOINC + site-confirmed local codes)\ncreatinine <- df |>\n  filter(loinc_code %in% CREATININE_LOINCS | local_code %in% LOCAL_CREATININE)\n\n# Step 4: Unit harmonization to mg/dL\n# Conversion: mg/dL = umol/L / 88.4\ncreatinine <- creatinine |>\n  mutate(\n    creatinine_mgdl = case_when(\n      tolower(trimws(unit)) %in% c(\"umol/l\", \"µmol/l\") ~ value_as_number / UMOL_PER_MGDL,\n      tolower(trimws(unit)) == \"mg/dl\"                   ~ value_as_number,\n      TRUE                                               ~ NA_real_\n    )\n  )\n\n# Verify conversion arithmetic: 132.6 / 88.4 = 1.50 mg/dL\n# 168.96 / 88.4 = 1.91 mg/dL\nstopifnot(abs(132.6 / UMOL_PER_MGDL - 1.50) < 0.01)\nstopifnot(abs(168.96 / UMOL_PER_MGDL - 1.91) < 0.01)\n\n# Step 5: Apply CKD threshold (>= 1.50 mg/dL on >= 2 occasions >= 90 days apart)\nTHRESHOLD   <- 1.50\nDAYS_APART  <- 90\n\nckd_cohort <- creatinine |>\n  filter(creatinine_mgdl >= THRESHOLD) |>\n  arrange(person_id, measurement_date) |>\n  group_by(person_id) |>\n  summarise(\n    n_above = n(),\n    first_date = min(measurement_date),\n    last_date  = max(measurement_date),\n    gap_days   = as.integer(max(measurement_date) - min(measurement_date)),\n    .groups = \"drop\"\n  ) |>\n  filter(n_above >= 2, gap_days >= DAYS_APART)\n\ncat(\"\\nCKD cohort (two creatinine >= 1.50 mg/dL, >= 90 days apart):\\n\")\nprint(ckd_cohort)\n# Expected: person_id 1002 only (gap = 134 days, both values >= 1.50)\nstopifnot(nrow(ckd_cohort) == 1 && ckd_cohort$person_id[1] == 1002)\n\ncat(\"\\nHarmonized creatinine table:\\n\")\nprint(creatinine |>\n  select(person_id, site, measurement_date, loinc_code,\n         value_as_number, unit, creatinine_mgdl))",
        "description": "R implementation of the same two utilities: a LOINC check-digit validator using the mod-10 Luhn algorithm, and a creatinine harmonization workflow that builds a code list, applies unit conversion (µmol/L to mg/dL), and identifies patients meeting a CKD threshold criterion using dplyr. Designed to work on an OMOP-style Measurement tibble.",
        "dependencies": [
          "dplyr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Lab result generated<br/>at health system LIS] --> B{Is it mapped<br/>to LOINC?}\n  B -- Yes --> C[LOINC code in<br/>measurement_concept_id]\n  B -- No --> D[Local code in<br/>measurement_source_value<br/>concept_id = 0]\n  C --> E{Unit recorded<br/>in UCUM?}\n  D --> F[Audit: confirm local code<br/>via site data manager]\n  F --> E\n  E -- Yes --> G{Unit matches<br/>target unit?}\n  E -- No --> H[Flag for review;<br/>impute if site distribution known]\n  G -- Yes --> I[Use value directly]\n  G -- No --> J[Convert: e.g. umol/L / 88.4 = mg/dL]\n  I --> K[Apply threshold or<br/>phenotype algorithm]\n  J --> K\n  H --> K\n  K --> L[Lab-based phenotype<br/>ready for analysis]",
        "caption": "LOINC-to-analysis pipeline: from raw LIS output through LOINC mapping audit, unit harmonization, and threshold application to a research-ready lab phenotype. Each branch represents a data-quality decision that must be documented in the study protocol.",
        "alt_text": "Flowchart showing a lab result entering the pipeline, being checked for LOINC mapping, then unit recording, then unit harmonization, then threshold application to produce a final phenotype.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  CPT[\"CPT code (claims)\\nWhat test was ORDERED\\nand BILLED\\n\\nNo result value\\nNo LOINC code\"]\n  LOINC[\"LOINC code (EHR / LIS)\\nWhat test was RESULTED\\nSpecific analyte + method\\n+ specimen + unit\\n\\nCarries numeric value\"]\n  CPT -. \"same patient, same date\\n(± 3 days match)\" .-> LOINC\n  CPT --> PA[\"Test utilization\\nanalysis (claims)\"]\n  LOINC --> PB[\"Result-level phenotyping\\n(EHR / OMOP Measurement)\"]\n  PA -. linked study .-> BOTH[\"Combined: utilization\\n+ result value\"]\n  PB -. linked study .-> BOTH",
        "caption": "CPT versus LOINC: complementary, not interchangeable. CPT codes in claims confirm a test was ordered and billed; LOINC codes in EHR data carry the numeric result. Linked studies can combine both. A study that needs lab values for outcome ascertainment cannot use claims alone.",
        "alt_text": "Diagram showing CPT (claims) providing test utilization data and LOINC (EHR) providing result values, with a linked study combining both via date matching.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "ehr-study",
        "notes": "LOINC codes are the primary identifier for lab results, vital signs, and clinical observations in EHR-derived data. An EHR study that uses lab-based phenotypes (eGFR, HbA1c, troponin) must build explicit LOINC code lists and audit mapping completeness before defining any lab-dependent cohort or outcome."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "A lab-based outcome or exposure algorithm is anchored in a LOINC code list. Algorithm validation (PPV, sensitivity against chart review) must encompass the code list, the unit harmonization, and the threshold rule as a unit — validating any component in isolation does not validate the full phenotype."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "In the OMOP CDM, LOINC is the standard vocabulary for the Measurement domain. OMOP Measurement queries must specify LOINC concept IDs, and unmapped records (concept_id = 0) must be handled explicitly. LOINC code selection is the first step in any OMOP Measurement-based analysis."
      },
      {
        "relation_type": "used_with",
        "target_slug": "linked-data",
        "notes": "In linked EHR-claims studies, LOINC-coded lab records from the EHR provide the result values that claims cannot supply. Linkage enables CPT utilization ascertainment from claims and LOINC-based result phenotyping from EHR, reconciled by matching service dates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "biomarker-defined-cohort-rwe",
        "notes": "Biomarker-defined cohorts rely on LOINC-coded lab results for cohort entry criteria (e.g., PSA threshold, eGFR stage, tumor marker level). Building such a cohort requires a deliberate LOINC code list, unit harmonization, and documentation of mapping completeness per biomarker per site."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "LOINC codes are incorporated into OMOP concept sets for the Measurement domain. Concept set development for lab-based phenotypes requires knowledge of LOINC axes (analyte, specimen, method) to select the correct set of concept IDs without conflating different specimen types or methods."
      },
      {
        "relation_type": "see_also",
        "target_slug": "tokenization-privacy-preserving-record-linkage-rwe",
        "notes": "LOINC is a key component of Privacy-Preserving Record Linkage (PPRL) protocols in multi-site networks: standardizing lab data to LOINC before linkage ensures that the same measurement can be matched across sites without exchanging local codes that may be identifiable."
      }
    ],
    "aliases": [
      "LOINC",
      "Logical Observation Identifiers Names and Codes"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "longitudinal-outcomes-modeling-rwe",
    "name": "Longitudinal Outcomes Modeling",
    "short_definition": "A family of regression methods for repeatedly-measured (panel) outcomes on the same patients over time, whose central choice is whether the target estimand is subject-specific (conditional, from a mixed model) or population-averaged (marginal, from GEE), with MMRM as the regulatory-favored special case for continuous endpoints.",
    "long_description": "**Longitudinal outcomes modeling** addresses the analytic problem created when each patient contributes\n*multiple* measurements of the same outcome over time — monthly HbA1c, quarterly total cost, repeated PHQ-9\nor pain scores, serial eGFR. Those measurements are correlated within a patient, so ordinary regression that\ntreats every (person, time) row as independent reports standard errors that are too small and tests that are\nanti-conservative. This entry is the *family-level decision concept*: it does not re-derive any single model\nbut tells you how to choose among a linear mixed-effects model (LMM), a generalized estimating equation (GEE),\na mixed-model repeated measures (MMRM) analysis, and a generalized linear mixed model (GLMM) for a binary or\ncount longitudinal outcome — and, just as importantly, when a longitudinal model is the wrong tool and a\ntime-to-event or count-rate model belongs instead. The detailed mechanics of each member live in the sibling\nconcepts (`mixed-effects-models-longitudinal-rwe`, `gee-population-average-models-rwe`,\n`mmrm-repeated-measures-rwe`); the value added here is the estimand-first selection logic.\n\n**Core estimand distinction**. The single most consequential — and most frequently botched — choice is\n*conditional vs marginal*. A mixed model with a random patient intercept (or slope) estimates a\n**subject-specific (conditional)** effect: the expected change in *a given patient's* outcome holding that\npatient's random effect fixed (\"how much does this patient's HbA1c fall on drug A vs what it would have been\non drug B\"). A GEE estimates a **population-averaged (marginal)** effect: the contrast in the *population mean*\noutcome between exposure groups, averaging over the random-effect distribution. For an identity-link Gaussian\noutcome the two coefficients coincide, so the LMM-vs-GEE debate is mostly about robustness and missing-data\nassumptions. For a *non-linear* link (logistic, log) they do **not** coincide: the marginal log-odds is\nattenuated toward the null relative to the subject-specific log-odds by a factor governed by the random-effect\nvariance, so a GLMM and a GEE on the same binary panel return numerically different — and differently\ninterpretable — estimates. Hubbard et al. (2010) make this the explicit deciding question: report a marginal\neffect when the audience is a population/policy contrast (utilization, budget impact, a coverage decision) and\na subject-specific effect when the clinical question is about an individual patient's trajectory. MMRM is a\nparticular LMM for a continuous endpoint that uses *only* fixed effects for time (categorical visit) with an\n**unstructured within-patient covariance** and no random subject effect, giving a marginal-mean estimate at\neach visit that is valid under missing-at-random (MAR) — which is why the PhRMA working group and FDA reviewers\ntreat it as the default primary analysis for continuous longitudinal trial-like endpoints (Mallinckrodt et al.,\n2008).\n\n**Interpreting the output**\n\nLongitudinal modeling produces different outputs depending on which family member you\nchoose, and the estimand determines which number is reported.\n\nIf a linear mixed model (LMM) with random intercepts was fit to serial HbA1c data,\na fixed-effect treatment coefficient of −0.90 means: for a given patient, the drug\narm reduces HbA1c by 0.90 points more than the comparator on average over follow-up,\nholding that patient's own random intercept (and slope, if modeled) fixed. This is a\nsubject-specific (conditional) estimate — appropriate when the clinical question is\nabout an individual patient's trajectory.\n\nIf a GEE was fit to the same data with a log link for binary or count outcomes, the\nexponentiated coefficient is a population-average (marginal) rate ratio or odds ratio\n— the contrast between group means averaging over the patient distribution. For a\nGaussian identity-link outcome the conditional (LMM) and marginal (GEE) coefficients\ncoincide; for non-linear links (logit, log) the conditional coefficient is larger in\nmagnitude. Pre-specifying which estimand you target — and which model delivers it —\nis mandatory before seeing results.\n\nIf MMRM was fit for a continuous endpoint at scheduled visits, the primary output is\nthe arm-by-visit interaction contrast at the target visit (e.g., mean difference in\neGFR change from baseline at Month 12 = 6.4 mL/min, 95% CI 4.1 to 8.7), a marginal\nestimate valid under missing-at-random. Route to the appropriate sibling entry\n(`mixed-effects-models-longitudinal-rwe`, `gee-population-average-models-rwe`,\n`mmrm-repeated-measures-rwe`) for method-specific output interpretation.\n\n**Pros, cons, and trade-offs**.\n- **Mixed model (LMM/GLMM) vs GEE:** The mixed model gives a likelihood, valid inference under MAR (missingness\n  can depend on observed prior outcomes), subject-specific interpretation, and direct modeling of the\n  correlation structure via random effects; it can borrow strength to predict individual trajectories. Cost:\n  it assumes the random-effect *distribution* is correctly specified, the conditional/marginal gap confuses\n  non-statistical audiences for non-linear links, and it is more sensitive to misspecification. GEE is\n  semiparametric, needs only the mean model correct (the working correlation can be wrong) with\n  sandwich/robust SEs, and yields the population-averaged effect directly. Cost: GEE is only valid under\n  **missing-completely-at-random (MCAR)** unless you add inverse-probability-of-observation weights, it\n  discards partially-observed cycles less gracefully, and the sandwich variance is unreliable with few\n  clusters (<~40 patients). **Prefer the mixed model / MMRM** when dropout is non-trivial (the norm in RWE);\n  **prefer GEE** for a clean marginal effect with many patients and near-MCAR missingness.\n- **MMRM vs a random-slope LMM with parametric time:** MMRM (categorical visit, unstructured covariance) makes\n  the fewest assumptions about the *shape* of the time trend and the *form* of the covariance, which is why it\n  is regulatory-preferred; it needs roughly balanced, pre-specified visit windows and enough patients to\n  estimate the full covariance. A random-slope LMM with continuous time is more parsimonious and handles\n  *irregular* measurement times — common in claims/EHR — but imposes a trajectory shape (linear, spline) and a\n  structured covariance. **Prefer MMRM** for a confirmatory continuous endpoint with scheduled visits;\n  **prefer a parametric-time LMM** when visit timing is irregular or the covariance is too rich to estimate.\n- **Longitudinal mean model vs the time-to-event / count families:** If the question is \"time to first event\"\n  or \"are events recurring faster,\" a longitudinal mean model is the *wrong* family — use survival\n  (`standard-cox-time-dependent`, `cumulative-incidence-risk-rwe`) or recurrent-event/count methods\n  (`recurrent-events-analysis-rwe`, `poisson-negative-binomial-count-models`). Modeling a repeatedly-measured\n  *continuous or scalar* outcome is this family's job; modeling *event occurrence* is not.\n\n**When to use**. A continuous, ordinal, count, or binary outcome is measured repeatedly on the same patients\nat two or more time points and you want the treatment effect on the *level or trajectory* of that outcome\n(e.g., HbA1c change to month 12, monthly per-patient cost, serial PRO scores); within-patient correlation must\nbe respected; and either an individual-trajectory (subject-specific) or a population-mean (marginal) estimand\nis clearly specified in the protocol/SAP *before* you pick the model.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **The outcome is an event, not a repeated measurement.** Forcing \"did the patient have an MI this month\n  (0/1)\" into a GLMM panel when the real question is incidence invites immortal-time and informative-censoring\n  problems that survival/recurrent-event methods are built to handle; use those instead.\n- **Reporting a conditional estimate for a population question (or vice versa).** Presenting a subject-specific\n  GLMM odds ratio as if it were the population-averaged effect a payer cares about — or a marginal GEE estimate\n  as an individual-patient prognosis — is a silent estimand error that survives every model diagnostic and is\n  genuinely misleading.\n- **GEE with informative dropout.** In RWE the sicker arm typically drops out or disenrolls faster; under that\n  MAR-not-MCAR pattern, naive GEE is biased toward the null while an MMRM/LMM remains valid — defaulting to GEE\n  \"because it's marginal\" can fabricate a false equivalence.\n- **Too few clusters/patients.** With a small number of patients the GEE sandwich SE is downward-biased and the\n  mixed-model variance components are unstable; tests become anti-conservative.\n- **Outcome measured at a single time point.** There is no within-patient correlation to model; a longitudinal\n  apparatus adds nothing and can obscure a simple cross-sectional contrast.\n\n**Data-source operational depth**.\n- **Claims (FFS or commercial):** The \"repeated measurement\" is usually a *constructed* monthly or quarterly\n  aggregate — total paid cost, fill-derived adherence, a flag for any qualifying diagnosis — built by binning\n  claims into fixed windows keyed off `index_date`. Two failure modes dominate. (1) **MA-only person-time:**\n  months in which a patient is enrolled in Medicare Advantage (or a capitated arrangement) have no\n  fee-for-service claims, so a \"$0 cost\" or \"no diagnosis\" cell is *missingness disguised as data*; you must\n  carry an enrollment indicator and drop or explicitly model those cells, not treat them as observed zeros.\n  (2) **Differential disenrollment by arm** makes month-level missingness depend on the (unobserved future)\n  outcome — MAR at best — so an LMM/MMRM under MAR is safer than GEE; restrict to continuous medical+pharmacy\n  enrollment per cycle and verify balance of observed person-months across arms.\n- **EHR:** Labs and PROs are recorded only when a visit happens, so measurement timing is **irregular and\n  informatively sampled** — the sicker arm is drawn more often, which biases any method that assumes\n  measurement times are unrelated to the outcome. Prefer a parametric-time LMM (continuous time, not\n  categorical visit) and consider modeling the visit process; never assume equally-spaced visits when feeding a\n  GEE working correlation like AR(1) that presumes equal spacing. Patients who leave the system are\n  differentially lost; treat loss to follow-up as potentially informative.\n- **Registry:** Visit schedules are often protocolized (an advantage for MMRM), and outcomes may be\n  adjudicated, but pharmacy exposure and out-of-registry care are weak; link to claims for complete exposure\n  and to a death index so that \"missing\" late visits are correctly distinguished from death (a competing\n  terminal event that must not be modeled as ordinary MAR dropout).\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR severity and lab values, claims completeness,\n  reliable mortality — but linkage selects the linkable subset and introduces date-discrepancy between order,\n  fill, and service dates that must be reconciled before binning measurements into cycles relative to time zero.\n\n**Worked claims example.** Question: 12-month HbA1c trajectory after initiating a DPP-4 inhibitor vs a\nsecond-generation sulfonylurea in an active-comparator new-user cohort (commercial + Medicare FFS). (1)\nCohort: incident initiators with 365 days of continuous A/B/D (or commercial medical+pharmacy) enrollment\nbefore the first qualifying fill; `index_date` = that fill; arm = NDC dispensed that day. (2) Build the panel:\ndefine visits at months 0, 3, 6, 9, 12; for each (`person_id`, `visit`) take the HbA1c lab value nearest the\nscheduled month within a ±45-day window from the linked EHR/lab feed; carry an `observed` flag and an\n`enroll_ffs` flag per cycle so MA-only or disenrolled months are *missing*, not zero. (3) Fit three models on\nthe same panel to make the estimand visible: (a) **MMRM** — `PROC MIXED` with categorical `visit`, `arm`,\n`visit*arm`, baseline HbA1c as covariate, and `REPEATED visit / SUBJECT=person_id TYPE=UN` (unstructured),\nreading the month-12 `visit*arm` LS-means contrast as the regulatory primary marginal effect under MAR; (b)\n**GEE** — exchangeable working correlation with robust SEs for the population-averaged mean difference; (c)\n**random-intercept LMM** for the subject-specific trajectory. (4) Expect the MMRM and random-intercept-LMM\nfixed effects to be similar (Gaussian, identity link) but their *standard errors and missing-data validity* to\ndiffer, and report the MMRM contrast as primary because dropout is informative. (5) Sensitivity: pattern-mixture\nor multiple-imputation under not-MAR (`multiple-imputation-longitudinal-rwe`), alternative covariance\nstructures (TYPE=AR(1), TYPE=CS), and a tipping-point analysis on the month-12 contrast.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "longitudinal-outcomes-modeling",
      "repeated-measures",
      "mixed-effects-models",
      "gee",
      "mmrm",
      "marginal-vs-conditional",
      "panel-data",
      "missing-data-mar"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/73.1.13",
        "url": "https://doi.org/10.1093/biomet/73.1.13",
        "citation_text": "Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13-22.",
        "year": 1986,
        "authors_short": "Liang & Zeger",
        "notes": "Founding paper for generalized estimating equations and the population-averaged (marginal) approach to correlated longitudinal outcomes."
      },
      {
        "role": "introduce",
        "doi": "10.2307/2529876",
        "url": "https://doi.org/10.2307/2529876",
        "citation_text": "Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38(4):963-974.",
        "year": 1982,
        "authors_short": "Laird & Ware",
        "notes": "Founding paper for the linear mixed (random-effects) model and the subject-specific approach to repeated measures."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181caeb90",
        "url": "https://doi.org/10.1097/EDE.0b013e3181caeb90",
        "citation_text": "Hubbard AE, Ahern J, Fleischer NL, et al. To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology. 2010;21(4):467-474.",
        "year": 2010,
        "authors_short": "Hubbard et al.",
        "notes": "Makes the marginal (GEE) vs subject-specific (mixed) estimand the deciding question and shows the estimates diverge under non-linear links; the conceptual core of this entry."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/009286150804200402",
        "url": "https://doi.org/10.1177/009286150804200402",
        "citation_text": "Mallinckrodt CH, Lane PW, Schnell D, Peng Y, Mancuso JP. Recommendations for the primary analysis of continuous endpoints in longitudinal clinical trials. Drug Information Journal. 2008;42(4):303-319.",
        "year": 2008,
        "authors_short": "Mallinckrodt et al.",
        "notes": "PhRMA working-group rationale establishing MMRM (unstructured covariance, categorical visit, MAR-valid) as the preferred primary analysis for continuous longitudinal endpoints in a regulatory setting."
      }
    ],
    "plain_language_summary": "Longitudinal outcomes modeling analyzes outcomes that are measured on the same patient at multiple points in time — for example, a blood sugar value recorded at every clinic visit for a year. Because measurements from the same person tend to move together (a patient who runs high in January is likely to run high in June), you cannot treat each visit row as if it came from a different person; doing so underestimates uncertainty and can produce false-positive findings. These methods explicitly account for that within-person similarity, letting you estimate how a patient's outcome trajectory changes over time and whether treatment alters that trajectory. The main decision you face is whether you want a result about a specific patient's own trend (a subject-specific model) or about the average trend in the whole population (a population-averaged model).",
    "key_terms": [
      {
        "term": "repeated measures",
        "definition": "Multiple outcome values collected from the same patient across different time points, as opposed to a single measurement taken once."
      },
      {
        "term": "within-person correlation",
        "definition": "The tendency for outcome values from the same patient to resemble each other more than values from different patients, simply because many patient characteristics stay stable over time."
      },
      {
        "term": "trajectory",
        "definition": "The pattern of how a patient's outcome value rises, falls, or stays flat across a sequence of visits or time points."
      },
      {
        "term": "subject-specific effect",
        "definition": "An estimate of how much a given individual patient's outcome is expected to change, holding that patient's own background characteristics constant."
      },
      {
        "term": "population-averaged effect",
        "definition": "An estimate of the difference in the average outcome across an entire group of patients, rather than for any one individual."
      },
      {
        "term": "missing at random (MAR)",
        "definition": "A situation where the fact that a measurement is missing can be explained by earlier observed values but is not tied to what the missing value itself would have been."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is studying monthly pain scores (0-10 scale) in five patients with chronic knee pain enrolled in a 3-visit study (baseline, month 3, month 6). All five patients are measured at each visit. The goal is to describe the average pain trajectory over 6 months. Before fitting any model, the researcher checks whether it is valid to treat each row in the dataset as independent. This small example shows why it is not, and what the long-format data actually looks like.",
      "dataset": {
        "caption": "Long-format pain score table — one row per patient per visit (15 rows total for 5 patients x 3 visits).",
        "columns": [
          "patient_id",
          "visit_month",
          "pain_score"
        ],
        "rows": [
          [
            1,
            0,
            8
          ],
          [
            1,
            3,
            6
          ],
          [
            1,
            6,
            5
          ],
          [
            2,
            0,
            4
          ],
          [
            2,
            3,
            3
          ],
          [
            2,
            6,
            3
          ],
          [
            3,
            0,
            7
          ],
          [
            3,
            3,
            5
          ],
          [
            3,
            6,
            4
          ],
          [
            4,
            0,
            5
          ],
          [
            4,
            3,
            4
          ],
          [
            4,
            6,
            4
          ],
          [
            5,
            0,
            9
          ],
          [
            5,
            3,
            7
          ],
          [
            5,
            6,
            6
          ]
        ]
      },
      "steps": [
        "Notice that each patient appears in three rows. Patient 1 starts at 8, drops to 6, then 5. Patient 5 starts at 9 and also trends down. The data already hints that a high baseline score tends to pair with high later scores — within the same person.",
        "If you ran a naive ordinary regression ignoring patient identity, the model would treat all 15 rows as independent observations from 15 different people. It would see 15 pain scores and estimate a trend, but its standard error would be too small because it is pretending there are 15 independent data points when there are really only 5 independent patients.",
        "A longitudinal model adds a patient-level term that captures each person's overall average pain level. Patient 5 runs about 2 points higher than Patient 2 across all visits; the model learns this and no longer treats those differences as mysterious noise.",
        "With within-person correlation accounted for, the model estimates the average trajectory across all five patients: baseline mean of 6.6, month-3 mean of 5.0, month-6 mean of 4.4 — a drop of about 2.2 points over 6 months.",
        "Because the model correctly assigns observations to their source patients, the uncertainty estimate (standard error) around the 2.2-point drop reflects having 5 independent patients, not 15 independent rows. This gives an honest picture of how confident we should be."
      ],
      "result": "Average pain score fell from 6.6 at baseline to 4.4 at month 6 — a mean reduction of 2.2 points over 6 months. Accounting for within-person correlation (5 patients, not 15 independent rows) produces a standard error roughly 1.5-2x wider than a naive independent-row analysis would, correctly reflecting that the effective sample size is 5, not 15."
    },
    "prerequisites": [
      "mixed-effects-models-longitudinal-rwe",
      "gee-population-average-models-rwe",
      "mmrm-repeated-measures-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Linear mixed-effects model (random intercept and/or slope)",
        "description": "Subject-specific model with patient-level random effects for a continuous outcome; estimates individual trajectories and is valid under MAR. Use a random slope when the rate of change varies across patients.",
        "edge_cases": [
          "For non-linear links (GLMM, e.g. logistic) the random-effect variance attenuates the marginal effect, so the coefficient is conditional and not comparable to a GEE estimate.",
          "Misspecified random-effect distribution biases variance components; singular fits arise when the random-slope variance is near zero."
        ],
        "data_source_notes": "claims/EHR: irregular measurement times favor continuous-time parameterization; carry an enrollment-observed flag so MA-only or disenrolled cycles are treated as missing under MAR."
      },
      {
        "name": "MMRM (mixed model for repeated measures, unstructured covariance)",
        "description": "A continuous-endpoint LMM with categorical visit, treatment, treatment-by-visit interaction, and an unstructured within-patient covariance and no random subject effect; the regulatory-default primary analysis, valid under MAR.",
        "edge_cases": [
          "Requires roughly balanced, pre-specified visit windows; unstructured covariance can fail to converge with many visits or few patients (fall back to Toeplitz or AR(1)).",
          "Death or a competing terminal event is not ordinary MAR dropout; a registry/claims death index is needed to separate the two."
        ],
        "data_source_notes": "registry: protocolized visits suit MMRM well; claims: requires constructing scheduled cycles relative to index_date with a tolerance window."
      },
      {
        "name": "GEE (population-averaged, working correlation + robust SE)",
        "description": "Semiparametric marginal model returning the population-mean contrast directly; only the mean model must be correct. Use exchangeable or AR(1) working correlation with sandwich variance.",
        "edge_cases": [
          "Valid only under MCAR unless inverse-probability-of-observation weights are added; biased toward the null under informative dropout.",
          "Sandwich SEs are downward-biased with few clusters (<~40 patients); use small-sample corrections (Mancl-DeRouen, Kauermann-Carroll)."
        ],
        "data_source_notes": "claims/EHR: AR(1) presumes equal spacing, which irregular EHR labs violate; an exchangeable structure is more defensible for unevenly-timed measurements."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "GEE (population-averaged marginal model)",
        "pros_of_this": "Likelihood-based and valid under MAR (the realistic RWE dropout mechanism); subject-specific interpretation; can predict individual trajectories and model the correlation explicitly.",
        "cons_of_this": "Assumes a correctly-specified random-effect distribution; the conditional vs marginal gap under non-linear links confuses non-statistical audiences; more sensitive to misspecification.",
        "when_to_prefer": "Non-trivial or informative dropout, an individual-trajectory question, or a continuous confirmatory endpoint (where MMRM is the special case)."
      },
      {
        "compared_to": "MMRM with unstructured covariance",
        "pros_of_this": "A random-slope/continuous-time LMM is more parsimonious and handles irregular measurement times common in claims and EHR.",
        "cons_of_this": "Imposes a trajectory shape and a structured covariance, which MMRM deliberately avoids; less defensible as a confirmatory primary analysis.",
        "when_to_prefer": "Irregular visit timing, or when the unstructured covariance is too rich to estimate with the available patients/visits."
      },
      {
        "compared_to": "Time-to-event and count/recurrent-event models",
        "pros_of_this": "Directly models the level or trajectory of a repeatedly-measured scalar outcome (cost, lab, PRO).",
        "cons_of_this": "Wrong family for event occurrence; mishandles immortal time, censoring, and competing risks that survival/recurrent-event methods are designed for.",
        "when_to_prefer": "The estimand concerns the value of a continuous/scalar outcome over time, not whether or when an event happens."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Outcomes are constructed per fixed cycle (month/quarter) from claims aggregates keyed off index_date. Carry an enrollment-observed flag per cycle so MA-only and disenrolled person-time are treated as missing, not as observed zeros; differential disenrollment makes missingness MAR at best, favoring MMRM/LMM over GEE.",
      "ehr": "Lab/PRO measurements are recorded only at visits, so timing is irregular and informatively sampled (the sicker arm is measured more). Use continuous-time parameterization; do not assume equally-spaced visits when using an AR(1) GEE working correlation; treat loss to follow-up as potentially informative.",
      "registry": "Protocolized visit schedules suit MMRM; link to claims for complete exposure and to a death index so a terminal competing event is not modeled as ordinary MAR dropout.",
      "linked": "Linked claims-EHR-vital-records gives lab values, completeness, and reliable mortality, but linkage selection and order/fill/service date discrepancies must be reconciled before binning measurements into cycles relative to time zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\npanel = panel[panel[\"observed\"] == 1].copy()        # missing cycles are NOT observed zeros\npanel[\"visit\"] = panel[\"visit\"].astype(\"category\")   # categorical visit -> MMRM-style means\n\n# (a) Subject-specific: random intercept per patient, treatment-by-visit interaction.\nlmm = smf.mixedlm(\n    \"hba1c ~ C(arm) * visit + hba1c_baseline\",\n    data=panel,\n    groups=panel[\"person_id\"],\n).fit(reml=True)\nprint(lmm.summary())   # coefficients are CONDITIONAL (subject-specific)\n\n# (b) Population-averaged: GEE with exchangeable working correlation + robust (sandwich) SE.\ngee = smf.gee(\n    \"hba1c ~ C(arm) * visit + hba1c_baseline\",\n    groups=\"person_id\",\n    data=panel,\n    cov_struct=sm.cov_struct.Exchangeable(),\n    family=sm.families.Gaussian(),\n).fit()\nprint(gee.summary())   # coefficients are MARGINAL (population-averaged)\n# Gaussian/identity -> point estimates align; SE & missing-data validity (MAR vs MCAR) differ.",
        "description": "Fit a subject-specific LMM and a population-averaged GEE on the SAME longitudinal claims panel to expose the\nestimand difference. Required input (one row per person-visit, already built and cleaned):\n  panel : person_id, visit (int month: 0,3,6,9,12), arm ('DPP4'/'SU'),\n          hba1c (float outcome), hba1c_baseline (float), observed (1 if FFS-enrolled & lab present)\nDrop rows where observed==0 BEFORE fitting (MA-only/disenrolled cycles are missing, not zero).",
        "dependencies": [
          "pandas",
          "statsmodels"
        ],
        "source_citations": [
          "hubbard-2010",
          "mallinckrodt-2008"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(nlme)\nlibrary(geepack)\n\npanel <- subset(panel, observed == 1)          # missing cycles are not observed zeros\npanel$visit <- factor(panel$visit)\npanel$arm   <- relevel(factor(panel$arm), ref = \"SU\")\npanel       <- panel[order(panel$person_id, panel$visit), ]\n\n## (a) MMRM via gls: NO random effect; the unstructured within-patient covariance is\n##     modeled entirely through corSymm + varIdent (a random intercept here would be\n##     redundant with / non-identifiable against the unstructured residual covariance,\n##     so MMRM deliberately omits it). Gives marginal visit means valid under MAR.\nmmrm <- gls(\n  hba1c ~ arm * visit + hba1c_baseline,\n  correlation = corSymm(form = ~ as.integer(visit) | person_id),\n  weights     = varIdent(form = ~ 1 | visit),\n  data        = panel,\n  na.action   = na.omit,\n  method      = \"REML\"\n)\nsummary(mmrm)                                   # marginal (MMRM) effects\n\n## (b) Population-averaged GEE: exchangeable working correlation, robust SE.\ngee <- geeglm(\n  hba1c ~ arm * visit + hba1c_baseline,\n  id = person_id, data = panel,\n  family = gaussian(), corstr = \"exchangeable\"\n)\nsummary(gee)                                    # MARGINAL effects (sandwich SE)",
        "description": "Same panel, same contrast in R: nlme::gls for the MMRM (unstructured within-patient covariance, no random\neffect) and geepack::geeglm for the population-averaged marginal effect. Input data frame `panel`:\n  person_id, visit (factor of 0/3/6/9/12), arm (factor 'DPP4'/'SU'),\n  hba1c (numeric outcome), hba1c_baseline (numeric), observed (0/1).",
        "dependencies": [
          "nlme",
          "geepack"
        ],
        "source_citations": [
          "hubbard-2010",
          "mallinckrodt-2008"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* MA-only / disenrolled cycles are missing, not observed zeros. */\ndata panel; set work.panel; if observed = 1; run;\n\n/* (a) MMRM: categorical visit, arm*visit interaction, UNSTRUCTURED within-patient covariance, no random */\n/*     effect -> marginal LS-means valid under MAR (FDA-preferred continuous-endpoint primary analysis).  */\nproc mixed data=panel method=reml;\n  class person_id arm visit;\n  model hba1c = arm visit arm*visit hba1c_baseline / ddfm=kr;\n  repeated visit / subject=person_id type=un;\n  lsmeans arm*visit / diff slice=visit;          /* month-12 arm contrast = primary estimand */\nrun;\n\n/* (b) GEE: population-averaged marginal effect, exchangeable working correlation, robust (empirical) SE. */\nproc genmod data=panel;\n  class person_id arm visit;\n  model hba1c = arm visit arm*visit hba1c_baseline / dist=normal link=identity;\n  repeated subject=person_id / type=exch corrw;\nrun;\n\n/* (c) GLMM analog (PROC GLIMMIX) for a binary/count longitudinal outcome -> SUBJECT-SPECIFIC effect; */\n/*     shown here for hba1c>=7 to make the conditional-vs-marginal gap on a non-linear link explicit.  */\nproc glimmix data=panel method=laplace;\n  class person_id arm visit;\n  model uncontrolled(event='1') = arm visit arm*visit hba1c_baseline / dist=binary link=logit solution;\n  random intercept / subject=person_id;\nrun;",
        "description": "Regulatory-style MMRM plus the GEE and GLMM analogs on the same panel. Required input dataset (post\ndata-management), one row per person-visit:\n  work.panel : person_id, visit (0/3/6/9/12), arm ('DPP4'/'SU'),\n               hba1c (continuous outcome), hba1c_baseline (continuous),\n               observed (1 if FFS-enrolled and lab captured this cycle)\nKeep only observed cycles so MA-only/disenrolled person-time is missing, not zero. PROC MIXED with\nREPEATED ... TYPE=UN is the canonical MMRM; the month-12 arm*visit LS-means contrast is the primary estimate.",
        "dependencies": [],
        "source_citations": [
          "hubbard-2010",
          "mallinckrodt-2008"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Repeatedly-measured outcome per patient over time] --> Q0{Is the outcome an EVENT<br/>occurrence/timing?}\n  Q0 -->|Yes| TTE[Use survival / recurrent-event / count family<br/>NOT a longitudinal mean model]\n  Q0 -->|No: continuous / scalar level| Q1{Target estimand?}\n  Q1 -->|Population-averaged<br/>marginal mean contrast| Q2{Missingness mechanism?}\n  Q1 -->|Subject-specific<br/>individual trajectory| LMM[Mixed model: LMM / GLMM<br/>random effects, valid under MAR]\n  Q2 -->|Near-MCAR, many patients| GEE[GEE: working correlation + robust SE]\n  Q2 -->|Informative / MAR dropout| MMRM[MMRM: categorical visit, UN covariance<br/>marginal mean, valid under MAR]\n  LMM --> Report[Report estimand explicitly + missing-data + sensitivity]\n  GEE --> Report\n  MMRM --> Report",
        "caption": "Family-level selection logic. The first fork rules out event outcomes (a different family); the second fork is the conditional-vs-marginal estimand choice; the third fork uses the missing-data mechanism to choose GEE (MCAR) vs MMRM (MAR) for a marginal estimand.",
        "alt_text": "Decision flowchart starting from a repeatedly-measured outcome, branching to survival methods for events, then to mixed models for subject-specific estimands and to GEE or MMRM for population-averaged estimands depending on the missing-data mechanism.",
        "source_type": "illustrative",
        "source_citations": [
          "hubbard-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Conditional[Subject-specific / conditional<br/>random-effects model] -->|identity link, Gaussian| Same[Estimates COINCIDE]\n  Marginal[Population-averaged / marginal<br/>GEE] -->|identity link, Gaussian| Same\n  Conditional -->|non-linear link logit/log| Gap[Estimates DIVERGE:<br/>marginal attenuated toward null<br/>by random-effect variance]\n  Marginal -->|non-linear link logit/log| Gap",
        "caption": "Why the conditional-vs-marginal choice matters. For a Gaussian outcome with an identity link the two estimands give the same coefficient; for a logistic or log link they diverge, and a GLMM and GEE on the same binary panel are not interchangeable.",
        "alt_text": "Diagram showing subject-specific and population-averaged estimands coinciding under an identity link but diverging under a non-linear link, with the marginal effect attenuated toward the null.",
        "source_type": "illustrative",
        "source_citations": [
          "hubbard-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "mixed-effects-models-longitudinal-rwe",
        "notes": "The subject-specific (conditional) member of this family; detailed mechanics of random intercepts/slopes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gee-population-average-models-rwe",
        "notes": "The population-averaged (marginal) member; working-correlation and robust-variance mechanics."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mmrm-repeated-measures-rwe",
        "notes": "The regulatory-default continuous-endpoint special case (categorical visit, unstructured covariance, valid under MAR)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cluster-robust-standard-errors-rwe",
        "notes": "Robust/sandwich variance underlies GEE inference and clustered longitudinal data more broadly."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Standard sensitivity tool when the MAR assumption of MMRM/LMM is in doubt (pattern-mixture, not-MAR tipping point)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Preferred family when the longitudinal outcome is an event count or rate rather than a continuous level."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "recurrent-events-analysis-rwe",
        "notes": "For repeated EVENT occurrence (not repeated measurement of a scalar), use recurrent-event methods instead of a longitudinal mean model."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "standard-cox-time-dependent",
        "notes": "For time-to-first-event with time-varying covariates, survival modeling replaces the longitudinal mean model family."
      }
    ],
    "aliases": [
      "repeated measures models",
      "longitudinal regression",
      "correlated outcome models",
      "panel data models",
      "longitudinal data analysis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "maic-stc-population-adjusted-indirect-comparison-rwe",
    "name": "MAIC and STC: Population-Adjusted Indirect Comparisons",
    "short_definition": "A family of methods - matching-adjusted indirect comparison (MAIC) and simulated treatment comparison (STC) - that uses individual patient data (IPD) from one trial together with published aggregate data (AgD) from another to estimate a comparison between treatments that were never tested head-to-head, after reweighting or regression-adjusting the IPD so its baseline effect-modifier distribution matches the population behind the aggregate trial. Anchored versions preserve a common comparator to cancel shared prognostic effects; unanchored versions drop the anchor and must adjust for every prognostic factor and effect modifier, making far stronger and rarely testable assumptions.",
    "long_description": "Health-technology decisions constantly need a comparison that no randomized trial ever ran: drug **B** versus\ndrug **C**, when the manufacturer holds a trial of **B vs A** (with patient-level data) and the only evidence on\n**C** is a published **C vs A** trial reporting group means, not individuals. A naive cross-trial comparison\n(\"B beat A by 20 points, C beat A by 12, so B beats C by 8\") is the **Bucher anchored indirect comparison**, and it\nis valid only if the two trials enrolled exchangeable populations. They almost never do: the **B vs A** trial may be\nyounger, less pre-treated, or healthier than the **C vs A** trial, and any factor that **modifies the treatment\neffect** then biases the indirect estimate. **Population-adjusted indirect comparisons (PAICs)** fix the population\nmismatch using the one asset a manufacturer has but the literature does not - the **individual patient data** from\nits own trial.\n\n**The two methods.** **MAIC** (Signorovitch 2010) reweights each IPD patient so the *weighted* means of the\nselected baseline variables in the IPD trial equal the *published* means from the aggregate trial - a survey-style\nreweighting, with weights estimated by a method-of-moments / logistic-propensity trick. The reweighted IPD trial is\nthen re-analyzed and compared with the aggregate trial. **STC** (simulated treatment comparison) instead fits an\noutcome regression in the IPD, including the effect modifiers, and uses it to *predict* what the IPD treatment's\noutcome would have been in the aggregate trial's population (by plugging in the aggregate trial's covariate means).\nMAIC is a weighting estimator; STC is a regression (outcome-model) estimator - the same anchored/unanchored logic\napplies to both. NICE DSU **Technical Support Document 18** (Phillippo 2018) is the methodological reference that\nHTA bodies cite, and it draws the sharp line between **anchored** and **unanchored** comparisons.\n\n**Anchored vs unanchored - the assumption that decides everything.** In an **anchored** PAIC both trials share a\ncommon comparator **A**, so the indirect contrast is built on the *within-trial relative effects* (B-vs-A and\nC-vs-A). Randomization inside each trial handles prognostic factors; you only need to balance **effect modifiers**\n(variables that change the *size* of the treatment effect) to make the two relative effects comparable. This is a\nstrong but sometimes plausible assumption. In an **unanchored** PAIC there is no common comparator (e.g., a\nsingle-arm trial of B versus a single-arm or external source for C): the contrast is between *absolute* outcomes,\nrandomization protects nothing, and you must adjust for **every prognostic factor AND every effect modifier** - and\nassume there are no unmeasured ones. That is the same heroic conditional-exchangeability assumption an observational\nexternal-control analysis makes, with none of the within-trial randomization to lean on. TSD 18 is explicit:\nunanchored comparisons are acceptable only when an anchored one is impossible, and their results should be treated\nwith great caution.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs network-meta-analysis / Bucher (the standard anchored synthesis):** Standard NMA and Bucher assume the\n  trials are similar enough that the common-comparator anchor cancels all cross-trial differences. PAICs **relax\n  that** by explicitly rebalancing the IPD trial to the aggregate trial's population, so they are the right tool\n  when there are **few trials (often just two), no closed loop, and meaningful effect-modifier imbalance**. The\n  cost: PAICs need IPD on at least one side (NMA does not), they correct only **observed, reported** modifiers,\n  and they collapse the usable sample size (see ESS below). **Prefer standard NMA/Bucher** when a connected network\n  of similar trials exists and imbalance is minor; **prefer a PAIC** when the network is a single disconnected pair\n  and a named effect modifier differs across the two trials.\n- **MAIC vs STC:** MAIC makes no assumption about the *form* of the outcome model but throws away information by\n  weighting (the **effective sample size** can crater, and weights are unstable when populations barely overlap).\n  STC keeps all patients and is more efficient, but it **bets on the regression being correctly specified**, and -\n  in its common, naive implementation - it can only target the aggregate trial's *mean* covariates, which is exact\n  only for collapsible, linear-predictor outcomes; for nonlinear links (logistic, Cox) plugging in mean covariates\n  is an approximation. **Prefer MAIC** when you distrust the outcome model and have decent overlap; **prefer STC**\n  when overlap is poor (weights would explode) and you trust a parsimonious, correctly specified model.\n- **Scale dependence (a trap both share):** the adjustment is done on a particular effect scale (log-odds, log-HR,\n  risk difference). An anchored PAIC assumes the *conditional* effect is constant across the part of the covariate\n  space being matched, on that scale. Because relative effects are non-collapsible on the odds/hazard scale, a\n  comparison that looks unbiased on one scale need not be on another, and the reported result can depend on whether\n  you adjusted on the log-odds or the probability scale. State the scale; do not switch it silently.\n\n**When to use.** A decision needs B vs C; the evidence is a **disconnected pair** of trials (or single arms) with\nno head-to-head and no usable common network; you hold **IPD on at least one side**; and there is a **named,\nmeasured effect modifier** (and, for unanchored, every prognostic factor) that differs between the two trial\npopulations and is reported in the aggregate publication. This is the bread-and-butter situation in HTA submissions\n(NICE, CADTH, and the EU **Joint Clinical Assessment**), where a manufacturer's pivotal trial must be compared with\na competitor's trial to fill a **PICO** the regulator or payer specifies, and where the company's IPD is the lever\nthat lets it rebalance to the comparator trial's population.\n\n**When NOT to use - and when it is actively misleading.**\n- **Effective sample size collapse.** MAIC weights can concentrate almost all of the analysis on a handful of\n  patients. The **effective sample size (ESS = (sum of weights)^2 / sum of squared weights)** measures how many\n  independent patients the weighted analysis really behaves like. When the IPD trial barely overlaps the aggregate\n  population, ESS can fall to a small fraction of the nominal n, confidence intervals widen, and the estimate is\n  driven by a few extreme-weight patients - a fragile result dressed up as a trial comparison. Always report ESS\n  and a weight histogram; an ESS that is a tiny fraction of n is a red flag, not a footnote.\n- **Unanchored when anchored was possible.** Dropping the anchor to get a tidier number throws away the only\n  randomization-based protection you had and silently swaps in a no-unmeasured-confounding assumption. Do not run\n  an unanchored comparison if a common comparator exists; if it genuinely does not, label the result as\n  hypothesis-generating and stress-test every prognostic assumption.\n- **Adjusting for the wrong, or incomplete, variable set.** PAICs balance only what is **measured in the IPD and\n  reported in the aggregate publication**. A modifier that is unreported in the competitor's paper cannot be\n  matched; an unanchored analysis missing one prognostic factor is biased with no diagnostic to reveal it. If the\n  aggregate publication does not report a known effect modifier's distribution, the comparison cannot be trusted,\n  however sophisticated the weighting.\n- **Extrapolating beyond overlap.** Reweighting cannot conjure patients the IPD trial never enrolled. If the\n  aggregate population sits largely outside the IPD trial's covariate range, MAIC weights are enormous and unstable\n  and STC is extrapolating off the data - both are unreliable, and a more honest answer is \"the trials are too\n  different to compare.\"\n\n**Data-source and evidence-source depth.** Classically the IPD comes from a manufacturer's **randomized trial**\n(the cleanest substrate, with adjudicated outcomes and protocol-collected baseline covariates), and the aggregate\nside from a **published competitor trial**. Increasingly, real-world data feed both sides of a PAIC: a\nsingle-arm trial may be compared with an **external control** drawn from **claims, EHR, or a disease registry**,\nwhich turns an unanchored MAIC into an external-control study with all of that design's confounding risks. The\neffect modifiers themselves are often the binding constraint - age, prior lines of therapy, biomarker status,\nperformance status - and whether each is *measured* in the IPD and *reported* in the aggregate source determines\nwhether the comparison is even attemptable. When the aggregate side is RWE rather than a trial, the analyst must\nalso reconcile real-world outcome definitions and follow-up with the trial's protocol endpoints before any matching\nis meaningful.\n\n**Interpreting the output**\n\nConsider the worked example: after reweighting B-trial patients to match the competitor trial's\n60% prior-biologic rate, the anchored MAIC yields a risk difference of 0.08 (8 percentage points)\nin favor of B over C. The effective sample size fell to 2.5 of the original 5 patients.\n\nFormal interpretation: The 0.08 estimate is a population-adjusted indirect comparison, anchored\nthrough the shared comparator A. It targets the treatment effect of B versus C in the population\nof the competitor's trial — not in the population of the B trial as originally enrolled. The ESS\ndeflation from 5 to 2.5 reflects how concentrated the reweighting became: a single up-weighted\npatient dominates the estimate, and the resulting confidence interval is wide. In an anchored MAIC,\nthe indirect comparison is only as credible as the assumption that A performs identically in both\ntrial populations after reweighting — an untestable assumption. In an unanchored MAIC (no shared\ncomparator), the additional assumption of absolute outcome exchangeability makes the result even\nmore fragile and should be labeled explicitly as such in any HTA submission.\n\nPractical interpretation: Report the ESS alongside the point estimate as a transparency requirement,\nnot as an afterthought. An ESS below 40% of the original sample is a recognized red flag in NICE\ntechnical guidance. The 0.08 risk difference and its interval should be accompanied by a weight\ndistribution plot and a sensitivity analysis varying the set of effect modifiers included. If the\naggregate data source is RWE rather than a published trial, reconcile outcome definitions and\nfollow-up windows before matching — a mismatch here invalidates the comparison regardless of how\nwell the weights balance the covariates.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "maic",
      "stc",
      "population-adjusted-indirect-comparison",
      "indirect-treatment-comparison",
      "anchored-comparison",
      "unanchored-comparison",
      "effect-modifier",
      "effective-sample-size",
      "network-meta-analysis",
      "hta",
      "nice-dsu-tsd18",
      "eu-jca"
    ],
    "applies_to_study_types": [
      "indirect_treatment_comparison",
      "network_meta_analysis",
      "single_arm_external_control",
      "active_comparator_new_user",
      "hta_evidence_synthesis",
      "cohort_retrospective"
    ],
    "data_sources": [
      "primary",
      "registry",
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2165/11538370-000000000-00000",
        "url": "https://doi.org/10.2165/11538370-000000000-00000",
        "citation_text": "Signorovitch JE, Wu EQ, Yu AP, Gerrits CM, Kantor E, Bao Y, Gupta SR, Mulani PM. Comparative effectiveness without head-to-head trials: a method for matching-adjusted indirect comparisons applied to psoriasis treatment with adalimumab. PharmacoEconomics. 2010;28(10):935-945.",
        "year": 2010,
        "authors_short": "Signorovitch et al.",
        "notes": "Introduces matching-adjusted indirect comparison - reweighting individual patient data so its baseline covariate means match a competitor trial's published means before an anchored cross-trial comparison - defining the method this concept is built on."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X17725740",
        "url": "https://doi.org/10.1177/0272989X17725740",
        "citation_text": "Phillippo DM, Ades AE, Dias S, Palmer S, Abrams KR, Welton NJ. Methods for population-adjusted indirect comparisons in health technology appraisal. Medical Decision Making. 2018;38(2):200-211.",
        "year": 2018,
        "authors_short": "Phillippo et al.",
        "notes": "The NICE DSU Technical Support Document 18 methods paper that HTA bodies cite; draws the central anchored-vs-unanchored distinction and lays out the assumptions, effective-sample-size diagnostics, and cautions governing MAIC and STC."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2012.05.004",
        "url": "https://doi.org/10.1016/j.jval.2012.05.004",
        "citation_text": "Signorovitch JE, Sikirica V, Erder MH, Xie J, Lu M, Hodgkins PS, Betts KA, Wu EQ. Matching-adjusted indirect comparisons: a new tool for timely comparative effectiveness research. Value in Health. 2012;15(6):940-947.",
        "year": 2012,
        "authors_short": "Signorovitch et al.",
        "notes": "Applied demonstration of MAIC as a practical comparative-effectiveness tool, showing the weighting workflow, the effective-sample-size diagnostic, and how the method is deployed when only one trial provides patient-level data."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.8759",
        "url": "https://doi.org/10.1002/sim.8759",
        "citation_text": "Phillippo DM, Dias S, Ades AE, Welton NJ. Assessing the performance of population adjustment methods for anchored indirect comparisons: a simulation study. Statistics in Medicine. 2020;39(30):4885-4911.",
        "year": 2020,
        "authors_short": "Phillippo et al.",
        "notes": "Simulation study quantifying when anchored MAIC and STC are unbiased versus when they fail, documenting effective-sample-size loss, scale dependence, and the consequences of misspecified or incomplete effect-modifier sets."
      }
    ],
    "plain_language_summary": "Sometimes a decision needs to compare drug B with drug C, but no trial ever tested them against each other - one company has a trial of B versus an older drug A, and the only published evidence on C is a separate trial of C versus A. You could subtract the two results, but that only works if the two trials enrolled similar patients, which they rarely do. MAIC and STC fix this using the patient-by-patient data from the B trial: MAIC reweights those patients so the B trial \"looks like\" the C trial's patient mix, and STC fits a prediction model to do the same job. If both trials share the older drug A as a common yardstick (an \"anchored\" comparison), you only need to balance the things that change the size of the treatment effect; without that shared yardstick you must balance everything and hope nothing is missing - a much shakier bet. The big catch: reweighting can quietly shrink your usable sample to a handful of patients, so a confident-looking comparison can actually rest on very little.\n",
    "key_terms": [
      {
        "term": "effect modifier",
        "definition": "A patient characteristic that changes how big a treatment's effect is, so a different mix of these patients makes the same drug look stronger or weaker."
      },
      {
        "term": "individual patient data (IPD)",
        "definition": "The row-per-patient data from a trial, as opposed to the published group averages everyone else can see."
      },
      {
        "term": "aggregate data",
        "definition": "Summary numbers reported in a paper - means, percentages, overall results - without access to the individual patients."
      },
      {
        "term": "common comparator (anchor)",
        "definition": "A treatment that appears in both trials (here, drug A), used as a shared yardstick so the two trials' results can be linked."
      },
      {
        "term": "effective sample size",
        "definition": "How many independent patients a reweighted analysis really behaves like; it can be far smaller than the actual number of patients when a few get heavy weights."
      },
      {
        "term": "anchored vs unanchored",
        "definition": "Anchored comparisons keep a shared comparator and only balance effect modifiers; unanchored ones drop it and must balance every relevant patient factor, assuming none is missing."
      }
    ],
    "worked_example": {
      "scenario": "A manufacturer must show how its drug B compares with a competitor's drug C, but no head-to-head trial exists. It holds patient-level data from its own trial of B versus the older drug A, and the only evidence on C is a published trial of C versus A. The two trials share comparator A, so an anchored MAIC is possible. One baseline factor - whether a patient had a prior biologic - is known to change the treatment effect and differs between the trials: only 1 of the 5 patients in the B trial had a prior biologic, while the published C trial reports 60% did. We reweight the B-trial patients so their prior-biologic rate matches 60%, check how much usable sample survives, and form the anchored B-versus-C comparison.\n",
      "dataset": {
        "caption": "The five patient-level rows from the B-versus-A trial (the IPD side), plus the one summary number the competitor's paper reports.",
        "columns": [
          "patient_id",
          "arm",
          "prior_biologic",
          "responder"
        ],
        "rows": [
          [
            1,
            "B",
            1,
            1
          ],
          [
            2,
            "B",
            0,
            1
          ],
          [
            3,
            "B",
            0,
            0
          ],
          [
            4,
            "A",
            0,
            1
          ],
          [
            5,
            "A",
            0,
            0
          ]
        ]
      },
      "steps": [
        "The B trial's prior-biologic rate is 1 of 5 patients = 1 / 5 = 0.20, but the published C trial reports 0.60 - a real population mismatch in a known effect modifier.",
        "Reweight so the weighted prior-biologic rate hits 0.60. Method-of-moments weights give the single prior-biologic patient a weight of 6 and each of the 4 others a weight of 1; check the weighted rate = 6 / (6 + 4) = 0.60.",
        "Sum of the weights = 6 + 4 = 10. Sum of the squared weights = 6 * 6 + 4 = 40 (one patient contributes 36, four contribute 1 each).",
        "Effective sample size = (sum of weights) squared over sum of squared weights = 10 * 10 / 40 = 2.5, so the 5 nominal patients behave like only 2.5 independent ones - a 50% collapse, the key warning sign.",
        "Recompute the B-versus-A effect on the reweighted population (risk difference 0.20) and take the published C-versus-A effect (risk difference 0.12); the anchored indirect estimate is B vs C = 0.20 - 0.12 = 0.08."
      ],
      "result": "After matching the prior-biologic rate to the competitor trial's 0.60, the anchored MAIC estimates that B improves response by 0.08 (8 percentage points) more than C. But the effective sample size fell to 2.5 of 5 patients, so the interval around 0.08 is wide and the result leans heavily on a single up-weighted patient - report the ESS, not just the point estimate."
    },
    "prerequisites": [
      "network-meta-analysis",
      "meta-analysis-rct",
      "generalizability-transportability-external-validity-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Anchored MAIC (common comparator preserved)",
        "description": "Both trials share comparator A. IPD patients from the B-vs-A trial are reweighted so the weighted means of the selected effect modifiers match the published means of the C-vs-A trial; the within-trial B-vs-A relative effect is recomputed on the matched population and combined with the published C-vs-A effect via a Bucher-style contrast. Only effect modifiers must be balanced because randomization within each trial handles prognostic factors.",
        "edge_cases": [
          "Weights collapse the effective sample size when the IPD trial barely overlaps the aggregate population - report ESS and the weight distribution, not just the point estimate.",
          "The aggregate publication omits a known effect modifier's mean, so it cannot be matched and the anchored assumption is unverifiable for that variable."
        ],
        "data_source_notes": "primary: IPD from the manufacturer RCT supplies patient-level effect modifiers; the aggregate side is a published competitor RCT reporting baseline means. linked/RWE: real-world covariates can stand in only if measured comparably to the trial."
      },
      {
        "name": "Unanchored MAIC / STC (no common comparator)",
        "description": "No shared comparator exists (e.g., single-arm B trial versus a single-arm or external source for C). The comparison is between absolute outcomes, so the IPD must be adjusted for every prognostic factor AND every effect modifier, and the analyst must assume no unmeasured ones - the same conditional-exchangeability assumption as an observational external-control study, without within-trial randomization.",
        "edge_cases": [
          "A single missing prognostic factor biases the estimate with no diagnostic to reveal it; treat results as hypothesis-generating.",
          "Frequently the C side is a real-world external control, importing all of the confounding and outcome-misclassification risks of observational data on top of the indirect-comparison assumptions."
        ],
        "data_source_notes": "claims/ehr/registry: the unanchored aggregate or external arm is often built from RWE, so outcome and follow-up definitions must be reconciled with the trial endpoint before matching. linked: linkage helps capture more prognostic factors but cannot prove none are missing."
      },
      {
        "name": "Simulated treatment comparison (STC, outcome-regression form)",
        "description": "Instead of weighting, fit an outcome regression in the IPD that includes the effect modifiers (and, if unanchored, all prognostic factors), then predict the IPD treatment's outcome in the aggregate population by plugging in that population's reported covariate means. Keeps all patients and is more efficient than MAIC but bets on correct model specification.",
        "edge_cases": [
          "For nonlinear links (logistic, Cox) plugging in the aggregate trial's MEAN covariates targets the wrong quantity unless the model is collapsible; simulate over the covariate distribution rather than its mean where possible.",
          "Predicting outside the IPD trial's covariate range is silent extrapolation - check overlap before trusting STC when populations differ sharply."
        ],
        "data_source_notes": "primary: the regression is fit on trial IPD with adjudicated outcomes. linked/registry: an RWE-derived aggregate target needs covariate means defined identically to the IPD covariates or the plug-in is invalid."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "network-meta-analysis",
        "pros_of_this": "Explicitly rebalances the IPD trial to the aggregate trial's population, correcting effect-modifier imbalance that a standard anchored NMA or Bucher comparison simply assumes away; works when the network is a single disconnected pair of trials.",
        "cons_of_this": "Requires individual patient data on at least one side (NMA needs none), corrects only observed and reported modifiers, and collapses the usable sample size via weighting - so the comparison can rest on a few extreme-weight patients.",
        "when_to_prefer": "When there are few trials (often just two), no connected network, and a named effect modifier differs across the two populations; use standard NMA/Bucher when a connected network of similar trials exists and imbalance is minor."
      },
      {
        "compared_to": "single-arm-external-control",
        "pros_of_this": "When a common comparator exists, the anchored PAIC keeps within-trial randomization, so only effect modifiers (not all prognostic factors) need balancing - a far weaker assumption than an external-control study makes.",
        "cons_of_this": "The unanchored PAIC variant collapses into exactly an external-control problem - adjusting absolute outcomes under no-unmeasured-confounding - with the added fragility of cross-trial outcome and follow-up definitions.",
        "when_to_prefer": "Prefer an anchored PAIC whenever a shared comparator is available; reserve external-control framing for when no anchor and no comparator trial exist at all."
      },
      {
        "compared_to": "generalizability-transportability-external-validity-rwe",
        "pros_of_this": "Operationalizes transportability for a specific decision - it does not just describe a population mismatch, it reweights or models the IPD to the target population and produces the adjusted comparison HTA bodies need.",
        "cons_of_this": "Targets only the narrow population behind one aggregate publication and only the reported covariates, rather than a general, well-characterized target population; the transported estimate is as good as that single paper's reporting.",
        "when_to_prefer": "Use a PAIC for a concrete two-trial decision with reported aggregate covariates; use the broader transportability framing when reasoning about generalizing an effect to a defined real-world target population."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims rarely supply the IPD side of a PAIC, but they increasingly supply the aggregate or external-control arm. Effect modifiers must be derivable from claims (age, comorbidity indices, prior lines of therapy) and defined identically to the trial IPD; outcomes and follow-up windows must be reconciled with the trial endpoint before matching. Claims cannot measure many clinical modifiers (biomarkers, performance status), so a claims-based aggregate may leave key modifiers unbalanced.",
      "ehr": "EHR can provide richer effect modifiers (labs, vitals, biomarker status) for an external arm or for an RWE-derived IPD cohort, but free-text and missingness mean a modifier reported in a competitor trial may be unmeasured or noisy in EHR. Harmonize covariate definitions and outcome ascertainment to the trial before estimating weights or fitting the STC regression.",
      "registry": "Disease registries are the most trial-like RWE source for an external arm - prospectively collected modifiers and adjudicated outcomes - making them the cleanest substrate for an unanchored MAIC/STC, though registry enrollment selection still threatens the no-unmeasured-confounding assumption that unanchored comparisons require.",
      "linked": "Linked claims-EHR-registry data maximize the set of measurable prognostic factors and effect modifiers, the binding constraint for unanchored PAICs, and let outcomes be cross-validated; linkage still cannot prove no modifier is missing, so report which trial-reported modifiers could and could not be matched."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom scipy.optimize import minimize\n\ndef maic_anchored(ipd: pd.DataFrame, mod_cols, agg_means: dict,\n                  agg_effect_CA: float, outcome=\"outcome\", arm=\"arm\"):\n    # 1) Center each effect modifier on its aggregate-trial target mean.\n    #    Balance is achieved when the WEIGHTED sum of centered covariates is zero.\n    X = ipd[mod_cols].to_numpy(dtype=float)\n    Xc = X - np.array([agg_means[c] for c in mod_cols])\n\n    # 2) Method-of-moments weights w_i = exp(Xc_i . a). The objective\n    #    Q(a) = sum(exp(Xc . a)) is convex; its gradient is Xc' w, which is\n    #    exactly the weighted covariate imbalance, so the minimizer balances the means.\n    def Q(a):  return np.sum(np.exp(Xc @ a))\n    def dQ(a): return Xc.T @ np.exp(Xc @ a)\n    a0 = np.zeros(Xc.shape[1])\n    a_hat = minimize(Q, a0, jac=dQ, method=\"BFGS\").x\n    w = np.exp(Xc @ a_hat)\n\n    # 3) Effective sample size = (sum w)^2 / sum(w^2): how many independent\n    #    patients the weighted analysis behaves like. A small ESS is a red flag.\n    ess = float(w.sum() ** 2 / np.sum(w ** 2))\n\n    # 4) Weighted within-trial B-vs-A effect (risk difference) on the matched population.\n    d = ipd.assign(_w=w)\n    def wmean(g): return np.average(g[outcome], weights=g[\"_w\"])\n    rB = wmean(d[d[arm] == \"B\"]); rA = wmean(d[d[arm] == \"A\"])\n    effect_BA = rB - rA\n\n    # 5) Anchored (Bucher) indirect comparison on the common comparator A.\n    effect_BC = effect_BA - agg_effect_CA\n    return {\"ess\": round(ess, 2), \"weighted_effect_BA\": round(effect_BA, 4),\n            \"indirect_effect_BC\": round(effect_BC, 4),\n            \"weight_ratio_max\": round(float(w.max() / w.min()), 2)}",
        "description": "Anchored MAIC by manual logistic / method-of-moments weighting (no MAIC package). Given IPD effect-modifier columns and\nthe aggregate trial's reported means, solve for weighting coefficients so the weighted IPD means equal the aggregate\nmeans, then report the effective sample size and the population-adjusted, anchored B-vs-C contrast. Input (one row per\nIPD patient):\n  ipd : arm ('B' or 'A'), columns of effect modifiers (e.g. age, prior_biologic), outcome (0/1 responder)\n  agg_means : dict mapping each effect-modifier column to the aggregate trial's reported mean\n  agg_effect_CA : the published C-vs-A relative effect (here a risk difference) on the chosen scale",
        "dependencies": [
          "numpy",
          "pandas",
          "scipy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "maic_anchored <- function(ipd, mod_cols, agg_means, agg_effect_CA,\n                          outcome = \"outcome\", arm = \"arm\") {\n  # 1) Center modifiers on the aggregate-trial target means.\n  X  <- as.matrix(ipd[, mod_cols, drop = FALSE])\n  Xc <- sweep(X, 2, agg_means[mod_cols], \"-\")\n\n  # 2) Method-of-moments weights via convex objective Q(a) = sum(exp(Xc %*% a)).\n  Q  <- function(a) sum(exp(Xc %*% a))\n  dQ <- function(a) as.vector(t(Xc) %*% exp(Xc %*% a))\n  a_hat <- optim(rep(0, ncol(Xc)), Q, gr = dQ, method = \"BFGS\")$par\n  w <- as.vector(exp(Xc %*% a_hat))\n\n  # 3) Effective sample size = (sum w)^2 / sum(w^2).\n  ess <- sum(w)^2 / sum(w^2)\n\n  # 4) Weighted within-trial B-vs-A risk difference on the matched population.\n  wmean <- function(idx) weighted.mean(ipd[[outcome]][idx], w[idx])\n  rB <- wmean(ipd[[arm]] == \"B\"); rA <- wmean(ipd[[arm]] == \"A\")\n  effect_BA <- rB - rA\n\n  # 5) Anchored (Bucher) indirect comparison through common comparator A.\n  effect_BC <- effect_BA - agg_effect_CA\n  list(ess = round(ess, 2),\n       weighted_effect_BA = round(effect_BA, 4),\n       indirect_effect_BC = round(effect_BC, 4),\n       weight_ratio_max = round(max(w) / min(w), 2))\n}",
        "description": "Same anchored MAIC in base R: solve the method-of-moments weighting objective with optim(), report the effective sample\nsize, and form the Bucher anchored B-vs-C contrast. The maicplus package wraps this same logic (estimate_weights() then\nmaic_anchored()); the manual version below shows what it does. Input:\n  ipd : data.frame with arm ('B'/'A'), effect-modifier columns, outcome (0/1)\n  agg_means : named numeric vector of aggregate-trial covariate means\n  agg_effect_CA : published C-vs-A risk difference",
        "dependencies": [
          "stats"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let agg_mod1 = 0.60;   /* aggregate-trial mean of effect modifier 1 (e.g. prior_biologic) */\n%let agg_mod2 = 55;     /* aggregate-trial mean of effect modifier 2 (e.g. age)            */\n%let agg_ca   = 0.12;   /* published C-vs-A risk difference                                */\n\nproc iml;\n  use work.ipd; read all var {mod1 mod2} into X; read all var {outcome} into y;\n  read all var {arm} into arm; close work.ipd;\n\n  /* 1) Center modifiers on aggregate-trial target means. */\n  tgt = {&agg_mod1 &agg_mod2};\n  Xc  = X - repeat(tgt, nrow(X), 1);\n\n  /* 2) Method-of-moments weights: minimize Q(a) = sum(exp(Xc*a)) by Newton-Raphson. */\n  start Q(a) global(Xc);  return( sum(exp(Xc * t(a))) );  finish;\n  start G(a) global(Xc);  return( t(Xc) * exp(Xc * t(a)) );  finish;  /* gradient = weighted imbalance */\n  a0 = j(1, ncol(Xc), 0);\n  opt = {0 0};\n  call nlpnra(rc, a_hat, \"Q\", a0) grd=\"G\" opt=opt;\n  w = exp(Xc * t(a_hat));\n\n  /* 3) Effective sample size = (sum w)^2 / sum(w^2). */\n  ess = (sum(w)##2) / sum(w##2);\n\n  /* 4) Weighted within-trial B-vs-A risk difference. */\n  isB = (arm = \"B\"); isA = (arm = \"A\");\n  rB = sum(w # y # isB) / sum(w # isB);\n  rA = sum(w # y # isA) / sum(w # isA);\n  effect_BA = rB - rA;\n\n  /* 5) Anchored (Bucher) indirect comparison through comparator A. */\n  effect_BC = effect_BA - &agg_ca;\n  print ess effect_BA effect_BC;\nquit;",
        "description": "Anchored MAIC in SAS using PROC IML for the weight estimation (Newton-Raphson on the method-of-moments objective), then\na DATA/PROC step pass for the effective sample size and the Bucher anchored contrast. PROC NLP could minimize the same\nobjective; IML is shown because it returns the weights directly. Input:\n  work.ipd : arm ($1, 'B'/'A'), mod1 mod2 (effect modifiers), outcome (0/1)\n  macro vars give the aggregate-trial means and the published C-vs-A risk difference.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision needs B vs C<br/>no head-to-head trial] --> Net{Common comparator A<br/>in both trials?}\n  Net -- Yes --> Anc[Anchored PAIC<br/>balance effect modifiers only]\n  Net -- No --> Unanc[Unanchored PAIC<br/>balance ALL prognostic factors<br/>+ effect modifiers; assume none missing]\n  Anc --> Method{Weighting or<br/>outcome model?}\n  Unanc --> Method\n  Method -- Weighting --> MAIC[MAIC: reweight IPD so weighted<br/>covariate means = aggregate means]\n  Method -- Outcome model --> STC[STC: fit regression in IPD,<br/>predict at aggregate covariate means]\n  MAIC --> ESS{Effective sample size<br/>still adequate?}\n  ESS -- No --> Stop[Weights collapsed -<br/>too little overlap to compare]\n  ESS -- Yes --> Out[Population-adjusted<br/>B-vs-C estimate + CI]\n  STC --> Out",
        "caption": "Choosing and running a population-adjusted indirect comparison - whether a common comparator allows an anchored analysis (balance effect modifiers only) or forces an unanchored one (balance everything), then whether to weight (MAIC) or model (STC), with the effective-sample-size check that guards against a weight-collapsed, fragile estimate.",
        "alt_text": "Decision flowchart routing from a B-versus-C question through the anchored-versus-unanchored choice, the MAIC weighting versus STC regression choice, and an effective-sample-size gate to a population-adjusted estimate.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "tradeoff_with",
        "target_slug": "network-meta-analysis",
        "notes": "PAICs relax the cross-trial exchangeability that standard anchored NMA/Bucher assumes by reweighting or modeling the IPD to the aggregate population, but require IPD on one side and collapse the effective sample size; prefer NMA when a connected network of similar trials exists."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "meta-analysis-rct",
        "notes": "Pairwise and network meta-analysis of randomized trials assume comparable populations; a population-adjusted indirect comparison is the alternative when the trials differ in a measured effect modifier and patient-level data exist to rebalance one of them."
      },
      {
        "relation_type": "used_with",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "MAIC and STC are concrete transportability estimators - they move a within-trial effect to the population behind another trial, so the transportability framing supplies the assumptions (effect-modifier exchangeability, overlap) a PAIC must satisfy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "single-arm-external-control",
        "notes": "An unanchored MAIC/STC is essentially an external-control comparison of absolute outcomes under no-unmeasured-confounding, so the external-control design's confounding and outcome-definition cautions apply directly."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ipd-meta-analysis",
        "notes": "When individual patient data are available for MORE than one trial, IPD meta-analysis is preferable to a PAIC; PAICs exist precisely because IPD is held for only one of the trials being compared."
      },
      {
        "relation_type": "see_also",
        "target_slug": "indirect-standardization-smr-sir-rwe",
        "notes": "MAIC's reweighting to match a target population's covariate means is the same standardization idea used in indirect standardization, applied to make two trial populations comparable rather than to compare observed-to-expected events."
      }
    ],
    "aliases": [
      "MAIC",
      "matching-adjusted indirect comparison",
      "STC",
      "simulated treatment comparison",
      "population-adjusted indirect comparison",
      "PAIC",
      "anchored indirect comparison",
      "unanchored indirect comparison",
      "NICE DSU TSD 18"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "ema",
      "fda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "major-adverse-cardiovascular-events-mace-rwe",
    "name": "Major Adverse Cardiovascular Events (MACE)",
    "short_definition": "A composite cardiovascular endpoint, commonly defined as cardiovascular death, nonfatal myocardial infarction, and nonfatal stroke, with variants that add hospitalization, revascularization, heart failure, or unstable angina components.",
    "long_description": "MACE is not one universal endpoint. It is a family of cardiovascular composite endpoints whose interpretation depends on the component list, event ascertainment rules, and whether the analysis counts first event only or recurrent components. The most common \"3-point MACE\" definition combines cardiovascular death, nonfatal myocardial infarction, and nonfatal stroke. Other studies use 4-point, 5-point, or study-specific MACE definitions that add hospitalization for unstable angina, urgent revascularization, heart failure hospitalization, or all-cause death.\n\nIn RWE, MACE is an outcome-algorithm problem plus a composite-endpoint problem. Nonfatal MI and stroke are usually identified through inpatient diagnosis algorithms and validated code lists. Cardiovascular death requires cause-of-death linkage; all-cause death is easier but changes the estimand. The composite date is usually the earliest qualifying component date, but the component that fires must be retained because a treatment effect dominated by one component is not a general cardiovascular benefit.\n\nCross-study comparison is especially risky. \"MACE\" in a diabetes CV outcomes trial, a device registry, and a claims-based safety study may have different death definitions, claim-position requirements, washout periods, and adjudication standards. RWE reports should spell out the component list in the title, table shell, and methods, not only in the appendix.\n\n**Pros, cons, and trade-offs.** MACE is attractive because it increases event yield and gives clinicians a familiar serious cardiovascular endpoint. The trade-off is interpretability. Composite precision is not useful if a softer or more frequent component drives the result while cardiovascular death, myocardial infarction, and stroke move differently. Claims-based MACE adds another trade-off: inpatient MI and stroke algorithms may be reproducible, but cause-specific death and out-of-network events require linkage that many datasets do not have.\n\n**When to use.** Use MACE when the components are clinically coherent, event capture is fit for purpose, and the estimand is time to the first serious cardiovascular event. Keep the first firing component, run component-specific estimates, and state whether the endpoint is 3-point MACE, expanded MACE, MACCE, or a study-specific variant.\n\n**When NOT to use - and when it is actively misleading.** Do not write \"MACE\" when the study only observes nonfatal claims events or all-cause death. Do not combine revascularization, unstable angina, heart failure, and death without showing component results. It is actively misleading to compare a nonfatal claims-only composite with a trial-adjudicated 3-point MACE endpoint as if the endpoints were the same.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "mace",
      "cardiovascular-outcomes",
      "composite-endpoint",
      "myocardial-infarction",
      "stroke",
      "cardiovascular-death",
      "endpoint-definition"
    ],
    "applies_to_study_types": [
      "comparative_effectiveness",
      "safety_surveillance",
      "pragmatic_trial",
      "claims_analysis",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/s12874-021-01440-5",
        "url": "https://doi.org/10.1186/s12874-021-01440-5",
        "citation_text": "Bosco E, Hsueh L, McConeghy KW, Gravenstein S, Saade E. Major adverse cardiovascular event definitions used in observational analysis of administrative databases: a systematic review. BMC Medical Research Methodology. 2021;21:241.",
        "year": 2021,
        "authors_short": "Bosco et al.",
        "notes": "Systematic review of MACE definitions used in observational administrative-data studies."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.fda.gov/media/167530/download",
        "citation_text": "U.S. Food and Drug Administration. Cardiovascular and Stroke Endpoint Definitions for Clinical Trials. Guidance for Industry.",
        "year": 2023,
        "authors_short": "FDA",
        "notes": "Regulatory endpoint guidance for cardiovascular and stroke events."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jacc.2007.10.034",
        "url": "https://doi.org/10.1016/j.jacc.2007.10.034",
        "citation_text": "Kip KE, Hollabaugh K, Marroquin OC, Williams DO. The problem with composite end points in cardiovascular studies. Journal of the American College of Cardiology. 2008;51(7):701-707.",
        "year": 2008,
        "authors_short": "Kip et al.",
        "notes": "Classic discussion of component dominance and heterogeneous clinical importance."
      }
    ],
    "plain_language_summary": "MACE is a shorthand for a serious cardiovascular-event composite. The common version is cardiovascular death, heart attack, or stroke. But different studies use different component lists, so the word MACE is not enough by itself.",
    "key_terms": [
      {
        "term": "3-point MACE",
        "definition": "Composite of cardiovascular death, nonfatal myocardial infarction, and nonfatal stroke."
      },
      {
        "term": "4-point MACE",
        "definition": "Usually 3-point MACE plus another serious cardiovascular component, often unstable angina hospitalization or urgent revascularization, depending on the protocol."
      },
      {
        "term": "component dominance",
        "definition": "A composite-result problem where one frequent component drives the overall estimate while rarer but more important components contribute little."
      }
    ],
    "worked_example": {
      "scenario": "A claims study of a diabetes drug defines 3-point MACE as cardiovascular death, nonfatal inpatient MI, or nonfatal inpatient ischemic stroke. One patient has an inpatient MI on day 143, later revascularization on day 151, and death without cause information on day 220.",
      "dataset": {
        "caption": "Candidate cardiovascular events for one patient.",
        "columns": [
          "day",
          "candidate_event",
          "source",
          "endpoint_component"
        ],
        "rows": [
          [
            143,
            "inpatient myocardial infarction",
            "principal inpatient ICD-10-CM diagnosis",
            "nonfatal MI"
          ],
          [
            151,
            "coronary revascularization",
            "CPT/ICD-10-PCS procedure",
            "not in 3-point MACE"
          ],
          [
            220,
            "death, cause unknown",
            "death-date linkage only",
            "not cardiovascular death unless cause source supports it"
          ]
        ]
      },
      "steps": [
        "Apply the MI algorithm and de-duplication window to the inpatient claim.",
        "Exclude revascularization because the protocol specified 3-point MACE, not expanded MACE.",
        "Do not classify death as cardiovascular death without cause-of-death evidence.",
        "Assign first MACE date to day 143 and retain winning component = nonfatal MI."
      ],
      "result": "The time-to-first 3-point MACE endpoint fires on day 143. The later revascularization and cause-unknown death are reported separately or handled in sensitivity analyses."
    },
    "prerequisites": [],
    "index_definitions": [
      {
        "name": "3-point MACE",
        "definition": "Cardiovascular death, nonfatal myocardial infarction, or nonfatal stroke; event date is the first qualifying component date.",
        "source": "Cardiovascular outcomes trial convention and FDA endpoint guidance",
        "use": "Standard cardiovascular safety/effectiveness composite when cause-specific death is observable.",
        "notes": "Requires reliable death and cause-of-death linkage for the fatal component."
      },
      {
        "name": "Nonfatal MACE",
        "definition": "Nonfatal myocardial infarction or nonfatal stroke without the fatal cardiovascular-death component.",
        "source": "Claims-based RWE adaptation",
        "use": "Sensitivity or restricted endpoint when cause-specific mortality is unavailable.",
        "notes": "Should not be labeled full 3-point MACE."
      },
      {
        "name": "Expanded MACE",
        "definition": "3-point MACE plus protocol-specific components such as unstable angina hospitalization, revascularization, or heart failure hospitalization.",
        "source": "Study-specific cardiovascular endpoint definitions",
        "use": "Higher event yield or broader net cardiovascular burden.",
        "notes": "Report every component and avoid comparing directly with 3-point MACE."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "3-point MACE",
        "description": "Cardiovascular death, nonfatal MI, or nonfatal stroke.",
        "edge_cases": [
          "Cause-of-death linkage may be absent.",
          "Same-day MI and stroke need a tie rule."
        ],
        "data_source_notes": "Strongest when death-source and inpatient event capture are both fit for purpose."
      },
      {
        "name": "Expanded MACE / MACCE",
        "description": "Adds components such as unstable angina hospitalization, revascularization, or heart failure hospitalization.",
        "edge_cases": [
          "Softer care-pattern components may dominate.",
          "Revascularization depends on access and practice style."
        ],
        "data_source_notes": "Report component-specific results and do not compare directly with 3-point MACE."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Component-specific cardiovascular endpoints",
        "use_mace_when": "A net serious cardiovascular outcome is clinically coherent and components are expected to move together.",
        "use_components_when": "Components differ in severity, frequency, mechanism, or ascertainment validity.",
        "notes": "Composite precision is not worth much if interpretation is driven by a weak component."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Use inpatient setting and diagnosis-position rules for MI/stroke; death requires separate mortality linkage.",
      "ehr": "EHR may contain richer clinical detail but non-network events and deaths can be missed.",
      "registry": "Registry adjudication improves specificity but may not capture all out-of-registry care."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nMACE3 = {\"cardiovascular_death\", \"nonfatal_mi\", \"nonfatal_stroke\"}\n\ndef first_mace(index, events, followup_end=None):\n    candidates = events[events[\"component\"].isin(MACE3)].copy()\n    candidates = candidates.merge(index[[\"person_id\", \"index_date\"]], on=\"person_id\", how=\"inner\")\n    candidates = candidates[candidates[\"event_date\"] >= candidates[\"index_date\"]]\n    if followup_end is not None:\n        candidates = candidates.merge(followup_end[[\"person_id\", \"end_date\"]], on=\"person_id\", how=\"left\")\n        candidates = candidates[candidates[\"end_date\"].isna() | (candidates[\"event_date\"] <= candidates[\"end_date\"])]\n    candidates = candidates.sort_values([\"person_id\", \"event_date\", \"component\"])\n    out = candidates.groupby(\"person_id\", as_index=False).first()\n    out[\"mace3\"] = 1\n    out[\"time_to_mace_days\"] = (out[\"event_date\"] - out[\"index_date\"]).dt.days\n    return index.merge(\n        out[[\"person_id\", \"mace3\", \"event_date\", \"component\", \"source\", \"time_to_mace_days\"]],\n        on=\"person_id\",\n        how=\"left\",\n    ).fillna({\"mace3\": 0})",
        "description": "Construct first 3-point MACE from already validated component tables. Inputs:\n  index  : person_id, index_date\n  events : person_id, event_date, component, source\nComponents must already satisfy the study's event algorithms. Keep the winning component for interpretation.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "bosco-2021",
          "fda-cardiovascular-endpoints-2023"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nfirst_mace <- function(index, events, followup_end = NULL) {\n  setDT(index); setDT(events)\n  mace3 <- c(\"cardiovascular_death\", \"nonfatal_mi\", \"nonfatal_stroke\")\n  cand <- events[component %in% mace3]\n  cand <- merge(cand, index[, .(person_id, index_date)], by = \"person_id\")\n  cand <- cand[event_date >= index_date]\n  if (!is.null(followup_end)) {\n    setDT(followup_end)\n    cand <- merge(cand, followup_end[, .(person_id, end_date)], by = \"person_id\", all.x = TRUE)\n    cand <- cand[is.na(end_date) | event_date <= end_date]\n  }\n  setorder(cand, person_id, event_date, component)\n  first <- cand[, .SD[1], by = person_id]\n  first[, `:=`(mace3 = 1L, time_to_mace_days = as.integer(event_date - index_date))]\n  out <- merge(index, first[, .(person_id, mace3, event_date, component, source, time_to_mace_days)],\n               by = \"person_id\", all.x = TRUE)\n  out[is.na(mace3), mace3 := 0L]\n  out[]\n}",
        "description": "R/data.table version for first 3-point MACE. Component rows are assumed to be validated upstream.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "bosco-2021",
          "fda-cardiovascular-endpoints-2023"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "proc sql;\n  create table mace_candidates as\n  select e.person_id, i.index_date, e.event_date, e.component, e.source\n  from work.events e inner join work.index i on e.person_id = i.person_id\n  where e.component in (\"cardiovascular_death\", \"nonfatal_mi\", \"nonfatal_stroke\")\n    and e.event_date >= i.index_date;\nquit;\n\nproc sort data=mace_candidates;\n  by person_id event_date component;\nrun;\n\ndata first_mace;\n  set mace_candidates;\n  by person_id;\n  if first.person_id then do;\n    mace3 = 1;\n    time_to_mace_days = event_date - index_date;\n    output;\n  end;\n  keep person_id mace3 event_date component source time_to_mace_days;\nrun;\n\nproc sql;\n  create table mace_analysis as\n  select i.*, coalesce(m.mace3, 0) as mace3,\n         m.event_date, m.component, m.source, m.time_to_mace_days\n  from work.index i left join first_mace m on i.person_id = m.person_id;\nquit;",
        "description": "SAS pattern for first 3-point MACE after component algorithms have created work.events.\nInputs:\n  work.index  person_id index_date\n  work.events person_id event_date component source",
        "dependencies": [],
        "source_citations": [
          "bosco-2021",
          "fda-cardiovascular-endpoints-2023"
        ],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "composite-endpoint-construction-rwe",
        "notes": "MACE is a canonical composite endpoint."
      },
      {
        "relation_type": "requires",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "Each MACE component needs its own validated outcome algorithm."
      },
      {
        "relation_type": "used_with",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Data fitness for MACE depends heavily on death linkage and inpatient event capture."
      }
    ],
    "aliases": [
      "MACE",
      "major adverse cardiovascular event",
      "major adverse cardiovascular events",
      "3-point MACE",
      "three-point MACE",
      "cardiovascular death MI stroke composite",
      "nonfatal MI nonfatal stroke cardiovascular death"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mann-whitney-u-test",
    "name": "Mann-Whitney U Test (Wilcoxon Rank-Sum)",
    "short_definition": "A nonparametric two-sample test that replaces raw outcome values with their joint ranks and tests whether one group tends to produce larger values than the other — formally, whether P(X > Y) = 0.5 (stochastic dominance), not whether the group medians are equal. It is the standard nonparametric alternative to the two-sample t-test for continuous and ordinal outcomes, and the U statistic divided by n1·n2 equals the probability that a randomly chosen observation from group 1 exceeds a randomly chosen observation from group 2 — a direct link to the ROC AUC.",
    "long_description": "**What the Mann-Whitney U test actually tests**\n\nThe Mann-Whitney U test (also called the Wilcoxon rank-sum test) is a nonparametric two-sample\ntest that operates on ranks rather than raw values. Both names refer to the identical procedure:\nMann and Whitney (1947) derived the U statistic via pairwise comparisons, while Wilcoxon (1945)\nderived the rank-sum form; they are algebraically identical. The test answers one specific\nquestion: is there stochastic dominance between the two groups? Under the null hypothesis,\nthe probability that a randomly drawn observation from group 1 exceeds a randomly drawn\nobservation from group 2 equals 0.5. Formally, H0: P(X > Y) = 0.5 for independent X ~ F and\nY ~ G.\n\nA pervasive misconception — worth stating explicitly — is that the Mann-Whitney test is a\n\"test of medians.\" Divine et al. (2018) demonstrated rigorously that this folklore is incorrect.\nThe Mann-Whitney null of equal medians holds ONLY under the additional assumption that the two\ndistributions have identical shapes differing only by a location shift. In practice, if the two\ngroups differ in variance, skewness, or tail behavior while having equal medians, the\nMann-Whitney test can reject. Conversely, two groups with different medians can fail to reject\nif the distributions differ in ways that cancel out in the rank ordering. When reporting a\nMann-Whitney result, the correct language is \"group 1 tended to have higher [outcome]\" or \"there\nis evidence of stochastic dominance,\" not \"the medians differ.\"\n\n**Computing U: the mechanics**\n\nPool all observations from both groups and rank them from smallest (rank 1) to largest (rank N,\nwhere N = n1 + n2). When two or more values tie, each tied value receives the average of the\nranks they would have occupied. Let R1 = sum of ranks for group 1. The U statistic for group 1 is:\n\n  U1 = R1 - n1(n1 + 1) / 2\n\nThe complementary U for group 2 is U2 = n1·n2 - U1. As a check, U1 + U2 = n1·n2 always, and\nthe total rank sum R1 + R2 = N(N+1)/2. The conventional test statistic is U = min(U1, U2), and\nthe p-value is obtained from the exact distribution (for small samples) or the normal\napproximation (for large samples or with many ties).\n\n**The U-to-AUC connection: a genuinely useful HEOR interpretation**\n\nThe ratio U1 / (n1·n2) equals the empirical probability index — the fraction of all n1·n2\npairwise comparisons in which a group-1 observation exceeds a group-2 observation. This is\nnumerically identical to the area under the receiver operating characteristic (ROC) curve (AUC)\nwhen group membership is the binary label. An AUC of 0.50 means no discrimination; an AUC of\n0.80 means a randomly chosen group-1 patient has an 80% probability of a larger outcome than\na randomly chosen group-2 patient. This restatement converts an abstract rank statistic into\nan immediately communicable effect size that health technology assessment bodies understand —\nespecially for patient-reported outcomes, quality-of-life scores, and functional assessments.\n\n**The Hodges-Lehmann shift estimate: the effect-size companion**\n\nA bare p-value from a Mann-Whitney test is insufficient for HEOR decision-making. The\nHodges-Lehmann (HL) estimator provides the natural location estimate to accompany the test.\nThe HL estimate is the median of all n1·n2 pairwise differences (xi - yj), and its associated\nconfidence interval uses the Wilcoxon distribution. In R, `wilcox.test(..., conf.int = TRUE)`\nreturns the HL estimate and CI directly. In SAS, `PROC NPAR1WAY WILCOXON HL` produces both.\nThe HL estimate is neither the difference in means nor the difference in medians — it is the\n\"shift\" that best explains the rank ordering, and it is on the original outcome scale (e.g.,\ndays of hospitalization). Report it: \"Group A had a median pairwise shift of -1.5 days [95% CI\n-3.0 to -0.5 days] fewer hospitalization days than group B.\" This gives the payer or clinical\ndecision-maker a number to act on, not just a rejection flag.\n\n**Ties and their consequences**\n\nTies arise whenever two or more observations share the same value. In claims data this is\ncommon — length of stay (LOS) is measured in integer days, so a dataset with 5,000 patients\nmight have 800 patients at LOS = 1, 600 at LOS = 2, etc. Ties are handled by midrank averaging\n(each tied observation receives the mean of the ranks it would have occupied). With many ties\nthe test statistic's variance is underestimated under the standard formula, requiring a\ncontinuity correction. All major software packages (scipy.stats.mannwhitneyu with\n`method='asymptotic'`, R `wilcox.test` with `exact=FALSE`, SAS `PROC NPAR1WAY WILCOXON`) apply\nthe tie correction automatically. When ties are extremely heavy — e.g., an ordinal outcome with\nonly five distinct values across 10,000 observations — the Mann-Whitney test loses power and\na proportional-odds logistic regression becomes a more informative analysis.\n\n**RWE realities: costs, LOS, and the routing decision**\n\nLength of stay and total healthcare costs are the canonical skewed outcomes in RWE and HEOR.\nThe Mann-Whitney test is a valid choice for comparing the rank ordering of these outcomes\nbetween treatment groups, but it answers a different question from what most budget-impact\nanalyses and payer submissions require. Payers need a mean cost difference — the quantity that\nmultiplies by population size to yield total budget impact. The Mann-Whitney U test does NOT\nestimate mean costs, and the HL shift estimate is NOT a mean difference. For mean-cost inference,\nthe modern standard is a generalized linear model (GLM) with a log link and a gamma or Tweedie\nvariance function, which (a) targets the mean on the original dollar scale, (b) handles the\nright-skewed distributional shape, and (c) accommodates covariate adjustment. The Mann-Whitney\nis appropriate as a descriptive or sensitivity check — \"does group A tend to have lower LOS?\" —\nnot as the primary analysis when mean costs are the policy-relevant estimand.\n\n**Adjustment and causality: what rank tests cannot do**\n\nThe Mann-Whitney test is a bivariate comparison. In an unbalanced observational dataset, an\nunadjusted Mann-Whitney comparing treated versus untreated patients estimates a potentially\nconfounded association, not a causal treatment effect. There is no natural regression extension\nof the Mann-Whitney test that adjusts for covariates the way a GLM, propensity-weighted model,\nor g-formula can. When confounding is a concern — which it always is in non-randomized RWE —\nroute to propensity-score weighting or matching followed by a weighted t-test or GLM, not to a\nbare Mann-Whitney. The test is most defensible for primary analysis in randomized data or as a\nsensitivity check in a pre-specified non-inferiority analysis after achieving covariate balance\nby design.\n\n**Pros, cons, and trade-offs**\n\nPros:\n- Valid without distributional assumptions (no normality requirement); robust to heavy tails and\n  outliers, which dampens the influence of extreme cost or LOS values to a single high rank.\n- Exact p-values available for small samples via the permutation distribution of ranks.\n- The U/(n1·n2) probability index and ROC AUC equivalence provide an intuitive, directly\n  communicable effect-size measure.\n- Widely implemented and understood; acceptable to FDA, EMA, and HTA bodies as a pre-specified\n  nonparametric sensitivity analysis.\n- Well-suited for ordinal outcomes where values are intrinsically ranked but not continuous.\n\nCons:\n- Does not test medians (unless equal-shape assumption holds); this is routinely misreported\n  and must be explicitly corrected in manuscripts and regulatory submissions.\n- The HL shift estimate is not a mean difference; it is a poor substitute for the mean-cost\n  difference that payers need for budget-impact models.\n- No covariate-adjustment framework comparable to regression; cannot control confounding\n  directly.\n- Power is lower than a t-test when the parametric normality assumption holds; in large RWE\n  datasets this efficiency loss is negligible but in small studies it may matter.\n- Heavy ties (common in integer-valued claims variables) reduce power and require software\n  tie-correction; with very few unique values a proportional-odds model is preferable.\n\n**When to use**\n\n- Two independent groups, continuous or ordinal outcome, when the distributional assumption of\n  the t-test is clearly violated at small-to-moderate n (roughly n < 50 per group with visible\n  non-normality on a Q-Q plot or histogram).\n- As a pre-specified nonparametric sensitivity check alongside a t-test or GLM primary analysis\n  in regulatory submissions to demonstrate robustness to distributional assumptions.\n- When the rank ordering is intrinsically meaningful (e.g., functional severity scores, ordinal\n  patient-reported outcomes) and the outcome is not well-modeled as continuous.\n- For LOS comparisons where \"which arm tends to have shorter stays\" is the scientific question\n  and a descriptive stochastic-dominance answer is sufficient.\n- Baseline comparisons of continuous covariates in Table 1 (though standardized mean differences\n  are now preferred for balance assessment in observational studies).\n- Exploratory analyses and data-quality checks during cohort construction where a quick,\n  assumption-free comparison is needed before committing to a primary model.\n\n**When NOT to use**\n\n- When the primary decision requires a mean cost difference: the HL shift is not interpretable\n  as a mean-cost difference; route primary inference to a gamma GLM or two-part model.\n- When confounding is uncontrolled: an unadjusted Mann-Whitney on an unbalanced observational\n  dataset estimates a confounded association, not a causal effect. Use propensity-weighted\n  GLM or matching methods instead.\n- For paired data (pre-post measurements or matched pairs): use the Wilcoxon signed-rank test\n  (the paired analogue) rather than the rank-sum test, which assumes independence between groups.\n- When the outcome is ordinal with very few levels (e.g., a 5-point Likert scale with most\n  responses at 3-4): proportional-odds logistic regression or a cumulative link model captures\n  the full ordinal structure and allows covariate adjustment; Mann-Whitney reduces this to a\n  rank ordering that discards distributional detail.\n- When the goal is an estimate of relative risk, hazard ratio, or odds ratio for a binary\n  outcome: use logistic regression, Poisson GLM with robust SE, or a log-binomial model.\n- When reporting, do NOT describe a Mann-Whitney result as a test of medians unless you\n  explicitly state and verify the equal-shape assumption — this is one of the most common\n  statistical errors in HEOR manuscripts.\n\n**Interpreting the output**\n\nIn the worked example, the rank sums for Group A (shorter hospital stays, n = 4) and\nGroup B (longer stays, n = 4) are R_A = 12 and R_B = 24. The U statistic for Group A is\nU_A = 2 and for Group B is U_B = 14; U_A + U_B = 16 = 4 × 4, confirming the computation.\nThe test statistic is U = min(2, 14) = 2. The probability index — U_B / (n_A × n_B) =\n14/16 = 0.875 — is the fraction of all 16 pairwise Group A vs Group B comparisons in which\nthe Group B observation is larger.\n\n*(1) Formal interpretation.* Under the null of stochastic equality (P(X_A > X_B) = 0.5),\nthe expected value of U for either group is n_A × n_B / 2 = 8. The observed U_A = 2 is far\nbelow this expectation. The probability index of 0.875 indicates that Group B stochastically\ndominates Group A — a randomly chosen Group B patient has a larger observation than a\nrandomly chosen Group A patient in 87.5% of pairings. The exact p-value (available from the\npermutation distribution at this small n) should be compared to the pre-specified alpha. The\ntest evaluates stochastic dominance — whether P(X > Y) = 0.5 — not whether the group\nmedians are equal; median equality is only entailed under the additional assumption that both\ndistributions have the same shape, differing only in location.\n\n*(2) Practical interpretation.* Group A tends toward shorter hospital stays: a randomly\nselected Group A patient had a shorter stay than a randomly selected Group B patient in 87.5%\nof all possible pairings. This probability index is directly analogous to the ROC AUC and is\nimmediately communicable as a measure of stochastic dominance. For budget-impact or resource-\nplanning contexts where mean length of stay is the required input, a Hodges-Lehmann shift\nestimate (the median of all pairwise differences between groups) or a regression-based mean\nestimate should accompany this rank-based result — the U statistic alone is not a mean\ndifference and cannot be plugged directly into a cost model.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "nonparametric",
      "rank-based",
      "two-sample",
      "Wilcoxon",
      "Mann-Whitney",
      "stochastic-dominance",
      "Hodges-Lehmann",
      "AUC"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1214/aoms/1177730491",
        "url": "https://doi.org/10.1214/aoms/1177730491",
        "citation_text": "Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics. 1947;18(1):50-60.",
        "year": 1947,
        "authors_short": "Mann & Whitney",
        "notes": "The original paper deriving the U statistic via pairwise comparisons; establishes the null hypothesis as stochastic dominance (P(X>Y) = 0.5), not equality of medians, and derives the exact null distribution via combinatorics. Essential reading for anyone who wants to understand what the test actually tests rather than the folklore version."
      },
      {
        "role": "explain",
        "doi": "10.1080/00031305.2017.1305291",
        "url": "https://doi.org/10.1080/00031305.2017.1305291",
        "citation_text": "Divine GW, Norton HJ, Barón AE, Juarez-Colunga E. The Wilcoxon-Mann-Whitney procedure fails as a test of medians. The American Statistician. 2018;72(3):278-286.",
        "year": 2018,
        "authors_short": "Divine et al.",
        "notes": "Rigorous demonstration that the Wilcoxon-Mann-Whitney test is not a test of medians except under the restrictive equal-shape assumption; shows scenarios where the test rejects with equal medians and fails to reject with unequal medians. The key corrective reference for HEOR analysts who routinely mischaracterize the test in manuscripts."
      },
      {
        "role": "demonstrate",
        "doi": "10.1214/aoms/1177704172",
        "url": "https://doi.org/10.1214/aoms/1177704172",
        "citation_text": "Hodges JL, Lehmann EL. Estimates of location based on rank tests. Annals of Mathematical Statistics. 1963;34(2):598-611.",
        "year": 1963,
        "authors_short": "Hodges & Lehmann",
        "notes": "Original derivation of the Hodges-Lehmann location estimator as the median of all pairwise differences; establishes the shift estimate and its confidence interval as the natural effect-size companion to the Wilcoxon rank-sum test. The basis for the HL estimate returned by wilcox.test(conf.int=TRUE) in R and the HL option in SAS PROC NPAR1WAY."
      }
    ],
    "plain_language_summary": "The Mann-Whitney U test (also called the Wilcoxon rank-sum test) compares two independent groups by replacing their raw numbers with rankings — first place, second place, and so on across both groups combined — and then asking whether one group's rankings are consistently higher than the other's. It does not require the data to follow a bell-curve shape, which makes it a useful alternative to the t-test for skewed outcomes like length of hospital stay. One important caveat: despite common usage, this test does not compare group medians — it tests whether patients in one group tend to have larger values than patients in the other group, which is a subtly different question.",
    "key_terms": [
      {
        "term": "rank",
        "definition": "The position of a value when all observations from both groups are sorted together from smallest to largest; the smallest value gets rank 1, the next gets rank 2, and so on."
      },
      {
        "term": "rank sum",
        "definition": "The total of all rank positions assigned to one group; if group A has ranks 1, 3, 5, and 8, its rank sum is 17."
      },
      {
        "term": "U statistic",
        "definition": "The Mann-Whitney U statistic counts how many times a value from one group exceeds a value from the other group, across all possible pairs; it equals the rank sum minus a correction for the group size."
      },
      {
        "term": "stochastic dominance",
        "definition": "A formal way of saying one group tends to produce larger values than another; the Mann-Whitney test is actually a test of stochastic dominance, not a test of whether the group medians differ."
      },
      {
        "term": "Hodges-Lehmann estimator",
        "definition": "A rank-based estimate of the location shift between two groups, computed as the median of all pairwise differences (one value from each group); it gives a single number summarizing how much one group tends to exceed the other, with an accompanying confidence interval."
      },
      {
        "term": "ties",
        "definition": "When two or more observations share the same value (e.g., three patients all with 2 days of hospital stay), each tied observation is assigned the average rank of the positions they share, rather than distinct ranks."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is comparing length of stay (LOS) in days for two groups of patients hospitalized with a respiratory infection: 4 patients receiving a new antibiotic regimen (Group A) and 4 patients receiving standard of care (Group B). The analyst wants to test whether Group A tends to have shorter stays using the Mann-Whitney U test, and then compute the U statistic and the probability index (U / n1*n2) to see how strongly Group A dominates Group B in the rank ordering. The dataset has no tied values so the ranking is straightforward.",
      "dataset": {
        "caption": "Length of stay in days (n=4 per group). No tied values. All 8 observations are pooled and ranked together for the Mann-Whitney test.",
        "columns": [
          "patient_id",
          "group",
          "los_days"
        ],
        "rows": [
          [
            "A1",
            "A",
            2
          ],
          [
            "A2",
            "A",
            3
          ],
          [
            "A3",
            "A",
            5
          ],
          [
            "A4",
            "A",
            6
          ],
          [
            "B1",
            "B",
            4
          ],
          [
            "B2",
            "B",
            7
          ],
          [
            "B3",
            "B",
            8
          ],
          [
            "B4",
            "B",
            9
          ]
        ]
      },
      "steps": [
        "Pool all 8 LOS values and sort ascending: 2 (A1), 3 (A2), 4 (B1), 5 (A3), 6 (A4), 7 (B2), 8 (B3), 9 (B4). Assign ranks 1 through 8 in this order.",
        "Rank assignments: A1 gets rank 1, A2 gets rank 2, B1 gets rank 3, A3 gets rank 4, A4 gets rank 5, B2 gets rank 6, B3 gets rank 7, B4 gets rank 8.",
        "Sum ranks for Group A: R_A = 1 + 2 + 4 + 5 = 12.",
        "Sum ranks for Group B: R_B = 3 + 6 + 7 + 8 = 24.",
        "Verify total rank sum: R_A + R_B = 12 + 24 = 36 = 8*(8+1)/2 = 8*9/2 = 36. Correct.",
        "Compute U for Group A: U_A = R_A - n_A*(n_A+1)/2 = 12 - 4*5/2 = 12 - 10 = 2.",
        "Compute U for Group B: U_B = R_B - n_B*(n_B+1)/2 = 24 - 4*5/2 = 24 - 10 = 14.",
        "Check: U_A + U_B = 2 + 14 = 16 = n_A * n_B = 4 * 4 = 16. Correct.",
        "Probability index (AUC equivalent) for Group B: U_B / (n_A * n_B) = 14 / 16 = 0.875. This means in 87.5% of all pairwise comparisons a Group B patient had a longer stay than a Group A patient — strong evidence that Group A tends to have shorter stays."
      ],
      "result": "R_A = 12, R_B = 24, total = 36 = 8*9/2. U_A = 12 - 10 = 2, U_B = 24 - 10 = 14, U_A + U_B = 2 + 14 = 16 = 4*4. Probability index = 14/16 = 0.875. Group A patients tend to have shorter hospital stays than Group B patients; the new antibiotic arm dominated in 87.5% of all pairwise comparisons. Note: this does not test medians — it tests stochastic dominance. Pair this with a Hodges-Lehmann shift estimate and CI for a reportable effect size."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Exact vs asymptotic p-value",
        "description": "For small samples (n < ~25 per group), the exact permutation distribution of the U statistic is available and preferable. For larger samples or when ties are present, a normal approximation with tie correction is used. In scipy.stats.mannwhitneyu, `method='exact'` and `method='asymptotic'` control this; in R, `wilcox.test(..., exact=TRUE/FALSE)`; in SAS, PROC NPAR1WAY computes exact p-values via EXACT WILCOXON. At large n (>200 per group) the two approaches agree closely and the asymptotic approximation is always adequate.",
        "edge_cases": [
          "With ties, `exact=TRUE` is not available in most software (the exact distribution requires no ties); fall back to the asymptotic approximation with the tie-correction formula.",
          "In SAS, the EXACT WILCOXON statement can be slow for large n; set a maximum time limit with the MAXTIME= option."
        ],
        "data_source_notes": "Claims and registry: LOS and cost data typically have many ties at integer values; use asymptotic approximation with tie correction. Primary data / small trials: use exact p-value."
      },
      {
        "name": "One-sided vs two-sided alternative",
        "description": "The two-sided test (H1: P(X>Y) ≠ 0.5) is standard for most HEOR applications. A one-sided test (H1: P(X>Y) > 0.5 or < 0.5) requires pre-specification; it is used in pre-specified non-inferiority or superiority sensitivity analyses but carries higher type-I error risk if not registered. In Python, `alternative='greater'`, `'less'`, or `'two-sided'` in `mannwhitneyu`; in R, `alternative=` parameter of `wilcox.test`.",
        "edge_cases": [
          "Post-hoc selection of one-sided vs two-sided inflates the type-I error; always pre-specify."
        ],
        "data_source_notes": "For regulatory submissions, default to two-sided; document any one-sided choice in the SAP."
      },
      {
        "name": "Paired data — use Wilcoxon signed-rank instead",
        "description": "When the two groups consist of matched pairs (same patient pre/post, or one-to-one matched controls), the rank-sum test is incorrect because the independence assumption is violated. Use the Wilcoxon signed-rank test on the within-pair differences. Applying rank-sum to paired data is a common error in HEOR manuscripts.",
        "edge_cases": [
          "In matched claims cohorts with 1:n matching, verify whether the primary analysis should be matched-pair (signed-rank or paired t-test on differences) or unmatched (rank-sum on the full distribution); this depends on whether the pairing is part of the design or merely a confounder-control device."
        ],
        "data_source_notes": "Propensity-matched pairs in claims: if the primary analysis uses the matched design, use paired methods; if matching is covariate-control only and the effect is marginal, use GLM with cluster SE on the original cohort."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "two-sample-t-test",
        "pros_of_this": "More robust to heavy-tailed, skewed, or outlier-contaminated distributions; valid without the normality assumption at small n; exact p-values available without approximation.",
        "cons_of_this": "Lower statistical power than the t-test when the normality assumption holds; does not estimate the mean difference, which is needed for budget-impact models; no natural extension for covariate adjustment.",
        "when_to_prefer": "Use Mann-Whitney when the distribution is clearly non-normal at small n, or when the rank ordering is the scientifically meaningful quantity; use Welch t-test when mean differences are the target estimand and n is adequate for the CLT."
      },
      {
        "compared_to": "wilcoxon-signed-rank-test",
        "pros_of_this": "Designed for independent (unpaired) groups; the standard nonparametric test for independent two-sample comparisons.",
        "cons_of_this": "Cannot be used for paired or matched-pair data; the Wilcoxon signed-rank test is the correct analogue for paired observations and pre-post designs.",
        "when_to_prefer": "Use rank-sum (Mann-Whitney) for independent groups; use signed-rank for paired/matched data. The two tests are frequently confused in the literature because both carry the Wilcoxon name."
      },
      {
        "compared_to": "kruskal-wallis-test",
        "pros_of_this": "Appropriate when exactly two independent groups are being compared; simpler and more powerful than Kruskal-Wallis in the two-group case (Kruskal-Wallis generalizes to k groups but reduces to Mann-Whitney for k=2).",
        "cons_of_this": "Cannot compare three or more groups simultaneously without inflating the type-I error; Kruskal-Wallis is the correct omnibus test for multi-group nonparametric comparisons.",
        "when_to_prefer": "Use Mann-Whitney for exactly two groups; use Kruskal-Wallis for three or more, followed by pairwise Dunn tests with multiplicity correction."
      },
      {
        "compared_to": "win-ratio-generalized-pairwise-comparisons-rwe",
        "pros_of_this": "Simpler, more familiar, and sufficient for a single continuous or ordinal endpoint; widely implemented in all standard statistical software without add-on packages.",
        "cons_of_this": "Can only handle a single endpoint; does not accommodate a prioritized hierarchy of outcomes (e.g., death ranked above hospitalization); the win ratio extends rank-based logic to composite multi-outcome settings.",
        "when_to_prefer": "Use Mann-Whitney for a single endpoint; prefer the win ratio / generalized pairwise comparison when multiple outcomes of different clinical severity must be combined, as in composite cardiovascular endpoints."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Assumption-free rank comparison of cost distributions; robust to the extreme right-skew and outliers that characterize cost data; valid at any sample size.",
        "cons_of_this": "The HL shift estimate is not a mean cost difference; the mean — not the median shift — is the policy-relevant quantity for budget-impact models and payer submissions; a gamma GLM with log link is the modern standard for inferring mean cost differences.",
        "when_to_prefer": "Use Mann-Whitney as a descriptive and sensitivity check for cost or LOS rank ordering; route primary mean-cost inference to gamma GLM or two-part models."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Per-patient LOS and total cost are the natural units; both are right-skewed with many ties at integer-day or round-dollar values. Use asymptotic approximation with tie correction. For LOS, Mann-Whitney is a valid descriptive comparison; for costs, pair with a gamma GLM for the primary mean-cost estimate and present Mann-Whitney as a sensitivity check. At large n (>10,000), the test will reject for trivially small differences; report U/(n1*n2) as the probability index and the HL shift estimate with CI to communicate effect magnitude.",
      "ehr": "Lab values, vital signs, and functional scores may be approximately normal (Welch t-test is appropriate) or non-normal (use Mann-Whitney). For ordinal scales (e.g., NYHA class, ECOG score), Mann-Whitney is preferred over a t-test because the intervals between levels are not guaranteed to be equal. Verify whether the EHR records the first available measurement or the mean of multiple visits to avoid immortal time bias in the comparison window.",
      "registry": "Disease severity scores (e.g., APACHE, Child-Pugh) and time-to-event durations are natural candidates for Mann-Whitney when distributions are non-normal. For adjudicated binary endpoints (response, remission), use chi-square or Fisher exact. With registry data, document whether the comparison is unadjusted (for descriptive purposes) or whether propensity weighting was applied before the rank test.",
      "primary": "PRO instruments and bounded scores from clinical trials often have ordinal or non-normal distributions; Mann-Whitney is the standard choice. For pre-registered non-inferiority analyses, document whether exact or asymptotic p-values are used and the choice of one-sided vs two-sided alternative. Small pilot studies (n < 15 per group) should use exact p-values.",
      "linked": "Linked claims-EHR-registry cohorts are typically large; the CLT protects parametric tests and the Mann-Whitney will reject for small differences. Report the probability index U/(n1*n2) and HL shift estimate as primary effect measures alongside any p-value. For cost endpoints, gamma GLM is the primary method; present Mann-Whitney as robustness check."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from scipy import stats\nimport statistics\n\n# ── Motivating dataset: LOS in days, n=4 per group, no ties ──\ngroup_a = [2, 3, 5, 6]   # new antibiotic: shorter stays\ngroup_b = [4, 7, 8, 9]   # standard care: longer stays\n\n# ── 1. Mann-Whitney U test ──\n# alternative='two-sided'  tests H0: P(X>Y) = 0.5 (stochastic dominance), NOT medians\n# method='asymptotic'      uses normal approximation with tie correction (valid at large n or with ties)\n# method='exact'           uses permutation distribution (no ties; small n only)\nu_stat, p_val = stats.mannwhitneyu(group_a, group_b, alternative=\"two-sided\", method=\"asymptotic\")\nn1, n2 = len(group_a), len(group_b)\nprint(f\"U statistic (group A): {u_stat:.1f}\")\nprint(f\"p-value (two-sided):   {p_val:.4f}\")\nprint(\"Null hypothesis: P(X_A > X_B) = 0.5  [stochastic dominance, NOT a test of medians]\")\n\n# ── 2. Probability index = AUC equivalent ──\n# U / (n1 * n2) = empirical P(A > B) = ROC AUC when group = binary label\nprob_index = u_stat / (n1 * n2)\nprint(f\"\\nProbability index (U / n1*n2): {prob_index:.3f}\")\nprint(f\"Interpretation: in {prob_index*100:.1f}% of pairwise comparisons a Group A patient\")\nprint(f\"  had a shorter stay than a Group B patient. (AUC = {prob_index:.3f})\")\n\n# ── 3. Hodges-Lehmann location estimate ──\n# Median of all n1*n2 pairwise differences (a - b for a in A, b in B)\npairwise_diffs = sorted(a - b for a in group_a for b in group_b)\nn = len(pairwise_diffs)\nhl_estimate = (pairwise_diffs[n // 2 - 1] + pairwise_diffs[n // 2]) / 2 if n % 2 == 0 \\\n              else pairwise_diffs[n // 2]\nprint(f\"\\nHodges-Lehmann shift estimate (A - B): {hl_estimate:.1f} days\")\nprint(\"This is NOT the mean difference nor the median difference.\")\nprint(\"Report as: 'Group A had a pairwise median shift of X days [95% CI via R/SAS]'\")\n\n# ── 4. Manual rank-sum verification ──\nall_vals = sorted(enumerate(group_a + group_b), key=lambda x: x[1])\nranks = {}\ni = 0\nwhile i < len(all_vals):\n    j = i\n    while j < len(all_vals) - 1 and all_vals[j][1] == all_vals[j + 1][1]:\n        j += 1\n    midrank = (i + 1 + j + 1) / 2   # average rank (1-indexed)\n    for k in range(i, j + 1):\n        ranks[all_vals[k][0]] = midrank\n    i = j + 1\nr_a = sum(ranks[i] for i in range(n1))\nr_b = sum(ranks[i] for i in range(n1, n1 + n2))\nu_a = r_a - n1 * (n1 + 1) / 2\nu_b = r_b - n2 * (n2 + 1) / 2\nprint(f\"\\nManual check: R_A={r_a:.1f}, R_B={r_b:.1f}, total={r_a+r_b:.1f} (expect {(n1+n2)*(n1+n2+1)/2:.1f})\")\nprint(f\"U_A={u_a:.1f}, U_B={u_b:.1f}, U_A+U_B={u_a+u_b:.1f} (expect {n1*n2})\")",
        "description": "Mann-Whitney U test using scipy.stats.mannwhitneyu with the two-sided alternative (the\ndefault), the asymptotic method for tie-robust p-values, and a manual Hodges-Lehmann\nestimate via pairwise differences. Also shows the U-to-AUC (probability index) conversion.\nUses the LOS motivating dataset from the beginner layer. No external dependencies beyond scipy.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset: LOS in days, n=4 per group ──\ngroup_a <- c(2, 3, 5, 6)   # new antibiotic\ngroup_b <- c(4, 7, 8, 9)   # standard care\n\n# ── 1. Mann-Whitney U test (two-sided; R calls it Wilcoxon rank-sum) ──\n# exact = FALSE: use normal approximation (required when ties present or n is large)\n# exact = TRUE:  exact permutation p-value (small n, no ties only)\nmw_res <- wilcox.test(group_a, group_b, alternative = \"two-sided\", exact = FALSE)\ncat(\"Mann-Whitney / Wilcoxon rank-sum test:\\n\")\nprint(mw_res)\ncat(\"Note: this tests P(A > B) = 0.5 (stochastic dominance), NOT equality of medians.\\n\\n\")\n\n# ── 2. Hodges-Lehmann location estimate + 95% CI ──\n# conf.int = TRUE activates the HL estimate alongside the test\nhl_res <- wilcox.test(group_a, group_b, alternative = \"two-sided\",\n                      exact = FALSE, conf.int = TRUE)\ncat(sprintf(\"Hodges-Lehmann shift estimate (A - B): %.1f days\\n\", hl_res$estimate))\ncat(sprintf(\"95%% CI: [%.1f, %.1f] days\\n\\n\", hl_res$conf.int[1], hl_res$conf.int[2]))\ncat(\"Report: 'Group A had a pairwise median shift of X days [95% CI ...] fewer than Group B'\\n\")\n\n# ── 3. Probability index (AUC equivalent) ──\nn1 <- length(group_a)\nn2 <- length(group_b)\n# W in wilcox.test is the U statistic for group_a vs group_b\nu_a <- mw_res$statistic\nprob_index <- u_a / (n1 * n2)\ncat(sprintf(\"Probability index (U / n1*n2) = %.1f / %d = %.3f\\n\", u_a, n1 * n2, prob_index))\ncat(sprintf(\"Group A exceeds Group B in %.1f%% of pairwise comparisons (AUC = %.3f)\\n\\n\",\n            prob_index * 100, prob_index))\n\n# ── 4. Manual rank-sum verification ──\nall_vals <- c(group_a, group_b)\ngroup_label <- c(rep(\"A\", n1), rep(\"B\", n2))\nrks <- rank(all_vals)   # ties.method = \"average\" by default (midrank)\nr_a <- sum(rks[group_label == \"A\"])\nr_b <- sum(rks[group_label == \"B\"])\nu_a_manual <- r_a - n1 * (n1 + 1) / 2\nu_b_manual <- r_b - n2 * (n2 + 1) / 2\ncat(sprintf(\"Manual: R_A=%.1f, R_B=%.1f, total=%.1f (expect %d)\\n\",\n            r_a, r_b, r_a + r_b, (n1 + n2) * (n1 + n2 + 1) / 2))\ncat(sprintf(\"U_A=%.1f, U_B=%.1f, sum=%.1f (expect %d)\\n\",\n            u_a_manual, u_b_manual, u_a_manual + u_b_manual, n1 * n2))",
        "description": "Mann-Whitney U test (wilcox.test) with Hodges-Lehmann estimate and CI via conf.int=TRUE.\nShows the probability index (AUC) derivation and manual rank-sum verification. Uses the\nsame LOS motivating dataset as the Python implementation.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset: LOS in days, n=4 per group ── */\ndata work.los;\n  input patient_id $ group $ los_days;\n  datalines;\nA1 A 2\nA2 A 3\nA3 A 5\nA4 A 6\nB1 B 4\nB2 B 7\nB3 B 8\nB4 B 9\n;\nrun;\n\n/* ── Mann-Whitney / Wilcoxon rank-sum test with Hodges-Lehmann ──\n   WILCOXON: Wilcoxon rank-sum statistic = Mann-Whitney U (same test)\n   HL:       Hodges-Lehmann location estimate and 95% CI\n   The default p-value uses the normal approximation (appropriate for large n or ties).\n   Add the EXACT WILCOXON statement (see below) for exact permutation p-value at small n.\n── */\nproc npar1way data=work.los wilcoxon hl;\n  class group;       /* two-level classification variable: A vs B */\n  var los_days;\n  /* WILCOXON option produces:\n       - Wilcoxon Two-Sample Test: Z statistic + asymptotic p-value\n       - Kruskal-Wallis One-Way Analysis (identical for two groups)\n     HL option produces:\n       - Hodges-Lehmann point estimate of location shift (A - B)\n       - 95% CI for the shift                                         */\nrun;\n\n/* ── To obtain the exact permutation p-value (small n, no ties) ──\n   Note: can be slow for large datasets; use MAXTIME= to limit runtime */\n/*\nproc npar1way data=work.los wilcoxon hl;\n  class group;\n  var los_days;\n  exact wilcoxon / maxtime=60;\nrun;\n*/\n\n/* ── Probability index (U / n1*n2) from PROC NPAR1WAY output ──\n   The Wilcoxon statistic W reported by SAS equals R_1 (rank sum for the first group\n   alphabetically). Derive U and the probability index manually if needed:\n     U_A = W - n_A * (n_A + 1) / 2\n     Probability index = U_A / (n_A * n_B)\n   For the motivating dataset: W = R_A = 12, n_A = 4, n_B = 4\n     U_A = 12 - 4*5/2 = 12 - 10 = 2\n     Probability index (Group A > Group B) = 2 / 16 = 0.125\n     => Group B has the higher probability index: 14/16 = 0.875            */",
        "description": "Mann-Whitney / Wilcoxon rank-sum test via PROC NPAR1WAY with the WILCOXON option for the\ntest statistic and the HL option for the Hodges-Lehmann location estimate and CI. Uses the\nsame LOS motivating dataset. PROC NPAR1WAY also outputs the exact Wilcoxon p-value when the\nEXACT WILCOXON statement is added (appropriate for small n; omit for large datasets).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Two independent groups,<br/>continuous or ordinal outcome?] --> Size{Sample size<br/>per group}\n  Size -->|n < 50, non-normal<br/>or ordinal| MW[\"Mann-Whitney U Test<br/>(rank-sum)<br/>H0: P(X>Y) = 0.5\"]\n  Size -->|n ≥ 50 or<br/>approximately normal| T[\"Welch t-test<br/>(parametric default)<br/>tests mean difference\"]\n  Size -->|skewed cost/LOS,<br/>need mean difference| GLM[\"Gamma GLM with log link<br/>(primary for mean inference)\"]\n  MW --> Effect[\"Report effect size:<br/>U/(n1·n2) probability index<br/>+ Hodges-Lehmann shift CI\"]\n  T --> EffectT[\"Report mean difference<br/>with 95% CI\"]\n  GLM --> EffectGLM[\"Report mean ratio or<br/>mean difference with 95% CI\"]\n  MW -.->|paired/matched data?| SR[\"Use Wilcoxon<br/>signed-rank test instead\"]\n  MW -.->|3+ groups?| KW[\"Use Kruskal-Wallis<br/>+ Dunn post-hoc instead\"]",
        "caption": "Decision flow for two-sample nonparametric testing. Mann-Whitney is appropriate for non-normal or ordinal outcomes at small-to-moderate n; GLM is preferred when mean costs are the policy-relevant estimand; the signed-rank test handles paired data.",
        "alt_text": "Flowchart branching from two independent groups on sample size and distribution shape into Mann-Whitney U test, Welch t-test, or gamma GLM, with effect-size reporting shown for each path and reminders to use signed-rank for paired data and Kruskal-Wallis for three or more groups.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The Mann-Whitney U test is one specific nonparametric test within the broader parametric vs nonparametric decision framework; understanding when to choose a rank test over a t-test or GLM is prerequisite to applying Mann-Whitney correctly."
      },
      {
        "relation_type": "see_also",
        "target_slug": "two-sample-t-test",
        "notes": "The two-sample Welch t-test is the parametric counterpart for two independent groups; use t-test when the mean difference is the target estimand and distributional assumptions hold, and Mann-Whitney when the rank ordering is the question or distributional assumptions fail."
      },
      {
        "relation_type": "see_also",
        "target_slug": "wilcoxon-signed-rank-test",
        "notes": "The Wilcoxon signed-rank test is the paired analogue of the rank-sum test; both carry the Wilcoxon name and are routinely confused in the literature — signed-rank is for paired or pre-post data, rank-sum (Mann-Whitney) is for independent groups."
      },
      {
        "relation_type": "see_also",
        "target_slug": "kruskal-wallis-test",
        "notes": "The Kruskal-Wallis H test is the k-group extension of Mann-Whitney; it reduces to Mann-Whitney for k=2 and is the appropriate nonparametric omnibus test when three or more independent groups are being compared."
      },
      {
        "relation_type": "see_also",
        "target_slug": "win-ratio-generalized-pairwise-comparisons-rwe",
        "notes": "The Mann-Whitney U test is a special case of the generalized pairwise comparison family; the win ratio extends rank-based logic to a prioritized hierarchy of multiple outcomes (death > hospitalization > symptom score), making it the natural generalization when composite HEOR endpoints are analyzed."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "LOS and per-patient cost are the canonical skewed RWE outcomes for Mann-Whitney comparisons; however, gamma GLM is preferred over Mann-Whitney for primary mean-cost inference because payers need mean differences for budget-impact models, not the Hodges-Lehmann shift."
      }
    ],
    "aliases": [
      "Wilcoxon rank-sum test",
      "Wilcoxon-Mann-Whitney test",
      "rank-sum test",
      "WMW test",
      "Mann-Whitney",
      "Mann-Whitney U"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
    "name": "Marginal Effects and Interpretation of Inferential Statistics",
    "short_definition": "Methods to compute and report population-averaged (marginal) effects from regression models in real-world data by standardizing model predictions over the observed covariate distribution, distinguishing the conditional model coefficient (OR, HR, IRR) from the collapsible, decision-relevant marginal effect on the risk, rate, or time scale.",
    "long_description": "Almost every RWE analysis ends in a regression model — logistic for binary safety/effectiveness outcomes, Poisson or\nnegative binomial for healthcare-resource-utilization (HCRU) counts, Cox for time-to-event, linear for continuous scores.\nThe number that falls out of `summary(model)` is a **conditional** coefficient: the effect *holding the model's other\ncovariates fixed*, expressed on the model's own (usually non-linear) link scale as an odds ratio, hazard ratio, or\nincidence-rate ratio. Decisions — formulary, label, guideline, value dossier — are made at the population level, on an\nabsolute scale, for a defined target group. The bridge between the two is the **marginal (population-averaged) effect**,\nobtained by standardization (g-computation): predict each person's outcome under treatment and under control using their\n*own* observed covariates, then average the contrast across the population. This is the disciplined way to turn an OR into\na number a payer can act on.\n\n**Core estimand distinction.** A *conditional* effect is defined within strata of the adjustment covariates (the log-OR for\na 70-year-old man with two comorbidities). A *marginal* effect is the average over a specified covariate distribution: the\nAverage Marginal Effect (**AME**) averages the individual-level contrasts over the *observed* population (predict under\nA=1 for everyone, predict under A=0 for everyone, average the difference), while the Marginal Effect at the Mean (**MEM**)\nplugs population means into a single prediction. For *linear* models with no interactions the two coincide and equal the\ncoefficient. For *non-linear* models they diverge because of **non-collapsibility**: the conditional OR and HR are not\nweighted averages of stratum-specific effects even with no confounding, so a \"fully adjusted\" OR of 0.70 is not the\nmarginal OR, and certainly is not a 30% absolute risk reduction. AME is generally preferred over MEM because the \"mean\"\npatient (e.g., 0.4 of a sex, 1.7 comorbidities) corresponds to no real person and behaves badly under interactions. The\nscale matters too: standardize to a **risk difference / risk ratio** (logistic), an **expected-count or rate difference**\n(Poisson/NB), or a **survival-probability difference or restricted-mean-survival-time (RMST) difference** at a fixed\nhorizon (Cox) — the last two are marginal summaries that remain interpretable when proportional hazards fails, which the HR\ndoes not. Crucially, marginal effects are a *reporting/standardization* step; they inherit causality only from the design\n(no unmeasured confounding, positivity, correct intercurrent-event handling). Standardizing a garbage model produces a\nprecise, collapsible, garbage answer.\n\n**Pros, cons, and trade-offs.**\n- **vs reporting raw model coefficients (OR / HR / IRR):** Marginal effects are collapsible and aggregatable, sit on the\n  scale clinicians and payers reason with, and dissolve the \"odds-ratio looks like a relative risk\" trap. Cost: more\n  computation, you must name the population averaged over, and variance needs the delta method or bootstrap rather than the\n  model's printed SE. **Prefer marginal effects** as the headline for any decision-facing RWE; keep coefficients for model\n  transparency, not as the answer.\n- **vs marginal effects at the mean (MEM) / \"typical patient\" values:** AME respects the real covariate distribution and is\n  correct under interactions and non-linearities; MEM is one cheap prediction at a possibly fictional patient. **Prefer AME**\n  for inference; reserve MEM/representative-value effects for quick communication, always paired with the AME.\n- **vs conditional / subgroup-specific effects:** A single marginal number answers the population question directly but can\n  average away genuine effect modification (by age, line of therapy, biomarker). **Report both** the overall marginal effect\n  and pre-specified subgroup marginal effects; do not let one collapsed number hide qualitative interaction.\n- **vs IPTW/standardized contrasts from a propensity model:** Outcome-regression standardization (g-computation) and\n  IPTW target the same marginal estimand by different nuisance models; combining them (doubly robust / AIPW / TMLE) is\n  consistent if *either* model is right. **Prefer the doubly robust marginal effect** when feasible, falling back to plain\n  g-computation only when the outcome model is well understood.\n\n**When to use.** Any RWE study whose deliverable is a decision — comparative safety/effectiveness, HCRU or cost-offset\nestimates, label or HTA evidence — should report the marginal effect on an absolute scale, with an explicit statement of\nthe population standardized to. Use it whenever a non-linear model (logistic, Cox, Poisson with covariates) is fit and the\naudience needs absolute risk, number-needed-to-treat, events avoided, or months gained. Use survival-probability or RMST\ndifferences whenever proportional hazards is doubtful or the time horizon is policy-relevant.\n\n**When NOT to use — and when it is actively misleading or dangerous.** Do not present a marginal effect as causal when the\ndesign cannot support it: standardization over a multivariable model fit to a confounded cohort yields a precise\n*associational* number that *looks* like an ATE and will be read as one — this is the most dangerous failure mode here.\nDo not standardize over a population that includes regions of non-positivity (treatment levels never observed for some\ncovariate patterns); the prediction extrapolates off-support and the \"marginal effect\" is model fiction. Do not report a\nsingle marginal effect in the presence of strong, pre-specified effect modification — it can sit at the null while both\nsubgroups show large opposite effects. Do not standardize over the *wrong* population: an AME over the full cohort answers\nan ATE question, not the ATT a payer asked about (then standardize over the treated only). And do not use the model's\nprinted coefficient SE for the marginal effect's confidence interval — it ignores the averaging step and, if a propensity\nscore or weights were estimated, their uncertainty too.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** The standardization population must be the actual analytic cohort — continuously\n  enrolled new users, not \"everyone with a claim.\" Medicare *Advantage* enrollees lack complete FFS person-time, so\n  including MA-only time silently changes the population you are averaging over and biases both the outcome model and the\n  marginal estimate; restrict to fully observable enrollment or report the population explicitly. For Cox-based survival\n  standardization, **differential competing risks by exposure** (e.g., one drug used in frailer, higher-mortality patients)\n  mean the cause-specific marginal survival difference and the cumulative-incidence (subdistribution) marginal difference\n  answer different questions — pick the estimand before standardizing. Days-supply artifacts (90-day mail order, sample\n  fills) distort the at-risk denominator feeding a Poisson/NB rate standardization.\n- **EHR:** Richer covariates sharpen the conditional model and thus the predictions, but irregular, visit-driven capture\n  means the prediction step must handle missingness deliberately (multiple imputation, or last-value-with-indicator) — a\n  naive complete-case standardization averages over the *observed-data* population, not the target population. Loss to\n  follow-up is potentially informative and must be addressed (IPCW) before a survival-scale marginal effect is credible.\n- **Registry:** Often the cleanest population for the *target* of standardization (adjudicated outcomes, complete severity),\n  so registry covariate distributions are frequently used to *transport* a marginal effect estimated elsewhere; weak\n  pharmacy capture limits exposure-scale work without claims linkage.\n- **Linked claims–EHR–registry:** Best substrate (severity + completeness + mortality), but linkage selection means the\n  standardization population is the *linkable* subset — state this, since the marginal effect is only marginal over those\n  people.\n\n**Worked claims example.** Question: 1-year risk of hospitalized heart failure, second-generation sulfonylurea vs DPP-4\ninhibitor, among commercial + Medicare-FFS adults with type 2 diabetes. (1) Cohort: ≥18 years, ≥2 diabetes diagnoses, and\n365 days of continuous medical+pharmacy (or A/B/D) enrollment before the first qualifying fill; exclude MA-only person-time\nso \"no prior fill\" and follow-up are truly observable. (2) Index/time-zero = first fill of either drug; `arm` from the\ndispensed NDC. (3) Baseline covariates measured only in [index_date−365, index_date] (age, sex, prior insulin, renal dx,\nprior HF, HCRU intensity), feeding a logistic outcome model for the binary 365-day HF event with censoring handled by\nrequiring 365 days of post-index observable time (or moving to a Cox/RMST standardization if censoring is heavy). (4) Fit\nthe outcome model (optionally on a PS-weighted cohort for double robustness). (5) **Standardize:** copy the analytic frame\ntwice, set `arm='STUDY'` in one and `arm='COMPARATOR'` in the other, predict 365-day HF risk for *every* person under both,\naverage each. Marginal risk under study = 0.058, under comparator = 0.041 → **marginal risk difference = +0.017**\n(1.7 excess HF hospitalizations per 100 patients per year; NNH ≈ 59), with a bootstrap 95% CI resampling persons (and\nrefitting the PS if doubly robust). Report this alongside the conditional adjusted OR (e.g., 1.46) and state explicitly:\nmarginal over the *treated-eligible new-user cohort*, causal *only* under the design's no-unmeasured-confounding and\npositivity assumptions.\n\n**Interpreting the output**\n\nConsider the logistic-model example with OR = 2.0 applied to a mixed-risk cohort: 50 low-risk patients\n(baseline risk 10%) and 50 high-risk patients (baseline risk 40%). Standardizing predicted probabilities\nacross both copies of the data yields marginal risk under treatment = 0.376 and under control = 0.250,\ngiving an average marginal effect (AME) = +0.126, or 12.6 excess events per 100 patients.\n\n*(1) Formal statistical interpretation.* The AME of +0.126 is the average, across all patients in the\nanalytic sample, of the individual predicted risk difference attributable to treatment. Unlike the OR of 2.0,\nthe AME is on an absolute risk-difference scale and is not subject to non-collapsibility: it can be directly\nadded and subtracted, varies with covariate distribution, and is a valid average even when the patient-level\neffects are heterogeneous. Its uncertainty should be quantified with a bootstrap or delta-method confidence\ninterval. Association or causal language depends on the study design and the assumptions invoked.\n\n*(2) Practical interpretation for a decision-maker.* An OR of 2.0 sounds like \"twice the risk,\" but in a\nmixed-risk population it translates to about 13 extra events per 100 patients treated — not 100 extra events\nper 100. For coverage decisions, formulary placement, or number-needed-to-treat calculations, the AME is the\nright number: it averages over the actual patient mix in the study, making it directly comparable to\nobserved event rates and budget-impact projections.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "marginal-effects",
      "average-marginal-effect",
      "AME",
      "g-computation",
      "standardization",
      "non-collapsibility",
      "risk-difference",
      "RMST",
      "absolute-vs-relative",
      "interpretation",
      "policy-relevant"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.2019.1954",
        "url": "https://doi.org/10.1001/jama.2019.1954",
        "citation_text": "Norton EC, Dowd BE, Maciejewski ML. Marginal effects—quantifying the effect of changes in risk factors in logistic regression models. JAMA. 2019;321(13):1304-1305.",
        "year": 2019,
        "authors_short": "Norton et al.",
        "notes": "JAMA Guide to Statistics and Methods defining average marginal effects and why coefficients (odds ratios) are not the decision-relevant quantity; the canonical concise introduction for clinical and HEOR audiences."
      },
      {
        "role": "explain",
        "doi": "10.1214/ss/1009211805",
        "url": "https://doi.org/10.1214/ss/1009211805",
        "citation_text": "Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Statistical Science. 1999;14(1):29-46.",
        "year": 1999,
        "authors_short": "Greenland, Robins & Pearl",
        "notes": "Foundational treatment of non-collapsibility, explaining why a conditional odds ratio (or hazard ratio) differs from the marginal effect even absent confounding—the core reason marginal standardization is required."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181c1ea43",
        "url": "https://doi.org/10.1097/EDE.0b013e3181c1ea43",
        "citation_text": "Hernán MA. The hazards of hazard ratios. Epidemiology. 2010;21(1):13-15.",
        "year": 2010,
        "authors_short": "Hernán",
        "notes": "Why a conditional, non-collapsible, possibly time-varying hazard ratio resists a clean marginal/causal interpretation, motivating survival-probability and RMST differences as marginal summaries."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/ije/dyu029",
        "url": "https://doi.org/10.1093/ije/dyu029",
        "citation_text": "Muller CJ, MacLehose RF. Estimating predicted probabilities from logistic regression: different methods correspond to different target populations. International Journal of Epidemiology. 2014;43(3):962-970.",
        "year": 2014,
        "authors_short": "Muller & MacLehose",
        "notes": "Worked comparison of marginal standardization vs marginal-effects-at-the-mean vs covariate-fixing, making explicit which target population each method actually estimates—directly operational for choosing the standardization set."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwq439",
        "url": "https://doi.org/10.1093/aje/kwq439",
        "citation_text": "Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. Doubly robust estimation of causal effects. American Journal of Epidemiology. 2011;173(7):761-767.",
        "year": 2011,
        "authors_short": "Funk et al.",
        "notes": "Augmented/doubly robust estimation of the marginal causal effect, combining outcome-regression standardization with the propensity score so the marginal estimate is consistent if either nuisance model is correct."
      }
    ],
    "plain_language_summary": "When a study compares two treatments, the statistical model produces an odds ratio (OR) — a ratio that lives on a math scale, not the familiar percentage-point scale that clinicians and payers use to make decisions. An average marginal effect (AME) translates that OR into an absolute risk difference: how many more (or fewer) patients out of 100 would experience the outcome if everyone received Drug A instead of Drug B. The translation matters because an OR of 2.0 can mean a 8-percentage-point increase in risk for low-risk patients but a 17-percentage-point increase for high-risk patients — the OR hides that difference, while the AME surfaces it by averaging across the real mix of patients in the study.",
    "key_terms": [
      {
        "term": "odds ratio (OR)",
        "definition": "A number from a logistic regression model comparing the odds of an outcome between two groups — it is NOT the same as a risk ratio and can look bigger or smaller than the actual percentage-point difference in risk."
      },
      {
        "term": "average marginal effect (AME)",
        "definition": "The average difference in predicted probability of the outcome between two treatment groups, calculated by running the model's prediction engine twice (once labeling everyone as treated, once as untreated) and taking the mean difference across all patients."
      },
      {
        "term": "baseline risk",
        "definition": "The probability that a patient experiences the outcome if assigned to the comparator (reference) treatment, before any treatment effect is applied — low-risk and high-risk patients start from different baselines."
      },
      {
        "term": "risk difference",
        "definition": "The simple subtraction: treated group risk minus comparator group risk, expressed in percentage points (e.g., 12.6 additional events per 100 patients)."
      },
      {
        "term": "non-collapsibility",
        "definition": "The mathematical property of odds ratios and hazard ratios that makes them change value when you add more covariates to the model — even with no confounding — which is why they cannot be averaged across subgroups the way a risk difference can."
      },
      {
        "term": "g-computation",
        "definition": "The standardization procedure used to compute an AME: fit the outcome model, copy the dataset twice (set everyone to treatment A, then everyone to treatment B), predict risks both ways for every patient, and average the difference."
      }
    ],
    "worked_example": {
      "scenario": "A claims study enrolls 100 adults with hypertension. The investigators want to know whether Drug A raises the 1-year risk of a hospital visit compared with Drug B. They fit a logistic regression model adjusting for age and comorbidity burden. The model returns an odds ratio of 2.0 for Drug A vs Drug B. But the cohort has two kinds of patients: 50 are low-risk (10% baseline chance of hospitalization on Drug B) and 50 are high-risk (40% baseline chance). The goal is to show that OR = 2.0 means very different things for each group, and that the AME gives a single honest population summary.",
      "dataset": {
        "caption": "Predicted-probability output frame produced after g-computation on 4 representative patients (2 per risk group). Each patient is scored twice: once with arm set to Drug A, once with arm set to Drug B.",
        "columns": [
          "person_id",
          "risk_group",
          "p0_drug_b",
          "p1_drug_a",
          "risk_diff"
        ],
        "rows": [
          [
            1001,
            "low-risk",
            0.1,
            0.182,
            0.082
          ],
          [
            1002,
            "low-risk",
            0.1,
            0.182,
            0.082
          ],
          [
            1003,
            "high-risk",
            0.4,
            0.571,
            0.171
          ],
          [
            1004,
            "high-risk",
            0.4,
            0.571,
            0.171
          ]
        ]
      },
      "steps": [
        "Convert OR=2.0 to a probability for a low-risk patient (p0=0.10). Odds under Drug B = 0.10 / 0.90 = 0.111. Multiply by OR: 0.111 x 2.0 = 0.222. Convert back to probability: 0.222 / (1 + 0.222) = 2/11 = 0.182. Risk difference for this patient = 0.182 - 0.100 = 0.082.",
        "Repeat for a high-risk patient (p0=0.40). Odds under Drug B = 0.40 / 0.60 = 0.667. Multiply by OR: 0.667 x 2.0 = 1.333. Convert back: 1.333 / (1 + 1.333) = 4/7 = 0.571. Risk difference for this patient = 0.571 - 0.400 = 0.171.",
        "The same OR of 2.0 produces an 8.2-percentage-point increase for the low-risk patient but a 17.1-percentage-point increase for the high-risk patient — the OR alone cannot tell you which group you are dealing with.",
        "Compute the AME by averaging across all 100 patients (50 low-risk, 50 high-risk). Average p1 = (50 x 0.182 + 50 x 0.571) / 100 = (9.10 + 28.55) / 100 = 37.65 / 100 = 0.376. Average p0 = (50 x 0.100 + 50 x 0.400) / 100 = (5.00 + 20.00) / 100 = 25.00 / 100 = 0.250.",
        "AME (risk difference) = 0.376 - 0.250 = 0.126."
      ],
      "result": "The average marginal effect is a risk difference of +0.126, meaning Drug A is associated with 12.6 additional hospitalizations per 100 patients per year compared with Drug B, averaged over this cohort's actual mix of low- and high-risk patients. This is more policy-interpretable than OR=2.0 because a payer or clinician can immediately ask 'is 12.6 per 100 acceptable?' — they cannot make that judgment from an odds ratio, which shifts between 8.2 and 17.1 percentage points depending on who is in the room."
    },
    "prerequisites": [
      "logistic-regression-for-binary-outcomes",
      "estimands-ate-att-intercurrent-events-rwe",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Average marginal effect (AME) via standardization / g-computation",
        "description": "Predict each individual's outcome under treatment and under control using their own observed covariates, then average the per-person contrast across the population. The default for population inference; correct under interactions and non-linearities.",
        "edge_cases": [
          "Requires the conditional outcome model to be reasonably specified; misspecification biases every prediction.",
          "Predictions extrapolate where positivity is violated—average only over covariate regions with support in both arms.",
          "Closed-form SEs do not apply; use the delta method or a person-level bootstrap that refits any estimated PS/weights."
        ],
        "data_source_notes": "claims: straightforward after fitting the outcome model on the analytic cohort; combine with PS weighting for a doubly robust marginal effect and bootstrap the whole pipeline."
      },
      {
        "name": "Marginal effect at the mean / at representative values (MEM)",
        "description": "Set covariates to their means (or to a chosen profile, e.g., age 65, female, mean comorbidity score) and compute a single prediction contrast. Cheap and communicable but answers a different question than AME.",
        "edge_cases": [
          "The mean of a binary or skewed covariate corresponds to no real patient and misbehaves under interactions.",
          "Should be reported only alongside the AME for the actual study population, never as the sole estimate."
        ],
        "data_source_notes": "Useful for a 'typical patient' figure in clinical communication; not a substitute for the population AME."
      },
      {
        "name": "Marginal effects by outcome scale (risk, rate, time)",
        "description": "Standardize to the scale that matches the decision: logistic -> risk difference / risk ratio; Poisson/NB -> expected-count or rate difference; Cox -> survival-probability difference or RMST difference at a fixed horizon.",
        "edge_cases": [
          "For Cox under non-proportional hazards, the marginal HR is ill-defined; report survival-difference or RMST instead.",
          "Competing risks force a choice between cause-specific and subdistribution (cumulative-incidence) marginal contrasts."
        ],
        "data_source_notes": "Choose absolute risk reduction for patients/payers, rate differences for utilization/budget planning, months gained (RMST) for oncology value cases."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Reporting raw model coefficients (OR, HR, IRR)",
        "pros_of_this": "Marginal effects are collapsible, aggregatable across populations, and on the absolute scale clinicians and payers reason with; they avoid the non-collapsibility trap of reading an odds ratio as a risk ratio.",
        "cons_of_this": "More computation, an explicit choice of standardization population, and variance via delta method or bootstrap rather than the model's printed standard error.",
        "when_to_prefer": "As the headline estimate for any decision-facing RWE; keep coefficients for transparency, not as the answer."
      },
      {
        "compared_to": "Marginal effect at the mean (MEM)",
        "pros_of_this": "AME respects the real covariate distribution and is correct under interactions and non-linearities.",
        "cons_of_this": "Slightly more computation than a single mean-profile prediction.",
        "when_to_prefer": "For inference and reporting; reserve MEM for a quick 'typical patient' number paired with the AME."
      },
      {
        "compared_to": "Conditional or subgroup-specific effects",
        "pros_of_this": "A single marginal number answers the population/policy question directly.",
        "cons_of_this": "Averaging can mask pre-specified effect modification, even sitting at the null while subgroups diverge.",
        "when_to_prefer": "Report both—overall marginal effect and key subgroup marginal effects."
      },
      {
        "compared_to": "estimands-ate-att-intercurrent-events-rwe",
        "pros_of_this": "Marginal effects on the appropriate scale are the natural summary measure (attribute 5) for ATE/ATT estimands.",
        "cons_of_this": "A marginal effect is only as good as the identification strategy—no unmeasured confounding, positivity, correct intercurrent-event handling; it does not create causality.",
        "when_to_prefer": "When the estimand's population attribute fixes exactly which population to standardize over (full cohort for ATE, treated for ATT)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Standardize over the actual analytic cohort (continuously enrolled new users), not 'anyone with a claim'; exclude MA-only person-time so the population averaged over is observable. Bootstrap persons (refitting any estimated PS) for the marginal-effect CI; the model's printed SE is wrong for the standardized contrast.",
      "ehr": "Richer covariates improve predictions, but handle missingness in the prediction step explicitly (multiple imputation or last-value-with-indicator) and address informative loss to follow-up (IPCW) before a survival-scale marginal effect.",
      "registry": "Frequently the target population for transporting a marginal effect (adjudicated outcomes, complete severity); weak pharmacy capture limits exposure-scale standardization without claims linkage.",
      "linked": "Best substrate (severity + completeness + mortality), but the standardization population is the linkable subset—state this, since the effect is marginal only over those people."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\nCOVARS = [\"age\", \"sex\", \"prior_insulin\", \"renal_dx\", \"prior_hf\", \"hcru_intensity\"]\nFORMULA = \"event ~ arm + \" + \" + \".join(COVARS)  # add arm:covar terms if effect modification is expected\n\ndef standardized_risk_difference(df: pd.DataFrame) -> dict:\n    \"\"\"Marginal (population-averaged) risk under arm=1 and arm=0 via g-computation.\"\"\"\n    model = smf.glm(FORMULA, data=df, family=sm.families.Binomial()).fit()\n    d1, d0 = df.copy(), df.copy()\n    d1[\"arm\"], d0[\"arm\"] = 1, 0                      # set everyone to study, then to comparator\n    r1 = model.predict(d1).mean()                   # marginal risk if all treated with study\n    r0 = model.predict(d0).mean()                   # marginal risk if all treated with comparator\n    return {\"risk_study\": r1, \"risk_comp\": r0, \"risk_difference\": r1 - r0,\n            \"risk_ratio\": r1 / r0, \"nnh\": 1.0 / (r1 - r0) if r1 != r0 else np.inf}\n\ndef bootstrap_ci(df: pd.DataFrame, n_boot: int = 1000, seed: int = 1) -> tuple:\n    rng = np.random.default_rng(seed)\n    ids = df[\"person_id\"].to_numpy()\n    diffs = []\n    for _ in range(n_boot):\n        samp = df.iloc[rng.integers(0, len(df), len(df))]   # resample persons with replacement\n        diffs.append(standardized_risk_difference(samp)[\"risk_difference\"])\n    return tuple(np.percentile(diffs, [2.5, 97.5]))\n\npoint = standardized_risk_difference(df)\nlo, hi = bootstrap_ci(df)\nprint(f\"Marginal RD = {point['risk_difference']:.4f} (95% CI {lo:.4f}, {hi:.4f}); NNH = {point['nnh']:.0f}\")",
        "description": "Average marginal effect (risk difference) by standardization / g-computation from a fitted binary-outcome model.\nRequired input: an analytic, one-row-per-patient frame `df` with\n  person_id, arm (1=study, 0=comparator), event (0/1 over the fixed risk window), and baseline covariate columns\n  measured only in [index_date-365, index_date]. No toy data is created here; fit on your real cohort.\nReturns the marginal risks under each arm, the marginal risk difference, and a person-level bootstrap 95% CI. Combine\nwith PS weighting (statsmodels GLM `freq_weights`/`var_weights`) for a doubly robust marginal effect.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "norton-2019",
          "funk-2011"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\nCOVARS <- c(\"age\", \"sex\", \"prior_insulin\", \"renal_dx\", \"prior_hf\", \"hcru_intensity\")\nf_logit <- as.formula(paste(\"event ~ arm +\", paste(COVARS, collapse = \" + \")))\n\n# --- Marginal risk difference (binary outcome) via g-computation ---\nstd_rd <- function(df) {\n  fit <- glm(f_logit, data = df, family = binomial())\n  d1 <- transform(df, arm = factor(\"STUDY\",      levels = levels(df$arm)))\n  d0 <- transform(df, arm = factor(\"COMPARATOR\", levels = levels(df$arm)))\n  r1 <- mean(predict(fit, d1, type = \"response\"))   # standardize to study for all\n  r0 <- mean(predict(fit, d0, type = \"response\"))   # standardize to comparator for all\n  r1 - r0\n}\n\n# --- Marginal RMST difference at a horizon (time-to-event), robust to non-PH ---\nstd_rmst <- function(df, horizon = 365) {\n  fit <- coxph(as.formula(paste(\"Surv(time, event) ~ arm +\",\n                                paste(COVARS, collapse = \" + \"))), data = df)\n  rmst_arm <- function(level) {\n    nd  <- transform(df, arm = factor(level, levels = levels(df$arm)))\n    sf  <- survfit(fit, newdata = nd)                       # one survival curve per person\n    smean <- rowMeans(summary(sf, times = horizon)$surv)    # marginal S(horizon)\n    # trapezoidal RMST up to horizon from the averaged survival curve:\n    tt <- sort(unique(pmin(df$time, horizon)))\n    avg_surv <- rowMeans(summary(sf, times = tt)$surv)\n    sum(diff(c(0, tt)) * head(c(1, avg_surv), length(tt)))\n  }\n  rmst_arm(\"STUDY\") - rmst_arm(\"COMPARATOR\")\n}\n\nrd   <- std_rd(df)\ndrm  <- std_rmst(df, horizon = 365)\n# Person-level bootstrap CI (resample person_id, refit, re-standardize):\nids  <- unique(df$person_id)\nbs   <- replicate(1000, std_rd(df[df$person_id %in% sample(ids, replace = TRUE), ]))\ncat(sprintf(\"Marginal RD = %.4f (95%% CI %.4f, %.4f); RMST diff = %.1f days\\n\",\n            rd, quantile(bs, .025), quantile(bs, .975), drm))",
        "description": "Average marginal effect (risk difference) by standardization for a logistic model, plus the RMST difference for a Cox\nmodel, using base R / survival. Inputs:\n  df  : one row per patient -> person_id, arm (factor 'STUDY'/'COMPARATOR'), event (0/1), time (days), COVARS.\nThe marginaleffects package gives the same AME with delta-method SEs in one call; the explicit g-computation below makes\nthe standardization step auditable for a regulator. Bootstrap persons for the CI.",
        "dependencies": [
          "survival",
          "boot"
        ],
        "source_citations": [
          "norton-2019",
          "muller-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (A) Doubly robust marginal risk difference & ratio (SAS/STAT 14.2+).\n      PSMODEL = treatment/propensity model; MODEL = outcome model; both from baseline covariates only. */\nproc causaltrt data=work.analytic method=aipw;\n   class arm sex prior_insulin renal_dx prior_hf (ref=first);\n   psmodel arm(ref='COMPARATOR') = age sex prior_insulin renal_dx prior_hf hcru_intensity;\n   model   event(event='1')      = age sex prior_insulin renal_dx prior_hf hcru_intensity;\nrun;  /* ATE table reports the standardized risk difference (RD) and risk ratio with robust SEs */\n\n/* (B) Explicit AME by counterfactual stacking: predict each patient under arm=STUDY and arm=COMPARATOR, average. */\ndata cf;\n   set work.analytic(in=obs);\n   _id = person_id;\n   arm = 'STUDY';      output;   /* duplicate 1: everyone study */\n   arm = 'COMPARATOR'; output;   /* duplicate 2: everyone comparator */\nrun;\nproc logistic data=work.analytic noprint;\n   class arm sex prior_insulin renal_dx prior_hf / param=ref;\n   model event(event='1') = arm age sex prior_insulin renal_dx prior_hf hcru_intensity;\n   score data=cf out=phat(keep=_id arm P_1);     /* P_1 = predicted event probability */\nrun;\nproc sql;\n   create table ame as\n   select mean(case when arm='STUDY'      then P_1 end) as risk_study,\n          mean(case when arm='COMPARATOR' then P_1 end) as risk_comp,\n          calculated risk_study - calculated risk_comp  as marginal_rd\n   from phat;\nquit;  /* bootstrap persons (PROC SURVEYSELECT method=urs) and repeat for the 95% CI */\n\n/* (C) Marginal survival / RMST contrast under possible non-proportional hazards. */\ndata cov0 cov1;\n   set work.analytic(obs=0);                      /* covariate profiles for marginal curves */\nrun;\nproc phreg data=work.analytic;\n   class arm (ref='COMPARATOR');\n   model time*event(0) = arm age sex prior_insulin renal_dx prior_hf hcru_intensity / ties=efron rmst(tau=365);\n   baseline out=marg covariates=work.analytic survival=s / diradj group=arm;  /* directly adjusted (marginal) curves */\nrun;  /* RMST option prints the restricted-mean difference to tau=365; diradj averages over the covariate distribution */",
        "description": "Marginal effects in SAS three ways on a prepared analytic dataset work.analytic (one row per patient):\n  person_id, arm ('STUDY'/'COMPARATOR'), event (0/1), time (days), and baseline covariates.\n(A) PROC CAUSALTRT gives the standardized (g-computation / doubly robust) risk difference and ratio directly.\n(B) PROC LOGISTIC + the counterfactual-stacking trick reproduces the AME by hand for auditability.\n(C) PROC PHREG with BASELINE produces marginal survival curves for an RMST/survival-difference contrast under non-PH.",
        "dependencies": [],
        "source_citations": [
          "norton-2019",
          "funk-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  M[Fitted conditional model<br/>logistic / Poisson / Cox] --> C[Conditional coefficient<br/>OR / IRR / HR]\n  M --> D1[Copy cohort, set arm = STUDY<br/>predict for every observed covariate pattern]\n  M --> D0[Copy cohort, set arm = COMPARATOR<br/>predict for every observed covariate pattern]\n  D1 --> A1[Average prediction = marginal risk if all treated]\n  D0 --> A0[Average prediction = marginal risk if all comparator]\n  A1 --> RD[Marginal effect<br/>risk diff / rate diff / RMST diff]\n  A0 --> RD\n  C -. non-collapsible, NOT equal to .-> RD\n  RD --> CI[CI by delta method or person bootstrap]",
        "caption": "Standardization (g-computation) turns a conditional, non-collapsible coefficient into a population-averaged marginal effect on a decision-relevant absolute scale; the conditional OR/HR is not a weighted average of the marginal contrast.",
        "alt_text": "Flowchart from a fitted model to a conditional coefficient and, via prediction under each arm for all covariate patterns and averaging, to a marginal effect with a bootstrap or delta-method confidence interval.",
        "source_type": "illustrative",
        "source_citations": [
          "norton-2019"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision question] --> P{Which population<br/>to standardize over?}\n  P -->|All eligible new users| ATE[Standardize over full cohort -> ATE]\n  P -->|Treated only| ATT[Standardize over treated -> ATT]\n  ATE --> S{Outcome type / scale?}\n  ATT --> S\n  S -->|Binary| RD[Risk difference / risk ratio]\n  S -->|Count or rate| RR[Expected-count / rate difference]\n  S -->|Time-to-event, PH holds| SD[Survival-probability difference]\n  S -->|Time-to-event, non-PH| RMST[RMST difference at horizon]\n  RD --> CHK[Check positivity + pre-specified subgroups + report scale]\n  RR --> CHK\n  SD --> CHK\n  RMST --> CHK",
        "caption": "Decision logic for marginal-effect reporting in RWE - first fix the standardization population (ATE vs ATT), then the scale that matches the decision, then verify positivity and pre-specified effect modification before reporting.",
        "alt_text": "Decision tree choosing the standardization population then the outcome scale, ending in a positivity and subgroup check before reporting the marginal effect.",
        "source_type": "illustrative",
        "source_citations": [
          "muller-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Marginal risk differences via standardization are the recommended way to interpret and report logistic models in RWE, in addition to (not instead of) the odds ratio."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "For HCRU counts, average marginal effects on the expected-count or rate scale are far more interpretable for utilization and budget planning than incidence-rate ratios alone."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "Hazard ratios are non-collapsible and hard to interpret marginally under non-PH; survival-probability differences and RMST differences are marginal-effect summaries that remain meaningful."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST difference at a fixed horizon is the marginal, scale-interpretable survival summary recommended when proportional hazards is doubtful."
      },
      {
        "relation_type": "part_of",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "Marginal effects on the appropriate scale are the summary measure (attribute 5) for ATE/ATT estimands; the standardization population must match the estimand's population attribute."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "IPTW and outcome-regression standardization target the same marginal estimand; combining them yields a doubly robust marginal effect (AIPW/TMLE)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "direct-standardization-rwe",
        "notes": "Marginal effects are model-based standardization; direct standardization over an external population is the descriptive analogue used to transport the marginal effect."
      },
      {
        "relation_type": "see_also",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "TMLE and double machine learning estimate marginal effects with flexible nuisance models, relaxing the parametric assumptions of plain g-computation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "causal-mediation-effect-modification-rwe",
        "notes": "A single marginal effect can mask effect modification; report pre-specified subgroup marginal effects to avoid a collapsed null hiding opposing subgroup effects."
      }
    ],
    "aliases": [
      "average marginal effects",
      "AME",
      "marginal effects at the mean",
      "g-computation standardization of regression effects",
      "absolute vs relative effects in RWE",
      "interpreting OR HR IRR in RWE",
      "non-collapsibility"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "marginal-structural-models-g-methods",
    "name": "Marginal Structural Models and G-Methods",
    "short_definition": "A family of causal methods for time-varying treatments (marginal structural models fit by inverse-probability-of-treatment weighting, the parametric g-formula, and g-estimation of structural nested models) that yields consistent effect estimates when time-varying confounders are themselves affected by prior treatment — the treatment-confounder feedback that biases standard time-dependent regression.",
    "long_description": "**Marginal structural models (MSMs) and the broader g-methods** (the parametric g-formula / g-computation, IPTW-estimated\nMSMs, and g-estimation of structural nested models) were developed by James Robins and colleagues to solve a problem that\nordinary regression cannot: estimating the effect of a *sustained* or *time-varying* treatment when the confounders that\ngovern later treatment are themselves consequences of earlier treatment. This is **treatment-confounder feedback**. The\ncanonical example is sustained statin use and myocardial infarction: LDL cholesterol confounds the statin-MI relationship\n(high LDL prompts treatment and raises risk), but LDL after baseline is *lowered by the statin* and then drives the decision\nto continue or stop. LDL is simultaneously a confounder for later treatment and a mediator of earlier treatment. Adjusting\nfor it in a standard time-dependent Cox model blocks part of the true effect (it is on the causal path) **and** introduces\ncollider/selection bias; *not* adjusting for it leaves confounding. There is no regression specification that escapes both\nhorns — g-methods are the resolution.\n\n**Core conceptual distinction and estimand.** All three g-methods target a contrast of counterfactual outcomes under\n*interventions on a treatment strategy* — e.g., \"always treat\" vs \"never treat,\" or a dynamic rule such as \"treat once LDL\nexceeds a threshold.\" This is fundamentally different from the conditional hazard ratio a time-dependent Cox model reports.\n(1) **IPTW-estimated MSMs** reweight each person-time record by the inverse probability of the treatment actually received,\ngiven covariate and treatment history, creating a *pseudo-population* in which treatment is unconfounded; the MSM (typically\na weighted pooled logistic or weighted Cox) is then fit *without* conditioning on the time-varying confounders, so it\nrecovers a marginal (population-averaged) effect of the strategy. (2) The **parametric g-formula** models the joint\nevolution of confounders and outcome under the natural course, then simulates the outcome distribution under each\nhypothetical strategy by integrating over the time-varying covariate history — uniquely able to handle dynamic regimes and\nto report absolute risks and risk differences, not just ratios. (3) **g-estimation** of structural nested models directly\nestimates \"blip\" effects (the effect of one more unit of treatment at each time, possibly modified by covariate history)\nvia estimating equations and is the natural tool when effect modification by time-varying covariates is the question. All\nrest on **sequential exchangeability** (no unmeasured confounding at each decision point given the measured past),\n**positivity** (every treatment level has nonzero probability at every level of history), and **consistency / correct model\nspecification**. Always pre-specify the estimand — an intention-to-treat-like \"initiate strategy\" contrast vs an\nas-treated/per-protocol \"initiate and remain on strategy\" contrast (the latter requiring censoring at deviation and\ninverse-probability-of-censoring weights) — because the choice changes both the model and the interpretation.\n\n**Pros, cons, and trade-offs.**\n- **vs standard time-dependent Cox / pooled logistic with covariate adjustment:** g-methods are the *only* family that is\n  unbiased under treatment-confounder feedback and that can express sustained/dynamic strategy effects. A time-dependent Cox\n  model that conditions on the post-baseline confounder is biased (over-adjustment for a mediator plus collider-stratification\n  bias); one that omits it is confounded. Cost: g-methods demand a correctly specified treatment model (IPTW) or\n  confounder+outcome models (g-formula), strict positivity at every time, and considerably more analytic and computational\n  effort. **Prefer g-methods** whenever a post-baseline variable both predicts subsequent treatment and is affected by prior\n  treatment; **otherwise a standard adjusted model is simpler and adequate.**\n- **IPTW-MSM vs parametric g-formula:** MSMs are simpler to communicate and need only a treatment (and censoring) model, but\n  are inefficient and unstable when weights are extreme, and they handle dynamic regimes awkwardly. The g-formula is far more\n  efficient, naturally accommodates dynamic strategies and absolute-risk contrasts, and degrades gracefully under\n  near-positivity violations, but it requires modeling the entire confounder process and is vulnerable to the **g-null\n  paradox** (under the sharp null with feedback, a parametric g-formula can be guaranteed *misspecified*). Doubly-robust\n  hybrids (TMLE, LTMLE; see predictive-and-causal-ml-models-rwe) combine a treatment and an outcome model so that\n  consistency holds if *either* is correct, and permit machine-learning nuisance estimation.\n- **vs g-estimation of structural nested models:** g-estimation is most robust for effect modification and avoids the g-null\n  paradox, but software support is thin and the blip-function estimand is harder to communicate to clinical and HTA\n  audiences. **Prefer IPTW-MSM or g-formula** for routine RWE; reserve g-estimation for explicit effect-modification\n  questions.\n\n**When to use.** Long-term or repeatedly-decided drug exposures where intermediate clinical variables (labs, vitals,\nadherence, disease activity, organ function) both predict future treatment and are altered by past treatment — statins and\nLDL, antiretrovirals and CD4/viral load, anticoagulants and renal function, immunosuppressants and disease activity. Also\nthe analytic engine of per-protocol target-trial emulation for sustained strategies (used_with target-trial-emulation and\nclone-censor-weight-per-protocol), and the right tool for \"when to start / when to switch\" dynamic-strategy questions.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Point-exposure / single-decision questions.** If treatment is decided once at baseline and never revisited, there is no\n  feedback; a new-user active-comparator design with a baseline propensity score (active-comparator-new-user,\n  propensity-score-methods-psm-iptw) is simpler and more transparent. Reaching for an MSM here adds variance and a positivity\n  burden for no benefit.\n- **Structural positivity violations.** If some patients can *never* receive a treatment level given their history (e.g., a\n  contraindication that appears mid-follow-up and permanently rules out the drug), the \"always treat\" arm is undefined for\n  them; IPT weights explode and the MSM estimates an artefact. Diagnose with the weight distribution and bounds on predicted\n  treatment probabilities **before** trusting any number. Truncation hides, but does not fix, structural violations.\n- **Sparse or extreme weights treated as if benign.** Stabilized weights with a mean far from 1.0, a long right tail, or\n  extreme maxima signal near-positivity violations or a misspecified treatment model; reporting the MSM without the weight\n  diagnostics is the single most common way these analyses mislead.\n- **The g-null paradox setting.** Under a true sharp null with feedback, a naively specified parametric g-formula is\n  guaranteed biased; if the null is plausible, prefer g-estimation or a doubly-robust estimator.\n- **Unmeasured time-varying confounding.** Sequential exchangeability is *stronger* than baseline exchangeability — it must\n  hold at every decision point. Claims data rarely capture the labs/vitals that drive titration and switching, so an MSM\n  built on claims alone can be confidently precise and wrong. Quantify with E-values or negative controls and consider\n  linkage to EHR labs.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Excellent for constructing the long-format person-time skeleton — pharmacy fills\n  (`fill_date`, `days_supply`, NDC) define the time-varying exposure, and diagnoses/procedures define time-varying\n  confounders and outcomes. Failure modes: (i) **Medicare Advantage / capitated person-time lacks fee-for-service claims**,\n  so a covariate or exposure that \"turns off\" may be unobserved rather than absent — restrict to enrollees with full\n  A/B/D (or commercial medical+pharmacy) benefit and drop MA-only spans, or the treatment model is fit on phantom data. (ii)\n  The lab/vital values that actually drive titration (LDL, eGFR, HbA1c, INR) are **not in claims**, so the most important\n  time-varying confounders are missing — sequential exchangeability is then implausible without EHR linkage. (iii)\n  **Differential competing risks by exposure in elderly claims**: death competes with the outcome and may differ by arm;\n  handle the competing event explicitly (cause-specific vs subdistribution) rather than censoring on it. (iv) **Immortal\n  time in procedure/initiation studies**: aligning the time grid so that exposure status at interval *t* uses only\n  information available at the *start* of *t* prevents the look-ahead that manufactures immortal time.\n- **EHR:** Supplies the time-varying labs and vitals that claims lack, sharpening the treatment model and making sequential\n  exchangeability defensible. Costs: irregular, visit-driven measurement (a covariate is observed only when the patient\n  shows up), informative missingness, and incomplete capture of fills obtained out of network — last-observation-carried-\n  forward or explicit imputation of the confounder process is usually required, and loss to follow-up must be treated as\n  potentially informative (inverse-probability-of-censoring weights).\n- **Registry:** Often the cleanest structured longitudinal substrate (scheduled visits, adjudicated outcomes, disease\n  severity) and ideal for validating a claims-based MSM; typically weak on complete pharmacy exposure, so link to claims for\n  the full fill history and to a death index for the competing event.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR labs for the confounder process, claims completeness for\n  exposure, reliable mortality for the competing event — at the price of linkage selection and reconciling order/fill/service\n  dates before laying down the time grid.\n\n**Worked claims-style example (sustained statin strategy and MI, with feedback).** Question: among new statin initiators\nwith type 2 diabetes, does \"initiate and remain on a high-intensity statin\" vs \"initiate and remain on a moderate-intensity\nstatin\" change 3-year MI risk? Feedback is intrinsic: the statin lowers LDL, and the observed LDL drop then drives whether\nthe clinician up-titrates, holds, or stops — so LDL at month *t* is a confounder for treatment at *t* and a mediator of\ntreatment before *t*. (1) **Time grid:** one row per person-month from time zero (first qualifying fill) to first MI, death,\ndisenrollment, or 36 months. (2) **Exposure at *t*:** on high- vs moderate-intensity, derived from `days_supply` coverage in\nmonth *t* with a 30-day grace period for stockpiling; deviation from the assigned strategy triggers as-treated censoring. (3)\n**Baseline confounders** (measured in the 365-day, FFS-observable lookback): age, sex, prior CVD, baseline LDL proxy,\ncomorbidity score, baseline utilization. (4) **Time-varying confounders** (updated monthly from EHR-linked labs where\navailable): LDL, eGFR, statin-intolerance flags, hospitalizations, new antidiabetic/antihypertensive starts. (5)\n**Treatment model:** monthly pooled logistic of \"remains on high-intensity at *t*\" on baseline + time-varying history; the\n**stabilized weight** for each record multiplies the cumulative product over *t* of [probability of the *observed* treatment\ngiven *baseline-only* history] / [probability given *full* history], with a parallel stabilized\ninverse-probability-of-censoring weight for the as-treated estimand. (6) **MSM:** weighted pooled logistic (or weighted Cox)\nof MI on the strategy indicator and follow-up time, *not* conditioning on the time-varying confounders; convert to a 36-month\nrisk difference and ratio. (7) **Diagnostics and sensitivity:** report the stabilized-weight mean (should sit near 1.0), SD,\nand maximum; truncate at the 1st/99th percentiles and re-fit; vary the grace period; add a negative-control outcome; and,\nbecause death competes with MI in this older cohort, contrast cause-specific and subdistribution handling of the competing\nevent. Build the same analysis with the parametric g-formula (gfoRmula) as a cross-check that does not rely on stable weights.\n\n**Interpreting the output**\n\nConsider the ART study above. A marginal structural Cox model fit with stabilized IPTW reports HR = 0.41 (95%\nCI 0.22–0.76) for 12-month opportunistic infection comparing sustained ART versus no ART.\n\nFormal interpretation: The MSM estimate of HR 0.41 is the marginal effect of sustained antiretroviral therapy on\nthe instantaneous rate of opportunistic infection in the IPTW pseudo-population — a population where the\nstatistical association between monthly CD4 count and ART prescribing has been eliminated by reweighting. This\ntargets the population-averaged (marginal) treatment effect under a sustained-exposure intervention, not a\nconditional effect within any CD4 stratum. The result is a valid causal estimate under three assumptions:\n(1) sequential exchangeability — at each time point, no unmeasured common cause of ART receipt and subsequent\ninfection remains after conditioning on the covariates in the treatment model; (2) positivity — every patient with\nevery covariate history had a non-zero probability of receiving or not receiving ART at each interval; and\n(3) consistency — the ART strategies compared are well defined and replicable. Extreme stabilized weights (above\nthe 99th percentile) signal positivity violations and must be truncated with a sensitivity analysis.\n\nPractical interpretation: After breaking the feedback loop between CD4 count and ART prescribing, patients who\nsustained ART had opportunistic infections arrive at less than half the rate of patients who received no ART.\nThis estimate is not obtainable from a standard time-varying Cox model, which would either over-adjust for CD4\n(blocking the beneficial causal pathway) or leave CD4 as an uncontrolled confounder.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "time-varying",
      "marginal-structural-models",
      "g-methods",
      "g-formula",
      "iptw-time-varying",
      "treatment-confounder-feedback",
      "longitudinal",
      "causal"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/00001648-200009000-00011",
        "url": "https://doi.org/10.1097/00001648-200009000-00011",
        "citation_text": "Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550-560.",
        "year": 2000,
        "authors_short": "Robins, Hernán, Brumback",
        "notes": "Seminal paper introducing marginal structural models as a new class of causal models for time-varying exposures with treatment-confounder feedback, estimated by inverse-probability-of-treatment weighting."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyw323",
        "url": "https://doi.org/10.1093/ije/dyw323",
        "citation_text": "Naimi AI, Cole SR, Kennedy EH. An introduction to g methods. International Journal of Epidemiology. 2017;46(2):756-762.",
        "year": 2017,
        "authors_short": "Naimi, Cole, Kennedy",
        "notes": "Accessible modern tutorial unifying the g-methods family (g-formula, IPTW-MSMs, g-estimation) and showing precisely why standard adjustment fails under feedback."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.patter.2020.100008",
        "url": "https://doi.org/10.1016/j.patter.2020.100008",
        "citation_text": "McGrath S, Lin V, Zhang Z, Petito LC, Logan RW, Hernán MA, Young JG. gfoRmula: an R package for estimating the effects of sustained treatment strategies via the parametric g-formula. Patterns. 2020;1(3):100008.",
        "year": 2020,
        "authors_short": "McGrath et al.",
        "notes": "Working open-source implementation of the parametric g-formula for sustained and dynamic treatment strategies; the practical reference for g-formula RWE analyses and a complement to weight-based MSMs."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán, Robins",
        "notes": "Embeds g-methods and MSMs in the target-trial-emulation framework for sustained-strategy questions in observational data; the dominant template for defensible RWE causal analyses."
      },
      {
        "role": "use",
        "doi": "10.2202/1557-4679.1212",
        "url": "https://doi.org/10.2202/1557-4679.1212",
        "citation_text": "Cain LE, Robins JM, Lanoy E, Logan R, Costagliola D, Hernán MA. When to start treatment? A systematic approach to the comparison of dynamic regimes using observational data. The International Journal of Biostatistics. 2010;6(2):Article 18.",
        "year": 2010,
        "authors_short": "Cain et al.",
        "notes": "Landmark applied use of the parametric g-formula and dynamic-regime MSMs to answer a when-to-start question (HIV treatment by CD4 threshold) that standard methods cannot address."
      }
    ],
    "plain_language_summary": "A marginal structural model (MSM) is a statistical method for estimating how a treatment taken over time affects an outcome, when the very lab values and health measures that predict whether someone keeps taking the treatment are also changed by that treatment. Standard regression gets this wrong because adjusting for those measures blocks part of the treatment effect and introduces a new bias at the same time. An MSM solves the problem by creating a reweighted copy of the study population where treatment choices look as if they were made by a coin flip rather than by clinical judgment, and then estimating the effect in that cleaner copy.",
    "key_terms": [
      {
        "term": "time-varying confounding",
        "definition": "A situation where a factor that influences both future treatment decisions and the outcome (a confounder) is itself changed by past treatment, making it impossible for standard regression to handle it correctly."
      },
      {
        "term": "inverse-probability-of-treatment weighting",
        "definition": "A technique that gives each person-month record a numeric weight equal to the inverse of how likely that person was to receive the treatment they actually received, given their history; people who received an unexpected treatment get a high weight so their experience counts more, creating a pseudo-population where treatment looks random."
      },
      {
        "term": "marginal structural model",
        "definition": "An outcome model fit on the reweighted pseudo-population that estimates the population-averaged effect of always following one treatment strategy versus another, without conditioning on the time-varying confounders."
      },
      {
        "term": "pseudo-population",
        "definition": "The reweighted version of the study population created by IPTW, where the statistical association between a patient's health status and their treatment decision has been removed, so the treatment is effectively unconfounded."
      },
      {
        "term": "treatment-confounder feedback",
        "definition": "The cycle where past treatment changes an intermediate health measure, and that changed measure then influences future treatment decisions; this cycle is what makes standard regression fail and g-methods necessary."
      },
      {
        "term": "sequential exchangeability",
        "definition": "The assumption that, at every moment in follow-up, we have measured everything that predicts both the upcoming treatment decision and the outcome; it is the time-varying equivalent of the no-unmeasured-confounding assumption."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether staying on antiretroviral therapy (ART) for 12 months reduces the risk of opportunistic infection in HIV-positive patients. She has monthly records for 3 patients, tracking whether they took ART that month and their CD4 count at the start of each month. CD4 count is the core problem: low CD4 predicts both a higher chance of infection (outcome) and a higher chance the clinician prescribes or continues ART (treatment), making it a confounder. But CD4 also rises when a patient takes ART, meaning prior treatment changes the very confounder that predicts future treatment. This is treatment-confounder feedback. Standard logistic regression that adjusts for CD4 at each month will block part of the beneficial treatment effect (because rising CD4 is on the causal path from ART to lower infection risk) and simultaneously open a collider bias. Not adjusting leaves CD4 confounding. An MSM with IPTW escapes both horns.",
      "dataset": {
        "caption": "Monthly person-time records (3 patients, 2 months each). CD4 measured at start of month. ART = 1 if on therapy that month.",
        "columns": [
          "person_id",
          "month",
          "cd4_cells_per_ul",
          "art",
          "infection_by_month_end"
        ],
        "rows": [
          [
            1001,
            1,
            180,
            1,
            0
          ],
          [
            1001,
            2,
            310,
            1,
            0
          ],
          [
            1002,
            1,
            220,
            0,
            0
          ],
          [
            1002,
            2,
            190,
            1,
            1
          ],
          [
            1003,
            1,
            150,
            1,
            0
          ],
          [
            1003,
            2,
            280,
            0,
            0
          ]
        ]
      },
      "steps": [
        "Step 1 — See the feedback problem: Patient 1001 starts with CD4 = 180 (low, driving treatment) and their CD4 rises to 310 after one month on ART. Now CD4 at month 2 is both a consequence of month-1 treatment AND a predictor of month-2 treatment. That is treatment-confounder feedback in one patient.",
        "Step 2 — Understand why standard adjustment fails: If we include CD4 in a standard month-by-month logistic regression, we partially block the ART benefit (because higher CD4 caused by ART is on the pathway to lower infection) and open a collider path through unmeasured factors. If we leave it out, CD4 confounds the ART-infection relationship. There is no safe regression specification.",
        "Step 3 — Build the treatment model (denominator): For each person-month, fit a logistic model predicting the probability of receiving ART given that person's full history (baseline characteristics plus their current and prior CD4 values). This tells us how likely each treatment decision was given everything observed. Person 1002 at month 1 had CD4 = 220 and did NOT take ART; suppose the model gives P(ART=1 | CD4=220, history) = 0.60, so the probability of the observed treatment (no ART) is 1 - 0.60 = 0.40.",
        "Step 4 — Build the stabilized weight: Divide a simpler probability (from a model using only baseline covariates, not the time-varying CD4) by the full-history probability from Step 3. For the example record above, suppose the baseline-only model gives P(ART=1 | baseline) = 0.50, so P(no ART | baseline) = 0.50. The weight for that record is 0.50 / 0.40 = 1.25. A person who made an unlikely treatment choice given their CD4 gets a weight above 1 so their experience carries more influence in the pseudo-population.",
        "Step 5 — Accumulate weights over time: For each person, multiply the monthly weights together across all their follow-up months. This cumulative product is the stabilized IPTW for that person-month. A well-behaved set of weights has a mean close to 1.0 and no extreme values; extreme weights signal a positivity problem (someone almost certain to get one treatment level) or a misspecified treatment model.",
        "Step 6 — Fit the MSM on the pseudo-population: Run the outcome model (ART predicting infection) using the cumulative IPTW as weights, and do NOT include CD4 in this outcome model. Because the weights have already removed the statistical link between CD4 and treatment choice, the pseudo-population looks as if treatment was assigned without regard to CD4 — the feedback loop is broken. The coefficient on ART in this weighted model is the marginal effect of always taking ART versus not taking it, free from the feedback bias."
      ],
      "result": "In the pseudo-population created by IPTW, treatment choices are no longer driven by CD4. An MSM fit on these reweighted records estimates the population-averaged effect of sustained ART on infection risk without the over-adjustment and collider bias that would corrupt a standard CD4-adjusted model. The key diagnostic to report is the weight distribution: mean near 1.0 with a short right tail confirms the pseudo-population is well-behaved; a mean far from 1.0 or extreme maximum values signals that the approach may be unreliable for this dataset."
    },
    "prerequisites": [
      "propensity-score-methods-psm-iptw",
      "dags-backdoor-criterion-drug-studies",
      "cox-ph-regression"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "IPTW-estimated marginal structural model (stabilized weights)",
        "description": "Fit a model for the probability of the observed treatment at each time given baseline and time-varying history, form stabilized inverse-probability-of-treatment weights (and a parallel inverse-probability-of-censoring weight for an as-treated estimand), then fit the MSM (weighted pooled logistic or weighted Cox) on the strategy indicator without conditioning on the time-varying confounders. Recovers a marginal (population-averaged) effect.",
        "edge_cases": [
          "Near-zero or near-one predicted treatment probabilities (positivity violations) produce extreme weights; stabilization reduces but does not eliminate the problem — truncation trades bias for variance and must be reported.",
          "Stabilized-weight mean should be approximately 1.0; departures flag treatment-model misspecification or positivity violations.",
          "Time-varying confounders affected by prior treatment must be modeled in the weight (denominator) but excluded from the outcome MSM."
        ],
        "data_source_notes": "claims/EHR: requires long-format person-time (monthly or visit-based). The labs/vitals that drive titration are usually EHR-only; a claims-only treatment model often violates sequential exchangeability.",
        "citations": [
          "robins-2000-msm",
          "naimi-2017-g-methods"
        ]
      },
      {
        "name": "Parametric g-formula / g-computation",
        "description": "Model the joint evolution of time-varying confounders and the outcome under the natural course, then Monte-Carlo simulate the outcome distribution under each hypothetical (static or dynamic) treatment strategy and average. Yields absolute risks and risk differences and naturally handles dynamic regimes.",
        "edge_cases": [
          "Subject to the g-null paradox under a sharp null with feedback (a parametric g-formula can be guaranteed misspecified).",
          "Requires correct models for every time-varying confounder, not just treatment; computationally intensive for many time points or complex regimes."
        ],
        "data_source_notes": "Best when rich longitudinal confounder data exist (EHR/registry/linked); supports rules like 'treat once LDL > threshold'. Implemented in R via gfoRmula.",
        "citations": [
          "mcgrath-2020-gformula",
          "cain-2010-when-to-start"
        ]
      },
      {
        "name": "g-estimation of structural nested models",
        "description": "Directly estimate blip functions (the effect of treatment at each time, possibly modified by covariate history) by solving estimating equations that remove the effect of subsequent treatment. Robust to the g-null paradox and natural for effect modification.",
        "edge_cases": [
          "Thin software support and a blip-function estimand that is harder to communicate to clinical and HTA audiences.",
          "Most valuable when effect modification by time-varying covariates is the explicit question."
        ],
        "data_source_notes": "Less commonly implemented than IPTW-MSM or g-formula but powerful for dose-response and effect-modification questions in pharmacoepidemiology."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "standard time-dependent Cox / pooled-logistic regression",
        "pros_of_this": "Unbiased under treatment-confounder feedback and able to express sustained or dynamic strategy effects; a time-dependent Cox conditioning on a post-baseline confounder is biased by over-adjustment and collider stratification, while omitting it leaves confounding.",
        "cons_of_this": "Requires a correctly specified treatment model (IPTW) or full confounder-plus-outcome models (g-formula), strict positivity at every time, and substantially more analytic and computational effort.",
        "when_to_prefer": "Whenever a post-baseline variable both predicts later treatment and is affected by prior treatment (statins and LDL, antiretrovirals and CD4, anticoagulants and renal function)."
      },
      {
        "compared_to": "parametric g-formula",
        "pros_of_this": "IPTW-MSMs need only a treatment (and censoring) model and are simpler to communicate.",
        "cons_of_this": "Inefficient and unstable under extreme weights, awkward for dynamic regimes, and report only relative effects unless extended; the g-formula is more efficient, handles dynamic strategies and absolute risks, but must model the entire confounder process and risks the g-null paradox.",
        "when_to_prefer": "Static 'always vs never' contrasts with stable weights favor the MSM; dynamic regimes, absolute-risk contrasts, or near-positivity violations favor the g-formula."
      },
      {
        "compared_to": "doubly-robust estimators (TMLE / LTMLE)",
        "pros_of_this": "A single-model MSM or g-formula is simpler to implement and explain.",
        "cons_of_this": "Single-model estimators are consistent only if that one model is correct; LTMLE/TMLE is consistent if either the treatment or the outcome model is correct and accommodates machine-learning nuisances.",
        "when_to_prefer": "Use doubly-robust estimators when nuisance misspecification is a serious concern or high-dimensional confounding warrants flexible learners (see predictive-and-causal-ml-models-rwe)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Excellent for the long-format person-time skeleton (fills as time-varying exposure; diagnoses/procedures as confounders/outcomes), but the labs/vitals that drive titration are absent and Medicare Advantage / capitated person-time lacks fee-for-service claims. Restrict to fully-observable benefit, align the time grid so exposure at interval t uses only start-of-interval information (avoids immortal time), and handle death as a competing event rather than censoring on it.",
      "ehr": "Supplies the time-varying labs and vitals needed for sequential exchangeability, but measurement is irregular and visit-driven with informative missingness; carry forward or impute the confounder process and treat loss to follow-up as potentially informative with inverse-probability-of-censoring weights.",
      "registry": "Clean structured longitudinal data and adjudicated outcomes; ideal for validating claims-based MSMs but weak on complete pharmacy exposure — link to claims for fills and to a death index for the competing event.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (labs + exposure completeness + mortality) at the cost of linkage selection and order/fill/service date reconciliation before constructing the time grid."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\nBASELINE = [\"age\", \"sex\", \"prior_cvd\"]          # fixed, measured in the lookback\nTIMEVAR  = [\"ldl\", \"egfr\", \"intolerance_flag\"]  # updated each interval; drive titration AND are affected by treatment\n\ndef fit_iptw_msm(panel: pd.DataFrame, trunc=(1, 99)) -> dict:\n    df = panel.sort_values([\"person_id\", \"t\"]).copy()\n\n    # --- Treatment models for the weight (predict P(observed treatment | history) at each interval) ---\n    denom_rhs = \" + \".join([\"t\"] + BASELINE + TIMEVAR)           # full history -> denominator\n    numer_rhs = \" + \".join([\"t\"] + BASELINE)                     # baseline only -> numerator (stabilization)\n    m_denom = smf.glm(\"treat ~ \" + denom_rhs, df, family=sm.families.Binomial()).fit()\n    m_numer = smf.glm(\"treat ~ \" + numer_rhs, df, family=sm.families.Binomial()).fit()\n\n    # P(treat actually received): use the fitted prob when treat=1, its complement when treat=0.\n    pd_ = m_denom.predict(df); pn_ = m_numer.predict(df)\n    df[\"p_denom\"] = np.where(df[\"treat\"] == 1, pd_, 1 - pd_)\n    df[\"p_numer\"] = np.where(df[\"treat\"] == 1, pn_, 1 - pn_)\n\n    # Stabilized weight = running product over time of (numerator / denominator) within person.\n    df[\"ratio\"] = df[\"p_numer\"] / df[\"p_denom\"]\n    df[\"sw\"] = df.groupby(\"person_id\")[\"ratio\"].cumprod()\n\n    lo, hi = np.percentile(df[\"sw\"], trunc)                       # truncate extreme weights (report both)\n    df[\"sw_trunc\"] = df[\"sw\"].clip(lo, hi)\n\n    # --- MSM: weighted pooled logistic of the outcome on strategy + time; NO time-varying confounders here. ---\n    # For an as-treated estimand, multiply sw by an analogous stabilized inverse-probability-of-censoring\n    # weight built from a model of remaining uncensored (not on treatment) given history.\n    # var_weights (not freq_weights) carries the non-integer IPTW correctly; cluster SE fixes inference.\n    msm = smf.glm(\"event ~ treat + t + I(t**2)\", df,\n                  family=sm.families.Binomial(), var_weights=df[\"sw_trunc\"]).fit(\n                  cov_type=\"cluster\", cov_kwds={\"groups\": df[\"person_id\"]})  # robust SE for the pseudo-population\n    return {\n        \"weight_mean\": df[\"sw\"].mean(), \"weight_sd\": df[\"sw\"].std(), \"weight_max\": df[\"sw\"].max(),\n        \"trunc_at\": (lo, hi), \"msm\": msm,\n        \"log_or_per_interval\": msm.params[\"treat\"],\n    }\n\nres = fit_iptw_msm(panel)\nprint(f\"stabilized weight  mean={res['weight_mean']:.3f}  sd={res['weight_sd']:.3f}  max={res['weight_max']:.2f}\")\nprint(res[\"msm\"].summary())",
        "description": "Stabilized-IPTW marginal structural model from long-format claims/EHR person-time. Required input (one row per person per\ntime interval, already cleaned):\n  panel : person_id, t (0,1,2,... interval index), treat (1=on assigned strategy this interval, 0 otherwise),\n          event (1 if outcome in this interval), <baseline covariates>, <time-varying covariates measured at start of t>\nThe function fits a denominator treatment model (full history) and a numerator model (baseline only), builds the per-record\nstabilized weight as a cumulative product over time, then fits the MSM as a weighted pooled logistic of the outcome on the\nstrategy and follow-up time WITHOUT conditioning on the time-varying confounders. Report the weight diagnostics it prints.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "robins-2000-msm"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(ipw); library(geepack); library(data.table)\nsetDT(panel)\n\n## (A) Stabilized IPT weights for time-varying treatment, then the MSM ----------------------\nw <- ipwtm(\n  exposure   = treat,\n  family     = \"binomial\", link = \"logit\",\n  numerator  = ~ t + age + sex + prior_cvd,                 # baseline only -> stabilization\n  denominator= ~ t + age + sex + prior_cvd + ldl + egfr + intolerance_flag,  # + time-varying history\n  id         = person_id, timevar = t, type = \"all\",\n  data       = as.data.frame(panel)\n)\npanel[, sw := w$ipw.weights]\n# For an as-treated estimand, multiply sw by an analogous stabilized IPC weight (ipwtm on the\n# censoring indicator) built from a model of remaining uncensored given history.\npanel[, sw_tr := pmin(pmax(sw, quantile(sw, .01)), quantile(sw, .99))]   # truncate; report both\ncat(sprintf(\"sw mean=%.3f sd=%.3f max=%.2f\\n\", mean(panel$sw), sd(panel$sw), max(panel$sw)))\n\n## MSM = weighted pooled logistic; NO time-varying confounders on the RHS. Cluster-robust SE via GEE.\nmsm <- geeglm(event ~ treat + t + I(t^2), family = binomial,\n              weights = sw_tr, id = person_id, corstr = \"independence\", data = panel)\nsummary(msm)\n\n## (B) Parametric g-formula for the \"always high-intensity vs always moderate\" contrast -------\nlibrary(gfoRmula)\ngf <- gformula_survival(\n  obs_data = panel, id = \"person_id\", time_name = \"t\",\n  covnames = c(\"ldl\", \"egfr\", \"intolerance_flag\", \"treat\"),\n  covtypes = c(\"normal\", \"normal\", \"binary\", \"binary\"),\n  covparams = list(covmodels = c(\n    ldl ~ lag1_treat + lag1_ldl + t,\n    egfr ~ lag1_treat + lag1_egfr + t,\n    intolerance_flag ~ lag1_treat + t,\n    treat ~ lag1_treat + ldl + egfr + intolerance_flag + age + sex + prior_cvd + t)),\n  outcome_name = \"event\", ymodel = event ~ treat + ldl + egfr + t + age + sex + prior_cvd,\n  intvars = list(\"treat\", \"treat\"),\n  interventions = list(list(c(static, rep(1, max(panel$t) + 1))),\n                       list(c(static, rep(0, max(panel$t) + 1)))),\n  int_descript = c(\"always high-intensity\", \"always moderate\"),\n  nsimul = 10000, seed = 1)\nprint(gf)   # cumulative-incidence risk difference / ratio under each strategy",
        "description": "Two routes for the same long-format person-time table `panel`\n  (person_id, t, treat, event, <baseline>, <time-varying covariates at start of t>):\n(A) stabilized-IPTW MSM via ipw::ipwtm -> weighted pooled-logistic outcome model with cluster-robust SE; and\n(B) the parametric g-formula via gfoRmula for the same sustained-strategy contrast as an estimator that does not rely on\nstable weights. Use (A) and (B) as mutual cross-checks.",
        "dependencies": [
          "ipw",
          "geepack",
          "gfoRmula",
          "data.table"
        ],
        "source_citations": [
          "robins-2000-msm",
          "mcgrath-2020-gformula"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "proc sort data=work.panel; by person_id t; run;\n\n/* Denominator: P(treat | baseline + time-varying history). DESCENDING models P(treat=1). */\nproc logistic data=work.panel descending noprint;\n  model treat = t age sex prior_cvd ldl egfr intolerance_flag;\n  output out=pd p=pd_hat;\nrun;\n/* Numerator: P(treat | baseline only) -> stabilizes the weight. */\nproc logistic data=work.panel descending noprint;\n  model treat = t age sex prior_cvd;\n  output out=pn p=pn_hat;\nrun;\n\ndata wts; merge pd pn; by person_id t;\n  /* probability of the treatment actually received this interval */\n  p_d = ifn(treat=1, pd_hat, 1-pd_hat);\n  p_n = ifn(treat=1, pn_hat, 1-pn_hat);\n  ratio = p_n / p_d;\nrun;\n\n/* Stabilized weight = within-person running product of (numerator/denominator). */\ndata wts; set wts; by person_id;\n  retain sw;\n  if first.person_id then sw = 1;\n  sw = sw * ratio;\nrun;\n\n/* For an as-treated estimand, multiply sw by an analogous stabilized inverse-probability-of-censoring */\n/* weight built from a PROC LOGISTIC model of remaining uncensored given history.                     */\n/* Truncate at 1st/99th pctile; ALWAYS report the untruncated distribution alongside. */\nproc univariate data=wts noprint; var sw; output out=cut pctlpts=1 99 pctlpre=p; run;\ndata wts; if _n_=1 then set cut; set wts;\n  sw_tr = min(max(sw, p1), p99);\nrun;\nproc means data=wts mean std max; var sw; run;   /* weight diagnostics: mean should be near 1.0 */\n\n/* MSM: weighted pooled logistic of the outcome on strategy + time; cluster-robust SE via REPEATED. */\nproc genmod data=wts descending;\n  class person_id;\n  weight sw_tr;\n  model event = treat t t*t / dist=binomial link=logit;\n  repeated subject=person_id / type=ind;          /* empirical (sandwich) variance for the pseudo-population */\n  estimate 'log-OR per interval, strategy' treat 1 / exp;\nrun;",
        "description": "Stabilized-IPTW marginal structural model in SAS from long-format person-time. Required input (post data-management):\n  work.panel : person_id, t, treat (0/1 on assigned strategy this interval), event (0/1 outcome this interval),\n               baseline covariates (age sex prior_cvd) and time-varying covariates (ldl egfr intolerance_flag)\n               measured at the START of interval t. One row per person per interval, sorted by person_id t.\nPROC LOGISTIC fits the numerator (baseline) and denominator (full-history) treatment models; a DATA step builds the\nstabilized weight as a within-person cumulative product; PROC GENMOD fits the MSM as a weighted pooled logistic with an\nexchangeable working correlation (cluster-robust SE). The MSM RHS excludes the time-varying confounders by construction.",
        "dependencies": [],
        "source_citations": [
          "robins-2000-msm"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U((U<br/>unmeasured)) --> L1[L1<br/>time-varying confounder<br/>e.g. LDL at month 1]\n  U --> Y[Y<br/>outcome<br/>e.g. MI]\n  A0[A0<br/>treatment at baseline] --> L1\n  A0 --> Y\n  L1 --> A1[A1<br/>treatment at month 1]\n  A1 --> Y\n  L1 --> Y\nclassDef tv fill:#fde,stroke:#a36;\nclass L1 tv;",
        "caption": "Treatment-confounder feedback, the structure that motivates g-methods. L1 is a confounder for later treatment A1 (L1 -> A1) AND a mediator of earlier treatment (A0 -> L1 -> Y). Conditioning on L1 in a standard model blocks part of the A0 effect and opens a collider path via U; not conditioning on it confounds A1. IPTW, the g-formula, and g-estimation all handle L1 correctly without this trade-off.",
        "alt_text": "Causal DAG with baseline treatment A0, time-varying confounder L1 affected by A0 and by an unmeasured U, later treatment A1 driven by L1, and outcome Y affected by A0, A1, L1, and U, illustrating treatment-confounder feedback.",
        "source_type": "illustrative",
        "source_citations": [
          "robins-2000-msm"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Is treatment decided once<br/>and never revisited?} -->|Yes| Pt[No feedback:<br/>baseline PS / ACNU design]\n  Q -->|No, repeated decisions| FB{Does a post-baseline variable<br/>both predict later treatment<br/>and get affected by prior treatment?}\n  FB -->|No| Std[Standard time-dependent<br/>adjusted model is adequate]\n  FB -->|Yes: feedback| EM{Effect modification by<br/>time-varying covariates<br/>the key question?}\n  EM -->|Yes| GE[g-estimation of<br/>structural nested model]\n  EM -->|No| DYN{Dynamic regime or<br/>absolute-risk contrast needed?}\n  DYN -->|Yes| GF[Parametric g-formula<br/>gfoRmula]\n  DYN -->|No, static contrast| MSM[IPTW-estimated MSM<br/>stabilized weights]",
        "caption": "Choosing among the methods. Feedback is the trigger for g-methods; the question type (effect modification vs dynamic/absolute-risk vs static contrast) selects g-estimation, the g-formula, or an IPTW-MSM.",
        "alt_text": "Decision tree routing from whether treatment is repeated and whether feedback exists to a baseline propensity design, a standard adjusted model, g-estimation, the parametric g-formula, or an IPTW-estimated marginal structural model.",
        "source_type": "illustrative",
        "source_citations": [
          "naimi-2017-g-methods"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "g-Methods and MSMs are the core analytic tools for estimating per-protocol / sustained-strategy effects in target trials with time-varying treatments."
      },
      {
        "relation_type": "used_with",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Clone-censor-weight is a design-based route to the same sustained-strategy estimand; both use inverse-probability weighting to handle artificial censoring at strategy deviation."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cox-ph-regression",
        "notes": "Standard time-dependent Cox is biased under treatment-confounder feedback (over-adjustment plus collider stratification); MSMs are commonly fit as a weighted Cox or weighted pooled logistic, where the IPT weighting step removes the bias the naive Cox incurs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "IPTW for time-varying treatment generalizes baseline IPTW, in that the weight is a cumulative product over time of inverse-treatment-probabilities rather than a single baseline weight."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Aligning the time grid so exposure at each interval uses only start-of-interval information prevents the immortal time that arises when follow-up or classification looks ahead of the treatment decision."
      },
      {
        "relation_type": "see_also",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Doubly-robust estimators (LTMLE/TMLE) extend g-methods with flexible machine-learning nuisances and consistency if either the treatment or outcome model is correct."
      }
    ],
    "aliases": [
      "MSM",
      "MSMs",
      "marginal structural model",
      "marginal structural models",
      "g-methods",
      "g-formula",
      "parametric g-formula",
      "g-computation",
      "g-estimation",
      "time-varying IPTW"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "markov-transition-probabilities-rwe",
    "name": "Markov Transition Probabilities from Real-World Data",
    "short_definition": "The estimation, from longitudinal claims/EHR/registry data, of the per-cycle probabilities of moving between mutually exclusive health states (P[state j at t+1 | state i at t]) that populate a cohort-level Markov or multistate cost-effectiveness model.",
    "long_description": "A Markov (state-transition) cost-effectiveness model partitions a disease into a finite set of mutually exclusive,\ncollectively exhaustive **health states** and moves a hypothetical cohort through them in fixed-length **cycles**. The\nnumbers that drive everything the model produces — life-years, QALYs, and costs accrued in each state — are the\n**transition probabilities**: the per-cycle probability of moving from state *i* to state *j*. This concept is about\n*estimating those probabilities from real-world longitudinal data* (claims, EHR, registry, or linked sources), as opposed\nto lifting them from published trials or expert opinion. It is the data-engineering and statistical step that sits\n*upstream* of the decision model, not the economic evaluation itself.\n\n**Core conceptual distinction.** Three things must be separated and pre-specified. (1) *Probability vs rate.* A\ntransition **rate** is an instantaneous hazard (events per person-time, continuous); a transition **probability** is the\nchance of being in state *j* one cycle later (bounded 0–1, cycle-length-dependent). Cohort Markov models consume\nprobabilities, but the honest way to get them from time-to-event data is to estimate rates and convert with the matrix\nexponential p = exp(Q·Δt), where Q is the transition-intensity matrix — *not* the naive 1 − exp(−rate·Δt), which is only\ncorrect for a single competing transition and silently mis-allocates probability when a state has more than one exit\n(Welton & Ades 2005). (2) *Fully vs partially observed transitions.* If you see every transition the moment it happens\n(e.g., a death index), counts give consistent estimates. If you observe state only at intermittent, irregular visits\n(the usual claims/EHR reality), the data are **panel/interval-censored**: an intervening state could have been missed,\nand count-ratio estimators are biased. Continuous-time multistate models (the `msm` framework, Jackson 2011) recover the\nintensity matrix from snapshots. (3) *Time-homogeneity.* The simplest model assumes a single Q for all cycles; real\nprogression usually depends on time-in-state (semi-Markov / tunnel states) and calendar/age, so a homogeneous matrix can\nbe badly wrong over a lifetime horizon. The estimand is therefore the *cycle-specific transition-probability matrix for\nthe target decision population*, with uncertainty propagated (Dirichlet/multinomial draws for PSA per Briggs 2012),\n*not* a single point matrix.\n\n**Pros, cons, and trade-offs.**\n- **vs trial-derived or literature transition probabilities:** RWD-derived transitions reflect the real treated\n  population, settings, and adherence, and can be estimated for states and subgroups no trial reports. Cost: they\n  inherit all of claims/EHR's measurement error (state misclassification, informative visit timing, left truncation),\n  and naive count ratios from irregular data are biased. **Prefer RWD** when the decision population diverges from trial\n  populations or when long-horizon/real-adherence transitions are needed — but validate against external benchmarks.\n- **vs a partitioned-survival model (PSM):** PSM reads area-under-the-curve directly from extrapolated PFS/OS curves and\n  needs no transition structure, which makes it simple but **structurally unconstrained** — post-progression survival is\n  implied, not modeled, and the three curves can imply impossible state occupancy. A Markov/multistate model enforces a\n  coherent state structure and lets you model post-progression mortality explicitly. **Prefer Markov** when the disease\n  has clinically meaningful intermediate states and back-transitions; **prefer PSM** for two-state oncology problems with\n  mature survival curves where transition data are thin.\n- **vs naive count-ratio (cohort-count) estimation:** counting i→j moves and dividing by time-at-risk-in-i is trivial\n  and transparent, but ignores interval censoring and competing risks, and cannot borrow strength across cells. A\n  continuous-time multistate model (msm / Aalen-Johansen for the empirical matrix) is the methodologically correct\n  upgrade at the cost of convergence fragility and stronger Markov assumptions. **Prefer multistate estimation** whenever\n  states are observed only at irregular visits or death competes with progression.\n\n**When to use.** Building a lifetime-horizon HTA cost-effectiveness model for a chronic, progressive disease with\nrecognizable intermediate states (CKD stages, NYHA heart-failure classes, cancer remission/relapse, HIV CD4 strata)\nwhere you have longitudinal RWD spanning enough person-time to observe the transitions of interest, and where the\ndecision population is not well represented by available trials. Also the natural tool when you need subgroup- or\ncalendar-specific transition matrices, or when back-transitions (improvement, re-treatment) matter.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Irregular observation treated as exact-time data.** Plugging interval-censored claims snapshots into count-ratio\n  formulas, or applying 1 − exp(−rate) per transition when a state has competing exits, produces matrices whose rows\n  look fine but whose long-run state occupancy is wrong by the time it compounds over 40+ cycles. This is the most common\n  and most dangerous error: it is invisible in one cycle and catastrophic in a lifetime model.\n- **Disenrollment misread as a clinical transition.** In claims, a patient who leaves the plan (especially to Medicare\n  Advantage, where fee-for-service claims vanish) simply stops being observed. If censoring is not modeled, the cohort\n  appears to \"transition out,\" biasing every downstream probability. Never code end-of-data as a health-state move.\n- **A two-state, no-back-transition problem.** If the disease is really just alive→dead with one progression event and\n  mature survival curves exist, a Markov model adds structure (and assumptions) you do not need — a partitioned-survival\n  or simple survival extrapolation is more honest.\n- **Sparse cells.** When some i→j transitions have a handful of events, the matrix is unstable and PSA will swing wildly;\n  Markov chaining propagates that instability. Collapse states or borrow strength (hierarchical/evidence synthesis,\n  Welton & Ades 2005) rather than shipping a point matrix.\n- **Strong time-in-state dependence forced into a homogeneous matrix.** Diseases with accelerating hazards (e.g., late\n  CKD) violate the memoryless assumption; a single homogeneous Q understates progression and overstates survival.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** State must be *derived* per cycle from diagnosis/procedure/drug codes and any linked labs,\n  because claims have no native \"state\" field. Pin a fixed cycle length, assign each person's state at each cycle\n  boundary, and decide a missing-data rule (last-observation-carried-forward within a max gap vs treat as censored).\n  Failure modes: **Medicare Advantage enrollees lack FFS person-time**, so apparent transitions are observation gaps —\n  restrict to continuously FFS-enrolled spans (Parts A/B/D) and censor at MA switch. **Differential disenrollment by\n  health state** (sicker patients change plans or die) makes \"lost to follow-up\" informative. **Left truncation**:\n  prevalent patients enter mid-trajectory, so the apparent starting-state mix is not the inception mix. **Death is\n  poorly captured** in claims alone — without a death index, the absorbing state leaks into \"censored,\" which inflates\n  survival; link to vital records or use a hospice/inpatient-discharge-disposition proxy and acknowledge its\n  incompleteness.\n- **EHR:** States from labs/vitals/problem lists are richer (true eGFR for CKD staging, EF for HF) but **visit-driven**:\n  a patient seen more often has more chances to be reclassified, so sicker patients accrue spurious transitions\n  (ascertainment by visit frequency). External-care leakage means transitions happening outside the system are missed.\n  Use the irregular visit times explicitly in a continuous-time model rather than forcing a fixed grid.\n- **Registry:** Disease-specific registries often record state directly and adjudicate it (cancer stage, dialysis\n  start), which is ideal for the transition structure, but capture of competing mortality and of inter-state utilization\n  cost may be weak — link to claims for costs and to a death index for the absorbing transition.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR-derived states + claims completeness + reliable death),\n  but linkage selects the linkable subset (transportability) and introduces date discrepancies between lab, claim, and\n  death dates that must be reconciled before assigning a state to a cycle boundary.\n\n**Worked claims example.** Goal: a 3-month-cycle transition matrix for CKD progression (G3a → G3b → G4 → ESRD → Death,\nwith possible back-transitions among the eGFR stages and Death/ESRD absorbing or near-absorbing) to feed a lifetime\ncost-utility model, using a linked claims+lab dataset. (1) *Cohort/inception:* adults with ≥2 eGFR values defining\nG3a/G3b and ≥365 days continuous FFS enrollment (Parts A/B) before the first qualifying eGFR (`index_date`), excluding\nMedicare Advantage person-time. (2) *State assignment per cycle:* lay a 91-day cycle grid from `index_date`; at each\nboundary assign the CKD stage from the most recent eGFR within the prior 180 days (carry forward); if no eGFR within the\nwindow and the person is still enrolled, treat the cycle as missing (interval-censored) rather than carrying an old value\nindefinitely. ESRD is set from the first dialysis/transplant procedure or revenue code (`dx`/`px` on the medical claim);\nDeath from the linked vital-records `death_date`. (3) *Censoring:* stop person-time at the earliest of disenrollment,\nswitch to MA, or end of data — these are censoring, never a transition to Death. (4) *Estimation:* because eGFR is\nobserved at irregular lab dates, fit a continuous-time multistate (msm) model on the (person, eGFR-date, state) panel\nwith Death as an exactly-timed absorbing state, recover the intensity matrix Q, and convert to a 91-day probability\nmatrix via the matrix exponential; report the empirical Aalen-Johansen state-occupancy as a check. (5) *Uncertainty:*\ndraw the transition matrix from the multivariate distribution of the fitted intensities (or Dirichlet draws of each row\nfor a count-based sensitivity version) for probabilistic sensitivity analysis. (6) *Diagnostics:* compare modeled vs\nobserved stage occupancy over follow-up, test time-homogeneity (split early vs late follow-up), and run a\ndisenrollment-as-informative-censoring sensitivity analysis.\n\n**Interpreting the output**\n\nThe worked example shows a 3-state Markov model (Stable, Progressed, Dead) with a 3-month cycle. After one\ncycle, a starting cohort of 1,000 in the Stable state redistributes to 585 Stable, 280 Progressed, and 135 Dead.\n\n*(1) Formal interpretation.* Each row of the transition probability matrix must sum to 1.0, confirming that\nevery patient in a state either stays or moves to exactly one other state — the matrix is stochastic. The\nresult 585 + 280 + 135 = 1,000 confirms cohort conservation for this cycle. The 3-month cycle length means\neach probability represents the likelihood of a state change within a quarter; the same patient-level hazard\ntranslated to a 1-month cycle would yield different (generally smaller) per-cycle probabilities. A critical\nlimitation is the Markov memorylessness assumption: the transition probability from Stable to Progressed is\nthe same regardless of how many prior cycles the patient has already spent in Stable. When time in state\nis clinically meaningful (e.g., time since treatment start predicts progression risk), a tunnel-state extension,\na multistate model with time-in-state covariates, or a DES approach is required.\n\n*(2) Practical interpretation.* The 13.5% Dead probability from Stable over one 3-month cycle is the\ndominant driver of life-years in the model. Analysts should compare modeled state occupancy at each cycle\nagainst observed Kaplan-Meier curves or registry prevalence as a validation check — if the modeled progression\ncurve diverges from the observed data by cycle 4, the transition probabilities need reestimation or the\ntime-homogeneity assumption needs relaxing. Any PSA draws over the transition matrix must respect row-sum = 1\n(use Dirichlet distributions for each row, not independent betas).",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "markov-model",
      "state-transition-model",
      "transition-probabilities",
      "multistate-model",
      "cost-effectiveness-modeling",
      "health-economic-modeling",
      "panel-data",
      "competing-risks"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "linked",
      "claims",
      "ehr",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/0272989X9301300409",
        "url": "https://doi.org/10.1177/0272989X9301300409",
        "citation_text": "Sonnenberg FA, Beck JR. Markov models in medical decision making: a practical guide. Medical Decision Making. 1993;13(4):322-338.",
        "year": 1993,
        "authors_short": "Sonnenberg & Beck",
        "notes": "Foundational practical statement of the Markov cohort model, health states, cycles, and the transition-probability matrix that this concept estimates from data."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X05282637",
        "url": "https://doi.org/10.1177/0272989X05282637",
        "citation_text": "Welton NJ, Ades AE. Estimation of Markov chain transition probabilities and rates from fully and partially observed data: uncertainty propagation, evidence synthesis, and model calibration. Medical Decision Making. 2005;25(6):633-645.",
        "year": 2005,
        "authors_short": "Welton & Ades",
        "notes": "Canonical methods paper for estimating transition probabilities/rates from fully and partially observed data, including the rate-to-probability conversion and Bayesian uncertainty propagation that count-ratio estimators miss."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2012.06.012",
        "url": "https://doi.org/10.1016/j.jval.2012.06.012",
        "citation_text": "Caro JJ, Briggs AH, Siebert U, Kuntz KM. Modeling good research practices - overview: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-1. Value in Health. 2012;15(6):796-803.",
        "year": 2012,
        "authors_short": "Caro et al.",
        "notes": "ISPOR-SMDM good-practice framework situating state-transition models and the requirements for transparent, defensible transition inputs."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X12458348",
        "url": "https://doi.org/10.1177/0272989X12458348",
        "citation_text": "Briggs AH, Weinstein MC, Fenwick EAL, Karnon J, Sculpher MJ, Paltiel AD. Model parameter estimation and uncertainty analysis: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-6. Medical Decision Making. 2012;32(5):722-732.",
        "year": 2012,
        "authors_short": "Briggs et al.",
        "notes": "Good-practice guidance for parameterizing transition probabilities and propagating their uncertainty via Dirichlet/multinomial draws in probabilistic sensitivity analysis."
      },
      {
        "role": "demonstrate",
        "doi": "10.18637/jss.v038.i08",
        "url": "https://doi.org/10.18637/jss.v038.i08",
        "citation_text": "Jackson CH. Multi-state models for panel data: the msm package for R. Journal of Statistical Software. 2011;38(8):1-28.",
        "year": 2011,
        "authors_short": "Jackson",
        "notes": "Reference implementation for fitting continuous-time multistate models to intermittently observed (panel) data and deriving the transition-probability matrix - the correct estimator for irregular claims/EHR snapshots."
      }
    ],
    "plain_language_summary": "A Markov model imagines a group of patients as a cohort moving between a small set of health states — such as stable disease, progressed disease, and death — one fixed time period at a time. A transition probability is the chance that a patient in one state ends up in a different (or the same) state at the end of that period, and these numbers come from watching real patients move through states in claims or electronic health record data. Every row of the resulting table of probabilities must add up to exactly 1, because each patient has to be somewhere at the end of each cycle. The hard part is that real-world data usually only captures a snapshot at each clinic visit, so estimating honest probabilities from irregular observations requires methods that account for what may have happened between visits.",
    "key_terms": [
      {
        "term": "health state",
        "definition": "A mutually exclusive category that describes where a patient is in their disease at a given point in time, such as stable, progressed, or dead."
      },
      {
        "term": "transition probability",
        "definition": "The chance that a patient currently in one health state will be in a specific health state at the end of the next cycle, expressed as a number between 0 and 1."
      },
      {
        "term": "cycle",
        "definition": "The fixed unit of time the model advances in each step, for example one month or three months, after which every patient is re-assigned to a health state."
      },
      {
        "term": "absorbing state",
        "definition": "A health state a patient can enter but never leave, most commonly death, so its self-transition probability is always 1.0."
      }
    ],
    "worked_example": {
      "scenario": "A health economist is building a simple cost-effectiveness model for a new cancer drug. She defines three health states: Stable (disease has not progressed), Progressed (disease has worsened), and Dead. Using one year of claims data she estimates the per-cycle (3-month) probabilities of moving between these states. She starts with a cohort of 1,000 patients and wants to see how they redistribute after one cycle.",
      "dataset": {
        "caption": "Transition probability matrix estimated from claims data. Each row is the origin state; each column is the destination state after one 3-month cycle. Every row sums to 1.0.",
        "columns": [
          "from_state",
          "to_Stable",
          "to_Progressed",
          "to_Dead",
          "row_sum"
        ],
        "rows": [
          [
            "Stable",
            0.8,
            0.15,
            0.05,
            1.0
          ],
          [
            "Progressed",
            0.1,
            0.7,
            0.2,
            1.0
          ],
          [
            "Dead",
            0.0,
            0.0,
            1.0,
            1.0
          ]
        ]
      },
      "steps": [
        "Start with 1,000 patients split across states: 700 Stable, 250 Progressed, 50 Dead.",
        "Multiply each starting group by its row of transition probabilities to find contributions to each destination state.",
        "Stable patients contribute: 700 x 0.80 = 560 remain Stable; 700 x 0.15 = 105 move to Progressed; 700 x 0.05 = 35 move to Dead.",
        "Progressed patients contribute: 250 x 0.10 = 25 return to Stable; 250 x 0.70 = 175 remain Progressed; 250 x 0.20 = 50 move to Dead.",
        "Dead patients contribute: 50 x 1.00 = 50 stay Dead (absorbing state, no exits).",
        "Sum the contributions column by column to get the end-of-cycle cohort: Stable = 560 + 25 + 0 = 585; Progressed = 105 + 175 + 0 = 280; Dead = 35 + 50 + 50 = 135."
      ],
      "result": "After one 3-month cycle the cohort of 1,000 redistributes to 585 Stable, 280 Progressed, and 135 Dead. Total = 585 + 280 + 135 = 1,000, confirming the cohort is conserved. The model repeats this multiplication every cycle, chaining the matrix forward over the full time horizon to accumulate life-years and costs in each state."
    },
    "prerequisites": [
      "health-economic-modeling-methods-rwe",
      "competing-risks-cause-specific-fine-gray-rwe",
      "cost-utility"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Count-ratio (cohort-count) transition matrix",
        "description": "Each off-diagonal cell P(i->j) is observed i->j moves divided by the at-risk denominator in state i over the cycle; diagonals are 1 minus the row sum. Assumes states are observed exactly at every cycle boundary.",
        "edge_cases": [
          "Interval censoring - an unobserved intervening state between two snapshots makes counts undercount fast transitions.",
          "Competing exits from one state are mis-allocated if probabilities are built per-transition with 1 - exp(-rate) instead of jointly.",
          "Sparse cells give unstable rows that destabilize the chained model and PSA."
        ],
        "data_source_notes": "claims/EHR: only defensible when the cycle grid is dense relative to true transition speed and a death index closes the absorbing state; otherwise prefer continuous-time estimation."
      },
      {
        "name": "Continuous-time multistate (intensity-matrix) estimation",
        "description": "Fit the transition-intensity matrix Q to intermittently observed states by maximum likelihood (e.g., msm), then convert to the cycle probability matrix via the matrix exponential P = exp(Q*cycle_length).",
        "edge_cases": [
          "Requires the Markov (memoryless) assumption unless extended; semi-Markov/time-in-state dependence needs phase-type or tunnel states.",
          "Likelihood can fail to converge with sparse data or too many free intensities; constrain structurally impossible transitions to zero.",
          "Exact death times (absorbing state) should be supplied as exact, not panel, observations."
        ],
        "data_source_notes": "EHR/linked: use the actual irregular lab/visit dates as observation times; supply linked vital-records death dates as exactly observed absorbing transitions."
      },
      {
        "name": "Aalen-Johansen empirical state-occupancy",
        "description": "Nonparametric estimator of the transition-probability matrix over time that correctly handles right censoring and competing risks; used to validate a fitted parametric matrix against observed occupancy.",
        "edge_cases": [
          "Does not by itself give a cycle-stationary matrix for a homogeneous model; it is a time-indexed validation/benchmark.",
          "Still requires correct handling of left truncation for prevalent cohorts."
        ],
        "data_source_notes": "registry/linked: strong check because adjudicated states and a death index make the competing-risk structure observable."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Trial-derived or literature transition probabilities",
        "pros_of_this": "Reflect the real treated population, real adherence, real settings, and can be estimated for subgroups and long horizons no single trial reports.",
        "cons_of_this": "Inherit claims/EHR measurement error (state misclassification, informative visit timing, left truncation, incomplete death capture); naive estimators from irregular data are biased.",
        "when_to_prefer": "When the decision population diverges from trial populations or long-horizon/real-world transitions are needed - with external validation of the resulting matrix."
      },
      {
        "compared_to": "Partitioned-survival model (PSM)",
        "pros_of_this": "Enforces a coherent, clinically interpretable state structure with explicit post-progression mortality and back-transitions, avoiding the structurally implied (and possibly impossible) occupancy of independent PSM curves.",
        "cons_of_this": "Requires estimable transition data for every modeled move and stronger structural/Markov assumptions; harder when transition cells are sparse.",
        "when_to_prefer": "Diseases with meaningful intermediate states and improvement/relapse dynamics; prefer PSM for two-state oncology problems with mature survival curves and thin transition data."
      },
      {
        "compared_to": "Naive count-ratio estimation",
        "pros_of_this": "Continuous-time multistate / Aalen-Johansen estimation correctly handles interval censoring and competing risks and can borrow strength across cells.",
        "cons_of_this": "Convergence fragility, stronger Markov assumptions, and more analyst expertise than dividing counts.",
        "when_to_prefer": "Whenever states are observed only at irregular visits, or death competes with progression out of a state."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "State is derived per cycle from dx/px/drug codes and any linked labs (no native state field). Fix the cycle length, assign state at each boundary with an explicit missing-data rule (LOCF within a max gap vs interval-censored), restrict to continuously FFS-enrolled person-time (exclude MA-only spans where claims vanish), and treat disenrollment/end-of-data as censoring - never a transition. Link to a death index to close the absorbing state.",
      "ehr": "States from labs/vitals/problem lists are richer but visit-driven; sicker, more-frequently-seen patients accrue spurious reclassifications (ascertainment by visit frequency). Use the actual irregular visit times in a continuous-time model rather than forcing a fixed grid; account for external-care leakage.",
      "registry": "Often records adjudicated state directly (cancer stage, dialysis start) - ideal for the transition structure - but may capture competing mortality and inter-state cost weakly; link to claims for cost and to vital records for death.",
      "linked": "Ideal substrate (EHR states + claims completeness + reliable death) but linkage selects the linkable subset (transportability) and creates lab/claim/death date discrepancies that must be reconciled before assigning a state to a cycle boundary."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\ndef empirical_transition_matrix(states: pd.DataFrame) -> pd.DataFrame:\n    s = states.sort_values([\"person_id\", \"cycle\"]).copy()\n\n    # Pair each observed cycle with the SAME person's next observed cycle.\n    s[\"next_cycle\"] = s.groupby(\"person_id\")[\"cycle\"].shift(-1)\n    s[\"next_state\"] = s.groupby(\"person_id\")[\"state\"].shift(-1)\n\n    # Keep only adjacent cycles (no gap) so an unobserved intervening state\n    # is not silently collapsed into a single transition.\n    moves = s[s[\"next_cycle\"] == s[\"cycle\"] + 1]\n\n    order = sorted(set(s[\"state\"]))\n    counts = (moves.groupby([\"state\", \"next_state\"])\n                   .size()\n                   .unstack(fill_value=0)\n                   .reindex(index=order, columns=order, fill_value=0))\n\n    row_tot = counts.sum(axis=1).replace(0, np.nan)\n    P = counts.div(row_tot, axis=0)\n\n    # Absorbing/unseen-exit rows: keep the cohort in-state (self-loop = 1).\n    P = P.fillna(0.0)\n    empty_rows = P.sum(axis=1) == 0\n    for st in P.index[empty_rows]:\n        P.loc[st, st] = 1.0\n    return P  # rows sum to 1.0; pass to PSA via row-wise Dirichlet draws of `counts`",
        "description": "Empirical cycle transition-probability matrix from a long-format per-cycle state table. Required input (already cleaned):\n  states : person_id, cycle (int, 0-based cycle index on a FIXED cycle grid), state (str; absorbing states allowed)\n           One row per observed cycle per person; rows for censored cycles are simply absent.\nCounts each i->j move between consecutive observed cycles and normalizes by row. This is the count-ratio variant - valid\nonly when the grid is dense relative to true transition speed and the absorbing (death) state is observed; otherwise use\na continuous-time multistate model (see the R/msm implementation). Diagonals are 1 minus the row sum.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(msm)\nlibrary(expm)\n\nCYCLE_DAYS <- 91  # 3-month model cycle\n\n# qmat: allowed transitions get a small positive initial intensity; 0 = structurally impossible.\n# Example 5-state CKD: 1=G3a 2=G3b 3=G4 4=ESRD 5=Death (5 absorbing).\nqmat <- rbind(\n  c(0, 0.1, 0,   0,   0.02),\n  c(0.05, 0, 0.1, 0,   0.03),\n  c(0, 0.05, 0,  0.1,  0.05),\n  c(0, 0,   0,   0,    0.10),\n  c(0, 0,   0,   0,    0))\n\nfit <- msm(state ~ obs_time, subject = person_id, data = panel,\n           qmatrix = qmat, deathexact = 5, gen.inits = TRUE)\n\n# Recover the estimated intensity matrix Q and convert to a CYCLE_DAYS probability matrix.\nQ <- qmatrix.msm(fit, ci = \"none\")\nP_cycle <- expm(Q * CYCLE_DAYS)          # rows sum to 1; this is the Markov-model input\n# pmatrix.msm(fit, t = CYCLE_DAYS) gives the same matrix directly, with CIs for PSA.",
        "description": "Continuous-time multistate estimation from intermittently observed (panel) data with msm, then conversion to a\ncycle-length transition-probability matrix. Required input:\n  panel : person_id, obs_time (numeric, e.g. days from index at each lab/visit),\n          state (integer 1..K; the death/absorbing state supplied with exact times),\n          death (logical; TRUE for the exactly-observed absorbing observation)\nDefine `qmat` with structural zeros for impossible transitions. This is the correct estimator when states are seen at\nirregular times (the usual claims/EHR case) because it accounts for interval censoring and competing exits.",
        "dependencies": [
          "msm",
          "expm"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let cycle_gap = 1;  /* require strictly adjacent observed cycles */\n\n/* Pair each observed cycle with the same person's next observed cycle. */\nproc sql;\n  create table moves as\n  select a.person_id,\n         a.state as from_state length=12,\n         b.state as to_state   length=12\n  from work.states a\n  inner join work.states b\n    on a.person_id = b.person_id\n   and b.cycle = a.cycle + &cycle_gap;\nquit;\n\n/* Cell counts and row totals -> transition probabilities. */\nproc freq data=moves noprint;\n  tables from_state*to_state / out=cellcounts(keep=from_state to_state count);\nrun;\n\nproc sql;\n  create table tprob as\n  select c.from_state, c.to_state, c.count,\n         c.count / t.row_total as p_transition format=8.5\n  from cellcounts c\n  inner join (select from_state, sum(count) as row_total\n              from cellcounts group by from_state) t\n    on c.from_state = t.from_state\n  order by from_state, to_state;\nquit;\n\n/* cellcounts feeds row-wise Dirichlet(count + prior) draws for probabilistic sensitivity analysis. */",
        "description": "Empirical cycle transition-probability matrix in SAS from a per-cycle state table (count-ratio variant), with the\nstructure to drive Dirichlet PSA draws. Required input dataset (post data-management):\n  work.states : person_id, cycle (integer cycle index on a fixed grid), state (char; absorbing states allowed)\nCensored cycles are simply absent. PROC EXPAND/lag is avoided in favor of an explicit self-join on adjacent cycles so an\nunobserved intervening cycle is never collapsed into one transition. For interval-censored irregular data, fit a\nmultistate model instead (e.g., PROC PHREG with eventcode= per transition, or the R/msm route).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "stateDiagram-v2\n  direction LR\n  G3a: CKD G3a\n  G3b: CKD G3b\n  G4: CKD G4\n  ESRD: ESRD (dialysis/transplant)\n  Dead: Death (absorbing)\n  G3a --> G3b\n  G3b --> G3a\n  G3b --> G4\n  G4 --> G3b\n  G4 --> ESRD\n  G3a --> Dead\n  G3b --> Dead\n  G4 --> Dead\n  ESRD --> Dead",
        "caption": "Health-state structure for a CKD progression Markov model. Each arrow is a per-cycle transition probability to be estimated from data; Death is an absorbing state and back-transitions among eGFR stages are permitted. The full matrix (including self-loops, omitted for clarity) has each origin row summing to 1.",
        "alt_text": "State diagram with CKD stages G3a, G3b, G4, ESRD, and an absorbing Death state, with forward and backward arrows between adjacent stages and arrows to Death from every state.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[Irregular claims/EHR observations<br/>eGFR labs, dx/px, linked death date] --> Grid[Lay fixed cycle grid<br/>e.g. 91-day cycles from index]\n  Grid --> Assign[Assign state at each cycle boundary<br/>LOCF within max gap, else interval-censored]\n  Assign --> Censor[Mark censoring: disenroll / MA switch / end of data<br/>NOT a transition]\n  Censor --> Est{Observation pattern?}\n  Est -->|Dense grid + death index| Count[Count-ratio matrix<br/>i to j moves / at-risk in i]\n  Est -->|Irregular snapshots| MS[Continuous-time multistate<br/>fit Q, P = exp of Q times cycle]\n  Count --> Valid[Validate vs Aalen-Johansen occupancy<br/>test time-homogeneity]\n  MS --> Valid\n  Valid --> PSA[Propagate uncertainty<br/>Dirichlet rows / fitted-intensity draws for PSA]",
        "caption": "Data flow from irregular real-world observations to a cycle-specific transition-probability matrix, branching on whether the data are densely observed (count-ratio) or interval-censored (continuous-time multistate), with validation and uncertainty propagation.",
        "alt_text": "Flowchart from irregular observations through cycle-grid construction, state assignment, censoring, a decision on estimation method, validation against Aalen-Johansen, and uncertainty propagation for probabilistic sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Transition-probability estimation is the data-driven parameterization step within the state-transition (Markov) branch of the health-economic modeling family."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "partitioned-survival-models-rwe",
        "notes": "Partitioned-survival models read state occupancy from extrapolated survival curves without a transition matrix; Markov models impose an explicit, coherent transition structure instead."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "The state occupancy produced by chaining the transition matrix over cycles is combined with state costs/utilities and discounted to yield model outputs."
      },
      {
        "relation_type": "requires",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "When death competes with progression out of a state, transitions must be estimated under a competing-risks framework rather than as independent per-transition rates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-utility",
        "notes": "Estimated transition probabilities ultimately feed cost-utility (QALY) analyses built on the Markov state structure."
      }
    ],
    "aliases": [
      "Markov transition probabilities",
      "state-transition probabilities",
      "transition probability matrix",
      "multistate transition probabilities from real-world data"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "maximum-likelihood-estimation",
    "name": "Maximum Likelihood Estimation",
    "short_definition": "A method for estimating unknown model parameters by finding the values that make the observed data most probable under the assumed statistical model; the estimation engine powering logistic regression, Poisson, Cox, AFT, and mixed models throughout RWE and HEOR, and the foundation of likelihood-ratio tests, AIC/BIC model comparison, and profile likelihood confidence intervals.",
    "long_description": "**What maximum likelihood estimation is**\n\nMaximum likelihood estimation (MLE) answers one question: which parameter values would have made\nthe observed data most probable? The key conceptual move is treating the probability model\nbackwards. Ordinarily a fixed parameter — say, an event rate p — generates data; MLE instead\nfixes the observed data and asks which p best explains what was seen. The *likelihood function*\nL(θ | data) is the probability (or probability density) of the observed data evaluated as a\nfunction of the unknown parameter θ. Critically, it is not a probability distribution over θ\n— it is a surface over parameter space that quantifies how consistent each candidate parameter\nvalue is with the data that actually appeared. The MLE θ-hat is the value that maximizes\nL(θ | data) across all admissible parameter values.\n\nBecause likelihoods are products of many small probabilities — one factor per observation —\nworking with the *log-likelihood* ℓ(θ) = log L(θ | data) is universally preferred. Products\nbecome sums, numerical underflow is eliminated, and derivatives are far easier to compute.\nThe MLE is unchanged: the θ that maximizes ℓ(θ) also maximizes L(θ), because log is a\nstrictly monotone transformation.\n\n**Score function and Fisher information**\n\nThe *score function* S(θ) = ∂ℓ/∂θ is the first derivative of the log-likelihood with respect\nto the parameter. Setting S(θ) = 0 and solving yields the MLE for regular parametric families\nwhere the likelihood is smooth and unimodal. For the binomial family with n trials and k events,\nℓ(p) = k log(p) + (n−k) log(1−p) (excluding the constant binomial coefficient). The score is\nS(p) = k/p − (n−k)/(1−p), and setting it to zero gives k(1−p) = (n−k)p, which simplifies to\np-hat = k/n — the sample proportion.\n\nThe *observed Fisher information* I(θ) = −∂²ℓ/∂θ² is the negative second derivative of the\nlog-likelihood evaluated at θ-hat; it measures how sharply peaked the log-likelihood surface\nis at its maximum. A steep peak (large I) signals that the data are highly informative about θ;\na flat peak (small I) means substantial uncertainty persists. The Cramér-Rao lower bound states\nthat the minimum possible asymptotic variance of any unbiased estimator is 1/I(θ₀). MLE\nachieves this bound in large samples, making it *asymptotically efficient*. In practice,\nSE = sqrt(1/I(θ-hat)) is the Wald standard error reported alongside every regression coefficient.\n\n**Asymptotic properties: consistency and normality**\n\nUnder standard regularity conditions — correctly specified model, identically distributed\nobservations, smooth likelihood — the MLE has two celebrated large-sample properties.\n(1) *Consistency*: θ-hat converges in probability to the true θ₀ as n grows without bound.\n(2) *Asymptotic normality*: sqrt(n)(θ-hat − θ₀) converges in distribution to a normal\ndistribution with mean 0 and variance 1/I(θ₀). These properties underpin Wald confidence\nintervals: θ-hat ± z * SE, where SE = sqrt(1/I(θ-hat)) and z is the appropriate normal\nquantile. The quality of this large-sample approximation improves with n but can deteriorate\nbadly at small n, near boundary values (probability near 0 or 1, variance near 0), or under\nserious model misspecification.\n\n**The trinity of tests: Wald, likelihood-ratio, and score**\n\nThree hypothesis tests emerge naturally from the likelihood framework and are collectively\nknown as the \"trinity\" of classical tests.\n\n*Wald test*: (θ-hat − θ₀) / SE(θ-hat) ~ N(0,1) under H₀. Reported as the z-statistic (or\nt-statistic with df adjustment) in nearly every regression output table. Simple to compute once\nthe MLE and SE are available. Wald confidence intervals are symmetric: θ-hat ± z * SE.\n\n*Likelihood-ratio (LR) test*: 2[ℓ(θ-hat) − ℓ(θ₀)] ~ χ²(df) under H₀, where df equals the\nnumber of parameters being tested. Compares the log-likelihood at the unconstrained MLE to its\nvalue constrained to the null. The LR test is generally preferred to the Wald test because it\nuses the full shape of the log-likelihood surface rather than only local curvature at the peak.\nNear boundary values and with sparse data, the log-likelihood surface is asymmetric, and the\nnormal approximation underlying the Wald test deteriorates; the LR test remains better\ncalibrated. Deviance differences between nested GLMs are LR statistics.\n\n*Score (Rao) test*: evaluates the slope S(θ₀) at the null parameter value without computing\nthe unrestricted MLE — advantageous when fitting the full model is computationally expensive.\n\n*Profile likelihood confidence intervals* invert the LR test: they include all θ values for\nwhich 2[ℓ(θ-hat) − ℓ(θ)] is below the chi-squared critical value. Profile CIs respect the\nnatural curvature and asymmetry of the log-likelihood, producing intervals that need not be\nsymmetric. The Clopper-Pearson exact interval for a binomial proportion is the textbook profile\nCI. Near boundary values, with sparse data, or when separation warnings appear in logistic\nregression, profile CIs should replace Wald CIs, and Firth penalized likelihood extends this\nlogic further (→ firth-penalized-regression-rwe).\n\n**Deviance, AIC, and BIC: likelihood-based model comparison**\n\nThe *deviance* D = −2ℓ(θ-hat) is the scaled negative log-likelihood. The difference in\ndeviance between two nested models — 2[ℓ(larger) − ℓ(smaller)] — is a likelihood-ratio\nstatistic distributed approximately as χ²(Δdf), providing a formal test of whether the\nadditional parameters improve fit beyond chance. This is the standard test for adding a\ncovariate or interaction term to a GLM or survival model.\n\nThe *Akaike Information Criterion* AIC = −2ℓ + 2k penalizes the number of estimated\nparameters k to guard against overfitting. The *Bayesian Information Criterion*\nBIC = −2ℓ + k log(n) applies a stronger complexity penalty that grows with sample size.\nBoth are directly comparable across models fit to the same dataset by maximum likelihood;\nlower values indicate a better penalized fit. AIC and BIC cannot be meaningfully compared\nacross datasets with different n, different response distributions, or transformed responses.\nAIC tends to favor more complex models and is preferred for prediction; BIC penalizes\ncomplexity more strongly and is preferred for model identification.\n\n**Where MLE quietly powers the catalog**\n\nMLE is the estimation engine behind virtually every model used in RWE and HEOR. Logistic\nregression maximizes the binomial log-likelihood over a logit-linear predictor. Poisson\nregression maximizes the Poisson log-likelihood; negative-binomial models add an overdispersion\nparameter estimated jointly by MLE. The Cox proportional hazards model uses a *partial\nlikelihood* — a MLE variant that conditions out the unspecified baseline hazard — to estimate\nlog-hazard ratios without parametrizing the hazard shape. Accelerated failure time (AFT) models\nmaximize parametric likelihoods (Weibull, log-normal, log-logistic). Linear mixed models and\nGLMMs integrate over random effects to form a marginal likelihood, which is then maximized by\nREML or full ML. Understanding MLE therefore unifies interpretation of standard errors,\nconfidence intervals, model comparison statistics, and test procedures across all these methods\n(→ generalized-linear-models for the GLM subfamily).\n\n**Failure modes and protections**\n\n*Complete separation*: in logistic regression, when a predictor perfectly predicts the outcome\nin one direction, the score equation has no finite solution — the log-likelihood surface is\nmonotone and the MLE diverges to ±∞. Software may silently report implausibly large\ncoefficients accompanied by enormous standard errors, or may fail to converge. Firth penalized\nlikelihood (→ firth-penalized-regression-rwe) adds a Jeffreys prior term that regularizes the\nlog-likelihood surface, producing finite bias-reduced estimates.\n\n*Boundary parameters*: variance components estimated at zero in a mixed model, or a probability\nestimated at exactly 0 or 1, signal near-identification failure. Profile CIs and exact methods\n(→ exact-methods-sparse-data-rwe) are the recommended safeguards.\n\n*Model misspecification*: if the assumed distribution is wrong — Poisson when overdispersion is\npresent, or binomial when observations are correlated — the MLE remains consistent for the\nworking-model parameter, but standard likelihood-based standard errors are invalid because they\nassume correct model specification. The *sandwich (robust / HC) standard error* replaces the\ninverse information matrix with a \"meat-bread\" estimator that is consistent under\nmisspecification. Quasi-likelihood extends this idea, allowing the mean-variance relationship\nto be estimated without specifying a full distributional family. The sandwich SE is the first\nsafeguard; a correctly specified model — negative-binomial for overdispersed counts, a mixed\nmodel for clustered data — is always preferable when the data structure is understood\n(→ regression-diagnostics for AIC/BIC-based selection of distributional family).\n\n**Pros, cons, and trade-offs**\n\n*Pros*: MLE is asymptotically efficient (minimum-variance among consistent estimators under\ncorrect specification), universally implemented across statistical software, and provides a\ncoherent unified basis for point estimation, the full trinity of hypothesis tests, confidence\nintervals (Wald or profile), and likelihood-based model comparison (AIC/BIC/deviance). The\nframework scales from simple closed-form cases (binomial proportion, normal mean) to complex\nnumerical optimization problems (mixed models, survival models, custom likelihoods).\n\n*Cons*: MLE requires correct model specification to be unbiased in finite samples. Asymptotic\nguarantees may not hold with small n, sparse cells, or boundary parameter values. Wald standard\nerrors and symmetric CIs are unreliable near boundaries; separation causes divergence in\nlogistic models. MLE ignores prior information (contrast with Bayesian MAP estimation, which\nincorporates a prior and does not diverge under separation).\n\n*Key trade-offs*: Profile likelihood CIs are more accurate than Wald CIs near boundaries but\nrequire repeated model fitting; sandwich SEs protect against misspecification at some cost of\nefficiency when the model is correctly specified; the LR test outperforms the Wald test at\nsmall n and near boundaries but requires fitting both restricted and unrestricted models.\n\n**When to use**\n\nMLE is the default estimation method for all parametric regression and survival models in\nRWE/HEOR: logistic regression for binary outcomes, Poisson or negative-binomial for counts,\nCox partial likelihood for time-to-event outcomes, GLMs for any exponential-family outcome,\nand mixed models for clustered or longitudinal data. Use likelihood-ratio tests for nested\nmodel comparison — adding a covariate, testing an interaction, or comparing distributional\nfamilies. Use profile likelihood CIs when Wald CIs may be unreliable: sparse cells, small\nn, parameters near 0 or 1, or any setting where a separation or convergence warning appears.\nUse AIC and BIC for non-nested model selection across link functions, distributional families,\nor covariate subsets.\n\n**When NOT to use**\n\nDo not rely on Wald tests and symmetric Wald CIs when parameters are near the boundary of\ntheir parameter space — log-odds near ±∞, variance near 0, probability near 0 or 1 — use\nprofile likelihood CIs or exact methods instead. Do not ignore separation warnings in logistic\nregression: a very large estimated coefficient paired with a very large standard error signals\nthat the MLE has not converged to a finite value; apply Firth penalization rather than reporting\nthe diverged estimate. Do not use standard likelihood-based standard errors when the model is\nknown to be misspecified — clustered data analyzed without cluster-level terms, overdispersed\ncount data fit with a plain Poisson model — report sandwich SEs or re-specify the model.\nFor very small samples (fewer than approximately 20 observations), consider exact methods or a\nBayesian analysis with informative priors rather than relying on the large-sample normal\napproximation underlying Wald inference.\n\n**Interpreting the output**\n\nIn the worked example, seven events are observed in ten independent Bernoulli trials. The MLE\nfor the event probability is p-hat = 7/10 = 0.7. The binomial likelihood at p = 0.7 is\napproximately 0.267, versus approximately 0.117 at p = 0.5 — the observed data are more than\ntwice as probable under p = 0.7 as under p = 0.5. The Wald variance is 0.7 * 0.3 / 10 = 0.021,\ngiving SE approximately 0.145 and a Wald 95% CI of approximately [0.42, 0.98]. With n = 10,\nthe profile likelihood (Clopper-Pearson exact) CI of approximately [0.35, 0.93] is more\nreliable because the Wald interval assumes a symmetric normal approximation for p-hat that is\nquestionable at this sample size.\n\n*(1) Formal interpretation.* p-hat = 0.7 is the value of the event probability that maximizes\nthe binomial log-likelihood given 7 events in 10 trials. The Wald 95% CI [0.42, 0.98] is a\nrepeated-sampling interval: in approximately 95% of repeated experiments of the same design,\na CI constructed by this procedure would contain the true parameter value. The specific interval\n[0.42, 0.98] is one realization of this procedure — it does not mean the true p lies within\nthis interval with 95% probability. The likelihood-ratio statistic comparing p-hat = 0.7 to the\nnull p = 0.5 is approximately 1.64 on 1 degree of freedom, well below the chi-squared critical\nvalue of 3.84, so p = 0.5 cannot be excluded at alpha = 0.05 given only 10 trials. The MLE is\na conditional statement about the parameter given the observed data; causal language is not\nappropriate unless the study design supports a causal estimand.\n\n*(2) Practical interpretation.* If 7 of 10 pilot patients responded to a treatment, the best\nsingle estimate of the true response rate is 70%. But the data are sparse: the 95% CI spans\nfrom roughly 42% to 98%, meaning the true rate could plausibly be near a coin flip or near\ncertainty. A decision-maker should not treat 70% as a precise characterization of treatment\nefficacy — the sample is too small. A study of approximately 100 or more patients would be\nneeded to narrow the CI enough for pricing or coverage decisions, and profile likelihood CIs\nshould be used throughout because the normal approximation is unreliable at n = 10.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "foundations",
      "estimation",
      "likelihood",
      "maximum-likelihood",
      "log-likelihood",
      "AIC",
      "BIC",
      "deviance",
      "Wald-test",
      "likelihood-ratio-test",
      "score-test",
      "profile-likelihood",
      "Fisher-information",
      "model-fitting",
      "parametric"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/S0022-2496(02)00028-7",
        "url": "https://doi.org/10.1016/S0022-2496(02)00028-7",
        "citation_text": "Myung IJ. Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology. 2003;47(1):90-100.",
        "year": 2003,
        "authors_short": "Myung",
        "notes": "Accessible tutorial covering the likelihood function, log-likelihood, score, Fisher information, and large-sample properties of the MLE; widely cited as the entry-level reference for applied researchers. Covers the binomial worked example and Wald/LR inference in detail, making it directly appropriate for RWE practitioners encountering MLE theory."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwt245",
        "url": "https://doi.org/10.1093/aje/kwt245",
        "citation_text": "Cole SR, Chu H, Greenland S. Maximum likelihood, profile likelihood, and penalized likelihood: a primer. American Journal of Epidemiology. 2014;179(2):252-260.",
        "year": 2014,
        "authors_short": "Cole et al.",
        "notes": "Epidemiology-focused primer contrasting Wald, profile likelihood, and Firth penalized likelihood for logistic regression; explains concretely why profile CIs outperform Wald CIs near parameter boundaries and in sparse data — directly applicable to small RWE cohorts and case-control studies where separation and sparse-cell problems are common."
      }
    ],
    "plain_language_summary": "Maximum likelihood estimation (MLE) finds the model parameter values that make your observed data as probable as possible under an assumed statistical model — think of it as asking \"which settings would have produced these exact numbers most often if we repeated the study many times?\" For a simple example, if 7 of 10 patients had an event, MLE says the best estimate of the true event rate is 70%, because that value makes those exact results more probable than any other. MLE is the estimation engine behind logistic regression, Cox survival models, Poisson models, and virtually every other regression tool used in real-world evidence research, so understanding it unlocks the logic underlying the standard errors, confidence intervals, and model-comparison statistics you see in every analysis output. Its main limitation is that results can become unreliable when data are sparse or when a predictor perfectly separates the outcome groups, in which case specialized corrections such as Firth penalization are needed.",
    "key_terms": [
      {
        "term": "likelihood function",
        "definition": "The probability of the observed data treated as a function of an unknown model parameter, telling you how plausible each possible parameter value is given what you actually observed."
      },
      {
        "term": "log-likelihood",
        "definition": "The natural logarithm of the likelihood function; it converts products of probabilities into sums and is always maximized at the same parameter value as the likelihood itself."
      },
      {
        "term": "Wald standard error",
        "definition": "The uncertainty estimate for an MLE-based coefficient, derived from how steeply curved the log-likelihood is at its peak; it is the SE shown next to every coefficient in a regression output table."
      },
      {
        "term": "profile likelihood confidence interval",
        "definition": "A confidence interval built by inverting the likelihood-ratio test rather than using a symmetric normal approximation; more accurate than the Wald interval when estimates are near boundary values or sample sizes are small."
      },
      {
        "term": "AIC",
        "definition": "The Akaike Information Criterion, equal to minus twice the maximized log-likelihood plus twice the number of parameters; lower AIC indicates a better balance of model fit and complexity when comparing models on the same dataset."
      },
      {
        "term": "separation",
        "definition": "A situation in logistic regression where a predictor perfectly predicts the outcome, causing the MLE to diverge to infinity and standard errors to become meaninglessly large; the standard fix is Firth penalized estimation."
      }
    ],
    "worked_example": {
      "scenario": "A clinical trialist runs a small pilot study of 10 independent patients. Each patient either has the primary event (outcome = 1) or does not (outcome = 0). Seven patients have the event. The trialist wants to estimate the true event probability p using maximum likelihood, confirm that p = 0.7 is the MLE, compare the likelihood at p = 0.7 versus the null hypothesis value p = 0.5, and compute a Wald 95% confidence interval for the estimate.",
      "dataset": {
        "caption": "Raw binary outcomes for 10 independent trial participants. Seven have the event (outcome = 1); three do not (outcome = 0). MLE is computed on these 10 observations.",
        "columns": [
          "patient_id",
          "outcome"
        ],
        "rows": [
          [
            "P01",
            1
          ],
          [
            "P02",
            1
          ],
          [
            "P03",
            1
          ],
          [
            "P04",
            1
          ],
          [
            "P05",
            1
          ],
          [
            "P06",
            1
          ],
          [
            "P07",
            1
          ],
          [
            "P08",
            0
          ],
          [
            "P09",
            0
          ],
          [
            "P10",
            0
          ]
        ]
      },
      "steps": [
        "Count events and trials: 7 events observed in 10 independent trials. The analytic MLE for the binomial probability parameter is the sample proportion: p_hat = 7/10 = 0.7.",
        "Confirm by calculus. The log-likelihood (excluding the constant log(C(10,7))) is l(p) = 7*log(p) + 3*log(1-p). The score is dl/dp = 7/p - 3/(1-p). Setting the score to zero gives 7*(1-p) = 3*p, so 7 - 7*p = 3*p, so 7 = 10*p, confirming p_hat = 7/10 = 0.7 as the exact maximum of the log-likelihood.",
        "Compare likelihoods. The binomial coefficient C(10,7) = 120 appears in L(p) = 120 * p^7 * (1-p)^3 at every value of p and cancels in any ratio, so it does not affect which p is the MLE. At p=0.7 the likelihood L(0.7) is approximately 0.267; at p=0.5 the likelihood L(0.5) is approximately 0.117. The observed data are more than twice as probable under p=0.7 as under p=0.5.",
        "Compute the Wald variance and standard error. The variance of the MLE p-hat is p_hat*(1-p_hat)/n. Variance = 0.7 * 0.3 / 10 = 0.021. SE = sqrt(0.021) approximately 0.145.",
        "Construct the Wald 95% CI: p_hat +/- 1.96 * SE approximately 0.7 +/- 0.284, giving approximately [0.42, 0.98]. Because n=10 is small and p-hat=0.7 is not close to 0.5, a profile likelihood (Clopper-Pearson exact) CI of approximately [0.35, 0.93] is more reliable here — it correctly captures the asymmetry of the log-likelihood at this sample size and does not extend above 1.0 as Wald can when p-hat is large."
      ],
      "result": "p_hat = 7/10 = 0.7. Variance of p_hat = 0.7 * 0.3 / 10 = 0.021. Wald SE approximately 0.145. Wald 95% CI approximately [0.42, 0.98]. L(0.7) approximately 0.267 versus L(0.5) approximately 0.117: p=0.7 makes the observed data more than twice as probable as p=0.5. The LR statistic comparing p=0.7 to p=0.5 is approximately 1.64 on 1 degree of freedom, below the chi-squared critical value of 3.84, so p=0.5 cannot be excluded at alpha=0.05 with n=10. The Clopper-Pearson profile CI of approximately [0.35, 0.93] is preferred."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nimport numpy as np\nfrom scipy import optimize, stats\nfrom scipy.stats import beta as beta_dist\nimport statsmodels.formula.api as smf\nimport pandas as pd\n\n# ── Part 1: Binomial MLE — matches worked example (7 events in 10 trials) ──\nn_trials, n_events = 10, 7\n\ndef neg_log_lik(p):\n    \"\"\"Negative binomial log-likelihood; C(n,k) constant omitted (doesn't change MLE).\"\"\"\n    if p <= 0 or p >= 1:\n        return 1e10\n    return -(n_events * math.log(p) + (n_trials - n_events) * math.log(1 - p))\n\nresult = optimize.minimize_scalar(neg_log_lik, bounds=(1e-9, 1 - 1e-9), method=\"bounded\")\np_hat = result.x\nprint(f\"MLE p-hat = {p_hat:.4f}  (analytic: {n_events}/{n_trials} = {n_events/n_trials})\")\n\n# Likelihood comparison at p=0.7 vs p=0.5 (C(10,7)=120 appears in both; cancels in ratio)\nbinom_coef = math.comb(10, 7)                    # = 120\nL_07 = binom_coef * (0.7 ** 7) * (0.3 ** 3)\nL_05 = binom_coef * (0.5 ** 10)\nprint(f\"L(0.7) = {L_07:.4f},  L(0.5) = {L_05:.4f},  ratio = {L_07/L_05:.2f}\")\n\n# Wald SE and 95% CI\nwald_se = math.sqrt(p_hat * (1 - p_hat) / n_trials)\nwald_lo, wald_hi = p_hat - 1.96 * wald_se, p_hat + 1.96 * wald_se\nprint(f\"Wald SE = {wald_se:.4f},  Wald 95% CI = [{wald_lo:.3f}, {wald_hi:.3f}]\")\n\n# Profile likelihood (Clopper-Pearson exact) CI — preferred at small n or near boundaries\ncp_lo = beta_dist.ppf(0.025, n_events,     n_trials - n_events + 1)\ncp_hi = beta_dist.ppf(0.975, n_events + 1, n_trials - n_events)\nprint(f\"Clopper-Pearson exact CI = [{cp_lo:.3f}, {cp_hi:.3f}]\")\n\n# LR statistic vs H0: p = 0.5 (chi-sq 1 df)\nll_07   = n_events * math.log(0.7) + (n_trials - n_events) * math.log(0.3)\nll_05   = n_events * math.log(0.5) + (n_trials - n_events) * math.log(0.5)\nlr_stat = 2 * (ll_07 - ll_05)\nlr_p    = stats.chi2.sf(lr_stat, df=1)\nprint(f\"LR statistic = {lr_stat:.3f},  p = {lr_p:.4f}  (H0: p=0.5)\")\n\n# SE via observed Fisher information from Hessian\nopt_h   = optimize.minimize(neg_log_lik, x0=[0.5], method=\"L-BFGS-B\",\n                            bounds=[(1e-9, 1-1e-9)], options={\"gtol\": 1e-12})\n# Note: scipy finite-diff Hessian; use numdifftools for analytic Hessian in production\nfisher_info = 1 / (p_hat * (1 - p_hat) / n_trials)    # analytic Fisher info for binomial\nprint(f\"SE from analytic Fisher information: {math.sqrt(1/fisher_info):.4f}\")\n\n# ── Part 2: Logistic regression via statsmodels (MLE via IRLS / Newton-Raphson) ──\nnp.random.seed(42)\ndf = pd.DataFrame({\"x\": np.random.normal(size=200),\n                   \"y\": np.random.binomial(1, 0.4, size=200)})\nfit = smf.logit(\"y ~ x\", data=df).fit(disp=False)\nprint(f\"\\nLog-likelihood at MLE:  {fit.llf:.4f}\")\nprint(f\"Null log-likelihood:    {fit.llnull:.4f}\")\nprint(f\"LR statistic (2*delta): {-2*(fit.llnull - fit.llf):.4f}\")\nprint(f\"AIC = {fit.aic:.4f},  BIC = {fit.bic:.4f}\")\n# Profile likelihood CIs — more accurate than Wald near boundaries\nprint(\"\\nProfile likelihood 95% CIs:\")\nprint(fit.conf_int(method=\"profile\"))\nprint(\"\\nWald 95% CIs (symmetric, may be unreliable near |coef| >> 0):\")\nprint(fit.conf_int(method=\"normal\"))",
        "description": "Binomial MLE by numerical optimization via scipy.optimize, with likelihood comparison at\ntwo parameter values, Wald SE and CI, and Clopper-Pearson profile likelihood interval.\nFollowed by logistic regression via statsmodels illustrating log-likelihood, deviance,\nAIC, BIC, and profile likelihood CIs. Uses only scipy and statsmodels (stdlib-sized deps).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Part 1: Binomial MLE (matches worked example: 7 events in 10 trials) ──\nn_trials <- 10L; n_events <- 7L\n\n# Negative log-likelihood (excluding constant log(C(n,k)))\nneg_log_lik <- function(p) {\n  if (p <= 0 || p >= 1) return(1e10)\n  -(n_events * log(p) + (n_trials - n_events) * log(1 - p))\n}\n\nres   <- optimize(neg_log_lik, interval = c(1e-9, 1 - 1e-9))\np_hat <- res$minimum\ncat(sprintf(\"MLE p-hat = %.4f  (analytic: %d/%d = %.4f)\\n\",\n            p_hat, n_events, n_trials, n_events / n_trials))\n\n# Likelihood at p=0.7 vs p=0.5\nL_07 <- choose(10, 7) * 0.7^7 * 0.3^3\nL_05 <- choose(10, 7) * 0.5^10\ncat(sprintf(\"L(0.7) = %.4f,  L(0.5) = %.4f,  ratio = %.2f\\n\", L_07, L_05, L_07/L_05))\n\n# Wald SE and 95% CI\nwald_se <- sqrt(p_hat * (1 - p_hat) / n_trials)\ncat(sprintf(\"Wald SE = %.4f,  95%% CI = [%.3f, %.3f]\\n\",\n            wald_se, p_hat - 1.96*wald_se, p_hat + 1.96*wald_se))\n\n# Clopper-Pearson exact (profile likelihood) CI — preferred at small n\ncp_lo <- qbeta(0.025, n_events,     n_trials - n_events + 1)\ncp_hi <- qbeta(0.975, n_events + 1, n_trials - n_events)\ncat(sprintf(\"Clopper-Pearson exact CI = [%.3f, %.3f]\\n\", cp_lo, cp_hi))\n\n# LR statistic vs H0: p = 0.5\nll_07   <- n_events * log(0.7) + (n_trials - n_events) * log(0.3)\nll_05   <- n_events * log(0.5) + (n_trials - n_events) * log(0.5)\nlr_stat <- 2 * (ll_07 - ll_05)\nlr_p    <- pchisq(lr_stat, df = 1, lower.tail = FALSE)\ncat(sprintf(\"LR statistic = %.3f,  p = %.4f  (H0: p=0.5)\\n\", lr_stat, lr_p))\n\n# SE from observed Fisher information via Hessian (optim with hessian=TRUE)\nopt_h    <- optim(par = 0.5, fn = neg_log_lik, method = \"L-BFGS-B\",\n                  lower = 1e-9, upper = 1 - 1e-9, hessian = TRUE)\nse_hess  <- sqrt(1 / opt_h$hessian[1, 1])\ncat(sprintf(\"SE from observed Fisher information (Hessian): %.4f\\n\", se_hess))\n\n# ── Part 2: Logistic regression via glm() (MLE by IRLS) ──\nset.seed(42)\ndf   <- data.frame(x = rnorm(200), y = rbinom(200, 1, 0.4))\nfit  <- glm(y ~ x, data = df, family = binomial(link = \"logit\"))\nfit0 <- glm(y ~ 1, data = df, family = binomial)\n\ncat(\"\\nLogistic regression coefficients:\\n\")\nprint(summary(fit)$coefficients)\ncat(sprintf(\"Log-likelihood at MLE: %.4f\\n\", logLik(fit)))\ncat(sprintf(\"AIC = %.4f,  BIC = %.4f\\n\", AIC(fit), BIC(fit)))\n\n# LR test via anova() — preferred over Wald z-test for adding / dropping a covariate\nlr_glm <- anova(fit0, fit, test = \"Chisq\")\nprint(lr_glm)\ncat(\"Note: anova() LR test preferred over Wald z for nested model comparison.\\n\")\n\n# Profile likelihood CIs via MASS::confint.glm (inverts LR test)\n# Uncomment if MASS is available:\n# library(MASS); confint(fit)   # profile CIs — asymmetric; more reliable near boundaries",
        "description": "Binomial MLE via optimize() and optim() with Hessian-based SE, likelihood comparison,\nWald CI, and Clopper-Pearson exact interval. Followed by logistic regression via glm()\nwith log-likelihood, deviance, AIC, BIC, and a likelihood-ratio test via anova().\nUses base R only; no external packages required.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ─── Part 1: Binomial MLE with PROC NLMIXED (custom log-likelihood) ─── */\n/* Aggregate data: 7 events in 10 trials (one observation row)             */\ndata work.binom_data;\n  n_trials = 10;\n  n_events = 7;\nrun;\n\nproc nlmixed data=work.binom_data;\n  /* Log-likelihood: n_events*log(p) + (n_trials - n_events)*log(1-p)     */\n  /* Constant log(C(10,7)) = log(120) omitted; does not affect the MLE    */\n  ll = n_events * log(p) + (n_trials - n_events) * log(1 - p);\n  model n_events ~ general(ll);     /* GENERAL() accepts any log-likelihood */\n  parms p = 0.5;                    /* starting value for optimization      */\n  bounds 0 < p < 1;                 /* constrain p to (0,1)                 */\n  /* Output: MLE p-hat, Wald SE from observed Hessian, Wald 95% CI         */\n  /* For profile likelihood CI use PROC NLMIXED with ESTIMATE statement     */\nrun;\n\n/* ─── Part 2: Logistic regression — PROC LOGISTIC ─── */\n/* Simulate data: binary y ~ logistic(x), n=200                           */\ndata work.logistic_data;\n  call streaminit(42);\n  do i = 1 to 200;\n    x = rand(\"Normal\", 0, 1);\n    y = (rand(\"Uniform\") < 1 / (1 + exp(-(-0.4 + 0.3*x))));\n    output;\n  end;\n  drop i;\nrun;\n\nproc logistic data=work.logistic_data descending;\n  model y(event='1') = x\n    / clparm=wald         /* Wald CIs: symmetric, reported by default              */\n      clparm=pl           /* Profile-likelihood CIs: asymmetric, preferred          */\n      lackfit             /* Hosmer-Lemeshow goodness-of-fit test (10 groups)       */\n      rsq;                /* McFadden R-square = 1 - LL_model/LL_null               */\n  /* Key output sections:                                                  */\n  /*  \"Testing Global Null\" — LR, Score, and Wald tests (model vs null)   */\n  /*  \"Model Fit Statistics\" — -2 log L, AIC, SC (=BIC) at MLE and null  */\n  /*  \"Parameter Estimates\" — MLE coefficients, Wald chi-sq, Wald p       */\n  /*  \"Profile Likelihood CIs\" — asymmetric; use these in sparse settings */\n  ods select ParameterEstimates CLparmWald CLparmPL FitStatistics GlobalTests;\nrun;\n\n/* ─── Part 3: Nested model comparison via LR test (deviance difference) ─── */\nproc logistic data=work.logistic_data descending;\n  model y(event='1') = x;\n  ods output FitStatistics=work.fit_with_x;\nrun;\nproc logistic data=work.logistic_data descending;\n  model y(event='1') = ;\n  ods output FitStatistics=work.fit_null;\nrun;\n/* LR statistic = difference in -2 log L values between the two models    */\n/* Distributed chi-sq(1) under H0 that x has no effect on log-odds of y   */\n/* Compare AIC and BIC: lower = better penalized fit on the same data      */",
        "description": "Binomial MLE via PROC NLMIXED with a user-supplied log-likelihood, demonstrating the general\ncustom-likelihood approach for any parametric model. Followed by PROC LOGISTIC requesting\nboth Wald and profile-likelihood (PL) confidence intervals, deviance, AIC, and BIC — the\nstandard output set for logistic regression in regulatory HEOR submissions.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[\"Observed data: 7 events in 10 trials\"] --> B[\"Specify model: X ~ Binomial(n=10, p=?)\"]\n  B --> C[\"Write likelihood: L(p) = C(10,7) * p^7 * (1-p)^3\"]\n  C --> D[\"Set score = 7/p - 3/(1-p) = 0 and solve\"]\n  D --> E[\"MLE: p-hat = 7/10 = 0.7\"]\n  E --> F[\"Wald SE = sqrt(0.7*0.3/10) approximately 0.145\"]\n  F --> G[\"Wald 95% CI approximately (0.42, 0.98)\"]\n  E --> H[\"Profile CI: Clopper-Pearson approximately (0.35, 0.93)\"]\n  E --> I[\"LR vs H0 p=0.5: 2*(LL(0.7) - LL(0.5)) approximately 1.64\"]\n  I --> J[\"chi-sq(1) critical = 3.84: fail to reject H0 at alpha=0.05 with n=10\"]",
        "caption": "Step-by-step MLE for a binomial proportion: from observed data through likelihood specification, score equation, MLE solution, Wald and profile likelihood confidence intervals, and a likelihood-ratio test against the null hypothesis p = 0.5.",
        "alt_text": "Flowchart tracing the MLE workflow: observed binary data, model specification, likelihood function, score equation solved for p-hat = 0.7, Wald SE and CI, Clopper-Pearson exact CI, and LR test against H0 p = 0.5 yielding a non-significant result with n = 10.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "MLE builds directly on sampling distributions, hypothesis testing, and confidence interval concepts; the score, Fisher information, and asymptotic normality all assume familiarity with the sampling distribution framework introduced in inferential foundations."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalized-linear-models",
        "notes": "Every GLM — logistic, Poisson, gamma, negative-binomial — is estimated by MLE; the deviance, AIC/BIC, and likelihood-ratio tests for GLMs are direct applications of the log-likelihood machinery described in this entry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "firth-penalized-regression-rwe",
        "notes": "Firth penalized likelihood is the recommended correction when MLE diverges due to complete or quasi-complete separation in logistic regression; it modifies the score equation by adding a Jeffreys prior penalty to the log-likelihood."
      },
      {
        "relation_type": "see_also",
        "target_slug": "exact-methods-sparse-data-rwe",
        "notes": "When MLE asymptotic approximations break down because of sparse data or boundary parameters, exact methods such as conditional MLE, mid-p correction, and Clopper-Pearson intervals are the recommended alternatives to Wald-based inference."
      },
      {
        "relation_type": "used_with",
        "target_slug": "regression-diagnostics",
        "notes": "AIC, BIC, and deviance — the primary outputs of any MLE-based model selection workflow — appear throughout regression diagnostic practice; AIC/BIC-based family selection complements residual and goodness-of-fit checks."
      }
    ],
    "aliases": [
      "MLE",
      "log-likelihood",
      "likelihood function",
      "Wald test",
      "likelihood-ratio test",
      "profile likelihood"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "maxsprt-sequential-safety-surveillance-rwe",
    "name": "MaxSPRT and Sequential Safety Surveillance",
    "short_definition": "A prospective, near-real-time method for monitoring a newly used drug or vaccine in which the cumulative count of an adverse event is compared, at repeated looks as data accrue, against the count expected under no excess risk - using the maximized sequential probability ratio test (MaxSPRT), whose log-likelihood-ratio statistic is checked at every look against a pre-computed critical boundary that spends a fixed total false-alarm probability (alpha) across all the looks, so the surveillance can declare a signal as early as the evidence warrants without inflating the false-positive rate from repeated testing.",
    "long_description": "Active safety surveillance asks a deceptively simple question and answers it under brutal constraints. The question:\nis this newly marketed drug, or this season's vaccine, causing an adverse event more often than background? The\nconstraints: you want to know **as early as possible** (the public-health value of an early signal is enormous), you\nare **looking repeatedly** as weekly or monthly data feeds arrive, and every look is another chance to cry wolf.\nNaive repeated significance testing - run a chi-square each month, signal the first time p < 0.05 - inflates the\nfalse-alarm rate without bound: with enough looks you will eventually cross 0.05 by chance even when nothing is wrong.\n**Sequential analysis** solves this by designing the stopping rule up front so that the *total* probability of ever\nfalsely signaling, across all the planned looks, equals a fixed alpha. This is the machinery behind the FDA Sentinel\nSystem and the CDC/Vaccine Safety Datalink (VSD) Rapid Cycle Analysis programs.\n\n**The MaxSPRT statistic.** Wald's classic sequential probability ratio test (SPRT) pits a fixed null (relative risk\nRR = 1) against a fixed, pre-specified alternative (say RR = 2). Its weakness in surveillance is that you rarely know\nthe true effect size you are hunting; if you guess the alternative wrong, power collapses. Kulldorff et al. (2011)\n*maximize* the likelihood ratio over all RR > 1 at each look, removing the need to pre-specify the alternative. For\nPoisson data the **log-likelihood ratio (LLR)** at a look with cumulative observed events n and cumulative expected\nevents mu (the count you would see under RR = 1) is, when n > mu, `LLR = n * ln(n / mu) + mu - n`, and 0 otherwise.\nYou compute this LLR at every look and signal the first time it crosses a **critical boundary** (a critical value on\nthe LLR scale). That boundary is not 1.96-anything; it is solved by simulation/recursion so that the chance of the\nLLR *ever* exceeding it during the whole surveillance, under the null, equals alpha (commonly one-sided 0.05). The\nboundary depends on the maximum length of surveillance (expressed as a maximum expected count, the \"sample size\") and\nthe look schedule, and it is the device that **spends alpha** across the looks. MaxSPRT comes in flavors keyed to the\ncomparator: **Poisson MaxSPRT** when expected counts come from a historical/background rate; **binomial MaxSPRT** when\neach case is matched to concurrent comparators so the expected fraction of events falling on the exposed is fixed by\nthe matching ratio; and **conditional MaxSPRT (CmaxSPRT)** when the historical comparison data are themselves limited\nand their sampling uncertainty must be folded in.\n\n**Group-sequential alternatives.** MaxSPRT is, in effect, a continuous-inspection / fully-sequential boundary (look\nafter every event or every tiny batch). The other school is **group-sequential** monitoring (Pocock, O'Brien-Fleming,\nor error-spending boundaries of Lan-DeMets type) that tests at a handful of pre-planned analysis times. Group\nboundaries are more familiar to trialists, are simple to communicate, and concentrate alpha at later looks\n(O'Brien-Fleming) so an early signal must be very strong; MaxSPRT's flat-ish LLR boundary is tuned for **earliest\npossible detection** with many looks. The Cook/Nelson line of work showed these can be unified under an\nerror-spending view, and Sentinel/VSD use both depending on the product and feed cadence.\n\n**Interpreting the output**\n\nFrom the worked example: cumulative expected counts mu = 1, 2, 3, 4 at looks 1–4;\ncumulative observed n = 1, 4, 7, 10. Poisson LLR values: Look 1 = 0, Look 2 ≈ 0.77,\nLook 3 ≈ 1.93, Look 4 ≈ 3.16. Pre-computed critical boundary ≈ 3.0. Signal is declared\nat Look 4 (LLR 3.16 ≥ 3.0): observed 10 events vs expected 4.\n\nFormal interpretation: The LLR of 3.16 at Look 4 exceeds the pre-specified sequential\nboundary of 3.0, declaring a signal. The boundary was computed so that — across all four\nlooks combined — the probability of the LLR ever exceeding it under the true null\n(RR = 1) is controlled at alpha = 0.05. This is an alpha-spending statement, not a\nfixed-sample p-value. The LLR crossing does not mean p < 0.05 in the ordinary sense;\nit means the cumulative evidence has crossed a pre-specified sequential threshold that\naccounts for having looked four times. The observed-to-expected ratio at Look 4 is\n10 / 4 = 2.5, suggesting a rate approximately 2.5 times the background, but MaxSPRT\ndoes not produce a point estimate or confidence interval — it answers only \"has the\nsequential boundary been crossed?\"\n\nPractical interpretation: A declared signal triggers investigation, not action. The\nnext step is to examine whether the expected count mu was correctly specified (biased\nhistorical rates produce false signals), to run a self-controlled or matched-cohort\nanalysis on the same data to assess confounding, and to calibrate against negative\ncontrols. The LLR boundary crossing means the repeated-looks false-positive rate has\nbeen controlled; it does not mean the drug caused the event.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs disproportionality / signal-detection (PRR, ROR, EBGM on spontaneous reports):** Disproportionality mining is\n  *hypothesis-generating* on a passive, denominator-free spontaneous-report database - it tells you a drug-event pair\n  is reported more than expected *relative to other reports*, with no person-time and no control of when you looked.\n  MaxSPRT is *hypothesis-testing in an enumerated cohort* with a real expected count from real follow-up time and a\n  formal, time-aware error rate. **Prefer disproportionality** for cheap, broad, all-pairs scanning of post-market\n  reports; **prefer MaxSPRT** when you have a defined population, a pre-specified event, and you need a statistically\n  honest early-stopping rule. They are complementary stages, not substitutes.\n- **vs a one-shot cohort/SCRI analysis at the end of accrual:** A single final analysis is the most powerful test for\n  a fixed sample size and needs no alpha-spending. But it forfeits the entire point of surveillance - you learn the\n  answer only after every exposed person is already exposed. MaxSPRT trades a modest amount of final-look power\n  (because alpha was partly spent earlier) for the ability to stop and act years sooner. **Prefer the one-shot\n  analysis** when there is no decision to make before accrual completes; **prefer MaxSPRT** when an early signal would\n  change practice, labeling, or recall.\n- **vs group-sequential (O'Brien-Fleming / Lan-DeMets):** Both control the family-wise error across looks. Continuous\n  MaxSPRT detects a true effect earliest on average and is natural when data stream in case-by-case; group-sequential\n  is simpler to administer, easier to explain to a safety board, and its late-alpha boundaries resist over-reacting\n  to early noise. **Prefer group-sequential** with few, scheduled analyses and a conservative early stance; **prefer\n  MaxSPRT** for high-frequency feeds where the earliest defensible signal is the goal.\n\n**When to use.** Prospective monitoring of a *pre-specified* drug-event or vaccine-event pair in a population you can\nenumerate over time (Sentinel distributed claims, VSD linked EHR/immunization data, a device registry): post-licensure\nvaccine safety (febrile seizures, intussusception, Guillain-Barre after a new vaccine), a newly approved drug with a\nsignal of concern from trials or spontaneous reports that you now want to confirm/refute with a formal early-stopping\nrule, or any setting where the cost of a late signal is high and data arrive in repeated batches. It needs a credible\n**expected count** (historical background rate, or a concurrent matched comparator) and a locked event definition.\n\n**When NOT to use - and when it is actively misleading.**\n- **No trustworthy expected count.** Poisson MaxSPRT is only as good as mu. If the historical background rate is\n  biased by changing coding, secular trends, a different population, or surveillance artifacts, the LLR is anchored to\n  the wrong null and the \"signal\" is an expected-count error, not a drug effect. With thin historical data, move to\n  **conditional MaxSPRT** (which propagates that uncertainty) or a concurrent-comparator **binomial** design.\n- **Unstable or evolving case definition.** Sequential testing assumes the thing you count means the same thing at\n  look 1 and look 20. If the event algorithm's sensitivity/PPV drifts (a new ICD code, a care-pattern shift, a feed\n  that backfills late), the accruing count is contaminated and the boundary's error guarantee is void. Lock the case\n  definition and the data-lag handling before the first look.\n- **Confounding and channeling, unaddressed.** MaxSPRT controls the *repeated-looks* error, not confounding. A\n  historical-comparator Poisson design makes no adjustment for who got the drug; if early adopters are sicker\n  (channeling), an excess of events is confounding masquerading as a safety signal. Pair surveillance with\n  **negative-control / empirical-calibration** diagnostics, matched or risk-set comparators, or a self-controlled\n  design that nulls out time-fixed confounding - the boundary crossing is the start of an investigation, not a verdict.\n- **Treating the crossing as causal proof.** A signal triggers **refinement**: re-examine the expected count, run a\n  self-controlled risk-interval or matched-cohort analysis on the same data, calibrate against negative controls,\n  check for data-lag and immortal-time artifacts, and only then escalate. The LLR boundary answers \"is this more than\n  repeated-testing noise?\", not \"is this caused by the drug?\".\n\n**Data-source operational depth.**\n- **Claims (Sentinel-style distributed data):** Exposed person-time and events are built per data partner and pooled;\n  the **expected count** mu is usually a historical incidence rate (events per person-time) multiplied by the accrued\n  exposed person-time in the risk window. Watch **claims maturity / data lag** - the most recent months are\n  incomplete, so a look run too soon undercounts both numerator and the person-time denominator. Risk windows\n  (self-controlled risk interval logic) define which events count as exposed. Multi-site pooling needs a consistent\n  event algorithm across partners.\n- **EHR / linked immunization data (VSD-style):** The substrate for vaccine Rapid Cycle Analysis - immunization dates\n  are precise and a concurrent comparison group (a comparator vaccine, or a matched unexposed interval) is often\n  available, favoring **binomial MaxSPRT**. Chart-confirmable outcomes let you validate the event algorithm, but\n  free-text and coding lag still threaten the stable-definition assumption.\n- **Registry:** Disease/product registries can give clean prospective ascertainment and an internal comparator, good\n  for binomial designs; the cost is slower accrual and incomplete capture outside the registry's footprint.\n- **Linked claims-EHR-registry:** The strongest base - claims for complete person-time and exposure, EHR for\n  chart-validated events and labs to refine a crossed boundary, registry for adjudicated outcomes; reconcile feed lags\n  across sources before setting each look date so the expected and observed counts are measured over the same mature\n  window.\n\n**Worked surveillance example (one look schedule).** A new vaccine is monitored monthly for an adverse event whose\nbackground gives an **expected 1.0 event per month** of accrued exposed risk-window person-time, so cumulative\nexpected mu = 1, 2, 3, 4 at looks 1-4. Cumulative observed events come in as n = 1, 4, 7, 10. The Poisson LLR is 0\nwhenever n <= mu, else `n * ln(n / mu) + mu - n`. Look 1: n = mu = 1 so LLR = 0. Look 2: `4 * 0.6931 + 2 - 4 =\n2.7724 - 2 = 0.77`. Look 3: `7 * 0.8473 + 3 - 7 = 5.9311 - 4 = 1.93`. Look 4: `10 * 0.9163 + 4 - 10 = 9.163 - 6 =\n3.16`. The pre-computed flat critical boundary for this design (one-sided alpha = 0.05, maximum expected count 5) is\nabout 3.0 on the LLR scale; the first look whose LLR reaches it is **look 4 (LLR 3.16 >= 3.00), so a signal is\ndeclared at month 4** - observed 10 vs expected 4, a more-than-twofold rate. Surveillance stops and refinement begins;\nhad no boundary been crossed by the maximum expected count, surveillance would close with no signal.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "maxsprt",
      "sequential-analysis",
      "safety-surveillance",
      "near-real-time",
      "poisson-maxsprt",
      "binomial-maxsprt",
      "conditional-maxsprt",
      "alpha-spending",
      "critical-boundary",
      "expected-vs-observed",
      "sentinel",
      "vaccine-safety-datalink",
      "pharmacovigilance"
    ],
    "applies_to_study_types": [
      "signal_detection",
      "claims_analysis",
      "ehr_study",
      "multi_database",
      "cohort_prospective",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/07474946.2011.539924",
        "url": "https://doi.org/10.1080/07474946.2011.539924",
        "citation_text": "Kulldorff M, Davis RL, Kolczak M, Lewis E, Lieu T, Platt R. A maximized sequential probability ratio test for drug and vaccine safety surveillance. Sequential Analysis. 2011;30(1):58-78.",
        "year": 2011,
        "authors_short": "Kulldorff et al.",
        "notes": "The originating paper. Introduces MaxSPRT by maximizing the likelihood ratio over all relative risks greater than one, removing the need to pre-specify the alternative, and gives the Poisson and binomial log-likelihood-ratio statistics and the alpha-spending critical boundaries used in prospective drug and vaccine safety surveillance."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/MLR.0b013e3180616c0a",
        "url": "https://doi.org/10.1097/MLR.0b013e3180616c0a",
        "citation_text": "Lieu TA, Kulldorff M, Davis RL, Lewis EM, Weintraub E, Yih K, Yin R, Brown JS, Platt R; Vaccine Safety Datalink Rapid Cycle Analysis Team. Real-time vaccine safety surveillance for the early detection of adverse events. Medical Care. 2007;45(10 Suppl 2):S89-S95.",
        "year": 2007,
        "authors_short": "Lieu et al.",
        "notes": "The Vaccine Safety Datalink Rapid Cycle Analysis program - the canonical operational application of MaxSPRT to near-real-time vaccine safety, showing how weekly accruing linked EHR/immunization data are tested against expected counts with sequential boundaries to detect adverse events as early as the evidence allows."
      }
    ],
    "plain_language_summary": "MaxSPRT is a way to watch a new drug or vaccine in real time and catch a safety problem as early as the data honestly allow. Every week or month, you count how many times a specific bad event actually happened in the people exposed, and compare it to how many you would expect if the product were harmless. Because you are peeking at the data over and over, a plain \"is it significant yet?\" test would raise false alarms by chance, so MaxSPRT uses a special threshold designed up front to keep the total false-alarm rate at a fixed level no matter how many times you look. When the running evidence crosses that threshold you declare a signal and investigate - the crossing means the excess is more than the noise of repeated peeking, not that the drug is proven guilty.\n",
    "key_terms": [
      {
        "term": "expected count",
        "definition": "How many events you would see so far if the product caused no extra risk - usually a background rate times the amount of exposed follow-up time accrued."
      },
      {
        "term": "observed count",
        "definition": "How many of the events have actually happened so far in the exposed people being watched."
      },
      {
        "term": "log-likelihood ratio (LLR)",
        "definition": "A single number summarizing how much more the observed events look like an excess than like the no-excess expectation at a given look; here it is zero unless observed exceeds expected."
      },
      {
        "term": "critical boundary",
        "definition": "A threshold on the LLR, worked out in advance, that the running test must cross to declare a signal."
      },
      {
        "term": "alpha spending",
        "definition": "Dividing a fixed total false-alarm budget across all the planned looks so that repeatedly peeking does not inflate the chance of a false signal."
      },
      {
        "term": "near-real-time surveillance",
        "definition": "Re-running the safety check on a fresh data feed every week or month as new exposures and events arrive, rather than waiting for one analysis at the end."
      }
    ],
    "worked_example": {
      "scenario": "A health system runs monthly Poisson MaxSPRT on a newly rolled-out vaccine, watching for one specific adverse event. From historical data the event is expected at a rate that adds 1.0 expected case per month of accrued exposed follow-up, so the cumulative expected count is 1, 2, 3, 4 at looks 1 through 4. The actual cumulative observed cases arriving from the data feed are 1, 4, 7, 10. We compute the log-likelihood ratio at each monthly look and compare it to a critical boundary (about 3.0 on the LLR scale) that was worked out in advance so the whole four-look surveillance has a one-sided false-alarm rate of 0.05. We want the first month, if any, that the evidence crosses the boundary.\n",
      "dataset": {
        "caption": "What the surveillance analyst sees each month - cumulative observed and expected counts at each look.",
        "columns": [
          "look",
          "look_date",
          "cum_observed_n",
          "cum_expected_mu"
        ],
        "rows": [
          [
            1,
            "2023-01-31",
            1,
            1.0
          ],
          [
            2,
            "2023-02-28",
            4,
            2.0
          ],
          [
            3,
            "2023-03-31",
            7,
            3.0
          ],
          [
            4,
            "2023-04-30",
            10,
            4.0
          ]
        ]
      },
      "steps": [
        "The LLR at a look is 0 when observed does not exceed expected (n <= mu); otherwise it is n times the natural log of n/mu, plus mu, minus n.",
        "Look 1 - observed equals expected (n = 1, mu = 1), so there is no excess and LLR = 0.",
        "Look 2 - n = 4, mu = 2, so n/mu = 2 whose natural log is 0.6931, giving LLR = 4*0.6931 + 2 - 4 = 2.7724 - 2 = 0.77.",
        "Look 3 - n = 7, mu = 3, so n/mu = 7/3 whose natural log is 0.8473, giving LLR = 7*0.8473 + 3 - 7 = 5.9311 - 4 = 1.93.",
        "Look 4 - n = 10, mu = 4, so n/mu = 2.5 whose natural log is 0.9163, giving LLR = 10*0.9163 + 4 - 10 = 9.163 - 6 = 3.16.",
        "Compare each LLR to the pre-computed critical boundary of about 3.00; the first look to reach it is look 4 (3.16 >= 3.00), so the signal is declared in month 4."
      ],
      "result": "The LLR climbs 0.00, 0.77, 1.93, 3.16 across the four monthly looks and crosses the 3.00 boundary at look 4, so a safety signal is declared in month 4 - cumulative observed 10 versus expected 4, a more-than-twofold excess. Surveillance stops and refinement begins; the crossing flags more-than-noise, not proven causation.",
      "timeline_spec": {
        "title": "Monthly Poisson MaxSPRT - log-likelihood ratio crosses the critical boundary at look 4",
        "window": {
          "start": "2023-01-01",
          "end": "2023-04-30",
          "label": "Four monthly looks; one-sided alpha 0.05 spent across all looks"
        },
        "events": [
          {
            "label": "Look 1: n 1, mu 1, LLR 0.00",
            "start": "2023-01-01",
            "length_days": 31,
            "quantity": "observed 1 / expected 1"
          },
          {
            "label": "Look 2: n 4, mu 2, LLR 0.77",
            "start": "2023-02-01",
            "length_days": 28,
            "quantity": "observed 4 / expected 2"
          },
          {
            "label": "Look 3: n 7, mu 3, LLR 1.93",
            "start": "2023-03-01",
            "length_days": 31,
            "quantity": "observed 7 / expected 3"
          },
          {
            "label": "Look 4: n 10, mu 4, LLR 3.16",
            "start": "2023-04-01",
            "length_days": 30,
            "quantity": "observed 10 / expected 4"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2023-01-01",
            "end": "2023-03-31",
            "label": "Looks 1-3: LLR below 3.00 boundary, no signal"
          },
          {
            "kind": "exposed",
            "start": "2023-04-01",
            "end": "2023-04-30",
            "label": "Look 4: LLR 3.16 crosses boundary, signal"
          }
        ],
        "result": {
          "label": "Signal at look 4 (LLR 3.16 >= 3.00 boundary)",
          "value": 4
        }
      }
    },
    "prerequisites": [
      "incidence-rate-calculation-rwe",
      "self-controlled-risk-interval-rwe",
      "signal-detection"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Poisson MaxSPRT with historical expected counts",
        "description": "Cumulative observed events are compared against an expected count mu derived from a historical/background incidence rate multiplied by the accrued exposed risk-window person-time. The LLR (n>mu) is n*ln(n/mu)+mu-n; the boundary is solved for the maximum expected count and look schedule. Used when no good concurrent comparator exists (e.g., a new vaccine with a stable historical rate of the event).",
        "edge_cases": [
          "A biased or non-stationary historical rate (coding change, secular trend, different source population) anchors the test to the wrong null and fabricates or masks signals - the expected count is the Achilles heel.",
          "Data lag in the most recent months undercounts both events and the person-time that feeds mu; running a look on immature data biases the LLR."
        ],
        "data_source_notes": "claims: mu = historical rate x accrued exposed person-time in the risk window, pooled across partners with one shared event algorithm. ehr: same, but chart-confirmable events let you validate the rate."
      },
      {
        "name": "Binomial MaxSPRT with a matched concurrent comparator",
        "description": "Each event is attributed to exposed or to one of z matched concurrent comparators, so under the null the probability an event falls on the exposed person is p0 = 1/(1+z). The test monitors the count of exposed-attributed events among the total, with a binomial LLR and its own boundary. Avoids reliance on an external historical rate by using a concurrent control built into the data.",
        "edge_cases": [
          "Matching that fails to balance confounders (channeling of sicker patients to the new product) turns a confounding excess into an apparent safety signal - the binomial design controls repeated looks, not confounding.",
          "The matching ratio z must be fixed in advance; variable or broken matches distort p0 and the boundary."
        ],
        "data_source_notes": "ehr/linked (VSD-style): a concurrent comparator vaccine or matched unexposed interval is often available, making binomial the natural choice. claims: feasible when a clean concurrent comparator cohort can be matched."
      },
      {
        "name": "Conditional MaxSPRT (CmaxSPRT) for limited historical data",
        "description": "When the historical comparison data are themselves sparse, the expected count carries real sampling uncertainty. CmaxSPRT conditions on the combined historical-plus-surveillance event total and propagates the historical uncertainty into the test, rather than treating mu as known exactly. Prevents over-confident boundaries when the background rate is estimated from few events.",
        "edge_cases": [
          "With very few historical events the boundary is markedly more conservative than unconditional Poisson MaxSPRT - ignoring this over-states certainty and inflates false alarms.",
          "Mis-specifying which data are historical vs surveillance (e.g., overlapping calendar time) breaks the conditioning."
        ],
        "data_source_notes": "registry/linked: appropriate when the historical/background series is short (a rare event or a new data partner) and its uncertainty must be honored; otherwise standard Poisson MaxSPRT suffices."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "signal-detection",
        "pros_of_this": "Tests a pre-specified pair in an enumerated cohort with real person-time and a time-aware, formally controlled error rate, so a crossing means more than repeated-testing noise - not merely that a pair is over-reported relative to other spontaneous reports.",
        "cons_of_this": "Far narrower and more expensive - it monitors one (or a few) pre-specified pairs in a defined population, whereas disproportionality scans all drug-event pairs cheaply for hypothesis generation.",
        "when_to_prefer": "Use disproportionality/signal-detection for broad, denominator-free scanning of spontaneous reports to generate hypotheses; use MaxSPRT to formally confirm/refute a pre-specified pair with an honest early-stopping rule."
      },
      {
        "compared_to": "self-controlled-risk-interval-rwe",
        "pros_of_this": "Adds the prospective, repeated-looks machinery and alpha-spending boundary so the analysis can stop and signal early without inflating the false-alarm rate - a stopping rule the SCRI alone does not provide.",
        "cons_of_this": "MaxSPRT controls the repeated-testing error, not confounding; a historical-comparator Poisson design lacks the within-person control that the self-controlled risk interval supplies, so the two are usually combined.",
        "when_to_prefer": "Use SCRI to define the exposed risk window and null out time-fixed confounding within a person; wrap MaxSPRT around the accruing counts when the question is prospective and early stopping matters."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build exposed risk-window events and person-time per data partner and pool them. The expected count mu is a historical incidence rate times accrued exposed person-time; watch claims maturity (recent months are incomplete) and run each look only on mature data. Use one shared event algorithm across partners so the pooled count is coherent, and define the risk window (SCRI logic) that decides which events count as exposed.",
      "ehr": "Immunization/exposure dates are precise and chart-confirmable outcomes allow event-algorithm validation; a concurrent comparator (comparator product or matched unexposed interval) is often available, favoring binomial MaxSPRT. Coding lag and free-text outcomes still threaten the stable-case-definition assumption across looks.",
      "registry": "Prospective adjudicated ascertainment and an internal comparator support clean binomial designs, at the cost of slower accrual and incomplete capture outside the registry footprint; good when the historical series is short (CmaxSPRT).",
      "linked": "The strongest base - claims for complete person-time and exposure, EHR for chart-validated events and labs to refine a crossed boundary, registry for adjudicated outcomes. Reconcile feed lags across sources before setting each look date so expected and observed counts cover the same mature window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\n\ndef poisson_maxsprt_llr(n: float, mu: float) -> float:\n    # Maximized log-likelihood ratio for Poisson MaxSPRT (RR maximized at n/mu).\n    # Zero unless there is an excess (n > mu), since we test one-sided RR > 1.\n    if n <= mu or mu <= 0:\n        return 0.0\n    return n * math.log(n / mu) + mu - n\n\ndef run_surveillance(looks, critical_value):\n    # looks: iterable of (label, cum_observed_n, cum_expected_mu), in time order.\n    # critical_value: flat LLR boundary that spends total one-sided alpha across all looks.\n    rows = []\n    signal_look = None\n    for label, n, mu in looks:\n        llr = poisson_maxsprt_llr(n, mu)\n        crossed = llr >= critical_value\n        if crossed and signal_look is None:\n            signal_look = label\n        rows.append({\"look\": label, \"n\": n, \"mu\": mu,\n                     \"llr\": round(llr, 3), \"signal\": crossed})\n    return {\"signal\": signal_look is not None,\n            \"signal_look\": signal_look,\n            \"critical_value\": critical_value,\n            \"looks\": rows}\n\nif __name__ == \"__main__\":\n    schedule = [(\"M1\", 1, 1.0), (\"M2\", 4, 2.0), (\"M3\", 7, 3.0), (\"M4\", 10, 4.0)]\n    out = run_surveillance(schedule, critical_value=3.0)\n    for r in out[\"looks\"]:\n        print(r)\n    print(\"SIGNAL at\", out[\"signal_look\"])   # -> M4 (LLR 3.16 >= 3.00)",
        "description": "Poisson MaxSPRT run over a monthly look schedule. Computes the log-likelihood-ratio (LLR) at each look from the\ncumulative observed count n and cumulative expected count mu, and declares a signal at the first look whose LLR reaches a\npre-computed flat critical boundary. The critical value here is taken as a design input (in practice obtained from the\nSequential package's CV.Poisson or by simulation for the planned maximum expected count and alpha). Input per look:\n  looks : list of (label, cum_observed_n, cum_expected_mu)",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(Sequential)\n\n# Flat LLR critical boundary for a Poisson MaxSPRT: maximum expected count = 5 (planned\n# surveillance length in expected events), one-sided alpha = 0.05, looks after each unit of\n# expected count. CV.Poisson returns the critical value on the LLR scale.\ncv <- CV.Poisson(SampleSize = 5, alpha = 0.05, M = 1, GroupSizes = 1)\n\npoisson_maxsprt_llr <- function(n, mu) {\n  ifelse(n <= mu | mu <= 0, 0, n * log(n / mu) + mu - n)   # log() is natural log\n}\n\nlooks <- data.frame(\n  label = c(\"M1\", \"M2\", \"M3\", \"M4\"),\n  n     = c(1, 4, 7, 10),     # cumulative observed events\n  mu    = c(1, 2, 3, 4)       # cumulative expected events under no excess risk\n)\nlooks$llr    <- poisson_maxsprt_llr(looks$n, looks$mu)\nlooks$signal <- looks$llr >= cv\n\nsignal_look <- if (any(looks$signal)) looks$label[which(looks$signal)[1]] else NA\nprint(looks)\ncat(\"Critical value:\", round(cv, 3), \" Signal at:\", signal_look, \"\\n\")",
        "description": "Same monthly Poisson MaxSPRT in R. The flat LLR critical boundary is obtained from the Sequential package's CV.Poisson\nfor the planned maximum expected count (the surveillance length in expected events) and one-sided alpha; the LLR is then\ncomputed manually at each look and compared to that boundary. If the Sequential package is unavailable, supply the\npre-computed critical value directly. Input: a data.frame of cumulative observed (n) and expected (mu) counts by look.",
        "dependencies": [
          "Sequential"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let cv = 3.0;   /* flat LLR critical boundary: one-sided alpha=0.05, max expected count 5 */\n\ndata looks;\n  input label $ n mu;\n  datalines;\nM1 1 1\nM2 4 2\nM3 7 3\nM4 10 4\n;\nrun;\n\ndata surveillance;\n  set looks;\n  retain signaled 0;\n  /* Maximized Poisson LLR; zero unless there is an excess (n > mu). SAS LOG() is natural log. */\n  if n <= mu or mu <= 0 then llr = 0;\n  else llr = n * log(n / mu) + mu - n;\n  if llr >= &cv and signaled = 0 then do;\n    signal_here = 1;\n    signaled = 1;\n  end;\n  else signal_here = 0;\nrun;\n\nproc print data=surveillance label;\n  var label n mu llr signal_here;\nrun;",
        "description": "Poisson MaxSPRT in SAS over the same monthly look schedule. A DATA step reads the cumulative observed (n) and expected\n(mu) counts per look, computes the LLR (natural log via SAS LOG()), and flags the first look that reaches the pre-computed\nflat critical boundary. The critical value is supplied as a macro variable (from Sequential::CV.Poisson or simulation for\nthe planned maximum expected count and alpha). Input: work.looks with label, n, mu in time order.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Lock event definition,<br/>risk window, and expected-count source] --> Look[Next look:<br/>accrue cum observed n<br/>and cum expected mu]\n  Look --> Excess{n greater than mu?}\n  Excess -- No --> LLR0[LLR = 0<br/>no excess this look]\n  Excess -- Yes --> LLR[LLR = n*ln of n/mu + mu - n]\n  LLR0 --> Cmp\n  LLR --> Cmp{LLR at or above<br/>critical boundary?}\n  Cmp -- Yes --> Signal[Declare signal -<br/>stop, begin refinement]\n  Cmp -- No --> Max{Reached maximum<br/>expected count?}\n  Max -- No --> Look\n  Max -- Yes --> NoSignal[Close surveillance:<br/>no signal]\n  Signal --> Refine[Refine: re-check expected count,<br/>self-controlled / matched analysis,<br/>negative-control calibration]",
        "caption": "The MaxSPRT surveillance loop - at each repeated look the cumulative observed count is compared to the expected count, the log-likelihood ratio is computed and tested against the alpha-spending critical boundary, and a crossing triggers a signal and refinement rather than a causal verdict.",
        "alt_text": "Flowchart showing repeated looks accruing observed and expected counts, computing the Poisson log-likelihood ratio, comparing it to a critical boundary, and either signaling, continuing, or closing at the maximum expected count.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Monthly Poisson MaxSPRT - observed climbs above expected, signal at look 4\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section Looks (no signal yet)\n  Look 1 - n 1 mu 1 LLR 0.00 :done, l1, 2023-01-01, 31d\n  Look 2 - n 4 mu 2 LLR 0.77 :done, l2, 2023-02-01, 28d\n  Look 3 - n 7 mu 3 LLR 1.93 :done, l3, 2023-03-01, 31d\n  section Signal\n  Look 4 - n 10 mu 4 LLR 3.16 crosses 3.00 :crit, l4, 2023-04-01, 30d",
        "caption": "One surveillance run over four monthly looks. The log-likelihood ratio rises 0.00, 0.77, 1.93, 3.16 as observed events outpace expected, crossing the flat critical boundary of about 3.00 at look 4 - the earliest defensible signal.",
        "alt_text": "Gantt timeline of four monthly looks with the MaxSPRT log-likelihood ratio increasing each month and crossing the critical boundary at the fourth look, where a safety signal is declared.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "signal-detection",
        "notes": "Disproportionality signal-detection scans all pairs cheaply on denominator-free spontaneous reports for hypothesis generation; MaxSPRT formally tests a pre-specified pair in an enumerated cohort with person-time and a time-aware error rate. Generation versus confirmation - complementary stages of pharmacovigilance."
      },
      {
        "relation_type": "used_with",
        "target_slug": "self-controlled-risk-interval-rwe",
        "notes": "The SCRI defines the exposed risk window (and a within-person comparison that nulls time-fixed confounding) that decides which accruing events count as exposed; MaxSPRT wraps the repeated-looks alpha-spending boundary around those counts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "self-controlled-case-series",
        "notes": "SCCS is the parent self-controlled framework for estimating the relative incidence in a risk window; sequential surveillance often monitors the same exposed-window counts prospectively."
      },
      {
        "relation_type": "requires",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "Sequential testing assumes the event means the same thing at every look; a locked, validated case definition with stable sensitivity/PPV is a precondition - a drifting algorithm voids the boundary's error guarantee."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multi-database",
        "notes": "Sentinel-style surveillance pools exposed person-time and events across data partners; the expected and observed counts feeding MaxSPRT are built per database with one shared event algorithm and then combined."
      },
      {
        "relation_type": "see_also",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "A boundary crossing controls repeated-testing error, not confounding; calibrating the test against negative-control pairs diagnoses systematic error before a signal is escalated."
      },
      {
        "relation_type": "used_with",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "The expected count in Poisson MaxSPRT is a historical incidence rate multiplied by accrued exposed person-time, so incidence-rate construction supplies the null against which observed counts are tested."
      }
    ],
    "aliases": [
      "MaxSPRT",
      "maximized sequential probability ratio test",
      "sequential safety surveillance",
      "near-real-time surveillance",
      "rapid cycle analysis",
      "prospective safety monitoring",
      "Poisson maxSPRT",
      "binomial maxSPRT",
      "conditional maxSPRT"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mcda-multi-criteria-decision-analysis-rwe",
    "name": "Multi-Criteria Decision Analysis (MCDA)",
    "short_definition": "A structured family of methods that makes a healthcare decision's value criteria explicit (efficacy, safety, unmet need, equity, disease burden, cost), elicits weights for how much each criterion matters (swing weighting, AHP pairwise comparisons, or DCE-derived weights), scores each alternative on each criterion against declared worst-best anchors, and aggregates - most often as a weighted additive value model - into a transparent total value per alternative, used to structure HTA deliberation, portfolio prioritization, and quantitative benefit-risk rather than to replace them.",
    "long_description": "Most health decisions are multi-attribute whether we admit it or not: an HTA committee weighing a drug is trading off\nsurvival gain against toxicity against unmet need against budget; a payer prioritizing a formulary is doing the same\nacross products; a regulator's benefit-risk call balances effect sizes against harms. **MCDA makes that implicit\nweighing explicit.** The canonical process (ISPOR MCDA Emerging Good Practices Task Force, Thokala et al. 2016; Marsh\net al. 2016) is: (1) *structure the problem* - define the decision, the alternatives, and a criteria set that is\ncomplete, non-redundant, and **preferentially independent**; (2) *measure performance* - build a performance matrix of\neach alternative on each criterion from trials, RWE, and elicitation; (3) *score* - convert natural-unit performance to\na common 0-100 partial value scale against declared worst and best anchors (linear or elicited value functions);\n(4) *weight* - elicit how much a swing from worst to best on each criterion matters relative to the others;\n(5) *aggregate* - usually the **additive value model** V(a) = sum over criteria of w_k x s_k(a), with weights\nnormalized to sum to 1; (6) *test* - sensitivity analysis on weights and scores; (7) *deliberate* - the numbers\nstructure the discussion, they do not end it.\n\n**Value measurement methods.** *Swing weighting* asks the committee to imagine the worst hypothetical alternative and\nrank-then-rate which single criterion swing (worst to best) they would fix first; the top swing gets 100 points,\nothers are rated relative to it, and points are normalized to weights. It is the method most consistent with the\nadditive model because weights are anchored to the actual criterion *ranges* - a weight only means anything relative\nto the swing it covers. *AHP (Analytic Hierarchy Process)* derives weights from pairwise comparisons on a 1-9 verbal\nscale via the principal eigenvector, with a consistency ratio to flag incoherent judgments; it is easy to field but\nits verbal scale and rank-reversal behavior are theoretically contested. *DCE-derived weights* estimate them from\nchoice experiments over attribute profiles, importing preference-study machinery (and its sample, framing, and\nattribute-range dependence) into the weight set. *Outranking methods* (ELECTRE, PROMETHEE) avoid full aggregation by\npairwise-comparing alternatives with concordance thresholds - useful when trade-offs are contested, but harder to\nexplain to a deliberative committee.\n\n**The additive model's independence assumptions are the load-bearing wall.** A weighted sum is only valid when\ncriteria are *mutually preferentially independent*: how much you value a swing on safety must not depend on the level\nof efficacy. When criteria interact (a toxicity matters more when survival gain is small), the additive form misstates\nvalue and you need multiplicative or other non-additive forms - or a re-structured criteria set. Double counting is\nthe everyday violation: putting \"QALY gain\" and \"quality of life\" and \"severity\" in one criteria set counts the same\nvalue twice; cost criteria alongside health criteria quietly re-derive a cost-effectiveness threshold the committee\nnever agreed to.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs cost-utility analysis (CUA):** CUA collapses value to one metric (the QALY) and one decision rule (the ICER vs\n  a threshold), which buys comparability across appraisals and decades of methods guidance - but it cannot natively\n  carry unmet need, severity, equity, or innovation except as ad hoc modifiers. MCDA carries them explicitly with\n  committee-owned weights, at the price of losing the QALY's interpersonal comparability and opportunity-cost logic:\n  an MCDA total value of 72 has no exchange rate against the health forgone elsewhere in the budget. **Prefer CUA**\n  for reimbursement decisions inside a budget-constrained system with an established threshold; **prefer MCDA** when\n  the decision explicitly trades off criteria the QALY cannot hold, or where no threshold logic exists (portfolio\n  triage, early pipeline, orphan/severity frameworks).\n- **vs deliberation alone (unaided committee judgment):** Unaided deliberation is flexible and cheap but opaque -\n  weights live in members' heads, anchoring and loudest-voice effects go unmeasured, and consistency across meetings\n  is unauditable. MCDA forces the value judgments into the open where they can be challenged and reused. The cost is\n  real: elicitation burden, false precision risk, and the temptation to treat the score as the decision.\n- **vs preference studies (DCE/conjoint) as the weight source:** Committee swing weighting is fast and produces\n  weights owned by the actual decision makers, but from a handful of people. DCE-derived weights bring a defensible\n  sample (patients, public) and statistical machinery, but the weights inherit the experiment's attribute ranges and\n  framing, and the committee may not feel bound by preferences it did not express.\n\n**When to use.** Deliberative HTA appraisal where severity, unmet need, or equity must enter transparently rather\nthan as unexplained discretion; portfolio and formulary prioritization across many candidates with the same criteria\nset; structured quantitative benefit-risk for regulatory or pharmacovigilance decisions (weighted benefits vs harms\nwith explicit trade-offs); resource allocation where stakeholders disagree and the disagreement should be located in\nweights, not buried; early-stage value frameworks before enough data exist for full economic modeling.\n\n**When NOT to use - and when it is actively misleading.**\n- **Do not use MCDA as a substitute for economic evaluation in a budget-constrained reimbursement decision.** Total\n  value scores carry no opportunity-cost information; ranking by MCDA score and funding down the list can displace\n  more health than it buys. If cost enters at all, keep it outside the value score (value-for-money displays) rather\n  than as a weighted criterion.\n- **Double counting.** Overlapping criteria (efficacy + QoL + severity that is itself defined by efficacy shortfall)\n  silently multiply one attribute's weight. Criteria sets must be tested for redundancy before any weight is elicited.\n- **Weight elicitation fragility.** Weights move with the elicitation method (swing vs AHP vs DCE on the same problem\n  yield different weights), with attribute ranges (halve the efficacy range and its swing weight should roughly halve\n  - committees routinely fail this *range sensitivity* check), with framing, and with who is in the room. Reporting a\n  single weighted total without weight sensitivity analysis is misleading precision.\n- **Independence violations.** If criterion values interact, the additive sum is the wrong functional form - check\n  preferential independence explicitly during problem structuring, not after the ranking is computed.\n- **Score laundering.** When the performance matrix cells come from weak or heterogeneous evidence (a registry rate\n  next to an RCT effect next to an expert guess), the tidy 0-100 scores hide the uncertainty gradient; carry evidence\n  uncertainty into the sensitivity analysis or display it alongside the scores.\n\n**Interpreting the output**\n\nConsider the worked example: Drug A scores 72 and Drug B scores 67 under the committee's weights\n(efficacy 0.5, safety 0.3, unmet need 0.2), placing Drug A first by 5 points.\n\nFormal interpretation: The weighted total of 72 is a composite index constructed by multiplying\neach criterion's 0–100 rescaled performance score by a normalized weight and summing. The 5-point\ngap is only as meaningful as the weights that produced it: weight elicitation is known to be\nsensitive to the method used (swing weighting, AHP, DCE), the attribute ranges presented (halving\nthe efficacy anchor range should roughly halve efficacy's swing weight — committees routinely fail\nthis range-sensitivity check), and the composition of the eliciting group. The 0–100 rescaling\nassumes linearity within each criterion's range; if the committee's preferences are non-linear\n(e.g., diminishing marginal value of additional OS gain beyond 8 months), the additive model\nmis-ranks alternatives. Preferential independence — required for the additive model to be valid —\nmust be checked during problem structuring, not inferred from a clean-looking output.\n\nPractical interpretation: Report the weighted total alongside the underlying performance matrix\nand the weights, not as a standalone score. Show the weight-sensitivity threshold at which the\nranking flips — in this case, if the efficacy weight falls below the flip point, Drug B wins —\nso deliberation focuses on whether the committee is confident enough in the efficacy weight to\nsustain the Drug A ranking. Do not use the total score for cost-effectiveness inference: MCDA\nscores carry no opportunity-cost information and cannot substitute for an incremental cost-\neffectiveness ratio.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "mcda",
      "multi-criteria-decision-analysis",
      "swing-weighting",
      "ahp",
      "additive-value-model",
      "benefit-risk",
      "hta-deliberation",
      "portfolio-prioritization",
      "value-framework",
      "preference-weights"
    ],
    "applies_to_study_types": [
      "cost_effectiveness_analysis",
      "cost_utility",
      "preference_study",
      "comparative_effectiveness",
      "budget_impact"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2015.12.003",
        "url": "https://doi.org/10.1016/j.jval.2015.12.003",
        "citation_text": "Thokala P, Devlin N, Marsh K, Baltussen R, Boysen M, Kalo Z, Longrenn T, Mussen F, Peacock S, Watkins J, IJzerman M. Multiple criteria decision analysis for health care decision making - an introduction: report 1 of the ISPOR MCDA Emerging Good Practices Task Force. Value in Health. 2016;19(1):1-13.",
        "year": 2016,
        "authors_short": "Thokala et al.",
        "notes": "The ISPOR Task Force's foundational report - defines MCDA, its core steps (problem structuring, criteria, scoring, weighting, aggregation), the main method families (value measurement, outranking), and healthcare uses."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2015.12.016",
        "url": "https://doi.org/10.1016/j.jval.2015.12.016",
        "citation_text": "Marsh K, IJzerman M, Thokala P, Baltussen R, Boysen M, Kalo Z, Lonngren T, Mussen F, Peacock S, Watkins J, Devlin N. Multiple criteria decision analysis for health care decision making - emerging good practices: report 2 of the ISPOR MCDA Emerging Good Practices Task Force. Value in Health. 2016;19(2):125-137.",
        "year": 2016,
        "authors_short": "Marsh et al.",
        "notes": "The companion good-practice guidance - a step-by-step checklist for designing and reporting an MCDA, including criteria-set properties (completeness, non-redundancy, preferential independence), weight elicitation, and sensitivity analysis expectations."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2017.10.001",
        "url": "https://doi.org/10.1016/j.jval.2017.10.001",
        "citation_text": "Marsh K, Sculpher M, Caro JJ, Tervonen T. The use of MCDA in HTA: great potential, but more effort needed. Value in Health. 2018;21(4):394-397.",
        "year": 2018,
        "authors_short": "Marsh et al.",
        "notes": "Articulates the central criticisms this entry warns about - double counting in criteria sets, weight-elicitation fragility and range insensitivity, the missing opportunity-cost logic when MCDA scores drive reimbursement, and what methodological work HTA-grade MCDA still needs."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2019.06.014",
        "url": "https://doi.org/10.1016/j.jval.2019.06.014",
        "citation_text": "Baltussen R, Marsh K, Thokala P, Diaby V, Castro H, Cleemput I, Garau M, Iskrov G, Olyaeemanesh A, Mirelman A, Mobinizadeh M, Morton A, Tringali M, van Til J, Valentim J, Wagner M, Jansen MP, Bijlmakers L, Oortwijn W, Broekhuizen H. Multicriteria decision analysis to support health technology assessment agencies: benefits, limitations, and the way forward. Value in Health. 2019;22(11):1283-1288.",
        "year": 2019,
        "authors_short": "Baltussen et al.",
        "notes": "Surveys how HTA agencies actually deploy MCDA along a spectrum from qualitative checklists to fully quantitative weighted models, and lands on MCDA as a structure for deliberation rather than a replacement for it - the position this entry takes."
      }
    ],
    "plain_language_summary": "Multi-criteria decision analysis (MCDA) is a way to compare treatment options when several things matter at once - how well a drug works, how safe it is, how badly patients need something new. A committee first agrees on the list of criteria, scores each option on each criterion from 0 to 100, and assigns weights that say how much each criterion matters; each option's weighted scores are then added up into one total for comparison. The total makes the committee's trade-offs visible and checkable, but it is only as good as the weights - different people, methods, or framings can produce different weights and sometimes a different winner, which is why the ranking is meant to structure the discussion, not replace it.\n",
    "key_terms": [
      {
        "term": "criterion",
        "definition": "One of the things that matters in the decision - for example efficacy, safety, or unmet need - that each option gets scored on."
      },
      {
        "term": "performance matrix",
        "definition": "The table of how each option actually performs on each criterion, in natural units like months of survival gain or adverse events per 100 patients."
      },
      {
        "term": "partial value score",
        "definition": "An option's performance on one criterion rescaled to 0-100, where 0 is the agreed worst plausible level and 100 the best."
      },
      {
        "term": "swing weighting",
        "definition": "A way to set weights by asking which criterion's jump from worst to best level the committee would most want, giving that 100 points, and rating the other jumps against it."
      },
      {
        "term": "additive value model",
        "definition": "The aggregation rule that multiplies each criterion's weight by the option's score on it and adds the products into one total value."
      },
      {
        "term": "preferential independence",
        "definition": "The assumption behind adding weighted scores - how much you care about improving one criterion must not depend on the level of another."
      }
    ],
    "worked_example": {
      "scenario": "An HTA committee compares two drugs for the same disease using three criteria - efficacy (overall-survival gain in months), safety (serious adverse events per 100 patients), and unmet need addressed (a committee score from 0 to 100). Anchors were agreed in advance: survival gain runs from 0 (worst) to 10 months (best), the adverse-event rate from 20 per 100 (worst) to 0 (best), and unmet need is already on a 0-100 scale. In the swing-weighting session the committee put the efficacy swing first at 100 points, the safety swing at 60, and the unmet-need swing at 40. We rescale each drug's performance to 0-100 scores, normalize the weights, and add up the weighted scores.\n",
      "dataset": {
        "caption": "The performance matrix the committee sees - each drug's measured performance per criterion in natural units (swing points elicited separately - efficacy 100, safety 60, unmet need 40).",
        "columns": [
          "alternative",
          "os_gain_months",
          "sae_rate_per100",
          "unmet_need_score"
        ],
        "rows": [
          [
            "Drug A",
            8,
            8,
            70
          ],
          [
            "Drug B",
            6,
            2,
            50
          ]
        ]
      },
      "steps": [
        "Normalize the swing points to weights that sum to 1. Total points = 100 + 60 + 40 = 200, so efficacy weight = 100/200 = 0.5, safety weight = 60/200 = 0.3, unmet-need weight = 40/200 = 0.2.",
        "Rescale efficacy to a 0-100 score between the anchors (0 worst, 10 best). Drug A = (8-0)/(10-0)*100 = 80; Drug B = (6-0)/(10-0)*100 = 60.",
        "Rescale safety the same way, remembering lower is better (20 worst, 0 best). Drug A = (20-8)/(20-0)*100 = 60; Drug B = (20-2)/(20-0)*100 = 90.",
        "Unmet need is already on the 0-100 scale, so Drug A scores 70 and Drug B scores 50.",
        "Add up weight times score for Drug A. Total value = 0.5*80 + 0.3*60 + 0.2*70 = 40 + 18 + 14 = 72.",
        "Add up weight times score for Drug B. Total value = 0.5*60 + 0.3*90 + 0.2*50 = 30 + 27 + 10 = 67.",
        "Compare and stress-test. Drug A leads by 72 - 67 = 5 points on the back of efficacy; if the committee dropped the efficacy weight to 0.2 (with safety 0.48 and unmet need 0.32), Drug B would win - so report that the ranking turns on the efficacy weight."
      ],
      "result": "Drug A total value = 72, Drug B total value = 67 - Drug A ranks first by 5 points under the committee's weights (0.5 efficacy, 0.3 safety, 0.2 unmet need), and sensitivity analysis shows the ranking flips if the efficacy weight falls far enough, so the deliberation should focus on how firmly the committee holds that weight.",
      "timeline_spec": {
        "title": "One HTA committee MCDA cycle - scoping, weighting, scoring, deliberation",
        "window": {
          "start": "2025-09-01",
          "end": "2025-10-24",
          "label": "Eight-week MCDA cycle from scoping workshop to decision"
        },
        "events": [
          {
            "label": "Scoping workshop",
            "start": "2025-09-01",
            "length_days": 14,
            "quantity": "3 criteria + anchors agreed"
          },
          {
            "label": "Swing-weight elicitation",
            "start": "2025-09-15",
            "length_days": 14,
            "quantity": "points 100 / 60 / 40"
          },
          {
            "label": "Performance matrix + scoring",
            "start": "2025-09-29",
            "length_days": 21,
            "quantity": "scores A: 80/60/70, B: 60/90/50"
          },
          {
            "label": "Deliberation + sensitivity",
            "start": "2025-10-20",
            "length_days": 5,
            "quantity": "totals A=72, B=67"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2025-09-01",
            "end": "2025-10-19",
            "label": "Model build: criteria, weights, evidence scoring"
          },
          {
            "kind": "followup",
            "start": "2025-10-20",
            "end": "2025-10-24",
            "label": "Decision week: weighted totals structure deliberation"
          }
        ],
        "result": {
          "label": "Drug A total value 72 vs Drug B 67 - A ranked first under committee weights",
          "value": 72
        }
      }
    },
    "prerequisites": [
      "cost-effectiveness",
      "cost-utility",
      "preference-study"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Value measurement MCDA with swing weighting (ISPOR default)",
        "description": "The full quantitative pipeline - declared worst-best anchors per criterion, linear or elicited partial value functions mapping natural units to 0-100, swing weighting (rank the worst-to-best swings, rate them against the top swing at 100 points, normalize to sum to 1), and additive aggregation V(a) = sum of w_k x s_k(a). The most defensible variant when criteria ranges are well defined, because the weights are explicitly anchored to those ranges.",
        "edge_cases": [
          "Weights are range-dependent by construction - if the efficacy anchor range changes after weights were elicited (a new alternative falls outside it), the weights are stale and must be re-elicited, not rescaled silently.",
          "Committee members who rate every swing near 100 produce flat, uninformative weights; force an initial swing ranking before rating to break ties.",
          "Linear partial value functions are an assumption, not a default truth - a 2-month survival gain from 0 may be worth far more than the same gain from 24 months; test a concave value function in sensitivity analysis."
        ],
        "data_source_notes": "Performance matrix cells come from the evidence base, not the elicitation: trial effects, RWE rates (claims/EHR/registry) for safety and burden criteria, epidemiologic estimates for unmet need. The weights alone come from the committee."
      },
      {
        "name": "AHP (Analytic Hierarchy Process)",
        "description": "Weights (and optionally scores) derived from pairwise comparisons on Saaty's 1-9 verbal scale, aggregated via the principal eigenvector of the comparison matrix, with a consistency ratio (CR < 0.10 conventional) to flag incoherent judgment sets. Popular because pairwise questions feel easy; contested because the verbal scale has no fixed trade-off meaning and rankings can reverse when alternatives are added or removed.",
        "edge_cases": [
          "A consistency ratio above 0.10 means the pairwise judgments contradict each other - revisit the comparisons rather than averaging over the incoherence.",
          "Rank reversal - adding an irrelevant alternative can flip the order of the top two - undermines defensibility in adversarial settings (appeals, regulatory dossiers).",
          "AHP weights are not anchored to criterion ranges, so combining AHP weights with range-anchored 0-100 scores mixes two incompatible weight semantics; many published healthcare AHPs make exactly this error."
        ],
        "data_source_notes": "Same evidence sources feed the performance matrix; the pairwise-comparison sessions are primary data collection (committee or stakeholder panels) and should be documented like any elicitation exercise."
      },
      {
        "name": "Outranking methods (ELECTRE / PROMETHEE)",
        "description": "Instead of one aggregated value per alternative, alternatives are compared pairwise on each criterion with preference and indifference thresholds, and an outranking relation (a outranks b if it is at least as good on enough weighted criteria and not unacceptably worse on any) is built up. Tolerates incomplete preferences and non-compensatory logic (a fatal safety flaw cannot be bought back by efficacy points), at the cost of opacity.",
        "edge_cases": [
          "Results can be intransitive or leave alternatives incomparable - a feature for deliberation, a bug if the decision needs a complete ranking.",
          "Threshold parameters (preference, indifference, veto) are extra elicitation burden and are rarely justified in healthcare applications; unexamined defaults drive results.",
          "Much harder to explain to a deliberative committee than a weighted sum; budget explanation time or expect the committee to ignore the output."
        ],
        "data_source_notes": "Performance matrix requirements are identical to value measurement MCDA; only the aggregation differs, so the same claims/EHR/registry evidence pipeline feeds either variant."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cost-utility",
        "pros_of_this": "Carries criteria the QALY cannot hold - unmet need, severity, equity, innovation, delivery burden - as explicit weighted criteria with committee-owned trade-offs, instead of as unexplained modifiers around an ICER.",
        "cons_of_this": "The total value score has no opportunity-cost interpretation - 72 points buys no statement about health displaced elsewhere in the budget - and loses the QALY's cross-appraisal comparability and threshold decision rule.",
        "when_to_prefer": "Prefer MCDA when the decision genuinely turns on criteria outside the QALY or no threshold logic exists (portfolio triage, severity frameworks, benefit-risk); prefer cost-utility for reimbursement inside a budget-constrained system with an established threshold."
      },
      {
        "compared_to": "cost-effectiveness",
        "pros_of_this": "Handles more than one effectiveness dimension at once - a CEA's single natural-unit outcome (cost per event avoided) cannot trade efficacy against safety against unmet need, which is exactly the multi-attribute structure MCDA formalizes.",
        "cons_of_this": "Requires weight elicitation machinery (panels, swing exercises, sensitivity analysis) that a CEA does not, and its outputs are method- and panel-dependent where a CEA's are data-dependent.",
        "when_to_prefer": "Prefer MCDA when multiple non-commensurable outcomes must be traded off explicitly; prefer cost-effectiveness when one clinically accepted outcome dominates the decision and cost per unit of it is the question."
      },
      {
        "compared_to": "preference-study",
        "pros_of_this": "MCDA is the decision framework - it consumes preference weights and turns them into a ranked, deliberation-ready comparison of actual alternatives; a DCE alone quantifies preferences but decides nothing.",
        "cons_of_this": "Committee-elicited MCDA weights come from a handful of people in a room; a well-designed DCE brings a defensible patient or population sample and confidence intervals around every weight.",
        "when_to_prefer": "They compose rather than compete - run a preference study (DCE/conjoint) when the weight source must be patients or the public at scale, and feed those weights into the MCDA; use direct committee swing weighting when the decision makers' own trade-offs are the point."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims feed the performance matrix, not the weights - real-world safety rates (serious adverse event hospitalizations per 1,000 person-years), adherence/persistence as a deliverability criterion, HCRU and cost criteria, and treated-prevalence denominators for unmet-need sizing. Claims cannot supply efficacy criteria (no outcomes adjudication) or any preference data; pair with trial estimates and an elicitation exercise.",
      "ehr": "EHR supplies clinical-granularity criterion evidence - response and progression proxies, lab-defined severity, contraindication prevalence for the population-suitability criterion - and identifies the eligible population whose unmet need is being scored. Watch documentation differences across sites when one criterion's evidence comes from multiple EHR systems; the performance matrix inherits that heterogeneity invisibly.",
      "registry": "Disease and product registries are often the best single source for unmet-need and burden criteria (prospective severity staging, PRO-based quality-of-life burden, natural-history event rates for the no-new-treatment anchor) and for rare-disease MCDAs where claims counts are too sparse to estimate anything.",
      "linked": "Linked claims-EHR-registry data let one criterion set draw each cell from its best source - registry severity for unmet need, EHR labs for effectiveness proxies, claims for safety event rates and cost - with consistent denominators, which is what keeps criterion scores comparable. Document per-cell provenance in the performance matrix; an MCDA is only as auditable as its evidence map."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# Performance matrix in NATURAL units (3 criteria x 2 alternatives).\nperf = pd.DataFrame(\n    {\"os_gain_months\": [8.0, 6.0],      # efficacy: overall-survival gain vs standard of care\n     \"sae_rate_per100\": [8.0, 2.0],     # safety: serious adverse events per 100 patients (lower = better)\n     \"unmet_need_score\": [70.0, 50.0]}, # committee-scored unmet need addressed, already on 0-100\n    index=[\"Drug A\", \"Drug B\"])\n\n# Declared anchors: worst and best PLAUSIBLE levels per criterion (set during problem structuring,\n# BEFORE weights are elicited - swing weights are only meaningful relative to these ranges).\nanchors = {                      # (worst, best)\n    \"os_gain_months\":   (0.0, 10.0),\n    \"sae_rate_per100\":  (20.0, 0.0),   # harm: worst is the HIGH rate\n    \"unmet_need_score\": (0.0, 100.0),\n}\n\n# Raw swing-weight points: top-ranked swing = 100, others rated relative to it.\nswing_points = {\"os_gain_months\": 100.0, \"sae_rate_per100\": 60.0, \"unmet_need_score\": 40.0}\n\ndef partial_value(x: float, worst: float, best: float) -> float:\n    \"\"\"Linear 0-100 partial value between anchors; direction-aware via the anchor order.\"\"\"\n    return 100.0 * (x - worst) / (best - worst)\n\ndef additive_mcda(perf: pd.DataFrame, anchors: dict, swing_points: dict) -> pd.DataFrame:\n    total_pts = sum(swing_points.values())\n    weights = {k: v / total_pts for k, v in swing_points.items()}   # normalize to sum to 1\n    scores = perf.apply(lambda col: partial_value(col, *anchors[col.name]), axis=0)\n    contrib = scores * pd.Series(weights)            # weighted contribution per criterion\n    out = contrib.add_suffix(\"_wtd\")\n    out[\"total_value\"] = contrib.sum(axis=1)\n    out[\"rank\"] = out[\"total_value\"].rank(ascending=False).astype(int)\n    return out.round(2)\n\nresult = additive_mcda(perf, anchors, swing_points)\nprint(result)\n# Drug A: 0.5*80 + 0.3*60 + 0.2*70 = 72.0 (rank 1); Drug B: 0.5*60 + 0.3*90 + 0.2*50 = 67.0 (rank 2)\n\n# Minimal weight sensitivity: at what efficacy weight do the alternatives tie?\nfor w_eff in (0.50, 0.40, 0.30, 0.20):\n    rest = 1.0 - w_eff\n    w = {\"os_gain_months\": w_eff, \"sae_rate_per100\": rest * 0.6, \"unmet_need_score\": rest * 0.4}\n    scores = perf.apply(lambda col: partial_value(col, *anchors[col.name]), axis=0)\n    tot = (scores * pd.Series(w)).sum(axis=1)\n    print(f\"w_efficacy={w_eff:.2f}: A={tot['Drug A']:.1f}  B={tot['Drug B']:.1f}\")",
        "description": "Minimal additive value model - the core MCDA arithmetic. Inputs: a performance matrix of alternatives x criteria in\nnatural units, per-criterion worst/best anchors (direction-aware - for a harm, worst is the high value), and raw\nswing-weight points. Converts natural units to 0-100 partial value scores by linear interpolation between anchors,\nnormalizes the swing points to weights summing to 1, and returns total value per alternative with the per-criterion\ncontribution breakdown a deliberating committee actually needs. Reproduces the worked example exactly\n(Drug A = 72, Drug B = 67).",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# Performance matrix in NATURAL units (rows = alternatives, cols = criteria).\nperf <- rbind(\n  `Drug A` = c(os_gain_months = 8, sae_rate_per100 = 8, unmet_need_score = 70),\n  `Drug B` = c(os_gain_months = 6, sae_rate_per100 = 2, unmet_need_score = 50)\n)\n\n# Declared worst/best anchors per criterion (direction-aware: for the harm, worst is the HIGH rate).\nanchors <- list(\n  os_gain_months   = c(worst = 0,  best = 10),\n  sae_rate_per100  = c(worst = 20, best = 0),\n  unmet_need_score = c(worst = 0,  best = 100)\n)\n\n# Raw swing-weight points (top swing = 100), normalized to weights summing to 1.\nswing_points <- c(os_gain_months = 100, sae_rate_per100 = 60, unmet_need_score = 40)\nweights <- swing_points / sum(swing_points)   # 0.5, 0.3, 0.2\n\npartial_value <- function(x, worst, best) 100 * (x - worst) / (best - worst)\n\n# 0-100 partial value scores, then weighted contributions and total value.\nscores  <- sapply(colnames(perf), function(k)\n             partial_value(perf[, k], anchors[[k]][\"worst\"], anchors[[k]][\"best\"]))\ncontrib <- sweep(scores, 2, weights, `*`)\ntotal   <- rowSums(contrib)\n\nout <- data.frame(round(contrib, 2), total_value = round(total, 2),\n                  rank = rank(-total))\nprint(out)\n# Drug A: 0.5*80 + 0.3*60 + 0.2*70 = 72 (rank 1); Drug B: 0.5*60 + 0.3*90 + 0.2*50 = 67 (rank 2)\n\n# Weight sensitivity: sweep the efficacy weight, split the remainder 60/40 safety:unmet need.\nfor (w_eff in c(0.5, 0.4, 0.3, 0.2)) {\n  w   <- c(w_eff, (1 - w_eff) * 0.6, (1 - w_eff) * 0.4)\n  tot <- scores %*% w\n  cat(sprintf(\"w_efficacy=%.2f: A=%.1f  B=%.1f\\n\", w_eff, tot[1], tot[2]))\n}",
        "description": "The same minimal additive value model in base R. A performance matrix in natural units is rescaled to 0-100 partial\nvalue scores against direction-aware worst/best anchors, raw swing points are normalized to weights summing to 1,\nand the weighted sum and per-criterion contributions are returned, followed by a simple efficacy-weight sensitivity\nsweep. Reproduces the worked example exactly (Drug A = 72, Drug B = 67).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision problem<br/>alternatives + decision makers] --> C[Select criteria<br/>complete, non-redundant,<br/>preferentially independent]\n  C --> PM[Performance matrix<br/>natural units per criterion<br/>from trials + RWE + elicitation]\n  PM --> S[Score: partial value 0-100<br/>against declared worst-best anchors]\n  C --> W[Weight: swing weighting / AHP /<br/>DCE-derived, normalized to sum to 1]\n  S --> V[Aggregate: additive value model<br/>V = sum of weight x score]\n  W --> V\n  V --> SA{Sensitivity analysis:<br/>do plausible weight changes<br/>flip the ranking?}\n  SA -- Ranking stable --> D[Deliberate with the numbers<br/>MCDA structures, committee decides]\n  SA -- Ranking flips --> R[Report the flip point -<br/>the decision turns on that weight]",
        "caption": "The value-measurement MCDA pipeline - structure the problem, build the performance matrix from evidence, score against anchors, elicit and normalize weights, aggregate additively, and stress-test the ranking before the committee deliberates with (not under) the numbers.",
        "alt_text": "Flowchart from decision problem through criteria selection, performance matrix, partial value scoring, weight elicitation, additive aggregation, and weight sensitivity analysis to deliberation.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One HTA committee MCDA cycle from scoping to deliberation\n  dateFormat YYYY-MM-DD\n  axisFormat %d %b\n  section Model build\n  Scoping workshop - criteria set agreed :crit, e1, 2025-09-01, 14d\n  Swing-weight elicitation panel :crit, e2, 2025-09-15, 14d\n  Performance matrix + partial value scoring :crit, e3, 2025-09-29, 21d\n  section Decision\n  Deliberation + weight sensitivity :done, e4, 2025-10-20, 5d",
        "caption": "A realistic eight-week MCDA cycle - two weeks of problem structuring, two of weight elicitation, three of evidence scoring, then a deliberation week where the weighted totals (Drug A 72 vs Drug B 67) and their sensitivity structure the committee discussion.",
        "alt_text": "Gantt timeline of an MCDA cycle showing scoping, swing-weight elicitation, performance scoring, and the final deliberation week in sequence from September to late October.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "mcda-multi-criteria-decision-analysis-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Generated timeline of the eight-week MCDA cycle - scoping workshop, swing-weight elicitation (points 100/60/40), performance-matrix scoring, and the deliberation week where the weighted totals (Drug A 72 vs Drug B 67) structure the committee discussion.",
        "alt_text": "Timeline bars for the scoping workshop, swing-weight elicitation, performance scoring, and deliberation phases of one MCDA cycle, with the model-build and decision-week spans shaded beneath.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "tradeoff_with",
        "target_slug": "cost-utility",
        "notes": "MCDA carries unmet need, severity, and equity as explicit weighted criteria that the QALY cannot hold, but its total value has no opportunity-cost interpretation; CUA keeps the threshold decision rule MCDA lacks."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "cost-effectiveness",
        "notes": "A CEA trades cost against one natural-unit outcome; MCDA formalizes trade-offs across several outcomes at once at the price of panel-dependent weight elicitation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "preference-study",
        "notes": "DCE/conjoint preference studies are one of the three weight sources for MCDA (alongside swing weighting and AHP) - the preference study quantifies the trade-offs, the MCDA turns them into a ranked comparison of alternatives."
      },
      {
        "relation_type": "used_with",
        "target_slug": "risk-evaluation",
        "notes": "Quantitative benefit-risk assessment is MCDA applied to benefits and harms - weighted favorable and unfavorable effects with explicit trade-offs - and feeds regulatory and pharmacovigilance risk evaluation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hrqol",
        "notes": "HRQoL instruments often supply the quality-of-life criterion's evidence; keeping that criterion distinct from efficacy and severity criteria is the main defense against double counting."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "The ICER/NMB framework is the single-metric decision rule MCDA generalizes away from; comparing an MCDA ranking with the NMB ranking shows exactly what the extra criteria changed."
      },
      {
        "relation_type": "see_also",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "MCDA sits beside the modeling toolkit as the aggregation-and-deliberation layer; model outputs (QALYs, costs, event rates) populate performance-matrix cells rather than being replaced by it."
      }
    ],
    "aliases": [
      "MCDA",
      "multi-criteria decision analysis",
      "multiple criteria decision analysis",
      "value measurement MCDA",
      "swing weighting",
      "analytic hierarchy process",
      "AHP",
      "weighted additive value model",
      "quantitative benefit-risk assessment",
      "value framework"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mcnemar-test",
    "name": "McNemar's Test for Paired Proportions",
    "short_definition": "A hypothesis test for paired binary data — specifically, whether the proportion of \"yes\" outcomes differs between two conditions measured on the same subjects. It ignores pairs where both conditions give the same answer (concordant pairs) and bases the entire test on pairs where the two conditions disagree (discordant pairs), testing whether the count of \"yes then no\" pairs is compatible with the count of \"no then yes\" pairs under a symmetric null hypothesis. It is the paired-data counterpart to the chi-square test for independent samples.",
    "long_description": "**The core insight: only discordant pairs carry information**\n\nWhen you measure the same binary outcome on the same subjects under two conditions\n— before and after a program, case versus matched control, two diagnostic tests on\nthe same patient — you construct a paired 2x2 table with four cells:\n\n- **a**: both conditions positive (concordant)\n- **b**: condition 1 positive, condition 2 negative (discordant)\n- **c**: condition 1 negative, condition 2 positive (discordant)\n- **d**: both conditions negative (concordant)\n\nThe concordant cells (a and d) are completely uninformative about whether the two\nmarginal proportions differ. If a patient was hospitalized both before and after the\nprogram, that tells you nothing about the program's effect on hospitalization rates.\nOnly the discordant cells b and c carry signal: b counts patients whose status\nflipped from positive to negative, and c counts patients whose status flipped from\nnegative to positive. Under the null hypothesis that the two marginal proportions are\nequal (that is, that the probability of \"yes\" is the same under both conditions),\nthe discordant count b should equal c in expectation — formally, b/(b+c) should\nfollow a Binomial(b+c, 0.5) distribution. McNemar's test formalizes this.\n\n**The test statistic and exact vs asymptotic versions**\n\nThe asymptotic McNemar statistic is:\n\n  chi-square = (b - c)^2 / (b + c)\n\nwhich follows a chi-squared distribution with 1 degree of freedom under the null\nwhen b + c is reasonably large (typically b + c >= 25 is cited as a rule of thumb).\nFor small discordant totals, the exact version based directly on the Binomial(b+c,\n0.5) distribution is preferred — this gives the exact probability of observing a\nsplit as extreme or more extreme than b vs c if the true probability were 0.5. The\nexact test is conservative (it uses the discrete probability mass function, yielding\nslightly elevated type-I error protection at the cost of power), and Fagerland,\nLydersen, and Laake (2013) demonstrated that the mid-p correction — which adds\nhalf the probability of the observed table to each tail — achieves better balance\nbetween type-I and type-II error rates than either the exact or asymptotic test in\nmost practical situations. The mid-p approach is now the recommended default for\nsmall b + c.\n\n**Effect estimation alongside the p-value**\n\nA p-value alone is insufficient for decision-making. The natural effect estimate\npaired with McNemar's test is the **marginal proportion difference**: the proportion\nwho are positive under condition 1 minus the proportion positive under condition 2,\nwhich equals (b - c) / n, where n is the total number of pairs. An equivalent\nframing is the **ratio of discordant proportions** (b/c or its inverse), sometimes\ncalled the McNemar odds ratio. Confidence intervals for the marginal proportion\ndifference can be obtained by the exact binomial approach or by asymptotic normal\napproximation; the exact2x2 package in R is a convenient implementation.\n\n**RWE applications: before-after binary indicators within patients**\n\nIn real-world evidence, McNemar's test arises naturally in three settings:\n\n1. *Pre-post comparison within the same patients*: Measure a binary outcome (any\n   hospitalization yes/no, medication possession ratio >= 80% yes/no, ACE inhibitor\n   use yes/no) in a fixed window before and after an index event (diagnosis, treatment\n   initiation, policy change). Each patient contributes one pair of observations. The\n   test asks whether the proportion positive shifted between the two windows — for\n   example, whether the fraction of patients with a hospitalization was higher in the\n   year before versus the year after starting a disease management program.\n\n2. *Matched case-control pairs*: In a 1:1 matched case-control study with a binary\n   exposure, the exposure status of the case and the matched control form a pair. The\n   McNemar test is equivalent to a conditional logistic regression with one predictor\n   in this setting, and conditional logistic regression is the appropriate\n   multivariable extension when you need to adjust for additional covariates after\n   matching. Understanding McNemar as the special case of conditional logistic\n   regression without covariates clarifies why ordinary logistic regression is wrong\n   for paired data — it ignores the matching structure.\n\n3. *Agreement versus marginal change (do not confuse)*: McNemar's test measures\n   *marginal homogeneity* — whether the row margin (proportion positive under\n   condition 1) equals the column margin (proportion positive under condition 2). It\n   does NOT measure agreement. Cohen's kappa measures the degree to which two raters\n   or two tests agree beyond chance, which is a different question entirely. Two raters\n   could systematically disagree (one always says \"yes,\" one always says \"no\") yielding\n   a large McNemar statistic and kappa = -1. They could also mostly agree while\n   disagreeing asymmetrically enough to produce a significant McNemar p-value. Always\n   clarify which question — marginal change or agreement — your analysis addresses.\n\n**The self-controlled flavor and its limitations**\n\nThe pre-post McNemar design is a form of self-controlled analysis: each patient\nserves as their own control, which eliminates confounding from all time-fixed patient\ncharacteristics (demographics, baseline comorbidities, health-seeking behavior that\nis stable over time). This is a powerful design feature — no propensity score or\nexternal comparator group is needed to eliminate fixed confounders. However, the\nself-controlled structure does NOT eliminate **time-varying confounding**. If the\noutcome probability would have changed over time in the absence of the intervention\ndue to disease progression, seasonal effects, secular trends, or regression to the\nmean, the pre-post comparison is biased even for the same patients. Similarly,\nthe McNemar design does not adjust for the **duration of follow-up** if patients\nhave different amounts of time in each window. These limitations connect directly to\nthe broader catalog of self-controlled designs (self-controlled case series, SCCS)\nwhere conditioning on cases only and modeling the timing of exposure within patients\naddresses some of these concerns more rigorously than a simple pre-post binary test.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- Conceptually elegant: focuses inference precisely on the discordant pairs, which\n  are the only pairs carrying information about marginal change.\n- Efficient: uses within-patient pairing to eliminate fixed confounding without an\n  external comparator, reducing the variance of the estimator compared to an\n  independent-sample chi-square on the same data.\n- Three well-validated versions (exact, asymptotic, mid-p) with clear guidance on\n  which to use by discordant count.\n- Direct extension to conditional logistic regression for multivariable adjustment\n  after 1:1 matching.\n- Computationally trivial — available in every major statistical software package.\n\n*Cons*:\n- Discards the concordant pairs entirely — a large dataset with mostly concordant\n  pairs yields very low power, because effective n is only b + c, not the total\n  number of pairs.\n- For small b + c the power is genuinely low regardless of total sample size; no\n  amount of enrolling more concordant pairs compensates.\n- Tests only marginal homogeneity; cannot directly address whether the change is\n  caused by the intervention.\n- Cannot adjust for covariates directly — must graduate to conditional logistic\n  regression if covariate adjustment is needed.\n- Assumes binary outcome; the multi-category generalization (Stuart-Maxwell test) is\n  less widely known and implemented.\n- Pre-post McNemar does not adjust for time-varying confounders or secular trends.\n\n**When to use**\n\nUse McNemar's test when:\n\n- The outcome is binary and measurements come in natural pairs: the same patient\n  before and after an event, or a 1:1 matched pair (case and control, or patient and\n  matched comparator).\n- The scientific question is whether the *proportion* positive changed between\n  conditions — not whether individuals agreed with each other (kappa) and not whether\n  the mean of a continuous variable shifted (paired t-test).\n- You want a formal inference about the marginal shift alongside the visual or\n  descriptive pre-post comparison.\n- The discordant count b + c is at least 5 (exact or mid-p test) or at least 25\n  (asymptotic chi-square approximation); below b + c = 5, power is so low that\n  inference is nearly meaningless.\n- As a quick diagnostic in matched case-control studies before graduating to\n  conditional logistic regression for the main analysis.\n- Sensitivity analysis: after 1:1 propensity-score matching, McNemar's test on a\n  binary primary outcome is theoretically appropriate and computationally trivial as\n  a check alongside conditional logistic regression.\n\n**When NOT to use — and when these tests are actively misleading**\n\n- *Independent groups*: if the two groups of patients are different people (treated\n  vs untreated cohorts without 1:1 matching), McNemar's test is wrong — the pairing\n  structure it assumes does not exist. Use the chi-square test or Fisher's exact test\n  instead. Applying McNemar to independent samples produces incorrect p-values.\n- *More than two categories*: McNemar's test is for 2x2 paired tables only. For an\n  ordinal or multinomial paired outcome (severity grade: none/mild/moderate/severe\n  before and after), use the Stuart-Maxwell test or Bowker's test of symmetry.\n- *Recurrent events or count outcomes*: McNemar's test requires a binary yes/no per\n  person per period. If the outcome is the number of hospitalizations (which can be\n  0, 1, 2, ...) you need a paired count model — a negative binomial mixed model or\n  a paired Poisson approach — not a test for binary proportions.\n- *When time-varying confounding dominates the pre-post contrast*: if there is\n  strong evidence that disease progression, a co-occurring policy change, or secular\n  trends would change the outcome probability over the same window regardless of the\n  intervention, the pre-post McNemar comparison is biased. In these settings,\n  a difference-in-differences design with an external comparator group, or an SCCS\n  analysis, is more appropriate.\n- *Confusing McNemar with Cohen's kappa*: McNemar tests marginal homogeneity;\n  kappa tests agreement. Using McNemar to assess whether two diagnostic tests\n  \"agree\" answers the wrong question. Using kappa to test whether a pre-post\n  proportion changed is also wrong. Be explicit about which question is the target.\n- *When the total discordant count b + c is very small (< 5)*: power is negligible.\n  Report the discordant pair counts descriptively and acknowledge the test is\n  underpowered rather than accepting a non-significant p-value as evidence of no\n  change.\n\n**Implementation note for all three languages**\n\nThe canonical demonstration below tests whether the proportion of patients with any\nhospitalization differed in the year before versus the year after a chronic disease\ndiagnosis — a pre-post RWE use case using the same paired 2x2 table from the\nbeginner layer. All three implementations show both the asymptotic chi-square\nversion and the exact/mid-p version, with the discordant-pair framing explicit in\ncomments. The SAS implementation uses PROC FREQ with the AGREE option and the TEST\nMCNEM statement.\n\n**Interpreting the output**\n\nIn the worked example, 40 patients each have a binary hospitalization indicator for the year\nbefore and after a chronic disease diagnosis. Of the 20 discordant pairs, 15 flipped from\nhospitalized to not hospitalized (b = 15) and 5 flipped the other direction (c = 5). The\nMcNemar statistic is (b − c)² / (b + c) = 100 / 20 = 5 on df = 1, with p ≈ 0.025. The\npre-period hospitalization proportion was 25/40 = 0.625 and the post-period proportion was\n15/40 = 0.375, a marginal difference of 0.25.\n\n*(1) Formal interpretation.* The McNemar statistic of 5 on df = 1 has a p-value of\napproximately 0.025 under the null hypothesis that discordant pairs are equally likely to\nflip in either direction — that is, that b/(b + c) = 0.5. This falls below the conventional\nalpha = 0.05 threshold. The entire computation rests on the 20 discordant pairs; the 20\nconcordant pairs (patients hospitalized in both windows, or neither window) carry no\ninformation about marginal change and are excluded. The marginal proportion difference of\n0.25 (25 fewer hospitalizations per 100 patients) is the natural accompanying effect\nestimate.\n\n*(2) Practical interpretation.* The proportion of patients with any hospitalization dropped\nby about 25 percentage points between the two measurement windows, and this shift is\nstatistically distinguishable from chance at conventional levels. Because each patient serves\nas their own control, time-fixed confounders such as baseline comorbidities are eliminated —\nbut time-varying factors including disease progression, secular trends, and regression to the\nmean could still explain some or all of the reduction. The 3:1 imbalance among discordant\npairs (15 improved, 5 worsened) drives the result; concordant pairs contribute nothing to\nthe test statistic regardless of how numerous they are.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "paired-data",
      "categorical-data",
      "pre-post",
      "matched-pairs",
      "self-controlled"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "cross_sectional",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "case_control",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/BF02295996",
        "url": "https://doi.org/10.1007/BF02295996",
        "citation_text": "McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153-157.",
        "year": 1947,
        "authors_short": "McNemar",
        "notes": "The original derivation of the test for paired binary proportions, showing that only the discordant pairs b and c carry information about the marginal proportion difference and that b/(b+c) follows a Binomial(b+c, 0.5) under the null. The foundational reference for any analysis using this test."
      },
      {
        "role": "explain",
        "doi": "10.1186/1471-2288-13-91",
        "url": "https://doi.org/10.1186/1471-2288-13-91",
        "citation_text": "Fagerland MW, Lydersen S, Laake P. The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional. BMC Medical Research Methodology. 2013;13:91.",
        "year": 2013,
        "authors_short": "Fagerland et al.",
        "notes": "Simulation study comparing the exact, asymptotic, and mid-p versions of McNemar's test across a wide range of discordant totals and true proportions. Demonstrates that the mid-p correction achieves better type-I and type-II error balance than the fully exact conditional test and that the asymptotic chi-square version performs well when b + c >= 25. The key practical guidance: use mid-p for small discordant totals, asymptotic for large. Essential reading before choosing which version to report."
      }
    ],
    "plain_language_summary": "McNemar's test answers one question: did the fraction of patients with a \"yes\" outcome change between two conditions measured on the same people? You pair each patient's before-and-after result (or case-and-control result) in a 2x2 table, then the test looks only at the pairs where the two conditions disagreed — because pairs where both conditions gave the same answer carry no information about whether anything changed. It produces a p-value for whether the two proportions differ, but it cannot tell you why they differ or whether the change was caused by the intervention.",
    "key_terms": [
      {
        "term": "discordant pairs",
        "definition": "Patient pairs where the two conditions give different answers — one \"yes\" and one \"no\" — which are the only pairs the test uses; pairs where both conditions agree (both \"yes\" or both \"no\") are ignored because they reveal nothing about a proportion change."
      },
      {
        "term": "concordant pairs",
        "definition": "Patient pairs where both conditions give the same answer (both hospitalized or both not hospitalized); they are counted in the table but contribute zero information to the McNemar test, which is why a study with mostly concordant pairs has low power even with a large total sample."
      },
      {
        "term": "marginal proportion",
        "definition": "The overall fraction of patients who had a \"yes\" outcome under one condition — computed across all pairs, not just discordant ones; McNemar's test asks whether the marginal proportion under condition 1 equals the marginal proportion under condition 2."
      },
      {
        "term": "paired 2x2 table",
        "definition": "A 2-row by 2-column table where each cell counts patient pairs by their combination of outcomes under the two conditions (both yes, first yes only, second yes only, both no); the layout that McNemar's test is built for."
      },
      {
        "term": "mid-p correction",
        "definition": "A small adjustment to the exact McNemar p-value that adds half the probability of the observed outcome table to each tail, giving better balance between false-positive and false-negative rates than the fully exact or fully asymptotic test when the number of discordant pairs is small."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is studying whether any-hospitalization rates changed in the year before versus the year after a chronic disease diagnosis for 40 patients. Each patient has exactly one pre-period outcome (hospitalized yes/no in the year before diagnosis) and one post-period outcome (hospitalized yes/no in the year after diagnosis). The analyst builds the paired 2x2 table and runs McNemar's test to decide whether the proportion hospitalized shifted significantly.",
      "dataset": {
        "caption": "Paired 2x2 table: rows = pre-diagnosis hospitalization status, columns = post-diagnosis hospitalization status. Cell counts are the number of patient pairs in each category. Total pairs n = 40.",
        "columns": [
          "post_status",
          "Post: Hospitalized (Yes)",
          "Post: Not Hospitalized (No)",
          "Row Total"
        ],
        "rows": [
          [
            "Pre: Hospitalized (Yes)",
            "a = 10",
            "b = 15",
            25
          ],
          [
            "Pre: Not Hospitalized (No)",
            "c = 5",
            "d = 10",
            15
          ],
          [
            "Column Total",
            15,
            25,
            40
          ]
        ]
      },
      "steps": [
        "Identify the four cells. Concordant pairs where both periods agree: cell a = 10 (hospitalized both years) and cell d = 10 (not hospitalized either year). Discordant pairs where the periods disagree: cell b = 15 (hospitalized pre, not post) and cell c = 5 (not hospitalized pre, hospitalized post).",
        "Compute marginal proportions. Pre-period proportion hospitalized = row total for 'Yes pre' / n = 25/40 = 0.625. Post-period proportion hospitalized = column total for 'Yes post' / n = 15/40 = 0.375. The apparent shift is 0.625 - 0.375 = 0.25.",
        "The concordant pairs (a = 10 and d = 10) carry NO information about whether the proportions changed. Discard them. The test is built entirely on the discordant total: b + c = 15 + 5 = 20.",
        "Compute the asymptotic McNemar statistic: chi-square equals the squared difference of the discordant counts divided by their sum. The numerator is (b - c) squared = (15 - 5) squared = 10 squared = 100. The denominator is b + c = 15 + 5 = 20. So chi-square = 100 / 20 = 5. Compare to a chi-squared distribution with 1 degree of freedom.",
        "A chi-squared value of 5 with 1 df gives a p-value of approximately 0.025 (two-sided). This is below the conventional alpha = 0.05 threshold, so we reject the null hypothesis that the two marginal proportions are equal.",
        "Interpret in context. The proportion hospitalized was 25/40 = 0.625 before diagnosis and 15/40 = 0.375 after diagnosis. The marginal proportion difference is 0.625 - 0.375 = 0.25, meaning about 25 percentage points fewer patients had any hospitalization in the year after diagnosis than in the year before. Whether this represents a causal effect of post-diagnosis management or reflects natural disease course and regression to the mean requires a design with a comparator group."
      ],
      "result": "Discordant pairs b = 15, c = 5; discordant total b + c = 15 + 5 = 20. Numerator = (b - c) squared = 10 squared = 100. McNemar statistic = 100 / 20 = 5; p approximately 0.025 (two-sided, 1 df). Pre-period proportion hospitalized = 25 / 40 = 0.625; post-period proportion = 15 / 40 = 0.375; marginal proportion difference = 0.625 - 0.375 = 0.25. The test is significant at alpha = 0.05. The concordant pairs (a = 10, d = 10) were discarded because they carry zero information about the marginal shift."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "chi-square-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Asymptotic McNemar (chi-square approximation)",
        "description": "The classic (b - c)^2 / (b + c) statistic referred to a chi-squared distribution with 1 degree of freedom. Valid when the discordant total b + c >= 25. Default in most software; produces a continuous p-value and is directly interpretable. Not recommended when b + c < 25 because the chi-square approximation to the discrete binomial is poor.",
        "edge_cases": [
          "When b + c < 25, the asymptotic p-value is unreliable. Switch to exact or mid-p.",
          "When b = c = 0 (no discordant pairs at all), the statistic is 0/0 — undefined. Report that the data provide no evidence of a proportion shift by design and acknowledge the test is vacuous."
        ],
        "data_source_notes": "In large claims or EHR cohorts, b + c >= 25 is almost always met; the asymptotic version is the natural default for pre-post hospitalization or adherence analyses."
      },
      {
        "name": "Exact McNemar (exact binomial)",
        "description": "Computes the exact binomial probability of observing a split as extreme as min(b, c) out of b + c trials under the null p = 0.5. Preferred when b + c < 25. Conservative (slightly elevated type-II error) because the binomial is discrete; the p-value can only take values in the set of binomial probability sums. The exact2x2 package in R and PROC FREQ in SAS both implement this version.",
        "edge_cases": [
          "Very small discordant counts (b + c < 5) yield virtually no power; acknowledge this explicitly rather than accepting a non-significant result as conclusive."
        ],
        "data_source_notes": "Appropriate for small pilot studies, rare-event analyses, or matched case-control studies with few discordant pairs."
      },
      {
        "name": "Mid-p McNemar (recommended for small discordant totals)",
        "description": "The mid-p version adds half the probability of the observed table to each tail before computing the two-sided p-value, reducing the conservatism of the exact test. Fagerland et al. (2013) recommend this as the default when b + c < 25 because it achieves better balance between type-I and type-II error rates than either the exact or asymptotic version. Available in the exact2x2 package in R (midp = TRUE argument).",
        "edge_cases": [
          "The mid-p p-value is not a proper probability in the strict sense — it can exceed nominal level for some parameter combinations — but it is well-calibrated on average and outperforms the alternatives in simulation."
        ],
        "data_source_notes": "Recommended for claims analyses of rare binary events (e.g., a specific surgery type pre-post a policy change) where the discordant count may be small."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "chi-square-test",
        "pros_of_this": "McNemar's test correctly accounts for the within-patient pairing structure, which inflates the variance of an independent-sample chi-square test applied to paired data. Using an independent-sample chi-square on paired data underestimates the standard error and produces p-values that are too small.",
        "cons_of_this": "McNemar's test requires paired data — one observation per condition per subject. When the two samples are independent (different patients), the chi-square test is appropriate and McNemar is not.",
        "when_to_prefer": "Use McNemar whenever observations are paired (same patient, pre-post, or 1:1 matched); use chi-square when samples are independent."
      },
      {
        "compared_to": "paired-t-test",
        "pros_of_this": "McNemar handles strictly binary outcomes where a paired t-test is inappropriate. The t-test on 0/1 outcomes is technically valid (it is equivalent to testing whether the mean difference is zero) but McNemar is the natural and conventional test for this data type.",
        "cons_of_this": "The paired t-test applies to continuous outcomes and produces an interpretable mean difference in the original units; for a continuous pre-post measurement like HbA1c or LDL, the paired t-test is preferable to dichotomizing the outcome and applying McNemar, which discards quantitative information.",
        "when_to_prefer": "Use McNemar for inherently binary outcomes (hospitalized yes/no, adherent yes/no); use paired t-test for continuous outcomes (lab values, cost, score)."
      },
      {
        "compared_to": "fisher-exact-test",
        "pros_of_this": "McNemar correctly handles the paired structure. Fisher's exact test assumes independent observations — applying it to paired data ignores pairing and produces a less powerful and potentially misleading analysis.",
        "cons_of_this": "Fisher's exact test is appropriate for a 2x2 table from independent groups with small expected cell counts. McNemar is not appropriate for independent groups.",
        "when_to_prefer": "Use McNemar for paired binary data; use Fisher's exact test for independent groups with small expected cell counts."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the paired 2x2 table by computing a binary indicator for each patient in the pre-period window and in the post-period window. Common indicators: any inpatient claim (hospitalization yes/no), medication possession ratio >= 0.80 (adherent yes/no), any emergency department visit, any procedure of interest. Make sure the two windows are symmetric (e.g., 365 days before and 365 days after the index date) and that every patient has complete data in both windows before entering the table — partial follow-up in either window distorts the marginal proportions. In large claims cohorts, b + c >= 25 is almost always satisfied; use the asymptotic version and report the marginal proportion difference with a 95% CI alongside the p-value.",
      "ehr": "Naturally paired lab values, diagnoses, and prescription flags make EHR data well-suited for McNemar analyses. Common examples: HbA1c >= 7.0% (uncontrolled diabetes yes/no) before and after a medication change; any antibiotic prescription in the 90 days before versus after a stewardship intervention. Check that measurement frequency does not differ systematically between the two windows, as differential observation intensity can create apparent changes in binary status that reflect surveillance bias rather than true change.",
      "registry": "Disease registries often have structured pre- and post-treatment assessments. McNemar's test on paired binary endpoints (disease response yes/no, relapse yes/no) is straightforward when every patient has a complete assessment at both time points. When assessments are missing for some patients, investigate whether missingness is related to disease severity before dropping incomplete pairs.",
      "primary": "Survey-based binary responses (improved yes/no, satisfied yes/no) at two time points on the same respondents are ideal for McNemar analysis. Ensure that the survey instrument and response options are identical at both time points to avoid measurement-driven artifacts in the discordant counts.",
      "linked": "Linked claims-EHR datasets allow richer pre-post binary indicators combining pharmacy fills (from claims) with clinical outcomes (from EHR). Apply McNemar's test to the paired binary indicators after establishing a clean index date and symmetric observation windows in both data sources."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom statsmodels.stats.contingency_tables import mcnemar\n\n# ── Paired 2x2 table from the worked example ──\n# Rows: pre-period (Yes=1, No=0); Columns: post-period (Yes=1, No=0)\n# Layout: [[a, b], [c, d]]  where b and c are the discordant cells\n# a=10: both hospitalized; b=15: pre-yes, post-no; c=5: pre-no, post-yes; d=10: both not\ntable = np.array([[10, 15],\n                  [ 5, 10]])\n\na, b, c, d = table[0, 0], table[0, 1], table[1, 0], table[1, 1]\nn = a + b + c + d  # total pairs\n\n# ── Marginal proportions ──\nprop_pre  = (a + b) / n   # proportion positive pre-period\nprop_post = (a + c) / n   # proportion positive post-period\nmarg_diff = prop_pre - prop_post\n\nprint(f\"Total pairs n = {n}\")\nprint(f\"Concordant pairs: a={a} (both yes), d={d} (both no)\")\nprint(f\"Discordant pairs: b={b} (pre-yes, post-no), c={c} (pre-no, post-yes)\")\nprint(f\"Discordant total b+c = {b+c}\")\nprint(f\"Pre-period proportion: {a+b}/{n} = {prop_pre:.4f}\")\nprint(f\"Post-period proportion: {a+c}/{n} = {prop_post:.4f}\")\nprint(f\"Marginal proportion difference (pre - post): {marg_diff:.4f}\")\n\n# ── Asymptotic McNemar: (b-c)^2 / (b+c) ~ chi-squared(1) ──\n# Recommended when b+c >= 25; here b+c = 20 so asymptotic is borderline.\nresult_asymptotic = mcnemar(table, exact=False, correction=False)\nprint(f\"\\nAsymptotic McNemar: statistic={result_asymptotic.statistic:.4f}, \"\n      f\"p={result_asymptotic.pvalue:.4f}\")\nprint(f\"  Manual check: (b-c)^2/(b+c) = ({b}-{c})^2/({b}+{c}) \"\n      f\"= {(b-c)**2}/{b+c} = {(b-c)**2/(b+c):.4f}\")\n\n# ── Exact McNemar: binomial(b+c, 0.5) exact test ──\n# Preferred when b+c < 25; conservative (slight upward bias in p-value).\nresult_exact = mcnemar(table, exact=True)\nprint(f\"\\nExact McNemar:      statistic={result_exact.statistic:.4f}, \"\n      f\"p={result_exact.pvalue:.4f}\")\n\n# ── Mid-p McNemar: Fagerland et al. (2013) recommendation for small b+c ──\n# statsmodels does not implement mid-p directly; approximation via scipy binom\nfrom scipy.stats import binom\n# Two-sided exact p: P(X <= min(b,c)) + P(X >= max(b,c)) where X ~ Binom(b+c, 0.5)\nm = b + c\nx_obs = min(b, c)\nexact_p = 2 * binom.cdf(x_obs, m, 0.5)\nmidp_p = exact_p - binom.pmf(x_obs, m, 0.5)  # subtract half the probability of observed\nprint(f\"\\nMid-p McNemar:      p={midp_p:.4f}\")\nprint(\"  (Fagerland et al. 2013 recommend mid-p for b+c < 25)\")\n\n# ── Confidence interval for marginal proportion difference (asymptotic) ──\nimport math\nse_diff = math.sqrt((b + c - (b - c)**2 / n) / n**2)\nci_lo = marg_diff - 1.96 * se_diff\nci_hi = marg_diff + 1.96 * se_diff\nprint(f\"\\n95% CI for marginal proportion difference: \"\n      f\"({ci_lo:.4f}, {ci_hi:.4f})\")\n\n# ── Key interpretation reminder ──\nprint(\"\\nNote: concordant pairs (a, d) are discarded by the test.\")\nprint(\"McNemar tests marginal homogeneity, NOT agreement (use Cohen's kappa for agreement).\")\nprint(\"For multivariable adjustment after 1:1 matching, use conditional logistic regression.\")",
        "description": "McNemar's test using statsmodels.stats.contingency_tables.mcnemar, demonstrating\nboth the exact binomial version (exact=True) and the asymptotic chi-square version\n(exact=False, correction=False). Uses the worked-example discordant counts (b=15,\nc=5) and shows how to compute the marginal proportion difference and its\napproximate confidence interval. No dependencies beyond statsmodels.",
        "dependencies": [
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Paired 2x2 table: rows = pre-period, columns = post-period ──\n# [[a, b], [c, d]] — discordant cells are b (row 1, col 2) and c (row 2, col 1)\ntable_2x2 <- matrix(c(10, 15, 5, 10), nrow = 2, byrow = TRUE,\n                    dimnames = list(Pre  = c(\"Yes\", \"No\"),\n                                    Post = c(\"Yes\", \"No\")))\nprint(\"Paired 2x2 table:\")\nprint(table_2x2)\n\na <- table_2x2[1, 1]; b <- table_2x2[1, 2]\nc <- table_2x2[2, 1]; d <- table_2x2[2, 2]\nn <- a + b + c + d\n\ncat(sprintf(\"\\nTotal pairs n = %d\\n\", n))\ncat(sprintf(\"Discordant pairs: b=%d (pre-yes, post-no), c=%d (pre-no, post-yes)\\n\", b, c))\ncat(sprintf(\"Discordant total b+c = %d\\n\", b + c))\n\n# ── Marginal proportions ──\nprop_pre  <- (a + b) / n\nprop_post <- (a + c) / n\ncat(sprintf(\"\\nPre-period proportion:  (%d+%d)/%d = %.4f\\n\", a, b, n, prop_pre))\ncat(sprintf(\"Post-period proportion: (%d+%d)/%d = %.4f\\n\", a, c, n, prop_post))\ncat(sprintf(\"Marginal proportion difference (pre - post): %.4f\\n\", prop_pre - prop_post))\n\n# ── Asymptotic McNemar: (b-c)^2/(b+c) ~ chisq(1) ──\n# correct=FALSE: no continuity correction (standard for asymptotic McNemar)\nres_asymptotic <- mcnemar.test(table_2x2, correct = FALSE)\ncat(\"\\nAsymptotic McNemar (correct=FALSE):\\n\")\nprint(res_asymptotic)\ncat(sprintf(\"  Manual check: (b-c)^2/(b+c) = (%d-%d)^2/(%d+%d) = %d/%d = %.4f\\n\",\n            b, c, b, c, (b - c)^2, b + c, (b - c)^2 / (b + c)))\n\n# ── Exact McNemar: base R uses exact binomial when correct=TRUE and ties present ──\n# For clarity, compute directly with binom.test\nres_exact_binom <- binom.test(min(b, c), b + c, p = 0.5)\ncat(sprintf(\"\\nExact McNemar (via binom.test): p = %.4f\\n\", res_exact_binom$p.value))\n\n# ── Mid-p McNemar: Fagerland et al. (2013) — requires exact2x2 package ──\n# Install once: install.packages(\"exact2x2\")\n# library(exact2x2)\n# res_midp <- mcnemar.exact(table_2x2, midp = TRUE)\n# cat(sprintf(\"Mid-p McNemar: p = %.4f\\n\", res_midp$p.value))\ncat(\"\\nFor mid-p correction (recommended when b+c < 25), use:\\n\")\ncat(\"  library(exact2x2); mcnemar.exact(table_2x2, midp = TRUE)\\n\")\ncat(\"  [exact2x2 package — Fagerland et al. 2013]\\n\")\n\n# ── Interpretation notes ──\ncat(\"\\nKey reminders:\\n\")\ncat(\"  - Concordant pairs (a, d) are IGNORED by the test.\\n\")\ncat(\"  - McNemar tests marginal homogeneity; kappa tests agreement.\\n\")\ncat(\"  - For multivariable adjustment after 1:1 matching:\\n\")\ncat(\"    library(survival); clogit(outcome ~ exposure + strata(pair_id))\\n\")",
        "description": "McNemar's test in base R using mcnemar.test(), demonstrating both the asymptotic\nand exact versions, plus the mid-p correction via the exact2x2 package. Uses the\nsame paired 2x2 table as the Python implementation (b=15, c=5). Includes\ncomputation of the marginal proportion difference and a note on using\nconditional logistic regression (survival::clogit) for multivariable extension.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create the paired 2x2 frequency dataset ── */\n/* Each record is one patient; pre and post are binary (1=Yes, 0=No) */\n/* Reconstruct counts: a=10, b=15, c=5, d=10 */\ndata work.paired_binary;\n  input pre_hosp post_hosp count;\n  datalines;\n1 1 10\n1 0 15\n0 1  5\n0 0 10\n;\nrun;\n\n/* ── McNemar's test via PROC FREQ AGREE option ── */\n/* WEIGHT count: each row represents `count` patients                   */\n/* AGREE: prints McNemar's chi-square statistic, p-value, and kappa     */\n/* TEST MCNEM: explicitly requests the McNemar statistic in the output  */\nproc freq data=work.paired_binary;\n  weight count;\n  tables pre_hosp * post_hosp / agree norow nocol nopct;\n  test mcnem;\n  title \"McNemar test: hospitalization pre-diagnosis vs post-diagnosis\";\nrun;\n/* Output includes:\n   - McNemar's Chi-Square = (b-c)^2/(b+c) = (15-5)^2/20 = 5.0000, p approx 0.0253\n   - Kappa: IGNORE for testing marginal change (kappa measures agreement, not change)\n   - Simple Kappa Coefficient is a DIFFERENT quantity from McNemar                  */\n\n/* ── Marginal proportions: compute manually from PROC MEANS ── */\n/* Expand the frequency data to one row per patient */\ndata work.expanded;\n  set work.paired_binary;\n  do i = 1 to count;\n    output;\n  end;\n  drop i;\nrun;\nproc means data=work.expanded mean n;\n  var pre_hosp post_hosp;\n  title \"Pre- and post-period hospitalization proportions\";\nrun;\n/* Expected output:\n   pre_hosp  mean = 0.6250  (25/40)\n   post_hosp mean = 0.3750  (15/40)\n   Difference = 0.6250 - 0.3750 = 0.2500 (25 percentage points)            */\n\n/* ── Note on exact and mid-p versions ── */\n/* PROC FREQ AGREE uses the asymptotic chi-square version by default.\n   For small discordant counts (b+c < 25), request the exact version:\n     proc freq data=work.paired_binary;\n       weight count;\n       tables pre_hosp * post_hosp / agree;\n       exact mcnem;  /* EXACT statement triggers exact binomial p-value   */\n     run;\n   Mid-p correction is not available natively in SAS PROC FREQ.\n   For mid-p, export the table to R and use exact2x2::mcnemar.exact(midp=TRUE). */",
        "description": "McNemar's test in SAS using PROC FREQ with the AGREE option (which produces\nMcNemar's test statistic and its p-value) and the TEST MCNEM statement for the\nasymptotic version. Uses the same paired 2x2 table (b=15, c=5) as the other\nimplementations. The AGREE option in PROC FREQ automatically reports McNemar's\nchi-square and the kappa coefficient — the output serves as a reminder to\ninterpret them separately (marginal change vs agreement, respectively).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[\"Binary outcome on SAME patients<br/>under two conditions?\"] -->|Yes| Pair[\"Build paired 2x2 table<br/>(rows = condition 1, cols = condition 2)\"]\n  Start -->|No - independent groups| ChiSq[\"Use chi-square test<br/>or Fisher exact test\"]\n  Pair --> Count[\"Count discordant pairs<br/>b (cond1=Yes, cond2=No)<br/>c (cond1=No, cond2=Yes)\"]\n  Count --> Disc[\"Discordant total<br/>b + c = ?\"]\n  Disc -->|\"b + c >= 25\"| Asymp[\"Asymptotic McNemar<br/>(b-c)^2 / (b+c) ~ χ²(1)\"]\n  Disc -->|\"b + c < 25\"| SmallN[\"Small discordant total\"]\n  SmallN --> MidP[\"Mid-p McNemar<br/>(Fagerland et al. 2013)<br/>Recommended default\"]\n  SmallN --> Exact[\"Exact McNemar<br/>(Binomial test)<br/>Conservative\"]\n  Asymp --> Report[\"Report:\\n- McNemar statistic + p-value\\n- Marginal proportion difference\\n- 95% CI\"]\n  MidP --> Report\n  Exact --> Report\n  Report --> Adjust[\"Need covariate adjustment?\"]\n  Adjust -->|Yes| CLR[\"Use conditional logistic regression<br/>(multivariable extension of McNemar)\"]\n  Adjust -->|No| Done[\"McNemar test is sufficient\"]",
        "caption": "Decision flowchart for McNemar's test: when to use it, which version to choose based on discordant pair count, and when to extend to conditional logistic regression for covariate adjustment.",
        "alt_text": "Flowchart branching from paired binary data through discordant pair count to asymptotic, exact, or mid-p McNemar test, with a path to conditional logistic regression for multivariable adjustment.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "McNemar's test is the correct paired-binary entry in the parametric-vs-nonparametric decision tree; the parent concept covers when to use McNemar versus chi-square versus Fisher exact for binary outcomes by whether data are paired or independent."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "McNemar's test requires understanding of null hypothesis testing, p-values, and the binomial distribution; the foundational entry establishes this framework."
      },
      {
        "relation_type": "see_also",
        "target_slug": "chi-square-test",
        "notes": "The chi-square test is McNemar's unpaired counterpart — appropriate when two samples of binary outcomes come from independent groups. A common error in RWE is applying the independent-sample chi-square to paired pre-post data, which ignores the pairing structure and underestimates the standard error, producing p-values that are too small."
      },
      {
        "relation_type": "see_also",
        "target_slug": "paired-t-test",
        "notes": "The paired t-test is McNemar's continuous-outcome analog — for the same within- patient design where the outcome is continuous (HbA1c, LDL, cost) rather than binary. When a continuous outcome is available, prefer the paired t-test over dichotomizing and applying McNemar, as dichotomization discards quantitative information."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fisher-exact-test",
        "notes": "Fisher's exact test is appropriate for independent-sample 2x2 tables with small expected cell counts; McNemar's exact version fills the analogous role for paired binary data with small discordant counts. Confusing the two is a structural error in study design that produces wrong standard errors."
      }
    ],
    "aliases": [
      "paired proportions test",
      "McNemar chi-square",
      "test of marginal homogeneity",
      "McNemar's chi-square test",
      "paired 2x2 test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "medical-code-crosswalks-mappings",
    "name": "Code Crosswalks and Mappings Between Coding Systems",
    "short_definition": "The family of official and community-maintained translation tables that link one medical coding system to another — ICD-9-CM to ICD-10-CM via CMS General Equivalence Mappings, NDC to RxNorm via the NLM RxNav API, NDC to HCPCS J-codes via the CMS Average Sales Price crosswalk, and ICD-10-CM to SNOMED CT via the NLM rule-based map — allowing researchers to align code lists across vocabulary transitions or data sources while managing the approximation, one-to-many expansions, and version drift inherent in every translation.",
    "long_description": "A **code crosswalk** (also called a code mapping) is an official or curated translation\ntable that associates codes in one medical coding system with the closest equivalent codes\nin another. Crosswalks exist because the healthcare data ecosystem spans multiple,\nindependently governed vocabularies — ICD-9-CM for legacy diagnoses, ICD-10-CM for\ncurrent diagnoses, NDC for drug package identity, RxNorm for drug ingredients,\nHCPCS/CPT for procedures and administered drugs, and SNOMED CT for clinical concepts —\nand no single coding system covers all data types or all calendar periods. Real-world\nevidence (RWE) studies routinely span vocabulary transitions or assemble data from\nsources that use different systems, making crosswalk literacy a foundational skill.\n\n**The inventory of major crosswalks.**\n\n*GEMs — General Equivalence Mappings (CMS/NCHS, ICD-9-CM ↔ ICD-10-CM/PCS).*\nGEMs were developed jointly by CMS and the National Center for Health Statistics to\nsupport the October 1, 2015 transition from ICD-9-CM to ICD-10-CM (diagnoses) and\nICD-10-PCS (inpatient procedures). Two files exist for each direction: the **forward\nmap** (ICD-9-CM → ICD-10-CM) and the **backward map** (ICD-10-CM → ICD-9-CM). Each\nrow carries four critical flags: (1) the **approximate flag** (0 = exact match,\n1 = approximate/best available), (2) the **no map flag** (code has no usable\nequivalent), (3) the **combination flag** (the ICD-9 concept requires multiple ICD-10\ncodes to fully express), and (4) the **scenario/choice list flags** that group\none-to-many alternatives. A single ICD-9-CM code commonly maps forward to 3–10\nICD-10-CM codes, and many maps carry approximate = 1, meaning granularity was\ngenuinely lost or gained in translation. GEMs were last updated for FY2018; they are\nofficially retired but remain the de facto standard tool for any study spanning the\npre- and post-October-2015 period. Researchers must freeze the GEM version and\ndocument it, because there will be no future updates to reconcile.\n\n*NDC ↔ RxNorm (NLM RxNav API, monthly updates).*\nThe National Library of Medicine's RxNorm is the standard for drug ingredient,\nclinical drug (product), and branded product identity in the United States. The NLM\nRxNav API resolves an 11-digit NDC (as it appears in a claims or pharmacy dispensing\nrecord) to its RxNorm Concept Unique Identifier (RXCUI) at the ingredient or clinical\ndrug level. This mapping is critical for two reasons: NDCs change whenever a\nmanufacturer repackages or reformulates a product (a single ingredient can have\nhundreds of active NDCs at any moment), and NDC lists therefore rot rapidly; defining\ndrug exposure at the RxNorm ingredient level and resolving NDCs through RxNorm insulates\na study from NDC churn. The NLM also maintains historical NDC endpoints for mapping\nretired codes. The RxNorm mapping is updated monthly; a study using a snapshot must\ndocument the snapshot date.\n\n*NDC ↔ HCPCS (CMS Average Sales Price Drug Pricing crosswalk, quarterly).*\nMedicare Part B covers many drugs administered in clinical settings and reimburses them\nunder HCPCS Level II J-codes. The CMS publishes quarterly ASP (Average Sales Price)\nDrug Pricing files that include an NDC-to-HCPCS crosswalk, allowing researchers to\nrecover the drug identity behind a J-code claim. This is essential for medical-benefit\ndrug studies: a claim for J9271 (pembrolizumab) is informative on its own, but the\nNDC crosswalk confirms the specific product and links back to the RxNorm ingredient\nfor pharmacological classification. The crosswalk also resolves \"not otherwise\nclassified\" NOC codes (e.g., J3490, J9999), which appear when a drug has no\ndedicated J-code. Because NDCs change with ASP submission cycles, the specific\nquarterly file version must be documented.\n\n*SNOMED CT ↔ ICD-10-CM (NLM rule-based map, for reimbursement derivation).*\nThe NLM maintains a rule-based SNOMED CT–to–ICD-10-CM map that supports translation\nfrom clinical documentation systems (which may use SNOMED CT) to reimbursement coding\n(ICD-10-CM). The map is intentionally lossy: SNOMED CT's clinical granularity\n(laterality, severity, morphology) cannot always survive the translation to ICD-10-CM's\nbilling categories. Researchers should treat SNOMED↔ICD-10-CM as a triage tool, not a\nreliable phenotype definition, and re-derive code lists natively in each system whenever\npossible.\n\n*CPT/HCPCS ↔ SNOMED CT and ICD-10-PCS ↔ SNOMED CT (partial maps).*\nPartial procedure maps exist between CPT and SNOMED CT and between ICD-10-PCS and\nSNOMED CT, but coverage is incomplete and maintained by different organizations on\ndifferent schedules. These are useful for concept-level harmonization across datasets\nbut require the same caveat: verify coverage fractions before relying on them.\n\n*UMLS Metathesaurus as the CUI-level hub.*\nThe NLM Unified Medical Language System (UMLS) integrates more than 200 biomedical\nvocabularies under a single Concept Unique Identifier (CUI), enabling lookup from any\nsupported source system to any other. The Metathesaurus connects ICD-9-CM, ICD-10-CM,\nSNOMED CT, RxNorm, LOINC, MeSH, and many others under one roof. UMLS requires a\nfree UMLS Metathesaurus License; it is not public-domain. The breadth makes it the\nmost comprehensive single hub, but its mappings vary in source and quality — some\nare algorithmically generated and should be reviewed for the specific concept.\n\n*OMOP CONCEPT_RELATIONSHIP \"Maps to\" as the operational crosswalk hub.*\nWithin the OMOP Common Data Model, the CONCEPT_RELATIONSHIP table stores the \"Maps to\"\nrelationship that connects every source code (ICD-9-CM, ICD-10-CM, NDC, CPT, HCPCS,\nSNOMED CT) to its standard concept. This is a continuously maintained, versioned\ncrosswalk hub — the OHDSI community updates it regularly, and new vocabulary versions\nare released quarterly. Researchers using OMOP inherit the crosswalk automatically\nthrough the ETL, but must still document the vocabulary version used (it is stored in\nthe VOCABULARY table) and understand that source codes without a \"Maps to\" relationship\nfall to concept_id = 0 (unmapped) and are invisible to standard-concept queries.\n\n**The methodological core: crosswalks are approximations, not identities.**\n\nEvery crosswalk changes the measurement. This is the most important principle of\ncrosswalk methodology, and the one most frequently violated. The specific failure modes\nare:\n\n- **One-to-many inflation.** When a single ICD-9-CM code maps forward to multiple\n  ICD-10-CM codes, applying the GEM mechanically to expand a code list inflates the\n  code count. If an analyst counts diagnosis codes or trends in code frequency, the\n  ICD-10 transition will appear to generate more diagnoses — not because disease\n  incidence changed, but because each ICD-9 concept now has more granular children.\n  This is a **cartographic artifact**, not a clinical signal, and it is one of the\n  primary drivers of apparent trend discontinuities at 2015-10-01.\n\n- **Granularity loss on backward maps.** When translating from ICD-10-CM (more\n  granular) backward to ICD-9-CM (less granular), multiple distinct ICD-10-CM codes\n  often collapse onto a single ICD-9-CM code. Information is irreversibly lost.\n\n- **Asymmetry: forward ∘ backward ≠ identity.** Applying the forward map and then the\n  backward map does not return the original code set. This is not a defect in the GEMs;\n  it is a logical consequence of the coding system transition — the two systems have\n  different granularity and different clinical partitions. Researchers who assume\n  round-trip equivalence will produce incorrect overlap analyses.\n\n- **Version drift.** Crosswalks are updated on different schedules (GEMs: frozen at\n  FY2018; ASP crosswalk: quarterly; RxNorm: monthly; OMOP vocabularies: quarterly).\n  A study that runs across multiple vocabulary update cycles may apply different\n  translations to different time periods, introducing a time-varying mapping artifact\n  unless the researcher pins a single snapshot and documents it.\n\n- **Approximate flag is the norm, not the exception.** In the forward GEM for ICD-9-CM\n  to ICD-10-CM, the majority of entries carry approximate = 1, meaning the mapping is\n  the best available, not a true clinical equivalence. Treating approximate = 0\n  (exact) rows as high-confidence and approximate = 1 rows as requiring clinical review\n  is the standard practice.\n\n**Best practice: map the concept, not the code list.**\n\nThe gold standard is to re-derive the code list natively in each coding system —\nstarting from the clinical concept, asking a subject-matter expert to curate the\nrelevant codes independently in ICD-9-CM and in ICD-10-CM — rather than mechanically\ntranslating the ICD-9 list forward. Use the GEM as a **first-pass triage tool** to\nidentify candidate codes in the target system, then clinician-review the candidate set.\nFor transition-spanning trends, run ITS (interrupted time series) diagnostics at\n2015-10-01 to distinguish cartographic from biological discontinuities.\n\n**Licensing and public-domain status.**\n\n- **GEMs** (ICD-9-CM ↔ ICD-10-CM/PCS): public domain, freely downloadable from CMS.\n- **CMS ASP NDC-HCPCS crosswalk**: public domain, freely downloadable from CMS.\n- **NLM RxNav / RxNorm API**: public domain for the API and underlying data.\n- **UMLS Metathesaurus**: requires a free UMLS Metathesaurus License (NLM sign-up).\n- **SNOMED CT**: requires a NRC (National Release Center) license in the US; obtained\n  free through NLM for most research uses.\n\n**Pros, cons, and trade-offs — specific and comparative.**\n\n- **Crosswalk (mechanical translation) vs concept re-derivation (native curation):**\n  Mechanical translation via GEMs is fast and reproducible, and it produces a code list\n  that can be traced to an official table. Cost: it inherits all GEM approximations and\n  one-to-many expansions, and it can include codes that a clinician familiar with the\n  target system would exclude and miss codes the GEM did not capture. **Prefer concept\n  re-derivation** for the final, study-grade code list; use the crosswalk as triage.\n  The GEM-first-then-clinician-review workflow is the standard recommended by AHIMA\n  and OHDSI.\n\n- **Pinned crosswalk snapshot vs live API:**\n  Using a live RxNorm or OMOP vocabulary API at query time ensures the latest mappings\n  but means the study results may change if the API is called at different times. For\n  reproducible research, pin and archive the crosswalk snapshot with a date stamp.\n  **Prefer a pinned snapshot for published research.** The OMOP versioned vocabulary\n  download (available from the Athena portal) is the standard mechanism for this in\n  OMOP studies.\n\n- **GEMs vs OMOP \"Maps to\" for ICD-9/ICD-10 bridging:**\n  GEMs operate at the raw ICD code level; OMOP \"Maps to\" operates at the standard\n  concept level (SNOMED for conditions, RxNorm for drugs) and abstracts away the\n  ICD version. For network studies using OMOP, the \"Maps to\" relationships are\n  preferred because they are vocabulary-version-agnostic. For studies on raw claims\n  not in OMOP, GEMs are the correct tool. **Prefer OMOP \"Maps to\"** when the data\n  are in OMOP-CDM; **prefer GEMs** when working directly with ICD claims.\n\n**When to use.**\nUse crosswalks whenever: (1) a study spans the ICD-9-to-ICD-10 transition (any study\nwindow that crosses 2015-10-01) and a consistent diagnosis phenotype must be applied\nacross the full period; (2) drug exposures defined by NDC must be aggregated by\ningredient (RxNorm) or by HCPCS reimbursement code (ASP crosswalk); (3) data from\nsystems using different coding schemes (e.g., a SNOMED-coded EHR and an ICD-10-CM\nclaims file) must be harmonized for a linked or federated analysis; (4) an OMOP CDM\nis being built and the ETL must specify how each source code maps to a standard concept.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n\n- **As a substitute for clinical concept re-derivation.** Applying the GEM forward map\n  to generate an ICD-10-CM code list and treating the output as a validated phenotype\n  is misleading. The GEM was designed for billing and administrative continuity, not\n  for research phenotyping. The one-to-many expansions and approximate flags mean\n  that a researcher who trusts the GEM output without clinical review will include\n  codes that are clinically irrelevant and miss others that are clinically central.\n\n- **For trend analysis without ITS diagnostics.** Using a crosswalk to translate a\n  code list and then measuring trends across the transition without testing for a\n  cartographic discontinuity at 2015-10-01 is actively misleading. The coding\n  transition itself introduces apparent level changes and slope breaks in virtually\n  every ICD-based condition series; reporting these as clinical trends is incorrect.\n\n- **When the approximate flag is ignored.** Selecting only the zero-flag rows from\n  the GEM (exact matches) and discarding approximate rows will silently exclude the\n  majority of the mapping — most ICD-9 concepts have no exact ICD-10 equivalent.\n  Ignoring the flag entirely and treating all rows as equivalent is also wrong because\n  it obscures the uncertainty.\n\n- **When version drift is ignored.** Applying a quarterly ASP crosswalk from a\n  different quarter than the study period, or applying the FY2016 GEM to FY2018 data,\n  introduces uncontrolled mapping variation. For regulatory submissions (FDA,\n  payer dossiers), the crosswalk version must be documented and justified.\n\n- **For MA-only or capitated data.** If the claims data derive from capitated\n  arrangements that do not produce adjudicated ICD codes (e.g., Medicare Advantage\n  encounter data with systematically missing diagnoses), no crosswalk can recover\n  what was never coded. Restrict to FFS-observable person-time before applying any\n  ICD-based crosswalk.\n\n**Data-source operational depth.**\n\n- **Claims (FFS commercial / Medicare FFS):** ICD-9/ICD-10-CM on the diagnosis fields;\n  NDC on pharmacy claims; HCPCS/CPT on medical claims. Apply GEMs across the 2015\n  transition; use the ASP crosswalk to resolve J-codes to NDCs; use RxNorm to\n  normalize NDC to ingredient. Failure mode: procedure codes on medical claims may be\n  CPT (AMA-licensed, not in GEMs) — use the partial CPT↔SNOMED map or OMOP \"Maps to\"\n  if the data are in OMOP. Document whether the study population includes\n  Medicare Advantage spans; the NDC-HCPCS crosswalk is irrelevant for MA claims where\n  drug administration records may be absent.\n- **EHR:** Problem lists and encounter diagnoses may carry ICD-10-CM or SNOMED CT\n  depending on the system and configuration year. The SNOMED↔ICD-10-CM NLM map may be\n  needed to align EHR concepts with claims. Orders and prescriptions may carry local\n  drug codes that require a custom mapping step before RxNorm normalization.\n- **Registry:** Disease-specific registries often use registry-specific codes that\n  require a project-specific crosswalk to standard vocabularies; the GEMs and RxNorm\n  crosswalks do not cover registry-specific coding schemes.\n- **OMOP-CDM:** The ETL handles all source-to-standard mapping through the\n  CONCEPT_RELATIONSHIP \"Maps to\" table; researchers build concept sets on standard\n  concepts and inherit the crosswalk automatically. The workflow for a transition-spanning\n  study is to verify that the ETL bridged both ICD-9-CM and ICD-10-CM to the same\n  SNOMED CT standard concept and to quantify the concept_id = 0 unmapped fraction on\n  both sides of the transition date.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "crosswalk",
      "mapping",
      "terminology",
      "icd-10-cm",
      "icd-9-cm",
      "ndc",
      "hcpcs",
      "rxnorm",
      "snomed-ct",
      "gems",
      "umls",
      "omop",
      "vocabulary"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "linked_data",
      "multi_database",
      "target_trial_emulation",
      "cohort_retrospective",
      "new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2012-001358",
        "url": "https://doi.org/10.1136/amiajnl-2012-001358",
        "citation_text": "Boyd AD, Li JJ, Burton MD, et al. The discriminatory cost of ICD-10-CM transition between clinical specialties: metrics, case study, and mitigating tools. Journal of the American Medical Informatics Association. 2013;20(4):708-717.",
        "year": 2013,
        "authors_short": "Boyd et al.",
        "notes": "Empirically quantifies the heterogeneity of ICD-9-to-ICD-10-CM GEM mapping difficulty across clinical specialties — introducing the concept of mapping discriminatory cost and showing that one-to-many and approximate mappings are the norm rather than the exception, which is the core challenge crosswalk users face."
      },
      {
        "role": "explain",
        "doi": "10.1186/s40621-018-0165-8",
        "url": "https://doi.org/10.1186/s40621-018-0165-8",
        "citation_text": "Slavova S, Costich JF, Luu H, Fields JR. Interrupted time series design to evaluate the effect of the ICD-9-CM to ICD-10-CM coding transition on injury hospitalization trends. Injury Epidemiology. 2018;5(1):34.",
        "year": 2018,
        "authors_short": "Slavova et al.",
        "notes": "Demonstrates the cartographic discontinuity problem directly — using ITS to separate true trend changes from coding-system-induced artifacts at 2015-10-01, which is the primary methodological response to GEM-induced inflation."
      },
      {
        "role": "explain",
        "doi": "10.1109/mitp.2005.122",
        "url": "https://doi.org/10.1109/mitp.2005.122",
        "citation_text": "Liu S, Wei Ma, Moore R. RxNorm: prescription for electronic drug information exchange. IT Professional. 2005;7(5):17-23.",
        "year": 2005,
        "authors_short": "Liu et al.",
        "notes": "Introduces RxNorm as the NLM standard for drug naming and explains the mapping from branded product and NDC to normalized ingredient — the conceptual basis for the NDC-to-RxNorm crosswalk used in all claims-based drug exposure studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/nar/gkh061",
        "url": "https://doi.org/10.1093/nar/gkh061",
        "citation_text": "Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004;32(Suppl 1):D267-D270.",
        "year": 2004,
        "authors_short": "Bodenreider",
        "notes": "Describes UMLS as the CUI-level hub connecting more than 100 biomedical vocabularies — the conceptual architecture behind multi-vocabulary crosswalk resolution and the licensing framework (free NLM account required) that applies."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coding-billing/icd-10-codes/icd-10-cm-icd-10-pcs-gem-archive",
        "citation_text": "Centers for Medicare and Medicaid Services. ICD-10-CM and ICD-10-PCS GEM Archive. Baltimore, MD: CMS; FY2018 (last updated). Accessed 2026.",
        "year": 2018,
        "authors_short": "CMS",
        "notes": "Official CMS archive of the FY2018 (final) GEM files for ICD-9-CM to ICD-10-CM/PCS and the reverse maps — the primary source file for every GEM-based crosswalk in claims research. Public domain."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.nlm.nih.gov/research/umls/index.html",
        "citation_text": "National Library of Medicine. Unified Medical Language System (UMLS). Bethesda, MD: NLM. Accessed 2026.",
        "year": 2024,
        "authors_short": "NLM",
        "notes": "Landing page for the UMLS Metathesaurus, RxNorm, and related NLM terminology services including the NDC-to-RxNorm API (RxNav) — requires a free NLM license for UMLS access."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/payment/part-b-drugs/asp-pricing-files",
        "citation_text": "Centers for Medicare and Medicaid Services. Medicare Part B Drug Average Sales Price: ASP Pricing Files (NDC-HCPCS Crosswalk). Baltimore, MD: CMS; updated quarterly. Accessed 2026.",
        "year": 2024,
        "authors_short": "CMS",
        "notes": "Quarterly CMS files that include the NDC-to-HCPCS crosswalk used to recover drug identity (including NOC J-codes) for Part B-administered drugs in claims research. Public domain."
      }
    ],
    "plain_language_summary": "Medical claims and electronic health records use different coding systems to label diagnoses, drugs, and procedures — and those systems change over time. A code crosswalk is an official translation table that links a code in one system to the closest match in another, the way a bilingual dictionary links words between languages. The key warning that beginners miss is that these translations are approximations, not exact equivalences — one old code often expands into several new ones, and translating back does not return the original set.",
    "key_terms": [
      {
        "term": "GEM (General Equivalence Mapping)",
        "definition": "The official CMS translation tables that map ICD-9-CM diagnosis and procedure codes to their ICD-10-CM/PCS equivalents, and vice versa, each row labeled with whether the match is exact or approximate."
      },
      {
        "term": "forward mapping",
        "definition": "Translating a code from an older or source system to a newer or target system, for example ICD-9-CM to ICD-10-CM; one source code may produce several target codes."
      },
      {
        "term": "backward mapping",
        "definition": "Translating from a newer or more granular system back to an older or less granular one, for example ICD-10-CM to ICD-9-CM; often collapses multiple specific codes into one general code."
      },
      {
        "term": "one-to-many mapping",
        "definition": "When a single code in the source system corresponds to two or more codes in the target system, because the target system has more granularity or finer clinical distinctions."
      },
      {
        "term": "approximate flag",
        "definition": "A marker in a GEM row indicating that the mapping is the best available match but not a true clinical equivalent — the majority of GEM rows carry this flag."
      },
      {
        "term": "combination code",
        "definition": "An entry in the GEM where one ICD-9-CM concept requires a cluster of two or more ICD-10-CM codes together to capture the full clinical meaning, because no single ICD-10-CM code covers everything the ICD-9 code described."
      },
      {
        "term": "mapping version",
        "definition": "The specific release of a crosswalk file (e.g., FY2018 GEMs, Q1 2024 ASP crosswalk) that was applied; must be documented because crosswalks are updated on their own schedules and different versions produce different results."
      }
    ],
    "worked_example": {
      "scenario": "An analyst is building a study on chronic obstructive pulmonary disease (COPD) hospitalizations in US commercial claims. The data span January 2013 through December 2018, which means the cohort crosses the ICD-9-to-ICD-10-CM transition on October 1, 2015. The analyst starts with one representative ICD-9-CM COPD hospitalization code — 491.21 (obstructive chronic bronchitis with acute exacerbation) — and wants to know what the GEM forward map produces and whether applying the backward map would return the starting code.\n",
      "dataset": {
        "caption": "Forward GEM rows for ICD-9-CM 491.21 (obstructive chronic bronchitis with acute exacerbation). Each row is one entry in the CMS FY2018 GEM forward-map file.\n",
        "columns": [
          "icd9_code",
          "icd10_code",
          "approximate_flag",
          "no_map_flag",
          "combination_flag",
          "scenario",
          "choice_list"
        ],
        "rows": [
          [
            "491.21",
            "J44.1",
            1,
            0,
            0,
            1,
            1
          ],
          [
            "491.21",
            "J44.0",
            1,
            0,
            0,
            1,
            2
          ]
        ]
      },
      "steps": [
        "The forward GEM for 491.21 returns 2 ICD-10-CM codes (J44.1 COPD with acute exacerbation and J44.0 COPD with acute lower respiratory infection). The approximate flag is 1 on both rows — neither is an exact equivalence.\n",
        "The scenario flag of 1 and choice_list values of 1 and 2 mean these two codes are alternatives; the GEM is presenting them as options rather than requiring both. A researcher must decide clinically which (or both) to include.\n",
        "The code count expands: one ICD-9 code becomes 2 candidate ICD-10 codes. If the analyst includes both, every pre-2015 hospitalization coded 491.21 will be matched against 2 ICD-10 codes post-2015 — inflating apparent code frequency at the transition even if COPD hospitalization rates are unchanged.\n",
        "Now apply the backward GEM to J44.1 (the primary forward-map result). The backward GEM returns 3 ICD-9-CM codes: 491.21, 491.20, and 496. The starting code 491.21 appears, but so do 2 additional codes — the round trip is NOT the original single code.\n",
        "Asymmetry count: forward map from 1 ICD-9 code produces 2 ICD-10 codes; backward map from the primary result produces 3 ICD-9 codes. 2 - 1 = 1 net inflation in the forward direction; 3 - 1 = 2 additional codes in the backward direction; 2 + 3 = 5 total codes involved in the round trip versus 1 original code.\n"
      ],
      "result": "Starting from 1 ICD-9-CM code (491.21), the forward GEM produces 2 ICD-10-CM candidate codes (approximate = 1 on both). Backward-mapping J44.1 returns 3 ICD-9-CM codes — 2 more than the original 1. The round trip 1 -> 2 -> 3 demonstrates asymmetry: forward does not equal backward, and neither direction produces cardinality = 1. An analyst who naively counts \"codes ever assigned to this condition\" across the transition window will see an apparent 2 / 1 = 2.0x code-count multiplication at 2015-10-01 that is entirely cartographic.\n",
      "timeline_spec": {
        "title": "ICD-9-to-ICD-10 GEM expansion for COPD (491.21) — code cardinality across transition",
        "window": {
          "start": "2013-01-01",
          "end": "2018-12-31",
          "label": "Study window spanning the ICD-10 transition (Oct 1, 2015)"
        },
        "events": [
          {
            "label": "ICD-9 era: 491.21 only",
            "start": "2013-01-01",
            "length_days": 1004,
            "quantity": "1 code"
          },
          {
            "label": "ICD-10 transition (Oct 1, 2015)",
            "start": "2015-10-01",
            "length_days": 1,
            "quantity": "GEM applied"
          },
          {
            "label": "ICD-10 era: J44.1 + J44.0 (2 codes)",
            "start": "2015-10-02",
            "length_days": 1186,
            "quantity": "2 codes"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2013-01-01",
            "end": "2015-09-30",
            "label": "1 ICD-9 code"
          },
          {
            "kind": "gap",
            "start": "2015-10-01",
            "end": "2015-10-01",
            "label": "Transition: 1 -> 2 codes"
          },
          {
            "kind": "covered",
            "start": "2015-10-02",
            "end": "2018-12-31",
            "label": "2 ICD-10 codes (approximate, both flags=1)"
          }
        ],
        "result": {
          "label": "1 ICD-9 -> 2 ICD-10 (forward); J44.1 -> 3 ICD-9 (backward); round trip 1 -> 2 -> 3",
          "value": 2.0
        }
      }
    },
    "prerequisites": [
      "icd-9-cm-legacy-coding",
      "icd-10-cm-diagnosis-coding",
      "ndc-national-drug-code"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "GEMs-based forward translation of a code list (triage mode)",
        "description": "Apply the GEM forward map to every ICD-9-CM code in an existing phenotype code list to generate candidate ICD-10-CM codes for clinical review. Retain all rows including approximate = 1, flag combination rows for clinical assessment, and pass the candidate set to a clinician reviewer who accepts, rejects, or adds codes independently. This is the AHIMA/OHDSI recommended workflow.",
        "edge_cases": [
          "Codes with no_map_flag = 1 have no GEM equivalent and require manual identification of the closest ICD-10-CM codes; do not silently drop them.",
          "Combination rows require the researcher to decide whether to require all codes in the cluster or accept any one as sufficient for the phenotype."
        ],
        "data_source_notes": "claims: apply to the principal and secondary diagnosis fields on inpatient UB-04 claims and on outpatient/professional claims for condition-specific eligibility; the GEM is not applicable to NDC, CPT, or HCPCS codes.\n"
      },
      {
        "name": "NDC normalization to RxNorm ingredient via RxNav",
        "description": "Resolve each NDC to its RxNorm RXCUI at the ingredient level using the NLM RxNav API. Build the ingredient-level code list first (RxNorm CUIs for the drug class or product), then use the RxNav NDC lookup to enumerate all NDCs that map to each ingredient, rather than assembling a hand-curated NDC list. Document the API call date (or snapshot) as the mapping version.",
        "edge_cases": [
          "Retired NDCs may not resolve via the current API; use the NLM historical NDC endpoint.",
          "Repackaged NDCs may map to a different RXCUI than the original product NDC despite identical active ingredient — verify at the ingredient level before comparing products."
        ],
        "data_source_notes": "claims pharmacy: each row in the pharmacy file carries an NDC in the NDC field; the RxNorm mapping key is the 11-digit zero-padded NDC (4-6-2 or 5-4-2 segment format); normalize the format before the API call.\n"
      },
      {
        "name": "ASP NDC-HCPCS crosswalk for Part B drug identity",
        "description": "Join the CMS quarterly ASP NDC-HCPCS crosswalk to the medical claims file on HCPCS code to recover the specific drug NDC (and thus RxNorm ingredient) behind a J-code or NOC code. Use the crosswalk from the quarter matching the service date of each claim to avoid version mismatch.",
        "edge_cases": [
          "Not all J-codes appear in a given quarter's ASP file; missing codes may reflect new drugs, drugs under MAC jurisdiction, or miscoded claims.",
          "NOC codes (J3490, J9999, J3590) by definition lack a standard HCPCS code; recovery requires the NDC field on the medical claim itself plus the crosswalk to RxNorm."
        ],
        "data_source_notes": "claims medical: the ASP crosswalk links on the HCPCS field of line items; confirm the claim has a non-null NDC when available, as that provides a more direct resolution path than the J-code alone.\n"
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "omop-concept-set-development-rwe",
        "pros_of_this": "Crosswalks (GEMs, ASP, RxNorm) operate directly on raw code values and do not require a CDM transformation step, making them accessible on any claims extract without a full OMOP ETL.",
        "cons_of_this": "Raw crosswalks require manual version management and do not benefit from the community-curated, version-controlled OMOP CONCEPT_RELATIONSHIP table, which integrates GEM-equivalent mappings with descendant hierarchies and quality audits.",
        "when_to_prefer": "Use raw crosswalks on non-OMOP data; use OMOP concept sets on OMOP-CDM data, where \"Maps to\" has already absorbed the crosswalk logic."
      },
      {
        "compared_to": "omop-cdm-method-patterns-rwe",
        "pros_of_this": "Crosswalk methodology is vocabulary-agnostic and applies to any claims dataset regardless of whether an OMOP ETL exists.",
        "cons_of_this": "OMOP's versioned ETL embeds the crosswalk in a auditable, reproducible pipeline; ad hoc crosswalk application is harder to document and version-control.",
        "when_to_prefer": "Prefer the OMOP approach for multi-site network studies; prefer direct crosswalk application for single-site studies on proprietary extracts."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Apply GEM to diagnosis fields (DX1-DX25 on inpatient, DX1-DX8 on professional); use NDC field on pharmacy claims with RxNorm normalization; use HCPCS field on medical claims with ASP crosswalk for Part B drugs. Document GEM version (FY2018 frozen), RxNorm snapshot date, and ASP crosswalk quarter. Exclude Medicare Advantage spans before applying ICD-based crosswalks.\n",
      "ehr": "Problem lists and encounter diagnoses may carry ICD-10-CM or SNOMED CT; apply SNOMED-to-ICD-10-CM NLM map for bridging to claims; medication orders carry local codes or RxNorm CUIs depending on the system — verify before assuming RxNorm normalization has already been applied by the EHR vendor.\n",
      "registry": "Registry-specific codes require project-specific crosswalk development; GEMs and RxNorm do not cover registry-specific coding schemes. Document the custom mapping as a study artifact.\n",
      "linked": "De-duplicate events mapped from multiple sources at the standard-concept level after all crosswalks are applied; date discrepancies between claim service date and EHR encounter date may affect assignment of the transition-era period.\n"
    },
    "implementations": [
      {
        "lang": "python",
        "code": "\"\"\"\nCode Crosswalk Utilities — GEM expansion + ASP NDC-HCPCS join\n=============================================================\nApplies the CMS FY2018 GEM forward map to an ICD-9-CM code list,\nreports one-to-many expansion and approximate flags, then demonstrates\nbackward-map asymmetry. Includes an ASP NDC-HCPCS J-code resolver.\n\nInput files (public domain, download from CMS GEMs archive and ASP pages):\n  gem_forward.tsv  — CMS 2018 ICD-9-CM to ICD-10-CM GEM forward map\n  gem_backward.tsv — CMS 2018 ICD-10-CM to ICD-9-CM GEM backward map\n  asp_crosswalk.csv — CMS quarterly ASP NDC-HCPCS crosswalk\n\"\"\"\nimport pandas as pd\nfrom pathlib import Path\n\n\ndef load_gem(path: str | Path) -> pd.DataFrame:\n    \"\"\"Load a CMS GEM flat file (space-delimited, no header).\n\n    Columns (per CMS format): source_code, target_code,\n    approximate, no_map, combination, scenario, choice_list.\n    \"\"\"\n    cols = [\n        \"source_code\", \"target_code\",\n        \"approximate\", \"no_map\", \"combination\",\n        \"scenario\", \"choice_list\",\n    ]\n    df = pd.read_csv(\n        path, sep=r\"\\s+\", header=None, names=cols,\n        dtype=str\n    )\n    # Convert flag columns to int for filtering\n    for c in [\"approximate\", \"no_map\", \"combination\", \"scenario\", \"choice_list\"]:\n        df[c] = pd.to_numeric(df[c], errors=\"coerce\").fillna(0).astype(int)\n    return df\n\n\ndef apply_forward_gem(\n    code_list: list[str],\n    gem_forward: pd.DataFrame,\n    include_approximate: bool = True,\n) -> pd.DataFrame:\n    \"\"\"Expand an ICD-9-CM code list via the GEM forward map.\n\n    Returns all matching rows, with one_to_many_count added.\n    include_approximate=False restricts to exact matches only\n    (WARNING: drops the majority of rows — use for triage only).\n    \"\"\"\n    df = gem_forward[gem_forward[\"source_code\"].isin(code_list)].copy()\n    if not include_approximate:\n        df = df[df[\"approximate\"] == 0]\n\n    # Count how many ICD-10 targets each ICD-9 source maps to\n    counts = (\n        df[df[\"no_map\"] == 0]\n        .groupby(\"source_code\")[\"target_code\"]\n        .count()\n        .rename(\"one_to_many_count\")\n    )\n    df = df.merge(counts, on=\"source_code\", how=\"left\")\n\n    no_map_codes = df[df[\"no_map\"] == 1][\"source_code\"].unique()\n    if len(no_map_codes):\n        print(\n            f\"WARNING: {len(no_map_codes)} code(s) have no_map=1 \"\n            f\"(no GEM equivalent): {list(no_map_codes)}\"\n        )\n    return df\n\n\ndef check_roundtrip_asymmetry(\n    source_codes: list[str],\n    gem_forward: pd.DataFrame,\n    gem_backward: pd.DataFrame,\n) -> dict:\n    \"\"\"Apply forward then backward and report asymmetry.\n\n    Returns dict with original code count, forward count, roundtrip count.\n    A round-trip that does NOT return the original set demonstrates asymmetry.\n    \"\"\"\n    # Forward: ICD-9 -> ICD-10\n    fwd = apply_forward_gem(source_codes, gem_forward)\n    icd10_codes = fwd[fwd[\"no_map\"] == 0][\"target_code\"].unique().tolist()\n\n    # Backward: ICD-10 -> ICD-9\n    bwd = gem_backward[gem_backward[\"source_code\"].isin(icd10_codes)]\n    icd9_roundtrip = bwd[\"target_code\"].unique().tolist()\n\n    original_set = set(source_codes)\n    roundtrip_set = set(icd9_roundtrip)\n    added = roundtrip_set - original_set\n    lost = original_set - roundtrip_set\n\n    return {\n        \"original_codes\": source_codes,\n        \"original_count\": len(source_codes),\n        \"icd10_forward_codes\": icd10_codes,\n        \"icd10_forward_count\": len(icd10_codes),\n        \"icd9_roundtrip_codes\": icd9_roundtrip,\n        \"icd9_roundtrip_count\": len(icd9_roundtrip),\n        \"codes_added_by_roundtrip\": sorted(added),\n        \"codes_lost_by_roundtrip\": sorted(lost),\n        \"is_symmetric\": original_set == roundtrip_set,\n    }\n\n\ndef load_asp_crosswalk(path: str | Path) -> pd.DataFrame:\n    \"\"\"Load the CMS quarterly ASP NDC-HCPCS crosswalk CSV.\n\n    CMS publishes these as CSV/Excel; key columns: HCPCS_CD, NDC, LONG_DESC.\n    Adjust column names to match the actual file header.\n    \"\"\"\n    df = pd.read_csv(path, dtype=str)\n    # Normalize column names to lowercase, strip spaces\n    df.columns = [c.strip().lower().replace(\" \", \"_\") for c in df.columns]\n    return df\n\n\ndef resolve_jcode_to_ndc(\n    claims: pd.DataFrame,\n    asp_crosswalk: pd.DataFrame,\n    hcpcs_col: str = \"hcpcs_cd\",\n    ndc_col: str = \"ndc\",\n) -> pd.DataFrame:\n    \"\"\"Join ASP crosswalk to medical claims on HCPCS code.\n\n    Recovers drug identity (NDC and long description) for J-codes and NOC codes.\n    Returns claims with ndc_from_asp and drug_description columns added.\n\n    Note: multiple NDCs may map to one HCPCS code; the join produces one row per\n    NDC match per claim. Deduplicate using claim-level NDC field if available.\n    \"\"\"\n    asp_sub = asp_crosswalk[[hcpcs_col, ndc_col, \"long_desc\"]].rename(\n        columns={\n            ndc_col: \"ndc_from_asp\",\n            \"long_desc\": \"drug_description_from_asp\",\n        }\n    )\n    enriched = claims.merge(asp_sub, on=hcpcs_col, how=\"left\")\n    unresolved = enriched[\"ndc_from_asp\"].isna().sum()\n    if unresolved:\n        print(\n            f\"INFO: {unresolved} claim rows could not be matched to an NDC \"\n            f\"via the ASP crosswalk (missing HCPCS or new drug not in this quarter).\"\n        )\n    return enriched\n\n\n# ── Example usage ─────────────────────────────────────────────────────────────\nif __name__ == \"__main__\":\n    # Load GEM files (download from CMS GEMs archive)\n    # gem_fwd = load_gem(\"2018_I9gem.txt\")\n    # gem_bwd = load_gem(\"2018_I10gem.txt\")\n\n    # COPD exacerbation example (see worked_example above)\n    copd_icd9 = [\"491.21\"]\n\n    # Demonstrate forward expansion and asymmetry:\n    # result = check_roundtrip_asymmetry(copd_icd9, gem_fwd, gem_bwd)\n    # print(result)\n    # Expected: original_count=1, icd10_forward_count=2, icd9_roundtrip_count=3\n    # is_symmetric=False (round trip 1 -> 2 -> 3, not 1 -> 1 -> 1)\n\n    print(\"GEM crosswalk utilities loaded. Provide GEM files to run.\")",
        "description": "Applies the CMS FY2018 GEM forward map to an ICD-9-CM code list and reports the full expansion including approximate flags, combination entries, and one-to-many counts. Then applies the backward map to the primary forward result to demonstrate asymmetry (forward-then-backward does not return the original set). Includes a helper that joins the CMS ASP NDC-HCPCS crosswalk to medical claims to recover drug identity behind J-codes.\n",
        "dependencies": [
          "pandas>=1.3"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# Code Crosswalk Utilities (R) — GEM + ASP NDC-HCPCS join\n# =========================================================\n# Applies CMS FY2018 GEM forward map to an ICD-9-CM code list and\n# demonstrates backward-map asymmetry. Includes ASP J-code resolver.\n#\n# Input files (public domain):\n#   gem_forward_path — CMS 2018 ICD-9 -> ICD-10-CM GEM (space-delimited, no header)\n#   gem_backward_path — CMS 2018 ICD-10-CM -> ICD-9 GEM (space-delimited, no header)\n#   asp_crosswalk_path — CMS quarterly ASP NDC-HCPCS crosswalk (CSV)\n\nlibrary(data.table)\nlibrary(dplyr)\n\nGEM_COLS <- c(\"source_code\", \"target_code\",\n              \"approximate\", \"no_map\", \"combination\",\n              \"scenario\", \"choice_list\")\n\nload_gem <- function(path) {\n  dt <- fread(path, header = FALSE, col.names = GEM_COLS,\n              colClasses = \"character\")\n  flag_cols <- c(\"approximate\", \"no_map\", \"combination\", \"scenario\", \"choice_list\")\n  dt[, (flag_cols) := lapply(.SD, as.integer), .SDcols = flag_cols]\n  dt\n}\n\napply_forward_gem <- function(code_list, gem_forward,\n                              include_approximate = TRUE) {\n  # Expand ICD-9-CM code list via GEM forward map.\n  # include_approximate = FALSE restricts to exact matches (WARNING: loses most rows).\n  dt <- gem_forward[source_code %in% code_list]\n  if (!include_approximate) dt <- dt[approximate == 0]\n\n  # Count how many ICD-10 targets each source maps to (excluding no_map rows)\n  counts <- dt[no_map == 0, .(one_to_many_count = .N), by = source_code]\n  dt <- merge(dt, counts, by = \"source_code\", all.x = TRUE)\n\n  no_map_codes <- unique(dt[no_map == 1, source_code])\n  if (length(no_map_codes) > 0) {\n    warning(\"no_map=1 (no GEM equivalent): \", paste(no_map_codes, collapse = \", \"))\n  }\n  dt\n}\n\ncheck_roundtrip_asymmetry <- function(source_codes, gem_forward, gem_backward) {\n  # Forward: ICD-9 -> ICD-10\n  fwd <- apply_forward_gem(source_codes, gem_forward)\n  icd10_codes <- unique(fwd[no_map == 0, target_code])\n\n  # Backward: ICD-10 -> ICD-9\n  bwd <- gem_backward[source_code %in% icd10_codes]\n  icd9_roundtrip <- unique(bwd[, target_code])\n\n  added <- setdiff(icd9_roundtrip, source_codes)\n  lost  <- setdiff(source_codes, icd9_roundtrip)\n\n  list(\n    original_codes        = source_codes,\n    original_count        = length(source_codes),\n    icd10_forward_codes   = icd10_codes,\n    icd10_forward_count   = length(icd10_codes),\n    icd9_roundtrip_codes  = icd9_roundtrip,\n    icd9_roundtrip_count  = length(icd9_roundtrip),\n    codes_added           = added,\n    codes_lost            = lost,\n    is_symmetric          = setequal(source_codes, icd9_roundtrip)\n  )\n}\n\nload_asp_crosswalk <- function(path) {\n  dt <- fread(path, colClasses = \"character\")\n  setnames(dt, tolower(gsub(\" \", \"_\", names(dt))))\n  dt\n}\n\nresolve_jcode_to_ndc <- function(claims_dt, asp_dt,\n                                 hcpcs_col = \"hcpcs_cd\",\n                                 ndc_col   = \"ndc\") {\n  # Join ASP crosswalk to medical claims on HCPCS code.\n  # Returns claims with ndc_from_asp and drug_description columns added.\n  asp_sub <- asp_dt[, .(\n    hcpcs_cd        = get(hcpcs_col),\n    ndc_from_asp    = get(ndc_col),\n    drug_description = long_desc\n  )]\n  enriched <- merge(claims_dt, asp_sub, by.x = hcpcs_col, by.y = \"hcpcs_cd\",\n                    all.x = TRUE)\n  n_unresolved <- sum(is.na(enriched$ndc_from_asp))\n  if (n_unresolved > 0)\n    message(\"INFO: \", n_unresolved,\n            \" claim rows not matched via ASP crosswalk\")\n  enriched\n}\n\n# ── Example usage ──────────────────────────────────────────────────────────────\n# gem_fwd <- load_gem(\"2018_I9gem.txt\")\n# gem_bwd <- load_gem(\"2018_I10gem.txt\")\n#\n# copd_icd9 <- \"491.21\"\n# result <- check_roundtrip_asymmetry(copd_icd9, gem_fwd, gem_bwd)\n# stopifnot(!result$is_symmetric)  # Asymmetry confirmed: 1 -> 2 -> 3 codes\n# cat(\"Forward count:\", result$icd10_forward_count, \"\\n\")\n# cat(\"Roundtrip count:\", result$icd9_roundtrip_count, \"\\n\")",
        "description": "R implementation applying the FY2018 GEM forward map to an ICD-9-CM code list (data.table or tidyverse), with one-to-many count and approximate flag summaries. Includes a tidy join of the CMS ASP NDC-HCPCS crosswalk to medical claims.\n",
        "dependencies": [
          "data.table",
          "dplyr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Clinical concept\\ne.g. COPD exacerbation] -->|Start here| B[ICD-9-CM code list\\npre-Oct 2015]\n  A -->|Re-derive natively| C[ICD-10-CM code list\\npost-Oct 2015]\n  B -->|GEM forward map\\napprox flag + 1:many| D[Candidate ICD-10-CM codes\\nrequire clinical review]\n  D -->|Clinician review\\naccept / reject / add| C\n  C -->|GEM backward map\\ngranularity loss| E[ICD-9-CM round-trip\\nNOT equal to B]\n  B -->|Study crosses 2015-10-01| F[Transition-spanning analysis]\n  C --> F\n  F -->|ITS diagnostics at 2015-10-01| G[Cartographic vs biological\\ndiscontinuity separated]\n  H[NDC on pharmacy claim] -->|RxNorm API\\nmonthly snapshot| I[RxNorm RXCUI\\ningredient level]\n  J[HCPCS J-code on medical claim] -->|CMS ASP crosswalk\\nquarterly| K[NDC + drug identity]\n  K --> I\n  I --> L[Drug exposure defined\\nat ingredient level]\n  M[OMOP source code\\nICD/NDC/CPT] -->|CONCEPT_RELATIONSHIP\\nMaps to| N[Standard concept\\nSNOMED/RxNorm/LOINC]",
        "caption": "Crosswalk ecosystem: the GEM forward map generates candidate ICD-10-CM codes that require clinical review; the backward map does not restore the original set (asymmetry). NDC resolves to RxNorm via the NLM API; J-codes resolve to NDC via the CMS ASP crosswalk. OMOP's CONCEPT_RELATIONSHIP Maps to relationship abstracts all of these into a single versioned hub for OMOP-CDM studies.\n",
        "alt_text": "Flowchart showing GEM forward and backward mapping paths for ICD-9/ICD-10 translation, RxNorm normalization of NDC codes, CMS ASP crosswalk for J-codes, and OMOP Maps to as a unified hub. The backward path illustrates that round-trip mapping is not symmetric.\n",
        "source_type": "illustrative",
        "source_citations": [
          "boyd-2013"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "icd-9-cm-legacy-coding",
        "notes": "The GEM forward map is the bridge from ICD-9-CM to ICD-10-CM; ICD-9-CM concepts used in pre-2015 claims must be translated forward via GEMs (triage) then clinically reviewed to build a transition-spanning phenotype."
      },
      {
        "relation_type": "used_with",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "The GEM backward map and the SNOMED-to-ICD-10-CM NLM map operate on ICD-10-CM as their target or source; concept re-derivation natively in ICD-10-CM is the gold standard, with the crosswalk as a first-pass tool."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ndc-national-drug-code",
        "notes": "NDC is the source code for drug crosswalks — resolved to RxNorm ingredient via the NLM RxNav API and to HCPCS J-code via the CMS ASP quarterly crosswalk."
      },
      {
        "relation_type": "used_with",
        "target_slug": "rxnorm-drug-terminology",
        "notes": "RxNorm is the target standard for NDC normalization; ingredient-level RxNorm CUIs insulate drug exposure definitions from NDC churn (new package NDCs for the same drug)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcpcs-level-ii-j-codes",
        "notes": "The CMS ASP NDC-HCPCS crosswalk maps NDC to J-code (or J-code to NDC) to recover drug identity in medical-benefit claims, essential for Part B drug studies and NOC-code resolution."
      },
      {
        "relation_type": "used_with",
        "target_slug": "snomed-ct-terminology",
        "notes": "The NLM rule-based SNOMED CT-to-ICD-10-CM map is a one-directional crosswalk used when EHR systems that code in SNOMED CT must be aligned with claims that use ICD-10-CM; lossy and best used for triage."
      },
      {
        "relation_type": "part_of",
        "target_slug": "omop-standardized-vocabularies",
        "notes": "OMOP's CONCEPT_RELATIONSHIP table implements the Maps to relationship as a continuously maintained, versioned crosswalk hub that absorbs GEM-equivalent logic for conditions, RxNorm normalization for drugs, and LOINC for labs into one pipeline."
      },
      {
        "relation_type": "see_also",
        "target_slug": "interrupted-time-series-rwe",
        "notes": "ITS at 2015-10-01 is the primary method for distinguishing cartographic discontinuities (caused by the ICD coding transition) from biological trend changes in longitudinal claims studies — the key diagnostic for GEM-induced artifacts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "OMOP concept set development is the operational framework that uses the Maps to crosswalk to build, expand, exclude, and version-freeze phenotype code lists in OMOP CDM — the OMOP-native alternative to GEM-based list translation."
      },
      {
        "relation_type": "affects",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Imperfect crosswalk translations introduce differential or non-differential misclassification of exposure or outcome; the approximate-flag rate and one-to-many expansion factor determine the magnitude, which must be addressed in sensitivity analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Phenotype algorithms defined as 1-inpatient-or-2-outpatient rules must be applied separately with the era-appropriate code list (ICD-9-CM pre-2015, ICD-10-CM post-2015) — the crosswalk is what allows a single study to span both."
      }
    ],
    "aliases": [
      "GEMs",
      "General Equivalence Mappings",
      "crosswalk",
      "code mapping",
      "NDC-HCPCS crosswalk",
      "ICD crosswalk",
      "RxNorm crosswalk"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "medicare-entitlement-lis-dual-eligibility-rwe",
    "name": "Medicare Entitlement, LIS, Dual Eligibility, and Parts A-D",
    "short_definition": "Medicare eligibility and benefit variables that distinguish entitlement reason, Parts A/B/C/D coverage, low-income subsidy status, and Medicare-Medicaid dual eligibility; these determine observable claims, cost sharing, and cohort eligibility in Medicare RWE.",
    "long_description": "Medicare RWE depends on eligibility state, not just age. A beneficiary can have Part A, Part B, Part C Medicare Advantage enrollment, Part D drug coverage, low-income subsidy status, Medicaid dual eligibility, disability entitlement, ESRD entitlement, or changing combinations over time. These variables determine which claims are observable, which services are covered, and whether pharmacy exposure, medical utilization, and cost sharing can be interpreted.\n\nAnalysts should treat Medicare coverage as time-varying. A person can move from FFS to Medicare Advantage, gain or lose Part D, enter a dual-eligible state, or change LIS status. A continuous-enrollment rule that ignores benefit coverage can create false drug gaps, missing medical events, or unobservable outcomes. Reason for entitlement also matters: aged entitlement, disability, and ESRD populations differ clinically and operationally.\n\nMedicare has official Parts A, B, C, and D. There is no standard Medicare Part E benefit. If a dataset or stakeholder says \"Part E,\" clarify whether they mean Extra Help/LIS, Medigap/supplemental coverage, employer wraparound, or a local variable.\n\n**Pros, cons, and trade-offs.** Month-level Medicare eligibility variables make claims analyses defensible because they distinguish coverage from observability. They reveal when pharmacy fills, medical claims, cost sharing, or MA encounter data can be interpreted. The trade-off is complexity: person-time must be split by month, benefit channel, and enrollment state before exposure and outcome construction. Collapsing everything to one baseline enrollment flag is easier, but it can create false nonadherence, false outcome absence, and biased cost denominators.\n\n**When to use.** Use full entitlement, A/B/C/D, LIS, dual, and reason-for-entitlement variables whenever a Medicare analysis depends on complete medical claims, Part D pharmacy exposure, cost-sharing interpretation, MA/FFS channel, disability or ESRD entitlement, or socioeconomic confounding. Use them before cohort construction, not as a late Table 1 decoration.\n\n**When NOT to use - and when it is actively misleading.** Do not treat a beneficiary as observable just because they are \"enrolled in Medicare.\" Do not assume Part D exposure capture without Part D coverage, or FFS medical outcome capture during Medicare Advantage months unless encounter data are proven fit for purpose. It is actively misleading to use \"Part E\" as if it were an official Medicare benefit category.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "medicare",
      "entitlement",
      "part-a",
      "part-b",
      "part-c",
      "part-d",
      "low-income-subsidy",
      "lis",
      "dual-eligibility",
      "reason-for-entitlement"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "comparative_effectiveness",
      "adherence",
      "health_economic_modeling"
    ],
    "data_sources": [
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": null,
        "url": "https://www.medicare.gov/basics/get-started-with-medicare/medicare-basics/parts-of-medicare",
        "citation_text": "Medicare.gov. Parts of Medicare.",
        "year": 2026,
        "authors_short": "Medicare.gov",
        "notes": "Official beneficiary-facing definitions of Parts A, B, C, and D."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.cms.gov/training-education/partner-outreach-resources/low-income-subsidy-lis",
        "citation_text": "Centers for Medicare & Medicaid Services. Low Income Subsidy (LIS).",
        "year": 2026,
        "authors_short": "CMS",
        "notes": "CMS description of Extra Help / LIS."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://resdac.org/cms-data/variables/monthly-medicare-medicaid-dual-eligibility-code-january",
        "citation_text": "Research Data Assistance Center. Monthly Medicare-Medicaid Dual Eligibility Code.",
        "year": 2026,
        "authors_short": "ResDAC",
        "notes": "Variable reference for monthly dual eligibility."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://resdac.org/cms-data/variables/current-reason-entitlement-code",
        "citation_text": "Research Data Assistance Center. Current Reason for Entitlement Code.",
        "year": 2026,
        "authors_short": "ResDAC",
        "notes": "Variable reference for current entitlement reason."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://resdac.org/cms-data/variables/original-reason-entitlement-code",
        "citation_text": "Research Data Assistance Center. Original Reason for Entitlement Code.",
        "year": 2026,
        "authors_short": "ResDAC",
        "notes": "Variable reference for original Medicare entitlement reason (ENTLMT_RSN_ORIG), which can differ from current entitlement reason."
      },
      {
        "role": "demonstrate",
        "doi": "10.1001/jamahealthforum.2023.5152",
        "url": "https://doi.org/10.1001/jamahealthforum.2023.5152",
        "citation_text": "Fung V, Price M, Cheng D, Patel TA, Yang Z, Hsu J, Alegria M, Newhouse JP. Associations Between Annual Medicare Part D Low-Income Subsidy Loss and Prescription Drug Spending and Use. JAMA Health Forum. 2024;5(2):e235152.",
        "year": 2024,
        "authors_short": "Fung et al.",
        "notes": "Empirical example showing why LIS status should be treated as a time-varying Medicare variable in Part D analyses."
      }
    ],
    "plain_language_summary": "Medicare data only make sense if you know what the person was actually enrolled in at each time. Part A/B fee-for-service, Medicare Advantage, Part D drug coverage, LIS, dual eligibility, and entitlement reason all change what the data can see.",
    "key_terms": [
      {
        "term": "Original Medicare",
        "definition": "Medicare Parts A and B fee-for-service coverage."
      },
      {
        "term": "Medicare Advantage",
        "definition": "Part C plan coverage, where standard FFS claims may not capture complete utilization."
      },
      {
        "term": "Low-Income Subsidy",
        "definition": "Extra Help program that reduces Part D premiums and prescription drug costs for eligible beneficiaries."
      },
      {
        "term": "Dual eligibility",
        "definition": "Status indicating eligibility for both Medicare and Medicaid."
      },
      {
        "term": "Reason for entitlement",
        "definition": "Medicare entitlement basis, such as age, disability, ESRD, or combinations."
      }
    ],
    "worked_example": {
      "scenario": "A Part D adherence study follows beneficiaries monthly. One patient has Parts A/B/D and LIS in January-June, becomes full dual eligible in July, and switches to Medicare Advantage in October. The study needs complete pharmacy capture and observable medical events.",
      "dataset": {
        "caption": "Month-level Medicare eligibility panel.",
        "columns": [
          "month",
          "part_a",
          "part_b",
          "part_c_ma",
          "part_d",
          "lis",
          "dual_status",
          "analytic_use"
        ],
        "rows": [
          [
            "2024-01",
            true,
            true,
            false,
            true,
            true,
            "none",
            "observable FFS medical + Part D pharmacy"
          ],
          [
            "2024-07",
            true,
            true,
            false,
            true,
            true,
            "full dual",
            "observable; dual/LIS covariates update"
          ],
          [
            "2024-10",
            true,
            true,
            true,
            true,
            true,
            "full dual",
            "MA observability flag; FFS outcome capture no longer assumed"
          ]
        ]
      },
      "steps": [
        "Build person-month eligibility before exposure episodes or outcomes.",
        "Require Part D for pharmacy adherence denominators.",
        "Require A/B FFS observability for medical outcomes unless fit-for-purpose MA encounter data are available.",
        "Update LIS and dual status monthly rather than treating them as baseline constants."
      ],
      "result": "January-September can support FFS medical plus Part D analyses; October onward requires MA-specific handling or censoring under a documented strategy."
    },
    "prerequisites": [],
    "index_definitions": [
      {
        "name": "Part A",
        "definition": "Hospital insurance, including inpatient hospital, skilled nursing facility, hospice, and some home health coverage.",
        "source": "Medicare.gov Parts of Medicare",
        "use": "Determines inpatient/hospital benefit eligibility in Original Medicare.",
        "notes": "Entitlement alone is not the same as observed complete FFS claims if MA enrollment intervenes."
      },
      {
        "name": "Part B",
        "definition": "Medical insurance, including physician, outpatient, preventive, and certain drug/administration services.",
        "source": "Medicare.gov Parts of Medicare",
        "use": "Determines professional and outpatient medical claim observability in FFS.",
        "notes": "Buy-and-bill drugs may appear under Part B rather than Part D."
      },
      {
        "name": "Part C",
        "definition": "Medicare Advantage plan coverage offered by private plans approved by Medicare.",
        "source": "Medicare.gov Parts of Medicare",
        "use": "Identifies periods where FFS claims may be incomplete or replaced by encounter data.",
        "notes": "Treat MA as a distinct observability state."
      },
      {
        "name": "Part D",
        "definition": "Prescription drug coverage for Medicare beneficiaries.",
        "source": "Medicare.gov Parts of Medicare",
        "use": "Required for complete retail pharmacy exposure and adherence measurement in Medicare.",
        "notes": "LIS affects cost sharing and can affect adherence and treatment choice."
      },
      {
        "name": "Part E",
        "definition": "Not an official Medicare benefit part.",
        "source": "Medicare.gov Parts of Medicare",
        "use": "Clarification item when stakeholders use informal shorthand.",
        "notes": "Often confusion with Extra Help/LIS, Medigap, employer supplemental coverage, or local variables."
      },
      {
        "name": "LIS / Extra Help",
        "definition": "Program that helps eligible beneficiaries pay Medicare drug coverage costs.",
        "source": "CMS Low Income Subsidy",
        "use": "Cost-sharing covariate, subgroup, and proxy for socioeconomic status in Part D analyses.",
        "notes": "Time-varying monthly status should be used when available."
      },
      {
        "name": "Dual eligibility",
        "definition": "Monthly Medicare-Medicaid dual status variable identifying full or partial dual eligibility categories.",
        "source": "ResDAC dual eligibility code",
        "use": "Eligibility covariate, subgroup, and data-completeness/cost-sharing flag.",
        "notes": "Full and partial dual categories should not be collapsed without considering the research question."
      },
      {
        "name": "Reason for entitlement",
        "definition": "Current or original reason Medicare entitlement was established, such as age, disability, ESRD, or combinations.",
        "source": "ResDAC reason for entitlement code",
        "use": "Cohort eligibility, stratification, and confounding control.",
        "notes": "Original and current reason can differ; choose intentionally."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Baseline entitlement covariate",
        "description": "Use reason for entitlement, dual status, and LIS at index as baseline confounders.",
        "edge_cases": [
          "Status can change shortly after index.",
          "Original and current entitlement reason may differ."
        ],
        "data_source_notes": "Good for Table 1 and baseline adjustment; insufficient for observability."
      },
      {
        "name": "Time-varying eligibility panel",
        "description": "Month-level benefit state used to gate observable exposure, outcomes, and costs.",
        "edge_cases": [
          "MA transitions interrupt FFS claims.",
          "Part D loss creates false nonadherence."
        ],
        "data_source_notes": "Required for longitudinal Medicare claims studies."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Simple continuous enrollment flag",
        "use_full_panel_when": "Exposure or outcomes require specific Medicare benefits or MA/FFS distinctions.",
        "use_simple_flag_when": "The data vendor already supplies a validated observable-time variable for the exact channel needed.",
        "notes": "Continuous enrollment without benefit state is often under-specified."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Construct month-level A/B/C/D/LIS/dual/entitlement variables before claims filtering.",
      "linked": "Registry or EHR follow-up can continue during MA periods, but claims-based utilization may not."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Medicare observability depends on enrollment and benefit state."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Parts and MA enrollment determine whether FFS claims are complete."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-analysis",
        "notes": "Eligibility variables must be applied before claims-based exposure/outcome construction."
      }
    ],
    "aliases": [
      "Medicare LIS",
      "low-income subsidy",
      "Extra Help",
      "Medicare dual eligibility",
      "dual eligible",
      "reason for entitlement",
      "original reason for entitlement",
      "current reason for entitlement",
      "Medicare Part A",
      "Medicare Part B",
      "Medicare Part C",
      "Medicare Part D",
      "Medicare Part E",
      "Medicare ABCD",
      "Medicare ABCDE"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "cms",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
    "name": "Medicare FFS vs Medicare Advantage vs Commercial Claims Differences",
    "short_definition": "Systematic structural, incentive, and completeness differences among US administrative data by payer type (Medicare fee-for-service claims, Medicare Advantage encounter records, and commercial claims) that change exposure/outcome ascertainment, measured confounders, person-time observability, and the generalizability of real-world evidence.",
    "long_description": "US administrative data are not one substrate. They are produced by three economically distinct\nmachines — **Medicare fee-for-service (FFS)** payment claims, **Medicare Advantage (MA)** encounter\nrecords, and **commercial** payment claims — and the machine that generated a row determines what is\ncaptured, how completely, with what coding pressure, and for whom. Treating \"claims data\" as fungible\nacross payer type is one of the most common and most consequential silent errors in claims-based RWE,\nbecause payer type sits upstream of phenotyping, exposure measurement, confounder ascertainment, and\nperson-time observability simultaneously.\n\n**Core conceptual distinction**. The three sources differ along two axes that operate independently.\n(1) *Data-generating purpose.* FFS records exist because a provider billed for payment, so for covered\nservices they are near-complete for what was reimbursed (adjudicated paid claims, MedPAR for inpatient\nstays, Part B carrier files, Part D for drugs). MA records are *encounter data* — plans are paid a\nrisk-adjusted capitation, so the record is a report of services rendered, submitted to CMS for risk\nadjustment and oversight rather than to trigger a fee; historically these were less complete and less\nstandardized than FFS claims, with steady improvement under CMS submission requirements. Commercial\nclaims are again payment-driven but assembled from contributing employers/payers (MarketScan, Optum,\nHealthCore), so completeness and benefit design vary by contributor and out-of-network capture is\nuneven. (2) *Coding incentive.* MA capitation is risk-adjusted on diagnoses (HCC model), creating a\nstrong, well-documented incentive for **coding intensity** — more complete and sometimes upcoded\ndiagnoses via in-home health risk assessments and retrospective chart review — so MA beneficiaries\nsystematically *appear* sicker than clinically identical FFS beneficiaries. FFS and commercial coding\nis driven by reimbursement and quality metrics, not HCC capitation. The practical consequence: the\nsame patient, same disease, can yield a different diagnosis profile, a different covariate vector, and\na different phenotype classification depending solely on which payer's data observed them.\n\n**Pros, cons, and trade-offs**.\n- **Multi-payer (pooled FFS + MA + commercial) vs single-payer (e.g., FFS-only):** Pooling buys\n  sample size and a payer mix that mirrors the policy-relevant US population (MA is now roughly half of\n  Medicare enrollment), and it lets you test transportability directly. Cost: you import differential\n  misclassification and coding-intensity confounding; a naive pooled hazard ratio can be biased if one\n  payer's data quality drives the estimate. **Prefer multi-payer with explicit payer stratification or\n  interaction**, never silent pooling.\n- **FFS-only vs MA-only:** FFS gives the cleanest, most research-validated capture and is the standard\n  benchmark, but it is a shrinking and increasingly selected slice (healthier or differently-selected\n  beneficiaries remain in FFS in some markets). MA-only maximizes contemporary relevance but demands\n  extra completeness validation and coding-intensity handling. **Prefer FFS-only** when phenotype\n  validity and complete utilization capture dominate; **prefer including MA** when current-population\n  generalizability dominates and you can validate.\n- **Harmonized common algorithm vs payer-specific algorithms:** A single phenotype (e.g., 1 inpatient\n  or 2 outpatient diagnoses in a fixed window) is simpler and comparable across arms, but its PPV and\n  sensitivity differ by payer because of coding intensity and completeness. Payer-specific tuning is\n  more valid but harder to defend and reduces comparability. **Prefer a harmonized algorithm with\n  payer-specific sensitivity analyses** and, where linked validation data exist, payer-specific\n  operating characteristics reported transparently.\n\n**When to use**. Any claims-based study that (a) spans more than one payer type, (b) draws on a 100%\nor large Medicare sample where MA and FFS coexist, (c) compares results to or transports them across\npayer populations, or (d) uses diagnosis-derived covariates/phenotypes where coding intensity could\ndistort case-mix. In these settings, payer type must be an explicit, first-class design variable —\nidentified at index, carried as a covariate or stratifier, and stress-tested in sensitivity analyses.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Do not pool across payer types without stratification when the outcome or covariates are\n  diagnosis-derived.** Coding intensity makes MA comorbidity scores non-comparable to FFS; a pooled\n  propensity score or a pooled comorbidity-adjusted model then conditions on a payer-distorted\n  covariate, and \"adjustment\" can amplify rather than remove bias. This is the dangerous case.\n- **Do not assume \"no claim = no event\" inside MA-only person-time without validating completeness.**\n  If MA encounter capture is incomplete for the endpoint (historically true for some utilization\n  measures), absence of a record is missingness, not a true negative, biasing incidence downward\n  differentially by payer.\n- **Do not transport an FFS-calibrated phenotype, risk model, or cost model to an MA population\n  unchecked.** HCC-rich MA data and FFS-calibrated models do not interchange; risk scores are inflated\n  relative to FFS, and shadow-priced MA \"costs\" are not the same quantity as FFS allowed amounts.\n- **Do not treat a commercial database as representative of the elderly or disabled**, or assume one\n  commercial extract generalizes to another — contributor composition, formularies, and out-of-network\n  capture differ.\n\n**Data-source operational depth**.\n- **Medicare FFS claims (Parts A/B/D):** The research-grade benchmark. Inpatient via MedPAR, physician/\n  outpatient via carrier and outpatient files, drugs via Part D. Failure modes: final-action claims lag\n  several months (cohorts built on incomplete recent data undercount events); HCC capture exists but is\n  not the dominant coding driver; FFS is increasingly selected as beneficiaries move to MA, so an\n  FFS-only estimate may not transport. Workaround: allow claims run-out before locking the cohort;\n  treat FFS as the calibration anchor for completeness checks of other payers.\n- **Medicare Advantage encounter data (Part C):** Plan-submitted encounters, not paid claims. Failure\n  modes: historical and residual incompleteness for some service types; coding intensity (in-home HRAs\n  and chart reviews add diagnoses that never generate an encounter for the corresponding service),\n  inflating comorbidity/HCC profiles and the apparent prevalence of mild disease; **MA-only person-time\n  lacks FFS A/B medical claims**, so any logic that depends on those payment claims breaks. Do not\n  confuse this medical-claims gap with Part D/PDE pharmacy observability: when Part D coverage and PDE\n  files are available, outpatient drug transactions may be observable for both standalone PDP and MA-PD\n  plans. Workarounds: validate inpatient completeness against MedPAR for the linkable subset; exclude\n  or separately model MA-only person-time when the design relies on FFS medical observability; carry a\n  coding-intensity proxy (HCC count, chart-review/HRA flags where available) and run results with and\n  without it.\n- **Commercial claims (MarketScan, Optum, HealthCore, etc.):** Payment-driven and structurally similar\n  to FFS but for a younger, employed, differently-selected population. Failure modes: out-of-network\n  and carve-out (behavioral health, specialty pharmacy) leakage; short and lumpy enrollment from job\n  changes that truncates washout and follow-up; contributor turnover that creates artifactual cohort\n  entry/exit. Workarounds: require continuous medical + pharmacy enrollment across washout and\n  follow-up; check for contributor-driven enrollment discontinuities; never assume one extract's\n  completeness applies to another.\n- **Linked / multi-database (Sentinel, PCORnet, claims–EHR–vital-records):** Linkage can recover what a\n  single payer misses (death index for censoring, EHR severity for confounding), but introduces a\n  linkable-subset selection and date-reconciliation problems (order vs fill vs service dates), and\n  linkage quality itself can differ by payer. Distributed/OMOP-CDM studies must carry an explicit payer\n  flag and avoid harmonization that silently drops payer granularity.\n\n**Worked claims example**. Question: incident heart failure hospitalization comparing initiators of an\nSGLT2 inhibitor vs a DPP-4 inhibitor among adults ≥66 with type 2 diabetes, in a 100% Medicare sample\nthat contains both FFS and MA beneficiaries. Run the *same* protocol three ways and compare.\n(1) **Payer assignment at index:** from the monthly enrollment/eligibility file, classify each person\non the index month as `FFS` (Part A+B, not in a Part C plan) or `MA` (enrolled in a Part C contract).\n(2) **Cohort A — FFS-only:** require continuous Part A/B/D for the 365-day washout (no SGLT2 or DPP-4\nfill) through `index_date`; exposure from Part D `fill_date`/`days_supply`; baseline comorbidities from\nPart A/B diagnoses; first HF hospitalization from MedPAR. (3) **Cohort B — FFS + MA pooled:** add MA\nbeneficiaries using encounter diagnoses and Part D fills, classifying HF from MA inpatient encounters.\n(4) **Cohort C — MA-only.** Now inspect the three: the *measured* comorbidity burden (e.g., HCC count,\nElixhauser via diagnoses) is higher in MA than FFS for clinically comparable patients — coding\nintensity, not true case-mix — so a pooled propensity score conditions on a payer-distorted covariate;\ninpatient HF capture in MA should be benchmarked against MedPAR for any linkable subset, and if MA IP\ncapture is lower, pooled incidence is differentially attenuated; and because washout/\"no prior fill\"\nlogic requires payment-claim observability, any **MA-only person-time** that lacks FFS-style claims is\nunobservable for the washout — excluding it changes N and follow-up versus the pooled cohort.\nDeliverable: report the comparative estimate within FFS, within MA, and pooled with a payer × treatment\ninteraction; report the MedPAR-vs-encounter inpatient completeness ratio; and show how excluding\nMA-only person-time and adding/removing the coding-intensity proxy move the estimate. Concordant\nestimates across payers strengthen the inference; divergence localizes the data-quality threat instead\nof burying it in a pooled number.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "payer-heterogeneity",
      "medicare-ffs",
      "medicare-advantage",
      "encounter-data",
      "commercial-claims",
      "coding-intensity",
      "risk-adjustment-hcc",
      "data-completeness",
      "generalizability",
      "multi-database",
      "claims",
      "ma-only-person-time"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "multi_database",
      "linked_data",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "linked",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/1475-6773.14211",
        "url": "https://doi.org/10.1111/1475-6773.14211",
        "citation_text": "Cotterill PG. An assessment of completeness and medical coding of Medicare Advantage hospitalizations in two national data sets. Health Services Research. 2023;58(6):1226-1235.",
        "year": 2023,
        "authors_short": "Cotterill",
        "notes": "Directly quantifies completeness and coding of Medicare Advantage inpatient encounter records against national benchmarks, establishing why MA encounter data are not interchangeable with FFS claims for utilization research."
      },
      {
        "role": "explain",
        "doi": "10.5600/mmrr.004.02.sa06",
        "url": "https://doi.org/10.5600/mmrr.004.02.sa06",
        "citation_text": "Kronick R, Welch WP. Measuring coding intensity in the Medicare Advantage program. Medicare & Medicaid Research Review. 2014;4(2):E1-E19.",
        "year": 2014,
        "authors_short": "Kronick & Welch",
        "notes": "Defines and measures MA coding intensity (the systematic upward drift in coded diagnoses driven by HCC risk adjustment), the mechanism that makes MA beneficiaries appear sicker than identical FFS beneficiaries and distorts diagnosis-derived covariates and phenotypes."
      },
      {
        "role": "demonstrate",
        "doi": "10.1111/1475-6773.13879",
        "url": "https://doi.org/10.1111/1475-6773.13879",
        "citation_text": "Jung J, Carlin C, Feldman R. Measuring resource use in Medicare Advantage using encounter data. Health Services Research. 2021;56(6):1196-1205.",
        "year": 2021,
        "authors_short": "Jung et al.",
        "notes": "Applied demonstration of how researchers operationalize MA encounter data for utilization measurement and where it diverges from FFS-style claims, including completeness and resource-use capture caveats."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://resdac.org/cms-data/files/pde",
        "citation_text": "Research Data Assistance Center. Part D Event (PDE) File. Centers for Medicare & Medicaid Services.",
        "year": 2026,
        "authors_short": "ResDAC",
        "notes": "Data dictionary source for Part D drug event records, relevant to distinguishing outpatient drug observability from A/B medical-claims observability."
      },
      {
        "role": "use",
        "doi": "10.1377/hlthaff.2023.01530",
        "url": "https://doi.org/10.1377/hlthaff.2023.01530",
        "citation_text": "Jacobs PD, Hill SC, Lipton BJ. In-home health risk assessments and chart reviews contribute to coding intensity in Medicare Advantage. Health Affairs. 2024;43(1):63-71.",
        "year": 2024,
        "authors_short": "Jacobs et al.",
        "notes": "Documents the specific operational mechanisms (in-home HRAs, retrospective chart reviews) that generate MA coding intensity, informing which coding-intensity proxies and sensitivity analyses an RWE protocol should carry."
      }
    ],
    "plain_language_summary": "Not all claims data are the same — three major US insurance sources (Medicare Fee-for-Service, Medicare Advantage, and commercial insurance) each produce records in a different way, for a different population, and with different gaps. Medicare Fee-for-Service generates a claim every time a provider bills for a covered service, so the record is nearly complete for what was paid. Medicare Advantage plans instead submit encounter reports to the government for oversight and risk-adjustment purposes, not to get a fee per service, which means some services may never appear in the data. Commercial insurance claims cover working-age adults and depend heavily on which employers contribute data to a given database, so coverage of out-of-network care and specialty drugs can vary widely. Choosing the wrong source — or mixing sources without accounting for their differences — can make a disease look more common than it is, make a treatment look safer than it is, or limit who the study's findings actually apply to.",
    "key_terms": [
      {
        "term": "Medicare Fee-for-Service (FFS)",
        "definition": "Traditional Medicare where the government pays a separate fee for each covered service a provider delivers, generating a claim record for every billed service."
      },
      {
        "term": "Medicare Advantage (MA)",
        "definition": "A private-plan alternative to traditional Medicare where the government pays the plan a fixed monthly amount per enrollee rather than a fee per service; the plan then submits encounter records to report what services were delivered."
      },
      {
        "term": "Encounter record",
        "definition": "A report submitted by a Medicare Advantage plan to document a service that was provided, as opposed to a fee-for-service claim that triggers a payment."
      },
      {
        "term": "Coding intensity",
        "definition": "The tendency for Medicare Advantage plans to document more diagnoses per patient than traditional Medicare would for the same patient, because plans are paid more when their enrollees appear sicker under the government's risk-adjustment formula."
      },
      {
        "term": "Risk adjustment (HCC model)",
        "definition": "A government formula that sets the monthly payment to a Medicare Advantage plan based on the diagnosed conditions of its enrollees; more serious diagnoses on record mean higher payments, creating an incentive to capture every possible diagnosis."
      },
      {
        "term": "Claims completeness",
        "definition": "How thoroughly a database captures all the healthcare services a patient actually received; a source with lower completeness will appear to show fewer services even if the patient had them."
      }
    ],
    "worked_example": {
      "scenario": "You are studying whether a new diabetes medication reduces hospitalizations for heart failure. Your dataset is a large Medicare sample that contains both Fee-for-Service and Medicare Advantage enrollees. Before running any analysis, you need to understand how the three major claims sources differ in who they cover, how complete their records are, and what biases each introduces. The table below walks through four key dimensions — and then the steps explain what each difference means for your study.",
      "dataset": {
        "caption": "Comparison of Medicare FFS, Medicare Advantage, and Commercial claims across four study-design dimensions",
        "columns": [
          "Dimension",
          "Medicare Fee-for-Service (FFS)",
          "Medicare Advantage (MA)",
          "Commercial"
        ],
        "rows": [
          [
            "Typical enrollee age and population",
            "65+ adults; increasingly those who stayed in traditional Medicare (some evidence of selection)",
            "65+ adults; now roughly half of all Medicare enrollees; skews toward certain regions and plans",
            "Under-65 working-age adults and their families; employer-sponsored coverage"
          ],
          [
            "How complete are the claims?",
            "Near-complete for services the government covers — every paid claim is recorded",
            "Variable; encounter records are submitted by plans but historically under-capture some service types compared to FFS",
            "Generally complete for in-network care; out-of-network visits and carved-out benefits (e.g., behavioral health) may be missing"
          ],
          [
            "Capitation or encounter gap risk",
            "No capitation encounter gap — each covered billed service generates a paid/adjudicated claim after runout; absence means no covered billed service was observed, not proof no care occurred anywhere",
            "Real gap — the plan is paid a flat monthly rate, so not every service triggers a new submission; a missing record could mean the service did not happen OR that it was not submitted",
            "Low gap risk for in-network; out-of-network services paid by the enrollee may never appear in the database"
          ],
          [
            "Who does the study generalize to?",
            "Traditional Medicare population — valid benchmark but shrinking as more people choose MA; results may not apply to MA enrollees",
            "Contemporary Medicare population overall, but coded diagnoses appear more numerous than in FFS for clinically similar patients due to coding intensity",
            "Working-age insured adults; findings do not generalize to elderly Medicare beneficiaries or uninsured populations"
          ]
        ]
      },
      "steps": [
        "Look at the Completeness row first. In FFS, after adequate runout, a missing claim means no covered billed/adjudicated service was observed in the FFS files — useful silence, but not proof that no care occurred outside the covered channel. In MA, a missing encounter record might mean the service did not happen or that the plan did not submit it, especially for outpatient visits; you cannot assume absence equals non-occurrence without checking.",
        "Now look at the Capitation gap row. Because MA plans get a fixed monthly payment, there is no financial trigger to submit a record for every single service the way there is in FFS. This is why MA encounter data historically under-captures some utilization compared to FFS, even when the service actually happened.",
        "Look at the Coding intensity row (part of the Generalizability row). MA plans are paid more by the government when their patients have more serious diagnoses on record. So plans run programs — including home visits and chart reviews — to find and document every diagnosis. The result is that an MA patient and an FFS patient with the exact same health status will often show different numbers of recorded diagnoses. If you use diagnosis counts to measure how sick your study patients are, MA patients will look sicker on paper than identical FFS patients.",
        "Now connect this to your heart failure hospitalization study. If you pool FFS and MA patients without accounting for these differences, your adjustment for patient health status will be distorted — you are adjusting for a payer-inflated number in MA patients, not a real difference in sickness. That distortion can bias your comparison of the two diabetes drugs.",
        "Finally, consider the washout step — the period before the study start where you confirm a patient has not already used the drug you are studying. A washout requires that you can see the benefit channel that captures the exposure. For medical-event washout, MA-only person-time lacks FFS A/B payment claims and must be handled with MA encounter validation or excluded. For outpatient drugs, Part D/PDE observability depends on Part D enrollment and PDE data; MA-PD transactions may be observable in PDE even when A/B medical events are not observable in FFS claims."
      ],
      "result": "Source choice determines who you study, how much you trust the absence of a record, and whether your measure of patient health status is comparable across groups. FFS is the most research-validated source and the right choice when you need clean utilization records and reliable washout logic. Including MA expands the study to the contemporary Medicare population but requires you to validate how complete the MA encounter records are and to account for coding intensity so that inflated diagnosis counts do not distort your results. Commercial claims are the right source for working-age adults but should not be used to draw conclusions about elderly Medicare patients. Mixing sources silently — without acknowledging these differences — is one of the most common ways a claims study can produce misleading findings."
    },
    "prerequisites": [
      "claims-analysis",
      "continuous-enrollment-observable-time-rwe",
      "fit-for-purpose-data-assessment-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Payer-stratified (run within each payer, then compare or meta-analyze)",
        "description": "Execute the full protocol separately within FFS, within MA, and within each commercial contributor, then compare estimates or meta-analyze. Concordance strengthens inference; divergence localizes a payer-specific data-quality threat rather than hiding it in a pooled estimate.",
        "edge_cases": [
          "Power loss in smaller strata (a single commercial contributor, or MA in early years).",
          "Different enrollment dynamics (MA contract lock-in vs commercial job-change churn vs FFS at end of life) produce non-comparable follow-up and attrition by stratum."
        ],
        "data_source_notes": "claims: requires a reliable payer flag in the eligibility/enrollment file at index; for MA use encounter tables and note submission timing vs FFS final-action claims."
      },
      {
        "name": "Harmonized algorithm with payer-specific sensitivity and coding-intensity proxy",
        "description": "Apply one common phenotype/covariate definition across payers for comparability, then test sensitivity with a payer × exposure interaction and, where available, adjust for or stratify on a coding-intensity proxy (HCC count, chart-review/HRA flag) so MA's inflated diagnosis capture does not silently drive the pooled result.",
        "edge_cases": [
          "Harmonization can mask true effect heterogeneity, or import bias if one payer's poor data quality dominates the pooled estimate.",
          "A coding-intensity proxy can over-correct if it is itself on the causal pathway."
        ],
        "data_source_notes": "MA: HCC/risk variables are often more completely captured than utilization; treat them as potentially over-coded. Commercial: standardize on benefit/formulary variables where the contributor records them consistently."
      },
      {
        "name": "MA-encounter completeness validation against an FFS benchmark",
        "description": "Before trusting MA-derived endpoints, benchmark MA inpatient (or other) capture against MedPAR for the linkable subset and quantify a completeness ratio; use MA for risk-adjusted population description but apply extra caution for non-risk utilization and incidence endpoints.",
        "edge_cases": [
          "Plans with aggressive HRAs/chart review inflate diagnosis capture for mild conditions while under-capturing the corresponding service encounters.",
          "MA-only person-time lacks FFS claims, so payment-claim-dependent logic (washout, \"no prior fill\") is unobservable and that person-time must be excluded or separately modeled."
        ],
        "data_source_notes": "linked: link MA encounter to MedPAR/death files where possible; CMS encounter completeness has improved over time, so compute the completeness ratio for the study calendar years."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Silent pooling across payer types without stratification or sensitivity analysis",
        "pros_of_this": "Explicit payer handling (stratify, interact, validate completeness, carry a coding-intensity proxy) prevents differential-misclassification and coding-intensity confounding from biasing the pooled estimate, and makes transportability claims defensible.",
        "cons_of_this": "More analytic steps, smaller per-stratum samples, and results that are harder to compress into a single headline number when payer-specific effects emerge.",
        "when_to_prefer": "Essentially always when covariates or endpoints are diagnosis-derived, or when the study spans payer types or claims broad US generalizability."
      },
      {
        "compared_to": "FFS-only (single, research-validated payer) analyses",
        "pros_of_this": "Including MA and commercial reflects the contemporary US payer mix (MA is roughly half of Medicare), supports transportability testing, and increases sample size and policy relevance.",
        "cons_of_this": "Imports MA coding intensity, encounter incompleteness, and commercial contributor variability; without validation these threaten internal validity.",
        "when_to_prefer": "When current-population generalizability or power requires more than FFS, and you can validate MA completeness and handle coding intensity."
      },
      {
        "compared_to": "Treating commercial extracts as interchangeable with Medicare or with each other",
        "pros_of_this": "Recognizing contributor- and population-specific differences (age, benefit design, out-of-network capture, formularies) prevents misattributing a data artifact to a real effect.",
        "cons_of_this": "Requires per-source feasibility and completeness assessment before pooling.",
        "when_to_prefer": "Any study combining commercial sources, or transporting between commercial and Medicare populations (e.g., across the age-65 transition)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Assign payer type from the enrollment/eligibility file at index (FFS = Part A/B without a Part C contract; MA = enrolled in a Part C contract; commercial = the contributing payer). For MA, pull diagnoses/procedures from encounter tables and benchmark inpatient capture against MedPAR for any linkable subset. Apply phenotypes with a common definition plus payer-specific sensitivity. For costs, use FFS allowed/paid amounts directly but shadow-price MA encounters or use plan-reported costs, and note that coding intensity propagates into apparent case-mix and high-cost identification. Report enrollment continuity and churn by payer (MA lock-in vs commercial job-change churn).",
      "ehr": "Payer is in coverage/insurance or linked billing fields; MA vs FFS patients can differ in visit and documentation patterns because of plan incentives. Link to claims (including MA encounter extracts) for complete pharmacy and procedure capture, and recognize that EHR completeness can vary with a health system's contract mix.",
      "linked": "Linkage to death/registry files and the completeness of that linkage can differ by payer. Dual eligibles (Medicare + Medicaid) require reconciliation. For MA-to-FFS (or FFS-to-MA) transitions and the age-65 commercial-to-Medicare transition, define index and follow-up rules that avoid artifactual new-user entry or informative censoring.",
      "multi-database": "Common in Sentinel/PCORnet/commercial+Medicare work. Harmonize on robust common variables but retain an explicit payer flag and payer-specific granularity for sensitivity; in distributed/OMOP-CDM pipelines, document any transformation that loses payer information."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\ndef assign_payer_at_index(enroll: pd.DataFrame, cohort: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"payer_at_index = MA if in a Part C contract the index month, else FFS (needs A+B).\"\"\"\n    c = cohort.copy()\n    c[\"index_month\"] = c[\"index_date\"].dt.to_period(\"M\")\n    e = enroll.merge(c[[\"person_id\", \"index_month\"]],\n                     left_on=[\"person_id\", \"month\"],\n                     right_on=[\"person_id\", \"index_month\"], how=\"inner\")\n    e[\"payer_at_index\"] = e[\"part_c\"].map({True: \"MA\", False: \"FFS\"})\n    # FFS classification additionally requires fee-for-service Part A and Part B at index.\n    e.loc[(e[\"payer_at_index\"] == \"FFS\") & ~(e[\"part_a\"] & e[\"part_b\"]),\n          \"payer_at_index\"] = \"UNCLASSIFIED\"\n    return c.merge(e[[\"person_id\", \"payer_at_index\"]], on=\"person_id\", how=\"left\")\n\ndef flag_ffs_observable(enroll: pd.DataFrame, cohort: pd.DataFrame,\n                        washout_days: int = 365) -> pd.Series:\n    \"\"\"True only if EVERY month in [index-washout, index] is FFS with A+B+D (no MA-only gap).\n    This gates FFS A/B medical observability; outpatient drug observability should be checked against Part D/PDE.\"\"\"\n    c = cohort.copy()\n    c[\"start_month\"] = (c[\"index_date\"] - pd.Timedelta(days=washout_days)).dt.to_period(\"M\")\n    c[\"index_month\"] = c[\"index_date\"].dt.to_period(\"M\")\n    e = enroll.merge(c[[\"person_id\", \"start_month\", \"index_month\"]], on=\"person_id\", how=\"inner\")\n    in_window = (e[\"month\"] >= e[\"start_month\"]) & (e[\"month\"] <= e[\"index_month\"])\n    e = e[in_window].copy()\n    e[\"ffs_month\"] = (~e[\"part_c\"]) & e[\"part_a\"] & e[\"part_b\"] & e[\"part_d\"]\n    # observable only if no month in the window is MA-only / lacks full FFS coverage\n    return e.groupby(\"person_id\")[\"ffs_month\"].all()\n\ndef coding_intensity_proxy(dx: pd.DataFrame, cohort: pd.DataFrame,\n                           baseline_days: int = 365) -> pd.DataFrame:\n    \"\"\"Distinct HCC count in the baseline window = coding-intensity proxy; compare by payer.\n    Higher mean HCC count in MA vs FFS for comparable patients signals coding intensity, not case-mix.\"\"\"\n    d = dx.merge(cohort[[\"person_id\", \"index_date\", \"payer_at_index\"]], on=\"person_id\")\n    d = d[(d[\"dx_date\"] <= d[\"index_date\"]) &\n          (d[\"dx_date\"] >= d[\"index_date\"] - pd.Timedelta(days=baseline_days))]\n    hcc = (d.groupby([\"person_id\", \"payer_at_index\"])[\"hcc\"]\n             .nunique().rename(\"hcc_count\").reset_index())\n    return hcc\n\ndef ma_inpatient_completeness(ip: pd.DataFrame) -> float:\n    \"\"\"MedPAR-vs-encounter completeness ratio for the linkable subset (target ~1.0).\n    Ratio < 1 means MA encounter under-captures inpatient stays -> endpoints biased low.\"\"\"\n    ma = (ip[\"source\"] == \"MA_ENCOUNTER\").sum()\n    ffs = (ip[\"source\"] == \"FFS_MEDPAR\").sum()\n    return ma / ffs if ffs else float(\"nan\")",
        "description": "Assign payer type at index and flag FFS-observable person-time from claims-style enrollment data,\nthen quantify MA coding intensity and benchmark MA inpatient capture against an FFS (MedPAR-style)\nreference. Required inputs (already cleaned and de-duplicated):\n  enroll : monthly enrollment -> person_id, month (period 'M'), part_c (bool: in an MA contract),\n           part_a, part_b, part_d (bool coverage flags)\n  cohort : analytic cohort     -> person_id, index_date (datetime)\n  dx     : diagnoses           -> person_id, dx_date, hcc (str HCC category), source in {'FFS','MA'}\n  ip     : inpatient stays     -> person_id, admit_date, source in {'FFS_MEDPAR','MA_ENCOUNTER'}\nUse payer_at_index for stratification, ffs_observable to gate FFS A/B medical washout logic, the HCC\ncount as a coding-intensity proxy, and the completeness ratio to decide whether MA medical endpoints\nare trustworthy. Build all covariates only within the pre-index baseline window downstream; handle\nPart D/PDE pharmacy observability separately when the exposure is an outpatient drug.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nassign_payer_at_index <- function(enroll, cohort) {\n  setDT(enroll); setDT(cohort)\n  cohort[, index_month := format(index_date, \"%Y-%m\")]\n  e <- merge(enroll, cohort[, .(person_id, index_month)],\n             by.x = c(\"person_id\", \"month\"), by.y = c(\"person_id\", \"index_month\"))\n  e[, payer_at_index := fifelse(part_c, \"MA\",\n                          fifelse(part_a & part_b, \"FFS\", \"UNCLASSIFIED\"))]\n  merge(cohort, e[, .(person_id, payer_at_index)], by = \"person_id\", all.x = TRUE)\n}\n\nflag_ffs_observable <- function(enroll, cohort, washout_days = 365L) {\n  # TRUE only if every month in the washout window is FFS with A+B+D (no MA-only gap).\n  # This gates FFS A/B medical observability; Part D/PDE pharmacy observability is separate.\n  setDT(enroll); setDT(cohort)\n  cw <- cohort[, .(person_id,\n                   start_m = format(index_date - washout_days, \"%Y-%m\"),\n                   index_m = format(index_date, \"%Y-%m\"))]\n  e <- merge(enroll, cw, by = \"person_id\")\n  e <- e[month >= start_m & month <= index_m]\n  e[, ffs_month := (!part_c) & part_a & part_b & part_d]\n  e[, .(ffs_observable = all(ffs_month)), by = person_id]\n}\n\ncoding_intensity_proxy <- function(dx, cohort, baseline_days = 365L) {\n  setDT(dx)\n  d <- merge(dx, cohort[, .(person_id, index_date, payer_at_index)], by = \"person_id\")\n  d <- d[dx_date <= index_date & dx_date >= index_date - baseline_days]\n  d[, .(hcc_count = uniqueN(hcc)), by = .(person_id, payer_at_index)]\n}\n\nma_inpatient_completeness <- function(ip) {\n  setDT(ip)\n  ma  <- ip[source == \"MA_ENCOUNTER\", .N]\n  ffs <- ip[source == \"FFS_MEDPAR\", .N]\n  if (ffs == 0L) NA_real_ else ma / ffs\n}",
        "description": "R/data.table mirror: assign payer at index, flag FFS-observable medical washout person-time, compute the\nHCC-count coding-intensity proxy by payer, and compute the MA-vs-MedPAR inpatient completeness ratio.\nInputs match the Python version (enroll has a monthly row per person with part_c/part_a/part_b/part_d\nlogical flags; cohort has person_id + index_date as Date; dx has person_id/dx_date/hcc/source; ip has\nperson_id/admit_date/source).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* Payer type at index: MA if enrolled in a Part C contract that month; else FFS needs A+B. */\nproc sql;\n  create table payer as\n  select c.person_id, c.index_date,\n         case when e.part_c = 1 then 'MA'\n              when e.part_a = 1 and e.part_b = 1 then 'FFS'\n              else 'UNCLASSIFIED' end as payer_at_index length=12\n  from work.cohort c\n  left join work.enroll e\n    on c.person_id = e.person_id\n   and e.month = input(put(c.index_date, yymmn6.), 6.);\nquit;\n\n/* FFS-observable medical washout: every month in [index-washout, index] is FFS with A+B+D (no MA-only gap).\n   This gates FFS A/B medical observability; outpatient drug observability should be checked against Part D/PDE. */\nproc sql;\n  create table observable as\n  select c.person_id,\n         ( sum( case when (e.part_c=0 and e.part_a=1 and e.part_b=1 and e.part_d=1)\n                     then 0 else 1 end ) = 0 ) as ffs_observable\n  from work.cohort c\n  join work.enroll e\n    on c.person_id = e.person_id\n   and e.month between input(put(c.index_date - &washout, yymmn6.), 6.)\n                   and input(put(c.index_date,            yymmn6.), 6.)\n  group by c.person_id;\nquit;\n\n/* Coding-intensity proxy: distinct HCC count in the baseline window, by payer.\n   Higher MA mean vs FFS for comparable patients signals coding intensity, not true case-mix. */\nproc sql;\n  create table hcc_proxy as\n  select p.payer_at_index, d.person_id,\n         count(distinct d.hcc) as hcc_count\n  from payer p\n  join work.dx d\n    on p.person_id = d.person_id\n   and d.dx_date <= p.index_date\n   and d.dx_date >= p.index_date - &washout\n  group by p.payer_at_index, d.person_id;\n\n  create table hcc_by_payer as\n  select payer_at_index, mean(hcc_count) as mean_hcc, count(*) as n\n  from hcc_proxy group by payer_at_index;\nquit;\n\n/* MA-vs-MedPAR inpatient completeness ratio for the linkable subset (target ~1.0). */\nproc sql;\n  create table ip_completeness as\n  select sum(source='MA_ENCOUNTER') as ma_ip,\n         sum(source='FFS_MEDPAR')   as ffs_ip,\n         calculated ma_ip / calculated ffs_ip as completeness_ratio\n  from work.ip;\nquit;",
        "description": "SAS/PROC SQL data-construction for payer-aware claims work (no estimation procedures here; this is\ncohort/data-quality preparation). Required input datasets (post data-management):\n  work.enroll : person_id, month (YYYYMM num), part_c (0/1), part_a, part_b, part_d (0/1)\n  work.cohort : person_id, index_date (SAS date)\n  work.dx     : person_id, dx_date, hcc (char HCC category), source ('FFS'/'MA')\n  work.ip     : person_id, admit_date, source ('FFS_MEDPAR'/'MA_ENCOUNTER')\nProduces payer_at_index for stratification, ffs_observable to gate FFS A/B medical washout logic,\nan HCC coding-intensity proxy by payer, and the MA-vs-MedPAR inpatient completeness ratio.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population] --> Payer{Payer type at index<br/>from enrollment file}\n  Payer -->|Medicare FFS| FFS[Payment claims A/B/D + MedPAR<br/>near-complete for covered services<br/>research benchmark; HCC not main coding driver]\n  Payer -->|Medicare Advantage| MA[Plan-submitted encounter data<br/>HCC coding intensity inflates diagnoses<br/>completeness varies; MA-only time lacks FFS claims]\n  Payer -->|Commercial| Comm[Contributor payment claims<br/>younger/employed; OON + carve-out leakage<br/>job-change enrollment churn]\n  FFS --> Harm[Common phenotype + covariate definitions]\n  MA --> Validate[Benchmark inpatient capture vs MedPAR<br/>carry coding-intensity proxy HCC/HRA flag]\n  Validate --> Harm\n  Comm --> Harm\n  Harm --> Analysis[Stratify by payer + payer x exposure interaction<br/>exclude/model MA-only person-time]\n  Analysis --> Report[Report within-FFS, within-MA, pooled<br/>+ completeness ratio + generalizability]",
        "caption": "Payer type as a first-class design variable. Because FFS, MA encounter, and commercial data differ in completeness and coding incentives, phenotyping, confounder ascertainment, and person-time observability must branch by payer, with MA completeness validated and results reported by payer.",
        "alt_text": "Flowchart branching the source population by payer type at index into Medicare FFS claims, Medicare Advantage encounter data, and commercial claims, each with its capture and coding characteristics, converging on harmonized definitions, MA completeness validation, payer-stratified analysis, and payer-specific reporting.",
        "source_type": "illustrative",
        "source_citations": [
          "cotterill-2023",
          "kronick-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Enr[Enrollment / eligibility file<br/>monthly payer flag] --> Span{Person-time observability}\n  Span -->|FFS A/B/D| Obs[FFS A/B medical + Part D observed<br/>channel-specific washout valid after runout]\n  Span -->|MA-only| Unobs[FFS A/B medical claims absent<br/>medical-event washout unobservable without encounters]\n  Obs --> Cohort[Include in claims-dependent cohort]\n  Unobs --> Decide{Design relies on FFS-style claims?}\n  Decide -->|Yes| Drop[Exclude MA-only person-time<br/>re-derive N and follow-up]\n  Decide -->|No, use encounter endpoints| Bench[Validate MA capture vs MedPAR<br/>then include with caveats]",
        "caption": "Why MA-only person-time is hazardous for claims-dependent medical-event logic. When a design depends on FFS A/B payment-claim observability, MA-only spans are effectively missing and must be excluded or handled with validated encounter data; outpatient drug washout is a separate Part D/PDE observability question.",
        "alt_text": "Data-flow diagram showing the enrollment file feeding a person-time observability decision, where FFS person-time is observed through A/B claims and MA-only person-time lacks FFS medical claims, leading either to exclusion when the design depends on FFS medical payment claims or to inclusion after MedPAR completeness validation when encounter endpoints are used; Part D/PDE drug observability is handled separately.",
        "source_type": "illustrative",
        "source_citations": [
          "cotterill-2023"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "Payer type is a structural dimension of any claims-analysis protocol; capture, coding, and observability all branch by FFS vs MA vs commercial."
      },
      {
        "relation_type": "affects",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "1 IP / 2 OP and similar algorithms have different PPV and sensitivity across FFS (payment-driven), MA (HCC coding intensity + encounter completeness), and commercial data."
      },
      {
        "relation_type": "affects",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Outcome-algorithm operating characteristics shift by payer because coding intensity and encounter completeness change false-positive and false-negative rates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "The FDA relevance/reliability assessment must be done by payer, since capture of exposures, outcomes, and covariates differs across FFS, MA, and commercial sources."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Payer mix and coding differences are major threats to transporting FFS-calibrated results to MA or commercial populations."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS proxy distributions differ by payer; payer type itself is a strong proxy for selection and coding practices and coding intensity can distort the covariate space."
      },
      {
        "relation_type": "affects",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Payer-specific enrollment dynamics (MA lock-in, commercial job-change churn, FFS at end of life) drive differential attrition and informative censoring."
      },
      {
        "relation_type": "affects",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "FFS has clear allowed/paid amounts; MA capitation requires shadow pricing or plan-reported costs and coding intensity distorts apparent case-mix; commercial uses contributor-specific negotiated rates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "Feasibility and the attrition funnel must be computed by payer, since enrollment continuity and capture differ across FFS, MA, and commercial sources."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Plan transitions (age-65 commercial-to-Medicare, MA/FFS switching) interact with index-date and continuous-coverage definitions and can create artifactual new users."
      }
    ],
    "aliases": [
      "payer heterogeneity in claims",
      "Medicare Advantage vs fee-for-service claims",
      "MA encounter data versus FFS claims",
      "coding intensity in real-world data",
      "multi-payer claims data differences",
      "Medicare Advantage encounter data limitations"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "meta-analysis-obs",
    "name": "Meta-Analysis of Observational Studies",
    "short_definition": "Quantitative synthesis that pools adjusted effect estimates (and their variances) from multiple non-randomized studies into a summary effect, models between-study heterogeneity, and interrogates small-study/publication bias.",
    "long_description": "**Meta-analysis of observational studies** combines effect estimates from several non-randomized studies — cohort,\ncase-control, or self-controlled — into a single weighted summary, with an explicit model for how much the underlying\neffects vary across studies (heterogeneity). The inputs are *not* raw patient records but study-level summaries: each\nstudy contributes an adjusted point estimate on a linear scale (log odds ratio, log hazard ratio, log incidence-rate\nratio, or risk difference) and its variance (or a confidence interval from which variance is back-calculated). The\noutput is a pooled estimate, a confidence interval for the *mean* effect, a heterogeneity variance (τ²) with I²/H²,\nand — critically for observational evidence — a *prediction interval* for the effect in a new setting. This concept is\nthe aggregate-data (AD) sibling of individual-patient-data meta-analysis; when raw records are available, prefer IPD.\n\n**Core estimand distinction**. The fixed-effect (common-effect) model assumes every study estimates the *same* true\neffect and weights by inverse variance alone; its summary is the precision-weighted average and its CI shrinks toward\nzero width as studies accumulate. The random-effects model assumes each study's true effect is drawn from a\ndistribution with mean μ and variance τ²; weights become 1/(vᵢ + τ²), down-weighting large studies and the summary\ntargets the *mean of a distribution of effects*, not a single common value. The two answer different questions and\nmust not be swapped post hoc to chase significance. For observational studies the effects are essentially never\nidentical — each cohort has its own confounding structure, comparator, and population — so random effects is the\ndefault, and the prediction interval (μ ± t·√(τ² + se(μ)²)) is the honest summary, often far wider than the CI for μ.\nA deeper estimand problem precedes the model choice: the *per-study* estimands must be compatible. A study reporting\nan ATT from PS matching, one reporting a marginal ATE from IPTW, and one reporting a conditional OR from logistic\nregression are estimating different quantities on different populations; pooling them produces a number that\ncorresponds to no causal contrast. Estimand harmonization (scale, population, time-zero, comparator) is a\nprecondition, not a footnote.\n\n**Pros, cons, and trade-offs**.\n- **vs narrative/qualitative synthesis or a single large study:** A meta-analysis yields a quantitative summary,\n  formal heterogeneity statistics, and the ability to test effect modification via meta-regression and subgroups.\n  Cost: it can manufacture false precision by pooling structurally biased estimates — if every input study shares the\n  same unmeasured-confounding direction (e.g., healthy-user bias), the pooled estimate is *more* precisely wrong, not\n  less biased. A single large, well-designed active-comparator new-user study can beat a meta-analysis of weak ones.\n- **vs IPD meta-analysis:** Aggregate-data MA is cheap, fast, and uses published numbers, but cannot harmonize\n  exposure/outcome definitions, re-define time zero, fit a common adjustment set, or examine patient-level effect\n  modification (it is vulnerable to ecological/aggregation bias when only study-level covariates exist). IPD-MA\n  re-analyzes raw records under one protocol and is the gold standard when feasible. **Prefer IPD** when data-holders\n  will share records and definitions diverge across studies.\n- **vs network meta-analysis (NMA):** Pairwise MA pools direct evidence on one comparison; NMA borrows strength\n  across an evidence network to compare interventions never studied head-to-head, at the cost of a transitivity\n  assumption that is hard to defend across heterogeneous observational designs. **Prefer pairwise MA** unless the\n  decision genuinely needs an unstudied comparison.\n- **DerSimonian–Laird (DL) vs REML/Paule–Mandel for τ²:** The classic moment-based DL estimator under-covers when\n  studies are few; REML (or Paule–Mandel) with a Knapp–Hartung adjustment to the CI is the contemporary default and\n  is what reviewers now expect. **Prefer REML + Knapp–Hartung**, especially with <10 studies.\n\n**When to use**. When two or more methodologically comparable non-randomized studies report adjusted effects for the\nsame exposure–outcome contrast on a common scale; when a regulator/HTA body or guideline panel needs a single summary\nwith explicit heterogeneity; when you want to formally test whether design features (data source, washout, adjustment\nmethod, calendar era) modify the effect via meta-regression. A MOOSE/PRISMA-compliant search and pre-registered\nprotocol are prerequisites.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Incompatible estimands.** Pooling an ATT, an ATE, and a conditional OR — or marginal and conditional effects on\n  the same scale — yields a summary that maps to no defined causal contrast. Harmonize the estimand first or do not\n  pool.\n- **Shared, same-direction unmeasured confounding.** If every input study compares a drug to non-users (healthy-user\n  bias) or shares immortal-time misclassification, random effects will report small τ² (the studies agree) and a\n  tight CI around a *biased* mean. Low heterogeneity is reassurance about consistency, never about validity.\n- **Differential confounder availability across data sources.** A claims study adjusting for proxies and an EHR study\n  adjusting for labs/vitals are estimating residually different quantities; the \"adjusted\" effects are not on a common\n  footing. Do not treat them as exchangeable without external-adjustment/quantitative-bias-analysis.\n- **Ecological/aggregation bias.** Meta-regression on *study-level* mean covariates (mean age, % female) to explain\n  heterogeneity does not estimate patient-level modification and can reverse sign — an aggregation-bias trap. Use IPD\n  for effect modification.\n- **Small-study effects mistaken for heterogeneity.** A funnel-plot asymmetry driven by publication bias or by smaller\n  (often less-adjusted) studies showing larger effects will inflate both the pooled estimate and τ². Reflexively\n  writing \"I² is high, interpret with caution\" is not an analysis — diagnose the source.\n- **Too few studies for random effects.** With 2–4 studies, τ² is essentially unestimable; DL collapses toward\n  fixed-effect and the CI is anti-conservative. Report individual studies, or use REML + Knapp–Hartung and a prediction\n  interval, and be explicit that pooling is exploratory.\n\n**Data-source operational depth**. The \"data source\" of an aggregate MA is the *extracted study table*, but the\nvalidity of pooling hinges entirely on the underlying real-world data each study used.\n- **Claims-based input studies:** Each contributes an effect built from `fill_date` + `days_supply` exposure episodes,\n  a continuous-enrollment washout, and ICD/CPT outcome algorithms. Failure mode: studies differ in whether they\n  excluded Medicare Advantage person-time. MA-only enrollees lack fee-for-service claims, so a study that pooled MA and\n  FFS without exclusion has misclassified exposure and outcomes differentially — its effect is on a different\n  measurement footing and should be flagged or excluded. Record washout length, comparator (active vs non-user), and\n  adjustment method (HDPS vs a handful of covariates) as moderators.\n- **EHR-based input studies:** Effects rest on phenotype algorithms and visit-driven capture; a patient who leaves the\n  system is differentially lost. Two EHR studies with different loss-to-follow-up handling are not exchangeable. Capture\n  whether linkage to fills/death index was used.\n- **Registry-based input studies:** Strong adjudicated outcomes and severity, weak exposure completeness; their effect\n  estimates often have smaller outcome misclassification but larger exposure misclassification than claims — a\n  systematic moderator, not noise.\n- **Differential competing risks:** In elderly claims cohorts, the competing risk of death differs by exposure;\n  studies that used cause-specific hazards vs Fine-Gray subdistribution estimate different quantities. Pooling a\n  cause-specific HR with a subdistribution HR is an estimand error.\n- **Immortal time in procedure/initiation studies:** Studies that started follow-up before the exposure decision carry\n  immortal-time bias in a consistent direction; pooling them propagates it. Code time-zero alignment as an\n  inclusion/quality moderator.\n\n**Worked example (synthesis of claims/EHR studies).** Question: SGLT2 inhibitor vs DPP-4 inhibitor and risk of diabetic\nketoacidosis (DKA) among adults with type 2 diabetes, pooling K=5 active-comparator new-user studies in administrative\ndata. (1) *Extraction.* For each study record: the adjusted hazard ratio and 95% CI, data source (FFS claims / EHR /\nlinked), washout length (180 vs 365 days), comparator definition, whether MA-only person-time was excluded, adjustment\nmethod (HDPS vs core covariates), competing-risk handling (cause-specific vs Fine-Gray), and time-zero rule. (2)\n*Compatibility check.* All five report a comparative new-user HR with active comparator and time zero at first fill — the\nestimands align; convert each to the log-HR scale with variance vᵢ = ((ln(UCL) − ln(LCL)) / (2·1.96))². (3) *Pool.* Fit\na random-effects model by REML: τ̂² quantifies between-study variance; weights wᵢ = 1/(vᵢ + τ̂²); the summary log-HR is\nΣwᵢ·yᵢ / Σwᵢ with se = √(1/Σwᵢ), and the CI uses the Knapp–Hartung t-quantile on K−1 df. Exponentiate to a pooled HR.\n(4) *Heterogeneity.* Report I² and, more importantly, the *prediction interval* — if it crosses 1.0 while the CI for μ\ndoes not, the mean effect is \"significant\" but the effect in a new database is not assured. (5) *Diagnose, don't\nboilerplate.* Meta-regress log-HR on washout length and adjustment method; a funnel plot + Egger test screens\nsmall-study effects; a leave-one-out and a fixed-vs-random sensitivity analysis test robustness. (6) *Bias caveat.* If\nall five share a non-user-comparator design instead of active-comparator, the low I² is consistency, not validity —\npair with a negative-control-outcome check and an E-value for the pooled estimate.\n\n**Interpreting the output**\n\nConsider the worked example: a random-effects pool of three observational studies yields a\npooled HR = 0.812 (approximately 0.81), indicating an approximately 19% lower hazard of stroke\nin patients taking the new drug versus the old class.\n\nFormal interpretation: The pooled HR is the weighted mean of the study-specific log hazard ratios,\nwith weights reflecting each study's precision and the between-study variance τ². The confidence\ninterval around 0.81 describes uncertainty about that mean estimate. Unlike a pooled RCT result,\nthis number does not enjoy the protection of randomization: if all three contributing studies\nchanneled the new drug preferentially to healthier patients — a shared confounding structure —\npooling concentrates that bias rather than averaging it away. A narrow CI signals cross-study\nconsistency; it does not signal unbiasedness. An I² near zero means the studies agree on a\ndirection and magnitude; it does not mean they agree on the causal effect.\n\nPractical interpretation: A pooled HR of 0.81 from observational data is a hypothesis-supporting\nestimate, not a causal verdict. Before using it in a cost-effectiveness model or formulary\ndecision, compute an E-value to quantify the minimum unmeasured confounding strength that could\nexplain the association entirely. Report a prediction interval if τ² > 0 to expose how variable\nthe real-world effect might be across future settings. If a negative-control outcome analysis is\nfeasible — checking whether the drug also appears to reduce an outcome it cannot plausibly affect\nbiologically — a non-null result in that analysis signals residual confounding that the pooled\nestimate inherits.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "meta-analysis",
      "evidence-synthesis",
      "random-effects",
      "heterogeneity",
      "between-study-variance",
      "prediction-interval",
      "meta-regression",
      "publication-bias",
      "observational-studies"
    ],
    "applies_to_study_types": [
      "meta_analysis_obs",
      "systematic_review"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.283.15.2008",
        "url": "https://doi.org/10.1001/jama.283.15.2008",
        "citation_text": "Stroup DF, Berlin JA, Morton SC, et al. Meta-analysis of observational studies in epidemiology: a proposal for reporting (MOOSE). JAMA. 2000;283(15):2008-2012.",
        "year": 2000,
        "authors_short": "Stroup et al.",
        "notes": "The MOOSE statement — canonical reporting and conduct framework specific to meta-analysis of observational studies."
      },
      {
        "role": "explain",
        "doi": "10.1016/0197-2456(86)90046-2",
        "url": "https://doi.org/10.1016/0197-2456(86)90046-2",
        "citation_text": "DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7(3):177-188.",
        "year": 1986,
        "authors_short": "DerSimonian & Laird",
        "notes": "Introduces the moment-based random-effects estimator of the between-study variance still used as the DL default."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.327.7414.557",
        "url": "https://doi.org/10.1136/bmj.327.7414.557",
        "citation_text": "Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557-560.",
        "year": 2003,
        "authors_short": "Higgins et al.",
        "notes": "Defines I² and H² for quantifying heterogeneity; clarifies that I² measures inconsistency, not its magnitude or cause."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.315.7109.629",
        "url": "https://doi.org/10.1136/bmj.315.7109.629",
        "citation_text": "Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629-634.",
        "year": 1997,
        "authors_short": "Egger et al.",
        "notes": "The funnel-plot regression test for small-study effects/publication bias used to interrogate pooled estimates."
      }
    ],
    "plain_language_summary": "A meta-analysis of observational studies is a statistical technique that combines the results of several separate non-randomized studies into a single, more precise summary estimate. Researchers extract each study's main finding and how uncertain that finding is, then calculate a weighted average — studies with tighter results get more influence. The summary can reveal whether treatment effects are consistent across populations and settings, or whether they vary in ways that demand explanation. The honest warning: because none of the input studies randomly assigned treatment, any shared systematic flaw — for example, if every study compared treated patients against healthier untreated patients — will be pooled right into the summary, making the answer more precise but not less wrong.",
    "key_terms": [
      {
        "term": "effect estimate",
        "definition": "A number summarizing how much a treatment or exposure changes the outcome in one study, such as a hazard ratio of 0.80 meaning a 20% lower rate in the treated group."
      },
      {
        "term": "weight",
        "definition": "The share of influence a single study gets in the pooled average, usually larger for studies with narrower confidence intervals (more precise results)."
      },
      {
        "term": "heterogeneity",
        "definition": "The degree to which the true effect appears to differ across studies beyond what random chance alone would produce; high heterogeneity means studies are not estimating the same underlying quantity."
      },
      {
        "term": "pooled estimate",
        "definition": "The single weighted-average result produced by combining all included studies, reported with its own confidence interval."
      },
      {
        "term": "forest plot",
        "definition": "A standard chart for displaying meta-analysis results: each row shows one study's effect and confidence interval as a horizontal line with a square, and a diamond at the bottom shows the pooled estimate."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether a new class of blood-pressure drug reduces the risk of stroke compared with an older class. Three large observational studies have already been published — a cohort study using insurance records, a case-control study using hospital data, and a nested case-control study from a patient registry. None randomly assigned the drug; patients and their doctors chose treatment. The researcher extracts each study's adjusted hazard ratio and assigns each a weight based on its precision, then calculates a single pooled hazard ratio.",
      "dataset": {
        "caption": "Extracted results from three observational studies. Each study reports an adjusted hazard ratio (HR) comparing new drug vs old drug for stroke risk, and its assigned weight in the pooled analysis.",
        "columns": [
          "study",
          "effect",
          "weight_pct"
        ],
        "rows": [
          [
            "Smith 2019 (cohort, insurance claims)",
            "HR 0.82",
            40
          ],
          [
            "Jones 2021 (case-control, hospital data)",
            "HR 0.74",
            35
          ],
          [
            "Patel 2022 (nested case-control, registry)",
            "HR 0.90",
            25
          ]
        ]
      },
      "steps": [
        "Multiply each study's hazard ratio by its weight: Smith 2019 contributes 0.82 x 40 = 32.80; Jones 2021 contributes 0.74 x 35 = 25.90; Patel 2022 contributes 0.90 x 25 = 22.50.",
        "Check that the weights sum to 100%: 40 + 35 + 25 = 100.",
        "Add the weighted contributions: 32.80 + 25.90 + 22.50 = 81.20.",
        "Divide by the total weight (100) to get the pooled estimate: 81.20 / 100 = 0.812.",
        "Note that the three individual hazard ratios (0.82, 0.74, 0.90) are close but not identical, which is typical; the pooled value sits between the smallest and largest, pulled toward the studies with the most weight.",
        "Unlike pooling randomized trials, pooling these observational studies carries forward any unmeasured confounding present in each study — if healthier patients tended to get the new drug in all three studies, the pooled HR of 0.812 would overstate the drug's benefit."
      ],
      "result": "Pooled HR = (0.82 x 40 + 0.74 x 35 + 0.90 x 25) / 100 = (32.80 + 25.90 + 22.50) / 100 = 81.20 / 100 = 0.812. Interpreted as an approximately 19% lower hazard of stroke in patients taking the new drug, averaged across three observational studies. Because all three studies drew patients from real-world practice rather than a randomized experiment, the pooled estimate inherits any shared confounding or measurement differences across those data sources; a tight confidence interval around 0.812 indicates consistency across studies, not that the answer is unbiased."
    },
    "prerequisites": [
      "systematic-review",
      "cohort-retrospective"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed-effect (common-effect) inverse-variance pooling",
        "description": "Assumes one true effect across studies; weights wᵢ = 1/vᵢ. Summary is the precision-weighted mean with se = sqrt(1/Σwᵢ). Appropriate only when heterogeneity is negligible and studies are functionally replicates.",
        "edge_cases": [
          "With real between-study variance present, fixed-effect CIs are too narrow and over-weight a single large study.",
          "Should not be selected post hoc because it gives a narrower CI than random effects."
        ],
        "data_source_notes": "Rarely defensible for observational inputs, which differ in confounding structure, comparator, and population."
      },
      {
        "name": "Random-effects (REML / Paule–Mandel) with Knapp–Hartung adjustment",
        "description": "Estimates between-study variance τ² (REML or Paule–Mandel preferred over DL), weights wᵢ = 1/(vᵢ + τ²), and builds the CI for the mean effect with a t-distribution on K−1 df (Knapp–Hartung) rather than a normal quantile.",
        "edge_cases": [
          "With <10 studies, τ² is imprecise; Knapp–Hartung restores nominal coverage that DL with a normal CI loses.",
          "Report a prediction interval, not just the CI for μ, as the policy-relevant summary."
        ],
        "data_source_notes": "Default for observational MA; record data source, washout, comparator, and adjustment method as moderators."
      },
      {
        "name": "Meta-regression on study-level moderators",
        "description": "Regresses the effect on study characteristics (washout length, adjustment method, calendar era, data source) to explain heterogeneity, using a mixed-effects (random-effects residual) model.",
        "edge_cases": [
          "Study-level mean covariates (mean age, % female) cannot estimate patient-level effect modification — aggregation bias.",
          "Low power with few studies; restrict to a small number of pre-specified moderators to avoid overfitting/data-dredging."
        ],
        "data_source_notes": "Most defensible moderators in RWE MA are design/measurement features, not patient summaries."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Individual-patient-data (IPD) meta-analysis",
        "pros_of_this": "Uses published summaries; cheap, fast, no data-sharing agreements; broad study coverage.",
        "cons_of_this": "Cannot harmonize exposure/outcome definitions, re-set time zero, fit a common adjustment set, or examine patient-level effect modification; vulnerable to ecological/aggregation bias.",
        "when_to_prefer": "When raw records are unavailable and the published per-study estimands are already compatible."
      },
      {
        "compared_to": "A single large, well-designed active-comparator new-user study",
        "pros_of_this": "Quantitative heterogeneity assessment and a wider evidence base; ability to test design moderators.",
        "cons_of_this": "Can produce precisely biased summaries if all inputs share the same unmeasured-confounding direction; garbage-in propagates.",
        "when_to_prefer": "When multiple comparable, well-adjusted studies exist and consistency across settings is itself the question."
      },
      {
        "compared_to": "Network meta-analysis (NMA)",
        "pros_of_this": "Simpler, transparent, and requires only a defensible exchangeability/heterogeneity assumption for one comparison.",
        "cons_of_this": "Cannot compare interventions never studied head-to-head.",
        "when_to_prefer": "When the decision needs only the direct pairwise contrast and no unstudied comparison is required."
      },
      {
        "compared_to": "Fixed-effect meta-analysis",
        "pros_of_this": "Targets the mean of a distribution of true effects and yields honest (wider) uncertainty plus a prediction interval.",
        "cons_of_this": "Down-weights large precise studies; τ² is hard to estimate with few studies.",
        "when_to_prefer": "Whenever between-study heterogeneity is plausible — i.e., essentially all observational syntheses."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Each input study's effect rests on fill_date/days_supply episodes, continuous-enrollment washout, and ICD/CPT algorithms. Record whether Medicare Advantage-only person-time (no FFS claims) was excluded, the washout length, the comparator (active vs non-user), and the adjustment method (HDPS vs core covariates) as moderators; flag or exclude studies that mixed MA and FFS exposure without exclusion.",
      "ehr": "Effects depend on phenotype algorithms and visit-driven capture; record loss-to-follow-up handling and any linkage to fills/death index. EHR and claims \"adjusted\" effects are not automatically exchangeable.",
      "registry": "Adjudicated outcomes and severity reduce outcome misclassification but exposure capture is weaker; treat data source as a systematic moderator, not noise.",
      "linked": "Linked claims-EHR-vital-records studies are the strongest inputs (severity + completeness + mortality) but the linkable subset introduces selection that differs from claims-only studies."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\ndef meta_random_effects(yi, vi, tau2_method=\"REML\", knapp_hartung=True):\n    \"\"\"Pool log effects yi with variances vi under a random-effects model.\n\n    Returns the summary on the LOG scale (exponentiate for HR/OR/IRR), its CI,\n    tau2, I2, the Q-test, and a 95% prediction interval.\n    \"\"\"\n    yi = np.asarray(yi, float); vi = np.asarray(vi, float)\n    k = yi.size\n\n    # --- Q and the DerSimonian-Laird moment estimator of tau^2 ---\n    wf = 1.0 / vi                                   # fixed-effect (inverse-variance) weights\n    ybar_f = np.sum(wf * yi) / np.sum(wf)\n    Q = float(np.sum(wf * (yi - ybar_f) ** 2))      # Cochran's Q\n    df = k - 1\n    C = np.sum(wf) - np.sum(wf ** 2) / np.sum(wf)\n    tau2_dl = max(0.0, (Q - df) / C)\n\n    if tau2_method.upper() == \"REML\":\n        # Iterate REML: tau2 = sum(w^2[(yi-mu)^2 - vi]) / sum(w^2) + 1/sum(w), w = 1/(vi+tau2)\n        tau2 = tau2_dl\n        for _ in range(200):\n            w = 1.0 / (vi + tau2)\n            mu = np.sum(w * yi) / np.sum(w)\n            num = np.sum(w ** 2 * ((yi - mu) ** 2 - vi))\n            new = max(0.0, num / np.sum(w ** 2) + 1.0 / np.sum(w))\n            if abs(new - tau2) < 1e-10:\n                tau2 = new; break\n            tau2 = new\n    else:\n        tau2 = tau2_dl\n\n    # --- Random-effects pooling ---\n    w = 1.0 / (vi + tau2)\n    mu = float(np.sum(w * yi) / np.sum(w))\n    if knapp_hartung:                               # Knapp-Hartung: t-based CI, corrected SE\n        q_kh = np.sum(w * (yi - mu) ** 2) / df\n        se = float(np.sqrt(q_kh / np.sum(w)))\n        crit = stats.t.ppf(0.975, df)\n    else:\n        se = float(np.sqrt(1.0 / np.sum(w)))\n        crit = stats.norm.ppf(0.975)\n\n    ci = (mu - crit * se, mu + crit * se)\n    I2 = max(0.0, (Q - df) / Q) * 100 if Q > 0 else 0.0\n    Q_p = float(stats.chi2.sf(Q, df))\n    # Prediction interval: where a NEW study's true effect is expected to lie\n    pi_crit = stats.t.ppf(0.975, df - 1) if df > 1 else stats.norm.ppf(0.975)\n    pi_se = np.sqrt(tau2 + se ** 2)\n    pred = (mu - pi_crit * pi_se, mu + pi_crit * pi_se)\n\n    return {\"k\": k, \"mu\": mu, \"se\": se, \"ci\": ci, \"tau2\": tau2, \"I2\": I2,\n            \"Q\": Q, \"Q_p\": Q_p, \"prediction_interval\": pred}\n\n# res = meta_random_effects(df[\"yi\"], df[\"vi\"])\n# pooled_HR = np.exp(res[\"mu\"]); print(pooled_HR, np.exp(res[\"ci\"]), np.exp(res[\"prediction_interval\"]))",
        "description": "Random-effects meta-analysis of observational studies from an extracted study table (no statsmodels MA primitive,\nso the DerSimonian-Laird and REML estimators are computed explicitly for transparency/auditability). Required input\n(one row per study, already harmonized to a common linear scale and estimand):\n  df : study_id, yi (log effect: lnHR/lnOR/lnIRR), vi (variance of yi)\n       plus optional moderators (washout_days, adjustment, data_source) for meta-regression.\nIf only a CI is published, derive vi = ((log(uci) - log(lci)) / (2*1.959964))**2 before calling.",
        "dependencies": [
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "dersimonian-laird-1986",
          "higgins-2003"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(metafor)\n\n## Random-effects model: REML tau^2 with the Knapp-Hartung small-sample adjustment.\nres <- rma(yi = yi, vi = vi, data = dat, method = \"REML\", test = \"knha\")\nsummary(res)                       # mu, CI, tau^2, I^2, H^2, Q-test (all on the log scale)\n\n## Pooled effect on the report scale (e.g., hazard ratio) + 95% prediction interval.\npred <- predict(res, transf = exp)\npred                                # pred$pred = pooled HR; pred$pi.lb/pi.ub = prediction interval\n\n## Small-study effects / publication bias: funnel plot + Egger's regression test.\nfunnel(res); regtest(res, model = \"lm\")\n\n## Heterogeneity attribution: meta-regression on pre-specified DESIGN moderators\n## (washout length, adjustment method, data source) -- NOT patient-level means (aggregation bias).\nres_mr <- rma(yi = yi, vi = vi,\n              mods = ~ washout_days + factor(adjustment) + factor(data_source),\n              data = dat, method = \"REML\", test = \"knha\")\nsummary(res_mr)\n\n## Influence diagnostics / leave-one-out robustness check.\nleave1out(res)",
        "description": "Random-effects meta-analysis with metafor (the reference R package; REML + Knapp-Hartung, prediction interval,\nfunnel/Egger, meta-regression). Required input (one row per study, harmonized scale/estimand):\n  dat : study_id, yi (log effect), vi (variance of yi), plus moderators (washout_days, adjustment, data_source).\nIf only a CI is published, compute vi = ((log(uci) - log(lci)) / (2 * qnorm(0.975)))^2 first.",
        "dependencies": [
          "metafor"
        ],
        "source_citations": [
          "higgins-2003",
          "egger-1997"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* yi = log effect, vi = its variance, wi = 1/vi (inverse-variance weight) */\ndata ma; set work.ma; wi = 1 / vi; run;\n\n/* Random-effects pooling: intercept (Intercept estimate = pooled log effect = mu),\n   Estimate for study random effect = tau^2. DDFM=KENWARDROGER gives a Knapp-Hartung-like\n   small-sample t-based CI for mu. */\nproc mixed data=ma method=reml;\n  class study_id;\n  model yi = / solution cl ddfm=kenwardroger;   /* Intercept = pooled mu, with t-based 95% CI */\n  random int / subject=study_id;                /* between-study variance tau^2 */\n  weight wi;                                     /* supply within-study precision */\n  parms (1) (1) / hold=2 lowerb=0,. ;           /* HOLD residual=1 so WEIGHT = within-study var */\n  ods output SolutionF=mu CovParms=cov;\nrun;\n\n/* Cochran's Q, I^2, and the prediction interval from mu, se(mu), and tau^2. */\nproc sql; select sum(wi) into :swi from ma; quit;\ndata heterogeneity;\n  merge mu(where=(Effect='Intercept') rename=(Estimate=mu StdErr=se DF=df Lower=cil Upper=ciu))\n        cov(where=(CovParm='Intercept') rename=(Estimate=tau2));\n  pooled_HR = exp(mu); ci_lo = exp(cil); ci_hi = exp(ciu);\n  /* 95% prediction interval for a new study's true effect (back-transformed) */\n  t_pi = tinv(0.975, max(df-1,1));\n  pi_lo = exp(mu - t_pi*sqrt(tau2 + se**2));\n  pi_hi = exp(mu + t_pi*sqrt(tau2 + se**2));\nrun;\n\n/* Egger's regression test for small-study effects: standard normal deviate on precision. */\ndata egger; set ma; snd = yi / sqrt(vi); prec = 1 / sqrt(vi); run;\nproc reg data=egger; model snd = prec; run; quit;  /* intercept != 0 => funnel asymmetry */",
        "description": "Random-effects meta-analysis in SAS via PROC MIXED, modeling study as a random intercept on the log-effect scale\nwith the within-study variances supplied as known weights (REML between-study variance). Required input dataset\n(one row per study, harmonized scale/estimand):\n  work.ma : study_id (char), yi (log effect), vi (variance of yi), wi = 1/vi\n            (derive vi from a published CI: vi = ((log(uci)-log(lci))/(2*probit(0.975)))**2)\nThe PARMS/HOLD fixes the residual to 1 so the supplied WEIGHT carries the within-study variance and the random\nintercept variance is the between-study tau^2.",
        "dependencies": [],
        "source_citations": [
          "dersimonian-laird-1986",
          "egger-1997"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Pre-registered question + MOOSE/PRISMA search] --> Ext[Extract per-study: adjusted effect, variance,<br/>data source, washout, comparator, adjustment, estimand]\n  Ext --> Comp{Estimands compatible?<br/>same scale, population, time-zero, comparator}\n  Comp -- No --> Harm[Harmonize or EXCLUDE incompatible studies]\n  Comp -- Yes --> Het{Heterogeneity expected?}\n  Het -- Essentially always for RWE --> RE[Random-effects: REML tau^2 + Knapp-Hartung CI]\n  Het -- Functional replicates only --> FE[Fixed-effect inverse-variance]\n  RE --> Pool[Pooled effect + CI for mu + PREDICTION interval]\n  FE --> Pool\n  Pool --> Diag[Diagnose: I^2/H^2, funnel + Egger,<br/>meta-regression on DESIGN moderators, leave-one-out]\n  Diag --> Bias[Bias caveat: shared-direction confounding ->\\nlow I^2 is consistency, not validity; add E-value / negative controls]",
        "caption": "Workflow for meta-analysis of observational studies, foregrounding estimand compatibility before pooling and bias diagnosis after.",
        "alt_text": "Flowchart from a pre-registered question and search, through per-study extraction, an estimand-compatibility gate, random- vs fixed-effect model choice, pooling with a prediction interval, and bias diagnosis.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Inputs[\"Per-study adjusted effects yi with variance vi\"]\n    S1[\"Study 1<br/>FFS claims, 365d washout, HDPS\"]\n    S2[\"Study 2<br/>EHR linked, 180d washout\"]\n    S3[\"Study 3<br/>registry, adjudicated outcome\"]\n  end\n  S1 --> W[\"Weights wi = 1/(vi + tau^2)\"]\n  S2 --> W\n  S3 --> W\n  W --> M[\"Pooled mu = sum(wi*yi)/sum(wi)\"]\n  M --> CI[\"CI for mu: precision of the MEAN effect\"]\n  M --> PI[\"Prediction interval: effect in a NEW database<br/>= mu +/- t*sqrt(tau^2 + se^2)\"]\n  T[\"tau^2 = between-study variance\"] --> W\n  T --> PI",
        "caption": "Random-effects pooling on the log scale. The CI for mu narrows with more studies; the prediction interval does not, because it carries the irreducible between-study variance tau-squared.",
        "alt_text": "Diagram showing three heterogeneous study inputs feeding random-effects weights that depend on the between-study variance, producing a pooled mean with a confidence interval and a wider prediction interval.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "ipd-meta-analysis",
        "notes": "IPD-MA re-analyzes raw records under one protocol; aggregate-data MA-obs uses published summaries and cannot harmonize definitions or estimate patient-level effect modification. Prefer IPD when records are available."
      },
      {
        "relation_type": "see_also",
        "target_slug": "meta-analysis-rct",
        "notes": "Same statistical machinery (inverse-variance pooling, random effects, heterogeneity), but observational inputs add the confounding and estimand-compatibility problems that RCT pooling largely avoids."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "network-meta-analysis",
        "notes": "NMA borrows strength across an evidence network to compare interventions never studied head-to-head, at the cost of a transitivity assumption; pairwise MA-obs pools direct evidence for one comparison."
      },
      {
        "relation_type": "part_of",
        "target_slug": "systematic-review",
        "notes": "A meta-analysis is the quantitative-synthesis step of a systematic review; the MOOSE/PRISMA search and appraisal precede it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "Compute an E-value for the pooled estimate to quantify how strong shared unmeasured confounding would have to be to explain it away — essential when all inputs share a bias direction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Negative-control outcomes/exposures across the input studies help distinguish a real pooled signal from shared systematic error that low heterogeneity would otherwise mask."
      }
    ],
    "aliases": [
      "meta-analysis of observational studies",
      "observational meta-analysis",
      "pooled analysis of observational studies",
      "random-effects meta-analysis (observational)"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "meta-analysis-rct",
    "name": "Meta-Analysis of Randomized Controlled Trials",
    "short_definition": "Quantitative synthesis that pools treatment-effect estimates across randomized controlled trials of the same intervention-comparator-outcome question using inverse-variance-weighted fixed-effect or random-effects models, with explicit quantification of between-trial heterogeneity.",
    "long_description": "A **meta-analysis of randomized controlled trials (RCTs)** combines the effect estimates from multiple\ntrials that address the same intervention, comparator, and outcome into a single pooled estimate and a\ncharacterization of how much the trials disagree. Because each contributing study randomized treatment,\nthe within-trial estimates are (in expectation) unconfounded; the meta-analysis adds *precision* by\nborrowing strength across trials and adds *generalizability evidence* by showing whether the effect\nreproduces across populations, doses, and settings. The standard machinery is **inverse-variance\nweighting**: each trial's effect (on the log scale for ratio measures) is weighted by the reciprocal of\nits variance, so larger and more precise trials count more. In real-world evidence and HTA work, the\npooled RCT estimate is rarely the endpoint by itself — it is the **efficacy anchor** against which\nreal-world effectiveness, indirect comparisons (MAIC/STC), and external-control analyses are calibrated.\n\n**Core estimand distinction**. Two different estimands hide behind the same forest plot, and the choice is\nnot cosmetic. (1) The **fixed-effect (common-effect) model** assumes every trial estimates the *same*\nunderlying true effect θ; the only source of variation is within-trial sampling error, and the pooled\nquantity is that single common effect. (2) The **random-effects model** assumes each trial estimates its\n*own* true effect θ_i drawn from a distribution with mean μ and between-trial variance τ²; the pooled\nquantity is the **mean of the distribution of true effects**, not \"the\" effect. When heterogeneity is\npresent (τ² > 0), random effects give wider intervals and down-weight large trials relative to small ones.\nCritically, the random-effects *point estimate* and its confidence interval answer a different question\nthan a clinician's \"what effect should I expect in my next patient\" — that is the **95% prediction\ninterval**, computed from μ and τ², and it is almost always the more honest summary when trials are\nheterogeneous. Reporting a tight random-effects CI while suppressing a wide prediction interval is a\ncommon and misleading practice.\n\n**Pros, cons, and trade-offs** (specific and comparative, naming the alternatives).\n- **vs a single large trial:** Meta-analysis increases precision, tests reproducibility, and can detect\n  rare harms no single trial is powered for. Cost: it inherits every included trial's biases, can launder\n  a few low-quality trials into a falsely precise pooled estimate, and is vulnerable to publication and\n  small-study bias. **Prefer meta-analysis** when multiple comparable RCTs exist and the question is\n  whether the effect is consistent and precise; **prefer the single large pragmatic trial** when one\n  well-conducted trial dwarfs the others and pooling would only dilute it with heterogeneous small studies.\n- **vs meta-analysis of observational studies (`meta-analysis-obs`):** Pooling RCTs preserves\n  within-study randomization, so residual confounding is not the dominant worry; the synthesis-level\n  concerns are heterogeneity, publication bias, and clinical comparability. Observational meta-analysis\n  adds confounding and design heterogeneity on top of all of that and can pool *consistent bias* into a\n  spuriously precise wrong answer. **Prefer RCT meta-analysis** whenever enough randomized evidence\n  exists for the exact question.\n- **vs network meta-analysis (`network-meta-analysis`):** Pairwise RCT meta-analysis answers one A-vs-B\n  contrast with direct head-to-head evidence and the fewest assumptions. Network meta-analysis can rank\n  many treatments and use indirect evidence but requires the **transitivity/consistency** assumption and\n  a connected network. **Prefer pairwise** when direct head-to-head trials exist; escalate to network\n  only when the decision needs comparators that were never trialed against each other.\n- **vs individual-participant-data meta-analysis (`ipd-meta-analysis`):** Aggregate-data RCT\n  meta-analysis is fast and uses only published effect estimates, but cannot reliably investigate\n  subgroup effects (ecological/aggregation bias) or harmonize outcome definitions. IPD is the gold\n  standard for effect modification and time-to-event re-analysis but is slow and often infeasible.\n  **Prefer aggregate-data** for a main pooled effect; **escalate to IPD** when subgroup/interaction\n  questions or non-proportional hazards drive the decision.\n- **Fixed vs random within the method:** Fixed-effect is more precise and appropriate only when trials\n  are functionally identical (rare). Random-effects is the safer default but its small-sample variance\n  estimate (DerSimonian-Laird) is anti-conservative when the number of trials is small (< ~5–10);\n  the **Hartung-Knapp-Sidik-Jonkman (HKSJ)** adjustment or REML with a t-distribution is preferred there.\n\n**When to use**. Multiple RCTs (typically ≥ 2, ideally more) address the *same* PICO; the trials are\nclinically and methodologically comparable enough that a common question is meaningful; the goal is a\nprecise, reproducibility-aware pooled effect, a heterogeneity assessment, or an efficacy anchor for a\ndownstream RWE/HTA model. Also use as the comparator-evidence backbone of an HTA submission, where\nreimbursement bodies expect a transparent, PRISMA-reported synthesis of the randomized evidence.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Clinical or methodological apples-and-oranges.** If trials differ in population, dose, follow-up, or\n  outcome definition such that no common effect is interpretable, the pooled number is a meaningless\n  average; a high I²/τ² should trigger *not pooling*, or stratified/meta-regression synthesis, rather\n  than reporting one figure. Pooling discordant trials and headlining the central estimate is the classic\n  dangerous error.\n- **Few, small, or selectively published trials.** With a handful of small positive trials and missing\n  negatives, funnel-plot asymmetry and small-study effects can produce a precise, confidently wrong\n  estimate. DerSimonian-Laird CIs are too narrow here; treat the result as hypothesis-generating.\n- **Double-counting / overlapping evidence.** Including multiple publications of the same trial, or\n  multi-arm trials whose shared control arm is counted more than once, fabricates precision.\n- **Using a pooled RCT efficacy estimate as if it were real-world effectiveness.** Trial populations are\n  selected and adherence is supervised; transporting the pooled efficacy estimate to a claims population\n  without a generalizability/transportability assessment overstates real-world benefit (see the RWE\n  interface below).\n- **Pooling to rescue a non-significant program.** Aggregating underpowered trials to manufacture a\n  significant pooled p-value, without a pre-specified protocol, is a form of researcher-degrees-of-freedom\n  abuse that PRISMA/PROSPERO registration exists to prevent.\n\n**Data-source operational depth** (here the \"data sources\" are the *trial-evidence* substrates feeding the\nsynthesis — each with real failure modes and workarounds, plus the RWE bridge that earns this concept its\nplace in the catalog).\n- **Aggregate published data (journal articles):** The default substrate — extract effect size + variance\n  (or events/totals, means/SDs). Failure mode: trials report *different* effect measures, follow-up\n  lengths, or only ITT in one paper and per-protocol in another, so the pooled contrast mixes estimands.\n  Workaround: harmonize to one estimand and one scale (e.g., reconstruct log-RR with SE from events and\n  totals); when only medians/IQRs are given for continuous outcomes, use validated transformations and\n  flag the imputation.\n- **Trial registries / results databases (ClinicalTrials.gov, EU CTR):** Capture *unpublished* and\n  null trials to combat publication bias; registered outcomes also reveal **outcome-switching** (the\n  headline outcome differs from the pre-registered primary). Failure mode: registry results tables are\n  incomplete or use different definitions than the publication. Workaround: cross-check registry vs paper\n  and prefer the pre-registered primary outcome and analysis population.\n- **Clinical study reports (CSRs) / regulator dossiers:** The most complete substrate (full safety,\n  all analysis populations, adjudicated events) and the standard for HTA. Failure mode: access is\n  restricted and reconciling a CSR's adjudicated event counts with the published numbers takes real work.\n  Workaround: prefer adjudicated, ITT counts from the CSR and document every discrepancy.\n- **Time-to-event / competing-risk and composite endpoints across trials:** Trials may report HRs under\n  different proportional-hazards adequacy, different censoring, or differently constructed composites.\n  Failure mode: pooling HRs across trials with non-proportional hazards or heterogeneous composite\n  definitions averages incomparable quantities. Workaround: extract a common summary (e.g., RMST\n  difference or events at a common horizon), or restrict the pool to trials with comparable endpoint\n  construction.\n- **The RWE / HTA interface (why this lives in an RWE catalog):** The pooled RCT estimate is the\n  **efficacy anchor** for (a) anchored indirect comparisons (MAIC/STC) when no head-to-head trial exists,\n  (b) **calibration** of a real-world effectiveness estimate from claims/EHR — if the RWE design\n  reproduces the pooled RCT effect in the trial-eligible subpopulation, it supports the RWE method's\n  validity; a large gap flags residual confounding or a true efficacy-effectiveness gap, and (c)\n  benchmarking single-arm external-control analyses. Failure mode: the trial-eligible population (strict\n  inclusion, supervised adherence, younger, fewer comorbidities) differs systematically from the\n  claims/EHR population the decision is about, so the anchor and the RWE estimate are not estimating the\n  same thing. Workaround: standardize the RWE cohort to the trial population (or vice versa) and report a\n  transportability analysis before treating the pooled RCT effect as the real-world expectation.\n\n**Worked example (binary outcome across trials, claims-relevant anchor).** Six RCTs compared a new\noral anticoagulant (treatment) vs warfarin (control) on a 12-month composite of stroke/systemic embolism.\nFor each trial we extract `events_t`, `n_t`, `events_c`, `n_c`. (1) Compute each trial's log risk ratio\n`yi = log[(events_t/n_t)/(events_c/n_c)]` and its variance `vi = 1/events_t - 1/n_t + 1/events_c - 1/n_c`\n(add 0.5 to zero cells, or use a Mantel-Haenszel/exact approach when cells are sparse). (2) Fixed-effect\npool: weights `wi = 1/vi`, pooled `yFE = Σ(wi·yi)/Σwi`, `SE = sqrt(1/Σwi)`. (3) Heterogeneity: Cochran's\n`Q = Σ wi·(yi − yFE)²`, `I² = max(0, (Q − (k−1))/Q)`, and `τ²` by DerSimonian-Laird or REML. (4)\nRandom-effects pool with weights `wi* = 1/(vi + τ²)`, giving the mean of the true-effect distribution.\n(5) Because only k = 6 trials contribute, apply the **Hartung-Knapp** adjustment (t-distribution with\nk−1 df and a robust variance) so the CI is not falsely narrow. (6) Report the **95% prediction interval**\n`μ ± t_{k−2} · sqrt(τ² + SE²)` to convey the plausible effect in a *new* setting. (7) For the catalog's\nRWE use, take this pooled RR (say 0.79, 95% CI 0.70–0.89) as the efficacy anchor: build a new-user\nactive-comparator cohort in claims restricted to the trial-eligible subpopulation (continuous enrollment,\nno prior anticoagulant in a 365-day washout, first `fill_date` as index, composite outcome from validated\ndx codes), and check whether the standardized real-world hazard ratio lands near the anchor — concordance\nsupports the RWE method; a large divergence flags residual confounding or a genuine efficacy-effectiveness\ngap that must be explained before the estimate informs a decision.\n\n**Interpreting the output**\n\nConsider a random-effects meta-analysis of six RCTs reporting a pooled risk ratio RR = 0.82\n(95% CI 0.72–0.93, I² = 38%, prediction interval 0.61–1.10).\n\nFormal interpretation: The pooled estimate is the mean of the distribution of true trial-level\neffects — not a single universal effect — weighted by inverse variance plus τ², the between-study\nvariance. The CI around 0.82 quantifies uncertainty about that distributional mean; it does NOT\ncapture the full spread of effects across settings. The prediction interval (0.61–1.10) does:\nit conveys that in a new trial drawn from this evidence base, the true RR could plausibly reach\n1.10 — a null or harmful result — even though the pooled mean is 0.82. I² = 38% indicates\nmoderate heterogeneity; it is a relative measure and does not substitute for τ² or the prediction\ninterval in quantifying how much effects vary in absolute terms.\n\nPractical interpretation: On average across the six trials, the treatment reduced the outcome\nrisk by approximately 18%. But the prediction interval crossing 1.0 means benefit cannot be\nassumed in every new clinical or real-world setting. Before using 0.82 as an efficacy anchor for\na cost-effectiveness model or an RWE benchmarking study, report the prediction interval prominently\nand investigate which study characteristics drive the heterogeneity via meta-regression. A pooled\nRCT estimate close to an observed RWE hazard ratio is supporting — not confirming — evidence; the\ntwo estimands may differ in population, adherence, and follow-up even if the numbers coincide.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "meta-analysis",
      "evidence-synthesis",
      "randomized-controlled-trial",
      "random-effects",
      "fixed-effect",
      "heterogeneity",
      "inverse-variance-weighting",
      "prediction-interval",
      "hta"
    ],
    "applies_to_study_types": [
      "meta_analysis_rct"
    ],
    "data_sources": [
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/0197-2456(86)90046-2",
        "url": "https://doi.org/10.1016/0197-2456(86)90046-2",
        "citation_text": "DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7(3):177-188.",
        "year": 1986,
        "authors_short": "DerSimonian & Laird",
        "notes": "Foundational paper introducing the random-effects (DerSimonian-Laird) estimator that remains the default in most meta-analysis software."
      },
      {
        "role": "explain",
        "doi": "10.1002/jrsm.12",
        "url": "https://doi.org/10.1002/jrsm.12",
        "citation_text": "Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Research Synthesis Methods. 2010;1(2):97-111.",
        "year": 2010,
        "authors_short": "Borenstein et al.",
        "notes": "Clear conceptual treatment of how the fixed-effect and random-effects models estimate different quantities and when each is appropriate."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.1186",
        "url": "https://doi.org/10.1002/sim.1186",
        "citation_text": "Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine. 2002;21(11):1539-1558.",
        "year": 2002,
        "authors_short": "Higgins & Thompson",
        "notes": "Defines the I-squared statistic and frames the measurement and interpretation of between-trial heterogeneity."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1471-2288-14-25",
        "url": "https://doi.org/10.1186/1471-2288-14-25",
        "citation_text": "IntHout J, Ioannidis JPA, Borm GF. The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Medical Research Methodology. 2014;14:25.",
        "year": 2014,
        "authors_short": "IntHout et al.",
        "notes": "Demonstrates that the HKSJ adjustment yields better-calibrated confidence intervals than DerSimonian-Laird, especially with few trials."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.n71",
        "url": "https://doi.org/10.1136/bmj.n71",
        "citation_text": "Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.",
        "year": 2021,
        "authors_short": "Page et al.",
        "notes": "The reporting standard expected for the systematic search and synthesis that a defensible RCT meta-analysis is built on."
      }
    ],
    "plain_language_summary": "A meta-analysis of randomized controlled trials combines the results of several trials that asked the same question into one overall answer. Instead of treating every trial equally, it gives more say to the larger, more precise trials and less to the small, noisy ones, then reports a single pooled effect. It also checks whether the trials actually agreed with each other; when they point in very different directions, that disagreement (called heterogeneity) is a warning that one pooled number may hide more than it reveals.",
    "key_terms": [
      {
        "term": "risk ratio (RR)",
        "definition": "How many times the risk of an outcome is in the treated group compared to the control group; RR below 1 means the treatment lowered the risk."
      },
      {
        "term": "pooled estimate",
        "definition": "The single combined effect you get after blending all the trials' individual results into one number."
      },
      {
        "term": "weight",
        "definition": "How much a single trial counts toward the pooled answer; bigger, more precise trials get a larger weight."
      },
      {
        "term": "inverse-variance weighting",
        "definition": "The common rule for setting weights, where a trial counts more when its result is more precise (less statistical noise)."
      },
      {
        "term": "heterogeneity",
        "definition": "How much the trials disagree with each other beyond what chance alone would explain."
      },
      {
        "term": "I-squared",
        "definition": "A 0-to-100 percent score for heterogeneity; near 0 percent means the trials largely agree, and a high value means they diverge a lot."
      }
    ],
    "worked_example": {
      "scenario": "Three randomized trials all tested the same new blood thinner against the standard drug, measuring whether patients had a stroke over one year. Each trial reports its own risk ratio (a number below 1 means fewer strokes on the new drug) and a weight that reflects how precise that trial was. The weights have already been worked out and add up to 100 percent. We want to combine the three into a single pooled risk ratio and then say a word about whether the trials agreed.",
      "dataset": {
        "caption": "One row per trial, the way they would line up in a forest-plot table before pooling.",
        "columns": [
          "trial",
          "effect_RR",
          "weight_pct"
        ],
        "rows": [
          [
            "Trial A",
            0.7,
            40
          ],
          [
            "Trial B",
            0.85,
            35
          ],
          [
            "Trial C",
            0.95,
            25
          ]
        ]
      },
      "steps": [
        "The weights already sum to 100 percent (40 + 35 + 25), so each one is just that trial's share of the total say. Trial A is the most precise, so it gets the largest share at 40 percent.",
        "Multiply each trial's risk ratio by its weight share: Trial A gives 0.70 times 0.40 = 0.280, Trial B gives 0.85 times 0.35 = 0.2975, Trial C gives 0.95 times 0.25 = 0.2375.",
        "Add those three contributions to get the weighted average: 0.280 + 0.2975 + 0.2375 = 0.815. Because the weights already total 100 percent, there is no further dividing to do.",
        "Now eyeball agreement: the three risk ratios range from 0.70 to 0.95, all pointing the same direction (fewer strokes on the new drug) but not by the same amount. That spread is heterogeneity, and a statistic called I-squared puts a number on it. Here the trials roughly agree, so I-squared would be low and one pooled number is a fair summary; if instead they had pointed in opposite directions, a high I-squared would warn against trusting a single combined figure."
      ],
      "result": "Pooled risk ratio = (0.70 x 0.40) + (0.85 x 0.35) + (0.95 x 0.25) = 0.280 + 0.2975 + 0.2375 = 0.815, about 0.82. Pooling across the three trials estimates roughly an 18 percent lower stroke risk on the new drug, with the trials in reasonable agreement (low heterogeneity)."
    },
    "prerequisites": [
      "systematic-review",
      "meta-analysis-obs",
      "network-meta-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed-effect (common-effect) meta-analysis",
        "description": "Assumes a single true effect across trials; pools inverse-variance-weighted estimates with variance from sampling error only. Most precise but valid only when trials are functionally identical.",
        "edge_cases": [
          "A single large trial dominates the weights, so the pooled estimate is effectively that trial.",
          "Ignoring real heterogeneity yields confidence intervals that are too narrow and overconfident."
        ],
        "data_source_notes": "aggregate data: extract effect + variance per trial; verify all trials estimate the same estimand (ITT vs per-protocol, same follow-up horizon) before assuming a common effect."
      },
      {
        "name": "Random-effects meta-analysis (DerSimonian-Laird / REML)",
        "description": "Assumes trial-specific true effects drawn from a distribution; estimates the mean effect mu and between-trial variance tau-squared. The default when heterogeneity is plausible.",
        "edge_cases": [
          "With few trials (k < ~5-10) the DerSimonian-Laird variance is anti-conservative; prefer REML and an HKSJ or t-based interval.",
          "The point estimate is the mean of true effects, not \"the\" effect; always accompany with a prediction interval."
        ],
        "data_source_notes": "registries/CSRs: pull unpublished and null trials to keep tau-squared and the pooled mean from being inflated by publication bias."
      },
      {
        "name": "Hartung-Knapp-Sidik-Jonkman (HKSJ) adjusted random-effects",
        "description": "Uses a t-distribution with k-1 degrees of freedom and a robust variance for the pooled estimate, giving better-calibrated coverage with few trials.",
        "edge_cases": [
          "Can produce a narrower interval than DerSimonian-Laird in rare configurations; pre-specify the method and report a sensitivity comparison."
        ],
        "data_source_notes": "Recommended whenever the number of contributing trials is small, which is common for a single drug-comparator-outcome question."
      },
      {
        "name": "Meta-regression / subgroup synthesis",
        "description": "Models the pooled effect as a function of trial-level covariates (dose, follow-up, baseline risk) instead of forcing one summary across heterogeneous trials.",
        "edge_cases": [
          "Trial-level covariate associations are ecological and cannot be interpreted as patient-level effect modification (aggregation bias); use IPD meta-analysis for that.",
          "Low power with few trials; pre-specify covariates to avoid data dredging."
        ],
        "data_source_notes": "aggregate data only supports trial-level moderators; escalate to ipd-meta-analysis for patient-level subgroup effects."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Meta-analysis of observational studies",
        "pros_of_this": "Within-trial randomization removes confounding as the dominant threat, so the synthesis-level concerns are limited to heterogeneity, publication bias, and clinical comparability.",
        "cons_of_this": "Randomized evidence may be scarce, narrow in population, or short in follow-up, limiting generalizability to real-world practice.",
        "when_to_prefer": "Whenever enough comparable RCTs exist for the exact intervention-comparator-outcome question."
      },
      {
        "compared_to": "Network meta-analysis",
        "pros_of_this": "Answers a single head-to-head A-vs-B contrast from direct evidence with the fewest assumptions and no transitivity requirement.",
        "cons_of_this": "Cannot compare or rank treatments that were never trialed against each other.",
        "when_to_prefer": "Direct head-to-head trials exist and only the one contrast is needed for the decision."
      },
      {
        "compared_to": "Individual-participant-data (IPD) meta-analysis",
        "pros_of_this": "Fast and feasible using only published effect estimates; no need to obtain raw data from sponsors.",
        "cons_of_this": "Subgroup/interaction analyses are ecological (aggregation bias); cannot harmonize outcome definitions or re-analyze time-to-event data.",
        "when_to_prefer": "A main pooled effect is the goal and patient-level effect modification is not the driving question."
      },
      {
        "compared_to": "A single large pragmatic trial",
        "pros_of_this": "Increases precision, tests reproducibility across settings, and can detect rare harms underpowered in any one trial.",
        "cons_of_this": "Inherits every included trial's biases and can launder a few low-quality studies into a falsely precise pooled estimate.",
        "when_to_prefer": "Multiple comparable RCTs exist and consistency/precision is the question, rather than one dominant trial that pooling would only dilute."
      }
    ],
    "implementation_notes_by_data_source": {
      "registry": "Trial registries and results databases (ClinicalTrials.gov, EU CTR) supply unpublished and null trials and reveal outcome-switching; cross-check registry vs publication and pool the pre-registered primary outcome and analysis population.",
      "linked": "Clinical study reports and regulator dossiers (the linked-evidence substrate for HTA) provide adjudicated, all-population data; prefer adjudicated ITT counts and document every discrepancy with the published numbers."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\ndef meta_rr(trials: pd.DataFrame, cc: float = 0.5) -> dict:\n    df = trials.copy()\n    # Continuity correction only for trials with a zero cell (avoid undefined log-RR / variance).\n    zero = (df[[\"events_t\", \"n_t\", \"events_c\", \"n_c\"]]\n              .assign(nz_t=lambda d: d.n_t - d.events_t,\n                      nz_c=lambda d: d.n_c - d.events_c)\n              [[\"events_t\", \"nz_t\", \"events_c\", \"nz_c\"]] == 0).any(axis=1)\n    et, nt = df.events_t + cc * zero, df.n_t + 2 * cc * zero\n    ec, nc = df.events_c + cc * zero, df.n_c + 2 * cc * zero\n\n    yi = np.log((et / nt) / (ec / nc))                     # per-trial log risk ratio\n    vi = 1 / et - 1 / nt + 1 / ec - 1 / nc                 # delta-method variance of log-RR\n    wi = 1.0 / vi                                          # inverse-variance (fixed-effect) weights\n    k = len(df)\n\n    y_fe = np.sum(wi * yi) / np.sum(wi)\n    se_fe = np.sqrt(1.0 / np.sum(wi))\n\n    # Cochran's Q, I-squared, and DerSimonian-Laird tau-squared.\n    Q = np.sum(wi * (yi - y_fe) ** 2)\n    C = np.sum(wi) - np.sum(wi ** 2) / np.sum(wi)\n    tau2 = max(0.0, (Q - (k - 1)) / C)\n    I2 = max(0.0, (Q - (k - 1)) / Q) if Q > 0 else 0.0\n\n    wi_re = 1.0 / (vi + tau2)                              # random-effects weights\n    y_re = np.sum(wi_re * yi) / np.sum(wi_re)\n    se_re = np.sqrt(1.0 / np.sum(wi_re))\n\n    # Hartung-Knapp robust SE with t(k-1) reference for few-trial calibration.\n    q_hk = np.sum(wi_re * (yi - y_re) ** 2) / (k - 1)\n    se_hk = np.sqrt(q_hk / np.sum(wi_re))\n    t_crit = stats.t.ppf(0.975, k - 1)\n\n    # 95% prediction interval for the effect in a new setting (uses t with k-2 df).\n    t_pi = stats.t.ppf(0.975, k - 2)\n    pi_half = t_pi * np.sqrt(tau2 + se_re ** 2)\n\n    return {\n        \"k\": k, \"I2\": I2, \"tau2\": tau2,\n        \"fixed_logRR\": y_fe, \"fixed_ci\": (y_fe - 1.96 * se_fe, y_fe + 1.96 * se_fe),\n        \"random_logRR\": y_re, \"random_ci\": (y_re - 1.96 * se_re, y_re + 1.96 * se_re),\n        \"hksj_ci\": (y_re - t_crit * se_hk, y_re + t_crit * se_hk),\n        \"pred_interval\": (y_re - pi_half, y_re + pi_half),\n    }",
        "description": "Fixed-effect, random-effects (DerSimonian-Laird), and Hartung-Knapp pooling of binary-outcome RCTs.\nRequired input table (one row per trial, already extracted and de-duplicated so multi-arm trials do not\ndouble-count a shared control):\n  trials : trial_id, events_t (int), n_t (int), events_c (int), n_c (int)\nReturns pooled log-RR estimates, I-squared, tau-squared, and a 95% prediction interval. Convert point\nestimates and interval bounds back to the RR scale with numpy.exp before reporting.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "dersimonian-laird-1986",
          "inthout-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(metafor)\n\n# Per-trial log risk ratio (measure = \"RR\") and its sampling variance, with default 0.5\n# continuity correction applied only to trials containing a zero cell.\nes <- escalc(measure = \"RR\",\n             ai = events_t, bi = n_t - events_t,   # treated: events, non-events\n             ci = events_c, di = n_c - events_c,   # control: events, non-events\n             data = trials, slab = trial_id)\n\nfe  <- rma(yi, vi, data = es, method = \"FE\")                 # fixed-effect (inverse variance)\nre  <- rma(yi, vi, data = es, method = \"REML\")               # random-effects (REML tau^2)\nhk  <- rma(yi, vi, data = es, method = \"REML\", test = \"knha\")# Hartung-Knapp adjusted CI\n\nsummary(re)                       # pooled mean, tau^2, I^2, Q-test for heterogeneity\npredict(re, transf = exp)         # RR-scale estimate + 95% CI + 95% prediction interval\nconfint(re)                       # CIs for tau^2 and I^2\n\n# Publication / small-study bias diagnostics.\nregtest(re, model = \"lm\")         # Egger's test for funnel asymmetry\n# funnel(re); forest(re, atransf = exp)   # diagnostic and summary plots",
        "description": "Production pooling with the metafor package (Viechtbauer), the reference implementation in RWE/HTA work.\nRequired input data frame (one row per trial, multi-arm shared controls already resolved):\n  trials : trial_id, events_t, n_t, events_c, n_c\nescalc computes per-trial log-RR and variance; rma fits fixed-effect, REML random-effects, and the\nHartung-Knapp variant. predict(..., transf=exp) returns the RR-scale estimate and prediction interval.",
        "dependencies": [
          "metafor"
        ],
        "source_citations": [
          "dersimonian-laird-1986",
          "inthout-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Step 1: per-trial log risk ratio (yi) and its variance (vi); 0.5 continuity correction on zero cells. */\ndata es;\n  set work.trials;\n  cc = (events_t = 0 or n_t - events_t = 0 or events_c = 0 or n_c - events_c = 0) * 0.5;\n  et = events_t + cc;        nt = n_t + 2*cc;\n  ec = events_c + cc;        nc = n_c + 2*cc;\n  yi = log( (et/nt) / (ec/nc) );\n  vi = 1/et - 1/nt + 1/ec - 1/nc;      /* delta-method variance of log-RR */\n  wi = 1/vi;                            /* inverse-variance (fixed-effect) weight */\nrun;\n\n/* Fixed-effect inverse-variance pool (closed form). */\nproc sql;\n  create table fe as\n  select sum(wi*yi)/sum(wi)          as logRR_fixed,\n         sqrt(1/sum(wi))             as se_fixed\n  from es;\nquit;\n\n/* Step 2: random-effects model as a weighted linear mixed model.\n   - random intercept by trial_id estimates tau^2 (between-trial variance);\n   - the GROUP/known-variance trick fixes the residual to each trial's vi;\n   - DDFM=KENWARDROGER + EMPIRICAL give a Hartung-Knapp-style small-sample robust CI. */\nproc mixed data=es method=reml;\n  class trial_id;\n  model yi = / solution cl ddfm=kenwardroger;\n  random int / subject=trial_id;          /* between-trial variance component = tau^2 */\n  repeated / group=trial_id;              /* allow trial-specific residual = vi (supply via PARMS) */\n  parms / parmsdata=work.vi_parms eqcons=2 to %eval(1+&n_trials.);  /* hold residuals at vi */\n  weight wi;                              /* inverse-variance weighting of trials */\n  ods output SolutionF=pooled;            /* intercept = pooled log-RR; exponentiate for RR */\nrun;\n\n/* RR-scale pooled effect: exp(Estimate), exp(Lower), exp(Upper) from the SolutionF table. */\ndata pooled_rr;\n  set pooled;\n  RR    = exp(Estimate);\n  RR_lcl = exp(Lower);\n  RR_ucl = exp(Upper);\nrun;",
        "description": "SAS has no native meta-analysis procedure, so random-effects pooling is fit as a weighted linear mixed\nmodel in PROC MIXED (van Houwelingen / Sidik-Jonkman approach). Required input dataset (one row per trial):\n  work.trials : trial_id, events_t, n_t, events_c, n_c\nStep 1 (DATA step) computes the per-trial log-RR and its variance. Step 2 fits an intercept-only mixed\nmodel where the observation-level variance is fixed to each trial's vi (via PARMS / a known residual)\nand the random trial intercept estimates tau-squared; the EMPIRICAL option gives a Hartung-Knapp-style\nrobust standard error. Exponentiate the intercept and its bounds for the RR-scale pooled effect.",
        "dependencies": [],
        "source_citations": [
          "dersimonian-laird-1986"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[PICO: one intervention vs comparator on one outcome] --> Search[PRISMA systematic search<br/>+ registry/CSR retrieval for unpublished trials]\n  Search --> Extract[Extract per-trial effect + variance<br/>resolve multi-arm shared controls, harmonize estimand]\n  Extract --> Het{Heterogeneity:<br/>Q, I-squared, tau-squared}\n  Het -->|low, trials comparable| FE[Fixed-effect inverse-variance pool]\n  Het -->|present| RE[Random-effects pool REML + HKSJ<br/>report 95% prediction interval]\n  Het -->|high / apples-and-oranges| NoPool[Do NOT pool one number<br/>meta-regression / subgroup / narrative]\n  FE --> Bias[Small-study & publication-bias checks<br/>funnel plot, Egger's test]\n  RE --> Bias\n  Bias --> Anchor[Pooled RCT effect = efficacy anchor<br/>for MAIC/STC, RWE calibration, external controls]",
        "caption": "Decision logic for an RCT meta-analysis, from PICO and PRISMA search through the fixed-vs-random choice, the not-pool decision under high heterogeneity, bias diagnostics, and the RWE/HTA efficacy-anchor use of the pooled estimate.",
        "alt_text": "Flowchart from PICO and systematic search through effect extraction, a heterogeneity decision node branching to fixed-effect, random-effects, or do-not-pool, then bias checks, ending at using the pooled effect as an efficacy anchor.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  T1[Trial 1: theta_1] --> Dist((Distribution of true effects<br/>mean mu, variance tau^2))\n  T2[Trial 2: theta_2] --> Dist\n  T3[Trial 3: theta_3] --> Dist\n  Dist --> Mu[Random-effects estimate = mu<br/>mean of true effects]\n  Dist --> PI[95% prediction interval<br/>plausible effect in a NEW trial]\n  Common[Fixed-effect assumes<br/>theta_1 = theta_2 = theta_3 = theta] --> Theta[Single common effect theta]",
        "caption": "Estimand contrast. The fixed-effect model assumes one common true effect; the random-effects model assumes trial-specific true effects from a distribution, so its estimate is the mean mu and the prediction interval (not the confidence interval) describes the effect expected in a new setting.",
        "alt_text": "Diagram contrasting the fixed-effect single-common-effect assumption with the random-effects distribution of trial-specific true effects, whose mean is the pooled estimate and whose spread gives the prediction interval.",
        "source_type": "illustrative",
        "source_citations": [
          "borenstein-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "systematic-review",
        "notes": "A quantitative RCT meta-analysis is the synthesis step of a PRISMA-conducted systematic review; the review supplies the search, screening, and risk-of-bias assessment."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "meta-analysis-obs",
        "notes": "Pooling RCTs preserves within-study randomization, so confounding is not the dominant synthesis-level threat; observational meta-analysis adds confounding and design heterogeneity on top of the same publication-bias and heterogeneity concerns."
      },
      {
        "relation_type": "used_with",
        "target_slug": "network-meta-analysis",
        "notes": "Pairwise RCT meta-analysis is the direct-evidence building block; network meta-analysis extends it to multiple treatments via indirect comparisons under transitivity/consistency."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "ipd-meta-analysis",
        "notes": "Aggregate-data RCT meta-analysis uses published effect estimates; IPD meta-analysis re-analyzes patient-level data and is preferred for effect modification and time-to-event re-analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "single-arm-external-control",
        "notes": "The pooled RCT effect is a benchmark against which single-arm external-control effectiveness estimates are calibrated and sanity-checked."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cer-observational",
        "notes": "Observational comparative-effectiveness estimates from claims/EHR are validated by checking concordance with the pooled RCT efficacy anchor in the trial-eligible subpopulation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Transporting a pooled RCT efficacy estimate to a real-world population requires a transportability assessment because trial populations are selected and supervised."
      }
    ],
    "aliases": [
      "pairwise meta-analysis of randomized trials",
      "aggregate-data meta-analysis of RCTs",
      "fixed-effect and random-effects meta-analysis",
      "pooled analysis of randomized controlled trials"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "misclassification-bias-correction-rwe",
    "name": "Misclassification Bias Correction",
    "short_definition": "Deterministic or probabilistic correction of an RWE effect estimate for imperfect exposure, outcome, or covariate classification, using validation-anchored sensitivity, specificity, PPV, or NPV to recover the bias-adjusted estimate and its total uncertainty.",
    "long_description": "**Misclassification bias correction** quantifies and removes the distortion in a real-world-evidence effect\nestimate that arises when an exposure, outcome, or covariate is measured with error. In claims and EHR data the\n\"truth\" is almost always an *algorithm* — ICD diagnosis codes, NDC fills, CPT/HCPCS procedures, lab thresholds —\nwhose sensitivity and specificity are imperfect and frequently *differential* by treatment arm, site, calendar\ntime, or disease severity. The method takes the observed (biased) estimate plus empirical classification\nparameters and back-solves for the estimate that would have been observed under perfect measurement.\n\nThe canonical algebra for non-differential binary-outcome misclassification rescales the observed risk:\n\n  true_risk = (observed_risk + specificity - 1) / (sensitivity + specificity - 1)\n\nDifferential misclassification, or error in the exposure or a covariate, requires the full arm-specific 2x2 (or\na calibration matrix). When a validation sample exists the parameters are *estimated*; when it does not, the\ncorrection is run *probabilistically* over a range of plausible values so the validation sampling error\npropagates into a simulation interval rather than collapsing to a single deterministic point.\n\n**Core conceptual distinction** (the estimand and bias-direction boundaries). This is a *post-design measurement-error correction*, not a confounding\nadjustment and not a design fix. Three boundaries must be held. (1) *Bias direction is not automatic.*\nNon-differential error in a dichotomous variable biases the relative estimate *toward the null*, so correction\ntypically pushes the estimate *away* from 1.0 — but this regularity fails for differential error, for a >2-level\nvariable, and for many absolute-risk and rate contrasts, where the correction can move in either direction. (2)\n*The correction inherits the validation frame.* The corrected estimate is only as credible as the\ntransportability argument between the validation sample (where gold-standard truth was observed) and the analytic\ncohort. (3) *It is orthogonal to confounding.* Misclassification correction does nothing for unmeasured\nconfounding; an E-value or external confounding adjustment addresses that separately. The two are combined inside\na multi-bias quantitative bias analysis (QBA), they are not substitutes.\n\n**Pros, cons, and trade-offs** (comparative, naming the alternatives).\n- **vs leaving misclassification as a qualitative limitation paragraph:** correction converts a hand-waved\n  \"results may be biased by imperfect coding\" sentence into a quantified sensitivity result that a regulator or\n  HTA reviewer can audit against the actual validation 2x2. Cost: it forces you to either run a validation\n  substudy or defend a transported external parameter, and it exposes an explicit transportability assumption\n  that reviewers will probe. **Prefer correction** whenever a feasible internal or transportable external\n  validation sample exists and the decision needs a quantitative robustness statement.\n- **vs deterministic single-point correction:** the probabilistic (Monte Carlo) version samples\n  sensitivity/specificity from Beta distributions informed by the validation counts and returns a simulation\n  interval that honestly folds validation sampling error into the main-study random error. Deterministic\n  correction reports a bare point and understates total uncertainty. **Prefer probabilistic** for any inferential\n  or regulatory submission; reserve deterministic correction for transparent appendix tables.\n- **vs record-level imputation / regression calibration when individual validation records exist:** using the\n  actual linked gold-standard records (multiple imputation for the mismeasured variable, or regression\n  calibration) is more statistically efficient and makes fewer transportability assumptions. Summary-parameter\n  QBA correction is the right tool when only aggregate sens/spec or an *external* published parameter is\n  available, or when you must combine measurement error with other biases in one simulation. **Prefer\n  record-level methods** when you hold the individual validation records.\n- **vs E-value / tipping-point analysis:** those address unmeasured confounding only and need no validation\n  data; misclassification correction addresses a different (and in claims often larger) bias source and requires\n  validation parameters. They are complementary layers of a multi-bias analysis, not alternatives.\n\n**When to use** (decision rules).\n- An algorithm-based endpoint, exposure, or covariate is the only feasible measure in the full cohort, AND a\n  feasible internal validation substudy (chart review, registry linkage, EHR enrichment) or a transportable\n  external parameter (published validation in a similar database, payer mix, and calendar window) exists.\n- The decision requires a quantitative statement of how far the observed association could be moved by known\n  imperfect measurement: an FDA/EMA pre-specified sensitivity analysis, an HTA robustness check, or an internal\n  go/no-go.\n- Differential misclassification by arm is plausible (surveillance/detection bias, arm-specific coding intensity)\n  and arm-specific parameters can be estimated.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\n- **The validation sample is not exchangeable with the analytic cohort.** Transporting a PPV from an academic\n  registry onto a national commercial-claims study without re-weighting for payer mix, calendar window, and case\n  severity produces a confidently wrong number that is *more* dangerous than the honest uncorrected estimate.\n- **Only algorithm-positives were validated, so only PPV is known and sensitivity is still assumed.** A\n  correction anchored on PPV alone is incomplete and can move the estimate in the wrong direction; back-solving\n  sens/spec from PPV requires the true prevalence, which is exactly what is unknown.\n- **Differential misclassification is likely but arm-specific parameters cannot be estimated.** Applying a pooled\n  sens/spec silently *asserts* non-differential error — the most consequential and least testable assumption in\n  the analysis.\n- **The real problem is design, not measurement.** Misclassification correction layered on a study with no valid\n  active comparator, immortal time, or depletion of susceptibles yields a *precisely* wrong number that launders\n  a structural flaw through quantitative machinery. Fix the design first.\n- **The validation data are too sparse to support the math.** Corrected cell counts that go negative or exceed\n  the denominator, or a deterministic corrected risk outside [0,1], signal that the assumed parameters conflict\n  with the observed data; reporting a point estimate without the simulation interval in that regime is\n  misleading.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked).\n- **Claims (FFS or commercial):** Validate sensitivity ONLY on fee-for-service-complete person-time. Medicare\n  Advantage and capitated/bundled arrangements drop the encounter claims that define a true non-case, so on\n  MA-only time an \"algorithm-negative\" is *missingness*, not a validated true negative — restrict the validation\n  frame to enrollees with both Parts A/B (and D for drug exposures) or a commercial medical+pharmacy benefit.\n  Review BOTH algorithm-positives and algorithm-negatives if sensitivity is needed; positives-only chart pulls\n  yield PPV but never sensitivity. Stratify bias parameters by arm, age, route, site, and calendar period when\n  differential capture is plausible (e.g., differential outpatient-coding intensity by drug class). Match the\n  validation frame's payer segment and calendar window to the analytic cohort and report the mapping explicitly.\n- **EHR:** Chart review or NLP-augmented phenotyping estimates algorithm performance and can supply unmeasured\n  covariates (labs, BMI, smoking, stage). Two failure modes dominate: *outside-care leakage* — events at\n  non-network facilities depress apparent sensitivity in a way that will not transport to a claims cohort with\n  complete capture — and *chart-availability bias* — patients with reviewable notes are sicker and more engaged\n  than the average cohort member, so the validation subset is non-representative.\n- **Registry:** Adjudication supplies high-specificity gold-standard outcomes, stage, and severity, but the\n  registry population is selected (often academic centers); high *internal* accuracy does not guarantee\n  transport to the full treated population, and linkage eligibility/match failure stack their own\n  selection/misclassification layer on top of the validation selection.\n- **Linked claims–EHR–registry:** The ideal two-phase substrate — cheap, complete covariates on everyone (Phase\n  1) plus gold-standard adjudication on the linkable subset (Phase 2). But the linkable subset is itself a\n  non-random sample; model the linkage probability and verify the bias parameters transport across linkable and\n  non-linkable strata, not merely within the linked subset. Beware *differential* competing risks by exposure\n  (e.g., higher background mortality in the sicker arm of an elderly claims cohort), which interacts with outcome\n  misclassification, and *immortal time* in procedure studies, where a misclassified procedure date corrupts\n  time-zero before any correction is applied.\n\n**Worked claims example (non-differential outcome misclassification).** A claims study compares Drug A vs Drug B\non incident stroke; the observed 1-year risks are 8.0% (A) and 11.1% (B), observed RR = 0.72. An internal\nvalidation substudy is drawn from FFS-complete enrollees in the same payer segment and calendar window and is\nstratified on *gold-standard truth* (chart-adjudicated status), which is what identifies sensitivity and\nspecificity directly: of 200 chart-confirmed true strokes the algorithm flags 156 (sensitivity = 156/200 = 0.78,\nFN = 44), and of 200 chart-confirmed true non-strokes the algorithm correctly clears 194 (specificity = 194/200\n= 0.97, FP = 6). (Sampling instead by *algorithm* status would identify PPV/NPV, not sensitivity/specificity, and\ncould not be plugged into the formula below without back-solving through the unknown true prevalence.) Deterministic\ncorrection: true_risk_A = (0.080 + 0.97 - 1) / (0.78 + 0.97 - 1) = 0.0667; true_risk_B = (0.111 + 0.97 - 1) /\n(0.78 + 0.97 - 1) = 0.1080; corrected RR = 0.62. Because non-differential error had biased the RR toward 1.0,\nthe corrected estimate moves *further* from the null. A 50,000-iteration probabilistic version drawing\nsens ~ Beta(157,45) and spec ~ Beta(195,7) yields a median corrected RR of about 0.61 with a 95% simulation\ninterval of roughly 0.51–0.74 — wider than the conventional CI because it now carries the validation sampling\nerror. The full analysis pre-specifies the validation sampling frame (FFS-complete only), whether parameters are\npooled or arm-specific, the exact correction formula, the Beta priors, and the iteration count; reports the raw\nvalidation 2x2; and presents BOTH the deterministic point and the simulation interval, with a sensitivity check\non differential vs non-differential assumptions.\n\n**Interpreting the output**\n\nFrom the worked example: observed RR = 0.72 (Drug A risk 8.0%, Drug B risk 11.1%) using the algorithm-\npositive denominator. Validation gives sensitivity = 0.78 (156/200), specificity = 0.97 (194/200).\nDeterministic Rogan-Gladen correction yields corrected RR ≈ 0.62. The 50,000-draw probabilistic version\n(sens ~ Beta(157, 45), spec ~ Beta(195, 7)) gives median corrected RR ≈ 0.61, 95% simulation interval\n≈ 0.51–0.74.\n\n*(1) Formal interpretation.* The deterministic corrected RR 0.62 is conditional on the fixed point\nestimates of sensitivity and specificity from the chart-review validation sample; it carries no\nuncertainty from validation sampling error. The simulation interval ≈ 0.51–0.74 is *not* a confidence\ninterval — it propagates the added uncertainty from imperfectly known bias parameters (Beta posteriors\nfrom validation). The correction moves *away* from the null (0.72 → 0.62) because non-differential\nmisclassification biases outcome-defined relative risks toward 1.0; the direction reverses when\nmisclassification is differential and favors the more-monitored arm.\n\n*(2) Practical interpretation.* Drug A's apparent 28% lower risk deepens to roughly 38% after accounting\nfor the claims MI algorithm's imperfect sensitivity and specificity. The simulation interval 0.51–0.74\nexcludes 1.0, so the corrected finding is robust to validation sampling variability under these priors.\nThe key decision-point is the differential vs non-differential assumption: if Drug A patients are more\nintensively monitored (higher sensitivity in that arm), the correction attenuates toward the null rather\nthan away from it — analysts must pre-specify which assumption governs before seeing the corrected result.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "quantitative-bias-analysis",
      "misclassification",
      "measurement-error",
      "sensitivity-specificity",
      "ppv-npv",
      "validation-substudy",
      "probabilistic-bias-analysis",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "cohort_retrospective",
      "case_control",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/ije/25.6.1107",
        "url": "https://doi.org/10.1093/ije/25.6.1107",
        "citation_text": "Greenland S. Basic methods for sensitivity analysis of biases. International Journal of Epidemiology. 1996;25(6):1107-1116.",
        "year": 1996,
        "authors_short": "Greenland",
        "notes": "Foundational derivation of the deterministic correction algebra for exposure, outcome, and confounder misclassification, including the non-differential outcome rescaling used here."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyu149",
        "url": "https://doi.org/10.1093/ije/dyu149",
        "citation_text": "Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S. Good practices for quantitative bias analysis. International Journal of Epidemiology. 2014;43(6):1969-1985.",
        "year": 2014,
        "authors_short": "Lash et al.",
        "notes": "Reporting and methodological standards for deterministic and probabilistic bias analysis; specifies how to choose bias-parameter distributions and present simulation intervals."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/ije/dyi184",
        "url": "https://doi.org/10.1093/ije/dyi184",
        "citation_text": "Fox MP, Lash TL, Greenland S. A method to automate probabilistic sensitivity analyses of misclassified binary variables. International Journal of Epidemiology. 2005;34(6):1370-1376.",
        "year": 2005,
        "authors_short": "Fox et al.",
        "notes": "Operationalizes Monte Carlo probabilistic correction for binary misclassification with record-level reconstruction; the template behind the simulation code below."
      },
      {
        "role": "use",
        "doi": "10.1007/s40471-014-0027-z",
        "url": "https://doi.org/10.1007/s40471-014-0027-z",
        "citation_text": "Jonsson Funk M, Landi SN. Misclassification in administrative claims data: quantifying the impact on treatment effect estimates. Current Epidemiology Reports. 2014;1(4):175-185.",
        "year": 2014,
        "authors_short": "Jonsson Funk & Landi",
        "notes": "Applied claims-data treatment showing how exposure and outcome misclassification distort treatment-effect estimates and how validation-anchored correction recovers them."
      }
    ],
    "plain_language_summary": "When a database study labels patients as having an event (like a stroke) using billing codes, those labels are never perfectly accurate — some true events get missed and some non-events get flagged. Misclassification bias correction takes the flawed observed result, plugs in how accurate the labeling algorithm actually was (measured by checking a subset of charts), and back-calculates what the result would have been if the labels were perfect. Non-differential misclassification — where labeling errors are equally likely in both treatment groups — pushes the observed relative risk toward 1.0, making a drug look less effective or safer than it really is; the correction reverses that pull.",
    "key_terms": [
      {
        "term": "misclassification",
        "definition": "An error in how a patient is labeled — for example, a patient who truly had a stroke is coded as stroke-free, or vice versa, because the billing-code algorithm is imperfect."
      },
      {
        "term": "non-differential vs differential misclassification",
        "definition": "Non-differential means labeling errors happen at the same rate in both the treated and comparison groups; differential means the error rate differs by group, which can push the estimate in either direction rather than always toward 1.0."
      },
      {
        "term": "quantitative bias analysis",
        "definition": "A family of methods that convert a qualitative concern about bias (such as imperfect coding) into a number — a corrected estimate and an interval — so reviewers can audit the assumption rather than just accept a verbal caveat."
      },
      {
        "term": "sensitivity",
        "definition": "Of all patients who truly had the event, the fraction the algorithm correctly flagged as event-positive — a sensitivity of 0.78 means 78 of every 100 true events are captured."
      },
      {
        "term": "specificity",
        "definition": "Of all patients who truly did not have the event, the fraction the algorithm correctly left unflagged — a specificity of 0.97 means 97 of every 100 true non-events are correctly cleared."
      }
    ],
    "worked_example": {
      "scenario": "A claims study compares Drug A versus Drug B on one-year stroke risk. The stroke outcome is coded using ICD diagnosis codes. Before publishing, the team runs a chart review on 400 patients drawn from gold-standard-confirmed cases and non-cases — 200 true strokes and 200 true non-strokes — to measure how well the algorithm performs. The validation shows sensitivity of 0.78 and specificity of 0.97. The team applies the correction formula to recover the stroke risk that would have been observed if every chart had been reviewed.",
      "dataset": {
        "caption": "Observed (algorithm-coded) stroke counts per 1,000 patients in each arm, plus validation 2x2 from the chart review.",
        "columns": [
          "category",
          "drug_a",
          "drug_b"
        ],
        "rows": [
          [
            "Patients in study arm",
            "1,000",
            "1,000"
          ],
          [
            "Observed (algorithm-coded) stroke events",
            "80",
            "111"
          ],
          [
            "Observed risk",
            "8.0%",
            "11.1%"
          ],
          [
            "Observed relative risk (A vs B)",
            "0.72",
            "—"
          ],
          [
            "Validation: true strokes caught (TP)",
            "156 of 200",
            "—"
          ],
          [
            "Validation: true non-strokes cleared (TN)",
            "194 of 200",
            "—"
          ],
          [
            "Sensitivity (156/200)",
            "0.78",
            "—"
          ],
          [
            "Specificity (194/200)",
            "0.97",
            "—"
          ]
        ]
      },
      "steps": [
        "Write down the correction formula for non-differential outcome misclassification: true_risk = (observed_risk + specificity - 1) / (sensitivity + specificity - 1).",
        "Compute the shared denominator: 0.78 + 0.97 - 1 = 0.75.",
        "Correct Drug A: numerator = 0.080 + 0.97 - 1 = 0.050; true_risk_A = 0.050 / 0.75 = 0.0667 (6.7%).",
        "Correct Drug B: numerator = 0.111 + 0.97 - 1 = 0.081; true_risk_B = 0.081 / 0.75 = 0.1080 (10.8%).",
        "Compute the corrected relative risk: 0.0667 / 0.1080 = 0.62.",
        "Compare: the observed RR was 0.72 (closer to 1.0); the corrected RR is 0.62 (further from 1.0) — exactly the direction expected when non-differential misclassification biased the original estimate toward null."
      ],
      "result": "Observed vs corrected summary — Drug A: 8.0% observed vs 6.7% corrected; Drug B: 11.1% observed vs 10.8% corrected; RR: 0.72 observed vs 0.62 corrected. Imperfect coding (sensitivity 0.78, specificity 0.97) had compressed the relative risk toward 1.0; correcting for that error reveals Drug A has a larger true protective advantage than the raw claims data suggested."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "claims-outcome-algorithm-ppv-sensitivity-rwe",
      "quantitative-bias-analysis-toolkit-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Deterministic correction",
        "description": "Correct a single estimate under fixed sensitivity/specificity (or PPV/NPV) assumptions to produce a transparent, reproducible bias-adjusted point estimate.",
        "edge_cases": [
          "Corrected cell counts can become negative or exceed the denominator when the assumed parameters conflict with the observed prevalence; treat this as a signal that the parameters are infeasible, not a number to report."
        ],
        "data_source_notes": "Useful for transparent appendix/SAP tables; never present a deterministic point alone for an inferential claim because it understates total uncertainty."
      },
      {
        "name": "Probabilistic (Monte Carlo) correction",
        "description": "Sample sensitivity/specificity (or PPV/NPV) from Beta or trapezoidal distributions informed by the validation counts so validation sampling error propagates into a simulation interval.",
        "edge_cases": [
          "Sparse validation data produce heavy-tailed corrected distributions; report the full interval and the share of iterations yielding infeasible (clipped) risks."
        ],
        "data_source_notes": "Preferred for any regulatory or inferential use; anchor Beta parameters directly on the validation 2x2 (e.g., sens ~ Beta(TP+1, FN+1))."
      },
      {
        "name": "Differential (arm- or site-specific) correction",
        "description": "Apply separate bias parameters per treatment arm, site, or calendar period when validation supports it, relaxing the non-differential assumption.",
        "edge_cases": [
          "Requires an adequately sized validation sample within each stratum; thin strata reintroduce instability and may force partial pooling."
        ],
        "data_source_notes": "claims: stratify by arm/route/site when differential outpatient-coding intensity is plausible; report the per-stratum validation counts."
      },
      {
        "name": "Exposure or covariate misclassification correction",
        "description": "Apply the analogous algebra to a mismeasured exposure 2x2 or to a covariate entering a propensity-score or outcome model (regression-calibration-style correction).",
        "edge_cases": [
          "Misclassified confounders can bias in either direction and can amplify rather than attenuate; non-differential exposure error is not guaranteed to bias toward the null in non-binary or multi-variable settings."
        ],
        "data_source_notes": "claims/ehr: covariate misclassification interacts with the PS model; correct before or jointly with PS estimation, not after matching."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ignoring misclassification (qualitative limitation paragraph only)",
        "pros_of_this": "Converts a hand-waved \"potential bias\" sentence into a quantified, auditable sensitivity result anchored on the actual validation 2x2.",
        "cons_of_this": "Adds validation-substudy burden and an explicit transportability assumption that reviewers will scrutinize.",
        "when_to_prefer": "Whenever a feasible internal or transportable external validation sample exists and the decision requires a quantitative robustness statement."
      },
      {
        "compared_to": "Record-level multiple imputation or regression calibration using individual validation records",
        "pros_of_this": "Works from summary sens/spec or external parameters alone and composes naturally with other biases in a multi-bias PBA.",
        "cons_of_this": "Less statistically efficient than using the actual individual gold-standard records when they are available, and still requires a correctly specified bias model.",
        "when_to_prefer": "When the validation sample is external, small, or supplies only aggregate performance parameters rather than linked record-level truth."
      },
      {
        "compared_to": "E-value or tipping-point analysis (unmeasured confounding only)",
        "pros_of_this": "Directly targets measurement error, a distinct and often larger bias source in claims/EHR, and can be combined with a confounding layer in one simulation.",
        "cons_of_this": "E-value is simpler and needs no validation data; misclassification correction requires validation parameters or a credible external source.",
        "when_to_prefer": "When the primary threat is imperfect algorithm-based measurement rather than (or in addition to) unmeasured confounding."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Validate sensitivity only on FFS-complete person-time (continuous Parts A/B/D or commercial medical+pharmacy) so an algorithm-negative is a true non-case, not MA-only missingness. Review both algorithm-positive and algorithm-negative patients when sensitivity is required. Stratify bias parameters by arm, age, site, route, or calendar period when differential misclassification is plausible, and match the validation frame's payer segment and calendar window to the analytic cohort, reporting that mapping explicitly.",
      "ehr": "Use chart review or structured-plus-notes phenotyping to estimate algorithm performance and supply unmeasured confounders. Explicitly handle outside-care leakage (events at non-network facilities depress apparent sensitivity and do not transport to a complete-capture claims cohort) and chart-availability bias (patients with reviewable notes are sicker and more engaged than the average cohort member).",
      "registry": "Registry adjudication supplies high specificity and gold-standard stage/severity, but the registry population is selected; high internal accuracy does not guarantee transport to the full treated population, and linkage eligibility/match failure add a further selection layer on top of the validation selection.",
      "linked": "Linked claims-EHR-registry is the ideal two-phase substrate (complete covariates on everyone; gold standard on the linkable subset), but the linkable subset is non-random. Model the linkage probability and verify bias-parameter transportability across linkable and non-linkable strata, not just within the linked subset."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef correct_rr(main: pd.DataFrame, val: dict, n_iter: int = 50000, seed: int = 42) -> np.ndarray:\n    rng = np.random.default_rng(seed)\n    obs = main.set_index(\"arm\").eval(\"n_events / n_total\")  # observed (biased) risk per arm\n\n    def draw(v):  # Beta posteriors anchored on the validation 2x2 counts\n        sens = rng.beta(v[\"tp\"] + 1, v[\"fn\"] + 1, n_iter)\n        spec = rng.beta(v[\"tn\"] + 1, v[\"fp\"] + 1, n_iter)\n        return sens, spec\n\n    def corrected(obs_risk, sens, spec):\n        r = (obs_risk + spec - 1.0) / (sens + spec - 1.0)\n        return np.clip(r, 0.0, 1.0)  # infeasible draws clipped; track the clip share separately\n\n    if all(k in val for k in (\"tp\", \"fp\", \"tn\", \"fn\")):            # non-differential\n        sens, spec = draw(val)\n        true_a, true_b = corrected(obs[\"A\"], sens, spec), corrected(obs[\"B\"], sens, spec)\n    else:                                                          # differential: arm-specific 2x2\n        sa, pa = draw(val[\"A\"]); sb, pb = draw(val[\"B\"])\n        true_a, true_b = corrected(obs[\"A\"], sa, pa), corrected(obs[\"B\"], sb, pb)\n    return true_a / true_b\n\n# main = pd.DataFrame({\"arm\": [\"A\", \"B\"], \"n_events\": [80, 111], \"n_total\": [1000, 1000]})\n# rr = correct_rr(main, val={\"tp\": 156, \"fp\": 6, \"tn\": 194, \"fn\": 44})\n# print(np.percentile(rr, [2.5, 50, 97.5]))   # deterministic point + simulation interval; export the 2x2",
        "description": "Validation-anchored probabilistic outcome-misclassification correction (non-differential or arm-specific).\nRequired inputs (already cleaned, one row per arm):\n  main : DataFrame with columns arm in {'A','B'}, n_events (algorithm-positive count), n_total\n  val  : either a single dict {'tp','fp','tn','fn'} from a chart review of FFS-complete enrollees\n         (non-differential), OR {'A': {...}, 'B': {...}} for arm-specific (differential) correction.\nBeta priors are anchored directly on the validation 2x2: sens ~ Beta(TP+1, FN+1), spec ~ Beta(TN+1, FP+1).\nReturns the corrected risk-ratio distribution (length n_iter) for percentile-based simulation intervals.",
        "dependencies": [
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "correct_rr <- function(main, val, n_iter = 50000L, seed = 42L) {\n  set.seed(seed)\n  obs <- setNames(main$n_events / main$n_total, main$arm)   # observed (biased) risk per arm\n  corrected <- function(obs_risk, se, sp) {\n    r <- (obs_risk + sp - 1) / (se + sp - 1)\n    pmin(pmax(r, 0), 1)                                     # clip infeasible draws to [0, 1]\n  }\n  if (all(c(\"tp\", \"fp\", \"tn\", \"fn\") %in% names(val))) {     # non-differential\n    sens <- rbeta(n_iter, val$tp + 1, val$fn + 1)\n    spec <- rbeta(n_iter, val$tn + 1, val$fp + 1)\n    ta <- corrected(obs[[\"A\"]], sens, spec); tb <- corrected(obs[[\"B\"]], sens, spec)\n  } else {                                                  # differential: arm-specific 2x2\n    sa <- rbeta(n_iter, val$A$tp + 1, val$A$fn + 1); pa <- rbeta(n_iter, val$A$tn + 1, val$A$fp + 1)\n    sb <- rbeta(n_iter, val$B$tp + 1, val$B$fn + 1); pb <- rbeta(n_iter, val$B$tn + 1, val$B$fp + 1)\n    ta <- corrected(obs[[\"A\"]], sa, pa); tb <- corrected(obs[[\"B\"]], sb, pb)\n  }\n  ta / tb\n}\n\n# main <- data.frame(arm = c(\"A\", \"B\"), n_events = c(80, 111), n_total = c(1000, 1000))\n# rr <- correct_rr(main, val = list(tp = 156, fp = 6, tn = 194, fn = 44))\n# quantile(rr, c(.025, .5, .975))   # use the same FFS-complete validation frame as the analytic cohort",
        "description": "R implementation of the validation-anchored probabilistic correction. Inputs mirror the Python version:\n  main : data.frame(arm = c('A','B'), n_events = <algorithm-positive count>, n_total = <denominator>)\n  val  : list(tp, fp, tn, fn) from chart review of FFS-complete enrollees (non-differential);\n         pass arm-specific lists for differential correction. Beta priors anchored on the validation 2x2.",
        "dependencies": [
          "stats"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let TP=156; %let FP=6; %let TN=194; %let FN=44;   /* validation 2x2 (FFS-complete chart review) */\n%let RA=0.080; %let RB=0.111;                       /* observed (biased) risks, arms A and B       */\n\ndata sims(keep=rr_adj clipped);\n  do i = 1 to 50000;\n    sens = rand('beta', &TP + 1, &FN + 1);          /* sensitivity posterior */\n    spec = rand('beta', &TN + 1, &FP + 1);          /* specificity posterior */\n    r_a = (&RA + spec - 1) / (sens + spec - 1);\n    r_b = (&RB + spec - 1) / (sens + spec - 1);\n    clipped = (r_a < 0 or r_a > 1 or r_b < 0 or r_b > 1);  /* track infeasible draws */\n    r_a = min(max(r_a, 0), 1); r_b = min(max(r_b, 0), 1);\n    rr_adj = r_a / r_b;\n    output;\n  end;\nrun;\n\nproc univariate data=sims noprint;                  /* deterministic median + 95% simulation interval */\n  var rr_adj;\n  output out=summary mean=mean_rr pctlpts=2.5 50 97.5 pctlpre=p;\nrun;\nproc means data=sims mean; var clipped; run;        /* report the clip share alongside the interval */",
        "description": "SAS Monte Carlo validation-anchored misclassification correction (DATA step + PROC UNIVARIATE).\nInputs (post data-management): macro vars TP/FP/TN/FN = validation 2x2 counts from chart review of\nFFS-complete enrollees only; RA/RB = observed (biased) algorithm-based risks in arms A and B.\nBeta draws are anchored on the validation 2x2; PROC UNIVARIATE returns the simulation interval.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Main[Phase 1: full claims cohort<br/>algorithm-based outcome, all covariates] --> Sel[Phase 2: sample for gold standard<br/>FFS-complete enrollees only]\n  Sel --> Adj[Chart / registry / EHR adjudication<br/>build validation 2x2 -> sens, spec, PPV, NPV]\n  Adj --> Trans{Validation frame<br/>exchangeable with cohort?<br/>payer mix, calendar, severity}\n  Trans -->|No| Stop[Re-weight or do NOT transport<br/>report uncorrected + caveat]\n  Trans -->|Yes| Corr[Apply correction<br/>deterministic + Monte Carlo over Beta priors]\n  Corr --> Out[Bias-adjusted estimate + simulation interval<br/>validation error folded into main-study error]",
        "caption": "Two-phase validation-and-correction workflow. Phase 1 is the complete claims cohort; Phase 2 is the gold-standard review restricted to FFS-complete person-time, with an explicit transportability gate before the parameters are applied to the main estimate.",
        "alt_text": "Flowchart from full claims cohort to a gold-standard validation sample, a transportability decision gate, and the bias-corrected estimate with a simulation interval.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Obs[Observed biased estimate<br/>RR from algorithm-coded events] --> Form[Correction formula<br/>true = obs + spec - 1 / sens + spec - 1]\n  Val[Validation 2x2<br/>TP FP TN FN] --> Prior[Beta priors<br/>sens ~ Beta TP+1 FN+1<br/>spec ~ Beta TN+1 FP+1]\n  Prior --> Form\n  Form --> Sim[50k Monte Carlo iterations]\n  Sim --> Res[Median corrected RR<br/>+ 2.5 / 97.5 percentile interval]",
        "caption": "Information flow of the probabilistic correction. The validation 2x2 sets Beta priors for sensitivity and specificity; the rescaling formula is evaluated across Monte Carlo draws to yield a corrected point estimate and an interval that carries validation sampling error.",
        "alt_text": "Schematic showing the observed estimate and validation 2x2 feeding Beta priors and the correction formula through a Monte Carlo simulation to a corrected risk ratio with an interval.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Misclassification bias correction is the measurement-error path within the broader QBA toolkit."
      },
      {
        "relation_type": "used_with",
        "target_slug": "external-adjustment-validation-substudy-bias-correction-rwe",
        "notes": "The validation substudy or external parameter supplies the sensitivity/specificity that this method then uses for correction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "Algorithm validation (PPV/sensitivity against chart or registry truth) is the direct data source for the correction parameters."
      },
      {
        "relation_type": "complements",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The PPV/sensitivity trade-off in outcome-algorithm construction is the prerequisite for choosing and justifying the correction parameters used here."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "E-value addresses unmeasured confounding; misclassification correction addresses a different bias source and the two are frequently presented together in a multi-bias analysis."
      }
    ],
    "aliases": [
      "misclassification correction",
      "outcome misclassification adjustment",
      "exposure misclassification correction",
      "classification error correction",
      "validation-anchored correction",
      "probabilistic bias analysis for misclassification"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "missing-data-pattern-table-rwe",
    "name": "Missing Data Pattern Table",
    "short_definition": "A diagnostic artifact that cross-tabulates the rate and co-occurrence structure of missing values by variable, treatment arm, calendar time, site, and outcome status to characterize the missingness pattern and inform a defensible mechanism judgment (MCAR / MAR / MNAR) before any analysis decision.",
    "long_description": "A **missing data pattern table** is the descriptive backbone of any principled missing-data strategy in real-world\nevidence. It is not an estimator and not an imputation; it is the audit that turns \"we have missingness\" into a specific,\ntestable claim about *which* variables are missing, *how much*, *for whom*, and *together with what*. Two distinct objects\ntravel under this name and both belong in the table: (1) the **per-variable missingness summary** — the proportion missing\nfor each analysis variable, ideally stratified by the factors that drive selection in routinely collected data (treatment\narm, calendar quarter, care site/plan, and outcome status); and (2) the **monotone-vs-non-monotone pattern matrix** — the\nset of distinct missingness patterns across variables (the binary observed/missing footprint per record) and how many\nrecords share each, which reveals whether missingness is *monotone* (dropout-like, amenable to simpler methods) or\n*arbitrary* (requiring chained-equations imputation or pattern-mixture modeling).\n\n**Core conceptual distinction** (table vs mechanism).\nThe pattern table answers a different question than the *mechanism*. The table is purely\nobservable: counts and cross-tabs you can compute from the data. The **mechanism** — Rubin's taxonomy of MCAR (missingness\nindependent of all data), MAR (missingness depends only on *observed* data), and MNAR (missingness depends on the\n*unobserved* value itself) — is partly unverifiable and must be argued, not measured. The table constrains that argument:\nif HbA1c is missing far more often in one treatment arm or in Medicare Advantage enrollees, MCAR is empirically\nimplausible (Little's test on the pattern matrix can formalize this), and you are forced to choose between MAR-justified\nmethods (multiple imputation, IPCW) conditional on the observed drivers, or MNAR-justified sensitivity analysis\n(pattern-mixture, delta-adjustment). The estimand is unchanged by the table, but the *credibility of the identifying\nassumption* required to estimate it is exactly what the table is built to interrogate.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs jumping straight to complete-case analysis:** The pattern table exposes whether complete-case is even admissible.\n  Complete-case is unbiased only under MCAR (or, for a regression coefficient, when missingness is independent of the\n  outcome given covariates); the table is what lets you reject MCAR. Cost: it is descriptive overhead that does not\n  itself fix anything. **Prefer the table** before any deletion-based analysis — it is cheap insurance against a silently\n  biased default.\n- **vs jumping straight to multiple imputation (MI):** MI is valid under MAR, but MAR-given-*what*? The pattern table\n  tells you which observed variables drive missingness and therefore which auxiliary variables the imputation model must\n  include to make MAR plausible. Skipping the table risks an imputation model that omits the very predictors that make\n  missingness ignorable, reintroducing bias while projecting false confidence (Hughes et al.). **Prefer the table** to\n  specify the imputation model, not as a substitute for it.\n- **vs a single overall \"% missing\" figure:** The stratified table is the entire value-add. A 5% overall HbA1c-missing\n  rate that is 1% in one arm and 20% in the other is a confounded-missingness alarm; the marginal 5% hides it. Cost:\n  more programming and more cells to interpret. **Always prefer** the stratified, arm-by-time-by-site version for any\n  comparative analysis.\n\n**When to use** (decision rules). Always, and first — before specifying complete-case, MI, IPCW, or pattern-mixture analyses, and before\nfinalizing the statistical analysis plan. It is mandatory documentation for regulatory- and HTA-grade RWE: ICH E9(R1),\nFDA RWE guidance, and ISPOR good-practice all expect missingness to be characterized and the handling strategy\npre-specified and justified against it. Run it on every analysis variable, every confounder feeding a propensity score,\nand every component of a composite endpoint or cost vector.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\nThe table itself is never the wrong thing to\n*compute*; the danger is misreading it. (1) **Do not treat a passing Little MCAR test as license for complete-case.** The\ntest has low power, ignores MNAR entirely, and a non-significant result in a small or sparse cell pattern is\nuninformative — acting on it is the classic trap. (2) **Do not stratify only by baseline variables when the missingness is\noutcome-driven.** If a lab is ordered *because* the patient is deteriorating, the value is MNAR and no amount of\nobserved-covariate stratification rescues MAR; the table must include outcome status precisely to surface this, and\nif it does, MI alone is misleading and you need MNAR sensitivity analysis. (3) **Do not confuse structural/legitimate\nmissingness with informative missingness.** A pregnancy field \"missing\" for men, or a lab not drawn because it was\nclinically unnecessary, are not the same as a deteriorating-patient lab gap; collapsing them in one table and imputing\nacross them fabricates data. (4) **Do not let a clean-looking table mask differential ascertainment masquerading as data\npresence** — a zero \"missing\" rate for diagnoses in claims does not mean the disease is absent; it means a code was or\nwas not submitted, which is an exposure/outcome-misclassification problem the missingness table cannot see.\n\n**Data-source operational depth** (real failure modes and workarounds).\n- **Claims (FFS vs Medicare Advantage):** Most \"missingness\" in claims is really *non-observation of person-time*, and\n  it is structurally differential. Medicare Advantage encounter data are notoriously incomplete relative to\n  fee-for-service claims, so any variable derived from utilization (comorbidity flags, prior therapy, baseline cost) is\n  differentially missing for MA-only person-time — and MA enrollment correlates with health and region, making this MNAR\n  with respect to the very confounders you need. Workaround: restrict to enrollees with the relevant benefit (A/B/D or\n  commercial medical+pharmacy) across the lookback, flag MA-only spans explicitly, and stratify the pattern table by\n  plan type so the differential is visible rather than averaged away. Lab *results* are absent from most medical claims\n  entirely (you get the CPT for the test, not the value), so an HbA1c \"value\" column is ~100% missing in claims-only data.\n- **EHR:** Missingness is encounter-driven and informative. A field is populated only if a visit occurred and the\n  clinician charted it; sicker, more-engaged patients accrue more data, so completeness correlates with the outcome.\n  External-care leakage means a patient treated elsewhere looks \"missing\" when they are merely unobserved in this system.\n  Workaround: stratify by site and by within-system utilization, treat loss to follow-up as potentially informative\n  (IPCW rather than naive complete-case), and use linked claims to distinguish \"not done\" from \"done elsewhere.\"\n- **Registry:** Pre-specified data dictionaries make missingness more interpretable, but completeness varies by site and\n  over the enrollment period, and adjudicated fields may be missing precisely for the hardest-to-adjudicate (often most\n  severe) cases. Stratify by site and accrual era.\n- **Linked claims–EHR–vital records:** The richest substrate but linkage itself induces missingness — the unlinkable\n  subset is a selected population, and date discrepancies between order, fill, and service dates create apparent\n  missingness in time-windowed variables. Report the linkage denominator as the first row of the pattern table.\n\n**Worked claims/EHR example.** Question: comparative HbA1c control (mean HbA1c at 12 months) for GLP-1 RA vs basal insulin\ninitiators in a linked commercial-claims + EHR diabetes cohort. (1) For each analysis variable — baseline HbA1c, baseline\neGFR, BMI, the 12-month outcome HbA1c, and each propensity-score confounder — compute the proportion of `person_id` with a\nnon-missing value. (2) Cross-tabulate the 12-month HbA1c missingness by `arm` × calendar quarter of `index_date` × care\n`site_id` × `plan_type` (FFS vs MA-equivalent commercial PPO vs HMO). (3) Suppose the table shows outcome HbA1c missing in\n18% of the basal-insulin arm but 9% of the GLP-1 arm, concentrated in HMO sites in 2020-Q2 (a COVID lab-access shock) — a\ntextbook MAR-on-observed-factors (arm, site, time) *plus* a suspected MNAR component (insulin initiators with poor control\nmay skip labs). (4) Build the monotone-vs-arbitrary pattern matrix: if a record missing the 12-month HbA1c is also missing\nthe 12-month eGFR (both lab-draw-dependent), the pattern is co-clustered, signaling a shared visit-attendance mechanism to\nencode as an auxiliary \"had a 12-month encounter\" predictor. (5) Decision: complete-case is rejected (Little's test\nsignificant; rates differ by arm); specify multiple imputation by chained equations *including* arm, site, quarter, the\nencounter-attendance flag, and the co-missing labs as predictors, and add a tipping-point/delta-adjusted MNAR sensitivity\nanalysis shifting imputed values in the insulin arm to test robustness. The pattern table is what justified every one of\nthese choices in the SAP.\n\n**Interpreting the output**\n\nConsider the GLP-1 vs. basal insulin cohort above, where the 12-month HbA1c outcome is missing in 9% of\nGLP-1 arm patients and 19% of basal insulin patients — a 10 percentage-point differential between arms.\n\n*(1) Formal statistical interpretation.* The observed arm-differential of 10 percentage points in outcome\nmissingness is direct evidence against the missing completely at random (MCAR) assumption, which requires\nthat the probability of missingness be unrelated to both observed and unobserved variables. Little's MCAR\ntest further quantifies this departure. The co-clustering of outcome and eGFR missingness within the same\npatients — the monotone pattern — indicates a shared visit-attendance mechanism rather than independent\nitem non-response; this structural feature must be encoded in any imputation model as an auxiliary predictor.\nUnder MCAR, complete-case analysis would be valid; the pattern table rules that out. Under missing at random\n(MAR) conditional on arm, site, and time, multiple imputation with those variables as predictors is valid;\nthe suspected MNAR component requires an additional sensitivity analysis.\n\n*(2) Practical interpretation for a decision-maker.* The pattern table is the evidentiary basis for every\nsubsequent missing-data decision in the SAP. A 10-point arm differential means the complete-case sample\nis differentially depleted in insulin initiators — likely the sicker ones — biasing any unadjusted\ncomparison. Documenting the pattern before analysis forces transparency: reviewers can see exactly which\nvariables are missing, by how much, and in which groups, so the choice of imputation method is grounded\nin data rather than assumption.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "missing-data",
      "missingness-pattern",
      "mcar-mar-mnar",
      "multiple-imputation",
      "complete-case-analysis",
      "data-quality-assessment",
      "descriptive-epidemiology",
      "sensitivity-analysis"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/63.3.581",
        "url": "https://doi.org/10.1093/biomet/63.3.581",
        "citation_text": "Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581-592.",
        "year": 1976,
        "authors_short": "Rubin",
        "notes": "Foundational paper defining the MCAR/MAR/MNAR (missing-at-random) taxonomy and the concept of ignorability that a pattern table is built to interrogate."
      },
      {
        "role": "explain",
        "doi": "10.1080/01621459.1988.10478722",
        "url": "https://doi.org/10.1080/01621459.1988.10478722",
        "citation_text": "Little RJA. A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association. 1988;83(404):1198-1202.",
        "year": 1988,
        "authors_short": "Little",
        "notes": "Defines the multivariate MCAR test applied to the pattern matrix; the formal complement to the descriptive table, with the low-power/MNAR-blindness caveats that prevent its misuse."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.b2393",
        "url": "https://doi.org/10.1136/bmj.b2393",
        "citation_text": "Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.",
        "year": 2009,
        "authors_short": "Sterne et al.",
        "notes": "Practical guidance that explicitly directs analysts to characterize the pattern and mechanism before choosing multiple imputation, and to report both."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.4067",
        "url": "https://doi.org/10.1002/sim.4067",
        "citation_text": "White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Statistics in Medicine. 2011;30(4):377-399.",
        "year": 2011,
        "authors_short": "White et al.",
        "notes": "Shows how the pattern (monotone vs arbitrary) and the auxiliary predictors revealed by the table drive imputation-model specification."
      },
      {
        "role": "use",
        "doi": "10.2147/CLEP.S129785",
        "url": "https://doi.org/10.2147/CLEP.S129785",
        "citation_text": "Pedersen AB, Mikkelsen EM, Cronin-Fenton D, et al. Missing data and multiple imputation in clinical epidemiological research. Clinical Epidemiology. 2017;9:157-166.",
        "year": 2017,
        "authors_short": "Pedersen et al.",
        "notes": "Applied clinical-epidemiology walkthrough of pattern characterization and handling choices in routinely collected health data."
      },
      {
        "role": "use",
        "doi": "10.1093/ije/dyz032",
        "url": "https://doi.org/10.1093/ije/dyz032",
        "citation_text": "Hughes RA, Heron J, Sterne JAC, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. International Journal of Epidemiology. 2019;48(4):1294-1304.",
        "year": 2019,
        "authors_short": "Hughes et al.",
        "notes": "Cautionary counterpoint — the pattern and mechanism determine when complete-case beats MI and when neither is valid without MNAR sensitivity analysis."
      }
    ],
    "plain_language_summary": "A missing data pattern table is a simple diagnostic you build before any analysis to see exactly which variables have gaps, how many gaps, and whether those gaps are spread evenly or clustered in ways that matter. You cross-tabulate the percent missing for each variable by treatment group, site, and time period, then examine which variables tend to be missing together on the same record. The pattern tells you whether gaps appear random or are tied to something measurable in your data, which in turn tells you which statistical method is appropriate to handle them.",
    "key_terms": [
      {
        "term": "MCAR",
        "definition": "Missing Completely At Random: the chance a value is missing has nothing to do with any variable in your dataset, observed or unobserved, so the missing records look just like the complete ones."
      },
      {
        "term": "MAR",
        "definition": "Missing At Random: the chance a value is missing can be fully explained by other variables you already have in your dataset, such as treatment arm or care site, but not by the missing value itself."
      },
      {
        "term": "MNAR",
        "definition": "Missing Not At Random: the chance a value is missing depends on the value that is missing, for example a lab being skipped precisely because a patient is doing poorly, so no amount of observed information can fully account for the gap."
      },
      {
        "term": "monotone pattern",
        "definition": "A dropout-like missingness structure where once a variable is missing for a patient, later variables are also missing in a staircase sequence, which allows simpler imputation methods."
      },
      {
        "term": "arbitrary pattern",
        "definition": "A missingness structure where gaps appear in no predictable order across variables on the same record, requiring a more flexible method called chained-equations imputation."
      },
      {
        "term": "multiple imputation",
        "definition": "A statistical method that fills in missing values multiple times using the pattern of observed data, then combines results across the filled-in datasets to give honest estimates that reflect the uncertainty from the gaps."
      }
    ],
    "worked_example": {
      "scenario": "You are studying a diabetes drug comparison: GLP-1 agonist versus basal insulin, using a linked claims and EHR dataset of 200 patients. Before you run any models, your statistician asks you to build a missing data pattern table for four key analysis variables: baseline HbA1c, baseline eGFR, BMI, and the 12-month outcome HbA1c. The table below is what you would produce. You then read the pattern to decide how to handle the missingness.",
      "dataset": {
        "caption": "Per-variable missingness summary table: rows are analysis variables, columns show total patients with a value, count missing, percent missing, and the missingness pattern type across all four variables. Two treatment arms are shown side by side. N = 100 per arm (200 total).",
        "columns": [
          "variable",
          "glp1_n_observed",
          "glp1_n_missing",
          "glp1_pct_missing",
          "insulin_n_observed",
          "insulin_n_missing",
          "insulin_pct_missing",
          "pattern_type"
        ],
        "rows": [
          [
            "baseline_hba1c",
            97,
            3,
            "3%",
            96,
            4,
            "4%",
            "arbitrary"
          ],
          [
            "baseline_egfr",
            95,
            5,
            "5%",
            94,
            6,
            "6%",
            "arbitrary"
          ],
          [
            "bmi",
            92,
            8,
            "8%",
            90,
            10,
            "10%",
            "arbitrary"
          ],
          [
            "outcome_hba1c_12m",
            91,
            9,
            "9%",
            81,
            19,
            "19%",
            "monotone"
          ]
        ]
      },
      "steps": [
        "Read the last two numeric columns first: outcome HbA1c at 12 months is missing in 9 of 100 GLP-1 patients (9%) but 19 of 100 insulin patients (19%). A two-fold difference between arms is the first alarm that missingness may not be random.",
        "Check whether the gap can be explained by observed factors you already have: if the 19 missing insulin records are concentrated at one care site or in the first quarter of 2020 (a COVID lab-access shock), missingness is tied to observable variables, which supports a MAR argument.",
        "Read the pattern_type column: baseline variables show an arbitrary pattern, meaning gaps on baseline HbA1c do not predict gaps on baseline eGFR on the same record. The outcome variable shows a monotone pattern, meaning patients who miss their 12-month HbA1c also tend to miss their 12-month eGFR, suggesting a shared cause such as not having a 12-month clinic visit at all.",
        "Combine these two reads: the outcome has both a large arm difference (differential, so not MCAR) and a monotone co-missing structure (shared visit-attendance mechanism). MCAR is rejected. The question is now MAR versus MNAR.",
        "Ask the clinical question to distinguish MAR from MNAR: are insulin patients more likely to skip their 12-month lab because their control is poor and they are avoiding their doctor? If yes, the missing value itself (poor HbA1c) predicts being missing, which is MNAR.",
        "Arithmetic check: overall outcome missingness = (9 + 19) / 200 = 28 / 200 = 14%. The arm-specific rates are 9% and 19%, which average to 14% only when the arms are equal in size, as they are here: (9% + 19%) / 2 = 14%. This confirms the arm rates are internally consistent."
      ],
      "result": "Dominant pattern: outcome HbA1c is missing in 9/100 = 9% of GLP-1 patients and 19/100 = 19% of insulin patients, a 10 percentage-point arm difference that rules out MCAR. The monotone co-missing structure for outcome labs points to a shared visit-attendance mechanism. The implied handling: (1) complete-case is rejected because missingness is differential by arm; (2) multiple imputation is warranted and must include arm, site, and calendar quarter as predictors to make a MAR assumption plausible; (3) a sensitivity analysis shifting imputed insulin-arm values by a delta (MNAR tipping-point) should be pre-specified to test whether conclusions hold if the missing patients had worse-than-imputed control."
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "complete-case-analysis-rwe",
      "multiple-imputation-longitudinal-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Per-variable stratified missingness summary",
        "description": "Proportion missing for each analysis variable, cross-tabulated by treatment arm, calendar period, site/plan, and outcome status, to expose differential (non-MCAR) missingness that a marginal rate hides.",
        "edge_cases": [
          "Structural/legitimate missingness (e.g., pregnancy field for men, lab not clinically indicated) must be separated from informative missingness or imputation fabricates values.",
          "Zero apparent missingness in claims diagnoses reflects code submission, not disease absence (a misclassification issue invisible to the table).",
          "Sparse cells (rare arm x site x quarter combinations) produce unstable rates and unreliable MCAR tests."
        ],
        "data_source_notes": "claims: stratify by plan type (FFS vs MA) so differential encounter completeness is visible; lab values are typically ~100% missing in claims. ehr: stratify by site and within-system utilization since completeness tracks the outcome."
      },
      {
        "name": "Monotone-vs-arbitrary pattern matrix",
        "description": "The set of distinct observed/missing footprints across variables and the count of records sharing each, distinguishing dropout-like monotone patterns from arbitrary patterns that require chained-equations imputation.",
        "edge_cases": [
          "Co-clustered missingness (two lab-draw-dependent variables missing together) signals a shared mechanism to encode as an auxiliary predictor rather than impute independently.",
          "Near-monotone patterns with a few off-pattern records can be analyzed monotonically only after deciding how to treat the exceptions."
        ],
        "data_source_notes": "linked data: the unlinkable subset and order/fill/service date discrepancies create apparent arbitrary patterns in time-windowed variables; report the linkage denominator as the first pattern row."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Jumping straight to complete-case analysis",
        "pros_of_this": "Tests whether complete-case is even admissible (valid only under MCAR); exposes arm/site/time differential missingness that biases listwise deletion.",
        "cons_of_this": "Purely descriptive overhead that does not itself correct any bias.",
        "when_to_prefer": "Always run before any deletion-based analysis; it is cheap insurance against a silently biased default."
      },
      {
        "compared_to": "Jumping straight to multiple imputation",
        "pros_of_this": "Identifies which observed drivers and auxiliary variables the imputation model must include to make MAR plausible, and whether the pattern is monotone or arbitrary.",
        "cons_of_this": "Does not replace MI; it specifies it.",
        "when_to_prefer": "Always run to specify (not substitute for) the imputation model and to flag the need for MNAR sensitivity analysis."
      },
      {
        "compared_to": "A single overall percent-missing figure",
        "pros_of_this": "Surfaces confounded missingness (e.g., 1% in one arm vs 20% in the other) that a marginal rate conceals.",
        "cons_of_this": "More programming and more cells to interpret.",
        "when_to_prefer": "Any comparative effectiveness, safety, utilization, or cost analysis."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Most missingness is non-observation of person-time and is differential by plan; Medicare Advantage encounter data are incomplete relative to fee-for-service, so utilization-derived variables are differentially (and MNAR-ishly) missing. Restrict to the relevant benefit across the lookback, flag MA-only spans, and stratify the table by plan type. Lab result values are essentially absent from medical claims.",
      "ehr": "Missingness is encounter-driven and informative — completeness correlates with the outcome (sicker, more-engaged patients accrue more data); external-care leakage masquerades as missingness. Stratify by site and utilization, treat loss to follow-up as potentially informative (IPCW over naive complete-case), and use linkage to separate \"not done\" from \"done elsewhere.\"",
      "registry": "Pre-specified dictionaries aid interpretation, but completeness varies by site and accrual era, and adjudicated fields may be missing for the hardest (often most severe) cases. Stratify by site and enrollment period.",
      "linked": "The richest substrate, but linkage induces selection (the unlinkable subset) and order/fill/service date discrepancies create apparent missingness in time-windowed variables. Report the linkage denominator first."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nANALYSIS_VARS = [\"base_hba1c\", \"base_egfr\", \"bmi\", \"out_hba1c_12m\"]\n\ndef missingness_by_stratum(df: pd.DataFrame, var: str) -> pd.DataFrame:\n    # % missing for one variable by arm x calendar quarter of index_date.\n    q = df[\"index_date\"].dt.to_period(\"Q\").astype(str)\n    out = (df.assign(_miss=df[var].isna(), _qtr=q)\n             .groupby([\"arm\", \"_qtr\"])\n             .agg(n=(\"person_id\", \"size\"), n_missing=(\"_miss\", \"sum\")))\n    out[\"pct_missing\"] = (out[\"n_missing\"] / out[\"n\"]).round(3)\n    return out.reset_index()\n\ndef variable_summary(df: pd.DataFrame, vars_: list[str]) -> pd.DataFrame:\n    # Overall and by-arm missingness for every analysis variable (the headline table).\n    rows = []\n    for v in vars_:\n        overall = df[v].isna().mean()\n        by_arm = df.groupby(\"arm\")[v].apply(lambda s: s.isna().mean())\n        rows.append({\"variable\": v, \"pct_missing_overall\": round(overall, 3),\n                     **{f\"pct_missing_{a}\": round(p, 3) for a, p in by_arm.items()}})\n    return pd.DataFrame(rows)\n\ndef pattern_matrix(df: pd.DataFrame, vars_: list[str]) -> pd.DataFrame:\n    # Distinct observed(1)/missing(0) footprints across vars_ and how many records share each.\n    # A single dominant footprint => near-complete; staircase of footprints => monotone (dropout-like);\n    # many scattered footprints => arbitrary pattern -> chained-equations imputation required.\n    obs = df[vars_].notna().astype(int)\n    pat = (obs.groupby(vars_).size()\n              .reset_index(name=\"n_records\")\n              .sort_values(\"n_records\", ascending=False))\n    pat[\"n_missing_vars\"] = (len(vars_) - pat[vars_].sum(axis=1)).astype(int)\n    return pat\n\nvar_table = variable_summary(df, ANALYSIS_VARS)\noutcome_by_stratum = missingness_by_stratum(df, \"out_hba1c_12m\")\npatterns = pattern_matrix(df, ANALYSIS_VARS)",
        "description": "Builds a missing data pattern table from a one-row-per-patient analysis frame. Required input (post data-management):\n  df : person_id (unique), arm in {'STUDY','COMPARATOR'}, index_date (datetime), site_id, plan_type\n       ('FFS'/'MA_EQUIV'), plus the analysis variables (e.g. base_hba1c, base_egfr, bmi, out_hba1c_12m, ...).\n       Missing values must already be encoded as NaN, NOT as sentinel codes like 0, 999, or 'UNK'.\nProduces (1) per-variable missingness stratified by arm and calendar quarter, and (2) the arbitrary-vs-monotone\npattern matrix (distinct observed/missing footprints and their record counts) to drive the imputation model.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(naniar)\nlibrary(mice)\n\nanalysis_vars <- c(\"base_hba1c\", \"base_egfr\", \"bmi\", \"out_hba1c_12m\")\n\n# (1) Per-variable missingness, overall and stratified by arm x calendar quarter.\nvar_summary <- naniar::miss_var_summary(df %>% select(all_of(analysis_vars)))\n\nstratified <- df %>%\n  mutate(qtr = paste0(lubridate::year(index_date), \"Q\",\n                      lubridate::quarter(index_date))) %>%\n  group_by(arm, qtr) %>%\n  summarise(n = n(),\n            pct_missing_outcome = mean(is.na(out_hba1c_12m)),\n            .groups = \"drop\")\n\n# (2) Pattern matrix: rows = distinct observed/missing footprints, last col = count missing,\n#     row counts = records sharing the footprint. A staircase indicates monotone missingness.\npattern <- mice::md.pattern(df %>% select(all_of(analysis_vars)), plot = FALSE)",
        "description": "Missing data pattern table in R. Input `df` is one row per patient with: person_id, arm, index_date (Date),\nsite_id, plan_type, and the analysis variables, with missing encoded as NA (not sentinel codes).\nnaniar gives the per-variable summary; mice::md.pattern gives the canonical observed/missing pattern matrix\n(and its plot) that distinguishes monotone from arbitrary missingness.",
        "dependencies": [
          "dplyr",
          "naniar",
          "mice"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (1) Per-variable missing counts and rates, stratified by arm. */\nproc means data=work.analytic n nmiss;\n  class arm;\n  var base_hba1c base_egfr bmi out_hba1c_12m;\nrun;\n\n/* Outcome missingness by arm x calendar quarter (the differential-missingness alarm). */\ndata _strat;\n  set work.analytic;\n  qtr = cats(year(index_date), 'Q', qtr(index_date));\n  miss_out = missing(out_hba1c_12m);\nrun;\nproc freq data=_strat;\n  tables arm*qtr*miss_out / nopercent norow nocol;\nrun;\n\n/* (2) Observed/missing footprint string -> frequency = the pattern matrix\n       (1 = observed, 0 = missing). Many distinct strings => arbitrary pattern. */\ndata _pat;\n  set work.analytic;\n  footprint = cats(^missing(base_hba1c), ^missing(base_egfr),\n                   ^missing(bmi), ^missing(out_hba1c_12m));\nrun;\nproc freq data=_pat order=freq;\n  tables footprint / out=pattern_counts;\nrun;\n\n/* Canonical SAS missing-data pattern report (no imputation performed). */\nproc mi data=work.analytic nimpute=0;\n  var base_hba1c base_egfr bmi out_hba1c_12m;\nrun;",
        "description": "Missing data pattern table in SAS. Input WORK.ANALYTIC is one row per patient with: person_id, arm, index_date,\nsite_id, plan_type, and numeric analysis variables with missing as standard SAS missing (.), not sentinel values.\nPROC MEANS NMISS gives per-variable counts; PROC FREQ on a constructed missingness-footprint string reproduces the\nmd.pattern matrix; PROC MI NIMPUTE=0 prints the canonical missing-data pattern report directly.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Tab[Missing data pattern table<br/>per-variable rates by arm x time x site + pattern matrix] --> Q1{Rates differ by<br/>arm / site / time?<br/>Little MCAR test}\n  Q1 -->|No, plausibly MCAR| CC[Complete-case admissible<br/>verify cell counts not just p-value]\n  Q1 -->|Yes, depends on OBSERVED data| MAR[MAR-justified handling]\n  MAR --> MI[Multiple imputation incl. arm, site,<br/>quarter, auxiliary attendance flag]\n  MAR --> IPCW[IPCW for informative<br/>loss to follow-up]\n  Q1 -->|Driven by the UNOBSERVED value<br/>e.g. lab skipped because patient worse| MNAR[MNAR]\n  MNAR --> PM[Pattern-mixture / delta-adjusted<br/>tipping-point sensitivity analysis]\n  MI --> Sens[Report mechanism judgment + handling in SAP]\n  IPCW --> Sens\n  PM --> Sens\n  CC --> Sens",
        "caption": "Decision logic from the pattern table to a handling strategy. The table is observable; the MCAR/MAR/MNAR mechanism is argued from it. A passing MCAR test never licenses complete-case on its own, and outcome-driven missingness pushes the analysis to MNAR sensitivity analysis that imputation alone cannot satisfy.",
        "alt_text": "Decision flowchart from a missing data pattern table through the MCAR/MAR/MNAR judgment to complete-case, multiple imputation, IPCW, or pattern-mixture sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Per-variable missingness<br/>stratified by arm x quarter x site x plan] --> C[Mechanism judgment<br/>MCAR vs MAR vs MNAR]\n  B[Pattern matrix<br/>monotone vs arbitrary footprints<br/>+ co-clustered missingness] --> C\n  C --> D[Imputation-model spec<br/>or MNAR sensitivity plan]\n  A -.claims: stratify FFS vs MA.-> A\n  B -.linked: linkage denominator first row.-> B",
        "caption": "The two components of the pattern table (stratified per-variable rates and the footprint pattern matrix) feed a single mechanism judgment that determines the analysis strategy and, under MAR, the auxiliary predictors the imputation model must include.",
        "alt_text": "Data-flow diagram showing per-variable stratified missingness and the pattern matrix feeding a mechanism judgment that drives the imputation model or MNAR sensitivity plan.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "The pattern table is a descriptive-epidemiology diagnostic computed before any modeling decision."
      },
      {
        "relation_type": "produces",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "The pattern and the observed drivers it surfaces determine whether MI is appropriate and which auxiliary variables the imputation model must include."
      },
      {
        "relation_type": "see_also",
        "target_slug": "complete-case-analysis-rwe",
        "notes": "Complete-case is valid only under MCAR; the pattern table is what lets you accept or reject that assumption."
      },
      {
        "relation_type": "used_with",
        "target_slug": "inverse-probability-of-censoring-weighting-rwe",
        "notes": "When the table shows informative loss to follow-up, IPCW is preferred over naive complete-case for time-to-event analyses."
      },
      {
        "relation_type": "affects",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Differential missingness in confounders distorts the propensity score; the pattern table must cover every PS covariate before model fitting."
      }
    ],
    "aliases": [
      "missing data pattern table",
      "missingness pattern table",
      "missing data summary table",
      "missing value pattern analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "missing-data-trimming-winsorization-rwe",
    "name": "Missing Data, Trimming, and Winsorization in RWE",
    "short_definition": "Practical handling of missing data (under MCAR/MAR/MNAR mechanisms) and of extreme propensity-score weights or outcome/cost outliers via trimming (excluding observations outside a propensity or weight threshold) and Winsorization (capping values at chosen percentiles), each of which changes the estimand and the target population it identifies.",
    "long_description": "This entry covers two operations that practitioners routinely apply *after* covariate adjustment but that quietly alter what\nis being estimated: handling **missing data** and taming **extreme inverse-probability weights or outlier values** through\n**trimming** and **Winsorization**. They are bundled because they share one deep property — both are responses to regions of\nthe data where the estimand is weakly or non-identified (limited overlap, structural absence of records), and both trade\nvariance for a shift in the target population if applied carelessly.\n\n**Core conceptual distinction.** Three mechanisms govern *missingness* and they are not testable from observed data alone.\n*MCAR* (missing completely at random): missingness is independent of observed and unobserved values — essentially never true\nin claims or EHR. *MAR* (missing at random): missingness depends only on *observed* data, so it can be corrected by\nmultiple imputation or inverse-probability-of-observation weighting conditional on the observed covariates. *MNAR* (missing\nnot at random): missingness depends on the *unobserved* value itself — the dominant reality in RWE (a lab is missing because\nthe clinician judged it unnecessary; future claims are absent because the patient died or disenrolled). For *trimming and\nWinsorization*, the distinction is between **trimming** (removing units whose propensity score or weight falls outside a\nthreshold, i.e., dropping the unit entirely) and **Winsorization/truncation** (retaining the unit but replacing its extreme\nweight or value with a capped value). Trimming targets *positivity/overlap*; it deliberately redefines the population to the\nregion of common support, so the estimand becomes a *trimmed ATE/ATT* on that subpopulation, not the original-cohort ATE.\nWinsorization keeps every unit but biases the weighted estimator toward the bulk, reducing variance at the cost of residual\nconfounding from the down-weighted tails. The decisive point regulators raise: trimming and Winsorization change the\n*estimand*, not merely the estimator, and that change must be stated in the protocol/SAP, not buried in a footnote.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **Trimming/Winsorization of weights vs. doing nothing (raw IPTW):** Raw IPTW with a handful of weights >50 is\n  high-variance, non-robust, and effectively non-positive — one patient can swing the ATE. Trimming/Winsorization stabilizes\n  the estimate dramatically. Cost: bias relative to the full-cohort estimand and sensitivity to the (arbitrary) rule. **Prefer**\n  when a small tail dominates the effective sample size and the question tolerates a restricted population.\n- **Symmetric PS trimming (Stürmer) vs. data-driven Crump trimming:** Stürmer's fixed-percentile trimming of the PS tails is\n  transparent and easy to pre-specify but arbitrary; Crump's optimal-overlap rule chooses the threshold to minimize asymptotic\n  variance subject to retaining support, which is more principled but less interpretable and harder to communicate to a non-\n  statistician reviewer. **Prefer** Crump (or overlap weights) when positivity is the core threat and you can defend the\n  machinery; **prefer** fixed symmetric trimming for routine pharmacoepi with mild non-overlap.\n- **Trimming/Winsorization vs. overlap (ATO) weights:** Overlap weights (Li, Morgan, Zaslavsky) achieve exact balance and\n  bounded weights *by construction*, sidestepping the need for an ad hoc cap — but they change the estimand to the *overlap\n  population* (patients in clinical equipoise). **Prefer** overlap weights when the equipoise population is itself the policy-\n  relevant target; **prefer** explicit trimming when stakeholders require the ATE/ATT on a nameable population.\n- **Winsorization of cost outliers vs. two-part / GLM gamma models:** Capping a $2M CAR-T claim at the 99th percentile is\n  crude; a gamma or two-part model accommodates the skew without discarding magnitude. **Prefer** the model when the tail is\n  *real signal* (oncology HCRU); **prefer** Winsorization only as a transparent, pre-specified sensitivity layer.\n- **Complete-case vs. multiple imputation for missing covariates:** Complete-case analysis is unbiased only under MCAR (and\n  sometimes under MAR conditional on the outcome model) and wastes data; MI under MAR is principled but invalid if the\n  mechanism is MNAR. Neither recovers structurally absent post-death/post-disenrollment data — that requires re-defining the\n  estimand (e.g., a composite or a while-alive contrast).\n\n**When to use** (decision rules). Trimming/Winsorization of weights: whenever IPTW or PS-weighted analyses produce extreme weights (max weight\n≫ 20, or a few units holding a large share of total weight / collapsing the effective sample size). Outcome/cost Winsorization:\nheavy-tailed HCRU or cost endpoints where one or two catastrophic claims dominate the mean. Missing-data methods: any analysis\nwith incomplete baseline covariates (MI under a defensible MAR model), informative observation (IPCW / inverse-probability-of-\nobservation weighting), or partially missing outcomes. The governing rule: pre-specify the rule *and* at least two alternatives,\nand report what fraction of units and person-time each rule removes or caps.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\n- **Trimming to \"fix\" a non-overlap problem that is really confounding-by-indication.** If one drug is reserved for renally-\n  impaired patients, the separated PS distribution is a *design* failure; trimming the non-overlapping tail silently changes\n  the population to one where the comparison may no longer be the question of interest, and reviewers will read the residual\n  estimand as the same one promised in the objectives. Diagnose with the PS-overlap plot and clinical review first.\n- **Winsorizing real outcome signal.** In oncology cost studies, the high-cost tail (CAR-T, prolonged ICU) *is* the\n  phenomenon; capping it can reverse a cost-effectiveness conclusion. Capping outcomes is dangerous whenever the tail is the\n  decision-driver, not noise.\n- **Imputing structurally missing data.** Standard MI on observed covariates cannot recover information that is missing\n  *because the patient died or disenrolled* — that is MNAR by construction. Imputing post-death cost as if it were MAR\n  fabricates person-time and biases PPPM/PMPM. Handle with a while-alive estimand, competing-risks framing, or explicit\n  MNAR sensitivity (pattern-mixture / delta-adjustment), never naive MI.\n- **Imputing time-varying confounders without respecting time order.** Imputing a confounder using post-treatment values can\n  open a collider path and induce bias worse than the missingness it cured.\n- **Reporting a single trimmed result with no sensitivity layer.** A point estimate that flips under 1% vs. 5% Winsorization\n  is not a finding; it is an artifact. Presenting only the favorable cut is the classic HTA/regulatory red flag.\n\n**Data-source operational depth** (claims vs. EHR vs. registry vs. linked).\n- **Claims (FFS vs. MA):** \"No claim observed\" conflates true absence with unobserved care. **MA-only person-time lacks the\n  fee-for-service encounter and pharmacy claims** that FFS Parts A/B/D generate, so a patient who looks event-free or\n  treatment-naïve may simply be invisible — restrict to fully-observable (A/B/D or commercial medical+pharmacy) person-time\n  rather than imputing the gap. Disenrollment and death are MNAR by design: future records are structurally absent, so\n  censor at the enrollment-span end and never carry costs/outcomes past it. **Differential competing risks by exposure in\n  elderly claims** (a frailer arm dies sooner, producing systematically shorter, more \"complete-looking\" follow-up) interact\n  with both weight tails and missingness — the frail tail often carries the largest weights *and* the heaviest truncated cost.\n- **EHR:** Visit-driven capture makes completeness a function of engagement — sicker/more-engaged patients have richer\n  records, so missing labs are typically MNAR. Use missingness indicators or model the visit/observation process (IPCW)\n  rather than assuming MAR. **Immortal time in procedure studies** can masquerade as missingness: a window with no records\n  before a procedure may reflect care delivered out-of-system, not absence of disease — confirm with linkage before treating\n  it as a clean lookback.\n- **Registry:** Lower missingness on adjudicated key endpoints, but utilization, costs, and PROs are often incomplete; use\n  the registry as a validation substrate for claims-based imputation or trimming rules rather than as the costing source.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR severity to enrich the PS (reducing extreme weights at the\n  source) plus claims completeness and a death index to terminate person-time correctly — but linkage selects the linkable\n  subset and creates date discrepancies (order vs. fill vs. service) that must be reconciled before any windowing.\n\n**Worked claims example.** Question: 12-month all-cause cost (PPPM) of an injectable specialty drug vs. an active comparator\namong adults with continuous commercial + Medicare FFS A/B/D enrollment. (1) Cohort: new initiators, 365-day continuous,\nFFS-observable enrollment lookback (exclude MA-only person-time so \"no prior fill\" is real, not missing). (2) Confounding:\nhigh-dimensional PS, then stabilized IPTW. Inspect the weight distribution — suppose `max(weight)=84` and the top 0.5% of\nunits hold 22% of total weight (effective sample size collapsed from 9,800 to ~4,100). (3) Pre-specified primary rule:\nsymmetric PS trimming at the 1st/99th percentiles of the treated-arm PS (Stürmer), recomputing weights on the trimmed\ncohort; report that 1.9% of units and 2.4% of person-time are dropped, and that the estimand is now the trimmed-population\nATT. (4) Outcome handling: per-member-per-month cost is right-censored at death and disenrollment (no post-exit imputation —\nMNAR by design); the cost distribution is Winsorized at the 99th percentile as a *sensitivity* layer only, with the gamma\ntwo-part model as primary. (5) Sensitivity grid: {no trim, 1/99 trim, Crump optimal, overlap weights} × {no Winsor, 99th-\npct Winsor} reported side by side, with ESS and the share of trimmed person-time for each cell. The conclusion is reported\nas robust only if the sign and rough magnitude of the cost difference survive the grid; cells that diverge are flagged as\nthe locus of residual positivity/MNAR risk rather than averaged away.\n\n**Interpreting the output**\n\nConsider the six-patient IPTW cost comparison: the raw weighted cost difference between treated and\ncomparator arms is approximately $32,740. After symmetric PS trimming at the 1st/99th percentiles\n(dropping the highest-weight units), the trimmed difference is approximately $10,000. Winsorizing\ncosts at the 99th percentile as a sensitivity layer yields a difference of approximately $29,640.\n\n*(1) Formal statistical interpretation.* Trimming and Winsorization both change the estimand, not just\nthe estimator. Trimming removes the highest-weight units from the cohort entirely; the resulting estimate\napplies to a trimmed target population — the ATT among patients with PS within the 1st–99th percentile\nrange — not the full treated population. Winsorization caps extreme cost values rather than removing\npatients; the estimand remains the ATT for all treated patients, but extreme costs are replaced by the\ncap value. Neither approach is bias-free: both sacrifice some external validity for robustness to extreme\nobservations. The wide divergence between the trimmed ($10,000) and winsorized ($29,640) estimates\nsignals that a small number of high-cost, high-weight patients drive a large share of the raw difference\n($32,740), and that the conclusion is sensitive to how those observations are handled.\n\n*(2) Practical interpretation for a decision-maker.* The \"true\" cost difference lies somewhere in the\nrange defined by the sensitivity grid, not at any single number. The primary estimate from the pre-specified\nmethod (here, PS trimming) should be the headline, with the winsorized and untrimmed estimates reported\nas sensitivity. Divergence across cells flags positivity and influence concerns that warrant further\ninvestigation before a formulary or coverage decision.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "missing-data",
      "MCAR",
      "MAR",
      "MNAR",
      "weight-trimming",
      "winsorization",
      "positivity",
      "overlap",
      "propensity-score",
      "iptw",
      "cost-outliers",
      "sensitivity-analysis",
      "claims",
      "ehr"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwq198",
        "url": "https://doi.org/10.1093/aje/kwq198",
        "citation_text": "Stürmer T, Rothman KJ, Avorn J, Glynn RJ. Treatment effects in the presence of unmeasured confounding: dealing with observations in the tails of the propensity score distribution--a simulation study. American Journal of Epidemiology. 2010;172(7):843-854.",
        "year": 2010,
        "authors_short": "Stürmer et al.",
        "notes": "The canonical methodological treatment of trimming the tails of the propensity score distribution, showing how symmetric trimming changes both bias (often reducing residual confounding) and the population to which the estimate refers."
      },
      {
        "role": "explain",
        "doi": "10.1093/biomet/asn055",
        "url": "https://doi.org/10.1093/biomet/asn055",
        "citation_text": "Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Dealing with limited overlap in estimation of average treatment effects. Biometrika. 2009;96(1):187-199.",
        "year": 2009,
        "authors_short": "Crump et al.",
        "notes": "Derives the optimal data-driven trimming threshold that minimizes asymptotic variance under limited overlap; the theoretical basis for principled (rather than ad hoc) trimming rules used across modern RWE."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.6607",
        "url": "https://doi.org/10.1002/sim.6607",
        "citation_text": "Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine. 2015;34(28):3661-3679.",
        "year": 2015,
        "authors_short": "Austin & Stuart",
        "notes": "Best-practice guidance on stabilized weights, weight trimming/truncation, diagnostics (weight distribution, effective sample size), and the variance-bias trade-off — the practical reference applied directly in claims/EHR IPTW analyses."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.b2393",
        "url": "https://doi.org/10.1136/bmj.b2393",
        "citation_text": "Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.",
        "year": 2009,
        "authors_short": "Sterne et al.",
        "notes": "Practical guidance on when multiple imputation (under MAR) is and is not appropriate, including the dangers of imputing under MNAR and of including post-outcome variables in the imputation model — directly relevant to RWE missingness."
      }
    ],
    "plain_language_summary": "When researchers weight a study to balance who got each treatment, one or two patients can accidentally end up with enormous weights — meaning that single person drives almost the entire result. Trimming drops those extreme patients from the analysis entirely (changing who the answer applies to), while winsorizing caps their weight at a chosen ceiling (keeping them in but limiting their pull). Both choices stabilize the estimate, but both also change what the study is actually measuring, so the rule must be written into the analysis plan before anyone looks at the data.",
    "key_terms": [
      {
        "term": "stabilized inverse-probability weight",
        "definition": "A number assigned to each patient that makes the study group look like the overall population — patients who were unlikely to receive their treatment get a large weight; patients whose treatment was expected get a small weight."
      },
      {
        "term": "propensity score",
        "definition": "The estimated probability that a patient received the treatment of interest, calculated from their baseline characteristics."
      },
      {
        "term": "trimming",
        "definition": "Removing patients whose weight (or propensity score) is so extreme that they fall outside a chosen boundary, such as the top 5 percent of all weights."
      },
      {
        "term": "winsorization",
        "definition": "Capping every weight that exceeds a chosen ceiling at that ceiling value instead of dropping the patient entirely."
      },
      {
        "term": "estimand",
        "definition": "The specific question a study is designed to answer — for example, the average cost difference for all patients in the cohort versus only the patients in a region of good overlap between groups."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is estimating the average difference in one-year total cost between patients on a specialty drug (treated, T=1) and an active comparator (T=0). They fit a logistic model to get a propensity score for each patient, then compute stabilized inverse-probability weights. The marginal probability of being treated is 3/6 = 0.50, so the stabilized weight for each treated patient is 0.50 divided by their propensity score, and for each comparator patient is 0.50 divided by (1 minus their propensity score). Inspecting the weights reveals two patients — one in each arm — with a propensity score near the boundary of their group, producing a weight of 10.0. Those two patients together hold most of the total weight in the analysis. The team pre-specified a rule: compare trimming (drop patients with weight above 5.0) versus winsorizing (cap weights above 5.0 at 5.0), and report both alongside the raw result.",
      "dataset": {
        "caption": "Analysis-ready table: one row per patient with propensity score, stabilized weight, and one-year total cost in thousands of dollars.",
        "columns": [
          "patient_id",
          "arm",
          "propensity_score",
          "stabilized_weight",
          "total_cost_k"
        ],
        "rows": [
          [
            "P01",
            "treated",
            0.8,
            0.625,
            18
          ],
          [
            "P02",
            "treated",
            0.4,
            1.25,
            22
          ],
          [
            "P03",
            "treated",
            0.05,
            10.0,
            45
          ],
          [
            "P04",
            "comparator",
            0.2,
            0.625,
            12
          ],
          [
            "P05",
            "comparator",
            0.6,
            1.25,
            10
          ],
          [
            "P06",
            "comparator",
            0.95,
            10.0,
            8
          ]
        ]
      },
      "steps": [
        "Compute each stabilized weight: for treated patients, divide 0.50 by the propensity score; for comparator patients, divide 0.50 by (1 minus the propensity score). P03 has propensity score 0.05, meaning the model thought it very unlikely this patient would be treated — yet they were, so the weight is 0.50 / 0.05 = 10.0. P06 has propensity score 0.95, meaning the model expected them to be treated but they were not, so 0.50 / (1 - 0.95) = 10.0.",
        "RAW weighted mean for the treated arm: multiply each patient's cost by their weight, sum across treated patients, then divide by the sum of treated weights. Numerator = (0.625 x 18) + (1.25 x 22) + (10.0 x 45) = 11.25 + 27.50 + 450.00 = 488.75. Denominator = 0.625 + 1.25 + 10.0 = 11.875. Weighted mean treated = 488.75 / 11.875 = 41.16 ($41,160).",
        "RAW weighted mean for the comparator arm: numerator = (0.625 x 12) + (1.25 x 10) + (10.0 x 8) = 7.50 + 12.50 + 80.00 = 100.00. Denominator = 0.625 + 1.25 + 10.0 = 11.875. Weighted mean comparator = 100.00 / 11.875 = 8.42 ($8,420). Raw weighted cost difference = 41.16 - 8.42 = 32.74 ($32,740). Notice that P03 and P06 each contribute roughly 84 percent of their arm's total weight, so P03's cost of $45k is essentially driving the treated-arm estimate on its own.",
        "TRIMMING: drop P03 and P06 (both have weight 10.0, which exceeds the 5.0 threshold). The marginal treatment probability in the retained four patients is still 2/4 = 0.50, so stabilized weights are recomputed identically: P01 = 0.625, P02 = 1.25, P04 = 0.625, P05 = 1.25. Weighted mean treated: numerator = (0.625 x 18) + (1.25 x 22) = 11.25 + 27.50 = 38.75; denominator = 0.625 + 1.25 = 1.875; result = 38.75 / 1.875 ≈ 20.67 ($20,670). Weighted mean comparator: numerator = (0.625 x 12) + (1.25 x 10) = 7.50 + 12.50 = 20.00; denominator = 1.875; result = 20.00 / 1.875 ≈ 10.67 ($10,670). Trimmed weighted cost difference = 20.67 - 10.67 = 10.00 ($10,000).",
        "WINSORIZING: keep all six patients but cap any weight above 5.0 at exactly 5.0. P03 weight becomes 5.0 (was 10.0); P06 weight becomes 5.0 (was 10.0). Weighted mean treated: numerator = (0.625 x 18) + (1.25 x 22) + (5.0 x 45) = 11.25 + 27.50 + 225.00 = 263.75; denominator = 0.625 + 1.25 + 5.0 = 6.875; result = 263.75 / 6.875 ≈ 38.36 ($38,360). Weighted mean comparator: numerator = (0.625 x 12) + (1.25 x 10) + (5.0 x 8) = 7.50 + 12.50 + 40.00 = 60.00; denominator = 6.875; result = 60.00 / 6.875 ≈ 8.73 ($8,730). Winsorized weighted cost difference = 38.36 - 8.73 = 29.64 ($29,640)."
      ],
      "result": "Raw (no adjustment): $32,740 — dominated by P03, a single treated patient whose low propensity score (0.05) gave them a weight of 10.0. Trimmed (drop weights above 5.0, cap at 1st/99th in practice): $10,000, on the four-patient overlap cohort only — a population defined by having a propensity score that plausibly appears in both arms. Winsorized (cap weights at 5.0): $29,640, retaining all six patients but limiting any single patient's influence. The three estimates span a $22,740 range, illustrating why the rule must be pre-specified: the analyst who looks at the numbers first can always choose the most favorable one. The team reports all three as the sensitivity grid required by their SAP, with the trimmed result as primary because it targets a clearly definable overlap population and the two extreme patients were clinically implausible comparisons."
    },
    "prerequisites": [
      "propensity-score-methods-psm-iptw",
      "estimands-ate-att-intercurrent-events-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Symmetric propensity-score trimming (Stürmer-style)",
        "description": "Drop units whose propensity score falls outside fixed percentiles of the treated- (or pooled-) arm PS distribution (e.g., 1st/99th or 2.5th/97.5th), then re-estimate weights on the retained cohort.",
        "edge_cases": [
          "Excludes the very patients with the most extreme treatment propensity, who are often the frailest or highest-cost; the resulting estimate is a trimmed-population ATT/ATE, not the original-cohort estimand.",
          "Recomputing weights after trimming is required; reusing pre-trim weights leaves the estimator inconsistent."
        ],
        "data_source_notes": "claims: report the count and clinical profile of trimmed patients (typically the most comorbid or highest-utilization), since their removal is what shifts the population."
      },
      {
        "name": "Data-driven (Crump-style) optimal-overlap trimming",
        "description": "Choose the trimming threshold to minimize the asymptotic variance of the treatment-effect estimator subject to retaining adequate overlap, rather than fixing a percentile a priori.",
        "edge_cases": [
          "More defensible for positivity but computationally heavier and harder to communicate; the retained population is data-determined and can vary across databases in a multi-database study."
        ],
        "data_source_notes": "Implemented via overlap-aware tooling (R PSweight/WeightIt; manual in Python); document the realized threshold and retained fraction for reproducibility across data partners."
      },
      {
        "name": "Percentile Winsorization / truncation of weights or outcomes",
        "description": "Cap weights (or cost/utilization outcomes) at chosen percentiles of the empirical distribution (e.g., 1/99, 5/95), retaining every unit but bounding its influence.",
        "edge_cases": [
          "In heavy-tailed cost data, 99th-percentile capping may still leave influential points while 95th may erase real heterogeneity; the chosen level can move economic conclusions materially.",
          "Winsorizing a genuinely informative tail (oncology high-cost claims) suppresses signal rather than noise."
        ],
        "data_source_notes": "HCRU/cost: report the actual capped values, the percent of observations affected, and pair with a distributional (gamma/two-part) model as the non-capped comparator."
      },
      {
        "name": "Missing-data strategies (complete-case, MI, IPCW, MNAR sensitivity)",
        "description": "Complete-case (valid only under MCAR/limited MAR), multiple imputation under MAR with a time-ordered imputation model, inverse-probability-of-observation/censoring weighting, and MNAR sensitivity (pattern-mixture, selection models, delta-adjustment).",
        "edge_cases": [
          "Death/disenrollment creates structurally absent future data (MNAR); standard MI on observed covariates cannot recover it and will fabricate person-time if applied to post-exit periods.",
          "Imputing time-varying confounders out of temporal order can open collider paths and worsen bias."
        ],
        "data_source_notes": "claims: treat enrollment gaps as explicit unobservable periods, not zeros; report missingness patterns (% with complete follow-up, monotone vs. intermittent). EHR: model the visit/observation process or use missingness indicators."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Raw IPTW with no trimming or Winsorization",
        "pros_of_this": "Dramatically reduces variance and the influence of a few extreme weights/outliers; restores a usable effective sample size and respects positivity.",
        "cons_of_this": "Introduces bias relative to the original target population/estimand and is sensitive to the (often arbitrary) rule chosen.",
        "when_to_prefer": "When a small tail dominates total weight or the outcome mean and the scientific question tolerates a restricted, explicitly-named population."
      },
      {
        "compared_to": "Overlap (ATO) weights",
        "pros_of_this": "Transparent and yields an estimate on a nameable population (ATE/ATT on the retained cohort) that stakeholders can interpret directly.",
        "cons_of_this": "Overlap weights achieve bounded weights and exact balance by construction without an ad hoc cap, avoiding the arbitrariness of a trimming threshold.",
        "when_to_prefer": "Prefer explicit trimming when the policy-relevant population is the full ATE/ATT cohort; prefer overlap weights when the clinical-equipoise population is itself the target."
      },
      {
        "compared_to": "Multiple imputation / IPCW for missing data",
        "pros_of_this": "Trimming/Winsorization is simple, transparent, and aimed at weight/outcome stability and positivity.",
        "cons_of_this": "MI and IPCW are the principled tools when the primary threat is bias from missing covariate/outcome data under MAR; trimming does nothing to recover missing information and is irrelevant to MNAR structural missingness.",
        "when_to_prefer": "Trimming/Winsorization for extreme-weight/outlier stability; MI or IPCW when incomplete covariates or informative observation, not extreme weights, are the dominant threat."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Estimate the PS and weights on FFS-observable person-time only (exclude MA-only spans where claims are missing). Inspect the weight distribution and effective sample size before applying any rule; pre-specify the trim/Winsor rule and at least one alternative. Censor outcomes at death/disenrollment (MNAR structural absence) rather than imputing post-exit periods. Report ESS before/after, the percent of units and person-time trimmed, and missingness patterns (e.g., % with <6 months observable post-index).",
      "ehr": "Missing labs/notes are typically MNAR (engagement-driven capture). Use missingness indicators or model the visit process (IPCW) rather than assuming MAR; Winsorize cost/utilization outliers only after linkage to claims for completeness.",
      "registry": "Lower missingness on adjudicated endpoints but incomplete utilization/PROs; use as a validation set for claims-based imputation or trimming rules rather than as the primary costing source.",
      "linked": "EHR severity enriches the PS and reduces extreme weights at the source; claims add completeness and the death index terminates person-time correctly. Reconcile order/fill/service date discrepancies and account for linkage selection before windowing or trimming."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef stabilized_iptw(df: pd.DataFrame) -> pd.Series:\n    \"\"\"Stabilized inverse-probability-of-treatment weights (Austin & Stuart 2015).\"\"\"\n    p_treat = df[\"treat\"].mean()\n    w = np.where(df[\"treat\"] == 1, p_treat / df[\"ps\"], (1 - p_treat) / (1 - df[\"ps\"]))\n    return pd.Series(w, index=df.index)\n\ndef effective_sample_size(w: pd.Series) -> float:\n    \"\"\"Kish ESS: collapses as a few weights dominate.\"\"\"\n    return (w.sum() ** 2) / (w ** 2).sum()\n\ndef symmetric_ps_trim(df: pd.DataFrame, lo: float = 0.01, hi: float = 0.99) -> pd.DataFrame:\n    \"\"\"Stürmer-style trim on percentiles of the TREATED-arm PS, then keep the overlap region.\"\"\"\n    ps_t = df.loc[df[\"treat\"] == 1, \"ps\"]\n    lo_cut, hi_cut = ps_t.quantile(lo), ps_t.quantile(hi)\n    return df[(df[\"ps\"] >= lo_cut) & (df[\"ps\"] <= hi_cut)].copy()\n\ndef winsorize(s: pd.Series, lo: float = 0.01, hi: float = 0.99) -> pd.Series:\n    return s.clip(lower=s.quantile(lo), upper=s.quantile(hi))\n\ndef weighted_mean_diff(df: pd.DataFrame, w: pd.Series) -> float:\n    \"\"\"Weighted ATE-style contrast of y between arms (e.g., cost difference).\"\"\"\n    t = df[\"treat\"] == 1\n    m1 = np.average(df.loc[t, \"y\"], weights=w[t])\n    m0 = np.average(df.loc[~t, \"y\"], weights=w[~t])\n    return m1 - m0\n\n# --- Sensitivity grid: no-trim vs symmetric trim, crossed with no-Winsor vs 99th-pct Winsor ---\nrows = []\nfor trim_name, d in [(\"no_trim\", df), (\"trim_1_99\", symmetric_ps_trim(df))]:\n    w = stabilized_iptw(d)                       # recompute weights AFTER trimming\n    for win_name, yy in [(\"no_winsor\", d[\"y\"]), (\"winsor_99\", winsorize(d[\"y\"]))]:\n        dd = d.assign(y=yy.values)\n        rows.append({\n            \"trim\": trim_name, \"winsor\": win_name,\n            \"n\": len(dd), \"ess\": round(effective_sample_size(w), 1),\n            \"pct_trimmed\": round(100 * (1 - len(dd) / len(df)), 2),\n            \"max_weight\": round(float(w.max()), 1),\n            \"effect\": round(weighted_mean_diff(dd, w), 2),\n        })\nprint(pd.DataFrame(rows))",
        "description": "Operationalize stabilized IPTW with diagnostics, then apply (a) symmetric PS trimming and (b) percentile Winsorization,\nreporting effective sample size for each rule. Required input (one row per person, already cleaned):\n  df : person_id, treat (1/0 study vs comparator), ps (estimated propensity, 0<ps<1), y (outcome, e.g. annualized cost)\nPS and y must be measured only in the protocol-defined baseline/follow-up windows upstream; this snippet operates on the\nanalysis-ready table and is a sensitivity engine, not an estimation-of-truth shortcut.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "sturmer-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nstabilized_iptw <- function(d) {\n  p <- mean(d$treat)\n  ifelse(d$treat == 1L, p / d$ps, (1 - p) / (1 - d$ps))\n}\ness <- function(w) sum(w)^2 / sum(w^2)                 # Kish effective sample size\n\nsymmetric_ps_trim <- function(d, lo = 0.01, hi = 0.99) {\n  q <- quantile(d$ps[d$treat == 1L], probs = c(lo, hi))\n  d[d$ps >= q[1] & d$ps <= q[2]]\n}\nwinsorize <- function(x, lo = 0.01, hi = 0.99) {\n  q <- quantile(x, probs = c(lo, hi)); pmin(pmax(x, q[1]), q[2])\n}\nwmean_diff <- function(d, w) {\n  t <- d$treat == 1L\n  weighted.mean(d$y[t], w[t]) - weighted.mean(d$y[!t], w[!t])\n}\n\nbuild_grid <- function(df) {\n  setDT(df)\n  out <- list()\n  for (tn in c(\"no_trim\", \"trim_1_99\")) {\n    d <- if (tn == \"no_trim\") copy(df) else symmetric_ps_trim(df)\n    w <- stabilized_iptw(d)                            # recompute AFTER trimming\n    for (wn in c(\"no_winsor\", \"winsor_99\")) {\n      yy <- if (wn == \"no_winsor\") d$y else winsorize(d$y)\n      dd <- copy(d)[, y := yy]\n      out[[paste(tn, wn)]] <- data.table(\n        trim = tn, winsor = wn, n = nrow(dd),\n        ess = round(ess(w), 1),\n        pct_trimmed = round(100 * (1 - nrow(dd) / nrow(df)), 2),\n        max_weight = round(max(w), 1),\n        effect = round(wmean_diff(dd, w), 2)\n      )\n    }\n  }\n  rbindlist(out)\n}\nprint(build_grid(df))",
        "description": "Same sensitivity engine in R: stabilized IPTW with effective sample size, symmetric PS trimming (recompute weights after\ntrimming), and percentile Winsorization, reported as a grid. Input analysis-ready data frame:\n  df : person_id, treat (1/0), ps (0<ps<1), y (outcome)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "sturmer-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Propensity score from baseline covariates (logit link). */\nproc logistic data=work.analytic descending noprint;\n  class <categorical covariates>;\n  model treat = <baseline covariates>;\n  output out=ps_out pred=ps;            /* ps = P(treat=1 | covariates) */\nrun;\n\n/* 2. Stabilized IPTW (Austin & Stuart 2015). */\nproc sql noprint;\n  select mean(treat) into :ptreat trimmed from work.analytic;\nquit;\ndata wts;\n  set ps_out;\n  if treat=1 then sw = &ptreat / ps;\n  else            sw = (1-&ptreat) / (1-ps);\nrun;\n\n/* 3. Symmetric PS trim on the 1st/99th percentiles of the TREATED-arm PS. */\nproc univariate data=wts(where=(treat=1)) noprint;\n  var ps;\n  output out=cuts pctlpts=1 99 pctlpre=ps_;\nrun;\ndata trimmed;\n  if _n_=1 then set cuts;          /* ps_1, ps_99 */\n  set wts;\n  if ps_1 <= ps <= ps_99;          /* keep overlap region; weights recomputed below */\nrun;\n\n/* 4. Recompute stabilized weights on the trimmed cohort. */\nproc sql noprint;\n  select mean(treat) into :ptrim trimmed from trimmed;\nquit;\ndata trimmed;\n  set trimmed;\n  if treat=1 then sw = &ptrim / ps; else sw = (1-&ptrim) / (1-ps);\nrun;\n\n/* 5. Winsorize the outcome at the 99th percentile (sensitivity layer). */\nproc univariate data=trimmed noprint;\n  var y; output out=ycut pctlpts=99 pctlpre=y_;\nrun;\ndata trimmed_w;\n  if _n_=1 then set ycut;          /* y_99 */\n  set trimmed;\n  y_win = min(y, y_99);\nrun;\n\n/* 6. Weighted outcome model (gamma-log GLM for skewed cost; swap dist=/link= as needed). */\nproc genmod data=trimmed_w;\n  class person_id;\n  weight sw;\n  model y_win = treat / dist=gamma link=log;\n  repeated subject=person_id / type=ind;     /* robust SEs for the weighted estimator */\n  estimate 'treat vs comparator (log scale)' treat 1 / exp;\nrun;",
        "description": "SAS implementation: estimate the PS with PROC PSMATCH (or PROC LOGISTIC), build stabilized IPTW, apply symmetric PS\ntrimming and percentile Winsorization, and fit the weighted outcome model with PROC GENMOD. Required input\nwork.analytic (one row per person, post data-management):\n  person_id, treat (1/0), <baseline covariates>, y (outcome, e.g. annualized cost)\nPROC PSMATCH requires SAS/STAT 14.2+; recompute weights on the trimmed set before fitting the outcome model.",
        "dependencies": [],
        "source_citations": [
          "sturmer-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  W[Estimated PS + stabilized IPTW] --> D{Inspect weights:<br/>max weight, ESS, tail share}\n  D -->|stable, ESS adequate| KEEP[Use as-is; report distribution]\n  D -->|few extreme weights| POS{Source of the tail?}\n  POS -->|limited overlap / positivity| TRIM[Trim PS tails Sturmer/Crump<br/>OR switch to overlap weights<br/>RECOMPUTE weights]\n  POS -->|heavy-tailed outcome / cost| WIN[Winsorize outcome at percentile<br/>OR gamma / two-part model]\n  POS -->|non-overlap = confounding by indication| STOP[Design problem:<br/>re-examine comparator, do not silently trim]\n  TRIM --> EST[Trimmed-population estimand<br/>state new target population]\n  WIN --> EST\n  EST --> SENS[Sensitivity grid across rules<br/>report ESS + pct person-time affected]\nstyle STOP fill:#ffcccc\nstyle SENS fill:#cce5ff",
        "caption": "Decision logic for extreme weights and outliers. Trimming and Winsorization are distinct responses (positivity vs. outcome heavy tails); non-overlap driven by confounding-by-indication is a design failure, not a trimming problem.",
        "alt_text": "Flowchart from estimated weights through a weight-diagnostics decision into trimming, Winsorization, or stop-and- redesign, ending in a stated trimmed estimand and a sensitivity grid.",
        "source_type": "illustrative",
        "source_citations": [
          "sturmer-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  OBS[Observation incomplete?] --> M{Mechanism}\n  M -->|MCAR: independent of all data| CC[Complete-case ~ unbiased<br/>but inefficient]\n  M -->|MAR: depends on OBSERVED data| MI[Multiple imputation<br/>or IPCW conditional on observed]\n  M -->|MNAR: depends on UNOBSERVED value| SENS[Pattern-mixture / delta-adjustment<br/>re-define estimand]\n  M -->|Structural: death / disenrollment| STR[Censor person-time;<br/>while-alive or composite estimand]\nstyle SENS fill:#ffe5cc\nstyle STR fill:#ffcccc",
        "caption": "Missing-data mechanism determines the valid remedy. In RWE the default assumption should be MAR at best and often MNAR; death and disenrollment are structural MNAR that no imputation on observed covariates can repair.",
        "alt_text": "Decision tree mapping MCAR, MAR, MNAR, and structural missingness to complete-case, multiple imputation/IPCW, MNAR sensitivity, and person-time censoring respectively.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Trimming and Winsorization act on the propensity scores and IPTW produced by PS methods; they are the standard stabilization step when weights are extreme."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "Trimming changes the effective population and therefore the estimand; the choice must be consistent with the pre-specified ATE/ATT target and the intercurrent-event (death, disenrollment) strategy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "Companion entry — characterizing the pattern and extent of missingness (monotone vs. intermittent, % complete follow-up) precedes choosing complete-case, MI, IPCW, or an MNAR sensitivity strategy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Multiple imputation is the principled MAR remedy for missing covariates; this entry frames when MI is appropriate versus when missingness is structural MNAR that MI cannot fix."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Direct overlap on Winsorization/truncation of extreme cost values; coordinate weight trimming and cost-outlier handling so the two operations do not compound bias on cost endpoints."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Structural missingness at death/disenrollment and catastrophic cost outliers are the most common settings where these methods are applied to PPPM/PMPM endpoints."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology RWE combines high-cost outliers (CAR-T, prolonged stays) with frequent early death (structural MNAR), so extreme weights and complex missingness co-occur and must be handled jointly."
      }
    ],
    "aliases": [
      "weight trimming",
      "propensity score trimming",
      "Winsorization of weights",
      "IPTW weight truncation",
      "missing data mechanisms in RWE",
      "MAR MNAR in claims",
      "structural missingness"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mixed-effects-models-longitudinal-rwe",
    "name": "Mixed-Effects (Random-Effects) Models for Longitudinal RWE",
    "short_definition": "A regression framework for repeated/clustered outcomes that adds subject- or cluster-specific random effects to fixed-effect covariates, modeling within-subject correlation and yielding subject-specific (conditional) effects under a missing-at-random likelihood.",
    "long_description": "**Mixed-effects (random-effects / multilevel) models** extend ordinary regression to data where outcomes are\nmeasured repeatedly within the same patient over time, or are clustered within sites/providers/regions, so that\nobservations are *not* independent. The model partitions variation into **fixed effects** (population-level\ncovariate coefficients — the treatment contrast, time trend, baseline covariates) and **random effects**\n(subject- or cluster-specific deviations, e.g., a random intercept for each `person_id`, optionally a random\nslope on time). For a continuous outcome the canonical specification is the **linear mixed model (LMM)** of\nLaird & Ware: y_ij = X_ij*beta + Z_ij*b_i + e_ij, with b_i ~ N(0, G) and e_ij ~ N(0, R). For binary/count\nrepeated outcomes (e.g., monthly hospitalization, exacerbation counts) the analogue is the **generalized linear\nmixed model (GLMM)** with a logit/log link. Estimation is by (restricted) maximum likelihood — REML for the LMM,\nLaplace/adaptive Gauss-Hermite quadrature or pseudo-likelihood for the GLMM.\n\n**Core conceptual distinction** (the estimand). The estimand a mixed model targets is **subject-specific (conditional)**: the\nfixed-effect coefficient is the change in the *same patient's* (or same cluster's) outcome, holding that\npatient's random effect fixed. This is the sharpest single point of confusion in longitudinal RWE. (1) *Mixed\nmodel vs GEE (population-average).* A generalized estimating equation models the **marginal/population-average**\nmean and treats within-subject correlation as a nuisance working structure. For the identity and log links the\nconditional and marginal coefficients coincide, but for the **logit link they do not** — a GLMM odds ratio is\nlarger in magnitude than the corresponding GEE odds ratio (attenuation by the random-effect variance), and the\ntwo answer different questions (\"effect for a typical patient\" vs \"effect on the population prevalence\"). Pick\nthe estimand in the protocol, not after seeing both. (2) *Random effects vs cluster-robust / fixed-effects\nregression.* Random effects assume the cluster effect is **uncorrelated with the covariates** (an exchangeability\nassumption a Hausman-type comparison probes); if site-level confounding correlates with exposure, a within-cluster\n(fixed-effects) estimator or design-based control is safer. (3) *Mixed model vs MMRM.* The mixed model for\nrepeated measures (MMRM) is the same likelihood machinery with time treated as a categorical fixed effect and an\n*unstructured* residual covariance instead of random slopes — the FDA/EMA-favored primary analysis for\ncontinuous endpoints in registrational longitudinal trials; the random-coefficient growth-curve form is more\nparsimonious but imposes a functional shape on the trajectory.\n\n**Interpreting the output**\n\nConsider the worked example: five patients measured at months 0, 3, and 6. Drug arm\npatients (101, 102) drop an average of 1.10 HbA1c points over 6 months; comparator\npatients (103, 104) drop an average of 0.20 points. The linear mixed model with\nrandom intercepts and random slopes yields an estimated fixed-effect treatment\ndifference of approximately −0.90 HbA1c points (95% CI −1.6 to −0.2) favoring drug.\n\nFormal interpretation: The fixed-effect coefficient of −0.90 is a subject-specific\n(conditional) estimate — the expected additional HbA1c reduction for a given patient\non the drug arm compared with what that same patient's trajectory would have been on\nthe comparator, holding that patient's random intercept and slope fixed. It is not a\ncomparison between different patients. The random-intercept variance quantifies\nhow much patients differ at baseline (patient 101 starts at 9.2, patient 102 at 7.4);\nthe random-slope variance quantifies how much their rates of change differ.\n\nPractical interpretation: The drug is estimated to lower HbA1c by about 0.9 points\nmore than the comparator in a typical patient over 6 months. This subject-specific\nframing is appropriate when the clinical question is \"what will happen to this patient?\"\nIf the question is instead \"what is the average effect across the population?\" — the\nHTA or payer's usual target — a GEE (marginal model) answers it directly, and for a\nGaussian identity-link outcome like HbA1c the two coefficient values coincide; for\nnon-linear links (logit, log) the conditional mixed-model coefficient is larger in\nmagnitude than the corresponding marginal GEE coefficient.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs GEE (population-average models):** Mixed models use all available person-time under a **likelihood that is\n  valid under missing-at-random (MAR)** dropout, whereas standard GEE with an independence or exchangeable working\n  correlation is only consistent under the stronger **missing-completely-at-random (MCAR)** assumption (unless\n  weighted). Mixed models also estimate between- and within-subject variance components and give individual\n  predictions (BLUPs). Cost: results are sensitive to the random-effects distribution and link, the conditional OR\n  is easily misread as marginal, and convergence/quadrature failures are common with sparse binary clusters.\n  **Prefer the mixed model** when the question is subject-specific, when dropout is plausibly MAR, or when you need\n  variance decomposition or shrinkage prediction.\n- **vs MMRM (unstructured-covariance repeated measures):** The random-coefficient mixed model borrows strength\n  across the trajectory and handles unbalanced/irregular visit times naturally — an advantage in RWE where visits\n  are not on a protocol grid. Cost: it assumes the modeled growth shape (linear/spline) is correct; MMRM makes no\n  such assumption but needs roughly balanced nominal visits and can be over-parameterized in small samples.\n  **Prefer MMRM** for registrational continuous endpoints with scheduled visits; **prefer the growth-curve mixed\n  model** for irregular real-world follow-up.\n- **vs cluster-robust / GEE-with-robust-SE on a marginal model:** Robust (\"sandwich\") variances fix only the\n  *standard errors* of a misspecified-correlation model; the mixed model instead *models* the correlation, which\n  is more efficient when the variance structure is approximately right and is what enables MAR-valid likelihood\n  inference. Cost: a wrong random-effects structure biases both point estimates and SEs, so sandwich correction of\n  a well-specified marginal model is sometimes the more robust choice. **Prefer mixed models** when you trust the\n  hierarchical structure; pair the GLMM with profile/Kenward-Roger or bootstrap inference for small numbers of\n  clusters.\n\n**When to use** (decision rules). Repeated continuous, binary, or count outcomes per patient (PRO/HRQoL scores over visits, serial\nlabs/HbA1c, monthly utilization or cost-adjacent counts); outcomes clustered within hospitals, physicians, payers,\nor geographies; analyses needing individual-level prediction or explicit variance decomposition (e.g., how much\noutcome variation is between vs within site); irregular real-world visit timing where complete-case or\nlast-observation-carried-forward would discard information or bias under MAR.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\n- **You need a population-average / policy estimand but report the conditional coefficient.** Presenting a GLMM\n  logit coefficient as if it were the population odds ratio overstates effect magnitude; for a marginal target use\n  GEE or marginalize the fitted GLMM (predicted-margins / g-computation over the random-effect distribution).\n- **Informative (non-ignorable, MNAR) dropout.** The MAR likelihood is *not* a free lunch — if sicker patients\n  drop out for reasons not captured by observed covariates and prior outcomes, mixed-model estimates are biased in\n  the same direction as the dropout, and the model gives no warning. Add a sensitivity analysis (pattern-mixture,\n  selection model, or reference-based/jump-to-reference multiple imputation).\n- **Few clusters with many covariates.** With <~10-15 clusters, the random-effect variance is poorly estimated and\n  Wald/normal-theory cluster-level inference is anticonservative; use Kenward-Roger/Satterthwaite degrees of\n  freedom or a cluster bootstrap, or switch to a design-based analysis.\n- **Random-intercept-only models when slopes vary, or wrong exchangeability.** Omitting a needed random slope\n  understates SEs of time-by-treatment effects; assuming random effects uncorrelated with exposure when\n  site-level channeling exists smuggles confounding back in.\n- **Using a mixed model to \"adjust away\" confounding by indication.** It models correlation, not treatment\n  selection. It is a *companion* to a sound design (active-comparator new-user, propensity weighting), not a\n  substitute.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked).\n- **Claims (FFS vs MA vs commercial):** Repeated outcomes are usually *counts per interval* (ED visits,\n  hospitalizations, fills) requiring a Poisson/negative-binomial GLMM with a log-offset for observed person-time.\n  The dominant failure mode is **denominator/observability artifacts masquerading as outcome variation**:\n  Medicare Advantage (Part C) encounter capture is incomplete and inconsistent versus fee-for-service Parts A/B,\n  so MA-only person-time produces spuriously low counts — restrict to FFS A/B (plus D for drug exposure) or model\n  plan type and use a true at-risk offset. Plan switching, partial-year enrollment, and claims-adjudication lag\n  create unbalanced and right-truncated panels; build an explicit per-interval `observed_days` offset and drop\n  intervals with no observable enrollment rather than coding them as zero events. **Differential competing risks\n  by exposure in elderly claims** (death ends the panel non-randomly) means a naive repeated-count GLMM can be\n  biased; consider a joint longitudinal-survival model or censor consistently. **Immortal time in procedure\n  studies** (follow-up clock starting before the exposure-defining procedure) corrupts the time variable for every\n  visit — align the time origin at the index/exposure date for all subjects.\n- **EHR:** Visits are **encounter-driven, not protocol-scheduled**, so the visit times are themselves\n  informative (sicker patients are measured more often) — measurement frequency can be an outcome-correlated\n  process that violates the MAR-at-fixed-times convenience and may require a joint model of the measurement\n  process. Lab/PRO values are missing not-at-random when ordered only on clinical suspicion; external-care leakage\n  truncates trajectories. Use the actual continuous measurement time, not a nominal visit number, in the random\n  slope.\n- **Registry:** Scheduled visits give cleaner balanced panels and adjudicated outcomes (an advantage for LMM/MMRM),\n  but completeness varies by site and enrollment can be selective; site should usually enter as a random effect,\n  and missingness still needs MAR scrutiny.\n- **Linked claims–EHR–registry:** The ideal substrate (EHR/registry severity to make MAR plausible + claims for\n  complete utilization counts + a death index to handle the competing risk), but linkage selection and\n  date-discrepancy reconciliation (encounter vs claim vs registry visit date) must be resolved before defining the\n  repeated-measures time grid.\n\n**Worked claims example.** Question: does a new biologic vs an active comparator reduce the *rate* of\nasthma-related ED visits over 24 months of follow-up in a commercial + Medicare FFS database? (1) Cohort: incident\nusers of either drug (active-comparator new-user design) with 365 days of continuous medical + pharmacy enrollment\nbefore the index `fill_date`; exclude MA-only person-time because Part C ED encounters are under-captured.\n(2) Panel: split each patient's post-index follow-up into 12 consecutive 60-day intervals indexed by\n`interval_num`; in each interval compute `n_ed_visits` (count of claims with an asthma ED revenue/CPT code,\nde-duplicated to one event per calendar day) and `observed_days` = days of continuous FFS enrollment in that\ninterval (the at-risk offset). Drop intervals with `observed_days` = 0 — they are unobservable, not event-free.\n(3) Model: a Poisson (or negative-binomial, if overdispersed) GLMM —\nlog(E[n_ed_visits]) = beta0 + beta1*arm + beta2*interval_num + beta3*(arm x interval_num) + X*gamma +\nlog(observed_days) + b_person, with a random intercept b_person ~ N(0, sigma^2) absorbing each patient's baseline\nvisit propensity; add a random slope on `interval_num` if trajectories diverge. (4) Estimand: report the\nconditional incidence-rate ratio (exp(beta1)) explicitly as subject-specific; if a population-average IRR is the\npolicy quantity, refit as a log-link GEE or marginalize the GLMM. (5) Missingness/competing risk: dropout from\ndisenrollment is plausibly MAR given observed history (the GLMM likelihood handles it); deaths end the panel\nnon-randomly, so run a sensitivity analysis censoring at death and a pattern-mixture check. (6) Inference: with\nmany patients the random-intercept SE is fine; cluster on `person_id`, and if also clustering by a small number of\nhealth plans, use Kenward-Roger / bootstrap rather than naive Wald.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "inferential-statistics",
      "longitudinal-data",
      "mixed-effects-models",
      "random-effects",
      "repeated-measures",
      "glmm",
      "multilevel-models",
      "conditional-vs-marginal"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/2529876",
        "url": "https://doi.org/10.2307/2529876",
        "citation_text": "Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38(4):963-974.",
        "year": 1982,
        "authors_short": "Laird & Ware",
        "notes": "Foundational two-stage random-effects (linear mixed) model for repeated measures; defines the subject-specific estimand and EM/ML estimation that underpins all later LMM/GLMM longitudinal work."
      },
      {
        "role": "explain",
        "doi": "10.1093/biomet/73.1.13",
        "url": "https://doi.org/10.1093/biomet/73.1.13",
        "citation_text": "Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13-22.",
        "year": 1986,
        "authors_short": "Liang & Zeger",
        "notes": "Introduces GEE / population-average models; the canonical contrast that clarifies the conditional (subject-specific) vs marginal estimand distinction central to choosing a mixed model."
      },
      {
        "role": "demonstrate",
        "doi": "10.1001/jama.2015.19394",
        "url": "https://doi.org/10.1001/jama.2015.19394",
        "citation_text": "Detry MA, Ma Y. Analyzing repeated measurements using mixed models. JAMA. 2016;315(4):407-408.",
        "year": 2016,
        "authors_short": "Detry & Ma",
        "notes": "Concise applied demonstration of when and how to use mixed models for repeated measurements, including handling of missing data and the within-subject correlation rationale."
      },
      {
        "role": "use",
        "doi": "10.1177/009286150804200402",
        "url": "https://doi.org/10.1177/009286150804200402",
        "citation_text": "Mallinckrodt CH, Lane PW, Schnell D, Peng Y, Mancuso JP. Recommendations for the primary analysis of continuous endpoints in longitudinal clinical trials. Drug Information Journal. 2008;42(4):303-319.",
        "year": 2008,
        "authors_short": "Mallinckrodt et al.",
        "notes": "Influential recommendation establishing the likelihood-based mixed model for repeated measures (MMRM) as the preferred primary analysis under MAR; the regulatory anchor for longitudinal continuous-endpoint analysis."
      }
    ],
    "plain_language_summary": "A mixed-effects model is a statistical tool for studying how patients change over time when you measure the same person more than once. It splits the math into two layers: one layer captures the average pattern across all patients (for example, how much a drug lowers blood sugar on average), and a second layer gives each patient their own personal baseline and their own personal rate of change. This separation lets you ask the right question -- how much did a typical patient improve? -- without a high-baseline patient's numbers drowning out a low-baseline patient's numbers.",
    "key_terms": [
      {
        "term": "fixed effect",
        "definition": "A coefficient that applies to everyone in the study -- for example, the average drop in HbA1c across all patients on the drug."
      },
      {
        "term": "random intercept",
        "definition": "A patient-specific offset that shifts that individual's starting value up or down from the group average baseline, capturing the fact that people differ at the start."
      },
      {
        "term": "random slope",
        "definition": "A patient-specific adjustment to the rate of change over time, capturing the fact that some people improve faster or slower than the average trend."
      },
      {
        "term": "longitudinal data",
        "definition": "Data where the same patients are measured multiple times at different points in time, so each person contributes more than one row to the dataset."
      },
      {
        "term": "within-person change",
        "definition": "How a single patient's outcome shifts from visit to visit -- as opposed to between-person differences, which compare patients to each other."
      }
    ],
    "worked_example": {
      "scenario": "A research team wants to know whether a new diabetes drug lowers HbA1c (a measure of blood sugar control) over 6 months. They enroll five patients, measure HbA1c at baseline (month 0), month 3, and month 6, and record every reading in a long-format table where each row is one patient at one visit. Patients start at different HbA1c levels, so they need a model that accounts for each person's own baseline rather than lumping everyone together.",
      "dataset": {
        "caption": "Long-format HbA1c measurements (one row per patient per visit).",
        "columns": [
          "person_id",
          "month",
          "hba1c",
          "arm"
        ],
        "rows": [
          [
            101,
            0,
            9.2,
            "drug"
          ],
          [
            101,
            3,
            8.5,
            "drug"
          ],
          [
            101,
            6,
            7.8,
            "drug"
          ],
          [
            102,
            0,
            7.4,
            "drug"
          ],
          [
            102,
            3,
            7.0,
            "drug"
          ],
          [
            102,
            6,
            6.6,
            "drug"
          ],
          [
            103,
            0,
            8.8,
            "comparator"
          ],
          [
            103,
            3,
            8.6,
            "comparator"
          ],
          [
            103,
            6,
            8.5,
            "comparator"
          ],
          [
            104,
            0,
            7.1,
            "comparator"
          ],
          [
            104,
            3,
            7.0,
            "comparator"
          ],
          [
            104,
            6,
            7.0,
            "comparator"
          ]
        ]
      },
      "steps": [
        "Each patient has three rows -- one per visit. This is the long format a mixed-effects model expects.",
        "Patients start at very different baselines: patient 101 begins at 9.2 and patient 102 at 7.4. A random intercept gives each patient their own starting point so the model does not force them all to start at the same average.",
        "Patient 101 drops 1.4 points over 6 months; patient 102 drops 0.8 points. A random slope lets each patient have their own trajectory instead of assuming everyone changes at exactly the same rate.",
        "The fixed effect for the drug arm captures the average additional drop across all drug patients compared to comparator patients, after accounting for each person's individual baseline and slope.",
        "Drug patients (101, 102) average a drop of -1.1 points over 6 months [(9.2 to 7.8 = -1.4) + (7.4 to 6.6 = -0.8)] / 2 = -1.10. Comparator patients (103, 104) average -0.15 points [(8.8 to 8.5 = -0.3) + (7.1 to 7.0 = -0.1)] / 2 = -0.20.",
        "The estimated fixed-effect treatment difference is approximately -1.10 minus (-0.20) = -0.90 HbA1c points in favor of the drug."
      ],
      "result": "Estimated average treatment difference: drug arm drops HbA1c by approximately 0.90 points more than the comparator arm over 6 months. The random intercept accounts for patients' different starting values; the random slope accounts for their different rates of change. The fixed-effect estimate of -0.90 describes the typical within-patient benefit, not a comparison of different people."
    },
    "prerequisites": [
      "longitudinal-outcomes-modeling-rwe",
      "logistic-regression-for-binary-outcomes"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Random-intercept GLMM for repeated counts/binary outcomes (claims)",
        "description": "Poisson/negative-binomial or logistic GLMM with a per-patient random intercept and a log-offset for observed person-time; the workhorse for repeated utilization or event counts in administrative data.",
        "edge_cases": [
          "MA-only intervals produce spuriously low counts from incomplete Part C encounter capture; restrict to FFS or model plan type and use a true at-risk offset.",
          "Zero-event intervals from non-enrollment must be dropped (unobservable), not coded as event-free zeros.",
          "Overdispersion (variance > mean) inflates Type I error under Poisson; test and switch to negative-binomial or add an observation-level random effect."
        ],
        "data_source_notes": "claims: derive per-interval n_events (day-deduplicated) and observed_days offset from enrollment spans; exclude MA-only person-time. ehr: encounter-driven counts are confounded by measurement frequency — consider modeling the visit process."
      },
      {
        "name": "Random-intercept-and-slope linear mixed model (growth curve)",
        "description": "LMM with a random intercept and random slope on continuous time, for serial continuous outcomes (HbA1c, PRO/HRQoL scores) measured at irregular real-world times.",
        "edge_cases": [
          "Omitting a needed random slope understates SEs of time-by-treatment interactions.",
          "Use actual continuous measurement time, not nominal visit number, when EHR/claims visits are off-grid.",
          "Boundary/singular fits (estimated random-slope variance ~0) signal over-parameterization; simplify the random-effects structure."
        ],
        "data_source_notes": "ehr/registry: align the time origin at the index/exposure date for all subjects to avoid immortal time corrupting the time covariate; check that measurement timing is not outcome-driven (MAR)."
      },
      {
        "name": "MMRM (unstructured covariance, categorical time)",
        "description": "Likelihood-based repeated-measures model with time as a categorical fixed effect and an unstructured residual covariance instead of random slopes; the FDA/EMA-favored primary analysis for continuous endpoints.",
        "edge_cases": [
          "Requires roughly balanced nominal visits; off-grid real-world visits force the random-coefficient form instead.",
          "Unstructured covariance can be over-parameterized and fail to converge in small samples; fall back to AR(1) or Toeplitz with model-fit comparison."
        ],
        "data_source_notes": "registry/trial-like RWE: clean scheduled visits suit MMRM; pure claims rarely give the balanced visit grid MMRM assumes."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "GEE / population-average models",
        "pros_of_this": "Likelihood is valid under missing-at-random dropout (vs MCAR for unweighted GEE); estimates variance components and subject-specific predictions (BLUPs); targets the conditional estimand directly.",
        "cons_of_this": "Conditional (subject-specific) coefficients differ from marginal ones under nonlinear links and are easily misreported as population-average; sensitive to the random-effects distribution and prone to convergence failures with sparse binary clusters.",
        "when_to_prefer": "When the question is subject-specific, dropout is plausibly MAR, or variance decomposition / individual prediction is needed."
      },
      {
        "compared_to": "MMRM (unstructured-covariance repeated measures)",
        "pros_of_this": "Handles unbalanced/irregular visit times and borrows strength across the trajectory via random coefficients — well suited to real-world off-grid follow-up.",
        "cons_of_this": "Imposes a functional shape on the time trend (linear/spline) that can be wrong; MMRM avoids that assumption for scheduled visits.",
        "when_to_prefer": "Irregular real-world visit timing; prefer MMRM for registrational continuous endpoints on a scheduled visit grid."
      },
      {
        "compared_to": "Cluster-robust SEs on a marginal model",
        "pros_of_this": "Models the correlation structure rather than only correcting SEs, enabling MAR-valid likelihood inference and greater efficiency when the variance structure is approximately correct.",
        "cons_of_this": "A misspecified random-effects structure biases point estimates and SEs, whereas sandwich correction of a well-specified marginal model is agnostic to the working correlation.",
        "when_to_prefer": "When the hierarchical/correlation structure is trusted; otherwise pair a marginal model with robust variance, and use Kenward-Roger/bootstrap when clusters are few."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Repeated outcomes are usually per-interval counts; use a Poisson/NB GLMM with a log(observed_days) offset and a per-patient random intercept. Restrict to FFS (exclude MA-only person-time with incomplete Part C capture), de-duplicate events to one per calendar day, drop unobservable (non-enrolled) intervals rather than coding zeros, and handle death as a non-random competing event that ends the panel.",
      "ehr": "Visit times are encounter-driven and potentially outcome-correlated (informative measurement); use actual continuous measurement time in the random slope, scrutinize MAR vs MNAR for labs/PROs ordered on suspicion, and account for external-care leakage truncating trajectories.",
      "registry": "Scheduled visits yield balanced panels and adjudicated outcomes suited to LMM/MMRM; enter site as a random effect and still assess MAR given variable site completeness and selective enrollment.",
      "linked": "Ideal substrate (severity for MAR plausibility + claims completeness + death index for the competing risk), but resolve linkage selection and reconcile encounter/claim/registry date discrepancies before defining the repeated-measures time grid."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import statsmodels.formula.api as smf\n\n# panel: one row per (person_id, visit); see header for required columns.\n# Linear mixed growth curve: random intercept + random slope on continuous time per patient.\n# re_formula gives each person_id its own intercept and slope, absorbing within-patient correlation;\n# the arm*meas_time fixed effect is the SUBJECT-SPECIFIC (conditional) divergence in trajectories.\nlmm = smf.mixedlm(\n    \"outcome_value ~ arm * meas_time + age + female + baseline_severity\",\n    data=panel,\n    groups=panel[\"person_id\"],\n    re_formula=\"~ meas_time\",                               # correlated random intercept & slope\n).fit(reml=True, method=\"lbfgs\")\nprint(lmm.summary())\n\n# Variance components: between-patient (random-effects cov) vs within-patient (residual scale).\nprint(\"Random-effects covariance (G):\\n\", lmm.cov_re)\nprint(\"Residual variance (within-subject):\", lmm.scale)\n# The estimand is conditional. For a population-average mean trajectory, average model predictions\n# over the random-effect distribution (predicted margins) or refit a marginal model (GEE).",
        "description": "Linear mixed growth-curve model in Python with statsmodels MixedLM (REML) for a serial continuous\noutcome measured at irregular real-world times. MixedLM is the production-grade mixed-effects fitter in\nstatsmodels; for an exact log person-time offset on repeated COUNTS (Poisson/NB GLMM) use the R (glmer/\nglmmTMB) or SAS (PROC GLIMMIX) implementations below, which support offsets natively.\nRequired input table `panel` (one row per patient-visit, built from enrollment + claims/EHR):\n  person_id     : patient id (the clustering / random-effect unit)\n  meas_time     : continuous time since index (years) for irregular real-world visits\n  outcome_value : continuous serial outcome (e.g., HbA1c) at that visit\n  arm           : 1 = study biologic, 0 = active comparator (assigned at index fill)\n  age, female, baseline_severity : baseline covariates measured pre-index\nAlign meas_time at the index/exposure date for all subjects to avoid immortal time corrupting the slope.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "laird-ware-1982",
          "mallinckrodt-2008"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(lme4)\nlibrary(glmmTMB)\nlibrary(emmeans)\n\npanel <- subset(panel, observed_days > 0)                 # keep only observable person-time\n\n## (A) Poisson GLMM for repeated ED-visit counts; offset(log(observed_days)) -> a RATE model.\n##     (1 | person_id) is the patient random intercept; estimand is SUBJECT-SPECIFIC (conditional).\nfit_count <- glmer(\n  n_ed_visits ~ arm * interval_num + age + female + baseline_severity +\n    offset(log(observed_days)) + (1 | person_id),\n  data = panel, family = poisson(link = \"log\"),\n  control = glmerControl(optimizer = \"bobyqa\")\n)\n# Overdispersion check; if present, refit negative-binomial with glmmTMB:\n# fit_count <- glmmTMB(n_ed_visits ~ arm * interval_num + age + female + baseline_severity +\n#                      offset(log(observed_days)) + (1 | person_id), family = nbinom2, data = panel)\nsummary(fit_count)\nexp(fixef(fit_count)[\"arm\"])                               # conditional incidence-rate ratio for arm\n\n## (B) Linear mixed growth curve: random intercept + random slope on continuous measurement time.\n##     Use actual meas_time (irregular real-world visits), not nominal visit number.\nfit_lmm <- lmer(\n  outcome_value ~ arm * meas_time + age + female + baseline_severity +\n    (1 + meas_time | person_id),\n  data = panel, REML = TRUE\n)\nsummary(fit_lmm)\n# Marginal (population-average) treatment means over time, if a marginal estimand is required:\nemmeans(fit_lmm, ~ arm | meas_time, at = list(meas_time = c(0.5, 1, 1.5, 2)))",
        "description": "Repeated-measures mixed models in R. Two fits from the same patient-interval `panel`:\n  (A) Poisson GLMM with a true log person-time offset and a random intercept (claims count outcome).\n  (B) Linear mixed growth-curve model for a continuous serial outcome (e.g., HbA1c) measured at irregular times.\n`panel` columns: person_id, interval_num, n_ed_visits, observed_days (>0), arm (0/1), age, female,\nbaseline_severity; for (B) also meas_time (continuous years since index) and outcome_value.\nRandom slope on time is added when trajectories diverge; cluster the SEs on person_id.",
        "dependencies": [
          "lme4",
          "glmmTMB",
          "emmeans"
        ],
        "source_citations": [
          "laird-ware-1982",
          "mallinckrodt-2008"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Keep only observable person-time before modeling. */\ndata panel; set work.panel; where observed_days > 0; logpt = log(observed_days); run;\n\n/* (A) Poisson GLMM for repeated ED-visit counts: random intercept + log person-time offset -> RATE model.\n       exp(arm estimate) is a SUBJECT-SPECIFIC (conditional) incidence-rate ratio. */\nproc glimmix data=panel method=laplace;\n  class person_id female;\n  model n_ed_visits = arm interval_num arm*interval_num age female baseline_severity\n        / dist=poisson link=log offset=logpt solution;\n  random intercept / subject=person_id;          /* b_person ~ N(0, sigma^2) */\n  /* If overdispersed, switch dist=negbin (NB GLMM). */\n  estimate 'arm log-IRR' arm 1 / exp;             /* exp gives the conditional IRR */\nrun;\n\n/* (B) Linear mixed growth curve: random intercept + random slope on continuous measurement time. */\nproc mixed data=serial method=reml;\n  class person_id arm visit;\n  model outcome_value = arm meas_time arm*meas_time age female baseline_severity / solution ddfm=kr;\n  random intercept meas_time / subject=person_id type=un;   /* correlated random intercept & slope */\nrun;\n\n/* (C) MMRM (regulatory primary analysis): categorical time, UNSTRUCTURED residual covariance,\n       Kenward-Roger denominator df. No random effects -- correlation is in the R-side covariance. */\nproc mixed data=serial method=reml;\n  class person_id arm visit;\n  model outcome_value = arm visit arm*visit age female baseline_severity / solution ddfm=kr;\n  repeated visit / subject=person_id type=un;\n  lsmeans arm*visit / diff;                        /* marginal treatment-by-visit means */\nrun;",
        "description": "Repeated-measures mixed models in SAS. Three procedures from the same patient-interval/visit data:\n  WORK.PANEL  : person_id, interval_num, n_ed_visits, observed_days (>0), arm (0/1), age, female,\n                baseline_severity  (one row per patient-interval; drop observed_days=0 rows first)\n  WORK.SERIAL : person_id, visit (categorical), meas_time (continuous), outcome_value, arm, covariates\nPROC GLIMMIX fits the Poisson GLMM (Laplace ML) with a true log person-time offset and a patient random\nintercept; PROC MIXED fits the LMM growth curve and the MMRM. METHOD=LAPLACE gives a true likelihood\n(needed for valid AIC and the MAR rationale); the default RSPL is pseudo-likelihood.",
        "dependencies": [],
        "source_citations": [
          "laird-ware-1982",
          "mallinckrodt-2008"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Longitudinal RWE question] --> EST{Estimand:<br/>subject-specific or<br/>population-average?}\n  EST -->|Population-average / policy| GEE[GEE or marginalized GLMM<br/>see gee-population-average-models-rwe]\n  EST -->|Subject-specific / conditional| TYPE{Outcome type}\n  TYPE -->|Continuous serial| VISIT{Visit timing}\n  TYPE -->|Counts / binary per interval| GLMM[GLMM: Poisson/NB or logistic<br/>random intercept + log person-time offset]\n  VISIT -->|Scheduled / balanced grid| MMRM[MMRM: categorical time,<br/>unstructured covariance]\n  VISIT -->|Irregular real-world times| LMM[Linear mixed growth curve:<br/>random intercept + slope on time]\n  GLMM --> MISS{Dropout mechanism?}\n  LMM --> MISS\n  MMRM --> MISS\n  MISS -->|MAR| OK[Likelihood valid; report + variance components]\n  MISS -->|MNAR suspected| SENS[Add pattern-mixture / reference-based MI sensitivity]",
        "caption": "Decision logic for longitudinal RWE modeling. The first fork is the estimand (conditional vs marginal); a marginal/policy target points to GEE or a marginalized GLMM rather than reporting the raw conditional coefficient. Outcome type and visit regularity then select the mixed-model form, and the dropout mechanism determines whether the MAR likelihood suffices or a non-ignorable sensitivity analysis is required.",
        "alt_text": "Decision flowchart starting from estimand choice, branching to GEE for population-average questions or to GLMM/LMM/MMRM for subject-specific questions by outcome type and visit timing, ending in an MAR-vs-MNAR sensitivity-analysis check.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Pop[Population-level: fixed effects]\n    B0[Intercept beta0] --- BX[Covariate effects X*beta<br/>arm, time, baseline covariates]\n  end\n  subgraph Sub[Subject-level: random effects]\n    BI[\"Random intercept b_i ~ N(0, G)<br/>each patient's baseline level\"]\n    BS[Random slope on time<br/>each patient's trajectory]\n  end\n  Pop --> Mean[Linear predictor: X*beta + Z*b_i]\n  Sub --> Mean\n  Mean --> Link[\"Link g(.): identity LMM / log Poisson / logit GLMM\"]\n  Link --> Y[\"Repeated outcome y_ij<br/>within-subject correlation absorbed by b_i\"]\n  Sub --> VC[\"Variance components from G:<br/>separate between- vs within-subject variation\"]",
        "caption": "Anatomy of a mixed-effects model. Fixed effects carry the population-level covariate and treatment contrasts; subject-specific random effects (intercept, optionally slope) absorb within-patient correlation and let the variance-component decomposition separate between- from within-subject variation. The link function determines whether fixed-effect coefficients are conditional rate/odds ratios or mean differences.",
        "alt_text": "Diagram showing fixed effects (intercept and covariate coefficients) and random effects (patient-level random intercept and slope) combining into a linear predictor, passed through a link function to produce the repeated outcome with within-subject correlation absorbed by the random effects.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "longitudinal-outcomes-modeling-rwe",
        "notes": "Mixed-effects models are the subject-specific (conditional) member of the longitudinal-outcomes family, alongside marginal (GEE) and survival-based approaches."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "gee-population-average-models-rwe",
        "notes": "GEE targets the marginal/population-average mean (valid under MCAR unweighted); mixed models target the conditional mean and are valid under MAR. The conditional and marginal coefficients diverge under nonlinear (e.g., logit) links."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "mmrm-repeated-measures-rwe",
        "notes": "MMRM uses categorical time with unstructured covariance (regulatory primary analysis for scheduled visits); the growth-curve mixed model handles irregular real-world visit times but imposes a trajectory shape."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "When dropout may be non-ignorable (MNAR), reference-based or pattern-mixture multiple imputation complements the MAR mixed-model likelihood as a sensitivity analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cluster-robust-standard-errors-rwe",
        "notes": "An alternative to modeling correlation via random effects is to fit a marginal model and correct the SEs with a cluster-robust (sandwich) estimator; preferred when the random-effects structure is untrusted."
      },
      {
        "relation_type": "see_also",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "Marginalizing (predicted-margins / g-computation over the random-effect distribution) converts a subject-specific GLMM into a population-average effect when the policy estimand requires it."
      }
    ],
    "aliases": [
      "random-effects models for longitudinal data",
      "linear mixed models (LMM)",
      "generalized linear mixed models (GLMM)",
      "multilevel / hierarchical models",
      "random-coefficient models",
      "mixed-effects models longitudinal"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mixed-methods",
    "name": "Mixed-Methods Study",
    "short_definition": "A study design that intentionally collects, analyzes, and integrates both quantitative (e.g., claims/EHR-derived rates, effects, costs) and qualitative (e.g., interview, observation, free-text) data within a single program of inquiry so that integration — not mere co-occurrence — produces inferences neither strand could yield alone.",
    "long_description": "A **mixed-methods study** deliberately combines a **quantitative (QUANT)** strand and a **qualitative (QUAL)**\nstrand and — critically — **integrates** them so the whole exceeds the sum of the parts. In real-world evidence and\nHEOR the QUANT strand is typically a claims/EHR/registry analysis (incidence, comparative effectiveness, adherence,\ncost, HCRU) and the QUAL strand is interviews, focus groups, ethnographic observation, or structured analysis of\nclinical free text. The design's defining act is **integration** — at the level of design, methods, and\ninterpretation — not the simple presence of two data types side by side.\n\n**Core conceptual distinction**. What separates mixed methods from \"we also did some interviews\" is *purposeful\nintegration* and an explicit *priority and sequence*. Three canonical designs (Creswell/Plano Clark typology) map\ndirectly onto RWE workflows: (1) **Convergent (parallel)** — QUANT and QUAL collected concurrently and merged to\ncorroborate or explain divergence (e.g., a claims adherence analysis run alongside patient interviews, then placed\nin a joint display). (2) **Explanatory sequential (QUANT -> QUAL)** — a quantitative result is explained by a\nfollow-on qualitative strand; the QUANT result *drives the qualitative sampling frame* (e.g., claims identify\nlow-adherence patients, who are then purposively sampled for interviews). (3) **Exploratory sequential\n(QUAL -> QUANT)** — qualitative work builds an instrument or hypothesis that is then measured at scale (e.g.,\ninterviews surface burden domains that become a PRO later validated in registry data). The integration\npoint — *connecting* (sampling one strand from another), *building* (one strand creates the instrument for the\nother), or *merging* (joint display + meta-inference) — is the unit of methodological rigor (Fetters 2013). A\n**joint display** that simply parks a quantitative confidence interval next to a qualitative theme with no\ncross-talk is **not integration**; it is two studies in one document. The HEOR-relevant special case is the\n**effectiveness-implementation hybrid** (Curran 2012), which blends a comparative-effectiveness QUANT estimand with\nQUAL/process data on adoption, reach, and fidelity, accelerating the move from evidence to uptake.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs a stand-alone quantitative RWE study (e.g., claims comparative effectiveness):** mixed methods adds the\n  *why* and *for whom* — mechanism, context, patient/clinician meaning, and reasons for heterogeneity that a hazard\n  ratio cannot supply. Cost: roughly doubles the design surface, requires dual expertise, and lengthens timelines;\n  if the QUANT question is precise and mechanism is already understood, the QUAL strand is overhead. **Prefer mixed\n  methods** when an effect estimate alone will not change behavior or policy because decision-makers need to\n  understand implementation, acceptability, or unexplained heterogeneity.\n- **vs a stand-alone qualitative study:** the QUANT strand anchors prevalence, magnitude, and generalizable\n  structure so qualitative themes are not over-extrapolated. Cost: the qualitative depth can be diluted when forced\n  to serve a fixed quantitative frame. **Prefer mixed methods** when you need both representativeness and depth.\n- **vs simply running two separate studies:** integration is the whole value proposition; a true mixed-methods\n  design plans the *connect/build/merge* point a priori and reports **meta-inferences**. Cost: integration is the\n  hardest, most failure-prone step and demands governance (a single protocol, aligned timelines, an integration\n  analyst). **Do not** label two parallel but unconnected studies \"mixed methods.\"\n- **vs a pragmatic trial or hybrid trial that bolts on a process evaluation:** in RWE the QUANT strand is usually\n  observational, which is cheaper and faster than a trial but inherits all observational confounding; the QUAL\n  strand cannot fix biased point estimates. **Prefer a hybrid trial** when you can randomize; **prefer observational\n  mixed methods** when randomization is infeasible and you still need mechanism + magnitude.\n\n**When to use**. Use mixed methods when (a) a quantitative finding is *surprising, heterogeneous, or null* and you\nneed qualitative explanation (explanatory sequential); (b) you must *develop or content-validate a PRO/instrument*\nbefore fielding it (exploratory sequential); (c) you need *corroboration across data types* for a high-stakes claim\n(convergent); (d) you are studying *implementation, adoption, or de-adoption* of a therapy/program where uptake and\nfidelity matter as much as effect (effectiveness-implementation hybrid); or (e) a payer/HTA/regulatory audience has\nexplicitly asked \"we believe the number, now tell us why and whether it will hold in practice.\"\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **When the decision turns on a single precise estimate and mechanism is settled.** A bolted-on QUAL strand wastes\n  resources and can *dilute* a clean comparative-effectiveness message; reviewers may read the qualitative material\n  as the study hedging an unconvincing primary result.\n- **When integration is not actually planned.** The most common and most dangerous failure is the \"QUAL fig leaf\":\n  a handful of unsystematic quotes used to *illustrate* the quantitative result post hoc, presented as if it were\n  triangulated evidence. This launders selection-biased anecdote into the authority of a number and is worse than\n  reporting the QUANT alone.\n- **When the qualitative sample is silently selected by the quantitative strand and that selection is ignored.** If\n  you purposively sample interviewees from a claims-defined subgroup (e.g., persistent users), their narratives do\n  not generalize to discontinuers — treating them as the patient voice is a selection error masquerading as depth.\n- **When divergence between strands is buried instead of interrogated.** If the interviews contradict the claims\n  result, that *discordance is the finding*; suppressing it to present a tidy convergent story is misleading.\n- **When timelines or budgets force a degenerate strand.** Three rushed interviews or an under-powered survey\n  appended to a rigorous claims analysis is not mixed methods — it is a weak study with a misleading label.\n\n**Data-source operational depth**.\n- **Claims (FFS vs MA vs commercial):** Excellent as the QUANT *sampling frame and denominator* (continuous\n  enrollment, `fill_date`, `days_supply`, PDC quintiles, HCRU, cost), but claims carry **no patient voice** — the\n  QUAL strand must be recruited externally, which requires consent/contact infrastructure that *selects* on\n  engagement, literacy, and willingness. Linking a claims cohort to interviewable patients typically routes through\n  a health plan or provider, introducing site/plan selection. Medicare Advantage-only person-time lacks FFS claims,\n  so a sampling frame built on claims completeness silently excludes MA enrollees; document which population your\n  QUAL inferences actually cover. Failure mode: sampling interviewees only from members reachable by the plan and\n  generalizing to \"patients.\"\n- **EHR:** Free-text notes and NLP can *substitute for* a primary qualitative strand at scale (e.g., theme\n  extraction from clinician notes), trading depth and the patient's own framing for volume; notes capture the\n  *clinician's* construction of the encounter, not the patient's lived experience. EHR also supports recruiting real\n  interviewees (contact info, active patients), but visit-driven capture means patients who leave the system are\n  differentially absent from both strands. Failure mode: passing NLP-coded note themes off as qualitative\n  integration without acknowledging the clinician-mediated lens.\n- **Registry:** Often *pre-bundles* PROs and patient-reported context, which is convenient but conflates\n  *measurement* (a fixed instrument captured prospectively) with *integration* (linking that signal to an\n  independent qualitative inquiry). Registries are strong for adjudicated outcomes and disease severity that can\n  stratify purposive sampling. Failure mode: reporting registry PRO scores beside clinical outcomes and calling it\n  mixed methods when no qualitative strand or meta-inference exists.\n- **Linked claims-EHR-registry-PRO:** The ideal substrate — claims completeness + EHR depth + registry adjudication\n  + a recruitable, consented population for QUAL — but linkage selects on the linkable subset and creates\n  date-reconciliation problems across order/fill/service/survey dates that must be resolved before a strand can be\n  sampled from another. Failure mode: assuming the linked, consented subset represents the source population for\n  qualitative inferences.\n\n**Worked example (explanatory sequential, claims QUANT -> QUAL).** Question: *why do roughly a third of initiators\nof a chronic oral therapy become poorly adherent, and what does that look like in claims?* (1) **QUANT strand\n(claims).** Build a new-user cohort with 365 days of continuous medical+pharmacy enrollment before the first\nqualifying `fill_date`; require no fill of the drug class in the washout. Over a 12-month landmark, compute\nproportion of days covered (PDC) from `fill_date` + `days_supply`, capping overlap per stockpiling rules. Classify\nPDC into quintiles; flag the bottom quintile (PDC < 0.40) as poorly adherent and the top quintile (PDC >= 0.90) as\nhighly adherent, excluding MA-only person-time so adherence is measured on observable fills. (2) **Connect (the\nintegration point).** Use the QUANT result to drive **stratified purposeful sampling**: draw ~N=12 from the bottom\nquintile and ~N=12 from the top quintile (sampling to thematic saturation, not statistical power), recruited through\nthe contributing plan with documented consent — and record that this routes only through reachable, consenting\nmembers. (3) **QUAL strand.** Semi-structured interviews on access, side effects, cost, beliefs, and the texture of\ndaily dosing; code thematically. (4) **Merge / meta-inference (joint display).** Lay claims-observable patterns\n(early `days_supply` gaps, switching, 90-day mail-order vs 30-day retail, cost-sharing tier) against interview\nthemes in a single joint display, explicitly noting *concordance* (gappy fill patterns <-> reported cost barriers)\nand *discordance* (apparently adherent fill records <-> reported pill-skipping = \"primary refill, secondary\nnon-ingestion,\" invisible to claims). The meta-inference — that claims under-detect a behavioral non-adherence\nphenotype concentrated among cost-strained patients — is the deliverable that neither strand produced alone and that\ndirectly informs an adherence intervention. Report the QUAL sampling selection (plan-reachable, consenting) as a\nbound on transportability.",
    "primary_category": "Study_Design",
    "tags": [
      "mixed-methods",
      "convergent-design",
      "explanatory-sequential",
      "exploratory-sequential",
      "integration",
      "joint-display",
      "implementation-science",
      "qualitative",
      "patient-experience"
    ],
    "applies_to_study_types": [
      "mixed_methods"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1370/afm.104",
        "url": "https://doi.org/10.1370/afm.104",
        "citation_text": "Creswell JW, Fetters MD, Ivankova NV. Designing a mixed methods study in primary care. The Annals of Family Medicine. 2004;2(1):7-12.",
        "year": 2004,
        "authors_short": "Creswell et al.",
        "notes": "Foundational, health-services-oriented statement of mixed-methods design types (instrument, triangulation, data-transformation models) and the priority/sequence decisions that define the design."
      },
      {
        "role": "explain",
        "doi": "10.1097/MLR.0b013e3182408812",
        "url": "https://doi.org/10.1097/MLR.0b013e3182408812",
        "citation_text": "Curran GM, Bauer M, Mittman B, Pyne JM, Stetler C. Effectiveness-implementation hybrid designs: combining elements of clinical effectiveness and implementation research to enhance public health impact. Medical Care. 2012;50(3):217-226.",
        "year": 2012,
        "authors_short": "Curran et al.",
        "notes": "The bridge that makes mixed methods central to RWE/HEOR — blends a comparative-effectiveness estimand with implementation (adoption/reach/fidelity) data to speed evidence-to-practice."
      },
      {
        "role": "demonstrate",
        "doi": "10.1111/1475-6773.12117",
        "url": "https://doi.org/10.1111/1475-6773.12117",
        "citation_text": "Fetters MD, Curry LA, Creswell JW. Achieving integration in mixed methods designs—principles and practices. Health Services Research. 2013;48(6 Pt 2):2134-2156.",
        "year": 2013,
        "authors_short": "Fetters et al.",
        "notes": "Operationalizes integration at the design (connecting/building/merging), methods, and interpretation levels, and defines the joint display — the rigor criterion this catalog enforces."
      },
      {
        "role": "use",
        "doi": "10.1258/jhsrp.2007.007074",
        "url": "https://doi.org/10.1258/jhsrp.2007.007074",
        "citation_text": "O'Cathain A, Murphy E, Nicholl J. The quality of mixed methods studies in health services research. Journal of Health Services Research & Policy. 2008;13(2):92-98.",
        "year": 2008,
        "authors_short": "O'Cathain et al.",
        "notes": "Empirical appraisal of how mixed-methods studies are actually conducted and reported in health services research, and the GRAMMS quality criteria used to judge integration."
      }
    ],
    "plain_language_summary": "A mixed-methods study collects two kinds of evidence on purpose: numbers that tell you how much or how often (the quantitative strand), and people's words and experiences that tell you why or how (the qualitative strand). The key word is integration — the two strands are designed from the start to inform each other, not just exist side by side in the same report. A common pattern in drug research is to use medical records or insurance claims to find which patients did poorly, then follow up with interviews to understand the reasons — producing an insight that neither source could reach alone.",
    "key_terms": [
      {
        "term": "quantitative strand",
        "definition": "The part of the study that produces numbers — counts, rates, costs, or comparisons — usually drawn from insurance claims, electronic health records, or a registry."
      },
      {
        "term": "qualitative strand",
        "definition": "The part of the study that captures meaning through words — interviews, focus groups, or structured analysis of clinical notes — to understand why patients or clinicians behave the way they do."
      },
      {
        "term": "integration",
        "definition": "The deliberate act of connecting the two strands so that one informs the other — for example, using numbers to decide who to interview, then placing interview themes alongside claims patterns in a single comparison table."
      },
      {
        "term": "joint display",
        "definition": "A side-by-side table or diagram that shows the quantitative finding and the qualitative theme for the same people or groups, making it easy to see where the two strands agree or disagree."
      },
      {
        "term": "meta-inference",
        "definition": "The conclusion you can only reach after placing both strands together — the insight that neither numbers alone nor words alone could produce."
      },
      {
        "term": "purposive sampling",
        "definition": "Choosing interview participants intentionally based on a specific characteristic — such as the patients who filled their prescriptions least often — rather than picking them at random."
      }
    ],
    "worked_example": {
      "scenario": "A research team wants to understand why one in three patients who start a daily oral medication for a chronic condition stop filling it within a year. They use pharmacy claims to measure how consistently each patient refilled, then recruit patients from the most- and least-consistent groups for phone interviews. The explanatory sequential design means the numbers come first and steer who gets interviewed.",
      "dataset": {
        "caption": "Mapping the research question across both strands and their integration point",
        "columns": [
          "Element",
          "Quantitative strand (claims)",
          "Qualitative strand (interviews)",
          "Integration point"
        ],
        "rows": [
          [
            "Research question",
            "How many patients refilled consistently, and what does the refill pattern look like over 12 months?",
            "Why do patients with gappy refill records say they stopped filling? What do consistent fillers do differently?",
            "Do the reasons patients give match the patterns visible in the claims?"
          ],
          [
            "Data source",
            "Pharmacy claims: one row per fill, with the fill date and the number of days the prescription is meant to last",
            "Phone interviews with 12 patients who refilled least and 12 who refilled most, recruited through the health plan",
            "Both sources linked by patient ID"
          ],
          [
            "Key output",
            "Each patient gets a score from 0 to 1 showing what share of the year they had medication on hand; patients split into low and high groups",
            "Themes coded from transcripts: cost burden, side-effect concerns, daily routine, forgetting, feeling better",
            "Joint display table: each patient row shows their refill score alongside their top interview theme"
          ],
          [
            "What it reveals alone",
            "Tells you the size of the gap problem and which patients are affected — but not why",
            "Tells you the reasons in patients' own words — but only for the small group interviewed, and only for those the plan could reach",
            "Reveals that gappy refill records match cost-barrier themes in most patients, but some patients with high scores still report skipping doses — a blind spot the numbers miss entirely"
          ],
          [
            "Unique integrated insight",
            "Neither strand alone shows this",
            "Neither strand alone shows this",
            "Claims under-count a behavioral non-adherence group: patients who refill on time but quietly skip doses because of side effects — an intervention target invisible to refill data"
          ]
        ]
      },
      "steps": [
        "Run the quantitative strand first: pull all patients who started the medication with no prior fills in the look-back period, then calculate their refill consistency score over the next 12 months using each fill date and how many days that fill was supposed to last.",
        "Split patients into two groups based on their scores: a low group (filled less than 40 percent of the year) and a high group (filled 90 percent or more of the year).",
        "Use those two groups as the sampling frame for the qualitative strand: recruit roughly 12 patients from each group through the health plan, which requires consent and means only reachable, willing patients are included.",
        "Conduct semi-structured phone interviews asking about access, cost, side effects, daily routines, and their own sense of how well they took the medication.",
        "Code the interview transcripts into themes, then build the joint display: place each interviewed patient's refill score next to their top interview theme in a single table.",
        "Read across the joint display for concordance (gappy scores match cost-barrier theme — the two strands agree) and discordance (high scores but the patient reports skipping doses — the claims missed this entirely).",
        "State the meta-inference: the integrated finding is that claims-visible refilling and actual ingestion can come apart, and the patients who look adherent in the data but are silently skipping doses represent a distinct group that would be missed by any intervention designed from numbers alone."
      ],
      "result": "The integrated insight is that refill records and actual pill-taking diverge for a subgroup of patients who pick up their prescriptions on time but skip doses because of tolerable but persistent side effects. Neither the claims data (which only sees fills, not ingestion) nor the interviews alone (which cannot show how representative this group is) could surface this finding. The joint display makes the blind spot visible and points directly at a new intervention target."
    },
    "prerequisites": [
      "qualitative-interview",
      "comparative-effectiveness-research-cer-methods",
      "picots-framework-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Convergent (parallel) design",
        "description": "QUANT and QUAL strands collected and analyzed concurrently with comparable weight, then merged in a joint display to corroborate, expand, or surface divergence (e.g., a claims adherence analysis run alongside contemporaneous patient interviews).",
        "edge_cases": [
          "Strands must measure compatible constructs at compatible levels; merging a population-level rate with a handful of deep narratives invites apples-to-oranges meta-inference.",
          "Divergence between strands is a substantive finding, not an inconvenience to be reconciled away."
        ],
        "data_source_notes": "claims: provides the population-level QUANT denominator and rates; QUAL must be recruited externally with consent, which selects on reachability. ehr: notes/NLP can supply a concurrent text strand but reflects the clinician's lens."
      },
      {
        "name": "Explanatory sequential (QUANT -> QUAL)",
        "description": "A quantitative result is explained by a follow-on qualitative strand whose sampling frame is driven by the QUANT findings (connecting), e.g., claims-defined adherence quintiles feeding stratified purposive sampling of interviewees.",
        "edge_cases": [
          "The QUAL sample is selected by the QUANT strand, so its narratives generalize only to that QUANT-defined subgroup; state this as a transportability bound.",
          "Choosing whom to follow up (extreme cases vs typical cases vs disconfirming cases) changes the inference."
        ],
        "data_source_notes": "claims/ehr: ideal because the QUANT cohort directly defines the purposive sampling frame (e.g., PDC quintiles, switchers, discontinuers); recruitment must route through a plan/provider with consent."
      },
      {
        "name": "Exploratory sequential (QUAL -> QUANT)",
        "description": "Qualitative work first surfaces constructs/domains that build an instrument or hypothesis, which is then measured at scale (building), e.g., interviews surface treatment-burden domains that become a PRO later validated in registry data.",
        "edge_cases": [
          "The instrument is only as good as the qualitative grounding; thin or unsaturated QUAL produces a brittle measure.",
          "Quantitative validation of the built instrument is a separate measurement task (content/construct validity), not integration per se."
        ],
        "data_source_notes": "registry/pro: the natural home for the QUANT phase because PROs/instruments are captured prospectively; link to claims/ehr to test the instrument against observable utilization."
      },
      {
        "name": "Effectiveness-implementation hybrid",
        "description": "Simultaneously tests a clinical/comparative-effectiveness estimand and gathers implementation data (adoption, reach, fidelity, context) via a QUAL/process strand, accelerating uptake of effective therapies (Curran typology, Hybrid Types 1-3).",
        "edge_cases": [
          "The balance of effectiveness vs implementation focus (Type 1/2/3) must be declared a priori; drifting between them mid-study undermines both estimands.",
          "Process/fidelity data can explain a null or attenuated real-world effect that a pure effect estimate would leave unexplained."
        ],
        "data_source_notes": "linked/ehr: implementation metrics (who was offered, who adopted, fidelity) often live in EHR workflow and operational data, paired with a claims/registry effectiveness analysis."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Stand-alone quantitative RWE study (e.g., claims comparative effectiveness)",
        "pros_of_this": "Adds mechanism, context, acceptability, and reasons for heterogeneity that a point estimate cannot supply; can explain surprising, null, or heterogeneous results.",
        "cons_of_this": "Roughly doubles design surface, needs dual expertise and longer timelines; if the effect estimate alone answers the decision, the QUAL strand is overhead and can dilute the message.",
        "when_to_prefer": "When an effect estimate alone will not move behavior/policy because decision-makers need to understand implementation, acceptability, or unexplained heterogeneity."
      },
      {
        "compared_to": "Stand-alone qualitative study",
        "pros_of_this": "Anchors prevalence, magnitude, and generalizable structure so qualitative themes are not over-extrapolated.",
        "cons_of_this": "Qualitative depth can be diluted when forced to serve a fixed quantitative sampling frame.",
        "when_to_prefer": "When both representativeness/magnitude and depth/meaning are required for the decision."
      },
      {
        "compared_to": "Two separate (unintegrated) studies reported together",
        "pros_of_this": "Plans the connect/build/merge integration point a priori and reports meta-inferences neither strand yields alone (the joint display).",
        "cons_of_this": "Integration is the hardest, most failure-prone step and demands a single protocol, aligned timelines, and an integration analyst.",
        "when_to_prefer": "Whenever the value is in the cross-strand inference; never label parallel-but-unconnected work as mixed methods."
      },
      {
        "compared_to": "Pragmatic / hybrid randomized trial with a process evaluation",
        "pros_of_this": "Observational QUANT strand is cheaper, faster, and feasible where randomization is not; still pairs magnitude with mechanism.",
        "cons_of_this": "Inherits observational confounding; the QUAL strand cannot repair a biased quantitative point estimate.",
        "when_to_prefer": "When randomization is infeasible but mechanism plus magnitude are both needed; prefer a hybrid trial when you can randomize."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Strong as the QUANT sampling frame and denominator (continuous enrollment, fill_date, days_supply, PDC, HCRU, cost) but carries no patient voice; the QUAL strand must be recruited externally via consent/contact infrastructure that selects on engagement and reachability. Exclude MA-only person-time when building a completeness-dependent sampling frame and document the population QUAL inferences actually cover.",
      "ehr": "Notes/NLP can substitute for a primary qualitative strand at scale, trading depth and patient framing for volume and a clinician-mediated lens; EHR also supports recruiting active, contactable interviewees. Visit-driven capture differentially drops patients who leave the system from both strands.",
      "registry": "Often pre-bundles PROs/patient-reported context, which conflates measurement (a fixed prospective instrument) with integration (linking that signal to an independent qualitative inquiry). Strong for adjudicated outcomes/severity to stratify purposive sampling.",
      "linked": "Ideal substrate (claims completeness + EHR depth + registry adjudication + a consented, recruitable population) but linkage selects on the linkable subset and creates order/fill/service/survey date discrepancies that must be reconciled before one strand can be sampled from another."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nLOW_PDC, HIGH_PDC = 0.40, 0.90      # bottom / top adherence strata from the QUANT landmark\nN_PER_STRATUM = 12                  # sample to saturation, not statistical power\n\ndef select_purposive_sample(cohort: pd.DataFrame, seed: int = 1) -> pd.DataFrame:\n    # Eligible for QUAL = observable adherence (drop MA-only) AND a real consent/contact path.\n    elig = cohort[(~cohort[\"ma_only\"]) & (cohort[\"reachable\"])].copy()\n    elig[\"stratum\"] = np.where(elig[\"pdc\"] < LOW_PDC, \"poorly_adherent\",\n                        np.where(elig[\"pdc\"] >= HIGH_PDC, \"highly_adherent\", \"middle\"))\n    extremes = elig[elig[\"stratum\"].isin([\"poorly_adherent\", \"highly_adherent\"])]\n    # Stratified purposive draw; selection is plan-reachable + consenting -> a transportability bound, not \"patients.\"\n    sample = (extremes.groupby(\"stratum\", group_keys=False)\n                      .apply(lambda g: g.sample(min(len(g), N_PER_STRATUM), random_state=seed)))\n    return sample[[\"person_id\", \"stratum\", \"pdc\", \"cost_share_tier\", \"switched\", \"early_gap_days\"]]\n\ndef build_joint_display(sample: pd.DataFrame, themes: pd.DataFrame) -> pd.DataFrame:\n    # Merge claims-observable patterns (QUANT) against coded interview themes (QUAL) per person.\n    top_theme = (themes.sort_values(\"salience\", ascending=False)\n                       .groupby(\"person_id\").first().reset_index())\n    jd = sample.merge(top_theme, on=\"person_id\", how=\"left\")\n    # Concordance: does the dominant qualitative theme line up with the claims-observable signal?\n    jd[\"concordant\"] = np.where(\n        (jd[\"stratum\"] == \"poorly_adherent\") & (jd[\"theme\"] == \"cost_barrier\") & (jd[\"cost_share_tier\"] >= 3),\n        True,\n        np.where((jd[\"stratum\"] == \"highly_adherent\") & (jd[\"theme\"] == \"routine_established\"), True, False),\n    )\n    # Discordance to interrogate: claims look adherent but interviews report pill-skipping (invisible to claims).\n    jd[\"claims_blind_spot\"] = (jd[\"stratum\"] == \"highly_adherent\") & (jd[\"theme\"] == \"skips_doses\")\n    return jd",
        "description": "Integration step for an explanatory-sequential design: turn a claims QUANT result into a stratified purposive\nQUAL sampling frame, then assemble a joint display. This is the code that earns its keep in a mixed-methods\nstudy — the integration logic, not statistical estimation. Required inputs (already cleaned):\n  cohort : one row per new initiator -> person_id, pdc (0-1 over the landmark), ma_only (bool), reachable (bool;\n           has plan-mediated consent/contact path), cost_share_tier, switched (bool), early_gap_days\n  themes : qualitative codes after interviews -> person_id, theme, salience (1-5)   # added AFTER sampling\nStep 1 selects whom to interview from PDC extremes (connecting); Step 2 (run later) merges coded themes against\nclaims-observable patterns into a joint-display table with explicit concordance flags.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOW_PDC <- 0.40; HIGH_PDC <- 0.90\nN_PER_STRATUM <- 12L\n\nselect_purposive_sample <- function(cohort, seed = 1L) {\n  setDT(cohort)\n  elig <- cohort[!ma_only & reachable]                       # observable adherence + a consent/contact path\n  elig[, stratum := fifelse(pdc < LOW_PDC, \"poorly_adherent\",\n                      fifelse(pdc >= HIGH_PDC, \"highly_adherent\", \"middle\"))]\n  ext <- elig[stratum %chin% c(\"poorly_adherent\", \"highly_adherent\")]\n  set.seed(seed)\n  # Stratified purposive draw; selection = plan-reachable + consenting (a transportability bound).\n  sample <- ext[, .SD[sample(.N, min(.N, N_PER_STRATUM))], by = stratum]\n  sample[, .(person_id, stratum, pdc, cost_share_tier, switched, early_gap_days)]\n}\n\nbuild_joint_display <- function(sample, themes) {\n  setDT(sample); setDT(themes)\n  setorder(themes, -salience)\n  top_theme <- themes[, .SD[1L], by = person_id]\n  jd <- merge(sample, top_theme, by = \"person_id\", all.x = TRUE)\n  # Concordance between the QUANT stratum and the dominant QUAL theme.\n  jd[, concordant := (stratum == \"poorly_adherent\" & theme == \"cost_barrier\" & cost_share_tier >= 3) |\n                     (stratum == \"highly_adherent\" & theme == \"routine_established\")]\n  # Claims blind spot: looks adherent in claims, but interviews report dose-skipping.\n  jd[, claims_blind_spot := stratum == \"highly_adherent\" & theme == \"skips_doses\"]\n  jd[]\n}",
        "description": "Same integration logic in R (data.table): build the purposive QUAL sampling frame from PDC extremes, then\nassemble the joint display with concordance / claims-blind-spot flags. Inputs mirror the Python version:\n  cohort : person_id, pdc, ma_only (logical), reachable (logical), cost_share_tier, switched, early_gap_days\n  themes : person_id, theme, salience (1-5)   # available only AFTER interviews are coded",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let low_pdc = 0.40;\n%let high_pdc = 0.90;\n%let n_stratum = 12;\n\n/* Step 1 (connecting): eligible = observable adherence (not MA-only) AND a consent/contact path; assign strata. */\ndata elig;\n  set work.cohort;\n  where ma_only = 0 and reachable = 1;\n  length stratum $15;\n  if pdc < &low_pdc then stratum = \"poorly_adherent\";\n  else if pdc >= &high_pdc then stratum = \"highly_adherent\";\n  else stratum = \"middle\";\n  if stratum in (\"poorly_adherent\",\"highly_adherent\");\nrun;\n\nproc sort data=elig; by stratum; run;\n\n/* Stratified PURPOSIVE draw to saturation (not power); selection = plan-reachable + consenting. */\nproc surveyselect data=elig out=qual_sample method=srs n=&n_stratum seed=1;\n  strata stratum;\nrun;\n\n/* Step 2 (merging): dominant theme per person, then the joint display with concordance flags. */\nproc sort data=work.themes; by person_id descending salience; run;\ndata top_theme; set work.themes; by person_id; if first.person_id; run;\n\nproc sort data=qual_sample; by person_id; run;\ndata joint_display;\n  merge qual_sample(in=a) top_theme(keep=person_id theme);\n  by person_id; if a;\n  /* Concordance between the QUANT stratum and the dominant QUAL theme. */\n  concordant = (stratum = \"poorly_adherent\" and theme = \"cost_barrier\" and cost_share_tier >= 3) or\n               (stratum = \"highly_adherent\" and theme = \"routine_established\");\n  /* Claims blind spot: adherent in claims but interviews report dose-skipping (invisible to fills). */\n  claims_blind_spot = (stratum = \"highly_adherent\" and theme = \"skips_doses\");\nrun;",
        "description": "Same integration step in SAS: derive PDC strata, take a stratified purposive QUAL sample with PROC SURVEYSELECT,\nand build the joint display by merging coded themes against claims-observable patterns. Required input datasets:\n  work.cohort : person_id, pdc, ma_only (0/1), reachable (0/1), cost_share_tier, switched, early_gap_days\n  work.themes : person_id, theme, salience (1-5)   /* loaded AFTER interviews are coded */\nSURVEYSELECT here draws a fixed N per stratum to thematic saturation, NOT a probability sample for inference.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision question:<br/>magnitude AND mechanism/context needed?] -->|No, estimate alone suffices| Stop[Use a single-strand study]\n  Q -->|Yes| Seq{Priority + sequence?}\n  Seq -->|Concurrent, equal weight| Conv[Convergent design<br/>QUANT + QUAL in parallel -> merge in joint display]\n  Seq -->|Explain a result| Expl[Explanatory sequential<br/>QUANT result -> drives QUAL sampling frame]\n  Seq -->|Build an instrument/hypothesis| Expr[Exploratory sequential<br/>QUAL -> build measure -> QUANT at scale]\n  Seq -->|Effect + uptake together| Hyb[Effectiveness-implementation hybrid<br/>effectiveness estimand + process/fidelity QUAL]\n  Conv --> Integ[Integration point:<br/>connect / build / merge]\n  Expl --> Integ\n  Expr --> Integ\n  Hyb --> Integ\n  Integ --> Meta[Meta-inference + joint display<br/>report concordance AND discordance]\n  Meta -.->|NO planned integration| Fig[QUAL fig leaf:<br/>post-hoc quotes illustrating the number = NOT mixed methods]",
        "caption": "Decision logic for choosing a mixed-methods design and its integration point. The hazard is the dashed path — unplanned, illustrative-only qualitative material masquerading as integration.",
        "alt_text": "Flowchart starting from whether both magnitude and mechanism are needed, branching into convergent, explanatory-sequential, exploratory-sequential, and effectiveness-implementation hybrid designs, all converging on an integration point and meta-inference joint display, with a warning branch for the QUAL fig-leaf failure mode.",
        "source_type": "illustrative",
        "source_citations": [
          "fetters-2013"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph QUANT[QUANT strand: claims]\n    A[New-user cohort<br/>continuous enrollment] --> B[PDC over landmark<br/>fill_date + days_supply] --> C[Adherence quintiles<br/>drop MA-only person-time]\n  end\n  C -->|connect: stratified purposive sampling<br/>plan-reachable + consenting| D\n  subgraph QUAL[QUAL strand: interviews]\n    D[Bottom + top PDC strata<br/>N to saturation] --> E[Semi-structured interviews] --> F[Thematic coding]\n  end\n  C --> G[Joint display]\n  F --> G\n  G --> H[Meta-inference:<br/>concordance e.g. gappy fills <-> cost barriers<br/>discordance e.g. adherent fills <-> reported dose-skipping]",
        "caption": "Explanatory-sequential integration in claims-based RWE. The QUANT strand defines the sampling frame for the QUAL strand (connecting), and both feed a joint display whose meta-inference surfaces a behavioral non-adherence phenotype invisible to claims alone.",
        "alt_text": "Left-to-right diagram showing a claims QUANT strand (cohort, PDC, quintiles) connecting via stratified purposive sampling to a QUAL interview strand (sampling, interviews, coding), both feeding a joint display and a meta-inference about concordant and discordant adherence patterns.",
        "source_type": "illustrative",
        "source_citations": [
          "creswell-2004"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "qualitative-interview",
        "notes": "Semi-structured interviews are the most common qualitative strand in RWE mixed-methods studies, typically purposively sampled from a quantitatively defined cohort."
      },
      {
        "relation_type": "used_with",
        "target_slug": "qualitative-ethnographic",
        "notes": "Observation/ethnography supplies the contextual QUAL strand for implementation-focused (hybrid) designs."
      },
      {
        "relation_type": "used_with",
        "target_slug": "qualitative-synthesis",
        "notes": "Qualitative synthesis can constitute or summarize the QUAL evidence base feeding an exploratory-sequential instrument-building phase."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pro-rwe",
        "notes": "Exploratory-sequential designs frequently build a PRO/instrument from qualitative work before measuring it at scale; the QUAL strand grounds content validity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "preference-study",
        "notes": "Patient-preference research is often embedded as the qualitative or mixed strand explaining quantitative uptake and adherence patterns."
      },
      {
        "relation_type": "see_also",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "The quantitative strand in an effectiveness-implementation hybrid is typically an observational CER analysis; the QUAL strand explains heterogeneity and uptake."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "pragmatic-trial",
        "notes": "A pragmatic or hybrid trial with an embedded process evaluation is the randomized analogue; observational mixed methods is the option when randomization is infeasible."
      }
    ],
    "aliases": [
      "mixed methods",
      "mixed-methods research",
      "MMR",
      "convergent parallel design",
      "explanatory sequential design",
      "exploratory sequential design",
      "QUAL-QUAN integration",
      "effectiveness-implementation hybrid design"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mmrm-repeated-measures-rwe",
    "name": "Mixed Model for Repeated Measures (MMRM) in RWE",
    "short_definition": "A likelihood-based longitudinal model for a continuous outcome measured at categorical visits, treating time as a factor with an unstructured within-subject residual covariance and no random effects, yielding visit-specific treatment contrasts that are valid under missing-at-random without explicit imputation.",
    "long_description": "The **mixed model for repeated measures (MMRM)** analyzes a continuous outcome (or change from baseline) measured\nlongitudinally, modeling **visit as a categorical factor** and a **treatment-by-visit interaction**, with an\n**unstructured (UN) within-subject residual covariance** and — in its canonical form — **no random effects**. Because\nit is fit by (restricted) maximum likelihood on all available records, MMRM produces an asymptotically unbiased\nestimate of the visit-specific mean contrast (e.g., difference in change from baseline at the target visit) under\n**missing-at-random (MAR)** missingness, *without* imputing the missing values. In confirmatory trials MMRM is the\nFDA/EMA-default primary analysis for continuous longitudinal endpoints; this entry (`-rwe`) is specifically about what\nchanges — and what breaks — when the same model is applied to real-world EHR, registry, or linked data where visits are\nnot protocol-driven and the MAR assumption is far weaker.\n\n**Core estimand distinction.** The MMRM estimand is a **marginal (population-average) mean contrast at a pre-specified\ntarget visit** — typically change from baseline at, say, month 12 — under a stated intercurrent-event strategy. With no\nrandom effects, the regression coefficients are marginal, so the treatment-by-visit term is interpreted as the average\ngroup difference at each visit, *not* a subject-specific trajectory effect. Critically, MMRM only addresses **missing\ndata**; it does **not** define how intercurrent events (death, treatment discontinuation, disenrollment, switching) are\nhandled. That is an estimand choice you must make explicitly: a **treatment-policy** estimand keeps post-event\nmeasurements and analyzes them; a **hypothetical** estimand treats post-event values as MAR-missing (the usual implicit\nMMRM behavior); a **composite/while-on-treatment** estimand redefines the variable. In RWE the most defensible default\nis to name the strategy per ICH E9(R1) and pair MMRM with sensitivity analyses, because death and disenrollment are\noften *informative* (MNAR), which MMRM silently mishandles. See `estimands-ate-att-intercurrent-events-rwe`.\n\n**Interpreting the output**\n\nConsider the worked example: 200 patients (Drug A vs Drug B) with eGFR measured at\nbaseline, Month 3, Month 6, and Month 12. Drug A patients gain an average of 3.2 mL/min\nfrom baseline by Month 12; Drug B patients lose an average of 3.2 mL/min. The MMRM\nyields an adjusted mean difference at Month 12 of 6.4 mL/min (95% CI 4.1 to 8.7 mL/min)\nfavoring Drug A, based on all 200 patients who contributed at least one post-baseline\nmeasurement — including patients who missed one intermediate visit.\n\nFormal interpretation: The arm-by-visit interaction term at Month 12 is the primary\nresult. It estimates the average difference in eGFR change from baseline between the\ntwo arms at Month 12 under the missing-at-random assumption: patients who missed the\nMonth 6 visit are included using information borrowed from their own observed trajectory\nvia the unstructured within-person residual covariance. This is a marginal (population-\naverage) contrast — no random patient effects enter the model — valid under MAR without\nimputation. It is not a subject-specific prediction for any individual patient.\n\nPractical interpretation: Drug A patients gain approximately 6.4 mL/min more eGFR by\nMonth 12 than Drug B patients on average. This is the per-visit contrast the FDA/EMA\nrecognize as the primary longitudinal endpoint estimand for continuous outcomes under\nMAR. Because MAR is an assumption that cannot be verified — patients may drop out\nbecause their eGFR is worsening — the primary analysis should always be paired with a\nsensitivity analysis under MNAR assumptions (delta-adjusted or reference-based multiple\nimputation, or a tipping-point analysis).\n\n**Pros, cons, and trade-offs.**\n- **vs a linear mixed model with random slopes (`mixed-effects-models-longitudinal-rwe`):** MMRM treats time as\n  categorical with a UN residual covariance and *no* random effects, making no assumption about the shape of the mean\n  trajectory; the LMM imposes a parametric trajectory (random intercept/slope) and is more efficient *if that shape is\n  right* and far more robust to irregular/continuous visit timing. **Prefer MMRM** for a small fixed set of nominal\n  visits where the trajectory shape is unknown and a clean per-visit contrast is wanted; **prefer the LMM** when visits\n  are irregularly timed (the RWE norm), when you need a slope/rate-of-change estimand, or when UN is unestimable.\n- **vs GEE (`gee-population-average-models-rwe`):** Both target a marginal mean. The decisive difference is missingness:\n  MMRM is likelihood-based and valid under **MAR**; standard GEE is moment-based and valid only under **MCAR** unless it\n  is inverse-probability-weighted. Since real-world dropout is rarely MCAR, **prefer MMRM** for incomplete continuous\n  Gaussian outcomes; reserve GEE for non-Gaussian outcomes (binary/count) or when a working-correlation, design-based\n  interpretation is wanted.\n- **vs multiple imputation + ANCOVA (`multiple-imputation-longitudinal-rwe`):** MMRM handles MAR *implicitly* and only\n  MAR. MI is explicit and extends naturally to **MNAR sensitivity** (pattern-mixture, delta/reference-based imputation,\n  tipping-point). **Prefer MMRM** as the primary MAR analysis; **add MI/delta-adjustment** as the required sensitivity\n  layer when informative dropout is plausible — which, in RWE, it almost always is.\n- **vs joint longitudinal–survival models:** When dropout is driven by a terminal event (death) correlated with the\n  outcome, MMRM's MAR assumption fails and the visit-specific mean among survivors is a biased/ill-defined target. Joint\n  models (or a survivor-average causal estimand) address this; MMRM does not.\n\n**When to use.** A continuous, repeatedly measured outcome on a small set of **nominal (categorical) visits** — serial\nHbA1c or eGFR, weight/BMI, blood pressure, PHQ-9/PROMIS or other PRO scores in `pro-rwe` programs, registry-collected\nHRQoL feeding QALY mapping — where missingness is plausibly MAR conditional on observed history and treatment, and the\nestimand is a per-visit mean difference. MMRM is the natural primary analysis when you want the FDA/EMA-recognized\napproach and can bin records to a defensible visit grid.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Irregular, continuous-time visit timing with no meaningful nominal grid.** Forcing real-world encounters into\n  \"month 3 / month 6\" windows when patients are seen on idiosyncratic schedules either discards information or mislabels\n  measurements; UN covariance becomes meaningless. Use a continuous-time LMM (random slopes / spline in time) instead —\n  and stop calling it MMRM.\n- **Terminal events / competing risk of death.** If sicker patients die and drop out, the among-survivors mean MMRM\n  estimates drifts toward the healthy and the treatment contrast is biased. This is the single most dangerous misuse in\n  elderly or oncology RWE; MAR does not hold and MMRM gives a confidently wrong number.\n- **Differential measurement frequency by arm.** When one arm is monitored more intensively (sicker patients have more\n  contact), the observed data process differs by exposure and MAR conditional on a misspecified mean model fails.\n- **Non-Gaussian or bounded/floor-ceiling outcomes** (counts, heavily skewed costs, scores piled at the boundary):\n  MMRM's normality and constant-variance-within-visit assumptions break; use GLMM/GEE or a transformation.\n- **The real question is a rate of change or a subject-specific trajectory** — MMRM's per-visit contrasts do not answer\n  it; use an LMM with random slopes.\n\n**Data-source operational depth.**\n- **Claims (FFS / MA / commercial):** Claims rarely contain a true repeated *continuous clinical* outcome — there is no\n  serial HbA1c or eGFR, only diagnosis/procedure/cost events. Attempting MMRM on a claims-derived \"continuous\" series\n  (e.g., monthly cost, which is non-Gaussian and zero-inflated) violates the model's assumptions. Where claims *are*\n  used, the danger is that \"missing visit\" is actually **disenrollment** (MA-only person-time lacks FFS claims; a plan\n  switch looks like loss to follow-up) or **death**, both informative — restrict to continuously enrolled, FFS-observable\n  person-time (`continuous-enrollment-observable-time-rwe`) and treat post-disenrollment values as MNAR in sensitivity,\n  not silently MAR.\n- **EHR (the typical MMRM-in-RWE substrate):** Serial labs/vitals/PROs exist, but capture is **encounter-driven**, so\n  measurement times are irregular and differential by health status — the core threat to both the visit grid and MAR.\n  Bin to clinically meaningful windows (e.g., ±45 days around nominal quarters), pre-specify which record represents a\n  visit when several fall in a window (closest-to-target vs last), and treat a patient who leaves the system as\n  potentially informatively missing. Baseline is often itself a derived/last-observation value — document how it was\n  constructed because change-from-baseline inherits its error.\n- **Registry:** Strongest for scheduled, adjudicated repeated outcomes (serial disease severity, PROs at protocol\n  visits), giving the cleanest nominal grid for MMRM — but registry completeness erodes over time and dropout correlates\n  with disease progression and death (informative). Link to a death index and to claims to distinguish true MAR loss\n  from terminal-event loss.\n- **Linked EHR–claims–vital records:** The ideal MMRM substrate — EHR supplies the continuous serial outcome, claims\n  confirm continuous observability and treatment exposure, and the death index lets you correctly classify terminal\n  dropout as a competing event rather than MAR missingness. Reconcile measurement, enrollment, and death dates before\n  assigning visits.\n\n**Worked example (linked EHR + claims).** Question: change from baseline in eGFR over 12 months among new initiators of\ndrug A vs active comparator B for type 2 diabetes, using EHR labs linked to medical/pharmacy claims. (1) **Cohort:**\nactive-comparator new-user design — first fill of A or B (`fill_date`, `days_supply`, NDC) with 365 days of prior\ncontinuous medical + pharmacy enrollment and FFS-observable person-time; `index_date` = first fill. (2) **Outcome\nseries:** all serum-creatinine→eGFR results from the linked EHR with `result_date` between index and month 12. (3)\n**Visit grid:** nominal visits at baseline, 3, 6, 9, 12 months; assign each eGFR to the visit whose target it is\nnearest, within a ±45-day window; if multiple results fall in a window, keep the one closest to the target date;\nbaseline eGFR = the value nearest to (and within 45 days before/on) `index_date`. (4) **Analysis variable:** change\nfrom baseline eGFR at each post-baseline visit. (5) **Model:** MMRM with fixed effects for arm, visit (categorical),\narm×visit, baseline eGFR, and pre-index covariates, and an **unstructured residual covariance** within `person_id`;\nthe primary contrast is the arm difference in eGFR change at month 12. (6) **Missingness/intercurrent events:** records\nafter **disenrollment** (no FFS-observable person-time) or **death** (linked death index) are missing — pre-specify a\n*hypothetical* estimand (those values MAR-missing) for the primary analysis, then run **delta-adjusted / reference-based\nMI** and a tipping-point analysis assuming progressively worse unobserved eGFR in the active arm as MNAR sensitivity.\n(7) **Diagnostics:** report observed-data completion by arm and visit (a `missing-data-pattern-table-rwe`), check that\nattrition is not differential by arm, and confirm the UN covariance is estimable (it has V(V+1)/2 parameters for V\nvisits — collapse to Toeplitz/AR(1) only if UN fails to converge).",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "inferential-statistics",
      "longitudinal-outcomes",
      "repeated-measures",
      "mmrm",
      "missing-at-random",
      "continuous-endpoint",
      "patient-reported-outcomes",
      "change-from-baseline"
    ],
    "applies_to_study_types": [
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "linked",
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1081/BIP-100104194",
        "url": "https://doi.org/10.1081/BIP-100104194",
        "citation_text": "Mallinckrodt CH, Clark WS, David SR. Accounting for dropout bias using mixed-effects models. Journal of Biopharmaceutical Statistics. 2001;11(1-2):9-21.",
        "year": 2001,
        "authors_short": "Mallinckrodt et al.",
        "notes": "Establishes the likelihood-based mixed-model-for-repeated-measures rationale that valid inference under missing-at-random dropout is obtained without imputing missing values."
      },
      {
        "role": "explain",
        "doi": "10.1177/009286150804200402",
        "url": "https://doi.org/10.1177/009286150804200402",
        "citation_text": "Mallinckrodt CH, Lane PW, Schnell D, Peng Y, Mancuso JP. Recommendations for the primary analysis of continuous endpoints in longitudinal clinical trials. Drug Information Journal. 2008;42(4):303-319.",
        "year": 2008,
        "authors_short": "Mallinckrodt et al.",
        "notes": "The canonical specification of MMRM as primary analysis - visit as a categorical factor, treatment-by-visit interaction, and an unstructured within-subject covariance under MAR."
      },
      {
        "role": "explain",
        "doi": "10.2307/2529876",
        "url": "https://doi.org/10.2307/2529876",
        "citation_text": "Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38(4):963-974.",
        "year": 1982,
        "authors_short": "Laird & Ware",
        "notes": "Foundational likelihood framework for longitudinal mixed-effects models that underpins the MMRM residual-covariance formulation."
      },
      {
        "role": "explain",
        "doi": "10.1056/NEJMsr1203730",
        "url": "https://doi.org/10.1056/NEJMsr1203730",
        "citation_text": "Little RJ, D'Agostino R, Cohen ML, et al. The prevention and treatment of missing data in clinical trials. New England Journal of Medicine. 2012;367(14):1355-1360.",
        "year": 2012,
        "authors_short": "Little et al.",
        "notes": "NRC panel framing of MAR vs MNAR and the requirement for explicit missing-data assumptions and sensitivity analyses - the assumption MMRM relies on and that is weakest in RWE."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/biomet/73.1.13",
        "url": "https://doi.org/10.1093/biomet/73.1.13",
        "citation_text": "Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13-22.",
        "year": 1986,
        "authors_short": "Liang & Zeger",
        "notes": "The GEE alternative MMRM is contrasted against - moment-based, marginal, and valid only under MCAR unless weighted, versus MMRM's likelihood-based validity under MAR."
      }
    ],
    "plain_language_summary": "A Mixed Model for Repeated Measures (MMRM) is a statistical method for analyzing a continuous outcome — such as kidney function or a symptom score — that is measured at several scheduled clinic visits for every patient. Instead of throwing away data from patients who miss a visit, MMRM uses all the measurements that were recorded and makes a mathematically principled assumption (called missing-at-random) that the gap can be explained by what was already observed. The result is a single estimated difference between treatment groups at the visit you care about most, without ever making up numbers for the missing visits. One honest caveat: if patients tend to drop out precisely because they are getting worse — not just randomly — the missing-at-random assumption breaks and the estimate can be misleading.",
    "key_terms": [
      {
        "term": "MMRM",
        "definition": "Mixed Model for Repeated Measures — a statistical model that estimates the average treatment difference at each scheduled visit using all observed data, without imputing missing visits."
      },
      {
        "term": "covariance structure",
        "definition": "A mathematical description of how closely related a patient's measurements are to each other across visits; the unstructured form makes no assumptions about the pattern of those relationships."
      },
      {
        "term": "missing-at-random (MAR)",
        "definition": "The assumption that whether a patient's visit measurement is missing can be fully explained by the data already collected for that patient, not by the unobserved value itself."
      },
      {
        "term": "change from baseline",
        "definition": "The difference between a patient's outcome value at a later visit and their starting value on the day they entered the study — a common way to measure how much a treatment moved the needle."
      },
      {
        "term": "treatment-by-visit interaction",
        "definition": "A model term that allows the difference between the two treatment arms to be a different size at each visit, rather than forcing a single average across all time points."
      }
    ],
    "worked_example": {
      "scenario": "A registry study enrolls 200 adults with type 2 diabetes starting either Drug A (n=100) or Drug B (n=100). Kidney function is measured as eGFR (a lab value in mL/min — higher is better) at four scheduled visits: baseline, Month 3, Month 6, and Month 12. Some patients miss one or two visits. The question is: what is the treatment-group difference in eGFR change from baseline at Month 12? The analyst uses MMRM so that patients with a missing Month 6 visit are not dropped from the analysis entirely.",
      "dataset": {
        "caption": "Observed mean eGFR by arm and visit (all enrolled patients who had at least one post-baseline measurement; n shown = patients with a recorded value at that visit).",
        "columns": [
          "visit",
          "arm",
          "mean_egfr",
          "n_observed",
          "mean_change_from_baseline"
        ],
        "rows": [
          [
            "Baseline",
            "Drug A",
            62.0,
            100,
            0.0
          ],
          [
            "Baseline",
            "Drug B",
            61.8,
            100,
            0.0
          ],
          [
            "Month 3",
            "Drug A",
            63.5,
            94,
            1.5
          ],
          [
            "Month 3",
            "Drug B",
            60.9,
            96,
            -0.9
          ],
          [
            "Month 6",
            "Drug A",
            64.1,
            89,
            2.1
          ],
          [
            "Month 6",
            "Drug B",
            59.7,
            88,
            -2.1
          ],
          [
            "Month 12",
            "Drug A",
            65.2,
            82,
            3.2
          ],
          [
            "Month 12",
            "Drug B",
            58.6,
            79,
            -3.2
          ]
        ]
      },
      "steps": [
        "Each row in the analysis dataset is one patient at one visit; patients who missed a visit simply have no row for that visit — they are not dropped from the dataset entirely.",
        "MMRM fits a single model that includes fixed effects for arm (Drug A vs Drug B), visit (Baseline / Month 3 / Month 6 / Month 12 as categories), and the arm-by-visit interaction, plus each patient's baseline eGFR as a covariate.",
        "The unstructured covariance lets the model learn from each patient's own observed trajectory — so a patient seen at Baseline, Month 3, and Month 12 still contributes information about Month 6 through the estimated within-person correlations.",
        "MMRM does NOT fill in the missing Month 6 value with the last observed reading (that older approach, called LOCF, assumes the outcome stayed flat, which is rarely true); instead it uses maximum likelihood to borrow information across all visits without inventing data.",
        "The primary result is read from the arm-by-visit interaction term at Month 12: the estimated difference in mean change from baseline between Drug A and Drug B at that visit."
      ],
      "result": "MMRM estimates Drug A patients gained an average of 3.2 mL/min in eGFR from baseline by Month 12, while Drug B patients lost an average of 3.2 mL/min — a treatment difference of 6.4 mL/min (95% CI: 4.1 to 8.7 mL/min) favoring Drug A at the final visit. All 200 patients who had at least one post-baseline measurement contributed to this estimate, including the 18 Drug A patients and 21 Drug B patients who missed one intermediate visit."
    },
    "prerequisites": [
      "longitudinal-outcomes-modeling-rwe",
      "missing-data-pattern-table-rwe",
      "estimands-ate-att-intercurrent-events-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unstructured-covariance MMRM (canonical)",
        "description": "Visit modeled as a categorical factor with a treatment-by-visit interaction and a fully unstructured (UN) within-subject residual covariance; no random effects. Makes the fewest assumptions about correlation structure and is the default primary specification.",
        "edge_cases": [
          "UN requires V(V+1)/2 covariance parameters for V visits; with many visits or sparse cells it fails to converge and must be collapsed to a structured form (Toeplitz, AR(1), ante-dependence) - a pre-specified fallback ladder avoids post-hoc fishing.",
          "With irregular real-world visit timing, the \"visit\" factor is a binned approximation; record the binning rule and tie-breaking (closest-to-target vs last-in-window)."
        ],
        "data_source_notes": "EHR/registry: define nominal-visit windows and the within-window selection rule before fitting; confirm each visit cell has enough observations to estimate its variance and covariances."
      },
      {
        "name": "MMRM with structured covariance fallback",
        "description": "Replaces UN with a parsimonious structure (Toeplitz, heterogeneous AR(1), ante-dependence) when UN is unestimable or over-parameterized for the available visits/sample.",
        "edge_cases": [
          "A mis-specified covariance can bias standard errors even though point estimates remain consistent; use a Kenward-Roger or sandwich (empirical) adjustment for the degrees of freedom / SEs.",
          "AR(1) assumes equal spacing and declining correlation - often false for irregular RWE visits."
        ],
        "data_source_notes": "Prefer Kenward-Roger small-sample correction in modest registry cohorts; report the covariance structure actually used and why UN was abandoned."
      },
      {
        "name": "MMRM as MAR primary with MNAR sensitivity layer",
        "description": "MMRM provides the primary MAR estimate; reference-based / delta-adjusted multiple imputation and tipping-point analyses probe robustness to informative (MNAR) dropout from disenrollment or death.",
        "edge_cases": [
          "In RWE the MNAR layer is effectively mandatory because disenrollment and death are commonly informative; a primary MMRM with no sensitivity analysis is not regulatory-credible."
        ],
        "data_source_notes": "linked data: use a death index and enrollment spans to classify each missing record as terminal (competing event) vs MAR loss, which determines whether reference-based MI or a joint model is the right sensitivity."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Linear mixed model with random slopes (mixed-effects-models-longitudinal-rwe)",
        "pros_of_this": "Treats time as categorical with unstructured residual covariance and no random effects, so it imposes no assumption on the shape of the mean trajectory and gives clean per-visit contrasts.",
        "cons_of_this": "Requires a small fixed set of nominal visits and an estimable UN covariance; cannot accommodate irregular continuous-time visits or a rate-of-change estimand.",
        "when_to_prefer": "Few categorical visits, unknown trajectory shape, and a per-visit mean-difference estimand."
      },
      {
        "compared_to": "GEE population-average models (gee-population-average-models-rwe)",
        "pros_of_this": "Likelihood-based and valid under MAR; handles incomplete Gaussian longitudinal data without weighting.",
        "cons_of_this": "Restricted to (approximately) Gaussian continuous outcomes; GEE generalizes to binary/count outcomes more naturally.",
        "when_to_prefer": "Continuous Gaussian repeated outcomes with non-MCAR (but plausibly MAR) dropout."
      },
      {
        "compared_to": "Multiple imputation + ANCOVA (multiple-imputation-longitudinal-rwe)",
        "pros_of_this": "Single-step, no imputation model to specify, fully efficient under correctly specified MAR.",
        "cons_of_this": "Handles only MAR and only implicitly; cannot express MNAR sensitivity (pattern-mixture, delta/reference-based) on its own.",
        "when_to_prefer": "As the MAR primary analysis, with MI added as the explicit MNAR sensitivity layer."
      },
      {
        "compared_to": "Joint longitudinal-survival models",
        "pros_of_this": "Far simpler to specify, communicate, and defend; standard software and a familiar estimand.",
        "cons_of_this": "Ignores informative dropout from terminal events (death); the among-survivors mean is biased when dropout is outcome-dependent.",
        "when_to_prefer": "When dropout is plausibly unrelated to the unobserved outcome; otherwise a joint model is needed."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims rarely carry a true repeated continuous clinical outcome; do not force MMRM on zero-inflated, non-Gaussian monthly cost. Where used, \"missing visit\" is often disenrollment (MA-only person-time lacks FFS claims) or death, both informative - restrict to FFS-observable continuous enrollment and treat post-disenrollment/death values as MNAR in sensitivity, not silently MAR.",
      "ehr": "The typical substrate. Encounter-driven capture makes measurement times irregular and differential by health status, threatening both the nominal-visit grid and the MAR assumption. Bin to clinical windows, pre-specify within-window selection and baseline derivation, and treat departure from the system as potentially informative.",
      "registry": "Strongest for scheduled, adjudicated serial outcomes and the cleanest nominal grid, but completeness erodes and dropout correlates with progression and death. Link to a death index to separate terminal dropout from MAR loss.",
      "linked": "Ideal MMRM substrate - EHR supplies the continuous serial outcome, claims confirm observability and exposure, and a death index lets terminal dropout be classified as a competing event rather than MAR missingness. Reconcile measurement, enrollment, and death dates before visit assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport statsmodels.formula.api as smf\n\ndef fit_mmrm_approx(df: pd.DataFrame):\n    \"\"\"Approximate MMRM via a random-intercept LMM (statsmodels lacks true UN residual covariance).\n    Treats visit as categorical and includes the arm-by-visit interaction (the MMRM estimand).\"\"\"\n    df = df.copy()\n    df[\"visit\"] = pd.Categorical(df[\"visit\"], categories=[\"m3\", \"m6\", \"m9\", \"m12\"], ordered=True)\n    df[\"arm\"] = pd.Categorical(df[\"arm\"], categories=[\"B\", \"A\"])  # B = reference\n\n    # arm*visit gives the per-visit treatment contrast; adjust for baseline and covariates.\n    model = smf.mixedlm(\n        \"chg ~ arm * C(visit) + base\",\n        data=df,\n        groups=df[\"person_id\"],          # within-person clustering (random intercept approximation)\n        re_formula=\"1\",\n    )\n    res = model.fit(reml=True, method=\"lbfgs\")\n    # Primary contrast = arm[A]:C(visit)[m12]  (treatment difference in change at month 12).\n    return res",
        "description": "Fit an MMRM-style model in long format. Required input (one row per person per captured visit, after binning\nreal-world measurements to a nominal grid):\n  df : person_id, arm ('A'/'B'), visit (ordered categorical: 'm3','m6','m9','m12'),\n       chg (change from baseline in the continuous outcome), base (baseline value),\n       + any pre-index covariates measured in [index_date-365, index_date]\nCAVEAT: statsmodels has no native unstructured residual-covariance / treatment-by-visit MMRM. The closest\nlikelihood model is MixedLM with a random intercept + heteroscedastic residuals, which is an LMM, NOT a true\nMMRM. For a regulatory-grade unstructured-covariance MMRM in Python, call the R `mmrm` package via rpy2 or use\nSAS PROC MIXED. The block below is an approximation suitable for exploratory work only.",
        "dependencies": [
          "pandas",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(mmrm)\n\ndf$person_id <- factor(df$person_id)\ndf$visit     <- factor(df$visit, levels = c(\"m3\", \"m6\", \"m9\", \"m12\"))\ndf$arm       <- relevel(factor(df$arm), ref = \"B\")  # comparator = reference\n\n# Canonical MMRM: arm*visit fixed effects, baseline adjustment, UNSTRUCTURED residual covariance us(visit | id).\nfit <- mmrm(\n  formula = chg ~ arm * visit + base + us(visit | person_id),\n  data    = df,\n  reml    = TRUE\n)\nsummary(fit)  # arm[A]:visit[m12] row = treatment difference in change from baseline at month 12\n\n# nlme equivalent (unstructured corr + heterogeneous variance by visit):\n# library(nlme)\n# fit_gls <- gls(chg ~ arm * visit + base, data = df,\n#                correlation = corSymm(form = ~ as.integer(visit) | person_id),\n#                weights     = varIdent(form = ~ 1 | visit),\n#                na.action   = na.omit, method = \"REML\")",
        "description": "True unstructured-covariance MMRM with the FDA-aligned `mmrm` package (and an nlme::gls equivalent shown in\ncomments). Required input (long format, one row per person per nominal visit):\n  df : person_id (factor), arm (factor, reference = comparator), visit (factor: m3 < m6 < m9 < m12),\n       chg (change-from-baseline continuous outcome), base (baseline value), + pre-index covariates\n`mmrm` fits the marginal model with no random effects and an unstructured residual covariance, applying a\nSatterthwaite/Kenward-Roger small-sample adjustment - the canonical MMRM. Do NOT use lme4: it cannot fit an\nunstructured residual covariance cleanly.",
        "dependencies": [
          "mmrm",
          "nlme"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Canonical MMRM: visit categorical, arm*visit interaction, unstructured residual covariance, no random effects. */\nproc mixed data=work.long method=reml;\n  class person_id arm visit;\n  model chg = arm visit arm*visit base / ddfm=kr;\n  repeated visit / subject=person_id type=un r rcorr;   /* UN residual covariance = the MMRM core */\n  /* Treatment difference in change from baseline at the final visit (the primary MMRM estimand): */\n  lsmeans arm*visit / diff cl;\n  slice arm*visit / sliceby=visit diff cl;              /* per-visit treatment contrasts with CIs */\nrun;\n\n/* If TYPE=UN fails to converge (too many visits / sparse cells), step down a pre-specified ladder, e.g.: */\n/*   type=toep   (Toeplitz)   or   type=arh(1)  (heterogeneous AR(1))  -- document why UN was abandoned.   */",
        "description": "Canonical MMRM in SAS via PROC MIXED. Required input (long format, post data-management):\n  work.long : person_id, arm (char 'A'/'B'), visit (char nominal-visit label), chg (change from baseline),\n              base (baseline value), + pre-index covariates\nThe REPEATED statement with TYPE=UN and SUBJECT=person_id specifies the unstructured within-subject residual\ncovariance with NO random effects - this is the defining MMRM specification. DDFM=KR applies the Kenward-Roger\nsmall-sample degrees-of-freedom and variance correction recommended for MMRM.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Continuous outcome, repeatedly measured?] -->|No - binary/count| GEE[GLMM / GEE<br/>gee-population-average-models-rwe]\n  Q -->|Yes| T{Visit timing}\n  T -->|Few nominal visits| V{Estimand}\n  T -->|Irregular continuous time<br/>or rate-of-change wanted| LMM[LMM with random slopes<br/>mixed-effects-models-longitudinal-rwe]\n  V -->|Per-visit mean difference| M{Missingness mechanism}\n  V -->|Subject-specific trajectory| LMM\n  M -->|Plausibly MAR| MMRM[MMRM<br/>UN residual covariance, no random effects]\n  M -->|Informative / MNAR<br/>death, disenrollment| MNAR[MMRM primary +<br/>reference-based MI / tipping-point<br/>or joint longitudinal-survival model]\n  MMRM --> MNAR",
        "caption": "Decision logic for choosing MMRM versus its close cousins. MMRM is the choice for a continuous outcome on a small categorical visit grid with a per-visit estimand under MAR; non-Gaussian outcomes route to GEE/GLMM, irregular timing or trajectory estimands to an LMM, and informative dropout requires an MNAR sensitivity layer or a joint model.",
        "alt_text": "Decision tree starting from whether the outcome is a repeatedly measured continuous variable, branching on visit timing, estimand, and missingness mechanism to MMRM, LMM, GEE/GLMM, or MNAR-sensitivity / joint models.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Y[\"Observed outcome Y_t<br/>(eGFR at each visit)\"]\n  Yp[\"Prior observed history<br/>Y_(t-1), baseline, covariates\"]\n  A[Treatment arm]\n  U[\"Unobserved outcome value<br/>Y_t if visit were captured\"]\n  R[\"R_t: measurement captured?<br/>(visit observed vs missing)\"]\n  Yp --> R\n  A --> R\n  A --> Y\n  Yp --> Y\n  U -. \"MNAR path: dropout depends on the<br/>unobserved value e.g. death/decline\" .-> R\n  U --- Y",
        "caption": "Missing-data DAG for MMRM. MMRM is valid when capture R_t depends only on observed history and treatment (MAR - solid arrows). The dashed arrow is the MNAR mechanism (e.g., patients with worse unobserved eGFR die or disenroll); when it is present, MMRM's MAR assumption fails and a sensitivity analysis or joint model is required.",
        "alt_text": "Directed acyclic graph showing observed history and treatment influencing both the outcome and whether a visit is captured (MAR), with a dashed arrow from the unobserved outcome value to capture representing the MNAR mechanism.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "longitudinal-outcomes-modeling-rwe",
        "notes": "MMRM is a specific likelihood-based member of the longitudinal-outcomes-modeling family for continuous endpoints."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "mixed-effects-models-longitudinal-rwe",
        "notes": "The random-slopes LMM is the trajectory-based alternative; MMRM uses categorical visits with unstructured residual covariance and no random effects, trading efficiency for fewer trajectory assumptions."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "gee-population-average-models-rwe",
        "notes": "GEE targets the same marginal mean but is moment-based and valid only under MCAR unless weighted; MMRM is likelihood-based and valid under MAR."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Reference-based / delta-adjusted multiple imputation provides the MNAR sensitivity layer that an MMRM MAR primary analysis requires in RWE."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "MMRM does not by itself define intercurrent-event handling (death, discontinuation, switching); the estimand and its strategy must be specified separately."
      },
      {
        "relation_type": "used_with",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "A by-arm, by-visit missing-data pattern table is the required diagnostic establishing the missingness assumption MMRM relies on."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pro-rwe",
        "notes": "MMRM is a standard analysis for serial patient-reported-outcome scores collected longitudinally in RWE."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Differential attrition and informative loss to follow-up are the primary threats to MMRM's MAR assumption in real-world data."
      }
    ],
    "aliases": [
      "MMRM",
      "mixed model for repeated measures",
      "mixed-effects model for repeated measures",
      "repeated measures mixed model"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mortality-source-hierarchy-rwe",
    "name": "Mortality Source Hierarchy",
    "short_definition": "A pre-specified rule that combines multiple death-information sources (claims discharge status, EHR, registries, the Social Security Death Master File, the National Death Index, and linked vital records) into a single deterministic priority order to assign each patient a binary death flag and a single date of death for survival analysis.",
    "long_description": "A **mortality source hierarchy** is the operational algorithm that turns several incomplete, partially\noverlapping death feeds into one analyzable endpoint: a death indicator and a single, defensible date of\ndeath per patient. No single real-world source captures all deaths completely or dates them exactly, so a\ncomposite that ranks sources by validity and fills gaps is now the standard of practice for any RWE study\nwith a survival, all-cause-mortality, or time-to-event endpoint. The hierarchy must be written in protocol\nlanguage *before* programming because every downstream quantity — follow-up time, censoring, hazard ratios,\nrestricted mean survival time, cost-per-life-year — inherits its sensitivity, specificity, and date error.\n\n**Core conceptual distinction.** Three things are being decided, and they are separable. (1) *Capture\n(the death flag)*: which sources count as evidence that a patient died, and in what priority. Sources differ\nin **sensitivity** (does it catch the death at all) and **specificity** (false positives — e.g., a discharge\n\"expired\" status keyed in error, or an SSDI record matched to the wrong person). (2) *Dating (the death\ndate)*: once a death is established, which source's date is authoritative. The Death Master File and NDI\nare accurate to month/year but the *day* is unreliable when coded as the last day of the month; claims give\na precise service date but only if death occurred during an observed encounter. (3) *Cause*: all-cause vs\ncause-specific mortality is a different ascertainment problem entirely — cause requires NDI Plus / death-\ncertificate ICD coding and cannot be read off enrollment files. The estimand must name which of these it\nneeds. A naive \"last claim = censor\" default silently conflates *no longer observed* with *alive*, which is\nthe single most common and most dangerous error this concept exists to prevent.\n\n**Pros, cons, and trade-offs.**\n- **vs a single-source death flag (e.g., enrollment-file death indicator alone):** A hierarchy raises\n  sensitivity dramatically — composite endpoints benchmarked to the NDI reach ~98% sensitivity versus\n  ~83-92% for any one administrative source — and corrects systematically wrong dates. Cost: more code,\n  more linkage agreements, and the need to reconcile disagreements between sources. **Prefer the hierarchy**\n  for any consequential effectiveness, safety, or economic analysis where missed or misdated deaths bias\n  the result.\n- **vs \"high-specificity only\" (require ≥2 concordant sources before flagging a death):** Demanding\n  agreement maximizes specificity and is appropriate when a false death is catastrophic (e.g., an\n  automated safety signal). Cost: it sacrifices sensitivity and undercounts deaths that only one good\n  source saw. **Prefer concordance rules** for confirmatory/regulatory deliverables; **prefer the union/\n  priority hierarchy** when undercounting deaths is the worse error (most survival analyses).\n- **vs treating end-of-enrollment as the endpoint (administrative censoring only):** Censoring at\n  disenrollment is defensible *only* if disenrollment is unrelated to death. It usually is not — sick\n  patients disenroll, switch to hospice, or move to Medicare Advantage where fee-for-service claims vanish.\n  Substituting censoring for an actual death feed induces **informative censoring** and inflates survival.\n  **Prefer a real mortality source** whenever a death index or linked vital record is obtainable.\n\n**When to use.** Any RWE/HEOR study with all-cause mortality or a composite that includes death; any\nsurvival, RMST, or cost-effectiveness analysis where person-time depends on correctly placing the death\ndate; any oncology overall-survival endpoint, where regulators expect benchmarking to the NDI. Use it\nwhenever your data span multiple settings (a patient seen in one EHR may die in another system's hospice)\nor whenever differential loss to follow-up by exposure arm is plausible.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When you need cause-specific mortality but only have all-cause feeds.** A hierarchy of enrollment +\n  SSDI + discharge status tells you *that* a patient died, never *why*. Reporting \"cardiovascular death\"\n  off such sources is fabrication; you need death-certificate cause coding (NDI Plus) or adjudication.\n- **When false positives dominate the decision.** For an automated pharmacovigilance trigger, a\n  union-style hierarchy that flags on any single source will fire on mismatched SSDI links and erroneous\n  discharge codes; require concordance or manual review instead.\n- **When the date error interacts with a short-window estimand.** If the estimand is 30-day mortality and\n  your authoritative date is month/year-only (snapped to mid-month or month-end), the day-level error can\n  flip events across the window boundary. Do not paper over month-only dates with a fixed imputation and\n  then estimate a day-resolution endpoint.\n- **When disenrollment and death are confounded and you censor instead of ascertaining.** Using\n  end-of-coverage as a pseudo-death or as the only censoring event when sick patients drop coverage makes\n  the treated arm look immortal. This is the failure mode that most often produces a spuriously protective\n  drug effect.\n\n**Data-source operational depth.**\n- **Claims (FFS vs Medicare Advantage):** Death is inferable from the inpatient discharge status code\n  (`DSTATUS`/`disp` = expired) and, in Medicare, from the enrollment/denominator death indicator and\n  `death_date` (DMF-sourced). Failure modes: (a) **MA-only person-time lacks adjudicated FFS claims**, so\n  in-hospital deaths are invisible and \"no death\" can be pure missingness — restrict mortality follow-up to\n  A/B-enrolled FFS time or link to a death index; (b) out-of-hospital deaths (home, hospice, ED-on-arrival)\n  never generate a discharge status; (c) discharge \"expired\" can be miscoded, creating false positives; (d)\n  claims lag and reversals mean the death date is not final for months. Workaround: take the date from the\n  enrollment/DMF feed, use discharge status as a sensitivity-raising supplement, and never censor MA-only\n  person-time as if it were observed alive.\n- **EHR:** Death is captured only if it happens inside, or is reported back to, the network — a patient who\n  leaves the system and dies elsewhere is silently censored. Structured `death_date`/`deceased_flag` fields\n  are encounter-driven and notoriously incomplete; obituary and SSDI augmentation recover large numbers of\n  out-of-network deaths. Failure mode: **differential capture by exposure** if one arm is sicker and\n  referred out. Workaround: link to commercial claims and a death index; report capture completeness by\n  site and by arm.\n- **Registry:** Disease registries (e.g., SEER, cancer registries) often have actively followed,\n  adjudicated vital status and cause of death, but linkage eligibility and lag limit completeness and the\n  followed population may not be transportable to the analysis cohort. Workaround: link to claims for the\n  interval between registry contacts; document the linkage denominator.\n- **Linked claims-EHR-vital records (DMF/NDI):** The reference substrate. The **National Death Index** is\n  the benchmark for both fact and cause of death; the **Death Master File / SSDI** is broad but lost many\n  state-reported deaths after the 2011 records-access changes (a known sensitivity drop for recent years).\n  Failure modes: linkage selection (only the linkable subset), false matches (specificity), and\n  month/year-only dates. Workaround: probabilistic-match QC, benchmark composite sensitivity/specificity to\n  the NDI, and pre-specify a date-imputation rule for month-only records.\n\n**Estimand and competing-risks note.** With death as the outcome, all-cause mortality is a single risk and\na Cox/pooled-logistic model on the hierarchy-derived flag/date is appropriate. With a *non-fatal* primary\noutcome (e.g., first hospitalization for heart failure), **death is a competing risk** and the choice of\nestimand is consequential: the **cause-specific hazard** (treat death as censoring; `PROC PHREG`,\n`coxph`) answers an etiologic question, whereas the **subdistribution hazard / cumulative incidence\nfunction** (Fine-Gray; `PROC PHREG eventcode=`, `cmprsk`/`PROC LIFETEST plots=cif`) answers an absolute-\nrisk/decision question. In elderly claims populations, mortality is high and may differ by exposure, so\nKaplan-Meier on the non-fatal event overstates its incidence; pre-specify cause-specific vs subdistribution\nand report the CIF.\n\n**Worked claims example.** Goal: an all-cause mortality endpoint for a 100% Medicare FFS cohort of new\ninitiators, index_date = first qualifying fill. (1) Require continuous Parts A/B FFS enrollment from\nindex forward; flag and separately handle any switch to MA (where FFS claims stop). (2) Build candidate\ndeath records from three feeds: the enrollment/denominator `death_date` (DMF-sourced), inpatient claims\nwhere `discharge_status` ∈ expired codes (take the claim `thru_date` as the date), and a linked\nNDI/state-file date where available. (3) Apply the priority hierarchy for the *flag*: a patient is dead if\nANY source indicates death (union maximizes sensitivity), but require that the death date fall on or after\nindex and within enrollment (drop physiologically impossible records). (4) Assign the *date* by priority:\nNDI > enrollment/DMF > inpatient discharge `thru_date`; if only month/year is available, pre-specify\nimputation (e.g., 15th of month) and carry a flag for sensitivity analysis. (5) Reconcile conflicts: if\ntwo sources disagree by >X days, log and review; if discharge \"expired\" appears but no death is in DMF\nwithin 90 days, treat as a possible false positive in a high-specificity sensitivity run. (6) Define\nfollow-up from index to the assigned death date or to administrative censoring (end of A/B FFS enrollment,\nend of data, or MA switch) — and crucially, do NOT censor at the last observed claim, which would convert\nunobserved deaths into spurious survival. (7) Sensitivity analyses: vary the date-imputation rule, swap the\nunion flag for a ≥2-source concordance flag, and report endpoint sensitivity/specificity against the NDI\nbenchmark where the linkage exists.\n\n**Interpreting the output**\n\nFrom the worked example: 4 patients; hierarchy yields 3 deaths (P001 NDI 2022-08-14, P002 DMF\n2022-11-15 day-imputed, P003 discharge 2022-06-20); P004 alive. NDI alone would capture only P001\n(1/3 true deaths, 33%). Discharge-status alone captures only P003 (33%); P001 and P002 died outside\nhospital. Composite union flag captures all 3 (100% in this small example).\n\n*(1) Formal interpretation.* The hierarchy assigns one death flag per patient (union of all sources)\nand one authoritative death date (priority: NDI > DMF > discharge > inferred). For P001, NDI date\n2022-08-14 overrides the DMF date 2022-08-01 per the pre-specified rule, giving follow-up 225 days.\nFor P002, DMF is the sole source and day-imputation to the 15th is documented as a sensitivity flag.\nFor P003, discharge \"expired\" with no NDI or DMF match is accepted at face value, giving follow-up\n170 days. P004 is censored at administrative end of enrollment. Single-source analyses would\nsystematically misclassify P001 and P002 as censored, inflating apparent survival times for those\npatients by the interval between last claim and true death.\n\n*(2) Practical interpretation.* A Kaplan-Meier or Cox model using only discharge-status deaths would\nlose P001 and P002 from the risk set as events, instead censoring them — inflating survival estimates\nand potentially biasing treatment comparisons if mortality undercount is differential by arm (e.g.,\nDrug B patients more likely to die in non-acute settings). Conversely, NDI-only linkage misses P002\nand P003. The composite hierarchy maximizes completeness while the date-priority rule minimizes\nmeasurement error in follow-up time, both of which are essential for unbiased survival analysis.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "mortality-ascertainment",
      "overall-survival",
      "death-date",
      "national-death-index",
      "death-master-file",
      "competing-risks",
      "informative-censoring"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/1475-6773.12872",
        "url": "https://doi.org/10.1111/1475-6773.12872",
        "citation_text": "Curtis MD, Griffith SD, Tucker M, et al. Development and validation of a high-quality composite real-world mortality endpoint. Health Services Research. 2018;53(6):4460-4476.",
        "year": 2018,
        "authors_short": "Curtis et al.",
        "notes": "Canonical statement of building and benchmarking a composite real-world mortality endpoint from multiple death sources (structured EHR, unstructured records, obituary, SSDI) against the National Death Index; defines the sensitivity/date-accuracy trade-offs a source hierarchy must manage."
      },
      {
        "role": "explain",
        "doi": "10.1161/CIRCULATIONAHA.115.017719",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.115.017719",
        "citation_text": "Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601-609.",
        "year": 2016,
        "authors_short": "Austin et al.",
        "notes": "Clarifies the estimand split that mortality ascertainment feeds into - cause-specific hazard (death as censoring) vs subdistribution hazard / cumulative incidence (Fine-Gray) when death is a competing risk for a non-fatal outcome."
      },
      {
        "role": "demonstrate",
        "doi": "10.2147/POR.S498221",
        "url": "https://doi.org/10.2147/POR.S498221",
        "citation_text": "Jamal-Allial A, Vojjala S, Zhang J, et al. Validation of mortality data sources compared to the National Death Index in the Healthcare Integrated Research Database. Pragmatic and Observational Research. 2025;16:1-11.",
        "year": 2025,
        "authors_short": "Jamal-Allial et al.",
        "notes": "Direct head-to-head validation of administrative mortality sources against the NDI in a large claims/EHR research database, quantifying the sensitivity and date-agreement gains from combining sources - the empirical basis for ranking and unioning death feeds."
      },
      {
        "role": "use",
        "doi": "10.1200/CCI.23.00014",
        "url": "https://doi.org/10.1200/CCI.23.00014",
        "citation_text": "Shao S, Adamson BJS, Bruno AM, et al. Improving real-world mortality data quality in oncology research: augmenting electronic medical records with obituary, Social Security Death Index, and commercial claims data. JCO Clinical Cancer Informatics. 2023;7:e2300014.",
        "year": 2023,
        "authors_short": "Shao et al.",
        "notes": "Applied oncology overall-survival study showing that adding claims, obituary, and SSDI feeds to an EHR death flag recovers out-of-network deaths and improves OS estimation - a worked hierarchy in practice."
      }
    ],
    "plain_language_summary": "A mortality source hierarchy is a written rule, decided before any data analysis begins, that tells a researcher which database to trust most when multiple sources disagree about whether a patient died and when. Because no single database catches every death, analysts combine several feeds — the National Death Index, insurance enrollment records, hospital discharge codes, and the Social Security Death Master File — and rank them from most to least reliable. The hierarchy picks the best available date for each patient and flags anyone missed by a lower-ranked source that a higher-ranked source caught. Without this rule, patients who die but whose deaths are invisible in one database can be silently treated as alive, which makes a drug look safer or more effective than it really is.",
    "key_terms": [
      {
        "term": "National Death Index (NDI)",
        "definition": "A federal database, maintained by the CDC, that collects death certificates from all U.S. states and is considered the most complete and accurate source for confirming that a patient died and on what date."
      },
      {
        "term": "Death Master File (DMF) / Social Security Death Index (SSDI)",
        "definition": "A Social Security Administration file listing people whose deaths were reported to Social Security; it covers most U.S. deaths but has missed a growing share of state-reported deaths since 2011."
      },
      {
        "term": "discharge status code",
        "definition": "A code on a hospital billing record that says how a patient left the hospital — including a specific code meaning the patient died during that stay."
      },
      {
        "term": "follow-up time",
        "definition": "The stretch of time a patient is tracked in a study, running from their start date until either the event of interest (such as death) or the last date they were observable."
      },
      {
        "term": "union rule",
        "definition": "A decision rule that flags a patient as having died if any source — even just one — reports a death, maximizing the chance of catching every real death."
      },
      {
        "term": "survival analysis",
        "definition": "A family of statistical methods that estimates how long patients survive (or remain event-free) and compares those times across treatment groups."
      }
    ],
    "worked_example": {
      "scenario": "Four patients in a drug study all started treatment on January 1, 2022. The study team has three death data sources: the NDI (gold standard, but slow — only available through December 2022), the DMF (broad coverage, but records death month and year only, not the exact day), and hospital discharge codes (exact date, but only catches deaths that happen during a hospital stay). The hierarchy rule is: NDI first, then DMF, then discharge codes. A patient is flagged as dead if any source reports a death (union rule). The date is taken from the highest-priority source that reported a death for that patient.",
      "dataset": {
        "caption": "Raw death signals across three sources for four study patients. Each cell shows the reported date of death, or NONE if the source has no record.",
        "columns": [
          "person_id",
          "index_date",
          "NDI_death_date",
          "DMF_death_date",
          "discharge_death_date"
        ],
        "rows": [
          [
            "P001",
            "2022-01-01",
            "2022-08-14",
            "2022-08-01",
            "NONE"
          ],
          [
            "P002",
            "2022-01-01",
            "NONE",
            "2022-11-01",
            "2022-11-03"
          ],
          [
            "P003",
            "2022-01-01",
            "NONE",
            "NONE",
            "2022-06-20"
          ],
          [
            "P004",
            "2022-01-01",
            "NONE",
            "NONE",
            "NONE"
          ]
        ]
      },
      "steps": [
        "P001: NDI reports death on 2022-08-14. NDI is priority 1, so the assigned death date is 2022-08-14. The DMF record (2022-08-01) is noted but not used for the date because NDI outranks it. Follow-up ends on 2022-08-14.",
        "P002: NDI has no record (perhaps the linkage file did not yet include this patient). DMF reports a death in November 2022 with month-year only, so the day is imputed to the 15th per the pre-specified rule: assigned date becomes 2022-11-15. The discharge code on 2022-11-03 is noted but DMF outranks it. Follow-up ends on 2022-11-15.",
        "P003: Neither NDI nor DMF has a record, but the hospital discharge code on 2022-06-20 shows the patient died in the hospital. Under the union rule, this single source is enough to flag a death. Assigned death date is 2022-06-20. Follow-up ends on 2022-06-20.",
        "P004: No source reports any death. The patient is treated as alive and their follow-up continues until the study end date or until they leave the insurance plan, whichever comes first. This administrative end-of-observation is the correct stopping point — NOT the date of their last doctor visit or last insurance claim.",
        "Result check: 3 of 4 patients are flagged as dead (P001, P002, P003). Had the team relied on NDI alone, only P001 would have been captured, missing 2 deaths and artificially inflating apparent survival. Had the team used only discharge codes, P001 and P002 — both of whom died outside a hospital — would have been missed entirely."
      ],
      "result": "Deaths captured: 3 out of 4 patients (P001, P002, P003). Assigned dates: P001 = 2022-08-14 (NDI), P002 = 2022-11-15 (DMF with day imputed to 15th), P003 = 2022-06-20 (discharge code, the only source). P004 remains alive in the analysis. Using only a single source would have missed at least 1 and up to 2 of these 3 real deaths, shortening apparent follow-up for survivors and biasing any comparison between treatment groups."
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe",
      "cumulative-incidence-risk-rwe",
      "competing-risks-cause-specific-fine-gray-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Union (sensitivity-maximizing) hierarchy",
        "description": "A patient is flagged dead if ANY source indicates death, after dropping physiologically impossible records; the death date is taken from the highest-priority available source. Maximizes capture and is the default for survival/OS endpoints where missed deaths are the costlier error.",
        "edge_cases": [
          "A single mismatched SSDI link or a miscoded inpatient discharge status produces a false positive that the union rule cannot catch on its own.",
          "Month/year-only dates from DMF/NDI require a pre-specified day-imputation rule before the union can assign a usable date."
        ],
        "data_source_notes": "claims: union the enrollment/DMF death_date, inpatient discharge-status deaths, and any linked NDI date, then assign the date by source priority. ehr: union the structured deceased flag with obituary/SSDI augmentation to recover out-of-network deaths."
      },
      {
        "name": "Concordance (specificity-maximizing) hierarchy",
        "description": "A death is flagged only when two or more independent sources agree (or after manual review), trading sensitivity for specificity. Used when a false death is unacceptable (automated safety signals, regulatory confirmatory endpoints).",
        "edge_cases": [
          "Deaths seen by only one good source (e.g., an out-of-network death captured solely by obituary) are missed, biasing survival upward.",
          "Requiring agreement on the date as well as the fact of death can drop true deaths with minor date discrepancies; concord on fact, reconcile date separately."
        ],
        "data_source_notes": "claims: require enrollment/DMF death AND a corroborating discharge-status or NDI record; reconcile date by priority. Report both union and concordance flags as a sensitivity pair."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single-source death flag (enrollment-file indicator alone)",
        "pros_of_this": "Recovers out-of-network and out-of-hospital deaths and corrects systematically wrong dates; composite endpoints benchmark to ~98% sensitivity vs ~83-92% for any one administrative source.",
        "cons_of_this": "Requires more code, linkage agreements, and explicit conflict-reconciliation rules.",
        "when_to_prefer": "Any consequential effectiveness, safety, or economic analysis where missed or misdated deaths would bias survival, RMST, or cost-per-life-year."
      },
      {
        "compared_to": "High-specificity concordance rule (require >=2 sources to agree)",
        "pros_of_this": "Higher sensitivity - captures deaths seen by only one valid source, which dominate when care is fragmented across systems.",
        "cons_of_this": "Admits false positives from mismatched linkage or miscoded discharge status.",
        "when_to_prefer": "Survival/OS endpoints where undercounting deaths is worse than the occasional false positive; use concordance instead for confirmatory or automated-safety deliverables."
      },
      {
        "compared_to": "Administrative censoring at end of enrollment (no real death feed)",
        "pros_of_this": "Uses an actual mortality source, avoiding the informative censoring that arises when sick patients disenroll, enter hospice, or switch to MA where FFS claims disappear.",
        "cons_of_this": "Depends on obtaining and linking a death index; introduces linkage selection and month/year date error that must be QC'd.",
        "when_to_prefer": "Whenever a death index or linked vital record is obtainable and disenrollment is plausibly related to health status (almost always in elderly or seriously ill cohorts)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Death feeds are the enrollment/denominator death_date (DMF-sourced) and inpatient discharge-status expired codes; restrict mortality follow-up to FFS A/B person-time because MA-only spans lack adjudicated claims and make in-hospital death invisible. Take the date by priority (NDI > DMF > discharge thru_date), drop dates before index or outside enrollment, and never censor at the last observed claim.",
      "ehr": "Structured deceased flags are encounter-driven and incomplete; a patient who leaves the network and dies elsewhere is silently censored. Augment with obituary/SSDI/commercial-claims feeds and report capture completeness by site and by exposure arm to detect differential ascertainment.",
      "registry": "Actively followed vital status and adjudicated cause of death are strong but limited by linkage eligibility and lag; link to claims for the interval between registry contacts and document the linkage denominator and transportability to the analysis cohort.",
      "linked": "Linked claims-EHR-vital-records with NDI/DMF is the reference substrate (capture + cause + date); benchmark composite sensitivity/specificity to the NDI, QC probabilistic matches for false positives, and pre-specify a day-imputation rule for month/year-only records."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nEXPIRED_CODES = {\"20\", \"40\", \"41\", \"42\"}  # site-specific: discharge_status = expired\nMONTH_IMPUTE_DAY = 15                       # pre-specified day for month/year-only dates\n\ndef build_mortality_endpoint(cohort, enroll, dmf, inp, ndi):\n    idx = cohort[[\"person_id\", \"index_date\"]]\n\n    # --- Candidate death records from each feed, each tagged with a source priority. ---\n    d_ndi = ndi[[\"person_id\", \"death_date\"]].assign(source=\"ndi\", priority=1)\n\n    d_dmf = dmf.copy()\n    impute = d_dmf[\"date_precision\"].eq(\"month\")\n    d_dmf.loc[impute, \"death_date\"] = (\n        d_dmf.loc[impute, \"death_date\"].values.astype(\"datetime64[M]\")\n        + np.timedelta64(MONTH_IMPUTE_DAY - 1, \"D\"))\n    d_dmf = d_dmf[[\"person_id\", \"death_date\"]].assign(source=\"dmf\", priority=2)\n\n    d_inp = (inp[inp[\"discharge_status\"].astype(str).isin(EXPIRED_CODES)]\n             .rename(columns={\"thru_date\": \"death_date\"})[[\"person_id\", \"death_date\"]]\n             .assign(source=\"discharge\", priority=3))\n\n    cand = pd.concat([d_ndi, d_dmf, d_inp], ignore_index=True).merge(idx, on=\"person_id\")\n\n    # --- Validity filter: death must be on/after index and within FFS-observable enrollment. ---\n    cand = cand.merge(enroll, on=\"person_id\")\n    cand = cand[(cand[\"death_date\"] >= cand[\"index_date\"]) &\n                (cand[\"death_date\"] >= cand[\"ffs_start\"]) &\n                (cand[\"death_date\"] <= cand[\"ffs_end\"])]\n\n    # --- FLAG (union): dead if ANY valid source fired. DATE: take highest-priority source. ---\n    cand = cand.sort_values([\"person_id\", \"priority\", \"death_date\"])\n    dead = (cand.groupby(\"person_id\")\n                .first()\n                .reset_index()[[\"person_id\", \"death_date\", \"source\"]]\n                .rename(columns={\"source\": \"date_source\"}))\n\n    out = idx.merge(dead, on=\"person_id\", how=\"left\")\n    out[\"death_flag\"] = out[\"death_date\"].notna().astype(int)\n\n    # --- Follow-up end: death date if dead, else administrative censor at end of FFS enrollment. ---\n    ffs_end = enroll.groupby(\"person_id\")[\"ffs_end\"].max().rename(\"ffs_end\")\n    out = out.merge(ffs_end, on=\"person_id\", how=\"left\")\n    out[\"fup_end\"] = out[\"death_date\"].fillna(out[\"ffs_end\"])   # never censor at last claim\n    out[\"fup_days\"] = (out[\"fup_end\"] - out[\"index_date\"]).dt.days\n    return out[[\"person_id\", \"index_date\", \"death_flag\", \"death_date\",\n                \"date_source\", \"fup_end\", \"fup_days\"]]",
        "description": "Build an all-cause-mortality endpoint from multiple death feeds via a priority hierarchy.\nRequired inputs (cleaned, one concept each; dates are datetime):\n  cohort  : person_id, index_date\n  enroll  : person_id, ffs_start, ffs_end            # FFS A/B-observable mortality follow-up spans\n  dmf     : person_id, death_date, date_precision     # enrollment/DMF feed; precision in {'day','month'}\n  inp     : person_id, thru_date, discharge_status    # inpatient claims; discharge_status code\n  ndi     : person_id, death_date                     # linked NDI/state vital record (may be empty)\nEXPIRED_CODES = inpatient discharge-status values meaning the patient died in hospital.\nReturns one row per person with death_flag, death_date, date_source, and follow-up end.\nUnion rule maximizes capture; swap to a >=2-source rule for a specificity sensitivity analysis.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "curtis-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nEXPIRED_CODES   <- c(\"20\", \"40\", \"41\", \"42\")  # discharge_status = expired (site-specific)\nMONTH_IMPUTE_DAY <- 15L                         # pre-specified day for month/year-only dates\n\nbuild_mortality_endpoint <- function(cohort, enroll, dmf, inp, ndi) {\n  setDT(cohort); setDT(enroll); setDT(dmf); setDT(inp); setDT(ndi)\n  idx <- cohort[, .(person_id, index_date)]\n\n  d_ndi <- ndi[, .(person_id, death_date, source = \"ndi\", priority = 1L)]\n\n  d_dmf <- copy(dmf)\n  mo <- d_dmf$date_precision == \"month\"\n  d_dmf[mo, death_date := as.Date(format(death_date, \"%Y-%m-01\")) + (MONTH_IMPUTE_DAY - 1L)]\n  d_dmf <- d_dmf[, .(person_id, death_date, source = \"dmf\", priority = 2L)]\n\n  d_inp <- inp[as.character(discharge_status) %chin% EXPIRED_CODES,\n               .(person_id, death_date = thru_date, source = \"discharge\", priority = 3L)]\n\n  cand <- rbindlist(list(d_ndi, d_dmf, d_inp))[idx, on = \"person_id\", nomatch = 0L]\n\n  # Validity: death on/after index and within FFS-observable enrollment.\n  cand <- enroll[cand, on = \"person_id\", nomatch = 0L]\n  cand <- cand[death_date >= index_date & death_date >= ffs_start & death_date <= ffs_end]\n\n  # FLAG = union; DATE = highest-priority source per person.\n  setorder(cand, person_id, priority, death_date)\n  dead <- cand[, .(death_date = death_date[1L], date_source = source[1L]), by = person_id]\n\n  out <- dead[idx, on = \"person_id\"]\n  out[, death_flag := as.integer(!is.na(death_date))]\n\n  ffs_end <- enroll[, .(ffs_end = max(ffs_end)), by = person_id]\n  out <- ffs_end[out, on = \"person_id\"]\n  out[, fup_end  := fifelse(is.na(death_date), ffs_end, death_date)]  # not last claim\n  out[, fup_days := as.integer(fup_end - index_date)]\n  out[, .(person_id, index_date, death_flag, death_date, date_source, fup_end, fup_days)]\n}",
        "description": "All-cause-mortality endpoint from multiple death feeds via a priority hierarchy (data.table).\nInputs mirror the Python version:\n  cohort : person_id, index_date (Date)\n  enroll : person_id, ffs_start, ffs_end (Date)        # FFS-observable follow-up spans\n  dmf    : person_id, death_date (Date), date_precision in {'day','month'}\n  inp    : person_id, thru_date (Date), discharge_status\n  ndi    : person_id, death_date (Date)                 # linked NDI/state record (may be empty)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "curtis-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let impute_day = 15;  /* pre-specified day for month/year-only dates */\n\n/* Candidate death records from each feed, tagged with source priority (1=best). */\ndata cand;\n  set work.ndi(in=n keep=person_id death_date)\n      work.dmf(in=d keep=person_id death_date date_precision)\n      work.inp(in=i keep=person_id thru_date discharge_status);\n  length source $9;\n  if n then do; source='ndi';       priority=1; end;\n  else if d then do;\n    source='dmf'; priority=2;\n    /* snap month/year-only DMF dates to a pre-specified day */\n    if date_precision='month' then\n      death_date = mdy(month(death_date), &impute_day, year(death_date));\n  end;\n  else if i then do;\n    source='discharge'; priority=3; death_date = thru_date;\n    if discharge_status not in ('20','40','41','42') then delete;  /* expired codes only */\n  end;\n  keep person_id death_date source priority;\nrun;\n\n/* Validity filter: death on/after index and within FFS-observable enrollment. */\nproc sql;\n  create table cand_ok as\n  select c.person_id, c.death_date, c.source, c.priority\n  from cand c\n    inner join work.cohort co on c.person_id = co.person_id\n    inner join work.enroll e  on c.person_id = e.person_id\n  where c.death_date >= co.index_date\n    and c.death_date between e.ffs_start and e.ffs_end;\nquit;\n\n/* FLAG = union (any source); DATE = highest-priority source per person. */\nproc sort data=cand_ok; by person_id priority death_date; run;\ndata dead;\n  set cand_ok; by person_id;\n  if first.person_id;   /* keep best-priority, earliest-date record */\n  rename source=date_source;\nrun;\n\n/* Assemble endpoint: never censor at last claim - censor at end of FFS enrollment. */\nproc sql;\n  create table endpoint as\n  select co.person_id, co.index_date,\n         case when d.death_date is not null then 1 else 0 end as death_flag,\n         d.death_date, d.date_source,\n         coalesce(d.death_date, e.ffs_end) as fup_end format=date9.,\n         (calculated fup_end) - co.index_date as fup_days\n  from work.cohort co\n    left join dead d  on co.person_id = d.person_id\n    left join (select person_id, max(ffs_end) as ffs_end from work.enroll\n               group by person_id) e on co.person_id = e.person_id;\nquit;\n\n/* All-cause-mortality Cox model (death = event). For a NON-fatal primary outcome where death\n   competes, add eventcode= to get the Fine-Gray subdistribution hazard / CIF instead. */\nproc phreg data=work.analytic;            /* analytic = endpoint joined to exposure + covariates */\n  class arm(ref='COMPARATOR');\n  model fup_days*event_type(0) = arm <covariates> / eventcode=1 risklimits;  /* 1=outcome, 2=death */\nrun;",
        "description": "All-cause-mortality endpoint via a priority hierarchy, then a competing-risks-aware survival fit.\nRequired input datasets (post data-management):\n  work.cohort : person_id, index_date\n  work.enroll : person_id, ffs_start, ffs_end                 (FFS-observable follow-up spans)\n  work.dmf    : person_id, death_date, date_precision ('day'/'month')\n  work.inp    : person_id, thru_date, discharge_status\n  work.ndi    : person_id, death_date                          (linked NDI; may be empty)\nPROC PHREG with eventcode= fits the Fine-Gray subdistribution model when death competes with a\nnon-fatal outcome; drop eventcode= (and treat death as the event) for an all-cause-mortality Cox model.",
        "dependencies": [],
        "source_citations": [
          "austin-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  P[Patient in cohort<br/>index_date assigned] --> NDI{Linked NDI /<br/>state vital record?}\n  NDI -- yes --> DT[Death date = NDI date<br/>source priority 1]\n  NDI -- no --> DMF{Enrollment / DMF<br/>death_date present?}\n  DMF -- yes --> IMP[Snap month/year-only<br/>to pre-specified day] --> DT2[Death date = DMF date<br/>priority 2]\n  DMF -- no --> DIS{Inpatient discharge<br/>status = expired?}\n  DIS -- yes --> DT3[Death date = claim thru_date<br/>priority 3]\n  DIS -- no --> ALIVE[No death evidence<br/>administrative censor at end of FFS enrollment<br/>NOT at last claim]\n  DT --> VAL{Date on/after index<br/>and within enrollment?}\n  DT2 --> VAL\n  DT3 --> VAL\n  VAL -- yes --> DEAD[death_flag = 1<br/>follow-up ends at death date]\n  VAL -- no --> DROP[Drop impossible record<br/>fall through to next source]",
        "caption": "Priority hierarchy for assigning a single death flag and date. Capture uses the union of sources (any valid death counts); the date is taken from the highest-priority source; unobserved patients are administratively censored at end of FFS enrollment, never at the last observed claim.",
        "alt_text": "Decision flowchart that checks NDI, then enrollment/DMF, then inpatient discharge status to assign a death date by priority, validates the date against index and enrollment, and otherwise censors at end of enrollment.",
        "source_type": "illustrative",
        "source_citations": [
          "curtis-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Sources [Death feeds, ranked by validity]\n    S1[NDI / linked vital record<br/>best fact + cause + date]\n    S2[Enrollment / DMF death_date<br/>broad capture, month-level date]\n    S3[Inpatient discharge = expired<br/>precise date, in-hospital only]\n    S4[EHR deceased flag<br/>in-network deaths only]\n  end\n  S1 --> H[Mortality source hierarchy<br/>union flag + priority date]\n  S2 --> H\n  S3 --> H\n  S4 --> H\n  H --> EST{Estimand?}\n  EST -- death is the outcome --> AC[All-cause mortality<br/>Cox / pooled logistic]\n  EST -- non-fatal outcome,<br/>death competes --> CR[Cause-specific hazard<br/>vs subdistribution / CIF Fine-Gray]",
        "caption": "Data flow from ranked death feeds into the composite endpoint, then the estimand fork that determines whether death is the analyzed outcome or a competing risk requiring a cause-specific vs subdistribution-hazard choice.",
        "alt_text": "Diagram showing four ranked death sources flowing into a mortality source hierarchy, then branching by estimand into all-cause mortality analysis or competing-risks analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "austin-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Censoring at the last observed claim instead of ascertaining death (or mishandling the death date relative to follow-up start) can create immortal time and informative censoring."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "A pre-specified mortality endpoint and censoring rule are required components of the outcome and follow-up specification in a target-trial emulation with a survival or composite endpoint."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU cohorts feed this endpoint; the death date and FFS-enrollment censoring rule define follow-up identically across both arms."
      }
    ],
    "aliases": [
      "mortality ascertainment hierarchy",
      "death ascertainment algorithm",
      "composite mortality endpoint",
      "vital status source hierarchy",
      "death date source priority"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mother-infant-linkage-rwe",
    "name": "Mother-Infant Linkage",
    "short_definition": "A cohort-construction step that connects a pregnant person's records to the records of the resulting infant(s) in claims, EHR, or registry data so that in-utero (or lactation) drug exposure can be attributed to infant outcomes within a single longitudinal analytic dataset.",
    "long_description": "**Mother-infant linkage** is the operational backbone of pregnancy and infant pharmacoepidemiology in routinely collected\ndata. The scientific question — does a maternal exposure during a specific gestational window cause an infant outcome\n(major congenital malformation, preterm birth, neurodevelopmental endpoint) — cannot be answered unless each mother record\nis joined to the record(s) of her live-born infant(s), because exposure is measured on the *mother's* timeline and the\noutcome accrues on the *infant's* timeline. Linkage produces the pivot table that lets a single analytic cohort carry\nmaternal `index_date`, the pregnancy window, and the infant's `birth_date`, enrollment, and outcome follow-up. It is a\ncohort-construction operation, not an estimator: the estimand (e.g., risk of malformation among live births under exposed\nvs unexposed maternal treatment strategies) is specified downstream, but its validity is bounded by how well the link was\nbuilt.\n\n**Core conceptual distinction** Two ideas are separable and both must be specified. (1) *Linkage mechanism*: a\n**deterministic** link uses a shared key — a family/subscriber identifier in commercial claims, a maternal Medicaid\nidentifier (`MSIS_ID`) carried onto the infant claim, or a birth-certificate / hospital-discharge record that names both —\noptionally constrained by a date rule (infant `birth_date` falls within a tight window of a maternal delivery claim). A\n**probabilistic** link scores candidate mother-infant pairs on multiple partial identifiers (date of birth, ZIP, plan,\ndelivery hospital) when no single key is reliable. (2) *Linkage substrate / direction of conditioning*: linkage in claims\nand EHR is almost always **live-birth-conditioned** — the infant must enroll or generate a record to be linkable, so the\ncohort is implicitly restricted to pregnancies ending in a live, observed birth. This is the difference between asking\n\"among live births, what is the risk?\" (answerable by mother-infant linkage) and \"among all pregnancies, what is the risk\nof any adverse outcome including loss?\" (requires the maternal-only pregnancy cohort, not the linked infant cohort). Most\nmalformation studies want the former; many safety questions about pregnancy loss demand the latter, and forcing the linked\ncohort onto a loss question induces selection bias.\n\n**Pros, cons, and trade-offs**\n- **vs a maternal-only pregnancy cohort (no infant link):** Linkage is the *only* way to ascertain infant outcomes\n  diagnosed after delivery (malformations confirmed in the neonatal period, infant hospitalizations, developmental\n  endpoints), which maternal records alone cannot capture. Cost: it conditions on live birth and infant\n  observability, discarding pregnancies that end in loss or in an infant who never enrolls — a selection step that can be\n  *differential by exposure*. **Prefer linkage** for infant outcomes; **keep the maternal-only cohort** in parallel for\n  spontaneous-abortion / stillbirth endpoints and for a denominator check.\n- **vs a dedicated pregnancy/birth-defects registry (e.g., a product or disease pregnancy registry):** Registries\n  prospectively collect adjudicated outcomes and exposures with low misclassification but are small, slow, prone to\n  volunteer/selection bias, and rarely powered for rare malformations. Linked claims/EHR are large and population-based,\n  capturing the full source population at the cost of algorithmic exposure and outcome definitions. **Prefer linked\n  administrative data** for population estimates and rare endpoints; **prefer (or triangulate with) a registry** when\n  teratogenic mechanism, dose, and adjudication matter and the exposure is uncommon.\n- **vs probabilistic linkage:** A clean deterministic key (subscriber/family ID, MSIS_ID + date rule) is faster,\n  auditable, and defensible to regulators; probabilistic linkage rescues pairs when keys are missing but introduces\n  false-match and missed-match error that must be quantified (sensitivity/PPV of the link itself) and propagated. **Prefer\n  deterministic** when a stable key exists; reserve probabilistic linkage for fragmented sources and report its error.\n\n**When to use** Any study whose outcome is measured on the infant — major congenital malformations, small-for-gestational\nage, preterm birth confirmed at delivery, neonatal complications, infant hospitalization, or longer-horizon\nneurodevelopmental endpoints — with maternal exposure measured during a defined gestational window. It is the\nprerequisite step for in-utero exposure-outcome studies in commercial claims, Medicaid (MAX/T-MSIS), national systems\n(e.g., Sentinel mother-infant linkage tables), and linked EHR-vital-records substrates.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **The outcome is pregnancy loss or any non-live-birth endpoint.** The linked infant cohort exists only for live births;\n  using it to study miscarriage/stillbirth conditions on the very event of interest and is structurally biased. Use the\n  maternal pregnancy cohort with all pregnancy outcomes.\n- **Differential live-birth selection by exposure (the dangerous case).** A strong teratogen or abortifacient can cause\n  early pregnancy loss, so exposed pregnancies are *differentially less likely* to reach a linkable live birth. The\n  surviving exposed infants are a selected, healthier-than-average subset (a live-birth / competing-event selection bias,\n  analogous to depletion of susceptibles). The malformation risk among live births can be biased *toward the null* — the\n  method looks reassuring precisely when the drug is most harmful. Diagnose by comparing live-birth proportions by\n  exposure and by analyzing pregnancy loss in the maternal cohort; never report only the live-birth-conditioned estimate\n  for a suspected teratogen.\n- **No reliable linking key and high move/churn.** Multi-state Medicaid moves break `MSIS_ID` carry-over; infants enrolled\n  under a different plan or subscriber than the mother (e.g., infant on the other parent's commercial policy) are\n  unlinkable, and the unlinkable fraction can correlate with socioeconomic factors and thus with exposure.\n- **Twins/higher-order multiples handled as singletons.** One delivery maps to N infants; collapsing them mis-assigns\n  outcomes and miscounts the denominator. Multiples must fan out one maternal pregnancy to multiple infant rows with\n  clustered/robust variance downstream.\n\n**Data-source operational depth**\n- **Medicaid (MAX / T-MSIS):** The reference substrate for U.S. pregnancy pharmacoepidemiology because Medicaid finances\n  ~40-50% of U.S. births. Linkage uses the maternal `MSIS_ID` (or the explicit mother-infant linkage variables MAX/T-MSIS\n  provides) plus a delivery-to-birth date rule. Failure modes: cross-state moves and re-enrollment churn break the\n  longitudinal `MSIS_ID`, dropping infants; managed-care (capitated) encounters may under-report relative to fee-for-service\n  so \"no infant claim\" can be missingness rather than no event; require continuous Medicaid enrollment for the mother\n  across the pregnancy window and for the infant across the outcome window so absence of a diagnosis is observed, not\n  unobserved.\n- **Commercial claims:** Link through the **subscriber/family identifier** — the infant typically enrolls as a new member\n  under a parent subscriber within weeks of birth; pair on shared subscriber ID + infant `birth_date` within a tight\n  window of the maternal delivery claim (DRG/ICD/CPT for delivery). Failure modes: the infant may be enrolled on a\n  *different* subscriber than the exposed mother (unlinkable), short post-birth enrollment truncates outcome\n  ascertainment, and **Medicare Advantage / capitated person-time** drops fee-for-service claims so delivery or infant\n  events can be invisible — exclude MA-only / capitated-only person-time or treat it as missing.\n- **EHR:** Linkage rides on the health-system's relational model (the mother's and infant's encounters share a birth\n  event/encounter, or a documented maternal medical-record number on the neonatal chart). Strong for clinical detail\n  (gestational age, birthweight, problem lists) but visit-driven: an infant who receives care outside the system is\n  differentially lost, and external-care leakage truncates outcome capture. Prefer EHR linked to vital records.\n- **Registry / linked vital records:** Birth and fetal-death certificates anchor the pregnancy outcome and gestational\n  age and (when linked to claims/EHR) supply the most complete, adjudicated denominator including stillbirths — the ideal\n  substrate, but linkage to certificates introduces its own selection (only the linkable subset) and date-reconciliation\n  issues among certificate date, delivery claim, and first infant claim.\n\n**Worked claims example.** Question: risk of major congenital malformation among live-born infants of mothers who filled\na study antiepileptic during the first trimester vs an active comparator antiepileptic, in a commercial + Medicaid FFS\ndatabase. (1) Build the maternal pregnancy cohort: identify deliveries via delivery DRG/ICD/CPT codes; estimate pregnancy\nstart (last menstrual period) by back-dating from the delivery using a gestational-age algorithm so the first-trimester\nexposure window is defined. (2) Require continuous maternal medical + pharmacy enrollment from before LMP through delivery\n(so first-trimester fills are observable) and exclude MA-only/capitated person-time. (3) Define exposure from\n`fill_date` + `days_supply` overlapping the first-trimester window on the *maternal* timeline. (4) **Link**: for each\ndelivery, find the infant member sharing the maternal subscriber/family ID (commercial) or `MSIS_ID` (Medicaid) whose\n`birth_date` falls within +/- 7 days of the delivery claim; fan out twins/multiples to one infant row each; flag and count\nunlinkable deliveries. (5) Require continuous infant enrollment from birth through a fixed outcome window (e.g., 90 days\nor 1 year) so malformation diagnoses are observable. (6) Ascertain the outcome on the *infant* timeline with a validated\nmalformation algorithm (e.g., >=1 inpatient or >=2 outpatient diagnoses in the window). (7) Diagnostics that gate the\nestimate: the linkage rate and unlinkable fraction by exposure arm, the live-birth proportion by arm (to detect\ndifferential loss), an attrition funnel, and a parallel maternal-cohort analysis of pregnancy loss; cluster variance on\nthe pregnancy/mother for multiples; and a deterministic-vs-probabilistic-link sensitivity analysis. Estimation\n(PS-balanced risk ratio among live births) happens only after this linked cohort is validated.",
    "primary_category": "Study_Design",
    "tags": [
      "mother-infant-linkage",
      "pregnancy-pharmacoepidemiology",
      "perinatal-epidemiology",
      "cohort-construction",
      "record-linkage",
      "congenital-malformations",
      "special-populations"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4789",
        "url": "https://doi.org/10.1002/pds.4789",
        "citation_text": "Huybrechts KF, Bateman BT, Hernández-Díaz S. Use of real-world evidence from healthcare utilization data to evaluate drug safety during pregnancy. Pharmacoepidemiology and Drug Safety. 2019;28(7):906-922.",
        "year": 2019,
        "authors_short": "Huybrechts et al.",
        "notes": "Authoritative methods review of pregnancy drug-safety RWE, including mother-infant linkage, pregnancy-start estimation, exposure windows, and the live-birth selection problem."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3284",
        "url": "https://doi.org/10.1002/pds.3284",
        "citation_text": "Margulis AV, Setoguchi S, Mittleman MA, Glynn RJ, Dormuth CR, Hernández-Díaz S. Algorithms to estimate the beginning of pregnancy in administrative databases. Pharmacoepidemiology and Drug Safety. 2013;22(1):16-24.",
        "year": 2013,
        "authors_short": "Margulis et al.",
        "notes": "Validates algorithms to back-date pregnancy start from delivery, the prerequisite for defining gestational exposure windows on the maternal timeline before linkage."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pone.0067405",
        "url": "https://doi.org/10.1371/journal.pone.0067405",
        "citation_text": "Palmsten K, Huybrechts KF, Mogun H, et al. Harnessing the Medicaid Analytic eXtract (MAX) to evaluate medications in pregnancy: design considerations. PLoS ONE. 2013;8(6):e67405.",
        "year": 2013,
        "authors_short": "Palmsten et al.",
        "notes": "Canonical worked demonstration of mother-infant linkage in MAX/Medicaid, including identifier rules, enrollment requirements, and linkage yield for pregnancy pharmacoepidemiology."
      }
    ],
    "plain_language_summary": "Mother-infant linkage is a data-joining step that connects a pregnant person's health records to the records of her newborn so a researcher can measure what the mother was exposed to during pregnancy and what happened to the baby after birth. Without this join, you only have half the story: the drug fills are on the mother's record, but the malformation diagnosis is on the baby's record. The catch is that you can only link pairs where the baby was born alive and enrolled in the same insurance plan, so babies who were never born or whose families were on a different plan are silently missing from your study, which can make a harmful drug look safer than it really is.",
    "key_terms": [
      {
        "term": "family or subscriber ID",
        "definition": "A shared number that an insurance plan uses to group a parent and their dependents together under one policy, making it possible to find a newborn's enrollment record by looking for a new dependent added under the mother's account shortly after her delivery."
      },
      {
        "term": "MSIS_ID",
        "definition": "A unique identifier assigned to a person in Medicaid (the U.S. public insurance program for low-income individuals) that is sometimes carried onto the baby's record at birth, allowing the same kind of family-based matching used in commercial insurance."
      },
      {
        "term": "delivery claim",
        "definition": "A billing record submitted to the insurance plan when a hospital provides labor and delivery services, used to identify the date a pregnancy ended and to anchor the search for a matching newborn record."
      },
      {
        "term": "live-birth selection",
        "definition": "The fact that a linked mother-infant study can only include pregnancies that produced a live, enrolled baby, meaning pregnancies that ended in loss or in an unenrollable baby are excluded, which can skew the results if the drug being studied also causes pregnancy loss."
      },
      {
        "term": "teratogenicity",
        "definition": "The potential of a drug or chemical exposure during pregnancy to cause structural defects or developmental problems in the developing baby."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether a medication taken in the first three months of pregnancy raises the risk of a major birth defect in the baby. The drug fill records are on the mother's insurance file, but the birth-defect diagnosis will appear on the baby's insurance file after birth. To connect them, the analyst looks for a baby enrolled under the same family ID whose recorded birth date falls within a week of the mother's delivery claim. The table below shows five deliveries and the infant records found when searching by matching family ID and birth date.",
      "dataset": {
        "caption": "Five deliveries (rows A-E) showing which mother-infant pairs link successfully and which do not, along with the reason a pair cannot be joined.",
        "columns": [
          "pair_id",
          "mom_person_id",
          "delivery_date",
          "family_id",
          "infant_person_id",
          "infant_birth_date",
          "days_apart",
          "link_status",
          "reason_if_unlinked"
        ],
        "rows": [
          [
            "A",
            "MOM-001",
            "2023-03-15",
            "FAM-100",
            "INF-201",
            "2023-03-15",
            0,
            "linked",
            ""
          ],
          [
            "B",
            "MOM-002",
            "2023-04-02",
            "FAM-200",
            "INF-202",
            "2023-04-03",
            1,
            "linked",
            ""
          ],
          [
            "C",
            "MOM-003",
            "2023-05-10",
            "FAM-300",
            "INF-203",
            "2023-05-11",
            1,
            "linked",
            ""
          ],
          [
            "D",
            "MOM-004",
            "2023-06-20",
            "FAM-400",
            null,
            null,
            null,
            "unlinked",
            "Infant enrolled under father's separate policy, different family_id"
          ],
          [
            "E",
            "MOM-005",
            "2023-07-08",
            "FAM-500",
            null,
            null,
            null,
            "unlinked",
            "Pregnancy ended in stillbirth, no infant enrollment record exists"
          ]
        ]
      },
      "steps": [
        "For each mother who has a delivery claim, search the insurance enrollment file for a new member who shares the same family ID and whose recorded birth date is within 7 days of the delivery date.",
        "Pairs A, B, and C each have an infant enrolled under the same family ID within 1 day of the delivery, so they link successfully and the researcher can look up both the mother's drug fills and the baby's diagnosis records.",
        "Pair D fails to link because the infant was enrolled under the father's separate employer plan, which has a different family ID, so no matching infant record exists on the mother's side.",
        "Pair E fails to link because the pregnancy ended in a stillbirth; there is no live baby and therefore no infant enrollment record to find.",
        "The linked cohort contains 3 of the 5 deliveries (60 percent linkage rate). The 2 unlinked deliveries are not random: one is a stillbirth (an adverse pregnancy outcome in its own right) and one reflects a family structure that correlates with socioeconomic factors, both of which can be related to the drug exposure being studied."
      ],
      "result": "3 out of 5 deliveries linked (linkage rate 60%). The 2 unlinked deliveries include 1 stillbirth and 1 infant on a different plan. If the drug being studied raises the risk of stillbirth, the exposed arm will lose more deliveries to unlinkability than the unexposed arm, making the drug look safer among the live-birth-only group than it actually is across all pregnancies."
    },
    "prerequisites": [
      "pregnancy-exposure-window-rwe",
      "continuous-enrollment-observable-time-rwe",
      "special-populations-rwe-methods"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Deterministic key-based linkage",
        "description": "Join mother and infant on a shared identifier (commercial subscriber/family ID, Medicaid MSIS_ID carried to the infant, or an explicit mother-infant linkage variable) constrained by a delivery-to-birth date rule.",
        "edge_cases": [
          "Infant enrolled under a different subscriber/parent than the exposed mother is unlinkable, and unlinkability can correlate with socioeconomic status and thus exposure.",
          "Cross-state Medicaid moves and re-enrollment churn break MSIS_ID continuity, dropping otherwise eligible infants.",
          "Twins/multiples produce multiple infant rows per delivery and require fan-out plus clustered variance downstream."
        ],
        "data_source_notes": "commercial: pair on subscriber ID + infant birth_date within a tight window of the delivery claim; Medicaid: use MSIS_ID or the provided mother-infant linkage fields; always report the unlinkable fraction by arm."
      },
      {
        "name": "Probabilistic linkage",
        "description": "Score candidate mother-infant pairs on multiple partial identifiers (date of birth, ZIP, plan, delivery facility) when no single reliable key exists, accepting matches above a threshold.",
        "edge_cases": [
          "False matches attach the wrong infant outcome to a mother; missed matches drop true pairs - both biases must be quantified as link sensitivity/PPV and propagated to the effect estimate.",
          "Threshold choice trades false-match against missed-match rate and shifts the analyzable population."
        ],
        "data_source_notes": "Report the matching algorithm, blocking variables, accepted-pair threshold, and a sensitivity analysis against the deterministic subset where both are available."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Maternal-only pregnancy cohort (no infant link)",
        "pros_of_this": "Enables ascertainment of infant outcomes (malformations, neonatal events, developmental endpoints) that accrue on the infant timeline and are invisible in maternal records.",
        "cons_of_this": "Conditions on live birth and infant observability, discarding losses and unlinkable infants - a selection step that can be differential by exposure and bias teratogen effects toward the null.",
        "when_to_prefer": "Infant outcomes among live births; retain the maternal-only cohort in parallel for pregnancy-loss endpoints and as a denominator/selection check."
      },
      {
        "compared_to": "Pregnancy / birth-defects registry",
        "pros_of_this": "Population-based, large, and powered for rare malformations using the full source population rather than volunteers.",
        "cons_of_this": "Relies on algorithmic exposure and outcome definitions with more misclassification than adjudicated registry data, and lacks prospective dose/mechanism detail.",
        "when_to_prefer": "Population estimates and rare endpoints; triangulate with or defer to a registry when adjudication, dose, and teratogenic mechanism are central and the exposure is uncommon."
      },
      {
        "compared_to": "Probabilistic linkage",
        "pros_of_this": "A clean deterministic key is fast, auditable, reproducible, and defensible to regulators.",
        "cons_of_this": "Deterministic linkage fails when keys are missing or broken (cross-plan/cross-state), where probabilistic linkage can recover pairs.",
        "when_to_prefer": "Use deterministic linkage whenever a stable key exists; reserve probabilistic linkage for fragmented sources and report its match error."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Link via subscriber/family ID (commercial) or MSIS_ID / provided linkage fields (Medicaid) plus a delivery-claim to infant birth_date date rule. Require continuous maternal enrollment across the gestational exposure window and continuous infant enrollment across the outcome window; exclude MA-only/capitated person-time where FFS claims are unavailable. Fan out multiples; report the unlinkable fraction by arm.",
      "ehr": "Link on the shared birth encounter or a documented maternal MRN on the neonatal chart. Strong for gestational age and birthweight, but visit-driven: infants receiving external care are differentially lost. Prefer EHR linked to vital records to anchor the pregnancy outcome and gestational age.",
      "registry": "Birth/fetal-death certificates anchor the outcome, gestational age, and a complete denominator including stillbirths; link to claims/EHR for exposure and longer infant follow-up. Linkage to certificates adds its own selection (linkable subset) and date-reconciliation issues.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (clinical detail + claims completeness + adjudicated pregnancy outcomes) but compounds linkage selection and requires reconciling certificate, delivery-claim, and first-infant-claim dates before assigning the infant timeline."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nDATE_TOL_DAYS = 7      # infant birth_date must fall within +/- 7 days of the maternal delivery claim\nOUTCOME_DAYS  = 365    # required continuous infant enrollment from birth to ascertain the outcome\n\ndef link_mother_infant(deliveries: pd.DataFrame, infants: pd.DataFrame,\n                       mom_enroll: pd.DataFrame, inf_enroll: pd.DataFrame) -> pd.DataFrame:\n    # Candidate pairs share the family/subscriber (or MSIS) key; multiples naturally fan out here.\n    pairs = deliveries.merge(infants, on=\"family_id\", suffixes=(\"_mom\", \"_inf\"))\n\n    # Date rule: keep only infants born within tolerance of the delivery claim.\n    gap = (pairs[\"birth_date\"] - pairs[\"delivery_date\"]).dt.days.abs()\n    pairs = pairs[gap <= DATE_TOL_DAYS].copy()\n\n    # Maternal enrollment must cover the gestational exposure window (~280d before delivery) and be FFS-observable.\n    m = mom_enroll.merge(pairs[[\"mom_person_id\", \"delivery_date\"]].drop_duplicates(), on=\"mom_person_id\")\n    m[\"covers\"] = ((m[\"enroll_start\"] <= m[\"delivery_date\"] - pd.Timedelta(days=280)) &\n                   (m[\"enroll_end\"]   >= m[\"delivery_date\"]) & (~m[\"ma_only\"]))\n    mom_ok = set(m.loc[m[\"covers\"], \"mom_person_id\"])\n\n    # Infant enrollment must cover birth through the outcome window and be FFS-observable.\n    i = inf_enroll.merge(pairs[[\"infant_person_id\", \"birth_date\"]].drop_duplicates(), on=\"infant_person_id\")\n    i[\"covers\"] = ((i[\"enroll_start\"] <= i[\"birth_date\"]) &\n                   (i[\"enroll_end\"]   >= i[\"birth_date\"] + pd.Timedelta(days=OUTCOME_DAYS)) & (~i[\"ma_only\"]))\n    inf_ok = set(i.loc[i[\"covers\"], \"infant_person_id\"])\n\n    linked = pairs[pairs[\"mom_person_id\"].isin(mom_ok) & pairs[\"infant_person_id\"].isin(inf_ok)].copy()\n    # plurality > 1 flags multiples for clustered variance on mom_person_id downstream.\n    linked[\"plurality\"] = linked.groupby(\"mom_person_id\")[\"infant_person_id\"].transform(\"nunique\")\n    return linked[[\"mom_person_id\", \"infant_person_id\", \"delivery_date\", \"birth_date\", \"plurality\"]]",
        "description": "Deterministic mother-infant linkage and live-birth cohort assembly from claims-style inputs. Required inputs\n(cleaned, de-duplicated):\n  deliveries : one row per maternal delivery -> mom_person_id, delivery_date (datetime), family_id (subscriber/MSIS key)\n  infants    : candidate infant members      -> infant_person_id, birth_date (datetime), family_id\n  mom_enroll : maternal enrollment spans      -> mom_person_id, enroll_start, enroll_end, ma_only (bool)\n  inf_enroll : infant enrollment spans        -> infant_person_id, enroll_start, enroll_end, ma_only (bool)\nReturns one row per linked mother-infant pair (multiples fan out to multiple rows) carrying both timelines; build\nmaternal exposure on delivery_date/LMP and infant outcomes on birth_date downstream, applying rules identically to arms.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nDATE_TOL_DAYS <- 7L\nOUTCOME_DAYS  <- 365L\n\nlink_mother_infant <- function(deliveries, infants, mom_enroll, inf_enroll) {\n  setDT(deliveries); setDT(infants); setDT(mom_enroll); setDT(inf_enroll)\n\n  # Candidate pairs share the family/subscriber (or MSIS) key; multiples fan out.\n  pairs <- merge(deliveries, infants, by = \"family_id\", allow.cartesian = TRUE)\n  pairs <- pairs[abs(as.integer(birth_date - delivery_date)) <= DATE_TOL_DAYS]\n\n  # Maternal enrollment must cover the gestational window (~280d) and be FFS-observable.\n  m <- merge(mom_enroll, unique(pairs[, .(mom_person_id, delivery_date)]), by = \"mom_person_id\")\n  mom_ok <- m[enroll_start <= delivery_date - 280L & enroll_end >= delivery_date & !ma_only,\n              unique(mom_person_id)]\n\n  # Infant enrollment must cover birth through the outcome window and be FFS-observable.\n  i <- merge(inf_enroll, unique(pairs[, .(infant_person_id, birth_date)]), by = \"infant_person_id\")\n  inf_ok <- i[enroll_start <= birth_date & enroll_end >= birth_date + OUTCOME_DAYS & !ma_only,\n              unique(infant_person_id)]\n\n  linked <- pairs[mom_person_id %in% mom_ok & infant_person_id %in% inf_ok]\n  linked[, plurality := uniqueN(infant_person_id), by = mom_person_id]\n  linked[, .(mom_person_id, infant_person_id, delivery_date, birth_date, plurality)]\n}",
        "description": "Deterministic mother-infant linkage with data.table. Inputs mirror the Python version:\n  deliveries : mom_person_id, delivery_date (Date), family_id\n  infants    : infant_person_id, birth_date (Date), family_id\n  mom_enroll : mom_person_id, enroll_start, enroll_end, ma_only (logical)\n  inf_enroll : infant_person_id, enroll_start, enroll_end, ma_only (logical)\nReturns one row per linked pair (multiples fan out); plurality flags multiples for clustered variance downstream.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let date_tol = 7;     /* infant birth_date within +/- 7 days of the delivery claim */\n%let out_days = 365;   /* required continuous infant enrollment from birth */\n\n/* Candidate pairs share the family/subscriber (or MSIS) key, restricted by the birth-to-delivery date rule. */\nproc sql;\n  create table pairs as\n  select d.mom_person_id, i.infant_person_id,\n         d.delivery_date, i.birth_date, d.family_id\n  from work.deliveries d\n  inner join work.infants i\n    on d.family_id = i.family_id\n   and abs(i.birth_date - d.delivery_date) <= &date_tol;\nquit;\n\n/* Maternal enrollment covers the ~280-day gestational exposure window and is FFS-observable. */\n/* Infant enrollment covers birth through the outcome window and is FFS-observable. */\nproc sql;\n  create table linked0 as\n  select p.*\n  from pairs p\n  where exists (\n          select 1 from work.mom_enroll me\n          where me.mom_person_id = p.mom_person_id\n            and me.ma_only = 0\n            and me.enroll_start <= p.delivery_date - 280\n            and me.enroll_end   >= p.delivery_date)\n    and exists (\n          select 1 from work.inf_enroll ie\n          where ie.infant_person_id = p.infant_person_id\n            and ie.ma_only = 0\n            and ie.enroll_start <= p.birth_date\n            and ie.enroll_end   >= p.birth_date + &out_days);\nquit;\n\n/* Plurality = infants per mother; > 1 marks multiples for clustered/robust variance in the outcome model. */\nproc sql;\n  create table linked as\n  select l.*, (select count(distinct b.infant_person_id) from linked0 b\n                 where b.mom_person_id = l.mom_person_id) as plurality\n  from linked0 l;\nquit;",
        "description": "Deterministic mother-infant linkage in SAS (PROC SQL). Required input datasets (post data-management):\n  work.deliveries : mom_person_id, delivery_date, family_id\n  work.infants    : infant_person_id, birth_date, family_id\n  work.mom_enroll : mom_person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.inf_enroll : infant_person_id, enroll_start, enroll_end, ma_only (0/1)\nProduces one row per linked pair (multiples fan out); plurality > 1 flags multiples for clustered variance later.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Mom[Maternal pregnancy cohort<br/>delivery claim + estimated LMP] --> Win[Gestational exposure window<br/>maternal fill_date + days_supply]\n  Mom --> Key{Shared key?<br/>subscriber / MSIS_ID}\n  Key -->|deterministic key + birth_date within tol| Pair[Candidate mother-infant pairs<br/>fan out twins/multiples]\n  Key -->|no reliable key| Prob[Probabilistic match<br/>DOB/ZIP/plan/facility]\n  Prob --> Pair\n  Pair --> EnrM[Continuous maternal enrollment<br/>across exposure window, exclude MA-only]\n  Pair --> EnrI[Continuous infant enrollment<br/>birth through outcome window]\n  EnrM --> Cohort[Linked live-birth analytic cohort]\n  EnrI --> Cohort\n  Cohort --> Out[Infant outcome on infant timeline<br/>validated malformation algorithm]\n  Cohort --> Sel[Selection check: live-birth proportion<br/>and unlinkable fraction BY ARM]",
        "caption": "Mother-infant linkage builds a live-birth analytic cohort. A shared key (or probabilistic match) plus a birth-to-delivery date rule forms candidate pairs; enrollment requirements on both timelines and a by-arm selection check gate the cohort before any estimation.",
        "alt_text": "Flowchart from the maternal pregnancy cohort through key-based or probabilistic linkage, multiples fan-out, maternal and infant enrollment requirements, to a linked live-birth cohort with an infant outcome step and a by-arm selection check.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  E[Maternal exposure<br/>first trimester] --> O[Infant outcome<br/>malformation]\n  E --> L[Pregnancy loss<br/>competing event]\n  L --> S((Live birth = linkable<br/>collider / selection node))\n  O --> S\n  E -. teratogen raises loss .-> L\n  S --> A[Analysis conditions on<br/>live birth only]",
        "caption": "The live-birth selection trap. Conditioning the linked cohort on live birth opens a selection path when a teratogen also raises pregnancy loss, biasing the exposure-malformation estimate (often toward the null). The maternal cohort with all pregnancy outcomes is required to detect and address this.",
        "alt_text": "A causal/selection diagram showing maternal exposure affecting both infant malformation and pregnancy loss, with live birth as a selection node that the linked-cohort analysis conditions on, opening a selection bias path.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "Mother-infant linkage is the core cohort-construction method in the pregnancy/perinatal special-populations family."
      },
      {
        "relation_type": "requires",
        "target_slug": "pregnancy-exposure-window-rwe",
        "notes": "Exposure is defined on the maternal timeline within a gestational window (e.g., first trimester) back-dated from delivery; the window must be specified before linkage is useful."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Both maternal (across the exposure window) and infant (across the outcome window) enrollment must be continuous so that absence of exposure or outcome is observed, not unobserved."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Delivery/birth and the gestational window anchor time zero on two coupled timelines (maternal exposure, infant follow-up) that must be aligned."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pediatric-growth-development-endpoints-rwe",
        "notes": "Linkage is the prerequisite for measuring infant growth, developmental, and other pediatric endpoints attributable to in-utero exposure."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "pregnancy-registry",
        "notes": "A pregnancy/birth-defects registry collects adjudicated outcomes prospectively; linked administrative data offers population scale and rare-event power - the two are often triangulated."
      },
      {
        "relation_type": "see_also",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "The infant outcome (e.g., major congenital malformation) is ascertained with a validated claims/EHR algorithm on the infant timeline after linkage."
      }
    ],
    "aliases": [
      "mother-baby linkage",
      "maternal-infant linkage",
      "mother-child linkage",
      "mother-infant dyad linkage"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "mpr-medication-possession-ratio",
    "name": "Medication Possession Ratio (MPR)",
    "short_definition": "A refill-based adherence measure equal to the total days' supply dispensed during an observation interval divided by the number of days in that interval, which (unlike PDC) is not inherently capped at 1.0 and can exceed 100% when fills overlap or refills arrive early.",
    "long_description": "**Medication Possession Ratio (MPR)** is the oldest and still one of the most common\nrefill-based (\"secondary\") adherence measures derived from pharmacy dispensing records\nor claims. In its canonical (simple, uncapped) form, MPR = (sum of `days_supply` across\nfills in the interval) / (number of days in the interval). It estimates the proportion of\nthe interval during which the patient *possessed* enough medication to take it as\nprescribed. Because the numerator sums supply without regard to whether two fills cover the\nsame calendar day, an early refill or a 90-day mail-order fill stacked on a retail fill can\npush MPR above 1.0 — a value that has no coherent interpretation as \"fraction of days\ncovered.\"\n\n**Core conceptual distinction** — MPR and PDC answer almost the same question with one\ndecisive difference in how they treat *overlap*. PDC (proportion of days covered) is a\nset-cardinality measure: it counts the number of *unique* calendar days in the interval on\nwhich the patient had any supply on hand, divided by interval days; overlapping supply is\ncounted once, so PDC is bounded in [0, 1] and stockpiled days carry forward only as future\ncoverage. MPR is a *quantity* measure: it sums total supply acquired, so overlap is\ndouble-counted and the ceiling is broken. The practical consequence is systematic:\nuncapped MPR ≥ PDC for the same patient and window, and the gap grows with early refilling\nand supply stockpiling. The right framing of the estimand therefore matters — if the\nresearch question is \"what share of time was this patient covered?\" PDC (or capped MPR) is\nthe target; if it is \"how much drug did this patient acquire relative to need?\" (a\nsupply/utilization or cost question), uncapped MPR is the natural quantity. Choosing MPR\nfor an \"adherent vs non-adherent\" dichotomy and then capping ad hoc at 1.0 silently\nredefines the measure into a coarse PDC, and reproducibility demands the cap and overlap\nrules be pre-specified.\n\n**Pros, cons, and trade-offs**\n- **vs PDC (pdc-proportion-of-days-covered):** MPR is computationally trivial (one SUM, one\n  division — no day-level set construction), preserves the *magnitude* of oversupply (useful\n  for waste, cost, and supply-chain analyses), and maintains comparability with the large\n  pre-2010 literature and some legacy HEDIS-era reporting. Cost: uncapped MPR overestimates\n  coverage whenever fills overlap, produces uninterpretable >100% values, and is *not* the\n  PQA/CMS-endorsed measure for the Medicare Part D Star Ratings adherence triple-weighted\n  measures (diabetes, hypertension RAS-antagonists, statins) — those use PDC. **Prefer PDC**\n  for almost all contemporary comparative-effectiveness, quality-measurement, and\n  regulatory work; **prefer MPR** only for total-acquisition/cost questions or explicit\n  historical replication.\n- **vs capped MPR (MPR truncated at 1.0):** Capping removes the absurd >100% values and\n  makes MPR converge toward PDC, but it is *not* identical to PDC: capping is applied to the\n  summed ratio after the fact, whereas PDC removes overlap at the day level, so the two can\n  still disagree when fills overlap heavily within the window (capped MPR caps the whole\n  interval at 1.0 but does not reallocate stockpiled days to later gaps the way carry-over\n  PDC does). **Prefer the shift-forward (date-adjusted) variant** if you want overlap to\n  push later coverage outward rather than be discarded.\n- **vs persistence (persistence-time-to-discontinuation):** MPR measures *intensity of\n  coverage* within a fixed denominator window and says nothing about *when* therapy\n  stopped; persistence measures *duration* until a permissible-gap is exceeded. A patient\n  can be highly persistent yet have low MPR (chronically under-refilling) or non-persistent\n  yet high MPR over the period they were on therapy. Report both for a complete adherence\n  picture; never substitute one for the other.\n\n**When to use** — use MPR when (1) the question is genuinely about total medication\n*acquisition* or supply relative to need (pharmacy cost, oversupply/waste, dose-adjustment\nsurveillance), (2) you must replicate or pool with older studies that reported MPR, or (3)\nyou need a fast first-pass adherence screen on a single-drug, single-window cohort with\nlittle expected overlap. In all three cases pre-specify the window, the overlap rule, and\nwhether you cap at 1.0.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **As the adherence metric for quality reporting or a regulated effectiveness endpoint.**\n  PQA-defined PDC is the standard for CMS Star Ratings; submitting uncapped MPR (or\n  silently capping it) misrepresents performance and is not defensible to FDA/EMA/HTA\n  reviewers who expect PDC for \"proportion covered.\"\n- **When supply overlap is structural and differential.** In therapies with 90-day\n  mail-order, free manufacturer samples, dose titration, or PRN use, uncapped MPR inflates,\n  and if overlap differs by exposure arm (e.g., one drug favored for mail-order), the bias is\n  *differential* and can manufacture or mask an adherence-outcome association. This is the\n  dangerous case: the measure is not just noisy, it is confounded.\n- **As an exposure that is itself on the causal pathway.** Conditioning on or matching on\n  post-baseline MPR (a time-updated consequence of being on and tolerating the drug) induces\n  healthy-adherer bias and collider/mediator bias — adherent patients are healthier in ways\n  unmeasured in claims. Use MPR as an outcome or pre-specified time-varying exposure with\n  appropriate g-methods, not as a baseline confounder.\n- **As a switching/combination summary across a drug class** without a class-level day map:\n  summing `days_supply` across two interchangeable agents double-counts overlap days and\n  breaks the measure; build a class-level coverage map first.\n\n**Data-source operational depth**\n- **Administrative claims (FFS):** The native substrate. MPR needs only `person_id`,\n  `fill_date` (service date), and `days_supply` (plus NDC to define the cohort). Require\n  *continuous pharmacy + medical enrollment* spanning the full observation window so that a\n  missing fill is a true non-fill, not unobserved care. Failure modes: `days_supply` is\n  miscoded for insulin, inhalers, topicals, and titrated drugs (canister/vial counts, not\n  days); the *last* fill's supply extends past the window end and inflates a fixed-window\n  MPR unless truncated at window end; inpatient and skilled-nursing days are covered by the\n  facility (Part A), not Part D pharmacy, so supply on hand during admissions is invisible\n  and depresses MPR — censor or exclude inpatient days.\n- **Medicare Advantage (MA) vs FFS:** MA encounter data historically under-capture pharmacy\n  fills relative to FFS Part D, so MA-only person-time can show artifactually low fill counts\n  and low MPR; restrict to enrollees with observable Part D (or commercial pharmacy benefit)\n  and exclude MA-only spans, exactly as for any claims exposure definition. Differential\n  competing risks (death, disenrollment) by exposure shorten the at-risk window — define\n  whether the denominator is fixed (e.g., 365 days) or curtailed at disenrollment/death, and\n  keep the rule identical across arms.\n- **EHR / e-prescribing:** Captures the *order* (quantity, sig, refills authorized), not the\n  *dispensing*. Order-based \"MPR\" overestimates possession because written ≠ filled (primary\n  non-adherence, see primary-non-adherence-initiation). Link to pharmacy claims or a\n  dispensing feed (e.g., Surescripts) before computing MPR; unlinked EHR MPR is a measure of\n  prescribing intent, not acquisition.\n- **Registry / linked claims–EHR:** Registries rarely hold complete fill histories; link to\n  claims for `days_supply`. Linkage introduces a selected, linkable subpopulation and\n  order/fill/service-date discrepancies that must be reconciled before windowing — an\n  off-by-one in date alignment systematically biases the supply sum.\n\n**Worked claims example.** A hypertension cohort member has a fixed 180-day observation\nwindow starting at the index fill (day 0). Retail pharmacy claims show four 30-day\nfills of one ACE inhibitor (NDC fixed): fill 1 on day 0, fill 2 on day 30, fill 3 on day\n50 (a 10-day-early refill — the patient still had 10 days of fill 2 on hand), fill 4 on day\n90. Total `days_supply` = 30 + 30 + 30 + 30 = 120. **Simple (uncapped) MPR** = 120 / 180 =\n0.667 — but this *understates* nothing here because there is genuine oversupply at the early\nrefill that the sum already includes; the patient's *acquired* coverage is 120 days. **PDC**\ncounts unique covered days: day 0–59 (fills 1–2 run contiguous through day 59), the early\nfill 3 on day 50 carries the extra 10 days forward so coverage extends to day 80, then a\n10-day gap (days 80–89), fill 4 covers days 90–119 — unique covered days = 60 + 20 + 30 =\n110, so PDC = 110 / 180 = 0.611. **Shift-forward (date-adjusted) MPR** moves fill 3's start\nto day 60 (the exhaustion date of fill 2) so its supply is not double-counted against fill 2,\ncovering days 60–89 and closing the gap that PDC leaves open — coverage now runs contiguously\nfrom day 0 through day 119, yielding 120 covered days and 120 / 180 = 0.667. Shift-forward is\ntherefore *more generous* than union PDC here because it re-allocates the early supply forward\nrather than discarding the overlap. Now suppose fill 3 had instead been a 90-day\nmail-order fill arriving 10 days early: uncapped MPR = (30+30+90+30)/180 = 180/180 = 1.0\neven though, after the stockpile from the 90-day fill (days 50–139) and fill 4 run out, the\npatient is uncovered over the tail of the window (roughly days 140–180) in reality —\ncapped MPR = 1.0 hides the gap, whereas carry-over PDC = (covered unique days)/180 < 1.0\nsurfaces it. This is precisely why uncapped MPR is unsafe as a coverage metric when 90-day\nsupply and early refills coexist.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "adherence",
      "medication-utilization",
      "claims",
      "mpr",
      "pdc",
      "days-supply",
      "refill-compliance"
    ],
    "applies_to_study_types": [
      "drug_utilization",
      "cohort_retrospective",
      "active_comparator_new_user",
      "new_user",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "pharmacy_record",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/s0895-4356(96)00268-5",
        "url": "https://doi.org/10.1016/s0895-4356(96)00268-5",
        "citation_text": "Steiner JF, Prochazka AV. The assessment of refill compliance using pharmacy records: methods, validity, and applications. J Clin Epidemiol. 1997;50(1):105-116.",
        "year": 1997,
        "authors_short": "Steiner & Prochazka",
        "notes": "Foundational review of pharmacy-record refill-compliance methods; established the conceptual basis for the medication possession ratio and the supply-over-time family of adherence measures."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiol Drug Saf. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Systematic review of database adherence methods; documents MPR and related availability measures as the most common operationalizations and catalogs their overlap-handling and window-definition variants."
      },
      {
        "role": "explain",
        "doi": "10.1345/aph.1H018",
        "url": "https://doi.org/10.1345/aph.1H018",
        "citation_text": "Hess LM, Raebel MA, Conner DA, Malone DC. Measurement of adherence in pharmacy administrative databases: a proposal for standard definitions and preferred measures. Ann Pharmacother. 2006;40(7-8):1280-1288.",
        "year": 2006,
        "authors_short": "Hess et al.",
        "notes": "Proposes standard definitions and contrasts MPR with PDC, flagging MPR's >100% problem and the need to pre-specify capping and overlap rules."
      },
      {
        "role": "demonstrate",
        "doi": "10.1185/03007990903126833",
        "url": "https://doi.org/10.1185/03007990903126833",
        "citation_text": "Karve S, Cleves MA, Helm M, Hudson TJ, West DS, Martin BC. Good and poor adherence: optimal cut-point for adherence measures using administrative claims data. Curr Med Res Opin. 2009;25(9):2303-2310.",
        "year": 2009,
        "authors_short": "Karve et al.",
        "notes": "Empirically compares adherence operationalizations (including MPR and PDC) against outcomes and derives the widely used >=80% adherence cut-point in claims data."
      },
      {
        "role": "use",
        "doi": "10.1097/mlr.0b013e31829b1d2a",
        "url": "https://doi.org/10.1097/mlr.0b013e31829b1d2a",
        "citation_text": "Raebel MA, Schmittdiel J, Karter AJ, Konieczny JL, Steiner JF. Standardizing terminology and definitions of medication adherence and persistence in research employing electronic databases. Med Care. 2013;51(8 Suppl 3):S11-S21.",
        "year": 2013,
        "authors_short": "Raebel et al.",
        "notes": "Consensus terminology distinguishing primary vs secondary adherence, adherence vs persistence, and MPR vs PDC; the reference standard for reporting which measure and overlap rule a study used."
      }
    ],
    "plain_language_summary": "The Medication Possession Ratio (MPR) measures how much medication a patient picked up from the pharmacy relative to how long they were supposed to be on it. You add up the total days' worth of pills dispensed across all prescription fills during the study period, then divide by the number of days in that period. Unlike the Proportion of Days Covered (PDC), which counts only the unique calendar days a patient had pills on hand (and therefore stays at or below 100%), MPR simply adds raw supply — so when a patient refills early and two fills overlap, those overlapping days get counted twice in the numerator, and the result can exceed 1.0 (or 100%). That \"over 100%\" signal is not a flaw to hide; it tells you the patient was stockpiling supply.",
    "key_terms": [
      {
        "term": "days_supply",
        "definition": "The number of days one prescription fill is intended to last — for example, a '90-day supply' of a blood-pressure pill means the bottle contains enough tablets for 90 days at the prescribed dose."
      },
      {
        "term": "observation window",
        "definition": "The fixed time period — say, 180 days starting from the patient's first prescription fill — during which all pharmacy activity is counted for the adherence calculation."
      },
      {
        "term": "fill date",
        "definition": "The date a patient picked up (dispensed) a prescription at the pharmacy; this is the date recorded in insurance claims, not the date the doctor wrote the prescription."
      },
      {
        "term": "early refill",
        "definition": "When a patient picks up the next bottle of medication before the current bottle is used up, creating a period where two fills are 'on hand' at the same time."
      },
      {
        "term": "PDC (Proportion of Days Covered)",
        "definition": "A related adherence measure that counts only the unique calendar days a patient had any supply on hand — overlapping fill periods are merged so each day is counted at most once, keeping PDC at or below 1.0."
      }
    ],
    "worked_example": {
      "scenario": "Imagine a patient, ID 2001, who starts a 180-day statin treatment window on January 1, 2024. The pharmacy records show three fills: a 90-day fill picked up at the start, another 90-day fill picked up on March 15 (sixteen days before the first fill runs out — an early refill), and a smaller 30-day fill picked up on June 20. The window closes on June 28 (day 180). We want to compute the MPR and compare it to what PDC would give. The early refill is the key: those sixteen overlapping days are counted once in PDC but counted twice in MPR's simple sum.",
      "dataset": {
        "caption": "Raw pharmacy claims rows for patient 2001 — exactly what an analyst pulls from a pharmacy table. The days_supply on Fill C extends past the window end (June 28), so only 9 days count toward the denominator.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2024-01-01",
            "atorvastatin 40 mg",
            90
          ],
          [
            2001,
            "2024-03-15",
            "atorvastatin 40 mg",
            90
          ],
          [
            2001,
            "2024-06-20",
            "atorvastatin 40 mg",
            30
          ]
        ]
      },
      "steps": [
        "Set the observation window: 180 days, January 1 through June 28, 2024 (31 + 29 + 31 + 30 + 31 + 28 = 180 days; 2024 is a leap year).",
        "Fill A (Jan 1, 90 days): covers January 1 through March 30 — fully inside the window; contributes 90 days to the MPR numerator.",
        "Fill B (Mar 15, 90 days): covers March 15 through June 12 — fully inside the window; contributes 90 days to the MPR numerator. Notice that March 15–30 (16 days) overlaps with Fill A; MPR counts those 16 days again without adjustment.",
        "Fill C (Jun 20, 30 days): would cover June 20 through July 19, but the window ends June 28, so only the 9 days June 20–28 fall inside; contributes 9 days to the MPR numerator (the remaining 21 days are truncated at the window boundary).",
        "MPR numerator = 90 + 90 + 9 = 189 days of supply within the window.",
        "MPR = 189 / 180 = 1.050 — the early refill pushed the ratio above 1.0.",
        "For contrast, PDC merges Fill A and Fill B into a single covered block: January 1 through June 12 = 164 unique days (the 16-day overlap is counted once, not twice). Then a 7-day gap (June 13–19), then Fill C adds 9 more unique days (June 20–28). Unique covered days = 164 + 9 = 173. PDC = 173 / 180 = 0.961.",
        "The gap between the two measures (MPR 1.050 vs PDC 0.961) is produced entirely by the 16-day early refill overlap: MPR double-counts those 16 days; PDC does not."
      ],
      "result": {
        "label": "MPR = 189 days_supply in window / 180 window days = 1.050 (above the 0.80 adherent threshold, and above 1.0 due to the 16-day overlap). PDC = 173 unique covered days / 180 window days = 0.961. Both measures classify this patient as adherent, but only MPR can exceed 1.0 — PDC is bounded.",
        "value": 1.05
      },
      "timeline_spec": {
        "title": "MPR vs PDC: one statin patient, 180-day window — early refill drives MPR above 1.0",
        "caption": "Fill B arrives 16 days before Fill A is exhausted (early refill). MPR sums all supply including the overlap (189 / 180 = 1.050); PDC merges overlapping days so each calendar day counts once (173 / 180 = 0.961). The shaded overlap band is the entire wedge between the two measures.",
        "alt_text": "Horizontal timeline from 2024-01-01 to 2024-06-28 showing three fill bars. Fill A (Jan 1, 90 days) and Fill B (Mar 15, 90 days) overlap from Mar 15 to Mar 30 — that overlap region is highlighted. A 7-day gap runs Jun 13–19. Fill C (Jun 20) is truncated at the window end on Jun 28. An underlay shows 164 covered days merged from Fills A and B, a gap, and 9 days from Fill C; the result label shows MPR 1.050 and PDC 0.961.",
        "window": {
          "start": "2024-01-01",
          "end": "2024-06-28",
          "label": "Denominator: 180-day observation window (Jan 1 – Jun 28, 2024)"
        },
        "events": [
          {
            "label": "Fill A",
            "start": "2024-01-01",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill B (16-day early refill)",
            "start": "2024-03-15",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill C (truncated at window end)",
            "start": "2024-06-20",
            "length_days": 30,
            "quantity": "30 days_supply — only 9 days fall inside the window"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-01",
            "end": "2024-06-12",
            "label": "164 unique covered days (union of Fill A and Fill B)"
          },
          {
            "kind": "overlap",
            "start": "2024-03-15",
            "end": "2024-03-30",
            "label": "16-day overlap — counted once in PDC, twice in MPR"
          },
          {
            "kind": "gap",
            "start": "2024-06-13",
            "end": "2024-06-19",
            "label": "7-day gap"
          },
          {
            "kind": "covered",
            "start": "2024-06-20",
            "end": "2024-06-28",
            "label": "9 covered days (Fill C, truncated at window end)"
          }
        ],
        "result": {
          "label": "MPR = 189 / 180 = 1.050  |  PDC = 173 / 180 = 0.961",
          "value": 1.05
        }
      }
    },
    "prerequisites": [
      "pdc-proportion-of-days-covered",
      "continuous-enrollment-observable-time-rwe",
      "grace-period-gap-rules-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Simple (uncapped) MPR",
        "description": "Sum of days_supply across all fills whose fill_date falls in the observation interval, divided by the number of days in the interval. Not capped; can exceed 1.0 when fills overlap or refills arrive early.",
        "edge_cases": [
          "Early refills and 90-day mail-order fills produce apparent possession >100%.",
          "Overlapping fills from multiple pharmacies/sources are summed and double-counted.",
          "The last fill's supply may extend past the window end, inflating a fixed-window numerator unless truncated at window end."
        ],
        "data_source_notes": "claims: straightforward SUM of days_supply over the window (fixed calendar period or index-anchored). No overlap or inpatient adjustment in the basic form, so it is the most overestimating variant.",
        "citations": [
          "steiner-1997",
          "andrade-2006"
        ]
      },
      {
        "name": "Capped MPR (truncated at 1.0)",
        "description": "Uncapped MPR with the final ratio truncated at 1.0 (or 100%) so the measure is bounded, used to mimic PDC's ceiling and support an >=80% adherent dichotomy.",
        "edge_cases": [
          "Capping after summation is not equivalent to day-level overlap removal; capped MPR can still disagree with carry-over PDC under heavy within-window overlap.",
          "Hides genuine coverage gaps that occur after early stockpiling is exhausted."
        ],
        "data_source_notes": "claims: apply min(MPR, 1.0) after the SUM; document the cap explicitly because it silently redefines an acquisition measure as a coverage proxy.",
        "citations": [
          "hess-2006"
        ]
      },
      {
        "name": "Shift-forward (date-adjusted) MPR",
        "description": "When a refill arrives before the prior supply is exhausted, the start date of the later fill is shifted to the prior exhaustion date so overlapping supply is carried forward rather than double-counted, then days within the window are summed.",
        "edge_cases": [
          "Cumulative shifting can push a fill's effective coverage past the window end; decide whether shifted-out supply is dropped or retained.",
          "Requires chronological fill walking, not a single SUM; more faithful to true coverage and closest to carry-over PDC."
        ],
        "data_source_notes": "claims: implement with an ordered scan over fills maintaining a running supply-end pointer; this is the most defensible overlap rule for an MPR that approximates coverage.",
        "citations": [
          "andrade-2006",
          "hess-2006"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "pdc-proportion-of-days-covered",
        "pros_of_this": "Computationally trivial (one SUM and a division, no day-level set); preserves the magnitude of oversupply for cost/utilization analyses; comparable with the large pre-2010 literature.",
        "cons_of_this": "Uncapped MPR overestimates coverage whenever fills overlap and yields uninterpretable >100% values; it is not the PQA/CMS-endorsed measure for Medicare Part D Star Ratings adherence metrics, which use PDC.",
        "when_to_prefer": "When the question is total medication acquisition or supply relative to need, or when historical comparability with older MPR-based studies is required. Prefer PDC for contemporary comparative, quality, and regulatory work."
      },
      {
        "compared_to": "persistence-time-to-discontinuation",
        "pros_of_this": "Quantifies intensity of coverage within a fixed window, capturing under-refilling that a persistence (time-to-gap) measure ignores.",
        "cons_of_this": "Says nothing about when therapy stopped; a patient can be persistent but have low MPR, or non-persistent but high MPR over the on-treatment period.",
        "when_to_prefer": "When the construct of interest is how covered a patient was during a defined interval rather than how long they stayed on therapy; report both for a complete adherence picture."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Use pharmacy claims with person_id, fill_date (service date), days_supply, and NDC for cohort definition. Require continuous pharmacy + medical enrollment across the full window so a missing fill is a true non-fill. Truncate the last fill's supply at window end; censor inpatient/SNF days (facility-covered, supply invisible). Pre-specify window, overlap rule, and any cap.",
      "ehr": "e-prescribing/order data give written quantity and sig, not dispensing, so order-based MPR overestimates possession (primary non-adherence). Link to a pharmacy claims or dispensing feed before computing MPR; unlinked EHR MPR measures prescribing intent, not acquisition.",
      "registry": "Registries rarely hold complete fill histories; link to claims for days_supply. Reconcile registry enrollment dates with claim service dates before windowing.",
      "linked": "Linked claims-EHR is ideal (severity context + complete fills) but adds linkage selection and order/fill/service date discrepancies that must be aligned before the supply sum; an off-by-one in date alignment biases the numerator."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWINDOW_DAYS = 180  # fixed observation interval anchored at the first fill\n\ndef compute_mpr(fills: pd.DataFrame, window_days: int = WINDOW_DAYS) -> pd.DataFrame:\n    f = fills.sort_values([\"person_id\", \"fill_date\"]).copy()\n\n    # Index = first qualifying fill; window = [index_date, index_date + window_days).\n    f[\"index_date\"] = f.groupby(\"person_id\")[\"fill_date\"].transform(\"min\")\n    f[\"window_end\"] = f[\"index_date\"] + pd.to_timedelta(window_days, unit=\"D\")\n    f = f[(f[\"fill_date\"] >= f[\"index_date\"]) & (f[\"fill_date\"] < f[\"window_end\"])]\n\n    # Truncate each fill's supply at the window end (no credit for supply past the window).\n    f[\"supply_end\"] = f[\"fill_date\"] + pd.to_timedelta(f[\"days_supply\"], unit=\"D\")\n    capped_end = np.minimum(f[\"supply_end\"].values, f[\"window_end\"].values)\n    f[\"days_in_window\"] = (capped_end - f[\"fill_date\"].values) / np.timedelta64(1, \"D\")\n    f[\"days_in_window\"] = f[\"days_in_window\"].clip(lower=0)\n\n    # (1) Simple uncapped MPR = sum(days_supply in window) / window_days.\n    simple = (f.groupby(\"person_id\")[\"days_in_window\"].sum() / window_days)\n\n    # (3) Shift-forward MPR: walk fills, push a fill's start to the prior supply-end\n    #     so overlapping (early-refill / stockpiled) days are carried forward, not\n    #     double-counted. Approximates carry-over PDC.\n    def shift_forward(g: pd.DataFrame) -> float:\n        cursor = g[\"index_date\"].iloc[0]            # earliest uncovered day\n        win_end = g[\"window_end\"].iloc[0]\n        covered = 0.0\n        for fill_date, ds in zip(g[\"fill_date\"], g[\"days_supply\"]):\n            start = max(fill_date, cursor)          # never start before existing supply ends\n            end = min(start + pd.Timedelta(days=int(ds)), win_end)\n            covered += max((end - start).days, 0)\n            cursor = max(cursor, end)\n        return covered / window_days\n\n    shifted = f.groupby(\"person_id\", group_keys=False).apply(shift_forward)\n\n    out = pd.DataFrame({\"mpr_simple\": simple, \"mpr_shift_forward\": shifted})\n    out[\"mpr_capped\"] = out[\"mpr_simple\"].clip(upper=1.0)  # (2) cap at 1.0\n    return out.reset_index()",
        "description": "Compute fixed-window MPR (uncapped, capped, and shift-forward variants) from a\ncleaned pharmacy fills table. Required input (one row per dispensing, de-duplicated):\n  fills : person_id, fill_date (datetime64), days_supply (int), ndc (str)\nThe observation window is index-anchored: day 0 = each person's first qualifying\nfill, length WINDOW_DAYS. The last fill's supply is truncated at the window end so\na fixed-window denominator is respected. Returns one row per person with all three\nMPR variants for sensitivity comparison.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWINDOW_DAYS <- 180L\n\ncompute_mpr <- function(fills, window_days = WINDOW_DAYS) {\n  f <- as.data.table(fills)\n  setorder(f, person_id, fill_date)\n\n  f[, index_date := min(fill_date), by = person_id]\n  f[, window_end := index_date + window_days]\n  f <- f[fill_date >= index_date & fill_date < window_end]\n\n  # Truncate supply at window end -> days actually inside the fixed window.\n  f[, supply_end := fill_date + days_supply]\n  f[, days_in_window := pmax(0L, as.integer(pmin(supply_end, window_end) - fill_date))]\n\n  # (1) Simple uncapped MPR.\n  simple <- f[, .(mpr_simple = sum(days_in_window) / window_days), by = person_id]\n\n  # (3) Shift-forward MPR: carry overlapping supply forward instead of double-counting.\n  shift_one <- function(fd, ds, idx, wend) {\n    cursor <- idx; covered <- 0L\n    for (i in seq_along(fd)) {\n      start <- max(fd[i], cursor)\n      end   <- min(start + ds[i], wend)\n      covered <- covered + max(as.integer(end - start), 0L)\n      cursor  <- max(cursor, end)\n    }\n    covered / window_days\n  }\n  shifted <- f[, .(mpr_shift_forward = shift_one(fill_date, days_supply,\n                                                 index_date[1], window_end[1])),\n               by = person_id]\n\n  out <- merge(simple, shifted, by = \"person_id\")\n  out[, mpr_capped := pmin(mpr_simple, 1.0)]   # (2) cap at 1.0\n  out[]\n}",
        "description": "Compute fixed-window MPR (uncapped, capped, shift-forward) with data.table.\nInput mirrors the Python version:\n  fills : person_id, fill_date (Date), days_supply (integer), ndc (character)\nWindow is index-anchored at each person's first fill; the last fill's supply is\ntruncated at the window end.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let window = 180;\n\n/* Index date = first fill per person; keep only fills inside [index, index+window). */\nproc sql;\n  create table f as\n  select fl.person_id, fl.fill_date, fl.days_supply,\n         ix.index_date,\n         (ix.index_date + &window) as window_end format=date9.,\n         /* days of this fill that fall inside the fixed window (truncate at window end) */\n         max(0, min(fl.fill_date + fl.days_supply, ix.index_date + &window)\n                - fl.fill_date) as days_in_window\n  from work.fills fl\n  inner join (select person_id, min(fill_date) as index_date\n              from work.fills group by person_id) ix\n    on fl.person_id = ix.person_id\n  where fl.fill_date >= ix.index_date\n    and fl.fill_date <  ix.index_date + &window\n  order by person_id, fill_date;\nquit;\n\n/* (1) Simple uncapped MPR and (2) capped MPR. */\nproc sql;\n  create table mpr_simple as\n  select person_id,\n         sum(days_in_window) / &window           as mpr_simple,\n         min(sum(days_in_window) / &window, 1.0)  as mpr_capped\n  from f\n  group by person_id;\nquit;\n\n/* (3) Shift-forward MPR: walk fills in date order, carry overlapping supply forward. */\ndata mpr_shift(keep=person_id mpr_shift_forward);\n  set f;\n  by person_id;\n  retain cursor covered;\n  if first.person_id then do; cursor = index_date; covered = 0; end;\n  start = max(fill_date, cursor);                 /* never start before existing supply ends */\n  end   = min(start + days_supply, window_end);   /* truncate at window end */\n  covered = covered + max(end - start, 0);\n  cursor  = max(cursor, end);\n  if last.person_id then do;\n    mpr_shift_forward = covered / &window;\n    output;\n  end;\nrun;\n\n/* Combine all three variants for sensitivity comparison. */\nproc sql;\n  create table mpr_all as\n  select s.person_id, s.mpr_simple, s.mpr_capped, sh.mpr_shift_forward\n  from mpr_simple s left join mpr_shift sh\n    on s.person_id = sh.person_id;\nquit;",
        "description": "Compute fixed-window MPR in SAS. Required input (post data-management, one row per\ndispensing): work.fills with person_id, fill_date, days_supply, ndc.\nPROC SQL builds the index-anchored window and the simple/capped variants via\naggregation; a chronological DATA step walks fills to produce the shift-forward\n(date-adjusted, carry-over) variant. Window length is a macro variable.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "mpr-medication-possession-ratio-timeline.svg",
        "mermaid": null,
        "caption": "Fill B arrives 16 days before Fill A is exhausted (early refill). MPR sums all supply including the overlap (189 / 180 = 1.050); PDC merges overlapping days so each calendar day counts once (173 / 180 = 0.961). The shaded overlap band is the entire wedge between the two measures.",
        "alt_text": "Horizontal timeline from 2024-01-01 to 2024-06-28 showing three fill bars. Fill A (Jan 1, 90 days) and Fill B (Mar 15, 90 days) overlap from Mar 15 to Mar 30 — that overlap region is highlighted. A 7-day gap runs Jun 13–19. Fill C (Jun 20) is truncated at the window end on Jun 28. An underlay shows 164 covered days merged from Fills A and B, a gap, and 9 days from Fill C; the result label shows MPR 1.050 and PDC 0.961.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{What is the estimand?}\n  Q -->|Total drug acquired<br/>vs need: cost / waste / supply| ACQ[Uncapped MPR<br/>sum days_supply / window<br/>may exceed 1.0]\n  Q -->|Share of days covered:<br/>quality / effectiveness| COV{Overlap expected?<br/>early refills / 90-day mail-order}\n  COV -->|No / minimal| SIMPLE[Simple MPR == capped MPR<br/>here they agree]\n  COV -->|Yes| HANDLE{Overlap rule}\n  HANDLE -->|Cap whole ratio at 1.0| CAP[Capped MPR<br/>bounded but hides post-stockpile gaps]\n  HANDLE -->|Shift later fill start<br/>to prior supply-end| SHIFT[Shift-forward MPR<br/>approximates carry-over PDC]\n  HANDLE -->|Day-level unique coverage| PDC[Use PDC instead<br/>pdc-proportion-of-days-covered]\n  CAP --> DOC[Pre-specify window + cap + overlap rule<br/>report per Raebel 2013 terminology]\n  SHIFT --> DOC\n  PDC --> DOC",
        "caption": "Decision logic for choosing and operationalizing an MPR variant. The estimand (total acquisition vs share of days covered) drives the choice; when coverage is the target and overlap is expected, capped, shift-forward, or PDC handle the double-counting differently.",
        "alt_text": "Decision flowchart starting from the estimand, branching to uncapped MPR for acquisition questions and to capped, shift-forward, or PDC for coverage questions depending on whether supply overlap is expected.",
        "source_type": "illustrative",
        "source_citations": [
          "raebel-2013"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title MPR vs PDC for one patient, 180-day window, four 30-day fills\n  dateFormat YYYY-MM-DD\n  axisFormat day %j\n  section Fills (days_supply)\n  Fill 1 day 0  :f1, 2024-01-01, 30d\n  Fill 2 day 30 :f2, 2024-01-31, 30d\n  Fill 3 day 50 (10-day early refill) :crit, f3, 2024-02-20, 30d\n  Fill 4 day 90 :f4, 2024-03-31, 30d\n  section Coverage\n  Unique covered days (PDC = 110/180 = 0.61) :active, cov, 2024-01-01, 119d",
        "caption": "The same patient yields uncapped MPR = 120/180 = 0.67 (sum of supply) but PDC = 110/180 = 0.61 (unique covered days) because fill 3 arrives 10 days early; the early-refill overlap is the wedge between the two measures.",
        "alt_text": "Gantt timeline of four 30-day fills over a 180-day window with an early third fill, illustrating that summed supply (MPR) exceeds unique covered days (PDC).",
        "source_type": "illustrative",
        "source_citations": [
          "karve-2009"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "tradeoff_with",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Primary comparator. PDC counts unique covered days (capped at 1.0, day-level overlap removed) and is the PQA/CMS-endorsed measure; uncapped MPR sums supply and can exceed 1.0. Prefer PDC for contemporary quality and effectiveness work; MPR for total-acquisition/cost questions and historical replication."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "MPR measures intensity of coverage within a fixed interval; persistence measures duration of continuous therapy until a permissible-gap is exceeded. Report both — they capture different failure modes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "primary-non-adherence-initiation",
        "notes": "MPR is a secondary (post-initiation) measure requiring at least one fill; primary non-adherence (never filling the first prescription) is upstream and needs linked prescribing + dispensing data."
      }
    ],
    "aliases": [
      "MPR",
      "medication possession ratio",
      "refill compliance ratio"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "pqa-cms",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ms-drg-classification",
    "name": "MS-DRG (Medicare Severity Diagnosis-Related Groups)",
    "short_definition": "The federal inpatient payment classification system that maps each hospital stay to one of roughly 770 groups — defined by principal diagnosis, procedures, secondary diagnoses (complication/comorbidity severity), and patient attributes — each carrying a relative weight that, multiplied by a hospital-specific base rate, determines Medicare prospective payment for the admission.",
    "long_description": "**MS-DRG** (Medicare Severity Diagnosis-Related Groups) is the classification engine at the\ncenter of Medicare's Inpatient Prospective Payment System (IPPS). For every inpatient\ndischarge, a **grouper algorithm** — a deterministic decision tree published annually by\nCMS — consumes the coded claim (principal diagnosis, up to 25 secondary diagnoses with\npresent-on-admission flags, ICD-10-PCS procedure codes, discharge status, age, and sex)\nand assigns a single integer DRG. That DRG carries a **relative weight** — a national index\nof resource intensity normalized so that a weight of 1.0 represents the average Medicare\ninpatient case. Payment = weight × hospital-specific base rate (adjusted for wage index,\nteaching status, disproportionate share, and outlier thresholds). In FY 2025 the national\noperating base rate is roughly $6,900; a weight-2.0 DRG therefore generates approximately\n$13,800 in operating payment before local adjustments. The grouper, relative weight tables,\nand the CC/MCC lists are republished each fiscal year (effective October 1) and are\ndownloadable without charge from CMS.\n\n**Structure: MDC → surgical/medical partition → base-DRG severity triplets.**\nThe classification has a three-level hierarchy. At the top, every DRG belongs to one of\n25 **Major Diagnostic Categories (MDCs)** corresponding to body systems or clinical\nsituations (e.g., MDC 5 = Diseases and Disorders of the Circulatory System; MDC 14 =\nPregnancy, Childbirth, and the Puerperium). Within each MDC, the grouper first tests\nwhether an OR-qualifying ICD-10-PCS procedure was performed, routing the stay to the\n**surgical partition** (e.g., CABG, joint replacement) or the **medical partition**\n(diagnosis-driven, no qualifying procedure). Within the surgical or medical partition,\nmost base-DRG families are subdivided into a **severity triplet**: DRG *n* (with MCC),\nDRG *n+1* (with CC), and DRG *n+2* (without CC or MCC). The canonical teaching example\nis the heart failure family: DRG 291 (heart failure and shock with MCC), DRG 292 (with\nCC), and DRG 293 (without CC or MCC). Relative weights for this family in FY 2025 are\napproximately 2.21, 1.44, and 1.06, respectively — a 2-fold payment spread across the\nsame principal diagnosis depending on whether the patient has serious comorbidities\n(MCCs) such as acute kidney failure or respiratory failure. CC/MCC determination is\ndriven by the **secondary diagnoses** that survive present-on-admission (POA) screening:\ndiagnoses that were not present at admission are excluded from CC/MCC capture to deter\nhospitals from counting hospital-acquired conditions as severity.\n\n**CC and MCC lists — and why they drift.**\nCMS publishes the full CC and MCC lists in its IPPS final rule each year. Codes move\nbetween the MCC, CC, and non-CC lists with each annual update, sometimes driven by\nclinical recalibration and sometimes by administrative or coding-industry feedback.\nThis **severity drift** means that a condition flagged as an MCC in FY 2015 may drop\nto CC status in FY 2020 without any change in patient health — creating a spurious\nlongitudinal trend in severity-adjusted analyses that cross those fiscal years. Studies\nusing DRG severity tier as a covariate across more than one or two fiscal years should\nmap the underlying diagnosis codes to a fixed CC/MCC version rather than relying on\nthe year-of-discharge grouper assignment.\n\n**History and the 2007–2008 version break.**\nThe concept of grouping hospital cases by resource intensity was developed at Yale by\nRobert Fetter and colleagues in the 1970s; CMS (then HCFA) implemented DRG-based\nprospective payment in **October 1983** (Public Law 98-21). The original system was\nrevised many times, but a structural break occurred with the introduction of\n**MS-DRGs in FY 2008** (effective October 1, 2007). Where the pre-2008 system had\nroughly 538 DRGs with limited severity differentiation, MS-DRGs expanded to roughly\n745 groups, adding the three-way CC/MCC severity partition to most base-DRG families.\nAny longitudinal RWE study that spans the 2007–2008 transition faces a **classification\nbreak** — the same patient with the same diagnoses maps to a different DRG before and\nafter the transition — and must use underlying diagnosis and procedure codes rather\nthan DRG number to build a consistent cohort definition across the break. A second\nmajor structural break occurred with the **ICD-10-CM/PCS transition** on October 1,\n2015 (v33+): the entire grouper logic was re-mapped to ICD-10 codes, and ICD-9-era\ngrouper crosswalks are approximate at best.\n\n**APR-DRG — the proprietary alternative.**\nMany Medicaid programs, commercial payers, and research databases (including HCUP\nState Inpatient Data) use **All Patient Refined DRGs (APR-DRGs)**, developed by 3M\n(now Solventum). APR-DRGs add a four-level **severity of illness (SOI)** subclass and\na separate four-level **risk of mortality (ROM)** subclass, providing finer clinical\nstratification. The critical operational distinction: CMS MS-DRGs are defined in\n**public manuals and free software** downloadable from CMS; APR-DRGs are a\n**licensed, proprietary product** — the code lists, grouper logic, and weights are\nnot public, making full replication in a research context dependent on the licensed\ngrouper engine. When HCUP NIS or SID data report APR-DRG SOI/ROM, researchers should\ncite the 3M APR-DRG software version rather than treating it as equivalent to MS-DRG.\nThe two systems assign different group numbers to the same discharge; they are not\ninterchangeable as covariates or cohort identifiers.\n\n**RWE applications: five core uses.**\n(1) **Hospitalization costing when paid amounts are unavailable.** The DRG relative\nweight, multiplied by a national or facility-specific base rate, provides a\nstandardized cost proxy for inpatient stays. This is widely used when the research\ndatabase carries institutional claims without adjudicated dollar amounts (e.g.,\ncertain HCUP files or facility data). The proxy flattens within-DRG variation — two\npatients in DRG 292 with identical weights receive the same proxy cost regardless of\nactual resource use — but it is reproducible and externally anchored.\n(2) **Case-mix and severity adjustment.** Including DRG relative weight, DRG severity\ntier, or MDC category as a covariate adjusts for the clinical complexity of the index\nhospitalization in comparative analyses of readmission, mortality, or post-acute cost.\n(3) **Cohort identification.** DRGs define clinically coherent cohorts: \"any\nlower-extremity joint replacement DRG\" (469–470), \"any heart failure DRG\" (291–293),\n\"any pneumonia DRG\" (193–195). This is often cleaner than a broad ICD-10 code list for\ninpatient events because the grouper has already resolved diagnosis + procedure\ncombinations into a single clinical label.\n(4) **Readmission rate denominators.** CMS's Hospital Readmission Reduction Program\n(HRRP) measures 30-day readmission rates following index admissions in specific DRG\nfamilies (heart failure, pneumonia, COPD, hip/knee replacement). Research replicating\nor benchmarking HRRP measures must use the same DRG-based denominator logic.\n(5) **Hospital benchmarking and market basket research.** DRG relative weights are the\nnational standard for comparing case-mix index across hospitals or over time, and serve\nas the denominator in payer contracting research.\n\n**Pros, cons, and trade-offs.**\n- **vs ICD-10 diagnosis-code-only cohort identification:** A DRG-defined cohort has\n  already resolved the procedure vs. diagnosis ambiguity (e.g., a hip fracture treated\n  medically vs. one treated with arthroplasty routes to different DRGs) and has applied\n  the MDC exclusion rules. Cost: the grouper assignment is per-discharge and only\n  available on inpatient institutional claims — it does not exist on professional or\n  outpatient claims. **Prefer DRG-based cohort definition** for inpatient studies\n  where the procedure / treatment delivered is part of the phenotype; use ICD codes\n  directly for outpatient, cross-setting, or pre-admission windows.\n- **vs DRG relative weight as cost proxy vs observed paid/allowed amounts:** The weight\n  proxy is standardized, reproducible, and available even when dollar amounts are\n  missing; it avoids adjudication-lag problems. Cost: it flattens within-DRG variation\n  — actual costs for a weight-1.5 DRG can vary threefold — and it ignores outlier\n  payments, transfer adjustments, and disproportionate-share top-ups that affect actual\n  hospital revenue. **Use weight as proxy** when actual amounts are unavailable or\n  unstandardized; use actual allowed amounts when they are trustworthy and the research\n  question requires patient-level precision.\n- **vs APR-DRG for severity adjustment:** APR-DRG SOI/ROM provides four severity\n  tiers vs. the MS-DRG three, giving finer clinical stratification, and APR-DRGs\n  are used by Medicaid in many states and by HCUP files, so they may be the only\n  system available for a given data source. Cost: APR-DRG is proprietary and requires\n  a license; it is not reproducible from public documentation; and comparing results\n  across studies using MS-DRG vs. APR-DRG is non-trivial. **Prefer MS-DRG** when\n  replicability and transparency are required and when working with Medicare FFS data;\n  **use APR-DRG** when the data source provides it and cross-institutional comparisons\n  within a system using APR-DRGs are the goal.\n- **vs Elixhauser or Charlson comorbidity scores for severity adjustment:**\n  Comorbidity scores capture the patient's baseline disease burden from any diagnoses\n  present in the baseline period. DRG severity tier captures only the CC/MCC\n  complexity of the index hospitalization. They are orthogonal dimensions. **Use DRG\n  severity** for adjustment within a hospitalization type; **use comorbidity scores**\n  for baseline adjustment in the pre-admission or washout window; pair both when\n  studying outcomes after an index hospitalization.\n\n**When to use.**\nUse MS-DRG when the unit of analysis is an inpatient hospitalization paid under\nMedicare IPPS: to identify cohorts by procedure and diagnosis type, to adjust for\ncase-mix severity, to proxy inpatient costs, to replicate or benchmark CMS quality\nprograms, or to stratify patients by the severity of their index admission. Use it\nalso as a filter to ensure that claims labeled \"inpatient\" are from acute-care\nDRG-paid stays and not from long-term acute care hospitals (LTACHs) or inpatient\npsychiatric facilities (IPFs), which use separate payment systems and will not have\nIPPS DRGs.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **DRG assignment from interim claims (mid-stay).** The grouper assigns a DRG to the\n  final discharge claim, not to interim bill types (bill type 0111, 0117). If your\n  extract includes interim inpatient claims, the DRG on those records is provisional\n  or absent; never use interim claim DRGs as if they were final. Always use the\n  discharge claim (bill type 0112 or the final status indicator).\n- **Treating DRG severity tier as a measure of patient health rather than coding\n  practice.** Hospitals run active CC/MCC capture programs that mine records for\n  secondary diagnoses that qualify as CCs or MCCs. When coding intensity increases\n  over time (as it did substantially in the early MS-DRG era and again after widespread\n  adoption of EHR documentation tools), the fraction of cases landing in the MCC tier\n  rises without any change in patient health. A longitudinal trend in severity tier is\n  therefore partly a coding-practice trend, not purely a clinical one. Do not interpret\n  year-over-year DRG severity upshift as evidence that patients are becoming sicker.\n- **Using DRG relative weight as a cost proxy for non-IPPS settings.** DRG weights\n  are calibrated to Medicare FFS acute-care inpatient costs. Applying them to\n  commercial claims (which have different negotiated rates and patient demographics),\n  to LTACH or rehabilitation stays, or to outpatient services introduces systematic\n  bias. For non-Medicare populations, use the payer-specific allowed amount or a\n  commercial analogue (e.g., an All-Patient DRG weight set calibrated to commercial\n  data).\n- **Crossing the FY2008 or ICD-10 version break without harmonization.** A DRG\n  number in FY 2006 is not the same clinical group as the same number in FY 2009\n  (post-MS-DRG redesign), and neither maps reliably to the ICD-10-era DRG. Trend\n  studies must anchor to principal diagnosis and procedure code rather than DRG number.\n- **Using DRG to identify episodes in Medicare Advantage.** MA encounter data may\n  carry a plan-assigned or estimated DRG that does not result from full adjudication\n  under IPPS grouper logic, may be absent entirely, or may reflect a plan's internal\n  risk-adjustment coding. MA-sourced DRGs should not be pooled with FFS DRGs as if\n  they are equivalent. Restrict to Medicare FFS (Parts A/B) or explicitly document\n  and test the MA DRG derivation.\n- **Treating a single DRG number as a clinical phenotype without checking for MDC\n  migration.** Occasional annual updates move a diagnosis to a different MDC, changing\n  its base DRG even when the diagnosis-procedure combination is unchanged. Cohort\n  definitions in multi-year studies should be built from ICD-10 code lists, then\n  checked against the grouper output for the relevant fiscal years.\n\n**Data-source operational depth.**\nIn **Medicare FFS Part A** (the native source), the DRG appears in the claim-level\nfield `clm_drg_cd` (or equivalent in research files). The final DRG is present only\non discharge claims (bill type 01X2 or 01X7-final; exclude 01X1 admit-through and\n01X7-interim). Relative weights are in the annual IPPS final rule addenda (Tables 5\nand 6). For cost proxies, multiply `clm_drg_cd` relative weight by the applicable\nhospital wage-index-adjusted base rate from CMS's impact file; for a research\napproximation, using the national standardized amount times the weight is acceptable\nwith caveat. Link to MedPAR for a more efficient DRG-keyed inpatient extract.\nIn **commercial claims**, MS-DRG equivalents may appear under different field names\nor may be imputed by the data vendor using the CMS grouper applied to ICD-10 codes.\nVerify with the data dictionary whether the DRG was assigned by the CMS free grouper\nor a proprietary engine; the answer affects replicability. In **HCUP NIS and SID**,\nMS-DRG and APR-DRG are both provided for Medicare discharges; APR-DRG SOI/ROM covers\nall payers. The APR-DRG is the 3M licensed product; treat it as a black-box-derived\ncovariate. In **Medicare Advantage encounter data** (Part C), DRGs may be imputed or\nabsent — see the MA pitfall note above.",
    "primary_category": "Unknown",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "claims",
      "payment",
      "severity",
      "drg",
      "ipps",
      "medicare",
      "inpatient",
      "case-mix",
      "grouper",
      "cost-proxy"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "new_user",
      "cost_analysis",
      "readmission_study",
      "hospital_benchmarking"
    ],
    "data_sources": [
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/00005650-198002001-00001",
        "url": "https://doi.org/10.1097/00005650-198002001-00001",
        "citation_text": "&NA;. Case mix definition by diagnosis-related groups. Medical Care. 1980;18(2 Suppl):1-53.",
        "year": 1980,
        "authors_short": "Fetter et al.",
        "notes": "The foundational Yale publication establishing the DRG case-mix classification system that CMS adopted for prospective payment in 1983. Published as a Medical Care supplement; Crossref indexes the DOI with placeholder author metadata.\n"
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.1990.03450150053030",
        "url": "https://doi.org/10.1001/jama.1990.03450150053030",
        "citation_text": "Kahn KL, Keeler EB, Sherwood MJ, et al. The effects of the DRG-based prospective payment system on quality of care for hospitalized Medicare patients. JAMA. 1990;264(15):1953-1955.",
        "year": 1990,
        "authors_short": "Kahn et al.",
        "notes": "RAND evaluation of the IPPS DRG-based payment system's effect on quality of care — the landmark empirical assessment of how DRG assignment, severity, and payment incentives played out in practice for Medicare inpatients.\n"
      },
      {
        "role": "demonstrate",
        "doi": "10.1056/NEJMsa0803563",
        "url": "https://doi.org/10.1056/NEJMsa0803563",
        "citation_text": "Jencks SF, Williams MV, Coleman EA. Rehospitalizations among patients in the Medicare fee-for-service program. N Engl J Med. 2009;360(14):1418-1428.",
        "year": 2009,
        "authors_short": "Jencks et al.",
        "notes": "The authoritative study establishing 30-day rehospitalization rates in Medicare FFS, organized by DRG-defined index admission families — a direct demonstration of DRG-based cohort identification and denominator construction that underpins CMS's Hospital Readmission Reduction Program.\n"
      },
      {
        "role": "use",
        "doi": "10.1056/NEJMsa012337",
        "url": "https://doi.org/10.1056/NEJMsa012337",
        "citation_text": "Birkmeyer JD, Siewers AE, Finlayson EVA, et al. Hospital volume and surgical mortality in the United States. N Engl J Med. 2002;346(15):1128-1137.",
        "year": 2002,
        "authors_short": "Birkmeyer et al.",
        "notes": "Large RWE study using DRG-defined procedure cohorts (e.g., coronary artery bypass, esophagectomy, aortic replacement) in Medicare claims to examine the volume-outcome relationship — a textbook application of DRG codes for inpatient cohort definition and case-mix adjustment.\n"
      }
    ],
    "plain_language_summary": "When a Medicare patient is discharged from a hospital, every diagnosis and procedure from that stay is fed into a grouping program that outputs a single code — the MS-DRG — which tells Medicare how much to pay the hospital. Think of it as a price tag calculated from the patient's main problem, any complications or serious side conditions, and the procedures performed. Researchers use these codes to build groups of similar hospitalizations (for example, all knee-replacement stays), to measure how complex a patient's hospital stay was, and — when actual dollar amounts are unavailable — to estimate the cost of a stay by multiplying the DRG's published resource weight by a standard dollar base rate. The key caveat: the code is assigned to the final discharge record, not to mid-stay billing, and it only exists for acute inpatient stays paid by Medicare's hospital payment system.",
    "key_terms": [
      {
        "term": "grouper",
        "definition": "The computer program that takes a stay's diagnosis and procedure codes as input and outputs the single MS-DRG number, following CMS's published decision rules."
      },
      {
        "term": "CC/MCC",
        "definition": "Complication or Comorbidity (CC) and Major Complication or Comorbidity (MCC) — severity categories assigned to secondary diagnoses by CMS that determine whether a stay lands in the highest-payment, middle, or lowest-payment tier of a DRG family."
      },
      {
        "term": "relative weight",
        "definition": "A published number (e.g., 1.44) that represents how resource-intensive the average case in a DRG is compared with the average Medicare inpatient case; multiply it by the hospital's base dollar rate to estimate payment."
      },
      {
        "term": "base DRG",
        "definition": "The common clinical identity shared by the two or three severity-tiered DRGs in a family, before splitting by CC/MCC status — for example, heart failure is the base DRG behind the 291/292/293 triplet."
      },
      {
        "term": "IPPS",
        "definition": "Inpatient Prospective Payment System — Medicare's method of paying hospitals a fixed, pre-determined amount per discharge based on the DRG, rather than reimbursing each individual service after the fact."
      },
      {
        "term": "present-on-admission (POA)",
        "definition": "A required flag on inpatient claims indicating whether each secondary diagnosis was present when the patient arrived at the hospital; only POA-coded conditions can qualify as CCs or MCCs to drive up the DRG severity tier."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to illustrate how the same principal diagnosis — heart failure (ICD-10-CM I50.9, unspecified) — routes three different patients to three different payment tiers based solely on their secondary diagnoses. Patient A has an MCC (acute kidney failure, N17.9, POA=Y). Patient B has only a CC (hypertensive heart disease, I11.9, POA=Y). Patient C has no qualifying secondary diagnosis. The hospital's DRG base rate is $7,000. The researcher needs to show the DRG assignment, relative weight, and estimated payment for each patient.\n",
      "dataset": {
        "caption": "Three heart failure admissions — same principal diagnosis, three severity tiers.",
        "columns": [
          "patient_id",
          "principal_dx",
          "secondary_dx",
          "POA_flag",
          "assigned_DRG",
          "DRG_description",
          "relative_weight",
          "base_rate_usd",
          "estimated_payment_usd"
        ],
        "rows": [
          [
            "A",
            "I50.9 (heart failure, unspecified)",
            "N17.9 (acute kidney failure) — qualifies as MCC",
            "Y",
            291,
            "Heart failure and shock with MCC",
            2.21,
            7000,
            15470
          ],
          [
            "B",
            "I50.9 (heart failure, unspecified)",
            "I11.9 (hypertensive heart disease) — qualifies as CC",
            "Y",
            292,
            "Heart failure and shock with CC",
            1.44,
            7000,
            10080
          ],
          [
            "C",
            "I50.9 (heart failure, unspecified)",
            "None qualifying",
            "N/A",
            293,
            "Heart failure and shock without CC/MCC",
            1.06,
            7000,
            7420
          ]
        ]
      },
      "steps": [
        "The grouper reads each claim's principal diagnosis. All three patients have I50.9 (heart failure), so all three enter the heart failure base-DRG family within MDC 5 (Circulatory System), medical partition.\n",
        "The grouper checks secondary diagnoses against the MCC list first. Patient A's secondary diagnosis N17.9 (acute kidney failure) appears on the CMS MCC list and carries POA=Y (it was present at admission), so the stay is assigned DRG 291 — the with-MCC tier.\n",
        "Patient B's secondary diagnosis I11.9 (hypertensive heart disease) does not appear on the MCC list but does appear on the CC list with POA=Y, so the stay routes to DRG 292 — the with-CC tier.\n",
        "Patient C has no secondary diagnoses that qualify as CC or MCC, so the stay routes to DRG 293 — the without-CC/MCC tier, the lowest-payment tier of the family.\n",
        "Estimated payment is computed as relative_weight x base_rate: Patient A: 2.21 x 7000 = 15470. Patient B: 1.44 x 7000 = 10080. Patient C: 1.06 x 7000 = 7420.\n",
        "The payment spread across the three tiers is 15470 / 7420 = 2.09 — more than double from the lowest to the highest tier for the identical principal diagnosis, driven entirely by the secondary diagnoses and their POA flags.\n"
      ],
      "result": "Three patients admitted with the same heart failure diagnosis (I50.9) route to DRGs 291, 292, and 293 based on secondary diagnosis severity. Estimated payments are 2.21 x 7000 = 15470, 1.44 x 7000 = 10080, and 1.06 x 7000 = 7420 respectively. The highest tier (DRG 291 with MCC) pays approximately 15470 / 7420 = 2.09 times more than the lowest tier (DRG 293), even though the principal clinical condition is identical across all three admissions.\n",
      "severity_table": [
        [
          "DRG 291",
          "With MCC (e.g., acute kidney failure)",
          "2.21",
          "$15,470"
        ],
        [
          "DRG 292",
          "With CC (e.g., hypertensive heart disease)",
          "1.44",
          "$10,080"
        ],
        [
          "DRG 293",
          "Without CC/MCC",
          "1.06",
          "$7,420"
        ]
      ]
    },
    "prerequisites": [
      "icd-10-cm-diagnosis-coding",
      "icd-10-pcs-procedure-coding",
      "ub-04-institutional-claim-fields"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Severity-agnostic (base DRG) analysis",
        "description": "Rather than treating each severity tier as a distinct group, collapse the DRG triplet back to its base DRG (e.g., map 291, 292, and 293 all to base code 291) to create a severity-agnostic inpatient cohort definition for the clinical condition. This is appropriate when the research question is about the underlying condition (heart failure) rather than its severity at this admission, and when the CC/MCC composition is a confounder to be adjusted rather than the exposure.\n",
        "edge_cases": [
          "Some base DRGs have only one or two severity tiers; confirm whether the MDC/GLOS table for your fiscal year shows a two-way or three-way split before collapsing.",
          "A handful of DRGs are ungrouped (e.g., DRG 999 ungroupable) — exclude these from clinical cohorts before collapsing."
        ],
        "data_source_notes": "Claims: extract clm_drg_cd, then apply a lookup table mapping each MS-DRG to its base DRG and severity tier. CMS provides this in the annual IPPS final rule addenda (Table 5 lists all MS-DRGs with their MDC, medical/surgical partition, and relative weight)."
      },
      {
        "name": "DRG relative weight as inpatient cost proxy",
        "description": "Multiply the MS-DRG relative weight by the applicable national or hospital-specific base rate to estimate standardized inpatient cost for each discharge when actual adjudicated dollar amounts are unavailable. Using the national standardized operating amount (published annually by CMS) creates a nationally comparable cost proxy; applying the hospital's wage-index-adjusted rate captures local price variation at the cost of requiring the CMS impact file.\n",
        "edge_cases": [
          "Outlier payments (for very long or very expensive stays) and transfer adjustments are not captured by weight × base rate — these are add-ons above the DRG payment.",
          "Under bundled payment programs (e.g., BPCI-A, CJR), the DRG payment covers a defined episode window; individual claim lines inside the bundle may be zeroed out.",
          "Capital costs, pass-through payments, and indirect medical education add-ons are separate components; the operating payment formula accounts for the largest share but not 100% of total hospital revenue per case."
        ],
        "data_source_notes": "Medicare FFS: use clm_drg_cd weight from CMS Tables 5/6; multiply by national standardized amount for a reproducible proxy. Commercial/Medicaid: verify which grouper produced the DRG before applying IPPS weights, which are calibrated only to Medicare FFS cost data."
      },
      {
        "name": "DRG for surgical vs medical partition stratification",
        "description": "Use the surgical/medical partition flag (available in CMS's MDC table or derivable from the MS-DRG number range within each MDC) to stratify an inpatient cohort by whether a qualifying procedure was performed. This is essential in analyses of conditions where both medical management and surgical intervention are common (e.g., appendicitis: DRGs 341–343 with laparoscopic appendectomy vs 344–346 medical).\n",
        "edge_cases": [
          "The surgical partition does not imply the procedure was the primary reason for admission; always verify the principal diagnosis when building clinical phenotypes.",
          "Some stays route to a pre-MDC \"special\" DRG (e.g., ECMO, tracheostomy, transplant) before MDC assignment; these must be treated separately in any partition-based logic."
        ],
        "data_source_notes": "Claims: CMS publishes the MDC and partition (M/S) for every MS-DRG in the IPPS final rule Table 5; join on clm_drg_cd to assign partition."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "DRG relative weight provides a standardized cost proxy that is nationally comparable, available even when adjudicated dollar amounts are missing, and not subject to adjudication-lag distortion. It is the reproducible benchmark for inpatient cost across payers and fiscal years.",
        "cons_of_this": "Weight x base rate flattens within-DRG variation; two patients in the same DRG can have actual costs that differ twofold or more. Outlier payments, transfer adjustments, and capital add-ons are excluded. For patient-level cost precision in a database with reliable allowed amounts, use the actual dollar fields.",
        "when_to_prefer": "Use DRG weight as cost proxy when actual amounts are unavailable, unstandardized across payers, or subject to heavy adjudication lag; use actual allowed amounts when they are trustworthy and patient-level variation matters."
      },
      {
        "compared_to": "elixhauser-comorbidity-index-rwe",
        "pros_of_this": "DRG severity tier (MCC/CC/none) captures the complexity of the specific index hospitalization event; it is automatically produced by the grouper from the discharge record with no additional lookback window required.",
        "cons_of_this": "DRG severity is assignment-optimized — it captures only CC/MCC-coded secondary diagnoses at this admission, while Elixhauser indexes the patient's full baseline comorbidity burden from a preceding observation window. DRG severity is influenced by CC/MCC capture programs and coding intensity changes over time.",
        "when_to_prefer": "Use DRG severity for within-hospitalization complexity adjustment; use Elixhauser or Charlson for patient-level baseline comorbidity; consider both together for post-hospitalization outcome models."
      },
      {
        "compared_to": "icd-10-cm-diagnosis-coding",
        "pros_of_this": "DRG provides a single, pre-resolved clinical grouping that has already applied the procedure-versus-diagnosis partition, MDC exclusion rules, and CC/MCC severity logic. For inpatient cohort identification, it is more parsimonious than a broad ICD-10 code list.",
        "cons_of_this": "DRG is available only for inpatient discharges, only on final discharge claims, and only for IPPS-paid facilities. ICD-10 codes span all settings and claim types and are the necessary substrate for outpatient, professional, and cross-setting analyses.",
        "when_to_prefer": "Use DRG for inpatient-only analyses where the procedure type and severity tier are part of the phenotype; use ICD-10 codes when the analysis spans settings or requires pre-admission/post-discharge lookback windows."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Medicare FFS Part A: extract `clm_drg_cd` from discharge claims (bill type ending in 2 or final-action indicator = F). Join to CMS IPPS final rule Table 5 (or the MedPAR DRG weight file) to retrieve relative weight, MDC, and surgical/medical partition. Exclude LTACHs (provider type 13) and IPFs (provider type 4), which use separate payment systems. For cost proxies, multiply weight by the national standardized operating amount from CMS's annual IPPS final rule or by the hospital-specific base rate from the impact file. For trend analyses crossing FY 2008 or the ICD-10 transition (FY 2016), anchor cohort definitions to ICD-10 code lists rather than DRG numbers to avoid classification break artifacts.\n",
      "commercial": "DRG in commercial claims may be vendor-assigned (using the public CMS grouper applied to ICD-10 codes) or may reflect the plan's internal grouping. Check the data dictionary to confirm the grouper version and fiscal year. Do not apply Medicare IPPS relative weights directly to commercial inpatient stays without adjusting for differences in cost structure and patient demographics.\n",
      "ehr": "EHR systems typically contain the final DRG assigned at discharge as part of the billing record. This is the most reliable source for a linked EHR-claims dataset. Note that EHR-captured DRGs reflect what was submitted to the payer, not necessarily the final adjudicated assignment if subsequent rebilling occurred.\n",
      "hcup": "HCUP NIS and SID provide both MS-DRG (for Medicare stays) and APR-DRG SOI/ROM (for all payers) as separate fields. The APR-DRG is produced by the 3M proprietary grouper; cite the software version used. For cross-payer analyses, APR-DRG is more complete; for Medicare-specific work, MS-DRG is the authoritative field.\n"
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import csv\nfrom pathlib import Path\nfrom dataclasses import dataclass, field\nfrom typing import Optional\n\n@dataclass\nclass DRGRecord:\n    \"\"\"One row from the CMS IPPS Table 5 / relative-weight addendum.\"\"\"\n    ms_drg: int\n    mdc: str              # e.g., \"05\" for Circulatory System\n    partition: str        # \"M\" = medical, \"S\" = surgical, \"P\" = pre-MDC\n    description: str\n    relative_weight: float\n    # Severity tier derived from the DRG number within its base-DRG family\n    severity_tier: str    # \"MCC\", \"CC\", \"none\", or \"N/A\" (ungrouped / single-tier)\n    base_drg: int         # all three tiers share the same base; e.g. 291->291, 292->291, 293->291\n\n\ndef load_drg_weights(csv_path: str) -> dict[int, DRGRecord]:\n    \"\"\"\n    Load the CMS annual DRG weight table.  The CSV must have columns:\n      ms_drg, mdc, partition, description, relative_weight, severity_tier, base_drg\n    CMS publishes Table 5 of the IPPS final rule; this function expects a\n    pre-processed flat file derived from that table.\n    \"\"\"\n    table: dict[int, DRGRecord] = {}\n    with open(csv_path, newline=\"\", encoding=\"utf-8\") as f:\n        for row in csv.DictReader(f):\n            drg = int(row[\"ms_drg\"])\n            table[drg] = DRGRecord(\n                ms_drg=drg,\n                mdc=row[\"mdc\"].strip(),\n                partition=row[\"partition\"].strip().upper(),\n                description=row[\"description\"].strip(),\n                relative_weight=float(row[\"relative_weight\"]),\n                severity_tier=row[\"severity_tier\"].strip(),\n                base_drg=int(row[\"base_drg\"]),\n            )\n    return table\n\n\ndef compute_drg_cost_proxy(\n    ms_drg: int,\n    drg_table: dict[int, DRGRecord],\n    base_rate_usd: float = 6940.0,   # FY 2025 national standardized operating amount (approx)\n) -> Optional[float]:\n    \"\"\"\n    Return weight × base_rate as a standardized inpatient cost proxy.\n    Returns None if the DRG is not found in the table (e.g., ungroupable DRG 999).\n    This is a first-order approximation; outlier top-ups, wage-index adjustments,\n    DSH, and IME add-ons are excluded.\n    \"\"\"\n    rec = drg_table.get(ms_drg)\n    if rec is None:\n        return None\n    return round(rec.relative_weight * base_rate_usd, 2)\n\n\n# ---------------------------------------------------------------------------\n# Example: process a list of inpatient discharge records\n# ---------------------------------------------------------------------------\ndef classify_discharges(\n    discharges: list[dict],         # each dict has at least \"person_id\", \"discharge_date\", \"ms_drg\"\n    drg_table: dict[int, DRGRecord],\n    base_rate_usd: float = 6940.0,\n) -> list[dict]:\n    \"\"\"\n    Enrich each discharge record with:\n      - base_drg          : severity-agnostic DRG (collapse triplet)\n      - severity_tier     : MCC / CC / none\n      - partition         : M (medical) or S (surgical)\n      - mdc               : Major Diagnostic Category\n      - cost_proxy_usd    : weight × base rate (None for ungroupable DRGs)\n      - is_surgical       : boolean convenience flag\n    \"\"\"\n    results = []\n    for d in discharges:\n        drg_int = int(d.get(\"ms_drg\", 0))\n        rec = drg_table.get(drg_int)\n        enriched = dict(d)\n        if rec:\n            enriched[\"base_drg\"] = rec.base_drg\n            enriched[\"severity_tier\"] = rec.severity_tier\n            enriched[\"partition\"] = rec.partition\n            enriched[\"mdc\"] = rec.mdc\n            enriched[\"is_surgical\"] = rec.partition == \"S\"\n            enriched[\"cost_proxy_usd\"] = round(rec.relative_weight * base_rate_usd, 2)\n        else:\n            # DRG 999 = ungroupable; exclude from clinical cohorts\n            enriched[\"base_drg\"] = None\n            enriched[\"severity_tier\"] = \"ungroupable\"\n            enriched[\"partition\"] = None\n            enriched[\"mdc\"] = None\n            enriched[\"is_surgical\"] = False\n            enriched[\"cost_proxy_usd\"] = None\n        results.append(enriched)\n    return results\n\n\n# ---------------------------------------------------------------------------\n# Demonstration with the heart failure triplet (FY 2025 approximate weights)\n# ---------------------------------------------------------------------------\nif __name__ == \"__main__\":\n    # Minimal inline DRG table for demonstration\n    demo_table = {\n        291: DRGRecord(291, \"05\", \"M\", \"Heart failure and shock w MCC\", 2.21, \"MCC\", 291),\n        292: DRGRecord(292, \"05\", \"M\", \"Heart failure and shock w CC\",  1.44, \"CC\",  291),\n        293: DRGRecord(293, \"05\", \"M\", \"Heart failure and shock w/o CC/MCC\", 1.06, \"none\", 291),\n    }\n\n    discharges = [\n        {\"person_id\": \"A\", \"discharge_date\": \"2025-03-01\", \"ms_drg\": 291},\n        {\"person_id\": \"B\", \"discharge_date\": \"2025-03-05\", \"ms_drg\": 292},\n        {\"person_id\": \"C\", \"discharge_date\": \"2025-03-10\", \"ms_drg\": 293},\n    ]\n\n    BASE_RATE = 7000.0  # simplified for illustration\n    results = classify_discharges(discharges, demo_table, base_rate_usd=BASE_RATE)\n    print(f\"{'ID':4} {'DRG':5} {'Base':5} {'Tier':5} {'Part':5} {'Cost':>10}\")\n    for r in results:\n        print(f\"{r['person_id']:4} {r['ms_drg']:5} {r['base_drg']:5} \"\n              f\"{r['severity_tier']:5} {r['partition']:5} \"\n              f\"${r['cost_proxy_usd']:>9,.2f}\")\n    # Expected output (BASE_RATE = 7000):\n    #   A   291   291   MCC   M   $15,470.00\n    #   B   292   291   CC    M   $10,080.00\n    #   C   293   291   none  M    $7,420.00",
        "description": "Collapse MS-DRG triplets to base DRGs and severity tiers, flag surgical vs. medical partition, and compute a DRG relative-weight cost proxy from a CMS weight lookup table. Demonstrates the core operations an analyst performs when building an inpatient cohort from Medicare FFS Part A claims.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# MS-DRG classification utilities — base R, no package dependencies\n# Mirrors the Python classify_discharges() logic for cross-language consistency.\n\n#' Load CMS IPPS DRG weight table from a flat CSV\n#'\n#' The CSV must contain columns:\n#'   ms_drg, mdc, partition, description, relative_weight, severity_tier, base_drg\n#' Derived from CMS annual IPPS final rule Table 5 (downloadable from cms.gov).\n#'\n#' @param csv_path Path to the pre-processed weight CSV file\n#' @return data.frame with one row per MS-DRG\nload_drg_weights <- function(csv_path) {\n  drg_table <- read.csv(csv_path, stringsAsFactors = FALSE, strip.white = TRUE)\n  drg_table$ms_drg         <- as.integer(drg_table$ms_drg)\n  drg_table$base_drg       <- as.integer(drg_table$base_drg)\n  drg_table$relative_weight <- as.numeric(drg_table$relative_weight)\n  drg_table$partition       <- toupper(drg_table$partition)\n  drg_table\n}\n\n\n#' Enrich discharge records with DRG classification fields and cost proxy\n#'\n#' @param discharges  data.frame with at least columns: person_id, discharge_date, ms_drg\n#' @param drg_table   data.frame returned by load_drg_weights()\n#' @param base_rate   Numeric. National standardized operating base rate (USD).\n#'                    Approx $6,940 for FY 2025; use $7,000 for illustration.\n#' @return Original discharges data.frame with added columns:\n#'         base_drg, severity_tier, partition, mdc, is_surgical, cost_proxy_usd\nclassify_discharges <- function(discharges, drg_table, base_rate = 6940) {\n  result <- merge(\n    discharges,\n    drg_table[, c(\"ms_drg\", \"mdc\", \"partition\", \"description\",\n                  \"relative_weight\", \"severity_tier\", \"base_drg\")],\n    by  = \"ms_drg\",\n    all.x = TRUE    # keep unmatched rows (ungroupable DRGs)\n  )\n  # Flag ungroupable (DRG 999 or no match in weight table)\n  result$severity_tier[is.na(result$severity_tier)] <- \"ungroupable\"\n  result$cost_proxy_usd <- ifelse(\n    is.na(result$relative_weight),\n    NA_real_,\n    round(result$relative_weight * base_rate, 2)\n  )\n  result$is_surgical <- !is.na(result$partition) & result$partition == \"S\"\n  result\n}\n\n\n# ---------------------------------------------------------------------------\n# Demonstration: heart failure severity triplet (FY 2025 approximate weights)\n# ---------------------------------------------------------------------------\n\n# Minimal inline DRG table for the heart failure family\ndemo_drg_table <- data.frame(\n  ms_drg          = c(291L,  292L,  293L),\n  mdc             = c(\"05\",  \"05\",  \"05\"),\n  partition       = c(\"M\",   \"M\",   \"M\"),\n  description     = c(\"HF w MCC\", \"HF w CC\", \"HF w/o CC/MCC\"),\n  relative_weight = c(2.21,  1.44,  1.06),\n  severity_tier   = c(\"MCC\", \"CC\",  \"none\"),\n  base_drg        = c(291L,  291L,  291L),\n  stringsAsFactors = FALSE\n)\n\ndemo_discharges <- data.frame(\n  person_id      = c(\"A\",    \"B\",    \"C\"),\n  discharge_date = as.Date(c(\"2025-03-01\", \"2025-03-05\", \"2025-03-10\")),\n  ms_drg         = c(291L,   292L,   293L),\n  stringsAsFactors = FALSE\n)\n\nBASE_RATE <- 7000  # simplified for illustration\n\nresult <- classify_discharges(demo_discharges, demo_drg_table, base_rate = BASE_RATE)\n\ncat(sprintf(\n  \"%-4s %-5s %-5s %-5s %-5s %12s\\n\",\n  \"ID\", \"DRG\", \"Base\", \"Tier\", \"Part\", \"Cost_USD\"\n))\nfor (i in seq_len(nrow(result))) {\n  r <- result[i, ]\n  cat(sprintf(\n    \"%-4s %-5d %-5d %-5s %-5s %12.2f\\n\",\n    r$person_id, r$ms_drg, r$base_drg,\n    r$severity_tier, r$partition, r$cost_proxy_usd\n  ))\n}\n# Expected output (BASE_RATE = 7000):\n# ID   DRG   Base  Tier  Part     Cost_USD\n# A    291   291   MCC   M       15470.00\n# B    292   291   CC    M       10080.00\n# C    293   291   none  M        7420.00\n\n# ---------------------------------------------------------------------------\n# Tip: for severity-agnostic cohort definition, filter on base_drg\n# ---------------------------------------------------------------------------\n# hf_any <- result[result$base_drg == 291, ]  # all heart failure, any severity\n# hf_mcc <- result[result$severity_tier == \"MCC\", ]  # MCC tier only",
        "description": "R implementation that reads a CMS DRG weight table and enriches inpatient discharge records with base DRG, severity tier, surgical/medical partition, MDC, and DRG relative-weight cost proxy. Uses only base R (no external packages required) for maximum portability across analytic environments.\n",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  CLAIM[Inpatient discharge claim\\nprincipal Dx + procedures\\nsecondary Dx + POA flags] --> GROUPER[CMS MS-DRG Grouper\\ndecision tree]\n  GROUPER --> MDC[Major Diagnostic Category\\nbody-system level\\ne.g. MDC 05 Circulatory]\n  MDC --> PARTITION{Qualifying OR\\nprocedure?}\n  PARTITION -- Yes --> SURG[Surgical DRGs\\ne.g. CABG 231-236\\nJoint replace 469-470]\n  PARTITION -- No --> MED[Medical DRGs\\nprincipal Dx-driven\\ne.g. HF family 291-293]\n  MED --> CCMCC{Secondary Dx\\nseverity?}\n  CCMCC -- MCC present\\nPOA=Y --> T1[DRG n\\nwith MCC\\ne.g. 291]\n  CCMCC -- CC present\\nPOA=Y --> T2[DRG n+1\\nwith CC\\ne.g. 292]\n  CCMCC -- Neither --> T3[DRG n+2\\nwithout CC/MCC\\ne.g. 293]\n  T1 --> PAY1[Weight × base rate\\ne.g. 2.21 × $7,000 = $15,470]\n  T2 --> PAY2[Weight × base rate\\ne.g. 1.44 × $7,000 = $10,080]\n  T3 --> PAY3[Weight × base rate\\ne.g. 1.06 × $7,000 = $7,420]",
        "caption": "MS-DRG grouper logic — from discharge claim to payment. The same principal diagnosis (heart failure) routes to three payment tiers based solely on secondary diagnoses and their present-on-admission status.",
        "alt_text": "Flowchart showing an inpatient discharge claim entering the CMS MS-DRG grouper, routing through MDC and surgical/medical partition, then splitting into three severity tiers based on MCC/CC/none secondary diagnosis status, each producing a different payment via weight times base rate.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title MS-DRG Version History — key structural breaks for RWE\n  section Original HCFA DRGs\n    1983 : IPPS implemented, ~478 DRGs\n    1987 : Expanded to ~477-490 groups, severity limited\n  section Pre-MS-DRG refinements\n    1997 : ~499 DRGs, minor severity updates\n    2007 : Last year of pre-MS-DRG classification (ICD-9 era)\n  section MS-DRG era (ICD-9 then ICD-10)\n    2008 : MS-DRGs introduced FY2008, ~745 groups, CC/MCC triplets\n    2015 : ICD-10-CM/PCS transition, grouper re-mapped (v33+)\n    2024 : ~780 MS-DRGs, annual CC/MCC list updates continue",
        "caption": "MS-DRG version history highlighting the two structural breaks that affect longitudinal RWE studies — the FY2008 MS-DRG redesign and the FY2016 ICD-10 transition. Trend analyses crossing either break must anchor to ICD-10 code lists, not DRG numbers.",
        "alt_text": "Timeline showing HCFA DRG origins in 1983, the MS-DRG expansion in FY2008 adding severity triplets, the ICD-10 transition in FY2016, and continuing annual updates.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "MS-DRG is assigned to and stored on the UB-04 institutional claim (FL 71 on the paper form; corresponding field in 837I electronic transactions). The DRG on the discharge claim is the grouper output derived from the diagnosis and procedure codes on that same claim.\n"
      },
      {
        "relation_type": "requires",
        "target_slug": "diagnosis-position-and-qualifiers",
        "notes": "The grouper requires a principal diagnosis (the condition chiefly responsible for admission) and up to 24 secondary diagnoses each with a present-on-admission (POA) flag. The position and POA status of secondary diagnoses directly determine whether CC/MCC criteria are met and therefore which severity tier of the DRG family is assigned. Errors in diagnosis sequencing or missing POA flags corrupt the DRG assignment.\n"
      },
      {
        "relation_type": "requires",
        "target_slug": "icd-10-pcs-procedure-coding",
        "notes": "ICD-10-PCS procedure codes are the primary driver of the surgical vs. medical partition decision in the grouper. Whether a qualifying OR procedure appears on the claim — and which procedure — determines whether the stay routes to a surgical DRG (e.g., joint replacement 469/470) or a medical DRG (e.g., heart failure 291-293). Missing or incorrect procedure codes misclassify partition and DRG.\n"
      },
      {
        "relation_type": "requires",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "The grouper consumes ICD-10-CM diagnosis codes for both the principal diagnosis (which determines MDC and base DRG) and the secondary diagnoses (which drive CC/MCC severity tier assignment). All DRG-based cohort definitions and cost proxies are downstream of accurate ICD-10-CM coding.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "DRG relative weight is the standard inpatient cost proxy when paid or allowed dollar amounts are unavailable in claims; weight × base rate feeds directly into PPPM/PPPY cost calculations for inpatient episodes in burden-of-disease, CER, and value dossier analyses.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "elixhauser-comorbidity-index-rwe",
        "notes": "DRG severity tier (MCC/CC/none) and Elixhauser comorbidity measures capture orthogonal dimensions of patient complexity — admission-level severity versus baseline disease burden. Both are commonly included as covariates in outcome models for post-hospitalization endpoints such as readmission, mortality, and downstream cost.\n"
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "MS-DRG assignment is specific to inpatient institutional claims and is a foundational field in any claims-based analysis of hospitalizations, surgical procedures, or inpatient costs in Medicare or commercial populations.\n"
      },
      {
        "relation_type": "see_also",
        "target_slug": "all-cause-vs-attributable-costs-rwe",
        "notes": "DRG-based cost proxies inherit the all-cause vs. attributable distinction: the DRG payment covers the entire stay regardless of which condition drove resource use. Attributable inpatient costs require either diagnosis-position filtering or a separate attribution step — the DRG weight alone does not partition costs by condition.\n"
      }
    ],
    "aliases": [
      "DRG",
      "MS-DRG",
      "diagnosis-related group",
      "APR-DRG",
      "Medicare severity DRG",
      "IPPS grouper",
      "DRG relative weight"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "cms",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "multi-database",
    "name": "Multi-Database / Distributed Network Study",
    "short_definition": "A study design that runs one common protocol against two or more independently held data sources, executing identical analytic code at each site and combining only privacy-preserving site-level summaries (e.g., risk-set tables, propensity-score-adjusted estimates) rather than pooling patient-level records.",
    "long_description": "A **multi-database (distributed network) study** answers a single research question by executing one harmonized\nprotocol across several independently governed data sources — commercial and Medicare claims plans, integrated-delivery\nEHRs, registries — and then combining the *results*, not the *records*. Each **data partner** maps its data to a shared\nstructure (a common data model such as OMOP CDM or the Sentinel SCDM, or a shared analytic-table specification), runs\nbyte-identical or specification-identical code locally, and returns only aggregate output. The patient-level protected\nhealth information never leaves the partner's firewall. This is the operating model of the FDA Sentinel System, the\nCanadian Network for Observational Drug Effect Studies (CNODES), the OHDSI network, and most multinational PASS.\n\n**Core conceptual distinction.** The defining choice is *distributed analysis vs centralized pooling*, and it is\nseparable from the choice of estimator. (1) *Distributed vs pooled individual-level data (IPD):* in a distributed\nnetwork each site holds its own data and shares only aggregates; in a pooled study all records are physically combined\nin one analytic file. Distributed analysis is what makes data partners willing to participate (governance, HIPAA,\nGDPR), but it constrains what you can compute — you must design every quantity so it can be assembled from site-level\npieces. (2) *Common data model vs ad hoc harmonization:* a CDM (OMOP, SCDM/Sentinel) fixes table schemas and\nvocabularies so the same code runs everywhere; ad hoc harmonization writes bespoke extraction per site and is fragile.\n(3) *What gets shared* sits on a spectrum: fully aggregate counts/effect estimates (most private), stratified\nrisk-set or propensity-score-stratum tables that enable exact stratified or conditional analyses without IPD\n(Toh's distributed risk-set sharing), or, rarely, a curated pooled extract under a data-use agreement. The estimand is\nstill a comparative effect (e.g., a hazard ratio or risk difference for drug A vs B), but it is a *network-level*\nsummary of site-specific estimates whose interpretation depends on whether you fix-effect or random-effect combine\nthem, and on how heterogeneous the sites are.\n\n**Pros, cons, and trade-offs.**\n- **vs a single-database study:** A network buys sample size for rare exposures and outcomes, broader generalizability\n  across payers/regions/care settings, and the ability to *measure* heterogeneity (is the signal real or one site's\n  artifact?). Cost: enormous operational overhead — common-protocol governance, CDM mapping, code distribution, site\n  QC, and a slower timeline. **Prefer a network** when one database lacks power, when external validity matters for a\n  regulatory or coverage decision, or when reproducibility across data sources is itself the evidentiary claim;\n  **prefer a single well-characterized database** for an exploratory or hypothesis-generating analysis where speed and\n  deep knowledge of one source outweigh breadth.\n- **vs centralized pooled individual-level analysis:** Distributed analysis preserves privacy and partner autonomy and\n  sidesteps the legal/ethical barriers to moving PHI. Cost: you cannot run arbitrary individual-level models\n  centrally; you are limited to what aggregates support (stratified Cox/conditional logistic via risk sets,\n  site-specific PS models, meta-analytic combination). Some flexible estimators (certain machine-learning PS,\n  individual-level g-methods) are awkward or impossible without IPD. **Prefer distributed** unless a data-use\n  agreement genuinely permits pooling and the analysis demands individual-level modeling that aggregates cannot\n  reproduce.\n- **vs aggregate meta-analysis of separately published studies:** A distributed network uses one protocol, one set of\n  code lists, and one outcome definition everywhere, eliminating the between-study methodological heterogeneity and\n  publication bias that plague literature-based meta-analysis. Cost: it requires a live, governed network rather than\n  a desk review. **Prefer the network** when you control the analysis; fall back to meta-analysis of published\n  estimates only when you cannot access the underlying data.\n\n**When to use.** Use a multi-database design when (a) no single source is large enough for the exposure-outcome pair\n(rare drugs, rare adverse events, subgroups); (b) a regulator (FDA Sentinel, EMA PASS) or HTA body expects evidence\nreproduced across multiple real-world sources; (c) generalizability across populations, payers, or health systems is\ncentral to the decision; or (d) consistency across heterogeneous data is itself the scientific point (does the effect\nreplicate?). It is the right substrate for active drug-safety surveillance and for high-stakes comparative safety\nquestions where a single-database finding would not be persuasive.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When one database already answers the question with adequate power and validity.** The network's overhead buys\n  nothing and slows the answer; the breadth is decorative.\n- **When the exposure, outcome, or confounders are not measurable identically across sites.** If inpatient drug\n  administration is captured at one integrated-delivery site but invisible in another's claims, the \"same\" exposure\n  definition means different things; combining them manufactures a spurious network estimate. Pooling or meta-analyzing\n  non-comparable site definitions is the dangerous failure mode — heterogeneity then reflects *measurement*, not\n  biology, and a fixed-effect summary will confidently report a number that means nothing.\n- **When sites differ structurally in ways the analysis ignores.** Different drug launch dates and formulary timing by\n  plan create calendar-time confounding that varies by site; Medicare Advantage vs fee-for-service capture differs;\n  case-mix differs. Blindly fixed-effect combining hides this. Always report between-site heterogeneity (I², τ²,\n  forest plot) and investigate large I² *before* trusting a pooled number.\n- **When privacy constraints force aggregates so coarse that the estimator is biased.** If a site can only return\n  marginal counts (not PS-stratified risk sets), residual confounding cannot be controlled distributedly and the\n  network estimate is no better than a crude one.\n\n**Data-source operational depth.**\n- **Claims (commercial / Medicare FFS):** The workhorse of Sentinel and CNODES. Exposure = pharmacy claim\n  (`ndc` + `fill_date` + `days_supply`); enrollment spans define observable time. Require continuous medical +\n  pharmacy enrollment across washout and follow-up so absence of a fill is real. Failure mode: **Medicare Advantage\n  person-time lacks FFS claims** — MA enrollees' encounters are paid by the plan, not adjudicated as FFS claims, so\n  diagnoses/procedures are missing or undercounted; exclude MA-only person-time (or use MA encounter data only where a\n  partner certifies its completeness). Sample fills, 90-day mail order, and free samples distort `days_supply`.\n- **EHR / integrated delivery (e.g., a Sentinel data partner with an internal pharmacy):** Captures inpatient\n  administrations, labs, and vitals that claims miss — an *advantage* that becomes a *threat to comparability* when\n  pooled with claims-only partners. A drug given in hospital is observed at the EHR site and invisible at the claims\n  site, so the operational exposure definition silently differs. Visit-driven capture also means patients who leave\n  the system are differentially lost. Workaround: restrict to the lowest common denominator of captured care, or model\n  site as a fixed effect and stratify so each site's estimate uses only its own internally consistent data.\n- **Registry:** Strong for indication, disease severity, and adjudicated outcomes (e.g., cancer stage, validated MI);\n  weak for complete longitudinal drug exposure. Use registries in a network for the outcome/severity layer and link to\n  claims for exposure and to a death index for censoring; never assume a registry's drug history is complete.\n- **Linked claims–EHR–vital records:** The richest partner type, but linkage selects the linkable subset and creates\n  order/fill/service date discrepancies that must be reconciled before time-zero assignment. In a network, a few\n  linked sites plus many claims-only sites create a *capability gradient* — design the common protocol to the weakest\n  site, then run richer sensitivity analyses only where the data support them.\n- **Cross-cutting failure modes:** **differential competing risks by exposure** (in elderly claims populations a drug\n  preferentially used in frailer patients faces higher competing mortality, biasing cause-specific estimates\n  differently across sites with different age mixes); **immortal time in procedure studies** (defining exposure by a\n  procedure that can only occur after surviving to it); and **outcome-algorithm portability** (a claims-based MI\n  algorithm validated in one plan may have different PPV in another).\n\n**Worked claims example (distributed safety study).** Question: incidence of acute pancreatitis among new users of\nincretin-based therapy (GLP-1/DPP-4) vs sulfonylureas, run across four data partners (two commercial claims plans, one\nMedicare FFS extract, one integrated-delivery EHR), one common protocol. At each site, local code builds the analytic\ntable identically: (1) **Cohort entry** = first fill (`fill_date`) of either drug class with no fill of *any* study or\ncomparator drug in the prior 365 days (washout), among adults with ≥2 type-2-diabetes diagnoses in the baseline\nwindow. (2) **Observable time** = continuous medical + pharmacy enrollment spanning the full 365-day washout through\nfollow-up; **exclude MA-only person-time** at the Medicare partner because FFS claims are absent there. (3) **Index\ndate / time zero** = the qualifying fill date; assign the arm from the `ndc` dispensed that day. (4) **Outcome** =\nfirst inpatient acute-pancreatitis diagnosis (validated algorithm) after time zero; censor at disenrollment, death\n(death index), end of data, treatment discontinuation (`days_supply` end + 30-day grace), or switch. (5) Each site\nestimates a **site-specific propensity score** from baseline covariates measured only in `[index_date-365,\nindex_date]`, forms PS strata, and returns only a **stratified risk-set / event-count table per PS stratum and arm**\n(person-time and events) — no patient-level rows leave the site. (6) The coordinating center combines the four\nsite-specific stratified incidence-rate ratios with a **random-effects meta-analysis**, reports the pooled IRR with\nits 95% CI, and reports **I² and a forest plot**; a high I² triggers investigation of whether one partner's inpatient\ncapture or formulary timing — not biology — drives the divergence before any pooled number is released.",
    "primary_category": "Study_Design",
    "tags": [
      "distributed-network",
      "sentinel",
      "cnodes",
      "ohdsi",
      "common-data-model",
      "privacy-preserving",
      "multi-database",
      "pharmacoepidemiology",
      "drug-safety-surveillance"
    ],
    "applies_to_study_types": [
      "multi_database"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/mlr.0b013e31829b1bb1",
        "url": "https://doi.org/10.1097/MLR.0b013e31829b1bb1",
        "citation_text": "Toh S, Gagne JJ, Rassen JA, Fireman BH, Kulldorff M, Brown JS. Confounding adjustment in comparative effectiveness research conducted within distributed research networks. Medical Care. 2013;51(8 Suppl 3):S4-S10.",
        "year": 2013,
        "authors_short": "Toh et al.",
        "notes": "Foundational statement of how to do confounding-adjusted comparative analysis across a distributed network while sharing only site-level summaries rather than patient-level data."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3483",
        "url": "https://doi.org/10.1002/pds.3483",
        "citation_text": "Toh S, Reichman ME, Houstoun M, et al. Multivariable confounding adjustment in distributed data networks without sharing of patient-level data. Pharmacoepidemiology and Drug Safety. 2013;22(11):1171-1177.",
        "year": 2013,
        "authors_short": "Toh et al.",
        "notes": "Shows privacy-preserving propensity-score and risk-set-sharing approaches that reproduce individual-level confounding adjustment without moving PHI across sites."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4936",
        "url": "https://doi.org/10.1002/pds.4936",
        "citation_text": "Platt RW, Henry DA, Suissa S. The Canadian Network for Observational Drug Effect Studies (CNODES): reflections on the first eight years, and a look to the future. Pharmacoepidemiology and Drug Safety. 2020;29(S1):103-107.",
        "year": 2020,
        "authors_short": "Platt et al.",
        "notes": "Senior pharmacoepidemiology authors' retrospective on operating a multinational distributed drug-safety network (common protocol, local execution, meta-analytic combination) and its governance and methods lessons."
      },
      {
        "role": "demonstrate",
        "doi": "10.1073/pnas.1510502113",
        "url": "https://doi.org/10.1073/pnas.1510502113",
        "citation_text": "Hripcsak G, Ryan PB, Duke JD, et al. Characterizing treatment pathways at scale using the OHDSI network. Proceedings of the National Academy of Sciences. 2016;113(27):7329-7336.",
        "year": 2016,
        "authors_short": "Hripcsak et al.",
        "notes": "Large-scale demonstration of identical analytic code executed across a federated OMOP-CDM network, quantifying cross-site heterogeneity in real-world treatment patterns."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.70028",
        "url": "https://doi.org/10.1002/pds.70028",
        "citation_text": "Desai RJ, Wang SV, Sreedhara SK, et al. The FDA Sentinel Real-World Evidence Data Enterprise (RWE-DE). Pharmacoepidemiology and Drug Safety. 2024.",
        "year": 2024,
        "authors_short": "Desai et al.",
        "notes": "Current operational use of a regulatory distributed data enterprise for routine RWE and safety surveillance."
      }
    ],
    "plain_language_summary": "A multi-database study runs the exact same research question across several independent health databases at the same time, combines the findings from each one, and checks whether they agree. No patient records are ever shared between databases — each site keeps its own data, runs the same computer code locally, and sends back only a privacy-safe summary number. Running the same study across multiple sources boosts statistical power for rare events, shows whether a finding holds across different patient populations, and lets researchers spot when one database tells a very different story from the others.",
    "key_terms": [
      {
        "term": "common data model",
        "definition": "A shared blueprint that tells every participating database how to name its tables and columns so that the same analysis code can run identically at every site without custom rewriting."
      },
      {
        "term": "distributed analysis",
        "definition": "An approach where each site runs the study code on its own data and returns only summary numbers to the coordinating team — no individual patient rows ever leave the site's own servers."
      },
      {
        "term": "data partner",
        "definition": "One of the independent databases participating in the network (for example, a commercial insurance plan or a hospital system), each contributing its own patient population."
      },
      {
        "term": "incidence rate ratio",
        "definition": "A number comparing how often an event (like a side effect) occurs per year in one treatment group versus another; a value below 1.0 means the first group had fewer events per year."
      }
    ],
    "worked_example": {
      "scenario": "A research team wants to know whether GLP-1 diabetes medications are associated with fewer hospitalizations for pancreatitis compared with sulfonylureas. No single insurance database has enough pancreatitis cases to answer this reliably, so the team runs the identical study at three separate data partners. Each partner maps its data to a shared common data model, runs the same code locally, and returns only a small table of event counts and patient-years. The coordinating center then pools those three site-level estimates.",
      "dataset": {
        "caption": "Privacy-preserving summary table returned by each data partner (no patient rows cross the firewall). events_study = pancreatitis hospitalizations in GLP-1 arm; PY_study = patient-years in GLP-1 arm; events_comp = hospitalizations in sulfonylurea arm; PY_comp = patient-years in sulfonylurea arm.",
        "columns": [
          "site",
          "events_study",
          "PY_study",
          "events_comp",
          "PY_comp"
        ],
        "rows": [
          [
            "Site A (commercial claims)",
            3,
            600,
            6,
            600
          ],
          [
            "Site B (Medicare FFS)",
            5,
            800,
            8,
            800
          ],
          [
            "Site C (integrated-delivery EHR)",
            6,
            1000,
            8,
            1000
          ]
        ]
      },
      "steps": [
        "For each site, compute the incidence rate ratio (IRR): divide the GLP-1 event rate (events_study / PY_study) by the sulfonylurea event rate (events_comp / PY_comp).",
        "Site A IRR = (3 / 600) / (6 / 600) = 0.005 / 0.010 = 0.500 — GLP-1 users had half the pancreatitis rate of sulfonylurea users at this site.",
        "Site B IRR = (5 / 800) / (8 / 800) = 0.00625 / 0.01000 = 0.625 — GLP-1 users had 62.5% of the sulfonylurea rate at this site.",
        "Site C IRR = (6 / 1000) / (8 / 1000) = 0.006 / 0.008 = 0.750 — GLP-1 users had 75% of the sulfonylurea rate at this site.",
        "Pool the three site-specific IRRs by taking a simple average: (0.500 + 0.625 + 0.750) / 3 = 1.875 / 3 = 0.625.",
        "Note the spread across sites: IRRs range from 0.500 to 0.750, which signals real heterogeneity. Before releasing the pooled number, the coordinating center investigates whether one site captures inpatient events differently or serves a different age mix."
      ],
      "result": "Pooled IRR = (0.500 + 0.625 + 0.750) / 3 = 1.875 / 3 = 0.625. Across all three data partners, GLP-1 users had approximately 37.5% fewer pancreatitis hospitalizations per patient-year than sulfonylurea users (IRR 0.625). The finding is consistent in direction across all three sites, which strengthens confidence in the signal. The range of site-specific estimates (0.500 to 0.750) is worth reporting so readers can judge how much the result varies by data source."
    },
    "prerequisites": [
      "new-user-design",
      "cohort-retrospective",
      "meta-analysis-obs"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Common-data-model federated analysis (OMOP / Sentinel SCDM)",
        "description": "Each site maps its data to a shared CDM and runs byte-identical code; only aggregate results return. This is the OHDSI and Sentinel operating model and maximizes reproducibility and code reuse.",
        "edge_cases": [
          "Vocabulary mapping (e.g., source codes to OMOP standard concepts) is lossy and varies in completeness by site; audit concept-set coverage per partner before trusting cross-site comparability.",
          "A CDM enforces a schema but not data capture; a field that is structurally present can still be empty or differentially recorded at some sites."
        ],
        "data_source_notes": "claims: drug_exposure/condition_occurrence from adjudicated claims; EHR: same tables but populated from orders/administrations — semantically different despite identical schema."
      },
      {
        "name": "Distributed risk-set / PS-stratum sharing (privacy-preserving stratified analysis)",
        "description": "Sites share stratified event/person-time tables (by PS stratum and arm) rather than patient rows, enabling exact stratified or conditional analyses centrally without IPD (Toh approach).",
        "edge_cases": [
          "Small cells risk re-identification; sites apply cell-suppression or minimum-count thresholds that can bias sparse strata.",
          "Requires the PS model and covariate set to be specified identically at every site."
        ],
        "data_source_notes": "Each site fits its own PS on locally available covariates; sites with richer data (EHR labs) may have different PS specifications, which must be reconciled or modeled as site-specific."
      },
      {
        "name": "Centralized pooled individual-level analysis under a data-use agreement",
        "description": "Patient-level extracts are physically combined in one secure environment. Maximizes analytic flexibility (any individual-level model) at the cost of governance burden and reduced privacy.",
        "edge_cases": [
          "Legally and operationally hard; many partners (and most international networks) will not permit it.",
          "Even when permitted, residual cross-site measurement heterogeneity remains and should be modeled (site fixed effects or stratification), not assumed away by pooling."
        ],
        "data_source_notes": "Useful when an estimator (e.g., individual-level g-methods) cannot be assembled from aggregates."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single-database study",
        "pros_of_this": "Power for rare exposures/outcomes, broader generalizability, and the ability to measure and report cross-source heterogeneity rather than assume a single source is representative.",
        "cons_of_this": "Large operational overhead (governance, CDM mapping, code distribution, site QC) and slower timelines.",
        "when_to_prefer": "When one database lacks power or external validity, or when reproducibility across sources is itself the evidentiary claim (e.g., regulatory safety surveillance)."
      },
      {
        "compared_to": "Centralized pooled individual-level (IPD) analysis",
        "pros_of_this": "Preserves patient privacy and partner autonomy; avoids the legal/ethical barriers to moving PHI; feasible across jurisdictions.",
        "cons_of_this": "Limited to estimators expressible from site-level aggregates; some flexible individual-level models are impossible without IPD.",
        "when_to_prefer": "Whenever data-use agreements forbid pooling, which is the default in real networks."
      },
      {
        "compared_to": "Meta-analysis of separately published observational studies",
        "pros_of_this": "One common protocol, code list, and outcome definition everywhere eliminates between-study methodological heterogeneity and publication bias.",
        "cons_of_this": "Requires a live governed network and active data access rather than a desk review.",
        "when_to_prefer": "When you control the analysis across the contributing data sources."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (ndc + fill_date + days_supply); require continuous medical + pharmacy enrollment across washout and follow-up; exclude Medicare Advantage-only person-time where FFS claims are absent. Each site runs the same code and returns only aggregates.",
      "ehr": "Initiation = order/administration, not dispensing; inpatient capture and labs are an advantage but break comparability with claims-only partners. Model site as a fixed effect or stratify so each site's estimate uses only its internally consistent data.",
      "registry": "Strong for indication/severity/adjudicated outcomes, weak for complete drug exposure; use for the outcome layer and link to claims for exposure and a death index for censoring.",
      "linked": "Richest substrate but introduces linkage selection and order/fill/service-date discrepancies; design the common protocol to the weakest site and run richer sensitivity analyses only where data support them."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWASHOUT_DAYS = 365   # drug-free + continuous-enrollment lookback defining a new user\nGRACE_DAYS   = 30    # as-treated grace period after last days_supply\nMIN_CELL     = 11    # suppress small cells before sharing (re-identification guard)\n\ndef site_summary(rx: pd.DataFrame, enroll: pd.DataFrame,\n                 events: pd.DataFrame, cov: pd.DataFrame, data_partner_id: str) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n    study = rx[rx[\"drug_class\"].isin([\"STUDY\", \"COMPARATOR\"])]\n\n    # New-user index: first qualifying fill; arm = drug_class dispensed that day.\n    idx = (study.groupby(\"person_id\").first().reset_index()\n                .rename(columns={\"fill_date\": \"index_date\", \"drug_class\": \"arm\"}))\n\n    # Washout: drop anyone with a prior study/comparator fill in the 365 days before index.\n    prior = study.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    bad = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))][\"person_id\"]\n    idx = idx[~idx[\"person_id\"].isin(bad)].copy()\n\n    # Continuous, FFS-observable enrollment across washout through index (no MA-only person-time).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"ok\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n               (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    idx = idx[idx[\"person_id\"].isin(e.loc[e[\"ok\"], \"person_id\"])].copy()\n\n    # As-treated exit: min(last days_supply end + grace, end of enrollment).\n    last_supply = (study.merge(idx[[\"person_id\"]], on=\"person_id\")\n                        .assign(supply_end=lambda d: d[\"fill_date\"] + pd.to_timedelta(d[\"days_supply\"], \"D\"))\n                        .groupby(\"person_id\")[\"supply_end\"].max())\n    enr_end = enroll.groupby(\"person_id\")[\"enroll_end\"].max()\n    c = idx.merge(last_supply.rename(\"supply_end\"), on=\"person_id\") \\\n           .merge(enr_end.rename(\"enr_end\"), on=\"person_id\") \\\n           .merge(events.groupby(\"person_id\")[\"event_date\"].min().rename(\"event_date\"),\n                  on=\"person_id\", how=\"left\") \\\n           .merge(cov[[\"person_id\", \"ps_stratum\"]], on=\"person_id\", how=\"left\")\n    c[\"tx_exit\"]  = (c[\"supply_end\"] + pd.Timedelta(days=GRACE_DAYS)).clip(upper=c[\"enr_end\"])\n    c[\"exit\"]     = c[[\"tx_exit\", \"enr_end\"]].min(axis=1)\n    had_event     = c[\"event_date\"].notna() & (c[\"event_date\"] <= c[\"exit\"])\n    c[\"exit\"]     = np.where(had_event, c[\"event_date\"], c[\"exit\"])\n    c[\"event\"]    = had_event.astype(int)\n    c[\"pt_days\"]  = (pd.to_datetime(c[\"exit\"]) - c[\"index_date\"]).dt.days.clip(lower=0)\n\n    # Aggregate to PS-stratum x arm; this is the ONLY thing that leaves the site.\n    agg = (c.groupby([\"ps_stratum\", \"arm\"])\n             .agg(events=(\"event\", \"sum\"), person_years=(\"pt_days\", lambda s: s.sum() / 365.25),\n                  n=(\"person_id\", \"size\")).reset_index())\n    agg[\"data_partner_id\"] = data_partner_id\n    agg.loc[agg[\"n\"] < MIN_CELL, [\"events\", \"person_years\", \"n\"]] = np.nan  # cell suppression\n    return agg",
        "description": "Site-level distributed analysis for a multi-database study. STEP 1 (this code) runs identically at EACH data\npartner and returns ONLY a privacy-preserving aggregate (events + person-time by arm and PS stratum) — no\npatient-level rows leave the site. Required local inputs (post-CDM/data-management, identical schema everywhere):\n  rx     : person_id, fill_date (datetime), drug_class in {'STUDY','COMPARATOR'}, days_supply, ndc\n  enroll : person_id, enroll_start, enroll_end, ma_only (bool)   # ma_only person-time lacks FFS claims\n  events : person_id, event_date (datetime)                      # first validated outcome occurrence\n  cov    : person_id, ps_stratum (int 0..K-1)                    # PS stratum from a LOCALLY fitted, commonly\n                                                                 # specified propensity model (baseline window only)\nReturns a small table the coordinating center can meta-analyze; cells below MIN_CELL are suppressed for privacy.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "toh-2013-pds"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(metafor)\n\npool_network <- function(sites) {\n  setDT(sites)\n  sites <- sites[!is.na(events) & !is.na(person_years)]   # drop suppressed cells\n\n  # Site-specific log-IRR via PS-stratified Mantel-Haenszel (study vs comparator).\n  w <- dcast(sites, data_partner_id + ps_stratum ~ arm,\n             value.var = c(\"events\", \"person_years\"))\n  setnames(w, c(\"events_STUDY\",\"events_COMPARATOR\",\"person_years_STUDY\",\"person_years_COMPARATOR\"),\n           c(\"e1\",\"e0\",\"pt1\",\"pt0\"))\n  w <- w[pt1 > 0 & pt0 > 0]\n  site <- w[, {\n    tot <- e1 + e0; pt <- pt1 + pt0                       # stratum totals\n    num <- sum(e1 * pt0 / pt); den <- sum(e0 * pt1 / pt)  # MH rate-ratio components\n    v   <- sum((e1 + e0) * pt1 * pt0 / pt^2) / (num * den)\n    .(logirr = log(num / den), se = sqrt(v))\n  }, by = data_partner_id]\n\n  # Random-effects (DerSimonian-Laird) pooling across sites + heterogeneity.\n  m <- rma(yi = site$logirr, sei = site$se, method = \"DL\")\n  list(IRR = exp(m$beta[[1]]),\n       CI  = exp(c(m$ci.lb, m$ci.ub)),\n       I2  = m$I2, tau2 = m$tau2,\n       per_site = site[, .(data_partner_id, IRR = exp(logirr))])\n}",
        "description": "Coordinating-center STEP 2: combine the privacy-preserving site summaries (output of the Python/R site step) into a\nrandom-effects pooled incidence-rate ratio with between-site heterogeneity (I-squared). Input `sites` is the stacked\nsite-stratum-arm table: data_partner_id, ps_stratum, arm in {'STUDY','COMPARATOR'}, events, person_years.\nWithin each site, sum the PS-stratum-specific rates to a Mantel-Haenszel-style site IRR, then random-effects pool\nacross sites. No patient-level data is involved at any point.",
        "dependencies": [
          "data.table",
          "metafor"
        ],
        "source_citations": [
          "platt-2020"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n%let grace   = 30;\n%let mincell = 11;   /* suppress small cells before sharing */\n\n/* --- STEP 1 (runs at EACH data partner) --- */\n/* First qualifying fill = time zero; arm = drug_class on that fill. */\nproc sort data=work.rx(where=(drug_class in ('STUDY','COMPARATOR'))) out=rx_q;\n  by person_id fill_date;\nrun;\n\ndata idx;\n  set rx_q;\n  by person_id;\n  if first.person_id;\n  index_date = fill_date;\n  format index_date date9.;\n  length arm $12;\n  arm = drug_class;\n  keep person_id index_date arm;\nrun;\n\n/* New-user washout + continuous FFS-observable enrollment (no MA-only person-time). */\nproc sql;\n  create table cohort as\n  select i.person_id, i.arm, i.index_date,\n         max(e.enroll_end) as enr_end format=date9.\n  from idx i\n  inner join work.enroll e\n    on e.person_id = i.person_id and e.ma_only = 0\n   and e.enroll_start <= i.index_date - &washout and e.enroll_end >= i.index_date\n  where not exists (select 1 from work.rx p\n                    where p.person_id = i.person_id and p.drug_class in ('STUDY','COMPARATOR')\n                      and p.fill_date <  i.index_date\n                      and p.fill_date >= i.index_date - &washout)\n  group by i.person_id, i.arm, i.index_date;\nquit;\n\n/* As-treated exit, event flag, person-time, then aggregate to ps_stratum x arm. */\nproc sql;\n  create table perpat as\n  select c.person_id, c.arm, v.ps_stratum,\n         min( (select max(r2.fill_date + r2.days_supply) from work.rx r2\n                 where r2.person_id = c.person_id and r2.drug_class in ('STUDY','COMPARATOR')) + &grace,\n              c.enr_end,\n              coalesce((select min(ev.event_date) from work.events ev\n                          where ev.person_id = c.person_id), c.enr_end) ) as exit_date,\n         (case when (select min(ev.event_date) from work.events ev\n                       where ev.person_id = c.person_id) <= c.enr_end then 1 else 0 end) as event\n  from cohort c left join work.cov v on v.person_id = c.person_id;\n  create table site_summary as\n  select \"PARTNER_A\" as data_partner_id length=12, ps_stratum, arm,\n         sum(event) as events,\n         sum( (exit_date - index_date) ) / 365.25 as person_years,\n         count(*) as n\n  from (select p.*, c.index_date from perpat p inner join cohort c on c.person_id = p.person_id)\n  group by ps_stratum, arm\n  having calculated n >= &mincell;   /* cell suppression */\nquit;\n\n/* --- STEP 2 (coordinating center) on stacked work.allsites = site_summary from every partner --- */\n/* Poisson rate model: rate ratio for STUDY vs COMPARATOR, site fixed effect, PS-stratum adjustment. */\nproc genmod data=work.allsites;\n  class arm(ref='COMPARATOR') data_partner_id ps_stratum;\n  model events = arm data_partner_id ps_stratum / dist=poisson link=log offset=log_py;\n  estimate 'log RR (STUDY vs COMPARATOR)' arm 1 -1 / exp;   /* exp() = pooled rate ratio */\nrun;",
        "description": "SAS distributed-network analysis. STEP 1 (PROC SQL) builds the same site-level new-user cohort and the\nprivacy-preserving PS-stratum x arm summary that each data partner returns. STEP 2 (PROC GENMOD) at the\ncoordinating center fits a Poisson model on the stacked site summaries with a log person-time offset and a\ndata_partner_id (site) fixed effect to obtain a heterogeneity-aware pooled rate ratio. Required local inputs:\n  work.rx     : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply, ndc\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.events : person_id, event_date\n  work.cov    : person_id, ps_stratum   /* from a locally fitted, commonly specified PS model */",
        "dependencies": [],
        "source_citations": [
          "toh-2013-mlr"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Common protocol + shared code list<br/>one operational definition] --> CDM[Map each source to shared CDM<br/>OMOP / Sentinel SCDM]\n  CDM --> S1[Site 1: commercial claims]\n  CDM --> S2[Site 2: Medicare FFS]\n  CDM --> S3[Site 3: integrated-delivery EHR]\n  CDM --> S4[Site 4: registry-linked claims]\n  S1 --> A1[Run identical code locally<br/>build cohort + site PS]\n  S2 --> A2[Run identical code locally]\n  S3 --> A3[Run identical code locally]\n  S4 --> A4[Run identical code locally]\n  A1 --> Agg[Return ONLY aggregates<br/>PS-stratum x arm events + person-time]\n  A2 --> Agg\n  A3 --> Agg\n  A4 --> Agg\n  Agg --> Meta[Coordinating center:<br/>random-effects meta-analysis]\n  Meta --> Het{High I-squared?}\n  Het -->|Yes| Inv[Investigate measurement /<br/>formulary / capture heterogeneity]\n  Het -->|No| Out[Pooled estimate + forest plot]",
        "caption": "Distributed-network workflow. PHI never leaves a data partner; only privacy-preserving site-level summaries are combined, and between-site heterogeneity is interrogated before any pooled estimate is released.",
        "alt_text": "Flowchart from a common protocol and shared CDM mapping, to four data partners running identical code locally and returning aggregates, to random-effects meta-analysis with a heterogeneity check.",
        "source_type": "illustrative",
        "source_citations": [
          "toh-2013-mlr"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Site[Inside each data partner's firewall]\n    PHI[Patient-level records<br/>person_id, fill_date, dx] --> Cohort[New-user cohort + local PS]\n    Cohort --> StratTab[PS-stratum x arm table<br/>events, person-time, n]\n    StratTab --> Supp[Cell suppression n<11]\n  end\n  Supp -->|aggregate only| Center[Coordinating center]\n  Center --> Pool[Stratified MH + random-effects pool]\nstyle PHI fill:#fde2e2\nstyle Supp fill:#e2f0fd",
        "caption": "What crosses the firewall. Individual records and the local propensity model stay at the site; only a suppressed, PS-stratified aggregate table is transmitted, enabling confounding-adjusted pooling without IPD.",
        "alt_text": "Diagram showing patient-level data and the propensity model remaining inside a data partner's firewall while only a cell-suppressed propensity-score-stratified aggregate table is sent to the coordinating center.",
        "source_type": "illustrative",
        "source_citations": [
          "toh-2013-pds"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "A common data model (OMOP CDM) is the usual substrate that lets identical code run at every data partner in a distributed network."
      },
      {
        "relation_type": "used_with",
        "target_slug": "meta-analysis-obs",
        "notes": "Site-specific estimates from a distributed network are combined by (usually random-effects) meta-analysis, with between-site heterogeneity reported and investigated."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "Each site fits its own propensity (often high-dimensional) score on locally available covariates and shares only stratum-level summaries, enabling distributed confounding adjustment without sharing PHI."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS stratification/weighting is performed locally at each site; risk-set or stratum-level results are then pooled centrally."
      },
      {
        "relation_type": "part_of",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "Multi-database distributed studies are a core delivery mechanism for adequately powered, generalizable comparative-effectiveness and safety research."
      },
      {
        "relation_type": "see_also",
        "target_slug": "signal-detection",
        "notes": "Active drug-safety surveillance systems (e.g., FDA Sentinel) run sequential signal detection over a distributed network."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Networks mixing commercial, Medicare FFS, and MA sources must handle differential data capture (especially MA-only person-time lacking FFS claims) to keep exposure/outcome definitions comparable."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Replicating an effect across heterogeneous data sources is the network's main external-validity argument; transportability framing explains when site-specific estimates can be combined."
      },
      {
        "relation_type": "see_also",
        "target_slug": "linked-data",
        "notes": "Linked claims-EHR-vital-records partners are the richest network nodes but introduce linkage selection and date-discrepancy issues."
      }
    ],
    "aliases": [
      "multi-database study",
      "distributed network study",
      "distributed data network",
      "distributed research network",
      "federated analysis",
      "multi-site RWE study"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "multi-state-models-rwe",
    "name": "Multi-State Models",
    "short_definition": "A framework that represents a patient's disease history as movement between a finite set of states (e.g., Stable, Progressed, Dead) connected by allowed transitions, estimates the transition intensities (cause-specific hazards) that govern each move, and assembles them with the Aalen-Johansen estimator into transition probabilities - the chance of occupying each state at a future time given the state occupied now - thereby generalizing standard survival analysis and competing risks to repeated, intermediate, and possibly reversible events.",
    "long_description": "Standard survival analysis asks one question - time to a single event - and competing risks extends it to\n\"time to the first of several mutually exclusive events.\" A **multi-state model** is the natural generalization\nof both: it represents a patient's history as a token sitting in one of a **finite set of states** and moving\nalong **allowed transitions** between them over time. Cancer patients move Stable -> Progressed -> Dead;\ntransplant patients move Transplanted -> Graft-failure -> Dead, or back to Re-transplant; HIV patients move\nacross CD4 strata. Each arrow in the state diagram is governed by a **transition intensity** (a hazard for that\nspecific move, possibly depending on time and covariates), and the whole object answers questions that a single\nCox model cannot: *what is the probability a patient who is progression-free today will be alive-with-progression\nin two years?* That quantity - **state occupancy / transition probability** - is the deliverable, and it is\nexactly what health-technology-assessment cost-effectiveness models consume.\n\n**The machinery, in order.** (1) **States and transitions.** Declare the states and the transition matrix (which\nmoves are allowed). States are *transient* (you can leave) or *absorbing* (death - you cannot leave). (2)\n**Transition intensities.** For each allowed transition s -> s', the instantaneous rate of that move among those\ncurrently in state s. These are estimated transition-by-transition: nonparametrically (Nelson-Aalen cumulative\nhazards per transition) or with a **transition-specific Cox model** (stratify the baseline hazard by transition;\nlet covariate effects differ by transition). (3) **Transition probabilities via Aalen-Johansen.** The\nintensities are assembled into the K x K matrix P(s, t) of probabilities of moving from each state at time s to\neach state at time t. **Aalen-Johansen** is the estimator: it is a product-integral - a matrix product, over all\nevent times u in (s, t], of (I + dA(u)) where dA(u) holds the estimated transition increments at u. With no\ncensoring it reduces to the empirical fraction of the cohort occupying each state; its real value is that it\ncorrectly **reweights the at-risk sets** under right-censoring, which a naive state-occupancy tally does not.\nCompeting risks (cumulative incidence functions) is the **special case** with one transient starting state and\nseveral absorbing states and no onward moves - Aalen-Johansen there is exactly the cumulative-incidence\nestimator.\n\n**The two assumptions that define your model.** *Markov vs semi-Markov.* In a **Markov** model the transition\nintensity out of a state depends only on the current state and the time *since study origin* (clock-forward) -\nnot on how long the patient has been in the current state, nor how they got there. In a **semi-Markov**\n(clock-reset) model the intensity depends on the time *since entering the current state* (the clock resets to\nzero at each transition). For a Progressed -> Dead transition this is the crux: clock-forward says mortality\nrisk tracks calendar time from diagnosis; clock-reset says it tracks duration of progressed disease (usually the\nmore clinically honest choice). A *homogeneous* Markov model further assumes constant intensities (the engine of\nclassic HTA Markov cohort models with fixed cycle transition probabilities); the nonparametric multi-state model\nhere makes no such constancy assumption.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs competing-risks-cause-specific-fine-gray-rwe (the special case):** Multi-state strictly contains competing\n  risks - add intermediate states and onward transitions and you have it. **Prefer the competing-risks framing**\n  when every event is terminal and you only need cumulative incidence of first events; **prefer the full\n  multi-state model** the moment an *intermediate, non-absorbing* event (progression, relapse, hospitalization,\n  graft failure) matters and you want the probability of *currently being in* that intermediate state, or the\n  effect of passing through it on downstream risk. A Fine-Gray subdistribution model gives you one cumulative\n  incidence curve; it cannot tell you how many patients are *alive with progression right now*.\n- **vs partitioned-survival-models-rwe (the HTA debate):** Partitioned survival (PartSA) reconstructs state\n  occupancy by *subtracting* independently fitted overall-survival and progression-free-survival curves -\n  \"alive-and-progressed\" = OS minus PFS. It is simple and standard in oncology dossiers but **structurally\n  incoherent**: the three state-membership curves are not constrained to come from a consistent transition\n  process, so the implied post-progression survival is an output, not a modeled quantity, and can behave\n  implausibly in extrapolation. A multi-state (state-transition) model fits the Progressed -> Dead transition\n  *directly*, so post-progression survival is governed by an estimated hazard and the curves are internally\n  consistent by construction. **Prefer PartSA** only when data are too thin to estimate the intermediate\n  transition or when a regulator/HTA precedent demands it; **prefer the multi-state model** whenever\n  post-progression survival, treatment switching at progression, or coherent long-term extrapolation drives the\n  cost-effectiveness result (NICE TSD 19 makes exactly this recommendation).\n- **vs a single Cox model on a composite or first event (cox-ph-regression):** One Cox model collapses a rich\n  history into a single time-to-event and throws away the intermediate structure. The multi-state model keeps it,\n  at the cost of estimating several transition-specific hazards (more parameters, sparser at-risk sets on the\n  later transitions) and of having to *choose and defend* the Markov/semi-Markov and clock assumptions.\n\n**When to use.** Any question where intermediate, non-terminal events change later risk or are themselves of\ninterest: oncology Stable -> Progressed -> Dead with post-progression survival driving cost-effectiveness;\ntransplant and graft-failure histories; multi-morbidity accrual; recovery-relapse cycles; predicting an\nindividual's probability of occupying each state at a horizon for prognostication; and building HTA economic\nmodels on a coherent state-transition structure rather than subtracted survival curves.\n\n**When NOT to use - and when it is actively misleading.**\n- **The illness-death model with an unverified Markov assumption.** The canonical three-state structure (Healthy\n  -> Ill -> Dead, with a direct Healthy -> Dead arrow) is the workhorse, but the Ill -> Dead intensity is almost\n  never truly clock-forward Markov - mortality after progression depends on *duration* of progressed disease, not\n  calendar time from origin. Fitting a Markov Ill -> Dead transition when the process is semi-Markov biases the\n  transition probabilities and the extrapolated post-progression survival. Test it (e.g., add time-in-state as a\n  covariate) or fit clock-reset.\n- **Sparse late transitions.** The onward transitions (Progressed -> Dead) are estimated only among the subset\n  who reached the intermediate state, so the at-risk set is small and the hazard is noisy. Over-parameterizing\n  transition-specific covariate effects on a thin transition produces unstable estimates; consider sharing\n  baseline hazards or covariate effects across transitions when justified.\n- **Treating Aalen-Johansen state occupancy as if it adjusted for confounding.** Aalen-Johansen is a descriptive\n  / prognostic estimator of transition probabilities in the observed cohort. Comparing arms by their multi-state\n  curves is **not** a causal contrast unless you have handled confounding (e.g., weight the transitions, or embed\n  the multi-state model in a target-trial emulation). A naive arm comparison of state occupancy inherits all the\n  usual confounding-by-indication of observational data.\n- **Interval-censored state transitions.** When the intermediate state is only observed at scheduled visits (you\n  see that progression happened *between* two visits, not when), the exact-transition-time machinery of\n  Aalen-Johansen does not apply; a panel/Markov interval-censored model (e.g., a continuous-time Markov chain\n  fit to panel data) is required instead.\n\n**Data-source operational depth.**\n- **Claims:** States must be *constructed* from codes - \"Progressed\" has no field; it is proxied by a regimen\n  switch, a new line of therapy, a radiotherapy/secondary-malignancy code, or hospice enrollment, and \"Dead\"\n  needs a death source (inpatient discharge disposition, linked death index) because pharmacy/medical claims do\n  not record outpatient death. Transition *times* are interval-flavored (you see the claim, not the clinical\n  event), so the apparent transition date is a coding date, not the biological one.\n- **EHR:** Richer state definition (labs, vitals, problem list, oncology flowsheets, RECIST in structured or\n  note-derived fields) and finer transition timing, but death and out-of-system transitions are missed when the\n  patient leaves the network - that informative loss-to-follow-up censors the later transitions\n  non-randomly.\n- **Registry:** Disease registries (cancer, transplant) are the cleanest substrate - states and transition dates\n  are adjudicated prospectively, which is why the illness-death and competing-risks literature grew up on them -\n  but cause-of-death and post-exit transitions may still need linkage.\n- **Linked claims-EHR-registry:** The ideal: registry/EHR for adjudicated state definitions and transition\n  timing, claims for treatment-switch proxies and continuity, a death index for the absorbing state. Reconcile\n  the differing transition dates across sources before building the long-format risk sets.\n\n**Interpreting the output**\n\nAn Aalen-Johansen multi-state model (stable → progressed → dead) returns at year 2: P(Stable) = 0.73, P(Progressed) = 0.15, P(Dead) = 0.12 (sum = 1.00).\n\n*Formal interpretation.* Each value is a state-occupancy probability estimated via the Aalen-Johansen product-integral, which correctly accounts for competing transitions: a patient who dies cannot progress, and a patient who progresses cannot return to stable in this illness-death model. P(Progressed) = 0.15 is the probability of occupying the intermediate state at exactly year 2 — distinct from the cumulative probability of ever progressing, which is higher because some who progressed have already died by then. Under the Markov assumption, transition intensities at time t depend only on current state, not on time spent in that state; if sojourn-time dependence is plausible, a semi-Markov or clock-reset model is preferred.\n\n*Practical interpretation.* A standard two-arm survival curve tells you only who is alive; the multi-state model tells you where among the living patients are. The 15% occupying the progressed state at year 2 carry direct cost and quality-of-life implications — they consume disease-management resources and face reduced utility. Health economic models that compute QALYs from state utilities require exactly these time-in-state probabilities, making multi-state analysis the natural bridge between clinical endpoints and economic inputs in HEOR submissions.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "multi-state-model",
      "transition-intensity",
      "transition-probability",
      "aalen-johansen",
      "illness-death-model",
      "markov",
      "semi-markov",
      "clock-forward",
      "clock-reset",
      "competing-risks",
      "state-transition-model",
      "partitioned-survival",
      "registry"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "disease_registry",
      "comparative_effectiveness"
    ],
    "data_sources": [],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.2712",
        "url": "https://doi.org/10.1002/sim.2712",
        "citation_text": "Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Statistics in Medicine. 2007;26(11):2389-2430.",
        "year": 2007,
        "authors_short": "Putter et al.",
        "notes": "The standard applied tutorial defining states, transition intensities, the Markov vs semi-Markov (clock-forward vs clock-reset) distinction, and Aalen-Johansen estimation of transition probabilities, with competing risks shown as the special case - the conceptual spine of this entry."
      },
      {
        "role": "explain",
        "doi": "10.1191/0962280202SM276ra",
        "url": "https://doi.org/10.1191/0962280202SM276ra",
        "citation_text": "Andersen PK, Keiding N. Multi-state models for event history analysis. Statistical Methods in Medical Research. 2002;11(2):91-115.",
        "year": 2002,
        "authors_short": "Andersen & Keiding",
        "notes": "Foundational review grounding multi-state models in counting-process theory - transition intensities, the product-integral, and the conditions (notably the Markov property) under which Aalen-Johansen consistently estimates transition probabilities."
      },
      {
        "role": "demonstrate",
        "doi": "10.18637/jss.v038.i07",
        "url": "https://doi.org/10.18637/jss.v038.i07",
        "citation_text": "de Wreede LC, Fiocco M, Putter H. mstate: an R package for the analysis of competing risks and multi-state models. Journal of Statistical Software. 2011;38(7):1-30.",
        "year": 2011,
        "authors_short": "de Wreede et al.",
        "notes": "The reference implementation (msprep, transMat, msfit, probtrans) showing exactly how wide patient data are expanded to long counting-process form, transition-specific hazards are fit, and Aalen-Johansen transition probabilities are computed - the basis for the R code in this entry."
      }
    ],
    "plain_language_summary": "A multi-state model tracks patients as they move between a small set of health states over time - for example Stable, Progressed, and Dead - and estimates the chance of being in each state at any future time. Instead of studying a single yes/no outcome, you study every allowed move (a transition) and the rate at which each move happens. From those rates, the Aalen-Johansen estimator builds the probability that a patient who starts Stable is Stable, Progressed, or Dead one or two years later. Competing risks is just the special case where the only moves are out of the starting state into a few final (absorbing) states.\n",
    "key_terms": [
      {
        "term": "state",
        "definition": "One of the finite conditions a patient can be in at a point in time (e.g., Stable, Progressed, Dead); the model tracks which state each patient occupies."
      },
      {
        "term": "transition",
        "definition": "An allowed move from one state to another (e.g., Stable to Progressed); arrows in the state diagram."
      },
      {
        "term": "transition intensity",
        "definition": "The instantaneous rate of a specific move among patients currently in the starting state - a hazard for that one arrow."
      },
      {
        "term": "Aalen-Johansen estimator",
        "definition": "The method that combines all the transition intensities into the probability of being in each state at a future time, correctly accounting for patients still being followed."
      },
      {
        "term": "absorbing state",
        "definition": "A state you cannot leave once you enter it - death is the usual example."
      },
      {
        "term": "Markov vs semi-Markov",
        "definition": "Whether the rate of leaving a state depends on time since the study began (Markov, clock-forward) or time since entering the current state (semi-Markov, clock-reset)."
      }
    ],
    "worked_example": {
      "scenario": "A 100-patient oncology cohort starts in the Stable state and can move to Progressed or directly to Dead, and from Progressed to Dead - the illness-death model. We observe everyone for two years with no dropout, so the at-risk count in each state is just a headcount. We tabulate who moves at year 1 and year 2, then use the Aalen-Johansen idea (apply each year's transition fractions to the people at risk) to get the probability of occupying each state two years after a Stable start.\n",
      "dataset": {
        "caption": "The at-risk counts and transitions at each event time for the 100-patient illness-death cohort (no dropout) - the table you estimate transition probabilities from.",
        "columns": [
          "event_year",
          "from_state",
          "n_at_risk",
          "n_transitions",
          "to_state"
        ],
        "rows": [
          [
            1,
            "Stable",
            100,
            10,
            "Progressed"
          ],
          [
            1,
            "Stable",
            100,
            5,
            "Dead"
          ],
          [
            2,
            "Stable",
            85,
            8,
            "Progressed"
          ],
          [
            2,
            "Stable",
            85,
            4,
            "Dead"
          ],
          [
            2,
            "Progressed",
            10,
            3,
            "Dead"
          ]
        ]
      },
      "steps": [
        "Three states (Stable, Progressed, Dead) with transitions Stable to Progressed, Stable to Dead, and Progressed to Dead - the illness-death model. All 100 patients begin Stable; with no dropout, at-risk equals headcount.",
        "Year 1 transition fractions out of Stable - to Progressed = 10/100 = 0.10, to Dead = 5/100 = 0.05; staying Stable = 1 - 0.10 - 0.05 = 0.85. Applying them, the cohort splits into 85 Stable, 10 Progressed, 5 Dead.",
        "Year 2 fractions - from Stable to Progressed = 8/85 = 0.094, to Dead = 4/85 = 0.047; from Progressed to Dead = 3/10 = 0.30.",
        "Apply the year-2 moves to the people at risk - Stable 85 - 8 - 4 = 73; Progressed 10 + 8 - 3 = 15; Dead 5 + 4 + 3 = 12 (still 100 patients total).",
        "Transition probabilities from a Stable start at time 0 to each state at year 2 - P(Stable) = 73/100 = 0.73, P(Progressed) = 15/100 = 0.15, P(Dead) = 12/100 = 0.12, which sum to 0.73 + 0.15 + 0.12 = 1.00."
      ],
      "result": "From a Stable start, the year-2 transition probabilities are 0.73 Stable, 0.15 Progressed, and 0.12 Dead (they sum to 1.00). The 0.15 chance of being alive-with-progression right now is exactly the intermediate-state occupancy a partitioned-survival model cannot read off directly - it falls out of the multi-state structure.\n",
      "timeline_spec": {
        "title": "One patient's path through the illness-death states - Stable to Progressed (year 1) to Dead (year 2.5)",
        "window": {
          "start": "2020-01-01",
          "end": "2022-07-01",
          "label": "Observation: Stable start through the absorbing Dead state"
        },
        "events": [
          {
            "label": "Stable sojourn (state 1)",
            "start": "2020-01-01",
            "length_days": 366,
            "quantity": "progression-free"
          },
          {
            "label": "Progressed sojourn (state 2)",
            "start": "2021-01-01",
            "length_days": 546,
            "quantity": "alive with progression"
          },
          {
            "label": "Death (state 3, absorbing)",
            "start": "2022-07-01",
            "length_days": 1,
            "quantity": "absorbing state"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2020-01-01",
            "end": "2020-12-31",
            "label": "Stable: 1.0 year in state 1"
          },
          {
            "kind": "followup",
            "start": "2021-01-01",
            "end": "2022-07-01",
            "label": "Progressed: ~1.5 years in state 2"
          }
        ],
        "result": {
          "label": "One patient path: Stable to Progressed at year 1, Dead at year 2.5",
          "value": 2.5
        }
      }
    },
    "prerequisites": [
      "cox-ph-regression",
      "competing-risks-cause-specific-fine-gray-rwe",
      "cumulative-incidence-risk-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Illness-death model (three-state, with direct death transition)",
        "description": "The canonical structure - Healthy (state 1) -> Ill/Progressed (state 2) -> Dead (state 3), plus a direct Healthy -> Dead arrow. State 1 and 2 are transient, 3 is absorbing. The probability of currently occupying the intermediate Ill state, and the effect of passing through it on death, are the targets.",
        "edge_cases": [
          "The Ill -> Dead transition is estimable only among those who reach state 2, so its at-risk set is small and the hazard is noisy late in follow-up.",
          "If progression is observed only at scheduled visits (interval-censored), the exact-time Aalen-Johansen machinery fails and a panel/interval-censored Markov fit is needed."
        ],
        "data_source_notes": "registry: states and transition dates adjudicated prospectively (cleanest). claims: Ill is proxied by a regimen switch / new line of therapy; Dead needs a linked death source."
      },
      {
        "name": "Clock-forward (Markov) time scale",
        "description": "Every transition intensity is indexed by time since study origin; the rate out of a state ignores how long the patient has been in that state or how they arrived. Aalen-Johansen P(s, t) is computed directly on this single clock. The default in msfit/probtrans.",
        "edge_cases": [
          "A Markov Progressed -> Dead transition assumes post-progression mortality tracks calendar time from origin, which is usually clinically wrong; test by adding time-in-state as a covariate.",
          "Homogeneous (constant-intensity) Markov is a further, stronger assumption - the engine of fixed-cycle HTA Markov cohort models, not assumed by the nonparametric estimator."
        ],
        "data_source_notes": "all sources: requires a single well-defined origin (e.g., diagnosis/index date) shared across transitions; misaligned origins break the clock-forward interpretation."
      },
      {
        "name": "Clock-reset (semi-Markov) time scale",
        "description": "The intensity out of a state depends on the time since *entering* that state - the clock restarts to zero at each transition. For Progressed -> Dead this models mortality by duration of progressed disease, usually the more honest assumption.",
        "edge_cases": [
          "Transition probabilities are no longer a simple product on one clock; computation uses the entry-time distribution into each state (semi-Markov Aalen-Johansen or simulation).",
          "Mixing clock-forward for early transitions and clock-reset for later ones must be declared explicitly and is easy to get silently wrong."
        ],
        "data_source_notes": "registry/ehr: needs an accurate state-entry date for each patient to reset the clock; claims coding dates approximate it and add timing error."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "competing-risks-cause-specific-fine-gray-rwe",
        "pros_of_this": "Handles intermediate, non-absorbing states and onward transitions, yielding the probability of currently occupying a transient state (e.g., alive-with-progression) and the effect of passing through it - quantities a competing-risks cumulative-incidence or Fine-Gray model cannot express.",
        "cons_of_this": "More transitions to estimate (more parameters, sparser late at-risk sets) and an explicit Markov vs semi-Markov choice to defend; competing risks needs neither when every event is terminal.",
        "when_to_prefer": "Whenever an intermediate event matters as a state to be in or a gateway to later risk; use competing risks when you only need cumulative incidence of the first, terminal events."
      },
      {
        "compared_to": "partitioned-survival-models-rwe",
        "pros_of_this": "Models the post-progression (Progressed -> Dead) transition directly, so state-membership curves are internally consistent and long-term extrapolation is governed by an estimated hazard rather than the difference of two independently fitted survival curves.",
        "cons_of_this": "Requires enough data to estimate the intermediate transition and a defensible structure; partitioned survival is simpler, needs only OS and PFS curves, and has strong HTA precedent in oncology dossiers.",
        "when_to_prefer": "When post-progression survival, treatment switching at progression, or coherent extrapolation drives the cost-effectiveness result (per NICE TSD 19); fall back to partitioned survival only when the intermediate transition is unestimable or precedent demands it."
      },
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "Preserves the full event history as transitions between states instead of collapsing it into a single time-to-first-event, exposing transition-specific covariate effects and state occupancy over time.",
        "cons_of_this": "Several transition-specific models with smaller at-risk sets, plus the time-scale and Markov assumptions, versus a single, familiar, lower-variance Cox fit.",
        "when_to_prefer": "When intermediate states carry information; a single Cox model suffices when only one terminal event is of interest and intermediate structure is irrelevant."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "States have no native fields - construct Progressed/Ill from regimen switches, new lines of therapy, or disease-specific codes, and the absorbing Dead state from a linked death index (claims miss outpatient death). Transition dates are coding dates, not biological event dates; treat the timing as interval-flavored and align all transitions to one origin (index date) for a clock-forward model.",
      "ehr": "Richer state definitions from labs, vitals, problem lists, and oncology flowsheets, and finer transition timing, but transitions out of the network (including death) are missed - that informative loss-to-follow-up censors later transitions non-randomly and biases Progressed -> Dead.",
      "registry": "Cleanest substrate - states and transition dates adjudicated prospectively (cancer, transplant registries), which is why the illness-death and competing-risks methods matured here; cause-of-death and post-exit transitions may still need linkage.",
      "linked": "Ideal - registry/EHR for adjudicated state definitions and transition timing, claims for treatment-switch proxies and enrollment continuity, a death index for the absorbing state. Reconcile differing transition dates across sources before building the long-format at-risk rows."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef aalen_johansen(trans: pd.DataFrame, states: list[int], horizon: float):\n    \"\"\"Return P(0, horizon) and the path of P(0, u) at each event time u <= horizon.\"\"\"\n    K = len(states)\n    idx = {s: i for i, s in enumerate(states)}\n    event_times = sorted(trans.loc[trans[\"status\"] == 1, \"t_stop\"].unique())\n\n    P = np.eye(K)\n    path = []\n    for u in event_times:\n        if u > horizon:\n            break\n        dA = np.zeros((K, K))\n        for s in states:\n            i = idx[s]\n            # who is at risk in state s just before time u (unique patients, dedup competing rows)\n            at_risk = ((trans[\"from_state\"] == s) &\n                       (trans[\"t_start\"] < u) & (trans[\"t_stop\"] >= u))\n            n_s = trans.loc[at_risk, \"id\"].nunique()\n            if n_s == 0:\n                continue\n            fired = trans[(trans[\"from_state\"] == s) &\n                          (trans[\"t_stop\"] == u) & (trans[\"status\"] == 1)]\n            for to_s, grp in fired.groupby(\"to_state\"):\n                dA[i, idx[to_s]] += grp[\"id\"].nunique() / n_s\n            dA[i, i] = -dA[i].sum()          # each row of the increment sums to zero\n        P = P @ (np.eye(K) + dA)             # product-integral step\n        path.append((u, P.copy()))\n    return P, path\n\n# P[idx[1]] is then the row of transition probabilities FROM state 1 (Stable) at time 0\n# to each state at the horizon - e.g. P[0] = [P(Stable), P(Progressed), P(Dead)].",
        "description": "Nonparametric Aalen-Johansen estimation of the transition-probability matrix P(0, t) from long-format\ncounting-process data, using only numpy and pandas (no specialist package). Required input (one row per at-risk\ntransition episode; mstate/msprep-style):\n  trans : id, from_state, to_state, t_start, t_stop, status\n          status=1 if the from_state -> to_state move occurred at t_stop, else 0 (censored or a competing move)\nStates are integer-coded 1..K. P(0, t) is the product-integral over event times u<=t of (I + dA(u)), where\ndA(u) holds the per-transition increments d_{ss'}/n_s at u and the diagonal makes each row of dA sum to zero.",
        "dependencies": [
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(mstate)\nlibrary(survival)\n\n# Illness-death model: 1=Stable, 2=Progressed, 3=Dead (3 is absorbing).\ntmat <- transMat(x = list(c(2, 3), c(3), c()),\n                 names = c(\"Stable\", \"Progressed\", \"Dead\"))\n\n# Expand wide data to long: one row per at-risk transition episode (Tstart, Tstop, status, trans).\nlong <- msprep(time   = c(NA, \"pfs_time\", \"os_time\"),\n               status = c(NA, \"prog\",     \"death\"),\n               data   = wide, trans = tmat, keep = c(\"age\", \"arm\"))\nlong <- expand.covs(long, covs = c(\"age\", \"arm\"), append = TRUE)\n\n# Transition-specific baseline hazards: stratify the baseline by transition (clock-forward / Markov).\ncox <- coxph(Surv(Tstart, Tstop, status) ~ strata(trans), data = long)\n\n# Cumulative transition hazards -> Aalen-Johansen transition probabilities from each starting state.\nmsf <- msfit(cox, trans = tmat)\npt  <- probtrans(msf, predt = 0)     # predt=0: P(0, t)\n# pt[[1]] = probabilities for patients starting in state 1 (Stable):\n#   columns pstate1, pstate2, pstate3 are the occupancy probabilities over time.\nhead(pt[[1]])\n\n# Semi-Markov (clock-reset) alternative: refit on the per-state sojourn time scale, e.g.\n#   coxph(Surv(time, status) ~ strata(trans), data = long)  where `time` = Tstop - Tstart.",
        "description": "The mstate workflow: define the transition matrix for an illness-death model, expand wide per-patient data to\nlong counting-process form with msprep, fit transition-specific baseline hazards with a stratified Cox model,\nand turn the cumulative transition hazards into Aalen-Johansen transition probabilities with msfit + probtrans.\nInput `wide`: id, pfs_time, prog (1=progressed), os_time, death (1=died), plus covariates age, arm.",
        "dependencies": [
          "mstate",
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1) Per-transition baseline (Nelson-Aalen) cumulative hazards.\n      STRATA trans = a separate baseline hazard per transition; (tstart,tstop) = counting-process input. */\nproc phreg data=long;\n  strata trans;\n  model (tstart, tstop)*status(0) = ;\n  baseline out=cumhaz cumhaz=H0;        /* H0 = cumulative hazard for each transition over time */\nrun;\n\n/* 2) Transition-specific covariate effects: let age and arm act differently by transition. */\nproc phreg data=long;\n  class trans;\n  strata trans;\n  model (tstart, tstop)*status(0) = age arm;\nrun;\n\n/* 3) Assemble Aalen-Johansen transition probabilities from the per-transition hazards in CUMHAZ.\n      At each event time u build the increment matrix dA(u) (off-diagonals = hazard jumps d_ss', the\n      diagonal set so each row sums to zero), then accumulate the matrix product P = P * (I + dA(u)) in a\n      DATA step / IML loop. The result is P(0, t): occupancy probabilities for each starting state over t. */\nproc iml;\n  use cumhaz; read all var {trans Time H0}; close cumhaz;\n  /* ... build dA(u) per distinct Time, multiply forward into P (3x3 here) ... */\nquit;",
        "description": "Transition-specific hazards via PROC PHREG on counting-process (start, stop) data, stratified by transition.\nThe long table holds one row per at-risk transition episode:\n  long : id, trans, tstart, tstop, status, age, arm\nwhere trans codes each allowed move (1: Stable->Progressed, 2: Stable->Dead, 3: Progressed->Dead). The\nstratified null model gives per-transition Nelson-Aalen cumulative hazards; these feed the Aalen-Johansen\nproduct to assemble transition probabilities.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "stateDiagram-v2\n  direction LR\n  [*] --> Stable\n  Stable --> Progressed: intensity a12(t)\n  Stable --> Dead: intensity a13(t)\n  Progressed --> Dead: intensity a23(t)\n  Dead --> [*]",
        "caption": "The illness-death multi-state model - three states (Stable and Progressed transient, Dead absorbing) and three allowed transitions, each governed by its own transition intensity a_ss'(t). Aalen-Johansen assembles these intensities into the probability of occupying each state at a future time.",
        "alt_text": "State diagram with Stable transitioning to Progressed and to Dead, and Progressed transitioning to Dead, each arrow labeled with a transition intensity; Dead is the absorbing state.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One patient's path through the illness-death states (Stable to Progressed to Dead)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Stable (state 1)\n  Progression-free :crit, s1, 2020-01-01, 366d\n  section Progressed (state 2)\n  Post-progression alive :active, s2, 2021-01-01, 546d\n  section Dead (state 3)\n  Death (absorbing) :done, s3, 2022-07-01, 1d",
        "caption": "A single patient occupying state 1 (Stable) for one year, transitioning to state 2 (Progressed) at year 1, and reaching the absorbing state 3 (Dead) at year 2.5 - the kind of state path the cohort-level Aalen-Johansen estimator summarizes into transition probabilities.",
        "alt_text": "Timeline showing a patient progression-free for one year, then alive-with-progression for eighteen months, then dead, illustrating movement through the three states of the illness-death model.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "multi-state-models-rwe-timeline.svg",
        "mermaid": null,
        "caption": "One patient's path through the illness-death model - Stable for one year, Progressed for the next eighteen months, then the absorbing Dead state at year 2.5 - the per-patient histories that aggregate into Aalen-Johansen transition probabilities.",
        "alt_text": "Stepped state timeline of a single patient moving from Stable to Progressed at year one and to Dead at year two and a half, with the two sojourn spans shaded.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Competing risks is the special case of a multi-state model with one transient starting state and several absorbing states and no onward transitions; Aalen-Johansen there reduces to the cumulative-incidence estimator."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "partitioned-survival-models-rwe",
        "notes": "Partitioned survival reconstructs state occupancy by subtracting independent OS and PFS curves; a multi-state (state-transition) model fits the Progressed -> Dead transition directly, giving internally consistent curves and coherent extrapolation (NICE TSD 19)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "Homogeneous (constant-intensity) Markov models with fixed cycle transition probabilities are the simplest multi-state case and the engine of classic HTA Markov cohort models; the nonparametric multi-state model relaxes the constant-intensity assumption."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "Transition intensities are commonly estimated with transition-specific Cox models (baseline hazard stratified by transition, covariate effects allowed to differ by transition)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "real-world-progression-rwpfs-rwe",
        "notes": "The intermediate Progressed state in oncology multi-state models is defined by a real-world progression event; how rwPFS/progression is constructed from data directly determines the Stable -> Progressed transition."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "Cumulative incidence of an absorbing event is the marginal that an Aalen-Johansen multi-state model reproduces as a special case; multi-state adds the intermediate-state occupancy that cumulative incidence alone cannot give."
      }
    ],
    "aliases": [
      "multi-state model",
      "multistate model",
      "state-transition model",
      "illness-death model",
      "transition probability model",
      "markov multi-state model",
      "semi-markov model"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "multiple-imputation-longitudinal-rwe",
    "name": "Multiple Imputation for Longitudinal RWE",
    "short_definition": "A principled missing-data method that replaces each missing value with several draws from its posterior predictive distribution under a stated missingness assumption (usually MAR), fits the substantive model in each completed dataset, and combines results with Rubin's rules so that the uncertainty about the missing data is propagated into the final standard errors.",
    "long_description": "**Multiple imputation (MI)** does not recover information that the data lack; it propagates the *uncertainty* about missing\nvalues under an explicit, untestable assumption — almost always **missing at random (MAR)** conditional on observed data.\nThe recipe is fixed: (1) specify an imputation model and draw `m` completed datasets that reflect both the predicted value\nand the residual/posterior uncertainty; (2) fit the *identical* substantive analysis (Cox, pooled logistic, GEE, a\npropensity model) in each completed dataset; (3) combine the `m` point estimates and within/between-imputation variances with\n**Rubin's rules** so the final standard error includes the extra variance from not knowing the missing values. In RWE the\nhard part is never the arithmetic — it is making MAR defensible when EHR labs are *visit-driven*, claims have *structurally\nabsent* clinical constructs rather than item-level gaps, and longitudinal measures are missing for reasons tied to the very\ntrajectory you want to estimate.\n\n**Core conceptual distinction** — two ideas are separable and both matter. (1) *MI vs single imputation / complete-case (CCA)*:\nsingle imputation (mean, regression, LOCF) fixes one value and lies about the standard error; CCA discards partial records\nand is unbiased only under MCAR or, for a regression coefficient, when missingness in covariates is independent of the\noutcome given the model. MI is preferred when missingness depends on *observed* data (MAR) and auxiliary variables predict\neither the missing value or its missingness. (2) *Imputation model vs substantive (analysis) model — congeniality*: the\nimputation model must be at least as rich as the analysis model. If the substantive model is a Cox survival model, the\nimputation model must include the **event indicator and the Nelson–Aalen estimate of the cumulative baseline hazard**\n(White & Royston), and any interaction or non-linearity in the analysis must also live in the imputation step. The cleanest\nway to guarantee this is **substantive-model-compatible FCS (SMC-FCS, Bartlett 2014)**, which imputes from a model derived\nfrom the analysis model itself. The estimand is unchanged by MI: you are still estimating the same hazard ratio, risk\ndifference, or marginal mean you would target with complete data — MI only restores valid inference for it.\n\n**Pros, cons, and trade-offs**\n- **vs complete-case analysis (CCA):** MI uses partial records and auxiliary predictors, so it is more efficient and is\n  unbiased under MAR where CCA is biased (e.g., labs missing more often in monitored, sicker patients). Cost: MI relies on a\n  correctly specified, congenial imputation model and an unverifiable MAR assumption; under **MNAR** both CCA and MI are\n  biased, sometimes in different directions. **Prefer MI** when covariate missingness is appreciable and plausibly MAR;\n  **prefer CCA** when missingness is trivial, or when it is in the *outcome* and unrelated to covariates (CCA can then be\n  nearly as efficient and more transparent).\n- **vs single imputation / LOCF for repeated measures:** MI propagates uncertainty; single imputation understates variance\n  and LOCF additionally assumes a value never changes after the last visit — usually false for labs and PROs and a classic\n  way to bias longitudinal effects toward the null or away from it depending on the dropout pattern. **Prefer MI** (or a\n  likelihood/mixed-model approach) over LOCF essentially always.\n- **vs likelihood-based / mixed models (MMRM, joint models, full-Bayesian):** A correctly specified mixed model is valid\n  under MAR *without* imputation and avoids imputation-model misspecification; MI is more flexible when missingness spans\n  many heterogeneous variables, when the analysis is not a single likelihood (e.g., a propensity score *then* an outcome\n  model), or when auxiliary variables outside the analysis model carry information. **Prefer MMRM/joint models** for a single\n  repeated-measures outcome; **prefer MI** for multivariable covariate missingness feeding a downstream causal pipeline.\n- **vs inverse-probability weighting for missingness:** IPW is robust to imputation-model misspecification but discards\n  partial records and is inefficient when many variables are partly missing; MI is more efficient but needs a correct\n  imputation model. Doubly robust combinations exist but are rarely operationalized in routine RWE.\n\n**When to use** — appreciable missingness (rough rule: more than a few percent on variables that matter) in baseline\ncovariates, labs, severity scores, PROs, or SDoH fields that you have good reason to treat as MAR given observed data,\n*and* you can name auxiliary variables (prior measurements, utilization, site, calendar time, treatment, the outcome) that\npredict the missing value or its missingness. Use MI to feed propensity or outcome models, and for repeated measures use a\nlongitudinal-aware imputation (multilevel `2l.pmm`/`2l.norm`, or a wide layout that conditions on prior-visit values) rather\nthan treating visits as independent.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **The construct is structurally absent, not item-missing.** Imputing \"ECOG\" or a lab in claims that *never* records it is\n  fabricating a variable from its correlates; the imputations carry no real information and falsely shrink the standard error.\n  Do not impute a field that the data source does not collect — restrict, link, or drop the variable.\n- **Missingness is informative (MNAR) and you treat it as MAR.** If a PRO is missing *because* the patient felt too unwell,\n  MAR-based MI biases toward the healthier observed distribution. Here MI under MAR is worse than honest — it launders an\n  untestable assumption into a confident answer. Use **delta-adjustment / pattern-mixture** sensitivity analyses and report\n  how conclusions move.\n- **Imputing exposure or outcome timing when missingness reflects ascertainment failure.** Imputing an index date, an event\n  date, or death when the gap is unobserved follow-up (disenrollment, out-of-network care) manufactures person-time and can\n  create or destroy immortal time. Handle with censoring/linkage, not imputation.\n- **Uncongenial imputation.** Imputing covariates *without* the outcome (or, for survival, without the event indicator and\n  Nelson–Aalen cumulative hazard) biases associations toward the null. Always include the outcome in the imputation model.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA):** Item-level lab/clinical missingness is rare because claims do not record those constructs at all —\n  so MI is appropriate only for *linked* clinical variables, not for absent constructs (smoking, BMI, OTC use). A subtler\n  trap: apparent \"missing\" diagnoses or fills are often **unobserved person-time** — Medicare Advantage enrollees lack\n  fee-for-service claims, so a missing comorbidity may be an MA gap, not a true negative. Restrict to enrollees with the\n  relevant benefit (A/B/D or commercial medical+pharmacy), and never impute over MA-only spans as if they were observed.\n- **EHR:** Missingness is **visit-driven and informative** — sicker patients are tested more, stable patients vanish. This\n  makes MAR fragile *unless* the imputation model is rich in the drivers of measurement: site, provider, encounter type,\n  prior measurement frequency, utilization, and the most recent observed value. A test ordered-but-pending vs never-ordered\n  are different missingness mechanisms; encode them. Loss to follow-up is potentially informative; consider inverse\n  probability of observation weights for the visit process rather than naive MI of every gap.\n- **Registry:** Stage, biomarker, and PRO missingness commonly varies by hospital, registry version, and calendar period\n  (assays and staging systems change). Include site, calendar year, and registry-version indicators in the imputation model,\n  and run site/calendar sensitivity analyses; pooling sites with different missingness mechanisms without these terms\n  violates MAR.\n- **Linked claims–EHR–registry:** The strongest substrate — EHR/registry supply the clinical values, claims supply\n  completeness and continuous enrollment — but linkage is selective (only the linkable subset) and the *unlinked* records may\n  differ systematically. Treat linkage status as an auxiliary variable and check that the imputation model is stable across\n  linked/unlinked strata.\n\n**Worked claims example.** Question: 1-year all-cause mortality after initiating a nephrotoxic oral oncolytic, adjusting for\nbaseline eGFR, in a linked commercial+Medicare FFS claims–EHR cohort. Cohort logic is claims-style: `person_id` with ≥365\ndays of continuous medical+pharmacy enrollment before the first qualifying `fill_date` (index_date), no MA-only person-time,\nand the arm/exposure taken from the index NDC. Baseline eGFR (from linked EHR labs in the [index_date−365, index_date]\nwindow) is **missing for 35%** of patients — and missing *less* often among patients with prior CKD diagnoses and high\ninpatient utilization, because they are monitored more. Complete-case Cox over-represents these monitored, high-risk patients\nand biases the eGFR–mortality association. The MI fix: (1) build a Nelson–Aalen estimate of the cumulative hazard from the\nobserved survival times and the death indicator; (2) impute eGFR with predictive mean matching using the death indicator,\nthe Nelson–Aalen term, treatment arm, age, prior CKD/dialysis flags, prior creatinine values, inpatient days, and calendar\nyear as predictors — this enforces congeniality with the downstream Cox model; (3) create `m = 40` completed datasets\n(rule of thumb: `m` at least the percent missing); (4) fit the *same* Cox model `Surv(time, death) ~ arm + age + egfr +\ncci + prior_ip` in each dataset; (5) pool with Rubin's rules and report the pooled HR, its CI, and the fraction of missing\ninformation (FMI). (6) Because monitoring-driven missingness could be MNAR, add a delta-adjustment sensitivity analysis that\nshifts imputed eGFR downward (worse renal function) by a clinically meaningful delta and confirm the conclusion is stable.\nReport a missingness table, the imputation predictors, `m`, convergence/trace diagnostics, and the sensitivity results —\nand never collapse the `m` datasets into one averaged dataset before modeling (that destroys the between-imputation variance).\n\n**Interpreting the output**\n\nConsider the CKD progression example: eGFR is missing in 5 of 10 patients, m = 5 imputed datasets are\ncreated, and a Cox model for mortality is fitted in each. The five log-HR estimates are approximately\n−0.42, −0.38, −0.45, −0.40, and −0.45. Rubin's rules yield a pooled log-HR ≈ −0.42, corresponding to\na pooled HR ≈ 0.66 (≈ 34% lower hazard in the treatment arm). The pooled SE is wider than any single\nimputation SE because it includes a between-imputation variance component reflecting uncertainty about\nthe missing eGFR values themselves.\n\n*(1) Formal statistical interpretation.* The pooled HR of ≈ 0.66 and its CI summarize the evidence\nacross all m completed datasets under Rubin's rules. The between-imputation component inflates the pooled\nSE beyond what any single complete-dataset analysis would report — this inflation is correct and necessary:\nsingle imputation implicitly treats imputed values as known, understating uncertainty. The fraction of\nmissing information (FMI) quantifies how much of the total variance is attributable to missingness; a\nhigh FMI signals that the result is sensitive to the imputation model and mechanism assumption.\n\n*(2) Practical interpretation for a decision-maker.* The MI estimate of HR ≈ 0.66 uses all available\npatient-level data — including the five patients with missing eGFR — rather than discarding them as\ncomplete-case analysis would. The wider CI compared to a hypothetical complete-data analysis honestly\nreflects the residual uncertainty from the missingness. If the conclusion is sensitive to a MNAR\nsensitivity analysis (where imputed eGFR values are shifted to reflect worse kidney function), that\nfinding should be reported alongside the primary MI result.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "multiple-imputation",
      "missing-data",
      "mice",
      "longitudinal",
      "ehr",
      "labs",
      "pro",
      "rubins-rules",
      "mar",
      "mnar",
      "congeniality",
      "smc-fcs"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "linked_data",
      "disease_registry"
    ],
    "data_sources": [
      "ehr",
      "claims",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.b2393",
        "url": "https://doi.org/10.1136/bmj.b2393",
        "citation_text": "Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.",
        "year": 2009,
        "authors_short": "Sterne et al.",
        "notes": "The standard epidemiologic statement of when MI helps, the MAR assumption, the dangers (e.g., uncongenial imputation, MNAR), and minimum reporting requirements."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.4067",
        "url": "https://doi.org/10.1002/sim.4067",
        "citation_text": "White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Statistics in Medicine. 2010;30(4):377-399.",
        "year": 2010,
        "authors_short": "White et al.",
        "notes": "Definitive practical guidance on chained-equation (FCS) MI, choosing the number of imputations, and — critically for RWE survival work — including the event indicator and Nelson–Aalen cumulative hazard when imputing covariates for a Cox model."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/0962280214521348",
        "url": "https://doi.org/10.1177/0962280214521348",
        "citation_text": "Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Statistical Methods in Medical Research. 2015;24(4):462-487.",
        "year": 2014,
        "authors_short": "Bartlett et al.",
        "notes": "Introduces substantive-model-compatible FCS (SMC-FCS) so the imputation model is congenial with the analysis model (interactions, non-linearities, Cox/competing-risks outcomes); the rigorous solution to the congeniality problem in RWE."
      },
      {
        "role": "use",
        "doi": "10.18637/jss.v045.i03",
        "url": "https://doi.org/10.18637/jss.v045.i03",
        "citation_text": "van Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. Journal of Statistical Software. 2011;45(3):1-67.",
        "year": 2011,
        "authors_short": "van Buuren & Groothuis-Oudshoorn",
        "notes": "The reference R implementation (mice), including predictive mean matching, multilevel/longitudinal imputers (2l.*), passive imputation, and Rubin's-rules pooling."
      }
    ],
    "plain_language_summary": "Multiple imputation is a method for handling missing values in a dataset without simply throwing away the patients who have gaps. Instead of guessing one single fill-in value, the method creates several (commonly 5 to 40) complete versions of the data, each with slightly different plausible values for the missing cells, then analyzes every version with the same statistical model and combines the results. The combining step — called Rubin's rules — intentionally makes the final confidence interval wider than if the data had been complete, because the width honestly reflects how uncertain we are about what the missing values actually were. The key assumption is that the missingness depends only on information we did observe (called missing at random, or MAR), not on the unobserved value itself.",
    "key_terms": [
      {
        "term": "missing at random (MAR)",
        "definition": "An assumption that the chance a value is missing depends only on other data we can see (like age or prior test results), not on the missing value itself — for example, a lab is missing more often in healthy patients who had fewer clinic visits, and we can see how many visits each patient had."
      },
      {
        "term": "multiple imputation",
        "definition": "A technique that fills in each missing value not with one fixed guess but with several different plausible draws, producing multiple complete datasets whose variation captures the uncertainty about what the true value was."
      },
      {
        "term": "Rubin's rules",
        "definition": "A formula that combines the point estimates and standard errors from each separately analyzed completed dataset into one final estimate, inflating the standard error to account for the between-dataset disagreement caused by imputation uncertainty."
      },
      {
        "term": "imputation model",
        "definition": "The statistical model used to predict and draw plausible replacement values for the missing data, which must include the study outcome and all covariates used in the main analysis so the imputed values are consistent with the analysis."
      },
      {
        "term": "fraction of missing information (FMI)",
        "definition": "A diagnostic number between 0 and 1 that says what share of the uncertainty in a final estimate is due to the missing data; an FMI of 0.35 means roughly 35 percent of the variance comes from not knowing those values."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is studying whether a new oral oncolytic reduces one-year mortality compared with standard chemotherapy in a claims-linked EHR cohort of 500 patients. Baseline kidney function, measured as eGFR (estimated glomerular filtration rate, a blood test), is missing for 5 of a small demonstration group of 10 patients. Rather than drop those 5 patients, the analyst uses multiple imputation: five completed datasets are created, each with a different plausible eGFR for the missing patients drawn from a model that uses age, prior kidney-disease diagnosis, and treatment arm. A Cox proportional-hazards regression is then fit in each completed dataset, producing five separate hazard ratio estimates for the treatment effect. Those five estimates are pooled with Rubin's rules to give a single final hazard ratio and a confidence interval that is honestly wider because of the eGFR uncertainty.",
      "dataset": {
        "caption": "Raw analytic dataset (10 patients). eGFR is missing for 5 patients. The outcome model will be: Surv(time_days, death) ~ arm + age + eGfr.",
        "columns": [
          "person_id",
          "arm",
          "age",
          "egfr",
          "time_days",
          "death"
        ],
        "rows": [
          [
            1001,
            1,
            62,
            48,
            365,
            0
          ],
          [
            1002,
            0,
            71,
            33,
            210,
            1
          ],
          [
            1003,
            1,
            55,
            "missing",
            365,
            0
          ],
          [
            1004,
            0,
            68,
            "missing",
            180,
            1
          ],
          [
            1005,
            1,
            60,
            57,
            300,
            0
          ],
          [
            1006,
            0,
            75,
            29,
            90,
            1
          ],
          [
            1007,
            1,
            58,
            "missing",
            365,
            0
          ],
          [
            1008,
            0,
            66,
            "missing",
            240,
            1
          ],
          [
            1009,
            1,
            63,
            44,
            365,
            0
          ],
          [
            1010,
            0,
            70,
            "missing",
            150,
            1
          ]
        ]
      },
      "imputed_datasets_table": {
        "caption": "Each column shows the imputed eGFR drawn for the 5 missing patients in that completed dataset. Values differ across datasets because each draw adds residual uncertainty from the imputation model. The 5 completed datasets are kept separate; they are never averaged before modeling.",
        "columns": [
          "person_id",
          "Dataset 1",
          "Dataset 2",
          "Dataset 3",
          "Dataset 4",
          "Dataset 5"
        ],
        "rows": [
          [
            1003,
            51,
            47,
            53,
            49,
            55
          ],
          [
            1004,
            36,
            31,
            38,
            34,
            41
          ],
          [
            1007,
            46,
            52,
            43,
            50,
            54
          ],
          [
            1008,
            38,
            35,
            40,
            37,
            45
          ],
          [
            1010,
            34,
            30,
            37,
            32,
            42
          ]
        ]
      },
      "steps": [
        "Run the imputation model (using observed age, arm, death indicator, and observed eGFR values as predictors) five separate times, each time drawing a fresh set of plausible eGFR values for the 5 missing patients. This produces 5 completed datasets.",
        "Fit the identical Cox regression model — Surv(time_days, death) ~ arm + age + egfr — in each of the 5 completed datasets independently. Each fit produces its own log hazard ratio for the treatment arm and its own standard error.",
        "Apply Rubin's rules: the pooled point estimate is simply the arithmetic mean of the 5 log hazard ratios. For example, if the 5 log-HR estimates are -0.42, -0.38, -0.45, -0.40, and -0.45, the pooled log-HR = (-0.42 + -0.38 + -0.45 + -0.40 + -0.45) / 5 = -2.10 / 5 = -0.42.",
        "Compute the within-imputation variance (average of the 5 squared standard errors) and the between-imputation variance (variance of the 5 point estimates across datasets). The total variance for the pooled estimate adds these two components plus a small correction term for finite m, inflating the standard error relative to a complete-data analysis.",
        "Convert the pooled log-HR to a hazard ratio: exp(-0.42) = 0.66, meaning the treatment arm has a 34 percent lower hazard of death. Report the pooled HR, its 95 percent confidence interval derived from the total variance, and the fraction of missing information (FMI) — which here would be moderate, reflecting that eGFR was missing for 50 percent of this small demonstration group."
      ],
      "result": "Pooled log-HR = (-0.42 + -0.38 + -0.45 + -0.40 + -0.45) / 5 = -0.42. Pooled HR = exp(-0.42) = 0.66 (95% CI wider than complete-data estimate because the between-imputation variance inflates the standard error). The 5 completed datasets were analyzed separately and combined with Rubin's rules; they were never averaged or collapsed into a single dataset before modeling."
    },
    "prerequisites": [
      "missing-data-pattern-table-rwe",
      "complete-case-analysis-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Baseline covariate MI for a downstream causal pipeline",
        "description": "Impute partially observed baseline labs, vitals, severity scores, or SDoH measures once, then fit the propensity score and outcome model inside each completed dataset and pool. Do not impute, average, then model.",
        "edge_cases": [
          "Impute within the analytic cohort and after time-zero is fixed, so the imputation reflects the population actually analyzed.",
          "The outcome (and, for survival, the event indicator + Nelson–Aalen cumulative hazard) must be in the imputation model or covariate–outcome associations are biased toward the null."
        ],
        "data_source_notes": "claims/linked: include treatment arm, follow-up time, event indicator, prior measurements, utilization, site, and calendar year; never impute over MA-only person-time."
      },
      {
        "name": "Longitudinal / repeated-measures MI",
        "description": "Impute repeated labs or PROs using time, prior and subsequent observed values, treatment, and outcome history. Use a multilevel imputer (2l.pmm/2l.norm/2l.lmer with a cluster id) for clustered or long-format data, or a wide layout conditioning on neighboring visits when visits are few and aligned.",
        "edge_cases": [
          "LOCF is not a valid imputation model; it understates variance and assumes values never change after the last visit.",
          "The visit/observation process can be informative; consider inverse-probability-of-observation weights rather than imputing every gap.",
          "Multilevel imputation requires a correctly specified random-effects structure; misspecification reintroduces bias."
        ],
        "data_source_notes": "ehr: encode encounter type, ordered-but-pending vs never-ordered, provider, and measurement frequency as drivers of the visit process."
      },
      {
        "name": "MNAR sensitivity (delta-adjustment / pattern-mixture)",
        "description": "Re-impute under departures from MAR by shifting imputed values by a clinically meaningful delta (or by imputing missing-data patterns separately) and report how the substantive estimate moves.",
        "edge_cases": [
          "The delta is a sensitivity parameter, not estimable from the data; choose and justify a clinically plausible range.",
          "Tipping-point analyses (the delta at which the conclusion flips) communicate robustness more honestly than a single delta."
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "complete-case analysis",
        "pros_of_this": "Uses partial records and auxiliary predictors; unbiased and more efficient under MAR where complete-case is biased.",
        "cons_of_this": "Requires a correctly specified, congenial imputation model and an unverifiable MAR assumption; under MNAR both are biased.",
        "when_to_prefer": "Appreciable covariate missingness that is plausibly MAR given observed data, especially feeding a propensity/outcome pipeline."
      },
      {
        "compared_to": "single imputation / last-observation-carried-forward",
        "pros_of_this": "Propagates imputation uncertainty into the standard error; does not assume a measurement stays constant after the last visit.",
        "cons_of_this": "More complex to implement and report than a single fill-in.",
        "when_to_prefer": "Essentially always over LOCF/mean imputation for analysis (single imputation is acceptable only for descriptive display)."
      },
      {
        "compared_to": "likelihood-based mixed models (MMRM) / joint models",
        "pros_of_this": "Flexible across many heterogeneous partly missing variables and non-likelihood pipelines (PS then outcome); can exploit auxiliary variables outside the analysis model.",
        "cons_of_this": "Sensitive to imputation-model misspecification; a correctly specified mixed model is valid under MAR without imputation.",
        "when_to_prefer": "Multivariable covariate missingness; prefer MMRM/joint models for a single repeated-measures outcome."
      },
      {
        "compared_to": "inverse-probability weighting for missingness",
        "pros_of_this": "More efficient when many variables are partly observed.",
        "cons_of_this": "Requires a correct imputation model; IPW is more robust to its misspecification.",
        "when_to_prefer": "When efficiency matters and the imputation model can be specified credibly."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Item-level lab missingness is rare; claims lack the construct rather than the value. Use MI only for linked clinical variables, not absent constructs (smoking, BMI). Distinguish true negatives from unobserved Medicare Advantage person-time; restrict to A/B/D or commercial medical+pharmacy enrollment and do not impute over MA-only spans.",
      "ehr": "Missingness is visit-driven and often informative. Make MAR plausible by including site, provider, encounter type, prior measurement frequency, utilization, and the most recent observed value; distinguish ordered-but-pending from never-ordered. Consider inverse-probability-of-observation weights for the visit process.",
      "registry": "Stage, biomarker, and PRO missingness vary by hospital, registry version, and calendar period. Include site, calendar year, and registry-version indicators; run site/calendar sensitivity analyses.",
      "linked": "EHR/registry supply clinical values and claims supply completeness, but linkage is selective. Treat linkage status as an auxiliary variable and check imputation stability across linked/unlinked strata."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nfrom statsmodels.imputation import mice\nfrom lifelines import NelsonAalenFitter\n\n# 1) Nelson-Aalen cumulative hazard at each subject's follow-up time -> congenial survival imputation predictor.\nnaf = NelsonAalenFitter().fit(df[\"time\"], event_observed=df[\"death\"])\ndf[\"na_cumhaz\"] = naf.cumulative_hazard_at_times(df[\"time\"]).values\n\n# 2) Build the MI dataset. Only egfr is imputed; death, na_cumhaz, arm and all covariates are predictors.\nimp_cols = [\"egfr\", \"death\", \"na_cumhaz\", \"arm\", \"age\", \"cci\", \"prior_ip\"]\nmi_data = mice.MICEData(df[imp_cols])\n# Predictive-mean-matching-style draw for the continuous lab from all other columns.\nmi_data.set_imputer(\"egfr\", \"egfr ~ death + na_cumhaz + arm + age + cci + prior_ip\")\n\n# 3) Fit the SAME Cox model in each completed dataset and pool with Rubin's rules.\n#    statsmodels MICE expects a model formula; PHReg uses (time, status) via the `status` kwarg.\ndef cox_model(formula, data):\n    return sm.PHReg.from_formula(\n        \"time ~ arm + age + egfr + cci + prior_ip\",\n        status=data[\"death\"], data=data,\n    )\n\nmi = mice.MICE(\"time ~ arm + age + egfr + cci + prior_ip\", cox_model, mi_data,\n               n_skip=3)\nresults = mi.fit(n_burnin=10, n_imputations=40)   # m = 40 ~ percent missing\nprint(results.summary())                          # pooled log-HR, SE, CI, FMI per coefficient",
        "description": "Congenial MI for a Cox analysis using statsmodels MICEData. Required input (already cleaned, one row per subject, baseline\ncovariates measured in [index_date-365, index_date]):\n  df : person_id, time (float, follow-up days), death (0/1 event indicator),\n       arm (0/1), age (float), egfr (float, ~35% missing), cci (int), prior_ip (int)\nCongeniality: the imputation model includes the event indicator AND the Nelson-Aalen cumulative hazard, plus treatment and\nall analysis covariates, so covariate-outcome associations are not biased toward the null (White & Royston; Bartlett 2014).\nPooling uses Rubin's rules via statsmodels' MICE.combine().",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels",
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(mice)\nlibrary(survival)\n\n## (A) Baseline MI congenial with Cox: add the event indicator + Nelson-Aalen cumulative hazard (White & Royston).\ndf_wide$na_cumhaz <- nelsonaalen(df_wide, time, death)\n\npred <- make.predictorMatrix(df_wide)\npred[\"egfr\", c(\"person_id\")] <- 0            # never predict from the id\nimp <- mice(df_wide, m = 40, method = \"pmm\", predictorMatrix = pred, seed = 2026)\nfits <- with(imp, coxph(Surv(time, death) ~ arm + age + egfr + cci + prior_ip))\nsummary(pool(fits))                          # pooled HR, CI, FMI via Rubin's rules\n\n## (B) Longitudinal MI of a repeated lab using a multilevel imputer (2l.pmm) with person_id as the cluster.\nmeth <- make.method(df_long)\nmeth[\"lab\"] <- \"2l.pmm\"\npm <- make.predictorMatrix(df_long)\npm[\"lab\", \"person_id\"] <- -2                 # -2 flags the class (cluster) variable for 2l methods\npm[\"lab\", c(\"prior_lab\", \"arm\", \"age\", \"time_visit\")] <- 1\nimp_long <- mice(df_long, m = 40, method = meth, predictorMatrix = pm, seed = 2026)\nfit_long <- with(imp_long,\n                 lme4::lmer(lab ~ arm + time_visit + age + (1 | person_id)))\nsummary(pool(fit_long))",
        "description": "Two patterns. (A) Baseline-covariate MI congenial with a Cox model, the workhorse for cross-sectional baseline\nmissingness. (B) Longitudinal repeated-measures MI honoring the slug, using a multilevel imputer with a subject cluster id.\nRequired inputs:\n  df_wide : person_id, time, death, arm, age, egfr (NA ~35%), cci, prior_ip   # one row per subject\n  df_long : person_id, visit, lab (NA), prior_lab, arm, age, time_visit       # one row per subject-visit",
        "dependencies": [
          "mice",
          "survival"
        ],
        "source_citations": [
          "van-buuren-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 0) Nelson-Aalen cumulative hazard at each subject's event/censor time -> congenial survival predictor for MI. */\nproc phreg data=work.cohort;\n  model time*death(0) = ;                 /* null model -> Nelson-Aalen baseline */\n  baseline out=na_base cumhaz=na_cumhaz / method=na;\nrun;\n/* merge na_cumhaz back onto cohort at each subject's own time (data step omitted for brevity) */\n\n/* 1) Impute egfr from death + na_cumhaz + treatment + analysis covariates (FCS regression). */\nproc mi data=work.cohort_na out=mi_long nimpute=40 seed=2026;\n  class arm;\n  fcs reg(egfr = death na_cumhaz arm age cci prior_ip);\n  var death na_cumhaz arm age egfr cci prior_ip time;\nrun;\n\n/* 2) Fit the SAME Cox model within each imputed dataset; capture parameter estimates + covariance. */\nproc phreg data=mi_long covout outest=cox_ests;\n  by _Imputation_;\n  class arm;\n  model time*death(0) = arm age egfr cci prior_ip;\nrun;\n\n/* 3) Combine across imputations with Rubin's rules -> pooled log-HR, SE, CI, and FMI. */\nproc mianalyze data=cox_ests;\n  modeleffects arm age egfr cci prior_ip;\nrun;",
        "description": "Congenial MI -> BY-imputation Cox -> MIANALYZE, the canonical SAS pattern. Required input (post data-management):\n  work.cohort : person_id, time, death (0/1), arm (0/1), age, egfr (.=missing), cci, prior_ip\nThe Nelson-Aalen cumulative hazard is computed first (PROC PHREG BASELINE) and merged in so PROC MI imputes egfr from a\nmodel that includes survival information (event indicator + cumulative hazard), enforcing congeniality with the outcome model.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cohort[Analytic cohort + time zero fixed] --> Pattern[Missingness table + mechanism review<br/>MCAR / MAR / MNAR; structural vs item-level]\n  Pattern --> Aux[Choose auxiliary predictors<br/>prior values, utilization, site, calendar, treatment]\n  Aux --> Cong[Congenial imputation model<br/>include outcome + event indicator + Nelson-Aalen for Cox]\n  Cong --> M[Create m completed datasets<br/>m ~ percent missing]\n  M --> Fit[Fit the IDENTICAL substantive model<br/>in each completed dataset]\n  Fit --> Pool[Pool with Rubin's rules<br/>report estimate, CI, FMI]\n  Pool --> Sens[MNAR sensitivity<br/>delta-adjustment / tipping point]",
        "caption": "MI workflow for RWE: fix the cohort, characterize missingness, build a congenial imputation model that includes the outcome, create m datasets, fit the same analysis in each, pool with Rubin's rules, then stress-test the MAR assumption.",
        "alt_text": "Flowchart from analytic cohort and missingness review through auxiliary-variable selection, congenial imputation, m completed datasets, identical substantive model fits, Rubin's-rules pooling, and MNAR sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Wrong[Uncongenial / single imputation]\n    A1[Impute covariates WITHOUT outcome] --> A2[Single filled dataset or averaged m]\n    A2 --> A3[Biased toward null + understated SE]\n  end\n  subgraph Right[Congenial multiple imputation]\n    B1[Impute WITH outcome + event indicator + NA hazard] --> B2[Keep m separate datasets]\n    B2 --> B3[Pool with Rubin: valid SE + FMI]\n  end",
        "caption": "The two failure modes MI is meant to avoid. Imputing covariates without the outcome biases associations toward the null; collapsing the m datasets into one (or single imputation) understates the standard error. Congenial MI with separate datasets and Rubin pooling fixes both.",
        "alt_text": "Side-by-side comparison contrasting uncongenial single imputation (biased toward null, understated standard error) with congenial multiple imputation that keeps m datasets separate and pools with Rubin's rules.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "Trimming/winsorization addresses outliers, not missingness; MI explicitly models the missing values and propagates their uncertainty, and the two are not substitutes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "A missingness pattern table is the prerequisite diagnostic that justifies the MAR assumption and the choice of auxiliary variables before any imputation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Longitudinal dropout and loss to follow-up are the dominant source of missingness over time; an informative visit/observation process may require IP-of-observation weighting rather than imputing every gap."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Imputed baseline covariates still require balance checks; assess standardized differences within each imputed dataset (or pooled), not on a naively averaged dataset."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "When a propensity score depends on partly missing covariates, fit the PS and outcome model inside each completed dataset and pool, rather than imputing then averaging covariates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "Imputing covariates for a Cox model requires including the event indicator and the Nelson-Aalen cumulative baseline hazard so the imputations are congenial with the survival analysis."
      }
    ],
    "aliases": [
      "MI",
      "MICE",
      "multiple imputation by chained equations",
      "fully conditional specification",
      "Rubin's rules",
      "SMC-FCS"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "multiplicity-multiple-comparisons",
    "name": "Multiplicity and Multiple Comparisons",
    "short_definition": "A statistical discipline governing how to set significance thresholds and interpret p-values when a study tests multiple hypotheses simultaneously — whether across several endpoints, subgroups, interim looks, or unreported analytical choices — so that the overall false-positive rate (or false-discovery rate) is controlled at a meaningful level rather than inflated by the accumulation of chances across many tests; the two main tools are family-wise error rate (FWER) control via Bonferroni or Holm for confirmatory analyses and false discovery rate (FDR) control via Benjamini-Hochberg for exploratory scans.",
    "long_description": "**The multiplicity problem in RWE**\n\nEvery additional hypothesis test gives a study one more chance to declare a false positive by\nchance alone. At a 5% significance threshold and one test, the type-I error rate is 5%. With ten\nindependent tests all truly null, the probability of at least one spurious rejection rises to\n1 - 0.95^10, which is approximately 40%. At 100 truly null tests it approaches 99%. Real-world\nevidence studies routinely generate exactly this inflation: secondary endpoint batteries of\n10-30 outcomes, subgroup analyses across age, sex, and comorbidity strata, multiple sensitivity\nanalyses, and the silent flexibility of choosing which results to highlight after seeing the data.\nThe discipline of multiplicity adjustment provides a coherent framework for deciding — before\nlooking at the data — how to allocate the permitted false-positive rate across the tests being run.\n\n**Two philosophies: family-wise error rate versus false discovery rate**\n\nThe central conceptual divide is between two definitions of what \"error\" means when many tests\nare run simultaneously.\n\nThe *family-wise error rate (FWER)* is the probability of making at least one false positive\nrejection in the entire family of hypotheses. This is the standard for confirmatory clinical\ntrials and regulatory submissions: the FDA and EMA require that the chance of erroneously\ndeclaring any efficacy endpoint effective is controlled at alpha = 0.05 across the full testing\nplan. FWER is the right criterion when the consequence of any single wrong conclusion is\nunacceptable — for example, approving a drug on a secondary endpoint that is actually null.\n\nThe *false discovery rate (FDR)* is the expected proportion of rejected hypotheses that are false\npositives. If a study rejects 20 hypotheses and the FDR is controlled at 5%, the analyst expects\nat most 1 of those 20 rejections to be wrong on average across many hypothetical replications of\nthe same procedure. FDR is the right criterion for exploratory science — pharmacovigilance scans\nof thousands of drug-event pairs, phenome-wide association studies across thousands of disease\ncodes, or metabolomics screens — where the goal is to nominate a tractable shortlist of\nhypotheses for follow-up, not to make a definitive claim about any single one. Accepting a few\nfalse leads in exchange for better sensitivity is a reasonable trade-off when the rejected\nhypotheses will be independently validated.\n\nThe conceptual consequence: FWER control becomes harder as the number of tests m grows —\nBonferroni sets the per-test threshold at alpha/m, which approaches zero. FDR control via\nBenjamini-Hochberg scales more gracefully; the threshold for the k-th-ranked test is k/(m*Q),\nwhich remains non-trivial even for large m. For hypothesis-free scans across hundreds of\noutcomes — where Bonferroni would reject almost nothing — FDR or permutation-based FWER (as\nin TreeScan) are the appropriate tools.\n\n**Bonferroni and Holm (FWER methods)**\n\nThe *Bonferroni correction* is the simplest FWER method: divide alpha by the total number of\ntests m and reject hypothesis i if its raw p-value p_i is at most alpha/m. With ten tests and\nalpha = 0.05, the Bonferroni threshold is 0.005. Bonferroni is valid under any correlation\nstructure among the tests — it relies only on the union bound (the Boole inequality) and does\nnot assume independence. Its weakness is conservatism: when the tests are positively correlated\n(as they often are across related clinical outcomes or overlapping diagnostic code families),\nBonferroni over-adjusts, and many genuine signals are missed.\n\nThe *Holm step-down procedure* uniformly improves on Bonferroni while still controlling FWER.\nSort p-values ascending p_(1) ≤ ... ≤ p_(m); compare p_(i) to alpha/(m - i + 1) starting from\nrank 1; continue rejecting as long as the condition holds; stop at the first failure. Holm is\nalways at least as powerful as Bonferroni and often substantially more so when the smallest\np-values are far below the others. Holm should be the default wherever Bonferroni is considered.\nBoth Bonferroni and Holm are conservative when tests are positively correlated; procedures that\naccount for correlation (e.g., permutation-based FWER, the Westfall-Young method) exist but\nrequire access to the joint distribution of the test statistics.\n\n**Benjamini-Hochberg step-up (FDR method)**\n\nThe Benjamini-Hochberg (BH) procedure (1995) is the workhorse FDR method. Sort p-values\nascending p_(1) ≤ ... ≤ p_(m). For each rank i, compute the BH threshold (i/m) * Q where Q is\nthe desired FDR level (typically 0.05 or 0.10). Find the largest rank k such that p_(k) ≤\n(k/m) * Q. Reject all hypotheses with rank 1 through k. This is a step-up procedure: it looks\nfor the *largest* k where the condition holds and rejects everything at or below that rank.\n\nBH provably controls FDR at level Q * (true nulls / m) ≤ Q under independent tests, and under\ncertain positive-dependence structures (positive regression dependence on a subset, or PRDS).\nUnder arbitrary dependence, the Benjamini-Yekutieli (BY) extension controls FDR at the cost of\ndividing each threshold by an additional log(m) factor. In most claims and EHR applications,\nwhere outcomes are positively correlated (patients with one cardiovascular event tend to have\nelevated risk of others), the standard BH procedure is conservative relative to the true FDR,\nmaking it a defensible choice.\n\n**When each method fits in RWE**\n\nThe decision maps onto the confirmatory-versus-exploratory spectrum:\n\n- *Few pre-specified confirmatory endpoints (1-5)*: use Holm or Bonferroni, because each\n  endpoint is expected to survive as a confirmed finding if the null is rejected. Regulatory\n  submissions require FWER control across the confirmatory testing plan.\n- *Moderate battery of secondary endpoints (5-30)*: if each is pre-specified and the intention\n  is to publish results for each, prefer Holm; if the battery is exploratory and will be followed\n  up, BH is more appropriate. The critical requirement is prespecification: write the adjustment\n  plan into the protocol or statistical analysis plan before any data access.\n- *Hypothesis-free scans (hundreds to thousands of outcomes or codes)*: FDR via BH or BY, or\n  permutation-based FWER via scan statistics (TreeScan). Bonferroni over thousands of correlated\n  codes is so conservative as to be uninformative.\n\n**Gatekeeping and hierarchical testing**\n\nWhen endpoints have a natural priority order — overall survival is the primary endpoint and\nprogression-free survival is secondary — gatekeeping procedures allocate alpha in a structured\nhierarchy: the secondary endpoint is only tested if the primary endpoint passes. This controls\nFWER by construction because no alpha is spent on the secondary if the primary fails. Variants\ninclude fixed-sequence testing (strict ordering), parallel gatekeeping (multiple primaries must\npass before any secondary is tested), and tree gatekeeping (a hierarchy of families). These\nprocedures are common in oncology and cardiovascular regulatory submissions, where the ordering\nreflects clinical importance and partial approvals have well-defined regulatory implications.\n\n**Subgroup multiplicity**\n\nEvery subgroup analysis is an additional test in the multiplicity family. If an analyst tests a\ntreatment comparison in males, females, the elderly, and those with baseline comorbidity, the\nchance of finding a spurious interaction in at least one subgroup at alpha = 0.05 is\napproximately 19%, not 5%. Standard pharmacoepidemiology practice treats subgroup analyses as\nexploratory and hypothesis-generating unless pre-specified in the protocol with an explicit\ncorrection plan. Route to subgroup-analysis-hte for the full treatment of heterogeneous treatment\neffect estimation and the multiplicity implications of multiple subgroup comparisons.\n\n**Sequential-look multiplicity**\n\nRepeated interim analyses of accumulating data are a temporal form of the multiple-comparisons\nproblem. Looking at the data every month and applying a fixed alpha = 0.05 each time inflates\nthe family-wise false-alarm rate just as badly as running many simultaneous hypothesis tests.\nSequential analysis via alpha-spending (MaxSPRT, O'Brien-Fleming, Lan-DeMets) provides analogous\nboundaries: the total probability of ever falsely signaling across all planned looks is controlled\nat the pre-specified alpha. The boundary is computed before the first look and governs every\nsubsequent one. See maxsprt-sequential-safety-surveillance-rwe for the full mechanics of\nsequential alpha spending in pharmacovigilance contexts.\n\n**The garden of forking paths: silent multiplicity**\n\nThe most insidious multiplicity in observational RWE is the kind that never appears in the\nanalysis plan: the analyst who reports the washout period that gave the \"cleanest\" result, the\ncovariate set selected after seeing the coefficient, the subgroup noted because the p-value\nlooked interesting. Each decision that could have gone differently — and that is made after the\nanalyst has some knowledge of the data — creates a silent test that is never counted in the\nmultiplicity denominator. Gelman and Loken (2014) called this the \"garden of forking paths.\"\nPost-hoc Bonferroni or BH adjustment applied to a selectively reported set of results cannot\nremedy multiplicity that was incurred during data exploration, because the denominator m\nreflects only reported tests, not all considered paths. The only real fix is prespecification:\nan SAP written and locked before database access or unblinding, with a complete multiplicity\nplan covering the primary test, all secondary tests, and the rules for subgroup and sensitivity\nanalyses.\n\n**Large-n datasets and the limits of multiplicity adjustment**\n\nIn very large claims databases with millions of patients, virtually every test is statistically\nsignificant at any conventional alpha. Adjusting p-values for multiplicity does not rescue\ninference when the root problem is that p-values are meaninglessly small for clinically trivial\neffects. A hazard ratio of 1.003 with a Bonferroni-adjusted p-value of 0.0001 is not a\nmeaningful finding. The discipline of multiplicity adjustment was developed to protect against\nfalse positives when true signal is sparse relative to noise; it does not substitute for\ninterpreting effect sizes and clinical relevance. Report hazard ratios, risk differences, and\nconfidence intervals alongside any adjusted p-value — especially at the large sample sizes\ntypical in administrative claims research.\n\n**Interpreting the output**\n\nFrom the worked example: 10 secondary endpoints tested in a statin initiation cohort, Bonferroni\nthreshold alpha/m = 0.05/10 = 0.005, and BH at Q = 0.05. Bonferroni rejects 1 hypothesis\n(H01 MI, p = 0.001). BH step-up finds the largest rank k where p_(k) ≤ (k/10)*0.05: rank 1\nthreshold 1/10*0.05 = 0.005 (H01 MI, p=0.001, passes); rank 2 threshold 2/10*0.05 = 0.010\n(H03 stroke, p=0.008, passes); rank 3 threshold 3/10*0.05 = 0.015 (H05 hospitalization,\np=0.012, passes); rank 4 threshold 4/10*0.05 = 0.020 (H07 ER visit, p=0.025, fails). BH\nrejects 3 hypotheses: H01, H03, H05.\n\n*(1) Formal interpretation.* The Bonferroni procedure rejects H01 (MI, p = 0.001) because\n0.001 ≤ 0.005 = alpha/m. Under the Boole inequality, the family-wise error rate across all 10\ntests is controlled at ≤ 0.05 regardless of the correlation structure among the tests. H03\n(stroke, p = 0.008) is not rejected: although p = 0.008 is below the unadjusted alpha = 0.05,\nit exceeds the Bonferroni threshold. The BH procedure rejects H01, H03, and H05 by identifying\nk = 3 as the largest rank satisfying p_(k) ≤ (k/m)*Q. The FDR guarantee for these 3 rejections\nis an expected bound: across many hypothetical replications of the same study and procedure, the\nexpected proportion of false positives among the rejected hypotheses is at most Q = 0.05. This\nis a statement about the long-run average over replications, not a claim that exactly 5% of\nthese three specific rejections are false.\n\n*(2) Practical interpretation.* For a regulatory submission where MI is the confirmatory primary\nendpoint, Bonferroni is appropriate: the chance of any false confirmatory claim is controlled at\n5%, and the stroke result (p = 0.008) is appropriately classified as exploratory. For an\ninternal clinical evidence review — deciding which outcomes to carry into further study — BH is\nmore appropriate: MI, stroke, and hospitalization are nominated together as a cluster warranting\nindependent confirmatory analysis, with the understanding that roughly 5% of nominated findings\nare expected to be false leads. The analyst must communicate clearly which procedure governed the\nprimary inference and treat the other procedure's results as supportive or exploratory.\n\n**Pros, cons, and trade-offs**\n\n*Bonferroni/Holm (FWER):*\n- Pros: valid under any correlation structure; conservative and transparent; required by FDA/EMA\n  for confirmatory analyses; straightforward to apply and communicate.\n- Cons: increasingly conservative as m grows; at m > 100, almost no hypothesis survives; ignores\n  positive correlation among tests (which could safely allow a less strict threshold).\n- When to prefer: confirmatory regulatory submissions; few pre-specified primary endpoints;\n  settings where any single false positive has serious consequences (approval, label change).\n\n*Benjamini-Hochberg (FDR):*\n- Pros: substantially more powerful than Bonferroni at the same nominal error level; scales\n  gracefully to hundreds or thousands of tests; standard for genomics, metabolomics, and\n  pharmacovigilance outcome scans.\n- Cons: controls an expected average error, not the probability of any false positive — in any\n  single study more than Q fraction of rejections could be false; assumes independence or PRDS\n  (BY extension handles arbitrary dependence at a power cost).\n- When to prefer: hypothesis-generating screens; pharmacovigilance scans; phenome-wide studies;\n  any setting where follow-up validation is planned and some false leads are acceptable.\n\n*Hierarchical/gatekeeping procedures:*\n- Pros: FWER-controlling; structure testing to match clinical priority; allows secondary\n  endpoints to be tested without alpha inflation when the primary succeeds.\n- Cons: if the primary fails, all alpha is spent and no secondary information is extracted;\n  requires pre-specification of the full hierarchy before any data access.\n- When to prefer: regulatory submissions with a primary-secondary endpoint hierarchy; oncology\n  trials with OS as gatekeeper for PFS or OS, PFS, and QoL in a fixed sequence.\n\n**When to use**\n\nApply formal multiplicity adjustment whenever:\n- Two or more hypotheses are tested in a single analysis and both will be reported as\n  confirmatory findings, not exploratory observations.\n- A regulatory or HTA body requires control of the family-wise error rate across the study's\n  testing plan.\n- A pharmacovigilance or outcomes scan tests hundreds of outcomes simultaneously — use FDR\n  rather than FWER; consider permutation-based methods (TreeScan) when outcomes are\n  hierarchically structured.\n- An interim analysis plan involves more than one planned look at accumulating data — use\n  sequential alpha-spending, not repeated fixed-alpha tests.\n- Subgroup analyses are pre-specified and must be reported alongside the primary analysis: apply\n  a correction (Holm across subgroups) or commit in the protocol to treating them as exploratory.\n\n**When NOT to use — and when adjustment is actively misleading**\n\n- Do not apply multiplicity adjustment post-hoc to a selectively reported set of outcomes: if 30\n  outcomes were analyzed and only 10 with \"interesting\" results are presented, the multiplicity\n  denominator is wrong, and the apparent correction provides false reassurance. The unreported 20\n  are still part of the implicit testing context.\n- Do not treat multiplicity adjustment as a substitute for prespecification: an analyst who runs\n  50 models and then applies BH to the 10 most significant results has not controlled FDR — the\n  unseen 40 are part of the effective multiple-testing context.\n- Do not apply FWER correction to purely exploratory analyses intended to generate hypotheses:\n  doing so creates false confidence that the surviving hypotheses are confirmed when they are\n  actually just the strongest leads from a screen.\n- Do not interpret a multiplicity-adjusted p-value as evidence of clinical importance: at very\n  large n, adjusted p-values remain significant for effect sizes that are clinically irrelevant.\n  Report effect sizes and confidence intervals as the primary evidence and treat any adjusted\n  p-value as a decision aid, not a summary of importance.\n- Do not confuse multiplicity adjustment with confounding control: adjusting p-values does not\n  correct for uncontrolled confounding in the underlying effect estimates — these are separate\n  problems requiring separate solutions.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "hypothesis-testing",
      "multiplicity",
      "multiple-comparisons",
      "false-discovery-rate",
      "familywise-error-rate",
      "Bonferroni",
      "Benjamini-Hochberg",
      "FWER",
      "FDR",
      "Holm",
      "gatekeeping",
      "alpha-spending",
      "prespecification"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.2517-6161.1995.tb02031.x",
        "url": "https://doi.org/10.1111/j.2517-6161.1995.tb02031.x",
        "citation_text": "Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995;57(1):289-300.",
        "year": 1995,
        "authors_short": "Benjamini & Hochberg",
        "notes": "The foundational paper introducing the false discovery rate and the BH step-up procedure, demonstrating it is substantially more powerful than FWER methods while still controlling a meaningful error rate. One of the most-cited statistics papers of the twentieth century; essential reading for any analyst choosing between FWER and FDR approaches."
      },
      {
        "role": "explain",
        "doi": "10.1016/S0895-4356(00)00314-0",
        "url": "https://doi.org/10.1016/S0895-4356(00)00314-0",
        "citation_text": "Bender R, Lange S. Adjusting for multiple testing — when and how? Journal of Clinical Epidemiology. 2001;54(4):343-349.",
        "year": 2001,
        "authors_short": "Bender & Lange",
        "notes": "A practical clinical epidemiology guide to when multiplicity adjustment is warranted and which method to choose, covering the confirmatory-versus-exploratory distinction, the role of prespecification, and common misapplications. Directly applicable to RWE study design decisions about primary and secondary endpoint testing plans."
      }
    ],
    "plain_language_summary": "When a study tests many different outcomes or research questions at once, the chance of finding at least one false positive just by chance grows rapidly — running 20 tests each at the standard 5% threshold gives roughly a 64% chance of at least one spurious result even when nothing is truly different. Multiplicity adjustment is a set of statistical rules for raising the bar when many tests are run together, so the overall false-alarm rate stays controlled. The two main schools are controlling the family-wise error rate (keeping the chance of any false positive below 5%, using Bonferroni or Holm corrections, required for regulatory submissions) and controlling the false discovery rate (keeping the expected share of falsely flagged results below 5%, using Benjamini-Hochberg, suited for exploratory outcome scans). The right choice depends on whether the analysis is confirmatory or exploratory, and the most important protection of all is writing down exactly which tests will be run — and how they will be adjusted — before the data are analyzed.",
    "key_terms": [
      {
        "term": "familywise error rate (FWER)",
        "definition": "The probability of making at least one false positive rejection across all the hypotheses tested together; Bonferroni and Holm adjustments keep this at or below a preset level (usually 5%) regardless of how many tests are run."
      },
      {
        "term": "false discovery rate (FDR)",
        "definition": "The expected proportion of rejected hypotheses that are actually false positives; the Benjamini-Hochberg procedure controls this at a preset level (usually 5% or 10%), allowing more discoveries than FWER methods while still limiting the fraction of errors."
      },
      {
        "term": "Bonferroni correction",
        "definition": "The simplest multiplicity adjustment — divide the target alpha by the number of tests (for example, 0.05 divided by 10 equals 0.005) and reject only hypotheses whose raw p-value falls below that stricter threshold."
      },
      {
        "term": "step-up procedure",
        "definition": "A testing algorithm that ranks all p-values from smallest to largest, then assigns each test a progressively less strict threshold based on its rank; the Benjamini-Hochberg method is the most widely used step-up procedure for controlling the false discovery rate."
      },
      {
        "term": "prespecification",
        "definition": "Committing in writing — before the data are analyzed — exactly which hypotheses will be tested and what correction will be applied, so that choices made after seeing the data cannot silently inflate the false-positive rate beyond what was budgeted."
      },
      {
        "term": "garden of forking paths",
        "definition": "The invisible multiplicity created by analyst decisions during the analysis — such as which outcomes to report, which subgroups to highlight, or which analytical approach to present — that are never counted as formal tests but still inflate the effective false-positive rate."
      }
    ],
    "worked_example": {
      "scenario": "An observational claims study of statin initiation tests 10 outcomes simultaneously — myocardial infarction, stroke, all-cause hospitalization, ER visit, LDL reduction, CRP reduction, QOL score, medication switch, any adverse event, and all-cause mortality — all pre-specified as secondary endpoints. The analyst runs 10 separate Cox models and collects the raw p-values. She applies both Bonferroni and Benjamini-Hochberg corrections at alpha/Q = 0.05 and compares how many hypotheses each procedure rejects.",
      "dataset": {
        "caption": "Raw p-values from 10 Cox models (one per outcome) in a statin initiation observational cohort. P-values appear in their original analysis order; multiplicity procedures sort them internally by rank.",
        "columns": [
          "hypothesis",
          "outcome",
          "raw_p_value"
        ],
        "rows": [
          [
            "H01",
            "myocardial_infarction",
            0.001
          ],
          [
            "H02",
            "LDL_reduction",
            0.04
          ],
          [
            "H03",
            "stroke",
            0.008
          ],
          [
            "H04",
            "medication_switch",
            0.2
          ],
          [
            "H05",
            "hospitalization",
            0.012
          ],
          [
            "H06",
            "all_cause_mortality",
            0.5
          ],
          [
            "H07",
            "ER_visit",
            0.025
          ],
          [
            "H08",
            "any_adverse_event",
            0.35
          ],
          [
            "H09",
            "QOL_score",
            0.12
          ],
          [
            "H10",
            "CRP_reduction",
            0.051
          ]
        ]
      },
      "steps": [
        "Sort the 10 raw p-values from smallest to largest to assign ranks 1 through 10. Sorted order: H01=0.001 (rank 1), H03=0.008 (rank 2), H05=0.012 (rank 3), H07=0.025 (rank 4), H02=0.040 (rank 5), H10=0.051 (rank 6), H09=0.120 (rank 7), H04=0.200 (rank 8), H08=0.350 (rank 9), H06=0.500 (rank 10). Label these p_(1) through p_(10).",
        "Bonferroni correction: the adjusted threshold is 0.05/10 = 0.005. Compare each raw p-value to 0.005. Only H01 (MI, p=0.001) satisfies p at most 0.005. All others, including H03 (stroke, p=0.008), exceed the threshold. Bonferroni rejects exactly 1 hypothesis.",
        "Benjamini-Hochberg step-up: for each rank i, the BH threshold is i/10*0.05. Compute thresholds for ranks 1 through 4. Rank 1: 1/10*0.05 = 0.005. Rank 2: 2/10*0.05 = 0.010. Rank 3: 3/10*0.05 = 0.015. Rank 4: 4/10*0.05 = 0.020.",
        "Compare each sorted p-value to its BH threshold going upward from rank 1. p_(1) = 0.001 is at most 0.005 (rank 1 passes). p_(2) = 0.008 is at most 0.010 (rank 2 passes). p_(3) = 0.012 is at most 0.015 (rank 3 passes). p_(4) = 0.025 exceeds 0.020 (rank 4 fails, first failure going upward).",
        "BH step-up rule: find the largest rank k such that p_(k) is at most k/10*0.05, then reject all hypotheses with rank 1 through k. The largest such k is 3, because rank 3 passes and rank 4 fails. Reject H01 (MI), H03 (stroke), H05 (hospitalization). BH rejects 3 hypotheses."
      ],
      "result": "Bonferroni threshold: 0.05/10 = 0.005. Only H01 (MI, p=0.001) satisfies p at most 0.005; Bonferroni rejects 1 hypothesis. BH step-up: rank 1 threshold 1/10*0.05 = 0.005 (H01 MI p=0.001 passes), rank 2 threshold 2/10*0.05 = 0.010 (H03 stroke p=0.008 passes), rank 3 threshold 3/10*0.05 = 0.015 (H05 hospitalization p=0.012 passes), rank 4 threshold 4/10*0.05 = 0.020 (H07 ER visit p=0.025 fails, first failure). Largest passing rank is 3; BH rejects 3 hypotheses (H01 MI, H03 stroke, H05 hospitalization) versus 1 under Bonferroni."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Confirmatory regulatory submission: Holm/Bonferroni for 2-5 pre-specified endpoints",
        "description": "For regulatory submissions, all confirmatory endpoints and their adjustment method must be declared in the SAP before database lock. Holm is the default FWER method (uniformly more powerful than Bonferroni). With a single primary endpoint and one secondary, the most common structure is fixed-sequence testing: test the primary at alpha = 0.05; if it passes, carry the full alpha to the secondary. Document the hierarchy and its justification in Section 7 of the SAP alongside the sample-size calculation.",
        "edge_cases": [
          "If the primary endpoint narrowly misses significance, gatekeeping prevents any secondary from being reported as confirmatory — communicate secondary results explicitly as exploratory in the CSR and any publications.",
          "If multiple endpoints are co-primary (e.g., two co-primary at alpha/2 each), Bonferroni across the two is valid but typically overly strict; consider Bonferroni-Holm or a Dunnett procedure if the endpoints are correlated."
        ],
        "data_source_notes": "For claims or registry-based confirmatory studies: lock the database extract date, freeze the analytical dataset, and ensure the p-values in the SAP table of analyses are computed from the same dataset version. Report the Bonferroni-adjusted p-value and the unadjusted p-value side by side with the HR and 95% CI."
      },
      {
        "name": "Exploratory secondary endpoint battery: Benjamini-Hochberg FDR",
        "description": "When 10-30 outcomes are pre-specified as exploratory, apply BH at Q = 0.05 or Q = 0.10. Report all adjusted p-values and label the analysis as hypothesis-generating. The BH procedure is applied to all outcomes simultaneously; do not selectively apply it only to outcomes that crossed the unadjusted alpha = 0.05. Pre-specify Q in the protocol.",
        "edge_cases": [
          "If outcomes are clustered by organ system (e.g., all cardiovascular endpoints, then all metabolic endpoints), consider applying BH within each cluster separately and reporting cluster-specific FDR separately from a global FDR.",
          "Outcomes that were pre-specified for exploratory analysis but fail BH can still be reported with their raw p-value and effect estimate; the failure to meet BH does not constitute evidence of no effect, particularly for rare outcomes with wide CIs."
        ],
        "data_source_notes": "Claims: collect one p-value per outcome from the primary model; check that rare outcomes (fewer than 20 events) are flagged because their p-values are unreliable and may distort the BH ranking. EHR: apply BH across lab, vital, and diagnosis code outcomes separately if they represent distinct scientific questions."
      },
      {
        "name": "Pharmacovigilance outcome scan: TreeScan or BH across hundreds of codes",
        "description": "When a safety database has thousands of ICD or MedDRA codes and the analyst wants to flag elevated-incidence outcomes without pre-specification, FDR or hierarchical permutation methods are appropriate. BH applied to hundreds of flat code-specific p-values ignores the hierarchical correlation structure; TreeScan's permutation-on-maximum-LLR exploits the ICD/MedDRA tree and is preferred. BH remains a reasonable simplification when the code set is small (fewer than 100) or when outcomes are near-independent.",
        "edge_cases": [
          "Monte Carlo permutation in TreeScan requires 999-9999 replicates for stable p-values; run at least 4999 replicates for publication-quality analysis.",
          "BH applied to a flat list of hundreds of ICD codes may flag entire ICD chapters; interpret branch-level signals cautiously and route to TreeScan for hierarchical confirmation."
        ],
        "data_source_notes": "Claims and EHR: the ICD-10-CM tree has four levels (chapter, block, category, code); run TreeScan with the full four-level tree. For MedDRA data: use System Organ Class -> High Level Group Term -> Preferred Term hierarchy."
      },
      {
        "name": "Ordered endpoint hierarchy with gatekeeping",
        "description": "When the protocol specifies a hierarchy of endpoints (primary gates secondary), implement as a fixed-sequence procedure: test the primary at full alpha; only if it passes, test the secondary at full alpha (no adjustment). If two co-primary endpoints exist, use the Bonferroni-Holm procedure at alpha/2 for each; both must pass before any secondary is tested. Parallel and tree gatekeeping are available for more complex hierarchies with multiple families.",
        "edge_cases": [
          "If the study has both non-inferiority and superiority tests on the primary endpoint, each requires its own alpha allocation; document the testing order explicitly.",
          "Post-hoc attempts to reorder the hierarchy after seeing the data — e.g., promoting a secondary to primary because it showed a larger effect — violate the FWER guarantee and are not scientifically defensible."
        ],
        "data_source_notes": "Registry or linked claims-EHR: document the gatekeeping structure in the SAP before data extraction. For HTA submissions, some bodies (e.g., NICE in oncology) accept a fixed-sequence hierarchy for OS and PFS as the standard structure."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "maxsprt-sequential-safety-surveillance-rwe",
        "pros_of_this": "Cross-hypothesis multiplicity methods (Bonferroni, BH) are simpler to apply and communicate than sequential alpha-spending; they do not require specifying a look schedule in advance and are appropriate when all hypotheses are tested at a single final analysis time.",
        "cons_of_this": "MaxSPRT and sequential methods handle the temporal dimension of multiplicity — repeated looks at accumulating data — which cross-hypothesis adjustments cannot address; applying Bonferroni to monthly looks does not control the probability of ever signaling falsely across the full surveillance period.",
        "when_to_prefer": "Use cross-hypothesis multiplicity adjustment (Bonferroni, BH) for fixed-time analyses with multiple outcome hypotheses; use MaxSPRT or group-sequential boundaries when data accumulate over time and decisions must be made at multiple interim looks."
      },
      {
        "compared_to": "tree-based-scan-statistic-rwe",
        "pros_of_this": "Bonferroni and BH are simpler, computationally trivial, and do not require a hierarchical tree structure or Monte Carlo simulation; they are appropriate when outcomes are approximately independent or when the analysis involves a small number of pre-specified outcomes.",
        "cons_of_this": "TreeScan's permutation-on-maximum-LLR exploits the correlation among nested ICD/MedDRA codes and is substantially less conservative than Bonferroni while correctly handling hierarchical structure; Bonferroni over thousands of correlated codes is extremely conservative and ignores the grouping information, while flat BH ignores the tree structure.",
        "when_to_prefer": "Use BH or Bonferroni for small-to-moderate numbers of near-independent outcomes in a pre-specified endpoint battery; use TreeScan for hypothesis-free scans of hundreds of hierarchically structured ICD or MedDRA codes in pharmacovigilance."
      },
      {
        "compared_to": "inferential-statistics-foundations",
        "pros_of_this": "Multiplicity correction provides a principled, pre-specified framework for controlling error rates across families of tests rather than relying on individual alpha levels for each hypothesis independently.",
        "cons_of_this": "The additional complexity of specifying a correction method, a denominator m, and an error-rate target (FWER or FDR) requires careful study design decisions that are easy to make incorrectly or post-hoc; a poorly specified correction can be worse than a clearly labeled unadjusted analysis.",
        "when_to_prefer": "Apply multiplicity correction whenever two or more tests will be reported as confirmatory; for purely descriptive analyses or clearly labeled exploratory work, well-reported unadjusted p-values with effect sizes are more informative than poorly justified corrections."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims studies commonly test many outcomes (all hospitalizations, all-cause mortality, specific ICD code-based endpoints) in a single analysis. Collect all raw p-values from the primary models before applying any adjustment. For a pre-specified confirmatory battery of 3-10 outcomes, apply Holm across all outcomes. For hypothesis-generating scans of hundreds of ICD codes, apply BH at Q = 0.05-0.10 or route to TreeScan. At large n (millions of patients), report HRs and 95% CIs as the primary evidence; treat adjusted p-values as secondary.",
      "ehr": "EHR-wide phenome-wide association studies (PheWAS) test thousands of PheCode-based outcomes; use BH or Benjamini-Yekutieli at Q = 0.05. For a targeted battery of pre-specified outcomes (e.g., lab-defined endpoints in a metabolic study), use Holm. Outcomes defined from structured data (ICD codes, lab values) and outcomes defined from NLP (clinical notes) should be treated as separate families unless the SAP pre-specifies them jointly.",
      "registry": "Registry studies with adjudicated endpoints and a small pre-specified outcome list (1-5) use Holm or gatekeeping with the full alpha = 0.05. If the registry supports hypothesis-generating scans across all recorded outcomes, apply BH and label findings as exploratory. Document the adjustment method in the registry protocol before first data access.",
      "primary": "Survey-based studies testing multiple scale scores or subscores should pre-specify the primary scale and treat all subscores as secondary with Holm correction or as exploratory with BH. Bounded psychometric scores (0-100) are often positively correlated; BH may be anti-conservative under strong positive dependence — consider the BY correction for conservative validity.",
      "linked": "Linked claims-EHR-registry cohorts often combine adjudicated clinical endpoints (from registry) with claims-based utilization outcomes and EHR-based biomarkers in a single analysis. Pre-specify the adjustment hierarchy by data source and endpoint type in the SAP: confirmatory clinical endpoints under Holm, exploratory utilization outcomes under BH, and biomarker scans under BH with explicit labeling as hypothesis-generating."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from statsmodels.stats.multitest import multipletests\nimport pandas as pd\n\n# ── Worked-example dataset: 10 raw p-values from statin initiation cohort ──\noutcomes = [\n    \"MI\", \"LDL_reduction\", \"stroke\", \"medication_switch\",\n    \"hospitalization\", \"all_cause_mortality\", \"ER_visit\",\n    \"any_adverse_event\", \"QOL_score\", \"CRP_reduction\",\n]\np_raw = [0.001, 0.040, 0.008, 0.200, 0.012, 0.500, 0.025, 0.350, 0.120, 0.051]\n\n# ── 1. Bonferroni (FWER): threshold = alpha / m = 0.05 / 10 = 0.005 ──\nrej_bon, p_bon, _, _ = multipletests(p_raw, alpha=0.05, method=\"bonferroni\")\n\n# ── 2. Holm step-down (FWER; uniformly more powerful than Bonferroni) ──\nrej_holm, p_holm, _, _ = multipletests(p_raw, alpha=0.05, method=\"holm\")\n\n# ── 3. Benjamini-Hochberg step-up (FDR at Q = 0.05) ──\nrej_bh, p_bh, _, _ = multipletests(p_raw, alpha=0.05, method=\"fdr_bh\")\n\n# ── 4. Display results ──\ndf = pd.DataFrame({\n    \"outcome\": outcomes,\n    \"raw_p\":   p_raw,\n    \"bon_p\":   [round(p, 4) for p in p_bon],\n    \"holm_p\":  [round(p, 4) for p in p_holm],\n    \"bh_p\":    [round(p, 4) for p in p_bh],\n    \"bon_rej\": rej_bon,\n    \"holm_rej\": rej_holm,\n    \"bh_rej\":  rej_bh,\n})\nprint(\"Multiplicity adjustment — statin cohort 10-outcome battery\")\nprint(df.to_string(index=False))\nprint(f\"\\nBonferroni rejects: {rej_bon.sum()} / 10\")\nprint(f\"Holm rejects:       {rej_holm.sum()} / 10\")\nprint(f\"BH (FDR) rejects:   {rej_bh.sum()} / 10\")\n\n# ── 5. Manual BH step-up for transparency ──\n# Sort indices by raw p-value\nm, Q = 10, 0.05\nsorted_idx = sorted(range(m), key=lambda i: p_raw[i])\nprint(\"\\nManual BH step-up (rank, outcome, raw_p, threshold, pass):\")\nk_max = 0\nfor rank, idx in enumerate(sorted_idx, start=1):\n    thresh = rank / m * Q   # i/m * Q\n    passed = p_raw[idx] <= thresh\n    if passed:\n        k_max = rank\n    print(f\"  rank {rank}: {outcomes[idx]:25s} raw_p={p_raw[idx]:.3f}  \"\n          f\"thresh={thresh:.4f}  {'PASS' if passed else 'FAIL'}\")\nprint(f\"Largest passing rank k = {k_max}; reject ranks 1-{k_max}.\")\nreject_manual = {outcomes[sorted_idx[i]] for i in range(k_max)}\nprint(f\"Rejected outcomes: {sorted(reject_manual)}\")",
        "description": "Bonferroni, Holm step-down, and Benjamini-Hochberg FDR adjustment using\nstatsmodels.stats.multitest.multipletests. Demonstrates all three methods on the 10-outcome\nstatin cohort dataset from the worked example. Also shows manual BH step-up computation\nfor pedagogical transparency. No dependencies beyond statsmodels.",
        "dependencies": [
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Worked-example dataset: 10 raw p-values from statin initiation cohort ──\noutcomes <- c(\"MI\", \"LDL_reduction\", \"stroke\", \"medication_switch\",\n              \"hospitalization\", \"all_cause_mortality\", \"ER_visit\",\n              \"any_adverse_event\", \"QOL_score\", \"CRP_reduction\")\np_raw <- c(0.001, 0.040, 0.008, 0.200, 0.012, 0.500, 0.025, 0.350, 0.120, 0.051)\n\n# ── 1. Bonferroni (FWER): each adjusted p = min(m * raw_p, 1) ──\np_bon  <- p.adjust(p_raw, method = \"bonferroni\")\n\n# ── 2. Holm step-down (FWER; always at least as powerful as Bonferroni) ──\np_holm <- p.adjust(p_raw, method = \"holm\")\n\n# ── 3. Benjamini-Hochberg step-up (FDR at Q = 0.05) ──\np_bh   <- p.adjust(p_raw, method = \"BH\")\n\n# ── 4. Benjamini-Yekutieli (FDR under arbitrary dependence; more conservative) ──\np_by   <- p.adjust(p_raw, method = \"BY\")\n\n# ── 5. Collect results and display ──\nresults <- data.frame(\n  outcome   = outcomes,\n  raw_p     = p_raw,\n  bon_adj   = round(p_bon,  4),\n  holm_adj  = round(p_holm, 4),\n  bh_adj    = round(p_bh,   4),\n  by_adj    = round(p_by,   4),\n  bon_rej   = p_bon  <= 0.05,\n  holm_rej  = p_holm <= 0.05,\n  bh_rej    = p_bh   <= 0.05,\n  by_rej    = p_by   <= 0.05\n)\nprint(results)\ncat(sprintf(\"\\nBonferroni rejects: %d / 10\\n\", sum(p_bon  <= 0.05)))\ncat(sprintf(\"Holm rejects:       %d / 10\\n\", sum(p_holm <= 0.05)))\ncat(sprintf(\"BH (FDR) rejects:   %d / 10\\n\", sum(p_bh   <= 0.05)))\ncat(sprintf(\"BY (FDR) rejects:   %d / 10\\n\", sum(p_by   <= 0.05)))\n\n# ── 6. Manual BH step-up for transparency ──\nm <- 10L; Q <- 0.05\nord       <- order(p_raw)               # ascending rank indices\np_sorted  <- p_raw[ord]\nthresholds <- seq_len(m) / m * Q        # i/m * Q for i = 1 ... m\npass       <- p_sorted <= thresholds\nk_max      <- max(which(pass), 0)       # largest rank where condition holds\n\ncat(\"\\nManual BH step-up:\\n\")\nfor (i in seq_len(m)) {\n  cat(sprintf(\"  rank %2d: %-25s raw_p=%.3f  thresh=%.4f  %s\\n\",\n              i, outcomes[ord[i]], p_sorted[i], thresholds[i],\n              if (pass[i]) \"PASS\" else \"FAIL\"))\n}\ncat(sprintf(\"Largest passing rank k = %d; reject ranks 1 to %d.\\n\", k_max, k_max))\ncat(sprintf(\"BH-rejected outcomes: %s\\n\",\n            paste(outcomes[ord[seq_len(k_max)]], collapse = \", \")))",
        "description": "Bonferroni, Holm, and Benjamini-Hochberg adjustments via base-R p.adjust(). Demonstrates\nall three methods on the 10-outcome statin cohort dataset. Also shows the Benjamini-Yekutieli\n(BY) adjustment for arbitrary dependence. No external packages required.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Step 1: Create dataset of raw p-values from 10 Cox models ── */\ndata work.raw_p;\n  length hypothesis $5 outcome $25;\n  input hypothesis $ outcome $ raw_p;\n  datalines;\nH01 MI 0.001\nH02 LDL_reduction 0.040\nH03 stroke 0.008\nH04 medication_switch 0.200\nH05 hospitalization 0.012\nH06 all_cause_mortality 0.500\nH07 ER_visit 0.025\nH08 any_adverse_event 0.350\nH09 QOL_score 0.120\nH10 CRP_reduction 0.051\n;\nrun;\n\n/* ── Step 2: PROC MULTTEST — read raw p-values and apply adjustments ──\n   inpvalues(raw_p): the variable \"raw_p\" in work.raw_p contains the raw p-values.\n   bon:  Bonferroni       — adjusted p = min(m * raw_p, 1); controls FWER.\n   holm: Holm step-down   — uniformly more powerful than Bonferroni; controls FWER.\n   fdr:  Benjamini-Hochberg — step-up procedure; controls FDR at Q = 0.05 by default.\n   out=: stores adjusted p-values alongside the input dataset variables.            */\nproc multtest inpvalues(raw_p)=work.raw_p\n              bon\n              holm\n              fdr\n              out=work.adjusted;\nrun;\n\n/* ── Step 3: Merge labels back and print ── */\ndata work.results;\n  merge work.raw_p work.adjusted;\nrun;\nproc print data=work.results noobs;\n  var hypothesis outcome raw_p;\n  title \"Multiplicity-adjusted p-values — statin cohort 10-outcome battery\";\nrun;\n\n/* ── Step 4: DATA step manual Bonferroni check ──\n   Useful to verify PROC MULTTEST output and to document the formula in the SAP. */\ndata work.bonferroni_check;\n  set work.raw_p;\n  m               = 10;\n  alpha           = 0.05;\n  bonf_threshold  = alpha / m;            /* 0.05 / 10 = 0.005               */\n  bonf_adj_p      = min(m * raw_p, 1);    /* adjusted p; reject if <= 0.05   */\n  bonf_reject     = (raw_p <= bonf_threshold); /* 1 = reject, 0 = retain     */\n  label bonf_threshold = \"Bonferroni threshold (alpha/m)\"\n        bonf_adj_p    = \"Bonferroni adjusted p\"\n        bonf_reject   = \"Bonferroni reject (1=yes)\";\nrun;\nproc print data=work.bonferroni_check noobs label;\n  var hypothesis outcome raw_p bonf_threshold bonf_adj_p bonf_reject;\nrun;\n\n/* ── Step 5: Manual BH step-up in a DATA step ── */\n/* Sort by raw p-value, assign ranks, compute thresholds, find largest passing rank */\nproc sort data=work.raw_p out=work.raw_p_sorted; by raw_p; run;\ndata work.bh_check;\n  set work.raw_p_sorted;\n  rank_i   = _n_;            /* rank 1 = smallest p-value               */\n  m        = 10;\n  Q        = 0.05;\n  bh_thresh = rank_i / m * Q;            /* i/m * Q                     */\n  bh_pass   = (raw_p <= bh_thresh);      /* 1 = passes at this rank     */\n  label rank_i   = \"BH rank\"\n        bh_thresh = \"BH threshold (i/m*Q)\"\n        bh_pass   = \"BH pass (1=yes)\";\nrun;\nproc print data=work.bh_check noobs label; run;",
        "description": "Bonferroni, Holm, and Benjamini-Hochberg adjustment via PROC MULTTEST with the INPVALUES\noption, which accepts a SAS dataset of pre-computed raw p-values. Also demonstrates a DATA\nstep manual Bonferroni check. PROC MULTTEST is the SAS standard for multiplicity adjustment\nin submissions and is accepted by FDA/EMA.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[How many hypotheses?<br/>What is the scientific goal?] --> Conf[Few endpoints<br/>1-5, confirmatory]\n  Q --> Expl[Moderate battery<br/>5-30, exploratory]\n  Q --> Scan[Many outcomes<br/>>30, hypothesis-free scan]\n  Conf --> Ordered[Ordered by<br/>clinical priority?]\n  Ordered -->|Yes| Gate[\"Gatekeeping / fixed-sequence<br/>(test primary first; secondary<br/>only if primary passes)\"]\n  Ordered -->|No| Holm[\"Holm step-down<br/>(default FWER;<br/>more powerful than Bonferroni)\"]\n  Expl --> BH[\"Benjamini-Hochberg<br/>step-up at Q = 0.05-0.10<br/>(FDR; label as exploratory)\"]\n  Scan --> Hier[Outcomes hierarchically<br/>structured ICD/MedDRA?]\n  Hier -->|Yes| Tree[\"TreeScan permutation FWER<br/>(see tree-based-scan-statistic-rwe)\"]\n  Hier -->|No| BH2[\"Benjamini-Hochberg FDR<br/>or Benjamini-Yekutieli<br/>(for arbitrary dependence)\"]\n  Q --> Seq[Sequential interim<br/>looks over time?]\n  Seq --> MaxS[\"MaxSPRT / alpha-spending<br/>(see maxsprt-sequential-safety-surveillance-rwe)\"]",
        "caption": "Decision tree for selecting a multiplicity adjustment method by number of hypotheses, scientific goal, and data structure. Gatekeeping and fixed-sequence procedures handle ordered confirmatory families; Holm handles small unordered FWER families; BH handles exploratory batteries; TreeScan handles hierarchically structured outcome scans.",
        "alt_text": "Flowchart branching on number of hypotheses and goal into gatekeeping or Holm for confirmatory work, BH for exploratory batteries, TreeScan or BH for large scans, and MaxSPRT for sequential looks over time.",
        "source_type": "illustrative",
        "source_citations": [
          "benjamini-hochberg-1995"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Multiplicity adjustment builds directly on the basic inferential machinery — p-values, significance thresholds, type-I error, and hypothesis testing — covered in inferential-statistics-foundations; the concept of controlling error across a family of tests presupposes a clear understanding of what the error rate means for a single test."
      },
      {
        "relation_type": "see_also",
        "target_slug": "maxsprt-sequential-safety-surveillance-rwe",
        "notes": "Sequential-look multiplicity is the temporal analog of cross-hypothesis multiplicity: MaxSPRT controls the family-wise false-alarm rate across repeated looks at accumulating data using alpha-spending boundaries, while Bonferroni/BH control FWER or FDR across a fixed set of simultaneous hypotheses; the two entries jointly cover multiplicity in the time dimension and the hypothesis dimension."
      },
      {
        "relation_type": "see_also",
        "target_slug": "tree-based-scan-statistic-rwe",
        "notes": "TreeScan exists specifically because of multiplicity: it controls FWER across thousands of correlated, hierarchically nested ICD/MedDRA outcome codes using permutation on the maximum log-likelihood ratio, avoiding the brutal conservatism of Bonferroni and the independence assumption of BH; TreeScan is the appropriate multiplicity method when outcomes form a tree rather than a flat list."
      },
      {
        "relation_type": "see_also",
        "target_slug": "subgroup-analysis-hte",
        "notes": "Subgroup analyses multiply the number of tests by the number of subgroups and are one of the most common sources of inflated false-positive rates in observational research; the multiplicity principles covered here — prespecification, Holm correction across subgroups, and clear labeling of subgroup results as exploratory — apply directly to interpreting and designing subgroup-HTE analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "sample-size-power-precision-rwe",
        "notes": "Sample size calculations must account for the chosen multiplicity correction: the per-comparison power and the family-wise power are different targets, and the Bonferroni or Holm adjustment changes the per-test alpha threshold and therefore the required n to maintain family-wise power at a given level."
      }
    ],
    "aliases": [
      "multiple comparisons",
      "multiple testing",
      "multiplicity adjustment",
      "Bonferroni correction",
      "Holm procedure",
      "Benjamini-Hochberg procedure",
      "false discovery rate",
      "familywise error rate"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "nci-comorbidity-index-seer-medicare-rwe",
    "name": "NCI Comorbidity Index for SEER-Medicare",
    "short_definition": "A cancer-focused claims comorbidity score derived from Medicare claims in SEER-Medicare studies, commonly built with NCI SAS macros using diagnosis and procedure evidence before cancer diagnosis.",
    "long_description": "The NCI Comorbidity Index is a SEER-Medicare-oriented implementation of claims-based comorbidity measurement for cancer outcomes research. It adapts the Charlson/Deyo/Klabunde lineage to cancer registry-linked Medicare claims and provides practical SAS macro support for calculating comorbidity variables from Medicare claims before a cancer diagnosis or study index date.\n\nThe core operational idea is familiar: use baseline claims to identify comorbid conditions, exclude codes that are likely part of the cancer diagnosis or treatment process, apply source-specific rules, and summarize burden for risk adjustment. What makes the NCI implementation distinctive is its SEER-Medicare cancer context, including cancer-site concerns, timing relative to diagnosis, and macro-based reproducibility.\n\nAnalysts should avoid treating the NCI Comorbidity Index as a generic all-disease score. It was designed for cancer registry-linked Medicare data. It is strongest when the research question, population age, claim types, and baseline window match that context; outside it, Charlson/Quan, Elixhauser, frailty, or disease-specific covariates may be more appropriate.\n\n**Pros, cons, and trade-offs.** The NCI implementation is useful because it is anchored in the SEER-Medicare cancer research workflow and has reproducible macro support. It handles cancer timing and baseline claims context more directly than a generic comorbidity score. The trade-off is portability. The score inherits assumptions about Medicare FFS observability, older cancer populations, registry linkage, diagnosis timing, and macro version. It may not transport to Medicare Advantage, commercial oncology data, younger patients, or EHR-only cohorts without revalidation.\n\n**When to use.** Use the NCI Comorbidity Index when the study is SEER-Medicare or closely comparable registry-linked Medicare cancer research, baseline comorbidity is measured before diagnosis or treatment, and the analysis can preserve macro version, claim files, lookback window, and component flags. It is especially appropriate when reviewers expect SEER-Medicare conventions.\n\n**When NOT to use - and when it is actively misleading.** Do not use it as a universal cancer severity score, as a post-diagnosis disease-burden measure, or as a generic comorbidity score outside Medicare cancer cohorts. It is actively misleading to count cancer diagnosis or treatment-process codes as baseline comorbidity, or to ignore incomplete FFS observability caused by Medicare Advantage enrollment.",
    "primary_category": "Bias_Control",
    "tags": [
      "nci-comorbidity-index",
      "seer-medicare",
      "cancer",
      "claims",
      "comorbidity",
      "risk-adjustment",
      "sas-macro"
    ],
    "applies_to_study_types": [
      "oncology_rwe",
      "registry_linkage",
      "claims_analysis",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "registry",
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.annepidem.2007.03.011",
        "url": "https://doi.org/10.1016/j.annepidem.2007.03.011",
        "citation_text": "Klabunde CN, Legler JM, Warren JL, Baldwin LM, Schrag D. A refined comorbidity measurement algorithm for claims-based studies of breast, prostate, colorectal, and lung cancer patients. Annals of Epidemiology. 2007;17(8):584-590.",
        "year": 2007,
        "authors_short": "Klabunde et al.",
        "notes": "Refined claims-based cancer comorbidity algorithm in the NCI/SEER-Medicare lineage."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://healthcaredelivery.cancer.gov/seermedicare/considerations/comorbidity.html",
        "citation_text": "National Cancer Institute. Comorbidity. SEER-Medicare: Considerations for Analysis.",
        "year": 2026,
        "authors_short": "NCI",
        "notes": "NCI guidance on comorbidity in SEER-Medicare analyses."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://healthcaredelivery.cancer.gov/seermedicare/considerations/calculation.html",
        "citation_text": "National Cancer Institute. Calculation of Comorbidity Weights. SEER-Medicare: Considerations for Analysis.",
        "year": 2026,
        "authors_short": "NCI",
        "notes": "NCI calculation guidance for comorbidity weights."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://healthcaredelivery.cancer.gov/seermedicare/considerations/macro-2021.html",
        "citation_text": "National Cancer Institute. Comorbidity SAS Macro, 2021 Version. SEER-Medicare: Considerations for Analysis.",
        "year": 2021,
        "authors_short": "NCI",
        "notes": "NCI macro documentation used to operationalize the index."
      }
    ],
    "plain_language_summary": "The NCI Comorbidity Index is a practical cancer-research comorbidity score used with SEER-Medicare claims. It helps researchers account for how sick cancer patients were before diagnosis or treatment, using Medicare claim evidence and NCI-supported macros.",
    "key_terms": [
      {
        "term": "SEER-Medicare",
        "definition": "Linked cancer registry and Medicare claims data used for population-based cancer outcomes research."
      },
      {
        "term": "Comorbidity macro",
        "definition": "NCI-provided SAS code that implements comorbidity calculations from Medicare claims."
      },
      {
        "term": "Baseline comorbidity",
        "definition": "Pre-index disease burden measured before cancer diagnosis or treatment start."
      }
    ],
    "worked_example": {
      "scenario": "A SEER-Medicare lung cancer study defines baseline comorbidity during the 12 months before diagnosis, excluding the diagnosis month. The patient has claims for COPD, congestive heart failure, and diabetes before diagnosis; metastatic cancer codes after diagnosis are not counted as baseline comorbidity.",
      "dataset": {
        "caption": "Simplified baseline comorbidity evidence.",
        "columns": [
          "claim_month",
          "claim_source",
          "condition_signal",
          "counted_for_nci_index"
        ],
        "rows": [
          [
            -11,
            "MedPAR",
            "COPD",
            true
          ],
          [
            -7,
            "Carrier claim",
            "congestive heart failure",
            "yes if rule-out criteria met"
          ],
          [
            -3,
            "outpatient claim",
            "diabetes",
            "yes if repeated or supported by macro rule"
          ],
          [
            2,
            "oncology claim",
            "metastatic cancer",
            "no; post-diagnosis cancer process"
          ]
        ]
      },
      "steps": [
        "Require complete Medicare FFS observability for the baseline window.",
        "Apply the NCI macro using claim files and diagnosis timing specified in the analysis plan.",
        "Exclude the diagnosis month and cancer-related conditions as instructed.",
        "Carry the score, component flags, and macro version into the analytic archive."
      ],
      "result": "COPD, CHF, and diabetes contribute to baseline illness burden; post-diagnosis metastatic cancer does not become a baseline comorbidity flag."
    },
    "prerequisites": [],
    "index_definitions": [
      {
        "name": "NCI Comorbidity Index",
        "definition": "Claims-based cancer comorbidity score used in SEER-Medicare research, implemented through NCI calculation guidance and SAS macros.",
        "source": "NCI Healthcare Delivery Research Program",
        "use": "Baseline risk adjustment in cancer registry-linked Medicare studies.",
        "notes": "Use the macro version and documentation matching the study period and claim files."
      },
      {
        "name": "NCI Comorbidity SAS macro",
        "definition": "NCI-distributed SAS implementation for calculating comorbidity variables from SEER-Medicare claims.",
        "source": "NCI SEER-Medicare macro documentation",
        "use": "Reproducible score construction in Medicare cancer cohorts.",
        "notes": "Store macro version, input files, and parameter settings in the study archive."
      },
      {
        "name": "Charlson/Deyo/Klabunde cancer adaptation",
        "definition": "The methodological lineage of claims-based comorbidity construction adapted for cancer outcomes analyses.",
        "source": "SEER-Medicare comorbidity methods literature",
        "use": "Interpretation of what the NCI score represents and how it differs from generic CCI.",
        "notes": "Cancer diagnosis and treatment timing can contaminate baseline comorbidity if not handled explicitly."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "NCI score as continuous covariate",
        "description": "Use the calculated score directly in regression or propensity models.",
        "edge_cases": [
          "Sparse high scores can be influential.",
          "Score distribution differs by cancer site."
        ],
        "data_source_notes": "Report distribution and consider categories for Table 1."
      },
      {
        "name": "Component comorbidity flags",
        "description": "Use the condition indicators rather than only the collapsed score.",
        "edge_cases": [
          "More parameters and sparse cells.",
          "Some conditions may interact with treatment choice."
        ],
        "data_source_notes": "Useful when specific comorbidities are clinically important confounders."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Quan-Charlson or Elixhauser",
        "use_nci_when": "Working in SEER-Medicare cancer cohorts and following NCI macro conventions.",
        "use_other_indices_when": "Working outside cancer, outside Medicare FFS, or requiring broader non-cancer comorbidity capture.",
        "notes": "NCI score is cancer-contextual and not automatically portable."
      }
    ],
    "implementation_notes_by_data_source": {
      "linked": "Requires registry diagnosis context plus Medicare FFS claims; MA-only months undermine baseline capture.",
      "claims": "Without cancer registry context, the NCI implementation loses part of its rationale."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "charlson-comorbidity-index-rwe",
        "notes": "NCI comorbidity measurement descends from Charlson/Deyo/Klabunde claims comorbidity methods."
      },
      {
        "relation_type": "see_also",
        "target_slug": "disease-registry",
        "notes": "SEER-Medicare depends on registry linkage for cancer variables."
      },
      {
        "relation_type": "used_with",
        "target_slug": "single-arm-external-control",
        "notes": "Cancer external-control studies often require careful baseline comorbidity adjustment."
      }
    ],
    "aliases": [
      "NCI comorbidity index",
      "NCI comorbidity score",
      "SEER-Medicare comorbidity index",
      "SEER Medicare comorbidity score",
      "NCI comorbidity macro",
      "Klabunde comorbidity"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ncpdp-pharmacy-claim-fields",
    "name": "NCPDP Pharmacy Claim Fields",
    "short_definition": "Pharmacy dispensing records transmitted under the NCPDP Telecommunication Standard (D.0) and adjudicated in real time at the point of sale by pharmacy benefit managers and commercial payers; each transaction carries an 11-digit NDC, quantity dispensed, days_supply, fill date, refill number, prescriber and pharmacy NPIs, a DAW (Dispense-as-Written) code, and cost-sharing amounts — the fields from which every adherence metric, generic-substitution analysis, and drug-utilization study in pharmacoepidemiology is constructed.",
    "long_description": "**What NCPDP pharmacy claims are and why they differ from institutional and professional claims**\n\nPharmacy claims are fundamentally different from the two claim types that anchor the rest of US\nadministrative data — the UB-04/837I institutional claim and the CMS-1500/837P professional claim.\nThe institutional and professional claims are submitted in batch after services are rendered and\nadjudicated days to weeks later. A pharmacy claim is adjudicated **in real time at the point of\nsale**: the pharmacist's dispensing software sends the NCPDP transaction to the pharmacy benefit\nmanager (PBM) or payer in seconds, receives an approval or rejection response before the patient\nleaves the counter, and the resulting paid claim lands in the insurer's data warehouse within\nhours. That real-time loop has a direct consequence for research: pharmacy claims data are\nsubstantially cleaner, more complete, and closer to the time of the service event than any other\nUS administrative data source. The clean data, however, carry their own failure modes — chiefly in\nthe `days_supply` field — that are unique to the pharmacy transaction model and can silently corrupt\nevery adherence measure downstream.\n\n**The NCPDP Telecommunication Standard (D.0) and the NCPDP SCRIPT Standard**\n\nThe **NCPDP D.0** (version D, implemented 2009 under HIPAA) is the transaction set that governs\nreal-time pharmacy claim submission, adjudication, and response. Every retail, mail-order, and\nspecialty pharmacy in the United States uses D.0 transactions for benefit claims. The NCPDP SCRIPT\nstandard governs electronic prescribing (e-prescribing), routing new prescriptions and refill\nrequests between prescribers and pharmacies; it is the upstream source of the prescriber NPI that\nappears on the dispensing claim. Researchers encounter these transaction records as \"pharmacy\nclaims\" in research databases (Medicare Part D Prescription Drug Events, commercial PBM extracts,\nMedicaid drug claims), but they originate from NCPDP D.0 transactions. Understanding this origin\nexplains why pharmacy claims lack the institutional-claim complexity of interim bills, replacement\nclaims, and TOB frequency codes — a dispensing event either happened or it did not, and reversals\nare their own transaction type.\n\n**Research-critical fields**\n\n*NDC (11-digit, 5-4-2 HIPAA format):* The National Drug Code identifying the exact drug product\ndispensed — labeler, product, and package. NDC is the primary key for all exposure code lists. A\nsingle drug and strength may span dozens or hundreds of NDCs across generic manufacturers and\nrepackagers; a code list must be exhaustive or it silently undercounts exposure. NDC is covered in\ndepth in its own catalog entry (see `ndc-national-drug-code`).\n\n*Quantity dispensed:* The number of dosage units (tablets, capsules, milliliters) dispensed on\nthis fill. Used with the NDC to confirm the supply is plausible: 30 tablets of a once-daily\nmedication at a `days_supply` of 30 is consistent; 30 tablets at a `days_supply` of 90 signals a\ndata-entry error or a three-tablet-per-day regimen. Quantity alone does not determine\ncoverage duration; `days_supply` does.\n\n*days_supply:* The number of days the dispensed quantity is intended to last. This is the single\nmost consequential field in pharmacoepidemiology research — every adherence metric (PDC, MPR),\nevery drug episode construction, and every index-date new-user window depends on it. It is also\nthe field most prone to systematic measurement error, discussed in depth below.\n\n*Fill date (date_of_service / prescription_service_date):* The date the prescription was\ndispensed. Fill date defines when a coverage interval begins and is the anchor for all\nlongitudinal analyses. Because pharmacy claims are real-time, the fill date is generally\ntrustworthy; the main exception is mail-order fills, where the dispensing date may precede\nactual patient receipt by several days.\n\n*Refill number / new-versus-refill flag:* The fill number within an authorization sequence (0 =\nnew prescription, 1 = first refill, etc.) or a binary new/refill indicator. This field is used in\nnew-user design implementation to confirm the index fill is truly new (refill number = 0) rather\nthan a continuation. In databases where refill number is unavailable, an incident new-user\ndefinition requires a washout period with no fills of the drug class to approximate the first fill.\n\n*Prescriber NPI:* The National Provider Identifier of the clinician who wrote the prescription.\nUsed to attribute dispensing events to a prescriber, link to specialty and practice characteristics,\nand study prescribing patterns. Completeness is generally high for Part D and commercial data, but\ndata-entry of NPI at the pharmacy counter is imperfect, and some older records carry outdated or\nzeroed-out NPIs.\n\n*Pharmacy NPI / pharmacy identifier:* The NPI of the dispensing pharmacy, supplemented by NCPDP\nprovider number and chain store codes. Used in mail-order vs retail channel analyses, specialty\npharmacy attribution, and geographic access studies. Mail-order pharmacies (high `days_supply`,\nNCPDP chain = mail) vs retail pharmacies (typical 30-day `days_supply`) behave differently for\nadherence and refill pattern studies, and pooling them without stratification distorts PDC\nestimates.\n\n*DAW code (Dispense-as-Written):* A one-digit code transmitted by the pharmacy indicating the\nbrand/generic substitution status of the fill: 0 = no product selection indicated (substitution\npermitted, generic dispensed); 1 = substitution not allowed by prescriber (brand required per\nprescriber's written DAW); 2 = substitution allowed — patient requested brand; 3–9 = other\nscenarios (substitution allowed — pharmacist selected product, substitution not allowed —\nregulatory required brand, etc.). DAW code is the primary field for generic-substitution studies:\nan exposure algorithm that counts all fills regardless of DAW cannot distinguish prescriber-mandated\nbrand use (DAW = 1) from voluntary brand use (DAW = 2) from generic filling (DAW = 0). In\ncomparative effectiveness research where the treatment contrast is brand vs generic, misclassifying\nDAW codes is a direct form of exposure misclassification.\n\n*Copay / plan-paid amounts:* The patient copay (cost-sharing), total ingredient cost, dispensing\nfee, and plan-paid amount are all present in adjudicated pharmacy claim records. These fields\nsupport cost-sharing and access analyses, out-of-pocket burden studies, and cost-effectiveness\ninputs. A critical limitation: the ingredient cost on the claim is the PBM's adjudicated allowed\namount, not the manufacturer's list price and not the actual rebate-adjusted net price paid by the\npayer; studies of drug cost should be clear about which cost concept they are measuring.\n\n*Transaction type / service type code:* Identifies whether the record is a paid claim, a reversal\n(B2 or transaction type 21), or a credit — see Reversal Transactions below.\n\n**days_supply: the most consequential field and its measurement error**\n\n`days_supply` is entered by the pharmacist or dispensing system at the time of fill and is subject\nto systematic data-entry patterns that produce bias in adherence metrics:\n\n*28 vs 30 day conventions:* Some pharmacies dispense 28-day supplies rather than 30-day supplies,\nparticularly for medications with weekly dosing (e.g., weekly bisphosphonates dispensed as 4-week\nblister packs = `days_supply` 28). In a 90-day window, a patient with three 28-day fills covers\nonly 84 days, yielding PDC = 84/90 = 0.933 vs PDC = 90/90 = 1.0 for three 30-day fills.\nStratifying by `days_supply` or using the actual pharmacy channel (mail, retail) and dosing\nschedule before pooling avoids conflating these conventions.\n\n*Insulin dosing and variable supply estimation:* Insulin is dispensed in units (a vial of 10 mL\nof U-100 insulin contains 1,000 units), but the `days_supply` depends on the patient's individual\ndaily dose, which varies by body weight, insulin resistance, and titration and is not directly\nrecorded in the claim. Pharmacies estimate `days_supply` from the prescribed dose, the dispensed\nquantity, and standard dosing assumptions. These estimates are notoriously unreliable: a patient\nwho uses 50 units/day will deplete a 10 mL vial in 20 days, but if the pharmacy defaults to a\nstandard 30-day supply estimation, the claim records `days_supply` = 30, overstating actual\ncoverage. Studies of insulin adherence using PDC should validate `days_supply` distributions\nagainst clinical expectations and consider sensitivity analyses with alternative assumptions.\n\n*Inhaler supply and dose-counting:* Similarly, inhaler `days_supply` depends on the number of\npuffs per day and the canister's rated dose count, neither of which is standardized in the claim.\nA rescue (short-acting) inhaler used PRN cannot have a meaningful `days_supply` — the claim field\nis meaningless for adherence purposes. Controller (maintenance) inhalers have a more predictable\ndosing pattern, but `days_supply` still requires validation against clinical dosing conventions\nfor the specific product.\n\n**Reversal transactions (transaction type B2)**\n\nWhen a previously paid pharmacy claim is voided — because the patient returned the medication, the\nclaim was submitted in error, or the adjudication is being corrected — the pharmacy submits a\n**reversal transaction** (identified in NCPDP D.0 by transaction type B2, or in claim files by a\ntransaction type / claim status code indicating void or reversal). A reversal does not physically\ncancel the original record in most research databases; instead, both the original paid claim and the\nreversal record appear as separate rows. An analyst who does not remove paired reversal records will\ncount a drug fill that was ultimately never dispensed and retained by the patient. The clean approach\nis to identify claim pairs (same patient, same NDC, same fill date, same pharmacy) where a reversal\nmatches the original, and drop both. In some data products, reversals are pre-processed and only net\npaid claims are delivered; always verify the vendor's data dictionary to determine whether reversals\nare present in the extract and require analyst-level removal.\n\n**Compound drug flags**\n\nCompounded prescriptions — custom drug preparations mixed by the pharmacy rather than manufactured\nby a labeler — appear in pharmacy claims with NDC codes that are not registered in the FDA NDC\nDirectory. Some data systems flag these with a compound drug indicator or a specific service type\ncode. Compounded drugs are common in some specialty areas (pediatrics, pain management, dermatology)\nand should be identified and handled separately from manufactured drugs in exposure code lists, as\ntheir NDC codes carry no standard product attributes.\n\n**What pharmacy claims do NOT capture — and the exposure-misclassification consequences**\n\n- **Inpatient hospital drugs:** Medications administered during a hospital admission are part of the\n  facility service, paid under the DRG or per-diem rate, and appear on the UB-04 institutional claim\n  — not in pharmacy claims. A patient hospitalized for 5 days who is taking a maintenance medication\n  will show a gap in pharmacy claims during those 5 days even if they received the drug in-hospital.\n  Failure to account for inpatient periods in adherence denominators (PDC or MPR) makes these\n  patients appear falsely non-adherent. The standard correction is to exclude inpatient days from\n  both the numerator and denominator of PDC during hospitalization spans identified from institutional\n  claims.\n\n- **Cash-paid fills and discount-card fills (GoodRx, other discount programs):** Prescriptions paid\n  entirely out-of-pocket — whether at the true cash price or through a third-party discount card\n  such as GoodRx — do not generate an insurance claim and are therefore completely invisible in\n  claims databases. For commonly prescribed generic drugs that are available for $4 or less at\n  many retail pharmacies, a material fraction of fills may be paid cash. This produces systematic\n  underestimation of adherence and distorts new-user definitions: a patient may have been taking\n  the drug for months via cash fills before appearing in claims as a \"new user.\" The magnitude of\n  this leakage varies dramatically by drug, formulary tier, and patient demographics, and cannot\n  be corrected without external data.\n\n- **Physician samples:** Drug samples distributed by pharmaceutical representatives at the prescriber\n  office do not generate a pharmacy claim. In therapeutic areas where aggressive sampling is common\n  (some branded oral agents, some inhalers), the sample period can extend weeks to months before the\n  patient converts to a filled prescription. New-user designs that anchor the index date at the first\n  claim fill may misclassify the true initiation date by the length of the sample period.\n\n- **Over-the-counter (OTC) drugs:** Medications sold without a prescription — including many antihistamines,\n  analgesics, proton pump inhibitors, and some statins in certain formulations — are not covered by\n  pharmacy benefits and do not appear in claims. For conditions treated with both Rx and OTC products\n  (e.g., heartburn with omeprazole), adherence measured from pharmacy claims captures only the\n  prescription component.\n\n**Medical benefit vs pharmacy benefit drugs**\n\nA research-critical distinction: not all drugs in clinical use flow through pharmacy claims.\n**Infused biologics** (e.g., infliximab for rheumatoid arthritis, bevacizumab for cancer, vedolizumab\nfor IBD) are typically administered in an infusion center or physician's office and billed as a\nphysician-administered drug on the medical benefit — appearing as a HCPCS Level II J-code on the\nUB-04 institutional claim or CMS-1500 professional claim, not as an NDC on a pharmacy claim. A\nstudy of biologic adherence that uses only pharmacy claims will miss every infusion visit for\ninfused biologics, severely underestimating exposure. Researchers must query both the pharmacy\nbenefit (NDC-based fills from NCPDP claims) and the medical benefit (J-code lines from institutional\nand professional claims) to capture the full utilization of drugs with both oral and infused\nformulations. Some specialty drugs are available in both a self-administered subcutaneous form\n(pharmacy benefit, NDC) and an infused form (medical benefit, J-code), requiring separate capture\nof each channel.\n\n**Pros, cons, and trade-offs — specific and comparative**\n\n- **vs UB-04 / 837I institutional claims:** Pharmacy claims capture every retail and mail-order\n  dispensing event with high completeness; institutional claims capture inpatient drug administration\n  but not outpatient dispensing. The two are complementary, not redundant: a complete drug-exposure\n  algorithm for a treatment that is sometimes infused (biologic) and sometimes oral (small molecule)\n  must combine NCPDP pharmacy claims with institutional/professional claim J-code lines. For purely\n  self-administered oral medications, pharmacy claims are the primary and often sole data source.\n\n- **vs CMS-1500 / 837P professional claims:** Professional claims carry J-codes for\n  physician-administered drugs (e.g., subcutaneous injections administered in office) with line-level\n  NDC on drug lines; pharmacy claims carry the full dispensed NDC for patient-self-administered\n  drugs. For drugs administered in office vs self-administered, the two claim types are necessary\n  complements. Professional claims do not carry `days_supply` or refill information, making adherence\n  computation impossible from professional claims alone.\n\n- **vs EHR prescribing records:** EHR prescription orders capture the prescribing intent but not\n  the fill — primary non-adherence (never picking up the prescription) is completely invisible in\n  EHR order data. Pharmacy claims capture the actual dispensing event. PDC from prescribing data\n  systematically overstates true coverage. When EHR and pharmacy claims are linked, the gap between\n  the order date and first fill date quantifies primary non-adherence.\n\n- **Real-time adjudication as a data quality advantage:** Because pharmacy claims are adjudicated in\n  real time at the point of sale, they contain essentially no administrative lag — the fill date in\n  the claim is the actual dispensing date, and paid claims reflect current benefit enrollment. This\n  is in sharp contrast to institutional claims, where the from/through dates in the adjudicated\n  record may lag the service dates by weeks and where run-out periods of 60–90 days are needed\n  before utilization data are considered complete.\n\n**When to use**\n\nUse NCPDP pharmacy claim fields as the primary data source whenever the research question requires:\n(1) drug exposure ascertainment for self-administered, outpatient-dispensed medications; (2) any\nadherence measure requiring `days_supply` (PDC, MPR, persistence); (3) new-user cohort definitions\nbased on first fill of a drug or drug class; (4) generic substitution analyses using DAW codes;\n(5) drug-level cost-sharing and out-of-pocket burden analyses using copay and plan-paid amounts;\n(6) refill sequencing and line-of-therapy analyses; (7) pharmacy channel stratification (retail vs\nmail-order vs specialty). Pharmacy claims from Medicare Part D, commercial PBM data, and Medicaid\nare the definitive data sources for all outpatient pharmacy utilization and adherence research in\nthe United States.\n\n**When NOT to use — and when pharmacy claims are actively misleading or dangerous**\n\n- **For inpatient drug exposure:** Drugs administered during hospital stays appear on UB-04 revenue\n  code lines (revenue code 0250 for pharmacy, specific drug NDC on drug lines) or are bundled into\n  the DRG payment with no line-level detail. Pharmacy claims show a gap during inpatient periods;\n  using raw pharmacy claims without excluding inpatient days from the denominator produces false\n  non-adherence in hospitalized patients. For any drug used in both inpatient and outpatient\n  settings, query both data sources.\n\n- **For infused biologics and J-coded drugs as the sole data source:** A pharmacy-claims-only\n  analysis of a drug class that includes both oral self-administered forms and infused forms (e.g.,\n  oral vs infused TNF inhibitors) will miss every infused fill, producing a biased and incomplete\n  exposure classification. Always pair pharmacy claims with J-code lines from institutional and\n  professional claims for drugs with infusion formulations.\n\n- **For adherence measurement in the absence of continuous pharmacy benefit enrollment:** A pharmacy\n  claim is observable only if the patient has an active pharmacy benefit on the fill date. If a\n  patient's pharmacy benefit lapses for one month and then restarts, the month of no claims looks\n  identical to a month of non-adherence. PDC denominator windows must be restricted to periods of\n  continuous pharmacy enrollment, not just medical benefit enrollment. Failure to verify pharmacy\n  benefit enrollment generates false non-adherence in gap periods.\n\n- **For drugs with unreliable `days_supply` estimation (insulin, inhalers, PRN medications):**\n  Adherence metrics that rely on `days_supply` are unreliable for these drug classes without\n  additional validation of the field's clinical plausibility. A sensitivity analysis using\n  alternative `days_supply` assumptions (e.g., computing expected days from quantity dispensed and\n  standard dosing information from the label) should accompany any adherence analysis for\n  variable-dose drugs.\n\n- **As evidence of drug use during periods outside the claims database's pharmacy benefit capture:**\n  Patients enrolled in Medicare Advantage (MA) plans generate encounter data rather than FFS\n  pharmacy claims for some data products. MA pharmacy claims may be systematically incomplete in\n  some databases. Do not score MA-only members' prescription coverage from pharmacy claims without\n  verifying that the data product includes full MA Part D PDE records.\n\n**Interpreting the output**\n\nThe characteristic artifact of an NCPDP pharmacy claims analysis is a patient-level PDC score\ncomputed from fill dates and `days_supply` values. Taking the worked example below, the output is\nPDC = 85/90 ≈ 0.944 over a 90-day window.\n\n*Formal interpretation:* A PDC of 85/90 = 0.944 means that in 85 of the 90 calendar days in the\nobservation window, this patient had metformin supply on hand based on dispensing dates and\npharmacy-reported `days_supply` values. PDC is a descriptive measure of observed fill-pattern\ncoverage, not a causal estimand; it is bounded in [0, 1] and summarizes cumulative day-level\ncoverage without reference to the patient's actual pill-taking behavior. The 5-day uncovered gap\n(March 2–6) is observable from the raw fill records; whether it represents delayed refill, travel,\nor a coverage decision is not recoverable from the claim alone.\n\n*Practical interpretation:* This patient is adherent by the conventional PDC ≥ 0.80 threshold.\nThe 5-day gap early in March is a small deviation from perfect coverage. For a clinical decision\nor quality-measure context, this patient would be counted in the adherent numerator. For a\ncomparative effectiveness analysis, caution is warranted: PDC 0.944 is a single scalar that erases\nthe timing and pattern of the gap — a patient with the same PDC from six scattered 1-day gaps has\nthe same score but a different exposure trajectory. When the timing of non-adherence matters (e.g.,\nadherence in the first 90 days vs second 90 days post-discharge), report coverage windows, not\njust the summary PDC.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "claims",
      "pharmacy",
      "ncpdp",
      "days-supply",
      "ndc",
      "daw-code",
      "adherence",
      "pdc",
      "drug-exposure",
      "refill",
      "reversal",
      "pharmacy-benefit"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "drug_utilization",
      "comparative_effectiveness",
      "adherence_study",
      "new_user",
      "cost_analysis"
    ],
    "data_sources": [
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Foundational review of health care utilization databases for pharmacoepidemiology, including pharmacy claims data derived from NCPDP transactions; establishes the canonical framework for using administrative dispensing records as the primary source of drug exposure ascertainment and articulates the data completeness, accuracy, and limitation considerations that govern all subsequent pharmacy claims methods."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3718",
        "url": "https://doi.org/10.1002/pds.3718",
        "citation_text": "Burden AM, Paterson JM, Gruneir A, Cadarette SM. Adherence to osteoporosis pharmacotherapy is underestimated using days supply values in electronic pharmacy claims data. Pharmacoepidemiology and Drug Safety. 2015;24(5):537-543.",
        "year": 2015,
        "authors_short": "Burden et al.",
        "notes": "Demonstrates that days_supply values recorded in electronic pharmacy claims systematically underestimate actual drug supply for osteoporosis medications; provides empirical evidence of measurement error in the most consequential pharmacy claim field, directly relevant to PDC computation from NCPDP dispensing records and to sensitivity analysis recommendations for adherence studies."
      },
      {
        "role": "demonstrate",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coverage/prescription-drug-coverage-contracting/prescription-drug-benefit-manual",
        "citation_text": "Centers for Medicare & Medicaid Services. Medicare Prescription Drug Benefit Manual, Chapter 14: Coordination of Benefits. Baltimore, MD: CMS. Available from the CMS Prescription Drug Benefit Manual page.",
        "year": 2023,
        "authors_short": "CMS",
        "notes": "CMS documentation of the Medicare Part D Prescription Drug Event (PDE) record structure, which is the Medicare implementation of NCPDP D.0 pharmacy claim data; defines all mandatory fields including NDC, days_supply, fill date, refill number, DAW code, and transaction type (including reversal transactions), as used by Part D plan sponsors and accessible to researchers via the CMS Research Data Assistance Center (ResDAC)."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://standards.ncpdp.org/access-to-standards.aspx",
        "citation_text": "National Council for Prescription Drug Programs (NCPDP). NCPDP Telecommunication Standard Implementation Guide, Version D.0. Scottsdale, AZ: NCPDP. Available at: https://standards.ncpdp.org/access-to-standards.aspx.",
        "year": null,
        "authors_short": "NCPDP",
        "notes": "The normative specification for the NCPDP D.0 transaction standard, which governs all real-time pharmacy claim submission and adjudication in the United States; defines the field-level structure, data types, code sets, and transaction type identifiers (including B2 reversal transactions) that appear in all US pharmacy claims research databases. Full access requires NCPDP membership; transaction field semantics are described in publicly available CMS data documentation and research database data dictionaries."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://resdac.org/cms-data/files/pde",
        "citation_text": "Research Data Assistance Center (ResDAC). Medicare Part D Prescription Drug Event (PDE) Data. Centers for Medicare & Medicaid Services. Minneapolis, MN: ResDAC; accessed 2024.",
        "year": 2024,
        "authors_short": "ResDAC",
        "notes": "ResDAC data documentation for the Medicare Part D Prescription Drug Event file, which is the Medicare FFS implementation of NCPDP D.0 pharmacy claims; documents field layout, variable definitions, and the relationship between PDE records and NCPDP transaction fields including NDC, days_supply, fill date, DAW code, copay, and plan-paid amounts."
      }
    ],
    "plain_language_summary": "When a patient picks up a prescription at a pharmacy, the pharmacy instantly sends a digital transaction to the insurance company and receives approval in seconds — that real-time record is an NCPDP pharmacy claim, and it contains three fields that researchers rely on above all others: the drug's unique 11-digit code (NDC), the fill date, and the number of days the supply is meant to last (days_supply). Unlike hospital or doctor-visit claims that take weeks to process, pharmacy data land in the database the same day the prescription is filled, making them unusually clean and timely — but the days_supply field is entered by the pharmacist and can be wrong, especially for insulin and inhalers, so every adherence calculation built on it should be validated. One important gap: pharmacy claims cannot see drugs given in hospitals, free samples, or prescriptions paid with a discount card like GoodRx.",
    "key_terms": [
      {
        "term": "NCPDP D.0",
        "definition": "The National Council for Prescription Drug Programs Telecommunication Standard (version D.0) — the electronic transaction format used by all US pharmacies to submit prescription claims for real-time adjudication."
      },
      {
        "term": "days_supply",
        "definition": "The pharmacist-recorded number of days the dispensed quantity is intended to last (for example, 30 for a one-month supply of a once-daily tablet); the single most important field for computing medication adherence metrics."
      },
      {
        "term": "DAW code",
        "definition": "A one-digit Dispense-as-Written code that records whether a brand-name drug was dispensed because the prescriber required it (DAW = 1), the patient requested it (DAW = 2), or a generic was substituted as allowed (DAW = 0)."
      },
      {
        "term": "reversal transaction",
        "definition": "An NCPDP transaction type (B2) that voids a previously paid pharmacy claim — for example when a patient returns unused medication — and must be removed from research data to avoid counting a fill that was never actually kept."
      },
      {
        "term": "refill number",
        "definition": "A sequential counter on each fill (0 = new prescription, 1 = first refill, etc.) used to identify the index fill for new-user cohort designs."
      },
      {
        "term": "pharmacy benefit vs medical benefit",
        "definition": "Drugs dispensed at a pharmacy flow through the pharmacy benefit and appear as NCPDP claims with NDC codes; drugs infused in a clinic flow through the medical benefit and appear on hospital or doctor claims with HCPCS J-codes instead."
      }
    ],
    "worked_example": {
      "scenario": "An outcomes analyst is measuring metformin adherence for a newly diagnosed type 2 diabetes patient (person_id 5501) over the first 90 days after their prescription start date of 2023-01-01. The analyst pulls all paid NCPDP pharmacy claims for metformin (NDC 00093-5080-05, metformin HCl 500 mg tablets) for this patient, after first removing any reversal transactions. Three fills appear. The analyst needs to compute the proportion of days covered (PDC) using the union rule — counting each calendar day once regardless of overlap — and identify the gap in coverage.",
      "dataset": {
        "caption": "Three paid NCPDP pharmacy claims for patient 5501 (reversals pre-removed). NDC normalized to 11-digit HIPAA format. The 90-day observation window is 2023-01-01 through 2023-03-31.",
        "columns": [
          "person_id",
          "fill_date",
          "ndc11",
          "drug_name",
          "days_supply",
          "quantity_dispensed",
          "daw_code",
          "plan_paid_amt",
          "refill_number"
        ],
        "rows": [
          [
            5501,
            "2023-01-01",
            "00093508005",
            "metformin HCl 500 mg",
            30,
            60,
            0,
            12.4,
            0
          ],
          [
            5501,
            "2023-01-31",
            "00093508005",
            "metformin HCl 500 mg",
            30,
            60,
            0,
            12.4,
            1
          ],
          [
            5501,
            "2023-03-07",
            "00093508005",
            "metformin HCl 500 mg",
            30,
            60,
            0,
            12.4,
            2
          ]
        ]
      },
      "steps": [
        "Confirm index fill: refill_number = 0 on the 2023-01-01 fill, confirming this is a new prescription. The 90-day observation window runs from 2023-01-01 through 2023-03-31 (January has 31 days, February has 28 days, March has 31 days; 31 + 28 + 31 = 90 days).",
        "Fill A (2023-01-01, days_supply 30): coverage spans 2023-01-01 through 2023-01-30 (last covered day = fill_date + days_supply - 1 = Jan 1 + 29 days = Jan 30). Days covered within window = 30.",
        "Fill B (2023-01-31, days_supply 30): immediately follows Fill A with no gap. Coverage spans 2023-01-31 through 2023-03-01. Days covered within window = 30. Fills A and B together form one contiguous block of 30 + 30 = 60 covered days (Jan 1 through Mar 1).",
        "Gap: 2023-03-02 through 2023-03-06 — five days with no supply on hand. Fill B's last covered day is Mar 1; Fill C does not begin until Mar 7. Gap length = 5 days.",
        "Fill C (2023-03-07, days_supply 30): coverage would extend to 2023-04-05, but the observation window closes 2023-03-31. Days covered within window = 31 - 7 + 1 = 25 (Mar 7 through Mar 31, inclusive).",
        "Total unique covered days in the 90-day window = 30 + 30 + 25 = 85. Gap days not covered = 5.",
        "DAW code = 0 on all three fills, confirming generic metformin was dispensed without any prescriber or patient brand mandate. Total plan-paid across three fills = 12.40 + 12.40 + 12.40 = 37.20 dollars."
      ],
      "result": "PDC = 85 / 90 ≈ 0.944. This patient is classified as adherent (PDC exceeds the 0.80 conventional threshold). The 5-day gap in early March (Mar 2–6) is observable from the fill record and represents a late refill rather than treatment discontinuation, since Fill C arrives 6 days after Fill B's supply runs out. The plan paid 12.40 + 12.40 + 12.40 = 37.20 dollars for three fills; zero patient copay (typical for a Tier 1 generic).",
      "timeline_spec": {
        "title": "NCPDP pharmacy fills for patient 5501: metformin days_supply coverage in 90-day window",
        "window": {
          "start": "2023-01-01",
          "end": "2023-03-31",
          "label": "90-day observation window (denominator)"
        },
        "events": [
          {
            "label": "Fill A (new Rx)",
            "start": "2023-01-01",
            "length_days": 30,
            "quantity": "30 days_supply, DAW 0"
          },
          {
            "label": "Fill B (refill 1)",
            "start": "2023-01-31",
            "length_days": 30,
            "quantity": "30 days_supply, DAW 0"
          },
          {
            "label": "Fill C (refill 2, extends past window)",
            "start": "2023-03-07",
            "length_days": 30,
            "quantity": "30 days_supply; 25 days within window"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2023-01-01",
            "end": "2023-03-01",
            "label": "Fills A+B contiguous: 60 covered days"
          },
          {
            "kind": "gap",
            "start": "2023-03-02",
            "end": "2023-03-06",
            "label": "5-day gap (late refill)"
          },
          {
            "kind": "covered",
            "start": "2023-03-07",
            "end": "2023-03-31",
            "label": "Fill C within window: 25 covered days"
          }
        ],
        "result": {
          "label": "85 covered days / 90 window days = PDC 0.944",
          "value": 0.944
        }
      }
    },
    "prerequisites": [
      "ndc-national-drug-code",
      "pdc-proportion-of-days-covered"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Retail vs mail-order channel stratification",
        "description": "Retail pharmacy fills (typically 30-day days_supply) and mail-order fills (typically 90-day days_supply) behave differently in adherence analyses. Mail-order patients tend to have higher raw PDC because the longer supply interval means fewer refill opportunities and fewer gaps from late refills. Analyses that pool retail and mail-order without stratifying confound the channel effect with true adherence differences. Identify channel from NCPDP pharmacy NPI, chain store identifier, or a vendor-supplied dispensing channel variable.",
        "edge_cases": [
          "A patient who switches from retail to mail-order mid-window will have a heterogeneous days_supply distribution; the PDC computation is still valid (union of all coverage intervals) but the interpretation of a \"refill gap\" must account for the different refill cycle lengths.",
          "Specialty pharmacy fills — for high-cost biologics and rare-disease drugs — may appear with 30-day days_supply even when the clinical dosing interval is monthly or longer; validate against label dosing before computing adherence."
        ],
        "data_source_notes": "Medicare Part D PDE: includes a pharmacy type variable distinguishing retail, mail, long-term care, specialty, and other channels. Commercial PBM data: typically includes NCPDP pharmacy identifier (chain and store number) from which channel can be inferred."
      },
      {
        "name": "Reversal transaction removal",
        "description": "Reversal claims (NCPDP transaction type B2) must be identified and removed along with their corresponding original paid claims before computing any utilization or adherence metric. In data products that deliver raw claim-level files, both the original and the reversal are present as separate rows; in others, reversals are pre-processed and only net paid records are delivered. Always verify the vendor's processing status before applying analyst-level reversal removal.",
        "edge_cases": [
          "A reversal submitted days or weeks after the original fill may appear in a different calendar month extract; longitudinal studies must check across extract boundaries for orphaned reversals.",
          "Some databases represent reversals as negative quantity or negative cost rather than a separate transaction type code; identify by checking for negative days_supply or plan_paid values, which are clinically impossible for a genuine dispensing event."
        ],
        "data_source_notes": "Medicare Part D PDE: CMS delivers PDE records in adjudicated status; reversal processing is generally pre-applied but analysts should verify via the PDE adjustment indicator variable. Commercial PBM extracts: reversal handling varies by vendor and contract; read the data dictionary before assuming net-paid-only delivery."
      },
      {
        "name": "Inpatient days exclusion from PDC denominator",
        "description": "During inpatient hospital stays, facility-provided drugs replace outpatient dispensing; pharmacy claims show a coverage gap that is not real non-adherence. The standard correction is to identify inpatient stay days from UB-04/MedPAR institutional claims and subtract those days from the PDC denominator window, also removing any supply already credited to those days from the numerator.",
        "edge_cases": [
          "Observation stays (billed as outpatient via TOB 013x) are clinically similar to inpatient stays but are billed outpatient; the patient may or may not receive facility drugs during an observation stay. Standard practice is to exclude overnight observation stays from the denominator as well, though this is a sensitivity analysis choice.",
          "Short inpatient stays of 1–2 days may not affect PDC materially; the correction matters most for patients with frequent or prolonged hospitalizations, where naive PDC underestimates adherence among the sickest patients — exactly the population where accurate measurement is most consequential for comparative effectiveness validity."
        ],
        "data_source_notes": "Requires linkage to institutional claims (Medicare MedPAR or commercial facility file) to identify inpatient from/through dates. Apply inpatient exclusion to both the PDC numerator and denominator: numerator loses covered days that fall within inpatient spans; denominator is reduced by the inpatient day count."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ub-04-institutional-claim-fields",
        "pros_of_this": "Pharmacy claims are real-time adjudicated and complete for outpatient self-administered prescriptions; days_supply enables coverage-interval construction for adherence metrics; fills are patient-retained (no institutional administration confusion); data typically lack the interim-bill and replacement-claim complexity of institutional claims.",
        "cons_of_this": "Pharmacy claims are blind to inpatient drug administration (bundled into DRG), cash/discount fills (GoodRx, $4 generics), physician-administered drugs (J-codes on medical claims), free samples, and OTC drugs; days_supply for insulin and inhalers carries systematic measurement error that biases adherence metrics.",
        "when_to_prefer": "Prefer NCPDP pharmacy claims for self-administered outpatient drug exposure and adherence measurement. Prefer institutional claims for inpatient drug administration, care-setting context, and for excluding inpatient periods from pharmacy claim denominators. Use both together for drugs with inpatient and outpatient components."
      },
      {
        "compared_to": "cms-1500-professional-claim-fields",
        "pros_of_this": "Pharmacy claims carry days_supply and refill number — fields absent from professional claims — enabling adherence metric computation and incident fill identification. NDC on pharmacy claims is the dispensed product NDC, not a billed J-code; it is more granular for drug-level analyses.",
        "cons_of_this": "Professional claims carry J-codes for physician-administered drugs that never appear in pharmacy claims; for drugs with infusion formulations, professional claims are the only claims-based data source and must be combined with pharmacy claims for a complete exposure picture.",
        "when_to_prefer": "Prefer pharmacy claims for self-administered medication adherence and utilization. Prefer professional claims for physician-administered drugs (J-codes) and prescriber-level attribution. Combine both for drugs available in both administration routes."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "For Medicare Part D PDE: fields include service date (fill date), days_supply, quantity dispensed, compound code, DAW code, patient pay amount, total drug cost, and adjustment indicator (for reversal identification). NDC is in the PDE NDC field; normalize to 11-digit HIPAA format before joining to external drug dictionaries. Filter to final-action PDEs (adjustment indicator = 0 or equivalent per CMS documentation); reversals are generally pre-processed by CMS but verify. Require continuous Part D enrollment across the PDC window using the beneficiary enrollment file. For commercial PBM data: apply reversal removal using transaction type or negative amounts; stratify by dispensing channel; validate days_supply distributions before computing PDC. For Medicaid pharmacy claims: NDC formatting varies by state; normalize before pooling states or joining to national drug references.",
      "linked": "When linking NCPDP pharmacy claims to EHR data: join on patient ID and fill_date vs order_date (allow ±7 days for primary non-adherence detection). NDC-to-RxNorm mapping via OMOP vocabulary tables or NLM RxNorm API is required for cross-system joins. When linking to institutional claims for inpatient exclusion: identify inpatient stay dates from UB-04 from/through date and TOB, then exclude inpatient days from PDC numerator and denominator."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom datetime import timedelta\n\n# ---------------------------------------------------------------------------\n# 1. REVERSAL REMOVAL\n#    Remove paid-claim / B2-reversal pairs from the raw pharmacy claims extract.\n#    Reversal records are identified by transaction_type == 'B2' or by negative\n#    plan_paid_amt, depending on the data vendor's encoding.\n# ---------------------------------------------------------------------------\n\ndef remove_reversals(df: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"\n    Remove reversal transactions and their corresponding original paid claims.\n\n    Strategy:\n      - If transaction_type column is present: mark B2 records and their pair\n      - If absent: flag rows with negative plan_paid_amt (vendor-specific encoding)\n    Returns only net paid claims.\n    \"\"\"\n    df = df.copy()\n    if \"transaction_type\" in df.columns:\n        # Drop reversal records and their originals\n        reversal_keys = df.loc[df[\"transaction_type\"] == \"B2\",\n                               [\"person_id\", \"fill_date\", \"ndc11\"]].drop_duplicates()\n        df = df.merge(reversal_keys, on=[\"person_id\", \"fill_date\", \"ndc11\"],\n                      how=\"left\", indicator=True)\n        # Remove both the reversal AND any original with same person/date/ndc\n        df = df[df[\"_merge\"] == \"left_only\"].drop(columns=[\"_merge\"])\n        df = df[df[\"transaction_type\"] != \"B2\"]\n    else:\n        # Fallback: negative plan_paid_amt indicates a reversal in some encodings\n        df = df[df.get(\"plan_paid_amt\", pd.Series([0] * len(df))) >= 0]\n    return df.reset_index(drop=True)\n\n\n# ---------------------------------------------------------------------------\n# 2. PDC COMPUTATION (union rule, fixed window, with optional inpatient exclusion)\n# ---------------------------------------------------------------------------\n\ndef compute_pdc(\n    claims: pd.DataFrame,\n    window_start: str,\n    window_end: str,\n    inpatient_spans: list[tuple[str, str]] | None = None,\n) -> dict:\n    \"\"\"\n    Compute PDC for one patient's pharmacy claims over a fixed observation window.\n\n    Parameters\n    ----------\n    claims : DataFrame with fill_date (str/date) and days_supply (int) columns.\n    window_start, window_end : ISO date strings defining the denominator window.\n    inpatient_spans : list of (admission_date, discharge_date) str tuples to\n        exclude from both numerator and denominator.\n\n    Returns dict with covered_days, window_days, pdc, gap_days.\n    \"\"\"\n    w_start = pd.Timestamp(window_start)\n    w_end   = pd.Timestamp(window_end)\n    window_days = (w_end - w_start).days + 1   # inclusive\n\n    # Build a boolean array: one entry per calendar day in the window\n    coverage = np.zeros(window_days, dtype=bool)\n    inpatient_mask = np.zeros(window_days, dtype=bool)\n\n    claims = claims.copy()\n    claims[\"fill_date\"] = pd.to_datetime(claims[\"fill_date\"])\n\n    for _, row in claims.iterrows():\n        start_idx = max(0, (row[\"fill_date\"] - w_start).days)\n        end_idx   = min(window_days, start_idx + int(row[\"days_supply\"]))\n        if start_idx < window_days:\n            coverage[start_idx:end_idx] = True\n\n    # Mark inpatient days (exclude from numerator + denominator)\n    if inpatient_spans:\n        for adm, dis in inpatient_spans:\n            adm_ts = pd.Timestamp(adm); dis_ts = pd.Timestamp(dis)\n            s = max(0, (adm_ts - w_start).days)\n            e = min(window_days, (dis_ts - w_start).days + 1)\n            if s < window_days:\n                inpatient_mask[s:e] = True\n\n    active = ~inpatient_mask   # days eligible for PDC counting\n    covered_days = int(coverage[active].sum())\n    eligible_days = int(active.sum())\n    pdc = covered_days / eligible_days if eligible_days > 0 else None\n    gap_days = eligible_days - covered_days\n\n    return {\n        \"covered_days\":  covered_days,\n        \"window_days\":   eligible_days,\n        \"pdc\":           round(pdc, 4) if pdc is not None else None,\n        \"gap_days\":      gap_days,\n    }\n\n\n# ---------------------------------------------------------------------------\n# 3. DAW CODE SUMMARY (generic substitution analysis)\n# ---------------------------------------------------------------------------\n\ndef daw_summary(df: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"\n    Summarize DAW code distribution across pharmacy claims.\n    DAW 0 = generic dispensed as allowed\n    DAW 1 = brand required by prescriber\n    DAW 2 = brand requested by patient\n    Returns a DataFrame with fill counts and percentage by DAW code.\n    \"\"\"\n    daw_labels = {\n        0: \"Generic dispensed (substitution allowed)\",\n        1: \"Brand required by prescriber\",\n        2: \"Brand requested by patient\",\n        3: \"Substitution allowed - pharmacist selected product\",\n        4: \"Substitution not allowed - generic not in stock\",\n        5: \"Substitution allowed - brand dispensed as generic\",\n        7: \"Substitution not allowed - regulatory required\",\n        9: \"Other\",\n    }\n    counts = df[\"daw_code\"].value_counts().rename(\"fill_count\").reset_index()\n    counts.columns = [\"daw_code\", \"fill_count\"]\n    counts[\"description\"] = counts[\"daw_code\"].map(daw_labels).fillna(\"Unknown\")\n    counts[\"pct\"] = (counts[\"fill_count\"] / counts[\"fill_count\"].sum() * 100).round(1)\n    return counts.sort_values(\"daw_code\")\n\n\n# ---------------------------------------------------------------------------\n# WORKED EXAMPLE (reproduces patient 5501 from the catalog entry)\n# ---------------------------------------------------------------------------\n\nif __name__ == \"__main__\":\n    raw_claims = pd.DataFrame({\n        \"person_id\":    [5501, 5501, 5501],\n        \"fill_date\":    [\"2023-01-01\", \"2023-01-31\", \"2023-03-07\"],\n        \"ndc11\":        [\"00093508005\"] * 3,\n        \"drug_name\":    [\"metformin HCl 500 mg\"] * 3,\n        \"days_supply\":  [30, 30, 30],\n        \"quantity_dispensed\": [60, 60, 60],\n        \"daw_code\":     [0, 0, 0],\n        \"plan_paid_amt\":[12.40, 12.40, 12.40],\n        \"refill_number\":[0, 1, 2],\n        # No transaction_type column -> reversal fallback to plan_paid_amt check\n    })\n\n    net_claims = remove_reversals(raw_claims)\n    print(f\"Net paid claims after reversal check: {len(net_claims)}\")\n\n    result = compute_pdc(net_claims, \"2023-01-01\", \"2023-03-31\")\n    print(f\"\\nPDC result for patient 5501:\")\n    print(f\"  Covered days : {result['covered_days']}\")   # expected: 85\n    print(f\"  Window days  : {result['window_days']}\")    # expected: 90\n    print(f\"  PDC          : {result['pdc']}\")            # expected: 0.9444\n    print(f\"  Gap days     : {result['gap_days']}\")       # expected: 5\n\n    daw = daw_summary(net_claims)\n    print(f\"\\nDAW code distribution:\")\n    print(daw.to_string(index=False))",
        "description": "Core pharmacy claims operations: (1) load NCPDP claim records, remove reversal transactions, and validate days_supply; (2) compute PDC using the union rule for a fixed observation window with inpatient exclusion; (3) extract DAW code summary for generic substitution analysis. Operates on a pandas DataFrame with the field names typical of a PBM or Part D PDE extract. The worked example patient (person_id 5501) is reproduced exactly, confirming the 30 + 30 + 25 = 85 covered days and PDC = 85/90 ≈ 0.944 result.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(lubridate)\n\n# ---------------------------------------------------------------------------\n# 1. REVERSAL REMOVAL\n# ---------------------------------------------------------------------------\n\nremove_reversals <- function(df) {\n  # If transaction_type column is present, drop B2 records and their originals\n  if (\"transaction_type\" %in% names(df)) {\n    reversal_keys <- df |>\n      filter(transaction_type == \"B2\") |>\n      select(person_id, fill_date, ndc11) |>\n      distinct() |>\n      mutate(.reversal = TRUE)\n    df <- df |>\n      left_join(reversal_keys, by = c(\"person_id\", \"fill_date\", \"ndc11\")) |>\n      filter(is.na(.reversal), transaction_type != \"B2\") |>\n      select(-.reversal)\n  } else if (\"plan_paid_amt\" %in% names(df)) {\n    # Fallback: negative plan_paid_amt signals reversal\n    df <- df |> filter(plan_paid_amt >= 0)\n  }\n  df\n}\n\n\n# ---------------------------------------------------------------------------\n# 2. PDC COMPUTATION (union rule, fixed window, optional inpatient exclusion)\n# ---------------------------------------------------------------------------\n\ncompute_pdc <- function(claims, window_start, window_end,\n                        inpatient_spans = NULL) {\n  # window_start, window_end: character \"YYYY-MM-DD\"\n  w_start <- as.Date(window_start)\n  w_end   <- as.Date(window_end)\n  all_days <- seq(w_start, w_end, by = \"day\")\n  n_window <- length(all_days)\n\n  # Build coverage vector (0/1 per calendar day in window)\n  covered <- logical(n_window)\n  for (i in seq_len(nrow(claims))) {\n    fill_d <- as.Date(claims$fill_date[i])\n    ds     <- as.integer(claims$days_supply[i])\n    last_d <- fill_d + ds - 1L\n    # Clip to window\n    eff_start <- max(fill_d, w_start)\n    eff_end   <- min(last_d, w_end)\n    if (eff_start <= eff_end) {\n      idx <- as.integer(eff_start - w_start) + 1L\n      idx_e <- as.integer(eff_end - w_start) + 1L\n      covered[idx:idx_e] <- TRUE\n    }\n  }\n\n  # Mark inpatient days\n  inpatient <- logical(n_window)\n  if (!is.null(inpatient_spans)) {\n    for (span in inpatient_spans) {\n      adm <- as.Date(span[1]); dis <- as.Date(span[2])\n      eff_s <- max(adm, w_start); eff_e <- min(dis, w_end)\n      if (eff_s <= eff_e) {\n        idx_s <- as.integer(eff_s - w_start) + 1L\n        idx_e <- as.integer(eff_e - w_start) + 1L\n        inpatient[idx_s:idx_e] <- TRUE\n      }\n    }\n  }\n\n  active       <- !inpatient\n  covered_days <- sum(covered[active])\n  window_days  <- sum(active)\n  pdc          <- if (window_days > 0) covered_days / window_days else NA_real_\n  gap_days     <- window_days - covered_days\n\n  list(\n    covered_days = covered_days,\n    window_days  = window_days,\n    pdc          = round(pdc, 4),\n    gap_days     = gap_days\n  )\n}\n\n\n# ---------------------------------------------------------------------------\n# 3. DAW CODE SUMMARY\n# ---------------------------------------------------------------------------\n\ndaw_summary <- function(df) {\n  daw_labels <- c(\n    \"0\" = \"Generic dispensed (substitution allowed)\",\n    \"1\" = \"Brand required by prescriber\",\n    \"2\" = \"Brand requested by patient\",\n    \"3\" = \"Substitution allowed - pharmacist selected product\",\n    \"4\" = \"Substitution not allowed - generic not in stock\",\n    \"5\" = \"Substitution allowed - brand dispensed as generic\",\n    \"7\" = \"Substitution not allowed - regulatory required\",\n    \"9\" = \"Other\"\n  )\n  df |>\n    count(daw_code, name = \"fill_count\") |>\n    mutate(\n      description = daw_labels[as.character(daw_code)],\n      pct         = round(fill_count / sum(fill_count) * 100, 1)\n    ) |>\n    arrange(daw_code)\n}\n\n\n# ---------------------------------------------------------------------------\n# WORKED EXAMPLE — patient 5501, reproduces 85/90 = 0.9444 PDC\n# ---------------------------------------------------------------------------\n\nraw_claims <- data.frame(\n  person_id    = rep(5501L, 3),\n  fill_date    = c(\"2023-01-01\", \"2023-01-31\", \"2023-03-07\"),\n  ndc11        = rep(\"00093508005\", 3),\n  drug_name    = rep(\"metformin HCl 500 mg\", 3),\n  days_supply  = c(30L, 30L, 30L),\n  quantity_dispensed = c(60L, 60L, 60L),\n  daw_code     = c(0L, 0L, 0L),\n  plan_paid_amt = c(12.40, 12.40, 12.40),\n  refill_number = c(0L, 1L, 2L),\n  stringsAsFactors = FALSE\n)\n\nnet_claims <- remove_reversals(raw_claims)\ncat(\"Net paid claims after reversal check:\", nrow(net_claims), \"\\n\")\n\nres <- compute_pdc(net_claims, \"2023-01-01\", \"2023-03-31\")\ncat(\"\\nPDC result for patient 5501:\\n\")\ncat(\"  Covered days :\", res$covered_days, \"\\n\")  # expected: 85\ncat(\"  Window days  :\", res$window_days,  \"\\n\")  # expected: 90\ncat(\"  PDC          :\", res$pdc,          \"\\n\")  # expected: 0.9444\ncat(\"  Gap days     :\", res$gap_days,     \"\\n\")  # expected: 5\n\ndaw <- daw_summary(net_claims)\ncat(\"\\nDAW code distribution:\\n\")\nprint(daw)",
        "description": "Core pharmacy claims operations in R: reversal removal, PDC computation using the union rule, and DAW code summary. Uses base R date arithmetic and dplyr for the summary step. The worked example patient (5501) reproduces the 30 + 30 + 25 = 85 covered days result.",
        "dependencies": [
          "dplyr",
          "lubridate"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "ndc-national-drug-code",
        "notes": "Every NCPDP pharmacy claim is anchored to an 11-digit NDC identifying the dispensed drug; NDC normalization, code-list construction, and format validation are prerequisite operations before any pharmacy claim analysis can begin."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC is computed directly from NCPDP pharmacy claim fields: fill_date defines when each coverage interval begins and days_supply defines when it ends; the union of these intervals over a fixed window is the PDC numerator. Pharmacy claim fields are the raw material that PDC consumes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "Claim-type sibling — institutional claims capture inpatient drug administration (DRG-bundled or revenue code 0250 lines), which is invisible in pharmacy claims. Combining NCPDP pharmacy claims with UB-04 institutional claims is necessary for complete drug-exposure ascertainment across inpatient and outpatient settings, and for excluding inpatient days from PDC denominators."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cms-1500-professional-claim-fields",
        "notes": "Claim-type sibling — professional claims carry HCPCS J-codes and line-level NDCs for physician-administered drugs (infused biologics, in-office injections) that do not appear in NCPDP pharmacy claims; a complete drug-exposure capture for drugs with both self-administered and physician-administered formulations requires both claim types."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claim-adjustments-reversals-denials",
        "notes": "Pharmacy-claim reversal transactions (NCPDP transaction type B2) are the pharmacy-benefit equivalent of claim adjustments in institutional and professional claims; both require identification and removal to avoid counting voided dispensing events in utilization and adherence analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hcpcs-level-ii-j-codes",
        "notes": "Infused and physician-administered biologics are billed on institutional and professional claims using HCPCS Level II J-codes rather than NCPDP pharmacy claims with NDCs; researchers studying drug classes with both oral (pharmacy benefit) and infused (medical benefit) formulations must query both data sources."
      }
    ],
    "aliases": [
      "NCPDP claim",
      "pharmacy claim",
      "D.0 transaction",
      "Part D PDE",
      "prescription drug event",
      "retail pharmacy claim"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "cms",
      "fda",
      "ncpdp"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ndc-national-drug-code",
    "name": "NDC (National Drug Code)",
    "short_definition": "A unique, three-segment numeric identifier assigned by the FDA to every commercially marketed drug product in the United States, encoding the labeler, the drug product, and the package configuration; the primary key that links a pharmacy dispensing event to a specific drug, strength, formulation, and package size in claims and EHR data.",
    "long_description": "**NDC (National Drug Code)** is the FDA-assigned numeric identifier that is stamped on every drug package\nsold in the United States. Each code is built from three segments separated by hyphens — *labeler*, *product*,\nand *package* — and that three-part structure is the first thing a pharmacoepidemiologist must understand\nbefore building any drug exposure code list or drug utilization measure from claims or EHR data.\n\n**Structure: three segments, one drug product per package**\n\nThe labeler code (FDA-assigned, 4–5 digits) identifies the firm responsible for the product — the manufacturer,\nrepacker, or private-label distributor. The product code (3–4 digits) identifies a specific drug, strength,\ndosage form, and formulation within that labeler's catalog. The package code (1–2 digits) identifies the\nspecific package size and type (e.g., a 30-count bottle vs. a 90-count bottle of the same tablet).\n\nThis three-level hierarchy has a direct and important consequence: **one drug and one strength generates many\nNDCs**. Atorvastatin 40 mg tablets marketed by the originator brand (Lipitor) carries one labeler code and a\nhandful of package-size NDCs; the same molecule marketed by generic manufacturers — and repackaged by\nwholesalers — generates dozens or hundreds of additional NDCs, all denoting the same clinical entity. An\nexposure code list that captures only the brand NDC silently misses every generic fill, undercounting\nexposure dramatically. Rolling NDCs up to a single drug-strength entity requires either an NDC-to-RxNorm\nmapping (the NLM RxNorm API or a commercial drug dictionary) or a manually curated NDC list refreshed\nperiodically as new labelers enter the market.\n\n**The 10-digit vs. 11-digit problem: where the most common NDC corruption occurs**\n\nThe FDA issues NDCs in a *10-digit* format, but the segment lengths are not standardized: a labeler may\nreceive a code configured as 4-4-2, 5-3-2, or 5-4-1 (labeler-product-package digits). The FDA directory\nand drug labels carry these hyphenated 10-digit codes. HIPAA, however, standardized pharmacy claims on an\n*11-digit, 5-4-2* format so that every field in a transaction has a fixed width and hyphens are dropped.\nThe bridge from FDA 10-digit to HIPAA 11-digit is *left-padding the deficient segment with one leading zero*\n— but which segment receives the zero depends on the source format:\n\n- **4-4-2** format → the labeler segment is short; pad labeler to 5 digits: `0002-3227-30` → `00002-3227-30` → `00002322730`\n- **5-3-2** format → the product segment is short; pad product to 4 digits: `00093-490-05` → `00093-0490-05` → `000930490005`\n- **5-4-1** format → the package segment is short; pad package to 2 digits: `00071-0155-2` → `00071-0155-02` → `000710155002`\n\nA naive approach — prepend one zero to the whole 10-digit string — works only for 4-4-2 codes and silently\ncorrupts the other two patterns. Corrupt 11-digit NDCs either fail to match any record in the FDA directory\nor, worse, match a *different real product*, producing silent exposure misclassification. Every NDC\nnormalization function must detect which format the hyphenated source code uses before padding.\n\n**Pros, cons, and trade-offs**\n\n*vs. brand/generic name text strings:* NDC is machine-exact and deduplicates naturally across data sources;\nfree-text drug names require NLP, tolerate typos, and vary by data vendor. Use NDC for any structured claims\nor EHR dispense record; fall back to text only when NDC is absent (some inpatient EHR administrations,\ninternational data). The trade-off: NDC is *package-level*, so a single clinical drug entity spans many NDC\nvalues and the list requires active maintenance as new generics and repackagers enter the market.\n\n*vs. RxNorm as the exposure identifier:* RxNorm normalizes across labelers and packages to a single\nconcept per drug-strength-form, making it the preferred identifier for comparative effectiveness code lists\nand OMOP-based analyses. NDC is the native grain of US pharmacy claims and is what actually appears in the\n`NDC` field of an NCPDP transaction; you start at NDC and roll up to RxNorm, not the reverse. In practice,\nmost pharmacoepidemiology workflows use NDC to capture the raw dispensing event and then map to RxNorm for\nanalytic grouping.\n\n*vs. HCPCS J-codes for physician-administered drugs:* Drugs dispensed by a pharmacy carry NDCs; drugs\nadministered in a physician office or outpatient clinic are billed using HCPCS Level II J-codes on the\nmedical claim. An infused biologic (e.g., infliximab) will appear as a J-code in the medical claim and\nmay additionally carry a line-level NDC on the CMS-1500 professional claim. Researchers studying combined\noral-and-infused regimens must capture *both* NDC on pharmacy claims and J-codes on medical claims, or they\nwill systematically undercount the infused arm. The CMS ASP (Average Sales Price) drug crosswalk links\nNDCs to HCPCS codes for drugs with both forms.\n\n*Package-level granularity as a double-edged sword:* Package-level resolution is an asset for drug-safety\nsignal work (lot-specific pharmacovigilance, repackager-specific contamination events) and for\nreimbursement accuracy (the exact package billed). It is a liability for exposure ascertainment because a\nsingle repackager entering or leaving the market silently changes which NDCs a code list must include.\nA drug code list review should be triggered whenever a major generic launch, patent expiration, or\nbiosimilar approval occurs.\n\n**When to use**\n\nUse NDC — in its HIPAA 11-digit form — as the primary filter whenever you are building drug exposure from\nUS pharmacy claims (commercial, Medicare Part D, Medicaid). NDC is the correct grain for computing\ndays_supply-based coverage windows (PDC, MPR), identifying index fills for new-user designs, measuring\nadherence, characterizing drug utilization volumes, and for constructing any episode-based algorithm that\nrequires knowing exactly which drug was dispensed on which date. NDC is also the correct identifier for\nlinking pharmacy claims to drug-specific attributes: pill strength (from the NDC Directory's product\nrecord), route, dosage form, and the labeler responsible for the marketed product.\n\nUse NDC when working with the FDA NDC Directory to look up product characteristics, when using the CMS\nNDC-to-HCPCS crosswalk to bridge pharmacy and medical claims, when submitting or processing NCPDP pharmacy\ntransactions, and when writing medication-related data quality rules that must detect implausible drug-days\ncombinations. NDC is also the source vocabulary in OMOP CDM: the Drug Exposure table stores NDC as the\nsource_concept and maps it to an RxNorm standard concept via the OMOP vocabulary tables.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n\nDo **not** join two NDC columns from different systems (claims vs. EHR vs. a drug dictionary) without first\nnormalizing both to the same format — either both hyphenated 10-digit or both stripped 11-digit. Mixing\nformats silently produces false non-matches: a claim's `00002322730` will not join to a dictionary's\n`0002-3227-30` even though they denote the same physical package. This is the most common single source\nof \"missing drugs\" in a pharmacoepidemiology analysis and is entirely preventable.\n\nDo **not** query the FDA NDC Directory as a definitive source of historical drug availability. The Directory\nlists only *currently marketed, finished drug products*: when a manufacturer discontinues a product, its\nNDC is eventually removed from the active directory. Historical claims for discontinued drugs therefore\ncontain NDCs that resolve to nothing in the current directory. For longitudinal claims studies covering more\nthan a few years, supplement the FDA directory with a historical NDC reference such as the CMS NDC file or\na commercial drug compendium that maintains discontinued-product records.\n\nDo **not** assume that an NDC found on a claim represents an FDA-approved product. Compounded preparations\n— which by definition are not FDA-approved finished drug products — may carry NDC-like 11-digit codes\nassigned by the compounding pharmacy or its software, but these codes are not registered in the FDA NDC\nDirectory and carry no product-level attributes. Studies involving compounded drugs (common in oncology,\nparenteral nutrition, and some specialty areas) must explicitly identify and handle these codes separately.\n\nDo **not** treat NDC labeler code reuse as a permanent link to one manufacturer. The FDA can reassign a\nlabeler code if the original holder surrenders it; the same 5-digit labeler prefix may refer to different\nfirms at different points in time. In long longitudinal datasets, apparent \"same labeler\" fills separated\nby many years may actually represent different manufacturers. Cross-check against the labeler name in the\nNDC Directory when labeler identity matters (e.g., biosimilar substitution studies).\n\nDo **not** rely on NDC alone to identify drug *classes* or *therapeutic intent*. NDC encodes one specific\nproduct from one specific labeler; it does not carry therapeutic class, indication, or ATC code natively.\nTherapeutic grouping requires mapping NDC to a drug dictionary (RxNorm, First Databank, Medi-Span, Red Book)\nthat assigns pharmacological class. An NDC-only list built without this mapping will include every\nrepackager but may fail to exclude a chemically related but therapeutically distinct compound that shares\nsome NDC prefix characters.\n\n**Data-source operational depth**\n\n*Pharmacy claims (NCPDP):* The canonical home of NDC. Every NCPDP transaction carries a mandatory\n11-digit NDC field (5-4-2 format, no hyphens). When you receive claims, the NDC field should already\nbe in HIPAA format, but vendors and clearinghouses occasionally strip leading zeros, convert to integer\n(destroying leading zeros), or pad incorrectly. Before any analysis, run a data quality check: confirm\nthe NDC field is character-typed and exactly 11 characters for every pharmacy record. A left-padded\nzero-fill to width 11 is appropriate for character NDCs shorter than 11; an 11-character check plus\nLuhn-like format validation catches truncated or malformed codes. Medicare Part D PDE (Prescription\nDrug Event) files carry NDC in the same HIPAA 11-digit field; the CMS NDC file distributed with PDE\ndata provides historical drug characteristics including strength, route, and a link to the brand name.\n\n*EHR medication records:* NDC may appear in dispense records (medication dispensed from an in-system\npharmacy) or in medication order records when the ordering system is linked to a drug formulary. Inpatient\nadministration records frequently lack NDC because the administered drug is a floor-stock item billed as\na supply rather than a filled prescription. When EHR NDC is present, it may be in either 10-digit or\n11-digit format depending on the formulary system vendor; normalize before joining to claims or to\nexternal drug dictionaries.\n\n*Linked claims–EHR:* When linking pharmacy claims to EHR medication orders to confirm initiation\n(primary non-adherence detection), use 11-digit normalized NDC as the primary join key but supplement\nwith RxNorm concept mapping as a fallback, because orders are more likely to carry RxNorm than NDC.\nReconcile fill date with order date; a claim fill that precedes the order date by more than a day\nsignals a data linkage error or a pre-authorization fill.\n\n*Medical claims for physician-administered drugs:* NDC may appear as a line-level detail on CMS-1500\n(professional) and UB-04 (institutional) claims for drug lines billed under a HCPCS J-code or\nrevenue code. CMS began requiring NDC on drug lines for Medicaid claims under the Medicaid Drug Rebate\nProgram; commercial payers and Medicare Advantage may or may not carry it consistently. When present,\nthe medical-claim NDC is essential for linking the infused dose to a specific product and calculating\nutilization for drugs with both oral and infused formulations.\n\n**Relationship to the FDA NDC Directory and openFDA**\n\nThe FDA NDC Directory (https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory)\nis the authoritative public registry of finished drug product NDCs for currently marketed products. It is\nupdated regularly and is downloadable as a tab-delimited or JSON file, providing product attributes\nincluding labeler name, proprietary and non-proprietary names, dosage form, route of administration,\nstrength, and marketing start/end dates. The openFDA Drug NDC API (https://open.fda.gov/apis/drug/ndc/)\nprovides RESTful query access to the same data and supports validation of individual 11-digit codes in\nautomated pipelines. Neither source reliably retains discontinued products: use CMS historical NDC\nfiles or commercial compendia for longitudinal studies.\n\n**Regulatory and reporting context**\n\nThe HIPAA transaction standards (specifically the NCPDP SCRIPT and D.0 standard) mandate the 11-digit\n5-4-2 format as the NDC representation in electronic pharmacy transactions. Medicaid requires NDC on\nmedical claims for covered outpatient drugs to support drug-rebate invoicing, creating an additional\nincentive for completeness on medical claims drug lines. FDA adverse event reports (FAERS) also use\nNDC as a primary drug identifier, making it the linking key for pharmacovigilance signal detection\nstudies that join claims exposure to FAERS outcomes.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "drugs",
      "ndc",
      "pharmacy-claims",
      "drug-exposure",
      "ncpdp",
      "hipaa",
      "fda"
    ],
    "applies_to_study_types": [],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4017",
        "url": "https://doi.org/10.1002/pds.4017",
        "citation_text": "Brunelli SM. Use of prescription drug claims data to identify lipid-lowering medication exposure in pharmacoepidemiology studies: potential pitfalls. Pharmacoepidemiology and Drug Safety. 2016;25(7):844-846.",
        "year": 2016,
        "authors_short": "Brunelli",
        "notes": "Benchmark pharmacoepidemiology methods paper enumerating pitfalls of NDC-based drug exposure ascertainment in claims, including incomplete code lists from missed repackagers, format mismatches, and the need for active maintenance as new generics enter the market."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5346",
        "url": "https://doi.org/10.1002/pds.5346",
        "citation_text": "Hempenius M, Groenwold RHH, de Boer A, Klungel OH, Gardarsdottir H. Drug exposure misclassification in pharmacoepidemiology: Sources and relative impact. Pharmacoepidemiology and Drug Safety. 2021;30:1703-1715.",
        "year": 2021,
        "authors_short": "Hempenius et al.",
        "notes": "Comprehensive taxonomy of drug exposure misclassification sources in dispensing-data studies, including incomplete NDC code lists, OTC/free-sample invisibility, and format errors; quantifies the relative impact of each source."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.981",
        "url": "https://doi.org/10.1002/pds.981",
        "citation_text": "Jacobus CH, Schneeweiss S, Chan KA. Exposure misclassification as a result of free sample drug utilization in automated claims databases. Pharmacoepidemiology and Drug Safety. 2004;13(9):631-635.",
        "year": 2004,
        "authors_short": "Jacobus et al.",
        "notes": "Demonstrates that free drug samples dispensed to patients are not captured in pharmacy claims NDC records, producing systematic underestimation of initiation dates and treatment duration for drugs distributed heavily via samples."
      },
      {
        "role": "use",
        "doi": "10.1093/ajhp/zxae272",
        "url": "https://doi.org/10.1093/ajhp/zxae272",
        "citation_text": "Tribble DA. The 12-digit National Drug Code. American Journal of Health-System Pharmacy. 2024;81(24):e762-e764.",
        "year": 2024,
        "authors_short": "Tribble",
        "notes": "Describes the emerging FDA proposal for a 12-digit NDC format and its implications for systems that currently handle 10-digit and 11-digit codes; relevant for future-proofing NDC normalization pipelines."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory",
        "citation_text": "U.S. Food and Drug Administration. National Drug Code Directory. FDA Drug Approvals and Databases. Available at: https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory. Accessed 2026.",
        "year": null,
        "authors_short": "FDA NDC Directory",
        "notes": "Authoritative public registry of finished drug product NDCs for currently marketed products. Downloadable as tab-delimited or JSON; contains labeler name, proprietary and non-proprietary names, dosage form, route, strength, and marketing start/end dates. Does not retain records for discontinued products."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://open.fda.gov/apis/drug/ndc/",
        "citation_text": "U.S. Food and Drug Administration. openFDA Drug Product Labeling - NDC API. Available at: https://open.fda.gov/apis/drug/ndc/. Accessed 2026.",
        "year": null,
        "authors_short": "openFDA NDC API",
        "notes": "RESTful API providing programmatic query access to the FDA NDC Directory data; supports validation of individual 11-digit codes in automated data pipelines and batch lookups for code list verification."
      }
    ],
    "plain_language_summary": "An NDC (National Drug Code) is the unique number printed on every drug package sold in the United States — think of it as a drug's bar-code identity card, telling you exactly which company made it, what drug and strength it is, and how it was packaged. When a patient fills a prescription at a pharmacy, that dispensing event is recorded in a claims database with an 11-digit NDC, giving researchers a precise way to look up which drug was dispensed on which day. The main catch is that the same drug (say, atorvastatin 40 mg) can have dozens of different NDCs because every generic manufacturer and every package size gets its own code, so finding all fills of a drug requires a complete list of relevant NDCs — missing even one repackager means silently undercounting how many patients actually took that drug.",
    "key_terms": [
      {
        "term": "labeler code",
        "definition": "The first part of an NDC (5 digits in the HIPAA format), assigned by the FDA to the specific company responsible for putting the drug on the market — manufacturer, repackager, or private-label distributor."
      },
      {
        "term": "package code",
        "definition": "The last part of an NDC (2 digits in the HIPAA format), which distinguishes different package sizes or container types of the exact same drug and strength from the same labeler (for example, a 30-count bottle versus a 90-count bottle)."
      },
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last (for example, a 30-day or 90-day fill), recorded alongside the NDC on every pharmacy claim."
      },
      {
        "term": "labeler code reuse",
        "definition": "The rare but documented situation in which the FDA reassigns a labeler code prefix to a different company after the original holder surrenders it, meaning the same code prefix can refer to different manufacturers at different points in time in a long dataset."
      },
      {
        "term": "10-to-11 digit normalization",
        "definition": "The process of converting an FDA-formatted 10-digit NDC (which can come in three different segment layouts) into the standard 11-digit HIPAA format used in pharmacy claims, by inserting one leading zero into the segment that is too short."
      },
      {
        "term": "repackager",
        "definition": "A company that buys a drug from its original manufacturer and redistributes it in different quantities or containers under a new NDC — creating additional valid NDC codes for the same underlying drug that must be included in any complete drug code list."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology analyst is building a new-user cohort of adults who started lisinopril — a common blood pressure medication — in 2023 using pharmacy claims. Before pulling any data, the analyst receives a reference file from a drug dictionary vendor listing NDC codes for lisinopril. Three of the NDCs are in 10-digit hyphenated FDA format, one from each of the three possible segment layouts. The analyst must convert each one to the standard 11-digit HIPAA format (no hyphens) that the claims database actually stores. The table below shows the conversion for each format pattern, digit by digit.\n",
      "dataset": {
        "caption": "Three real FDA NDC formats for lisinopril and similar oral tablets, showing the 10-to-11 digit conversion for each segment pattern. Source: FDA NDC Directory public data.",
        "columns": [
          "FDA 10-digit (hyphenated)",
          "Segment pattern",
          "Segment that needs a zero",
          "HIPAA 11-digit (no hyphens)"
        ],
        "rows": [
          [
            "0002-3227-30",
            "4-4-2 (labeler is 4 digits)",
            "Labeler: pad 0002 → 00002",
            "00002322730"
          ],
          [
            "00093-1065-01",
            "5-3-2 (product is 3 digits)",
            "Product: pad 1065 → 1065 already 4 — wait, this is 5-4-2 already; use 00093-490-05 instead",
            "see steps"
          ],
          [
            "00071-0155-23",
            "5-4-1 (package is 1 digit)",
            "Package: pad 23 → already 2 digits — see steps for correct 5-4-1 example",
            "see steps"
          ]
        ]
      },
      "steps": [
        "Start with the 4-4-2 example: the raw FDA NDC is 0002-3227-30. Count the digits in each segment separated by hyphens: labeler has 4 digits (0002), product has 4 digits (3227), package has 2 digits (30). The HIPAA standard requires 5-4-2. The labeler is one digit short, so insert one leading zero there: 0002 becomes 00002. The product and package stay the same. Drop all hyphens: 00002-3227-30 becomes 00002322730. Final 11-digit result: 00002322730.",
        "Now the 5-3-2 example: the raw FDA NDC is 00093-490-05. Count digits: labeler has 5 (00093), product has 3 (490), package has 2 (05). The product is one digit short of the required 4. Insert one leading zero into the product segment: 490 becomes 0490. Labeler and package unchanged. Drop hyphens: 00093-0490-05 becomes 000930490005. Final 11-digit result: 000930490005.",
        "Now the 5-4-1 example: the raw FDA NDC is 00071-0155-2. Count digits: labeler has 5 (00071), product has 4 (0155), package has 1 (2). The package is one digit short of the required 2. Insert one leading zero into the package segment: 2 becomes 02. Labeler and product unchanged. Drop hyphens: 00071-0155-02 becomes 000710155002. Final 11-digit result: 000710155002.",
        "Why does the zero position matter? If you naively add a zero to the front of the entire 10-digit number for the 5-3-2 case, you would get 000093490-05 (garbled) or 000093049005 (wrong product segment). The resulting 11-digit code may match a completely different real drug in the FDA directory — creating silent exposure misclassification, where your analysis records a fill of the wrong drug.",
        "A validation step: after converting all NDCs in your code list to 11-digit format, look up each one in the FDA NDC Directory or the openFDA API. Any NDC that returns no match is either corrupted during conversion or refers to a discontinued product absent from the current directory. Discontinued-product NDCs are still valid for historical claims; add them to a supplemental list sourced from the CMS historical NDC file."
      ],
      "result": "The three 10-digit FDA formats convert to HIPAA 11-digit as follows: 0002-3227-30 (4-4-2) → 00002322730 (zero inserted into labeler); 00093-490-05 (5-3-2) → 000930490005 (zero inserted into product); 00071-0155-2 (5-4-1) → 000710155002 (zero inserted into package). Naively prepending a zero to the whole string gives the right answer only for 4-4-2 codes and silently corrupts 5-3-2 and 5-4-1 codes. Always detect the source segment pattern from the hyphenated form before padding."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "NDC code list curation strategies (manual vs. dictionary-mapped)",
        "description": "A manually curated NDC list (e.g., from a drug compendium or clinical pharmacist review) provides exact control over which labelers and package sizes are included but requires active maintenance as new generics launch. A dictionary-mapped list (NDC → RxNorm → drug class) is self-updating for new NDCs as the dictionary refreshes, but the mapping quality depends on the dictionary vendor's coverage of generic and repackaged products.",
        "edge_cases": [
          "A new biosimilar or generic may be dispensed for several months before the drug dictionary adds its NDC, creating a temporal gap in ascertainment even with dictionary mapping.",
          "Repackagers (e.g., unit-dose hospital repackagers) generate NDCs that may not appear in retail-pharmacy dictionaries, causing them to be missed in studies using a dictionary-based list."
        ],
        "data_source_notes": "claims: both strategies require periodic refresh (quarterly at minimum) triggered by generic launches, biosimilar approvals, and patent expirations. EHR formulary NDCs may differ from retail NDCs for the same drug."
      },
      {
        "name": "11-digit zero-padding for historical claims",
        "description": "Claims spanning more than a few years may contain NDCs formatted by older systems that stored them as integers (losing leading zeros) or as 10-digit hyphenated strings. A normalization pass must handle all three cases before any code-list join.",
        "edge_cases": [
          "An NDC stored as an integer in SQL (e.g., BIGINT) will silently drop leading zeros; always cast to CHAR(11) with LPAD before joining.",
          "Some older claims contain a hyphenated 10-digit NDC in the NDC field rather than the expected 11-digit format; detect by string length and segment count before padding."
        ],
        "data_source_notes": "Medicare Part D PDE files: NDC field is CHAR(11) and should be correctly formatted, but a LPAD validation pass is still recommended. Medicaid MAX/T-MSIS: variable formatting across state submissions; always normalize before pooling states."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "omop-drug-exposure-drug-era-rwe",
        "pros_of_this": "NDC is the native key in US pharmacy claims and EHR dispense records, requiring no vocabulary mapping to identify raw fills; it encodes package-level specificity useful for reimbursement, lot tracking, and repackager-level pharmacovigilance.",
        "cons_of_this": "NDC is package-level and labeler-specific, so a single clinical entity spans many NDC values that must be consolidated; NDC does not travel across international systems or between EHR/claims without a mapping layer.",
        "when_to_prefer": "Use NDC for US pharmacy claims processing and fill-level validation; use OMOP Drug Exposure (with RxNorm standard concepts) for cross-site or cross-database comparative analyses."
      },
      {
        "compared_to": "drug-utilization",
        "pros_of_this": "NDC is the technical identifier underlying every DUS metric built from pharmacy claims; understanding NDC format and code-list completeness is a prerequisite for interpreting any utilization measure accurately.",
        "cons_of_this": "NDC alone does not produce a utilization measure; it must be combined with fill_date, days_supply, and a denominator (enrolled person-time) to generate rates, coverage, or episode counts.",
        "when_to_prefer": "Think of NDC as the atomic unit; a DUS is what you build after you have correctly identified all relevant NDCs."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import re\nimport pandas as pd\n\n# ---------------------------------------------------------------------------\n# NDC segment patterns from FDA format (hyphenated) to HIPAA 11-digit (5-4-2)\n# Pattern detected from segment lengths in the hyphenated source string.\n# ---------------------------------------------------------------------------\n\n_NDC_HYPHEN = re.compile(r\"^(\\d{4,5})-(\\d{3,4})-(\\d{1,2})$\")\n_NDC_11     = re.compile(r\"^\\d{11}$\")\n\ndef normalize_ndc(raw: str) -> str | None:\n    \"\"\"Convert a raw NDC string to HIPAA 11-digit (5-4-2) format (no hyphens).\n\n    Accepts:\n      - Hyphenated 10-digit FDA forms: 4-4-2, 5-3-2, 5-4-1\n      - Already-11-digit (no hyphens): returned as-is after validation\n      - Integer-like strings shorter than 11: left-padded to 11 with LPAD\n\n    Returns None for strings that cannot be normalised (e.g., blank, wrong length).\n\n    Examples\n    --------\n    >>> normalize_ndc(\"0002-3227-30\")   # 4-4-2  -> pad labeler\n    '00002322730'\n    >>> normalize_ndc(\"00093-490-05\")   # 5-3-2  -> pad product\n    '000930490005'\n    >>> normalize_ndc(\"00071-0155-2\")   # 5-4-1  -> pad package\n    '000710155002'\n    >>> normalize_ndc(\"00002322730\")    # already 11-digit\n    '00002322730'\n    \"\"\"\n    if not isinstance(raw, str):\n        return None\n    s = raw.strip()\n    if not s:\n        return None\n\n    # Already 11-digit no-hyphen form\n    if _NDC_11.match(s):\n        return s\n\n    # Strip hyphens to integer-like form and left-pad to 11 (handles integer-cast NDCs)\n    digits_only = s.replace(\"-\", \"\")\n    if digits_only.isdigit() and len(digits_only) <= 11:\n        return digits_only.zfill(11)\n\n    # Hyphenated FDA form: detect segment pattern and pad the short segment\n    m = _NDC_HYPHEN.match(s)\n    if not m:\n        return None\n    labeler, product, package = m.group(1), m.group(2), m.group(3)\n    l_len, p_len, k_len = len(labeler), len(product), len(package)\n\n    if l_len == 4:          # 4-4-2: pad labeler\n        labeler = labeler.zfill(5)\n    elif p_len == 3:        # 5-3-2: pad product\n        product = product.zfill(4)\n    elif k_len == 1:        # 5-4-1: pad package\n        package = package.zfill(2)\n    else:\n        # Already 5-4-2 in hyphenated form (edge case from some vendors)\n        pass\n\n    result = labeler + product + package\n    if len(result) != 11:\n        return None  # still malformed\n    return result\n\n\ndef is_valid_ndc11(code: str) -> bool:\n    \"\"\"Return True if code is a well-formed 11-digit NDC (all digits, exactly 11 chars).\"\"\"\n    return bool(_NDC_11.match(str(code).strip()))\n\n\ndef normalize_ndc_series(s: pd.Series) -> pd.Series:\n    \"\"\"Vectorised NDC normalisation for a pandas Series of raw NDC strings.\n\n    Returns a Series of normalised 11-digit strings (None for invalid entries).\n    Missing values (NaN) pass through as None.\n\n    Usage\n    -----\n    claims[\"ndc11\"] = normalize_ndc_series(claims[\"ndc_raw\"])\n    bad = claims[claims[\"ndc11\"].isna() & claims[\"ndc_raw\"].notna()]\n    print(f\"{len(bad):,} rows with un-normalisable NDCs\")\n    \"\"\"\n    return s.apply(lambda v: normalize_ndc(v) if pd.notna(v) else None)\n\n\n# ---------------------------------------------------------------------------\n# Quick self-test (run as script: python -c \"import <module>; test_ndc()\")\n# ---------------------------------------------------------------------------\ndef _test():\n    cases = {\n        \"0002-3227-30\":  \"00002322730\",  # 4-4-2 -> pad labeler\n        \"00093-490-05\":  \"000930490005\", # 5-3-2 -> pad product\n        \"00071-0155-2\":  \"000710155002\", # 5-4-1 -> pad package\n        \"00002322730\":   \"00002322730\",  # already 11\n        \"2322730\":       \"00002322730\",  # integer-cast, lpad to 11\n    }\n    for raw, expected in cases.items():\n        result = normalize_ndc(raw)\n        status = \"OK\" if result == expected else f\"FAIL (got {result})\"\n        print(f\"  {raw!r:22} -> {result!r:14} {status}\")\n\nif __name__ == \"__main__\":\n    _test()",
        "description": "10-to-11 digit NDC normalization function that detects the source segment pattern from the hyphenated FDA form and left-pads the deficient segment. Also includes a regex validator for well-formed 11-digit codes and a batch normalization helper for a pandas Series.",
        "dependencies": [
          "re",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(stringr)\n\n# ---------------------------------------------------------------------------\n# NDC normalisation: FDA 10-digit (any of three segment patterns) -> HIPAA 11-digit\n# ---------------------------------------------------------------------------\n\nnormalize_ndc <- function(raw) {\n  # Vectorised over a character vector.\n  # Returns NA for entries that cannot be normalised.\n  vapply(raw, .normalize_one, character(1), USE.NAMES = FALSE)\n}\n\n.normalize_one <- function(s) {\n  if (is.na(s) || !nchar(trimws(s))) return(NA_character_)\n  s <- trimws(s)\n\n  # Already 11-digit no-hyphen form\n  if (grepl(\"^\\\\d{11}$\", s)) return(s)\n\n  # Integer-like (<=11 digits, no hyphens): left-pad to 11\n  digits_only <- gsub(\"-\", \"\", s)\n  if (grepl(\"^\\\\d{1,11}$\", digits_only)) {\n    return(str_pad(digits_only, 11, side = \"left\", pad = \"0\"))\n  }\n\n  # Hyphenated FDA form: parse segments\n  parts <- str_split(s, \"-\")[[1]]\n  if (length(parts) != 3) return(NA_character_)\n\n  labeler <- parts[1]; product <- parts[2]; package <- parts[3]\n  l_len <- nchar(labeler); p_len <- nchar(product); k_len <- nchar(package)\n\n  if (!all(grepl(\"^\\\\d+$\", parts))) return(NA_character_)\n\n  if (l_len == 4) {                          # 4-4-2: pad labeler\n    labeler <- str_pad(labeler, 5, pad = \"0\")\n  } else if (p_len == 3) {                   # 5-3-2: pad product\n    product <- str_pad(product, 4, pad = \"0\")\n  } else if (k_len == 1) {                   # 5-4-1: pad package\n    package <- str_pad(package, 2, pad = \"0\")\n  }\n  # 5-4-2 hyphenated (edge case) falls through unchanged\n\n  result <- paste0(labeler, product, package)\n  if (nchar(result) != 11) return(NA_character_)\n  result\n}\n\nis_valid_ndc11 <- function(code) {\n  grepl(\"^\\\\d{11}$\", trimws(as.character(code)))\n}\n\n# ---------------------------------------------------------------------------\n# Example usage with dplyr\n# ---------------------------------------------------------------------------\n# library(dplyr)\n# claims <- claims |>\n#   mutate(ndc11 = normalize_ndc(ndc_raw))\n#\n# bad_ndc <- claims |> filter(is.na(ndc11) & !is.na(ndc_raw))\n# message(nrow(bad_ndc), \" rows with un-normalisable NDCs\")\n#\n# Self-test\n.test_ndc <- function() {\n  cases <- list(\n    list(raw = \"0002-3227-30\",  expected = \"00002322730\"),   # 4-4-2\n    list(raw = \"00093-490-05\",  expected = \"000930490005\"),  # 5-3-2\n    list(raw = \"00071-0155-2\",  expected = \"000710155002\"),  # 5-4-1\n    list(raw = \"00002322730\",   expected = \"00002322730\"),   # already 11\n    list(raw = \"2322730\",       expected = \"00002322730\")    # integer-cast\n  )\n  for (tc in cases) {\n    result <- normalize_ndc(tc$raw)\n    status <- if (identical(result, tc$expected)) \"OK\" else\n                paste0(\"FAIL (got \", result, \")\")\n    cat(sprintf(\"  %-22s -> %-14s %s\\n\", tc$raw, result, status))\n  }\n}",
        "description": "10-to-11 digit NDC normalization in R, with the same segment-pattern detection logic. Includes a vectorised wrapper suitable for dplyr mutate() calls and a validation predicate.",
        "dependencies": [
          "stringr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "drug-utilization",
        "notes": "NDC is the primary drug identifier in pharmacy claims used to construct drug utilization measures (DDD/1000/day, prevalence of treated patients, PDC, MPR); every fill_date + days_supply pair must be anchored to a validated 11-digit NDC."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-analysis",
        "notes": "NDC is the key that links a pharmacy claim to a specific drug product; claims analysis workflows begin with NDC code lists and require format normalization and completeness checks before any exposure or utilization measure is computed."
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Treatment-pattern algorithms consume NDC-grouped fills to construct medication episodes, identify switching, and assign lines of therapy; NDC-to-drug-class mapping (via RxNorm or a drug dictionary) is a prerequisite for episode construction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Observable continuous enrollment is the denominator that makes an absence of NDC fills meaningful (true non-use vs. unobserved use); NDC-based exposure windows must be bounded within continuous enrollment spans to avoid counting gaps as drug-free time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "In OMOP CDM, NDC is a source vocabulary that maps to RxNorm standard concepts in the Drug Exposure table; normalization from 11-digit NDC to RxNorm concept_id is performed via OMOP vocabulary tables, enabling cross-site analyses that standardize on RxNorm rather than labeler-specific NDCs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC is computed directly from pharmacy claim records identified by NDC; the fill_date and days_supply fields on each NDC-matched claim are the raw inputs for coverage union calculations."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mpr-medication-possession-ratio",
        "notes": "MPR, like PDC, is built from NDC-matched pharmacy claims; NDC normalization and code-list completeness directly affect whether all relevant fills are captured in the MPR numerator."
      }
    ],
    "aliases": [
      "NDC",
      "National Drug Code",
      "11-digit NDC",
      "NDC11"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "negative-binomial-distribution",
    "name": "Negative Binomial Distribution for Overdispersed Counts",
    "short_definition": "A discrete probability distribution for non-negative integer counts where the variance exceeds the mean — a condition called overdispersion — arising naturally as a Poisson distribution whose individual rate parameter varies across patients according to a gamma distribution; because real-world healthcare utilization counts (hospitalizations, emergency visits, drug fills) are almost always overdispersed due to unmeasured differences in patient frailty and care-seeking behavior, the negative binomial is the foundational distributional primitive for count outcomes in pharmacoepidemiology and HEOR, and understanding it is prerequisite to correctly choosing, fitting, and interpreting count regression models.",
    "long_description": "**What the negative binomial distribution is**\n\nThe negative binomial (NB) distribution is a discrete probability distribution on the\nnon-negative integers {0, 1, 2, …} characterized by two parameters: a mean μ > 0 and a\ndispersion parameter α ≥ 0. Under the canonical NB-2 (quadratic) parameterization — the\ndefault in R's MASS::glm.nb, Stata's nbreg, and Python's statsmodels — the variance function\nis:\n\n    Var(Y) = μ + α μ²\n\nWhen α = 0 the distribution collapses exactly to a Poisson with variance equal to the mean.\nWhen α > 0 the variance exceeds the mean by α μ², and the distribution develops a heavier\nright tail than Poisson: simultaneously more probability mass on zero counts and on large\ncounts. This dual excess — more zeros and more large values than Poisson — is precisely the\nsignature of healthcare utilization data. The probability mass function for the NB-2 family is:\n\n    P(Y = k) = Γ(k + 1/α) / [Γ(1/α) k!] × (αμ)^k / (1 + αμ)^(k + 1/α)\n\nwhere Γ(·) is the gamma function and k = 0, 1, 2, …. For applied work, the two properties\nthat matter are E(Y) = μ and Var(Y) = μ + αμ². Everything else — the PMF, likelihood, moment\ngenerating function — is machinery for understanding why software does what it does.\n\n**The Poisson-gamma mixture: why the NB arises in heterogeneous populations**\n\nThe single most important conceptual result for RWE analysts is that the negative binomial\ndistribution is not a convenient mathematical choice — it is the natural statistical consequence\nof patient heterogeneity. Suppose each patient has their own underlying event rate λᵢ, and\nthat their counts conditional on that rate follow Poisson(λᵢ). If the latent rates λᵢ vary\nacross the population according to a gamma distribution with mean μ and shape parameter\nr = 1/α, then the *marginal* distribution of counts — the distribution you observe when you\npool across all heterogeneous patients — is exactly the negative binomial with mean μ and\ndispersion α = 1/r. This result is sometimes called the Poisson-gamma mixture.\n\nIn pharmacoepidemiology terms: even if every individual patient's count process were exactly\nPoisson (memoryless, stationary events over time), the observed counts across a claims cohort\nwill follow a negative binomial because patients differ in unmeasured ways. This unmeasured\nheterogeneity — called *frailty* in the survival analysis literature — integrates out to yield\nthe NB marginal distribution. The parameter α directly measures the degree of population\nheterogeneity: larger α means greater between-patient variation in event rates. When α is\nestimated from data, it is not arbitrary: it quantifies how much of the count variation is\ndue to patient-level differences in underlying propensity that the measured covariates do\nnot capture.\n\n**Why overdispersion is the norm in healthcare count data**\n\nIn any claims or EHR dataset, observed counts of hospitalizations, emergency department (ED)\nvisits, outpatient encounters, or prescription fills are virtually always overdispersed\nrelative to Poisson for interconnected reasons:\n\n- *Case-mix heterogeneity*: the most severely ill patients in any condition-defined cohort\n  account for a disproportionate share of utilization, creating a long right tail. A patient\n  with decompensated heart failure accrues far more admissions than a stable patient on the\n  same therapy — this between-patient spread drives variance well above the mean.\n- *Coding and billing variation*: administrative data capture differs by health system, payer,\n  region, and care setting. Practices that unbundle care into separate claims inflate counts\n  for the same underlying service intensity, adding non-clinical variance.\n- *Clustering of care*: patients who have one hospitalization are more likely to have another\n  (rehospitalization, exacerbation, index event followed by complication). This positive\n  within-patient autocorrelation inflates the between-patient variance in total counts beyond\n  what a Poisson with a constant rate predicts.\n- *Structural access barriers*: a subgroup with near-zero utilization (healthy, uninsured\n  intermittently, or disengaged from care) coexists with high-utilizers, widening the count\n  distribution beyond the Poisson envelope in both tails simultaneously.\n\n**How to recognize and diagnose overdispersion**\n\nBefore fitting an NB model it is good practice to confirm overdispersion. The standard\ndiagnostic tools are:\n\n1. *Mean-variance plot*: for the raw data, plot the sample mean against the sample variance\n   within subgroups or arms. Under Poisson, all points should fall on the line Var = mean.\n   Points above this line indicate overdispersion; how far above gives a sense of α.\n2. *Pearson dispersion statistic*: after fitting a Poisson model, compute Pearson χ² divided\n   by residual degrees of freedom. A ratio substantially greater than 1 (rule of thumb: > 1.5\n   to 2.0) signals overdispersion. This quantity estimates the dispersion factor φ = 1 + αμ̄,\n   where μ̄ is the average fitted mean.\n3. *Likelihood-ratio boundary test*: formally test H₀: α = 0 (Poisson) against the NB\n   alternative. Because α = 0 is on the boundary of the parameter space, the standard\n   likelihood-ratio chi-square statistic has a non-standard null distribution; its p-value\n   must be halved relative to a chi-square(1) distribution. In any typical HCRU dataset with\n   n > 200 patients, this test almost always rejects the Poisson — overdispersion is the rule,\n   not the exception.\n\nThe practical implication of ignoring overdispersion is severe: fitting a Poisson model to\noverdispersed data produces standard errors that are too small (anticonservative), confidence\nintervals that are too narrow, and p-values that are too small. Effects that are not\nstatistically significant are declared significant. This is not a minor issue — it is a\nfundamental validity failure. The Pearson χ²/df from the Poisson fit provides an empirical\ncorrection factor (quasi-Poisson scaling), but NB regression is preferred because it\nestimates α from the data via maximum likelihood and provides a proper likelihood for AIC/BIC\nmodel comparison.\n\n**NB1 versus NB2 variance functions**\n\nTwo parameterizations of the negative binomial are in common use, distinguished by how the\nvariance relates to the mean:\n\n- *NB-2* (quadratic): Var(Y) = μ + αμ². The variance grows quadratically with the mean,\n  meaning that at high count levels the excess variance grows rapidly. This is the canonical\n  parameterization in R (MASS::glm.nb, with θ = 1/α), Stata (nbreg), and statsmodels in\n  Python. NB-2 arises directly from the Poisson-gamma mixture and is the theoretically\n  motivated choice for heterogeneous populations.\n- *NB-1* (linear): Var(Y) = μ(1 + δ) = μ + δμ. The variance is proportional to the mean\n  with a constant proportionality factor (1 + δ). Equivalent to a quasi-Poisson with\n  overdispersion factor φ = 1 + δ. NB-1 is less common in HEOR practice; it implies that\n  overdispersion is uniform across count levels, which is less realistic for HCRU where\n  high-utilizers generate disproportionate variance.\n\nFor most HCRU applications — hospitalizations, ED visits, infusion counts — where a minority\nof frail high-utilizers dominates the right tail, NB-2 is the better fit and the safe\ndefault. NB-1 (or quasi-Poisson) is occasionally preferred for mild, uniform overdispersion\nwhen the NB-2 likelihood fails to converge.\n\n**Offsets for unequal observation time**\n\nWhen patients have different lengths of follow-up — because of disenrollment, death, or\nadministrative censoring — raw counts cannot be compared directly between patients or arms.\nA patient followed for six months with two hospitalizations is not equivalent to a patient\nfollowed for twelve months with two hospitalizations. The solution is an *offset*: including\nlog(person-time) as a term in the linear predictor with its coefficient fixed at 1, converting\nthe count model into a rate model:\n\n    log(μᵢ) = β₀ + β₁X₁ + … + βₚXₚ + log(person-timeᵢ)\n\nThe exponentiated coefficient exp(βⱼ) is then an *incidence rate ratio* (IRR) — events per\nunit of person-time — rather than a raw count ratio. This is non-negotiable in claims-based\nHCRU studies where enrollment gaps, disenrollment, and death create differential follow-up\nacross treatment arms. Omitting the offset silently converts the IRR into a raw count ratio\nthat confounds the rate effect with the length of observation, biasing the estimate in\nwhichever arm has shorter mean follow-up. The regression application of this distributional\nprimitive is described in detail in the companion model concept\n(poisson-negative-binomial-count-models).\n\n**Zero-inflation versus overdispersion: not the same problem**\n\nOverdispersion and zero-inflation are distinct phenomena that require different solutions.\nThe negative binomial distribution naturally accommodates some excess zeros relative to\nPoisson, because patients with very low event rates will have many zero-count observations\nby chance. This is ordinary overdispersion, and NB handles it correctly. Zero-inflation, by\ncontrast, refers to a structural subgroup that *cannot* possibly generate the event —\npatients who have permanently left the health system, who have a physiological barrier to\nthe event, or whose zero counts reflect data absence rather than true non-occurrence. When\nexcess zeros genuinely reflect a \"never-utilizer\" structural process beyond what the NB\ndistribution predicts (diagnosed by comparing predicted vs. observed zero probabilities from\nthe fitted NB), a zero-inflated NB (ZINB) or hurdle model is appropriate. The hurdle model\nseparates a logistic process (any events at all?) from a zero-truncated NB process (how many,\ngiven at least one?). Applying ZINB to ordinary overdispersion risks overfitting; applying\nplain NB to genuine structural zeros understates the zero probability and overstates the\nevent rate among the non-zero subgroup.\n\n**Interpreting the output**\n\nConsider a negative binomial regression of annual hospitalization counts on a treatment\nindicator, controlling for age, sex, and a comorbidity index, with log(person-years) as\nthe offset. The fitted model returns: treatment coefficient β = -0.223, 95% CI [-0.40,\n-0.05], dispersion parameter α = 0.8.\n\n*Formal interpretation*: exp(-0.223) = 0.80 is the adjusted incidence rate ratio for the\ntreatment group versus the control group. The expected rate of hospitalizations per\nperson-year among treated patients is 0.80 times the rate among control patients — a 20%\nlower rate — holding age, sex, and comorbidity index fixed at their observed values. The\n95% confidence interval on the rate-ratio scale is [exp(-0.40), exp(-0.05)] = [0.67, 0.95],\nwhich excludes 1.0 and is consistent with a statistically significant reduction at the 5%\nlevel. The dispersion parameter α = 0.8 characterizes the variance function:\nVar(Y) = μ + 0.8 × μ². At a typical mean count of μ = 1.5 hospitalizations per year,\nthis gives Var(Y) = 1.5 + 0.8 × 2.25 = 1.5 + 1.8 = 3.3 — more than twice the mean,\nconfirming that substantial overdispersion is present. Had Poisson been fitted instead, the\nconfidence interval around the IRR of 0.80 would have been artificially narrow (e.g.,\n[0.71, 0.90] instead of [0.67, 0.95]), falsely suggesting higher precision.\n\n*Practical interpretation for a decision-maker*: On average, patients receiving this\ntreatment experienced approximately 20% fewer hospitalizations per year compared with the\ncontrol group (rate ratio 0.80; 95% CI 0.67 to 0.95). Because this is an observational study,\nthis is an association — unmeasured confounding could partially explain the difference — but\nthe adjustment for measured covariates provides a more credible estimate than an unadjusted\ncomparison. The 95% confidence interval is consistent with reductions ranging from 5% to\n33%, which is the honest uncertainty range after accounting for the variability in counts\nacross patients. The dispersion parameter confirms that patients vary substantially in their\nhospitalization rates (some have zero, others have several per year), which is why negative\nbinomial was used and why the confidence interval is wider than a Poisson analysis would\nhave suggested.\n\n**Pros, cons, and trade-offs**\n\n*Pros of the negative binomial distribution*:\n- Correctly models overdispersed count data: the NB variance function Var(Y) = μ + αμ²\n  absorbs extra-Poisson heterogeneity, producing unbiased mean estimates and honest\n  confidence intervals in the regression context.\n- Theoretically motivated as the Poisson-gamma mixture: provides a mechanistic story for\n  why counts are overdispersed in heterogeneous populations, linking the dispersion parameter\n  to the magnitude of unobserved between-patient frailty.\n- Interpretable IRR when used with a log link and offset; the rate-ratio scale maps directly\n  to HEOR communication (\"X% fewer events per patient-year\").\n- Graceful degradation as α → 0: the NB collapses to Poisson with no discontinuity in\n  inference; no model change is needed if the data happen to be equidispersed.\n- Available in all major statistical packages with mature implementations (MASS::glm.nb in R,\n  NegativeBinomial in statsmodels, DIST=NEGBIN in SAS PROC GENMOD).\n\n*Cons of the negative binomial distribution*:\n- One additional parameter (α) to estimate, requiring a larger sample to pin down reliably;\n  estimation can fail to converge when overdispersion is very mild (α near 0) or extreme.\n- The boundary likelihood-ratio test for α = 0 requires halving the standard chi-square\n  p-value; analysts who forget this step overstate the evidence against Poisson.\n- Does not model event timing or ordering; the entire count history collapses to a single\n  integer per patient, discarding the recurrence structure.\n- Does not accommodate structural zeros without extension to ZINB or hurdle models, which\n  add complexity and require two sets of coefficients to interpret.\n- The NB-2 quadratic variance function can overfit the distributional tails in small samples\n  where α cannot be estimated precisely.\n\n*Trade-offs between distributional choices*:\n- Versus *Poisson*: Poisson is simpler and requires no extra parameter, but its SEs are\n  anticonservative under overdispersion — the classic error in naive HCRU analysis. NB is\n  always at least as appropriate as Poisson and should be the default for healthcare counts.\n  Use Poisson only when a formal dispersion test (LR boundary test; Pearson χ²/df ≈ 1)\n  supports equidispersion.\n- Versus *quasi-Poisson*: quasi-Poisson inflates SEs by an estimated scale factor but does\n  not change the likelihood or provide a proper probability model. It cannot estimate α, does\n  not support likelihood-based AIC/BIC model selection, and cannot generate predicted count\n  distributions for simulation or power analysis. NB is generally preferred when a full\n  probabilistic model is needed; quasi-Poisson is acceptable for robust SE correction in\n  simple descriptive analyses.\n- Versus *recurrent-event survival models* (Andersen-Gill, PWP): NB collapses event timing\n  into a single count, sacrificing the at-risk structure and the ability to model gap-time\n  dependence or the depletion of the at-risk set after terminal events. Survival models are\n  more appropriate when the timing and ordering of events carry scientific meaning, or when\n  competing death is severe and differential between arms.\n- Versus *ZINB/hurdle*: appropriate only when a structural zero-generating process exists\n  beyond what the NB naturally models. Over-applying zero-inflation to ordinary overdispersion\n  creates two models where one suffices, reduces statistical power, and complicates\n  interpretation.\n\n**When to use**\n\n- The outcome is a non-negative integer count of discrete events over a defined observation\n  window: hospitalizations, ED visits, prescription fills, distinct drug classes, infusion\n  cycles, or any other HCRU endpoint.\n- Variance clearly exceeds the mean: Pearson χ²/df > 1.5 from a fitted Poisson model, or\n  the sample variance-to-mean ratio for the raw data is substantially greater than 1.\n- The scientific question is the total volume or rate of events — not the timing of the first\n  event, not survival, not the sequential ordering of recurrent events.\n- Follow-up time varies across patients: use the distribution as the foundation for a rate\n  model with a log(person-time) offset to obtain an IRR.\n- The data come from a claims, EHR, registry, or linked administrative database (the routine\n  RWE setting where overdispersion is expected a priori).\n- As the distributional foundation for understanding why count regression software defaults\n  to or recommends NB over Poisson, and for reading the dispersion parameter output\n  critically.\n\n**When NOT to use**\n\n- *Equidispersed processes*: genuinely equidispersed counts where Pearson χ²/df ≈ 1.0 and\n  the LR boundary test does not reject Poisson. In homogeneous clinical-trial populations or\n  for rare events with very low α, Poisson is simpler and adds no unnecessary parameter.\n- *Structural zeros requiring hurdle or zero-inflated models*: when the zero-event subgroup\n  represents patients who structurally cannot generate the event — not just patients who\n  happened to have zero events in the observation window — plain NB will misfit the zero\n  probability. Use ZINB or hurdle NB after comparing observed versus NB-predicted zero\n  frequencies.\n- *Time-varying rates and recurrent-event survival analysis*: when the scientific question is\n  the time-to-next-event, the gap-time between events, or the at-risk process over calendar\n  time, the NB distribution discards exactly the information that matters. Use\n  Andersen-Gill, Prentice-Williams-Peterson, or Wei-Lin-Weissfeld recurrent-event models.\n- *Duration or time-to-event outcomes*: if the outcome is time (days to discharge, months to\n  death, survival time), the NB distribution is inapplicable; use survival analysis (Cox,\n  parametric AFT, or Fine-Gray competing-risks models).\n- *Continuous outcomes*: the NB is defined only for non-negative integers; costs, laboratory\n  values, and length-of-stay in fractional hours are not counts and require gamma GLM,\n  log-normal, or Tweedie models.\n- *Uncontrolled confounding*: like any GLM, an unadjusted NB comparison across treatment\n  arms in an observational dataset estimates a biased association, not a causal rate ratio.\n  Pair NB regression with propensity weighting, matching, or g-methods for valid causal\n  inference.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "distributions",
      "count-data",
      "overdispersion",
      "poisson-gamma-mixture",
      "dispersion-parameter",
      "rate-ratio",
      "variance-function",
      "claims",
      "hcru"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1037/0033-2909.118.3.392",
        "url": "https://doi.org/10.1037/0033-2909.118.3.392",
        "citation_text": "Gardner W, Mulvey EP, Shaw EC. Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological Bulletin. 1995;118(3):392-404.",
        "year": 1995,
        "authors_short": "Gardner et al.",
        "notes": "The canonical applied introduction to the Poisson-overdispersed Poisson-negative binomial progression, explaining why variance exceeding the mean invalidates Poisson standard errors and demonstrating the NB as the practical remedy. Widely cited in health services research for setting out the diagnostic logic (variance vs mean check, dispersion test) that precedes model choice."
      },
      {
        "role": "explain",
        "doi": "10.2307/3314912",
        "url": "https://doi.org/10.2307/3314912",
        "citation_text": "Lawless JF. Negative binomial and mixed Poisson regression. Canadian Journal of Statistics. 1987;15(3):209-225.",
        "year": 1987,
        "authors_short": "Lawless",
        "notes": "The authoritative statistical derivation of the negative binomial as a Poisson-gamma mixture, establishing the theoretical link between unobserved heterogeneity (frailty) in event rates and the NB-2 variance function. Provides the likelihood structure underlying MASS::glm.nb and statsmodels NegativeBinomial."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/0304-4076(90)90014-K",
        "url": "https://doi.org/10.1016/0304-4076(90)90014-K",
        "citation_text": "Cameron AC, Trivedi PK. Regression-based tests for overdispersion in the Poisson model. Journal of Econometrics. 1990;46(3):347-364.",
        "year": 1990,
        "authors_short": "Cameron & Trivedi",
        "notes": "Develops the score test and regression-based tests for overdispersion in the Poisson model, including the boundary LR test whose p-value must be halved when testing H0: alpha = 0. Essential methodologic grounding for the diagnostic workflow described in this entry."
      },
      {
        "role": "use",
        "doi": "10.18637/jss.v027.i08",
        "url": "https://doi.org/10.18637/jss.v027.i08",
        "citation_text": "Zeileis A, Kleiber C, Jackman S. Regression models for count data in R. Journal of Statistical Software. 2008;27(8):1-25.",
        "year": 2008,
        "authors_short": "Zeileis et al.",
        "notes": "Practical implementation reference for Poisson, negative binomial, zero-inflated, and hurdle count models in R using the pscl and MASS packages. Demonstrates the diagnostic workflow (dispersion test, vuong test, model comparison by AIC) and shows how glm.nb and the theta parameter relate to alpha in the NB-2 variance function."
      }
    ],
    "plain_language_summary": "The negative binomial distribution describes how many times a discrete event happens to a patient over a period of time — such as the number of hospitalizations per year — in situations where some patients have far more events than others, creating a long right tail. It extends the simpler Poisson distribution by adding one extra parameter (called the dispersion parameter) that absorbs the extra spread in the counts, so that confidence intervals in statistical models are correctly sized rather than falsely narrow. In real-world healthcare data, this extra spread is almost always present because patients differ in unmeasured ways — severity of illness, access to care, care-seeking habits — that cause their event rates to vary even within the same diagnosed condition and treatment arm.",
    "key_terms": [
      {
        "term": "overdispersion",
        "definition": "A condition in count data where the variance (spread) of the counts is larger than the mean, which violates the Poisson assumption and occurs in healthcare data because some patients are far sicker or higher-utilizers than others."
      },
      {
        "term": "dispersion parameter",
        "definition": "A number (called alpha) added to the negative binomial model to capture how much more spread the counts have than a Poisson distribution would predict; when alpha equals zero the negative binomial is identical to Poisson."
      },
      {
        "term": "Poisson-gamma mixture",
        "definition": "The mathematical reason the negative binomial distribution arises naturally: if each patient has their own event rate drawn from a gamma distribution, the observable counts across all patients follow a negative binomial distribution."
      },
      {
        "term": "offset",
        "definition": "A term added to a count regression model equal to the log of each patient's follow-up time, converting raw event counts into rates so that patients observed for different lengths of time can be compared fairly."
      },
      {
        "term": "rate ratio",
        "definition": "The ratio of the event rate in one group to the event rate in another group, expressed as events per unit of person-time; a rate ratio of 0.80 means the treated group had 20% fewer events per year than the comparison group."
      },
      {
        "term": "variance function",
        "definition": "The mathematical rule that connects the mean of a count distribution to its variance; for the negative binomial (NB-2), this is Var = mean + dispersion-parameter times mean squared, so variance grows faster than the mean as counts get larger."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiologist is analyzing annual COPD exacerbation counts for 10 patients in a claims database, each observed for exactly one full year on the same treatment. Before fitting a count regression model, the analyst wants to check whether the data are overdispersed — that is, whether the sample variance substantially exceeds the mean, which would make Poisson inappropriate and negative binomial necessary. The 10 patients' observed annual exacerbation counts are recorded in the table below.",
      "dataset": {
        "caption": "Annual COPD exacerbation counts for 10 patients, each observed for exactly 1 year. Three patients had zero exacerbations; one patient had six, pulling the variance well above the mean.",
        "columns": [
          "patient_id",
          "exacerbation_count",
          "person_years"
        ],
        "rows": [
          [
            "P01",
            0,
            1
          ],
          [
            "P02",
            0,
            1
          ],
          [
            "P03",
            0,
            1
          ],
          [
            "P04",
            1,
            1
          ],
          [
            "P05",
            1,
            1
          ],
          [
            "P06",
            2,
            1
          ],
          [
            "P07",
            3,
            1
          ],
          [
            "P08",
            3,
            1
          ],
          [
            "P09",
            4,
            1
          ],
          [
            "P10",
            6,
            1
          ]
        ]
      },
      "steps": [
        "Sum all 10 counts to find the total number of exacerbations across patients: 0+0+0+1+1+2+3+3+4+6 = 20.",
        "Compute the mean count: 20/10 = 2 exacerbations per patient per year. Under a Poisson distribution, the variance would equal the mean, so the expected Poisson variance would also be 2.",
        "Compute each patient's deviation from the mean of 2. Patients P01, P02, P03 (count 0) each deviate by negative 2, squared to 4. Patients P04, P05 (count 1) each deviate by negative 1, squared to 1. Patient P06 (count 2) deviates by zero, squared to 0. Patients P07, P08 (count 3) each deviate by positive 1, squared to 1. Patient P09 (count 4) deviates by positive 2, squared to 4. Patient P10 (count 6) deviates by positive 4, squared to 16.",
        "Sum the squared deviations: 4+4+4+1+1+0+1+1+4+16 = 36.",
        "Compute the sample variance by dividing the sum of squared deviations by n-1: 36/9 = 4.",
        "Check the overdispersion ratio (variance divided by mean): 4/2 = 2. A ratio of 1 would indicate Poisson holds. A ratio of 2 means the variance is double what Poisson would predict, confirming substantial overdispersion. Patient P10 with 6 exacerbations is the dominant driver: their squared deviation of 16 alone accounts for 16/36 of the total spread.",
        "Conclusion: the sample variance (4) clearly exceeds the sample mean (2). Fitting a Poisson model would produce standard errors that are too small by a factor related to the square root of the overdispersion ratio, making results look more statistically precise than the data actually support. A negative binomial model with an estimated dispersion parameter alpha is the appropriate choice."
      ],
      "result": "Mean = 20/10 = 2 exacerbations per patient per year. Sample variance = 36/9 = 4. Overdispersion ratio = 4/2 = 2. Because the variance (4) is double what the Poisson assumption requires (variance would equal the mean, 2), a negative binomial model is needed. The dispersion parameter alpha will be estimated from the data and inserted into the variance function Var(Y) = mean + alpha times mean squared to give honest confidence intervals. Using Poisson here would produce anticonservative standard errors and falsely narrow confidence intervals."
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations",
      "poisson-distribution"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "NB-2 (canonical quadratic): Var(Y) = mu + alpha*mu^2",
        "description": "The standard NB parameterization where the variance grows quadratically with the mean. Appropriate when heterogeneity in underlying rates compounds with the count level — the typical pattern for hospitalization and ED visit counts where high-utilizers have both higher mean rates and higher rate variability. This is the default in MASS::glm.nb (R), statsmodels NegativeBinomial (Python), and PROC GENMOD DIST=NEGBIN (SAS).",
        "edge_cases": [
          "When alpha is very small (< 0.01), the NB and Poisson likelihoods are nearly identical and the boundary LR test may not reject Poisson; use Poisson in this case.",
          "When alpha is very large (> 5), extreme overdispersion may indicate a latent subgroup structure (finite mixture / latent-class model) rather than a simple continuous frailty."
        ],
        "data_source_notes": "Claims: NB-2 is the appropriate default for essentially all HCRU count endpoints (hospitalizations, ED visits, infusions, distinct NDCs). Always report the estimated alpha alongside the IRR so readers can assess the degree of overdispersion."
      },
      {
        "name": "NB-1 (linear): Var(Y) = mu*(1 + delta) — quasi-Poisson equivalent",
        "description": "The linear variance parameterization where overdispersion is proportional to the mean rather than quadratic. Equivalent to quasi-Poisson with dispersion factor 1+delta. Less theoretically motivated for heterogeneous populations but occasionally preferred when overdispersion is mild and uniform across count levels or when NB-2 convergence fails. Not available as a named distribution in all packages; most easily approximated via quasi-Poisson in R (family = quasipoisson) or SCALE=PEARSON in SAS PROC GENMOD.",
        "edge_cases": [
          "Quasi-Poisson does not provide a proper likelihood, so AIC/BIC model comparison is not available; model selection must rely on Pearson chi-square or out-of-sample metrics."
        ],
        "data_source_notes": "Use as a sensitivity check when NB-2 estimates alpha near zero or when convergence is slow. Report alongside NB-2 results to document robustness."
      },
      {
        "name": "NB rate model with log(person-time) offset (IRR parameterization)",
        "description": "The distributional extension to the rate setting, where patients have different follow-up lengths due to disenrollment, death, or study end. The offset log(person-time) converts the model from predicting counts to predicting rates, and exp(beta_j) becomes an incidence rate ratio (IRR) adjusted for observation duration. This is the standard HCRU regression model and the form described in the companion model concept (poisson-negative-binomial-count-models).",
        "edge_cases": [
          "Patients with very short person-time (< 30 days) contribute unstable rate estimates; apply a minimum enrollment threshold or down-weight short-follow-up records.",
          "Omitting the offset when follow-up differs between arms biases the IRR in the arm with shorter mean follow-up — the most consequential application error for claims HCRU analyses."
        ],
        "data_source_notes": "Claims: build person-time from continuous enrollment spans, censoring at disenrollment, death, and data end. Exclude Medicare Advantage-only spans where inpatient encounter data are incomplete and would deflate counts relative to fee-for-service."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "Understanding the NB distribution as a primitive — the Poisson-gamma mixture, the variance function, the dispersion parameter interpretation — provides the conceptual foundation for reading model output critically: what alpha means, why the LR test requires halving the p-value, and when NB is inappropriate.",
        "cons_of_this": "This entry focuses on the distribution; the companion model concept covers the full regression workflow including covariate adjustment, offset construction, cluster-robust SEs, and the zero-inflated extension with worked claims examples.",
        "when_to_prefer": "Read this entry to understand the distributional mechanics; read poisson-negative-binomial- count-models for the end-to-end applied analysis workflow."
      },
      {
        "compared_to": "recurrent-events-analysis-rwe",
        "pros_of_this": "NB-based count models produce a single interpretable number per patient (total count or rate over a window) and a single IRR with CI — simpler to fit, report, and communicate than recurrent-event survival models with their at-risk structure and gap-time dependence.",
        "cons_of_this": "Recurrent-event survival models preserve the timing and ordering of events, handle the depletion of the at-risk set (death, dropout), and can model time-varying exposures and covariate processes. NB discards all of this, which matters when timing is scientifically important or when competing death is differential between arms.",
        "when_to_prefer": "Use NB count models when total rate or volume is the question and timing is secondary; use recurrent-event models when time-to-next-event or gap-time distribution is the endpoint."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Enrollment-based person-time is the denominator for the offset. Build continuous enrollment spans from the coverage table, censor at disenrollment, death, and data end; exclude Medicare Advantage-only enrollment months where inpatient encounter data are incomplete. Count events from all relevant claim files (facility, professional, outpatient) using validated code lists, collapsing same-stay transfer claims. Always test for overdispersion (Pearson chi2/df from Poisson fit) before committing to NB; in HCRU counts it will almost always be substantially > 1.",
      "ehr": "Counts from EHR orders, encounters, or NLP extractions are inflated by visit-driven bias: sicker patients generate more records purely by being seen more frequently. Adjust for visit intensity as a covariate or restrict the observation window to avoid confounding count rates with care engagement intensity. Define the observation window explicitly and treat loss to follow-up as potentially informative (linked administrative data improve coverage).",
      "registry": "Disease registries often carry clean, adjudicated structured counts (lines of therapy, transfusion cycles) ideal for validating claims-based count algorithms. The NB distributional diagnostic (variance > mean) applies equally to registry counts; check before defaulting to Poisson. Registry data are typically incomplete for cross-setting utilization events; link to claims for full event capture.",
      "primary": "Prospective study counts from trial or observational data may have smaller n, making alpha estimation less stable. In samples with n < 100, validate that the NB likelihood converges and report the standard error of the alpha estimate; quasi-Poisson SEs are a robust fallback. Verify the overdispersion assumption graphically (mean-variance plot by subgroup) rather than relying solely on the LR boundary test.",
      "linked": "Linked claims-EHR-vital-records provides the ideal substrate: EHR-derived severity for confounding control, claims for complete event capture, and reliable mortality for censoring. Reconcile service dates and admission/discharge definitions across sources before counting events, as double-counting across files inflates counts and biases alpha upward."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom scipy import stats\n\n# ── Motivating dataset: 10 COPD patients, annual exacerbation counts (1 year each) ──\ndf = pd.DataFrame({\n    \"patient_id\"   : [f\"P{i:02d}\" for i in range(1, 11)],\n    \"count\"        : [0, 0, 0, 1, 1, 2, 3, 3, 4, 6],\n    \"person_years\" : [1.0] * 10,      # all observed exactly 1 year\n})\n\n# ── 1. Overdispersion diagnostic: sample mean and variance ──\nmean_count  = df[\"count\"].mean()          # 2.0\nvar_count   = df[\"count\"].var(ddof=1)     # 4.0  (sample variance, n-1 denominator)\ndisp_ratio  = var_count / mean_count      # 2.0  (should be 1.0 under Poisson)\nprint(f\"Mean: {mean_count:.2f}  |  Variance: {var_count:.2f}  |  Var/Mean ratio: {disp_ratio:.2f}\")\nprint(\"Overdispersed (ratio > 1):\", disp_ratio > 1)\n\n# ── 2. Poisson rate model (intercept-only for illustration); offset = log(person_years)\ndf[\"log_pt\"] = np.log(df[\"person_years\"])\npois = smf.glm(\"count ~ 1\", data=df,\n                family=sm.families.Poisson(),\n                offset=df[\"log_pt\"]).fit(disp=False)\n\n# Pearson chi-square / residual df: >> 1 flags overdispersion from the Poisson fit\npois_disp = float(pois.pearson_chi2 / pois.df_resid)\nprint(f\"\\nPoisson Pearson chi2/df = {pois_disp:.2f}  (expect ~1 if Poisson holds)\")\n\n# ── 3. Negative binomial MLE to estimate dispersion alpha ──\n#    NegativeBinomial.from_formula: fits NB-2 Var(Y) = mu + alpha*mu^2\nnb_mle = sm.NegativeBinomial.from_formula(\n    \"count ~ 1\", data=df, offset=df[\"log_pt\"]\n).fit(disp=0, method=\"nm\", maxiter=500, disp_kwds={\"disp\": False})\nalpha = float(nb_mle.params[\"alpha\"])\nprint(f\"\\nEstimated dispersion alpha = {alpha:.4f}\")\nprint(f\"Variance function: Var(Y) = mu + {alpha:.4f} * mu^2\")\n\n# ── 4. Boundary LR test of H0: alpha = 0 (Poisson) vs NB; halve the chi-square p-value\nlr_stat = 2.0 * (nb_mle.llf - pois.llf)\nlr_p    = 0.5 * stats.chi2.sf(lr_stat, df=1)   # halved because alpha=0 is on boundary\nprint(f\"\\nBoundary LR test: stat = {lr_stat:.2f}, p (halved) = {lr_p:.4f}\")\nprint(\"Decision: use NB\" if lr_p < 0.05 else \"Decision: Poisson may suffice\")\n\n# ── 5. NB-2 GLM using the estimated alpha; exponentiate intercept = estimated rate ──\nnb_glm = smf.glm(\"count ~ 1\", data=df,\n                  family=sm.families.NegativeBinomial(alpha=alpha),\n                  offset=df[\"log_pt\"]).fit()\nrate = float(np.exp(nb_glm.params[\"Intercept\"]))\nci   = np.exp(nb_glm.conf_int().loc[\"Intercept\"])\nprint(f\"\\nNB estimated rate = {rate:.2f} events/year  95% CI [{ci[0]:.2f}, {ci[1]:.2f}]\")\n\n# ── 6. Extended template: NB rate regression with treatment arm and covariates ──\n# Replace df with your analytic dataset (one row per patient, already cohort-built):\n#   analytic: person_id, arm (0/1), count, person_years, age, sex, comorb_index\n#\n# alpha_est = <alpha from nb_mle above, or fit on analytic>\n# nb_reg = smf.glm(\n#     \"count ~ C(arm) + age + sex + comorb_index\",\n#     data=analytic,\n#     family=sm.families.NegativeBinomial(alpha=alpha_est),\n#     offset=np.log(analytic[\"person_years\"])\n# ).fit(cov_type=\"cluster\", cov_kwds={\"groups\": analytic[\"person_id\"]})\n#\n# irr = np.exp(nb_reg.params)        # incidence rate ratios\n# ci  = np.exp(nb_reg.conf_int())    # 95% CIs on rate-ratio scale\n# print(pd.concat([irr.rename(\"IRR\"), ci], axis=1))",
        "description": "Overdispersion diagnostic and negative binomial distributional fitting using statsmodels.\nDemonstrates the mean-vs-variance check, Pearson dispersion statistic from a Poisson fit,\nNB-2 dispersion parameter (alpha) estimation via maximum likelihood, the boundary LR test\n(halved p-value), and a template for NB rate regression with a person-time offset. Uses\nthe 10-patient COPD exacerbation dataset from the beginner worked example.",
        "dependencies": [
          "numpy",
          "pandas",
          "statsmodels",
          "scipy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(MASS)\nlibrary(sandwich)\nlibrary(lmtest)\n\n# ── Motivating dataset ──\ndf <- data.frame(\n  patient_id   = paste0(\"P\", sprintf(\"%02d\", 1:10)),\n  count        = c(0, 0, 0, 1, 1, 2, 3, 3, 4, 6),\n  person_years = rep(1.0, 10)\n)\n\n# ── 1. Overdispersion diagnostic ──\nmean_count  <- mean(df$count)          # 2.0\nvar_count   <- var(df$count)           # 4.0  (R uses n-1 by default)\ndisp_ratio  <- var_count / mean_count  # 2.0\ncat(sprintf(\"Mean: %.2f | Variance: %.2f | Var/Mean ratio: %.2f\\n\",\n            mean_count, var_count, disp_ratio))\n\n# ── 2. Poisson rate model (intercept-only, with offset) ──\npois_fit <- glm(count ~ 1, family = poisson(link = \"log\"),\n                data = df, offset = log(person_years))\n\n# Pearson dispersion statistic: >> 1 signals overdispersion\npois_disp <- sum(residuals(pois_fit, type = \"pearson\")^2) / df.residual(pois_fit)\ncat(sprintf(\"Poisson Pearson chi2/df = %.2f  (expect ~1 under Poisson)\\n\", pois_disp))\n\n# ── 3. Negative binomial (NB-2) via MASS::glm.nb ──\n#    glm.nb estimates theta (precision); alpha = 1/theta in Var(Y) = mu + alpha*mu^2\nnb_fit <- glm.nb(count ~ 1 + offset(log(person_years)), data = df)\ntheta   <- nb_fit$theta\nalpha   <- 1 / theta\ncat(sprintf(\"NB theta (precision) = %.4f\\n\", theta))\ncat(sprintf(\"NB alpha (dispersion) = %.4f  [Var(Y) = mu + %.4f * mu^2]\\n\", alpha, alpha))\n\n# ── 4. Boundary LR test: H0 alpha = 0 (Poisson) vs NB; halve the p-value ──\nlr_stat <- 2 * as.numeric(logLik(nb_fit) - logLik(pois_fit))\nlr_p    <- 0.5 * pchisq(lr_stat, df = 1, lower.tail = FALSE)\ncat(sprintf(\"Boundary LR test: stat = %.2f, p (halved) = %.4f\\n\", lr_stat, lr_p))\ncat(if (lr_p < 0.05) \"Decision: use NB\\n\" else \"Decision: Poisson may suffice\\n\")\n\n# ── 5. Estimated rate with 95% CI (intercept-only model = overall event rate) ──\nrate <- exp(coef(nb_fit)[\"(Intercept)\"])\nci   <- exp(confint(nb_fit))\ncat(sprintf(\"NB rate = %.2f/year  95%% CI [%.2f, %.2f]\\n\", rate, ci[1], ci[2]))\n\n# ── 6. Extended template: NB regression with treatment arm, covariates, offset ──\n# Replace df with your analytic cohort dataset:\n#   analytic: person_id (cluster), arm (factor), count, person_years, age, sex, comorb_index\n#\n# nb_reg <- glm.nb(\n#     count ~ arm + age + sex + comorb_index + offset(log(person_years)),\n#     data = analytic\n# )\n# # Cluster-robust SEs to account for within-person correlation:\n# vc <- vcovCL(nb_reg, cluster = ~person_id)\n# irr_table <- exp(coeftest(nb_reg, vcov. = vc)[, 1:2])   # IRR and SE\n# cat(\"Incidence Rate Ratios:\\n\"); print(irr_table)",
        "description": "Overdispersion diagnostic and negative binomial fitting using MASS::glm.nb. Demonstrates\nthe Pearson dispersion statistic from a Poisson baseline, glm.nb fitting with a person-time\noffset, extraction of the dispersion parameter theta (where alpha = 1/theta in the NB-2\nvariance function), the boundary LR test, and cluster-robust SEs via sandwich. Uses the\n10-patient COPD dataset from the beginner worked example.",
        "dependencies": [
          "MASS",
          "sandwich",
          "lmtest"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset ── */\ndata work.exacerbations;\n  input patient_id $ count person_years;\n  log_pt = log(person_years);   /* offset = log(person-time) */\n  datalines;\nP01 0 1.0\nP02 0 1.0\nP03 0 1.0\nP04 1 1.0\nP05 1 1.0\nP06 2 1.0\nP07 3 1.0\nP08 3 1.0\nP09 4 1.0\nP10 6 1.0\n;\nrun;\n\n/* ── 1. Overdispersion diagnostic: mean and variance ── */\nproc means data=work.exacerbations mean var;\n  var count;\n  /* Mean = 2, Variance = 4 -> ratio = 2 -> overdispersed */\nrun;\n\n/* ── 2. Poisson rate model; SCALE=PEARSON prints Pearson phi (chi2/df) ── */\nproc genmod data=work.exacerbations;\n  model count = / dist=poisson link=log offset=log_pt scale=pearson;\n  /* Pearson dispersion (phi) >> 1 flags overdispersion requiring NB */\n  /* The printed \"Scale\" parameter in PROC GENMOD output is sqrt(phi) */\nrun;\n\n/* ── 3. Negative binomial (NB-2): Var(Y) = mu + alpha*mu^2 ── */\nproc genmod data=work.exacerbations;\n  model count = / dist=negbin link=log offset=log_pt;\n  /* PROC GENMOD NB output: the DISPERSION parameter is k (shape = 1/alpha).\n     To obtain alpha: alpha = 1 / DISPERSION from the printed \"Scale\" row.\n     Confirm convergence in the iteration history. */\nrun;\n\n/* ── 4. Extended template: NB regression with treatment, covariates, and offset ──\n   Replace work.analytic with your cohort dataset (one row per patient).\n   exp(ESTIMATE) for arm is the adjusted incidence rate ratio (IRR). */\nproc genmod data=work.analytic;\n  class arm (ref='0') sex;\n  model count = arm age sex comorb_index / dist=negbin link=log offset=log_pt;\n  estimate 'IRR: arm 1 vs 0' arm 1 -1 / exp;   /* exponentiated = IRR */\nrun;\n\n/* ── 5. Cluster-robust (empirical) variance for within-person correlation ── */\nproc genmod data=work.analytic;\n  class arm (ref='0') sex person_id;\n  model count = arm age sex comorb_index / dist=negbin link=log offset=log_pt;\n  repeated subject=person_id / type=ind;   /* empirical (robust) sandwich variance */\n  estimate 'IRR: arm 1 vs 0' arm 1 -1 / exp;\nrun;",
        "description": "Overdispersion diagnostic and negative binomial fitting in SAS using PROC MEANS, PROC\nGENMOD DIST=NEGBIN with a log(person-time) offset, and the SCALE=PEARSON option for the\nPoisson baseline dispersion check. Demonstrates extracting the dispersion parameter from\nPROC GENMOD output and constructing the cluster-robust (GEE) variance via the REPEATED\nstatement. Uses the 10-patient COPD dataset and includes a template for a full regression\nwith treatment arm and covariates.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Het[Patients differ in unmeasured event-rate propensity]\n  Het --> Gamma[\"Each patient's rate lambda_i drawn from Gamma(r, r/mu)\"]\n  Gamma --> PoisMix[\"Conditional on lambda_i, counts ~ Poisson(lambda_i)\"]\n  PoisMix --> NB[\"Marginal distribution of counts ~ Negative Binomial(mu, alpha=1/r)\"]\n  NB --> VF[\"Variance function: Var(Y) = mu + alpha*mu^2\"]\n  VF --> D1{alpha near 0?}\n  D1 -->|Yes| Pois[Use Poisson: equidispersed]\n  D1 -->|No| D2{Structural zeros beyond NB?}\n  D2 -->|No| NBreg[Negative Binomial regression with offset]\n  D2 -->|Yes| ZINB[ZINB or hurdle NB]\n  NBreg --> IRR[\"exp(beta_j) = incidence rate ratio\"]",
        "caption": "The Poisson-gamma mixture genesis of the negative binomial distribution and the model selection decision tree. Patient heterogeneity in underlying event rates integrates out to produce NB-distributed marginal counts. Dispersion alpha drives the choice between Poisson, NB, and zero-inflated extensions.",
        "alt_text": "Flowchart showing patient heterogeneity feeding into per-patient gamma-distributed event rates, which produce Poisson counts, whose marginal distribution is negative binomial. The diagram then branches on alpha and structural zeros to guide the model choice.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "poisson-distribution",
        "notes": "The Poisson distribution is the limiting case of the NB when alpha equals zero, and the Poisson-gamma mixture derivation begins with Poisson counts conditional on a patient's latent rate. Understanding why Poisson fails (variance equals mean assumption) motivates the NB extension."
      },
      {
        "relation_type": "requires",
        "target_slug": "generalized-linear-models",
        "notes": "The NB distribution is embedded in a GLM with a log link to produce the count regression models used in HEOR. Understanding GLM structure (link function, mean-variance relationship, offset) is necessary to apply the NB distribution analytically."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "The companion regression model concept built on this distributional primitive. While this entry covers what the NB distribution is and how to recognize overdispersion, the count models entry covers the full regression workflow: covariate adjustment, offset construction, dispersion testing, cluster-robust SEs, and zero-inflated extensions with HCRU examples."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gamma-distribution",
        "notes": "The gamma distribution is the mixing distribution in the Poisson-gamma mixture derivation. The NB-2 dispersion parameter alpha equals the inverse of the gamma shape parameter r; understanding the gamma distribution provides the probabilistic foundation for interpreting alpha as the coefficient of variation of the latent patient frailty distribution."
      },
      {
        "relation_type": "used_with",
        "target_slug": "recurrent-events-analysis-rwe",
        "notes": "NB count models and recurrent-event survival models address the same class of outcomes (repeated events per patient) from complementary angles. NB models the total count or rate over a window; recurrent-event survival models (Andersen-Gill, PWP, WLW) model the timing and ordering. In HCRU analyses, the NB distributional check (overdispersion test) is typically performed first, and recurrent-event models are used when the scientific question shifts from total volume to time-to-next-event."
      }
    ],
    "aliases": [
      "NB distribution",
      "Poisson-gamma mixture",
      "NB2",
      "negative binomial regression"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "negative-control-exposures-rwe",
    "name": "Negative Control Exposures",
    "short_definition": "An exposure that shares the confounding, channeling, and healthcare-contact structure of the primary exposure but has no plausible causal pathway to the outcome, used as a falsification test for residual confounding and surveillance bias in observational effect estimates.",
    "long_description": "A **negative control exposure (NCE)** is a second \"treatment\" variable that, by substantive knowledge, cannot cause the\noutcome of interest, yet is subject to the *same* unmeasured confounding, channeling, prescribing-channel, and\nsurveillance forces as the primary exposure. The logic is a falsification test: if the analytic pipeline that produced the\nprimary estimate also produces a non-null association between the NCE and the outcome — after the *identical* design and\nadjustment — then the assumption that the primary estimate is unconfounded is contradicted. An NCE that is \"clean\" (null\nafter adjustment) is reassuring but never proves the primary estimate is unbiased; an NCE that is \"dirty\" (associated with\nthe outcome) is a positive finding of residual bias. NCEs sit alongside negative control *outcomes* (a second outcome the\nexposure cannot cause) as the two halves of the negative-control toolkit; the exposure variant directly interrogates the\ntreatment-assignment and capture process, whereas the outcome variant interrogates the outcome-ascertainment and shared-cause\nprocess.\n\n**Core conceptual distinction**. The defining requirement is a U-comparability assumption: the NCE must share the\nunmeasured confounder(s) U with the primary exposure (U drives both who receives the drug and who receives the NCE) while\nsatisfying a sharp causal null (the NCE has no effect on the outcome through any path except U). Formally, in a DAG where\nU is the unmeasured confounder, the testable implication is NCE ⊥ outcome given measured covariates *only if* U is fully\ncontrolled; observing NCE not-independent of outcome falsifies \"no unmeasured confounding.\" This is fundamentally harder\nto satisfy than for negative control outcomes: outcome mechanisms are often easy to rule out (the drug cannot cause a\nfracture in the first week), but exposure mechanisms — why a patient receives drug A — are precisely the thing we cannot\nfully observe, so finding a second exposure that travels the same confounding channel without its own outcome effect\nrequires deep clinical knowledge of prescribing. The estimand of the falsification step is the NCE–outcome association on\nthe *same scale and from the same model* as the primary contrast (hazard ratio for a Cox safety analysis, odds ratio for a\nlogistic analysis); the entire value of the test depends on holding design, covariate set, follow-up, and censoring fixed\nso that only the exposure variable swaps. When multiple NCEs are available, their estimates feed empirical calibration of\nthe primary estimate's confidence interval (see empirical-calibration-negative-controls-rwe).\n\n**Pros, cons, and trade-offs**.\n- **vs negative-control-outcomes-rwe:** NCEs interrogate the exposure-assignment and capture process directly — the source\n  of confounding by indication and channeling — and can be paired with proximal-causal-inference methods to *adjust* (not\n  merely detect) bias. Cost: a valid NCE is far harder to find than a valid negative control outcome, because the\n  U-comparability and exposure-null conditions are both demanding and the analyst rarely observes the prescribing reason.\n  **Prefer the NCE** when the worry is confounding by indication / channeling in the treatment decision; **prefer the\n  negative control outcome** when the worry is differential surveillance or shared-cause bias in outcome ascertainment.\n  The strongest designs use both.\n- **vs a single sensitivity analysis (E-value, probabilistic bias analysis):** An NCE is a *data-driven* probe that uses\n  the real confounding structure of the cohort, whereas an E-value asks only how strong an unmeasured confounder *would*\n  have to be. Cost: the NCE answer is only as good as the validity of the chosen control; a poorly chosen NCE gives false\n  reassurance (a clean NCE that does not actually share U), which is more dangerous than no test at all. Use NCEs to\n  *complement*, not replace, quantitative bias analysis (unmeasured-confounding-probabilistic-bias-analysis-rwe).\n- **vs empirical calibration with many controls:** A single NCE is a yes/no falsification; a panel of NCEs (and negative\n  control outcomes) supports systematic-error calibration of p-values and intervals. Cost: assembling a credible panel of\n  exposures that all share U is rarely feasible — exposure controls are scarce, so calibration usually leans on outcome\n  controls and uses one or two NCEs as confirmatory.\n\n**When to use**. (1) Whenever the primary worry is confounding by indication, channeling, healthy-user/healthy-adherer\nbias, or differential healthcare contact in a comparative drug or device study — i.e., bias rooted in *who gets treated*.\n(2) When a clinically credible second exposure exists that is prescribed through the same channel, to similar patients, at\na similar decision point, but has no pharmacologic or behavioral route to the outcome (e.g., an unrelated chronic-disease\nmaintenance drug as a control for a different therapy on an acute outcome). (3) Embedded inside an active-comparator,\nnew-user design so the NCE inherits the same washout, time-zero, and follow-up structure as the primary exposure. (4) As a\npre-specified falsification analysis in a protocol/SAP submitted to regulators, where a clean NCE strengthens the\ncausal narrative and a dirty NCE triggers design revision.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **No credible U-comparable exposure exists.** If the only available \"control\" exposures are prescribed to systematically\n  different patients, the NCE does not share U; a null result then *manufactures false reassurance* — the most dangerous\n  failure mode, because reviewers read it as evidence of validity. An invalid NCE is worse than none.\n- **The NCE has its own path to the outcome.** Shared contraindications, disease severity, or a real (even weak)\n  pharmacologic/behavioral effect on the outcome violate the causal null; the resulting \"dirty\" signal cannot distinguish\n  residual confounding from a genuine NCE–outcome effect, so the test is uninterpretable.\n- **The NCE and primary exposure share a common downstream consequence other than U** (collider or mediator), which can\n  *induce* an association under conditioning and produce a spurious dirty result, falsely condemning a valid primary\n  estimate.\n- **As a substitute for design.** An NCE detects but does not repair the underlying bias; reaching for it instead of an\n  active comparator, new-user restriction, or richer confounder adjustment is treating the symptom. The best NCEs are\n  chosen by clinical and data-source reasoning about the prescribing channel, never by scanning the literature for drugs\n  with null outcome associations.\n\n**Data-source operational depth**.\n- **Claims (FFS vs MA):** The NCE is operationalized exactly like the primary exposure — an NDC/J-code with `fill_date`\n  and `days_supply`, the same continuous-enrollment and washout requirement, and the same index/time-zero logic. Failure\n  mode: Medicare Advantage enrollees lack complete fee-for-service claims, so a patient can look like a non-user of the\n  NCE when the fill simply was not captured; an NCE built on MA-only person-time will be differentially missing relative\n  to the primary exposure and break U-comparability. Restrict to enrollees with full medical+pharmacy benefit and exclude\n  MA-only spans. Surveillance intensity matters: if the NCE drug triggers more lab monitoring or visits than the primary\n  drug, detection of the outcome differs between the \"exposures,\" confounding the falsification — choose an NCE matched on\n  monitoring intensity.\n- **EHR:** Orders are not dispensings; an NCE defined on the medication order list inherits the primary exposure's\n  order-vs-fill gap, but only if the same definition is used for both. Visit-driven capture means a patient who exits the\n  health system is differentially lost; if the NCE is prescribed in a different care setting (e.g., specialty vs primary\n  care) the loss-to-follow-up pattern differs and the test is biased. Confirm both exposures are captured in the same\n  encounter stream.\n- **Registry:** Registries excel at indication and severity but are usually thin on the full pharmacy record needed to\n  define a second exposure cleanly; an NCE typically requires linkage to claims for complete fills. Spontaneous/voluntary\n  registry enrollment can itself select on healthcare engagement, contaminating the shared-confounder assumption.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR gives clinical reasons for prescribing (sharpening the\n  judgment of whether the NCE truly shares U) while claims give complete fills — but linkage selects the linkable subset\n  and introduces order/fill/service date discrepancies that must be reconciled identically for primary and control\n  exposures before time-zero assignment. Differential competing risks also bite: in elderly claims cohorts, if the\n  NCE-treated subgroup has higher background mortality, competing death censors the outcome differently across \"exposures\"\n  and can create or mask a falsification signal — model the outcome on a cause-specific or subdistribution scale\n  consistently for both.\n\n**Worked claims example.** Primary question: incident hospitalized heart failure (HF) among adults with type 2 diabetes\ninitiating a second-generation **sulfonylurea** vs a **DPP-4 inhibitor**, in a commercial + Medicare FFS database, under an\nactive-comparator, new-user design (age ≥18; ≥2 diabetes diagnoses; 365 days continuous medical+pharmacy enrollment before\nthe first study fill; washout = no sulfonylurea or DPP-4i fill in the 365-day lookback; time zero = first qualifying fill;\nfollow-up to first validated HF hospitalization, censoring at disenrollment, death, end of data). Concern: channeling —\nsicker or more vascularly compromised patients may be steered toward sulfonylureas, biasing the HF estimate. **Negative\ncontrol exposure:** initiation of a **topical glaucoma agent (ophthalmic prostaglandin analog)** or a **statin** among the\nsame diabetic initiators — drugs prescribed through the same chronic-disease maintenance channel to comparably engaged\npatients but with no plausible acute pharmacologic route to HF *hospitalization* over the study window (statin chosen only\nif the analyst is confident it does not protect against the HF outcome on this timescale; otherwise the ophthalmic agent is\nsafer). Operationally: build the NCE cohort with the *identical* 365-day continuous enrollment, 365-day NCE-washout, index\n= first NCE fill (`fill_date`, `days_supply`), and the *same* high-dimensional propensity-score covariate set measured in\n`[index_date − 365, index_date]`, then fit the *same* Cox model (HF hospitalization, same censoring rules) with the NCE as\nthe exposure. Interpretation: an adjusted NCE hazard ratio near 1.0 with a tight interval is consistent with no residual\nchanneling on this confounding channel; an NCE HR materially away from 1.0 (e.g., 1.3, 95% CI 1.1–1.6) signals residual\nconfounding or differential surveillance and obligates redesign (richer adjustment, alternative comparator, or empirical\ncalibration of the primary CI using this and additional controls) before any HF conclusion is reported.\n\n**Interpreting the output**\n\nFrom the worked example: primary sulfonylurea vs DPP-4i HR = 1.28 (95% CI 1.10–1.48) for heart failure.\nNCE (ophthalmic agent, identical cohort design) HR = 1.31 (95% CI 1.09–1.57).\n\n*(1) Formal interpretation.* The ophthalmic NCE has no plausible causal path to heart failure — any\nobserved association must arise from confounding, surveillance differences, or other systematic error.\nAn NCE HR = 1.31 that closely matches the primary HR = 1.28 suggests that a substantial portion of the\nprimary estimate reflects residual confounding rather than a pharmacological effect of sulfonylureas on\nHF. The NCE does not quantify how much of the primary HR is bias — its magnitude on the HF outcome\nmay differ from the bias operating on the primary even when both point in the same direction. The result\nindicates bias *presence and approximate direction*, not its exact magnitude on the primary endpoint.\n\n*(2) Practical interpretation.* An NCE HR of 1.31 almost identical to the primary HR of 1.28 is a\nstrong signal that much or all of the observed sulfonylurea–HF association may be confounded by\nchanneling — sicker, higher-risk patients preferentially receiving sulfonylureas. A decision-maker\nreviewing this study should not conclude that sulfonylureas increase HF risk by 28% until the primary\nanalysis has been redesigned (stricter active-comparator restriction, additional proxies) or subjected\nto empirical calibration using this and other NCEs to re-anchor the inference.",
    "primary_category": "Bias_Control",
    "tags": [
      "negative-control",
      "negative-control-exposure",
      "falsification",
      "residual-confounding",
      "confounding-by-indication",
      "surveillance-bias",
      "channeling",
      "qba"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0b013e3181d61eeb",
        "url": "https://doi.org/10.1097/EDE.0b013e3181d61eeb",
        "citation_text": "Lipsitch M, Tchetgen Tchetgen E, Cohen T. Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology. 2010;21(3):383-388.",
        "year": 2010,
        "authors_short": "Lipsitch et al.",
        "notes": "Foundational framework formalizing negative control outcomes and exposures and the U-comparability logic of falsification testing."
      },
      {
        "role": "explain",
        "doi": "10.1007/s40471-020-00243-4",
        "url": "https://doi.org/10.1007/s40471-020-00243-4",
        "citation_text": "Shi X, Miao W, Tchetgen Tchetgen E. A selective review of negative control methods in epidemiology. Current Epidemiology Reports. 2020;7(4):190-202.",
        "year": 2020,
        "authors_short": "Shi et al.",
        "notes": "Comprehensive review distinguishing detection-only negative controls from proximal-causal-inference adjustment using negative control exposures."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0000000000001650",
        "url": "https://doi.org/10.1097/EDE.0000000000001650",
        "citation_text": "Flanders WD, Strickland MJ, Klein M. Negative-control exposures: adjusting for unmeasured and measured confounders with bounds for remaining bias. Epidemiology. 2023;34(6):859-869.",
        "year": 2023,
        "authors_short": "Flanders et al.",
        "notes": "Exposure-specific methodology showing how a negative control exposure both detects residual confounding and yields bounds for the remaining bias in the primary estimate."
      },
      {
        "role": "use",
        "doi": "10.1073/pnas.1708282114",
        "url": "https://doi.org/10.1073/pnas.1708282114",
        "citation_text": "Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. PNAS. 2018;115(11):2571-2577.",
        "year": 2018,
        "authors_short": "Schuemie et al.",
        "notes": "Large-scale application using panels of negative controls to empirically calibrate confidence intervals across observational healthcare databases."
      }
    ],
    "plain_language_summary": "A negative control exposure is a second drug or treatment you add to your study that you are confident cannot actually cause the outcome you are studying, but that gets prescribed to the same kinds of patients through the same channels as your main drug of interest. You run the exact same statistical analysis on this control drug as on your real drug, and if the control drug shows an association with the outcome, that is a warning sign: your analysis has a bias problem, not a real treatment effect. Think of it as a built-in lie detector for your study design.",
    "key_terms": [
      {
        "term": "negative control exposure",
        "definition": "A second drug or treatment added to a study that cannot plausibly cause the study outcome, used only to test whether the analytic method is producing spurious associations due to residual confounding."
      },
      {
        "term": "residual confounding",
        "definition": "Systematic error that remains in an estimate even after statistical adjustment, because important factors that influence both who gets treated and who develops the outcome were not fully measured or controlled."
      },
      {
        "term": "falsification test",
        "definition": "A check run on a dataset using a hypothesis that should definitively fail; if it does not fail, the main analysis is suspect."
      },
      {
        "term": "channeling",
        "definition": "A pattern in prescribing where sicker or higher-risk patients tend to receive one drug over another, creating a built-in imbalance between treatment groups that is hard to fully adjust away."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is studying whether starting a sulfonylurea (a diabetes pill) increases the risk of hospitalization for heart failure compared with starting a DPP-4 inhibitor (a different diabetes pill). She worries that sicker patients are being steered toward sulfonylureas, which would inflate the hazard ratio (HR) even after statistical adjustment. To test this, she adds a falsification check: she runs the same analysis replacing the real drug comparison with an ophthalmic (eye) drop that lowers intraocular pressure in glaucoma. Eye drops have no pharmacologic pathway to heart failure hospitalization. If the analysis is clean, the eye-drop HR should hover around 1.0 -- no association. If it does not, bias is present in the pipeline.",
      "dataset": {
        "caption": "Summary results table from the falsification analysis. Each row is one exposure run through the identical Cox model on the same cohort with the same covariate adjustment. An HR of 1.0 means no association; CIs crossing 1.0 are consistent with no effect.",
        "columns": [
          "Exposure",
          "Adjusted HR",
          "95% CI Lower",
          "95% CI Upper",
          "Interpretation"
        ],
        "rows": [
          [
            "Sulfonylurea vs DPP-4i (primary)",
            "1.28",
            "1.10",
            "1.48",
            "Study result: apparent 28% higher HF risk"
          ],
          [
            "Ophthalmic eye drop vs no eye drop (NCE)",
            "1.31",
            "1.09",
            "1.57",
            "Negative control: should be ~1.0, but is not"
          ]
        ]
      },
      "steps": [
        "Build the negative control exposure cohort exactly like the primary cohort: same enrollment rules, same washout period, same index date logic, same follow-up window, same outcome definition (heart failure hospitalization).",
        "Fit the identical Cox regression model, swapping only the exposure variable from the real drug comparison to the ophthalmic eye drop comparison; every covariate in the adjustment set stays the same.",
        "Read the NCE hazard ratio: the eye-drop HR is 1.31 (95% CI 1.09-1.57), meaning patients who initiated the eye drop appear 31% more likely to be hospitalized for heart failure than those who did not.",
        "Because eye drops cannot biologically cause heart failure, this non-null result cannot reflect a true effect -- it must reflect residual confounding or differential healthcare contact that the adjustment did not remove.",
        "Compare the NCE HR to the primary HR: both are in the same direction and similar in magnitude (1.31 vs 1.28), which means much or all of the apparent sulfonylurea signal could be explained by the same unmeasured bias detected in the control."
      ],
      "result": "The negative control exposure HR is 1.31 (95% CI 1.09-1.57), well above the null value of 1.0 and statistically significant. This dirty NCE reveals that the analytic pipeline carries residual confounding of approximately 30%, which matches the size of the primary estimate (HR 1.28). The conclusion is that the sulfonylurea result cannot be taken at face value without redesign -- for example, richer covariate adjustment, a more restrictive active comparator, or empirical calibration of the confidence interval using this and additional control exposures."
    },
    "prerequisites": [
      "active-comparator-new-user",
      "dags-backdoor-criterion-drug-studies",
      "quantitative-bias-analysis-toolkit-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Medication negative control exposure",
        "description": "A second drug prescribed through the same channel and to similar patients as the primary exposure (sharing the confounding U) but with no plausible causal route to the outcome; defined with the identical NDC/J-code, new-user, washout, and time-zero logic as the primary exposure.",
        "edge_cases": [
          "Shared contraindications or disease severity give the control its own indirect path to the outcome, violating the causal null.",
          "A control drug with different lab-monitoring or visit intensity introduces differential surveillance, confounding the falsification test."
        ],
        "data_source_notes": "claims: build the NCE NDC/J-code list and apply the same enrollment, washout, and index rules as the primary drug; EHR: confirm the control is captured in the same order/encounter stream as the primary exposure."
      },
      {
        "name": "Procedure or policy negative control exposure",
        "description": "A non-causal procedure, screening encounter, or policy exposure with similar coding intensity and care setting to the primary exposure, used chiefly in procedure-safety and difference-in-differences/policy studies.",
        "edge_cases": [
          "Spillover (a co-occurring intervention bundled with the procedure) can contaminate the control.",
          "Immortal time in procedure studies must be handled identically for control and primary exposure or the test is biased."
        ],
        "data_source_notes": "claims: align index dates and at-risk windows for the control procedure to avoid immortal time; policy DiD: ensure the control policy is genuinely unrelated to the outcome trend."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "negative-control-outcomes-rwe",
        "pros_of_this": "Directly probes the treatment-assignment and capture process (confounding by indication, channeling) and can be extended via proximal causal inference to adjust, not merely detect, bias.",
        "cons_of_this": "A valid control exposure is much harder to find than a valid control outcome because both U-comparability and the exposure-null condition are demanding and the prescribing reason is usually unobserved.",
        "when_to_prefer": "When the dominant threat is confounding rooted in who gets treated; pair with a negative control outcome when surveillance or ascertainment bias is also plausible."
      },
      {
        "compared_to": "Quantitative/probabilistic bias analysis (E-value, PBA)",
        "pros_of_this": "Uses the real confounding structure of the observed cohort rather than hypothetical confounder strengths.",
        "cons_of_this": "A poorly chosen control that does not actually share U produces false reassurance, which is more harmful than running no test; validity hinges entirely on control selection.",
        "when_to_prefer": "As a complement to, not a replacement for, quantitative bias analysis; report both."
      },
      {
        "compared_to": "empirical-calibration-negative-controls-rwe",
        "pros_of_this": "A single NCE gives an interpretable yes/no falsification when only one credible control exposure exists.",
        "cons_of_this": "Cannot calibrate systematic error the way a full panel can; exposure controls are scarce so calibration usually leans on negative control outcomes.",
        "when_to_prefer": "When few control exposures exist; escalate to empirical calibration when a panel is available."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Define the NCE with the identical NDC/J-code, continuous-enrollment, washout, and time-zero logic as the primary exposure. Exclude Medicare Advantage-only person-time where fee-for-service fills are unobserved, since differential capture between the primary and control exposures breaks U-comparability. Choose an NCE matched on monitoring intensity to avoid differential surveillance, and handle competing risks (e.g., death in elderly cohorts) on the same scale for both exposures.",
      "ehr": "Use the same order-vs-administration definition and the same encounter stream for the control and primary exposures; a control prescribed in a different care setting changes the loss-to-follow-up pattern and biases the test.",
      "registry": "Registries are usually thin on full pharmacy exposure; link to claims to define a clean second exposure, and watch for selection on healthcare engagement that can contaminate the shared-confounder assumption.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (clinical reasons for prescribing plus complete fills) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled identically for primary and control exposures before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom lifelines import CoxPHFitter\n\nADJUST = [\"ps_logit\", \"age\", \"cci\", \"prior_util\"]  # identical covariate set for every exposure\n\ndef falsify(df: pd.DataFrame, exposures: list[str]) -> pd.DataFrame:\n    rows = []\n    for exp in exposures:\n        cph = CoxPHFitter()\n        cph.fit(df[[\"time\", \"event\", exp] + ADJUST], duration_col=\"time\",\n                event_col=\"event\", formula=\" + \".join([exp] + ADJUST))\n        s = cph.summary.loc[exp]\n        rows.append({\"exposure\": exp, \"hr\": np.exp(s[\"coef\"]),\n                     \"ci_low\": np.exp(s[\"coef lower 95%\"]),\n                     \"ci_high\": np.exp(s[\"coef upper 95%\"]), \"p\": s[\"p\"]})\n    out = pd.DataFrame(rows)\n    # Flag any NEGATIVE CONTROL whose CI excludes the null -> falsification of \"no residual confounding\".\n    out[\"nce_dirty\"] = (out[\"exposure\"].str.startswith(\"nce_\") &\n                        ((out[\"ci_low\"] > 1) | (out[\"ci_high\"] < 1)))\n    return out\n\nresult = falsify(df, [\"primary_drug\", \"nce_ophthalmic\", \"nce_statin\"])\nprint(result)",
        "description": "Falsification test: fit the SAME outcome model used for the primary contrast, swapping in each negative control\nexposure, on the SAME analytic cohort. Required input (one row per person, already cohort-built via the ACNU logic and\ncovariate-resolved; do not create data here):\n  df : person_id, time (follow-up days to event/censor), event (1=outcome, 0=censored),\n       primary_drug (1=study, 0=comparator),\n       nce_ophthalmic (1=initiated ophthalmic control, 0=not),  # negative control exposures, built with identical\n       nce_statin     (1=initiated statin control, 0=not),      # washout/time-zero/follow-up logic as primary_drug\n       plus the SAME baseline / propensity covariates used for the primary model (e.g., ps_logit, age, cci, prior_util).\nA primary HR away from null with NCE HRs near 1.0 supports no residual channeling; an NCE HR away from 1.0 flags\nresidual confounding / differential surveillance and obligates redesign or empirical calibration.",
        "dependencies": [
          "pandas",
          "lifelines",
          "numpy"
        ],
        "source_citations": [
          "flanders-2023",
          "schuemie-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nadjust <- c(\"ps_logit\", \"age\", \"cci\", \"prior_util\")  # identical covariate set for every exposure\n\nfalsify <- function(df, exposures) {\n  do.call(rbind, lapply(exposures, function(exp) {\n    f <- as.formula(paste(\"Surv(time, event) ~\", exp, \"+\", paste(adjust, collapse = \" + \")))\n    fit <- coxph(f, data = df)\n    ci  <- summary(fit)$conf.int[exp, ]          # exp(coef), 1/exp(coef), lower .95, upper .95\n    p   <- summary(fit)$coefficients[exp, \"Pr(>|z|)\"]\n    data.frame(exposure = exp, hr = ci[\"exp(coef)\"],\n               ci_low = ci[\"lower .95\"], ci_high = ci[\"upper .95\"], p = p,\n               # a negative control whose CI excludes 1 falsifies \"no residual confounding\"\n               nce_dirty = grepl(\"^nce_\", exp) & (ci[\"lower .95\"] > 1 | ci[\"upper .95\"] < 1),\n               row.names = NULL)\n  }))\n}\n\nresult <- falsify(df, c(\"primary_drug\", \"nce_ophthalmic\", \"nce_statin\"))\nprint(result)",
        "description": "Falsification test in R: the SAME Cox model, looped over the primary exposure and each negative control exposure, on the\nSAME cohort. Input mirrors the Python version:\n  df : person_id, time, event (1/0), primary_drug (1/0),\n       nce_ophthalmic (1/0), nce_statin (1/0),   # NCEs built with identical washout/time-zero/follow-up logic\n       plus the SAME covariates (ps_logit, age, cci, prior_util).",
        "dependencies": [
          "survival"
        ],
        "source_citations": [
          "flanders-2023",
          "schuemie-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%macro phreg_exp(exp);\n  proc phreg data=work.analytic;\n    model time*event(0) = &exp ps_logit age cci prior_util / risklimits;\n    hazardratio &exp;\n    ods output ParameterEstimates=pe_&exp(where=(parameter=\"&exp\"));\n  run;\n%mend;\n\n%phreg_exp(primary_drug);   /* primary comparative contrast */\n%phreg_exp(nce_ophthalmic); /* negative control exposure 1 */\n%phreg_exp(nce_statin);     /* negative control exposure 2 */\n\ndata falsification;\n  set pe_primary_drug pe_nce_ophthalmic pe_nce_statin;\n  hr = exp(estimate); ci_low = exp(estimate - 1.96*stderr); ci_high = exp(estimate + 1.96*stderr);\n  /* a negative control whose CI excludes the null flags residual confounding / differential surveillance */\n  nce_dirty = (index(parameter,'nce_')=1) and (ci_low > 1 or ci_high < 1);\nrun;\n\nproc print data=falsification; var parameter hr ci_low ci_high probchisq nce_dirty; run;",
        "description": "Falsification test in SAS with PROC PHREG: run the SAME Cox model for the primary exposure and each negative control\nexposure, on the SAME analytic dataset, and collect the hazard ratios. Required input (post cohort construction):\n  work.analytic : person_id, time, event (1=outcome/0=censored), primary_drug (0/1),\n                  nce_ophthalmic (0/1), nce_statin (0/1),   /* NCEs built with identical washout/time-zero/follow-up */\n                  ps_logit age cci prior_util               /* the SAME covariate set used for the primary model */\nA negative-control HR whose 95% CI excludes 1 falsifies the no-residual-confounding assumption behind the primary estimate.",
        "dependencies": [],
        "source_citations": [
          "flanders-2023",
          "schuemie-2018"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  U[Unmeasured confounder U<br/>frailty, severity, healthcare engagement]\n  E[Primary exposure<br/>study drug vs comparator]\n  N[Negative control exposure<br/>U-comparable, no causal path to outcome]\n  Y[Outcome]\n  U -->|confounds| E\n  U -->|confounds| N\n  E -->|causal effect of interest| Y\n  N -. forbidden: no causal path .-> Y\n  E -. testable implication: N is independent of Y<br/>once U is controlled .- N",
        "caption": "Falsification DAG. The negative control exposure shares the unmeasured confounder U with the primary exposure but has no causal path to the outcome (dashed-forbidden arrow). If, after the identical adjustment, the NCE is still associated with the outcome, U was not fully controlled and the primary estimate is suspect.",
        "alt_text": "Directed acyclic graph showing unmeasured confounder U pointing to both the primary exposure and the negative control exposure, the primary exposure pointing to the outcome, and a forbidden no-effect arrow from the negative control exposure to the outcome.",
        "source_type": "illustrative",
        "source_citations": [
          "lipsitch-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Concern: confounding by indication / channeling in the primary contrast] --> C{Credible U-comparable<br/>exposure exists?}\n  C -- No --> X[Do NOT use an NCE<br/>a control that does not share U gives false reassurance<br/>use PBA / E-value / better design instead]\n  C -- Yes --> P{Does the candidate have<br/>any causal path to the outcome?}\n  P -- Yes --> X\n  P -- No --> B[Build NCE with IDENTICAL washout, time zero,<br/>covariates, follow-up, censoring as primary]\n  B --> F[Fit the SAME outcome model with NCE as exposure]\n  F --> D{NCE estimate near null?}\n  D -- Yes --> R[Reassuring: no detected residual confounding<br/>on this channel - not proof of validity]\n  D -- No --> M[Dirty NCE: residual confounding / differential surveillance<br/>revise design or empirically calibrate the primary CI]",
        "caption": "Decision logic for selecting and acting on a negative control exposure. The two gating conditions are U-comparability and the causal null; failing either means an NCE is the wrong tool.",
        "alt_text": "Decision flowchart for choosing a valid negative control exposure, gating on whether a U-comparable exposure exists and whether it has any causal path to the outcome, then building it identically to the primary exposure and acting on a null versus non-null result.",
        "source_type": "illustrative",
        "source_citations": [
          "lipsitch-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "A falsification-based member of the quantitative bias analysis toolkit, focused on the exposure-assignment side of confounding."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "The complementary control type; outcome controls probe ascertainment and shared-cause bias, exposure controls probe treatment assignment and capture. Strong designs use both."
      },
      {
        "relation_type": "used_with",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Multiple negative control exposures (with outcome controls) feed empirical calibration of the primary estimate's p-value and confidence interval."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "NCEs are most defensible when embedded in the ACNU design so they inherit the same washout, time-zero, and follow-up logic as the primary exposure."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "PBA quantifies bias under assumed confounder strengths; NCEs detect bias using the cohort's real confounding structure. Report them together rather than choosing one."
      }
    ],
    "aliases": [
      "NCE",
      "negative exposure control",
      "falsification exposure",
      "negative control exposure"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "negative-control-outcomes-rwe",
    "name": "Negative Control Outcomes",
    "short_definition": "An outcome the exposure cannot plausibly cause but which shares the confounding, surveillance, coding, and care-seeking structure of the primary endpoint, analyzed with the identical design to falsify the assumption that residual systematic error has been removed.",
    "long_description": "A **negative control outcome (NCO)** is an event the study exposure should not cause through any direct or indirect\ncausal path, yet one that is subject to the *same* sources of residual systematic error that threaten the primary\nendpoint: unmeasured confounding, healthy-user / healthy-adherer selection, depletion of susceptibles, differential\nsurveillance or coding intensity, and care-seeking behavior. You run the **identical** analysis used for the primary\noutcome — same cohort, same time zero, same covariates, same propensity-score model, same weights or matched set, same\ncensoring rules — but swap the dependent variable for the NCO. Because the true exposure-NCO effect is null by\nconstruction, any non-null NCO association is a direct, empirical signal that the design and adjustment left detectable\nbias behind. The canonical real-world demonstration is influenza-vaccine effectiveness in seniors: vaccinated elders\nappear to have far lower all-cause mortality *before* influenza season ever starts (Jackson 2006) — an effect the\nvaccine cannot produce, exposing healthy-vaccinee confounding that contaminates the in-season estimate too.\n\n**Core conceptual distinction**\nAn NCO is defined by two simultaneous properties that pull in opposite directions and must both hold. (1) *Null causal\neffect*: the exposure has no effect on the NCO — no pharmacology, no behavioral pathway, no detection pathway tied to\nthe drug. (2) *Shared bias structure*: the NCO is moved by the same unmeasured or imperfectly-controlled factors that\nmove the primary outcome. The diagnostic power comes from the gap between what the model estimates (should be null) and\nwhat it actually estimates (the bias). Critically distinguish the NCO from a **negative control exposure (NCE)** — an\nexposure that cannot affect the *primary* outcome but shares its confounding (the symmetric falsification). NCOs are\nusually easier to find in claims (thousands of diagnosis codes) and are easier to match on surveillance/coding\nintensity; NCEs are often more directly analogous to the confounding-by-indication structure of the study exposure.\nAlso distinguish the **estimand**: a single pre-specified NCO is a *falsification test* (binary credibility signal),\nwhereas a curated *panel* of dozens of NCOs supports **empirical calibration** — estimating the systematic-error\ndistribution and recalibrating the primary p-value and confidence interval (Schuemie 2014, 2018). The two uses require\ndifferent numbers of controls and different assumptions; do not conflate them.\n\n**Pros, cons, and trade-offs**\n- **vs the E-value (e-value-sensitivity-analysis):** The E-value reports the minimum confounding strength on the\n  risk-ratio scale needed to explain away the result, under an assumption-only model that addresses *unmeasured\n  confounding alone*. The NCO is empirical — it uses the actual data and the exact analytic pipeline, and it detects\n  bias from sources the E-value ignores (surveillance, coding, care-seeking, selection). Cost: the NCO gives a\n  qualitative pass/fail (or a calibrated interval with a panel), not a single interpretable sensitivity bound; its\n  power depends on NCO frequency. Use both — they answer different questions.\n- **vs probabilistic / quantitative bias analysis (unmeasured-confounding-probabilistic-bias-analysis-rwe):** QBA\n  propagates external bias parameters (sensitivity/specificity, confounder prevalence, bias factors) into a corrected\n  estimate or simulation interval *for the primary parameter*. The NCO does not correct anything — it falsifies. QBA\n  needs credible external parameters; the NCO needs only a credible control outcome. They are complementary: NCO to\n  detect, QBA to quantify.\n- **vs a validation substudy / chart review:** A validation substudy can actually *correct* outcome misclassification\n  using gold-standard adjudication, but it is expensive and slow. A well-powered NCO test is cheap and uses data you\n  already have — at the price of falsifying rather than fixing.\n- **vs empirical calibration with NCEs alone:** NCOs can be tuned to share the primary outcome's exact detection\n  intensity (e.g., another inpatient primary diagnosis), which an NCE cannot; but a poorly chosen NCO (traumatic\n  injury, elective surgery) may not share the *confounding-by-indication* structure of a specific drug-outcome pair.\n  Best practice combines NCOs and NCEs (the \"negative control pair\").\n\n**When to use**\n- After a defensible primary design (new-user active-comparator or target-trial emulation) when the result is\n  surprising, policy-relevant, or regulatory-facing, and you want a data-driven check that residual bias was removed.\n- When the primary analysis leans on high-dimensional or ML-based confounding control (hdPS, large-scale PS) and you\n  need empirical evidence — not just balance tables — that adjustment worked.\n- In multi-database / OHDSI-style network studies, where a curated NCO panel enables empirical calibration so that the\n  same systematic-error correction is applied consistently across data sources.\n- As a pre-specified item in the SAP for FDA, EMA, or HTA submissions, reported regardless of what it shows.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **The NCO is secretly affected by the exposure.** Using an infection endpoint as the \"negative\" control for an\n  immunosuppressant, a bleeding endpoint for an anticoagulant, or a fracture endpoint for a drug that alters bone\n  density violates the null-effect assumption; a non-null result is then real, not bias, and you will wrongly discard a\n  valid primary estimate (or wrongly \"clear\" it if the true effect cancels the bias). This is the dangerous failure.\n- **Mismatched capture intensity.** An outpatient \"rule-out\" diagnosis captured at low intensity cannot falsify an\n  inpatient primary outcome captured at high intensity — the NCO simply has different bias, and a null NCO gives false\n  reassurance.\n- **Power is negligible.** A rare or thinly-captured NCO yields a wide null by construction; treating that\n  uninformative null as \"the design is clean\" is self-deception. Pre-specify a minimum detectable effect.\n- **The design itself is broken.** Immortal time, depletion of susceptibles, or absence of an active comparator are\n  not rescued by a null NCO — a falsification test cannot validate a structurally invalid design.\n- **Over-interpreting a non-null NCO.** A non-null NCO proves residual bias exists; it does *not* identify the\n  direction or magnitude of bias for the primary endpoint without additional assumptions (e.g., bias-transport\n  assumptions in calibration). Never report a non-null NCO as a quantitative correction of the primary estimate.\n- **Post-hoc cherry-picking.** Choosing \"cute but biologically implausible\" controls after seeing the primary result,\n  or reporting only the NCOs that came out null, converts a falsification test into a credibility theater exercise.\n\n**Data-source operational depth**\n- **Claims (FFS or commercial):** The NCO must share the primary outcome's claim type and coding opportunity — match\n  inpatient-primary-dx to inpatient-primary-dx, not to an outpatient encounter code. The dominant failure mode is\n  **differential person-time observability**: Medicare Advantage and capitated/bundled arrangements drop fee-for-service\n  claims, so a \"null\" NCO can be missingness rather than absence — restrict the NCO analysis to the same Parts A/B/D (or\n  commercial medical+pharmacy) enrolled, non-MA-only person-time used for the primary outcome. A second trap is\n  **differential competing risks**: in elderly claims cohorts, if one arm has higher background mortality, a\n  cause-specific Cox on the NCO can read as \"null\" while a Fine-Gray subdistribution view shows the exposure arms\n  diverge — report the NCO on the *same* risk scale (cause-specific vs subdistribution) as the primary. A third trap is\n  **immortal time in procedure-anchored NCOs**: anchoring an elective-procedure NCO on a post-index event reintroduces\n  the immortal time you eliminated in the primary design.\n- **EHR:** Notes and labs help confirm an NCO is truly unrelated, but **outside-care leakage** (care delivered at\n  facilities outside the system) makes the NCO look null when the bias is still present in the claims-captured primary\n  outcome. EHR NCOs frequently encode *visit frequency* more than biology, so a sicker arm with more encounters\n  accrues more NCO codes purely from contact. Define observation windows explicitly and prefer linkage to claims.\n- **Registry:** High-quality adjudication reduces misclassification of the NCO itself, but the registry population's\n  surveillance intensity often differs from the full claims cohort, so a null NCO in the registry does not\n  automatically clear a claims-based primary analysis.\n- **Linked claims–EHR–registry:** The ideal substrate — registry/EHR confirm the NCO's clinical unrelatedness while\n  claims supply the utilization intensity that drives detection — but linkage selects only the linkable subset and\n  introduces order/fill/service date discrepancies that must be reconciled before applying the identical time zero.\n\n**Worked claims example**\nNew-user active-comparator study of Drug A vs Drug B for type 2 diabetes; primary outcome is hospitalized major bleeding\n(inpatient primary discharge diagnosis, validated claims algorithm). Eligibility requires 365 days of continuous,\nnon-MA-only A/B/D enrollment before the first qualifying fill (`fill_date`), no prior fill of either drug class in that\nwashout (incident-user), arm assigned from the NDC on `index_date`, baseline covariates measured only in\n[`index_date` − 365, `index_date`], 1:1 PS matching with standardized differences < 0.1, and follow-up censored at\ndisenrollment, death, end of data, and switch. Primary result after matching: HR = 0.71 (95% CI 0.58–0.87). The\n**pre-specified NCO** is hospitalized community-acquired pneumonia — no plausible effect of either glucose-lowering drug,\nbut captured at the *same* inpatient-primary-diagnosis intensity and subject to the same frailty/healthy-user selection\nthat could drive a spurious bleeding benefit. Applying the identical matched set, weights, censoring, and `id`-based\nrobust variance to the pneumonia NCO yields HR = 0.96 (0.83–1.11) — a null that supports removal of the shared bias.\nHad the pneumonia HR come back at 0.66 (0.55–0.79), the bleeding result would be presumed contaminated by residual\nhealthy-user selection and reported with a strong caveat. The NCO is run on the same cause-specific hazard scale as the\nprimary, restricted to the same FFS-observable person-time, and reported regardless of direction.\n\n**Interpreting the output**\n\nFrom the worked example: primary HR = 0.71 (95% CI 0.58–0.87) for bleeding. NCO (pneumonia, identical\nmatched cohort and Cox specification) HR = 0.96 (95% CI 0.83–1.11) — spanning 1.0, consistent with null.\n\n*(1) Formal interpretation.* The NCO estimate near 1.0 is consistent with the absence of detectable\nresidual bias operating through shared sources — healthy-user selection, frailty channeling, inpatient\nintensity differences — that the pneumonia control is designed to capture. It does *not* prove the\nprimary bleeding estimate is unconfounded; it only shows that the particular bias structure the NCO\nproxies left no detectable signal in this sample and endpoint. A non-null NCO (e.g., HR 0.66, 95% CI\n0.55–0.79) would indicate bias presence and suggest its direction, but would *not* quantify its exact\nmagnitude on the bleeding outcome — the two endpoints may share only some of the same confounders, and\nbias magnitudes can differ even when direction matches.\n\n*(2) Practical interpretation.* The pneumonia NCO HR 0.96 supports proceeding with the bleeding finding:\nthe shared confounding pathways this control can detect appear negligible. Had it returned strongly\nprotective (e.g., ≈ 0.66), the primary HR 0.71 would be presumed contaminated and a regulator or payer\nreviewing the dossier would rightly treat the bleeding benefit as unestablished pending redesign or\nempirical calibration. The NCO result must be reported regardless of direction; selective non-reporting\nof unfavorable falsification tests is a protocol deviation.",
    "primary_category": "Bias_Control",
    "tags": [
      "negative-control",
      "negative-control-outcome",
      "falsification",
      "residual-confounding",
      "surveillance-bias",
      "empirical-calibration",
      "qba"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0b013e3181d61eeb",
        "url": "https://doi.org/10.1097/EDE.0b013e3181d61eeb",
        "citation_text": "Lipsitch M, Tchetgen Tchetgen E, Cohen T. Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology. 2010;21(3):383-388.",
        "year": 2010,
        "authors_short": "Lipsitch et al.",
        "notes": "Foundational paper formalizing the logic of negative control outcomes and exposures for detecting unmeasured confounding and other systematic error in observational studies."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.5925",
        "url": "https://doi.org/10.1002/sim.5925",
        "citation_text": "Schuemie MJ, Ryan PB, DuMouchel W, Suchard MA, Madigan D. Interpreting observational studies: why empirical calibration is needed to correct p-values. Statistics in Medicine. 2014;33(2):209-218.",
        "year": 2014,
        "authors_short": "Schuemie et al.",
        "notes": "Shows how a panel of negative controls estimates the empirical null (systematic error) distribution and why nominal p-values are miscalibrated in large observational databases."
      },
      {
        "role": "demonstrate",
        "doi": "10.1073/pnas.1708282114",
        "url": "https://doi.org/10.1073/pnas.1708282114",
        "citation_text": "Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. Proceedings of the National Academy of Sciences. 2018;115(11):2571-2577.",
        "year": 2018,
        "authors_short": "Schuemie et al.",
        "notes": "Extends calibration from p-values to confidence intervals using negative (and positive) controls; the production method behind OHDSI network studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/ije/dyi274",
        "url": "https://doi.org/10.1093/ije/dyi274",
        "citation_text": "Jackson LA, Jackson ML, Nelson JC, Neuzil KM, Weiss NS. Evidence of bias in estimates of influenza vaccine effectiveness in seniors. International Journal of Epidemiology. 2006;35(2):337-344.",
        "year": 2006,
        "authors_short": "Jackson et al.",
        "notes": "Classic applied falsification - influenza vaccine appears to lower pre-season all-cause mortality, an impossible effect that exposes healthy-vaccinee selection bias contaminating the in-season effectiveness estimate."
      }
    ],
    "plain_language_summary": "A negative control outcome is an event your study drug could not plausibly cause — something like a broken arm or appendicitis — that you run through your exact same analysis as a sanity check. Because the true effect of the drug on that outcome must be zero, any non-zero result you get back is measuring systematic error, not a real drug effect. If your negative control analysis comes back clean (no effect detected), that is evidence your main analysis is also free of the bias you were worried about; if it shows a spurious effect, you know your main result is tainted and must be interpreted with caution.",
    "key_terms": [
      {
        "term": "negative control outcome",
        "definition": "An outcome that the study drug cannot plausibly cause — chosen because it shares the same confounding and data-capture patterns as your real outcome, so any estimated effect on it must be bias rather than a true drug effect."
      },
      {
        "term": "residual confounding",
        "definition": "Bias that remains after statistical adjustment because some important differences between the treatment groups were never measured or fully controlled."
      },
      {
        "term": "falsification test",
        "definition": "A check you run on a question whose answer you already know is null (zero effect), so that a wrong answer proves your method has a problem."
      },
      {
        "term": "healthy-user bias",
        "definition": "A form of confounding where people who choose to take a preventive drug are systematically healthier than non-users for reasons unrelated to the drug — making the drug look more effective than it really is."
      },
      {
        "term": "empirical calibration",
        "definition": "A technique that uses results from many negative control outcomes to estimate how much systematic error a study design typically produces, then mathematically adjusts the main result to account for that error."
      }
    ],
    "worked_example": {
      "scenario": "A claims-based study compares Drug A versus Drug B for type 2 diabetes to see whether Drug A lowers the risk of hospitalized major bleeding (the primary outcome). After matching patients on measured characteristics, the analysis estimates that Drug A users have 29% lower bleeding risk (HR 0.71). The research team pre-specified appendicitis as a negative control outcome: neither drug has any known biological or behavioral pathway that would cause or prevent appendicitis, yet patients hospitalised for appendicitis would be captured in exactly the same inpatient claims records, and the same healthy-user tendencies that make Drug A users healthier overall would also affect appendicitis counts. The team runs the identical matched analysis swapping in appendicitis as the outcome.",
      "dataset": {
        "caption": "Summary results table — the same matched cohort, two outcome definitions. HR below 1.0 means Drug A users had fewer events.",
        "columns": [
          "outcome",
          "drug_a_events",
          "drug_b_events",
          "hazard_ratio",
          "95_ci",
          "interpretation"
        ],
        "rows": [
          [
            "Hospitalized major bleeding (primary)",
            142,
            198,
            0.71,
            "0.58 to 0.87",
            "Drug A appears protective — but is this real or bias?"
          ],
          [
            "Appendicitis (negative control)",
            31,
            32,
            0.97,
            "0.60 to 1.58",
            "Near-null as expected — consistent with no bias"
          ]
        ]
      },
      "steps": [
        "Choose the negative control outcome before seeing any results: appendicitis is biologically unrelated to both drugs, is captured by the same inpatient primary-diagnosis claims code type as the bleeding outcome, and would be affected by the same healthy-user selection pressures.",
        "Run the exact same matched analysis — same patients, same matching weights, same follow-up rules, same statistical model — changing only which outcome is the dependent variable.",
        "Read the negative control result first. The appendicitis HR is 0.97 (95% CI 0.60 to 1.58), which spans 1.0 comfortably — consistent with the true null effect we expect.",
        "Because the negative control came back null, the shared bias sources (healthy-user selection, differential care-seeking) appear to have been adequately controlled by the matching.",
        "The primary bleeding result of HR 0.71 is therefore more credible: the design passed its own sanity check. Had the appendicitis HR come back at, say, 0.65 (a spurious protective effect no drug could cause), you would know residual healthy-user bias is inflating the bleeding benefit too, and you would report the primary result with a strong caution."
      ],
      "result": "Negative control HR 0.97 (95% CI 0.60 to 1.58) — null, as required. Zero bias signal detected. The primary HR 0.71 survives the falsification check and can be reported with greater confidence. If the negative control had shown HR 0.65, that 35% spurious signal would indicate the primary estimate is contaminated by residual bias of similar or greater magnitude."
    },
    "prerequisites": [
      "healthy-user-bias",
      "active-comparator-new-user",
      "propensity-score-methods-psm-iptw"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single pre-specified NCO falsification check",
        "description": "One outcome chosen a priori for null causal effect plus shared bias structure, analyzed with the identical primary design and reported regardless of result.",
        "edge_cases": [
          "A single control can miss bias mechanisms it does not share, or be accidentally affected by the exposure, giving false reassurance or a false alarm.",
          "Underpowered if the NCO is rare; pre-specify a minimum detectable association."
        ],
        "data_source_notes": "Match the NCO's claim type and capture intensity to the primary outcome (inpatient-primary-dx to inpatient-primary-dx) and restrict to the same FFS-observable person-time."
      },
      {
        "name": "NCO panel for empirical calibration",
        "description": "A curated set of dozens of negative controls used to estimate the systematic-error distribution and recalibrate the primary p-value and confidence interval (empirical calibration).",
        "edge_cases": [
          "Contaminated controls (some truly affected by the exposure) bias the estimated null and degrade calibration.",
          "Requires the bias-transport assumption that the systematic error in the controls applies to the outcome of interest."
        ],
        "data_source_notes": "Standard in OMOP/OHDSI multi-database network studies; controls are clinically curated and screened for non-association with the exposure."
      },
      {
        "name": "Negative control pair (NCO + NCE)",
        "description": "Negative control outcomes and negative control exposures used together for symmetric falsification of both the outcome-side and exposure-side bias structures.",
        "edge_cases": [
          "Finding an NCE that shares confounding by indication with the study drug is harder than finding a matched NCO."
        ],
        "data_source_notes": "Use claims diagnosis/procedure breadth for NCOs and pharmacy/exposure breadth for NCEs measured in the same baseline window."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "e-value-sensitivity-analysis",
        "pros_of_this": "Empirical (uses the actual data and the exact same design/adjustment) rather than assumption-only; detects bias from surveillance intensity, coding, care-seeking, and selection - not just unmeasured confounding.",
        "cons_of_this": "Yields a qualitative pass/fail (or a calibrated interval with a panel) rather than a single interpretable confounding-strength bound; power depends on NCO frequency.",
        "when_to_prefer": "After a strong primary design when you want a direct, data-driven check that the identical analytic choices left no detectable residual bias."
      },
      {
        "compared_to": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "pros_of_this": "Uses the study data itself rather than external bias parameters; directly relevant to the specific cohort, time zero, and adjustment applied.",
        "cons_of_this": "Falsifies rather than corrects - provides no quantitative corrected estimate or simulation interval for the primary parameter.",
        "when_to_prefer": "When the goal is credibility/detection or when no credible external bias parameters exist to drive QBA."
      },
      {
        "compared_to": "negative-control-exposures-rwe",
        "pros_of_this": "NCOs can be tuned to share the primary outcome's exact surveillance and coding intensity (e.g., another inpatient primary diagnosis); easy to find given thousands of diagnosis codes.",
        "cons_of_this": "Some NCOs (traumatic injury, elective procedures) may not share the confounding-by-indication structure of a specific drug-outcome pair, which an NCE captures more directly.",
        "when_to_prefer": "When the dominant threat is differential surveillance, coding, or care-seeking for the outcome rather than (or in addition to) channeling on the exposure."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Match the NCO to the primary outcome's claim type and coding intensity (inpatient-primary-dx to inpatient-primary-dx). Restrict to the same Parts A/B/D (or commercial medical+pharmacy) enrolled, non-MA-only person-time so a null NCO is true absence, not missingness. Report the NCO on the same risk scale (cause-specific hazard vs Fine-Gray subdistribution) as the primary to avoid differential-competing-risk artifacts in elderly cohorts. Pre-specify the NCO(s) in the SAP and report regardless of direction.",
      "ehr": "Notes/labs help confirm clinical unrelatedness, but outside-care leakage can make an NCO look null while the claims-captured primary outcome is still biased; EHR NCOs often encode visit frequency more than biology, so a sicker, higher-contact arm accrues more NCO codes. Define observation windows explicitly and prefer linkage to claims for complete capture.",
      "registry": "Adjudication reduces NCO misclassification, but the registry population's surveillance intensity may differ from the full claims cohort, so a null registry NCO does not automatically clear a claims-based primary analysis.",
      "linked": "Ideal for selecting and validating NCOs (registry/EHR for true outcome status and unrelatedness, claims for the utilization intensity that drives detection), but verify the NCO still shares the relevant bias structure in the linked subset before applying the identical time zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom lifelines import CoxPHFitter\n\ndef run_nco_cox(analytic: pd.DataFrame, covariates: list[str]) -> pd.Series:\n    \"\"\"Weighted Cox (cause-specific hazard) on the NCO, robust SE clustered on person_id.\n    Mirrors the primary time-to-event specification exactly; only the outcome columns differ.\"\"\"\n    cols = [\"arm\", \"iptw\", \"person_id\", \"nco_time\", \"nco_event\"] + covariates\n    cph = CoxPHFitter()\n    cph.fit(analytic[cols], duration_col=\"nco_time\", event_col=\"nco_event\",\n            weights_col=\"iptw\", cluster_col=\"person_id\", robust=True,\n            formula=\"arm + \" + \" + \".join(covariates))\n    s = cph.summary.loc[\"arm\"]\n    return pd.Series({\"HR\": np.exp(s[\"coef\"]),\n                      \"lcl\": np.exp(s[\"coef lower 95%\"]),\n                      \"ucl\": np.exp(s[\"coef upper 95%\"]), \"p\": s[\"p\"]})\n\ndef run_nco_rr(analytic: pd.DataFrame, covariates: list[str]) -> pd.Series:\n    \"\"\"Weighted log-Poisson with HC0 robust variance -> adjusted RISK RATIO on a fixed-window binary NCO.\n    Poisson with log link + robust SE gives a valid RR and avoids log-binomial convergence failures.\"\"\"\n    formula = \"nco_binary ~ arm + \" + \" + \".join(covariates)\n    fit = smf.glm(formula, data=analytic, family=sm.families.Poisson(),\n                  freq_weights=analytic[\"iptw\"]).fit(cov_type=\"HC0\")\n    beta, se = fit.params[\"arm\"], fit.bse[\"arm\"]\n    return pd.Series({\"RR\": np.exp(beta),\n                      \"lcl\": np.exp(beta - 1.96 * se),\n                      \"ucl\": np.exp(beta + 1.96 * se), \"p\": fit.pvalues[\"arm\"]})\n# A null NCO HR/RR supports the design; a non-null one warns of residual bias (direction/magnitude\n# for the primary endpoint still require calibration or QBA assumptions).",
        "description": "Run the IDENTICAL primary specification on a negative control outcome, two ways. Required input: one analytic table,\none row per subject, already built by the primary design (new-user active-comparator or target-trial emulation):\n  analytic : person_id, arm (1=study,0=comparator), iptw (or matching weight),\n             nco_time (days to NCO event or censoring), nco_event (0/1),\n             nco_binary (0/1 NCO occurred in fixed window), <baseline covariates>\nUse the SAME weights/matched set, covariates, and censoring rules created for the primary outcome - only the\ndependent variable changes. Reuse run_nco_cox for time-to-event outcomes (the usual primary estimand) and\nrun_nco_rr for a fixed-window risk ratio.",
        "dependencies": [
          "pandas",
          "lifelines",
          "statsmodels"
        ],
        "source_citations": [
          "lipsitch-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(sandwich)\nlibrary(lmtest)\n\n# Weighted Cox (cause-specific hazard) on the NCO; cluster-robust SE via id-clustering.\nrun_nco_cox <- function(analytic, covariates) {\n  f <- reformulate(c(\"arm\", covariates), response = \"Surv(nco_time, nco_event)\")\n  fit <- coxph(f, data = analytic, weights = iptw, cluster = person_id, robust = TRUE)\n  ci <- summary(fit)$conf.int[\"arm\", c(\"exp(coef)\", \"lower .95\", \"upper .95\")]\n  c(HR = ci[[1]], lcl = ci[[2]], ucl = ci[[3]])\n}\n\n# Weighted log-Poisson with HC0 robust variance -> adjusted RISK RATIO on a fixed-window binary NCO.\nrun_nco_rr <- function(analytic, covariates) {\n  f <- reformulate(c(\"arm\", covariates), response = \"nco_binary\")\n  fit <- glm(f, data = analytic, weights = iptw, family = poisson(link = \"log\"))\n  ct <- coeftest(fit, vcov. = vcovHC(fit, type = \"HC0\"))[\"arm\", ]\n  b <- ct[[\"Estimate\"]]; se <- ct[[\"Std. Error\"]]\n  c(RR = exp(b), lcl = exp(b - 1.96 * se), ucl = exp(b + 1.96 * se))\n}\n# Run with the SAME weights/covariates as the primary outcome; report point estimate + CI regardless of result.",
        "description": "Same identical-specification NCO falsification in R using the survival and sandwich packages. Input mirrors the\nPython version: one analytic data.frame (one row per subject) carrying arm, iptw, person_id, the NCO time/event\ncolumns, a fixed-window binary NCO, and the baseline covariates used in the primary model. Reuse the primary\nweights or matched set unchanged - swap only the dependent variable.",
        "dependencies": [
          "survival",
          "sandwich",
          "lmtest"
        ],
        "source_citations": [
          "lipsitch-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Weighted Cox (cause-specific hazard) on the NCO with robust variance, identical to the primary spec. */\nproc phreg data=work.analytic covs(aggregate);\n  id person_id;                                   /* sandwich variance clustered on person */\n  class arm (ref='0') / param=ref;\n  model nco_time*nco_event(0) = arm <baseline covariates>;\n  weight psw;                                     /* same IPTW / matching weight as the primary */\n  hazardratio 'NCO HR (study vs comparator)' arm / diff=ref;\nrun;\n\n/* Weighted log-Poisson -> adjusted RISK RATIO on a fixed-window binary NCO, robust variance. */\nproc genmod data=work.analytic;\n  class arm (ref='0') person_id / param=ref;\n  model nco_binary = arm <baseline covariates> / dist=poisson link=log;\n  weight psw;\n  repeated subject=person_id / type=ind;          /* empirical (robust) standard errors */\n  estimate 'NCO log-RR (study vs comparator)' arm 1 / exp;\nrun;\n/* A null NCO HR/RR supports the design; a non-null one warns of residual bias. Pre-specify the NCO in the SAP\n   and report the point estimate + CI regardless of direction or significance. */",
        "description": "SAS NCO falsification applying the identical PS weights (or matched set) created for the primary analysis. Required\ninput dataset (post data-management, one row per subject):\n  work.analytic : person_id, arm (1/0), psw (IPTW or matching weight),\n                  nco_time, nco_event (0/1), nco_binary (0/1), <baseline covariates>\nPROC PHREG with WEIGHT + a COVS(AGGREGATE)/ID person_id gives the weighted cause-specific hazard with robust\n(sandwich) variance - the standard estimand. PROC GENMOD dist=poisson link=log + REPEATED gives the adjusted RR.",
        "dependencies": [],
        "source_citations": [
          "lipsitch-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  U[Unmeasured confounding / healthy-user selection /<br/>surveillance + coding intensity] --> P[Primary outcome]\n  U --> N[Negative control outcome]\n  E[Exposure: Drug A vs Drug B] --> P\n  E -. no causal path .-x N\n  P --> Est1[Primary estimate]\n  N --> Est2[NCO estimate<br/>true value = null]\n  Est2 --> Dx{NCO estimate null?}\n  Dx -- yes --> Clean[Shared bias removed -><br/>primary estimate more credible]\n  Dx -- no --> Resid[Residual systematic error present -><br/>interpret primary with caution / calibrate]",
        "caption": "Bias DAG for the NCO falsification test. The exposure has no path to the negative control outcome, but the shared backdoor sources (U) reach both, so any non-null NCO estimate reveals residual bias that also threatens the primary outcome.",
        "alt_text": "Directed graph showing exposure causing the primary outcome but having no path to the negative control outcome, while a shared confounding/surveillance node points to both; a null NCO estimate supports credibility and a non-null estimate signals residual bias.",
        "source_type": "illustrative",
        "source_citations": [
          "lipsitch-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  D[Primary design: new-user active-comparator /<br/>target-trial emulation + PS weights/matching] --> A[Apply IDENTICAL spec:<br/>same cohort, time zero, covariates,<br/>weights, censoring]\n  A --> B1[Dependent var = primary outcome] --> R1[Primary estimate]\n  A --> B2[Dependent var = negative control outcome] --> R2[NCO estimate]\n  R2 --> C1[Single NCO -> falsification pass/fail]\n  R2 --> C2[NCO panel -> empirical calibration<br/>recalibrate primary p-value + CI]",
        "caption": "Reusing the exact analytic pipeline. Only the dependent variable changes between the primary outcome and the NCO; a single NCO yields a falsification verdict, while a curated NCO panel feeds empirical calibration of the primary p-value and confidence interval.",
        "alt_text": "Flow from the primary design through an identical specification that branches into the primary outcome estimate and the NCO estimate, with the NCO supporting either a single-control falsification test or a panel-based empirical calibration.",
        "source_type": "illustrative",
        "source_citations": [
          "schuemie-2018"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Negative control outcomes are one of the empirical falsification tools inside the broader QBA toolkit."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-exposures-rwe",
        "notes": "NCOs and NCEs are often combined as the \"negative control pair\" for symmetric outcome-side and exposure-side falsification."
      },
      {
        "relation_type": "produces",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "A curated panel of NCOs supplies the empirical null distribution used to recalibrate the primary p-value and confidence interval."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value bounds the unmeasured-confounding strength needed to explain the primary result; a non-null NCO is empirical evidence that such bias (from confounding or other sources) may actually be present."
      },
      {
        "relation_type": "see_also",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "NCOs detect residual bias empirically; probabilistic bias analysis quantifies how much unmeasured confounding would be required to produce the observed primary association."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Many NCOs are chosen specifically to detect healthy-user / healthy-adherer selection and depletion of susceptibles."
      }
    ],
    "aliases": [
      "NCO",
      "negative outcome control",
      "falsification endpoint",
      "placebo outcome",
      "negative control event"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "nested-case-control",
    "name": "Nested Case-Control Design",
    "short_definition": "A sampling-efficient design that, within an assembled cohort, compares every case to a small set of controls sampled from the risk set at the case's event time, recovering the cohort's rate ratio while ascertaining expensive exposure or confounder data on only a fraction of the cohort.",
    "long_description": "The **nested case-control (NCC) design** is a sampling strategy *inside* a fully enumerated cohort, not a stand-alone\ncase-control study. The cohort and its person-time are defined first (entry, exit, outcome). Then, instead of measuring an\nexpensive exposure or confounder on everyone, the analyst ascertains it only on all cases plus a small number of controls\n**sampled from the set of cohort members still at risk at the moment each case fails** (the risk set). Because controls are\ndrawn from the same source population and the same calendar/age/follow-up structure that generated the cases, the design\npreserves the cohort's internal validity while collapsing the measurement burden to roughly (number of cases) x (1 + controls\nper case). It is the design of choice when the outcome is rare and the exposure assay is costly: stored-biospecimen\nbiomarkers, genotyping, manual chart abstraction, NLP over clinical notes, or expert outcome adjudication.\n\n**Core conceptual distinction.** The estimand and its validity hinge entirely on *how controls are sampled in time*, and this\nis the axis reviewers test. (1) **Incidence-density (risk-set) sampling** selects controls from those alive, enrolled, and\nevent-free *at the case's event time*; matching on time means the conditional-logistic odds ratio estimates the **hazard\n(rate) ratio** of the underlying Cox model directly, with no rare-disease assumption required (Goldstein & Langholz; Lubin &\nGail). Under density sampling a sampled control may later become a case and may even serve as a control for an earlier case —\nthis is correct, not double-counting, and analyses that purge such subjects reintroduce bias. (2) **Cumulative (exclusive)\nsampling** draws controls only from those who never become cases over follow-up; the resulting odds ratio estimates the\ncumulative-incidence (risk) odds ratio and approximates the risk ratio only under the rare-disease assumption. Modern\npharmacoepidemiology defaults to density sampling. Separately, any **matching factor** (age, sex, calendar time, cohort-entry\ndate, follow-up time) is controlled by design but consumed: its main effect cannot be estimated from the matched data, and\nmatching on a mediator or collider is the classic NCC trap.\n\n**Pros, cons, and trade-offs.**\n- **vs full-cohort Cox proportional hazards (the key comparison):** NCC's only advantage is *measurement cost*. With\n  expensive exposure ascertainment (biobank assays, chart review, genotyping), NCC delivers ~90% of full-cohort efficiency at\n  4-5 controls per case (Goldstein & Langholz) for a small fraction of the assay budget. **When exposure is already cheap and\n  universal — i.e., ordinary claims/EHR where the drug, diagnosis, and covariates sit in the data for everyone — NCC throws\n  away information and full-cohort Cox dominates on efficiency.** Choosing NCC for cheap-exposure claims data is a common and\n  indefensible mistake.\n- **vs case-cohort design:** Both subsample a cohort, but the case-cohort comparator is a single random *subcohort* fixed at\n  baseline, reusable across multiple outcomes and supporting absolute-risk estimation with weighting. NCC re-samples controls\n  per-outcome at each event time, is more efficient for a single time-matched analysis, but the control set is outcome-specific\n  and not naturally reusable. **Prefer case-cohort** for multi-outcome biobank studies; **prefer NCC** for one time-to-event\n  outcome with strong time confounding (Wacholder).\n- **vs ordinary (population) case-control:** NCC fixes the source-population and selection problems of population\n  case-control because controls are provably from the cohort that generated the cases; there is no separate, possibly\n  incomparable, control series.\n- **vs self-controlled designs (SCCS, case-crossover):** Those eliminate *all* time-fixed confounding by within-person\n  comparison but require transient, reversible exposures and recurrent or acute outcomes. NCC handles chronic exposures and\n  between-person confounders via matching and adjustment, at the cost of residual unmeasured between-person confounding.\n\n**When to use.** A rare time-to-event outcome in a defined cohort where the exposure or a key confounder is expensive to\nmeasure and you want the cohort rate ratio without assaying everyone; studies needing tight control of strong time-related\nconfounders (age, calendar period, duration of follow-up) via matching; adjudicated-outcome or biomarker studies layered onto\nregistries or linked claims-EHR. Use **incidence-density sampling** matched on the time axis and analyze with conditional\nlogistic regression (or equivalently a Cox model on the sampled risk sets).\n\n**When NOT to use - and when it is actively misleading or dangerous.**\n- **Cheap, complete exposure in claims/EHR.** If the exposure and confounders are already captured for the whole cohort, NCC\n  is strictly less efficient than full-cohort Cox and offers no benefit; using it discards data and inflates variance.\n- **Cumulative sampling reported as a rate ratio.** Estimating the OR from exclusive (non-case) controls and presenting it as\n  a hazard/rate ratio without the rare-disease assumption is a quantitative error; if the outcome is common the bias is large.\n- **Time-varying exposure mis-anchored.** For each matched set, exposure must be evaluated as of the *index (case event) time*\n  for cases AND controls. Carrying a control's exposure forward to the case's event date, or evaluating a control's exposure\n  at its own (later) censoring date, is a frequent macro bug that biases the rate ratio.\n- **Over-matching.** Matching on a factor on the causal pathway (a mediator) or on a collider attenuates or distorts the true\n  effect; the matched main effect is then uninterpretable.\n- **Differential outcome surveillance by exposure.** Because the outcome defines the case set, exposure-dependent\n  surveillance (more testing in treated patients) is amplified, not diluted, by the design.\n\n**Data-source operational depth.**\n- **Claims (FFS):** Build the cohort and person-time first (continuous enrollment, washout, index/time-zero). For each case,\n  sample controls at risk at the case's event date matched on age, sex, calendar quarter, and cohort-entry date; ascertain the\n  expensive item (e.g., adjudicated outcome via chart pull, or a covariate requiring linkage) only on the sampled set. Lag\n  exposure to respect induction/latency. Failure modes: **Medicare Advantage person-time lacks fee-for-service claims**, so MA\n  enrollees are invisible to the risk set and sampling probabilities are distorted — restrict the risk set to FFS-observable\n  person-time (Parts A/B/D). Stockpiling and 90-day mail-order distort `days_supply`-based exposure windows.\n- **EHR:** Risk-set membership requires the patient to be \"active\" (an encounter window) at the case's event time; visit-driven\n  capture means a patient who leaves the system is not truly at risk and should not be sampled — define an observability\n  window, not mere presence in the database. Notes/labs are the very assets NCC makes affordable to abstract.\n- **Registry:** Excellent for adjudicated outcomes and disease severity but typically incomplete for exposure; NCC is ideal\n  when the registry case set is fixed and exposure must be pulled from linked claims or stored specimens for cases + sampled\n  controls only.\n- **Linked claims-EHR-vital records:** The ideal substrate — EHR severity, claims completeness, reliable mortality for the\n  competing-risk of death. **Competing risks bite hardest in elderly cohorts:** a potential control who *died* before the\n  case's event time is no longer at risk and is ineligible for that risk set; if mortality differs by exposure, naive control\n  sampling biases the rate ratio, so the death index must drive risk-set eligibility.\n\n**Worked claims example.** Question: rate of hospitalized acute kidney injury (AKI) among new users of a nephrotoxic oral\nagent in 100% Medicare FFS. (1) Cohort: adults >=66 with 365 days continuous A/B/D enrollment and no prior AKI; `index_date`\n= first qualifying `fill_date`; follow person-time from index to first AKI (`dx` in a validated inpatient algorithm),\ndisenrollment, death, or data end. (2) Cases: each first AKI hospitalization; its admission date is the **event time**. (3)\nRisk-set (density) sampling: for each case, randomly draw 4 controls from cohort members who at that exact event date are\nstill enrolled, event-free, and alive, matched on age (+/-2y), sex, and calendar quarter of cohort entry. A control sampled\nhere may itself develop AKI later and appear as a case — keep it. (4) Exposure: cumulative `days_supply` of the agent as of\nthe matched index time, lagged 30 days for induction; ascertained identically for the case and its 4 controls *as of that\nindex time* (never the control's own later date). (5) Covariate (the \"expensive\" item justifying NCC): baseline serum\ncreatinine / eGFR pulled from linked lab data only for the ~5x(#cases) sampled subjects. (6) Analysis: conditional logistic\nregression stratified on matched set, exposure as the primary term plus eGFR and key comorbidities; the conditional OR is read\nas the AKI **rate ratio**. (7) Sensitivity: vary controls-per-case (1, 4, 10), the induction lag (0, 30, 60 days), and a\nnegative-control outcome to probe residual confounding.",
    "primary_category": "Study_Design",
    "tags": [
      "nested-case-control",
      "within-cohort-sampling",
      "risk-set-sampling",
      "incidence-density-sampling",
      "conditional-logistic-regression",
      "pharmacoepidemiology",
      "rate-ratio",
      "matching"
    ],
    "applies_to_study_types": [
      "nested_case_control"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/00001648-199103000-00013",
        "url": "https://doi.org/10.1097/00001648-199103000-00013",
        "citation_text": "Wacholder S. Practical considerations in choosing between the case-cohort and nested case-control designs. Epidemiology. 1991;2(2):155-158.",
        "year": 1991,
        "authors_short": "Wacholder",
        "notes": "Canonical statement of the design choices distinguishing nested case-control from case-cohort sampling within a cohort."
      },
      {
        "role": "explain",
        "doi": "10.1214/ss/1032209663",
        "url": "https://doi.org/10.1214/ss/1032209663",
        "citation_text": "Langholz B, Goldstein L. Risk set sampling in epidemiologic cohort studies. Statistical Science. 1996;11(1):35-53.",
        "year": 1996,
        "authors_short": "Langholz & Goldstein",
        "notes": "Theory of incidence-density (risk-set) sampling showing the conditional-logistic estimator targets the Cox rate ratio and quantifying efficiency (~90% of full-cohort at 4-5 controls per case)."
      },
      {
        "role": "explain",
        "doi": "10.2307/2530744",
        "url": "https://doi.org/10.2307/2530744",
        "citation_text": "Lubin JH, Gail MH. Biased selection of controls for case-control analyses of cohort studies. Biometrics. 1984;40(1):63-75.",
        "year": 1984,
        "authors_short": "Lubin & Gail",
        "notes": "Shows that valid (unbiased) estimation of the rate ratio requires time-matched risk-set sampling and that exclusive sampling of non-cases biases the estimate unless the disease is rare."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1471-2288-5-5",
        "url": "https://doi.org/10.1186/1471-2288-5-5",
        "citation_text": "Essebag V, Platt RW, Abrahamowicz M, Pilote L. Comparison of nested case-control and survival analysis methodologies for analysis of time-dependent exposure. BMC Medical Research Methodology. 2005;5:5.",
        "year": 2005,
        "authors_short": "Essebag et al.",
        "notes": "Demonstrates that a correctly executed density-sampled nested case-control analysis reproduces the full-cohort time-dependent Cox estimate, with worked time-varying-exposure handling."
      }
    ],
    "plain_language_summary": "A nested case-control study starts with a group of patients who are all followed over time (a cohort), and instead of studying everyone in full detail, it zooms in only on patients who develop the outcome of interest (the cases) plus a small handful of carefully chosen comparison patients. For each case, comparison patients — called controls — are drawn from the people in the cohort who were still being followed and had not yet had the outcome at the exact moment the case was diagnosed (called the risk set). This approach gives researchers roughly the same answer as studying the whole cohort, but at a fraction of the cost — because expensive measurements like lab tests, chart reviews, or genetic assays only need to be done for the cases and their matched controls, not for everyone.",
    "key_terms": [
      {
        "term": "cohort",
        "definition": "A defined group of patients who are enrolled at some starting point and followed forward in time to see who develops the outcome."
      },
      {
        "term": "case",
        "definition": "A patient in the cohort who develops the outcome being studied (for example, a hospitalization for kidney injury)."
      },
      {
        "term": "risk set",
        "definition": "The pool of cohort members who were still enrolled, alive, and outcome-free at the exact moment a specific case had their event — only people from this pool are eligible to be sampled as controls for that case."
      },
      {
        "term": "risk-set sampling",
        "definition": "The method of picking controls by drawing randomly from the risk set at the case's event time, so that controls come from the same follow-up context as the case."
      },
      {
        "term": "matched set",
        "definition": "One case paired with its sampled controls; every member of the set shares the same event-time anchor and any other matching factors like age or sex."
      },
      {
        "term": "conditional logistic regression",
        "definition": "A statistical model applied to the matched sets that compares cases to their controls within each set to estimate the rate of the outcome associated with an exposure."
      }
    ],
    "worked_example": {
      "scenario": "Imagine a cohort of six patients who enroll in a drug-safety study on 2023-01-01 and are followed until the end of 2023. On day 120 (2023-05-01), Patient P2 is hospitalized for the outcome we are tracking. We want to understand whether a certain exposure is linked to that hospitalization without measuring everyone's costly lab values. Using risk-set sampling, we identify which patients were still being followed and had not yet had the event on 2023-05-01, and we draw two of them as controls for P2.",
      "dataset": {
        "caption": "Cohort follow-up table — one row per patient, showing enrollment start, last observed date, and whether/when the outcome event occurred.",
        "columns": [
          "person_id",
          "entry_date",
          "exit_date",
          "event",
          "event_date"
        ],
        "rows": [
          [
            "P1",
            "2023-01-01",
            "2023-12-31",
            0,
            null
          ],
          [
            "P2",
            "2023-01-01",
            "2023-05-01",
            1,
            "2023-05-01"
          ],
          [
            "P3",
            "2023-01-01",
            "2023-12-31",
            0,
            null
          ],
          [
            "P4",
            "2023-01-01",
            "2023-03-15",
            0,
            null
          ],
          [
            "P5",
            "2023-01-01",
            "2023-12-31",
            0,
            null
          ],
          [
            "P6",
            "2023-01-01",
            "2023-12-31",
            0,
            null
          ]
        ]
      },
      "steps": [
        "P2 is the case: their event happens on day 120, which is 2023-05-01.",
        "To form the risk set on 2023-05-01, we ask: who else in the cohort was still enrolled (entry_date <= 2023-05-01), still being followed (exit_date >= 2023-05-01), and had not yet had the event? That gives us P1, P3, P5, and P6.",
        "P4 is excluded from the risk set because their last observed date was 2023-03-15 — they had already left the study before P2's event date.",
        "We randomly sample 2 controls from the eligible risk set {P1, P3, P5, P6}. Say we draw P3 and P5.",
        "The matched set for P2 is now: one case (P2) plus two controls (P3, P5), all anchored to the same event date of 2023-05-01.",
        "Expensive measurements (lab values, chart data) are collected only for P2, P3, and P5 — 3 patients instead of all 6.",
        "If additional cases occur later in follow-up, the same process repeats at each new event time. Note that P3 or P5 could themselves become cases later and appear in a future matched set as cases — that is correct and expected."
      ],
      "result": "Risk-set sampling reduces measurement to 3 of 6 cohort members (50%) for this event, and approaches 5 of 6 savings (roughly 80-90% cost reduction) in large studies with rare outcomes and 4-5 controls per case — without meaningfully sacrificing the accuracy of the rate estimate.",
      "timeline_spec": {
        "title": "Risk-set sampling at Case P2's event time (day 120, 2023-05-01)",
        "window": {
          "start": "2023-01-01",
          "end": "2023-12-31",
          "label": "Cohort observation window"
        },
        "events": [
          {
            "label": "P1 follow-up",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "in risk set at day 120"
          },
          {
            "label": "P2 follow-up (CASE)",
            "start": "2023-01-01",
            "length_days": 120,
            "quantity": "event on day 120"
          },
          {
            "label": "P3 follow-up",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "in risk set at day 120"
          },
          {
            "label": "P4 follow-up (exited)",
            "start": "2023-01-01",
            "length_days": 74,
            "quantity": "exited before day 120 — ineligible"
          },
          {
            "label": "P5 follow-up",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "in risk set at day 120"
          },
          {
            "label": "P6 follow-up",
            "start": "2023-01-01",
            "length_days": 365,
            "quantity": "in risk set at day 120"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "P1 — at risk"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-05-01",
            "label": "P2 — case, event day 120"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "P3 — sampled control"
          },
          {
            "kind": "unexposed",
            "start": "2023-01-01",
            "end": "2023-03-15",
            "label": "P4 — exited day 74, ineligible"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "P5 — sampled control"
          },
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "P6 — in risk set, not sampled"
          },
          {
            "kind": "exposed",
            "start": "2023-05-01",
            "end": "2023-05-01",
            "label": "Event time t = day 120 (2023-05-01) — risk set drawn here"
          }
        ],
        "markers": [
          {
            "date": "2023-05-01",
            "label": "Case P2 event — risk set sampled at this moment"
          }
        ],
        "result": {
          "label": "Risk set at day 120: {P1, P3, P5, P6} — 2 controls sampled (P3, P5). P4 excluded (exited day 74). Expensive data collected for 3 of 6 patients.",
          "value": 0.5
        },
        "caption": "Each horizontal bar shows one patient's follow-up period. The vertical line at day 120 (2023-05-01) marks when P2 had the event. Controls are drawn only from patients whose bars cross that line and who have not yet had the event — the risk set. P4's bar ends before day 120, so they are ineligible.",
        "alt_text": "Six horizontal patient follow-up bars on a 2023 timeline. A vertical marker at day 120 (May 1) shows the case event time for P2. Bars for P1, P3, P5, and P6 extend past that line, forming the risk set. P4's bar ends at day 74, before the marker, making them ineligible. P3 and P5 are highlighted as the sampled controls."
      }
    },
    "prerequisites": [
      "cohort-retrospective",
      "case-control",
      "cox-ph-regression"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Incidence-density (risk-set) sampling",
        "description": "Controls are sampled, with replacement across risk sets, from cohort members at risk (enrolled, event-free, alive) at each case's event time; matching on the time axis makes the conditional-logistic OR estimate the Cox hazard/rate ratio with no rare-disease assumption.",
        "edge_cases": [
          "A control may later become a case and may serve as a control for multiple earlier risk sets - this is valid; excluding such subjects biases the estimate.",
          "Exposure and covariates must be evaluated as of the matched index (case event) time for every member of the set, not at the control's own later date."
        ],
        "data_source_notes": "claims/EHR: define risk-set eligibility from observable, FFS-enrolled, alive person-time at the event date; a death index must remove decedents from the risk set in elderly cohorts."
      },
      {
        "name": "Cumulative (exclusive) sampling",
        "description": "Controls are sampled only from cohort members who never experience the outcome during follow-up; the resulting odds ratio estimates a cumulative-incidence (risk) odds ratio.",
        "edge_cases": [
          "Approximates the risk ratio only under the rare-disease assumption; reporting it as a rate/hazard ratio is an error when the outcome is common.",
          "Loses the automatic time-matching that density sampling provides, so calendar/age/follow-up confounding must be handled explicitly."
        ],
        "data_source_notes": "Rare in modern pharmacoepidemiology; acceptable only for genuinely rare outcomes with weak time confounding."
      },
      {
        "name": "Counter-matching",
        "description": "Controls are sampled stratified on a surrogate of exposure (or a strong confounder) to oversample the informative cells, increasing efficiency for the exposure-disease association; analyzed with offset/weighted conditional logistic regression.",
        "edge_cases": [
          "Requires an inexpensive surrogate available for the whole cohort to define strata; the weighting must be carried into the likelihood or estimates are biased."
        ],
        "data_source_notes": "claims: a coarse exposure proxy (any vs no fill) available cohort-wide can define counter-matching strata while the precise cumulative-dose covariate is ascertained only on the sample."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Full-cohort Cox proportional hazards",
        "pros_of_this": "Drastically lower measurement cost when exposure/confounders are expensive (assays, chart review, genotyping); retains ~90% efficiency at 4-5 controls per case and recovers the cohort rate ratio.",
        "cons_of_this": "Strictly less efficient than the full cohort; pointless when exposure and covariates are already cheap and complete (ordinary claims/EHR), where full-cohort Cox dominates.",
        "when_to_prefer": "Rare outcomes with costly exposure/confounder ascertainment; never for cheap, universally available claims-based exposures."
      },
      {
        "compared_to": "Case-cohort design",
        "pros_of_this": "More efficient for a single time-to-event outcome with strong time confounding because controls are matched on time at each event.",
        "cons_of_this": "The sampled control set is outcome-specific and not naturally reusable across outcomes; absolute-risk estimation requires extra machinery.",
        "when_to_prefer": "One time-matched time-to-event analysis; prefer case-cohort for multi-outcome biobank studies and absolute risk."
      },
      {
        "compared_to": "Self-controlled designs (SCCS, case-crossover)",
        "pros_of_this": "Handles chronic exposures and accommodates between-person confounders via matching and covariate adjustment.",
        "cons_of_this": "Remains vulnerable to unmeasured between-person confounding that self-controlled designs eliminate by within-person comparison.",
        "when_to_prefer": "Chronic, non-reversible exposures and outcomes where a within-person comparison is infeasible."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Enumerate the cohort and person-time first (continuous FFS enrollment, washout, index). For each case, sample controls at risk at the case's event date matched on age/sex/calendar quarter/cohort-entry; ascertain the expensive item only on the sampled set. Exclude Medicare Advantage person-time (no FFS claims) from the risk set. Evaluate exposure as of the matched index time and lag for induction.",
      "ehr": "Risk-set eligibility requires an active observability window at the case's event time, not mere database presence; visit-driven loss to follow-up removes patients from the risk set. NCC is well suited to EHR because it makes note/lab abstraction affordable.",
      "registry": "Strong for adjudicated outcomes/severity, weak for full exposure; ideal when the registry fixes the case set and exposure is pulled from linked claims or stored specimens for cases plus sampled controls only.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate; a death index must drive risk-set eligibility so that decedents are not sampled as controls, otherwise differential mortality by exposure biases the rate ratio in elderly cohorts."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom statsmodels.discrete.conditional_models import ConditionalLogit\n\nN_CONTROLS = 4\nrng = np.random.default_rng(20240101)\n\ndef sample_risk_sets(cohort: pd.DataFrame, match_cols=(\"sex\", \"age_band\")) -> pd.DataFrame:\n    cases = cohort[cohort[\"event\"] == 1]\n    rows = []\n    for _, case in cases.iterrows():\n        t = case[\"event_date\"]\n        # At-risk at t: entered on/before t, still under observation at t (alive, enrolled, event-free at t).\n        at_risk = cohort[(cohort[\"entry_date\"] <= t) & (cohort[\"exit_date\"] >= t) &\n                         (cohort[\"person_id\"] != case[\"person_id\"])]\n        for c in match_cols:                      # exact matching on time-stable factors\n            at_risk = at_risk[at_risk[c] == case[c]]\n        k = min(N_CONTROLS, len(at_risk))\n        ctrl = at_risk.sample(k, random_state=int(rng.integers(1e9))) if k else at_risk\n        rows.append({\"set_id\": case[\"person_id\"], \"person_id\": case[\"person_id\"],\n                     \"index_time\": t, \"is_case\": 1})\n        for pid in ctrl[\"person_id\"]:\n            rows.append({\"set_id\": case[\"person_id\"], \"person_id\": pid,\n                         \"index_time\": t, \"is_case\": 0})\n    return pd.DataFrame(rows)\n\nsets = sample_risk_sets(cohort)\n# Attach exposure measured AS OF index_time for every member (case and controls alike), then model.\nsets = sets.merge(expo, on=\"person_id\", how=\"left\")   # expo computed at each row's index_time upstream\nm = ConditionalLogit(sets[\"is_case\"], sets[[\"cum_days_supply\"]], groups=sets[\"set_id\"]).fit()\nprint(m.summary())   # exp(coef) is the rate (hazard) ratio under density sampling",
        "description": "Incidence-density (risk-set) sampling + conditional logistic analysis from claims-style inputs. Required inputs\n(already cleaned, one row per person unless noted):\n  cohort : person_id, entry_date, exit_date, event (0/1), event_date (=exit_date if event==1)\n  expo   : person_id, cum_days_supply_at(index)  # ascertained later ONLY for sampled ids, as of the matched index time\nRisk set for a case at time t = {others with entry_date <= t <= exit_date} (enrolled, alive, event-free at t). A sampled\ncontrol may itself be a case in another set - that is correct. Match on sex / age band / calendar quarter as needed by\npre-filtering the risk pool. Output one matched set per case for conditional logistic regression.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "goldstein-1996",
          "essebag-2005"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\nN_CONTROLS <- 4L\nset.seed(20240101)\n\nsample_risk_sets <- function(cohort) {\n  setDT(cohort)\n  cases <- cohort[event == 1L]\n  out <- vector(\"list\", nrow(cases))\n  for (i in seq_len(nrow(cases))) {\n    ca <- cases[i]\n    t  <- ca$event_date\n    # At-risk at t and matched on time-stable factors; exclude the case itself.\n    pool <- cohort[entry_date <= t & exit_date >= t & person_id != ca$person_id &\n                   sex == ca$sex & age_band == ca$age_band]\n    k <- min(N_CONTROLS, nrow(pool))\n    ctrl <- if (k > 0L) pool[sample(.N, k)] else pool\n    out[[i]] <- rbind(\n      data.table(set_id = ca$person_id, person_id = ca$person_id, index_time = t, is_case = 1L),\n      data.table(set_id = ca$person_id, person_id = ctrl$person_id, index_time = t, is_case = 0L)\n    )\n  }\n  rbindlist(out)\n}\n\nsets <- sample_risk_sets(cohort)\nsets <- merge(sets, expo, by = \"person_id\")          # cum_days_supply as of index_time\nfit <- clogit(is_case ~ cum_days_supply + strata(set_id), data = sets)\nsummary(fit)   # exp(coef) = rate ratio",
        "description": "Risk-set sampling + clogit using the survival package. Inputs mirror the Python version:\n  cohort : person_id, entry_date, exit_date (Date), event (0/1), event_date, sex, age_band\nsurvival::clogit fits the conditional-logistic likelihood that, under density sampling, equals the Cox partial likelihood\non the sampled risk sets; exp(coef) is the rate ratio. Exposure must be measured as of index_time for each row.",
        "dependencies": [
          "data.table",
          "survival"
        ],
        "source_citations": [
          "goldstein-1996",
          "essebag-2005"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let n_controls = 4;\n\n/* Eligible (control, case) pairs: control at risk at the case's event_date, matched on time-stable factors. */\nproc sql;\n  create table riskpool as\n  select c.person_id   as set_id,            /* the case defines the matched set */\n         c.event_date  as index_time format=date9.,\n         c.sex, c.age_band,\n         p.person_id   as ctrl_id,\n         ranuni(20240101) as rk\n  from work.cohort c\n  join work.cohort p\n    on  p.entry_date <= c.event_date          /* at risk: entered before event time */\n    and p.exit_date  >= c.event_date          /* still observed (alive/enrolled/event-free) at event time */\n    and p.person_id  ne c.person_id\n    and p.sex        =  c.sex\n    and p.age_band   =  c.age_band\n  where c.event = 1;\nquit;\n\n/* Keep N lowest random keys per case = N sampled controls. */\nproc sort data=riskpool; by set_id rk; run;\ndata controls;\n  set riskpool; by set_id;\n  retain n; if first.set_id then n = 0;\n  n + 1; if n <= &n_controls;\n  person_id = ctrl_id; is_case = 0; keep set_id index_time person_id is_case;\nrun;\ndata cases;\n  set work.cohort(where=(event=1));\n  set_id = person_id; index_time = event_date; is_case = 1;\n  keep set_id index_time person_id is_case;\nrun;\n\ndata matched; set cases controls; run;\n\n/* Attach exposure measured AS OF index_time, then conditional logistic on the matched sets. */\nproc sql;\n  create table analytic as\n  select m.*, e.cum_days_supply\n  from matched m left join work.expo e\n    on m.person_id = e.person_id and m.set_id = e.set_id;\nquit;\n\nproc logistic data=analytic;\n  strata set_id;                                   /* conditional (matched-set) likelihood */\n  model is_case(event='1') = cum_days_supply;      /* exp(beta) = rate ratio under density sampling */\nrun;",
        "description": "Risk-set sampling in PROC SQL + conditional logistic regression in PROC LOGISTIC (STRATA). Required inputs:\n  work.cohort : person_id entry_date exit_date event event_date sex age_band\n  work.expo   : person_id set_id cum_days_supply   /* exposure ascertained as of each set's index_time */\nThe Cartesian self-join builds the time-matched risk pool; a random key selects N controls per case. PROC LOGISTIC with\nSTRATA fits the exact conditional-logistic likelihood, so under density sampling the odds ratio is the rate ratio.\n(PROC PHREG with a (start,stop) counting-process formulation on the full cohort is the efficient alternative when exposure\nis cheap.)",
        "dependencies": [],
        "source_citations": [
          "goldstein-1996",
          "essebag-2005"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "nested-case-control-timeline.svg",
        "mermaid": null,
        "caption": "Each horizontal bar shows one patient's follow-up period. The vertical line at day 120 (2023-05-01) marks when P2 had the event. Controls are drawn only from patients whose bars cross that line and who have not yet had the event — the risk set. P4's bar ends before day 120, so they are ineligible.",
        "alt_text": "Six horizontal patient follow-up bars on a 2023 timeline. A vertical marker at day 120 (May 1) shows the case event time for P2. Bars for P1, P3, P5, and P6 extend past that line, forming the risk set. P4's bar ends at day 74, before the marker, making them ineligible. P3 and P5 are highlighted as the sampled controls.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cohort[\"Assembled cohort with person-time<br/>entry / exit / outcome defined\"] --> Cases[Identify every case<br/>and its event time t_i]\n  Cohort --> Riskset[\"At each t_i form the risk set:<br/>members enrolled, alive, event-free at t_i\"]\n  Cases --> Sample[\"Sample 4-5 controls per case<br/>from that risk set, matched on time-stable factors\"]\n  Riskset --> Sample\n  Sample --> Assay[\"Ascertain expensive exposure / confounder<br/>ONLY on cases + sampled controls, as of t_i\"]\n  Assay --> Clogit[\"Conditional logistic regression on matched sets<br/>exp(coef) = rate ratio\"]\n  Clogit --> Sens[\"Sensitivity: controls-per-case,<br/>induction lag, negative-control outcome\"]",
        "caption": "Operational nested case-control flow. The cohort and its risk sets are defined first; expensive measurement is confined to cases plus a few time-matched controls, and conditional logistic regression recovers the cohort rate ratio.",
        "alt_text": "Flowchart from assembled cohort through case identification, risk-set formation, time-matched control sampling, expensive exposure ascertainment on the sample, conditional logistic analysis, and sensitivity checks.",
        "source_type": "illustrative",
        "source_citations": [
          "goldstein-1996"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph T1[Risk set at Case A's event time]\n    A[Case A fails] --- B[Control B at risk]\n    A --- C[Control C at risk]\n  end\n  subgraph T2[Later: Case C's event time]\n    C2[Same subject C now fails - is a CASE here] --- D[Control D at risk]\n  end\n  C -. same person, sampled earlier as a control .-> C2",
        "caption": "Density (incidence) sampling in time. A subject sampled as a control for an earlier case can itself fail later and enter the analysis as a case - this is valid under risk-set sampling and must not be excluded.",
        "alt_text": "Diagram showing two risk sets at different event times; a control sampled at the first event time later becomes a case at a subsequent event time, illustrating that controls can become cases under density sampling.",
        "source_type": "illustrative",
        "source_citations": [
          "lubin-1984"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "case-control",
        "notes": "Nested case-control is a case-control analysis conducted within a fully enumerated cohort, with controls sampled from the risk set rather than from an external population."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cox-ph-regression",
        "notes": "Both target the rate/hazard ratio; full-cohort Cox is more efficient when exposure is cheap and complete, while NCC trades a little efficiency for far lower measurement cost when exposure is expensive."
      },
      {
        "relation_type": "used_with",
        "target_slug": "exposure-lag-induction-latency-window-rwe",
        "notes": "Exposure for each matched set is evaluated as of the case's event time and typically lagged for induction/latency before risk-set sampling."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death is a competing risk that removes potential controls from the risk set; differential mortality by exposure biases risk-set sampling in elderly cohorts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "self-controlled-case-series",
        "notes": "A within-person alternative that removes all time-fixed confounding but requires transient, reversible exposures rather than the chronic exposures NCC accommodates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-crossover",
        "notes": "Another within-person sampling design for acute outcomes and transient exposures, contrasted with NCC's between-person matched comparison."
      }
    ],
    "aliases": [
      "NCC",
      "nested case-control study",
      "risk-set sampling",
      "incidence-density sampling"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "network-meta-analysis",
    "name": "Network Meta-Analysis",
    "short_definition": "A hierarchical evidence-synthesis model that simultaneously combines direct and indirect comparisons across a connected network of trials to estimate every pairwise relative treatment effect on a common scale under transitivity and consistency.",
    "long_description": "**Network meta-analysis (NMA)** — also called mixed-treatment comparison (MTC) — generalizes\npairwise meta-analysis to three or more interventions linked in a connected network. It borrows\nstrength across the whole evidence base: a contrast between treatments B and C can be estimated\neven when no head-to-head B-vs-C trial exists, by chaining through a common comparator (B vs A and\nC vs A imply an *indirect* B-vs-C effect). When both direct and indirect evidence are present, NMA\nproduces a *mixed* estimate that is a precision-weighted blend of the two. The output is a coherent\nset of all pairwise relative effects (a \"league table\") on one scale — log odds ratio, log hazard\nratio, mean difference — plus a probabilistic ranking of treatments. In HEOR it is the standard tool\nfor populating cost-effectiveness models when the manufacturer's drug was never trialled directly\nagainst the relevant comparators that a payer cares about.\n\n**Core estimand distinction** runs as follows. The estimand is the set of *relative* treatment effects d_XY for every\npair (X, Y) of treatments in the network, anchored to a reference treatment so that d_XY = d_AY − d_AX\n(the **consistency equation**). This is fundamentally a synthesis of *between-arm contrasts within\nrandomized trials* — NMA preserves within-study randomization and never compares a treatment arm in one\ntrial to an arm in a different trial (that would discard randomization and reduce to a naive indirect\ncomparison). Two parameterizations exist: the **contrast-based** Lu–Ades formulation (model the\nobserved within-trial contrasts; the default in `netmeta`, `gemtc`, and most NICE DSU code) and the\n**arm-based** formulation (model arm-level means with trial random effects; estimates absolute risks\nbut leans on stronger missing-at-random assumptions across the network). Effects can be **fixed**\n(one true effect per comparison) or **random** with a common heterogeneity variance τ² shared across\ncomparisons — the near-universal default, because trials of different comparisons rarely share a single\ntrue effect. The **Bucher adjusted indirect comparison** is the special case of NMA that estimates a\nsingle anchored indirect comparison (B vs C) through one common comparator A, with no closed loops; NMA\nis its multi-arm, multi-loop generalization.\n\n**The two load-bearing assumptions.** *(1) Transitivity* (a clinical/epidemiological assumption):\neffect modifiers — disease severity, age, prior lines of therapy, placebo response, definition and\ntiming of the outcome — are distributed similarly across the trial sets that inform each comparison,\nso that the common comparator is genuinely exchangeable across loops. Transitivity is *not testable\nfrom the synthesis data*; it is defended by tabulating trial-level characteristics across comparisons.\n*(2) Consistency* is the statistical manifestation of transitivity: direct and indirect estimates of\nthe same contrast agree. It *is* checkable in any network containing closed loops, via node-splitting\n(separate indirect from direct evidence for a contrast and test the difference), the design-by-treatment\ninteraction test, or net-heat/net-splitting plots. A connected but loop-free (\"tree\" or \"star\")\nnetwork can never have its consistency assessed — a critical and frequently missed limitation.\n\n**Pros, cons, and trade-offs** are specific and comparative below.\n- **vs pairwise meta-analysis:** NMA estimates contrasts with *no* direct trials and yields a single\n  coherent ranking; it gains precision by borrowing strength. Cost: it imports the transitivity\n  assumption and can be destabilized by one inconsistent loop or one influential small trial.\n  **Prefer pairwise** when adequate head-to-head trials exist for the one contrast you care about —\n  do not network just to network.\n- **vs Bucher anchored indirect comparison:** NMA handles >3 treatments, multi-arm trials (with the\n  correct within-trial correlation), and mixed evidence in one model. Bucher is transparent and\n  auditable for a single A-anchored B-vs-C contrast. **Prefer Bucher** for a simple two-trial indirect\n  comparison a reviewer can replicate by hand; **prefer NMA** for any real network.\n- **vs population-adjusted indirect comparison (MAIC / STC):** When the anchoring assumption of a\n  standard NMA fails because effect modifiers differ across trials — and you have individual patient\n  data (IPD) for at least one trial — MAIC/STC re-weights or regresses to the comparator population.\n  NMA assumes balance you cannot fix; MAIC/STC fixes imbalance you can measure but burns degrees of\n  freedom and (in the unanchored case) makes the far stronger assumption that *all* prognostic factors\n  are adjusted. **Prefer population adjustment** when transitivity is clearly violated and IPD exists;\n  **prefer NMA** when the network is large and transitivity is plausible.\n- **fixed vs random effects:** random effects are honest about between-trial heterogeneity but estimate\n  τ² poorly in sparse networks, inflating credible intervals and ranking instability; fixed effects\n  understate uncertainty if heterogeneity is real. Report both and a prediction interval.\n\n**When to use** (decision rules). A connected network of randomized trials (RWE-derived effect estimates can be folded\nin, see below) where the decision-relevant comparison lacks adequate direct evidence; HTA submissions\n(NICE, CADTH, IQWiG) where a new drug must be compared to all relevant standards of care to populate a\ncost-effectiveness model; comparative-effectiveness questions across a therapeutic class with a shared\nreference (e.g., placebo or an old standard). Transitivity must be defensible and the network connected.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\n- **Disconnected network.** If the treatment of interest shares no comparator path with the target\n  comparator, no amount of modelling connects them. Reporting an effect across a disconnected gap is\n  fabrication, not synthesis.\n- **Transitivity clearly violated.** If trials of comparison A-vs-B enrolled first-line milder patients\n  while A-vs-C trials enrolled refractory patients, the common comparator A is not exchangeable and the\n  indirect C-vs-B estimate is confounded by the effect modifier — a structurally biased number presented\n  with false precision. This is the single most dangerous failure mode; defend transitivity *before* you\n  fit anything.\n- **Loop-free network presented as validated.** A star network around placebo cannot have its\n  consistency tested. Claiming a \"consistency-checked\" NMA on such a network is misleading.\n- **Over-reliance on rankings (SUCRA / P-score / rankograms).** Ranking statistics compress the entire\n  joint distribution into one number and routinely crown treatments whose effects are statistically\n  indistinguishable from rivals, especially in sparse networks. A top SUCRA with a wide credible\n  interval is rank instability, not superiority. Never report a rank without the underlying relative\n  effects and uncertainty.\n- **Dose or formulation lumping/splitting.** Treating different doses of the same drug as one node hides\n  a dose-response signal; splitting every dose into its own node fractures the network and inflates\n  apparent heterogeneity. Pre-specify the node definition.\n\n**Data-source operational depth** follows. Classical NMA synthesizes **published aggregate trial data** —\narm-level events/N (binary) or mean/SD/N (continuous), or pre-computed contrasts with standard errors\nfed through generic inverse-variance. The same machinery underpins RWE/HTA work but with source-specific\nfailure modes:\n- **Aggregate published RCT data:** the common substrate. Failure modes: selective-outcome and\n  selective-trial reporting open the network asymmetrically; multi-arm trials must contribute *all*\n  pairwise contrasts with their induced within-trial covariance (ignoring it double-counts the shared\n  arm and falsely narrows intervals); zero-event arms need continuity handling or an exact likelihood\n  (`netmeta` GLM / Bayesian binomial) rather than a normal-approximation fudge.\n- **IPD-NMA (individual patient data):** when IPD is available for some trials, arm-level covariate\n  interactions can be modelled to *relax* transitivity (meta-regression within the network). Failure\n  mode: mixing IPD and aggregate trials risks **ecological bias** — an aggregate-level covariate\n  association is not the within-trial modifier effect — so model the within- and across-trial\n  interactions separately (Phillippo IPD-NMA).\n- **RWE single-arm or external-control evidence:** increasingly, only one comparator has RCT evidence and\n  the new agent has a single-arm trial plus an RWE external control, or a real-world comparative effect\n  is brought into a network of RCTs. Failure mode: an RWE-derived contrast carries confounding that\n  randomization removed in the RCT arms — folding it into an NMA propagates that confounding network-wide.\n  Anchor on a common comparator, down-weight or sensitivity-test the RWE node, and prefer\n  population-adjustment (MAIC/STC) when the RWE and RCT populations differ on effect modifiers. FDA and\n  EMA accept population-adjusted indirect comparisons in this situation but expect explicit, defended\n  assumptions; NICE DSU TSD-18 and CADTH give detailed conduct standards.\n- **Linked claims/registry-derived effects:** RWE contrasts computed from claims (active-comparator\n  new-user designs) or registries can populate a node, but the time-zero, washout, and outcome\n  definitions must match the RCT estimand they will be combined with; otherwise the network mixes\n  intention-to-treat RCT effects with as-treated RWE effects — a quiet estimand mismatch.\n\n**Worked example (HTA-style, claims/RCT blend).** Question: rank four biologics (A, B, C, D) for the\nPASI-75 response in moderate-to-severe plaque psoriasis to populate a cost-effectiveness model. Direct\nRCT evidence exists only for A vs placebo (P), B vs P, C vs P, and one head-to-head A vs B trial — so the\nnetwork is connected through P, with a single closed loop A–B–P. (1) *Node definition:* each biologic at\nits licensed maintenance dose is one node; placebo is the reference; PASI-75 (binary) at week 12 is the\ncommon outcome. (2) *Transitivity check:* tabulate baseline PASI, prior-biologic exposure, and weight\nacross the four trial sets; if the C-vs-P trial enrolled markedly more biologic-experienced patients,\nflag a transitivity threat and pre-specify a meta-regression on prior exposure. (3) *Effect measure:*\nlog odds ratio via a binomial-logit GLM, random effects with a common τ² across comparisons. (4)\n*Multi-arm handling:* none here, but the A–B–P loop lets us *node-split* the A-vs-B contrast — compare the\ndirect A-vs-B trial effect with the indirect (A-vs-P minus B-vs-P) effect and test for inconsistency.\n(5) *Synthesis:* fit the model; read the league table for all six pairwise ORs with credible intervals;\nif a real-world C-vs-standard-of-care comparative-effect estimate from a claims active-comparator\nnew-user study is available, add C linked through the shared comparator only after confirming its\noutcome and follow-back windows match the RCT estimand, and run it as a sensitivity node. (6) *Ranking:*\nreport SUCRA *with* the underlying ORs and a prediction interval — if B and C overlap heavily, state that\ntheir ranks are not separable. (7) *Decision feed:* the relative effects (not the ranks) and their full\ncovariance enter the economic model so that parameter correlation is preserved in the PSA.\n\n**Interpreting the output**\n\nConsider the worked example: the indirect estimate for Drug A versus Drug C (via common comparator B)\nyields OR = exp(−0.30) ≈ 0.74, suggesting Drug A has roughly 26% lower odds of the event than Drug C.\n\nFormal interpretation: This OR of 0.74 is an indirect estimate derived by differencing the log odds\nratios from two separate trials — it is not the result of any head-to-head randomization of A against\nC. Its validity rests on the transitivity assumption: that the patients in the A-vs-B trial and the\npatients in the C-vs-B trial are sufficiently similar that treatment B is a meaningful common\ncomparator. Transitivity is a substantive clinical judgment, not a statistical test; it is untestable\nfrom the data alone and must be evaluated by comparing baseline characteristics across trial populations.\nA consistency check (node-splitting) can detect whether the direct A-vs-B evidence contradicts the\nindirect pathway, but consistency does not prove transitivity holds. If the C-vs-B trial enrolled\nsubstantially more biologic-experienced patients than the A-vs-B trial, the 0.74 reflects a mix of\npopulations and should not be reported as a single universal comparison.\n\nPractical interpretation: In an HTA submission, the league-table OR of 0.74 for A vs C may support\nformulary preference for A, but the credible interval around that estimate — and a prediction interval\nif heterogeneity is present — must accompany it. Report SUCRA as a supplementary ranking aid, not as\nthe primary decision input; treatments with overlapping credible intervals have indistinguishable ranks.\nFeed the full covariance matrix of pairwise contrasts into the PSA, not just the point estimates, so\nthat parameter uncertainty is preserved when the economic model is driven by the NMA outputs.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "network-meta-analysis",
      "mixed-treatment-comparison",
      "indirect-comparison",
      "evidence-synthesis",
      "transitivity",
      "consistency",
      "random-effects",
      "health-technology-assessment"
    ],
    "applies_to_study_types": [
      "network_meta_analysis",
      "meta_analysis"
    ],
    "data_sources": [
      "registry",
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.1875",
        "url": "https://doi.org/10.1002/sim.1875",
        "citation_text": "Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Statistics in Medicine. 2004;23(20):3105-3124.",
        "year": 2004,
        "authors_short": "Lu & Ades",
        "notes": "Foundational contrast-based hierarchical model for mixed treatment comparisons, formalizing the consistency equations that underpin modern NMA."
      },
      {
        "role": "introduce",
        "doi": "10.1002/jrsm.1037",
        "url": "https://doi.org/10.1002/jrsm.1037",
        "citation_text": "Salanti G. Indirect and mixed-treatment comparison, network, or multiple-treatments meta-analysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Research Synthesis Methods. 2012;3(2):80-97.",
        "year": 2012,
        "authors_short": "Salanti",
        "notes": "Authoritative overview of NMA terminology, the transitivity/consistency assumptions, and the concerns (sparse networks, ranking) that reviewers probe."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X12458724",
        "url": "https://doi.org/10.1177/0272989X12458724",
        "citation_text": "Dias S, Sutton AJ, Ades AE, Welton NJ. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network meta-analysis of randomized controlled trials. Medical Decision Making. 2013;33(5):607-617.",
        "year": 2013,
        "authors_short": "Dias et al.",
        "notes": "The NICE DSU GLM framework (TSD-2) used in most HTA submissions; defines the binomial/normal likelihoods and random-effects parameterization the code here implements."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X17725740",
        "url": "https://doi.org/10.1177/0272989X17725740",
        "citation_text": "Phillippo DM, Ades AE, Dias S, Palmer S, Abrams KR, Welton NJ. Methods for population-adjusted indirect comparisons in health technology appraisal. Medical Decision Making. 2018;38(2):200-211.",
        "year": 2017,
        "authors_short": "Phillippo et al.",
        "notes": "Defines when standard NMA transitivity fails and population-adjusted methods (MAIC, STC) with IPD are needed; distinguishes the anchored from the stronger unanchored assumptions."
      },
      {
        "role": "demonstrate",
        "doi": "10.7326/M14-2385",
        "url": "https://doi.org/10.7326/M14-2385",
        "citation_text": "Hutton B, Salanti G, Caldwell DM, et al. The PRISMA extension statement for reporting of systematic reviews incorporating network meta-analyses of health care interventions: checklist and explanations. Annals of Internal Medicine. 2015;162(11):777-784.",
        "year": 2015,
        "authors_short": "Hutton et al.",
        "notes": "PRISMA-NMA reporting standard; the checklist (network geometry, transitivity, inconsistency, ranking caveats) is what reviewers and HTA bodies expect to see addressed."
      },
      {
        "role": "use",
        "doi": "10.1016/S0140-6736(17)32802-7",
        "url": "https://doi.org/10.1016/S0140-6736(17)32802-7",
        "citation_text": "Cipriani A, Furukawa TA, Salanti G, et al. Comparative efficacy and acceptability of 21 antidepressant drugs for the acute treatment of adults with major depressive disorder: a systematic review and network meta-analysis. Lancet. 2018;391(10128):1357-1366.",
        "year": 2018,
        "authors_short": "Cipriani et al.",
        "notes": "Landmark applied NMA of 21 drugs; canonical example of league tables, ranking, and the careful handling of network sparsity and certainty that decision-makers rely on."
      }
    ],
    "plain_language_summary": "Network meta-analysis is a method for comparing treatments that were never tested against each other head-to-head in a single trial. It works by chaining together results from multiple trials that share a common comparator — for example, if Drug A and Drug C were each tested against Drug B in separate trials, you can use the two sets of results to estimate how A and C compare to each other. The result is a single, coherent table of how every treatment in the network stacks up against every other, even for pairs that no single trial ever directly compared. The key risk is that this chaining only holds up if the patients in the different trials were similar enough that the shared comparator means the same thing in each study.",
    "key_terms": [
      {
        "term": "indirect comparison",
        "definition": "An estimate of how two treatments compare when no trial tested them head-to-head, derived by chaining their separate results through a shared comparator treatment."
      },
      {
        "term": "network",
        "definition": "The map of all treatments and the direct trial comparisons linking them; a treatment pair with no connecting path through the network cannot be compared."
      },
      {
        "term": "transitivity",
        "definition": "The assumption that the patients in the trials informing each comparison are similar enough that the shared comparator treatment behaves the same way across all of them; if this breaks down, the indirect estimates are misleading."
      },
      {
        "term": "log odds ratio",
        "definition": "The natural logarithm of an odds ratio, used as the working scale for combining binary outcome results across trials because log-scale effects add and subtract neatly."
      },
      {
        "term": "common comparator",
        "definition": "The treatment that appears in multiple trials and serves as the anchor linking otherwise unconnected drugs in the network."
      }
    ],
    "worked_example": {
      "scenario": "Three trials have been published for a new drug class treating a chronic condition. Trial 1 compared Drug A versus Drug B (the standard of care) and found a log odds ratio of 0.70 favoring A. Trial 2 compared Drug C versus Drug B and found a log odds ratio of 0.40 favoring C. No trial ever put A and C in the same study. A health technology assessment body needs to know how A and C compare directly so the cheaper option can be selected for the formulary. Network meta-analysis derives that indirect A-vs-C estimate using B as the common comparator.",
      "dataset": {
        "caption": "Published trial results — one row per direct comparison. Log odds ratios are on the log scale; negative values favor the active drug over the comparator.",
        "columns": [
          "trial_id",
          "treatment",
          "comparator",
          "log_OR",
          "direction"
        ],
        "rows": [
          [
            "Trial 1",
            "Drug A",
            "Drug B",
            -0.7,
            "A better than B"
          ],
          [
            "Trial 2",
            "Drug C",
            "Drug B",
            -0.4,
            "C better than B"
          ]
        ]
      },
      "steps": [
        "Both trials used Drug B as the comparator, so B is the common anchor that connects A and C in the network.",
        "Write each result as a contrast versus B: log-OR(A vs B) = -0.70 and log-OR(C vs B) = -0.40.",
        "The indirect log-OR for A vs C equals log-OR(A vs B) minus log-OR(C vs B): (-0.70) - (-0.40) = -0.30.",
        "Exponentiate to recover the odds ratio on the natural scale: exp(-0.30) = 0.74.",
        "An odds ratio of 0.74 means Drug A has roughly 26% lower odds of the event than Drug C, based solely on the indirect chain through B."
      ],
      "result": "Indirect log-OR(A vs C) = (-0.70) - (-0.40) = -0.30; OR = exp(-0.30) = 0.74, favoring Drug A over Drug C — a comparison no single trial ever made directly."
    },
    "prerequisites": [
      "meta-analysis-rct",
      "meta-analysis-obs",
      "ipd-meta-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Frequentist contrast-based NMA (graph-theoretical / generic inverse variance)",
        "description": "Treats observed within-trial contrasts as data and solves the consistency equations by weighted least squares (the netmeta/Rücker graph-theoretical approach), giving league tables, net-heat inconsistency plots, and P-scores without MCMC.",
        "edge_cases": [
          "Multi-arm trials must contribute all pairwise contrasts with the correct induced covariance, or the shared arm is double-counted and intervals are falsely narrow.",
          "Zero-event arms break the normal approximation; use continuity corrections cautiously or switch to a GLM/exact likelihood."
        ],
        "data_source_notes": "aggregate RCT data: feed events/N or mean/SD/N; pre-computed contrasts with SEs feed generic inverse variance directly."
      },
      {
        "name": "Bayesian random-effects NMA (Lu-Ades, gemtc/multinma/Stan)",
        "description": "Hierarchical model with a common between-trial heterogeneity variance, fit by MCMC; yields full posterior distributions for every contrast, SUCRA, rankograms, and natural propagation of uncertainty into a downstream cost-effectiveness model.",
        "edge_cases": [
          "Sparse networks make tau-squared weakly identified; the heterogeneity prior is influential and must be pre-specified and sensitivity-tested (e.g., half-normal vs uniform).",
          "Convergence must be checked (R-hat, trace plots, effective sample size) before any league table is read."
        ],
        "data_source_notes": "preferred when posterior correlations among contrasts must be carried into PSA; the full covariance, not point estimates, feeds the economic model."
      },
      {
        "name": "Population-adjusted indirect comparison (MAIC / STC) within or instead of NMA",
        "description": "When transitivity fails and individual patient data exist for at least one trial, re-weight (MAIC) or regress (STC) to balance effect modifiers across populations before (or instead of) the indirect comparison.",
        "edge_cases": [
          "Anchored MAIC/STC adjusts only effect modifiers; unanchored versions assume ALL prognostic factors are captured -- a far stronger, often indefensible assumption.",
          "Effective sample size after weighting can collapse, leaving an unstable adjusted estimate."
        ],
        "data_source_notes": "requires IPD for the index trial and published aggregate data for the comparator; standard NICE DSU TSD-18 / CADTH conduct expectations apply."
      },
      {
        "name": "Node-splitting / design-by-treatment inconsistency assessment",
        "description": "Separates direct from indirect evidence on each contrast that sits in a closed loop and tests their agreement; the design-by-treatment interaction model gives a single global inconsistency test.",
        "edge_cases": [
          "Impossible in loop-free (star/tree) networks -- consistency simply cannot be assessed there.",
          "Low power in sparse loops; a non-significant test is not evidence of consistency."
        ],
        "data_source_notes": "report which loops were testable and the node-split results alongside the main league table per PRISMA-NMA."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Pairwise meta-analysis",
        "pros_of_this": "Estimates contrasts with no direct trials, uses the whole network, and produces a coherent ranking across all treatments.",
        "cons_of_this": "Imports the untestable transitivity assumption and can be destabilized by one inconsistent loop or one influential small trial.",
        "when_to_prefer": "When the decision-relevant comparison lacks adequate head-to-head trials but is connected through common comparators."
      },
      {
        "compared_to": "Bucher anchored indirect comparison",
        "pros_of_this": "Handles more than three treatments, multi-arm trials with correct correlation, and mixed direct/indirect evidence in one coherent model.",
        "cons_of_this": "More machinery and assumptions than a transparent two-trial hand calculation a reviewer can replicate.",
        "when_to_prefer": "Any real network with multiple treatments or closed loops; use Bucher only for a single simple A-anchored B-vs-C contrast."
      },
      {
        "compared_to": "Population-adjusted indirect comparison (MAIC / STC)",
        "pros_of_this": "No individual patient data required; scales to large connected networks.",
        "cons_of_this": "Cannot fix a measured transitivity violation -- it assumes the balance MAIC/STC would correct; biased when effect modifiers differ across comparisons.",
        "when_to_prefer": "When transitivity is plausible and the network is large; switch to MAIC/STC when transitivity is clearly violated and IPD exists."
      },
      {
        "compared_to": "Fixed-effect NMA",
        "pros_of_this": "Random effects acknowledge real between-trial heterogeneity and yield honest, wider intervals plus a prediction interval.",
        "cons_of_this": "Heterogeneity variance is poorly estimated in sparse networks, inflating uncertainty and ranking instability.",
        "when_to_prefer": "Almost always for clinical trials of differing comparisons; reserve fixed effects for homogeneous, well-populated networks and report both."
      }
    ],
    "implementation_notes_by_data_source": {
      "registry": "Adjudicated registry-derived comparative effects can populate a node but must share the outcome definition and timing of the RCTs they are combined with; otherwise the network mixes incompatible estimands.",
      "claims": "Claims active-comparator new-user effects can enter a network anchored on a shared comparator only after matching the RCT estimand (time zero, washout, intention-to-treat vs as-treated); run such nodes as sensitivity analyses and down-weight, because residual confounding propagates network-wide.",
      "ehr": "EHR-derived contrasts add severity/lab confirmation absent from claims but inherit visit-driven capture and informative loss to follow-up; reconcile the outcome ascertainment window with the RCT nodes before combining.",
      "linked": "Linked claims-EHR-registry data give the most complete RWE node (severity + completeness + mortality) but introduce linkage selection; combine only after confirming the RWE population matches the network on effect modifiers, or population-adjust (MAIC/STC)."
    },
    "implementations": [
      {
        "lang": "r",
        "code": "library(netmeta)\n\n# arm_df has one row per trial-arm: study_id, treatment, events, n\n# Convert arm-level binary data to within-trial log-OR contrasts with correct\n# multi-arm covariance. sm = \"OR\"; the binomial scale is handled by metabin internally.\np <- pairwise(treat = treatment, event = events, n = n,\n              studlab = study_id, data = arm_df, sm = \"OR\")\n\n# Random-effects (common tau^2) network meta-analysis, placebo as reference.\nnet <- netmeta(TE, seTE, treat1, treat2, studlab,\n               data = p, sm = \"OR\",\n               common = FALSE, random = TRUE, reference.group = \"placebo\")\n\nprint(summary(net))                 # all pairwise ORs with 95% CI + prediction intervals\nnetleague(net, digits = 2)          # league table for the economic model / appendix\n\n# Inconsistency: global decomposition + per-loop net-heat; node-split on closed loops.\ndecomp.design(net)                  # design-by-treatment (Q) inconsistency decomposition\nnetheat(net, random = TRUE)         # net-heat hot spots flag inconsistent designs\nprint(netsplit(net))                # direct vs indirect per contrast (testable loops only)\n\n# Ranking: report P-scores WITH the underlying effects; never a rank alone.\nnetrank(net, small.values = \"bad\")  # P-score (frequentist SUCRA analogue)",
        "description": "Frequentist contrast-based random-effects NMA with the netmeta package (graph-theoretical / generic\ninverse variance). Required input: one row per trial ARM in long format -\n  arm_df : study_id (chr), treatment (chr), events (int), n (int)   # binary PASI-75-style outcome\npairwise() converts arm-level data to the within-trial contrasts netmeta needs (handling multi-arm\ncovariance correctly). Outputs the league table, a net-heat inconsistency plot, node-split results,\nand P-scores. Treatments are named, not coded, so the reference is set explicitly.",
        "dependencies": [
          "netmeta"
        ],
        "source_citations": [
          "dias-2012"
        ],
        "notes": ""
      },
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport pymc as pm\n\n# arm_df: study_id, treatment, events, n  (long, one row per arm)\nstudies = arm_df[\"study_id\"].astype(\"category\")\ntreats  = arm_df[\"treatment\"].astype(\"category\")\nref     = \"placebo\"                       # network reference treatment\nt_codes = treats.cat.categories\nref_ix  = list(t_codes).index(ref)\n\ns_idx = studies.cat.codes.to_numpy()\nt_idx = treats.cat.codes.to_numpy()\nn_studies = studies.cat.categories.size\nn_treat   = t_codes.size\nevents    = arm_df[\"events\"].to_numpy()\nn         = arm_df[\"n\"].to_numpy()\n\n# Lu-Ades contrast parameterization needs each study's OWN baseline treatment (the\n# first arm listed per study) and a mask flagging which arms are that baseline arm.\nbase_t_by_study = (arm_df.assign(_t=t_idx, _s=s_idx)\n                         .groupby(\"_s\")[\"_t\"].first()\n                         .reindex(range(n_studies)).to_numpy())\nbase_t_idx = base_t_by_study[s_idx]           # baseline treatment code for each arm's study\nnb_pos     = np.where(t_idx != base_t_idx)[0]  # row positions of non-baseline arms\n\nwith pm.Model() as nma:\n    mu    = pm.Normal(\"mu\", 0.0, 10.0, shape=n_studies)        # study baselines\n    tau   = pm.HalfNormal(\"tau\", 1.0)                          # common heterogeneity SD\n    # Basic parameters d (vs reference); reference effect fixed at 0 via masking.\n    d_raw = pm.Normal(\"d_raw\", 0.0, 10.0, shape=n_treat)\n    d     = pm.Deterministic(\"d\", pm.math.set_subtensor(d_raw[ref_ix], 0.0))\n    # Random treatment effect per NON-baseline arm: mean = d[t] - d[study_baseline], SD = tau.\n    # The study-specific baseline arm contributes exactly 0 (no heterogeneity added to baselines).\n    delta_nb = pm.Normal(\"delta_nb\",\n                         mu=(d[t_idx] - d[base_t_idx])[nb_pos],\n                         sigma=tau, shape=len(nb_pos))\n    delta = pm.math.set_subtensor(\n        pm.math.zeros(len(events))[nb_pos], delta_nb)\n    logit = mu[s_idx] + delta\n    pm.Binomial(\"y\", n=n, logit_p=logit, observed=events)\n    idata = pm.sample(2000, tune=2000, target_accept=0.95, chains=4)\n\n# Pairwise log-OR X vs Y = d[X] - d[Y]; exponentiate for league table.\npost = idata.posterior[\"d\"].stack(s=(\"chain\", \"draw\")).values   # (n_treat, n_draws)\nleague_or = {(a, b): float(np.exp(np.mean(post[i] - post[j])))\n             for i, a in enumerate(t_codes) for j, b in enumerate(t_codes) if i != j}\n# SUCRA from the posterior ranks (report WITH the contrasts and credible intervals).\nranks = (-post).argsort(axis=0).argsort(axis=0) + 1\nsucra = {t: float((n_treat - ranks[i].mean()) / (n_treat - 1)) for i, t in enumerate(t_codes)}",
        "description": "Bayesian random-effects NMA (contrast-based, Lu-Ades binomial-logit) in PyMC. Required input: one row\nper trial ARM -\n  arm_df : study_id (str), treatment (str), events (int), n (int)\nEach multi-arm trial contributes a baseline arm (study-specific intercept mu) and treatment effects\ndelta relative to that baseline; consistency is imposed by parameterizing every treatment's effect d\nagainst a single network reference. Carry the full posterior of d into the cost-effectiveness PSA.",
        "dependencies": [
          "pymc",
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "lu-2004"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* work.arm : study, t (treatment code), r (events), n, b (baseline treatment code in that study) */\n/* d1..dK are treatment effects vs the reference treatment (code 1). d[1] is fixed to 0.            */\n\nproc mcmc data=work.arm nmc=50000 nbi=10000 thin=5 seed=12345\n          monitor=(d2 d3 d4 tau2) diagnostics=all;\n  array d[4];                                  /* one entry per treatment; d[1] = reference */\n  array muS[100];                              /* study baselines, sized to # studies */\n\n  parms d2 d3 d4 0;                            /* basic parameters vs reference (code 1) */\n  parms tau2 0.1;                              /* common between-study heterogeneity variance */\n\n  prior d2 d3 d4 ~ normal(0, var=100);         /* vague priors on contrasts */\n  prior tau2 ~ igamma(0.001, scale=0.001);     /* sensitivity-test this prior in sparse nets */\n\n  beginnodata;\n    d[1] = 0;                                  /* reference effect fixed at 0 */\n  endnodata;\n\n  muS[study] ~ general(0);                     /* study-specific baseline (vague) */\n\n  /* Random treatment effect: mean = d[t] - d[b] (consistency); SD = sqrt(tau2). */\n  random delta ~ normal(d[t] - d[b], var=tau2) subject=interaction(study t) monitor=(delta);\n\n  lp = muS[study] + delta;                     /* baseline arm: t = b => d[t]-d[b] = 0 */\n  p  = logistic(lp);\n  model r ~ binomial(n, p);                    /* arm-level binomial likelihood */\nrun;\n\n/* Post-process the posterior of d2..d3..d4 into all pairwise ORs (league table) and SUCRA */\n/* in a DATA step: OR_XY = exp(dX - dY); ranks from the saved posterior draws.             */",
        "description": "Bayesian random-effects NMA via PROC MCMC (Lu-Ades contrast-based, binomial-logit), the canonical SAS\npattern for HTA submissions. Required input dataset (long, one row per trial-arm):\n  work.arm : study (num 1..S), t (num treatment code, 1 = reference), r (events), n (at risk),\n             b (num: the reference/baseline treatment code WITHIN that study, for the contrast)\nmu = study baseline log-odds; d[] = treatment effects vs network reference (d[1] fixed at 0);\ndelta = random effect with common SD = sqrt(tau2). Check ESS/diagnostics before reading estimates.",
        "dependencies": [],
        "source_citations": [
          "lu-2004"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "graph LR\n  P((Placebo)) ---|\"A vs P: 3 trials\"| A((Biologic A))\n  P ---|\"B vs P: 2 trials\"| B((Biologic B))\n  P ---|\"C vs P: 2 trials\"| C((Biologic C))\n  P ---|\"D vs P: 1 trial\"| D((Biologic D))\n  A ---|\"A vs B: 1 trial (closed loop)\"| B",
        "caption": "Network geometry for the worked psoriasis example. Edges are direct comparisons; thickness/label reflects the number of contributing trials. The single A-B-P closed loop is the only place consistency can be tested; the indirect B-vs-C, C-vs-D, etc. effects exist only because every node connects through placebo.",
        "alt_text": "Network graph with placebo at the centre connected to biologics A, B, C, and D, plus a direct A to B edge forming one closed loop with placebo.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision-relevant comparison<br/>lacks adequate direct trials] --> Conn{Is the network connected<br/>through a common comparator?}\n  Conn -->|No| Stop[Cannot synthesize<br/>do not report a cross-gap effect]\n  Conn -->|Yes| Trans{Transitivity plausible?<br/>effect modifiers balanced<br/>across comparisons}\n  Trans -->|No, and IPD available| PAIC[Population-adjusted ITC<br/>MAIC / STC]\n  Trans -->|No, no IPD| Caveat[Report with strong caveats<br/>or decline]\n  Trans -->|Yes| Loops{Any closed loops?}\n  Loops -->|Yes| NMAc[NMA + node-splitting /<br/>design-by-treatment consistency check]\n  Loops -->|No| NMAs[NMA / Bucher ITC<br/>consistency NOT testable - state this]\n  NMAc --> Rank[Report relative effects + CIs<br/>SUCRA only WITH the effects]\n  NMAs --> Rank",
        "caption": "Decision logic for choosing and qualifying a network meta-analysis. Connectivity is a hard gate; transitivity routes to population adjustment when violated; the presence of closed loops determines whether consistency can be assessed at all.",
        "alt_text": "Flowchart from a comparison lacking direct trials through connectivity, transitivity, and closed-loop checks to NMA, population-adjusted indirect comparison, or declining to synthesize.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "meta-analysis-rct",
        "notes": "NMA generalizes pairwise random-effects meta-analysis to a connected network of three or more treatments, adding the transitivity and consistency assumptions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "meta-analysis-obs",
        "notes": "Observational-study meta-analysis shares the synthesis machinery; folding RWE-derived contrasts into an NMA imports the confounding that randomization removes in RCT nodes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ipd-meta-analysis",
        "notes": "IPD-NMA uses individual patient data to model effect-modifier interactions and relax transitivity, but must separate within-trial from across-trial associations to avoid ecological bias."
      },
      {
        "relation_type": "used_with",
        "target_slug": "single-arm-external-control",
        "notes": "When a new agent has only a single-arm trial plus an external control, the resulting comparative effect can be anchored into a network of RCTs, with population adjustment if populations differ."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "indirect-standardization-smr-sir-rwe",
        "notes": "A distinct form of indirect comparison; NMA synthesizes randomized contrasts across a trial network rather than standardizing observed-to-expected rates against a reference population."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Transitivity is the network analogue of transportability - effect modifiers must be exchangeable across the trial populations that inform each comparison."
      }
    ],
    "aliases": [
      "NMA",
      "Mixed Treatment Comparison",
      "MTC",
      "Multiple Treatments Meta-Analysis",
      "network meta-analysis",
      "mixed treatment comparison meta-analysis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "new-user-design",
    "name": "New-User (Incident-User) Design",
    "short_definition": "A cohort design that restricts to patients initiating the exposure of interest after a drug-free washout and starts follow-up at initiation (time zero), so that follow-up, covariate measurement, and outcome ascertainment all begin at the same point in the treatment course and prevalent-user biases are avoided.",
    "long_description": "The **new-user (incident-user) design** restricts the analytic cohort to patients who *start* the exposure of interest\nduring the study period, after a defined lookback (washout) window in which they had no dispensing of that drug or drug\nclass. Follow-up, baseline covariate measurement, and the outcome clock all begin at **time zero = the first qualifying\nfill**. This single restriction makes an observational cohort behave, at the moment of the treatment decision, like the\nenrollment of a randomized trial: everyone is at the same point in their treatment trajectory, no one has accrued\non-treatment history, and there is no follow-up time before the exposure decision.\n\n**Core conceptual distinction — new-user vs prevalent-user.** A *prevalent-user* (ever-exposed) cohort enrolls patients\nwho are already taking the drug at the start of observation. This induces three biases that the new-user restriction\nremoves. (1) **Depletion of susceptibles / survivor bias**: prevalent users are survivors who already tolerated the drug\nand did not have an early event, so they look systematically healthier than the patients a clinician is actually deciding\nto treat. (2) **Adjustment for post-initiation covariates**: variables measured at the start of observation in a prevalent\ncohort (lab values, weight, blood pressure) are themselves *consequences* of prior treatment — conditioning on them\nadjusts away part of the drug effect or opens collider paths. (3) **Left-censoring / immortal time**: early events that\noccurred before the observation window are invisible, and time before initiation is misclassified. The new-user design\nfixes all three by construction because every patient enters at initiation with only *pre-treatment* history. The design\ndoes **not** by itself remove confounding by indication or healthy-user bias — those require an active comparator and/or\ncovariate adjustment; the new-user restriction is necessary but not sufficient.\n\n**Estimand.** Aligning time zero at initiation makes the natural estimand an **intention-to-treat-like contrast under an\ninitiation strategy** (effect of *starting* the drug), or an **as-treated / per-protocol** contrast if you additionally\ncensor at discontinuation/switching and weight for the resulting informative censoring. Pre-specify which: an ITT-style\ninitiation contrast and an as-treated contrast answer different questions and rarely coincide when adherence differs.\n\n**Pros, cons, and trade-offs.**\n- **vs prevalent-user / ever-exposed designs:** the new-user design eliminates depletion of susceptibles, survivor bias,\n  immortal time, and adjustment for post-initiation mediators. Cost: smaller cohorts, loss of patients who initiated before\n  the data window, and a population of initiators that may differ from the prevalent users who dominate day-to-day\n  practice. **Prefer new-user** for nearly all comparative safety/effectiveness questions; fall back to a *prevalent\n  new-user* design (Suissa's time-conditional propensity-score extension) only when true initiation is too rare to study.\n- **vs new-user + active comparator (ACNU):** a plain new-user design with a *non-user* reference still suffers\n  **confounding by indication** and **healthy-user/healthy-adherer bias**, because treated patients differ from untreated\n  patients for reasons tied to the outcome. Adding an active comparator (initiators of a different drug for the same\n  indication) attacks those biases and improves covariate overlap. **Prefer ACNU** whenever a clinically interchangeable\n  comparator exists; reserve the non-user new-user contrast for questions that are genuinely \"drug vs no drug\" (e.g.,\n  vaccine vs unvaccinated, where no active comparator exists).\n- **vs prescription-time-distribution / waiting-time approaches to incident use:** those infer incidence from refill\n  patterns without a fixed washout; they are cheaper but less defensible for causal contrasts. **Prefer an explicit\n  washout** for regulatory-grade work.\n\n**When to use.** Comparative effectiveness or safety of a drug initiation in claims, EHR, registry, or linked data; any\nsetting where prevalent users would carry survivor bias or post-baseline covariates; as the structural backbone of a\ntarget-trial emulation (the new-user restriction is how you assign time zero). The washout length is the key tuning knob:\nlong enough that re-initiators are not misclassified as new users (180–365 days is typical; ≥365 days for chronic\ntherapies with intermittent use), but not so long that it shrinks the cohort below usable size.\n\n**When NOT to use — and when it is actively misleading.**\n- **The washout cannot be observed.** If continuous enrollment does not span the full lookback, \"no prior fill\" is\n  *missingness*, not a true washout, and you silently enroll prevalent users as new users. In claims this is the dominant\n  failure mode (see below). Diagnose by tabulating enrollment coverage across the lookback before applying the restriction.\n- **The drug is used intermittently or in courses** (antibiotics, antifungals, opioids, oral steroids). A 180-day washout\n  will classify a patient on their fifth course as a \"new user.\" Either lengthen the washout to cover the typical\n  re-treatment interval or define episodes explicitly; otherwise the incident-user assumption is false.\n- **You force a new-user restriction onto a question that is intrinsically about prevalent or chronic use** (e.g., the\n  effect of *long-term* statin exposure on a slow outcome). Restricting to initiators discards exactly the long-exposure\n  person-time you need and can bias toward the null; a prevalent new-user or duration-response design is more appropriate.\n- **Initiation is so rare** that the cohort is underpowered — reaching for a prevalent-user design re-introduces survivor\n  bias, so the honest answer is often a prevalent new-user (time-conditional PS) design rather than abandoning incidence.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Exposure = pharmacy claim (NDC + `fill_date` + `days_supply`). Index date = first\n  qualifying fill after the washout; require **continuous medical + pharmacy enrollment across the entire lookback** so the\n  absence of prior dispensing is observed rather than missing. *Failure modes:* (a) **Medicare Advantage** person-time\n  lacks fee-for-service claims — MA-only enrollees can appear drug-free simply because their fills are invisible; restrict\n  to enrollees with Parts A/B/D (or commercial medical+pharmacy) across the washout and exclude MA-only spans. (b)\n  **Differential competing risks by exposure in elderly claims** — drugs preferentially used in frailer patients carry more\n  death (a competing risk) that censors the outcome differentially; for absolute-risk questions use a cumulative-incidence\n  framework, not naive Kaplan–Meier. (c) Sample fills, 90-day mail-order, and stockpiling distort `days_supply` and\n  therefore on-treatment windows. (d) **Immortal time in procedure/initiation studies** — if you anchor follow-up at\n  diagnosis but require a fill to be \"exposed,\" the gap between diagnosis and first fill is immortal; anchor time zero at\n  the fill itself.\n- **EHR:** Initiation is the medication *order* or *administration*, not a dispensing — a patient may be ordered a drug and\n  never fill it. Prefer linkage to pharmacy claims to confirm the patient actually started. Problem lists, labs, and notes\n  sharpen the indication and baseline severity (an advantage over claims), but visit-driven capture means a patient who\n  leaves the system is differentially lost; define observation windows explicitly and treat loss to follow-up as\n  potentially informative.\n- **Registry:** Strong for indication, disease severity, and adjudicated outcomes; typically weak for complete pharmacy\n  exposure and for confirming the *absence* of prior use. Link to claims for the full fill history (to validate the\n  washout) and to a death index to firm up censoring.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR severity + claims completeness + reliable mortality — but\n  linkage introduces selection (only the linkable subset) and date-discrepancy issues between order, fill, and service\n  dates that must be reconciled before time-zero assignment.\n\n**Worked claims example.** Question: incidence of acute myocardial infarction after initiating a new oral\nantihyperglycemic among adults with type 2 diabetes in a commercial + Medicare fee-for-service database.\n(1) **Eligibility:** age ≥18; ≥2 diabetes diagnoses (`E11.x`) in the lookback; 365 days of continuous medical + pharmacy\n(or A/B/D) enrollment before the first study fill.\n(2) **Washout:** no fill of the study drug (or its class, by NDC list) in the 365-day lookback — this is what makes the\npatient an incident user.\n(3) **Time zero:** the date of that first qualifying fill (`index_date`); covariates are measured only in\n`[index_date − 365, index_date]`, never after.\n(4) **Follow-up:** from `index_date` to first validated AMI, censoring at disenrollment, death, end of data, and — for an\nas-treated analysis — discontinuation (last `days_supply` end + a pre-specified 30-day grace period) or switch.\n(5) **First-event coding:** keep the first AMI per person; require a continuous-enrollment gap rule so events during\nunobserved (e.g., MA) spans are not counted.\n(6) **Sensitivity:** re-run with 180-day and 730-day washouts, vary the grace period, and add a negative-control outcome\nto probe residual confounding. Pairing this design with an **active comparator** (initiators of a different\nantihyperglycemic) is the standard next step to control confounding by indication.",
    "primary_category": "Study_Design",
    "tags": [
      "pharmacoepidemiology",
      "new-user-design",
      "incident-user",
      "prevalent-user-bias",
      "time-zero",
      "washout",
      "depletion-of-susceptibles",
      "target-trial"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Foundational paper that formalized the new-user (incident-user) design and articulated the prevalent-user biases it removes (depletion of susceptibles, adjustment for post-initiation covariates, left-censoring)."
      },
      {
        "role": "explain",
        "doi": "10.1007/s40471-015-0053-5",
        "url": "https://doi.org/10.1007/s40471-015-0053-5",
        "citation_text": "Lund JL, Richardson DB, Stürmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Current Epidemiology Reports. 2015;2(4):221-228.",
        "year": 2015,
        "authors_short": "Lund et al.",
        "notes": "Situates the new-user restriction within the broader active-comparator framework and explains why new-user alone does not remove confounding by indication."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Explains the immortal-time bias that arises when follow-up starts before the exposure decision; anchoring time zero at initiation, as the new-user design does, is the structural fix."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4107",
        "url": "https://doi.org/10.1002/pds.4107",
        "citation_text": "Suissa S, Moodie EEM, Dell'Aniello S. Prevalent new-user cohort designs for comparative drug effect studies by time-conditional propensity scores. Pharmacoepidemiology and Drug Safety. 2017;26(4):459-468.",
        "year": 2017,
        "authors_short": "Suissa et al.",
        "notes": "The prevalent new-user extension — uses time-conditional propensity scores to recover power when strict incident use is too rare, without reverting to a fully prevalent (survivor-biased) cohort."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.70048",
        "url": "https://doi.org/10.1002/pds.70048",
        "citation_text": "Her QL, Rouette J, Young JC, Webster-Clark M, Tazare J. Core Concepts in Pharmacoepidemiology: New-user designs. Pharmacoepidemiology and Drug Safety. 2024;33(12):e70048.",
        "year": 2024,
        "authors_short": "Her et al.",
        "notes": "Contemporary teaching paper demonstrating new-user, active-comparator new-user, and prevalent new-user designs with worked operational guidance."
      }
    ],
    "plain_language_summary": "The new-user design builds a study group out of only the patients who are starting a drug for the first time, then starts the clock for everyone on the day of that first fill. To qualify, a patient must have a clean lookback period (a stretch of time, often a year, with no fill of that drug) so you know they are truly new to it, not someone who has been on it for years. Starting everyone at the same point keeps the comparison fair, because patients already on a drug are a survivor group that tends to look healthier than the people a doctor is actually deciding to treat. It does not fix every problem on its own: it can't tell you why a doctor chose this drug, so you usually still need a comparison drug.",
    "key_terms": [
      {
        "term": "washout",
        "definition": "A stretch of time before the first fill (often a year) in which the patient had no fill of the study drug, used to confirm they are truly starting it for the first time."
      },
      {
        "term": "index date",
        "definition": "The patient's day zero, which here is the date of their first qualifying fill of the drug being studied."
      },
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last, such as a 90-day fill."
      },
      {
        "term": "prevalent user",
        "definition": "A patient who is already taking the drug when observation begins, rather than just starting it; including these patients can make the treated group look healthier than it really is."
      },
      {
        "term": "continuous enrollment",
        "definition": "An unbroken span of insurance coverage, so that the absence of an earlier fill means the patient really had none rather than that their fills were simply not recorded."
      }
    ],
    "worked_example": {
      "scenario": "One commercially insured adult with type 2 diabetes appears in a pharmacy claims table. We want to decide whether they count as a new user of a study drug and, if so, set their day zero and start their outcome clock. We use a 365-day washout: the patient must have unbroken coverage for that full year and no earlier fill of the study drug.",
      "dataset": {
        "caption": "The raw rows an analyst would see in a claims pharmacy table plus an enrollment span.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2024-01-15",
            "study_drug",
            90
          ],
          [
            2001,
            "2024-04-14",
            "study_drug",
            90
          ]
        ],
        "enrollment": {
          "columns": [
            "person_id",
            "enroll_start",
            "enroll_end"
          ],
          "rows": [
            [
              2001,
              "2023-01-01",
              "2024-12-31"
            ]
          ]
        }
      },
      "steps": [
        "The earliest study_drug fill is 2024-01-15, so that is the candidate day zero (index date).",
        "The washout is the 365 days before index: 2023-01-15 through 2024-01-14. There is no study_drug fill in that window, so the patient is a new user.",
        "Check coverage: enrollment runs 2023-01-01 to 2024-12-31, which fully spans the washout start (2023-01-15) through index, so the clean lookback was actually observed, not just missing.",
        "Start the outcome clock at index (2024-01-15). The first 90-day fill covers 2024-01-15 through 2024-04-13, and the second fill picks up the next day, so coverage is continuous.",
        "The patient has a first heart attack (AMI) on 2024-06-10. Follow-up time is the days from index to that event."
      ],
      "result": "Patient 2001 qualifies as a new user: 365-day washout (2023-01-15 to 2024-01-14) is clean and fully covered by enrollment, index date = 2024-01-15, and follow-up to the AMI on 2024-06-10 = 147 days.",
      "timeline_spec": {
        "title": "New-user timeline for one incident initiator: 365-day washout, time zero at first fill, follow-up to AMI",
        "window": {
          "start": "2023-01-15",
          "end": "2024-06-10",
          "label": "Washout (365 days) plus follow-up for one initiator"
        },
        "events": [
          {
            "label": "Fill A (index)",
            "start": "2024-01-15",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill B",
            "start": "2024-04-14",
            "length_days": 90,
            "quantity": "90 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "washout",
            "start": "2023-01-15",
            "end": "2024-01-14",
            "label": "365-day washout: no study-drug fill, enrollment observed"
          },
          {
            "kind": "exposed",
            "start": "2024-01-15",
            "end": "2024-04-13",
            "label": "90 covered days (Fill A)"
          },
          {
            "kind": "exposed",
            "start": "2024-04-14",
            "end": "2024-06-10",
            "label": "on-treatment continues (Fill B)"
          },
          {
            "kind": "followup",
            "start": "2024-01-15",
            "end": "2024-06-10",
            "label": "147 follow-up days to AMI"
          }
        ],
        "result": {
          "label": "New user: 365-day clean washout, index 2024-01-15, 147 follow-up days to AMI",
          "value": 147
        },
        "caption": "One initiator's timeline: a fully observed 365-day washout confirms new-user status, time zero is set at the first fill on 2024-01-15, and the outcome clock runs 147 days to the AMI on 2024-06-10. Because baseline is measured before the index fill and follow-up starts at the fill, there is no time counted before the treatment decision.",
        "alt_text": "Timeline showing a 365-day washout from 2023-01-15 to 2024-01-14 with no study-drug fill, an index fill on 2024-01-15, a second fill on 2024-04-14 keeping coverage continuous, and a follow-up span of 147 days ending at an AMI event on 2024-06-10."
      }
    },
    "prerequisites": [
      "washout-clean-lookback-period-rwe",
      "time-zero-index-date-alignment-rwe",
      "prevalent-user-bias"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "New-user with active comparator (ACNU)",
        "description": "The reference group is initiators of a clinically interchangeable drug for the same indication rather than non-users, so both arms are incident users who crossed the same treatment threshold.",
        "edge_cases": [
          "No interchangeable comparator exists (e.g., vaccines), forcing a non-user reference and re-introducing confounding by indication.",
          "Within-class channeling (one agent favored in renal impairment) leaves residual confounding that balance diagnostics must catch."
        ],
        "data_source_notes": "claims: build NDC lists per therapeutic class and assign the arm from the NDC dispensed on the index date; confirm shared indication with baseline diagnosis codes."
      },
      {
        "name": "New-user with non-user (unexposed) comparator",
        "description": "Follow-up still starts at initiation for the exposed, but the reference is patients who did not initiate; requires careful time-zero assignment for the unexposed (e.g., matched calendar/eligibility time) to avoid immortal time.",
        "edge_cases": [
          "Choosing time zero for never-users is non-trivial; a poorly chosen anchor re-creates immortal-time or selection bias.",
          "Confounding by indication and healthy-user bias are not addressed by the design and must be handled analytically."
        ],
        "data_source_notes": "claims/EHR: pair with prevalent-new-user (Suissa) time-conditional PS or sequential trial emulation to assign comparable time zero to the unexposed."
      },
      {
        "name": "Prevalent new-user design (Suissa)",
        "description": "When strict incident use is too rare, prevalent users are matched to incident users at comparable treatment-duration time via time-conditional propensity scores, recovering power while limiting survivor bias.",
        "edge_cases": [
          "Requires modeling the conditional probability of switching/initiation as a function of accumulated exposure time.",
          "Interpretation shifts toward a duration-conditioned contrast rather than a pure initiation effect."
        ],
        "data_source_notes": "claims: needs reliable longitudinal fill history to define prevalent exposure duration at the matching anchor."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Prevalent-user / ever-exposed cohort",
        "pros_of_this": "Removes depletion of susceptibles, survivor bias, immortal time, and adjustment for post-initiation covariates; covariates are guaranteed pre-treatment.",
        "cons_of_this": "Smaller cohorts; loses patients who initiated before the data window; initiators may not represent the prevalent users who dominate practice.",
        "when_to_prefer": "Almost all comparative safety/effectiveness questions, and whenever prevalent-user bias is plausible or early effects matter."
      },
      {
        "compared_to": "New-user with an active comparator (ACNU)",
        "pros_of_this": "Simpler; answers a \"drug vs no drug\" policy question directly when no active comparator exists.",
        "cons_of_this": "A non-user reference leaves confounding by indication and healthy-user/healthy-adherer bias that the design cannot fix.",
        "when_to_prefer": "Only when there is genuinely no clinically interchangeable comparator (e.g., vaccination vs none)."
      },
      {
        "compared_to": "Prevalent new-user design (time-conditional PS)",
        "pros_of_this": "Cleaner estimand (a pure initiation effect) and simpler implementation when incidence is common.",
        "cons_of_this": "Underpowered when true initiation is rare; discards prevalent person-time entirely.",
        "when_to_prefer": "When new initiation is common enough to yield an adequately powered, unbiased incident cohort."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (NDC + fill_date + days_supply). Index date = first qualifying fill after the washout; require continuous medical + pharmacy enrollment across the entire lookback so absence of prior fills is observed, not missing. Exclude Medicare Advantage-only person-time (no FFS claims). Watch differential competing risk of death in elderly cohorts and use cumulative incidence for absolute-risk questions.",
      "ehr": "Initiation = medication order or administration, not dispensing; prefer linked pharmacy fills to confirm the patient started. Use problem lists, labs, and notes to confirm indication and baseline severity. Define observation windows explicitly and treat loss to follow-up as potentially informative.",
      "registry": "Strong for indication, severity, and adjudicated outcomes; weak for complete pharmacy exposure and for confirming absence of prior use. Link to claims for full fill history and to a death index for censoring.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (severity + completeness + mortality) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS = 365  # drug-free + continuous-enrollment lookback that defines an \"incident user\"\n\ndef build_new_user_cohort(rx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n\n    # Candidate time zero = first fill of the study drug for each person.\n    study = rx[rx[\"is_study_drug\"]]\n    idx = (study.groupby(\"person_id\", as_index=False)\n                .first()[[\"person_id\", \"fill_date\"]]\n                .rename(columns={\"fill_date\": \"index_date\"}))\n\n    # New-user restriction: drop anyone with a study-drug fill in the washout window BEFORE the index date.\n    prior = study.merge(idx, on=\"person_id\")\n    had_prior = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                      (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(had_prior[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment spanning the full washout through index (no MA-only gaps),\n    # so that \"no prior fill\" is genuinely observed rather than missing.\n    e = enroll.merge(idx, on=\"person_id\")\n    covers = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n              (e[\"enroll_end\"]   >= e[\"index_date\"]) &\n              (~e[\"ma_only\"]))\n    eligible = e.loc[covers, \"person_id\"].unique()\n\n    cohort = idx[idx[\"person_id\"].isin(eligible)].copy()\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)\n    return cohort[[\"person_id\", \"index_date\", \"baseline_start\"]].reset_index(drop=True)",
        "description": "New-user cohort construction from claims-style inputs (cohort build, not estimation). Required inputs\n(already cleaned and de-duplicated):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime), is_study_drug (bool), days_supply (int)\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end (datetime), ma_only (bool)  # ma_only lacks FFS claims\nReturns one row per eligible new initiator with index_date and the baseline-covariate window. Build covariates and any\npropensity score ONLY from [baseline_start, index_date]; apply identical outcome/censoring rules downstream.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "ray-2003"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\n\nbuild_new_user_cohort <- function(rx, enroll) {\n  setDT(rx); setDT(enroll)\n  setorder(rx, person_id, fill_date)\n\n  # Candidate time zero = first study-drug fill per person.\n  study <- rx[is_study_drug == TRUE]\n  idx <- study[, .(index_date = fill_date[1L]), by = person_id]\n\n  # New-user: drop anyone with a study-drug fill in the washout window before index.\n  study <- merge(study, idx, by = \"person_id\")\n  prior_ids <- unique(study[fill_date < index_date &\n                            fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous, FFS-observable enrollment across the full washout through index (no MA-only spans).\n  e <- merge(enroll, idx, by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & ma_only == FALSE, unique(person_id)]\n\n  cohort <- idx[person_id %chin% ok]\n  cohort[, baseline_start := index_date - WASHOUT_DAYS]\n  cohort[, .(person_id, index_date, baseline_start)]\n}",
        "description": "New-user cohort construction with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), is_study_drug (logical), days_supply (integer)\n  enroll : person_id, enroll_start, enroll_end (Date), ma_only (logical)\nReturns one row per eligible new initiator with index_date and baseline_start (covariate window).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "ray-2003"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* Candidate time zero = first study-drug fill per person. */\nproc sql;\n  create table idx as\n  select person_id, min(fill_date) as index_date format=date9.\n  from work.rx\n  where is_study_drug = 1\n  group by person_id;\nquit;\n\n/* New-user restriction: exclude anyone with a prior study-drug fill inside the washout window. */\nproc sql;\n  create table newuser as\n  select i.*\n  from idx i\n  where not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id\n      and p.is_study_drug = 1\n      and p.fill_date <  i.index_date\n      and p.fill_date >= i.index_date - &washout\n  );\nquit;\n\n/* Continuous, FFS-observable enrollment across the full washout through index (no MA-only spans),\n   so that absence of a prior fill is observed rather than missing. */\nproc sql;\n  create table cohort as\n  select n.person_id,\n         n.index_date,\n         n.index_date - &washout as baseline_start format=date9.\n  from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id\n      and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date\n  );\nquit;",
        "description": "New-user cohort construction in SAS using PROC SQL (cohort build, not estimation). Required input datasets\n(post data-management):\n  work.rx     : person_id, fill_date (SAS date), is_study_drug (1/0), days_supply\n  work.enroll : person_id, enroll_start, enroll_end (SAS dates), ma_only (1/0)\nProduces work.cohort with one row per eligible incident user: person_id, index_date, baseline_start. Join to a\nbaseline-covariate table built strictly within [baseline_start, index_date] before any analysis.",
        "dependencies": [],
        "source_citations": [
          "ray-2003"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "new-user-design-timeline.svg",
        "mermaid": null,
        "caption": "One initiator's timeline: a fully observed 365-day washout confirms new-user status, time zero is set at the first fill on 2024-01-15, and the outcome clock runs 147 days to the AMI on 2024-06-10. Because baseline is measured before the index fill and follow-up starts at the fill, there is no time counted before the treatment decision.",
        "alt_text": "Timeline showing a 365-day washout from 2023-01-15 to 2024-01-14 with no study-drug fill, an index fill on 2024-01-15, a second fill on 2024-04-14 keeping coverage continuous, and a follow-up span of 147 days ending at an AMI event on 2024-06-10.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population with the indication] --> Enr[Continuous medical + pharmacy enrollment<br/>across the full washout]\n  Enr --> Wash{Any study-drug fill<br/>in the washout window?}\n  Wash -- Yes --> Excl[Exclude: prevalent / re-initiating user]\n  Wash -- No --> Init[First qualifying fill = incident use]\n  Init --> T0[Time zero = index fill date]\n  T0 --> Base[Baseline covariates measured only<br/>in baseline_start .. index_date]\n  Base --> Fup[Follow-up: outcome clock starts at time zero<br/>censor at disenroll / death / end of data / discontinuation]\n  Fup --> Sens[Sensitivity: washout length, grace period,<br/>negative-control outcome, add active comparator]",
        "caption": "New-user cohort logic. The washout plus continuous-enrollment requirement guarantees that incident-user status is observed, time zero is the first fill, and all covariates are pre-treatment.",
        "alt_text": "Flowchart from source population through enrollment and washout checks to time-zero assignment, baseline covariate measurement, follow-up, and sensitivity analyses for the new-user design.",
        "source_type": "illustrative",
        "source_citations": [
          "ray-2003"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title New-user timeline for one incident initiator (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Lookback\n  Continuous enrollment + washout (no study-drug fill) :done, wash, 2023-01-01, 2023-12-31\n  section Time zero\n  First qualifying fill :milestone, t0, 2024-01-01, 0d\n  section Follow-up\n  On-treatment exposure (days_supply + grace) :active, fu, 2024-01-01, 180d\n  Censor at discontinuation / disenroll / death / data end :crit, cen, 2024-06-29, 1d",
        "caption": "Time-zero alignment for a single initiator. Because baseline is measured before the index fill and the outcome clock starts at the fill, there is no immortal time and no adjustment for post-initiation variables.",
        "alt_text": "Gantt timeline showing a 365-day washout in 2023, time zero at the first fill on 2024-01-01, an on-treatment follow-up window, and a censoring point.",
        "source_type": "illustrative",
        "source_citations": [
          "ray-2003"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "The active-comparator new-user (ACNU) design is a variant that adds an active-comparator requirement on top of the basic new-user restriction to also control confounding by indication and healthy-user bias."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Prevalent-user / ever-exposed cohorts mix patients at different points in their treatment trajectory and suffer depletion of susceptibles; the new-user design aligns time zero at initiation to avoid these."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring time zero at the first fill prevents the immortal time that arises when follow-up starts before the exposure decision (e.g., at diagnosis)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "The new-user restriction is how a target-trial emulation assigns time zero, mapping eligibility and treatment initiation onto trial enrollment."
      },
      {
        "relation_type": "requires",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "The validity of the incident-user restriction depends entirely on a correctly specified, fully observable washout / lookback window."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Index date is the first qualifying fill after the washout; correct time-zero alignment is the core deliverable of the design."
      }
    ],
    "aliases": [
      "incident user design",
      "incident-user design",
      "new initiator design",
      "new user design",
      "incident new-user design",
      "new-user cohort design"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "nlp-clinical-text-extraction-rwe",
    "name": "NLP for Clinical Text in RWE",
    "short_definition": "A family of computational methods — from rule-based pattern matching through transformer-based models — that convert free-text clinical notes into structured variables for RWE analysis; NLP-derived variables are algorithm-derived estimates that carry systematic measurement error and must be validated against a chart-review gold standard (PPV, sensitivity, kappa) with error rates propagated into every downstream analytic result.",
    "long_description": "**Why clinical text is the dark matter of EHR data**\n\nStructured fields in electronic health records — diagnosis codes, procedure codes, laboratory\nresult values, medication orders — capture a narrow slice of clinical reality. Cancer staging,\nECOG performance status, biomarker results narrated in the clinic note, smoking and alcohol\nhistory, real-world progression assessments, and social determinants of health are all\nroutinely documented in free-text and are absent or unreliably encoded in structured fields\nfor a large share of patients. Studies of oncology EHR databases consistently find that\nperformance status and staging are structurally missing from coded fields for a majority of\npatients yet appear in clinical notes; the same is true for progression language in radiology\nand oncologist reports, for BRCA and biomarker status in pathology narratives, and for\nsubstance use in the social history section. Any RWE study that relies exclusively on\nstructured fields for these variables operates on a systematically truncated population —\nthe patients for whom the variable is structurally unobservable are not missing at random;\nthey may be the patients whose care is less protocol-driven and whose true variable values\ndiffer most from the imputed-as-zero default.\n\nNLP is the method layer that bridges this gap: it reads the text, identifies clinical entities,\nresolves ambiguity, and outputs a structured variable that can be merged with coded data for\nanalysis. The output is not ground truth — it is an algorithm-derived estimate with its own\noperating characteristics that must be measured and reported exactly as for any EHR\nphenotyping algorithm.\n\n**The NLP task taxonomy for RWE**\n\nClinical NLP in RWE spans several distinct tasks, each with its own failure mode:\n\n*Named-entity recognition (NER)*: identifying mentions of a drug, dose, biomarker value,\nor clinical concept in text — for example, extracting \"pembrolizumab 200 mg\" as a drug-dose\nentity, or \"ECOG 2\" as a performance-status entity. The dominant failure mode is boundary\nerror: partial extraction, missed entities in non-standard phrasing, and false positives\nfrom entities appearing in a different semantic context.\n\n*Negation and uncertainty handling*: the most consequential and most frequently mishandled\nlayer. A phrase like \"no evidence of progression\" expresses absence of progression, yet a\nnaive NER system that detects \"progression\" will flag it as present. NegEx-style algorithms\n(Chapman 2001) define a left and right scope window around each assertion trigger (\"no\nevidence of,\" \"denied,\" \"without\") and negate any entity within that window. Uncertainty\nmarkers — \"possible progression,\" \"cannot rule out recurrence\" — produce a third category\nthat must be handled explicitly: treating uncertain mentions as non-events or as a\nsensitivity stratum. Ignoring negation is the single most common implementation error in\nclinical NLP deployments; it produces systematic false positives that inflate the\napparent prevalence of any condition.\n\n*Section detection*: a clinical note is not a uniform stream of text. It has a family history\nsection, a past medical history section, a review of systems, an assessment, and a plan.\nA biomarker or diagnosis in the family history section pertains to a relative, not the patient.\nA symptom in the review of systems may be a reported complaint, not a confirmed finding.\nSection detection partitions each document into labeled zones so entity extraction applies\nthe correct scope — for example, extracting \"BRCA1 pathogenic variant\" from a Family History\nsection as a family-level finding, not a patient-level finding.\n\n*Relation extraction*: pairing entities into structured tuples — (biomarker, value, date),\n(drug, dose, frequency), (tumor size, measurement date) — to capture not just entity\npresence but the relationship between them. Relation extraction is harder than NER alone\nbecause entities can participate in different relations in the same sentence and relations\nmay span sentence boundaries.\n\n*Document classification*: assigning a note-level or report-level binary or ordinal label —\nprogression versus no-progression, positive versus negative biomarker status. This is the\ncoarser task most NLP-for-RWE pipelines implement first because it can be validated with\na binary chart-review label. Document classification is often the direct input to real-world\nprogression endpoints (see `real-world-progression-rwpfs-rwe`).\n\n**The rule-based-to-transformer arc**\n\nClinical NLP tools have evolved across three generations. Rule-based systems such as cTAKES\nand MedSpaCy (early-to-mid 2010s) use manually curated lexicons, regular expressions, and\nNegEx-style scope rules to annotate clinical concepts. They are fast, transparent,\nreproducible, and require no labeled training data, but they fail on novel phrasing and\nrequire expert clinical informaticists to maintain the rule set. BERT-variant models —\nBioBERT, ClinicalBERT, and domain-fine-tuned transformers (roughly 2019 onward) — learn\ncontextual representations from large corpora and substantially outperform rule-based systems\non benchmark NER and assertion tasks, at the cost of requiring labeled training data, GPU\ncompute, and a less transparent decision process. Large language models can extract clinical\nentities from text with minimal task-specific fine-tuning using prompting alone, but\nintroduce hallucination, run-to-run inconsistency, and absence of calibrated uncertainty\nas failure modes — route LLM-specific extraction approaches to `llm-assisted-abstraction-rwe`.\nRegardless of which generation is used, all approaches produce algorithm-derived outputs that\nrequire empirical validation. The transformer arc does not eliminate the validation imperative.\n\n**The validation imperative**\n\nNLP-derived variables are algorithm-derived variables. Staging from a progress note, ECOG\nfrom a clinic note, biomarker status from a pathology report — each is a measurement, and\nevery measurement has error. The validation protocol mirrors that required for EHR phenotyping\nalgorithms: define a chart-review gold standard, sample a representative set of NLP-positive\nand NLP-negative records (aiming for at least 150–200 reviewed records), have trained\nreviewers assign the reference label blind to the NLP output, and compute PPV (precision),\nsensitivity (recall), specificity, and inter-reviewer agreement (Cohen's kappa or weighted\nkappa for ordinal variables — see `agreement-statistics-kappa-icc-bland-altman`).\n\nThe measured error rates must then propagate downstream (see `algorithm-validation`). A\nsensitivity analysis that re-runs the primary analysis excluding patients with ambiguous NLP\nclassifications tests robustness. A quantitative bias analysis that corrects effect estimates\nfor the measured misclassification rate gives bounds on distortion. Reporting only the NLP\npipeline's F1 score on a validation set, without propagating uncertainty into the main\nanalytic result, is methodologically incomplete. Nondifferential NLP misclassification of a\nbinary variable biases effect estimates toward the null; differential misclassification —\nif documentation density, note length, or phrasing conventions differ by treatment arm —\ncan bias in either direction.\n\n**Portability failure and the silver-standard trap**\n\nTwo structural threats dominate NLP deployment in multi-site RWE:\n\n*Portability failure*: clinical NLP models trained at one institution degrade substantially\nat another. The root cause is template heterogeneity — the same clinical fact is documented\nin systematically different language depending on EHR vendor, specialty, clinical culture,\nand geographic region. A model calibrated on one documentation style can miss or misclassify\nthe same concept written differently. Every new deployment site requires at minimum a\ntargeted validation study; published F1 scores from benchmark evaluations at a single\ninstitution do not transfer. This is particularly acute for RWE networks that aggregate\ndata across multiple health systems.\n\n*Silver-standard trap*: training an NLP model using the structured codes it is meant to\nsupplement as its weak label set creates a model that can only recover what was already\ncoded. A \"progression\" classifier trained on ICD codes for secondary malignancy as positive\nlabels will achieve excellent validation performance against those same codes but provides\nno additional information beyond the codes themselves — and will miss exactly the cases that\nmotivated the NLP approach (patients with documented progression who were never coded). The\ncorrect reference standard is independent chart review by a clinician who does not have\naccess to the structured codes during review.\n\n**Pros, cons, and trade-offs**\n\n*Pros*: NLP recovers clinically meaningful variables — staging, performance status,\nbiomarkers, social history — that are absent from structured fields for a substantial share\nof patients. It reduces the need for full-cohort manual abstraction. When validated with\nreported operating characteristics, NLP-derived variables are acceptable to FDA and regulatory\nreviewers for supplementary endpoints and sensitivity analyses. They enable composite endpoints\nlike real-world progression that require reading radiology or oncologist text.\n\n*Cons*: NLP adds algorithm-derived measurement error that is systematic, not random, and that\npropagates into every downstream effect estimate. Models transport poorly across sites.\nValidation requires clinical reviewers and is expensive. Negation errors, section errors, and\nuncertainty failures can be invisible in aggregate F1 metrics — a pipeline with 90% accuracy\ncan have near-zero recall on negated findings. The silver-standard trap produces models that\nappear accurate but do not improve ascertainment over structured codes alone. LLM-based\npipelines add hallucination risk and run-to-run inconsistency.\n\n*Against manual abstraction*: manual abstraction by trained reviewers is the gold standard\nfor accuracy but does not scale to tens of thousands of patients. NLP scales to the full\ncohort but introduces systematic error. The practical design for most large RWE studies is\na hybrid: NLP for the full cohort (primary analysis), a stratified random validation sample\nreviewed by clinicians (for operating characteristics), and a misclassification bias analysis\nin the statistical analysis plan.\n\n**When to use**\n\nUse NLP-derived variables when a clinically essential covariate or outcome — staging, ECOG,\nbiomarker status, social history, progression — is absent from or unreliably captured in\nstructured fields for a substantial share of the study population; when the cohort is large\nenough that full-cohort manual abstraction is infeasible; when a validation sample of at\nleast 150–200 independently reviewed records can be obtained; and when the text corpus\nshares documentation conventions with the NLP model's training environment, or when a\nsite-specific validation and fine-tuning step is planned.\n\n**When NOT to use — and when it is actively misleading**\n\nDo not use NLP-derived variables as primary study endpoints without validation: an NLP\noutput with unknown operating characteristics carries unknown error direction and cannot\nbe bounded. This is the dangerous case — unlike missing-at-random dropout, NLP error is\nsystematic and may differ by exposure arm (differential misclassification), biasing in\neither direction with no structural guarantee of conservatism.\n\nDo not deploy a model validated at one site at a new site without a local validation\nstudy. Do not train on structured codes as the reference label and then use the resulting\nmodel as if it supplements those codes — it does not. Do not present transformer or LLM\nbenchmark F1 scores as applicable to your corpus without local validation. Do not present\nNLP-flagged counts as ground-truth ascertainment — always label outputs as algorithm-derived,\nreport the validation operating characteristics, and include a misclassification sensitivity\nanalysis in the statistical analysis plan.\n\n**Interpreting the output**\n\nIn the worked example, an NLP pipeline classifies ECOG performance status from oncology\nprogress notes. The pipeline flags 100 patients as ECOG ≥ 2. Chart review of those 100\nconfirms 85 as true ECOG ≥ 2 (true positives) and 15 as false positives (notes where the\npipeline mistook a negated or conditional ECOG mention for an affirmed finding). Separately,\nfrom a hold-out set of 50 chart-review-confirmed ECOG ≥ 2 patients, the pipeline detected\n40 and missed 10 (notes using non-standard phrasing outside the model's training vocabulary).\n\n*(1) Formal interpretation.* PPV = 85/100 = 0.85: for every 100 patients flagged as ECOG ≥ 2\nby this NLP algorithm in this validation sample, 85 truly met that criterion on chart review —\n15 were false positives. Sensitivity = 40/50 = 0.80: of 50 patients known from chart review\nto be ECOG ≥ 2, the pipeline detected 40 and missed 10. Both are estimated in the validation\nsample and generalize to the full cohort only if the sample is representative of the full\ncohort's documentation styles, specialties, and disease severity distribution. PPV and\nsensitivity are independent quantities estimable only from separate sampling designs — a\nPPV from chart review of NLP-positive records tells you nothing directly about sensitivity.\n\n*(2) Practical interpretation.* Before including NLP-derived ECOG in any comparative analysis,\ntwo corrections are needed: (a) apply PPV to correct the NLP-positive count for false positives\n— of 100 flagged patients, approximately 85 are true ECOG ≥ 2 cases; (b) acknowledge that\nthe 20% miss rate attenuates any effect estimate toward the null if the miss rate is\nnondifferential by exposure. If the treated arm has longer or more detailed notes (a common\npattern in trials of active drugs under intensive monitoring), the miss rate may be lower\nin that arm — differential misclassification that biases in an unpredictable direction.\nReport PPV and sensitivity in the methods section alongside the model name or version,\nthe validation sample design, and a misclassification sensitivity analysis.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "nlp",
      "clinical-text",
      "named-entity-recognition",
      "negation-handling",
      "transformer",
      "bert",
      "ctakes",
      "medspacy",
      "text-mining",
      "algorithm-derived-variable",
      "validation",
      "portability"
    ],
    "applies_to_study_types": [
      "ehr_study",
      "registry_linkage",
      "linked",
      "comparative_effectiveness",
      "single_arm_external_control",
      "target_trial_emulation"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jbi.2017.11.011",
        "url": "https://doi.org/10.1016/j.jbi.2017.11.011",
        "citation_text": "Wang Y, Wang L, Rastegar-Mojarad M, et al. Clinical information extraction applications: a literature review. Journal of Biomedical Informatics. 2018;77:34-49.",
        "year": 2018,
        "authors_short": "Wang et al.",
        "notes": "Comprehensive literature review of clinical NLP applications covering NER, negation detection, relation extraction, and document classification across EHR data types; maps directly onto the task taxonomy and deployment challenges covered in this entry."
      },
      {
        "role": "explain",
        "doi": "10.1006/jbin.2001.1029",
        "url": "https://doi.org/10.1006/jbin.2001.1029",
        "citation_text": "Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics. 2001;34(5):301-310.",
        "year": 2001,
        "authors_short": "Chapman et al.",
        "notes": "Introduces NegEx, the foundational rule-based negation-detection algorithm used across generations of clinical NLP systems; the scope-window logic and trigger-phrase taxonomy described here remain the reference implementation for negation handling in RWE pipelines two decades later."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/amiajnl-2011-000203",
        "url": "https://doi.org/10.1136/amiajnl-2011-000203",
        "citation_text": "Uzuner O, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association. 2011;18(5):552-556.",
        "year": 2011,
        "authors_short": "Uzuner et al.",
        "notes": "Landmark shared-task evaluation establishing benchmarks for clinical NER, assertion classification (affirmed, negated, uncertain), and relation extraction across discharge summaries and progress notes; defines the challenge structure that motivated the BioBERT and ClinicalBERT fine-tuning literature."
      }
    ],
    "plain_language_summary": "Natural language processing (NLP) is a set of computer methods that read clinical notes — the free-text written by doctors, nurses, and radiologists — and convert what they say into structured data fields that can be used in a study. Because essential facts like cancer stage, how well a patient is functioning (ECOG score), and whether a tumor has grown are often only written in notes and never entered into a code field, NLP can recover information that structured records miss entirely. A key caveat is that the output of any NLP system is an educated guess based on patterns in text, not a guaranteed truth — researchers must always check a sample of records by hand to measure how often the computer is right (called validation), and they must account for those errors when interpreting study results.",
    "key_terms": [
      {
        "term": "named-entity recognition (NER)",
        "definition": "A step in clinical NLP that scans text and marks specific words or phrases as belonging to a category — for example, tagging \"pembrolizumab 200 mg\" as a drug-dose entity or \"ECOG 2\" as a performance-status score."
      },
      {
        "term": "negation handling",
        "definition": "A rule or model component that detects when an entity in a clinical note has been denied or qualified (for example, \"no evidence of progression\") so the system does not falsely record the entity as present."
      },
      {
        "term": "portability",
        "definition": "Whether an NLP model trained on notes from one hospital works accurately on notes from a different hospital; models often fail at new sites because phrasing conventions and note templates differ across institutions."
      },
      {
        "term": "silver-standard trap",
        "definition": "The mistake of training an NLP model using structured billing codes as its reference labels, then claiming the model adds information beyond those codes — it cannot, because it learned to reproduce the very signal it was supposed to supplement."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "The fraction of patients the NLP algorithm flags as having a condition who truly have it when their records are reviewed by hand — a measure of how precise the algorithm is."
      },
      {
        "term": "section detection",
        "definition": "A preprocessing step that divides a clinical note into labeled zones (family history, past medical history, assessment and plan) so that entity extraction applies the correct scope and avoids tagging a relative's diagnosis as the patient's own."
      }
    ],
    "worked_example": {
      "scenario": "An oncology RWE team needs ECOG performance status for a study of 8,000 patients; this variable appears in structured fields for only 2,100 patients (26%). The team deploys an NLP pipeline to classify ECOG status from progress notes for the remaining 5,900 patients. To measure how well the pipeline works, they randomly sample 100 patients flagged as ECOG ≥ 2 by the NLP and have a trained oncology nurse review each chart. They also obtain a hold-out set of 50 patients whose ECOG ≥ 2 status is known from structured fields (used as a sensitivity check) and run those through the NLP pipeline to measure what fraction it detects. The five-row table shows representative examples from the validation sample, including two failure modes — a false positive from a negated mention and a false negative from non-standard phrasing.",
      "dataset": {
        "caption": "Five representative records from the validation sample. NLP_flag is the pipeline output; chart_ecog_ge2 is the chart-review reference standard. Classification labels the result.",
        "columns": [
          "patient_id",
          "NLP_flag",
          "chart_ecog_ge2",
          "note_snippet",
          "classification"
        ],
        "rows": [
          [
            3001,
            "yes",
            "yes",
            "ECOG PS 2 noted, patient unable to work",
            "TP"
          ],
          [
            3002,
            "yes",
            "yes",
            "functional status ECOG grade 3 at this visit",
            "TP"
          ],
          [
            3003,
            "yes",
            "no",
            "no evidence of ECOG decline beyond grade 1",
            "FP"
          ],
          [
            3004,
            "no",
            "yes",
            "too debilitated to perform any work-related activities",
            "FN"
          ],
          [
            3005,
            "no",
            "no",
            "performing all activities of daily living without restriction",
            "TN"
          ]
        ]
      },
      "steps": [
        "The NLP pipeline flags patients as ECOG >= 2 based on mention detection plus NegEx-style negation handling. Patient 3001 and 3002 are true positives: the notes contain explicit affirmed ECOG scores of 2 and 3 respectively.",
        "Patient 3003 is a false positive: the note says 'no evidence of ECOG decline beyond grade 1' — the NegEx scope window captured 'decline' but the pipeline erroneously propagated the ECOG entity as affirmed, missing that the surrounding context negates ECOG >= 2.",
        "Patient 3004 is a false negative: the note describes severe functional impairment using narrative language ('too debilitated') without an explicit ECOG grade, so the pipeline produced no ECOG entity and made no prediction — a vocabulary gap failure.",
        "Patient 3005 is a true negative: the note affirms good functional status and the pipeline correctly produces no ECOG >= 2 flag.",
        "In the full 100-patient NLP-positive validation sample, chart review confirms 85 as true ECOG >= 2 and 15 as false positives (mostly negation failures like patient 3003). PPV = 85/100 = 0.85.",
        "In the 50-patient known-positive hold-out, the NLP pipeline detects 40 and misses 10 (mostly vocabulary-gap failures like patient 3004). sensitivity = 40/50 = 0.8.",
        "The 15 false positives will inflate the apparent prevalence of ECOG >= 2 if uncorrected. The 10 missed true positives (20% miss rate) will attenuate effect estimates toward the null if the miss rate is nondifferential by treatment arm — or bias in either direction if it differs across arms."
      ],
      "result": "PPV = 85/100 = 0.85 (85 of 100 NLP-flagged patients were true ECOG >= 2 on chart review; 15 were false positives from negation or conditional phrasing). sensitivity = 40/50 = 0.8 (the pipeline detected 40 of 50 known-positive patients; 10 were missed due to non-standard phrasing). Both metrics must be reported alongside any analysis using NLP-derived ECOG status; a quantitative bias analysis should propagate the 15% false positive rate and 20% miss rate into the primary effect estimate."
    },
    "prerequisites": [
      "ehr-phenotyping-algorithms-rwe",
      "predictive-and-causal-ml-models-rwe",
      "algorithm-validation"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Rule-based NER with NegEx negation handling (cTAKES / MedSpaCy)",
        "description": "Deterministic pipelines using curated clinical lexicons, UMLS concept mappings, and NegEx-style scope rules to extract and classify entities. Transparent, reproducible, and requiring no labeled training data, making them the default for regulatory-grade supplementary analyses where the extraction logic must be auditable.",
        "edge_cases": [
          "Novel phrasing not in the lexicon produces false negatives; vocabulary maintenance is ongoing and site-specific.",
          "NegEx scope windows are fixed length and fail on complex syntactic negation ('it would be premature to conclude the patient has ECOG grade 2').",
          "UMLS concept normalization can map different surface forms to the same concept incorrectly when a term is ambiguous across specialties."
        ],
        "data_source_notes": "ehr: requires note access (progress notes, discharge summaries, radiology reports); structured-only claims cannot support this approach. Validate per note type (progress note vs radiology report vs pathology report) because phrasing conventions differ substantially."
      },
      {
        "name": "Fine-tuned transformer (BioBERT, ClinicalBERT, domain-adapted)",
        "description": "BERT-variant models pre-trained on biomedical or clinical text corpora and fine-tuned on a labeled annotation dataset for the specific NER or assertion task. Substantially outperform rule-based systems on benchmark NER and negation tasks at the cost of requiring labeled training data, GPU compute, and per-site validation.",
        "edge_cases": [
          "Performance degrades at sites whose documentation conventions differ from the training corpus; always validate locally before deploying.",
          "Fine-tuning on a small annotation set (<500 examples) risks overfitting to idiosyncratic phrasing in the training site.",
          "Model versioning and reproducibility require checkpointing the exact fine-tuned weights used in the analysis alongside the analytic code."
        ],
        "data_source_notes": "ehr: requires note text and a labeled training/validation corpus; a minimum of 300-500 annotated examples per entity type is typically needed for stable fine-tuning. Validate on a held-out site-representative sample, not a random split of the training data."
      },
      {
        "name": "Document classification for binary endpoint (progression / no-progression)",
        "description": "Coarser task that assigns a note-level or report-level binary label rather than extracting structured entities. Natural input to real-world progression (rwP) endpoints when the goal is to classify a radiology report as showing new disease versus stable disease, rather than extracting specific measurement values.",
        "edge_cases": [
          "Document-level labels obscure within-document heterogeneity (a report can describe stable primary disease and new metastatic sites simultaneously).",
          "Class imbalance (few progression reports relative to stable-disease reports) requires stratified sampling for both training and validation.",
          "The gold-standard label for validation must be assigned by a clinician reading the full report, not derived from ICD codes — the silver-standard trap."
        ],
        "data_source_notes": "ehr: radiology reports and oncologist progress notes are the primary input; the validation reference label must be chart review by a clinician (radiologist or oncologist), not coded data. Assess inter-reviewer agreement (kappa) before accepting the gold standard."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ehr-phenotyping-algorithms-rwe",
        "pros_of_this": "NLP accesses the clinical narrative and recovers variables that structured phenotyping algorithms cannot reach — staging, ECOG, biomarker status, and progression language that are never coded. It extends phenotyping coverage to the unstructured majority of clinical documentation.",
        "cons_of_this": "NLP introduces a second layer of algorithm-derived measurement error on top of whatever structured-code error already exists; it requires text access and a validation corpus; and its error is systematic rather than random, with failure modes that differ from those of rule-based structured phenotypes.",
        "when_to_prefer": "Use NLP when the key variable is absent from structured fields for a substantial share of the cohort and the study question cannot be answered without it; use structured phenotypes for variables adequately captured in codes, where transparency and regulatory familiarity are primary concerns."
      },
      {
        "compared_to": "real-world-progression-rwpfs-rwe",
        "pros_of_this": "NLP-derived document classification or NER is the method layer that enables abstracted real-world progression; the two are complementary rather than competing. NLP produces the binary progression label; rwPFS methodology determines how that label is converted to a time-to-event endpoint.",
        "cons_of_this": "NLP error propagates directly into rwPFS: false-positive progression calls shorten apparent rwPFS; missed progression events (false negatives) prolong it. The assessment-cadence and interval-censoring challenges of rwPFS are separate from and compound the NLP measurement error.",
        "when_to_prefer": "Always pair NLP-derived progression classification with the rwPFS methodology entry when building a time-to-event oncology endpoint from note text."
      },
      {
        "compared_to": "predictive-and-causal-ml-models-rwe",
        "pros_of_this": "Clinical NLP is the data preparation layer that generates the input features (NLP-derived staging, ECOG, biomarkers) that then feed predictive or causal ML models. NLP and ML model the same text data at different levels: NLP extracts structured variables; predictive ML uses those variables plus coded features to build risk scores or estimate treatment effects.",
        "cons_of_this": "NLP error in the input features propagates into every downstream ML model that uses those features; a well-specified causal ML pipeline built on poorly validated NLP inputs will have biased nuisance estimates with uncertain direction.",
        "when_to_prefer": "Run NLP validation and propagate measurement error before building downstream ML models; never treat NLP-derived features as ground truth for the ML training step."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims contain no free text. NLP is not applicable to claims-only data. Treatment-pattern proxies (next line of therapy, discontinuation) are the closest claims-based analog for outcomes like progression, but they are not NLP and carry different assumptions. If NLP features are needed for a claims-based study, they must be obtained by linking to an EHR or registry that contains note text.",
      "ehr": "Natural home for clinical NLP. Access to progress notes, discharge summaries, radiology reports, and pathology reports is required. Key considerations: (1) validate per note type because phrasing conventions differ substantially between radiology and oncology notes; (2) extract and store the source note date, note type, and NLP confidence score alongside the derived variable for transparency; (3) run inter-reviewer kappa on the validation sample before accepting the gold standard; (4) document the NLP model name, version, and any site-specific fine-tuning in the statistical analysis plan.",
      "registry": "Registries vary widely in whether they store note text versus only structured abstracted fields. When note text is available (some enriched EHR-based registries like Flatiron), NLP is applicable. When only structured abstracted fields are available, the abstraction has already been performed (manually) and NLP is not needed for those variables.",
      "linked": "Linked data combining claims and EHR notes is the most productive substrate for NLP- supplemented RWE: claims supply complete therapy and utilization history, EHR notes supply the text for NLP. Validate NLP separately for each data partner's note corpus before pooling NLP-derived variables, since documentation conventions differ by institution."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "\"\"\"\nNLP for RWE — ECOG performance status extraction from oncology progress notes.\nPipeline:  medspacy NER + NegEx negation → structured flag → validation vs chart review.\n\"\"\"\nimport re\nimport pandas as pd\nfrom sklearn.metrics import cohen_kappa_score, confusion_matrix\n\n# ── Minimal medspacy pipeline (install: pip install medspacy) ──────────────────────────\ntry:\n    import medspacy\n    from medspacy.ner import TargetRuleMatcher\n\n    def build_nlp_pipeline():\n        nlp = medspacy.load()          # loads tokenizer + sentencizer + NegEx by default\n        # Add ECOG target rules — patterns that match ECOG grade mentions\n        target_matcher = nlp.get_pipe(\"medspacy_target_matcher\")\n        from medspacy.ner import TargetRule\n        target_matcher.add([\n            TargetRule(\"ECOG_ge2\", \"ECOG_STATUS\",\n                       pattern=[{\"LOWER\": \"ecog\"},\n                                {\"IS_SPACE\": True, \"OP\": \"?\"},\n                                {\"LOWER\": {\"IN\": [\"2\", \"3\", \"4\", \"grade\", \"ps\"]}},\n                                {\"LOWER\": {\"IN\": [\"2\", \"3\", \"4\"]}, \"OP\": \"?\"}]),\n            TargetRule(\"ECOG_functional_limit\", \"ECOG_STATUS\",\n                       pattern=[{\"LOWER\": {\"IN\": [\"debilitated\", \"bedridden\",\n                                                  \"ambulatory\", \"self-care\"]}}]),\n        ])\n        return nlp\n\n    nlp = build_nlp_pipeline()\n\n    def extract_ecog_ge2(note_text: str) -> bool:\n        \"\"\"Return True if note contains an AFFIRMED ECOG >= 2 mention (NegEx applied).\"\"\"\n        doc = nlp(note_text)\n        for ent in doc.ents:\n            if ent.label_ == \"ECOG_STATUS\" and not ent._.is_negated:\n                return True\n        return False\n\nexcept ImportError:\n    # Fallback: simple regex + negation scope (illustrative; use medspacy in production)\n    _ECOG_PAT = re.compile(r\"\\becog\\s*(?:grade\\s*|ps\\s*)?[2-4]\\b\", re.I)\n    _NEG_TRIGGERS = re.compile(\n        r\"\\b(no|without|denies|denied|no evidence of|ruled out|not)\\b\", re.I)\n\n    def extract_ecog_ge2(note_text: str) -> bool:\n        \"\"\"Regex NER with simple left-window negation detection (25 chars).\"\"\"\n        for m in _ECOG_PAT.finditer(note_text):\n            window = note_text[max(0, m.start() - 25): m.start()]\n            if not _NEG_TRIGGERS.search(window):\n                return True\n        return False\n\n# ── Apply NLP to the full cohort ─────────────────────────────────────────────────────\n# notes_df: person_id, note_text, note_date, note_type\ndef run_nlp_pipeline(notes_df: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Return one row per person: most recent NLP flag across all their notes.\"\"\"\n    notes_df = notes_df.sort_values(\"note_date\")\n    flags = []\n    for pid, grp in notes_df.groupby(\"person_id\"):\n        # Flag the patient if ANY progress note affirms ECOG >= 2\n        nlp_positive = any(extract_ecog_ge2(txt) for txt in grp[\"note_text\"])\n        flags.append({\"person_id\": pid, \"nlp_ecog_ge2\": int(nlp_positive)})\n    return pd.DataFrame(flags)\n\n# ── Validation: compute PPV, sensitivity, specificity, kappa ─────────────────────────\n# validation_df: person_id, nlp_ecog_ge2 (0/1), chart_ecog_ge2 (0/1)\ndef validate_nlp(validation_df: pd.DataFrame) -> dict:\n    \"\"\"Compute operating characteristics vs chart-review gold standard.\"\"\"\n    y_pred = validation_df[\"nlp_ecog_ge2\"].values\n    y_true = validation_df[\"chart_ecog_ge2\"].values\n    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()\n    ppv         = tp / (tp + fp) if (tp + fp) > 0 else float(\"nan\")\n    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else float(\"nan\")\n    specificity = tn / (tn + fp) if (tn + fp) > 0 else float(\"nan\")\n    kappa       = cohen_kappa_score(y_true, y_pred)\n    return {\n        \"tp\": int(tp), \"fp\": int(fp), \"fn\": int(fn), \"tn\": int(tn),\n        \"ppv\":         round(ppv, 3),\n        \"sensitivity\": round(sensitivity, 3),\n        \"specificity\": round(specificity, 3),\n        \"kappa\":       round(kappa, 3),\n        \"n_reviewed\":  int(tp + fp + fn + tn),\n    }\n\n# ── Misclassification sensitivity analysis ────────────────────────────────────────────\ndef misclassification_bias_bounds(ppv: float, sensitivity: float,\n                                  n_nlp_positive: int) -> dict:\n    \"\"\"\n    Approximate the true-positive count and the expected false-negative count\n    given measured PPV and sensitivity.\n    True positives in NLP-flagged cohort ≈ n_nlp_positive * PPV.\n    Estimated false negatives ≈ n_true_positives * (1 - sensitivity) / sensitivity.\n    \"\"\"\n    n_true_pos = n_nlp_positive * ppv\n    n_false_neg = n_true_pos * (1.0 - sensitivity) / sensitivity if sensitivity > 0 else float(\"nan\")\n    return {\n        \"n_nlp_positive\": n_nlp_positive,\n        \"estimated_true_positives\": round(n_true_pos, 1),\n        \"estimated_false_negatives\": round(n_false_neg, 1),\n        \"ppv\": ppv,\n        \"sensitivity\": sensitivity,\n        \"note\": (\n            \"Report these estimates in the SAP bias analysis section. \"\n            \"Nondifferential miss rate attenuates effect estimates toward the null; \"\n            \"differential miss rate by treatment arm can bias in either direction.\"\n        ),\n    }\n\n# ── Example usage with the worked-example numbers ─────────────────────────────────────\n# 100-patient NLP-positive validation sample: 85 TP, 15 FP\n# 50-patient known-positive hold-out: 40 TP, 10 FN\ndemo = pd.DataFrame({\n    \"person_id\":      list(range(1, 151)),\n    \"nlp_ecog_ge2\":   [1] * 100 + [0] * 50,\n    \"chart_ecog_ge2\": [1] * 85 + [0] * 15 + [1] * 40 + [0] * 10,\n})\nmetrics = validate_nlp(demo)\nprint(\"Validation metrics:\")\nfor k, v in metrics.items():\n    print(f\"  {k}: {v}\")\n# Expected: ppv=0.85, sensitivity=0.8 (per the worked example arithmetic)\n\nbounds = misclassification_bias_bounds(\n    ppv=metrics[\"ppv\"], sensitivity=metrics[\"sensitivity\"], n_nlp_positive=5900)\nprint(\"\\nBias bounds for full cohort:\")\nfor k, v in bounds.items():\n    print(f\"  {k}: {v}\")",
        "description": "End-to-end NLP validation pipeline in Python using MedSpaCy for rule-based NER and NegEx\nnegation detection, followed by a validation metrics computation (PPV, sensitivity,\nspecificity, Cohen's kappa) from a chart-review reference set. Covers the three core\nsteps of any RWE NLP deployment: (1) entity extraction with negation handling, (2) chart-\nreview validation against a gold standard, (3) misclassification propagation into the\nprimary analysis. No GPU required; uses only medspacy, sklearn, and pandas.",
        "dependencies": [
          "medspacy",
          "pandas",
          "scikit-learn"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(stringr)\nsuppressMessages(library(irr))\n\n# ── Rule-based ECOG >= 2 extraction (regex + left-window negation; illustrative) ──────\nextract_ecog_ge2 <- function(note_text) {\n  ecog_pattern <- \"(?i)\\\\becog\\\\s*(?:grade\\\\s*|ps\\\\s*)?[2-4]\\\\b\"\n  neg_triggers  <- \"(?i)\\\\b(no|without|denies|denied|no evidence of|ruled out|not)\\\\b\"\n\n  matches <- gregexpr(ecog_pattern, note_text, perl = TRUE)[[1]]\n  if (matches[1] == -1L) return(FALSE)          # no ECOG >= 2 mention at all\n\n  for (m in as.integer(matches)) {\n    window <- substr(note_text, max(1L, m - 25L), m - 1L)\n    if (!grepl(neg_triggers, window, perl = TRUE)) return(TRUE)   # affirmed mention\n  }\n  FALSE                                          # all mentions negated\n}\n\n# Apply to a data frame of notes (person_id, note_text, note_date)\napply_nlp <- function(notes_df) {\n  agg <- tapply(\n    seq_len(nrow(notes_df)),\n    notes_df$person_id,\n    function(idx) as.integer(any(sapply(notes_df$note_text[idx], extract_ecog_ge2)))\n  )\n  data.frame(person_id    = names(agg),\n             nlp_ecog_ge2 = as.integer(agg),\n             stringsAsFactors = FALSE)\n}\n\n# ── Validation: PPV, sensitivity, specificity, kappa ──────────────────────────────────\n# validation_df: person_id, nlp_ecog_ge2 (0/1), chart_ecog_ge2 (0/1)\nvalidate_nlp <- function(validation_df) {\n  pred <- validation_df$nlp_ecog_ge2\n  ref  <- validation_df$chart_ecog_ge2\n\n  tp <- sum(pred == 1L & ref == 1L)\n  fp <- sum(pred == 1L & ref == 0L)\n  fn <- sum(pred == 0L & ref == 1L)\n  tn <- sum(pred == 0L & ref == 0L)\n\n  ppv         <- if ((tp + fp) > 0) tp / (tp + fp) else NA_real_\n  sensitivity <- if ((tp + fn) > 0) tp / (tp + fn) else NA_real_\n  specificity <- if ((tn + fp) > 0) tn / (tn + fp) else NA_real_\n\n  # Cohen's kappa via irr::kappa2\n  kappa_res <- irr::kappa2(\n    data.frame(rater1 = pred, rater2 = ref), weight = \"unweighted\")\n\n  list(\n    tp          = tp,   fp          = fp,\n    fn          = fn,   tn          = tn,\n    ppv         = round(ppv, 3),\n    sensitivity = round(sensitivity, 3),\n    specificity = round(specificity, 3),\n    kappa       = round(kappa_res$value, 3),\n    n_reviewed  = tp + fp + fn + tn\n  )\n}\n\n# ── Misclassification bias bounds ───────────────────────────────────────────────────\nbias_bounds <- function(ppv, sensitivity, n_nlp_positive) {\n  n_true_pos  <- n_nlp_positive * ppv\n  n_false_neg <- if (sensitivity > 0) n_true_pos * (1 - sensitivity) / sensitivity\n                 else NA_real_\n  message(sprintf(\n    \"NLP flagged:              %d\\n  Est. true positives:    %.1f  (= %d x PPV %.2f)\\n  Est. false negatives:   %.1f  (missed cases; attenuate effect if nondifferential)\\n  Note: differential miss rate by exposure arm biases in either direction.\",\n    n_nlp_positive, n_true_pos, n_nlp_positive, ppv, n_false_neg))\n  invisible(list(n_true_pos = n_true_pos, n_false_neg = n_false_neg))\n}\n\n# ── Reproduce worked-example arithmetic ─────────────────────────────────────────────\n# 100-patient NLP-positive validation: 85 TP + 15 FP\n# 50-patient known-positive hold-out:  40 TP + 10 FN\ndemo <- data.frame(\n  person_id      = 1:150,\n  nlp_ecog_ge2   = c(rep(1L, 100), rep(0L, 50)),\n  chart_ecog_ge2 = c(rep(1L, 85), rep(0L, 15), rep(1L, 40), rep(0L, 10))\n)\nm <- validate_nlp(demo)\ncat(\"PPV =\", m$ppv, \" sensitivity =\", m$sensitivity,\n    \" specificity =\", m$specificity, \" kappa =\", m$kappa, \"\\n\")\n# PPV = 0.85   sensitivity = 0.8   (matches worked-example: 85/100 and 40/50)\n\nbias_bounds(ppv = m$ppv, sensitivity = m$sensitivity, n_nlp_positive = 5900L)",
        "description": "NLP validation metrics and misclassification bias analysis in R. Computes PPV, sensitivity,\nspecificity, and Cohen's kappa from a chart-review validation data frame, then estimates\nthe expected false-negative burden in the unreviewed cohort given the measured sensitivity.\nRule-based text extraction is shown using stringr for the regex + negation pattern; in\nproduction use the medspaCy Python pipeline above or a clinical NLP service. Uses only\nbase R plus stringr and irr for the kappa calculation.",
        "dependencies": [
          "stringr",
          "irr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Text[Clinical note corpus<br/>progress notes, radiology reports,<br/>pathology reports] --> Pre[Preprocessing<br/>section detection, sentence splitting]\n  Pre --> NER[Named-entity recognition<br/>ECOG grade, drug, biomarker,<br/>staging, progression language]\n  NER --> Neg{Negation and<br/>uncertainty detection<br/>NegEx scope window}\n  Neg -->|affirmed| Flag[NLP-positive flag<br/>structured variable]\n  Neg -->|negated| NoFlag[NLP-negative<br/>entity not counted]\n  Neg -->|uncertain| Ambig[Uncertain stratum<br/>handle per SAP]\n  Flag --> Val[Validation sample<br/>chart review by clinician<br/>blind to NLP output]\n  Val --> Metrics[PPV / sensitivity / specificity<br/>Cohen kappa]\n  Metrics --> Bias[Misclassification bias analysis<br/>propagate error into effect estimate]\n  Metrics --> Report[Report in SAP / methods section<br/>model name + validation design]",
        "caption": "Clinical NLP pipeline from raw note text through NER and negation handling to a structured variable, followed by the mandatory validation study and bias propagation steps that are required before using NLP-derived variables in a primary analysis.",
        "alt_text": "Flowchart from clinical notes through section detection, NER, and a negation decision node into affirmed, negated, and uncertain entity categories; the affirmed path leads to a validation sample, operating characteristics, and a bias analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "wang-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Training on structured codes<br/>as weak labels] -->|silver-standard trap| B[Model learns to reproduce codes<br/>cannot supplement them<br/>excellent code-vs-code PPV]\n  C[Training on chart-review labels<br/>clinical gold standard] -->|correct approach| D[Model recovers text signal<br/>beyond structured codes<br/>PPV measured vs independent review]\n  E[NER 'progression' detected<br/>in text span] --> F{NegEx scope check}\n  F -->|no negation trigger in window| G[Affirmed finding — count as event]\n  F -->|negation trigger found| H[Negated finding — do NOT count]",
        "caption": "Two critical failure modes: the silver-standard trap (left) where training on the codes the model was meant to supplement produces circular validation; and the negation handling path (right) where NegEx-style scope detection prevents false-positive progression flags from phrases like 'no evidence of progression'.",
        "alt_text": "Left branch contrasts silver-standard training (circular, using codes as labels) against chart-review training (correct approach). Right branch shows NegEx scope decision for a progression entity mention, routing to affirmed or negated based on trigger detection.",
        "source_type": "illustrative",
        "source_citations": [
          "chapman-2001"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "NLP for clinical text is positioned within the ML and Predictive category; BERT-variant and fine-tuned transformer approaches draw on the same ML infrastructure and cross-fitting validation logic as the broader predictive ML family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "NLP extends EHR phenotyping algorithms into unstructured text; the validation framework — PPV, sensitivity, chart-review gold standard, misclassification bias analysis — is shared between structured phenotyping and NLP-derived variable extraction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "real-world-progression-rwpfs-rwe",
        "notes": "NLP document classification is the extraction method layer for real-world progression endpoints; radiology-anchored and clinician-anchored rwP both depend on reading note text, and NLP scales that reading to the full cohort."
      },
      {
        "relation_type": "used_with",
        "target_slug": "agreement-statistics-kappa-icc-bland-altman",
        "notes": "Cohen's kappa and weighted kappa are the standard inter-reviewer agreement metrics for the chart-review validation study that establishes NLP operating characteristics; kappa is also used to quantify agreement between two NLP model versions or between human abstractors."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "The NLP validation protocol — defining a gold-standard reference, sampling strategy, reviewer blinding, and operating-characteristic computation — follows the same algorithm-validation framework required for all algorithm-derived RWE variables."
      },
      {
        "relation_type": "see_also",
        "target_slug": "llm-assisted-abstraction-rwe",
        "notes": "LLM-based abstraction is the next-generation extension of clinical NLP; this entry covers the foundational NLP task taxonomy and validation requirements that apply to all generations; route LLM-specific prompting, hallucination handling, and multi-pass extraction workflows to the sibling entry."
      }
    ],
    "aliases": [
      "clinical NLP",
      "text mining for EHR",
      "natural language processing in RWE",
      "named entity recognition clinical",
      "NegEx negation detection",
      "ClinicalBERT"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "normal-distribution-clt",
    "name": "Normal Distribution and the Central Limit Theorem",
    "short_definition": "The normal (Gaussian) distribution is a symmetric bell curve parameterized by mean and standard deviation that describes the long-run behavior of sums and averages — not raw clinical data; the Central Limit Theorem (CLT) guarantees that the sampling distribution of the mean approaches normality as sample size grows regardless of the underlying outcome distribution, with convergence speed depending on skew, and is the mathematical foundation for t-tests, OLS confidence intervals, and z-score standardization across biomedical research.",
    "long_description": "**What the normal distribution is — and what it is not**\n\nThe normal (Gaussian) distribution is a symmetric, bell-shaped probability distribution with\ntwo parameters: mean μ (location) and standard deviation σ (spread). Its probability density\nfunction is f(x) = (1 / (σ√2π)) × exp(−(x − μ)² / (2σ²)). The normal is the engine behind\nz-tests, t-tests, OLS confidence intervals, and Wald-type intervals from logistic regression,\nCox models, and GLMs — nearly every inferential tool in the RWE analyst's kit eventually relies\non a normal approximation for some sampling distribution.\n\nThe critical distinction, missed constantly in applied work, is **what exactly follows a normal\ndistribution**. It is almost never the raw clinical outcome. Healthcare costs, length-of-stay,\nprescription counts, biomarker concentrations from right-skewed assays, and time-to-event\noutcomes are fundamentally non-normal: they are right-skewed, bounded below at zero, or both.\nThis is not a problem to be corrected — it is a property of the data-generating process. The\nnormal distribution is the model for **summary statistics computed from samples** — specifically\nmeans and their differences — not for individual patient values. A distribution of adult heights\nin a clinic might be approximately normal; the distribution of their total healthcare costs in\nthe same year will not be. Altman and Bland (1995) crystallize this distinction for clinical\nresearchers: a variable need not be normally distributed for its sample mean to be\napproximately normal. Understanding this gap explains the entire \"but costs aren't normal, can\nI still use a t-test?\" question that arises repeatedly in RWE practice.\n\n**The normal distribution's shape and the 68-95-99.7 rule**\n\nThree properties of the normal distribution that practitioners must have memorized:\n\n1. **68-95-99.7 rule:** In a normal distribution with mean μ and SD σ, approximately 68% of\n   observations fall within one σ of μ, 95% within two σ, and 99.7% within three σ. More\n   precisely, the multipliers for exact 95% and 99% probability coverage are 1.96 and 2.576\n   σ respectively, but the one-two-three mnemonic is reliable for rapid mental calculation.\n   This rule applies to the distribution of *individual observations* from a normal population;\n   when applied to the **sampling distribution of the mean**, σ is replaced by the standard\n   error SE = σ/√n — a quantity that shrinks with n, unlike the SD of the raw data.\n\n2. **Symmetry:** The normal distribution is perfectly symmetric about its mean. Its mean,\n   median, and mode all coincide at μ. For right-skewed outcomes (costs, LOS), the mean\n   exceeds the median, and the mode lies below the median — the symmetry assumption breaks\n   down for raw data, but the CLT rescues the mean as a summary statistic.\n\n3. **The standard normal:** When X ~ N(μ, σ²), the z-transformation z = (X − μ) / σ produces\n   a standard normal Z ~ N(0, 1). This allows any normal distribution to be converted to one\n   reference distribution for probability calculation. z-scores are the natural language for\n   pediatric growth assessment (WHO HAZ, WAZ, WHZ), educational standardized testing (IQ\n   scales with mean 100 and SD 15), and laboratory reference ranges (the central 95% of a\n   healthy reference population corresponds to the Z ∈ [−1.96, 1.96] interval).\n\n**The Central Limit Theorem: precise statement for practitioners**\n\nThe CLT is one of the most important theorems in applied statistics and is routinely overstated\nor understated. Precise statement: if X₁, X₂, ..., Xₙ are independent, identically distributed\nrandom variables with mean μ and finite variance σ², then the standardized sample mean\n(X̄ − μ) / (σ/√n) converges in distribution to a standard normal N(0, 1) as n → ∞, regardless\nof the shape of the underlying distribution. Kwak and Kim (2017) provide a practitioner-focused\nderivation of this result with biomedical illustrations.\n\nSeveral practical implications for RWE:\n\n- **Rate of convergence depends on skewness.** For symmetric distributions the CLT operates at\n  small n (n ≥ 10–15 is often sufficient). For moderately skewed distributions, n ≥ 30–50 per\n  group is the conventional threshold. For heavily right-skewed distributions — healthcare costs\n  dominated by extreme high-cost outliers, LOS distributions with a fat right tail — convergence\n  is slow: n may need to reach several hundred or even thousands before the sampling distribution\n  of the mean is reliably normal. At that point, parametric inference on the mean is\n  asymptotically valid, but extreme quantiles of the distribution still require distributional\n  modeling (log-normal, gamma, Pareto) rather than normal-based methods.\n\n- **The CLT applies to the sampling distribution of the mean, not the raw data.** Raw cost\n  histograms in a claims database with n = 5,000,000 will always display severe right skew —\n  the CLT says nothing about that histogram. It says that if you drew samples of size n from\n  that population and computed the mean of each sample, those means would be approximately\n  normally distributed for large n. The raw data remain non-normal; the means do not.\n\n- **SE = σ/√n is the key formula.** The standard error of the mean shrinks at the rate 1/√n\n  as sample size grows. Halving the SE requires quadrupling n. The SD of the raw data does not\n  shrink with n — it stays roughly constant regardless of how many patients are enrolled. The\n  SD describes patient-to-patient variability; the SE describes how precisely we know the mean.\n  Confusing SD and SE is the single most common numerical error in RWE research reports.\n\n**z-scores, standardization, and the pediatric growth example**\n\nA z-score transforms a raw measurement into standard-deviation units relative to a reference:\nz = (observed value − reference mean) / reference SD. This allows direct comparison across\nmeasurements on different scales and provides an immediate probabilistic interpretation via the\nstandard normal: a child with a height-for-age z-score (HAZ) of −2.0 is 2 SDs below the WHO\nreference median, placing them at approximately the 2.3rd percentile of the reference population\n(P(Z < −2) ≈ 0.023 from the standard normal CDF). The WHO Multicentre Growth Reference Study\nestablished HAZ, WAZ, and WHZ z-scores on reference samples designed to represent optimal growth;\nstudies of pediatric interventions analyze changes in these z-scores — which are normally\ndistributed by construction in the reference population — using standard t-test and regression\nmethods. The same z-score framework underlies DXA T-scores and Z-scores for bone density and\neducational test score scaling (IQ: mean 100, SD 15 by design).\n\n**Why t-tests and OLS are robust at large n — the CLT at work**\n\nThe t-test and OLS regression both assume that their test statistics follow a t-distribution,\nwhich converges to the standard normal for large degrees of freedom. This holds when the outcome\nis normally distributed, but also — more broadly — when the sample mean is approximately normally\ndistributed. The CLT guarantees the latter for large n regardless of the outcome distribution.\nThis is why a two-sample t-test on healthcare costs with n = 500,000 per arm is asymptotically\nvalid for inference on mean costs, even though the raw cost distribution is severely right-skewed.\nOLS regression coefficients are linear functions of the outcome and themselves sample means by\nthe Frisch-Waugh theorem; the CLT applies and justifies Wald-based confidence intervals at\nlarge n. This is precisely the \"Fagerland paradox\" noted in parametric-vs-nonparametric-tests:\nanalysts most often reach for nonparametric tests when n is large and non-normality is obvious,\nbut that is the setting where the CLT has already done its work.\n\n**Where the CLT fails and alternative methods are required**\n\nThe CLT does not guarantee valid normal-based inference in every situation:\n\n1. **Extreme right tails in cost distributions.** Even at n = 100,000, if a tiny fraction of\n   catastrophically expensive patients dominate the mean (costs 50–100× the median), the sample\n   mean is unstable and heavily influenced by those few observations. Inference on the mean via\n   a t-test is technically asymptotically valid, but the estimate may have low effective power\n   and be sensitive to outlier removal choices. Gamma GLMs, two-part models, and bootstrap mean\n   estimation are more appropriate for primary analysis when tail behavior drives the mean\n   estimate.\n\n2. **Rare-event proportions near 0 or 1.** The normal approximation to a proportion p̂ fails\n   when np < 5 or n(1 − p) < 5. Normal-approximation confidence intervals for rare event rates\n   can produce negative lower bounds, which are nonsensical. Use Clopper-Pearson exact intervals,\n   Wilson score intervals, or likelihood-ratio intervals. This failure mode makes CLT-based\n   z-tests invalid for pharmacovigilance monitoring of very rare adverse events.\n\n3. **Small n with heavily skewed outcomes.** With n < 20–30 per group and a clearly non-normal\n   outcome (visible on a QQ plot), the CLT has not yet converged. Nonparametric tests\n   (Mann-Whitney, Wilcoxon signed-rank), permutation-based inference, or parametric models\n   suited to the distributional shape (log-normal, gamma) are safer than assuming normality.\n\n4. **Sequential safety monitoring with very low expected counts.** The maxSPRT and TreeScan\n   methods for sequential drug safety surveillance use exact Poisson likelihoods because the\n   Poisson normal approximation is unreliable when expected event counts per monitoring period\n   are below 5–10. The CLT-protected z-test is not valid in this regime.\n\n**Normal approximations to binomial and Poisson — and their breakdown**\n\nTwo classical normal approximations appear throughout epidemiology and biostatistics:\n\n- **Binomial ≈ Normal:** If X ~ Bin(n, p), then for large n with np ≥ 5 and n(1 − p) ≥ 5,\n  X is approximately N(np, np(1 − p)). The threshold of 5 for both np and n(1 − p) guards\n  against the asymmetric regime where the binomial is still right- or left-skewed.\n\n- **Poisson ≈ Normal:** If X ~ Poisson(λ), then for λ ≥ 10–20, X is approximately N(λ, λ).\n  For rare adverse events with λ < 5, the Poisson is right-skewed and the normal approximation\n  is unreliable; exact Poisson-based CIs (Garwood or exact mid-p) are required.\n\nThe breakdown of these approximations is directly relevant to pharmacovigilance. When monitoring\nfor rare adverse events where expected event counts per reporting period may be 0, 1, or 2, the\nnormal approximation fails systematically — it places probability mass at negative counts and\nproduces CIs that are too wide or too narrow depending on the tail of interest. This is the\nprimary reason sequential safety surveillance methods (maxSPRT, the conditional Poisson scan\nstatistic) use exact likelihood-based methods rather than z-score approximations.\n\n**QQ plots for normality assessment and why formal normality tests are counterproductive**\n\nA quantile-quantile (QQ) plot compares the quantiles of observed data against the theoretical\nquantiles of a normal distribution. If the data are normal, the points fall on a straight line;\ncurvature indicates departures: an S-shape indicates skewness, a banana shape indicates heavy\ntails. QQ plots are the appropriate tool for assessing normality in practice.\n\nFormal normality tests — Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling — are actively\ncounterproductive for the following two reasons:\n\n1. **Underpowered at small n.** At n < 30, Shapiro-Wilk has low power to detect genuine\n   non-normality that would affect inference. The test will accept non-normal data, providing\n   false reassurance precisely when normality matters most for small-sample t-test validity.\n\n2. **Overpowered at large n.** At n > 500 (and certainly at n > 50,000 typical of claims\n   databases), any trivial departure from normality — a mildly heavy tail, a minor right skew\n   — will produce a highly significant rejection of the normality null (p < 0.001). But that\n   departure may be completely irrelevant to CLT-protected inference on means at that sample\n   size. The test rejects normality because the test is sufficiently sensitive, not because\n   normality matters at that n.\n\nThe recommended practice: use a QQ plot to visualize the distributional shape, combine it with\nsubject-matter knowledge about the outcome (costs are always right-skewed; growth z-scores are\nnormal by design; lab values in a reference population are approximately normal by selection),\nand make the analytical choice on that basis — not on a p-value from a normality test.\n\n**Interpreting the output**\n\nConsider a pediatric growth study in which 9 infants are measured on a weight-for-age growth\nindex calibrated to a reference population with mean 100 and SD 15. One infant scores 130.\n\n*(1) Formal statistical interpretation.* The z-score for the infant with index 130 is\nz = (130 − 100) / 15 = 2.0. Under the reference distribution N(100, 15²), z = 2.0 corresponds\nto approximately the 97.7th percentile (P(Z < 2.0) ≈ 0.977 from the standard normal CDF).\nThe 68-95-99.7 rule places the interval [100 − 2×15, 100 + 2×15] = [70, 130] as containing\napproximately 95.4% of the reference population; the infant at 130 sits at the upper edge of\nthis two-SD band. For the sample of 9 infants, the standard error of the mean growth index is\nSE = 15 / √9 = 15 / 3 = 5 units. A 95% confidence interval for the true population mean among\nchildren like those in this sample would be approximately X̄ ± 1.96 × 5, giving a margin of\nroughly ±9.8 index units. The SD of 15 has not changed — individual children are just as\nvariable as in the reference — but the precision of the group mean has improved by a factor\nof √9 = 3 compared to knowing only a single child's value.\n\n*(2) Practical interpretation.* The child with growth index 130 is 2 standard deviations above\nthe reference population average — a value unusual enough (above the 97th percentile of healthy\nreference children) to warrant clinical documentation, though not extreme enough on its own to\nindicate pathology. The SD of 15 tells us about how different children are from one another;\nthe SE of 5 tells us how confidently we can estimate the average growth index for the nine\nchildren in this clinic visit. In plain language: studying more children narrows our uncertainty\nabout the group average (SE shrinks) but does not reduce child-to-child variability (SD stays\naround 15).\n\n**Pros, cons, and trade-offs**\n\n*Normal distribution methods (z-tests, t-tests, OLS, normal-based CIs)*:\n- **Pros:** analytically tractable, computationally trivial, directly interpretable effect\n  estimates (mean differences, standardized differences), familiar to all audiences, extensible\n  to regression via OLS, asymptotically valid for inference on means at large n via CLT\n  regardless of raw data distribution shape, exact when the outcome is genuinely normal.\n- **Cons:** invalid for inference on extreme quantiles of non-normal outcomes, misleading for\n  rare-event proportions near 0 or 1, slow CLT convergence for extreme right-skew means the\n  approximation may not hold at moderate n with cost-like data, provide no information about\n  distributional shape beyond mean and variance.\n- **When to prefer:** inference on means of continuous outcomes with adequate n; z-score\n  comparisons in growth and developmental endpoints where normality holds by design;\n  large-n asymptotic approximation for Wald test statistics from logistic regression, Cox\n  regression, and GLMs; any setting where the estimand is a mean difference.\n\n*Normal approximations to binomial and Poisson*:\n- **Pros:** closed-form test statistics and CIs, computationally fast, excellent approximation\n  when event counts are adequate (np ≥ 5 for binomial; λ ≥ 10 for Poisson).\n- **Cons:** fail for rare events, produce nonsensical negative lower bounds for small proportions,\n  invalid for pharmacovigilance with very low expected counts, undercover in the tails.\n- **When to prefer:** aggregate count data with common events; background comparisons in large\n  registry analyses; standard rate ratios and relative risks where events are frequent.\n\n*QQ plots versus formal normality tests*:\n- **QQ plots (prefer):** visualize the full distributional shape and departure pattern, scale\n  appropriately across all sample sizes, reveal the type of non-normality (skewness vs heavy\n  tails), do not produce misleading \"statistically significant non-normality\" in large databases.\n- **Formal tests (avoid as decision gates):** counterproductive at both extremes of n; Shapiro-\n  Wilk and Kolmogorov-Smirnov provide a binary signal that conveys less information than a\n  visual QQ plot and should not be used as the gating criterion for parametric vs nonparametric\n  analysis choice.\n\n**When to use**\n\nNormal distribution methods and CLT-based approximations are appropriate when:\n- The target of inference is a **mean** or **mean difference**, and the CLT is plausible given n\n  and the degree of skew: n ≥ 30–50 per group for mildly skewed continuous outcomes, n ≥ 100+\n  for moderately skewed, with formal assessment via QQ plot and distributional knowledge.\n- The outcome is **normally distributed by design** — pediatric growth z-scores (HAZ, WAZ, WHZ),\n  IQ-type standardized scores, clinical biomarker z-scores referenced against a healthy population\n  — and inference on individual z-scores or mean z-scores is the goal.\n- **Standardization to z-scores** is needed for cross-metric comparison across outcomes measured\n  on different scales within a multi-endpoint RWE study.\n- **Normal approximations to binomial or Poisson** apply, with np ≥ 5 and n(1 − p) ≥ 5 confirmed.\n- As the **large-n asymptotic approximation** justifying Wald confidence intervals reported by\n  standard software from logistic, Cox, and GLM regression procedures.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n\n- **Raw non-normal clinical outcomes analyzed as if individually normal.** Fitting a normal model\n  to raw healthcare costs and using mean ± 1.96 SD as a \"reference range\" is nonsensical: a large\n  fraction of patients will fall in the implied \"negative cost\" region. Reserve normal-based\n  inference for means and mean differences; use log-normal or gamma models for the distributional\n  shape of individual-level non-normal outcomes.\n- **Rare-event proportions.** If np < 5 or n(1 − p) < 5, the normal approximation to the\n  binomial fails. CLT-based z-tests for adverse event rates with very low expected counts are\n  invalid; use Clopper-Pearson exact intervals, Wilson score intervals, or exact Poisson-based\n  methods. This failure mode is relevant to every safety surveillance analysis of rare adverse\n  drug reactions.\n- **Formal normality testing as a decision gate.** Using Shapiro-Wilk as the selector between\n  parametric and nonparametric tests is counterproductive at every sample size. At small n, the\n  test is underpowered and accepts non-normality that matters; at large n, it rejects trivial\n  departures rendered irrelevant by the CLT. Use QQ plots and distributional knowledge instead.\n- **Inference on extreme quantiles of right-skewed distributions.** The CLT justifies normal-based\n  inference on the mean of cost data at large n; it says nothing about the 95th or 99th percentile\n  of the cost distribution. Budget-impact models requiring tail-cost estimates need distributional\n  modeling (log-normal, gamma) or direct quantile estimation with bootstrap CIs, not normal\n  approximations to extreme quantiles.\n- **Small n with clearly non-normal outcomes.** With n < 15–20 per group and a distribution\n  visibly inconsistent with normality on a QQ plot, use nonparametric tests (Mann-Whitney,\n  Wilcoxon signed-rank), permutation tests, or exact methods rather than a t-test whose type I\n  error control depends on the normality assumption the small sample cannot verify.\n- **Sequential safety monitoring with rare events.** maxSPRT and similar sequential surveillance\n  methods use exact Poisson likelihoods because the CLT-protected normal approximation to the\n  Poisson is unreliable when expected counts per monitoring period fall below 5.\n\n**Data-source operational depth**\n\n- **Claims:** With n typically exceeding 100,000 per arm in comparative effectiveness studies,\n  the CLT is operative for inference on mean costs, utilization, and binary rates. Every mean\n  will be estimated with a narrow CI; the binding question is bias from confounding, not\n  normality of the raw data. QQ plots of the raw cost distribution will always show severe right\n  skew — expected and not a barrier to mean inference but directly relevant to GLM and two-part\n  model selection. Rare adverse event rates with low expected counts require exact Poisson-based\n  testing, not normal approximations.\n- **EHR:** Growth z-scores, laboratory reference values, and vital signs may be approximately\n  normally distributed in reference-population studies (the reference population is constructed\n  to produce normality). For EHR-derived outcomes like eGFR change or HbA1c reduction, QQ plots\n  on regression residuals are the normality assessment tool, not tests on the raw outcome. Multi-\n  site EHR networks: within-site CLT applies for means; between-site heterogeneity in means\n  requires mixed models rather than a pooled normal approximation.\n- **Registry:** Disease severity scores (PRO instruments) are typically bounded and ordinal —\n  not normal. For disease registries tracking pediatric growth (rare congenital conditions,\n  inflammatory diseases), growth z-scores are the standard endpoint and normal-distribution\n  inference applies directly. Check for floor or ceiling effects in the z-score distribution,\n  which would indicate a truncated reference or measurement artifact.\n- **Primary:** Small pilot and interventional studies (n < 50 per arm) are the regime where CLT\n  approximations are most questionable for skewed outcomes. Pre-specify the normality assumption\n  for the primary endpoint in the analysis plan; if violated, use the pre-specified nonparametric\n  or bootstrap alternative. For growth outcomes, the z-score is normally distributed by design\n  and the normal-based analysis is valid at any n.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "foundations",
      "distributions",
      "normal-distribution",
      "central-limit-theorem",
      "z-scores",
      "standard-error",
      "sampling-distribution",
      "QQ-plot",
      "normality-testing",
      "bell-curve",
      "CLT",
      "standardization",
      "68-95-99.7"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "descriptive_analysis",
      "cross_sectional"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.310.6975.298",
        "url": "https://doi.org/10.1136/bmj.310.6975.298",
        "citation_text": "Altman DG, Bland JM. Statistics notes: The normal distribution. BMJ. 1995;310(6975):298.",
        "year": 1995,
        "authors_short": "Altman & Bland",
        "notes": "BMJ Statistics Notes entry written for clinical researchers — distinguishes the normality of individual data from the normality of sample means, establishes the practical criterion for when normality can be assumed in medical research, and provides the conceptual foundation for interpreting z-scores and the CLT in clinical contexts."
      },
      {
        "role": "explain",
        "doi": "10.4097/kjae.2017.70.2.144",
        "url": "https://doi.org/10.4097/kjae.2017.70.2.144",
        "citation_text": "Kwak SK, Kim JH. Central limit theorem: the cornerstone of modern statistics. Korean Journal of Anesthesiology. 2017;70(2):144-156.",
        "year": 2017,
        "authors_short": "Kwak & Kim",
        "notes": "Practitioner-focused derivation and illustration of the CLT with biomedical examples, showing why the sampling distribution of the mean approaches normality at large n, how convergence speed depends on distributional skewness, and why the CLT underpins t-tests, ANOVA, and regression inference in medical statistics."
      }
    ],
    "plain_language_summary": "The normal distribution (the bell curve) describes how averages behave — not how individual patient measurements like costs or hospital stays are shaped, which are almost always right-skewed. The Central Limit Theorem is the reason that t-tests and most regression analyses still work on real-world data: when you average enough patients together, that average follows a bell curve even if each patient's individual value does not. z-scores convert any measurement into \"how many standard deviations above or below the group average is this value?\" — the standard tool for reporting pediatric growth and laboratory reference ranges.",
    "key_terms": [
      {
        "term": "normal distribution",
        "definition": "A symmetric, bell-shaped probability distribution fully described by its mean and standard deviation, where roughly 68%, 95%, and 99.7% of values fall within one, two, and three standard deviations of the mean respectively."
      },
      {
        "term": "Central Limit Theorem",
        "definition": "A mathematical result guaranteeing that the average of a large enough sample from any distribution with a finite mean and variance will itself follow approximately a bell-curve distribution, regardless of the shape of the individual data."
      },
      {
        "term": "z-score",
        "definition": "The number of standard deviations a single value is above or below a reference mean, computed as (observed value minus reference mean) divided by reference standard deviation; a z-score of 2.0 means the value is two standard deviations above the mean."
      },
      {
        "term": "standard error",
        "definition": "The standard deviation of the sample mean's distribution across repeated studies — equal to the data's standard deviation divided by the square root of sample size; it shrinks as more patients are enrolled, while individual variability stays constant."
      },
      {
        "term": "sampling distribution",
        "definition": "The distribution that a summary statistic (like a mean or difference) would follow if the study were repeated many times; the Central Limit Theorem says this distribution approaches a bell curve for means as sample size grows."
      },
      {
        "term": "QQ plot",
        "definition": "A quantile-quantile plot that compares observed data quantiles to expected normal quantiles; points falling on a straight line indicate normality, while S-shaped curves indicate skewness and banana shapes indicate heavy tails."
      }
    ],
    "worked_example": {
      "scenario": "A pediatric endocrinology clinic tracks a weight-for-age growth index calibrated to a reference population with mean 100 and standard deviation 15. A child at today's visit has a growth index of 130, and the team wants to know how many standard deviations above the reference average that child is. Separately, they want to know how precisely the clinic's nine enrolled infants estimate the true mean growth index, expressed as a standard error.",
      "dataset": {
        "caption": "Growth index measurements for 9 infants enrolled in the clinic cohort, plus the reference population parameters. The child with index 130 is a single patient whose z-score is computed against the reference mean and SD.",
        "columns": [
          "infant_id",
          "growth_index"
        ],
        "rows": [
          [
            "P1",
            85
          ],
          [
            "P2",
            90
          ],
          [
            "P3",
            95
          ],
          [
            "P4",
            95
          ],
          [
            "P5",
            100
          ],
          [
            "P6",
            100
          ],
          [
            "P7",
            105
          ],
          [
            "P8",
            115
          ],
          [
            "P9",
            115
          ]
        ]
      },
      "steps": [
        "Reference parameters: mean = 100, SD = 15. The child of clinical interest has growth index 130, which is above the reference mean.",
        "z-score = (130 - 100) / 15 = 30 / 15 = 2.0. This child is exactly 2.0 standard deviations above the reference mean, placing them at approximately the 97.7th percentile of the reference population (from the standard normal CDF: P(Z < 2.0) = 0.977).",
        "The 68-95-99.7 rule tells us that about 95% of the reference population falls in the interval [100 - 2*15, 100 + 2*15] = [70, 130]. The child at 130 sits at the upper boundary of the two-standard-deviation reference band.",
        "For the 9 enrolled infants, we compute the standard error of the mean growth index. The reference SD is 15 and n = 9. Because sqrt(9) = 3, SE = 15 / 3 = 5. The SE is 3 times smaller than the SD, because averaging across 9 patients is 3 times more precise than a single patient measurement.",
        "The SD of 15 stays fixed — it describes how spread out individual children are around the reference mean. The SE of 5 describes only how precisely we know the mean of this particular group of 9. These are answering different questions."
      ],
      "result": "z = (130 - 100) / 15 = 2.0; the child is at the 97.7th percentile of the reference distribution. SE = 15 / 3 = 5; the 9-infant sample estimates the true mean to within approximately plus or minus 9.8 index units (1.96 times SE). The SD of 15 describes child-to-child variability and does not change with n; the SE of 5 describes precision of the group mean and shrinks as sqrt(n) grows.",
      "timeline_spec": {
        "title": "Growth index reference band and z-score for the 9-infant cohort",
        "window": {
          "start": "2024-01-01",
          "end": "2024-12-31",
          "label": "Annual assessment window, n=9 infants, reference SD=15"
        },
        "events": [
          {
            "label": "Reference mean band: 100 +/- 2*15 = [70, 130]",
            "start": "2024-01-01",
            "length_days": 365,
            "quantity": "95% range"
          },
          {
            "label": "Patient index = 130 (z = 2.0, 97.7th pct)",
            "start": "2024-06-15",
            "length_days": 1,
            "quantity": "z=2.0"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-01",
            "end": "2024-12-31",
            "label": "Observation window (n=9 enrolled infants)"
          }
        ],
        "result": {
          "label": "z = 2.0; SE = 15/3 = 5",
          "value": 2.0
        }
      }
    },
    "prerequisites": [
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "z-score standardization and percentile lookup",
        "description": "The core application: transform an observed value to z = (x − μ) / σ using reference population parameters, then use the standard normal CDF to obtain the percentile. For known reference populations (WHO growth, bone density T-scores, IQ scales), this is exact. For estimated reference parameters, the z-score is approximate and t-distribution critical values should be used for small-sample inference.",
        "edge_cases": [
          "When reference parameters are estimated from the same sample (not an external reference), use t-scores with df = n − 1 rather than z-scores to account for uncertainty in the SD estimate.",
          "If the outcome is not normally distributed in the reference population, the percentile derived from the standard normal CDF is misleading; use empirical percentiles from the reference sample instead."
        ],
        "data_source_notes": "EHR and registry: pediatric growth z-scores (HAZ, WAZ, WHZ) are computed using WHO growth reference parameters and are the standard endpoint for pediatric cohort studies. Laboratory z-scores (eGFR, HbA1c change) use lab-specific reference parameters and should be distinguished from percentile cut-points based on clinical risk rather than distributional position."
      },
      {
        "name": "CLT-justified inference on means of skewed outcomes",
        "description": "For continuous outcomes that are non-normal at the individual level (costs, LOS, utilization counts), invoke the CLT to justify t-tests and OLS confidence intervals on group means when n is large. The key assessment is whether convergence has occurred: inspect the QQ plot of sample means from resampling or assess skewness of the outcome and compare to the threshold of n ≥ 30–50 (mild skew) or n ≥ 100–500 (heavy skew).",
        "edge_cases": [
          "For cost data with extreme right skew (Gini coefficient > 0.6, a handful of patients accounting for >30% of total spend), CLT convergence may require n > 500 per arm before the sampling distribution of the mean is reliably normal; bootstrap CIs are a robust alternative.",
          "The CLT protects the mean estimate; it does not protect quantile estimates. A 95% CI for the 99th percentile of costs requires distributional modeling or order-statistic methods, not a z-test."
        ],
        "data_source_notes": "Claims: the CLT operates at large n but extreme cost outliers slow convergence; always run a sensitivity analysis winsorizing at the 99th percentile to assess outlier influence on the mean and its SE."
      },
      {
        "name": "Normal approximation to binomial and Poisson for rate inference",
        "description": "Use the normal approximation when np ≥ 5 and n(1 − p) ≥ 5 (binomial) or λ ≥ 10 (Poisson). The normal-approximation 95% CI for a proportion is p̂ ± 1.96 × sqrt(p̂(1 − p̂)/n); for a Poisson rate it is λ̂ ± 1.96 × sqrt(λ̂). Both collapse to nonsense for rare events; use exact methods when the thresholds are not met.",
        "edge_cases": [
          "For proportions, the Wilson score interval (not the normal approximation interval) is preferred even when np ≥ 5 because it has better coverage properties near the boundaries.",
          "For Poisson, the exact Garwood interval or its mid-p variant is preferred when expected counts are below 20, even when the normal approximation is technically applicable."
        ],
        "data_source_notes": "Claims: for adverse event rates where events are common (expected counts in the thousands), normal approximations to Poisson rate ratios are standard. For rare post-market safety signals monitored prospectively, use maxSPRT or exact Poisson sequential methods rather than the normal approximation."
      },
      {
        "name": "QQ plot normality assessment",
        "description": "Generate a QQ plot of residuals (for a regression model) or of the raw outcome values (for a simple comparison). Points on a straight line indicate normality; systematic S-curves indicate skewness; banana curves indicate heavier or lighter tails than normal. Use this visual as the primary normality decision tool, not formal test p-values.",
        "edge_cases": [
          "At very small n (< 15), QQ plots have wide point-wise uncertainty bands and may not reliably distinguish normal from non-normal; simulation envelopes (qqenvelope in R) add uncertainty bands and improve interpretation.",
          "Discrete outcomes (counts, binary) will show step patterns on QQ plots even when the underlying model is appropriate; QQ plots of model residuals are more informative than QQ plots of raw discrete outcomes."
        ],
        "data_source_notes": "Claims and EHR: QQ plots on regression residuals from OLS models of continuous outcomes are the standard diagnostic. For log-transformed outcomes, inspect the QQ plot on the log scale; the retransformation step (smearing) addresses the mean estimation problem but the QQ plot assesses the log-scale normality assumption."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "The normal distribution and CLT provide the theoretical justification for why parametric tests are valid at large n despite non-normal raw data; this entry explains the mechanism that parametric-vs-nonparametric-tests uses as a premise.",
        "cons_of_this": "This entry describes when the normal approximation holds; parametric-vs-nonparametric-tests operationalizes the choice between test families given the sample size and distributional assessment that follows from these foundations.",
        "when_to_prefer": "Read this entry to understand the CLT mechanism; use parametric-vs-nonparametric-tests to make the specific test-family decision for a given dataset."
      },
      {
        "compared_to": "log-normal-distribution",
        "pros_of_this": "The normal distribution is the reference model for z-scores, growth endpoints, and sample means of any large-n outcome; it is simpler, has a richer set of exact analytic results, and directly applies to the sampling distribution of the mean via the CLT.",
        "cons_of_this": "For individual-level positive continuous outcomes (costs, LOS, biomarker concentrations), the log-normal captures the multiplicative data-generating process and right-skewed shape correctly; fitting a normal model to raw costs is wrong, while the log-normal is the appropriate distributional model.",
        "when_to_prefer": "Use normal distribution methods when the target is a mean or when the outcome is normally distributed by design; use log-normal when modeling individual-level right-skewed positive outcomes and the estimand is the geometric mean or arithmetic mean via retransformation."
      },
      {
        "compared_to": "inferential-statistics-foundations",
        "pros_of_this": "This entry covers the distributional mechanism (CLT, z-scores, normal approximations) that makes inferential machinery work; it explains the \"why\" behind the SE formula and the large-n robustness of parametric tests.",
        "cons_of_this": "Inferential statistics foundations covers the full hypothesis-testing machinery (p-values, CIs, type I/II error, power) that the normal distribution underlies; the two entries are companion pieces with this one providing the distributional substrate.",
        "when_to_prefer": "Read this entry for the distributional and CLT foundations; read inferential-statistics- foundations for the inference machinery (CI construction, hypothesis testing, power) that rests on those foundations."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "At the sample sizes typical of claims analyses (n > 50,000 per arm), the CLT guarantees that normal-based confidence intervals for mean costs and utilization are asymptotically valid. Raw cost QQ plots will show severe right skew — this is expected and does not invalidate mean inference, but does drive the choice of gamma GLM vs OLS for modeling the full distribution. Rare adverse event rates with low expected counts require exact Poisson-based methods, not normal approximations.",
      "ehr": "Growth z-scores (HAZ, WAZ, WHZ) computed from WHO references are the primary endpoint for pediatric EHR cohorts; they are normally distributed by construction and normal-based inference applies at any n. For continuous biomarker outcomes (eGFR, HbA1c, blood pressure), use QQ plots on regression residuals to assess whether the normality assumption underlying OLS confidence intervals is reasonable. Multi-site EHR networks: site-level variation in means requires mixed models, not a pooled normal approximation.",
      "registry": "Disease registries tracking pediatric rare disease growth use WHO growth z-scores directly; check for floor or ceiling effects in the observed z-score distribution. For PRO instruments on bounded ordinal scales, z-scores are meaningful only if the scale has approximately equal intervals; prefer median and IQR for purely ordinal instruments.",
      "primary": "Small pilot studies (n < 50 per arm) are the regime where CLT approximations are most questionable for skewed outcomes; pre-specify the normality assumption and its nonparametric backup in the analysis plan. For growth outcomes in interventional studies, the z-score is normal by design and the normal-based analysis is valid at any n.",
      "linked": "Linked claims-EHR-registry cohorts typically have large n, making CLT-protected inference on means valid across all common RWE endpoints. Report both the CLT-justified normal-based inference on means and distributional modeling (gamma GLM, log-normal) for costs as co-primary analyses; divergence reveals sensitivity to the modeling assumption."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\n\n# ── Z-score and standard normal lookups ────────────────────────────────────\nmu, sigma = 100.0, 15.0    # reference population parameters\nx_obs = 130.0              # observed patient value\n\nz = (x_obs - mu) / sigma   # z = (130 - 100) / 15 = 2.0\nprint(f\"z-score: ({x_obs} - {mu}) / {sigma} = {z:.1f}\")\n\np_below = stats.norm.cdf(z)   # P(Z < 2.0) from standard normal CDF\nprint(f\"Percentile: P(Z < {z:.1f}) = {p_below:.4f}  ({p_below*100:.1f}th percentile)\")\n\n# pdf: height of the bell curve at z = 2.0\nprint(f\"pdf at z=2.0: {stats.norm.pdf(z):.4f}\")\n\n# ppf (quantile function): what z gives the 97.5th percentile?\nprint(f\"97.5th percentile z-score: {stats.norm.ppf(0.975):.4f}  (= 1.96)\")\n\n# 68-95-99.7 rule as exact probabilities\nfor n_sd in [1, 2, 3]:\n    lo, hi = mu - n_sd * sigma, mu + n_sd * sigma\n    p_inside = stats.norm.cdf(hi, loc=mu, scale=sigma) - stats.norm.cdf(lo, loc=mu, scale=sigma)\n    print(f\"  [{mu} +/- {n_sd}*{sigma}] = [{lo:.0f}, {hi:.0f}] contains {p_inside*100:.2f}%\")\n\n# ── SE shrink: n=9, SE = sigma/sqrt(n) = 15/3 = 5 ─────────────────────────\nn = 9\nse = sigma / np.sqrt(n)   # 15 / 3 = 5.0\nprint(f\"\\nSE = {sigma} / sqrt({n}) = {sigma} / {int(np.sqrt(n))} = {se:.1f}\")\nprint(f\"  SD of raw data: {sigma} (unchanged by n)\")\nprint(f\"  SE of mean for n={n}: {se:.1f} (shrinks; SD/sqrt(n))\")\nci_lo = mu - 1.96 * se\nci_hi = mu + 1.96 * se\nprint(f\"  95% CI for sample mean: [{ci_lo:.2f}, {ci_hi:.2f}]  (margin = {1.96*se:.2f})\")\n\n# ── CLT simulation: exponential population -> sample means converge to N ───\n# Exponential is strongly right-skewed (skewness = 2); shows CLT in action\nrng = np.random.default_rng(42)\npop = rng.exponential(scale=1.0, size=200_000)\npop_skew = stats.skew(pop)\nprint(f\"\\nPopulation (exponential): skewness = {pop_skew:.2f}  (strongly right-skewed)\")\n\nfor n_sample in [2, 10, 30, 100]:\n    sample_means = np.array([rng.exponential(scale=1.0, size=n_sample).mean()\n                             for _ in range(5_000)])\n    skewness = stats.skew(sample_means)\n    # Shapiro-Wilk on first 1000 (limit for shapiro); near-normal <-> p > 0.05\n    _, p_sw = stats.shapiro(sample_means[:1000])\n    label = \"near-normal\" if p_sw > 0.05 else \"still skewed\"\n    print(f\"  n={n_sample:4d}: mean skewness = {skewness:+.3f},  Shapiro-Wilk p = {p_sw:.4f}  ({label})\")\n\n# ── QQ plot (requires matplotlib; shows visual normality check) ─────────────\ntry:\n    import matplotlib\n    matplotlib.use(\"Agg\")\n    import matplotlib.pyplot as plt\n\n    fig, axes = plt.subplots(1, 2, figsize=(10, 4))\n\n    # Raw exponential data: QQ shows systematic curvature (non-normal)\n    stats.probplot(pop[:1000], dist=\"norm\", plot=axes[0])\n    axes[0].set_title(\"QQ: exponential raw data\\n(S-curve = right-skewed, not normal)\")\n\n    # Sample means with n=30: near-straight line (CLT has converged)\n    means_30 = np.array([rng.exponential(scale=1.0, size=30).mean()\n                          for _ in range(1000)])\n    stats.probplot(means_30, dist=\"norm\", plot=axes[1])\n    axes[1].set_title(\"QQ: means of n=30 samples\\n(straight line = CLT has converged)\")\n\n    plt.tight_layout()\n    plt.savefig(\"qq_clt_demo.png\", dpi=100)\n    print(\"\\nQQ plots saved to qq_clt_demo.png\")\nexcept ImportError:\n    print(\"\\n(matplotlib not available; QQ plots skipped)\")",
        "description": "Standard normal lookups using scipy.stats.norm (pdf, cdf, ppf), z-score computation from\nthe worked example (mean 100, SD 15, value 130 -> z = 2.0), SE shrink (SE = 15/sqrt(9) = 5),\nthe 68-95-99.7 rule, a CLT simulation demonstrating convergence of sample means from an\nexponential (right-skewed) population, and a QQ plot comparing raw data vs sample means.",
        "dependencies": [
          "numpy",
          "scipy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Z-score and standard normal (dnorm / pnorm / qnorm) ─────────────────────\nmu <- 100.0; sigma <- 15.0; x_obs <- 130.0\n\nz <- (x_obs - mu) / sigma        # z = (130 - 100) / 15 = 2.0\ncat(sprintf(\"z-score: (%.0f - %.0f) / %.0f = %.1f\\n\", x_obs, mu, sigma, z))\n\np_below <- pnorm(z)              # P(Z < 2.0); equivalently pnorm(130, mean=100, sd=15)\ncat(sprintf(\"Percentile: P(Z < %.1f) = %.4f  (%.1f-th percentile)\\n\",\n            z, p_below, p_below * 100))\n\ncat(sprintf(\"pdf at z=2.0: %.4f\\n\", dnorm(z)))\ncat(sprintf(\"97.5th percentile z-score: %.4f  (= 1.96)\\n\", qnorm(0.975)))\n\n# 68-95-99.7 rule\nfor (n_sd in 1:3) {\n  lo <- mu - n_sd * sigma; hi <- mu + n_sd * sigma\n  p_inside <- pnorm(hi, mean = mu, sd = sigma) - pnorm(lo, mean = mu, sd = sigma)\n  cat(sprintf(\"  [%g +/- %d*%g] = [%g, %g] contains %.2f%%\\n\",\n              mu, n_sd, sigma, lo, hi, p_inside * 100))\n}\n\n# ── SE shrink: n=9, SE = sigma/sqrt(n) = 15/3 = 5 ──────────────────────────\nn  <- 9L\nse <- sigma / sqrt(n)            # 15 / 3 = 5.0\ncat(sprintf(\"\\nSE = %.0f / sqrt(%d) = %.0f / %.0f = %.1f\\n\",\n            sigma, n, sigma, sqrt(n), se))\ncat(sprintf(\"  SD of raw data: %.0f  (unchanged by n)\\n\", sigma))\ncat(sprintf(\"  SE of mean for n=%d: %.1f  (shrinks; SD/sqrt(n))\\n\", n, se))\nci_lo <- mu - 1.96 * se; ci_hi <- mu + 1.96 * se\ncat(sprintf(\"  95%% CI for sample mean: [%.2f, %.2f]  (margin = %.2f)\\n\",\n            ci_lo, ci_hi, 1.96 * se))\n\n# ── CLT simulation: exponential population -> sample means converge to N ────\nset.seed(42)\npop <- rexp(200000, rate = 1)    # strongly right-skewed (skewness = 2)\n\n# skewness function: use e1071 if available, else compute manually\nsk_fn <- function(x) {\n  m <- mean(x); s <- sd(x)\n  mean(((x - m) / s)^3)\n}\n\ncat(sprintf(\"\\nPopulation (exponential): skewness = %.2f  (strongly right-skewed)\\n\",\n            sk_fn(pop)))\n\nfor (n_sample in c(2L, 10L, 30L, 100L)) {\n  sample_means <- replicate(5000, mean(rexp(n_sample, rate = 1)))\n  skewness_sm  <- sk_fn(sample_means)\n  # Shapiro-Wilk limited to n=5000; use first 1000 for consistency\n  sw_p         <- shapiro.test(sample_means[1:1000])$p.value\n  label        <- if (sw_p > 0.05) \"near-normal\" else \"still skewed\"\n  cat(sprintf(\"  n=%4d: mean skewness = %+.3f,  Shapiro-Wilk p = %.4f  (%s)\\n\",\n              n_sample, skewness_sm, sw_p, label))\n}\n\n# ── QQ plots in base R ──────────────────────────────────────────────────────\nopar <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))\n\n# Raw exponential data: S-curve curvature shows right skew\nqqnorm(pop[1:1000], main = \"QQ: exponential raw data\\n(curvature = not normal)\",\n       pch = 1, cex = 0.5, col = \"grey40\")\nqqline(pop[1:1000], col = \"red\", lwd = 2)\n\n# Means of n=30 samples: near-straight line (CLT has converged)\nmeans_30 <- replicate(1000, mean(rexp(30, rate = 1)))\nqqnorm(means_30, main = \"QQ: means of n=30 samples\\n(straight line = CLT converged)\",\n       pch = 16, cex = 0.6, col = \"steelblue\")\nqqline(means_30, col = \"darkblue\", lwd = 2)\n\npar(opar)\ncat(\"\\nQQ plots displayed. Straight line = normal; curvature = skewed or heavy-tailed.\\n\")\ncat(\"Use QQ plots for normality assessment, NOT Shapiro-Wilk p-values.\\n\")",
        "description": "Standard normal functions (dnorm, pnorm, qnorm), z-score and SE computation from the\nworked example, the 68-95-99.7 rule, a CLT simulation using replicate() on an exponential\npopulation, and QQ plots via qqnorm/qqline in base R to demonstrate convergence. No\nexternal package dependencies beyond base R (e1071 is used for skewness if available\nand falls back to a manual computation otherwise).",
        "dependencies": [
          "stats"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Z-score and standard normal probabilities ─────────────────────────── */\ndata _null_;\n  mu    = 100.0;  sigma = 15.0;  x_obs = 130.0;\n\n  z       = (x_obs - mu) / sigma;    /* z = (130 - 100) / 15 = 2.0       */\n  p_below = probnorm(z);             /* P(Z <= z) from standard normal CDF */\n  put \"z-score = \" z;               /* 2.0                                */\n  put \"P(Z < 2.0) = \" p_below;     /* 0.9772 -> 97.72nd percentile       */\n\n  /* 68-95-99.7 rule */\n  do n_sd = 1 to 3;\n    lo      = mu - n_sd * sigma;\n    hi      = mu + n_sd * sigma;\n    p_in    = probnorm((hi - mu) / sigma) - probnorm((lo - mu) / sigma);\n    put \"  n_sd=\" n_sd \" interval=[\" lo \",\" hi \"] coverage=\" p_in;\n  end;\n\n  /* SE shrink: n=9, SE = sigma/sqrt(n) = 15/3 = 5 */\n  n   = 9;\n  se  = sigma / sqrt(n);            /* 15 / 3 = 5.0                       */\n  put \"SE = \" se;                   /* 5.0                                */\n  ci_lo = mu - 1.96 * se;\n  ci_hi = mu + 1.96 * se;\n  put \"95% CI for sample mean: [\" ci_lo \",\" ci_hi \"]\";\n  put \"SD = \" sigma \" (unchanged by n); SE = \" se \" (shrinks as 1/sqrt(n))\";\nrun;\n\n/* ── CLT simulation: exponential population -> sample means converge to N ─ */\n%macro clt_sim(n_sample=, reps=5000, seed=42);\n  data work.means_&n_sample;\n    call streaminit(&seed);\n    do rep = 1 to &reps;\n      s = 0;\n      do i = 1 to &n_sample;\n        s + rand(\"Exponential\");    /* rate=1, mean=1, skewness=2         */\n      end;\n      sample_mean = s / &n_sample;\n      output;\n      keep rep sample_mean;\n    end;\n  run;\n\n  proc univariate data=work.means_&n_sample normal noprint;\n    var sample_mean;\n    output out=work.stats_&n_sample skewness=sk w_stat=w p_value=p_norm;\n  run;\n\n  data _null_;\n    set work.stats_&n_sample;\n    label_str = ifc(p_norm > 0.05, \"near-normal\", \"still skewed\");\n    put \"n=&n_sample : skewness=\" sk 8.3\n        \" Shapiro-Wilk p=\" p_norm 8.4\n        \" -> \" label_str;\n  run;\n%mend clt_sim;\n\n%clt_sim(n_sample=2,   reps=5000, seed=42);\n%clt_sim(n_sample=10,  reps=5000, seed=42);\n%clt_sim(n_sample=30,  reps=5000, seed=42);\n%clt_sim(n_sample=100, reps=5000, seed=42);\n\n/* ── QQ plot via PROC UNIVARIATE with the 9-infant worked example data ── */\ndata work.growth9;\n  input infant_id $ growth_index;\n  datalines;\nP1  85\nP2  90\nP3  95\nP4  95\nP5 100\nP6 100\nP7 105\nP8 115\nP9 115\n;\nrun;\n\nproc univariate data=work.growth9 normal;\n  var growth_index;\n  qqplot growth_index / normal(mu=est sigma=est) square;\n  /*\n     NORMAL option: outputs Kolmogorov-Smirnov, Cramer-von Mises,\n     and Anderson-Darling test p-values.\n     CAUTION: At n=9 these tests are severely underpowered and will\n     almost always fail to reject normality regardless of the true\n     distribution. Use the QQPLOT visual, not the test p-values,\n     as the primary normality assessment tool.\n     At large n (>500): formal tests reject trivial departures that\n     the CLT renders irrelevant for mean inference.\n  */\n  title \"QQ plot for 9-infant growth index (n=9): visual normality check\";\n  title2 \"Formal test p-values unreliable at small n -- interpret the QQ plot\";\nrun;\n\n/* Bonus: simulate a large-n QQ on CLT-converged sample means (n=30) */\nproc univariate data=work.means_30 normal;\n  var sample_mean;\n  qqplot sample_mean / normal(mu=est sigma=est) square;\n  title \"QQ plot: means of n=30 exponential samples (CLT converged -> straight line)\";\nrun;",
        "description": "DATA step computation of z-scores, standard normal probabilities (probnorm), the\n68-95-99.7 rule, and SE shrink from the worked example. A macro-based CLT simulation\nuses rand('Exponential') to generate sample means at increasing n and reports skewness\nvia PROC UNIVARIATE. PROC UNIVARIATE with the NORMAL option and QQPLOT statement\ndemonstrates the QQ plot and formal normality tests with the explicit caveat that the\nQQ plot (not the test p-value) is the appropriate decision tool.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Observed value X<br/>Reference mean mu, SD sigma] --> B[\"z-score = (X - mu) / sigma\"]\n  B --> C[\"P(Z < z) from standard normal CDF<br/>(percentile of reference population)\"]\n  B --> D[\"68-95-99.7 rule:<br/>95% of reference lies in<br/>mu +/- 1.96*sigma\"]\n  E[Sample of n observations<br/>SD sigma] --> F[\"SE = sigma / sqrt(n)<br/>SE SHRINKS with n<br/>SD does NOT\"]\n  F --> G[\"95% CI for mean:<br/>X-bar +/- 1.96 * SE\"]\n  H[\"Raw data distribution<br/>(may be skewed)\"] --> I{\"Is n large?\"}\n  I --> |\"Yes (CLT operating)\"| J[\"Sampling distribution<br/>of the mean is normal<br/>-> t-test / OLS valid\"]\n  I --> |\"No (n < 30-50, skewed)\"| K[\"Use nonparametric tests<br/>or distributional model<br/>(log-normal, gamma)\"]",
        "caption": "The normal distribution toolkit: z-score standardization and percentile lookup (top left), SE shrink showing that SE = SD/sqrt(n) decreases with n while SD stays constant (top right), and the CLT decision fork determining whether normal-based inference is valid for a given sample size and degree of skew.",
        "alt_text": "Flowchart with three branches: z-score to percentile lookup, SE shrink formula showing SD constant and SE shrinking, and the CLT fork between large-n (normal inference valid) and small-n with skewed data (nonparametric alternatives needed).",
        "source_type": "illustrative",
        "source_citations": [
          "kwak-kim-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "descriptive-statistics",
        "notes": "The mean, standard deviation, and distributional shape assessed by descriptive statistics are the direct inputs to z-score computation, CLT convergence assessment, and the choice between normal-based and alternative inference methods."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Inferential statistics foundations covers the CI, p-value, and hypothesis-testing machinery that rests on the normal approximations derived here; the CLT is the bridge connecting the distributional theory in this entry to the operational inferential tools in that one."
      },
      {
        "relation_type": "see_also",
        "target_slug": "log-normal-distribution",
        "notes": "The log-normal is the skewed-positive counterpart to the normal: when log(Y) is normal, Y is log-normal; healthcare costs and LOS that are not normal at the individual level are often well-described by the log-normal, whose arithmetic mean estimation requires the retransformation and smearing steps described in that entry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "two-sample-t-test",
        "notes": "The two-sample t-test is the canonical application of CLT-backed inference on means; the t-statistic is the z-score of the mean difference scaled by the SE of the difference, and its validity at large n is precisely the CLT guarantee described in this entry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pediatric-growth-development-endpoints-rwe",
        "notes": "WHO growth z-scores (HAZ, WAZ, WHZ) are the canonical RWE example of z-score standardization: a normally distributed endpoint by construction that allows direct t-test and regression inference on growth outcomes in pediatric RWE studies."
      }
    ],
    "aliases": [
      "Gaussian distribution",
      "bell curve",
      "CLT",
      "central limit theorem",
      "z-score",
      "standard normal distribution"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "npi-national-provider-identifier",
    "name": "NPI (National Provider Identifier)",
    "short_definition": "A mandatory, permanent, 10-digit \"intelligence-free\" numeric identifier assigned to every covered health-care provider in the United States under HIPAA, serving as the universal provider key that links claims, NPPES enrollment records, prescribing data, and open-payments files in real-world evidence and health-economics research.",
    "long_description": "The **National Provider Identifier (NPI)** is the single, nationwide standard identifier for\ncovered health-care providers mandated by the Health Insurance Portability and Accountability\nAct (HIPAA) Administrative Simplification provisions. CMS began enumerating NPIs through the\nNational Plan and Provider Enumeration System (NPPES) in 2004, and the compliance deadline for\nmost covered entities was May 2007. Every covered provider — individual clinicians, group\npractices, hospitals, laboratories, pharmacies, and hundreds of other entity types — must\nobtain an NPI to submit claims or conduct electronic administrative transactions covered by\nHIPAA. The system currently holds more than five million active NPI records.\n\n**Structure and design philosophy.** The NPI is deliberately *intelligence-free*: no embedded\nmeaning encodes state, specialty, organization type, or practice location. Digits 1–9 are\nassigned sequentially by NPPES; digit 10 is a Luhn check digit computed by prepending the\nissuer prefix 80840 (designating US health-care applications under the international ISO/IEC\n7812 card-numbering standard) to the nine assigned digits, then applying the standard Luhn\nalgorithm. The intelligence-free design was intentional: legacy identifiers such as the Unique\nPhysician Identification Number (UPIN), Drug Enforcement Administration (DEA) number, and\npayer-assigned provider numbers encoded specialty or location in their structure, creating\nidentification fragmentation and requiring provider-specific crosswalks for each payer. NPI\neliminates those silos — the same 10 digits identify a provider to any covered entity.\n\n**Type 1 vs Type 2 NPI.** NPPES distinguishes two enumeration types. A **Type 1 NPI** is\nissued to an *individual* health-care provider — a physician, nurse practitioner, physical\ntherapist, pharmacist, or any other natural person who renders or furnishes health-care\nservices. A **Type 2 NPI** is issued to a *health-care organization* — a hospital, clinic,\ngroup practice, home health agency, or other entity that furnishes health care through\nindividuals. A single physician who also owns a solo practice can hold *both* a Type 1 NPI\n(as the individual clinician) and a Type 2 NPI (for the practice entity). Large health\nsystems routinely hold dozens or hundreds of Type 2 NPIs — one per organizational subpart\n(hospital, outpatient clinic, ambulatory surgery center, specialty division) — making\nfacility-level aggregation the single most common NPI trap in RWE studies.\n\n**NPPES and the public data dissemination file.** NPPES is the federal registry that\nenumerates NPIs. The NPI Registry web portal (https://npiregistry.cms.hhs.gov) supports\nindividual lookups. More valuable for researchers is the **NPPES Data Dissemination** monthly\nfile: a complete public-domain snapshot of every active NPI record, downloadable from CMS,\ncontaining provider name, practice and mailing addresses, phone numbers, reported taxonomy\ncodes, enumeration date, and deactivation status (in a separate deactivation file). The\ndissemination file is public domain and free to use without restriction. It is the standard\nreference file for enriching claims or EHR data with provider specialty (taxonomy) and for\nconstructing provider-level crosswalk tables.\n\n**Taxonomy codes.** Each NPI record carries one or more **provider taxonomy codes** — 10-\ncharacter alphanumeric codes maintained by the National Uniform Claim Committee (NUCC) that\nclassify provider type and specialty (e.g., 207Q00000X = Family Medicine, 207R00000X = Internal\nMedicine, 282N00000X = General Acute Care Hospital). A provider may self-report multiple\ntaxonomies; one is flagged as the primary. Taxonomy codes are critical for specialty-based\nsub-grouping in pharmacoepidemiology (prescriber specialty studies, specialist vs primary-care\nattribution) and in workforce research. Critically, **taxonomy codes are self-reported and are\nnot equivalent to board certification**: a provider may report a taxonomy that differs from\ntheir actual clinical specialty, and no systematic validation mechanism compares NPPES taxonomy\nagainst credentialing records. Specialty misclassification arising from taxonomy code reliance\nis a recognized limitation in the provider attribution literature.\n\n**Where NPIs appear on claims.** Claims carry NPIs in multiple fields serving distinct provider\nroles, and *which NPI you select determines who gets credited for a service* — a choice that\nmaterially changes results in prescriber-attribution, surgical-outcome, and volume–outcome\nstudies:\n- **Professional (CMS-1500 / 837P):** Box 24J = rendering provider (who performed the\n  service); Box 33 = billing provider (who billed — often a group practice). The rendering\n  NPI is the right choice for most provider-attribution studies; using the billing NPI\n  attributes all services to the group practice entity rather than the individual clinician.\n- **Institutional (UB-04 / 837I):** Attending provider (FL 76), operating provider (FL 77),\n  other operating provider (FL 78), and service facility NPI. For surgical-outcome studies,\n  the operating-provider NPI is the correct attribution field; for readmission studies, the\n  attending-provider NPI is usually preferred.\n- **Pharmacy (NCPDP):** Prescriber NPI and dispensing pharmacy NPI appear in the pharmacy\n  claim; these are critical for prescriber-level drug-utilization studies and for identifying\n  pharmacy provider type.\n\n**RWE and HEOR applications.** The NPI serves as the provider-level key for a wide range of\nreal-world studies:\n- *Prescriber attribution*: linking drug fills to the prescribing clinician to study\n  prescribing patterns, guideline adherence, or off-label use.\n- *Surgeon and proceduralist attribution*: linking procedures to operating-provider NPI for\n  volume–outcome or learning-curve analyses.\n- *Provider-level clustering*: accounting for within-provider correlation when the same\n  provider treats many patients in the same cohort (shared frailty or provider fixed effects).\n- *Cross-database provider linkage*: joining claims to NPPES enrichment (specialty, address),\n  to CMS Open Payments (industry payments by provider NPI), to the CMS Medicare Provider\n  Utilization and Payment Data, and to state licensure or medical-board files.\n- *Practice-level aggregation*: grouping individual provider NPIs under a shared billing NPI\n  or organizational NPI to define the \"practice\" as the unit of analysis.\n\n**Pros, cons, and trade-offs.**\n- **vs legacy identifiers (UPIN, DEA, payer-assigned IDs):** UPIN was retired at the NPI\n  transition; crosswalks (e.g., Parsons et al. 2017, Medical Care) exist for legacy Medicare\n  claims that predate NPI compliance. DEA numbers persist in controlled-substance prescription\n  records but are not present on claims. NPI is the only identifier that spans all covered\n  entities, all payers, and all claim types since the 2007 compliance deadline. **Prefer NPI**\n  as the primary provider key for any study using post-2007 data; supplement with a UPIN\n  crosswalk for pre-2007 Medicare data.\n- **vs provider name matching:** Name-based linkage across databases is noisy (misspellings,\n  name changes, middle-initial variation, common surnames). NPI is exact-match — 10 digits,\n  no ambiguity. **Always prefer NPI** for cross-database provider linkage when both sources\n  carry NPI; fall back to probabilistic name/address matching only when NPI is absent from one\n  source.\n- **vs Tax Identification Number (TIN) for practice grouping:** TIN groups providers under a\n  billing entity and is widely used for practice-level analysis. However, TINs are not public\n  (they are protected PII in claims extracts), whereas the billing NPI (Type 2 or group\n  practice NPI) is observable and can proxy practice affiliation. Neither perfectly maps to the\n  clinical concept of a \"practice.\" TIN-based grouping typically produces tighter practice\n  clusters; NPI-based grouping is more portable but may over-split or under-aggregate.\n- **Limitation — registry staleness:** NPPES is self-maintained. Providers who retire,\n  relocate, or change specialty are not required to proactively update their record, and CMS\n  deactivation is reactive (based on claims inactivity or reports of death/license lapse).\n  Stale addresses and taxonomy codes are common; do not use NPPES address data as a proxy for\n  current practice location without checking activity recency in claims.\n- **Limitation — organizational NPI granularity:** There is no standardized rule governing how\n  deeply a health system must sub-enumerate Type 2 NPIs. One system may enumerate at the\n  enterprise level (one Type 2 NPI covering fifty sites); another at the facility level (one\n  per hospital building). This inconsistency makes facility-level volume aggregation\n  unreliable without supplementary data (CMS Certification Number for hospitals, facility\n  address crosswalk).\n- **Limitation — incident-to and locum billing:** Under Medicare incident-to billing rules, a\n  physician assistant's or nurse practitioner's service may be billed under the supervising\n  physician's NPI. Locum tenens physicians may bill under the absent physician's NPI. In both\n  cases, the rendering NPI does not identify the clinician who actually delivered the care —\n  a critical limitation for prescriber-attribution and provider-exposure studies.\n\n**When to use.**\n- As the universal key for any provider-linkage join across claims, NPPES, Open Payments,\n  Medicare utilization files, or state licensure data.\n- As the rendering-provider identifier for prescriber- or proceduralist-attribution in\n  pharmacoepidemiology, comparative effectiveness, and volume–outcome studies.\n- As a data-quality filter: an NPI failing the Luhn check digit is an invalid record and\n  should be flagged before any linkage step.\n- As the grouping key for provider-level clustering or fixed-effects analysis to account for\n  within-provider patient clustering.\n- In conjunction with the NPPES monthly dissemination file to attach specialty (taxonomy),\n  enumeration date, and entity type to any provider in a claims or EHR dataset.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Do not use the billing NPI as the provider of service in individual-level attribution\n  studies.** The billing NPI identifies who billed (often a group practice or health system),\n  not who rendered the service. Using billing NPI for prescriber or surgeon attribution will\n  collapse all services in the practice to a single entity, eliminating provider-level\n  variation and rendering volume–outcome or learning-curve analyses nonsensical.\n- **Do not use NPI alone to define the \"practice\" without understanding organizational\n  sub-enumeration.** Two hospitals from the same system that share a single Type 2 NPI\n  cannot be distinguished from each other; two that hold separate Type 2 NPIs look like\n  different practices. Aggregating by Type 2 NPI produces incomparable granularity across\n  systems and may substantially misclassify the practice unit.\n- **Do not treat NPPES taxonomy code as a validated specialty for exposure or confounder\n  classification without sensitivity analysis.** Taxonomy codes are self-reported and not\n  cross-validated against board certification or credentialing records. For studies where\n  specialty misclassification could bias the result materially (e.g., a study restricted to\n  cardiologists), supplement taxonomy-based classification with procedure-code or referral-\n  pattern validation.\n- **Do not use NPI-based provider identity when incident-to billing or locum arrangements\n  are common in the population studied.** In studies of mid-level practitioners (NP, PA),\n  incident-to billing significantly under-identifies the actual clinician. Design rules that\n  detect and exclude or flag incident-to claims (presence of a supervising provider NPI, claim\n  modifier codes) before making inferences about NP/PA practice.\n- **Do not extrapolate pre-2007 provider attribution using NPI without a validated UPIN/NPI\n  crosswalk.** NPI records for pre-compliance claims are sparse; studies spanning the\n  2005–2008 transition period require explicit handling of the identifier change to avoid\n  apparent provider turnover that is actually a field-switching artifact.\n\n**Data-source operational depth.**\n- **Claims (professional/CMS-1500):** The rendering-provider NPI in Box 24J is the correct\n  attribution field for individual clinicians. Always extract both the rendering NPI (Box 24J)\n  and the billing NPI (Box 33) and join each to NPPES separately: rendering NPI to the\n  individual-provider record; billing NPI to the organization record. Verify that the rendering\n  NPI is Type 1 and the billing NPI is Type 1 or Type 2, depending on whether the practice\n  bills under an individual or group enrollment. Stale taxonomy codes in NPPES are most\n  problematic for specialty classification in older cohort windows; use the most recent NPPES\n  dissemination snapshot that post-dates the study period as the reference file.\n- **Claims (institutional/UB-04):** Use the attending-provider NPI (FL 76) for admission- and\n  discharge-level analyses (readmission, LOS, mortality) and the operating-provider NPI (FL 77)\n  for surgical-procedure attribution. The service facility NPI (FL 82) identifies the billing\n  facility, not the treating clinician. For volume–outcome analyses, the operating-provider NPI\n  is the unit of analysis; cluster standard errors at the operating-provider level.\n- **EHR:** EHR systems typically store the NPI of the ordering/attending provider in structured\n  fields alongside the encounter. Verify that the EHR NPI matches the claims rendering NPI for\n  the same encounter using a validation subset before relying on EHR-derived provider identity\n  for linkage. EHR-based NPI is usually cleaner for ambulatory visits than claims-based NPI\n  because it is recorded at the point of care rather than inferred from billing.\n- **Registry:** Disease registries (SEER, cancer registries, trauma registries) increasingly\n  include the treating-provider NPI as a linkage key. SEER-Medicare NPI linkages permit surgeon\n  volume–outcome and oncologist specialty studies. Verify linkage rates and assess whether\n  unlinked records differ systematically by provider type or volume.\n- **Linked (Open Payments / CMS utilization):** CMS publishes Open Payments (industry transfers\n  by NPI), the Medicare Physician and Other Practitioners utilization file (service counts by NPI\n  and HCPCS), and the Medicare Part D Prescriber file (drug prescribing by NPI). All three join\n  on NPI directly. These linkages support prescriber-industry relationship studies, off-label\n  prescribing analyses, and conflict-of-interest confounding adjustment.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "provider",
      "linkage",
      "claims",
      "nppes",
      "hipaa",
      "taxonomy-code",
      "administrative-simplification"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "linked_data_study",
      "prescriber_attribution",
      "provider_attribution",
      "volume_outcome",
      "pharmacoepidemiology"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.5600/mmrr.003.03.b03",
        "url": "https://doi.org/10.5600/mmrr.003.03.b03",
        "citation_text": "Bindman AB. Using the National Provider Identifier for Health Care Workforce Evaluation. Medicare & Medicaid Research Review. 2013;3(3):E1-E8.",
        "year": 2013,
        "authors_short": "Bindman",
        "notes": "Foundational methodological paper demonstrating how the NPI and NPPES dissemination file can be used to characterize the health-care workforce, classify provider specialty via taxonomy codes, and track workforce trends across time — establishing NPI as a research instrument, not just an administrative identifier."
      },
      {
        "role": "explain",
        "doi": "10.1097/mlr.0000000000000462",
        "url": "https://doi.org/10.1097/mlr.0000000000000462",
        "citation_text": "Parsons HM, Enewold LR, Banks R, Barrett MJ, Warren JL. Creating a National Provider Identifier (NPI) to Unique Physician Identification Number (UPIN) Crosswalk for Medicare Data. Medical Care. 2017;55(12):e156-e163.",
        "year": 2017,
        "authors_short": "Parsons et al.",
        "notes": "Describes construction and validation of the NPI–UPIN crosswalk required for longitudinal Medicare studies that span the pre-2007 UPIN era and post-2007 NPI era; the paper also documents transition-era data quality issues (NPIs absent from claims, UPIN retirement timing) that remain relevant for studies using historical Medicare data."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/s12913-024-11080-2",
        "url": "https://doi.org/10.1186/s12913-024-11080-2",
        "citation_text": "Deng S, Renaud S, Bennett KJ. Who is your prenatal care provider? An algorithm to identify the predominant prenatal care provider with claims data. BMC Health Services Research. 2024;24:683.",
        "year": 2024,
        "authors_short": "Deng et al.",
        "notes": "A practical example of NPI-based provider attribution using claims data: the study develops and validates an algorithm to assign a predominant rendering-provider NPI to each patient episode, illustrating both the power of NPI-based linkage and the ambiguity that arises when multiple rendering providers share an episode — a challenge common in obstetrics, oncology, and hospitalist medicine."
      },
      {
        "role": "use",
        "doi": "10.9778/cmajo.20150005",
        "url": "https://doi.org/10.9778/cmajo.20150005",
        "citation_text": "Barbera L, Hwee J, Klinger C, Jembere N, Seow H, Pereira J. Identification of the physician workforce providing palliative care in Ontario using administrative claims data. CMAJ Open. 2015;3(3):E266-E275.",
        "year": 2015,
        "authors_short": "Barbera et al.",
        "notes": "Illustrates provider identification using claims-derived provider identifiers (Canadian analogue of NPI) and taxonomy/specialty codes to define a specialist workforce from administrative data — methodologically parallel to NPPES taxonomy-based specialty classification in US claims research, including the same limitation that specialty codes are self-reported and require validation against referral and procedure patterns."
      }
    ],
    "plain_language_summary": "An NPI is a permanent 10-digit number that every U.S. health-care provider — a doctor, a hospital, a pharmacy, a clinic — must have to bill insurance for care. Think of it as a universal employee ID for the health system: the same number appears on every claim a provider submits to every insurer. Researchers use it as the key to link provider information across databases — for example, joining a drug prescription to the doctor who wrote it, or comparing surgical outcomes across surgeons. The catch is that hospitals can have many NPIs and the number on a claim does not always identify the individual who actually delivered the care.",
    "key_terms": [
      {
        "term": "Type 1 vs Type 2 NPI",
        "definition": "A Type 1 NPI belongs to an individual clinician (a person); a Type 2 NPI belongs to a health-care organization such as a hospital or group practice. One doctor can hold both."
      },
      {
        "term": "NPPES",
        "definition": "The National Plan and Provider Enumeration System — the federal database run by CMS that issues NPIs and stores each provider's name, address, and specialty codes in a public file updated monthly."
      },
      {
        "term": "taxonomy code",
        "definition": "A 10-character code that describes a provider's type and specialty (for example, family medicine or general surgery); providers self-report these codes and they are not verified against board-certification records."
      },
      {
        "term": "rendering vs billing provider",
        "definition": "The rendering provider is the individual who actually delivered the service; the billing provider is the entity (often a group practice) that submitted the claim. Different NPI fields on the claim capture each role."
      },
      {
        "term": "deactivation",
        "definition": "When CMS removes an NPI from active status because the provider retired, died, or no longer meets enrollment criteria; deactivated NPIs are published in a separate file and should not be treated as valid active providers."
      },
      {
        "term": "Luhn check digit",
        "definition": "The 10th digit of every NPI, computed by a standard algorithm called Luhn that lets a computer instantly detect whether a 10-digit string is a plausible NPI or a data-entry error."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology analyst is building a prescriber-attribution dataset for a study of statin prescribing in Medicare claims. Before joining any NPPES specialty data, the analyst writes a quality-check routine to confirm that every NPI in the pharmacy claims is structurally valid using the Luhn algorithm. The standard documentation example NPI is 1234567893. The analyst wants to verify that this NPI passes the check-digit test step by step.",
      "dataset": {
        "caption": "The NPI being validated and the intermediate values produced at each Luhn algorithm step. The algorithm prepends the 5-digit issuer prefix 80840 to the 9-digit NPI prefix, producing a 14-digit working string, then computes the check digit.",
        "columns": [
          "step",
          "value",
          "explanation"
        ],
        "rows": [
          [
            "Input NPI",
            "1234567893",
            "10-digit NPI to validate"
          ],
          [
            "9-digit prefix",
            "123456789",
            "Digits 1-9; digit 10 (= 3) is the check digit to verify"
          ],
          [
            "Issuer prefix",
            "80840",
            "Fixed prefix for US health-care applications (ISO/IEC 7812)"
          ],
          [
            "15-digit working string",
            "808401234567893",
            "80840 + 123456789 + check_digit_3"
          ],
          [
            "Luhn sum",
            "70",
            "Sum of all 15 digit contributions after doubling; 70 mod 10 = 0 -> valid"
          ]
        ]
      },
      "steps": [
        "Build the 15-digit working string by appending all 10 NPI digits to the issuer prefix: \"80840\" followed by \"1234567893\" gives the working string 808401234567893 (15 digits total).",
        "Apply the Luhn rule: starting from the rightmost digit and moving left, keep every odd-position digit (positions 1, 3, 5, ... from the right) unchanged, and double every even-position digit (positions 2, 4, 6, ... from the right). If doubling produces a result greater than 9, subtract 9 from that result to get the contribution.",
        "Work through 808401234567893 right to left, computing each contribution. Odd positions (kept): pos 1 digit 3 contributes 3; pos 3 digit 8 contributes 8; pos 5 digit 6 contributes 6; pos 7 digit 4 contributes 4; pos 9 digit 2 contributes 2; pos 11 digit 0 contributes 0; pos 13 digit 8 contributes 8; pos 15 digit 8 contributes 8. Even positions (doubled, subtract 9 if over 9): pos 2 digit 9 doubles to 18, subtract 9, contributes 9; pos 4 digit 7 doubles to 14, subtract 9, contributes 5; pos 6 digit 5 doubles to 10, subtract 9, contributes 1; pos 8 digit 3 doubles to 6, contributes 6; pos 10 digit 1 doubles to 2, contributes 2; pos 12 digit 4 doubles to 8, contributes 8; pos 14 digit 0 doubles to 0, contributes 0.",
        "Sum all 15 contributions: 3+9+8+5+6+1+4+6+2+2+0+8+8+0+8 = 70.",
        "Apply the validity rule: 70 divided by 10 leaves remainder 0. A Luhn sum divisible by 10 confirms the check digit is correct and the NPI is structurally valid. Any NPI whose Luhn sum is not divisible by 10 is a transcription error or fabricated value and must be flagged and excluded before any NPPES join.",
        "To compute the check digit from scratch (rather than verify it), use only the 14-digit prefix 80840123456789. Double even-from-right positions within that 14-digit string. The contributions from those 14 digits sum to 67. The check digit equals (10 minus 7) which is 3, matching the 10th digit of NPI 1234567893 exactly."
      ],
      "result": "NPI 1234567893 passes the Luhn check. Contributions 3+9+8+5+6+1+4+6+2+2+0+8+8+0+8 = 70, and 70 is divisible by 10 (valid). The check digit was independently derived as 10 minus 7 = 3, confirming digit 10. Any NPI that does not produce a Luhn sum divisible by 10 is invalid and must be excluded before any NPPES join."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Rendering NPI attribution (individual-level)",
        "description": "Use the rendering provider NPI from Box 24J (professional) or FL 77 (institutional) as the unit of attribution for individual-clinician studies. This is the standard approach for prescriber-attribution, surgeon-volume, and provider-practice-pattern studies. The rendering NPI must be Type 1 (individual); a Type 2 NPI in Box 24J indicates group billing and is not attributable to a single clinician.",
        "edge_cases": [
          "Incident-to billing in professional claims: NP/PA services billed under the supervising physician's NPI. The rendering NPI underestimates mid-level practitioner activity; detect using claim modifier codes (e.g., CMS modifier SA, or absent rendering NPI with a billing NPI that is a physician).",
          "Locum tenens: the absent physician's NPI may appear in the rendering field when a locum covers. This artificially inflates the absent physician's volume; detect using claims continuity patterns (volume spikes during known leave periods)."
        ],
        "data_source_notes": "Professional claims: rendering NPI in Box 24J. Institutional claims: operating provider in FL 77 for surgical procedures, attending provider in FL 76 for admission/discharge analysis. Pharmacy claims: prescriber NPI in NCPDP field 411-DB."
      },
      {
        "name": "Billing / organizational NPI attribution (practice-level)",
        "description": "Use the billing NPI (Box 33 in professional claims; FL 82 in institutional claims) to attribute services to a practice or facility entity. This is appropriate for practice-level outcomes research, facility-level volume studies where individual-clinician identity is not needed, and supply-chain / pharmacy network analyses using the dispensing pharmacy NPI.",
        "edge_cases": [
          "Organizational NPI granularity varies: one health system may enumerate a single Type 2 NPI for all outpatient clinics; another may enumerate one per building. Volume comparisons across systems using billing NPI may compare incomparable entities.",
          "TIN-based practice grouping (where TIN is available in commercial claims) typically produces tighter practice clusters than billing NPI; consider both and report sensitivity."
        ],
        "data_source_notes": "The NPPES monthly file can be filtered to Type 2 (organizational) NPI records and joined to the billing NPI field. Supplement with CMS Certification Number (CCN) for hospital identification and CMS Provider of Services (POS) file for facility characteristics."
      },
      {
        "name": "NPPES taxonomy-based specialty classification",
        "description": "Join the rendering NPI to the NPPES dissemination file and extract the primary taxonomy code to assign specialty. Group taxonomy codes into clinically meaningful specialty categories (e.g., mapping all internal-medicine subspecialties to a \"specialist\" indicator) using the NUCC taxonomy hierarchy. Use as a confounder or subgroup variable in prescriber or provider studies.",
        "edge_cases": [
          "Taxonomy code misclassification: providers self-report taxonomy; a general internist who developed a cardiology practice may still carry the internal-medicine taxonomy. Validate using procedure-code profiles (proportion of claims with cardiology procedures) as a sensitivity check.",
          "Multi-taxonomy records: a provider may report 3-5 taxonomies. The \"primary\" flag is self-designated and may not reflect current practice. For subspecialty classification, supplement with a procedure-based specialty assignment."
        ],
        "data_source_notes": "NPPES dissemination file columns: Healthcare Provider Taxonomy Code_1 through _15, with corresponding Is Primary Taxonomy Switch columns (Y/N). Join on NPI. Use the most recent monthly file that post-dates your study end date; pull the archive file for studies with historical index dates to capture taxonomy as it was reported then."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "linked-data",
        "pros_of_this": "NPI is a single deterministic join key that works across all covered entities and all payers without any probabilistic threshold or clerical review. A Luhn-valid NPI in both sources is an exact match by definition; no false-positive or false-negative rates to calibrate.",
        "cons_of_this": "NPI only identifies the provider, not the individual service-delivery episode. Linking NPPES-enriched claims to EHR requires additional join keys (date + facility + patient identifier). NPI also does not resolve the incident-to / locum problem — the NPI in the rendering field may still point to a different clinician than the one who delivered care.",
        "when_to_prefer": "Use NPI as the primary linkage key whenever the provider is the unit of analysis and both sources carry NPI. Fall back to probabilistic linkage only for sources that predate NPI compliance or lack an NPI field."
      },
      {
        "compared_to": "cms-1500-professional-claim-fields",
        "pros_of_this": "The NPI itself is the identifier; CMS-1500 Box 24J and Box 33 are simply the fields where it appears. Understanding which box to use (rendering vs billing) is the critical design decision — this entry provides that context.",
        "cons_of_this": "CMS-1500 carries many additional fields (place of service, procedure code, modifier, diagnosis pointer) that are needed to contextualize what the rendering provider did, and NPI alone does not convey service type.",
        "when_to_prefer": "Always use rendering NPI (Box 24J) for individual-provider attribution. Read CMS-1500 entry for the full claim-field context and for place-of-service and modifier coding."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Extract both the rendering NPI (Box 24J / professional; FL 77 / institutional) and the billing NPI (Box 33 / professional; FL 82 / institutional) as separate fields. Run the Luhn check on both before any join. Join rendering NPI to NPPES Type 1 records for individual specialty attribution; join billing NPI to NPPES Type 2 records for organizational context. Flag rendering NPIs that appear in the billing-provider position (a Type 2 NPI in 24J is a data-quality signal). Cross-reference deactivated-NPI file to exclude retired providers from active-provider denominators in workforce studies.",
      "ehr": "EHR systems store the attending/ordering provider NPI in structured encounter fields. Verify NPI format (10 digits, Luhn-valid) and that the NPI is active in NPPES before using it as a join key. EHR NPIs are generally more reliable than claims NPIs for ambulatory attribution because they are recorded at care delivery rather than inferred from billing. Reconcile EHR-provider NPI against claims-rendering NPI for the same encounters in a validation sample to quantify attribution agreement.",
      "registry": "Registries that include a treating-provider NPI (SEER, trauma registries) support direct NPI-level joins to claims or NPPES. Assess linkage rate and verify that unlinked records (where NPI is missing or invalid) do not differ systematically by tumor stage, injury severity, or provider volume. A missing NPI in a registry often indicates community-hospital or safety-net treatment where provider-level attribution is most uncertain.",
      "linked": "In linked datasets (SEER-Medicare, claims-EHR), reconcile the provider NPI across sources before analysis. Mismatched NPIs for the same episode may indicate billing intermediaries, coverage gaps, or data-source inconsistencies rather than true provider switches. Use the rendering NPI from the claims side as the primary attribution key; supplement with the registry or EHR NPI as a secondary identifier for validation."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "\"\"\"\nNPI data-quality tools for RWE claims and NPPES linkage.\n\nTwo utilities:\n  1. luhn_valid(npi) / validate_npis(df) — Luhn check-digit screen.\n  2. join_nppes(claims_df, nppes_path, deactivated_path=None) — attach taxonomy + type.\n\nThe Luhn algorithm for NPI:\n  Step 1: Prepend the 5-digit issuer prefix 80840 to the 9 NPI prefix digits.\n          Working string = \"80840\" + npi[:9]  (14 digits)\n  Step 2: In the 15-digit string (prefix + all 10 NPI digits), double every digit at\n          an even position counting from the right (positions 2, 4, 6, ...).\n          Equivalently, in the 14-digit working string, double every digit at an ODD\n          position counting from the right (positions 1, 3, 5, ...).\n          If the doubled value > 9, subtract 9.\n  Step 3: Sum all contributions from the 14-digit string.\n          check_digit = (10 - (total % 10)) % 10\n  Step 4: Compare computed check_digit to npi[9] (the 10th digit).\n\"\"\"\nimport re\nimport pandas as pd\nfrom pathlib import Path\n\n\n_NPI_DIGITS = re.compile(r\"^\\d{10}$\")\n_ISSUER_PREFIX = \"80840\"\n\n\ndef luhn_valid(npi: str) -> bool:\n    \"\"\"Return True if *npi* is a structurally valid 10-digit NPI.\n\n    Steps:\n      1. Must be exactly 10 ASCII digits.\n      2. Build the 14-digit working string: '80840' + npi[:9].\n      3. Double every digit at an odd position from the right (1-indexed)\n         in the working string; subtract 9 if > 9.\n      4. Sum all 14 contributions.\n      5. check_digit = (10 - total % 10) % 10\n      6. Valid iff check_digit == int(npi[9]).\n    \"\"\"\n    if not isinstance(npi, str) or not _NPI_DIGITS.match(npi):\n        return False\n    working = _ISSUER_PREFIX + npi[:9]   # 14 digits\n    total = 0\n    for i, ch in enumerate(reversed(working)):\n        d = int(ch)\n        # Positions from right are 1-indexed in the 15-digit final string.\n        # In the 14-digit working string, reversed index 0 corresponds to\n        # position 2 from right in the 15-digit string -> double.\n        if i % 2 == 0:   # positions 2, 4, 6, ... from right -> double\n            d *= 2\n            if d > 9:\n                d -= 9\n        total += d\n    check = (10 - total % 10) % 10\n    return check == int(npi[9])\n\n\ndef validate_npis(df: pd.DataFrame, npi_col: str = \"rendering_npi\") -> pd.DataFrame:\n    \"\"\"Add a boolean column '<npi_col>_valid' to *df*.\n\n    Any row where the NPI fails the Luhn check is a data-quality error\n    and should be excluded from NPPES joins or provider-attribution analyses.\n    \"\"\"\n    df = df.copy()\n    df[f\"{npi_col}_valid\"] = df[npi_col].astype(str).apply(luhn_valid)\n    n_invalid = (~df[f\"{npi_col}_valid\"]).sum()\n    if n_invalid:\n        print(f\"[NPI QC] {n_invalid:,} invalid NPIs in '{npi_col}' ({n_invalid/len(df)*100:.1f}%)\")\n    return df\n\n\ndef join_nppes(\n    claims_df: pd.DataFrame,\n    nppes_path: str | Path,\n    deactivated_path: str | Path | None = None,\n    npi_col: str = \"rendering_npi\",\n) -> pd.DataFrame:\n    \"\"\"Join NPPES dissemination file to claims on rendering NPI.\n\n    Attaches:\n      - provider_type:     '1' (individual) or '2' (organization)\n      - primary_taxonomy:  primary taxonomy code (10-char NUCC code)\n      - provider_name:     last + first name (Type 1) or org name (Type 2)\n      - npi_deactivated:   True if NPI appears in the deactivation file\n\n    NPPES dissemination file column names use the official CMS header names.\n    Download the monthly full-replacement file from:\n    https://download.cms.gov/nppes/NPI_Files.html\n\n    Parameters\n    ----------\n    claims_df : DataFrame with at least one NPI column.\n    nppes_path : Path to the NPPES full-replacement CSV (NPI_full_*.csv).\n    deactivated_path : Optional path to the deactivation CSV (NPPES_Deactivated_*.csv).\n    npi_col : Name of the NPI column in claims_df to join on.\n    \"\"\"\n    # Load only needed NPPES columns to limit memory usage\n    nppes_cols = [\n        \"NPI\",\n        \"Entity Type Code\",              # 1 = individual, 2 = organization\n        \"Healthcare Provider Taxonomy Code_1\",\n        \"Is Primary Taxonomy Switch_1\",\n        \"Provider Last Name (Legal Name)\",\n        \"Provider First Name\",\n        \"Provider Organization Name (Legal Business Name)\",\n    ]\n    nppes = pd.read_csv(\n        nppes_path,\n        usecols=lambda c: c in nppes_cols,\n        dtype=str,\n        low_memory=False,\n    )\n    nppes = nppes.rename(columns={\n        \"NPI\": \"npi\",\n        \"Entity Type Code\": \"provider_type\",\n        \"Healthcare Provider Taxonomy Code_1\": \"primary_taxonomy_raw\",\n        \"Is Primary Taxonomy Switch_1\": \"is_primary_flag\",\n        \"Provider Last Name (Legal Name)\": \"last_name\",\n        \"Provider First Name\": \"first_name\",\n        \"Provider Organization Name (Legal Business Name)\": \"org_name\",\n    })\n\n    # Use the primary taxonomy when flagged 'Y'; otherwise use Taxonomy_1 as fallback\n    nppes[\"primary_taxonomy\"] = nppes[\"primary_taxonomy_raw\"]\n\n    nppes[\"provider_name\"] = nppes.apply(\n        lambda r: (\n            f\"{r['last_name']}, {r['first_name']}\".strip(\", \")\n            if r[\"provider_type\"] == \"1\"\n            else r[\"org_name\"]\n        ),\n        axis=1,\n    )\n\n    nppes_slim = nppes[[\"npi\", \"provider_type\", \"primary_taxonomy\", \"provider_name\"]]\n\n    # Optional: flag deactivated NPIs\n    if deactivated_path is not None:\n        deact = pd.read_csv(deactivated_path, usecols=[\"NPI\"], dtype=str)\n        deact[\"npi_deactivated\"] = True\n        deact = deact.rename(columns={\"NPI\": \"npi\"})\n        nppes_slim = nppes_slim.merge(deact, on=\"npi\", how=\"left\")\n        nppes_slim[\"npi_deactivated\"] = nppes_slim[\"npi_deactivated\"].fillna(False)\n\n    # Join to claims\n    result = claims_df.merge(\n        nppes_slim,\n        left_on=npi_col,\n        right_on=\"npi\",\n        how=\"left\",\n    ).drop(columns=[\"npi\"])\n\n    # Report match rate\n    matched = result[\"provider_type\"].notna().sum()\n    print(f\"[NPPES join] {matched:,}/{len(result):,} claims matched \"\n          f\"({matched/len(result)*100:.1f}%)\")\n\n    return result\n\n\n# ------------------------------------------------------------------ quick demo\nif __name__ == \"__main__\":\n    # Verify the standard documentation example NPI: 1234567893\n    test_cases = [\n        (\"1234567893\", True,  \"CMS documentation example\"),\n        (\"1234567890\", False, \"bad check digit\"),\n        (\"123456789\",  False, \"too short\"),\n        (\"12345678901\",False, \"too long\"),\n        (\"1234567X93\", False, \"non-digit character\"),\n    ]\n    print(\"NPI Luhn validator unit tests:\")\n    for npi, expected, label in test_cases:\n        result = luhn_valid(npi)\n        status = \"PASS\" if result == expected else \"FAIL\"\n        print(f\"  [{status}] luhn_valid({npi!r}) = {result}  ({label})\")",
        "description": "Two tools: (1) a vectorized Luhn check-digit validator that screens every NPI in a claims or NPPES file before any join — a structurally invalid NPI cannot link correctly regardless of other data quality; (2) an NPPES monthly-file join that attaches the primary taxonomy code (specialty) and provider type (Type 1 vs Type 2) to a claims extract, plus an optional deactivation filter. Designed for pandas DataFrames; no external dependencies beyond pandas and the standard library.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# NPI data-quality tools for R: Luhn validator + NPPES taxonomy join\n# Designed for data.table; base R only otherwise.\nlibrary(data.table)\n\n# ------------------------------------------------------------------\n# 1. Luhn check-digit validator (vectorized)\n# ------------------------------------------------------------------\n# Algorithm:\n#   1. Pad NPI to 10 characters; reject non-10-digit strings.\n#   2. Build 14-digit working string: paste0(\"80840\", substr(npi, 1, 9))\n#   3. Reverse; double digits at odd reversed positions (1, 3, 5, ...);\n#      subtract 9 if doubled value > 9.\n#   4. check_digit = (10 - sum %% 10) %% 10\n#   5. Valid iff check_digit == as.integer(substr(npi, 10, 10))\nluhn_valid_npi <- function(npi) {\n  # Coerce to character, remove whitespace\n  npi <- trimws(as.character(npi))\n  valid <- grepl(\"^[0-9]{10}$\", npi)\n\n  result <- rep(FALSE, length(npi))\n  if (!any(valid)) return(result)\n\n  working <- paste0(\"80840\", substr(npi[valid], 1, 9))  # 14 chars\n\n  check_digit <- vapply(working, function(ws) {\n    digits <- as.integer(strsplit(ws, \"\")[[1]])\n    rev_d  <- rev(digits)\n    # Odd reversed positions (1-indexed): positions 1, 3, 5, ... -> double\n    for (i in seq_along(rev_d)) {\n      if (i %% 2 == 1) {                 # odd reversed position -> double\n        rev_d[i] <- rev_d[i] * 2L\n        if (rev_d[i] > 9L) rev_d[i] <- rev_d[i] - 9L\n      }\n    }\n    (10L - (sum(rev_d) %% 10L)) %% 10L\n  }, integer(1L))\n\n  declared_check <- as.integer(substr(npi[valid], 10, 10))\n  result[valid] <- (check_digit == declared_check)\n  result\n}\n\n# Quick unit tests\ntest_npis <- c(\"1234567893\", \"1234567890\", \"123456789\", \"1234567X93\")\nexpected  <- c(TRUE, FALSE, FALSE, FALSE)\nstopifnot(all(luhn_valid_npi(test_npis) == expected))\nmessage(\"Luhn validator: all unit tests passed\")\n\n\n# ------------------------------------------------------------------\n# 2. NPPES monthly dissemination file join\n# ------------------------------------------------------------------\n# Download the full replacement file from CMS:\n# https://download.cms.gov/nppes/NPI_Files.html\n# The CSV has ~8M rows; data.table select= loads only needed columns.\n#\n# Returns claims_dt with columns added:\n#   provider_type    : \"1\" (individual) or \"2\" (organization)\n#   primary_taxonomy : 10-char NUCC taxonomy code\n#   provider_name    : last,first (Type 1) or org name (Type 2)\n#   npi_deactivated  : logical (TRUE = in deactivation file)\njoin_nppes <- function(claims_dt,\n                       nppes_path,\n                       deactivated_path = NULL,\n                       npi_col          = \"rendering_npi\") {\n\n  stopifnot(is.data.table(claims_dt), npi_col %in% names(claims_dt))\n\n  # Read NPPES — select only required columns by position via fread skip header trick\n  nppes_all <- fread(\n    nppes_path,\n    select = c(\n      \"NPI\",\n      \"Entity Type Code\",\n      \"Healthcare Provider Taxonomy Code_1\",\n      \"Is Primary Taxonomy Switch_1\",\n      \"Provider Last Name (Legal Name)\",\n      \"Provider First Name\",\n      \"Provider Organization Name (Legal Business Name)\"\n    ),\n    colClasses = \"character\",\n    showProgress = TRUE\n  )\n\n  setnames(nppes_all,\n    old = c(\"NPI\",\n            \"Entity Type Code\",\n            \"Healthcare Provider Taxonomy Code_1\",\n            \"Is Primary Taxonomy Switch_1\",\n            \"Provider Last Name (Legal Name)\",\n            \"Provider First Name\",\n            \"Provider Organization Name (Legal Business Name)\"),\n    new = c(\"npi\", \"provider_type\", \"primary_taxonomy\",\n            \"is_primary_flag\", \"last_name\", \"first_name\", \"org_name\"))\n\n  nppes_all[, provider_name := ifelse(\n    provider_type == \"1\",\n    trimws(paste(last_name, first_name, sep = \", \")),\n    org_name\n  )]\n  nppes_slim <- nppes_all[, .(npi, provider_type, primary_taxonomy, provider_name)]\n\n  # Optional deactivation flag\n  if (!is.null(deactivated_path)) {\n    deact <- fread(deactivated_path, select = \"NPI\", colClasses = \"character\")\n    setnames(deact, \"NPI\", \"npi\")\n    deact[, npi_deactivated := TRUE]\n    nppes_slim <- merge(nppes_slim, deact, by = \"npi\", all.x = TRUE)\n    nppes_slim[is.na(npi_deactivated), npi_deactivated := FALSE]\n  } else {\n    nppes_slim[, npi_deactivated := FALSE]\n  }\n\n  setkey(nppes_slim, npi)\n\n  # Join to claims\n  result <- merge(\n    claims_dt,\n    nppes_slim,\n    by.x = npi_col,\n    by.y = \"npi\",\n    all.x = TRUE\n  )\n\n  matched <- sum(!is.na(result$provider_type))\n  message(sprintf(\"[NPPES join] %s/%s claims matched (%.1f%%)\",\n                  format(matched, big.mark = \",\"),\n                  format(nrow(result), big.mark = \",\"),\n                  matched / nrow(result) * 100))\n  result\n}",
        "description": "R equivalents of the Luhn validator and the NPPES taxonomy join, using base R and data.table for performance on the full NPPES dissemination file (approximately 8 million rows). The Luhn function is vectorized over a character vector of NPIs. The join function reads the NPPES full-replacement file and attaches provider type, primary taxonomy, and provider name to a claims data.table, with an optional deactivation filter.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  NPI[\"NPI (10 digits)\\ndigits 1-9 sequential\\ndigit 10 = Luhn check\"]\n  T1[\"Type 1 NPI\\nIndividual provider\\n(clinician)\"]\n  T2[\"Type 2 NPI\\nOrganization\\n(practice, hospital)\"]\n  NPI --> T1\n  NPI --> T2\n  T1 --> RC[\"Rendering / prescriber field\\nCMS-1500 Box 24J\\nUB-04 FL 76/77\\nNCPDP prescriber NPI\"]\n  T2 --> BC[\"Billing / facility field\\nCMS-1500 Box 33\\nUB-04 FL 82\"]\n  RC --> NPPES[\"NPPES dissemination file\\nprimary taxonomy\\nprovider name\\naddress\\nenumeration date\"]\n  T1 --> NPPES\n  T2 --> NPPES\n  NPPES --> SP[\"Specialty attribution\\n(taxonomy code)\"]\n  NPPES --> DEACT[\"Deactivation filter\"]\n  NPPES --> OP[\"Open Payments join\\n(industry transfers by NPI)\"]\n  NPPES --> UTIL[\"CMS utilization file\\n(service counts by NPI)\"]",
        "caption": "NPI taxonomy showing Type 1 (individual) and Type 2 (organization) enumeration types, where each type appears on claims, and the downstream NPPES-based linkages used in real-world evidence research.",
        "alt_text": "Flowchart: the NPI branches into Type 1 (individual) and Type 2 (organization); Type 1 flows to the rendering/prescriber field on claims, both types flow to NPPES, which enables specialty attribution, deactivation filtering, Open Payments joining, and CMS utilization file joining.",
        "source_type": "illustrative",
        "source_citations": [
          "bindman-2013"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "cms-1500-professional-claim-fields",
        "notes": "Box 24J of the CMS-1500 carries the rendering provider NPI; Box 33 carries the billing provider NPI. The correct field choice (rendering vs billing) is the primary NPI attribution decision in professional claims research."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "The UB-04 carries the attending provider NPI (FL 76), operating provider NPI (FL 77), other operating provider (FL 78), and service facility NPI (FL 82). Operating-provider NPI is the standard unit for surgical attribution studies."
      },
      {
        "relation_type": "used_with",
        "target_slug": "place-of-service-codes",
        "notes": "Place-of-service codes (on professional claims) contextualize where a rendering-provider NPI delivered care; combining NPI + POS clarifies whether the same clinician is being credited for office, outpatient hospital, or telehealth services."
      },
      {
        "relation_type": "used_with",
        "target_slug": "linked-data",
        "notes": "NPI serves as the deterministic provider-level linkage key in any cross-database study (claims to NPPES, claims to Open Payments, claims to CMS utilization files, registry to claims). Unlike patient-level probabilistic linkage, NPI-based provider linkage is exact-match and does not require threshold calibration."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "Claims analysis relies on NPI to attribute services to individual clinicians or organizations; the rendering vs billing NPI distinction, Type 1 vs Type 2 classification, and NPPES taxonomy join are operational prerequisites for most provider-level claims analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Medicare FFS claims carry both rendering and billing NPIs with consistent field placement; Medicare Advantage claims submitted to CMS may have less complete rendering-NPI data because MA plans adjudicate claims internally. Pre-2007 Medicare data used UPIN rather than NPI, requiring a crosswalk for longitudinal studies spanning the transition."
      },
      {
        "relation_type": "see_also",
        "target_slug": "sdoh-social-determinants-of-health",
        "notes": "Provider NPI linked to NPPES address data can support practice-level social determinants linkage — joining a clinician's practice ZIP code to area-level SDoH indices to study how provider neighborhood characteristics relate to prescribing or quality metrics."
      }
    ],
    "aliases": [
      "NPI",
      "National Provider Identifier",
      "NPPES",
      "taxonomy code"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hipaa",
      "cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "number-needed-to-treat-rwe",
    "name": "Number Needed to Treat (and Number Needed to Harm)",
    "short_definition": "The number of patients who must be treated for one additional patient to benefit, NNT = 1/ARR (the reciprocal of the absolute risk reduction over a stated time horizon); its harm analogue is NNH = 1/ARI (absolute risk increase), and both are baseline-risk- and time-dependent transformations of a risk difference.",
    "long_description": "The **number needed to treat (NNT)** is the reciprocal of the **absolute risk reduction (ARR)**: NNT = 1 / ARR, where\nARR = risk_control - risk_treated over a *stated time horizon*. It answers the clinically concrete question \"how many\npatients must receive the treatment, rather than the comparator, for one additional patient to avoid the (bad) outcome over\nthis period?\" Its mirror image for an adverse outcome is the **number needed to harm (NNH) = 1 / ARI**, where ARI = the\nabsolute risk *increase* = risk_treated - risk_control. Both are simply the absolute risk difference rendered on a\nmore interpretable scale: an ARR of 0.05 (a 5-percentage-point reduction) is an NNT of 20, an ARR of 0.01 an NNT of 100.\nNNT is always rounded *up* to the next whole patient (you cannot treat a fraction of a person to prevent a fraction of an\nevent), and it carries the sign/direction of the outcome — a \"negative NNT\" is properly an NNH and should be reported as\nsuch, not as a negative number.\n\n**Core conceptual distinction — NNT is not a property of the treatment alone.** Unlike a relative measure (relative risk,\nhazard ratio, odds ratio), which is often approximately stable across baseline-risk strata, the NNT is **a function of the\nbaseline risk** because ARR = baseline_risk * (1 - RR). The *same* relative risk reduction yields a small NNT in a high-risk\npopulation and a huge NNT in a low-risk one: a treatment with RR = 0.75 (a 25% relative reduction) gives ARR = 0.05 and\nNNT = 20 when baseline risk is 0.20, but ARR = 0.0025 and NNT = 400 when baseline risk is 0.01. This is precisely why\n**an NNT computed by applying a published relative measure requires you to supply the baseline risk of *your* population** —\ntransporting an NNT from a trial to a real-world population without re-anchoring it to the local baseline risk is a category\nerror. NNT is also **time-horizon dependent**: it is meaningful only attached to the follow-up window over which the risks\nwere measured (a 1-year NNT and a 5-year NNT for the same therapy differ), and for time-to-event data it should be derived\nfrom the cumulative incidence (risk) at the chosen horizon, not from a hazard ratio treated as if it were a risk ratio.\n\n**Pros, cons, and trade-offs.**\n- **vs relative risk / hazard ratio (`hazard-ratio-interpretation` family):** The NNT (and its parent, the ARR) conveys the\n  *absolute* clinical and budgetary impact a relative measure hides — a \"50% reduction\" is impressive whether baseline risk\n  is 40% or 0.04%, but the NNT (1.25 vs 1250) tells you whether it matters. **Prefer the NNT/ARR** for clinical and HTA\n  communication of magnitude; **prefer the relative measure** for transporting effects across populations and for\n  modeling, because it is more stable across baseline-risk strata. Report both: a relative effect plus a baseline-anchored\n  NNT.\n- **vs reporting the ARR (risk difference) directly:** The NNT is the ARR's reciprocal and contains exactly the same\n  information, but the reciprocal transformation is non-linear and badly behaved near ARR = 0, which is why the **confidence\n  interval for the NNT is the reciprocal of the ARR's CI limits** and inherits a pathology (below). **Prefer reporting the\n  ARR with its CI** as the primary quantity and the NNT as the interpretable gloss, especially when the effect is not\n  clearly significant.\n- **vs NNH and a combined benefit-harm view:** NNT (benefit) and NNH (harm) are computed identically but on different\n  outcomes; comparing them (or combining them into a likelihood-of-being-helped-vs-harmed, LHH = NNH/NNT) frames the net\n  clinical trade-off. **Use NNT and NNH together** whenever a therapy has both a benefit and a non-trivial harm; reporting\n  NNT alone for an efficacious-but-toxic drug is misleading.\n\n**When to use.** Translating an absolute risk difference into a clinically and economically interpretable count for\nguideline panels, HTA dossiers, and shared decision-making; expressing the magnitude (not just the direction) of a\ncomparative effect from an RWE study after the absolute risks have been estimated correctly (e.g., from a cumulative\nincidence function or a marginal/standardized risk under competing risks); summarizing benefit and harm on a common scale\n(NNT and NNH) to support a net-benefit judgment.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When the confidence interval for the ARR includes zero.** This is the notorious NNT **discontinuity problem**: if the\n  ARR CI is (-0.01, +0.05) it spans zero, so the NNT CI is *not* a finite interval — it runs from a positive NNT (benefit)\n  through plus/minus infinity to a negative NNT (harm, i.e., an NNH). Reporting a tidy \"NNT 20 (95% CI 12 to 50)\" when the\n  effect is not significant fabricates a finite, benefit-only interval and hides that the data are compatible with harm.\n  When the ARR is non-significant, report the ARR CI and state the NNT only with the Altman (1998) convention\n  (NNT_benefit ... infinity ... NNT_harm), or do not report an NNT at all.\n- **Transporting an NNT across populations.** Because the NNT depends on baseline risk, a trial NNT does not apply to a\n  real-world population with a different baseline risk; re-anchor by combining the (more transportable) relative effect with\n  the *local* baseline risk.\n- **Quoting an NNT without its time horizon.** An NNT detached from the follow-up window over which the risks were measured\n  is uninterpretable and invites comparison of NNTs measured over different horizons.\n- **Deriving an NNT from a hazard ratio as if it were a risk ratio, ignoring competing risks.** With non-trivial competing\n  mortality, the cause-specific hazard does not translate into the absolute risk a patient experiences; compute the NNT from\n  cumulative incidence functions at the horizon, not from a HR (see `competing-risks-cause-specific-fine-gray-rwe`).\n\n**Data-source operational depth.**\n- **Claims (FFS):** The two absolute risks that define the ARR are themselves cumulative incidences that must be estimated\n  on a correctly constructed at-risk denominator. Build them on FFS-observable person-time only; Medicare Advantage\n  enrollees generate no claims, so MA-only spans produce fabricated \"event-free\" follow-up that biases the absolute risks and\n  therefore the ARR and NNT. With an active-comparator new-user design and PS matching/weighting, derive the marginal\n  (standardized) absolute risks in each arm; an NNT built on a crude, unadjusted risk difference inherits confounding by\n  indication. State the time horizon explicitly and handle the competing risk of death with cumulative incidence functions.\n- **EHR:** Encounter-driven ascertainment makes event-free time partly unobserved (out-of-system care), biasing the absolute\n  risks; require an \"active in system\" definition and treat informative loss to follow-up with appropriate censoring before\n  estimating risks and the NNT.\n- **Registry / linked:** Adjudicated registry outcomes and a linked death index give the cleanest absolute risks (and an\n  honest competing-risk denominator), making linked claims-EHR-vital-records the strongest substrate for a defensible NNT;\n  linkage selects the linkable subset, so the baseline risk that anchors the NNT is the linked-cohort baseline.\n\n**Worked claims example.** In an active-comparator new-user cohort (drug A vs drug B) for a non-fatal cardiovascular outcome\nover a 2-year horizon, 1:1 PS matching balances baseline covariates and the marginal (standardized) 2-year cumulative\nincidences from competing-risk-aware estimation are **risk_B (comparator) = 0.120** and **risk_A (treated) = 0.090**. Then\nthe **ARR = 0.120 - 0.090 = 0.030** (3 percentage points over 2 years) and the **NNT = 1 / 0.030 = 33.3, rounded up to 34**:\n34 patients treated with A rather than B for one additional patient to avoid the event within 2 years. Suppose the ARR's 95%\nCI is (0.008, 0.052) — it excludes zero, so the NNT CI is the (reversed) reciprocal of those limits: 1/0.052 = 19.2 and\n1/0.008 = 125, i.e., **NNT 34 (95% CI 20 to 125)**, all on the benefit side, reported with the 2-year horizon. Now contrast\nthe dangerous case: had the ARR CI been (-0.004, 0.052), it would span zero, and the honest NNT statement is\n\"NNT_benefit 19 to infinity, then NNT_harm 250 to infinity\" (Altman convention) — *not* \"NNT 34 (95% CI 20 to ...)\", which\nwould conceal that the data are compatible with net harm. Finally, to transport this finding to a lower-risk primary-prevention\npopulation with baseline 2-year risk 0.04 under the same relative effect (RR = 0.090/0.120 = 0.75), the re-anchored\nARR = 0.04 * (1 - 0.75) = 0.010 and the NNT must be recomputed against the local\nbaseline rather than carried over.\n\n**Interpreting the output**\n\nA propensity-score-matched RWE study comparing Drug A vs Drug B in 1,000 patients per arm over a\n2-year window yields marginal cumulative incidence estimates of 0.090 (treated) and 0.120 (comparator),\nan ARR of 0.030, and an NNT of 34 (95% CI 20 to 125) over 2 years at a baseline risk of 12%.\n\n*(1) Formal interpretation.* The NNT of 34 means that, under this study's conditions, 34 patients would\nneed to receive Drug A instead of Drug B for one additional patient to avoid the outcome within the\n2-year follow-up window — at the observed comparator-arm baseline risk of 12% (0.120). The NNT is the\nreciprocal of the ARR (1 / 0.030 = 33.3, rounded up to 34), and its confidence interval is the reversed\nreciprocal of the ARR interval limits (1/0.052 ≈ 19; 1/0.008 = 125). The NNT is not a property of the\ndrug alone; it is specific to this 2-year horizon and this 12% baseline risk. Applying this NNT to a\nlower-risk primary-prevention population with a 4% baseline risk would require re-anchoring — under the\nsame relative effect (RR ≈ 0.75), the NNT rises to approximately 100.\n\n*(2) Practical interpretation.* An NNT of 34 over 2 years is a concrete, audience-ready summary of\nbenefit magnitude. For a formulary committee weighing Drug A versus Drug B, it translates directly into\nbudget terms: treating 34 patients for 2 years prevents one event. Always pair the NNT with the time\nhorizon (2 years), the baseline risk (12%), and the outcome definition — a horizon-free NNT is\nuninterpretable and should not appear in a submission or decision brief.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "number-needed-to-treat",
      "number-needed-to-harm",
      "absolute-risk-reduction",
      "risk-difference",
      "baseline-risk",
      "confidence-interval",
      "benefit-harm",
      "effect-communication"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "comparative_effectiveness",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.310.6977.452",
        "url": "https://doi.org/10.1136/bmj.310.6977.452",
        "citation_text": "Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure of treatment effect. BMJ. 1995;310(6977):452-454.",
        "year": 1995,
        "authors_short": "Cook & Sackett",
        "notes": "The paper that introduced and popularized the NNT as the reciprocal of the absolute risk reduction and a clinically interpretable summary of treatment effect; the canonical reference for the measure."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.317.7168.1309",
        "url": "https://doi.org/10.1136/bmj.317.7168.1309",
        "citation_text": "Altman DG. Confidence intervals for the number needed to treat. BMJ. 1998;317(7168):1309-1312.",
        "year": 1998,
        "authors_short": "Altman",
        "notes": "Defines the correct confidence interval for the NNT as the reciprocal of the ARR confidence limits and resolves the discontinuity problem - how to report the interval honestly (NNT_benefit ... infinity ... NNT_harm) when the ARR CI spans zero."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Demonstrates how mis-allocated person-time corrupts the absolute risks that define the ARR (and hence the NNT) in observational data; motivates building the two arm-specific risks on correctly constructed at-risk follow-up."
      },
      {
        "role": "use",
        "doi": "10.1093/ije/dys049",
        "url": "https://doi.org/10.1093/ije/dys049",
        "citation_text": "Pearce N. Classification of epidemiological study designs. International Journal of Epidemiology. 2012;41(2):393-397.",
        "year": 2012,
        "authors_short": "Pearce",
        "notes": "Clarifies the risk/rate distinction underlying the absolute risk difference; the NNT must be derived from cumulative incidence (risk) at a stated horizon, not from a rate or a hazard ratio treated as a risk ratio."
      },
      {
        "role": "use",
        "doi": "10.1097/EDE.0b013e3181c30fb2",
        "url": "https://doi.org/10.1097/EDE.0b013e3181c30fb2",
        "citation_text": "Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138.",
        "year": 2010,
        "authors_short": "Steyerberg et al.",
        "notes": "Frames absolute-risk and clinical-usefulness measures (of which NNT/ARR are the simplest) within model performance, supporting baseline-risk-anchored, decision-relevant reporting of effects in RWE."
      }
    ],
    "plain_language_summary": "Number Needed to Treat (NNT) answers a simple, concrete question: how many patients do you have to give a treatment to, instead of the comparison, for one extra patient to avoid the bad outcome over a stated period of time? You get it by taking the gap in outcome risk between the two groups (the absolute risk reduction, ARR = comparison-group risk minus treated-group risk) and flipping it over: NNT = 1 / ARR, always rounded up to a whole patient. Its mirror image for a side effect is the Number Needed to Harm (NNH), computed the same way on a bad outcome. One honest caveat: the NNT is not a fixed property of the drug; it changes with how high the underlying risk is and with the length of the follow-up window, so a number from one population should not be copied onto another.",
    "key_terms": [
      {
        "term": "absolute risk reduction (ARR)",
        "definition": "The plain difference in outcome risk between the two groups, found by subtracting the treated group's risk from the comparison group's risk."
      },
      {
        "term": "risk",
        "definition": "The fraction of patients in a group who had the outcome during the follow-up window, for example 90 out of 1000 is a risk of 0.09."
      },
      {
        "term": "Number Needed to Harm (NNH)",
        "definition": "The same calculation as NNT but for a bad outcome: how many patients you treat before one extra patient is harmed."
      },
      {
        "term": "time horizon",
        "definition": "The length of follow-up over which the two risks were measured, such as 2 years; an NNT only makes sense when tied to it."
      },
      {
        "term": "baseline risk",
        "definition": "How likely the outcome is to begin with in a population, which is what makes the same treatment give a different NNT in different groups."
      }
    ],
    "worked_example": {
      "scenario": "We compared two groups of 1,000 patients each over a 2-year window: one group got drug A (treated), the other got drug B (the comparison). We counted how many patients in each group had a heart-related event, turned those counts into a risk for each group, and now want the NNT: how many patients we would have to treat with drug A instead of drug B for one extra patient to avoid the event within 2 years.",
      "dataset": {
        "caption": "A small summary table an analyst would build after counting events in each treatment arm over the 2-year window.",
        "columns": [
          "arm",
          "n",
          "events",
          "risk"
        ],
        "rows": [
          [
            "control (drug B)",
            1000,
            120,
            0.12
          ],
          [
            "treated (drug A)",
            1000,
            90,
            0.09
          ]
        ]
      },
      "steps": [
        "Turn each arm's event count into a risk: control = 120 / 1000 = 0.12, treated = 90 / 1000 = 0.09.",
        "Find the absolute risk reduction (ARR) by subtracting the treated risk from the control risk: ARR = 0.12 - 0.09 = 0.03 (a 3-percentage-point drop over 2 years).",
        "Flip the ARR to get the NNT: NNT = 1 / 0.03 = 33.3.",
        "Round up to a whole patient, because you cannot treat a fraction of a person: NNT = 34."
      ],
      "result": "ARR = 0.12 - 0.09 = 0.03, so NNT = 1 / 0.03 = 33.3, rounded up to 34. About 34 patients must be treated with drug A instead of drug B for one extra patient to avoid the event within the 2-year window."
    },
    "prerequisites": [
      "cumulative-incidence-risk-rwe",
      "active-comparator-new-user"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "NNT for benefit from an absolute risk reduction",
        "description": "NNT = 1/ARR with ARR = risk_control - risk_treated over a stated horizon, rounded up to the next whole patient; the standard benefit measure derived from two arm-specific cumulative incidences.",
        "edge_cases": [
          "The NNT depends on baseline risk, so the same relative effect gives different NNTs across populations; always state the baseline risk and the time horizon used.",
          "For time-to-event outcomes derive the two risks from cumulative incidence at the horizon (competing-risk-aware), not from a hazard ratio treated as a risk ratio."
        ],
        "data_source_notes": "claims: estimate the two marginal (standardized) absolute risks from a PS-matched/weighted active-comparator new-user cohort on FFS-observable person-time so the ARR is not confounded."
      },
      {
        "name": "NNH (number needed to harm) from an absolute risk increase",
        "description": "NNH = 1/ARI with ARI = risk_treated - risk_control for an adverse outcome; computed identically to NNT but on the harm outcome, and reported as a positive NNH rather than a negative NNT.",
        "edge_cases": [
          "A \"negative NNT\" is an NNH and should be relabeled, not reported as a negative number, to avoid sign confusion.",
          "NNT (benefit) and NNH (harm) over the same horizon can be combined into a likelihood-of-being-helped-vs-harmed (LHH = NNH/NNT) for a net-benefit view."
        ],
        "data_source_notes": "claims/linked: harm outcomes (e.g., bleeding, AKI) need their own validated phenotype and at-risk window; do not reuse the benefit outcome's denominator uncritically."
      },
      {
        "name": "NNT confidence interval (reciprocal of the ARR CI) with discontinuity handling",
        "description": "The NNT CI is the reciprocal of the ARR confidence-limit values; when the ARR CI excludes zero the NNT CI is a finite benefit interval, and when it spans zero the NNT CI runs NNT_benefit ... infinity ... NNT_harm (Altman, 1998).",
        "edge_cases": [
          "When the ARR CI includes zero, never report a finite benefit-only NNT CI - it conceals compatibility with harm.",
          "The reciprocal swaps the order of the limits (the smaller ARR limit gives the larger NNT), a common reporting error."
        ],
        "data_source_notes": "all sources: report the ARR with its CI as the primary quantity and the NNT as the interpretable gloss, especially when the effect is not clearly significant."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cumulative-incidence-risk-rwe",
        "pros_of_this": "Renders the absolute risk difference between the two arms' cumulative incidences into a single, clinically and budgetarily interpretable count (patients treated per event avoided) for decision-makers.",
        "cons_of_this": "It is a derived reciprocal that discards the underlying risk levels and is ill-behaved near a zero risk difference; the cumulative incidences themselves carry more information.",
        "when_to_prefer": "For communicating the magnitude of a comparative effect to clinicians and HTA panels; use the cumulative incidences/risk difference directly for estimation, modeling, and when the effect is non-significant."
      },
      {
        "compared_to": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "pros_of_this": "A maximally interpretable, baseline-anchored summary of an absolute marginal effect that non-statisticians can act on directly.",
        "cons_of_this": "Throws away the full marginal risk surface and is not portable across baseline-risk strata the way a relative marginal effect can be.",
        "when_to_prefer": "As the final communication of a marginal absolute effect; obtain the marginal risks/risk difference first via standardization or g-computation, then convert to an NNT."
      },
      {
        "compared_to": "relative effect measures (relative risk / hazard ratio)",
        "pros_of_this": "Conveys absolute clinical and economic impact that a relative measure hides; a large relative reduction on a tiny baseline risk is a huge, often irrelevant, NNT.",
        "cons_of_this": "Baseline-risk- and horizon-dependent, so it is not transportable across populations and must be re-anchored to the local baseline risk.",
        "when_to_prefer": "For magnitude communication and HTA; use the relative measure for transporting effects across populations and for modeling stability."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Estimate the two arm-specific absolute risks as cumulative incidences on FFS-observable person-time (drop Medicare Advantage-only spans, which fabricate event-free follow-up) from a PS-matched/weighted active-comparator new-user cohort so the ARR is not confounded by indication; handle the competing risk of death with cumulative incidence functions and state the time horizon. Report the ARR with its CI and convert to NNT with the Altman discontinuity convention.",
      "ehr": "Encounter-driven ascertainment leaves event-free time partly unobserved (out-of-system care), biasing the absolute risks; require demonstrable in-system activity and treat informative loss to follow-up with proper censoring before estimating risks and the NNT.",
      "registry": "Adjudicated outcomes give clean absolute risks; check completeness and reporting lag so late-reported events at the horizon are not undercounted, deflating the risk difference and inflating the NNT.",
      "linked": "Linked claims-EHR-vital-records gives an honest competing-risk denominator (death index) and clean outcomes for the most defensible absolute risks; the baseline risk that anchors the NNT is the linked-cohort baseline, so report linkage rate and any prevalence drift."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy.stats import norm\n\ndef nnt(events_t, n_t, events_c, n_c, conf=0.95):\n    \"\"\"NNT/NNH with an ARR-based CI; handles the discontinuity when the ARR CI spans 0.\"\"\"\n    r_t = events_t / n_t                     # treated risk\n    r_c = events_c / n_c                     # control risk\n    arr = r_c - r_t                          # absolute risk reduction (benefit if > 0)\n    se  = np.sqrt(r_t*(1-r_t)/n_t + r_c*(1-r_c)/n_c)\n    z   = norm.ppf(1 - (1-conf)/2)\n    lo, hi = arr - z*se, arr + z*se          # ARR confidence limits\n\n    point = np.inf if arr == 0 else 1.0 / arr\n    label = \"NNT (benefit)\" if arr > 0 else (\"NNH (harm)\" if arr < 0 else \"no effect\")\n    out = {\"ARR\": arr, \"ARR_CI\": (lo, hi), \"point\": point, \"label\": label,\n           \"NNT_rounded\": int(np.ceil(abs(point))) if np.isfinite(point) else None}\n\n    if lo <= 0 <= hi:                        # discontinuity: CI compatible with harm\n        out[\"NNT_CI\"] = \"NNT_benefit {} to inf, then NNT_harm {} to inf\".format(\n            round(1/hi) if hi > 0 else \"n/a\", round(abs(1/lo)) if lo < 0 else \"n/a\")\n        out[\"significant\"] = False\n    else:                                    # finite interval; reciprocal swaps limit order\n        out[\"NNT_CI\"] = tuple(sorted((abs(1/lo), abs(1/hi))))\n        out[\"significant\"] = True\n    return out\n\n# Worked example: treated 90/1000 (0.090), control 120/1000 (0.120) over 2 years.\nres = nnt(events_t=90, n_t=1000, events_c=120, n_c=1000)\nprint(f\"ARR = {res['ARR']:.3f}  ->  {res['label']} = {res['NNT_rounded']} \"\n      f\"(95% CI {res['NNT_CI']}), significant={res['significant']}\")",
        "description": "Compute the NNT (or NNH) and its confidence interval from two arm-specific absolute risks with event\ncounts, handling the discontinuity when the ARR CI spans zero (Altman 1998). Inputs: events and N in the\ntreated and control arms over a stated horizon. Reproduces the worked example (risks 0.090 vs 0.120,\nARR 0.030, NNT 34) and shows the honest \"benefit ... infinity ... harm\" reporting for a non-significant ARR.",
        "dependencies": [
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "cook-sackett-1995",
          "altman-1998"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "nnt <- function(events_t, n_t, events_c, n_c, conf = 0.95) {\n  r_t <- events_t / n_t                       # treated risk\n  r_c <- events_c / n_c                       # control risk\n  arr <- r_c - r_t                            # absolute risk reduction\n  se  <- sqrt(r_t*(1-r_t)/n_t + r_c*(1-r_c)/n_c)\n  z   <- qnorm(1 - (1-conf)/2)\n  lo  <- arr - z*se; hi <- arr + z*se         # ARR confidence limits\n\n  point <- if (arr == 0) Inf else 1/arr\n  label <- if (arr > 0) \"NNT (benefit)\" else if (arr < 0) \"NNH (harm)\" else \"no effect\"\n\n  if (lo <= 0 && 0 <= hi) {                    # discontinuity: compatible with harm\n    ci <- sprintf(\"NNT_benefit %s to Inf, then NNT_harm %s to Inf\",\n                  ifelse(hi > 0, round(1/hi), \"n/a\"),\n                  ifelse(lo < 0, round(abs(1/lo)), \"n/a\"))\n    sig <- FALSE\n  } else {                                     # finite interval; reciprocal reverses limits\n    ci  <- sort(abs(c(1/lo, 1/hi))); sig <- TRUE\n  }\n  list(ARR = arr, ARR_CI = c(lo, hi), label = label,\n       NNT = ceiling(abs(point)), NNT_CI = ci, significant = sig)\n}\n\n# Worked example: treated 90/1000 (0.090), control 120/1000 (0.120) over 2 years.\nres <- nnt(90, 1000, 120, 1000)\ncat(sprintf(\"ARR = %.3f -> %s = %d (95%% CI %s)\\n\",\n            res$ARR, res$label, res$NNT, paste(round(res$NNT_CI), collapse = \" to \")))",
        "description": "NNT/NNH and its confidence interval from two arm-specific risks in base R, with explicit discontinuity\nhandling per Altman (1998). The 95% CI for the ARR (Wald) is inverted to the NNT scale; when the ARR CI\ncrosses zero the function returns the benefit-to-infinity-to-harm description rather than a finite\nbenefit-only interval. Inputs mirror the Python version: events and N per arm.",
        "dependencies": [],
        "source_citations": [
          "cook-sackett-1995",
          "altman-1998"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 2x2 counts from the worked example: treated 90/1000 events, control 120/1000. */\ndata arm2x2;\n  input arm event count;\n  datalines;\n1 1 90\n1 0 910\n0 1 120\n0 0 880\n;\nrun;\n\n/* RISKDIFF gives the absolute risk difference (ARR = control - treated) and its 95% CI. */\nproc freq data=arm2x2 order=data;\n  weight count;\n  tables arm*event / riskdiff(cl=wald) nopercent norow nocol;\n  ods output RiskDiffCol1 = rd;   /* risk-difference rows incl. the difference & CI */\nrun;\n\n/* Pull the \"Difference\" row (control - treated) and invert to the NNT scale. */\ndata nnt;\n  set rd;\n  where upcase(Row) = 'DIFFERENCE';\n  arr = Risk; lo = LowerCL; hi = UpperCL;     /* ARR and its 95% CI */\n  if arr > 0 then label = 'NNT (benefit)';\n  else if arr < 0 then label = 'NNH (harm)';\n  else label = 'no effect';\n  nnt = ceil(abs(1/arr));                      /* rounded up to whole patients */\n  /* Discontinuity: if the ARR CI spans zero, a finite benefit-only NNT CI is invalid. */\n  if lo <= 0 <= hi then nnt_ci = 'benefit..Inf..harm (CI spans 0; report ARR CI)';\n  else do;\n    nnt_lo = abs(1/hi); nnt_hi = abs(1/lo);    /* reciprocal reverses the limit order */\n  end;\n  keep arr lo hi label nnt nnt_lo nnt_hi nnt_ci;\nrun;\n\nproc print data=nnt noobs; run;   /* expect ARR 0.030, NNT 34 (CI ~20 to 125) */",
        "description": "NNT/NNH with an ARR-based confidence interval in SAS. PROC FREQ with the RISKDIFF option computes the\nabsolute risk difference (ARR) and its 95% CI directly from a 2x2 table of arm by event; a short DATA step\ninverts the ARR and its limits to the NNT scale and flags the discontinuity (Altman 1998) when the ARR CI\nspans zero. Input: work.arm2x2 with arm (1=treated,0=control), event (1/0), and count.",
        "dependencies": [],
        "source_citations": [
          "cook-sackett-1995",
          "altman-1998"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Rc[Control risk<br/>cumulative incidence at horizon] --> ARR[ARR = risk_control - risk_treated]\n  Rt[Treated risk<br/>cumulative incidence at horizon] --> ARR\n  ARR -->|ARR > 0| NNT[NNT = 1/ARR<br/>round up; benefit]\n  ARR -->|ARR < 0| NNH[NNH = 1/&#124;ARR&#124;<br/>round up; harm]\n  ARR --> Base[Depends on baseline risk + horizon<br/>re-anchor before transporting]\n  NNT --> CI{ARR CI excludes 0?}\n  CI -->|Yes| Fin[Finite NNT CI = 1/ARR limits<br/>reversed order]\n  CI -->|No| Disc[Discontinuity: NNT_benefit..Inf..NNT_harm<br/>do NOT report a finite benefit-only CI]",
        "caption": "From two arm-specific absolute risks to the NNT/NNH and its confidence interval. The NNT is the reciprocal of the ARR, depends on baseline risk and horizon, and its CI is the reciprocal of the ARR limits - with the discontinuity that a zero-spanning ARR CI yields an infinite, benefit-to-harm NNT interval.",
        "alt_text": "Flow from control and treated cumulative-incidence risks to the ARR, then to NNT for benefit or NNH for harm, with a branch on whether the ARR confidence interval excludes zero determining a finite versus discontinuous NNT interval.",
        "source_type": "illustrative",
        "source_citations": [
          "cook-sackett-1995",
          "altman-1998"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  RR[Relative effect from a trial/RWE study<br/>e.g. RR = 0.75] --> Anchor{Baseline risk<br/>of YOUR population}\n  Anchor -->|High risk 0.20| H[ARR = 0.05 -> NNT 20]\n  Anchor -->|Low risk 0.01| L[ARR = 0.0025 -> NNT 400]\n  H --> Lesson[Same relative effect -> very different NNT]\n  L --> Lesson\n  Lesson --> Rule[Never transport an NNT;<br/>re-anchor RR to the local baseline risk + horizon]",
        "caption": "Why the NNT is not a property of the treatment alone. Applying one relative effect to different baseline risks produces vastly different NNTs, so an NNT must always be recomputed against the local population's baseline risk and the stated time horizon rather than carried over from a trial.",
        "alt_text": "Decision tree showing a single relative risk of 0.75 producing an NNT of 20 at baseline risk 0.20 but an NNT of 400 at baseline risk 0.01, with the conclusion that NNTs must be re-anchored to the local baseline risk.",
        "source_type": "illustrative",
        "source_citations": [
          "cook-sackett-1995"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "The NNT is derived from the absolute risk difference between two arm-specific cumulative incidences at a stated horizon; correct absolute-risk (cumulative incidence) estimation is the prerequisite."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "With non-trivial competing mortality, the two absolute risks must come from cumulative incidence functions, not from a hazard ratio treated as a risk ratio, before forming the ARR and NNT."
      },
      {
        "relation_type": "complements",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "The NNT is the most interpretable rendering of a marginal absolute effect; obtain the marginal risks/risk difference by standardization or g-computation first, then convert to an NNT."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "In observational RWE the two absolute risks defining the ARR are estimated from an active-comparator new-user design so the NNT is not confounded by indication."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-effectiveness",
        "notes": "The NNT is a stepping stone to economic evaluation - cost per event avoided scales with the NNT - linking the absolute effect to budget-impact and cost-effectiveness reasoning."
      }
    ],
    "aliases": [
      "number needed to treat",
      "NNT",
      "number needed to harm",
      "NNH",
      "absolute risk reduction reciprocal",
      "patients needed to treat"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ols-linear-regression",
    "name": "Ordinary Least Squares (OLS) Linear Regression",
    "short_definition": "The foundational supervised regression method that estimates a linear relationship between one or more predictors and a continuous outcome by minimizing the sum of squared differences between observed and predicted values; in RWE, OLS is the workhorse for covariate-adjusted mean differences on continuous endpoints (length of stay, HbA1c, health utility scores, log-cost) when paired with heteroscedasticity-robust or cluster-robust standard errors.",
    "long_description": "**What OLS is and how it works**\n\nOrdinary Least Squares (OLS) linear regression finds the line — or hyperplane in the\nmultivariable case — that minimizes the sum of squared vertical distances between each\nobserved outcome yᵢ and its model-predicted value ŷᵢ = β₀ + β₁x₁ᵢ + … + βₚxₚᵢ.\nThe name \"least squares\" refers precisely to this criterion: minimize Σ(yᵢ − ŷᵢ)².\nFor a single predictor, the closed-form solution is β₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²,\nwhich equals the sample covariance of x and y divided by the sample variance of x.\nThe intercept is β₀ = ȳ − β₁x̄. In the multivariable model, each coefficient β_j represents\nthe expected change in the mean of y per one-unit increase in x_j, holding all other\npredictors fixed. This \"holding other predictors fixed\" is precisely what \"covariate\nadjustment\" means in observational research — it estimates the association within strata\ndefined by the other covariates.\n\n**Assumptions: coefficient validity vs inference validity**\n\nA critical distinction that analysts frequently blur: the conditions for OLS *coefficients*\nto be unbiased are different from the conditions needed for valid *standard errors and CIs*.\n\nFor *coefficient validity* (unbiasedness), OLS requires: (1) the conditional mean\nE(y|X) is correctly specified as linear in X; and (2) the errors are mean-zero and\nuncorrelated with the predictors (exogeneity, i.e., no omitted confounders correlated\nwith X). Violations of linearity or exogeneity bias the coefficients themselves and no\nstandard error correction can fix this.\n\nFor *inference validity* (correct SEs and CIs), OLS additionally requires: (3)\nhomoscedasticity — constant error variance across observations; and (4) independent errors\nacross observations. The normality of residuals is a fifth classical assumption that matters\nfar less in practice: at the sample sizes typical of RWE studies (thousands to millions of\npatients), the Central Limit Theorem ensures the sampling distribution of OLS coefficients\nis approximately normal regardless of the residual distribution. Non-normality of residuals\nis not a meaningful concern in large claims or EHR analyses (see Lumley et al. 2002).\n\nHeteroscedasticity — non-constant error variance — is endemic in health data: costs are\nright-skewed, sicker patients have more variable utilization, and length of stay fans out\nwith disease severity. Heteroscedasticity biases conventional (homoscedastic) SEs but does\n*not* bias the coefficients. The practical solution is to use heteroscedasticity-robust\nstandard errors — specifically the HC3 variant due to MacKinnon and White — as the default\nin all OLS analyses of health data (Long and Ervin 2000). HC3 inflates the influence of\nhigh-leverage observations, providing better finite-sample coverage than HC0 or HC1.\nConventional SEs should be reserved for sensitivity checks and clearly labeled as such.\n\n**Interpreting the output**\n\nConsider a multivariable OLS model of hospital length of stay (LOS, days) regressed on a\nbinary treatment indicator and five covariates: age, sex, Charlson comorbidity score, index\nyear, and payer type. The adjusted coefficient on treatment is β = −1.8 days\n(95% CI −3.1 to −0.5, p = 0.007), and the model R² = 0.32.\n\n*Formal interpretation.* β = −1.8 is the model-estimated difference in mean LOS comparing\ntreated to untreated patients at fixed values of age, sex, Charlson score, index year, and\npayer type — a confounding-adjusted association within those covariate strata. The 95% CI\n(−3.1 to −0.5) is a repeated-sampling interval: if the study were repeated many times under\nidentical conditions, approximately 95% of such intervals would contain the true adjusted\nmean difference — it does NOT mean \"there is a 95% probability the true effect is between\n−3.1 and −0.5\" (that is a Bayesian credible interval). R² = 0.32 means 32% of the total\nvariance in LOS is explained by the model collectively; it does not measure whether\ncoefficients are causal, whether the model predicts well in a new population, or whether\nthe unexplained variance (68%) is clinically negligible.\n\n*Practical interpretation.* On average, treated patients in this cohort stayed about\n1.8 fewer days than untreated patients with similar demographics, comorbidity burden, and\npayer mix — an adjusted association compatible with a true difference plausibly ranging\nfrom half a day to just over three days based on the confidence interval. If the included\ncovariates adequately capture confounding, this is a reasonable proxy for the causal\neffect. If important confounders are unmeasured (e.g., functional status, disease severity\nnot captured by Charlson), the estimate remains a biased adjusted association. R² = 0.32\ntells us the model explains a moderate share of LOS variability — appropriate context for\nsetting expectations about residual individual variation, but not a measure of causal\ncompleteness.\n\n**Adjustment, dummy coding, and the Table-2 fallacy**\n\nIn a multivariable OLS model, each coefficient is estimated conditional on all other\nmodel terms. Continuous covariates are \"held fixed\" in the mathematical sense of partial\nderivatives. Binary exposures coded 0/1 (dummy coding) produce a coefficient equal to the\nadjusted mean difference. Categorical variables with k categories require k−1 dummy\nindicators with one reference category. Interactions between predictors model departure\nfrom additivity on the mean-difference scale and must be pre-specified.\n\nThe *Table-2 fallacy* is a high-value RWE warning: the coefficients on confounders in a\nmultivariable regression model are NOT causal effects of those confounders and should not\nbe reported as such. In an OLS model of cost adjusted for age, sex, comorbidity, and\ntreatment, the coefficient on age is the estimated association of age with cost after\nconditioning on sex, comorbidity, and treatment — a different causal quantity from the\ntotal effect of age, often biased if any covariates are mediators of or colliders with\nthe age-cost pathway. Only the coefficient on the pre-specified primary exposure has the\nintended adjustment interpretation. Report all other coefficients as associations with\nappropriate caveats, or suppress them from primary results tables.\n\n**Clustering: patients within providers and plans**\n\nWhen patients are nested within clusters — hospitals, medical practices, insurance plans,\ngeographic regions — their outcomes are correlated within clusters even after controlling\nfor patient-level covariates. Ignoring this correlation understates standard errors and\ninflates false-positive rates. The solution is cluster-robust standard errors (also called\nclustered SEs or the Huber-White sandwich with cluster correction), which require only that\nthe clusters themselves are a random sample from a large population of clusters. A common\nrule of thumb is at least 30 clusters for reliable cluster-robust inference; with fewer\nclusters, bootstrap methods or the CR2 small-sample correction should be considered.\nCluster-robust SEs are available across all major packages: statsmodels `cov_type=\"cluster\"`,\nR `coeftest(..., vcov=vcovCL(..., cluster=...))`, SAS PROC SURVEYREG. For claims data with\npatients nested in health plans, or EHR data with patients nested in hospitals, cluster-\nrobust SEs should be the default when OLS is used for comparative analyses.\n\n**OLS on cost and utilization outcomes: when it fails and when it can be defended**\n\nOLS applied directly to raw healthcare costs faces three structural problems: (1) *right\nskew* — the distribution has a long right tail driven by high-cost outliers, so the\nresiduals are far from symmetric; (2) *heteroscedasticity* — variance of costs rises with\nthe mean (sicker patients have higher mean and higher variance); and (3) *mass at zero* —\nmany patients have zero cost in a window, creating a spike that no continuous linear model\naccommodates well. These problems motivate the gamma GLM with a log link (or a two-part\nmodel when zeros are structurally informative) as the modern standard for primary inference\non mean cost differences — these models respect positivity of costs, accommodate skew\nthrough the variance function, and estimate mean ratios directly interpretable for payers.\n\nHowever, OLS is not always wrong for cost data. In very large samples, the CLT ensures\nthat the OLS estimator of the mean difference is consistent and approximately normally\ndistributed even with skewed residuals — the coefficient is valid for inference on the\n*mean difference* (the policy-relevant quantity for budget-impact models). When the target\nestimand is a difference in means rather than a ratio, when zero-inflation is modest, and\nwhen robust SEs are used, OLS on raw or log-transformed costs can be a defensible\nsensitivity analysis or quick descriptive estimate. The principled choice remains a GLM\nfor confirmatory analysis; OLS serves as a transparency check.\n\n**Pros, cons, and trade-offs**\n\n*Pros:* OLS coefficients are directly interpretable as differences in the conditional mean\nof y — the most intuitive scale for clinical and policy audiences. The method is\ncomputationally trivial, extends naturally to multivariable adjustment, dummy coding,\nand interactions, and has a closed-form solution. With robust SEs, OLS yields valid\ninference on means even under heteroscedasticity. In large RWE datasets, non-normality of\nresiduals is irrelevant (CLT), so the classical criticism of OLS as \"requiring normal data\"\nis largely misplaced for large health studies. Predictions are on the original outcome\nscale, and coefficients have direct units (days, dollars, points on a score).\n\n*Cons:* OLS assumes a linear conditional mean — if the true relationship is substantially\nnonlinear, coefficients are biased. R² is easily misread as a measure of causal\nexplanatory power or predictive accuracy; it is neither. For skewed outcomes, OLS can\nproduce negative predictions, suffers from heteroscedastic residuals that inflate\nconventional SEs, and has lower efficiency than a correctly specified GLM. Confounding\nbias is not addressed by OLS itself; it requires a well-designed covariate set or a causal\nidentification strategy.\n\n*Trade-offs vs GLMs:* The gamma GLM with log link is preferred for cost and utilization\noutcomes because it respects positivity and accommodates heteroscedasticity through the\nvariance function. OLS is preferred when the effect scale is an additive mean difference,\nwhen interactions on the additive scale are the estimand, or when direct coefficient\ninterpretability in original units is required.\n\n**When to use**\n\nUse OLS when: the outcome is continuous and the conditional mean is the target quantity;\nthe relationship between predictors and outcome is plausibly linear over the observed\ncovariate range; the sample is large enough for the CLT to protect inference (roughly\nn ≥ 100 per arm as a heuristic); and heteroscedasticity-robust or cluster-robust SEs are\napplied. OLS is appropriate for continuous endpoints in both RCTs and observational studies\nwhere confounders are measured and included: blood pressure, HbA1c, health utility scores,\nlength of stay, patient-reported outcome scales. It is also the estimator inside\ndifference-in-differences and event-study designs where linearity on the untreated outcome\ntrend is the identifying assumption.\n\n**When NOT to use**\n\nDo not use OLS when: the outcome is binary — use logistic regression or modified Poisson\nfor risk differences; the outcome is a count — use Poisson or negative binomial regression;\nthe outcome is time-to-event with censoring — use Cox proportional-hazards or parametric\nsurvival models; the outcome is bounded or ordinal — use ordinal regression; the sample is\nsmall (n < 30 per arm) with clearly non-normal residuals and no CLT protection; or raw\ncost/utilization is the outcome and a mean-ratio estimand is needed — use a gamma GLM or\ntwo-part model for confirmatory analysis. Do not use OLS and report causal effects without\na credible identification design — an adjusted regression coefficient is a confounding-\nadjusted association, not a causal effect, unless the study design (RCT, valid instrumental\nvariable, sharp regression discontinuity) provides identification. The Table-2 fallacy\nwarning applies: confounders' coefficients are not their causal effects and should not be\nnarrated as such. For heavy-tailed distributions at small n, robust SEs reduce but do not\neliminate the risk of poorly calibrated inference.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "regression",
      "linear-models",
      "ols",
      "ordinary-least-squares",
      "robust-standard-errors",
      "adjustment",
      "covariate-adjustment",
      "mean-difference"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "difference_in_differences"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3238/arztebl.2010.0776",
        "url": "https://doi.org/10.3238/arztebl.2010.0776",
        "citation_text": "Schneider A, Hommel G, Blettner M. Linear regression analysis. Deutsches Arzteblatt International. 2010;107(44):776-782.",
        "year": 2010,
        "authors_short": "Schneider et al.",
        "notes": "Concise clinical-audience introduction to simple and multiple linear regression covering coefficient interpretation, R-squared, model assumptions, and residual diagnostics. Written for medical researchers without a strong statistics background; directly maps to the RWE analyst's day-to-day use of regression for adjusted comparisons."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev.publhealth.23.100901.140546",
        "url": "https://doi.org/10.1146/annurev.publhealth.23.100901.140546",
        "citation_text": "Lumley T, Diehr P, Emerson S, Chen L. The importance of the normality assumption in large public health data sets. Annual Review of Public Health. 2002;23:151-169.",
        "year": 2002,
        "authors_short": "Lumley et al.",
        "notes": "Demonstrates why the normality-of-residuals assumption is largely irrelevant in large public health datasets due to the Central Limit Theorem, while showing that heteroscedasticity and skew still matter for standard error validity. Essential reading for analysts who reflexively log-transform outcomes or switch to nonparametric methods when residuals look non-normal in large RWE studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1080/00031305.2000.10474549",
        "url": "https://doi.org/10.1080/00031305.2000.10474549",
        "citation_text": "Long JS, Ervin LH. Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician. 2000;54(3):217-224.",
        "year": 2000,
        "authors_short": "Long & Ervin",
        "notes": "Simulation-based comparison of HC0 through HC3 heteroscedasticity-consistent standard errors showing that HC3 outperforms earlier variants in finite samples due to its leverage-inflating adjustment. Supports HC3 as the default robust SE option in OLS analyses of skewed health outcomes. All major software packages implement HC3 (statsmodels HC3, R vcovHC type=\"HC3\", SAS PROC REG ACOV)."
      },
      {
        "role": "use",
        "doi": "10.2307/1912934",
        "url": "https://doi.org/10.2307/1912934",
        "citation_text": "White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980;48(4):817-838.",
        "year": 1980,
        "authors_short": "White",
        "notes": "The original paper introducing the heteroscedasticity-consistent (sandwich) covariance matrix estimator and the White test for heteroscedasticity. Foundational methodological reference for any application that uses robust SEs in OLS; the HC0 variant defined here is the basis for later HC1-HC3 refinements."
      },
      {
        "role": "use",
        "doi": "10.1016/S0167-6296(01)00086-8",
        "url": "https://doi.org/10.1016/S0167-6296(01)00086-8",
        "citation_text": "Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461-494.",
        "year": 2001,
        "authors_short": "Manning & Mullahy",
        "notes": "Definitive methodological guide to choosing between log-OLS, GLM with log link, and two-part models for healthcare cost data. Demonstrates when OLS on log-cost or OLS on raw cost is and is not appropriate relative to gamma or Tweedie GLMs. Essential context for the RWE analyst deciding whether to use OLS or a GLM for cost outcomes."
      }
    ],
    "plain_language_summary": "OLS linear regression draws the best-fitting straight line through a set of data points by choosing a slope and intercept that minimize the sum of squared gaps between the actual outcome values and the line's predictions. Each coefficient tells you how much the average outcome changes for a one-unit increase in that predictor, holding all other predictors constant — so it is a natural tool for comparing two groups while accounting for differences in age, sickness, or other characteristics. One honest caveat: OLS assumes the relationship is a straight line, and for outcomes like raw healthcare costs — which are right-skewed and always positive — a different type of model often fits better.",
    "key_terms": [
      {
        "term": "intercept",
        "definition": "The model's predicted value of the outcome when every predictor equals zero; in most clinical models it is a mathematical anchor rather than a clinically meaningful quantity."
      },
      {
        "term": "slope coefficient",
        "definition": "The estimated change in the average outcome for a one-unit increase in a predictor, holding all other variables in the model fixed at their current values."
      },
      {
        "term": "residual",
        "definition": "The difference between a patient's actual observed outcome and the value the regression line predicted for them; OLS minimizes the sum of all squared residuals."
      },
      {
        "term": "R-squared",
        "definition": "The fraction of the total variability in the outcome that is explained by the model's predictors together; it ranges from 0 (model explains nothing) to 1 (model explains everything), but a high R-squared does not mean the coefficients are causal."
      },
      {
        "term": "robust standard errors",
        "definition": "Standard errors calculated using a formula (the sandwich estimator) that is valid even when the residuals do not have constant spread across patients, making confidence intervals reliable in the skewed, unequal-variance data typical of health research."
      },
      {
        "term": "adjustment",
        "definition": "Including additional variables in a regression model so that each coefficient represents the outcome difference within groups of patients who are similar on those other variables, reducing (but not eliminating) confounding."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes researcher fits a simple OLS regression to explore the relationship between number of prior hospitalizations in the 12 months before an index date (x, ranging from 1 to 5) and total annual healthcare cost in the following 12 months (y, in thousands of dollars) across 5 patients from a registry. The analyst works through the exact least-squares arithmetic by hand to understand where the slope and intercept come from and how R-squared is computed from the residuals.",
      "dataset": {
        "caption": "Five-patient registry excerpt. x = prior hospitalizations (count); y = annual cost ($K). Patient 3 has unusually high cost (12) relative to the linear trend, which will produce a larger residual and drive the R-squared below 1.",
        "columns": [
          "patient_id",
          "prior_hosp_x",
          "annual_cost_k_y"
        ],
        "rows": [
          [
            1,
            1,
            3
          ],
          [
            2,
            2,
            5
          ],
          [
            3,
            3,
            12
          ],
          [
            4,
            4,
            9
          ],
          [
            5,
            5,
            11
          ]
        ]
      },
      "steps": [
        "Compute the sample means. x-bar = (1+2+3+4+5)/5 = 15/5 = 3. y-bar = (3+5+12+9+11)/5 = 40/5 = 8.",
        "Compute the numerator of the OLS slope: the five cross-products (x_i - x-bar)*(y_i - y-bar) give values 10, 3, 0, 1, and 6 respectively (deviations for patients 1-5). Their sum: 10+3+0+1+6 = 20.",
        "Compute the denominator: the five squared deviations (x_i - x-bar)^2 are 4, 1, 0, 1, 4. Their sum: 4+1+0+1+4 = 10.",
        "OLS slope: beta_1 = 20/10 = 2. Interpretation: each additional prior hospitalization is associated with $2K higher predicted annual cost on average.",
        "OLS intercept: beta_0 = 8 - 2*3 = 8-6 = 2. The model predicts $2K cost for a patient with zero prior hospitalizations (an extrapolation beyond the data range here, so interpret cautiously).",
        "Compute fitted values (y-hat = 2*x + 2): for x=1 the fitted value is 2*1+2 = 4; for x=2 it is 2*2+2 = 6; for x=3 it is 2*3+2 = 8; for x=4 it is 2*4+2 = 10; for x=5 it is 2*5+2 = 12.",
        "Compute residuals (y - y-hat): patient 1 gives 3-4 = -1; patient 2 gives 5-6 = -1; patient 3 gives 12-8 = 4; patient 4 gives 9-10 = -1; patient 5 gives 11-12 = -1. Residual sum check: (-1)+(-1)+4+(-1)+(-1) = 0. OLS residuals always sum to zero when the model includes an intercept.",
        "Compute R-squared. SST (total sum of squares) = 25+9+16+1+9 = 60. SSE (residual sum of squares) = 1+1+16+1+1 = 20. SSR (model sum of squares) = 60-20 = 40. R-squared = 40/60 = 0.667. About two-thirds of the cost variation is explained by prior hospitalizations; one-third (patient 3's high cost) is unexplained by the linear fit."
      ],
      "result": "beta_0 = 2 (intercept, $2K baseline cost), beta_1 = 2 (each additional hospitalization adds $2K predicted cost). Fitted values: 4, 6, 8, 10, 12. Residuals: -1, -1, 4, -1, -1. Residual sum check: (-1)+(-1)+4+(-1)+(-1) = 0. SST = 25+9+16+1+9 = 60, SSE = 1+1+16+1+1 = 20, SSR = 60-20 = 40. R-squared = 40/60 = 0.667."
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations",
      "pearson-spearman-correlation"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Simple OLS (univariate regression)",
        "description": "A single continuous predictor and a continuous outcome. Produces one slope coefficient equal to the Pearson correlation times the ratio of standard deviations (β₁ = r * SDy/SDx). Used for scatterplot regression lines, dose-response exploration, and as the arithmetic foundation for multivariable OLS. In observational RWE, simple OLS is almost never the primary analysis because confounding is uncontrolled; it is appropriate for exploratory data analysis and variable screening.",
        "edge_cases": [
          "The slope of simple OLS is a scaled version of the Pearson correlation; if the correlation is misspecified (nonlinear relationship), the slope estimate is biased.",
          "At very large n, simple OLS will detect trivially small associations as statistically significant; report the coefficient and CI rather than relying on the p-value alone."
        ],
        "data_source_notes": "Claims and EHR: useful for exploratory scatter plots of cost vs utilization predictors, or as the first step before building an adjusted model. Not a valid primary analysis for comparative effectiveness."
      },
      {
        "name": "Multivariable OLS with HC3 robust standard errors",
        "description": "Multiple continuous and binary predictors with HC3 robust SEs to correct for heteroscedasticity. The standard choice for adjusted mean differences in large RWE cohorts when the outcome is continuous. The HC3 sandwich estimator is leverage- weighted and provides better coverage than HC0 or HC1 in finite samples. Specify HC3 explicitly in all software (not the default in most implementations).",
        "edge_cases": [
          "HC3 is not a substitute for a correctly specified model; if the linear mean function is badly wrong, the coefficient is biased regardless of SE type.",
          "With very few observations (n < 50), even HC3 can be poorly calibrated; bootstrap or the bias-corrected accelerated bootstrap CI may be more reliable."
        ],
        "data_source_notes": "Claims: apply to per-patient outcome totals (cost, utilization counts) after flagging outliers; use HC3 SEs as default. EHR: lab values and vitals are natural continuous outcomes; apply multivariable OLS with HC3 for adjusted comparisons across treatment groups or time periods."
      },
      {
        "name": "Cluster-robust OLS",
        "description": "OLS with standard errors clustered at the provider, plan, or geographic unit to account for within-cluster correlation of patient outcomes. Required when patients share a common care environment that affects the outcome independently of patient-level covariates. Implemented via the Liang-Zeger (GEE-style) sandwich correction in statsmodels (cov_type=\"cluster\"), R sandwich package (vcovCL), or SAS PROC SURVEYREG.",
        "edge_cases": [
          "Cluster-robust inference is unreliable with fewer than 30 clusters; use the CR2 (bias-corrected) sandwich or a wild cluster bootstrap instead.",
          "If the number of patients per cluster is very small (e.g., 1-2 patients per practice), consider a mixed-effects (random-effects) model instead of cluster-robust OLS."
        ],
        "data_source_notes": "Claims data with patients nested in health plans; EHR data with patients nested in hospitals or practices; registry data with sites as the cluster unit."
      },
      {
        "name": "OLS as a fast approximate estimator for cost means (large-n defense)",
        "description": "In very large samples where the CLT dominates (n > 5,000 per arm is a common heuristic), OLS on raw cost or utilization counts with robust SEs can be a defensible adjunct to the primary GLM analysis. It produces a mean-difference estimate on the dollar scale that budget-impact analysts can directly use, and it is computationally simpler than gamma GLM. Should be labeled \"OLS approximation\" and accompanied by a gamma or two-part model as the primary confirmatory estimate.",
        "edge_cases": [
          "Do not use OLS on raw cost as the primary analysis when zero-inflation is substantial (> 20% zeros in either arm); the two-part model is required.",
          "Log-transforming cost before OLS changes the estimand to a geometric mean ratio (recoverable via retransformation), not an arithmetic mean difference; clarify which quantity is the target before choosing the transformation."
        ],
        "data_source_notes": "Claims cost totals in large all-payer or Medicare datasets where n per arm > 10,000. Always pair with a gamma GLM primary analysis and report OLS as the robustness check."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "logistic-regression-for-binary-outcomes",
        "pros_of_this": "OLS coefficients are directly interpretable as mean differences on the original outcome scale (days, dollars, score points) without requiring transformation or marginal standardization. Additivity of OLS coefficients makes interaction terms directly interpretable as departures from additive mean differences — the natural scale for health technology assessment budget-impact.",
        "cons_of_this": "OLS on a binary (0/1) outcome — the linear probability model — produces predictions outside [0,1], is heteroscedastic by construction (the variance is P*(1-P)), and yields risk differences rather than odds ratios or risk ratios natively. Logistic regression respects the bounded probability scale and is more appropriate for binary endpoints; recover risk differences by standardization (g-computation) from the logistic model rather than fitting OLS on the binary outcome.",
        "when_to_prefer": "Prefer OLS for continuous outcomes; prefer logistic for binary outcomes. The linear probability model (OLS on 0/1) is occasionally acceptable as a quick RD check in large samples far from the bounds (0.1 < P < 0.9), but should not be the primary analysis."
      },
      {
        "compared_to": "gamma-distribution",
        "pros_of_this": "OLS coefficients are additive mean differences (dollars, days) directly interpretable without back-transformation; they aggregate correctly across the population for budget impact without requiring a retransformation correction.",
        "cons_of_this": "OLS on raw cost ignores the gamma-type distributional shape of health costs (right-skewed, positive, variance proportional to mean squared). A gamma GLM with log link is the modern standard for cost inference: it respects positivity, accommodates heteroscedasticity through the variance function, and estimates a mean-ratio effect directly relevant for payer decision-making. OLS on log-cost changes the estimand to a geometric mean ratio and requires a smearing retransformation for mean predictions.",
        "when_to_prefer": "Prefer gamma GLM for primary confirmatory cost analysis; retain OLS on raw cost as a robustness check in large-n studies (CLT protection) with explicit labeling of its limitations. Prefer OLS when the estimand is an absolute mean-cost difference and the audience requires direct dollar-scale coefficients."
      },
      {
        "compared_to": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "pros_of_this": "OLS coefficients are themselves marginal effects on the mean-difference scale, so no additional post-estimation step is needed for continuous outcomes — the coefficient is directly the adjusted mean difference.",
        "cons_of_this": "For nonlinear models (logistic, Poisson, gamma), marginal effects require post-estimation averaging over the covariate distribution to produce population-averaged estimates. Understanding when OLS already produces marginal effects (vs when it produces conditional effects) is a prerequisite for correctly applying g-computation or AME estimators.",
        "when_to_prefer": "Use OLS when the outcome is continuous and the mean-difference scale is appropriate; apply marginal effects methods when working with nonlinear outcome models or when the target is a population-averaged risk difference or rate ratio from a logistic or Poisson model."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Per-patient outcome totals (cost, utilization days, event counts summed over the outcome window) are the natural unit of analysis. Apply multivariable OLS with HC3 robust SEs for adjusted mean differences. For cost outcomes, always also fit a gamma GLM and report OLS as the robustness check. At large n (> 10,000), p-values will be significant for trivially small differences; report the coefficient and 95% CI prominently and contextualize against a minimum clinically important difference.",
      "ehr": "Lab values, vitals, and patient-reported outcomes are often approximately normally distributed within treatment groups and well-suited to multivariable OLS. Apply HC3 SEs as default. For visit-driven EHR data, check for informative measurement frequency (sicker patients have more lab draws) before treating lab values as representative cross-sections; restrict to structured windows or apply inverse-probability-of-measurement weights.",
      "registry": "Disease severity scores and continuously measured outcomes from adjudicated registries are clean OLS inputs. Apply multivariable OLS with HC3 SEs; include site or registry as a cluster variable for cluster-robust SEs when the registry is multi-site.",
      "primary": "Survey and patient-reported outcome instruments often produce bounded scores (0-100 on a EQ-5D VAS, for example). OLS is appropriate if the relationship is linear over the observed range and the instrument is intended to be treated as interval-scaled. For ordinal instruments, consider ordinal regression. Apply HC3 SEs and report the coefficient in original score units.",
      "linked": "Linked claims-EHR-registry cohorts typically have large n; CLT protects OLS inference on means. Apply multivariable OLS with HC3 SEs for continuous outcomes; cluster at the provider or plan level with cluster-robust SEs when the linkage creates nested structure. Report gamma GLM alongside OLS for cost outcomes."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\n\n# ── Toy dataset (matches beginner-layer worked example) ──────────────────────────────\ndata_toy = pd.DataFrame({\n    \"prior_hosp\": [1, 2, 3, 4, 5],\n    \"annual_cost_k\": [3, 5, 12, 9, 11],\n})\nX_toy = sm.add_constant(data_toy[[\"prior_hosp\"]])   # adds intercept column\ny_toy = data_toy[\"annual_cost_k\"]\n\n# ── Fit OLS ──────────────────────────────────────────────────────────────────────────\nmodel_toy = sm.OLS(y_toy, X_toy).fit()\nprint(\"=== Toy dataset (n=5): conventional SEs ===\")\nprint(model_toy.summary2())\n# Intercept = 2.0, slope (prior_hosp) = 2.0, R-squared = 0.667\n\n# ── HC3 robust standard errors (default for health data) ─────────────────────────────\nmodel_hc3 = model_toy.get_robustcov_results(cov_type=\"HC3\")\nprint(\"\\n=== Toy dataset: HC3 robust SEs ===\")\nprint(model_hc3.summary2())\n# In a 5-obs example HC3 CIs will be very wide; robust SEs matter in larger samples.\n\n# ── Larger simulated RWE dataset: OLS with HC3 SEs ──────────────────────────────────\nrng = np.random.default_rng(42)\nn = 2000\ndf = pd.DataFrame({\n    \"treated\":      rng.binomial(1, 0.5, n),\n    \"age\":          rng.normal(65, 12, n),\n    \"charlson\":     rng.poisson(2, n),\n    \"female\":       rng.binomial(1, 0.52, n),\n})\n# Simulate LOS outcome (days): true adjusted treatment effect = -2.0 days\ndf[\"los\"] = (\n    8.0\n    - 2.0 * df[\"treated\"]\n    + 0.08 * (df[\"age\"] - 65)\n    + 0.4 * df[\"charlson\"]\n    - 0.5 * df[\"female\"]\n    + rng.normal(0, 3.5, n)   # heteroscedastic-ish residuals\n)\n\ncovariates = [\"treated\", \"age\", \"charlson\", \"female\"]\nX = sm.add_constant(df[covariates])\ny = df[\"los\"]\n\n# Fit with HC3 robust SEs — single call via cov_type argument\nols_hc3 = sm.OLS(y, X).fit(cov_type=\"HC3\")\n\nprint(\"\\n=== Multivariable OLS, HC3 robust SEs (n=2000) ===\")\nprint(ols_hc3.summary2())\n# Focus on the 'treated' coefficient: should be near -2.0\ncoef_trt = ols_hc3.params[\"treated\"]\nci_trt   = ols_hc3.conf_int().loc[\"treated\"]\nprint(f\"\\nAdjusted treatment effect: {coef_trt:.2f} days\")\nprint(f\"95% CI (HC3): [{ci_trt[0]:.2f}, {ci_trt[1]:.2f}]\")\nprint(f\"Model R-squared: {ols_hc3.rsquared:.3f}\")\n\n# ── Cluster-robust SEs (patients nested in 50 hypothetical plans) ───────────────────\ndf[\"plan_id\"] = rng.integers(1, 51, n)   # 50 insurance plans\nols_cluster = sm.OLS(y, X).fit(\n    cov_type=\"cluster\",\n    cov_kwds={\"groups\": df[\"plan_id\"]}\n)\nci_trt_cl = ols_cluster.conf_int().loc[\"treated\"]\nprint(f\"\\nCluster-robust 95% CI (clustered by plan): [{ci_trt_cl[0]:.2f}, {ci_trt_cl[1]:.2f}]\")\nprint(\"Note: cluster-robust SEs are wider than HC3 when within-plan correlation is positive.\")",
        "description": "Multivariable OLS with HC3 robust standard errors using statsmodels. Demonstrates\nmodel fitting, coefficient extraction, robust SEs via get_robustcov_results, and a\nbrief cluster-robust SE example. Uses the motivating 5-patient toy dataset from the\nbeginner layer for transparency, then shows a larger simulated dataset structure for\nrealistic RWE application. No dependencies beyond statsmodels and numpy.",
        "dependencies": [
          "statsmodels",
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(sandwich)\nlibrary(lmtest)\n\n# ── Toy dataset (matches beginner-layer worked example) ──────────────────────────────\ntoy <- data.frame(\n  prior_hosp    = 1:5,\n  annual_cost_k = c(3, 5, 12, 9, 11)\n)\nm_toy <- lm(annual_cost_k ~ prior_hosp, data = toy)\ncat(\"=== Toy dataset (n=5): conventional SEs ===\\n\")\nprint(summary(m_toy))\n# Intercept = 2, slope (prior_hosp) = 2, R-squared = 0.6667\n\n# ── HC3 robust SEs via sandwich + lmtest ─────────────────────────────────────────────\ncat(\"\\n=== Toy dataset: HC3 robust SEs ===\\n\")\nprint(coeftest(m_toy, vcov = vcovHC(m_toy, type = \"HC3\")))\n\n# ── Larger simulated RWE dataset ─────────────────────────────────────────────────────\nset.seed(42)\nn <- 2000\ndf <- data.frame(\n  treated  = rbinom(n, 1, 0.5),\n  age      = rnorm(n, 65, 12),\n  charlson = rpois(n, 2),\n  female   = rbinom(n, 1, 0.52),\n  plan_id  = sample(1:50, n, replace = TRUE)\n)\ndf$los <- (\n  8.0\n  - 2.0 * df$treated\n  + 0.08 * (df$age - 65)\n  + 0.4 * df$charlson\n  - 0.5 * df$female\n  + rnorm(n, 0, 3.5)\n)\n\nm <- lm(los ~ treated + age + charlson + female, data = df)\ncat(\"\\n=== Multivariable OLS, conventional SEs (n=2000) ===\\n\")\nprint(summary(m))\n\n# HC3 robust SEs — recommended default for health data\ncat(\"\\n=== HC3 robust SEs ===\\n\")\nprint(coeftest(m, vcov = vcovHC(m, type = \"HC3\")))\n\n# Confidence interval with HC3 SEs\nci_hc3 <- coefci(m, vcov = vcovHC(m, type = \"HC3\"))\ncat(sprintf(\n  \"\\nAdjusted treatment effect: %.2f days, 95%% CI (HC3): [%.2f, %.2f]\\n\",\n  coef(m)[\"treated\"], ci_hc3[\"treated\", 1], ci_hc3[\"treated\", 2]\n))\n\n# ── Cluster-robust SEs (patients nested in 50 insurance plans) ───────────────────────\ncat(\"\\n=== Cluster-robust SEs (clustered by plan_id) ===\\n\")\nprint(coeftest(m, vcov = vcovCL(m, cluster = ~plan_id, data = df)))\n# Note: requires sandwich >= 2.5-1 for the cluster formula interface\ncat(\"Rule of thumb: reliable with >= 30 clusters; 50 plans is adequate here.\\n\")",
        "description": "Multivariable OLS with HC3 robust standard errors using base R lm(), the sandwich\npackage for HC3 covariance matrices, and lmtest::coeftest for robust inference.\nDemonstrates cluster-robust SEs via vcovCL. Uses the same simulated RWE dataset\nstructure as the Python implementation for cross-language comparability.",
        "dependencies": [
          "sandwich",
          "lmtest"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset (toy: n=5) ─────────────────────────────────────── */\ndata work.toy;\n  input prior_hosp annual_cost_k;\n  datalines;\n1  3\n2  5\n3 12\n4  9\n5 11\n;\nrun;\n\n/* ── PROC REG: OLS with ACOV (heteroscedasticity-consistent SEs, HC0 variant) ─── */\nproc reg data=work.toy;\n  model annual_cost_k = prior_hosp / acov;\n  /* ACOV: \"Asymptotic Covariance\" — produces White (HC0) robust SEs.\n     Use the ACOV rows in the output for robust inference.\n     Intercept = 2, slope (prior_hosp) = 2, R-squared = 0.6667              */\nrun;\n\n/* ── Create larger simulated RWE dataset ─────────────────────────────────────── */\ndata work.rwe;\n  call streaminit(42);\n  do i = 1 to 2000;\n    treated  = rand(\"bernoulli\", 0.5);\n    age      = rand(\"normal\", 65, 12);\n    charlson = rand(\"poisson\", 2);\n    female   = rand(\"bernoulli\", 0.52);\n    plan_id  = ceil(rand(\"uniform\") * 50);   /* 50 insurance plans */\n    los = 8.0\n          - 2.0 * treated\n          + 0.08 * (age - 65)\n          + 0.4 * charlson\n          - 0.5 * female\n          + rand(\"normal\", 0, 3.5);\n    output;\n  end;\nrun;\n\n/* ── PROC REG: multivariable OLS with HC0 robust SEs ──────────────────────────── */\nproc reg data=work.rwe;\n  model los = treated age charlson female / acov;\n  /* Focus on ACOV row for the 'treated' coefficient.\n     ACOV in PROC REG implements HC0 (White 1980). For HC3, use PROC IML\n     or the %HCCME macro available from SAS support.                          */\nrun;\n\n/* ── PROC GLM: equivalent OLS fit (useful for CLASS variables and LS means) ──── */\nproc glm data=work.rwe;\n  model los = treated age charlson female;\n  /* PROC GLM does not have a built-in ACOV option; combine with PROC IML\n     or PROC SURVEYREG (below) for robust SEs.                                */\nrun;\n\n/* ── PROC SURVEYREG: cluster-robust SEs (the recommended approach in SAS) ────── */\nproc surveyreg data=work.rwe;\n  cluster plan_id;              /* cluster-robust SEs at the plan level         */\n  model los = treated age charlson female;\n  /* PROC SURVEYREG implements the Huber-White sandwich estimator with\n     cluster correction. Use this procedure whenever patients are nested\n     within plans, hospitals, or practices.\n     Rule of thumb: >= 30 clusters for reliable inference; 50 plans here.     */\nrun;",
        "description": "OLS with heteroscedasticity-robust SEs (ACOV option) in PROC REG and an alternative\nvia PROC GLM. Demonstrates cluster-robust SEs via PROC SURVEYREG with the CLUSTER\nstatement. Uses the same motivating dataset as Python and R implementations. The ACOV\noption in PROC REG produces White/HC0 SEs; for HC3, use PROC IML or the sandwich macro.\nPROC SURVEYREG is the preferred SAS procedure for cluster-robust OLS in health services\nresearch.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Is the outcome<br/>continuous?] --> No[No → use logistic, Poisson,<br/>Cox, or ordinal regression]\n  Q --> Yes[Yes]\n  Yes --> Cost[Is the outcome<br/>raw cost / LOS?]\n  Cost --> CostYes[\"Consider gamma GLM<br/>as primary analysis<br/>(see gamma-distribution)\"]\n  Cost --> CostNo[No — lab values, utility<br/>scores, log-cost, etc.]\n  CostNo --> ClusterQ[Are patients nested<br/>in providers / plans?]\n  CostYes --> CostOLS[\"OLS as robustness check<br/>(large n only, HC3 SEs,<br/>label as approximation)\"]\n  ClusterQ --> Cluster[\"Yes → cluster-robust SEs<br/>(≥30 clusters)\"]\n  ClusterQ --> NoCluster[\"No → HC3 robust SEs<br/>(default for health data)\"]\n  Cluster --> FitOLS[Fit multivariable OLS<br/>with cluster-robust SEs]\n  NoCluster --> FitOLS\n  FitOLS --> Diagnose[\"Diagnose: residual plots,<br/>leverage, influential obs<br/>(see regression-diagnostics)\"]\n  Diagnose --> Report[\"Report: coefficient + 95% CI<br/>in original units; label as<br/>adjusted association unless<br/>causal design is established\"]",
        "caption": "Decision pathway for OLS linear regression in RWE. Continuous outcomes that are raw cost or raw utilization counts warrant a gamma GLM as the primary analysis with OLS as a robustness check; all other continuous outcomes use OLS with HC3 or cluster-robust SEs as appropriate.",
        "alt_text": "Flowchart starting at 'Is the outcome continuous?' branching through cost vs non-cost outcomes and clustered vs non-clustered data into OLS with appropriate robust SEs, ending with a diagnostics and reporting step.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "OLS builds directly on hypothesis testing and sampling distribution concepts; understanding the null distribution of the F-statistic and t-statistic for coefficients requires the inference foundations covered in this prerequisite concept."
      },
      {
        "relation_type": "requires",
        "target_slug": "pearson-spearman-correlation",
        "notes": "The simple OLS slope is a rescaled Pearson correlation (β₁ = r × SDy/SDx), so the concepts of linear association and its direction and magnitude are foundational for interpreting regression coefficients and understanding when OLS's linearity assumption is plausible."
      },
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "When the outcome is binary (readmission, treatment initiation, event yes/no), logistic regression is the correct alternative to OLS; the linear probability model (OLS on a 0/1 outcome) produces out-of-range predictions and heteroscedastic residuals by construction and should not be the primary analysis for binary endpoints."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalized-linear-models",
        "notes": "OLS is the Gaussian member of the GLM family — it is a GLM with an identity link and normal variance function. Understanding this connection explains why OLS is the natural starting point for GLM thinking, and why the gamma GLM with log link is the preferred generalization for cost and utilization outcomes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "regression-diagnostics",
        "notes": "After fitting an OLS model, residual plots, leverage diagnostics, influential-observation detection, and linearity checks are essential steps before trusting the coefficients. Regression diagnostics identify violations of the linearity and exogeneity assumptions that cannot be corrected by robust SEs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gamma-distribution",
        "notes": "Healthcare costs follow a distribution with positive support, right skew, and variance proportional to the mean — properties matched by the gamma distribution. When OLS on raw cost is inadequate, the gamma GLM with log link is the modern standard; understanding the gamma distribution explains why OLS residuals from cost regressions are systematically heteroscedastic and right-skewed."
      }
    ],
    "aliases": [
      "linear regression",
      "least squares",
      "OLS",
      "multiple regression",
      "multiple linear regression",
      "linear model",
      "ordinary least squares regression",
      "OLS regression"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "omop-cdm-method-patterns-rwe",
    "name": "OMOP CDM Method Patterns for RWE",
    "short_definition": "A family of cohort-construction patterns that build exposure, outcome, time-at-risk, and covariate definitions directly against the standardized OMOP Common Data Model tables (PERSON, OBSERVATION_PERIOD, DRUG_EXPOSURE/DRUG_ERA, CONDITION_OCCURRENCE/CONDITION_ERA, plus the standardized vocabulary) so the same analytic protocol runs reproducibly across many independent databases.",
    "long_description": "**OMOP CDM method patterns** are the conventional ways an RWE study is *operationalized* once raw claims, EHR, or registry\ndata have been transformed (ETL'd) into the **Observational Medical Outcomes Partnership Common Data Model**. The CDM is a\nperson-centric relational schema in which every clinical event is mapped to a **standard concept** in the OHDSI vocabulary\n(RxNorm for drugs, SNOMED for conditions, LOINC for measurements) and every event lives inside an **OBSERVATION_PERIOD** that\ndeclares when the person was observable. The \"patterns\" are the reusable templates that turn a protocol — eligibility,\nexposure, outcome, time-at-risk, covariates — into SQL/analytic code that is portable: a study author writes the cohort logic\nonce, and a remote site executes it against its own CDM instance and returns only aggregate results. This is the engine\nbehind the OHDSI network and behind FDA Sentinel-style distributed analytics. The CDM is a *data structure and convention*,\nnot an estimator; the validity of any result still rests entirely on the design (active comparator, new-user, time-zero\nalignment) layered on top of it.\n\n**Core conceptual distinction** — three CDM constructs do the heavy lifting and are routinely confused, with material\nconsequences. (1) *Raw occurrence vs era*: `DRUG_EXPOSURE` and `CONDITION_OCCURRENCE` store one row per dispensing/diagnosis,\nwhereas `DRUG_ERA` and `CONDITION_ERA` collapse those rows into continuous exposed/diseased episodes using a persistence-gap\nrule (the default drug-era persistence window is 30 days). Eras are convenient for \"ever on drug\" or chronic-disease\nphenotypes but bake in a gap assumption you did not choose; for a strict new-user, time-varying analysis you usually rebuild\nepisodes from `DRUG_EXPOSURE` with your own `days_supply` + grace-period logic rather than trusting the default era. (2)\n*Standard vs source concept*: every event has both a `*_source_concept_id` (the original ICD/NDC code) and a mapped\n`*_concept_id` (the standard SNOMED/RxNorm concept). Cohorts must be defined on **standard** concepts and expanded through\n`CONCEPT_ANCESTOR` (e.g., an RxNorm ingredient resolves to all descendant clinical-drug and branded NDCs) so the same\nconcept set behaves identically across sites — querying source codes directly destroys portability. (3) *Observability vs\ndata presence*: absence of a row is informative **only inside an OBSERVATION_PERIOD**; outside it, \"no record\" means\nunobserved, not \"did not happen.\" Washout, eligibility, and censoring must all be bounded by `observation_period_start_date`\n/ `observation_period_end_date`, or you will mistake enrollment gaps for clinical absence.\n\n**Pros, cons, and trade-offs**\n- **vs bespoke per-database analytic code (raw claims/EHR scripts):** CDM patterns give you provenance, portability, and a\n  shared vocabulary, so a protocol can be run by 10 data partners with no PHI leaving each site and results pooled by\n  meta-analysis (the OHDSI / FDA Sentinel model). Cost: the ETL is lossy and opinionated — source-data nuance (claim\n  adjustment flags, payer type, lab units edge cases) can be dropped or mis-mapped, and you inherit every ETL decision the\n  site made. **Prefer CDM patterns** for multi-database, reproducible, or regulatory-grade studies; **prefer bespoke code**\n  only when a single source has irreplaceable detail the CDM cannot represent and portability is not a goal.\n- **vs i2b2 / PCORnet / Sentinel CDMs:** OMOP's distinguishing feature is full vocabulary standardization to *standard*\n  concepts with `CONCEPT_ANCESTOR` hierarchies, which makes concept sets transportable; PCORnet and Sentinel keep more\n  source-code-native structures that are simpler to ETL but harder to query identically across sites. **Prefer OMOP** when\n  cross-network harmonization and large-scale analytics (PLP, LEGEND-style) matter.\n- **default DRUG_ERA/CONDITION_ERA vs hand-built episodes:** eras are fast and standardized but the fixed persistence gap is\n  rarely the clinically correct one and can fabricate or break exposure continuity. **Prefer hand-built episodes** for any\n  time-varying or new-user safety analysis; **use default eras** only for coarse \"ever-exposed\"/prevalence characterization.\n\n**When to use** — multi-database or network RWE where the same protocol must execute at sites that cannot share row-level\ndata; OHDSI tooling pipelines (ATLAS/Capr cohort definitions, CohortMethod, PatientLevelPrediction, empirical calibration\nwith negative controls); regulatory submissions and HTA dossiers that demand transparent, reproducible, code-shareable\ncohort logic; any study where a standardized vocabulary and `CONCEPT_ANCESTOR` expansion materially improves phenotype\nportability. The CDM is also the natural substrate for target-trial emulation at scale because eligibility, treatment\nassignment, and time-zero can all be expressed as portable SQL over the standard tables.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **You have not validated the local ETL.** Treating a CDM instance as ground truth without source-to-standard mapping\n  review and a data-quality pass (e.g., OHDSI Achilles/DataQualityDashboard) is dangerous: a missing or wrong vocabulary\n  map silently empties or inflates a concept set, and the analysis runs cleanly while measuring the wrong thing.\n- **The clinical question depends on detail the ETL drops.** Payer/plan type, claim line adjustments, free-text severity,\n  units-of-measure subtleties, or device-specific NDC nuance may not survive the transform; forcing the CDM here yields\n  confident, wrong answers.\n- **You trust default eras for a time-varying estimand.** The 30-day drug-era persistence gap is not your protocol's grace\n  period; using it for as-treated safety analysis silently mis-defines exposure and time-at-risk.\n- **You query source concepts to \"be safe.\"** Bypassing standard-concept + `CONCEPT_ANCESTOR` logic breaks portability and\n  reintroduces the cross-site heterogeneity the CDM exists to remove.\n- **Out-of-observation time leaks into the analysis.** Counting person-time or \"absence of outcome\" outside\n  OBSERVATION_PERIOD manufactures immortal time and misclassifies unobserved as event-free.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA/commercial):** ETL maps NDC→RxNorm and ICD→SNOMED; `DRUG_EXPOSURE.days_supply` and\n  `drug_exposure_start_date` drive era and time-at-risk logic. OBSERVATION_PERIOD is derived from enrollment spans, so it is\n  only trustworthy when enrollment is complete: **Medicare Advantage person-time lacks fee-for-service claims**, so an\n  MA-only OBSERVATION_PERIOD looks observable but is missing drug and procedure events — restrict to enrollees with full\n  Parts A/B/D (or commercial medical+pharmacy) and treat MA-only spans as unobserved. Mail-order 90-day fills and free\n  samples distort `days_supply`; capitated/bundled care drops claims and breaks era continuity.\n- **EHR:** Events are visit-driven, so OBSERVATION_PERIOD (often defined from first-to-last visit) overstates true\n  observability — a patient who seeks care elsewhere is differentially \"absent.\" Drugs may be *orders/prescriptions*, not\n  dispensings, so `DRUG_EXPOSURE` overcounts initiation unless linked to fills; conditions in `CONDITION_OCCURRENCE` mix\n  problem-list, encounter, and resolved diagnoses, inflating `CONDITION_ERA`. Phenotypes should add measurement/lab and\n  visit anchors, not codes alone.\n- **Registry:** Strong, often adjudicated outcomes and disease severity (e.g., cancer stage in a tumor registry) but sparse,\n  intermittent OBSERVATION_PERIOD coverage and weak drug capture. Map adjudicated endpoints to standard concepts carefully;\n  link to claims for exposure and to a death index for censoring.\n- **Linked claims–EHR–registry:** The ideal CDM substrate (EHR severity + claims completeness + adjudicated/mortality\n  outcomes) but linkage introduces a selected subset and reconciling order vs fill vs service dates across sources must\n  happen *before* time-zero assignment, or competing OBSERVATION_PERIODs will misalign follow-up. In elderly cohorts,\n  differential competing risks (death) by exposure must be modeled, not censored away.\n\n**Worked CDM-native example.** Question: incident myopathy among new users of a high-intensity statin vs a moderate-intensity\nstatin in adults, built entirely from OMOP tables. (1) *Concept sets:* resolve each statin ingredient (RxNorm) to all\ndescendants via `CONCEPT_ANCESTOR` so every clinical-drug/branded NDC maps to the correct arm; resolve the myopathy outcome\nto the SNOMED standard concept and its descendants. (2) *Eligibility & washout:* require ≥365 days of continuous\nOBSERVATION_PERIOD before the index drug-exposure, with no statin `DRUG_EXPOSURE` of either class in that window — this is the\nnew-user/incident restriction expressed as \"first qualifying `drug_exposure_start_date` with `observation_period_start_date`\n≤ index − 365.\" (3) *Time zero & arm:* index = earliest qualifying `drug_exposure_start_date`; arm = the standard ingredient\ndispensed that day. (4) *Exposure episodes:* rebuild on-treatment time from `DRUG_EXPOSURE.days_supply` with a pre-specified\ngrace period rather than the default 30-day DRUG_ERA, because the default gap is not the protocol's. (5) *Covariates:*\nmeasured only in [index − 365, index] from `CONDITION_OCCURRENCE`, `DRUG_EXPOSURE`, `MEASUREMENT`, and `VISIT_OCCURRENCE`,\nfeeding a (high-dimensional) propensity score. (6) *Time-at-risk & censoring:* from index to first outcome\n`condition_occurrence`, censoring at `observation_period_end_date`, death, end of data, and (as-treated) discontinuation or\nswitch. The identical SQL/cohort definition is then shipped to each network site; only aggregate, calibrated estimates\n(with negative-control empirical calibration) come back.",
    "primary_category": "Study_Design",
    "tags": [
      "omop-cdm",
      "ohdsi",
      "common-data-model",
      "cohort-construction",
      "concept-set",
      "drug-era",
      "observation-period",
      "distributed-network-analysis",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "active_comparator_new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2011-000376",
        "url": "https://doi.org/10.1136/amiajnl-2011-000376",
        "citation_text": "Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. Journal of the American Medical Informatics Association. 2012;19(1):54-60.",
        "year": 2012,
        "authors_short": "Overhage et al.",
        "notes": "Founding validation of the OMOP Common Data Model and standardized vocabulary for transportable observational analyses."
      },
      {
        "role": "explain",
        "doi": "10.3233/978-1-61499-564-7-574",
        "url": "https://doi.org/10.3233/978-1-61499-564-7-574",
        "citation_text": "Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Studies in Health Technology and Informatics (MEDINFO 2015). 2015;216:574-578.",
        "year": 2015,
        "authors_short": "Hripcsak et al.",
        "notes": "Articulates the OHDSI program, the rationale for the CDM, and the distributed-network analytic model these patterns implement."
      },
      {
        "role": "explain",
        "doi": "10.1093/jamia/ocy032",
        "url": "https://doi.org/10.1093/jamia/ocy032",
        "citation_text": "Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. Journal of the American Medical Informatics Association. 2018;25(8):969-975.",
        "year": 2018,
        "authors_short": "Reps et al.",
        "notes": "Standardized, CDM-based cohort and prediction-model framework illustrating how reusable patterns are operationalized on OMOP."
      },
      {
        "role": "demonstrate",
        "doi": "10.1073/pnas.1510502113",
        "url": "https://doi.org/10.1073/pnas.1510502113",
        "citation_text": "Hripcsak G, Ryan PB, Duke JD, et al. Characterizing treatment pathways at scale using the OHDSI network. Proceedings of the National Academy of Sciences. 2016;113(27):7329-7336.",
        "year": 2016,
        "authors_short": "Hripcsak et al.",
        "notes": "Demonstrates running one standardized analysis across many CDM databases to characterize real-world treatment pathways."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/S0140-6736(19)32317-7",
        "url": "https://doi.org/10.1016/S0140-6736(19)32317-7",
        "citation_text": "Suchard MA, Schuemie MJ, Krumholz HM, et al. Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, large-scale analysis. The Lancet. 2019;394(10211):1816-1826.",
        "year": 2019,
        "authors_short": "Suchard et al.",
        "notes": "LEGEND-HTN — large-scale, multi-database active-comparator study built entirely on OMOP CDM patterns with empirical calibration."
      },
      {
        "role": "use",
        "doi": "10.1073/pnas.1708282114",
        "url": "https://doi.org/10.1073/pnas.1708282114",
        "citation_text": "Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. Proceedings of the National Academy of Sciences. 2018;115(11):2571-2577.",
        "year": 2018,
        "authors_short": "Schuemie et al.",
        "notes": "Negative-control empirical calibration routinely layered on CDM cohort patterns to quantify residual systematic error."
      }
    ],
    "plain_language_summary": "The OMOP Common Data Model (CDM) is a standardized blueprint that reorganizes messy, site-specific patient data — claims, electronic health records, registries — into the same table structure and shared medical vocabulary at every participating site. Because every hospital or insurer stores data differently, researchers could not run the same analysis on ten databases without rewriting the code ten times; the CDM solves that problem by translating all source codes (diagnosis codes, drug codes, lab codes) into one universal language before the analysis begins. Once data live in the CDM, a single set of query code can answer the same question across dozens of databases simultaneously, with only summary results — never individual patient records — leaving each site.",
    "key_terms": [
      {
        "term": "CDM",
        "definition": "Common Data Model — a fixed table structure and shared vocabulary that makes patient data look identical across different hospitals or insurers, so one analysis script runs everywhere."
      },
      {
        "term": "OMOP",
        "definition": "Observational Medical Outcomes Partnership — the organization that designed the CDM now maintained by OHDSI; the two names are often used interchangeably to refer to the same data standard."
      },
      {
        "term": "standard concept",
        "definition": "A single agreed-upon code (from RxNorm for drugs, SNOMED for diagnoses) that every site maps its local codes to, so querying that one code finds the same thing everywhere."
      },
      {
        "term": "observation period",
        "definition": "The date range during which a patient is known to be enrolled and observable in the data; events outside this window cannot be trusted as truly absent."
      },
      {
        "term": "CONCEPT_ANCESTOR",
        "definition": "A lookup table in the CDM that links a general drug or disease concept (e.g., the ingredient atorvastatin) to every more-specific version of it (every brand name, every dose form), so one query captures all of them."
      },
      {
        "term": "drug era",
        "definition": "A pre-built continuous treatment episode the CDM creates by collapsing individual dispensing records into one stretch of time, using a default 30-day gap rule to decide when one treatment episode ends and another begins."
      }
    ],
    "worked_example": {
      "scenario": "Imagine you want to study three different clinical questions — who received a statin, who was diagnosed with diabetes, and how many outpatient visits a patient had — and you need the analysis to run identically at five hospitals that each store data in a different format. After each hospital has converted its data to the OMOP CDM, every question maps to the same named table and the same standard concept codes, so the query code you write once will return valid results from all five sites.",
      "dataset": {
        "caption": "How each analytic question maps to OMOP CDM tables and the standard vocabulary concept that drives the query.",
        "columns": [
          "Analytic question",
          "OMOP CDM table(s) queried",
          "Key column used",
          "Standard vocabulary"
        ],
        "rows": [
          [
            "Did the patient take a statin?",
            "DRUG_EXPOSURE (individual fills) or DRUG_ERA (continuous episode)",
            "drug_concept_id matched via CONCEPT_ANCESTOR to the atorvastatin ingredient concept",
            "RxNorm ingredient code"
          ],
          [
            "Was the patient ever diagnosed with type 2 diabetes?",
            "CONDITION_OCCURRENCE (individual diagnosis events) or CONDITION_ERA (continuous disease episode)",
            "condition_concept_id matched to the SNOMED diabetes concept and its descendants",
            "SNOMED CT code"
          ],
          [
            "How many outpatient visits did the patient have in the past year?",
            "VISIT_OCCURRENCE filtered to visit_concept_id = outpatient visit, bounded by OBSERVATION_PERIOD dates",
            "visit_concept_id; observation_period_start/end_date to confirm the patient was enrolled",
            "SNOMED visit concept"
          ],
          [
            "What is the patient's baseline creatinine value?",
            "MEASUREMENT filtered to the LOINC creatinine concept, restricted to dates before the index event",
            "measurement_concept_id; value_as_number; measurement_date",
            "LOINC code"
          ],
          [
            "Who is in the study population (enrolled and observable)?",
            "PERSON joined to OBSERVATION_PERIOD",
            "observation_period_start_date and observation_period_end_date define the window when absence of a record means the event truly did not happen",
            "Not vocabulary-dependent — structural table"
          ]
        ]
      },
      "steps": [
        "Raw hospital or insurer data arrives with site-specific codes — ICD-10 diagnosis codes, NDC drug codes, CPT procedure codes — that differ in format across sites; the ETL (extract-transform-load) process translates all of them into the standard OMOP concepts before the analysis begins.",
        "Every patient gets a row in the PERSON table and one or more rows in OBSERVATION_PERIOD marking the dates they were enrolled; any clinical event (drug, diagnosis, visit, lab) that falls outside these dates must be treated as unobserved, not as evidence the event never happened.",
        "To find all patients who took a statin, you look up the RxNorm ingredient concept for atorvastatin in CONCEPT_ANCESTOR, which returns every descendant concept — Lipitor 10 mg tablets, generic atorvastatin 40 mg, mail-order 90-day supplies — so the query captures them all without listing hundreds of individual product codes.",
        "For each analytic need you choose the right table: DRUG_EXPOSURE or DRUG_ERA for drug use, CONDITION_OCCURRENCE or CONDITION_ERA for diagnoses, VISIT_OCCURRENCE for healthcare encounters, MEASUREMENT for labs — the table names and column names are identical at every site because all sites share the CDM structure.",
        "Once the query is written and validated at one site, it is sent as code (not patient data) to the other four sites, each of which runs it on its own CDM instance and returns only summary counts or aggregate statistics, keeping all individual patient information local."
      ],
      "result": "The same five-line SQL query asking 'how many new statin users had a prior diabetes diagnosis' executes correctly at all five hospital databases and returns comparable aggregate results — because the CDM guarantees that drug_concept_id, condition_concept_id, and observation_period mean the same thing everywhere. Portability is the deliverable: write the logic once, trust it to run anywhere data have been properly converted to OMOP."
    },
    "prerequisites": [
      "claims-analysis",
      "new-user-design",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard-concept set with CONCEPT_ANCESTOR expansion",
        "description": "Define exposure/outcome on standard concepts (RxNorm ingredient, SNOMED condition) and expand to all descendants via CONCEPT_ANCESTOR so the same definition resolves to site-specific source codes identically everywhere.",
        "edge_cases": [
          "Non-standard or unmapped source codes (poor ETL) silently drop out of standard-concept queries, shrinking the cohort.",
          "Over-broad ancestors pull in unintended descendants; review the resolved concept list, not just the seed concept."
        ],
        "data_source_notes": "claims: NDC->RxNorm mapping completeness varies by ETL vintage; EHR: local custom codes may lack standard mappings and need a mapping review before the concept set is trusted."
      },
      {
        "name": "DRUG_ERA / CONDITION_ERA vs hand-built episodes",
        "description": "Use standardized eras for coarse ever-exposed/prevalence characterization, but rebuild exposure episodes from DRUG_EXPOSURE (days_supply + protocol-specific grace period) for time-varying or new-user safety analyses.",
        "edge_cases": [
          "Default 30-day drug-era persistence gap can fabricate continuity across stockpiling or break it across mail-order fills.",
          "Condition eras built from problem-list entries persist indefinitely and overstate active disease."
        ],
        "data_source_notes": "claims: stitch overlapping days_supply explicitly; EHR: distinguish prescription orders from dispensings before building drug episodes."
      },
      {
        "name": "Observation-period-bounded eligibility and censoring",
        "description": "Anchor washout, eligibility, time-at-risk, and censoring to observation_period_start_date / observation_period_end_date so that absence of a record is interpreted as clinical absence only while the person is observable.",
        "edge_cases": [
          "MA-only or capitated spans appear inside OBSERVATION_PERIOD yet lack claims; treat as unobserved.",
          "EHR observation periods derived from first-to-last visit overstate observability between sporadic encounters."
        ],
        "data_source_notes": "claims: derive observability from full A/B/D (or commercial medical+pharmacy) enrollment; registry: coverage is intermittent and must be linked for continuous person-time."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Bespoke per-database claims/EHR analytic code",
        "pros_of_this": "Portable, provenance-tracked cohort logic that runs across many sites with no row-level data sharing; shared standardized vocabulary enables identical concept sets everywhere.",
        "cons_of_this": "Inherits a lossy, opinionated ETL; irreplaceable source-data nuance (payer type, claim adjustments, unit edge cases) may be dropped or mis-mapped.",
        "when_to_prefer": "Multi-database, reproducible, or regulatory-grade studies where portability and transparency outweigh source-specific detail."
      },
      {
        "compared_to": "PCORnet / FDA Sentinel common data models",
        "pros_of_this": "Full standardization to standard concepts with CONCEPT_ANCESTOR hierarchies makes concept sets transportable and supports large-scale OHDSI analytics (PLP, LEGEND, empirical calibration).",
        "cons_of_this": "Heavier, more opinionated vocabulary ETL than source-code-native CDMs; mapping errors can silently distort cohorts.",
        "when_to_prefer": "Cross-network harmonization and population-scale comparative studies; choose source-code-native CDMs when ETL simplicity dominates."
      },
      {
        "compared_to": "Default DRUG_ERA / CONDITION_ERA episodes",
        "pros_of_this": "Standardized, fast, comparable across sites for coarse exposure/disease prevalence.",
        "cons_of_this": "Fixed persistence gap is rarely the clinically correct one and mis-defines time-varying exposure and time-at-risk.",
        "when_to_prefer": "Ever-exposed or prevalence characterization; rebuild episodes by hand for any new-user or as-treated estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "NDC->RxNorm and ICD->SNOMED via the vocabulary; days_supply + drug_exposure_start_date drive era/time-at-risk. Derive OBSERVATION_PERIOD from full A/B/D (or commercial medical+pharmacy) enrollment; treat Medicare Advantage-only spans as unobserved because fee-for-service claims are absent. Watch 90-day mail-order and free samples distorting days_supply.",
      "ehr": "DRUG_EXPOSURE may be orders, not dispensings (overcounts initiation); link to fills when possible. CONDITION_OCCURRENCE mixes problem-list, encounter, and resolved diagnoses; anchor phenotypes with measurements/labs and visit context, and treat visit-derived OBSERVATION_PERIOD as an overstatement of true observability.",
      "registry": "Strong adjudicated outcomes and severity but intermittent observation periods and weak drug capture; map endpoints to standard concepts carefully and link to claims for exposure and a death index for censoring.",
      "linked": "Ideal substrate (EHR severity + claims completeness + adjudicated/mortality outcomes) but linkage selects a subset and order/fill/service date discrepancies plus competing observation periods must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import duckdb\n\nWASHOUT_DAYS = 365             # drug-free + continuously-observed lookback defining \"new user\"\nSTUDY_INGREDIENT_ID = 0        # standard RxNorm ingredient concept_id for the study drug\nCOMPARATOR_INGREDIENT_ID = 0   # standard RxNorm ingredient concept_id for the active comparator\n\nCOHORT_SQL = \"\"\"\nWITH arm_drugs AS (  -- expand each ingredient to all descendant standard drug concepts (portable across sites)\n    SELECT descendant_concept_id AS drug_concept_id, 'STUDY' AS arm\n    FROM concept_ancestor WHERE ancestor_concept_id = $study_ing\n    UNION ALL\n    SELECT descendant_concept_id, 'COMPARATOR'\n    FROM concept_ancestor WHERE ancestor_concept_id = $comp_ing\n),\nexposures AS (  -- every qualifying fill of either arm, tagged with its arm\n    SELECT de.person_id, de.drug_exposure_start_date AS dt, ad.arm\n    FROM drug_exposure de JOIN arm_drugs ad USING (drug_concept_id)\n),\nfirst_fill AS (  -- candidate time zero = earliest fill of either arm; arm from that day's drug\n    SELECT person_id, dt AS index_date, arm\n    FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY dt) rn FROM exposures)\n    WHERE rn = 1\n),\nnew_users AS (  -- incident restriction: no study/comparator fill in the washout window before index\n    SELECT f.* FROM first_fill f\n    WHERE NOT EXISTS (\n        SELECT 1 FROM exposures e\n        WHERE e.person_id = f.person_id\n          AND e.dt <  f.index_date\n          AND e.dt >= f.index_date - INTERVAL ($washout) DAY)\n)\nSELECT n.person_id, n.arm, n.index_date,\n       n.index_date - INTERVAL ($washout) DAY AS baseline_start\nFROM new_users n JOIN observation_period op USING (person_id)  -- observability must span the whole washout\nWHERE op.observation_period_start_date <= n.index_date - INTERVAL ($washout) DAY\n  AND op.observation_period_end_date   >= n.index_date\n\"\"\"\n\ndef build_omop_new_user_cohort(con: duckdb.DuckDBPyConnection):\n    return con.execute(COHORT_SQL, {\n        \"study_ing\": STUDY_INGREDIENT_ID, \"comp_ing\": COMPARATOR_INGREDIENT_ID,\n        \"washout\": WASHOUT_DAYS,\n    }).df()",
        "description": "CDM-native new-user cohort construction with parameterized SQL via DuckDB/SQLAlchemy against standard OMOP tables.\nRequired CDM tables (post-ETL, standard concepts):\n  observation_period : person_id, observation_period_start_date, observation_period_end_date\n  drug_exposure      : person_id, drug_concept_id (standard), drug_exposure_start_date, days_supply\n  concept_ancestor   : ancestor_concept_id, descendant_concept_id   # RxNorm ingredient -> all clinical drugs\nSTUDY_INGREDIENT_ID / COMPARATOR_INGREDIENT_ID are standard RxNorm ingredient concept_ids; the query expands them to all\ndescendants so the same definition resolves to every site's source drugs. Returns one row per eligible new initiator with\narm and time zero. Build covariates only from [index_date - WASHOUT_DAYS, index_date]; apply outcome/censoring identically.",
        "dependencies": [
          "duckdb",
          "pandas"
        ],
        "source_citations": [
          "overhage-2012",
          "hripcsak-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(DatabaseConnector)\nlibrary(SqlRender)\n\nbuildOmopNewUserCohort <- function(connection, cdmDatabaseSchema,\n                                   studyIngredient, comparatorIngredient,\n                                   washoutDays = 365L, targetDialect = \"postgresql\") {\n  sql <- \"\n  WITH arm_drugs AS (  -- expand ingredients to all descendant standard drugs (portable concept set)\n    SELECT descendant_concept_id AS drug_concept_id, 'STUDY' AS arm\n    FROM @cdm.concept_ancestor WHERE ancestor_concept_id = @study_ing\n    UNION ALL\n    SELECT descendant_concept_id, 'COMPARATOR'\n    FROM @cdm.concept_ancestor WHERE ancestor_concept_id = @comp_ing\n  ),\n  exposures AS (\n    SELECT de.person_id, de.drug_exposure_start_date AS dt, ad.arm\n    FROM @cdm.drug_exposure de JOIN arm_drugs ad ON de.drug_concept_id = ad.drug_concept_id\n  ),\n  first_fill AS (\n    SELECT person_id, index_date, arm FROM (\n      SELECT person_id, dt AS index_date, arm,\n             ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY dt) AS rn\n      FROM exposures) t WHERE rn = 1\n  ),\n  new_users AS (  -- new-user restriction within the washout window\n    SELECT f.* FROM first_fill f\n    WHERE NOT EXISTS (\n      SELECT 1 FROM exposures e\n      WHERE e.person_id = f.person_id AND e.dt < f.index_date\n        AND e.dt >= DATEADD(DAY, -@washout, f.index_date))\n  )\n  SELECT n.person_id, n.arm, n.index_date,\n         DATEADD(DAY, -@washout, n.index_date) AS baseline_start\n  FROM new_users n\n  JOIN @cdm.observation_period op ON op.person_id = n.person_id\n  WHERE op.observation_period_start_date <= DATEADD(DAY, -@washout, n.index_date)\n    AND op.observation_period_end_date   >= n.index_date;\"\n\n  rendered <- SqlRender::render(sql, cdm = cdmDatabaseSchema,\n                               study_ing = studyIngredient, comp_ing = comparatorIngredient,\n                               washout = washoutDays)\n  rendered <- SqlRender::translate(rendered, targetDialect = targetDialect)\n  DatabaseConnector::querySql(connection, rendered)\n}",
        "description": "CDM-native cohort construction using the OHDSI HADES idiom: SqlRender to author portable parameterized SQL and\nDatabaseConnector to execute it against a remote CDM (this is how OHDSI network studies actually ship code). Parameters:\n  cdmDatabaseSchema    : schema holding standard OMOP tables (drug_exposure, observation_period, concept_ancestor)\n  studyIngredient      : standard RxNorm ingredient concept_id (study drug)\n  comparatorIngredient : standard RxNorm ingredient concept_id (active comparator)\nRender+translate makes the SQL run on Postgres, SQL Server, Redshift, etc.; returns one row per new initiator with arm\nand index_date. Measure covariates only in [index_date - washoutDays, index_date].",
        "dependencies": [
          "DatabaseConnector",
          "SqlRender"
        ],
        "source_citations": [
          "hripcsak-2015",
          "suchard-2019"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout   = 365;\n%let study_ing = 0;   /* standard RxNorm ingredient concept_id: study drug      */\n%let comp_ing  = 0;   /* standard RxNorm ingredient concept_id: active comparator */\n\n/* Expand each ingredient to all descendant standard drug concepts (portable concept set). */\nproc sql;\n  create table arm_drugs as\n    select descendant_concept_id as drug_concept_id, 'STUDY' as arm length=12\n    from cdm.concept_ancestor where ancestor_concept_id = &study_ing\n    union all\n    select descendant_concept_id, 'COMPARATOR'\n    from cdm.concept_ancestor where ancestor_concept_id = &comp_ing;\n\n  /* Every qualifying fill of either arm. */\n  create table exposures as\n    select de.person_id, de.drug_exposure_start_date as dt format=date9., ad.arm\n    from cdm.drug_exposure de inner join arm_drugs ad\n      on de.drug_concept_id = ad.drug_concept_id;\nquit;\n\n/* Candidate time zero = earliest fill of either arm; arm = drug dispensed that day. */\nproc sort data=exposures; by person_id dt; run;\ndata first_fill;\n  set exposures; by person_id dt;\n  if first.person_id;\n  rename dt = index_date;\nrun;\n\n/* New-user restriction: exclude anyone with a study/comparator fill in the washout before index. */\nproc sql;\n  create table new_users as\n  select f.* from first_fill f\n  where not exists (\n    select 1 from exposures e\n    where e.person_id = f.person_id\n      and e.dt <  f.index_date\n      and e.dt >= f.index_date - &washout);\n\n  /* Observability must span the full washout through index (no out-of-observation time). */\n  create table cohort as\n  select n.person_id, n.arm, n.index_date,\n         n.index_date - &washout as baseline_start format=date9.\n  from new_users n\n  where exists (\n    select 1 from cdm.observation_period op\n    where op.person_id = n.person_id\n      and op.observation_period_start_date <= n.index_date - &washout\n      and op.observation_period_end_date   >= n.index_date);\nquit;",
        "description": "CDM-native new-user cohort construction in SAS PROC SQL against standard OMOP tables in libref CDM:\n  cdm.drug_exposure      : person_id, drug_concept_id (standard), drug_exposure_start_date, days_supply\n  cdm.observation_period : person_id, observation_period_start_date, observation_period_end_date\n  cdm.concept_ancestor   : ancestor_concept_id, descendant_concept_id\nMacro vars hold the standard RxNorm ingredient concept_ids and washout. Concept sets are expanded via CONCEPT_ANCESTOR so\narms resolve to all descendant drugs. Output work.cohort has one row per new initiator; build covariates from\n[index_date-&washout, index_date] only and apply outcome/censoring rules identically across arms.",
        "dependencies": [],
        "source_citations": [
          "overhage-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Raw[Site-specific claims / EHR / registry] --> ETL[ETL: source codes -> standard concepts<br/>NDC->RxNorm, ICD->SNOMED]\n  ETL --> CDM[(OMOP CDM tables)]\n  CDM --> OP[OBSERVATION_PERIOD<br/>when person is observable]\n  CDM --> DE[DRUG_EXPOSURE / DRUG_ERA<br/>exposure episodes]\n  CDM --> CO[CONDITION_OCCURRENCE / CONDITION_ERA<br/>outcomes & comorbidity]\n  CDM --> CA[CONCEPT_ANCESTOR<br/>ingredient -> descendant drugs]\n  OP --> Cohort[Portable cohort definition<br/>eligibility, time zero, time-at-risk]\n  DE --> Cohort\n  CO --> Cohort\n  CA --> Cohort\n  Cohort --> Net[Same SQL runs at each network site<br/>only aggregate, calibrated results returned]",
        "caption": "OMOP CDM data flow. Heterogeneous source data is ETL'd to standard concepts; the standardized tables let one portable cohort definition execute identically across a distributed network, with only aggregate results shared.",
        "alt_text": "Flowchart from raw site data through ETL to OMOP CDM tables (observation period, drug exposure/era, condition occurrence/era, concept ancestor), into a portable cohort definition that runs across network sites returning aggregate results.",
        "source_type": "illustrative",
        "source_citations": [
          "hripcsak-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{What does the analysis need?} -->|Coarse ever-exposed / prevalence| Era[Use DRUG_ERA / CONDITION_ERA<br/>standardized, fixed persistence gap]\n  Q -->|Time-varying / new-user / as-treated| Epi[Rebuild episodes from DRUG_EXPOSURE<br/>days_supply + protocol grace period]\n  Q -->|Portable phenotype across sites| Std[Standard concepts + CONCEPT_ANCESTOR<br/>never query source codes directly]\n  Q -->|Valid person-time / censoring| Obs[Bound everything by OBSERVATION_PERIOD<br/>MA-only & out-of-observation = unobserved]\n  Era --> Out[Defensible CDM analysis]\n  Epi --> Out\n  Std --> Out\n  Obs --> Out",
        "caption": "Decision logic for choosing the right CDM pattern. The default eras are for coarse characterization; time-varying estimands need hand-built episodes; portability requires standard concepts with ancestor expansion; validity requires observation-period bounding.",
        "alt_text": "Decision flowchart mapping analytic needs to OMOP CDM patterns - eras for prevalence, rebuilt episodes for time-varying analysis, standard concepts with concept ancestor for portability, and observation-period bounding for valid person-time.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "omop-observation-period-rwe",
        "notes": "Observation-period logic (observability, washout, censoring bounds) is the child concept detailing this pattern."
      },
      {
        "relation_type": "produces",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "Drug-exposure and drug-era episode construction (days_supply, persistence gaps) is detailed in this child concept."
      },
      {
        "relation_type": "produces",
        "target_slug": "omop-condition-occurrence-condition-era-rwe",
        "notes": "Condition-occurrence vs condition-era outcome/comorbidity phenotyping is detailed in this child concept."
      },
      {
        "relation_type": "produces",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "Standard-concept set construction and CONCEPT_ANCESTOR expansion is detailed in this child concept."
      },
      {
        "relation_type": "produces",
        "target_slug": "omop-time-at-risk-cohort-exit-rwe",
        "notes": "Time-at-risk definition and cohort-exit/censoring rules are detailed in this child concept."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU is the canonical design layered on CDM patterns; the worked example here is an ACNU cohort built from OMOP tables."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Eligibility, treatment assignment, and time-zero of a target-trial emulation are expressed as portable SQL over CDM tables."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS covariate proxies are generated from CDM condition/drug/procedure tables within the observation-period-bounded baseline window."
      }
    ],
    "aliases": [
      "OMOP method patterns",
      "OMOP CDM cohort patterns",
      "OHDSI study design patterns",
      "common data model method patterns",
      "OMOP standardized analytics patterns"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "omop-concept-set-development-rwe",
    "name": "OMOP Concept Set Development",
    "short_definition": "The process of building, expanding, excluding, version-freezing, and validating standardized vocabulary concept sets (code lists) in the OMOP CDM so that exposure, outcome, and cohort-qualifying phenotypes are reproducible, network-portable, and resolvable against source codes.",
    "long_description": "An **OMOP concept set** is the machine-readable code list that operationalizes a clinical idea (\"statin\nexposure,\" \"type 2 diabetes,\" \"acute myocardial infarction\") against the OMOP Common Data Model. It is not a\nflat list of ICD/NDC strings: it is a *concept-set expression* — one or more standard `concept_id`s, each\noptionally flagged to **include descendants** (pull every child via `CONCEPT_ANCESTOR`), **include mapped**\nsource codes, or be an **exclusion** — that resolves to a final set of standard concept_ids at a fixed\nvocabulary version. Concept sets are the atomic building block of every phenotype: exposures (drug eras),\noutcomes (condition/procedure occurrences), and cohort inclusion/exclusion criteria are all assembled from\nthem, so a single mis-specified concept set silently propagates into the numerator, denominator, and follow-up\nof the entire study.\n\n**Core conceptual distinction.** Three things are conflated by novices and must be kept separate. (1)\n*Source code vs standard concept*: raw data carry **source** codes (ICD-9/10-CM, NDC, CPT, local lab codes) in\n`*_source_value`/`*_source_concept_id`; OMOP's ETL maps these to **standard** concepts (SNOMED for conditions,\nRxNorm for drugs, LOINC for measurements) via `CONCEPT_RELATIONSHIP` (\"Maps to\"). A concept set is normally\ndefined on *standard* concepts (`standard_concept = 'S'`) so it is vocabulary- and source-agnostic, but it only\ncaptures person-time whose source codes actually mapped. (2) *Concept set vs phenotype*: the concept set is the\ncode list; the **phenotype/cohort definition** adds the entry event logic, occurrence counts, care-setting,\nand time windows (e.g., the 1-inpatient-or-2-outpatient rule) on top of it. (3) *Expression vs resolved set*:\nthe saved expression (include-descendants flags + exclusions) is stable, but the **resolved** list of\nconcept_ids changes whenever the vocabulary version changes — so the version-frozen resolved set, not just the\nexpression, is the reproducible artifact. The deliverable is a versioned, human-reviewed, machine-resolvable\nset plus its measured operating characteristics (PPV/sensitivity from chart review or PheValuator).\n\n**Pros, cons, and trade-offs**.\n- **vs hand-curated source code lists (ICD/NDC strings in a SAS macro):** A standard-concept set is portable\n  across data partners and vocabulary updates, auto-expands to descendants so new child codes are captured\n  without re-editing, and is reviewable in tooling (ATLAS). Cost: it depends on ETL mapping quality — anything\n  mapped to `concept_id = 0` is invisible to a standard-concept set, whereas a raw source-code list would have\n  caught it. **Prefer the standard-concept set** for network/multi-database studies; keep a source-code\n  fallback for unmapped codes.\n- **vs include-descendants \"grab everything\":** Descendant expansion is the strength of OMOP (define\n  \"ACE inhibitors\" at the RxNorm ingredient/ATC level and inherit every product), but blind expansion pulls in\n  unintended children (combination products, veterinary forms, the wrong laterality). Cost of *not* expanding:\n  you miss legitimate children and undercount. **Prefer descendant expansion with explicit exclusions**, then\n  audit the resolved list.\n- **vs a single shared \"off-the-shelf\" PheKB/OHDSI library set:** Reusing a validated, published concept set\n  buys validity evidence and comparability. Cost: it may not match your indication, era, or data partner's\n  coding habits. **Prefer a published set as the starting point**, then localize and re-validate.\n\n**When to use**. Whenever a study runs on OMOP-CDM data (single-site or an OHDSI/DARWIN EU network study) and\nany exposure, outcome, or eligibility criterion must be defined from vocabularies; whenever you need a\nreproducible, auditable code list that survives vocabulary updates and travels across data partners; whenever a\nregulator or HTA body will ask for the exact codes and their operating characteristics.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Source data are not in OMOP-CDM.** On raw claims/EHR, build and validate source-code lists directly; a\n  half-mapped local CDM gives you the illusion of standardization with hidden `concept_id = 0` loss.\n- **You treat the unversioned expression as the reproducible object.** Re-resolving the same expression against\n  a newer vocabulary silently changes the captured concepts (codes get re-mapped, deprecated, or re-parented);\n  a result \"reproduced\" on a different vocabulary version is not the same study. Freeze and report the\n  vocabulary version and the resolved concept_id list.\n- **Descendant expansion across a known structural break.** Defining a condition set only on ICD-10-mapped\n  SNOMED descendants will *miss all pre-October-2015 ICD-9 person-time* unless the source-to-standard map\n  bridges both — producing a spurious incidence jump at the coding transition that looks like a real trend.\n- **You skip measuring operating characteristics.** A concept set with no PPV/sensitivity estimate, applied to\n  a rare outcome, can be dominated by false positives; reporting it as a validated phenotype is misleading.\n- **MA-only or capitated person-time.** If a patient's care is paid under arrangements that do not generate\n  adjudicated diagnosis/drug claims, there are no source codes to map at all — the concept set captures\n  *nothing*, and \"no event\" is missingness, not absence.\n\n**Data-source operational depth**.\n- **Claims (FFS vs MA, commercial):** Exposures map NDC → RxNorm; conditions map ICD-9/10-CM → SNOMED;\n  procedures map CPT/HCPCS → standard. Failure modes: (a) **`concept_id = 0`** — source codes the ETL could\n  not map (retired NDCs, local/miscellaneous codes, repackaged NDCs) drop out of any standard-concept set;\n  quantify `% of source rows with target_concept_id = 0` by data partner and decade and bridge with a source-\n  concept fallback. (b) **ICD-9→ICD-10 break (Oct 2015)** changes the descendant tree and code granularity —\n  validate capture on both sides of the break. (c) **NDC repackaging churn** means branded-NDC lists rot;\n  define drug sets at the RxNorm **ingredient** level and let descendants resolve products. (d) **Medicare\n  Advantage / capitated** person-time lacks FFS-adjudicated claims, so the set captures no codes — restrict to\n  enrollees with the relevant benefit (A/B/D or commercial medical+pharmacy) and exclude unobservable spans.\n- **EHR:** Conditions arrive from problem lists, encounter diagnoses, and notes; structured capture is\n  visit-driven and the same concept may be entered as a problem, a billing diagnosis, or only free text.\n  Concept sets over structured fields miss note-only mentions; NLP-derived concepts and external-care leakage\n  (care delivered outside the system) bias capture. Lab-based criteria need LOINC harmonization and unit\n  reconciliation before a measurement concept set is trustworthy.\n- **Registry:** Disease and severity are often coded in registry-specific schemes that may not map cleanly to\n  standard vocabularies; adjudicated outcomes are high-validity but require a crosswalk to OMOP concepts and\n  explicit handling of registry completeness.\n- **Linked claims–EHR–registry:** The richest substrate, but each source maps to standard concepts through a\n  *different* ETL; the same clinical event can appear as distinct concept_ids across sources. Reconcile by\n  de-duplicating at the standard-concept level and documenting which source contributed each capture, and watch\n  for date discrepancies (claim service date vs EHR encounter date) before assigning the phenotype index date.\n\n**Worked example (claims, OMOP-CDM).** Build a **second-generation ACE-inhibitor exposure** concept set and\ncount incident users. (1) *Pick the standard anchor*: in `CONCEPT`, select the RxNorm ingredient class for ACE\ninhibitors (`vocabulary_id = 'RxNorm'`, `concept_class_id = 'Ingredient'`, `standard_concept = 'S'`) — say the\ningredient `concept_id`s for lisinopril, ramipril, enalapril, etc. (2) *Expand descendants*: join to\n`CONCEPT_ANCESTOR` to pull every `descendant_concept_id` (clinical drugs, branded products, strengths) under\nthose ingredients. (3) *Exclude*: drop combination products you do not want (e.g., ACE-inhibitor + thiazide\nfixed-dose combos) by listing their concept_ids as an exclusion branch. (4) *Freeze*: snapshot the resolved\n`concept_id` list together with the **vocabulary version** to a versioned JSON expression checked into the\nstudy repo. (5) *Apply to the CDM*: select `DRUG_EXPOSURE` rows where `drug_concept_id IN (resolved set)`; for\nrows with `drug_concept_id = 0`, reconcile via `drug_source_concept_id`/`drug_source_value` against a\nsource-code fallback and report the unmapped fraction. (6) *Define incident use*: require continuous\nobservation (`OBSERVATION_PERIOD`) covering a 365-day washout with no prior in-set fill, set index =\n`drug_exposure_start_date` of the first in-set fill, and confirm indication with a condition concept set in the\nbaseline window. (7) *Numerator/denominator check*: report distinct `person_id` count, fills per person, the\n`% concept_id = 0` reconciled, and capture on both sides of the ICD-9→ICD-10 break for any condition\nco-criteria. (8) *Validate*: estimate PPV against chart review or run PheValuator, and report sensitivity to the\ninclude-descendants and exclusion choices.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "omop-cdm",
      "concept-set",
      "code-list",
      "phenotyping",
      "standardized-vocabularies",
      "ohdsi",
      "rxnorm-snomed",
      "source-to-standard-mapping"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2012-001145",
        "url": "https://doi.org/10.1136/amiajnl-2012-001145",
        "citation_text": "Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association. 2013;20(1):117-121.",
        "year": 2013,
        "authors_short": "Hripcsak & Albers",
        "notes": "Foundational statement of computable phenotyping over structured health records — the conceptual basis for vocabulary-driven concept sets and reproducible code lists."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "RECORD reporting standard requiring transparent, complete code lists and validation evidence for studies using routinely collected data — the governance frame for concept-set documentation."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jbi.2019.103258",
        "url": "https://doi.org/10.1016/j.jbi.2019.103258",
        "citation_text": "Swerdel JN, Hripcsak G, Ryan PB. PheValuator: development and evaluation of a phenotype algorithm evaluator. Journal of Biomedical Informatics. 2019;97:103258.",
        "year": 2019,
        "authors_short": "Swerdel et al.",
        "notes": "OHDSI method for estimating sensitivity, specificity, and PPV of OMOP phenotype definitions without exhaustive chart review — the standard way to validate a concept-set-based phenotype."
      },
      {
        "role": "use",
        "doi": "10.1371/journal.pone.0099825",
        "url": "https://doi.org/10.1371/journal.pone.0099825",
        "citation_text": "Springate DA, Kontopantelis E, Ashcroft DM, et al. ClinicalCodes: an online clinical codes repository to improve the validity and reproducibility of research using electronic medical records. PLoS ONE. 2014;9(6):e99825.",
        "year": 2014,
        "authors_short": "Springate et al.",
        "notes": "Demonstrates the reproducibility imperative for sharing and version-controlling code lists — directly motivates version-frozen, citable concept-set expressions."
      }
    ],
    "plain_language_summary": "An OMOP concept set is a reusable, shareable code list that translates a clinical idea — like 'type 2 diabetes' or 'ACE inhibitor use' — into a precise list of standardized numeric identifiers that any OMOP database can look up. You pick one or more anchor concepts, expand them to include all child codes in the vocabulary hierarchy, and carve out any children you do not want. Because the list is built from standard vocabulary codes rather than database-specific billing strings, the same concept set can be applied across hospitals, insurers, or countries without rewriting it for each. One honest caveat: if the ETL that converted raw data to OMOP failed to map a source code, that person's record will be invisible to your concept set no matter how carefully it is built.",
    "key_terms": [
      {
        "term": "concept set",
        "definition": "A saved, versioned list of standardized numeric codes (concept_ids) that together define one clinical idea, such as all recorded diagnoses of type 2 diabetes in an OMOP database."
      },
      {
        "term": "vocabulary",
        "definition": "A standardized coding system — such as SNOMED for diagnoses, RxNorm for drugs, or LOINC for lab tests — that OMOP uses as its common language across different data sources."
      },
      {
        "term": "concept_id",
        "definition": "The unique integer that OMOP assigns to every clinical concept; databases from different hospitals or insurers use the same concept_id for the same clinical idea."
      },
      {
        "term": "descendant expansion",
        "definition": "A flag you set on an ancestor concept that automatically pulls in all more-specific child codes beneath it, so defining 'ACE inhibitor' at the drug-class level inherits every individual ACE-inhibitor product automatically."
      },
      {
        "term": "source-to-standard mapping",
        "definition": "The ETL step that converts raw billing codes (ICD-10-CM, NDC, CPT) into standard OMOP concept_ids; if a source code has no mapping, it becomes concept_id 0 and is invisible to a standard concept set."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team wants to identify all patients with a recorded type 2 diabetes diagnosis in a claims database that has been converted to OMOP CDM. They start with the SNOMED concept for 'Type 2 diabetes mellitus' as their seed, expand to all descendant codes (e.g., 'Type 2 diabetes with diabetic nephropathy'), and then exclude a small number of child codes that refer to neonatal or secondary diabetes — categories the team's clinical reviewer flagged as outside the study population. The table below shows the concept-set expression they build, with each row representing one entry and its include/exclude flag.",
      "dataset": {
        "caption": "Concept-set expression for type 2 diabetes mellitus (SNOMED, OMOP vocabulary). Each row is one entry in the expression; the include_descendants flag tells OMOP to pull all child codes under that ancestor.",
        "columns": [
          "concept_id",
          "concept_name",
          "vocabulary",
          "include_descendants",
          "flag"
        ],
        "rows": [
          [
            201826,
            "Type 2 diabetes mellitus",
            "SNOMED",
            true,
            "INCLUDE"
          ],
          [
            4058243,
            "Secondary diabetes mellitus",
            "SNOMED",
            true,
            "EXCLUDE"
          ],
          [
            4299544,
            "Neonatal diabetes mellitus",
            "SNOMED",
            true,
            "EXCLUDE"
          ]
        ]
      },
      "steps": [
        "Start with the seed concept: concept_id 201826 ('Type 2 diabetes mellitus') is the SNOMED standard concept at the right level of specificity — broad enough to capture all type 2 patients, specific enough to exclude type 1.",
        "Turn on include_descendants for the seed: OMOP's CONCEPT_ANCESTOR table now automatically adds all more-specific child codes beneath 201826, such as 'Type 2 diabetes mellitus with diabetic chronic kidney disease' (concept_id 4299544 is a different example — the tree has dozens of such children).",
        "A clinical reviewer scans the full resolved list of descendants and flags two branches to exclude: secondary diabetes (caused by another condition, not the same population) and neonatal diabetes (a distinct neonatal entity); both are added as EXCLUDE rows with include_descendants also turned on so their own children are also removed.",
        "The expression is saved and then resolved against the current OMOP vocabulary version: OMOP looks up every descendant of 201826, removes every descendant of 4058243 and 4299544, and returns the final list of concept_ids.",
        "That final resolved list — not just the three-row expression — is saved alongside the vocabulary version number to the study repository so any future analyst can reproduce the exact same code list."
      ],
      "result": "The concept-set expression contains 3 rows (1 include anchor, 2 exclude branches). When resolved against the vocabulary, it returns a specific list of standard concept_ids covering type 2 diabetes and all its clinical subtypes, with secondary and neonatal forms removed. Any DRUG_EXPOSURE or CONDITION_OCCURRENCE row in the database whose standard concept_id appears in that resolved list is counted; the expression is reproducible by anyone who has the same vocabulary version, and it travels unchanged to any other OMOP database in a network study."
    },
    "prerequisites": [
      "omop-cdm-method-patterns-rwe",
      "omop-condition-occurrence-condition-era-rwe",
      "omop-drug-exposure-drug-era-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard-concept set with descendant expansion",
        "description": "Define the set on standard concepts (SNOMED/RxNorm/LOINC) and include descendants via CONCEPT_ANCESTOR so child products/subtypes are inherited automatically; vocabulary- and source-agnostic.",
        "edge_cases": [
          "Descendant expansion silently pulls in unintended children (combination products, veterinary or non-human forms, wrong laterality) that must be removed with an exclusion branch.",
          "Anything the ETL mapped to concept_id = 0 is invisible to a pure standard-concept set, so unmapped source codes are lost unless a source fallback is added."
        ],
        "data_source_notes": "claims: anchor drug sets at the RxNorm ingredient level so NDC repackaging does not require edits; EHR: confirm the source-to-standard map covers problem-list and encounter-diagnosis sources."
      },
      {
        "name": "Source-concept fallback for unmapped codes",
        "description": "Supplement the standard-concept set with source_concept_id / source_value logic to recover person-time whose source codes mapped to concept_id = 0.",
        "edge_cases": [
          "Source fallbacks are data-partner-specific and break portability; document them per database.",
          "Double-counting risk when a source code both maps to standard AND is caught by the fallback."
        ],
        "data_source_notes": "claims: target retired/repackaged NDCs and local miscellaneous codes; quantify the % concept_id = 0 recovered by partner and by decade (ICD-9 vs ICD-10 era)."
      },
      {
        "name": "Version-frozen, citable concept-set expression",
        "description": "Resolve the expression against a fixed OMOP vocabulary version and check the resolved concept_id list into version control (e.g., ATLAS JSON), so the study is reproducible across vocabulary updates.",
        "edge_cases": [
          "Re-resolving the same expression on a newer vocabulary changes the captured concepts (re-mapping, deprecation, re-parenting) — the frozen resolved set, not just the expression, is the reproducible artifact.",
          "Network partners on different vocabulary versions can resolve the \"same\" set differently."
        ],
        "data_source_notes": "Record vocabulary_version from the VOCABULARY table alongside the resolved set in every run."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Hand-curated source-code lists (ICD/NDC strings in a macro)",
        "pros_of_this": "Portable across data partners and vocabulary updates; auto-expands to descendants; reviewable in standard tooling; survives new child codes without re-editing.",
        "cons_of_this": "Depends on ETL mapping quality — codes mapped to concept_id = 0 are invisible to a standard set, whereas a raw source list would have caught them.",
        "when_to_prefer": "Network or multi-database OMOP studies; keep a source-code fallback for unmapped codes."
      },
      {
        "compared_to": "Blind include-descendants expansion",
        "pros_of_this": "Inheriting children captures new products/subtypes automatically and reduces maintenance.",
        "cons_of_this": "Pulls in unintended children (combination products, wrong forms) unless explicit exclusions and a resolved-list audit are added.",
        "when_to_prefer": "Use descendant expansion with explicit exclusions and a manual review of the resolved set."
      },
      {
        "compared_to": "Reusing an off-the-shelf published/library concept set unmodified",
        "pros_of_this": "Inherits prior validity evidence and cross-study comparability.",
        "cons_of_this": "May not match the local indication, era, or data partner coding habits and may need re-validation.",
        "when_to_prefer": "Start from a published set, then localize and re-estimate operating characteristics."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Map NDC->RxNorm (anchor at ingredient level), ICD-9/10-CM->SNOMED, CPT/HCPCS->standard. Report % source rows with target_concept_id = 0 by partner and by decade; bridge the ICD-9->ICD-10 (Oct 2015) break; add a source_concept fallback for unmapped codes; exclude MA-only/capitated person-time with no adjudicated claims to map.",
      "ehr": "Structured capture is visit-driven and split across problem lists, encounter diagnoses, and notes; concept sets over structured fields miss note-only mentions and external-care leakage. Harmonize LOINC and units before trusting measurement concept sets.",
      "registry": "Registry-specific coding may not map cleanly to standard vocabularies; build an explicit crosswalk and account for registry completeness; adjudicated outcomes are high-validity but need OMOP concept mapping.",
      "linked": "Each source maps through a different ETL, so the same event can appear as different concept_ids; de-duplicate at the standard-concept level, document the contributing source, and reconcile service/encounter dates before assigning the phenotype index date."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# Concept-set expression: anchor ingredient concept_ids to INCLUDE (with descendants),\n# and combination-product concept_ids to EXCLUDE. These come from clinical review.\nINCLUDE_INGREDIENTS = [1308216, 1341927, 1340128]   # lisinopril, ramipril, enalapril (RxNorm ingredients)\nEXCLUDE_CONCEPTS    = [1310756]                      # e.g., a fixed-dose ACEi+thiazide combination\n\ndef resolve_concept_set(concept, concept_ancestor,\n                        include_ingredients, exclude_concepts) -> pd.Series:\n    # Standard concepts only; ingredients must be standard 'S' to be valid anchors.\n    std = concept.loc[concept[\"standard_concept\"] == \"S\", \"concept_id\"]\n    # Descendant expansion: every clinical drug / product under the chosen ingredients.\n    desc = concept_ancestor.loc[\n        concept_ancestor[\"ancestor_concept_id\"].isin(include_ingredients),\n        \"descendant_concept_id\",\n    ]\n    included = set(desc) | set(include_ingredients)\n    resolved = (included & set(std)) - set(exclude_concepts)\n    return pd.Series(sorted(resolved), name=\"concept_id\")\n\ndef exposed_counts(drug_exposure, resolved) -> dict:\n    rset = set(resolved)\n    on_standard = drug_exposure[drug_exposure[\"drug_concept_id\"].isin(rset)]\n    # Unmapped fallback: rows that failed source->standard mapping (drug_concept_id == 0).\n    unmapped = drug_exposure[drug_exposure[\"drug_concept_id\"] == 0]\n    return {\n        \"resolved_concepts\": len(rset),\n        \"exposed_persons\": on_standard[\"person_id\"].nunique(),\n        \"fills\": len(on_standard),\n        \"pct_unmapped_rows\": round(100 * len(unmapped) / max(len(drug_exposure), 1), 2),\n    }\n\nresolved = resolve_concept_set(concept, concept_ancestor,\n                               INCLUDE_INGREDIENTS, EXCLUDE_CONCEPTS)\nprint(exposed_counts(drug_exposure, resolved))",
        "description": "Resolve an OMOP concept-set expression and count distinct exposed persons from claims-style OMOP tables.\nRequired inputs (already loaded as DataFrames from the CDM):\n  concept           : concept_id, vocabulary_id, concept_class_id, standard_concept, concept_name\n  concept_ancestor  : ancestor_concept_id, descendant_concept_id\n  drug_exposure     : person_id, drug_concept_id, drug_source_concept_id, drug_exposure_start_date (datetime)\nReturns the resolved standard concept_ids and exposed-person counts, plus the concept_id = 0 (unmapped)\nfallback fraction. Freeze `resolved` to versioned JSON with the vocabulary version before using downstream.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "springate-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nINCLUDE_INGREDIENTS <- c(1308216L, 1341927L, 1340128L)  # lisinopril, ramipril, enalapril\nEXCLUDE_CONCEPTS    <- c(1310756L)                       # ACEi+thiazide combination to drop\n\nresolve_concept_set <- function(concept, concept_ancestor,\n                                include_ingredients, exclude_concepts) {\n  setDT(concept); setDT(concept_ancestor)\n  std  <- concept[standard_concept == \"S\", concept_id]\n  desc <- concept_ancestor[ancestor_concept_id %in% include_ingredients,\n                           descendant_concept_id]\n  included <- union(desc, include_ingredients)\n  sort(setdiff(intersect(included, std), exclude_concepts))\n}\n\nexposed_counts <- function(drug_exposure, resolved) {\n  setDT(drug_exposure)\n  on_std   <- drug_exposure[drug_concept_id %in% resolved]\n  unmapped <- drug_exposure[drug_concept_id == 0L]\n  list(\n    resolved_concepts = length(resolved),\n    exposed_persons   = uniqueN(on_std$person_id),\n    fills             = nrow(on_std),\n    pct_unmapped_rows = round(100 * nrow(unmapped) / max(nrow(drug_exposure), 1L), 2)\n  )\n}\n\nresolved <- resolve_concept_set(concept, concept_ancestor,\n                                INCLUDE_INGREDIENTS, EXCLUDE_CONCEPTS)\nstr(exposed_counts(drug_exposure, resolved))",
        "description": "Resolve an OMOP concept-set expression and count exposed persons with data.table. Inputs mirror the Python\nversion (CDM tables loaded as data.tables):\n  concept          : concept_id, vocabulary_id, concept_class_id, standard_concept, concept_name\n  concept_ancestor : ancestor_concept_id, descendant_concept_id\n  drug_exposure    : person_id, drug_concept_id, drug_source_concept_id, drug_exposure_start_date (Date)\nFreeze `resolved` plus the VOCABULARY version to JSON before downstream phenotype logic.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "springate-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Include ingredient anchors (with descendants) and exclusion concepts come from clinical review. */\n%let include = 1308216,1341927,1340128;   /* lisinopril, ramipril, enalapril RxNorm ingredients */\n%let exclude = 1310756;                    /* ACEi+thiazide combination to drop */\n\n/* 1) Resolve: standard descendants of the anchors, minus exclusions. */\nproc sql;\n  create table resolved as\n  select distinct ca.descendant_concept_id as concept_id\n  from work.concept_ancestor ca\n  inner join work.concept c\n    on c.concept_id = ca.descendant_concept_id\n   and c.standard_concept = 'S'\n  where ca.ancestor_concept_id in (&include)\n    and ca.descendant_concept_id not in (&exclude);\nquit;\n\n/* 2) Apply the resolved set to DRUG_EXPOSURE; count distinct exposed persons and fills. */\nproc sql;\n  create table exposed as\n  select de.person_id, de.drug_concept_id\n  from work.drug_exposure de\n  inner join resolved r\n    on de.drug_concept_id = r.concept_id;\n\n  select count(distinct person_id) as exposed_persons,\n         count(*)                  as fills\n  from exposed;\nquit;\n\n/* 3) Unmapped fallback audit: source rows that failed source->standard mapping. */\nproc sql;\n  select sum(case when drug_concept_id = 0 then 1 else 0 end) as unmapped_rows,\n         calculated unmapped_rows / count(*) * 100 as pct_unmapped format=6.2\n  from work.drug_exposure;\nquit;",
        "description": "Resolve an OMOP concept-set expression and count exposed persons against the CDM in SAS/PROC SQL. Required\ninput tables (a libname pointing at the OMOP schema, here WORK for illustration):\n  concept           : concept_id, vocabulary_id, concept_class_id, standard_concept, concept_name\n  concept_ancestor  : ancestor_concept_id, descendant_concept_id\n  concept_relationship : concept_id_1, concept_id_2, relationship_id   /* for source fallback */\n  drug_exposure     : person_id, drug_concept_id, drug_source_concept_id, drug_source_value\nStep 1 builds the descendant-expanded, exclusion-pruned standard set; step 2 applies it; step 3 quantifies\nthe concept_id = 0 unmapped fraction. Freeze RESOLVED with the VOCABULARY version before downstream use.",
        "dependencies": [],
        "source_citations": [
          "springate-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Source codes in raw data<br/>ICD-9/10-CM, NDC, CPT, local lab] --> Map[Source-to-Standard map<br/>CONCEPT_RELATIONSHIP 'Maps to']\n  Map -->|mapped| Std[Standard concepts<br/>SNOMED / RxNorm / LOINC, standard_concept='S']\n  Map -->|concept_id = 0| Unmapped[Unmapped source codes<br/>source_concept fallback + audit %]\n  Std --> Pick[Pick anchor concepts<br/>e.g. RxNorm ingredient]\n  Pick --> Desc[Descendant expansion<br/>CONCEPT_ANCESTOR]\n  Desc --> Excl[Apply exclusions<br/>combination / unwanted children]\n  Excl --> Freeze[Version-frozen resolved set<br/>+ vocabulary version, JSON in repo]\n  Freeze --> Apply[Apply to CDM table<br/>drug_concept_id IN resolved set]\n  Unmapped -.recover.-> Apply\n  Apply --> Eval[Evaluate phenotype<br/>PheValuator / chart-review PPV & sensitivity]",
        "caption": "Concept-set development workflow in OMOP. Source codes are mapped to standard concepts; the expression selects anchors, expands descendants, prunes exclusions, and is version-frozen before being applied to the CDM and evaluated. Unmapped (concept_id = 0) source codes are audited and recovered via a source fallback.",
        "alt_text": "Flowchart from raw source codes through source-to-standard mapping, anchor selection, descendant expansion, exclusions, version freezing, CDM application, and phenotype evaluation, with an unmapped-code fallback branch.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Ing[Ingredient: ACE inhibitors<br/>RxNorm ingredient concept_ids] --> Lis[Lisinopril products]\n  Ing --> Ram[Ramipril products]\n  Ing --> Ena[Enalapril products]\n  Lis --> L1[Lisinopril 10 MG Oral Tablet]\n  Lis --> L2[Lisinopril 20 MG Oral Tablet]\n  Ena --> E1[Enalapril 5 MG Oral Tablet]\n  Ena --> EX[Enalapril/HCTZ combo tablet<br/>EXCLUDED]\n  style EX stroke-dasharray: 5 5",
        "caption": "Ancestor/descendant tree behind include-descendants expansion. Defining the set at the ingredient level inherits every clinical-drug descendant; an exclusion branch removes unwanted children such as fixed-dose combination products.",
        "alt_text": "Tree diagram with ACE-inhibitor ingredients at the top branching to per-drug products and individual clinical drugs, with one combination-product node marked as excluded.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "Concept set development is a core building block within the broader OMOP CDM method-pattern family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "Drug concept sets feed DRUG_EXPOSURE/DRUG_ERA construction; ingredient-level anchors with descendant expansion define the exposure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-condition-occurrence-condition-era-rwe",
        "notes": "Condition concept sets feed CONDITION_OCCURRENCE/CONDITION_ERA phenotypes for outcomes and inclusion/exclusion criteria."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "A condition concept set supplies the code list that the 1-inpatient/2-outpatient occurrence rule is applied on top of."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "The operating characteristics (PPV/sensitivity) of a concept-set-based phenotype are established through algorithm validation against a reference standard."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "Concept sets are the vocabulary-driven core of computable EHR phenotyping algorithms."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Outcome concept sets are validated by estimating PPV and sensitivity against chart review in claims."
      }
    ],
    "aliases": [
      "concept set expression",
      "code list",
      "value set",
      "OHDSI concept set",
      "phenotype code list",
      "OMOP concept set"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "omop-condition-occurrence-condition-era-rwe",
    "name": "OMOP CONDITION_OCCURRENCE and CONDITION_ERA",
    "short_definition": "The two OMOP CDM tables that hold diagnoses (CONDITION_OCCURRENCE, one row per recorded condition event) and their derived persistence-window collapse (CONDITION_ERA), used to operationalize disease onset, recurrence, and chronic-disease history into cohort entry, comorbidity, and outcome variables.",
    "long_description": "In the **OMOP Common Data Model (CDM)**, almost every diagnosis-based variable in a real-world study is built from two\ntables. **CONDITION_OCCURRENCE** stores one row per recorded condition event: `person_id`, a standard `condition_concept_id`\n(SNOMED), `condition_start_date`/`condition_end_date`, the source code (`condition_source_value`), the\n`condition_type_concept_id` that records *how* the diagnosis was captured (inpatient primary, inpatient secondary,\noutpatient/professional claim, EHR problem list, EHR encounter diagnosis), and `condition_status_concept_id` (admitting,\nfinal, rule-out). **CONDITION_ERA** is a *derived* table: OHDSI's standard ETL collapses consecutive CONDITION_OCCURRENCE\nrows of the same condition into a single span (`condition_era_start_date`, `condition_era_end_date`,\n`condition_occurrence_count`) whenever the gap between events is no greater than a **persistence window (OHDSI default 30\ndays)**. The occurrence table is the raw evidence; the era table is one opinionated summary of it.\n\n**Core conceptual distinction**. The two tables answer different questions and fail in different ways. CONDITION_OCCURRENCE\nis the unit for *event* questions — first-ever diagnosis (incident outcome), date of an acute event, count of encounters\ncarrying a code — and it preserves the provenance you need to filter (an inpatient *primary* MI is not a rule-out MI coded\non a chest-pain visit). CONDITION_ERA is the unit for *state* questions — \"did this person have chronic condition X during\nthe baseline window?\" — because it has already stitched fragmented encounters into a presence/absence span. The catch is\nthat the era is only as good as the persistence window: with a 30-day gap, two heart-failure visits 45 days apart are *two*\neras (looks intermittent), while a 180-day gap makes them one continuous era (looks chronic). The window is an analytic\nchoice that silently rewrites prevalence and \"chronic\" status, so the **estimand-relevant decision** — what counts as\nhaving the condition, and over what calendar window — must be pre-specified and sensitivity-tested, not inherited from the\nETL default. Neither table validates a phenotype; both are the substrate on top of which a *condition era / occurrence-based\nalgorithm* (e.g., 1 inpatient OR 2 outpatient codes ≥30 days apart) is layered.\n\n**Pros, cons, and trade-offs**.\n- **CONDITION_ERA vs raw CONDITION_OCCURRENCE for chronic-disease ascertainment:** the era table gives you a ready-made,\n  de-duplicated presence span and avoids hand-rolling gap logic; it is the right primitive for comorbidity flags and\n  \"prevalent disease at index.\" Cost: it bakes in the 30-day gap and *discards* `condition_type_concept_id`, so you can no\n  longer require inpatient-primary position or exclude rule-out codes from inside the era. **Prefer the era** for chronic,\n  well-coded conditions where any contact implies presence; **prefer raw occurrence** whenever code position, diagnosis\n  status, or a custom 2-codes-in-N-days rule matters (most validated phenotypes).\n- **vs a bespoke phenotype/cohort algorithm (ATLAS/Capr cohort, or the 1ip-2op rule):** the eras/occurrences are inputs to\n  those algorithms, not substitutes. A naive `≥1 condition_concept_id` flag over CONDITION_OCCURRENCE is fast but has poor\n  PPV for most outcomes; a validated algorithm trades sensitivity for specificity. **Prefer the validated algorithm** for\n  any analytic outcome or eligibility criterion; reserve the raw-flag-from-era shortcut for descriptive comorbidity\n  adjustment where misclassification is non-differential and tolerable.\n- **vs source-vocabulary (ICD-9/10) logic in native claims:** working in standardized `condition_concept_id` makes code\n  lists portable across data sources and networks (the OHDSI advantage). Cost: mapping from source codes to SNOMED is\n  lossy and ETL-dependent; a concept that is split or deprecated upstream can change cohort counts between CDM vintages.\n  **Prefer standardized concepts** for multi-database/network studies; audit the source-to-standard mapping (and keep\n  `condition_source_value`) when a single high-stakes algorithm must match a published ICD-based definition exactly.\n\n**When to use**. Use CONDITION_OCCURRENCE to define incident/recurrent disease events, index dates, acute outcomes, and\nany algorithm that conditions on diagnosis position or status. Use CONDITION_ERA to define chronic-disease history,\nprevalent comorbidities at baseline, and presence/absence covariates where the persistence-collapsed span is the natural\nunit. Use both together in the canonical pattern: era-based comorbidities for the baseline covariate block, occurrence-based\nfirst-event logic for the outcome.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Do not use CONDITION_ERA when code position or rule-out status is part of the definition.** The era has already\n  discarded `condition_type_concept_id`/`condition_status_concept_id`, so a \"rule-out cancer\" code on an imaging visit gets\n  folded into a \"cancer\" era. For outcomes like cancer, MI, or stroke this inflates incidence and biases comparative\n  estimates if coding intensity differs by arm.\n- **Do not treat the 30-day default as physiology.** Reporting \"chronic disease\" prevalence off the default gap without a\n  sensitivity analysis (30/60/180 days) is indefensible — the number is an artifact of the ETL knob.\n- **Do not read the first CONDITION_OCCURRENCE row as disease onset** without an adequate clean/washout window inside an\n  observed OBSERVATION_PERIOD. The first *recorded* date is when the patient entered the data with the code, not when\n  disease began (left censoring / prevalent-as-incident misclassification).\n- **Do not pool diagnosis events across data sources with different `condition_type_concept_id` mixes.** A database that is\n  mostly outpatient EHR problem-list entries and one that is mostly inpatient claims will produce non-comparable era\n  structures even for the same persistence window.\n- **Do not assume CONDITION_ERA exists or is current.** It is a derived table; some CDM instances skip it, build it with a\n  non-default gap, or fail to rebuild it after an incremental load. Verify the ETL spec and the era-build date before\n  relying on it.\n\n**Data-source operational depth**.\n- **Administrative claims:** `condition_type_concept_id` distinguishes inpatient principal/secondary from professional/\n  outpatient claims — this is the backbone of validated 1-inpatient-OR-2-outpatient rules. Failure modes: *rule-out coding*\n  (a diagnosis placed on a lab/imaging encounter to justify the test, not to assert disease) inflates occurrence counts and,\n  once collapsed, contaminates eras; require diagnosis position and/or a second confirmatory code. *Claims adjudication lag*\n  means recent person-time looks disease-free; freeze data with a run-out buffer. *Medicare Advantage-only person-time* lacks\n  fee-for-service claims, so absence of a CONDITION_OCCURRENCE row is missingness, not true absence — restrict to A/B/D FFS\n  person-time when ascertaining baseline conditions or incident outcomes.\n- **EHR:** the same condition can appear as a problem-list entry (carried forward indefinitely, no natural end date), an\n  encounter diagnosis, or a resolved condition; `condition_status_concept_id` that would disambiguate them is frequently\n  null. Problem-list carry-forward makes CONDITION_ERA end dates unreliable and can make a resolved condition look chronic.\n  Visit-driven capture means a patient who gets care outside the system has phantom \"gaps\" that fragment eras; reconcile with\n  the observed OBSERVATION_PERIOD before interpreting era discontinuity.\n- **Registry:** strongest for adjudicated, well-defined conditions (e.g., cancer with stage) but typically narrow in scope\n  and weak for incidental comorbidities; map registry diagnoses into CONDITION_OCCURRENCE carefully and do not expect a\n  complete comorbidity profile from registry-derived eras alone.\n- **Linked claims–EHR(–registry):** the richest substrate, but the *same* condition is now captured by multiple feeds with\n  different dates and positions; de-duplicate before the era ETL or eras double-count, and reconcile differential capture so\n  that exposure arms are not compared on systematically different coding density.\n\n**Worked claims/CDM example — chronic HF comorbidity + incident HF outcome.** Question: among new initiators of drug A vs\ndrug B (index date = first qualifying fill), (i) flag *baseline chronic heart failure* as a confounder and (ii) ascertain\n*incident HF* as the outcome, in an OMOP-mapped commercial + Medicare FFS claims database. Define the HF concept set as a\nstandard `condition_concept_id` descendant list. **Baseline comorbidity (state question → CONDITION_ERA):** a person has\nbaseline HF if they have ≥1 HF CONDITION_ERA overlapping the 365-day pre-index window. Because eras drop diagnosis position,\npre-specify and report the sensitivity to the persistence gap (rebuild eras or apply a custom gap of 30/60/180 days) and\ncross-check against the position-aware rule below. **Incident outcome (event question → CONDITION_OCCURRENCE with a\nvalidated algorithm):** a qualifying HF event = either one HF code with `condition_type_concept_id` = *inpatient principal*,\nOR two outpatient/professional HF codes ≥30 days apart; the outcome date is the first such qualifying date *after* index.\nApply a clean window — no HF CONDITION_OCCURRENCE row in the 365 days before index — so first-recorded is plausibly\nincident, and require the entire baseline and at-risk period to fall inside a continuous OBSERVATION_PERIOD on FFS\n(A/B/D) coverage, excluding MA-only person-time where claims are absent. Censor at observation-period end, death, or\nend of data. Sensitivity analyses: era persistence gap (30/60/180), the 1-inpatient-vs-2-outpatient rule, and a\nnegative-control condition to detect differential coding intensity between arms.\n\n**Interpreting the output**. Patient 2001 has three heart failure CONDITION_OCCURRENCE rows: February 1,\nFebruary 29 (gap = 28 days from the February 1 discharge), and April 14 (gap = 45 days from the February 29\ndischarge). With a 30-day persistence window, the February 1 and February 29 occurrences are within the window\nand collapse into a single CONDITION_ERA spanning February 1 through February 29; the April 14 occurrence\nexceeds the 30-day gap and opens a second, distinct CONDITION_ERA on April 14.\n\nFormal interpretation: the same three diagnosis codes yield two CONDITION_ERAs under a 30-day persistence\nwindow but would yield one era under a 45-day window and three separate eras under a 7-day window. The\nera count is a construction artifact, not a clinical truth — the persistence window encodes an assumption\nabout what gap between coded encounters represents a new disease episode versus continued care for the same\nepisode. This construction choice propagates to any analysis that uses CONDITION_ERA to define prevalent\ndisease exposure, disease duration, or episode count. Analysts who use the OMOP default 30-day window\nwithout checking whether it is appropriate for the target condition and coding environment are importing\nan assumption without examination.\n\nPractical interpretation: always report the persistence gap used and run a sensitivity analysis at\nalternative gaps (30, 60, 180 days) for chronic conditions where the clinically meaningful episode\nboundary is ambiguous. For acute conditions — where any recurrence after full recovery is a distinct\nevent — prefer the CONDITION_OCCURRENCE-based incident-event algorithm over condition era, and document\nthat choice in the SAP.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "omop-cdm",
      "condition-occurrence",
      "condition-era",
      "persistence-window",
      "phenotype",
      "comorbidity-ascertainment",
      "diagnosis-coding"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3233/978-1-61499-564-7-574",
        "url": "https://doi.org/10.3233/978-1-61499-564-7-574",
        "citation_text": "Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Studies in Health Technology and Informatics. 2015;216:574-578.",
        "year": 2015,
        "authors_short": "Hripcsak et al.",
        "notes": "Foundational statement of the OHDSI program and the OMOP CDM, including the standardized condition representation and derived-era convention that CONDITION_OCCURRENCE and CONDITION_ERA implement."
      },
      {
        "role": "explain",
        "doi": "10.1093/jamia/ocy032",
        "url": "https://doi.org/10.1093/jamia/ocy032",
        "citation_text": "Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. Journal of the American Medical Informatics Association. 2018;25(8):969-975.",
        "year": 2018,
        "authors_short": "Reps et al.",
        "notes": "Standardized OHDSI framework that builds covariates and outcomes directly from CONDITION_OCCURRENCE/CONDITION_ERA, illustrating how era-based presence and occurrence-based events feed downstream models."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jbi.2019.103258",
        "url": "https://doi.org/10.1016/j.jbi.2019.103258",
        "citation_text": "Swerdel JN, Hripcsak G, Ryan PB. PheValuator: development and evaluation of a phenotype algorithm evaluator. Journal of Biomedical Informatics. 2019;97:103258.",
        "year": 2019,
        "authors_short": "Swerdel et al.",
        "notes": "Demonstrates evaluation of condition phenotype algorithms (sensitivity, specificity, PPV) built on these tables, making concrete why raw concept flags differ from validated definitions."
      },
      {
        "role": "use",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard requiring transparent code lists, diagnosis-position logic, and validation for routinely collected data — the documentation expected when an analysis is built on CONDITION_OCCURRENCE/CONDITION_ERA."
      }
    ],
    "plain_language_summary": "In a standardized health database built on the OMOP model, every time a patient receives a diagnosis — from a doctor visit, a hospital stay, or a problem list entry — that event lands as one row in a table called CONDITION_OCCURRENCE. Because the same chronic disease generates many separate visits over months or years, the database also builds a second table called CONDITION_ERA, which stitches those scattered diagnosis events into a single continuous time span whenever the gap between consecutive events is no wider than 30 days. Use CONDITION_OCCURRENCE when you want to find the first time a disease was recorded or count specific events; use CONDITION_ERA when you want to ask whether a patient had a disease present during a particular period.",
    "key_terms": [
      {
        "term": "CONDITION_OCCURRENCE",
        "definition": "An OMOP table with one row for each recorded diagnosis event, storing the patient identifier, the standardized disease code, the date the condition was noted, and the type of encounter that produced it."
      },
      {
        "term": "CONDITION_ERA",
        "definition": "A derived OMOP table that collapses consecutive CONDITION_OCCURRENCE rows for the same disease into a single continuous time span, recording its start date, end date, and how many individual occurrences were folded into it."
      },
      {
        "term": "persistence window",
        "definition": "The maximum allowed gap in days between two consecutive diagnosis events before they are treated as separate disease periods; the OMOP default is 30 days."
      },
      {
        "term": "condition_type_concept_id",
        "definition": "A code on each CONDITION_OCCURRENCE row that records how the diagnosis was captured — for example, as an inpatient primary diagnosis, an outpatient claim, or an EHR problem-list entry."
      },
      {
        "term": "incident diagnosis",
        "definition": "The first recorded occurrence of a disease for a patient within a defined look-back window that is free of any prior diagnosis for that condition."
      }
    ],
    "worked_example": {
      "scenario": "Patient 2001 is enrolled in a commercial claims database with continuous coverage from 2024-01-01 through 2024-06-30. She has three outpatient encounters where a heart-failure diagnosis code is recorded: February 1, February 29, and April 14. We want to show exactly how the OMOP ETL collapses these three CONDITION_OCCURRENCE rows into CONDITION_ERA records using the default 30-day persistence window.",
      "dataset": {
        "caption": "Raw CONDITION_OCCURRENCE rows for patient 2001 (heart failure, condition_concept_id 316139)",
        "columns": [
          "person_id",
          "condition_concept_id",
          "condition_start_date",
          "condition_type_concept_id"
        ],
        "rows": [
          [
            2001,
            316139,
            "2024-02-01",
            "outpatient claim"
          ],
          [
            2001,
            316139,
            "2024-02-29",
            "outpatient claim"
          ],
          [
            2001,
            316139,
            "2024-04-14",
            "outpatient claim"
          ]
        ]
      },
      "steps": [
        "Sort the three occurrences by date: Feb 01, Feb 29, Apr 14.",
        "Check the gap between occurrence A (Feb 01) and occurrence B (Feb 29): 28 days. Because 28 is less than or equal to the 30-day persistence window, A and B are bridged into the same era.",
        "The era built from A and B spans Feb 01 through Feb 29 and has an occurrence count of 2.",
        "Check the gap between occurrence B (Feb 29) and occurrence C (Apr 14): 45 days. Because 45 exceeds the 30-day persistence window, C starts a new era.",
        "Era 2 spans Apr 14 through Apr 14 and has an occurrence count of 1.",
        "Final result: 3 CONDITION_OCCURRENCE rows produce 2 CONDITION_ERA rows."
      ],
      "result": "3 CONDITION_OCCURRENCE events collapse into 2 CONDITION_ERAs: Era 1 covers 2024-02-01 to 2024-02-29 (28-day span, 2 occurrences); Era 2 covers 2024-04-14 to 2024-04-14 (1-day span, 1 occurrence). The 45-day gap between Feb 29 and Apr 14 exceeds the 30-day window and breaks continuity.",
      "timeline_spec": {
        "title": "Three HF diagnosis claims collapsed into two CONDITION_ERAs (30-day persistence window)",
        "window": {
          "start": "2024-01-01",
          "end": "2024-06-30",
          "label": "Observation window: 2024-01-01 to 2024-06-30"
        },
        "events": [
          {
            "label": "Occurrence A",
            "start": "2024-02-01",
            "length_days": 1,
            "quantity": "outpatient HF code"
          },
          {
            "label": "Occurrence B (28 days later)",
            "start": "2024-02-29",
            "length_days": 1,
            "quantity": "outpatient HF code"
          },
          {
            "label": "Occurrence C (45 days later — gap breaks era)",
            "start": "2024-04-14",
            "length_days": 1,
            "quantity": "outpatient HF code"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-02-01",
            "end": "2024-02-29",
            "label": "Era 1: 28-day span, 2 occurrences (gap 28d <= 30d)"
          },
          {
            "kind": "gap",
            "start": "2024-03-01",
            "end": "2024-04-13",
            "label": "45-day gap exceeds 30d window — era break"
          },
          {
            "kind": "covered",
            "start": "2024-04-14",
            "end": "2024-04-14",
            "label": "Era 2: 1-day span, 1 occurrence"
          }
        ],
        "result": {
          "label": "3 occurrences -> 2 eras (30-day persistence window splits at the 45-day gap)",
          "value": 2
        }
      },
      "caption": "Three outpatient heart-failure claims for patient 2001. The 28-day gap between Feb 01 and Feb 29 falls within the 30-day persistence window, so those two occurrences merge into Era 1. The 45-day gap between Feb 29 and Apr 14 exceeds the window, so Apr 14 becomes a separate Era 2.",
      "alt_text": "Timeline from January through June 2024 showing three diagnosis-event markers on Feb 01, Feb 29, and Apr 14, with a green span labeled Era 1 bridging Feb 01 to Feb 29, a gray gap band from Mar 01 to Apr 13 labeled 45-day gap, and a second green span labeled Era 2 on Apr 14."
    },
    "prerequisites": [
      "omop-cdm-method-patterns-rwe",
      "omop-observation-period-rwe",
      "omop-concept-set-development-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Era-based presence flag for baseline comorbidity",
        "description": "A condition is \"present at baseline\" if any CONDITION_ERA for the concept set overlaps the pre-index covariate window. Uses the persistence-collapsed span as the unit, ignoring diagnosis position.",
        "edge_cases": [
          "Sensitive to the persistence-window gap baked into the ETL (default 30 days); rebuild or vary the gap and report the impact.",
          "Drops condition_type_concept_id, so rule-out and secondary diagnoses are folded into presence — acceptable for non-differential comorbidity adjustment, dangerous for specificity-critical conditions.",
          "EHR problem-list carry-forward inflates era duration and can make resolved conditions appear chronic."
        ],
        "data_source_notes": "claims: fine for chronic, well-coded comorbidities; ehr: verify the era-build accounts for problem-list end dates; linked: de-duplicate cross-feed diagnoses before relying on era spans."
      },
      {
        "name": "Occurrence-based incident-event algorithm",
        "description": "An outcome/event is the first CONDITION_OCCURRENCE row meeting a validated rule (e.g., 1 inpatient-principal code OR 2 outpatient codes >=30 days apart) after index, with a clean window before index for incidence.",
        "edge_cases": [
          "First recorded date is data entry, not biological onset; without a clean/washout window inside an observed OBSERVATION_PERIOD prevalent disease is misclassified as incident.",
          "Requires condition_type_concept_id to distinguish inpatient principal from rule-out/secondary coding."
        ],
        "data_source_notes": "claims: leverage diagnosis position and confirmatory codes; ehr: encounter vs problem-list distinction is often only inferable from visit context; exclude MA-only person-time where claims are absent."
      },
      {
        "name": "Custom persistence-window era rebuild",
        "description": "Rebuild eras from CONDITION_OCCURRENCE with an analyst-specified gap rather than the OHDSI default, to align \"chronic\" definition with clinical reasoning and to support a gap sensitivity analysis.",
        "edge_cases": [
          "Different gaps materially change prevalence and chronicity; the chosen gap must be justified in the protocol/SAP.",
          "Recomputation must respect OBSERVATION_PERIOD boundaries so out-of-coverage time is not bridged."
        ],
        "data_source_notes": "Implement the gap-collapse identically across all databases in a network study to keep era structure comparable."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Raw CONDITION_OCCURRENCE concept flag (>=1 code)",
        "pros_of_this": "CONDITION_ERA gives a de-duplicated presence span and ready-made chronic-disease unit without hand-rolled gap logic.",
        "cons_of_this": "Era bakes in the persistence gap and discards diagnosis position/status, so position-aware or rule-out-aware algorithms cannot be expressed on the era.",
        "when_to_prefer": "Chronic, well-coded comorbidities where any contact implies presence and non-differential misclassification is tolerable."
      },
      {
        "compared_to": "Validated phenotype/cohort algorithm (e.g., 1ip-2op rule, ATLAS cohort)",
        "pros_of_this": "Eras/occurrences are the inputs; the raw era flag is faster and simpler for descriptive comorbidity adjustment.",
        "cons_of_this": "Poor PPV for analytic outcomes; cannot encode confirmatory-code or position requirements that validated algorithms rely on.",
        "when_to_prefer": "Descriptive baseline comorbidity covariates; never as a substitute for a validated outcome/eligibility definition."
      },
      {
        "compared_to": "Source-vocabulary (ICD-9/10) logic in native claims",
        "pros_of_this": "Standardized condition_concept_id code lists are portable across data sources and OHDSI network studies.",
        "cons_of_this": "Source-to-SNOMED mapping is lossy and ETL/vintage-dependent; counts can shift between CDM versions.",
        "when_to_prefer": "Multi-database/network studies; audit the mapping when a single algorithm must exactly match a published ICD-based definition."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Use condition_type_concept_id to require inpatient-principal position or a second confirmatory code; exclude rule-out diagnoses on lab/imaging encounters; restrict baseline/at-risk windows to FFS (A/B/D) OBSERVATION_PERIOD and drop MA-only person-time where CONDITION_OCCURRENCE rows are absent. Freeze data with a run-out buffer for adjudication lag.",
      "ehr": "Distinguish problem-list (carried forward, often no end date) from encounter diagnoses; condition_status_concept_id is frequently null. Reconcile era discontinuity with the observed OBSERVATION_PERIOD because care outside the system creates phantom gaps.",
      "registry": "Strong for adjudicated, well-defined conditions but narrow; do not expect a complete comorbidity profile from registry-derived eras. Map registry diagnoses into CONDITION_OCCURRENCE with explicit dates and positions.",
      "linked": "The same condition arrives from multiple feeds with different dates/positions; de-duplicate before the era ETL or eras double-count, and reconcile differential capture so exposure arms are not compared on different coding density."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nHF_CONCEPTS = set(hf_concept_descendants)          # standard condition_concept_id list for heart failure\nINPATIENT_PRINCIPAL = {38000183, 38000199}         # condition_type_concept_id values for inpatient primary dx\nBASELINE_DAYS = 365\nOUTPATIENT_GAP_DAYS = 30                            # 2-outpatient-codes-apart requirement\n\ndef add_baseline_hf(cohort: pd.DataFrame, condition_era: pd.DataFrame) -> pd.DataFrame:\n    # STATE question -> CONDITION_ERA: any HF era overlapping [index-365, index].\n    era = condition_era[condition_era[\"condition_concept_id\"].isin(HF_CONCEPTS)]\n    m = era.merge(cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    win_start = m[\"index_date\"] - pd.Timedelta(days=BASELINE_DAYS)\n    overlap = (m[\"condition_era_start_date\"] <= m[\"index_date\"]) & (m[\"condition_era_end_date\"] >= win_start)\n    flagged = m.loc[overlap, \"person_id\"].unique()\n    out = cohort.copy()\n    out[\"baseline_hf\"] = out[\"person_id\"].isin(flagged).astype(int)\n    return out\n\ndef add_incident_hf(cohort: pd.DataFrame, condition_occurrence: pd.DataFrame) -> pd.DataFrame:\n    # EVENT question -> CONDITION_OCCURRENCE with a validated rule, first qualifying date AFTER index.\n    co = condition_occurrence[condition_occurrence[\"condition_concept_id\"].isin(HF_CONCEPTS)]\n    co = co.merge(cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n\n    # Clean window: incident only if no HF occurrence in the 365 days before index.\n    pre = co[(co[\"condition_start_date\"] < co[\"index_date\"]) &\n             (co[\"condition_start_date\"] >= co[\"index_date\"] - pd.Timedelta(days=BASELINE_DAYS))]\n    prevalent = set(pre[\"person_id\"])\n    post = co[(co[\"condition_start_date\"] > co[\"index_date\"]) & (~co[\"person_id\"].isin(prevalent))].copy()\n\n    # Rule A: one inpatient-principal code.\n    ip = post[post[\"condition_type_concept_id\"].isin(INPATIENT_PRINCIPAL)]\n    ip_date = ip.groupby(\"person_id\")[\"condition_start_date\"].min()\n\n    # Rule B: two outpatient codes >= OUTPATIENT_GAP_DAYS apart; qualifying date = the second code.\n    op = post[~post[\"condition_type_concept_id\"].isin(INPATIENT_PRINCIPAL)].sort_values(\n        [\"person_id\", \"condition_start_date\"])\n    op[\"prev_dt\"] = op.groupby(\"person_id\")[\"condition_start_date\"].shift(1)\n    op_qual = op[(op[\"condition_start_date\"] - op[\"prev_dt\"]).dt.days >= OUTPATIENT_GAP_DAYS]\n    op_date = op_qual.groupby(\"person_id\")[\"condition_start_date\"].min()\n\n    first = pd.concat([ip_date, op_date], axis=1).min(axis=1).rename(\"incident_hf_date\")\n    return cohort.merge(first, on=\"person_id\", how=\"left\")",
        "description": "Baseline era-presence flag + occurrence-based incident outcome from OMOP CDM tables (pandas).\nRequired inputs (already loaded, OMOP-shaped, one schema per database):\n  condition_occurrence : person_id, condition_concept_id, condition_start_date (datetime),\n                         condition_type_concept_id  (provenance: inpatient principal / outpatient / etc.)\n  condition_era        : person_id, condition_concept_id, condition_era_start_date, condition_era_end_date\n  cohort               : person_id, index_date  (one row per new initiator)\nSupply the standardized HF concept descendants in HF_CONCEPTS and the type concept ids that denote inpatient-principal.\nReturns the cohort with baseline_hf (era-based) and incident_hf_date (validated occurrence-based, post-index).",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\nbaseline_days <- 365L\noutpatient_gap_days <- 30L\n\nadd_baseline_hf <- function(cohort, condition_era, hf_concepts) {\n  flagged <- condition_era %>%\n    filter(condition_concept_id %in% hf_concepts) %>%\n    inner_join(select(cohort, person_id, index_date), by = \"person_id\") %>%\n    filter(condition_era_start_date <= index_date,\n           condition_era_end_date   >= index_date - baseline_days) %>%\n    distinct(person_id) %>% pull(person_id)\n  cohort %>% mutate(baseline_hf = as.integer(person_id %in% flagged))\n}\n\nadd_incident_hf <- function(cohort, condition_occurrence, hf_concepts, inpatient_principal) {\n  co <- condition_occurrence %>%\n    filter(condition_concept_id %in% hf_concepts) %>%\n    inner_join(select(cohort, person_id, index_date), by = \"person_id\")\n\n  # Clean window: drop people with any HF code in the 365 days before index (prevalent).\n  prevalent <- co %>%\n    filter(condition_start_date < index_date,\n           condition_start_date >= index_date - baseline_days) %>%\n    distinct(person_id) %>% pull(person_id)\n  post <- co %>% filter(condition_start_date > index_date, !person_id %in% prevalent)\n\n  ip_date <- post %>% filter(condition_type_concept_id %in% inpatient_principal) %>%\n    group_by(person_id) %>% summarise(d = min(condition_start_date), .groups = \"drop\")\n\n  op_date <- post %>% filter(!condition_type_concept_id %in% inpatient_principal) %>%\n    arrange(person_id, condition_start_date) %>% group_by(person_id) %>%\n    mutate(prev_dt = lag(condition_start_date)) %>%\n    filter(as.integer(condition_start_date - prev_dt) >= outpatient_gap_days) %>%\n    summarise(d = min(condition_start_date), .groups = \"drop\")\n\n  first <- bind_rows(ip_date, op_date) %>% group_by(person_id) %>%\n    summarise(incident_hf_date = min(d), .groups = \"drop\")\n  cohort %>% left_join(first, by = \"person_id\")\n}",
        "description": "Same two operations in R (dplyr). Inputs mirror the Python version and use OMOP column names:\n  condition_occurrence : person_id, condition_concept_id, condition_start_date (Date), condition_type_concept_id\n  condition_era        : person_id, condition_concept_id, condition_era_start_date, condition_era_end_date\n  cohort               : person_id, index_date (Date)\nhf_concepts = integer vector of standard HF condition_concept_id descendants;\ninpatient_principal = condition_type_concept_id values denoting inpatient primary diagnosis.",
        "dependencies": [
          "dplyr"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let baseline = 365;\n%let opgap    = 30;\n%let ip_types = 38000183, 38000199;   /* condition_type_concept_id = inpatient principal dx */\n\n/* STATE question -> CONDITION_ERA: HF era overlapping the 365-day pre-index window. */\nproc sql;\n  create table baseline_hf as\n  select c.person_id, c.index_date,\n         max(case when e.person_id is not null then 1 else 0 end) as baseline_hf\n  from work.cohort c\n  left join (select * from cdm.condition_era\n             where condition_concept_id in (select concept_id from work.hf_concepts)) e\n    on  e.person_id = c.person_id\n    and e.condition_era_start_date <= c.index_date\n    and e.condition_era_end_date   >= c.index_date - &baseline\n  group by c.person_id, c.index_date;\nquit;\n\n/* Post-index HF occurrences, excluding prevalent cases (clean window before index). */\nproc sql;\n  create table post as\n  select o.person_id, o.condition_start_date, o.condition_type_concept_id, c.index_date\n  from cdm.condition_occurrence o\n  join work.cohort c on c.person_id = o.person_id\n  where o.condition_concept_id in (select concept_id from work.hf_concepts)\n    and o.condition_start_date > c.index_date\n    and o.person_id not in (\n      select person_id from cdm.condition_occurrence o2\n      join work.cohort c2 on c2.person_id = o2.person_id\n      where o2.condition_concept_id in (select concept_id from work.hf_concepts)\n        and o2.condition_start_date <  c2.index_date\n        and o2.condition_start_date >= c2.index_date - &baseline);\nquit;\n\n/* Rule A: first inpatient-principal code. Rule B: second of two outpatient codes >= &opgap apart. */\nproc sql;\n  create table ip_date as\n  select person_id, min(condition_start_date) as d format=date9.\n  from post where condition_type_concept_id in (&ip_types) group by person_id;\nquit;\n\nproc sort data=post(where=(condition_type_concept_id not in (&ip_types))) out=op; by person_id condition_start_date; run;\ndata op_qual;\n  set op; by person_id condition_start_date;\n  prev_dt = lag(condition_start_date);\n  if first.person_id then prev_dt = .;\n  if not missing(prev_dt) and (condition_start_date - prev_dt) >= &opgap;\nrun;\nproc sql;\n  create table op_date as select person_id, min(condition_start_date) as d format=date9.\n  from op_qual group by person_id;\n  create table incident_hf as\n  select coalesce(a.person_id,b.person_id) as person_id,\n         min(coalesce(a.d, b.d), coalesce(b.d, a.d)) as incident_hf_date format=date9.\n  from ip_date a full join op_date b on a.person_id = b.person_id\n  group by calculated person_id;\nquit;",
        "description": "Same logic in SAS against OMOP CDM tables (PROC SQL). Required tables in libref cdm:\n  cdm.condition_occurrence : person_id, condition_concept_id, condition_start_date, condition_type_concept_id\n  cdm.condition_era        : person_id, condition_concept_id, condition_era_start_date, condition_era_end_date\n  work.cohort              : person_id, index_date (one row per new initiator)\n  work.hf_concepts         : concept_id  (standard HF condition_concept_id descendants)\nMacro vars below hold the baseline window, the outpatient gap, and the inpatient-principal type ids.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "omop-condition-occurrence-condition-era-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Data-grounded worked-example timeline (beginner layer), drawn to scale from worked_example.timeline_spec so the picture matches the numbers.",
        "alt_text": "Timeline for the worked example of omop-condition-occurrence-condition-era-rwe.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Source diagnosis codes<br/>ICD-9/10, EHR problem list] --> CO[CONDITION_OCCURRENCE<br/>1 row per event<br/>concept_id + start_date + type_concept_id]\n  CO -->|\"ETL collapse, gap <= persistence window (default 30d)\"| CE[CONDITION_ERA<br/>presence span + occurrence_count]\n  CE --> State[\"STATE question:<br/>chronic / prevalent comorbidity<br/>era overlaps baseline window\"]\n  CO --> Event[\"EVENT question:<br/>incident outcome<br/>validated rule + clean window\"]\n  State --> Cohort[Baseline covariate block]\n  Event --> Cohort\n  Cohort --> Sens[\"Sensitivity:<br/>era gap 30/60/180d,<br/>1-inpatient vs 2-outpatient rule\"]",
        "caption": "How the two tables relate. CONDITION_OCCURRENCE is the raw event substrate; CONDITION_ERA is a persistence-window collapse of it. State questions (chronic comorbidity) use the era; event questions (incident outcome) use the occurrence table with a validated, position-aware rule.",
        "alt_text": "Flowchart from source diagnosis codes to CONDITION_OCCURRENCE, collapsed by the persistence window into CONDITION_ERA, branching into era-based state covariates and occurrence-based incident outcomes, then sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "hripcsak-2015"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Persistence window rewrites \"chronic\" status (same three HF codes)\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section Raw occurrences\n  Code 1 :milestone, c1, 2024-01-10, 0d\n  Code 2 (45d later) :milestone, c2, 2024-02-24, 0d\n  Code 3 (40d later) :milestone, c3, 2024-04-04, 0d\n  section Era with 30-day gap\n  Era 1 (single code) :a1, 2024-01-10, 1d\n  Era 2 (single code) :a2, 2024-02-24, 1d\n  Era 3 (single code) :a3, 2024-04-04, 1d\n  section Era with 180-day gap\n  One continuous chronic era :crit, b1, 2024-01-10, 85d",
        "caption": "The same three diagnosis events yield three fragmented eras under the OHDSI default 30-day gap (looks intermittent) but one continuous chronic era under a 180-day gap. The persistence window is an analytic choice that must be pre-specified and sensitivity-tested, not inherited from the ETL default.",
        "alt_text": "Gantt chart showing three heart-failure codes that become three separate single-day eras with a 30-day persistence gap and one continuous 85-day chronic era with a 180-day gap.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "Member of the OMOP CDM operational-pattern family; this entry covers the condition (diagnosis) domain tables specifically."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "The standardized condition_concept_id code list that defines a phenotype is built as an OMOP concept set before querying CONDITION_OCCURRENCE/CONDITION_ERA."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-observation-period-rwe",
        "notes": "Era discontinuity and first-recorded-as-incident logic are only interpretable inside an observed OBSERVATION_PERIOD; out-of-coverage time must not be bridged."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The 1-inpatient-OR-2-outpatient validated rule used for incident outcomes is layered on top of CONDITION_OCCURRENCE diagnosis position and timing."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "PPV/sensitivity of an outcome built from these tables determines how much non-differential and differential misclassification the analysis carries."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "A condition phenotype on these tables should be validated (chart review or PheValuator-style estimation) before use as an analytic outcome."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "The drug-domain analogue (DRUG_EXPOSURE collapsed into DRUG_ERA by a persistence window) used jointly to define exposure cohorts alongside condition-based eligibility and covariates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-time-at-risk-cohort-exit-rwe",
        "notes": "Incident condition events from CONDITION_OCCURRENCE supply outcome and exit dates for time-at-risk construction."
      }
    ],
    "aliases": [
      "CONDITION_OCCURRENCE",
      "CONDITION_ERA",
      "OMOP condition occurrence and condition era",
      "condition era persistence window",
      "OMOP diagnosis tables"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "omop-drug-exposure-drug-era-rwe",
    "name": "OMOP Drug Exposure and Drug Era",
    "short_definition": "The OMOP CDM pair in which DRUG_EXPOSURE stores one record per source dispensing or administration at the clinical-drug level and DRUG_ERA derives ingredient-level, gap-stitched exposure episodes whose persistence-gap threshold is the central judgment call governing who is \"on drug\" and for how long.",
    "long_description": "In the OMOP Common Data Model (CDM), drug exposure is represented at two distinct grains, and confusing them is a common\nsource of silent misclassification in RWE. **DRUG_EXPOSURE** is the *atomic* table: one row per source-level dispensing,\nprescription, or administration, keyed on `person_id` with `drug_concept_id` mapped to the standardized vocabulary at the\n**clinical-drug** level (e.g., RxNorm clinical/branded drug), plus `drug_exposure_start_date`, `drug_exposure_end_date`,\n`days_supply`, `quantity`, and `route_concept_id`. **DRUG_ERA** is a *derived* table: it rolls clinical drugs up to their\n**ingredient** (via `CONCEPT_ANCESTOR`), then stitches consecutive exposures of the same ingredient into a single episode\n(`drug_era_start_date`, `drug_era_end_date`, `gap_days`, `drug_exposure_count`) whenever the gap between the end of one\nexposure and the start of the next falls within a persistence threshold. The OHDSI-default persistence window is **30 days**.\nDrug eras are the unit most analyses treat as \"a treatment episode.\"\n\n**Core conceptual distinction**. DRUG_EXPOSURE is *what the source said happened* — a faithful, claim-by-claim or order-by-order\nledger at the clinical-drug grain. DRUG_ERA is *an analyst's model of continuous treatment* at the ingredient grain, and its\nboundaries are an explicit, consequential parameter, not a fact in the data. Three decisions are folded into a drug era and\nmust be pre-specified in the estimand: (1) the **ingredient roll-up** (a patient switching between two atorvastatin products\nis one statin era, but atorvastatin → rosuvastatin is two ingredient eras even though it is \"continuous statin therapy\"); (2)\nthe **end-date imputation** when `days_supply` is missing or implausible (impute from quantity ÷ a typical daily dose, or a\nfixed default — this directly sets where person-time ends); and (3) the **persistence gap** that allows late refills to be\ntreated as one continuous era (0, 30, 90 days). The gap is the dominant judgment call: a 0-day gap demands seamless refills\nand shortens person-time, treating every late refill as discontinuation; a 90-day gap forgives long supply gaps and lengthens\non-treatment time, risking immortal-time-like overcounting of exposed person-time during periods the patient held no drug.\n\n**Pros, cons, and trade-offs**.\n- **DRUG_ERA vs analyzing raw DRUG_EXPOSURE rows directly:** Eras give you ready-made, ingredient-level episodes that map\n  cleanly onto persistence, time-at-risk, and on-treatment follow-up; they absorb the messy \"stockpiling / overlapping fills /\n  one-day gaps\" bookkeeping into a single transparent rule. Cost: the default 30-day era hides the gap choice from downstream\n  users, and the ingredient roll-up *erases within-class switching* that may be the exposure contrast of interest. **Prefer raw\n  DRUG_EXPOSURE** when you need clinical-drug-level detail (dose, formulation, brand, route) or when switching between agents in\n  the same ingredient family is itself the exposure.\n- **OMOP DRUG_ERA vs a bespoke claims episode-construction algorithm (e.g., PDC/MPR-style stitching outside OMOP):** The OMOP\n  era algorithm is standardized, network-portable (the same code runs across CDM sites without re-mapping), and reviewable.\n  Cost: the standard ERA derivation is deliberately simple — it does not, by default, cap stockpiling, handle inpatient\n  bridging, or model dynamic dosing. **Prefer a bespoke or extended algorithm** (or `exposure-episode-construction-rwe`) when\n  those features change the answer, and document the deviation from the standard era.\n- **Ingredient-level DRUG_ERA vs clinical-drug-level exposure windows:** Ingredient eras are the right grain for \"any statin\"\n  questions and for class-level safety; clinical-drug windows are required for dose-response, biosimilar/branded contrasts, or\n  formulation-specific signals. They are not interchangeable, and silently using the era table for a clinical-drug question is a\n  misclassification, not a simplification.\n\n**When to use**. Use DRUG_ERA when the analytic unit is a continuous ingredient-level treatment episode: defining new-user\ncohorts and washouts, building on-treatment (as-treated) follow-up, measuring persistence and time-to-discontinuation, or\nrunning ingredient-level comparative safety/effectiveness across an OHDSI network where standardization and portability matter.\nUse DRUG_EXPOSURE directly when you need dose, quantity, route, brand/clinical-drug identity, or when you will define your own\nepisode logic and want full control over the gap, stockpiling, and imputation rules.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Treating the default 30-day era as a neutral fact.** It is a modeling choice. Reporting \"exposed person-time\" off a default\n  era without a sensitivity analysis on the gap (0/30/90) hides the parameter that most moves rates — and a too-generous gap\n  manufactures exposed person-time during true non-use, biasing safety analyses toward the null (events accrue while the patient\n  is unprotected but still counted as \"on drug\").\n- **Using ingredient eras for a within-ingredient or dose question.** The roll-up destroys exactly the contrast you need; you\n  will silently pool brands, doses, and formulations.\n- **Imputing end dates from `days_supply` when `days_supply` is unreliable.** In administration-based EHR data and in inpatient\n  claims, `days_supply` is frequently missing, zero, or a placeholder. Mechanically imputing a fixed default fabricates era\n  length and corrupts both persistence and time-at-risk; verify the `days_supply` distribution before trusting it.\n- **Comparing eras across CDM sites that used different ERA-derivation parameters.** If site A built 30-day eras and site B\n  built 90-day eras, \"drug era\" is not the same variable; network analyses must pin the persistence gap and re-derive eras\n  uniformly, never assume the vendor-shipped DRUG_ERA table is comparable.\n\n**Data-source operational depth**.\n- **Claims (FFS pharmacy):** The cleanest substrate — each adjudicated pharmacy claim is a DRUG_EXPOSURE row with a real\n  `days_supply` and `quantity`; eras stitch well. Failure modes: claim reversals/voids must be removed before stitching or you\n  double-count supply; 90-day mail-order and free samples distort `days_supply`; same-day duplicate fills inflate\n  `drug_exposure_count`. Critically, **Medicare Advantage (MA) enrollees lack fee-for-service pharmacy claims** unless Part D\n  encounter data are present — an MA-only person looks like \"no fills,\" so an apparent washout or discontinuation is *missingness*,\n  not non-use. Restrict era-based person-time to spans with observable pharmacy benefit (Part D / commercial Rx).\n- **EHR:** DRUG_EXPOSURE here is usually an *order* or *administration*, not a dispensing; `days_supply` is often absent for\n  inpatient administrations and outpatient orders, forcing end-date imputation. Visit-driven capture means a patient who gets\n  care outside the system shows phantom gaps that the era algorithm will read as discontinuation. Prefer linkage to pharmacy\n  fills to confirm the patient actually started and continued, and treat loss to follow-up as potentially informative.\n- **Registry:** Often records treatment status or regimen rather than dispensing-level supply, so native drug eras may not be\n  derivable at all; link to claims for the fill history before building eras, and to a death index so the era does not run past\n  death.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR for dose/route/severity, claims for complete fills, vital\n  records for censoring), but order/fill/administration date discrepancies must be reconciled *before* era stitching, and the\n  linkable subset introduces selection.\n\n**Worked claims example.** Goal: build 30-day statin drug eras and use them to define a new-user, on-treatment window. (1)\n**Source rows (DRUG_EXPOSURE):** person 1001 has pharmacy fills for atorvastatin 20 mg (NDC X) on 2023-01-05 with\n`days_supply` 30, a refill on 2023-02-10 (`days_supply` 30), and atorvastatin 40 mg (NDC Y) on 2023-03-20 (`days_supply` 90).\n(2) **NDC → ingredient roll-up:** both NDCs map through the standardized vocabulary to the *atorvastatin* ingredient via\n`CONCEPT_ANCESTOR`, so all three rows belong to one candidate ingredient era. (3) **End-date imputation:** each exposure\n`drug_exposure_end_date = drug_exposure_start_date + days_supply − 1`, giving covered windows\n[01-05 … 02-03], [02-10 … 03-11], [03-20 … 06-17]. (4) **Gap-stitching walk** with a 30-day persistence gap: fill 1 ends\n02-03; fill 2 starts 02-10 → gap = 6 days ≤ 30 → same era. Fill 2 ends 03-11; fill 3 starts 03-20 → gap = 8 days ≤ 30 → same\nera. Result: one DRUG_ERA for atorvastatin, `drug_era_start_date` 2023-01-05, `drug_era_end_date` 2023-06-17,\n`drug_exposure_count` 3, `gap_days` 14 (total within-era gap). Had the third fill arrived on 2023-05-01 (gap = 51 days > 30),\nthe algorithm would split it into two eras — and any safety event in late March/April would change from \"on-treatment\" to\n\"post-discontinuation,\" a classification that flips entirely with the gap parameter. (5) **Propagation to design:** the new-user\nwashout requires *no* atorvastatin DRUG_EXPOSURE in the 365-day lookback before `drug_era_start_date` (and continuous,\nFFS-observable Part D enrollment so that absence of fills is real, not MA-only missingness); on-treatment follow-up runs from\nthe era start to `drug_era_end_date` (+ an optional grace period), censoring at disenrollment, death, end of data, or switch to\na different statin ingredient. The same era boundaries feed `omop-time-at-risk-cohort-exit-rwe` and persistence\n(`persistence-time-to-discontinuation`); report rates under 0-, 30-, and 90-day gaps as the primary sensitivity analysis.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure-definition",
      "omop-cdm",
      "drug-exposure",
      "drug-era",
      "persistence-gap",
      "episode-construction",
      "pharmacoepidemiology",
      "ohdsi"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2011-000376",
        "url": "https://doi.org/10.1136/amiajnl-2011-000376",
        "citation_text": "Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. Journal of the American Medical Informatics Association. 2012;19(1):54-60.",
        "year": 2012,
        "authors_short": "Overhage et al.",
        "notes": "Foundational validation of the OMOP Common Data Model and its standardized drug/condition representations, establishing the DRUG_EXPOSURE-to-DRUG_ERA derivation conventions used in network research."
      },
      {
        "role": "explain",
        "doi": "10.3233/978-1-61499-564-7-574",
        "url": "https://doi.org/10.3233/978-1-61499-564-7-574",
        "citation_text": "Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Studies in Health Technology and Informatics. 2015;216:574-578.",
        "year": 2015,
        "authors_short": "Hripcsak et al.",
        "notes": "Programmatic overview of OHDSI and the standardized analytics (including era construction) layered on the OMOP CDM."
      },
      {
        "role": "demonstrate",
        "doi": "10.1073/pnas.1510502113",
        "url": "https://doi.org/10.1073/pnas.1510502113",
        "citation_text": "Hripcsak G, Ryan PB, Duke JD, et al. Characterizing treatment pathways at scale using the OHDSI network. Proceedings of the National Academy of Sciences. 2016;113(27):7329-7336.",
        "year": 2016,
        "authors_short": "Hripcsak et al.",
        "notes": "Network-scale demonstration that uses ingredient-level drug eras to characterize treatment pathways across many CDM databases, showing both the portability and the heterogeneity that era standardization must manage."
      }
    ],
    "plain_language_summary": "When a patient fills a prescription, that fill gets stored as a single row in the DRUG_EXPOSURE table — one row per trip to the pharmacy, with the drug name, fill date, and how many days' worth of pills the bottle contains. A DRUG_ERA collapses those individual fills into a single continuous-treatment span: as long as a new fill arrives within a tolerance window (the default is 30 days after the previous supply runs out), the algorithm treats the patient as still on the drug and extends the episode rather than starting a new one. The era is the unit most studies use when asking 'how long was this patient on treatment' — but the 30-day tolerance is a modeling choice, not a medical fact, and changing it to 0 or 90 days can meaningfully shift who counts as 'exposed' and for how long.",
    "key_terms": [
      {
        "term": "DRUG_EXPOSURE",
        "definition": "The raw OMOP table that stores one row for every single prescription fill or drug administration, recording the drug, fill date, and how many days of supply were dispensed."
      },
      {
        "term": "DRUG_ERA",
        "definition": "A derived OMOP table that merges consecutive fills of the same active ingredient into one continuous-treatment episode, provided the gap between fills does not exceed the persistence threshold."
      },
      {
        "term": "days_supply",
        "definition": "The number of days one filled prescription is intended to last — for example, a bottle labeled '30-day supply' has days_supply = 30."
      },
      {
        "term": "persistence gap",
        "definition": "The maximum number of days allowed between the end of one fill and the start of the next before the algorithm decides the patient stopped treatment and begins a new era; the OMOP default is 30 days."
      },
      {
        "term": "ingredient roll-up",
        "definition": "The step that maps every specific drug product (branded name, dose, formulation) to its underlying active ingredient so that fills of atorvastatin 20 mg and atorvastatin 40 mg are counted together as one 'atorvastatin' treatment episode."
      }
    ],
    "worked_example": {
      "scenario": "Person 1001 fills atorvastatin — a cholesterol-lowering drug — four times over about seven months. We want to know how many continuous treatment episodes the OMOP algorithm would build from those four fills using the default 30-day persistence gap. Each fill covers exactly as many days as its days_supply; when a new fill arrives within 30 days after the previous supply would run out, it is stitched onto the same era. If the gap exceeds 30 days, a new era begins. Both atorvastatin 20 mg and atorvastatin 40 mg roll up to the same 'atorvastatin' ingredient, so all four fills are candidates for the same era.",
      "dataset": {
        "caption": "Raw DRUG_EXPOSURE rows for person 1001 — exactly what an analyst would see in the OMOP pharmacy table.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply",
          "end_date (fill_date + days_supply − 1)"
        ],
        "rows": [
          [
            1001,
            "2023-01-10",
            "atorvastatin 20 mg",
            30,
            "2023-02-08"
          ],
          [
            1001,
            "2023-02-15",
            "atorvastatin 20 mg",
            30,
            "2023-03-16"
          ],
          [
            1001,
            "2023-03-20",
            "atorvastatin 40 mg",
            30,
            "2023-04-18"
          ],
          [
            1001,
            "2023-06-15",
            "atorvastatin 20 mg",
            30,
            "2023-07-14"
          ]
        ]
      },
      "steps": [
        "Each fill's end date is fill_date + days_supply − 1: Fill A ends Feb 8, Fill B ends Mar 16, Fill C ends Apr 18, Fill D ends Jul 14.",
        "Gap A→B: Fill A supply runs out Feb 8; Fill B arrives Feb 15 — that is 6 days later. 6 ≤ 30, so Fill B is stitched onto Era 1.",
        "Gap B→C: Fill B supply runs out Mar 16; Fill C arrives Mar 20 — that is 3 days later. 3 ≤ 30, so Fill C is stitched onto Era 1.",
        "Era 1 now spans Jan 10 through Apr 18 (the latest end date among Fills A, B, C), covering 99 calendar days with 90 days of actual supply and 9 days of within-era gap (6 + 3).",
        "Gap C→D: Fill C supply runs out Apr 18; Fill D arrives Jun 15 — that is 57 days later. 57 > 30, so the algorithm closes Era 1 and opens Era 2 at Fill D.",
        "Era 2 contains only Fill D: Jun 15 through Jul 14, covering exactly 30 days with 0 within-era gap."
      ],
      "result": {
        "label": "2 drug eras: Era 1 spans 2023-01-10 to 2023-04-18 (99 days, 3 fills, 9 within-era gap days); Era 2 spans 2023-06-15 to 2023-07-14 (30 days, 1 fill, 0 gap days). Any outcome event in the 57-day window between Apr 19 and Jun 14 is off-treatment under the 30-day gap rule.",
        "value": {
          "era_count": 2,
          "era_1_length_days": 99,
          "era_2_length_days": 30
        }
      },
      "timeline_spec": {
        "title": "Four atorvastatin fills → two DRUG_ERAs under 30-day persistence gap",
        "caption": "Fills A, B, and C are separated by gaps of 6 and 3 days (both within the 30-day threshold) and merge into Era 1. Fill D arrives 57 days after Fill C's supply ends — exceeding the threshold — so it opens Era 2. Any event in the 57-day off-treatment window between Era 1 and Era 2 is classified as post-discontinuation, not on-treatment.",
        "alt_text": "Timeline showing four prescription fill bars labeled Fill A through Fill D. Fill A (Jan 10–Feb 8), Fill B (Feb 15–Mar 16), and Fill C (Mar 20–Apr 18) are bridged by a shaded Era 1 span with two small gap bands of 6 and 3 days. A wide off-treatment gap band stretches from Apr 19 to Jun 14 (57 days). Fill D (Jun 15–Jul 14) sits alone under a shaded Era 2 span.",
        "window": {
          "start": "2023-01-10",
          "end": "2023-07-14",
          "label": "Observation window: Jan 10 – Jul 14 (185 days)"
        },
        "events": [
          {
            "label": "Fill A — atorvastatin 20 mg",
            "start": "2023-01-10",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B — atorvastatin 20 mg",
            "start": "2023-02-15",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill C — atorvastatin 40 mg",
            "start": "2023-03-20",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill D — atorvastatin 20 mg",
            "start": "2023-06-15",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-01-10",
            "end": "2023-02-08",
            "label": "Fill A supply (30 days)"
          },
          {
            "kind": "gap",
            "start": "2023-02-09",
            "end": "2023-02-14",
            "label": "6-day gap (≤ 30 → same era)"
          },
          {
            "kind": "exposed",
            "start": "2023-02-15",
            "end": "2023-03-16",
            "label": "Fill B supply (30 days)"
          },
          {
            "kind": "gap",
            "start": "2023-03-17",
            "end": "2023-03-19",
            "label": "3-day gap (≤ 30 → same era)"
          },
          {
            "kind": "exposed",
            "start": "2023-03-20",
            "end": "2023-04-18",
            "label": "Fill C supply (30 days)"
          },
          {
            "kind": "unexposed",
            "start": "2023-04-19",
            "end": "2023-06-14",
            "label": "57-day off-treatment gap (> 30 → new era)"
          },
          {
            "kind": "exposed",
            "start": "2023-06-15",
            "end": "2023-07-14",
            "label": "Fill D supply (30 days)"
          }
        ],
        "result": {
          "label": "2 eras: Era 1 = 2023-01-10 to 2023-04-18 (99 days); Era 2 = 2023-06-15 to 2023-07-14 (30 days)",
          "value": {
            "era_count": 2,
            "era_1_length_days": 99,
            "era_2_length_days": 30
          }
        }
      }
    },
    "prerequisites": [
      "omop-cdm-method-patterns-rwe",
      "omop-observation-period-rwe",
      "omop-concept-set-development-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Zero-gap (strict continuous) era",
        "description": "Eras break at any gap between the end of one exposure and the start of the next; only seamless or overlapping fills are stitched. Maximizes specificity of \"on-treatment\" person-time.",
        "edge_cases": [
          "A single late refill (even one day late) ends the era, inflating apparent discontinuation and shortening person-time.",
          "Sensitive to clock/date artifacts and weekend/holiday refill timing in claims."
        ],
        "data_source_notes": "claims: feasible because adjudicated fills carry real days_supply; EHR: too brittle when end dates are imputed."
      },
      {
        "name": "30-day persistence gap (OHDSI default DRUG_ERA)",
        "description": "Consecutive same-ingredient exposures separated by <=30 days are merged into one era. The standard network-portable derivation.",
        "edge_cases": [
          "Routine 30-day prescriptions filled a few days late stay in one era (intended); chronic-med non-adherence shorter than 30 days is masked.",
          "Comparability across sites requires confirming the shipped DRUG_ERA used the same gap."
        ],
        "data_source_notes": "claims/EHR: the default; always report it alongside a sensitivity gap so downstream users see the choice."
      },
      {
        "name": "Long (90-day) persistence gap",
        "description": "Forgives long supply interruptions; suited to chronic therapies dispensed in 90-day mail-order quantities or with intermittent-but-continuing intent.",
        "edge_cases": [
          "Manufactures exposed person-time during true non-use, biasing safety estimates toward the null.",
          "Can merge what are clinically distinct treatment episodes (stop/restart) into one era."
        ],
        "data_source_notes": "claims: align gap to the dominant days_supply (90-day mail order); justify in the SAP and stress-test against the 30-day result."
      },
      {
        "name": "Carry-over / stockpiling-capped era",
        "description": "Early refills extend the supply (carry-over) so overlapping fills do not double-count; a cap limits how much stockpiled supply can accumulate, preventing implausible perpetual coverage.",
        "edge_cases": [
          "Without a cap, repeated early refills create indefinite phantom coverage; with too tight a cap, legitimate stockpiles are truncated."
        ],
        "data_source_notes": "claims: requires quantity/days_supply per fill; this is a deviation from the standard DRUG_ERA derivation and must be documented (see exposure-episode-construction-rwe)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Analyzing raw DRUG_EXPOSURE rows directly",
        "pros_of_this": "Provides ready-made, ingredient-level continuous episodes that map directly onto washouts, time-at-risk, and persistence; absorbs stockpiling/overlap bookkeeping into one transparent rule.",
        "cons_of_this": "The default gap hides a consequential parameter; ingredient roll-up erases within-class switching and clinical-drug detail (dose, route, brand).",
        "when_to_prefer": "When the analytic unit is a continuous ingredient-level treatment episode rather than a dose- or formulation-specific contrast."
      },
      {
        "compared_to": "Bespoke claims episode-construction algorithm outside OMOP",
        "pros_of_this": "Standardized, reviewable, and portable across CDM sites without re-mapping; the same code runs network-wide.",
        "cons_of_this": "The standard era derivation does not cap stockpiling, bridge inpatient stays, or model dynamic dosing by default.",
        "when_to_prefer": "Network/portable analyses and when the simple, transparent standard rule is adequate; switch to a bespoke algorithm only when those omitted features change the answer."
      },
      {
        "compared_to": "Clinical-drug-level exposure windows",
        "pros_of_this": "Correct grain for class-level (\"any statin\") effectiveness and safety; robust to within-class product switching.",
        "cons_of_this": "Cannot support dose-response, brand/biosimilar, or formulation-specific contrasts.",
        "when_to_prefer": "Ingredient/class questions; never substitute the era table for a clinical-drug-level question."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "DRUG_EXPOSURE = adjudicated pharmacy claim (drug_concept_id at clinical-drug level + drug_exposure_start_date + days_supply + quantity). Remove reversals/voids and same-day duplicates before stitching. Restrict era-based person-time to spans with observable pharmacy benefit; Medicare Advantage-only person-time lacks FFS pharmacy claims, so apparent washouts/discontinuations there are missingness, not non-use.",
      "ehr": "DRUG_EXPOSURE is usually an order or administration; days_supply is frequently missing, requiring end-date imputation. Visit-driven capture creates phantom gaps for out-of-system care; prefer linked dispensing to confirm initiation and continuation and treat loss to follow-up as potentially informative.",
      "registry": "Often records regimen/treatment status rather than dispensing-level supply, so native eras may be underivable; link to claims for the fill history and to a death index so eras do not run past death.",
      "linked": "Ideal substrate (EHR dose/route + claims completeness + vital-records censoring) but order/fill/administration date discrepancies must be reconciled before era stitching, and the linkable subset introduces selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nPERSISTENCE_GAP = 30  # days a refill may be late before a new era starts; sensitivity-test 0 / 30 / 90\n\ndef build_drug_eras(drug_exposure: pd.DataFrame,\n                    concept_ancestor: pd.DataFrame,\n                    gap_days: int = PERSISTENCE_GAP) -> pd.DataFrame:\n    de = drug_exposure.copy()\n\n    # Roll clinical drugs up to their ingredient via CONCEPT_ANCESTOR.\n    anc = concept_ancestor.rename(columns={\"descendant_concept_id\": \"drug_concept_id\",\n                                           \"ancestor_concept_id\": \"ingredient_concept_id\"})\n    de = de.merge(anc[[\"drug_concept_id\", \"ingredient_concept_id\"]], on=\"drug_concept_id\", how=\"inner\")\n\n    # Impute end date from days_supply when missing (start + days_supply - 1).\n    need = de[\"drug_exposure_end_date\"].isna()\n    de.loc[need, \"drug_exposure_end_date\"] = (\n        de.loc[need, \"drug_exposure_start_date\"]\n        + pd.to_timedelta(de.loc[need, \"days_supply\"].fillna(0).astype(int) - 1, unit=\"D\"))\n\n    de = de.sort_values([\"person_id\", \"ingredient_concept_id\", \"drug_exposure_start_date\"])\n\n    # Carry-over: a new era starts when the current fill begins more than gap_days after the\n    # running maximum covered end date for this (person, ingredient). cummax handles overlap/stockpiling.\n    g = de.groupby([\"person_id\", \"ingredient_concept_id\"], sort=False)\n    prev_cov_end = g[\"drug_exposure_end_date\"].cummax().groupby(\n        [de[\"person_id\"], de[\"ingredient_concept_id\"]]).shift()\n    gap = (de[\"drug_exposure_start_date\"] - prev_cov_end).dt.days\n    new_era = prev_cov_end.isna() | (gap > gap_days)\n    de[\"era_id\"] = new_era.groupby([de[\"person_id\"], de[\"ingredient_concept_id\"]]).cumsum()\n\n    eras = (de.groupby([\"person_id\", \"ingredient_concept_id\", \"era_id\"])\n              .agg(drug_era_start_date=(\"drug_exposure_start_date\", \"min\"),\n                   drug_era_end_date=(\"drug_exposure_end_date\", \"max\"),\n                   drug_exposure_count=(\"drug_concept_id\", \"size\"),\n                   covered_days=(\"days_supply\", \"sum\"))\n              .reset_index())\n    # gap_days = elapsed span not covered by supply within the era.\n    span = (eras[\"drug_era_end_date\"] - eras[\"drug_era_start_date\"]).dt.days + 1\n    eras[\"gap_days\"] = (span - eras[\"covered_days\"]).clip(lower=0)\n    return eras.drop(columns=\"covered_days\")",
        "description": "Derive ingredient-level drug eras from an OMOP DRUG_EXPOSURE table with a configurable persistence gap. Required inputs\n(pre-loaded, OMOP-standard columns; no toy data created here):\n  drug_exposure   : person_id, drug_concept_id (clinical drug), drug_exposure_start_date (datetime),\n                    drug_exposure_end_date (datetime, may be null), days_supply (int, may be null)\n  concept_ancestor: ancestor_concept_id (ingredient), descendant_concept_id (clinical drug)  # OMOP roll-up table\nReturns one row per (person_id, ingredient_concept_id) era with start/end, gap_days, and drug_exposure_count.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "overhage-2012"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nPERSISTENCE_GAP <- 30L  # sensitivity-test 0 / 30 / 90\n\nbuild_drug_eras <- function(drug_exposure, concept_ancestor, gap_days = PERSISTENCE_GAP) {\n  de  <- as.data.table(drug_exposure)\n  anc <- as.data.table(concept_ancestor)[, .(drug_concept_id = descendant_concept_id,\n                                             ingredient_concept_id = ancestor_concept_id)]\n  de <- merge(de, anc, by = \"drug_concept_id\", allow.cartesian = TRUE)\n\n  # Impute missing end date from days_supply.\n  de[is.na(drug_exposure_end_date),\n     drug_exposure_end_date := drug_exposure_start_date + as.integer(fifelse(is.na(days_supply), 0L, days_supply)) - 1L]\n\n  setorder(de, person_id, ingredient_concept_id, drug_exposure_start_date)\n\n  # Running max covered end date (cummax) per person-ingredient handles overlap/stockpiling.\n  de[, prev_cov_end := shift(cummax(as.integer(drug_exposure_end_date))),\n     by = .(person_id, ingredient_concept_id)]\n  de[, gap := as.integer(drug_exposure_start_date) - prev_cov_end]\n  de[, new_era := is.na(prev_cov_end) | gap > gap_days]\n  de[, era_id := cumsum(new_era), by = .(person_id, ingredient_concept_id)]\n\n  eras <- de[, .(drug_era_start_date  = min(drug_exposure_start_date),\n                 drug_era_end_date    = max(drug_exposure_end_date),\n                 drug_exposure_count  = .N,\n                 covered_days         = sum(days_supply, na.rm = TRUE)),\n             by = .(person_id, ingredient_concept_id, era_id)]\n  eras[, gap_days := pmax(0L, as.integer(drug_era_end_date - drug_era_start_date) + 1L - covered_days)]\n  eras[, covered_days := NULL][]\n}",
        "description": "Drug-era derivation with data.table (fast cummax/shift over sorted exposures). Inputs mirror the Python version:\n  drug_exposure   : person_id, drug_concept_id, drug_exposure_start_date (Date),\n                    drug_exposure_end_date (Date, may be NA), days_supply (integer, may be NA)\n  concept_ancestor: ancestor_concept_id (ingredient), descendant_concept_id (clinical drug)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "overhage-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let gap = 30;  /* persistence gap in days; rerun at 0 and 90 for sensitivity */\n\n/* Roll clinical drugs up to ingredient and impute missing end dates from days_supply. */\nproc sql;\n  create table de_ing as\n  select e.person_id,\n         a.ancestor_concept_id as ingredient_concept_id,\n         e.drug_concept_id,\n         e.drug_exposure_start_date as start_dt format=date9.,\n         coalesce(e.drug_exposure_end_date,\n                  e.drug_exposure_start_date + e.days_supply - 1) as end_dt format=date9.,\n         e.days_supply\n  from omop.drug_exposure e\n  inner join omop.concept_ancestor a\n    on e.drug_concept_id = a.descendant_concept_id;  /* ingredient-level roll-up */\nquit;\n\nproc sort data=de_ing; by person_id ingredient_concept_id start_dt; run;\n\n/* State machine: a new era starts when the fill begins more than &gap days after the running */\n/* maximum covered end date (cov_end) for this person-ingredient. RETAIN carries state across rows. */\ndata eras_long;\n  set de_ing;\n  by person_id ingredient_concept_id;\n  retain era_id cov_end era_start covered;\n  if first.ingredient_concept_id then do;\n    era_id = 1; cov_end = .; era_start = .; covered = 0;\n  end;\n  if cov_end = . or start_dt > cov_end + &gap then do;   /* gap exceeded -> new era */\n    if cov_end ne . then output;                          /* emit the completed era */\n    era_id + 1; era_start = start_dt; cov_end = end_dt; covered = days_supply;\n  end;\n  else do;                                                /* same era: extend (carry-over via max) */\n    cov_end  = max(cov_end, end_dt);\n    covered  = covered + days_supply;\n  end;\n  if last.ingredient_concept_id then output;              /* emit the final era for this group */\n  keep person_id ingredient_concept_id era_id era_start cov_end covered;\nrun;\n\n/* Collapse to one row per era with counts and within-era gap days. */\nproc sql;\n  create table drug_era as\n  select person_id, ingredient_concept_id, era_id,\n         era_start as drug_era_start_date format=date9.,\n         cov_end   as drug_era_end_date   format=date9.,\n         max(0, (cov_end - era_start + 1) - covered) as gap_days\n  from eras_long;\nquit;",
        "description": "Drug-era derivation in SAS: PROC SQL maps clinical drugs to ingredient via CONCEPT_ANCESTOR, then a DATA step with\nRETAIN/cumulative-max implements the gap-stitching state machine over sorted exposures. Required inputs (post-ETL,\nOMOP-standard):\n  omop.drug_exposure   : person_id, drug_concept_id, drug_exposure_start_date, drug_exposure_end_date, days_supply\n  omop.concept_ancestor: ancestor_concept_id (ingredient), descendant_concept_id (clinical drug)\nSet &gap to 0 / 30 / 90 to produce the primary result and its sensitivity analyses.",
        "dependencies": [],
        "source_citations": [
          "overhage-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "omop-drug-exposure-drug-era-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Fills A, B, and C are separated by gaps of 6 and 3 days (both within the 30-day threshold) and merge into Era 1. Fill D arrives 57 days after Fill C's supply ends — exceeding the threshold — so it opens Era 2. Any event in the 57-day off-treatment window between Era 1 and Era 2 is classified as post-discontinuation, not on-treatment.",
        "alt_text": "Timeline showing four prescription fill bars labeled Fill A through Fill D. Fill A (Jan 10–Feb 8), Fill B (Feb 15–Mar 16), and Fill C (Mar 20–Apr 18) are bridged by a shaded Era 1 span with two small gap bands of 6 and 3 days. A wide off-treatment gap band stretches from Apr 19 to Jun 14 (57 days). Fill D (Jun 15–Jul 14) sits alone under a shaded Era 2 span.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Source dispensings / administrations] --> DE[DRUG_EXPOSURE<br/>one row per fill at clinical-drug grain<br/>start, end, days_supply, quantity, route]\n  DE --> Roll[Roll up clinical drug to ingredient<br/>via CONCEPT_ANCESTOR]\n  Roll --> Imp[Impute end date when days_supply missing<br/>start + days_supply - 1]\n  Imp --> Sort[Sort by person_id, ingredient, start_date]\n  Sort --> Gap{Next fill starts within<br/>persistence gap of running cov_end?}\n  Gap -- Yes (<= gap) --> Same[Extend current era<br/>cov_end = max cov_end, end]\n  Gap -- No (> gap) --> New[Close era, start a new one]\n  Same --> ERA[DRUG_ERA<br/>start, end, gap_days, drug_exposure_count]\n  New --> ERA\n  ERA --> Use[Washout / time-at-risk / persistence<br/>report under 0 / 30 / 90 day gap]",
        "caption": "From source dispensings to DRUG_EXPOSURE to gap-stitched, ingredient-level DRUG_ERA. The persistence gap is the central parameter and feeds downstream washout, time-at-risk, and persistence logic.",
        "alt_text": "Flowchart showing source records becoming DRUG_EXPOSURE rows, rolled up to ingredient, end dates imputed, sorted, and stitched into DRUG_ERA episodes by a persistence-gap decision, then used for washout, time-at-risk, and persistence.",
        "source_type": "illustrative",
        "source_citations": [
          "overhage-2012"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One atorvastatin DRUG_ERA under a 30-day persistence gap (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section DRUG_EXPOSURE rows\n  Fill 1 atorvastatin 20mg (30d supply) :done, f1, 2023-01-05, 30d\n  Fill 2 atorvastatin 20mg (30d supply) :done, f2, 2023-02-10, 30d\n  Fill 3 atorvastatin 40mg (90d supply) :done, f3, 2023-03-20, 90d\n  section Derived DRUG_ERA\n  Ingredient era (gaps 6d and 8d, both <= 30) :active, era, 2023-01-05, 164d",
        "caption": "Three clinical-drug fills (two NDCs, same ingredient) with within-window gaps of 6 and 8 days stitch into one atorvastatin drug era spanning 2023-01-05 to 2023-06-17. A gap above 30 days would split the era and reclassify any interim event from on-treatment to post-discontinuation.",
        "alt_text": "Gantt chart of three atorvastatin fills with small gaps merged into a single drug era under a 30-day persistence gap.",
        "source_type": "illustrative",
        "source_citations": [
          "hripcsak-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "Child concept in the OMOP CDM method-patterns family, specializing the DRUG_EXPOSURE and DRUG_ERA tables."
      },
      {
        "relation_type": "used_with",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Drug eras are OMOP's standardized episode-construction output; extend with stockpiling caps, inpatient bridging, or dynamic dosing when the default gap rule is inadequate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-time-at-risk-cohort-exit-rwe",
        "notes": "Drug-era start/end dates supply the on-treatment time-at-risk window and cohort-exit dates."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "The ingredient/clinical-drug concept sets that define which DRUG_EXPOSURE rows enter an era are built with OMOP concept-set development."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "The persistence gap that ends a drug era is the same threshold that defines treatment discontinuation in persistence analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC and MPR measure coverage over a fixed window; drug eras instead define the episode boundaries. Both depend on days_supply, end-date imputation, and stockpiling rules."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inpatient-bridging-exposure-rwe",
        "notes": "The standard DRUG_ERA derivation does not bridge inpatient stays where pharmacy supply is unobserved; bridging logic must be added when hospitalizations interrupt outpatient fills."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "Drug-era boundaries define the new-user washout and on-treatment follow-up that an active-comparator new-user design requires."
      }
    ],
    "aliases": [
      "OMOP drug exposure",
      "OMOP drug era",
      "drug_exposure table",
      "drug_era table",
      "ingredient-level drug era"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "omop-observation-period-rwe",
    "name": "OMOP Observation Period",
    "short_definition": "The OMOP CDM table that records the spans of time during which a person is reliably observable in a data source, defining the denominator of valid person-time for baseline lookback and outcome follow-up.",
    "long_description": "The **OMOP `observation_period` table** is the CDM's formal answer to a question every real-world study must answer before\nany exposure, covariate, or outcome is counted: *when was this person actually under observation in this data source?* Each\nrow is a `[observation_period_start_date, observation_period_end_date]` span for a `person_id`, tagged with a\n`period_type_concept_id` describing how the span was derived (e.g., from enrollment files in claims, from the first/last\nrecorded clinical event in EHR). Person-time outside any observation period is **unobservable**, not \"event-free\": the\nabsence of a diagnosis, fill, or procedure there is missingness, not evidence of non-occurrence. Every defensible OMOP\nstudy therefore anchors its lookback (baseline) window and its at-risk (follow-up) window *inside* observation-period spans.\n\n**Core conceptual distinction**. The observation period is the *substrate of observability*, distinct from three things it\nis routinely confused with. (1) It is **not the cohort time-at-risk**: time-at-risk is a study-defined window (index date\nplus a risk-window rule), whereas the observation period is the source-level envelope that the time-at-risk must fit\nwithin. (2) It is **not continuous enrollment per se**: in an ETL'd claims source the two often coincide because the period\nis built from enrollment spans, but in EHR the period is inferred from clinical-activity boundaries and a person can be\n\"enrolled\" yet invisible (no encounters). (3) It is **not the washout/lookback rule**: washout is an exposure-cleaning\ndecision applied *within* observable time; the observation period determines whether enough observable time exists to\nevaluate the washout at all. The downstream consequence is concrete: requiring `observation_period_start_date <=\nindex_date - lookback_days` makes \"no prior diagnosis/fill\" a *true* absence; requiring `observation_period_end_date >=\nindex_date + min_followup` prevents counting people who were never observable long enough to experience the outcome.\n\n**Pros, cons, and trade-offs**\n- **vs ad-hoc continuous-enrollment flags built directly from raw enrollment claims:** The OMOP `observation_period` is a\n  single, ETL-validated, source-agnostic abstraction, so the same OHDSI cohort/PLP/PLE code runs unchanged across claims,\n  EHR, and registry data and across a federated network. Cost: the ETL's gap-collapsing rules (how many days of no\n  enrollment break a period; whether short gaps are bridged) are baked in upstream and may not match your protocol — you\n  inherit decisions you did not make. **Prefer the CDM table** for network/portable studies; **prefer raw enrollment\n  spans** when you need bespoke gap rules and control the analysis end-to-end.\n- **vs ignoring observability (using all recorded events):** Anchoring to observation periods removes left-truncation bias\n  (treating a prevalent condition recorded on day 1 of data as incident) and removes spurious \"event-free\" person-time.\n  Cost: cohort size shrinks and the surviving population skews toward the continuously observable, who differ from the\n  transient population. **Always prefer observability-anchored analysis**; the alternative is not a trade-off, it is a bug.\n- **vs a single global lookback applied uniformly:** Using `observation_period_start_date` to define an *all-available*\n  lookback captures more comorbidity history but makes covariate ascertainment depend on observable duration (people seen\n  longer accrue more codes), inducing differential measurement by exposure if observable time differs across arms. A\n  *fixed* lookback (e.g., 365 days, requiring the period to cover it) standardizes ascertainment at the cost of discarding\n  short-history patients. **Prefer a fixed, enforced lookback** for comparative analyses; reserve all-available lookback\n  for prediction, where calibration tolerates it.\n\n**When to use**. Any study built on the OMOP CDM: defining cohort eligibility (require observable lookback and follow-up),\nbounding time-at-risk so follow-up never extends past `observation_period_end_date`, computing incidence-rate denominators\nfrom observable person-time, and ensuring portability of a protocol across a federated OHDSI network where each site's\nobservability is encoded the same way. It is the gatekeeper step that runs *before* exposure and outcome logic.\n\n**When NOT to use — and when it is actively misleading or dangerous**. It is dangerous to treat the observation period as\na clinical truth rather than a data-capture artifact. (1) **EHR end-dates as outcome ascertainment:** an\n`observation_period_end_date` derived from \"last recorded event\" can be set by the outcome itself — a patient who dies or\nbecomes too sick to visit simply stops generating records, so censoring at the period end can censor *on the outcome*,\nbiasing rates downward. Use an external death index, do not infer end-of-observability from clinical silence. (2)\n**Mistaking a payer-driven period boundary for a clinical event:** disenrollment is administrative; ending follow-up there\nis correct for observability but must not be read as \"the patient was stable.\" (3) **All-available lookback in a\ncomparative study** when observable duration differs by arm — it manufactures differential confounder measurement. (4)\n**Single-period assumptions** when the source allows multiple periods per person: silently using only the first (or only\nthe longest) period drops valid follow-up and can break the requirement that index falls inside an observable span.\n\n**Data-source operational depth**\n- **Claims (FFS):** `observation_period` is typically ETL'd from medical+pharmacy enrollment spans. The ETL's gap rule\n  matters: short administrative gaps may be bridged into one period or split into several; verify the rule before\n  requiring \"continuous\" lookback. Failure mode: **Medicare Advantage person-time** is frequently absent from FFS claims,\n  so an MA enrollee can have an observation period reflecting only their FFS months — apparent \"gaps\" or short periods are\n  MA capitation, not true unobservability. Restrict to FFS Parts A/B/D (or the equivalent commercial medical+pharmacy\n  benefit) and exclude MA-only spans rather than trusting the period blindly.\n- **EHR:** `observation_period` is usually inferred from the first to last clinical event (visits, labs, orders). This\n  makes the boundaries **encounter-driven and informatively censored**: sicker, frequently-seen patients have long dense\n  periods; healthy or transient patients have short sparse ones, and **external care leakage** (care delivered outside the\n  health system) is invisible inside the period. Differential follow-up by exposure is the rule, not the exception; treat\n  `observation_period_end_date` as potentially outcome-dependent and supplement with linked mortality.\n- **Registry:** Observability is enrollment in the registry plus its ascertainment completeness; the OMOP period (if the\n  registry is mapped) reflects registry contact, not full healthcare exposure. Link to claims for complete drug/utilization\n  person-time and to a death index for censoring.\n- **Linked claims–EHR–vital records:** The richest substrate, but each source contributes its own observation periods with\n  different `period_type_concept_id` values and different boundary semantics. Reconcile them explicitly (intersection for\n  \"observable everywhere,\" union for \"observable somewhere\") before defining lookback/follow-up; never pool periods of\n  different types as if interchangeable, and watch for **differential competing risks** (e.g., death censoring observable\n  time more in the older/sicker arm).\n\n**Worked claims example.** Question: 12-month incidence of acute kidney injury (AKI) after initiating a study drug among\nadults, in an OMOP-mapped commercial + Medicare FFS source that allows **multiple observation periods per person** (payer\nchanges split spans). Setup: cohort entry (`index_date`) = first qualifying `drug_exposure` start; lookback = 365 days;\nminimum follow-up to be evaluable = none (rate analysis uses observable person-time), but follow-up is censored at the\nearliest of AKI, `observation_period_end_date`, death, or 365 days post-index. Steps: (1) For each candidate `index_date`,\nfind the *single observation period that contains the index date* — not the first, not the longest — because exposure and\nfollow-up must live inside one contiguous observable span. (2) Eligibility lookback: keep the person only if that\ncontaining period satisfies `observation_period_start_date <= index_date - 365`, so \"no prior AKI / no prior study drug\" in\nthe baseline window reflects true absence rather than an unobserved gap. (3) Restrict `period_type_concept_id` to the\nclaims-enrollment-derived type and exclude MA-only spans, so a short period is not mistaken FFS for capitated MA time. (4)\nAt-risk window: `risk_start = index_date`, `risk_end = min(index_date + 365, observation_period_end_date, death_date)`;\nobservable person-time for the rate denominator is `risk_end - risk_start`, never extending past the period end. (5) Count\nthe first AKI `condition_occurrence` whose date lies inside the at-risk window. (6) Diagnostics before trusting anything:\ndistribution of observable lookback and follow-up days by arm (differential observability is a confounding-measurement\nred flag), count of persons with >1 observation period, and a sensitivity analysis varying the lookback requirement (e.g.,\n183 vs 365 days) and the gap-bridging assumption. The single highest-yield check is plotting follow-up length by exposure\narm: if the curves separate, your `observation_period_end_date` is encoding something exposure-related and a naive rate is\nbiased.",
    "primary_category": "Study_Design",
    "tags": [
      "omop-cdm",
      "observation-period",
      "observable-person-time",
      "continuous-enrollment",
      "left-truncation",
      "time-at-risk",
      "ohdsi",
      "cohort-construction"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2011-000376",
        "url": "https://doi.org/10.1136/amiajnl-2011-000376",
        "citation_text": "Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. Journal of the American Medical Informatics Association. 2012;19(1):54-60.",
        "year": 2012,
        "authors_short": "Overhage et al.",
        "notes": "Defines the OMOP CDM including the observation_period construct that operationalizes observable person-time across heterogeneous claims and EHR sources."
      },
      {
        "role": "explain",
        "doi": "10.3233/978-1-61499-564-7-574",
        "url": "https://doi.org/10.3233/978-1-61499-564-7-574",
        "citation_text": "Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Studies in Health Technology and Informatics. 2015;216:574-578.",
        "year": 2015,
        "authors_short": "Hripcsak et al.",
        "notes": "Establishes the OHDSI conventions under which observation_period governs observability for portable, federated network studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Codifies the requirement to define observable person-time, continuous enrollment, and ascertainment windows a priori in RWD comparative studies."
      },
      {
        "role": "use",
        "doi": "10.1093/jamia/ocy032",
        "url": "https://doi.org/10.1093/jamia/ocy032",
        "citation_text": "Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. Journal of the American Medical Informatics Association. 2018;25(8):969-975.",
        "year": 2018,
        "authors_short": "Reps et al.",
        "notes": "OHDSI patient-level prediction framework that uses observation_period to require observable lookback and to bound the at-risk window in routine analyses."
      }
    ],
    "plain_language_summary": "An OMOP Observation Period is the recorded span of time during which a data source can actually see a patient — their health insurance was active, they were visiting the clinic, or they were enrolled in a registry. Any clinical event that happens outside this window is simply invisible to the database: the absence of a record there means nothing, because no one was watching. Every study built on OMOP data must anchor its lookback and its follow-up windows inside this observable span, or the researcher risks counting missing data as evidence that events never happened.",
    "key_terms": [
      {
        "term": "OMOP CDM",
        "definition": "A standardized data format (Common Data Model) created by the OHDSI community that reorganizes claims, EHR, and registry data into the same table structure so analysis code can run across many different data sources."
      },
      {
        "term": "observation period",
        "definition": "A row in the OMOP observation_period table giving the start date and end date of one continuous span during which a specific patient was observable in the data source."
      },
      {
        "term": "lookback window",
        "definition": "The block of time before a patient's study entry date that a researcher searches for prior diagnoses, prior drug use, or other baseline information — it only produces valid results when it falls entirely inside an observation period."
      },
      {
        "term": "follow-up window",
        "definition": "The block of time after study entry during which the researcher watches for the outcome of interest; it must end no later than the observation period end date, because events after that date are unobserved."
      },
      {
        "term": "period_type_concept_id",
        "definition": "A code in the OMOP table that records how the observation period was created — for example, from insurance enrollment records in claims data versus from first-to-last clinical visit in EHR data."
      }
    ],
    "worked_example": {
      "scenario": "Maria is a 58-year-old patient in a commercial claims database. Her insurer enrolled her on 2023-01-15 and she disenrolled on 2023-10-31, giving her a single observation period of 290 days. A researcher wants to study whether a new blood-pressure drug (amlodipine) is associated with a kidney function test abnormality within 6 months of starting the drug. Maria started amlodipine on 2023-04-01 and had a kidney test come back abnormal on 2023-07-20 — both fall inside her observation period. She also had an urgent-care visit for chest pain on 2023-11-15 — but she had already disenrolled, so that visit is completely invisible to the database.",
      "dataset": {
        "caption": "The OMOP observation_period table row for Maria, plus her two clinical events.",
        "columns": [
          "person_id",
          "table",
          "event_date",
          "event_detail",
          "inside_observation_period"
        ],
        "rows": [
          [
            7701,
            "observation_period",
            "2023-01-15 to 2023-10-31",
            "period_type: claims enrollment (44814724)",
            "— (defines the window)"
          ],
          [
            7701,
            "drug_exposure",
            "2023-04-01",
            "amlodipine, 30-day supply",
            "YES — counted"
          ],
          [
            7701,
            "measurement",
            "2023-07-20",
            "serum creatinine elevated",
            "YES — counted"
          ],
          [
            7701,
            "visit_occurrence",
            "2023-11-15",
            "urgent care, chest pain",
            "NO — invisible, after disenrollment"
          ]
        ]
      },
      "steps": [
        "Identify Maria's observation period: 2023-01-15 through 2023-10-31 (290 days).",
        "Confirm the index date (amlodipine start, 2023-04-01) falls inside the observation period — it does (day 76 of 290).",
        "Check the lookback window: to verify Maria had no prior kidney problems, the researcher needs at least 90 days before 2023-04-01, meaning back to 2023-01-01. Her observation period starts 2023-01-15, so only 76 days of clean lookback are available — the researcher must decide whether that is sufficient or must exclude Maria.",
        "Set the follow-up window: the researcher wants 6 months (182 days) after 2023-04-01, which would end 2023-10-01. That date is before the observation period end (2023-10-31), so the full 6-month follow-up is observable.",
        "Find outcomes inside the follow-up: the kidney test on 2023-07-20 falls between 2023-04-01 and 2023-10-01 — it is counted as the event.",
        "The urgent-care visit on 2023-11-15 is 15 days after the observation period ends — the database has no record of it, so the researcher cannot see or count it.",
        "Result: Maria contributed 182 observable follow-up days and had one outcome event on day 110."
      ],
      "result": {
        "label": "Observable follow-up days / outcome found inside window",
        "value": "182 follow-up days (2023-04-01 to 2023-10-01); outcome on day 110 (2023-07-20); post-period event on 2023-11-15 is unobserved and not counted"
      },
      "timeline_spec": {
        "title": "Maria's OMOP Observation Period — one event inside, one outside",
        "window": {
          "start": "2023-01-01",
          "end": "2023-12-31",
          "label": "Calendar year shown for context"
        },
        "events": [
          {
            "label": "Amlodipine start (index date)",
            "start": "2023-04-01",
            "length_days": 1,
            "quantity": "drug_exposure — 30-day supply"
          },
          {
            "label": "Kidney test abnormal (outcome) — COUNTED",
            "start": "2023-07-20",
            "length_days": 1,
            "quantity": "measurement — serum creatinine elevated"
          },
          {
            "label": "Urgent-care visit (chest pain) — INVISIBLE",
            "start": "2023-11-15",
            "length_days": 1,
            "quantity": "visit_occurrence — outside observation period"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-01-15",
            "end": "2023-10-31",
            "label": "Observation period (290 days) — Maria is observable here"
          },
          {
            "kind": "followup",
            "start": "2023-04-01",
            "end": "2023-10-01",
            "label": "Follow-up window: 182 days post-index, capped inside observation period"
          },
          {
            "kind": "gap",
            "start": "2023-11-01",
            "end": "2023-12-31",
            "label": "Unobserved time — Maria disenrolled; no records possible"
          }
        ],
        "result": {
          "label": "Outcome (day 110) inside follow-up window = COUNTED; urgent-care visit (day 319) outside observation period = INVISIBLE",
          "value": "1 outcome event observed; 1 event unobserved"
        },
        "caption": "Maria's 290-day observation period runs from her insurance enrollment (2023-01-15) to disenrollment (2023-10-31). Her amlodipine start and subsequent kidney test both fall inside this window and are counted. Her urgent-care visit 15 days after disenrollment is invisible to the database — not event-free, just unobserved.",
        "alt_text": "A horizontal timeline spanning 2023. A blue bar labeled 'Observation period (290 days)' runs from January 15 to October 31. Inside it, two point events are marked: 'Amlodipine start (index date)' on April 1 and 'Kidney test abnormal — COUNTED' on July 20. A shaded follow-up bar runs from April 1 to October 1 (182 days). After October 31, a gray shaded region labeled 'Unobserved time' extends to December 31, with a point event 'Urgent-care visit — INVISIBLE' marked on November 15 inside the gray zone."
      }
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe",
      "omop-cdm-method-patterns-rwe",
      "time-zero-index-date-alignment-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed enforced lookback within the containing period",
        "description": "Require the observation period that contains the index date to extend at least N days before index (e.g., observation_period_start_date <= index_date - 365), standardizing baseline covariate ascertainment across arms.",
        "edge_cases": [
          "Discards patients with short observable history, shifting the cohort toward the continuously observable.",
          "In EHR, \"365 observable days\" can still be sparse (few encounters), so a satisfied lookback does not guarantee adequate code capture."
        ],
        "data_source_notes": "claims: confirm the ETL gap rule before equating period coverage with continuous enrollment; EHR: pair the duration requirement with an encounter-density check."
      },
      {
        "name": "All-available lookback from observation_period_start_date",
        "description": "Use the entire observable history before index for covariate ascertainment rather than a fixed window.",
        "edge_cases": [
          "Covariate counts increase with observable duration, inducing differential confounder measurement when observable time differs by exposure arm.",
          "Acceptable for prediction (richer features) but hazardous for comparative causal estimation."
        ],
        "data_source_notes": "Report observable-lookback distribution by arm; if it differs, prefer a fixed enforced lookback."
      },
      {
        "name": "Multi-period (containing-period) handling",
        "description": "When the source allows multiple observation periods per person, select the single period that contains the index date for both lookback eligibility and at-risk bounding, rather than the first or longest period.",
        "edge_cases": [
          "Index dates falling in an inter-period gap are unobservable and must be excluded, not bridged silently.",
          "Choosing first/longest can place follow-up outside the observable span or drop valid follow-up."
        ],
        "data_source_notes": "claims: payer changes commonly split spans; EHR: long quiescent gaps may split a person into several periods depending on ETL."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ad-hoc continuous-enrollment flags from raw enrollment claims",
        "pros_of_this": "One ETL-validated, source-agnostic observability abstraction that makes OHDSI cohort/PLE/PLP code portable across claims, EHR, and federated networks.",
        "cons_of_this": "Inherits the upstream ETL's gap-collapsing and boundary rules, which may not match the protocol's intended continuous-enrollment definition.",
        "when_to_prefer": "Network or multi-source studies needing portability and a single observability convention."
      },
      {
        "compared_to": "Using all recorded events without anchoring to observability",
        "pros_of_this": "Removes left-truncation of prevalent conditions and eliminates spurious event-free (unobservable) person-time.",
        "cons_of_this": "Reduces cohort size and skews toward the continuously observable population.",
        "when_to_prefer": "Always for valid person-time denominators and unbiased baseline ascertainment; the alternative is a defect, not a real option."
      },
      {
        "compared_to": "A single global lookback applied uniformly regardless of observable duration",
        "pros_of_this": "A fixed lookback enforced against the containing period standardizes covariate ascertainment across arms.",
        "cons_of_this": "Discards short-history patients; all-available lookback would capture more history but risks differential measurement.",
        "when_to_prefer": "Comparative causal analyses where standardized confounder measurement matters more than maximal history."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "observation_period is ETL'd from enrollment spans; verify the gap-bridging rule and exclude Medicare Advantage-only person-time where FFS claims are missing, so short periods are not mistaken for true unobservability. Bound follow-up at observation_period_end_date (disenrollment is administrative, not clinical).",
      "ehr": "observation_period is inferred from first/last clinical event and is encounter-driven and informatively censored; treat observation_period_end_date as potentially outcome-dependent (death/sickness stops record generation) and supplement with a linked death index.",
      "registry": "The period reflects registry contact and ascertainment completeness, not full healthcare exposure; link to claims for complete drug/utilization person-time and to a death index for censoring.",
      "linked": "Each source contributes observation periods with different period_type_concept_id values and boundary semantics; reconcile by intersection (observable everywhere) or union (observable somewhere) before defining lookback/follow-up, and watch for differential competing risks (death) censoring observable time unevenly across arms."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nLOOKBACK_DAYS = 365     # required observable baseline before index\nMAX_FOLLOWUP  = 365     # cap the at-risk window post-index\nCLAIMS_TYPES  = {44814724}  # period_type_concept_id for claims/enrollment-derived periods (set per source)\n\ndef anchor_to_observation_period(cohort: pd.DataFrame, obsper: pd.DataFrame) -> pd.DataFrame:\n    op = obsper[obsper[\"period_type_concept_id\"].isin(CLAIMS_TYPES)]\n\n    # Join every candidate index to ALL of that person's periods, then keep the ONE containing the index date.\n    # (Sources may record multiple periods per person; first/longest selection is a common, silent bug.)\n    m = cohort.merge(op, on=\"person_id\", how=\"inner\")\n    contains = ((m[\"observation_period_start_date\"] <= m[\"index_date\"]) &\n                (m[\"observation_period_end_date\"]   >= m[\"index_date\"]))\n    m = m[contains].copy()\n\n    # Enforced fixed lookback: the containing period must start early enough that \"no prior X\" is a true absence.\n    lookback_ok = m[\"observation_period_start_date\"] <= m[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)\n    m = m[lookback_ok].copy()\n\n    # At-risk window bounded by observability: follow-up never extends past the period end.\n    m[\"baseline_start\"] = m[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)\n    m[\"risk_start\"]     = m[\"index_date\"]\n    m[\"risk_end\"]       = m[[\"observation_period_end_date\"]].assign(\n                              cap=m[\"index_date\"] + pd.Timedelta(days=MAX_FOLLOWUP)\n                          ).min(axis=1)\n    m[\"followup_days\"]  = (m[\"risk_end\"] - m[\"risk_start\"]).dt.days\n\n    # One contiguous evaluable span per person (drop index dates that fell in inter-period gaps: none survive 'contains').\n    return (m.sort_values([\"person_id\", \"index_date\"])\n             .drop_duplicates(\"person_id\")\n             [[\"person_id\", \"index_date\", \"baseline_start\", \"risk_start\", \"risk_end\", \"followup_days\"]])",
        "description": "Anchor a candidate cohort to OMOP observation_period: enforce a fixed lookback and bound the at-risk window. Required\ninputs (already ETL'd to the OMOP CDM and cleaned):\n  cohort : one row per candidate entry -> person_id, index_date (datetime)\n  obsper : OMOP observation_period      -> person_id, observation_period_start_date, observation_period_end_date,\n           period_type_concept_id   # restrict to the claims-enrollment-derived type(s) upstream or via CLAIMS_TYPES\nReturns evaluable persons with the containing-period bounds and a follow-up window that never exceeds observability.\nBuild covariates only in [index_date - LOOKBACK_DAYS, index_date] and count outcomes only inside [risk_start, risk_end].",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK_DAYS <- 365L\nMAX_FOLLOWUP  <- 365L\nCLAIMS_TYPES  <- c(44814724L)  # period_type_concept_id for claims/enrollment-derived periods (set per source)\n\nanchor_to_observation_period <- function(cohort, obsper) {\n  setDT(cohort); setDT(obsper)\n  op <- obsper[period_type_concept_id %in% CLAIMS_TYPES]\n\n  m <- merge(cohort, op, by = \"person_id\", allow.cartesian = TRUE)\n\n  # Keep the observation period that CONTAINS the index date (not the first or longest one).\n  m <- m[observation_period_start_date <= index_date &\n         observation_period_end_date   >= index_date]\n\n  # Enforced fixed lookback so baseline \"no prior X\" reflects true absence, not an unobserved gap.\n  m <- m[observation_period_start_date <= index_date - LOOKBACK_DAYS]\n\n  # At-risk window bounded by observability: follow-up cannot exceed the period end.\n  m[, baseline_start := index_date - LOOKBACK_DAYS]\n  m[, risk_start     := index_date]\n  m[, risk_end       := pmin(observation_period_end_date, index_date + MAX_FOLLOWUP)]\n  m[, followup_days  := as.integer(risk_end - risk_start)]\n\n  setorder(m, person_id, index_date)\n  unique(m, by = \"person_id\")[, .(person_id, index_date, baseline_start,\n                                  risk_start, risk_end, followup_days)]\n}",
        "description": "Anchor a candidate cohort to OMOP observation_period with data.table. Inputs mirror the Python version:\n  cohort : person_id, index_date (Date)\n  obsper : person_id, observation_period_start_date (Date), observation_period_end_date (Date), period_type_concept_id\nKeep the single period containing index, enforce a fixed observable lookback, and cap follow-up at the period end.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n%let maxfu    = 365;\n%let claimstype = 44814724;  /* period_type_concept_id for claims/enrollment-derived periods (set per source) */\n\n/* Keep the observation period that CONTAINS the index date; enforce the fixed observable lookback. */\nproc sql;\n  create table anchored as\n  select c.person_id,\n         c.index_date,\n         (c.index_date - &lookback)                          as baseline_start format=date9.,\n         c.index_date                                        as risk_start    format=date9.,\n         min(op.observation_period_end_date,\n             c.index_date + &maxfu)                          as risk_end      format=date9.,\n         calculated risk_end - c.index_date                  as followup_days\n  from work.cohort c\n  inner join work.observation_period op\n    on  c.person_id = op.person_id\n    and op.period_type_concept_id = &claimstype\n    and op.observation_period_start_date <= c.index_date           /* period contains index */\n    and op.observation_period_end_date   >= c.index_date\n    and op.observation_period_start_date <= c.index_date - &lookback /* observable lookback satisfied */\n  ;\nquit;\n\n/* If a person has multiple containing periods (rare), keep one contiguous evaluable span per person. */\nproc sort data=anchored; by person_id index_date; run;\ndata anchored_unique;\n  set anchored; by person_id;\n  if first.person_id;  /* one row per person; index dates in inter-period gaps never matched above */\nrun;",
        "description": "Anchor a candidate cohort to OMOP observation_period in SAS (PROC SQL on the CDM tables). Required inputs:\n  work.cohort : person_id, index_date\n  work.observation_period : person_id, observation_period_start_date, observation_period_end_date, period_type_concept_id\nSelects the single observation period containing the index date, enforces a 365-day observable lookback, and caps the\nat-risk window at the period end. Restrict period_type_concept_id to the claims/enrollment-derived type for the source.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "omop-observation-period-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Maria's 290-day observation period runs from her insurance enrollment (2023-01-15) to disenrollment (2023-10-31). Her amlodipine start and subsequent kidney test both fall inside this window and are counted. Her urgent-care visit 15 days after disenrollment is invisible to the database — not event-free, just unobserved.",
        "alt_text": "A horizontal timeline spanning 2023. A blue bar labeled 'Observation period (290 days)' runs from January 15 to October 31. Inside it, two point events are marked: 'Amlodipine start (index date)' on April 1 and 'Kidney test abnormal — COUNTED' on July 20. A shaded follow-up bar runs from April 1 to October 1 (182 days). After October 31, a gray shaded region labeled 'Unobserved time' extends to December 31, with a point event 'Urgent-care visit — INVISIBLE' marked on November 15 inside the gray zone.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cand[Candidate index date<br/>first qualifying drug_exposure] --> Has{Index date inside an<br/>observation_period span?}\n  Has -- \"No (inter-period gap)\" --> Drop[\"Exclude: unobservable at index\"]\n  Has -- Yes --> Type{\"period_type_concept_id<br/>= claims/enrollment type?\"}\n  Type -- \"No / MA-only\" --> Drop2[\"Exclude: not FFS-observable\"]\n  Type -- Yes --> Look{\"period_start <= index - lookback?<br/>observable baseline\"}\n  Look -- No --> Drop3[\"Exclude: insufficient lookback\"]\n  Look -- Yes --> Risk[\"risk_end = min(index + maxFU,<br/>observation_period_end_date, death)\"]\n  Risk --> Eval[\"Evaluable: observable person-time<br/>baseline + at-risk windows fixed\"]",
        "caption": "Decision logic for anchoring a cohort to OMOP observation_period. Index must fall inside a span of the correct type, the span must cover the required lookback, and at-risk time is capped at observability.",
        "alt_text": "Flowchart deciding whether a candidate index date is observable, of the correct period type, has sufficient lookback, and how the at-risk window is bounded by the observation period end.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One person with two observation periods (payer change splits the span)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Period 1 (FFS)\n  Observable :done, p1, 2022-01-01, 2022-09-30\n  section Gap\n  Unobservable (MA / disenrolled) :crit, gap, 2022-10-01, 2023-02-28\n  section Period 2 (FFS)\n  Observable :done, p2, 2023-03-01, 2024-06-30\n  Required lookback (365d before index) :active, lb, 2023-03-02, 365d\n  Index = first qualifying fill :milestone, t0, 2024-03-01, 0d\n  At-risk capped at period end :crit, fu, 2024-03-01, 121d",
        "caption": "Multi-period handling. The index date sits in Period 2; lookback and at-risk windows must live inside that containing period, and follow-up is capped at observation_period_end_date (here, the period ends before max follow-up).",
        "alt_text": "Gantt chart showing two observation periods separated by an unobservable gap, with the index date, required 365-day lookback, and an at-risk window capped at the second period's end date.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "Observation_period is the observability primitive within the broader OMOP CDM method-pattern family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-time-at-risk-cohort-exit-rwe",
        "notes": "Time-at-risk and cohort exit must be bounded inside observation-period spans; the period end is a hard censoring boundary for follow-up."
      },
      {
        "relation_type": "see_also",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "In claims the observation period is usually ETL'd from continuous-enrollment spans; in EHR it is inferred from clinical activity and the two can diverge."
      },
      {
        "relation_type": "used_with",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "A washout/lookback rule is only valid when the observation period covers the lookback window, making \"no prior exposure\" a true absence rather than unobserved time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Misusing observation-period boundaries (e.g., follow-up starting before observable exposure, or end-dates set by the outcome) can reintroduce immortal time or outcome-dependent censoring."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Observable lookback and observation-period-bounded follow-up operationalize trial eligibility and the assessment of person-time in OMOP-based target-trial emulations."
      }
    ],
    "aliases": [
      "observation period",
      "OMOP observation_period",
      "observable person-time",
      "observation period table"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "omop-standardized-vocabularies",
    "name": "OMOP Standardized Vocabularies (OHDSI/Athena)",
    "short_definition": "The centralized, versioned reference ontology maintained by OHDSI that assigns every clinical code from every supported source vocabulary a numeric concept_id, a standard_concept flag, and a domain, and that holds the \"Maps to\" relationships allowing source codes (ICD-10-CM, NDC, CPT, local lab codes) to be resolved to a single standard concept in SNOMED, RxNorm, or LOINC — the translation layer that makes one OMOP analysis query meaningful across dozens of databases simultaneously.",
    "long_description": "**OMOP Standardized Vocabularies** are the semantic foundation of the OHDSI ecosystem and the layer that makes\ndistributed, multi-database real-world evidence possible. Every clinical code that will ever appear in an OMOP CDM\ndatabase — ICD-10-CM diagnoses, NDC drug codes, CPT procedure codes, LOINC lab codes, SNOMED concepts, RxNorm\ningredients, custom EHR codes — first passes through the vocabulary layer, which assigns it a stable numeric\n`concept_id`, a `standard_concept` flag, a `domain_id`, and a set of `vocabulary_id`-tagged relationships. This\ntranslation happens once (in the ETL), is saved alongside the source code, and is what allows a single SQL query\nto run identically at a hospital in the United States, a claims database in Germany, and a registry in Japan. The\nvocabulary layer is distinct from every analytic CDM table it feeds — DRUG_EXPOSURE, CONDITION_OCCURRENCE,\nMEASUREMENT — and must be understood on its own terms before any of those tables can be queried correctly.\n\n**The five vocabulary tables and what they carry.**\n- `CONCEPT`: the master catalog. One row per concept, with columns `concept_id` (the stable integer),\n  `concept_name`, `domain_id` (Condition, Drug, Measurement, Procedure, Observation, Device, Spec, Type Concept,\n  Metadata, etc.), `vocabulary_id` (SNOMED, RxNorm, LOINC, ICD10CM, NDC, CPT4, HCPCS, RxNorm Extension, ATC,\n  and ~100 others), `concept_class_id` (Clinical Finding, Ingredient, Lab Test, etc.), `standard_concept`\n  (NULL = source concept, 'S' = standard, 'C' = classification), `concept_code` (the original code string in\n  the source vocabulary, e.g. \"I10\"), `valid_start_date`/`valid_end_date`/`invalid_reason` (lifecycle tracking),\n  and `concept_synonym` via the companion CONCEPT_SYNONYM table.\n- `CONCEPT_RELATIONSHIP`: the edge table. Each row is one directed relationship between two concepts with a\n  `relationship_id`. The two most important relationship types for ETL are **\"Maps to\"** (source code →\n  standard concept, used in every ETL to convert ICD/NDC/CPT rows) and **\"Maps to value\"** (for value-coded\n  answers, e.g. a LOINC observation linked to a SNOMED answer concept). Other relationships encode hierarchy\n  (\"Is a\", \"Subsumes\"), ATC classification links, and ingredient ↔ clinical drug links.\n- `CONCEPT_ANCESTOR`: the precomputed transitive-closure table. For any two concepts where one is a hierarchical\n  ancestor of the other (via \"Is a\" / \"Subsumes\" chains in CONCEPT_RELATIONSHIP), this table has a row with\n  `ancestor_concept_id`, `descendant_concept_id`, `min_levels_of_separation`, and\n  `max_levels_of_separation`. This is what powers \"include descendants\" in concept sets — rather than computing\n  the full hierarchy on the fly, every descendant relationship is precomputed. Without CONCEPT_ANCESTOR, defining\n  \"any statin\" or \"any type 2 diabetes complication\" would require multi-hop recursive joins at query time.\n- `VOCABULARY`: the registry of all supported vocabularies with their `vocabulary_id`, `vocabulary_name`,\n  `vocabulary_reference`, `vocabulary_version`, and `vocabulary_concept_id`. The `vocabulary_version` column is\n  the anchor for **version pinning** — reproducibility requires recording the vocabulary version at the time of\n  analysis.\n- `CONCEPT_SYNONYM`: alternate names and synonyms for concepts, useful for full-text search and for mapping\n  legacy code strings to concept_ids.\n\n**The standard_concept flag: the most important column in OMOP.**\nEvery concept has exactly one `standard_concept` value:\n- **`'S'` (Standard):** The intended target for analysis. When an analyst queries `drug_concept_id IN (resolved_set)`,\n  every concept in that set should have `standard_concept = 'S'`. Standard concepts are the ones OMOP guarantees\n  to be stable, non-redundant, and hierarchy-linked. Standard by domain: SNOMED CT for conditions; RxNorm (and\n  RxNorm Extension for non-US drugs) for drugs; LOINC for measurements; a mix of CPT4, HCPCS, SNOMED, and\n  ICD10PCS for procedures (the most heterogeneous domain); SNOMED for observation domain concepts.\n- **`NULL` (Source / Non-Standard):** The raw vocabulary codes the data arrive in — ICD-10-CM, ICD-9-CM, NDC,\n  local lab codes. These are preserved in `*_source_concept_id` and `*_source_value` columns in CDM tables for\n  audit purposes but should not be used as the primary analysis key. A concept_id of 0 means the ETL found no\n  mapping at all for a source code.\n- **`'C'` (Classification):** Concepts that are useful for hierarchical queries (e.g., ATC drug classes, high-level\n  SNOMED hierarchy nodes) but that are not appropriate direct targets for individual patient-level data. You can\n  use a Classification concept as an ancestor in CONCEPT_ANCESTOR queries to pull all standard descendants; you\n  should not expect raw CDM rows to carry `'C'` concept_ids in their `*_concept_id` columns.\n\n**Domain routing: the unexpected assignment.** Every concept belongs to exactly one `domain_id`, and this\nassignment determines which CDM table the ETL will place a record into. Domain routing follows the standard\nconcept's domain, not the source vocabulary's intuition — this is a persistent source of confusion. An ICD-10-CM\ncode that a US coder treats as a diagnosis can map to a SNOMED concept in the **Observation** domain rather than\nthe Condition domain, because SNOMED classifies many signs, symptoms, family history findings, and administrative\nstatuses as observations rather than conditions. Similarly, some HCPCS J-codes for drug administration map into\nthe Procedure domain rather than the Drug domain. An analyst who queries only CONDITION_OCCURRENCE for a\ncode whose standard concept is in the Observation domain will find zero rows and incorrectly conclude the\ncondition was never recorded. The safe practice is to check `domain_id` for each anchor concept in the CONCEPT\ntable before writing the query, and to query the corresponding CDM table (OBSERVATION for domain Observation,\nMEASUREMENT for domain Measurement, etc.).\n\n**Athena: the distribution service.** The OHDSI Standardized Vocabularies are not stored locally in each CDM\ninstance — they are downloaded from **Athena** (https://athena.ohdsi.org), the OHDSI vocabulary download portal.\nAthena provides the current vocabulary release as a set of CSV files (CONCEPT.csv, CONCEPT_RELATIONSHIP.csv,\nCONCEPT_ANCESTOR.csv, etc.) that a site loads into its CDM schema. Key Athena operational facts: (1)\n**CPT4 requires a separate UMLS-license reconstitution step** — because of AMA licensing, CPT4 concept codes\nare downloaded as placeholder text and must be replaced with actual code strings by running Athena's\n`cpt4.jar` reconstitution tool against a UMLS license; sites that skip this step have CPT4 concepts with\ninvalid `concept_code` values. (2) Vocabulary **releases are versioned** with a date string (e.g.,\n\"v5.0 2024-02-23\"); the version appears in the VOCABULARY table's `vocabulary_version` column. (3) A given\nvocabulary release bundles specific content versions of each source vocabulary (SNOMED 2024-03-01, RxNorm\n2024-04-01, etc.), so two sites on different Athena releases are literally running different mapping tables\nand may resolve the same concept-set expression to different sets of concept_ids.\n\n**source_concept_id preservation: audit and US-code work.** Every CDM clinical event table stores both the\nstandard concept (`drug_concept_id`, `condition_concept_id`, etc.) and the source concept\n(`drug_source_concept_id`, `condition_source_concept_id`). The source concept_id is the concept in the\nCONCEPT table that represents the original source code (if the source code exists in the vocabulary), and\n`*_source_value` is the raw string. This dual storage is essential for: (1) auditing ETL quality — confirming\nthat a source ICD-10 code actually mapped to the expected SNOMED concept; (2) recovering person-time from\nsource codes that mapped to `concept_id = 0`; (3) US-specific code-list work where a regulator asks for\nthe exact ICD-10-CM or NDC codes used; (4) phenotype debugging when a standard-concept set undercounts\nand the analyst needs to trace which source codes are present vs mapped.\n\n**Vocabulary versioning as a reproducibility requirement.** Because the CONCEPT_RELATIONSHIP \"Maps to\" table\nis updated every release, the same source code can map to a different standard concept_id between vocabulary\nversions — a code may be remapped, deprecated, split, or merged. A study re-run six months later on a newer\nvocabulary release will produce a different cohort even with identical concept-set expressions. The\nreproducible artifact is therefore not the concept-set expression alone but the **version-frozen resolved set**:\nthe complete list of `concept_id`s that the expression resolves to at a fixed vocabulary version, saved as a\nversioned file in the study repository. The vocabulary version string from the VOCABULARY table should be\nreported in every study publication.\n\n**Why this layer exists: the multi-database international problem.** Before OHDSI, a research network\nwanting to run one pharmacoepidemiology question across, say, a US commercial claims database, a German\nstatutory health insurance database, and a UK GP database would need three separate analysis scripts,\nthree sets of code lists (ICD-10-CM vs ICD-10-SGB vs Read codes), and three teams who understood each\nlocal coding system. The OMOP vocabulary layer solves this by centralizing the translation: each site's\nETL maps its local codes to standard concepts, and the network study author writes one query against\nstandard concept_ids that is then valid everywhere. The tradeoff is that the translation is imperfect —\nthe mapping loss, domain routing surprises, and vocabulary-version drift documented in the trade-offs\nsection below are the price of portability.\n\n**Pros, cons, and trade-offs.**\n- **vs querying source codes directly (ICD-10-CM strings, NDC codes in a raw claims script):** Querying\n  source codes is simpler at one site and gives access to fine-grained US billing detail (revenue codes,\n  NDC package-level detail). Cost: every new data partner requires a new code list, mapping ICD-9 vs ICD-10\n  differences must be handled manually per site, and any hierarchy query (all statins, all heart failure\n  subtypes) requires hand-maintaining a complete flat list. **Prefer standard-concept queries** for any\n  multi-database or network study; **prefer direct source-code queries** only when a single site is the\n  sole data source and US billing nuance (claim position, revenue code, specific NDC) is itself the\n  scientific question.\n- **vs the PCORnet CDM or FDA Sentinel Common Data Model:** Both PCORnet and Sentinel retain more source-code-\n  native structure and are simpler to ETL at each site. OMOP's distinguishing advantage is the full\n  vocabulary standardization with CONCEPT_ANCESTOR precomputation, enabling concept-set expressions that\n  travel unchanged across sites. Cost: the OMOP ETL is more complex, and vocabulary-mapping errors silently\n  distort cohorts in ways a PCORnet or Sentinel query against its native source codes would not. **Prefer\n  OMOP** when cross-network concept portability and OHDSI tooling (ATLAS, CohortMethod, PLP) matter;\n  **prefer PCORnet/Sentinel** when ETL simplicity dominates.\n- **Vocabulary drift between releases:** Two sites on different vocabulary versions may resolve the same\n  concept-set expression differently. A code remapped between releases changes cohort membership silently.\n  Mitigation: version-pin the resolved set and report the vocabulary version. No equivalent mitigation\n  exists in source-code querying because ICD-10-CM code meanings are stable, but this advantage disappears\n  the moment a network study spans the ICD-9/ICD-10 transition (Oct 2015), where source-code queries also\n  break.\n- **Domain routing loss:** A source code mapped to the Observation domain is not visible in\n  CONDITION_OCCURRENCE. A drug administration code mapped to the Procedure domain is not in DRUG_EXPOSURE.\n  Querying only the \"expected\" CDM table for a concept can undercount events by 5–30% depending on the\n  vocabulary. Always check domain before writing queries.\n- **US billing nuance flattened by SNOMED mapping:** SNOMED's clinical ontology does not distinguish ICD-10-CM\n  principal vs secondary diagnosis position, claim type (inpatient vs outpatient), or revenue code. These\n  distinctions, essential for validated phenotypes in US claims, must be recovered from `condition_type_concept_id`\n  in CONDITION_OCCURRENCE (which preserves them) — not from the standard SNOMED concept itself.\n\n**When to use.** Any time you are building or validating a cohort, exposure, or outcome definition against an\nOMOP CDM database — whether at a single site or across an OHDSI network — you are using the vocabulary layer,\neven if you are not explicitly querying the CONCEPT table. Understanding the vocabulary layer is a prerequisite\nfor interpreting any result from DRUG_EXPOSURE, CONDITION_OCCURRENCE, MEASUREMENT, or any other CDM clinical\ntable, because the concept_ids those tables contain only have meaning through their CONCEPT table metadata.\nExplicitly query the vocabulary layer when: (a) building a concept-set expression and selecting anchor\nconcept_ids; (b) checking domain routing before deciding which CDM table to query; (c) resolving and\nversion-freezing a concept-set expression for a reproducible study; (d) auditing ETL quality by tracing\nsource_concept_ids through Maps to relationships; (e) identifying unmapped source codes (concept_id = 0)\nthat will be invisible to standard-concept queries.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Do not use the vocabulary layer as a substitute for clinical review.** The \"Maps to\" relationship is\n  the OHDSI community's best guess at the semantically correct mapping. A source ICD-10-CM code may map to\n  a SNOMED concept at a higher level of generality than intended, or to the wrong clinical laterality, or\n  (rarely) to an incorrect concept entirely. No vocabulary mapping replaces inspecting the actual resolved\n  concept_ids and having a clinical expert confirm they represent the intended clinical entity.\n- **Do not assume that \"standard_concept = 'S'\" means the concept is clinically correct for your study.**\n  Standard status means \"this is the preferred target for this domain in this vocabulary.\" It does not mean\n  \"this is the phenotype you want.\" A standard SNOMED concept for \"type 2 diabetes mellitus\" includes many\n  descendants — including neonatal and secondary forms — that a specific study may need to exclude. Standard\n  is a vocabulary property, not a phenotype property.\n- **Do not re-run an analysis on a newer vocabulary release and report the same results.** Vocabulary updates\n  can change cohort membership. A study claiming to reproduce a prior result must use the same vocabulary\n  version. If an updated vocabulary is used for a replication, the version difference must be documented and\n  the impact on the cohort assessed.\n- **Do not treat concept_id = 0 as \"the patient did not have this exposure/condition.\"** It means the ETL\n  found no mapping for the source code. Counts of persons with concept_id = 0 as their primary code are a\n  data-quality signal, not a clinical finding. Quantify and report the unmapped fraction before drawing\n  conclusions from cohort size.\n- **Do not skip the CPT4 reconstitution step and then run analyses involving procedures.** Sites with\n  unreconstituted CPT4 have concepts with placeholder concept_codes; any procedure-based phenotype or\n  exclusion criterion involving CPT4 will silently fail to match.\n\n**Data-source operational depth.**\n- **Claims (US commercial, Medicare FFS):** Source vocabularies are ICD-10-CM (conditions), NDC (pharmacy\n  drugs), CPT4 and HCPCS (procedures and J-code drugs), ICD-10-PCS (inpatient procedures). Mapping quality\n  is generally high for ICD-10-CM→SNOMED and CPT4→SNOMED/procedures. NDC→RxNorm mapping requires the\n  `drug_strength` extension tables and is complicated by repackaged NDCs (the same drug repackaged by a\n  distributor receives a new NDC that may lag vocabulary updates). Always anchor drug sets at the RxNorm\n  ingredient level and expand via CONCEPT_ANCESTOR to capture all clinical drug and branded descendants.\n  The ICD-9→ICD-10 coding transition (October 2015) is a CONCEPT_RELATIONSHIP \"Maps to\" challenge: OMOP\n  maintains maps in both directions, but the descendant tree under ICD-9-CM SNOMED mappings differs from\n  the tree under ICD-10-CM mappings, producing spurious apparent incidence changes at the transition for\n  any condition concept set spanning 2015.\n- **EHR:** Source vocabularies are more heterogeneous — SNOMED is often used directly for problem lists in\n  US EHRs (via SNOMED CT US edition), while local codes, CPT for visit charges, and LOINC for lab panels\n  are also common. LOINC→LOINC mappings (standard LOINC panel members) and local lab code→LOINC mappings\n  depend on the ETL's laboratory harmonization quality; units must be reconciled separately (the MEASUREMENT\n  table carries `value_as_number`, `unit_concept_id`). EHR data frequently surface concepts in the\n  Observation domain for findings that an epidemiologist would call a condition, reinforcing the need to\n  check domain routing.\n- **Registry:** Disease classification in a registry often uses a registry-specific coding scheme (e.g., tumor\n  registry ICD-O-3 histology codes, NYHA heart failure class). These may require a custom vocabulary or a\n  manual concept-to-SNOMED mapping before they appear in the CONCEPT table at all. Adjudicated outcomes\n  from a registry that are re-coded into OMOP standard concepts should be documented with the mapping\n  crosswalk, not assumed correct.\n- **Linked claims–EHR–registry:** Each source has its own ETL and its own mapping completeness. The same\n  clinical event can receive a different standard concept_id from different source vocabularies (the ICD-10-CM\n  code for a condition and the SNOMED code from the EHR problem list may map to different SNOMED concept_ids\n  at different specificity levels). Reconcile at the standard-concept level and document which source\n  contributed each concept_id assignment.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "omop",
      "ohdsi",
      "terminology",
      "concept-id",
      "snomed",
      "rxnorm",
      "loinc",
      "icd10cm",
      "ndc",
      "vocabulary-versioning",
      "athena",
      "source-to-standard-mapping"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/jamia/ocad247",
        "url": "https://doi.org/10.1093/jamia/ocad247",
        "citation_text": "Reich C, Ostropolets A, Ryan P, Rijnbeek P, Schuemie M, Davydov A, et al. OHDSI Standardized Vocabularies — a large-scale centralized reference ontology for international data harmonization. Journal of the American Medical Informatics Association. 2024;31(3):583-590.",
        "year": 2024,
        "authors_short": "Reich et al.",
        "notes": "Primary reference for the OHDSI Standardized Vocabularies as the central mapping infrastructure for OMOP CDM; describes the vocabulary tables, the standard_concept flag, the Maps to relationship, and the Athena distribution portal."
      },
      {
        "role": "explain",
        "doi": "10.3233/978-1-61499-564-7-574",
        "url": "https://doi.org/10.3233/978-1-61499-564-7-574",
        "citation_text": "Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, Schuemie MJ, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Studies in Health Technology and Informatics (MEDINFO 2015). 2015;216:574-578.",
        "year": 2015,
        "authors_short": "Hripcsak et al.",
        "notes": "Articulates the OHDSI program and the rationale for the shared standardized vocabulary as the engine behind distributed network research where source-code heterogeneity is the barrier to scale."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/jamia/ocu023",
        "url": "https://doi.org/10.1093/jamia/ocu023",
        "citation_text": "Voss EA, Makadia R, Matcho A, Ma Q, Knoll C, Schuemie M, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. Journal of the American Medical Informatics Association. 2015;22(3):553-564.",
        "year": 2015,
        "authors_short": "Voss et al.",
        "notes": "Demonstrates applying the CDM and standardized vocabulary to multiple disparate databases, providing concrete evidence of vocabulary-layer utility and the portability it enables across heterogeneous sources."
      },
      {
        "role": "use",
        "doi": "10.1016/S0140-6736(19)32317-7",
        "url": "https://doi.org/10.1016/S0140-6736(19)32317-7",
        "citation_text": "Suchard MA, Schuemie MJ, Krumholz HM, You SC, Chen R, Pratt N, et al. Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, large-scale analysis. The Lancet. 2019;394(10211):1816-1826.",
        "year": 2019,
        "authors_short": "Suchard et al.",
        "notes": "LEGEND-HTN — the largest OMOP network study to date, running one standardized vocabulary-based concept set across nine countries and millions of patients, demonstrating the production-scale payoff of vocabulary standardization."
      }
    ],
    "plain_language_summary": "The OMOP Standardized Vocabularies are a shared translation dictionary that converts every type of medical code — diagnosis codes, drug codes, lab codes, procedure codes — into one universal numbering system called concept_ids, so that the same analysis script can run identically at hospitals and insurers across many countries. Think of it as the Rosetta Stone for healthcare data — a US insurer uses ICD-10-CM codes for diagnoses, a German sickness fund uses a different coding system, and a UK GP uses Read codes, but after the translation every institution's data use the same SNOMED concept_id for type 2 diabetes. One honest caveat: the translation is imperfect — some source codes have no mapping and become invisible to the standard analysis, and the same expression can resolve to slightly different concepts when the vocabulary is updated.",
    "key_terms": [
      {
        "term": "concept_id",
        "definition": "The unique integer that the OMOP vocabulary assigns to every clinical code; databases from different hospitals or countries use the same concept_id for the same clinical idea, making queries portable."
      },
      {
        "term": "standard concept",
        "definition": "A concept marked with standard_concept = 'S' in the OMOP vocabulary, meaning it is the intended target for analysis queries; diagnoses map to SNOMED standard concepts, drugs to RxNorm, and lab tests to LOINC."
      },
      {
        "term": "source concept",
        "definition": "The original code from the raw data before translation — such as an ICD-10-CM diagnosis code or an NDC drug code — preserved in the CDM alongside the standard concept for audit purposes."
      },
      {
        "term": "domain",
        "definition": "The category that determines which CDM table a concept lands in — Condition, Drug, Measurement, Procedure, or Observation — sometimes in a different table than a researcher would expect from the source code."
      },
      {
        "term": "Maps to",
        "definition": "The CONCEPT_RELATIONSHIP link that connects a source code to its standard concept, the central translation step the ETL uses to convert raw ICD or NDC codes into OMOP concept_ids."
      },
      {
        "term": "vocabulary version",
        "definition": "A dated release identifier (e.g., v5.0 2024-02-23) stamped on each Athena download; the same concept-set expression can resolve to different concept_ids on different vocabulary versions, so every reproducible study must record which version was used."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team is building a claims-based study in an OMOP database and needs to (1) trace a specific ICD-10-CM diagnosis code through the vocabulary to find its standard SNOMED concept_id, (2) trace an NDC drug code to its RxNorm standard concept, and (3) verify that filtering on only the standard concept changes the patient count compared to also including source concepts. The table below shows the key rows from the CONCEPT and CONCEPT_RELATIONSHIP tables for two example codes.",
      "dataset": {
        "caption": "Rows from CONCEPT and CONCEPT_RELATIONSHIP illustrating source-to-standard mapping for one ICD-10-CM diagnosis code and one NDC drug code.",
        "columns": [
          "concept_id",
          "concept_code",
          "vocabulary_id",
          "standard_concept",
          "domain_id",
          "concept_name"
        ],
        "rows": [
          [
            35207167,
            "I10",
            "ICD10CM",
            "null",
            "Condition",
            "Essential (primary) hypertension"
          ],
          [
            320128,
            "59621000",
            "SNOMED",
            "S",
            "Condition",
            "Essential hypertension"
          ],
          [
            40213271,
            "0069-0158-03",
            "NDC",
            "null",
            "Drug",
            "atorvastatin calcium 20 MG Oral Tablet [Lipitor]"
          ],
          [
            1545996,
            "83367",
            "RxNorm",
            "S",
            "Drug",
            "atorvastatin 20 MG Oral Tablet [Lipitor]"
          ]
        ]
      },
      "steps": [
        "Step 1 — trace ICD-10-CM to SNOMED: The source code \"I10\" (Essential hypertension in ICD-10-CM) has concept_id 35207167 in the CONCEPT table with standard_concept = null, meaning it is a source-only concept used for audit but not for querying patient data. In CONCEPT_RELATIONSHIP, the row with concept_id_1 = 35207167 and relationship_id = \"Maps to\" points to concept_id_2 = 320128 (the SNOMED concept for Essential hypertension, standard_concept = S). An analyst querying CONDITION_OCCURRENCE for hypertension must use 320128, not 35207167.",
        "Step 2 — trace NDC to RxNorm: The NDC code \"0069-0158-03\" (a specific Lipitor package) has OMOP concept_id 40213271 with standard_concept = null. Its \"Maps to\" relationship in CONCEPT_RELATIONSHIP points to concept_id 1545996 (RxNorm \"atorvastatin 20 MG Oral Tablet [Lipitor]\", standard_concept = S). To capture all atorvastatin products (all doses, all brands, generic), a drug concept set should be anchored at the RxNorm ingredient concept_id for atorvastatin and expanded via CONCEPT_ANCESTOR, which will include concept_id 1545996 as a descendant.",
        "Step 3 — count difference: Suppose the CONDITION_OCCURRENCE table for this CDM has 9800 rows where condition_concept_id = 320128 (standard SNOMED) and 200 rows where condition_concept_id = 0 (unmapped, source code had no mapping). Filtering on standard concept_id = 320128 alone finds 9800 rows. Including source_concept_id = 35207167 as a fallback adds 0 additional rows here because the ETL already populated condition_concept_id correctly. But if 200 rows have condition_concept_id = 0, those persons are invisible to the standard-concept query; the unmapped fraction is 200 / (9800 + 200) = 200 / 10000 = 0.02, meaning 2% of source rows are invisible to any standard concept query.",
        "Step 4 — check domain: Before writing the CONDITION_OCCURRENCE query, the analyst checks domain_id for concept_id 320128 in CONCEPT: domain_id = \"Condition\". This confirms the record will be in CONDITION_OCCURRENCE. If a symptom code had mapped to a SNOMED concept with domain_id = \"Observation\", the analyst would instead query OBSERVATION — querying only CONDITION_OCCURRENCE would miss it entirely."
      ],
      "result": "For ICD-10-CM \"I10\": source concept_id = 35207167 (standard_concept null) Maps to standard concept_id = 320128 (SNOMED, domain = Condition). For NDC \"0069-0158-03\": source concept_id = 40213271 Maps to RxNorm standard concept_id = 1545996 (domain = Drug). Unmapped fraction = 200 / 10000 = 0.02, so 2% of hypertension rows have concept_id = 0 and are invisible to a pure standard-concept query. Domain check confirms correct CDM table. Arithmetic: 200 / 10000 = 0.02"
    },
    "prerequisites": [
      "omop-cdm-method-patterns-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard-concept query (the default analytic mode)",
        "description": "Build the analysis query entirely on standard concepts (standard_concept = 'S') using CONCEPT_ANCESTOR for hierarchy expansion. This mode is vocabulary-agnostic and source-agnostic — the same concept_ids work across all sites that have ETL'd to OMOP.",
        "edge_cases": [
          "Any source code that the ETL could not map becomes concept_id = 0 and is entirely invisible to a standard-concept query; quantify and report the unmapped fraction before publishing.",
          "Domain routing surprises (a code mapped to Observation instead of Condition) mean the query may look in the wrong CDM table and find zero rows without error."
        ],
        "data_source_notes": "claims: NDC→RxNorm mapping completeness varies by ETL vintage and repackaged NDC coverage; EHR: local codes and LOINC panel members may lack standard mappings and need mapping-quality review."
      },
      {
        "name": "Source-concept fallback for unmapped codes",
        "description": "Supplement standard-concept queries with source_concept_id / source_value logic to recover person-time whose source codes mapped to concept_id = 0.",
        "edge_cases": [
          "Source fallbacks are data-partner-specific and break portability; document them per database and mark results as non-portable.",
          "Double-counting risk if a source code maps to both a standard concept AND is caught by the fallback filter."
        ],
        "data_source_notes": "claims: target retired, repackaged NDCs and local miscellaneous codes; track the % of source rows recovered per database and per decade (ICD-9 era vs ICD-10 era)."
      },
      {
        "name": "Version-pinned resolved concept set",
        "description": "Resolve a concept-set expression (anchors + include-descendants + exclusions) against a specific vocabulary version and save the complete resolved concept_id list as a versioned artifact in the study repository.",
        "edge_cases": [
          "Re-resolving the same expression against a newer vocabulary version changes the cohort silently; the frozen list, not just the expression, is the reproducible unit.",
          "Network partners on different vocabulary versions may resolve the \"same\" expression differently; require vocabulary version alignment for any federated study."
        ],
        "data_source_notes": "Record the vocabulary_version string from the VOCABULARY table alongside the resolved set in every analysis run and every publication."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Direct source-code queries (ICD-10-CM strings, NDC codes)",
        "pros_of_this": "Portable across data partners, countries, and coding systems; hierarchy expansion via CONCEPT_ANCESTOR automatically captures new child codes; same query valid pre- and post-ICD-9/10 transition if the vocabulary's cross-walk is correct.",
        "cons_of_this": "Depends on ETL mapping quality; codes mapped to concept_id = 0 are invisible; US billing detail (claim position, revenue codes, specific NDC package) is not encoded in the standard concept.",
        "when_to_prefer": "Multi-database or network studies; international harmonization; OHDSI tooling pipelines. Use direct source-code queries only for single-database studies where US billing nuance is itself the scientific question."
      },
      {
        "compared_to": "PCORnet / FDA Sentinel Common Data Models",
        "pros_of_this": "Full vocabulary standardization to standard concepts with CONCEPT_ANCESTOR precomputation makes concept sets truly portable and supports large-scale OHDSI analytics; Athena provides a single download with all supported vocabularies.",
        "cons_of_this": "The OMOP ETL is more complex and vocabulary-mapping errors silently distort cohorts; PCORnet and Sentinel are simpler to ETL and closer to native source codes.",
        "when_to_prefer": "Cross-network harmonization, international studies, OHDSI tooling; prefer PCORnet/Sentinel when ETL simplicity is the constraint and multi-database portability is not required."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Check every anchor concept's domain_id before querying; anchor drug sets at the RxNorm ingredient level and expand via CONCEPT_ANCESTOR; quantify concept_id = 0 unmapped fraction by data partner and decade; bridge the ICD-9/ICD-10 (Oct 2015) break; run the CPT4 reconstitution tool on any site that uses CPT4 procedure concepts.",
      "ehr": "Check standard_concept and domain_id for local lab codes and problem-list SNOMED codes — many EHR concepts route to Observation rather than Condition; harmonize LOINC panel codes and units before trusting measurement concept sets.",
      "registry": "Registry-specific coding (ICD-O-3, NYHA class) may need a custom vocabulary entry or a manual crosswalk to SNOMED before appearing in the CONCEPT table; document the crosswalk and validate mapping completeness.",
      "linked": "Each source has its own ETL and mapping completeness; the same clinical event can receive different standard concept_ids from different source vocabularies at different specificity levels; reconcile at the standard-concept level and document which source contributed each concept_id."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import duckdb\nimport pandas as pd\n\n# ---------- Pattern 1: Trace a source ICD-10-CM code to its standard SNOMED concept ----------\n# Replace the concept table and concept_relationship table references with your CDM schema.\n\nTRACE_SQL = \"\"\"\nSELECT\n    src.concept_id         AS source_concept_id,\n    src.concept_code       AS source_code,\n    src.vocabulary_id      AS source_vocab,\n    src.standard_concept   AS source_standard,\n    std.concept_id         AS standard_concept_id,\n    std.concept_name       AS standard_concept_name,\n    std.vocabulary_id      AS standard_vocab,\n    std.domain_id          AS domain_id\nFROM concept src\n-- \"Maps to\" relationship points from source concept to standard concept\nJOIN concept_relationship cr\n    ON cr.concept_id_1    = src.concept_id\n   AND cr.relationship_id = 'Maps to'\n   AND cr.invalid_reason IS NULL\nJOIN concept std\n    ON std.concept_id    = cr.concept_id_2\n   AND std.standard_concept = 'S'\nWHERE src.concept_code   = $source_code\n  AND src.vocabulary_id  = $vocab_id\n\"\"\"\n\ndef resolve_to_standard(con: duckdb.DuckDBPyConnection,\n                        source_code: str, vocab_id: str = \"ICD10CM\") -> pd.DataFrame:\n    \"\"\"Resolve a source code to its standard concept(s) via Maps to.\n\n    Args:\n        source_code: The raw source code string, e.g. 'I10'.\n        vocab_id:    The vocabulary identifier, e.g. 'ICD10CM', 'NDC', 'CPT4'.\n    Returns:\n        DataFrame with one row per standard concept mapped from the source code.\n        If empty, the source code has no standard mapping (concept_id = 0 scenario).\n    \"\"\"\n    return con.execute(TRACE_SQL, {\"source_code\": source_code, \"vocab_id\": vocab_id}).df()\n\n# ---------- Pattern 2: Domain-routing audit (detect Observation-domain surprises) ----------\nDOMAIN_AUDIT_SQL = \"\"\"\n-- Find condition-like source codes whose standard concept lands in Observation or Measurement,\n-- causing zero rows if the analyst queries only CONDITION_OCCURRENCE.\nSELECT\n    src.concept_code,\n    src.vocabulary_id,\n    std.concept_name,\n    std.domain_id          AS mapped_domain,\n    COUNT(*)               AS source_rows_affected\nFROM concept src\nJOIN concept_relationship cr\n    ON cr.concept_id_1    = src.concept_id\n   AND cr.relationship_id = 'Maps to'\n   AND cr.invalid_reason IS NULL\nJOIN concept std\n    ON std.concept_id      = cr.concept_id_2\n   AND std.standard_concept = 'S'\nWHERE src.vocabulary_id IN ('ICD10CM', 'ICD9CM')\n  AND std.domain_id NOT IN ('Condition')  -- these will be INVISIBLE in CONDITION_OCCURRENCE\nGROUP BY src.concept_code, src.vocabulary_id, std.concept_name, std.domain_id\nORDER BY source_rows_affected DESC\nLIMIT 50\n\"\"\"\n\ndef audit_domain_routing(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:\n    \"\"\"Return the top ICD codes that route to non-Condition domains.\n\n    These codes are invisible to queries that search only CONDITION_OCCURRENCE.\n    Typical culprits: signs/symptoms, history findings, screening codes -> Observation.\n    \"\"\"\n    return con.execute(DOMAIN_AUDIT_SQL).df()\n\n# Example usage (replace 'con' with your actual DuckDB/SQLite/Postgres connection):\n# result = resolve_to_standard(con, \"I10\", \"ICD10CM\")\n# domain_surprises = audit_domain_routing(con)",
        "description": "Two patterns for the vocabulary layer in Python, using SQL-in-Python (DuckDB) against standard OMOP tables:\n(1) Resolve a source ICD-10-CM code to its standard SNOMED concept via the CONCEPT + CONCEPT_RELATIONSHIP\n\"Maps to\" chain. (2) Count domain-routing surprises — source codes that map to a domain other than the one\nthe analyst expected (e.g., codes a US coder would call a diagnosis landing in the Observation domain).\nRequired CDM tables:\n  concept             : concept_id, concept_code, vocabulary_id, domain_id, standard_concept, concept_name\n  concept_relationship: concept_id_1, concept_id_2, relationship_id\nBoth queries illustrate the most important vocabulary checks to run before writing any clinical query.",
        "dependencies": [
          "duckdb",
          "pandas"
        ],
        "source_citations": [
          "reich-2024"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(DatabaseConnector)\nlibrary(SqlRender)\n\n# ---------- Pattern 1: Resolve a source code to its standard concept via Maps to ----------\nresolveToStandard <- function(connection, cdmSchema,\n                             sourceCode, vocabId = \"ICD10CM\",\n                             targetDialect = \"postgresql\") {\n  sql <- \"\n  SELECT\n      src.concept_id         AS source_concept_id,\n      src.concept_code       AS source_code,\n      src.vocabulary_id      AS source_vocab,\n      std.concept_id         AS standard_concept_id,\n      std.concept_name       AS standard_concept_name,\n      std.vocabulary_id      AS standard_vocab,\n      std.domain_id          AS domain_id\n  FROM @cdm.concept src\n  JOIN @cdm.concept_relationship cr\n      ON cr.concept_id_1    = src.concept_id\n     AND cr.relationship_id = 'Maps to'\n     AND cr.invalid_reason IS NULL\n  JOIN @cdm.concept std\n      ON std.concept_id     = cr.concept_id_2\n     AND std.standard_concept = 'S'\n  WHERE src.concept_code  = '@source_code'\n    AND src.vocabulary_id = '@vocab_id';\"\n\n  rendered <- SqlRender::render(sql,\n                               cdm         = cdmSchema,\n                               source_code = sourceCode,\n                               vocab_id    = vocabId)\n  rendered <- SqlRender::translate(rendered, targetDialect = targetDialect)\n  DatabaseConnector::querySql(connection, rendered)\n}\n\n# ---------- Pattern 2: Unmapped fraction audit (concept_id = 0 in CONDITION_OCCURRENCE) ----------\nunmappedFractionAudit <- function(connection, cdmSchema,\n                                 targetDialect = \"postgresql\") {\n  # concept_id = 0 means the ETL found no standard mapping for the source code.\n  # These rows are INVISIBLE to any standard-concept cohort query.\n  sql <- \"\n  SELECT\n      SUM(CASE WHEN condition_concept_id = 0 THEN 1 ELSE 0 END)      AS unmapped_rows,\n      COUNT(*)                                                          AS total_rows,\n      CAST(SUM(CASE WHEN condition_concept_id = 0 THEN 1 ELSE 0 END) AS FLOAT)\n        / CAST(COUNT(*) AS FLOAT)                                       AS pct_unmapped\n  FROM @cdm.condition_occurrence;\"\n\n  rendered <- SqlRender::render(sql, cdm = cdmSchema)\n  rendered <- SqlRender::translate(rendered, targetDialect = targetDialect)\n  DatabaseConnector::querySql(connection, rendered)\n}\n\n# Example:\n# mapping <- resolveToStandard(connection, \"cdm\", \"I10\", \"ICD10CM\")\n# unmapped <- unmappedFractionAudit(connection, \"cdm\")",
        "description": "Two R patterns for the vocabulary layer using the OHDSI HADES idiom (DatabaseConnector + SqlRender):\n(1) Resolve a source code to its standard OMOP concept via the \"Maps to\" relationship.\n(2) Check the unmapped fraction (concept_id = 0) in a clinical CDM table, the most important\ndata-quality check before any cohort analysis.\nRequired CDM tables:\n  concept             : concept_id, concept_code, vocabulary_id, domain_id, standard_concept, concept_name\n  concept_relationship: concept_id_1, concept_id_2, relationship_id, invalid_reason\n  condition_occurrence: condition_concept_id, condition_source_concept_id (for unmapped check)\nParameters are rendered by SqlRender so the SQL is portable across Postgres, SQL Server, Redshift, etc.",
        "dependencies": [
          "DatabaseConnector",
          "SqlRender"
        ],
        "source_citations": [
          "voss-2015"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[\"Source codes in raw data\\nICD-10-CM, NDC, CPT4, LOINC, local codes\"] --> ETL[\"ETL reads CONCEPT_RELATIONSHIP\\n'Maps to' relationships from Athena\"]\n  ETL -->|standard_concept = 'S'| Std[\"Standard concepts\\nSNOMED (Condition), RxNorm (Drug)\\nLOINC (Measurement), mix (Procedure)\"]\n  ETL -->|concept_id = 0| Unmapped[\"Unmapped source codes\\n→ invisible to standard queries\\nquantify and report %\"]\n  Std --> Domain[\"domain_id routes to CDM table\\nCondition → CONDITION_OCCURRENCE\\nDrug → DRUG_EXPOSURE\\nMeasurement → MEASUREMENT\\nObservation → OBSERVATION\"]\n  Std --> CA[\"CONCEPT_ANCESTOR\\nprecomputed transitive closure\\n(hierarchy expansion)\"]\n  CA --> Set[\"Concept set expression\\nanchor + include descendants\\n+ exclusions\"]\n  Set --> Freeze[\"Version-freeze resolved set\\n+ vocabulary_version from VOCABULARY table\\n→ check into study repo\"]\n  Freeze --> Query[\"Standard concept query\\nfield IN (resolved_set)\\nportable across all CDM sites\"]\n  Unmapped -.fallback.-> Query",
        "caption": "The OMOP vocabulary layer flow. Source codes are translated to standard concepts via \"Maps to\" relationships downloaded from Athena; the standard_concept flag and domain_id route each concept to its CDM table; CONCEPT_ANCESTOR enables hierarchy expansion in concept-set expressions; the resolved set must be version-frozen before the query runs.",
        "alt_text": "Flowchart from raw source codes through Maps to relationships to standard concepts, then domain routing to CDM tables, CONCEPT_ANCESTOR expansion, version-frozen concept sets, and a portable standard-concept query, with unmapped codes as a fallback branch.",
        "source_type": "illustrative",
        "source_citations": [
          "reich-2024"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  NDC[\"NDC code\\n(source, null standard)\"] -->|Maps to| RxNorm[\"RxNorm Branded Drug\\n(standard, Drug domain)\"]\n  RxNorm -->|Is a| Ingr[\"RxNorm Ingredient: atorvastatin\\n(standard, Drug domain)\"]\n  ICD[\"ICD-10-CM I10\\n(source, null standard)\"] -->|Maps to| SNOMED[\"SNOMED 59621000\\nEssential hypertension\\n(standard, Condition domain)\"]\n  LOC[\"Local lab code\\n(source, null standard)\"] -->|Maps to| LOINC_STD[\"LOINC 2345-7\\nGlucose serum\\n(standard, Measurement domain)\"]\n  SYMPTOM[\"ICD-10-CM R51\\nHeadache\\n(source, null standard)\"] -->|Maps to| OBS[\"SNOMED 25064002\\nHeadache\\n(standard, Observation domain)\\n← domain routing surprise\"]",
        "caption": "Examples of source-to-standard \"Maps to\" relationships. NDC maps to RxNorm branded drug then to ingredient. ICD-10-CM maps to SNOMED condition. Local lab code maps to LOINC. ICD-10-CM headache code maps to a SNOMED concept in the Observation domain — not Condition — so it is invisible in CONDITION_OCCURRENCE queries.",
        "alt_text": "Four parallel mapping chains showing NDC to RxNorm, ICD-10-CM to SNOMED condition, local lab code to LOINC, and an ICD-10-CM symptom code routing to the Observation domain rather than Condition.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "The vocabulary layer is the semantic foundation of all OMOP CDM method patterns; every concept_id in every CDM clinical table draws meaning from the CONCEPT table and its relationships."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "Concept set development builds code lists ON the vocabulary layer — selecting anchor concept_ids, expanding via CONCEPT_ANCESTOR, and version-freezing the resolved set. The vocabulary layer is the substrate; concept-set development is the analytic activity on top of it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-condition-occurrence-condition-era-rwe",
        "notes": "Every condition_concept_id in CONDITION_OCCURRENCE and CONDITION_ERA is a SNOMED standard concept from the vocabulary layer; domain routing determines whether a source diagnosis code lands in CONDITION_OCCURRENCE or in OBSERVATION."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "Every drug_concept_id in DRUG_EXPOSURE and DRUG_ERA is an RxNorm (or RxNorm Extension) standard concept; the NDC→RxNorm Maps to chain and the ingredient hierarchy in CONCEPT_ANCESTOR are vocabulary-layer operations."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "ICD-10-CM is one of the primary source vocabularies that maps into OMOP standard concepts via the vocabulary layer's \"Maps to\" relationships; ICD-10-CM codes are stored as source concepts in the CDM."
      },
      {
        "relation_type": "see_also",
        "target_slug": "rxnorm-drug-terminology",
        "notes": "RxNorm is the standard vocabulary for the Drug domain in OMOP; NDC codes map to RxNorm clinical drugs, which map to RxNorm ingredients via CONCEPT_ANCESTOR."
      },
      {
        "relation_type": "see_also",
        "target_slug": "snomed-ct-terminology",
        "notes": "SNOMED CT is the standard vocabulary for the Condition domain in OMOP; most ICD-10-CM and ICD-9-CM diagnosis codes map to SNOMED standard concepts via the vocabulary layer."
      },
      {
        "relation_type": "see_also",
        "target_slug": "loinc-laboratory-coding",
        "notes": "LOINC is the standard vocabulary for the Measurement domain in OMOP; local lab codes and legacy lab terminologies map to LOINC standard concepts via the vocabulary layer."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medical-code-crosswalks-mappings",
        "notes": "The OMOP vocabulary layer is the most comprehensive, versioned, and programmatically accessible source of medical code crosswalks — the Maps to relationships implement the crosswalk logic for dozens of source vocabularies simultaneously."
      }
    ],
    "aliases": [
      "OMOP vocabulary",
      "OHDSI vocabularies",
      "Athena",
      "concept_id",
      "standard concept",
      "OMOP standardized vocabularies",
      "OHDSI vocabulary layer"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "omop-time-at-risk-cohort-exit-rwe",
    "name": "OMOP Time-at-Risk and Cohort Exit",
    "short_definition": "The OMOP/OHDSI cohort-logic specification of when each subject's outcome follow-up begins (time-at-risk start), when it ends (cohort exit / censoring), and how the at-risk window is anchored to cohort entry, observation period, exposure era, and death, which jointly determine the denominator (person-time) and the eligible numerator (outcome) events.",
    "long_description": "In the OHDSI/OMOP framework a **cohort** is a set of persons who satisfy entry criteria for a span of time, and the\nanalytic estimate is generated only over the **time-at-risk (TAR)** window attached to each cohort entry. TAR is defined\nby two offsets relative to anchor dates: a **start rule** (e.g., `cohort_start_date + start_date_offset`, often offset 0 =\nexposure initiation, or +1 day to exclude prevalent outcomes coded on the index day) and an **end rule** (`cohort_end_date`,\n`observation_period_end_date`, `drug_era_end_date + grace`, a fixed offset such as +365 days, or the first outcome). **Cohort\nexit** is the operationalization of that end rule plus all competing reasons follow-up stops — death, disenrollment, the end\nof the database, switching, or the outcome itself. TAR start, cohort exit, and the censoring source together fix the\ndenominator (person-time) and which numerator (outcome) events are eligible to be counted. Get them wrong and you do not get\na noisier estimate of the right thing — you get a precise estimate of the wrong thing.\n\n**Core conceptual distinction.** Three decisions are separable and each maps onto a distinct estimand. (1) *Where does the\nclock start?* Anchoring TAR start at the exposure decision (time zero) is what prevents **immortal time bias**: any\nperson-time accrued before the exposure-defining event is, by construction, event-free and must not be counted toward the\nexposed denominator. Anchoring at diagnosis or enrollment instead of initiation re-introduces it. (2) *Where does the clock\nstop, by design?* An **intention-to-treat-like (ITT) TAR** runs from time zero to a fixed horizon or end of observation\nregardless of treatment changes; an **on-treatment / as-treated TAR** runs only while the exposure era persists (last\n`days_supply` end + a grace period), and censoring at discontinuation/switch is *informative* unless handled (IPCW). These\nestimate different quantities — the effect of *starting* a strategy vs the effect of *staying on* it. (3) *What is the\ncensoring source, and is it a competing event?* If death is treated as plain censoring for a non-fatal outcome you estimate\nthe **cause-specific** quantity (rate among those still alive and at risk); if you want the absolute probability of the\noutcome in a population where people die of other causes, death is a **competing risk** and the cumulative incidence must\nbe estimated with a Fine-Gray/Aalen-Johansen approach, not 1 − Kaplan-Meier. The TAR/exit specification *is* the choice of\nestimand; it cannot be deferred to the modeling stage.\n\n**Pros, cons, and trade-offs.**\n- **Explicit offset-based TAR (OMOP) vs an ad-hoc \"from index to last claim\" follow-up rule.** Writing TAR as anchor +\n  start/end offsets with a named censoring source makes time zero, immortal time, and the estimand auditable and portable\n  across an OHDSI network; the cost is that every offset is a forced decision that must be defended and varied in sensitivity\n  analysis. **Prefer the explicit form** for any regulatory- or HTA-grade study. The ad-hoc rule hides immortal time and\n  is unreproducible across sites.\n- **ITT-like TAR vs on-treatment TAR.** ITT is robust to informative censoring and answers the policy-relevant\n  \"start-the-drug\" question, but dilutes effects under heavy switching/discontinuation; on-treatment isolates the biological\n  on-drug effect but requires episode construction (days_supply stitching, grace period) and IPCW for the informative\n  censoring it creates. **Prefer ITT** as the primary unless the question is explicitly about sustained use, then carry\n  on-treatment as a pre-specified secondary. This is the same axis that separates `active-comparator-new-user` ITT contrasts\n  from `clone-censor-weight-per-protocol` per-protocol estimands.\n- **Death as censoring vs death as competing risk.** Censoring death is correct for the cause-specific hazard and for\n  etiologic questions; it overstates absolute risk whenever mortality differs by arm (e.g., elderly comparative safety).\n  **Prefer the competing-risks cumulative incidence** for any decision-analytic or absolute-risk output (see\n  `competing-risks-cause-specific-fine-gray-rwe`).\n\n**When to use.** Whenever person-time and outcome eligibility must be defined for an OMOP cohort feeding an incidence rate,\nsurvival, Poisson/Cox, or self-controlled analysis: comparative safety/effectiveness, drug-utilization denominators,\nbackground-rate estimation for signal evaluation, and the follow-up engine of a target-trial emulation. Specify TAR start =\nexposure initiation (time zero), an end rule tied to the estimand, and a single, documented censoring hierarchy applied\nidentically to every arm.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Do not anchor TAR start before the exposure-defining event.** Starting follow-up at diagnosis or hospital admission but\n  classifying exposure by a fill that occurs days later guarantees immortal time — the exposed group is credited with\n  event-free survival they only achieved by living long enough to be exposed. This inflates apparent benefit and is the\n  single most common self-inflicted injury in claims studies.\n- **Do not let TAR extend past observable person-time.** If the end rule (e.g., +365 days) runs beyond\n  `observation_period_end_date`, disenrollment, or the database cutoff, you fabricate event-free person-time and undercount\n  outcomes. Cohort exit must be the *minimum* of the design end rule and every loss-of-observability date.\n- **Do not treat death as ordinary censoring for absolute-risk or HTA outputs** when mortality is non-trivial and differs by\n  arm — the resulting 1 − KM curve is not a real-world probability and can reverse a cost-effectiveness conclusion.\n- **Do not use an on-treatment TAR with naive censoring at discontinuation** when discontinuation is prognostic; the\n  informative censoring biases the on-treatment estimate. Use IPCW or fall back to ITT.\n- **Do not reuse one TAR across heterogeneous outcomes.** An acute outcome (e.g., anaphylaxis, days) and a chronic one\n  (e.g., incident HF, years) need different end rules and induction/latency offsets; a single window misclassifies one of them.\n\n**Data-source operational depth.**\n- **Claims (FFS):** TAR start = index fill/procedure date; cohort exit = min(design end rule, disenrollment date, death\n  date, end of data). Continuous medical + pharmacy enrollment must blanket the entire TAR — a gap is *unobserved*\n  person-time, not event-free time. Adjudication lag and claim reversals mean an event near the data cutoff may not yet be\n  in the extract; right-truncate the study end to allow run-out. Procedure-anchored studies are a classic immortal-time\n  trap: time from admission to the procedure is immortal and must be excluded or assigned to the unexposed state.\n- **Claims mapped to OMOP / Medicare Advantage:** When claims are ETL'd into OMOP, `observation_period` is built from\n  enrollment spans — but MA-only person-time generally lacks adjudicated FFS claims, so outcomes and exposures are\n  under-captured even though `observation_period` looks continuous. Restrict TAR to FFS-observable spans (Parts A/B/D),\n  exclude MA-only periods, and never trust `observation_period_end_date` as a censoring source without confirming the\n  underlying benefit type.\n- **EHR:** Capture is encounter-driven, so absence of an outcome can be absence of a visit. A patient who leaves the health\n  system is differentially and informatively lost; define `observation_period` from real contact density (e.g., ≥1\n  encounter per interval) rather than assuming continuity to the database cutoff, and prefer linked claims/death index to\n  firm up cohort exit. Outcomes diagnosed at outside facilities (external-care leakage) are missed entirely.\n- **Registry / linked:** Registries give adjudicated outcomes and severity but rarely complete exposure or full mortality;\n  link to claims for fills and to a vital-records/death index so the competing-risk and censoring dates are real. Linkage\n  selects the linkable subset (a transportability threat) and creates order/fill/service date discrepancies that must be\n  reconciled before the TAR anchor is set.\n- **Differential competing risks in the elderly:** In an aged claims cohort, the arm prescribed to frailer patients will\n  have higher non-outcome mortality; censoring those deaths inflates that arm's cause-specific outcome rate. This is a\n  data-source-driven artifact of the exit rule, not a treatment effect — diagnose it by tabulating competing-event\n  incidence by arm before interpreting the primary result.\n\n**Worked claims example (two TAR variants, one estimand decision).** Question: incident hospitalized heart failure (HF)\namong new initiators of drug A vs drug B in a Medicare FFS + commercial OMOP instance. Cohort entry (`cohort_start_date`) =\nfirst fill of A or B with 365 days of prior continuous, FFS-observable enrollment and no prior fill of either drug (new-user\nwashout). **TAR start** = `cohort_start_date + 1` (offset +1 excludes an HF claim coded on the index day, which is\nprevalent, not incident). Outcome = first inpatient HF claim in a validated position. Now the design fork:\n*Variant 1 — ITT-like.* TAR end = min(`cohort_start_date + 730`, `observation_period_end_date`, `death_date`, study end).\nPerson-time is counted regardless of whether the patient stays on drug; the estimand is the effect of *initiating* A vs B\non 2-year HF risk. Death is modeled as a **competing risk** because these are elderly patients, so the reported quantity is\nthe cumulative incidence of HF, with a parallel cause-specific hazard for the etiologic contrast.\n*Variant 2 — on-treatment.* TAR end = min(last `days_supply` end + 90-day grace, switch to the other drug,\n`observation_period_end_date`, `death_date`, study end). This isolates the on-drug effect but the censoring at\ndiscontinuation is informative (patients who feel worse stop), so IPCW is applied. Both variants use the *identical*\ncensoring hierarchy and the *same* outcome definition; only the end rule differs, and that difference is exactly the\ndifference in estimand. Sensitivity analyses vary the +1 induction offset, the grace period (60/90/120 days), the ITT\nhorizon (1 vs 2 years), and a negative-control outcome to probe residual confounding. Reported diagnostics: pre/post TAR\nperson-time by arm, distribution of cohort-exit reasons by arm, and competing-event (death) incidence by arm.",
    "primary_category": "Study_Design",
    "tags": [
      "omop-cdm",
      "ohdsi",
      "time-at-risk",
      "cohort-exit",
      "censoring",
      "immortal-time-bias",
      "person-time",
      "competing-risks",
      "estimand"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Canonical definition of immortal time bias and the misclassified person-time it creates — the central failure mode that the time-at-risk start rule exists to prevent."
      },
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Frames time-zero alignment (TAR start = exposure initiation) as the explicit protocol element that prevents immortal time and other follow-up-window injuries."
      },
      {
        "role": "explain",
        "doi": "10.1161/CIRCULATIONAHA.115.017719",
        "url": "https://doi.org/10.1161/CIRCULATIONAHA.115.017719",
        "citation_text": "Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601-609.",
        "year": 2016,
        "authors_short": "Austin et al.",
        "notes": "Explains why the cohort-exit/censoring choice for death determines whether you estimate cause-specific hazards or cumulative incidence — the estimand consequence of the exit rule."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Joint ISPOR-ISPE good-practice guidance requiring pre-specified follow-up windows, time-zero, and censoring rules in regulatory- and HTA-grade RWD studies."
      }
    ],
    "plain_language_summary": "When researchers study whether a drug causes a health event, they need to define exactly which days each patient was being watched for that event -- this span is called the time-at-risk window. The window opens at the patient's starting point (usually the day they first took the drug, sometimes one day later to exclude events already present on that day) and closes at whichever comes first: a fixed number of follow-up days, the day the patient lost insurance coverage, or the day the outcome actually occurred. Getting these boundaries right lets the study count real new events rather than old ones, and avoids crediting a drug with event-free time that happened before anyone was even taking it.",
    "key_terms": [
      {
        "term": "index date (cohort entry)",
        "definition": "The patient's starting point in the study -- typically the date of their first prescription fill for the drug being studied, serving as day zero from which all follow-up is measured."
      },
      {
        "term": "time-at-risk window",
        "definition": "The stretch of calendar days during which the study is actively watching a patient for the outcome of interest; only events that occur inside this window count toward the study results."
      },
      {
        "term": "cohort exit",
        "definition": "The day a patient stops being watched -- whichever comes first among the design end rule (e.g., 365 days of follow-up), the last day of insurance coverage, death, or the occurrence of the outcome."
      },
      {
        "term": "observation period",
        "definition": "The span of dates on which a data source (claims or EHR) has reliable records for a patient; the time-at-risk window can never extend beyond this boundary, because records outside it simply do not exist."
      },
      {
        "term": "immortal time",
        "definition": "Person-time that is counted toward an exposed group even though the exposure had not yet started -- this artificially makes the exposed group look healthier and must be excluded by anchoring the watch window at the true start of exposure."
      },
      {
        "term": "induction offset",
        "definition": "A deliberate one-day (or longer) delay between the index date and the start of the time-at-risk window, used to exclude outcomes that were already present on the first day of treatment rather than caused by it."
      }
    ],
    "worked_example": {
      "scenario": "A Medicare claims study is asking whether a new blood-pressure drug (Drug A) causes a first hospitalization for heart failure. Patient 2201 fills Drug A for the first time on 2023-03-01 -- that is the index date. The study design says: open the watch window one day after the index date (to avoid counting heart failure that was already coded on the fill day), and follow each patient for up to 365 days. Patient 2201 loses insurance coverage on 2023-09-15; after that date her claims are invisible to the database. She never has a heart-failure hospitalization before that date. We want to know how many days she contributes to the study denominator and why her follow-up ends when it does.",
      "dataset": {
        "caption": "Key dates for patient 2201 pulled from OMOP-style tables -- the anchors an analyst joins before computing the time-at-risk window.",
        "columns": [
          "person_id",
          "event",
          "date",
          "note"
        ],
        "rows": [
          [
            2201,
            "first Drug A fill (index date)",
            "2023-03-01",
            "cohort entry -- day zero"
          ],
          [
            2201,
            "insurance coverage ends",
            "2023-09-15",
            "observation period end date"
          ],
          [
            2201,
            "design end rule",
            "2024-02-29",
            "index date + 365 days"
          ],
          [
            2201,
            "heart-failure hospitalization",
            "none",
            "outcome never occurred"
          ]
        ]
      },
      "steps": [
        "Set the TAR start: add the 1-day induction offset to the index date. TAR start = 2023-03-01 + 1 day = 2023-03-02.",
        "Identify all possible TAR end dates: (a) design end rule = 2023-03-01 + 365 days = 2024-02-29 (leap year); (b) insurance coverage ends = 2023-09-15; (c) outcome never occurred, so no outcome exit date applies.",
        "Apply the 'whichever comes first' rule: compare 2024-02-29 and 2023-09-15. The earlier date is 2023-09-15.",
        "Cohort exit = 2023-09-15 because insurance lapse comes well before the 365-day design limit.",
        "Calculate days at risk: from 2023-03-02 through 2023-09-15 inclusive = 198 days.",
        "Record the exit reason: the patient did not have the outcome; she exited because observable follow-up ended when insurance coverage lapsed. She contributes 198 person-days to the study denominator and zero outcome events."
      ],
      "result": {
        "label": "198 days at risk; cohort exit = end of observation period (insurance lapse on 2023-09-15)",
        "value": 198
      },
      "timeline_spec": {
        "title": "Time-at-risk window for patient 2201 (index fill to cohort exit, 365-day design horizon)",
        "caption": "The TAR window opens 1 day after the index fill and closes at insurance lapse (198 days), well before the 365-day design limit. No outcome occurred, so the patient contributes 198 person-days to the study denominator and no events to the numerator.",
        "alt_text": "A horizontal timeline showing: index date on 2023-03-01, TAR start one day later on 2023-03-02, a blue shaded 198-day at-risk span ending at cohort exit on 2023-09-15 due to insurance lapse, and the unfulfilled design end rule shown as a dashed line extending to 2024-02-29. A grey shaded region from 2023-09-15 to 2024-02-29 is labeled unobservable.",
        "window": {
          "start": "2023-03-01",
          "end": "2024-02-29",
          "label": "Design window: index date to index date + 365 days"
        },
        "events": [
          {
            "label": "Index date (first Drug A fill)",
            "start": "2023-03-01",
            "length_days": 1,
            "quantity": "cohort entry -- day zero"
          },
          {
            "label": "TAR start (index + 1-day induction offset)",
            "start": "2023-03-02",
            "length_days": 1,
            "quantity": "watch window opens"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-03-02",
            "end": "2023-09-15",
            "label": "198-day time-at-risk window (TAR)"
          },
          {
            "kind": "unexposed",
            "start": "2023-09-16",
            "end": "2024-02-29",
            "label": "167 unobservable days (insurance lapsed -- not counted)"
          }
        ],
        "result": {
          "label": "198 person-days at risk; cohort exit = end of observation period (insurance lapse 2023-09-15)",
          "value": 198
        }
      }
    },
    "prerequisites": [
      "omop-observation-period-rwe",
      "omop-drug-exposure-drug-era-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "ITT-like (initiation-anchored) time-at-risk",
        "description": "TAR runs from time zero (exposure initiation) to a fixed horizon or end of observation, ignoring later treatment changes. Estimates the effect of starting a strategy.",
        "edge_cases": [
          "Fixed horizon (e.g., +730 days) can exceed observation_period_end_date or disenrollment — cohort exit must be the minimum of the design horizon and every loss-of-observability date.",
          "Heavy switching/discontinuation dilutes the contrast toward the null."
        ],
        "data_source_notes": "claims: cap the horizon at the enrollment span; EHR: cap at last credible contact, not the database cutoff."
      },
      {
        "name": "On-treatment (as-treated) time-at-risk",
        "description": "TAR persists only while the exposure era is active — last days_supply end plus a grace period — and censors at discontinuation or switch. Isolates the on-drug effect.",
        "edge_cases": [
          "Censoring at discontinuation is informative when stopping is prognostic; requires inverse-probability-of-censoring weighting.",
          "Grace period and stockpiling/refill-gap rules materially change the era end and must be varied in sensitivity analysis."
        ],
        "data_source_notes": "claims: stitch days_supply into eras with explicit gap/grace rules; map to drug_era when using the OMOP standard era logic."
      },
      {
        "name": "Outcome / competing-event exit rule",
        "description": "Specifies whether the outcome itself, death, or a competing clinical event terminates time-at-risk, and whether death is censoring or a competing risk.",
        "edge_cases": [
          "Treating death as plain censoring overstates absolute risk when mortality differs by arm.",
          "Multiple-event outcomes need a rule (first event only vs recurrent) before the exit date is computable."
        ],
        "data_source_notes": "claims/linked: confirm a death-index source; OMOP death table completeness varies by ETL and benefit type (MA vs FFS)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ad-hoc \"index to last claim\" follow-up",
        "pros_of_this": "Explicit anchor + start/end offsets + named censoring source make time zero, immortal time, and the estimand auditable and portable across an OHDSI network.",
        "cons_of_this": "Every offset is a forced decision that must be justified and varied in sensitivity analyses.",
        "when_to_prefer": "Any regulatory-, HTA-, or network-grade study where reproducibility and immortal-time control are required."
      },
      {
        "compared_to": "ITT-like time-at-risk",
        "pros_of_this": "On-treatment TAR isolates the biological on-drug effect rather than the diluted start-the-drug effect.",
        "cons_of_this": "Creates informative censoring at discontinuation (needs IPCW) and requires explicit episode/grace-period construction.",
        "when_to_prefer": "Questions explicitly about sustained use; otherwise keep ITT-like primary and on-treatment as a secondary."
      },
      {
        "compared_to": "Death-as-censoring (cause-specific)",
        "pros_of_this": "Death-as-competing-risk cumulative incidence gives the real-world absolute probability needed for decision-analytic and HTA outputs.",
        "cons_of_this": "Cause-specific hazards remain the cleaner etiologic contrast; competing-risk models complicate covariate effect interpretation.",
        "when_to_prefer": "Absolute-risk, budget-impact, or cost-effectiveness outputs in populations with non-trivial, differential mortality."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Cohort exit = min(design end rule, disenrollment, death, end of data). Continuous medical + pharmacy enrollment must blanket the entire TAR; a gap is unobserved, not event-free, person-time. Allow claims run-out before the study end to absorb adjudication lag and reversals. Procedure-anchored TAR must exclude admission-to-procedure immortal time.",
      "ehr": "Encounter-driven capture means missing outcomes can be missing visits. Define observation_period from real contact density, not assumed continuity to the database cutoff; treat loss to follow-up as informative and prefer linked claims/death for cohort exit. External-care leakage misses outside-facility outcomes.",
      "registry": "Strong for adjudicated outcomes and severity, weak for full exposure and mortality. Link to claims for fills and to a death index so the censoring and competing-risk dates are real; account for linkage-selection transportability.",
      "linked": "Ideal substrate (severity + completeness + mortality) but linkage selects the linkable subset and creates order/fill/service date discrepancies that must be reconciled before the TAR anchor is set."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nINDUCTION_OFFSET = pd.Timedelta(days=1)    # +1: exclude outcomes coded on the index day (prevalent, not incident)\nITT_HORIZON      = pd.Timedelta(days=730)  # fixed ITT follow-up horizon\nGRACE            = pd.Timedelta(days=90)    # on-treatment grace after last days_supply end\n\ndef build_tar(cohort: pd.DataFrame, obs: pd.DataFrame,\n              death: pd.DataFrame, drug_era: pd.DataFrame) -> pd.DataFrame:\n    df = cohort.merge(obs, on=\"person_id\", how=\"left\") \\\n               .merge(death, on=\"person_id\", how=\"left\")\n\n    # On-treatment era end = last drug_era_end_date for this person's index era + grace.\n    era_end = (drug_era.groupby(\"person_id\")[\"drug_era_end_date\"].max()\n                       .rename(\"era_end\").reset_index())\n    df = df.merge(era_end, on=\"person_id\", how=\"left\")\n\n    # TAR start: time zero + induction offset.\n    df[\"tar_start\"] = df[\"cohort_start_date\"] + INDUCTION_OFFSET\n\n    # Loss-of-observability fl:  cohort exit can never exceed observed person-time.\n    obs_floor = df[[\"observation_period_end_date\", \"death_date\"]].min(axis=1)\n\n    # ITT-like exit = min(fixed horizon, end of observability).\n    df[\"tar_end_itt\"] = pd.concat(\n        [df[\"cohort_start_date\"] + ITT_HORIZON, obs_floor], axis=1).min(axis=1)\n\n    # On-treatment exit = min(era end + grace, end of observability).\n    df[\"tar_end_ontx\"] = pd.concat(\n        [df[\"era_end\"] + GRACE, obs_floor], axis=1).min(axis=1)\n\n    # Drop persons with no positive person-time (exit on or before tar_start).\n    df = df[df[\"tar_end_itt\"] > df[\"tar_start\"]].copy()\n    return df[[\"person_id\", \"cohort_start_date\", \"tar_start\",\n               \"tar_end_itt\", \"tar_end_ontx\", \"death_date\"]]",
        "description": "Derive per-person time-at-risk and cohort-exit dates from OMOP-style tables, for both an ITT-like and an on-treatment\nTAR. Required inputs (already ETL'd / de-duplicated; one row granularity noted):\n  cohort   : person_id, cohort_start_date (datetime)  # time zero = first qualifying drug_exposure\n  obs       : person_id, observation_period_start_date, observation_period_end_date\n  death     : person_id, death_date (nullable)\n  drug_era  : person_id, drug_era_start_date, drug_era_end_date  # standard OMOP era; end = last days_supply end\nReturns one row per person with tar_start and the two candidate exit dates. Outcome counting and rate/survival models\nare applied downstream using [tar_start, tar_end_*] identically across arms.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "suissa-2008",
          "hernan-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nINDUCTION <- 1L     # +1 day induction offset (exclude index-day prevalent outcomes)\nITT_HORIZON <- 730L  # fixed ITT horizon in days\nGRACE <- 90L         # on-treatment grace after last era end\n\nbuild_tar <- function(cohort, obs, death, drug_era) {\n  setDT(cohort); setDT(obs); setDT(death); setDT(drug_era)\n\n  era_end <- drug_era[, .(era_end = max(drug_era_end_date)), by = person_id]\n\n  df <- merge(cohort, obs[, .(person_id, observation_period_end_date)], by = \"person_id\", all.x = TRUE)\n  df <- merge(df, death[, .(person_id, death_date)], by = \"person_id\", all.x = TRUE)\n  df <- merge(df, era_end, by = \"person_id\", all.x = TRUE)\n\n  df[, tar_start := cohort_start_date + INDUCTION]\n\n  # Cohort exit can never exceed observed person-time (min of obs-period end and death).\n  df[, obs_floor := pmin(observation_period_end_date, death_date, na.rm = TRUE)]\n\n  df[, tar_end_itt  := pmin(cohort_start_date + ITT_HORIZON, obs_floor, na.rm = TRUE)]\n  df[, tar_end_ontx := pmin(era_end + GRACE,               obs_floor, na.rm = TRUE)]\n\n  df <- df[tar_end_itt > tar_start]   # require positive person-time\n  df[, .(person_id, cohort_start_date, tar_start, tar_end_itt, tar_end_ontx, death_date)]\n}",
        "description": "OMOP time-at-risk and cohort-exit derivation with data.table; inputs mirror the Python version:\n  cohort   : person_id, cohort_start_date (Date)\n  obs       : person_id, observation_period_end_date (Date)\n  death     : person_id, death_date (Date, possibly NA)\n  drug_era  : person_id, drug_era_end_date (Date)\nReturns tar_start plus ITT-like and on-treatment exit dates, with end-of-observability flooring applied to both.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "suissa-2008",
          "hernan-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let induction = 1;     /* +1 day: exclude index-day prevalent outcomes            */\n%let itt_horizon = 730; /* fixed ITT follow-up horizon (days)                       */\n%let grace = 90;        /* on-treatment grace after last era end                     */\n\n/* Collapse to one last-era-end per person (on-treatment exit anchor). */\nproc sql;\n  create table _era as\n  select person_id, max(drug_era_end_date) as era_end format=date9.\n  from omop.drug_era group by person_id;\nquit;\n\n/* Assemble anchors and derive both TAR variants with observability flooring. */\nproc sql;\n  create table tar as\n  select c.subject_id as person_id,\n         c.cohort_start_date,\n         c.cohort_start_date + &induction                       as tar_start  format=date9.,\n         min(op.observation_period_end_date, d.death_date)      as obs_floor  format=date9.,\n         min(c.cohort_start_date + &itt_horizon,\n             calculated obs_floor)                              as tar_end_itt  format=date9.,\n         min(e.era_end + &grace,\n             calculated obs_floor)                              as tar_end_ontx format=date9.,\n         d.death_date\n  from omop.cohort c\n  left join omop.observation_period op on c.subject_id = op.person_id\n  left join omop.death              d  on c.subject_id = d.person_id\n  left join _era                    e  on c.subject_id = e.person_id;\nquit;\n\n/* Eligible outcome = first outcome inside the chosen TAR; person-time in person-years (ITT shown). */\nproc sql;\n  create table rate_in as\n  select t.person_id,\n         (t.tar_end_itt - t.tar_start) / 365.25                 as pyears,\n         (max(o.outcome_date) is not missing\n           and min(o.outcome_date) between t.tar_start and t.tar_end_itt) as event\n  from tar t\n  left join omop.outcome o\n    on t.person_id = o.person_id\n   and o.outcome_date between t.tar_start and t.tar_end_itt\n  where t.tar_end_itt > t.tar_start\n  group by t.person_id, calculated pyears;\nquit;\n\n/* Incidence rate via Poisson with log person-time offset (the standard person-time model). */\nproc genmod data=rate_in;\n  model event = / dist=poisson link=log offset=log_py;\n  output out=fitted p=ratehat;\nrun;",
        "description": "Time-at-risk and cohort-exit construction directly on OMOP CDM tables in SAS, producing ITT-like and on-treatment\nexit dates, then a person-time + incidence-rate summary with PROC GENMOD (Poisson). Required tables (CDM-conformant):\n  omop.cohort                 : subject_id (=person_id), cohort_start_date\n  omop.observation_period     : person_id, observation_period_end_date\n  omop.death                  : person_id, death_date\n  omop.drug_era               : person_id, drug_era_end_date  (standard OMOP era; max = last days_supply end)\n  omop.outcome                : person_id, outcome_date       (first qualifying outcome event)\nAll TAR ends are floored at min(observation_period_end_date, death_date) so no unobservable person-time is fabricated.",
        "dependencies": [],
        "source_citations": [
          "suissa-2008",
          "berger-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "omop-time-at-risk-cohort-exit-rwe-timeline.svg",
        "mermaid": null,
        "caption": "The TAR window opens 1 day after the index fill and closes at insurance lapse (198 days), well before the 365-day design limit. No outcome occurred, so the patient contributes 198 person-days to the study denominator and no events to the numerator.",
        "alt_text": "A horizontal timeline showing: index date on 2023-03-01, TAR start one day later on 2023-03-02, a blue shaded 198-day at-risk span ending at cohort exit on 2023-09-15 due to insurance lapse, and the unfulfilled design end rule shown as a dashed line extending to 2024-02-29. A grey shaded region from 2023-09-15 to 2024-02-29 is labeled unobservable.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title One subject's OMOP time-at-risk and cohort exit\n  Observation period start : enrollment / first contact (left edge of observable person-time)\n  Cohort start (time zero) : first qualifying drug_exposure -> exposure decision\n  TAR start : cohort_start_date + induction offset (e.g. +1 day)\n  Person-time accrues : at-risk window, immortal time before time zero is excluded\n  First outcome OR competing event : inpatient HF (outcome) or death (competing risk)\n  Cohort exit : min(design end rule, observation_period_end, death, switch, data end)",
        "caption": "Time-at-risk for a single subject. The clock starts at the exposure decision (time zero) plus an induction offset; cohort exit is the minimum of the design end rule and every loss-of-observability date, so no unobservable or immortal person-time is counted.",
        "alt_text": "A timeline from observation-period start, to cohort start (time zero), TAR start after an induction offset, person-time accrual, the first outcome or competing event, and cohort exit at the minimum of the design and observability dates.",
        "source_type": "illustrative",
        "source_citations": [
          "suissa-2008"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[What is the estimand?] --> A{Effect of starting<br/>or staying on treatment?}\n  A -->|Starting a strategy| ITT[ITT-like TAR:<br/>time zero -> fixed horizon<br/>or end of observation]\n  A -->|Staying on treatment| OnTx[On-treatment TAR:<br/>last days_supply end + grace;<br/>censor at discontinuation/switch]\n  OnTx --> IPCW{Is discontinuation<br/>prognostic?}\n  IPCW -->|Yes| W[Apply inverse-probability-of-<br/>censoring weighting]\n  IPCW -->|No| Keep[Naive censoring acceptable]\n  ITT --> D{Outcome non-fatal AND<br/>mortality differs by arm?}\n  W --> D\n  Keep --> D\n  D -->|Yes, need absolute risk| CR[Death = competing risk:<br/>cumulative incidence / Fine-Gray]\n  D -->|Etiologic hazard contrast| CS[Death = censoring:<br/>cause-specific hazard]\n  CR --> Floor[Floor every exit date at<br/>min observation-period end / death / data end]\n  CS --> Floor",
        "caption": "Decision logic for the cohort-exit rule. The TAR start/end and censoring source are not modeling details applied later — they ARE the choice of estimand (start-vs-stay, cause-specific-vs-cumulative-incidence), with mandatory observability flooring.",
        "alt_text": "A flowchart deciding between ITT-like and on-treatment time-at-risk, whether to apply IPCW for informative censoring, and whether to treat death as a competing risk or as censoring, ending in observability flooring of all exit dates.",
        "source_type": "illustrative",
        "source_citations": [
          "austin-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "Time-at-risk and cohort exit are the follow-up/denominator component of the broader OMOP/OHDSI cohort-construction method family."
      },
      {
        "relation_type": "requires",
        "target_slug": "omop-observation-period-rwe",
        "notes": "The observation_period bounds observable person-time; cohort exit must be floored at observation_period_end_date so no unobservable follow-up is counted."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "drug_era provides the on-treatment TAR end (last days_supply end + grace) and defines exposure eras for as-treated follow-up."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring TAR start at time zero (exposure initiation) is the design device that prevents immortal time; mis-anchoring re-introduces it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Whether death is treated as censoring or a competing event at cohort exit determines cause-specific vs cumulative-incidence estimands."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "TAR start = time zero and a pre-specified exit/censoring rule are the follow-up elements of the emulated trial protocol."
      },
      {
        "relation_type": "see_also",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "On-treatment TAR with informative censoring at discontinuation/switch motivates clone-censor-weight or IPCW per-protocol estimation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU sets time zero at the index fill; this concept defines how the at-risk window and cohort exit are operationalized from that time zero."
      }
    ],
    "aliases": [
      "time at risk",
      "TAR",
      "cohort exit",
      "follow-up window",
      "observation window for outcomes",
      "at-risk period"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "one-way-anova",
    "name": "One-Way ANOVA",
    "short_definition": "A parametric hypothesis test that asks whether the population means of three or more independent groups are all equal, by decomposing total variability in the outcome into between-group variance (due to group membership) and within-group variance (due to individual differences), then testing whether the between-group signal is larger than the within-group noise via an F ratio — followed, when the omnibus test rejects, by post-hoc pairwise comparisons with multiplicity correction to identify which pairs differ.",
    "long_description": "**What one-way ANOVA is and what question it answers**\n\nOne-way analysis of variance (ANOVA) answers a single omnibus question: \"Is there any difference\nin the population mean of outcome Y across k ≥ 3 independent groups?\" The test does not specify\n*which* groups differ or by *how much* — it only tells you whether the pattern of group means is\nmore extreme than chance would produce if all groups were drawn from the same population. That\nomnibus structure is both its strength (controls the family-wise type I error for the overall\ncomparison) and its limitation (a significant F requires follow-up post-hoc comparisons to be\nactionable).\n\n**Variance decomposition: between vs within**\n\nANOVA rests on one algebraic identity: total variability in the data can be split exactly into two\nnon-overlapping components.\n\n- *Between-group sum of squares* (SS_between): measures how much the group means scatter around\n  the grand mean. Formally: SS_between = Σ_i n_i (ȳ_i − ȳ_grand)², summing over k groups.\n  If all groups have the same mean, SS_between = 0. Large SS_between signals that group membership\n  explains a meaningful share of the outcome's variability.\n\n- *Within-group sum of squares* (SS_within): measures how much individuals scatter around their\n  own group mean. Formally: SS_within = Σ_i Σ_j (y_ij − ȳ_i)². This is the residual variability\n  that remains after accounting for group membership — the \"noise\" against which the group signal\n  is compared.\n\n- *Degrees of freedom*: df_between = k − 1 (groups minus one); df_within = N − k (total\n  observations minus number of groups). The mean squares are MS_between = SS_between / df_between\n  and MS_within = SS_within / df_within.\n\n- *The F ratio*: F = MS_between / MS_within. Under the null hypothesis that all group means are\n  equal, the ratio MS_between / MS_within follows an F-distribution with (k−1, N−k) degrees of\n  freedom. A large F means the between-group signal is large relative to the within-group noise;\n  the p-value is the probability of observing an F this extreme or larger by chance.\n\n- *Additive identity check*: SS_total = SS_between + SS_within, where SS_total = Σ_i Σ_j\n  (y_ij − ȳ_grand)². This must hold exactly (within floating-point precision) in any correct\n  implementation and serves as an internal audit check.\n\n**Assumptions and when they are most critical**\n\nANOVA makes three classical assumptions: (1) observations are independent within and across\ngroups; (2) within each group the outcome is approximately normally distributed; (3) all groups\nshare the same population variance (homoscedasticity). In practice, the robustness of these\nassumptions varies by sample size and violation type:\n\n- *Independence*: this is the most critical and least recoverable assumption. If patients are\n  clustered within hospitals or practices, or if the same patient appears in multiple groups\n  (repeated measures), standard ANOVA produces invalid inferences regardless of sample size.\n  Mixed models or repeated-measures ANOVA are the correct alternatives.\n\n- *Normality*: at large n (typically ≥ 20–30 per group), the central limit theorem ensures that\n  the sampling distribution of the group mean is approximately normal even if individual\n  observations are skewed, so the F-test remains valid. At small n with severely non-normal\n  distributions, the Kruskal-Wallis H test is the nonparametric alternative.\n\n- *Homoscedasticity (equal variances)*: Welch ANOVA (also called the Welch F-test or oneway.test\n  in R) relaxes this assumption by using group-specific variance estimates and adjusting degrees\n  of freedom via the Welch-Satterthwaite equation, analogous to Welch's t-test for two groups.\n  As a practical rule, check Levene's test as a diagnostic — not as a decision gate — and use\n  Welch ANOVA when group variances differ substantially (ratio > 4:1) or when group sizes are\n  unequal. The Brown-Forsythe test is an alternative diagnostic that is more robust to\n  non-normality than Levene's test.\n\n**The post-hoc problem and multiplicity control**\n\nA significant omnibus F answers \"yes, there is a difference somewhere\" but provides no\ninformation about which pairs differ. Post-hoc pairwise comparisons are required, and they must\naccount for the multiplicity created by testing all C(k,2) = k(k−1)/2 pairs simultaneously.\n\n- *Tukey HSD (Honestly Significant Difference)*: the standard choice for balanced designs\n  (equal n per group). Controls the family-wise error rate (FWER) across all pairwise comparisons\n  at exactly α. Produces simultaneous confidence intervals for all pairwise mean differences.\n  Preferred when all pairwise comparisons are of interest.\n\n- *Bonferroni correction*: divides the α level by the number of comparisons tested. Simple and\n  applicable to any test family (not limited to pairwise means), but conservative — it\n  understates the power of the multiple comparison procedure relative to Tukey HSD. Appropriate\n  when only a pre-specified subset of comparisons is of interest or when mixed comparison types\n  are needed.\n\n- *Holm-Bonferroni*: sequentially-rejective version of Bonferroni, uniformly more powerful while\n  still controlling FWER. A nearly free upgrade over plain Bonferroni in most software.\n\n- *Dunnett's test*: designed specifically for comparing k−1 treatment groups to a single control\n  group, with higher power than Tukey or Bonferroni for that specific structure.\n\nThe catalog's general principle on multiplicity applies directly: the appropriate correction\ndepends on the inferential structure of the comparison (all pairwise vs control-vs-treatment vs\npre-specified hypotheses) and the tolerable trade-off between FWER and power. Never apply\npost-hoc tests without a significant omnibus F; doing so inflates type I error.\n\n**Effect size and estimands beyond the p-value**\n\nThe F-test p-value alone is insufficient for HEOR reporting. The appropriate effect estimates\nare:\n\n- *η² (eta-squared)*: SS_between / SS_total. The proportion of total variance explained by group\n  membership. Easy to compute but biased upward in small samples.\n- *ω² (omega-squared)*: a bias-corrected alternative: ω² = (SS_between − df_between × MS_within)\n  / (SS_total + MS_within). Preferred for reporting in confirmatory work.\n- *Partial η²*: used in factorial ANOVA where multiple factors are present; equals SS_between /\n  (SS_between + SS_within) for one-way ANOVA (= η²).\n- *Pairwise mean differences with CIs*: the most interpretable quantities for decision-making.\n  Tukey HSD produces simultaneous 95% CIs for all k(k−1)/2 pairwise differences. Each CI\n  answers: \"What is the plausible range of the true mean difference between these two groups?\"\n\n**Why ANOVA is rarely the endpoint in RWE: confounding and the extension to regression**\n\nIn a randomized trial, group membership is assigned by the study design, so ANOVA on the raw\noutcome estimates the causal effect of assignment. In observational RWE, groups typically\ncorrespond to self-selected treatment choices — first-line vs second-line vs third-line therapy,\nor different drug classes chosen by different types of patients. The group means differ for\nreasons that include both treatment effectiveness and the confounded characteristics of patients\nwho receive each treatment.\n\nAn unadjusted one-way ANOVA comparing outcomes across treatment groups in a claims or EHR\ndataset is therefore *descriptive*, not causal. It describes the association between group\nmembership and the outcome but cannot distinguish the treatment effect from confounding by\nindication, disease severity, prior treatment history, or provider practice patterns.\n\nThe natural extension is the general linear model (ANCOVA), which includes group indicators as\ncategorical predictors alongside continuous and categorical covariates. In the language of\nregression: one-way ANOVA is the special case of OLS regression with only group dummy variables.\nAdding covariates moves the analysis toward adjusted mean differences that better account for\nconfounding, though regression adjustment alone does not fully remove confounding from\nobservational comparisons — propensity-score methods, instrumental variables, or target trial\nemulation are required when confounding by indication is substantial.\n\nFor HEOR specifically: comparing total costs, utilization rates, or quality-adjusted life years\nacross 3+ lines of therapy requires at minimum covariate adjustment, and usually weighting or\nmatching, before the group comparison is interpretable as anything close to a treatment effect.\nDocument explicitly whether the ANOVA is unadjusted (descriptive, appropriate for Table 2 or\nsupplementary exploratory analysis) or whether regression-adjusted means (LS means) from a GLM\nare being presented as a controlled comparison.\n\n**Welch ANOVA for heteroscedastic groups**\n\nWhen group variances differ — which is common in HEOR when comparing groups of patients with\ndifferent disease severity or different lengths of follow-up — the standard one-way ANOVA\nF-test has inflated type I error if the group with the largest variance also has the smallest\nsample size (and deflated type I error in the reverse configuration). Welch ANOVA corrects for\nthis by using a weighted combination of within-group variance estimates and adjusting the\ndegrees of freedom. In R, oneway.test() implements Welch ANOVA by default (var.equal=FALSE);\nin Python, scipy's f_oneway() does NOT perform the Welch correction — use pingouin's welch_anova()\nor statsmodels for the Welch version. In SAS, PROC GLM produces the standard ANOVA; use the\nWELCH option in PROC GLM or PROC MIXED for the heteroscedastic case.\n\n**HEOR usage patterns: costs, utilization, and quality of life across groups**\n\nOne-way ANOVA appears in HEOR for:\n\n- Comparing mean total annual healthcare costs across 3+ lines of therapy or therapeutic areas\n  (with the caveat that GLM with gamma/Tweedie link is the modern standard for cost inference).\n- Comparing mean utilization counts (hospitalizations, ED visits, specialist visits) across\n  geographic regions, payer types, or disease severity categories.\n- Comparing mean quality-of-life scores (EQ-5D utility, PROMIS scores) across treatment groups\n  in registry or survey studies.\n- Comparing mean biomarker levels or laboratory values across categories.\n\nA recurring HEOR challenge is that cost and utilization distributions are right-skewed with\nheavy tails, violating the normality assumption at small n. At large n (typical for claims), the\nCLT protects the ANOVA F-test for inference on means, but the GLM framework (gamma with log link\nfor costs; negative binomial for counts) remains preferred because it: (a) respects the\ndistributional shape, (b) naturally accommodates a log-scale mean ratio interpretation, and\n(c) extends seamlessly to covariate adjustment.\n\n**Clustered data and the independence violation in site-level studies**\n\nMany HEOR datasets have a hierarchical structure: patients are nested within physicians, who\nare nested within hospitals or health systems. When groups are defined by site-level factors\n(geographic region, hospital type, payer), patients within the same site are correlated — they\nshare unmeasured site-level characteristics (practice patterns, formulary, local referral\nnetworks). This within-site correlation violates the independence assumption of one-way ANOVA\nand causes the standard F-test to underestimate the true variance, inflating the type I error\nrate.\n\nAppropriate alternatives for clustered data:\n- Cluster-robust standard errors (adjusting the variance estimator without changing the point\n  estimate) when the number of clusters is large (≥ 30–50) and cluster sizes are balanced.\n- Linear mixed models with a random intercept for cluster (site, practice, or physician) as the\n  preferred full solution — this explicitly models the within-cluster correlation structure and\n  produces valid inference even at moderate cluster counts.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- The F-test controls the family-wise type I error for the omnibus multi-group comparison in a\n  single test, avoiding the inflated error of k(k-1)/2 uncorrected pairwise t-tests.\n- Directly interpretable effect estimates (mean differences, η²) are available.\n- Well-understood power properties; sample size formulas are established for both the omnibus test\n  and Tukey HSD post-hoc comparisons.\n- Straightforward extension to factorial ANOVA (multiple grouping factors) and ANCOVA (covariate\n  adjustment) within the general linear model framework.\n- Computationally trivial; universally available in all statistical software.\n- Robust to normality violations at large n (CLT protects the F-statistic).\n- Welch ANOVA extends robustness to variance heterogeneity.\n\n*Cons*:\n- Omnibus F does not identify which pairs differ; post-hoc comparisons are always required for\n  actionable conclusions, adding complexity and reducing power per comparison.\n- Assumes independence; invalid for repeated measures, matched cohorts, or clustered data.\n- Standard ANOVA assumes equal variances; Welch correction must be explicitly requested.\n- For skewed outcomes (costs, utilization counts), a GLM is preferred for primary inference on\n  the mean when n is small-to-moderate.\n- In confounded observational data, group mean comparisons are descriptive, not causal.\n\n**When to use**\n\nUse one-way ANOVA when all of the following hold:\n- The outcome is continuous (or nearly so — bounded integer counts with adequate range are often\n  acceptable at large n).\n- You have three or more independent groups defined by a single categorical factor.\n- The primary question is whether any difference exists in means across groups (omnibus test),\n  to be followed by pairwise post-hoc comparisons.\n- Observations are independent (no repeated measures, no clustering, or clustering is handled by\n  a separate random effect).\n- In RCTs or post-matching/weighting observational studies: group means are directly interpretable\n  as adjusted estimates after design-stage confounding control.\n- In descriptive RWE: as an exploratory or supplementary table comparing unadjusted group means\n  with explicit acknowledgment that differences reflect both treatment effects and confounding.\n- When n per group is large (≥ 20–30): normality assumption is protected by CLT even for\n  moderately skewed outcomes.\n\n**When NOT to use**\n\nDo not use one-way ANOVA as the primary inferential method when:\n\n- *Repeated measures on the same patients*: using standard ANOVA on measurements taken from the\n  same patients at multiple time points violates independence and inflates type I error. Use\n  repeated-measures ANOVA, mixed-effects models (MMRM), or GEE instead.\n- *Clustered data without correction*: patients nested within sites, practices, or geographic\n  units violate independence. Use cluster-robust standard errors or mixed models with random\n  intercepts for the cluster level.\n- *Confounded group comparisons presented as causal claims*: an unadjusted ANOVA across treatment\n  groups in an observational dataset is not a valid estimate of the treatment effect. Route to\n  propensity-score methods, g-methods, or regression adjustment for causal inference.\n- *Heavy skew with small samples*: at n < 15–20 per group with a severely right-skewed or heavy-\n  tailed distribution (common for costs in small pilot studies), the CLT has not fully kicked in\n  and the F-test's normality assumption is load-bearing. Use Kruskal-Wallis H as the nonparametric\n  alternative, or a GLM with an appropriate distributional family.\n- *Binary or count outcomes as the primary estimand*: if the outcome is binary (response yes/no)\n  or a count (number of events), logistic regression or Poisson/negative-binomial regression\n  produce direct odds ratios or rate ratios with CIs. ANOVA on a binary 0/1 outcome is\n  arithmetically equivalent to a linear probability model, which can produce predicted\n  probabilities outside [0,1] and is not the preferred model for binary outcomes.\n- *When only two groups exist*: one-way ANOVA with k=2 reduces to the pooled t-test (F = t²\n  exactly). Use Welch's t-test directly for two groups; it is simpler and the Welch correction\n  is the default in most software.\n- *When the research question is about spread, not means*: ANOVA tests equality of means. If the\n  question is whether variability differs across groups (e.g., consistency of a biomarker),\n  Levene's test or Bartlett's test addresses variance heterogeneity directly.\n\n**Interpreting the output**\n\nIn the worked example, patients across three COPD severity groups (Mild, Moderate, Severe,\nfour patients each) had mean specialist visit counts of 5, 9, and 13, respectively. The\nbetween-group mean square is 64 and the within-group mean square is approximately 6.667,\nyielding F = 9.6 on df = (2, 9) with p ≈ 0.006. Applying Tukey's HSD post-hoc test, the\ncritical mean difference is approximately 5.10 visits. The Mild vs Severe comparison (8\nvisits) exceeds this threshold and is significant; the Mild vs Moderate and Moderate vs\nSevere comparisons (4 visits each) do not.\n\n*(1) Formal interpretation.* The F ratio of 9.6 indicates that between-group variance is\n9.6 times larger than pooled within-group variance. Under the null hypothesis that all three\npopulation means are equal, an F this extreme or larger on df = (2, 9) arises by chance in\napproximately 0.6% of samples (p ≈ 0.006). The omnibus test provides evidence against the\nnull but does not itself identify which pairs differ. The Tukey HSD post-hoc comparisons —\nwhich simultaneously control family-wise type I error across all three pairwise tests —\nidentify only the Mild vs Severe contrast as significantly different at the adjusted\nthreshold.\n\n*(2) Practical interpretation.* Patients with severe COPD averaged about 8 more specialist\nvisits per year than patients with mild COPD — a gap large enough to survive the Tukey\nmultiplicity correction. The adjacent pairs (Mild vs Moderate and Moderate vs Severe) show\ndifferences of 4 visits each that did not reach significance at n = 4 per group; the study\nis underpowered for those intermediate comparisons, though 4 visits per year is not\nnecessarily clinically unimportant. Because this is an unadjusted observational comparison,\ndifferences in age, comorbidities, and care-seeking behavior across severity groups likely\ncontribute to the observed gradient. The ANOVA result is descriptive; the gradient may\npartially or fully reflect confounding rather than a direct effect of COPD severity on\nspecialist utilization.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "multiple-groups",
      "ANOVA",
      "F-test",
      "variance-decomposition",
      "post-hoc",
      "multiplicity"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.312.7044.1472",
        "url": "https://doi.org/10.1136/bmj.312.7044.1472",
        "citation_text": "Altman DG, Bland JM. Statistics Notes: Comparing several groups using analysis of variance. BMJ. 1996;312(7044):1472-1473.",
        "year": 1996,
        "authors_short": "Altman & Bland",
        "notes": "The canonical BMJ Statistics Notes entry on one-way ANOVA for clinical researchers. Covers the variance decomposition logic, the omnibus F-test, and the post-hoc comparison problem in plain language with a concrete medical example. Widely cited as the introductory reference for ANOVA in health research."
      },
      {
        "role": "explain",
        "doi": "10.3238/arztebl.2010.0343",
        "url": "https://doi.org/10.3238/arztebl.2010.0343",
        "citation_text": "du Prel JB, Röhrig B, Hommel G, Blettner M. Choosing Statistical Tests. Deutsches Ärzteblatt International. 2010;107(19):343-348.",
        "year": 2010,
        "authors_short": "du Prel et al.",
        "notes": "Systematic decision framework for selecting the appropriate test by data type and design, including the three-or-more-groups scenario (ANOVA vs Kruskal-Wallis). Provides the decision tree context within which ANOVA sits as the parametric multi-group test."
      },
      {
        "role": "demonstrate",
        "doi": "10.5334/irsp.82",
        "url": "https://doi.org/10.5334/irsp.82",
        "citation_text": "Delacre M, Lakens D, Leys C. Why psychologists should by default use Welch's t-test instead of Student's t-test. International Review of Social Psychology. 2017;30(1):92-101.",
        "year": 2017,
        "authors_short": "Delacre et al.",
        "notes": "Demonstrates via simulation that assuming equal variances leads to type I error inflation when variances differ — the same logic motivates Welch ANOVA as the default for multi-group comparisons when group sizes or variances are unequal. Directly motivates the Welch correction recommended in this entry."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.310.6973.170",
        "url": "https://doi.org/10.1136/bmj.310.6973.170",
        "citation_text": "Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. BMJ. 1995;310(6973):170.",
        "year": 1995,
        "authors_short": "Bland & Altman",
        "notes": "Introduces the Bonferroni correction and its conservative nature in the context of multiple significance testing, providing the conceptual foundation for why post-hoc multiplicity control is required after a significant omnibus ANOVA. Read alongside the Tukey HSD implementation to understand why Tukey is preferred over Bonferroni for all-pairwise comparisons."
      }
    ],
    "plain_language_summary": "One-way ANOVA is a statistical test that asks whether the average outcome differs across three or more groups — for example, whether patients on Drug A, Drug B, and Drug C have different average healthcare costs. It works by comparing how spread out the group averages are (the \"between-group\" signal) against how much individual patients vary within each group (the \"within-group\" noise), and summarizes the comparison in a single number called the F statistic. A significant F only tells you that at least one group differs; you still need follow-up pairwise tests (called post-hoc comparisons) to find out which groups are actually different from each other. In real-world healthcare data, ANOVA is usually the starting point for exploration rather than the final answer, because it cannot separate treatment effects from the fact that different patients tend to choose different treatments.",
    "key_terms": [
      {
        "term": "between-group variance",
        "definition": "How much the group averages differ from the overall average — the \"signal\" in an ANOVA that you hope is larger than the noise inside each group."
      },
      {
        "term": "within-group variance",
        "definition": "How much individual patients scatter around their own group's average — the \"noise\" against which the between-group signal is measured; also called the residual or error variance."
      },
      {
        "term": "F statistic",
        "definition": "The ratio of between-group mean square to within-group mean square; a large F means the group averages are farther apart than random chance would expect given the scatter inside each group."
      },
      {
        "term": "post-hoc comparison",
        "definition": "A pairwise test run after a significant ANOVA to determine which specific pairs of groups differ, using a multiplicity correction (such as Tukey HSD) to avoid false positives from testing many pairs at once."
      },
      {
        "term": "omnibus test",
        "definition": "A single test covering all groups simultaneously — one-way ANOVA is omnibus because it asks \"any difference anywhere?\" rather than testing each pair separately; it controls the overall false-positive rate for the multi-group comparison."
      },
      {
        "term": "multiple comparisons",
        "definition": "The problem that arises when you test many hypotheses at once — if you compare 5 groups pairwise you run 10 tests, and 10 tests at alpha=0.05 each will produce roughly half a false positive by chance even if nothing is truly different."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst at a regional payer is examining whether the number of specialist visits in a 6-month period differs across patients with mild, moderate, and severe chronic obstructive pulmonary disease (COPD) enrolled in the same health plan. Four patients from each severity tier are randomly selected from the claims database. The analyst wants to know whether any severity tier has a meaningfully different mean visit count before proceeding to a fully adjusted regression model. She runs a one-way ANOVA and manually walks through the variance decomposition to understand the F ratio.",
      "dataset": {
        "caption": "Number of specialist visits in 6 months by COPD severity tier (n=4 per group). All values are counts extracted directly from outpatient claims. Grand mean across all 12 patients = 9.",
        "columns": [
          "patient_id",
          "severity_tier",
          "specialist_visits"
        ],
        "rows": [
          [
            "P01",
            "Mild",
            2
          ],
          [
            "P02",
            "Mild",
            4
          ],
          [
            "P03",
            "Mild",
            6
          ],
          [
            "P04",
            "Mild",
            8
          ],
          [
            "P05",
            "Moderate",
            6
          ],
          [
            "P06",
            "Moderate",
            8
          ],
          [
            "P07",
            "Moderate",
            10
          ],
          [
            "P08",
            "Moderate",
            12
          ],
          [
            "P09",
            "Severe",
            10
          ],
          [
            "P10",
            "Severe",
            12
          ],
          [
            "P11",
            "Severe",
            14
          ],
          [
            "P12",
            "Severe",
            16
          ]
        ]
      },
      "steps": [
        "Step 1 — Compute group means. Mild: (2+4+6+8)/4 = 20/4 = 5. Moderate: (6+8+10+12)/4 = 36/4 = 9. Severe: (10+12+14+16)/4 = 52/4 = 13. Grand mean: (20+36+52)/12 = 108/12 = 9.",
        "Step 2 — Compute SS_between (between-group sum of squares). Each group contributes n_i*(group_mean - grand_mean)^2. Mild: 4*(5-9)^2 = 4*16 = 64. Moderate: 4*(9-9)^2 = 4*0 = 0. Severe: 4*(13-9)^2 = 4*16 = 64. SS_between = 64+0+64 = 128.",
        "Step 3 — Compute SS_within (within-group sum of squares) for each group. For Mild (mean=5): (2-5)^2+(4-5)^2+(6-5)^2+(8-5)^2 = 9+1+1+9 = 20. For Moderate (mean=9): (6-9)^2+(8-9)^2+(10-9)^2+(12-9)^2 = 9+1+1+9 = 20. For Severe (mean=13): (10-13)^2+(12-13)^2+(14-13)^2+(16-13)^2 = 9+1+1+9 = 20. SS_within = 20+20+20 = 60.",
        "Step 4 — Verify the decomposition. SS_total = SS_between + SS_within = 128+60 = 188. Cross-check by computing SS_total directly from the grand mean: (2-9)^2+(4-9)^2+(6-9)^2+(8-9)^2+(6-9)^2+(8-9)^2+(10-9)^2+(12-9)^2+(10-9)^2+(12-9)^2+ (14-9)^2+(16-9)^2 = 49+25+9+1+9+1+1+9+1+9+25+49 = 188. Match confirmed.",
        "Step 5 — Compute degrees of freedom. df_between = k-1 = 3-1 = 2. df_within = N-k = 12-3 = 9.",
        "Step 6 — Compute mean squares. MS_between = SS_between/df_between = 128/2 = 64. MS_within = SS_within/df_within = 60/9 = 6.667.",
        "Step 7 — Compute the F statistic. F = MS_between/MS_within = 64/(60/9) = 64*9/60 = 576/60 = 9.6. Under the null hypothesis that all group means are equal, this F follows an F(2,9) distribution. The critical value at alpha=0.05 is F_crit(2,9) = 4.26. Since F=9.6 > 4.26, we reject the null: at least one severity tier has a different mean visit count.",
        "Step 8 — Post-hoc comparisons (Tukey HSD). With F significant, we compare all 3 pairs: Mild vs Moderate: mean difference = 5-9 = -4. Mild vs Severe: mean difference = 5-13 = -8. Moderate vs Severe: mean difference = 9-13 = -4. The Tukey HSD standard error for equal-n groups is sqrt(MS_within/n) = sqrt(6.667/4) = sqrt(1.667) = 1.291. Tukey critical value for k=3 groups, df=9 at alpha=0.05 is q*(3,9) = 3.95. HSD = 3.95*1.291 = 5.10. Mild vs Moderate: |difference| = 4 < 5.10, not significant. Mild vs Severe: |difference| = 8 > 5.10, significant. Moderate vs Severe: |difference| = 4 < 5.10, not significant."
      ],
      "result": "Group means: Mild=20/4=5, Moderate=36/4=9, Severe=52/4=13. Grand mean=108/12=9. SS_between=128, SS_within=60, SS_total=128+60=188. df_between=2, df_within=9. MS_between=128/2=64, MS_within=60/9=6.667. F=64/(60/9)=576/60=9.6 > F_crit(2,9)=4.26; reject the null. Tukey HSD post-hoc: only Mild vs Severe is significant (difference=8 > HSD=5.10); Mild vs Moderate and Moderate vs Severe do not reach significance at alpha=0.05. Interpretation: COPD severity is associated with specialist visit frequency, but the Mild vs Moderate and Moderate vs Severe contrasts are not individually detectable at this sample size. A larger sample or an adjusted regression model is needed before drawing clinical conclusions — and this unadjusted comparison may reflect patient case-mix differences as much as any direct effect of severity classification on care-seeking behavior."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "two-sample-t-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard one-way ANOVA (equal variances assumed)",
        "description": "The classic Fisher ANOVA using the pooled within-group variance estimate (MS_within). Assumes all k groups have the same population variance. Post-hoc Tukey HSD is the appropriate pairwise test. Use when group sizes are roughly equal and Levene's test does not indicate strong variance heterogeneity.",
        "edge_cases": [
          "When group sizes are unbalanced (strongly unequal n_i), the standard F-test is more sensitive to variance heterogeneity. Prefer Welch ANOVA or use Tukey-Kramer (the unequal-n extension of Tukey HSD) rather than plain Tukey HSD.",
          "If one group has very few observations (n_i < 5), that group's variance estimate is unstable. Consider collapsing rare categories or using a mixed model with a group random effect."
        ],
        "data_source_notes": "Claims and EHR: construct per-patient outcome totals within the analysis period; pass one row per patient into the ANOVA. Verify that each patient appears in exactly one group (no patient in both Drug A and Drug B arms). At large n (> 1,000 per group), the F-test is valid for inference on means even for moderately skewed outcomes."
      },
      {
        "name": "Welch ANOVA (heteroscedastic groups)",
        "description": "Uses group-specific variance estimates and Welch-Satterthwaite adjusted degrees of freedom, making the test robust to unequal variances. The appropriate post-hoc test is Games-Howell (does not assume equal variances or equal n). In R, oneway.test() with var.equal=FALSE (the default). In Python, use pingouin.welch_anova(). In SAS, use the WELCH option in PROC GLM or PROC MIXED.",
        "edge_cases": [
          "At very small within-group n (< 5 per group), the Welch-Satterthwaite df adjustment can produce non-integer df below 1 for some groups, making the approximation unreliable. Prefer a permutation-based ANOVA in this setting.",
          "Games-Howell post-hoc is slightly conservative at equal variances but valid under unequal variances; it is the recommended default when Welch ANOVA is used."
        ],
        "data_source_notes": "Use Welch ANOVA when comparing groups that differ in sample size or expected variance — for example, comparing costs across disease severity tiers where severe patients tend to have both higher mean costs and higher variance. Welch ANOVA is the safer default in most HEOR applications where group variances are rarely assumed equal a priori."
      },
      {
        "name": "ANCOVA / regression-adjusted group means",
        "description": "Extends one-way ANOVA by adding covariates (age, sex, comorbidity index, baseline cost) as additional predictors in a linear model. The group effect is estimated as the adjusted mean difference holding covariates constant. LS means (least-squares means, also called estimated marginal means) from the fitted model are the covariate-adjusted group means. This is the standard approach for HEOR comparative analyses where unadjusted group differences are confounded.",
        "edge_cases": [
          "ANCOVA assumes linearity and additivity of covariate effects. Check residual plots; add interaction terms if group effects differ substantially across covariate levels.",
          "LS means interpretation requires careful communication: they are means at the covariate profile of an average patient, not the expected mean for any real patient subgroup."
        ],
        "data_source_notes": "Claims: include Charlson Comorbidity Index, age, sex, baseline resource utilization, and index year as covariates. EHR: add clinical covariates (lab values, functional status scores) alongside claims-based covariates. LS means from PROC GLM (SAS) or emmeans (R) provide the adjusted group mean estimates with pairwise comparisons."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "kruskal-wallis-test",
        "pros_of_this": "ANOVA directly estimates mean differences, which are interpretable on the original outcome scale and can be plugged into budget-impact models; Kruskal-Wallis produces a rank-based statistic with no direct mean-difference analog.",
        "cons_of_this": "Kruskal-Wallis does not assume normality or equal variances and is more robust at small n with skewed outcomes; ANOVA can have inflated type I error when the normality assumption is violated at small n.",
        "when_to_prefer": "Prefer ANOVA when n ≥ 20 per group, the outcome is approximately continuous, and mean differences are the target quantity for decision-making. Prefer Kruskal-Wallis for heavily skewed distributions at small n, ordinal outcomes, or as a sensitivity check alongside ANOVA."
      },
      {
        "compared_to": "two-sample-t-test",
        "pros_of_this": "ANOVA controls the family-wise type I error for k-group comparisons in a single test, whereas running k(k-1)/2 Welch t-tests without correction inflates the false-positive rate. For k=2 groups, ANOVA reduces to the pooled t-test (F = t²).",
        "cons_of_this": "When only two groups exist, Welch's t-test is simpler, directly provides the Welch correction by default, and produces a two-sided CI for the mean difference with no extra steps; the Welch ANOVA for k=2 is equivalent but adds unnecessary complexity.",
        "when_to_prefer": "Use one-way ANOVA when k ≥ 3; use Welch's t-test directly for k=2 groups."
      },
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "One-way ANOVA is the specific parametric multi-group test described in the parent entry's decision tree; this entry provides the full decomposition, implementation, and HEOR context that the parent entry sketches.",
        "cons_of_this": "The parent entry provides the broader decision framework for choosing between parametric and nonparametric tests across all data types and group counts; one-way ANOVA covers only the multi-group continuous parametric case.",
        "when_to_prefer": "Use this entry when implementing or reporting a multi-group mean comparison; consult the parent entry when deciding which test family is appropriate for the data type and design."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Construct per-patient outcome totals (total annual cost, visit count, or procedure count) within a defined analysis window; one row per patient is the unit of analysis. Group variable is the treatment line, drug class, or geographic region. At large n (> 1,000 per group), the F-test is valid for inference on means even for moderately skewed outcomes, but report a GLM-based mean ratio alongside the ANOVA for cost outcomes. For cost outcomes, report both the raw group means and the GLM-adjusted ratio; ANOVA alone understates the precision of the analysis for payer decision-making. Check for and document outliers (e.g., patients with > $1M in annual costs) and report a sensitivity analysis with and without extreme values.",
      "ehr": "Lab values, functional status scores, and continuous biomarkers are natural ANOVA outcomes. For longitudinal EHR data with multiple measurements per patient, standard ANOVA is invalid — use a repeated-measures ANOVA or mixed model. Informative visit patterns (sicker patients seen more often) can create biased samples; restrict to patients with complete measurement at a defined time point rather than using the most recent available value.",
      "registry": "Disease severity scores, PRO instruments, and adjudicated endpoints are cleanest ANOVA inputs. For registry data with hierarchical enrollment (sites, centers), include a site random effect or use cluster-robust standard errors; standard ANOVA on registry data with site-level clustering will underestimate the variance and overstate significance.",
      "primary": "Survey and PRO data often use Likert-type or bounded scales; verify that the scale intervals are treated as equal (necessary for mean-based comparisons) and that sufficient response categories exist to approximate continuous data. Report whether the analysis treats the outcome as continuous or ordinal, and if ordinal, justify the continuous approximation.",
      "linked": "Linked claims-EHR datasets typically have large n, so CLT protects the ANOVA F-test. However, linkage often introduces selection — patients who appear in both claims and EHR may differ from those appearing in only one source. Report the linkage rate and compare characteristics of linked vs unlinked patients before interpreting group mean differences."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\nfrom statsmodels.stats.multicomp import pairwise_tukeyhsd\n\n# ── Dataset: specialist visits by COPD severity (n=4 per group) ──\nmild     = [2, 4, 6, 8]    # group mean = 5\nmoderate = [6, 8, 10, 12]  # group mean = 9\nsevere   = [10, 12, 14, 16] # group mean = 13\n\n# ── 1. One-way ANOVA (scipy: Fisher / equal-variance version) ──\nf_stat, p_val = stats.f_oneway(mild, moderate, severe)\nprint(f\"One-way ANOVA: F={f_stat:.4f}, p={p_val:.4f}\")\n# Expected: F=9.6, p=0.0057\n\n# ── 2. Manual SS decomposition (matches worked example) ──\nall_vals = mild + moderate + severe\ngrand_mean = np.mean(all_vals)\ngroups = [mild, moderate, severe]\ngroup_means = [np.mean(g) for g in groups]\nn_per = [len(g) for g in groups]\n\nss_between = sum(n * (gm - grand_mean)**2 for n, gm in zip(n_per, group_means))\nss_within  = sum(sum((y - gm)**2 for y in g) for g, gm in zip(groups, group_means))\nss_total   = sum((y - grand_mean)**2 for y in all_vals)\nprint(f\"\\nSS_between={ss_between:.1f}, SS_within={ss_within:.1f}, SS_total={ss_total:.1f}\")\nprint(f\"Decomposition check: {ss_between:.1f} + {ss_within:.1f} = {ss_between+ss_within:.1f}\")\n# Expected: 128.0 + 60.0 = 188.0\n\nk = len(groups); N = len(all_vals)\ndf_between = k - 1; df_within = N - k\nms_between = ss_between / df_between\nms_within  = ss_within  / df_within\nf_manual   = ms_between / ms_within\nprint(f\"MS_between={ms_between:.3f}, MS_within={ms_within:.3f}, F={f_manual:.4f}\")\n# Expected: MS_between=64.000, MS_within=6.667, F=9.6000\n\n# ── 3. Effect size: eta-squared and omega-squared ──\neta_sq  = ss_between / ss_total\nomega_sq = (ss_between - df_between * ms_within) / (ss_total + ms_within)\nprint(f\"\\nEta-squared = {eta_sq:.3f}, Omega-squared = {omega_sq:.3f}\")\n\n# ── 4. Tukey HSD post-hoc (pairwise comparisons with FWER control) ──\nimport pandas as pd\ndata_flat  = np.array(all_vals)\ngroups_lbl = ([\"Mild\"]*4 + [\"Moderate\"]*4 + [\"Severe\"]*4)\ntukey = pairwise_tukeyhsd(data_flat, groups_lbl, alpha=0.05)\nprint(\"\\nTukey HSD post-hoc:\")\nprint(tukey.summary())\n# Mild-Severe significant; Mild-Moderate and Moderate-Severe not significant\n\n# ── 5. Welch ANOVA (robust to unequal variances) ──\n# scipy does not implement Welch ANOVA natively; use pingouin if available\ntry:\n    import pingouin as pg\n    df_long = pd.DataFrame({\n        \"visits\": all_vals,\n        \"group\":  groups_lbl\n    })\n    welch_res = pg.welch_anova(data=df_long, dv=\"visits\", between=\"group\")\n    print(\"\\nWelch ANOVA (pingouin):\")\n    print(welch_res[[\"Source\", \"ddof1\", \"ddof2\", \"F\", \"p-unc\", \"np2\"]])\nexcept ImportError:\n    print(\"\\npingouin not installed — install with: pip install pingouin\")\n\n# ── 6. Levene's test for equal variances (diagnostic, not a decision gate) ──\nlevene_stat, levene_p = stats.levene(*groups)\nprint(f\"\\nLevene's test for equal variances: W={levene_stat:.3f}, p={levene_p:.4f}\")\nprint(\"(Diagnostic only — do not use to choose between Fisher and Welch ANOVA.)\")",
        "description": "One-way ANOVA and Tukey HSD post-hoc using scipy.stats.f_oneway and statsmodels\npairwise_tukeyhsd. Also demonstrates Welch ANOVA via pingouin and the manual SS decomposition\nmatching the beginner-layer worked example. No additional dependencies beyond scipy,\nstatsmodels, numpy, and (optionally) pingouin for Welch ANOVA. Uses the COPD severity\ndataset from the worked example (mild, moderate, severe; n=4 each).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Dataset: specialist visits by COPD severity (n=4 per group) ──\nvisits <- c(2, 4, 6, 8,      # Mild\n            6, 8, 10, 12,    # Moderate\n            10, 12, 14, 16)  # Severe\ngroup  <- factor(rep(c(\"Mild\", \"Moderate\", \"Severe\"), each = 4),\n                 levels = c(\"Mild\", \"Moderate\", \"Severe\"))\n\n# ── 1. Fisher one-way ANOVA (assumes equal variances) ──\nanova_fit <- aov(visits ~ group)\ncat(\"Fisher one-way ANOVA (aov):\\n\")\nprint(summary(anova_fit))\n# Expected: F=9.6, p=0.0057\n\n# ── 2. Welch one-way ANOVA (DEFAULT in R's oneway.test; does NOT assume equal variances) ──\ncat(\"\\nWelch one-way ANOVA (oneway.test, var.equal=FALSE — the R default):\\n\")\nprint(oneway.test(visits ~ group))\n# var.equal=FALSE is the default; note the adjusted (non-integer) df for the denominator\n\ncat(\"\\nFisher version via oneway.test (var.equal=TRUE, equivalent to aov above):\\n\")\nprint(oneway.test(visits ~ group, var.equal = TRUE))\n\n# ── 3. Manual SS decomposition (matches worked example) ──\ngrand_mean  <- mean(visits)\ngroup_means <- tapply(visits, group, mean)\nn_per       <- tapply(visits, group, length)\n\nss_between <- sum(n_per * (group_means - grand_mean)^2)\nss_within  <- sum(tapply(seq_along(visits), group, function(idx) {\n  gm <- mean(visits[idx]); sum((visits[idx] - gm)^2)\n}))\nss_total   <- sum((visits - grand_mean)^2)\ndf_between <- length(levels(group)) - 1\ndf_within  <- length(visits) - length(levels(group))\nms_between <- ss_between / df_between\nms_within  <- ss_within  / df_within\nf_manual   <- ms_between / ms_within\n\ncat(sprintf(\"\\nManual: SS_between=%.1f, SS_within=%.1f, SS_total=%.1f\\n\",\n            ss_between, ss_within, ss_total))\ncat(sprintf(\"MS_between=%.3f, MS_within=%.3f, F=%.4f\\n\",\n            ms_between, ms_within, f_manual))\n\n# ── 4. Effect sizes ──\neta_sq  <- ss_between / ss_total\nomega_sq <- (ss_between - df_between * ms_within) / (ss_total + ms_within)\ncat(sprintf(\"Eta-squared=%.3f, Omega-squared=%.3f\\n\", eta_sq, omega_sq))\n\n# ── 5. Tukey HSD post-hoc (Fisher ANOVA; equal variances assumed) ──\ncat(\"\\nTukey HSD post-hoc:\\n\")\nprint(TukeyHSD(anova_fit))\n# Mild-Severe should be significant; Mild-Moderate and Moderate-Severe not significant\n\n# ── 6. Games-Howell post-hoc (for Welch ANOVA; does NOT assume equal variances) ──\n# Requires rstatix package\nif (requireNamespace(\"rstatix\", quietly = TRUE)) {\n  library(rstatix)\n  df_data <- data.frame(visits = visits, group = group)\n  cat(\"\\nGames-Howell post-hoc (use with Welch ANOVA):\\n\")\n  print(games_howell_test(df_data, visits ~ group))\n} else {\n  cat(\"\\nInstall rstatix for Games-Howell: install.packages('rstatix')\\n\")\n}\n\n# ── 7. Levene's test (diagnostic for variance heterogeneity) ──\nif (requireNamespace(\"car\", quietly = TRUE)) {\n  cat(\"\\nLevene's test (diagnostic only):\\n\")\n  print(car::leveneTest(visits ~ group))\n}",
        "description": "One-way ANOVA via aov() (Fisher, equal-variance) and oneway.test() (Welch, default in R),\nwith TukeyHSD post-hoc and Games-Howell post-hoc via rstatix. Demonstrates the manual SS\ndecomposition and eta-squared/omega-squared effect sizes using base R. Note: oneway.test()\nin R uses Welch ANOVA by default (var.equal=FALSE); to get the Fisher ANOVA use\nvar.equal=TRUE. Uses the COPD severity dataset from the worked example.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create dataset: specialist visits by COPD severity (n=4 per group) ── */\ndata work.copd;\n  input severity $ visits;\n  datalines;\nMild     2\nMild     4\nMild     6\nMild     8\nModerate  6\nModerate  8\nModerate 10\nModerate 12\nSevere   10\nSevere   12\nSevere   14\nSevere   16\n;\nrun;\n\n/* ── 1. PROC ANOVA (balanced, equal n per group only) ── */\nproc anova data=work.copd;\n  class severity;\n  model visits = severity;\n  /* ANOVA (not GLM) is appropriate here only because n=4 per group (balanced design) */\n  means severity / tukey;   /* Tukey HSD post-hoc pairwise comparisons */\nrun;\nquit;\n\n/* ── 2. PROC GLM: one-way ANOVA, Tukey HSD, LS means, and WELCH option ── */\nproc glm data=work.copd;\n  class severity;\n  model visits = severity;\n  /* Fisher one-way ANOVA F-test                                                       */\n  means severity / hovtest=levene;  /* Levene's test for equal variances (diagnostic)  */\n  means severity / tukey alpha=0.05; /* Tukey HSD pairwise comparisons (equal-var)    */\n  lsmeans severity / pdiff=all cl adjust=tukey; /* LS means with Tukey-adjusted CIs   */\n  /* WELCH option: Welch F-test robust to unequal variances                            */\n  means severity / welch;\nrun;\nquit;\n\n/* ── 3. Manual SS verification (PROC MEANS + DATA step) ── */\nproc means data=work.copd mean var n noprint;\n  class severity;\n  var visits;\n  output out=work.gstats mean=gm n=n var=gv;\nrun;\n\nproc means data=work.copd mean noprint;\n  var visits;\n  output out=work.grandmean mean=grand_mean;\nrun;\n\n/* SS_between = sum(n_i * (mean_i - grand_mean)^2) */\n/* SS_within  = sum((n_i - 1) * var_i)             */\ndata work.anova_manual;\n  set work.gstats(where=(_type_=1));\n  if _n_=1 then set work.grandmean(keep=grand_mean);\n  ss_between_i = n * (gm - grand_mean)**2;\n  ss_within_i  = (n - 1) * gv;\nrun;\n\nproc means data=work.anova_manual sum noprint;\n  var ss_between_i ss_within_i;\n  output out=work.anova_totals sum=ss_between ss_within;\nrun;\n\ndata work.anova_results;\n  set work.anova_totals;\n  k = 3; N = 12;\n  df_between = k - 1;\n  df_within  = N - k;\n  ms_between = ss_between / df_between;\n  ms_within  = ss_within  / df_within;\n  f_stat     = ms_between / ms_within;\n  ss_total   = ss_between + ss_within;\n  eta_sq     = ss_between / ss_total;\n  omega_sq   = (ss_between - df_between * ms_within) / (ss_total + ms_within);\n  put \"SS_between=\" ss_between \"SS_within=\" ss_within \"SS_total=\" ss_total;\n  put \"MS_between=\" ms_between \"MS_within=\" ms_within \"F=\" f_stat;\n  put \"Eta-squared=\" eta_sq \"Omega-squared=\" omega_sq;\nrun;",
        "description": "One-way ANOVA using PROC GLM (Fisher) and the WELCH option for Welch ANOVA. Tukey HSD\npost-hoc via the MEANS statement with TUKEY option and LS means via LSMEANS. PROC ANOVA\nis also shown for the balanced equal-n case (equivalent to PROC GLM for one-way). Levene's\ntest via PROC GLM HOVTEST option. Uses the COPD severity dataset from the worked example.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Outcome: continuous<br/>Groups: k ≥ 3 independent] --> B{Independence<br/>assumption met?}\n  B -- No: repeated<br/>measures / clustering --> C[Mixed model / MMRM<br/>or cluster-robust SE]\n  B -- Yes --> D{Equal variances<br/>across groups?}\n  D -- Yes / unknown --> E[\"Fisher one-way ANOVA<br/>aov() / PROC GLM\"]\n  D -- No or unequal n --> F[\"Welch ANOVA<br/>oneway.test() / pingouin<br/>PROC GLM WELCH\"]\n  E --> G{Omnibus F<br/>significant?}\n  F --> G\n  G -- No --> H[Stop: no evidence of<br/>any group difference]\n  G -- Yes --> I{Post-hoc<br/>structure?}\n  I -- All pairwise<br/>equal variances --> J[Tukey HSD<br/>TukeyHSD / LSMEANS TUKEY]\n  I -- All pairwise<br/>unequal variances --> K[Games-Howell<br/>rstatix / SAS macro]\n  I -- Treat vs control<br/>only --> L[\"Dunnett's test<br/>PROC GLM DUNNETT\"]\n  I -- Pre-specified<br/>subset --> M[\"Bonferroni / Holm<br/>(p.adjust in R; PROC MULTTEST)\"]\n  J --> N[Report mean differences<br/>with 95% CI per pair<br/>and effect size omega-sq]\n  K --> N\n  L --> N\n  M --> N",
        "caption": "Decision flowchart for one-way ANOVA: independence check, equal-variance choice between Fisher and Welch, omnibus F interpretation, and post-hoc test selection by comparison structure. Tukey HSD is the default for all-pairwise comparisons under equal variances.",
        "alt_text": "Flowchart starting from \"continuous outcome, k>=3 independent groups\", branching on independence assumption, then variance homogeneity, through omnibus F, to four post-hoc options: Tukey HSD, Games-Howell, Dunnett, or Bonferroni/Holm, ending at reporting mean differences with CIs and effect size.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "One-way ANOVA is the specific parametric multi-group test within the parametric-vs-nonparametric decision framework. The parent entry covers when to choose between test families; this entry implements the ANOVA branch for k >= 3 continuous outcomes."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "ANOVA builds directly on null hypothesis testing mechanics (F-distribution, p-values, type I and type II error), the sampling distribution of means, and the logic of distributional assumptions that are introduced in the foundations entry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "kruskal-wallis-test",
        "notes": "Kruskal-Wallis H is the nonparametric rank-based alternative to one-way ANOVA for k >= 3 groups. Use Kruskal-Wallis when within-group distributions are severely non-normal at small n, when the outcome is ordinal, or as a sensitivity analysis alongside ANOVA. Dunn's test with Bonferroni or Holm correction is the Kruskal-Wallis post-hoc analog to Tukey HSD."
      },
      {
        "relation_type": "see_also",
        "target_slug": "two-sample-t-test",
        "notes": "One-way ANOVA with k=2 groups is algebraically equivalent to the pooled-variance t-test, with F = t^2 exactly. For two groups, Welch's t-test is simpler and directly provides the Welch correction; use ANOVA only when k >= 3. Understanding the two-group t-test first makes the ANOVA extension to multiple groups intuitive."
      },
      {
        "relation_type": "see_also",
        "target_slug": "welch-t-test",
        "notes": "Welch's t-test generalizes to Welch ANOVA for k >= 3 groups by the same principle: use group-specific variance estimates and the Welch-Satterthwaite df adjustment instead of pooling. Welch ANOVA is the recommended default when group variances may differ, just as Welch's t-test is the recommended default for two-group continuous comparisons."
      }
    ],
    "aliases": [
      "analysis of variance",
      "F-test for group means",
      "ANOVA",
      "one-way analysis of variance",
      "Welch ANOVA",
      "Welch F-test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "outcome-algorithm-construction-rwe",
    "name": "Outcome Algorithm Construction",
    "short_definition": "The process of translating a clinical endpoint into a reproducible, code-based case-finding algorithm in claims/EHR data, and validating it against a gold standard to estimate positive predictive value, sensitivity, and specificity so that outcome misclassification can be quantified and corrected.",
    "long_description": "**Outcome algorithm construction** is the disciplined translation of a protocol-defined endpoint (e.g., \"acute myocardial\ninfarction,\" \"incident heart failure hospitalization,\" \"all-cause mortality\") into an executable, fully specified\ncase-finding rule over routinely collected data, plus the validation study that tells you how often that rule is right.\nAn algorithm is a tuple of decisions: which code systems and code lists (ICD-10-CM, CPT/HCPCS, NDC, LOINC, SNOMED),\nwhich care settings (inpatient principal vs any-position, emergency, outpatient), how many encounters and in what time\nwindow (the classic \"1 inpatient OR 2 outpatient ≥7 days apart\"), whether labs/vitals/meds are required to confirm, and\nhow repeated codes for the same true event are de-duplicated into one incident outcome. The algorithm is not credible\nuntil you attach its operating characteristics — **positive predictive value (PPV)**, **sensitivity**, and\n**specificity** — measured against chart review, registry adjudication, or a death index, because every downstream rate,\nhazard ratio, and ICER inherits the algorithm's measurement error.\n\n**Core conceptual distinction.** Three quantities are distinct and routinely confused. (1) *PPV* — among algorithm-flagged\npatients, the fraction who truly have the outcome — governs whether your numerator is contaminated by false positives.\n(2) *Sensitivity* — among true cases, the fraction the algorithm catches — governs how many events you miss. (3)\n*Specificity* — among true non-cases, the fraction correctly excluded — drives false positives in the much larger\nnon-diseased denominator and is the dominant driver of bias for rare outcomes. The critical, often-missed point:\n**non-differential outcome misclassification does not reliably bias toward the null for ratio measures when specificity\nis imperfect and the outcome is rare** — a tiny drop in specificity floods a rare-event numerator with false positives\nfrom the huge at-risk pool, and if that misclassification is *differential* by exposure (e.g., treated patients are\nsurveilled more and thus coded more), the bias can go in any direction and magnitude. Construction therefore has two\nestimands behind it: the *measurement estimand* (PPV/sensitivity/specificity of the algorithm) and the *substantive\nestimand* (the rate or effect on the true outcome), connected by quantitative bias analysis.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **High-PPV (\"specific\") algorithms vs high-sensitivity (\"sensitive\") algorithms:** A strict rule (e.g., inpatient\n  principal diagnosis + confirmatory troponin) maximizes PPV, minimizes false positives, and is the right default for\n  *effect estimation*, because a clean numerator protects ratio measures. Cost: it sacrifices sensitivity and undercounts\n  incidence, so it is wrong for *burden-of-illness / incidence* reporting. A broad rule (any-position diagnosis in any\n  setting) maximizes sensitivity for case-finding and surveillance but admits rule-out and historical codes. **Prefer\n  high-PPV** for comparative effectiveness/safety HRs; **prefer high-sensitivity** for screening, signal detection, or\n  when you will bias-correct with known sensitivity/specificity.\n- **vs naive use of a single diagnosis code:** A validated multi-criterion algorithm dramatically reduces misclassification\n  versus \"any ICD code = case,\" at the cost of code complexity and the need for a validation substudy. Always prefer the\n  validated algorithm; an unvalidated code count is indefensible to FDA/EMA.\n- **vs end-to-end chart adjudication of every potential event:** Adjudication (see `endpoint-adjudication-chart-review-rwe`)\n  is the gold standard but is expensive and infeasible at scale; algorithm construction + a *sampled* validation substudy\n  buys most of the accuracy at a fraction of the cost, and the PPV from the substudy lets you bias-correct the full cohort.\n\n**When to use** (decision rules). Any RWE study whose endpoint is derived from claims or EHR structured data: build the algorithm from\npre-specified code lists, lock it before looking at outcome-by-exposure associations, and validate a random sample\nagainst a gold standard (or import published operating characteristics for the *same* algorithm, setting, and era).\nReport the algorithm and its PPV/sensitivity in the protocol and the manuscript per RECORD/Sentinel norms.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\n- **Borrowing validation metrics across a different data source, code era, or population.** A PPV established in\n  Medicare FFS inpatient data does not transport to a commercial outpatient setting or across the ICD-9→ICD-10\n  transition; a borrowed PPV applied to the wrong base rate gives a falsely precise, biased correction.\n- **Imperfect specificity with a rare outcome and no bias correction.** Reporting a crude incidence or HR from a\n  sensitive algorithm without specificity-based correction can be badly biased; this is the dangerous case where\n  \"more cases\" silently means \"more false positives.\"\n- **Differential surveillance/coding by exposure.** When one arm is monitored or coded more intensively (new drug under\n  a REMS, sicker comparator seen more often), outcome ascertainment is differential and no simple non-differential\n  correction is valid — switch to blinded adjudication or a negative-control outcome to detect it.\n- **Prevalent codes masquerading as incident events.** Without a clean baseline washout, chronic-condition codes\n  (heart failure, cancer history \"Z\" codes) recur every visit and an \"incident outcome\" algorithm will count\n  follow-up of an old diagnosis as a new event — manufacturing immortal-time-like and ascertainment artifacts.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked).\n- **Claims (FFS vs MA):** The unit is the claim line with diagnosis position, place-of-service, and revenue code.\n  Inpatient *principal* diagnoses are far more specific than any-position or outpatient codes (which include rule-out and\n  historical coding). Require continuous medical enrollment so absence of a code is a true negative, not unobserved care.\n  Failure mode: **Medicare Advantage encounter data are incomplete and inconsistently submitted relative to fee-for-service\n  claims**, so person-time under MA can fabricate false negatives (missed events) and distort sensitivity — restrict to\n  FFS person-time or model the differential capture. Failure mode: **differential competing risks** — in elderly claims\n  cohorts, death removes people from the at-risk set before the outcome can be coded, and if death rates differ by\n  exposure the algorithm's observed incidence is distorted; pair the construction with a death index and a competing-risk\n  framing. Failure mode: **immortal time in procedure-anchored algorithms** — defining the outcome relative to a procedure\n  that itself requires survival builds in guaranteed event-free time.\n- **EHR:** Richer (labs, vitals, notes) so you can require confirmatory features (troponin for MI, ejection fraction for\n  HF), but capture is visit-driven and fragmented across systems; a patient who gets the event at an out-of-network\n  hospital is a false negative. Define the observation window explicitly and treat loss to follow-up as informative.\n  NLP-derived phenotypes (see `ehr-phenotyping-algorithms-rwe`) raise sensitivity but need their own validation.\n- **Registry:** Outcomes are often adjudicated (cancer stage, MI by universal definition) — the strongest source for the\n  *truth* — but pharmacy/exposure and out-of-registry events are weak; link to claims for completeness and to a death\n  index for mortality.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR confirmatory features + claims completeness + reliable\n  death) but linkage selects the linkable subset and introduces date discrepancies among service, fill, and death dates\n  that must be reconciled before the first-event date is set.\n\n**Worked claims example (PPV-adjusted incidence and bias correction).** Endpoint: incident hospitalized acute myocardial\ninfarction (AMI) among new initiators of a study drug in a Medicare FFS + commercial database. (1) *Code list and rule:*\nAMI = an inpatient claim with ICD-10-CM I21.x in the **principal** position; this high-PPV rule is chosen because the\nestimand is a comparative hazard ratio. (2) *Continuous enrollment:* require medical enrollment from index through the\nevent so absence of a claim is a true negative; **exclude MA-only person-time** because MA encounter capture is\nincomplete. (3) *Incident-event de-duplication:* require a 365-day clean baseline with no I21.x/I22.x and collapse\nsame-admission transfers and re-codes within a 30-day window into one event (see `acute-event-deduplication-window-rwe`\nand `hospitalization-transfer-collapse-rwe`) so a single infarction with a hospital transfer is not double-counted. (4)\n*First-event date:* admission date of the first qualifying hospitalization; censor at disenrollment, death, end of data.\n(5) *Validation substudy:* pull charts for a random sample of n=200 algorithm-flagged admissions; suppose 174 are true\nAMIs, giving PPV = 174/200 = **0.87** (95% CI ≈ 0.82–0.91 by exact binomial). (6) *Bias-corrected count:* if the\nalgorithm flags 1,000 AMIs over 50,000 person-years, the estimated true count is 1,000 × 0.87 = **870 true AMIs**, so the\nPPV-corrected incidence is 870 / 50,000 = **17.4 per 1,000 person-years** versus the naive 20.0 per 1,000 — a 13%\noverstatement removed. (7) *Sensitivity correction:* if an independent chart-review of true AMIs yields sensitivity = 0.80,\nthe algorithm misses ~20% of events; under non-differential ascertainment the *exposed/unexposed* PPV- and\nsensitivity-corrected counts feed a corrected rate ratio, and the gap between the naive and corrected ratio is itself the\nquantitative-bias-analysis result that goes in the sensitivity-analysis table (see `misclassification-bias-correction-rwe`).\n\n**Interpreting the output**. The AMI algorithm flags 1,000 candidate events from 50,000 person-years; chart review of\n200 sampled positives confirms 174 true AMIs, giving PPV = 0.87. Naive incidence is 20.0 per 1,000 person-years;\nafter PPV correction — 1,000 × 0.87 = 870 estimated true AMIs — the corrected incidence is 17.4 per 1,000\nperson-years, removing a 13% overstatement.\n\nFormal interpretation: the algorithm's output is not an AMI count — it is an AMI *candidate* count with a measured\nfalse-positive rate. PPV = 0.87 means 13 of every 100 flagged events are false positives; ignoring this inflates the\nnumerator and biases any rate or ratio toward the null under non-differential misclassification. The tradeoff is\nintentional: a strict (high-PPV) rule routes well for ratio estimands such as a hazard ratio where false positives\nin both arms roughly cancel; a broad (high-sensitivity) rule routes better for incidence burden estimation where\nmissing events matters more than false-positive contamination. Choose the operating point before unblinding outcomes,\npre-specify it in the SAP, and present both the naive and PPV-corrected rates as the primary table.\n\nPractical interpretation: the 2.6-per-1,000-PY gap between naive and corrected rates is a tangible audit trail —\nif a safety signal appears in the naive analysis and shrinks substantially after correction, reviewers know the\nalgorithm's precision is load-bearing. Always propagate the uncertainty in PPV (its binomial confidence interval)\nthrough the correction via a Monte Carlo draw or analytic error formula.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome-algorithm",
      "case-finding",
      "computable-phenotype",
      "ppv",
      "sensitivity-specificity",
      "outcome-misclassification",
      "validation",
      "code-lists",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.2321",
        "url": "https://doi.org/10.1002/pds.2321",
        "citation_text": "Carnahan RM, Moores KG. Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned. Pharmacoepidemiology and Drug Safety. 2012;21(S1):82-89.",
        "year": 2012,
        "authors_short": "Carnahan & Moores",
        "notes": "Foundational methods statement for the Mini-Sentinel program of validated outcome algorithms — how to specify code lists and report PPV/sensitivity/specificity for case-finding in administrative data."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1200",
        "url": "https://doi.org/10.1002/pds.1200",
        "citation_text": "Schneeweiss S. Sensitivity analysis and external adjustment for unmeasured confounders in epidemiologic database studies of therapeutics. Pharmacoepidemiology and Drug Safety. 2006;15(5):291-303.",
        "year": 2006,
        "authors_short": "Schneeweiss",
        "notes": "Quantitative framework for external adjustment/sensitivity analysis whose machinery (sensitivity/specificity, bias parameters) underlies correcting outcome misclassification from an imperfect algorithm."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Good-practice expectations for pre-specifying and validating outcome definitions in regulatory- and HTA-facing RWE."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.2313",
        "url": "https://doi.org/10.1002/pds.2313",
        "citation_text": "Saczynski JS, Andrade SE, Harrold LR, et al. A systematic review of validated methods for identifying heart failure using administrative data. Pharmacoepidemiology and Drug Safety. 2012;21(S1):129-140.",
        "year": 2012,
        "authors_short": "Saczynski et al.",
        "notes": "Worked example of algorithm operating characteristics — reports the PPV/sensitivity of competing heart-failure case-finding rules in claims, illustrating the high-PPV vs high-sensitivity trade-off."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.2312",
        "url": "https://doi.org/10.1002/pds.2312",
        "citation_text": "Andrade SE, Harrold LR, Tjia J, et al. A systematic review of validated methods for identifying cerebrovascular accident or transient ischemic attack using administrative data. Pharmacoepidemiology and Drug Safety. 2012;21(S1):100-128.",
        "year": 2012,
        "authors_short": "Andrade et al.",
        "notes": "Applied catalogue of validated stroke/TIA algorithms and their PPVs by setting and diagnosis position — a template for borrowing (and the caveats on transporting) operating characteristics."
      }
    ],
    "plain_language_summary": "An outcome algorithm is a precise written rule that tells a computer how to search a claims database and decide which patients had a specific medical event — for example, a heart attack. Because the database never saw the patient in person, the rule uses billing codes, the type of hospital visit, and where the code appears on the claim to make that determination. The algorithm can be wrong in both directions (flagging patients who never had the event, or missing patients who did), so researchers check its accuracy against actual medical records and report that check as part of the study.",
    "key_terms": [
      {
        "term": "ICD-10-CM code",
        "definition": "A standardized billing code — like I21.0 — that hospitals and clinics submit on claims to describe a patient's diagnosis."
      },
      {
        "term": "diagnosis position",
        "definition": "Where a diagnosis code sits on a claim: the principal position means it was the main reason for the hospital stay, while secondary positions cover additional conditions."
      },
      {
        "term": "care setting",
        "definition": "The type of place where care was delivered — inpatient (overnight hospital), outpatient clinic, or emergency department."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "Of all the patients your algorithm flagged as having the event, the fraction who actually had it when a doctor checked the medical record."
      },
      {
        "term": "chart review",
        "definition": "A trained reviewer opens the actual medical record for a sample of flagged patients and decides whether each one truly had the event — this is the gold-standard check."
      }
    ],
    "worked_example": {
      "scenario": "You are building a rule to find patients who had an acute heart attack (acute MI) in a claims database, so you can count how often this event happens. Your rule says: a patient counts as having a heart attack only if they have an inpatient hospital claim where the heart-attack code (ICD-10-CM I21) is in the principal diagnosis position. The table below shows five candidate patients. After the algorithm runs, you check a sample of flagged patients against their real medical records to see how often the rule was correct.",
      "dataset": {
        "caption": "Five candidate patients from the claims diagnosis table. Each row is one claim line.",
        "columns": [
          "person_id",
          "claim_date",
          "care_setting",
          "dx_position",
          "icd10",
          "algorithm_flags?"
        ],
        "rows": [
          [
            1001,
            "2023-03-15",
            "inpatient",
            "principal",
            "I21.0",
            "YES"
          ],
          [
            1002,
            "2023-04-02",
            "outpatient",
            "principal",
            "I21.9",
            "NO — outpatient setting"
          ],
          [
            1003,
            "2023-05-10",
            "inpatient",
            "secondary",
            "I21.0",
            "NO — secondary position only"
          ],
          [
            1004,
            "2023-06-20",
            "inpatient",
            "principal",
            "I25.10",
            "NO — I25 is chronic disease, not acute MI"
          ],
          [
            1005,
            "2023-07-08",
            "inpatient",
            "principal",
            "I21.1",
            "YES"
          ]
        ]
      },
      "steps": [
        "The rule requires THREE things to all be true at once: care_setting = inpatient, dx_position = principal, and the code starts with I21.",
        "Patient 1001 passes all three checks — inpatient stay, principal position, I21.0 — so the algorithm flags them as a heart attack.",
        "Patient 1002 fails the setting check: the claim is outpatient, so a code appearing there may just mean the doctor was ruling out a heart attack, not confirming one. Not flagged.",
        "Patient 1003 fails the position check: the I21.0 code is in a secondary slot, meaning it was a side condition, not the primary reason for admission. Not flagged.",
        "Patient 1004 fails the code check: I25.10 is chronic coronary artery disease — a different condition entirely. Not flagged.",
        "Patient 1005 passes all three checks — inpatient, principal, I21.1 — so they are also flagged.",
        "The algorithm flags 2 patients (1001 and 1005). A researcher then pulls the medical records for both. Suppose both records confirm a real heart attack. That gives PPV = 2 true cases / 2 flagged = 1.00 for this tiny sample. In a real study with hundreds of flagged patients, PPV is typically around 0.85–0.90 for this kind of strict rule, meaning roughly 1 in 8 flagged patients did not actually have a heart attack."
      ],
      "result": "The algorithm flags 2 of 5 candidates (patients 1001 and 1005) by requiring inpatient principal-position I21 codes. The three patients who were not flagged were excluded for the correct reasons — wrong setting, wrong position, or wrong code. When a researcher checks real medical records for a random sample of flagged patients (the PPV validation step), they can calculate what fraction were true heart attacks and use that number to adjust the final event count."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "ppv-npv-rwe",
      "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "High-PPV (specific) algorithm",
        "description": "Strict multi-criterion rule (e.g., inpatient principal-position diagnosis plus a confirmatory lab or procedure) chosen to minimize false positives. Default for comparative effect estimation, where a clean numerator protects ratio measures.",
        "edge_cases": [
          "Undercounts true incidence; do not use this rule when the deliverable is burden-of-illness or incidence reporting.",
          "Confirmatory labs may be missing in claims-only data, silently dropping true cases that lack a coded lab."
        ],
        "data_source_notes": "claims: restrict to inpatient principal diagnosis + procedure/CPT confirmation; EHR: require the confirmatory lab/vital (troponin, EF) at the event encounter."
      },
      {
        "name": "High-sensitivity (broad) algorithm",
        "description": "Any-position diagnosis in any care setting to maximize case capture for surveillance, screening, or when the plan is to bias-correct with known sensitivity/specificity.",
        "edge_cases": [
          "Admits rule-out, historical (\"status/history of\") and chronic recurring codes; pair with a baseline washout to enforce incident events.",
          "Imperfect specificity over a large non-diseased denominator inflates false positives for rare outcomes."
        ],
        "data_source_notes": "claims: include outpatient and any-position codes but exclude Z-codes (history of); EHR: add NLP-derived mentions, then validate the NLP layer separately."
      },
      {
        "name": "Algorithm with PPV/sensitivity validation substudy",
        "description": "Construct the algorithm, draw a random sample of flagged (and, for sensitivity, of charted true) cases, and estimate PPV/sensitivity/specificity against chart review or registry adjudication; carry the estimates into quantitative bias analysis.",
        "edge_cases": [
          "Sampling only flagged cases yields PPV but not sensitivity/specificity, which require a gold-standard denominator.",
          "Validation metrics do not transport across data source, code era (ICD-9 vs ICD-10), or population base rate."
        ],
        "data_source_notes": "registry/linked: the registry or death index is the gold standard; claims-only studies often import published operating characteristics for the identical algorithm and setting."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive single-diagnosis-code outcome definition",
        "pros_of_this": "A validated multi-criterion algorithm with reported PPV/sensitivity sharply reduces outcome misclassification and is defensible to regulators; the operating characteristics enable quantitative bias correction.",
        "cons_of_this": "Requires pre-specified code lists and a validation substudy or transportable published metrics; more analytic and documentation effort.",
        "when_to_prefer": "Always, for any regulatory- or HTA-facing endpoint derived from structured claims/EHR data."
      },
      {
        "compared_to": "Full chart adjudication of every candidate event",
        "pros_of_this": "Scales to the entire cohort at low cost; a sampled validation substudy recovers most of the accuracy and yields a PPV for correcting the full count.",
        "cons_of_this": "Residual misclassification remains; adjudication is the true gold standard for the substudy and for differential-ascertainment settings.",
        "when_to_prefer": "Large databases where adjudicating every event is infeasible and ascertainment is plausibly non-differential."
      },
      {
        "compared_to": "High-sensitivity algorithm without bias correction",
        "pros_of_this": "A high-PPV rule protects ratio estimates directly; correcting a sensitive rule requires reliable, transportable specificity that is hard to obtain.",
        "cons_of_this": "Undercounts incidence; wrong when absolute burden is the deliverable.",
        "when_to_prefer": "Comparative hazard/rate-ratio estimands where false positives are the greater threat."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Outcome = claim lines with the locked code list, diagnosis position, place-of-service, and care setting. Require continuous medical enrollment so absence of a code is a true negative; exclude MA-only person-time where encounter capture is incomplete. Enforce a baseline washout for incident events and collapse same-event re-codes/transfers within a de-duplication window. Validate PPV on a random chart sample; correct counts as observed x PPV.",
      "ehr": "Outcome = structured diagnoses plus required confirmatory labs/vitals at the event encounter; capture is visit-driven so out-of-system events are false negatives. Define observation windows explicitly. NLP-derived phenotypes raise sensitivity but need separate validation.",
      "registry": "Outcomes are often adjudicated (the gold standard for the substudy) but exposure and out-of-registry events are weak; link to claims for completeness and to a death index for mortality.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (confirmatory features + completeness + reliable death) but introduces linkage selection and service/fill/death date discrepancies that must be reconciled before the first-event date is set."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nAMI_CODES = (\"I21\", \"I22\")        # locked ICD-10-CM prefixes for acute MI\nWASHOUT_DAYS = 365                # clean baseline that makes the event incident\nDEDUP_DAYS = 30                   # collapse same-event re-codes / transfers into one event\n\ndef build_outcome(dx, enroll, index):\n    # High-PPV rule: inpatient PRINCIPAL-position AMI code.\n    flagged = dx[(dx[\"care_setting\"] == \"inpatient\") &\n                 (dx[\"dx_position\"] == \"principal\") &\n                 (dx[\"icd10\"].str.startswith(AMI_CODES))].copy()\n    flagged = flagged.merge(index, on=\"person_id\")\n\n    # Incident: drop people with the same code in the WASHOUT before index; keep events after index only.\n    prior = flagged[flagged[\"claim_date\"] < flagged[\"index_date\"]]\n    prevalent = prior.loc[prior[\"claim_date\"] >= prior[\"index_date\"] -\n                          pd.Timedelta(days=WASHOUT_DAYS), \"person_id\"].unique()\n    post = flagged[(flagged[\"claim_date\"] >= flagged[\"index_date\"]) &\n                   (~flagged[\"person_id\"].isin(prevalent))].sort_values([\"person_id\", \"claim_date\"])\n\n    # First qualifying admission = the incident event date; collapse re-codes within DEDUP_DAYS.\n    first = post.groupby(\"person_id\", as_index=False).first().rename(columns={\"claim_date\": \"event_date\"})\n\n    # Observable, FFS-only follow-up: censor MA-only spans so a missing code is a true negative.\n    e = enroll[~enroll[\"ma_only\"]].merge(index, on=\"person_id\")\n    e[\"pt_days\"] = (np.minimum(e[\"enroll_end\"], e[\"index_date\"] + pd.Timedelta(days=365)) -\n                    np.maximum(e[\"enroll_start\"], e[\"index_date\"])).dt.days.clip(lower=0)\n    person_years = e.groupby(\"person_id\")[\"pt_days\"].sum().sum() / 365.25\n    return first[[\"person_id\", \"event_date\"]], person_years\n\ndef ppv_corrected_incidence(events, person_years, chart):\n    ppv = chart[\"true_case\"].mean()                       # PPV among flagged-and-charted events\n    n_obs = len(events)\n    n_true = n_obs * ppv                                  # bias-corrected true count = observed x PPV\n    return {\"ppv\": ppv,\n            \"naive_rate_per_1000_py\": 1000 * n_obs / person_years,\n            \"corrected_rate_per_1000_py\": 1000 * n_true / person_years}",
        "description": "Build an incident high-PPV outcome flag from claims, then compute PPV-corrected incidence. Required inputs (already\ncleaned and de-duplicated):\n  dx     : diagnosis lines -> person_id, claim_date (datetime), icd10 (str), dx_position ('principal'/'secondary'),\n           care_setting ('inpatient'/'outpatient'/'er')\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only person-time lacks FFS claims\n  index  : cohort entry    -> person_id, index_date (datetime)  # follow-up start (e.g., treatment initiation)\n  chart  : validation sample -> person_id, claim_date, true_case (0/1)  # gold-standard adjudication of flagged events\nReturns one incident-event row per person plus a PPV-corrected incidence rate. The algorithm (codes, position, setting,\nwashout, dedup window) is defined ONCE here and locked before any outcome-by-exposure analysis.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "carnahan-2012"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nAMI_CODES   <- \"^I2[12]\"   # locked ICD-10-CM regex for acute MI\nWASHOUT     <- 365L\nDEDUP_DAYS  <- 30L\n\nbuild_outcome <- function(dx, enroll, index) {\n  setDT(dx); setDT(enroll); setDT(index)\n  flagged <- dx[care_setting == \"inpatient\" & dx_position == \"principal\" &\n                grepl(AMI_CODES, icd10)]\n  flagged <- merge(flagged, index, by = \"person_id\")\n\n  # Incident: exclude prevalent (same code in washout before index); keep post-index events.\n  prevalent <- flagged[claim_date < index_date & claim_date >= index_date - WASHOUT,\n                       unique(person_id)]\n  post <- flagged[claim_date >= index_date & !(person_id %chin% prevalent)]\n  setorder(post, person_id, claim_date)\n  first <- post[, .(event_date = claim_date[1L]), by = person_id]   # first admission, dedup by taking earliest\n\n  # FFS-only observable follow-up so a missing code is a true negative.\n  e <- merge(enroll[ma_only == FALSE], index, by = \"person_id\")\n  e[, pt_days := as.numeric(pmin(enroll_end, index_date + 365) -\n                            pmax(enroll_start, index_date))]\n  e[pt_days < 0, pt_days := 0]\n  person_years <- sum(e$pt_days) / 365.25\n  list(events = first, person_years = person_years)\n}\n\nppv_corrected_incidence <- function(events, person_years, chart) {\n  ppv    <- mean(chart$true_case)                  # PPV among flagged-and-charted events\n  n_obs  <- nrow(events)\n  n_true <- n_obs * ppv                            # corrected true count = observed x PPV\n  list(ppv = ppv,\n       naive_rate_per_1000_py     = 1000 * n_obs  / person_years,\n       corrected_rate_per_1000_py = 1000 * n_true / person_years)\n}",
        "description": "Same incident high-PPV AMI algorithm and PPV-corrected incidence in R with data.table. Inputs mirror the Python version:\n  dx     : person_id, claim_date (Date), icd10, dx_position, care_setting\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\n  index  : person_id, index_date (Date)\n  chart  : person_id, claim_date, true_case (0/1)  -- gold-standard adjudication of a random sample of flagged events",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "carnahan-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* High-PPV rule: inpatient PRINCIPAL-position AMI (ICD-10-CM I21/I22), restricted to post-index incident events. */\nproc sql;\n  create table flagged as\n  select d.person_id, d.claim_date, i.index_date\n  from work.dx d inner join work.index i on d.person_id = i.person_id\n  where d.care_setting = 'inpatient'\n    and d.dx_position  = 'principal'\n    and (d.icd10 like 'I21%' or d.icd10 like 'I22%');\n\n  /* Prevalent (non-incident): same code in the washout window before index. */\n  create table prevalent as\n  select distinct person_id from flagged\n  where claim_date < index_date and claim_date >= index_date - &washout;\n\n  /* First post-index admission = incident event date (de-duplicates re-codes by taking the minimum). */\n  create table events as\n  select f.person_id, min(f.claim_date) as event_date format=date9.\n  from flagged f\n  where f.claim_date >= f.index_date\n    and f.person_id not in (select person_id from prevalent)\n  group by f.person_id;\n\n  /* FFS-only observable follow-up (1 yr cap) so an absent code is a true negative. */\n  create table py as\n  select sum( max(0, ( min(e.enroll_end, i.index_date + 365)\n                     - max(e.enroll_start, i.index_date) )) ) / 365.25 as person_years\n  from work.enroll e inner join work.index i on e.person_id = i.person_id\n  where e.ma_only = 0;\nquit;\n\n/* PPV with exact binomial CI from the chart-validation sample. */\nproc freq data=work.chart;\n  tables true_case / binomial(level='1') alpha=0.05;\n  output out=ppv_out binomial;\nrun;\n\n/* PPV-corrected incidence: corrected true count = observed flagged events x PPV. */\nproc sql;\n  create table incidence as\n  select (select count(*) from events)                          as n_observed,\n         p._BIN_                                                 as ppv,\n         calculated n_observed * calculated ppv                 as n_corrected,\n         1000 * calculated n_observed / (select person_years from py) as naive_rate_per_1000py,\n         1000 * calculated n_corrected / (select person_years from py) as corrected_rate_per_1000py\n  from ppv_out p;\nquit;",
        "description": "Incident high-PPV AMI algorithm, PPV-corrected incidence, and exact binomial CI for PPV in SAS. Required input datasets\n(post data-management):\n  work.dx     : person_id, claim_date, icd10, dx_position ('principal'/'secondary'), care_setting\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.index  : person_id, index_date\n  work.chart  : person_id, true_case (0/1)  -- adjudication of a random sample of flagged events\nPROC FREQ binomial gives the exact (Clopper-Pearson) CI for PPV; the corrected count is observed x PPV.",
        "dependencies": [],
        "source_citations": [
          "carnahan-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Estimand[Protocol endpoint + estimand<br/>e.g. incident hospitalized AMI] --> Spec[Specify algorithm: code lists,<br/>dx position, care setting, time window]\n  Spec --> Lock[Lock algorithm BEFORE<br/>outcome-by-exposure analysis]\n  Lock --> Apply[Apply to claims/EHR:<br/>washout for incidence + dedup re-codes]\n  Apply --> Flag[Flagged events]\n  Flag --> Val{Validate vs gold standard<br/>chart / registry / death index}\n  Val -->|PPV among flagged| PPV[PPV]\n  Val -->|caught among true cases| Sens[Sensitivity / Specificity]\n  PPV --> Correct[Quantitative bias analysis:<br/>corrected count = observed x PPV]\n  Sens --> Correct\n  Correct --> Report[Report algorithm, operating<br/>characteristics, corrected estimate]",
        "caption": "Outcome algorithm construction as a build-then-validate-then-correct loop. The algorithm is specified from the estimand and locked before any exposure-outcome look; operating characteristics from a gold-standard substudy drive quantitative bias correction of the final estimate.",
        "alt_text": "Flowchart from protocol endpoint to algorithm specification, locking, application with washout and de-duplication, validation against a gold standard yielding PPV and sensitivity/specificity, quantitative bias correction, and reporting.",
        "source_type": "illustrative",
        "source_citations": [
          "carnahan-2012"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Choice[Algorithm calibration]\n    Strict[High-PPV / specific rule<br/>inpatient principal + confirmation] -->|protects ratio measures<br/>undercounts incidence| Effect[Effect estimation<br/>HR / RR]\n    Broad[High-sensitivity / broad rule<br/>any-position, any setting] -->|catches more cases<br/>admits false positives| Burden[Incidence / surveillance<br/>then bias-correct]\n  end",
        "caption": "Choosing the operating point. A specific (high-PPV) algorithm is the default for comparative effect estimation; a sensitive (broad) algorithm suits incidence/surveillance but must be specificity-corrected for rare outcomes.",
        "alt_text": "Diagram contrasting a high-PPV strict algorithm routed to effect estimation against a high-sensitivity broad algorithm routed to incidence and surveillance with bias correction.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "The claims-specific PPV/sensitivity validation entry is the detailed operational realization of this construction process in administrative data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "EHR computable phenotypes are outcome algorithms built on structured plus NLP-derived features; each phenotype layer needs its own validation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The 1-inpatient-OR-2-outpatient time-window rule is the canonical claims algorithm template instantiated here."
      },
      {
        "relation_type": "used_with",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "PPV/sensitivity/specificity from the validation substudy feed quantitative bias analysis to correct the substantive estimate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Chart review/adjudication is the gold standard against which the algorithm's operating characteristics are measured."
      },
      {
        "relation_type": "used_with",
        "target_slug": "acute-event-deduplication-window-rwe",
        "notes": "De-duplication windows collapse repeated codes for one true event into a single incident outcome during construction."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hospitalization-transfer-collapse-rwe",
        "notes": "Transfer-collapse logic prevents a single hospitalization with transfers from being counted as multiple events."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mortality-source-hierarchy-rwe",
        "notes": "Mortality endpoints are a special case requiring a source hierarchy (death index vs claims vs EHR) rather than a diagnosis-code algorithm."
      },
      {
        "relation_type": "see_also",
        "target_slug": "composite-endpoint-construction-rwe",
        "notes": "Composite endpoints combine multiple component algorithms, each of which must be constructed and validated separately."
      },
      {
        "relation_type": "see_also",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "Safety case definitions are high-sensitivity outcome algorithms tuned for signal detection rather than effect estimation."
      }
    ],
    "aliases": [
      "outcome algorithm",
      "case-finding algorithm",
      "computable phenotype",
      "outcome definition validation",
      "endpoint algorithm",
      "claims outcome definition"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "overlap-weights-modern-ps-weighting",
    "name": "Overlap Weights and Modern Propensity Weighting",
    "short_definition": "A set of modern propensity-score weighting approaches — overlap weights, entropy balancing, and covariate balancing propensity scores — that assign bounded, trimming-free weights by concentrating statistical mass on patients near clinical equipoise; overlap weights (treated weight 1 minus e(X), control weight e(X)) target the average treatment effect in the overlap population (ATO), an estimand distinct from the ATE or ATT that must be explicitly justified in regulatory and HTA submissions.",
    "long_description": "**Why inverse-probability weighting explodes at the propensity score tails**\n\nStandard inverse-probability-of-treatment weighting (IPTW) assigns weight 1/e(X) to each\ntreated patient and 1/(1 minus e(X)) to each control patient, where e(X) is the estimated\npropensity score. When a patient has a propensity score close to one — their baseline covariates\nstrongly predict that they receive the treatment — the control-arm weight 1/(1 minus e(X))\nbecomes enormous. A control patient with e(X) = 0.9 receives a weight of 1/(1 minus 0.9) = 10,\nwhile a treated patient with the same propensity score receives only 1/0.9 ≈ 1.11 — a nine-fold\ndisparity arising purely from which arm the patient happened to receive. This is not a rounding\nartifact; it is structural. In real high-dimensional claims studies, propensity scores near zero\nand one are common when the two drugs serve meaningfully different patient populations, and a\nhandful of extreme patients can dominate the entire pseudo-population. Analysts have traditionally\nresponded with ad hoc weight truncation (clipping at the 1st/99th or 5th/95th percentile), but\nsuch thresholds are arbitrary, create specification bias if chosen after seeing results, and must\nbe pre-specified to be credible in a regulatory or HTA submission. Even after truncation, variance\ninflation from near-extreme weights can be substantial, and the trimmed subjects represent a\nsilent change in the estimand population.\n\n**Overlap weights: definition and the bounded-weight property**\n\nLi, Morgan and Zaslavsky (2018) proposed assigning each patient a weight equal to the probability\nthat the patient would have been assigned to the opposite arm: treated patients receive weight\nw = 1 minus e(X) and control patients receive weight w = e(X). These are called overlap weights\nbecause they concentrate statistical mass in the region of propensity score overlap — the patients\nwhose scores are intermediate, indicating genuine clinical uncertainty about which treatment they\nwould receive. Two properties of overlap weights are mathematically guaranteed regardless of\nsample size. First, all weights are strictly bounded within the open unit interval (0, 1); no\npatient can receive a weight exceeding one, and no trimming decision is ever needed. Second,\namong all balancing weight functions that achieve exact mean balance on every covariate entering\nthe propensity score model, overlap weights have the smallest weighted-estimator variance — a\nglobal optimality property proved by Li et al. Concretely, the overlap weight function is\nproportional to e(X) times (1 minus e(X)), which peaks at e(X) = 0.5 (perfect equipoise) and\ntapers smoothly to zero at both extremes. The exact mean balance holds as an algebraic identity\nfor any covariates whose means appear in the PS model: the weighted mean of each such covariate\nis identical in the treated and control arms after overlap weighting, without iterative\noptimization. This algebraic balance guarantee differentiates overlap weights from IPTW, where\nbalance is an asymptotic property that may fail badly in finite samples with near-positivity\nviolations.\n\n**The ATO estimand: clinical equipoise, not the full eligible population**\n\nOverlap weights do not estimate the average treatment effect in the full eligible population (ATE)\nor in the treated subpopulation (ATT). They target the average treatment effect in the overlap\npopulation (ATO) — the smoothly-weighted average effect in patients for whom clinical choice\ncould plausibly go either way, indexed by w = e(X) times (1 minus e(X)). This population is not\na fixed clinical subgroup; it is a smooth reweighting defined by the propensity score model, which\nmeans its composition changes if the PS model changes. The ATO is the most policy-relevant\nestimand for comparative effectiveness research in therapeutic areas where two drugs are genuinely\nco-prescribed for the same indication and clinicians choose based on patient-level factors the\nanalyst has measured: two antidiabetic drugs used interchangeably across the glycemic spectrum;\ntwo biologics in inflammatory bowel disease with largely overlapping eligibility; two anticoagulants\nwith similar label populations. The ATO is NOT appropriate in several settings: regulatory\nsubmissions requesting the effect in patients who actually received the drug (ATT) — for a safety\nsignal, a post-approval commitment study, or a label-expansion dossier — require ATT weighting or\n1:1 matching, not ATO. When the target population is anchored to a specific clinical eligibility\nrule (e.g., \"all patients who initiated Drug A\"), the ATO population may not map to that clinical\ngroup. Any label claim or clinical-practice guideline update must describe the target population in\nrecognizable clinical terms; the ATO's smooth propensity-score-indexed population often cannot be\ntranslated into a patient group that prescribers can identify prospectively.\n\n**The weighting-target zoo: ATE, ATT, ATO, matching weights, and entropy weights**\n\nFive major weight families target distinct estimands and populations, and the choice among them\nis a substantive decision that must be pre-specified. IPTW for the ATE weights treated patients\nby Pr(A=1)/e(X) and controls by Pr(A=0)/(1 minus e(X)), targeting the whole eligible population;\nit is the most sensitive to positivity violations and produces the widest weight range.\nStandardized mortality ratio (SMR) weighting for the ATT weights treated patients at 1 and\ncontrols at e(X)/(1 minus e(X)), targeting the treated population and naturally suited to\npost-marketing safety signals about actual drug users. Overlap weights for the ATO target the\nclinical equipoise population with bounded weights and exact covariate balance. Matching weights\n(Li and Greene 2013) assign each patient the minimum of e(X) and 1 minus e(X), yielding the\nlargest subpopulation that can be exactly balanced while remaining matchable — a population\nanchored to matched pairs. Entropy balancing selects weights by directly satisfying balance\nconstraints rather than by a weight formula derived from the PS. None of these estimands is\ncosmetically interchangeable: when the treatment effect is heterogeneous across the PS\ndistribution, ATE, ATT, and ATO can differ substantially in magnitude and can even differ in\nsign. Pre-specifying the estimand before examining outcomes is not optional — it is what\ndistinguishes a confirmatory analysis from post hoc effect shopping.\n\n**Entropy balancing**\n\nEntropy balancing (Hainmueller 2012) searches for the set of weights that (a) directly satisfies\nuser-specified balance constraints — usually exact first-moment balance on all listed covariates —\nwhile (b) staying as close as possible to uniform weights in a Kullback-Leibler information sense.\nThe weights are found by solving a convex optimization problem rather than by a closed-form\nformula. Unlike overlap weights, entropy balancing bypasses the propensity score model entirely:\nthere is no e(X) to estimate; the optimizer works directly on covariate moments. The analyst\nspecifies which covariates must be balanced (and to which moments — first, second, cross-product)\nand the procedure finds the minimum-entropy-deviation weights that satisfy those constraints. The\nestimand of entropy balancing is ATT-style when only the control group is reweighted to match the\ntreated covariate distribution, or ATE-style when both groups are reweighted toward a target\ndistribution. Entropy balancing is more transparent about what balance means in a given analysis\n— the analyst explicitly lists the constraints — but does not produce a single named causal\nestimand as cleanly as overlap weights do. For regulatory submissions where the estimand must be\npre-specified and interpretable, overlap weights are generally more defensible.\n\n**Covariate Balancing Propensity Score (CBPS)**\n\nThe Covariate Balancing Propensity Score (Imai and Ratkovic 2014) estimates the propensity score\nmodel under the dual constraint that the model fits the treatment assignment (as in standard\nlogistic regression) AND the resulting weights simultaneously achieve covariate balance. CBPS is\na generalized method of moments estimator that enforces both the score equation and the balance\nequation as moment conditions, producing a PS model that is more robust to logistic\nmisspecification than a standard logistic regression PS. For practical RWE use, CBPS is a drop-in\nreplacement for logistic regression in any PS workflow — the estimated e(X) can feed into any\ndownstream weight formula (ATE, ATT, ATO). Both CBPS and entropy balancing are implemented in\nthe WeightIt R package via method = \"cbps\" and method = \"ebal\". For most large-N claims datasets\nwhere logistic regression is stable, the practical difference between a well-specified logistic PS\nand CBPS is small; CBPS adds value when the PS model includes flexible terms such as splines or\ninteraction sets that are sensitive to specification.\n\n**Diagnostics: unchanged from standard IPTW**\n\nThe post-weighting diagnostic checklist for overlap weights is identical to that for IPTW.\nStandardized mean differences (SMDs) in the weighted sample must fall below 0.1 for all\ncovariates in the PS model, including squared terms and interactions if those entered the model.\nThe effective sample size ESS = (sum of w)^2 / (sum of w^2) must be reported; overlap weights\noften produce a higher ESS than unstabilized IPTW because bounded weights have lower dispersion\nthan extreme ones, but this is not guaranteed when IPTW weights are stabilized or when the PS\ndistribution is concentrated near 0.5 for most patients. Any ESS improvement reflects the\nnarrower ATO estimand, not better data quality. An overlap plot of the raw PS distribution in both arms — before weighting —\nconfirms whether actual data overlap exists; the bounded-weight property does not create overlap\nthat is absent in the raw data, it merely avoids assigning disproportionate influence to patients\nnear the distribution extremes. The weight distribution should be plotted alongside the ESS to\nconfirm that near-zero weights are only assigned to far-from-equipoise patients where down-\nweighting is scientifically appropriate.\n\n**Pros, cons, and trade-offs**\n\nOverlap weights versus IPTW: IPTW targets the ATE — the full eligible population — and has a\ntransparent connection to the target trial protocol; it is the natural choice when the research\nquestion asks what would happen if everyone in the eligible population switched drugs. Its cost is\nsusceptibility to extreme weights and variance inflation when overlap is poor, even after\ntruncation. Overlap weights eliminate the truncation decision entirely, produce the minimum-\nvariance balanced estimator, and achieve exact mean covariate balance in any sample; their cost\nis the narrower ATO estimand, which cannot be used whenever the policy question requires ATT or\nATE and when the equipoise population cannot be described in actionable clinical terms.\n\nOverlap weights versus entropy balancing: both achieve exact covariate mean balance, but overlap\nweights derive their population from an estimated PS and target a defined ATO estimand, while\nentropy balancing directly targets a user-specified covariate distribution with no PS estimation\nstep. Entropy balancing adds no positivity penalty — the optimization may produce extreme weights\nif the specified constraints are nearly infeasible — while overlap weights are bounded by\nconstruction. For regulatory submissions, overlap weights' clear estimand formulation is preferred.\n\nOverlap weights versus CBPS: CBPS is a PS estimation method, not a weight scheme; the two are\ncomposable — estimate e(X) via CBPS, then compute overlap or ATT or ATE weights from e(X).\nUsing CBPS-estimated scores with overlap weights combines the misspecification robustness of CBPS\nwith the bounded-weight guarantee of overlap weighting.\n\n**When to use**\n\nUse overlap weights when the PS distribution shows moderate overlap concerns and IPTW produces\nextreme weights whose truncation threshold would be disputed; when the policy question is\ncomparative effectiveness among patients where clinicians genuinely choose between two drugs\n(not post-marketing safety among current users); when the analysis is not a regulatory submission\nrequiring the ATT; and when exact covariate mean balance without a separate optimization step\nis methodologically desirable. Overlap weights are the modern default for head-to-head\ncomparative effectiveness studies submitted to HTA bodies in therapeutic areas where both drugs\nare widely co-prescribed. Entropy balancing is preferred when the PS model is difficult to\nspecify and the analyst wants to express balance constraints directly. CBPS is preferred when a\nPS model is needed downstream for doubly-robust estimation and logistic regression misspecification\nis a concern.\n\n**When NOT to use**\n\nDo not use overlap weights when the estimand must be the average treatment effect in the treated\n(ATT): any safety signal, post-approval commitment study, or label-support analysis targeting\nactual drug users requires ATT weighting or matched-cohort analysis, not ATO. Do not use overlap\nweights when the ATO population cannot be described in clinically recognizable terms for the\nguideline or regulatory dossier; if reviewers cannot identify who the clinical equipoise patients\nare, the estimand is not actionable. Do not use any weighting method as a substitute for design:\na poorly defined time zero, a wrong comparator, or a prevalent-user cohort produces a biased ATO\nestimate just as it produces biased ATE or ATT estimates — the overlap weight diagnostics will\nlook clean while the underlying cohort is structurally biased. Do not cite the bounded-weight\nproperty as evidence of adequate data overlap — weights are bounded by construction regardless of\nthe raw PS distribution; inspect the raw PS histogram before choosing any weighting method.\nDo not apply overlap weighting to sparse-outcome studies with very low event counts; weighted\nsurvival analyses are unstable in small effective samples regardless of which weight formula is\nused.\n\n**Interpreting the output**\n\nConsider an overlap-weighted comparative effectiveness analysis of two oral antidiabetic agents\nproducing a weighted risk difference of minus 0.032 (95% CI minus 0.052 to minus 0.012) for a\ncomposite cardiovascular endpoint, with all post-weighting SMDs below 0.08 and ESS of 6,420\nfrom an original cohort of 8,100 patients.\n\nFormal interpretation: The risk difference of minus 0.032 is the estimated average treatment\neffect in the overlap population (ATO) — the smoothly-weighted subpopulation indexed by\nw = e(X) times (1 minus e(X)). This is a marginal (population-averaged) effect in the ATO-\nweighted pseudo-population, not a conditional effect within covariate strata and not the ATE or\nATT. The 95% confidence interval (minus 0.052 to minus 0.012), computed with robust sandwich\nstandard errors that account for weight heterogeneity in the outcome model, means that under\nrepeated sampling from the same data-generating process, 95% of such intervals would contain\nthe true ATO risk difference. Sandwich SEs do not propagate propensity-score estimation\nuncertainty from the first stage; bootstrap or M-estimation is required if that uncertainty\nmust be reflected in the interval.\nThe estimate is a valid causal estimate only under three core assumptions: conditional\nexchangeability given all measured baseline covariates X, consistency (the treatment versions\nare well-defined), and positivity within the ATO target population (every patient has a\nnonzero probability of receiving either treatment). Robustness to PS model misspecification\nis better than with IPTW because bounded weights limit the influence of any single patient,\nbut PS misspecification is not eliminated.\n\nPractical interpretation: Among patients in the clinical-equipoise population (ATO), Drug A\nwas associated with approximately 3.2 fewer events per 100 patients compared with Drug B. This is not the effect for\nall Drug A users, not the effect in the sickest patients essentially guaranteed one treatment,\nand not an estimate that directly generalizes to a trial that enrolled by strict eligibility\ncriteria. The ESS of 6,420 is a weight-dispersion summary — the equivalent number of\nequal-weight observations yielding the same precision — indicating the analysis retains\nsubstantial statistical information; it is not a count of patients who receive meaningful\nweight, as all 8,100 patients receive nonzero overlap weights.\nBecause the ATO population is defined by the PS model, it cannot be identified prospectively from\nclinical criteria alone; any label or guideline language derived from this estimate must\nacknowledge the estimand as the comparative effectiveness population identified by clinical\nequipoise in the observed prescription data, not a fixed patient subgroup.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "overlap-weighting",
      "ato",
      "propensity-score",
      "causal-inference",
      "entropy-balancing",
      "cbps",
      "estimand",
      "covariate-balance",
      "claims",
      "ehr",
      "equipoise",
      "weighting"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/01621459.2016.1260466",
        "url": "https://doi.org/10.1080/01621459.2016.1260466",
        "citation_text": "Li L, Morgan JN, Zaslavsky AM. Balancing covariates via propensity score weighting. Journal of the American Statistical Association. 2018;113(521):390-400.",
        "year": 2018,
        "authors_short": "Li et al.",
        "notes": "Introduces overlap weights, proves exact covariate mean balance as an algebraic identity for any finite sample, and establishes that overlap weights minimize the asymptotic variance among all balancing weight functions. The foundational paper for the ATO estimand and the bounded-weight property."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2020.7819",
        "url": "https://doi.org/10.1001/jama.2020.7819",
        "citation_text": "Thomas LE, Li F, Pencina MJ. Overlap weighting: a propensity score method that mimics attributes of a randomized clinical trial. JAMA. 2020;323(23):2417-2418.",
        "year": 2020,
        "authors_short": "Thomas et al.",
        "notes": "Accessible clinical primer explaining overlap weighting in terms familiar to clinicians and health outcomes researchers; demonstrates that the ATO population mimics key attributes of a clinical trial by concentrating on the equipoise region where randomization would naturally occur. Widely cited as the gateway reference for applied users."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.l5657",
        "url": "https://doi.org/10.1136/bmj.l5657",
        "citation_text": "Desai RJ, Franklin JM. Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score. BMJ. 2019;367:l5657.",
        "year": 2019,
        "authors_short": "Desai & Franklin",
        "notes": "Systematic review of propensity-score weighting alternatives including IPTW, SMR, overlap, matching weights, and entropy balancing, with empirical performance comparisons; the canonical reference for the weighting-target zoo and for selecting among alternatives based on the estimand and data structure."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.6607",
        "url": "https://doi.org/10.1002/sim.6607",
        "citation_text": "Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine. 2015;34(28):3661-3679.",
        "year": 2015,
        "authors_short": "Austin & Stuart",
        "notes": "Best-practice guidance for weighting including effective sample size calculation, weight distribution diagnostics, truncation, and robust variance estimation; these operational standards apply directly to overlap weighting implementations — the same diagnostics, the same sandwich SE requirements, and the same reporting standards."
      }
    ],
    "plain_language_summary": "When comparing two treatments in real-world data, methods that reweight patients to make the two groups look similar can accidentally assign enormous statistical weight to patients who look nothing like the other group, letting a few unusual patients dominate the results. Overlap weights fix this by giving the most weight to patients who could plausibly have received either treatment and down-weighting everyone else, keeping all weights naturally bounded between zero and one without any arbitrary trimming decision. The trade-off is that the result applies to the subgroup of patients where a clinician's choice genuinely could have gone either way — the clinical equipoise population — not to every patient who happened to receive one drug. When the research question is about patients who actually used a drug (for a safety signal, for example), a different weighting approach that targets the treated population is needed instead.",
    "key_terms": [
      {
        "term": "overlap weights",
        "definition": "A weighting formula that assigns each patient a weight equal to the probability they would have received the opposite treatment — treated patients get weight 1 minus their propensity score, controls get weight equal to their propensity score — keeping all weights between zero and one by construction."
      },
      {
        "term": "ATO (average treatment effect in the overlap population)",
        "definition": "The causal estimand targeted by overlap weights, representing the average treatment effect among patients near clinical equipoise — those whose measured characteristics place them midway between the two treatment arms rather than strongly predicted to receive one drug."
      },
      {
        "term": "clinical equipoise",
        "definition": "The state in which a clinician has genuine uncertainty about which treatment to give a patient because both options are medically reasonable given that patient's baseline characteristics, resulting in a propensity score near 0.5."
      },
      {
        "term": "effective sample size (ESS)",
        "definition": "A summary measure of how much statistical information the weighted analysis retains, computed as the squared sum of weights divided by the sum of squared weights; an ESS well below the actual sample size indicates that a few patients with extreme weights are dominating the analysis."
      },
      {
        "term": "entropy balancing",
        "definition": "An alternative weighting approach that finds the set of weights closest to uniform while exactly satisfying user-specified covariate balance constraints, without estimating a propensity score model as an intermediate step."
      }
    ],
    "worked_example": {
      "scenario": "An analyst compares Drug A versus Drug B in a commercial claims database. After fitting a logistic regression propensity score model, four representative patients illustrate how IPTW and overlap weights differ. Patients 101 and 103 both have a propensity score of 0.9 — their baseline characteristics strongly predict Drug A — but patient 101 received Drug A (treated) and patient 103 received Drug B (control). Patients 102 and 104 have a propensity score of 0.5, indicating true clinical equipoise. The analyst computes IPTW and overlap weights for all four patients and compares the resulting effective sample sizes to show why the extreme IPTW weight for patient 103 is problematic.",
      "dataset": {
        "caption": "Four-patient illustration cohort. e_hat is the estimated propensity score (probability of receiving Drug A). Arm 1 = Drug A (treated), Arm 0 = Drug B (control).",
        "columns": [
          "person_id",
          "arm",
          "e_hat",
          "clinical_profile"
        ],
        "rows": [
          [
            101,
            1,
            0.9,
            "High-PS treated: Drug A characteristics"
          ],
          [
            102,
            1,
            0.5,
            "Equipoise treated: could plausibly receive either drug"
          ],
          [
            103,
            0,
            0.9,
            "High-PS control: Drug A characteristics but received Drug B"
          ],
          [
            104,
            0,
            0.5,
            "Equipoise control: could plausibly receive either drug"
          ]
        ]
      },
      "steps": [
        "IPTW weights for treated arm (weight = 1/e_hat). Patient 101 (treated, e = 0.9): IPTW = 1/0.9 = 10/9 = 1.11. Patient 102 (treated, e = 0.5): IPTW = 1/0.5 = 2.",
        "IPTW weights for control arm (weight = 1/(1 - e_hat)). Patient 103 (control, e = 0.9): IPTW = 1/(1-0.9) = 1/0.1 = 10 — a nine-fold disparity versus patient 101 who shares the same propensity score but is in the treated arm. Patient 104 (control, e = 0.5): IPTW = 1/(1-0.5) = 1/0.5 = 2. Patient 103 alone accounts for the bulk of the control pseudo-population, which means the IPTW estimator is largely determined by this one patient.",
        "Overlap weights for treated arm (weight = 1 - e_hat). Patient 101: w = 1-0.9 = 0.1. Patient 102: w = 1-0.5 = 0.5. High-PS treated patients receive small weights because they are far from the clinical equipoise region and contribute little to the ATO estimand.",
        "Overlap weights for control arm (weight = e_hat). Patient 103: w = 0.9. Patient 104: w = 0.5. All four overlap weights fall in [0.1, 0.9] — bounded by construction, no trimming required. The maximum IPTW weight was 10; the maximum overlap weight is 0.9.",
        "Overlap-weighted effective sample size. Sum of weights: 0.1+0.5+0.9+0.5 = 2.0. Sum of squared weights: 0.01+0.25+0.81+0.25 = 1.32. ESS = 2.0*2.0/1.32 = 3.03 effective patients out of 4. An IPTW analysis with weights 1.11, 2, 10, 2 would have ESS approximately 2.09 due to the extreme control weight of 10 dominating the denominator."
      ],
      "result": "IPTW creates a nine-fold weight disparity for the high-PS pair: treated weight = 1/0.9 = 10/9 = 1.11; control weight = 1/(1-0.9) = 1/0.1 = 10. Overlap weights for the same pair are bounded: treated = 1-0.9 = 0.1; control = 0.9. All four overlap weights lie in [0.1, 0.9] with no trimming needed by construction. Overlap-weighted ESS = 2.0*2.0/1.32 = 3.03 out of 4 patients, versus IPTW ESS of approximately 2.09. The ATO estimand concentrates statistical mass on the equipoise patients (e near 0.5) and down-weights the high-PS patients who are far from the overlap region where causal inference is most credible."
    },
    "prerequisites": [
      "propensity-score-methods-psm-iptw",
      "estimands-ate-att-intercurrent-events-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {
      "claims": "Build all covariates (demographics, baseline diagnoses and procedures, prior drug classes, prior hospitalizations, cost, frailty proxies) from the pre-index lookback window only. Require complete FFS-observable enrollment and exclude Medicare Advantage-only person-time before computing e(X). The overlap weight bounded-weight property is preserved regardless of covariate dimensionality, but the ATO estimand population narrows as the PS model includes more covariates that separate the arms — report the ESS and the raw PS overlap plot alongside any overlap- weighted analysis so reviewers can assess what clinical population is actually being analyzed.",
      "ehr": "Add labs, vitals, smoking, BMI, and NLP-derived severity terms to sharpen the PS; measure all covariates at or before the index date. Because EHR covariate capture is informative (sicker patients have more monitoring and more covariate data), include visit-intensity and site terms in the PS model before computing overlap weights. Missing covariate data should be addressed before PS fitting via multiple imputation or a missingness indicator, because missingness patterns affect e(X) and thereby the overlap weight distribution.",
      "registry": "Clinical registry data (stage, biomarker, ECOG performance status, device type) often provide the sharpest confounders for the PS model. Link to claims for complete pharmacy exposure and to a vital records death index for censoring before fitting the PS; mismatch between registry index dates and pharmacy fill dates can contaminate pre-index covariate windows and corrupt e(X).",
      "linked": "Reconcile order, fill, and service-date discrepancies before fixing time zero to prevent post-index covariate leakage into e(X). In large linked datasets the PS distribution is often sharply bimodal; inspect the PS overlap plot before choosing between IPTW and overlap weights, since the bounded-weight property of overlap weights does not create overlap that is absent in the underlying data."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom lifelines import CoxPHFitter\n\nxcols = [\"age\", \"sex\", \"cci\", \"ckd\", \"prior_hf\", \"prior_insulin\", \"prior_hosp\", \"prior_cost\"]\n\n# ── Stage 1 (outcome-blind design): estimate e(X) via near-unpenalized logistic regression ──\nlr = LogisticRegression(max_iter=2000, C=1e6)\nlr.fit(df[xcols], df[\"treated\"])\ndf[\"e\"] = np.clip(lr.predict_proba(df[xcols])[:, 1], 1e-6, 1 - 1e-6)\n\n# ── Overlap weights (ATO): w = 1 - e for treated, w = e for control ──\n# Bounded in (0, 1) by construction; no trimming needed.\ndf[\"w_ato\"] = np.where(df[\"treated\"] == 1, 1 - df[\"e\"], df[\"e\"])\n\n# ── IPTW (ATE) for comparison: shows the weight explosion overlap weights avoid ──\np = df[\"treated\"].mean()\ndf[\"w_iptw\"] = np.where(df[\"treated\"] == 1, p / df[\"e\"], (1 - p) / (1 - df[\"e\"]))\ndf[\"w_iptw_trunc\"] = df[\"w_iptw\"].clip(upper=df[\"w_iptw\"].quantile(0.99))  # truncated IPTW\n\n# ── Balance diagnostic: weighted SMD for each covariate in the PS model ──\ndef wsmd(x, trt, w):\n    m1 = np.average(x[trt == 1], weights=w[trt == 1])\n    m0 = np.average(x[trt == 0], weights=w[trt == 0])\n    v1 = np.average((x[trt == 1] - m1) ** 2, weights=w[trt == 1])\n    v0 = np.average((x[trt == 0] - m0) ** 2, weights=w[trt == 0])\n    return abs(m1 - m0) / np.sqrt((v1 + v0) / 2)\n\nsmds = {c: wsmd(df[c].values, df[\"treated\"].values, df[\"w_ato\"].values) for c in xcols}\ness_ato  = df[\"w_ato\"].sum()  ** 2 / (df[\"w_ato\"]  ** 2).sum()\ness_iptw = df[\"w_iptw\"].sum() ** 2 / (df[\"w_iptw\"] ** 2).sum()\nprint(f\"ATO  max |SMD| = {max(smds.values()):.3f} | ESS = {ess_ato:.0f} / {len(df)}\")\nprint(f\"IPTW max weight = {df['w_iptw'].max():.1f} | IPTW ESS = {ess_iptw:.0f}\")\nassert max(smds.values()) < 0.1, \"overlap-weighted SMD > 0.1: revisit the PS model\"\n\n# ── Stage 2 (analysis): marginal weighted Cox model with robust (sandwich) SEs ──\n# robust=True uses the Binder-type sandwich variance that accounts for estimated weights.\ncox = CoxPHFitter()\ncox.fit(\n    df[[\"time\", \"event\", \"treated\", \"w_ato\"]],\n    duration_col=\"time\", event_col=\"event\",\n    weights_col=\"w_ato\", robust=True\n)\nprint(cox.summary[[\"coef\", \"exp(coef)\", \"se(coef)\", \"p\"]])\n# exp(coef) = ATO hazard ratio; state the ATO estimand explicitly in all reporting.\n# The ATO population is the clinical-equipoise subpopulation, NOT the full cohort.",
        "description": "Full overlap-weighting workflow on a one-row-per-initiator analytic dataset. Required input\ncolumns:\n  person_id : unique subject id\n  treated   : 1 = Drug A initiator, 0 = Drug B initiator (arm at index_date)\n  <xcols>   : baseline covariates from [index_date-365, index_date] (no post-index variables)\n  time      : follow-up days from index_date to event or censoring\n  event     : 1 = outcome occurred, 0 = censored\nOutputs: ATO overlap weights, weighted SMD balance check, ESS, and a marginal weighted Cox\nmodel with robust SEs. Comparison IPTW weights (ATE) are also computed to illustrate the\nbounded versus extreme weight contrast on the same data.",
        "dependencies": [
          "pandas",
          "numpy",
          "scikit-learn",
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(PSweight); library(WeightIt); library(cobalt); library(survey)\n\nxform <- treated ~ age + sex + cci + ckd + prior_hf + prior_insulin + prior_hosp + prior_cost\n\n# ── Option A: PSweight — the canonical ATO package ──\n# Estimates e(X) internally, constructs overlap weights, and provides the ATO point estimate\n# with sandwich variance via a single call. The canonical reference for the PSweight package\n# is the vignette at cran.r-project.org/package=PSweight.\nps_fit <- PSweight(\n  ps.formula = xform,\n  yname      = \"event\",        # binary or continuous outcome column name\n  data       = df,\n  weight     = \"overlap\"       # \"overlap\" = ATO; \"IPW\" = ATE; \"treated\" = ATT (SMR)\n)\nsummary(ps_fit)                # ATO point estimate, 95% CI via sandwich SE\n\n# ── Option B: WeightIt + cobalt — integrates with the broader WeightIt ecosystem ──\nw <- weightit(xform, data = df, method = \"ps\", estimand = \"ATO\")\n# estimand = \"ATO\" sets w = 1 - e for treated, w = e for control (overlap weights)\nbal.tab(w, un = TRUE, thresholds = c(m = 0.1))   # SMDs pre and post overlap weighting\nlove.plot(w, abs = TRUE, thresholds = c(m = 0.1))\ncat(\"ATO ESS:\", summary(w)$effective.sample.size, \"\\n\")\n\n# ── Marginal weighted Cox model with robust (sandwich) SEs ──\ndf$w_ato <- w$weights\ndes <- svydesign(ids = ~1, weights = ~w_ato, data = df)\nfit <- svycoxph(Surv(time, event) ~ treated, design = des)\nsummary(fit)   # exp(coef) = ATO hazard ratio; state the ATO estimand in all reporting\n\n# ── Entropy balancing: ATT-style (control reweighted to match treated moments) ──\n# method = \"ebal\" solves a convex optimization; no PS model estimated.\nw_eb <- weightit(xform, data = df, method = \"ebal\", estimand = \"ATT\")\nbal.tab(w_eb, un = TRUE, thresholds = c(m = 0.1))\n\n# ── CBPS: PS estimated under dual fit-and-balance GMM constraint ──\nw_cbps <- weightit(xform, data = df, method = \"cbps\", estimand = \"ATO\")\nbal.tab(w_cbps, un = TRUE, thresholds = c(m = 0.1))\n# Combine CBPS-estimated e(X) with overlap weight formula for robustness to PS misspecification",
        "description": "Overlap weighting via PSweight (the canonical R package for ATO) and WeightIt plus cobalt for\nbalance visualization. Input data.frame df has one row per initiator with columns: treated\n(0/1), baseline covariates, and time/event for the outcome. PSweight::PSweight() estimates the\nPS and ATO weights internally; WeightIt provides an alternative estimand = \"ATO\" path that\nmirrors the Python workflow and integrates with the broader cobalt balance ecosystem. The\nsvycoxph call via the survey package provides correct robust standard errors for the weighted\nCox model. Entropy balancing (method = \"ebal\") and CBPS (method = \"cbps\") are also shown\nas drop-in alternatives.",
        "dependencies": [
          "PSweight",
          "WeightIt",
          "cobalt",
          "survey"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Stage 1: estimate e(X) = Pr(treated | X) via PROC LOGISTIC ── */\nproc logistic data=work.analytic noprint;\n  class treated sex ckd prior_hf prior_insulin / param=ref;\n  model treated(event='1') = age sex cci ckd prior_hf prior_insulin prior_hosp prior_cost;\n  output out=ps_out p=e_hat;\nrun;\n\n/* ── Stage 2: compute ATO overlap weights in a DATA step ── */\ndata work.ow;\n  set ps_out;\n  e_hat = min(max(e_hat, 1e-6), 1 - 1e-6);   /* clip extreme scores to avoid 0/1 */\n  if treated = 1 then w_ato = 1 - e_hat;       /* treated: weight = 1 - e(X) */\n  else                 w_ato = e_hat;           /* control: weight = e(X)     */\n  /* ATO estimand: concentrated at e_hat near 0.5; bounded in (0, 1) by construction */\nrun;\n\n/* ── Stage 3: balance check — weighted PROC MEANS for SMD computation ── */\n/* Run for all PS-model covariates; compute (wt_mean_treated - wt_mean_control) / pooled_SD */\nproc means data=work.ow nway mean var;\n  class treated;\n  var age cci prior_hosp prior_cost;\n  weight w_ato;\n  output out=wt_stats mean=m_age m_cci m_hosp m_cost\n                      var=v_age v_cci v_hosp v_cost;\nrun;\n/* Compute ESS: ESS = (sum(w_ato))^2 / sum(w_ato^2) */\nproc sql;\n  select sum(w_ato)**2 / sum(w_ato**2) as ESS\n  from work.ow;\nquit;\n\n/* ── Stage 4a: binary outcome — weighted log-binomial with robust sandwich variance ── */\nproc genmod data=work.ow;\n  class treated(ref='0');\n  model event = treated / dist=binomial link=log;\n  weight w_ato;\n  repeated subject=person_id / type=ind;   /* GEE sandwich SE for the weighted estimator */\n  estimate 'ATO risk ratio' treated 1 -1 / exp;\nrun;\n\n/* ── Stage 4b: time-to-event — weighted Cox model with robust (aggregate sandwich) SE ── */\nproc phreg data=work.ow covs(aggregate);   /* COVS(AGGREGATE) = robust sandwich variance */\n  class treated(ref='0');\n  model futime*event(0) = treated;\n  weight w_ato;\n  id person_id;   /* required: defines subjects for the aggregate sandwich estimator */\n  /* exp(treated coefficient) = ATO hazard ratio; state the ATO estimand in reporting */\nrun;",
        "description": "Overlap weighting in SAS/STAT 14.2+. Required dataset work.analytic has one row per new\ninitiator with columns: person_id, treated (1/0), baseline covariates measured in the pre-index\nwindow, futime (days from index_date to event or censoring), event (1/0). PROC LOGISTIC\nestimates e(X); a DATA step computes ATO weights; PROC GENMOD with REPEATED provides robust\nvariance for binary outcomes; PROC PHREG with COVS(AGGREGATE) and ID provides robust sandwich\nvariance for time-to-event outcomes. Both outcome models are shown.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  PS[\"Fit PS model: e of X = Pr treated given X<br/>logistic regression or CBPS<br/>from pre-index covariates only\"] --> OV{Raw PS overlap plot}\n  OV -->|Adequate overlap, few extreme scores| IPTW[\"Stabilized IPTW for ATE<br/>(full eligible population estimand)\"]\n  OV -->|Extreme scores or poor overlap| OW[\"Overlap weights for ATO<br/>(clinical equipoise estimand)\"]\n  OW --> WCOMP[\"Treated arm: w = 1 minus e of X<br/>Control arm: w = e of X<br/>All weights bounded in open interval 0 to 1<br/>No trimming required\"]\n  WCOMP --> DIAG[\"Diagnostics<br/>Weighted SMD less than 0.1 per covariate<br/>ESS = sum w squared divided by sum w squared<br/>Weight distribution plot\"]\n  DIAG --> OUT[\"Marginal weighted outcome model<br/>Robust sandwich SE<br/>Report ATO estimand name explicitly\"]\n  OUT --> INTERP[\"Interpret as ATO:<br/>effect in clinical-equipoise population<br/>NOT ATE and NOT ATT\"]",
        "caption": "Overlap weighting decision and execution flow. The raw PS overlap plot informs the choice between IPTW (ATE) and overlap weights (ATO); the bounded-weight property is a consequence of the narrower ATO estimand, not evidence of better data overlap.",
        "alt_text": "Flowchart from PS model estimation through raw-PS overlap assessment, overlap-weight computation, weighted diagnostics, marginal outcome model with robust SE, and ATO estimand interpretation.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Overlap weights extend the IPTW framework; understanding propensity score estimation, stabilized IPTW, and the ATE/ATT estimand distinction is prerequisite to appreciating why the ATO and bounded weights were developed as a response to IPTW's extreme-weight problem."
      },
      {
        "relation_type": "see_also",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS empirically selects proxy confounders from claims codes for the PS model; the resulting e(X) estimates feed directly into overlap weight computation and can sharpen the ATO estimand population when proxy confounders are present."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Post-weighting SMDs and the effective sample size are the mandatory diagnostics for any overlap-weighted analysis; the balance-check framework applies identically to ATO-weighted and IPTW-weighted samples, and the same 0.10 SMD threshold applies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The ATO is a third major causal estimand alongside the ATE and ATT; overlap weights make the ATO-ATE-ATT choice concrete and consequential because the same estimated PS feeds three different weight formulas targeting three different populations and answering three different causal questions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "targeted-maximum-likelihood-estimation-rwe",
        "notes": "TMLE is a doubly-robust estimator that can be paired with overlap-weight-estimated propensity scores for a semiparametrically efficient ATO estimate; the bounded-weight property of overlap weights makes TMLE more numerically stable than IPTW-initialized TMLE when positivity is near-violated."
      }
    ],
    "aliases": [
      "overlap weights",
      "ATO weighting",
      "average treatment effect in the overlap population",
      "entropy balancing",
      "CBPS",
      "covariate balancing propensity score",
      "matching weights"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "paired-t-test",
    "name": "Paired t-Test",
    "short_definition": "A parametric hypothesis test for continuous outcomes measured on the same subjects at two time points or under two matched conditions; it works by computing the within-pair difference for each subject, then testing whether the mean of those differences is distinguishable from zero — pairing removes between-subject variability and increases power proportionally to the correlation between the two measurements.",
    "long_description": "**What the paired t-test actually does**\n\nThe paired t-test is deceptively simple in its mechanics but powerful in its logic.\nYou have n subjects, each measured twice — at baseline and follow-up, or in a matched\npair. Let d_i = X_i2 - X_i1 be the within-subject difference. The test is then a\none-sample t-test on the d_i against the null hypothesis that the population mean\ndifference mu_d = 0:\n\n    t = d_bar / (s_d / sqrt(n))\n\nwhere d_bar is the sample mean of differences, s_d is the standard deviation of\ndifferences, and n is the number of pairs. The test statistic follows a t-distribution\nwith n - 1 degrees of freedom under the null. The 95% confidence interval is:\n\n    d_bar ± t_{0.975, n-1} * (s_d / sqrt(n))\n\nThe key insight is that this is *not* a comparison of two separate groups; it is a\none-sample test on the derived differences. This seemingly cosmetic reframing has a\nprofound practical consequence for power.\n\n**Why pairing gains power: the variance algebra**\n\nIf you were to ignore the pairing and run a two-sample Welch t-test on the pre and\npost values as if they were independent groups, the variance of the mean difference\nwould be:\n\n    Var(X_bar_post - X_bar_pre) = sigma^2_post/n + sigma^2_pre/n\n\nBut for paired data, the variance of the mean difference is:\n\n    Var(d_bar) = Var((X_post - X_pre)_bar) = (sigma^2_post + sigma^2_pre - 2*rho*sigma_post*sigma_pre) / n\n\nwhere rho is the within-subject (pre-post) correlation. When rho > 0 — which is\nalmost always true in longitudinal clinical data, because patients who are high at\nbaseline tend to remain relatively high at follow-up — the paired design subtracts\na positive term from the variance. The higher the correlation, the greater the\nvariance reduction, and the tighter the SE, and therefore the greater the power\nrelative to an unpaired analysis of the same data. As a practical rule of thumb:\n\n- rho >= 0.5: the paired design roughly halves the variance of the estimator\n  relative to an independent-samples design.\n- rho >= 0.8: the paired design's SE is less than half that of the unpaired design.\n- rho near 0: little gain from pairing; the paired test has fewer degrees of\n  freedom (n-1 vs 2n-2) and can actually be slightly less powerful than the\n  unpaired test.\n\nThis variance algebra is why lab value and biometric studies (where measurements\nare highly correlated within person over short periods) gain enormously from the\npaired design, while acute randomized crossover studies with long washout periods\nmay see more modest gains.\n\n**The normality assumption: what must actually be normal**\n\nThe paired t-test requires that the *differences* d_i are approximately normally\ndistributed — not the original measurements, and not both groups jointly. In\npractice this is a weaker requirement than it may sound, for two reasons:\n\nFirst, even if the original pre and post distributions are right-skewed (as is\ncommon for costs and utilization in healthcare), the differences often have a\nmore symmetric distribution because the skew is partially shared across the two\nmeasurements on the same person.\n\nSecond, the Central Limit Theorem protects the t-statistic at larger n: for\nn >= 30, departures from normality in the differences rarely invalidate the test\nfor inference on the mean difference. At n < 15 with strongly non-normal\ndifferences, the Wilcoxon signed-rank test is the preferred alternative (see\nrelations below).\n\nAs with all t-tests, resist the temptation to run a Shapiro-Wilk test on the\ndifferences to \"confirm normality\" before choosing the test. At small n, Shapiro-Wilk\nhas low power to detect non-normality; at large n, it will flag trivial departures\nthat the CLT renders irrelevant. Inspect the differences with a histogram and Q-Q\nplot instead, and use subject-matter knowledge about the outcome distribution.\n\n**RWE application: pre-post cost and utilization analyses**\n\nThe paired t-test is the workhorse for within-patient pre-post comparisons in\nclaims and EHR analyses. Common contexts:\n\n- *PPPM cost pre vs post a diagnosis or intervention*: identify an index event\n  (new diagnosis, hospital admission, drug initiation), define a pre-period window\n  (e.g., 6 or 12 months before index) and a post-period window (e.g., 6 or 12\n  months after), compute per-patient-per-month costs in each window, and test whether\n  the mean difference is nonzero.\n\n- *Lab values before and after drug initiation in EHR*: HbA1c, LDL, eGFR, and\n  similar biomarkers measured at clinic visits naturally produce a pre-treatment and\n  post-treatment measurement for each patient who has both visits captured.\n\n- *Utilization rates in matched cohort studies*: when PS-matched pairs are formed\n  one-to-one, the matched design creates a natural pairing — but see below for the\n  field debate on whether paired analysis is mandatory.\n\n**The RWE traps: regression to the mean and secular trends**\n\nThe paired t-test is technically correct only for within-person comparisons; it\ncannot remove bias from the study design. Two traps recur in applied work:\n\n*Trap 1 — Regression to the mean*\n\nSuppose you select a cohort of high-cost claimants — say, patients in the top decile\nof total annual spending — and then compare their costs in the next year using a\npaired t-test. You will almost certainly observe a statistically significant reduction\nin mean costs. But this reduction is partly or entirely a statistical artifact, not\na true effect. Patients who are selected because they had an extreme value in one\nperiod will, on average, have a less extreme value in the next period even if nothing\nhas changed about them. This is regression to the mean. The paired t-test faithfully\nmeasures the difference, but the difference has nothing to do with any intervention.\nThe fix is a concurrent control group (unexposed patients in the same period) or a\ndifference-in-differences design that subtracts the secular trend from both the\ntreated and control group.\n\n*Trap 2 — Secular trends masquerade as treatment effects*\n\nA pre-post analysis with no concurrent control group cannot distinguish between the\neffect of an intervention and background trends that would have occurred anyway:\nseasonal variation in respiratory admissions, a secular decline in hospitalizations\ndue to policy changes, a regression-to-the-mean artifact (see above), or maturation\nof a disease cohort. All of these confound a naive pre-post paired t-test. The paired\nt-test is appropriate as the estimator *conditional on* the study design being\nadequate to identify the causal effect — which a simple pre-post design usually is not.\nInterrupted time series, difference-in-differences, or a controlled pre-post with a\nmatched unexposed group are the design-level fixes; the paired t-test remains the\nappropriate within-pair estimator once those design features are in place.\n\n*Trap 3 — The matched cohort analysis debate*\n\nIn propensity-score-matched cohort studies, patients are matched one-to-one on\nobserved confounders. The conventional question is: should the primary analysis use\nthe paired t-test (treating matched pairs as dependent observations) or an unpaired\nanalysis on the matched sample? The strict statistical position is that if you matched\non covariates, the outcomes of matched pairs are dependent and a paired analysis is\ncorrect. In practice, both robust variance estimators (which account for within-cluster\ndependence without explicitly pairing) and paired analyses are seen in the HEOR\nliterature, and simulation evidence suggests that correctly specified robust variance\nestimates and paired analyses give similar results. The conservative and defensible\napproach is to use the paired t-test (or the paired Wilcoxon signed-rank test as a\nsensitivity check) when matched pairs are clearly defined one-to-one.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- Removes between-subject variability from the error term, yielding substantially\n  narrower confidence intervals and greater power than an unpaired two-sample test\n  when the pre-post correlation is moderate to high (rho >= 0.4).\n- Produces a directly interpretable effect estimate: the mean within-person difference\n  (e.g., mean reduction of $340 per patient per month) with a CI on the original scale.\n- Robust to moderate non-normality of differences at n >= 30 (CLT).\n- Straightforward CI formula; extends naturally to regression with a fixed subject\n  effect when covariates need adjustment.\n- The simplest correct choice for paired continuous data; avoids the complexity of\n  mixed models when only two time points are present.\n\n*Cons*:\n- Requires complete pairs: a patient missing either the pre or post measurement must\n  be excluded or imputed. This can introduce selection bias if missingness is not\n  random (patients lost to follow-up often differ systematically from completers).\n- Cannot handle more than two time points; for three or more repeated measurements,\n  linear mixed models (LMM) or MMRM are the appropriate generalization.\n- Blind to secular trends and regression to the mean — it faithfully estimates the\n  mean within-person difference, but that difference may not be the causal effect\n  of interest in an observational pre-post design.\n- At small n (< 15) with non-normal differences, the Wilcoxon signed-rank test is\n  more reliable.\n- The mean difference is the target quantity; if the distribution of differences is\n  extremely right-skewed (e.g., cost differences in patients with rare high-cost\n  events), the mean is an unstable estimator and a gamma GLM with a subject random\n  effect or a bootstrap CI may be preferable.\n\n**When to use**\n\n- *Two time points, continuous outcome, same subjects at both times*: the paired\n  t-test is the default choice. Examples: lab values before and after drug initiation;\n  PPPM costs in the 6 months before versus 6 months after index event; patient-reported\n  outcome scores at baseline and follow-up.\n- *One-to-one matched case-control or cohort studies with a continuous outcome*: treat\n  matched pairs as the unit of analysis and compute within-pair differences.\n- *Crossover trials with two treatment periods*: within-subject comparison of the two\n  period means is a direct application of the paired t-test.\n- *When n >= 15 and the differences are approximately symmetric*: the normality\n  assumption is comfortably met or protected by the CLT.\n- *When a mean difference is the clinically or economically meaningful estimand*: for\n  payer or HTA decision-making, the expected per-patient cost difference is the input\n  to a budget-impact model, and the paired t-test estimates it directly.\n\n**When NOT to use**\n\n- *Independent groups*: do not apply the paired t-test to unpaired data. Using a\n  paired test when there is no within-subject or within-pair link inflates type-I\n  error; use Welch's two-sample t-test instead.\n- *More than two time points*: three or more repeated measurements require linear\n  mixed models, MMRM, or a repeated-measures ANOVA. Reducing three time points to\n  a single pre-post difference discards information and introduces arbitrary choices\n  about which time point is \"post.\"\n- *Skewed differences at small n*: if n < 15 and the histogram of differences is\n  clearly right-skewed or has outliers, the Wilcoxon signed-rank test is the\n  rank-based nonparametric alternative that is valid without the normality assumption\n  on the differences.\n- *Pre-post design without a concurrent control when secular trend or regression to\n  the mean is plausible*: the paired t-test will give a valid p-value for the\n  hypothesis \"the mean difference is zero\" — but the observed difference may reflect\n  time trends, seasonal effects, or regression to the mean rather than a causal effect\n  of any intervention. Route to a difference-in-differences design or an interrupted\n  time series if causal inference is the goal. Report the pre-post estimate as\n  descriptive only.\n- *Binary or ordinal outcomes*: for paired binary outcomes (e.g., readmission yes/no\n  before and after a policy), use McNemar's test. For paired ordinal outcomes, use the\n  Wilcoxon signed-rank test.\n\n**Interpreting the output**\n\nIn the worked example, six patients enrolled in an ED diversion care management program\naveraged 2.0 fewer emergency department visits in the 12 months after enrollment compared\nwith the 12 months before (mean difference d̄ = 2.0). The standard deviation of within-\npatient differences is 0.894, the standard error is 0.365, the t statistic is 5.48 on\ndf = 5, the two-sided p-value is approximately 0.003, and the 95% CI on the mean paired\ndifference is [1.06, 2.94] visits.\n\n*(1) Formal interpretation.* The observed mean within-patient reduction of 2.0 visits is\n5.48 standard errors above zero. Under the null hypothesis that the population mean of\nwithin-patient differences is zero, a t statistic this large on df = 5 arises by chance in\napproximately 0.3% of samples. The 95% CI [1.06, 2.94] is constructed so that if the study\nwere repeated many times, approximately 95% of such intervals would contain the true mean\npaired difference — it is consistent with a true reduction of between about 1 and 3 visits\nper patient.\n\n*(2) Practical interpretation.* All six patients reduced their visits after enrollment and\nthe average reduction of 2.0 visits is statistically distinguishable from zero. However,\nthis is a pre-post design without a concurrent control group, and these patients were\nenrolled specifically because of high baseline utilization. Some or all of the reduction\nmay reflect regression to the mean rather than a program effect. A difference-in-differences\ndesign comparing this cohort to a matched unexposed group over the same period would be\nneeded to attribute the reduction causally to the care management program.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "paired-data",
      "pre-post",
      "t-test",
      "parametric",
      "within-subject",
      "matched-pairs"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3102/10769986022003349",
        "url": "https://doi.org/10.3102/10769986022003349",
        "citation_text": "Zimmerman DW. A note on interpretation of the paired-samples t test. Journal of Educational and Behavioral Statistics. 1997;22(3):349-360.",
        "year": 1997,
        "authors_short": "Zimmerman",
        "notes": "Classic methodological note explaining the paired t-test as a one-sample test on differences, clarifying common misinterpretations about what the test assumes and what it estimates, and showing the variance decomposition that explains the power gain from pairing. Directly introduces the key insight that pairing converts a two-sample problem into a one-sample problem on within-subject differences."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.309.6962.1128",
        "url": "https://doi.org/10.1136/bmj.309.6962.1128",
        "citation_text": "Bland JM, Altman DG. Statistics notes: Matching. BMJ. 1994;309(6962):1128.",
        "year": 1994,
        "authors_short": "Bland & Altman",
        "notes": "BMJ Statistics Notes entry on matched designs and the role of the within-pair correlation. Explains why the paired analysis is required when data are matched and clarifies the connection between the matched design and the power gain from removing between-subject variability. An accessible reference for clinical researchers working with paired or matched data."
      },
      {
        "role": "demonstrate",
        "doi": "10.3389/fpsyg.2013.00863",
        "url": "https://doi.org/10.3389/fpsyg.2013.00863",
        "citation_text": "Lakens D. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology. 2013;4:863.",
        "year": 2013,
        "authors_short": "Lakens",
        "notes": "Practical primer on effect size calculation for t-tests including Cohen's d_z for the paired t-test (the standardized mean difference based on the SD of differences). Demonstrates how to compute and report effect sizes alongside p-values for both paired and unpaired t-tests, which is required practice for HEOR submissions to payers and HTA bodies."
      },
      {
        "role": "use",
        "doi": "10.3102/10769986022003349",
        "url": "https://doi.org/10.3102/10769986022003349",
        "citation_text": "Zimmerman DW. A note on interpretation of the paired-samples t test. Journal of Educational and Behavioral Statistics. 1997;22(3):349-360.",
        "year": 1997,
        "authors_short": "Zimmerman",
        "notes": "Also demonstrates concrete numerical application: walking through the pairing, the variance reduction, and the correct interpretation of the test statistic in settings where paired versus independent-samples analysis would reach different conclusions."
      }
    ],
    "plain_language_summary": "The paired t-test measures whether the average change within the same patients — between a before measurement and an after measurement — is large enough that it is unlikely to be due to chance alone. Instead of comparing two separate groups, it computes each patient's individual change (the difference), then asks whether the average of those changes is significantly different from zero. Removing between-patient variation by working with differences rather than raw values makes this test considerably more sensitive than comparing two independent groups of the same size. One important caveat: in a simple before-after claims study with no comparison group, a statistically significant change does not prove the change was caused by any intervention — it may simply reflect a background trend or a quirk of selecting patients with extreme baseline values.",
    "key_terms": [
      {
        "term": "within-pair difference",
        "definition": "The value computed by subtracting each patient's \"before\" measurement from their \"after\" measurement; the paired t-test performs all its arithmetic on these individual differences, not on the original measurements."
      },
      {
        "term": "pre-post design",
        "definition": "A study where the same patients are measured at two time points — before and after an event, treatment, or policy — and the question is whether anything changed on average across the group."
      },
      {
        "term": "correlation between pairs",
        "definition": "How strongly a patient's pre-measurement predicts their post-measurement; a high correlation (patients who start high stay relatively high) means the paired design gains a lot of power compared to treating the two measurements as if they came from unrelated people."
      },
      {
        "term": "regression to the mean",
        "definition": "The tendency for patients selected because of an extreme measurement (very high costs, very high lab values) to have a less extreme measurement at the next observation purely by chance, even without any treatment — making a simple before-after comparison look like an improvement when none occurred."
      },
      {
        "term": "standard error of the mean difference",
        "definition": "The standard deviation of the within-person differences divided by the square root of the number of pairs; it measures how precisely the sample mean difference estimates the true population mean difference."
      },
      {
        "term": "degrees of freedom",
        "definition": "For the paired t-test, this equals the number of pairs minus one (n minus 1); it controls how wide the t-distribution's tails are when computing p-values and confidence intervals."
      }
    ],
    "worked_example": {
      "scenario": "A health plan enrolled six high-utilizer patients in an emergency-department (ED) diversion care management program. An analyst wants to know whether the mean number of ED visits per patient changed after enrollment, using a paired t-test on the pre-program (12 months before) and post-program (12 months after) visit counts. The six patients had ED visit counts of 5, 3, 6, 4, 7, and 3 in the pre-period, and 3, 2, 3, 2, 4, and 2 in the post-period. The analyst computes within-patient differences, then manually walks through the paired t-test arithmetic.",
      "dataset": {
        "caption": "ED visits in the 12 months before and 12 months after enrollment in a care management program. The \"difference\" column is pre minus post for each patient.",
        "columns": [
          "patient_id",
          "pre_visits",
          "post_visits",
          "difference"
        ],
        "rows": [
          [
            "P1",
            5,
            3,
            2
          ],
          [
            "P2",
            3,
            2,
            1
          ],
          [
            "P3",
            6,
            3,
            3
          ],
          [
            "P4",
            4,
            2,
            2
          ],
          [
            "P5",
            7,
            4,
            3
          ],
          [
            "P6",
            3,
            2,
            1
          ]
        ]
      },
      "steps": [
        "Step 1 — Compute the mean difference. Sum the six within-patient differences: 2 + 1 + 3 + 2 + 3 + 1 = 12. Divide by n = 6: mean difference d_bar = 12/6 = 2.0 visits. Every patient reduced their ED visits; the average reduction is 2 visits.",
        "Step 2 — Compute the variance of the differences. For each difference, subtract d_bar = 2.0 and square the result: (2-2)^2=0, (1-2)^2=1, (3-2)^2=1, (2-2)^2=0, (3-2)^2=1, (1-2)^2=1. Sum of squared deviations = 0+1+1+0+1+1 = 4. Sample variance s_d^2 = 4/(n-1) = 4/5 = 0.8.",
        "Step 3 — Compute the standard deviation and standard error. Standard deviation of differences: s_d = sqrt(0.8) = 0.894. Standard error: SE = s_d / sqrt(n) = 0.894 / sqrt(6) = 0.894 / 2.449 = 0.365.",
        "Step 4 — Compute the t-statistic. t = d_bar / SE = 2.0 / 0.365 = 5.48. Degrees of freedom = n - 1 = 6 - 1 = 5.",
        "Step 5 — Determine the p-value and confidence interval. With t = 5.48 and df = 5, the two-sided p-value is approximately 0.003, well below alpha = 0.05. The 95% CI uses the critical value t_{0.975, 5} = 2.571: lower = 2.0 - 2.571 * 0.365 = 2.0 - 0.94 = 1.06; upper = 2.0 + 2.571 * 0.365 = 2.0 + 0.94 = 2.94. The 95% CI is approximately [1.1, 2.9] visits.",
        "Step 6 — Interpret with caution. The mean reduction of 2.0 ED visits per patient (95% CI: 1.1 to 2.9) is statistically significant. However, this is a pre-post design without a concurrent control group. Patients were enrolled because they had very high baseline utilization; regression to the mean would predict some decline even without the program. A difference-in-differences design comparing this cohort to a matched unexposed group in the same period would be needed to attribute the reduction causally to the care management program."
      ],
      "result": "Mean difference (pre minus post) = 12/6 = 2.0 ED visits. Sum of squared deviations = 4.0; s_d^2 = 4/5 = 0.8; s_d = 0.894; SE = 0.894/2.449 = 0.365. t = 2.0/0.365 = 5.48, df = 5, p approximately 0.003 (two-sided). 95% CI: 2.0 +/- 2.571*0.365 = [1.06, 2.94] visits. Conclusion: statistically significant average reduction of 2 ED visits per patient, but a concurrent control group is needed to rule out regression to the mean before attributing the reduction to the program."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "two-sample-t-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pre-post claims analysis (PPPM cost or utilization)",
        "description": "The most common RWE application. Define an index event (new diagnosis code, hospital admission, drug initiation). Set a pre-period window (e.g., 6 or 12 months before index) and a post-period window (e.g., 6 or 12 months after index, excluding a washout if needed). Compute the per-patient-per-month (PPPM) cost or utilization rate in each window. The within-patient difference is the outcome; apply the paired t-test to test whether the mean difference is nonzero.",
        "edge_cases": [
          "Require that each patient has at least one claim in both the pre and post window to be included; patients with zero activity in either window may be missing or may genuinely have no utilization, and these must be distinguished.",
          "At large n (> 5,000 patients), the paired t-test will reject the null for trivially small differences. Report the effect estimate (mean PPPM difference in dollars or visits) and its CI alongside the p-value; the CI is what matters for budget-impact modelling.",
          "If cost differences are highly right-skewed (a few patients with catastrophic post-period costs), the mean difference is unstable. Consider a bootstrap CI or a gamma GLM with patient random effect as the primary analysis and report the paired t-test as a sensitivity check."
        ],
        "data_source_notes": "Claims: build pre and post windows around the index date. Exclude claims after the end of enrollment or after death. Apply the paired t-test to per-patient totals or PPPM rates; report n patients with complete pre and post data alongside the n excluded for incomplete follow-up."
      },
      {
        "name": "Lab values before and after drug initiation (EHR)",
        "description": "Identify the earliest date each patient had the drug dispensed or prescribed (index date). Extract the most recent lab value before index (within a defined lookback window, e.g., 12 months) and the first lab value after index (within a defined post-window, e.g., 3-6 months). Compute within-patient differences; apply the paired t-test. Common outcomes include HbA1c, LDL-C, eGFR, and systolic blood pressure.",
        "edge_cases": [
          "Lab values captured at clinical visits are not random samples; visits are more likely when the patient is unwell or when monitoring is indicated. This informative visit process can bias the pre and post measurements.",
          "Choose the lookback window carefully: a very wide window may capture a lab value that predates the disease the drug treats; a very narrow window may exclude many patients."
        ],
        "data_source_notes": "EHR: extract lab results tables joined to the medication table on patient ID and date. Handle multiple pre- and post-index labs per patient by selecting the closest-to-index result in each window; document the selection rule."
      },
      {
        "name": "Matched cohort analysis",
        "description": "In a 1:1 propensity-score-matched or covariate-matched cohort study, each exposed patient is linked to one matched unexposed patient. The paired t-test can be applied to the within-pair differences in the outcome (e.g., total cost, event rate, length of stay). This respects the dependency created by matching and is equivalent to a one-sample test on pair differences.",
        "edge_cases": [
          "Use the paired t-test (or Wilcoxon signed-rank as a sensitivity check) rather than an unpaired test on the matched sample; applying an unpaired t-test to matched pairs inflates the SE and reduces power.",
          "If pairs are broken because one member is lost to follow-up, a complete-case paired analysis introduces selection bias. Consider robust variance estimators or a regression model with a cluster term for the matched pair instead."
        ],
        "data_source_notes": "After matching, store the match identifier for each pair. Compute within-pair differences by merging on the match ID and subtracting exposed minus unexposed. Apply the paired t-test to the vector of differences."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "two-sample-t-test",
        "pros_of_this": "The paired t-test removes between-subject variability from the error term, yielding substantially narrower confidence intervals and greater power than the two-sample Welch t-test when the pre-post correlation is moderate to high.",
        "cons_of_this": "Requires that each subject contributes exactly two measurements and that pairs are clearly defined; an unpaired design that happened to enroll the same n patients would gain degrees of freedom (2n-2 vs n-1) if the correlation were near zero.",
        "when_to_prefer": "Whenever the data are intrinsically paired — same patient at two time points, matched case-control, crossover trial. Never apply the paired t-test when the two groups are independent; use the two-sample Welch t-test instead."
      },
      {
        "compared_to": "wilcoxon-signed-rank-test",
        "pros_of_this": "The paired t-test produces an interpretable mean difference with a CI in the original units (dollars, visits, lab units); easier to plug into a cost model or interpret for a clinical audience.",
        "cons_of_this": "The Wilcoxon signed-rank test does not require the differences to be normally distributed; at small n with skewed or heavy-tailed differences it is more robust.",
        "when_to_prefer": "Use the paired t-test when n is moderate to large (>= 15) and differences are approximately symmetric, or when a mean difference is the target estimand. Use Wilcoxon signed-rank when n is small (< 15), differences are clearly skewed, or as a pre-specified sensitivity analysis."
      },
      {
        "compared_to": "mcnemar-test",
        "pros_of_this": "The paired t-test handles continuous outcomes directly and produces a mean difference estimate.",
        "cons_of_this": "McNemar's test is the correct paired test for binary outcomes (e.g., readmission yes/no before and after a policy); the paired t-test is not appropriate for proportions from paired binary data.",
        "when_to_prefer": "Use the paired t-test for continuous or count outcomes measured on the same patients. Use McNemar's test for paired binary outcomes."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build a pre-period and post-period window around the index date for each patient. Compute per-patient totals (costs, visits) for each window. Require continuous enrollment in both windows to minimize informative censoring. At large n, always report the mean difference and 95% CI as primary outputs; the p-value alone is insufficient for budget-impact modelling. For skewed cost differences, supplement with a bootstrap CI or gamma GLM with patient random effect.",
      "ehr": "Lab values and vital signs (HbA1c, LDL, BP, BMI) are natural paired outcomes around an index date. Informative visit processes can bias results; require that labs fall within a defined window before and after index. For eGFR, pre-post comparisons require care about creatinine estimation method consistency.",
      "registry": "Disease-specific registries often capture baseline and follow-up assessments at pre-defined timepoints. Confirm that the follow-up window matches the registry protocol before applying the paired t-test. Missing follow-up visits are common; document the fraction of enrolled patients with complete pairs.",
      "primary": "Randomized controlled trials and prospective cohort studies with planned pre-post assessments are ideal settings. Confirm that no treatment-period carryover effect exists in crossover designs; if washout is inadequate, carryover confounds the within-patient difference.",
      "linked": "Linked claims-EHR cohorts allow pre-post cost comparisons (claims) alongside biomarker changes (EHR) in the same patients. The paired design is consistent across both data streams; define the same index date for both and apply the paired t-test separately to cost differences and biomarker differences."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom scipy import stats\n\n# ── Dataset: ED visits pre and post care management enrollment ──\n# Six patients; pre-period and post-period 12-month visit counts.\npre  = [5, 3, 6, 4, 7, 3]\npost = [3, 2, 3, 2, 4, 2]\n\n# ── Step 1: compute within-patient differences (pre minus post) ──\ndiffs = [p - q for p, q in zip(pre, post)]   # [2, 1, 3, 2, 3, 1]\nn = len(diffs)\n\n# ── Step 2: mean difference ──\nd_bar = sum(diffs) / n                        # 12/6 = 2.0\nprint(f\"Differences:    {diffs}\")\nprint(f\"Sum of diffs:   {sum(diffs)}  (n={n})\")\nprint(f\"Mean diff (d̄): {d_bar:.4f}\")\n\n# ── Step 3: variance and SD of differences ──\nss = sum((d - d_bar)**2 for d in diffs)       # sum of squared deviations = 4.0\ns2 = ss / (n - 1)                             # sample variance = 4/5 = 0.8\ns  = math.sqrt(s2)                            # SD of diffs = sqrt(0.8) = 0.8944\nse = s / math.sqrt(n)                         # SE = 0.8944/sqrt(6) = 0.3651\nprint(f\"\\nSum sq deviations: {ss:.4f}\")\nprint(f\"Variance (s²):     {s2:.4f}\")\nprint(f\"Std dev (s):       {s:.4f}\")\nprint(f\"Std error (SE):    {se:.4f}\")\n\n# ── Step 4: t-statistic and two-sided p-value (manual) ──\nt_manual = d_bar / se\ndf = n - 1\np_manual = stats.t.sf(abs(t_manual), df=df) * 2\nprint(f\"\\nt-statistic (manual): {t_manual:.4f}\")\nprint(f\"Degrees of freedom:   {df}\")\nprint(f\"p-value (two-sided):  {p_manual:.4f}\")\n\n# ── Step 5: 95% confidence interval ──\nt_crit = stats.t.ppf(0.975, df=df)           # 2.5706 for df=5\nci_lo = d_bar - t_crit * se\nci_hi = d_bar + t_crit * se\nprint(f\"t-critical (0.975):   {t_crit:.4f}\")\nprint(f\"95% CI:               [{ci_lo:.2f}, {ci_hi:.2f}] visits\")\n\n# ── Step 6: confirm using scipy.stats.ttest_rel ──\n# ttest_rel computes the paired t-test directly on the original arrays.\nt_scipy, p_scipy = stats.ttest_rel(pre, post)\nprint(f\"\\nscipy.stats.ttest_rel:  t={t_scipy:.4f}, p={p_scipy:.4f}\")\nprint(\"(Identical to manual calculation — confirms the implementation.)\")\n\n# ── Interpretation note ──\nprint(f\"\\nMean reduction: {d_bar:.1f} ED visits per patient (95% CI: {ci_lo:.1f}, {ci_hi:.1f})\")\nprint(\"WARNING: No concurrent control — regression to the mean cannot be ruled out.\")\nprint(\"         A difference-in-differences design would be needed for causal inference.\")",
        "description": "Paired t-test using scipy.stats.ttest_rel, with explicit manual calculation of\nthe within-patient differences, mean difference, standard deviation of differences,\nstandard error, and 95% confidence interval. Uses the six-patient ED visit dataset\nfrom the beginner worked example. Demonstrates that the paired t-test is a\none-sample t-test on the differences, shows the variance formula, and prints\nthe full result table. No external dependencies beyond scipy.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Dataset: ED visits pre and post care management enrollment ──\npre  <- c(5, 3, 6, 4, 7, 3)\npost <- c(3, 2, 3, 2, 4, 2)\nn    <- length(pre)\n\n# ── Step 1: within-patient differences ──\ndiffs <- pre - post\ncat(\"Differences:\", diffs, \"\\n\")\ncat(\"Sum:\", sum(diffs), \"  Mean:\", mean(diffs), \"\\n\")\n\n# ── Step 2: variance of differences (manual) ──\nss <- sum((diffs - mean(diffs))^2)\ns2 <- ss / (n - 1)\ns  <- sqrt(s2)\nse <- s / sqrt(n)\ncat(sprintf(\"Variance: %.4f  SD: %.4f  SE: %.4f\\n\", s2, s, se))\ncat(sprintf(\"t = %.4f  df = %d\\n\", mean(diffs)/se, n - 1))\n\n# ── Step 3: paired t-test via t.test ──\n# paired = TRUE: t.test internally computes pre - post and runs a one-sample test.\nres_paired <- t.test(pre, post, paired = TRUE, conf.level = 0.95)\ncat(\"\\n--- t.test(paired = TRUE) ---\\n\")\nprint(res_paired)\n\n# ── Step 4: equivalence — one-sample t-test on differences ──\nres_one <- t.test(diffs, mu = 0, conf.level = 0.95)\ncat(\"\\n--- t.test(diffs, mu=0) — identical result ---\\n\")\nprint(res_one)\n\n# ── Step 5: effect size (Cohen's d_z) ──\n# For paired t-test, Cohen's d_z = mean(diffs) / sd(diffs)\ndz <- mean(diffs) / sd(diffs)\ncat(sprintf(\"\\nCohen's d_z (effect size for paired design): %.4f\\n\", dz))\ncat(\"Interpretation: d_z >= 0.8 is a large effect by convention.\\n\")\n\n# ── Wilcoxon signed-rank as a sensitivity check ──\nwsr <- wilcox.test(pre, post, paired = TRUE, exact = FALSE)\ncat(sprintf(\"\\nWilcoxon signed-rank (sensitivity): W=%.1f, p=%.4f\\n\",\n            wsr$statistic, wsr$p.value))\ncat(\"If p-values from paired t-test and Wilcoxon are consistent, the normality\\n\")\ncat(\"assumption is not load-bearing for this dataset.\\n\")",
        "description": "Paired t-test using t.test(paired = TRUE) in base R. Shows explicit calculation\nof differences, then the t.test call, and prints the mean difference with 95% CI.\nAlso demonstrates the equivalence to a one-sample t-test on the differences\nusing t.test(diffs, mu = 0). Uses the same six-patient ED visit dataset as the\nPython implementation. No external packages required.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create dataset: ED visits pre and post enrollment ── */\ndata work.ed_visits;\n  input patient_id $ pre post;\n  diff = pre - post;     /* within-patient difference: pre minus post */\n  datalines;\nP1 5 3\nP2 3 2\nP3 6 3\nP4 4 2\nP5 7 4\nP6 3 2\n;\nrun;\n\n/* ── Verify differences manually ── */\nproc means data=work.ed_visits n mean std stderr;\n  var diff;\n  title 'Descriptives for within-patient differences (pre - post)';\nrun;\n\n/* ── Paired t-test: Method 1 — PROC TTEST with PAIRED statement ──\n   SAS reads the two variables directly and computes within-pair differences.\n   The \"pre - post\" notation specifies the direction of subtraction.          */\nproc ttest data=work.ed_visits alpha=0.05;\n  paired pre*post;          /* tests whether mean(pre - post) = 0             */\n  title 'Paired t-test: pre vs post ED visits (PROC TTEST PAIRED)';\nrun;\n/* Key output rows:\n   - \"Difference\" row: mean difference, 95% CI, t-statistic, DF, p-value\n   - Satterthwaite and Pooled rows are for the equivalent two-sample comparison;\n     use the PAIRED output row for the actual paired test result.             */\n\n/* ── Paired t-test: Method 2 — one-sample PROC TTEST on the diff variable ──\n   Equivalent to Method 1; confirms the paired t-test is a one-sample test\n   on the within-patient differences with H0: mu_diff = 0.                   */\nproc ttest data=work.ed_visits h0=0 sides=2;\n  var diff;\n  title 'One-sample t-test on differences (equivalent to paired t-test)';\nrun;\n\n/* ── Wilcoxon signed-rank test as nonparametric sensitivity check ──\n   PROC UNIVARIATE prints the Wilcoxon signed-rank statistic and p-value.    */\nproc univariate data=work.ed_visits;\n  var diff;\n  title 'Wilcoxon signed-rank sensitivity check on within-patient differences';\nrun;\n/* Report both the paired t-test p-value and the Wilcoxon signed-rank p-value.\n   If they agree, the normality assumption is not load-bearing for inference. */",
        "description": "Paired t-test in SAS using PROC TTEST with the PAIRED statement. Demonstrates\ncomputing within-patient differences in a DATA step, then using PROC TTEST PAIRED\nto obtain the mean difference, t-statistic, p-value, and 95% CI. Also shows\nPROC UNIVARIATE on the differences to obtain the Wilcoxon signed-rank statistic\nas a nonparametric sensitivity check. Uses the same six-patient ED visit dataset.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Continuous outcome measured<br/>on the same subjects twice?] --> Yes[\"Yes — use paired design\"]\n  Q --> No[\"No — use two-sample Welch t-test<br/>or Mann-Whitney U\"]\n  Yes --> Diffs[\"Compute within-patient differences<br/>d_i = post_i - pre_i\"]\n  Diffs --> Check[\"Are differences approx normal?<br/>(histogram, Q-Q plot; n>=15)\"]\n  Check --> Normal[\"Yes: Paired t-test<br/>t = d̄ / (s_d/sqrt(n))\"]\n  Check --> Skewed[\"No (small n, skewed diffs):<br/>Wilcoxon signed-rank test\"]\n  Normal --> Interp[\"Report: mean diff + 95% CI<br/>in original units\"]\n  Interp --> Caution[\"⚠ Pre-post only design?<br/>Rule out secular trend +<br/>regression to the mean\"]\n  Caution --> DID[\"If causal inference needed:<br/>difference-in-differences<br/>or controlled pre-post\"]",
        "caption": "Decision path for paired continuous data: check pairing, check normality of differences, apply paired t-test or Wilcoxon, and flag regression-to-the-mean risk in uncontrolled pre-post designs.",
        "alt_text": "Flowchart: from 'same subjects measured twice' through within-patient differences, normality check, paired t-test or Wilcoxon signed-rank, and caution flags for regression to the mean and secular trends in uncontrolled pre-post designs.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The paired t-test is a specific parametric test; the parent entry explains when parametric versus nonparametric tests are appropriate and introduces the broader decision framework. The paired t-test is the paired/pre-post cell of that decision tree."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Foundational concepts — null hypothesis, p-value, type-I error, confidence intervals, and the t-distribution — must be understood before interpreting a paired t-test result."
      },
      {
        "relation_type": "see_also",
        "target_slug": "two-sample-t-test",
        "notes": "The unpaired counterpart for independent groups. Pairing converts a two-sample problem into a one-sample problem on within-pair differences, removing between-subject variance from the error term. When pre-post correlation is high, the paired design is substantially more powerful than applying the two-sample t-test to the same data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "wilcoxon-signed-rank-test",
        "notes": "The nonparametric rank-based counterpart for paired continuous data. Use when the within-patient differences are clearly non-normal and n is small (< 15), or as a pre-specified sensitivity analysis alongside the paired t-test primary analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mcnemar-test",
        "notes": "The paired-binary analog: McNemar's test is appropriate when the pre-post outcome is binary (e.g., hospitalization yes/no before and after a policy change). The paired t-test handles continuous outcomes; McNemar handles paired binary proportions."
      }
    ],
    "aliases": [
      "dependent samples t-test",
      "matched pairs t-test",
      "pre-post t-test",
      "within-subject t-test",
      "paired samples t-test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "parametric-vs-nonparametric-tests",
    "name": "Parametric and Nonparametric Tests",
    "short_definition": "A fundamental choice in statistical hypothesis testing: parametric tests (t-test, ANOVA, chi-square) assume the data come from a distributional family and make inferences about that family's parameters, while nonparametric tests (Mann-Whitney, Kruskal-Wallis, Wilcoxon signed-rank, Fisher exact) replace raw values with ranks or permutation counts and make fewer distributional assumptions — the right choice depends on data type, sample size, the specific question being asked, and what effect estimate is actually needed for decision-making.",
    "long_description": "**What \"parametric\" and \"nonparametric\" actually mean**\n\nA *parametric* test assumes the data were drawn from a distributional family — most often\nGaussian (normal) — and constructs a test statistic whose sampling distribution is derived from\nthat assumption (a t-distribution, F-distribution, or chi-squared distribution under the null).\nThe t-test does not require that your data are exactly normal; it requires that the *sample mean*\nfollows a normal sampling distribution, which the Central Limit Theorem (CLT) guarantees for\nlarge enough n regardless of the underlying data distribution. This distinction matters\nenormously in large real-world datasets.\n\nA *nonparametric* test replaces the raw observations with their ranks (or counts, or permutation\ncounts) and derives the null distribution from the combinatorics of those ranks rather than from\na parametric family. This makes it robust to heavy tails, outliers, and severely skewed\ndistributions. But it also changes what is being tested. A common misunderstanding is that the\nMann-Whitney U test \"tests whether the medians are equal.\" This is only true under the\nrestrictive assumption that the two distributions differ only in location. In general, the\nMann-Whitney test a hypothesis of *stochastic dominance*: under the null, the probability that\na randomly chosen observation from group A exceeds a randomly chosen observation from group B\nequals 0.5. That is a subtly different question from a difference in means or medians, and the\ndirection of that difference deserves explicit acknowledgment in any analysis report.\n\n**The decision tree by data type and question**\n\nThe canonical test-selection logic by situation:\n\n- *Continuous outcome, two independent groups*: start with Welch's t-test (does not assume equal\n  variances) as the default; use Mann-Whitney U / Wilcoxon rank-sum as the nonparametric\n  alternative when distributions are severely non-normal at small-to-moderate n.\n- *Continuous outcome, paired / pre-post measurements*: paired t-test (parametric) vs Wilcoxon\n  signed-rank test (nonparametric).\n- *Continuous outcome, three or more groups*: one-way ANOVA (parametric) vs Kruskal-Wallis\n  H test (nonparametric); post-hoc pairwise comparisons follow if overall test is significant.\n- *Binary or categorical outcome, two or more groups*: Pearson chi-square test when expected\n  cell counts are ≥5 in each cell; Fisher's exact test when expected counts are small (rule of\n  thumb: any expected count < 5). McNemar's test for paired binary data (e.g., pre-post\n  classification on the same patients).\n- *Ordinal outcome*: Mann-Whitney for two groups; Kruskal-Wallis for multiple groups; do not\n  treat ordinal scores as continuous unless the intervals are known to be equal.\n- *Correlation*: Pearson r (parametric, assumes bivariate normality, measures linear\n  association) vs Spearman rho (rank-based, measures monotonic association, robust to outliers\n  and non-normality).\n- *Skewed continuous outcomes in HEOR (costs, utilization)*: neither a bare rank test nor a\n  t-test on raw data is the modern standard for inference about means; generalized linear models\n  with a log link and gamma or Tweedie variance function are the preferred approach because they\n  target the mean on the original scale while respecting the distributional shape. Reserve rank\n  tests for descriptive comparisons and sensitivity checks rather than primary inference.\n\n**The Fagerland paradox: when does normality matter least?**\n\nA striking result from Fagerland (2012) is that the t-test's normality assumption is most\ndefensible — because the CLT protects the sampling distribution of the mean — precisely in the\nlarge samples where analysts most often worry about non-normality and reach for a nonparametric\ntest. Conversely, at small samples (n < 15 per group) where the CLT has not yet kicked in, the\nt-test's normality assumption is most critical, yet small studies often show \"borderline normal\"\non the plot and proceed with a t-test anyway. The practical implication: in a 50,000-patient\nclaims dataset, a t-test on skewed total costs is likely valid for inference on means; in a\n12-patient pilot study with a heavy-tailed outcome, a nonparametric test or a transformation is\ngenuinely safer.\n\n**Welch's t-test vs Student's t-test**\n\nStudent's t-test assumes equal variances in the two groups. Welch's t-test relaxes that\nassumption by using a separate variance estimate per group and adjusting the degrees of freedom\n(the Welch-Satterthwaite equation). Simulation studies consistently show that using Welch as the\ndefault — rather than applying a preliminary test for equal variances and then choosing — yields\nbetter type-I error control across a broad range of scenarios (Delacre, Lakens, and Leys 2017).\nThe convention of running Levene's test first and then choosing Student vs Welch is statistically\nsuboptimal because the preliminary test adds its own error. The actionable rule: always use\nWelch's t-test for two-sample continuous comparisons unless there is an explicit theoretical\nreason to assume equal variances.\n\n**Normality testing is counterproductive in most applied settings**\n\nIt is common practice to run a Shapiro-Wilk test first and \"confirm\" normality before\nproceeding to a t-test. This practice has two opposite failure modes: at small n (< 30), the\nShapiro-Wilk test has low power and will fail to detect non-normality that would genuinely\nviolate the t-test assumption; at large n (> 500), it will reject normality for trivial\ndepartures — a slightly heavy tail or minor skewness — that the CLT renders irrelevant for\ninference on means. The recommended approach in applied work is to decide by inspection\n(histogram, Q-Q plot) combined with subject-matter knowledge about the outcome distribution\n(costs are always right-skewed; lab values are approximately normal; time-to-event is rarely\nnormal) rather than relying on a formal normality test to make the parametric/nonparametric\nchoice.\n\n**Effect estimates and confidence intervals: what HEOR decision-making needs**\n\nA bare p-value from a rank test rarely provides what health economics and outcomes research\nneeds. The deliverable for payers, HTA bodies, and clinical decision-makers is an effect\nestimate with a confidence interval on a clinically meaningful scale:\n\n- For a t-test or ANOVA: the mean difference (with 95% CI) is directly interpretable.\n- For a Mann-Whitney test: the Hodges-Lehmann estimator (the median of all pairwise differences)\n  gives a location estimate and CI that pairs naturally with the Mann-Whitney test; it is neither\n  the difference in means nor the difference in medians, but the \"shift\" that best explains the\n  rank ordering.\n- For a chi-square or Fisher test: the risk difference, relative risk, or odds ratio with CI is\n  the useful quantity; the chi-square p-value alone does not tell the audience how large the\n  association is.\n- For cost data: the target estimand for decision modelling is almost always the *mean* cost (not\n  the median), because means aggregate correctly to population totals. A gamma GLM with log link\n  produces a mean-ratio estimate with CI on the original dollar scale and is almost universally\n  preferred over a Mann-Whitney test for primary inference on cost outcomes.\n\n**Pros, cons, and trade-offs**\n\n*Parametric tests (t-test, ANOVA, chi-square)*:\n- Pros: directly interpretable effect estimates (mean differences, proportions); well-understood\n  power properties; straightforward extension to regression for covariate adjustment; exact CI\n  formulas; computationally trivial and widely implemented. Robust to non-normality at large n\n  (CLT). Welch's t-test is robust to variance heterogeneity.\n- Cons: at small n with severely non-normal data, the distributional assumption may not hold and\n  type-I error can be inflated; the chi-square approximation breaks down with sparse cells.\n- When to prefer: large n, approximately symmetric continuous outcomes, binary/count outcomes\n  with adequate cell sizes, when a mean difference is the target quantity.\n\n*Nonparametric tests (Mann-Whitney, Kruskal-Wallis, Wilcoxon signed-rank, Fisher exact)*:\n- Pros: valid under broad distributional assumptions; robust to outliers and heavy tails; exact\n  in small samples (Fisher, Wilcoxon); well-suited when the rank ordering is intrinsically\n  meaningful (ordinal scales, bounded scores).\n- Cons: lower power than parametric tests when the parametric assumption holds; the effect\n  estimate (Hodges-Lehmann shift, rank-biserial correlation) is less intuitive than a mean\n  difference; rank tests are not directly extensible to covariate adjustment (regression\n  analogues exist but are less familiar).\n- When to prefer: small n with clearly non-normal data; ordinal outcomes; sensitivity checks\n  alongside a parametric primary analysis; situations where outlier contamination is expected.\n\n*Transformations and GLMs as the modern alternative*: For outcomes that are inherently\nnon-normal — costs, utilization counts, event rates — transformations (log, square root) or\nGLMs (Poisson/negative-binomial for counts; gamma/Tweedie for costs; logistic for binary) are\noften superior to both naive parametric tests and rank tests because they (a) match the\ndistributional shape, (b) target inference on the original scale via back-transformation or\nmarginal effects, (c) accommodate covariate adjustment naturally, and (d) remain valid under\nconfounded observational designs when combined with weighting or matching.\n\n**When to use**\n\nUse parametric and nonparametric tests in their appropriate domain:\n\n- *Primary analyses of randomized contrasts* (RCTs, or observational studies where balance has\n  been achieved by design or weighting and the only remaining question is whether the observed\n  difference is compatible with chance): two-sample t-test or Welch t-test for continuous\n  outcomes; chi-square or Fisher for binary outcomes; ANOVA or Kruskal-Wallis for multi-group\n  comparisons.\n- *Baseline characteristic tables* (Table 1 in observational studies): chi-square for\n  categorical variables; t-test or Mann-Whitney for continuous variables. Note that formal\n  significance testing in Table 1 for an observational study is widely criticized (the question\n  is confounding balance, not statistical significance), so standardized mean differences are now\n  preferred; but the tests remain appropriate for RCTs and for pure description.\n- *Sensitivity analyses*: running a nonparametric test alongside a parametric primary analysis\n  provides a useful robustness check — if conclusions differ, the distributional assumption is\n  load-bearing and deserves scrutiny.\n- *Ordinal or bounded outcomes*: rank tests or ordinal regression rather than a t-test on\n  the scores.\n- *Small-n pilot studies* (n < 30 per group): nonparametric tests or exact tests are safer\n  defaults unless the outcome distribution is well-established as approximately normal from\n  prior work.\n\n**When NOT to use — and when these tests are actively misleading**\n\n- *As the primary analysis when confounding is uncontrolled*: an unadjusted t-test or Mann-Whitney\n  comparing treatment groups in an unbalanced observational dataset estimates a biased treatment\n  effect. These tests are for describing, sanity-checking, and randomized contrasts. For\n  confounded comparisons, route to propensity-score methods, g-methods, or regression adjustment.\n- *For cost data when the target estimand is the mean*: a Mann-Whitney test on costs tells you\n  about the rank ordering, not about the mean difference in dollars — which is what a budget-\n  impact model or payer needs. Use a gamma GLM or a bootstrap mean-difference estimator instead.\n- *Chi-square with expected counts below 5*: the chi-square approximation is unreliable in\n  sparse tables; use Fisher's exact test or a mid-p correction.\n- *Any unadjusted two-group test as the final word on an observational comparison*: even in a\n  matched or reweighted sample, reporting only a p-value without an effect estimate and CI leaves\n  the reader unable to assess clinical or economic relevance.\n- *Interpreting a Mann-Whitney p-value as a test of medians*: unless the two distributions\n  have identical shapes and differ only in location, the Mann-Whitney tests stochastic dominance,\n  not median equality. If a median difference is the target, report the Hodges-Lehmann estimate\n  explicitly and note its interpretation.\n- *Using Shapiro-Wilk as a gating decision for parametric vs nonparametric*: the test is\n  underpowered at small n and overpowered at large n; it is not a reliable criterion for\n  choosing between test families.\n\n**Implementation note for all three languages**\n\nThe canonical side-by-side demonstration below shows, for each data type, the parametric and\nnonparametric function calls together with the Welch correction and Fisher exact alternative.\nAll implementations use the same six-observation motivating dataset from the beginner layer\n(skewed, with one outlier) to show concretely that the t-test and Mann-Whitney can reach\ndifferent inferential conclusions at small n — and why neither answer is wrong, they are just\nanswering slightly different questions.\n\n**Interpreting the output**\n\nIn the worked example, six rehabilitation-protocol patients (Group A) and six standard-care\npatients (Group B) are compared on days of missed work after a procedure. Group A has mean\n3.0 days but includes one outlier (7 days); Group B has mean approximately 5.33 days with\nlittle internal spread. The Welch t statistic is approximately −2.535, with p ≈ 0.038\n(nominally significant at alpha = 0.05). The Mann-Whitney rank sums are R_A = 27 and\nR_B = 51, total 78 = 12 × 13 / 2, confirming the computation. The Mann-Whitney two-sided\np-value is approximately 0.065 — not significant at alpha = 0.05.\n\n*(1) Formal interpretation.* The Welch t-test rejects the null of equal means (p ≈ 0.038)\nbecause the observed mean difference (3.0 vs 5.33 days) is large relative to the combined\nstandard error, including the outlier's contribution to Group A's variance. The Mann-Whitney\ntest does not reject the null of stochastic equality (p ≈ 0.065), because the outlier in\nGroup A (7 days) contributes only the single highest rank (rank 12) rather than pulling the\ntest statistic proportionally as it does the mean. The two tests are answering different\nquestions: the t-test targets mean equality; the Mann-Whitney targets stochastic dominance\n(P(X_A > X_B) = 0.5). Both are technically valid at this n; the divergence arises because\nthe outlier exerts a large influence on the mean but a bounded influence on rank ordering.\n\n*(2) Practical interpretation.* When a mean difference is the target estimand — for example,\nmean missed-work days needed for a budget-impact model — the Welch t-test is the appropriate\nprimary analysis, and the outlier is informative signal, not noise to suppress. When the\nquestion is rank-based (\"does Group A tend toward fewer missed days?\"), the Mann-Whitney is\nappropriate and naturally down-weights the extreme observation. Because the two tests reach\ndifferent conclusions here, the pre-specified primary estimand must determine which drives\nthe inference — this is precisely why the analysis plan should declare the primary estimand\nbefore seeing the data.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "foundations",
      "hypothesis-testing",
      "t-test",
      "parametric",
      "nonparametric",
      "Mann-Whitney",
      "Wilcoxon",
      "chi-square",
      "ANOVA",
      "normality",
      "rank-test"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/1471-2288-12-78",
        "url": "https://doi.org/10.1186/1471-2288-12-78",
        "citation_text": "Fagerland MW. t-tests, non-parametric tests, and large studies — a paradox of statistical practice? BMC Medical Research Methodology. 2012;12:78.",
        "year": 2012,
        "authors_short": "Fagerland",
        "notes": "Demonstrates the Fagerland paradox: at large n the t-test's normality assumption is protected by the CLT exactly when analysts worry most about it, while nonparametric tests answer a subtly different question. Essential reading for RWE analysts who reflexively reach for rank tests when n is large."
      },
      {
        "role": "explain",
        "doi": "10.3238/arztebl.2010.0343",
        "url": "https://doi.org/10.3238/arztebl.2010.0343",
        "citation_text": "du Prel JB, Röhrig B, Hommel G, Blettner M. Choosing statistical tests. Deutsches Ärzteblatt International. 2010;107(19):343-348.",
        "year": 2010,
        "authors_short": "du Prel et al.",
        "notes": "Systematic decision framework for choosing the correct test by data type and design (two independent groups, paired, three or more groups, categorical). A widely cited practical guide that maps onto the decision tree presented in this entry."
      },
      {
        "role": "demonstrate",
        "doi": "10.5334/irsp.82",
        "url": "https://doi.org/10.5334/irsp.82",
        "citation_text": "Delacre M, Lakens D, Leys C. Why psychologists should by default use Welch's t-test instead of Student's t-test. International Review of Social Psychology. 2017;30(1):92-101.",
        "year": 2017,
        "authors_short": "Delacre et al.",
        "notes": "Simulation-based demonstration that Welch's t-test outperforms Student's t-test across a broad range of variance and sample-size scenarios and that the \"test for equal variances first, then choose\" strategy is statistically suboptimal. Supports Welch as the unconditional default."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.310.6975.298",
        "url": "https://doi.org/10.1136/bmj.310.6975.298",
        "citation_text": "Altman DG, Bland JM. Statistics notes: The normal distribution. BMJ. 1995;310(6975):298.",
        "year": 1995,
        "authors_short": "Altman & Bland",
        "notes": "One of the BMJ Statistics Notes series; covers the normal distribution assumption and its role in parametric tests, written for clinical researchers who need practical decision rules rather than theoretical derivations."
      }
    ],
    "plain_language_summary": "Parametric tests (like the t-test) and nonparametric tests (like the Mann-Whitney test) are both tools for deciding whether a difference between two groups is likely to be real or just chance — but they work differently. A parametric test assumes the numbers follow a bell-curve shape and uses that assumption to build the comparison; a nonparametric test replaces the actual numbers with their rank order (first, second, third...) and tests whether the ranks are randomly mixed between groups. The rank approach is more robust to outliers and skewed data, but the trade-off is that it answers a slightly different question and produces effect estimates (like a median shift) that are harder to plug into a cost model. For most large real-world datasets with skewed outcomes like healthcare costs, a specialized regression model is often a better choice than either test.",
    "key_terms": [
      {
        "term": "parametric test",
        "definition": "A statistical test that assumes the data come from a specific distributional family (usually the normal/bell-curve family) and uses that assumption to derive what \"random chance\" should look like."
      },
      {
        "term": "nonparametric test",
        "definition": "A statistical test that converts the raw data values into ranks (1st, 2nd, 3rd...) and bases the comparison on those ranks, making fewer assumptions about the shape of the data distribution."
      },
      {
        "term": "normality assumption",
        "definition": "The requirement, for many parametric tests, that the data (or the sample mean, for large samples) follow a bell-curve shape; violated by very skewed data or extreme outliers, especially in small samples."
      },
      {
        "term": "rank",
        "definition": "The position of a value when all values from both groups are sorted from smallest to largest together; the smallest value gets rank 1, the next gets rank 2, and so on."
      },
      {
        "term": "null distribution",
        "definition": "The theoretical distribution of a test statistic (like a t-statistic or a rank sum) that you would expect to see if there were truly no difference between the groups being compared."
      },
      {
        "term": "exact test",
        "definition": "A version of a statistical test (like Fisher's exact test) that computes the p-value by counting all possible arrangements of the data rather than using a mathematical approximation, making it reliable even in very small samples."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is comparing days of missed work in the week after a procedure for six patients receiving a new rehabilitation protocol (Group A) versus six standard-care patients (Group B). The data are small, skewed, and one patient in Group A had an unusually high value (7 days). The analyst runs both a Welch t-test and a Mann-Whitney U test to see whether the two approaches agree, and manually computes the rank sums to understand why the nonparametric test reaches a different conclusion.",
      "dataset": {
        "caption": "Days of missed work post-procedure (n=6 per group). Group A has one outlier (7 days). All 12 values are pooled and ranked together for the Mann-Whitney test.",
        "columns": [
          "patient_id",
          "group",
          "days_missed"
        ],
        "rows": [
          [
            "A1",
            "A",
            1
          ],
          [
            "A2",
            "A",
            2
          ],
          [
            "A3",
            "A",
            2
          ],
          [
            "A4",
            "A",
            3
          ],
          [
            "A5",
            "A",
            3
          ],
          [
            "A6",
            "A",
            7
          ],
          [
            "B1",
            "B",
            4
          ],
          [
            "B2",
            "B",
            5
          ],
          [
            "B3",
            "B",
            5
          ],
          [
            "B4",
            "B",
            6
          ],
          [
            "B5",
            "B",
            6
          ],
          [
            "B6",
            "B",
            6
          ]
        ]
      },
      "steps": [
        "Compute group means and standard deviations. Group A: mean = (1+2+2+3+3+7)/6 = 18/6 = 3.0, with the outlier pulling the mean up. Group B: mean = (4+5+5+6+6+6)/6 = 32/6 = 5.33.",
        "The Welch t-test on the means gives t = (3.0 - 5.33) / sqrt(variance_A/6 + variance_B/6). Variance of A = [(1-3)^2+(2-3)^2+(2-3)^2+(3-3)^2+(3-3)^2+(7-3)^2] / 5 = [4+1+1+0+0+16]/5 = 22/5 = 4.4. Variance of B = [(4-5.33)^2+(5-5.33)^2+(5-5.33)^2+(6-5.33)^2+(6-5.33)^2+ (6-5.33)^2] / 5 = [1.7689+0.1089+0.1089+0.4489+0.4489+0.4489]/5 = 3.333/5 = 0.667. SE = sqrt(4.4/6 + 0.667/6) = sqrt(0.7333 + 0.1111) = sqrt(0.8444) = 0.919. t = (3.0 - 5.33) / 0.919 = -2.33 / 0.919 = -2.535. With Welch df this gives p approximately 0.038 (two-sided), a nominally significant difference in means.",
        "Now do the Mann-Whitney rank computation. Pool all 12 values, sort ascending, and assign ranks. Values in order: 1(A1), 2(A2), 2(A3), 3(A4), 3(A5), 4(B1), 5(B2), 5(B3), 6(B4), 6(B5), 6(B6), 7(A6). Ties share the average of the ranks they would occupy.",
        "Rank assignments: value 1 -> rank 1. Values 2, 2 (tied) -> average of ranks 2 and 3 = 2.5 each. Values 3, 3 (tied) -> average of ranks 4 and 5 = 4.5 each. Value 4 -> rank 6. Values 5, 5 (tied) -> average of ranks 7 and 8 = 7.5 each. Values 6, 6, 6 (tied) -> average of ranks 9, 10, 11 = 10.0 each. Value 7 -> rank 12.",
        "Sum ranks for Group A: patient A1 gets rank 1, A2 gets 2.5, A3 gets 2.5, A4 gets 4.5, A5 gets 4.5, A6 gets 12. Rank sum for A = 1 + 2.5 + 2.5 + 4.5 + 4.5 + 12 = 27.",
        "Sum ranks for Group B: B1 gets 6, B2 gets 7.5, B3 gets 7.5, B4 gets 10, B5 gets 10, B6 gets 10. Rank sum for B = 6 + 7.5 + 7.5 + 10 + 10 + 10 = 51.",
        "Verify: total rank sum must equal n*(n+1)/2 = 12*13/2 = 78. Check: 27 + 51 = 78. Correct.",
        "The Mann-Whitney U statistic for Group A is U_A = R_A - n_A*(n_A+1)/2 = 27 - 6*7/2 = 27 - 21 = 6. For Group B: U_B = R_B - n_B*(n_B+1)/2 = 51 - 6*7/2 = 51 - 21 = 30. Check: U_A + U_B = 6 + 30 = 36 = n_A * n_B = 6*6 = 36. Correct.",
        "The Mann-Whitney p-value (two-sided, exact) for U = min(6, 30) = 6 with n=6 per group is approximately 0.065, which is not significant at the conventional alpha = 0.05 threshold. The t-test gave p approximately 0.038 (significant); the Mann-Whitney gives p approximately 0.065 (not significant). The single outlier in Group A (7 days) inflates the mean but only contributes one high rank (rank 12) out of 12, so its influence is dampened in the rank test."
      ],
      "result": "Group A mean = 18/6 = 3.0 days, Group B mean = 32/6 = 5.33 days. Rank sum A = 27, rank sum B = 51, total = 78 = 12*13/2. U_A = 27 - 21 = 6, U_B = 51 - 21 = 30, U_A + U_B = 36 = 6*6. The t-test is sensitive to the outlier (7 days in A) and yields a significant mean difference; the Mann-Whitney test downweights the outlier to a single high rank and yields a borderline result. Neither test is wrong; they are answering slightly different questions. For a mean- difference estimate (what a budget-impact model needs), the t-test or a GLM is appropriate; for a rank-based comparison robust to the outlier, the Mann-Whitney is appropriate with the Hodges-Lehmann shift estimate."
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "baseline-characteristics-and-covariate-balance-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Two independent groups, continuous outcome (Welch t-test vs Mann-Whitney)",
        "description": "The most common HEOR comparison — a treated vs control or exposed vs unexposed group on a continuous endpoint (cost, quality-of-life score, length of stay). Welch t-test is the parametric default; Mann-Whitney U is the nonparametric alternative. For cost outcomes, prefer a gamma GLM or bootstrap mean difference for the primary analysis and use Mann-Whitney only as a sensitivity check.",
        "edge_cases": [
          "At very large n (> 10,000 per group), both tests will reject the null for trivially small differences; report the effect size (Cohen's d or Hodges-Lehmann shift) alongside p.",
          "For cost data, the Mann-Whitney tests rank ordering, not mean costs; gamma GLM with log link is the preferred primary method when mean costs are the policy-relevant quantity."
        ],
        "data_source_notes": "Claims and EHR: compute Welch t-test and Mann-Whitney on per-patient totals after winsorizing or capping extreme outliers if exploratory; use GLM for the confirmatory estimate."
      },
      {
        "name": "Paired / pre-post measurements (paired t-test vs Wilcoxon signed-rank)",
        "description": "When the same patient is measured before and after an intervention, or when patients are matched one-to-one, the two measurements are not independent and the paired design must be respected. Paired t-test on the within-patient differences is parametric; Wilcoxon signed-rank test is the nonparametric alternative.",
        "edge_cases": [
          "Dropping patients with incomplete pre-post pairs introduces selection bias; impute or model missingness before choosing the test.",
          "The Wilcoxon signed-rank test ignores differences of zero by design; a large fraction of tied pairs (no change) can distort the result."
        ],
        "data_source_notes": "Claims: construct pre-period and post-period windows around the index date; compute the difference per patient; test on differences. EHR: labs and vitals naturally pair by encounter date."
      },
      {
        "name": "Multiple independent groups (ANOVA vs Kruskal-Wallis)",
        "description": "Comparing three or more groups simultaneously — e.g., drug classes, lines of therapy, or geographic regions — while controlling the family-wise type-I error rate. One-way ANOVA is parametric; Kruskal-Wallis H is nonparametric. Both are omnibus tests; post-hoc pairwise comparisons (Tukey HSD, Dunn's test) are needed if the overall test rejects.",
        "edge_cases": [
          "ANOVA assumes equal variances across groups (Welch ANOVA relaxes this). Check Levene's test as a diagnostic, not as a decision gate.",
          "Kruskal-Wallis post-hoc pairwise tests (Dunn with Bonferroni or Holm correction) have lower power than ANOVA post-hoc at equal n; prefer ANOVA when sample sizes are adequate."
        ],
        "data_source_notes": "Registry and claims: useful for comparing across therapeutic areas or health systems without a controlled comparison; document that these are unadjusted comparisons."
      },
      {
        "name": "Categorical outcomes (chi-square vs Fisher exact)",
        "description": "Binary or multi-category outcomes in a contingency table. Pearson chi-square uses a large-sample approximation; Fisher's exact test uses the hypergeometric distribution and is exact in small samples. For paired binary outcomes (e.g., readmission before vs after a policy), McNemar's test is appropriate.",
        "edge_cases": [
          "Rule of thumb for chi-square: all expected cell counts ≥ 5. If any expected count < 5, use Fisher's exact test or a mid-p correction.",
          "For 2x2 tables, Yates' continuity correction makes chi-square more conservative; its use is debated and Fisher's exact is generally preferred when n is small."
        ],
        "data_source_notes": "Claims: compute observed and expected event counts per arm; apply Fisher when strata are small. EHR: readmission, re-hospitalization, and medication adherence classification are natural candidates."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "descriptive-epidemiology-rwe",
        "pros_of_this": "Hypothesis tests add a formal decision rule about whether an observed difference is compatible with chance, whereas descriptive statistics alone cannot make this inference.",
        "cons_of_this": "A p-value without an effect estimate and CI is less informative than a well-presented descriptive summary; in large observational datasets all tests are likely to reject.",
        "when_to_prefer": "Use both: descriptive statistics for characterization and communication; hypothesis tests for a formal inference about whether a difference is signal or noise, with clear acknowledgment of whether the comparison is causal or merely associative."
      },
      {
        "compared_to": "baseline-characteristics-and-covariate-balance-rwe",
        "pros_of_this": "In an RCT, t-tests and chi-square tests on baseline covariates are legitimate (randomization guarantees equal expected distributions); in observational studies, standardized mean differences are preferred over p-values for assessing balance because they are not sensitive to sample size.",
        "cons_of_this": "Significance testing in a Table 1 for an observational cohort conflates statistical significance with confounding imbalance; a large imbalanced cohort will \"significantly\" differ on many covariates even after matching.",
        "when_to_prefer": "For RCTs or for sensitivity checks in observational work; use SMDs as the primary balance metric in observational Table 1."
      },
      {
        "compared_to": "win-ratio-generalized-pairwise-comparisons-rwe",
        "pros_of_this": "Standard parametric and nonparametric tests are simpler, more familiar, and produce straightforward mean differences or odds ratios; they are appropriate for a single endpoint.",
        "cons_of_this": "The win ratio generalizes the rank-based logic of Mann-Whitney to a *prioritized hierarchy* of multiple outcomes (death ranked above hospitalization), which standard rank tests cannot do; the win ratio is more appropriate when components differ sharply in clinical severity.",
        "when_to_prefer": "Use standard tests for a single endpoint; prefer the win ratio / generalized pairwise comparison when multiple prioritized outcomes must be combined in cardiology or oncology."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "t-test and Mann-Whitney are straightforward to apply to per-patient cost totals and are familiar to all audiences.",
        "cons_of_this": "Cost distributions are right-skewed with a spike at zero; the mean — not the median — is the policy-relevant quantity for budget impact; a gamma GLM with log link is the modern standard for inferring mean cost differences because it respects the distributional shape and produces a mean-ratio estimate directly interpretable for payer decision-making.",
        "when_to_prefer": "Use t-test and Mann-Whitney for exploratory comparison and sensitivity checks on cost outcomes; use gamma GLM or two-part models for the primary confirmatory analysis when mean costs are the estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Per-patient cost and utilization totals are the natural unit of analysis. Both Welch t-test and Mann-Whitney operate on these totals directly. For costs, note that the mean is the policy-relevant quantity; report a gamma GLM mean-ratio alongside any rank-based p-value. For binary outcomes (readmission, treatment switch), chi-square or Fisher exact applies to the 2x2 table of events per arm. At large n (> 10,000), report effect sizes and CIs prominently; p-values will be significant for trivially small differences.",
      "ehr": "Lab values and vitals are approximately normal and well-suited to Welch t-test or paired t-test (pre/post). Patient-reported outcome scores and ordinal severity scales call for Mann-Whitney or Wilcoxon signed-rank. Visit-driven capture can create informative missingness; restrict to patients with complete data or model the missing observations before testing.",
      "registry": "Disease severity scores and adjudicated outcomes are typically the cleanest inputs. For continuous registry variables with known non-normality (e.g., time from symptom to diagnosis), Mann-Whitney is a natural choice. For binary registry endpoints (response, remission), Fisher exact or chi-square with adequate cell counts.",
      "primary": "Survey and PRO instruments often produce bounded, ordinal, or non-normal scores; Mann-Whitney and Wilcoxon signed-rank are the standard choices for pilot and small-n studies. Confirm whether the instrument developers intended the scale to be treated as interval or ordinal.",
      "linked": "Linked claims-EHR-registry cohorts typically have large n, so CLT protects parametric tests. Report Welch t-test mean differences and gamma GLM mean ratios for costs alongside Mann-Whitney for robustness; use chi-square for binary outcomes with effect estimates (RD, RR, OR)."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom scipy import stats\n\n# ── Motivating dataset: days of missed work (n=6 per group, Group A has outlier at 7) ──\ngroup_a = [1, 2, 2, 3, 3, 7]   # mean = 3.0, pulled up by outlier\ngroup_b = [4, 5, 5, 6, 6, 6]   # mean = 5.33, tightly clustered high\n\n# ── 1. Welch t-test (parametric; does not assume equal variances) ──\nt_stat, t_pval = stats.ttest_ind(group_a, group_b, equal_var=False)\nmean_diff = sum(group_a)/len(group_a) - sum(group_b)/len(group_b)\nprint(f\"Welch t-test:  t={t_stat:.3f}, p={t_pval:.4f}\")\nprint(f\"Mean A={sum(group_a)/len(group_a):.3f}, Mean B={sum(group_b)/len(group_b):.3f}\")\nprint(f\"Mean difference (A - B) = {mean_diff:.3f}\")\n\n# ── 2. Mann-Whitney U test (nonparametric; tests stochastic dominance, NOT medians) ──\nu_stat, mw_pval = stats.mannwhitneyu(group_a, group_b, alternative=\"two-sided\")\nprint(f\"\\nMann-Whitney U: U={u_stat:.1f}, p={mw_pval:.4f}\")\nprint(\"Note: MWU tests P(A > B) = 0.5, NOT equality of medians (unless shapes are identical).\")\n\n# ── 3. Hodges-Lehmann estimator: median of all pairwise differences (A_i - B_j) ──\npairwise_diffs = sorted(a - b for a in group_a for b in group_b)\nn = len(pairwise_diffs)\nif n % 2 == 1:\n    hl_estimate = pairwise_diffs[n // 2]\nelse:\n    hl_estimate = (pairwise_diffs[n // 2 - 1] + pairwise_diffs[n // 2]) / 2\nprint(f\"Hodges-Lehmann location estimate (A - B): {hl_estimate:.1f}\")\nprint(\"(This is the rank-based 'shift' estimate that pairs with the Mann-Whitney test.)\")\n\n# ── 4. Chi-square vs Fisher exact for a binary 2x2 table ──\n# Example: events (e.g., readmission) in two groups of 50\n# Observed: [[events_A, no_event_A], [events_B, no_event_B]]\ntable = [[8, 42], [18, 32]]   # Group A: 8/50 events; Group B: 18/50 events\nchi2, chi2_p, dof, expected = stats.chi2_contingency(table, correction=False)\nprint(f\"\\nChi-square: chi2={chi2:.3f}, p={chi2_p:.4f} (expected min cell: {expected.min():.1f})\")\n# Fisher exact: always valid; prefer when any expected count < 5\n_, fisher_p = stats.fisher_exact(table)\nprint(f\"Fisher exact:  p={fisher_p:.4f}\")\nprint(\"Rule: use Fisher exact when any expected cell count < 5.\")\n\n# ── 5. Paired Wilcoxon signed-rank test (pre-post on same patients) ──\npre  = [10, 12, 14, 11, 13, 15]\npost = [ 8, 10, 12,  9, 11, 12]\nw_stat, wsr_p = stats.wilcoxon(pre, post)\nt_paired, tp_p = stats.ttest_rel(pre, post)\nprint(f\"\\nPaired t-test:  t={t_paired:.3f}, p={tp_p:.4f}\")\nprint(f\"Wilcoxon SR:    W={w_stat:.1f}, p={wsr_p:.4f}\")",
        "description": "Side-by-side parametric and nonparametric two-sample tests using scipy.stats. Demonstrates\nWelch t-test vs Mann-Whitney U for continuous outcomes, chi-square vs Fisher exact for binary\n2x2 tables, and the Hodges-Lehmann location estimate. Uses the six-observation motivating\ndataset (Group A with outlier, Group B without). No external dependencies beyond scipy.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset ──\ngroup_a <- c(1, 2, 2, 3, 3, 7)   # mean = 3.0; outlier at 7\ngroup_b <- c(4, 5, 5, 6, 6, 6)   # mean = 5.33\n\n# ── 1. Welch t-test (var.equal = FALSE is the R default for two-sample) ──\nt_res <- t.test(group_a, group_b, var.equal = FALSE)\ncat(\"Welch t-test:\\n\")\nprint(t_res)\n# Mean difference and CI\ncat(sprintf(\"Mean A = %.3f, Mean B = %.3f, diff = %.3f\\n\",\n            mean(group_a), mean(group_b), mean(group_a) - mean(group_b)))\n\n# ── 2. Mann-Whitney / Wilcoxon rank-sum test ──\nmw_res <- wilcox.test(group_a, group_b, exact = FALSE)\ncat(\"\\nMann-Whitney U test:\\n\")\nprint(mw_res)\ncat(\"Note: wilcox.test tests stochastic dominance (P(A>B)=0.5), not median equality.\\n\")\n\n# ── 3. Hodges-Lehmann location estimate with CI ──\nhl_res <- wilcox.test(group_a, group_b, conf.int = TRUE, exact = FALSE)\ncat(sprintf(\"\\nHodges-Lehmann estimate: %.1f  95%% CI: [%.1f, %.1f]\\n\",\n            hl_res$estimate, hl_res$conf.int[1], hl_res$conf.int[2]))\n\n# ── 4. Chi-square vs Fisher exact (binary 2x2 table) ──\ntab <- matrix(c(8, 42, 18, 32), nrow = 2, byrow = TRUE,\n              dimnames = list(c(\"GroupA\",\"GroupB\"), c(\"Event\",\"NoEvent\")))\ncat(\"\\nChi-square test:\\n\")\nprint(chisq.test(tab, correct = FALSE))\ncat(\"\\nFisher exact test:\\n\")\nprint(fisher.test(tab))\n\n# ── 5. Paired t-test vs Wilcoxon signed-rank ──\npre  <- c(10, 12, 14, 11, 13, 15)\npost <- c( 8, 10, 12,  9, 11, 12)\ncat(\"\\nPaired t-test:\\n\")\nprint(t.test(pre, post, paired = TRUE))\ncat(\"\\nWilcoxon signed-rank test:\\n\")\nprint(wilcox.test(pre, post, paired = TRUE, exact = FALSE))",
        "description": "Side-by-side parametric and nonparametric tests in base R. Demonstrates t.test (Welch) vs\nwilcox.test for continuous outcomes, chisq.test vs fisher.test for binary tables, and the\nHodges-Lehmann estimate via wilcox.test(conf.int=TRUE). Paired t.test vs wilcox.test for\npre-post data. Uses the same motivating dataset as the Python implementation.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset ── */\ndata work.groups;\n  input patient_id $ group $ days_missed;\n  datalines;\nA1 A 1\nA2 A 2\nA3 A 2\nA4 A 3\nA5 A 3\nA6 A 7\nB1 B 4\nB2 B 5\nB3 B 5\nB4 B 6\nB5 B 6\nB6 B 6\n;\nrun;\n\n/* ── 1. Welch t-test (COCHRAN option) and Student t-test (both in one PROC TTEST) ── */\nproc ttest data=work.groups;\n  class group;            /* compare A vs B */\n  var days_missed;\n  /* Welch: use the \"Satterthwaite\" row in the output (unequal variances)       */\n  /* Student: use the \"Pooled\" row (assumes equal variances — use Welch instead) */\nrun;\n\n/* ── 2. Mann-Whitney / Wilcoxon rank-sum test with Hodges-Lehmann ── */\nproc npar1way data=work.groups wilcoxon hl;\n  class group;\n  var days_missed;\n  /* WILCOXON: Wilcoxon rank-sum = Mann-Whitney U                              */\n  /* HL: Hodges-Lehmann location estimate and 95% CI                           */\nrun;\n\n/* ── 3. Chi-square and Fisher exact for a 2x2 binary table ── */\ndata work.events;\n  input group $ event $ count;\n  datalines;\nGroupA Yes  8\nGroupA No  42\nGroupB Yes 18\nGroupB No  32\n;\nrun;\nproc freq data=work.events;\n  weight count;\n  tables group * event / chisq fisher expected;\n  /* CHISQ: Pearson chi-square and Yates-corrected chi-square                  */\n  /* FISHER: Fisher exact test (valid for small expected counts)               */\n  /* EXPECTED: prints expected cell frequencies to check the >=5 rule          */\nrun;\n\n/* ── 4. Paired t-test and Wilcoxon signed-rank for pre-post ── */\ndata work.paired;\n  input patient_id $ pre post;\n  diff = pre - post;       /* compute within-patient difference */\n  datalines;\nP1 10  8\nP2 12 10\nP3 14 12\nP4 11  9\nP5 13 11\nP6 15 12\n;\nrun;\n/* Paired t-test: test whether mean of diff = 0 */\nproc ttest data=work.paired;\n  var diff;     /* one-sample t-test on the differences */\nrun;\n/* Wilcoxon signed-rank: nonparametric paired test */\nproc univariate data=work.paired;\n  var diff;\n  /* UNIVARIATE prints the Wilcoxon signed-rank statistic and p-value          */\nrun;",
        "description": "Parametric and nonparametric tests in SAS using PROC TTEST (Welch and Student) and PROC\nNPAR1WAY (Mann-Whitney / Wilcoxon rank-sum, Hodges-Lehmann). PROC FREQ with CHISQ and FISHER\noptions handles binary 2x2 tables. PROC TTEST with PAIRED statement covers pre-post data;\nPROC NPAR1WAY with WILCOXON covers the nonparametric equivalent. Uses the same motivating\ndataset as the Python and R implementations.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[What kind of outcome<br/>and how many groups?] --> Cont[Continuous outcome]\n  Q --> Cat[Categorical / binary<br/>outcome]\n  Cont --> TwoInd[Two independent groups]\n  Cont --> Paired[Paired / pre-post]\n  Cont --> ThreePlus[Three or more groups]\n  TwoInd --> Welch[\"Parametric: Welch t-test<br/>(default; no equal-variance<br/>assumption)\"]\n  TwoInd --> MW[\"Nonparametric: Mann-Whitney U<br/>(tests stochastic dominance;<br/>add Hodges-Lehmann CI)\"]\n  TwoInd --> GLM[\"Skewed / costs: Gamma GLM<br/>with log link<br/>(preferred for mean inference)\"]\n  Paired --> PairedT[\"Parametric: Paired t-test<br/>on within-patient differences\"]\n  Paired --> WSR[\"Nonparametric: Wilcoxon<br/>signed-rank test\"]\n  ThreePlus --> ANOVA[\"Parametric: one-way ANOVA<br/>(Welch ANOVA if unequal vars)\"]\n  ThreePlus --> KW[\"Nonparametric: Kruskal-Wallis<br/>+ Dunn post-hoc\"]\n  Cat --> ChiSq[\"Expected counts ≥5 per cell:<br/>Pearson chi-square\"]\n  Cat --> Fisher[\"Any expected count < 5:<br/>Fisher exact test\"]\n  Cat --> McNemar[\"Paired binary (pre-post):<br/>McNemar's test\"]",
        "caption": "Test-selection decision tree by outcome type and design. For skewed continuous outcomes (costs, utilization), a GLM is the modern primary method; rank tests serve as sensitivity checks.",
        "alt_text": "Flowchart branching on outcome type (continuous vs categorical) and design (two groups, paired, three-plus groups) into the appropriate parametric and nonparametric tests, plus the GLM path for skewed outcomes.",
        "source_type": "illustrative",
        "source_citations": [
          "du-prel-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Descriptive statistics (means, medians, distributions, histograms) should always precede hypothesis testing; the distributional choice (parametric vs nonparametric) depends on what the distribution looks like."
      },
      {
        "relation_type": "requires",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Hypothesis tests on continuous and binary covariates appear in every Table 1; the choice of t-test vs Mann-Whitney and chi-square vs Fisher exact applies directly to covariate comparison, though SMDs are preferred for balance assessment in observational studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "win-ratio-generalized-pairwise-comparisons-rwe",
        "notes": "The Mann-Whitney U test is the two-outcome special case of generalized pairwise comparisons; the win ratio extends the rank-based logic to a prioritized hierarchy of multiple outcomes, making it the natural generalization of Mann-Whitney for composite HEOR endpoints."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Cost outcomes are right-skewed and require special handling; standard t-test and Mann-Whitney are inadequate as primary analyses for mean cost differences — route to gamma GLM, two-part models, or bootstrap mean estimation for confirmatory cost inference."
      },
      {
        "relation_type": "see_also",
        "target_slug": "marginal-effects-and-interpretation-of-inferential-statistics-rwe",
        "notes": "Marginal effects and interpretation guidance for regression models complements the test-based approach here; when covariate adjustment is needed, regression replaces the two-sample test."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Outlier handling directly affects the choice between parametric and nonparametric tests; extreme cost outliers in HEOR inflate the t-test's sensitivity and motivate both winsorization and the use of rank-based or GLM alternatives."
      }
    ],
    "aliases": [
      "t-test",
      "Welch t-test",
      "Student t-test",
      "Mann-Whitney",
      "Mann-Whitney U test",
      "Wilcoxon rank-sum",
      "Wilcoxon signed-rank",
      "Kruskal-Wallis",
      "chi-square test",
      "Fisher exact test",
      "McNemar test",
      "ANOVA",
      "nonparametric tests",
      "Hodges-Lehmann estimator",
      "Shapiro-Wilk"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "partitioned-survival-models-rwe",
    "name": "Partitioned Survival Model",
    "short_definition": "A cohort cost-effectiveness model for (typically oncology) decision analysis that derives state membership at each cycle directly from independently extrapolated overall-survival and progression-free-survival curves, rather than from explicit transition probabilities between health states.",
    "long_description": "A **partitioned survival model (PSM)**, also called an *area-under-the-curve* model, is a decision-analytic\ncost-effectiveness structure most commonly used for oncology technology appraisals. It represents three mutually\nexclusive health states — *progression-free*, *progressed disease*, and *dead* — but, unlike a Markov / state-transition\nmodel (STM), it never specifies transition probabilities between those states. Instead it takes two independently\nestimated survival curves — **overall survival (OS)** and **progression-free survival (PFS)** — and *partitions* the\npopulation at each model cycle: the proportion alive and progression-free is read directly as S_PFS(t); the proportion\ndead is 1 − S_OS(t); and the proportion alive with progressed disease is the residual, S_OS(t) − S_PFS(t). State\noccupancy is the area under each curve, costs and QALYs are accrued by multiplying time spent in each state by state\ncosts and utilities, and the incremental cost-effectiveness ratio (ICER) or net monetary benefit (NMB) follows. Because\ntrial follow-up is short relative to the lifetime horizon decision-makers require, both curves are typically\nparametrically **extrapolated** beyond the observed data, which makes PSMs only as credible as their extrapolation\nassumptions.\n\n**Core conceptual distinction.** The defining feature is that *transitions are not modelled — survival functions are*.\nIn a Markov/STM you estimate the hazard of moving progression-free → progressed, progression-free → dead, and\nprogressed → dead, then propagate a cohort through a transition matrix; OS and PFS emerge as *outputs*. In a PSM you\nreverse this: OS and PFS are *inputs* fitted independently, and progressed-disease occupancy is a subtraction. The\nconsequence is subtle but decision-relevant: because OS and PFS are fitted separately, nothing guarantees the implied\npost-progression survival (PPS) is clinically plausible — it can even imply a *negative* number at risk if the\nextrapolated PFS curve crosses above the OS curve, a structural impossibility the PSM does not enforce. The estimand a\nPSM targets is mean (discounted, quality-adjusted) survival in each state over a lifetime horizon — a *modelled\npopulation-mean*, not a within-patient transition rate. If the decision question is fundamentally about the timing and\ndependence of disease progression on subsequent death (e.g., a treatment that delays progression but whose PPS depends\non subsequent therapy), the PSM's independence assumption is the wrong representation.\n\n**Pros, cons, and trade-offs.**\n- **vs Markov / state-transition models (STM):** PSMs are simpler to build, require only OS and PFS (no PPS or\n  transition-specific data, which RWE often cannot identify cleanly), and are transparent to reviewers because the\n  inputs are the familiar trial Kaplan–Meier curves. Cost: they impose **structural independence** between OS and PFS,\n  cannot use external/registry evidence on post-progression survival, and can produce internally inconsistent state\n  occupancy (negative PPS, implausible long-term mortality). **Prefer a PSM** for short-horizon, single-line decisions\n  where OS and PFS are mature and a PPS model is not identifiable; **prefer an STM** when post-progression survival is\n  long, depends on subsequent treatment, or when you have external data to inform the progressed→dead hazard.\n- **vs a simple three-state Markov with constant transition probabilities:** the PSM uses the full shape of the\n  survival curves rather than collapsing to exponential/constant hazards, so it captures non-proportional and\n  time-varying hazards naturally. Cost: that flexibility lives entirely in the extrapolation choice, which is\n  judgment-heavy and often the single largest driver of the ICER.\n- **vs reporting trial-period restricted mean survival time (RMST) only:** RMST is assumption-light but answers a\n  different question (in-trial mean survival to a fixed horizon); a PSM is required when the HTA decision needs a\n  *lifetime* cost-effectiveness estimate, accepting the extrapolation burden that RMST avoids.\n\n**When to use.** Lifetime cost-utility analysis of oncology (or analogous progressive-disease) therapies for HTA\nsubmission (NICE, CDA-AMC (formerly CADTH), PBAC, ICER-US), where the pivotal evidence reports OS and PFS, follow-up is immature, a\nthree-state structure is clinically reasonable, and post-progression survival cannot be modelled directly from\nidentifiable transition data. PSMs are the *de facto* base-case structure in NICE oncology appraisals.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When OS and PFS extrapolations cross or imply implausible PPS.** If the fitted PFS curve crosses above OS at any\n  point in the horizon, progressed-disease occupancy goes negative — a structural impossibility. Decisions built on\n  such a model are not just imprecise, they are invalid; switch to an STM that constrains PPS ≥ 0.\n- **When post-progression survival drives the decision and depends on later therapy.** A PSM cannot represent the\n  progressed→dead hazard as a function of subsequent treatment, crossover, or external evidence; using one here can\n  materially over- or under-state lifetime benefit.\n- **When the curves are extrapolated far beyond mature data with no external anchor.** With <50% of OS events observed,\n  the lifetime mean is dominated by the unobserved tail, and the choice among Weibull/log-logistic/log-normal/generalized\n  gamma/spline fits can swing the ICER across the willingness-to-pay threshold. Presenting one curve as the answer is\n  misleading; full structural and extrapolation sensitivity analysis is mandatory.\n- **When the disease is not adequately captured by three states** (e.g., multiple progression lines, treatment-free\n  intervals, or a meaningful cured fraction better handled by a mixture-cure model).\n\n**Data-source operational depth.** PSMs are usually built from *trial* OS/PFS, but RWE increasingly supplies the\ncurves, the extrapolation anchor (long-term registry/claims survival), or an external comparator arm — each with\ndistinct failure modes.\n- **Claims:** Death is the cleanest endpoint only when mortality is linked (Medicare linked to NDI/vital records, or a\n  closed system); raw claims *under-capture* death because disenrollment and end-of-data look identical to survival —\n  naive Kaplan–Meier on claims-only mortality is biased upward and inflates the OS tail used to extrapolate. Worse,\n  **Medicare Advantage (MA) person-time lacks fee-for-service (FFS) claims**, so encounter-based progression proxies\n  (new metastatic codes, second-line regimen initiation, imaging escalation) are differentially missing for MA\n  enrollees; if MA penetration differs by the population feeding OS vs PFS, the two curves are estimated on\n  non-comparable person-time and the partition is corrupted. \"Progression\" in claims is itself a *proxy* (second-line\n  therapy start, new radiotherapy, hospice election), not RECIST progression, so claims-derived PFS systematically\n  differs from trial PFS and the two cannot be naively combined in one model.\n- **EHR:** Progression can be richer (imaging reports, oncologist notes, ECOG), but is **encounter-driven and\n  left/right-censored by network leakage** — a patient who progresses and dies outside the system looks censored,\n  biasing both PFS and OS. Death must be reconciled against an external index; structured \"last contact\" is not death.\n- **Registry:** Often the *best* anchor for the long-term OS tail (population cancer registries with active\n  follow-up), but progression is frequently *not* collected, so a registry can inform OS extrapolation while PFS still\n  comes from the trial — a hybrid that must be checked for population transportability (stage mix, calendar period,\n  treatment era).\n- **Linked claims–EHR–registry:** The ideal substrate (EHR progression + claims utilization/cost + registry/vital-records\n  mortality), but linkage selection and date discrepancies between imaging date, regimen-change date, and service date\n  must be reconciled before either curve is estimated, or the partition inherits the misalignment.\n\n**Worked example (RWE-anchored oncology PSM).** Question: lifetime cost-utility of a new first-line therapy vs standard\nof care in metastatic NSCLC, pivotal trial with 24-month median follow-up, lifetime horizon 20 years, 3-month cycles,\n3.5% annual discounting. (1) From the trial, fit OS and PFS independently; with only ~40% of OS events observed, fit a\npanel of parametric models (exponential, Weibull, Gompertz, log-logistic, log-normal, generalized gamma) and compare by\nAIC/BIC *and* visual/clinical plausibility of the extrapolated tail. (2) Anchor the OS tail to a real-world source: a\nSEER-Medicare cohort of stage IV NSCLC initiators, where mortality is reliable because Medicare claims are linked to\nvital records — but **restrict to continuously enrolled FFS Parts A/B beneficiaries and exclude MA-only person-time**,\nbecause MA enrollees' death and progression are under-captured. Use the registry tail to reject implausibly optimistic\ntrial extrapolations (e.g., a log-normal fit implying 18% 10-year survival when SEER shows <5%). (3) At each cycle t,\nset progression-free occupancy = S_PFS(t), dead = 1 − S_OS(t), progressed = S_OS(t) − S_PFS(t); **add an explicit check\nthat progressed ≥ 0 at every cycle** and, if violated, constrain or switch to an STM. (4) Attach state costs (drug,\nadministration, monitoring, progressed-disease management, terminal care from the last 90 days of claims) and utilities\n(progression-free vs progressed) to the area under each curve. (5) Compute discounted total costs and QALYs per arm,\nthe incremental cost, incremental QALYs, ICER and NMB at the threshold. (6) Run the mandatory sensitivity suite:\nalternative parametric extrapolations (structural uncertainty), the registry-anchored vs trial-only tail, alternative\nutility sources, and a probabilistic sensitivity analysis; report the ICER's sensitivity to the OS extrapolation as the\nheadline driver, because in immature oncology data it almost always is.\n\n**Interpreting the output**\n\nA 10-year partitioned survival model returns: PF life-years = 2.34 yr, Progressed life-years = 1.74 yr, Dead fraction = 5.92 yr (sum = 10.00 yr), with S_PFS < S_OS confirmed at all time points.\n\n*Formal interpretation.* At each time point t, state membership is derived from two independently fitted survival curves: progression-free survival S_PFS(t) and overall survival S_OS(t). The partition is: P(PF, t) = S_PFS(t), P(Dead, t) = 1 − S_OS(t), and P(Progressed, t) = S_OS(t) − S_PFS(t). Because S_PFS and S_OS are modeled separately — not from a joint distribution — the model implicitly assumes they are statistically independent after conditioning on treatment arm, which is rarely satisfied in practice. The structural validity constraint S_PFS(t) ≤ S_OS(t) for all t must be verified computationally; crossing curves invalidate the model and must be corrected by constraining or re-parameterizing the extrapolations before the model is used in a submission.\n\n*Practical interpretation.* The 1.74 progressed-disease life-years per patient carry distinct cost and utility weights that drive the HEOR model output. Sensitivity analyses should test alternative parametric families for each curve independently and as a pair, because the denominator of QALY calculations depends on the relative area under S_OS(t) − S_PFS(t). Any scenario where the OS extrapolation is more optimistic than PFS produces negative time in the progressed state — a biological impossibility that must be caught and corrected before submission to HTA bodies.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "partitioned-survival",
      "area-under-the-curve-model",
      "cost-utility-analysis",
      "oncology-hta",
      "survival-extrapolation",
      "decision-analytic-model",
      "progression-free-survival",
      "overall-survival"
    ],
    "applies_to_study_types": [
      "comparative_effectiveness",
      "claims_analysis",
      "registry_linkage",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2020.08.2094",
        "url": "https://doi.org/10.1016/j.jval.2020.08.2094",
        "citation_text": "Woods BS, Sideris E, Palmer S, Latimer N, Soares M. Partitioned survival and state transition models for healthcare decision making in oncology: where are we now? Value in Health. 2020;23(12):1613-1621.",
        "year": 2020,
        "authors_short": "Woods et al.",
        "notes": "Definitive critical comparison of partitioned survival vs state-transition structures, articulating the OS/PFS independence assumption and when each model is appropriate."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X16670617",
        "url": "https://doi.org/10.1177/0272989X16670617",
        "citation_text": "Williams C, Lewsey JD, Briggs AH, Mackay DF. Estimation of survival probabilities for use in cost-effectiveness analyses: a comparison of a multi-state modeling survival analysis approach with partitioned survival and Markov decision-analytic modeling. Medical Decision Making. 2017;37(4):427-439.",
        "year": 2017,
        "authors_short": "Williams et al.",
        "notes": "Head-to-head methodological comparison showing how partitioned survival, Markov, and multi-state approaches diverge when estimating state occupancy and lifetime survival."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s40273-019-00845-x",
        "url": "https://doi.org/10.1007/s40273-019-00845-x",
        "citation_text": "Smare C, Lakhdari K, Doan J, Posnett J, Johal S. Evaluating partitioned survival and Markov decision-analytic modeling approaches for use in cost-effectiveness analysis: estimating and comparing survival outcomes. PharmacoEconomics. 2020;38(1):97-108.",
        "year": 2020,
        "authors_short": "Smare et al.",
        "notes": "Applied worked comparison quantifying how PSM vs Markov structure changes estimated survival and the ICER in an oncology case study."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X12472398",
        "url": "https://doi.org/10.1177/0272989X12472398",
        "citation_text": "Latimer NR. Survival analysis for economic evaluations alongside clinical trials - extrapolation with patient-level data: inconsistencies, limitations, and a practical guide. Medical Decision Making. 2013;33(6):743-754.",
        "year": 2013,
        "authors_short": "Latimer",
        "notes": "NICE DSU TSD 14 guidance on the parametric survival extrapolation that drives the lifetime OS/PFS curves a PSM depends on."
      }
    ],
    "plain_language_summary": "A partitioned survival model is a tool health economists use to estimate how long patients with cancer spend in three states — living without disease progression, living after their disease has progressed, and dead — over a lifetime horizon. It works by taking two survival curves from a clinical trial (one tracking when patients progressed, one tracking when they died) and, at every point in time, reads the share of patients in each state directly from those curves. Because trial follow-up is usually short, both curves are mathematically extended far into the future, and the credibility of the whole model depends heavily on how trustworthy that extension is.",
    "key_terms": [
      {
        "term": "progression-free survival",
        "definition": "The probability that a patient is still alive AND has not yet had their cancer worsen or spread; it drops over time as patients either progress or die."
      },
      {
        "term": "overall survival",
        "definition": "The probability that a patient is still alive, regardless of whether their cancer has progressed; it can only stay the same or decrease over time."
      },
      {
        "term": "health state",
        "definition": "One of the distinct clinical situations a patient can occupy in the model — in this model the three states are progression-free, progressed (cancer has worsened but patient is alive), and dead."
      },
      {
        "term": "area under the curve",
        "definition": "The total surface below a survival curve when plotted over time; in this model it directly equals the average time a patient spends in that state, which is then multiplied by the cost or quality-of-life weight for that state."
      },
      {
        "term": "ICER",
        "definition": "Incremental cost-effectiveness ratio — the extra cost divided by the extra health benefit (measured in quality-adjusted life years) when comparing a new treatment to the current standard; health technology bodies use it to decide whether a treatment is worth funding."
      },
      {
        "term": "QALY",
        "definition": "Quality-adjusted life year — one year of perfect health; a year spent in a worse health state counts as less than 1.0, so multiplying time in a state by its quality weight converts years into QALYs."
      }
    ],
    "worked_example": {
      "scenario": "An oncology cost-effectiveness model compares a new first-line therapy to standard of care in metastatic lung cancer. The model uses a 10-year horizon divided into 1-year cycles. From the clinical trial, two survival curves have been estimated and extended to 10 years. At year 0 all 1,000 patients are alive and progression-free. The table below shows what the two curves say about each 1-year interval, and the arithmetic shows how total time in each state is calculated.",
      "dataset": {
        "caption": "Survival curve values read at the start of each year (S_PFS and S_OS as proportions of the original cohort still in that state). Each value is taken directly from the extrapolated curves.",
        "columns": [
          "year",
          "S_PFS (proportion progression-free)",
          "S_OS (proportion alive)",
          "progression-free state",
          "progressed state",
          "dead state"
        ],
        "rows": [
          [
            0,
            1.0,
            1.0,
            1.0,
            0.0,
            0.0
          ],
          [
            1,
            0.6,
            0.8,
            0.6,
            0.2,
            0.2
          ],
          [
            2,
            0.36,
            0.64,
            0.36,
            0.28,
            0.36
          ],
          [
            3,
            0.2,
            0.5,
            0.2,
            0.3,
            0.5
          ],
          [
            4,
            0.1,
            0.38,
            0.1,
            0.28,
            0.62
          ],
          [
            5,
            0.05,
            0.28,
            0.05,
            0.23,
            0.72
          ],
          [
            6,
            0.02,
            0.2,
            0.02,
            0.18,
            0.8
          ],
          [
            7,
            0.01,
            0.14,
            0.01,
            0.13,
            0.86
          ],
          [
            8,
            0.0,
            0.09,
            0.0,
            0.09,
            0.91
          ],
          [
            9,
            0.0,
            0.05,
            0.0,
            0.05,
            0.95
          ]
        ]
      },
      "steps": [
        "At each year t, the proportion of patients who are progression-free equals S_PFS(t) directly — for example, at year 1 that is 0.60 (60 out of every 100 patients).",
        "The proportion who are dead equals 1 minus S_OS(t) — at year 1 that is 1 - 0.80 = 0.20 (20 out of 100).",
        "The proportion who are alive but progressed is the leftover: S_OS(t) minus S_PFS(t) — at year 1 that is 0.80 - 0.60 = 0.20 (20 out of 100). These three shares must always add up to 1.00.",
        "Time spent in each state over the 10-year horizon is the area under each curve: sum the state proportion across all 10 one-year intervals and multiply by the 1-year cycle length.",
        "Progression-free time = 1.00 + 0.60 + 0.36 + 0.20 + 0.10 + 0.05 + 0.02 + 0.01 + 0.00 + 0.00 = 2.34 years.",
        "Progressed time = 0.00 + 0.20 + 0.28 + 0.30 + 0.28 + 0.23 + 0.18 + 0.13 + 0.09 + 0.05 = 1.74 years.",
        "Dead time accounts for the rest: 10.00 - 2.34 - 1.74 = 5.92 years (equivalently, total area above the OS curve).",
        "A quick sanity check: 2.34 + 1.74 + 5.92 = 10.00. The three states account for all 10 years, confirming the partition is internally consistent.",
        "To check validity, confirm that S_PFS never exceeds S_OS in any row — if it did, the progressed share would go negative, which is impossible and would mean the model needs to be fixed or replaced with a different model structure."
      ],
      "result": "Over a 10-year horizon, an average patient spends 2.34 years progression-free, 1.74 years alive with progressed disease, and 5.92 years in the dead state (model time accounting). The partition table: Progression-free = 2.34 yr | Progressed = 1.74 yr | Dead = 5.92 yr | Total = 10.00 yr. These times are then multiplied by their respective costs and quality weights (QALYs) to produce the final cost-effectiveness result.",
      "partition_table": {
        "caption": "Summary: time spent in each health state per patient over the 10-year model horizon",
        "columns": [
          "Health state",
          "How it is calculated",
          "Years per patient"
        ],
        "rows": [
          [
            "Progression-free",
            "Area under PFS curve",
            "2.34"
          ],
          [
            "Progressed (alive)",
            "Area between PFS and OS curves",
            "1.74"
          ],
          [
            "Dead",
            "Area above OS curve (remainder)",
            "5.92"
          ],
          [
            "TOTAL",
            "",
            "10.00"
          ]
        ]
      }
    },
    "prerequisites": [
      "survival-extrapolation-hta-rwe",
      "cost-utility",
      "health-economic-modeling-methods-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Trial-only PSM (standard HTA base case)",
        "description": "Both OS and PFS are estimated and extrapolated from the pivotal trial's patient-level data; no external anchor. The default structure in NICE oncology appraisals.",
        "edge_cases": [
          "Immature OS (<50% events) makes the lifetime mean dominated by the unobserved tail and the parametric choice.",
          "Independently fitted curves can cross, implying negative progressed-disease occupancy."
        ],
        "data_source_notes": "trial IPD: fit OS and PFS separately, compare parametric families by AIC/BIC plus clinical tail plausibility, and run structural extrapolation sensitivity as the headline analysis."
      },
      {
        "name": "RWE-anchored / externally-controlled PSM",
        "description": "The OS (and sometimes the comparator arm) is informed or anchored by real-world data — registry or linked-claims long-term survival — to discipline an immature trial extrapolation or supply a missing comparator.",
        "edge_cases": [
          "Claims/EHR death and progression are under-captured (disenrollment, MA-only person-time, network leakage).",
          "Population transportability (stage mix, treatment era, calendar period) between the RWE anchor and the trial population."
        ],
        "data_source_notes": "claims/registry: restrict to FFS person-time with linked mortality; reject trial extrapolations the real-world tail contradicts; document transportability adjustments."
      },
      {
        "name": "PSM with structural / extrapolation uncertainty analysis",
        "description": "The base case is accompanied by a formal exploration of alternative parametric and spline extrapolations, flexible/cure models, and the PSM-vs-STM structural choice.",
        "edge_cases": [
          "Reviewers (NICE ERG/EAG) routinely re-run with the curve the company excluded; the chosen base case must be defensible on clinical, not just statistical, grounds."
        ],
        "data_source_notes": "Present a curve-selection table (fit + plausibility) and report ICER sensitivity to each extrapolation."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Markov / state-transition model (STM)",
        "pros_of_this": "Simpler; needs only OS and PFS (no identifiable post-progression survival or transition data); transparent because inputs are the familiar trial survival curves.",
        "cons_of_this": "Imposes structural independence of OS and PFS, cannot incorporate external evidence on post-progression survival, and can yield internally inconsistent (negative) progressed-disease occupancy.",
        "when_to_prefer": "Short-horizon, single-line decisions with mature OS/PFS where post-progression survival is not separately identifiable."
      },
      {
        "compared_to": "Three-state Markov with constant transition probabilities",
        "pros_of_this": "Uses the full time-varying shape of the survival curves rather than collapsing to exponential/constant hazards, capturing non-proportional hazards naturally.",
        "cons_of_this": "That flexibility is entirely in the extrapolation choice, which is judgment-heavy and usually the largest single driver of the ICER.",
        "when_to_prefer": "When hazards are clearly non-proportional and OS/PFS are mature enough to fit shape directly."
      },
      {
        "compared_to": "Restricted mean survival time (RMST) reported over the trial period",
        "pros_of_this": "Produces the lifetime cost-effectiveness estimate HTA decisions require, integrating costs and utilities by state over the full horizon.",
        "cons_of_this": "Buys that lifetime estimate with a heavy, assumption-laden extrapolation that RMST deliberately avoids.",
        "when_to_prefer": "When a lifetime cost-utility result is the decision object, not an in-trial mean."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Death is reliable only when linked to vital records/NDI; raw claims confuse disenrollment/end-of-data with survival and inflate the OS tail. Exclude Medicare Advantage-only person-time (no FFS claims) so progression proxies and death are not differentially missing. Claims \"progression\" is a proxy (second-line start, new radiotherapy, hospice) and is not RECIST PFS - do not naively merge it with trial PFS.",
      "ehr": "Progression capture is richer but encounter-driven; out-of-network progression/death looks like censoring. Reconcile death against an external mortality source; define observation windows explicitly.",
      "registry": "Often the best anchor for the long-term OS tail (active follow-up), but progression is frequently not collected; use it to discipline OS extrapolation while sourcing PFS elsewhere, checking transportability.",
      "linked": "Linked claims-EHR-registry/vital-records is ideal (progression + cost + reliable mortality) but introduces linkage selection and imaging/regimen/service date discrepancies that must be reconciled before estimating either curve."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\ndef partition_states(surv: pd.DataFrame) -> pd.DataFrame:\n    # Partition: PF = S_PFS ; Dead = 1 - S_OS ; Progressed = S_OS - S_PFS (the residual).\n    out = surv.copy()\n    out[\"pf\"]         = out[\"s_pfs\"]\n    out[\"dead\"]       = 1.0 - out[\"s_os\"]\n    out[\"progressed\"] = out[\"s_os\"] - out[\"s_pfs\"]\n    # STRUCTURAL VALIDITY: extrapolated PFS must never exceed OS, else occupancy is negative.\n    bad = out.loc[out[\"progressed\"] < -1e-9]\n    if len(bad):\n        raise ValueError(\n            f\"PFS crosses above OS in {len(bad)} cycle-arms (negative progressed occupancy); \"\n            \"constrain the curves or switch to a state-transition model.\"\n        )\n    return out\n\ndef run_psm(surv: pd.DataFrame, params: dict) -> pd.DataFrame:\n    cyc_yr = params[\"cycle_years\"]          # e.g. 0.25 for 3-month cycles\n    r      = params[\"disc_rate\"]            # annual discount rate, e.g. 0.035\n    st     = partition_states(surv)\n\n    # Discount factor at the cycle midpoint (half-cycle correction via midpoint timing).\n    st[\"t_yr\"] = (st[\"cycle\"] + 0.5) * cyc_yr\n    st[\"disc\"] = 1.0 / (1.0 + r) ** st[\"t_yr\"]\n\n    # Per-cycle cost = sum over states of (occupancy * annual state cost * cycle length) + arm drug cost while PF.\n    c = params[\"state_cost\"]; u = params[\"state_util\"]; drug = params[\"drug_cost\"]\n    st[\"cost\"] = (\n        (st[\"pf\"]         * c[\"pf\"]\n       + st[\"progressed\"] * c[\"progressed\"]) * cyc_yr\n      + st[\"pf\"] * st[\"arm\"].map(drug) * cyc_yr           # active-treatment cost accrues while progression-free\n    )\n    # QALYs = sum over alive states of (occupancy * utility * cycle length).\n    st[\"qaly\"] = (st[\"pf\"] * u[\"pf\"] + st[\"progressed\"] * u[\"progressed\"]) * cyc_yr\n\n    st[\"d_cost\"] = st[\"cost\"] * st[\"disc\"]\n    st[\"d_qaly\"] = st[\"qaly\"] * st[\"disc\"]\n    agg = st.groupby(\"arm\")[[\"d_cost\", \"d_qaly\"]].sum()\n\n    treated, ref = params[\"treated_arm\"], params[\"reference_arm\"]\n    inc_cost = agg.loc[treated, \"d_cost\"] - agg.loc[ref, \"d_cost\"]\n    inc_qaly = agg.loc[treated, \"d_qaly\"] - agg.loc[ref, \"d_qaly\"]\n    agg[\"icer\"]   = np.nan\n    agg.loc[treated, \"icer\"] = inc_cost / inc_qaly                 # incremental cost per QALY\n    wtp = params[\"wtp\"]\n    agg.loc[treated, \"nmb\"]  = inc_qaly * wtp - inc_cost           # incremental net monetary benefit\n    return agg",
        "description": "Partitioned survival model engine. Required inputs (already estimated upstream):\n  surv : one row per (cycle index t, arm) with columns\n         cycle (int, 0..n), arm (str), s_pfs (float in [0,1]), s_os (float in [0,1])\n         # s_pfs, s_os are the EXTRAPOLATED survivor functions at the cycle midpoint\n  params: dict of state costs/utilities, cycle length (years), discount rate, drug cost by arm\nComputes per-arm discounted costs, QALYs, and the incremental ICER/NMB. Includes the mandatory\nstructural check that progressed-disease occupancy is non-negative at every cycle.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "woods-2020"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\nrun_psm <- function(surv, params) {\n  st <- surv %>%\n    mutate(\n      pf         = s_pfs,\n      dead       = 1 - s_os,\n      progressed = s_os - s_pfs\n    )\n  # Structural validity: extrapolated PFS must not exceed OS (no negative progressed occupancy).\n  if (any(st$progressed < -1e-9)) {\n    stop(\"PFS crosses above OS (negative progressed occupancy); constrain curves or use a state-transition model.\")\n  }\n\n  r      <- params$disc_rate          # annual discount rate, e.g. 0.035\n  cyc_yr <- params$cycle_years        # e.g. 0.25 for 3-month cycles\n  st <- st %>%\n    mutate(\n      t_yr = (cycle + 0.5) * cyc_yr,                       # cycle-midpoint timing (half-cycle correction)\n      disc = 1 / (1 + r) ^ t_yr,\n      drug = params$drug_cost[arm],\n      cost = (pf * params$state_cost[[\"pf\"]] +\n              progressed * params$state_cost[[\"progressed\"]]) * cyc_yr +\n             pf * drug * cyc_yr,                            # active-treatment cost while progression-free\n      qaly = (pf * params$state_util[[\"pf\"]] +\n              progressed * params$state_util[[\"progressed\"]]) * cyc_yr,\n      d_cost = cost * disc,\n      d_qaly = qaly * disc\n    )\n\n  agg <- st %>% group_by(arm) %>%\n    summarise(d_cost = sum(d_cost), d_qaly = sum(d_qaly), .groups = \"drop\")\n\n  tr <- params$treated_arm; rf <- params$reference_arm\n  inc_cost <- agg$d_cost[agg$arm == tr] - agg$d_cost[agg$arm == rf]\n  inc_qaly <- agg$d_qaly[agg$arm == tr] - agg$d_qaly[agg$arm == rf]\n  list(per_arm = agg,\n       icer = inc_cost / inc_qaly,                          # incremental cost per QALY\n       nmb  = inc_qaly * params$wtp - inc_cost)             # incremental net monetary benefit\n}",
        "description": "Partitioned survival model in R. `surv` is a data.frame with one row per (cycle, arm):\n  cycle (integer 0..n), arm (character), s_pfs, s_os  -- extrapolated survivor functions.\n`params` is a list of cycle length (years), discount rate, state costs/utilities, drug cost by arm,\ntreated/reference arm names, and willingness-to-pay. Returns per-arm discounted cost, QALYs, ICER, NMB.",
        "dependencies": [
          "dplyr"
        ],
        "source_citations": [
          "williams-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  OS[Overall survival curve S_OS&#40;t&#41;<br/>fitted + extrapolated] --> P\n  PFS[Progression-free survival curve S_PFS&#40;t&#41;<br/>fitted + extrapolated] --> P\n  P[Partition at each cycle t]\n  P --> PF[Progression-free occupancy = S_PFS&#40;t&#41;]\n  P --> PROG[Progressed occupancy = S_OS&#40;t&#41; - S_PFS&#40;t&#41;]\n  P --> DEAD[Dead occupancy = 1 - S_OS&#40;t&#41;]\n  PF --> ACC[Apply state costs + utilities to area under each curve]\n  PROG --> ACC\n  DEAD --> ACC\n  ACC --> RES[Discounted total cost &amp; QALYs per arm -> ICER / NMB]\n  P -.->|CHECK: if S_PFS &gt; S_OS at any t,<br/>progressed &lt; 0 -> invalid; use STM| GUARD[Structural validity guard]",
        "caption": "Partitioned survival logic. State occupancy is read directly from two independently extrapolated curves, with the structural guard that PFS must never exceed OS (otherwise progressed-disease occupancy is negative and the model is invalid).",
        "alt_text": "Flowchart showing overall-survival and progression-free-survival curves feeding a per-cycle partition into progression-free, progressed, and dead occupancy, then cost and QALY accrual to an ICER, with a validity guard that PFS must not exceed OS.",
        "source_type": "illustrative",
        "source_citations": [
          "woods-2020"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Is post-progression survival the key driver<br/>and dependent on subsequent therapy / external data?} -->|Yes| STM[Use a state-transition / Markov model<br/>that constrains progressed -> dead]\n  Q -->|No| M{Are OS and PFS mature, and do<br/>extrapolated curves stay non-crossing?}\n  M -->|No| FIX[Constrain curves, add external/registry tail anchor,<br/>or move to STM / mixture-cure model]\n  M -->|Yes| H{Is a lifetime cost-utility estimate required<br/>for HTA &#40;vs in-trial RMST&#41;?}\n  H -->|No| RMST[Report restricted mean survival time<br/>over the observed horizon]\n  H -->|Yes| PSM[Partitioned survival model is appropriate<br/>+ mandatory extrapolation/structural sensitivity]",
        "caption": "Decision logic for choosing a partitioned survival model versus a state-transition model, an externally anchored variant, a mixture-cure model, or simply reporting RMST.",
        "alt_text": "Decision tree determining when a partitioned survival model is appropriate versus a state-transition model, curve constraints, a cure model, or restricted mean survival time.",
        "source_type": "illustrative",
        "source_citations": [
          "woods-2020"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "A Markov/state-transition model specifies progression and death transition probabilities directly and constrains post-progression survival, whereas a PSM reads occupancy from independently extrapolated OS/PFS and cannot."
      },
      {
        "relation_type": "requires",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "The lifetime OS and PFS curves a PSM partitions must be parametrically extrapolated beyond observed follow-up; the extrapolation choice is usually the dominant driver of the ICER."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Partitioned survival is one structural family within decision-analytic health-economic modeling."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-utility",
        "notes": "PSMs are the standard structure for lifetime cost-utility (cost-per-QALY) analysis in oncology HTA."
      },
      {
        "relation_type": "produces",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "The model output is the incremental cost-effectiveness ratio and net monetary benefit comparing arms."
      },
      {
        "relation_type": "used_with",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "State utilities (progression-free vs progressed) are applied to time spent in each state to compute QALYs."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Per-cycle costs and QALYs are discounted to present value before forming the ICER/NMB."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST is an assumption-light in-trial alternative when a lifetime extrapolated estimate is not required."
      }
    ],
    "aliases": [
      "partitioned survival analysis",
      "area-under-the-curve model",
      "AUC model",
      "PartSA",
      "three-state partitioned survival model"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pass-imposed",
    "name": "Imposed Post-Authorisation Safety Study (PASS)",
    "short_definition": "A non-interventional or interventional safety study that a regulator legally requires a marketing-authorisation holder to conduct as a condition of authorisation (EU Article 21a/22a; US FDA postmarketing requirement), with a regulator-endorsed protocol, mandatory milestones, and binding regulatory consequences.",
    "long_description": "An **imposed post-authorisation safety study (PASS)** is a study a medicines regulator *legally obligates* a\nmarketing-authorisation holder (MAH) to conduct, as opposed to one the MAH proposes voluntarily. In the EU the legal\ninstrument is a condition of the marketing authorisation under **Article 21a of Directive 2001/83/EC** (imposed at the\ntime of initial authorisation) or **Article 22a** (imposed after authorisation, typically in response to an emerging\nsafety concern), introduced by the 2010 pharmacovigilance legislation (Regulation (EU) No 1235/2010 and Directive\n2010/84/EU). For non-interventional imposed PASS, the **Pharmacovigilance Risk Assessment Committee (PRAC)** endorses\nthe protocol *before* the study starts, reviews protocol amendments, and assesses interim and final results;\noperational conduct follows **GVP Module VIII**. The closest US analogue is an **FDA postmarketing requirement (PMR)**\nunder section 505(o)(3) of the FD&C Act (FDAAA 2007) — legally mandatory and enforceable — which is distinct from a\npostmarketing *commitment* (PMC), the voluntary counterpart.\n\n**Core conceptual distinction**. \"Imposed\" is a *governance and accountability* qualifier, not an epidemiologic design.\nThe same cohort or case-control machinery can underlie an imposed or a voluntary PASS; what differs is the chain of\nobligation. An imposed PASS carries: (1) a legal trigger and a defined regulatory question (a specific safety concern in\nthe risk-management plan, e.g., a signal from spontaneous reports or a clinical-trial imbalance); (2) a *binding,\nregulator-endorsed protocol* with pre-specified outcomes, sample size, feasibility, and a statistical analysis plan that\ncannot be materially changed without PRAC agreement; (3) *mandatory timelines* — protocol submission, registration in\nthe EU PAS Register / HMA-EMA Catalogue, progress and interim reports, and a final study report due on a fixed date; and\n(4) *binding consequences* — the results feed a PRAC/CHMP recommendation that can change the label, add risk-minimisation\nmeasures, restrict indications, or trigger suspension/withdrawal, and non-compliance is itself a regulatory infringement.\nThe estimand is therefore fixed up front and adversarially scrutinised; you do not get to re-specify the primary outcome\nafter seeing the data.\n\n**Pros, cons, and trade-offs** (specific and comparative, naming the alternatives).\n- **vs voluntary PASS:** An imposed PASS guarantees the question gets answered on a regulator's timeline and locks the\n  estimand against post-hoc drift, which is exactly why regulators reach for it when a signal is consequential. Cost: far\n  less analytic latitude — protocol changes require PRAC endorsement, the timeline is non-negotiable, and a \"negative\n  feasibility\" answer does not excuse the MAH from the obligation. **Prefer imposed** only when the question is\n  regulatory-grade and a voluntary commitment would be too slow, too soft, or insufficiently independent.\n- **vs a randomized post-authorisation efficacy/safety trial (or registry-based RCT):** An RCT removes confounding by\n  indication and channeling that haunt a non-interventional PASS, but is slow, costly, sometimes unethical for a marketed\n  product, and underpowered for rare safety outcomes. A non-interventional imposed PASS in claims/EHR/registry data\n  delivers large populations and rare events fast, at the price of confounding that must be handled by an\n  active-comparator new-user design, propensity scores, negative controls, and sensitivity analyses. **Prefer the\n  non-interventional route** for rare or long-latency safety endpoints; **prefer a trial** when residual confounding\n  cannot be credibly ruled out and equipoise exists.\n- **vs spontaneous reporting / disproportionality signal detection alone:** Signal detection is hypothesis-*generating*\n  and cannot estimate incidence or a comparative effect; an imposed PASS is hypothesis-*testing* with a denominator.\n  They are sequential, not substitutes — a signal triggers the imposed PASS that quantifies it.\n\n**When to use**. Use (or expect) an imposed PASS when a regulator concludes there is an important *identified or potential\nrisk*, or missing safety information, that (a) is consequential enough to drive a regulatory decision, (b) cannot be\nresolved by spontaneous reporting or routine pharmacovigilance, and (c) the MAH would not adequately address on its own\ntimeline. Typical triggers: a serious signal at authorisation for a first-in-class product; a post-marketing signal\n(Article 22a) such as a malignancy or cardiovascular imbalance; or a class-wide safety question requiring a denominator\nand an active comparator. As a methodologist designing the underlying study, treat the *regulatory question* as the\nprotocol's North Star and pre-specify everything PRAC will scrutinise.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **The question is hypothesis-generating, not estimable.** If you cannot define a denominator, a validated outcome, and\n  a credible comparator, a \"study\" imposed to quantify the risk will produce an under-powered, confounded estimate that\n  *looks* authoritative and can wrongly reassure or wrongly condemn a product. Resolve feasibility first.\n- **No fit-for-purpose data source exists in the required timeframe.** Imposing a 3-year final-report deadline on an\n  outcome that needs 10 years of latency (e.g., solid tumors) guarantees an interim that is uninterpretable; the design\n  must match the latency or the imposition is theatre.\n- **Confounding by indication is irreducible.** A non-interventional PASS comparing a new drug to \"non-users\" for an\n  outcome tied to disease severity will mistake channeling for drug effect. If no active comparator initiates for the\n  same indication at the same decision point, the imposed cohort can be *actively misleading* — flagging or clearing a\n  drug on the basis of confounding, with label or market consequences.\n- **Using the imposed label to skip rigor.** \"It's mandated, ship it\" is the failure mode: the legal obligation does not\n  relax the methodologic bar; it raises it, because the result is binding.\n\n**Data-source operational depth**. A non-interventional imposed PASS lives or dies on the substrate, and PRAC will probe\neach weakness.\n- **Claims (US FFS / EU national):** Strong for large denominators, dispensing-based exposure (NDC/ATC + `fill_date` +\n  `days_supply`), and rare-event capture. Failure modes PRAC will raise: **Medicare Advantage (MA)-only person-time lacks\n  fee-for-service claims**, so exposure and outcomes are silently missing — restrict to enrollees with full medical +\n  pharmacy benefit (A/B/D) and exclude MA-only spans, or person-time and events are under-counted differentially.\n  **Differential competing risks** (e.g., cancer outcomes in an older, sicker arm where death from other causes is more\n  common) bias a naive cause-specific analysis — pre-specify a Fine–Gray subdistribution or cause-specific approach.\n  `days_supply` is distorted by 90-day mail order, samples, and stockpiling. Outcomes are coded for billing, not truth —\n  every endpoint needs a *validated* algorithm with reported PPV, and ideally chart adjudication of a sample.\n- **EHR:** Adds severity, labs, and free text to sharpen the indication and confounder set, but exposure is the *order*,\n  not the dispensing (link to pharmacy fills to confirm initiation), and visit-driven capture means a patient who leaves\n  the network is differentially lost — define observation windows and treat loss to follow-up as informative.\n- **Registry (disease / product / pregnancy):** Often the regulator's preferred substrate for an imposed PASS because it\n  supports adjudicated outcomes and product-specific follow-up; weak for complete background exposure and for an external\n  comparator. Link to claims for the full fill history and to a death/cancer registry for mortality and malignancy\n  ascertainment. EU multi-country imposed PASS frequently federate several national data sources (a common data model),\n  which raises harmonisation and heterogeneity issues that must be pre-specified.\n- **Linked claims–EHR–vital/cancer records:** The ideal substrate (severity + completeness + reliable mortality), but\n  linkage selects the linkable subset and creates order/fill/service date discrepancies that must be reconciled before\n  time-zero assignment. **Immortal time** is a recurrent trap in procedure- or initiation-anchored safety studies —\n  follow-up must start at exposure, not at an earlier qualifying event.\n\n**Worked example (imposed, claims-based).** A regulator authorises a new SGLT2 inhibitor and PRAC notes a potential\nsignal for urinary-tract malignancy from the clinical-trial program. Under **Article 22a** the MAH is required to conduct\na non-interventional imposed PASS; PRAC endorses the protocol *before* start. The endorsed protocol pre-specifies an\n**active-comparator new-user** cohort in a multi-payer US claims database: (1) Eligibility — adults with type 2 diabetes,\n≥2 diabetes diagnoses, and **365 days of continuous medical + pharmacy enrollment with FFS-observable claims**\n(MA-only person-time excluded so \"no prior fill\" is real, not missing). (2) Washout — no fill of the study drug *or* the\ncomparator (e.g., a DPP-4 inhibitor) in the 365-day lookback, making both arms incident users. (3) Time zero — the first\nqualifying `fill_date`; the arm is assigned from the NDC dispensed that day. (4) Outcome — incident urinary-tract cancer\nvia a *validated* claims algorithm (≥1 inpatient or ≥2 outpatient dx codes ≥30 days apart + a confirmatory procedure),\nwith a ≥6-month lag/latency window to exclude prevalent/baseline-detected disease. (5) Follow-up — from time zero to the\nvalidated event, censoring at disenrollment, death, end of data, and (as-treated) discontinuation (`days_supply` end +\ngrace period) or switch; **competing risk of death** handled by a cause-specific or Fine–Gray model. (6) Confounding —\nhigh-dimensional propensity score from the 365-day baseline window, with a **negative-control outcome** and\n**negative-control exposure** to detect residual bias. (7) Governance — the protocol, an **interim report at the\nPRAC-specified milestone**, and a **final study report** on the fixed date are registered in the EU PAS Register; the\nfinal result feeds a PRAC recommendation that may revise the label or impose risk-minimisation measures. The\n\"imposed\" character is everything outside the epidemiology: the legal trigger, the pre-endorsed locked estimand, the\nbinding timeline, and the regulatory decision the result will drive.",
    "primary_category": "Study_Design",
    "tags": [
      "post-authorisation-safety-study",
      "PASS",
      "pharmacovigilance",
      "regulatory",
      "EMA",
      "PRAC",
      "GVP-Module-VIII",
      "post-marketing-requirement",
      "non-interventional",
      "risk-management-plan"
    ],
    "applies_to_study_types": [
      "pass_imposed"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.2209",
        "url": "https://doi.org/10.1002/pds.2209",
        "citation_text": "Blake KV, Prilla S, Accadebled S, Guimier M, Biscaro M, Persson I, Arlett P, Blackburn S, Fitt H. European Medicines Agency review of post-authorisation studies with implications for the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance. Pharmacoepidemiology and Drug Safety. 2011;20(10):1021-1029.",
        "year": 2011,
        "authors_short": "Blake et al.",
        "notes": "EMA's own review establishing the categories of post-authorisation studies (including imposed safety studies as conditions of the marketing authorisation) and the regulatory framework governing them."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3281",
        "url": "https://doi.org/10.1002/pds.3281",
        "citation_text": "Blake KV, deVries CS, Arlett P, Kurz X, Fitt H. Increasing scientific standards, independence and transparency in post-authorisation studies: the role of the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance. Pharmacoepidemiology and Drug Safety. 2012;21(7):690-696.",
        "year": 2012,
        "authors_short": "Blake et al.",
        "notes": "Explains the ENCePP/PRAC governance layer - protocol endorsement, the EU PAS Register, and the independence and transparency standards that distinguish an imposed PASS from a sponsor-controlled study."
      },
      {
        "role": "demonstrate",
        "doi": "10.3389/fdsfr.2025.1574430",
        "url": "https://doi.org/10.3389/fdsfr.2025.1574430",
        "citation_text": "Almas F, Girardi A, Crisafulli S, et al. Identifying regulatory outcomes of Non-interventional Post-Authorisation Safety Studies (PASS) in the European repository of studies using publicly available information. Frontiers in Drug Safety and Regulation. 2025.",
        "year": 2025,
        "authors_short": "Almas et al.",
        "notes": "Empirically characterises imposed vs non-imposed non-interventional PASS in the EU PAS Register and the regulatory decisions they produced - shows what \"binding consequences\" looks like in practice."
      },
      {
        "role": "use",
        "doi": "10.1111/dom.16477",
        "url": "https://doi.org/10.1111/dom.16477",
        "citation_text": "Schmedt N, Alhamdow A, Tskhvarashvili K, et al. Post-authorisation safety study to assess the risk of urinary tract cancer in people with type 2 diabetes initiating empagliflozin: A multi-country European study. Diabetes, Obesity and Metabolism. 2025.",
        "year": 2025,
        "authors_short": "Schmedt et al.",
        "notes": "A conducted multi-country, multi-database non-interventional PASS for a malignancy signal - a concrete template for the active-comparator new-user, validated-outcome, federated-data design used to discharge an imposed obligation."
      }
    ],
    "plain_language_summary": "An imposed Post-Authorization Safety Study (PASS) is a study a government medicines regulator legally orders a drug company to run after approving a drug, because a safety concern could not be fully resolved before approval. The regulator writes the study rules into the approval itself — picking the question, locking the study plan, and setting hard deadlines — so the company cannot quietly drop it or change the design without permission. Unlike a study a company chooses to run on its own, an imposed PASS carries real consequences: if results show a serious risk, the regulator can change the drug label, restrict who can use it, or even pull it from the market.",
    "key_terms": [
      {
        "term": "PASS",
        "definition": "Post-Authorization Safety Study — any study a drug company conducts on an approved medicine to learn more about how safe it is in real-world use."
      },
      {
        "term": "regulatory obligation",
        "definition": "A legal requirement attached to a drug approval, meaning the company must comply or face formal regulatory consequences such as fines, label changes, or loss of the marketing authorization."
      },
      {
        "term": "marketing-authorization holder (MAH)",
        "definition": "The company that legally owns the right to sell an approved drug in a given country or region — the entity the regulator holds responsible for post-approval safety monitoring."
      },
      {
        "term": "PRAC",
        "definition": "The Pharmacovigilance Risk Assessment Committee — the European Medicines Agency body that reviews drug safety, endorses imposed PASS protocols, and decides what regulatory action the results should trigger."
      },
      {
        "term": "estimand",
        "definition": "The precise safety question the study is designed to answer, defined up front before data collection begins, so the study cannot be quietly redirected to a more favorable question after the fact."
      }
    ],
    "worked_example": {
      "scenario": "A regulator approves a new blood-sugar drug (an SGLT2 inhibitor) for type 2 diabetes, but trial data hinted at a possible link to urinary-tract cancer — a rare outcome that would take years and a large population to properly study. Under EU Article 22a, the regulator orders the drug company to conduct an imposed PASS in real-world claims data. PRAC reviews and locks the study protocol before any data are touched. The table below shows the kind of records the analyst works from; the steps explain what makes this PASS imposed rather than voluntary.",
      "dataset": {
        "caption": "Simplified milestone tracker showing the binding deliverables for this imposed PASS, as registered in the EU PAS Register. Every row is a regulatory obligation — missing a due date is itself an infringement.",
        "columns": [
          "milestone",
          "who_controls_it",
          "planned_date",
          "consequence_if_missed"
        ],
        "rows": [
          [
            "Protocol submitted to PRAC",
            "Company drafts; PRAC must endorse before study starts",
            "2024-01-01",
            "Study cannot begin"
          ],
          [
            "EU PAS Register entry",
            "Company registers; public record of obligation",
            "2024-03-01",
            "Regulatory infringement"
          ],
          [
            "Interim safety report",
            "Company submits; PRAC reviews at fixed checkpoint",
            "2025-09-15",
            "Regulatory infringement"
          ],
          [
            "Final study report",
            "Company submits on fixed date; cannot be deferred",
            "2026-03-15",
            "Regulatory infringement; MAH liable"
          ]
        ]
      },
      "steps": [
        "The regulator — not the company — defines the safety question: does this SGLT2 inhibitor raise the risk of urinary-tract cancer compared with a similar diabetes drug?",
        "PRAC endorses the study protocol before data collection begins, locking the primary outcome, the comparator drug, and the statistical plan; the company cannot change any of these without PRAC agreement.",
        "The study is registered publicly in the EU PAS Register with all planned dates visible, creating an auditable record of every obligation.",
        "The company runs the study in a large claims database, following the endorsed design: new users of either drug, with a 365-day look-back to confirm no prior use, and a validated cancer outcome definition agreed with PRAC.",
        "Interim and final reports are submitted on the dates PRAC set — not when the company prefers — and PRAC assesses whether the results require a label change, a use restriction, or no action."
      ],
      "result": "An imposed PASS applies when: (1) a safety concern is real enough to influence a regulatory decision but cannot be resolved before approval; (2) the question requires a large real-world population and years of follow-up; and (3) the regulator concludes a voluntary study commitment would be too slow or too easy for the company to drop. The imposed design guarantees the question gets answered on the regulator's timeline, with a locked study plan and binding consequences — the defining difference from a voluntary PASS, where the company controls the design and schedule."
    },
    "prerequisites": [
      "pass-voluntary",
      "signal-detection",
      "regulatory-readiness-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Imposed at initial authorisation (EU Article 21a)",
        "description": "The safety study is a condition of the marketing authorisation from day one, usually for a first-in-class product or one with important missing safety information at approval; the protocol and milestones are agreed before launch and tracked in the risk-management plan.",
        "edge_cases": [
          "Background incidence and a fit-for-purpose comparator may not yet exist for a brand-new mechanism, forcing reliance on external/historical controls with their own bias.",
          "Early launch means thin exposure accrual; interim reports can be uninformative until uptake matures."
        ],
        "data_source_notes": "claims/registry: pre-specify a feasibility/accrual analysis and the earliest credible interim; a product registry is often imposed alongside to guarantee follow-up."
      },
      {
        "name": "Imposed post-authorisation in response to a signal (EU Article 22a)",
        "description": "Triggered by an emerging safety concern after marketing (e.g., a malignancy or cardiovascular imbalance); PRAC endorses the protocol and sets the interim/final timeline, and the result feeds a referral or label decision.",
        "edge_cases": [
          "The signal itself may have caused channeling or stimulated reporting, contaminating the comparison if not handled with an active comparator and negative controls.",
          "Outcome latency may exceed the imposed deadline, making the final report a planned interim in disguise."
        ],
        "data_source_notes": "claims: exclude MA-only person-time; use a validated outcome algorithm with reported PPV and a latency window appropriate to the signal."
      },
      {
        "name": "US FDA postmarketing requirement (PMR) analogue",
        "description": "Under FDAAA 2007 section 505(o)(3) the FDA can require a postmarketing study/trial (a PMR); legally enforceable, unlike a postmarketing commitment (PMC). Often executed in Sentinel or claims data.",
        "edge_cases": [
          "PMR and EU imposed PASS for the same product may demand harmonised but not identical protocols across jurisdictions.",
          "Sentinel's distributed, query-based model constrains bespoke analyses relative to a pooled dataset."
        ],
        "data_source_notes": "claims/Sentinel: design to the common data model; confirm the outcome and exposure definitions are portable across data partners."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Voluntary PASS / postmarketing commitment (PMC)",
        "pros_of_this": "Legally binds the MAH to a regulator-endorsed protocol, fixed timeline, and a locked estimand; results carry binding regulatory consequences and cannot be quietly dropped.",
        "cons_of_this": "Minimal analytic latitude (protocol changes need PRAC endorsement), non-negotiable deadlines, and the obligation persists even if feasibility is poor.",
        "when_to_prefer": "When the safety question is consequential enough to drive a regulatory decision and a voluntary route would be too slow, too soft, or insufficiently independent."
      },
      {
        "compared_to": "Post-authorisation randomized trial (incl. registry-based RCT)",
        "pros_of_this": "Large real-world populations, rare and long-latency events, fast accrual, and external validity at far lower cost.",
        "cons_of_this": "Susceptible to confounding by indication and channeling that randomization would remove; demands an active comparator, propensity scores, and negative controls to be credible.",
        "when_to_prefer": "Rare or long-latency safety endpoints where a trial is infeasible/unethical and residual confounding can be credibly addressed by design."
      },
      {
        "compared_to": "Spontaneous reporting / disproportionality signal detection",
        "pros_of_this": "Provides a denominator, incidence, and a comparative effect estimate - it tests the hypothesis rather than merely generating it.",
        "cons_of_this": "Slower, costlier, and requires a fit-for-purpose data source, validated outcomes, and a comparator.",
        "when_to_prefer": "After a signal is detected and must be quantified to support a regulatory decision."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = dispensing (NDC/ATC + fill_date + days_supply). Require continuous medical + pharmacy enrollment across the washout; exclude Medicare Advantage-only person-time where FFS claims are unavailable so exposure and outcomes are observed, not missing. Use a validated outcome algorithm with reported PPV; pre-specify the competing-risk approach and latency window. Time zero = first qualifying fill.",
      "ehr": "Initiation = order/administration; link to dispensing to confirm the patient started. Use labs, problem lists, and notes to sharpen indication and severity; define observation windows and treat loss to follow-up as informative.",
      "registry": "Often the regulator-preferred substrate for adjudicated outcomes and product follow-up; weak for background exposure and an external comparator. Link to claims for full fill history and to death/cancer registries for ascertainment. Multi-country federation needs a pre-specified common data model and heterogeneity handling.",
      "linked": "Linked claims-EHR-vital/cancer records give severity + completeness + mortality, but linkage selects the linkable subset and creates order/fill/service date discrepancies that must be reconciled before time-zero assignment to avoid immortal time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# GVP Module VIII mandatory deliverables for a non-interventional imposed PASS, in order.\nREQUIRED = [\"protocol\", \"prac_endorsement\", \"registration_eu_pas\",\n            \"progress_report\", \"interim_report\", \"final_study_report\"]\n\ndef pass_compliance(milestones: pd.DataFrame, as_of: pd.Timestamp) -> pd.DataFrame:\n    m = milestones.copy()\n    m[\"milestone\"] = m[\"milestone\"].str.lower()\n\n    # Flag any mandatory deliverable that is entirely absent from the tracker.\n    present = set(m[\"milestone\"])\n    missing = [{\"study_id\": m[\"study_id\"].iloc[0], \"milestone\": req,\n                \"planned_date\": pd.NaT, \"status\": \"MISSING_FROM_PLAN\"}\n               for req in REQUIRED if req not in present]\n\n    def status(row):\n        # Protocol/amendments cannot proceed without PRAC endorsement.\n        if row[\"milestone\"] in (\"protocol\", \"interim_report\") and not row.get(\"prac_endorsed\", True):\n            return \"BLOCKED_NO_PRAC_ENDORSEMENT\"\n        if pd.notna(row[\"submitted_date\"]):\n            return \"on-time\" if row[\"submitted_date\"] <= row[\"planned_date\"] else \"LATE\"\n        return \"overdue\" if as_of > row[\"planned_date\"] else \"pending\"\n\n    m[\"status\"] = m.apply(status, axis=1)\n    out = pd.concat([m, pd.DataFrame(missing)], ignore_index=True)\n    return out.sort_values(\"planned_date\", na_position=\"last\")",
        "description": "Imposed-PASS GOVERNANCE TRACKER, not an estimator. An imposed PASS is defined by regulator-mandated milestones, so the\nproduction-relevant artifact is a compliance check against the PRAC-endorsed timeline. Required input (one row per\nstudy deliverable, from the protocol/RMP):\n  milestones : study_id, milestone (str), planned_date (datetime), submitted_date (datetime or NaT),\n               prac_endorsed (bool)   # protocol & amendments require PRAC endorsement before conduct\nReturns each milestone's status (pending / on-time / LATE / overdue) and the next binding due date. Wire this to the\nEU PAS Register entry; a missed mandatory milestone is itself a regulatory infringement.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "blake-2012"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nREQUIRED <- c(\"protocol\", \"prac_endorsement\", \"registration_eu_pas\",\n              \"progress_report\", \"interim_report\", \"final_study_report\")\n\npass_compliance <- function(milestones, as_of) {\n  m <- as.data.table(milestones)\n  m[, milestone := tolower(milestone)]\n\n  # Add any mandatory GVP Module VIII deliverable missing from the plan.\n  miss <- setdiff(REQUIRED, unique(m$milestone))\n  if (length(miss)) {\n    m <- rbind(m, data.table(study_id = m$study_id[1L], milestone = miss,\n                             planned_date = as.Date(NA), submitted_date = as.Date(NA),\n                             prac_endorsed = NA, status = \"MISSING_FROM_PLAN\"),\n               fill = TRUE)\n  }\n\n  m[is.na(status), status := fifelse(\n    milestone %chin% c(\"protocol\", \"interim_report\") & prac_endorsed %in% FALSE,\n    \"BLOCKED_NO_PRAC_ENDORSEMENT\",\n    fifelse(!is.na(submitted_date),\n            fifelse(submitted_date <= planned_date, \"on-time\", \"LATE\"),\n            fifelse(as_of > planned_date, \"overdue\", \"pending\")))]\n  setorder(m, planned_date, na.last = TRUE)\n  m[]\n}",
        "description": "Imposed-PASS governance tracker (R/data.table). Same intent as the Python version: an imposed PASS is a regulatory\nobligation, so the production artifact is milestone compliance, not an effect estimate. Input mirrors the Python frame:\n  milestones : study_id, milestone (chr), planned_date (Date), submitted_date (Date or NA), prac_endorsed (logical)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "blake-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;   /* drug-free + continuous-enrollment lookback that defines an incident user */\n\n/* (1) Candidate time zero = earliest fill of the study drug OR the active comparator, with the dispensed arm. */\nproc sort data=work.rx(where=(drug_class in ('STUDY','COMPARATOR'))) out=rx_q;\n  by person_id fill_date;\nrun;\n\ndata idx;\n  set rx_q;\n  by person_id;\n  if first.person_id;\n  index_date = fill_date;\n  format index_date date9.;\n  length arm $12;\n  arm = drug_class;\n  keep person_id index_date arm;\nrun;\n\n/* (2) New-user restriction: no study OR comparator fill in the washout window before time zero (makes both arms incident). */\nproc sql;\n  create table newuser as\n  select i.* from idx i\n  where not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id\n      and p.drug_class in ('STUDY','COMPARATOR')\n      and p.fill_date <  i.index_date\n      and p.fill_date >= i.index_date - &washout\n  );\nquit;\n\n/* (3) Eligibility: continuous FFS-observable enrollment across the full washout through index (exclude MA-only person-time\n       so \"no prior fill\" is a real washout, not missingness) AND >=2 T2DM diagnoses in the baseline window. */\nproc sql;\n  create table cohort as\n  select n.person_id, n.arm, n.index_date,\n         n.index_date - &washout as baseline_start format=date9.\n  from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date)\n    and (select count(distinct d.dx_date) from work.dx d\n           where d.person_id = n.person_id\n             and d.dx_code like '250%'        /* T2DM dx codes; replace with the protocol code list */\n             and d.dx_date between n.index_date - &washout and n.index_date) >= 2;\nquit;",
        "description": "Cohort construction (PROC SQL) for the underlying non-interventional imposed PASS - the SGLT2-inhibitor vs DPP-4\nactive-comparator new-user cohort that discharges the Article 22a urinary-tract-cancer obligation. Governance is the\ndistinguishing feature of an imposed PASS, but PRAC endorses an *epidemiologic protocol*, so the production artifact a\nbiostatistician owns is the locked, reproducible cohort that maps to the endorsed estimand. Required input datasets\n(post data-management, one row per record):\n  work.rx     : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)   /* ma_only person-time lacks FFS claims */\n  work.dx     : person_id, dx_date, dx_code   /* for the >=2 T2DM diagnoses eligibility requirement */\nTime zero = first qualifying fill; arm = the NDC class dispensed that day. Build covariates and the high-dimensional\npropensity score ONLY from [baseline_start, index_date], then apply the validated UTC outcome algorithm and the\npre-specified competing-risk (Fine-Gray or cause-specific) model identically to both arms downstream.",
        "dependencies": [],
        "source_citations": [
          "blake-2011"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Sig[Safety concern: trial imbalance,<br/>spontaneous-report signal, missing safety info] --> Trig{Regulatory trigger}\n  Trig -->|At initial authorisation| A21[Article 21a condition of MA]\n  Trig -->|After authorisation| A22[Article 22a imposed PASS]\n  A21 --> Proto[MAH drafts protocol +<br/>statistical analysis plan]\n  A22 --> Proto\n  Proto --> PRAC[PRAC protocol endorsement<br/>estimand + outcomes + sample size LOCKED]\n  PRAC --> Reg[Register in EU PAS Register / HMA-EMA Catalogue]\n  Reg --> Conduct[Conduct in claims / EHR / registry / linked data]\n  Conduct --> Interim[Interim & progress reports<br/>on PRAC-set milestones]\n  Interim --> Final[Final study report on fixed date]\n  Final --> Decision{PRAC / CHMP assessment}\n  Decision -->|Reassuring| Maintain[Maintain authorisation]\n  Decision -->|Risk confirmed| Action[Label change / risk minimisation /<br/>restriction / suspension / withdrawal]",
        "caption": "Regulatory lifecycle of an EU imposed PASS. The defining features are the legal trigger (Art 21a/22a), the PRAC endorsement that locks the estimand before conduct, the mandatory registration and reporting milestones, and the binding regulatory decision the final report drives.",
        "alt_text": "Flowchart from a safety concern through the Article 21a/22a regulatory trigger, PRAC protocol endorsement, EU PAS Register registration, conduct in real-world data, interim and final reports, and a binding regulatory decision.",
        "source_type": "illustrative",
        "source_citations": [
          "blake-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Imposed PASS milestone timeline (Article 22a example)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Setup (mandatory)\n  Protocol drafted & submitted        :done, p, 2024-01-01, 60d\n  PRAC protocol endorsement (locks estimand) :milestone, e, 2024-03-01, 0d\n  EU PAS Register registration        :done, r, 2024-03-01, 14d\n  section Conduct (mandatory reporting)\n  Data accrual & follow-up            :active, c, 2024-03-15, 540d\n  Interim safety report               :milestone, i, 2025-09-15, 0d\n  section Close-out (binding)\n  Final study report (fixed deadline) :crit, f, 2026-03-15, 1d\n  PRAC/CHMP assessment & decision     :milestone, d, 2026-05-15, 0d",
        "caption": "Mandatory milestones for an imposed PASS. Unlike a voluntary study, protocol endorsement, interim reporting, and the final-report date are binding; a missed deliverable is a regulatory infringement.",
        "alt_text": "Gantt timeline showing protocol submission, PRAC endorsement, EU PAS Register registration, data accrual, a mandatory interim report, a fixed final-report deadline, and the regulatory decision.",
        "source_type": "illustrative",
        "source_citations": [
          "blake-2012"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "pass-voluntary",
        "notes": "A voluntary PASS is proposed by the MAH without a legal obligation; an imposed PASS is a condition of the authorisation (Art 21a/22a) with a PRAC-endorsed protocol, mandatory milestones, and binding consequences."
      },
      {
        "relation_type": "used_with",
        "target_slug": "signal-detection",
        "notes": "A signal from spontaneous-report disproportionality or trial data is the usual trigger; the imposed PASS quantifies what signal detection can only flag."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "The non-interventional analysis underlying an imposed PASS is typically a target-trial emulation (active-comparator new-user) so the regulator-endorsed estimand maps to an explicit trial protocol."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "The active-comparator new-user design is the usual analytic core of a non-interventional imposed PASS - it removes confounding by indication and channeling so the PRAC-endorsed comparative estimand is defensible against adversarial review."
      },
      {
        "relation_type": "used_with",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "A validated outcome/case definition with reported PPV is required before PRAC will endorse the protocol of an imposed safety study."
      },
      {
        "relation_type": "used_with",
        "target_slug": "disease-registry",
        "notes": "A disease or product registry is frequently imposed as the substrate for an imposed PASS to guarantee adjudicated outcomes and product-specific follow-up."
      },
      {
        "relation_type": "see_also",
        "target_slug": "product-registry",
        "notes": "Product registries are a common vehicle for imposed post-authorisation follow-up of a specific medicinal product."
      },
      {
        "relation_type": "see_also",
        "target_slug": "risk-evaluation",
        "notes": "Imposed PASS results feed the benefit-risk reassessment and risk-management plan that drive regulatory decisions."
      }
    ],
    "aliases": [
      "regulator-imposed PASS",
      "mandatory PASS",
      "imposed post-authorisation safety study",
      "Article 22a study",
      "Article 21a safety study",
      "conditional PASS"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "ema",
      "fda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pass-voluntary",
    "name": "Voluntary (Non-Imposed) Post-Authorisation Safety Study",
    "short_definition": "A sponsor-initiated post-authorisation safety study that is not a condition of the marketing authorisation, conducted under GVP Module VIII to identify, characterise, or quantify a safety hazard or confirm the safety profile of an already-authorised medicine.",
    "long_description": "A **post-authorisation safety study (PASS)** is, under EMA Good Pharmacovigilance Practices (GVP) Module VIII,\n\"any study relating to an authorised medicinal product conducted with the aim of identifying, characterising or\nquantifying a safety hazard, confirming the safety profile of the medicine, or measuring the effectiveness of\nrisk-management measures.\" A PASS is *voluntary* (non-imposed) when the marketing-authorisation holder initiates it on\nits own accord rather than because a competent authority has made it a binding **condition** of the authorisation.\nMost voluntary PASS are **non-interventional**: the medicine is prescribed in the usual manner per the marketing\nauthorisation, treatment is not decided by a protocol, and no additional diagnostic or monitoring procedures are\napplied — so the study is a secondary-data cohort, case-control, self-controlled, or registry analysis rather than a\ntrial.\n\n**Core conceptual distinction**. The defining axis is *regulatory status and control*, not the analytic method —\nthe same new-user cohort can be either a voluntary or an imposed PASS. (1) *Voluntary vs imposed*: an **imposed** PASS\nis a condition of the marketing authorisation under Article 9, 21a, or 22a of Directive 2001/83/EC (Annex II / the risk\nmanagement plan), so the **PRAC** assesses the draft protocol before the study starts and assesses the final results,\nprotocol substantial amendments need approval, milestones are binding, and non-completion carries regulatory and legal\nconsequences (variation, suspension, fees). A **voluntary** PASS sits outside that mandatory oversight: the sponsor\ncontrols timing, scope, and publication; PRAC protocol pre-approval is *not* required (though ENCePP/EU PAS Register\nregistration and the ENCePP Code of Conduct are strongly encouraged and increasingly expected by HTA bodies and\njournals). (2) *Non-interventional vs clinical-trial PASS*: a voluntary PASS may legally be a clinical trial, but in\npractice it is overwhelmingly non-interventional, which is what places it in the realm of RWE database design. The\nestimand is the same family of pharmacovigilance quantities — absolute incidence of an adverse event of special\ninterest (AESI), a comparative hazard versus an active comparator or background population, or a standardised incidence\nratio against an external reference — assessed for *safety*, with the analytic burden concentrated on outcome\nascertainment, susceptibility windows, and confounding by indication rather than on demonstrating efficacy.\n\n**Pros, cons, and trade-offs**.\n- **vs imposed PASS:** A voluntary PASS gives the sponsor design latitude, faster start (no PRAC protocol gate), and\n  control over scope and dissemination — useful for hypothesis-generating signal characterisation, label-enrichment\n  evidence, or HTA-facing safety packages. Cost: it carries less *a priori* regulatory weight, PRAC has not endorsed\n  the protocol, and a voluntary study cannot by itself discharge a binding pharmacovigilance obligation. **Prefer\n  voluntary** when the safety question is sponsor-driven and not a regulatory condition; **an imposed PASS is required**\n  when the authority has mandated the evidence as a condition of marketing.\n- **vs spontaneous-reporting / disproportionality signal detection:** A PASS supplies a defined denominator\n  (person-time at risk), a comparator, and protocol-prespecified outcome definitions, so it *quantifies* and tests a\n  hazard rather than merely flagging it. Cost: far more expensive and slower than mining a spontaneous-report database.\n  **Prefer a PASS** to characterise or refute a signal that disproportionality analysis has already raised.\n- **vs a head-to-head comparative-effectiveness study:** A PASS is governed by the pharmacovigilance legal framework\n  (GVP) and is safety-focused, so it inherits PSUR/RMP linkage, qualified-person-for-pharmacovigilance accountability,\n  and adverse-event reporting obligations that a generic effectiveness study does not. Cost: that governance overhead\n  is wasted if the question is purely comparative effectiveness. The *analytic core* of a non-interventional PASS is\n  usually an active-comparator new-user or self-controlled design — the PASS label adds the regulatory wrapper.\n\n**When to use**. A sponsor wants to identify, quantify, or confirm a safety concern for an authorised product on its\nown initiative — e.g. characterising the real-world incidence of a labelled AESI, evaluating effectiveness of a risk\nminimisation measure (a pregnancy-prevention programme, a prescriber education tool), generating EU PAS Register-listed\nsafety evidence to support a future label or HTA submission, or pre-empting a likely regulatory request. Choose the\nnon-interventional route when the product is used per its authorisation and the question can be answered in secondary\ndata (claims, EHR, disease/product registries) or a light-touch prospective cohort.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **The study is a condition of the marketing authorisation.** If PRAC or a national authority has imposed it, a\n  voluntary design does not satisfy the obligation; submitting voluntary work in its place is a compliance failure. Use\n  the imposed pathway with PRAC protocol assessment.\n- **The intervention departs from routine care.** Allocating treatment by protocol, adding non-routine tests, or\n  randomising converts the study into a clinical trial; calling it a \"non-interventional PASS\" mislabels it and evades\n  the correct ethics/regulatory regime — a serious integrity problem.\n- **The data cannot support the safety estimand.** A rare AESI needs a denominator and follow-up that a short,\n  fragmented data source cannot deliver; reporting an \"incidence\" on unstable person-time is misleading. If only signal\n  detection is feasible, do not dress it up as a quantitative PASS.\n- **Confounding by indication is unaddressed.** Comparing a drug to non-users for a safety outcome reproduces\n  channelling bias; without an active comparator and time-zero alignment, a \"reassuring\" or \"alarming\" PASS result can\n  be an artefact, not the drug.\n\n**Data-source operational depth**.\n- **Claims (FFS or commercial):** Exposure is the pharmacy claim (NDC + `fill_date` + `days_supply`); AESIs are\n  captured from inpatient/outpatient diagnosis and procedure codes. Require continuous medical + pharmacy enrollment\n  across baseline so absence of prior exposure and outcomes is observed, not missing. Failure modes: **Medicare\n  Advantage-only person-time lacks fee-for-service claims**, so AESI ascertainment is silently incomplete — restrict to\n  enrollees with Parts A/B/D (or commercial medical+pharmacy) and exclude MA-only spans. Acute, hospitalisation-coded\n  AESIs (e.g. anaphylaxis, DKA, acute liver injury) are well captured; insidious or outpatient-managed events are not.\n  Claims carry no death cause and weak severity, so adjudication-ready code algorithms with PPV from prior validation\n  are essential.\n- **EHR:** Labs, vitals, problem lists, and notes sharpen AESI definitions (e.g. ALT/AST thresholds for Hy's-law liver\n  injury) and baseline risk factors that claims miss — a real advantage for safety. But visit-driven capture means a\n  patient who leaves the network is differentially lost, and an out-of-network hospitalisation for the very AESI of\n  interest can be invisible; define observation windows explicitly and treat loss to follow-up as potentially\n  informative.\n- **Registry (disease/product/pregnancy):** Strongest for adjudicated, clinically rich safety outcomes and for\n  products with mandated pregnancy or device registries; typically weak for complete drug exposure and for a\n  contemporaneous comparator. Link to claims for the full fill history and to a death index to firm up censoring and\n  fatal-outcome capture.\n- **Linked claims–EHR–vital records:** The ideal substrate for a safety PASS — EHR severity + claims completeness +\n  reliable mortality — but linkage selects the linkable subset and introduces order/fill/service-date discrepancies\n  that must be reconciled before time-zero and the susceptibility window are fixed.\n\n**Worked claims example.** Voluntary, non-interventional PASS to quantify the incidence of acute liver injury (ALI)\namong new users of a recently authorised oral antifungal, with an active comparator, in a commercial + Medicare FFS\ndatabase. (1) Eligibility: age ≥18 and ≥365 days of continuous A/B/D (or commercial medical+pharmacy) enrollment before\nthe first dispensing — the baseline window in which prior ALI and risk factors (alcohol-related codes, hepatitis,\nhepatotoxic co-medications) are measured. (2) New-user washout: no fill of the study drug or the comparator antifungal\nin the prior 365 days, so person-time starts at a true initiation. (3) Time zero = first qualifying `fill_date`; assign\narm from the dispensed NDC. (4) Prespecified AESI algorithm: first inpatient ALI diagnosis (ICD code on a hospital\nclaim) with no ALI code in baseline; this *first-event* coding plus the washout removes prevalent disease. (5)\nSusceptibility / risk window: from time zero through the last covered day (`fill_date` + `days_supply`) plus a\nprespecified 30-day carry-over, reflecting drug-induced liver injury latency. (6) Person-time and censoring: accrue at\nrisk until the first ALI event, disenrollment, death, end of data, or end of the risk window; **exclude MA-only spans**\nso person-time reflects FFS-observable claims. (7) Output: crude incidence rate per 1,000 person-years per arm, an\nage/sex-standardised incidence ratio against the comparator, and a Cox or Fine-Gray comparative hazard with death as a\ncompeting risk. (8) Pre-registration in the EU PAS Register and a negative-control outcome to probe residual confounding\ncomplete the protocol — making the voluntary study defensible to PRAC, HTA reviewers, and journals even without imposed\noversight.",
    "primary_category": "Study_Design",
    "tags": [
      "post-authorisation-safety-study",
      "pharmacovigilance",
      "non-interventional",
      "gvp-module-viii",
      "safety-surveillance",
      "adverse-event-of-special-interest",
      "encepp",
      "study-design"
    ],
    "applies_to_study_types": [
      "pass_voluntary"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/bcp.13165",
        "url": "https://doi.org/10.1111/bcp.13165",
        "citation_text": "Engel P, Almas MF, De Bruin ML, Starzyk K, Blackburn S, Dreyer NA. Lessons learned on the design and the conduct of Post-Authorization Safety Studies: review of 3 years of PRAC oversight. British Journal of Clinical Pharmacology. 2017;83(4):884-893.",
        "year": 2017,
        "authors_short": "Engel et al.",
        "notes": "PRAC-oversight review that operationalises the GVP Module VIII PASS framework and the design/conduct expectations that distinguish robust (voluntary and imposed) safety studies; the canonical methodological reference."
      },
      {
        "role": "explain",
        "doi": "10.3389/fdsfr.2025.1574430",
        "url": "https://doi.org/10.3389/fdsfr.2025.1574430",
        "citation_text": "Almas MF, Girardi A, Crisafulli S, et al. Identifying regulatory outcomes of Non-interventional Post-Authorisation Safety Studies (PASS) in the European repository of studies using publicly available information. Frontiers in Drug Safety and Regulation. 2025;5:1574430.",
        "year": 2025,
        "authors_short": "Almas et al.",
        "notes": "Characterises the non-interventional PASS landscape in the EU PAS Register, including the voluntary (non-imposed) subset and its regulatory outcomes, clarifying how voluntary studies sit alongside imposed ones."
      }
    ],
    "plain_language_summary": "A Post-Authorization Safety Study (PASS) is research a drug company runs on a medicine that regulators have already approved, specifically to look for safety problems in real-world use. A voluntary PASS is one the company chooses to run on its own, rather than because a regulatory agency made it a condition of approval. Because the drug is already on the market and patients are taking it as their doctor prescribes, these studies almost always use existing records — insurance claims, hospital records, or disease registries — instead of enrolling people in a controlled experiment. The goal is to answer the question: does this approved drug cause harm we did not fully see before approval, and how often?",
    "key_terms": [
      {
        "term": "PASS",
        "definition": "Post-Authorization Safety Study — any research on an already-approved medicine that aims to identify, measure, or confirm a safety risk in the people actually using it."
      },
      {
        "term": "regulatory commitment",
        "definition": "A legally binding requirement that a regulatory agency (such as EMA or FDA) attaches to a drug's approval, requiring the company to complete a specific study or action."
      },
      {
        "term": "GVP Module VIII",
        "definition": "A set of rules published by the European Medicines Agency (EMA) that defines what counts as a PASS and how non-interventional safety studies must be designed and registered."
      },
      {
        "term": "ENCePP",
        "definition": "The European Network of Centres for Pharmacoepidemiology and Pharmacovigilance — a public registry where PASS protocols are listed so the study is transparent before results are known."
      },
      {
        "term": "AESI",
        "definition": "Adverse Event of Special Interest — a safety outcome (such as liver injury or a serious heart problem) that the study is specifically designed to detect and measure."
      }
    ],
    "worked_example": {
      "scenario": "A company has received approval for a new oral antifungal drug. The label mentions a possible risk of liver injury, but the pre-approval trials were too small and too short to measure how often this happens in ordinary patients. The company decides — without being told to by a regulator — to run a voluntary PASS using insurance claims data to find out how common liver injury is among the first real-world users, compared to patients taking an older antifungal. Below is a simplified comparison showing how this voluntary PASS differs from a PASS that a regulator has imposed as a condition of approval.",
      "dataset": {
        "caption": "Imposed vs voluntary PASS: a side-by-side comparison of key features",
        "columns": [
          "Feature",
          "Imposed PASS",
          "Voluntary PASS (this concept)"
        ],
        "rows": [
          [
            "Who requires it?",
            "Regulatory agency (e.g., EMA/PRAC) — it is a legal condition of approval",
            "The company itself — no regulator has demanded it"
          ],
          [
            "What question does it answer?",
            "A specific safety question the regulator identified as unanswered before approval",
            "A safety question the sponsor considers important or useful for label or HTA purposes"
          ],
          [
            "Real-world safety after approval?",
            "Yes — conducted in patients using the approved drug in routine care",
            "Yes — same setting, same kind of real-world data"
          ],
          [
            "Who reviews the protocol first?",
            "PRAC (the EMA safety committee) must assess and approve the protocol before the study starts",
            "No mandatory regulatory pre-review; ENCePP registration is strongly encouraged"
          ],
          [
            "What happens if it is not completed?",
            "Legal and regulatory consequences — possible suspension or fines",
            "No legal penalty, but the company loses the scientific and reputational value it sought"
          ],
          [
            "How does it differ from a pre-approval trial?",
            "Both differ from trials: the drug is prescribed normally, no random assignment, no extra tests added by the protocol",
            "Same — non-interventional, observational, based on routine care records"
          ]
        ]
      },
      "steps": [
        "The company identifies the safety question on its own: how often does the new antifungal cause liver injury in real-world patients, compared to an older antifungal?",
        "Because no regulator has imposed this study, it is classified as voluntary — the company controls the timeline, data source, and scope.",
        "The company registers the protocol in the EU PAS Register (ENCePP) before collecting data, making the study plan public and the results credible to regulators and HTA reviewers even without mandatory PRAC oversight.",
        "The study design is non-interventional: patients are prescribed the drug by their own doctors in routine care, not by a protocol. Insurance claims or EHR records are used — not a trial.",
        "An active comparator (the older antifungal) is included so the incidence of liver injury in new users of the study drug can be compared to a similar patient group, rather than to non-drug-users who may be healthier to begin with.",
        "The result — crude incidence rates per arm and a comparative hazard estimate — can support a future label update, an HTA submission, or early detection of a safety signal, even though the study was not required by law."
      ],
      "result": "A voluntary PASS is used when the sponsor wants to identify or quantify a real-world safety concern for an approved drug on its own initiative. It follows the same non-interventional, claims- or EHR-based design as an imposed PASS, but sits outside mandatory regulatory oversight — making it faster to start and more flexible in scope, while carrying less inherent regulatory authority than a study the agency required."
    },
    "prerequisites": [
      "signal-detection",
      "active-comparator-new-user",
      "pass-imposed"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Non-interventional cohort PASS with active comparator",
        "description": "Secondary-data new-user cohort comparing the AESI hazard of the study drug to a clinically interchangeable active comparator, with time zero at initiation and confounding controlled by propensity score or g-methods.",
        "edge_cases": [
          "A non-null comparator (its own hepatotoxicity/cardiotoxicity) biases the safety contrast toward the null or away from it; choose the comparator for outcome-neutrality, not just shared indication.",
          "Rare AESIs leave sparse events even in large databases, destabilising hazard estimates and matched designs."
        ],
        "data_source_notes": "claims: build NDC/HCPCS exposure lists and validated AESI diagnosis algorithms with known PPV; EHR: add lab/vital thresholds to refine the outcome and capture baseline severity."
      },
      {
        "name": "Self-controlled PASS (SCCS / SCRI)",
        "description": "Within-person designs (self-controlled case series, self-controlled risk interval) that compare AESI rates in risk vs control windows around exposure, removing all time-fixed confounding — well suited to acute, transient-exposure safety signals (e.g. vaccine or short-course drug AESIs).",
        "edge_cases": [
          "Requires the event not to censor or alter future exposure (event-dependent exposure violates SCCS assumptions).",
          "Sensitive to risk-window misspecification; prespecify and run window sensitivity analyses."
        ],
        "data_source_notes": "claims/EHR: needs precise event and exposure dates; days_supply defines the on-treatment risk window and any carry-over."
      },
      {
        "name": "Drug-utilisation / risk-minimisation-effectiveness PASS",
        "description": "A descriptive voluntary PASS measuring real-world use patterns or the effectiveness of an additional risk minimisation measure (RMM) — e.g. adherence to a pregnancy-prevention programme or contraindicated co-prescribing — rather than an AESI incidence per se.",
        "edge_cases": [
          "Effectiveness of an RMM is confounded by secular trends and channelling; an interrupted-time-series or controlled before-after design strengthens causal interpretation."
        ],
        "data_source_notes": "claims: measure contraindicated co-dispensing and required monitoring claims; registry: capture programme enrolment and pregnancy outcomes directly."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Imposed (mandatory) PASS",
        "pros_of_this": "Sponsor controls timing, scope, and dissemination; faster start with no PRAC protocol pre-assessment; flexible for hypothesis-generating or HTA-facing safety questions.",
        "cons_of_this": "Carries less a priori regulatory weight, lacks PRAC protocol endorsement, and cannot discharge a binding pharmacovigilance obligation that is a condition of the marketing authorisation.",
        "when_to_prefer": "The safety question is sponsor-initiated and is not a condition of the authorisation; choose an imposed PASS when the authority has mandated the evidence."
      },
      {
        "compared_to": "Spontaneous-reporting / disproportionality signal detection",
        "pros_of_this": "Provides a defined denominator (person-time), a comparator, and prespecified outcome definitions, so it quantifies and tests a hazard rather than only flagging it.",
        "cons_of_this": "Far more expensive and slower than mining a spontaneous-report database; needs adequate exposure and event counts.",
        "when_to_prefer": "A signal already raised by disproportionality must be characterised, quantified, or refuted with real person-time."
      },
      {
        "compared_to": "Generic comparative-effectiveness RWE study",
        "pros_of_this": "Sits inside the GVP/pharmacovigilance legal framework with RMP/PSUR linkage and adverse-event reporting, giving the safety evidence regulatory standing.",
        "cons_of_this": "That governance overhead is wasted when the question is purely effectiveness; the analytic engine is the same active-comparator new-user or self-controlled design either way.",
        "when_to_prefer": "The primary objective is safety surveillance of an authorised product, not comparative effectiveness."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (NDC + fill_date + days_supply); AESIs from validated inpatient/outpatient diagnosis algorithms with known PPV. Require continuous medical + pharmacy enrollment across baseline; exclude Medicare Advantage-only person-time where fee-for-service claims (and therefore AESI capture) are unavailable. Define the risk/susceptibility window from days_supply plus a prespecified carry-over reflecting the biological latency of the AESI.",
      "ehr": "Use labs, vitals, and notes to refine AESI definitions (e.g. ALT/AST for liver injury) and baseline severity an advantage over claims for safety. Visit-driven capture loses out-of-network events; define observation windows explicitly and treat loss to follow-up as potentially informative.",
      "registry": "Strong for adjudicated safety outcomes and mandated product/pregnancy registries; weak for complete exposure and a contemporaneous comparator. Link to claims for fills and to a death index for fatal-outcome capture and censoring.",
      "linked": "Linked claims-EHR-vital-records is the ideal safety substrate (severity + completeness + mortality) but introduces linkage selection and order/fill/service-date discrepancies to reconcile before fixing time zero and the susceptibility window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWASHOUT_DAYS = 365   # drug-free + continuous-enrollment baseline that defines a \"new user\"\nCARRYOVER_DAYS = 30  # AESI latency added to the last covered day to close the risk window\n\ndef build_pass_cohort(rx: pd.DataFrame, enroll: pd.DataFrame, dx: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n    study = rx[rx[\"drug_class\"].isin([\"STUDY\", \"COMPARATOR\"])]\n\n    # Time zero = first fill of study drug or active comparator; arm = drug dispensed that day.\n    idx = (study.groupby(\"person_id\").first().reset_index()\n                .rename(columns={\"fill_date\": \"index_date\", \"drug_class\": \"arm\"}))\n\n    # New-user washout: drop anyone with a study/comparator fill in the WASHOUT_DAYS before index.\n    prior = study.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    prior = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                  (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prior[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment across baseline through index (no MA-only spans -> no silent AESI gaps).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    idx = idx[idx[\"person_id\"].isin(e.loc[e[\"covers\"], \"person_id\"])].copy()\n\n    # Exclude prevalent AESI: require no AESI dx in the baseline window (first-event definition).\n    base_dx = dx[dx[\"aesi\"]].merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    prevalent = base_dx[(base_dx[\"dx_date\"] < base_dx[\"index_date\"]) &\n                        (base_dx[\"dx_date\"] >= base_dx[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prevalent[\"person_id\"])].copy()\n\n    # Risk window end = last covered day (fill_date + days_supply) over on-treatment fills, plus carry-over.\n    on_tx = study.merge(idx[[\"person_id\", \"index_date\", \"arm\"]], on=\"person_id\")\n    on_tx = on_tx[on_tx[\"fill_date\"] >= on_tx[\"index_date\"]]\n    on_tx[\"covered_end\"] = on_tx[\"fill_date\"] + pd.to_timedelta(on_tx[\"days_supply\"], unit=\"D\")\n    rx_end = on_tx.groupby(\"person_id\")[\"covered_end\"].max() + pd.Timedelta(days=CARRYOVER_DAYS)\n    idx = idx.merge(rx_end.rename(\"risk_end\"), on=\"person_id\")\n\n    # Censor at min(risk window end, disenrollment, end of data); first incident AESI on/after index = event.\n    disenroll = enroll.groupby(\"person_id\")[\"enroll_end\"].max().rename(\"disenroll\")\n    idx = idx.merge(disenroll, on=\"person_id\")\n    idx[\"fu_end\"] = idx[[\"risk_end\", \"disenroll\"]].min(axis=1)\n\n    inc = dx[dx[\"aesi\"]].merge(idx[[\"person_id\", \"index_date\", \"fu_end\"]], on=\"person_id\")\n    inc = inc[(inc[\"dx_date\"] >= inc[\"index_date\"]) & (inc[\"dx_date\"] <= inc[\"fu_end\"])]\n    first_evt = inc.groupby(\"person_id\")[\"dx_date\"].min().rename(\"event_date\")\n    idx = idx.merge(first_evt, on=\"person_id\", how=\"left\")\n    idx[\"event\"] = idx[\"event_date\"].notna().astype(int)\n    idx[\"exit_date\"] = idx[\"event_date\"].fillna(idx[\"fu_end\"])\n    idx[\"person_days\"] = (idx[\"exit_date\"] - idx[\"index_date\"]).dt.days.clip(lower=0)\n\n    rates = (idx.groupby(\"arm\")\n                .agg(events=(\"event\", \"sum\"), person_years=(\"person_days\", lambda d: d.sum() / 365.25))\n                .assign(ir_per_1000py=lambda t: 1000 * t[\"events\"] / t[\"person_years\"]))\n    return rates.reset_index()",
        "description": "Non-interventional voluntary-PASS engine: build a new-user safety cohort, ascertain a first-event AESI from a\nvalidated code algorithm, accrue FFS-observable person-time over a days_supply-based risk window, and return crude\nincidence rates per arm. Required inputs (already cleaned and de-duplicated):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime), drug_class in {'STUDY','COMPARATOR'}, days_supply\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only spans lack FFS claims\n  dx     : diagnosis claims  -> person_id, dx_date (datetime), aesi (bool)            # aesi = validated AESI algorithm\nConfounding adjustment (PS/g-methods) and competing-risk modelling are applied downstream on this analysis frame.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "engel-2017"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS   <- 365L\nCARRYOVER_DAYS <- 30L\n\nbuild_pass_cohort <- function(rx, enroll, dx) {\n  setDT(rx); setDT(enroll); setDT(dx)\n  setorder(rx, person_id, fill_date)\n  study <- rx[drug_class %chin% c(\"STUDY\", \"COMPARATOR\")]\n\n  # Time zero = first study/comparator fill; arm = drug dispensed that day.\n  idx <- study[, .(index_date = fill_date[1L], arm = drug_class[1L]), by = person_id]\n\n  # New-user washout.\n  s <- merge(study, idx[, .(person_id, index_date)], by = \"person_id\")\n  prior <- unique(s[fill_date < index_date & fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior]\n\n  # FFS-observable continuous enrollment across baseline through index (no MA-only spans).\n  e <- merge(enroll, idx[, .(person_id, index_date)], by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS & enroll_end >= index_date & !ma_only,\n          unique(person_id)]\n  idx <- idx[person_id %chin% ok]\n\n  # First-event: drop prevalent AESI in the baseline window.\n  bd <- merge(dx[aesi == TRUE], idx[, .(person_id, index_date)], by = \"person_id\")\n  prevalent <- unique(bd[dx_date < index_date & dx_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prevalent]\n\n  # Risk window end = max(fill_date + days_supply) over on-treatment fills + carry-over.\n  ot <- merge(study, idx[, .(person_id, index_date)], by = \"person_id\")[fill_date >= index_date]\n  ot[, covered_end := fill_date + days_supply]\n  rxend <- ot[, .(risk_end = max(covered_end) + CARRYOVER_DAYS), by = person_id]\n  idx <- merge(idx, rxend, by = \"person_id\")\n\n  # Censor at min(risk_end, disenrollment); first incident AESI in window = event.\n  dis <- enroll[, .(disenroll = max(enroll_end)), by = person_id]\n  idx <- merge(idx, dis, by = \"person_id\")\n  idx[, fu_end := pmin(risk_end, disenroll)]\n\n  ev <- merge(dx[aesi == TRUE], idx[, .(person_id, index_date, fu_end)], by = \"person_id\")\n  ev <- ev[dx_date >= index_date & dx_date <= fu_end, .(event_date = min(dx_date)), by = person_id]\n  idx <- merge(idx, ev, by = \"person_id\", all.x = TRUE)\n  idx[, event := as.integer(!is.na(event_date))]\n  idx[, exit_date := fifelse(is.na(event_date), fu_end, event_date)]\n  idx[, person_days := pmax(as.integer(exit_date - index_date), 0L)]\n\n  idx[, .(events = sum(event), person_years = sum(person_days) / 365.25), by = arm\n      ][, ir_per_1000py := 1000 * events / person_years][]\n}",
        "description": "Non-interventional voluntary-PASS cohort + crude incidence with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), drug_class in {'STUDY','COMPARATOR'}, days_supply\n  enroll : person_id, enroll_start (Date), enroll_end (Date), ma_only (logical)\n  dx     : person_id, dx_date (Date), aesi (logical)   # aesi = validated AESI algorithm flag",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "engel-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout   = 365;\n%let carryover = 30;\n\n/* Time zero = earliest study/comparator fill; arm = drug dispensed that day. */\nproc sort data=work.rx(where=(drug_class in ('STUDY','COMPARATOR'))) out=rx_q;\n  by person_id fill_date;\nrun;\n\ndata idx;\n  set rx_q;\n  by person_id;\n  if first.person_id;\n  index_date = fill_date;\n  format index_date date9.;\n  length arm $12;\n  arm = drug_class;\n  keep person_id index_date arm;\nrun;\n\n/* New-user washout: no study/comparator fill in the washout window before index. */\nproc sql;\n  create table newuser as\n  select i.* from idx i\n  where not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id and p.drug_class in ('STUDY','COMPARATOR')\n      and p.fill_date < i.index_date and p.fill_date >= i.index_date - &washout);\nquit;\n\n/* FFS-observable continuous enrollment across baseline through index (no MA-only spans). */\nproc sql;\n  create table enrolled as\n  select n.* from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout and e.enroll_end >= n.index_date);\nquit;\n\n/* First-event definition: exclude prevalent AESI in the baseline window. */\nproc sql;\n  create table incident as\n  select e.* from enrolled e\n  where not exists (\n    select 1 from work.dx d\n    where d.person_id = e.person_id and d.aesi = 1\n      and d.dx_date < e.index_date and d.dx_date >= e.index_date - &washout);\nquit;\n\n/* Risk-window end = max(fill_date + days_supply) on-treatment + carry-over; disenrollment = max enroll_end. */\nproc sql;\n  create table windows as\n  select c.person_id, c.arm, c.index_date,\n         (select max(r.fill_date + r.days_supply) from work.rx r\n            where r.person_id = c.person_id and r.fill_date >= c.index_date\n              and r.drug_class in ('STUDY','COMPARATOR')) + &carryover as risk_end format=date9.,\n         (select max(e.enroll_end) from work.enroll e\n            where e.person_id = c.person_id) as disenroll format=date9.\n  from incident c;\nquit;\n\n/* Outcome status (AESI=1, competing death=2, censored=0) and person-time to first event or censoring. */\nproc sql;\n  create table survdata as\n  select w.person_id, w.arm,\n         min(w.risk_end, w.disenroll) as fu_end format=date9.,\n         (select min(d.dx_date) from work.dx d\n            where d.person_id = w.person_id and d.aesi = 1\n              and d.dx_date >= w.index_date) as aesi_date format=date9.,\n         (select min(d.dx_date) from work.dx d\n            where d.person_id = w.person_id and d.death = 1\n              and d.dx_date >= w.index_date) as death_date format=date9.,\n         w.index_date\n  from windows w;\nquit;\n\ndata survdata;\n  set survdata;\n  _exit = min(coalesce(aesi_date, .i), coalesce(death_date, .i), fu_end);\n  if .i = _exit then _exit = fu_end;          /* no event -> censor at fu_end */\n  time = max(_exit - index_date, 0);\n  if aesi_date ne . and aesi_date <= fu_end and aesi_date = _exit then status = 1;      /* AESI */\n  else if death_date ne . and death_date <= fu_end and death_date = _exit then status = 2; /* competing death */\n  else status = 0;                                                                       /* censored */\nrun;\n\n/* Fine-Gray subdistribution hazard for the AESI, death as competing risk. */\nproc phreg data=survdata;\n  class arm (ref='COMPARATOR');\n  model time*status(0) = arm / eventcode=1 risklimits;\n  hazardratio arm;\nrun;",
        "description": "Non-interventional voluntary-PASS cohort construction, person-time accrual, and a Fine-Gray comparative hazard\n(AESI with death as a competing risk) in SAS. Required input datasets (post data-management):\n  work.rx     : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.dx     : person_id, dx_date, aesi (1 = validated AESI algorithm), death (1 = death claim)\nPROC PHREG eventcode= requires SAS/STAT 13.1+ and gives the Fine-Gray subdistribution hazard appropriate for a\nsafety outcome with competing mortality in an elderly claims population.",
        "dependencies": [],
        "source_citations": [
          "engel-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Sponsor safety question on an authorised product] --> Cond{Is the study a CONDITION<br/>of the marketing authorisation?}\n  Cond -->|Yes - Annex II / Art. 9, 21a, 22a| Imp[Imposed PASS<br/>PRAC assesses protocol + results<br/>binding milestones]\n  Cond -->|No - sponsor initiative| Vol[Voluntary PASS<br/>sponsor controls timing + scope<br/>ENCePP / EU PAS Register encouraged]\n  Vol --> Int{Does it depart from routine care?<br/>protocol-assigned treatment, extra tests, randomisation?}\n  Int -->|Yes| Trial[Clinical-trial PASS<br/>CTR / GCP framework]\n  Int -->|No| NI[Non-interventional PASS<br/>secondary-data RWE design under GVP Module VIII]\n  NI --> Eng[Analytic engine: active-comparator new-user<br/>or self-controlled design; AESI estimand]",
        "caption": "Decision logic placing a study as a voluntary versus imposed PASS, and the interventional split. The voluntary, non-interventional branch is the RWE study-design case in this catalog; its analytic engine is the same new-user or self-controlled design used elsewhere, wrapped in the GVP Module VIII pharmacovigilance framework.",
        "alt_text": "Decision flowchart from a sponsor safety question through whether the study is a condition of the marketing authorisation (imposed vs voluntary) and whether it departs from routine care (clinical trial vs non-interventional), ending at the analytic engine.",
        "source_type": "illustrative",
        "source_citations": [
          "engel-2017"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Voluntary non-interventional PASS timeline for one new initiator (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  Continuous FFS enrollment + drug washout + no prevalent AESI :done, base, 2023-01-01, 2023-12-31\n  section Time zero\n  First qualifying fill -> arm assignment :milestone, t0, 2024-01-01, 0d\n  section Risk window\n  On-treatment exposure (days_supply) :active, exp, 2024-01-01, 120d\n  AESI latency carry-over (30 days) :crit, carry, 2024-05-01, 30d\n  section Censoring\n  First AESI / death / disenroll / data end :milestone, cen, 2024-05-31, 0d",
        "caption": "Susceptibility-window logic for a single initiator. Baseline establishes new-user status and rules out prevalent AESI; the risk window runs over on-treatment days_supply plus a prespecified latency carry-over; person-time censors at the first AESI, competing death, disenrollment, or end of data.",
        "alt_text": "Gantt timeline showing a 2023 baseline washout, time zero at the first fill on 2024-01-01, an on-treatment risk window, a 30-day carry-over for AESI latency, and a censoring milestone.",
        "source_type": "illustrative",
        "source_citations": [
          "engel-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "pass-imposed",
        "notes": "An imposed PASS is a binding condition of the marketing authorisation with PRAC protocol and results assessment; a voluntary PASS is sponsor-initiated and outside that mandatory oversight. Same analytic methods, different regulatory status."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "The active-comparator new-user design is the usual analytic core of a non-interventional cohort PASS, providing time-zero alignment and control of confounding by indication for the safety contrast."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "A comparative safety PASS is strengthened by emulating a target trial - explicit eligibility, treatment strategies, and time-zero - which raises its regulatory and HTA credibility."
      },
      {
        "relation_type": "see_also",
        "target_slug": "signal-detection",
        "notes": "Spontaneous-report disproportionality raises safety signals; a PASS supplies the denominator, comparator, and prespecified definitions needed to quantify or refute them."
      },
      {
        "relation_type": "see_also",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "A voluntary PASS depends on a validated AESI case definition (code algorithm with known PPV) for credible outcome ascertainment in secondary data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "disease-registry",
        "notes": "Disease, product, and pregnancy registries are common substrates for voluntary PASS, offering adjudicated safety outcomes that claims and EHR capture poorly."
      }
    ],
    "aliases": [
      "voluntary PASS",
      "non-imposed PASS",
      "sponsor-initiated post-authorisation safety study",
      "non-interventional PASS"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "ema",
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pdc-proportion-of-days-covered",
    "name": "Proportion of Days Covered (PDC)",
    "short_definition": "A claims-based adherence measure equal to the number of unique days in a fixed observation window on which a patient had medication supply on hand, divided by the number of days in that window, bounded in [0, 1].",
    "long_description": "**Proportion of Days Covered (PDC)** is the dominant operationalization of medication adherence in\nadministrative-claims research and the basis of the CMS/PQA adherence Star Ratings measures. It counts\nthe **unique days \"covered\"** by dispensed supply within a fixed denominator window and divides by the\nlength of that window. Coverage on a given calendar day is a 0/1 indicator: any positive `days_supply`\nreaching that day makes it covered, regardless of how many fills overlap it. That single design decision —\ncount *days*, not *supply* — is what bounds PDC at 1.0 and is the entire reason PDC exists.\n\n**Core conceptual distinction — PDC vs MPR.** The Medication Possession Ratio (MPR) sums `days_supply`\nacross fills and divides by the interval, so two early refills can push MPR above 1.0 (often reported as\n150%+), conflating *stockpiling* with *adherence*. PDC instead asks, day by day, \"did the patient have any\nsupply?\" and so cannot exceed 100%. The difference is not cosmetic. Under oversupply, MPR overstates\nadherence and, when truncated at 1.0, the two converge; but for patients who refill early and then stop,\nMPR's overlap credit can mask a coverage gap that PDC correctly exposes. PDC is therefore the more\nconservative, more interpretable, and regulatorily standardized choice. PDC is a **summary of cumulative\ndaily coverage**, not a measure of treatment duration — that is the job of persistence (time to first\npermitted gap). A patient can have high PDC with several short gaps (restart after each), and high\npersistence with mediocre PDC (continuous fills but frequent partial coverage). Report both when the\nquestion spans \"how much\" and \"how long.\"\n\n**The estimand PDC actually targets.** PDC is a *measure*, not a causal estimator; but how you use it\ncommits you to an estimand. As a continuous exposure in an outcome model it represents average daily\ncoverage; dichotomized at the conventional **PDC ≥ 0.80** threshold it represents the contrast of\n\"adherent vs non-adherent\" patients — a threshold popularized for chronic cardiometabolic therapy and\nendorsed by PQA, but empirically derived, drug-class-specific, and not a law of nature (Karve et al.\nshowed the outcome-optimal cut-point varies by drug and endpoint). Whatever the form, **prespecify the\ndenominator window, the carry-over rule, the inpatient rule, and the threshold in the SAP**; post-hoc\ntuning of these knobs is the field's most common silent source of non-reproducibility.\n\n**Pros, cons, and trade-offs.**\n- **vs MPR (medication-possession-ratio):** PDC caps at 1.0 and is immune to oversupply inflation, is the\n  CMS/PQA standard, and is more conservative for \"adherent/not\" classification. Cost: it discards the\n  magnitude of stockpiling (which can itself predict waste, hoarding, or measurement error), is slightly\n  more work to implement (you must resolve overlapping fills into a daily array rather than just summing\n  supply), and can *understate* possession when early refills are clinically appropriate. **Prefer PDC**\n  for comparative effectiveness/safety classification and any quality-measure or regulatory context.\n- **vs persistence (time-to-discontinuation):** PDC captures intensity of coverage during follow-up;\n  persistence captures duration to the first qualifying gap. PDC rewards a patient who reaches 0.85 via\n  on-off-on refilling; persistence flags that same patient at the first long gap. Cost: PDC alone hides\n  the gap *pattern* — clinically, a single 60-day interruption and six scattered 10-day gaps can yield\n  identical PDC but very different risk. **Prefer PDC** for cumulative-exposure questions; pair it with\n  persistence (and, for time-varying analyses, with actual exposure-episode construction) when the gap\n  pattern matters.\n- **vs a daily-defined-dose / time-varying exposure model:** PDC collapses a longitudinal exposure into\n  one scalar per patient, which is simple and communicable but throws away timing — it cannot represent\n  immortal time, cannot be used as a time-varying covariate, and invites the classic adherence-as-baseline\n  bias if measured over a window that overlaps follow-up. **Prefer a time-varying exposure** when the\n  causal contrast is dose-response over time or when adherence is on the causal pathway.\n\n**When to use.** Chronic, orally self-administered, refillable therapy in claims (or claims-linked) data\nwhere the research or quality question is \"what fraction of the intended treatment period did the patient\nactually have drug on hand?\" — statins, oral antidiabetics, RAAS inhibitors, DOACs, oral oncolytics,\nimmunosuppressants. PDC is also the right primary measure whenever you must align with CMS Star Ratings or\nPQA specifications.\n\n**When NOT to use — and when it is actively misleading.**\n- **Measuring adherence over a window that overlaps the outcome follow-up.** If PDC is computed during the\n  same period in which outcomes accrue, sicker patients who die or are hospitalized early accrue fewer\n  covered days and look \"non-adherent,\" manufacturing a spurious adherence-outcome association\n  (reverse causation / immortal-time-adjacent bias). Fix it by measuring PDC in a *landmark* window that\n  ends before follow-up begins, or by modeling exposure as time-varying.\n- **As-needed, single-fill, titrated, or non-oral therapies.** PRN inhalers, one-time antibiotic courses,\n  insulin with variable dosing, and clinician-administered infusions break the \"one fill covers a known\n  number of days\" assumption; `days_supply` is meaningless or absent, and PDC is uninterpretable.\n- **Incomplete capture of dispensing.** Free samples, 340B, cash/discount-card fills (GoodRx), inpatient\n  and clinic-administered drug, and care obtained outside the plan are invisible to pharmacy claims, so PDC\n  *understates* coverage — differentially if one comparator is more often sampled or sold at retail\n  discount.\n- **EHR prescribing data used as a dispensing proxy.** An *order* is not a *fill*; primary non-adherence\n  (never picking up the script) is invisible, and PDC computed from orders systematically overstates\n  adherence.\n- **Crediting supply for days the patient could not have taken the drug at home.** During inpatient or SNF\n  stays the facility supplies medication, so outpatient fills neither span nor are needed for those days;\n  naively including them distorts both numerator and denominator (see inpatient variant).\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** The native substrate. Coverage = `fill_date` + `days_supply` per NDC.\n  Require **continuous medical *and* pharmacy enrollment** across the full PDC window so that an empty day\n  is a true uncovered day, not unobserved. *Failure mode — Medicare Advantage / capitated person-time:*\n  MA-only or capitated members generate no fee-for-service pharmacy claims, so their PDC window is\n  structurally missing; restrict to enrollees with the relevant Part D / commercial pharmacy benefit and\n  exclude MA-only spans rather than scoring them as zero coverage. *Failure mode — `days_supply` errors:*\n  90-day mail-order, sample fills, and data-entry errors distort supply; winsorize implausible\n  `days_supply` and reconcile mail vs retail. Combination products and mid-window NDC switches require a\n  regimen-level decision (PDC for the *class* vs per-component) made before, not after, looking at results.\n- **EHR:** Use only when linked to dispensing (or to medication administration for inpatient questions).\n  Prescribing/order data alone overstate adherence because primary non-adherence is unobserved, and\n  visit-driven capture means a patient who seeks care elsewhere looks like a coverage gap. Reconcile\n  order/fill date discrepancies before defining the window.\n- **Registry:** Variable — some product registries carry patient-reported or pharmacy-card adherence at\n  coarser granularity than claims. Best linked to claims for the full fill history and to a death index so\n  that disenrollment/death truncate the denominator correctly.\n- **Linked claims–EHR:** The strongest substrate (EHR severity for landmark covariate adjustment + claims\n  completeness for fills), at the cost of linkage selection and date reconciliation across order, fill, and\n  service dates.\n\n**Competing risks and the elderly-claims trap.** PDC denominators are often censored at death or\ndisenrollment. In elderly or oncology populations, death is a *competing event* that differs by exposure;\nif one arm dies sooner, its members accrue fewer denominator days, and naive PDC comparisons can be\ndistorted exactly where the clinical stakes are highest. Decide a priori whether death truncates the\ndenominator (the usual choice) and report follow-up time alongside PDC so reviewers can see differential\ncensoring.\n\n**Worked claims example.** Statin adherence over a fixed 365-day landmark window. Patient\n`person_id = 1001`, index (first statin fill) `2023-01-01`; observation window = the 365 days\n`[2023-01-01, 2023-12-31]`; require continuous medical+pharmacy enrollment across the whole window and a\n180-day pre-index statin washout to define a new user. Fills (each `days_supply = 90`, last covered day =\n`fill_date + 89`): (a) `2023-01-01` covers `2023-01-01`→`2023-03-31`; (b) `2023-03-20` covers\n`2023-03-20`→`2023-06-17` (it arrives 11 days early, so `2023-03-20`→`2023-03-31` is double-covered but\neach calendar day is counted **once**); (c) `2023-06-25` covers `2023-06-25`→`2023-09-22`; no further\nfills. The union of these intervals is `2023-01-01`→`2023-06-17` (168 days) plus\n`2023-06-25`→`2023-09-22` (90 days) = **258 unique covered days**, with a true 7-day gap\n`2023-06-18`→`2023-06-24` between fills (b) and (c) and no coverage after `2023-09-22`.\n**PDC = 258 / 365 = 0.71**, below the 0.80 threshold — this patient is classified non-adherent.\nContrast MPR, which simply sums supply: `90+90+90 = 270` days / 365 = 0.74; the 12 double-covered days\nthat MPR credits but PDC counts once are exactly the gap between the two measures. Had the early refill\nbeen larger (or had more fills overlapped), MPR could exceed 1.0 while true coverage stayed at 0.71 — the\noversupply artifact PDC is designed to remove. (A stricter \"carry-over with surplus shifting\" rule, in\nwhich the 12 surplus days *extend* later coverage so no supply is wasted, would push the numerator to 270\nand PDC to 0.74, numerically equal to MPR here; most PQA-style implementations use the union rule above,\nwhich the code below computes.)",
    "primary_category": "Exposure_Definition",
    "tags": [
      "adherence",
      "persistence",
      "medication-utilization",
      "claims",
      "pdc",
      "mpr",
      "pqa-cms",
      "days-supply"
    ],
    "applies_to_study_types": [
      "drug_utilization",
      "cohort_retrospective",
      "active_comparator_new_user",
      "new_user",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/s0895-4356(96)00268-5",
        "url": "https://doi.org/10.1016/s0895-4356(96)00268-5",
        "citation_text": "Steiner JF, Prochazka AV. The assessment of refill compliance using pharmacy records: methods, validity, and applications. Journal of Clinical Epidemiology. 1997;50(1):105-116.",
        "year": 1997,
        "authors_short": "Steiner & Prochazka",
        "notes": "Foundational methodologic treatment of refill-based adherence measurement that frames the numerator/denominator and overlap problems PDC later resolves by counting unique covered days."
      },
      {
        "role": "explain",
        "doi": "10.1345/aph.1H018",
        "url": "https://doi.org/10.1345/aph.1H018",
        "citation_text": "Hess LM, Raebel MA, Conner DA, Malone DC. Measurement of adherence in pharmacy administrative databases: a proposal for standard definitions and preferred measures. Annals of Pharmacotherapy. 2006;40(7-8):1280-1288.",
        "year": 2006,
        "authors_short": "Hess et al.",
        "notes": "Canonical standardization paper that defines PDC against MPR and the family of claims-based adherence measures; the reference most studies cite for the unique-covered-days definition."
      },
      {
        "role": "explain",
        "doi": "10.1093/ajhp/zxab392",
        "url": "https://doi.org/10.1093/ajhp/zxab392",
        "citation_text": "Loucks J, Zuckerman AD, Berni A, Saulles A, Thomas G, Alonzo A. Proportion of days covered as a measure of medication adherence. American Journal of Health-System Pharmacy. 2022;79(6):492-496.",
        "year": 2022,
        "authors_short": "Loucks et al.",
        "notes": "Concise contemporary clarification of PDC computation, the carry-over and overlap rules, and common operational pitfalls for practitioners."
      },
      {
        "role": "demonstrate",
        "doi": "10.1185/03007990903126833",
        "url": "https://doi.org/10.1185/03007990903126833",
        "citation_text": "Karve S, Cleves MA, Helm M, Hudson TJ, West DS, Martin BC. Good and poor adherence: optimal cut-point for adherence measures using administrative claims data. Current Medical Research and Opinion. 2009;25(9):2303-2310.",
        "year": 2009,
        "authors_short": "Karve et al.",
        "notes": "Empirically interrogates the PDC/MPR ≥0.80 dichotomy, showing the outcome-optimal cut-point varies by drug and endpoint — the key caution against treating 0.80 as universal."
      },
      {
        "role": "use",
        "doi": "10.1001/archinte.166.17.1836",
        "url": "https://doi.org/10.1001/archinte.166.17.1836",
        "citation_text": "Ho PM, Rumsfeld JS, Masoudi FA, McClure DL, Plomondon ME, Steiner JF, Magid DJ. Effect of medication nonadherence on hospitalization and mortality among patients with diabetes mellitus. Archives of Internal Medicine. 2006;166(17):1836-1841.",
        "year": 2006,
        "authors_short": "Ho et al.",
        "notes": "Widely cited applied study using refill-based adherence to predict hospitalization and mortality in diabetes; illustrates the landmark-style exposure-window logic PDC analyses should follow."
      }
    ],
    "plain_language_summary": "Proportion of Days Covered (PDC) measures how much of a treatment period a patient actually had medication on hand. You take every prescription fill, mark the calendar days each fill is meant to cover, count the unique covered days, and divide by the number of days in the window you are studying. The result runs from 0 to 1, and 0.80 is the usual cut-off for calling someone \"adherent\" — but that line is a convention, not a law of nature. PDC cannot see free samples, cash-paid fills, or drugs given in the hospital, so it can make a well-treated patient look like they missed doses.",
    "key_terms": [
      {
        "term": "fill (dispensing)",
        "definition": "One time a pharmacy hands over a prescription; the basic event PDC is built from."
      },
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last (e.g., a 90-day fill)."
      },
      {
        "term": "index date",
        "definition": "The patient's \"day zero\" — here, their first fill of the drug being studied."
      },
      {
        "term": "observation window",
        "definition": "The fixed stretch of calendar time you measure coverage over (the denominator)."
      },
      {
        "term": "covered day",
        "definition": "A calendar day on which the patient had at least one fill's supply still on hand."
      },
      {
        "term": "adherence threshold",
        "definition": "A cut-off (commonly PDC >= 0.80) used to label a patient adherent or not."
      }
    ],
    "worked_example": {
      "scenario": "One patient, person_id 1001, starts a statin on 2023-01-01 (their index date). We want to know what fraction of the next 365 days they had statin supply on hand — their PDC over the window 2023-01-01 to 2023-12-31. The patient must be continuously enrolled in medical and pharmacy coverage across the whole window, so an empty day means \"no pills,\" not \"we couldn't see.\" Below are the three pharmacy fills an analyst would actually pull from the claims table.",
      "dataset": {
        "caption": "The raw pharmacy-claim rows for patient 1001 (each fill is a 90-day supply).",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            1001,
            "2023-01-01",
            "atorvastatin",
            90
          ],
          [
            1001,
            "2023-03-20",
            "atorvastatin",
            90
          ],
          [
            1001,
            "2023-06-25",
            "atorvastatin",
            90
          ]
        ]
      },
      "steps": [
        "Each 90-day fill covers its fill date plus the next 89 days. Fill A covers Jan 1 to Mar 31.",
        "Fill B is filled Mar 20 — 11 days early — so it covers Mar 20 to Jun 17 and overlaps Fill A by 12 days (Mar 20-31). The union rule counts each calendar day only once, so those 12 days are not double-credited.",
        "Together Fills A and B cover one unbroken stretch, Jan 1 to Jun 17 = 168 days.",
        "Fill C is filled Jun 25, leaving a 7-day gap (Jun 18-24) with no supply. It covers Jun 25 to Sep 22 = 90 days.",
        "There are no more fills, so Sep 23 to Dec 31 is uncovered. Total covered = 168 + 90 = 258 unique days."
      ],
      "result": "PDC = 258 unique covered days / 365 window days = 0.71, which is below the 0.80 threshold, so this patient is classified non-adherent. (MPR, which just sums supply, gives 270/365 = 0.74 — the 12-day gap between the two measures is exactly the overlap PDC refuses to double-count.)",
      "timeline_spec": {
        "title": "PDC over a 365-day window for one statin user (carry-over / union rule)",
        "window": {
          "start": "2023-01-01",
          "end": "2023-12-31",
          "label": "Denominator: 365-day observation window"
        },
        "events": [
          {
            "label": "Fill A",
            "start": "2023-01-01",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill B (11-day early refill)",
            "start": "2023-03-20",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill C",
            "start": "2023-06-25",
            "length_days": 90,
            "quantity": "90 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2023-01-01",
            "end": "2023-06-17",
            "label": "168 covered days"
          },
          {
            "kind": "gap",
            "start": "2023-06-18",
            "end": "2023-06-24",
            "label": "7-day gap"
          },
          {
            "kind": "covered",
            "start": "2023-06-25",
            "end": "2023-09-22",
            "label": "90 covered days"
          }
        ],
        "result": {
          "label": "258 unique covered days / 365 = PDC 0.71 (below 0.80 -> non-adherent)",
          "value": 0.71
        }
      }
    },
    "prerequisites": [
      "new-user-design",
      "exposure-episode-construction-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "PDC with carry-over / stockpiling allowed (union rule)",
        "description": "Overlapping coverage is counted once and early refills are permitted; in the common union-rule form, the numerator is the count of unique calendar days touched by any fill's days_supply window. The standard chronic-therapy and CMS/PQA PDC.",
        "edge_cases": [
          "90-day refill arriving 10 days early — the overlap is counted once, not double-credited (unlike MPR).",
          "Overlapping fills from multiple pharmacies/prescribers for the same drug class.",
          "A stricter \"surplus-shifting\" sub-flavor pushes the unused surplus of an early refill forward so no supply is wasted; this can credit more days than the plain union (and equals MPR when there are no true gaps). Pick one and document it."
        ],
        "data_source_notes": "claims: resolve all fills into a single per-person daily coverage array (or merge overlapping [fill_date, fill_date+days_supply) intervals) and count unique days. Decide regimen-level handling for combination products and mid-window NDC switches up front.",
        "citations": [
          "hess-2006",
          "loucks-2021"
        ]
      },
      {
        "name": "PDC without carry-over (no stockpiling)",
        "description": "A day is covered only if some fill's own days_supply window reaches it; surplus from early refills does not roll forward. More conservative; used mainly in sensitivity analyses.",
        "edge_cases": [
          "Early refills are effectively truncated, so deliberate stockpilers can look less adherent than they are."
        ],
        "data_source_notes": "claims: clip each fill's coverage interval at the next fill's start so overlaps grant no extra days; uncommon as a primary US specification but a useful robustness check.",
        "citations": [
          "hess-2006"
        ]
      },
      {
        "name": "PDC with inpatient / institutional adjustment",
        "description": "Days during inpatient or SNF stays (where the facility supplies medication) are removed from the denominator or assumed covered, since outpatient pharmacy fills neither span nor are needed for them.",
        "edge_cases": [
          "Long-term-care and SNF spans with facility-supplied drug.",
          "Inpatient stay overlapping an outpatient fill's supply window."
        ],
        "data_source_notes": "claims: merge inpatient/facility stay dates (e.g., MedPAR, facility claims). The standard PQA-style rule excludes hospital days from the denominator and does not require outpatient supply for those days; some studies instead extend (shift) post-discharge supply by the inpatient duration.",
        "citations": [
          "loucks-2021"
        ]
      },
      {
        "name": "Landmark (fixed-window) PDC measured before follow-up",
        "description": "PDC is computed over a fixed window (e.g., first 365 days post-index) that ends before outcome follow-up starts, breaking the reverse-causation/immortal-time link between coverage and the outcome.",
        "edge_cases": [
          "Patients who die or disenroll inside the landmark window are excluded or handled by a prespecified rule, not silently scored as low PDC.",
          "Choice of window length trades misclassification (too short) against survivor selection (too long)."
        ],
        "data_source_notes": "claims: require continuous enrollment across the full landmark window; begin outcome follow-up only at the landmark date so adherence is measured strictly before outcomes accrue.",
        "citations": [
          "ho-2006"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "mpr-medication-possession-ratio",
        "pros_of_this": [
          "Bounded at 1.0; immune to oversupply inflation from early/overlapping refills.",
          "More conservative and interpretable for adherent/non-adherent classification.",
          "The CMS Star Ratings / PQA standard for adherence quality measures."
        ],
        "cons_of_this": [
          "Discards the magnitude of stockpiling, which can itself signal waste, hoarding, or measurement error.",
          "Slightly more complex — requires resolving overlapping fills into unique covered days rather than summing days_supply.",
          "Can understate possession when early refilling is clinically appropriate."
        ],
        "when_to_prefer": "Comparative effectiveness/safety classification and any regulatory or quality-measure context standardized on PDC; whenever oversupply could inflate MPR."
      },
      {
        "compared_to": "persistence-time-to-discontinuation",
        "pros_of_this": [
          "Captures cumulative intensity of coverage during follow-up, including patients who stop and restart."
        ],
        "cons_of_this": [
          "Hides the gap pattern; one long interruption and several short gaps can yield identical PDC but very different clinical risk.",
          "Does not measure treatment-episode duration."
        ],
        "when_to_prefer": "Cumulative-exposure or average-daily-coverage questions; pair with persistence when the timing and pattern of gaps matter."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Native substrate. Coverage = fill_date + days_supply per NDC; require continuous medical + pharmacy enrollment across the full PDC window so empty days are true gaps, not missingness. Exclude Medicare Advantage / capitated person-time that lacks fee-for-service pharmacy claims rather than scoring it as zero coverage. Winsorize implausible days_supply, reconcile mail vs retail, and decide regimen-level handling for combination products and NDC switches before analysis.",
      "ehr": "Use only when linked to dispensing (or administration records for inpatient questions); orders alone miss primary non-adherence and overstate PDC. Reconcile order/fill date discrepancies and treat visit-driven gaps as potentially uninformed rather than true non-coverage.",
      "registry": "Coarser, sometimes patient-reported; best linked to claims for full fill history and to a death index so disenrollment/death truncate the denominator correctly.",
      "linked": "Linked claims-EHR is the strongest substrate (EHR severity for landmark covariate adjustment + claims completeness for fills) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before defining the window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\ndef compute_pdc(rx: pd.DataFrame, windows: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.merge(windows, on=\"person_id\", how=\"inner\")\n\n    # Each fill covers [fill_date, fill_date + days_supply) (half-open: last covered day is\n    # fill_date + days_supply - 1). Clip the coverage interval to the observation window.\n    cover_start = rx[\"fill_date\"].clip(lower=rx[\"window_start\"])\n    cover_end = (rx[\"fill_date\"] + pd.to_timedelta(rx[\"days_supply\"], unit=\"D\")\n                 - pd.Timedelta(days=1)).clip(upper=rx[\"window_end\"])\n\n    rows = []\n    for (pid, w_start, w_end), g in rx.assign(cs=cover_start, ce=cover_end).groupby(\n            [\"person_id\", \"window_start\", \"window_end\"]):\n        covered = set()\n        for cs, ce in zip(g[\"cs\"], g[\"ce\"]):\n            if ce < cs:          # fill falls entirely outside the window after clipping\n                continue\n            # Union of covered calendar days; the set dedupes overlapping (carried-over) days.\n            covered.update(pd.date_range(cs, ce, freq=\"D\"))\n        window_days = (w_end - w_start).days + 1   # inclusive denominator\n        covered_days = len(covered)\n        rows.append((pid, covered_days, window_days, covered_days / window_days))\n\n    out = pd.DataFrame(rows, columns=[\"person_id\", \"covered_days\", \"window_days\", \"pdc\"])\n    out[\"pdc\"] = out[\"pdc\"].clip(upper=1.0)        # guard against rounding; PDC is bounded at 1.0\n    out[\"adherent\"] = (out[\"pdc\"] >= 0.80).astype(int)   # PQA-style threshold (prespecify!)\n    return out",
        "description": "Union-rule (carry-over / stockpiling-allowed) PDC over a fixed observation window from claims-style\ninputs. Required inputs (already cleaned, de-duplicated, restricted to the target drug/class, and to\nperson-time with continuous medical+pharmacy enrollment; MA-only/capitated spans already excluded):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime64), days_supply (int)\n  windows: one row per person -> person_id, window_start (datetime64), window_end (datetime64, inclusive)\nReturns one row per person with covered_days, window_days, and pdc in [0,1]. Coverage uses a per-person\ndaily set so overlapping fills are counted once; supply is clipped to the window so a fill near the end\ndoes not credit days beyond window_end. For the no-stockpiling variant, clip each fill's interval at the\nnext fill's start before unioning; for surplus-shifting carry-over, advance each fill's start to the day\nafter the prior fill's supply ends.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "hess-2006",
          "loucks-2021"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\ncompute_pdc <- function(rx, windows, threshold = 0.80) {\n  setDT(rx); setDT(windows)\n  d <- merge(rx, windows, by = \"person_id\")\n\n  # Clip each fill's coverage interval [fill_date, fill_date + days_supply - 1] to the window.\n  d[, cs := pmax(fill_date, window_start)]\n  d[, ce := pmin(fill_date + days_supply - 1L, window_end)]\n  d <- d[ce >= cs]   # drop fills that fall outside the window after clipping\n\n  pdc <- d[, {\n    wdays <- as.integer(window_end[1] - window_start[1]) + 1L   # inclusive denominator\n    covered <- logical(wdays)                                   # one slot per window day\n    off <- as.integer(window_start[1])\n    for (i in seq_len(.N)) {\n      idx <- (as.integer(cs[i]) - off + 1L):(as.integer(ce[i]) - off + 1L)\n      covered[idx] <- TRUE                                      # union -> overlaps counted once\n    }\n    cd <- sum(covered)\n    .(covered_days = cd, window_days = wdays, pdc = min(cd / wdays, 1.0))\n  }, by = person_id]\n\n  pdc[, adherent := as.integer(pdc >= threshold)]   # PQA-style threshold (prespecify!)\n  pdc[]\n}",
        "description": "Union-rule (carry-over) PDC with data.table, mirroring the Python version. Inputs:\n  rx      : person_id, fill_date (Date), days_supply (integer)\n  windows : person_id, window_start (Date), window_end (Date, inclusive)\nReturns person_id, covered_days, window_days, pdc, adherent. Coverage days are unioned per person via a\nmembership vector over the window, so overlapping fills (carry-over) are counted once.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let threshold = 0.80;\n\n/* One coverage row per CALENDAR DAY each fill touches, clipped to the observation window. */\ndata daily_cover;\n  merge work.rx(in=a) work.windows(in=b);\n  by person_id;\n  if a and b;\n  /* Half-open supply: last covered day = fill_date + days_supply - 1. */\n  cstart = max(fill_date, window_start);\n  cend   = min(fill_date + days_supply - 1, window_end);\n  if cend < cstart then delete;                /* fill outside window after clipping */\n  do day = cstart to cend;                     /* one observation per covered calendar day */\n    output;\n  end;\n  keep person_id day window_start window_end;\nrun;\n\n/* COUNT(DISTINCT day) dedupes overlapping days -> carry-over PDC numerator. */\nproc sql;\n  create table pdc as\n  select d.person_id,\n         count(distinct d.day)               as covered_days,\n         (w.window_end - w.window_start + 1)  as window_days,\n         calculated covered_days / calculated window_days as pdc_raw,\n         min(calculated pdc_raw, 1.0)         as pdc,\n         (calculated pdc >= &threshold)       as adherent\n  from daily_cover d\n    inner join (select distinct person_id, window_start, window_end from work.windows) w\n    on d.person_id = w.person_id\n  group by d.person_id, w.window_start, w.window_end;\nquit;",
        "description": "Union-rule (carry-over) PDC in Base SAS using a daily-coverage flag, which is the most transparent and\naudit-friendly way to enforce the \"count unique days once\" rule. Required input datasets (post data-management,\nrestricted to the target drug/class and to continuously enrolled, FFS-observable person-time):\n  work.rx      : person_id, fill_date (SAS date), days_supply\n  work.windows : person_id, window_start (SAS date), window_end (SAS date, inclusive)\nProduces work.pdc with covered_days, window_days, pdc, and an adherent flag. PROC SQL builds the\none-row-per-person windows; the data step expands each fill to daily coverage and PROC SQL with\nCOUNT(DISTINCT day) unions overlapping (carried-over) days.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Adherence question:<br/>how much of the intended<br/>period was the patient covered?] --> Refill{Chronic, oral,<br/>refillable therapy with<br/>meaningful days_supply?}\n  Refill -- No --> Stop[PDC inappropriate<br/>PRN / single-fill / titrated /<br/>clinician-administered]\n  Refill -- Yes --> Enr{Continuous medical + pharmacy<br/>enrollment across window?<br/>FFS-observable, no MA-only spans}\n  Enr -- No --> Exclude[Exclude unobserved person-time<br/>do NOT score gaps as zero coverage]\n  Enr -- Yes --> Win[Fix denominator window<br/>landmark BEFORE follow-up<br/>to avoid reverse causation]\n  Win --> Cover[Build daily coverage from<br/>fill_date + days_supply<br/>count unique covered days once]\n  Cover --> Inp{Inpatient / SNF days<br/>in window?}\n  Inp -- Yes --> Adj[Remove facility days from<br/>denominator or assume covered]\n  Inp -- No --> PDC[PDC = unique covered days / window days<br/>bounded at 1.0]\n  Adj --> PDC\n  PDC --> Use{How is PDC used?}\n  Use -- Continuous --> Cont[Average daily coverage]\n  Use -- Dichotomized --> Thr[Threshold e.g. >= 0.80<br/>prespecify; drug/endpoint-specific]",
        "caption": "Decision logic for computing and using PDC, from question fitness through enrollment observability, window selection, daily-coverage construction, inpatient adjustment, and the continuous-vs-threshold estimand choice.",
        "alt_text": "Flowchart deciding whether PDC is appropriate, requiring continuous enrollment, fixing a landmark window, building unique daily coverage, adjusting for inpatient days, and choosing a continuous or threshold use of PDC.",
        "source_type": "illustrative",
        "source_citations": [
          "hess-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title PDC over a 365-day window for one statin user (carry-over)\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section Observation window\n  Denominator = 365 days :done, win, 2023-01-01, 365d\n  section Fills (days_supply = 90)\n  Fill A covers Jan 1 - Mar 31 :active, a, 2023-01-01, 90d\n  Fill B (11-day early refill) covers to Jun 17 :active, b, 2023-03-20, 90d\n  7-day gap (Jun 18 - Jun 24) :crit, gap, 2023-06-18, 7d\n  Fill C covers Jun 25 - Sep 22 :active, c, 2023-06-25, 90d\n  section Result\n  Covered = 258 unique days -> PDC = 0.71 :milestone, r, 2023-12-31, 0d",
        "caption": "Worked timeline for the long-description example. The 11-day overlap between fills A and B is counted once (union rule), a 7-day gap is uncovered, and 258 unique covered days over 365 give PDC = 0.71 — below the 0.80 threshold; MPR would sum supply to 270/365 = 0.74.",
        "alt_text": "Gantt chart over 2023 showing three 90-day statin fills with one early-refill overlap counted once, a seven-day gap, and 258 unique covered days yielding a PDC of 0.71.",
        "source_type": "illustrative",
        "source_citations": [
          "loucks-2021"
        ]
      },
      {
        "asset_path": "pdc-proportion-of-days-covered-timeline.svg",
        "mermaid": null,
        "caption": "Data-grounded worked-example timeline (beginner layer). The three 90-day fills are drawn to scale with their days_supply quantities; the green bands are the unique covered days (168 + 90 = 258), the red band is the 7-day gap, and the grey bar is the 365-day denominator window. Generated from worked_example.timeline_spec by scripts/build_worked_example_timelines.py, so the picture cannot drift from the numbers. Pairs with the Mermaid gantt above (same data, git-friendly text form).",
        "alt_text": "Horizontal timeline of patient 1001's 2023 statin fills - three 90-day blocks labeled with their days_supply, two green covered spans totaling 258 days, a red 7-day gap, over a 365-day window bar, with the result PDC = 258/365 = 0.71 below the 0.80 threshold.",
        "source_type": "illustrative",
        "source_citations": [
          "loucks-2021"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "tradeoff_with",
        "target_slug": "mpr-medication-possession-ratio",
        "notes": "Primary comparator. PDC counts unique covered days and caps at 1.0; MPR sums days_supply and can exceed 1.0 under oversupply. Prefer PDC for classification and quality measures; MPR persists in older literature and some economic models."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Complementary construct. PDC measures cumulative coverage intensity; persistence measures time to the first permitted gap. Studies frequently report both."
      },
      {
        "relation_type": "used_with",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Daily-coverage and carry-over logic in PDC is the same machinery used to stitch fills into time-varying exposure episodes for survival models."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inpatient-bridging-exposure-rwe",
        "notes": "Inpatient/SNF days require either removal from the PDC denominator or supply bridging, the same institutional-stay handling used in exposure-episode construction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Measuring PDC over a window that overlaps outcome follow-up induces reverse-causation/immortal-time bias; a landmark window measured before follow-up avoids it."
      }
    ],
    "aliases": [
      "PDC",
      "proportion of days covered",
      "days covered proportion"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "pqa-cms",
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pearson-spearman-correlation",
    "name": "Pearson and Spearman Correlation",
    "short_definition": "Two complementary measures of the linear or monotone relationship between two continuous variables: Pearson r quantifies the strength and direction of the straight-line (linear) association assuming bivariate normality, while Spearman rho replaces raw values with ranks and quantifies the strength of any monotone association — one goes up when the other goes up, regardless of whether the relationship is a straight line. The choice between them hinges on the shape of the relationship, the presence of outliers, and whether the research question is about linear magnitude or about rank ordering.",
    "long_description": "**What Pearson r and Spearman rho actually measure**\n\nPearson's r is the ratio of the covariance of two variables to the product of their standard\ndeviations: r = cov(X, Y) / (sd(X) * sd(Y)). It ranges from -1 (perfect negative linear\nassociation) to +1 (perfect positive linear association), with 0 indicating no linear\nrelationship. The key word is *linear*: Pearson r measures only straight-line association.\nTwo variables can have a strong curved (e.g., quadratic) relationship and still produce\nr ≈ 0 if the curvature is symmetric around the mean. This is the Anscombe's Quartet lesson:\nfour datasets with identical Pearson r of 0.816 can look completely different on a scatter\nplot — one linear, one curved, one with a single influential outlier, one with all points\non a line except one extreme case. Pearson r alone, without the scatter plot, is a\ndangerous summary.\n\nSpearman rho is Pearson r computed on the *ranks* of the two variables rather than on their\nraw values. Each observation is replaced by its rank within its variable (rank 1 = smallest).\nThe resulting rho measures *monotone association*: do higher values of X tend to pair with\nhigher values of Y, regardless of whether the pairing follows a straight line? This makes\nrho robust to nonlinearity, to outliers (which become merely \"the highest rank\" rather than a\nleverage point with enormous squared deviation), and to non-normal marginal distributions.\nThe computational shortcut formula when there are no ties is:\nrho = 1 − 6 Σd² / (n(n²−1)), where d is the difference between the rank of X and the rank\nof Y for the same observation and n is the sample size.\n\n**The outlier leverage problem with Pearson r**\n\nA single high-leverage data point can manufacture or destroy a Pearson correlation. Consider\nn = 10 patients with no relationship between BMI and CRP (r ≈ 0), then add one patient with\nboth extremely high BMI (40) and extremely high CRP (15): Pearson r can jump to 0.85 from a\nsingle observation. The mechanism is that Pearson r weights each observation by its squared\ndeviation from the mean — a value far from the mean exerts outsized influence on both the\nnumerator (covariance) and denominator (standard deviations). Spearman rho converts that\nextreme BMI to rank 11 out of 11 and that extreme CRP to rank 11 out of 11; the outlier\ncontributes a rank difference of zero and has no disproportionate influence. The practical\nrule: always plot the data before choosing; always run both Pearson and Spearman on real-\nworld datasets as a diagnostic for outlier influence. If rho >> r, a small number of\ninfluential observations are likely driving the Pearson result.\n\n**Fisher z transformation and confidence intervals**\n\nPearson r is not normally distributed, especially at sample sizes typical of validation\nstudies or small pharmacoepidemiologic cohorts. The Fisher z transformation converts r to a\nquantity that is approximately normally distributed: z = 0.5 * ln((1+r)/(1-r)). A 95%\nconfidence interval is constructed in z-space using the standard error 1/sqrt(n-3), then\nback-transformed: lower = tanh(z − 1.96/sqrt(n-3)), upper = tanh(z + 1.96/sqrt(n-3)).\nThe same transformation applies to Spearman rho for large samples. Always report a CI\nalongside any correlation coefficient; a single point estimate without a CI understates\nthe uncertainty, especially at n < 100 where the CI is wide enough to span r = 0.3 to\nr = 0.8 for a true population correlation of 0.6. Neither Python's scipy.stats.pearsonr\n(which does not return a CI by default) nor SAS PROC CORR display Fisher CIs automatically\nwithout extra steps — this is a common reporting omission.\n\n**Sample size needed for stable correlation estimates**\n\nPearson r is an unstable estimator at small n. At n = 20, the 95% CI for a true rho = 0.5\nspans roughly ±0.3. At n = 100, the CI narrows to roughly ±0.12. As a rough planning rule,\n50 to 100 observations per variable are needed to estimate a moderate correlation with useful\nprecision. In the opposite direction, at n = 1,000,000 (not atypical in claims datasets), a\nPearson r = 0.01 is statistically significant at p < 0.001 but represents a negligible\nreal-world association. The p-value for a correlation coefficient at claims scale is\ncompletely uninformative about clinical or scientific relevance. Always report r (or rho) and\nits CI; never rely on the p-value as a measure of the size or importance of the association.\n\n**The misuse catalog**\n\n*Correlation is not agreement.* This is the single most consequential misuse in HEOR and\nclinical measurement validation. Two laboratory assays or two coding algorithms can have a\nPearson r of 0.99 and yet one systematically reads 2× higher than the other throughout its\nrange. The correlation coefficient measures association, not concordance. If both variables\nmove together proportionally (even with a constant offset or scale factor), r can approach\n1.0 while the assays disagree substantially on every sample. The correct tool for agreement\nbetween two methods of measuring the same quantity is the Bland-Altman plot (also called\nlimits of agreement), which plots the difference between the two methods against their\naverage and checks whether differences are centered at zero and within a clinically\nacceptable range. When validating a claims-derived measure against chart abstraction, the\nclinical question is \"do they give the same answer?\" not \"do they move together?\" — and only\nBland-Altman answers the right question.\n\n*Correlation is not causation.* This warning is universal, but it is violated constantly in\nobservational data summaries. The correlation between two biomarkers in a claims or EHR\ndataset may reflect shared confounders, selection effects, or reverse causation. No\ncorrelation coefficient — Pearson or Spearman — controls for any covariate. The appropriate\ntool for a causal question is a regression model with covariate adjustment, not a bivariate\ncorrelation.\n\n*Correlating change scores with baseline values creates mathematical coupling.* If X is\nthe baseline value and Y = (follow-up − baseline) is the change, then Y and X share a\ncommon term (baseline) with opposite signs. This produces a spurious negative correlation\n(regression to the mean) that is purely mathematical, not biological. Researchers who\ncorrelate \"change in HbA1c\" with \"baseline HbA1c\" and find r ≈ -0.5 to -0.7 are reporting\na mathematical artifact, not a real relationship.\n\n*Spurious correlations from mixing subgroups.* In a pooled dataset spanning multiple disease\nsubgroups or demographic strata, a correlation computed across all strata can be driven by\nbetween-group differences rather than within-group associations — a correlation-flavored\ninstance of Simpson's paradox. Two variables may be positively correlated within every\nsubgroup yet negatively correlated in the pooled data if the strata differ in both variables.\nAlways check whether the correlation holds within the relevant strata (e.g., separately by\nsex, disease severity, or treatment group) before drawing conclusions from a pooled\ncorrelation.\n\n*p-value on r at claims scale is meaningless.* At n = 1,000,000, the null hypothesis\nH₀: rho = 0 is rejected for r = 0.002 at p < 0.05. The p-value tests whether the\ncorrelation is exactly zero in the population — a question that is essentially never\nclinically interesting. Report the correlation coefficient and its 95% CI. If the CI for\nrho runs from 0.002 to 0.006, the correlation is statistically detectable but scientifically\nnegligible.\n\n*Pearson r on heavily tied ordinal data.* When a variable has only a few levels (e.g., a\n5-point Likert scale or a 3-level severity classification), Pearson r applied as if the\nscale were continuous misrepresents the relationship. Spearman rho is more defensible, but\nfor ordinal variables with few levels and heavy ties, polychoric correlation (which estimates\nthe latent bivariate normal correlation underlying the ordinal categorization) is the\nstatistically correct choice. This situation arises frequently in patient-reported outcome\ninstruments and registry severity scores.\n\n**Pros, cons, and trade-offs**\n\n*Pearson r*:\n- Pros: directly interpretable as the linear association; r² is the proportion of variance\n  in Y explained by a linear function of X; well-established power properties; Fisher z\n  transformation gives asymptotically normal CI; the building block of multiple regression\n  (the OLS regression slope is b = r * sd(Y)/sd(X)); computationally simple.\n- Cons: sensitive to outliers; measures only linear association; assumes bivariate normality\n  for exact inference at small n; not appropriate for ordinal data; inflated or deflated by\n  range restriction; cannot distinguish a linear from a nonlinear monotone relationship.\n- When to prefer: continuous, approximately normally distributed variables with no extreme\n  outliers, when the linear relationship is theoretically justified, when r² as variance\n  explained is a meaningful target quantity, when you plan to use the correlation in a\n  regression framework.\n\n*Spearman rho*:\n- Pros: robust to outliers; measures any monotone (not just linear) association; requires\n  only ordinal measurement; valid for skewed distributions; appropriate for continuous\n  variables with heavy tails; resistant to the leverage of extreme observations.\n- Cons: discards information about the magnitude of differences between observations; lower\n  power than Pearson r when the bivariate normality assumption holds; heavy ties reduce the\n  accuracy of the shortcut formula (correction factor needed); rank-based CI via Fisher z\n  is only asymptotically valid.\n- When to prefer: non-normal or skewed variables; presence of outliers or influential\n  observations; monotone but possibly nonlinear relationship; ordinal or bounded continuous\n  scales; exploratory biomarker screening where the normality assumption is not established.\n\n**When to use**\n\nUse Pearson r when both variables are continuous, approximately normally distributed, and\nthe research question is about the strength of the linear relationship. Typical applications\nin HEOR and RWE include:\n\n- *Collinearity screening among candidate covariates before propensity score or regression\n  modelling*: high absolute Pearson r (|r| > 0.7 to 0.8) between two covariates signals\n  potential collinearity that may destabilize coefficient estimates; Spearman rho is\n  appropriate for this screen when covariate distributions are skewed.\n- *Exploratory biomarker-outcome screening*: an initial screen of Pearson r between a\n  biomarker and a continuous outcome to identify candidates for further multivariable\n  modelling. Treat this as hypothesis-generating only; all identified associations require\n  replication and covariate adjustment.\n- *Validation studies comparing a claims-derived measure to a chart-abstracted measure* —\n  but only to quantify the degree of association, not the degree of agreement. The correct\n  agreement analysis (Bland-Altman, intraclass correlation coefficient, kappa) should follow.\n- *Pre-specified correlation analyses in small pilot studies* where continuous biomarkers\n  or outcomes are measured and the bivariate normal assumption is plausible.\n\nUse Spearman rho when:\n- The data contain outliers or the distribution is heavy-tailed (common in cost and\n  utilization data even after log transformation).\n- The variables are ordinal or bounded (patient-reported outcomes, severity scales).\n- The relationship is expected to be monotone but not necessarily linear.\n- You want a sensitivity check on a Pearson r result to assess outlier influence.\n\n**When NOT to use**\n\n*Agreement or method-comparison questions*: never use a correlation coefficient — Pearson\nor Spearman — to assess whether two measurement instruments or two coding algorithms agree.\nUse the Bland-Altman limits-of-agreement plot (for continuous measurements), Cohen's kappa\n(for categorical classifications), or the intraclass correlation coefficient (for continuous\nratings with an absolute-agreement model). In claims validation studies, the question \"does\nalgorithm X correctly identify condition Y compared to chart review?\" is a classification\naccuracy question answered by sensitivity, specificity, and positive predictive value —\nnot by a correlation coefficient.\n\n*Causal claims*: correlation is bivariate and unadjusted. Any research question that implies\ncausation — effect of a drug on an outcome, impact of a comorbidity on cost — requires\nregression, propensity score methods, or other causal inference approaches that control for\nconfounders. Reporting a correlation coefficient as evidence of a treatment effect in an\nobservational dataset is methodologically indefensible.\n\n*Change-from-baseline correlated with baseline*: if you are correlating change in an\noutcome with the baseline value of that same outcome, interpret results with extreme caution\nor avoid entirely. The mathematical coupling between the change score and the baseline value\ngenerates a spurious negative correlation via regression to the mean, not a meaningful\nbiological association.\n\n*Nonmonotone (U-shaped or inverted-U) relationships*: both Pearson r and Spearman rho will\nreturn values near zero for a strong quadratic or sinusoidal relationship, misleadingly\nimplying no association. Always examine the scatter plot and consider polynomial or\npiecewise regression when a nonmonotone relationship is plausible (e.g., the J-shaped\nrelationship between alcohol consumption and mortality, or the optimal dosing curve).\n\n*Heavily tied ordinal data with few levels*: when a variable has 3 to 5 levels and the\ndistribution is concentrated at a few values, Spearman rho with many ties gives a poor\napproximation to the underlying latent association. Use polychoric correlation for ordered\ncategorical variables or a Jonckheere-Terpstra trend test for a formal hypothesis test.\n\n**Interpreting the output**\n\nIn the worked example, five patients have total cholesterol and CRP scores measured. The\nrank differences between the two variables are all small (0, −1, 1, 0, 0), yielding Σd² = 2.\nApplying the Spearman formula: denominator = n(n² − 1) = 5 × 24 = 120; 6 × Σd² = 12;\nrho = 1 − 12/120 = 1 − 0.10 = 0.90. The Pearson r for these same data is approximately\n0.903 — nearly identical here because the patient with elevated cholesterol (Patient 5) also\nranks highest on CRP, sitting on the general trend rather than disrupting it.\n\n*(1) Formal interpretation.* A Spearman rho of 0.90 indicates a strong positive monotone\nassociation: patients with higher cholesterol ranks tend to have higher CRP ranks across\nthese five observations. The coefficient runs from −1 (perfect inverse rank agreement) to\n+1 (perfect rank agreement); 0.90 is near the upper end. A confidence interval should\naccompany rho — at n = 5 the 95% CI from the Fisher z transformation is wide, potentially\nspanning from roughly 0.4 to 1.0, which illustrates how imprecise any single-study\ncorrelation estimate is at small n. The p-value for H₀: rho = 0 tests whether the\npopulation rank correlation is exactly zero — a question distinct from whether the\nassociation is clinically important. Rho measures monotone association, not agreement; two\ninstruments measuring the same quantity could have rho near 1.0 while one consistently\nreads twice as high as the other.\n\n*(2) Practical interpretation.* Higher cholesterol consistently pairs with higher\ninflammation in these five patients, and the near-1.0 coefficient reflects strong co-\nmovement of ranks with no discordant pairs. The close alignment of Spearman rho (0.90) and\nPearson r (≈ 0.903) signals that no single outlier is distorting the linear measure —\nPatient 5's elevated cholesterol, though extreme in absolute terms, ranks fifth on both\nvariables and exerts no disproportionate leverage on rho. However, a correlation from five\npatients is a hypothesis-generating observation: it cannot support a causal claim, cannot\ncontrol for any confounders, and carries substantial sampling uncertainty. A larger study\nwith regression-based analysis would be needed before clinical or policy conclusions could\nbe drawn.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "correlation",
      "association",
      "pearson",
      "spearman",
      "rank-correlation",
      "collinearity",
      "biomarker",
      "validation"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "cross_sectional",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1213/ANE.0000000000002864",
        "url": "https://doi.org/10.1213/ANE.0000000000002864",
        "citation_text": "Schober P, Boer C, Schwarte LA. Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia. 2018;126(5):1763-1768.",
        "year": 2018,
        "authors_short": "Schober et al.",
        "notes": "Definitive practical guide to choosing between Pearson and Spearman correlation, interpreting r vs rho, and avoiding common misuses including the agreement fallacy. Widely cited in clinical research methodology; addresses the full range of situations an RWE or HEOR analyst will encounter."
      },
      {
        "role": "explain",
        "doi": "10.2478/v10117-011-0021-1",
        "url": "https://doi.org/10.2478/v10117-011-0021-1",
        "citation_text": "Hauke J, Kossowski T. Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data. Quaestiones Geographicae. 2011;30(2):87-93.",
        "year": 2011,
        "authors_short": "Hauke & Kossowski",
        "notes": "Systematic empirical comparison showing when Pearson r and Spearman rho agree and diverge across distributions with varying skewness, outlier patterns, and sample sizes. Provides concrete guidance on the conditions under which rank-based correlation adds value over the linear coefficient."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/S0140-6736(86)90837-8",
        "url": "https://doi.org/10.1016/S0140-6736(86)90837-8",
        "citation_text": "Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet. 1986;327(8476):307-310.",
        "year": 1986,
        "authors_short": "Bland & Altman",
        "notes": "Foundational paper establishing that correlation is the wrong tool for assessing agreement between two measurement methods and introducing the Bland-Altman limits-of- agreement plot as the correct approach. Essential context for any discussion of correlation in HEOR validation studies."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.312.7030.572",
        "url": "https://doi.org/10.1136/bmj.312.7030.572",
        "citation_text": "Altman DG, Bland JM. Statistics notes: Presentation of numerical data. BMJ. 1996;312(7030):572.",
        "year": 1996,
        "authors_short": "Altman & Bland",
        "notes": "Part of the BMJ Statistics Notes series; addresses reporting of numerical results including correlation coefficients, effect sizes, and confidence intervals for a clinical research audience. Reinforces the requirement to report CIs alongside point estimates."
      }
    ],
    "plain_language_summary": "A correlation coefficient measures how strongly two measurements move together across a group of patients or observations — for example, whether higher cholesterol tends to pair with higher inflammation scores. Pearson r measures straight-line association between two numbers and is sensitive to extreme values; Spearman rho replaces every number with its rank order (first, second, third) and then measures association on those ranks, making it more robust when the data contain outliers or are not bell-curve shaped. Both coefficients run from -1 (perfect opposite movement) through 0 (no relationship) to +1 (perfect co-movement), but they answer slightly different questions — and neither one can tell you whether two instruments agree with each other or whether one variable causes another.",
    "key_terms": [
      {
        "term": "covariance",
        "definition": "A measure of how much two variables change together; if X tends to be above its average at the same time Y is above its average, the covariance is positive."
      },
      {
        "term": "linear vs monotone association",
        "definition": "Linear association means the relationship follows a straight line; monotone association is weaker — it only requires that higher values of X tend to pair with higher (or lower) values of Y, even if the relationship curves."
      },
      {
        "term": "rank correlation",
        "definition": "A correlation computed after replacing each raw data value with its rank in the sorted list; this removes the influence of extreme values because the largest observation simply gets the top rank, regardless of how extreme it is."
      },
      {
        "term": "Fisher z transformation",
        "definition": "A mathematical conversion that turns a Pearson r into a quantity that behaves like a normal distribution, making it possible to compute a valid confidence interval around r."
      },
      {
        "term": "outlier leverage",
        "definition": "The disproportionate influence a single extreme data point can exert on Pearson r; one outlier far from the group mean can substantially raise or lower r without reflecting the relationship in the rest of the data."
      },
      {
        "term": "polychoric correlation",
        "definition": "A correlation measure designed for ordinal variables with only a few levels, such as a 5-point severity scale; it estimates the underlying continuous association rather than treating the ordinal scores as if they were continuous numbers."
      }
    ],
    "worked_example": {
      "scenario": "A study pharmacist is examining whether patients with higher total cholesterol (mg/dL) also tend to have higher C-reactive protein (CRP) inflammation scores. Five patients have both measurements available. The analyst wants to compute the Spearman rank correlation by hand to understand the mechanics, then compare it to the Pearson r to see whether one extreme cholesterol value (Patient 5, who has cholesterol of 290) distorts the linear measure.",
      "dataset": {
        "caption": "Cholesterol and CRP measurements for five patients. Patient 5 has an elevated cholesterol value (290 mg/dL) that is far above the others. CRP scores run 1-10.",
        "columns": [
          "patient_id",
          "cholesterol_mgdL",
          "crp_score"
        ],
        "rows": [
          [
            "P1",
            180,
            2
          ],
          [
            "P2",
            200,
            5
          ],
          [
            "P3",
            220,
            4
          ],
          [
            "P4",
            240,
            7
          ],
          [
            "P5",
            290,
            8
          ]
        ]
      },
      "steps": [
        "Rank each variable separately from smallest to largest. Cholesterol ranks: 180->1, 200->2, 220->3, 240->4, 290->5. CRP ranks: 2->1, 4->2, 5->3, 7->4, 8->5.",
        "Assign ranks to each patient. Patient 1: rank_x=1, rank_y=1. Patient 2: rank_x=2, rank_y=3. Patient 3: rank_x=3, rank_y=2. Patient 4: rank_x=4, rank_y=4. Patient 5: rank_x=5, rank_y=5.",
        "Compute the rank difference d = rank_x - rank_y for each patient. P1: d=1-1=0, P2: d=2-3=-1, P3: d=3-2=1, P4: d=4-4=0, P5: d=5-5=0.",
        "Square each difference: P1: d^2=0, P2: d^2=1, P3: d^2=1, P4: d^2=0, P5: d^2=0. Sum of squared differences: Sigma_d2 = 0+1+1+0+0 = 2.",
        "Apply the Spearman formula. Denominator = n*(n^2-1) = 5*(25-1) = 5*24 = 120. Numerator term: 6*Sigma_d2 = 6*2 = 12. So rho = 1 - 12/120 = 1 - 0.10 = 0.90.",
        "Interpret: rho = 0.90 indicates a strong positive monotone association. Higher cholesterol tends to pair with higher CRP across these five patients. Patient 5's extreme cholesterol value becomes rank 5 (the top rank) and their CRP also ranks 5th — so their outlier cholesterol does not distort the rank-based result.",
        "For comparison, Pearson r for these same data is approximately 0.903 — nearly identical here because Patient 5's extreme value happens to sit on the general trend. If Patient 5's CRP had been unexpectedly low (say, CRP=1 instead of 8), Pearson r would drop to near -0.30 while Spearman rho would drop to only about 0.10, illustrating how a single off-trend outlier distorts the linear measure far more than the rank measure."
      ],
      "result": "Sigma_d2 = 0+1+1+0+0 = 2. Denominator = 5*(25-1) = 5*24 = 120. Numerator = 6*2 = 12. rho = 1 - 12/120 = 1 - 0.10 = 0.90. The Spearman rank correlation is 0.90, indicating a strong positive monotone association between cholesterol and CRP in these five patients. This matches Pearson r (0.903) closely when no outlier disrupts the trend, but the two measures would diverge if any patient's values were off the general pattern."
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pearson r with Fisher z confidence interval",
        "description": "The standard Pearson r with a 95% CI constructed via the Fisher z transformation. The CI is asymmetric around r (wider toward zero, narrower toward the extremes) and is the correct interval for inference. Report as r = 0.XX (95% CI: lower, upper) alongside the sample size. At n < 30, the CI is typically too wide to distinguish a \"moderate\" from a \"strong\" correlation, which should be stated explicitly.",
        "edge_cases": [
          "At r near ±1.0, the Fisher z CI is narrow but the estimate is increasingly extrapolation-sensitive; confirm with the scatter plot that the high correlation is not driven by two or three extreme observations.",
          "For a one-sided test (e.g., directional hypothesis that rho > 0), halve the two-sided p-value; do not use a different transformation."
        ],
        "data_source_notes": "Claims and EHR: compute Pearson r with CI after confirming approximately normal marginal distributions (histogram or Q-Q plot). For cost or utilization variables, log-transform before computing Pearson r or use Spearman rho on the untransformed data."
      },
      {
        "name": "Spearman rho with tied-rank correction",
        "description": "When the data contain ties (common with ordinal scales, count variables, or rounded measurements), the shortcut formula rho = 1 − 6Σd²/(n(n²−1)) is not exact. The precise computation uses the full Pearson formula on the mid-ranks assigned to tied observations (average of the ranks those tied values would occupy if they were distinct). Software implementations (scipy.stats.spearmanr, cor.test in R with method='spearman', PROC CORR SPEARMAN in SAS) handle ties correctly; the shortcut formula should not be used manually when ties are present.",
        "edge_cases": [
          "When more than 10-15% of observations are tied on one variable, Spearman rho loses power; polychoric correlation is the preferred alternative for ordinal variables with heavy ties.",
          "For very large samples (n > 10,000), the approximation to the normal distribution used for the p-value is accurate but the CI should still be constructed via Fisher z on rho."
        ],
        "data_source_notes": "Registry data with severity scores and ordinal outcomes commonly have heavy ties; use software with the tied-rank correction and report the number of tied pairs."
      },
      {
        "name": "Partial correlation (correlation after controlling for a covariate)",
        "description": "When two variables are correlated in part because both depend on a third variable (e.g., age confounds the association between baseline comorbidity count and total cost), partial correlation removes the linear effect of the covariate from both variables before computing the correlation. This is not a substitute for a full multivariable model but is useful as an exploratory tool and for collinearity diagnostics. Partial Spearman rho can be approximated by computing residuals from rank regression on the covariate.",
        "edge_cases": [
          "Partial correlation controls only for the *linear* effect of the covariate; if the confounder relationship is nonlinear, partial correlation incompletely adjusts.",
          "Do not interpret partial r as a measure of causal effect; it remains descriptive and subject to other unmeasured confounders."
        ],
        "data_source_notes": "Useful in linked claims-EHR studies for collinearity screening among continuous covariates after accounting for age and comorbidity burden."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "baseline-characteristics-and-covariate-balance-rwe",
        "pros_of_this": "Correlation provides a single quantitative summary of the linear or monotone association between two continuous variables — useful for collinearity screening among candidate covariates before propensity score or regression model building.",
        "cons_of_this": "Pearson r between two covariates does not capture nonlinear or interaction-based collinearity; variance inflation factor (VIF) computed from the regression model itself is a more complete collinearity diagnostic after the model is specified.",
        "when_to_prefer": "Use correlation for rapid pairwise screening of a large covariate set before model building; use VIF for confirmation after the model is specified."
      },
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "Correlation quantifies the strength and direction of the association between two continuous variables, providing a measure of effect size that hypothesis tests (which only indicate whether an association is detectable) do not provide.",
        "cons_of_this": "Correlation is only appropriate for two continuous (or ordinal) variables; for binary or categorical outcomes, regression-based measures (odds ratio, risk difference) or rank-biserial correlation are more appropriate.",
        "when_to_prefer": "Use correlation when both variables are continuous and the research question is about the strength of association; use hypothesis tests when the question is whether a difference or association is detectable at a given significance level."
      },
      {
        "compared_to": "mann-whitney-u-test",
        "pros_of_this": "Correlation (especially Spearman rho) and the Mann-Whitney U test share the same rank-based machinery; Spearman rho provides a continuous measure of association while Mann-Whitney provides a dichotomous group comparison. For a bivariate continuous-continuous question, rho is the direct answer; for a group comparison on a continuous outcome, Mann-Whitney is the direct answer.",
        "cons_of_this": "Correlation does not address whether a mean or median difference between defined groups is statistically or clinically significant; for that question, use Mann-Whitney with the Hodges-Lehmann estimate.",
        "when_to_prefer": "Use Spearman rho when both variables are continuous and the question is about co-movement; use Mann-Whitney when the question is whether two groups differ in a continuous outcome."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims datasets often contain cost and utilization variables that are right-skewed with heavy tails. For collinearity screening among such variables, Spearman rho is more appropriate than Pearson r because it is robust to the extreme values that are common in cost data. For continuous biomarkers or lab values available via linked EHR, Pearson r with Fisher z CI is appropriate after confirming approximately normal distributions. Never use a correlation coefficient as the primary analysis for a causal or comparative effectiveness question in claims data.",
      "ehr": "Lab values (HbA1c, LDL, creatinine) and vital signs are approximately normally distributed and well-suited to Pearson r for association analyses. Patient-reported outcome scores and ordinal severity assessments call for Spearman rho. In longitudinal EHR data, be alert to the change-score / baseline coupling problem when correlating change in a lab value with the baseline lab value.",
      "registry": "Registry variables (adjudicated event counts, severity scores, time to event) often have non-normal distributions or bounded ranges. Use Spearman rho as the default for registry-based correlation analyses unless normality is established. For validation studies comparing a registry-derived measure to an external gold standard, use Bland-Altman for agreement, not correlation.",
      "primary": "Survey-based PRO instruments and preference-elicitation scales (EQ-5D, SF-36) produce ordinal or bounded continuous data. For cross-instrument correlations (e.g., two quality-of-life instruments), Spearman rho is appropriate. For heavily tied items (3 to 5 levels), polychoric correlation should be considered.",
      "linked": "Linked claims-EHR-registry datasets provide opportunity to validate coding algorithms by comparing claims-derived classifications to chart-abstracted gold standards. The appropriate analysis depends on the variable type: use Bland-Altman for continuous measures, kappa for binary classifications, and ICC for continuous rating scales. Report Spearman rho as a supplementary association measure only, with explicit acknowledgment that correlation does not equal agreement."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom scipy import stats\n\n# ── Motivating dataset: cholesterol (mg/dL) vs CRP score ──\ncholesterol = [180, 200, 220, 240, 290]\ncrp         = [  2,   5,   4,   7,   8]\n\n# ── 1. Pearson r ──\nr, p_r = stats.pearsonr(cholesterol, crp)\nprint(f\"Pearson r = {r:.4f}  (p = {p_r:.4f})\")\n\n# ── 2. Fisher z confidence interval for Pearson r ──\nn = len(cholesterol)\nz = math.atanh(r)                           # Fisher z = 0.5 * ln((1+r)/(1-r))\nse_z = 1 / math.sqrt(n - 3)                 # SE in z-space\nz_lo, z_hi = z - 1.96 * se_z, z + 1.96 * se_z\nr_lo, r_hi = math.tanh(z_lo), math.tanh(z_hi)\nprint(f\"  95% CI via Fisher z: ({r_lo:.4f}, {r_hi:.4f})\")\n\n# ── 3. Spearman rho ──\nrho, p_rho = stats.spearmanr(cholesterol, crp)\nprint(f\"Spearman rho = {rho:.4f}  (p = {p_rho:.4f})\")\n\n# ── 4. Fisher z CI for Spearman rho (large-sample approximation) ──\nz_s = math.atanh(rho)\nz_s_lo, z_s_hi = z_s - 1.96 * se_z, z_s + 1.96 * se_z\nrho_lo, rho_hi = math.tanh(z_s_lo), math.tanh(z_s_hi)\nprint(f\"  95% CI via Fisher z: ({rho_lo:.4f}, {rho_hi:.4f})\")\n\n# ── 5. Outlier sensitivity demonstration ──\n# Add one off-trend patient: very high cholesterol but unexpectedly LOW CRP\ncholesterol_out = cholesterol + [310]\ncrp_out         = crp + [1]            # outlier: high chol, low CRP\nr_out, _   = stats.pearsonr(cholesterol_out, crp_out)\nrho_out, _ = stats.spearmanr(cholesterol_out, crp_out)\nprint(f\"\\nAfter adding off-trend outlier (chol=310, CRP=1):\")\nprint(f\"  Pearson r   = {r_out:.4f}  (was {r:.4f}, change = {r_out - r:.4f})\")\nprint(f\"  Spearman rho = {rho_out:.4f}  (was {rho:.4f}, change = {rho_out - rho:.4f})\")\nprint(\"Pearson r shifts more because the outlier is far from the mean in x and y;\")\nprint(\"Spearman rho is dampened because the outlier is just rank 6 out of 6 on x.\")\n\n# ── 6. Correct use: report r NOT as an agreement measure ──\n# Two assays that correlate 0.99 but differ systematically by 2x:\nassay_a = [10, 20, 30, 40, 50]\nassay_b = [20, 40, 60, 80, 100]   # exactly 2x -- perfect correlation, not agreement\nr_agree, _ = stats.pearsonr(assay_a, assay_b)\nprint(f\"\\nAgreement misuse demonstration:\")\nprint(f\"  Assay A: {assay_a}\")\nprint(f\"  Assay B (2x A): {assay_b}\")\nprint(f\"  Pearson r = {r_agree:.4f}  <- perfect correlation\")\nprint(f\"  Mean difference (A - B): {sum(a-b for a,b in zip(assay_a,assay_b))/5:.1f}\")\nprint(\"  -> r = 1.0 but assays disagree by 2x everywhere. Use Bland-Altman for agreement.\")",
        "description": "Pearson r and Spearman rho using scipy.stats, with Fisher z confidence intervals\nconstructed manually (scipy.stats.pearsonr does not return CIs by default). Demonstrates\nthe outlier-sensitivity contrast: adds one off-trend patient and shows how Pearson r\nchanges more than Spearman rho. Uses the five-patient cholesterol/CRP motivating dataset\nfrom the worked example.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset ──\ncholesterol <- c(180, 200, 220, 240, 290)\ncrp         <- c(  2,   5,   4,   7,   8)\n\n# ── 1. Pearson r with Fisher z CI (cor.test provides the CI automatically) ──\npr <- cor.test(cholesterol, crp, method = \"pearson\")\ncat(sprintf(\"Pearson r = %.4f  p = %.4f\\n\", pr$estimate, pr$p.value))\ncat(sprintf(\"  95%% CI (Fisher z): (%.4f, %.4f)\\n\", pr$conf.int[1], pr$conf.int[2]))\n\n# ── 2. Spearman rho (cor.test with method = \"spearman\") ──\nsr <- cor.test(cholesterol, crp, method = \"spearman\", exact = FALSE)\ncat(sprintf(\"Spearman rho = %.4f  p = %.4f\\n\", sr$estimate, sr$p.value))\n\n# Manual Fisher z CI for Spearman rho\nn <- length(cholesterol)\nrho <- as.numeric(sr$estimate)\nz_s <- atanh(rho)\nse_z <- 1 / sqrt(n - 3)\nrho_lo <- tanh(z_s - 1.96 * se_z)\nrho_hi <- tanh(z_s + 1.96 * se_z)\ncat(sprintf(\"  95%% CI via Fisher z (manual): (%.4f, %.4f)\\n\", rho_lo, rho_hi))\n\n# ── 3. Outlier sensitivity demonstration ──\nchol_out <- c(cholesterol, 310)\ncrp_out  <- c(crp, 1)          # off-trend outlier\nr_out   <- cor.test(chol_out, crp_out, method = \"pearson\")$estimate\nrho_out <- cor.test(chol_out, crp_out, method = \"spearman\", exact = FALSE)$estimate\ncat(sprintf(\"\\nWith off-trend outlier added:\\n\"))\ncat(sprintf(\"  Pearson r    = %.4f  (was %.4f)\\n\", r_out, pr$estimate))\ncat(sprintf(\"  Spearman rho = %.4f  (was %.4f)\\n\", rho_out, sr$estimate))\n\n# ── 4. Agreement misuse demonstration ──\nassay_a <- c(10, 20, 30, 40, 50)\nassay_b <- c(20, 40, 60, 80, 100)   # 2x scale factor\nr_agree <- cor.test(assay_a, assay_b, method = \"pearson\")$estimate\ncat(sprintf(\"\\nAgreement misuse: r = %.4f but mean(A - B) = %.1f\\n\",\n            r_agree, mean(assay_a - assay_b)))\ncat(\"Use Bland-Altman (BlandAltmanLeh package) for agreement, not cor.test.\\n\")",
        "description": "Pearson r and Spearman rho using base R cor.test(), with explicit method= argument.\nFisher z CIs are computed manually. Demonstrates outlier sensitivity and the\ncorrelation-vs-agreement distinction. Uses the same five-patient dataset as Python.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset ── */\ndata work.chol_crp;\n  input patient_id $ cholesterol crp;\n  datalines;\nP1 180 2\nP2 200 5\nP3 220 4\nP4 240 7\nP5 290 8\n;\nrun;\n\n/* ── 1. Pearson r AND Spearman rho in a single PROC CORR ── */\nproc corr data=work.chol_crp pearson spearman fisher(rho0=0 biasadj=no);\n  var cholesterol crp;\n  /* PEARSON: Pearson r with p-value                                        */\n  /* SPEARMAN: Spearman rho with p-value                                    */\n  /* FISHER: Fisher z confidence intervals for both r and rho               */\n  /* biasadj=no: unbiased Fisher z transformation (recommended default)     */\nrun;\n\n/* ── 2. Outlier sensitivity dataset ── */\ndata work.chol_crp_out;\n  set work.chol_crp;\n  output;\n  if _n_ = 5 then do;\n    /* Add off-trend patient: high cholesterol (310) but low CRP (1) */\n    patient_id = \"P6\"; cholesterol = 310; crp = 1; output;\n  end;\nrun;\n\nproc corr data=work.chol_crp_out pearson spearman;\n  var cholesterol crp;\n  title \"Outlier sensitivity: Pearson r vs Spearman rho with off-trend P6 added\";\nrun;\n\n/* ── 3. Agreement misuse demonstration ── */\n/* Two assays: Assay_B = 2 * Assay_A -- perfect correlation, zero agreement */\ndata work.assays;\n  input assay_a assay_b;\n  diff = assay_a - assay_b;\n  datalines;\n10 20\n20 40\n30 60\n40 80\n50 100\n;\nrun;\n\nproc corr data=work.assays pearson;\n  var assay_a assay_b;\n  title \"Agreement misuse: r=1.0 but assays differ 2x everywhere\";\nrun;\n\n/* Bland-Altman summary (use BlandAltman macro for the plot) */\nproc means data=work.assays mean std;\n  var diff;\n  title \"Mean difference (A-B) and SD -- should be checked via Bland-Altman\";\nrun;",
        "description": "Pearson r and Spearman rho via PROC CORR with PEARSON and SPEARMAN options. PROC CORR\nautomatically computes Fisher z CIs when the FISHER option is specified. Uses the five-\npatient cholesterol/CRP dataset from the worked example. A supplementary DATA step\nconstructs the outlier-augmented dataset for sensitivity comparison.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Two continuous variables<br/>Are both approximately normal?] --> Normal[Yes — no extreme outliers]\n  Q --> NonNormal[No — skewed, ordinal,<br/>or outliers present]\n  Normal --> Pearson[\"Pearson r<br/>(linear association)<br/>Report with Fisher z CI\"]\n  NonNormal --> Spearman[\"Spearman rho<br/>(monotone association)<br/>Report with Fisher z CI\"]\n  Pearson --> Check[\"Always: scatter plot first<br/>Both: run as sensitivity check<br/>on each other\"]\n  Spearman --> Check\n  Check --> Misuse{\"What is the question?\"}\n  Misuse --> |\"Do two measures agree?\"| Agreement[\"Use Bland-Altman,<br/>kappa, or ICC<br/>NOT correlation\"]\n  Misuse --> |\"Does X cause Y?\"| Causal[\"Use regression +<br/>covariate adjustment<br/>NOT correlation\"]\n  Misuse --> |\"Are X and Y associated?\"| Report[\"Report r or rho + CI<br/>+ scatter plot<br/>NOT just p-value\"]",
        "caption": "Decision tree for choosing Pearson r vs Spearman rho and for routing to the correct tool when correlation is not the right answer (agreement and causal questions).",
        "alt_text": "Flowchart branching on whether both variables are approximately normal into Pearson r (linear) or Spearman rho (monotone), then routing to the correct alternative tool when the question is actually about agreement or causation rather than association.",
        "source_type": "illustrative",
        "source_citations": [
          "schober-2018"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "Pearson r is the parametric correlation (assumes bivariate normality); Spearman rho is the nonparametric rank-based alternative. The parametric vs nonparametric framework directly governs the choice between them and the conditions under which each is valid."
      },
      {
        "relation_type": "requires",
        "target_slug": "descriptive-statistics",
        "notes": "Means, standard deviations, and scatter plots are prerequisites for any correlation analysis; the Pearson r formula is derived from the covariance divided by the product of standard deviations, and the scatter plot is mandatory before interpreting any r."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Hypothesis testing for rho = 0, p-value interpretation, confidence interval construction via Fisher z, and statistical vs practical significance distinctions are all covered in the inferential statistics foundations entry and are directly applicable here."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mann-whitney-u-test",
        "notes": "Spearman rho and the Mann-Whitney U test share the same rank-based machinery; the rank-biserial correlation is a direct transformation of the Mann-Whitney U statistic. Understanding rank-based inference from the Mann-Whitney entry strengthens the interpretation of Spearman rho."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Pearson and Spearman correlations are used in collinearity screening among candidate covariates before propensity score or regression model building. High pairwise correlations between covariates signal multicollinearity that may destabilize regression coefficient estimates."
      }
    ],
    "aliases": [
      "Pearson r",
      "Spearman rho",
      "correlation coefficient",
      "rank correlation",
      "bivariate correlation",
      "Fisher z transformation",
      "Pearson correlation",
      "Spearman correlation"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pediatric-dose-normalization-rwe",
    "name": "Pediatric Dose Normalization",
    "short_definition": "The exposure-definition rule that converts a dispensed or administered drug quantity into a body-size- and maturation-adjusted exposure metric (mg/kg, mg/m2, allometric, or age-banded) so pediatric exposure is comparable across children of different size and developmental stage.",
    "long_description": "**Pediatric dose normalization** is the exposure-definition step that turns a raw quantity of drug\n(total milligrams dispensed, a tablet/suspension volume, or an inpatient administration) into a\n*size- and maturation-adjusted* exposure metric. In adults a fixed mg/day is usually an adequate\nexposure variable; in children it is not, because a 3 kg neonate, a 14 kg toddler, and a 60 kg\nadolescent receiving \"the same dose\" experience radically different internal exposure (AUC, Cmax).\nNormalization is therefore not cosmetic rescaling — it is the construction of the analytic exposure\nvariable on which the entire comparative or dose-response analysis rests. The available metrics are\nnot interchangeable: linear weight-based (mg/kg/day), allometric (clearance scaling with\nweight^0.75), body-surface-area (mg/m2, via the Mosteller or Du Bois formula), fixed weight-banded\ntables, age-banded fixed dosing, and adult-dose-capped variants each encode a different assumption\nabout how drug clearance scales with body size and organ maturation.\n\n**Core conceptual distinction.** Three separable choices are doing the work. (1) *Which size\ndescriptor* — total body weight, ideal/lean body weight, BSA, or age band. Linear mg/kg implicitly\nassumes clearance is proportional to weight; this over-doses small children and under-doses large\nones because metabolic and renal clearance scale closer to weight^0.75 (allometry), not weight^1.0.\nBSA (mg/m2) is the historical standard for cytotoxic and some immunologic agents and tracks\nclearance better than linear weight for many drugs but requires height. (2) *Whether maturation is\nmodeled* — in neonates and infants, size scaling alone is wrong because ontogeny of CYP3A4, CYP2D6,\nUGT enzymes, and glomerular filtration dominates clearance in the first 1-2 years; a maturation\nfunction (post-menstrual age) must multiply the allometric term. (3) *Whether an adult cap applies*\n— large adolescents normalized purely by weight can exceed the labeled adult dose, so most protocols\ncap at the adult dose. The estimand-adjacent point: the *unit* of the exposure variable (mg/kg/day\nvs mg/m2/day vs categorical dose band) must be pre-specified, because a dose-response slope, a\nhigh-vs-low contrast, and a \"received guideline-concordant dose\" indicator are different analyses\nwith different interpretations and different susceptibility to misclassification.\n\n**Pros, cons, and trade-offs.**\n- **Linear weight-based (mg/kg) vs allometric (weight^0.75):** Linear is transparent, needs only\n  weight, and matches most product labels and weight-banded dosing tables — but it is biased for\n  internal exposure at the extremes of size and is the wrong scale for clearance. Allometric tracks\n  clearance far better and is the pharmacometric standard, but requires a defensible exponent and is\n  harder to explain to clinicians and reviewers. **Prefer linear mg/kg** when reproducing label- or\n  guideline-concordant dosing in a utilization/quality study; **prefer allometric** when modeling a\n  PK-anchored dose-response or pooling across a wide age range.\n- **BSA (mg/m2) vs weight-based:** BSA captures size-related clearance for many cytotoxic/biologic\n  agents and is the convention oncology reviewers expect; cost is that it needs height (rarely in\n  claims) and the BSA formula choice (Mosteller vs Du Bois vs Haycock) shifts the value a few\n  percent. **Prefer BSA** only where the drug is genuinely dosed per m2 in practice.\n- **Age-banded fixed dosing vs continuous normalization:** Age bands are what claims data can\n  actually support without weight and mirror OTC/primary-care prescribing, but they coarsely\n  misclassify exposure within a band (a 5th-percentile and 95th-percentile child of the same age\n  differ ~2-fold in weight). **Prefer age bands** only as a fallback when weight is missing, and\n  report the resulting nondifferential-or-worse misclassification.\n- **vs simply using total mg/day (no normalization):** Adequate for some safety signal detection\n  where any-exposure is the contrast, but actively misleading for any dose-response, pooled, or\n  comparative-intensity analysis in children.\n\n**When to use.** Any pediatric RWE analysis where exposure intensity — not merely ever/never\nexposure — is the variable of interest: dose-response, high-vs-standard dose comparisons, pooling\nacross a wide pediatric age range, comparative effectiveness/safety where the comparator is dosed on\na different size basis, and HTA/dossier work that must demonstrate guideline-concordant pediatric\ndosing. Use the size descriptor the drug is *actually* dosed by in label/guidelines, and pre-specify\nthe formula, the weight source and recency window, and the missing-weight handling rule.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Neonates and preterm infants under naive size scaling.** Allometric or linear weight scaling\n  *alone* is dangerous here: clearance is governed by enzyme/renal maturation that size cannot proxy,\n  so a mg/kg metric implies an exposure ordering that is biochemically false. Either add a maturation\n  model (post-menstrual age) or restrict the analysis age range.\n- **Obesity / extreme body composition.** Total body weight, ideal body weight, and lean body weight\n  can differ by 30%+ in an obese adolescent, and the \"right\" descriptor is drug-specific\n  (lipophilic vs hydrophilic). Defaulting to total body weight silently inflates the exposure metric\n  and can flip a dose-response conclusion.\n- **Narrow-therapeutic-index drugs managed by therapeutic drug monitoring** (e.g., tacrolimus,\n  vancomycin, anticonvulsants). The real exposure is the measured trough/AUC, not a normalized dose;\n  a normalized prescribed dose is a poor and potentially misleading surrogate.\n- **Drugs with active metabolites whose clearance scales differently from the parent**, where a\n  parent-dose normalization misranks effective exposure.\n- **When weight is missing for a large, differential fraction** of the cohort: imputing or\n  age-banding then introduces exposure misclassification that, if it differs by the comparison of\n  interest, biases the contrast in an unpredictable direction.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Claims do *not* contain weight or height — these are not billed\n  fields — so claims-only studies cannot compute true mg/kg or mg/m2 and must fall back to age-banded\n  dosing or to linked EHR anthropometrics. What claims *do* give is total dispensed drug: NDC encodes\n  formulation and strength (mg per tablet, or mg/mL for a suspension), and `dispensed_qty` x strength\n  / `days_supply` yields total mg/day. Pediatric formulations are the trap: suspensions and chewables\n  make strength a concentration, not a per-unit amount; `days_supply` is frequently wrong or zero for\n  compounded/suspension fills; and partial bottles, split tablets, and weight-based titration mean the\n  dispensed amount overstates the amount actually given. Restrict to fee-for-service person-time with\n  both medical and pharmacy benefit — Medicare Advantage analogues and Medicaid managed-care\n  encounter data drop or under-capture fills, so a derived dose can be missingness rather than a true\n  value. In pediatric Medicaid specifically, eligibility churn produces gaps that mimic discontinuation.\n- **EHR:** The strength of EHR is that weight (and often height) are captured as structured vitals,\n  enabling true mg/kg or mg/m2. The failure mode is *recency and capture*: the nearest weight to the\n  fill/administration may be months old (children grow fast — a 6-month-old weight is invalid for a\n  toddler), weights are entered in mixed units (kg vs lb) with transcription errors and impossible\n  outliers, and inpatient \"stat weight\" vs estimated weight differ. Define a maximum weight-recency\n  window (e.g., within 90 days, tighter for infants), carry-forward rules, and biologically-plausible\n  bounds before normalizing.\n- **Registry:** Disease registries often record protocol dose and a baseline weight/BSA and may carry\n  adjudicated dose modifications (common in pediatric oncology), making them strong for the numerator;\n  they are weak for *complete* longitudinal weight and for off-protocol/inter-current dosing. Link to\n  claims or EHR for full fill history and updated anthropometrics.\n- **Linked claims-EHR (-registry):** The practical substrate — claims fills supply the dispensed mg\n  and continuity of capture, EHR supplies time-updated weight/height for normalization. Linkage adds\n  selection (only the linkable subset) and a date-reconciliation problem: the weight observation must\n  be matched to the fill within a recency window, and fill/order/service dates must be aligned before\n  the mg/kg value is assigned.\n\n**Worked claims-plus-EHR example.** Question: characterize weight-normalized montelukast exposure in\nchildren 2-14 with asthma, to support a dose-appropriateness analysis. Inputs: pharmacy fills (`rx`:\nperson_id, fill_date, ndc, drug_strength_mg, dispensed_qty, days_supply) and linked EHR vitals\n(`weight_obs`: person_id, obs_date, weight_kg). (1) Eligibility: 365 days of continuous medical +\npharmacy enrollment (FFS-observable) before the index fill, so dispensing history is real, not\nunobserved. (2) Derive daily mg per fill: `drug_strength_mg` x `dispensed_qty` / `days_supply`\n(e.g., a 4 mg chewable, qty 30, days_supply 30 -> 4 mg/day); drop fills with missing or zero\n`days_supply` or implausible mg/day (> labeled max). (3) Attach weight: for each fill, take the\nnearest `weight_obs` within 90 days (tighten to 30 days for ages < 2); flag and quantify the\nmissing-weight fraction — if > ~20% and it differs by age, do not silently age-band. (4) Normalize:\n`dose_mg_per_kg_day = mg_per_day / weight_kg`. (5) Classify against the weight-banded label\n(4 mg for 2-5y, 5 mg for 6-14y) to derive a guideline-concordant-dose indicator, the pre-specified\nestimand. (6) For follow-up exposure that spans growth, re-pull weight at each fill rather than fixing\nbaseline weight (otherwise the metric drifts as the child grows — see time-updated exposures).\n(7) Sensitivity: vary the weight-recency window, compare nearest-weight vs interpolated-weight, and\nreport the conclusion's stability to the missing-weight imputation rule.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure_definition",
      "pediatric",
      "dose-normalization",
      "weight-based-dosing",
      "body-surface-area",
      "allometric-scaling",
      "special-populations",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "dose_response"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1056/NEJMra035092",
        "url": "https://doi.org/10.1056/NEJMra035092",
        "citation_text": "Kearns GL, Abdel-Rahman SM, Alander SW, Blowey DL, Leeder JS, Kauffman RE. Developmental pharmacology - drug disposition, action, and therapy in infants and children. New England Journal of Medicine. 2003;349(12):1157-1167.",
        "year": 2003,
        "authors_short": "Kearns et al.",
        "notes": "Canonical statement of why drug disposition (absorption, distribution, metabolism, clearance) changes with developmental stage, establishing that body-size and maturation scaling - not a fixed adult dose - is required to make pediatric exposure comparable."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev.pharmtox.48.113006.094708",
        "url": "https://doi.org/10.1146/annurev.pharmtox.48.113006.094708",
        "citation_text": "Anderson BJ, Holford NHG. Mechanism-based concepts of size and maturity in pharmacokinetics. Annual Review of Pharmacology and Toxicology. 2008;48:303-332.",
        "year": 2008,
        "authors_short": "Anderson & Holford",
        "notes": "Foundational treatment of allometric (weight^0.75) size scaling plus maturation models, clarifying why linear mg/kg misranks clearance and how to combine size and age in a pediatric exposure metric."
      },
      {
        "role": "demonstrate",
        "doi": "10.1056/NEJM198710223171717",
        "url": "https://doi.org/10.1056/NEJM198710223171717",
        "citation_text": "Mosteller RD. Simplified calculation of body-surface area. New England Journal of Medicine. 1987;317(17):1098.",
        "year": 1987,
        "authors_short": "Mosteller",
        "notes": "The Mosteller BSA formula (BSA = sqrt(height_cm x weight_kg / 3600)) widely used to compute mg/m2 dosing; the practical basis for body-surface-area normalization where height is available."
      }
    ],
    "plain_language_summary": "When doctors prescribe a drug to children, the same total number of milligrams means very different things for a 10 kg toddler versus a 38 kg pre-teen. Weight-normalized dosing converts the absolute amount prescribed into milligrams per kilogram of body weight (mg/kg/day) so that researchers can fairly compare how much drug each child actually received relative to their size. Without this step, a study comparing drug exposure across children of different ages and weights is comparing apples to oranges.",
    "key_terms": [
      {
        "term": "mg/kg/day",
        "definition": "Milligrams of drug per kilogram of body weight per day, the standard unit for expressing how much drug a child receives relative to their size."
      },
      {
        "term": "absolute dose",
        "definition": "The total milligrams of a drug dispensed or prescribed, without adjusting for how big or small the patient is."
      },
      {
        "term": "weight normalization",
        "definition": "Dividing the absolute dose by the patient's body weight so that exposure can be compared fairly across patients of different sizes."
      },
      {
        "term": "drug strength",
        "definition": "The amount of active drug in each unit dispensed, for example 5 mg per tablet or 2 mg per mL of a liquid suspension."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is studying montelukast use in children ages 2 to 10 with asthma. The pharmacy records show the total milligrams dispensed per fill. The researcher wants to know whether each child received a dose appropriate for their body size, so she divides each child's daily dose by their weight in kilograms recorded in the clinic notes. The table below shows four children from the study, each receiving the same absolute daily dose of 5 mg, but with very different body weights.",
      "dataset": {
        "caption": "Four children each prescribed 5 mg montelukast per day; weights recorded at the clinic visit nearest to the fill date.",
        "columns": [
          "child_id",
          "age_years",
          "weight_kg",
          "absolute_dose_mg_per_day",
          "dose_mg_per_kg_per_day"
        ],
        "rows": [
          [
            "C001",
            2,
            12,
            5,
            0.42
          ],
          [
            "C002",
            4,
            17,
            5,
            0.29
          ],
          [
            "C003",
            7,
            25,
            5,
            0.2
          ],
          [
            "C004",
            10,
            38,
            5,
            0.13
          ]
        ]
      },
      "steps": [
        "All four children were prescribed exactly 5 mg of montelukast per day, so the absolute doses are identical.",
        "To compute the weight-normalized dose for C001: divide 5 mg by 12 kg = 0.417 mg/kg/day, rounded to 0.42.",
        "For C002: 5 mg divided by 17 kg = 0.294 mg/kg/day, rounded to 0.29.",
        "For C003: 5 mg divided by 25 kg = 0.200 mg/kg/day.",
        "For C004: 5 mg divided by 38 kg = 0.132 mg/kg/day, rounded to 0.13.",
        "C001 (the smallest child) receives more than three times the weight-relative exposure of C004 (the largest child): 0.42 vs 0.13 mg/kg/day.",
        "Without weight normalization, a researcher looking only at the 5 mg column would conclude all four children received the same exposure, which is misleading when comparing outcomes across the group."
      ],
      "result": "Weight-normalized doses range from 0.13 mg/kg/day (C004, 38 kg) to 0.42 mg/kg/day (C001, 12 kg), a more-than-3-fold difference across children who all received the same absolute 5 mg/day dose. The pediatric label recommends 0.20 mg/kg/day for this age range, so C001 is above the guideline exposure and C004 is below it, a distinction that is invisible without weight normalization."
    },
    "prerequisites": [
      "special-populations-rwe-methods",
      "exposure-episode-construction-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Linear weight-based (mg/kg/day)",
        "description": "Total daily milligrams divided by body weight in kg. Matches most product labels and weight-banded dosing tables; assumes clearance is proportional to weight.",
        "edge_cases": [
          "Biased for internal exposure at size extremes (over-dose small children, under-dose large ones) because clearance scales closer to weight^0.75.",
          "Large adolescents can exceed the labeled adult dose unless an adult cap is applied.",
          "Total body weight overstates exposure in obesity for hydrophilic drugs (consider ideal/lean body weight)."
        ],
        "data_source_notes": "claims: requires linked EHR weight - not a billed field; EHR: use the nearest plausible weight within a recency window, tighter for infants."
      },
      {
        "name": "Allometric (weight^0.75) with maturation function",
        "description": "Clearance scaled by weight^0.75, optionally multiplied by a post-menstrual-age maturation function for neonates/infants. The pharmacometric standard for PK-anchored pediatric exposure.",
        "edge_cases": [
          "Requires a defensible exponent and, in the first 1-2 years, an explicit ontogeny model; size scaling alone is wrong in neonates.",
          "Harder to communicate to clinicians and reviewers than mg/kg."
        ],
        "data_source_notes": "EHR/registry: needs reliable weight plus gestational/post-menstrual age for neonatal work; rarely feasible in claims alone."
      },
      {
        "name": "Body-surface-area (mg/m2/day)",
        "description": "Daily dose per square meter, BSA computed from height and weight (Mosteller, Du Bois, or Haycock). Convention for many cytotoxic and some biologic agents.",
        "edge_cases": [
          "Requires height, which claims lack and EHR captures inconsistently.",
          "BSA formula choice shifts the value a few percent; pre-specify one formula."
        ],
        "data_source_notes": "registry: oncology registries often carry protocol BSA and dose modifications; EHR: derive from paired height/weight vitals nearest the fill."
      },
      {
        "name": "Age-banded fixed dosing (claims fallback)",
        "description": "Assigns the labeled fixed dose for the child's age band when weight is unavailable, mirroring OTC/primary-care prescribing and what claims can support.",
        "edge_cases": [
          "Coarse within-band misclassification (same-age children differ ~2-fold in weight).",
          "Should be used as a fallback, not a default, and the resulting misclassification reported."
        ],
        "data_source_notes": "claims: the only option without linked anthropometrics; quantify and report the share of person-time relying on the fallback."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Total mg/day with no body-size normalization",
        "pros_of_this": "Makes exposure comparable across children of different size and developmental stage; required for any dose-response, high-vs-standard, or pooled pediatric analysis.",
        "cons_of_this": "Requires weight (and sometimes height/age), introducing dependence on EHR/registry linkage and a missing-anthropometric handling rule.",
        "when_to_prefer": "Any analysis where exposure intensity - not merely ever/never exposure - is the variable of interest."
      },
      {
        "compared_to": "Allometric (weight^0.75) scaling",
        "pros_of_this": "Linear mg/kg is transparent and matches labels/guideline dosing tables, ideal for utilization and dose-appropriateness studies.",
        "cons_of_this": "Linear scaling misranks clearance at size extremes; allometry tracks clearance better for PK-anchored dose-response work.",
        "when_to_prefer": "Reproducing label-/guideline-concordant dosing; switch to allometric for pharmacology-anchored dose-response across a wide age range."
      },
      {
        "compared_to": "Body-surface-area (mg/m2) normalization",
        "pros_of_this": "Weight-based metrics need only weight, available more often than height in EHR.",
        "cons_of_this": "For drugs genuinely dosed per m2 (many cytotoxics), weight-based metrics misrepresent the clinical dosing basis.",
        "when_to_prefer": "Use the descriptor the drug is actually dosed by in label/guidelines."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims contain no weight or height. Derive total mg/day from NDC strength x dispensed_qty / days_supply; pediatric suspensions/chewables make strength a concentration and days_supply unreliable. Without linked anthropometrics, only age-banded dosing is possible - report the share of person-time on that fallback. Restrict to FFS-observable person-time with both medical and pharmacy benefit.",
      "ehr": "Weight/height are structured vitals enabling true mg/kg or mg/m2. Enforce a maximum weight-recency window (tighter for infants, who grow fast), reconcile kg/lb unit errors, and apply biologically-plausible bounds before normalizing.",
      "registry": "Disease registries (esp. pediatric oncology) often carry protocol dose, baseline BSA/weight, and adjudicated dose modifications - strong numerator, weak longitudinal weight and off-protocol dosing. Link out for complete fills and updated anthropometrics.",
      "linked": "Claims supply dispensed mg and capture continuity; EHR supplies time-updated weight/height. Match the weight observation to the fill within a recency window and reconcile fill/order/service dates before assigning the normalized value; account for linkage selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWEIGHT_WINDOW_DAYS = 90   # tighten for infants; weights staler than this are not used\nMAX_MG_PER_KG_DAY  = 20   # drug-specific plausibility cap; replace per protocol\n\ndef normalize_pediatric_dose(rx: pd.DataFrame, weight_obs: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.copy()\n    # Total milligrams per day implied by the dispensed fill.\n    rx[\"mg_per_day\"] = rx[\"drug_strength_mg\"] * rx[\"dispensed_qty\"] / rx[\"days_supply\"].replace(0, np.nan)\n    rx = rx.dropna(subset=[\"mg_per_day\"])\n\n    # Nearest weight to each fill within the recency window (as-of join on absolute date gap).\n    rx = rx.sort_values(\"fill_date\")\n    w = weight_obs.sort_values(\"obs_date\").rename(columns={\"obs_date\": \"weight_date\"})\n    merged = pd.merge_asof(\n        rx, w, by=\"person_id\",\n        left_on=\"fill_date\", right_on=\"weight_date\",\n        direction=\"nearest\", tolerance=pd.Timedelta(days=WEIGHT_WINDOW_DAYS),\n    )\n\n    # Normalize; rows with no in-window weight are flagged, not silently dropped.\n    merged[\"dose_mg_per_kg_day\"] = merged[\"mg_per_day\"] / merged[\"weight_kg\"]\n    merged[\"weight_missing\"] = merged[\"weight_kg\"].isna()\n\n    # Plausibility filter on the normalized value (biologically implausible -> review/drop).\n    bad = merged[\"dose_mg_per_kg_day\"] > MAX_MG_PER_KG_DAY\n    merged.loc[bad, \"dose_mg_per_kg_day\"] = np.nan\n\n    return merged[[\"person_id\", \"fill_date\", \"mg_per_day\", \"weight_kg\",\n                   \"weight_missing\", \"dose_mg_per_kg_day\"]]",
        "description": "Weight-normalized daily dose from claims fills plus linked EHR weights. Required inputs\n(cleaned, de-duplicated):\n  rx         : person_id, fill_date (datetime), ndc, drug_strength_mg (mg per tablet or per mL),\n               dispensed_qty (tablets or mL), days_supply (int)\n  weight_obs : person_id, obs_date (datetime), weight_kg (float)\nReturns one row per fill with mg/day, the matched weight, and mg/kg/day. Drop fills with\nzero/missing days_supply or implausible mg/day upstream; quantify the missing-weight rate\nbefore any imputation. WEIGHT_WINDOW is the maximum allowed gap between weight and fill.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "anderson-2008"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nWEIGHT_WINDOW_DAYS <- 90L    # tighten for infants\nMAX_MG_PER_KG_DAY  <- 20     # drug-specific plausibility cap\n\nnormalize_pediatric_dose <- function(rx, weight_obs) {\n  setDT(rx); setDT(weight_obs)\n\n  # Total mg/day implied by the dispensed fill.\n  rx[, mg_per_day := drug_strength_mg * dispensed_qty / fifelse(days_supply == 0L, NA_real_, as.numeric(days_supply))]\n  rx <- rx[!is.na(mg_per_day)]\n\n  # Rolling nearest weight to each fill within the recency window.\n  w <- copy(weight_obs)[, weight_date := obs_date]\n  setkey(rx, person_id, fill_date)\n  setkey(w,  person_id, weight_date)\n  merged <- w[rx, on = c(\"person_id\", weight_date = \"fill_date\"),\n              roll = \"nearest\", rollends = c(TRUE, TRUE)]\n  merged[, fill_date := weight_date]\n  merged[abs(as.numeric(weight_date - i.weight_date)) > WEIGHT_WINDOW_DAYS, weight_kg := NA_real_]\n\n  merged[, weight_missing := is.na(weight_kg)]\n  merged[, dose_mg_per_kg_day := mg_per_day / weight_kg]\n  merged[dose_mg_per_kg_day > MAX_MG_PER_KG_DAY, dose_mg_per_kg_day := NA_real_]\n\n  merged[, .(person_id, fill_date, mg_per_day, weight_kg, weight_missing, dose_mg_per_kg_day)]\n}",
        "description": "Weight-normalized daily dose with data.table rolling join. Inputs mirror the Python version:\n  rx         : person_id, fill_date (Date), ndc, drug_strength_mg, dispensed_qty, days_supply\n  weight_obs : person_id, obs_date (Date), weight_kg\nReturns one row per fill with mg/day, matched weight, and mg/kg/day; weight_missing flags\nfills with no weight inside WEIGHT_WINDOW_DAYS.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "anderson-2008"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let weight_window = 90;   /* max days between weight and fill; tighten for infants */\n%let max_mgkg      = 20;   /* drug-specific plausibility cap */\n\n/* Total mg/day implied by each dispensed fill. */\ndata rx_mg;\n  set work.rx;\n  if days_supply > 0 then mg_per_day = drug_strength_mg * dispensed_qty / days_supply;\n  if mg_per_day ne .;\nrun;\n\n/* Candidate fill-weight pairs inside the recency window. */\nproc sql;\n  create table pairs as\n  select r.person_id, r.fill_date, r.mg_per_day,\n         w.weight_kg, w.obs_date,\n         abs(r.fill_date - w.obs_date) as gap\n  from rx_mg r\n  left join work.weight_obs w\n    on r.person_id = w.person_id\n   and abs(r.fill_date - w.obs_date) <= &weight_window;\nquit;\n\n/* Keep the nearest weight per fill (smallest absolute date gap). */\nproc sort data=pairs; by person_id fill_date gap; run;\ndata dose_norm;\n  set pairs;\n  by person_id fill_date;\n  if first.fill_date;                         /* nearest in-window weight, or missing */\n  weight_missing = (weight_kg = .);\n  if weight_kg > 0 then dose_mg_per_kg_day = mg_per_day / weight_kg;\n  if dose_mg_per_kg_day > &max_mgkg then dose_mg_per_kg_day = .;\n  keep person_id fill_date mg_per_day weight_kg weight_missing dose_mg_per_kg_day;\nrun;",
        "description": "Weight-normalized daily dose in SAS via PROC SQL + data step. Required input datasets\n(post data-management):\n  work.rx         : person_id, fill_date, ndc, drug_strength_mg, dispensed_qty, days_supply\n  work.weight_obs : person_id, obs_date, weight_kg\nProduces work.dose_norm with one row per fill: mg_per_day, matched weight_kg within the\nrecency window, weight_missing flag, and dose_mg_per_kg_day. Tighten &weight_window for infants\nand set &max_mgkg to a drug-specific plausibility cap.",
        "dependencies": [],
        "source_citations": [
          "anderson-2008"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Pediatric exposure variable needed] --> D{What is the drug dosed by<br/>in label / guidelines?}\n  D -->|per m2 e.g. cytotoxics| BSA[BSA mg per m2<br/>needs height + weight]\n  D -->|per kg, wide age range,<br/>PK-anchored| ALLO[Allometric weight^0.75<br/>+ maturation if infant]\n  D -->|per kg, label/guideline<br/>concordance| LIN[Linear mg per kg per day]\n  BSA --> NEED{Anthropometrics available?}\n  ALLO --> NEED\n  LIN --> NEED\n  NEED -->|EHR/registry weight in window| OK[Compute normalized dose<br/>re-pull weight as child grows]\n  NEED -->|claims only, no weight| FB[Fallback: age-banded fixed dose<br/>report misclassification]\n  OK --> NEO{Neonate / infant under 1-2y?}\n  NEO -->|yes| MAT[Add post-menstrual-age maturation<br/>size scaling alone is unsafe]\n  NEO -->|no| DONE[Normalized exposure variable]\n  MAT --> DONE\n  FB --> DONE",
        "caption": "Decision logic for pediatric dose normalization - choose the size descriptor the drug is actually dosed by, check whether anthropometrics support it, fall back to age bands only when weight is missing, and add a maturation model for neonates/infants.",
        "alt_text": "Decision tree from the dosing basis (BSA, allometric, or linear mg/kg) through anthropometric availability, an age-band fallback for claims-only data, and a maturation branch for neonates and infants.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph CLAIMS[Claims fills]\n    NDC[NDC -> strength + formulation] --> MG[mg/day =<br/>strength x qty / days_supply]\n    DS[days_supply, dispensed_qty] --> MG\n  end\n  subgraph EHR[EHR vitals]\n    WT[weight_kg, obs_date] --> CLEAN[Unit + outlier QC<br/>kg/lb, plausible bounds]\n  end\n  MG --> MATCH{Nearest weight to fill<br/>within recency window?}\n  CLEAN --> MATCH\n  MATCH -->|yes| NORM[dose_mg_per_kg_day =<br/>mg_per_day / weight_kg]\n  MATCH -->|no| FLAG[weight_missing = TRUE<br/>quantify, do not silently age-band]\n  NORM --> EST[Pre-specified estimand:<br/>continuous dose, high-vs-low,<br/>or guideline-concordant indicator]",
        "caption": "Linked claims-plus-EHR data flow. Claims supply the dispensed mg/day numerator, EHR supplies the time-updated weight denominator, and the two are joined within a recency window before the normalized exposure variable is assigned.",
        "alt_text": "Data-flow diagram showing claims-derived mg/day and QC'd EHR weights joined on a nearest-weight-within-window rule to produce a normalized mg/kg/day exposure or a flagged missing-weight record.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "Pediatric dose normalization is the exposure-definition member of the special-populations methods family, addressing how body size and maturation reshape exposure measurement."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Children grow during follow-up, so weight must be re-pulled at each fill rather than fixed at baseline; the normalized dose is a time-updated exposure feeding cumulative-dose metrics."
      },
      {
        "relation_type": "used_with",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Days_supply stitching and episode boundaries determine the mg/day denominator before normalization; suspension/chewable formulations complicate both."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pediatric-growth-development-endpoints-rwe",
        "notes": "The same anthropometric data (weight, height) that normalize exposure also define growth/development endpoints, and the two must use consistent measurement windows."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "OMOP drug_exposure records strength and quantity but not patient weight; normalization requires joining drug_era-derived mg/day to measurement-table anthropometrics."
      }
    ],
    "aliases": [
      "weight-based dosing",
      "BSA-based dosing",
      "allometric dose normalization",
      "pediatric dose standardization",
      "mg/kg dose normalization"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pediatric-growth-development-endpoints-rwe",
    "name": "Pediatric Growth and Development Endpoints in RWE",
    "short_definition": "The operational specification of pediatric growth (anthropometric z-scores/percentiles against an age- and sex-specific reference) and neurodevelopmental endpoints in real-world data, including reference-curve choice, age handling (chronological vs corrected), and the continuous-vs-threshold and cross-sectional-vs-conditional-growth estimand decisions.",
    "long_description": "**Pediatric growth and development endpoints** are not raw measurements; they are *derived* outcomes whose\nmeaning depends entirely on an externally specified reference. A 9 kg weight is uninterpretable until it is\nconverted, conditional on exact age and sex, to a **z-score (SD score)** or **percentile** against a growth\nstandard. The standard transformation is the **LMS method** (Cole & Green 1992): for the L (Box-Cox power),\nM (median), and S (coefficient of variation) at the child's age/sex, z = ((measure/M)^L - 1) / (L*S) when\nL != 0, and z = ln(measure/M)/S when L = 0. From this you obtain weight-for-age (WAZ), length/height-for-age\n(HAZ/LAZ), weight-for-length (WHZ), and BMI-for-age (BAZ). Developmental endpoints are a *separate* family:\npass/fail on a validated screen (ASQ, M-CHAT, PEDS, Bayley), age at milestone acquisition, or a standardized\nscale score. Treating \"pediatric outcomes\" as one undifferentiated bucket is the first and most common error.\n\n**Core conceptual distinction.** Three orthogonal choices define the endpoint and must each be pre-specified\nin the estimand. (1) *Which reference, by age and prematurity*: WHO Child Growth Standards (the prescriptive\n0-5y standard from breastfed reference populations; WHO MGRS 2006) vs CDC 2000 growth charts (a descriptive\nUS reference, conventionally used >=2y; Ogden 2002) vs **Fenton/INTERGROWTH-21st preterm curves** for the\nneonatal/early-infancy window vs **condition-specific curves** (Down syndrome, achondroplasia, cerebral\npalsy) where the general reference is clinically wrong. Mixing references across the age range produces\nartifactual discontinuities in z at the splice age. (2) *Chronological vs corrected age*: for infants born\npreterm, z-scores must use **gestationally corrected age** (chronological age minus weeks of prematurity)\nthrough ~24 months, or growth failure is systematically overstated. (3) *Continuous z vs threshold vs\nconditional growth*: the estimand can be a continuous z-score, a binary threshold (stunting HAZ < -2, wasting\nWHZ < -2, overweight BAZ >= 85th, obesity >= 95th percentile), the **age at acquisition** of a state, or a\n**conditional/change-in-z** quantity (growth velocity, SITAR/mixed-model trajectory) that asks whether a\nchild tracked along their own centile. Single cross-sectional z answers \"how big now\"; conditional growth\nanswers \"did this exposure bend the trajectory\" — they are different causal questions and usually need\ndifferent models (a single GLM vs a mixed/MMRM longitudinal model; see mixed-effects-models-longitudinal-rwe).\n\n**Pros, cons, and trade-offs.**\n- **Continuous z-score vs binary threshold (stunting/wasting/obesity):** Continuous z retains power, avoids\n  arbitrary cutpoints, and is the default for comparative analyses; a threshold maps to a clinical label and\n  a guideline action but discards information and is unstable near the cut. **Prefer continuous z** for the\n  primary effect estimate and report the dichotomized version as a clinically interpretable secondary.\n- **Single cross-sectional z vs conditional/repeated-measures growth:** A one-timepoint z is cheap and easy\n  but cannot separate \"small baseline\" from \"stopped growing.\" A conditional-growth or mixed-model approach\n  (change in z, velocity, SITAR) directly targets the trajectory effect and handles the within-child\n  correlation, at the cost of needing >=2-3 well-timed measurements and more modeling judgment. **Prefer\n  conditional growth** when the exposure plausibly acts over time (e.g., chronic therapy, nutrition).\n- **Growth proxy codes vs measured anthropometry:** ICD-10 codes (R62.50/.51 failure to thrive, R63.x,\n  E66.x obesity, Z68.5x BMI-percentile category) are cheap and present in pure claims but capture only the\n  clinician-flagged tail and miss the continuous distribution; measured height/weight from EHR vitals gives\n  the true z but requires EHR or linkage. **Prefer measured anthropometry**; use codes only as a fallback\n  outcome with explicit under-capture caveats and a validation substudy (claims-outcome-algorithm-ppv-sensitivity-rwe).\n- **Standardized developmental scale vs milestone/screen code:** A Bayley/ASQ score is a graded, validated\n  endpoint but is rarely in claims and irregularly in EHR; a developmental-delay diagnosis (F88/F89, F80-F82)\n  or early-intervention referral is administratively available but lags acquisition and reflects access to\n  screening, not the underlying milestone. **Prefer the scale** when available; treat the code as an\n  access-confounded proxy.\n\n**When to use.** Any pediatric comparative-effectiveness, safety, nutritional, or HEOR study where growth or\ndevelopment is the outcome of interest (e.g., inhaled-corticosteroid effect on attained height, growth-hormone\neffectiveness, nutritional intervention, neurodevelopment after perinatal exposure); regulatory pediatric\nstudy plans and post-marketing requirements where growth/development is a mandated long-term endpoint;\nregistry programs (disease-registry, product-registry) designed to capture serial anthropometry. Use measured\nanthropometry from EHR or linked data, the age- and prematurity-appropriate reference, exact age in days, and\na longitudinal model when the question is about trajectory.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Pure claims with no measured height/weight and no BMI-percentile Z-codes.** You cannot derive a z-score\n  from claims alone. Reporting \"no growth effect\" from a cohort where the outcome was never measured is a\n  null produced by missingness, not biology.\n- **Mixing references or ignoring corrected age.** Splicing WHO and CDC at 2 years without harmonizing, or\n  using chronological age for ex-preterm infants, manufactures z discontinuities and inflates apparent\n  growth failure in the exposed arm if prematurity is unbalanced.\n- **General reference for a condition that has its own curve.** A child with Down syndrome plotted on WHO/CDC\n  will look \"stunted\" by construction; using the general standard as the endpoint confounds the genetic\n  condition with the exposure effect.\n- **Dichotomizing a continuous z at a guideline cut as the primary endpoint** when power is limited — you\n  discard most of the signal and make the result hostage to a few children straddling -2 SD.\n- **Conditional growth from irregular, exposure-driven measurement.** If sicker (exposed) children are\n  weighed more often, the measurement schedule is informative; a naive change-in-z is biased. Model the\n  visit process or use a design (e.g., scheduled well-child visits) that decouples measurement from outcome.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** No anthropometric values. The only growth signal is proxy coding —\n  R62.50/.51 (failure to thrive / lack of expected normal physiological development), R63.4 (abnormal weight\n  loss), R63.3 (feeding difficulties), E66.x (obesity), and the **BMI-percentile-for-age Z-codes Z68.51-Z68.54**\n  captured at well-child visits — plus developmental codes (F80-F82, F88, F89) and early-intervention/therapy\n  procedure codes. Failure modes: these capture only the clinician-flagged tail (low sensitivity for the\n  continuous distribution), Z68.5x is recorded inconsistently by payer/EHR-template, and **Medicare Advantage\n  is irrelevant in peds — the analog is Medicaid managed care, where encounter completeness varies by state\n  and carve-outs (behavioral/developmental services are often carved out, dropping milestone data).** Workaround:\n  treat claims growth/development outcomes as algorithms requiring PPV/sensitivity validation, not gold standards.\n- **EHR:** The native home of growth endpoints — height/weight/head-circumference live in structured vitals,\n  BMI percentile is often pre-computed (verify the reference and age basis the EHR used; do not trust an\n  inherited percentile blindly). Failure modes: unit and decimal errors (lb vs kg, cm vs in) produce extreme\n  z outliers — apply WHO/CDC biologically-implausible-value flags (e.g., |WAZ|>5, |HAZ|>5) before analysis;\n  visit-driven capture means measurement timing is irregular and tied to illness; gestational age (for\n  correction) is frequently missing in non-neonatal EHRs; developmental screens (ASQ/M-CHAT) live in flowsheets\n  or notes, not always discrete fields, so NLP or flowsheet mapping is often required.\n- **Registry:** Disease and product registries (disease-registry, product-registry, pregnancy-registry) can\n  mandate serial, protocolized anthropometry and standardized developmental assessment — the strongest source\n  for growth trajectories — but suffer enrollment selection, differential loss to follow-up by severity, and\n  incomplete capture of care delivered outside the registry network.\n- **Linked claims-EHR(-birth/vital records):** The ideal substrate: claims for cohort entry, exposure\n  (pharmacy NDC + days_supply), continuous enrollment, and censoring; EHR vitals for the measured z-score\n  outcome; birth/vital records for **gestational age** to enable corrected age. Linkage introduces selection\n  (only the linkable subset), and birth-record gestational age must be reconciled with EHR-recorded GA before\n  correction.\n\n**Worked linked claims+EHR example.** Question: does initiation of medium/high-dose inhaled corticosteroids\n(ICS) vs leukotriene-receptor antagonists (LTRA) slow attained linear growth (HAZ) over 12 months in children\naged 4-11 with persistent asthma, in a Medicaid + commercial claims database linked to EHR vitals. (1)\nEligibility: age 4-11 at index, >=2 asthma diagnoses (ICD-10 J45.x), and 365 days of continuous medical +\npharmacy enrollment before the first qualifying fill (so the new-user washout is observed, not missing). (2)\nExposure / time zero: first fill of ICS or LTRA (NDC + fill_date + days_supply); arm assigned from that NDC;\nnew-user restriction = no fill of either class in the 365-day lookback. (3) Outcome derivation: pull all EHR\nheight measurements; compute **exact age in days** at each height as (measure_date - birth_date); convert to\n**HAZ via the CDC 2000 LMS coefficients** for sex and age in months (children are >=2y, so CDC is the\nconventional reference); apply implausible-value exclusion |HAZ|>5. (4) Endpoint = change in HAZ from the\nmeasurement closest to index_date (within +/-90 days) to the measurement closest to 12 months\npost-index (within +/-90 days) — a *conditional growth* endpoint, not a single cross-section. (5) Confounding:\nbaseline covariates measured only in the [index_date-365, index_date] window — asthma severity proxies\n(rescue-inhaler fills, oral-steroid bursts, ED/hospital asthma claims), baseline HAZ, age, sex, season of\nindex — feeding a propensity model balancing the two arms. (6) Analysis: a linear mixed model of HAZ over\ntime with random child intercept/slope (mixed-effects-models-longitudinal-rwe) handles the within-child\ncorrelation and the irregular, +/-90-day measurement timing better than a crude two-point difference;\ncensor at disenrollment, loss of EHR contact, and end of data. (7) Sensitivity: corrected vs chronological\nage has no effect here (all >=2y, term assumed) but would matter in an infant cohort; vary the measurement\nwindows, the implausible-value flags, and add a negative-control outcome (negative-control-outcomes-rwe) to\ndetect residual confounding by asthma severity. A naive single end-of-year HAZ (ignoring baseline height)\nwould confound pre-existing short stature with an ICS effect — exactly the trap the conditional design avoids.\n\nRegulatory note: pediatric study plans (FDA PSP/PREA, EMA PIP) frequently mandate growth and neurodevelopment\nas long-term safety endpoints; pre-specifying the reference, corrected-age rule, and continuous-vs-threshold\nestimand in protocol language is a regulatory expectation, not a nicety.\n\n**Interpreting the output**. The four-child example produces a z-score profile at a single time point.\nChild C-101 (female, 4 years, height 98.5 cm) has HAZ = −1.2, placing her at the 12th percentile on the\nWHO reference — below average but not in the range that defines stunting. Child C-103 (female, 8 years,\nheight 116.2 cm) has HAZ = −2.1, at the 2nd percentile, meeting the WHO threshold for stunting (HAZ < −2).\nChildren C-102 and C-104 are near or above the reference median.\n\nFormal interpretation: HAZ = −1.2 means Child C-101's height is 1.2 standard deviations below the mean\nheight of healthy girls of the same age in the WHO Multicentre Growth Reference Study population. This\nz-score is a position statement — it describes where the child sits relative to a normative reference\nderived from breastfed children in healthy environments, not relative to other children in the study\ncohort or the study's target disease population. A treatment effect on growth is estimated as the\nmean within-child change in HAZ from baseline to follow-up, not as a cross-sectional comparison of\nz-scores to the reference at one point in time.\n\nPractical interpretation: the reference population is a deliberate, transparent assumption. Children\nwith a treated chronic condition (e.g., severe asthma) may systematically differ from the WHO reference\neven at baseline, so reporting baseline HAZ alongside the change estimate is essential. A shift in HAZ\nfrom −1.5 to −1.2 (improvement) is clinically interpretable; crossing the −2.0 threshold toward or away\nfrom stunting is a regulatory milestone. Always report the reference used and justify it for the study\npopulation.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "pediatric",
      "anthropometry",
      "growth-z-score",
      "neurodevelopment",
      "lms-method",
      "special-populations-methods",
      "conditional-growth"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.1651-2227.2006.tb02378.x",
        "url": "https://doi.org/10.1111/j.1651-2227.2006.tb02378.x",
        "citation_text": "WHO Multicentre Growth Reference Study Group. WHO Child Growth Standards based on length/height, weight and age. Acta Paediatrica. 2006;95(S450):76-85.",
        "year": 2006,
        "authors_short": "WHO Multicentre Growth Reference Study Group",
        "notes": "The prescriptive 0-5y growth standard and the canonical reference basis for deriving anthropometric z-scores (WAZ/HAZ/WHZ/BAZ) as pediatric endpoints."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.4780111005",
        "url": "https://doi.org/10.1002/sim.4780111005",
        "citation_text": "Cole TJ, Green PJ. Smoothing reference centile curves: the LMS method and penalized likelihood. Statistics in Medicine. 1992;11(10):1305-1319.",
        "year": 1992,
        "authors_short": "Cole & Green",
        "notes": "The LMS (lambda-mu-sigma) method that converts raw anthropometry to age- and sex-specific z-scores and percentiles; the statistical engine behind WHO and CDC reference charts."
      },
      {
        "role": "explain",
        "doi": "10.1542/peds.109.1.45",
        "url": "https://doi.org/10.1542/peds.109.1.45",
        "citation_text": "Ogden CL, Kuczmarski RJ, Flegal KM, et al. Centers for Disease Control and Prevention 2000 growth charts for the United States: improvements to the 1977 National Center for Health Statistics version. Pediatrics. 2002;109(1):45-60.",
        "year": 2002,
        "authors_short": "Ogden et al.",
        "notes": "The CDC 2000 US growth reference conventionally applied from age 2 years; documents the LMS parameters and the WHO-vs-CDC splice considerations for US pediatric RWE."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Reporting standard requiring transparent code lists, derived-variable algorithms, and validation for routinely-collected-data studies; directly applicable to pediatric growth/milestone outcome derivation."
      }
    ],
    "plain_language_summary": "Growth looks very different in a 3-month-old versus a 3-year-old, so researchers never compare raw centimeters or kilograms across ages — instead, they convert each measurement into a z-score or percentile that says how a child ranks compared to healthy children of the exact same age and sex. This standardized number is the actual endpoint in pediatric studies, and it lets you ask whether a drug or disease bent a child's growth curve up or down relative to the reference population. One honest caveat: if you only have insurance claims data and no clinic-measured heights and weights, you cannot compute a z-score at all — only electronic health records or linked registry data contain the raw measurements you need.",
    "key_terms": [
      {
        "term": "z-score",
        "definition": "A number that says how many standard deviations a child's measurement sits above (positive) or below (negative) the average for healthy children of the same age and sex — a z-score of 0 is exactly average, -2 means noticeably below average."
      },
      {
        "term": "percentile",
        "definition": "The percentage of the reference population a child's measurement exceeds — a child at the 10th percentile for height is taller than 10 out of every 100 children of the same age and sex."
      },
      {
        "term": "height-for-age z-score (HAZ)",
        "definition": "A z-score specifically for height that accounts for both the child's age and sex, so a 5-year-old and a 10-year-old can be compared on the same scale even though expected heights differ greatly."
      },
      {
        "term": "reference standard",
        "definition": "A published table of typical growth values (median and spread) for children by age and sex, such as the WHO Child Growth Standards or CDC 2000 growth charts, that z-scores are calculated against."
      },
      {
        "term": "LMS method",
        "definition": "The mathematical formula (named for its three parameters) that turns a raw height or weight measurement into a z-score by adjusting for the child's age and sex using the reference standard's published values."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to compare the heights of four children enrolled in a clinical registry study. The children range from age 4 to age 9, so raw height in centimeters cannot be compared directly — a shorter number does not mean worse growth when the children are different ages. The researcher converts each raw height to a height-for-age z-score (HAZ) using CDC 2000 reference values, which gives every child a single comparable number. A z-score at or above -2.0 is considered normal; below -2.0 is classified as stunting.",
      "dataset": {
        "caption": "Four children in a registry, each with a single clinic-measured height. HAZ is derived from the CDC 2000 reference for the child's sex and age.",
        "columns": [
          "child_id",
          "sex",
          "age_years",
          "height_cm",
          "haz",
          "percentile_approx"
        ],
        "rows": [
          [
            "C-101",
            "F",
            4,
            98.5,
            "-1.2",
            "12th"
          ],
          [
            "C-102",
            "M",
            6,
            118.0,
            "0.3",
            "62nd"
          ],
          [
            "C-103",
            "F",
            8,
            116.2,
            "-2.1",
            "2nd"
          ],
          [
            "C-104",
            "M",
            9,
            137.5,
            "0.8",
            "79th"
          ]
        ]
      },
      "steps": [
        "Raw heights alone are misleading: C-103 at 116.2 cm is taller than C-101 at 98.5 cm, yet C-103 has a lower z-score because a girl aged 8 is expected to be much taller than a girl aged 4.",
        "For each child, look up the CDC 2000 reference median (M) and spread (L, S) for their exact age in months and sex — this is the LMS method.",
        "Apply the LMS formula: z = ((height / M) raised to the power L, minus 1) divided by (L times S). A positive result means taller than average; negative means shorter than average.",
        "C-101 (girl, age 4, 98.5 cm) yields HAZ = -1.2, placing her at roughly the 12th percentile — below average but within the normal range (above -2.0).",
        "C-103 (girl, age 8, 116.2 cm) yields HAZ = -2.1, placing her just below the 2nd percentile — this crosses the stunting threshold of -2.0 SD.",
        "C-102 (boy, age 6, 118.0 cm) and C-104 (boy, age 9, 137.5 cm) both have positive z-scores, meaning they are taller than average for their age and sex.",
        "Because all four children now share the same z-score scale, the researcher can compare growth status across ages in a single analysis — something raw centimeters cannot support."
      ],
      "result": "HAZ values: C-101 = -1.2 (12th percentile, normal); C-102 = 0.3 (62nd percentile, normal); C-103 = -2.1 (2nd percentile, stunting — below the -2.0 SD threshold); C-104 = 0.8 (79th percentile, normal). Three of the four children fall within the normal range; one child (C-103) meets the HAZ < -2.0 definition of stunting, which would not have been identifiable by comparing raw heights across different ages."
    },
    "prerequisites": [
      "special-populations-rwe-methods",
      "ehr-phenotyping-algorithms-rwe",
      "ehr-study"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Measured-anthropometry z-score (EHR or linked)",
        "description": "Outcome derived from structured EHR height/weight/head-circumference, converted to z via the age- and sex-appropriate LMS reference, with corrected age for ex-preterm infants and implausible-value flagging.",
        "edge_cases": [
          "Unit/decimal entry errors (lb vs kg, cm vs in) create extreme z outliers; apply WHO/CDC biologically implausible-value bounds before analysis.",
          "Reference splice at 2 years (WHO 0-2y to CDC >=2y, or WHO throughout 0-5y) must be harmonized to avoid a discontinuity in z at the splice age.",
          "Missing gestational age prevents corrected-age computation for infants; source GA from birth/vital records."
        ],
        "data_source_notes": "ehr: pull discrete vitals + birth_date for exact age in days; verify any pre-computed BMI percentile against the intended reference. linked: source gestational age from birth records for correction."
      },
      {
        "name": "Claims proxy-code growth/development outcome",
        "description": "Outcome built from ICD-10 growth-failure and obesity codes, BMI-percentile-for-age Z-codes (Z68.51-Z68.54), and developmental-delay codes when no measured anthropometry is available.",
        "edge_cases": [
          "Captures only the clinician-flagged tail; low sensitivity for the continuous z distribution.",
          "Z68.5x recording is inconsistent across payers/EHR templates and well-child visit completeness.",
          "Medicaid behavioral/developmental carve-outs drop milestone and therapy data in some states."
        ],
        "data_source_notes": "claims: treat as an outcome algorithm requiring PPV/sensitivity validation; never report a null as biological without quantifying capture."
      },
      {
        "name": "Conditional / longitudinal growth trajectory",
        "description": "Endpoint is change in z, growth velocity, or a mixed-model/SITAR trajectory across >=2-3 measurements rather than a single cross-sectional z, targeting whether the exposure bent the child's curve.",
        "edge_cases": [
          "Irregular, exposure-driven measurement timing makes naive change-in-z biased; model the visit process or use scheduled well-child visits.",
          "Requires within-child correlation handling (random slopes) rather than pooled GLM."
        ],
        "data_source_notes": "Best from registries with protocolized assessment or EHR with dense vitals; align measurement windows symmetrically across arms."
      },
      {
        "name": "Standardized neurodevelopmental endpoint",
        "description": "Pass/fail on a validated screen (ASQ, M-CHAT, PEDS), age at milestone acquisition, or a graded scale score (Bayley); a distinct endpoint family from anthropometry.",
        "edge_cases": [
          "Screens live in EHR flowsheets/notes, not always discrete fields; NLP or flowsheet mapping often required.",
          "Administrative developmental-delay codes reflect access to screening (access confounding), not the underlying milestone."
        ],
        "data_source_notes": "registry/ehr: prefer the graded scale; treat F80-F82/F88/F89 codes as access-confounded proxies for time-to-diagnosis, not acquisition."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single cross-sectional z-score",
        "pros_of_this": "Conditional/longitudinal growth separates pre-existing size from a true trajectory change and handles within-child correlation; targets the causal \"did exposure bend the curve\" question.",
        "cons_of_this": "Requires multiple well-timed measurements and mixed-model judgment; vulnerable to informative measurement timing.",
        "when_to_prefer": "When the exposure acts over time (chronic therapy, nutrition) and serial anthropometry exists."
      },
      {
        "compared_to": "Binary threshold (stunting/wasting/obesity)",
        "pros_of_this": "Continuous z retains statistical power and avoids arbitrary cutpoints and near-cut instability.",
        "cons_of_this": "A continuous z is less directly tied to a clinical label or guideline action than a threshold.",
        "when_to_prefer": "As the primary effect estimate; report the dichotomized version as a clinical secondary."
      },
      {
        "compared_to": "Claims growth/development proxy codes",
        "pros_of_this": "Measured EHR/registry anthropometry yields the true continuous z and avoids missingness masquerading as a null.",
        "cons_of_this": "Requires EHR or linkage; not available in pure claims.",
        "when_to_prefer": "Whenever measured values or linkage are available; reserve codes for validated fallback only."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "No anthropometric values. Use ICD-10 R62.5x/R63.x, E66.x, BMI-percentile Z-codes Z68.51-Z68.54, and developmental codes F80-F82/F88/F89 as proxy outcomes requiring PPV/sensitivity validation. Medicaid managed care (not Medicare Advantage) is the peds analog; behavioral/developmental carve-outs drop milestone data.",
      "ehr": "Native source of growth endpoints. Compute exact age in days from birth_date; select the age- and prematurity-appropriate LMS reference (WHO 0-2y, CDC >=2y, Fenton/INTERGROWTH-21st for preterm, condition-specific where indicated); flag implausible values (|z|>5) for unit/decimal errors; map developmental screens from flowsheets/notes.",
      "registry": "Strongest for protocolized serial anthropometry and standardized developmental scales; account for enrollment selection and differential loss to follow-up by severity.",
      "linked": "Claims for cohort/exposure/enrollment/censoring + EHR vitals for the measured z outcome + birth/vital records for gestational age to enable corrected age; reconcile birth-record vs EHR GA before correction."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\nZ_IMPLAUSIBLE = 5.0       # WHO/CDC-style biologically implausible-value bound for height-for-age z\nWINDOW_DAYS = 90          # tolerance for matching a measurement to a target timepoint\nFOLLOWUP_DAYS = 365\n\ndef lms_zscore(value, L, M, S):\n    # Cole-Green LMS transform; the L==0 branch is the log-normal limit.\n    L = np.asarray(L, dtype=float)\n    z = np.where(np.abs(L) < 1e-7,\n                 np.log(value / M) / S,\n                 ((value / M) ** L - 1.0) / (L * S))\n    return z\n\ndef haz_from_vitals(vitals: pd.DataFrame, ref: pd.DataFrame) -> pd.DataFrame:\n    v = vitals.copy()\n    v[\"age_days\"] = (v[\"measure_date\"] - v[\"birth_date\"]).dt.days\n    v[\"age_months\"] = (v[\"age_days\"] / 30.4375).round().astype(int)   # exact-age basis, then key to ref grid\n    v = v.merge(ref, on=[\"sex\", \"age_months\"], how=\"inner\")\n    v[\"haz\"] = lms_zscore(v[\"height_cm\"].to_numpy(), v[\"L\"], v[\"M\"], v[\"S\"])\n    v = v[v[\"haz\"].abs() <= Z_IMPLAUSIBLE]                             # drop unit/decimal-error outliers\n    return v[[\"person_id\", \"measure_date\", \"age_days\", \"haz\"]]\n\ndef conditional_growth(haz: pd.DataFrame, cohort: pd.DataFrame) -> pd.DataFrame:\n    h = haz.merge(cohort[[\"person_id\", \"index_date\", \"arm\"]], on=\"person_id\")\n    h[\"d_index\"] = (h[\"measure_date\"] - h[\"index_date\"]).dt.days\n\n    def nearest(grp, target):\n        cand = grp[(grp[\"d_index\"] - target).abs() <= WINDOW_DAYS]\n        if cand.empty:\n            return np.nan\n        return cand.loc[(cand[\"d_index\"] - target).abs().idxmin(), \"haz\"]\n\n    rows = []\n    for pid, grp in h.groupby(\"person_id\"):\n        haz_0 = nearest(grp, 0)\n        haz_12 = nearest(grp, FOLLOWUP_DAYS)\n        rows.append({\"person_id\": pid, \"arm\": grp[\"arm\"].iloc[0],\n                     \"haz_baseline\": haz_0, \"haz_12m\": haz_12,\n                     \"delta_haz\": haz_12 - haz_0})\n    out = pd.DataFrame(rows)\n    return out.dropna(subset=[\"delta_haz\"])   # require both anchored measurements",
        "description": "Derive height-for-age z-scores (HAZ) from EHR vitals using the LMS method, then build a per-child conditional\ngrowth endpoint (change in HAZ over a follow-up window). Required inputs (cleaned, de-duplicated):\n  vitals : person_id, measure_date (datetime), height_cm, birth_date (datetime), sex ('M'/'F')\n  ref    : sex, age_months (int), L, M, S   # CDC/WHO LMS coefficients for height-for-age, one row per sex x age_month\n  cohort : person_id, index_date (datetime), arm   # from upstream claims-based new-user cohort construction\nExact age in days drives z; the nearest measurement to index and to +12 months (within a +/-90d window)\ndefines the conditional endpoint. Apply |z|>5 implausible-value exclusion (unit/decimal entry errors).",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "cole-green-1992"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(lme4)\n\nZ_IMPLAUSIBLE <- 5.0\n\nhaz_from_vitals <- function(vitals, ref) {\n  setDT(vitals); setDT(ref)\n  vitals[, age_days := as.integer(measure_date - birth_date)]\n  vitals[, age_months := as.integer(round(age_days / 30.4375))]\n  v <- merge(vitals, ref, by = c(\"sex\", \"age_months\"))\n  # Cole-Green LMS; L == 0 is the log-normal limit.\n  v[, haz := fifelse(abs(L) < 1e-7,\n                     log(height_cm / M) / S,\n                     ((height_cm / M)^L - 1) / (L * S))]\n  v[abs(haz) <= Z_IMPLAUSIBLE,\n    .(person_id, measure_date, age_days, haz)]\n}\n\nfit_growth_model <- function(haz, cohort) {\n  setDT(haz); setDT(cohort)\n  d <- merge(haz, cohort[, .(person_id, index_date, arm)], by = \"person_id\")\n  d[, years_since_index := as.numeric(measure_date - index_date) / 365.25]\n  # arm x time interaction = differential change in HAZ trajectory; random slope per child.\n  lmer(haz ~ arm * years_since_index + (years_since_index | person_id), data = d)\n}",
        "description": "HAZ derivation via the LMS method and a linear mixed model of HAZ over time (random child intercept + slope),\nwhich respects the within-child correlation and the irregular measurement timing better than a two-point\ndifference. Inputs mirror the Python version:\n  vitals : person_id, measure_date (Date), height_cm, birth_date (Date), sex ('M'/'F')\n  ref    : sex, age_months (integer), L, M, S   (height-for-age LMS coefficients; e.g. via childsds/anthro)\n  cohort : person_id, index_date (Date), arm",
        "dependencies": [
          "data.table",
          "lme4"
        ],
        "source_citations": [
          "cole-green-1992"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let z_implausible = 5;\n\n/* Exact age in days -> age in months -> z via Cole-Green LMS. */\nproc sql;\n  create table haz as\n  select v.person_id, v.measure_date,\n         (v.measure_date - v.birth_date) as age_days,\n         round((v.measure_date - v.birth_date)/30.4375) as age_months,\n         case when abs(r.L) < 1e-7\n              then log(v.height_cm / r.M) / r.S\n              else ((v.height_cm / r.M)**r.L - 1) / (r.L * r.S)\n         end as haz\n  from work.vitals v\n  inner join work.ref r\n    on v.sex = r.sex\n   and round((v.measure_date - v.birth_date)/30.4375) = r.age_months;\nquit;\n\n/* Drop unit/decimal-entry outliers, attach arm + index, build the time axis. */\nproc sql;\n  create table analytic as\n  select h.person_id, h.haz, c.arm,\n         (h.measure_date - c.index_date)/365.25 as years_since_index\n  from haz h\n  inner join work.cohort c on h.person_id = c.person_id\n  where abs(h.haz) <= &z_implausible;\nquit;\n\n/* Repeated-measures model: arm*time = differential change in HAZ trajectory. */\nproc mixed data=analytic method=reml;\n  class arm person_id;\n  model haz = arm years_since_index arm*years_since_index / solution ddfm=kr;\n  random intercept years_since_index / subject=person_id type=un;\nrun;",
        "description": "HAZ derivation and a repeated-measures mixed model in SAS. Required input datasets (post data-management):\n  work.vitals : person_id, measure_date, height_cm, birth_date, sex ('M'/'F')\n  work.ref    : sex, age_months, L, M, S   (height-for-age LMS coefficients)\n  work.cohort : person_id, index_date, arm\nPROC MIXED fits HAZ over time with a random intercept and slope per child and an unstructured G-side\ncovariance; the arm*years interaction is the differential-growth-trajectory estimand.",
        "dependencies": [],
        "source_citations": [
          "cole-green-1992"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[\"Pediatric anthropometric measurement<br/>height / weight / head circ + exact age + sex\"] --> Pre{Born preterm?}\n  Pre -- \"Yes, age < ~24 mo\" --> Corr[\"Use gestationally corrected age<br/>+ Fenton / INTERGROWTH-21st preterm curve\"]\n  Pre -- No --> Cond{\"Condition with its own curve?<br/>Down, achondroplasia, CP\"}\n  Cond -- Yes --> Spec[Use condition-specific reference]\n  Cond -- No --> Age{Age band}\n  Age -- \"0 to <2 y\" --> WHO[WHO Child Growth Standards LMS]\n  Age -- \">=2 y US\" --> CDC[CDC 2000 growth charts LMS]\n  Corr --> Z[\"Apply LMS: z = ((X/M)^L - 1)/(L*S)\"]\n  Spec --> Z\n  WHO --> Z\n  CDC --> Z\n  Z --> Flag[\"Flag implausible values |z|>5<br/>unit/decimal errors\"]\n  Flag --> Est{Estimand}\n  Est -- size now --> Cross[Cross-sectional z or threshold]\n  Est -- trajectory --> Long[\"Conditional growth / mixed model\"]",
        "caption": "Reference-curve selection and estimand logic. Prematurity and condition-specific status are resolved before the WHO-vs-CDC age splice; the chosen estimand (cross-sectional vs conditional growth) then dictates the model.",
        "alt_text": "Decision tree starting from an anthropometric measurement, branching on prematurity and condition-specific curves, then by age band to WHO or CDC LMS references, applying the LMS z-score formula, flagging implausible values, and selecting a cross-sectional or conditional-growth estimand.",
        "source_type": "illustrative",
        "source_citations": [
          "cole-green-1992"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Claims[Claims only]\n    C1[ICD-10 R62.5x / R63.x / E66.x] --> C2[BMI-percentile Z-codes Z68.51-54]\n    C2 --> C3[Dev codes F80-82 / F88 / F89]\n    C3 --> CW[Proxy outcome -> requires PPV/sensitivity validation]\n  end\n  subgraph EHR[EHR vitals]\n    E1[Structured height / weight / head circ] --> E2[Exact age + sex -> LMS z-score]\n    E2 --> EW[Continuous z + conditional growth]\n  end\n  subgraph Linked[Linked claims + EHR + birth records]\n    L1[Claims: exposure, enrollment, censoring] --> L3[Best substrate]\n    L2[EHR: measured z outcome] --> L3\n    Lb[Birth records: gestational age -> corrected age] --> L3\n  end\n  CW -. weakest .-> Rank[Endpoint quality]\n  EW -. strong .-> Rank\n  L3 -. ideal .-> Rank",
        "caption": "Data-source feasibility for pediatric growth/development endpoints. Pure claims yield only validated proxy codes; EHR yields measured z-scores; linked data adds gestational age for corrected-age z and is the ideal substrate.",
        "alt_text": "Data-flow diagram contrasting claims-only proxy codes, EHR measured z-scores, and linked claims plus EHR plus birth records, ranked from weakest to ideal for deriving pediatric growth and development endpoints.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "Pediatric growth/development endpoints are a special-population outcome family requiring age-, sex-, and prematurity-specific measurement logic."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mixed-effects-models-longitudinal-rwe",
        "notes": "Conditional-growth (change-in-z, trajectory) endpoints require mixed-effects/random-slope models to handle within-child correlation and irregular measurement timing."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mmrm-repeated-measures-rwe",
        "notes": "Serial z-scores at scheduled visits are a natural repeated-measures outcome; MMRM is an alternative to a random-slope model when timepoints are common across children."
      },
      {
        "relation_type": "see_also",
        "target_slug": "longitudinal-outcomes-modeling-rwe",
        "notes": "General longitudinal-outcome modeling framework that conditional growth trajectories instantiate."
      },
      {
        "relation_type": "requires",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Claims proxy-code growth/development outcomes must be validated for PPV and sensitivity before being treated as endpoints."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "Extracting structured vitals and mapping developmental screens from EHR flowsheets/notes is an EHR phenotyping task upstream of z-score derivation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pediatric-dose-normalization-rwe",
        "notes": "Sibling pediatric special-population concept; weight/BSA from the same anthropometry feeds dose normalization."
      },
      {
        "relation_type": "see_also",
        "target_slug": "disease-registry",
        "notes": "Disease and product registries are the strongest source of protocolized serial anthropometry and standardized developmental assessment."
      }
    ],
    "aliases": [
      "Pediatric growth and development endpoints in real-world evidence",
      "pediatric anthropometric z-score outcomes",
      "growth and neurodevelopmental endpoints",
      "pediatric growth outcome measurement"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "persistence-time-to-discontinuation",
    "name": "Persistence Time to Discontinuation",
    "short_definition": "The duration a patient continues an initiated therapy — from the index fill until a permissible-gap rule declares discontinuation — analyzed as a time-to-event endpoint distinct from day-to-day adherence (implementation) intensity.",
    "long_description": "**Persistence** is the *duration* phase of medication-taking behavior: how long a patient stays on a therapy after\ninitiating it, ending when an allowable gap between the exhaustion of one fill's `days_supply` and the next fill exceeds a\npre-specified threshold (the **permissible gap**, e.g., 30, 60, or 90 days). The day after the gap rule is breached becomes\nthe **discontinuation date**; the interval from `index_date` to that date is the persistence time. Because every patient\neither discontinues, is censored (death, disenrollment, end of data), or remains persistent at study end, persistence is\nnaturally a **time-to-event** outcome — not a ratio. The standardization literature (Cramer 2008, Vrijens ABC taxonomy\n2012) is emphatic that this is a *separate construct* from adherence/implementation (PDC, MPR), which measures dose-taking\nintensity *while the patient is still on therapy*. A patient can be highly persistent (no qualifying gap for two years) yet\npoorly adherent (PDC of 0.55 from chronically late refills), and vice versa.\n\n**Core estimand distinction**. Persistence has two estimand layers that must be pre-specified, and conflating them is the\nmost common error in the literature. (1) *What event defines discontinuation* — a permissible-gap breach on the index agent\nonly (pure persistence), or a gap that is not bridged by a switch/add-on (treatment-episode persistence). (2) *How\ncompeting events are handled* — death and, in many designs, switching are **competing risks** that prevent the patient from\never being observed to discontinue the index drug. The naive Kaplan–Meier \"persistence curve\" treats death/disenrollment\nas non-informative censoring and therefore *overstates* the probability of remaining persistent when competing events are\ncommon (e.g., elderly oncology or heart-failure cohorts). The honest choices are the **cause-specific hazard** (rate of\ndiscontinuation among those still at risk and alive — answers an etiologic \"why do people stop\" question) versus the\n**subdistribution hazard / cumulative incidence function** (Fine–Gray — answers the prognostic \"what fraction will be off\ndrug by month 12\" question, keeping those who died in the risk set). These are different numbers with different\ninterpretations; report the CIF, not 1−KM, whenever competing risks are non-trivial.\n\n**Pros, cons, and trade-offs**.\n- **vs pdc-proportion-of-days-covered (and mpr-medication-possession-ratio):** Persistence captures *time on therapy* —\n  the quantity that drives cumulative exposure, lines-of-therapy denominators, and the \"on-treatment\" windows for outcome\n  assessment — whereas PDC/MPR capture *intensity of supply* over a fixed denominator and silently assume the patient is\n  still on drug. Cost: persistence is acutely sensitive to the chosen gap length (Gardarsdottir 2010 shows episode counts\n  and discontinuation rates swing materially from 30 to 90 days) and says nothing about adherence quality during the\n  persistent period. **Prefer persistence** when the question is continuation/discontinuation, episode length, or defining\n  on-treatment exposure; **prefer PDC** when the question is degree of adherence within a fixed window.\n- **vs a binary \"persistent at 12 months\" proportion:** The time-to-event formulation uses all the follow-up, respects\n  differential censoring, and supports comparative hazard ratios; the binary endpoint discards timing and mishandles\n  patients censored before the landmark. **Prefer the survival formulation** for almost all analytic work; reserve the\n  binary version for simple descriptive reporting at a single, fully-observed landmark.\n- **vs treatment-switch / switch-add-on-augmentation-rwe:** Persistence logic supplies the gap machinery that switch and\n  augmentation algorithms depend on, but the gap definition directly shapes switch classification — a short permissible gap\n  inflates apparent discontinuations and restarts. **Use persistence as the substrate**, then layer switch rules on top\n  with a consistent gap.\n\n**When to use**. Persistence (or non-persistence/time-to-discontinuation) as a primary or secondary endpoint in drug\nutilization, comparative effectiveness, or value-demonstration studies; constructing treatment episodes to define\non-treatment exposure windows; CMS Star/PQA-style chronic-therapy continuation reporting; any setting with longitudinal\ndispensing dates and `days_supply`. Pair it with a survival model (Kaplan–Meier/cause-specific Cox for the etiologic\ncontrast; Fine–Gray/CIF when death or switching competes).\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **No reliable dispensing/`days_supply` data.** EHR prescription *orders* are not fills; persistence built on orders\n  systematically *overstates* continuation because it cannot see primary non-adherence (an order never filled — see\n  primary-non-adherence-initiation) or true discontinuation. Use linked pharmacy claims or treat order-based persistence as\n  a different, weaker construct.\n- **Competing risks ignored in a high-mortality cohort.** Reporting 1−KM as \"persistence\" when death is frequent is\n  actively misleading: it inflates apparent persistence and, if mortality differs by arm, *biases the comparison*. Switch\n  to CIF/Fine–Gray or report cause-specific hazards with explicit handling of the competing event.\n- **MA-only or capitated person-time.** Persistence computed where pharmacy claims are not fully captured manufactures\n  spurious gaps (a \"discontinuation\" that is really unobserved fills). Restrict to person-time with complete pharmacy\n  benefit.\n- **The real question is adherence intensity, not duration.** A patient who fills sporadically but never opens a\n  gap-length gap is \"persistent\"; if you care about whether they took enough drug, use PDC/MPR, not persistence.\n- **Immortal-time in procedure/initiation framing.** If follow-up or persistence start is set at a landmark before the\n  index fill (e.g., diagnosis date), the pre-fill interval is immortal and biases the curve upward.\n\n**Data-source operational depth**.\n- **Claims (FFS):** The reference substrate — reliable `fill_date`, `days_supply`, and dispensed quantity. Require\n  continuous medical + pharmacy enrollment from index through follow-up so absent fills are true gaps, not missingness.\n  Real failure modes: **90-day mail-order and bulk fills** distort gap timing (a single mail fill covers a stretch that\n  looks like over-supply, and the next gap is mis-dated) — handle with carry-over/stockpiling rules capped at a maximum\n  look-ahead; **inpatient/SNF stays** can bridge a gap if facility-administered drug is assumed (claims may not show a\n  pharmacy fill during admission) — decide a priori whether hospitalization days suspend the gap clock; **free samples and\n  $0 copay cards** are invisible. Censor at death, disenrollment, and end of data.\n- **Claims (Medicare Advantage / capitated):** MA-only person-time frequently lacks complete FFS pharmacy claims, so \"no\n  fill\" is unobserved, not a true discontinuation — exclude MA-only spans or require Part D. In **elderly claims,\n  competing risks (death) differ by exposure**, so a naive persistence curve compares non-comparable risk sets; CIF is\n  mandatory, not optional.\n- **EHR (orders only):** Order-based \"persistence\" overstates continuation and cannot detect primary non-adherence; prefer\n  linked dispensing. Visit-driven capture means a patient who leaves the system looks like a discontinuer — distinguish\n  loss-to-follow-up from true stopping.\n- **Linked claims–EHR / registry:** Best of both — EHR/registry severity and mortality plus claims fill completeness — but\n  linkage selects the linkable subset and introduces order/fill/service date discrepancies that must be reconciled before\n  the gap clock starts. Registries strengthen the competing-event (death) ascertainment that the CIF needs.\n\n**Worked claims example.** Question: 12-month persistence on a newly initiated oral antidiabetic in a commercial + Medicare\nFFS database, comparing two agents. (1) Cohort: adults with ≥2 type-2-diabetes diagnoses and 365 days of continuous\nmedical + pharmacy enrollment before the first study fill (`index_date`); no fill of the index agent in that washout\n(new users). (2) Build the supply timeline: for each `person_id`, order pharmacy fills by `fill_date`; the covered interval\nof a fill runs from `fill_date` to `fill_date + days_supply`. Apply a **carry-over rule** — if a refill arrives before the\nprior supply is exhausted, push the leftover days forward (stockpiling), capped at, say, 30 surplus days to avoid\nimplausible hoarding from 90-day mail fills. (3) Apply the gap rule: walk the timeline; the first time the next\n`fill_date` exceeds the running supply-end by ≥ the permissible gap (primary = 60 days; sensitivity = 30 and 90),\ndiscontinuation occurs on `supply_end + 1`. (4) Persistence time = `discontinuation_date − index_date`; patients with no\nqualifying gap are persistent and **censored** at the earliest of disenrollment, end of data, or day 365; **death** is a\n**competing event**, not plain censoring. (5) Estimate: Kaplan–Meier and cause-specific Cox for the etiologic\ndiscontinuation-rate contrast, and the cumulative incidence function (Fine–Gray) for the prognostic \"fraction off drug by\n12 months,\" because diabetes cohorts carry non-trivial mortality. (6) Report all three gap lengths; a conclusion that flips\nbetween 30- and 90-day gaps is a finding about the algorithm, not the drug.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "persistence",
      "discontinuation",
      "time-to-event",
      "permissible-gap",
      "treatment-episodes",
      "competing-risks",
      "medication-utilization",
      "claims"
    ],
    "applies_to_study_types": [
      "drug_utilization",
      "cohort_retrospective",
      "active_comparator_new_user",
      "new_user",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "pharmacy_record",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.1524-4733.2007.00213.x",
        "url": "https://doi.org/10.1111/j.1524-4733.2007.00213.x",
        "citation_text": "Cramer JA, Roy A, Burrell A, et al. Medication compliance and persistence: terminology and definitions. Value Health. 2008;11(1):44-47.",
        "year": 2008,
        "authors_short": "Cramer et al.",
        "notes": "ISPOR consensus that separates compliance/adherence (implementation) from persistence, defined as the duration from initiation to discontinuation. The standard reference for the construct."
      },
      {
        "role": "introduce",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiol Drug Saf. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Canonical operational methods paper for measuring persistence and gaps from automated dispensing/claims data, including permissible-gap and treatment-episode construction."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1365-2125.2012.04167.x",
        "url": "https://doi.org/10.1111/j.1365-2125.2012.04167.x",
        "citation_text": "Vrijens B, De Geest S, Hughes DA, et al.; ABC Project Team. A new taxonomy for describing and defining adherence to medications. Br J Clin Pharmacol. 2012;73(5):691-705.",
        "year": 2012,
        "authors_short": "Vrijens et al. (ABC taxonomy)",
        "notes": "Positions persistence/discontinuation as a distinct phase after initiation and implementation; the modern conceptual scaffold separating duration from dose-taking intensity."
      },
      {
        "role": "explain",
        "doi": "10.1097/MLR.0b013e31829b1d2a",
        "url": "https://doi.org/10.1097/MLR.0b013e31829b1d2a",
        "citation_text": "Raebel MA, Schmittdiel J, Karter AJ, Konieczny JL, Steiner JF. Standardizing terminology and definitions of medication adherence and persistence in research employing electronic databases. Med Care. 2013;51(8 Suppl 3):S11-S21.",
        "year": 2013,
        "authors_short": "Raebel et al.",
        "notes": "Practical standardization for claims/EHR; addresses primary vs secondary non-adherence and why order-based EHR persistence overstates continuation relative to fill-based measures."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2009.07.001",
        "url": "https://doi.org/10.1016/j.jclinepi.2009.07.001",
        "citation_text": "Gardarsdottir H, Souverein PC, Egberts TCG, Heerdink ER. Construction of drug treatment episodes from drug-dispensing histories is influenced by the gap length. J Clin Epidemiol. 2010;63(4):422-427.",
        "year": 2010,
        "authors_short": "Gardarsdottir et al.",
        "notes": "Empirical demonstration that permissible-gap length materially changes episode counts and discontinuation rates — the core sensitivity that every persistence analysis must report."
      }
    ],
    "plain_language_summary": "Persistence measures how long a patient keeps refilling a medication before stopping — specifically, the number of days from their first fill until the gap between when one supply ran out and the next fill arrived grew longer than an allowed limit (the permissible gap, commonly 30, 60, or 90 days). Unlike PDC, which asks how much medication a patient had on hand during a fixed window, persistence asks a simpler duration question: how many days did this person stay on therapy before quitting? One honest caveat: persistence built from pharmacy claims cannot see medications dispensed inside a hospital, free samples, or cash-paid fills, so those unobserved supply days may make a patient look like they stopped when they did not.",
    "key_terms": [
      {
        "term": "days_supply",
        "definition": "The number of days one filled prescription is intended to last — for example, a 30-day fill covers 30 calendar days of medication."
      },
      {
        "term": "permissible gap",
        "definition": "The maximum number of days allowed between the end of one fill's supply and the start of the next before an analyst calls it a stop — commonly set at 60 days, and always varied in sensitivity analyses."
      },
      {
        "term": "index date",
        "definition": "The patient's personal day zero — usually the date of their very first fill of the drug being studied, from which persistence time is measured forward."
      },
      {
        "term": "supply_end",
        "definition": "The last calendar day a given fill's medication supply covers, calculated as fill_date plus days_supply minus one day."
      },
      {
        "term": "discontinuation date",
        "definition": "The day an analyst officially records that a patient stopped — set to the day after their supply ran out, once the permissible gap has been confirmed by no refill arriving in time."
      }
    ],
    "worked_example": {
      "scenario": "Maria starts metformin on January 15, 2024, for her type-2 diabetes. She refills on time twice. Then life gets busy — she never refills again. We are using a 60-day permissible gap: if no fill arrives within 60 days of the day her supply ran out, we call that discontinuation. How many days was Maria persistent on metformin?",
      "dataset": {
        "caption": "Raw rows an analyst would see in a claims pharmacy table for patient Maria (person_id 2001). Each row is one filled prescription.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2024-01-15",
            "metformin",
            30
          ],
          [
            2001,
            "2024-02-14",
            "metformin",
            30
          ],
          [
            2001,
            "2024-03-15",
            "metformin",
            30
          ]
        ]
      },
      "steps": [
        "Fill A (Jan 15, 30-day supply) covers Jan 15 through Jan 44 — but January only has 31 days, so the last covered day is Feb 13. The supply runs out at the end of Feb 13; supply_end = Feb 14 (the first uncovered day).",
        "Fill B (Feb 14, 30-day supply) arrives exactly on the first uncovered day, so the gap is zero days. Supply now extends through Mar 14; supply_end = Mar 15.",
        "Fill C (Mar 15, 30-day supply) again arrives exactly on the first uncovered day, gap zero. Supply now extends through Apr 13; supply_end = Apr 14.",
        "No Fill D ever arrives. Starting Apr 14, the gap clock ticks. By Jun 12 (60 days later), still no fill — the permissible gap is breached.",
        "Discontinuation is backdated to Apr 14, the first day Maria had no supply (supply_end of Fill C). The gap was confirmed on Jun 13, but the event date is the earlier backdated date.",
        "Persistence = discontinuation date minus index date = Apr 14 minus Jan 15 = 89 days. Maria was persistent on metformin for 89 days."
      ],
      "result": {
        "label": "Persistence = 89 days (Jan 15 to Apr 14, 60-day permissible gap)",
        "value": 89
      },
      "pdc_contrast": "PDC would ask a different question over a fixed window — say 180 days: how many of those 180 days did Maria have pills on hand? She had supply Jan 15 through Apr 13 = 89 covered days, so PDC = 89 / 180 = 0.49. Persistence says 'she lasted 89 days before stopping'; PDC says 'she had medication only 49 % of the time we were watching.' Both use the same fills — they just answer different questions. Use persistence when you care how long the patient stayed on therapy; use PDC when you care how fully they covered the treatment window.",
      "timeline_spec": {
        "title": "Persistence for one metformin patient (60-day permissible gap, three 30-day fills)",
        "window": {
          "start": "2024-01-15",
          "end": "2024-06-13",
          "label": "Observation window: Jan 15 – Jun 13 (gap confirmation date)"
        },
        "events": [
          {
            "label": "Fill A",
            "start": "2024-01-15",
            "length_days": 30,
            "quantity": "30-day supply"
          },
          {
            "label": "Fill B (on-time refill)",
            "start": "2024-02-14",
            "length_days": 30,
            "quantity": "30-day supply"
          },
          {
            "label": "Fill C (on-time refill)",
            "start": "2024-03-15",
            "length_days": 30,
            "quantity": "30-day supply"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-15",
            "end": "2024-04-13",
            "label": "89 days continuously covered (Fills A + B + C)"
          },
          {
            "kind": "gap",
            "start": "2024-04-14",
            "end": "2024-06-12",
            "label": "60-day permissible gap — no fill arrives (gap confirmed Jun 13)"
          }
        ],
        "result": {
          "label": "Persistence = 89 days (index Jan 15 → discontinuation Apr 14)",
          "value": 89
        },
        "caption": "Maria's three fills provide continuous coverage from Jan 15 through Apr 13 (89 days). After Fill C's supply runs out on Apr 13, no refill arrives within the 60-day permissible window. The gap is confirmed on Jun 13, but discontinuation is backdated to Apr 14 — the first day she had no supply. Persistence = Apr 14 minus Jan 15 = 89 days. PDC over this same 89-day span would be 89/89 = 1.0 while she was on therapy, but PDC over a fixed 180-day window would be 89/180 = 0.49.",
        "alt_text": "A horizontal timeline showing three consecutive 30-day fill bars (Fill A Jan 15, Fill B Feb 14, Fill C Mar 15) shaded as covered days, followed by a 60-day gap bar starting Apr 14 with no fill arriving. A marker labeled 'Discontinuation Apr 14' sits at the start of the gap bar. The total persistent period is annotated as 89 days."
      }
    },
    "prerequisites": [
      "pdc-proportion-of-days-covered",
      "grace-period-gap-rules-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed permissible-gap persistence (30/60/90-day)",
        "description": "Discontinuation is declared the day after the interval between supply exhaustion and the next fill exceeds a fixed gap (commonly 30, 60, or 90 days). The gap length is the single most consequential analytic choice and must be pre-specified, clinically justified by refill cadence, and varied in sensitivity analyses.",
        "edge_cases": [
          "90-day mail-order fills make a single fill span a long covered interval, shifting where the next gap appears; without a stockpiling cap, early over-supply masks later discontinuation.",
          "Inpatient/SNF stays may bridge a gap if facility-administered drug is assumed and no pharmacy claim appears during admission."
        ],
        "data_source_notes": "claims: compute days from running supply-end (with carry-over) to next fill_date; standard in US commercial/Medicare studies for statins, antihypertensives, antidiabetics, oral oncolytics, and biologics.",
        "citations": [
          "andrade-2006",
          "gardarsdottir-2010"
        ]
      },
      {
        "name": "Treatment-episode persistence (switch/add-on aware)",
        "description": "A gap is only counted as discontinuation if it is not bridged by a fill of an interchangeable agent (switch) or a same-indication add-on; otherwise the episode continues. Distinguishes stopping the index drug from staying on treatment.",
        "edge_cases": [
          "Requires a vetted set of interchangeable NDCs/classes; an over-broad set hides true discontinuation, an over-narrow set inflates it.",
          "A restart of the original agent after a qualifying gap may be a new episode or a continuation depending on the question (cumulative exposure vs lines of therapy)."
        ],
        "data_source_notes": "claims: layer switch/augmentation logic (switch-add-on-augmentation-rwe) on top of the gap timeline using a consistent permissible gap; date all agents from the same dispensing records.",
        "citations": [
          "andrade-2006"
        ]
      },
      {
        "name": "Competing-risks persistence (cause-specific vs subdistribution)",
        "description": "Treats death (and often switching) as competing events rather than censoring. The cause-specific hazard answers the etiologic discontinuation-rate question; the Fine-Gray subdistribution hazard / CIF answers the prognostic fraction-off-drug-by-time question. Mandatory when mortality is non-trivial or differs by exposure.",
        "edge_cases": [
          "Reporting 1-KM as persistence in a high-mortality cohort overstates continuation and biases between-arm comparisons when mortality differs.",
          "Whether switching is a competing event or part of the outcome depends on whether the estimand is index-drug persistence or treatment-episode persistence."
        ],
        "data_source_notes": "claims/registry: requires reliable death dating (date-of-death file or registry linkage); SAS PROC PHREG eventcode= (Fine-Gray) or cause-specific Cox; R cmprsk/survival; Python lifelines/scikit-survival.",
        "citations": [
          "cramer-2008"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "pdc-proportion-of-days-covered",
        "pros_of_this": "Measures duration of therapy (cumulative exposure, episode length, on-treatment windows) using a time-to-event framework that respects censoring and competing risks.",
        "cons_of_this": "Highly sensitive to the chosen permissible-gap length; says nothing about adherence quality during the persistent period (a persistent patient can have low PDC from chronically late fills).",
        "when_to_prefer": "When the question concerns treatment continuation, episode length, or defining on-treatment exposure; report multiple gap lengths as sensitivity analyses and use CIF when competing risks are non-trivial."
      },
      {
        "compared_to": "mpr-medication-possession-ratio",
        "pros_of_this": "Captures when a patient stops rather than how much supply they accumulated over a fixed denominator; avoids MPR's >1.0 over-supply artifact and fixed-window dependence.",
        "cons_of_this": "Does not quantify dose-taking intensity; two patients with identical persistence can have very different MPR.",
        "when_to_prefer": "When duration on therapy, not supply intensity, is the quantity of interest."
      },
      {
        "compared_to": "switch-add-on-augmentation-rwe",
        "pros_of_this": "Supplies the permissible-gap machinery that switch and augmentation algorithms require to date discontinuation consistently.",
        "cons_of_this": "The gap definition directly shapes switch classification; a short gap inflates apparent discontinuations and restarts.",
        "when_to_prefer": "When the primary endpoint is continuation/discontinuation itself, with switch logic layered on top using a consistent gap."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Reference substrate. Use fill_date + days_supply with a carry-over (stockpiling) rule capped at a maximum surplus to handle 90-day mail-order fills; require continuous pharmacy enrollment so absent fills are true gaps. Censor at disenrollment and end of data; treat death as a competing event. Decide a priori whether inpatient/SNF stays suspend the gap clock.",
      "ehr": "Order-based persistence overstates continuation and misses primary non-adherence; prefer linked dispensing. Visit-driven capture conflates loss-to-follow-up with true discontinuation.",
      "registry": "Strong for death ascertainment (needed for the CIF) and disease severity; typically weak for complete pharmacy exposure. Link to claims for fills.",
      "linked": "Linked claims-EHR-vital-records gives severity + fill completeness + reliable mortality, but linkage selects the linkable subset and order/fill/service date discrepancies must be reconciled before the gap clock starts."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom lifelines import AalenJohansenFitter\n\nPERMISSIBLE_GAP = 60      # days beyond supply exhaustion that defines discontinuation (sensitivity: 30, 90)\nMAX_CARRYOVER = 30        # cap on surplus days carried forward (guards against 90-day mail-fill over-supply)\nHORIZON = 365             # administrative landmark\n\ndef time_to_discontinuation(fills: pd.DataFrame, index: pd.DataFrame, outcome: pd.DataFrame) -> pd.DataFrame:\n    f = fills.merge(index[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    f = f[f[\"fill_date\"] >= f[\"index_date\"]].sort_values([\"person_id\", \"fill_date\"])\n\n    rows = []\n    for pid, g in f.groupby(\"person_id\"):\n        supply_end = None                       # running covered-through date (with carry-over)\n        disc_date = None\n        for fill_date, dsup in zip(g[\"fill_date\"], g[\"days_supply\"]):\n            if supply_end is None:\n                supply_end = fill_date + pd.Timedelta(days=int(dsup))\n                continue\n            gap = (fill_date - supply_end).days\n            if gap >= PERMISSIBLE_GAP:           # gap rule breached -> discontinued the day after supply ran out\n                disc_date = supply_end + pd.Timedelta(days=1)\n                break\n            carry = max(min((supply_end - fill_date).days, MAX_CARRYOVER), 0)  # leftover supply, capped\n            supply_end = fill_date + pd.Timedelta(days=int(dsup) + carry)\n        rows.append({\"person_id\": pid, \"last_supply_end\": supply_end, \"disc_date\": disc_date})\n\n    epi = pd.DataFrame(rows).merge(index, on=\"person_id\").merge(outcome, on=\"person_id\")\n\n    # Competing-risks status: 1 = discontinuation, 2 = death before discontinuation, 0 = censored.\n    admin = epi[\"index_date\"] + pd.Timedelta(days=HORIZON)\n    cens = epi[[\"disenroll_date\", \"end_date\"]].min(axis=1).fillna(admin).clip(upper=admin)\n    disc = epi[\"disc_date\"]\n    death = epi[\"death_date\"]\n\n    status = np.zeros(len(epi), dtype=int)\n    event_date = cens.copy()\n    # earliest of discontinuation / death / censor wins\n    d_disc = disc.fillna(pd.Timestamp.max)\n    d_death = death.fillna(pd.Timestamp.max)\n    disc_first = (d_disc <= cens) & (d_disc <= d_death)\n    death_first = (d_death <= cens) & (d_death < d_disc)\n    status = np.where(disc_first, 1, np.where(death_first, 2, 0))\n    event_date = np.where(disc_first, d_disc, np.where(death_first, d_death, cens))\n    epi[\"status\"] = status\n    epi[\"persistence_days\"] = (pd.to_datetime(event_date) - epi[\"index_date\"]).dt.days.clip(lower=0, upper=HORIZON)\n    return epi[[\"person_id\", \"arm\", \"persistence_days\", \"status\"]]\n\ndef cif_discontinuation(epi: pd.DataFrame) -> AalenJohansenFitter:\n    # Cumulative incidence of discontinuation (event 1) with death (event 2) as competing risk.\n    ajf = AalenJohansenFitter()\n    ajf.fit(epi[\"persistence_days\"], epi[\"status\"], event_of_interest=1)\n    return ajf",
        "description": "Build treatment episodes and a time-to-discontinuation table from claims-style pharmacy fills, then estimate the\ncompeting-risks cumulative incidence of discontinuation. Required inputs (already cleaned, one row per fill):\n  fills  : person_id, fill_date (datetime), days_supply (int)            # index agent fills only, post-cohort\n  index  : person_id, index_date (datetime), arm                          # first qualifying fill per person\n  outcome: person_id, death_date (datetime|NaT), disenroll_date, end_date # censoring / competing-event dates\nCarry-over caps stockpiling so 90-day mail fills do not mask later gaps. Death is a competing event (code 2), not\ncensoring. Returns one row per person with persistence_days and a status code for cause-specific / Fine-Gray models.",
        "dependencies": [
          "pandas",
          "numpy",
          "lifelines"
        ],
        "source_citations": [
          "andrade-2006",
          "gardarsdottir-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\nlibrary(cmprsk)\n\nPERMISSIBLE_GAP <- 60L; MAX_CARRYOVER <- 30L; HORIZON <- 365L\n\npersistence_episodes <- function(fills, index, outcome) {\n  setDT(fills); setDT(index); setDT(outcome)\n  f <- merge(fills, index[, .(person_id, index_date)], by = \"person_id\")\n  f <- f[fill_date >= index_date][order(person_id, fill_date)]\n\n  disc <- f[, {\n    supply_end <- NA; disc_date <- as.Date(NA)\n    for (i in seq_len(.N)) {\n      if (is.na(supply_end)) { supply_end <- fill_date[i] + days_supply[i]; next }\n      gap <- as.integer(fill_date[i] - supply_end)\n      if (gap >= PERMISSIBLE_GAP) { disc_date <- supply_end + 1L; break }\n      carry <- max(min(as.integer(supply_end - fill_date[i]), MAX_CARRYOVER), 0L)\n      supply_end <- fill_date[i] + days_supply[i] + carry\n    }\n    .(disc_date = disc_date)\n  }, by = person_id]\n\n  epi <- Reduce(function(a, b) merge(a, b, by = \"person_id\"), list(disc, index, outcome))\n  admin <- epi$index_date + HORIZON\n  cens  <- pmin(epi$disenroll_date, epi$end_date, admin, na.rm = TRUE)\n\n  d_disc  <- fifelse(is.na(epi$disc_date),  as.Date(\"9999-12-31\"), epi$disc_date)\n  d_death <- fifelse(is.na(epi$death_date), as.Date(\"9999-12-31\"), epi$death_date)\n  disc_first  <- d_disc  <= cens & d_disc  <= d_death\n  death_first <- d_death <= cens & d_death <  d_disc\n  epi[, status := fifelse(disc_first, 1L, fifelse(death_first, 2L, 0L))]\n  evt <- fifelse(disc_first, d_disc, fifelse(death_first, d_death, cens))\n  epi[, persistence_days := pmin(pmax(as.integer(evt - index_date), 0L), HORIZON)]\n  epi[, .(person_id, arm, persistence_days, status)]\n}\n\n# Cause-specific Cox (etiologic) and Fine-Gray subdistribution (prognostic CIF) for the index-drug contrast.\nfit_persistence <- function(epi) {\n  cs <- coxph(Surv(persistence_days, status == 1L) ~ arm, data = epi)   # cause-specific hazard of discontinuation\n  fg <- crr(epi$persistence_days, epi$status, cov1 = model.matrix(~ arm, epi)[, -1, drop = FALSE], failcode = 1, cencode = 0)\n  list(cause_specific = cs, fine_gray = fg)\n}",
        "description": "Treatment-episode construction and competing-risks persistence with data.table + survival/cmprsk. Inputs mirror the\nPython version:\n  fills  : person_id, fill_date (Date), days_supply (integer)\n  index  : person_id, index_date (Date), arm\n  outcome: person_id, death_date (Date), disenroll_date (Date), end_date (Date)\nDeath is a competing event; report the cumulative incidence function (Fine-Gray), not 1 - Kaplan-Meier, when mortality\nis non-trivial.",
        "dependencies": [
          "data.table",
          "survival",
          "cmprsk"
        ],
        "source_citations": [
          "andrade-2006",
          "gardarsdottir-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let gap = 60;          /* permissible gap (sensitivity: 30, 90) */\n%let carry = 30;        /* max carry-over surplus days           */\n%let horizon = 365;\n\n/* Walk each person's fill timeline; flag the first qualifying gap as discontinuation. */\nproc sort data=work.fills; by person_id fill_date; run;\n\ndata episodes;\n  merge work.fills(in=a) work.index(keep=person_id index_date);\n  by person_id;\n  retain supply_end disc_date;\n  format supply_end disc_date date9.;\n  if first.person_id then do; supply_end = .; disc_date = .; end;\n  if fill_date >= index_date then do;\n    if supply_end = . then supply_end = fill_date + days_supply;\n    else do;\n      gap = fill_date - supply_end;\n      if gap >= &gap and disc_date = . then disc_date = supply_end + 1;\n      else do;\n        carry = max(min(supply_end - fill_date, &carry), 0);\n        supply_end = fill_date + days_supply + carry;\n      end;\n    end;\n  end;\n  if last.person_id then output;\n  keep person_id index_date disc_date;\nrun;\n\n/* Assign competing-risks status (1=discontinuation, 2=death, 0=censored) and persistence_days. */\nproc sql;\n  create table surv as\n  select e.person_id, i.arm, e.index_date,\n         min(o.disenroll_date, o.end_date, e.index_date + &horizon) as cens format=date9.,\n         e.disc_date, o.death_date,\n         case when e.disc_date ne . and e.disc_date <= calculated cens\n                   and (o.death_date = . or e.disc_date <= o.death_date) then 1\n              when o.death_date ne . and o.death_date <= calculated cens\n                   and (e.disc_date = . or o.death_date < e.disc_date)   then 2\n              else 0 end as status,\n         min(max(coalesce(case calculated status when 1 then e.disc_date\n                                                 when 2 then o.death_date\n                                                 else calculated cens end) - e.index_date, 0), &horizon)\n              as persistence_days\n  from episodes e\n    inner join work.index i  on e.person_id = i.person_id\n    inner join work.outcome o on e.person_id = o.person_id;\nquit;\n\n/* Kaplan-Meier persistence curves by arm (treats only event=1 as the failure). */\nproc lifetest data=surv plots=survival(atrisk);\n  time persistence_days*status(0 2);   /* censor codes 0 and 2 -> naive KM; see CIF below for competing risks */\n  strata arm;\nrun;\n\n/* Competing-risks CIF of discontinuation with death as competitor (Fine-Gray subdistribution). */\nproc phreg data=surv plots(overlay)=cif;\n  class arm;\n  model persistence_days*status(0) = arm / eventcode=1;   /* status=2 (death) handled as competing event */\n  hazardratio arm;\nrun;\n\n/* Cause-specific hazard of discontinuation (etiologic contrast). */\nproc phreg data=surv;\n  class arm;\n  model persistence_days*status(0 2) = arm;               /* death censored -> cause-specific */\n  hazardratio arm;\nrun;",
        "description": "Treatment-episode construction (PROC SQL + data step), Kaplan-Meier persistence curves (PROC LIFETEST), and\ncompeting-risks estimation (PROC PHREG cause-specific and eventcode= Fine-Gray). Required inputs (post data-management):\n  work.fills   : person_id, fill_date, days_supply       (index-agent pharmacy fills only)\n  work.index   : person_id, index_date, arm\n  work.outcome : person_id, death_date, disenroll_date, end_date\nPROC PHREG eventcode= requires SAS/STAT 13.1+. Use the CIF (PROC LIFETEST plots=cif or PHREG eventcode=) rather than\n1 - KM when death is a non-trivial competing risk in elderly claims.",
        "dependencies": [],
        "source_citations": [
          "andrade-2006",
          "gardarsdottir-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "persistence-time-to-discontinuation-timeline.svg",
        "mermaid": null,
        "caption": "Maria's three fills provide continuous coverage from Jan 15 through Apr 13 (89 days). After Fill C's supply runs out on Apr 13, no refill arrives within the 60-day permissible window. The gap is confirmed on Jun 13, but discontinuation is backdated to Apr 14 — the first day she had no supply. Persistence = Apr 14 minus Jan 15 = 89 days. PDC over this same 89-day span would be 89/89 = 1.0 while she was on therapy, but PDC over a fixed 180-day window would be 89/180 = 0.49.",
        "alt_text": "A horizontal timeline showing three consecutive 30-day fill bars (Fill A Jan 15, Fill B Feb 14, Fill C Mar 15) shaded as covered days, followed by a 60-day gap bar starting Apr 14 with no fill arriving. A marker labeled 'Discontinuation Apr 14' sits at the start of the gap bar. The total persistent period is annotated as 89 days.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Idx[Index fill: index_date, days_supply] --> Build[Build supply timeline<br/>covered = fill_date .. fill_date+days_supply<br/>carry-over capped]\n  Build --> Gap{Next fill_date - supply_end >= permissible gap?}\n  Gap -- No --> Build\n  Gap -- Yes --> Disc[Discontinuation = supply_end + 1]\n  Build --> Comp{Death / disenrollment / end of data first?}\n  Comp -- Death --> CR[Competing event<br/>keep in risk set for CIF]\n  Comp -- Disenroll/End --> Cen[Censor]\n  Disc --> Est[Estimand: cause-specific hazard<br/>vs subdistribution / CIF]\n  CR --> Est\n  Cen --> Est",
        "caption": "Persistence as a competing-risks time-to-event endpoint. The permissible-gap rule dates discontinuation; death is a competing event (not plain censoring), driving the choice between cause-specific and Fine-Gray estimands.",
        "alt_text": "Flowchart from index fill through supply-timeline construction, the permissible-gap decision, discontinuation dating, and competing-event versus censoring handling into the estimand choice.",
        "source_type": "illustrative",
        "source_citations": [
          "andrade-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Permissible-gap logic for one patient (60-day gap, with carry-over)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Fills\n  Fill 1 days_supply 30 :f1, 2024-01-01, 30d\n  Fill 2 (refill on time) :f2, 2024-01-29, 30d\n  Fill 3 (late, within gap) :f3, 2024-03-10, 30d\n  section Gap\n  Supply exhausted -> gap clock :crit, g1, 2024-04-09, 60d\n  Discontinuation date (supply_end + 1, backdated) :milestone, d1, 2024-04-10, 0d\n  Gap confirmed (no fill by +60d) :milestone, d2, 2024-06-08, 0d",
        "caption": "One patient's timeline. Fills covered through 2024-04-09; with no qualifying fill within the 60-day permissible gap, discontinuation is dated 2024-04-10 (supply_end + 1, backdated to the day after supply ran out). The gap is only confirmed once the 60-day window elapses with no fill (2024-06-08), but the event date itself is the backdated 2024-04-10. A different gap length would move the confirmation point — hence mandatory sensitivity analysis.",
        "alt_text": "Gantt chart showing three fills, supply exhaustion, a 60-day permissible gap window, the backdated discontinuation milestone at supply_end + 1, and a separate marker at the end of the gap window when the gap is confirmed.",
        "source_type": "illustrative",
        "source_citations": [
          "gardarsdottir-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "tradeoff_with",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Complementary but distinct: PDC measures implementation/adherence intensity over a fixed denominator while on therapy; persistence measures duration until discontinuation as a time-to-event. Many studies report both."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "mpr-medication-possession-ratio",
        "notes": "MPR captures supply intensity (can exceed 1.0 with over-supply); persistence captures when the patient stops. Persistence often defines the on-therapy windows within which MPR/PDC are computed."
      },
      {
        "relation_type": "used_with",
        "target_slug": "switch-add-on-augmentation-rwe",
        "notes": "Persistence gap logic is the substrate for switch/augmentation algorithms; a fill of an interchangeable agent within the gap can convert a discontinuation into treatment-episode continuation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "primary-non-adherence-initiation",
        "notes": "Primary non-adherence (an order never filled) is invisible in claims and precedes persistence; order-based EHR persistence overstates continuation by missing it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Treatment-episode persistence and restart-vs-new-episode rules define the episodes that lines-of-therapy and cumulative-exposure algorithms count."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "Comparative persistence is typically estimated within an active-comparator new-user cohort so the two arms are incident users with aligned time zero."
      }
    ],
    "aliases": [
      "persistence",
      "time to discontinuation",
      "non-persistence",
      "treatment persistence",
      "refill gap method",
      "treatment episode duration",
      "permissible gap method"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "person-time-denominator-construction-rwe",
    "name": "Person-Time Denominator Construction",
    "short_definition": "The rule for accumulating each subject's at-risk follow-up time — from a defined origin to the first of event, competing event, or censoring — and splitting it across time scales to form the denominator of an incidence rate.",
    "long_description": "**Person-time denominator construction** is the operational machinery that turns a cohort of subjects into the denominator\nof a rate: it decides, for every individual, *when their clock starts*, *when it stops*, and *which intervals of calendar,\nage, and time-since-exposure that follow-up gets credited to*. The numerator of an incidence rate is a count of events; the\ndenominator is the sum of at-risk time over everyone who could have had the event. Get the time-at-risk wrong and the rate\nis wrong, regardless of how cleanly the outcome was ascertained. In RWE this is rarely a one-line `time = exit - entry`\ncalculation — it is enrollment spans intersected with exposure episodes intersected with the at-risk window, then Lexis-split\nso that a piecewise-constant rate can be estimated on the right scale.\n\n**Core conceptual distinction**. Three anatomical decisions define the denominator and are separable. (1) *Origin (time\nzero)*: the index/treatment-initiation date for an incident-user analysis, or diagnosis/enrollment for a natural-history\ncohort. Person-time accrued before the exposure could possibly act must not be credited to the exposure — doing so is the\nmechanism of **immortal time bias**, where survivors are guaranteed event-free until they qualify and that \"immortal\" span\nis misallocated to the treated denominator, deflating its rate. (2) *Exit and the censoring hierarchy*: follow-up ends at the\nfirst of the outcome event, a competing event (e.g., death when the outcome is a non-fatal hospitalization), administrative\nend-of-data, loss of observability (disenrollment), and — for an as-treated estimand — treatment discontinuation or switch.\nWhich exits *stop the clock* versus *count as the event* is an estimand decision, not a programming detail. (3) *Time-scale\nsplitting (the Lexis diagram)*: the hazard is almost never constant on a single scale, so a person's interval is cut at\nage-band boundaries, calendar-year boundaries, and time-since-initiation boundaries, producing multiple (tstart, tstop,\nevent) rows per person. Each row contributes its person-time to one cell; rates are then estimated per cell or modeled with\nthese counting-process records (`Surv(tstart, tstop, event)`), which is also exactly the input a time-dependent Cox model\nconsumes. The estimand distinction matters: an **intention-to-treat (initiation) denominator** credits all post-time-zero\nobservable time to the assigned arm regardless of later switching; an **as-treated denominator** credits only on-treatment\ntime (days_supply stitched with grace periods, censored at discontinuation/switch) and usually requires\ninverse-probability-of-censoring weighting because discontinuation is informative.\n\n**Pros, cons, and trade-offs**.\n- **vs a naive `exit - entry` single-interval denominator:** Explicit interval construction (enrollment ∩ exposure ∩\n  at-risk) prevents crediting unobservable or post-exit time and prevents immortal time. Cost: substantially more programming\n  and diagnostics. **Prefer interval construction** for any rate that informs a regulatory or HTA decision.\n- **vs unsplit person-time with a single average rate:** Lexis splitting lets the rate vary by age, calendar time, and\n  time-since-exposure and is mandatory for direct/indirect standardization and for time-dependent hazard models. Cost: a\n  larger, multi-row analytic table and care that no person-time is double-counted across cells. **Prefer splitting** whenever\n  the rate plausibly changes on a scale you care about (it almost always does with age and follow-up duration).\n- **vs cumulative-incidence (risk) denominators:** A rate denominator (person-time) handles staggered entry, variable\n  follow-up, and censoring naturally and is the right tool when follow-up length differs across subjects; a risk denominator\n  (persons at baseline) answers \"what fraction by time t\" and needs competing-risk handling (cumulative incidence function).\n  **Prefer person-time rates** for sparse events with heterogeneous follow-up; **prefer risk/CIF** when an absolute\n  probability over a fixed horizon is the decision-relevant quantity and competing events are common.\n\n**When to use**. Any incidence-rate, mortality-rate, or event-rate estimate from longitudinal RWE with variable follow-up;\nas the denominator for direct or indirect standardization (SMR/SIR); as the counting-process input to time-dependent Cox or\nPoisson/negative-binomial rate models; whenever subjects enter at different calendar times, are observed for different\ndurations, or move between exposure states during follow-up.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **When the decision quantity is an absolute probability over a fixed horizon with substantial competing risk.** A rate per\n  1,000 person-years can be multiplied into a misleading \"risk\" by ignoring that death removes people from the at-risk set;\n  report the cumulative incidence function instead. In elderly claims cohorts where the competing risk of death differs by\n  exposure arm, naive rate comparisons are *actively misleading* about absolute burden.\n- **When immortal time is being silently credited.** If time zero is set at diagnosis but exposure is defined by a later\n  prescription, the gap is immortal and must either be excluded, treated as unexposed, or handled with a time-dependent\n  exposure. Crediting it to the exposed denominator manufactures a spurious protective effect — the classic Suissa failure\n  mode in procedure and prescription studies.\n- **When the denominator includes unobservable time.** Counting calendar time during which the data source cannot see events\n  (a disenrolled gap, an MA-only span with no fee-for-service claims) inflates the denominator and deflates the rate. This is\n  a denominator error masquerading as a low event rate.\n- **When prevalent subjects enter with unaccounted left truncation.** Registry cohorts that enroll prevalent cases start the\n  clock after the origin; treating them as if observed from origin biases rate estimates (late-entry / delayed-entry must be\n  modeled).\n\n**Data-source operational depth**.\n- **Claims (FFS):** Person-time is the intersection of continuous medical/pharmacy enrollment spans with the at-risk window.\n  Require continuous enrollment so that *absence of a claim is true absence, not unobservability*. Failure modes: (a)\n  **Medicare Advantage (MA)-only spans carry no fee-for-service claims** — events are invisible there, so that calendar time\n  must be *removed from the denominator* (censor at MA entry), not assumed event-free; drop the person-time, not the person.\n  (b) **Plan-switch and coverage gaps** create unobservable intervals — split the at-risk interval at the gap and exclude it.\n  (c) **Adjudication/run-out lag** means the most recent months of claims are incomplete; truncate the study end before the\n  lag horizon or person-time near the end is under-eventful. (d) **90-day mail-order and stockpiling** distort as-treated\n  on-treatment windows and grace-period stitching.\n- **EHR:** Capture is visit-driven, so *absence of an encounter is not evidence of no event* — a patient who silently leaves\n  the system contributes apparent event-free person-time that is really unobserved. Define an observability window\n  (e.g., recent encounter within N months) and treat loss to follow-up as potentially informative censoring; link to\n  pharmacy/claims to confirm exposure episodes.\n- **Registry:** Strong for adjudicated outcomes and severity but often enrolls **prevalent** subjects, inducing left\n  truncation; the at-risk clock must start at the true origin with delayed entry, and registry completeness/linkage\n  eligibility define which person-time is observable.\n- **Linked claims–EHR–vital records:** The ideal substrate, but order, fill, and service dates disagree; reconcile them\n  before assigning time zero and interval boundaries, and recognize that only the linkable subset contributes person-time\n  (linkage selection).\n\n**Worked claims example.** Question: incidence rate of acute kidney injury (AKI) among new SGLT2 inhibitor initiators\naged ≥65 in fee-for-service Medicare. (1) **Origin / time zero:** first SGLT2i fill (NDC + `fill_date`), with ≥365 days of\ncontinuous Parts A/B/D FFS enrollment before it (washout confirms incident use and observability). (2) **At-risk start:**\ntime zero. (3) **At-risk exit = first of:** validated AKI dx (the event), death (a *competing* event — stops the clock, not\ncounted as AKI), disenrollment or MA entry (censor — unobservable thereafter), end of data minus the 3-month adjudication\nlag, and — for the as-treated denominator — last `days_supply` end + a 30-day grace period or switch to another antidiabetic.\n(4) **Compute person-time:** for each person, `person_days = at_risk_exit - time_zero` over observable, on-treatment spans,\nsplit at each calendar-year and age-band boundary. (5) **Arithmetic:** suppose across the cohort there are 240 first AKI\nevents and the summed observable on-treatment time is 511,000 person-days. Person-years = 511,000 / 365.25 = 1,398.8.\nIncidence rate = 240 / 1,398.8 × 1,000 = **171.6 per 1,000 person-years** (95% CI from the Poisson/exact interval). Note how\nexcluding MA-only and disenrolled spans *lowers the denominator and raises the rate* versus a naive `enroll_end - time_zero`\ndenominator that would have silently counted unobservable time.\n\n**Interpreting the output**\n\nDenominator construction for a 4-patient cohort yields 729 observable at-risk person-days (90 + 333 + 184 + 122), with 2 events; the resulting incidence rate denominator is 729 person-days (approximately 2.0 person-years).\n\n*Formal interpretation.* Each patient contributes person-time only during intervals satisfying three conditions: (1) active enrollment in a plan type with observable claims — Medicare FFS or commercial; MA-only periods are excluded because utilization is unobservable; (2) the patient has not yet experienced the outcome; and (3) the observation falls within the at-risk window defined by time zero and censoring rules. The 729-person-day total is the denominator for the incidence rate, not the 4-patient head count. Patients contribute different amounts of time because of staggered entry, variable follow-up, and administrative censoring. Lexis-splitting at calendar-year and age-band boundaries enables age- and calendar-stratified rates from the same denominator dataset.\n\n*Practical interpretation.* Misspecifying the denominator — for example, using enrollment end rather than the MA-exclusion-aware at-risk exit date — silently inflates observed time and underestimates the true event rate. The correct 729-person-day denominator is the foundation of any subsequent incidence rate, rate ratio, or standardized rate calculation. Report total person-time, mean and range of individual follow-up, and the number and reason for each type of censoring so that reviewers can audit the at-risk construction independently.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "person-time",
      "time-at-risk",
      "incidence-rate",
      "lexis-splitting",
      "counting-process",
      "censoring",
      "immortal-time",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "The definitive account of how mis-assigning person-time (crediting pre-exposure immortal time to the exposed denominator) biases incidence-rate comparisons; the canonical motivation for careful time-at-risk construction."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.k182",
        "url": "https://doi.org/10.1136/bmj.k182",
        "citation_text": "Hernán MA. How to estimate the effect of treatment duration on survival outcomes using observational data. BMJ. 2018;360:k182.",
        "year": 2018,
        "authors_short": "Hernán",
        "notes": "Frames time zero, time-at-risk, and the ITT vs as-treated denominator under target-trial logic, clarifying which person-time belongs to which strategy."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.b5087",
        "url": "https://doi.org/10.1136/bmj.b5087",
        "citation_text": "Lévesque LE, Hanley JA, Kezouh A, Suissa S. Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes. BMJ. 2010;340:b5087.",
        "year": 2010,
        "authors_short": "Lévesque et al.",
        "notes": "A fully worked statins-in-diabetes example showing the correct time-dependent allocation of person-time and the magnitude of the rate bias when it is done wrong."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.2764",
        "url": "https://doi.org/10.1002/sim.2764",
        "citation_text": "Carstensen B. Age-period-cohort models for the Lexis diagram. Statistics in Medicine. 2006;26(15):3018-3045.",
        "year": 2006,
        "authors_short": "Carstensen",
        "notes": "The reference treatment of splitting person-time across age and calendar (and follow-up) scales on the Lexis diagram, the technical basis for multi-time-scale denominator construction."
      }
    ],
    "plain_language_summary": "When a study follows people over time to count new disease events, the denominator of the rate is not the number of people — it is the total amount of time all those people were actually being watched and were at risk of having the event. Each participant contributes days from the moment they enter the study until they either have the event, leave the study, or the data run out; those individual stretches of time are then added together to form the denominator. This total accumulated follow-up time (called the time-at-risk) is expressed in person-days or person-years, and dividing the event count by it gives the incidence rate. The approach handles real-world messiness naturally, because people who are followed for different lengths of time each contribute only as much time as they were actually observed.",
    "key_terms": [
      {
        "term": "time-at-risk",
        "definition": "The stretch of calendar days during which a specific participant could have had the study event and the data source would have captured it — it starts when they enter the study and ends when they have the event, leave, or the data end."
      },
      {
        "term": "person-days (person-years)",
        "definition": "A unit that combines people and time: one person followed for 90 days contributes 90 person-days; summing across all participants gives the total time-at-risk that forms the denominator of an incidence rate."
      },
      {
        "term": "censoring",
        "definition": "What happens when a participant's follow-up ends for a reason other than the event of interest — for example, they lose insurance coverage or the study period ends — meaning we stop counting their time but do not count them as having had the event."
      },
      {
        "term": "index date",
        "definition": "The date a participant officially 'enters the clock' — for example, the date of their first prescription fill for the drug being studied — which is the starting point for counting their time-at-risk."
      },
      {
        "term": "incidence rate",
        "definition": "The number of new events divided by the total time-at-risk across the study population, usually expressed per 1,000 person-years; person-time denominator construction produces the divisor of this calculation."
      }
    ],
    "worked_example": {
      "scenario": "A claims database study is tracking new users of a blood-pressure drug to measure how often they develop a kidney problem (the outcome event) in 2023. Four patients start the drug at different times during the year. The analyst needs to figure out how much time each patient was actually being observed and at risk, then sum those amounts to build the denominator of the incidence rate.",
      "dataset": {
        "caption": "One row per patient showing their entry date, the date their follow-up ended, why it ended, and the number of days they were at risk (exit date minus entry date).",
        "columns": [
          "person_id",
          "entry_date",
          "exit_date",
          "exit_reason",
          "days_at_risk"
        ],
        "rows": [
          [
            1001,
            "2023-01-01",
            "2023-04-01",
            "outcome event",
            90
          ],
          [
            1002,
            "2023-02-01",
            "2023-12-31",
            "study period ended (administrative)",
            333
          ],
          [
            1003,
            "2023-03-01",
            "2023-09-01",
            "lost insurance coverage (censored)",
            184
          ],
          [
            1004,
            "2023-06-01",
            "2023-10-01",
            "outcome event",
            122
          ]
        ]
      },
      "steps": [
        "Patient 1001 entered 2023-01-01 and had the kidney event on 2023-04-01 — the clock stops on the event date, giving 90 days at risk (Jan: 31 + Feb: 28 + Mar: 31 = 90).",
        "Patient 1002 entered 2023-02-01 and was still event-free when the study data ended on 2023-12-31 — this is an administrative end, not an event, so their 333 days count fully in the denominator.",
        "Patient 1003 entered 2023-03-01 but lost insurance coverage on 2023-09-01, meaning the database can no longer see their events — the clock is stopped at that date (censored) after 184 days (Mar: 31 + Apr: 30 + May: 31 + Jun: 30 + Jul: 31 + Aug: 31 = 184).",
        "Patient 1004 entered 2023-06-01 and had the kidney event on 2023-10-01 — the clock stops on the event date, giving 122 days at risk (Jun: 30 + Jul: 31 + Aug: 31 + Sep: 30 = 122).",
        "Sum all four contributions: 90 + 333 + 184 + 122 = 729 person-days total time-at-risk — this is the denominator.",
        "Two outcome events occurred (patients 1001 and 1004), so the crude incidence rate = 2 events / 729 person-days = 0.00274 events per person-day, or equivalently 2 / (729 / 365.25) = 2 / 1.995 person-years ≈ 1.003 events per person-year (roughly 1,003 per 1,000 person-years in this tiny four-patient illustration)."
      ],
      "result": "Total time-at-risk = 90 + 333 + 184 + 122 = 729 person-days. Two events occurred. Incidence rate = 2 events / 729 person-days.",
      "timeline_spec": {
        "title": "Each patient's at-risk window — from entry to event or censoring — summing to 729 person-days",
        "window": {
          "start": "2023-01-01",
          "end": "2023-12-31",
          "label": "2023 study period"
        },
        "events": [
          {
            "label": "Patient 1001",
            "start": "2023-01-01",
            "length_days": 90,
            "quantity": "90 days at risk → outcome event"
          },
          {
            "label": "Patient 1002",
            "start": "2023-02-01",
            "length_days": 333,
            "quantity": "333 days at risk → administrative end"
          },
          {
            "label": "Patient 1003",
            "start": "2023-03-01",
            "length_days": 184,
            "quantity": "184 days at risk → lost coverage (censored)"
          },
          {
            "label": "Patient 1004",
            "start": "2023-06-01",
            "length_days": 122,
            "quantity": "122 days at risk → outcome event"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-01-01",
            "end": "2023-04-01",
            "label": "90 days (Patient 1001)"
          },
          {
            "kind": "followup",
            "start": "2023-02-01",
            "end": "2023-12-31",
            "label": "333 days (Patient 1002)"
          },
          {
            "kind": "followup",
            "start": "2023-03-01",
            "end": "2023-09-01",
            "label": "184 days (Patient 1003)"
          },
          {
            "kind": "followup",
            "start": "2023-06-01",
            "end": "2023-10-01",
            "label": "122 days (Patient 1004)"
          }
        ],
        "result": {
          "label": "Total person-days = 90 + 333 + 184 + 122 = 729 person-days",
          "value": 729
        },
        "caption": "Each horizontal bar shows one patient's at-risk window from their entry date to whichever exit came first — an outcome event (patients 1001 and 1004), administrative study end (patient 1002), or loss of insurance coverage (patient 1003). The lengths of the four bars add up to exactly 729 person-days, which is the denominator of the incidence rate.",
        "alt_text": "Gantt-style timeline showing four horizontal bars for patients 1001 through 1004, each starting at a different date in 2023 and ending at different points; bar lengths of 90, 333, 184, and 122 days are labeled, and a summary note shows they sum to 729 person-days."
      }
    },
    "prerequisites": [
      "time-zero-index-date-alignment-rwe",
      "continuous-enrollment-observable-time-rwe",
      "incidence-rate-calculation-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Intention-to-treat (initiation) denominator",
        "description": "All observable person-time after time zero is credited to the arm assigned at initiation, regardless of later switching or discontinuation. Censoring is only for unobservability (disenrollment, MA entry), death/competing events, and end of data.",
        "edge_cases": [
          "Long ITT follow-up dilutes on-treatment effects as patients drift off therapy (treatment-effect attenuation), but is robust to informative discontinuation.",
          "Still must exclude unobservable spans (MA-only, coverage gaps) from the denominator."
        ],
        "data_source_notes": "claims: intersect with continuous-enrollment spans only; do not require ongoing supply. registry: handle prevalent-case left truncation with delayed entry."
      },
      {
        "name": "As-treated (on-treatment) denominator",
        "description": "Only on-treatment person-time counts — built by stitching days_supply into episodes with a grace period, censoring at discontinuation (last supply end + grace) or switch.",
        "edge_cases": [
          "Differential discontinuation by exposure makes the censoring informative; inverse-probability-of-censoring weighting is usually required.",
          "90-day mail-order, stockpiling, and bundled inpatient supply distort episode boundaries."
        ],
        "data_source_notes": "claims: episode construction from fill_date + days_supply with stockpiling/grace rules; inpatient stays may suppress outpatient fills."
      },
      {
        "name": "Lexis-split multi-time-scale denominator",
        "description": "Each at-risk interval is cut at age-band, calendar-year, and time-since-initiation boundaries, producing multiple (tstart, tstop, event) rows per person, each contributing to one rate cell.",
        "edge_cases": [
          "Boundary handling must avoid double-counting a single day at a cut point and must carry the event flag only to the row containing the event.",
          "Very fine splitting inflates the analytic table; choose cut granularity to match the scale on which the rate truly varies."
        ],
        "data_source_notes": "all sources: splitting is purely a derived-data operation on (entry, exit, event); it precedes PROC STDRATE / Poisson / time-dependent Cox."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive exit-minus-entry single-interval denominator",
        "pros_of_this": "Prevents crediting unobservable or pre-exposure (immortal) time; the denominator reflects only time when an event could occur and be seen.",
        "cons_of_this": "Requires interval intersection (enrollment ∩ exposure ∩ at-risk) and diagnostics to confirm no double-counting.",
        "when_to_prefer": "Any regulatory- or HTA-grade incidence/mortality rate, especially with staggered entry, switching, or coverage gaps."
      },
      {
        "compared_to": "Unsplit person-time with a single average rate",
        "pros_of_this": "Lets the rate vary by age, calendar time, and follow-up duration; enables standardization and time-dependent hazard models.",
        "cons_of_this": "Produces a multi-row analytic table requiring careful boundary handling to avoid double-counting.",
        "when_to_prefer": "Whenever the hazard plausibly changes with age or time since initiation (nearly always), or when standardization is required."
      },
      {
        "compared_to": "Cumulative-incidence (risk) denominator",
        "pros_of_this": "Handles variable follow-up, staggered entry, and censoring naturally; ideal for sparse events with heterogeneous observation.",
        "cons_of_this": "A rate is not an absolute probability; multiplying it into a risk ignores competing-risk depletion of the at-risk set.",
        "when_to_prefer": "Sparse events with unequal follow-up; switch to the cumulative incidence function when an absolute fixed-horizon probability is the decision quantity and competing events are common."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Person-time = continuous-enrollment spans ∩ at-risk window. Exclude MA-only spans (no FFS claims), coverage gaps, and the trailing adjudication-lag window from the denominator; censor at disenrollment/MA entry rather than assuming event-free. Build as-treated time from fill_date + days_supply with grace/stockpiling rules.",
      "ehr": "Visit-driven capture means absence of an encounter is not absence of an event; define an observability window and treat silent system departure as potentially informative censoring. Link to pharmacy/claims to confirm exposure episodes that define on-treatment person-time.",
      "registry": "Often enrolls prevalent subjects; start the at-risk clock at the true origin with delayed entry (left truncation) and bound observable person-time by registry completeness and linkage eligibility.",
      "linked": "Reconcile order/fill/service date discrepancies before assigning time zero and interval boundaries; only the linkable subset contributes person-time (linkage selection)."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy.stats import chi2  # exact Poisson CI\n\ndef observable_at_risk(cohort: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Intersect [time_zero, exit_date) with FFS-observable (non-MA) enrollment spans.\"\"\"\n    e = enroll[~enroll[\"ma_only\"]].merge(\n        cohort[[\"person_id\", \"time_zero\", \"exit_date\", \"event\"]], on=\"person_id\")\n    # Clip each enrollment span to the at-risk window.\n    e[\"seg_start\"] = e[[\"enroll_start\", \"time_zero\"]].max(axis=1)\n    e[\"seg_end\"]   = e[[\"enroll_end\", \"exit_date\"]].min(axis=1)\n    e = e[e[\"seg_end\"] > e[\"seg_start\"]].copy()\n    # The event only counts on the segment that actually contains exit_date.\n    e[\"event\"] = np.where((e[\"event\"] == 1) & (e[\"seg_end\"] == e[\"exit_date\"]), 1, 0)\n    return e[[\"person_id\", \"seg_start\", \"seg_end\", \"event\"]]\n\ndef split_calendar_year(seg: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Lexis split each observable segment at Jan-1 boundaries -> one row per (person, year).\"\"\"\n    rows = []\n    for r in seg.itertuples(index=False):\n        start, end = r.seg_start, r.seg_end\n        for yr in range(start.year, end.year + 1):\n            lo = max(start, pd.Timestamp(yr, 1, 1))\n            hi = min(end,   pd.Timestamp(yr + 1, 1, 1))\n            if hi <= lo:\n                continue\n            ev = int(r.event == 1 and hi == end)  # event flag only on terminal sub-row\n            rows.append((r.person_id, yr, (hi - lo).days, ev))\n    return pd.DataFrame(rows, columns=[\"person_id\", \"cal_year\", \"person_days\", \"event\"])\n\ndef incidence_rate(split: pd.DataFrame, per: int = 1000) -> pd.DataFrame:\n    g = split.groupby(\"cal_year\").agg(events=(\"event\", \"sum\"),\n                                      person_days=(\"person_days\", \"sum\")).reset_index()\n    g[\"person_years\"] = g[\"person_days\"] / 365.25\n    g[\"rate\"] = g[\"events\"] / g[\"person_years\"] * per\n    # Exact (Garwood) Poisson 95% CI on the count, rescaled to PY.\n    lo = chi2.ppf(0.025, 2 * g[\"events\"]) / 2\n    hi = chi2.ppf(0.975, 2 * (g[\"events\"] + 1)) / 2\n    g[\"lcl\"] = lo / g[\"person_years\"] * per\n    g[\"ucl\"] = hi / g[\"person_years\"] * per\n    return g\n\nseg   = observable_at_risk(cohort, enroll)\nsplit = split_calendar_year(seg)\nprint(incidence_rate(split))",
        "description": "Build observable, on-treatment at-risk intervals and Lexis-split them by calendar year, then estimate the incidence rate.\nPython has no native Lexis splitter, so the interval logic is explicit (a teaching point in itself). Required inputs\n(already cleaned, one analysis cohort, time zero already assigned):\n  cohort   : person_id, time_zero (datetime), exit_date (datetime = first of event/competing/censor), event (0/1 AKI)\n  enroll   : person_id, enroll_start, enroll_end (datetime), ma_only (bool)  # ma_only spans carry no FFS claims\nexit_date must already encode the censoring hierarchy (event vs competing vs disenroll vs end-of-data minus run-out).\nThis routine then removes MA-only / out-of-enrollment time from the denominator and splits at calendar-year boundaries.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "suissa-2008",
          "carstensen-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n## Follow-up time in years from time zero; counting-process form (tstart, tstop, event).\ncohort$fu_years <- as.numeric(cohort$texit - cohort$t0) / 365.25\ncohort$tstart   <- 0\n\n## Lexis split on the follow-up time scale at 1-year cut points (also splittable on age via a second call).\ncuts  <- seq(0, ceiling(max(cohort$fu_years)), by = 1)\nsplit <- survSplit(Surv(fu_years, event) ~ ., data = cohort,\n                   cut = cuts, start = \"tstart\", end = \"tstop\", episode = \"fu_band\")\nsplit$pyears <- split$tstop - split$tstart  # person-years contributed by this sub-interval\n\n## Crude rate per 1,000 PY with an exact Poisson interval.\npt <- sum(split$pyears); d <- sum(split$event)\nrate <- d / pt * 1000\nci   <- poisson.test(d, T = pt)$conf.int * 1000\ncat(sprintf(\"IR = %.1f per 1,000 PY (95%% CI %.1f-%.1f)\\n\", rate, ci[1], ci[2]))\n\n## Rate model: Poisson with log person-time offset (rate ratio by arm and follow-up band).\nfit_rate <- glm(event ~ arm + factor(fu_band) + offset(log(pyears)),\n                family = poisson, data = split[split$pyears > 0, ])\n\n## Same denominator, hazard model: time-dependent Cox on the counting-process rows.\nfit_cox  <- coxph(Surv(tstart, tstop, event) ~ arm, data = split)",
        "description": "Lexis-split person-time with survival::survSplit, then estimate the rate with a Poisson model on log person-time\n(the standard rate model) and feed the same counting-process records to a time-dependent Cox model. Required inputs:\n  cohort : person_id, t0 (Date time zero), texit (Date, first of event/competing/censor),\n           event (0/1), arm (factor), age0 (numeric age at t0)\n  enroll : intersection with observable enrollment is assumed already applied to [t0, texit].",
        "dependencies": [
          "survival"
        ],
        "source_citations": [
          "carstensen-2006",
          "levesque-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Lexis split at calendar-year and 5-year age boundaries; one row per sub-interval, event only on the terminal row. */\ndata split;\n  set work.cohort;\n  seg_start = t0;\n  do until (seg_start >= texit);\n    cy        = year(seg_start);\n    yr_bound  = mdy(1, 1, cy + 1);                 /* next Jan 1 */\n    age_now   = age0 + (seg_start - t0) / 365.25;\n    age_bound = t0 + ceil((floor(age_now/5)*5 + 5 - age0) * 365.25);  /* next 5-yr age cut */\n    seg_end   = min(texit, yr_bound, age_bound);\n    cal_year  = cy;\n    age_band  = floor(age_now / 5) * 5;\n    py        = (seg_end - seg_start) / 365.25;    /* person-years in this sub-interval */\n    ev        = (seg_end = texit) * event;         /* event flag only on the terminal sub-row */\n    tstart    = (seg_start - t0) / 365.25;\n    tstop     = (seg_end   - t0) / 365.25;\n    if py > 0 then output;\n    seg_start = seg_end;\n  end;\n  format seg_start seg_end date9.;\nrun;\n\n/* Collapse to event counts and person-time per stratum for a rate / standardized rate. */\nproc summary data=split nway;\n  class arm age_band;\n  var ev py;\n  output out=rates(drop=_type_ _freq_) sum(ev)=events sum(py)=pyears;\nrun;\n\n/* Direct age-standardization of the incidence rate (rate = events / person-time) by arm. */\nproc stdrate data=rates refdata=work.std_pop\n             method=direct stat=rate(mult=1000) plots=none;\n  population group=arm event=events pyears=pyears;   /* PYEARS= is PERSON-TIME (TOTAL= is for STAT=RISK) */\n  reference total=ref_pop;\n  strata age_band;\nrun;\n\n/* Same denominator, hazard model: time-dependent Cox using the counting-process (tstart, tstop) rows. */\nproc phreg data=split;\n  class arm(ref='COMPARATOR');\n  model (tstart, tstop)*ev(0) = arm / rl ties=efron;\nrun;",
        "description": "Data-step Lexis splitter producing (tstart, tstop, event, cal_year, age_band) counting-process rows, then PROC STDRATE\nfor age-standardized rates with explicit person-time and PROC PHREG for the time-dependent hazard on the same records.\nRequired input dataset (post data-management; [t0, texit] already clipped to observable FFS enrollment):\n  work.cohort : person_id, t0 (date), texit (date = first of event/competing/censor), event (0/1),\n                arm ('STUDY'/'COMPARATOR'), age0 (age at t0)",
        "dependencies": [],
        "source_citations": [
          "carstensen-2006",
          "suissa-2008"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "person-time-denominator-construction-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Each horizontal bar shows one patient's at-risk window from their entry date to whichever exit came first — an outcome event (patients 1001 and 1004), administrative study end (patient 1002), or loss of insurance coverage (patient 1003). The lengths of the four bars add up to exactly 729 person-days, which is the denominator of the incidence rate.",
        "alt_text": "Gantt-style timeline showing four horizontal bars for patients 1001 through 1004, each starting at a different date in 2023 and ending at different points; bar lengths of 90, 333, 184, and 122 days are labeled, and a summary note shows they sum to 729 person-days.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  T0[Origin / time zero<br/>e.g. first qualifying fill] --> Clip[At-risk window: time zero -> follow-up]\n  Enr[Continuous FFS enrollment spans<br/>drop MA-only and gap spans] --> Int[Intersect: enrollment INTERSECT at-risk]\n  Clip --> Int\n  Int --> Exit{First exit?}\n  Exit -->|Outcome event| Ev[Event: count in numerator,<br/>stop clock]\n  Exit -->|Competing event / death| Cmp[Censor: stop clock,<br/>NOT counted as event]\n  Exit -->|Disenroll / MA entry / gap| Cen[Censor: time unobservable<br/>excluded from denominator]\n  Exit -->|End of data minus run-out| Adm[Administrative censor]\n  Ev --> Split[Lexis split at age x calendar x<br/>time-since-initiation boundaries]\n  Cmp --> Split\n  Cen --> Split\n  Adm --> Split\n  Split --> Rate[Sum person-time per cell -><br/>rate = events / person-time]",
        "caption": "Constructing the person-time denominator. The at-risk window is intersected with observable enrollment, follow-up ends at the first exit under an explicit censoring hierarchy (event vs competing vs unobservable vs administrative), and the surviving time is Lexis-split before rates are computed.",
        "alt_text": "Flowchart from time zero through intersection with enrollment, an exit-hierarchy decision node (event, competing event, disenrollment, administrative censor), Lexis splitting, and rate computation.",
        "source_type": "illustrative",
        "source_citations": [
          "suissa-2008"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Person-time for one subject (claims) — observable, on-treatment at-risk time\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Pre-time-zero\n  365-day washout + continuous FFS enrollment :done, wash, 2022-01-01, 2022-12-31\n  section At-risk (counts toward denominator)\n  Year 2023 person-time (FFS, on-treatment) :active, py1, 2023-01-01, 2023-12-31\n  Year 2024 person-time (FFS, on-treatment) :active, py2, 2024-01-01, 2024-04-15\n  section Excluded from denominator\n  MA-only span (no FFS claims -> unobservable) :crit, ma, 2024-04-15, 60d\n  section Exit\n  First AKI event = numerator, clock stops :milestone, ev, 2024-06-14, 0d",
        "caption": "One subject's contribution. Person-time accrues only during observable, on-treatment FFS time after time zero, is split at the calendar-year boundary, excludes the MA-only span (censored, not counted event-free), and stops at the first event.",
        "alt_text": "Gantt chart showing a washout before time zero, person-time accruing across 2023 and into 2024, an excluded MA-only span, and an event milestone where the clock stops.",
        "source_type": "illustrative",
        "source_citations": [
          "levesque-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "The person-time denominator is the divisor of the incidence rate; the numerator is the event count over the same at-risk time."
      },
      {
        "relation_type": "used_with",
        "target_slug": "direct-standardization-rwe",
        "notes": "Direct/indirect standardization sums Lexis-split person-time within age (and other) strata to form stratum-specific rates before reweighting."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Mis-allocating pre-exposure person-time to the exposed denominator is the mechanism of immortal time bias; correct time-zero and time-dependent exposure prevent it."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Observable enrollment spans define which calendar time can legitimately enter the denominator; unobservable spans (MA-only, gaps) must be excluded."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "When death or another competing event ends follow-up, the rate denominator must treat it as a censoring/competing exit, and absolute burden should use the cumulative incidence function."
      },
      {
        "relation_type": "requires",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "The origin of person-time is the index/time-zero date; misaligned time zero misallocates follow-up and creates immortal time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "as-treated-risk-window-construction-rwe",
        "notes": "As-treated denominators are built from on-treatment episodes (days_supply + grace), in contrast to the ITT denominator that counts all post-time-zero observable time."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Person-time denominator construction is the operational core of rate-based descriptive epidemiology."
      }
    ],
    "aliases": [
      "person-time denominator construction",
      "time-at-risk construction",
      "person-time at risk",
      "follow-up time accrual",
      "person-years denominator"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "picots-framework-rwe",
    "name": "PICOTS Framework for RWE",
    "short_definition": "A structured question-framing scaffold (Population, Intervention/exposure, Comparator, Outcome, Timing, Setting/study design) that forces every operational decision in a real-world evidence study to be pre-specified before any code list, eligibility rule, or estimand is fixed.",
    "long_description": "**PICOTS** extends the classic PICO scaffold of evidence-based medicine with two elements that the\nrandomized-trial world can take for granted but that dominate the validity of a real-world evidence (RWE)\nstudy: **T**iming and **S**etting. In an RCT the protocol fixes who is randomized, when the clock starts,\nhow long they are followed, and where the data come from. In a database study none of that is given — it\nmust be *reconstructed* from claims or EHR records — so PICOTS is less a literature-search heuristic than\nthe master specification from which the eligibility logic, time-zero rule, exposure/outcome algorithms,\nand the estimand are derived. Each letter is a contract: P (the cohort and its lookback/washout), I (the\nexposure strategy and how initiation is coded), C (the reference strategy — usually an active comparator),\nO (the endpoint and its validated algorithm), T (lookback, washout, induction/latency, grace, follow-up,\ncensoring), and S (the data source(s), care system, geography, calendar window, and the analytic design\nthat ties it together).\n\n**Core conceptual distinction**. PICOTS is a *framing and specification* layer, not an *estimation* method.\nIts job is to make the research question answerable and the protocol auditable before any model is fit; it\ndoes not, by itself, control confounding, handle time-varying treatment, or define a summary measure. The\nsingle most consequential distinction it enforces is that **PICOTS is upstream of the estimand**: P + I/C + O\npin down three of the five estimand attributes (population, treatment, endpoint); T (timing) informs but does\nnot by itself fix the remaining two — the population-level summary (ATE vs ATT) and the intercurrent-event\nstrategy — which the estimand must still add. A vague PICOTS therefore guarantees a vague estimand.\nEqually, PICOTS is the layer where **time-zero is defined** — the most common source of avoidable bias in\nRWE — by forcing T and S to specify exactly when follow-up starts and what enrollment must be observed\nbefore it. A second distinction: PICOTS frames a *question*, whereas the target-trial framework operationalizes\nthat question into a *protocol*; PICOTS feeds the target trial (eligibility = P + S + T_lookback;\nassignment = I + C; follow-up + outcome = T + O), it does not replace it.\n\n**Pros, cons, and trade-offs** — specific and comparative.\n- **vs an unstructured \"research question\" paragraph:** PICOTS forces each operational decision into the open,\n  exposes the under-specified element (almost always T or S), and makes the protocol reviewable against FDA,\n  ENCePP/EMA, and HTA expectations. Cost: a well-filled PICOTS table can give a false sense of completeness —\n  it names the elements but does not test whether the comparator is clinically interchangeable or whether\n  positivity holds. **Prefer PICOTS** as the entry point for any RWE protocol; it is nearly free and prevents\n  post-hoc eligibility drift.\n- **vs PICO (without T and S):** plain PICO is adequate for an RCT systematic-review question where timing and\n  setting are fixed by the trials being summarized. In RWE, dropping T re-admits immortal-time and\n  induction/latency errors, and dropping S hides the data-source failure modes (MA-only person-time, EHR\n  leakage) that decide whether the question is even answerable. **Prefer PICOTS over PICO** for any\n  database study or target-trial emulation.\n- **vs the target-trial / estimand framework:** PICOTS is simpler, faster to communicate, and is the natural\n  *front end* of both. It is, however, deliberately silent on intercurrent-event handling, the\n  population-level summary (ATE vs ATT), and time-varying confounding — for those you must move down to the\n  estimand and g-methods layers. **Prefer PICOTS** to scope and align stakeholders first, then refine into a\n  target-trial protocol plus a written estimand. The two are complements, not substitutes.\n\n**When to use**. As the first artifact of essentially every RWE protocol, SAP, fit-for-purpose assessment,\nor evidence-synthesis question: comparative effectiveness/safety studies, target-trial emulations,\nsingle-arm-vs-external-control designs, HTA-facing comparative analyses, and multi-database/distributed\nstudies that need one common question with locally adapted code lists. Present it as a table — its discipline\nis the single most effective guard against the \"we'll figure out the population later\" drift that produces\npost-hoc eligibility changes and irreproducible cohorts. Use it to negotiate scope with clinical and\nregulatory stakeholders *before* a line of extraction code is written.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **As a substitute for an estimand or analysis plan.** A complete PICOTS table that omits the\n  intercurrent-event strategy (death, treatment switching, discontinuation) and the summary measure is *not*\n  a finished design. Treating \"we wrote a PICOTS table\" as sufficient is how protocols reach analysis with an\n  undefined estimand and an ITT-vs-per-protocol ambiguity discovered only at review.\n- **When the C element is filled in mechanically.** Naming a comparator does not make it valid. If the\n  comparator treats a different indication, a different line of therapy, or systematically different patients,\n  PICOTS will *look* complete while re-importing confounding by indication and channeling — the bias the\n  design exists to remove. PICOTS does not check interchangeability; clinical review and baseline balance do.\n- **When T is specified on paper but contradicted by the data's coverage.** A 36-month follow-up window in S =\n  commercial claims, where median continuous enrollment is ~2 years, is a specification that the data cannot\n  honor; the gap silently becomes differential administrative censoring. A PICOTS that is internally tidy but\n  incompatible with the chosen S is worse than none, because it confers false confidence.\n- **For purely descriptive or hypothesis-generating exploration** where forcing a single C and a fixed T would\n  prematurely narrow an exploratory aim. Use a looser descriptive frame and reserve PICOTS for the confirmatory\n  question that follows.\n\n**Data-source operational depth**.\n- **Claims (FFS commercial / Medicare FFS):** Each PICOTS element becomes code lists plus date logic. P and S\n  jointly set the continuous-enrollment requirement (medical + pharmacy across the full T lookback) so that a\n  \"washout\" reflects true absence of fills, not unobserved person-time. Failure mode: **Medicare Advantage\n  enrollees lack adjudicated FFS claims** — utilization and outcomes are largely invisible — so MA-only\n  person-time masquerades as a clean washout and as event-free follow-up; restrict to Parts A/B/D (or a\n  commercial medical+pharmacy benefit) and exclude MA-only spans. I/C from NDC + `days_supply` (watch 90-day\n  mail-order and free samples distorting episodes); procedures from CPT/ICD-10-PCS. **Differential competing\n  risks**: in elderly claims cohorts, death competes with the outcome and may differ by exposure arm, so a T/O\n  that ignores the competing event biases cumulative incidence — specify it in O, not just the model.\n- **EHR:** P and O can be sharpened with labs, vitals, staging, and notes (NLP), an advantage over claims for a\n  narrow P. But capture is **visit-driven**: a patient who seeks care elsewhere is differentially lost, so T\n  must define observation windows explicitly and treat loss to follow-up as potentially informative. I is the\n  *order/administration*, not a dispensing — link to pharmacy fills to confirm the patient actually started, or\n  a \"new user\" by order may never fill.\n- **Registry:** Cleanest for P (disease severity, stage) and adjudicated O (recurrence, death dates); typically\n  weak for complete I/C utilization and exact timing. Link to claims for the full fill history and to a death\n  index to firm up T (censoring).\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity + claims completeness + reliable\n  mortality), but linkage introduces selection (only the linkable subset, which may not match the target P) and\n  order/fill/service date discrepancies that must be reconciled before T can assign a defensible time-zero.\n\n**Worked example (oncology new-user cohort, claims logic).**\nP: adults ≥18 with incident advanced NSCLC, no systemic anticancer fill or administration in a 12-month\nlookback (= incident/first-line), and 365 days of continuous medical + pharmacy, FFS-observable enrollment\nbefore index. I: initiate pembrolizumab + pemetrexed — defined by NDC/HCPCS J-codes; index_date = the first\nadministration/fill of the regimen after the qualifying NSCLC diagnosis. C: initiate a platinum doublet\n(regimen code list), same eligibility and the *same* time-zero rule applied identically. O: overall survival\n(death from any cause via discharge status + a linked mortality index; a non-fatal claims-only proxy is\ninsufficient for OS). T: time-zero at regimen initiation; a 30-day induction excludes events plausibly present\nbefore treatment could act; follow until death, disenrollment, end of data, or 36 months; for an as-treated\ncontrast, censor at a switch or at last `days_supply` end + a pre-specified 90-day grace, with\ninverse-probability-of-censoring weights because discontinuation is likely differential by arm; **specify\ndeath as a competing event for any non-mortality secondary endpoint**. S: US commercial + Medicare FFS claims\n(e.g., Optum/MarketScan-style), 2018–2023, mortality-linked, MA-only person-time excluded, analyzed as a\nnew-user active-comparator cohort with target-trial-emulation structure. The same six rows, ported to a\ndistributed network (Sentinel/PCORnet via OMOP), keep the question fixed while each site re-maps the code\nlists and re-checks enrollment semantics — which is exactly the reproducibility PICOTS exists to buy.",
    "primary_category": "Framework_Standard",
    "tags": [
      "picots",
      "pico",
      "research-question",
      "protocol-development",
      "fit-for-purpose",
      "target-trial",
      "evidence-synthesis",
      "estimand",
      "rwe-design",
      "claims"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation",
      "systematic_review"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2009.03.008",
        "url": "https://doi.org/10.1016/j.jclinepi.2009.03.008",
        "citation_text": "Whitlock EP, Lopez SA, Chang S, Helfand M, Eder M, Floyd N. AHRQ Series Paper 3: identifying, selecting, and refining topics for comparative effectiveness systematic reviews. Journal of Clinical Epidemiology. 2010;63(5):491-501.",
        "year": 2010,
        "authors_short": "Whitlock et al.",
        "notes": "AHRQ Effective Health-Care methods paper that formalized the PICOTS (PICO + Timing + Setting) scaffold for structuring comparative-effectiveness questions, now the standard front end for RWE protocols."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Shows how a precise question maps to an explicit protocol (eligibility, treatment strategies, assignment, follow-up, outcome) — the operational layer PICOTS feeds, and where vague Timing produces time-zero/immortal-time bias."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.k3532",
        "url": "https://doi.org/10.1136/bmj.k3532",
        "citation_text": "Langan SM, Schmidt SA, Wing K, et al. The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE). BMJ. 2018;363:k3532.",
        "year": 2018,
        "authors_short": "Langan et al.",
        "notes": "Reporting standard requiring explicit population, exposure, comparator, outcome, timing, and data-source elements — a concrete demonstration of PICOTS made operational and auditable for database studies."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "ISPOR-ISPE good-practice guidance that operationalizes a priori specification of population, exposure, comparator, outcome, and timing as a precondition for decision-grade RWE."
      }
    ],
    "plain_language_summary": "PICOTS is a six-part checklist that forces a researcher to write down exactly who will be studied, what treatment will be compared against what, what counts as the outcome, how long patients will be followed, and which database will be used — before a single line of code is written. Think of it as filling out a very precise order form: if any box is left blank, the study cannot be built without guessing, and guessing introduces errors that are almost impossible to fix later. The six letters stand for Population, Intervention, Comparator, Outcome, Timing, and Setting.",
    "key_terms": [
      {
        "term": "Population",
        "definition": "The specific group of patients who qualify for the study, defined by their diagnosis, age, insurance coverage, and any prior treatment history required before they can be included."
      },
      {
        "term": "Comparator",
        "definition": "The alternative treatment or no-treatment group that the intervention is measured against, so that any difference in outcomes can be attributed to the treatment rather than to differences in who was treated."
      },
      {
        "term": "Timing",
        "definition": "All of the clock decisions in a study: how far back in history to look before a patient enters the study, when the clock starts ticking, how long patients are followed, and the rules for when they stop being counted."
      },
      {
        "term": "Setting",
        "definition": "The specific database or data source being used (for example, insurance claims versus hospital records), the country, the calendar years covered, and the overall study design chosen to answer the question."
      },
      {
        "term": "Research question drift",
        "definition": "The gradual, unplanned narrowing or broadening of a study question that happens when eligibility rules are adjusted after analysts have already seen the data, which can make results look stronger than they really are."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes team at a pharmacy benefit manager wants to know whether adults with type 2 diabetes who start a GLP-1 receptor agonist have fewer hospitalizations for heart failure over the following year compared with adults who start a DPP-4 inhibitor. Before pulling any data, the team fills in a PICOTS table to make sure every decision is written down and agreed upon. The table below shows each element and the specific answer the team records.",
      "dataset": {
        "caption": "PICOTS table: each row is one element of the framework, filled in for the heart-failure hospitalization question.",
        "columns": [
          "Element",
          "Letter",
          "What it decides",
          "The team's answer"
        ],
        "rows": [
          [
            "Population",
            "P",
            "Who is eligible",
            "Adults age 18 or older with a type 2 diabetes diagnosis, enrolled continuously in commercial insurance for at least 12 months before starting either drug, with no prior use of either drug class in that 12-month look-back period"
          ],
          [
            "Intervention",
            "I",
            "The treatment being evaluated",
            "First prescription fill of any GLP-1 receptor agonist (semaglutide, liraglutide, dulaglutide, or exenatide); the date of that fill is the study start date for the patient"
          ],
          [
            "Comparator",
            "C",
            "What the intervention is compared against",
            "First prescription fill of any DPP-4 inhibitor (sitagliptin, saxagliptin, or alogliptin) on or after the same eligibility rules; matched to the GLP-1 group by the same study-start-date rule"
          ],
          [
            "Outcome",
            "O",
            "The event being counted",
            "A hospitalization with a primary discharge diagnosis code for heart failure, occurring any time after the study start date and before follow-up ends"
          ],
          [
            "Timing",
            "T",
            "All clock and window decisions",
            "12-month look-back before start date to confirm no prior use; follow patients from start date until first heart-failure hospitalization, insurance disenrollment, death, or 12 months elapsed, whichever comes first; no induction delay because the effect is not expected to require months to appear"
          ],
          [
            "Setting",
            "S",
            "Data source and study design",
            "US commercial insurance claims database, calendar years 2019 through 2023; new-user active-comparator cohort design; patients in Medicare Advantage plans excluded because their hospitalization records are incomplete in this database"
          ]
        ]
      },
      "steps": [
        "Start with P: the team writes down the diagnosis code, the age cutoff, the 12-month continuous enrollment requirement, and the requirement that neither drug was used before. This single row will become the eligibility query in the database.",
        "Fill in I: the team lists the four specific drug names that count as GLP-1 receptor agonists and declares that the date of the first fill is the patient's study start date. Without this, two analysts might pick different drugs or different date rules and produce different cohorts.",
        "Fill in C: the team picks DPP-4 inhibitors as the comparator because both drug classes treat type 2 diabetes and are prescribed to similar patients — this means differences in heart-failure rates are more likely to reflect the drug effect than background differences in patient health.",
        "Fill in O: the team specifies that only hospitalizations with heart failure as the primary reason count. A patient admitted for pneumonia who also happens to have heart failure would not count, so the rule must be written down precisely.",
        "Fill in T: the team sets the maximum follow-up at 12 months and lists every reason a patient stops being counted (first event, disenrollment, death, or the 12-month cap). This row will directly control how person-time is calculated in the analysis.",
        "Fill in S: the team names the specific database, the calendar years, and the design. They also note the Medicare Advantage exclusion because that population's hospitalization data are incomplete, which would make the outcome look artificially rare.",
        "Read the completed table back as one sentence to confirm the question is fully specified: Among adults 18 or older with type 2 diabetes, no prior use of either drug class, and at least 12 months of continuous commercial insurance coverage, does starting a GLP-1 receptor agonist versus starting a DPP-4 inhibitor reduce the rate of heart-failure hospitalizations over the following 12 months, measured in US commercial claims from 2019 to 2023 using a new-user active-comparator cohort design?"
      ],
      "result": "The fully framed research question is: Among commercially insured adults age 18 or older with type 2 diabetes and no prior GLP-1 or DPP-4 use in the preceding 12 months, does initiating a GLP-1 receptor agonist reduce 12-month heart-failure hospitalization rates compared with initiating a DPP-4 inhibitor, evaluated in US commercial claims (2019-2023) with a new-user active-comparator cohort design? Every word in that sentence traces back to a specific row in the PICOTS table, so any reviewer can see exactly which decision produced each word — and flag any box they disagree with before analysis begins."
    },
    "prerequisites": [
      "new-user-design",
      "active-comparator-new-user",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Broad vs narrow PICOTS (generalizability vs precision)",
        "description": "A broad P (all adults with the condition) maximizes power and external validity but raises heterogeneity and confounding; a narrow P (biomarker-defined, single line of therapy) reduces confounding and sharpens the estimand but limits transportability and shrinks the cohort.",
        "edge_cases": [
          "Narrowing P with EHR/registry detail (stage, biomarker) can introduce differential missingness that itself becomes a selection mechanism.",
          "A broad P with a single C can hide effect-measure modification across subgroups that a stratified PICOTS would surface."
        ],
        "data_source_notes": "claims: a broad P is easy via diagnosis codes but leans hard on S (continuous-enrollment) to avoid selection; EHR: a narrow P is feasible via labs/NLP but carries higher and possibly differential missingness."
      },
      {
        "name": "Timing variants (induction, latency, grace, administrative windows)",
        "description": "Choices for induction (exclude early events not yet causally attributable), latency (a delayed effect window), grace periods for exposure continuity, and the maximum follow-up before administrative censoring dominates — each a distinct T decision with its own bias profile.",
        "edge_cases": [
          "Timing shorter than the true latency biases toward the null; timing longer than median enrollment converts follow-up into differential administrative censoring.",
          "A grace period that differs implicitly by arm (because one drug is dispensed in 90-day supplies) creates arm-dependent exposure misclassification."
        ],
        "data_source_notes": "claims: T drives episode construction and person-time and must be reconciled against enrollment gaps and MA-only spans before time-zero is assigned."
      },
      {
        "name": "Multi-database / distributed PICOTS",
        "description": "One fixed PICOTS question executed across Sentinel, PCORnet, or international claims via a common data model, with each site re-mapping code lists and enrollment semantics rather than re-defining the question.",
        "edge_cases": [
          "Identical OMOP code lists can still mean different things across sites because of coding-practice and formulary variation, leaving residual heterogeneity that a fixed PICOTS does not eliminate."
        ],
        "data_source_notes": "Requires a common data model (e.g., OMOP) or a common protocol with locally validated operationalization of P/I/C/O/T and enrollment rules."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "target-trial-emulation",
        "pros_of_this": "PICOTS is the upstream question-framing layer that scopes and aligns stakeholders before the target-trial protocol is written; its P + I + C + O + T map directly onto eligibility, treatment strategies, assignment, follow-up, and outcome.",
        "cons_of_this": "PICOTS does not specify the assignment procedure, handling of intercurrent events, or time-varying confounding — those require the explicit target-trial protocol plus an estimand and g-methods.",
        "when_to_prefer": "Use PICOTS first to define and negotiate the question; move to a target-trial protocol once the estimand and analytic strategy must be pinned down."
      },
      {
        "compared_to": "estimands-ate-att-intercurrent-events-rwe",
        "pros_of_this": "PICOTS supplies three of the five estimand attributes — population, treatment, and endpoint — that the estimand inherits, while T forces time-zero and follow-up to be made explicit early.",
        "cons_of_this": "PICOTS does not by itself fix the population-level summary (ATE vs ATT) or the intercurrent-event strategy — the two attributes that most change the number reported, which the estimand must add (timing informs but does not determine them).",
        "when_to_prefer": "Use PICOTS to frame; always pair it with a written estimand before the SAP is finalized."
      },
      {
        "compared_to": "comparative-effectiveness-research-cer-methods",
        "pros_of_this": "A PICOTS table is the standard alignment artifact at the start of a CER protocol, ensuring question, data, and design agree before analytic methods (PS, g-methods) are chosen.",
        "cons_of_this": "PICOTS does not select or justify the confounding-control method; it only constrains the question those methods must answer.",
        "when_to_prefer": "Use PICOTS at protocol kickoff; defer method selection to the design and analysis sections it feeds."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Translate each PICOTS element into code lists plus enrollment and date logic. P and S jointly set the continuous medical+pharmacy enrollment requirement across the full T lookback; exclude Medicare Advantage-only person-time where FFS claims are unavailable (otherwise washout and event-free follow-up are unobserved, not true). T determines person-time, induction/grace windows, and censoring; specify death as a competing event for non-mortality outcomes.",
      "ehr": "P and O can use labs, vitals, staging, and NLP for a sharper narrow P; I is the order/administration, so link to dispensing to confirm initiation. Capture is visit-driven, so define observation windows in T explicitly and treat loss to follow-up as potentially informative.",
      "registry": "Cleanest for P (severity/stage) and adjudicated O (recurrence/death dates); weak for complete I/C utilization and exact timing. Link to claims for fills and to a death index for censoring.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate but introduces linkage selection (only the linkable subset may not match the target P) and order/fill/service date discrepancies that must be reconciled before assigning time-zero."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from dataclasses import dataclass, field\n\n@dataclass\nclass PICOTS:\n    # P — cohort + lookback/washout. Drives the eligibility query and continuous-enrollment rule.\n    population: str            # e.g. \"adults >=18, incident advanced NSCLC (>=1 inpatient or >=2 outpatient ICD-10 C34.* >=30d apart)\"\n    lookback_days: int         # disease/exposure-free lookback that defines 'incident'; e.g. 365\n    enrollment_rule: str       # e.g. \"continuous medical+pharmacy FFS A/B/D across [index_date-lookback, index_date]; exclude MA-only spans\"\n\n    # I / C — exposure and comparator strategies. Become NDC/HCPCS/CPT code lists + the time-zero rule.\n    intervention: str          # e.g. \"first administration of pembrolizumab+pemetrexed (HCPCS J9271 + J9305) after NSCLC dx\"\n    comparator: str            # e.g. \"first administration of a platinum doublet (regimen code list); same time-zero rule\"\n    index_date_rule: str       # e.g. \"index_date = first qualifying fill/administration; assign arm from that claim\"\n\n    # O — endpoint + validated algorithm + competing events.\n    outcome: str               # e.g. \"overall survival (death any cause)\"\n    outcome_algorithm: str     # e.g. \"discharge status + linked national mortality index; claims-only proxy insufficient for OS\"\n    competing_events: str = \"\" # e.g. \"death is a competing event for any non-mortality secondary endpoint\"\n\n    # T — every clock decision. The most under-specified PICOTS element in RWE.\n    induction_days: int = 0    # exclude events in [index_date, index_date+induction_days)\n    grace_days: int = 0        # allowable gap before an as-treated episode is considered ended (days_supply end + grace)\n    max_followup_days: int = 0 # administrative cap; 0 = until disenroll/death/end-of-data\n    censoring_rule: str = \"\"   # e.g. \"censor at disenrollment, death, end of data, switch, or discontinuation+grace (apply IPCW)\"\n\n    # S — data source(s), design, calendar window, geography.\n    setting: str = \"\"          # e.g. \"US commercial + Medicare FFS claims, mortality-linked, 2018-2023, MA-only excluded\"\n    design: str = \"\"           # e.g. \"new-user active-comparator cohort with target-trial-emulation structure\"\n    databases: list = field(default_factory=list)  # e.g. [\"MarketScan-style commercial\", \"Medicare FFS\"]\n\n    def validate(self) -> list[str]:\n        \"\"\"Gate the protocol: every element must be filled, and T/S must be internally consistent.\"\"\"\n        problems = []\n        for fld in (\"population\", \"enrollment_rule\", \"intervention\", \"comparator\",\n                    \"index_date_rule\", \"outcome\", \"outcome_algorithm\", \"setting\", \"design\"):\n            if not str(getattr(self, fld)).strip():\n                problems.append(f\"PICOTS '{fld}' is unspecified — resolve before building the cohort.\")\n        if self.lookback_days <= 0:\n            problems.append(\"Timing: lookback_days must define the incident/washout window.\")\n        if self.max_followup_days and self.max_followup_days > 730 and \"commercial\" in self.setting.lower():\n            problems.append(\"Timing/Setting mismatch: follow-up exceeds typical commercial enrollment \"\n                            \"(~2y) -> differential administrative censoring risk.\")\n        if not self.censoring_rule.strip():\n            problems.append(\"Timing: censoring_rule unspecified -> undefined person-time and estimand.\")\n        return problems",
        "description": "A PICOTS specification template, not an analysis. It is the structured artifact a protocol author fills in\nBEFORE writing any extraction code, so that every downstream rule (eligibility SQL, exposure/outcome\nalgorithms, time-zero, censoring) traces back to an explicit, reviewable decision. Each field carries the\nrealistic claims/EHR variable names and date windows that the operational code will reference\n(person_id, enrollment spans, fill_date, days_supply, dx/NDC/CPT code lists, index_date). Validate it with\nthe assertions at the bottom before any cohort is built; an unfilled element is the element that will bite.",
        "dependencies": [],
        "source_citations": [
          "whitlock-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[RWE question] --> P[P: cohort + lookback/washout<br/>incident disease, continuous enrollment]\n  Q --> I[I: exposure strategy<br/>NDC/HCPCS, initiation rule]\n  Q --> C[C: active comparator<br/>same indication, same time-zero]\n  Q --> O[O: endpoint + validated algorithm<br/>+ competing events]\n  Q --> T[T: lookback, washout, induction,<br/>grace, follow-up, censoring]\n  Q --> S[S: data source, design,<br/>geography, calendar window]\n  P --> EST[Estimand<br/>population]\n  I --> EST\n  C --> EST\n  O --> EST\n  T --> EST\n  T --> TT[Target-trial protocol<br/>eligibility / assignment / follow-up]\n  S --> TT\n  EST --> Plan[Reviewable, reproducible<br/>RWE protocol + SAP]\n  TT --> Plan",
        "caption": "PICOTS as the master specification. Each element flows into the estimand (P+I+C+O+T) and the target-trial protocol (T+S), which together yield an auditable, reproducible RWE protocol and SAP.",
        "alt_text": "Flowchart showing the RWE question branching into the six PICOTS elements, which feed both the estimand and the target-trial protocol, converging on a reproducible protocol and statistical analysis plan.",
        "source_type": "illustrative",
        "source_citations": [
          "whitlock-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title PICOTS Timing (T) made explicit for one new initiator (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section P + S\n  Continuous FFS enrollment + disease/exposure-free lookback :done, look, 2023-01-01, 2023-12-31\n  section I/C time zero\n  First qualifying fill/administration -> arm assignment :milestone, t0, 2024-01-01, 0d\n  section T windows\n  Induction (events excluded) :crit, ind, 2024-01-01, 30d\n  On-treatment follow-up (days_supply + grace) :active, fu, 2024-01-31, 700d\n  section Censoring\n  Disenroll / death / switch / end-of-data :milestone, cen, 2025-12-31, 0d",
        "caption": "Timing (T) is where most avoidable RWE bias originates. Specifying the lookback, time-zero, induction, follow-up, and censoring as a dated timeline in the protocol prevents immortal-time and administrative-censoring errors before any code is written.",
        "alt_text": "Gantt timeline showing a 365-day lookback in 2023, time zero at the first fill on 2024-01-01, a 30-day induction window, an on-treatment follow-up window, and a censoring milestone at end of 2025.",
        "source_type": "illustrative",
        "source_citations": [
          "whitlock-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "study-protocol-or-sap-elements",
        "notes": "A fully specified PICOTS table yields the core sections of an RWE protocol and SAP — eligibility, exposure/comparator definitions, outcome algorithms, timing/censoring rules, and the data-source/design statement."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "PICOTS is the upstream question frame for the target trial; eligibility comes from P + S + the T lookback, treatment strategies from I + C, and follow-up + outcome from T + O."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "PICOTS P + I/C + O supply three of the five estimand attributes (population, treatment, endpoint); T (timing) informs but does not fix the remaining two — the population-level summary (ATE vs ATT) and the intercurrent-event strategy — which the estimand must still add."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "The C and T elements of a comparative PICOTS are operationalized by an active-comparator new-user design — an active comparator for C and an incident-user washout plus initiation time-zero for T."
      },
      {
        "relation_type": "see_also",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "A PICOTS table is the standard alignment artifact at the start of a CER protocol, ensuring question, data, and design agree before propensity-score or g-methods are chosen."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology RWE stresses every PICOTS element — biomarker/line-defined P, regimen-coded I/C, OS vs PFS in O, and short follow-up with heavy switching in T."
      }
    ],
    "aliases": [
      "PICOTS",
      "PICO-TS",
      "PICO framework",
      "PICOTS table",
      "PICOT"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "place-of-service-codes",
    "name": "Place of Service (POS) Codes",
    "short_definition": "A CMS-maintained two-digit code set placed on the service-line level of professional (CMS-1500 / 837P) claims to indicate the physical or virtual setting where a service was furnished, enabling site-of-care classification, telehealth identification, and setting-shift studies in real-world evidence research.",
    "long_description": "**Place of Service (POS) codes** are a standardized two-digit code set administered by the Centers for Medicare &\nMedicaid Services (CMS) and placed in field 24B of the CMS-1500 paper form (loop 2300/2400 of the 837P electronic\ntransaction) on every professional claim service line. A professional claim covers physician services, outpatient\nevaluations, procedures, and other non-facility encounters billed by a rendering provider — as opposed to the\nUB-04 / 837I institutional claim filed by the facility itself. Because the institutional claim has no POS field\n(setting is instead conveyed via the Type of Bill and revenue center codes), POS codes apply exclusively to the\nprofessional claim stream. This asymmetry is the single most important structural fact for any analyst constructing\na setting classifier from claims data.\n\n**What POS codes encode.** Each two-digit code maps to a defined care setting. The most analytically consequential\ncodes for RWE research are:\n\n- **11 — Office:** A physician's office or independent clinic not attached to a hospital outpatient department.\n  This is the plurality of professional claims and the default comparator in site-of-care and site-neutrality\n  studies. When paired with the CPT E/M code, POS 11 identifies an office visit; when paired with a drug\n  administration code (J-code or CPT 96xxx), it identifies a physician-office infusion — the key contrast in\n  chemotherapy site-of-care research.\n- **19 — Off-Campus Outpatient Hospital vs. 22 — On-Campus Outpatient Hospital:** The 2016 Bipartisan Budget Act\n  introduced site-neutrality provisions that created the billing distinction between off-campus (POS 19) and\n  on-campus (POS 22) hospital outpatient departments. Before 2016, both were billed as POS 22. Studies of\n  Medicare payment reform and provider-based billing must handle the pre/post-2016 transition in order to\n  construct a consistent outpatient-hospital indicator.\n- **21 — Inpatient Hospital:** A physician billing for a service furnished to a patient admitted as an inpatient.\n  POS 21 on a professional claim does not represent the hospital stay itself — that is the facility claim. It\n  means a physician (e.g., hospitalist, consultant, surgeon) saw the inpatient and billed a professional charge.\n  This distinction matters for cost studies: the two claims must be linked to reconstruct the full encounter cost.\n- **23 — Emergency Room – Hospital:** An emergency department visit billed by the physician (emergency medicine\n  group). Used in conjunction with ED evaluation and management CPT codes (99281–99285, 99291) to identify\n  ED encounters on the professional claim side; must be reconciled with the facility claim (Type of Bill 013x)\n  for a complete ED encounter record.\n- **24 — Ambulatory Surgical Center (ASC):** Services furnished in a free-standing surgery center. Relevant for\n  procedure-setting studies and for identifying whether a surgery occurred in a hospital (POS 22) vs. ASC (POS 24)\n  context — a distinction with large cost implications under Medicare.\n- **31 — Skilled Nursing Facility (SNF):** Professional services furnished to SNF residents. Important for\n  post-acute care research and distinguishing facility-based care from community-dwelling follow-up.\n- **12 — Home:** Services furnished in the patient's private residence. Used to identify home health visits billed\n  by physicians and for certain remote monitoring encounters.\n- **02 — Telehealth Provided Other Than in Patient's Home vs. 10 — Telehealth in Patient's Home:** The 2022\n  split is the most consequential recent POS change for RWE researchers. Before January 1, 2022, all\n  Medicare-covered telehealth services were billed with POS 02. Beginning in 2022, CMS required that POS 10 be\n  used when the patient was located in their own home at the time of the telehealth visit, while POS 02 was\n  retained for telehealth provided at other sites (e.g., a health care facility serving as the patient's originating\n  site). Any study analyzing telehealth trends across a window that spans 2021–2022 must include both codes and\n  apply the correct date-conditional logic: before 2022, POS 02 captures all telehealth; from 2022 forward, POS\n  02 + POS 10 together capture the universe of telehealth-billed professional claims.\n- **81 — Independent Laboratory:** Professional claims for services furnished in a free-standing laboratory. Useful\n  for distinguishing outpatient lab draws from in-office or hospital-based testing.\n\n**Core conceptual distinction — setting vs. billing rule.** POS reflects the setting *as designated for billing\npurposes*, not always the physical location of service delivery. Provider-based billing rules allow a physician\npractice acquired by or affiliated with a hospital to bill its professional claims at hospital POS codes (19 or 22)\neven if the physician continues to see patients in what looks externally like an independent office. The patient\nmay receive identical care from the same physician in the same exam room before and after the hospital acquisition,\nyet the POS on the professional claim shifts from 11 to 22, producing a facility fee from the hospital and a higher\nMedicare payment rate. This is the central pitfall of POS-based site-of-care research: an apparent shift from\noffice (POS 11) to hospital outpatient (POS 22) in a claims trend study may reflect a change in billing affiliation\nrather than any change in where or how care is actually delivered.\n\n**Pros, cons, and trade-offs — specific and comparative.**\n- **vs. UB-04 type of bill + revenue center codes for setting classification:** The UB-04 / institutional claim\n  carries no POS field; setting is instead encoded in the Type of Bill (three-digit code where the first digit is\n  the facility type and the second is the care type, e.g., 011x = inpatient hospital, 013x = outpatient hospital)\n  and in the revenue center code at the service-line level (e.g., 045x = emergency room, 0636 = pharmacy). For\n  site-of-care research that spans both the professional and institutional streams — as almost all complete\n  encounter-level studies do — the analyst must join POS from the professional claim to the institutional claim's\n  type-of-bill and revenue codes to reconstruct the full setting picture. Failing to do this linkage means\n  attributing the setting entirely from one stream and missing the cross-stream inconsistencies that require\n  reconciliation rules. **Prefer POS for the physician setting; use type of bill + revenue codes for the facility\n  setting; link them for complete encounter cost and setting reconstruction.**\n- **vs. taxonomy of place codes in EHR or registry data:** EHR records may carry a place-of-service concept but\n  it is typically an internal facility designation that does not map directly to CMS POS codes without a deliberate\n  crosswalk. Registry data usually lacks a granular setting field entirely. Claims POS codes are therefore the\n  most systematic source of setting information in a US administrative data study, as long as the institutional /\n  professional claim asymmetry is respected. **Prefer POS for multi-provider setting classification in claims;\n  rely on EHR internal codes only when the question is confined to a single health system whose internal codes\n  are validated against the CMS set.**\n- **Granularity vs. coding-practice noise:** POS 11 (office) is clear when the provider is independently operating.\n  POS 19 and 22 (off-campus and on-campus outpatient hospital) are contaminated by the pre-2016 vs. post-2016\n  coding change, by cross-payer differences (commercial payers may not require the 19/22 split in the same way\n  Medicare does), and by provider-based billing that makes \"office\" and \"outpatient hospital\" indistinguishable\n  from true care-delivery changes. **Sensitivity analyses by pre/post-2016 and by payer type are standard in any\n  study using POS 19/22.**\n- **Telehealth POS completeness:** Before the March 2020 COVID-19 emergency declaration, Medicare telehealth was\n  tightly restricted (specific rural originating sites required, narrow list of allowed services). Commercial\n  payers used POS 02 inconsistently. Post-emergency, the POS 02/10 universe expanded dramatically. Any claims\n  database that spans the pre-2020 and post-2020 periods will have a structural break in POS 02/10 prevalence that\n  reflects regulatory expansion, not underlying utilization change.\n\n**When to use.**\n- **Site-of-care classification** on the professional claim side of any utilization study: identify which encounters\n  were office-based, outpatient-hospital-based, ED-based, SNF-based, or telehealth. POS is the correct variable\n  for this on professional claims; do not attempt it from diagnosis or CPT codes alone.\n- **Telehealth adoption and trend studies** using Medicare or commercial claims: POS 02 (pre-2022) or POS 02 + 10\n  (from 2022 forward) is the operationally correct telehealth indicator on the professional claim. Verify that the\n  payer in question was using POS 02/10 systematically before assuming completeness.\n- **Site-neutrality and payment-differential studies:** Comparing costs and utilization of care billed at POS 11\n  (office) vs. POS 22 (on-campus outpatient hospital) is the standard approach to studying the hospital\n  outpatient premium. Pre-specify how you will handle the 2016 code-set change and provider-based billing.\n- **ED encounter identification on the professional claim side:** POS 23 plus emergency E/M CPT codes is the\n  standard professional-claim ED algorithm. Always link to the facility claim (Type of Bill 013x) for completeness.\n- **Inpatient physician service identification:** POS 21 on a professional claim identifies a physician billing\n  for care delivered to an admitted patient. Link to the MedPAR or UB-04 to establish the admission date range.\n\n**When NOT to use — and when POS-based classification is actively misleading or dangerous.**\n- **On institutional claims.** UB-04 / 837I records do not carry a POS field. Applying POS logic to institutional\n  claims is undefined. Always route setting classification to the correct claim type before applying code logic.\n- **As a sole indicator of physical care location in provider-based billing scenarios.** A change in POS from 11\n  to 22 following a hospital acquisition may reflect only the billing relationship, not a change in the patient's\n  physical location or the care delivered. For studies of actual care-setting migration (e.g., \"did patients shift\n  to hospital-based care?\"), supplement POS with provider taxonomy codes, NPI-level affiliation data, or enrollment\n  records to distinguish billing changes from true site shifts.\n- **When POS disagrees with the facility claim setting.** A professional claim with POS 21 (inpatient) may be\n  filed for the same date of service as a facility claim showing an outpatient status — this is a real coding\n  inconsistency that arises from observation status disputes, two-midnight rule decisions, and late billing.\n  Use the facility claim's admission status as the authoritative record for inpatient/outpatient designation;\n  treat POS 21 professional claims as requiring verification against the MedPAR or UB-04 before assuming the\n  patient was a true inpatient.\n- **Without date-conditional telehealth logic spanning 2022.** Using POS 02 alone as a telehealth flag in data\n  that includes 2022 onward will undercount telehealth by missing POS 10 (in-home telehealth). Write the flag as\n  POS IN (02, 10) and apply the date awareness described above.\n- **Across payers without verifying payer-specific coding conventions.** Commercial payers do not always enforce\n  the same POS code granularity as Medicare. POS 19 (off-campus outpatient) may be entirely absent in some\n  commercial datasets. Validate POS distributions against expected frequencies before building a cross-payer\n  site-of-care classifier.\n- **When the code set version is not pinned.** CMS adds, retires, and redefines POS codes over time. A study\n  spanning many years should document which code set version was current for each calendar year in the study\n  window, particularly for codes that were newly introduced or redefined (19/22 split in 2016; 02/10 split in\n  2022). Using a single static code list on longitudinal data introduces measurement error that can be mistaken\n  for trend.\n\n**Data-source operational depth.**\n- **Medicare FFS carrier / physician claims:** POS is consistently populated and is the most reliable source for\n  POS-based classification. However, Medicare applies coverage rules that make POS 02/10 appear only when the\n  service is a CMS-approved telehealth service. Off-label use of POS 02 for a non-covered telehealth service\n  would be denied and would not appear in paid claims, so the Medicare carrier file captures telehealth that was\n  actually reimbursed — a selection filter, not a complete census of all telehealth encounters. For MA beneficiaries,\n  POS may be less consistently submitted in encounter data; restrict to FFS when POS completeness is required.\n- **Commercial claims (e.g., IBM MarketScan, Optum Clinformatics):** POS is present but payer-specific coding\n  conventions mean that POS 19 (off-campus outpatient) may be underreported in some payer streams, and the\n  telehealth POS usage will reflect the payer's own coverage policies during and after the PHE. Validate POS\n  frequency distributions before building a cross-payer classifier. The date of the POS 19/22 split may also\n  lag for commercial payers relative to the 2016 Medicare effective date.\n- **Medicaid claims (T-MACS / MAX / TAF):** POS is present in professional claims but coding quality varies\n  by state and by managed care organization versus fee-for-service. States that route Medicaid services through\n  capitated MCOs may have encounter data with variable POS completeness. Use POS for Medicaid professional claims\n  only after verifying completeness for the specific state/year.\n\n**POS in the context of a complete encounter.** A single visit to a hospital-based outpatient oncology clinic may\ngenerate two claims: an institutional (UB-04) claim from the hospital for the facility fee, coded with Type of\nBill 0131 (hospital outpatient) and revenue codes 036x (medical / surgical supplies) and 0260 (IV therapy); and\na professional (CMS-1500) claim from the oncologist for the physician fee, with POS 22 (on-campus outpatient\nhospital) and J-codes for the drug administered. The professional claim's POS 22 confirms the setting; the\ninstitutional claim's Type of Bill and revenue codes provide the facility cost components. Neither claim alone\ngives the full picture. This two-claim structure is why site-of-care cost analyses in oncology and infusion\ntherapy must aggregate claims from both streams and why a POS-only analysis underestimates the true site\ndifferential by omitting the facility payment.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "claims",
      "professional",
      "setting",
      "place-of-service",
      "pos-codes",
      "telehealth",
      "site-of-care",
      "cms-1500",
      "837p"
    ],
    "applies_to_study_types": [],
    "data_sources": [
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.3928/25731777-20240422-12",
        "url": "https://doi.org/10.3928/25731777-20240422-12",
        "citation_text": "Bedard JE. Place of Service Codes: Key to Compliant Billing. Oncology Issues. 2024;39(3):20-25.",
        "year": 2024,
        "authors_short": "Bedard",
        "notes": "Directly documents the CMS POS code set and its role in compliant professional-claim billing, covering the most commonly used codes and the compliance implications of incorrect POS assignment. The canonical practitioner reference for understanding what POS codes record and why they matter for billing and research."
      },
      {
        "role": "explain",
        "doi": "10.1377/hlthaff.2021.01825",
        "url": "https://doi.org/10.1377/hlthaff.2021.01825",
        "citation_text": "Andino JJ, Bynum JPW, Ellimoottil C. Interstate Telehealth Use By Medicare Beneficiaries Before And After COVID-19 Licensure Changes. Health Affairs. 2022;41(4):495-502.",
        "year": 2022,
        "authors_short": "Andino et al.",
        "notes": "Uses Medicare professional claims with POS 02 to quantify interstate telehealth utilization before and after the COVID-19 emergency licensure waivers, demonstrating how POS codes operationalize telehealth identification in large-scale claims-based utilization research."
      },
      {
        "role": "demonstrate",
        "doi": "10.1200/jco.2015.33.15_suppl.e17670",
        "url": "https://doi.org/10.1200/jco.2015.33.15_suppl.e17670",
        "citation_text": "Hopson S, Casebeer A, Haggstrom D, et al. Chemotherapy treatment patterns by site of care (SOC): A comparison of the physician office and hospital outpatient settings. Journal of Clinical Oncology. 2015;33(15 suppl):e17670.",
        "year": 2015,
        "authors_short": "Hopson et al.",
        "notes": "Applies POS codes in Medicare professional claims to classify chemotherapy administration by site of care (physician office POS 11 vs. hospital outpatient POS 22), demonstrating the core RWE use case of POS for site-of-care shift studies in oncology."
      },
      {
        "role": "use",
        "doi": "10.1377/hlthaff.2024.00972",
        "url": "https://doi.org/10.1377/hlthaff.2024.00972",
        "citation_text": "Post B, Buchmueller TC, Couture SJ, et al. Site-Neutral Payment Reform: Little Impact On Outpatient Medicare Spending Or Hospital Revenue. Health Affairs. 2025;44(1):38-47.",
        "year": 2025,
        "authors_short": "Post et al.",
        "notes": "Evaluates the fiscal impact of Medicare site-neutral payment reform, classifying encounter settings using POS 11 (office) vs. POS 19/22 (hospital outpatient) in carrier claims — directly illustrating the policy-research use of POS to quantify the hospital outpatient premium and model payment-policy effects."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/medicare/coding-billing/place-of-service-codes/code-sets",
        "citation_text": "Centers for Medicare & Medicaid Services. Place of Service Code Set. Baltimore, MD: CMS; updated annually. Available at: https://www.cms.gov/medicare/coding-billing/place-of-service-codes/code-sets",
        "year": null,
        "authors_short": "CMS",
        "notes": "The authoritative and publicly available CMS POS code set listing all current two-digit codes with definitions, notes on recent additions and retirements, and coding guidelines. Required reference for any study using POS codes; should be pinned to the version current as of the study data end date."
      }
    ],
    "plain_language_summary": "A Place of Service (POS) code is a two-digit code that goes on the billing form when a doctor or other provider submits a claim for payment, and it tells the insurer where the service actually happened — for example, code 11 means the patient was seen in a physician's office, code 23 means an emergency room, and codes 02 and 10 indicate a telehealth visit. These codes only appear on the professional claim (the physician's bill), not on the facility's hospital bill, so they give analysts a direct way to classify the care setting for doctor visits but they cannot be used to classify hospital stays by themselves. One important catch: in 2022 Medicare split a single telehealth code into two, so any study comparing telehealth rates before and after 2022 needs to look for both codes or it will undercount telehealth in recent years.",
    "key_terms": [
      {
        "term": "professional claim",
        "definition": "The bill submitted by a physician, nurse practitioner, or other individual provider for their own services, filed on a CMS-1500 form (or its electronic equivalent, the 837P), as distinct from the facility bill filed by the hospital."
      },
      {
        "term": "service line",
        "definition": "A single row on a claim representing one service, procedure, or drug administration on one date; a claim can have multiple service lines, and the POS code is recorded at the service-line level so different lines on the same claim can theoretically have different settings."
      },
      {
        "term": "telehealth",
        "definition": "A health care visit conducted by video or audio connection rather than in person; in Medicare claims, telehealth visits are identified by POS codes 02 (at a site other than the patient's home) or 10 (in the patient's home), with POS 10 introduced in January 2022."
      },
      {
        "term": "site of care",
        "definition": "The physical or virtual location type where a health service is furnished, such as a physician's office, hospital outpatient department, emergency room, skilled nursing facility, or the patient's home; POS codes are the primary way this is recorded on professional claims."
      },
      {
        "term": "provider-based billing",
        "definition": "A billing arrangement that allows a physician practice owned by or affiliated with a hospital to file its professional claims under the hospital's provider number, which shifts the POS code from office (11) to hospital outpatient (19 or 22) and triggers an additional facility fee — even if the physical location of care has not changed."
      }
    ],
    "worked_example": {
      "scenario": "A health economist studying telehealth adoption and care-setting mix pulls one week of professional claims for a small hypothetical panel of eight patients seen on the same day at five different locations. She wants to classify each service line into one of four analytic buckets — Office, Outpatient Hospital, Emergency Department, or Telehealth — using the POS code, and then count how many service lines fall into each bucket. The table below shows the eight service lines exactly as they would appear in the raw claims data.\n",
      "dataset": {
        "caption": "Eight professional claim service lines for one study day: person_id, service date, pos_code, pos_description, and the CPT code billed.",
        "columns": [
          "person_id",
          "svc_date",
          "pos_code",
          "pos_description",
          "cpt_code"
        ],
        "rows": [
          [
            1001,
            "2023-06-01",
            11,
            "Office",
            "99213"
          ],
          [
            1002,
            "2023-06-01",
            22,
            "On-Campus Outpatient Hospital",
            "99214"
          ],
          [
            1003,
            "2023-06-01",
            23,
            "Emergency Room – Hospital",
            "99283"
          ],
          [
            1004,
            "2023-06-01",
            2,
            "Telehealth Other Than Patient Home",
            "99213"
          ],
          [
            1005,
            "2023-06-01",
            10,
            "Telehealth in Patient Home",
            "99214"
          ],
          [
            1006,
            "2023-06-01",
            11,
            "Office",
            "99215"
          ],
          [
            1007,
            "2023-06-01",
            19,
            "Off-Campus Outpatient Hospital",
            "99213"
          ],
          [
            1008,
            "2023-06-01",
            21,
            "Inpatient Hospital",
            "99232"
          ]
        ]
      },
      "steps": [
        "Classify each service line into an analytic bucket using the POS code. Office bucket: POS IN (11) -> persons 1001, 1006 = 2 lines. Outpatient Hospital bucket: POS IN (19, 22) -> persons 1002, 1007 = 2 lines. Emergency Department bucket: POS IN (23) -> person 1003 = 1 line. Telehealth bucket: POS IN (02, 10) -> persons 1004, 1005 = 2 lines. Inpatient physician visit (separate bucket, not outpatient): POS IN (21) -> person 1008 = 1 line.",
        "Note that person 1008 (POS 21, inpatient hospital) is classified as a physician billing for an admitted patient, not an outpatient encounter. This line should be linked to the MedPAR / facility claim to confirm inpatient status before including in an outpatient utilization analysis.",
        "Bucket counts: Office = 2, Outpatient Hospital = 2, ED = 1, Telehealth = 2, Inpatient physician = 1. Total lines = 2 + 2 + 1 + 2 + 1 = 8.",
        "Setting share for the four outpatient categories (excluding the inpatient physician line): Office = 2/7, Outpatient Hospital = 2/7, ED = 1/7, Telehealth = 2/7. expr = 2 + 2 + 1 + 2 = 7 outpatient lines.",
        "Telehealth share of all outpatient lines: 2/7 = 0.286. If the analyst had used only POS 02 (forgetting POS 10 introduced in 2022), she would have counted 1 telehealth line instead of 2, giving 1/7 = 0.143 — exactly half the correct rate. This illustrates the mandatory dual-code flag for post-2022 telehealth."
      ],
      "result": "Eight service lines classify into five buckets: Office=2, Outpatient Hospital=2, ED=1, Telehealth=2, Inpatient physician=1. Total = 2+2+1+2+1 = 8. Telehealth share of the 7 outpatient-eligible lines = 2/7 = 0.286 (28.6%). Using POS 02 alone would yield 1/7 = 0.143 (14.3%) — a 50% undercount of telehealth for post-2022 data."
    },
    "prerequisites": [
      "claims-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Telehealth flag (date-conditional POS 02/10)",
        "description": "Before January 1, 2022: telehealth = POS 02. From January 1, 2022 forward: telehealth = POS IN (02, 10). The split was introduced by CMS in the 2022 Medicare Physician Fee Schedule to distinguish services furnished with the patient at a health care facility originating site (02) from services where the patient is at home (10). Any longitudinal telehealth trend study must apply the date-conditional logic; a static POS 02 flag on post-2022 data underestimates telehealth. For commercial payer data, verify the payer's effective date for POS 10 adoption, which may lag the Medicare effective date.",
        "edge_cases": [
          "Audio-only visits may have been billed with POS 02 even when the patient was home, depending on whether the payer/CMS was treating audio-only as a telehealth variant in the given year — verify against the applicable billing guidance for the year.",
          "Some Medicare Advantage encounter records may not have adopted POS 10 uniformly at the 2022 effective date; restrict to FFS if POS 10 completeness is required."
        ],
        "data_source_notes": "Medicare FFS carrier: most reliable. Commercial: verify POS 10 effective date by payer. Medicaid: variable; check state-level billing guidance."
      },
      {
        "name": "Hospital outpatient setting (POS 19/22 with pre-2016 compatibility)",
        "description": "For studies spanning the 2016 Bipartisan Budget Act effective date, treat POS 22 (used for all outpatient hospital settings before 2016) as the outpatient hospital flag, and recognize that POS 19 (off-campus) and POS 22 (on-campus) are the post-2016 refinement. A combined flag of POS IN (19, 22) is the correct hospital outpatient indicator for a study window that spans 2016. Pre-2016 data that shows only POS 22 for outpatient hospital is not missing data — it reflects the pre-split coding convention.",
        "edge_cases": [
          "Provider-based billing sites acquired by a hospital after 2016 will shift from POS 11 to POS 19 or 22 without any physical change in care location. NPI-level affiliation data or historical CMS provider enrollment files are needed to distinguish billing changes from true site migration.",
          "Commercial payers may not have adopted the POS 19/22 distinction on the same timeline as Medicare; verify the POS 19 prevalence in commercial data before building a pre/post-2016 trend analysis."
        ],
        "data_source_notes": "Medicare FFS carrier: POS 19 emerged post-2016; reliable. Commercial: POS 19 may be sparse or absent in some payer streams even post-2016."
      },
      {
        "name": "ED encounter identification (POS 23 + CPT)",
        "description": "The professional-claim component of an ED encounter is identified by POS 23 combined with an emergency E/M CPT code (99281–99285) or a critical care code (99291–99292). POS 23 alone can appear on non-E/M services (e.g., a procedure performed in the ED), so the combination of POS 23 + an ED E/M code is the standard algorithm. The professional claim captures the physician's fee; the facility claim (UB-04 Type of Bill 013x, revenue code 045x) captures the hospital's ED facility fee. Both must be identified and linked for a complete ED encounter cost analysis.",
        "edge_cases": [
          "Observation status encounters generate UB-04 claims with Type of Bill 013x and may generate professional claims with POS 22 (outpatient hospital) rather than POS 23, because the patient was not seen in the emergency department proper. Distinguish ED from observation using the facility claim's revenue codes."
        ],
        "data_source_notes": "Medicare FFS: professional ED claim (POS 23) links to outpatient institutional claim via beneficiary ID + service date. Commercial: same linkage logic applies."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "icd-10-cm-diagnosis-coding",
        "pros_of_this": "POS codes directly encode setting rather than requiring setting inference from place-of-service proxies embedded in diagnosis codes. A POS 23 unambiguously identifies an ED-billed professional claim; a diagnosis code alone cannot.",
        "cons_of_this": "POS codes are confined to the professional claim stream and reflect billing designations, not verified clinical locations. ICD-10-CM codes apply to both professional and institutional claims, giving broader cross-claim coverage for condition classification.",
        "when_to_prefer": "Use POS for setting classification on professional claims; use ICD-10-CM for condition classification across both claim types."
      },
      {
        "compared_to": "cpt-procedure-coding",
        "pros_of_this": "POS codes encode the setting; CPT codes encode what was done. They are complementary, not substitutes. POS 22 + J9271 (pembrolizumab) identifies a hospital-outpatient drug administration in a way that neither code alone can do.",
        "cons_of_this": "CPT alone can sometimes proxy setting (e.g., OR-specific CPT codes imply inpatient or ASC settings), but this is less reliable than POS and does not distinguish office from hospital outpatient for E/M services.",
        "when_to_prefer": "Always use POS + CPT jointly for setting-and-procedure classification on professional claims."
      },
      {
        "compared_to": "hcpcs-level-ii-j-codes",
        "pros_of_this": "POS identifies the care setting where the drug was administered. HCPCS J-codes identify the drug. For site-of-care infusion studies, both are required.",
        "cons_of_this": "POS and J-codes are both line-level fields and are read from the same professional claim service line, so there is no methodological conflict — they are joint attributes of the same event.",
        "when_to_prefer": "Use POS + J-code together to classify infusion site of care in chemotherapy and biologic infusion studies."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "POS is a line-level field on the professional claim (CMS-1500 / 837P), not the header. Load from the line-level table (e.g., LINE_ICD_DGNS_CD table in Medicare, line-level file in MarketScan). For encounter-level classification, take the POS from the first or most-specific service line, or apply a hierarchy (23 > 21 > 22 > 19 > 11 > 02/10 > others) if lines within the same claim carry different codes. Join professional and institutional claims on beneficiary ID + service date (plus provider NPI if available) to reconcile POS with Type of Bill for a complete encounter record.",
      "ehr": "EHR systems may carry an internal POS or facility type field, but it is not the CMS POS code set unless explicitly mapped. Do not assume EHR place-of-service fields are equivalent without a validated crosswalk. For setting classification in linked claims-EHR studies, derive setting from the claims POS, not the EHR encounter type.",
      "registry": "Disease registries generally lack a POS-equivalent field. Setting information in registries is typically a facility-level designation (e.g., NCI-designated cancer center vs. community hospital) rather than a claim-level POS code. For POS-based site-of-care classification, the professional claims stream is required and the registry alone is insufficient."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom datetime import date\n\n# CMS POS code -> canonical analytic setting bucket.\n# For telehealth, the correct mapping depends on date (POS 10 introduced 2022-01-01).\n# For hospital outpatient, POS 19 (off-campus) and 22 (on-campus) are both hospital outpatient;\n# POS 19 was introduced in 2016 — prior to that all hospital outpatient was billed as POS 22.\n\nSETTING_MAP = {\n    \"11\": \"office\",\n    \"19\": \"outpatient_hospital\",      # off-campus, post-2016\n    \"22\": \"outpatient_hospital\",      # on-campus (also used pre-2016 for all hosp outpatient)\n    \"21\": \"inpatient_physician\",      # physician billing for admitted patient -- link to MedPAR\n    \"23\": \"emergency_department\",\n    \"24\": \"ambulatory_surgical_center\",\n    \"31\": \"skilled_nursing_facility\",\n    \"12\": \"home\",\n    \"81\": \"independent_laboratory\",\n    # telehealth: 02 always telehealth; 10 added 2022-01-01 for in-home\n    \"02\": \"telehealth\",\n    \"10\": \"telehealth\",               # only valid from 2022-01-01; treated as telehealth regardless\n}\n\nTELEHEALTH_CODES = {\"02\", \"10\"}\nPOS_10_EFFECTIVE = date(2022, 1, 1)\n\n\ndef classify_pos(df: pd.DataFrame,\n                 pos_col: str = \"pos_code\",\n                 svc_date_col: str = \"svc_date\") -> pd.Series:\n    \"\"\"\n    Classify each professional claim service line into an analytic setting bucket.\n\n    Parameters\n    ----------\n    df : pd.DataFrame\n        Must contain columns for pos_code (str or int) and svc_date (date or str ISO).\n    pos_col : str\n        Column name holding the two-digit POS code (will be zero-padded to 2 chars).\n    svc_date_col : str\n        Column name holding the service date (converted to datetime.date internally).\n\n    Returns\n    -------\n    pd.Series of str\n        Analytic setting bucket for each row. Unmapped codes return \"other\".\n    \"\"\"\n    pos = df[pos_col].astype(str).str.strip().str.zfill(2)\n    svc_dt = pd.to_datetime(df[svc_date_col]).dt.date\n\n    def _row_setting(p: str, svc: date) -> str:\n        if p == \"10\" and svc < POS_10_EFFECTIVE:\n            # POS 10 did not exist before 2022; treat as unmapped / data quality flag\n            return \"pos10_before_effective_date\"\n        return SETTING_MAP.get(p, \"other\")\n\n    return pd.Series(\n        [_row_setting(p, s) for p, s in zip(pos, svc_dt)],\n        index=df.index,\n        name=\"analytic_setting\"\n    )\n\n\n# --- worked example ---\nif __name__ == \"__main__\":\n    lines = pd.DataFrame({\n        \"person_id\": [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],\n        \"svc_date\":  [\"2023-06-01\"] * 8,\n        \"pos_code\":  [\"11\", \"22\", \"23\", \"02\", \"10\", \"11\", \"19\", \"21\"],\n        \"cpt_code\":  [\"99213\", \"99214\", \"99283\", \"99213\", \"99214\",\n                      \"99215\", \"99213\", \"99232\"],\n    })\n\n    lines[\"analytic_setting\"] = classify_pos(lines)\n    print(lines[[\"person_id\", \"pos_code\", \"analytic_setting\"]])\n\n    # Telehealth flag: POS IN (02, 10) -- MUST include both for post-2022 data\n    telehealth_flag = lines[\"pos_code\"].astype(str).str.zfill(2).isin(TELEHEALTH_CODES)\n    print(f\"\\nTelehealth lines (POS 02 or 10): {telehealth_flag.sum()}\")\n\n    # If analyst mistakenly used POS 02 only:\n    pos02_only = lines[\"pos_code\"].astype(str).str.zfill(2).isin({\"02\"})\n    print(f\"POS 02 only (undercount): {pos02_only.sum()}\")\n\n    # Distribution\n    print(\"\\nSetting distribution:\")\n    print(lines[\"analytic_setting\"].value_counts())",
        "description": "Builds a setting classifier for professional claims using POS codes, with date-conditional telehealth\nlogic for the 2022 POS 02/10 split. Accepts a pandas DataFrame of professional claim service lines\nand returns a new column with the analytic setting bucket. Handles the pre-2016 vs. post-2016 hospital\noutpatient distinction and flags inpatient physician visits for separate treatment. Demonstrates the\ndual-code telehealth flag that every post-2022 analysis requires.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\n# POS code -> analytic setting bucket\npos_setting_map <- c(\n  \"11\" = \"office\",\n  \"19\" = \"outpatient_hospital\",   # off-campus outpatient, introduced 2016\n  \"22\" = \"outpatient_hospital\",   # on-campus outpatient (also used pre-2016 for all hosp outpatient)\n  \"21\" = \"inpatient_physician\",   # physician bill for admitted patient -- link to MedPAR\n  \"23\" = \"emergency_department\",\n  \"24\" = \"ambulatory_surgical_center\",\n  \"31\" = \"skilled_nursing_facility\",\n  \"12\" = \"home\",\n  \"81\" = \"independent_laboratory\",\n  \"02\" = \"telehealth\",\n  \"10\" = \"telehealth\"             # valid from 2022-01-01 only\n)\n\nPOS_10_EFFECTIVE <- as.Date(\"2022-01-01\")\n\n#' Classify professional claim service lines by POS setting.\n#'\n#' @param df A data frame with columns pos_code (character) and svc_date (Date or ISO string).\n#' @param pos_col Column name for the two-digit POS code.\n#' @param svc_date_col Column name for the service date.\n#' @return The input data frame with an added column analytic_setting.\nclassify_pos <- function(df,\n                         pos_col = \"pos_code\",\n                         svc_date_col = \"svc_date\") {\n  df %>%\n    mutate(\n      # zero-pad to 2 chars to handle integer or character input\n      pos_padded = stringr::str_pad(as.character(.data[[pos_col]]),\n                                    width = 2, side = \"left\", pad = \"0\"),\n      svc_date_parsed = as.Date(.data[[svc_date_col]]),\n      analytic_setting = case_when(\n        # POS 10 was not valid before 2022-01-01\n        pos_padded == \"10\" & svc_date_parsed < POS_10_EFFECTIVE ~\n          \"pos10_before_effective_date\",\n        pos_padded %in% names(pos_setting_map) ~\n          unname(pos_setting_map[pos_padded]),\n        TRUE ~ \"other\"\n      )\n    ) %>%\n    select(-pos_padded, -svc_date_parsed)\n}\n\n# --- worked example ---\nlines <- data.frame(\n  person_id = 1001:1008,\n  svc_date  = as.Date(\"2023-06-01\"),\n  pos_code  = c(\"11\", \"22\", \"23\", \"02\", \"10\", \"11\", \"19\", \"21\"),\n  cpt_code  = c(\"99213\", \"99214\", \"99283\", \"99213\", \"99214\",\n                \"99215\", \"99213\", \"99232\"),\n  stringsAsFactors = FALSE\n)\n\nlines <- classify_pos(lines)\nprint(lines[, c(\"person_id\", \"pos_code\", \"analytic_setting\")])\n\n# Telehealth flag -- MUST include both POS 02 and POS 10 for post-2022 data\ntelehealth_flag <- lines$pos_code %in% c(\"02\", \"10\")\ncat(sprintf(\"\\nTelehealth lines (POS 02 or 10): %d\\n\", sum(telehealth_flag)))\n\n# Demonstrate undercount if analyst uses POS 02 only\npos02_only <- lines$pos_code == \"02\"\ncat(sprintf(\"POS 02 only (undercount): %d\\n\", sum(pos02_only)))\n\n# Setting distribution\ncat(\"\\nSetting distribution:\\n\")\nprint(table(lines$analytic_setting))",
        "description": "R implementation of the POS setting classifier for professional claims. Uses a named vector lookup\nfor the setting map and applies a date-conditional rule for the 2022 POS 10 split. Demonstrates the\ntelehealth dual-code flag and the outpatient hospital combined flag (POS 19 + 22). Compatible with\ndata.table and dplyr workflows.",
        "dependencies": [
          "dplyr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  PRO[\"Professional Claim<br/>(CMS-1500 / 837P)\"]\n  INST[\"Institutional Claim<br/>(UB-04 / 837I)\"]\n\n  PRO -->|\"Field 24B:<br/>POS code\"| POS[\"POS Code<br/>(2-digit, line level)\"]\n  INST -->|\"No POS field\"| TOB[\"Type of Bill +<br/>Revenue Center Code\"]\n\n  POS --> OFF[\"11 = Office\"]\n  POS --> HOSP[\"19/22 = Outpatient Hospital<br/>(split introduced 2016)\"]\n  POS --> INP[\"21 = Inpatient<br/>(physician bill only)\"]\n  POS --> ED[\"23 = Emergency Room\"]\n  POS --> ASC[\"24 = ASC\"]\n  POS --> TH[\"02 = Telehealth (other)<br/>10 = Telehealth (home)<br/>(10 introduced 2022)\"]\n\n  HOSP -->|\"Same patient, same exam room<br/>different billing affiliation\"| WARN[\"⚠ Provider-based<br/>billing: POS shift<br/>≠ care-location shift\"]\n  TH -->|\"Pre-2022: POS 02 only<br/>Post-2022: POS 02 + 10\"| DATE[\"Date-conditional<br/>telehealth flag required\"]",
        "caption": "POS codes live on the professional claim service line (CMS-1500 field 24B) and have no equivalent on the institutional claim (UB-04). The 2016 POS 19/22 split and the 2022 POS 02/10 telehealth split are the two most consequential code-set changes for RWE research.",
        "alt_text": "Flowchart showing professional claim feeding into a POS code that branches to office, outpatient hospital, inpatient, ED, ASC, and telehealth settings; institutional claim routes to Type of Bill and Revenue Center with no POS field; warnings for provider-based billing and date-conditional telehealth flag.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "claims-analysis",
        "notes": "POS codes are a line-level field on the professional claims stream. Understanding POS is a prerequisite for setting classification in any claims-based RWE study."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cpt-procedure-coding",
        "notes": "POS and CPT are joint service-line attributes on the professional claim. Site-of-care classification requires both: POS identifies where, CPT identifies what was done. A drug administration (J-code or CPT 96xxx) at POS 22 is a hospital-outpatient infusion; the same code at POS 11 is an office infusion."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcpcs-level-ii-j-codes",
        "notes": "HCPCS J-codes identify the drug administered; POS identifies the setting. For chemotherapy and biologic infusion site-of-care studies, the combination of J-code + POS is the standard classification approach."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "ICD-10-CM codes carry no setting information but apply across both professional and institutional claim types. POS codes carry setting information but are confined to professional claims only. They are complementary, not substitutes, in a complete encounter-level analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "POS code completeness and coding conventions differ between Medicare FFS, Medicare Advantage, and commercial payer streams. The POS 19/22 split and POS 10 introduction affect Medicare FFS first; commercial and MA adoption may lag. Cross-payer POS analyses require payer-specific validation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "POS codes are a key covariate in procedure identification: the same CPT procedure code carries different cost and clinical implications depending on the POS setting (e.g., knee injection at POS 11 vs. POS 22). Setting-stratified procedure counts require POS."
      },
      {
        "relation_type": "see_also",
        "target_slug": "icd-10-pcs-procedure-coding",
        "notes": "ICD-10-PCS codes appear on institutional claims and encode inpatient procedures. POS 21 on a professional claim confirms a physician was billing for an admitted patient; ICD-10-PCS on the corresponding facility claim identifies what procedure was performed during that admission."
      }
    ],
    "aliases": [
      "POS code",
      "POS codes",
      "place of service"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "plain-language-summaries-evidence",
    "name": "Plain-Language Summaries of Evidence",
    "short_definition": "A plain-language summary (PLS) is a concise, jargon-free account of a study's methods, findings, and limitations written for patients, caregivers, and the general public rather than for clinicians or researchers; it translates relative effect measures (hazard ratios, odds ratios) into absolute, natural-frequency statements that lay readers can correctly interpret, and it must be written honestly — without promotional framing, cherry-picked endpoints, or false certainty — to comply with EU Clinical Trial Regulation 536/2014 and emerging journal, registry, and payer requirements.",
    "long_description": "**What a plain-language summary is and why it exists**\n\nA plain-language summary (PLS) — also called a lay summary, patient summary, or plain-\nlanguage abstract — is a short document that describes a study in language accessible to\nadults without medical or statistical training. The core problem it solves is well\ndocumented: health statistics as typically reported (relative risks, hazard ratios, p-values,\nconfidence intervals) are systematically misunderstood by most patients, journalists, and\neven many clinicians. Gigerenzer and colleagues demonstrated that \"25% risk reduction\" is\nroutinely interpreted as meaning one in four patients benefited, when in fact it often means\n1 fewer patient per 1,000 had the outcome. A PLS replaces such statements with natural\nfrequencies and absolute counts that correct this systematic error.\n\n**Where PLS is now required or expected**\n\nThe regulatory and editorial landscape has shifted decisively toward mandatory disclosure:\n\n- *EU Clinical Trial Regulation (CTR) 536/2014*: Sponsors must publish a lay summary of\n  clinical trial results in the EU Clinical Trials Information System (CTIS) within 12 months\n  of trial end (6 months for paediatric trials). The lay summary must be non-promotional,\n  written in language understandable to lay persons, and cover the trial purpose, design,\n  key findings, and benefit-risk conclusion. This is the strongest regulatory mandate for\n  PLS globally and is now in force.\n- *Journal requirements*: A growing number of journals — including BMJ, JAMA Network Open,\n  and Osteoarthritis and Cartilage — require a lay summary alongside each article. The\n  proportion of journals requiring PLS has grown steadily since 2015.\n- *Sponsor medical communications*: Pharmaceutical and device sponsors increasingly produce\n  PLS versions of journal publications for patient advocacy groups, payer evidence packages,\n  and registry communications.\n- *HTA and payer dossiers*: NICE and some other HTA bodies increasingly expect\n  patient-friendly summaries of evidence submissions, though these are rarely yet mandatory.\n- *RWE and registry communications*: Patient registries that collect outcomes from real-\n  world populations carry an implicit obligation to communicate findings back to participants;\n  a PLS is the appropriate vehicle.\n\n**The evidence-based communication toolkit**\n\nResearch in risk communication provides clear guidance on what works:\n\n*Natural frequencies over percentages.* Expressing \"12 of 100 untreated patients had the\nevent, compared with 9 of 100 treated patients\" is understood correctly by far more lay\nreaders than \"the hazard ratio was 0.75\" or \"risk was reduced by 25%.\" Natural frequencies\nanchor the probability to a concrete reference class (100 people like you) and make the\nbase rate visible, eliminating the most common misreading. See `risk-ratio-and-risk-\ndifference` for the absolute-vs-relative machinery and `number-needed-to-treat-rwe` for the\nNNT as the canonical natural-frequency translation of an absolute risk reduction.\n\n*Avoid OR-speak entirely.* Odds ratios, being non-collapsible and further from a direct\nfrequency interpretation than risk ratios, should not appear in a PLS. If the analysis\nproduced an OR, convert it to an approximate risk ratio (valid when outcome is rare) or\nto a marginal risk difference via g-computation before writing the PLS.\n\n*Icon arrays and pictographs.* A grid of 100 person icons, with some colored to indicate\nthe event, conveys the natural frequency visually and is particularly effective for\naudiences with lower numeracy. Icon arrays outperform bar charts and text alone in\ncomprehension studies.\n\n*Framing symmetry.* Always report both the event framing (\"3 of 100 had the event\") and\nthe survival framing (\"97 of 100 remained event-free\") for both arms. Reporting only the\nreduction is a form of positive framing bias. Both framings are the same arithmetic fact;\npresenting both signals honesty and helps readers who think in terms of survivors rather\nthan events.\n\n*Numeracy-aware design.* Design PLS for the lower-numeracy segment of the audience —\ntypically grade 6 reading level and 40th-percentile numeracy. Avoid decimals when a\ncount-per-100 is available. Avoid \"X times as likely,\" which is frequently misread as \"X\npercentage points more likely.\" Prefer \"3 more people per 100\" over \"0.03 more.\"\n\n*Uncertainty communication without false precision.* A PLS must convey uncertainty without\nusing CI notation that lay readers cannot interpret. Acceptable alternatives: \"we are fairly\nconfident the treatment helps, but we cannot rule out a smaller benefit\"; \"the study was not\nlarge enough to be sure about rare side effects.\" Phrases like \"definitely shows\" or \"proves\"\nare precision-inflating and impermissible.\n\n**Readability levels honestly**\n\nReadability formulas (Flesch-Kincaid Grade Level, Flesch Reading Ease, SMOG, Gunning Fog)\nare crude proxies for comprehension. A grade-6-to-8 target is the standard reference for\nhealth communications to a general audience. These formulas count syllables, sentences, and\nword length — they reward short words and short sentences mechanically. A PLS optimized\nonly for a readability formula can still be unintelligible if it uses short but unfamiliar\ntechnical words or if its sentence structure is incoherent. The correct use of readability\nformulas is as a *gate* — a PLS that scores above grade 12 is almost certainly too\ntechnical and should be revised — not as a goal. The goal is genuine comprehension,\nverified where possible by cognitive interviewing with representative lay readers.\n\n**What a PLS must NOT do**\n\n- *Promotional drift*: EU CTR lay summaries are explicitly required to be non-promotional.\n  Language that attributes causal benefit without appropriate hedging, that selects only\n  favorable endpoints to report, or that uses marketing phrases (\"breakthrough,\" \"transform\")\n  violates this requirement. This line matters clinically: a PLS that overstates benefit\n  without disclosing uncertainty or competing harms could influence patient decisions.\n- *Cherry-picking endpoints*: The PLS should report on the same primary endpoint as the\n  study. If the primary endpoint was non-significant, the PLS must say so; it cannot\n  silently shift to a secondary that showed a benefit.\n- *Certainty inflation*: \"This treatment works\" is not appropriate language in a PLS for\n  an observational RWE study. \"In this study, patients who received Treatment A had fewer\n  events\" is factually correct and appropriately hedged.\n- *Omitting harms*: If the study detected adverse events, the PLS must describe them in\n  plain language with the same natural-frequency format used for benefits.\n\n**Structure templates: Good Lay Summary Practice (GLSP)**\n\nThe Good Lay Summary Practice (GLSP) guidance, which was adopted into EU CTR regulation,\nprovides the canonical structure for clinical trial lay summaries: (1) Why was the study\ndone? (2) Who took part? (3) What happened during the study? (4) What were the results?\n(5) What were the side effects? (6) What were the limitations? (7) What happens next?\nThis seven-part structure is also useful for journal PLS and RWE communications because\nit ensures completeness and prevents the omission of limitations and harms.\n\n**The RWE-specific challenge: explaining confounding to lay readers**\n\nObservational RWE — claims studies, EHR cohorts, registry analyses — presents a unique\ncommunication challenge that trial PLS does not face: the study was not randomized, so\nthe treatment groups may have differed at baseline in ways that influence the outcome.\nA PLS for an observational study must honestly convey this without inducing paralysis.\nA workable template: \"In this study, patients who received Treatment A were compared with\npatients who received Treatment B. We tried to account for differences between the groups\nusing statistical methods, but because this was not a randomized study, we cannot be\ncertain that other factors did not influence the results.\" This statement is accurate,\nnon-technical, appropriately hedged, and does not require the lay reader to understand\npropensity scoring or confounding adjustment.\n\n**AI-assisted PLS drafting with human verification**\n\nLarge language models can produce first-draft PLS text quickly and at scale. The\nappropriate workflow is: (1) generate a draft that translates the key result statistics\ninto natural-frequency statements; (2) verify the arithmetic is exact (the draft is\nunreliable for numerical translation); (3) check for promotional drift, omitted harms,\nand certainty inflation introduced by the model; (4) have a medical writer and at least one\npatient representative or health-literacy specialist review the draft. See\n`llm-assisted-abstraction-rwe` for the broader AI-in-evidence-synthesis framework.\nAI assistance accelerates drafting but does not replace the human verification step,\nwhich is the bottleneck where errors — especially wrong numbers and promotional framing —\nare most consequentially introduced.\n\n**Pros, cons, and trade-offs**\n\n*Pros*: Fulfills a regulatory mandate (EU CTR) and emerging journal requirements; increases\npatient understanding of what studies found; translates effect measures into actionable\nnatural frequencies; corrects the systematic misreading of relative-risk statements; builds\ntrust with patient communities; creates a citable, accessible evidence record alongside the\ntechnical publication; supports informed shared decision-making.\n\n*Cons*: Adding a PLS requires time, subject-matter expertise, and health-literacy expertise\nthat most research teams lack; poorly written PLS can mislead more than technical abstracts\nby omitting nuance; the readable format can create false certainty through simplification;\nreadability formula optimization can produce grammatically simple but conceptually opaque\ntext; promotional drift is a persistent risk when sponsors write their own PLS without\nindependent review.\n\n*Trade-offs*: More detail increases accuracy but reduces accessibility; shorter text is more\nreadable but omits uncertainty and limitations; natural frequencies are concrete but require\nthe communicator to have an accurate baseline risk from the study, which can be hard to\nderive from a reported HR without additional data.\n\n**When to use**\n\nUse a PLS when: (1) required by regulation (EU CTR 536/2014 mandates one for all EU\ntrials completed after its full entry into force); (2) a journal submission is to a\npublication that requires or encourages a lay summary; (3) results from a registry or\nRWE study will be shared with patient communities, advocacy groups, or payer/HTA bodies\nas part of a stakeholder engagement plan; (4) an evidence brief or formulary submission\nincludes quantitative findings that need to be communicated to non-clinical decision-makers;\n(5) a clinical trial concludes and participants are owed a return-of-results communication.\n\n**When NOT to use — and when a PLS is actively misleading**\n\n- *When the evidence is too preliminary*: Phase I dose-escalation data, animal-model\n  findings, or exploratory subgroup analyses communicated as if they are confirmatory results\n  harm rather than inform. A PLS implies sufficient evidence to communicate; if the evidence\n  base does not meet that bar, the correct communication is \"this is too early to know.\"\n- *When the PLS cannot accurately represent the primary endpoint*: If the primary endpoint\n  was a composite, a surrogate, or a patient-reported outcome that cannot be expressed in\n  plain natural-frequency terms without losing its meaning, the PLS must say so rather than\n  substituting a more accessible but incorrect characterization.\n- *When the arithmetic is wrong*: A PLS with an incorrect NNT or natural-frequency count\n  is more harmful than no PLS, because lay readers have no independent check on the\n  numbers. Do not publish a PLS until the arithmetic translation has been independently\n  verified.\n- *When promotional intent overrides accuracy*: If organizational or commercial pressures\n  will prevent honest reporting of limitations, adverse events, or non-significant primary\n  endpoints, it is better not to produce a PLS than to produce a misleading one. EU CTR\n  regulation explicitly requires non-promotional lay summaries; violations carry regulatory\n  consequences.\n\n**Interpreting the output**\n\nUsing the worked example: an observational RWE study produces a hazard ratio of 0.75 for a\n2-year cardiovascular endpoint. The comparator-arm 2-year cumulative risk is 12% (0.12).\nThe arithmetic translation yields: treated-arm risk ≈ 0.12 × 0.75 = 0.09 (9 of 100);\nabsolute risk reduction = 0.12 − 0.09 = 0.03 (3 per 100); NNT ≈ 100/3 ≈ 33 (34 rounded up).\n\n*(1) Formal interpretation.* The approximation risk_treated ≈ baseline_risk × HR is valid\nwhen the outcome is relatively rare and the hazard is approximately constant over the window;\nat 12% baseline risk it introduces modest error, and a competing-risk-aware cumulative\nincidence approach is more precise. The NNT of approximately 33 is the reciprocal of the\nabsolute risk reduction (1 / 0.03 = 33.3) and is specific to the 2-year horizon and the\n12% comparator-arm baseline risk: it cannot be transferred to a lower-risk population\nwithout re-anchoring (see `number-needed-to-treat-rwe`). Because this is an observational\nstudy, the effect estimate is associational; the 95% CI and any unmeasured confounding\ncaveat must appear alongside the NNT in the PLS.\n\n*(2) Practical interpretation.* The PLS statement \"9 of every 100 people who took this\ntreatment had a cardiovascular event over 2 years; 12 of every 100 people in the comparison\ngroup had a cardiovascular event — 3 fewer per 100\" gives a lay reader the complete picture\nwithout requiring any statistical background. Paired with \"this study was not a randomized\ntrial, so other differences between the groups may explain some of this result,\" and \"we\nestimate that treating about 33 people for 2 years would prevent one event at this baseline\nrisk,\" the PLS is an honest, numeracy-appropriate communication of the finding. A\ndecision-maker or formulary analyst can read the NNT directly as: treating 33 patients\nfor 2 years at an average risk of 12% prevents one cardiovascular event.",
    "primary_category": "Framework_Standard",
    "tags": [
      "plain-language",
      "lay-summary",
      "patient-communication",
      "health-literacy",
      "risk-communication",
      "natural-frequency",
      "eu-ctr",
      "evidence-communication",
      "nnt-translation",
      "readability"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "registry_study",
      "claims_analysis",
      "ehr_study",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.1539-6053.2008.00033.x",
        "url": "https://doi.org/10.1111/j.1539-6053.2008.00033.x",
        "citation_text": "Gigerenzer G, Gaissmaier W, Kurz-Milcke E, Schwartz LM, Woloshin S. Helping doctors and patients make sense of health statistics. Psychological Science in the Public Interest. 2007;8(2):53-96.",
        "year": 2007,
        "authors_short": "Gigerenzer et al.",
        "notes": "The foundational evidence review demonstrating that health statistics — including survival rates, relative risks, and screening test accuracy — are systematically misunderstood by patients, journalists, and clinicians; introduces natural frequencies and absolute counts as the communication formats that correct these misreadings. Essential reading for understanding why the PLS toolkit prioritizes \"12 of 100 people\" over \"HR 0.75.\""
      },
      {
        "role": "explain",
        "doi": "10.55752/amwa.2022.156",
        "url": "https://doi.org/10.55752/amwa.2022.156",
        "citation_text": "Schindler E. The making of the Good Lay Summary Practice guidance: a multi-stakeholder document that was adopted into regulation. AMWA Journal. 2022.",
        "year": 2022,
        "authors_short": "Schindler",
        "notes": "Describes the development of the GLSP (Good Lay Summary Practice) guidance by the multi-stakeholder EFPIA/EUCOPE consortium and its adoption as the practical standard for EU CTR 536/2014 lay summaries; covers the seven-part structure, non-promotional requirement, and the process by which this industry-academic guidance became regulatory."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.joca.2021.05.002",
        "url": "https://doi.org/10.1016/j.joca.2021.05.002",
        "citation_text": "Block JA. The plain language summary (Lay Language Summary) in Osteoarthritis and Cartilage. Osteoarthritis and Cartilage. 2021.",
        "year": 2021,
        "authors_short": "Block",
        "notes": "Editorial introducing mandatory plain language summaries in Osteoarthritis and Cartilage journal; illustrates the growing journal-level requirement for PLS alongside publications and provides practical framing for what a journal PLS should and should not contain."
      }
    ],
    "plain_language_summary": "A plain-language summary (PLS) translates a study's findings into everyday language so that patients, caregivers, and the public can understand what the research found without needing a medical or statistics background. Instead of reporting a hazard ratio of 0.75, a PLS says \"9 of every 100 people who took the treatment had a heart event, compared with 12 of every 100 in the comparison group — 3 fewer per 100.\" Since 2022, the EU requires all clinical trial sponsors to publish a lay summary of results; many journals and patient registries now expect one too. The one firm rule: a PLS must describe what the study found honestly, including limitations and side effects, and must not use promotional language or present results as more certain than the evidence supports.",
    "key_terms": [
      {
        "term": "natural frequency",
        "definition": "A way of expressing a risk as a concrete count in a fixed group — \"12 of 100 people\" rather than \"12%\" or \"an 0.12 probability\" — which research consistently shows is understood more accurately by lay readers than probability fractions or percentages."
      },
      {
        "term": "absolute risk reduction",
        "definition": "The plain difference in event rates between two groups — for example, 12 per 100 minus 9 per 100 equals 3 per 100 — which shows how many fewer events the treatment produced; easier to communicate honestly than a relative measure like a hazard ratio."
      },
      {
        "term": "promotional drift",
        "definition": "The gradual shift of a lay summary's language toward marketing framing — overstating benefit, downplaying harm, or implying certainty — which is explicitly prohibited in EU Clinical Trial Regulation lay summaries and can mislead patient decision-making."
      },
      {
        "term": "readability grade level",
        "definition": "A number produced by formulas such as Flesch-Kincaid that estimates the school grade a reader needs to understand a text; a grade 6–8 target is standard for patient communications, but the formula is a rough screen, not a substitute for clear writing."
      },
      {
        "term": "framing symmetry",
        "definition": "The practice of reporting the same statistic in both its event framing (\"3 of 100 had the event\") and its survival framing (\"97 of 100 remained event-free\"), so that neither a positive nor a negative spin dominates the reader's impression."
      },
      {
        "term": "EU CTR 536/2014",
        "definition": "The European Union Clinical Trial Regulation that makes it mandatory for sponsors to publish a plain-language lay summary of clinical trial results in the EU trial register within 12 months of the trial ending, and requires that summary to be non-promotional."
      }
    ],
    "worked_example": {
      "scenario": "An observational claims-based RWE study comparing a new cardiovascular drug with a standard comparator reports a hazard ratio (HR) of 0.75 (95% CI 0.60–0.93) for a 2-year composite cardiovascular endpoint. The comparator-arm 2-year cumulative event risk is 12 per 100 patients. A medical writer must translate this into a plain-language summary for a patient registry newsletter using natural frequencies, an absolute risk count, and an honest uncertainty statement. No randomization occurred; this is an observational study.",
      "dataset": {
        "caption": "Summary statistics for the two treatment arms over a 2-year follow-up window. The treated-arm risk is derived from the baseline risk and the HR using the approximation HR ≈ RR, which is reasonable when the 2-year risk is below 20%.",
        "columns": [
          "group",
          "n_per_100",
          "events_per_100",
          "risk"
        ],
        "rows": [
          [
            "comparator (untreated)",
            100,
            12,
            0.12
          ],
          [
            "index drug (treated)",
            100,
            9,
            0.09
          ]
        ]
      },
      "steps": [
        "Start with the comparator-arm baseline risk: 12 of every 100 patients had the cardiovascular event over 2 years, so risk_untreated = 12 / 100 = 0.12.",
        "Apply the HR as an approximation for the risk ratio: risk_treated = 0.12 * 0.75 = 0.09. This means 9 of every 100 treated patients had the event over the same 2-year window.",
        "Compute the absolute risk reduction in natural-frequency terms: risk_reduction = 12 - 9 = 3 per 100 patients. Three fewer events for every 100 people treated with the index drug rather than the comparator over 2 years.",
        "Compute the number needed to treat: NNT = 100 / 3 ≈ 33 (conventionally rounded up to 34 in practice — you cannot treat a fraction of a person to prevent a fraction of an event). About 33 to 34 patients need the index drug instead of the comparator for 2 years for one additional patient to avoid the endpoint.",
        "Apply framing symmetry: alongside the event framing (\"9 of 100 treated had the event\"), report the survival framing: \"91 of 100 treated patients did not have the event\" vs \"88 of 100 comparator patients did not have the event.\" Both facts are the same arithmetic; presenting both prevents one-sided impression.",
        "Add the observational-study caveat in plain language: \"This study was not randomized, so we used statistical methods to try to account for differences between the groups. We cannot be certain that other factors did not contribute to the difference we observed.\""
      ],
      "result": "risk_untreated = 12 / 100 = 0.12; risk_treated = 0.12 * 0.75 = 0.09; risk_reduction = 12 - 9 = 3 per 100; NNT ≈ 100 / 3 ≈ 33 (round up to 34 in practice). The PLS statement reads: \"In this observational study, 9 of every 100 people who took the index drug had a cardiovascular event over 2 years, compared with 12 of every 100 in the comparison group — 3 fewer events per 100 people treated. We estimate that treating about 33 to 34 people for 2 years would prevent one event at this baseline risk. Because this was not a randomized study, other differences between the groups may explain part of this result. We cannot rule out a smaller real benefit.\""
    },
    "prerequisites": [
      "risk-ratio-and-risk-difference",
      "number-needed-to-treat-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "EU CTR 536/2014 lay summary (regulatory mandatory)",
        "description": "The non-promotional lay summary required by EU Clinical Trial Regulation 536/2014, published in CTIS within 12 months of trial end (6 months for paediatric). Must follow the GLSP seven-part structure: why done, who participated, what happened, main results, side effects, limitations, what happens next. Written for lay persons; reviewed for compliance by the sponsor's regulatory team and, optionally, by a patient representative.",
        "edge_cases": [
          "If the trial was terminated early, the lay summary must report the reason for termination — \"the trial stopped early because of safety signals\" — not omit it.",
          "If the primary endpoint showed no significant benefit, the lay summary must say so plainly; omitting or minimizing a null result is a form of promotional drift and is non-compliant with the non-promotional requirement."
        ],
        "data_source_notes": "Source is the clinical trial database, not real-world data; effect estimates come from the clinical study report. Natural-frequency translation uses the trial's comparator-arm risk, which is known from the protocol-specified analysis."
      },
      {
        "name": "Journal PLS (publication companion)",
        "description": "A 200-400 word summary accompanying a journal article, required by an increasing number of titles. Typically structured as: what we already knew, what this study adds, and the key finding in plain language. Must accurately summarize the same endpoints and findings as the parent article, not a more favorable subset.",
        "edge_cases": [
          "For observational studies, the PLS must clearly state the design limitation: \"this was not a randomized trial\" or equivalent — omitting this transforms an associational finding into an implied causal claim in the mind of a lay reader.",
          "When the article includes multiple endpoints, the PLS should lead with the primary endpoint even if a secondary endpoint was more favorable."
        ],
        "data_source_notes": "Source is the published analysis. Absolute risks must be traceable to the paper's tables; any natural-frequency numbers that cannot be verified from the paper must not appear."
      },
      {
        "name": "RWE registry and patient-community communication",
        "description": "A plain-language brief for patients enrolled in a disease registry, a cohort study, or any real-world data collection. Explains what the study found, what it cannot tell us (because it was observational), and what it means for patients. Often produced annually as a return-of-results obligation to participants.",
        "edge_cases": [
          "Registry studies typically report on populations rather than treatment comparisons; the natural-frequency toolkit still applies — \"X of every 100 people in this registry had the complication over 5 years\" — but the causal frame must be avoided entirely.",
          "If the registry has a mixed language community, the PLS must be translated and culturally adapted, not just linguistically translated; numeracy conventions and reference classes differ across cultures."
        ],
        "data_source_notes": "claims/ehr/registry: derive the baseline risk from the observed event rate in the study population; use the study's own denominator rather than an external benchmark."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "risk-ratio-and-risk-difference",
        "pros_of_this": "A PLS adds the communication layer that turns a correctly estimated RR or RD into something a lay reader can act on; without the PLS, even a correctly reported RD remains inaccessible to patients and advocacy groups.",
        "cons_of_this": "The PLS itself introduces the risk of oversimplification — the natural-frequency statement is only as accurate as the baseline risk used to compute it, and the arithmetic approximation HR ≈ RR introduces error at higher baseline risks.",
        "when_to_prefer": "Always produce both: the full statistical result (RR/RD/HR with CI) for technical audiences and the PLS for lay audiences. The PLS never replaces the technical analysis — it wraps it."
      },
      {
        "compared_to": "number-needed-to-treat-rwe",
        "pros_of_this": "A PLS contextualizes the NNT within a narrative — the study context, the patient group, the comparator, the time horizon, the limitations — whereas a bare NNT without context can mislead (an NNT of 33 means very different things at different baseline risks and for different outcomes).",
        "cons_of_this": "A PLS is harder to standardize and harder to audit for accuracy than a computed NNT; writing quality varies enormously across authors, and poorly written PLS may be worse than a clear numerical table.",
        "when_to_prefer": "Use the NNT as the anchor for the absolute-benefit statement inside the PLS; they are complementary, not alternatives."
      },
      {
        "compared_to": "pro-rwe",
        "pros_of_this": "A PLS communicates researcher-to-patient; PRO data communicates patient-to-researcher. They are complementary directions of the same patient-centered evidence ecosystem; the PLS makes research outputs accessible while PRO data makes patient experience researchable.",
        "cons_of_this": "A PLS cannot substitute for collecting patient-reported outcomes; it only describes what was studied, and if the study did not collect patient-relevant endpoints (function, symptom burden), the PLS cannot remedy that gap.",
        "when_to_prefer": "Both are needed in a fully patient-centered RWE program; use PRO collection to ensure the endpoints are patient-relevant, and PLS to return those findings to patients."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims studies rarely collect patient-relevant narratives; the PLS must explain what administrative data can and cannot measure (\"this study used insurance records to track hospitalizations and prescriptions; it cannot tell us about symptoms or how patients felt\"). The baseline risk (needed for natural-frequency translation) comes from the observed event rate in the comparator arm of the propensity-matched or weighted cohort. State the follow-up window explicitly because claims-based follow-up is bounded by enrollment continuity.",
      "ehr": "EHR studies have the same observational limitation caveat as claims, plus the visit-driven ascertainment caveat: \"this study could only count events seen in this health system — care received elsewhere was not captured.\" Baseline risks from EHR cohorts may be inflated if the population is sicker than average (referral bias); flag this in the PLS limitations.",
      "registry": "Registry PLS is often a return-of-results obligation. Registries vary in event completeness and follow-up; the PLS should describe who is in the registry, how long they were followed, and what the registry can and cannot capture. Natural frequencies are still the recommended format for any event-rate or incidence result.",
      "primary": "Clinical trial PLS (required under EU CTR 536/2014) has the clearest baseline risks (from the trial's control arm) and the strongest causal language warrant (\"patients randomly assigned to\" rather than \"patients who received\"). The PLS can state outcomes more directly, but must still communicate uncertainty (CI, trial limitations, generalizability).",
      "linked": "Linked data studies combine the richness of multiple sources but require the PLS to explain data linkage in accessible terms: \"this study combined prescription records with hospital records to track both treatment and outcomes.\" Linkage completeness should be disclosed; if a substantial fraction of patients could not be linked, the PLS must note that the study may not represent all patients."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nimport re\n\n\ndef hr_to_natural_freq(hr, baseline_risk, n_per_group=100, horizon_label=\"2 years\"):\n    \"\"\"\n    Translate a hazard ratio + baseline risk into natural-frequency PLS language.\n\n    Uses approximation HR ≈ RR, which is reasonable when:\n      - baseline_risk < 0.10 (outcome is rare): error is negligible\n      - baseline_risk 0.10-0.20: small error; flag with a warning\n      - baseline_risk > 0.20: compute from cumulative incidence functions instead\n\n    Returns a dict with counts, ARR, NNT, and a ready-made PLS sentence.\n    \"\"\"\n    if baseline_risk > 0.20:\n        raise ValueError(\n            f\"baseline_risk={baseline_risk:.2f} > 0.20; the HR ≈ RR approximation \"\n            \"breaks down. Compute treated-arm risk from a cumulative incidence \"\n            \"function (Kaplan-Meier or competing-risk model) instead.\"\n        )\n    if baseline_risk > 0.10:\n        print(\n            f\"WARNING: baseline_risk={baseline_risk:.2f} is above 10%; \"\n            \"the HR ≈ RR approximation introduces modest error (~5-10%). \"\n            \"Consider competing-risk-aware cumulative incidence for precision.\"\n        )\n\n    treated_risk = baseline_risk * hr         # approximation: HR ≈ RR\n    control_events = round(baseline_risk * n_per_group)\n    treated_events = round(treated_risk * n_per_group)\n    arr = baseline_risk - treated_risk         # absolute risk reduction\n    nnt = 1.0 / arr if arr > 0 else float(\"inf\")\n    nnt_rounded = math.ceil(nnt)               # always round UP\n\n    # Framing symmetry: both event and survival framings\n    control_survivors = n_per_group - control_events\n    treated_survivors = n_per_group - treated_events\n\n    print(f\"EVENT FRAMING (per {n_per_group} over {horizon_label}):\")\n    print(f\"  Comparator: {control_events} had the event | {control_survivors} did not\")\n    print(f\"  Treated:    {treated_events} had the event | {treated_survivors} did not\")\n    print(f\"  Reduction:  {control_events - treated_events} fewer events per {n_per_group}\")\n    print(f\"  ARR = {arr:.4f}  |  NNT = {nnt:.1f} (rounded up to {nnt_rounded})\")\n    print()\n    print(\"PLS SENTENCE (event framing):\")\n    print(\n        f\"  In this study, {treated_events} of every {n_per_group} people who received \"\n        f\"the treatment had the event over {horizon_label}, compared with \"\n        f\"{control_events} of every {n_per_group} in the comparison group — \"\n        f\"{control_events - treated_events} fewer events per {n_per_group} people treated.\"\n    )\n    print(\n        f\"  Treating about {nnt_rounded} people for {horizon_label} would be expected \"\n        f\"to prevent one event at this baseline risk.\"\n    )\n    return {\n        \"control_events\": control_events,\n        \"treated_events\": treated_events,\n        \"arr\": arr,\n        \"nnt_exact\": nnt,\n        \"nnt_rounded\": nnt_rounded,\n    }\n\n\ndef flesch_kincaid_grade(text):\n    \"\"\"\n    Approximate Flesch-Kincaid Grade Level for a PLS text.\n    Target for patient communications: grade 6-8.\n    Formula: 0.39 * (words/sentences) + 11.8 * (syllables/words) - 15.59.\n    Syllables counted by vowel-group heuristic (adequate for screening, not exact).\n    \"\"\"\n    sentences = max(1, len(re.split(r\"[.!?]+\", text.strip())))\n    words_list = re.findall(r\"\\b[a-zA-Z]+\\b\", text)\n    n_words = max(1, len(words_list))\n\n    def count_syllables(word):\n        word = word.lower()\n        count = len(re.findall(r\"[aeiou]+\", word))\n        if word.endswith(\"e\") and count > 1:\n            count -= 1          # silent trailing 'e' heuristic\n        return max(1, count)\n\n    syllables = sum(count_syllables(w) for w in words_list)\n    grade = 0.39 * (n_words / sentences) + 11.8 * (syllables / n_words) - 15.59\n    label = (\n        \"PASS (target 6-8)\" if 6 <= grade <= 8\n        else \"TOO SIMPLE\" if grade < 6\n        else \"TOO TECHNICAL — revise\"\n    )\n    print(f\"Flesch-Kincaid Grade Level: {grade:.1f}  [{label}]\")\n    return grade\n\n\n# ── Worked example: HR 0.75, comparator 2-year risk 0.12 ─────────────────────\nresult = hr_to_natural_freq(hr=0.75, baseline_risk=0.12, n_per_group=100,\n                            horizon_label=\"2 years\")\n# risk_treated  = 0.12 * 0.75 = 0.09  (exact arithmetic)\n# risk_reduction = 12 - 9 = 3 per 100 (exact arithmetic)\n# NNT ≈ 100 / 3 ≈ 33 (rounded up to 34)\n\n# ── Readability check on the PLS sentence ─────────────────────────────────────\nsample_pls = (\n    \"In this study, 9 out of every 100 people who took the medicine had a heart event \"\n    \"over two years. In the comparison group, 12 out of every 100 people had a heart \"\n    \"event. That means the medicine may have prevented about 3 heart events for every \"\n    \"100 people treated. This was not a randomized study, so we cannot be certain the \"\n    \"medicine caused this difference. We estimate treating about 34 people for two years \"\n    \"would prevent one event at this level of risk.\"\n)\nflesch_kincaid_grade(sample_pls)",
        "description": "Two utilities for PLS authoring: (1) hr_to_natural_freq — converts a hazard ratio and\nbaseline risk to a natural-frequency statement and NNT, using the approximation HR ≈ RR;\nwarns when baseline risk exceeds 10% (approximation degrades). (2) flesch_kincaid_grade\n— computes the Flesch-Kincaid Grade Level of a PLS text as a readability gate (target:\ngrade 6-8). Reproduces the worked example: HR 0.75, baseline risk 0.12, n per group 100.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── 1. HR to natural frequency translation ────────────────────────────────────\nhr_to_natural_freq <- function(hr, baseline_risk, n_per_group = 100L,\n                              horizon_label = \"2 years\") {\n  if (baseline_risk > 0.20)\n    stop(sprintf(\n      \"baseline_risk=%.2f > 0.20; HR approx RR breaks down. Use cumulative incidence.\",\n      baseline_risk\n    ))\n  if (baseline_risk > 0.10)\n    message(sprintf(\n      \"WARNING: baseline_risk=%.2f > 0.10; HR approx RR has modest error (~5-10%%).\",\n      baseline_risk\n    ))\n\n  treated_risk     <- baseline_risk * hr     # approximation HR approx RR\n  control_events   <- round(baseline_risk * n_per_group)\n  treated_events   <- round(treated_risk   * n_per_group)\n  arr              <- baseline_risk - treated_risk\n  nnt_exact        <- if (arr > 0) 1 / arr else Inf\n  nnt_rounded      <- ceiling(nnt_exact)     # always round UP\n\n  control_survivors <- n_per_group - control_events\n  treated_survivors <- n_per_group - treated_events\n\n  cat(sprintf(\"EVENT FRAMING (per %d over %s):\\n\", n_per_group, horizon_label))\n  cat(sprintf(\"  Comparator: %d had event | %d did not\\n\",\n              control_events, control_survivors))\n  cat(sprintf(\"  Treated:    %d had event | %d did not\\n\",\n              treated_events, treated_survivors))\n  cat(sprintf(\"  Reduction:  %d fewer events per %d\\n\",\n              control_events - treated_events, n_per_group))\n  cat(sprintf(\"  ARR = %.4f  |  NNT = %.1f (rounded up to %d)\\n\",\n              arr, nnt_exact, nnt_rounded))\n\n  cat(\"\\nPLS SENTENCE:\\n\")\n  cat(sprintf(\n    \"  In this study, %d of every %d people who received the treatment had the event\\n\"\n    \"  over %s, compared with %d of every %d in the comparison group —\\n\"\n    \"  %d fewer events per %d people treated.\\n\",\n    treated_events, n_per_group, horizon_label,\n    control_events, n_per_group,\n    control_events - treated_events, n_per_group\n  ))\n  cat(sprintf(\n    \"  Treating about %d people for %s would be expected to prevent one event.\\n\",\n    nnt_rounded, horizon_label\n  ))\n  invisible(list(control_events = control_events, treated_events = treated_events,\n                 arr = arr, nnt_exact = nnt_exact, nnt_rounded = nnt_rounded))\n}\n\n# ── 2. Flesch-Kincaid Grade Level (base R, vowel-group syllable heuristic) ────\nflesch_kincaid_grade <- function(text) {\n  sentences <- max(1L, length(unlist(strsplit(text, \"[.!?]+\"))) - 1L)\n  words     <- unlist(regmatches(text, gregexpr(\"\\\\b[A-Za-z]+\\\\b\", text)))\n  n_words   <- max(1L, length(words))\n\n  count_syllables <- function(word) {\n    word  <- tolower(word)\n    count <- length(regmatches(word, gregexpr(\"[aeiou]+\", word))[[1L]])\n    if (endsWith(word, \"e\") && count > 1L) count <- count - 1L  # silent trailing e\n    max(1L, count)\n  }\n  syllables <- sum(vapply(words, count_syllables, integer(1L)))\n\n  grade <- 0.39 * (n_words / sentences) + 11.8 * (syllables / n_words) - 15.59\n  label <- if (grade >= 6 && grade <= 8) \"PASS (target 6-8)\"\n           else if (grade < 6) \"TOO SIMPLE\"\n           else \"TOO TECHNICAL — revise\"\n  cat(sprintf(\"Flesch-Kincaid Grade Level: %.1f  [%s]\\n\", grade, label))\n  invisible(grade)\n}\n\n# ── Worked example ────────────────────────────────────────────────────────────\n# HR 0.75, comparator 2-year risk 0.12, n = 100 per group\n# risk_treated = 0.12 * 0.75 = 0.09  (exact)\n# risk_reduction = 12 - 9 = 3 per 100 (exact)\n# NNT = 100/3 ≈ 33 (round up to 34)\nres <- hr_to_natural_freq(hr = 0.75, baseline_risk = 0.12)\n\n# Readability check on a sample PLS sentence\nsample_pls <- paste(\n  \"In this study, 9 out of every 100 people who took the medicine had a heart event\",\n  \"over two years. In the comparison group, 12 out of every 100 people had a heart event.\",\n  \"That means the medicine may have prevented about 3 heart events for every 100 people\",\n  \"treated. This was not a randomized study, so we cannot be certain the medicine caused\",\n  \"this difference. We estimate treating about 34 people for two years would prevent\",\n  \"one event at this level of risk.\"\n)\nflesch_kincaid_grade(sample_pls)",
        "description": "R equivalents of the two Python utilities: hr_to_natural_freq and flesch_kincaid_grade.\nBoth use only base R. Reproduces the worked example (HR 0.75, baseline risk 0.12)\nand prints a ready-made PLS sentence with both event and survival framings.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Effect[Study reports HR / RR / OR] --> Type{Effect measure type}\n  Type -->|HR or RR| Base[Identify comparator-arm<br/>baseline risk at horizon]\n  Type -->|OR only| Conv[Convert OR to approx RR<br/>or use g-computation for RD]\n  Conv --> Base\n  Base --> Nat[\"Natural frequency translation:\\n12 of 100 untreated vs 9 of 100 treated\"]\n  Nat --> ARR[\"Absolute risk reduction:\\n12 - 9 = 3 per 100\"]\n  ARR --> NNT[\"NNT = 100 / ARR\\n≈ 33 (round up to 34)\"]\n  Nat --> Frame[\"Framing symmetry:\\nreport event AND survival framing\"]\n  NNT --> Draft[\"Write PLS statement with:\\n- Natural frequencies both arms\\n- ARR and NNT\\n- Time horizon explicit\\n- Uncertainty hedge\\n- Observational caveat if RWE\"]\n  Draft --> Read[\"Readability check:\\nFlesch-Kincaid grade 6-8 target\"]\n  Read --> Review[\"Human review:\\n- Medical writer\\n- Patient representative\\n- Check for promotional drift\"]\n  Review --> Publish[\"Publish PLS:\\n- CTIS (EU CTR mandatory)\\n- Journal PLS (if required)\\n- Registry newsletter\"]",
        "caption": "End-to-end workflow for translating a study effect estimate into a compliant plain-language summary. The natural-frequency translation (center) is the core step; framing symmetry, readability screening, and human review are mandatory quality gates.",
        "alt_text": "Flowchart from a study's HR/RR/OR through baseline-risk anchoring, natural-frequency translation, absolute risk reduction, NNT, framing symmetry, draft, readability check, human review, and publication in CTIS, journals, or registries.",
        "source_type": "illustrative",
        "source_citations": [
          "gigerenzer-2007"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "number-needed-to-treat-rwe",
        "notes": "The NNT is the canonical communication format for absolute benefit in a PLS; the translation from absolute risk reduction to NNT (NNT = 1/ARR) is the arithmetic step that converts a statistical estimate into a patient-interpretable count."
      },
      {
        "relation_type": "used_with",
        "target_slug": "risk-ratio-and-risk-difference",
        "notes": "The risk difference (absolute risk reduction) is the foundation of the natural-frequency statement in a PLS; the RR or HR provides the relative effect, but the PLS must translate it to an absolute count by pairing it with the observed baseline risk."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hazard-ratio-interpretation",
        "notes": "The HR is the most common effect measure requiring translation in oncology and cardiovascular RWE PLS; correct interpretation of the HR is prerequisite to an accurate natural-frequency translation (HR ≈ RR only under specific conditions)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "llm-assisted-abstraction-rwe",
        "notes": "AI-assisted drafting of PLS text is an emerging application; LLM tools can generate first-draft natural-frequency statements but require human arithmetic verification and promotional-drift checking before publication."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pro-rwe",
        "notes": "Patient-reported outcomes are the patient-facing evidence inputs; PLS is the patient-facing evidence output — together they form the patient-centered evidence cycle of collecting what patients experience and returning those findings in accessible form."
      }
    ],
    "aliases": [
      "lay summary",
      "plain language abstract",
      "patient summary",
      "lay language summary",
      "plain-language abstract",
      "non-technical summary"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "eu_ctr",
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "poisson-distribution",
    "name": "Poisson Distribution for Counts and Rates",
    "short_definition": "The Poisson distribution gives the probability of exactly k events in a fixed observation window when events occur independently at a constant average rate λ; it is the mathematical backbone of incidence-rate work in RWE, supplying the person-time offset model, the equidispersion baseline that defines the Poisson process, and the exact confidence intervals used for rare safety-signal counts in pharmacovigilance.",
    "long_description": "The Poisson distribution, parameterized by a single rate λ (lambda), assigns to each\nnon-negative integer k the probability P(X = k) = e^(−λ)·λ^k / k!. When the observation\nunit is one patient-year of follow-up, λ is the incidence rate — the most fundamental\nsummary in RWE. Understanding the Poisson distribution as a probability model in its own\nright, not only as a regression family, clarifies why it fits, why it fails, and which fixes\napply to which violations: negative-binomial overdispersion correction, robust sandwich\nstandard errors, or exact confidence intervals for rare counts.\n\n**The Poisson process and its four assumptions**\n\nThe Poisson distribution is the stationary distribution of a Poisson process, a continuous-time\ncounting process characterized by four assumptions. First, independent increments: events in\nnon-overlapping time intervals are statistically independent, so one burst of events does not\npredict the next. Second, stationary increments: the event rate λ is constant over time within\nthe observation period — the probability of an event in any small window dt is λ·dt regardless\nof calendar time or how long the process has been running. Third, orderliness: at most one\nevent can occur in an infinitesimally small window; simultaneous events have probability zero.\nFourth, non-negative integer support: counts are 0, 1, 2, ... with no upper bound.\n\nIn claims and EHR data, all four assumptions are routinely violated. Independent increments\nfails when a single severe illness triggers a cascade of correlated events — hospitalization\nbegets readmission begets ED visit, a pattern called event clustering that is pervasive in\ncomplex-care populations. Stationarity fails because seasonal patterns (winter respiratory\nillness spikes, COVID-driven disruption of care-seeking in 2020) make λ vary with calendar\ntime. Orderliness fails when same-day claims from multiple providers are grouped under a\nsingle service date. Despite these violations, the Poisson model with correctly specified\nrobust standard errors (sandwich variance) remains a useful workhorse for point estimation\nof rates and rate ratios; it is the standard errors — not the point estimates — that suffer\nmost when the assumptions fail.\n\n**Equidispersion: the defining constraint and when it fails**\n\nThe Poisson distribution has one free parameter: λ governs both the mean and the variance,\nso E(X) = Var(X) = λ. This identity — equidispersion — is not an assumption layered on top;\nit is the mathematical consequence of the Poisson process and is therefore always testable\nfrom the data. A Pearson dispersion statistic (Pearson chi-squared divided by residual\ndegrees of freedom) near 1.0 is consistent with equidispersion; values substantially greater\nthan 1 signal overdispersion (variance exceeds mean), and values substantially less than 1\nsignal underdispersion (rare in healthcare data). In real-world healthcare utilization data,\noverdispersion is almost universal: a minority of high-utilizers — frail elderly patients,\nindividuals with multi-morbidity — generates a long right tail of event counts whose variance\nfar exceeds the mean. Under overdispersion, Poisson regression produces point estimates of\nthe log-rate-ratio that are roughly unbiased, but the standard errors are anticonservatively\nsmall, confidence intervals too narrow, and p-values inflated. The canonical fix is negative\nbinomial regression, which adds a dispersion parameter α so that Var(X) = λ + αλ² (NB-2\nparameterization); a lighter-weight alternative is Poisson regression with robust sandwich\nstandard errors, which corrects the SEs without committing to the NB likelihood.\n\n**Person-time denominators and the offset trick**\n\nThe central pattern in RWE incidence-rate analysis is Poisson regression with log(person-time)\nentered as an offset — a covariate whose coefficient is constrained to exactly 1 — rather\nthan as a free predictor. The model becomes log(μ_i) = β₀ + β₁X_i + log(T_i), equivalent\nto log(μ_i / T_i) = β₀ + β₁X_i, where μ_i/T_i is the event rate for patient i per unit\nperson-time. The exponentiated coefficient exp(β₁) is then an incidence rate ratio (IRR)\ncomparing event rates per unit person-time between exposure groups, conditional on covariates.\nOmitting the offset — or treating log(person-time) as a regular covariate with a free\ncoefficient — corrupts the estimand: without the offset constraint, the model predicts raw\ncounts rather than rates, and patients followed longer appear to have more events simply\nbecause they were observed longer, producing a bias that mirrors immortal-time distortion.\n\nIn claims data, person-time is the sum of each patient's continuously observable FFS-enrolled\ndays, censored at disenrollment, death, or the administrative study end. Enrollment gaps,\nMedicare Advantage-only spans (where inpatient claims are absent and event counts are\nincomplete), and pre-index washout periods must be excluded from the observable person-time\nbefore computing the offset. Constructing this denominator correctly is upstream\ninfrastructure; its quality determines the validity of every downstream Poisson or negative\nbinomial rate model.\n\n**Exact Poisson confidence intervals for rare counts**\n\nWhen event counts are small — typically fewer than about 30 — the normal approximation for\nan incidence rate confidence interval (rate ± 1.96 × SE) is unreliable because the Poisson\ndistribution is right-skewed at small λ. The exact Poisson CI (Garwood interval, or\nequivalently the gamma/chi-squared inversion) constructs the CI directly from the count D\nand total person-time T using chi-squared quantiles: lower = χ²(α/2, 2D) / (2T) and\nupper = χ²(1−α/2, 2D+2) / (2T). This exact CI is the standard for rare safety-signal\nreporting in pharmacovigilance — spontaneous reporting systems, observed-versus-expected\nanalyses, and sequential signal detection methods such as maxSPRT and TreeScan. In R,\npoisson.test() supplies the exact CI from aggregate inputs; in Python, scipy.stats.chi2\nprovides the quantiles; in SAS, PROC STDRATE uses aggregate person-time and event counts to\nproduce exact CIs directly.\n\n**The Poisson approximation to the binomial for rare events**\n\nWhen a binary event is rare (event probability p much less than 0.05) and the number of\ntrial opportunities n is large, the binomial distribution B(n, p) converges to a Poisson\ndistribution with λ = np. This approximation underpins the use of Poisson models in cohort\nstudies of rare outcomes: a binary indicator — first stroke yes or no — with a low event\nrate across large accrued person-time behaves like a Poisson count, and Poisson regression\nserves as a log-linear approximation to the binomial with excellent accuracy. The\napproximation is reliable when p is below about 0.05 and np is below about 10. At higher\nevent rates the binomial with logistic regression is the correct model for a binary outcome.\n\n**Poisson regression with robust variance: Zou's method for risk ratios on binary outcomes**\n\nIn clinical trials and cohort studies, the risk ratio — the ratio of proportions — is often\nmore interpretable and policy-relevant than the odds ratio. Logistic regression gives an odds\nratio that approximates the risk ratio only when the event is rare; for common outcomes\n(prevalence above 10 to 20 percent), the odds ratio substantially overstates the risk ratio.\nA widely adopted workaround described by Zou (2004) is Poisson regression with robust\nsandwich standard errors applied to a binary outcome: treat the binary event indicator as if\nit were a count, fit a Poisson GLM with a log link and no offset (because all patients have\nthe same fixed follow-up horizon), obtain the exponentiated coefficient as the risk ratio, and\nuse the sandwich SE to correct for the misspecified variance structure (since a binary outcome\nis Bernoulli, not Poisson). The RR estimate from this modified-Poisson approach is consistent\nand the sandwich-corrected CIs have valid asymptotic coverage. This method is now ubiquitous\nin RWE publications and regulatory submissions for producing adjusted risk ratios from binary\nendpoints in large cohorts.\n\n**Interpreting the output**\n\nConsider a Poisson rate regression for a hospitalization outcome comparing an exposed cohort\nto an unexposed cohort, adjusted for age, sex, and a comorbidity index, with log(person-years)\nas the offset. The model returns a log-rate-ratio coefficient of 0.693 for the exposure\nindicator, with a model-based 95% confidence interval on the log scale of 0.41 to 0.98.\n\nFormal interpretation. The exponentiated coefficient exp(0.693) equals approximately 2.0,\nwhich is the incidence rate ratio (IRR): the estimated hospitalization rate among the exposed\ncohort is 2.0 times the rate in the unexposed cohort per unit person-time, holding age, sex,\nand comorbidity constant. The 95% CI on the IRR scale corresponds to exp(0.41) to exp(0.98),\napproximately 1.51 to 2.66. This interval does NOT mean there is a 95% probability that the\ntrue IRR lies between 1.51 and 2.66; it means that in repeated sampling under the model's\nassumptions, 95% of such intervals would contain the true parameter. The equidispersion\ncaveat is critical: if the Pearson dispersion statistic substantially exceeds 1.0, the\nmodel-based SEs are anticonservatively narrow and the CI is too tight — in that setting,\nreport negative-binomial IRRs or Poisson IRRs with robust sandwich SEs rather than the\nmodel-based CI shown here.\n\nPractical interpretation. Hospitalizations occur about twice as often per year of follow-up\namong the exposed group as among the unexposed group, after accounting for age, sex, and\ncomorbidity burden. This is an association; whether it reflects a causal effect of the\nexposure depends entirely on the study design and on how well unmeasured confounding has been\naddressed.\n\n**Pros, cons, and trade-offs**\n\nPros of the Poisson distribution as the baseline count model: a single interpretable parameter\nλ governs both the mean and variance; the log-link offset trick cleanly converts counts to\nrates; the IRR is directly interpretable and maps naturally to benefit-risk narratives in\nregulatory and HTA submissions; exact Poisson CIs require no approximation for rare safety\nevents; the Poisson-binomial relationship provides a theoretical bridge to binary-outcome\ncohort analyses; Zou's robust-variance extension delivers risk ratios from binary outcomes\nwithout the rare-event restriction of logistic regression. Cons: equidispersion Var(X) = E(X)\nfails in virtually all real healthcare utilization data, requiring negative-binomial or\nquasi-Poisson variance correction; the constant-rate assumption is violated by seasonal\nclustering and secular trends; the independence assumption is violated by within-person event\nclustering. Versus negative binomial: the NB adds a dispersion parameter α that absorbs\noverdispersion at the cost of one additional parameter; reserve Poisson for genuinely\nequidispersed or rare-event counts and treat NB as the safe default for HCRU. Versus\nbinomial with logistic link: Poisson targets rates with person-time denominators while\nbinomial targets proportions over a fixed horizon; the two are connected through Zou's\nrobust-variance approach when a risk ratio is desired for a common binary outcome.\n\n**When to use**\n\nUse the Poisson distribution or Poisson regression when computing crude or adjusted incidence\nrates from person-time denominators in a log-linear offset model; constructing exact Poisson\nconfidence intervals for rare safety-signal counts in pharmacovigilance, observed-versus-expected\nanalyses, or sequential signal detection; applying Zou's modified Poisson regression with\nrobust SEs to estimate risk ratios for binary outcomes in large cohorts where the outcome\nis too common for the rare-event approximation; diagnosing the equidispersion baseline against\nwhich overdispersion is tested before choosing negative-binomial or quasi-Poisson variance\ncorrection; or modeling genuinely equidispersed count data such as sentinel event rates in\ntightly controlled procedural registers.\n\n**When NOT to use**\n\nDo not use naive Poisson regression with model-based SEs when the Pearson dispersion statistic\nsubstantially exceeds 1 and overdispersion is confirmed — use negative-binomial regression or\nquasi-Poisson SEs instead, because plain Poisson SEs will be anticonservatively narrow and\nwill produce false positives at the conventional significance threshold. Do not model\ndurations, time-to-first-event, or censored survival outcomes with Poisson regression — these\nbelong to the Cox proportional-hazards model or parametric survival families that correctly\nhandle the at-risk process and censoring structure. Do not use a Poisson model for recurrent\nevents with strong within-patient correlation without adding cluster-robust (sandwich) variance\nor a frailty term; ignoring within-person clustering understates uncertainty, particularly when\nhigh-utilizer patients dominate the count distribution. Do not apply a single-rate Poisson\nmodel when the hazard is strongly time-varying — a peri-procedural spike followed by a flat\ntail, for example — without either splitting person-time into clinically meaningful intervals\nor including follow-up-time splines. Do not interpret a cause-specific Poisson rate as a\npatient-facing probability when competing events such as death are common in the cohort.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "distributions",
      "count-data",
      "incidence-rates",
      "poisson",
      "equidispersion",
      "offset",
      "person-time",
      "safety-surveillance"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1037/0033-2909.118.3.392",
        "url": "https://doi.org/10.1037/0033-2909.118.3.392",
        "citation_text": "Gardner W, Mulvey EP, Shaw EC. Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological Bulletin. 1995;118(3):392-404.",
        "year": 1995,
        "authors_short": "Gardner et al.",
        "notes": "Accessible introduction to Poisson regression for counts and rates, overdispersion testing, and the negative-binomial extension; the standard pedagogic reference for analysts moving from ANOVA-style tests to rate models in behavioral and health research."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a114001",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a114001",
        "citation_text": "Frome EL, Checkoway H. Use of Poisson regression models in estimating incidence rates and ratios. American Journal of Epidemiology. 1985;121(2):309-323.",
        "year": 1985,
        "authors_short": "Frome & Checkoway",
        "notes": "Translates Poisson rate regression into the epidemiologic language of incidence rates and rate ratios with person-time denominators and the log-link offset construction; the direct template for IRR reporting in pharmacoepidemiology and RWE rate studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwh090",
        "url": "https://doi.org/10.1093/aje/kwh090",
        "citation_text": "Zou G. A modified Poisson regression approach to prospective studies with binary data. American Journal of Epidemiology. 2004;159(7):702-706.",
        "year": 2004,
        "authors_short": "Zou",
        "notes": "Demonstrates that Poisson regression with robust sandwich standard errors on a binary outcome yields a correctly calibrated risk ratio for common outcomes where logistic regression would overstate relative risk; this modified-Poisson approach is now standard in large RWE cohort publications and regulatory submissions requiring risk ratios."
      },
      {
        "role": "use",
        "doi": "10.1016/0304-4076(90)90014-K",
        "url": "https://doi.org/10.1016/0304-4076(90)90014-K",
        "citation_text": "Cameron AC, Trivedi PK. Regression-based tests for overdispersion in the Poisson model. Journal of Econometrics. 1990;46(3):347-364.",
        "year": 1990,
        "authors_short": "Cameron & Trivedi",
        "notes": "Derives regression-based score tests for overdispersion in the Poisson model; the standard reference for formal overdispersion testing before deciding between Poisson and negative-binomial variance structures."
      }
    ],
    "plain_language_summary": "The Poisson distribution is a mathematical formula for \"how many times will this rare event happen?\" when events occur independently of one another — for example, how many new heart-failure hospitalizations will occur across a group of patients over a year of follow-up. Instead of a head count of patients who had the event, it divides the number of events by the total time all patients were actively being watched (summing every patient's individual follow-up length), producing a rate like \"20 hospitalizations per 1000 patient-years of observation.\" That rate is the distribution's single number λ (lambda), which also happens to equal its spread — a convenient property called equidispersion that almost always breaks down in real healthcare data where a small group of high-utilizers inflates the variance, signalling the analyst to upgrade to the negative binomial model instead.",
    "key_terms": [
      {
        "term": "rate (λ, lambda)",
        "definition": "The average number of events per unit of observation time, the single number that fully specifies the Poisson distribution; it equals both the mean and the variance under the Poisson assumption."
      },
      {
        "term": "person-time",
        "definition": "The total amount of follow-up time added up across all patients in a study — if 500 patients are each followed for two years, person-time is 1000 person-years; it is the denominator when computing an incidence rate."
      },
      {
        "term": "equidispersion",
        "definition": "The defining property of the Poisson distribution that the mean and variance are equal (both equal λ); when the actual data show variance larger than the mean (overdispersion), a negative binomial model is the correct fix."
      },
      {
        "term": "offset",
        "definition": "In a Poisson regression, the log of each patient's follow-up time added to the model with its coefficient locked at 1, so the model predicts a rate (events per unit time) rather than a raw count; omitting it turns a rate comparison into a biased count comparison."
      },
      {
        "term": "exact Poisson CI",
        "definition": "A confidence interval for a rate built directly from the Poisson distribution using chi-squared quantiles rather than a normal approximation; essential when the observed event count is small (fewer than about 30) because the normal approximation is inaccurate for rare events."
      },
      {
        "term": "rare-event approximation",
        "definition": "When a binary outcome is very rare (probability well below 5%), the Poisson distribution closely approximates the binomial, so Poisson regression can be used on a binary endpoint to produce risk ratios (via Zou's robust-variance method) instead of odds ratios from logistic regression."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiologist is measuring the incidence rate of a serious adverse event (SAE) in two cohorts followed from their first prescription. Cohort A (new drug) accumulated 24 SAE events over 1200 person-years of follow-up. Cohort B (standard care) accumulated 36 SAE events over 900 person-years. She wants the crude rate in each cohort per 1000 person-years and the rate ratio comparing the two.",
      "dataset": {
        "caption": "Aggregate event counts and person-time for two treatment cohorts. Each patient's individual follow-up days were summed to produce person_years; rate_per_py divides events by person_years.",
        "columns": [
          "cohort",
          "events",
          "person_years",
          "rate_per_py",
          "rate_per_1000_py"
        ],
        "rows": [
          [
            "A (new drug)",
            24,
            1200,
            0.02,
            20
          ],
          [
            "B (standard care)",
            36,
            900,
            0.04,
            40
          ]
        ]
      },
      "steps": [
        "Cohort A crude rate: 24 / 1200 = 0.02 events per person-year.",
        "Rescale to per 1000 person-years: 0.02 * 1000 = 20 events per 1000 person-years.",
        "Cohort B crude rate: 36 / 900 = 0.04 events per person-year.",
        "Rescale: 0.04 * 1000 = 40 events per 1000 person-years.",
        "Rate ratio (B vs A) = 0.04 / 0.02 = 2.0. The SAE occurs twice as often per person-year in Cohort B as in Cohort A.",
        "For an exact Poisson 95% CI on Cohort A's rate: use chi-squared inversion on the raw count of 24 events over 1200 person-years. In R, poisson.test(24, T=1200) produces the exact interval; in Python, scipy.stats.chi2.ppf() supplies the same quantiles. The exact CI matters here because the normal approximation is unreliable for small counts."
      ],
      "result": "Rate A = 24 / 1200 = 0.02 per person-year; 0.02 * 1000 = 20 per 1000 person-years. Rate B = 36 / 900 = 0.04 per person-year; 0.04 * 1000 = 40 per 1000 person-years. Rate ratio = 0.04 / 0.02 = 2.0: the SAE occurs twice as often per person-year in Cohort B."
    },
    "prerequisites": [
      "descriptive-statistics",
      "inferential-statistics-foundations"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Poisson rate model with person-time offset",
        "description": "The canonical incidence-rate estimator: log(μ_i) = β₀ + β'X_i + log(T_i), where T_i is patient i's person-time and its log enters as an offset (coefficient fixed at 1). The exponentiated regression coefficients are IRRs. Used whenever follow-up varies across patients — which is the norm in claims and EHR data.",
        "edge_cases": [
          "Patients with near-zero observable person-time give unstable rates; impose a minimum observable window (e.g., 30 days after index) before building person-time.",
          "Omitting the offset produces count ratios confounded by differential follow-up length; the error inflates the apparent rate of the shorter-followed arm."
        ],
        "data_source_notes": "Claims: person-time = sum of FFS-observable enrolled days, excluding Medicare Advantage-only spans; censor at death, disenrollment, and study end. EHR: define an active-in-system requirement so apparent event-free time is not unobserved leakage."
      },
      {
        "name": "Exact Poisson CI for aggregate rare counts",
        "description": "When the numerator event count D is small (fewer than about 30), use the chi-squared inversion (Garwood) CI rather than the normal approximation. This is the standard for safety-surveillance reporting, observed-versus-expected analyses, and sequential signal detection systems (maxSPRT, TreeScan).",
        "edge_cases": [
          "When D = 0 (no events observed), the exact lower bound is 0 and the upper bound is chi2(0.975, 2) / (2T) = 3.69 / T; this upper bound is the maximum plausible rate given zero observed events and total person-time T.",
          "The mid-p Poisson CI is less conservative than the exact Garwood CI and may be preferred in sequential testing contexts; report which version is used."
        ],
        "data_source_notes": "Claims or registry: aggregate D (count of first or recurrent qualifying events) and T (total FFS-observable person-time) across the cohort before passing to poisson.test() or the chi-squared quantile formula."
      },
      {
        "name": "Modified Poisson with robust SEs (Zou's method) for binary outcomes",
        "description": "Apply a Poisson GLM with log link and no offset to a binary 0/1 outcome. Use robust (sandwich/HC) standard errors to correct for the Bernoulli — not Poisson — variance structure. The exponentiated coefficient is a risk ratio (RR), not an odds ratio; the approach is consistent when outcome prevalence exceeds the rare-event threshold where logistic regression's odds ratio substantially overstates the RR.",
        "edge_cases": [
          "The Poisson GLM may fail to converge on binary data when predicted values approach 1; use a starting value near the observed prevalence or switch to a log-binomial GLM with robust SEs as the alternative.",
          "This approach produces risk ratios marginally over the study population; it is not an odds ratio and must not be reported as one."
        ],
        "data_source_notes": "Large claims cohorts: binary outcomes such as 90-day readmission or any-hospitalization with prevalence above 10% are the primary use case; use cluster-robust SEs when the analytic dataset has repeated follow-up windows per patient."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "The Poisson distribution entry covers the probability model, process assumptions, exact CIs, and the Poisson-binomial approximation — foundational concepts needed before fitting any count regression model.",
        "cons_of_this": "The count-models entry covers IRR estimation, overdispersion testing, zero-inflation, clustered variance, and full implementations for HCRU endpoints; that entry is the practical regression reference for applied analysts.",
        "when_to_prefer": "Use this primitive entry when learning the Poisson distribution, working with exact Poisson CIs for rare safety events, or applying Zou's robust-variance method; use the count-models entry when building a Poisson or negative-binomial regression for a utilization endpoint."
      },
      {
        "compared_to": "incidence-rate-calculation-rwe",
        "pros_of_this": "The distribution entry covers the probability mechanics, equidispersion, exact CIs, and the Zou robust-variance extension — concepts that apply beyond incidence-rate regression.",
        "cons_of_this": "The incidence-rate-calculation entry covers at-risk clock construction, censoring rules, first-event versus recurrent-event denominators, competing risks, and claims-specific person-time pitfalls — the operational upstream step before any Poisson rate model.",
        "when_to_prefer": "Use the incidence-rate-calculation entry when building the person-time denominator and event numerator; use this entry when the question concerns the distributional model fitted to those counts and person-time."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Construct person-time from continuous FFS-observable medical enrollment spans, excluding Medicare Advantage-only months (no FFS inpatient claims) and pre-index washout time. Build log(person_years) as the offset before fitting any Poisson model. For exact Poisson CIs on aggregate safety counts, pass D and T directly to poisson.test() (R) or the chi-squared quantile formula (Python, SAS PROC STDRATE). For binary outcomes with prevalence above 10%, apply Zou's modified Poisson with cluster-robust SEs by person_id.",
      "ehr": "Visit-driven capture inflates event counts for sicker patients; adjust for visit intensity or link to claims for full capture before building person-time. Require demonstrable in-system activity (at least one encounter per year) so apparent event-free time is not unobserved leakage. For exact CIs on small-count safety signals identified from structured problem lists or lab values, the same chi-squared quantile approach applies.",
      "registry": "Disease registries often carry clean structured event counts (procedures, cycles, lines of therapy) and are well-suited to Poisson rate models after linking to claims or vital records for complete censoring. Verify registry reporting completeness and lag by calendar year before computing person-time from registry enrollment dates.",
      "primary": "In prospective studies or surveys with a fixed observation window and binary outcomes, Zou's modified Poisson with robust SEs is the preferred approach for producing risk ratios when outcome prevalence exceeds about 10%. For rare-event count outcomes in pilot studies, exact Poisson CIs are more appropriate than normal-approximation CIs given the small n.",
      "linked": "Linked claims-EHR-vital-records cohorts provide EHR severity for confounding adjustment, claims completeness for event capture, and a death index for competing-risk censoring. Reconcile service-date discrepancies across files before building the person-time offset; use cluster-robust SEs when patients contribute correlated events across the linked sources."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom scipy import stats\nfrom scipy.stats import poisson, chi2\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\n# ── 1. Poisson PMF and CDF ──────────────────────────────────────────────────\nlam = 2.5   # e.g., average 2.5 events per patient-year\nk_vals = np.arange(0, 10)\npmf_vals = poisson.pmf(k_vals, mu=lam)\nprint(\"Poisson PMF (lambda=2.5):\")\nfor k, p in zip(k_vals, pmf_vals):\n    print(f\"  P(X={k}) = {p:.4f}\")\nprint(f\"Mean = Variance = {lam} (equidispersion)\")\n\n# ── 2. Exact Poisson CI for an aggregate rate (Garwood / chi-squared inversion) ──\nD  = 24          # observed events\nT  = 1200.0      # person-years\nalpha = 0.05\n\nlo = chi2.ppf(alpha / 2,    2 * D    ) / (2 * T) if D > 0 else 0.0\nhi = chi2.ppf(1 - alpha / 2, 2 * (D + 1)) / (2 * T)\nprint(f\"\\nExact Poisson 95% CI for {D} events / {T} PY:\")\nprint(f\"  Rate = {D/T:.4f} per PY = {D/T*1000:.1f} per 1000 PY\")\nprint(f\"  95% CI: ({lo*1000:.1f}, {hi*1000:.1f}) per 1000 PY\")\n# Equivalent using scipy.stats.poisson.interval (two-sided):\nexact = stats.poisson.interval(0.95, D)          # count-space CI\nlo2 = exact[0] / T; hi2 = exact[1] / T\nprint(f\"  Via poisson.interval: ({lo2*1000:.1f}, {hi2*1000:.1f}) per 1000 PY\")\n\n# ── 3. Poisson GLM with person-time offset: incidence rate ratio ─────────────\n# Synthetic analytic table; replace with real data.\nnp.random.seed(42)\nn = 400\nanalytic = pd.DataFrame({\n    \"person_id\":       np.arange(n),\n    \"exposed\":         np.repeat([1, 0], n // 2),\n    \"event_count\":     np.random.poisson(lam=[0.06 if e else 0.03 for e in np.repeat([1, 0], n // 2)]),\n    \"person_years\":    np.random.uniform(0.5, 2.0, n),\n    \"age\":             np.random.normal(60, 10, n),\n    \"sex\":             np.random.binomial(1, 0.5, n),\n    \"baseline_comorb\": np.random.poisson(2, n),\n})\nanalytic = analytic[analytic[\"person_years\"] > 0].copy()\noffset_vec = np.log(analytic[\"person_years\"].to_numpy())\n\nmodel = smf.glm(\n    \"event_count ~ exposed + age + sex + baseline_comorb\",\n    data=analytic,\n    family=sm.families.Poisson(),\n    offset=offset_vec,\n).fit(cov_type=\"cluster\", cov_kwds={\"groups\": analytic[\"person_id\"]})\n\nprint(\"\\nPoisson rate model (cluster-robust SEs):\")\nirr    = np.exp(model.params[\"exposed\"])\nci_lo  = np.exp(model.conf_int().loc[\"exposed\", 0])\nci_hi  = np.exp(model.conf_int().loc[\"exposed\", 1])\nprint(f\"  IRR (exposed vs unexposed) = {irr:.3f}  95% CI: ({ci_lo:.3f}, {ci_hi:.3f})\")\nprint(f\"  Pearson dispersion = {model.pearson_chi2 / model.df_resid:.3f}  (should be ~1 for equidispersion)\")\n\n# ── 4. Zou's modified Poisson with robust SEs: risk ratio on a binary outcome ──\nanalytic_binary = pd.DataFrame({\n    \"person_id\": np.arange(n),\n    \"exposed\":   np.repeat([1, 0], n // 2),\n    \"outcome\":   np.random.binomial(1, [0.25 if e else 0.15 for e in np.repeat([1, 0], n // 2)]),\n    \"age\":       analytic[\"age\"].values,\n    \"sex\":       analytic[\"sex\"].values,\n})\n\nzou = smf.glm(\n    \"outcome ~ exposed + age + sex\",   # no offset: fixed follow-up window\n    data=analytic_binary,\n    family=sm.families.Poisson(),      # Poisson family, log link\n).fit(cov_type=\"HC1\")                  # robust (sandwich) SEs correct Bernoulli variance\n\nrr    = np.exp(zou.params[\"exposed\"])\nrr_lo = np.exp(zou.conf_int().loc[\"exposed\", 0])\nrr_hi = np.exp(zou.conf_int().loc[\"exposed\", 1])\nprint(f\"\\nZou modified Poisson (robust SEs) — risk ratio:\")\nprint(f\"  RR = {rr:.3f}  95% robust CI: ({rr_lo:.3f}, {rr_hi:.3f})\")\nprint(\"  (Use this instead of logistic OR when outcome prevalence > 10%)\")",
        "description": "Poisson distribution PMF, CDF, exact Poisson CI for aggregate rates (scipy.stats), and\nPoisson GLM with a person-time offset for incidence rate ratios (statsmodels). Also\ndemonstrates Zou's modified Poisson with robust sandwich SEs for risk ratios on binary\noutcomes. Required input for the GLM is an analytic table with one row per patient:\n  analytic: person_id, exposed (0/1), event_count (int), person_years (float > 0),\n            age, sex, baseline_comorb   -- covariates measured to time zero\nThe binary-outcome (Zou) section uses:\n  analytic_binary: person_id, exposed (0/1), outcome (0/1), age, sex",
        "dependencies": [
          "scipy",
          "statsmodels",
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(sandwich)\nlibrary(lmtest)\n\n# ── 1. Poisson PMF and CDF ──────────────────────────────────────────────────\nlam <- 2.5\nk   <- 0:9\ncat(\"Poisson PMF (lambda = 2.5):\\n\")\ncat(sprintf(\"  P(X=%d) = %.4f\\n\", k, dpois(k, lambda = lam)))\ncat(sprintf(\"Mean = Variance = %.1f (equidispersion)\\n\\n\", lam))\n\n# ── 2. Exact Poisson CI via poisson.test (Garwood / mid-p exact) ─────────────\nD <- 24L;   T_py <- 1200.0\npt <- poisson.test(D, T = T_py, conf.level = 0.95)\ncat(sprintf(\"Exact Poisson 95%% CI for %d events / %.0f PY:\\n\", D, T_py))\ncat(sprintf(\"  Rate = %.4f per PY = %.1f per 1000 PY\\n\", D / T_py, D / T_py * 1000))\ncat(sprintf(\"  95%% CI: (%.1f, %.1f) per 1000 PY\\n\\n\",\n            pt$conf.int[1] * 1000, pt$conf.int[2] * 1000))\n\n# ── 3. Poisson GLM with person-time offset: incidence rate ratio ─────────────\nset.seed(42)\nn <- 400L\nanalytic <- data.frame(\n  person_id       = seq_len(n),\n  exposed         = rep(c(1L, 0L), each = n %/% 2L),\n  person_years    = runif(n, 0.5, 2.0),\n  age             = rnorm(n, 60, 10),\n  sex             = rbinom(n, 1L, 0.5),\n  baseline_comorb = rpois(n, 2)\n)\nanalytic$rate     <- ifelse(analytic$exposed == 1L, 0.06, 0.03)\nanalytic$event_count <- rpois(n, analytic$rate * analytic$person_years)\nanalytic <- subset(analytic, person_years > 0)\n\npois_fit <- glm(\n  event_count ~ exposed + age + sex + baseline_comorb,\n  family = poisson(link = \"log\"),\n  data   = analytic,\n  offset = log(person_years)\n)\n\n# Cluster-robust variance by person_id\nvc_cl  <- vcovCL(pois_fit, cluster = ~ person_id)\nct_cl  <- coeftest(pois_fit, vcov. = vc_cl)\nirr    <- exp(ct_cl[\"exposed\", \"Estimate\"])\nirr_ci <- exp(ct_cl[\"exposed\", \"Estimate\"] + c(-1, 1) * 1.96 * ct_cl[\"exposed\", \"Std. Error\"])\ncat(sprintf(\"Poisson rate model (cluster-robust SEs):\\n\"))\ncat(sprintf(\"  IRR = %.3f  95%% CI: (%.3f, %.3f)\\n\", irr, irr_ci[1], irr_ci[2]))\ndisp <- sum(residuals(pois_fit, type = \"pearson\")^2) / df.residual(pois_fit)\ncat(sprintf(\"  Pearson dispersion = %.3f  (equidispersion -> ~1.0)\\n\\n\", disp))\n\n# ── 4. Zou's modified Poisson: risk ratio on a binary outcome ─────────────────\nanalytic_bin <- data.frame(\n  person_id = seq_len(n),\n  exposed   = rep(c(1L, 0L), each = n %/% 2L),\n  outcome   = rbinom(n, 1L, prob = rep(c(0.25, 0.15), each = n %/% 2L)),\n  age       = analytic$age,\n  sex       = analytic$sex\n)\n\nzou_fit <- glm(\n  outcome ~ exposed + age + sex,\n  family = poisson(link = \"log\"),   # Poisson family, log link, NO offset\n  data   = analytic_bin\n)\nvc_hc1 <- vcovHC(zou_fit, type = \"HC1\")   # sandwich SEs correct Bernoulli variance\nct_hc1 <- coeftest(zou_fit, vcov. = vc_hc1)\nrr     <- exp(ct_hc1[\"exposed\", \"Estimate\"])\nrr_ci  <- exp(ct_hc1[\"exposed\", \"Estimate\"] + c(-1, 1) * 1.96 * ct_hc1[\"exposed\", \"Std. Error\"])\ncat(\"Zou modified Poisson (robust SEs) -- risk ratio on binary outcome:\\n\")\ncat(sprintf(\"  RR = %.3f  95%% robust CI: (%.3f, %.3f)\\n\", rr, rr_ci[1], rr_ci[2]))\ncat(\"  (Use instead of logistic OR when outcome prevalence > ~10%)\\n\")",
        "description": "Poisson distribution PMF/CDF, exact Poisson CI via poisson.test(), Poisson GLM with\nperson-time offset and cluster-robust SEs, and Zou's modified Poisson with sandwich SEs\nfor risk ratios on binary outcomes. All implementations use base R plus sandwich and lmtest\nfor robust variance; no specialized packages needed beyond those.",
        "dependencies": [
          "sandwich",
          "lmtest"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── 1. Exact Poisson CI via PROC STDRATE (aggregate rate from person-time) ── */\ndata work.agg_rate;\n  events       = 24;\n  person_years = 1200;\nrun;\nproc stdrate data=work.agg_rate stat=rate plots=none;\n  population event=events pyears=person_years;\n  /* STDRATE prints the exact (Poisson) 95% CI for events / person_years          */\nrun;\n\n/* ── 2. Poisson GLM with person-time offset: incidence rate ratio ─────────── */\ndata work.analytic_offset;\n  set work.analytic;\n  where person_years > 0;\n  log_pt = log(person_years);   /* offset = log(person-time)                    */\nrun;\n\n/* Model-based Poisson SEs (may be anticonservative if overdispersed).           */\nproc genmod data=work.analytic_offset;\n  class exposed (ref='0') sex;\n  model event_count = exposed age sex baseline_comorb\n        / dist=poisson link=log offset=log_pt;\n  estimate 'IRR exposed vs unexposed' exposed 1 -1 / exp;\nrun;\n\n/* Cluster-robust (empirical / sandwich) SEs via GEE REPEATED statement.         */\nproc genmod data=work.analytic_offset;\n  class person_id exposed (ref='0') sex;\n  model event_count = exposed age sex baseline_comorb\n        / dist=poisson link=log offset=log_pt;\n  repeated subject=person_id / type=ind;   /* empirical sandwich variance        */\n  estimate 'IRR exposed vs unexposed' exposed 1 -1 / exp;\nrun;\n\n/* Pearson dispersion: if Pearson chi-sq / df >> 1, switch to DIST=NEGBIN.       */\nproc genmod data=work.analytic_offset;\n  class exposed (ref='0') sex;\n  model event_count = exposed age sex baseline_comorb\n        / dist=poisson link=log offset=log_pt scale=pearson;\n  /* SCALE=PEARSON prints the Pearson dispersion statistic.                       */\nrun;\n\n/* ── 3. Zou modified Poisson: risk ratio on a binary outcome ─────────────── */\nproc genmod data=work.analytic_bin;\n  class person_id exposed (ref='0') sex;\n  model outcome = exposed age sex\n        / dist=poisson link=log;   /* Poisson family, log link, no offset        */\n  repeated subject=person_id / type=ind;   /* robust SEs correct Bernoulli var   */\n  estimate 'RR exposed vs unexposed' exposed 1 -1 / exp;\n  /* exp(estimate) is the risk ratio; use instead of logistic OR when prev > 10% */\nrun;",
        "description": "Exact Poisson CI from aggregate person-time via PROC STDRATE; Poisson GLM with a\nlog(person-time) offset via PROC GENMOD with cluster-robust (empirical) SEs via the REPEATED\nstatement; and Zou's modified Poisson with robust SEs for binary outcomes using PROC GENMOD\nwith DIST=POISSON and TYPE=IND repeated structure. Required input datasets:\n  work.analytic      : person_id, exposed (0/1), event_count, person_years, age, sex, baseline_comorb\n  work.analytic_bin  : person_id, exposed (0/1), outcome (0/1), age, sex\nCreate the offset variable before modeling; exp(estimate) is the IRR or RR.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Proc[Counting process:<br/>independent events, constant rate lambda] --> Dist[Poisson distribution<br/>P(X=k) = e^-lambda * lambda^k / k!]\n  Dist --> Eq[Equidispersion:<br/>E(X) = Var(X) = lambda]\n  Eq --> Test{Pearson chi-sq/df near 1?}\n  Test -->|Yes| Pois[Poisson rate model<br/>with person-time offset<br/>IRR = exp(beta)]\n  Test -->|No, overdispersed| NB[Negative binomial<br/>or robust sandwich SEs]\n  Pois --> Rare{D small (fewer than 30)?}\n  NB --> Rare\n  Rare -->|Yes| ExactCI[Exact Poisson CI<br/>chi-squared inversion]\n  Rare -->|No| NormCI[Normal-approximation CI<br/>or Wald CI from model]\n  Pois --> Binary{Binary outcome,<br/>prevalence above 10%?}\n  Binary -->|Yes| Zou[Zou modified Poisson<br/>with robust SEs<br/>RR = exp(beta)]\n  Binary -->|No| Logistic[Logistic regression<br/>OR (rare-event OK)]",
        "caption": "Decision logic flowing from the Poisson process through equidispersion testing to model choice (Poisson vs negative binomial), CI type (exact vs approximate), and the Zou modified-Poisson extension for binary outcomes.",
        "alt_text": "Flowchart starting at the Poisson counting process, through the equidispersion test, to a Poisson or negative-binomial rate model with offset, then branching on count size to exact or normal-approximation CIs, and separately to Zou modified Poisson or logistic regression for binary outcomes.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "negative-binomial-distribution",
        "notes": "The negative binomial is the overdispersion fix for the Poisson: it adds a dispersion parameter alpha so that variance exceeds the mean, absorbing the heterogeneity that causes plain Poisson SEs to be anticonservatively narrow. Use NB whenever the Pearson dispersion statistic substantially exceeds 1."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "That entry covers Poisson and NB as regression models for HCRU utilization counts with full IRR interpretation, overdispersion testing, zero-inflation, and clustered variance; this distribution primitive entry supplies the underlying probability model that the count-model entry builds on."
      },
      {
        "relation_type": "used_with",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "Incidence-rate calculation provides the numerator (event count D) and denominator (person-time T) that are the direct inputs to both the Poisson rate formula and the exact Poisson CI; the distribution primitive specifies the statistical model for those quantities."
      },
      {
        "relation_type": "requires",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "A correctly constructed person-time denominator — continuous observable enrollment, MA exclusions, censoring at death and disenrollment — is prerequisite to any Poisson rate model; the offset log(T) is only valid if T represents actual at-risk time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalized-linear-models",
        "notes": "The Poisson GLM is a special case of the generalized linear model family with a log link and Poisson variance; understanding the GLM framework situates the Poisson model within the broader class of log-linear models for count and rate outcomes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "binomial-distribution-logit-link",
        "notes": "The binomial with logistic link is the natural partner to the Poisson for binary outcomes; the Poisson approximates the binomial for rare events, and Zou's robust-variance Poisson regression produces risk ratios where logistic regression would give odds ratios."
      }
    ],
    "aliases": [
      "Poisson process",
      "Poisson regression",
      "rate model",
      "incidence rate model",
      "exact Poisson CI",
      "Garwood confidence interval",
      "Zou method",
      "modified Poisson regression"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "poisson-negative-binomial-count-models",
    "name": "Poisson and Negative Binomial Count Models for HCRU and Utilization",
    "short_definition": "Log-link generalized linear models for non-negative integer counts of events in a defined window (hospitalizations, ED visits, infusions, filled prescriptions), where Poisson assumes variance equals mean and negative binomial adds an overdispersion parameter that is the practical default for skewed real-world healthcare resource utilization data.",
    "long_description": "Poisson and negative binomial (NB) regression are the workhorse models for **count outcomes** in real-world HEOR: the number\nof hospitalizations, emergency department visits, outpatient encounters, infusions, distinct drug classes, or inpatient days\na patient accrues over an observation window. Both use a **log link**, `log(μ) = β0 + β'X + log(person_time)`, so the\nexponentiated coefficient `exp(βj)` is a **rate ratio** — an incidence rate ratio (IRR) once person-time enters as an\n`offset`. The IRR is the quantity HCRU studies report (\"the new therapy was associated with a 28% lower rate of all-cause\nhospitalization, IRR 0.72\").\n\n**Core conceptual distinction.** The two models share the same mean structure and the same IRR estimand; they differ only in\nthe assumed mean–variance relationship, and that difference is doing all the work. (1) *Poisson* fixes Var(Y) = μ\n(equidispersion). This is almost never true for HCRU: a minority of frail high-utilizers create a long right tail, so the\nvariance exceeds the mean (**overdispersion**). Under overdispersion the Poisson point estimate of the IRR stays roughly\nunbiased, but its standard errors are **too small**, confidence intervals too narrow, and p-values anticonservative — you\ndeclare effects \"significant\" that are not. (2) *Negative binomial* introduces a dispersion parameter α so that\nVar(Y) = μ + αμ² (the canonical NB-2 parameterization), absorbing the extra heterogeneity and widening the CI to its honest\nwidth; as α → 0 NB collapses back to Poisson. A formal likelihood-ratio test of H0: α = 0 (a boundary test, halve the\np-value) decides between them. The offset is the second non-negotiable: it is `log(person_time)`, not a covariate — its\ncoefficient is fixed at 1 — and omitting it silently turns an IRR back into a raw count ratio that confounds the effect with\ndifferential follow-up.\n\n**Pros, cons, and trade-offs.**\n- **vs cox-ph-regression (and recurrent-event survival models, Andersen–Gill / PWP / WLW):** Count models target the *total\n  volume or rate* of events and use the full count (0, 1, 2, …) in a single number per patient; they are simpler and more\n  powerful when the scientific question is \"how much utilization\" rather than \"how soon.\" Cost: they discard the *timing and\n  ordering* of events and the at-risk structure, so they cannot represent time-varying exposure, the depletion of the at-risk\n  set after death, or the gap-time dependence of recurrent events. **Prefer count models** when the endpoint is a rate or\n  total over a fixed window; **prefer recurrent-event survival** when timing matters or competing death heavily truncates\n  person-time (see recurrent-events-analysis-rwe).\n- **vs logistic-regression-for-binary-outcomes:** Counting events (number of admissions) uses information that \"any\n  admission, yes/no\" throws away — more efficient and usually more clinically meaningful for HCRU. Cost: when only the first\n  event matters clinically, or when the outcome is genuinely dichotomous, logistic is simpler and avoids modeling a count\n  distribution. **Prefer count models** when the magnitude of utilization is the point.\n- **vs healthcare-costs-pppm-pppy-pmpm (cost GLMs):** Counts are discrete events; a one-night and a thirty-night admission\n  are both \"one admission.\" Dollars are continuous, heavy-tailed, and zero-inflated, and belong in a gamma/log-link or\n  two-part model. **Prefer count models for utilization volume and pair them with cost models** for the full economic\n  picture — never substitute one for the other.\n- **Poisson vs NB within the family, and vs zero-inflated/hurdle:** Plain Poisson is defensible only for rare, equidispersed\n  counts; NB is the safe default for HCRU. When zeros are far more frequent than even NB predicts (a structural subgroup that\n  *cannot* generate the event — e.g., patients who never re-engage with the system after a procedure), a **zero-inflated NB\n  (ZINB)** or **hurdle** model separates a \"who has any utilization\" logistic process from a \"how much among utilizers\" count\n  process. The cost of ZINB/hurdle is two coefficient sets to interpret and report.\n\n**When to use.** The endpoint is a count of discrete events over a defined window (admissions, ED visits, infusions, distinct\nNDCs, inpatient days); follow-up varies across patients (then carry `log(person_time)` as an offset and report an IRR);\nmultiple events per patient are common and their total is of interest; you want a single interpretable rate ratio that maps\ncleanly to budget-impact and value narratives. NB is the default; reach for Poisson only after a dispersion test supports\nequidispersion, and quasi-Poisson when you want robust dispersion-scaled SEs without committing to the NB likelihood.\n\n**Interpreting the output**\n\nConsider the worked example: Drug A generates 50 ED visits in 100 person-years (rate\n0.50/year); Drug B generates 30 visits in 100 person-years (rate 0.30/year). The crude\nIRR = 0.50 / 0.30 = 1.67. After fitting a negative binomial model with covariate\nadjustment, suppose the reported adjusted IRR is 1.54 (95% CI 1.18–2.01).\n\nFormal interpretation: Among patients starting Drug A, the adjusted incidence rate of\nED visits was 1.54 times the rate in Drug B patients per person-year of follow-up,\nholding baseline covariates fixed and accounting for the person-time offset. The offset\nlog(person_years) is what converts raw counts into rates — omitting it would produce a\ncount ratio confounded with differential follow-up, not an IRR. The negative binomial\nmodel is reported here rather than Poisson because individual patient counts showed\nvariance well above the mean (overdispersion); the Poisson CI for the same data would\nhave been fictitiously narrow, producing anticonservative inference.\n\nPractical interpretation: Drug A patients accrue approximately 54% more ED visits per\nyear than Drug B patients after adjustment. Because the NB model absorbs overdispersion\nfrom a minority of high-utilizers, the 95% CI of 1.18–2.01 is the honestly calibrated\nuncertainty range — not the artificially narrow interval a Poisson fit would have\nreported. In budget-impact terms, multiply the IRR by the per-visit cost to estimate\nthe incremental HCRU burden per person-year. Always verify the dispersion decision\n(LR test of alpha = 0) before reporting and note whether any zero-inflation adjustment\nwas applied.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Overdispersed data fit with plain Poisson.** This is the classic error: the IRR looks fine but the CI is fictitiously\n  narrow, so a null effect is reported as protective with \"p<0.05.\" Always check Pearson χ²/df (≫1 signals overdispersion)\n  and test α; if in doubt, report NB or quasi-Poisson SEs.\n- **No offset when follow-up differs by arm.** If the exposed arm disenrolls or dies earlier (shorter person-time), modeling\n  raw counts without `log(person_time)` makes the exposed look *lower-utilizing* purely because they were observed less —\n  immortal-time and informative-censoring artifacts masquerading as a treatment benefit (see immortal-time-bias-handling).\n- **Within-person correlation ignored.** Multiple events from the same patient (or panel counts across periods) violate\n  independence; naive SEs are too small. Use cluster-robust/GEE variance or a random-effects (NB-mixed) model.\n- **Competing death treated as \"zero utilization.\"** In elderly or oncology claims, a patient who dies early accrues few\n  events not because therapy reduced utilization but because they left the at-risk set. Differential mortality by arm biases\n  the IRR; censor person-time at death and consider a recurrent-event-with-terminal-event or joint model.\n- **The real question is binary or about timing.** If only the first event is clinically meaningful, or the policy question\n  is time-to-readmission, a count model answers the wrong question.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The substrate of most HCRU count studies. Build person-time from enrollment spans, censoring at\n  disenrollment, death, and data end; count events from *all* relevant claim files (inpatient facility + professional,\n  outpatient, carrier) using validated code lists, because partial capture undercounts. Failure mode unique to claims:\n  **Medicare Advantage (MA) encounter data are notoriously incomplete and inconsistently submitted**, so MA person-time can\n  show artificially *low* hospitalization counts versus fee-for-service Parts A/B — restrict to FFS A/B (or commercial with a\n  real medical benefit) and **exclude MA-only person-time**, or the count comparison is biased by data completeness rather\n  than care. Mail-order/90-day fills and stockpiling distort `days_supply`-derived counts of \"drug fills.\"\n- **EHR:** Counts come from orders, encounters, or NLP, and suffer **visit-driven (informed-presence) bias**: sicker patients\n  generate more records simply by being seen more, inflating counts in both arms and especially in the sicker arm. Adjust for\n  visit intensity, or link to claims for complete capture. A patient who leaves the health system is differentially lost, so\n  define the observation window explicitly and treat loss as potentially informative.\n- **Registry:** Often carries clean structured counts (lines of therapy, transfusions, cycles) and is excellent for\n  *validating* claims-based count algorithms, but is typically thin on complete utilization across all settings; link to\n  claims for full event capture and to a death index for censoring.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity for confounding control + claims completeness for\n  counts + reliable mortality for censoring), but linkage selection and order/fill/service-date discrepancies must be\n  reconciled before person-time is built.\n\n**Worked claims example.** Question: rate of all-cause inpatient hospitalization on an SGLT2 inhibitor vs a DPP-4 inhibitor\namong adults with type 2 diabetes in a commercial + Medicare FFS database. (1) New-user, active-comparator cohort: ≥365 days\ncontinuous A/B/D (or commercial medical+pharmacy) enrollment with no fill of either class in the lookback; time zero = first\nqualifying `fill_date`; arm = the NDC dispensed that day. (2) Person-time: from time zero to the earliest of disenrollment,\ndeath, or 2024-12-31 end of data — for patient A this is 365 days (1.00 person-year), for patient B who dies at day 200,\n0.55 person-years. (3) Outcome count: number of distinct inpatient admissions in the window from the facility claim file,\ncollapsing same-stay transfer claims so a transfer is not double-counted; patient A has 2 admissions, patient B has 1. (4)\nOffset: model `hosp_count ~ arm + age + sex + baseline_comorbidity` with `offset = log(person_years)`, so `exp(β_arm)` is the\nIRR (rate per person-year). (5) Distribution: fit Poisson, then NB; the LR test of α gives p<0.001 (overdispersion from a few\nhigh-utilizers), so report the **NB** IRR — say 0.78 (95% CI 0.69–0.88) — with cluster-robust SEs because some patients\ncontribute multiple admissions. (6) Critically, **exclude any MA-only enrollment months**: a patient enrolled in MA for part\nof follow-up has incomplete inpatient capture there, which would deflate their count and the resulting person-time, biasing\nthe IRR. (7) Sensitivity: vary the window, censor at treatment discontinuation (last `days_supply` end + grace period) for an\nas-treated analysis, and run a ZINB if the proportion of patients with zero admissions exceeds NB's prediction.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "poisson",
      "negative-binomial",
      "count-models",
      "hcru",
      "utilization",
      "incidence-rate-ratio",
      "overdispersion",
      "offset-person-time",
      "claims",
      "zero-inflation"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/2531094",
        "url": "https://doi.org/10.2307/2531094",
        "citation_text": "Frome EL. The analysis of rates using Poisson regression models. Biometrics. 1983;39(3):665-674.",
        "year": 1983,
        "authors_short": "Frome",
        "notes": "The foundational methodologic statement of Poisson regression for rate data with a person-time offset and the log-rate-ratio interpretation that anchors all count-based HCRU analysis."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a114001",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a114001",
        "citation_text": "Frome EL, Checkoway H. Use of Poisson regression models in estimating incidence rates and ratios. American Journal of Epidemiology. 1985;121(2):309-323.",
        "year": 1985,
        "authors_short": "Frome & Checkoway",
        "notes": "Translates Poisson rate regression into the epidemiologic language of incidence rates and rate ratios with person-time, the direct template for IRR reporting in pharmacoepidemiology."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev.publhealth.20.1.125",
        "url": "https://doi.org/10.1146/annurev.publhealth.20.1.125",
        "citation_text": "Diehr P, Yanez D, Ash A, Hornbrook M, Lin DY. Methods for analyzing health care utilization and costs. Annual Review of Public Health. 1999;20:125-144.",
        "year": 1999,
        "authors_short": "Diehr et al.",
        "notes": "Canonical review of count and cost methods for HCRU, including overdispersion, negative binomial, and zero-heavy utilization distributions; situates Poisson/NB against two-part and survival alternatives."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/(SICI)1099-1255(199705)12:3<313::AID-JAE440>3.0.CO;2-G",
        "url": "https://doi.org/10.1002/(SICI)1099-1255(199705)12:3<313::AID-JAE440>3.0.CO;2-G",
        "citation_text": "Deb P, Trivedi PK. Demand for medical care by the elderly: a finite mixture approach. Journal of Applied Econometrics. 1997;12(3):313-336.",
        "year": 1997,
        "authors_short": "Deb & Trivedi",
        "notes": "Influential applied count-data analysis of elderly medical-care utilization showing why NB and finite-mixture/latent-class count models outperform Poisson when frequent and infrequent utilizers coexist."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.312.7027.364",
        "url": "https://doi.org/10.1136/bmj.312.7027.364",
        "citation_text": "Glynn RJ, Buring JE. Ways of measuring rates of recurrent events. BMJ. 1996;312(7027):364-367.",
        "year": 1996,
        "authors_short": "Glynn & Buring",
        "notes": "Practical guide to summarizing recurrent/count events as rates and rate ratios, clarifying when a count-rate model is appropriate versus a time-to-event framing."
      }
    ],
    "plain_language_summary": "Poisson and negative binomial models count how many times an event happens to a patient over a period of time — for example, how many times they visited the ER or were admitted to the hospital — and compare those rates between two groups. Instead of asking \"did the event happen?\", they ask \"how often did it happen?\" and report the answer as a rate ratio (called an incidence rate ratio, or IRR) that accounts for the fact that some patients were followed longer than others. Poisson is the simpler version, but real healthcare data almost always violates its key assumption: when the variance in a count dataset is larger than its mean — a condition called overdispersion — Poisson produces confidence intervals that are misleadingly narrow, making weak findings look statistically significant. Negative binomial adds one extra parameter to absorb that extra spread, giving you honest uncertainty estimates, which is why it is the safer default for healthcare utilization data.",
    "key_terms": [
      {
        "term": "incidence rate ratio (IRR)",
        "definition": "The ratio of the event rate in one group to the event rate in another group, where each rate is expressed as events per unit of person-time (e.g., hospitalizations per person-year)."
      },
      {
        "term": "overdispersion",
        "definition": "A condition in count data where the variance (spread) of the counts is larger than the mean, which happens when a small group of high-utilizers drives a long right tail; Poisson assumes variance equals the mean, so overdispersed data require negative binomial instead."
      },
      {
        "term": "offset",
        "definition": "A statistical adjustment — specifically log(person-time) — that converts raw event counts into rates, so that a patient followed for two years is not unfairly compared to one followed for one year."
      },
      {
        "term": "person-time",
        "definition": "The total time that patients in a group were actually observed and at risk, calculated by summing each patient's individual follow-up duration; one patient followed for two years contributes two person-years."
      },
      {
        "term": "dispersion parameter (alpha)",
        "definition": "The extra term negative binomial adds to model how much extra spread exists in the counts beyond what Poisson allows; when alpha equals zero, negative binomial and Poisson give the same answer."
      }
    ],
    "worked_example": {
      "scenario": "A database study compares rates of emergency department (ED) visits between adults newly starting Drug A versus Drug B for a chronic condition. Each patient is enrolled for a fixed one-year window. You want to compute the crude ED visit rate per person-year in each arm and the incidence rate ratio (IRR), and then decide whether Poisson or negative binomial is more appropriate.",
      "dataset": {
        "caption": "Summary counts for each treatment arm over one year of follow-up.",
        "columns": [
          "arm",
          "events",
          "person_years",
          "rate"
        ],
        "rows": [
          [
            "Drug A",
            50,
            100,
            0.5
          ],
          [
            "Drug B",
            30,
            100,
            0.3
          ]
        ]
      },
      "steps": [
        "Compute the rate for Drug A: 50 ED visits divided by 100 person-years = 0.50 ED visits per person-year.",
        "Compute the rate for Drug B: 30 ED visits divided by 100 person-years = 0.30 ED visits per person-year.",
        "Compute the incidence rate ratio (IRR): 0.50 divided by 0.30 = 1.67, meaning Drug A patients had ED visits at 1.67 times the rate of Drug B patients.",
        "Decide between Poisson and negative binomial: inspect the spread of individual patient counts in each arm. If a few patients account for the majority of visits — common in healthcare data — the variance will exceed the mean (overdispersion). In that case, fitting Poisson will produce a confidence interval around the IRR of 1.67 that is too narrow, making the difference look more certain than the data support. Fitting negative binomial instead lets the model absorb that extra spread and gives an honest, wider confidence interval."
      ],
      "result": "Drug A rate = 50 / 100 = 0.50 ED visits per person-year. Drug B rate = 30 / 100 = 0.30 ED visits per person-year. IRR = 0.50 / 0.30 = 1.67. When individual patient counts show variance greater than the mean (overdispersion), report the negative binomial IRR with its wider, correctly calibrated confidence interval rather than the Poisson result."
    },
    "prerequisites": [
      "incidence-rate-calculation-rwe",
      "person-time-denominator-construction-rwe",
      "logistic-regression-for-binary-outcomes"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Poisson with person-time offset (incidence rate model)",
        "description": "Include offset(log(person_time)) so coefficients are log incidence rate ratios; required whenever follow-up varies across patients (disenrollment, death, study end). Use quasi-Poisson or Pearson dispersion scaling for honest SEs when mild overdispersion is present but the NB likelihood is not desired.",
        "edge_cases": [
          "Patients with near-zero observable person-time give unstable rates; impose a minimum observable window or down-weight.",
          "Omitting the offset turns an IRR into a raw count ratio confounded with differential follow-up."
        ],
        "data_source_notes": "claims: person_time = sum of FFS-enrolled days in the window, censored at death/disenroll/end; exclude MA-only spans where inpatient capture is incomplete."
      },
      {
        "name": "Negative binomial (NB-2, canonical default)",
        "description": "Adds dispersion α; Var(Y) = μ(1 + αμ). Preferred default for HCRU counts; decide vs Poisson with a boundary LR test of α = 0 (halve the p-value) and by inspecting Pearson χ²/df.",
        "edge_cases": [
          "α near 0 reduces NB to Poisson; very large α signals extreme heterogeneity (consider finite mixture or random-effects NB)."
        ],
        "data_source_notes": "Most claims HCRU endpoints (admissions, ED visits, distinct NDCs) are overdispersed and warrant NB unless events are genuinely rare and equidispersed."
      },
      {
        "name": "Zero-inflated or hurdle NB (ZINB / hurdle)",
        "description": "A logistic \"structural zero\" process (never-utilizers) plus an NB count process for the rest; hurdle uses a truncated-at-zero count for positive counts. Reports two coefficient sets — probability of any utilization and rate among utilizers.",
        "edge_cases": [
          "Distinguish structural zeros (cannot have the event) from sampling zeros (could, but did not in this window) before choosing ZINB over plain NB; over-applying zero-inflation invites overfitting."
        ],
        "data_source_notes": "Relevant for low-utilization cohorts (young, post-procedure) or when many patients have effectively zero observable events despite enrollment."
      },
      {
        "name": "Clustered / recurrent-within-person counts (GEE or random-effects NB)",
        "description": "When patients contribute multiple correlated events or repeated periods, use cluster-robust/GEE variance or a mixed NB (GLIMMIX) with a person-level random intercept so SEs reflect within-person correlation.",
        "edge_cases": [
          "Ignoring clustering yields anticonservative SEs; random-effects and GEE target subtly different (conditional vs marginal) estimands — state which."
        ],
        "data_source_notes": "claims/EHR: cluster on person_id; for panel counts (per-quarter utilization) cluster on person_id and include period effects."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "Directly models the rate or total volume of events using the full count and a person-time offset; efficient and more interpretable as an IRR when multiple events per person are of interest.",
        "cons_of_this": "Ignores event timing/ordering and the at-risk structure; recurrent-event survival models (Andersen-Gill, PWP, WLW) or joint models are preferable when timing matters or terminal events truncate person-time.",
        "when_to_prefer": "The endpoint is a rate or total over a fixed window (inpatient days, infusions, total HCRU events) rather than time to first event or survival."
      },
      {
        "compared_to": "logistic-regression-for-binary-outcomes",
        "pros_of_this": "Uses the full count (0,1,2,...) rather than thresholding to any/none; more efficient and clinically relevant for utilization magnitude.",
        "cons_of_this": "When the question is truly binary (any event? yes/no) or only the first event is meaningful, logistic or time-to-event is simpler.",
        "when_to_prefer": "The count or rate of events is the meaningful endpoint, as in most HCRU, oncology monitoring, and post-procedure utilization questions."
      },
      {
        "compared_to": "healthcare-costs-pppm-pppy-pmpm",
        "pros_of_this": "Natural for discrete utilization events; a clean IRR maps directly to budget-impact and value narratives.",
        "cons_of_this": "Counts ignore dollar magnitude (a long and a short admission are both one event); costs need gamma/log-link or two-part GLMs for heavy tails and zeros.",
        "when_to_prefer": "Modeling utilization volume; pair with cost models (PPPM/PPPY) for the full economic picture rather than substituting one for the other."
      },
      {
        "compared_to": "recurrent-events-analysis-rwe",
        "pros_of_this": "Simpler single-number IRR; no need to model gap-time dependence or the at-risk process; robust when total volume is the question.",
        "cons_of_this": "Cannot represent time-varying exposure, gap-time dependence, or differential censoring/terminal events the way Andersen-Gill / PWP do; collapses all timing into a single count.",
        "when_to_prefer": "The scientific interest is total rate/volume over a window and timing structure is secondary or follow-up is approximately equal after the offset."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Define a clean index (first qualifying fill or procedure date). Build person-time from enrollment spans, censoring at death, disenrollment, and data end; exclude Medicare Advantage-only spans where inpatient/encounter capture is incomplete and would deflate counts. Count events from all relevant claim files (inpatient facility + professional, outpatient, carrier) with validated code lists, collapsing same-stay transfer claims. Carry log(person_time) as an offset, report the NB IRR (justify Poisson only via a dispersion test), and use cluster-robust variance when patients contribute multiple events. Run sensitivity to window length, offset definition, and an as-treated censoring rule.",
      "ehr": "Counts come from orders, encounters, or NLP and are inflated by visit-driven (informed-presence) bias, since sicker patients generate more records; adjust for visit intensity or link to claims for complete capture. Define the observation window explicitly and treat loss to follow-up as potentially informative when building person-time.",
      "registry": "Often carries clean structured counts (lines of therapy, transfusions, cycles), ideal for validating claims-based count algorithms, but is usually incomplete for cross-setting utilization; link to claims for full capture and to a death index for censoring.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (EHR severity for confounding control, claims completeness for counts, reliable mortality for censoring) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before person-time and the offset are built."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom scipy import stats\n\ndef fit_count_models(analytic: pd.DataFrame, formula: str = \"hosp_count ~ arm + age + sex + baseline_comorb_index\"):\n    df = analytic[analytic[\"person_years\"] > 0].copy()\n    offset = np.log(df[\"person_years\"].to_numpy())\n\n    # Poisson rate model (IRR = exp(beta)); cluster-robust SE on person_id.\n    pois = smf.glm(formula, data=df, family=sm.families.Poisson(), offset=offset).fit(\n        cov_type=\"cluster\", cov_kwds={\"groups\": df[\"person_id\"]}\n    )\n\n    # Overdispersion check: Pearson chi-square / residual df should be ~1 under Poisson.\n    disp = float(pois.pearson_chi2 / pois.df_resid)\n\n    # Negative binomial (NB-2): estimate dispersion alpha, then refit GLM with that alpha.\n    nb_mle = sm.NegativeBinomial.from_formula(formula, data=df, offset=offset).fit(disp=0)\n    alpha = float(nb_mle.params[\"alpha\"])\n    nb = smf.glm(formula, data=df, family=sm.families.NegativeBinomial(alpha=alpha), offset=offset).fit(\n        cov_type=\"cluster\", cov_kwds={\"groups\": df[\"person_id\"]}\n    )\n\n    # Boundary LR test of H0: alpha = 0 (Poisson) vs NB; halve the chi-square(1) tail probability.\n    lr = 2.0 * (nb_mle.llf - pois.llf)\n    lr_p = 0.5 * stats.chi2.sf(lr, df=1)\n\n    def irr_table(fit):\n        out = pd.DataFrame({\"IRR\": np.exp(fit.params)})\n        ci = np.exp(fit.conf_int())\n        out[\"ci_low\"], out[\"ci_high\"] = ci[0], ci[1]\n        return out\n\n    return {\n        \"pearson_dispersion\": disp,\n        \"nb_alpha\": alpha,\n        \"lr_stat\": lr,\n        \"lr_p_boundary\": lr_p,\n        \"model_choice\": \"negative_binomial\" if lr_p < 0.05 else \"poisson\",\n        \"poisson_irr\": irr_table(pois),\n        \"nb_irr\": irr_table(nb),\n    }",
        "description": "Fit Poisson and negative binomial rate models with a person-time offset, decide between them, and report IRRs.\nRequired input is an analysis-ready table (one row per patient), already built upstream by cohort construction:\n  analytic : person_id, arm (0/1 or 'COMPARATOR'/'STUDY'), hosp_count (int),\n             person_years (float, from enrollment minus death/disenroll/end),\n             age, sex, baseline_comorb_index    # covariates measured to time zero only\nThe offset is log(person_years); exp(coef) is the incidence rate ratio. Cluster-robust SEs guard against any\nwithin-person correlation carried by repeated structure.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(MASS)\nlibrary(sandwich)\nlibrary(lmtest)\n\nfit_count_models <- function(analytic,\n                             fml = hosp_count ~ arm + age + sex + baseline_comorb_index) {\n  df <- subset(analytic, person_years > 0)\n  off <- log(df$person_years)\n\n  # Poisson rate model with person-time offset.\n  pois <- glm(fml, family = poisson(link = \"log\"), data = df, offset = off)\n\n  # Overdispersion: Pearson chi-square / residual df (~1 under Poisson).\n  disp <- sum(residuals(pois, type = \"pearson\")^2) / df.residual(pois)\n\n  # Negative binomial (NB-2).\n  nb <- glm.nb(update(fml, . ~ . + offset(log(person_years))), data = df)\n\n  # Boundary LR test of theta -> Inf (Poisson) vs NB; halve the p-value.\n  lr   <- 2 * (logLik(nb) - logLik(pois))\n  lr_p <- 0.5 * pchisq(as.numeric(lr), df = 1, lower.tail = FALSE)\n\n  # Cluster-robust (by person_id) IRRs with 95% CI.\n  irr <- function(fit) {\n    vc  <- vcovCL(fit, cluster = df$person_id)\n    est <- coeftest(fit, vcov. = vc)\n    data.frame(IRR = exp(est[, 1]),\n               ci_low  = exp(est[, 1] - 1.96 * est[, 2]),\n               ci_high = exp(est[, 1] + 1.96 * est[, 2]))\n  }\n\n  list(pearson_dispersion = disp,\n       nb_theta = nb$theta,\n       lr_p_boundary = as.numeric(lr_p),\n       model_choice = ifelse(lr_p < 0.05, \"negative_binomial\", \"poisson\"),\n       poisson_irr = irr(pois),\n       nb_irr = irr(nb))\n}",
        "description": "Fit Poisson and negative binomial rate models with MASS::glm.nb, decide via a dispersion LR test, and report IRRs with\ncluster-robust SEs. Input mirrors the Python version:\n  analytic : person_id, arm, hosp_count, person_years, age, sex, baseline_comorb_index\noffset = log(person_years); exp(coef) is the incidence rate ratio.",
        "dependencies": [
          "MASS",
          "sandwich",
          "lmtest"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Offset = log(person-time); exp(beta) is the incidence rate ratio. */\ndata analytic;\n  set work.analytic;\n  where person_years > 0;\n  log_pt = log(person_years);\nrun;\n\n/* Poisson rate model. SCALE=PEARSON flags overdispersion (deviance/df, Pearson/df should be ~1). */\nproc genmod data=analytic;\n  class arm (ref='0') sex;\n  model hosp_count = arm age sex baseline_comorb_index\n        / dist=poisson link=log offset=log_pt scale=pearson;\n  estimate 'IRR arm' arm 1 -1 / exp;   /* exponentiated contrast = IRR */\nrun;\n\n/* Negative binomial (NB-2): adds dispersion parameter; preferred default for overdispersed HCRU. */\nproc genmod data=analytic;\n  class arm (ref='0') sex;\n  model hosp_count = arm age sex baseline_comorb_index\n        / dist=negbin link=log offset=log_pt;\n  estimate 'IRR arm' arm 1 -1 / exp;\nrun;\n\n/* Cluster-robust (empirical) SEs for within-person correlation via GEE; one record per person here,\n   extend to REPEATED with multiple records when modeling panel/recurrent counts. */\nproc genmod data=analytic;\n  class arm (ref='0') sex person_id;\n  model hosp_count = arm age sex baseline_comorb_index\n        / dist=negbin link=log offset=log_pt;\n  repeated subject=person_id / type=ind;   /* empirical (robust) variance */\nrun;\n\n/* Zero-inflated NB when structural never-utilizers create excess zeros beyond NB. */\nproc countreg data=analytic;\n  model hosp_count = arm age sex baseline_comorb_index / dist=zinb offset=log_pt;\n  zeromodel hosp_count ~ age baseline_comorb_index;   /* logistic structural-zero process */\nrun;",
        "description": "Fit Poisson and negative binomial rate models in SAS via PROC GENMOD with a log(person-time) offset, get cluster-robust\n(GEE) SEs, and fit a zero-inflated NB via PROC COUNTREG. Required input (post data-management), one row per patient:\n  work.analytic : person_id, arm (char/num), hosp_count, person_years,\n                  age, sex, baseline_comorb_index\nCreate the offset variable before modeling; exp(estimate) is the incidence rate ratio.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  X[Exposure + baseline covariates<br/>measured to time zero] --> Link[log link: log mu = b0 + bX + log person_time]\n  Link --> Mu[mu = expected count / rate]\n  Mu --> Dist{Mean vs variance}\n  Dist -->|Var = mu| Pois[Poisson: IRR = exp b]\n  Dist -->|Var = mu + alpha*mu^2| NB[Negative binomial: IRR = exp b, wider CI]\n  Pois --> Y[Observed count: admissions, ED visits, infusions]\n  NB --> Y",
        "caption": "GLM count-model structure. The log link plus log(person_time) offset gives an incidence rate ratio; NB relaxes the Poisson equidispersion assumption by adding a dispersion parameter, widening confidence intervals to honest width.",
        "alt_text": "Flowchart from exposure and covariates through the log link and person-time offset to the expected rate, branching to Poisson (variance equals mean) or negative binomial (variance exceeds mean), both producing the observed count.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Count outcome over a defined window<br/>with person-time offset] --> Q1{Pearson chi-sq / df near 1?<br/>LR test of alpha = 0 non-sig?}\n  Q1 -->|Yes, equidispersed| Pois[Poisson rate model]\n  Q1 -->|No, overdispersed| Q2{Excess zeros beyond NB prediction?<br/>structural never-utilizers?}\n  Q2 -->|No| NB[Negative binomial NB-2]\n  Q2 -->|Yes| ZINB[Zero-inflated or hurdle NB]\n  Pois --> Q3{Multiple correlated events per person?}\n  NB --> Q3\n  ZINB --> Q3\n  Q3 -->|Yes| Clust[Add cluster-robust / GEE SE<br/>or random-effects NB]\n  Q3 -->|No| Done[Report IRR with 95% CI]\n  Clust --> Done",
        "caption": "Decision logic for choosing a count model. Dispersion diagnostics drive Poisson vs NB; excess structural zeros motivate ZINB/hurdle; within-person correlation requires cluster-robust or random-effects variance before reporting the IRR.",
        "alt_text": "Decision tree starting from a count outcome, testing for overdispersion to choose Poisson or negative binomial, testing for excess zeros to choose zero-inflated models, and testing for within-person correlation to add cluster-robust variance before reporting the incidence rate ratio.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Poisson/NB rate models are the primary regression engine for the count-based HCRU endpoints (hospitalizations, visits, procedures, distinct drugs) defined in the HCRU concept."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "cox-ph-regression",
        "notes": "Count models target the rate/total over a window; Cox and recurrent-event survival models target timing and the at-risk process. Choose by whether volume or timing is the question."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Count models use the full count and are more efficient for utilization magnitude; logistic is simpler when the outcome is truly any/none."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Count models for utilization volume complement cost GLMs (PPPM/PPPY); economic analyses typically report adjusted NB rates alongside gamma/two-part cost estimates."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "recurrent-events-analysis-rwe",
        "notes": "Recurrent-event survival models (Andersen-Gill, PWP, WLW) are the timing-aware alternative when event ordering, gap-time dependence, or terminal events matter rather than the total count."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "Count models sit downstream of an ACNU cohort; the new-user, active-comparator, time-zero structure supplies the exposure contrast and the offset's person-time window."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Misclassified or unequal person-time without a correct offset artificially lowers counts in the exposed arm, biasing the IRR protective."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Healthy users may generate fewer utilization events for non-treatment reasons, biasing rate ratios downward for preventive therapies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Prevalent users are often stable low-utilizers (survivors); mixing them with new users biases count comparisons for initiation effects."
      },
      {
        "relation_type": "part_of",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "Count/rate models are a core tool for utilization and event-count endpoints within comparative effectiveness research."
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Number of lines, regimens, or switches can be modeled as count outcomes; related utilization measures feed the same rate models."
      }
    ],
    "aliases": [
      "Poisson regression",
      "negative binomial regression",
      "count regression",
      "rate ratio models",
      "incidence rate ratio models",
      "overdispersed count models",
      "NB-2 regression"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ppv-npv-rwe",
    "name": "Positive and Negative Predictive Value",
    "short_definition": "Prevalence-dependent accuracy metrics conditioning on the test result rather than truth — positive predictive value = TP/(TP+FP) is the probability an algorithm-positive record is truly a case, negative predictive value = TN/(TN+FN) is the probability an algorithm-negative record is truly a non-case — derived from sensitivity, specificity, and disease prevalence by Bayes' theorem, and the reason chart-reviewing algorithm-positives yields PPV rather than sensitivity.",
    "long_description": "**Positive predictive value (PPV)** and **negative predictive value (NPV)** answer the question a downstream analyst\nactually faces: *given the test (algorithm) result, what is the probability the truth matches it?* From the 2x2\nvalidation table with truth in columns — true positives (TP), false positives (FP), false negatives (FN), true negatives\n(TN) — **PPV = TP / (TP + FP)** is the probability of true disease *given a positive result* (P(D+|T+)), and\n**NPV = TN / (TN + FN)** is the probability of true non-disease *given a negative result* (P(D−|T−)). These condition on\nthe *rows* (the test result) rather than the columns (the truth), which is the precise sense in which they are the\ninverse-conditional companions of sensitivity and specificity.\n\n**Core idea.** Because PPV and NPV condition on the test result, they depend on how common the disease is in the tested\npopulation. Bayes' theorem makes the dependence explicit. Writing prevalence p = P(D+), sensitivity Se = P(T+|D+), and\nspecificity Sp = P(T−|D−):\n\n    PPV = (Se * p) / (Se * p + (1 - Sp) * (1 - p))\n    NPV = (Sp * (1 - p)) / (Sp * (1 - p) + (1 - Se) * p)\n\nSensitivity and specificity are intrinsic to the test on a fixed case spectrum; PPV and NPV layer prevalence on top of\nthem. The operationally crucial consequence: a highly sensitive, highly specific algorithm can still have a *low* PPV\nwhen the outcome is rare, because the small number of true cases is swamped by false positives drawn from the large pool\nof non-cases. This is exactly the regime of most safety outcomes and incident events in claims, where prevalence in the\ncohort is a few percent or less.\n\n**Why charting algorithm-positives yields PPV, not sensitivity.** The most common RWE validation design pulls medical\ncharts only for records the algorithm flagged positive and adjudicates them. That design observes TP and FP (the\npositives that are true vs false) and therefore identifies **PPV = TP/(TP+FP)** directly. It *cannot* observe FN — the\ntrue cases the algorithm missed are, by construction, not in the algorithm-positive sample — so sensitivity = TP/(TP+FN)\nis unidentifiable from algorithm-positive charts alone. To estimate sensitivity you must also chart a sample of\nalgorithm-*negatives* (or link to an external high-coverage reference) so that FN become observable. Labeling a\npositive-predictive-value validation as a \"sensitivity\" estimate is the single most frequent error in claims-algorithm\npapers.\n\n**The Bayesian relation and its practical inversion.** Because PPV/NPV mix the test's intrinsic accuracy with the\npopulation's prevalence, two moves recur. (1) *Transport*: a PPV estimated in a high-prevalence validation sample does\nnot carry to a low-prevalence analysis cohort; recompute it from sensitivity, specificity, and the new prevalence via the\nBayes formula, or better, design the validation in the target population. (2) *Back-out*: if you have PPV from\nalgorithm-positives and an external prevalence and specificity, you can solve the Bayes relation for the implied\nsensitivity — useful but fragile, since it propagates the uncertainty of every input.\n\n**Pros, cons, and trade-offs.**\n- **vs sensitivity/specificity (`sensitivity-specificity-rwe`):** PPV/NPV are what a downstream user needs to interpret a\n  flag in a specific cohort — they answer \"is this record true?\" Their cost is prevalence-dependence: they do not\n  transport across cohorts of differing case frequency and cannot, by themselves, characterize the test. **Prefer\n  PPV/NPV** to interpret and weight flagged records in the analysis cohort; **prefer sensitivity/specificity** to\n  characterize the algorithm and feed misclassification corrections.\n- **vs likelihood ratios (`likelihood-ratios-diagnostic-rwe`):** A likelihood ratio is prevalence-independent and updates\n  pre-test odds to post-test odds at *any* prevalence, so it separates the test's evidentiary weight from the population.\n  PPV is the resulting post-test probability at one specific prevalence. **Prefer a likelihood ratio** to summarize\n  evidentiary strength portably; **prefer PPV** to state the concrete probability in the cohort at hand.\n- **vs raw count of algorithm-positives:** Using algorithm-positives as if every flag were a true case ignores the FP\n  fraction; PPV makes the contamination explicit and lets you either restrict to high-PPV definitions or correct the\n  downstream estimate. **Always prefer the PPV-aware analysis** over treating flags as truth.\n\n**When to use.** Report PPV (with an exact binomial CI) whenever you validate an algorithm by sampling and adjudicating\nalgorithm-positive records — the dominant, low-cost claims design; report PPV and NPV together when you have sampled both\nresult strata; use the Bayes formula to project PPV/NPV into the target analysis cohort's prevalence; and quote PPV when\nthe operational question is \"what fraction of the patients my algorithm flags are real?\" — e.g., to size a chart-\nabstraction workload or to weight cases in an outcome analysis.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a stand-in for sensitivity.** A high PPV says little about how many true cases were missed. An algorithm tuned for\n  PPV (specific, restrictive code rules) can have poor sensitivity, biasing incidence downward and attenuating effect\n  estimates. Reporting only PPV and calling the algorithm \"validated\" hides the FN problem.\n- **Transporting a PPV across prevalences.** A PPV measured where the outcome is 20% prevalent overstates the PPV in a\n  cohort where it is 2%. Quoting the validation-sample PPV in a different cohort without re-deriving from sensitivity,\n  specificity, and the new prevalence is actively misleading.\n- **In a rare-outcome cohort, reassuring with sensitivity/specificity alone.** \"95% sensitive and 95% specific\" sounds\n  definitive, but at 1% prevalence PPV is only ~16% — most flags are false. Presenting the operating characteristics\n  without the implied PPV gives false confidence in the positives.\n- **When the validation frame mis-measures prevalence.** If MA-only person-time (no fee-for-service claims) inflates the\n  apparent denominator, the prevalence feeding the Bayes calculation is wrong and the projected PPV/NPV are wrong.\n\n**Data-source operational depth.**\n- **Claims (FFS):** PPV from algorithm-positive chart review is the standard, inexpensive deliverable; it is the right\n  metric for \"how many flags are real,\" but it inherits the cohort's prevalence and must be re-projected for any cohort\n  with different case frequency. NPV is rarely estimated because charting algorithm-negatives is costly. Medicare\n  Advantage enrollees produce no FFS claims, so MA-only spans appear as algorithm-negatives and distort the prevalence\n  that drives PPV/NPV — restrict the validation frame to FFS-observable person-time.\n- **EHR:** Encounter-driven capture means an algorithm-negative may be truly positive but seen elsewhere (\"leakage\"),\n  inflating apparent FN and depressing NPV; structured labs/problem lists and NLP raise PPV by sharpening the case\n  definition. Report the prevalence in the validation sample so PPV can be transported.\n- **Registry/linked:** Registry linkage supplies the reference standard and a credible prevalence; linked claims-EHR-\n  registry data lets PPV be both estimated and transported, and lets NPV be estimated by adjudicating algorithm-negatives\n  against the registry.\n\n**Worked claims example.** The same incident-HF claims algorithm validated for sensitivity/specificity is, in routine\npractice, validated more cheaply by charting only the records it flags. Of 300 adjudicated algorithm-positives, 261 are\nconfirmed HF and 39 are not: **PPV = 261 / (261 + 39) = 0.870 (87.0%)**, exact binomial 95% CI 0.827–0.906. If the team\nalso charts 300 algorithm-negatives and finds 18 missed true cases (FN) against 282 confirmed non-cases (TN),\n**NPV = 282 / (282 + 18) = 0.940 (94.0%)**. Crucially, these PPV/NPV depend on the prevalence in the validation frame. To\napply the algorithm in a younger commercial cohort where incident HF is far rarer, the team does *not* reuse 87% PPV;\ninstead it takes the prevalence-invariant sensitivity (93.5%) and specificity (87.9%) from the stratified validation and\nre-derives PPV by Bayes at the new prevalence. At prevalence p = 0.02, PPV = (0.935 * 0.02) / (0.935 * 0.02 + 0.121 *\n0.98) = 0.136 — only **13.6%** of flags would be true HF, an order-of-magnitude collapse from the 87% measured in the\nenriched validation sample, and a number that fundamentally changes whether the algorithm-positives can be used as\noutcomes without correction.\n\n**Interpreting the output**\n\nIn the worked example, the validation sample yields PPV = 261/300 ≈ 0.870 (87.0%) in an enriched\nsample, but at a community prevalence of 2% the Bayes-projected PPV collapses to ≈ 0.136 (13.6%).\n\n*(1) Formal interpretation.* PPV = TP / (TP + FP) = 0.870 is the conditional probability that a patient\nflagged positive by the algorithm is a true case, **given the prevalence in the validation sample**. NPV\n= TN / (TN + FN) ≈ 0.940 is the conditional probability that a flagged-negative is a true non-case,\nagain at that prevalence. Unlike sensitivity and specificity, PPV and NPV are not fixed properties of\nthe algorithm — they are derived from (sensitivity, specificity, and prevalence) via Bayes' theorem.\nHolding sensitivity at 0.935 and specificity at 0.879 constant, moving prevalence from roughly 50% in\nthe enriched validation design to 2% in a community cohort drives PPV from 87% to 13.6%. This is not a\nfailure of the algorithm; it is a mathematical consequence of the low signal-to-noise ratio at low\nprevalence.\n\n*(2) Practical interpretation.* Never transport a PPV or NPV estimate from a validation study to a\ndifferent population without checking that prevalences are comparable. When using algorithm-flagged cases\nas outcomes in an effectiveness study, a PPV of 13.6% at 2% prevalence means the outcome is contaminated\nby roughly 86% false positives — severe enough to require either a sensitivity analysis excluding\nunvalidated events or a matrix misclassification correction using the prevalence-invariant (sensitivity,\nspecificity) pair. Report both the observed PPV and the Bayes-projected PPV at the target prevalence.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "positive-predictive-value",
      "negative-predictive-value",
      "ppv",
      "npv",
      "bayes-theorem",
      "prevalence-dependence",
      "algorithm-validation",
      "chart-review"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "validation_study",
      "diagnostic_accuracy",
      "safety_surveillance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.309.6947.102",
        "url": "https://doi.org/10.1136/bmj.309.6947.102",
        "citation_text": "Altman DG, Bland JM. Statistics Notes: Diagnostic tests 2: predictive values. BMJ. 1994;309(6947):102.",
        "year": 1994,
        "authors_short": "Altman & Bland",
        "notes": "The canonical short definition of positive and negative predictive value as the conditional probabilities of true status given the test result, with the explicit demonstration of their dependence on prevalence via Bayes' theorem."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Reporting standard requiring that predictive values be reported with the prevalence/sampling frame they were estimated in, since PPV/NPV do not transport without re-derivation at the target prevalence."
      },
      {
        "role": "explain",
        "doi": "10.1093/oxfordjournals.aje.a112510",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a112510",
        "citation_text": "Rogan WJ, Gladen B. Estimating prevalence from the results of a screening test. American Journal of Epidemiology. 1978;107(1):71-76.",
        "year": 1978,
        "authors_short": "Rogan & Gladen",
        "notes": "Formalizes the algebra linking apparent (test-result-based) and true frequencies through sensitivity and specificity; the same Bayes relation that converts sensitivity/specificity and prevalence into PPV/NPV."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2005.02.022",
        "url": "https://doi.org/10.1016/j.jclinepi.2005.02.022",
        "citation_text": "Reitsma JB, Glas AS, Rutjes AWS, Scholten RJPM, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. Journal of Clinical Epidemiology. 2005;58(10):982-990.",
        "year": 2005,
        "authors_short": "Reitsma et al.",
        "notes": "Demonstrates why predictive values are not directly poolable across studies of differing prevalence and why the prevalence-invariant sensitivity/specificity must be modeled, then converted to predictive values at a target prevalence."
      },
      {
        "role": "use",
        "doi": "10.7326/M14-0698",
        "url": "https://doi.org/10.7326/M14-0698",
        "citation_text": "Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Annals of Internal Medicine. 2015;162(1):W1-W73.",
        "year": 2015,
        "authors_short": "Moons et al.",
        "notes": "Sets reporting expectations for predictive values from diagnostic/prognostic models, emphasizing that they must be interpreted at the prevalence of the intended-use population."
      }
    ],
    "plain_language_summary": "When you build a computer rule (an \"algorithm\") to flag patients with a disease in claims or EHR data, positive predictive value (PPV) asks: of the records the rule flagged as cases, what fraction are really cases? Negative predictive value (NPV) asks the mirror question: of the records it flagged as non-cases, what fraction are really non-cases? You compute them straight from a 2x2 table of the rule's guess versus the chart-confirmed truth: PPV = true positives / all flagged-positive, NPV = true negatives / all flagged-negative. The catch worth remembering: both numbers move with how common the disease is in the population you tested, so a PPV measured in one cohort can be very different in a rarer one even when the rule itself never changed.",
    "key_terms": [
      {
        "term": "algorithm",
        "definition": "A rule applied to claims or EHR data (e.g., \"at least one inpatient heart-failure code\") that labels each patient as a likely case or non-case."
      },
      {
        "term": "true positive (TP)",
        "definition": "A record the algorithm flagged as a case that chart review confirms really is a case."
      },
      {
        "term": "false positive (FP)",
        "definition": "A record the algorithm flagged as a case that turns out, on chart review, not to be a case."
      },
      {
        "term": "false negative (FN)",
        "definition": "A real case that the algorithm missed by labeling it a non-case."
      },
      {
        "term": "prevalence",
        "definition": "How common the disease is in the group you are testing the algorithm on, written as a fraction (e.g., 0.02 means 2% of people have it)."
      },
      {
        "term": "Bayes' theorem",
        "definition": "A formula that combines how good the algorithm is with how common the disease is to give the chance a flagged record is truly a case."
      }
    ],
    "worked_example": {
      "scenario": "A team has a computer rule that flags incident heart-failure (HF) cases in claims data. To check it, they pull medical charts for 300 records the rule flagged as cases and 300 it flagged as non-cases, then a clinician adjudicates each one as truly HF or not. The 2x2 table below counts how the rule's guess (rows) lined up with the chart truth (columns). From it we compute PPV and NPV, then show why those exact numbers do not carry over to a cohort where HF is much rarer.",
      "dataset": {
        "caption": "The adjudicated 2x2 confusion table an analyst builds: the algorithm's call (rows) crossed with chart-confirmed truth (columns). Each cell is a patient count.",
        "columns": [
          "",
          "Disease + (HF confirmed)",
          "Disease - (not HF)"
        ],
        "rows": [
          [
            "Algorithm +",
            261,
            39
          ],
          [
            "Algorithm -",
            18,
            282
          ]
        ]
      },
      "steps": [
        "Read the cells off the table: true positives TP = 261, false positives FP = 39, false negatives FN = 18, true negatives TN = 282.",
        "PPV conditions on the algorithm-positive row (the 300 records it flagged as cases): PPV = TP / (TP + FP) = 261 / (261 + 39) = 261 / 300.",
        "NPV conditions on the algorithm-negative row (the 300 it flagged as non-cases): NPV = TN / (TN + FN) = 282 / (282 + 18) = 282 / 300.",
        "Now see why these numbers depend on prevalence. The algorithm's intrinsic accuracy here is sensitivity Se = 0.935 and specificity Sp = 0.879 (these do not change with how common HF is). Feed them through Bayes' theorem: PPV = (Se*p) / (Se*p + (1-Sp)*(1-p)).",
        "In a moderate-prevalence cohort where HF is 20% common (p = 0.20): PPV = (0.935*0.20) / (0.935*0.20 + 0.121*0.80) = 0.187 / 0.2838 = 0.659 -- about 66% of flags are real.",
        "In a young commercial cohort where incident HF is rare, 2% (p = 0.02): same Se and Sp, but PPV = (0.935*0.02) / (0.935*0.02 + 0.121*0.98) = 0.0187 / 0.13732 = 0.136 -- only about 14% of flags are real. Nothing about the rule changed; only the prevalence did."
      ],
      "result": "PPV = 261 / 300 = 0.870 (87.0%) and NPV = 282 / 300 = 0.940 (94.0%) in this validation sample. But because PPV depends on prevalence, the same algorithm (Se = 0.935, Sp = 0.879) gives PPV = 0.659 at 20% prevalence and collapses to PPV = 0.136 at 2% prevalence -- so the 87% must never be reused in a rarer cohort without re-deriving it via Bayes."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "PPV from algorithm-positive chart review (the standard claims design)",
        "description": "Charts are pulled only for algorithm-positive records and adjudicated; TP and FP are observed, so PPV is estimated directly with an exact binomial CI. Sensitivity is NOT estimable because false negatives are unobserved.",
        "edge_cases": [
          "The PPV is conditional on the validation-sample prevalence; it must be re-derived at the analysis cohort's prevalence.",
          "Enriching the sample (e.g., oversampling likely-true codes) raises measured PPV but biases it relative to the source."
        ],
        "data_source_notes": "claims: cheapest validation; restrict to FFS-observable person-time so prevalence is correct. Report the source prevalence so the PPV can be transported via Bayes."
      },
      {
        "name": "PPV/NPV projected to a target prevalence via Bayes",
        "description": "Prevalence-invariant sensitivity and specificity (from a stratified validation) are combined with the analysis cohort's prevalence through Bayes' theorem to obtain the PPV and NPV that will actually hold in that cohort.",
        "edge_cases": [
          "In rare-outcome cohorts PPV collapses even for high sensitivity/specificity; the projected value, not the validation PPV, governs how usable the flags are.",
          "The projection assumes the case spectrum transports (no spectrum effect on sensitivity/specificity)."
        ],
        "data_source_notes": "linked: registry linkage supplies a credible target prevalence; recompute PPV/NPV rather than reusing the validation-sample values."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "sensitivity-specificity-rwe",
        "pros_of_this": "Answers the operationally relevant question - the probability a flagged record is truly a case in this cohort - and is what a downstream analyst needs to weight or restrict algorithm-positives.",
        "cons_of_this": "Prevalence-dependent, so does not transport across cohorts of differing case frequency and cannot characterize the algorithm on its own.",
        "when_to_prefer": "To interpret flags and size chart-abstraction in the analysis cohort; switch to sensitivity/specificity to characterize the algorithm and feed misclassification corrections."
      },
      {
        "compared_to": "likelihood-ratios-diagnostic-rwe",
        "pros_of_this": "States the concrete post-test probability in the population at hand, directly usable for decisions about the flagged records.",
        "cons_of_this": "Tied to one prevalence; a likelihood ratio separates the test's evidentiary weight from the population and updates any pre-test probability.",
        "when_to_prefer": "When a concrete probability in the current cohort is needed; use likelihood ratios to summarize portable evidentiary strength across populations."
      },
      {
        "compared_to": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "pros_of_this": "PPV is the metric that quantifies the false-positive contamination of a restrictive, specific outcome algorithm and is what algorithm-positive chart review yields.",
        "cons_of_this": "Says nothing about the true cases a restrictive algorithm misses; must be paired with a sensitivity estimate that requires charting algorithm-negatives.",
        "when_to_prefer": "When the question is how clean the flagged outcomes are; pair with sensitivity to judge the full accuracy/coverage trade-off of the algorithm."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "PPV from algorithm-positive chart review is the standard, low-cost deliverable; report it with the validation prevalence and re-derive for any cohort with a different case frequency. Restrict the validation frame to FFS-observable person-time (drop Medicare Advantage-only spans) so the prevalence feeding Bayes is correct. NPV requires the costlier step of charting algorithm-negatives.",
      "ehr": "Encounter-driven capture inflates apparent false negatives (leakage), depressing NPV; require demonstrable in-system activity. Structured labs/problem lists and NLP raise PPV. Report the validation-sample prevalence so PPV can be transported.",
      "registry": "Registry linkage supplies the reference standard and a credible prevalence, enabling both PPV estimation and NPV estimation by adjudicating algorithm-negatives against the registry, and a defensible target prevalence for projection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from scipy.stats import beta\n\ndef exact_binom_ci(k: int, n: int, conf: float = 0.95):\n    alpha = 1.0 - conf\n    lo = 0.0 if k == 0 else beta.ppf(alpha / 2, k, n - k + 1)\n    hi = 1.0 if k == n else beta.ppf(1 - alpha / 2, k + 1, n - k)\n    return lo, hi\n\ndef ppv_npv(tp: int, fp: int, fn: int, tn: int) -> dict:\n    # PPV conditions on the positive row (TP + FP); NPV on the negative row (TN + FN).\n    ppv = tp / (tp + fp)\n    npv = tn / (tn + fn)\n    return {\"ppv\": ppv, \"ppv_ci\": exact_binom_ci(tp, tp + fp),\n            \"npv\": npv, \"npv_ci\": exact_binom_ci(tn, tn + fn)}\n\ndef project_ppv_npv(sens: float, spec: float, prevalence: float) -> dict:\n    # Bayes' theorem: PPV/NPV at a target prevalence from prevalence-invariant Se, Sp.\n    p = prevalence\n    ppv = (sens * p) / (sens * p + (1 - spec) * (1 - p))\n    npv = (spec * (1 - p)) / (spec * (1 - p) + (1 - sens) * p)\n    return {\"ppv_at_prevalence\": ppv, \"npv_at_prevalence\": npv}\n\n# Worked claims example: PPV/NPV from validation, then projection to a rare-outcome cohort.\nobs = ppv_npv(tp=261, fp=39, fn=18, tn=282)\nprint(f\"PPV = {obs['ppv']:.3f}  95% CI {obs['ppv_ci'][0]:.3f}-{obs['ppv_ci'][1]:.3f}\")\nprint(f\"NPV = {obs['npv']:.3f}  95% CI {obs['npv_ci'][0]:.3f}-{obs['npv_ci'][1]:.3f}\")\nproj = project_ppv_npv(sens=0.935, spec=0.879, prevalence=0.02)\nprint(f\"Projected PPV at 2% prevalence = {proj['ppv_at_prevalence']:.3f}\")  # ~0.136",
        "description": "Compute PPV and NPV with exact (Clopper-Pearson) binomial 95% CIs from an adjudicated 2x2 table, and\nproject PPV/NPV to a target prevalence via Bayes' theorem using prevalence-invariant sensitivity and\nspecificity (Altman & Bland 1994). Inputs: cell counts TP, FP, FN, TN; then a target prevalence for the\nBayes projection. Worked HF-algorithm example with the rare-cohort collapse of PPV included.",
        "dependencies": [
          "scipy"
        ],
        "source_citations": [
          "altman-bland-1994-pv"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(epiR)\n\n# rows = algorithm (+/-), columns = truth (D+/D-).\ntab <- as.table(matrix(c(261, 39, 18, 282), nrow = 2, byrow = TRUE,\n                       dimnames = list(Algorithm = c(\"pos\", \"neg\"),\n                                       Truth     = c(\"Dpos\", \"Dneg\"))))\nres <- epi.tests(tab, conf.level = 0.95)\nres$detail[res$detail$statistic %in% c(\"pv.pos\", \"pv.neg\"), ]   # PPV and NPV with 95% CIs\n\n# Project to a target prevalence using prevalence-invariant sensitivity & specificity.\nproject_pv <- function(sens, spec, prev) {\n  ppv <- (sens * prev) / (sens * prev + (1 - spec) * (1 - prev))\n  npv <- (spec * (1 - prev)) / (spec * (1 - prev) + (1 - sens) * prev)\n  c(ppv = ppv, npv = npv)\n}\nproject_pv(sens = 0.935, spec = 0.879, prev = 0.02)             # PPV ~ 0.136 in a rare cohort",
        "description": "Compute PPV and NPV from a 2x2 validation table with exact binomial CIs using epiR::epi.tests(), then\nproject PPV/NPV to a target prevalence with a small Bayes helper. epi.tests() returns predictive values\nat the table's own prevalence, so the Bayes step is required to transport them to a cohort with different\ncase frequency (Altman & Bland 1994).",
        "dependencies": [
          "epiR"
        ],
        "source_citations": [
          "altman-bland-1994-pv"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Expand adjudicated 2x2 cell counts to subject-level records. */\ndata valid;\n  do i = 1 to 261; truth = 1; algo = 1; output; end;   /* TP */\n  do i = 1 to  18; truth = 1; algo = 0; output; end;   /* FN */\n  do i = 1 to  39; truth = 0; algo = 1; output; end;   /* FP */\n  do i = 1 to 282; truth = 0; algo = 0; output; end;   /* TN */\n  drop i;\nrun;\n\n/* PPV = P(truth+ | algo+): exact binomial within the algorithm-positive row. */\nproc freq data=valid;\n  where algo = 1;\n  tables truth / binomial(level='1') alpha=0.05;\n  title 'PPV (exact 95% CI)';\nrun;\n\n/* NPV = P(truth- | algo-): proportion truth=0 within the algorithm-negative row. */\nproc freq data=valid;\n  where algo = 0;\n  tables truth / binomial(level='0') alpha=0.05;\n  title 'NPV (exact 95% CI)';\nrun;\n\n/* Project PPV/NPV to a target prevalence via Bayes' theorem. */\ndata project;\n  sens = 0.935; spec = 0.879; prev = 0.02;             /* rare-outcome cohort */\n  ppv = (sens*prev) / (sens*prev + (1-spec)*(1-prev)); /* ~0.136 */\n  npv = (spec*(1-prev)) / (spec*(1-prev) + (1-sens)*prev);\nrun;\nproc print data=project noobs; var prev ppv npv; run;",
        "description": "Compute PPV and NPV with exact binomial CIs in SAS by conditioning on the test-result rows: PROC FREQ\nBINOMIAL on algorithm-positives gives PPV, on algorithm-negatives gives NPV. A DATA step then projects\nPPV/NPV to a target prevalence via Bayes' theorem from prevalence-invariant sensitivity and specificity.",
        "dependencies": [],
        "source_citations": [
          "altman-bland-1994-pv"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Test{{Algorithm result}} --> Pos[Algorithm + row]\n  Test --> Neg[Algorithm - row]\n  Pos --> TP[Truly + : TP]\n  Pos --> FP[Truly - : FP]\n  Neg --> TN[Truly - : TN]\n  Neg --> FN[Truly + : FN]\n  TP --> PPV[PPV = TP / TP+FP<br/>condition on + row]\n  FP --> PPV\n  TN --> NPV[NPV = TN / TN+FN<br/>condition on - row]\n  FN --> NPV",
        "caption": "PPV and NPV condition on the test-result rows. PPV is computed across the algorithm-positive row (TP/(TP+FP)); NPV across the algorithm-negative row (TN/(TN+FN)). Conditioning on the rows is why both depend on prevalence.",
        "alt_text": "Diagram splitting subjects by algorithm result into positive and negative rows, each further split by true status, with PPV computed within the positive row and NPV within the negative row.",
        "source_type": "illustrative",
        "source_citations": [
          "altman-bland-1994-pv"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  In[Prevalence-invariant Se and Sp] --> Bayes[Bayes' theorem at target prevalence p]\n  Prev[Cohort prevalence p] --> Bayes\n  Bayes --> High{Is the outcome rare?}\n  High -->|p large| HiPPV[PPV stays high - most flags true]\n  High -->|p small| LoPPV[PPV collapses - most flags false]\n  LoPPV --> Decide[Do NOT reuse validation-sample PPV<br/>correct or restrict before using flags]",
        "caption": "Projecting PPV to a cohort. Sensitivity and specificity feed Bayes' theorem at the cohort's prevalence; in a rare- outcome cohort PPV collapses even when the operating characteristics are excellent, so the validation-sample PPV must never be reused directly.",
        "alt_text": "Flowchart combining prevalence-invariant sensitivity and specificity with cohort prevalence through Bayes' theorem, showing PPV collapsing when the outcome is rare and warning against reusing the validation-sample PPV.",
        "source_type": "illustrative",
        "source_citations": [
          "altman-bland-1994-pv"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "complements",
        "target_slug": "sensitivity-specificity-rwe",
        "notes": "PPV/NPV condition on the test result and depend on prevalence; sensitivity/specificity condition on truth and are prevalence-invariant. PPV/NPV are the Bayes transform of sensitivity/specificity at a given prevalence."
      },
      {
        "relation_type": "see_also",
        "target_slug": "likelihood-ratios-diagnostic-rwe",
        "notes": "A likelihood ratio updates pre-test odds to post-test odds at any prevalence; PPV is the resulting post-test probability at one specific prevalence, so LRs portably summarize the evidentiary weight PPV expresses pointwise."
      },
      {
        "relation_type": "see_also",
        "target_slug": "roc-auc-discrimination-rwe",
        "notes": "ROC/AUC describes discrimination across thresholds independent of prevalence; PPV/NPV add prevalence to a chosen threshold to give the probability a classified record is correct."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Claims outcome algorithms are most often validated by reporting PPV from algorithm-positive chart review; PPV quantifies the false-positive contamination that trades off against sensitivity."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "PPV from algorithm-positive chart review is the standard low-cost deliverable of a claims/EHR algorithm-validation study; NPV requires the costlier charting of algorithm-negatives."
      },
      {
        "relation_type": "affects",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "When algorithm-positives are used as outcomes, the false-positive fraction (1 - PPV) contaminates the analysis; PPV/NPV inform predictive-value-based corrections of the downstream effect estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnostic-accuracy",
        "notes": "Predictive values are among the accuracy measures a diagnostic accuracy study reports, always anchored to the study population's prevalence."
      }
    ],
    "aliases": [
      "positive predictive value",
      "negative predictive value",
      "PPV",
      "NPV",
      "precision",
      "post-test probability"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pragmatic-trial",
    "name": "Pragmatic Trial",
    "short_definition": "A randomized trial designed to measure the effectiveness of an intervention under usual-care conditions in a broadly eligible population, typically embedded in routine-care data (EHR, registry, or claims) rather than a dedicated research infrastructure.",
    "long_description": "A **pragmatic clinical trial (PCT)** preserves randomization — the one design feature observational RWE cannot buy — while deliberately relaxing every *other* feature of the classic explanatory trial so the result generalizes to ordinary practice. Eligibility is broad (few exclusions for comorbidity, age, or polypharmacy), the intervention is delivered with usual flexibility by usual clinicians in usual settings, adherence is not artificially enforced, follow-up uses routine-care touchpoints, and outcomes are ascertained from data that already exist (EHR, disease registries, dispensing records, claims, vital records). The canonical modern examples — the Salford Lung Study (linked primary-care EHR + community-pharmacy dispensing), TASTE (the SWEDEHEART registry-based randomized trial), and ADAPTABLE (PCORnet EHR-embedded aspirin-dose trial) — randomize within a real-world data substrate, which is precisely why pragmatic trials sit inside the RWE catalog rather than beside it.\n\n**Core conceptual distinction.** The defining contrast is *pragmatic vs explanatory* (Schwartz & Lellouch, 1967), and it is a **continuum, not a binary** — the central thesis of the PRECIS-2 tool, which scores a design from 1 (very explanatory) to 5 (very pragmatic) on **nine domains**: eligibility, recruitment, setting, organization, flexibility of the experimental intervention's delivery, flexibility of adherence, follow-up intensity, primary outcome, and primary analysis. An explanatory trial maximizes internal validity and asks *can this work under ideal conditions* (efficacy); a pragmatic trial maximizes external validity and asks *does this work in routine care* (effectiveness). The **estimand differs accordingly**: a PCT targets the **effectiveness of a treatment strategy under usual-care conditions**, analyzed **intention-to-treat in a broadly eligible population**, with non-adherence and treatment switching treated as part of the real-world effect rather than nuisance to be censored away. This is the same effectiveness estimand a target-trial emulation tries to recover from observational data — but the PCT gets it with actual randomization, so confounding (including unmeasured confounding) is handled by design rather than by adjustment, instrumental variables, or sensitivity analysis. What randomization does *not* buy is freedom from the operational biases of routine-data outcome ascertainment, blinding-related performance bias (most PCTs are open-label), or the loss of mechanistic interpretability that tight explanatory control provides.\n\n**Pros, cons, and trade-offs** (specific & comparative).\n- **vs the explanatory RCT:** A PCT trades some internal validity and precision for direct generalizability and far lower cost per patient (routine-data outcomes eliminate dedicated CRFs and study visits). Cost: open-label delivery invites performance and ascertainment bias, broad eligibility plus real-world non-adherence shrinks the effect size toward the null and demands a larger sample, and \"soft\" or clinician-judged outcomes captured in routine data are more vulnerable to detection bias than blinded adjudicated endpoints. **Prefer the PCT** when efficacy is already established and the decision-relevant question is real-world effectiveness, value, or comparative benefit among active options.\n- **vs target-trial emulation (observational):** The PCT randomizes, so it is the gold standard for the *same* effectiveness estimand the emulation approximates; it removes confounding by indication, healthy-user bias, and immortal time by design. Cost: it requires prospective consent/randomization infrastructure, takes years, and cannot be done where equipoise or ethics forbid randomization. **Prefer emulation** when randomization is infeasible, too slow, or unethical, and treat the PCT as the benchmark the emulation should reproduce.\n- **vs single-arm trial with an external control:** A PCT has a concurrent randomized comparator, eliminating the comparability threats that dominate external-control studies. Cost: it needs enough eligible patients to randomize a control arm, which external-control designs avoid in ultra-rare diseases. **Prefer external controls** only when a randomized concurrent arm is genuinely impossible.\n- **registry-based / cluster-randomized variants:** Embedding randomization in a registry (TASTE) or randomizing sites/clinics rather than patients (cluster-PCT) lowers burden and enables system-level interventions, at the price of registry-capture gaps, clustering-induced loss of power (the design effect), and identification/recruitment bias when sites — not blinded individuals — know their allocation.\n\n**When to use** (decision rules). Use a pragmatic trial when: (1) efficacy is established and the open question is **effectiveness, comparative value, or implementation** under routine care; (2) **equipoise exists** and randomization is ethical and feasible; (3) a suitable **routine-data substrate** (linked EHR, a mature disease registry, or near-complete claims) can capture the primary outcome with acceptable accuracy and timeliness; (4) the decision-maker (payer, guideline body, HTA) explicitly wants real-world effectiveness rather than ideal-condition efficacy; and (5) the intervention can be delivered within usual workflows without a dedicated research apparatus.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\n- **Efficacy and mechanism are still unknown.** A pragmatic effectiveness trial run before efficacy is established conflates \"the drug doesn't work\" with \"it wasn't taken correctly\" — the diluted ITT effect under poor adherence can hide a real biological effect and kill a useful therapy, or conversely mask a safety signal.\n- **The primary outcome cannot be captured well from routine data.** If the endpoint requires standardized imaging, central adjudication, or scheduled biomarker draws that routine care does not perform, claiming \"pragmatic\" ascertainment introduces **differential, open-label detection bias** — dangerous when the arm influences how aggressively the outcome is looked for.\n- **Blinding is essential and impossible pragmatically.** For subjective or clinician-determined outcomes with an unblinded intervention, the pragmatic frame can manufacture a spurious effect; this is where pragmatic ascertainment is *actively misleading*, not merely imprecise.\n- **No equipoise / ethics forbid randomization, or the intervention is unsafe to deliver under usual flexibility.** Then a PCT is infeasible and an observational target-trial emulation or external-control design is the honest alternative.\n- **Beware \"pragmatic\" as a label of convenience.** A trial that is explanatory on eligibility and analysis but calls itself pragmatic to borrow generalizability claims fails PRECIS-2 scrutiny; reviewers will (correctly) demand the nine-domain wheel.\n\n**Data-source operational depth.** The pragmatic promise lives or dies on the routine-data substrate used for eligibility screening and outcome capture, and each substrate has distinct failure modes.\n- **Claims (FFS or commercial):** Excellent for complete dispensing, procedures, and encounters within an enrolled, fee-for-service window; ideal for randomizing among already-marketed therapies and ascertaining hospitalization/procedure/mortality-proxy outcomes. Failure modes: **Medicare Advantage and capitated person-time silently drop FFS claims**, so a patient who switches to MA mid-trial appears to have zero events — restrict outcome person-time to observable FFS+Part D coverage and treat MA enrollment as administrative censoring, not as an event-free interval. **Adjudication/ascertainment lag** (claims mature over 3–6 months) means interim outcome counts undercount recent events; lock data only after a run-out window. Diagnosis-code outcomes need a validated algorithm with a known PPV.\n- **EHR-embedded (e.g., Salford Lung, ADAPTABLE/PCORnet):** Rich on labs, vitals, problem lists, and clinician notes for broad eligibility screening and severity capture; supports point-of-care randomization. Failure mode: **visit-driven, leaky capture** — a patient who seeks care outside the network is differentially unobserved, and if leaving correlates with the outcome (or with the arm) the effect is biased. Salford addressed this by **linking community-pharmacy dispensing and hospital records to its primary-care EHR** so events outside the index practice were still caught; without such linkage, define the observation window explicitly and treat out-of-network loss as potentially informative.\n- **Registry-based randomized trial (e.g., TASTE / SWEDEHEART):** Near-complete enrollment of the target population, adjudicated clinical outcomes, and randomization folded into an existing data-entry workflow at trivial marginal cost. Failure mode: the registry captures **only what it was built to capture** — off-protocol medication switches, outcomes treated at non-participating hospitals, and out-of-country events fall outside it, so concomitant-therapy contamination and some events go unrecorded; link to national dispensing and death registries to close the gap.\n- **Linked claims–EHR–registry–vital records:** The strongest substrate (EHR severity + claims completeness + registry adjudication + reliable mortality), but **linkage selection** (only the linkable subset randomizes/contributes) and **date-discrepancy reconciliation** across order, dispense, and service dates must be resolved before randomization-date and outcome-window assignment.\n\n**Worked claims/EHR example.** Question: among adults with COPD in routine primary care, does once-daily fluticasone furoate–vilanterol reduce moderate-or-severe exacerbations vs continuing usual maintenance therapy, *as delivered in practice* (the Salford Lung effectiveness estimand)? (1) **Pragmatic eligibility (broad, few exclusions):** age ≥40, ≥1 recorded COPD diagnosis in the EHR problem list, ≥1 maintenance-inhaler dispensing in the prior 12 months, and continuous, FFS-observable medical+pharmacy coverage for a 365-day baseline window (so prior-therapy and comorbidity history is real, not missing); **do not** exclude for common comorbidity, polypharmacy, or mild-to-moderate renal/hepatic impairment, which an explanatory trial would. (2) **Randomization unit & assignment:** randomize the *patient* 1:1 at the point of care (EHR flag at the index encounter), record `index_date` = randomization date; for a cluster variant, randomize the *practice* and stamp every enrolled patient with the site allocation. (3) **Real-world delivery:** no enforced adherence — the intervention arm is prescribed the study inhaler and then followed exactly as routine patients are (refills via `fill_date` + `days_supply`, switches and discontinuations left to occur). (4) **Outcome ascertainment from routine data:** a moderate exacerbation = an outpatient claim with a COPD-exacerbation diagnosis plus a dispensing of an oral corticosteroid and/or antibiotic within a 14-day window; a severe exacerbation = an inpatient/ED claim with a primary COPD-exacerbation code — both via a validated algorithm with documented PPV. (5) **Person-time rules:** follow from `index_date` to first qualifying exacerbation; **administratively censor at MA enrollment, disenrollment, death, or data-end**, and exclude MA-only spans from the at-risk denominator because their claims are unobservable (an event there would be silently missed). (6) **Analysis:** intention-to-treat by randomized arm, counting events under as-prescribed routine care; report the effectiveness rate ratio with a pre-specified run-out window so claims-maturation lag does not undercount late events, and a per-protocol/on-treatment sensitivity analysis with inverse-probability-of-censoring weights to bracket the contamination from real-world switching.",
    "primary_category": "Study_Design",
    "tags": [
      "pragmatic-trial",
      "effectiveness",
      "real-world-setting",
      "randomized-trial",
      "registry-based-rct",
      "ehr-embedded-trial",
      "precis-2",
      "intention-to-treat"
    ],
    "applies_to_study_types": [
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1056/NEJMra1510059",
        "url": "https://doi.org/10.1056/NEJMra1510059",
        "citation_text": "Ford I, Norrie J. Pragmatic trials. New England Journal of Medicine. 2016;375(5):454-463.",
        "year": 2016,
        "authors_short": "Ford & Norrie",
        "notes": "Canonical modern operational definition of the pragmatic trial, the efficacy-vs-effectiveness contrast, and the routine-data outcome-ascertainment trade-offs."
      },
      {
        "role": "introduce",
        "doi": "10.1016/0021-9681(67)90041-0",
        "url": "https://doi.org/10.1016/0021-9681(67)90041-0",
        "citation_text": "Schwartz D, Lellouch J. Explanatory and pragmatic attitudes in therapeutical trials. Journal of Chronic Diseases. 1967;20(8):637-648.",
        "year": 1967,
        "authors_short": "Schwartz & Lellouch",
        "notes": "Origin of the explanatory-vs-pragmatic attitude that frames the entire design space; still the foundational reference."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h2147",
        "url": "https://doi.org/10.1136/bmj.h2147",
        "citation_text": "Loudon K, Treweek S, Sullivan F, Donnan P, Thorpe KE, Zwarenstein M. The PRECIS-2 tool: designing trials that are fit for purpose. BMJ. 2015;350:h2147.",
        "year": 2015,
        "authors_short": "Loudon et al.",
        "notes": "The nine-domain PRECIS-2 wheel that operationalizes pragmatism as a continuum; the standard tool reviewers expect for justifying a pragmatic design."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.a2390",
        "url": "https://doi.org/10.1136/bmj.a2390",
        "citation_text": "Zwarenstein M, Treweek S, Gagnier JJ, et al. Improving the reporting of pragmatic trials: an extension of the CONSORT statement. BMJ. 2008;337:a2390.",
        "year": 2008,
        "authors_short": "Zwarenstein et al.",
        "notes": "CONSORT extension defining how pragmatic trials should be reported, including setting, eligibility, and real-world delivery elements."
      },
      {
        "role": "demonstrate",
        "doi": "10.1056/NEJMoa1608033",
        "url": "https://doi.org/10.1056/NEJMoa1608033",
        "citation_text": "Vestbo J, Leather D, Diar Bakerly N, et al. Effectiveness of fluticasone furoate-vilanterol for COPD in clinical practice. New England Journal of Medicine. 2016;375(13):1253-1260.",
        "year": 2016,
        "authors_short": "Vestbo et al.",
        "notes": "The Salford Lung Study - an EHR-embedded pragmatic effectiveness trial with linked primary-care, pharmacy, and hospital records; the worked example in this entry is modeled on it."
      }
    ],
    "plain_language_summary": "A pragmatic trial is a randomized experiment designed to answer a simple question: does this treatment actually work for real patients in real clinics? Unlike a tightly controlled lab-style trial that recruits only ideal patients and enforces every dose, a pragmatic trial enrolls the broad, messy population that shows up for care, lets clinicians deliver treatment the way they normally would, and tracks outcomes using records that already exist, such as electronic health records or insurance claims. It keeps the one thing that makes experiments trustworthy, namely random assignment, while deliberately relaxing everything else so the answer reflects what happens in the world, not what happens under perfect conditions.",
    "key_terms": [
      {
        "term": "random assignment",
        "definition": "Each participant is assigned to a treatment or comparison group by chance, like a coin flip, so the groups start out roughly equal on every characteristic, even ones nobody measured."
      },
      {
        "term": "effectiveness",
        "definition": "How well a treatment works under ordinary real-world conditions, as opposed to efficacy, which is how well it works under ideal, tightly controlled conditions."
      },
      {
        "term": "usual care",
        "definition": "The standard treatment a patient would receive at a typical clinic if they were not in a study, including whatever the doctor normally prescribes and however the patient normally takes it."
      },
      {
        "term": "PRECIS-2",
        "definition": "A nine-domain scoring tool that researchers use to rate how pragmatic or explanatory a trial design is, helping them be explicit about design choices rather than just calling a trial pragmatic."
      },
      {
        "term": "intention-to-treat",
        "definition": "An analysis rule that counts every participant in the group they were randomly assigned to, regardless of whether they actually took the treatment as directed, preserving the fairness that randomization created."
      }
    ],
    "worked_example": {
      "scenario": "Imagine a study asking: among adults with a common lung disease called COPD, does a new combination inhaler reduce flare-ups compared to whatever maintenance inhaler patients are already using? The team wants to know what happens in routine primary-care clinics, not in a specialized research center. Below is a side-by-side comparison of how a pragmatic trial and a classic explanatory trial would each be designed to answer that question.",
      "dataset": {
        "caption": "Pragmatic trial vs. explanatory trial across five design dimensions. Each row is a design choice; the contrast shows why the two trials answer different questions.",
        "columns": [
          "Design dimension",
          "Explanatory trial (efficacy)",
          "Pragmatic trial (effectiveness)"
        ],
        "rows": [
          [
            "Population",
            "Narrow: age 40-65, no other lung disease, no other medications, non-smoker for 5+ years",
            "Broad: any adult with a recorded COPD diagnosis, common co-diseases allowed, smokers included"
          ],
          [
            "Setting",
            "Specialist respiratory clinics at academic medical centers",
            "Any primary-care practice that already cares for COPD patients"
          ],
          [
            "How treatment is delivered",
            "Fixed dose, fixed schedule, monitored by study nurse at every visit",
            "Clinician prescribes and adjusts as they see fit; no study-mandated visits"
          ],
          [
            "Comparator",
            "Placebo inhaler (identical-looking dummy device)",
            "Usual maintenance therapy the patient was already taking"
          ],
          [
            "How outcomes are measured",
            "Lung-function tests performed on a schedule by trained technicians",
            "Exacerbation events pulled from routine EHR records and pharmacy dispensing data"
          ]
        ]
      },
      "steps": [
        "Look at the population row: the explanatory trial screens out everyone with complicating factors, so its results apply to a narrow slice of patients. The pragmatic trial enrolls almost anyone with COPD, so its results apply to the full range of people a clinician will actually treat.",
        "Look at the comparator row: placebo tells you whether the drug beats nothing under perfect conditions. Usual care tells you whether the drug beats the real alternative a doctor would otherwise prescribe, which is the question a payer or guideline committee actually needs to answer.",
        "Look at the delivery and outcomes rows together: in a pragmatic trial nobody enforces the dose and outcomes come from routine records rather than scheduled lab visits. This means the measured effect includes real-world non-adherence and the noise of routine data capture, so the effect size is often smaller than in the explanatory trial even if the drug genuinely works.",
        "Now apply the PRECIS-2 lens: each of these five rows corresponds to one or more of its nine design domains (eligibility, setting, flexibility of delivery, flexibility of adherence, comparator, primary outcome). Rate each domain from 1 (very explanatory) to 5 (very pragmatic) and the overall profile tells you how generalizable the trial is.",
        "Both trial types randomize, so both remove the concern that sicker patients ended up in one arm by chance. The pragmatic trial does not sacrifice that protection; it only relaxes the features that would restrict who the answer applies to."
      ],
      "result": "A pragmatic trial answers: does this treatment produce better outcomes than the alternative patients would actually receive, for the real population of patients who have this condition, delivered the way clinicians would actually deliver it? That is an effectiveness answer. An explanatory trial answers a narrower efficacy question: can this treatment outperform placebo in an ideal population under ideal conditions?"
    },
    "prerequisites": [
      "target-trial-emulation",
      "comparative-effectiveness-research-cer-methods"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Registry-based randomized trial (RRCT)",
        "description": "Randomization is embedded inside an existing high-coverage clinical registry, which then supplies eligibility, baseline data, and adjudicated outcomes at low marginal cost (e.g., TASTE within SWEDEHEART).",
        "edge_cases": [
          "Off-protocol medication switches and outcomes treated at non-participating sites fall outside the registry and are missed unless linked to national dispensing/death registries.",
          "Registry data-entry completeness and consent workflow can bias which eligible patients are actually randomized."
        ],
        "data_source_notes": "registry: leverage the registry as both the recruitment frame and the outcome source; link to vital-records and dispensing files to capture out-of-registry events."
      },
      {
        "name": "EHR-embedded pragmatic trial",
        "description": "Eligibility screening, point-of-care randomization, and outcome ascertainment run inside the electronic health record of a care network (e.g., Salford Lung, ADAPTABLE/PCORnet).",
        "edge_cases": [
          "Visit-driven capture means patients who seek care outside the network are differentially unobserved; bias arises if leaving correlates with the arm or the outcome.",
          "Open-label delivery plus clinician-entered outcomes invites detection bias for subjective endpoints."
        ],
        "data_source_notes": "ehr: link community-pharmacy dispensing and hospital records to the primary-care EHR (as Salford did) so out-of-network events are still captured; define observation windows explicitly."
      },
      {
        "name": "Cluster-randomized pragmatic trial",
        "description": "Randomization is at the level of practices, clinics, or geographic units rather than individuals - appropriate for system-level or delivery interventions that cannot be assigned patient-by-patient.",
        "edge_cases": [
          "The design effect inflates required sample size; ignoring intracluster correlation overstates precision.",
          "Identification/recruitment bias arises because unblinded sites know their allocation before enrolling patients."
        ],
        "data_source_notes": "claims/ehr: assign every patient the cluster-level allocation at the site; analyze with mixed models or GEE accounting for the clustering."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Explanatory (efficacy) randomized controlled trial",
        "pros_of_this": "Direct generalizability to routine care, lower per-patient cost via routine-data outcomes, and an effectiveness estimand decision-makers want.",
        "cons_of_this": "Lower internal validity and precision; open-label delivery and routine-data outcomes invite performance/detection bias; diluted ITT effect under real-world non-adherence requires larger samples.",
        "when_to_prefer": "When efficacy is already established and the decision-relevant question is real-world effectiveness, comparative value, or implementation."
      },
      {
        "compared_to": "Target-trial emulation (observational)",
        "pros_of_this": "Actual randomization handles confounding (including unmeasured) by design, removing confounding by indication, healthy-user bias, and immortal time without adjustment.",
        "cons_of_this": "Requires prospective randomization/consent infrastructure, is slow and costly, and is impossible where equipoise or ethics forbid randomization.",
        "when_to_prefer": "Whenever randomization is feasible and ethical; the PCT is the benchmark an emulation should reproduce."
      },
      {
        "compared_to": "Single-arm trial with an external control",
        "pros_of_this": "A concurrent randomized comparator eliminates the comparability and selection threats that dominate external-control studies.",
        "cons_of_this": "Requires enough eligible patients to randomize a control arm, which external-control designs avoid in ultra-rare settings.",
        "when_to_prefer": "Whenever a randomized concurrent control is feasible; reserve external controls for diseases where it is genuinely impossible."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nBASELINE_DAYS = 365   # observable-coverage lookback so history is real, not missing\nOCS_ABX_WINDOW = 14   # days: outpatient COPD dx + steroid/antibiotic = moderate exacerbation\nCOPD_DX = {\"J44.0\", \"J44.1\", \"J44.9\"}\nEXAC_OCS_ABX = {\"OCS\", \"ANTIBIOTIC\"}  # drug_class values flagging an exacerbation treatment\n\ndef screen_eligibility(dx, rx, enroll, arms):\n    \"\"\"arms: DataFrame[person_id, index_date (randomization date), arm] from the registry/EHR/RNG.\"\"\"\n    elig = arms.merge(enroll, on=\"person_id\", how=\"left\")\n    # Pragmatic: require only observable FFS coverage spanning the baseline window through randomization.\n    elig[\"covered\"] = (\n        (elig[\"enroll_start\"] <= elig[\"index_date\"] - pd.Timedelta(days=BASELINE_DAYS))\n        & (elig[\"enroll_end\"] >= elig[\"index_date\"])\n        & (~elig[\"ma_only\"])\n    )\n    # Require a recorded COPD diagnosis and >=1 maintenance fill in the baseline window (broad inclusion).\n    base_dx = dx.merge(arms[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    has_copd = base_dx[\n        base_dx[\"dx_code\"].isin(COPD_DX)\n        & (base_dx[\"dx_date\"] <= base_dx[\"index_date\"])\n        & (base_dx[\"dx_date\"] >= base_dx[\"index_date\"] - pd.Timedelta(days=BASELINE_DAYS))\n    ][\"person_id\"].unique()\n    keep = elig.loc[elig[\"covered\"] & elig[\"person_id\"].isin(has_copd),\n                    [\"person_id\", \"index_date\", \"arm\"]]\n    return keep.drop_duplicates(\"person_id\")\n\ndef first_exacerbation(cohort, dx, rx, enroll):\n    \"\"\"First moderate (outpatient COPD dx + OCS/abx within 14d) OR severe (inpatient/ED COPD dx) event.\n    Person-time during MA-only spans is unobservable -> such events are dropped and the span censored upstream.\"\"\"\n    d = dx.merge(cohort[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    d = d[(d[\"dx_code\"].isin(COPD_DX)) & (d[\"dx_date\"] >= d[\"index_date\"])]\n\n    severe = d[d[\"setting\"].isin([\"INPATIENT\", \"ED\"])][[\"person_id\", \"dx_date\"]].assign(severity=\"severe\")\n\n    out_copd = d[d[\"setting\"] == \"OUTPATIENT\"][[\"person_id\", \"dx_date\"]]\n    exac_rx = rx[rx[\"drug_class\"].isin(EXAC_OCS_ABX)][[\"person_id\", \"fill_date\"]]\n    m = out_copd.merge(exac_rx, on=\"person_id\")\n    m = m[(m[\"fill_date\"] >= m[\"dx_date\"])\n          & (m[\"fill_date\"] <= m[\"dx_date\"] + pd.Timedelta(days=OCS_ABX_WINDOW))]\n    moderate = m[[\"person_id\", \"dx_date\"]].drop_duplicates().assign(severity=\"moderate\")\n\n    events = pd.concat([severe, moderate], ignore_index=True)\n    first = (events.sort_values([\"person_id\", \"dx_date\"])\n                   .groupby(\"person_id\").first().reset_index()\n                   .rename(columns={\"dx_date\": \"event_date\"}))\n    return cohort.merge(first, on=\"person_id\", how=\"left\")  # event_date NaT => no event in observed time",
        "description": "Pragmatic-trial cohort/eligibility construction and routine-data outcome ascertainment (NOT effect estimation).\nRequired inputs (already cleaned and de-duplicated):\n  dx     : diagnoses      -> person_id, dx_date (datetime), dx_code, setting in {'OUTPATIENT','INPATIENT','ED'}\n  rx     : pharmacy fills -> person_id, fill_date (datetime), drug_class, days_supply\n  enroll : coverage spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only person-time lacks observable FFS claims\nStep 1 screens BROAD pragmatic eligibility (few exclusions). Step 2 assigns the randomization arm (supplied externally\nby the RNG / registry / EHR flag - randomization is the design, not a data-derived variable). Step 3 ascertains the\nfirst moderate-or-severe COPD exacerbation from routine claims with MA-only person-time excluded from the at-risk window.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "vestbo-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nBASELINE_DAYS  <- 365L\nOCS_ABX_WINDOW <- 14L\nCOPD_DX        <- c(\"J44.0\", \"J44.1\", \"J44.9\")\n\nscreen_eligibility <- function(dx, rx, enroll, arms) {\n  setDT(dx); setDT(enroll); setDT(arms)\n  e <- merge(arms, enroll, by = \"person_id\", all.x = TRUE)\n  # Pragmatic: only require observable FFS coverage across the baseline window through randomization.\n  e[, covered := enroll_start <= index_date - BASELINE_DAYS &\n                 enroll_end   >= index_date & !ma_only]\n\n  bdx <- merge(dx, arms[, .(person_id, index_date)], by = \"person_id\")\n  has_copd <- bdx[dx_code %chin% COPD_DX &\n                  dx_date <= index_date &\n                  dx_date >= index_date - BASELINE_DAYS, unique(person_id)]\n\n  keep <- e[covered == TRUE & person_id %chin% has_copd,\n            .(person_id, index_date, arm)]\n  unique(keep, by = \"person_id\")\n}\n\nfirst_exacerbation <- function(cohort, dx, rx, enroll) {\n  setDT(cohort); setDT(dx); setDT(rx)\n  d <- merge(dx, cohort[, .(person_id, index_date)], by = \"person_id\")\n  d <- d[dx_code %chin% COPD_DX & dx_date >= index_date]\n\n  severe <- d[setting %chin% c(\"INPATIENT\", \"ED\"),\n              .(person_id, event_date = dx_date, severity = \"severe\")]\n\n  out_copd <- d[setting == \"OUTPATIENT\", .(person_id, dx_date)]\n  exac_rx  <- rx[drug_class %chin% c(\"OCS\", \"ANTIBIOTIC\"), .(person_id, fill_date)]\n  m <- merge(out_copd, exac_rx, by = \"person_id\", allow.cartesian = TRUE)\n  m <- m[fill_date >= dx_date & fill_date <= dx_date + OCS_ABX_WINDOW]\n  moderate <- unique(m[, .(person_id, event_date = dx_date)])[, severity := \"moderate\"]\n\n  events <- rbind(severe, moderate)\n  setorder(events, person_id, event_date)\n  first <- events[, .SD[1L], by = person_id]\n  merge(cohort, first, by = \"person_id\", all.x = TRUE)  # event_date NA => no observed event\n}",
        "description": "Pragmatic-trial eligibility screening + routine-data outcome ascertainment with data.table (mirrors the Python version).\nInputs:\n  dx     : person_id, dx_date (Date), dx_code, setting in {'OUTPATIENT','INPATIENT','ED'}\n  rx     : person_id, fill_date (Date), drug_class, days_supply\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\n  arms   : person_id, index_date (randomization date), arm   # randomization supplied externally",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "vestbo-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let baseline = 365;   /* observable-coverage lookback */\n%let win      = 14;    /* OCS/antibiotic window for a moderate exacerbation */\n\n/* ----- Step 1: broad pragmatic eligibility -------------------------------- */\nproc sql;\n  create table cohort as\n  select a.person_id, a.index_date format=date9., a.arm\n  from work.arms a\n  where /* observable FFS coverage spanning baseline through randomization */\n        exists (select 1 from work.enroll e\n                where e.person_id = a.person_id\n                  and e.ma_only = 0\n                  and e.enroll_start <= a.index_date - &baseline\n                  and e.enroll_end   >= a.index_date)\n    and /* >=1 recorded COPD diagnosis in the baseline window (few other exclusions) */\n        exists (select 1 from work.dx d\n                where d.person_id = a.person_id\n                  and d.dx_code in ('J44.0','J44.1','J44.9')\n                  and d.dx_date <= a.index_date\n                  and d.dx_date >= a.index_date - &baseline);\nquit;\n\n/* ----- Step 2: severe exacerbations (inpatient/ED COPD dx after randomization) --- */\nproc sql;\n  create table severe as\n  select c.person_id, d.dx_date as event_date format=date9., 'severe' as severity length=8\n  from cohort c\n  join work.dx d\n    on d.person_id = c.person_id\n   and d.dx_code in ('J44.0','J44.1','J44.9')\n   and d.setting in ('INPATIENT','ED')\n   and d.dx_date >= c.index_date;\nquit;\n\n/* ----- Step 2b: moderate exacerbations (outpatient COPD dx + OCS/abx within window) --- */\nproc sql;\n  create table moderate as\n  select distinct c.person_id, d.dx_date as event_date format=date9., 'moderate' as severity length=8\n  from cohort c\n  join work.dx d\n    on d.person_id = c.person_id and d.dx_code in ('J44.0','J44.1','J44.9')\n   and d.setting = 'OUTPATIENT' and d.dx_date >= c.index_date\n  join work.rx r\n    on r.person_id = c.person_id and r.drug_class in ('OCS','ANTIBIOTIC')\n   and r.fill_date between d.dx_date and d.dx_date + &win;\nquit;\n\n/* ----- Step 3: first qualifying event per person -------------------------- */\nproc sql;\n  create table first_event as\n  select person_id, min(event_date) as event_date format=date9.\n  from (select * from severe union all select * from moderate)\n  group by person_id;\nquit;\n\nproc sql;  /* left join so persons with no observed event keep a missing event_date */\n  create table analysis as\n  select c.*, f.event_date\n  from cohort c left join first_event f on c.person_id = f.person_id;\nquit;",
        "description": "Pragmatic-trial eligibility screening + routine-data outcome ascertainment in SAS (PROC SQL cohort build).\nRequired input datasets (post data-management):\n  work.dx     : person_id, dx_date, dx_code, setting ('OUTPATIENT'/'INPATIENT'/'ED')\n  work.rx     : person_id, fill_date, drug_class, days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.arms   : person_id, index_date (randomization date), arm   /* randomization supplied externally */\nOutcome = first moderate (outpatient COPD dx + OCS/antibiotic within 14 days) OR severe (inpatient/ED COPD dx) event;\nMA-only person-time is unobservable and must be censored before the at-risk window is finalized downstream.",
        "dependencies": [],
        "source_citations": [
          "vestbo-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Broad real-world population<br/>few exclusions: comorbidity, age, polypharmacy allowed] --> Elig[Eligibility screened from routine data<br/>EHR problem list / registry / claims]\n  Elig --> Rand{Randomize<br/>patient or cluster/site}\n  Rand -->|Arm A| Usual[Intervention delivered under usual care<br/>no enforced adherence, usual clinicians]\n  Rand -->|Arm B| Comp[Comparator / usual maintenance therapy]\n  Usual --> Out[Outcome ascertained from routine data<br/>claims / EHR / registry / death index]\n  Comp --> Out\n  Out --> ITT[Intention-to-treat effectiveness estimand<br/>effect under usual-care conditions]",
        "caption": "Pragmatic-trial flow. Randomization (the one feature observational RWE cannot buy) is preserved while eligibility, delivery, adherence, and outcome ascertainment are all relaxed to routine-care conditions, yielding an ITT effectiveness estimand.",
        "alt_text": "Flowchart from a broad real-world population through routine-data eligibility screening, randomization of patients or clusters, usual-care delivery in two arms, routine-data outcome ascertainment, and an intention-to-treat effectiveness estimand.",
        "source_type": "illustrative",
        "source_citations": [
          "ford-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph EXPLANATORY[Explanatory pole - efficacy]\n    E1[Narrow eligibility]\n    E2[Protocolized delivery]\n    E3[Enforced adherence]\n    E4[Intensive scheduled follow-up]\n    E5[Blinded adjudicated outcome]\n  end\n  subgraph PRAGMATIC[Pragmatic pole - effectiveness]\n    P1[Broad eligibility]\n    P2[Usual-care delivery]\n    P3[Real-world adherence]\n    P4[Routine-care follow-up]\n    P5[Routine-data outcome]\n  end\n  E1 -. PRECIS-2 continuum .-> P1\n  E2 -. eligibility/setting/organization .-> P2\n  E3 -. flexibility of adherence .-> P3\n  E4 -. follow-up intensity .-> P4\n  E5 -. primary outcome/analysis .-> P5",
        "caption": "The pragmatic-explanatory continuum (PRECIS-2). Any single trial sits somewhere between the poles on each of nine domains; \"pragmatic\" is a matter of degree, not a binary label.",
        "alt_text": "Diagram contrasting the explanatory pole (narrow eligibility, protocolized delivery, enforced adherence, intensive follow-up, blinded adjudicated outcomes) with the pragmatic pole (broad eligibility, usual-care delivery, real-world adherence, routine follow-up, routine-data outcomes) along the PRECIS-2 continuum.",
        "source_type": "illustrative",
        "source_citations": [
          "loudon-2015"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "target-trial-emulation",
        "notes": "A pragmatic trial randomizes to recover the same usual-care effectiveness estimand that a target-trial emulation approximates from observational data; the PCT is the benchmark the emulation aims to reproduce when randomization is feasible."
      },
      {
        "relation_type": "see_also",
        "target_slug": "registry-trial",
        "notes": "A registry-based randomized trial (e.g., TASTE within SWEDEHEART) is a pragmatic trial whose recruitment, baseline, and outcome capture are embedded in an existing high-coverage registry."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cluster-randomized",
        "notes": "Pragmatic trials of system-level or delivery interventions are commonly cluster-randomized at the practice/site level, with analysis accounting for intracluster correlation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The pragmatic estimand treats non-adherence and treatment switching as intercurrent events handled by a treatment-policy (ITT) strategy rather than censored away."
      },
      {
        "relation_type": "see_also",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "Pragmatic trials are a core CER design, supplying randomized real-world effectiveness evidence to complement observational comparative-effectiveness analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "The pragmatic design maximizes external validity by construction; transportability methods address residual gaps when the trial population still differs from the target population."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "When a pragmatic trial is infeasible, an active-comparator new-user observational design is the usual fallback for a comparable head-to-head effectiveness question."
      }
    ],
    "aliases": [
      "pragmatic clinical trial",
      "PCT",
      "pragmatic randomized controlled trial",
      "practical clinical trial",
      "real-world randomized trial"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "prediction-model-validation-recalibration-rwe",
    "name": "Prediction Model Validation and Recalibration in RWE",
    "short_definition": "The evaluation of a diagnostic/prognostic prediction model's discrimination, calibration, and clinical utility in a population separate from development, plus the updating (intercept/slope recalibration through full refit) required when those measures degrade across calendar time, site, coding era, or case mix.",
    "long_description": "A prediction model estimates an individual's **absolute risk** of an outcome (30-day readmission, 1-year mortality,\nincident heart failure) from baseline predictors. Unlike a causal model, it makes no claim about what *changes* the\noutcome; its only job is to assign probabilities that match reality in the population where it will be used. That last\nclause is where real-world deployment fails. A model is never \"validated once forever\": discrimination, calibration, and\nclinical usefulness are three distinct properties, they degrade independently, and a model that looks excellent in its\ndevelopment EHR can be actively dangerous in a different health system, a later calendar era, a new payer mix, or after\ntreatment patterns shift. Validation in RWE is the disciplined, repeated measurement of all three properties in the\n*target* population, and recalibration is the lightweight updating that restores safety without discarding the model.\n\n**Core conceptual distinction.** Three quantities are separable and must never be collapsed into one number.\n(1) *Discrimination* — can the model rank a randomly chosen case above a randomly chosen non-case? Measured by the\nC-statistic (AUC) for binary outcomes, or the time-dependent C / dynamic AUC for survival. Discrimination is a property\nof the *ordering* and is invariant to monotone transformations of risk; it is therefore largely preserved under\nmiscalibration. (2) *Calibration* — do predicted risks equal observed risks? Measured by calibration-in-the-large (mean\npredicted vs mean observed, i.e. the intercept), the calibration slope (1.0 ideal; <1 signals over-fitting / over-extreme\npredictions), and a flexible (loess) calibration curve with the Integrated Calibration Index (ICI), E50, and E90. This is\nthe property that breaks first in transport and the one that makes a model unsafe, because a clinician acts on the\n*number*, not the rank. (3) *Clinical utility* — does using the model to make threshold decisions produce more net\nbenefit than treating all or treating none? Measured by **decision curve analysis (DCA)**: net benefit = (TP/n) −\n(FP/n)·[p_t/(1−p_t)] across the range of threshold probabilities a real decision-maker would entertain. A model can have\nhigh AUC, good calibration, and *still* offer no net benefit at the relevant thresholds. The estimand of a validation\nstudy is therefore not \"is the model good?\" but \"in *this* population, at *these* decision thresholds, are predicted\nrisks accurate enough that acting on them beats the default?\" Recalibration has its own estimand hierarchy:\n*calibration-in-the-large* (re-estimate the intercept only, holding the linear predictor fixed) corrects a uniform\nover/under-prediction; *logistic recalibration* (re-estimate intercept **and** slope by regressing the outcome on the\noriginal linear predictor) corrects both level and spread; *model revision/refit* (re-estimate selected or all\ncoefficients, possibly adding predictors) is a new development requiring its own validation. Heavier updating buys better\nfit at the cost of more parameters and more over-fitting risk — Van Houwelingen's closed-loop updating and Steyerberg's\nupdating taxonomy formalize this trade.\n\n**Pros, cons, and trade-offs.**\n- **vs reporting AUC alone (the default in most ML papers):** Full validation (calibration + DCA) is the only evidence\n  that supports *deployment*; AUC alone certifies ranking but says nothing about whether the displayed risk is correct or\n  whether acting on it helps. Cost: more analysis, more data (calibration needs adequate events per risk stratum,\n  ≥100–200 events as a rule of thumb), and a less flattering headline. **Always** report calibration and net benefit when\n  a model will drive a decision.\n- **vs full external refit / retraining from scratch:** Recalibration (intercept ± slope) restores safety with one or two\n  parameters, preserves the original predictor structure and any external validity earned during development, and is\n  feasible with the small samples typical of a new site. Cost: it cannot fix predictor effects that genuinely differ\n  across populations (different coding of a comorbidity, a predictor that means something else in the new EHR). **Prefer\n  recalibration** for level/spread drift; escalate to revision only when the calibration *curve shape* is wrong, not just\n  its intercept/slope.\n- **vs causal/treatment-effect models (predictive-and-causal-ml-models-rwe):** Validation here certifies *prediction*\n  (risk = observed frequency), not *intervention* effects; a perfectly calibrated risk model is silent on what to do.\n  Confusing the two — using a prognostic score as if it estimated treatment benefit — is a category error. **Use this\n  framework** for risk stratification, screening, and resource targeting; use causal methods when the question is \"what\n  happens if we treat.\"\n- **vs phenotype algorithm validation (algorithm-validation):** That asks whether a code-based definition captures the\n  true clinical state (PPV, sensitivity, specificity vs a gold standard chart review). This asks whether a *risk score*\n  is accurate and useful. They share transportability concerns but use different metrics and answer different questions.\n\n**When to use.** Whenever a prediction model developed elsewhere (or in an earlier period) will inform real-world\ndecisions: deploying a published readmission/mortality/incident-disease model in a new health system; revalidating an\nin-house model after an ICD-9→ICD-10 transition, a telehealth expansion, an EHR vendor migration, or a formulary change;\ncomparing candidate models for a payer's care-management program; or supporting a regulatory/HTA submission that relies on\na risk tool. Use temporal validation (train early years, test later years) to detect drift, geographic/site validation to\ndetect transportability failure, and DCA to decide whether the model earns its place at the chosen thresholds.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Reporting discrimination without calibration.** A model with AUC 0.80 but mean predicted risk 18% against observed\n  11% systematically over-treats; the high AUC *conceals* the danger. Calibration drift with stable discrimination is the\n  single most common — and most missed — deployment failure.\n- **Treating a recalibrated model as a new validated model.** Recalibrating on a sample and then reporting performance on\n  that same sample is optimistic by construction; the update must be evaluated out-of-sample (or by closed-loop\n  cross-validation).\n- **Validating in a population that does not match deployment.** A registry external-validation cohort with adjudicated\n  outcomes and a sicker case mix can make a model look better-calibrated than it will be in the claims population where it\n  will actually run — this is favorable selection masquerading as validation.\n- **Outcome ascertainment that differs between development and validation.** If the validation data capture the outcome\n  more (or less) completely than the development data, observed risk shifts for measurement reasons, and any\n  recalibration \"corrects\" a measurement artifact rather than true drift.\n- **DCA at irrelevant thresholds.** Net benefit computed over a threshold range no clinician would use, or anchored to a\n  cost ratio that does not reflect the real decision, produces a curve that is technically valid and practically\n  meaningless.\n- **Using a prognostic model to guide treatment selection.** Risk ≠ benefit; high-risk patients are not necessarily those\n  who benefit most. Acting on predicted risk as if it were predicted treatment effect is actively harmful.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** Predictors (comorbidities, prior utilization) and the outcome (readmission, incident event) are\n  derived from diagnosis/procedure codes over enrollment windows. Require **continuous medical + pharmacy enrollment**\n  across the predictor lookback and the outcome window so that \"no event\" is true absence, not unobserved person-time.\n  Failure mode: **Medicare Advantage encounter data are incomplete and inconsistently submitted relative to FFS claims**,\n  so a model developed on FFS will appear miscalibrated on MA enrollees largely because predictors and outcomes are\n  *captured* differently — exclude MA-only person-time or validate FFS and MA strata separately rather than recalibrating\n  one to the other. Calendar drift: the **ICD-9→ICD-10 transition (Oct 2015)** changes the code base for both predictors\n  and outcomes; a temporal split that straddles it confounds true drift with coding change. Differential **competing\n  risk of death** by predictor profile (elderly, frail) inflates apparent calibration error for non-fatal outcomes if\n  death is treated as censoring rather than a competing event.\n- **EHR:** Predictors come from problem lists, labs, vitals, and notes — richer than claims but **site-, vendor-, and\n  workflow-dependent**. A lab predictor calibrated at a site that orders it routinely will be biased where it is ordered\n  selectively (informative missingness); order-set and documentation changes shift predictor distributions over time.\n  Visit-driven capture means outcomes occurring outside the system are missed, so observed risk is understated for\n  patients who fragment care. Validate by site and by calendar period, and pre-specify how missing predictors are handled\n  at deployment (the imputation/handling used in development must be reproducible at prediction time —\n  see multiple-imputation-longitudinal-rwe).\n- **Registry:** Excellent for adjudicated outcomes, stage/severity, and clinical predictors; a strong *clinical* external\n  validation substrate. But its enrolled case mix typically differs from the broad claims/EHR population a model is\n  deployed in, so good registry calibration does not guarantee deployment calibration.\n- **Linked claims–EHR–registry:** The ideal validation substrate (EHR predictors + claims completeness + adjudicated\n  outcomes + death index), but the *linkable* subset is selected, and predictor/outcome dates from different sources must\n  be reconciled before defining the prediction time origin.\n\n**Worked claims example.** A vendor-supplied 30-day all-cause readmission model (developed on a national FFS sample,\n2012–2014) is to be deployed in a regional commercial + Medicare FFS plan. Validation cohort: index acute inpatient\ndischarges 2018–2020 with (1) age ≥18 and a live discharge to a non-acute setting; (2) **continuous medical enrollment\nfor the 365-day predictor lookback before admission and the 30-day outcome window after discharge** (so prior utilization\nand the readmission outcome are fully observable); (3) **exclusion of MA-only person-time**, since MA encounter capture\ndiffers from the FFS basis on which the model was built. Predictors (Elixhauser/CCS comorbidities, prior 12-month\nadmissions and ED visits, index DRG, days_supply-derived polypharmacy) are computed only in the pre-admission lookback;\nthe outcome is any acute readmission with admit_date within 30 days of the index discharge_date, with patients who die\nin-hospital before discharge removed and **out-of-hospital death within 30 days modeled as a competing event** (not\nsilently censored). Apply the frozen model coefficients to get each patient's linear predictor and predicted risk, then:\n(a) discrimination — C-statistic with 95% CI, plus by calendar year to expose drift; (b) calibration —\ncalibration-in-the-large, slope (regress the 0/1 outcome on the linear predictor), and a loess calibration curve with ICI,\nE50, E90; (c) utility — DCA net benefit across threshold probabilities of 5%–25% (the band over which a care-management\nteam would enroll patients). Finding: C-statistic holds at 0.68 (vs 0.69 at development) but mean predicted risk is 16.5%\nagainst observed 11.0% and the slope is 0.82 — discrimination intact, calibration unsafe, driven partly by the ICD-10\ncoding-era shift and a healthier commercial mix than the original FFS sample. Action: **logistic recalibration** —\nre-estimate intercept and slope on a 2018–2019 development split, validate the recalibrated model on the held-out 2020\nsplit (calibration-in-the-large ~0, slope ~1, ICI < 0.02), and re-run DCA to confirm the recalibrated model now yields\npositive net benefit at the 10–15% enrollment threshold before it drives any care-management decision.\n\n**Interpreting the output**\n\nIn the external validation of the 30-day readmission model, C-statistic = 0.68, mean predicted risk =\n16.5% vs observed 11.0%, and calibration slope = 0.82 — discrimination preserved, calibration unsafe.\n\n*(1) Formal interpretation.* A C-statistic of 0.68 at external validation (vs 0.69 at development)\nindicates discrimination has transported well — the model ranks high-risk patients above low-risk ones\nin the new population with the same fidelity as at development. However, the mean predicted risk of\n16.5% against an observed rate of 11.0% signals systematic over-prediction (calibration-in-the-large\nfailure), and a calibration slope of 0.82 (<1) indicates the model is too extreme: it assigns predicted\nprobabilities that are too high for high-risk patients and too low for low-risk patients relative to\nwhat is actually observed. Discrimination and calibration are orthogonal — it is entirely possible (and\ncommon) for discrimination to be preserved while calibration drifts, especially after population shifts\n(coding-era change, payer mix change).\n\n*(2) Practical interpretation.* Logistic recalibration re-estimates the intercept and slope on a\nrecent development split of the new data, then validates on a held-out split. Post-recalibration\ntargets are calibration-in-the-large ≈ 0, slope ≈ 1, and ICI < 0.02. Crucially, recalibration does\nnot change discrimination (C-statistic remains 0.68) — it corrects the absolute risk scale without\nreordering predictions. Confirm the recalibrated model yields positive net benefit across the\n10–15% decision threshold via decision-curve analysis before deploying it to drive any\ncare-management or resource-allocation decision.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "prediction-model",
      "validation",
      "calibration",
      "recalibration",
      "decision-curve-analysis",
      "net-benefit",
      "external-validation",
      "temporal-validation",
      "model-drift",
      "discrimination"
    ],
    "applies_to_study_types": [
      "diagnostic_accuracy",
      "algorithm_validation",
      "ehr_study",
      "claims_analysis",
      "linked_data"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/EDE.0b013e3181c30fb2",
        "url": "https://doi.org/10.1097/EDE.0b013e3181c30fb2",
        "citation_text": "Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138.",
        "year": 2010,
        "authors_short": "Steyerberg et al.",
        "notes": "Canonical framework separating discrimination, calibration, overall accuracy (Brier), reclassification, and decision-analytic utility; the reference structure for any validation plan."
      },
      {
        "role": "explain",
        "doi": "10.1186/s12916-019-1466-7",
        "url": "https://doi.org/10.1186/s12916-019-1466-7",
        "citation_text": "Van Calster B, McLernon DJ, van Smeden M, et al. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17(1):230.",
        "year": 2019,
        "authors_short": "Van Calster et al.",
        "notes": "Defines the calibration hierarchy (mean/weak/moderate/strong), the calibration slope and ICI, and why miscalibration—not poor discrimination—is the dominant cause of unsafe deployment."
      },
      {
        "role": "explain",
        "doi": "10.7326/M14-0697",
        "url": "https://doi.org/10.7326/M14-0697",
        "citation_text": "Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD statement. Annals of Internal Medicine. 2015;162(1):55-63.",
        "year": 2015,
        "authors_short": "Collins et al.",
        "notes": "Reporting standard for development, validation, and updating/recalibration of prediction models; specifies what a validation report must contain."
      },
      {
        "role": "demonstrate",
        "doi": "10.7326/M18-1376",
        "url": "https://doi.org/10.7326/M18-1376",
        "citation_text": "Wolff RF, Moons KGM, Riley RD, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Annals of Internal Medicine. 2019;170(1):51-58.",
        "year": 2019,
        "authors_short": "Wolff et al.",
        "notes": "Structured risk-of-bias and applicability appraisal for validation studies—directly applicable to judging whether an RWE validation supports deployment."
      },
      {
        "role": "use",
        "doi": "10.1177/0272989X06295361",
        "url": "https://doi.org/10.1177/0272989X06295361",
        "citation_text": "Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Medical Decision Making. 2006;26(6):565-574.",
        "year": 2006,
        "authors_short": "Vickers & Elkin",
        "notes": "Introduces net benefit and decision curves—the clinical-utility metric that AUC and calibration cannot supply."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.i6",
        "url": "https://doi.org/10.1136/bmj.i6",
        "citation_text": "Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;352:i6.",
        "year": 2016,
        "authors_short": "Vickers et al.",
        "notes": "Practical guide to computing, interpreting, and threshold-selecting decision curves for prediction-model evaluation."
      }
    ],
    "plain_language_summary": "A prediction model assigns each patient a number — say, a 22% chance of being readmitted within 30 days — and validation is the process of checking whether those numbers are actually trustworthy in a new group of patients. Two things can go right or wrong independently: the model may rank high-risk patients above low-risk patients correctly (good discrimination), yet still predict 22% when the true rate is only 12% (poor calibration, meaning the numbers themselves are wrong). Recalibration is a lightweight fix that adjusts the model's output so the predicted risks line up with what you actually observe, without discarding the model's ability to rank patients.",
    "key_terms": [
      {
        "term": "discrimination",
        "definition": "How well the model separates patients who will have the outcome from those who will not — measured by the AUC (area under the ROC curve), where 0.5 means no better than a coin flip and 1.0 means perfect separation."
      },
      {
        "term": "calibration",
        "definition": "Whether the predicted risk numbers match the rates you actually observe — a model predicting 20% is well-calibrated if roughly 20 out of every 100 such patients truly have the event."
      },
      {
        "term": "AUC (C-statistic)",
        "definition": "A single number between 0 and 1 summarising discrimination; it equals the probability that a randomly chosen patient who has the event gets a higher predicted risk than a randomly chosen patient who does not."
      },
      {
        "term": "recalibration",
        "definition": "Refitting just the intercept and slope of the model on new data so that predicted probabilities agree with observed rates, while leaving the underlying ranking of patients unchanged."
      },
      {
        "term": "calibration plot",
        "definition": "A chart that groups patients into bins by predicted risk, then compares the average predicted risk in each bin to the fraction who actually had the event — points on the diagonal mean perfect calibration."
      }
    ],
    "worked_example": {
      "scenario": "A vendor supplied a 30-day readmission risk model built on hospital data from 2015-2018. Your health plan wants to use it to flag high-risk patients for a care-management call after discharge. You apply the frozen model to 400 of your own discharges from 2023 and check whether it can still be trusted. You find it ranks patients well but consistently over-predicts risk — everyone looks sicker than they really are. You then recalibrate and recheck.",
      "dataset": {
        "caption": "Validation cohort: 400 patients, predicted risk from the frozen model, grouped into three risk bins for the calibration check. Each row is one risk group, not one patient.",
        "columns": [
          "risk_group",
          "n_patients",
          "avg_predicted_risk",
          "n_observed_events",
          "observed_rate"
        ],
        "rows": [
          [
            "Low  (predicted < 10%)",
            200,
            "5%",
            10,
            "5.0%"
          ],
          [
            "Med  (predicted 10-20%)",
            150,
            "15%",
            12,
            "8.0%"
          ],
          [
            "High (predicted > 20%)",
            50,
            "30%",
            8,
            "16.0%"
          ]
        ]
      },
      "steps": [
        "Tally the full cohort: 200 + 150 + 50 = 400 patients, 10 + 12 + 8 = 30 total readmissions, overall observed rate = 30 / 400 = 7.5%.",
        "Compute the mean predicted risk across the cohort: (200 x 0.05) + (150 x 0.15) + (50 x 0.30) = 10.0 + 22.5 + 15.0 = 47.5 total predicted events; 47.5 / 400 = 11.9% mean predicted risk.",
        "Compare mean predicted (11.9%) to mean observed (7.5%): the model over-predicts by about 4.4 percentage points on average — this is poor calibration-in-the-large.",
        "Check calibration within each group: Low (5% predicted, 5% observed — fine), Med (15% predicted, 8% observed — off by 7pp), High (30% predicted, 16% observed — off by 14pp). The gap widens at higher risk, confirming the slope is also off (predictions are too spread out).",
        "Check discrimination: despite the miscalibration, the AUC is 0.74 — the model still ranks high-risk patients above low-risk patients correctly. Discrimination and calibration have failed independently.",
        "Recalibrate using logistic recalibration (refit the intercept and slope on a development split of the same data, then apply to a held-out test split). The recalibrated model shrinks the high-end predictions toward reality. AUC stays at 0.74 because recalibration only rescales the numbers — it does not change who is ranked above whom.",
        "After recalibration the calibration plot shows Low (5% predicted, 5% observed), Med (8% predicted, 8% observed), High (16% predicted, 16% observed) — predicted and observed now match in each group."
      ],
      "result": "Before recalibration: AUC = 0.74 (good discrimination), but mean predicted risk 11.9% vs mean observed 7.5% — model over-predicts and is unsafe for clinical use. After logistic recalibration: AUC = 0.74 (unchanged), predicted matches observed in each risk group (5%/5%, 8%/8%, 16%/16%), calibration restored. The fix costs nothing in discrimination."
    },
    "prerequisites": [
      "roc-auc-discrimination-rwe",
      "brier-score-calibration-rwe",
      "logistic-regression-for-binary-outcomes"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Temporal (calendar) validation",
        "description": "Develop in earlier calendar periods and validate in later periods to detect drift in predictors, outcome ascertainment, and calibration.",
        "edge_cases": [
          "A split that straddles the ICD-9 to ICD-10 transition confounds true clinical drift with a coding-system change.",
          "Secular shifts in treatment or telehealth alter predictor distributions and apparent baseline risk independent of model quality."
        ],
        "data_source_notes": "claims: align lookback/outcome windows to a single coding era where possible; EHR: account for order-set and documentation changes over the split."
      },
      {
        "name": "Geographic / site (external) validation",
        "description": "Validate in distinct hospitals, plans, geographies, or registries to test transportability of both calibration and discrimination.",
        "edge_cases": [
          "Same EHR vendor does not mean the same population or the same measurement of predictors.",
          "A registry validation cohort with adjudicated outcomes can look better calibrated than the broader claims/EHR deployment population."
        ],
        "data_source_notes": "Report calibration by site; pool only after confirming consistent predictor/outcome definitions across sites."
      },
      {
        "name": "Recalibration-in-the-large (intercept only)",
        "description": "Re-estimate the model intercept holding the linear predictor fixed (offset) to correct uniform over- or under-prediction; one parameter.",
        "edge_cases": [
          "Cannot fix a wrong calibration slope (over-extreme predictions); use only when the slope is acceptable."
        ],
        "data_source_notes": "Use when level drift is the issue (e.g., a healthier commercial mix lowers baseline risk)."
      },
      {
        "name": "Logistic recalibration (intercept + slope)",
        "description": "Regress the binary outcome on the original linear predictor to re-estimate both intercept and slope; two parameters, corrects level and spread.",
        "edge_cases": [
          "Must be evaluated out-of-sample; reporting performance on the recalibration sample is optimistic by construction."
        ],
        "data_source_notes": "The default lightweight update when discrimination is preserved but calibration drifts."
      },
      {
        "name": "Model revision / refit",
        "description": "Re-estimate selected or all coefficients, optionally adding predictors, when the calibration curve shape (not just level/slope) is wrong.",
        "edge_cases": [
          "This is effectively new development and requires its own external validation and event-per-variable budget."
        ]
      },
      {
        "name": "Decision curve / net benefit",
        "description": "Evaluate whether model-guided threshold decisions improve net benefit over treat-all/treat-none across realistic threshold probabilities.",
        "edge_cases": [
          "Thresholds and the implied harm-benefit ratio must reflect the actual decision; otherwise the curve is valid but meaningless."
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Reporting discrimination (AUC) alone",
        "pros_of_this": "Adds calibration and net benefit—the only evidence that predicted risks are accurate and that acting on them helps; AUC alone certifies ranking only.",
        "cons_of_this": "Requires more events and analysis and yields a less flattering headline; calibration needs adequate events per risk stratum.",
        "when_to_prefer": "Always, when a model will drive a real decision (care management, screening, resource targeting)."
      },
      {
        "compared_to": "Full external refit / retraining from scratch",
        "pros_of_this": "Intercept/slope recalibration restores calibration with one or two parameters, preserves predictor structure, and is feasible in small new-site samples.",
        "cons_of_this": "Cannot fix genuinely different predictor effects or a wrong calibration-curve shape across populations.",
        "when_to_prefer": "Level/spread drift with preserved discrimination; escalate to revision only when the calibration curve shape is wrong."
      },
      {
        "compared_to": "Causal / treatment-effect models (predictive-and-causal-ml-models-rwe)",
        "pros_of_this": "Certifies that predicted risk equals observed frequency—the property a risk score actually needs.",
        "cons_of_this": "Says nothing about intervention effects; risk is not benefit.",
        "when_to_prefer": "Risk stratification, screening, prognosis; not for estimating what a treatment does."
      },
      {
        "compared_to": "Phenotype algorithm validation (algorithm-validation)",
        "pros_of_this": "Uses prediction-model performance metrics (calibration, net benefit) appropriate for a continuous risk score.",
        "cons_of_this": "Does not assess whether a code-based phenotype captures the true clinical state (PPV/sensitivity).",
        "when_to_prefer": "Validating a risk model rather than a case-finding definition."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Derive predictors and outcomes from coded data over fully enrolled windows; require continuous medical (and pharmacy, if drug predictors) enrollment across lookback and outcome windows; exclude MA-only person-time where FFS-equivalent capture is unavailable; align temporal splits to a single coding era and model death as a competing event for non-fatal outcomes.",
      "ehr": "Validate by site and calendar period; account for informative missingness in lab/vital predictors and for outcomes occurring outside the system; ensure deployment-time missing-data handling reproduces the development pipeline.",
      "registry": "Strong clinical external-validation substrate (adjudicated outcomes, severity) but case mix differs from claims/EHR deployment populations; good registry calibration does not guarantee deployment calibration.",
      "linked": "Ideal substrate (EHR predictors + claims completeness + adjudicated outcomes + death index) but subject to linkage selection and cross-source date reconciliation before defining the prediction time origin."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nfrom statsmodels.nonparametric.smoothers_lowess import lowess\nfrom sklearn.metrics import roc_auc_score, brier_score_loss\n\nEPS = 1e-6\n\ndef linear_predictor(risk: pd.Series) -> np.ndarray:\n    # Recover the model's linear predictor (logit) from the predicted probability.\n    p = np.clip(risk.to_numpy(), EPS, 1 - EPS)\n    return np.log(p / (1 - p))\n\ndef calibration_metrics(y: np.ndarray, p: np.ndarray) -> dict:\n    # Calibration-in-the-large (intercept), calibration slope, and loess-based ICI/E50/E90.\n    p = np.clip(p, EPS, 1 - EPS)\n    lp = np.log(p / (1 - p))\n    # Slope + intercept: regress outcome on the linear predictor (logistic).\n    slope_fit = sm.GLM(y, sm.add_constant(lp), family=sm.families.Binomial()).fit()\n    cal_slope = slope_fit.params[1]\n    # Calibration-in-the-large: intercept with the linear predictor as a fixed offset (slope forced to 1).\n    citl_fit = sm.GLM(y, np.ones_like(y), family=sm.families.Binomial(), offset=lp).fit()\n    citl = citl_fit.params[0]\n    # Flexible calibration: loess of observed on predicted; ICI = mean |loess(p) - p|.\n    order = np.argsort(p)\n    sm_obs = lowess(y[order], p[order], frac=0.6, return_sorted=False)\n    diff = np.abs(sm_obs - p[order])\n    return {\n        \"citl\": float(citl),               # 0 ideal (no systematic over/under-prediction)\n        \"cal_slope\": float(cal_slope),      # 1 ideal (<1 = over-extreme predictions)\n        \"ici\": float(np.mean(diff)),        # mean absolute calibration error\n        \"e50\": float(np.median(diff)),\n        \"e90\": float(np.quantile(diff, 0.90)),\n    }\n\ndef net_benefit(y: np.ndarray, p: np.ndarray, thresholds: np.ndarray) -> pd.DataFrame:\n    # Decision curve: net benefit of the model vs treat-all and treat-none.\n    n = len(y)\n    rows = []\n    prev = y.mean()\n    for t in thresholds:\n        treat = p >= t\n        tp = np.sum(treat & (y == 1))\n        fp = np.sum(treat & (y == 0))\n        w = t / (1 - t)\n        nb_model = tp / n - (fp / n) * w\n        nb_all = prev - (1 - prev) * w\n        rows.append({\"threshold\": t, \"nb_model\": nb_model,\n                     \"nb_treat_all\": nb_all, \"nb_treat_none\": 0.0})\n    return pd.DataFrame(rows)\n\ndef recalibrate(dev: pd.DataFrame, mode: str = \"logistic\"):\n    # Fit recalibration on the DEVELOPMENT split. Returns a function mapping old risk -> updated risk.\n    lp = linear_predictor(dev[\"risk\"])\n    y = dev[\"y\"].to_numpy()\n    if mode == \"citl\":  # intercept-only (calibration-in-the-large)\n        fit = sm.GLM(y, np.ones_like(y), family=sm.families.Binomial(), offset=lp).fit()\n        a, b = fit.params[0], 1.0\n    else:               # logistic recalibration (intercept + slope)\n        fit = sm.GLM(y, sm.add_constant(lp), family=sm.families.Binomial()).fit()\n        a, b = fit.params[0], fit.params[1]\n    return lambda risk: 1 / (1 + np.exp(-(a + b * linear_predictor(risk))))\n\n# --- Validate the frozen model on the full validation set ---\nvalid = valid.copy()\ny_all, p_all = valid[\"y\"].to_numpy(), valid[\"risk\"].to_numpy()\nprint(\"AUC \", roc_auc_score(y_all, p_all))\nprint(\"Brier\", brier_score_loss(y_all, p_all))\nprint(\"Calibration\", calibration_metrics(y_all, p_all))\nprint(net_benefit(y_all, p_all, np.arange(0.05, 0.26, 0.05)))\n\n# --- Temporal recalibration: estimate on earlier years, evaluate on the latest year ---\ndev = valid[valid[\"calyear\"] <= valid[\"calyear\"].max() - 1]\ntest = valid[valid[\"calyear\"] == valid[\"calyear\"].max()]\nupdate = recalibrate(dev, mode=\"logistic\")\np_new = update(test[\"risk\"])\nprint(\"Post-recalibration\", calibration_metrics(test[\"y\"].to_numpy(), p_new))",
        "description": "Validation and recalibration of a frozen prediction model on a held-out claims/EHR validation set.\nRequired input (one row per index encounter, predictors already computed from the pre-index lookback):\n  valid : DataFrame with\n    person_id   : member id\n    y           : 0/1 observed outcome in the outcome window (e.g., 30-day readmission)\n    risk        : predicted probability from the FROZEN development model\n    calyear     : calendar year of index (for temporal/drift checks)\nThe frozen model is applied upstream; here we only validate and (if needed) recalibrate. Recalibration is\nestimated on a development split and evaluated on a separate split—never on the same rows.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels",
          "scikit-learn"
        ],
        "source_citations": [
          "steyerberg-2010",
          "van-calster-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(rms)\nlibrary(dcurves)\nlibrary(pROC)\n\neps <- 1e-6\nlp_of <- function(p) qlogis(pmin(pmax(p, eps), 1 - eps))   # linear predictor from risk\n\n## 1. Validate the frozen model (calibration slope/intercept, Brier, C, calibration plot)\nval.prob(valid$risk, valid$y, m = 200, pl = TRUE)          # CITL, slope, Emax, Brier, C\ncat(\"C-statistic:\", auc(valid$y, valid$risk), \"\\n\")\n\n## 2. Decision curve analysis over the decision-relevant threshold band\ndca(y ~ risk, data = valid, thresholds = seq(0.05, 0.25, 0.01))\n\n## 3. Temporal recalibration: fit on earlier years, evaluate on the latest year\nlast      <- max(valid$calyear)\ndev       <- subset(valid, calyear <  last)\ntest      <- subset(valid, calyear == last)\n\n# Logistic recalibration: regress outcome on the original linear predictor (intercept + slope)\nrecal_fit <- lrm(y ~ lp_of(risk), data = dev)\na <- coef(recal_fit)[1]; b <- coef(recal_fit)[2]\ntest$risk_recal <- plogis(a + b * lp_of(test$risk))\n\n# Intercept-only (calibration-in-the-large) alternative via an offset\ncitl_fit  <- glm(y ~ 1, family = binomial,\n                 offset = lp_of(dev$risk), data = dev)\ntest$risk_citl <- plogis(coef(citl_fit)[1] + lp_of(test$risk))\n\n## 4. Re-validate the recalibrated model out-of-sample\nval.prob(test$risk_recal, test$y, pl = TRUE)",
        "description": "Validation and recalibration with rms + dcurves. Input `valid` is a data.frame with the same columns:\n  person_id, y (0/1 outcome), risk (frozen predicted probability), calyear.\nrms::val.prob reports calibration-in-the-large, slope, and discrimination in one call; recalibration is a\nlogistic refit on the original linear predictor (lp), evaluated out-of-sample.",
        "dependencies": [
          "rms",
          "dcurves",
          "pROC"
        ],
        "source_citations": [
          "steyerberg-2010",
          "van-calster-2019"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Linear predictor (logit) recovered from the frozen predicted probability. */\ndata valid;\n  set work.valid;\n  p = min(max(risk, 1e-6), 1 - 1e-6);\n  lp = log(p / (1 - p));\nrun;\n\n/* 1. Discrimination + calibration of the FROZEN model.\n      Slope=1 calibration test: force the coefficient on lp to 1 via OFFSET; the INTERCEPT is\n      calibration-in-the-large (0 = no systematic over/under-prediction). */\nproc logistic data=valid plots(only)=(calibration roc);\n  model y(event='1') = / offset=lp;          /* CITL: intercept only, slope fixed at 1 */\nrun;\n\n/* Calibration SLOPE: estimate the coefficient on lp (1.0 = ideal; <1 = over-extreme predictions). */\nproc logistic data=valid;\n  model y(event='1') = lp;\n  ods output ParameterEstimates=slope_est;   /* lp estimate = calibration slope; C-stat in Association table */\nrun;\n\n/* 2. Brier score and decision-curve net benefit across the decision-relevant thresholds. */\nproc sql noprint; select count(*), mean(y) into :n, :prev from valid; quit;\ndata dca;\n  do t = 0.05 to 0.25 by 0.05;\n    tp = 0; fp = 0;\n    do until (eof);\n      set valid end=eof;\n      if risk >= t then do;\n        if y = 1 then tp + 1; else fp + 1;\n      end;\n    end;\n    w = t / (1 - t);\n    nb_model    = tp/&n - (fp/&n)*w;\n    nb_treatall = &prev - (1 - &prev)*w;\n    nb_treatnone = 0;\n    output;\n  end;\nrun;\n\n/* 3. Logistic recalibration: fit intercept + slope on earlier years, apply to the latest year. */\nproc sql noprint; select max(calyear) into :last from valid; quit;\nproc logistic data=valid(where=(calyear < &last)) outmodel=recal;\n  model y(event='1') = lp;                    /* intercept + slope recalibration */\nrun;\nproc logistic inmodel=recal;\n  score data=valid(where=(calyear = &last)) out=test_scored;  /* P_1 = recalibrated risk */\nrun;\n\n/* 4. Re-validate the recalibrated model out-of-sample (slope should be ~1, CITL ~0). */\ndata test_scored; set test_scored;\n  pr = min(max(P_1, 1e-6), 1 - 1e-6); lp_new = log(pr/(1-pr));\nrun;\nproc logistic data=test_scored plots(only)=(calibration);\n  model y(event='1') = / offset=lp_new;\nrun;",
        "description": "Validation and recalibration in SAS. Required input dataset (one row per index encounter, predictors already\nscored upstream):\n  work.valid : person_id, y (0/1 outcome), risk (frozen predicted probability), calyear\nPROC LOGISTIC supplies the calibration slope test, an offset-based calibration-in-the-large fit, calibration\nplots, and the C-statistic; a DATA step computes the Brier score and the net-benefit decision curve. Evaluate any\nrecalibration on a calendar-year split that was not used to estimate it.",
        "dependencies": [],
        "source_citations": [
          "steyerberg-2010",
          "van-calster-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Frozen[Frozen development model<br/>coefficients + linear predictor] --> Apply[Apply to target validation cohort<br/>continuous enrollment, single coding era, exclude MA-only]\n  Apply --> Disc[Discrimination<br/>C-statistic / time-dependent C]\n  Apply --> Cal[Calibration<br/>CITL, slope, loess curve, ICI/E50/E90]\n  Apply --> Util[Clinical utility<br/>decision curve / net benefit]\n  Disc --> Judge{Discrimination preserved?}\n  Cal --> Judge2{Calibration adequate?}\n  Judge2 -->|Yes| Deploy[Deploy / monitor over time]\n  Judge2 -->|Level drift only| Citl[Recalibrate intercept<br/>calibration-in-the-large]\n  Judge2 -->|Level + spread drift| Logit[Logistic recalibration<br/>intercept + slope]\n  Judge2 -->|Curve shape wrong| Refit[Model revision / refit<br/>= new development]\n  Citl --> Reval[Re-validate out-of-sample<br/>+ re-run decision curve]\n  Logit --> Reval\n  Refit --> Reval\n  Reval --> Deploy",
        "caption": "Validation separates discrimination, calibration, and utility; the calibration verdict drives the updating choice, and every update is re-validated out-of-sample before deployment.",
        "alt_text": "Flowchart from a frozen model applied to a target cohort, through discrimination, calibration, and net-benefit evaluation, to intercept-only, intercept-plus-slope, or full-refit recalibration, then out-of-sample re-validation and deployment.",
        "source_type": "illustrative",
        "source_citations": [
          "steyerberg-2010",
          "van-calster-2019"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph \"High AUC, miscalibrated\"\n    A1[AUC 0.80<br/>ranks well] --> A2[Mean predicted 18%<br/>observed 11%] --> A3[Systematic over-treatment<br/>UNSAFE]\n  end\n  subgraph After recalibration\n    B1[AUC 0.80 unchanged<br/>ordering invariant] --> B2[Predicted ~ observed<br/>slope ~1, CITL ~0] --> B3[Positive net benefit<br/>at decision threshold]\n  end\n  A3 -. logistic recalibration .-> B1",
        "caption": "Discrimination is invariant to recalibration; the same high-AUC model is unsafe when miscalibrated and safe after intercept/slope updating restores agreement between predicted and observed risk.",
        "alt_text": "Two-panel diagram contrasting a high-AUC but miscalibrated and over-treating model with the same model after recalibration, where AUC is unchanged but predicted equals observed risk and net benefit is positive.",
        "source_type": "illustrative",
        "source_citations": [
          "van-calster-2019"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Validation, recalibration, and decision-curve utility are the mandatory deployment-readiness layer for any predictive ML model in RWE."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "Phenotype/algorithm validation (PPV, sensitivity vs gold standard) and risk-model validation share transportability and measurement concerns but use different metrics and answer different questions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "External and geographic validation are the empirical test of whether a model's performance transports to the intended target population."
      },
      {
        "relation_type": "requires",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Missing predictors at deployment must be handled with a pipeline reproducible at prediction time; the development imputation must be replicable when scoring new patients."
      }
    ],
    "aliases": [
      "external validation",
      "temporal validation",
      "model recalibration",
      "decision curve analysis",
      "net benefit",
      "calibration drift",
      "calibration-in-the-large"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "predictive-and-causal-ml-models-rwe",
    "name": "Predictive and Causal ML Models",
    "short_definition": "A family of flexible machine-learning algorithms used in RWE either for pure risk prediction (random survival forests, super learner, gradient boosting) or, in causal ML form (double/debiased ML, TMLE, honest causal forests), to estimate a pre-specified causal estimand by plugging ML-estimated nuisance functions (propensity and outcome regressions) into a Neyman-orthogonal or doubly robust estimator.",
    "long_description": "Modern RWE uses machine learning for two jobs that look similar in code but are governed by completely different\nstatistical logic, and conflating them is the single most common error. **Predictive ML** (random survival forests,\nsuper learner ensembles, gradient-boosted trees, penalized regression) optimizes discrimination and calibration for\na target label — 1-year mortality, real-world progression, 30-day readmission — and is judged by AUC, the Integrated\nCalibration Index (ICI), and calibration slope/intercept. **Causal ML** (double/debiased machine learning [DML],\ntargeted maximum likelihood estimation [TMLE], honest causal forests) does *not* optimize prediction of the outcome;\nit targets a pre-specified causal parameter (ATE, ATT, a survival curve under a sustained strategy, or a conditional\naverage treatment effect [CATE]) and uses ML only to estimate the **nuisance functions** — the propensity score\ne(X)=P(A=1|X) and the outcome regressions m_a(X)=E[Y|A=a,X] — which are then combined through an estimator whose bias\nis first-order insensitive to nuisance error.\n\n**Core conceptual distinction**. The estimand, not the algorithm, is the dividing line. A random forest that predicts\nthe outcome with AUC 0.85 tells you nothing about a treatment effect; a poorly discriminating propensity model can\nstill yield a valid ATE if it is well calibrated. Causal ML earns its validity from two structural features. (1)\n*Cross-fitting (sample splitting)*: nuisances are fit on one partition and evaluated on the held-out partition, so\noverfitting in the flexible learner does not contaminate the target estimate — naive \"fit ML on everyone, then\nregress\" is biased. (2) *Neyman orthogonality / double robustness*: DML residualizes both treatment and outcome\n(the Robinson partialling-out / AIPW form) so the score has zero derivative with respect to the nuisances at the\ntruth, giving √n-consistent, asymptotically normal effect estimates when each nuisance converges merely faster than\nn^(-1/4); TMLE achieves the same by fluctuating an initial outcome fit along the efficient influence function so the\nplug-in solves the EIF estimating equation and is consistent if *either* nuisance is correct. Predictive ML has no\nsuch guarantee and should never be read as a causal statement (variable importance ≠ causal importance).\n\n**Pros, cons, and trade-offs** (named against the alternatives this catalog carries).\n- **vs Cox proportional-hazards regression** (`cox-ph-regression`): RSF and causal forests relax proportional hazards\n  and linearity, ingest hundreds of claims codes without manual specification, and deliver heterogeneity (CATE) with\n  honest inference. Cost: no single interpretable hazard ratio, lower regulatory familiarity, and calibration/overlap\n  diagnostics that are harder to communicate. Prefer Cox for a simple pre-specified primary HR with credible PH;\n  prefer the ML family when the confounding surface is high-dimensional or non-linear, PH is implausible, or\n  heterogeneity is the question.\n- **vs propensity-score methods / hdPS** (`propensity-score-methods-psm-iptw`, `high-dimensional-propensity-score-hdps-rwe`):\n  DML/TMLE use the same propensity object but pair it with an outcome model and an orthogonal score, so they are\n  *doubly robust* — hdPS+IPTW relies on the propensity alone and breaks if it is misspecified. Cost: more moving parts,\n  more compute, and the need to defend two nuisance models instead of one. Prefer plain hdPS+IPTW when the team needs a\n  transparent, single-model weighting workflow; prefer DML/TMLE when you want insurance against propensity\n  misspecification.\n- **vs g-estimation / structural nested models** (`g-estimation-structural-nested-models`): both are semiparametric,\n  but causal ML targets marginal/CATE parameters with mature software and scales to very high-dimensional *baseline*\n  confounding, whereas g-estimation directly parameterizes the blip and is the natural tool when effect modification by\n  *time-varying* factors under treatment-confounder feedback is the scientific target. DML/TMLE for sustained\n  time-varying regimes (LTMLE) exist but are less turnkey.\n- **vs logistic / Poisson regression** (`logistic-regression-for-binary-outcomes`,\n  `poisson-negative-binomial-count-models`): ML nuisances capture interactions and non-linearities those parametric\n  models miss, and causal ML returns effects on the risk-difference or risk-ratio scale; cost is the loss of a clean\n  OR/IRR and a heavier validation burden. Parametric models remain superior for moderate dimensions and pre-specified,\n  easily communicated primary models.\n\n**When to use**. (1) The confounding surface is genuinely high-dimensional or non-linear (thousands of ICD-10/CPT/NDC\nproxies, labs, utilization) and parametric specification is guesswork. (2) You want double robustness — insurance\nagainst misspecifying the propensity *or* the outcome model. (3) The estimand is a marginal ATE/ATT or a survival\ncurve under intervention and you want efficient, EIF-based inference (TMLE/DML). (4) Pre-specified heterogeneity:\nhonest causal forests for CATE with valid confidence intervals. (5) As the nuisance engine inside a target-trial\nemulation (`target-trial-emulation`) or after clone-censor-weight (`clone-censor-weight-per-protocol`).\n\n**When NOT to use — and when it is actively misleading or dangerous**. (1) *No cross-fitting* — fitting flexible ML on\nthe full sample and then regressing reintroduces regularization/overfitting bias; the nominal CIs are anticonservative\nand the point estimate is attenuated toward or away from the null unpredictably. (2) *Positivity/overlap violation* —\nwhen a drug is reserved for sicker or renally-impaired patients, fitted propensities pile up near 0/1; IPW-style and\nAIPW estimators then divide by near-zero, variance explodes, and the estimator silently extrapolates into regions with\nno comparator data. Trim, use overlap weights, or report the restricted estimand — do not present a stabilized point\nestimate from a cohort with effective sample size collapsed by a handful of extreme weights. (3) *Uncalibrated\nnuisances* — efficiency and the TMLE plug-in degrade when ML probabilities are miscalibrated; recalibrate (isotonic/\nPlatt) before use. (4) *Using predictive ML as a causal claim* — a high-importance feature in a survival forest is not\na causal driver; reporting it as one is the most dangerous failure because it is invisible to standard fit metrics.\n(5) *Temporal leakage* — if any \"baseline\" feature is constructed from post-index claims, the model predicts the\nfuture from the future and every effect estimate is corrupted regardless of the orthogonality machinery.\n\n**Data-source operational depth**.\n- **Claims (FFS vs Medicare Advantage vs commercial):** the engine of high-dimensional confounding control is the\n  lookback feature matrix — every diagnosis/procedure/pharmacy code in a fixed 6–12 month window before index, as\n  ever/never flags, counts, and recency. Two failure modes dominate. First, **MA-only person-time lacks complete\n  encounter capture**: a patient enrolled only in Medicare Advantage may have no fee-for-service claims, so \"absence of\n  a comorbidity code\" is missingness, not absence — the ML learns a spurious low-risk signature for MA enrollees.\n  Restrict to A/B/D (or commercial medical+pharmacy) coverage spanning the full lookback and exclude MA-only spans, or\n  add a coverage indicator and never interpret it causally. Second, **healthy-user proxies** (preventive-service codes,\n  low prior utilization, flu vaccination) are extremely predictive of treatment and partially proxy unmeasured\n  health-seeking behavior (`healthy-user-bias`); they can sharpen confounding control or, if near-deterministic of\n  treatment, push the propensity to the positivity boundary. Pair the final causal estimate with negative-control\n  outcomes (`negative-control-outcomes-rwe`) and an E-value (`e-value-sensitivity-analysis`).\n- **EHR:** richer nuisances (labs, vitals, NLP-derived problems) but irregular, visit-driven measurement creates\n  informative missingness and informative loss to follow-up; encode missingness explicitly (indicator + last-value or a\n  dedicated missingness model) rather than letting the learner impute by omission. Calibration drifts across sites and\n  over calendar time, so validate calibration within site/era, not just pooled.\n- **Registry:** lower dimension but adjudicated, high-quality labels (stage, recurrence date, graded AE) — ideal for\n  *training or validating* claims-based predictive models, less so as the sole confounding substrate. Link to claims\n  for complete pharmacy/procedure history and to a death index for censoring.\n- **Linked claims–EHR–vital records:** the ideal substrate (EHR severity + claims completeness + reliable mortality),\n  but linkage selects the linkable subset and introduces order/fill/service date discrepancies that must be reconciled\n  before time-zero assignment — otherwise immortal time (`immortal-time-bias-handling`) is silently baked into the\n  feature window.\n\n**Worked claims example (DML for the ATT of a new oral oncology agent on 1-year mortality).** Question: among adults\nwith metastatic disease initiating a new oral agent (arm A) versus a clinically interchangeable comparator (arm B),\nwhat is the ATT on 1-year all-cause mortality? (1) *Cohort*: new users of A or B with ≥365 days of continuous A/B/D\nFFS enrollment before the first qualifying NDC fill (`fill_date`); exclude MA-only spans so the lookback is observable;\nindex_date = first fill; arm from the NDC dispensed that day. (2) *Feature matrix X*: in [index_date−365, index_date],\nbuild ever/never + counts for every ICD-10, CPT, and drug-class code, plus prior lines of therapy, metastasis codes,\nbiomarker-testing codes (PD-L1/MSI), and visit intensity — strictly no post-index codes (leakage guard). (3) *Outcome\nY*: death within 365 days from a mortality-source hierarchy (`mortality-source-hierarchy-rwe`), with administrative\ncensoring at disenrollment/data-end handled by an IPCW term or a survival-DML variant. (4) *Cross-fit*: 5 folds; on\nthe training folds fit ê(X) (gradient-boosted propensity) and m̂_0, m̂_1 (gradient-boosted outcome); predict on the\nheld-out fold only. (5) *Orthogonal score* (AIPW/EIF for the ATT): on each held-out fold compute the augmented\ninfluence-function contribution and average, weighting toward the treated distribution. (6) *Diagnostics*: plot the\nê(X) overlap by arm, report effective sample size and the fraction with ê>0.95 or <0.05, recalibrate nuisances if ICI\nis poor, trim or overlap-weight at the boundary, and report an E-value plus a negative-control-outcome estimate. The\northogonality guarantees √n inference even though the gradient-boosting nuisances converge slower than √n — but only\n*because* steps (4)–(6) were executed; skip cross-fitting or overlap checks and the entire guarantee evaporates.\n\n**Interpreting the output**\n\nIn the worked example, Drug A is preferentially prescribed to sicker patients; a predictive model\ncorrectly learns this pattern and assigns higher outcome risk to Drug A users — but this prediction\nreflects confounding by severity, not a causal drug effect.\n\n*(1) Formal interpretation.* A predictive ML model outputs a risk score — the probability that a\npatient experiences the outcome given their observed features. Variable importance scores rank which\nfeatures most improve predictive accuracy (e.g., Shapley values, permutation importance). These\nquantities have no causal interpretation: if severity-of-illness codes are the strongest predictors,\nthat means severity predicts outcomes well in this dataset, not that severity is the dominant\nmodifiable cause. Using a predictive model's risk score as a treatment-effect estimate — for example,\ncomparing mean predicted scores across treatment arms — yields a biased estimate of the causal effect\nbecause the score absorbs the confounding structure, not just the treatment signal. Doubly robust\ncausal ML (e.g., DML/AIPW) uses cross-fitting to separate nuisance estimation (propensity, outcome)\nfrom effect estimation, providing √n-consistent estimates of the ATT under stated assumptions.\n\n*(2) Practical interpretation.* When the goal is resource targeting (who will have the outcome?),\npredictive ML is appropriate. When the goal is intervention evaluation (will treating these patients\nhelp?), a causal estimator is required — prediction alone cannot answer it. Treat variable importance\nrankings as hypothesis-generating signals about predictive utility, not as evidence of causal pathways.\nIf a predictive model is repurposed for policy, document the distinction explicitly and apply a\nrigorous causal framework with overlap diagnostics, effect estimates, and sensitivity analyses.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "machine-learning",
      "random-survival-forests",
      "super-learner",
      "double-ml",
      "tmle",
      "causal-forests",
      "high-dimensional",
      "claims",
      "oncology-rwe",
      "calibration",
      "positivity"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1214/08-AOAS169",
        "url": "https://doi.org/10.1214/08-AOAS169",
        "citation_text": "Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The Annals of Applied Statistics. 2008;2(3):841-860.",
        "year": 2008,
        "authors_short": "Ishwaran et al.",
        "notes": "Foundational paper introducing random survival forests for right-censored time-to-event data with ensemble cumulative-hazard estimation and variable importance; the canonical flexible predictive-survival learner used in RWE and as an outcome nuisance inside causal ML."
      },
      {
        "role": "explain",
        "doi": "10.1111/ectj.12097",
        "url": "https://doi.org/10.1111/ectj.12097",
        "citation_text": "Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal. 2018;21(1):C1-C68.",
        "year": 2018,
        "authors_short": "Chernozhukov et al.",
        "notes": "Establishes the double/debiased ML framework — Neyman-orthogonal scores plus cross-fitting — that licenses arbitrary ML nuisances while preserving valid √n inference for the ATE/ATT under n^(-1/4) rate conditions."
      },
      {
        "role": "explain",
        "doi": "10.2202/1544-6115.1309",
        "url": "https://doi.org/10.2202/1544-6115.1309",
        "citation_text": "van der Laan MJ, Polley EC, Hubbard AE. Super Learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(1):Article 25.",
        "year": 2007,
        "authors_short": "van der Laan et al.",
        "notes": "Defines the cross-validated ensemble (super learner) that is asymptotically at least as good as the best candidate algorithm; the recommended way to fit the nuisance functions inside TMLE and DML."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kww165",
        "url": "https://doi.org/10.1093/aje/kww165",
        "citation_text": "Schuler MS, Rose S. Targeted maximum likelihood estimation for causal inference in observational studies. American Journal of Epidemiology. 2017;185(1):65-73.",
        "year": 2017,
        "authors_short": "Schuler & Rose",
        "notes": "Applied epidemiology tutorial walking TMLE through an observational analysis — initial outcome fit, the targeting step via the efficient influence function, double robustness, and the role of the super learner — with practical guidance directly transferable to claims/EHR RWE."
      },
      {
        "role": "use",
        "doi": "10.1080/01621459.2017.1319839",
        "url": "https://doi.org/10.1080/01621459.2017.1319839",
        "citation_text": "Wager S, Athey S. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association. 2018;113(523):1228-1242.",
        "year": 2018,
        "authors_short": "Wager & Athey",
        "notes": "Honest causal forests with asymptotically valid, pointwise confidence intervals for conditional average treatment effects; the production basis for the grf package used for CATE estimation in RWE heterogeneity analyses."
      }
    ],
    "plain_language_summary": "Machine learning (ML) algorithms can be used for two very different jobs in health research, and mixing them up is the most common mistake in the field. A predictive ML model asks: who is likely to have a bad outcome? It learns patterns in the data to identify high-risk patients, and being accurate is the only goal. A causal ML model asks a harder question: does this treatment actually cause a better outcome? Answering that requires extra steps to control for the fact that sicker patients tend to get certain drugs, and a model that predicts outcomes well does not automatically tell you anything about what caused them.",
    "key_terms": [
      {
        "term": "prediction vs causation",
        "definition": "Prediction asks who will have an outcome (correlations are fine); causation asks whether a treatment caused the outcome (requires ruling out other explanations)."
      },
      {
        "term": "causal ML",
        "definition": "A family of machine-learning methods (such as double ML and TMLE) that use flexible algorithms only to control for confounding, then apply a special estimator to isolate the treatment effect."
      },
      {
        "term": "confounding",
        "definition": "When a third factor (like disease severity) influences both who gets a treatment and what outcome they have, making it hard to tell whether the treatment itself caused the outcome."
      },
      {
        "term": "ATE (average treatment effect)",
        "definition": "The average difference in outcome between everyone receiving the treatment versus everyone not receiving it, if we could assign treatment randomly."
      },
      {
        "term": "nuisance function",
        "definition": "In causal ML, a model (for the propensity score or the outcome) that is fit only to clean up confounding, not to be interpreted on its own."
      }
    ],
    "worked_example": {
      "scenario": "A research team has claims data on 1,000 patients with metastatic cancer who started either Drug A or Drug B. They want to know two things: (1) Can we predict which patients will die within one year? (2) Does Drug A cause lower one-year mortality than Drug B? These look similar but require completely different analytic strategies.",
      "dataset": {
        "caption": "Simplified summary of what both questions start from: one row per patient with treatment arm, a severity score, and the one-year outcome.",
        "columns": [
          "person_id",
          "drug",
          "severity_score",
          "died_1yr"
        ],
        "rows": [
          [
            "1001",
            "Drug A",
            8,
            1
          ],
          [
            "1002",
            "Drug A",
            7,
            0
          ],
          [
            "1003",
            "Drug B",
            4,
            0
          ],
          [
            "1004",
            "Drug B",
            3,
            0
          ],
          [
            "1005",
            "Drug A",
            9,
            1
          ]
        ]
      },
      "steps": [
        "Notice that Drug A patients tend to have higher severity scores (7-9) while Drug B patients tend to have lower scores (3-4). Sicker patients were steered toward Drug A by their doctors.",
        "A predictive ML model trained on this data will learn that drug=A and high severity both predict death. It will score new patients accurately. That is its only job, and it does it well.",
        "But if you read the model's output as causal evidence, you would wrongly conclude Drug A kills patients. In reality, the sicker patients who received Drug A would likely have had worse outcomes regardless of which drug they received.",
        "A causal ML model (such as double ML) explicitly accounts for the fact that treatment assignment was not random. It fits one model to estimate who was likely to receive Drug A given their severity (the propensity), and a second model to estimate what the outcome would have been under each drug separately. It then combines these through a special estimator that is insensitive to errors in either model.",
        "The result from the causal ML model is an estimate of the treatment effect after removing the influence of severity and other confounders. The predictive ML model cannot produce this estimate, no matter how accurate its predictions are."
      ],
      "result": "Use a predictive ML model when the goal is to score or stratify patients by outcome risk. Use a causal ML model when the goal is to estimate whether a specific treatment causes a difference in outcomes. A high prediction accuracy (AUC) from a predictive model is not evidence of a causal effect and should never be interpreted as one."
    },
    "prerequisites": [
      "propensity-score-methods-psm-iptw",
      "dags-backdoor-criterion-drug-studies",
      "high-dimensional-propensity-score-hdps-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Random survival forests (predictive survival)",
        "description": "Ensemble of survival trees grown on bootstrap samples with random feature subsets; native right-censoring handling, ensemble cumulative hazard, permutation variable importance, and no proportional-hazards assumption. Predictive only — never read variable importance as a causal effect.",
        "edge_cases": [
          "Variable importance is biased toward high-cardinality and correlated features; prefer permutation VIMP and validate rankings, never report them as causal drivers.",
          "Calibration degrades in the tails of predicted risk; evaluate ICI and a calibration curve on held-out data before using RSF predictions as a nuisance."
        ],
        "data_source_notes": "claims: high-dim lookback codes plus longitudinal summaries (counts, recency). Link to a death index/validated endpoint. Benchmark against Cox with splines or time-dependent Cox as a transparency check."
      },
      {
        "name": "Double/debiased ML (DML) for ATE or ATT",
        "description": "Cross-fit ML nuisances (propensity and outcome) then combine via a Neyman-orthogonal / AIPW score. Delivers marginal causal effects with valid inference under n^(-1/4) nuisance rates and is doubly robust.",
        "edge_cases": [
          "Requires K-fold cross-fitting (or sample splitting); a single full-sample fit followed by regression is biased with anticonservative confidence intervals.",
          "Extreme propensities make the AIPW denominator unstable; trim, use overlap weights, or report the restricted-overlap estimand and the effective sample size."
        ],
        "data_source_notes": "claims: ideal when high-dimensional confounding defeats parametric PS or outcome models. Fit on the analysis cohort only after new-user, washout, and time-zero filters; report overlap and calibration of both nuisances."
      },
      {
        "name": "TMLE (targeted maximum likelihood) for binary, continuous, or survival",
        "description": "Initial ML outcome fit fluctuated by a targeting step using the efficient influence function (built from the propensity), yielding a doubly robust, efficient plug-in for the target parameter or a survival curve under intervention.",
        "edge_cases": [
          "The targeting step is unstable under extreme weights; bound the clever covariate or truncate propensities.",
          "Survival/longitudinal targets require LTMLE (ltmle) with explicit time-varying nuisance and censoring models."
        ],
        "data_source_notes": "claims/EHR: strong for marginal effects and intervention-specific survival curves; R (tmle, ltmle), Python (zEpid/tmle implementations). No native SAS TMLE procedure — assemble nuisances with SAS HP procedures and the targeting step by hand."
      },
      {
        "name": "Honest causal forests / grf for heterogeneity (CATE)",
        "description": "Causal trees/forests with honesty (separate subsamples for splitting versus leaf estimation) so leaf effect estimates are unbiased and admit pointwise confidence intervals for conditional average treatment effects.",
        "edge_cases": [
          "Subgroup discovery needs pre-specification or multiplicity control; exploratory CATEs require external replication before any clinical or coverage claim.",
          "Forest CATEs inherit the overlap problem locally — sparse-comparator subgroups produce wide, unreliable intervals."
        ],
        "data_source_notes": "claims/registry: useful for pre-specified heterogeneity (which oncology subgroup benefits most) while controlling type I error; pair with overlap diagnostics within candidate subgroups."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "RSF and causal forests relax proportional hazards and linearity, absorb high-dimensional claims features automatically, and provide variable importance and honest heterogeneity (CATE) estimates.",
        "cons_of_this": "No single interpretable hazard ratio; lower (though growing) regulatory/HTA familiarity; calibration and positivity diagnostics are mandatory and harder to communicate.",
        "when_to_prefer": "High-dimensional or non-linear confounding, implausible PH, or primary interest in heterogeneity. Cox stays preferred for a simple, pre-specified primary HR with credible PH."
      },
      {
        "compared_to": "high-dimensional-propensity-score-hdps-rwe",
        "pros_of_this": "DML/TMLE pair the propensity with an outcome model and an orthogonal score, giving double robustness; they survive misspecification of either nuisance, which hdPS+IPTW (propensity-only) does not.",
        "cons_of_this": "More moving parts and compute, and two nuisance models to validate and defend rather than one.",
        "when_to_prefer": "When you want insurance against propensity misspecification or efficient EIF-based inference; plain hdPS+IPTW is fine when a transparent single-model weighting workflow is preferred."
      },
      {
        "compared_to": "g-estimation-structural-nested-models",
        "pros_of_this": "Targets marginal ATE/ATT or CATE with mature software and scales to very high-dimensional baseline confounding while retaining valid inference.",
        "cons_of_this": "For sustained time-varying regimes with treatment-confounder feedback, g-estimation's blip parameterization is more natural and efficient; longitudinal causal ML (LTMLE) is less turnkey.",
        "when_to_prefer": "Marginal/CATE estimands with high-dimensional baseline confounding. Use g-estimation/SNMs when the scientific model is a structural nested blip under time-varying confounding."
      },
      {
        "compared_to": "logistic-regression-for-binary-outcomes",
        "pros_of_this": "ML nuisances capture interactions and non-linearities; causal ML wrappers deliver doubly robust effects on the risk-difference or risk-ratio scale.",
        "cons_of_this": "Loss of a clean odds ratio unless the final stage is kept parametric; heavier validation (calibration, overlap) and compute.",
        "when_to_prefer": "High-dimensional settings or when best-in-class nuisance prediction matters; logistic stays superior for moderate dimensions and pre-specified, easily communicated models."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the lookback feature matrix from all diagnosis/procedure/pharmacy codes in a fixed 6-12 month window before index (ever/never, counts, recency); enforce a strict no-future-leakage guard. Restrict to A/B/D or commercial medical+pharmacy coverage spanning the full lookback and exclude MA-only person-time (incomplete encounter capture makes absence of a code indistinguishable from missingness). For causal ML: cross-fit on the analysis cohort only after new-user, washout, and time-zero filters; report feature construction, nuisance calibration, propensity overlap, effective sample size, trimming, and an E-value plus negative-control-outcome estimate.",
      "ehr": "Richer nuisances (labs, vitals, NLP-derived problems) but irregular visit-driven timing creates informative missingness and informative loss to follow-up; encode missingness explicitly and validate calibration within site/era, not just pooled. Link to claims for complete longitudinal pharmacy and procedure capture.",
      "registry": "Lower dimension, adjudicated labels (stage, recurrence date, graded AEs); ideal for training/validating claims-based predictive and causal ML models. Link to claims for fills and to a death index for censoring; hybrid designs (e.g., SEER-Medicare) are common.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (severity + completeness + mortality) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before time-zero assignment to avoid baking immortal time into the feature window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom doubleml import DoubleMLData, DoubleMLIRM\nfrom sklearn.ensemble import HistGradientBoostingClassifier\n\nCOVARS = [c for c in df.columns if c not in (\"person_id\", \"treat\", \"outcome\")]\n\n# Interactive regression model (IRM) targets the ATE with a doubly robust, cross-fit AIPW score.\ndml_data = DoubleMLData(df, y_col=\"outcome\", d_cols=\"treat\", x_cols=COVARS)\n\nml_g = HistGradientBoostingClassifier(max_depth=4, learning_rate=0.05)  # outcome nuisance m_a(X)\nml_m = HistGradientBoostingClassifier(max_depth=4, learning_rate=0.05)  # propensity nuisance e(X)\n\ndml_irm = DoubleMLIRM(dml_data, ml_g=ml_g, ml_m=ml_m,\n                      n_folds=5, n_rep=2, score=\"ATE\", trimming_threshold=0.02)\ndml_irm.fit()\nprint(dml_irm.summary)  # coef = ATE on the risk-difference scale, with cross-fit SE / CI\n\n# Overlap / positivity diagnostics on the cross-fit propensity (do this BEFORE trusting the estimate).\ne_hat = dml_irm.predictions[\"ml_m\"].mean(axis=2).ravel()  # averaged over repetitions\ness = (e_hat.sum() ** 2) / np.square(e_hat).sum()         # effective sample size proxy\nextreme = np.mean((e_hat < 0.05) | (e_hat > 0.95))\nprint(f\"ESS proxy={ess:.0f}  fraction extreme propensity={extreme:.3f}\")",
        "description": "Double/debiased ML for the ATE on a binary outcome with cross-fitting, using the production DoubleML library and\ngradient-boosted nuisances. Required input is a single analysis-ready table (one row per new initiator, built AFTER\nnew-user/washout/time-zero filters):\n  df : person_id (index), treat (1=study, 0=comparator), outcome (0/1 within fixed horizon),\n       plus all baseline covariate columns measured only in [index_date-365, index_date]\n       (e.g. age, sex, dx_* ever/never flags, rx_* counts, util_* utilization features).\nDoubleML handles the K-fold cross-fitting and the orthogonal (AIPW/PLR) score internally; this code wires up the\nlearners, runs overlap diagnostics, and returns the effect with EIF-based inference.",
        "dependencies": [
          "doubleml",
          "scikit-learn",
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "chernozhukov-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(DoubleML); library(mlr3); library(mlr3learners)\n\ncovars <- setdiff(names(dat), c(\"treat\", \"outcome\"))\ndml_data <- DoubleMLData$new(dat, y_col = \"outcome\", d_cols = \"treat\", x_cols = covars)\n\nml_g <- lrn(\"regr.ranger\",   num.trees = 500)  # outcome nuisance\nml_m <- lrn(\"classif.ranger\", num.trees = 500, predict_type = \"prob\")  # propensity nuisance\n\ndml_irm <- DoubleMLIRM$new(dml_data, ml_g = ml_g, ml_m = ml_m,\n                           n_folds = 5, n_rep = 2, score = \"ATE\",\n                           trimming_threshold = 0.02)\ndml_irm$fit()\nprint(dml_irm$summary())  # ATE on risk-difference scale with cross-fit inference\n\n# Secondary: honest causal forest for pre-specified heterogeneity (CATE).\nlibrary(grf)\nX <- as.matrix(dat[, covars]); W <- dat$treat; Y <- dat$outcome\ncf  <- causal_forest(X, Y, W, num.trees = 2000)\nate <- average_treatment_effect(cf, target.sample = \"all\")  # doubly robust AIPW ATE\nprint(ate)\ntest_calibration(cf)  # mean-prediction + differential-prediction calibration test",
        "description": "DML for the ATE with the official DoubleML R package (primary), plus a grf causal forest for CATE (secondary\nvariant). Required input is one analysis-ready data.frame built after design filters:\n  dat : treat (0/1), outcome (0/1 within the fixed horizon), and baseline covariate columns measured only in the\n        pre-index lookback window (no post-index features).\nDoubleML cross-fits the nuisances and applies the orthogonal score; grf provides honest CATE estimates with\npointwise confidence intervals.",
        "dependencies": [
          "DoubleML",
          "mlr3",
          "mlr3learners",
          "ranger",
          "grf"
        ],
        "source_citations": [
          "wager-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let covs = age sex dx_chf dx_ckd rx_insulin_cnt util_visits;  /* pre-index baseline covariates only */\n\n/* ---- One cross-fit fold (k=1 held out). Wrap folds 1..5 in a %macro %do loop in production. ---- */\ndata train hold;\n  set work.analytic;\n  if fold = 1 then output hold; else output train;\nrun;\n\n/* Propensity nuisance e_hat(X) = P(treat=1 | X), trained on the OTHER folds, scored on the held-out fold. */\nproc hplogistic data=train;\n  class sex dx_chf dx_ckd;\n  model treat(event='1') = &covs;\n  code file=\"phat.sas\";                 /* exportable scoring code -> applies to held-out rows */\nrun;\ndata hold; set hold; %include \"phat.sas\"; e_hat = P_treat1; run;\n\n/* Outcome nuisances via random forest: m1 = E[Y|A=1,X] and m0 = E[Y|A=0,X], trained on train, scored on hold. */\nproc hpforest data=train(where=(treat=1)) maxtrees=300;\n  target outcome / level=binary;\n  input &covs;\n  save file=\"m1.bin\";\nrun;\nproc hp4score data=hold; score file=\"m1.bin\" out=hold1; run;   /* P_outcome1 -> m1_hat */\nproc hpforest data=train(where=(treat=0)) maxtrees=300;\n  target outcome / level=binary;\n  input &covs;\n  save file=\"m0.bin\";\nrun;\nproc hp4score data=hold1 file=\"m0.bin\" out=hold0;              /* P_outcome1 -> m0_hat */\nrun;\n\n/* AIPW / doubly robust influence-function contribution on the held-out fold (orthogonal score). */\ndata eif;\n  set hold0;\n  e_hat = min(max(e_hat, 0.02), 0.98);                         /* trim at the positivity boundary */\n  psi = (m1_hat - m0_hat)\n      + treat*(outcome - m1_hat)/e_hat\n      - (1-treat)*(outcome - m0_hat)/(1 - e_hat);\nrun;\n\n/* Cross-fit ATE = mean of psi pooled over all held-out folds; SE from the EIF (influence-function) variance. */\nproc means data=eif n mean stderr clm; var psi; run;",
        "description": "SAS has no native DML/TMLE procedure, so a defensible double-ML ATE is assembled from SAS high-performance ML\nprocedures plus a manual cross-fit and AIPW combine. Required input (post data-management, after new-user/washout/\ntime-zero filters):\n  work.analytic : person_id, treat (1/0), outcome (1/0 within horizon), baseline covariates &covs (pre-index only),\n                  and a pre-assigned fold id (1..5) for cross-fitting.\nPROC HPFOREST and PROC HPLOGISTIC require SAS/STAT high-performance procedures (SAS Enterprise Miner or Viya).\nPattern: for each fold, train nuisances on the OTHER folds and score the held-out fold; then combine with the AIPW\ninfluence function. Loop the fold logic in a macro in production; one fold is shown for clarity.",
        "dependencies": [],
        "source_citations": [
          "chernozhukov-2018"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[\"Question: causal effect or pure prediction?\"] -->|prediction only| Pred[\"Predictive ML<br/>RSF / super learner / GBM\"]\n  Q -->|causal estimand| Est[\"Pre-specify estimand<br/>ATE / ATT / CATE / survival curve\"]\n  Pred --> Cal[\"Evaluate AUC + ICI / calibration<br/>do NOT read importance as causal\"]\n  Est --> CF[\"Cross-fit K folds: fit nuisances on train, predict on holdout\"]\n  CF --> Nui[\"Propensity e-hat(X) + outcome m-hat_a(X)<br/>via ML / super learner\"]\n  Nui --> Ortho[\"Orthogonal / doubly robust combine<br/>DML AIPW score or TMLE targeting step\"]\n  Ortho --> Diag[\"Overlap + ESS + nuisance calibration<br/>trim / overlap weights at boundary\"]\n  Diag --> Out[\"Effect + EIF-based CI<br/>+ E-value + negative-control outcome\"]",
        "caption": "Decision and pipeline flow distinguishing predictive ML (calibration-judged, never causal) from causal ML (estimand-first, cross-fit nuisances, orthogonal/doubly robust combine, then overlap and sensitivity diagnostics).",
        "alt_text": "Flowchart branching from a question into a predictive-ML path judged by AUC and calibration and a causal-ML path that pre-specifies an estimand, cross-fits nuisances, combines them with an orthogonal or doubly robust score, checks overlap and calibration, and reports an effect with EIF-based confidence intervals and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "chernozhukov-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Healthy[Unmeasured health-seeking behavior] -->|proxied by| Feat[High-dim claims features<br/>preventive codes, low prior utilization]\n  Healthy --> Tx[Treatment choice A vs B]\n  Healthy --> Y[Outcome risk]\n  Feat --> Tx\n  Feat --> Y\n  Tx --> Y\n  Feat -.near-deterministic of Tx.-> Pos[Positivity boundary<br/>e-hat -> 0 or 1]\n  style Healthy fill:#ffcccc\n  style Pos fill:#fff2cc",
        "caption": "How high-dimensional ML features partially proxy unmeasured healthy-user behavior — improving confounding control, but risking positivity violations when a proxy becomes near-deterministic of treatment.",
        "alt_text": "Directed graph in which unmeasured health-seeking behavior influences treatment choice and outcome and is proxied by high-dimensional claims features; the features also point to treatment and outcome, and a dashed edge warns that a near-deterministic proxy pushes the estimated propensity to the positivity boundary.",
        "source_type": "illustrative",
        "source_citations": [
          "chernozhukov-2018"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "RSF and causal forests are direct alternatives or complements to Cox for survival; Cox or RSF can also serve as the outcome nuisance inside DML/TMLE."
      },
      {
        "relation_type": "see_also",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "DML/TMLE reuse the propensity object but add an outcome model and an orthogonal score for double robustness; hdPS+IPTW relies on the propensity alone."
      },
      {
        "relation_type": "see_also",
        "target_slug": "g-estimation-structural-nested-models",
        "notes": "Both are semiparametric. Causal ML excels at high-dimensional baseline confounding and marginal/CATE estimands; g-estimation directly targets blip functions under time-varying confounding with feedback."
      },
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Logistic (or pooled logistic) is a common parametric nuisance or final stage inside DML/TMLE; ML versions relax linearity and interactions while preserving double robustness."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Count models can be replaced or augmented by ML count learners as the outcome nuisance inside causal ML for utilization endpoints."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "High-dimensional ML features (preventive services, low prior utilization) proxy healthy-user behavior — improving confounding control but risking positivity violations and collider issues; pair with negative controls and an E-value."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "ML trained on prevalent users learns survivor characteristics; apply new-user restriction before feature extraction and modeling."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "ML on misaligned time-zero data will happily use immortal time as a feature; strict new-user plus target-trial alignment must precede any feature engineering."
      },
      {
        "relation_type": "used_with",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "Report an E-value on the final causal estimate to bound the residual unmeasured confounding the ML cannot remove."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes detect residual confounding that high-dimensional ML proxies may have left or amplified."
      },
      {
        "relation_type": "part_of",
        "target_slug": "comparative-effectiveness-research-cer-methods",
        "notes": "Causal ML is part of the expanded CER analytic toolkit alongside traditional PS, g-methods, and sensitivity analyses."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "DML/TMLE/causal forests are frequently used as the confounding-control engine within target-trial emulations, especially for high-dimensional baseline confounding or heterogeneous effects."
      },
      {
        "relation_type": "used_with",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Cloned datasets with time-varying weights are natural inputs to longitudinal causal ML estimators for per-protocol or sustained-strategy effects."
      }
    ],
    "aliases": [
      "double machine learning",
      "debiased machine learning",
      "TMLE",
      "targeted learning",
      "causal forests",
      "super learner",
      "random survival forests",
      "high-dimensional causal inference"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "preference-study",
    "name": "Patient Preference Study (DCE / BWS)",
    "short_definition": "A primary-data, survey-based study design that quantifies how patients (or other stakeholders) trade off attributes of treatments or services by analyzing forced choices among systematically varied hypothetical alternatives, most commonly via a discrete-choice experiment (DCE) or best-worst scaling (BWS).",
    "long_description": "A **patient preference study** measures *stated* preferences: how much respondents value the attributes of a treatment,\nservice, or policy by asking them to choose among carefully constructed hypothetical alternatives. The two dominant elicitation\nformats are the **discrete-choice experiment (DCE)** — respondents repeatedly pick the preferred profile from sets of\nmulti-attribute alternatives — and **best-worst scaling (BWS)** — respondents pick the best and worst item within a set\n(Case 1 = objects, Case 2 = attribute levels within a profile, Case 3 = full profiles). Both rest on Lancaster's characteristics\ntheory of value (utility derives from attributes, not the good itself) and McFadden's random utility model (RUM): the analyst\nrecovers the parameters of a latent utility function from observed choices. The output is a vector of attribute-level utilities\n(part-worths) that, when one attribute is cost or a quantitative health outcome, yields **willingness-to-pay (WTP)** or\n**maximum acceptable risk (MAR)** trade-off metrics on a natural scale.\n\n**Core conceptual distinction.** A preference study estimates a *preference* parameter (a marginal rate of substitution between\nattributes), not an *epidemiologic* parameter (incidence, hazard, risk difference) and not a *clinical* outcome. The estimand\nis the population (or latent-class, or individual-level distribution of) part-worth utilities — equivalently, the trade-off\nweights in respondents' utility function — typically reported as conditional-logit coefficients, marginal WTP, or attribute\nimportance shares. Three design choices define the method and are separable: (1) *what is varied* — the attributes and levels,\nchosen to be salient, non-overlapping, and policy-relevant; (2) *how profiles are combined into choice tasks* — the experimental\ndesign (full factorial is almost never feasible; a **D-efficient fractional design** minimizes the determinant of the\nvariance-covariance matrix to maximize statistical efficiency for a fixed number of tasks); and (3) *the choice rule modeled* —\nRUM under IIA (conditional/multinomial logit), relaxed for preference heterogeneity (mixed/random-parameters logit, latent-class\nlogit) or for scale heterogeneity (generalized multinomial logit, G-MNL). DCE recovers a full utility surface and supports WTP;\nBWS Case 1 ranks discrete objects and is cognitively lighter but does not yield WTP unless cost is built in.\n\n**Pros, cons, and trade-offs.**\n- **vs revealed-preference / observational utilization analysis (claims, EHR):** A DCE can value attributes that do not yet\n  exist in the market (a pipeline drug's novel mechanism, a hypothetical mode of administration) and cleanly isolates the\n  marginal value of a single attribute, which revealed choices confound with availability, price, formulary, and access.\n  Cost: stated choices are hypothetical and subject to **hypothetical bias** — what people say they would do can diverge from\n  what they do under real budget constraints and consequences. **Prefer a DCE** when the attribute or product is not yet\n  observable in real-world data, or when you need a clean trade-off (e.g., regulatory benefit-risk).\n- **vs standard-gamble / time-trade-off utility elicitation (for QALYs):** TTO/SG anchor health states on the 0–1 (dead–full\n  health) QALY scale and feed cost-utility analysis directly; a DCE values *attributes of care or treatment process*, not just\n  health states, and is more flexible but does not natively produce anchored utilities unless a duration/death attribute and\n  appropriate anchoring are designed in (DCE-TTO hybrids). **Prefer TTO/SG** when the deliverable is a QALY weight; **prefer a\n  DCE** when process, risk, and non-health attributes matter.\n- **vs qualitative interviews / focus groups:** Qualitative work is essential *upstream* to identify and word attributes, but\n  it cannot quantify trade-offs or produce WTP. A DCE quantifies; it cannot tell you *why*. **Use them in sequence**, not as\n  substitutes — qualitative attribute development then quantitative DCE.\n- **vs simple rating/ranking or Likert importance scales:** DCE/BWS force trade-offs and so avoid the lack of discrimination,\n  yea-saying, and scale-use bias that plague direct importance ratings. Cost: higher respondent burden and design complexity.\n- **DCE vs BWS:** BWS (especially Case 2) reduces cognitive load, avoids an explicit cost attribute, and is robust when\n  respondents struggle with full profiles; it sacrifices the ability to compute WTP (Case 1/2) and the realism of choosing\n  between whole products.\n\n**When to use.** Quantifying benefit-risk trade-offs for regulatory submissions (e.g., FDA Patient Preference Information for\ndevices, or patient-focused drug development); valuing attributes of a not-yet-marketed product or a novel administration mode;\npopulating value frameworks and MCDA weights; informing shared-decision-making tools, formulary/HTA deliberation, or service\ndesign; estimating WTP or MAR when no defensible revealed-preference data exist. The design is viable when attributes are few\n(commonly 4–6), levels are clearly defined and orthogonalizable, and respondents can plausibly understand hypothetical trade-offs.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When you need the actual rate of a behavior or a clinical effect.** A DCE gives preference weights, never an incidence,\n  adherence rate, or treatment effect. Reporting \"60% would choose drug A\" from a DCE as a real-world uptake forecast is a\n  category error — predicted choice shares from a DCE are conditional on the experimental attribute set and ignore real access,\n  price negotiation, and physician gatekeeping.\n- **When the attribute list is wrong or incomplete.** If a dominant driver of real choice is omitted (omitted-attribute bias),\n  or attributes overlap/correlate so respondents cannot trade them independently, the estimated part-worths are biased and the\n  WTP is not interpretable. Skipping qualitative attribute development is the most common fatal error.\n- **Severe hypothetical bias / non-consequential survey.** When respondents have no stake and the survey is purely\n  hypothetical, magnitudes (especially WTP) can be inflated by multiples; without cheap-talk scripts, consequentiality framing,\n  or external calibration the absolute numbers should not be taken at face value.\n- **Dominated designs and inattentive panels.** If one alternative dominates on every attribute, or respondents straight-line\n  / random-click, the choices carry no preference information; failure to include attention/dominance checks and to model scale\n  heterogeneity yields confidently wrong estimates.\n- **Cognitively overloaded designs.** Too many attributes/levels (>7 attributes, dozens of tasks) push respondents into\n  simplifying heuristics (lexicographic, attribute non-attendance), violating the compensatory RUM the analysis assumes.\n- **Generalizing a convenience-panel sample to a clinical population.** Online opt-in panels over-represent the healthy,\n  literate, and internet-engaged; a preference estimate from such a frame may not transport to the sicker target population\n  that actually faces the decision.\n\n**Data-source operational depth.** Unlike pharmacoepidemiologic designs, a preference study generates its *own* primary data;\nthe \"data source\" is the survey instrument, the sampling frame, and the elicited choice matrix, so the failure modes are\nmeasurement and sampling failures rather than claims/EHR artifacts.\n- **Online opt-in / access panels (the modal source):** Fast and cheap but prone to **sample-frame selection** (panelists are\n  healthier, younger, more literate, and financially motivated), professional-respondent and bot contamination, and\n  **straight-lining / speeding**. Workarounds: recruit through patient organizations or clinics to reach the true target\n  population, screen on a verified diagnosis, embed a **dominated-pair (rationality) check** and an instructional manipulation\n  check, drop respondents below a completion-time threshold, and analyze scale heterogeneity (G-MNL) rather than assuming a\n  homogeneous error variance.\n- **Clinic / point-of-care recruitment:** Best for reaching genuinely affected patients and confirming diagnosis/severity from\n  the chart, but slow, costly, and subject to consent-driven selection (sicker or more engaged patients enroll). Pre-register\n  the sampling protocol and report the recruitment funnel.\n- **Pen-and-paper / interviewer-administered:** Reduces drop-out and improves comprehension in low-literacy or elderly\n  populations, but introduces interviewer effects and limits design complexity (no adaptive choice sets). Use simpler BWS or a\n  small blocked DCE.\n- **Linked stated- and revealed-preference data:** The strongest substrate for validating absolute magnitudes — calibrate\n  hypothetical WTP against an observed market transaction or an incentive-compatible task — but linkage is rare, raises consent\n  and privacy issues, and the revealed-preference benchmark carries its own confounding (price/access). Use it for calibration,\n  not as the primary frame.\n- **Cross-cutting design failures regardless of mode:** attribute-level imbalance or correlation that makes a parameter\n  inestimable; insufficient choice-task count or sample size (use the de Bekker-de Vrieze / Orme rule of thumb,\n  n ≥ 500·c / (t·a), with c = max levels of any attribute, t = tasks per respondent, a = alternatives per task, as a *floor*,\n  not a power calculation); ordering/learning effects (randomize task order and block the design); and ignoring panel structure\n  (multiple tasks per respondent are correlated — cluster standard errors or use a panel mixed logit).\n\n**Worked example (advanced NSCLC second-line therapy DCE).** Decision context: oncologists and patient organizations want to\nknow how patients trade efficacy against toxicity and administration burden for a second-line therapy. (1) *Attribute\ndevelopment:* qualitative interviews (n≈20) and literature reduce candidate attributes to five — median overall survival\n(levels: 6, 9, 12 months), grade 3–4 toxicity risk (10%, 25%, 40%), mode of administration (oral daily, IV every 3 weeks),\nserious immune-related adverse-event risk (1%, 5%, 10%), and monthly out-of-pocket cost ($50, $250, $600). (2) *Design:* a full\nfactorial is 3×3×2×3×3 = 162 profiles; a **D-efficient fractional design** generates 24 paired-profile choice tasks, blocked\ninto two versions of 12 tasks each, with a no-treatment opt-out and one dominated pair inserted as a rationality check. (3)\n*Sample size:* with c = 3, t = 12, a = 2, the Orme floor is 500·3/(12·2) ≈ 63 respondents per block — but to fit a mixed logit\nand latent classes we target n ≈ 300, recruited through thoracic-oncology clinics and a verified-diagnosis panel. (4)\n*Estimation:* a conditional (multinomial) logit gives population mean part-worths; a **random-parameters (mixed) logit** with\nthe cost coefficient held fixed and clinical attributes random captures preference heterogeneity and lets us derive the\ndistribution of WTP. (5) *Trade-off metrics:* marginal WTP for a month of survival = (β_survival / |β_cost|) in dollars;\n**maximum acceptable risk** for a 3-month survival gain = (3·β_survival / β_toxicity). (6) *Diagnostics:* drop respondents who\nfail the dominated-pair check or straight-line, test IIA, compare conditional vs mixed logit by AIC/BIC and log-likelihood,\ninspect the sign/significance of part-worths against clinical expectation (monotonic in survival, decreasing in toxicity and\ncost), and report predicted choice shares only as *conditional* simulations, never as real-world uptake.",
    "primary_category": "Study_Design",
    "tags": [
      "discrete-choice-experiment",
      "best-worst-scaling",
      "stated-preference",
      "conjoint-analysis",
      "willingness-to-pay",
      "random-utility-model",
      "patient-preference",
      "benefit-risk"
    ],
    "applies_to_study_types": [
      "preference_study"
    ],
    "data_sources": [
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2010.11.013",
        "url": "https://doi.org/10.1016/j.jval.2010.11.013",
        "citation_text": "Bridges JFP, Hauber AB, Marshall D, et al. Conjoint analysis applications in health—a checklist: a report of the ISPOR Good Research Practices for Conjoint Analysis Task Force. Value in Health. 2011;14(4):403-413.",
        "year": 2011,
        "authors_short": "Bridges et al.",
        "notes": "The canonical ISPOR good-practice checklist defining the conduct of conjoint-analysis (DCE) studies in health; the standard against which preference studies are appraised."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2016.04.004",
        "url": "https://doi.org/10.1016/j.jval.2016.04.004",
        "citation_text": "Hauber AB, González JM, Groothuis-Oudshoorn CGM, et al. Statistical methods for the analysis of discrete choice experiments: a report of the ISPOR Conjoint Analysis Good Research Practices Task Force. Value in Health. 2016;19(4):300-315.",
        "year": 2016,
        "authors_short": "Hauber et al.",
        "notes": "Authoritative treatment of conditional, mixed, and latent-class logit estimation and the derivation of WTP and attribute importance from DCE data."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2012.08.2223",
        "url": "https://doi.org/10.1016/j.jval.2012.08.2223",
        "citation_text": "Reed Johnson F, Lancsar E, Marshall D, et al. Constructing experimental designs for discrete-choice experiments: report of the ISPOR Conjoint Analysis Experimental Design Good Research Practices Task Force. Value in Health. 2013;16(1):3-13.",
        "year": 2013,
        "authors_short": "Reed Johnson et al.",
        "notes": "The reference for building D-efficient fractional-factorial choice designs, blocking, and the design choices that determine statistical efficiency."
      },
      {
        "role": "use",
        "doi": "10.1007/s40273-018-0734-2",
        "url": "https://doi.org/10.1007/s40273-018-0734-2",
        "citation_text": "Soekhai V, de Bekker-Grob EW, Ellis AR, Vass CM. Discrete choice experiments in health economics: past, present and future. PharmacoEconomics. 2019;37(2):201-226.",
        "year": 2019,
        "authors_short": "Soekhai et al.",
        "notes": "Systematic review documenting the routine, large-scale application of DCEs across health economics and the maturing methodological conventions."
      }
    ],
    "plain_language_summary": "A patient preference study asks people to choose between carefully described treatment options that differ on a few key features — like how well a drug works, what side effects it causes, how it is taken, and what it costs — and uses those choices to figure out what patients value most and what trade-offs they are willing to make. The most common approach is a discrete choice experiment (DCE), where a respondent picks their preferred option from a series of paired profiles, and the pattern of choices reveals how much one benefit is worth relative to one risk or one inconvenience. Because the choices are hypothetical rather than real purchases or real prescriptions, a DCE can measure preferences for a drug that does not exist yet or isolate the value of a single feature that real-world use mixes together with price, insurance coverage, and physician habit. One honest caveat: people sometimes say they would accept a higher risk when answering a survey than they actually would in a doctor's office, so absolute willingness-to-pay figures should be treated as estimates, not facts.",
    "key_terms": [
      {
        "term": "discrete choice experiment (DCE)",
        "definition": "A survey method where respondents repeatedly choose their preferred option from sets of two or three treatment profiles that differ on defined features, and the pattern of choices is used to calculate how much each feature is worth to them."
      },
      {
        "term": "attribute",
        "definition": "One specific feature of a treatment that is varied across choice options in a DCE, such as the chance of a side effect, the way the drug is taken, or the monthly out-of-pocket cost."
      },
      {
        "term": "level",
        "definition": "One of the specific values an attribute can take in a choice task — for example, an efficacy attribute might have levels of 40%, 55%, and 70% response rate."
      },
      {
        "term": "part-worth utility",
        "definition": "The estimated value a respondent places on one level of one attribute, calculated from the pattern of their choices; higher part-worth means that level is strongly preferred."
      },
      {
        "term": "willingness to accept (WTA)",
        "definition": "The minimum gain in a desirable attribute (such as higher efficacy) that a respondent requires before they will accept a worse level of another attribute (such as a higher rate of side effects)."
      },
      {
        "term": "hypothetical bias",
        "definition": "The tendency for people to express stronger preferences or higher willingness to pay in a survey than they would in a real decision with actual consequences."
      }
    ],
    "worked_example": {
      "scenario": "An HEOR analyst wants to know how patients with moderate-to-severe plaque psoriasis trade off efficacy, side-effect risk, dosing convenience, and monthly cost when choosing between two biologic treatment profiles. The analyst designs a simple DCE with four attributes and presents one example choice task to illustrate how the data are read.",
      "dataset": {
        "caption": "One DCE choice task — a respondent sees exactly this table and circles Profile A or Profile B.",
        "columns": [
          "Attribute",
          "Profile A",
          "Profile B"
        ],
        "rows": [
          [
            "Skin clearance at 16 weeks (PASI 90 response rate)",
            "55%",
            "75%"
          ],
          [
            "Serious infection risk per year",
            "3%",
            "6%"
          ],
          [
            "Dosing",
            "Monthly injection",
            "Weekly pill"
          ],
          [
            "Monthly out-of-pocket cost",
            "$50",
            "$200"
          ]
        ]
      },
      "steps": [
        "Each profile is built by combining one level from each attribute — Profile A offers moderate efficacy (55%) with low infection risk (3%), a monthly injection, and low cost ($50); Profile B offers higher efficacy (75%) with higher infection risk (6%), a weekly pill, and higher cost ($200).",
        "The analyst presents 12 such tasks to each respondent, each task pairing two profiles that differ across all four attributes in a planned pattern so every trade-off can be estimated independently.",
        "After collecting responses from 300 patients, the analyst fits a choice model and finds that going from 55% to 75% clearance produces a part-worth of +1.4 utility units, while going from 3% to 6% infection risk produces a part-worth of -0.9 utility units.",
        "Dividing those two numbers gives the willingness-to-accept ratio: patients require a 22-percentage-point efficacy gain (1.4 / 0.9 x 14 pp) — roughly the full observed 20 pp gap — before they are willing to accept the doubling of infection risk.",
        "The analyst also divides the efficacy part-worth by the cost part-worth to get willingness-to-pay: each additional percentage point of clearance is worth about $6.50/month in out-of-pocket cost to the average respondent."
      ],
      "result": "Patients value efficacy gains roughly 1.6 times more than they penalize an equivalent increase in infection risk; a jump from 55% to 75% clearance is worth accepting a 3-percentage-point higher annual infection risk, but only barely — and only if cost does not also rise. Monthly out-of-pocket cost is the strongest deterrent: patients would need a 20-percentage-point efficacy improvement to justify paying an extra $130/month."
    },
    "prerequisites": [
      "qualitative-interview",
      "cross-sectional",
      "pro-development"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Discrete-choice experiment (DCE) with D-efficient design",
        "description": "Respondents repeatedly choose the preferred multi-attribute profile from choice sets generated by a D-efficient fractional-factorial design; recovers a full utility surface and supports WTP and MAR when a cost or risk attribute is included.",
        "edge_cases": [
          "Dominant or correlated attributes prevent independent trade-offs and bias part-worths; verify attribute orthogonality in the design's information matrix.",
          "Too many attributes/levels trigger attribute non-attendance and simplifying heuristics that violate the compensatory RUM."
        ],
        "data_source_notes": "primary: blocked, randomized task order; insert a dominated-pair rationality check and an opt-out where a no-treatment option is realistic."
      },
      {
        "name": "Best-worst scaling (BWS)",
        "description": "Respondents identify the best and worst item in each set. Case 1 ranks discrete objects, Case 2 ranks attribute levels within a profile, Case 3 compares full profiles; lower cognitive load than DCE but Case 1/2 do not natively yield WTP.",
        "edge_cases": [
          "Case selection changes the estimand — object importance (Case 1) is not the same quantity as attribute-level utility (Case 2); choose to match the decision question.",
          "Position/order effects are stronger in BWS; balance item positions across sets."
        ],
        "data_source_notes": "primary: BWS Case 2 is preferred for low-literacy or elderly populations where full-profile DCE tasks overload respondents."
      },
      {
        "name": "Mixed (random-parameters) / latent-class logit for heterogeneity",
        "description": "Relaxes the homogeneous-preference and IIA assumptions of conditional logit by letting part-worths vary across respondents (continuous mixing) or across discrete latent segments, and by modeling the within-respondent panel correlation of repeated choices.",
        "edge_cases": [
          "Confounding of preference heterogeneity with scale (error-variance) heterogeneity inflates apparent dispersion; consider generalized multinomial logit (G-MNL) to separate them.",
          "Holding the cost coefficient fixed stabilizes the WTP distribution but imposes homogeneous price sensitivity."
        ],
        "data_source_notes": "primary: account for the panel structure (multiple tasks per respondent) via the mixing distribution or clustered standard errors."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Revealed-preference / observational utilization analysis (claims, EHR)",
        "pros_of_this": "Can value not-yet-marketed products and novel attributes, and isolates the marginal value of a single attribute free of price, formulary, and access confounding.",
        "cons_of_this": "Choices are hypothetical and subject to hypothetical bias; absolute WTP magnitudes may be inflated without calibration or consequentiality framing.",
        "when_to_prefer": "When the attribute or product is not observable in real-world data, or a clean isolated trade-off is needed (e.g., regulatory benefit-risk)."
      },
      {
        "compared_to": "Time-trade-off / standard-gamble utility elicitation for QALYs",
        "pros_of_this": "Flexibly values process, risk, and non-health attributes, not just health states; can be designed as a DCE-TTO hybrid to recover anchored utilities.",
        "cons_of_this": "Does not natively produce QALY-anchored (0–1) utilities unless duration/death anchoring is built into the design.",
        "when_to_prefer": "When process, mode of administration, and risk attributes matter, or no defensible market data exist."
      },
      {
        "compared_to": "Qualitative interviews / focus groups",
        "pros_of_this": "Quantifies trade-offs and yields WTP/MAR rather than themes; forces respondents to make compensatory choices.",
        "cons_of_this": "Cannot explain motivations and depends entirely on a valid upstream attribute set developed qualitatively.",
        "when_to_prefer": "After qualitative attribute development, when the deliverable is a numeric preference weight or trade-off."
      },
      {
        "compared_to": "Direct rating / ranking / Likert importance scales",
        "pros_of_this": "Forced trade-offs avoid lack of discrimination, yea-saying, and scale-use bias inherent to direct importance ratings.",
        "cons_of_this": "Higher respondent burden and design complexity; requires an experimental design and choice modeling.",
        "when_to_prefer": "Whenever a defensible, trade-off-based importance metric is required rather than self-reported ratings."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom xlogit import MultinomialLogit, MixedLogit\n\n# choices: long format, one row per alternative; `choice` is the 0/1 chosen flag within each (respondent_id, task_id).\nchoices = pd.read_parquet(\"dce_long.parquet\")\nchoices[\"alt_key\"] = choices[\"respondent_id\"].astype(str) + \"_\" + choices[\"task_id\"].astype(str)\nvarnames = [\"survival\", \"toxicity\", \"oral\", \"cost\"]   # design attributes (cost in $100s for stable scaling)\n\n# (1) Conditional (multinomial) logit: population-mean part-worths under RUM/IIA.\ncl = MultinomialLogit()\ncl.fit(X=choices[varnames], y=choices[\"choice\"], varnames=varnames,\n       alts=choices[\"alt_id\"], ids=choices[\"alt_key\"])\ncl.summary()\n\n# Marginal willingness-to-pay = -beta_attr / beta_cost  (cost coded so a higher level lowers utility).\nb = dict(zip(varnames, cl.coeff_))\nwtp_per_survival_month = -b[\"survival\"] / b[\"cost\"] * 100.0   # undo the $100 scaling\nprint(f\"WTP per extra month of survival: ${wtp_per_survival_month:,.0f}\")\n\n# (2) Mixed (random-parameters) logit: clinical attrs random, cost fixed; panels capture within-respondent correlation.\nml = MixedLogit()\nml.fit(X=choices[varnames], y=choices[\"choice\"], varnames=varnames,\n       alts=choices[\"alt_id\"], ids=choices[\"alt_key\"],\n       panels=choices[\"respondent_id\"],\n       randvars={\"survival\": \"n\", \"toxicity\": \"n\", \"oral\": \"n\"},  # normal mixing; cost stays fixed\n       n_draws=500)\nml.summary()   # mean and SD of each random part-worth quantify preference heterogeneity",
        "description": "Two-stage DCE workflow in Python: (1) inspect a D-efficient design and (2) estimate conditional and mixed logit.\nRequired long-format choice table (one row per alternative per task per respondent), already assembled from the\nfielded survey:\n  choices : respondent_id (int), task_id (int), alt_id (int), choice (0/1 indicator of the chosen alternative),\n            survival (months), toxicity (risk %), oral (1=oral,0=IV), cost (monthly OOP $)\nAttribute columns are the (effects- or dummy-coded) design columns merged onto the respondent's answered tasks.\nxlogit handles the panel structure (repeated tasks per respondent) in the mixed logit via the panels argument.",
        "dependencies": [
          "pandas",
          "numpy",
          "xlogit"
        ],
        "source_citations": [
          "hauber-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(idefix)\nlibrary(mlogit)\n\n## (1) D-efficient design: 5 attributes, levels per the NSCLC example; 24 tasks blocked into 2 sets of 12, 2 alts.\nlvls   <- c(3, 3, 2, 3, 3)                       # survival, toxicity, admin, ir-AE risk, cost\ncoding <- c(\"E\", \"E\", \"E\", \"E\", \"E\")             # effects coding\nset.seed(42)\ndes <- Modfed(cand.set = Profiles(lvls = lvls, coding = coding),\n              n.sets = 24, n.alts = 2, alt.cte = c(0, 0),\n              par.draws = rep(0, sum(lvls - 1)))  # zero priors -> D-optimal (use informative priors if available)\n# des$design is the efficient choice design; block and field it, then attach respondents' choices.\n\n## (2) Conditional logit on the fielded long data.\nidx <- dfidx(dce, idx = list(c(\"task_id\", \"respondent_id\"), \"alt\"), choice = \"choice\")\ncl  <- mlogit(choice ~ survival + toxicity + oral + cost | 0, data = idx)\nsummary(cl)\nwtp_survival <- -coef(cl)[\"survival\"] / coef(cl)[\"cost\"]   # marginal WTP per survival month\ncat(sprintf(\"WTP per survival month: %.0f\\n\", wtp_survival))\n\n## (3) Mixed logit (panel) for preference heterogeneity: clinical attrs random-normal, cost fixed.\nml <- mlogit(choice ~ survival + toxicity + oral + cost | 0, data = idx,\n             rpar = c(survival = \"n\", toxicity = \"n\", oral = \"n\"),\n             R = 500, panel = TRUE, correlation = FALSE)\nsummary(ml)   # estimated SDs flag attributes with heterogeneous preferences",
        "description": "Two-stage DCE workflow in R: design generation with idefix and estimation with mlogit/mixl.\nEstimation input `dce` is a long data.frame: respondent_id, task_id (unique choice-situation id), alt (1..A),\nchoice (logical/0-1), survival, toxicity, oral, cost. dfidx indexes the nested choice structure for mlogit.",
        "dependencies": [
          "idefix",
          "mlogit",
          "mixl"
        ],
        "source_citations": [
          "reed-johnson-2013",
          "hauber-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (1) Build a D-efficient choice design: 5 attributes, then assess efficiency. */\n%mktex(3 3 2 3 3, n=24, seed=42, out=cand);     /* candidate fractional-factorial design */\n%choiceff(data=cand, model=class(x1 x2 x3 x4 x5 / sta),\n          nsets=24, nalts=2, seed=42, out=design);   /* maximize D-efficiency for 24 sets x 2 alts */\n\n/* (2) Conditional (multinomial) logit on the fielded long data: population-mean part-worths under RUM. */\nproc mdc data=work.dce;\n  id respondent_id;                              /* clusters repeated tasks within respondent */\n  model choice = survival toxicity oral cost /\n        type=clogit nchoice=2;                   /* 2 alternatives per choice set */\nrun;\n/* Marginal WTP = -(beta_survival)/(beta_cost), computed from the parameter estimates above. */\n\n/* (3) Mixed (random-parameters) logit: clinical attributes random-normal, cost fixed; Halton draws. */\nproc mdc data=work.dce type=mixedlogit nmixed=500;\n  id respondent_id;\n  model choice = survival toxicity oral cost /\n        type=mixedlogit nchoice=2;\n  random survival toxicity oral / type=normal;   /* heterogeneous part-worths; cost left fixed */\nrun;",
        "description": "Two-stage DCE workflow in SAS using the SAS/STAT market-research macros and PROC MDC.\nEstimation input WORK.DCE is long: respondent_id, task_id, alt, choice (1=chosen/0=not), and the design attribute\ncolumns (survival toxicity oral cost). PROC MDC fits conditional and mixed (random-parameters) logit for choice data;\n%MktEx / %ChoicEff (autocall macros, SAS/STAT) build and evaluate the efficient design.",
        "dependencies": [],
        "source_citations": [
          "reed-johnson-2013",
          "hauber-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision question + target population] --> Attr[Qualitative attribute development<br/>interviews / literature -> 4-6 attributes, defined levels]\n  Attr --> Des[D-efficient experimental design<br/>fractional factorial, blocking, opt-out, dominated-pair check]\n  Des --> Field[Field the survey to the target frame<br/>screen diagnosis, randomize task order]\n  Field --> Clean[Data quality screen<br/>drop straight-liners, speeders, rationality-check failures]\n  Clean --> Est[Estimate choice model<br/>conditional logit -> mixed / latent-class logit]\n  Est --> Out[Part-worths, attribute importance,<br/>WTP / maximum acceptable risk]\n  Out --> Use[Benefit-risk, value frameworks, SDM tools<br/>report predicted shares ONLY as conditional simulations]",
        "caption": "End-to-end workflow of a patient preference study, from qualitative attribute development through D-efficient design, fielding, data-quality screening, choice modeling, and trade-off metrics.",
        "alt_text": "Flowchart from decision question and attribute development through experimental design, survey fielding, data cleaning, conditional and mixed logit estimation, and willingness-to-pay outputs.",
        "source_type": "illustrative",
        "source_citations": [
          "bridges-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  RUM[\"Random utility model<br/>U = V(attributes) + error\"] --> Q1{Preference<br/>heterogeneity?}\n  Q1 -- No --> CL[\"Conditional / multinomial logit<br/>population-mean part-worths, assumes IIA\"]\n  Q1 -- \"Yes, continuous\" --> ML[\"Mixed / random-parameters logit<br/>panel, distribution of part-worths\"]\n  Q1 -- \"Yes, segments\" --> LC[Latent-class logit<br/>discrete preference classes]\n  ML --> Q2{Scale vs preference<br/>heterogeneity confounded?}\n  LC --> Q2\n  Q2 -- Yes --> GMNL[Generalized multinomial logit G-MNL<br/>separates scale from taste variance]\n  CL --> WTP[\"Derive WTP = -beta_attr / beta_cost<br/>and maximum acceptable risk\"]\n  GMNL --> WTP",
        "caption": "Choice-model selection logic. The conditional logit is the baseline under RUM/IIA; mixed, latent-class, and G-MNL models progressively relax the homogeneous-preference, segment, and constant-scale assumptions before WTP is derived.",
        "alt_text": "Decision diagram selecting among conditional logit, mixed logit, latent-class logit, and generalized multinomial logit based on preference and scale heterogeneity, leading to willingness-to-pay derivation.",
        "source_type": "illustrative",
        "source_citations": [
          "hauber-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "qualitative-interview",
        "notes": "Qualitative interviews and focus groups are the standard upstream step to identify, word, and reduce the attributes and levels before a quantitative DCE; skipping this step is the leading cause of omitted-attribute bias."
      },
      {
        "relation_type": "produces",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "DCE-derived preference weights (especially DCE-TTO hybrids with duration anchoring) can feed health-state utility values and value-framework weights used in cost-utility analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pro-development",
        "notes": "Both are primary-data, instrument-based measurement designs; PRO development measures health status while a preference study measures trade-off weights among attributes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hrqol",
        "notes": "Health-related quality-of-life attributes are frequently among the DCE attributes valued, and DCE trade-offs can inform the relative importance of HRQoL domains."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cross-sectional",
        "notes": "A preference study is fielded as a cross-sectional survey; it shares the sampling-frame and selection concerns of cross-sectional primary data collection."
      }
    ],
    "aliases": [
      "DCE",
      "discrete choice experiment",
      "BWS",
      "best-worst scaling",
      "conjoint analysis",
      "choice-based conjoint",
      "stated-preference study",
      "patient preference study"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pregnancy-exposure-window-rwe",
    "name": "Pregnancy Exposure Window",
    "short_definition": "A protocol-specified set of gestational timing rules that maps maternal dispensings, administrations, or procedures onto etiologically relevant developmental periods (periconceptional, first/second/third trimester, peripartum, lactation) so that in-utero exposure is attributed to the window in which a given fetal or neonatal outcome can plausibly arise.",
    "long_description": "A **pregnancy exposure window** is the date logic that converts a raw maternal exposure record (a pharmacy fill, an\ninfusion administration, a procedure) into the analytic statement \"exposed during the developmental period relevant to\n*this* outcome.\" Most teratogenic and developmental effects are time-specific — organogenesis (roughly gestational weeks\n3-8) governs the majority of major structural malformations, while later gestation governs fetal growth, neonatal\nadaptation/withdrawal, and neurodevelopment. Because of this, a single \"any use during pregnancy\" indicator is usually the\n*wrong* exposure: it dilutes a true window-specific signal toward the null by averaging in person-time when the drug could\nnot have caused the outcome, and it invites confounding by indication when women who continue a drug into later trimesters\ndiffer systematically from those exposed only periconceptionally.\n\nEvery window is anchored to an estimated **pregnancy start date** — usually last menstrual period (LMP) or an estimated\nconception date — which in claims and most EHR data must itself be *derived* (delivery date minus a gestational-age\nestimate, or earliest prenatal-ultrasound/dating code). Once the anchor is set, the analyst applies pre-specified day or\nweek intervals (e.g., LMP-90 to LMP-1 days periconceptional; LMP to LMP+90/97 days first trimester) and computes overlap\nbetween the drug's covered-days interval and each gestational interval. The output is a set of per-window exposure variables\n(binary, days-exposed, or dose-in-window) on the **maternal** timeline that are then carried into the mother-infant linked\noutcome analysis.\n\n**Core conceptual distinction** — the estimand is *window-specific*, not pregnancy-wide, and the window must be named a\npriori from biology — e.g., \"first-trimester exposure to drug X versus an active comparator and the risk of major\ncongenital malformation among live births.\" This differs sharply from three superficially similar definitions. (1) Versus\nan *any-pregnancy ever-exposed* indicator: that variable answers a different, usually less interpretable question and is\nbiased toward the null whenever the causal window is narrow. (2) Versus *cumulative dose across the whole pregnancy*: that\nmatches a cumulative-toxicity model but mis-specifies a pulse/critical-window mechanism. (3) Versus *exposure-lag/induction\nwindows* in non-pregnancy pharmacoepidemiology: pregnancy windows are forward-anchored to a developmental clock (the\nconceptus's gestational age), not backward-lagged from outcome onset. Getting the anchor and the window boundaries right is\nthe entire game; the downstream regression is comparatively routine.\n\n**Pros, cons, and trade-offs** — versus an *any-use-during-pregnancy binary*, window-specific classification preserves\netiologic specificity, raises power for the relevant period, and reduces exposure misclassification from irrelevant\nperson-time — at the cost of requiring a credible LMP estimate, explicit rules for fills that straddle two windows, and a\nbank of dating-sensitivity analyses. Versus *trimester-only buckets* (first/second/third with no periconceptional or\nlactation window), windows let you capture pre-recognition exposures (folate antagonists, retinoids acting before a woman\nknows she is pregnant) and lactation transfer for some neonatal endpoints, which crude trimester buckets miss. Versus\n*cumulative-dose-over-pregnancy*, a window is easier to communicate (\"first-trimester exposed\") and correct when the risk\nis a single critical pulse, but it loses power and is mis-specified when the true relationship is dose-cumulative across\ngestation — in which case a dose-in-window or time-updated cumulative-dose metric is preferable. The window approach also\nshares the active-comparator logic of mainstream pharmacoepidemiology: comparing first-trimester initiators of drug X to\nfirst-trimester initiators of a clinically interchangeable comparator beats a drug-versus-unexposed contrast, which is\nconfounded by the underlying maternal indication.\n\n**When to use** — use windows for any study of in-utero drug effects on structural malformations, fetal growth, preterm\nbirth, neonatal adaptation/withdrawal, or longer-term neurodevelopment where biology or prior human/animal data implicate a\ntime-limited sensitive period. They are effectively mandatory for regulatory pregnancy-safety studies (FDA and EMA\npost-approval pregnancy safety guidance, pregnancy registries, and PASS) and for any comparative safety/effectiveness claim\nthat references a specific gestational timing on a product label.\n\n**When NOT to use — and when it is actively misleading or dangerous** — do *not* impose a window when the outcome has no\nplausible gestational specificity (e.g., a maternal injury months postpartum); the window only adds noise. Do *not* trust\nwindow membership when LMP cannot be estimated within an acceptable error — very late or absent prenatal care, missing\ngestational-age fields, or reliance on a delivery code with no dating information can make window misclassification *worse*\nthan an honest any-pregnancy analysis, and a fill near a trimester boundary will be assigned almost at random. The design\nbecomes actively dangerous in three situations: (a) **differential pregnancy recognition or termination by exposure** —\nwomen on a known teratogen may be diagnosed earlier, have earlier dating ultrasounds, or undergo elective termination,\nwhich depletes the affected fetuses from the **live-birth denominator** and biases the live-birth-restricted risk toward the\nnull (a form of live-birth/survivor selection bias); (b) **immortal-time-style misalignment** when exposure status in a\nlater window is used to define a cohort whose person-time in an earlier window is then counted as \"unexposed\"; and (c)\npresenting a reassuring \"no increased risk\" from an any-pregnancy binary when the etiologically relevant window-specific\nanalysis was never run or was hopelessly under-powered.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA vs commercial):** LMP is almost always derived — delivery date minus gestational weeks from\n  diagnosis/procedure codes, prenatal-ultrasound dating codes, or a published/validated algorithm. Drug exposure is the\n  pharmacy claim (`ndc` + `fill_date` + `days_supply`); you must construct covered-day intervals and handle reversals,\n  early refills, stockpiling, and 90-day mail-order supplies before computing window overlap. **Require continuous\n  medical + pharmacy enrollment** across the entire periconceptional-through-delivery span so that \"no fill in the window\"\n  is observed non-exposure, not missingness. Specific failure modes: **Medicare Advantage and capitated/bundled\n  person-time drop fee-for-service claims**, so a MA-only span fabricates apparent non-exposure — restrict to enrollees\n  with both medical and pharmacy benefit and exclude MA-only person-time. **Differential competing risks of fetal loss by\n  exposure** distort the live-birth denominator; report the proportion of pregnancies ending in an observed live birth by\n  arm and analyze non-live-birth endpoints in the maternal cohort. **Immortal time in procedure or infusion studies** —\n  e.g., requiring a second-trimester administration to be \"exposed\" makes all of the first trimester immortal — must be\n  avoided by aligning time zero to the first qualifying exposure.\n- **EHR:** Prenatal flow sheets, dating ultrasounds, and problem-list gestational age give a *direct* LMP/EDC and far\n  sharper exposure timing via e-prescribing and medication administration records than fills alone. The cost is\n  visit-driven, leaky capture: external pharmacies and care outside the system are missing, and a patient who leaves the\n  network is differentially lost. Link to claims for complete dispensing and for infant outcomes rendered elsewhere.\n- **Registry:** Product or disease pregnancy registries capture exact LMP, ultrasound dating, exposure dates, and\n  adjudicated outcomes prospectively with high validity — but are small and selected (enrollment, recall, and\n  healthy-volunteer effects). Their best use is to *validate* claims-based window/LMP algorithms or to triangulate effect\n  sizes, not to estimate population risk alone.\n- **Linked claims-EHR-vital records:** The reference substrate — claims for complete pharmacy + enrollment, EHR for\n  clinical dating and infant exams, vital/birth records for confirmed live births and recorded gestational age. The\n  catch is linkage selection (only the linkable subset) and **date reconciliation** across order, fill, service, and\n  delivery dates, which must be resolved *before* any window is assigned.\n\n**Worked claims example (antiepileptic drug, first-trimester major-malformation risk).** Source: linked commercial +\nmulti-state Medicaid claims, 2016-2023. Build the pregnancy cohort by identifying delivery claims, then back-date LMP with a\nvalidated gestational-age algorithm (delivery_date - estimated gestational weeks) or the earliest prenatal-dating\nultrasound code when present. Eligibility: continuous maternal medical + pharmacy enrollment (no MA-only spans) from\nLMP-90 days through 90 days post-delivery, and a linked infant continuously enrolled >=90 days. Define the first-trimester\nwindow as `[LMP, LMP+97 days]`. For each maternal fill, compute its covered interval `[fill_date, fill_date + days_supply -\n1]` and flag `exp_t1 = 1` if that interval overlaps the first-trimester window by at least one day (primary, binary);\nsecondary metrics are days-overlapping-window and average daily milligrams in window. Outcome: a major-congenital-\nmalformation algorithm with validated PPV > 0.80 applied to infant claims in days 0-90 plus the index inpatient stay.\nRestrict to first-trimester *initiators* of the index AED versus first-trimester initiators of an active-comparator AED\n(active-comparator, new-user logic ported into pregnancy), and balance maternal age, epilepsy-type and severity proxies,\ncomorbidities, folic-acid use, and concomitant AEDs with a high-dimensional propensity score (matching or overlap\nweighting). In this construction the adjusted risk ratio for first-trimester exposure to the teratogenic AED was 2.8\n(95% CI 1.6-4.9), whereas the any-pregnancy ever-exposed RR was only 1.9 (1.1-3.2) — diluted by third-trimester-only users\nfor whom organogenesis had passed. Pre-specified dating-sensitivity analyses (shift the LMP algorithm by +/-14 days; move\nthe window end by +/-7 days; reassign straddling fills by majority-of-covered-days rather than any-overlap) moved the point\nestimate only between 2.4 and 3.1, and a negative-control outcome with no plausible teratogenic mechanism was null,\nsupporting robustness against residual dating error and confounding.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "pregnancy",
      "exposure_window",
      "special_populations",
      "teratogen",
      "perinatal",
      "mother_infant",
      "gestational_timing",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "active_comparator_new_user",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4789",
        "url": "https://doi.org/10.1002/pds.4789",
        "citation_text": "Huybrechts KF, Bateman BT, Hernández-Díaz S. Use of real-world evidence from healthcare utilization data to evaluate drug safety during pregnancy. Pharmacoepidemiology and Drug Safety. 2019;28(7):906-922.",
        "year": 2019,
        "authors_short": "Huybrechts et al.",
        "notes": "Authoritative methods review of pregnancy pharmacoepidemiology in healthcare-utilization data — LMP estimation, exposure-window definition, mother-infant linkage, live-birth selection bias, and analytic strategy for claims and EHR."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3284",
        "url": "https://doi.org/10.1002/pds.3284",
        "citation_text": "Margulis AV, Setoguchi S, Mittleman MA, Glynn RJ, Dormuth CR, Hernández-Díaz S. Algorithms to estimate the beginning of pregnancy in administrative databases. Pharmacoepidemiology and Drug Safety. 2013;22(1):16-24.",
        "year": 2013,
        "authors_short": "Margulis et al.",
        "notes": "Validation and comparison of claims-based algorithms that back-date LMP from delivery and prenatal codes — the foundational upstream step that determines whether any exposure-window assignment is trustworthy."
      },
      {
        "role": "demonstrate",
        "doi": "10.1371/journal.pone.0067405",
        "url": "https://doi.org/10.1371/journal.pone.0067405",
        "citation_text": "Palmsten K, Huybrechts KF, Mogun H, et al. Harnessing the Medicaid Analytic eXtract (MAX) to evaluate medications in pregnancy: design considerations. PLoS ONE. 2013;8(6):e67405.",
        "year": 2013,
        "authors_short": "Palmsten et al.",
        "notes": "Canonical worked demonstration of pregnancy cohort construction, LMP estimation, exposure-window application, and mother-infant linkage in U.S. Medicaid (MAX) administrative data."
      },
      {
        "role": "use",
        "doi": "10.1212/wnl.0000000000004857",
        "url": "https://doi.org/10.1212/wnl.0000000000004857",
        "citation_text": "Hernández-Díaz S, Huybrechts KF, Desai RJ, et al. Topiramate use early in pregnancy and the risk of oral clefts. Neurology. 2018;90(4):e342-e351.",
        "year": 2018,
        "authors_short": "Hernández-Díaz et al.",
        "notes": "High-impact application using first-trimester-specific exposure windows to detect and quantify a teratogenic signal (oral clefts) for topiramate, illustrating why a window-specific rather than any-pregnancy estimand is decisive."
      }
    ],
    "plain_language_summary": "A pregnancy exposure window is a specific slice of pregnancy time when a drug could actually cause a birth defect or developmental problem. Instead of simply asking whether a mother took a drug at any point during pregnancy, researchers pin down which biological stage the baby was in when the pills were on hand. For example, the first eight weeks after conception are when the heart, spine, and limbs are being formed, so a drug fill during that window matters far more to birth-defect risk than the same drug taken in the final weeks of pregnancy. In a health insurance claims database, analysts figure out the pregnancy start date, divide the pregnancy into developmental stages, and then check whether any prescription fills overlap each stage, producing a yes/no (or days-of-overlap) score per stage.",
    "key_terms": [
      {
        "term": "LMP (last menstrual period)",
        "definition": "The first day of the mother's last menstrual cycle before conception, used as the standard starting point from which gestational age and trimester boundaries are calculated."
      },
      {
        "term": "gestational age",
        "definition": "How far along a pregnancy is, counted in weeks from LMP; it determines which developmental stage the fetus is in at any given date."
      },
      {
        "term": "organogenesis",
        "definition": "The early-pregnancy window, roughly weeks 3 through 8 after conception, when the fetus forms its major organs and is most vulnerable to drugs that cause structural birth defects."
      },
      {
        "term": "fill date",
        "definition": "The calendar date on a pharmacy record when a prescription was dispensed to the patient."
      },
      {
        "term": "days_supply",
        "definition": "The number of days one dispensed prescription is expected to last, for example 30 or 90 days."
      },
      {
        "term": "covered-day interval",
        "definition": "The calendar range from the fill date through fill date plus days_supply minus one, representing the days a patient theoretically had medication in hand."
      }
    ],
    "worked_example": {
      "scenario": "Maria is a 28-year-old with epilepsy who filled a prescription for an antiepileptic drug several times during her pregnancy. Her delivery was on 2023-09-28, and a validated algorithm estimates her LMP as 2022-12-28, giving a gestational length of about 274 days. We want to know whether she had the drug on hand during the first trimester, specifically the organogenesis window (LMP through LMP+97 days, i.e., 2022-12-28 through 2023-04-04), because that is when major structural malformations are most plausible. We will look up her pharmacy fills in the claims table and compute whether any covered-day interval overlaps that first-trimester window.",
      "dataset": {
        "caption": "Maria's pharmacy claims rows (person_id 2001). Each row is one prescription fill.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "2022-12-01",
            "levetiracetam 500 mg",
            30
          ],
          [
            2022,
            "2022-12-01",
            "levetiracetam 500 mg",
            30
          ],
          [
            2001,
            "2022-12-30",
            "levetiracetam 500 mg",
            90
          ],
          [
            2001,
            "2023-03-28",
            "levetiracetam 500 mg",
            90
          ],
          [
            2001,
            "2023-06-25",
            "levetiracetam 500 mg",
            90
          ]
        ]
      },
      "steps": [
        "Row 1 (fill_date 2022-12-01, 30 days): covered interval is Dec 1 to Dec 30, 2022. LMP is Dec 28, 2022, so this fill ends just two days into the pregnancy. It overlaps the first-trimester window by 3 days (Dec 28-30). This counts as first-trimester exposed.",
        "Row 2 is for person 2022, not Maria (person 2001) -- skip it.",
        "Row 3 (fill_date 2022-12-30, 90 days): covered interval is Dec 30, 2022 through Mar 29, 2023. The first-trimester window ends Apr 4, 2023. This fill is entirely inside the first trimester -- all 90 days overlap.",
        "Row 4 (fill_date 2023-03-28, 90 days): covered interval is Mar 28 through Jun 25, 2023. The first-trimester window ends Apr 4. The overlap is Mar 28 through Apr 4 = 8 days. This fill also reaches into the second trimester (Apr 5 onward).",
        "Row 5 (fill_date 2023-06-25, 90 days): covered interval is Jun 25 through Sep 22, 2023. The first-trimester window ended Apr 4 -- no overlap. This fill is second and third trimester only.",
        "Total unique days overlapping the first-trimester window: the union of Dec 28 through Apr 4 is 98 days. Rows 1, 3, and 4 together cover this entire span without a gap, so Maria had levetiracetam on hand for all 98 days of the first-trimester window.",
        "Result: first-trimester exposed = YES (binary = 1); days-in-window = 98; coverage fraction = 98/98 = 1.00."
      ],
      "result": "Maria is classified as first-trimester exposed (binary = 1). Her covered days fully span the 98-day first-trimester window (Dec 28, 2022 through Apr 4, 2023), meaning the drug was theoretically on hand during the entire organogenesis period. This exposure flag is what gets carried into the birth-defect outcome analysis.",
      "timeline_spec": {
        "title": "First-trimester drug exposure during organogenesis for one epilepsy patient",
        "caption": "Maria's levetiracetam fills (gray bars) plotted against her pregnancy timeline. The first-trimester organogenesis window (red span, LMP through LMP+97 days) is fully covered by fills from December 2022 through April 2023. Fills in the second and third trimesters do not affect the first-trimester exposure classification.",
        "alt_text": "Horizontal timeline from estimated LMP (Dec 28 2022) to delivery (Sep 28 2023). Three trimester spans are shown as colored bands beneath the timeline. Four prescription fill bars sit above the timeline, labeled with fill date and days_supply. The first-trimester band (organogenesis window) is highlighted in red; two fill bars overlap it completely. A result label reads: first-trimester exposed = 1, days in window = 98.",
        "window": {
          "start": "2022-12-28",
          "end": "2023-09-28",
          "label": "Full pregnancy (LMP to delivery, 274 days)"
        },
        "events": [
          {
            "label": "Fill A (pre-LMP tail)",
            "start": "2022-12-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B (first 90-day fill)",
            "start": "2022-12-30",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill C (straddles T1/T2 boundary)",
            "start": "2023-03-28",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill D (T2/T3 only)",
            "start": "2023-06-25",
            "length_days": 90,
            "quantity": "90 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2022-12-28",
            "end": "2023-04-04",
            "label": "First trimester (organogenesis window): 98 days -- fully covered, exposed = 1"
          },
          {
            "kind": "followup",
            "start": "2023-04-05",
            "end": "2023-06-24",
            "label": "Second trimester: 81 days"
          },
          {
            "kind": "followup",
            "start": "2023-06-25",
            "end": "2023-09-28",
            "label": "Third trimester: 95 days"
          }
        ],
        "result": {
          "label": "First-trimester exposed = 1 (98 covered days / 98-day window = 100% coverage)",
          "value": 1
        }
      }
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe",
      "exposure-episode-construction-rwe",
      "mother-infant-linkage-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed calendar-day windows anchored to estimated LMP",
        "description": "The default in claims — periconceptional, first/second/third trimester defined as fixed day intervals from a back-dated LMP (e.g., first trimester [LMP, LMP+97d]). Transparent and reproducible, but only as good as the LMP estimate.",
        "edge_cases": [
          "Late or no prenatal care leaves only a delivery code for dating, widening LMP error and randomizing fills near boundaries.",
          "Preterm and post-term deliveries break a fixed gestational-length assumption; use the algorithm-estimated gestational age, not a constant 280 days."
        ],
        "data_source_notes": "claims: derive LMP via a validated algorithm (delivery minus estimated gestational weeks) before computing any overlap; EHR: prefer ultrasound/EDC dating when present."
      },
      {
        "name": "Biologically anchored windows from ultrasound/prenatal dating",
        "description": "Window boundaries anchored to an early-ultrasound estimated date of conception/EDC rather than a back-dated delivery, giving sharper organogenesis windows when dating records exist.",
        "edge_cases": [
          "Available only when EHR/registry dating is captured; missing for women without early prenatal care.",
          "Ultrasound and LMP dates can disagree by 1-2 weeks; pre-specify which source wins."
        ],
        "data_source_notes": "ehr/registry: read EDC or first-trimester dating ultrasound; link to claims for complete dispensings."
      },
      {
        "name": "Straddling-fill assignment rule (any-overlap vs majority-of-days vs first-day)",
        "description": "A fill whose covered interval spans two windows must be assigned by a pre-specified rule — any one day of overlap, the window containing the majority of covered days, or the window containing the fill_date.",
        "edge_cases": [
          "Any-overlap maximizes sensitivity but can double-count a drug across two windows; majority-of-days is more specific.",
          "Long (90-day mail-order) supplies straddle multiple windows and need stockpiling-aware covered-day construction."
        ],
        "data_source_notes": "claims: build covered intervals from fill_date + days_supply with carry-over/stockpiling rules, then apply the chosen overlap rule consistently across both exposure arms."
      },
      {
        "name": "Dose- or intensity-in-window metric",
        "description": "Replace the binary with cumulative milligrams, defined daily doses, or covered days within the window when prior evidence supports a dose-response inside the sensitive period.",
        "edge_cases": [
          "Requires reliable dose parsing from NDC/strength fields; missing or imputed strength biases the intensity metric.",
          "Mixing chronic users (high in-window dose) and new initiators changes the contrast; align on new-user timing."
        ],
        "data_source_notes": "claims: map NDC to strength and multiply by days-overlapping-window; EHR MAR gives exact administered dose."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Any-use-during-pregnancy (ever-exposed) binary",
        "pros_of_this": "Preserves etiologic specificity for time-sensitive developmental effects; raises power for the relevant period; reduces exposure misclassification from irrelevant person-time, avoiding null-bias when the causal window is narrow.",
        "cons_of_this": "Requires a credible LMP estimate and explicit straddling-fill rules; more sensitive to dating error and to differential prenatal care/recognition by exposure.",
        "when_to_prefer": "Whenever biology, animal data, or prior human signals implicate a narrow sensitive period (organogenesis, third-trimester growth/adaptation, etc.) and adequate dating is available."
      },
      {
        "compared_to": "Trimester-only buckets (no periconceptional or lactation window)",
        "pros_of_this": "Captures pre-recognition exposures (folate antagonists, retinoids) and lactation transfer for some neonatal endpoints that crude trimester buckets miss.",
        "cons_of_this": "More windows means more multiplicity and smaller per-window counts; needs pre-specification to avoid fishing across windows.",
        "when_to_prefer": "Mechanisms acting before pregnancy recognition or via breast milk, or when a specific early/late window is the labeling question."
      },
      {
        "compared_to": "Cumulative dose across the whole pregnancy",
        "pros_of_this": "Matches a single-critical-pulse biology; easy to communicate ('first-trimester exposed') and to map to a label statement about gestational timing.",
        "cons_of_this": "Loses power and is mis-specified when the true relationship is dose-cumulative across gestation or exposure is chronic across multiple windows.",
        "when_to_prefer": "A clear prior hypothesis about a critical window, or a regulatory/labeling question that references specific gestational timing rather than total exposure."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "LMP is derived (delivery date minus algorithm-estimated gestational weeks, or earliest prenatal-dating code). Build covered-day intervals from ndc + fill_date + days_supply with reversal/stockpiling logic before computing window overlap. Continuous medical + pharmacy enrollment from LMP-90 through post-delivery is non-negotiable so non-exposure is observed; exclude Medicare Advantage-only person-time where FFS pharmacy/medical claims are absent. Link to and require infant enrollment so absence of an outcome code is informative, and report live-birth proportion by exposure arm.",
      "ehr": "Ultrasound/EDC and problem-list gestational age give direct, sharper dating; e-prescribing and MAR give precise exposure timing. Handle external fills and loss to the system by linking to claims for complete dispensings and for infant care rendered outside the network.",
      "registry": "Highest validity for dating, exposure, and adjudicated outcomes but small and selected; best used to validate claims-based LMP/window algorithms or to triangulate effect sizes rather than to estimate population risk alone.",
      "linked": "Claims (complete fills + enrollment) + EHR (clinical dating + infant exams) + vital/birth records (confirmed live births, recorded gestational age) is the reference standard. Reconcile order/fill/service/delivery dates before assigning any window; account for linkage selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\n# Window edges as integer days relative to LMP (LMP = day 0). End is inclusive.\nWINDOWS = {\n    \"precon\": (-90, -1),    # periconceptional: 90 days before LMP\n    \"t1\":     (0, 97),      # first trimester (organogenesis-bearing)\n    \"t2\":     (98, 195),    # second trimester\n    \"t3\":     (196, 300),   # third trimester (capped; delivery may truncate)\n}\n\ndef assign_windows(preg: pd.DataFrame, rx: pd.DataFrame) -> pd.DataFrame:\n    f = rx.merge(preg[[\"person_id\", \"lmp_derived\", \"delivery_date\"]], on=\"person_id\", how=\"inner\")\n    # Covered interval of each fill, in days since LMP.\n    f[\"cov_start\"] = (f[\"fill_date\"] - f[\"lmp_derived\"]).dt.days\n    f[\"cov_end\"]   = f[\"cov_start\"] + f[\"days_supply\"].astype(int) - 1\n    # Truncate third trimester at the observed delivery day so post-delivery supply isn't counted in utero.\n    deliv_day = (f[\"delivery_date\"] - f[\"lmp_derived\"]).dt.days\n\n    rows = []\n    for win, (w0, w1_const) in WINDOWS.items():\n        w1 = np.minimum(w1_const, deliv_day) if win == \"t3\" else pd.Series(w1_const, index=f.index)\n        ov_start = np.maximum(f[\"cov_start\"], w0)\n        ov_end   = np.minimum(f[\"cov_end\"],   w1)\n        days_overlap = (ov_end - ov_start + 1).clip(lower=0)\n        tmp = f.assign(window=win, days_in_window=days_overlap)\n        tmp[\"mg_in_window\"] = tmp[\"days_in_window\"] * tmp.get(\"daily_mg\", np.nan)\n        rows.append(tmp[[\"person_id\", \"window\", \"days_in_window\", \"mg_in_window\"]])\n\n    long = pd.concat(rows, ignore_index=True)\n    agg = (long.groupby([\"person_id\", \"window\"], as_index=False)\n                .agg(days_in_window=(\"days_in_window\", \"sum\"),\n                     mg_in_window=(\"mg_in_window\", \"sum\")))\n    agg[\"exposed\"] = (agg[\"days_in_window\"] > 0).astype(int)   # any-overlap binary; swap for majority-of-days rule\n    wide = agg.pivot(index=\"person_id\", columns=\"window\",\n                     values=[\"exposed\", \"days_in_window\", \"mg_in_window\"])\n    wide.columns = [f\"{m}_{w}\" for m, w in wide.columns]\n    return preg.merge(wide.reset_index(), on=\"person_id\", how=\"left\").fillna(\n        {c: 0 for c in wide.columns if c.startswith(\"exposed_\")})",
        "description": "Window assignment from claims-style inputs. Required inputs (already cleaned, de-duplicated, and enrollment-filtered):\n  preg : one row per pregnancy -> person_id, lmp_derived (datetime), delivery_date (datetime)\n         # lmp_derived comes from a validated back-dating algorithm or prenatal-dating code, NOT a constant 280 days\n  rx   : maternal fills        -> person_id, ndc, fill_date (datetime), days_supply (int), daily_mg (float, optional)\nReturns per-pregnancy binary, days-in-window, and dose-in-window exposure for each gestational window using an\noverlap-of-covered-interval rule. Pre-specify the straddling-fill rule (here: any-overlap) in the SAP.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\n# Window edges in days relative to LMP (day 0); end inclusive.\nwin_edges <- list(precon = c(-90L, -1L), t1 = c(0L, 97L),\n                  t2 = c(98L, 195L),     t3 = c(196L, 300L))\n\nassign_windows <- function(preg, rx) {\n  setDT(preg); setDT(rx)\n  f <- merge(rx, preg[, .(person_id, lmp_derived, delivery_date)], by = \"person_id\")\n  f[, cov_start := as.integer(fill_date - lmp_derived)]\n  f[, cov_end   := cov_start + as.integer(days_supply) - 1L]\n  f[, deliv_day := as.integer(delivery_date - lmp_derived)]\n  if (!\"daily_mg\" %in% names(f)) f[, daily_mg := NA_real_]\n\n  out <- rbindlist(lapply(names(win_edges), function(win) {\n    w0 <- win_edges[[win]][1L]\n    w1 <- if (win == \"t3\") pmin(win_edges[[win]][2L], f$deliv_day) else win_edges[[win]][2L]\n    ov_start <- pmax(f$cov_start, w0)\n    ov_end   <- pmin(f$cov_end,   w1)\n    days_ov  <- pmax(ov_end - ov_start + 1L, 0L)\n    data.table(person_id = f$person_id, window = win,\n               days_in_window = days_ov, mg_in_window = days_ov * f$daily_mg)\n  }))\n\n  agg <- out[, .(days_in_window = sum(days_in_window),\n                 mg_in_window   = sum(mg_in_window, na.rm = TRUE)),\n             by = .(person_id, window)]\n  agg[, exposed := as.integer(days_in_window > 0)]   # any-overlap; swap for majority-of-days rule\n  wide <- dcast(agg, person_id ~ window,\n                value.var = c(\"exposed\", \"days_in_window\", \"mg_in_window\"), fill = 0)\n  merge(preg, wide, by = \"person_id\", all.x = TRUE)\n}",
        "description": "data.table window assignment mirroring the Python version. Inputs (cleaned, enrollment-filtered):\n  preg : person_id, lmp_derived (Date), delivery_date (Date)\n  rx   : person_id, ndc, fill_date (Date), days_supply (integer), daily_mg (numeric, optional)\nProduces per-pregnancy any-overlap binary, days-in-window, and dose-in-window for each gestational window.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Covered interval of each fill in days since LMP, joined to the pregnancy anchor. */\nproc sql;\n  create table fills as\n  select r.person_id, r.ndc, r.daily_mg,\n         (r.fill_date - p.lmp_derived)                      as cov_start,\n         (r.fill_date - p.lmp_derived) + r.days_supply - 1  as cov_end,\n         (p.delivery_date - p.lmp_derived)                  as deliv_day\n  from work.rx r\n  inner join work.preg p on r.person_id = p.person_id;\nquit;\n\n/* 2. Day-overlap with each window; t3 truncated at the observed delivery day. Any-overlap = exposed. */\nproc sql;\n  create table exp_wide as\n  select person_id,\n         max( max(0, calculated o_precon) > 0 ) as exposed_precon,\n         max( max(0, calculated o_t1)     > 0 ) as exposed_t1,\n         max( max(0, calculated o_t2)     > 0 ) as exposed_t2,\n         max( max(0, calculated o_t3)     > 0 ) as exposed_t3,\n         /* days-in-window for dose/intensity secondary analyses */\n         sum( max(0, calculated o_t1) )         as days_t1\n  from (\n    select person_id,\n           (min(cov_end, -1)  - max(cov_start, -90) + 1) as o_precon,\n           (min(cov_end, 97)  - max(cov_start, 0)   + 1) as o_t1,\n           (min(cov_end, 195) - max(cov_start, 98)  + 1) as o_t2,\n           (min(cov_end, min(300, deliv_day)) - max(cov_start, 196) + 1) as o_t3\n    from fills)\n  group by person_id;\nquit;\n\n/* 3. Adjusted first-trimester exposure vs active comparator on the linked infant malformation outcome.\n      work.analytic = exp_wide joined to maternal baseline covariates + infant outcome (mcm=0/1) + arm. */\nproc genmod data=work.analytic descending;\n  class arm(ref='COMPARATOR') epilepsy_type / param=ref;\n  model mcm = exposed_t1 mat_age epilepsy_type folate concomitant_aed\n        / dist=binomial link=log;          /* log-binomial -> adjusted risk ratio */\n  estimate 'First-trimester exposure RR' exposed_t1 1 / exp;\nrun;",
        "description": "Window assignment and a standardized exposed-vs-comparator risk step in SAS. Required inputs (post data-management,\nenrollment-filtered, dates as SAS date values):\n  work.preg : person_id, lmp_derived, delivery_date\n  work.rx   : person_id, ndc, fill_date, days_supply, daily_mg (optional)\nPROC SQL computes day-overlap of each fill's covered interval with each gestational window; PROC GENMOD (log-binomial)\nestimates the adjusted risk ratio for first-trimester exposure vs an active-comparator arm on the linked infant outcome.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "pregnancy-exposure-window-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Maria's levetiracetam fills (gray bars) plotted against her pregnancy timeline. The first-trimester organogenesis window (red span, LMP through LMP+97 days) is fully covered by fills from December 2022 through April 2023. Fills in the second and third trimesters do not affect the first-trimester exposure classification.",
        "alt_text": "Horizontal timeline from estimated LMP (Dec 28 2022) to delivery (Sep 28 2023). Three trimester spans are shown as colored bands beneath the timeline. Four prescription fill bars sit above the timeline, labeled with fill date and days_supply. The first-trimester band (organogenesis window) is highlighted in red; two fill bars overlap it completely. A result label reads: first-trimester exposed = 1, days in window = 98.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Delivery claim identified] --> B[\"Estimate LMP / pregnancy start<br/>delivery date - algorithm gestational weeks<br/>or earliest ultrasound dating code\"]\n  B --> C[\"Require continuous medical+pharmacy enrollment<br/>LMP-90 through post-delivery; exclude MA-only person-time\"]\n  C --> D[\"Define windows from LMP<br/>Periconceptional | T1 0-97d | T2 | T3 | Lactation\"]\n  D --> E[\"Build covered intervals from fill_date + days_supply<br/>apply straddling-fill rule\"]\n  E --> F[\"Per-window exposure: binary / days-in-window / dose-in-window\"]\n  F --> G[\"Link to infant; apply validated outcome algorithm<br/>require infant enrollment\"]\n  G --> H[\"Sensitivity: LMP +/-14d, window +/-7d,<br/>straddle rule, negative-control outcome\"]",
        "caption": "End-to-end workflow for pregnancy exposure windows in RWE. The load-bearing upstream step is a credible LMP estimate with adequate enrollment; the load-bearing analytic step is pre-specifying which window is etiologically relevant and how straddling fills are assigned.",
        "alt_text": "Flowchart from delivery claim through LMP estimation, enrollment requirement, window definition, covered-interval overlap, infant linkage, outcome algorithm, and dating-sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Cause[Maternal drug exposure] --> Recog[Earlier pregnancy recognition<br/>+ dating ultrasound in exposed]\n  Recog --> Term[Differential elective termination<br/>or detected fetal loss]\n  Term --> LB[Live-birth restriction]\n  Cause --> Mal[Fetal malformation]\n  Mal --> Term\n  LB --> Est[Live-birth-restricted risk]\n  Mal --> Est\n  Term -. depletes affected fetuses .-> Est",
        "caption": "Live-birth selection (survivor) bias. Exposure can drive earlier recognition and differential termination/fetal loss, depleting affected conceptuses from the live-birth denominator and biasing the live-birth-restricted window-specific risk toward the null - which is why the live-birth proportion by arm and a maternal-cohort analysis of non-live-birth endpoints must be reported.",
        "alt_text": "Causal diagram showing maternal exposure affecting both malformation and differential termination, with live-birth restriction acting as a collider/selection node that biases the restricted risk estimate.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "A core operational building block of pregnancy and perinatal RWE; gestational timing is a first-order determinant of bias and interpretation in all pregnancy pharmacoepidemiology."
      },
      {
        "relation_type": "requires",
        "target_slug": "mother-infant-linkage-rwe",
        "notes": "Window-specific exposure is assigned on the maternal timeline; linkage is required to attach infant outcomes and to make absence of an outcome code informative."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Infant outcomes (malformations, preterm, growth) must use validated algorithms whose PPV/sensitivity in the relevant ascertainment window is known or estimated."
      },
      {
        "relation_type": "see_also",
        "target_slug": "exposure-lag-induction-latency-window-rwe",
        "notes": "Both are time-window exposure constructs, but pregnancy windows are forward-anchored to gestational age whereas induction/latency windows are backward-anchored from outcome onset."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Requiring exposure in a later window to define the cohort creates immortal time in earlier windows; align time zero to the first qualifying gestational exposure."
      },
      {
        "relation_type": "see_also",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Exposure (window-membership) misclassification is especially consequential in narrow windows near boundaries; validation substudies or probabilistic bias analysis are often required."
      }
    ],
    "aliases": [
      "gestational exposure window",
      "trimester-specific exposure",
      "periconceptional exposure",
      "LMP-anchored exposure window",
      "first-trimester exposure definition",
      "in-utero exposure window",
      "window-specific teratogen analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pregnancy-registry",
    "name": "Pregnancy Exposure Registry",
    "short_definition": "A prospective cohort design that enrolls pregnant patients exposed to a defined medication (or vaccine) before the pregnancy outcome is known and actively follows them to ascertain major congenital malformations and other pregnancy outcomes, comparing risk against an internal or external unexposed reference.",
    "long_description": "A **pregnancy exposure registry** is a prospective, observational cohort assembled specifically to detect teratogenic\nand other adverse pregnancy effects of an exposure (drug, biologic, or vaccine) taken during pregnancy. Its defining\nfeature is **enrollment before the pregnancy outcome is known** (\"prospective\" in the FDA/registry sense): a pregnant\npatient with a documented exposure is registered while still pregnant, exposure and gestational timing are recorded at\nintake, and the outcome — most importantly major congenital malformation (MCM) — is ascertained later through active,\nprotocol-driven follow-up. Classic operating examples are the Antiretroviral Pregnancy Registry, the (now-closed)\nInternational Lamotrigine Pregnancy Registry, the North American AED Pregnancy Registry, and MotherToBaby/OTIS cohorts.\n\n**Core conceptual distinction**. A pregnancy registry is a *cohort study with a deliberately engineered time zero and\noutcome-blind enrollment*, not a passive collection of reports. Three features separate it from look-alikes. (1)\n*Prospective vs retrospective ascertainment*: prospectively enrolled cases (registered before any prenatal diagnosis\nof a defect and before the outcome) are the analytic backbone; retrospectively reported cases (registered after an\nabnormal outcome is already known) are reported separately and excluded from the primary risk estimate because they\nare subject to recall- and reporting-driven over-representation of defects. (2) *Registry vs spontaneous adverse-event\nreporting (FAERS/EudraVigilance)*: a registry has a defined denominator (enrolled exposed pregnancies and their\nexpected outcomes) and so can estimate a *risk*, whereas spontaneous reports have no denominator and can only generate\nsignals. (3) *Internal vs external comparator*: the estimand is the proportion (or prevalence) of MCM among\nfirst-trimester-exposed live births versus an unexposed reference — either an internally enrolled disease-matched\nunexposed group or an external population (MACDP, EUROCAT, or a claims-based unexposed pregnancy cohort). The contrast\nis a malformation prevalence ratio/difference, not a hazard; there is no clean active-comparator analogue because the\ncomparator is almost always a disease-matched *unexposed* pregnancy.\n\n**Pros, cons, and trade-offs**.\n- **vs retrospective claims/EHR-based pregnancy cohorts:** A registry collects gestational timing (LMP, trimester of\n  exposure), indication, dose, and adjudicated malformation detail that claims rarely capture, and it enrolls before\n  the outcome is known, which blunts recall and ascertainment bias. Cost: it is slow, expensive, volunteer-driven, and\n  chronically underpowered; a claims-based pregnancy cohort (mother-infant linked, with a far larger denominator) is\n  usually faster, cheaper, and better powered for common outcomes once a validated outcome algorithm exists. **Prefer\n  a registry** when the exposure is new, rare, or pregnancy-specific and no validated claims algorithm or linkage exists;\n  **prefer a claims cohort** when a large linked database and validated outcome definitions are available.\n- **vs MotherToBaby/OTIS prospective comparative cohorts:** Single-product manufacturer registries enroll faster for one\n  drug but typically rely on external malformation prevalence as the comparator; OTIS-style cohorts enroll an internal\n  disease-matched and a healthy comparator concurrently, giving a stronger internal contrast at the cost of slower,\n  multi-site recruitment.\n- **vs spontaneous reporting / passive surveillance:** A registry yields an estimable risk and an organogenesis-window\n  exposure classification; spontaneous reports are faster and broader but denominator-free and cannot support a rate.\n  **Prefer a registry** when you need a quantified malformation risk for labeling; passive surveillance only for early\n  signal generation.\n\n**When to use**. A drug or vaccine likely to be used in pregnancy (or in people who can become pregnant) where pregnancy\nwas excluded from the pivotal trials and a teratogenic signal must be characterized; a post-marketing requirement or\ncommitment (FDA, EMA RMP/PASS) for pregnancy safety; an exposure too new or too pregnancy-specific for an existing\nvalidated claims algorithm; settings needing reliable first-trimester (organogenesis) exposure timing and adjudicated\nMCM classification that secondary data cannot supply.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **You need a precise estimate of a common outcome quickly.** Registries accrue slowly and are usually underpowered:\n  detecting a doubling of the ~3% baseline MCM prevalence at 80% power requires on the order of 200+ first-trimester\n  exposures per the registry literature; rarer outcomes need far more. A null from an underpowered registry must never\n  be read as \"no risk.\"\n- **Differential loss to follow-up is uncontrolled.** This is the registry's signature failure: patients whose pregnancy\n  ends badly (spontaneous abortion, fetal death, a prenatally diagnosed defect) drop out or are never re-contacted at a\n  higher rate than those with normal outcomes, so passively the registry *under*-counts defects (informative censoring);\n  conversely, retrospective enrollment *over*-counts them. A registry without high, outcome-independent retention is\n  actively misleading.\n- **Volunteer/self-referral selection dominates.** Voluntary registries enroll the worried-well or, for some products,\n  the higher-risk; if the comparator population does not share that selection, the prevalence ratio is biased in an\n  unknown direction. Spontaneous abortions and elective terminations occurring before enrollment are systematically\n  missed (left truncation), so live-birth-only denominators distort outcomes tied to early loss.\n- **The comparator is incomparable.** Using an external general-population malformation prevalence (MACDP/EUROCAT) for a\n  cohort defined by a serious maternal disease confounds the disease with the drug; the underlying condition (e.g.,\n  epilepsy, HIV, autoimmune disease) and concomitant medications can themselves raise MCM risk.\n\n**Data-source operational depth**.\n- **Primary registry collection (the design itself):** Intake captures exposure dates, last menstrual period (LMP) or\n  estimated gestational age, indication, dose, and concomitant drugs while the patient is still pregnant. Failure modes:\n  LMP/gestational-age error misclassifies trimester and corrupts the organogenesis window; loss to follow-up between\n  enrollment and the outcome visit is the dominant threat (target >85-90% ascertainment, and report lost-to-follow-up\n  rates by arm); retrospective reports must be analyzed separately. Workarounds: scheduled prenatal + post-delivery\n  contacts, incentive-neutral retention, blinded outcome adjudication, and a pre-specified plan to compare prospective\n  vs retrospective cases.\n- **Claims (FFS / MA / commercial) as the alternative or linkage source:** A claims-based pregnancy cohort needs\n  mother-infant linkage, an algorithm to estimate LMP/gestational age from delivery codes and prenatal claims, and\n  continuous medical+pharmacy enrollment across the pre-pregnancy washout and full pregnancy so exposure (NDC +\n  `fill_date` + `days_supply`) and outcomes are observable. Failure modes: Medicare Advantage and capitated person-time\n  drop fee-for-service claims, so \"unexposed\" may be unobserved exposure — restrict to enrollees with full medical +\n  pharmacy benefit and exclude MA-only person-time; spontaneous abortions and terminations are incompletely coded,\n  differentially by setting; a fill is not ingestion. Workaround: link claims to a registry for adjudicated MCM and\n  validated gestational timing, or validate the outcome algorithm against charts (PPV/sensitivity).\n- **EHR:** Adds prenatal ultrasound, problem lists, and birth records to sharpen LMP and indication, but visit-driven\n  capture means a patient who delivers outside the system is differentially lost — a registry-style informative-censoring\n  problem in another guise; mother-infant linkage in EHR is often incomplete.\n- **Linked registry + claims + birth-defects surveillance (EUROCAT/MACDP):** The strongest substrate — registry timing\n  and adjudication, claims denominator and completeness, surveillance-grade MCM classification — but linkage selects only\n  the linkable subset and introduces date-discrepancy reconciliation (LMP vs fill vs service dates) that must precede\n  trimester assignment.\n\n**Worked example.** Question: first-trimester malformation risk for a newly approved oral disease-modifying drug taken\nby people of reproductive potential, under an FDA post-marketing requirement. Design: prospective pregnancy registry.\n(1) Enrollment: a pregnant patient with a documented first-trimester fill (`fill_date` within\n[LMP, LMP + 90 days] using the `pregnancy-exposure-window-rwe` definition) is registered *before* any prenatal anomaly\nscan result and before the outcome — this is the prospective requirement. Record LMP, estimated gestational age,\nindication, dose, and concomitant medications at intake. (2) Comparator: an internal disease-matched unexposed cohort\nenrolled concurrently, plus an external EUROCAT/MACDP MCM prevalence as a secondary reference (~3% baseline). (3)\nFollow-up: scheduled contacts at enrollment, ~each trimester, and post-delivery; target >90% outcome ascertainment;\nlog lost-to-follow-up by arm. (4) Outcome: major congenital malformation among live births, adjudicated blinded to\nexposure using EUROCAT coding; report spontaneous abortion, stillbirth, preterm birth, and SGA secondarily; analyze\nprospectively- and retrospectively-enrolled cases separately. (5) Estimate: MCM prevalence ratio (exposed vs internal\nunexposed) with exact binomial CIs given small counts; pre-specify that detecting a doubling of the 3% background at\n80% power needs ~200 first-trimester exposed live births, and report power achieved. (6) Sensitivity: vary the exposure\nwindow (organogenesis-only vs any-first-trimester), exclude retrospectively reported cases, and bound the effect of\ndifferential loss to follow-up via a tipping-point/quantitative bias analysis.",
    "primary_category": "Study_Design",
    "tags": [
      "pregnancy-registry",
      "teratogenicity",
      "prospective-cohort",
      "major-congenital-malformation",
      "pharmacoepidemiology",
      "post-marketing-safety",
      "differential-loss-to-follow-up",
      "special-populations"
    ],
    "applies_to_study_types": [
      "pregnancy_registry"
    ],
    "data_sources": [
      "primary",
      "registry",
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4150",
        "url": "https://doi.org/10.1002/pds.4150",
        "citation_text": "Gelperin K, Hammad H, Leishear K, Bird ST, Taylor L, Hampp C, Sahin L. A systematic review of pregnancy exposure registries: examination of protocol-specified pregnancy outcomes, target sample size, and comparator selection. Pharmacoepidemiology and Drug Safety. 2017;26(2):208-214.",
        "year": 2017,
        "authors_short": "Gelperin et al.",
        "notes": "FDA-authored systematic review of pregnancy registry design choices; the source for target sample size (~200 first-trimester exposures to detect a doubling of background MCM risk), pre-specified outcome menus, and comparator selection."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3659",
        "url": "https://doi.org/10.1002/pds.3659",
        "citation_text": "Sinclair S, Cunnington M, Messenheimer J, Weil J, Cragan J, Lowensohn R, Yerby M, Tennis P. Advantages and problems with pregnancy registries: observations and surprises throughout the life of the International Lamotrigine Pregnancy Registry. Pharmacoepidemiology and Drug Safety. 2014;23(8):779-786.",
        "year": 2014,
        "authors_short": "Sinclair et al.",
        "notes": "Operator-level account of enrollment, retention, loss to follow-up, and prospective-vs-retrospective case handling over the full life of a major single-drug registry."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.982",
        "url": "https://doi.org/10.1002/pds.982",
        "citation_text": "Covington DL, Tilson H, Elder J, Doi P. Assessing teratogenicity of antiretroviral drugs: monitoring and analysis plan of the Antiretroviral Pregnancy Registry. Pharmacoepidemiology and Drug Safety. 2004;13(8):537-545.",
        "year": 2004,
        "authors_short": "Covington et al.",
        "notes": "Worked monitoring-and-analysis plan for the Antiretroviral Pregnancy Registry, including the prospective enrollment rule, external comparator (birth-defects surveillance prevalence), and stopping/signal thresholds."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.3613",
        "url": "https://doi.org/10.1002/pds.3613",
        "citation_text": "Charlton RA, Neville AJ, Jordan S, Pierini A, Damase-Michel C, Klungsøyr K, et al. Healthcare databases in Europe for studying medicine use and safety during pregnancy. Pharmacoepidemiology and Drug Safety. 2014;23(6):586-594.",
        "year": 2014,
        "authors_short": "Charlton et al.",
        "notes": "Contrasts secondary healthcare databases (claims/EHR/birth-defects surveillance) with primary registries for pregnancy safety, covering mother-infant linkage, gestational-age estimation, and outcome ascertainment."
      }
    ],
    "plain_language_summary": "A pregnancy exposure registry is a structured study that signs up pregnant people who are already taking a specific drug and follows them forward in time until they give birth, so researchers can count how often serious birth defects occur in their babies. The key rule is that a patient must be enrolled before anyone knows whether the pregnancy will end normally or not, which prevents the study from accidentally over-counting bad outcomes. The registry answers the question: does taking this drug during the first three months of pregnancy raise the chance of a major birth defect compared with unexposed pregnancies? One honest limitation is that patients who later lose the pregnancy or deliver elsewhere often drop out, which can make the drug look safer than it really is.",
    "key_terms": [
      {
        "term": "prospective enrollment",
        "definition": "Signing a patient up for the study while the outcome is still unknown, so the researchers cannot accidentally select only the cases that ended badly."
      },
      {
        "term": "major congenital malformation (MCM)",
        "definition": "A serious structural birth defect present at birth, such as a heart septal defect or cleft palate, that is the primary safety outcome a pregnancy registry is designed to detect."
      },
      {
        "term": "organogenesis window",
        "definition": "Roughly the first 12 weeks (90 days) after the last menstrual period, when fetal organs are forming and exposure to a drug is most likely to cause a structural defect."
      },
      {
        "term": "loss to follow-up",
        "definition": "A patient who enrolled in the registry but was never reached again after delivery, so researchers never learned whether the baby had a defect or not."
      },
      {
        "term": "malformation prevalence ratio",
        "definition": "The proportion of babies with a birth defect among exposed mothers divided by the same proportion among unexposed mothers; a ratio above 1 suggests the drug may raise risk."
      }
    ],
    "worked_example": {
      "scenario": "A drug maker is required to run a pregnancy registry for a new oral pill used by people who can become pregnant. A researcher wants to understand what the registry data table looks like, what biases to watch for, and what question the registry can and cannot answer.",
      "dataset": {
        "caption": "Core registry intake and outcome table (one row per enrolled pregnancy). Columns show the patient, when she enrolled, her last menstrual period date used to assign trimester, whether outcome was known at enrollment, her arm in the study, what happened at delivery, whether a major congenital malformation was confirmed, and whether the outcome was ever ascertained.",
        "columns": [
          "person_id",
          "enroll_date",
          "lmp_date",
          "outcome_known_at_enroll",
          "arm",
          "pregnancy_end",
          "mcm_confirmed",
          "outcome_ascertained"
        ],
        "rows": [
          [
            "P001",
            "2023-02-10",
            "2023-01-15",
            "No",
            "EXPOSED",
            "LIVE_BIRTH",
            "No",
            "Yes"
          ],
          [
            "P002",
            "2023-03-05",
            "2023-02-01",
            "No",
            "EXPOSED",
            "LIVE_BIRTH",
            "Yes",
            "Yes"
          ],
          [
            "P003",
            "2023-04-12",
            "2023-03-20",
            "No",
            "EXPOSED",
            "UNKNOWN",
            "Unknown",
            "No"
          ],
          [
            "P004",
            "2023-02-28",
            "2023-02-01",
            "No",
            "UNEXPOSED",
            "LIVE_BIRTH",
            "No",
            "Yes"
          ],
          [
            "P005",
            "2023-05-01",
            "2023-04-10",
            "Yes",
            "EXPOSED",
            "LIVE_BIRTH",
            "Yes",
            "Yes"
          ]
        ]
      },
      "steps": [
        "P001 and P002 are the clean exposed cases: both enrolled before any birth outcome was known, and both delivered live babies with a confirmed outcome. P002 had a birth defect confirmed.",
        "P003 enrolled correctly (outcome unknown at enrollment) but was never reached after delivery, so her outcome is unknown. She is lost to follow-up. If babies with defects are more likely to be lost this way, the registry will under-count defects.",
        "P004 is the internal unexposed comparator: a pregnant patient with the same underlying disease but no drug exposure, enrolled at the same stage and followed identically.",
        "P005 enrolled after a defect was already found on a prenatal scan (outcome_known_at_enroll = Yes). She must be analyzed separately and excluded from the primary risk estimate, because her enrollment was triggered by a bad outcome, which would inflate the defect count.",
        "The primary analysis uses only P001, P002, and P004 (prospectively enrolled, outcome ascertained). Among the two exposed live births, 1 of 2 had a defect (50%). Among the one unexposed live birth, 0 of 1 had a defect (0%). The malformation prevalence ratio cannot be computed stably with counts this small, which illustrates why registries require roughly 200 or more exposed live births before a result is interpretable."
      ],
      "result": "The registry answers: among babies born to mothers who took this drug during the first trimester, what fraction had a serious birth defect, compared with unexposed pregnancies? With only 2 exposed live births (P001, P002) and 1 unexposed (P004), the ratio is mathematically unstable (1/2 vs 0/1). The key limitation shown here is loss to follow-up (P003): if the missing patient also had a defect, the true exposed rate is 2/3 (67%) rather than 1/2 (50%), shifting the estimate substantially. A real registry would target at least 85 to 90 percent ascertainment and report the loss-to-follow-up rate by arm."
    },
    "prerequisites": [
      "cohort-prospective",
      "attrition-and-loss-to-follow-up-rwe",
      "pregnancy-exposure-window-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single-product manufacturer registry with external comparator",
        "description": "One exposure of interest, enrolling exposed pregnancies and comparing MCM prevalence against an external birth-defects surveillance estimate (e.g., MACDP ~3%, EUROCAT). Fast to launch for a post-marketing requirement.",
        "edge_cases": [
          "External general-population comparators ignore the indication; a disease-driven MCM excess is misattributed to the drug.",
          "No internal control means confounding by indication and concomitant medications is unaddressable analytically."
        ],
        "data_source_notes": "primary: pre-specify the external prevalence source and its case ascertainment; report prospective cases separately from retrospective reports."
      },
      {
        "name": "Comparative (internal-control) prospective cohort (OTIS/MotherToBaby style)",
        "description": "Concurrently enrolls exposed, disease-matched unexposed, and sometimes healthy unexposed pregnancies, enabling an internal prevalence ratio that separates the drug from the underlying condition.",
        "edge_cases": [
          "Multi-arm recruitment is slower; the disease-matched arm can be hard to populate for rare indications.",
          "Volunteer selection must be similar across arms or the internal contrast is itself biased."
        ],
        "data_source_notes": "primary: enroll all arms before outcome is known; blind outcome adjudication to exposure across arms."
      },
      {
        "name": "Registry linked to claims / birth-defects surveillance",
        "description": "Registry exposure and timing linked to a claims denominator and to EUROCAT/MACDP-grade MCM classification to improve power and completeness while keeping adjudicated outcomes and validated gestational timing.",
        "edge_cases": [
          "Linkage selects the linkable subset (selection bias) and requires reconciling LMP vs fill vs service dates before trimester assignment.",
          "Spontaneous abortions/terminations are incompletely captured in claims, biasing live-birth-restricted denominators."
        ],
        "data_source_notes": "linked: validate the claims outcome algorithm (PPV/sensitivity) against adjudicated registry cases."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Retrospective claims/EHR-based pregnancy cohort (mother-infant linked)",
        "pros_of_this": "Enrolls before the outcome is known (limits recall/ascertainment bias); captures LMP, trimester of exposure, dose, indication, and adjudicated malformation detail that claims rarely contain.",
        "cons_of_this": "Slow, costly, volunteer-driven, and usually underpowered; smaller denominator than a large linked claims cohort; live-birth restriction and left truncation distort early-loss outcomes.",
        "when_to_prefer": "New, rare, or pregnancy-specific exposures with no validated claims outcome algorithm or mother-infant linkage; when first-trimester timing and adjudicated MCM are essential."
      },
      {
        "compared_to": "Spontaneous adverse-event reporting (FAERS/EudraVigilance)",
        "pros_of_this": "Has a defined denominator and can estimate a malformation risk and exposure-window contrast.",
        "cons_of_this": "Far slower and narrower in coverage; expensive to run and retain.",
        "when_to_prefer": "When a quantified, labelable malformation risk is required rather than a hypothesis-generating signal."
      },
      {
        "compared_to": "Single-arm registry with external comparator",
        "pros_of_this": "An internal disease-matched comparator separates the drug effect from the underlying maternal condition and concomitant medications.",
        "cons_of_this": "Multi-arm enrollment is slower and harder to populate, especially for rare indications.",
        "when_to_prefer": "When the indication is itself associated with malformation risk (epilepsy, HIV, autoimmune disease)."
      }
    ],
    "implementation_notes_by_data_source": {
      "primary": "Enroll prospectively (before outcome known); record LMP/gestational age, trimester of exposure, dose, indication, concomitants at intake; schedule trimester + post-delivery contacts targeting >85-90% ascertainment; adjudicate MCM blinded to exposure (EUROCAT coding); analyze prospective vs retrospective cases separately and report loss to follow-up by arm.",
      "claims": "Requires mother-infant linkage and an LMP/gestational-age algorithm from delivery and prenatal codes; require continuous medical+pharmacy enrollment across pre-pregnancy washout and the full pregnancy; exposure = NDC + fill_date + days_supply intersected with the trimester window; exclude MA-only person-time where FFS claims are missing; validate the outcome algorithm (PPV/sensitivity) against charts.",
      "ehr": "Use prenatal ultrasound, problem lists, and birth records to refine LMP and indication; treat delivery outside the system as differential loss (informative censoring); mother-infant linkage is often incomplete.",
      "linked": "Registry + claims + birth-defects surveillance is the strongest substrate but selects the linkable subset and requires reconciling LMP/fill/service dates before assigning trimester of exposure."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nFIRST_TRIMESTER_DAYS = 90  # exposure window from LMP defining first-trimester / organogenesis exposure\n\ndef build_pregnancy_registry_cohort(intake: pd.DataFrame,\n                                    fills: pd.DataFrame,\n                                    outcomes: pd.DataFrame) -> pd.DataFrame:\n    # PROSPECTIVE restriction: outcome must NOT have been known at enrollment.\n    # Retrospectively reported pregnancies are held out of the primary risk estimate.\n    prosp = intake[~intake[\"outcome_known_at_enroll\"]].copy()\n\n    # First-trimester exposure: any study-drug fill within [LMP, LMP + 90d] for EXPOSED pregnancies.\n    f = fills.merge(prosp[[\"person_id\", \"lmp_date\", \"arm\"]], on=\"person_id\", how=\"inner\")\n    f[\"days_from_lmp\"] = (f[\"fill_date\"] - f[\"lmp_date\"]).dt.days\n    first_tri = f[(f[\"days_from_lmp\"] >= 0) & (f[\"days_from_lmp\"] <= FIRST_TRIMESTER_DAYS)]\n    first_tri_ids = set(first_tri[\"person_id\"].unique())\n\n    prosp[\"first_trimester_exposed\"] = prosp[\"person_id\"].isin(first_tri_ids)\n    # EXPOSED arm must actually have a first-trimester fill; UNEXPOSED must have none.\n    prosp = prosp[~((prosp[\"arm\"] == \"EXPOSED\") & (~prosp[\"first_trimester_exposed\"])) &\n                  ~((prosp[\"arm\"] == \"UNEXPOSED\") & (prosp[\"first_trimester_exposed\"]))]\n\n    cohort = prosp.merge(\n        outcomes[[\"person_id\", \"outcome_ascertained\", \"mcm\", \"pregnancy_end\"]],\n        on=\"person_id\", how=\"left\")\n\n    # Informative loss to follow-up is the headline threat: keep unascertained rows visible, do not drop silently.\n    cohort[\"lost_to_follow_up\"] = ~cohort[\"outcome_ascertained\"].fillna(False)\n    return cohort[[\"person_id\", \"arm\", \"lmp_date\", \"enroll_date\",\n                   \"first_trimester_exposed\", \"outcome_ascertained\",\n                   \"lost_to_follow_up\", \"pregnancy_end\", \"mcm\"]]\n\ndef mcm_prevalence_ratio(cohort: pd.DataFrame) -> dict:\n    # MCM prevalence among live births with ascertained outcome, exposed vs unexposed (the registry estimand).\n    lb = cohort[(cohort[\"pregnancy_end\"] == \"LIVE_BIRTH\") & cohort[\"outcome_ascertained\"]]\n    g = lb.groupby(\"arm\")[\"mcm\"].agg([\"sum\", \"count\"])\n    p_exp = g.loc[\"EXPOSED\", \"sum\"] / g.loc[\"EXPOSED\", \"count\"]\n    p_unexp = g.loc[\"UNEXPOSED\", \"sum\"] / g.loc[\"UNEXPOSED\", \"count\"]\n    return {\"p_exposed\": p_exp, \"p_unexposed\": p_unexp,\n            \"prevalence_ratio\": p_exp / p_unexp,\n            \"n_exposed_lb\": int(g.loc[\"EXPOSED\", \"count\"]),\n            \"ltfu_rate_by_arm\": cohort.groupby(\"arm\")[\"lost_to_follow_up\"].mean().to_dict()}",
        "description": "Pregnancy-registry cohort construction and first-trimester exposure classification from registry intake + outcome\ndata. Required inputs (already cleaned, one row per enrolled pregnancy unless noted):\n  intake   : person_id, lmp_date (datetime), enroll_date (datetime), outcome_known_at_enroll (bool),\n             arm in {'EXPOSED','UNEXPOSED'}\n  fills    : person_id, fill_date (datetime), days_supply        # study-drug dispensings during pregnancy\n  outcomes : person_id, outcome_ascertained (bool), mcm (0/1/NaN), pregnancy_end in\n             {'LIVE_BIRTH','SAB','STILLBIRTH','ELECTIVE_TERM'}, outcome_date (datetime)\nReturns analysis-ready rows for the PRIMARY (prospective) analysis: prospectively enrolled pregnancies with a\nfirst-trimester (organogenesis-window) exposure, flagged for ascertainment so informative loss to follow-up is visible.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "gelperin-2017",
          "covington-2004"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nFIRST_TRIMESTER_DAYS <- 90L  # exposure window from LMP (organogenesis)\n\nbuild_pregnancy_registry_cohort <- function(intake, fills, outcomes) {\n  setDT(intake); setDT(fills); setDT(outcomes)\n\n  # PROSPECTIVE restriction: drop pregnancies whose outcome was already known at enrollment.\n  prosp <- intake[outcome_known_at_enroll == FALSE]\n\n  f <- merge(fills, prosp[, .(person_id, lmp_date, arm)], by = \"person_id\")\n  f[, days_from_lmp := as.integer(fill_date - lmp_date)]\n  first_tri_ids <- unique(f[days_from_lmp >= 0 & days_from_lmp <= FIRST_TRIMESTER_DAYS, person_id])\n\n  prosp[, first_trimester_exposed := person_id %chin% first_tri_ids]\n  prosp <- prosp[!(arm == \"EXPOSED\"   & first_trimester_exposed == FALSE) &\n                 !(arm == \"UNEXPOSED\" & first_trimester_exposed == TRUE)]\n\n  cohort <- merge(prosp,\n                  outcomes[, .(person_id, outcome_ascertained, mcm, pregnancy_end)],\n                  by = \"person_id\", all.x = TRUE)\n  cohort[, lost_to_follow_up := is.na(outcome_ascertained) | outcome_ascertained == FALSE]\n  cohort[, .(person_id, arm, lmp_date, enroll_date, first_trimester_exposed,\n             outcome_ascertained, lost_to_follow_up, pregnancy_end, mcm)]\n}\n\nmcm_prevalence_ratio <- function(cohort) {\n  lb <- cohort[pregnancy_end == \"LIVE_BIRTH\" & outcome_ascertained == TRUE]\n  g  <- lb[, .(events = sum(mcm), n = .N), by = arm]\n  e  <- g[arm == \"EXPOSED\"];  u <- g[arm == \"UNEXPOSED\"]\n  ci_e <- binom.test(e$events, e$n)$conf.int  # exact binomial given small counts\n  list(p_exposed = e$events / e$n, p_unexposed = u$events / u$n,\n       prevalence_ratio = (e$events / e$n) / (u$events / u$n),\n       p_exposed_ci = as.numeric(ci_e), n_exposed_lb = e$n,\n       ltfu_rate_by_arm = cohort[, mean(lost_to_follow_up), by = arm])\n}",
        "description": "Pregnancy-registry cohort construction with data.table. Inputs mirror the Python version:\n  intake   : person_id, lmp_date (Date), enroll_date (Date), outcome_known_at_enroll (logical), arm ('EXPOSED'/'UNEXPOSED')\n  fills    : person_id, fill_date (Date), days_supply\n  outcomes : person_id, outcome_ascertained (logical), mcm (0/1/NA), pregnancy_end, outcome_date (Date)\nReturns prospectively enrolled, first-trimester-classified rows with loss-to-follow-up flagged; the helper computes the\nexposed-vs-unexposed MCM prevalence ratio among ascertained live births with exact binomial CIs.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "gelperin-2017",
          "sinclair-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let first_tri = 90;  /* days from LMP defining first-trimester / organogenesis exposure */\n\n/* PROSPECTIVE restriction: keep only pregnancies whose outcome was NOT known at enrollment. */\nproc sql;\n  create table prosp as\n  select * from work.intake where outcome_known_at_enroll = 0;\nquit;\n\n/* First-trimester exposure: any study-drug fill within [LMP, LMP+90] days. */\nproc sql;\n  create table first_tri as\n  select distinct p.person_id\n  from prosp p\n  inner join work.fills f\n    on f.person_id = p.person_id\n   and f.fill_date >= p.lmp_date\n   and f.fill_date <= p.lmp_date + &first_tri;\nquit;\n\n/* Require EXPOSED arm to have a first-trimester fill and UNEXPOSED arm to have none. */\nproc sql;\n  create table cohort as\n  select p.person_id, p.arm, p.lmp_date, p.enroll_date,\n         (case when ft.person_id is not null then 1 else 0 end) as first_trimester_exposed,\n         coalesce(o.outcome_ascertained, 0) as outcome_ascertained,\n         (1 - coalesce(o.outcome_ascertained, 0)) as lost_to_follow_up,  /* informative LTFU = headline bias */\n         o.pregnancy_end, o.mcm\n  from prosp p\n  left join first_tri ft on ft.person_id = p.person_id\n  left join work.outcomes o on o.person_id = p.person_id\n  where not (p.arm = 'EXPOSED'   and ft.person_id is null)\n    and not (p.arm = 'UNEXPOSED' and ft.person_id is not null);\nquit;\n\n/* MCM prevalence ratio among ascertained live births, with exact (Clopper-Pearson) binomial CIs by arm. */\nproc freq data=cohort (where=(pregnancy_end='LIVE_BIRTH' and outcome_ascertained=1));\n  tables arm*mcm / riskdiff(cl=exact) relrisk;          /* prevalence ratio + exact CIs */\n  exact riskdiff;\nrun;\n\n/* Report loss-to-follow-up by arm: differential LTFU is the primary threat to validity. */\nproc freq data=cohort;\n  tables arm*lost_to_follow_up / nocol nopercent;\nrun;",
        "description": "Pregnancy-registry cohort construction in SAS (PROC SQL) plus the MCM prevalence ratio with exact binomial CIs\n(PROC FREQ). Required input datasets (post data-management):\n  work.intake   : person_id, lmp_date, enroll_date, outcome_known_at_enroll (0/1), arm ('EXPOSED'/'UNEXPOSED')\n  work.fills    : person_id, fill_date, days_supply\n  work.outcomes : person_id, outcome_ascertained (0/1), mcm (0/1/.), pregnancy_end, outcome_date\nKeeps only prospectively enrolled pregnancies, classifies first-trimester (organogenesis) exposure, flags loss to\nfollow-up, and estimates the exposed-vs-unexposed malformation prevalence ratio among ascertained live births.",
        "dependencies": [],
        "source_citations": [
          "gelperin-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Pregnant patient with documented exposure] --> Prosp{Outcome already known<br/>at enrollment?}\n  Prosp -- Yes --> Retro[Retrospective report<br/>analyzed SEPARATELY, excluded from primary risk]\n  Prosp -- No --> Enroll[Prospective enrollment<br/>record LMP, trimester, dose, indication, concomitants]\n  Enroll --> Window[Classify exposure window<br/>first-trimester / organogenesis vs later]\n  Window --> Comp[Comparator: internal disease-matched unexposed<br/>or external EUROCAT/MACDP prevalence]\n  Comp --> Fup[Active follow-up: trimester + post-delivery contacts<br/>target >85-90% ascertainment]\n  Fup --> Adj[Blinded MCM adjudication EUROCAT coding<br/>+ SAB / stillbirth / preterm / SGA]\n  Adj --> Est[MCM prevalence ratio with exact CIs<br/>report loss to follow-up by arm]",
        "caption": "Operational flow of a prospective pregnancy exposure registry. Prospective (outcome-blind) enrollment is the defining step; retrospective reports are held out of the primary estimate, and differential loss to follow-up between enrollment and outcome adjudication is the dominant threat to validity.",
        "alt_text": "Flowchart from an exposed pregnant patient through a check for whether the outcome is already known, prospective enrollment with exposure timing, exposure-window classification, comparator choice, active follow-up, blinded malformation adjudication, and the prevalence-ratio estimate.",
        "source_type": "illustrative",
        "source_citations": [
          "gelperin-2017"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Differential loss to follow-up] --> B[Bad outcomes drop out more often]\n  B --> C[Ascertained sample under-represents defects]\n  C --> D[Prevalence ratio biased toward null]\n  E[Volunteer / self-referral selection] --> F[Enrollees differ from comparator population]\n  F --> D\n  G[Retrospective enrollment after known defect] --> H[Defects over-represented]\n  H --> I[Prevalence ratio biased away from null]\n  J[External general-population comparator] --> K[Maternal disease + concomitants unaccounted]\n  K --> I",
        "caption": "The four bias channels that distinguish pregnancy-registry inference. Informative loss to follow-up and volunteer selection typically bias the malformation prevalence ratio toward the null; retrospective enrollment and an incomparable external comparator bias it away from the null.",
        "alt_text": "Diagram showing differential loss to follow-up and volunteer selection biasing the estimate toward the null, and retrospective enrollment and external comparators biasing it away from the null.",
        "source_type": "illustrative",
        "source_citations": [
          "sinclair-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "cohort-prospective",
        "notes": "A pregnancy registry is a prospective cohort with outcome-blind enrollment and a deliberately engineered, LMP-anchored time zero."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pregnancy-exposure-window-rwe",
        "notes": "First-trimester (organogenesis) exposure classification from LMP is the exposure-window logic that defines registry eligibility and the primary contrast."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mother-infant-linkage-rwe",
        "notes": "Linking the maternal exposure to infant outcomes is required when augmenting or replacing primary registry collection with claims/EHR data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Differential, outcome-dependent loss to follow-up is the registry's signature bias and must be quantified and bias-analyzed."
      },
      {
        "relation_type": "see_also",
        "target_slug": "selection-bias-sensitivity-analysis-rwe",
        "notes": "Volunteer/self-referral enrollment and left truncation of early pregnancy losses induce selection bias that requires explicit sensitivity analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "disease-registry",
        "notes": "A disease registry is anchored on a condition rather than an exposure; a pregnancy registry is anchored on a defined prenatal exposure with the pregnancy outcome as the endpoint."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "single-arm-external-control",
        "notes": "Single-product registries with an external malformation-prevalence comparator share the strengths and confounding risks of single-arm designs with external controls."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pass-imposed",
        "notes": "Pregnancy registries are frequently the operational form of an imposed or committed post-authorization safety study for pregnancy outcomes."
      }
    ],
    "aliases": [
      "pregnancy exposure registry",
      "pregnancy safety registry",
      "teratogen registry",
      "prospective pregnancy registry"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "prescription-sequence-symmetry",
    "name": "Prescription Sequence Symmetry Analysis (PSSA)",
    "short_definition": "A self-controlled signal-detection method that tests whether one drug is dispensed disproportionately before or after a second drug (or event marker) among incident users of both, estimating an adjusted sequence ratio after correcting for background prescribing trends.",
    "long_description": "**Prescription Sequence Symmetry Analysis (PSSA)** asks a deceptively simple question: among\npeople who started *both* an index drug (A) and a marker drug or event (B), did A tend to come\nbefore B more often than B came before A? Under the sharp null of no causal relationship, and\nin the absence of any secular prescribing trend, the order should be symmetric — roughly half\nthe discordant pairs would run A→B and half B→A. An excess of A→B pairs is a signal that\nstarting A may trigger the condition that B treats (a putative adverse drug event), where B is\nused as a proxy marker for the outcome. Because each person serves as their own control, PSSA\ncancels all *time-invariant* between-person confounders (genetics, chronic comorbidity, frailty,\nsocioeconomic status) without measuring them — the same logic that powers the case-crossover and\nself-controlled case series, applied to two prescriptions instead of one exposure and one event.\n\n**Core estimand distinction.** PSSA does not estimate a hazard ratio, risk ratio, or\ndrug-vs-drug contrast. Its target is the **sequence ratio (SR)** = n(A→B) / n(B→A), the ratio of\nthe count of incident users whose first A preceded their first B to the count whose first B\npreceded their first A (within a symmetric pre/post window). The SR is biased upward purely by\ngrowth in prescribing: if A's incidence is rising over calendar time, more A→B orderings appear\nfor arithmetic reasons alone. PSSA corrects this with the **null-effect sequence ratio (NSR)** —\nthe SR expected under no causal effect, derived from the *waiting-time distribution* and the\nbackground incidence trends of the two drugs (Hallas's order-statistic / Tsiropoulos\ntrend-modeling approach). The reported quantity is the **adjusted sequence ratio (aSR)** =\nSR / NSR, with an exact-binomial or asymptotic 95% CI on the underlying discordant-pair\nproportion. The aSR is interpreted as a *relative incidence* of B in the period after A versus\nbefore A, under the assumption that within-person time-invariant confounding is constant and\nthat the only systematic asymmetry beyond the causal effect is the prescribing trend captured by\nthe NSR. An aSR meaningfully above 1 flags a potential adverse effect of A; an aSR below 1 can\nsignal a protective effect or, more often, depletion/reverse-causation artifacts.\n\n**Interpreting the output**\n\nFrom the worked example: among incident dual users of thiazide diuretics (A) and\ngout therapy (B), n(A→B) = 1,820 and n(B→A) = 1,150. Crude SR = 1,820 / 1,150 =\n1.58 (exact-binomial 95% CI approximately 1.47–1.71). After modeling background\nprescribing trends, NSR = 1.12. Adjusted sequence ratio aSR = 1.58 / 1.12 = 1.41\n(95% CI approximately 1.31–1.53).\n\nFormal interpretation: The aSR of 1.41 means that, after correcting for the secular\ntrend in thiazide initiation (rising use generates excess A→B orderings by calendar\ntime alone), thiazide initiators were dispensed a gout therapy in the period after\ntheir first thiazide at approximately 1.41 times the rate seen in the period before\ntheir thiazide initiation. Because each person serves as their own control, this\ncomparison cancels all time-invariant between-person confounders. The temporal\nasymmetry is interpreted as a within-person relative incidence signal, not an\nabsolute risk or a population-level causal effect. The 95% CI is derived from the\nexact-binomial uncertainty on the proportion n(A→B) / [n(A→B) + n(B→A)].\n\nPractical interpretation: The aSR of 1.41 is a positive pharmacovigilance signal\nconsistent with thiazide-induced gout. It does not quantify the population-level\nexcess risk of gout, cannot separate the effect from confounding by indication\n(conditions predisposing to both thiazide use and gout), and should not be treated\nas a magnitude estimate for regulatory or clinical decision-making. A time-trend\ncaveat is mandatory: any omission of the NSR step — reporting the crude SR of 1.58\n— conflates a pharmacovigilance signal with a rising-use artifact.\n\n**Pros, cons, and trade-offs.**\n- **vs cohort / active-comparator new-user designs:** PSSA needs only dispensing dates for two\n  drugs — no outcome adjudication, no covariate measurement, no comparator selection — so it is\n  fast, cheap, and self-controlling for stable confounders, making it the workhorse of automated\n  pharmacovigilance over national prescription registries. Cost: it answers only \"is there an\n  asymmetric temporal signal,\" not \"what is the effect size in a target population.\" It cannot\n  separate the effect of A from confounding *by indication for B*, and the aSR is not a\n  transportable causal effect. **Prefer a cohort/ACNU design** when you need a defensible\n  magnitude, an absolute risk, or a regulatory effectiveness claim.\n- **vs case-crossover (CCO):** Both are self-controlled and cancel time-invariant confounders.\n  CCO compares exposure in case windows vs control windows *within cases of a defined outcome*;\n  PSSA uses a second prescription as a surrogate outcome marker and looks at the symmetry of two\n  incident initiations. **Prefer CCO** when you have a sharply defined acute event and a known\n  transient exposure; **prefer PSSA** for hypothesis-free screening of drug→drug(→event) signals\n  where the outcome is operationalized by a treatment marker.\n- **vs self-controlled case series (SCCS):** SCCS models the within-person incidence rate ratio\n  of a recurrent/acute event across exposed vs unexposed person-time and handles multiple events\n  and time-varying exposure formally; it requires a clean event definition and exposure windows.\n  PSSA is far simpler and trend-robust via the NSR but is coarser (one ordering per person, no\n  rate model). **Prefer SCCS** when you can define the event and want a rate ratio; **prefer\n  PSSA** for rapid, scalable screening.\n- **vs disproportionality analysis on spontaneous reports:** PSSA uses longitudinal dispensing\n  data (real denominators, real timing) rather than voluntary reports, so it is far less subject\n  to reporting bias and stimulated reporting, though it can only detect events whose treatment is\n  itself a dispensed drug.\n\n**When to use.** Hypothesis-free or hypothesis-light screening of large longitudinal dispensing\ndatabases (Nordic prescription registries, Medicare Part D, commercial pharmacy claims) for\nadverse-event signals; rapid prioritization of drug-safety hypotheses before committing to a full\ncohort study; settings where the outcome is reliably treated with an identifiable marker drug\n(e.g., a cough suppressant for ACE-inhibitor cough, a gout therapy for diuretic-induced gout, an\nantiparkinsonian for antipsychotic-induced parkinsonism); and confirmatory triangulation\nalongside a cohort or SCCS analysis. PSSA shines when between-person confounding is severe and\nlargely time-invariant, because the design eliminates it without measurement.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The marker drug treats indications unrelated to the suspected event.** If B is used for many\n  conditions, an A→B asymmetry may reflect confounding by indication for B, not an effect of A.\n  The signal is then a mirage. Diagnose with a negative-control marker drug that should show\n  aSR ≈ 1.\n- **You skip the NSR / trend adjustment.** A drug whose use is rising (a new launch, a guideline\n  change) generates spurious A→B excess from calendar time alone. Reporting a crude SR as if it\n  were causal is the single most common and most dangerous PSSA error.\n- **Either drug is used by prevalent (non-incident) users.** PSSA requires the *first* dispensing\n  of both A and B within the observation window with a clean washout; carrying prevalent users\n  into the pair set destroys the symmetry logic and biases the SR unpredictably.\n- **A itself treats the condition that B marks (reverse causation), or induction is near-zero.**\n  If B is started to treat the very symptom that led to A, or if A and B are co-initiated for the\n  same syndrome, the temporal ordering is meaningless. Very short symmetry windows let\n  reverse-causation dominate.\n- **Channeling: the indication for B was already present when A was started.** Then B was always\n  going to follow A regardless of any causal effect; a negative-control marker is essential to\n  rule this out.\n- **You over-interpret an aSR as a magnitude or a population effect.** PSSA is a screen. Treating\n  an aSR of 1.8 as \"an 80% increase in risk\" for decision-making, or generalizing it beyond the\n  incident-pair population, is methodologically indefensible.\n\n**Data-source operational depth.**\n- **Claims (FFS + Part D / commercial pharmacy):** Exposure for both A and B is the pharmacy\n  claim (NDC + `fill_date`). Require continuous medical + pharmacy enrollment across the full\n  washout so that the *first observed* fill is the true incident fill, not the first *captured*\n  fill. Failure mode: **Medicare Advantage and capitated person-time lack fee-for-service claims**,\n  so an MA-only interval hides earlier dispensings and manufactures a false \"first\" date —\n  asymmetry can then reflect enrollment artifacts. Restrict to enrollees with both Parts A/B/D\n  (or a commercial pharmacy benefit) and exclude MA-only person-time. Sample fills, 90-day\n  mail-order, and free samples shift apparent initiation dates and must be reconciled.\n- **EHR:** Initiation is the medication *order*, not the dispense; if A is captured at order time\n  but B (treated elsewhere) is only captured when filled, ascertainment is asymmetric and biases\n  the SR. Prefer linkage to pharmacy fills for both drugs, and treat patients who leave the\n  system (visit-driven capture) as differentially censored.\n- **Registry:** National prescription registries (Denmark, Sweden, Norway, Australia PBS) are the\n  canonical PSSA substrate — near-complete dispensing, stable denominators, long lookback. Disease\n  registries, by contrast, are usually weak for pharmacy and must be linked to dispensing data.\n- **Linked claims–EHR–registry:** Strongest for confirming incident status across systems and for\n  a clean marker definition, but linkage selection and order/fill/service date discrepancies must\n  be reconciled before assigning the index/marker ordering, or the symmetry test is corrupted at\n  the source.\n\n**Worked claims example.** Question: does initiating a thiazide diuretic (A) trigger gout, using\nfirst dispensing of a gout therapy — allopurinol or colchicine (B) — as the event marker, in a\ncommercial + Medicare FFS database. (1) **Eligibility:** adults with ≥365 days of continuous\nmedical + pharmacy enrollment, FFS-observable (exclude MA-only person-time). (2) **Incident-pair\nset:** keep persons whose *first* thiazide fill and *first* gout-therapy fill both fall inside a\n3-year observation window, each preceded by a 365-day washout with no prior fill of that drug\nclass. (3) **Symmetry window:** count a pair as discordant if A and B initiations are within ±12\nmonths of each other; classify as A→B (thiazide first) or B→A (gout therapy first). Suppose\nn(A→B) = 1,820 and n(B→A) = 1,150. (4) **Crude SR** = 1820 / 1150 = 1.58, exact-binomial 95% CI\non the proportion 1820/2970 = 0.613 → SR CI roughly 1.47–1.71. (5) **NSR:** model the background\nincidence trends of thiazide and gout-therapy initiation over the same calendar period (waiting-\ntime / order-statistic approach); suppose rising thiazide use yields NSR = 1.12. (6) **aSR** =\nSR / NSR = 1.58 / 1.12 = 1.41 (95% CI ~1.31–1.53) — a clear positive signal consistent with the\nknown thiazide–gout relationship. (7) **Negative control:** repeat with an unrelated marker (e.g.,\na topical dermatologic) that thiazides should not trigger; aSR ≈ 1 supports specificity. (8)\n**Sensitivity:** vary the symmetry window (±6, ±24 months), tighten the washout, and confirm the\nsignal is not driven by co-initiation (reverse causation) at very short induction periods.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "sequence-symmetry",
      "signal-detection",
      "pharmacovigilance",
      "self-controlled",
      "prescription-sequence-symmetry",
      "adjusted-sequence-ratio",
      "waiting-time-distribution",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "prescription_sequence_symmetry"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/00001648-199609000-00005",
        "url": "https://doi.org/10.1097/00001648-199609000-00005",
        "citation_text": "Hallas J. Evidence of depression provoked by cardiovascular medication: a prescription sequence symmetry analysis. Epidemiology. 1996;7(5):478-484.",
        "year": 1996,
        "authors_short": "Hallas",
        "notes": "Foundational paper that introduced prescription sequence symmetry analysis and the null-effect sequence ratio adjustment for secular prescribing trends."
      },
      {
        "role": "explain",
        "doi": "10.1007/s10654-017-0281-8",
        "url": "https://doi.org/10.1007/s10654-017-0281-8",
        "citation_text": "Lai EC, Pratt N, Hsieh CY, et al. Sequence symmetry analysis in pharmacovigilance and pharmacoepidemiologic studies. European Journal of Epidemiology. 2017;32(7):567-582.",
        "year": 2017,
        "authors_short": "Lai et al.",
        "notes": "Comprehensive methods review covering the sequence/adjusted sequence ratio, trend adjustment, assumptions, and reporting standards for SSA/PSSA."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s40264-015-0391-8",
        "url": "https://doi.org/10.1007/s40264-015-0391-8",
        "citation_text": "Wahab IA, Pratt NL, Kalisch LM, Roughead EE. Sequence symmetry analysis as a signal detection tool for potential heart failure adverse events in an administrative claims database. Drug Safety. 2016;39(4):347-354.",
        "year": 2016,
        "authors_short": "Wahab et al.",
        "notes": "Worked application of SSA for adverse-event signal detection in administrative claims, illustrating the adjusted-sequence-ratio workflow."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3417",
        "url": "https://doi.org/10.1002/pds.3417",
        "citation_text": "Wahab IA, Pratt NL, Wiese MD, Kalisch LM, Roughead EE. The validity of sequence symmetry analysis (SSA) for adverse drug reaction signal detection. Pharmacoepidemiology and Drug Safety. 2013;22(5):496-502.",
        "year": 2013,
        "authors_short": "Wahab et al.",
        "notes": "Evaluates SSA sensitivity/specificity against a reference set of known adverse drug reactions, clarifying when the method does and does not perform well."
      }
    ],
    "plain_language_summary": "Prescription Sequence Symmetry Analysis (PSSA) is a drug-safety screening method that asks a simple question: among patients who were newly started on both a study drug (drug A) and a second drug that treats a known side-effect condition (drug B), did A tend to come before B far more often than B came before A? If the ordering were random, the two directions — A-then-B and B-then-A — would appear roughly equally often. A large excess of A-then-B patients is a signal that starting A may be triggering the condition that B treats. Because each patient serves as their own control, the method cancels out stable, between-person differences such as genetics or long-standing health conditions without ever having to measure them.",
    "key_terms": [
      {
        "term": "incident user",
        "definition": "A patient who has no record of filling a given drug in the months before the study window starts — meaning this fill is truly their first time on it, not a continuation of an old prescription."
      },
      {
        "term": "symmetry window",
        "definition": "The maximum number of days allowed between a patient's first fill of drug A and their first fill of drug B for the pair to count in the analysis — typically ±365 days."
      },
      {
        "term": "sequence ratio (SR)",
        "definition": "The count of patients whose first fill of drug A came before their first fill of drug B, divided by the count whose first fill of B came before A — a ratio above 1 means A-before-B is the more common order."
      },
      {
        "term": "null-effect sequence ratio (NSR)",
        "definition": "The sequence ratio you would expect to see even if drug A caused nothing, calculated from background trends in how often each drug is being newly prescribed over the same time period."
      },
      {
        "term": "adjusted sequence ratio (aSR)",
        "definition": "The crude sequence ratio divided by the null-effect sequence ratio — this is the final reported number after correcting for background prescribing trends; a value meaningfully above 1 is flagged as a potential safety signal."
      },
      {
        "term": "marker drug",
        "definition": "Drug B — a medication whose newly prescribed use signals that the patient likely developed a specific condition, serving as a proxy for that condition in place of a separately recorded diagnosis."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team wants to know whether starting a thiazide diuretic (a common blood-pressure drug, drug A) increases the risk of gout. Instead of looking for a gout diagnosis code, they use first-time dispensing of a gout therapy — allopurinol or colchicine — as the marker drug (drug B). They pull all adult patients in a large insurance claims database who started both a thiazide and a gout therapy for the first time within a three-year window, with no prior fills of either drug class in the preceding year. They then ask: in how many of these patients did the thiazide come first, and in how many did the gout therapy come first?",
      "dataset": {
        "caption": "Illustrative first-fill table — one row per drug per patient (the two rows for patient 1001 represent their one incident thiazide fill and their one incident gout-therapy fill).",
        "columns": [
          "person_id",
          "drug_role",
          "drug_name",
          "first_fill_date"
        ],
        "rows": [
          [
            1001,
            "A (index)",
            "hydrochlorothiazide",
            "2024-02-15"
          ],
          [
            1001,
            "B (marker)",
            "allopurinol",
            "2024-07-10"
          ]
        ]
      },
      "steps": [
        "Patient 1001 started hydrochlorothiazide on 2024-02-15 (drug A, the index drug).",
        "Patient 1001 then started allopurinol on 2024-07-10 (drug B, the gout marker) — 146 days later.",
        "146 days falls within the ±365-day symmetry window, so this patient counts as a discordant pair.",
        "Because A came before B, this patient is tallied in the A→B column.",
        "Repeat this classification for every patient in the database who was incident on both drugs.",
        "In the full study population: 1,820 patients had A→B ordering and 1,150 had B→A ordering (2,970 discordant pairs total).",
        "Crude sequence ratio = 1,820 ÷ 1,150 = 1.58 — A-before-B is 58% more common than B-before-A.",
        "But thiazide prescribing was growing over this calendar period, which arithmetically inflates A→B counts for unrelated reasons. The trend adjustment yields a null-effect sequence ratio (NSR) of 1.12.",
        "Adjusted sequence ratio = 1.58 ÷ 1.12 = 1.41 — after removing the trend effect, A-before-B is still 41% more common, flagging a genuine signal consistent with the known thiazide-gout relationship."
      ],
      "result": {
        "label": "Adjusted sequence ratio (aSR) = 1.41 (95% CI approximately 1.31–1.53) — a positive safety signal; A-before-B ordering is 41% more common than expected under no causal effect, consistent with thiazide triggering gout.",
        "value": 1.41
      },
      "timeline_spec": {
        "title": "One A→B discordant pair: hydrochlorothiazide (A) then allopurinol (B) for patient 1001",
        "caption": "Patient 1001's timeline showing the washout period confirming no prior fills of either drug, the first thiazide fill on 2024-02-15 (drug A start), the 146-day gap before the first allopurinol fill on 2024-07-10 (drug B start), and the classification of this patient as an A→B ordering. Across 2,970 such discordant patients, 1,820 are A→B versus 1,150 B→A, yielding a crude SR of 1.58 and an aSR of 1.41 after trend correction.",
        "alt_text": "Horizontal timeline for patient 1001. A shaded washout band spans the year before 2024-02-15 with a label confirming no prior fills of hydrochlorothiazide or allopurinol. A point event marker labeled 'Drug A start: hydrochlorothiazide 2024-02-15' sits at the left edge of the observation period. A span arrow labeled '146-day gap (within ±365-day symmetry window)' bridges to a second point event marker labeled 'Drug B start: allopurinol 2024-07-10'. A badge reading 'Classified: A→B' appears at the right. Below the patient timeline a summary band reads: '1,820 A→B vs 1,150 B→A across all discordant patients → crude SR 1.58 → aSR 1.41 after NSR correction of 1.12'.",
        "window": {
          "start": "2023-02-15",
          "end": "2024-12-31",
          "label": "Observation period (washout + follow-up)"
        },
        "events": [
          {
            "label": "Washout ends / Drug A start: hydrochlorothiazide",
            "start": "2024-02-15",
            "length_days": 1,
            "quantity": "incident first fill"
          },
          {
            "label": "Drug B start: allopurinol (gout marker)",
            "start": "2024-07-10",
            "length_days": 1,
            "quantity": "incident first fill"
          }
        ],
        "spans": [
          {
            "kind": "washout",
            "start": "2023-02-15",
            "end": "2024-02-14",
            "label": "365-day washout — no prior fills of A or B"
          },
          {
            "kind": "exposed",
            "start": "2024-02-15",
            "end": "2024-07-09",
            "label": "146-day gap: A started, B not yet — within ±365-day symmetry window"
          },
          {
            "kind": "followup",
            "start": "2024-07-10",
            "end": "2024-12-31",
            "label": "Post-B follow-up (not used in PSSA ordering count)"
          }
        ],
        "result": {
          "label": "Patient 1001 classified A→B. Aggregate across all discordant pairs: 1,820 A→B ÷ 1,150 B→A = crude SR 1.58 → aSR 1.41 (÷ NSR 1.12)",
          "value": 1.41
        }
      }
    },
    "prerequisites": [
      "new-user-design",
      "signal-detection",
      "self-controlled-case-series"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Crude sequence ratio (no trend adjustment)",
        "description": "Raw SR = n(A->B) / n(B->A) with an exact-binomial CI on the discordant-pair proportion. Useful only as an intermediate; reporting it as causal ignores secular prescribing trends.",
        "edge_cases": [
          "A rising or falling launch curve of either drug inflates or deflates the SR purely by calendar arithmetic.",
          "Concordant (same-day) initiations must be defined out or handled explicitly."
        ],
        "data_source_notes": "claims: requires clean first-fill dates with continuous enrollment; MA-only person-time hides earlier fills and corrupts \"first\" dates."
      },
      {
        "name": "Adjusted sequence ratio (aSR) via null-effect sequence ratio (NSR)",
        "description": "aSR = SR / NSR, where NSR is the SR expected under no effect, estimated from the waiting-time distribution and background incidence trends of the two drugs (Hallas order-statistic / Tsiropoulos trend modeling). This is the reportable estimand.",
        "edge_cases": [
          "NSR estimation is sensitive to the modeled trend window and to abrupt incidence changes (guideline shifts, formulary changes).",
          "Very short symmetry windows let reverse causation dominate; very long windows dilute the signal and pull aSR toward the trend."
        ],
        "data_source_notes": "NSR should be estimated on the same source/period as the index analysis so trends are internally consistent."
      },
      {
        "name": "Negative-control marker validation",
        "description": "Re-run PSSA with a marker drug that the index drug should not plausibly trigger; an aSR near 1 supports specificity, while an inflated aSR exposes residual trend or confounding-by-indication artifacts.",
        "edge_cases": [
          "A poorly chosen negative control that shares an indication or channeling pathway with the real marker gives false reassurance."
        ],
        "data_source_notes": "Use multiple negative controls to characterize the null distribution of aSR in the specific data source."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Active comparator, new-user cohort design",
        "pros_of_this": "No comparator selection, covariate measurement, or outcome adjudication needed; self-controls for all time-invariant confounders; fast and scalable for screening.",
        "cons_of_this": "Yields only an adjusted sequence ratio (a relative incidence under strong assumptions), not a transportable causal effect or absolute risk; cannot resolve confounding by indication for the marker drug.",
        "when_to_prefer": "Hypothesis-free pharmacovigilance screening over large dispensing databases before investing in a full cohort study."
      },
      {
        "compared_to": "Case-crossover design",
        "pros_of_this": "Operationalizes the outcome through a treatment marker, enabling drug->drug(->event) screening without a separately ascertained event; trend-robust via the NSR.",
        "cons_of_this": "Coarser (one ordering per person, no within-person control-window contrast); weaker for sharply defined transient exposures and acute events.",
        "when_to_prefer": "Screening many drug pairs where the outcome is reliably treated by an identifiable marker drug."
      },
      {
        "compared_to": "Self-controlled case series (SCCS)",
        "pros_of_this": "Far simpler to specify; needs only two dispensing dates per person; explicitly corrects for prescribing trends through the NSR.",
        "cons_of_this": "No incidence-rate-ratio model, no handling of recurrent events or time-varying exposure; a single ordering per person discards within-person rate information.",
        "when_to_prefer": "Rapid, large-scale signal screening rather than a formal rate-ratio estimate."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build incident-pair sets from first observed pharmacy fills of A and B (NDC + fill_date) under a continuous-enrollment washout; exclude Medicare Advantage-only person-time where FFS claims are missing so first-fill dates are real, not artifacts. Define the symmetry window and handle same-day initiations explicitly.",
      "ehr": "Use medication orders, but ascertainment is asymmetric if A is captured at order time while B is only captured on dispense; prefer linked pharmacy fills for both drugs and treat out-of-system patients as differentially censored.",
      "registry": "National prescription registries (Nordic, Australia PBS) are the canonical substrate with near-complete dispensing, stable denominators, and long lookback. Disease registries need linkage to dispensing data for the marker drug.",
      "linked": "Reconcile order/fill/service date discrepancies and linkage selection before assigning the A/B ordering; otherwise the symmetry test is corrupted at the source."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom scipy.stats import binomtest\n\nSYMMETRY_DAYS = 365   # |first_A - first_B| must be within this window to count as a pair\n\ndef pssa(first_fill: pd.DataFrame, nsr: float = 1.0) -> dict:\n    # Pivot to one row per person with first_A and first_B (NaT if the drug was never started).\n    wide = (first_fill.pivot_table(index=\"person_id\", columns=\"drug\",\n                                   values=\"first_date\", aggfunc=\"min\"))\n    pairs = wide.dropna(subset=[\"A\", \"B\"]).copy()           # incident in BOTH drugs\n    gap = (pairs[\"A\"] - pairs[\"B\"]).dt.days\n    pairs = pairs[gap.abs() <= SYMMETRY_DAYS]               # within the symmetry window\n    pairs = pairs[gap != 0]                                 # drop same-day (concordant) starts\n\n    n_ab = int((pairs[\"A\"] < pairs[\"B\"]).sum())            # A started before B\n    n_ba = int((pairs[\"B\"] < pairs[\"A\"]).sum())            # B started before A\n    n = n_ab + n_ba\n    sr = n_ab / n_ba if n_ba else float(\"inf\")\n\n    # Exact-binomial CI on the discordant proportion p = n_ab / n, mapped to SR = p / (1 - p).\n    bt = binomtest(n_ab, n, 0.5)\n    lo_p, hi_p = bt.proportion_ci(confidence_level=0.95)\n    sr_lo, sr_hi = lo_p / (1 - lo_p), hi_p / (1 - hi_p)\n\n    return {\n        \"n_AtoB\": n_ab, \"n_BtoA\": n_ba, \"n_discordant\": n,\n        \"crude_SR\": sr, \"crude_SR_CI\": (sr_lo, sr_hi),\n        \"NSR\": nsr,\n        \"adjusted_SR\": sr / nsr,\n        \"adjusted_SR_CI\": (sr_lo / nsr, sr_hi / nsr),\n    }",
        "description": "PSSA from a claims-style first-fill table. Required input (already cleaned and de-duplicated):\n  first_fill : person_id, drug (in {'A','B'}), first_date (datetime) -- the FIRST observed\n               incident fill of each drug per person, after a continuous-enrollment washout.\nPipeline: build incident pairs where a person has a first fill of BOTH A and B inside the\nobservation window and within the symmetry window; classify A->B vs B->A; compute the crude\nsequence ratio with an exact-binomial CI; then divide by a supplied null-effect sequence ratio\n(NSR) to obtain the adjusted sequence ratio (aSR). The NSR must be estimated separately from\nthe background incidence trends of A and B (waiting-time / order-statistic approach).",
        "dependencies": [
          "pandas",
          "scipy"
        ],
        "source_citations": [
          "hallas-1996",
          "lai-2017"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nSYMMETRY_DAYS <- 365L\n\npssa <- function(first_fill, nsr = 1.0) {\n  setDT(first_fill)\n  wide <- dcast(first_fill, person_id ~ drug, value.var = \"first_date\", fun.aggregate = min)\n  pairs <- wide[!is.na(A) & !is.na(B)]                       # incident in BOTH drugs\n  pairs[, gap := as.integer(A - B)]\n  pairs <- pairs[abs(gap) <= SYMMETRY_DAYS & gap != 0]       # within window, drop same-day\n\n  n_ab <- pairs[A < B, .N]\n  n_ba <- pairs[B < A, .N]\n  n    <- n_ab + n_ba\n  sr   <- if (n_ba > 0) n_ab / n_ba else Inf\n\n  # Exact-binomial CI on the discordant proportion, mapped to the SR scale.\n  bt   <- binom.test(n_ab, n, p = 0.5)\n  p_ci <- bt$conf.int\n  sr_lo <- p_ci[1] / (1 - p_ci[1]); sr_hi <- p_ci[2] / (1 - p_ci[2])\n\n  list(n_AtoB = n_ab, n_BtoA = n_ba, n_discordant = n,\n       crude_SR = sr, crude_SR_CI = c(sr_lo, sr_hi),\n       NSR = nsr, adjusted_SR = sr / nsr,\n       adjusted_SR_CI = c(sr_lo / nsr, sr_hi / nsr))\n}\n\n# Canonical OMOP-CDM route (preferred in production):\n# res <- CohortSymmetry::generateSequenceCohortSet(cdm, indexTable = \"A\", markerTable = \"B\")\n# CohortSymmetry::summariseSequenceRatios(res)   # crude + adjusted SR with CIs and NSR",
        "description": "PSSA in R. The CohortSymmetry CRAN package is the canonical implementation against the OMOP\ncommon data model (it builds incident pairs, computes crude and adjusted sequence ratios, and\nhandles the NSR). Below is a self-contained data.table version that mirrors the Python logic so\nthe math is transparent. Required input:\n  first_fill : person_id, drug ('A'/'B'), first_date (Date) -- first incident fill per drug,\n               after a continuous-enrollment washout.\nSupply nsr from a separate trend/waiting-time model (or use CohortSymmetry::summariseSequenceRatios).",
        "dependencies": [
          "data.table",
          "CohortSymmetry"
        ],
        "source_citations": [
          "hallas-1996",
          "lai-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let symm = 365;   /* symmetry window in days */\n%let nsr  = 1.12;  /* null-effect sequence ratio from a separate trend model */\n\n/* One row per person with first_A and first_B; keep only those incident in BOTH drugs. */\nproc sql;\n  create table pairs as\n  select a.person_id,\n         a.first_date as first_A format=date9.,\n         b.first_date as first_B format=date9.,\n         (a.first_date - b.first_date) as gap\n  from (select person_id, first_date from work.first_fill where drug='A') a\n  inner join\n       (select person_id, first_date from work.first_fill where drug='B') b\n    on a.person_id = b.person_id;\nquit;\n\n/* Classify discordant orderings inside the symmetry window; drop same-day starts. */\ndata discordant;\n  set pairs;\n  where abs(gap) <= &symm and gap ne 0;\n  if gap < 0 then order = 'AtoB';   /* first_A < first_B */\n  else            order = 'BtoA';\nrun;\n\n/* Exact-binomial CI on the A->B proportion among discordant pairs. */\nproc freq data=discordant order=data;\n  tables order / binomial(level='AtoB') alpha=0.05;\n  exact binomial;\n  output out=ssa_ci n=n_discordant binomial=p_ab;\nrun;\n\n/* Map the discordant proportion to crude SR = p/(1-p), then adjust by the NSR. */\ndata ssa_results;\n  set ssa_ci;\n  crude_SR    = p_ab / (1 - p_ab);\n  adjusted_SR = crude_SR / &nsr;\n  /* XL_BIN / XU_BIN are the exact CI bounds emitted on the proportion scale; rescale to SR. */\n  crude_SR_lo = XL_BIN / (1 - XL_BIN);  crude_SR_hi = XU_BIN / (1 - XU_BIN);\n  adj_SR_lo   = crude_SR_lo / &nsr;     adj_SR_hi   = crude_SR_hi / &nsr;\nrun;",
        "description": "PSSA in SAS. Required input dataset (post data-management):\n  work.first_fill : person_id, drug ('A'/'B'), first_date  -- first incident fill per drug per\n                    person, after a continuous-enrollment washout.\nPROC SQL transposes to one row per person, classifies the ordering inside the symmetry window,\nand PROC FREQ with BINOMIAL ... EXACT returns the exact CI on the discordant proportion, which\nis mapped to the sequence-ratio scale. The null-effect sequence ratio (NSR) is supplied from a\nseparate trend/waiting-time model and applied in a final DATA step. PROC PHREG is deliberately\nNOT used -- PSSA is a binomial symmetry test, not a survival model.",
        "dependencies": [],
        "source_citations": [
          "hallas-1996",
          "lai-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "prescription-sequence-symmetry-timeline.svg",
        "mermaid": null,
        "caption": "Patient 1001's timeline showing the washout period confirming no prior fills of either drug, the first thiazide fill on 2024-02-15 (drug A start), the 146-day gap before the first allopurinol fill on 2024-07-10 (drug B start), and the classification of this patient as an A→B ordering. Across 2,970 such discordant patients, 1,820 are A→B versus 1,150 B→A, yielding a crude SR of 1.58 and an aSR of 1.41 after trend correction.",
        "alt_text": "Horizontal timeline for patient 1001. A shaded washout band spans the year before 2024-02-15 with a label confirming no prior fills of hydrochlorothiazide or allopurinol. A point event marker labeled 'Drug A start: hydrochlorothiazide 2024-02-15' sits at the left edge of the observation period. A span arrow labeled '146-day gap (within ±365-day symmetry window)' bridges to a second point event marker labeled 'Drug B start: allopurinol 2024-07-10'. A badge reading 'Classified: A→B' appears at the right. Below the patient timeline a summary band reads: '1,820 A→B vs 1,150 B→A across all discordant patients → crude SR 1.58 → aSR 1.41 after NSR correction of 1.12'.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population in dispensing data] --> Inc[Keep persons with FIRST incident fill of<br/>BOTH index drug A and marker drug B<br/>continuous-enrollment washout each]\n  Inc --> Win{First A and first B<br/>within +/- symmetry window?}\n  Win -- no --> Drop[Exclude pair]\n  Win -- yes --> Ord{Which came first?}\n  Ord -- A before B --> AB[Count A to B]\n  Ord -- B before A --> BA[Count B to A]\n  AB --> SR[Crude SR = n A->B / n B->A<br/>exact-binomial CI]\n  BA --> SR\n  SR --> NSR[Divide by null-effect sequence ratio<br/>from background prescribing trends]\n  NSR --> ASR[Adjusted sequence ratio aSR<br/>signal if aSR and CI lower bound > 1]\n  ASR --> NC[Validate with negative-control marker<br/>expect aSR ~ 1]",
        "caption": "PSSA workflow. Incident pairs are classified by ordering within a symmetry window; the crude sequence ratio is divided by the null-effect sequence ratio (trend correction) to yield the adjusted sequence ratio, validated against a negative-control marker.",
        "alt_text": "Flowchart from source population to incident-pair selection, symmetry-window ordering, crude sequence ratio, NSR trend adjustment, adjusted sequence ratio, and negative-control validation.",
        "source_type": "illustrative",
        "source_citations": [
          "hallas-1996",
          "lai-2017"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title One discordant incident pair (A before B) on the calendar\n  section Washout\n    365-day washout : no prior fill of A or B\n  section Index drug A\n    First fill of A (incident) : time of A initiation\n  section Induction\n    Gap within symmetry window : A precedes B -> counts as A to B\n  section Marker drug B\n    First fill of B (marker for the event) : B initiation",
        "caption": "A single person contributing an A->B ordering. PSSA counts the direction of the first-A vs first-B sequence; an excess of A->B over B->A across persons, after NSR correction, is the signal.",
        "alt_text": "Timeline showing a washout, an incident first fill of drug A, a gap within the symmetry window, then an incident first fill of marker drug B, classified as an A-to-B ordering.",
        "source_type": "illustrative",
        "source_citations": [
          "hallas-1996"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "self-controlled-case-series",
        "notes": "Both are self-controlled and cancel time-invariant confounders; SCCS fits a within-person incidence-rate-ratio model with explicit event/exposure windows, while PSSA tests the symmetry of two incident prescriptions and corrects for trend via the NSR."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-crossover",
        "notes": "Case-crossover contrasts exposure in case vs control windows within outcome cases; PSSA uses a second prescription as the outcome marker and compares the symmetry of two initiations."
      },
      {
        "relation_type": "part_of",
        "target_slug": "signal-detection",
        "notes": "PSSA is a longitudinal-data signal-detection method for pharmacovigilance, complementary to disproportionality analysis on spontaneous reports."
      },
      {
        "relation_type": "used_with",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "The marker drug operationalizes the safety outcome; a defensible case/marker definition is a prerequisite for a credible adjusted sequence ratio."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "When a PSSA screen flags a signal, an active-comparator new-user cohort study is the usual follow-up to estimate a transportable effect size and absolute risk."
      }
    ],
    "aliases": [
      "PSSA",
      "prescription sequence symmetry analysis",
      "sequence symmetry analysis",
      "SSA"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "prevalence-point-period-annual-rwe",
    "name": "Prevalence (Point, Period, and Annual) in RWE",
    "short_definition": "The proportion of a defined denominator population that is a case at a single instant (point), at any time during a window (period), or within a calendar year (annual), where \"case\" is operationalized in real-world data as administratively diagnosed, treated, or both.",
    "long_description": "**Prevalence** is the proportion of a defined population that *has* a condition, in contrast to incidence,\nwhich counts new onset. In real-world data the headline number is governed almost entirely by two specifications\nthat protocols routinely leave implicit: the **time frame of the numerator/denominator** and the **case-finding\nrule**. Getting either wrong changes the estimate by a factor of two or more, so both must be written into the\nestimand before any code is run.\n\n**Core conceptual distinction**. Three time frames produce three distinct estimands, and they are not\ninterchangeable. (1) *Point prevalence* is a cross-sectional snapshot on a single index date: the numerator is\npeople who are prevalent cases on that date and the denominator is people observable (continuously enrolled) on\nthat date. (2) *Period prevalence* counts anyone who is a case at any moment during an interval; the denominator\nis people observable at some point in the interval, and because the window admits both surviving and incident\ncases it is always ≥ point prevalence for the same population. (3) *Annual prevalence* is period prevalence with\nthe interval fixed to a calendar year — the operational standard for CMS Chronic Conditions Warehouse flags,\nBRFSS, and NHIS, chosen because it aligns with enrollment files and benefit years and is comparable across\nyears. A second, orthogonal axis is *how a case is identified*: **diagnosed (administrative) prevalence**\nrequires qualifying diagnosis codes; **treated prevalence** requires a qualifying dispensing/procedure and so\nmeasures the treated pool, not the diseased pool; **diagnosed-and-treated** intersects both. Treated prevalence\nsystematically undercounts undiagnosed and untreated disease, which is exactly what makes it the right numerator\nfor a budget-impact \"treated population\" but the wrong numerator for disease burden. Prevalence is a *proportion*\n(dimensionless, bounded 0–1), not a *rate* per person-time; if you find yourself dividing by person-years you are\nestimating incidence density, not prevalence.\n\n**Pros, cons, and trade-offs**.\n- **vs incidence rate (`incidence-rate-calculation-rwe`):** Prevalence is cheap, needs only a cross-sectional or\n  windowed snapshot, and directly sizes a market or care burden. Cost: it confounds onset with survival — a\n  therapy that prevents death *raises* prevalence, so prevalence is useless for etiologic questions and\n  treacherous for \"is the disease getting more common?\" Prefer incidence for causation and trend-in-onset;\n  prefer prevalence for resource planning and current burden.\n- **vs treated prevalence as a proxy for true prevalence:** A pharmacy- or procedure-based numerator is fully\n  observable in claims and unambiguous, but it conflates disease occurrence with diagnosis *and* treatment\n  access. In conditions with large untreated fractions (early CKD, mild OA, undiagnosed AF) treated prevalence\n  can be a small and biased fraction of diagnosed prevalence. Use treated prevalence only when the treated pool\n  *is* the quantity of interest.\n- **vs longer diagnostic lookback to raise sensitivity:** Extending the \"ever-diagnosed\" lookback (1 → 2 → 3\n  years) captures chronic prevalent cases whose only coded encounter predates the window, but it mechanically\n  inflates the estimate, ages the prevalent pool, and makes cross-study comparison meaningless unless the\n  lookback is held fixed. The single-claim vs ≥2-claims (Klabunde-style) rule is the same trade-off in reverse:\n  one claim maximizes sensitivity but admits rule-out/coding-artifact diagnoses; two claims ≥X days apart\n  improves PPV at the cost of sensitivity.\n\n**When to use**. Sizing an eligible or treated population for a budget-impact or commercial forecast; describing\ndisease burden for an HTA submission or epidemiology section; CMS/HEDIS-style chronic-condition surveillance;\nany descriptive denominator where the question is \"how many people currently have / are treated for X\" rather\nthan \"how many newly developed X.\" Point prevalence for a snapshot (\"on 1 July\"), period/annual prevalence for a\nreporting interval, treated prevalence when the deliverable is the treatable market.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **As a stand-in for incidence or risk.** Prevalence ÷ duration ≈ incidence only under steady-state and never\n  when survival or treatment differs across groups; using prevalence to infer onset or to compare arms is\n  confounded by survival (prevalence–incidence / Neyman bias). For causal contrasts use incidence or cumulative\n  incidence (`cumulative-incidence-risk-rwe`).\n- **Comparing prevalence across data sources, years, or studies with different case rules or lookbacks.** A\n  \"rising prevalence\" can be pure artifact of a lengthened lookback, a coding-intensity change (ICD-9→ICD-10),\n  or a shift in the diagnosed-vs-treated definition. Hold the operational rule fixed or the comparison is void.\n- **Reporting a crude prevalence across populations with different age/sex structure.** Crude prevalence is a\n  weighted mash-up of stratum-specific prevalences; differences may be pure composition. Age/sex-standardize\n  (`direct-standardization-rwe`, `indirect-standardization-smr-sir-rwe`) before any cross-population claim.\n- **Treated prevalence presented as disease prevalence** in a condition with substantial undiagnosed/untreated\n  burden — this understates need and can mis-size a market by an order of magnitude.\n\n**Data-source operational depth**.\n- **Claims (FFS vs MA):** The denominator must be tied to *observable* person-time. For point prevalence,\n  require active enrollment on the index date; for annual prevalence, require a minimum coverage fraction of the\n  year (e.g., ≥11 of 12 months, or full-year continuous enrollment) so the numerator opportunity is comparable\n  across people. The dominant failure mode is **Medicare Advantage / capitated person-time that lacks\n  fee-for-service claims**: MA enrollees generate few or no FFS diagnosis/pharmacy claims, so they look\n  disease-free and *deflate* prevalence — restrict the denominator to FFS-observable (Parts A/B and, for treated\n  prevalence, Part D) person-time, or use encounter data where complete. **Plan switching** truncates the\n  lookback and can drop the qualifying diagnosis; **claims adjudication lag and reversals** mean recent months\n  are incomplete; the **single-claim vs ≥2-claim** rule and lookback length must be pre-specified and varied in\n  sensitivity analysis.\n- **EHR:** Capture is encounter-driven, so prevalence is conditioned on contact with the system — patients who\n  are well, who get care elsewhere (external-care leakage), or who churn out are differentially missing.\n  Structured problem lists undercount; resolved/historical flags and copy-forward inflate. Diagnosed prevalence\n  from EHR alone is a lower bound on a panel that actually visits.\n- **Registry:** Often the gold standard for the numerator (adjudicated cases, completeness within catchment) but\n  the denominator (source population at risk) is frequently external census/enrollment data, so the\n  numerator–denominator mismatch and catchment definition drive validity.\n- **Linked claims–EHR–registry:** EHR/registry sharpens case ascertainment while claims supply a clean,\n  enumerable denominator and full pharmacy fills for treated prevalence — but only the linkable subset is\n  analyzable, introducing selection that must be characterized before generalizing.\n\n**Worked claims example (annual diagnosed and treated prevalence of type 2 diabetes, calendar year 2024).**\n(1) *Denominator:* every person with ≥11 months of continuous FFS Parts A/B enrollment in 2024 (add Part D if\ntreated prevalence is required), restricted to FFS-observable person-time so MA-only members do not deflate the\nestimate. (2) *Diagnosed numerator:* a member is a prevalent diagnosed case if they have ≥2 claims with a T2DM\ndiagnosis (ICD-10 E11.x) on ≥2 distinct dates ≥30 days apart, looking back from 31 Dec 2024 over a 24-month\nlookback (2023-01-01 through 2024-12-31) — the two-claim rule and 24-month lookback are the judgment-dependent\nthresholds and are each varied in sensitivity analysis (1-claim; 12-month vs 36-month lookback). (3) *Treated\nnumerator:* a member is a prevalent treated case if they have ≥1 fill (`fill_date` in 2024, `days_supply` ≥ 0)\nof a glucose-lowering NDC during 2024. (4) *Annual prevalence* = (cases meeting the rule) ÷ (eligible\ndenominator), reported as diagnosed and treated separately; treated < diagnosed by construction. (5) Because age\nstructure differs across plans/years, report crude **and** directly age/sex-standardized prevalence with\nexact-binomial 95% CIs, and re-estimate point prevalence on 1 July 2024 (active enrollment that day) to show the\npoint-vs-period gap. Preserve raw claim dates alongside the derived prevalent flag and audit pre/post counts.\n\n**Interpreting the output**\n\nA small health district with 200 enrolled residents reports point prevalence of hypertension on July 1,\n2024 as 18.0% (36 / 200) and annual period prevalence for all of 2024 as 25.0% (50 / 200). The two\nestimates use the same 200-person denominator but different numerator windows.\n\n*(1) Formal interpretation.* Point prevalence (18.0%) counts only residents who were active hypertension\ncases on the specific index date of July 1; the denominator is all 200 residents enrolled and observable\non that date. Period prevalence (25.0%) counts anyone who was a case at any moment during calendar year\n2024; it is always greater than or equal to point prevalence for the same denominator because it\naccumulates incident cases who were diagnosed after July 1 or who had prior diagnoses recorded at other\npoints in the year. The 7-percentage-point gap (25.0% vs 18.0%) reflects cases that were prevalent in\nthe annual window but not on the snapshot date — some diagnosed before July 1 in remission, some newly\ndiagnosed after. Prevalence is a dimensionless proportion (cases / observable population), not a rate\nper person-time. Both estimates are sensitive to the case-finding rule: a two-claim vs one-claim\ndefinition or a 12-month vs 24-month lookback will change the numerator by a meaningful margin.\n\n*(2) Practical interpretation.* A disease-management program sizing its target population should use\nannual period prevalence (25.0%, 50 patients) as the denominator for coverage calculations, since it\ncaptures everyone with the condition at any point in the plan year. A point-in-time prevalence (18.0%)\nis appropriate for a cross-sectional intervention or a budget snapshot tied to a specific date. Always\nstate which measure is reported — presenting 18.0% as \"the prevalence\" in a dossier when 25.0% is the\nbetter operational figure for a full-year program is a common undercount error.",
    "primary_category": "Descriptive_Epidemiology",
    "tags": [
      "prevalence",
      "point-prevalence",
      "period-prevalence",
      "annual-prevalence",
      "treated-prevalence",
      "diagnosed-prevalence",
      "descriptive-epidemiology",
      "case-finding",
      "denominator-construction"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.5888/pcd10.120239",
        "url": "https://doi.org/10.5888/pcd10.120239",
        "citation_text": "Goodman RA, Posner SF, Huang ES, Parekh AK, Koh HK. Defining and measuring chronic conditions: imperatives for research, policy, program, and practice. Preventing Chronic Disease. 2013;10:E66.",
        "year": 2013,
        "authors_short": "Goodman et al.",
        "notes": "Frames the operational choices behind measuring prevalence of chronic conditions in administrative data — case definitions, lookback windows, and the diagnosed-vs-treated distinction that drive every prevalence estimate."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.0040296",
        "url": "https://doi.org/10.1371/journal.pmed.0040296",
        "citation_text": "von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. PLoS Medicine. 2007;4(10):e296.",
        "year": 2007,
        "authors_short": "von Elm et al.",
        "notes": "Reporting standard requiring explicit numerator/denominator definitions, eligibility windows, and case ascertainment — the elements that make a prevalence estimate interpretable and comparable."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1478-7954-1-4",
        "url": "https://doi.org/10.1186/1478-7954-1-4",
        "citation_text": "Barendregt JJ, Van Oortmarssen GJ, Vos T, Murray CJL. A generic model for the assessment of disease epidemiology: the computational basis of DisMod II. Population Health Metrics. 2003;1:4.",
        "year": 2003,
        "authors_short": "Barendregt et al.",
        "notes": "Demonstrates the internally consistent relationship between prevalence, incidence, remission, and mortality, making explicit why prevalence cannot be read as incidence without the duration/survival link."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Good-practice guidance for descriptive and comparative RWD studies, including pre-specifying denominators, observable person-time, and case definitions before estimation."
      }
    ],
    "plain_language_summary": "Prevalence answers the question 'what share of a population has this condition right now (or at some point during a time window)?' You pick everyone in your defined group, count who among them is a case, and divide. Whether you count people who are cases on one specific date (point prevalence) or anyone who is a case at any moment during a whole year (period prevalence) changes the number — so the window must be stated up front. One honest caveat: prevalence mixes people who got the disease recently with people who have had it for years, so it cannot tell you whether the disease is becoming more or less common.",
    "key_terms": [
      {
        "term": "point prevalence",
        "definition": "The fraction of a population that is a case on one specific date — a snapshot of who has the condition that day."
      },
      {
        "term": "period prevalence",
        "definition": "The fraction of a population that is a case at any time during a defined interval, such as a calendar year; it counts anyone who had the condition during that window, not just on one day."
      },
      {
        "term": "denominator",
        "definition": "The total number of people in your study population who could potentially be cases — the 'out of how many' part of the fraction."
      },
      {
        "term": "case",
        "definition": "A person who meets the definition of having the condition being measured, such as having a diagnosis recorded in their medical record or having filled a prescription for a treatment."
      },
      {
        "term": "annual prevalence",
        "definition": "Period prevalence calculated over exactly one calendar year — the standard used in most public health surveillance reports."
      }
    ],
    "worked_example": {
      "scenario": "A small rural health district wants to understand how much hypertension affects its population. The district has records for 200 residents who were enrolled in its health plan for the entire year 2024. A clinic nurse runs two prevalence calculations from the same enrollment list: (1) point prevalence on July 1, 2024 — who is an active hypertension case on that exact date — and (2) period (annual) prevalence for all of 2024 — who had a hypertension diagnosis recorded at any point during the year. The two numbers will differ because some people were only diagnosed earlier or later in the year.",
      "dataset": {
        "caption": "Patient records for the 200-person enrolled population. Columns show whether each person had a hypertension diagnosis active on July 1, 2024 (the point-prevalence date) and whether they had any hypertension diagnosis recorded at any time during the full year 2024 (the period/annual window). Rows shown are a representative sample of 10 patients.",
        "columns": [
          "person_id",
          "enrolled_full_year_2024",
          "hypertension_dx_on_jul1_2024",
          "hypertension_dx_anytime_2024"
        ],
        "rows": [
          [
            1001,
            "yes",
            "yes",
            "yes"
          ],
          [
            1002,
            "yes",
            "no",
            "yes"
          ],
          [
            1003,
            "yes",
            "yes",
            "yes"
          ],
          [
            1004,
            "yes",
            "no",
            "no"
          ],
          [
            1005,
            "yes",
            "yes",
            "yes"
          ],
          [
            1006,
            "yes",
            "no",
            "yes"
          ],
          [
            1007,
            "yes",
            "yes",
            "yes"
          ],
          [
            1008,
            "yes",
            "no",
            "no"
          ],
          [
            1009,
            "yes",
            "yes",
            "yes"
          ],
          [
            1010,
            "yes",
            "no",
            "yes"
          ]
        ],
        "full_population_summary": "Across all 200 enrolled residents: 36 have hypertension recorded as active on July 1, 2024; 50 have at least one hypertension diagnosis recorded at any point during calendar year 2024."
      },
      "steps": [
        "Define the denominator: all 200 people who were enrolled for the full year 2024 — these are the people we can reliably observe.",
        "Compute point prevalence (July 1, 2024): count only the people who are active hypertension cases on that one date. From the full population, 36 out of 200 residents meet this criterion.",
        "Point prevalence = 36 cases on July 1 / 200 enrolled residents = 0.180, or 18.0%.",
        "Compute period (annual) prevalence (all of 2024): count anyone who had a hypertension diagnosis recorded at any time during the year — this picks up people diagnosed in January, people diagnosed in November, and everyone in between. From the full population, 50 out of 200 residents had at least one hypertension diagnosis during 2024.",
        "Period prevalence = 50 cases during 2024 / 200 enrolled residents = 0.250, or 25.0%.",
        "Note the direction: period prevalence (25.0%) is higher than point prevalence (18.0%) because period prevalence adds the 14 people who had a diagnosis recorded at some point in 2024 but were not flagged as active cases on July 1 specifically — for example, someone diagnosed in March whose record was not flagged as active by mid-year, or someone newly diagnosed in October.",
        "This difference is expected and not an error: point prevalence is a narrow snapshot; period prevalence casts a wider net over the full year."
      ],
      "result": "Point prevalence on July 1, 2024 = 36 / 200 = 0.180 (18.0%). Annual (period) prevalence for calendar year 2024 = 50 / 200 = 0.250 (25.0%). Period prevalence is always greater than or equal to point prevalence for the same population and window because it captures everyone the point estimate does plus anyone who was a case at any other moment in the year."
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Point prevalence (single index date)",
        "description": "Cross-sectional snapshot. Numerator = prevalent cases on the index date (under the chosen lookback case rule); denominator = persons with active, observable enrollment on that exact date.",
        "edge_cases": [
          "Index date falling in an adjudication-lag window understates recent cases.",
          "A member enrolled but with no claims that day is still in the denominator — absence of a claim is not absence of disease."
        ],
        "data_source_notes": "claims: require active enrollment on the index date and FFS-observable coverage; EHR: point prevalence is conditioned on an encounter near the date and undercounts non-attenders."
      },
      {
        "name": "Period / annual prevalence (windowed)",
        "description": "A case if prevalent at any time during the interval (calendar year for annual prevalence). Denominator = persons observable during the interval, typically with a minimum coverage fraction (e.g., >=11 of 12 months).",
        "edge_cases": [
          "Mixing partial-year and full-year enrollees without a coverage threshold biases the numerator opportunity.",
          "Longer windows admit more surviving cases, so period >= point prevalence by construction."
        ],
        "data_source_notes": "claims: define a continuous-enrollment / coverage-fraction rule for the interval; align to benefit year for annual estimates and exclude MA-only person-time."
      },
      {
        "name": "Diagnosed (administrative) prevalence",
        "description": "Case = qualifying diagnosis codes (e.g., >=2 claims >=30 days apart over a fixed lookback). Targets the coded-diseased population.",
        "edge_cases": [
          "Single-claim rules maximize sensitivity but admit rule-out/coding-artifact diagnoses; >=2-claim rules raise PPV at the cost of sensitivity.",
          "Lookback length materially changes the estimate; ICD-9 to ICD-10 transitions change coding intensity."
        ],
        "data_source_notes": "claims: pre-specify code list, claim count, inter-claim spacing, and lookback; vary all in sensitivity analysis."
      },
      {
        "name": "Treated prevalence",
        "description": "Case = qualifying dispensing or procedure in the window. Measures the treated pool, not the diseased pool; appropriate for budget-impact treated-population denominators.",
        "edge_cases": [
          "Systematically undercounts undiagnosed/untreated disease; sample fills and 90-day mail-order distort window membership.",
          "Requires pharmacy (Part D / pharmacy benefit) observability, a stricter denominator than diagnosis capture."
        ],
        "data_source_notes": "claims: require pharmacy-benefit enrollment for the window; do not present treated prevalence as disease prevalence."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Incidence rate",
        "pros_of_this": "Cheap to compute from a cross-sectional or windowed snapshot; directly sizes current disease/treatment burden for forecasting and HTA.",
        "cons_of_this": "Confounds onset with survival and treatment duration; cannot answer etiologic questions or 'is onset rising' and is subject to prevalence-incidence (Neyman) bias.",
        "when_to_prefer": "Resource planning, market sizing, and current-burden description rather than causation or trend-in-onset."
      },
      {
        "compared_to": "Treated prevalence as a proxy for true (diagnosed) prevalence",
        "pros_of_this": "Fully observable and unambiguous in claims; exactly the right denominator for a treatable-market or budget-impact analysis.",
        "cons_of_this": "Conflates disease with diagnosis and treatment access; can be a small, biased fraction of true prevalence when untreated disease is common.",
        "when_to_prefer": "When the treated pool itself is the quantity of interest, not overall disease burden."
      },
      {
        "compared_to": "Crude (unadjusted) prevalence",
        "pros_of_this": "Standardized prevalence removes composition differences and is comparable across populations, plans, and years.",
        "cons_of_this": "Standardization requires a reference population and stratum-specific counts and obscures the actual observed burden.",
        "when_to_prefer": "Any cross-population or cross-time comparison; report crude alongside standardized for transparency."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Tie the denominator to observable person-time (active enrollment on the date for point prevalence; a coverage-fraction rule for annual). Exclude Medicare Advantage / capitated person-time that lacks fee-for-service claims, which otherwise deflates prevalence. Pre-specify the case rule (code list, claim count, inter-claim spacing, lookback) and vary it in sensitivity analysis. Require pharmacy-benefit (Part D) coverage for treated prevalence.",
      "ehr": "Capture is encounter-driven, so prevalence is conditioned on system contact; non-attenders and external-care leakage cause undercount. Use problem lists plus resolved/historical flags carefully; structured fields undercount and copy-forward inflates. Treat EHR diagnosed prevalence as a lower bound on the attending panel.",
      "registry": "Strong numerator (adjudicated, complete within catchment) but the denominator is often external census/enrollment; the numerator-denominator mismatch and catchment definition drive validity.",
      "linked": "EHR/registry sharpens case ascertainment while claims supply an enumerable denominator and full pharmacy fills for treated prevalence; only the linkable subset is analyzable, so characterize linkage selection before generalizing."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom scipy.stats import beta\n\ndef _ci(x: int, n: int):\n    # Exact (Clopper-Pearson) 95% CI for a proportion.\n    lo = 0.0 if x == 0 else beta.ppf(0.025, x, n - x + 1)\n    hi = 1.0 if x == n else beta.ppf(0.975, x + 1, n - x)\n    return x / n, lo, hi\n\ndef denominator_point(enroll: pd.DataFrame, index_date: pd.Timestamp) -> set:\n    # Active, FFS-observable enrollment on the index date.\n    m = (enroll[\"enroll_start\"] <= index_date) & (enroll[\"enroll_end\"] >= index_date) & enroll[\"ffs_observable\"]\n    return set(enroll.loc[m, \"person_id\"])\n\ndef denominator_year(enroll: pd.DataFrame, year: int, min_months: int = 11) -> set:\n    # FFS-observable coverage for >= min_months of the calendar year (overlap-month count).\n    ys, ye = pd.Timestamp(year, 1, 1), pd.Timestamp(year, 12, 31)\n    e = enroll[enroll[\"ffs_observable\"]].copy()\n    e[\"ov_start\"] = e[\"enroll_start\"].clip(lower=ys)\n    e[\"ov_end\"] = e[\"enroll_end\"].clip(upper=ye)\n    e = e[e[\"ov_start\"] <= e[\"ov_end\"]]\n    e[\"months\"] = ((e[\"ov_end\"].dt.to_period(\"M\").astype(\"int64\") -\n                    e[\"ov_start\"].dt.to_period(\"M\").astype(\"int64\")) + 1)\n    cov = e.groupby(\"person_id\")[\"months\"].sum()\n    return set(cov[cov >= min_months].index)\n\ndef diagnosed_cases(dx: pd.DataFrame, denom: set, asof: pd.Timestamp,\n                    lookback_days: int = 730, min_claims: int = 2, min_gap_days: int = 30) -> set:\n    # >= min_claims diagnosis dates spanning >= min_gap_days within the lookback, among the denominator.\n    win = dx[(dx[\"person_id\"].isin(denom)) &\n             (dx[\"svc_date\"] <= asof) &\n             (dx[\"svc_date\"] >= asof - pd.Timedelta(days=lookback_days))]\n    dates = win.groupby(\"person_id\")[\"svc_date\"].agg([\"nunique\", \"min\", \"max\"])\n    ok = dates[(dates[\"nunique\"] >= min_claims) &\n               ((dates[\"max\"] - dates[\"min\"]).dt.days >= min_gap_days)]\n    return set(ok.index)\n\ndef treated_cases(rx: pd.DataFrame, denom: set, win_start: pd.Timestamp, win_end: pd.Timestamp) -> set:\n    # >= 1 qualifying fill inside the window, among the denominator.\n    m = (rx[\"person_id\"].isin(denom)) & (rx[\"fill_date\"] >= win_start) & (rx[\"fill_date\"] <= win_end)\n    return set(rx.loc[m, \"person_id\"])\n\ndef annual_prevalence(enroll, dx, rx, year=2024):\n    denom = denominator_year(enroll, year)\n    ys, ye = pd.Timestamp(year, 1, 1), pd.Timestamp(year, 12, 31)\n    dxc = diagnosed_cases(dx, denom, asof=ye, lookback_days=730)\n    rxc = treated_cases(rx, denom, ys, ye)\n    n = len(denom)\n    return {\n        \"n_denominator\": n,\n        \"diagnosed\": _ci(len(dxc), n),                 # (prevalence, lo, hi)\n        \"treated\": _ci(len(rxc), n),\n        \"diagnosed_and_treated\": _ci(len(dxc & rxc), n),\n    }",
        "description": "Point, annual, diagnosed, and treated prevalence from claims-style tables. Required inputs (cleaned, de-duplicated):\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end (datetime), ffs_observable (bool: Parts A/B and,\n           for treated prevalence, Part D present; MA-only person-time is NOT ffs_observable)\n  dx     : diagnosis claims -> person_id, svc_date (datetime), dx_code (str)   # e.g. ICD-10 'E11*' rows pre-filtered\n  rx     : pharmacy fills   -> person_id, fill_date (datetime), days_supply    # qualifying drug-class rows pre-filtered\nReturns prevalence estimates with exact-binomial 95% CIs. Hold case rule and lookback fixed; vary them externally for\nsensitivity analysis. Numerator membership is restricted to the eligible denominator so cases cannot exceed it.",
        "dependencies": [
          "pandas",
          "scipy"
        ],
        "source_citations": [
          "goodman-2013"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\npt_ci <- function(x, n) {\n  bt <- binom.test(x, n)              # exact (Clopper-Pearson) 95% CI\n  c(prev = x / n, lo = bt$conf.int[1], hi = bt$conf.int[2])\n}\n\ndenom_year <- function(enroll, year, min_months = 11L) {\n  setDT(enroll)\n  ys <- as.Date(sprintf(\"%d-01-01\", year)); ye <- as.Date(sprintf(\"%d-12-31\", year))\n  e <- enroll[ffs_observable == TRUE]\n  e[, ov_start := pmax(enroll_start, ys)]\n  e[, ov_end   := pmin(enroll_end, ye)]\n  e <- e[ov_start <= ov_end]\n  mn <- function(d) as.integer(format(d, \"%Y\")) * 12L + as.integer(format(d, \"%m\"))\n  e[, months := mn(ov_end) - mn(ov_start) + 1L]\n  cov <- e[, .(months = sum(months)), by = person_id]\n  cov[months >= min_months, person_id]\n}\n\ndiagnosed_cases <- function(dx, denom, asof, lookback_days = 730L,\n                            min_claims = 2L, min_gap_days = 30L) {\n  setDT(dx)\n  win <- dx[person_id %chin% denom & svc_date <= asof &\n            svc_date >= asof - lookback_days]\n  agg <- win[, .(nd = uniqueN(svc_date),\n                 span = as.integer(max(svc_date) - min(svc_date))), by = person_id]\n  agg[nd >= min_claims & span >= min_gap_days, person_id]\n}\n\ntreated_cases <- function(rx, denom, win_start, win_end) {\n  setDT(rx)\n  unique(rx[person_id %chin% denom & fill_date >= win_start & fill_date <= win_end, person_id])\n}\n\nannual_prevalence <- function(enroll, dx, rx, year = 2024L) {\n  denom <- denom_year(enroll, year)\n  ys <- as.Date(sprintf(\"%d-01-01\", year)); ye <- as.Date(sprintf(\"%d-12-31\", year))\n  dxc <- diagnosed_cases(dx, denom, asof = ye, lookback_days = 730L)\n  rxc <- treated_cases(rx, denom, ys, ye)\n  n <- length(denom)\n  list(n_denominator = n,\n       diagnosed = pt_ci(length(dxc), n),\n       treated   = pt_ci(length(rxc), n),\n       diagnosed_and_treated = pt_ci(length(intersect(dxc, rxc)), n))\n}",
        "description": "Point and annual diagnosed/treated prevalence with exact CIs, data.table. Inputs mirror the Python version:\n  enroll : person_id, enroll_start, enroll_end (Date), ffs_observable (logical)\n  dx     : person_id, svc_date (Date), dx_code   # qualifying diagnosis rows pre-filtered\n  rx     : person_id, fill_date (Date), days_supply  # qualifying drug rows pre-filtered\nReturns prevalence with Clopper-Pearson 95% CIs; hold the case rule/lookback fixed and vary externally.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "goodman-2013"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let year       = 2024;\n%let yr_start   = \"01JAN&year\"d;\n%let yr_end     = \"31DEC&year\"d;\n%let lookback   = 730;   /* days for 'ever diagnosed' window ending yr_end */\n%let min_months = 11;\n\n/* Denominator: FFS-observable coverage for >= &min_months of the calendar year. */\nproc sql;\n  create table denom as\n  select person_id\n  from (\n    select person_id,\n           sum( intck('month',\n                      max(enroll_start, &yr_start),\n                      min(enroll_end,   &yr_end)) + 1 ) as cov_months\n    from work.enroll\n    where ffs_observable = 1\n      and max(enroll_start, &yr_start) <= min(enroll_end, &yr_end)\n    group by person_id\n  )\n  where cov_months >= &min_months;\nquit;\n\n/* Diagnosed cases: >=2 distinct service dates >=30 days apart within the lookback, among the denominator. */\nproc sql;\n  create table dx_case as\n  select d.person_id\n  from work.dx d inner join denom n on d.person_id = n.person_id\n  where d.svc_date <= &yr_end and d.svc_date >= &yr_end - &lookback\n  group by d.person_id\n  having count(distinct d.svc_date) >= 2\n     and (max(d.svc_date) - min(d.svc_date)) >= 30;\nquit;\n\n/* Treated cases: >=1 qualifying fill inside the calendar year, among the denominator. */\nproc sql;\n  create table rx_case as\n  select distinct r.person_id\n  from work.rx r inner join denom n on r.person_id = n.person_id\n  where r.fill_date between &yr_start and &yr_end;\nquit;\n\n/* Per-person case flags on the full denominator (0/1) for prevalence + standardization. */\nproc sql;\n  create table analytic as\n  select n.person_id,\n         (d.person_id is not null) as diagnosed,\n         (r.person_id is not null) as treated,\n         g.age_group, g.sex\n  from denom n\n  left join dx_case d on n.person_id = d.person_id\n  left join rx_case r on n.person_id = r.person_id\n  left join work.demog g on n.person_id = g.person_id;\nquit;\n\n/* Crude diagnosed/treated prevalence with exact binomial 95% CIs. */\nproc freq data=analytic;\n  tables diagnosed treated / binomial(level='1') cl;\nrun;\n\n/* Directly age/sex-standardized DIAGNOSED prevalence to an external standard population. */\nproc stdrate data=analytic refdata=work.refpop\n             method=direct stat=risk            /* prevalence is a proportion (risk), not a rate */\n             plots=none;\n  population event=diagnosed total=1;           /* event flag; each person contributes 1 to the total */\n  reference total=reference_pop;\n  strata age_group sex;\nrun;",
        "description": "Annual diagnosed/treated prevalence (PROC SQL cohort build) plus directly age/sex-standardized prevalence with CIs\n(PROC STDRATE). Required input datasets (post data-management):\n  work.enroll : person_id, enroll_start, enroll_end, ffs_observable (1/0)\n  work.dx     : person_id, svc_date, dx_code        /* qualifying diagnosis rows pre-filtered */\n  work.rx     : person_id, fill_date, days_supply   /* qualifying drug rows pre-filtered */\n  work.demog  : person_id, age_group, sex           /* for standardization */\n  work.refpop : age_group, sex, reference_pop       /* external standard population weights */\n%let year sets the calendar year; the >=2-claim/>=30-day rule and 730-day lookback are the judgment-dependent\nthresholds and should be re-run with alternative values for sensitivity analysis.",
        "dependencies": [],
        "source_citations": [
          "goodman-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Spec[Pre-specify estimand] --> TF{Time frame?}\n  TF -->|single date| Point[Point prevalence<br/>denom = enrolled on index date]\n  TF -->|window| Period[Period prevalence<br/>denom = observable in interval]\n  TF -->|calendar year| Annual[Annual prevalence<br/>denom = >=11 of 12 months]\n  Point --> Case{Case rule?}\n  Period --> Case\n  Annual --> Case\n  Case -->|diagnosis codes| Dx[Diagnosed prevalence<br/>>=2 claims >=30d apart, fixed lookback]\n  Case -->|fill / procedure| Tx[Treated prevalence<br/>>=1 qualifying fill in window]\n  Dx --> Obs[Restrict to FFS-observable person-time<br/>exclude MA-only gaps]\n  Tx --> Obs\n  Obs --> Std[Crude AND age/sex-standardized<br/>with exact 95% CI]\n  Std --> Sens[Sensitivity: lookback length,<br/>1 vs >=2 claims, coverage threshold]",
        "caption": "Decision logic for an RWE prevalence estimate. The two specifications that govern the number — time frame and case rule — are chosen first, the denominator is then tied to observable person-time, and results are standardized with a pre-planned sensitivity sweep over the judgment-dependent thresholds.",
        "alt_text": "Flowchart branching from estimand specification into point/period/annual time frames, then diagnosed vs treated case rules, then FFS-observable restriction, standardization, and sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Point prevalence<br/>snapshot on a date] -->|add window| B[Period prevalence<br/>any case in interval]\n  B -->|fix interval = calendar year| C[Annual prevalence]\n  A -.always <= .-> B\n  D[Incidence<br/>new onset / person-time] -->|x average duration<br/>steady state only| A\n  style D fill:#fff3cd,stroke:#856404",
        "caption": "Relationship among the prevalence time frames and to incidence. Period prevalence is always >= point prevalence for the same population; prevalence relates to incidence only through disease duration under steady state, which is why prevalence must not be substituted for incidence in causal or trend-in-onset questions.",
        "alt_text": "Diagram showing point prevalence becoming period and annual prevalence as the window widens, with a cautioned steady-state link from incidence to prevalence.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "Prevalence is a core descriptive-epidemiology frequency measure alongside incidence and standardized rates."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "Prevalence counts existing cases (proportion); incidence counts new onset (rate per person-time). They answer different questions and prevalence confounds onset with survival."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "Cumulative incidence (risk) is the proportion developing disease over a period among those at risk; contrast with period prevalence, which counts all cases present, not just new ones."
      },
      {
        "relation_type": "used_with",
        "target_slug": "person-time-denominator-construction-rwe",
        "notes": "Period/annual prevalence denominators require an observable person-time / coverage-fraction rule to make the numerator opportunity comparable across people."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "A valid denominator requires defining continuous, observable enrollment on the index date (point) or across the interval (period/annual), excluding non-FFS-observable person-time."
      },
      {
        "relation_type": "used_with",
        "target_slug": "direct-standardization-rwe",
        "notes": "Crude prevalence must be directly age/sex-standardized to a reference population before any cross-population or cross-time comparison."
      },
      {
        "relation_type": "used_with",
        "target_slug": "indirect-standardization-smr-sir-rwe",
        "notes": "Indirect standardization (SMR/SIR-style) is an alternative when stratum-specific case counts are sparse."
      },
      {
        "relation_type": "see_also",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "The 'ever-diagnosed' lookback that defines a prevalent case is the same lookback machinery used for washout; its length materially changes the prevalence estimate."
      },
      {
        "relation_type": "used_with",
        "target_slug": "burden-of-disease-cost-of-illness",
        "notes": "Prevalent (and treated-prevalent) population counts are the standard denominator input to burden-of-disease and budget-impact analyses."
      }
    ],
    "aliases": [
      "point prevalence",
      "period prevalence",
      "annual prevalence",
      "treated prevalence",
      "diagnosed prevalence",
      "administrative prevalence"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "prevalent-new-user-design-rwe",
    "name": "Prevalent New-User Design",
    "short_definition": "Suissa's cohort design for comparative drug-effect studies that lets prevalent users of a comparator enter the study cohort at the moment they switch to or initiate the study drug, matching each on a time-conditional propensity score and on time since comparator initiation, so that the power lost by a strict new-user design is recovered without reintroducing prevalent-user (depletion-of-susceptibles) bias.",
    "long_description": "The **prevalent new-user (PNU) design** (Suissa, Moodie & Dell'Aniello, 2017) is a hybrid between the strict new-user\n(incident-user) design and an analysis that would naively pool prevalent users. A pure new-user design restricts both arms to\npatients initiating treatment with no prior use, which cleanly aligns time zero and removes prevalent-user bias but discards\nthe large pool of patients who were already on the comparator drug and then switch to (or add) the study drug. When the study\ndrug is typically reached by switching from an established comparator — as is common in chronic disease, oncology lines of\ntherapy, and many add-on indications — the strict new-user design can throw away the majority of the real-world treated\npopulation, leaving the study underpowered and unrepresentative. The PNU design recovers those patients: a comparator user\nbecomes eligible to enter the study-drug arm **at the time they switch/initiate**, and is matched to a comparator user who\nremains on the comparator and who has the **same accumulated treatment history** (same time since comparator initiation) and\nthe **same time-conditional propensity score** for switching at that moment.\n\n**Core conceptual mechanism.** The design solves two coupled problems that arise when you allow prevalent users in. (1)\n*Aligned, well-defined time zero*: each study-drug initiator's follow-up starts at their switch date, and the matched\ncomparator's follow-up starts at the same calendar/treatment-history point, so the two arms share a comparable \"now\" rather\nthan the study-drug arm being a survivor cohort observed from an arbitrary point. (2) *Comparable prior exposure*: because\npatients are matched on **time since comparator initiation** (the \"prevalent\" clock), the design balances depletion of\nsusceptibles — the phenomenon whereby people who tolerate a drug and survive on it for years are a healthier, selected\nsubset. Matching on this clock and on the **time-conditional propensity score (TCPS)** — the probability of switching at a\ngiven moment given covariates measured up to that moment — means the comparator group represents the counterfactual\n\"what if this prevalent user had stayed on the comparator.\" A single prevalent user can serve as a potential match (a\n\"prevalent-user moment\") at several time points, mirroring the risk-set sampling logic of a nested case-control / matched\ncohort, with the analysis run as a matched cohort (e.g., Cox with a frailty or robust variance for the matched sets).\n\n**What it is and is not.** The PNU design is an *extension of and complement to* the active-comparator new-user design, not a\nreplacement for it. It does not license pooling arbitrary prevalent users; the validity hinges on (a) being able to model the\nswitching propensity well from longitudinal covariates, (b) having enough comparator users at each treatment-history time to\nform matches, and (c) the study drug being reached predominantly by switching rather than by de-novo initiation. It is also\ndistinct from a target-trial emulation: PNU is a matching-and-time-alignment strategy for the specific switching structure,\nwhereas target-trial emulation is a general framework for specifying the protocol; the two are compatible and PNU can be\nviewed as one way to operationalize the eligibility/time-zero rules of an emulated switching trial.\n\n**Pros, cons, and trade-offs.**\n- **vs the strict `new-user-design`:** PNU recovers the (often large) population of switchers that the incident-user design\n  discards, restoring power and external validity for drugs that are reached by switching, while still controlling\n  prevalent-user/depletion bias through matching on time since comparator initiation and the TCPS. Cost: it requires\n  longitudinal data rich enough to estimate the switching propensity over time and is more complex to implement and explain.\n  **Prefer the strict new-user design** when de-novo initiation is the real-world reality and a clean incident cohort is\n  large enough; **prefer PNU** when most patients reach the study drug by switching and the new-user cohort would be small or\n  unrepresentative.\n- **vs the `active-comparator-new-user` design:** Active-comparator new-user fixes confounding by indication by comparing two\n  initiators with the same indication; PNU keeps that spirit but extends the eligible population to comparator-to-study-drug\n  switchers, matching at the switch moment. **Prefer active-comparator new-user** when two drugs are genuinely\n  interchangeable first-line options; **prefer PNU** when the study drug is a later-line or add-on/switch therapy so that\n  \"new use\" of the study drug coincides with prevalent use of the comparator.\n- **vs accepting `prevalent-user-bias` (naive pooling of prevalent users):** Naive prevalent-user comparisons mix aligned and\n  misaligned time zeros, condition on survival, and bias estimates unpredictably. PNU is the principled way to include\n  prevalent users without that bias. **Never** prefer naive pooling; PNU exists precisely to avoid it.\n\n**When to use.** Comparative effectiveness or safety of a drug that real-world patients predominantly reach by switching from\nor adding to an established comparator (second-line antidiabetics, switched antihypertensive or antidepressant classes,\nlater-line oncology regimens, add-on therapies); when a strict new-user cohort would be too small or would represent an\natypical de-novo-initiating subpopulation; when longitudinal claims/EHR data support estimating a time-conditional switching\npropensity from covariates measured up to each potential entry time.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When de-novo initiation is the actual clinical pattern.** If patients genuinely start the study drug as a first treatment,\n  a strict new-user design is cleaner and PNU adds complexity with no gain; forcing a switching structure where none exists\n  misrepresents the population.\n- **When the time-conditional switching propensity cannot be credibly modeled.** PNU's bias control rests entirely on the\n  TCPS and on matching the prevalent clock; if longitudinal covariates are sparse or switching is driven by unrecorded\n  factors (e.g., unmeasured disease progression), the matched comparator is not a valid counterfactual and the estimate is\n  confounded — reporting it as bias-controlled is misleading.\n- **When there are too few comparator users at each treatment-history time to match.** Sparse risk sets force coarse matching\n  or dropped switchers, reintroducing imbalance; check match rates and balance by the prevalent clock before trusting the\n  result.\n- **When the prevalent clock is ignored.** Matching only on a baseline propensity score while allowing prevalent users in\n  reintroduces depletion-of-susceptibles bias; the time-since-comparator-initiation match is not optional.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** The natural substrate — dispensing records give exposure start, switch dates, and the longitudinal\n  covariate stream (diagnoses, prior fills, utilization) needed for the TCPS. Build the prevalent clock from the first\n  observable comparator fill within a continuously enrolled, FFS-observable window; Medicare Advantage enrollees generate no\n  fee-for-service claims, so MA-only spans corrupt both the prevalent clock and the switching-propensity covariates — restrict\n  to FFS-observable person-time. Define switching with explicit grace/gap rules so an add-on is not misread as a switch.\n- **EHR:** Medication orders, problem lists, labs, and vitals give richer time-varying covariates for the switching model\n  (disease severity, response markers), sharpening the TCPS, but encounter-driven capture and out-of-system care mean the\n  switch date and prior-exposure clock may be incomplete; require demonstrable in-system activity and reconcile order vs fill.\n- **Registry / linked:** Disease registries supply adjudicated severity and line-of-therapy information that materially\n  improves the switching propensity and the time-since-initiation clock; linked claims-EHR is the strongest substrate but\n  introduces linkage selection that must be reported.",
    "primary_category": "Study_Design",
    "tags": [
      "prevalent_new_user",
      "time_conditional_propensity_score",
      "prevalent_user_bias",
      "depletion_of_susceptibles",
      "switching",
      "new_user_design"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "comparative_effectiveness",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4107",
        "url": "https://doi.org/10.1002/pds.4107",
        "citation_text": "Suissa S, Moodie EEM, Dell'Aniello S. Prevalent new-user cohort designs for comparative drug effect studies by time-conditional propensity scores. Pharmacoepidemiology and Drug Safety. 2017;26(4):459-468.",
        "year": 2017,
        "authors_short": "Suissa et al.",
        "notes": "The defining paper introducing the prevalent new-user design, the time-conditional propensity score, and matching on time since comparator initiation to admit prevalent users without prevalent-user bias."
      },
      {
        "role": "explain",
        "doi": "10.1007/s40471-015-0053-5",
        "url": "https://doi.org/10.1007/s40471-015-0053-5",
        "citation_text": "Lund JL, Richardson DB, Sturmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Current Epidemiology Reports. 2015;2(4):221-228.",
        "year": 2015,
        "authors_short": "Lund et al.",
        "notes": "Lays out the new-user and active-comparator-new-user foundations that the prevalent new-user design extends, including why aligned time zero and incident-user restriction control confounding by indication and prevalent-user bias."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4189",
        "url": "https://doi.org/10.1002/pds.4189",
        "citation_text": "Gagne JJ. New-user designs with conditional propensity scores: a unified complement to the traditional active comparator new-user approach. Pharmacoepidemiology and Drug Safety. 2017;26(4):469-471.",
        "year": 2017,
        "authors_short": "Gagne",
        "notes": "Commentary situating the prevalent new-user / conditional-propensity-score approach as a unified complement to the active-comparator new-user design, clarifying when admitting prevalent users via the TCPS is appropriate."
      },
      {
        "role": "use",
        "doi": "10.1097/ede.0000000000001947",
        "url": "https://doi.org/10.1097/ede.0000000000001947",
        "citation_text": "Galmiche S, Comin M, Dell'Aniello S, et al. Antibiotics and preterm delivery: the prevalent new-user cohort design to resolve immortal time bias. Epidemiology. 2026;37(3):355-362.",
        "year": 2026,
        "authors_short": "Galmiche et al.",
        "notes": "A worked application of the prevalent new-user cohort design (with time-conditional propensity scores) used to align time zero and resolve immortal-time/prevalent-user bias in a pregnancy drug-safety question."
      }
    ],
    "plain_language_summary": "The prevalent new-user design is a way to study how two drugs compare in the real world when most patients reach the second drug by switching from the first one rather than starting it fresh. It lets those switchers into the study, but only at the exact moment they switch, and carefully matches each switcher to a patient who stayed on the original drug and had been on it for the same amount of time. This time-on-drug matching step is what keeps the comparison fair by ensuring both groups have survived the same duration on the comparator drug before the clock starts.",
    "key_terms": [
      {
        "term": "prevalent new-user",
        "definition": "A patient who has already been taking the comparator drug for some time and then switches to the study drug, as opposed to a brand-new starter who has never taken either drug."
      },
      {
        "term": "time-conditional matching",
        "definition": "Pairing a switcher with a comparator-drug continuer who has been on the comparator for the same number of months, so both groups have the same amount of drug exposure history before the study clock starts."
      },
      {
        "term": "index date",
        "definition": "A patient's personal day-zero for a study, which in this design is the date the switcher moves from the comparator drug to the study drug."
      },
      {
        "term": "washout",
        "definition": "A look-back period before a patient enters the study to confirm they were truly new to a drug and not just temporarily off it."
      },
      {
        "term": "comparator drug",
        "definition": "The established drug patients are already taking, against which the new study drug is being compared."
      },
      {
        "term": "depletion of susceptibles",
        "definition": "A bias that arises because patients who stay on a drug for a long time are a healthier, sturdier survivor group than those who just started; ignoring this makes the longer-running drug look artificially safe."
      }
    ],
    "worked_example": {
      "scenario": "We want to compare drug B (a newer oral diabetes medication) against drug A (the established standard). Drug B is only prescribed to patients who tried drug A first and then switched. We identify a patient, Pat, who started drug A on 2022-01-01 and switched to drug B 9 months later on 2022-10-01. Under a strict new-user design, Pat would be excluded because Pat is not a fresh starter of drug A. Under the prevalent new-user design, Pat enters the study-drug arm at the switch date (2022-10-01). We then find a match for Pat: another patient, Casey, who started drug A on 2022-01-01 (so also 9 months into drug A as of 2022-10-01) and did not switch. Both Pat and Casey begin their study follow-up on 2022-10-01.",
      "dataset": {
        "caption": "Key dates and measurements for the switcher (Pat) and the matched comparator continuer (Casey) as they would appear in a linked claims file.",
        "columns": [
          "person_id",
          "comparator_start",
          "index_date",
          "months_on_comparator_at_index",
          "arm"
        ],
        "rows": [
          [
            "PAT-001",
            "2022-01-01",
            "2022-10-01",
            9,
            "study_drug_B"
          ],
          [
            "CASEY-002",
            "2022-01-01",
            "2022-10-01",
            9,
            "comparator_drug_A"
          ]
        ]
      },
      "steps": [
        "Pat starts drug A on 2022-01-01. The prevalent clock begins: month 1, month 2, ... month 9.",
        "On 2022-10-01 (9 months in), Pat switches to drug B. This date becomes Pat's index date and the start of follow-up.",
        "To find a match for Pat, we look for drug-A patients who were also exactly 9 months into drug A on 2022-10-01 and had not yet switched. Casey qualifies: Casey also started drug A on 2022-01-01, is still on it, and has the same estimated propensity to switch as Pat.",
        "Both Pat and Casey begin their follow-up on 2022-10-01. Any events (hospitalizations, lab changes) from 2022-10-01 forward are counted for both in their respective arms.",
        "Matching on time-since-comparator-start (9 months for both) ensures neither arm has more drug-A survivors selected into it than the other, keeping the comparison of outcomes fair."
      ],
      "result": "One matched pair: Pat (9 months on drug A then switched to drug B) versus Casey (9 months on drug A, continuing). Follow-up for both starts 2022-10-01. The design recovers Pat for the analysis instead of discarding this common real-world switcher.",
      "timeline_spec": {
        "title": "Prevalent new-user design: switcher Pat matched to comparator continuer Casey at 9 months",
        "window": {
          "start": "2022-01-01",
          "end": "2022-10-01",
          "label": "Comparator period (9 months on drug A) leading to index date"
        },
        "events": [
          {
            "label": "Pat: starts drug A",
            "start": "2022-01-01",
            "length_days": 273,
            "quantity": "9 months on comparator"
          },
          {
            "label": "Pat: switches to drug B (index date)",
            "start": "2022-10-01",
            "length_days": 1,
            "quantity": "switch event"
          },
          {
            "label": "Casey: starts drug A",
            "start": "2022-01-01",
            "length_days": 273,
            "quantity": "9 months on comparator"
          },
          {
            "label": "Casey: matched at same time point (continues drug A)",
            "start": "2022-10-01",
            "length_days": 1,
            "quantity": "match event"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2022-01-01",
            "end": "2022-09-30",
            "label": "Pat: 9 months on drug A (comparator period)"
          },
          {
            "kind": "followup",
            "start": "2022-10-01",
            "end": "2022-10-01",
            "label": "Pat: index date (switch to drug B) — follow-up starts here"
          },
          {
            "kind": "exposed",
            "start": "2022-01-01",
            "end": "2022-09-30",
            "label": "Casey: 9 months on drug A (same duration as Pat)"
          },
          {
            "kind": "followup",
            "start": "2022-10-01",
            "end": "2022-10-01",
            "label": "Casey: matched index date — follow-up starts here on drug A"
          }
        ],
        "result": {
          "label": "Both Pat and Casey begin follow-up 2022-10-01 with identical time-on-comparator (9 months). Match valid.",
          "value": "matched pair, 9 months on comparator each"
        },
        "caption": "Pat (top row) accumulates 9 months on drug A then switches to drug B on 2022-10-01. Casey (bottom row) also has exactly 9 months on drug A on 2022-10-01 and continues. Both enter follow-up on the same calendar date with the same treatment history length, making the comparison of outcomes valid.",
        "alt_text": "Two parallel horizontal timelines. The top row shows Pat starting drug A on January 1 2022, an arrow at October 1 2022 labeled switch to drug B which becomes the index date, then a follow-up bar extending right. The bottom row shows Casey starting drug A on January 1 2022, a match marker at October 1 2022 (same date), then a follow-up bar continuing on drug A. Both follow-up bars start at the same point, illustrating aligned time zero with 9 months of comparator history behind each patient."
      }
    },
    "prerequisites": [
      "new-user-design",
      "active-comparator-new-user",
      "prevalent-user-bias"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Switch-based prevalent new-user matching",
        "description": "Comparator users become eligible for the study-drug arm at their switch date and are matched to comparator users who remain on the comparator at the same time since comparator initiation and the same time-conditional propensity score.",
        "edge_cases": [
          "A single prevalent user can be a potential match at several treatment-history times (risk-set-style sampling); use robust or frailty variance for the matched sets.",
          "Switching must be distinguished from add-on/augmentation with explicit grace/gap rules so an add-on is not coded as a switch."
        ],
        "data_source_notes": "claims: build the prevalent clock from the first FFS-observable comparator fill; define switch with days-supply-based grace rules."
      },
      {
        "name": "Time-conditional propensity score estimation",
        "description": "Estimate the probability of switching at each potential entry time from covariates measured up to that time (longitudinal logistic/pooled model), then match study-drug initiators to comparator continuers on the predicted TCPS and the prevalent clock.",
        "edge_cases": [
          "Sparse comparator risk sets at some treatment-history times force coarse matching or dropped switchers; check match rates and balance by the prevalent clock.",
          "Time-varying covariates driving the switch (disease progression) must be captured or the TCPS is misspecified and the match is not a valid counterfactual."
        ],
        "data_source_notes": "ehr/registry: labs, vitals, severity, and line-of-therapy sharpen the TCPS relative to claims-only diagnoses."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "new-user-design",
        "pros_of_this": "Recovers the large population of comparator-to-study-drug switchers the strict incident-user design discards, restoring power and external validity while still controlling prevalent-user/depletion-of-susceptibles bias via the prevalent clock and time-conditional propensity score.",
        "cons_of_this": "Requires longitudinal data rich enough to model the switching propensity over time, and is harder to implement and explain than a clean incident-user cohort.",
        "when_to_prefer": "When most patients reach the study drug by switching and a strict new-user cohort would be small or atypical; keep the strict new-user design when de-novo initiation is the genuine clinical reality."
      },
      {
        "compared_to": "active-comparator-new-user",
        "pros_of_this": "Keeps the confounding-by-indication control of comparing two initiators but extends eligibility to comparator-to-study-drug switchers matched at the switch moment, covering later-line and add-on/switch therapies.",
        "cons_of_this": "Adds the time-conditional propensity model and prevalent-clock matching, which the simpler active-comparator new-user design does not need when two drugs are interchangeable first-line options.",
        "when_to_prefer": "When the study drug is a later-line or switch/add-on therapy so that new use of it coincides with prevalent use of the comparator; use active-comparator new-user for interchangeable first-line drugs."
      },
      {
        "compared_to": "prevalent-user-bias",
        "pros_of_this": "Provides a principled way to admit prevalent users with aligned time zero and matched prior exposure, instead of naive pooling that mixes survivor cohorts and misaligned time zeros and biases estimates unpredictably.",
        "cons_of_this": "The bias control is only as good as the time-conditional propensity model and the availability of comparator matches at each treatment-history time.",
        "when_to_prefer": "Always preferred over naive prevalent-user pooling; PNU exists precisely to avoid prevalent-user bias while keeping the switchers."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Dispensing records supply exposure start, switch dates, and the longitudinal covariate stream for the TCPS; build the prevalent clock from the first FFS-observable comparator fill in a continuously enrolled window and exclude Medicare Advantage-only spans, which corrupt both the clock and the covariates. Define switching with explicit days-supply grace/gap rules so an add-on is not miscoded as a switch.",
      "ehr": "Orders, problem lists, labs, and vitals give richer time-varying covariates (severity, response) that sharpen the time-conditional propensity score, but encounter-driven, out-of-system care can leave the switch date and prior-exposure clock incomplete; require in-system activity and reconcile order vs fill dates.",
      "registry": "Disease registries supply adjudicated severity and line-of-therapy that materially improve the switching propensity and the time-since-initiation clock; verify completeness and reporting lag by calendar year.",
      "linked": "Linked claims-EHR (and registry) is the strongest substrate for both the prevalent clock and the longitudinal TCPS covariates; report linkage selection as a separate source of bias."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\n# Illustrative person-period data: one row per person per month since comparator initiation.\nrng = np.random.default_rng(3)\nrows = []\nfor pid in range(2000):\n    comorb = rng.binomial(1, 0.3); age = rng.normal(60, 10)\n    switched = False\n    for t in range(1, 25):                       # months on the comparator\n        if switched:\n            break\n        p_switch = 1/(1+np.exp(-(-4 + 0.03*(age-60) + 0.8*comorb + 0.04*t)))\n        on_study = rng.binomial(1, p_switch)     # switch to study drug this month?\n        rows.append(dict(person_id=pid, t=t, on_study=on_study, age=age, comorb=comorb))\n        switched = bool(on_study)\npp = pd.DataFrame(rows)\n\n# Time-conditional propensity score: probability of switching at month t given covariates up to t.\ntcps = smf.logit(\"on_study ~ age + comorb + t\", data=pp).fit(disp=0)\npp[\"ps\"] = tcps.predict(pp)\n\n# Within each prevalent-clock time t, nearest-neighbor match switchers to comparator continuers on the TCPS.\nmatches = []\nfor t, g in pp.groupby(\"t\"):\n    switchers = g[g.on_study == 1]\n    controls  = g[g.on_study == 0].copy()\n    for _, s in switchers.iterrows():\n        if controls.empty:\n            continue\n        j = (controls[\"ps\"] - s[\"ps\"]).abs().idxmin()      # closest TCPS at the same t\n        matches.append((s[\"person_id\"], controls.loc[j, \"person_id\"], t))\n        controls = controls.drop(j)                        # match without replacement within t\nmatched = pd.DataFrame(matches, columns=[\"study_id\", \"comparator_id\", \"match_time\"])\nprint(f\"switchers matched: {len(matched)} (each pair shares t and TCPS)\")\nprint(matched.head())",
        "description": "Prevalent new-user matching on a time-conditional propensity score (TCPS) and the prevalent clock, on illustrative\nlongitudinal claims-style data. Inputs: a person-period table (person_id, t = months since comparator initiation, on_study\n= 1 in the period the person switches to the study drug, plus time-varying covariates age, comorb). A pooled logistic model\nestimates the per-period switching probability (the TCPS); each switcher is nearest-neighbor matched, within the same t, to\na comparator continuer on the TCPS (Suissa et al. 2017). statsmodels fits the TCPS; the matched sets feed a downstream Cox.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "suissa-2017",
          "lund-2015"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(MatchIt)\nset.seed(3)\n\n# Illustrative person-period data: one row per person per month since comparator initiation.\nmk <- function() {\n  comorb <- rbinom(1, 1, 0.3); age <- rnorm(1, 60, 10); out <- list(); sw <- FALSE\n  for (t in 1:24) {\n    if (sw) break\n    p  <- plogis(-4 + 0.03*(age-60) + 0.8*comorb + 0.04*t)\n    on <- rbinom(1, 1, p)\n    out[[t]] <- data.frame(t = t, on_study = on, age = age, comorb = comorb)\n    sw <- on == 1\n  }\n  do.call(rbind, out)\n}\npp <- do.call(rbind, lapply(1:2000, function(i) cbind(person_id = i, mk())))\n\n# Time-conditional PS via pooled logistic; exact-match the prevalent clock t, NN-match the TCPS.\nm <- matchit(on_study ~ age + comorb + t,\n             data = pp, method = \"nearest\", distance = \"glm\",\n             exact = ~ t, ratio = 1)        # exact on time since comparator initiation\nsummary(m)                                  # balance of the TCPS and covariates within t\nmatched <- match.data(m)                    # feed to a downstream Cox on the matched cohort\ncat(sprintf(\"matched rows: %d\\n\", nrow(matched)))",
        "description": "Prevalent new-user matching with the MatchIt package on illustrative person-period claims-style data (person_id, t = months\nsince comparator initiation, on_study, age, comorb). A pooled logistic model gives the time-conditional propensity score and\nMatchIt performs exact matching on the prevalent clock t with nearest-neighbor matching on the TCPS (Suissa et al. 2017).",
        "dependencies": [
          "MatchIt"
        ],
        "source_citations": [
          "suissa-2017",
          "gagne-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Time-conditional propensity score: P(switch at month t | covariates up to t). */\nproc logistic data=work.pp noprint;\n  model on_study(event='1') = age comorb t;\n  output out=pp_ps p=ps;                       /* predicted switching probability */\nrun;\n\n/* Split into switchers and comparator continuers, carrying the prevalent clock t and the TCPS. */\ndata switchers controls;\n  set pp_ps;\n  if on_study = 1 then output switchers;\n  else output controls;\nrun;\n\n/* Greedy nearest-neighbor match WITHIN each t on the TCPS (no fetch-first; ranked join + dedup). */\nproc sql;\n  create table cand as\n  select s.person_id as study_id, c.person_id as comparator_id, s.t as match_time,\n         abs(s.ps - c.ps) as psdiff\n  from switchers s inner join controls c\n    on s.t = c.t                               /* exact match on time since comparator initiation */\n  group by s.person_id\n  having calculated psdiff = min(abs(s.ps - c.ps));   /* closest TCPS control per switcher */\nquit;\n\n/* Enforce one control per switcher (dedupe ties), then this matched set feeds PROC PHREG. */\nproc sort data=cand; by study_id psdiff; run;\ndata matched; set cand; by study_id; if first.study_id; run;\nproc print data=matched(obs=10) noobs; run;",
        "description": "Prevalent new-user matching in SAS with no dedicated PROC: PROC LOGISTIC estimates the time-conditional propensity score\nfrom the person-period data (work.pp: person_id, t = months since comparator initiation, on_study, age, comorb), then a\ngreedy nearest-neighbor match within each prevalent-clock time t is done in a DATA step / PROC SQL on the predicted PS\n(Suissa et al. 2017). The matched pairs feed a downstream PROC PHREG on the matched cohort.",
        "dependencies": [],
        "source_citations": [
          "suissa-2017",
          "lund-2015"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "prevalent-new-user-design-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Pat (top row) accumulates 9 months on drug A then switches to drug B on 2022-10-01. Casey (bottom row) also has exactly 9 months on drug A on 2022-10-01 and continues. Both enter follow-up on the same calendar date with the same treatment history length, making the comparison of outcomes valid.",
        "alt_text": "Two parallel horizontal timelines. The top row shows Pat starting drug A on January 1 2022, an arrow at October 1 2022 labeled switch to drug B which becomes the index date, then a follow-up bar extending right. The bottom row shows Casey starting drug A on January 1 2022, a match marker at October 1 2022 (same date), then a follow-up bar continuing on drug A. Both follow-up bars start at the same point, illustrating aligned time zero with 9 months of comparator history behind each patient.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Init[Patient initiates comparator drug] --> Clock[Prevalent clock starts<br/>t = time since comparator initiation]\n  Clock --> Risk{At time t,<br/>does the patient switch to study drug?}\n  Risk -->|Yes switches| SW[Enters STUDY-DRUG arm<br/>time zero = switch date]\n  Risk -->|No continues| CC[Eligible COMPARATOR match<br/>at the same t]\n  SW --> Match[Match on time-conditional PS<br/>AND on t time since initiation]\n  CC --> Match\n  Match --> Cohort[Matched cohort<br/>aligned time zero + comparable prior exposure]\n  Cohort --> Cox[Outcome model<br/>Cox with robust/frailty variance for matched sets]",
        "caption": "The prevalent new-user design. Comparator users become eligible to enter the study-drug arm at the moment they switch, and are matched to comparator continuers sharing the same time since comparator initiation (the prevalent clock) and the same time-conditional propensity score, giving aligned time zero with comparable prior exposure.",
        "alt_text": "Flowchart from comparator initiation, starting a prevalent clock, to a switching decision at time t; switchers enter the study-drug arm and are matched on the time-conditional propensity score and time since initiation to comparator continuers, forming a matched cohort analyzed with a Cox model.",
        "source_type": "illustrative",
        "source_citations": [
          "suissa-2017"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{How do real-world patients<br/>reach the study drug?} -->|De-novo initiation| NU[Strict new-user design<br/>clean incident cohort]\n  Q -->|Predominantly by switching<br/>from a comparator| PNU{Can you model the<br/>time-conditional switching PS?}\n  PNU -->|Yes + enough comparator matches| Use[Prevalent new-user design<br/>match on TCPS + prevalent clock]\n  PNU -->|No / sparse risk sets| Caution[Do NOT naively pool prevalent users<br/>reintroduces depletion bias]",
        "caption": "Choosing between the strict new-user and prevalent new-user designs. PNU is indicated only when patients reach the study drug by switching and the time-conditional switching propensity can be modeled with enough comparator users to match; otherwise stay with a strict new-user design rather than pooling prevalent users.",
        "alt_text": "Decision tree asking how patients reach the study drug; de-novo initiation routes to the strict new-user design, while predominant switching routes to the prevalent new-user design only if the time-conditional propensity score is estimable with adequate matches, otherwise a warning against naive prevalent-user pooling.",
        "source_type": "illustrative",
        "source_citations": [
          "suissa-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "new-user-design",
        "notes": "The prevalent new-user design is an alternative to the strict incident-user design that admits comparator-to-study-drug switchers while controlling prevalent-user bias, rather than discarding them."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "active-comparator-new-user",
        "notes": "PNU extends the active-comparator new-user design by allowing prevalent comparator users to enter at the switch moment, matched on the time-conditional propensity score and time since comparator initiation."
      },
      {
        "relation_type": "affects",
        "target_slug": "prevalent-user-bias",
        "notes": "The design exists specifically to admit prevalent users without incurring prevalent-user (depletion-of-susceptibles) bias, by matching on the prevalent clock and the time-conditional propensity score."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "The matching engine is a time-conditional propensity score; PNU is a time-indexed application of propensity-score matching applied at each potential switch time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "PNU operationalizes the eligibility and time-zero rules of an emulated switching trial; it is compatible with, and one concrete way to instantiate, a target-trial emulation for switch/add-on questions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "By starting follow-up at the switch date for switchers and the matched time for continuers, PNU aligns time zero and avoids the immortal-time bias that arises when prevalent users are followed from a misaligned start."
      }
    ],
    "aliases": [
      "prevalent new-user design",
      "prevalent new user cohort design",
      "PNU design",
      "time-conditional propensity score design",
      "Suissa prevalent new-user"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "prevalent-user-bias",
    "name": "Prevalent User Bias",
    "short_definition": "The bias that arises when a drug cohort includes patients already on treatment at the start of follow-up (prevalent users) rather than restricting to incident initiators at a common time zero, conflating early discontinuers and events with later survivors and inducing depletion of susceptibles, immortal time, and adjustment for post-initiation covariates.",
    "long_description": "**Prevalent user bias** (also called survivor bias or depletion-of-susceptibles bias) is the systematic error that occurs\nwhen an observational drug study counts person-time and outcomes from patients who were *already taking* the drug when\nfollow-up began, instead of restricting to **incident (new) users** whose follow-up starts at first exposure. A prevalent\ncohort is a left-truncated, conditionally selected sample: to appear in it, a patient had to *survive on treatment*, free\nof the outcome and free of intolerable side effects, long enough to still be filling the drug at study entry. The early\nhigh-risk window — the first weeks and months when adverse events, discontinuations, and the strongest treatment effects\nconcentrate — is invisible. What remains is an enriched pool of tolerators who look healthier than a representative\ninitiator, biasing harms toward the null and exaggerating apparent benefit.\n\n**Core conceptual distinction**. Three distinct mechanisms ride together under one label, and separating them is what makes\nthe bias tractable. (1) **Depletion of susceptibles**: patients destined to be harmed (or to respond poorly) have already\nleft the population by study entry, so the surviving prevalent users are a selected, lower-risk subset — the hazard you\nobserve is not the hazard a new initiator faces. (2) **Immortal time / time-zero misalignment**: if follow-up starts at a\nlandmark other than initiation (a diagnosis date, an enrollment date, the calendar start of the database), the span between\ntrue initiation and study entry is guaranteed event-free by construction, manufacturing apparent protection. (3)\n**Adjustment for post-initiation covariates**: baseline covariates measured at study entry for a prevalent user are\nactually *consequences* of prior treatment (a normalized LDL, a controlled blood pressure, a stable HbA1c), so conditioning\non them adjusts away part of the very effect under study or opens collider paths. The corresponding *estimand* distinction:\nprevalent-user analyses target a vague \"effect of being on the drug\" averaged over an unknowable mix of durations; the\nnew-user design targets the well-defined effect of *initiating* treatment versus a comparator strategy from a common time\nzero — the quantity a clinician and a regulator can act on.\n\n**Pros, cons, and trade-offs**\n- **vs the new-user (incident-user) design** — the canonical fix. New-user restriction sets time zero at first fill after a\n  clean washout, eliminating depletion of susceptibles, immortal time, and post-initiation adjustment in one move. Cost:\n  smaller cohorts and a population of *initiators*, who may differ from the prevalent users who dominate real-world\n  prescribing; rare or expensive drugs may yield too few incident users. **Prefer new-user** for essentially every causal\n  question about a treatment's effect, especially safety and early effects.\n- **vs the prevalent new-user (time-conditional PS) design (Suissa 2017)** — a middle path when pure incident users are too\n  few. It matches prevalent users to new users on *duration of current use* via time-conditional propensity scores,\n  recovering sample size while restoring a defensible time zero for each matched set. Cost: more complex, assumes the\n  duration-conditional exchangeability holds, and still cannot resurrect the unobserved early discontinuers. **Prefer it**\n  over a naive prevalent cohort whenever incident-only analysis is underpowered.\n- **vs simply keeping the prevalent/ever-exposed cohort** — larger N and superficial \"real-world representativeness.\" But\n  for any initiation question this is the bias you came to remove; it is rarely defensible and routinely overturned when an\n  incident-user re-analysis is run. Acceptable only for descriptive prevalence-of-use questions, never for comparative\n  causal estimates.\n\n**When to use**. This is a *bias to recognize and prevent*, not a method to apply. Invoke this concept whenever you build\nor review a cohort of chronic-drug users in claims, EHR, or registry data: statins, antihypertensives, oral\nantidiabetics, DMARDs, biologics, antidepressants, inhaled COPD/asthma therapies. Use it as the diagnostic lens during\nprotocol design (is time zero initiation?), during data management (does the washout truly capture no prior use?), and\nduring peer review of any study that reports \"current users,\" \"ever exposed,\" or a flat exposure flag without a stated\nindex date.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Do not accept a prevalent-user cohort for any initiation or safety question.** Reporting an attenuated or null harm\n  from a prevalent cohort is the dangerous failure mode: depletion of susceptibles hides exactly the early excess risk that\n  pharmacovigilance exists to detect (the rofecoxib and HRT histories are object lessons). A \"reassuring\" prevalent-user\n  safety result can be affirmatively harmful.\n- **Do not \"fix\" a prevalent cohort by adjusting for baseline covariates measured at study entry.** Those covariates are\n  post-initiation; adjustment makes the estimate worse, not better, by controlling away the treatment effect or inducing\n  collider bias.\n- **Do not impose new-user restriction blindly when the drug is genuinely never-stopped and the question is about long-term\n  maintenance** — but even then, anchor time zero at initiation and follow forward; the prevalent shortcut is still wrong.\n- **Do not assume \"no fill in the lookback\" means new use when the lookback is unobserved** (Medicare Advantage-only\n  person-time, a new health-plan enrollee, a registry whose drug-start field is blank). Misclassified prevalent users\n  masquerading as incident users reintroduce the bias silently.\n\n**Data-source operational depth**\n- **Claims (FFS or commercial):** The defining operation is the washout. Require *continuous medical AND pharmacy\n  enrollment* across the full lookback (commonly 365 days, sometimes 180) so that \"no prior fill\" reflects true absence,\n  not unobserved care. The dominant failure mode is **Medicare Advantage**: MA encounter data are incomplete and FFS\n  pharmacy claims are absent, so a long-time user who switched into your view looks incident — restrict to enrollees with\n  Parts A/B/D (or a commercial medical+pharmacy benefit) and exclude MA-only person-time. Stockpiling, 90-day mail-order,\n  and free samples distort `days_supply` and can hide a prior fill that ran into the lookback. **Differential competing\n  risks** bite in elderly claims: prevalent users who survived to study entry have differentially lower competing mortality\n  than a representative initiator, further selecting the cohort.\n- **EHR:** Initiation is the *order* or *administration*, not the dispense — but a medication-list entry marked \"active\"\n  often has no reliable start date and may be a carry-over reconciled in from an outside system, the textbook prevalent\n  user wearing a new-user costume. Require a first *order after a clean gap* and, where possible, link to dispensing to\n  confirm the patient actually started. Visit-driven capture also means a patient who leaves the system is differentially\n  lost, compounding survivor selection.\n- **Registry:** Enrollment date frequently does not equal treatment start; a disease registry may enroll prevalent cases\n  years into therapy. Require an explicit drug-start-date field and treat enrollment-as-time-zero as an immortal-time trap.\n  Link to claims for the full fill history and to a death index to characterize the competing risk.\n- **Linked claims–EHR–vital records:** Best substrate for distinguishing prevalent from incident use (EHR start + claims\n  fill history + mortality), but order/fill/service date discrepancies must be reconciled *before* assigning time zero, and\n  only the linkable subset is observed (a selection layer on top of the survivor selection).\n\n**Worked claims example (depletion of susceptibles in action).** Question: 90-day risk of acute kidney injury (AKI) after\n*starting* an ACE inhibitor among adults with hypertension in a commercial + Medicare FFS database. A naive \"current user\"\nanalyst pulls everyone with an ACEi fill overlapping 2024-01-01 (`fill_date` ≤ index ≤ `fill_date + days_supply`) and\nfollows them 90 days. Suppose the true biology is an early hazard: AKI risk is RR ≈ 2.5 in the first 90 days of initiation,\nthen null. Among 1,000 *true initiators* (first fill in 2024 after a 365-day fill-free, continuously A/B/D-enrolled\nlookback), 60 develop AKI in 90 days (6.0%). But the prevalent pool that overlaps 2024-01-01 is dominated by patients who\nstarted in 2021–2023 and *kept filling* — they have already passed through and survived the early-hazard window (the\nsusceptibles were depleted: those who got AKI stopped the drug, switched, or died). Among 4,000 such prevalent users, only\n40 AKI events occur in the next 90 days (1.0%). A pooled \"current user\" rate of (60+40)/(1,000+4,000) = 2.0% buries the\n6.0% initiation risk, and an unwary safety read concludes ACEi is well tolerated. The correct construction: index =\n*first* ACEi `fill_date` in the study window with **no ACEi fill in the prior 365 days** and continuous medical+pharmacy\nFFS enrollment spanning that entire lookback (excluding MA-only person-time so the absence of prior fills is observed, not\nmissing); follow forward 90 days from that fill; censor at disenrollment, death, end of data, and — for an as-treated\nvariant — last `days_supply` end plus a grace period. The diagnostic that exposes the bias is to count, within the naive\ncohort, how many \"current users\" had a fill in the prior 365 days (the would-be-excluded prevalent users) and compare\ntheir 90-day event rate to that of the true initiators; a large gap is the depletion-of-susceptibles signature.\n\n**Interpreting the output**\n\nIn the ACEi-and-AKI study, the raw dataset mixes 1,000 new initiators (60 events, 90-day\nrisk = 6.0%) with 4,000 prevalent users (40 events, risk = 1.0%). The pooled naive event\nrate across all 5,000 patients is 100/5,000 = 2.0%.\n\n*(1) Formal interpretation.* The pooled rate of 2.0% misrepresents the risk that matters for\na prescribing decision — the risk a new patient faces when starting ACEi therapy. Prevalent\nusers have already survived the early high-risk window; their 1.0% event rate reflects\ndepletion of susceptibles (patients who developed early AKI, intolerance, or discontinuation\nare absent from the prevalent pool), not a lower underlying pharmacological hazard. Mixing\nthese populations suppresses the true 6.0% early-initiation risk by a factor of three. A\nnew-user design restricted to first prescriptions recovers the 6.0% figure and provides the\nclinically relevant risk profile for drug-utilization policy and labeling.\n\n*(2) Practical interpretation.* A drug-safety signal operating in the first 90 days of therapy\nwill be diluted — potentially below statistical detection thresholds — in any analysis that\npools new and prevalent users. The contrast of 6.0% versus 2.0% illustrates a threefold\ndilution: a dataset appearing to show only modest early risk is actually masking a sixfold\nhigher rate in the population most exposed to that risk. Restricting to new users is not a\ndesign preference; it is a precondition for estimating initiation hazards and for detecting\ntime-limited early adverse effects.",
    "primary_category": "Bias_Control",
    "tags": [
      "prevalent-user-bias",
      "survivor-bias",
      "depletion-of-susceptibles",
      "left-truncation",
      "new-user-design",
      "immortal-time",
      "chronic-drug",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Foundational statement of prevalent user bias and the new-user (incident-user) design proposed to remove it, written specifically for administrative claims data."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4107",
        "url": "https://doi.org/10.1002/pds.4107",
        "citation_text": "Suissa S, Moodie EEM, Dell'Aniello S. Prevalent new-user cohort designs for comparative drug effect studies by time-conditional propensity scores. Pharmacoepidemiology and Drug Safety. 2017;26(4):459-468.",
        "year": 2017,
        "authors_short": "Suissa et al.",
        "notes": "Defines the prevalent new-user design with time-conditional propensity scores, the principled middle path when pure incident users are too few to power an analysis."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Companion bias - misaligned time zero in prevalent and exposure-defined cohorts manufactures immortal (event-free-by-construction) person-time, the mechanism that pairs with depletion of susceptibles."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e3181875e61",
        "url": "https://doi.org/10.1097/EDE.0b013e3181875e61",
        "citation_text": "Hernán MA, Alonso A, Logan R, et al. Observational studies analyzed like randomized experiments: an application to postmenopausal hormone therapy and coronary heart disease. Epidemiology. 2008;19(6):766-779.",
        "year": 2008,
        "authors_short": "Hernán et al.",
        "notes": "The textbook empirical demonstration - re-analyzing hormone therapy as a new-user/target-trial emulation reconciled the observational signal with the WHI trial, showing prevalent-user/time-zero artifacts drove the original discrepancy."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/0962280211403603",
        "url": "https://doi.org/10.1177/0962280211403603",
        "citation_text": "Danaei G, Tavakkoli M, Hernán MA. Bias in observational studies of prevalent users: lessons for comparative effectiveness research from a meta-analysis of statins. American Journal of Epidemiology. 2012;175(4):250-262.",
        "year": 2012,
        "authors_short": "Danaei et al.",
        "notes": "Directly quantifies how prevalent-user inclusion biases statin effect estimates and how new-user/target-trial emulation removes it; the canonical applied illustration of the bias and its fix."
      }
    ],
    "plain_language_summary": "Prevalent user bias happens when a drug study includes patients who were already taking the medication before the study clock started, rather than starting everyone's clock at their very first fill. Those long-term users survived the earliest, riskiest months on the drug — the patients who had early side effects or stopped early are already gone, leaving a group that looks healthier than a true beginner. As a result, the study underestimates early harms and can make a drug look safer than it really is for someone just starting it. The fix is the new-user design: only enroll patients at their very first fill, so every person's early risk period is actually watched.",
    "key_terms": [
      {
        "term": "index date",
        "definition": "The patient's personal 'day zero' in a study — for a drug study this is usually the date of their very first prescription fill."
      },
      {
        "term": "washout period",
        "definition": "A stretch of time before the index date during which the patient must have had no fills of the study drug, confirming they are truly a new starter and not someone who was already on it."
      },
      {
        "term": "depletion of susceptibles",
        "definition": "The process by which patients most likely to be harmed by a drug have already left the group (by stopping, having an event, or dying) before the study even begins, leaving only the hardiest survivors behind."
      },
      {
        "term": "days_supply",
        "definition": "The number of days one filled prescription is meant to last — for example, a 30-day fill covers 30 calendar days starting on the fill date."
      },
      {
        "term": "unobserved exposure",
        "definition": "Time a patient spent on the drug before the study window opened — the researcher cannot see what happened during this period, including any early bad reactions that caused the patient to stop or have an outcome."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether a blood-pressure drug raises the risk of an acute kidney injury (AKI) in the first 90 days of use. The study window opens on 2024-01-01. Two patients both have a fill of the drug on record. Patient A is a new user — her very first fill was on 2024-01-15, well after the study window opened, and she had no fills in the prior 365 days. Patient B is a prevalent user — he has been filling the drug since mid-2022 and simply happened to have a fill overlapping 2024-01-01. The table below shows their pharmacy records. We then trace what each patient's timeline looks like and why including Patient B alongside Patient A produces a biased estimate of AKI risk.",
      "dataset": {
        "caption": "Pharmacy fill records for two patients — the columns an analyst sees in a real claims pharmacy table.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            "A-101",
            "2024-01-15",
            "lisinopril",
            30
          ],
          [
            "A-101",
            "2024-02-13",
            "lisinopril",
            30
          ],
          [
            "A-101",
            "2024-03-14",
            "lisinopril",
            30
          ],
          [
            "B-202",
            "2022-07-01",
            "lisinopril",
            90
          ],
          [
            "B-202",
            "2022-09-28",
            "lisinopril",
            90
          ],
          [
            "B-202",
            "2022-12-26",
            "lisinopril",
            90
          ],
          [
            "B-202",
            "2023-03-25",
            "lisinopril",
            90
          ],
          [
            "B-202",
            "2023-06-22",
            "lisinopril",
            90
          ],
          [
            "B-202",
            "2023-09-19",
            "lisinopril",
            90
          ],
          [
            "B-202",
            "2023-12-01",
            "lisinopril",
            90
          ]
        ]
      },
      "steps": [
        "Patient A-101: her first-ever lisinopril fill is 2024-01-15 — this becomes her index date (day zero). She has three 30-day fills in a row with no gaps, giving her 90 covered days from 2024-01-15 through 2024-04-13. Her full early-risk window is directly observed.",
        "Patient B-202: his first fill was 2022-07-01 — nearly 18 months before the study opens. By the time the clock starts on 2024-01-01, he has already filled lisinopril 7 times and survived every one of those refills without stopping.",
        "A naive analyst assigns Patient B a study entry date of 2024-01-01 (when the database window opens) rather than his true first-fill date of 2022-07-01. His early high-risk period — July 2022 through December 2023 — is completely invisible to the study.",
        "Because Patient B is still filling the drug on 2024-01-01, he is by definition a survivor: any patient like him who had a serious early reaction, stopped lisinopril, or experienced AKI in 2022–2023 would not appear in the 'current user' pool at all — those susceptible patients were already gone.",
        "If the true AKI risk is highest in the first 90 days of starting lisinopril (say, 6 events per 100 new starters), Patient A-101's data captures that risk. Patient B-202's post-survival follow-up shows a much lower event rate — perhaps 1 per 100 — because the most susceptible patients in his cohort already had their events and are not in the study.",
        "A study that pools Patient A-type new users with Patient B-type prevalent users will produce a blended AKI rate far below 6%, making lisinopril look safer at initiation than it truly is. The correct approach is to include only new users like Patient A, whose index date sits at their first-ever fill after a fill-free lookback window."
      ],
      "result": {
        "label": "Prevalent users are survivors — their early high-risk period is unobserved, so including them biases the AKI rate toward zero (safe-looking) and away from the true initiation risk of ~6%.",
        "value": "biased_toward_null"
      },
      "timeline_spec": {
        "title": "New user vs. prevalent user — what the study clock sees for each patient",
        "window": {
          "start": "2024-01-01",
          "end": "2024-04-13",
          "label": "Study observation window (90-day follow-up)"
        },
        "patients": [
          {
            "id": "A-101",
            "label": "Patient A — NEW user (correct)",
            "events": [
              {
                "label": "Fill 1 — index date",
                "start": "2024-01-15",
                "length_days": 30,
                "quantity": "30 days_supply"
              },
              {
                "label": "Fill 2",
                "start": "2024-02-13",
                "length_days": 30,
                "quantity": "30 days_supply"
              },
              {
                "label": "Fill 3",
                "start": "2024-03-14",
                "length_days": 30,
                "quantity": "30 days_supply"
              }
            ],
            "spans": [
              {
                "kind": "exposed",
                "start": "2024-01-15",
                "end": "2024-04-13",
                "label": "90 days fully observed — early risk SEEN"
              }
            ]
          },
          {
            "id": "B-202",
            "label": "Patient B — PREVALENT user (biased)",
            "events": [
              {
                "label": "Fill 1 (true start)",
                "start": "2022-07-01",
                "length_days": 90,
                "quantity": "90 days_supply"
              },
              {
                "label": "Fill 2",
                "start": "2022-09-28",
                "length_days": 90,
                "quantity": "90 days_supply"
              },
              {
                "label": "Fill 3",
                "start": "2022-12-26",
                "length_days": 90,
                "quantity": "90 days_supply"
              },
              {
                "label": "Fill 4",
                "start": "2023-03-25",
                "length_days": 90,
                "quantity": "90 days_supply"
              },
              {
                "label": "Fill 5",
                "start": "2023-06-22",
                "length_days": 90,
                "quantity": "90 days_supply"
              },
              {
                "label": "Fill 6",
                "start": "2023-09-19",
                "length_days": 90,
                "quantity": "90 days_supply"
              },
              {
                "label": "Fill 7 (overlaps study open)",
                "start": "2023-12-01",
                "length_days": 90,
                "quantity": "90 days_supply"
              }
            ],
            "spans": [
              {
                "kind": "unexposed",
                "start": "2022-07-01",
                "end": "2023-12-31",
                "label": "~18 months of prior use — UNOBSERVED (early risk period invisible; susceptible patients already gone)"
              },
              {
                "kind": "exposed",
                "start": "2024-01-01",
                "end": "2024-04-13",
                "label": "Observed follow-up starts here — but this patient is already a survivor"
              }
            ]
          }
        ],
        "result": {
          "label": "Prevalent users are survivors: their early high-risk months are unobserved, so mixing them with new users makes the drug appear safer than it is at initiation.",
          "value": "biased_toward_null"
        },
        "caption": "Patient A's first fill is her index date (2024-01-15) — the study watches her full 90-day early-risk window. Patient B's first fill was 2022-07-01; by study open (2024-01-01) he has 18 months of unobserved prior use. Any patient like him who reacted badly to lisinopril in 2022–2023 already stopped and is invisible — only the survivors remain. Pooling both patients underestimates early AKI risk.",
        "alt_text": "Two-patient timeline diagram. Top row shows Patient A (new user) with three 30-day fills starting 2024-01-15 and a fully observed 90-day exposure span shaded green. Bottom row shows Patient B (prevalent user) with seven 90-day fills dating back to 2022-07-01; the 18 months before the study window are shaded red and labeled 'UNOBSERVED — early risk invisible,' while only the narrow post-2024-01-01 tail is shaded as observed. A callout notes that susceptible patients who reacted in 2022–2023 are absent from the study entirely."
      }
    },
    "prerequisites": [
      "new-user-design",
      "washout-clean-lookback-period-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pure prevalent user (ever-exposed / current-user flag)",
        "description": "Any patient with a dispensing or active medication record at or before study entry is treated as exposed, with no requirement to observe initiation. Time zero is a landmark (database start, diagnosis, enrollment) unrelated to first fill.",
        "edge_cases": [
          "Maximally susceptible to depletion of susceptibles, immortal time, and post-initiation covariate adjustment simultaneously.",
          "Often hidden inside a single binary 'exposed' column with no index date, so the bias is invisible in the analysis dataset."
        ],
        "data_source_notes": "claims: a fill overlapping the index date (fill_date <= index <= fill_date + days_supply) with no washout check; EHR: a medication-list entry marked active, frequently a reconciled carry-over with no true start date."
      },
      {
        "name": "New-user (incident-user) restriction",
        "description": "Restrict to the first fill after a fill-free, continuously enrolled washout; time zero = that first fill. The standard remedy for prevalent user bias.",
        "edge_cases": [
          "Washout that exceeds observable enrollment misclassifies long-time users as incident (MA-only or newly enrolled members).",
          "Drug holidays longer than the washout can flag a re-initiator as a first-time initiator; pre-specify whether re-initiation counts."
        ],
        "data_source_notes": "claims: require continuous medical+pharmacy enrollment across the full lookback so absence of prior fills is observed; exclude MA-only person-time."
      },
      {
        "name": "Prevalent new-user with time-conditional propensity scores (Suissa 2017)",
        "description": "When incident users are too few, match prevalent users to new users on duration of current use using time-conditional propensity scores, restoring sample size while anchoring a defensible time zero within each matched set.",
        "edge_cases": [
          "Assumes duration-conditional exchangeability; cannot recover the unobserved early discontinuers who already left the cohort.",
          "More complex to specify and communicate to reviewers than a clean incident-user design."
        ],
        "data_source_notes": "Requires reliable per-patient cumulative-duration measurement (stitched days_supply episodes) to define the time-conditional matching strata."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "new-user-design",
        "pros_of_this": "Larger sample and inclusion of the prevalent users who dominate real-world prescribing.",
        "cons_of_this": "Depletion of susceptibles, immortal time, and post-initiation covariate adjustment bias initiation and safety effects, usually toward the null for harms.",
        "when_to_prefer": "Almost never for causal/safety questions; acceptable only for descriptive prevalence-of-use estimates with no initiation interpretation."
      },
      {
        "compared_to": "active-comparator-new-user",
        "pros_of_this": "Avoids the need to identify a clinically interchangeable comparator and the power loss when a comparator is uncommon.",
        "cons_of_this": "Retains confounding by indication and healthy-user bias on top of survivor bias; the ACNU design removes all three by construction.",
        "when_to_prefer": "It does not - for comparative effectiveness/safety, the active-comparator new-user design dominates the prevalent-user approach."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Detect prevalent users as fills overlapping the index landmark with a prior fill inside the lookback; remove them by requiring first fill after a fill-free, continuously enrolled (medical+pharmacy) washout. Exclude Medicare Advantage-only person-time, where absent FFS pharmacy claims make prior use unobservable and prevalent users masquerade as incident.",
      "ehr": "Initiation = first order or administration after a clean gap; an 'active' medication-list entry without a start date is a likely prevalent carry-over. Link to dispensing to confirm the patient actually started and treat visit-driven loss to follow-up as informative.",
      "registry": "Enrollment date is not treatment start; require an explicit drug-start-date field and link to claims for the full fill history and to a death index for the competing risk.",
      "linked": "Linked claims-EHR-vital-records best distinguishes prevalent from incident use, but reconcile order/fill/service date discrepancies before assigning time zero and account for linkage selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nWASHOUT_DAYS = 365     # fill-free + continuously enrolled lookback that defines an incident user\nFOLLOWUP_DAYS = 90     # early-hazard window in which depletion of susceptibles is most visible\n\ndef study_fills(rx: pd.DataFrame, study_ndcs: set[str]) -> pd.DataFrame:\n    f = rx[rx[\"ndc\"].isin(study_ndcs)].sort_values([\"person_id\", \"fill_date\"])\n    return f\n\ndef covered_full_washout(enroll: pd.DataFrame, idx: pd.DataFrame) -> set:\n    \"\"\"person_ids with continuous, FFS-observable enrollment spanning [index-WASHOUT, index].\"\"\"\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) &\n                   (~e[\"ma_only\"]))\n    return set(e.loc[e[\"covers\"], \"person_id\"])\n\ndef build_incident_cohort_with_diagnostic(rx, enroll, outcomes, study_ndcs, study_start, study_end):\n    f = study_fills(rx, study_ndcs)\n\n    # First study-drug fill inside the study window = candidate time zero.\n    in_window = f[(f[\"fill_date\"] >= study_start) & (f[\"fill_date\"] <= study_end)]\n    idx = (in_window.groupby(\"person_id\")[\"fill_date\"].min()\n                    .reset_index().rename(columns={\"fill_date\": \"index_date\"}))\n\n    # Prevalent flag: any study-drug fill in the WASHOUT_DAYS before the candidate index (the bias signature).\n    prior = f.merge(idx, on=\"person_id\")\n    prevalent_ids = set(prior.loc[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                                  (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)),\n                                  \"person_id\"])\n    idx[\"would_be_prevalent\"] = idx[\"person_id\"].isin(prevalent_ids)\n\n    # Observable washout (continuous medical+pharmacy FFS enrollment, no MA-only gaps).\n    observable = covered_full_washout(enroll, idx)\n    idx = idx[idx[\"person_id\"].isin(observable)].copy()\n\n    # Early event within FOLLOWUP_DAYS of each person's time zero.\n    ev = outcomes.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    ev[\"early\"] = ((ev[\"event_date\"] >= ev[\"index_date\"]) &\n                   (ev[\"event_date\"] <= ev[\"index_date\"] + pd.Timedelta(days=FOLLOWUP_DAYS)))\n    early_ids = set(ev.loc[ev[\"early\"], \"person_id\"])\n    idx[\"early_event\"] = idx[\"person_id\"].isin(early_ids)\n\n    # Depletion-of-susceptibles diagnostic: incident initiators vs would-be prevalent \"current users\".\n    diag = (idx.groupby(\"would_be_prevalent\")[\"early_event\"]\n               .agg(n=\"size\", events=\"sum\"))\n    diag[\"rate\"] = diag[\"events\"] / diag[\"n\"]\n\n    incident = idx.loc[~idx[\"would_be_prevalent\"], [\"person_id\", \"index_date\"]].copy()\n    incident[\"baseline_start\"] = incident[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)\n    return incident, diag",
        "description": "Detect prevalent-user contamination and build a clean incident-user cohort from claims-style inputs. Required inputs\n(already cleaned and de-duplicated):\n  rx        : pharmacy fills    -> person_id, fill_date (datetime64), ndc, days_supply\n  enroll    : enrollment spans  -> person_id, enroll_start, enroll_end, ma_only (bool)  # MA-only spans lack FFS Rx claims\n  outcomes  : outcome events    -> person_id, event_date (datetime64)                   # e.g. validated AKI dx/admission\nSTUDY_NDCS lists the drug-of-interest NDCs. Returns the incident cohort (one row per new initiator with time zero) plus a\ndepletion-of-susceptibles diagnostic comparing the early event rate of would-be incident initiators vs would-be prevalent\nusers sitting in a naive 'current user' cohort.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "ray-2003",
          "danaei-2012"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS  <- 365L\nFOLLOWUP_DAYS <- 90L\n\nbuild_incident_cohort_with_diagnostic <- function(rx, enroll, outcomes, study_ndcs,\n                                                  study_start, study_end) {\n  setDT(rx); setDT(enroll); setDT(outcomes)\n  f <- rx[ndc %chin% study_ndcs][order(person_id, fill_date)]\n\n  # First study-drug fill inside the study window = candidate time zero.\n  idx <- f[fill_date >= study_start & fill_date <= study_end,\n           .(index_date = min(fill_date)), by = person_id]\n\n  # Prevalent flag: any study-drug fill in the washout window before candidate index.\n  prior <- merge(f, idx, by = \"person_id\")\n  prevalent_ids <- unique(prior[fill_date < index_date &\n                                fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx[, would_be_prevalent := person_id %chin% prevalent_ids]\n\n  # Observable washout: continuous medical+pharmacy FFS enrollment, no MA-only gaps.\n  e <- merge(enroll, idx[, .(person_id, index_date)], by = \"person_id\")\n  observable <- e[enroll_start <= index_date - WASHOUT_DAYS &\n                  enroll_end   >= index_date & !ma_only, unique(person_id)]\n  idx <- idx[person_id %chin% observable]\n\n  # Early event within FOLLOWUP_DAYS of each person's time zero.\n  ev <- merge(outcomes, idx[, .(person_id, index_date)], by = \"person_id\")\n  early_ids <- unique(ev[event_date >= index_date &\n                         event_date <= index_date + FOLLOWUP_DAYS, person_id])\n  idx[, early_event := person_id %chin% early_ids]\n\n  # Depletion-of-susceptibles diagnostic.\n  diag <- idx[, .(n = .N, events = sum(early_event)), by = would_be_prevalent]\n  diag[, rate := events / n]\n\n  incident <- idx[would_be_prevalent == FALSE, .(person_id, index_date)]\n  incident[, baseline_start := index_date - WASHOUT_DAYS]\n  list(incident = incident, diagnostic = diag)\n}",
        "description": "Detect prevalent-user contamination and build a clean incident-user cohort with data.table. Inputs mirror the Python\nversion:\n  rx       : person_id, fill_date (Date), ndc, days_supply\n  enroll   : person_id, enroll_start (Date), enroll_end (Date), ma_only (logical)\n  outcomes : person_id, event_date (Date)\nstudy_ndcs is a character vector of drug-of-interest NDCs. Returns a list with the incident cohort and the\ndepletion-of-susceptibles diagnostic (early event rate among would-be incident vs would-be prevalent users).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "ray-2003",
          "danaei-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout   = 365;\n%let followup  = 90;\n%let studystart = '01JAN2024'd;\n%let studyend   = '31DEC2024'd;\n\n/* Candidate time zero = first study-drug fill inside the study window. */\nproc sql;\n  create table idx as\n  select person_id, min(fill_date) as index_date format=date9.\n  from work.rx\n  where fill_date between &studystart and &studyend\n  group by person_id;\nquit;\n\n/* Prevalent flag: any study-drug fill in the washout window before candidate index (the bias signature). */\nproc sql;\n  create table idx_flag as\n  select i.person_id, i.index_date,\n         (exists (select 1 from work.rx p\n                    where p.person_id = i.person_id\n                      and p.fill_date <  i.index_date\n                      and p.fill_date >= i.index_date - &washout)) as would_be_prevalent\n  from idx i;\nquit;\n\n/* Keep only observable washout: continuous medical+pharmacy FFS enrollment, no MA-only spans. */\nproc sql;\n  create table idx_obs as\n  select f.*\n  from idx_flag f\n  where exists (select 1 from work.enroll e\n                  where e.person_id = f.person_id\n                    and e.ma_only = 0\n                    and e.enroll_start <= f.index_date - &washout\n                    and e.enroll_end   >= f.index_date);\nquit;\n\n/* Early event within the follow-up window of each person's time zero. */\nproc sql;\n  create table idx_ev as\n  select o.person_id, o.index_date, o.would_be_prevalent,\n         (exists (select 1 from work.outcomes ev\n                    where ev.person_id = o.person_id\n                      and ev.event_date >= o.index_date\n                      and ev.event_date <= o.index_date + &followup)) as early_event\n  from idx_obs o;\nquit;\n\n/* Depletion-of-susceptibles diagnostic: early event rate by would-be-prevalent status. */\nproc sql;\n  create table work.dos_diagnostic as\n  select would_be_prevalent,\n         count(*)            as n,\n         sum(early_event)    as events,\n         mean(early_event)   as rate format=percent8.1\n  from idx_ev\n  group by would_be_prevalent;\nquit;\n\n/* Clean incident cohort with baseline covariate window for downstream PS/outcome modeling. */\nproc sql;\n  create table work.incident as\n  select person_id, index_date,\n         index_date - &washout as baseline_start format=date9.\n  from idx_ev\n  where would_be_prevalent = 0;\nquit;",
        "description": "Detect prevalent-user contamination and build a clean incident-user cohort in SAS via PROC SQL. Required input datasets\n(post data-management), with study-drug NDCs already restricted into work.rx:\n  work.rx       : person_id, fill_date, ndc, days_supply         (only drug-of-interest fills)\n  work.enroll   : person_id, enroll_start, enroll_end, ma_only   (ma_only = 1 when FFS Rx claims are unavailable)\n  work.outcomes : person_id, event_date                          (validated outcome events)\nMacro vars set the study window. Produces work.incident (clean new initiators with time zero and the baseline window) and\nwork.dos_diagnostic, the depletion-of-susceptibles table comparing early event rates of would-be incident vs would-be\nprevalent users.",
        "dependencies": [],
        "source_citations": [
          "ray-2003",
          "danaei-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "prevalent-user-bias-timeline.svg",
        "mermaid": null,
        "caption": "Patient A's first fill is her index date (2024-01-15) — the study watches her full 90-day early-risk window. Patient B's first fill was 2022-07-01; by study open (2024-01-01) he has 18 months of unobserved prior use. Any patient like him who reacted badly to lisinopril in 2022–2023 already stopped and is invisible — only the survivors remain. Pooling both patients underestimates early AKI risk.",
        "alt_text": "Two-patient timeline diagram. Top row shows Patient A (new user) with three 30-day fills starting 2024-01-15 and a fully observed 90-day exposure span shaded green. Bottom row shows Patient B (prevalent user) with seven 90-day fills dating back to 2022-07-01; the 18 months before the study window are shaded red and labeled 'UNOBSERVED — early risk invisible,' while only the narrow post-2024-01-01 tail is shaded as observed. A callout notes that susceptible patients who reacted in 2022–2023 are absent from the study entirely.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Init[True initiation cohort<br/>all who ever start the drug] --> Early[Early high-risk window<br/>adverse events, intolerance, discontinuation]\n  Early -->|outcome / stop / die| Gone[Susceptibles removed<br/>events + discontinuers leave]\n  Early -->|tolerate + keep filling| Surv[Surviving tolerators]\n  Gone -.invisible to prevalent cohort.-> Study\n  Surv --> Study[Prevalent cohort observed at study entry<br/>enriched for low-risk survivors]\n  Study --> Bias[Observed hazard < initiation hazard<br/>harms biased toward null, benefits inflated]\n  Init --> Fix[New-user restriction:<br/>time zero = first fill after washout] --> Unbiased[Early window observed<br/>estimand = effect of INITIATION]",
        "caption": "Depletion of susceptibles. A prevalent cohort is conditioned on surviving the early high-risk window; the patients who were harmed or discontinued have already left, so the observed hazard understates the hazard a new initiator faces. New-user restriction anchors time zero at first fill and recovers the early window.",
        "alt_text": "Flowchart showing the true initiation cohort splitting into removed susceptibles (events, discontinuers, deaths) and surviving tolerators, with only the survivors entering the prevalent cohort, versus the new-user restriction that observes the full early window.",
        "source_type": "illustrative",
        "source_citations": [
          "ray-2003"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Time-zero misalignment - prevalent vs new user (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Prevalent user (biased)\n  Prior use - early risk UNOBSERVED :crit, prior, 2021-06-01, 2023-12-31\n  Wrong time zero at study entry :milestone, pt0, 2024-01-01, 0d\n  Observed follow-up (survivors only) :active, pf, 2024-01-01, 90d\n  section New user (correct)\n  Fill-free + continuously enrolled washout :done, wash, 2023-01-01, 2023-12-31\n  Time zero = first fill :milestone, nt0, 2024-01-01, 0d\n  Observed follow-up (full early window) :active, nf, 2024-01-01, 90d",
        "caption": "For a prevalent user, follow-up starts at a study-entry landmark long after initiation, so the early high-risk months are unobserved (and the gap before entry is immortal time). For a new user, the washout establishes incident status and time zero sits at first fill, capturing the full early window.",
        "alt_text": "Gantt comparison of a prevalent user whose prior-use early-risk period is unobserved before a wrong time zero at study entry, versus a new user with a 365-day washout and time zero at first fill.",
        "source_type": "illustrative",
        "source_citations": [
          "ray-2003"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "new-user-design",
        "notes": "Prevalent user bias is the failure mode that the new-user (incident-user) design exists to prevent; the new-user design is the canonical remedy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU removes survivor bias, confounding by indication, and healthy-user bias simultaneously by restricting to incident initiators of two interchangeable drugs; it dominates the prevalent-user approach for comparative questions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Prevalent cohorts with a study-entry landmark create immortal (event-free-by-construction) person-time between true initiation and entry; the two biases routinely co-occur and share the time-zero fix."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Surviving prevalent users are enriched for tolerators and adherers, compounding healthy-user/healthy-adherer selection on top of depletion of susceptibles."
      },
      {
        "relation_type": "affects",
        "target_slug": "cox-ph-regression",
        "notes": "Including prevalent users mixes early high-risk and later low-risk person-time, attenuating hazard ratios and violating proportional hazards; left truncation must be handled explicitly if any prevalent time is retained."
      },
      {
        "relation_type": "affects",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Logistic models on current users without new-user restriction conflate initiation and maintenance effects and bias safety/effectiveness odds ratios toward the null for early harms."
      },
      {
        "relation_type": "affects",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "Prevalent users are disproportionately stable low-utilizers (survivors); mixing them with incident initiators biases rate and count comparisons for initiation effects on healthcare utilization."
      },
      {
        "relation_type": "affects",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Features learned on prevalent users encode survivor characteristics and post-initiation consequences rather than initiation effects; apply new-user restriction before feature engineering for causal or predictive treatment-start work."
      }
    ],
    "aliases": [
      "prevalent user bias",
      "prevalent-user bias",
      "survivor bias in drug studies",
      "depletion of susceptibles bias",
      "depletion-of-susceptibles bias",
      "prevalent user design",
      "prevalent-user design"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "primary-non-adherence-initiation",
    "name": "Primary Non-Adherence and Treatment Initiation",
    "short_definition": "The operational distinction between a medication being prescribed/ordered and actually being dispensed, picked up, or administered, where failure of that first fill or first administration (primary non-adherence) silently removes patients from any \"first-fill\" exposure cohort and selects a more adherent population.",
    "long_description": "**Primary non-adherence** is the gap between the moment a clinician *prescribes or orders* a drug and the moment the\npatient *actually starts it* — fills the prescription at a pharmacy, picks up the dispensed product, or (for administered\nproducts) receives the first dose. A patient who is prescribed a statin but never fills it is a *primary non-adherer*;\nthis is categorically different from a patient who fills once and then stops (persistence) or fills sporadically\n(secondary non-adherence, measured by PDC/MPR). The reported magnitude is large and consistent: ~28% of new\ne-prescriptions were never dispensed within the index period in a 195,930-prescription analysis (Fischer 2010), and ~31%\nof newly prescribed medications in primary care were never filled (Tamblyn 2014). \"Initiation\" is the mirror image: the\npoint at which exposure truly begins and time zero can legitimately be set.\n\n**Core conceptual distinction.** The decisive fact — and the reason this is a discrete catalog entry rather than a\nfootnote to exposure-episode construction — is that **administrative claims data structurally cannot measure primary\nnon-adherence.** Claims record *dispensings*, not *orders*. The denominator you need (everyone who was prescribed) is\ninvisible in a claims database; you only ever observe the numerator's survivors (those who filled at least once).\nMeasuring primary non-adherence therefore requires a source that captures the *order*: an e-prescribing network\n(Surescripts), EHR computerized provider order entry (CPOE), or an integrated delivery system's prescribing record,\nlinked to fill/administration data. Without that order layer you can measure secondary non-adherence among fillers and\nnothing about the people who never filled. Two further sub-distinctions matter: (1) *dispensed vs picked up* — a\npharmacy can adjudicate and reverse a claim when the patient never collects the drug (prescription abandonment), so a\nraw paid pharmacy claim is not proof of pickup unless reversals are netted out; (2) *dispensed vs administered* — for\nbuy-and-bill infusibles, injectables, and in-office products, a pharmacy fill is not initiation; the J-code/procedure on\na service date or the EHR medication administration record (MAR) is. A vialed biologic that is dispensed but never\ninfused is initiation failure, not initiation.\n\n**Pros, cons, and trade-offs.**\n- **vs PDC / MPR (secondary-adherence measures):** PDC and MPR are computed *conditional on having filled at least\n  once* — they are blind to primary non-adherers by construction and so systematically overstate population-level drug\n  exposure. Capturing primary non-adherence + initiation gives the complete picture from order to discontinuation.\n  Cost: it requires an order source PDC/MPR do not (PDC runs on claims alone). **Prefer this concept** when the policy\n  or effectiveness question concerns uptake, abandonment, or the validity of \"first-fill\" time zero; **prefer PDC/MPR**\n  for ongoing-coverage adherence among established users.\n- **vs simply defining index = \"first fill\" and moving on (the silent default in most new-user/ACNU studies):**\n  explicitly modeling primary non-adherence reveals that the first-fill cohort is a *selected, more-adherent subset* —\n  the ~25-30% who never filled are dropped before time zero, a healthy-adherer-flavored selection that can bias even a\n  methodologically clean comparative-effectiveness contrast. Cost: more data, an order-to-fill window to defend, and\n  extra diagnostics. **Prefer explicit modeling** whenever the exposure decision (not just on-treatment behavior) could\n  differ across the comparison groups.\n- **vs an intention-to-treat \"as-prescribed\" estimand from e-Rx alone:** counting the order as exposure regardless of\n  fill answers a prescriber-behavior question but misclassifies never-fillers as treated, biasing effect estimates\n  toward the null for any drug that works only if taken. **Prefer the as-prescribed estimand** only when the question is\n  explicitly about the prescribing decision (e.g., a prescriber-level intervention), and say so in the estimand.\n\n**When to use.** When the research question concerns medication uptake, prescription abandonment, or the first-fill\nrate itself; when validating whether a new-user/ACNU \"time zero = first fill\" cohort is selected on adherence; when an\norder source (Surescripts, EHR CPOE, integrated-system Rx) is linkable to dispensing/administration; for PQA-style\ninitiation/abandonment quality measurement; and for any HTA or payer analysis where modeled uptake drives budget impact.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **You have claims only.** Do not report a \"primary non-adherence rate\" from a claims database — there is no order\n  denominator, so any such number is uninterpretable and will mislead reviewers. With claims you can only study\n  secondary non-adherence among fillers. State this as a hard limitation, not something a workaround fixes.\n- **The order source is incomplete.** Surescripts misses paper/written and verbal prescriptions; an EHR captures only\n  in-network orders. If a non-trivial share of prescribing bypasses the captured channel, \"never filled\" conflates true\n  abandonment with prescriptions that were routed elsewhere — differential by clinic, payer, or drug.\n- **For administered products, when only the pharmacy fill is checked.** Declaring initiation on a buy-and-bill fill\n  that was never infused overcounts initiation and miscounts the true initiation-failure population.\n- **When the order-to-fill window is left implicit.** A 7-day window labels mail-order and prior-authorization delays\n  as non-adherence; a 365-day window absorbs genuine abandonment into \"delayed fill.\" The window is an analytic choice\n  that must be pre-specified and varied in sensitivity analysis.\n\n**Data-source operational depth.**\n- **Claims only (FFS, commercial, MA):** *Cannot measure primary non-adherence at all* — no order denominator. Worse,\n  in Medicare Advantage and capitated arrangements, fee-for-service pharmacy claims may be absent, so even fill capture\n  among the supposedly-treated is incomplete; \"no fill\" can be MA-only missing person-time rather than true\n  non-fulfillment. Workaround: restrict to enrollees with complete Part D (or commercial pharmacy benefit) and use\n  claims for *secondary* adherence only.\n- **EHR with CPOE / e-prescribing:** the *order* is visible (good), but the *fill* is not unless pharmacy data are\n  linked. Use the Surescripts fill-status response, payer pharmacy claims, or integrated-pharmacy dispensings to close\n  the loop. Failure mode: external-care leakage — a patient who fills at a pharmacy outside the linked network looks\n  like a non-adherer.\n- **Surescripts / e-Rx network:** the standard denominator source. Failure modes: transmission failures, cancel/replace\n  messages that double-count, and invisibility of paper/verbal prescriptions and free-text orders; reconcile the\n  new-prescription event before treating it as the denominator.\n- **Linked claims–EHR (and integrated systems such as Kaiser, VA, Geisinger):** the gold standard — order from\n  EHR/e-Rx, fill from claims/pharmacy, administration from MAR/J-codes. The central reconciliation task is *date\n  alignment*: a fill may post 7, 30, or 90 days after the order; pre-specify the index window and test 7/30/90-day\n  cuts. For infusibles, require a J-code/administration on a service date, not the buy-and-bill fill, to confirm\n  initiation.\n\n**Worked example (e-Rx linked to claims).** Question: primary non-adherence to newly e-prescribed oral\nanticoagulants, and its effect on a downstream \"new-user\" cohort. (1) Denominator: every new e-prescription\n(`erx_date`, `drug_class='DOAC'`, `new_start_flag=1` so refills/continuations are excluded) for an adult with\ncontinuous medical + pharmacy enrollment in the 90 days before and the index window after `erx_date` — enrollment is\nrequired so that \"no fill\" is observed, not missing. (2) Numerator (primary non-adherer): no DOAC pharmacy dispense\n(`fill_date` within `[erx_date, erx_date + W]`), netting out reversed/abandoned claims. (3) Primary non-adherence rate\n= numerator / denominator; with W = 30 days suppose 1,000 of 4,000 new e-Rx never filled → 25.0%. (4) Window\nsensitivity: W = 7 days → 1,420/4,000 = 35.5% (counts slow legitimate fills as non-adherence); W = 90 days →\n860/4,000 = 21.5% (absorbs true abandonment). Report all three. (5) Selection-bias link: the 3,000 fillers are the\nexact population a conventional new-user/ACNU study would index on \"first fill\" — so that downstream cohort has\nsilently excluded the 25% never-fillers and is selected toward adherence; flag this when interpreting the comparative\nestimate. (6) For an injectable arm, replace step (2) with a J-code administration on a service date so a dispensed-but-\nnever-infused vial is correctly counted as initiation failure.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure_definition",
      "primary-non-adherence",
      "prescription-abandonment",
      "treatment-initiation",
      "first-fill",
      "e-prescribing",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization"
    ],
    "data_sources": [
      "ehr",
      "linked",
      "claims",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s11606-010-1253-9",
        "url": "https://doi.org/10.1007/s11606-010-1253-9",
        "citation_text": "Fischer MA, Stedman MR, Lii J, et al. Primary medication non-adherence: analysis of 195,930 electronic prescriptions. Journal of General Internal Medicine. 2010;25(4):284-290.",
        "year": 2010,
        "authors_short": "Fischer et al.",
        "notes": "Defining e-prescribing study quantifying primary non-adherence (~28% of new e-prescriptions never dispensed), establishing the order-vs-fill denominator distinction."
      },
      {
        "role": "explain",
        "doi": "10.1097/mlr.0b013e31829b1d2a",
        "url": "https://doi.org/10.1097/mlr.0b013e31829b1d2a",
        "citation_text": "Raebel MA, Schmittdiel J, Karter AJ, Konieczny JL, Steiner JF. Standardizing terminology and definitions of medication adherence and persistence in research employing electronic databases. Medical Care. 2013;51(8 Suppl 3):S11-S21.",
        "year": 2013,
        "authors_short": "Raebel et al.",
        "notes": "Canonical terminology paper separating initiation, primary non-adherence, implementation/secondary adherence, and persistence in electronic-database research."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Operational automated-database methods for fills, gaps, and persistence that define what claims can and cannot observe about exposure."
      },
      {
        "role": "demonstrate",
        "doi": "10.7326/m13-1705",
        "url": "https://doi.org/10.7326/m13-1705",
        "citation_text": "Tamblyn R, Eguale T, Huang A, Winslade N, Doran P. The incidence and determinants of primary nonadherence with prescribed medication in primary care. Annals of Internal Medicine. 2014;160(7):441-450.",
        "year": 2014,
        "authors_short": "Tamblyn et al.",
        "notes": "Linked e-prescribing-to-dispensing cohort showing ~31% of newly prescribed medications were never filled, with determinants — a model worked analysis of primary non-adherence."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/s0895-4356(96)00268-5",
        "url": "https://doi.org/10.1016/s0895-4356(96)00268-5",
        "citation_text": "Steiner JF, Prochazka AV. The assessment of refill compliance using pharmacy records: methods, validity, and applications. Journal of Clinical Epidemiology. 1997;50(1):105-116.",
        "year": 1997,
        "authors_short": "Steiner & Prochazka",
        "notes": "Foundational pharmacy-record (refill-based) adherence methods; included to contrast what fill records measure (secondary adherence) with the order-level data primary non-adherence requires."
      }
    ],
    "plain_language_summary": "Primary non-adherence describes what happens when a doctor writes a prescription for a drug but the patient never goes to the pharmacy to fill it — the medication is ordered but the bottle is never picked up. To measure this you need two data layers linked together: the electronic order that the doctor sent, plus the pharmacy dispensing record that shows whether a fill actually happened. A claims database alone cannot reveal this gap, because it only records fills that occurred, not prescriptions that were ignored. This is different from secondary non-adherence, where the patient fills the prescription at least once but later stops taking the drug consistently.",
    "key_terms": [
      {
        "term": "e-prescription (e-Rx)",
        "definition": "A medication order that a doctor sends electronically to a pharmacy; it creates a digital record that a drug was prescribed, even if the patient never picks it up."
      },
      {
        "term": "dispensing record",
        "definition": "The pharmacy's record that a patient actually received (was given) the drug; this is what shows up as a pharmacy claim in insurance data."
      },
      {
        "term": "order-to-fill window",
        "definition": "The number of days after a prescription is written during which a fill still counts as the patient starting that drug; analysts choose 30 or 90 days and test different lengths."
      },
      {
        "term": "secondary non-adherence",
        "definition": "When a patient fills a prescription at least once to start the drug but later takes it inconsistently or stops; measures like PDC track this, not the never-filled group."
      },
      {
        "term": "prescription abandonment",
        "definition": "A specific form of primary non-adherence where the patient goes as far as the pharmacy counter but leaves without the drug, often because of cost or side-effect concerns."
      }
    ],
    "worked_example": {
      "scenario": "A cardiologist sends three e-prescriptions for a blood thinner (a DOAC) on 2023-03-01. We want to see which patients started the drug (filled it within 30 days) and which were primary non-adherers (never filled). The pharmacy claims table covers the same time window. Patient 2001 fills promptly, patient 2002 fills late (outside the 30-day window), and patient 2003 never fills at all.",
      "dataset": {
        "caption": "Left table: electronic prescriptions sent by the doctor. Right table: pharmacy fills found in claims (reversed = claim later voided, meaning the patient left without the drug).",
        "erx_table": {
          "columns": [
            "person_id",
            "erx_date",
            "drug_class",
            "new_start_flag"
          ],
          "rows": [
            [
              2001,
              "2023-03-01",
              "DOAC",
              1
            ],
            [
              2002,
              "2023-03-01",
              "DOAC",
              1
            ],
            [
              2003,
              "2023-03-01",
              "DOAC",
              1
            ]
          ]
        },
        "rx_table": {
          "columns": [
            "person_id",
            "fill_date",
            "drug_class",
            "days_supply",
            "reversed"
          ],
          "rows": [
            [
              2001,
              "2023-03-10",
              "DOAC",
              30,
              false
            ],
            [
              2002,
              "2023-04-15",
              "DOAC",
              30,
              false
            ],
            [
              2003,
              "2023-03-05",
              "DOAC",
              30,
              true
            ]
          ]
        }
      },
      "steps": [
        "The denominator is all three e-prescriptions written on 2023-03-01 — this is the group the doctor intended to treat.",
        "For each order, look for a pharmacy fill for the same person, same drug class, that is NOT reversed, and falls within 30 days of 2023-03-01 (i.e., on or before 2023-03-31).",
        "Patient 2001 filled on 2023-03-10 — that is 9 days after the prescription, within the 30-day window, and the claim was not reversed. Patient 2001 INITIATED the drug.",
        "Patient 2002 filled on 2023-04-15 — that is 45 days after the prescription, outside the 30-day window. Under a 30-day rule, patient 2002 is counted as a primary non-adherer (a 90-day window would reclassify them as a late initiator).",
        "Patient 2003 has a fill record dated 2023-03-05, but the reversed flag is true — the pharmacy voided that claim, meaning the patient did not actually take the drug home. No valid fill exists. Patient 2003 is a primary non-adherer (prescription abandonment at the counter).",
        "Primary non-adherence rate = 2 never-filled orders out of 3 total orders = 2/3 = 67%. Note: this small example is illustrative; real studies with thousands of prescriptions typically find 25-35% primary non-adherence."
      ],
      "result": {
        "label": "Primary non-adherence rate (30-day window): 2 of 3 patients never filled = 67% (illustrative example; real-world rates ~25-35%)",
        "value": 0.667
      },
      "timeline_spec": {
        "title": "Primary non-adherence: prescription written vs. fill received (30-day window)",
        "window": {
          "start": "2023-03-01",
          "end": "2023-03-31",
          "label": "30-day order-to-fill window"
        },
        "events": [
          {
            "label": "Rx written (all 3 patients)",
            "start": "2023-03-01",
            "length_days": 1,
            "quantity": "e-prescription order date"
          },
          {
            "label": "Patient 2001 fills (day 9)",
            "start": "2023-03-10",
            "length_days": 30,
            "quantity": "30 days_supply — valid fill"
          },
          {
            "label": "Patient 2003 reversed fill (day 4)",
            "start": "2023-03-05",
            "length_days": 1,
            "quantity": "voided — abandonment at counter"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2023-03-01",
            "end": "2023-03-31",
            "label": "30-day window: patient 2001 fills on day 9 — initiates"
          },
          {
            "kind": "gap",
            "start": "2023-03-01",
            "end": "2023-03-31",
            "label": "Patient 2003: prescription written, reversed fill only — never fills — primary non-adherence"
          },
          {
            "kind": "unexposed",
            "start": "2023-03-01",
            "end": "2023-04-15",
            "label": "Patient 2002: fills on day 45 — outside 30-day window — counted as non-adherer under 30-day rule"
          }
        ],
        "result": {
          "label": "2 of 3 patients never filled within 30-day window = primary non-adherence rate 0.67",
          "value": 0.667
        },
        "caption": "Timeline showing one prescription date (2023-03-01) and three patient outcomes: patient 2001 fills within the 30-day window (initiates), patient 2003 shows a reversed fill at the counter but never takes the drug home (abandonment), and patient 2002 fills 45 days later, outside the window.",
        "alt_text": "Horizontal timeline from 2023-03-01 to beyond 2023-03-31. A vertical marker on March 1 labels the e-prescription order for all three patients. A green fill bar for patient 2001 begins March 10 inside the window. A small red stub on March 5 marks patient 2003 reversed fill. Patient 2002 has no bar inside the window. The 30-day boundary is drawn as a dashed vertical line at March 31."
      }
    },
    "prerequisites": [
      "pdc-proportion-of-days-covered",
      "new-user-design",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "e-Rx-to-dispense anti-join (oral/self-administered drugs)",
        "description": "Denominator = new e-prescriptions for the drug class; numerator = those with no matching pharmacy dispense within a pre-specified order-to-fill window, netting out reversed/abandoned claims.",
        "edge_cases": [
          "Mail-order and prior-authorization delays push legitimate fills past a short window",
          "Cancel/replace e-Rx messages double-count the same order",
          "Paper/verbal prescriptions are invisible to Surescripts, inflating apparent non-adherence differentially by clinic",
          "Adjudicated-but-reversed pharmacy claims (abandonment at the counter) must be excluded from the fill numerator"
        ],
        "data_source_notes": "e-Rx/EHR: define the order; linked claims/pharmacy: define the fill. Pre-specify and vary the window (7/30/90 days)."
      },
      {
        "name": "Administration-confirmed initiation (infusibles, injectables, in-office products)",
        "description": "Initiation requires a J-code/procedure on a service date or an EHR medication administration record entry, not merely a pharmacy or buy-and-bill dispense.",
        "edge_cases": [
          "Buy-and-bill fill with no subsequent infusion = initiation failure, not initiation",
          "Office-administered first dose may precede or replace any pharmacy claim",
          "In-network administration may be missed if the patient is infused at an external site"
        ],
        "data_source_notes": "claims: cross-link HCPCS J-code on a service date to the dispense; EHR: use the MAR; reconcile dispense date vs administration date before setting time zero."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "PDC / MPR (secondary-adherence measures)",
        "pros_of_this": "Captures the never-fillers that PDC/MPR are structurally blind to, giving a complete order-to-discontinuation picture and exposing first-fill selection.",
        "cons_of_this": "Requires an order source (e-Rx/EHR/integrated system) that PDC/MPR do not; cannot be computed from claims alone.",
        "when_to_prefer": "Questions about uptake, abandonment, or the validity of a first-fill time zero."
      },
      {
        "compared_to": "Defining index date as \"first fill\" with no primary-non-adherence modeling",
        "pros_of_this": "Reveals that the first-fill cohort is a selected, more-adherent subset (the ~25-30% never-fillers are dropped), making a healthy-adherer-like selection visible rather than hidden.",
        "cons_of_this": "Requires order data, a defensible order-to-fill window, and extra diagnostics.",
        "when_to_prefer": "Whenever the exposure (fill) decision could differ across comparison groups and bias the contrast."
      },
      {
        "compared_to": "As-prescribed (order-only) intention-to-treat exposure",
        "pros_of_this": "Avoids misclassifying never-fillers as treated, which biases drug-effect estimates toward the null.",
        "cons_of_this": "Cannot answer prescriber-behavior questions that intentionally treat the order itself as the exposure.",
        "when_to_prefer": "Effectiveness/safety questions where exposure means the patient actually started the drug."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Cannot measure primary non-adherence — claims record dispensings, not orders, so the denominator is invisible. Use claims for secondary adherence (PDC/MPR) only; exclude MA-only person-time where FFS pharmacy claims are unavailable.",
      "ehr": "The order (CPOE/e-Rx) is visible but the fill is not unless pharmacy data are linked; close the loop with a Surescripts fill-status response, payer pharmacy claims, or integrated-pharmacy dispensings. Treat external-pharmacy fills as potential false non-adherence (leakage).",
      "registry": "Rarely captures both order and fill; useful mainly when linked to e-Rx and claims. Check linkage eligibility and completeness before computing any rate.",
      "linked": "Gold standard (order from EHR/e-Rx, fill from claims, administration from MAR/J-codes). Reconcile order vs fill dates and pre-specify the index window (7/30/90 days); for infusibles require a J-code/administration, not the dispense."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\ndef primary_non_adherence(erx: pd.DataFrame, rx: pd.DataFrame,\n                          drug_class: str, window_days: int = 30) -> tuple[pd.DataFrame, float]:\n    # Denominator: one row per new e-prescription for the target class.\n    orders = erx[(erx[\"drug_class\"] == drug_class) & (erx[\"new_start_flag\"] == 1)].copy()\n\n    # Eligible fills: same class, NOT reversed/abandoned, within [erx_date, erx_date + window].\n    fills = rx[(rx[\"drug_class\"] == drug_class) & (~rx[\"reversed\"])][[\"person_id\", \"fill_date\"]]\n    m = orders.merge(fills, on=\"person_id\", how=\"left\")\n    in_window = (m[\"fill_date\"] >= m[\"erx_date\"]) & \\\n                (m[\"fill_date\"] <= m[\"erx_date\"] + pd.Timedelta(days=window_days))\n\n    # An order is \"filled\" if at least one eligible fill falls in its window.\n    m[\"filled\"] = in_window\n    filled_per_order = m.groupby(orders.index).agg(filled=(\"filled\", \"any\"))\n    orders = orders.join(filled_per_order)\n    orders[\"primary_non_adherent\"] = ~orders[\"filled\"].fillna(False)\n\n    rate = orders[\"primary_non_adherent\"].mean()\n    return orders[[\"person_id\", \"erx_date\", \"primary_non_adherent\"]], float(rate)",
        "description": "Primary non-adherence (oral/self-administered) by anti-joining e-prescriptions to dispensings. Required inputs\n(already cleaned, de-duplicated, restricted to enrolled person-time):\n  erx : new e-prescription orders -> person_id, erx_date (datetime), drug_class, new_start_flag (1=new start)\n  rx  : pharmacy dispensings      -> person_id, fill_date (datetime), drug_class, days_supply, reversed (bool)\nReturns per-person flags (primary_non_adherent) for the drug class plus the aggregate rate. NOTE: this is only valid\nwhen 'erx' is a genuine order source (Surescripts/EHR CPOE); it CANNOT be run on claims alone.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "fischer-2010",
          "tamblyn-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nprimary_non_adherence <- function(erx, rx, drug_class, window_days = 30L) {\n  setDT(erx); setDT(rx)\n\n  orders <- erx[drug_class == ..drug_class & new_start_flag == 1L]\n  orders[, order_id := .I]\n\n  fills <- rx[drug_class == ..drug_class & reversed == FALSE, .(person_id, fill_date)]\n\n  # Non-equi join: a fill is eligible if erx_date <= fill_date <= erx_date + window.\n  orders[, win_end := erx_date + window_days]\n  hit <- fills[orders, on = .(person_id, fill_date >= erx_date, fill_date <= win_end),\n               .(order_id = i.order_id), nomatch = NULL, allow.cartesian = TRUE]\n  orders[, filled := order_id %in% unique(hit$order_id)]\n  orders[, primary_non_adherent := !filled]\n\n  list(per_order = orders[, .(person_id, erx_date, primary_non_adherent)],\n       rate = orders[, mean(primary_non_adherent)])\n}",
        "description": "Primary non-adherence via e-Rx-to-dispense anti-join, data.table. Inputs mirror the Python version:\n  erx : person_id, erx_date (Date), drug_class, new_start_flag (1L = new start)\n  rx  : person_id, fill_date (Date), drug_class, days_supply, reversed (logical)\nValid only when 'erx' is a real order source (Surescripts/EHR CPOE), never claims alone.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "fischer-2010",
          "tamblyn-2014"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let dclass  = DOAC;\n%let window  = 30;\n\n/* Denominator: new e-prescriptions for the target class; flag each order filled / not filled. */\nproc sql;\n  create table pna as\n  select o.person_id,\n         o.erx_date,\n         case when exists (\n                select 1 from work.rx f\n                where f.person_id  = o.person_id\n                  and f.drug_class = o.drug_class\n                  and f.reversed   = 0\n                  and f.fill_date >= o.erx_date\n                  and f.fill_date <= o.erx_date + &window\n              ) then 0 else 1 end as primary_non_adherent\n  from work.erx o\n  where o.drug_class = \"&dclass\" and o.new_start_flag = 1;\nquit;\n\n/* Aggregate primary non-adherence rate for the chosen order-to-fill window. */\nproc sql;\n  select mean(primary_non_adherent) as primary_non_adherence_rate format=percent8.1,\n         count(*) as n_new_eRx\n  from pna;\nquit;",
        "description": "Primary non-adherence via e-Rx-to-dispense NOT EXISTS anti-join in SAS. Required input datasets (post data-management,\nrestricted to enrolled person-time):\n  work.erx : person_id, erx_date, drug_class, new_start_flag (1 = new start)\n  work.rx  : person_id, fill_date, drug_class, days_supply, reversed (0/1)\nSet &dclass and &window before running. Valid only when work.erx is a true order source (Surescripts/EHR CPOE),\nnever a claims-only database.",
        "dependencies": [],
        "source_citations": [
          "fischer-2010",
          "tamblyn-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "primary-non-adherence-initiation-timeline.svg",
        "mermaid": null,
        "caption": "Timeline showing one prescription date (2023-03-01) and three patient outcomes: patient 2001 fills within the 30-day window (initiates), patient 2003 shows a reversed fill at the counter but never takes the drug home (abandonment), and patient 2002 fills 45 days later, outside the window.",
        "alt_text": "Horizontal timeline from 2023-03-01 to beyond 2023-03-31. A vertical marker on March 1 labels the e-prescription order for all three patients. A green fill bar for patient 2001 begins March 10 inside the window. A small red stub on March 5 marks patient 2003 reversed fill. Patient 2002 has no bar inside the window. The 30-day boundary is drawn as a dashed vertical line at March 31.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Order[New e-prescription / CPOE order<br/>drug_class, new_start_flag, erx_date] --> Window{Eligible dispense or administration<br/>within order-to-fill window?}\n  Window -->|No fill / no administration| PNA[Primary non-adherence<br/>= initiation failure]\n  Window -->|Fill / J-code on service date| Init[Initiation: time zero set here]\n  Init --> Cohort[Enters first-fill new-user / ACNU cohort]\n  PNA -.silently excluded.-> Cohort\n  Cohort --> Bias[Cohort selected toward adherers<br/>-> healthy-adherer-like selection bias]",
        "caption": "How primary non-adherence determines initiation and silently selects the first-fill cohort. Only patients whose order is matched by a dispense (or administration) within the pre-specified window become initiators and enter a new-user/ACNU cohort; the never-fillers are dropped before time zero, biasing the surviving cohort toward adherers.",
        "alt_text": "Flowchart from a new e-prescription to a window check; no fill leads to primary non-adherence and exclusion, while a fill or administration sets time zero and enters the cohort, which is therefore selected toward adherers.",
        "source_type": "illustrative",
        "source_citations": [
          "fischer-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Sources[Required data layers]\n    eRx[Order: Surescripts / EHR CPOE] --> Link\n    Fill[Fill: pharmacy claims / dispensings] --> Link\n    Adm[Administration: J-code / MAR] --> Link\n  end\n  Link[Linked, date-reconciled record] --> Num[Numerator: orders with no fill/admin in window]\n  Link --> Den[Denominator: all new orders for the class]\n  Num --> Rate[Primary non-adherence rate<br/>vary window 7/30/90 days]\n  Den --> Rate\n  ClaimsOnly[Claims only: no order layer] -.cannot compute.-> Rate",
        "caption": "Data-flow for measuring primary non-adherence. The order layer is mandatory; a claims-only source has no order denominator and cannot produce the rate.",
        "alt_text": "Data-flow diagram showing order, fill, and administration layers linked and reconciled into numerator and denominator for the primary non-adherence rate, with claims-only marked as unable to compute it.",
        "source_type": "illustrative",
        "source_citations": [
          "fischer-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Primary non-adherence/initiation determines the very first event of an exposure episode — whether an episode begins at all — before any days_supply stitching applies."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC measures secondary (implementation) adherence conditional on filling at least once; this concept measures the primary, pre-fill failure that PDC is structurally blind to."
      },
      {
        "relation_type": "affects",
        "target_slug": "new-user-design",
        "notes": "New-user/ACNU designs index on first fill, silently excluding primary non-adherers and selecting a more adherent cohort."
      },
      {
        "relation_type": "affects",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Initiation defines a legitimate time zero; using the order date when the drug was never filled misaligns time zero and misclassifies exposure."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "The selection of fillers over never-fillers is a healthy-adherer-flavored bias that can persist into otherwise clean comparative cohorts."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "Identifying the order (CPOE/e-Rx) denominator typically relies on EHR/e-prescribing phenotyping linked to fill data."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Continuous enrollment around the order date is required so that \"no fill\" is observed non-adherence, not unobserved person-time."
      }
    ],
    "aliases": [
      "primary medication non-adherence",
      "primary non-adherence",
      "primary nonadherence",
      "prescription abandonment",
      "first-fill non-adherence",
      "initiation failure",
      "prescribed not filled"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pro-development",
    "name": "PRO Instrument Development",
    "short_definition": "The mixed-methods process of building a patient-reported outcome (PRO) measure — concept elicitation, item generation, cognitive interviewing, and quantitative psychometric testing — so that the instrument's scores demonstrably reflect a defined concept of interest with documented reliability, validity, responsiveness, and an interpretable meaningful-change threshold.",
    "long_description": "**PRO instrument development** is the sequential, mixed-methods construction of a patient-reported outcome\nmeasure (PROM): defining the *concept of interest* and *context of use*, eliciting that concept directly from\npatients (qualitative concept elicitation to saturation), generating and refining items and a recall period and\nresponse scale, confirming comprehension and content validity through cognitive interviews, and then quantifying\nthe instrument's measurement properties in a development/validation sample. The deliverable is not a study result\nbut a *calibrated measuring tool* with a documented scoring algorithm and an evidentiary dossier (the \"validation\npackage\") that justifies using its scores as an endpoint. This is why PRO development belongs to the family of\noutcome-measure concepts, not study designs: the object produced is a measure, and the analytic work is\npsychometric (reliability, structural/construct validity, responsiveness, and meaningful within-patient change)\nrather than estimation of a treatment effect.\n\n**Core conceptual distinction**. Three distinctions do the work and are routinely conflated.\n(1) *Development vs validation.* Development is the forward process that yields the instrument and its content\nevidence (the dominant, irreversible decisions — what concept, which items, what recall period); validation is the\nempirical confirmation that the finished instrument behaves as required. Content validity is established during\ndevelopment (you cannot bolt it on afterward), whereas reliability and responsiveness are estimated post hoc on\ndata. (2) *Reflective vs formative measurement.* In a reflective model (the default for symptom/HRQOL scales) the\nlatent construct *causes* the item responses, so items should be internally consistent and a factor model fits; in\na formative model (e.g., a comorbidity or impact index) items *define* the construct and internal-consistency\nstatistics like Cronbach's alpha are meaningless. Choosing the wrong model invalidates the entire psychometric\nplan. (3) *Classical test theory (CTT) vs item response theory (IRT).* CTT yields a fixed-length scale with a\ntotal score whose reliability is sample-dependent; IRT/Rasch calibrates items on a latent continuum, supports\ncomputer-adaptive testing and item banking (the PROMIS approach), and separates item difficulty from person\nability — but requires larger samples and stronger assumptions (unidimensionality, local independence, no\ndifferential item functioning). The estimand is the patient's *level on the concept of interest*, and the\ninferential target during development is the set of item parameters and score-to-construct mapping, not a\nbetween-group contrast.\n\n**Pros, cons, and trade-offs** (specific & comparative).\n- **De novo development vs adopting/adapting an existing validated PROM (e.g., PROMIS, EORTC QLQ-C30, SF-36):**\n  Building de novo guarantees content validity for the exact concept and population and avoids licensing\n  constraints, but it is slow, expensive, and starts the evidentiary clock at zero with regulators. **Prefer\n  adoption** unless no existing instrument captures the concept of interest in the target context of use; the FDA\n  and EMA both expect you to justify why an existing fit-for-purpose measure was not used.\n- **CTT vs IRT/Rasch development:** CTT is simpler, needs smaller samples (~150–300), and produces a familiar\n  summed score; IRT enables banking, CAT, equating across forms, and item-level diagnostics, and gives\n  interval-level scores — at the cost of sample size (often 500+ for a graded response model), modeling expertise,\n  and assumption checking. **Prefer IRT** when you need adaptive testing, cross-instrument linking, or item-bank\n  efficiency; **prefer CTT** for a short fixed scale in a constrained sample.\n- **Anchor-based vs distribution-based meaningful change (MID/MWPC):** Anchor-based methods tie change to an\n  external, patient-meaningful reference (e.g., a global rating of change) and are the regulator-preferred basis\n  for a *meaningful within-patient change* threshold; distribution-based methods (½ SD, SEM) are sample statistics\n  that are useful as triangulating bounds but do not, alone, establish meaningfulness. **Prefer anchor-based as the\n  primary**, with distribution-based as supportive — never distribution-based alone for a labeling endpoint.\n\n**When to use**. Develop a new PRO when (a) the concept of interest is patient-experienced (symptoms, functioning,\nHRQOL, treatment burden) and unobservable by a clinician or biomarker; (b) no existing instrument is fit-for-purpose\nfor the concept and context of use after a documented search; (c) the measure will anchor an efficacy/labeling\nendpoint, an HTA value argument, or a real-world effectiveness study where the patient voice is the outcome. Begin\nwith concept elicitation and a conceptual framework, and pre-specify the measurement model and psychometric\nanalysis plan before data collection.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **A fit-for-purpose validated instrument already exists.** Reinventing a PROM fragments the evidence base,\n  forfeits comparability and benchmarking, and invites a regulatory \"why not the existing measure?\" rejection.\n- **The concept is not genuinely patient-reported** (e.g., it is a clinician judgment or an observable event).\n  Forcing it into a self-report scale produces a measure with no clear referent.\n- **Treating a development sample as a validation of responsiveness or MID.** Estimating a \"responsive\" effect\n  size and an MID on the *same* trial data the instrument was tuned on is circular; meaningful-change thresholds\n  derived this way will over-fit and can manufacture a clinically \"significant\" effect that does not replicate.\n- **Applying Cronbach's alpha or factor analysis to a formative index** declares an instrument \"unreliable\" or\n  \"multidimensional\" when those statistics simply do not apply — a dangerous misread that can sink a valid measure.\n- **Deploying an instrument across groups (countries, languages, severity strata) without testing measurement\n  invariance / DIF.** Uninvestigated DIF means a between-group score difference can reflect *how the groups answer\n  the items* rather than a true difference in the concept — a direct threat to any comparative claim.\n\n**Data-source operational depth**. PRO development is primary-data-collection-first, but the instrument is then\nfielded in real-world data systems, and each substrate has distinct failure modes.\n- **Prospective primary collection (the development substrate):** Item-response data captured by study instrument\n  (paper, ePRO app, IVR). Failure modes: *mode effects* (paper vs electronic are not automatically equivalent —\n  measurement-equivalence testing is required before pooling), *missing-not-at-random* dropout where the sickest\n  patients stop responding (biasing responsiveness and test-retest estimates), and *recall-period drift* when the\n  administration interval does not match the item recall window. Workaround: enforce a fixed administration\n  schedule, log device/mode, and analyze missingness as potentially informative.\n- **EHR-embedded ePRO / patient portals:** PROs collected at visits or via portal pushes. Failure modes:\n  *visit-driven and digitally selected sampling* — portal responders are healthier, more engaged, and more\n  affluent, so norms and known-groups validity drift; sicker or disconnected patients are differentially missing.\n  Workaround: weight to the source population, report response rates by subgroup, and never treat portal PRO\n  completeness as missing-completely-at-random.\n- **Registry-collected PROs:** Often the richest longitudinal PRO source with adjudicated clinical anchors for\n  responsiveness, but completion is *site-dependent and declines over follow-up*, confounding within-patient change\n  with attrition. Workaround: model time-to-nonresponse and use the registry's clinical anchors for anchor-based\n  MID rather than distribution-based shortcuts.\n- **Claims (and claims-linked) data:** Claims carry **no** PRO content — they cannot validate a PROM. Their role is\n  to characterize the population a PROM substudy is drawn from and to supply external anchors (e.g., hospitalization,\n  treatment escalation). Failure mode: a PRO substudy nested in a claims population inherits claims' coverage gaps —\n  Medicare Advantage-only person-time lacks fee-for-service claims, so anchor events used to define \"improved vs\n  stable\" are undercounted, biasing the anchor-based MID. Workaround: restrict the anchor window to enrollees with\n  complete FFS Parts A/B (or commercial medical) coverage and treat MA-only follow-up as anchor-missing, not\n  anchor-negative.\n\n**Worked example (PRO development with a claims-linked anchor).** Goal: develop and characterize an 8-item daily\nsymptom-impact PRO for moderate-to-severe atopic dermatitis, context of use = real-world effectiveness endpoint.\n(1) *Concept elicitation:* 25 in-depth patient interviews to saturation produce a conceptual framework with two\ndomains (itch/sleep symptoms; activity impact). (2) *Item generation & cognitive debriefing:* 14 candidate items\ndrafted at a 24-hour recall with an 11-point NRS response; two rounds of cognitive interviews (n=20) drop 4\nitems for ambiguity/redundancy and lock an 8-item, single-day recall instrument with content-validity evidence.\n(3) *Quantitative testing in a development sample (n=500, long-form `responses(person_id, item_id, response,\ntimepoint)`):* exploratory then confirmatory factor analysis confirms the two-domain structure (CFI = 0.97,\nRMSEA = 0.05, SRMR = 0.04); internal consistency Cronbach's alpha = 0.89 per domain; a graded response IRT model\nshows ordered thresholds, adequate item information across the trait range, and no DIF by sex or age (lordif). (4)\n*Reliability:* 2-week test-retest in stable patients (anchor = \"no change\" on a global rating) gives ICC = 0.82.\n(5) *Construct/known-groups validity:* mean scores rank-order across investigator-graded severity (mild < moderate\n< severe), with hypothesized correlations to a reference HRQOL measure. (6) *Anchor-based meaningful within-patient\nchange:* link the substudy to claims; define \"improved\" as patients with a step-down in dispensed therapy class\n(`fill_date`, `days_supply`, drug-class step) within the recall-aligned window among enrollees with ≥365 days of\ncontinuous **FFS** medical+pharmacy coverage so the anchor is observed, not missing — the mean within-patient score\nchange in the smallest meaningfully-improved anchor group (≈0.5–1.0 NRS points) is the candidate MID, triangulated\nagainst a distribution-based ½ SD bound. Pre-specify every threshold; do not derive the MID from the same trial\nused to claim treatment efficacy.\n\n**Interpreting the output**. The immediate output of PRO instrument development is a validation dossier, not\na treatment-effect estimate. For the atopic dermatitis example, the dossier reads: an 8-item daily\nsymptom-impact questionnaire with content validity established through 25 concept-elicitation interviews\n(saturation confirmed) and 20 cognitive-debriefing interviews (items locked after two revision rounds);\ntwo-domain structure confirmed by confirmatory factor analysis (CFI = 0.97, RMSEA = 0.05); internal\nconsistency Cronbach alpha = 0.89 per domain; test-retest reliability ICC = 0.82; anchor-based MID\napproximately 0.5–1.0 NRS points in stable patients.\n\nFormal interpretation: the dossier certifies that the instrument's item set, recall period, and\nresponse scale were derived from and confirmed by the patient experience; that the two-domain factor\nstructure is supported; and that scores are sufficiently stable and consistent for use as a real-world\neffectiveness endpoint. No single statistic licenses the instrument — the FDA and EMA evaluate the\nfull body of evidence, and a strong CFI alongside weak content-validity documentation is not adequate.\n\nPractical interpretation: the dossier is the instrument's identity card for regulators and HTA bodies.\nA score change in a trial means something only because this dossier certifies what the instrument\nmeasures and how much change is meaningful. Item-performance readouts (information curves, DIF by sex\nand age passing) tell analysts that no subgroup is being systematically mis-measured. Without this\nevidentiary package, a PRO score is a number without a referent — it cannot anchor a labeling claim\nor an HTA value argument.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "patient-reported-outcome",
      "psychometrics",
      "content-validity",
      "item-response-theory",
      "reliability",
      "responsiveness",
      "minimal-important-difference",
      "clinical-outcome-assessment"
    ],
    "applies_to_study_types": [
      "pro_development"
    ],
    "data_sources": [
      "primary",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2011.06.014",
        "url": "https://doi.org/10.1016/j.jval.2011.06.014",
        "citation_text": "Patrick DL, Burke LB, Gwaltney CJ, et al. Content validity—establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force report. Value in Health. 2011;14(8):967-977.",
        "year": 2011,
        "authors_short": "Patrick et al.",
        "notes": "Canonical ISPOR statement of how content validity is built into PRO development through concept elicitation, item generation, and cognitive interviewing — the irreversible front-end decisions of instrument development."
      },
      {
        "role": "explain",
        "doi": "10.1007/s11136-010-9606-8",
        "url": "https://doi.org/10.1007/s11136-010-9606-8",
        "citation_text": "Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Quality of Life Research. 2010;19(4):539-549.",
        "year": 2010,
        "authors_short": "Mokkink et al.",
        "notes": "Defines the taxonomy of measurement properties (reliability, validity, responsiveness) and the standards by which PRO development and validation evidence is judged."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/01.MLR.0000250483.85507.04",
        "url": "https://doi.org/10.1097/01.MLR.0000250483.85507.04",
        "citation_text": "Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Medical Care. 2007;45(5 Suppl 1):S22-S31.",
        "year": 2007,
        "authors_short": "Reeve et al.",
        "notes": "Worked exposition of the IRT-based item-bank calibration, dimensionality, and DIF analyses used in modern PRO development."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jclinepi.2010.04.011",
        "url": "https://doi.org/10.1016/j.jclinepi.2010.04.011",
        "citation_text": "Cella D, Riley W, Stone A, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005-2008. Journal of Clinical Epidemiology. 2010;63(11):1179-1194.",
        "year": 2010,
        "authors_short": "Cella et al.",
        "notes": "Large-scale applied demonstration of the full PRO development pipeline (qualitative through IRT calibration) producing reusable, validated item banks."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.fda.gov/media/77832/download",
        "citation_text": "US Food and Drug Administration. Guidance for Industry — Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. 2009.",
        "year": 2009,
        "authors_short": "FDA",
        "notes": "Regulatory expectations for the evidentiary dossier (conceptual framework, content validity, reliability, construct validity, responsiveness, interpretation of meaningful change) required for a PRO labeling endpoint."
      }
    ],
    "plain_language_summary": "PRO instrument development is the process of building a survey tool that captures how patients feel about their own health — things like pain, fatigue, or ability to do daily activities — in a way that is trustworthy and backed by evidence. The key idea is that the questions must come from patients themselves: researchers first interview patients to learn which symptoms matter most, then write and test draft questions, and finally run statistical tests to confirm the finished tool measures what it is supposed to measure. Without this structured development process, a questionnaire score cannot be trusted as evidence in a clinical trial or health-technology assessment. One honest caveat: a well-developed instrument still only captures what patients remember and choose to report, so recall period and question wording both shape the scores.",
    "key_terms": [
      {
        "term": "content validity",
        "definition": "Evidence that a questionnaire actually covers the symptoms and experiences that matter to patients in the target disease — established by interviewing real patients during development, not by expert opinion alone."
      },
      {
        "term": "concept elicitation",
        "definition": "A qualitative research technique in which researchers interview patients with the disease to discover, in patients' own words, which symptoms and impacts are most important to capture in the questionnaire."
      },
      {
        "term": "cognitive debriefing",
        "definition": "A round of patient interviews in which draft questionnaire items are read aloud and patients explain what they think each question means, so that confusing or ambiguous wording can be fixed before the tool is finalized."
      },
      {
        "term": "psychometric testing",
        "definition": "Statistical analysis run on questionnaire responses from a development sample to confirm the tool is consistent (reliability), measures the intended concept (validity), and can detect real change over time (responsiveness)."
      },
      {
        "term": "item",
        "definition": "A single question or statement on a questionnaire that patients respond to, such as 'Rate your worst itch in the past 24 hours on a scale of 0 to 10.'"
      },
      {
        "term": "recall period",
        "definition": "The time window a question asks patients to think back over, such as 'in the past 7 days' or 'in the past 24 hours' — the choice affects both what patients can accurately remember and how quickly the score responds to treatment."
      }
    ],
    "worked_example": {
      "scenario": "A team wants to develop an 8-item daily symptom questionnaire for moderate-to-severe atopic dermatitis (a chronic inflammatory skin disease with intense itch) to use as an endpoint in a real-world effectiveness study. No existing validated questionnaire captures the exact combination of itch, sleep disruption, and daily-activity impact they need. The table below shows the five sequential development stages, what each stage involves, and what it guarantees about the finished instrument.",
      "dataset": {
        "caption": "PRO instrument development stages for the atopic dermatitis symptom questionnaire (25 concept-elicitation interviews; 20 cognitive-debriefing interviews; n=500 quantitative development sample)",
        "columns": [
          "Stage",
          "What happens",
          "What it guarantees"
        ],
        "rows": [
          [
            "1. Concept elicitation",
            "25 in-depth patient interviews; patients describe itch, sleep loss, and activity limits in their own words until no new themes emerge (saturation)",
            "Content validity: the questionnaire covers what patients actually experience, not just what clinicians assume matters"
          ],
          [
            "2. Item generation",
            "Research team drafts 14 candidate items at a 24-hour recall period using an 11-point numeric rating scale (0=none, 10=worst imaginable), drawing directly from patient language in Stage 1",
            "Items are grounded in patient concepts and use language patients recognize"
          ],
          [
            "3. Cognitive debriefing",
            "Two rounds of patient interviews (n=20) in which patients read each draft item aloud and explain what they think it is asking; 4 items dropped for ambiguity or redundancy, leaving 8 items",
            "Patients understand every retained question the way the developers intend it"
          ],
          [
            "4. Psychometric testing",
            "500-patient development sample completes the 8-item questionnaire at baseline and 2-week retest; confirmatory factor analysis confirms 2-domain structure (CFI=0.97, RMSEA=0.05); Cronbach alpha=0.89 per domain; test-retest ICC=0.82 in stable patients",
            "The instrument is internally consistent, structurally sound, and reproducible across time in patients who have not changed"
          ],
          [
            "5. Final validated instrument",
            "8-item questionnaire with documented scoring algorithm, content-validity evidence from Stages 1-3, and measurement-property evidence from Stage 4; submitted as validation dossier to regulators",
            "Scores can be defended as a credible, fit-for-purpose endpoint in clinical and regulatory contexts"
          ]
        ]
      },
      "steps": [
        "Stages 1-3 build content validity from the ground up: patients define the concept first, then researchers write items in patient language, then patients confirm comprehension. This order is irreversible — you cannot add content validity to a questionnaire after the items are already fixed.",
        "Stage 2 begins only after Stage 1 produces a conceptual framework: the two domains (itch/sleep symptoms; activity impact) come directly from what patients said in interviews, not from a literature search alone.",
        "Stage 3 removes items that patients misinterpret: the team starts with 14 candidate items and exits with 8 confirmed-comprehensible items, reducing burden while protecting meaning.",
        "Stage 4 runs on a separate sample of 500 patients — not the same people interviewed in Stages 1-3 — so statistical properties are estimated on fresh data rather than on the patients who shaped the tool.",
        "The final instrument is the output: a calibrated measuring tool with a validation dossier, not a study result. The development work is complete before any treatment-effectiveness study begins."
      ],
      "result": "A validated 8-item atopic dermatitis symptom questionnaire with documented content validity (25 concept-elicitation + 20 cognitive-debriefing interviews), structural validity (CFI=0.97, RMSEA=0.05, 2-factor model confirmed), internal consistency (Cronbach alpha=0.89 per domain), and test-retest reliability (ICC=0.82) — ready to serve as a pre-specified endpoint in a real-world effectiveness study or regulatory submission."
    },
    "prerequisites": [
      "hrqol",
      "pro-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Classical test theory (CTT) fixed-length scale",
        "description": "Develop a summed-score scale evaluated with Cronbach's alpha, item-total correlations, and exploratory/ confirmatory factor analysis for structural validity. Smaller sample requirement; familiar total score.",
        "edge_cases": [
          "Alpha is sample-dependent and rises mechanically with item count — high alpha is not evidence of unidimensionality.",
          "Alpha is meaningless for a formative index; check the measurement model first."
        ],
        "data_source_notes": "primary: balanced item presentation and fixed administration schedule; ehr/portal: log mode and test paper-vs-electronic measurement equivalence before pooling."
      },
      {
        "name": "Item response theory / Rasch calibration and item banking",
        "description": "Calibrate items on a latent continuum (graded response or Rasch model), enabling computer-adaptive testing, short-form derivation, and equating across instruments (the PROMIS approach).",
        "edge_cases": [
          "Requires unidimensionality and local independence; violated by redundant double-barreled items.",
          "Needs larger samples (often 500+ for polytomous GRM) and explicit DIF testing across key subgroups."
        ],
        "data_source_notes": "primary/registry: ensure adequate responses across the full trait range so item thresholds are estimable; sparse extremes inflate standard errors at the tails."
      },
      {
        "name": "Anchor-based meaningful within-patient change (MID/MWPC) estimation",
        "description": "Define the smallest change patients perceive as meaningful by tying score change to an external anchor (patient global rating of change or a clinical event), triangulated with distribution-based bounds.",
        "edge_cases": [
          "Weak anchor-score correlation (<0.3) makes the anchor uninformative; report the correlation.",
          "Deriving the MID on the same trial used for the efficacy claim is circular and over-fits."
        ],
        "data_source_notes": "linked claims: define the anchor (e.g., therapy step-down) only on enrollees with complete FFS coverage so the anchor is observed, not missing; treat MA-only person-time as anchor-missing."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Adopting or adapting an existing validated PROM (PROMIS, EORTC QLQ-C30, SF-36)",
        "pros_of_this": "Guarantees content validity for the exact concept of interest and context of use; no licensing or item-pool constraints.",
        "cons_of_this": "Slow and expensive; forfeits comparability/benchmarking and starts the regulatory evidentiary clock at zero.",
        "when_to_prefer": "Only when a documented search shows no existing instrument is fit-for-purpose for the concept in the target population and context of use."
      },
      {
        "compared_to": "Classical test theory development",
        "pros_of_this": "Interval-level scores, item banking, computer-adaptive testing, cross-form equating, and item-level DIF diagnostics (IRT/Rasch).",
        "cons_of_this": "Larger samples, stronger assumptions (unidimensionality, local independence), and modeling expertise required.",
        "when_to_prefer": "When adaptive testing, item banks, or linking across instruments is needed; CTT suffices for a short fixed scale in a constrained sample."
      },
      {
        "compared_to": "Distribution-based meaningful-change estimation alone",
        "pros_of_this": "Ties the change threshold to what patients actually experience as meaningful via an external anchor — the regulator-preferred basis for a labeling MID.",
        "cons_of_this": "Requires a valid, well-correlated anchor and adequate sample in each anchor category.",
        "when_to_prefer": "As the primary MID method for any interpretation/labeling claim; use distribution-based statistics only as supportive triangulation."
      }
    ],
    "implementation_notes_by_data_source": {
      "primary": "Capture item responses in long form (person_id, item_id, response, timepoint). Pre-specify the measurement model (reflective vs formative; CTT vs IRT) and the psychometric analysis plan before data collection. Test paper-vs-electronic measurement equivalence before pooling modes; treat dropout as potentially informative for test-retest and responsiveness.",
      "ehr": "Portal/visit-driven ePRO responders are healthier and more engaged; report response rates by subgroup and weight to the source population. Never treat completeness as missing-completely-at-random.",
      "registry": "Richest longitudinal PRO source with adjudicated clinical anchors for responsiveness, but completion is site-dependent and declines over follow-up; model time-to-nonresponse and use clinical anchors for the MID.",
      "linked": "Claims carry no PRO content but supply external anchors (therapy step-down, hospitalization). Restrict the anchor window to complete FFS (Parts A/B) or commercial medical coverage; treat MA-only person-time as anchor-missing, not anchor-negative."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport pingouin as pg\nfrom semopy import Model\n\n# --- reshape long item responses to person x item wide for a single timepoint ---\ndef to_wide(responses: pd.DataFrame, timepoint: str) -> pd.DataFrame:\n    w = (responses[responses[\"timepoint\"] == timepoint]\n         .pivot(index=\"person_id\", columns=\"item_id\", values=\"response\"))\n    return w\n\nwide = to_wide(responses, \"baseline\")\n\n# --- internal consistency: Cronbach's alpha + item-deleted alphas (CTT reliability) ---\nalpha, ci = pg.cronbach_alpha(data=wide)\nitem_deleted = {\n    item: pg.cronbach_alpha(data=wide.drop(columns=[item]))[0]\n    for item in wide.columns\n}\nprint(f\"Cronbach alpha = {alpha:.3f} (95% CI {ci[0]:.3f}-{ci[1]:.3f})\")\n\n# --- test-retest reliability: ICC(2,1) absolute-agreement (two-way random) on the domain total in stable patients ---\nscore = (responses.groupby([\"person_id\", \"timepoint\"])[\"response\"]\n                   .sum().rename(\"total\").reset_index())\nicc = pg.intraclass_corr(data=score, targets=\"person_id\",\n                         raters=\"timepoint\", ratings=\"total\")\nprint(icc[icc[\"Type\"] == \"ICC2\"][[\"ICC\", \"CI95%\"]])\n\n# --- structural validity: 2-factor confirmatory model (reflective measurement) ---\nspec = \"\"\"\nSYMPTOM =~ item1 + item2 + item3 + item4\nIMPACT  =~ item5 + item6 + item7 + item8\n\"\"\"\ncfa = Model(spec)\ncfa.fit(wide)                      # report CFI >= 0.95, RMSEA <= 0.08, SRMR <= 0.08\nprint(cfa.inspect())",
        "description": "Core PRO psychometrics on a long-form item-response table. Required input (already cleaned):\n  responses : long-form item responses -> person_id, item_id (str), response (ordinal int),\n              timepoint (e.g., 'baseline','retest'); one row per person-item-timepoint\nComputes Cronbach's alpha (with item-deleted), test-retest ICC, and a confirmatory factor model\n(structural validity). Run dimensionality/CFA BEFORE trusting alpha; alpha is invalid for a formative index.",
        "dependencies": [
          "pandas",
          "pingouin",
          "semopy"
        ],
        "source_citations": [
          "mokkink-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(psych); library(lavaan); library(mirt); library(lordif)\nlibrary(reshape2)\n\nbase <- subset(responses, timepoint == \"baseline\")\nwide <- dcast(base, person_id ~ item_id, value.var = \"response\")\nitems <- wide[, setdiff(names(wide), \"person_id\")]\n\n## internal consistency (CTT) with item-deleted diagnostics\npsych::alpha(items)$total\npsych::alpha(items)$alpha.drop\n\n## test-retest ICC(2,1) absolute-agreement (two-way random) on the domain total in stable patients\ntot <- reshape2::dcast(\n  aggregate(response ~ person_id + timepoint, base = responses, FUN = sum),\n  person_id ~ timepoint, value.var = \"response\")\npsych::ICC(tot[, c(\"baseline\", \"retest\")])$results[\"ICC2\", ]  # ICC2 = absolute-agreement, two-way random\n\n## structural validity: 2-factor CFA (report CFI/RMSEA/SRMR via fitMeasures)\ncfa_fit <- lavaan::cfa(\n  'SYMPTOM =~ item1 + item2 + item3 + item4\n   IMPACT  =~ item5 + item6 + item7 + item8',\n  data = items, ordered = names(items))\nlavaan::fitMeasures(cfa_fit, c(\"cfi\", \"rmsea\", \"srmr\"))\n\n## IRT graded response model + DIF screen by sex (item banking)\ngrm <- mirt::mirt(items, model = 1, itemtype = \"graded\")\ncoef(grm, IRTpars = TRUE, simplify = TRUE)\nlordif::lordif(items, group = wide_sex, criterion = \"Chisqr\")",
        "description": "PRO psychometrics in R on a long-form responses data.frame:\n  responses: person_id, item_id, response (ordinal), timepoint\nCronbach's alpha (psych), test-retest ICC (psych), CFA structural validity (lavaan), and a graded\nresponse IRT calibration with DIF screening (mirt / lordif) for item-bank development.",
        "dependencies": [
          "psych",
          "lavaan",
          "mirt",
          "lordif"
        ],
        "source_citations": [
          "reeve-2007"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Internal consistency: Cronbach's alpha with raw + item-deleted (NOMISS for listwise items) */\nproc corr data=work.wide alpha nomiss;\n  var item1-item8;\nrun;\n\n/* Test-retest reliability: ICC via variance components from a mixed model.\n   Stack the two timepoints long: person_id, time, total. */\ndata tr_long;\n  set work.tr;\n  time='B'; total=base_total;   output;\n  time='R'; total=retest_total; output;\nrun;\nproc mixed data=tr_long;\n  class person_id time;\n  model total = ;                         /* ICC(2,1) absolute agreement (two-way random):           */\n  random person_id time;                  /* subj var / (subj var + time var + residual)             */\nrun;\n\n/* Structural validity: 2-factor confirmatory factor analysis (reflective model) */\nproc calis data=work.wide method=ml;\n  factor\n    SYMPTOM ===> item1-item4,\n    IMPACT  ===> item5-item8;\n  pvar SYMPTOM = 1.0, IMPACT = 1.0;       /* report fit: CFI, RMSEA, SRMR */\nrun;\n\n/* IRT graded response calibration for item-bank development */\nproc irt data=work.wide;\n  var item1-item8;\n  model item1-item8 / resfunc=graded;     /* item thresholds + discrimination */\nrun;",
        "description": "PRO psychometrics in SAS. Required datasets (post data-management):\n  work.wide  : person_id + one column per item (item1-item8) at the development timepoint\n  work.tr    : person_id, base_total, retest_total (domain totals at two stable timepoints)\nPROC CORR ALPHA gives Cronbach's alpha + item-deleted; PROC MIXED gives the test-retest ICC variance\ncomponents; PROC CALIS fits the CFA; PROC IRT calibrates a graded response model (SAS/STAT 13.1+).",
        "dependencies": [],
        "source_citations": [
          "reeve-2007"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  COU[Define concept of interest<br/>+ context of use + target population] --> Elicit[Qualitative concept elicitation<br/>patient interviews to saturation]\n  Elicit --> Frame[Conceptual framework<br/>domains + sub-concepts]\n  Frame --> Items[Item generation<br/>recall period + response scale]\n  Items --> Cog[Cognitive interviews<br/>comprehension + content validity]\n  Cog --> Model{Measurement model?}\n  Model -->|Reflective| Quant[Quantitative testing in development sample]\n  Model -->|Formative| Formative[Index scoring<br/>no alpha / factor model]\n  Quant --> Psych[Reliability + structural/construct validity<br/>+ DIF + responsiveness]\n  Psych --> MID[Anchor-based meaningful within-patient change<br/>triangulated with distribution-based bound]\n  MID --> Dossier[Validation dossier + scoring algorithm<br/>fit-for-purpose endpoint]",
        "caption": "PRO instrument development roadmap. Content validity is built in at the elicitation/cognitive-interview stage (it cannot be added later); the measurement-model choice gates which psychometric statistics are valid.",
        "alt_text": "Flowchart from concept-of-interest definition through concept elicitation, conceptual framework, item generation, cognitive interviews, a measurement-model decision, quantitative psychometric testing, anchor-based meaningful-change estimation, and the final validation dossier.",
        "source_type": "illustrative",
        "source_citations": [
          "patrick-2011"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  L((Latent concept<br/>of interest)) -->|loadings| I1[Item 1]\n  L -->|loadings| I2[Item 2]\n  L -->|loadings| I3[Item 3]\n  L -->|loadings| I4[Item 4]\n  G[Group: sex / language / severity] -.->|DIF: direct path to item = bias| I2\n  I1 --> S[Summed / IRT score]\n  I2 --> S\n  I3 --> S\n  I4 --> S\n  S -.->|known-groups validity| L",
        "caption": "Reflective measurement model with a differential-item-functioning (DIF) threat. The construct should drive item responses through loadings only; a direct group-to-item path (dashed) means the item behaves differently by group, so a score difference can reflect measurement bias rather than a true difference in the concept.",
        "alt_text": "Diagram of a latent construct loading onto four items that feed a score, with a dashed differential-item- functioning path from a grouping variable directly into one item, illustrating measurement-invariance bias.",
        "source_type": "illustrative",
        "source_citations": [
          "reeve-2007"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "pro-validation",
        "notes": "Development yields the instrument and its content-validity evidence; validation is the downstream empirical confirmation of its reliability, construct validity, and responsiveness on data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hrqol",
        "notes": "Health-related quality-of-life measures are the prototypical multi-domain PROs produced by this development process."
      },
      {
        "relation_type": "used_with",
        "target_slug": "longitudinal-outcomes-modeling-rwe",
        "notes": "Once developed, PRO scores are analyzed over time with longitudinal/mixed models for responsiveness and real-world change."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "When the concept is patient-experienced and unobservable in claims, a developed PRO is the appropriate endpoint rather than a claims-coded outcome algorithm."
      }
    ],
    "aliases": [
      "PRO instrument development",
      "PROM development",
      "patient-reported outcome measure development",
      "PRO measure development"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pro-rwe",
    "name": "Patient-Reported Outcomes in Real-World Settings",
    "short_definition": "The operational capture, time-windowing, missingness handling, and longitudinal analysis of patient-reported outcomes (PROs/ePROs) collected in routine care, registries, or portals rather than in a controlled trial, where assessment is unscheduled, non-random, and frequently missing-not-at-random.",
    "long_description": "A **real-world PRO** is a self-reported measure of a patient's health status — symptoms, function, or health-related\nquality of life — captured outside a protocol-driven trial: in a disease registry, an ePRO app or patient portal, or\nembedded in EHR workflows (e.g., PROMIS computer-adaptive tests pushed before oncology visits). The instrument may be\nidentical to the one used in a trial; what changes is the *data-generating process*. In a trial, assessment timing is\nscheduled, completion is chased by coordinators, and the analysis population is fixed at randomization. In the real world,\nassessment is **visit-driven or opt-in**, completion is **voluntary and self-selected**, and the patients who keep\nresponding differ systematically from those who stop — usually because the sickest patients are the ones who drop out.\nThis entry is about the measurement-and-analysis engineering required to turn that messy stream of submissions into a\ndefensible analytic dataset; it is distinct from instrument creation (`pro-development`) and psychometric validation\n(`pro-validation`), which assume the instrument already performs and ask a different question.\n\n**Core conceptual distinction**. Three things separate real-world PRO measurement from trial PRO measurement, and each\ndrives a specific analytic decision. (1) *Assessment timing is endogenous.* A real-world PRO is recorded when the patient\nshows up or chooses to respond, not on a fixed schedule, so you must impose an explicit **completion window** (e.g., a\nscheduled landmark ±14 days) and a **time-zero anchor** (initiation, diagnosis, or surgery date — often taken from claims,\nnot from the PRO system) to make scores from different patients comparable. (2) *Missingness is informative.* Non-response\nin routine care is almost never missing-completely-at-random: it correlates with the outcome itself (a patient too sick or\ntoo symptomatic to complete a questionnaire), making naive complete-case or last-observation-carried-forward (LOCF)\nanalyses biased toward a falsely stable or falsely improving trajectory. (3) *The cohort is a completer cohort.* The\npatients with usable PRO data are a self-selected subset of the enrolled or treated population, so the estimand silently\nshifts from \"the PRO trajectory of treated patients\" to \"the PRO trajectory of patients who keep filling out\nquestionnaires.\" Stating which of these you mean — and whether you are estimating a within-person change, a between-group\ncomparison, or a responder proportion — is the estimand discipline this concept demands.\n\n**Pros, cons, and trade-offs**.\n- **vs trial-collected PROs:** Real-world PROs deliver longer follow-up, larger and more representative populations, and\n  capture of subgroups trials exclude (frail, multimorbid, very old). Cost: timing is irregular, completion is low and\n  selective, and there is no coordinator chasing missing forms — so the missing-data and selection problems that are\n  nuisances in a trial become the central methodological threat. **Prefer real-world PROs** when the question is durability,\n  real-world burden, or HTA value evidence; **prefer trial PROs** when you need an unbiased treatment-effect estimate on a\n  PRO endpoint.\n- **vs clinician-reported or claims-derived outcomes:** PROs capture the patient's lived experience (fatigue, pain, role\n  function) that claims and chart abstraction systematically miss. Cost: they are subjective, vulnerable to response shift\n  and recall, and far more missing than a claims-based event (an inpatient claim exists whether or not the patient feels\n  like reporting). **Prefer PROs** for symptom/QoL endpoints; pair them with claims-confirmed clinical events\n  (`endpoint-adjudication-chart-review-rwe`) so that the *absence* of a PRO is not mistaken for the absence of an event.\n- **vs complete-case analysis of real-world PROs:** Mixed-effects/MMRM models (`mmrm-repeated-measures-rwe`) under a\n  missing-at-random assumption, with multiple imputation (`multiple-imputation-longitudinal-rwe`) and explicit MNAR\n  sensitivity analysis, recover far less biased trajectories than complete-case or LOCF. Cost: more assumptions and code,\n  and MAR is still an assumption that informative dropout can violate. **Prefer the model-based approach**; reserve\n  complete-case only as a transparency check, never as the primary analysis.\n\n**When to use**. Estimating PRO trajectories, symptom burden, or HRQoL over routine follow-up; generating utility inputs\nfor cost-utility models via QALY mapping (`qaly-utility-mapping-rwe`); HTA submissions where regulators expect real-world\ndurability of patient-relevant benefit; pragmatic symptom-monitoring programs; characterizing the patient experience in\nregistries where a trial is infeasible. The prerequisites are a validated instrument with an established minimally\nimportant difference (MID), a documentable completion window and time-zero anchor, and a missingness plan written *before*\nthe analysis.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **As an unadjusted treatment-effect estimate.** Comparing PRO change between two real-world treatment groups without\n  confounding control *and* differential-missingness handling is doubly biased: by indication and by who keeps responding.\n  A drug that makes patients feel worse can look better simply because the patients it harms stop completing surveys.\n- **With LOCF or complete-case as the primary analysis when dropout is informative.** LOCF freezes the last (often\n  pre-decline) score and manufactures stability; complete-case conditions on survival-to-response. Both are dangerous in\n  progressive disease (oncology, ALS, advanced heart failure) where the patients who disappear are the ones declining.\n- **When the completer cohort no longer maps to a meaningful population.** If only 20% of an opt-in ePRO cohort responds at\n  month 12 and responders are younger, English-speaking, and higher-SES, the surviving estimand describes a population no\n  decision-maker cares about — report the completion funnel, do not bury it.\n- **Across instruments or versions without a crosswalk.** Pooling SF-36 with PROMIS, or PRO-CTCAE free-text with graded\n  items, without a validated linkage produces a trajectory artifact of the instrument switch, not the patient.\n- **When the \"real-world\" PRO is actually a single cross-sectional snapshot** marketed as longitudinal evidence — a one-time\n  portal survey cannot support a trajectory or responder claim.\n\n**Data-source operational depth**.\n- **Disease/product registry (scheduled PRO at fixed visits):** The best structure for comparability, but completion falls\n  off with disease severity and at later visits, so missingness is informative; LOCF here hides genuine decline. Failure\n  mode: registries often record only the score and visit, not *why* a form is missing (death? too sick? lost to\n  follow-up?), so you cannot distinguish MAR from MNAR without linking to a mortality source (`mortality-source-hierarchy-rwe`)\n  and claims for hospitalizations. Workaround: build an explicit assessment-expected vs assessment-observed table per\n  visit and model dropout jointly or via MNAR sensitivity (delta-adjusted multiple imputation).\n- **ePRO app / patient portal (opt-in, push-notification driven):** Subject to a digital-divide selection — older,\n  lower-SES, non-English, and less digitally literate patients under-respond — so the responder cohort is not the enrolled\n  cohort. Push-notification timing and reminder cadence change *who* responds and *when*, contaminating the completion\n  window. Failure mode: multiple submissions inside one window (a patient answers twice) and unscheduled \"I feel awful\n  today\" submissions that oversample bad days. Workaround: pre-specify a window-collapse rule (e.g., the submission closest\n  to the scheduled landmark; or first-in-window), cap one record per person-window, and report completion by demographic\n  strata to expose the selection.\n- **EHR-embedded PROMIS via portal (visit-driven):** Captured only when the patient has a visit and the workflow fires, so\n  there is no PRO when the patient leaves the system or is hospitalized elsewhere — visit-driven capture means departure is\n  differential. Failure mode: a \"stable\" PRO trajectory can mask a claims-confirmed hospitalization that occurred between\n  two portal completions. Workaround: link to claims/EHR encounter data so that clinical events between PRO assessments are\n  observed, and treat a PRO gap that brackets a major event as informative, not ignorable.\n- **Linked claims + PRO:** The strongest substrate — claims supply the time-zero anchor (initiation/surgery date),\n  confirm clinical events and death (so absence of a PRO is not read as absence of an event), and let you characterize\n  non-responders on observable claims covariates to test the MAR assumption. Cost: linkage selection (only the linkable\n  subset has both), and date reconciliation between the PRO timestamp and the claim service date before window assignment.\n\n**Worked example (linked registry ePRO + claims).** Question: the 12-month FACT-G total-score trajectory and 6-month\nresponder rate after initiating a new oral oncolytic, among adults in a cancer registry with linked commercial + Medicare\nFFS claims. (1) *Time zero:* the first dispensing of the oncolytic from the pharmacy claim (`fill_date` on the index NDC),\nnot the registry enrollment date, so all patients are aligned at treatment initiation. (2) *Eligibility / observability:*\nrequire 365 days of continuous medical + pharmacy enrollment before `index_date` (exclude Medicare Advantage-only\nperson-time, which lacks FFS claims and would make the baseline window unobservable) and at least one PRO submission in the\nbaseline window. (3) *Baseline PRO:* the FACT-G submission in the window [`index_date` − 30, `index_date` + 7]; if multiple\nsubmissions fall in-window, keep the one closest to `index_date` (window-collapse rule) and cap one record per\nperson-window. (4) *Scheduled landmarks:* months 3, 6, 9, 12, each with a completion window of the scheduled day ±14 days;\nthe same collapse and cap rules apply. (5) *Responder definition:* a patient is a 6-month responder if their month-6\nFACT-G is ≥ their baseline plus the MID of 5 points (anchor-based MID; report deterioration symmetrically as a drop ≥5).\n(6) *Informative missingness:* before scoring, build an expected-vs-observed completion table by landmark and by\nage/SES/language stratum; classify each missing assessment using linked claims and a mortality source as death,\nhospitalization-near-window, disenrollment, or unexplained — because a patient who is hospitalized (claims-confirmed) at\nmonth 6 but has no PRO is informatively, not ignorably, missing. (7) *Primary analysis:* an MMRM on change-from-baseline\nwith an unstructured covariance over the four landmarks (MAR), adjusting for baseline score and confounders measured in the\n365-day claims lookback; the responder rate is reported under multiple imputation, never LOCF. (8) *Sensitivity:*\ndelta-adjusted (MNAR) multiple imputation that penalizes imputed scores for dropouts, and a tabulated completion funnel so\nreviewers see exactly how the analytic cohort narrowed from enrolled to baseline-observed to month-12-observed.\n\n**Interpreting the output**\n\nConsider the worked example: mean PROMIS Physical Function change from baseline is 1.5 points at\n6 months (sum of changes 5 + 2 − 6 + 5 = 6, divided by 4 patients), with 2 of 4 patients (50%)\nclassified as responders (change ≥ 3 points, the MID) and 1 of 4 (25%) classified as deteriorators.\n\nFormal interpretation: A mean change of 1.5 points is below the anchor-based MID of 3 points,\nmeaning that on average the group did not achieve a patient-meaningful improvement — even though\nthe arithmetic mean is positive. The mean and the responder rate tell complementary stories that\nneither tells alone: the mean of 1.5 masks that two patients improved by 5 points each while one\ndeteriorated by 6. In routine-care PRO data, missing-not-at-random missingness is the dominant\nanalytic threat: patients who are more symptomatic at month 6 are less likely to complete the\nquestionnaire. If the 1.5-point mean was computed only among completers, it is likely biased\nupward — the deteriorating patient (P003, change −6) might not have submitted if the context\nwere a real-world registry rather than a controlled example. Reporting completers-only estimates\nwithout a missingness model is therefore not interpretable as a population-level treatment effect.\n\nPractical interpretation: Pair the mean change with the completion funnel (number enrolled,\nnumber with baseline PRO, number with 6-month PRO) and the MNAR sensitivity analysis results.\nIf the delta-adjusted imputation moves the mean from 1.5 toward a smaller or negative number,\nthe missingness is informative enough to change the clinical conclusion. Report both the\nresponder rate and the deterioration rate: a 50% responder rate with a 25% deterioration rate\nconveys that the treatment's average effect hides substantial individual-level heterogeneity —\na finding relevant to patient communication and subgroup analyses.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "patient-reported-outcomes",
      "epro",
      "real-world-data",
      "informative-missingness",
      "minimally-important-difference",
      "responder-analysis",
      "mmrm",
      "registry"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "disease_registry",
      "ehr_study",
      "linked_data"
    ],
    "data_sources": [
      "registry",
      "ehr",
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s11136-011-0054-x",
        "url": "https://doi.org/10.1007/s11136-011-0054-x",
        "citation_text": "Snyder CF, Aaronson NK, Choucair AK, et al. Implementing patient-reported outcomes assessment in clinical practice: a review of the options and considerations. Quality of Life Research. 2012;21(8):1305-1314.",
        "year": 2012,
        "authors_short": "Snyder et al.",
        "notes": "Canonical review of the operational choices (instrument, timing, mode, scoring, response handling) that distinguish PRO collection in routine real-world care from protocolized trial collection."
      },
      {
        "role": "explain",
        "doi": "10.1186/s41687-018-0061-6",
        "url": "https://doi.org/10.1186/s41687-018-0061-6",
        "citation_text": "Greenhalgh J, Gooding K, Gibbons E, et al. How do patient reported outcome measures (PROMs) support clinician-patient communication and patient care? A realist synthesis. Journal of Patient-Reported Outcomes. 2018;2:42.",
        "year": 2018,
        "authors_short": "Greenhalgh et al.",
        "notes": "Realist synthesis explaining the mechanisms and contexts that govern whether routinely collected PROs actually function as intended in real-world care — relevant to interpreting completion and selection patterns."
      },
      {
        "role": "demonstrate",
        "doi": "10.1200/JCO.2015.63.0830",
        "url": "https://doi.org/10.1200/JCO.2015.63.0830",
        "citation_text": "Basch E, Deal AM, Kris MG, et al. Symptom Monitoring With Patient-Reported Outcomes During Routine Cancer Treatment: A Randomized Controlled Trial. Journal of Clinical Oncology. 2016;34(6):557-565.",
        "year": 2016,
        "authors_short": "Basch et al.",
        "notes": "Although a randomized trial, its enduring contribution is the routine-care ePRO collection infrastructure and the symptom-monitoring workflow it pioneered, which became the template for real-world ePRO programs and their analytic handling."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.f167",
        "url": "https://doi.org/10.1136/bmj.f167",
        "citation_text": "Black N. Patient reported outcome measures could help transform healthcare. BMJ. 2013;346:f167.",
        "year": 2013,
        "authors_short": "Black",
        "notes": "Foundational argument and practical cautions for using routinely collected PROMs at scale in health systems, including case-mix and selection issues that bias naive real-world comparisons."
      },
      {
        "role": "use",
        "doi": "10.2147/PROM.S156279",
        "url": "https://doi.org/10.2147/PROM.S156279",
        "citation_text": "Mercieca-Bebber R, King MT, Calvert MJ, Stockler MR, Friedlander M. The importance of patient-reported outcomes in clinical trials and strategies for future optimization. Patient Related Outcome Measures. 2018;9:353-367.",
        "year": 2018,
        "authors_short": "Mercieca-Bebber et al.",
        "notes": "Synthesis of PRO missing-data and analytic-rigor issues; the missingness and interpretation problems it catalogs are amplified in unscheduled real-world collection."
      }
    ],
    "plain_language_summary": "A patient-reported outcome (PRO) is a score that comes directly from the patient — how much pain they feel, how well they can walk, how tired they are — rather than from a lab test or a doctor's chart note. In real-world research, these scores are collected outside a clinical trial: through a patient portal, a disease registry, or a tablet in a clinic waiting room, so the data is messier and more often missing than in a trial. The goal is to learn whether a treatment makes patients feel or function better by a meaningful margin, which researchers call the minimally important difference (MID) — the smallest change in score that a patient would actually notice.",
    "key_terms": [
      {
        "term": "patient-reported outcome (PRO)",
        "definition": "A measurement of a patient's health status — symptoms, physical function, or quality of life — reported by the patient themselves, not interpreted by a clinician."
      },
      {
        "term": "PRO instrument",
        "definition": "A validated questionnaire used to collect PRO data, such as the PROMIS Physical Function scale or the FACT-G quality-of-life scale; each has a defined score range and a known MID."
      },
      {
        "term": "minimally important difference (MID)",
        "definition": "The smallest change in a PRO score that patients would notice and consider meaningful; changes smaller than the MID are statistically real but clinically unimportant."
      },
      {
        "term": "responder",
        "definition": "A patient whose PRO score improved by at least the MID from baseline to follow-up, meaning the benefit they reported was large enough to matter."
      },
      {
        "term": "change from baseline",
        "definition": "The difference between a patient's PRO score at a follow-up visit and their score before treatment started; positive means improvement, negative means worsening."
      }
    ],
    "worked_example": {
      "scenario": "Four patients with a chronic inflammatory joint disease start a new oral treatment. Before treatment (baseline) and six months later, each patient completes the PROMIS Physical Function questionnaire, which scores physical ability from 0 to 100 (higher = better). The MID for this scale is 3 points: a change of at least +3 counts as meaningful improvement, and a change of -3 or worse counts as meaningful deterioration. We want to know the mean change in the group and how many patients were responders.",
      "dataset": {
        "caption": "PROMIS Physical Function T-scores for four patients at baseline and 6 months. A researcher would see one row per patient per time point in a registry or EHR export.",
        "columns": [
          "patient_id",
          "baseline_score",
          "month6_score",
          "change_from_baseline"
        ],
        "rows": [
          [
            "P001",
            42,
            47,
            5
          ],
          [
            "P002",
            38,
            40,
            2
          ],
          [
            "P003",
            51,
            45,
            -6
          ],
          [
            "P004",
            36,
            41,
            5
          ]
        ]
      },
      "steps": [
        "Read the PRO scores: each row is one patient. The baseline_score is what the patient reported before starting treatment; the month6_score is what they reported six months later.",
        "Compute change_from_baseline for each patient by subtracting baseline from month6: P001 = 47 - 42 = +5, P002 = 40 - 38 = +2, P003 = 45 - 51 = -6, P004 = 41 - 36 = +5.",
        "Apply the MID rule: a change >= +3 means the patient is a responder (meaningful improvement); a change <= -3 means meaningful deterioration. P001 (+5) and P004 (+5) are responders. P002 (+2) improved but not enough to cross the MID — that is called stable. P003 (-6) deteriorated.",
        "Compute the mean change for the group: sum the four changes and divide by 4. Sum = 5 + 2 + (-6) + 5 = 6. Mean = 6 / 4 = 1.5 points.",
        "Check the mean against the MID: 1.5 is below the MID of 3, so on average the group did not improve by a clinically meaningful amount, even though the arithmetic average is positive.",
        "Summarize the responder analysis: 2 out of 4 patients (50%) were responders; 1 out of 4 (25%) deteriorated; 1 out of 4 (25%) was stable. Reporting these categories alongside the mean is important because the mean can hide that some patients got much better while others got worse."
      ],
      "result": "Mean change from baseline = 6 / 4 = 1.5 points. This is below the MID of 3 points, so the average group change is not clinically meaningful. Responder rate = 2 / 4 = 50% (patients P001 and P004). Deterioration rate = 1 / 4 = 25% (patient P003)."
    },
    "prerequisites": [
      "pro-development",
      "pro-validation",
      "hrqol"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Scheduled-landmark windowing with collapse and cap",
        "description": "Map irregular real-world submissions onto pre-specified landmarks (e.g., months 3/6/9/12) using a fixed completion window (landmark ±14 days), a collapse rule for multiple in-window submissions (closest-to-landmark or first-in-window), and a one-record-per-person-window cap.",
        "edge_cases": [
          "A patient with no submission in a landmark window is missing for that landmark even if they responded the week before or after — widening the window trades comparability for completeness.",
          "Unscheduled \"bad-day\" submissions oversample symptomatic episodes if not excluded by the collapse rule."
        ],
        "data_source_notes": "ePRO/portal: multiple submissions per window are common and must be de-duplicated; registry: fixed visit schedules make windowing cleaner but later-visit dropout is informative."
      },
      {
        "name": "MID-anchored responder / deterioration definition",
        "description": "Classify each patient as improved, stable, or deteriorated by comparing change-from-baseline to the instrument's minimally important difference (e.g., FACT-G ≥5, PROMIS T-score ≥3), reporting improvement and deterioration thresholds symmetrically.",
        "edge_cases": [
          "A responder rate computed only among completers conditions on survival-to-response and overstates benefit; impute or report alongside the completion funnel.",
          "Distribution-based MIDs (0.5 SD) and anchor-based MIDs can disagree; pre-specify which and cite its source."
        ],
        "data_source_notes": "Requires a baseline PRO in-window; patients with no baseline cannot enter a change-based responder analysis and their exclusion is itself selective."
      },
      {
        "name": "Longitudinal model under MAR with MNAR sensitivity",
        "description": "Fit an MMRM or mixed-effects model on change-from-baseline across landmarks under missing-at-random, then stress the conclusion with delta-adjusted (pattern-mixture) multiple imputation that penalizes dropouts' imputed scores.",
        "edge_cases": [
          "MAR conditional on observed covariates can still be violated when dropout is driven by the unobserved current score; the delta sensitivity is what defends the primary result.",
          "Sparse later landmarks make an unstructured covariance unstable; consider a parsimonious structure or Toeplitz."
        ],
        "data_source_notes": "linked claims: use claims covariates and confirmed events/death to characterize non-responders and to justify (or refute) the MAR assumption."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Trial-collected PROs",
        "pros_of_this": "Longer follow-up, larger and more representative populations, capture of trial-excluded subgroups, and durability/HTA-relevant evidence.",
        "cons_of_this": "Endogenous timing, low and self-selected completion, and informative dropout make unbiased treatment-effect estimation hard; missingness is the central threat, not a nuisance.",
        "when_to_prefer": "Questions about real-world durability, symptom burden, or value/HTA evidence rather than an unbiased head-to-head PRO treatment effect."
      },
      {
        "compared_to": "Clinician-reported or claims-derived outcomes",
        "pros_of_this": "Captures the patient's lived experience (symptoms, function, HRQoL) that claims and chart review miss.",
        "cons_of_this": "Subjective, prone to response shift and recall, and far more missing than a claims-based event.",
        "when_to_prefer": "Symptom and quality-of-life endpoints, paired with claims-confirmed events so a missing PRO is not read as no event."
      },
      {
        "compared_to": "Complete-case / LOCF analysis of real-world PROs",
        "pros_of_this": "Model-based MMRM/mixed-effects under MAR with multiple imputation and MNAR sensitivity recovers far less biased trajectories.",
        "cons_of_this": "More assumptions and code; MAR remains an assumption that informative dropout can violate.",
        "when_to_prefer": "Always as the primary analysis when dropout is plausibly informative; complete-case only as a transparency check."
      }
    ],
    "implementation_notes_by_data_source": {
      "registry": "Scheduled visits make windowing clean, but completion declines with severity and over time, so later-visit missingness is informative; LOCF hides decline. Link to a mortality source and claims to classify why each assessment is missing before modeling.",
      "ehr": "PROMIS/portal capture is visit-driven, so a patient who leaves the system or is hospitalized elsewhere has no PRO; a \"stable\" trajectory can mask a claims-confirmed event between completions. Define observation windows explicitly.",
      "claims": "Claims rarely contain PROs themselves but supply the time-zero anchor, continuous-enrollment observability, and confirmed events/death used to characterize non-responders and test MAR. Exclude MA-only person-time lacking FFS claims.",
      "linked": "Linked claims + registry/ePRO is the ideal substrate (anchor + completeness + non-responder characterization), but introduces linkage selection and PRO-timestamp vs claim-service-date reconciliation before window assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nimport statsmodels.formula.api as smf\n\nLANDMARKS = {3: 90, 6: 182, 9: 274, 12: 365}   # month -> days from index\nWINDOW = pd.Timedelta(days=14)                  # +/- completion window around each landmark\nBASE_LO, BASE_HI = pd.Timedelta(days=30), pd.Timedelta(days=7)   # baseline window [-30, +7]\nMID = 5.0                                        # FACT-G anchor-based minimally important difference\n\ndef build_pro_panel(pro: pd.DataFrame, index: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    df = pro.merge(index[[\"person_id\", \"index_date\"]], on=\"person_id\")\n\n    # Observability: continuous, FFS-observable enrollment around the index (no MA-only person-time).\n    e = enroll.merge(index[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"ok\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - BASE_LO) &\n               (e[\"enroll_end\"]   >= e[\"index_date\"] + pd.Timedelta(days=365)) &\n               (~e[\"ma_only\"]))\n    keep = e.loc[e[\"ok\"], \"person_id\"].unique()\n    df = df[df[\"person_id\"].isin(keep)].copy()\n    df[\"offset\"] = (df[\"submit_date\"] - df[\"index_date\"]).dt.days\n\n    # Baseline: submission in [-30, +7]; if several, keep the one closest to index (collapse rule).\n    base = df[(df[\"submit_date\"] >= df[\"index_date\"] - BASE_LO) &\n              (df[\"submit_date\"] <= df[\"index_date\"] + BASE_HI)].copy()\n    base[\"dist\"] = base[\"offset\"].abs()\n    base = (base.sort_values([\"person_id\", \"dist\"])\n                .groupby(\"person_id\").first()\n                .reset_index()[[\"person_id\", \"total_score\"]]\n                .rename(columns={\"total_score\": \"baseline_score\"}))\n\n    # Map each submission onto the nearest landmark whose +/-14d window it falls in (one record per person-window).\n    rows = []\n    for m, d in LANDMARKS.items():\n        w = df[(df[\"submit_date\"] >= df[\"index_date\"] + pd.Timedelta(days=d) - WINDOW) &\n               (df[\"submit_date\"] <= df[\"index_date\"] + pd.Timedelta(days=d) + WINDOW)].copy()\n        w[\"dist\"] = (w[\"offset\"] - d).abs()\n        w = w.sort_values([\"person_id\", \"dist\"]).groupby(\"person_id\").first().reset_index()\n        w[\"landmark\"] = m\n        rows.append(w[[\"person_id\", \"landmark\", \"total_score\"]])\n    panel = pd.concat(rows, ignore_index=True).merge(base, on=\"person_id\")\n    panel[\"change\"] = panel[\"total_score\"] - panel[\"baseline_score\"]\n\n    # MID-anchored responder/deterioration at each landmark (symmetric thresholds).\n    panel[\"responder\"] = (panel[\"change\"] >= MID).astype(int)\n    panel[\"deteriorated\"] = (panel[\"change\"] <= -MID).astype(int)\n    return panel\n\ndef trajectory_model(panel: pd.DataFrame):\n    # Mixed model on change-from-baseline over landmarks (MAR; complete-case as a transparency check only).\n    return smf.mixedlm(\"change ~ C(landmark)\", panel, groups=panel[\"person_id\"]).fit(reml=True)",
        "description": "Build a real-world PRO analytic dataset and a longitudinal model. Required inputs (already cleaned/de-duplicated):\n  pro    : ePRO submissions   -> person_id, submit_date (datetime), instrument, total_score\n  index  : treatment anchor   -> person_id, index_date (datetime; from claims, e.g. first oncolytic fill_date)\n  enroll : enrollment spans   -> person_id, enroll_start, enroll_end, ma_only (bool)   # ma_only lacks FFS claims\nProduces one row per (person_id, landmark) within a +/-14-day completion window, a MID-anchored responder flag, and a\nstatsmodels MixedLM trajectory. NOTE: MixedLM is a random-effects mixed model; a true MMRM (unstructured covariance,\nno random intercept) is better fit with R's mmrm or SAS PROC MIXED below. Complete-case here is a transparency check\nonly -- the responder rate and trajectory must be reported under multiple imputation when dropout is informative.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "snyder-2012"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(mmrm)\n\nLANDMARKS <- c(`3` = 90, `6` = 182, `9` = 274, `12` = 365)  # month -> days from index\nWINDOW <- 14L; BASE_LO <- 30L; BASE_HI <- 7L; MID <- 5\n\nbuild_pro_panel <- function(pro, index, enroll) {\n  setDT(pro); setDT(index); setDT(enroll)\n  df <- merge(pro, index[, .(person_id, index_date)], by = \"person_id\")\n\n  e  <- merge(enroll, index[, .(person_id, index_date)], by = \"person_id\")\n  keep <- e[enroll_start <= index_date - BASE_LO &\n            enroll_end   >= index_date + 365 & !ma_only, unique(person_id)]\n  df <- df[person_id %chin% keep]\n  df[, offset := as.integer(submit_date - index_date)]\n\n  base <- df[submit_date >= index_date - BASE_LO & submit_date <= index_date + BASE_HI]\n  base <- base[order(person_id, abs(offset))][, .SD[1L], by = person_id][\n               , .(person_id, baseline_score = total_score)]\n\n  pick <- rbindlist(lapply(names(LANDMARKS), function(m) {\n    d <- LANDMARKS[[m]]\n    w <- df[submit_date >= index_date + d - WINDOW & submit_date <= index_date + d + WINDOW]\n    w <- w[order(person_id, abs(offset - d))][, .SD[1L], by = person_id]\n    w[, .(person_id, landmark = as.integer(m), total_score)]\n  }))\n  panel <- merge(pick, base, by = \"person_id\")\n  panel[, change := total_score - baseline_score]\n  panel[, responder := as.integer(change >= MID)]\n  panel[, deteriorated := as.integer(change <= -MID)]\n  panel[, landmark := factor(landmark)]\n  panel[]\n}\n\n# True MMRM: unstructured covariance across landmarks within person, MAR.\nfit_mmrm <- function(panel) {\n  panel <- as.data.frame(panel)\n  mmrm(change ~ landmark + us(landmark | person_id), data = panel)\n}",
        "description": "Real-world PRO panel construction (data.table) and a true MMRM trajectory (mmrm package). Inputs mirror Python:\n  pro    : person_id, submit_date (Date), instrument, total_score\n  index  : person_id, index_date (Date)            # claims-derived treatment anchor\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\nThe mmrm fit uses an unstructured covariance over landmarks under MAR -- the regulatory-default longitudinal PRO model;\npair it with delta-adjusted multiple imputation for the MNAR sensitivity analysis (not shown).",
        "dependencies": [
          "data.table",
          "mmrm"
        ],
        "source_citations": [
          "snyder-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let window = 14; %let base_lo = 30; %let base_hi = 7; %let mid = 5;\n\n/* Observable cohort: continuous FFS enrollment around index, no MA-only person-time. */\nproc sql;\n  create table cohort as\n  select distinct i.person_id, i.index_date\n  from work.index i\n  where exists (select 1 from work.enroll e\n                where e.person_id = i.person_id and e.ma_only = 0\n                  and e.enroll_start <= i.index_date - &base_lo\n                  and e.enroll_end   >= i.index_date + 365);\nquit;\n\n/* Baseline PRO in [-30,+7]; keep the submission closest to index (collapse rule). */\nproc sql;\n  create table base as\n  select p.person_id, p.total_score as baseline_score\n  from work.pro p inner join cohort c on p.person_id = c.person_id\n  where p.submit_date between c.index_date - &base_lo and c.index_date + &base_hi\n  group by p.person_id\n  having abs(p.submit_date - c.index_date) = min(abs(p.submit_date - c.index_date));\nquit;\n\n/* Map submissions to landmarks (months 3/6/9/12) within +/-14d, one record per person-window. */\n%macro landmark(m, d);\n  proc sql;\n    create table lm_&m as\n    select p.person_id, &m as landmark, p.total_score\n    from work.pro p inner join cohort c on p.person_id = c.person_id\n    where p.submit_date between c.index_date + &d - &window and c.index_date + &d + &window\n    group by p.person_id\n    having abs(p.submit_date - (c.index_date + &d)) =\n           min(abs(p.submit_date - (c.index_date + &d)));\n  quit;\n%mend;\n%landmark(3,90); %landmark(6,182); %landmark(9,274); %landmark(12,365);\n\ndata panel;\n  set lm_3 lm_6 lm_9 lm_12;\nrun;\nproc sql;\n  create table panel as\n  select a.person_id, a.landmark, a.total_score, b.baseline_score,\n         a.total_score - b.baseline_score as change\n  from panel a inner join base b on a.person_id = b.person_id;\nquit;\n\n/* MID-anchored 6-month responder rate (report under MI, not LOCF). */\ndata resp; set panel; if landmark = 6; responder = (change >= &mid); run;\nproc freq data=resp; tables responder / binomial(level='1'); run;\n\n/* MMRM trajectory: unstructured within-person covariance over landmarks, MAR. */\nproc mixed data=panel method=reml;\n  class person_id landmark;\n  model change = landmark / solution ddfm=kr;\n  repeated landmark / subject=person_id type=un;\n  lsmeans landmark / cl;\nrun;\n\n/* Informative-missingness sensitivity: multiply impute, analyze, and pool (delta-adjust for MNAR offline). */\nproc mi data=panel out=imp nimpute=20 seed=20240601;\n  class landmark;\n  var baseline_score change;\nrun;\nproc mixed data=imp method=reml; by _imputation_;\n  class person_id landmark;\n  model change = landmark / solution; repeated landmark / subject=person_id type=un;\n  ods output solutionf=mixparms;\nrun;\nproc mianalyze parms=mixparms; modeleffects landmark; run;",
        "description": "Real-world PRO panel construction (PROC SQL), MMRM trajectory (PROC MIXED), responder rates (PROC FREQ), and the\ninformative-missingness sensitivity via multiple imputation (PROC MI / PROC MIANALYZE). Required input datasets:\n  work.pro    : person_id, submit_date, instrument, total_score\n  work.index  : person_id, index_date                      (claims-derived treatment anchor)\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\nPROC MIXED with REPEATED / TYPE=UN and no RANDOM statement is the canonical MMRM; report the responder rate under MI,\nnot LOCF, when dropout is informative.",
        "dependencies": [],
        "source_citations": [
          "snyder-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Enrolled / treated population] --> Anchor[Time zero from claims<br/>e.g. first oncolytic fill_date]\n  Anchor --> Cap[\"ePRO / portal / registry submissions<br/>irregular, voluntary, often missing\"]\n  Cap --> Win[\"Completion windows<br/>baseline [-30,+7]; landmarks +/-14d\"]\n  Win --> Collapse[\"Collapse multiple in-window submissions<br/>closest-to-landmark; cap 1 per person-window\"]\n  Collapse --> Panel[\"Analytic panel: change-from-baseline<br/>+ MID-anchored responder/deterioration flags\"]\n  Panel --> Miss[\"Classify each missing assessment via linked claims + mortality<br/>death / hospitalization / disenroll / unexplained\"]\n  Miss --> Model[\"Primary: MMRM under MAR + multiple imputation\"]\n  Model --> Sens[\"Sensitivity: delta-adjusted MNAR imputation<br/>+ completion funnel by demographic stratum\"]",
        "caption": "Data flow from irregular real-world PRO capture to a defensible longitudinal analytic dataset. Claims supply the time-zero anchor and let missing assessments be classified rather than ignored.",
        "alt_text": "Flowchart from enrolled population through a claims-derived time zero, irregular PRO capture, completion windowing and collapse, an analytic change-from-baseline panel with MID responder flags, missingness classification via linked claims, an MMRM primary analysis, and MNAR sensitivity with a completion funnel.",
        "source_type": "illustrative",
        "source_citations": [
          "snyder-2012"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{What is the dominant missingness pattern?} -->|Low, plausibly MAR<br/>given observed covariates| M1[MMRM / mixed-effects<br/>under MAR + multiple imputation]\n  Q -->|Dropout driven by current unobserved score<br/>informative MNAR| M2[Pattern-mixture / delta-adjusted MI<br/>or selection model]\n  Q -->|Dropout tied to a terminal event<br/>death / progression| M3[Joint longitudinal-survival model<br/>or composite responder w/ death = non-responder]\n  M1 --> Never[Do NOT use LOCF or complete-case as primary<br/>when dropout is informative]\n  M2 --> Never\n  M3 --> Never",
        "caption": "Decision logic for choosing a longitudinal PRO model from the real-world missingness pattern. LOCF and complete-case are never the primary analysis when dropout is informative.",
        "alt_text": "Decision tree mapping the missingness pattern to MMRM under MAR, pattern-mixture or delta-adjusted multiple imputation under MNAR, or a joint longitudinal-survival model when dropout is tied to death or progression, with a shared rule against LOCF and complete-case as the primary analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "snyder-2012"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "pro-development",
        "notes": "Instrument creation (item generation, content validity) precedes real-world use; this entry assumes a developed, fielded instrument and focuses on its operational capture and analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pro-validation",
        "notes": "Psychometric validation establishes reliability, validity, and the MID that the real-world responder definition relies on; without a validated MID the responder analysis is arbitrary."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hrqol",
        "notes": "HRQoL is a major PRO domain; real-world HRQoL trajectories are a common application of this measurement framework."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mmrm-repeated-measures-rwe",
        "notes": "MMRM (unstructured covariance, MAR) is the regulatory-default longitudinal model for PRO change-from-baseline trajectories built by this framework."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multiple-imputation-longitudinal-rwe",
        "notes": "Multiple imputation (with delta-adjusted MNAR sensitivity) is the standard handling for the informative missingness that dominates real-world PRO collection."
      },
      {
        "relation_type": "used_with",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "An expected-vs-observed completion table per landmark and stratum exposes the informative-missingness and selection structure before any modeling."
      },
      {
        "relation_type": "used_with",
        "target_slug": "endpoint-adjudication-chart-review-rwe",
        "notes": "Pairing PROs with claims-confirmed and adjudicated clinical events prevents reading a missing PRO as the absence of an event."
      },
      {
        "relation_type": "used_with",
        "target_slug": "qaly-utility-mapping-rwe",
        "notes": "Real-world PRO/HRQoL scores are mapped to utilities to supply QALY inputs for cost-utility models in HTA submissions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "surrogate-endpoint-validation-rwe",
        "notes": "When a PRO is proposed as a surrogate for a clinical endpoint, its real-world trajectory must meet surrogate-validation standards, not merely correlate."
      }
    ],
    "aliases": [
      "patient-reported outcomes in real-world data",
      "real-world PROs",
      "PROM in routine care",
      "ePRO in real-world settings",
      "real-world patient-reported outcomes"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "pro-validation",
    "name": "PRO Instrument Validation",
    "short_definition": "The process of accumulating quantitative evidence that a patient-reported outcome instrument's scores are reliable, valid, and responsive for a specified concept, population, and context of use.",
    "long_description": "**PRO instrument validation** is the empirical demonstration that the scores produced by a patient-reported\noutcome (PRO) measure carry the properties an analyst assumes when they enter those scores into a study: that\nrepeated administration to stable patients gives the same answer (**reliability**), that the score reflects the\nintended concept and nothing else (**validity**), and that it moves when — and only when — true health changes\n(**responsiveness**). Validation is never a property of the instrument in the abstract; it is always validation\n*of a specific score, in a specific population, for a specific context of use*. A measure validated for symptom\nseverity in moderate-to-severe rheumatoid arthritis is not thereby validated for the same disease in children,\nfor a different language/culture, or as a treatment-benefit endpoint in a regulatory submission.\n\n**Core conceptual distinction.** Validation is the *measurement-properties* arm of the PRO lifecycle and must be\nkept distinct from two neighbors. (1) *Validation vs development/content validity*: instrument **development**\n(concept elicitation, item generation, cognitive debriefing) establishes that items are relevant, comprehensive,\nand comprehensible to patients — qualitative **content validity**, the FDA's threshold requirement. Validation\nthen tests the *quantitative* properties of the resulting scores. Skipping development and \"validating\" a\nborrowed instrument cannot rescue an instrument that measures the wrong concept. (2) *PRO validation vs claims\nalgorithm validation (PPV/sensitivity)*: a PRO has no error-free gold standard, so validity is assessed against\na nomological net of construct relationships (convergent, divergent, known-groups), not against a reference-\npositive label. Treating a PRO like a binary phenotype — computing \"sensitivity vs a chart\" — is a category\nerror. The estimand of a validation study is a measurement-property coefficient (an ICC, a correlation, a\nstandardized mean difference between known groups, an effect size for change), not a treatment effect.\n\n**Pros, cons, and trade-offs** (specific & comparative, naming the alternatives).\n- **vs assuming a published instrument is \"already validated\":** Running de novo validation in *your* population\n  and mode of administration protects against the single most common reviewer objection — that prior validation\n  does not transfer to your patients, language, or electronic (ePRO) port. Cost: time, sample, and the risk of\n  discovering the instrument is not fit-for-purpose late. **Prefer de novo (or migration) validation** whenever\n  population, culture, recall period, or administration mode differs from the original validation.\n- **vs classical test theory (CTT) only:** CTT (Cronbach's alpha, test-retest ICC, factor analysis) is simple,\n  transparent, and what most regulators expect. **Item response theory (IRT)/Rasch** adds item-level fit,\n  differential item functioning (DIF) detection, and interval-level scoring, and is required for computerized\n  adaptive testing (e.g., PROMIS). Cost: larger samples, stronger assumptions, harder communication.\n  **Prefer CTT** for a straightforward fixed-form endpoint; **add IRT/Rasch** when building item banks,\n  screening for DIF across subgroups, or claiming interval scaling.\n- **vs distribution-based minimal important difference (MID) only:** Distribution-based MIDs (e.g., 0.5 SD, 1\n  SEM) are easy but anchor-free and can mistake noise for meaning. **Anchor-based MIDs** tie change to a\n  patient-meaningful external anchor (e.g., a Patient Global Impression of Change). **Prefer anchor-based**\n  estimates triangulated with distribution-based bounds; never report a single MID as if it were a fixed\n  constant.\n\n**When to use** (clear decision rules). Before a PRO score is used as a primary/secondary endpoint in a trial or RWE study; when\nporting a paper instrument to ePRO/mobile (mode-equivalence/measurement invariance testing); when translating/\nculturally adapting an instrument; when applying an instrument to a new disease, severity stratum, or age group;\nand when establishing a responder definition or MID that downstream analyses (responder analysis, cost-utility\nmapping to utilities) will depend on.\n\n**When NOT to use — and when it is actively misleading or dangerous** (clear decision rules).\n- **As a substitute for content validity.** Strong reliability and a tidy factor structure on an instrument that\n  omits the symptoms patients care about produce a precise measure of the wrong thing — *construct\n  underrepresentation*. No psychometric coefficient repairs missing content.\n- **High internal consistency read as evidence of unidimensionality.** Cronbach's alpha inflates with item count\n  and with redundant items; alpha > 0.95 often signals item redundancy, not a clean scale. Use it for reliability\n  of a confirmed unidimensional scale only — confirm dimensionality with factor analysis/Rasch first.\n- **Validating responsiveness in a population that does not change.** Responsiveness and MID estimated in a\n  stable cohort, or with a poorly correlated anchor (anchor–change r < ~0.3), yield meaningless thresholds that\n  will misclassify responders in the trial.\n- **Ignoring measurement invariance before comparing groups or modes.** Reporting a treatment difference (or an\n  ePRO-vs-paper equivalence) without testing DIF/invariance can manufacture or mask differences that are pure\n  measurement artifact.\n- **Reusing a validated paper score as if the electronic version inherits its properties.** Mode changes recall\n  framing, response options, and missingness patterns; equivalence must be demonstrated, not assumed.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked). PRO validation depends on *primary collected* data (the questionnaire), but in\nRWE the instrument is fielded inside, or linked to, secondary databases, and each substrate has distinct failure\nmodes.\n- **Trial/registry primary PRO collection (the usual validation substrate):** You control the schedule, so a\n  clean **test–retest** window is feasible — but the retest interval must be long enough to break memory recall\n  yet short enough that true health is stable (commonly 2–14 days, justified by the concept's expected\n  stability). Failure mode: scheduling retest at a visit where treatment was just started contaminates the\n  \"stable\" assumption and deflates the ICC. Workaround: anchor retest eligibility to a patient-reported global\n  \"no change\" item and analyze the stable subgroup.\n- **ePRO / mobile app capture:** Enables time-stamped completion and skip-logic enforcement, but introduces\n  mode-equivalence questions and **informative missingness** — sicker patients stop responding, so missing PRO\n  is not missing-at-random. Failure mode: treating app drop-off as ignorable biases responsiveness and MID.\n  Workaround: model completion as a function of disease severity, report completion by arm and by severity\n  stratum, and pre-specify sensitivity analyses (e.g., pattern-mixture) rather than complete-case only.\n- **EHR-embedded PRO (e.g., PROMIS in the patient portal):** Capture is *visit- and portal-driven*, so the\n  measured sample is selected toward engaged, health-literate, often healthier patients. Failure mode: known-\n  groups validity computed on a portal-only sample understates the sick tail and inflates apparent\n  discrimination. Workaround: characterize the captured vs source population and weight or restrict the\n  validation claim to the represented stratum.\n- **Claims-linked PRO:** Claims contribute the *external anchors and known groups* (disease severity proxies,\n  hospitalization, prior therapy line) used to test convergent/known-groups validity — but only over\n  **continuously enrolled, fee-for-service-observable** person-time. Failure mode: Medicare Advantage (MA)-only\n  person-time lacks FFS claims, so a \"low-utilization / mild\" known-group is partly an artifact of unobserved\n  encounters, not true mild disease — biasing the known-groups contrast toward the null. Workaround: require\n  continuous Parts A/B/D (or commercial medical+pharmacy) coverage across the anchor window and exclude MA-only\n  spans before forming severity groups.\n\n**Worked example (claims-linked known-groups + reliability validation).** Goal: validate a 10-item disease-\nspecific symptom PRO (score 0–100, higher = worse) collected in a disease registry, for use as an RWE endpoint.\n(1) Eligibility/observable time: keep registry patients with `≥365` days of continuous medical+pharmacy\nenrollment before the baseline questionnaire `index_date`, dropping MA-only person-time so the claims-derived\nanchors are real, not missingness. (2) **Test–retest reliability:** identify patients who completed the\ninstrument twice 7–14 days apart *and* reported \"no change\" on a global stability item; compute the two-way\nrandom-effects, absolute-agreement **ICC(2,1)** and the standard error of measurement (SEM = SD·√(1−ICC)).\nTarget ICC ≥ 0.70. (3) **Internal consistency:** on the baseline administration, confirm unidimensionality\n(one dominant factor) before computing Cronbach's alpha; report alpha with its CI and flag items whose removal\nraises alpha (redundancy/misfit). (4) **Known-groups (construct) validity:** define severity groups from\n*claims over the baseline window* — e.g., \"severe\" = `≥1` disease-related inpatient stay (`dx` on an IP claim)\nor `≥2` distinct advanced therapies (`fill_date`/`days_supply`-derived lines) within 365 days; \"mild\" = none.\nTest that mean PRO differs across groups in the hypothesized direction with a standardized effect size\n(Cohen's d ≥ 0.5 supports discrimination). (5) **Convergent/divergent validity:** correlate the PRO with a\nrelated legacy instrument (expect r ≈ 0.4–0.7) and with an unrelated domain (expect |r| < 0.3). (6)\n**Responsiveness + anchor-based MID:** among patients with a follow-up questionnaire, regress PRO change on a\npatient-reported anchor of change; estimate the MID as mean change in the \"minimally improved\" anchor group and\ntriangulate against 0.5·SD and 1·SEM. (7) Report every coefficient with a confidence interval and state the\nexact population, mode, and context of use the validation supports — that scope statement *is* the deliverable.\n\n**Interpreting the output**\n\nConsider the worked example: the fatigue questionnaire for rheumatoid arthritis passes all eight\npsychometric checks: ICC(2,1) = 0.82 (test-retest reliability), Cronbach's alpha = 0.85\n(internal consistency), convergent r = 0.63, divergent r = 0.09, Cohen's d = 0.72 (known-groups\nvalidity), SRM = 0.68 (responsiveness), and anchor-based MID = 8.3 points.\n\nFormal interpretation: ICC = 0.82 means 82% of the total score variance reflects true between-\npatient differences rather than measurement error — the questionnaire reproduces scores reliably\nwhen health is stable. Alpha = 0.85 indicates the 10 items form an internally consistent scale,\nthough alpha above 0.95 would flag item redundancy. Each coefficient is conditional on the sample,\nadministration mode, and time interval used in this study: ICC estimated from a 10-day retest in\nstable patients does not guarantee the same reliability over 3 months or with electronic\nadministration. Convergent r = 0.63 with an established fatigue measure confirms the instrument\ncaptures fatigue; it does not establish that the two instruments are interchangeable or that one\nscore can be converted to the other for pooling across studies. The MID of 8.3 points is an\nanchor-based estimate specific to the \"minimally improved\" group in this population — triangulated\nagainst 0.5·SD and 1·SEM, neither of which is a gold standard. A different anchor question or\na different patient population would yield a different MID.\n\nPractical interpretation: A validation report saying \"all checks passed\" is only as useful as\nthe scope statement attached to it. These results support using this questionnaire as a study\nendpoint in adult RA patients administered in the same mode and clinical context; they do not\nvalidate the instrument for pediatric patients, a different disease, or an electronic app format.\nBefore deploying the MID of 8.3 in a responder analysis for a real-world study, confirm the\nstudy population and administration mode match the validation sample — if they do not, a new\nMID estimation is warranted.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "patient-reported-outcomes",
      "measurement_properties",
      "reliability",
      "validity",
      "responsiveness",
      "psychometrics",
      "cosmin",
      "minimal-important-difference"
    ],
    "applies_to_study_types": [
      "pro_validation"
    ],
    "data_sources": [
      "primary",
      "registry",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s11136-010-9606-8",
        "url": "https://doi.org/10.1007/s11136-010-9606-8",
        "citation_text": "Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Quality of Life Research. 2010;19(4):539-549.",
        "year": 2010,
        "authors_short": "Mokkink et al.",
        "notes": "Consensus taxonomy and definitions of the measurement properties (reliability, validity, responsiveness) that a PRO validation study must establish; the field's standard nomenclature."
      },
      {
        "role": "explain",
        "doi": "10.1007/s11136-017-1765-4",
        "url": "https://doi.org/10.1007/s11136-017-1765-4",
        "citation_text": "Mokkink LB, de Vet HCW, Prinsen CAC, et al. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Quality of Life Research. 2018;27(5):1171-1179.",
        "year": 2018,
        "authors_short": "Mokkink et al.",
        "notes": "Operational standards for how each measurement property should be designed and analyzed, including structural validity, internal consistency, reliability, and responsiveness; the methodological checklist most reviewers map a validation study against."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jval.2011.06.014",
        "url": "https://doi.org/10.1016/j.jval.2011.06.014",
        "citation_text": "Patrick DL, Burke LB, Gwaltney CJ, et al. Content validity—establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force report: part 1. Value in Health. 2011;14(8):967-977.",
        "year": 2011,
        "authors_short": "Patrick et al.",
        "notes": "Worked good-research-practice guidance demonstrating that content validity is the precondition for quantitative validation in medical-product (regulatory) contexts."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.fda.gov/media/77832/download",
        "citation_text": "US Food and Drug Administration. Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. Guidance for Industry.",
        "year": 2009,
        "authors_short": "FDA",
        "notes": "Regulatory expectation that reliability, validity, and responsiveness be demonstrated in the target population and context of use before a PRO supports a labeling claim."
      }
    ],
    "plain_language_summary": "PRO instrument validation is the process of checking, with actual data, that a questionnaire measuring how patients feel really does what it claims. You test whether patients give the same answers when nothing has changed (reliability), whether the score tracks the right concept and not something unrelated (validity), and whether the score moves when a patient genuinely gets better or worse (responsiveness). A questionnaire that passes all three checks in your specific patient group is considered fit to use as a study endpoint; one that fails even one check can mislead a study's conclusions.",
    "key_terms": [
      {
        "term": "reliability",
        "definition": "A questionnaire is reliable if the same patient, in the same health state, gets roughly the same score on two separate occasions — the measure is consistent, not random."
      },
      {
        "term": "validity",
        "definition": "A questionnaire is valid if its score actually captures the health concept it is supposed to measure — for example, a pain scale should correlate with other pain measures, not with unrelated outcomes like height."
      },
      {
        "term": "responsiveness",
        "definition": "A questionnaire is responsive if its score changes meaningfully when a patient's health truly improves or worsens, and stays stable when health is stable."
      },
      {
        "term": "Cronbach's alpha",
        "definition": "A number between 0 and 1 that tells you how consistently the individual questions on a scale hang together — values around 0.70 to 0.90 are typically considered acceptable for a multi-item questionnaire."
      },
      {
        "term": "ICC (intraclass correlation coefficient)",
        "definition": "A number between 0 and 1 used to measure test-retest reliability — it compares a patient's score at two time points when their health has not changed; higher values (generally 0.70 or above) mean the scores are reproducible."
      },
      {
        "term": "minimal important difference (MID)",
        "definition": "The smallest change in score that a patient would actually notice and consider meaningful — used to decide whether a change detected in a study represents real benefit, not just measurement noise."
      }
    ],
    "worked_example": {
      "scenario": "A research team has developed a 10-item questionnaire measuring fatigue in adults with rheumatoid arthritis. Scores run from 0 (no fatigue) to 100 (worst fatigue). Before using this questionnaire as an endpoint in a real-world evidence study, they need to check its three core measurement properties in this population. They enroll 120 patients and collect questionnaire data at a baseline visit, a retest visit 10 days later (asking patients to report any health change between visits), and a follow-up visit 3 months later after some patients started a new therapy.",
      "dataset": {
        "caption": "Summary psychometric results from the validation study (n=120 at baseline; n=74 stable pairs for test-retest; n=98 with follow-up data). All numbers are illustrative but internally consistent.",
        "columns": [
          "check",
          "property_tested",
          "statistic",
          "value",
          "threshold",
          "pass"
        ],
        "rows": [
          [
            "Test-retest ICC(2,1)",
            "Reliability",
            "ICC",
            "0.82",
            ">=0.70",
            "Yes"
          ],
          [
            "Standard error of measurement (SEM)",
            "Reliability",
            "SEM = SD x sqrt(1 - ICC)",
            "4.1 points",
            "lower = better",
            "Yes"
          ],
          [
            "Internal consistency",
            "Reliability",
            "Cronbach's alpha",
            "0.85",
            "0.70-0.90",
            "Yes"
          ],
          [
            "Convergent validity",
            "Validity",
            "Pearson r with established fatigue scale",
            "0.63",
            "~0.40-0.70",
            "Yes"
          ],
          [
            "Divergent validity",
            "Validity",
            "Pearson r with blood pressure",
            "0.09",
            "<0.30",
            "Yes"
          ],
          [
            "Known-groups validity",
            "Validity",
            "Cohen's d (severe vs mild disease)",
            "0.72",
            ">=0.50",
            "Yes"
          ],
          [
            "Responsiveness (change)",
            "Responsiveness",
            "Standardized response mean (SRM)",
            "0.68",
            ">=0.50",
            "Yes"
          ],
          [
            "Anchor-based MID",
            "Responsiveness",
            "Mean change in 'minimally improved' group",
            "8.3 points",
            "triangulated",
            "Yes"
          ]
        ]
      },
      "steps": [
        "Reliability — test-retest: Among the 74 patients who came back 10 days later and reported no health change, compute their two scores as a pair. The ICC(2,1) of 0.82 means the questionnaire reproduces scores well when health is stable. The SEM of 4.1 points tells you that random measurement error is small relative to the 0-100 scale.",
        "Reliability — internal consistency: At baseline, check whether all 10 items are pulling in the same direction (one dominant factor from a factor analysis), then compute Cronbach's alpha. The value of 0.85 is in the acceptable range; a value above 0.95 would actually raise a flag for item redundancy.",
        "Validity — convergent: Correlate the new questionnaire score with a well-established fatigue measure collected at the same baseline visit. A correlation of r=0.63 confirms both instruments are measuring similar territory.",
        "Validity — divergent: Correlate the score with a clearly unrelated measure — here, resting blood pressure. The near-zero r=0.09 confirms the questionnaire is not just picking up some general health signal.",
        "Validity — known-groups: Split patients into clinically 'severe' (hospitalised for flare or on third-line therapy) vs 'mild' based on their medical records. The fatigue questionnaire should score higher in severe patients; Cohen's d=0.72 confirms a meaningful separation.",
        "Responsiveness: Among the 98 patients with follow-up data, compare score change in those whose physician-rated disease activity improved vs those who were stable. The standardized response mean of 0.68 shows the questionnaire detects real clinical change.",
        "Anchor-based MID: Among patients who reported they felt 'minimally better' on a separate global rating, the average score drop was 8.3 points. This is the MID — a change smaller than 8 points in a future study would not be considered patient-meaningful."
      ],
      "result": "All eight psychometric checks meet their pre-specified thresholds. The fatigue questionnaire is considered fit-for-purpose as a study endpoint in adults with rheumatoid arthritis in this population and mode of administration. The validation claim is scoped to this context: a different patient group, language, or electronic format would need its own evidence."
    },
    "prerequisites": [
      "pro-development",
      "pro-rwe",
      "hrqol"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Classical test theory (CTT) validation",
        "description": "Evaluate reliability (Cronbach's alpha, test-retest ICC, SEM), structural validity (exploratory/ confirmatory factor analysis), construct validity (convergent/divergent/known-groups correlations), and responsiveness with effect sizes. Default regulatory expectation for a fixed-form endpoint.",
        "edge_cases": [
          "Cronbach's alpha inflates with item count and item redundancy; alpha > 0.95 usually signals redundancy, not quality.",
          "Test-retest ICC requires a genuinely stable interval; contamination by early treatment effects deflates it."
        ],
        "data_source_notes": "primary/registry: schedule a 2-14 day retest gated on a global stability item; ePRO: time-stamps confirm interval but informative missingness must be modeled."
      },
      {
        "name": "Item response theory (IRT) / Rasch validation",
        "description": "Item-level analysis of fit, item characteristic curves, information functions, and differential item functioning (DIF) across subgroups; produces interval-level scores and underpins item banks and computerized adaptive testing (e.g., PROMIS).",
        "edge_cases": [
          "Requires substantially larger samples and well-targeted item difficulty across the trait range.",
          "Unidimensionality and local-independence assumptions must be checked before trusting the model."
        ],
        "data_source_notes": "linked claims provide external grouping variables (severity, line of therapy) to test DIF by clinically meaningful subgroups."
      },
      {
        "name": "Mode-equivalence / measurement-invariance validation",
        "description": "Demonstrate that an electronic (ePRO/app) or translated version yields equivalent scores to the original via measurement-invariance testing (configural/metric/scalar) or randomized equivalence/crossover.",
        "edge_cases": [
          "Mode changes response framing and missingness; equivalence cannot be assumed from paper validation.",
          "Cross-cultural adaptation requires both linguistic validation and quantitative invariance testing."
        ],
        "data_source_notes": "ePRO: completion timestamps and skip-logic logs support equivalence analysis but introduce differential drop-off by severity."
      },
      {
        "name": "Responsiveness and minimal important difference (MID) estimation",
        "description": "Quantify ability to detect change (standardized response mean, effect size) and anchor-based MID tied to a patient-meaningful external anchor, triangulated with distribution-based bounds (0.5 SD, 1 SEM).",
        "edge_cases": [
          "A weak anchor (anchor-change correlation < ~0.3) yields an uninterpretable MID.",
          "Distribution-based thresholds alone can mistake measurement noise for meaningful change."
        ],
        "data_source_notes": "registry/primary follow-up questionnaires supply the change scores; claims anchors (e.g., new hospitalization) can supplement patient-reported anchors of change."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Assuming a published instrument is already validated",
        "pros_of_this": "De novo validation in the actual population, language, and mode protects against the most common reviewer objection that prior validation does not transfer.",
        "cons_of_this": "Requires sample, time, and risks discovering non-fit-for-purpose late in the program.",
        "when_to_prefer": "Whenever population, culture, recall period, or administration mode differs from the original validation study."
      },
      {
        "compared_to": "Item response theory / Rasch validation",
        "pros_of_this": "Classical test theory is simpler, transparent, and aligned with standard regulatory expectations for a fixed-form endpoint.",
        "cons_of_this": "CTT gives ordinal (not interval) scores, cannot localize item-level misfit, and cannot detect differential item functioning the way IRT/Rasch can.",
        "when_to_prefer": "Straightforward fixed-form endpoints; switch to or add IRT/Rasch for item banks, CAT, or DIF screening across subgroups."
      },
      {
        "compared_to": "Claims/EHR outcome-algorithm validation (PPV, sensitivity, specificity)",
        "pros_of_this": "PRO validation correctly evaluates a construct against a nomological net, since no error-free gold standard exists for a subjective concept.",
        "cons_of_this": "No single accuracy number; evidence is a portfolio of coefficients that is harder to summarize for non-specialist stakeholders.",
        "when_to_prefer": "Any subjective patient-reported concept; use PPV/sensitivity validation only for binary algorithmic phenotypes with a chart/registry gold standard."
      }
    ],
    "implementation_notes_by_data_source": {
      "primary": "Questionnaire is primary-collected. Schedule a 2-14 day test-retest gated on a patient-reported global stability item so the ICC is not deflated by true change. Confirm unidimensionality before reporting alpha.",
      "registry": "Strongest substrate for adjudicated severity and follow-up change scores; link to claims for external known-groups and convergent-validity anchors.",
      "ehr": "Portal/visit-driven PRO capture selects toward engaged, healthier patients; characterize captured vs source population and scope the validity claim to the represented stratum.",
      "linked": "Claims-linked PRO supplies external anchors and known groups over continuously enrolled, FFS-observable person-time only; exclude Medicare Advantage-only spans before forming severity groups so low-utilization is not confused with mild disease."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\nITEMS = [f\"item_{i:02d}\" for i in range(1, 11)]\n\ndef icc_2_1(x: np.ndarray, y: np.ndarray) -> float:\n    # Two-way random-effects, absolute-agreement ICC for a single rating (test-retest).\n    n = len(x)\n    m = np.column_stack([x, y]); k = 2\n    grand = m.mean()\n    ms_rows = k * ((m.mean(axis=1) - grand) ** 2).sum() / (n - 1)            # between-subject\n    ms_cols = n * ((m.mean(axis=0) - grand) ** 2).sum() / (k - 1)            # between-occasion (systematic)\n    ss_tot = ((m - grand) ** 2).sum()\n    ms_err = (ss_tot - (n - 1) * ms_rows - (k - 1) * ms_cols) / ((n - 1) * (k - 1))\n    return (ms_rows - ms_err) / (ms_rows + (k - 1) * ms_err + k * (ms_cols - ms_err) / n)\n\ndef cronbach_alpha(item_matrix: pd.DataFrame) -> float:\n    k = item_matrix.shape[1]\n    item_var = item_matrix.var(axis=0, ddof=1).sum()\n    total_var = item_matrix.sum(axis=1).var(ddof=1)\n    return (k / (k - 1)) * (1 - item_var / total_var)\n\ndef validate_pro(pro: pd.DataFrame, severity: pd.DataFrame) -> dict:\n    base = pro[pro[\"administration\"] == \"baseline\"]\n\n    # Test-retest on the stable subgroup only (true health unchanged between administrations).\n    rt = pro[pro[\"administration\"] == \"retest\"]\n    pair = (base.merge(rt, on=\"person_id\", suffixes=(\"_b\", \"_r\"))\n                .query(\"stable_global_r == 1\"))\n    icc = icc_2_1(pair[\"total_score_b\"].to_numpy(), pair[\"total_score_r\"].to_numpy())\n    sem = base[\"total_score\"].std(ddof=1) * np.sqrt(1 - icc)\n\n    alpha = cronbach_alpha(base[ITEMS])\n\n    # Known-groups validity: standardized mean difference (Cohen's d) severe vs mild.\n    g = base.merge(severity, on=\"person_id\")\n    sev, mild = g.loc[g.severity_group == \"severe\", \"total_score\"], g.loc[g.severity_group == \"mild\", \"total_score\"]\n    pooled_sd = np.sqrt(((sev.var(ddof=1) * (len(sev) - 1)) +\n                         (mild.var(ddof=1) * (len(mild) - 1))) / (len(sev) + len(mild) - 2))\n    cohens_d = (sev.mean() - mild.mean()) / pooled_sd\n\n    return {\"icc_2_1\": round(icc, 3), \"sem\": round(sem, 2),\n            \"cronbach_alpha\": round(alpha, 3), \"known_groups_d\": round(cohens_d, 3)}",
        "description": "Compute the core CTT measurement properties from a long PRO table plus a claims-derived severity group.\nRequired inputs (cleaned, one row per person-administration; MA-only person-time already excluded):\n  pro      : person_id, index_date, administration in {'baseline','retest','followup'}, item_01..item_10,\n             total_score (0-100), stable_global (1 if patient reported 'no change' at retest)\n  severity : person_id, severity_group in {'mild','severe'}  # derived from claims over the baseline window\nReturns test-retest ICC(2,1), SEM, Cronbach's alpha, and the known-groups standardized mean difference.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "mokkink-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(psych)\n\nitems <- sprintf(\"item_%02d\", 1:10)\n\nvalidate_pro <- function(pro, severity) {\n  setDT(pro); setDT(severity)\n  base <- pro[administration == \"baseline\"]\n\n  ## Test-retest on the patient-confirmed stable subgroup.\n  rt   <- pro[administration == \"retest\", .(person_id, retest = total_score, stable_global)]\n  pair <- merge(base[, .(person_id, baseline = total_score)], rt, by = \"person_id\")\n  pair <- pair[stable_global == 1]\n  icc  <- ICC(as.matrix(pair[, .(baseline, retest)]))$results\n  icc21 <- icc[icc$type == \"ICC2\", \"ICC\"]            # two-way random, absolute agreement, single rating\n  sem   <- sd(base$total_score) * sqrt(1 - icc21)\n\n  ## Internal consistency (confirm unidimensionality before trusting alpha).\n  alpha_val <- psych::alpha(as.matrix(base[, ..items]), warnings = FALSE)$total$raw_alpha\n\n  ## Known-groups validity: Cohen's d severe vs mild.\n  g    <- merge(base, severity, by = \"person_id\")\n  sev  <- g[severity_group == \"severe\", total_score]\n  mild <- g[severity_group == \"mild\",   total_score]\n  pooled_sd <- sqrt(((var(sev) * (length(sev) - 1)) +\n                     (var(mild) * (length(mild) - 1))) / (length(sev) + length(mild) - 2))\n  d <- (mean(sev) - mean(mild)) / pooled_sd\n\n  list(icc_2_1 = round(icc21, 3), sem = round(sem, 2),\n       cronbach_alpha = round(alpha_val, 3), known_groups_d = round(d, 3))\n}",
        "description": "CTT validation with the psych package. Inputs mirror the Python version:\n  pro      : person_id, administration in {'baseline','retest','followup'}, item_01..item_10,\n             total_score, stable_global\n  severity : person_id, severity_group in {'mild','severe'}  (claims-derived; MA-only excluded)\nReports test-retest ICC(2,1) + SEM, Cronbach's alpha, and the known-groups standardized mean difference.",
        "dependencies": [
          "data.table",
          "psych"
        ],
        "source_citations": [
          "mokkink-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Test-retest pairs restricted to patients who reported NO global change (stable subgroup). */\nproc sql;\n  create table pairs as\n  select b.person_id, b.total_score as score, 'baseline' as occ length=8\n  from work.pro b where b.administration='baseline'\n    and b.person_id in (select person_id from work.pro where administration='retest' and stable_global=1)\n  union all\n  select r.person_id, r.total_score as score, 'retest' as occ length=8\n  from work.pro r where r.administration='retest' and r.stable_global=1;\nquit;\n\n/* Two-way random-effects variance components -> ICC(2,1) = (BMS-EMS)/(BMS+(k-1)EMS+k(JMS-EMS)/n), k=2. */\nproc mixed data=pairs covtest;\n  class person_id occ;\n  model score = ;\n  random person_id occ;\nrun;\n\n/* Internal consistency (confirm one dominant factor with PROC FACTOR before relying on alpha). */\nproc corr data=work.pro(where=(administration='baseline')) alpha nomiss;\n  var item_01-item_10;\nrun;\n\n/* Known-groups validity: standardized mean difference (Cohen's d) severe vs mild on baseline score. */\nproc ttest data=work.analytic cohend;     /* analytic = baseline PRO joined to work.severity */\n  class severity_group;                    /* 'severe' vs 'mild' */\n  var total_score;\nrun;",
        "description": "CTT validation in SAS. Required datasets (post data-management; MA-only person-time excluded):\n  work.pro      : person_id, administration ('baseline'/'retest'/'followup'), item_01-item_10,\n                  total_score, stable_global (0/1)\n  work.severity : person_id, severity_group ('mild'/'severe')  (claims-derived over baseline window)\nPROC MIXED yields the variance components for ICC(2,1); PROC CORR ALPHA gives Cronbach's alpha; PROC TTEST\nwith COHEND gives the known-groups standardized mean difference. SEM = baseline SD * sqrt(1 - ICC).",
        "dependencies": [],
        "source_citations": [
          "mokkink-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Dev[Instrument development<br/>concept elicitation + content validity] --> Scope[Define context of use<br/>population, mode, recall, endpoint role]\n  Scope --> Field[Field instrument in target population<br/>baseline + retest + follow-up administrations]\n  Field --> Rel[Reliability<br/>test-retest ICC, SEM, Cronbach's alpha]\n  Field --> Struct[Structural validity<br/>EFA/CFA or Rasch dimensionality]\n  Field --> Val[Construct validity<br/>convergent / divergent / known-groups]\n  Field --> Resp[Responsiveness + MID<br/>anchor-based, triangulated]\n  Rel --> Judge{All properties meet<br/>pre-specified thresholds?}\n  Struct --> Judge\n  Val --> Judge\n  Resp --> Judge\n  Judge -- yes --> Fit[Fit-for-purpose for the stated context of use]\n  Judge -- no --> Revise[Revise items / scoring / scope, or do not use as endpoint]",
        "caption": "PRO validation pipeline. Content validity from development is the precondition; the four measurement properties are tested in the target context of use; the validation claim is scoped to that context.",
        "alt_text": "Flowchart from instrument development and context-of-use definition through fielding, then reliability, structural validity, construct validity, and responsiveness, to a fit-for-purpose decision.",
        "source_type": "illustrative",
        "source_citations": [
          "mokkink-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  PRO[PRO total score] -- \"convergent: r ~ 0.4-0.7\" --> Rel1[Related legacy instrument]\n  PRO -- \"divergent: |r| < 0.3\" --> Unrel[Unrelated domain]\n  PRO -- \"known-groups: d >= 0.5\" --> Sev[Claims-defined severe vs mild]\n  PRO -- \"responsiveness: change vs anchor\" --> Anc[Patient global impression of change]",
        "caption": "Nomological net for construct validity. A PRO has no error-free gold standard, so validity is judged by whether score relationships match a priori hypotheses, not by sensitivity/specificity against a reference label.",
        "alt_text": "Diagram showing the PRO score linked to a related instrument (convergent), an unrelated domain (divergent), claims-defined severity groups (known-groups), and a change anchor (responsiveness).",
        "source_type": "illustrative",
        "source_citations": [
          "mokkink-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "pro-development",
        "notes": "Development establishes qualitative content validity (the precondition); validation then tests the quantitative measurement properties of the resulting scores."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pro-rwe",
        "notes": "An instrument must be validated for the population and mode before its scores are used as endpoints in real-world evidence studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hrqol",
        "notes": "Health-related quality-of-life instruments are a major class of PROs whose scores require the same reliability, validity, and responsiveness evidence before use."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Binary claims/EHR phenotypes are validated against a gold standard via PPV/sensitivity; a subjective PRO has no error-free gold standard and is validated against a nomological net instead."
      },
      {
        "relation_type": "see_also",
        "target_slug": "surrogate-endpoint-validation-rwe",
        "notes": "Both establish that a measured quantity is fit to serve as an endpoint, but surrogate validation links a biomarker to a clinical outcome whereas PRO validation establishes measurement properties of the score itself."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "Informative missingness in ePRO/EHR-collected PROs (sicker patients stop responding) must be characterized before responsiveness and MID estimates can be trusted."
      }
    ],
    "aliases": [
      "PRO validation",
      "patient-reported outcome instrument validation",
      "PROM validation",
      "psychometric validation of PROs"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "probabilistic-sensitivity-analysis-hea-rwe",
    "name": "Probabilistic Sensitivity Analysis (PSA) for Health-Economic Models",
    "short_definition": "A Monte Carlo procedure that assigns probability distributions to a health-economic model's input parameters and propagates them jointly through the model to characterize decision uncertainty in incremental cost-effectiveness.",
    "long_description": "**Probabilistic sensitivity analysis (PSA)** treats every estimated input of a decision-analytic model\n(transition probabilities, hazard ratios, utilities, resource use, unit costs) as a random variable with a\ndistribution that reflects its sampling uncertainty, then draws a large Monte Carlo sample (commonly 5,000-10,000\niterations), runs the full model once per draw, and summarizes the resulting joint distribution of incremental\ncosts and incremental effects (QALYs). The outputs are the cost-effectiveness (CE) plane scatter, the\ncost-effectiveness acceptability curve (CEAC), the expected net benefit, and — increasingly required by HTA bodies —\nthe expected value of perfect information (EVPI). In real-world evidence (RWE), the inputs are not assumed; they are\nestimated from claims, EHR, registry, or linked data, so PSA is the bridge between the uncertainty in your RWE\nestimates and the uncertainty in the reimbursement decision.\n\n**Core conceptual distinction.** Three distinctions are doing the work and they are routinely conflated.\n(1) *PSA vs deterministic (one-way/multi-way) sensitivity analysis*: deterministic analysis varies one or a few\nparameters across plausible bounds to find drivers; PSA varies all uncertain parameters *simultaneously according to\ntheir distributions* and so quantifies the probability that a decision is correct given everything we do not know.\nDeterministic SA answers \"what matters?\"; PSA answers \"how sure are we?\" — they are complements, not substitutes.\n(2) *Parameter uncertainty vs structural uncertainty*: PSA addresses only second-order (parameter) uncertainty.\nChoices of model structure — time horizon, cycle length, half-cycle correction, the competing-risk specification,\nthe survival-extrapolation family — are **not** captured by PSA and must be handled by scenario analysis. Presenting\na tight CEAC while ignoring structural uncertainty is a known way to overstate confidence and is exactly what\nreviewers (and NICE/CADTH critique) flag. (3) *Sampling uncertainty vs variability*: distributions in PSA represent\nuncertainty about a *mean* parameter (the standard error of an estimate), not patient-to-patient heterogeneity;\nheterogeneity belongs in subgroup analysis or individual-level (microsimulation/DES) models, not in the PSA\ndistributions of cohort-model means.\n\n**Pros, cons, and trade-offs.**\n- **vs deterministic sensitivity analysis (tornado diagrams):** PSA yields a coherent probabilistic statement of\n  decision uncertainty (P[cost-effective at threshold k]) and supports value-of-information analysis; deterministic\n  SA cannot. Cost: PSA requires specifying a distribution and, critically, a *correlation structure* for every\n  parameter, is computationally heavier, and can give a false sense of precision if distributions are guessed.\n  **Prefer PSA** as the headline uncertainty analysis for any reimbursement submission, with deterministic SA\n  retained to identify drivers and to communicate.\n- **vs reporting a point ICER with a confidence interval:** A single ICER ratio is statistically ill-behaved (the\n  ratio of two uncertain quantities; CIs can straddle quadrants of the CE plane and become uninterpretable). PSA on\n  the net-benefit scale sidesteps the ratio problem and produces a well-defined CEAC. **Prefer PSA / net-benefit**\n  whenever uncertainty must be communicated.\n- **vs bootstrapping a within-trial / within-cohort cost-effectiveness analysis:** When costs and effects are\n  observed at the patient level in one RWE dataset, nonparametric bootstrap of the patient-level data propagates the\n  real (correlated, skewed) joint distribution directly and is often preferable. PSA on a *decision model* is\n  required instead when the model synthesizes parameters from multiple sources, extrapolates beyond observed\n  follow-up, or links exposure effects to long-term costs/QALYs. **Prefer bootstrap** for a self-contained\n  patient-level CEA; **prefer model-based PSA** for synthesis and extrapolation.\n- **vs the E-value and other causal sensitivity analyses:** these quantify robustness of a *causal estimate* to\n  unmeasured confounding; PSA quantifies robustness of a *decision* to parameter sampling uncertainty. They are not\n  interchangeable — a confounded hazard ratio fed into PSA produces a precisely characterized but biased decision.\n\n**When to use.** Any cost-utility or cost-effectiveness model intended for HTA submission (NICE, CADTH, ICER, IQWiG,\nPBAC) where decision uncertainty and value-of-information must be reported; whenever the model synthesizes RWE\nparameters whose standard errors are known; whenever the deterministic ICER sits near the decision threshold and the\nreviewer needs the probability of cost-effectiveness; and as the substrate for EVPI/EVPPI to prioritize future\nevidence generation.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a substitute for handling structural uncertainty.** A narrow CEAC built on one fixed structure (e.g., a\n  particular extrapolation curve) is misleadingly confident. If extrapolation or model structure drives the result,\n  PSA understates total uncertainty — actively dangerous for a reimbursement decision.\n- **When parameters are sampled independently but came from one regression.** Treating jointly estimated coefficients\n  (e.g., a multivariable cost model, a parametric survival fit, a multinomial transition model) as independent\n  *miscalibrates* the propagated uncertainty — typically inflating it and distorting the CEAC. Use the full\n  variance-covariance matrix (multivariate normal on the natural/log scale; Dirichlet for transition rows).\n- **When the point estimates are biased.** PSA propagates uncertainty, not bias. Garbage-in confounded RWE effects\n  yield a confidently wrong decision. Resolve confounding/immortal-time/selection issues in the estimation step\n  first; PSA cannot rescue them.\n- **When distributions are invented without an evidence basis.** Assuming convenient (often too-narrow or symmetric)\n  distributions for skewed costs or bounded utilities will mis-state tail behavior and the decision probability.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Costs are right-skewed with heavy tails (ICU stays, dialysis, biologics, cost\n  outliers); a Gamma or log-normal fitted to the *raw* arm mean understates the upper tail and the cost variance\n  that feeds PSA. Fit on appropriately handled data (Gamma/log-normal with explicit outlier handling) and run a\n  separate deterministic scenario on outlier trimming. Event rates that become transition probabilities should be\n  derived from FFS-observable person-time only: Medicare Advantage enrollees lack adjudicated FFS claims, so\n  including MA-only person-time biases both rates and per-member costs — restrict to Parts A/B/D FFS and document\n  whether a 5% sample or 100% denominator was used. In elderly claims cohorts, **differential competing risk of\n  death by exposure** distorts cause-specific transition probabilities; derive them with a competing-risk model and\n  sample the sub-distribution, not naive Kaplan-Meier complements.\n- **EHR:** Utilities/PROs and disease-severity states are richer than in claims, but visit-driven capture makes\n  state occupancy and resource use differentially missing for patients who leave the system; loss to follow-up is\n  potentially informative and should be reflected as additional (scenario) uncertainty rather than ignored. Unit\n  costs are usually absent from EHR and must be imported (and their import uncertainty represented).\n- **Registry:** Strong for adjudicated clinical states and disease-severity transitions (good source for transition\n  probabilities and their SEs) but weak for complete cost capture; link to claims for costs and to a death index to\n  pin the absorbing state, and propagate the linkage-completeness uncertainty.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR severity for utilities, claims for costs, vital\n  records for the death state — but only the linkable subset is analyzable, and order/fill/service-date\n  discrepancies must be reconciled before deriving cycle-level rates. The selection induced by linkage is itself a\n  structural-uncertainty scenario.\n\n**Worked example (RWE → PSA → decision).** Decision: a novel oral anticoagulant (NOAC) vs warfarin for stroke\nprevention in non-valvular atrial fibrillation, costed to a payer over a lifetime horizon in a 3-month-cycle Markov\ncost-utility model with states {AF-no-event, post-ischemic-stroke, post-major-bleed, dead}. RWE inputs and their PSA\ndistributions: (1) *Treatment effect* — the adjusted hazard ratio for ischemic stroke (NOAC vs warfarin) comes from\nan active-comparator new-user analysis in linked claims (continuous Parts A/B/D enrollment, 365-day washout, index =\nfirst NOAC/warfarin `fill_date`, first-event coding, censoring at disenrollment/death/end-of-data); sample\nlog(HR) ~ Normal(log Ĥ R, SE) and exponentiate, drawing it *once per iteration* and applying it to the baseline\nstroke transition so effect uncertainty propagates coherently. (2) *Baseline transitions* — from the warfarin arm's\nFFS person-time, computed with a competing-risk model for stroke vs bleed vs death; sample each transition row as\nDirichlet(α = event counts) so the row sums to 1 and counts drive precision. (3) *Utilities* — state utilities from\nEHR-linked PRO data, sampled Beta(a,b) (bounded 0-1). (4) *Costs* — acute stroke, acute bleed, and per-cycle\nmaintenance costs from claims `paid_amount` aggregated to the cycle, fitted Gamma to respect right skew, with an\noutlier-trimming scenario held aside. (5) Run 10,000 iterations, each drawing the *full* correlated parameter vector\n(use the survival model's variance-covariance matrix for any jointly estimated terms), evaluate the model, and store\nincremental cost ΔC and incremental QALYs ΔE. Outputs: CE-plane scatter; CEAC giving P(NOAC cost-effective) at WTP =\n$50k, $100k, $150k/QALY; expected incremental net monetary benefit; and population EVPI to judge whether the residual\nuncertainty justifies a confirmatory study. Report PSA alongside deterministic one-way SA (drivers) and structural\nscenarios (time horizon, extrapolation family, competing-risk handling) — the PSA alone does not cover the last.\n\n**Interpreting the output**\n\nThe worked example reports that 780 of 1,000 PSA iterations produce an ICER below $100,000/QALY, giving a\ncost-effectiveness acceptability curve (CEAC) value of 78% at that threshold.\n\n*(1) Formal interpretation.* The CEAC value of 78% at λ = $100,000/QALY means that in 78% of the joint\nparameter draws — each representing one plausible combination of all uncertain model inputs — the intervention\nis estimated to be cost-effective at that willingness-to-pay level. This quantifies *decision uncertainty under\nthe model*: it is not the probability that the intervention works for a given patient, nor a statement about\nthe fraction of patients who benefit. The 22% of draws above the threshold reflect the combined effect of\nuncertainty across all distributional parameters simultaneously — utility weights, transition probabilities,\nunit costs, and relative effects — propagated jointly through the model structure.\n\n*(2) Practical interpretation.* A CEAC of 78% at $100,000/QALY means the decision to adopt the intervention\ncarries meaningful but not extreme residual uncertainty. HTA bodies typically want to see both the CEAC\nand the expected value of perfect information (EVPI), which converts this uncertainty into the monetary value\nof resolving it — a key input to the value-of-research argument. One critical error to avoid: reporting \"there\nis a 78% probability the drug works\" from a PSA CEAC. The 78% is a property of the decision model's parameter\nuncertainty, not a patient-level efficacy probability. Report the CEAC alongside the base-case ICER and the\ntornado diagram from one-way deterministic sensitivity analysis; the three together provide a complete picture\nof uncertainty.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "probabilistic-sensitivity-analysis",
      "monte-carlo",
      "decision-uncertainty",
      "cost-effectiveness-acceptability-curve",
      "expected-value-of-information",
      "health-economic-modeling",
      "hta"
    ],
    "applies_to_study_types": [
      "cost_utility_analysis",
      "cost_effectiveness_analysis",
      "decision_analytic_modeling",
      "claims_analysis",
      "registry_linkage"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/0272989X12458348",
        "url": "https://doi.org/10.1177/0272989X12458348",
        "citation_text": "Briggs AH, Weinstein MC, Fenwick EAL, Karnon J, Sculpher MJ, Paltiel AD. Model parameter estimation and uncertainty analysis: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-6. Medical Decision Making. 2012;32(5):722-732.",
        "year": 2012,
        "authors_short": "Briggs et al.",
        "notes": "Definitive good-practice statement on characterizing parameter uncertainty and conducting probabilistic sensitivity analysis, including distribution choice (Beta/Gamma/Dirichlet/log-normal) and correlation handling."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2012.06.012",
        "url": "https://doi.org/10.1016/j.jval.2012.06.012",
        "citation_text": "Caro JJ, Briggs AH, Siebert U, Kuntz KM. Modeling good research practices - overview: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-1. Value in Health. 2012;15(6):796-803.",
        "year": 2012,
        "authors_short": "Caro et al.",
        "notes": "Places PSA within the broader decision-modeling workflow and distinguishes parameter from structural uncertainty."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/hec.903",
        "url": "https://doi.org/10.1002/hec.903",
        "citation_text": "Fenwick E, O'Brien BJ, Briggs A. Cost-effectiveness acceptability curves - facts, fallacies and frequently asked questions. Health Economics. 2004;13(5):405-415.",
        "year": 2004,
        "authors_short": "Fenwick et al.",
        "notes": "Canonical treatment of the CEAC — the principal PSA output — including correct net-benefit interpretation and common misreadings."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jval.2021.11.1351",
        "url": "https://doi.org/10.1016/j.jval.2021.11.1351",
        "citation_text": "Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) statement: updated reporting guidance for health economic evaluations. Value in Health. 2022;25(1):3-9.",
        "year": 2022,
        "authors_short": "Husereau et al.",
        "notes": "Current reporting standard requiring characterization and reporting of decision uncertainty (PSA, CEAC) in published economic evaluations."
      }
    ],
    "plain_language_summary": "Probabilistic sensitivity analysis (PSA) answers the question: given everything we are uncertain about in a cost-effectiveness model, how confident are we that a treatment is worth its price? Instead of plugging a single best-guess number into each model input, PSA treats each input as a range with a most-likely value and plausible spread, then re-runs the model thousands of times, each time drawing a fresh random set of inputs. The result is not one cost-effectiveness ratio but a cloud of thousands, from which you can read off the probability that the treatment is cost-effective at any given price threshold. One honest caveat: PSA captures uncertainty about the numbers we feed the model, but it cannot capture uncertainty about whether the model structure itself is correct.",
    "key_terms": [
      {
        "term": "probabilistic sensitivity analysis",
        "definition": "A technique that re-runs a cost-effectiveness model thousands of times, each time sampling all uncertain inputs from their probability distributions, to produce a spread of outcomes instead of a single answer."
      },
      {
        "term": "parameter uncertainty",
        "definition": "Uncertainty about the true value of a model input (such as a hazard ratio or a cost estimate) because it was measured in a finite sample and therefore has a margin of error."
      },
      {
        "term": "CEAC",
        "definition": "The cost-effectiveness acceptability curve is a graph that shows, at each possible willingness-to-pay threshold, the percentage of PSA simulation draws in which the treatment was cost-effective."
      },
      {
        "term": "Monte Carlo simulation",
        "definition": "A computational method that runs a model repeatedly using random draws from input distributions to characterize the range of possible outcomes."
      },
      {
        "term": "incremental cost-effectiveness ratio (ICER)",
        "definition": "The extra cost of the new treatment divided by the extra health benefit it produces, expressed as cost per unit of health gained (for example, cost per QALY)."
      },
      {
        "term": "willingness-to-pay threshold",
        "definition": "The maximum amount a payer or health system is willing to spend to gain one additional unit of health benefit, used to judge whether an ICER is acceptable."
      }
    ],
    "worked_example": {
      "scenario": "A health economist is building a simple cost-effectiveness model comparing a new drug (Drug A) versus standard care for a chronic condition. The model has three uncertain inputs: the annual cost of Drug A, the reduction in hospitalizations Drug A produces, and the quality-of-life benefit per hospitalization avoided. Instead of running the model once with single best-guess numbers, she runs 1,000 Monte Carlo iterations, drawing each input randomly from its distribution each time. She then tallies what fraction of those 1,000 runs show Drug A to be cost-effective at a willingness-to-pay threshold of $100,000 per QALY gained.",
      "dataset": {
        "caption": "A sample of five simulation draws from the 1,000-iteration PSA run. Each row is one complete re-run of the model with a different randomly sampled set of inputs. The final two columns show the resulting incremental cost and incremental QALYs for that draw.",
        "columns": [
          "draw",
          "drug_a_annual_cost_usd",
          "hosp_reduction_rate",
          "qaly_per_hosp_avoided",
          "incremental_cost_usd",
          "incremental_qalys"
        ],
        "rows": [
          [
            1,
            18200,
            0.22,
            0.18,
            14800,
            0.31
          ],
          [
            2,
            21500,
            0.15,
            0.12,
            18900,
            0.22
          ],
          [
            3,
            16400,
            0.31,
            0.24,
            11200,
            0.48
          ],
          [
            4,
            22800,
            0.19,
            0.16,
            19700,
            0.26
          ],
          [
            5,
            17600,
            0.28,
            0.21,
            13400,
            0.41
          ]
        ]
      },
      "steps": [
        "For each of the 1,000 draws, sample drug_a_annual_cost_usd from a Gamma distribution (right-skewed because costs cannot be negative and tend to have a long upper tail), sample hosp_reduction_rate from a Beta distribution (bounded between 0 and 1 because it is a probability), and sample qaly_per_hosp_avoided from a Beta distribution for the same reason.",
        "Run the cost-effectiveness model once using that draw's three sampled values to produce incremental_cost_usd (the extra cost of Drug A vs standard care) and incremental_qalys (the extra health benefit).",
        "Compute the implied ICER for that draw: incremental_cost_usd divided by incremental_qalys. For draw 1: $14,800 / 0.31 QALYs = $47,742 per QALY.",
        "Classify each draw as cost-effective if its ICER is below the $100,000 per QALY threshold (equivalently, if 100,000 x incremental_qalys minus incremental_cost_usd is positive).",
        "After all 1,000 draws, count the fraction classified as cost-effective to get the CEAC value at $100,000 per QALY."
      ],
      "result": "In this illustrative 5-draw sample, all five draws have an ICER below $100,000 per QALY (ICERs range from roughly $48k to $86k), so 5/5 = 100% are cost-effective in this small sample. Across all 1,000 full-simulation draws using the file's model parameters, 780 of 1,000 draws fall below the $100,000 threshold, giving a CEAC value of 78% at that threshold. This means the model estimates a 78% probability that Drug A is cost-effective at a willingness-to-pay of $100,000 per QALY."
    },
    "prerequisites": [
      "health-economic-modeling-methods-rwe",
      "icer-net-monetary-benefit-rwe",
      "markov-transition-probabilities-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Correlated multivariate sampling from a fitted model",
        "description": "When transition probabilities, survival parameters, or cost-regression coefficients are jointly estimated, sample the full parameter vector from a multivariate normal on the natural/log scale using the model's variance-covariance matrix (Dirichlet for multinomial transition rows) rather than drawing each parameter independently.",
        "edge_cases": [
          "Independent sampling of correlated coefficients typically inflates and distorts propagated uncertainty, biasing the CEAC.",
          "Survival-extrapolation parameters (e.g., Weibull shape/scale) are strongly correlated; ignore the covariance and long-horizon cost/QALY uncertainty is mis-stated."
        ],
        "data_source_notes": "claims/registry: retain the variance-covariance matrix from the survival or cost regression; EHR: PRO-derived utility correlations across states should be respected where estimated jointly."
      },
      {
        "name": "Value-of-information layer (EVPI / EVPPI)",
        "description": "Use the PSA sample to compute population expected value of perfect (or partial perfect) information, quantifying whether residual decision uncertainty justifies further data collection and which parameters to target.",
        "edge_cases": [
          "EVPPI for correlated parameter subsets requires nested or regression-based (e.g., GAM/Gaussian-process) metamodels; naive two-level Monte Carlo is unstable.",
          "Population EVPI scales with incidence and effective decision horizon; mis-specifying either dominates the result."
        ],
        "data_source_notes": "claims/registry incidence denominators set the population multiplier; document the catchment and time horizon used."
      },
      {
        "name": "Distribution choice matched to parameter support",
        "description": "Beta for probabilities and utilities (0-1), Dirichlet for multinomial transition rows, Gamma or log-normal for right-skewed costs and resource use, log-normal/normal-on-log for hazard and rate ratios.",
        "edge_cases": [
          "Utilities that can be negative (worse than death) violate Beta support; shift/scale or use a flexible distribution.",
          "Method-of-moments fitting of Gamma to a heavy-tailed claims cost mean can understate the upper tail."
        ],
        "data_source_notes": "claims: fit cost distributions per arm/age band and pair with an outlier-handling deterministic scenario."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Deterministic one-way / multi-way sensitivity analysis",
        "pros_of_this": "Varies all uncertain parameters jointly per their distributions and yields a probabilistic statement of decision correctness (CEAC) plus value-of-information; deterministic SA cannot.",
        "cons_of_this": "Requires specifying a distribution and correlation structure for every parameter, is computationally heavier, and can convey false precision if distributions are guessed.",
        "when_to_prefer": "As the headline uncertainty analysis for any HTA/reimbursement submission, with deterministic SA kept to identify drivers."
      },
      {
        "compared_to": "Point ICER with a confidence interval",
        "pros_of_this": "Avoids the ill-behaved ratio statistic by working on the net-benefit scale and produces an interpretable CEAC across willingness-to-pay thresholds.",
        "cons_of_this": "More to specify and compute than reporting a single ratio.",
        "when_to_prefer": "Whenever decision uncertainty must be communicated, especially near the cost-effectiveness threshold."
      },
      {
        "compared_to": "Nonparametric bootstrap of a within-cohort patient-level CEA",
        "pros_of_this": "Handles synthesis of parameters from multiple sources and extrapolation beyond observed follow-up, which a single-dataset bootstrap cannot.",
        "cons_of_this": "Relies on assumed parametric distributions rather than the empirical patient-level joint distribution; for a self-contained patient-level CEA the bootstrap is often more faithful.",
        "when_to_prefer": "Model-based PSA for evidence synthesis and extrapolation; bootstrap for a self-contained patient-level cost-effectiveness analysis."
      },
      {
        "compared_to": "Causal sensitivity analyses (E-value)",
        "pros_of_this": "Characterizes uncertainty of the cost-effectiveness decision given parameter sampling error and supports value-of-information.",
        "cons_of_this": "Does not address bias from unmeasured confounding in the underlying causal estimates.",
        "when_to_prefer": "For decision uncertainty; pair with E-value/negative-control analyses that address confounding of the effect estimate feeding the model."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Fit right-skewed cost distributions (Gamma/log-normal) per arm and age band on FFS-observable person-time; exclude Medicare Advantage-only person-time where adjudicated claims are missing and document 5%-sample vs 100%-denominator. Derive transition probabilities with competing-risk models in elderly cohorts and sample sub-distributions. Hold an outlier-trimming scenario aside as deterministic, not parameter, uncertainty.",
      "ehr": "Source state utilities/PROs and disease-severity transitions here; import unit costs and propagate their uncertainty. Treat visit-driven missingness and informative loss to follow-up as additional scenario uncertainty.",
      "registry": "Strong for adjudicated clinical-state transitions and their standard errors; link to claims for costs and to a death index for the absorbing state, and propagate linkage-completeness uncertainty.",
      "linked": "Ideal substrate (EHR severity + claims costs + vital-records death) but only the linkable subset is analyzable; reconcile order/fill/service dates before deriving cycle rates and treat linkage selection as a structural scenario."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\n\nRNG = np.random.default_rng(20240101)\nN_ITER, N_CYCLES = 10_000, 200          # 3-month cycles over a lifetime horizon\nWTP = np.array([50_000, 100_000, 150_000])\nSTATES = [\"AF\", \"post_stroke\", \"post_bleed\", \"dead\"]\nDISC_C, DISC_E = 0.03, 0.03             # annual discount rates (apply per cycle)\n\ndef run_psa(transition_counts, util_beta, cost_gamma, loghr_stroke):\n    # --- one draw of every uncertain parameter, sampled jointly across N_ITER ---\n    # Transition rows: Dirichlet(counts) keeps each row a valid probability vector; counts drive precision.\n    P = {arm: np.stack([RNG.dirichlet(c, size=N_ITER) for c in rows], axis=1)   # (N_ITER, n_states, n_states)\n         for arm, rows in transition_counts.items()}\n    # Treatment effect drawn ONCE per iteration on the log scale, then exponentiated.\n    hr = np.exp(RNG.normal(loghr_stroke[0], loghr_stroke[1], N_ITER))\n    u = np.stack([RNG.beta(*util_beta[s], size=N_ITER) for s in STATES], axis=1)  # utilities, bounded 0..1\n    cost = np.stack([RNG.gamma(cost_gamma[s][0], cost_gamma[s][1], N_ITER)        # right-skewed per-cycle cost\n                     for s in STATES], axis=1)\n\n    results = {}\n    for arm in (\"warfarin\", \"noac\"):\n        Parm = P[arm].copy()\n        if arm == \"noac\":                      # apply RWE hazard ratio to the AF->stroke transition\n            Parm[:, 0, 1] *= hr\n            Parm[:, 0, 0] = 1.0 - Parm[:, 0, 1:].sum(axis=1)   # renormalize the originating row\n        trace = np.zeros((N_ITER, len(STATES)))\n        trace[:, 0] = 1.0                       # everyone starts in AF\n        tot_c = np.zeros(N_ITER); tot_e = np.zeros(N_ITER)\n        for t in range(N_CYCLES):\n            trace = np.einsum(\"is,isj->ij\", trace, Parm)        # advance one cycle\n            dfac_c = 1 / (1 + DISC_C) ** (t / 4); dfac_e = 1 / (1 + DISC_E) ** (t / 4)\n            tot_c += (trace * cost).sum(axis=1) * dfac_c\n            tot_e += (trace * u).sum(axis=1) * 0.25 * dfac_e    # quarter-cycle QALYs\n        results[arm] = (tot_c, tot_e)\n\n    dC = results[\"noac\"][0] - results[\"warfarin\"][0]\n    dE = results[\"noac\"][1] - results[\"warfarin\"][1]\n    ceac = {k: float(np.mean(k * dE - dC > 0)) for k in WTP}    # P(NOAC cost-effective) by WTP\n    inb = WTP[1] * dE - dC                                      # incremental NMB at $100k\n    evpi = float(np.maximum(inb, 0).mean() - max(inb.mean(), 0))\n    return {\"dC\": dC, \"dE\": dE, \"ceac\": ceac, \"evpi_per_patient\": evpi}",
        "description": "Markov cost-utility PSA driven by RWE-derived parameters. Inputs (all already estimated upstream from claims/EHR/\nregistry):\n  transition_counts : dict arm -> array of per-row event counts for Dirichlet sampling of the transition matrix\n                      (rows: AF, post-stroke, post-bleed, dead; dead is absorbing)\n  util_beta         : dict state -> (a, b) Beta hyperparameters from EHR/PRO utility data\n  cost_gamma        : dict state -> (shape k, scale theta) Gamma fit to per-cycle claims paid_amount (right-skewed)\n  loghr_stroke      : (mean log-HR, se) for NOAC vs warfarin from the active-comparator new-user RWE analysis\nDraw the FULL parameter vector once per iteration (effect HR drawn once and applied to the baseline transition) so\ncorrelation through the model is preserved; report CE plane, CEAC, and EVPI. Deterministic one-way SA and\nstructural scenarios (horizon, cycle length, extrapolation) are run separately — PSA does not cover them.",
        "dependencies": [
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(MCMCpack)   # rdirichlet\nset.seed(20240101)\nN_ITER <- 10000L; N_CYCLES <- 200L\nWTP <- c(50000, 100000, 150000)\nSTATES <- c(\"AF\", \"post_stroke\", \"post_bleed\", \"dead\")\nDISC_C <- 0.03; DISC_E <- 0.03\n\nrun_psa <- function(transition_counts, util_beta, cost_gamma, loghr_stroke) {\n  draw_P <- function(rows)                      # list of Dirichlet-sampled transition matrices, one per iteration\n    lapply(seq_len(N_ITER), function(i)\n      t(vapply(rows, function(cnt) rdirichlet(1, cnt), numeric(length(STATES)))))\n  P <- lapply(transition_counts, draw_P)\n  hr <- exp(rnorm(N_ITER, loghr_stroke[1], loghr_stroke[2]))         # effect drawn once per iteration\n  u  <- sapply(STATES, function(s) rbeta(N_ITER, util_beta[[s]][1], util_beta[[s]][2]))\n  cost <- sapply(STATES, function(s) rgamma(N_ITER, cost_gamma[[s]][1], scale = cost_gamma[[s]][2]))\n\n  arm_totals <- function(arm) {\n    tot_c <- numeric(N_ITER); tot_e <- numeric(N_ITER)\n    for (i in seq_len(N_ITER)) {\n      Pi <- P[[arm]][[i]]\n      if (arm == \"noac\") {                       # apply RWE hazard ratio to AF->stroke, renormalize the row\n        Pi[1, 2] <- Pi[1, 2] * hr[i]\n        Pi[1, 1] <- 1 - sum(Pi[1, -1])\n      }\n      tr <- c(1, 0, 0, 0)\n      for (t in seq_len(N_CYCLES)) {\n        tr <- as.numeric(tr %*% Pi)\n        dC <- 1 / (1 + DISC_C)^((t - 1) / 4); dE <- 1 / (1 + DISC_E)^((t - 1) / 4)\n        tot_c[i] <- tot_c[i] + sum(tr * cost[i, ]) * dC\n        tot_e[i] <- tot_e[i] + sum(tr * u[i, ]) * 0.25 * dE\n      }\n    }\n    list(c = tot_c, e = tot_e)\n  }\n  w <- arm_totals(\"warfarin\"); n <- arm_totals(\"noac\")\n  dC <- n$c - w$c; dE <- n$e - w$e\n  ceac <- sapply(WTP, function(k) mean(k * dE - dC > 0))\n  inb  <- WTP[2] * dE - dC\n  evpi <- mean(pmax(inb, 0)) - max(mean(inb), 0)\n  list(dC = dC, dE = dE, ceac = setNames(ceac, WTP), evpi_per_patient = evpi)\n}",
        "description": "Markov cost-utility PSA in base R, parameters estimated upstream from RWE. Inputs mirror the Python version:\n  transition_counts : list(arm = list of per-row event-count vectors) for Dirichlet sampling\n  util_beta         : list(state = c(a, b)) Beta hyperparameters from EHR/PRO data\n  cost_gamma        : list(state = c(shape, scale)) Gamma fit to per-cycle claims paid_amount\n  loghr_stroke      : c(mean_log_hr, se) from the active-comparator new-user RWE analysis\nThe treatment-effect draw is taken once per iteration and applied to the baseline transition so model correlation\nis preserved. For jointly estimated coefficients use MASS::mvrnorm with the model vcov() instead of independent\ndraws. Structural uncertainty (horizon, cycle length, extrapolation) is handled by separate scenarios.",
        "dependencies": [
          "MCMCpack"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  RWE[RWE estimates with standard errors<br/>HR from active-comparator new-user analysis<br/>transitions, utilities, costs] --> Dist[Assign distributions matched to support<br/>log-normal HR / Dirichlet rows / Beta utilities / Gamma costs]\n  Dist --> Corr[Joint Monte Carlo sampling<br/>use full variance-covariance for jointly estimated terms]\n  Corr --> Model[Evaluate the decision model once per draw<br/>10,000 iterations]\n  Model --> Joint[Joint distribution of incremental cost and QALYs]\n  Joint --> Plane[CE plane scatter]\n  Joint --> CEAC[CEAC: P cost-effective vs WTP]\n  Joint --> EVPI[EVPI / EVPPI: value of further evidence]\n  Struct[Structural uncertainty<br/>horizon, cycle length, extrapolation, competing risks] -. handled separately by scenario analysis .-> Model",
        "caption": "PSA propagates RWE parameter uncertainty (with correlation) through the decision model to produce the CE plane, CEAC, and value-of-information; structural uncertainty is addressed by separate scenario analysis, not by PSA.",
        "alt_text": "Flowchart from RWE estimates to distribution assignment, joint Monte Carlo sampling, per-draw model evaluation, and outputs (CE plane, CEAC, EVPI), with structural uncertainty entering via separate scenario analysis.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Draws[10,000 incremental cost-QALY pairs] --> NB[For each WTP threshold k:<br/>net benefit = k*dE - dC]\n  NB --> Prob[Proportion of draws with positive net benefit<br/>= P strategy is optimal at k]\n  Prob --> Curve[CEAC: plot probability optimal across k]\n  Curve --> Read{Reading the CEAC}\n  Read --> Low[At low WTP comparator usually optimal]\n  Read --> High[At high WTP study strategy usually optimal]\n  Read --> Cross[Crossing point near the deterministic ICER]",
        "caption": "How the CEAC is computed and read — at each willingness-to-pay threshold the curve is the probability the strategy has the highest net benefit across the PSA draws.",
        "alt_text": "Diagram showing incremental cost-QALY pairs converted to net benefit at each willingness-to-pay threshold, summarized as the probability optimal, and read as a cost-effectiveness acceptability curve.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "PSA is the uncertainty-analysis component of the broader RWE health-economic modeling family."
      },
      {
        "relation_type": "produces",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "PSA generates the distribution of incremental costs and effects underlying the ICER and net-monetary-benefit statements (and the CEAC built on net benefit)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "Transition probabilities estimated from RWE are sampled (typically Dirichlet by row) as PSA inputs."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discrete-event-simulation-rwe",
        "notes": "PSA wraps an outer Monte Carlo loop over model parameters; in DES this is layered on top of the inner individual-level simulation, requiring careful separation of parameter uncertainty from variability."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-utility",
        "notes": "PSA is the standard decision-uncertainty analysis for cost-utility models reported to HTA bodies."
      },
      {
        "relation_type": "used_with",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "Survival-model parameters are correlated; their joint uncertainty (via the variance-covariance matrix) is a major PSA input, while the choice of extrapolation family is structural and handled by scenario analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Discount rates are conventionally fixed and varied deterministically rather than sampled in PSA; their effect belongs in scenario analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value addresses unmeasured-confounding robustness of the causal effect estimate; PSA addresses parameter-uncertainty robustness of the decision — complementary, not interchangeable."
      },
      {
        "relation_type": "complements",
        "target_slug": "scenario-deterministic-sensitivity-analysis-hea-rwe",
        "notes": "Scenario/deterministic SA handles influence ranking and the structural choices (horizon, extrapolation, perspective) PSA holds fixed; PSA quantifies joint parameter uncertainty. Good practice requires both."
      }
    ],
    "aliases": [
      "PSA",
      "probabilistic sensitivity analysis",
      "probabilistic analysis",
      "second-order Monte Carlo simulation",
      "parameter uncertainty analysis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "ema",
      "fda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "procedure-identification-and-measurement-in-claims-ehr",
    "name": "Procedure Identification and Measurement in Claims and EHR",
    "short_definition": "The operational task of defining a procedure exposure or outcome from administrative and clinical data by assembling validated CPT/HCPCS, ICD-10-PCS, revenue-center, and (where available) registry or operative-note evidence into a code set, deciding setting and laterality, and fixing the procedure date as a time-zero or event date for downstream analysis.",
    "long_description": "**Procedure identification and measurement** is the data-management bridge between a clinical concept\n(\"the patient had a total knee arthroplasty,\" \"the patient received a cardiac catheterization\") and an\nanalyzable variable: a flag, a date, a count, or a setting-tagged exposure episode. Unlike a drug, which\nin claims is captured uniformly as a pharmacy fill (NDC + `fill_date` + `days_supply`), a single procedure\nis encoded across multiple, non-interchangeable coding systems that depend on the *site of service and the\npayer's billing format*. The same total knee arthroplasty appears as CPT 27447 on a physician (carrier)\nclaim, ICD-10-PCS 0SRC0J9 on a hospital inpatient (UB-04) claim, and an APC/revenue-center line if done\nin a hospital outpatient department. The analyst's job is to write a code set and an assembly rule that\ncaptures the procedure once, at the right date, in the right setting, without double-counting the bilateral\nor staged case — and to know which billing streams in the source data can even see it.\n\n**Core conceptual distinction.** Three things are routinely conflated and must be separated. (1) *Coding\nsystem vs. clinical event*: CPT/HCPCS (physician and outpatient facility), ICD-10-PCS (inpatient facility),\nICD-10-CM procedure-adjacent diagnosis codes (status/history, not the act), and revenue center codes each\ndescribe a slice of the same act from a different billing actor — a complete definition usually requires a\nunion across systems, not a single code. (2) *Procedure-as-exposure vs. procedure-as-outcome*: as an\nexposure (e.g., bariatric surgery -> later diabetes remission) the procedure date is time zero and the\ncentral threat is immortal time and selection of who gets operated on; as an outcome (e.g., drug ->\nrevascularization) the same code set is an event date and the central threat is differential ascertainment\nand competing risks. (3) *The act vs. the claim*: a denied, reversed, or duplicate facility-plus-professional\npair for one surgery generates several lines; counting lines instead of events inflates incidence. The\nestimand must state whether the quantity is \"ever received,\" \"first receipt (incident),\" \"count of\nprocedures per person-time,\" or \"time to procedure,\" because the de-duplication and date rules differ for\neach.\n\n**Pros, cons, and trade-offs.**\n- **Multi-system union code set vs. a single coding system.** A union of CPT + HCPCS + ICD-10-PCS + revenue\n  codes captures the procedure regardless of where it was performed and is far more sensitive; the cost is\n  more programming, the need to reconcile dates across facility and professional claims for the same act,\n  and a higher duplicate burden that the de-duplication window must absorb. **Prefer the union** for any\n  consequential analysis. A single-system rule (e.g., CPT only) silently drops every inpatient case and\n  under-counts in exactly the sicker subgroup.\n- **Claims procedure codes vs. structured EHR procedure/order tables.** Claims procedure capture is highly\n  specific and reasonably complete for billable, reimbursed acts because billing is the reason the code\n  exists; EHR order/flowsheet capture sees in-system care in more clinical detail (laterality, surgeon,\n  intra-operative fields) but misses care delivered outside the network and is encounter-driven. **Prefer\n  claims** when completeness across sites of care matters; **prefer EHR/registry** when you need clinical\n  granularity (stage, laterality, device) and can tolerate leakage. Linkage gets both but adds selection.\n- **Counting all procedure claims vs. incident first-event logic.** Counting every qualifying claim answers\n  utilization questions (procedures per 1,000 person-years) but, for an exposure or \"first surgery\"\n  cohort, mistakes the staged second-eye cataract or a revision for a new patient. **Prefer incident\n  first-event logic** (washout + first qualifying claim) for cohort entry; **prefer counts** for HCRU.\n\n**When to use.** Whenever a procedure is the exposure, the outcome, or a utilization metric in claims, EHR,\nregistry, or linked data and you need a defensible, reproducible definition: comparative effectiveness of a\nprocedural intervention (e.g., TAVR vs. SAVR), procedure as an outcome of a drug, surgical safety/utilization,\ncost analyses anchored on a procedure, and any regulatory- or HTA-grade submission where the code set and its\nvalidity must be auditable.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As an exposure without controlling immortal time.** If follow-up starts at diagnosis (or cohort entry)\n  but the \"procedure\" arm is defined by an act that happens later, every day a patient survives to be\n  operated on is misclassified as exposed person-time. This makes the procedure look protective when it is\n  not (Suissa's immortal-time bias) — the single most common, most lethal error in procedure-as-exposure\n  studies. Use time-varying exposure, a landmark, or a target-trial design instead.\n- **When the billing stream cannot see the procedure.** In Medicare Advantage (managed-care) enrollees,\n  encounter data are historically incomplete and FFS-style procedure claims may be absent; \"no procedure\"\n  can be missingness, not a true negative. Do not pool MA-only and FFS person-time for a procedure rate.\n- **When laterality/staging is decision-relevant but uncoded.** Many CPT codes do not encode side; for\n  cataract, joint, and many oncologic procedures, treating a second-side or staged procedure as a recurrence\n  (or as a new patient) is a definitional error, not a data artifact.\n- **When the procedure is rare and the code is non-specific.** A broad revenue-center or unlisted CPT code\n  used to maximize sensitivity can drown a rare true procedure in non-specific lines; validate PPV before\n  trusting it.\n\n**Data-source operational depth.**\n- **Administrative claims (Medicare FFS / commercial).** Procedures live on *both* the carrier/physician\n  file (CPT/HCPCS) and the institutional file (ICD-10-PCS on inpatient; CPT/HCPCS + revenue codes on\n  outpatient facility). Failure modes: (a) one surgery generates a facility line and a professional line on\n  different `service_date`s — collapse to one event with an acute-event de-duplication window (e.g., a single\n  procedure per person within N days) and take the earliest date; (b) MA-only person-time lacks complete FFS\n  procedure claims, so restrict procedure-rate denominators to FFS-observable time (Parts A/B, no MA months);\n  (c) claim reversals/denials and resubmissions create phantom duplicate procedures — keep only adjudicated/paid\n  or use a within-window collapse; (d) bundled/global surgical packages roll post-op visits into one payment,\n  so the absence of follow-up claims is not the absence of care. Always require continuous enrollment across\n  the washout so \"first procedure\" is genuinely first.\n- **EHR.** Procedures appear in structured procedure/order tables, surgical case logs, and operative notes.\n  Advantage: laterality, surgeon, device, and intra-operative detail; problem lists and pathology refine\n  indication. Failure modes: encounter-driven capture (a procedure done at an out-of-network facility is\n  invisible — external-care leakage), inconsistent local procedure dictionaries that must be mapped to a\n  standard (CPT/SNOMED), and structured fields that are blank when the act is documented only in free text,\n  requiring NLP of operative notes.\n- **Registry (e.g., NSQIP, STS, SEER, device registries).** Strongest source for the procedure itself —\n  standardized definitions, laterality, approach, surgeon-reported detail, and adjudicated peri-operative\n  outcomes. Failure modes: registry inclusion is a selected subset (participating sites, eligible cases),\n  and follow-up beyond the index admission is usually thin; link to claims for longitudinal follow-up and to\n  a death index for mortality.\n- **Linked claims–EHR–registry.** The ideal substrate (registry/EHR procedure detail + claims completeness\n  + reliable mortality), but linkage selects the linkable subset and introduces date discrepancies among the\n  operative note, the facility claim, and the professional claim that must be reconciled before fixing the\n  procedure date / time zero.\n\n**Worked claims example.** Question: incident bariatric surgery as an exposure (sleeve gastrectomy or\nRoux-en-Y gastric bypass) among commercially insured + Medicare FFS adults with obesity, with later\ntype 2 diabetes remission as the outcome. (1) Code set: union of CPT/HCPCS (43775 sleeve, 43644/43645 RYGB)\non carrier and outpatient-facility claims AND ICD-10-PCS (e.g., 0DB64Z3, 0D164ZA) on inpatient facility\nclaims; this is the multi-system union — CPT-only would miss every inpatient bypass. (2) Eligibility: age >=18,\n>=2 obesity diagnoses, and 365 days of continuous medical enrollment (FFS-observable; exclude MA-only months)\nbefore the first qualifying procedure claim, so the washout can establish incidence. (3) Washout / incidence:\nno qualifying bariatric procedure code in any stream during the 365-day lookback -> the first qualifying claim\ndate is the *candidate* index. (4) De-duplication: a single patient's facility line (PCS, `service_date`\n2024-03-10) and professional line (CPT, `service_date` 2024-03-11) describe one surgery; collapse all\nqualifying lines within a 30-day acute window to one event and assign `index_date` = the earliest qualifying\n`service_date`. (5) Time zero = `index_date` (the procedure date), NOT the obesity-diagnosis date — anchoring\nat diagnosis and waiting for surgery would create immortal time. (6) Follow-up: from `index_date` to first\nvalidated diabetes-remission event, censoring at disenrollment, death, end of data, and (for an as-treated\ncontrast) a competing bariatric revision. (7) Sensitivity: vary the de-duplication window (7/30/90 days), test\nCPT-only vs. union to quantify inpatient capture, and report PPV against operative notes in a linked subset.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure_definition",
      "procedure-identification",
      "cpt-hcpcs",
      "icd-10-pcs",
      "claims-coding",
      "code-list-development",
      "surgical-cohort",
      "immortal-time"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Foundational methods review of how administrative utilization databases capture diagnoses, drugs, and procedures, and the validity considerations (code specificity, billing artifacts, completeness) that govern procedure ascertainment."
      },
      {
        "role": "explain",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "Canonical example of building and translating diagnosis/procedure code algorithms across ICD revisions; the template for transparent, reproducible code-set construction and ICD-9-to-ICD-10/PCS crosswalking that procedure definitions depend on."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Demonstrates the dominant pitfall when a procedure is used as a time-fixed exposure - person-time before the procedure date misclassified as exposed - and why time-zero must be the procedure date with time-varying or landmark handling."
      }
    ],
    "plain_language_summary": "When researchers want to know whether a patient had a specific surgery or procedure, they must search for it across several different medical billing systems that each use their own set of codes — the surgeon files one code, the hospital files a different one, and sometimes both show up for the very same operation. This entry explains how to find and count procedures correctly by combining those code lists, recognizing when two billing rows actually describe a single event, and removing the duplicate before counting. The main pitfall is treating each billing row as a separate procedure, which inflates how often procedures appear to occur.",
    "key_terms": [
      {
        "term": "CPT/HCPCS codes",
        "definition": "Five-character billing codes that physicians and outpatient facilities use to report procedures they performed, such as CPT 27447 for a total knee replacement."
      },
      {
        "term": "ICD procedure codes",
        "definition": "Seven-character codes (ICD-10-PCS) used by hospitals to describe procedures performed during an inpatient stay, covering the same surgeries as CPT but in a completely different coding language."
      },
      {
        "term": "professional claim",
        "definition": "The bill submitted by the surgeon or physician for their work during a procedure, distinct from the hospital's own bill for the same event."
      },
      {
        "term": "facility claim",
        "definition": "The bill submitted by the hospital or surgery center for the use of the operating room, nursing staff, and supplies — separate from the surgeon's professional claim."
      },
      {
        "term": "deduplication",
        "definition": "The step where an analyst collapses multiple billing rows that all describe the same single procedure into one counted event, so the procedure is not counted twice."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to count how many patients in a commercial claims database had a total knee replacement in 2023. She pulls all claim lines that carry a qualifying procedure code. For patient 1001, two rows come back: one from the surgeon (CPT 27447, dated March 14) and one from the hospital (ICD-10-PCS 0SRC0J9, dated March 14). Both rows describe the same single surgery. Without deduplication she would count this patient as having had two procedures; after deduplication she correctly counts one.",
      "dataset": {
        "caption": "Raw qualifying claim lines for two patients before deduplication",
        "columns": [
          "person_id",
          "service_date",
          "code",
          "code_system",
          "claim_type"
        ],
        "rows": [
          [
            1001,
            "2023-03-14",
            "27447",
            "CPT",
            "professional"
          ],
          [
            1001,
            "2023-03-14",
            "0SRC0J9",
            "ICD10PCS",
            "facility"
          ],
          [
            1002,
            "2023-07-22",
            "27447",
            "CPT",
            "professional"
          ],
          [
            1002,
            "2023-07-23",
            "0SRC0J9",
            "ICD10PCS",
            "facility"
          ]
        ]
      },
      "steps": [
        "Build a union code list: CPT 27447 covers total knee replacement on physician and outpatient-facility claims; ICD-10-PCS 0SRC0J9 covers the same surgery on inpatient-facility claims. Both are needed because the same operation is billed in two different coding languages depending on who submits the bill.",
        "Pull every paid claim line that matches any code in the union list. This returns 4 rows for 2 patients — 2 rows per patient, one professional and one facility.",
        "For patient 1002 the two service dates differ by one day (Jul 22 vs Jul 23), which is normal because the hospital and the surgeon may process and submit their bills on slightly different dates for the same surgery.",
        "Apply a 30-day deduplication window: for each patient, group all qualifying lines that fall within 30 days of the earliest qualifying date and collapse them into a single event dated at the earliest service date. Patient 1001 keeps date 2023-03-14; patient 1002 keeps date 2023-07-22.",
        "After deduplication: 2 patients, 2 distinct procedure events — one per patient. Without deduplication the raw row count was 4, which would wrongly suggest 4 procedures."
      ],
      "result": "After deduplication: 2 unique knee replacement procedures identified (1 per patient). Raw row count before deduplication was 4 rows. Deduplication removed 2 duplicate rows (1 per patient), yielding the correct count of 2 procedures."
    },
    "prerequisites": [
      "claims-analysis",
      "exposure-episode-construction-rwe",
      "acute-event-deduplication-window-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Multi-system union (CPT/HCPCS + ICD-10-PCS + revenue) incident first-event",
        "description": "Captures the procedure across carrier/physician (CPT/HCPCS), inpatient facility (ICD-10-PCS), and outpatient facility (CPT/HCPCS + revenue center) claims, then collapses qualifying lines within an acute-event window to one incident event dated at the earliest qualifying service date.",
        "edge_cases": [
          "Facility and professional claims for one surgery carry different service dates - collapse to one event and take the earliest date.",
          "Claim reversals/denials and resubmissions produce phantom duplicates - keep adjudicated/paid lines or collapse within the window.",
          "Bilateral or staged procedures (cataract second eye, contralateral joint) can be mis-read as a recurrence or a new patient when laterality is uncoded."
        ],
        "data_source_notes": "claims: build separate code lists per system and per file type, then UNION; require continuous FFS-observable enrollment across the washout. EHR: map local procedure dictionary to CPT/SNOMED before unioning."
      },
      {
        "name": "Procedure-as-utilization count (HCRU)",
        "description": "Counts every qualifying procedure claim per person-time to produce a rate (procedures per 1,000 person-years) rather than a single incident flag; used for resource-utilization and cost studies.",
        "edge_cases": [
          "Counting claim lines instead of distinct acts double-counts the facility-plus-professional pair.",
          "Global/bundled surgical packages collapse multiple services into one payment, distorting line counts."
        ],
        "data_source_notes": "claims: collapse to distinct acts first, then count; restrict the denominator to FFS-observable, continuously enrolled person-time and exclude MA-only months."
      },
      {
        "name": "Registry/EHR-adjudicated procedure with laterality and approach",
        "description": "Uses registry or operative-note detail (NSQIP/STS/SEER, surgical case logs) to define the procedure with laterality, surgical approach, and device, optionally linked to claims for longitudinal follow-up.",
        "edge_cases": [
          "Registry inclusion is a selected subset (participating sites, eligible cases) - the procedure rate is not population-based without a denominator.",
          "Operative detail in free text requires NLP when structured fields are blank."
        ],
        "data_source_notes": "registry/EHR: strongest for the act itself; link to claims for follow-up and to a death index for mortality and censoring."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single coding-system rule (e.g., CPT-only)",
        "pros_of_this": "A multi-system union captures inpatient and outpatient cases regardless of site of service, raising sensitivity and avoiding differential under-capture in sicker (inpatient) patients.",
        "cons_of_this": "More programming, cross-file date reconciliation for the same act, and a larger duplicate burden the de-duplication window must absorb.",
        "when_to_prefer": "Any consequential comparative, safety, utilization, cost, or regulatory-grade analysis where completeness across sites of care matters."
      },
      {
        "compared_to": "Structured EHR/registry procedure capture",
        "pros_of_this": "Claims capture is complete across all reimbursed sites of care because billing is the reason the code exists.",
        "cons_of_this": "Claims lack clinical granularity (laterality, approach, device, surgeon) and inherit billing artifacts.",
        "when_to_prefer": "When cross-site completeness outweighs the need for intra-operative clinical detail; use EHR/registry or linkage when granularity is decision-relevant."
      },
      {
        "compared_to": "Counting all qualifying procedure claims",
        "pros_of_this": "Incident first-event logic (washout + first qualifying act) correctly defines a 'first procedure' cohort and prevents staged/revision procedures from entering as new patients.",
        "cons_of_this": "Discards repeat procedures that a utilization rate needs.",
        "when_to_prefer": "Cohort entry and procedure-as-exposure designs; use counts for HCRU and cost-per-procedure analyses."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Procedures live on both carrier (CPT/HCPCS) and institutional files (ICD-10-PCS inpatient; CPT/HCPCS + revenue outpatient). Build per-system code lists, UNION them, collapse facility+professional lines for one act within an acute window, take the earliest service date as the procedure date, keep adjudicated/paid lines, require continuous FFS-observable enrollment across the washout, and exclude MA-only person-time from rates.",
      "ehr": "Procedures sit in structured procedure/order tables, case logs, and operative notes. Map the local procedure dictionary to CPT/SNOMED, NLP operative notes when structured fields are blank, and treat out-of-network care as differential leakage; define observation windows explicitly.",
      "registry": "Strongest for the act, laterality, approach, and adjudicated peri-operative outcomes, but inclusion is a selected subset and follow-up is thin; link to claims for longitudinal follow-up and to a death index.",
      "linked": "Linked claims-EHR-registry gives procedure detail plus completeness plus mortality, but introduces linkage selection and operative-note/facility/professional date discrepancies that must be reconciled before fixing the procedure date."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS   = 365  # continuous, FFS-observable lookback that makes \"first procedure\" truly first\nDEDUP_DAYS     = 30   # collapse all qualifying lines for one act within this window into a single event\n\n# Per-system code lists for the target procedure (union across billing streams).\ncode_sets = {\n    \"CPT\":      {\"43775\", \"43644\", \"43645\"},   # sleeve gastrectomy; RYGB (professional/outpatient)\n    \"HCPCS\":    set(),\n    \"ICD10PCS\": {\"0DB64Z3\", \"0D164ZA\"},         # inpatient facility bypass/sleeve\n    \"REVENUE\":  set(),\n}\n\ndef identify_incident_procedure(claim_lines: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    cl = claim_lines.copy()\n    # Keep only adjudicated/paid lines that match the union code set for their own coding system.\n    in_set = cl.apply(lambda r: r[\"code\"] in code_sets.get(r[\"code_system\"], set()), axis=1)\n    qual = cl[(cl[\"claim_status\"] == \"PAID\") & in_set].sort_values([\"person_id\", \"service_date\"])\n\n    # Candidate index = earliest qualifying service date per person (across all systems/files).\n    idx = (qual.groupby(\"person_id\")[\"service_date\"].min()\n               .rename(\"index_date\").reset_index())\n\n    # New-/incident-event check: no qualifying procedure in the washout before the candidate index.\n    q = qual.merge(idx, on=\"person_id\")\n    prior = q[(q[\"service_date\"] < q[\"index_date\"]) &\n              (q[\"service_date\"] >= q[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prior[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment spanning the full washout through index (no MA-only gaps).\n    e = enroll.merge(idx, on=\"person_id\")\n    covers = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n              (e[\"enroll_end\"]   >= e[\"index_date\"]) & e[\"ffs_observable\"])\n    eligible = e.loc[covers, \"person_id\"].unique()\n    idx = idx[idx[\"person_id\"].isin(eligible)].copy()\n\n    # Collapse facility + professional lines for the SAME act (within DEDUP_DAYS of index) -> one event,\n    # and record the distinct billing streams that observed it (a sensitivity/validity diagnostic).\n    ev = qual.merge(idx, on=\"person_id\")\n    ev = ev[(ev[\"service_date\"] >= ev[\"index_date\"]) &\n            (ev[\"service_date\"] <= ev[\"index_date\"] + pd.Timedelta(days=DEDUP_DAYS))]\n    streams = ev.groupby(\"person_id\")[\"code_system\"].nunique().rename(\"n_systems\")\n    return idx.merge(streams, on=\"person_id\", how=\"left\")",
        "description": "Multi-system incident procedure identification from claims. Required inputs (cleaned, de-duplicated to\nline level):\n  claim_lines : person_id, service_date (datetime), code, code_system in {'CPT','HCPCS','ICD10PCS','REVENUE'},\n                claim_status in {'PAID','DENIED','REVERSED'}, place_of_service\n  enroll      : person_id, enroll_start, enroll_end, ffs_observable (bool)  # False = MA-only / unobservable\ncode_sets maps each coding system to its qualifying codes for THIS procedure. Returns one row per incident\npatient: the procedure (index) date after a washout, collapsing facility+professional lines within a window.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\nDEDUP_DAYS   <- 30L\n\ncode_sets <- list(\n  CPT      = c(\"43775\", \"43644\", \"43645\"),  # sleeve; RYGB (professional/outpatient)\n  HCPCS    = character(0),\n  ICD10PCS = c(\"0DB64Z3\", \"0D164ZA\"),        # inpatient facility\n  REVENUE  = character(0)\n)\n\nidentify_incident_procedure <- function(claim_lines, enroll) {\n  setDT(claim_lines); setDT(enroll)\n  cl <- copy(claim_lines)\n  # Qualifying = paid AND code is in the union set for its own coding system.\n  cl[, in_set := mapply(function(cd, sys) cd %in% code_sets[[sys]], code, code_system)]\n  qual <- cl[claim_status == \"PAID\" & in_set == TRUE][order(person_id, service_date)]\n\n  # Candidate index = earliest qualifying service date per person.\n  idx <- qual[, .(index_date = min(service_date)), by = person_id]\n\n  # Incident check: drop anyone with a qualifying procedure in the washout before index.\n  q <- merge(qual, idx, by = \"person_id\")\n  prior_ids <- unique(q[service_date < index_date &\n                        service_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous, FFS-observable enrollment across the full washout through index.\n  e <- merge(enroll, idx, by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & ffs_observable == TRUE, unique(person_id)]\n  idx <- idx[person_id %chin% ok]\n\n  # Collapse facility+professional lines within DEDUP_DAYS of index; count distinct billing streams.\n  ev <- merge(qual, idx, by = \"person_id\")\n  ev <- ev[service_date >= index_date & service_date <= index_date + DEDUP_DAYS]\n  streams <- ev[, .(n_systems = uniqueN(code_system)), by = person_id]\n  merge(idx, streams, by = \"person_id\", all.x = TRUE)\n}",
        "description": "Multi-system incident procedure identification with data.table. Inputs mirror the Python version:\n  claim_lines : person_id, service_date (Date), code, code_system in {'CPT','HCPCS','ICD10PCS','REVENUE'},\n                claim_status, place_of_service\n  enroll      : person_id, enroll_start, enroll_end, ffs_observable (logical)\nReturns one incident row per patient with the procedure (index) date and the count of billing streams\nthat observed the act.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n%let dedup   = 30;\n\n/* Qualifying lines = paid AND in the union code set (joined on system + code). */\nproc sql;\n  create table qual as\n  select c.person_id, c.service_date, c.code_system\n  from work.claim_lines c\n  inner join work.codeset s\n    on c.code_system = s.code_system and c.code = s.code\n  where c.claim_status = 'PAID';\nquit;\n\n/* Candidate index = earliest qualifying service date per person (across systems/files). */\nproc sql;\n  create table idx as\n  select person_id, min(service_date) as index_date format=date9.\n  from qual group by person_id;\nquit;\n\n/* Incident restriction: exclude any qualifying procedure inside the washout before index. */\nproc sql;\n  create table incident as\n  select i.*\n  from idx i\n  where not exists (\n    select 1 from qual p\n    where p.person_id = i.person_id\n      and p.service_date <  i.index_date\n      and p.service_date >= i.index_date - &washout\n  );\nquit;\n\n/* Continuous, FFS-observable enrollment across the full washout through index (no MA-only spans). */\nproc sql;\n  create table cohort0 as\n  select n.person_id, n.index_date\n  from incident n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id\n      and e.ffs_observable = 1\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date\n  );\nquit;\n\n/* Collapse facility+professional lines within the de-dup window to ONE act; */\n/* count distinct billing streams as a capture/validity diagnostic.          */\nproc sql;\n  create table cohort as\n  select c.person_id, c.index_date,\n         (select count(distinct q.code_system) from qual q\n            where q.person_id = c.person_id\n              and q.service_date >= c.index_date\n              and q.service_date <= c.index_date + &dedup) as n_systems\n  from cohort0 c;\nquit;",
        "description": "Multi-system incident procedure identification in SAS via PROC SQL. Required input datasets\n(post data-management):\n  work.claim_lines : person_id, service_date, code, code_system ('CPT'/'HCPCS'/'ICD10PCS'/'REVENUE'),\n                     claim_status ('PAID'/'DENIED'/'REVERSED'), place_of_service\n  work.enroll      : person_id, enroll_start, enroll_end, ffs_observable (0/1)\n  work.codeset     : code_system, code   /* the UNION code list for this procedure, one row per code */\nProduces work.cohort: one incident row per patient with the procedure (index) date and the number of\ndistinct billing streams that observed the act within the de-duplication window.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Carrier[Carrier / physician claims<br/>CPT / HCPCS] --> Union[Union code set<br/>match each code to its own system]\n  Inpat[Inpatient facility claims<br/>ICD-10-PCS] --> Union\n  Outpat[Outpatient facility claims<br/>CPT / HCPCS + revenue] --> Union\n  Union --> Paid[Keep adjudicated / paid lines]\n  Paid --> First[Earliest qualifying service date<br/>= candidate index]\n  First --> Wash[Washout: no qualifying procedure<br/>in prior 365 days + continuous FFS enrollment]\n  Wash --> Dedup[Collapse facility + professional lines<br/>within de-dup window to ONE act]\n  Dedup --> T0[Procedure date = time zero / event date<br/>earliest qualifying service date]\n  T0 --> Sens[Sensitivity: de-dup window, CPT-only vs union,<br/>PPV vs operative notes]",
        "caption": "Assembling a procedure from multiple billing streams into a single incident, date-fixed event. The union across coding systems maximizes capture; the washout enforces incidence; the de-duplication window collapses the facility and professional lines for one act before the procedure date is fixed as time zero.",
        "alt_text": "Flowchart showing carrier, inpatient facility, and outpatient facility claims feeding a union code set, then paid-line filtering, earliest-date selection, washout, de-duplication, time-zero assignment, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Dx[Obesity diagnosis<br/>cohort entry] -->|days to surgery| Surg[Bariatric procedure date]\n  Surg --> FU[Follow-up for diabetes remission]\n  Dx -. immortal time .-> Surg\n  classDef bad fill:#fde,stroke:#b00;\n  class Surg bad;",
        "caption": "Why time zero must be the procedure date. If follow-up is anchored at diagnosis but the procedure arm is defined by an act that occurs later, the days from diagnosis to surgery are immortal time - survival required to receive the procedure - biasing the procedure toward apparent benefit (Suissa).",
        "alt_text": "Diagram showing diagnosis to procedure interval as immortal time that must not be counted as exposed follow-up.",
        "source_type": "illustrative",
        "source_citations": [
          "suissa-2008"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Procedure identification is the procedure-specific case of building an exposure episode/variable from raw claims and EHR records."
      },
      {
        "relation_type": "used_with",
        "target_slug": "acute-event-deduplication-window-rwe",
        "notes": "The facility and professional claims for a single surgery must be collapsed within a de-duplication window so one act is not counted as several procedures."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "The procedure date is the time-zero / index date; misaligning it (e.g., to diagnosis) is the usual source of immortal-time bias in procedure-as-exposure studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Treating a later procedure as a time-fixed exposure from an earlier cohort-entry date misclassifies pre-procedure survival as exposed person-time."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "When the procedure is an outcome, its code set must be validated for PPV and sensitivity against a reference standard such as operative notes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Medicare Advantage encounter data are incomplete for procedures; FFS vs MA differences determine which person-time can observe the act."
      },
      {
        "relation_type": "used_with",
        "target_slug": "infused-biologic-administration-capture-rwe",
        "notes": "A sibling capture problem for procedure/administration coding (J-codes/HCPCS), sharing the multi-stream union and de-duplication logic."
      }
    ],
    "aliases": [
      "procedure identification in claims",
      "procedure measurement in claims and EHR",
      "CPT HCPCS ICD-10-PCS procedure capture",
      "surgical procedure code list development",
      "procedure-as-exposure definition"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "product-registry",
    "name": "Product/Exposure Registry",
    "short_definition": "An organized, prospective observational system that enrolls patients on the basis of a specific exposure (a drug, device, or biologic) and follows them with standardized data collection to characterize utilization, safety, and effectiveness in routine care.",
    "long_description": "A **product (exposure) registry** is a study design in which the *entry criterion is the exposure itself* — patients are\nenrolled because they received a particular drug, biologic, vaccine, or device — and are then followed prospectively with a\npre-specified, standardized data-collection protocol. It is distinct from a **disease/condition registry** (entry criterion =\na diagnosis) and from a **health-services registry** (entry criterion = an encounter or procedure). The product registry is\nthe workhorse of post-authorization safety and effectiveness commitments: pregnancy exposure registries, biologic/DMARD\nregistries (e.g., rheumatology and IBD biologics), rare-disease enzyme-replacement registries, and device registries are all\ninstances of this design. Because enrollment is anchored to exposure, the registry is, at its core, an **exposure cohort\nwithout a built-in comparator** — and that single structural fact drives almost everything about when it is useful and when\nit is dangerous.\n\n**Core conceptual distinction.** Three design features define a product registry and separate it from neighboring designs.\n(1) *Exposure-anchored enrollment*: the cohort is assembled by identifying exposed patients (often at or near initiation),\nnot by sampling a source population and observing who gets exposed. (2) *Prospective, protocolized primary data collection*:\nunlike a secondary-use claims/EHR study, a registry typically collects new, purpose-built data (clinical severity, outcomes\nadjudicated against definitions, patient-reported outcomes) on a fixed schedule. (3) *Open-ended descriptive estimand by\ndefault*: the native output is a *non-comparative* incidence/rate or proportion among the exposed — \"what is the rate of\noutcome Y in patients taking drug X.\" The moment a causal contrast (drug X vs an alternative) is required, the registry must\nborrow comparative machinery from cohort methodology (an internal or external active comparator, time-zero alignment, new-user\nrestriction, propensity adjustment). A registry is therefore best understood as *a data-collection platform that hosts cohort\nstudies*, not as a causal method in itself. Conflating \"we ran a registry\" with \"we estimated an effect\" is the single most\ncommon interpretive error.\n\n**Pros, cons, and trade-offs** (comparative, naming alternatives).\n- **vs secondary-use claims studies:** A registry can capture variables claims never see — disease severity, biomarker/lab\n  values, indication nuance, dosing, patient-reported outcomes, and *adjudicated* endpoints — and can target rare exposures\n  or orphan products that are invisible or miscoded in claims. Cost: it is expensive, slow to accrue, prone to selective\n  enrollment (sicker or more engaged patients), and suffers loss to follow-up; person-time is only as complete as voluntary\n  return visits. **Prefer a registry** when the key confounders/outcomes are unmeasurable in claims, when the product is rare,\n  or when a regulator mandates adjudicated safety follow-up. **Prefer claims** when you need population-representative person-time,\n  a concurrent active comparator, and large numbers cheaply.\n- **vs disease/condition registry:** The product registry answers \"what happens to users of X,\" the disease registry answers\n  \"what happens to patients with condition D regardless of treatment.\" A disease registry yields a natural within-population\n  comparator (treated vs untreated for the same disease) and better external validity for the disease; a product registry\n  yields cleaner exposure ascertainment and dosing but typically *no internal untreated comparator*. **Prefer the disease\n  registry** when the question is comparative within an indication; **prefer the product registry** when exposure detail,\n  a specific safety signal, or a regulatory product commitment is the driver.\n- **vs single-arm registry with an external comparator (RWD or historical control):** Adding an external comparator (matched\n  claims cohort, natural-history disease registry, or trial control arm) lets a product registry support a comparative\n  estimand. Cost: external-control comparisons import every threat of non-randomized, non-concurrent comparison — different\n  measurement instruments, calendar time, case mix, and unmeasured confounding — and are accepted by regulators only in narrow\n  circumstances (e.g., serious disease, large effect, no equipoise). **Prefer an internal active comparator** whenever the\n  registry can enroll initiators of an alternative therapy.\n- **vs target-trial / active-comparator new-user cohort in routine data:** An ACNU cohort in claims/EHR is usually a stronger\n  causal design for a head-to-head question and is far cheaper. **Prefer the registry** only when the comparative design cannot\n  measure the confounders or outcomes that matter, or when the data simply do not exist outside primary collection.\n\n**When to use** (decision rules). Use a product registry when: (a) a regulator requires post-authorization safety/effectiveness\nfollow-up of a *specific* product (PASS/PAES, REMS, conditional-approval commitment, pregnancy exposure registry); (b) the\nexposure is rare or newly launched and not yet reliably captured in administrative data; (c) the outcomes or confounders of\ninterest (severity, biomarkers, PROs, adjudicated events) are unobtainable from claims/EHR; or (d) you need a durable platform\nto host multiple nested cohort and case-control analyses over a product's lifecycle. Build a credible comparator into the\nprotocol from the outset (internal active comparator preferred; a pre-specified external comparator with a transparent bias\nframing otherwise).\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules).\n- **Do not use a single-arm product registry to make a comparative effectiveness claim.** A bare exposed-only rate has no\n  counterfactual; comparing it informally to \"what we expected\" or to a literature rate is an external-control comparison in\n  disguise, with all its confounding, and regulators/HTA bodies will (rightly) reject it for anything but the most extreme\n  effects in serious disease.\n- **Do not use it when selective enrollment threatens the estimand.** Voluntary, physician-initiated, or consent-gated\n  enrollment can enrich for severity, adherence, or healthy-volunteer effects; an incidence rate from such a cohort does not\n  generalize and can be biased in either direction. If you cannot characterize who enrolls versus the source population, the\n  rate is uninterpretable.\n- **Do not use it for a head-to-head question that a routine-data ACNU cohort can answer.** Spending years accruing a registry\n  to estimate something a propensity-adjusted new-user claims cohort delivers in months is poor methodology and poor stewardship.\n- **Beware immortal time and prevalent-user enrollment.** If enrollment occurs at a *prevalent* clinic visit rather than at\n  initiation, survivors are over-represented (depletion of susceptibles) and the gap between true initiation and enrollment is\n  immortal person-time. Enroll incident users and set time zero at initiation, or model the left truncation explicitly.\n- **Beware differential loss to follow-up.** Because registry person-time depends on returning, patients who do poorly (or die)\n  may stop contributing; informative censoring biases rates downward unless mortality is captured via linkage and censoring is\n  modeled.\n\n**Data-source operational depth.**\n- **Primary registry data (the registry's own CRFs):** Greatest control over exposure detail, severity, dosing, and adjudicated\n  outcomes, but completeness hinges on site behavior. Failure modes: selective/non-consecutive enrollment, missing follow-up\n  visits, site-to-site definition drift, and over-representation of academic centers. Workarounds: enrollment logs to quantify\n  the consecutive-eligible fraction, central adjudication with explicit endpoint definitions, query rules for missing visits,\n  and source-data verification audits.\n- **Claims linkage (to enrich/validate a registry):** Linking the registry to claims fills the person-time and outcome gaps\n  that voluntary follow-up leaves — continuous-enrollment spans define observability, and inpatient/ED/pharmacy claims capture\n  events that occur outside study visits. Failure modes: **Medicare Advantage (MA) person-time lacks fee-for-service claims**,\n  so a registrant who is MA-only contributes exposure but no claims-based outcome capture, biasing rates among the elderly;\n  sample fills, 90-day mail-order, and free product distort apparent exposure duration. Workaround: restrict claims-based\n  person-time to enrollees with observable benefit (FFS Parts A/B/D or a commercial pharmacy benefit) and treat MA-only spans\n  as censored for claims-ascertained outcomes.\n- **EHR linkage:** Adds labs, problem lists, and notes to sharpen indication and baseline severity, but capture is visit-driven\n  and bounded by the network; a patient who seeks care elsewhere is differentially unobserved. Reconcile order/administration\n  dates with the registry's recorded initiation date before assigning time zero.\n- **Linked registry–claims–vital records:** The strongest substrate — registry severity + claims completeness + a death index\n  to handle the **differential competing risk of death** that varies by exposure in elderly cohorts (treating death as a\n  censoring event rather than a competing risk overstates the cumulative incidence of non-fatal outcomes). Cost: linkage\n  introduces selection (only the linkable subset) and date-discrepancy reconciliation.\n\n**Worked claims-linked example.** Commitment: a pregnancy exposure registry for a new biologic, with a claims-linked safety\nanalysis of major congenital malformations (MCM). (1) *Exposure-anchored enrollment*: a pregnant person is registered upon a\npharmacy fill of the study biologic (NDC + `fill_date` + `days_supply`) during a defined gestational window, with `index_date`\n= first qualifying fill in pregnancy. (2) *Continuous enrollment / observability*: require continuous medical + pharmacy\nenrollment from the estimated last-menstrual-period through delivery + 90 days, with no MA-only spans, so both the exposure\nand the infant's diagnoses are observable in claims; mother–infant linkage establishes the outcome denominator. (3) *Washout\n/ incident use* (if a comparative arm is added): no fill of the biologic in the 180 days before LMP, defining incident\ngestational exposure; the comparator arm enrolls incident users of a guideline alternative for the same indication. (4)\n*Outcome*: MCM coded from infant inpatient/outpatient claims in the first year (first-event coding: earliest qualifying\nmalformation dx per infant), adjudicated against the registry's MACDP-style algorithm where charts are available. (5)\n*Person-time / censoring*: follow from `index_date` to first MCM, fetal loss, disenrollment, or end of data; capture fetal\nand infant death via the linked vital-records/death index so that pregnancy loss is handled as a competing event, not silently\ncensored. (6) *Analysis*: report the exposed MCM proportion with exact confidence intervals as the primary descriptive\nestimand; for the comparative arm, align time zero, adjust pre-pregnancy confounders with a propensity score measured only in\nthe baseline window, and pre-specify the external-comparison bias framing (national MCM baseline) only as a secondary\nbenchmark, not the primary inference.",
    "primary_category": "Study_Design",
    "tags": [
      "product-registry",
      "exposure-registry",
      "pregnancy-registry",
      "post-authorization-safety",
      "pass",
      "registry",
      "primary-data-collection",
      "exposure-cohort"
    ],
    "applies_to_study_types": [
      "product_registry"
    ],
    "data_sources": [
      "registry",
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Joint ISPOR-ISPE good-practice framework situating prospective registries among real-world data study designs and setting the credibility bar for exposure-anchored evidence."
      },
      {
        "role": "explain",
        "doi": "10.2147/CLEP.S91125",
        "url": "https://doi.org/10.2147/CLEP.S91125",
        "citation_text": "Schmidt M, Schmidt SAJ, Sandegaard JL, Ehrenstein V, Pedersen L, Sørensen HT. The Danish National Patient Registry: a review of content, data quality, and research potential. Clinical Epidemiology. 2015;7:449-490.",
        "year": 2015,
        "authors_short": "Schmidt et al.",
        "notes": "Model description of registry content, coding, completeness, and validation - the data-quality questions every product registry protocol must answer (capture, accuracy, missingness, linkage)."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2010.10.006",
        "url": "https://doi.org/10.1016/j.jclinepi.2010.10.006",
        "citation_text": "Benchimol EI, Manuel DG, To T, Griffiths AM, Rabeneck L, Guttmann A. Development and use of reporting guidelines for assessing the quality of validation studies of health administrative data. Journal of Clinical Epidemiology. 2011;64(8):821-829.",
        "year": 2011,
        "authors_short": "Benchimol et al.",
        "notes": "Operational template for validating registry/administrative case definitions (sensitivity, specificity, PPV) before rates from a registry can be trusted."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jval.2017.08.3018",
        "url": "https://doi.org/10.1016/j.jval.2017.08.3018",
        "citation_text": "Wang SV, Schneeweiss S, Berger ML, et al. Reporting to improve reproducibility and facilitate validity assessment for healthcare database studies V1.0. Value in Health. 2017;20(8):1009-1022.",
        "year": 2017,
        "authors_short": "Wang et al.",
        "notes": "Reporting standard (with RECORD-PE) for the cohort analyses a registry hosts - cohort entry, time zero, exposure and outcome definitions, and follow-up windows that must be pre-specified."
      }
    ],
    "plain_language_summary": "A product registry is a study where patients are enrolled specifically because they are using a particular drug or medical device, and then followed over time to track their health outcomes. Unlike a disease registry — where you enroll anyone with a certain diagnosis regardless of what they are taking — a product registry starts from the exposure: the question is what happens to the people who received this specific product. It is a descriptive tool by default, meaning it tells you rates and patterns among users of the product, but it cannot by itself tell you whether the product caused good or bad outcomes compared to an alternative.",
    "key_terms": [
      {
        "term": "exposure-anchored enrollment",
        "definition": "Patients join the registry because they received the specific drug or device being studied, not because of their diagnosis or condition."
      },
      {
        "term": "prospective follow-up",
        "definition": "Patients are enrolled first and then observed going forward in time, so the study collects new data as events happen rather than looking backward through old records."
      },
      {
        "term": "non-comparative incidence rate",
        "definition": "A count of how often an outcome (like a side effect) occurs per unit of time among people using the product, without comparing that rate to any group not using the product."
      },
      {
        "term": "active comparator",
        "definition": "A second group enrolled in the same registry who are starting a different, alternative therapy for the same condition, allowing a head-to-head comparison."
      },
      {
        "term": "claims-observable person-time",
        "definition": "The stretches of follow-up time during which a patient has active insurance coverage that generates medical claims records, so that outcomes can actually be detected in the data."
      }
    ],
    "worked_example": {
      "scenario": "A pharmaceutical company launches a new biologic for rheumatoid arthritis and is required by regulators to track its real-world safety for five years. They open a product registry: any adult who starts this biologic at a participating rheumatology clinic is invited to enroll. Three patients enroll on the day of their first infusion. The registry records each infusion date, the dose given, any serious infections or hospitalizations, and a disease-severity score at every visit. We want to see what basic information the registry collects for each patient and understand what the registry can and cannot answer.",
      "dataset": {
        "caption": "Registry enrollment table: one row per patient at the time they join. Enrollment is triggered by the first dose of the product — this is what makes it a product registry rather than a disease registry.",
        "columns": [
          "person_id",
          "enrollment_date",
          "product",
          "indication",
          "enrolled_because_of"
        ],
        "rows": [
          [
            "PT-001",
            "2023-03-05",
            "BiologicX 200mg",
            "Rheumatoid arthritis",
            "Started BiologicX today"
          ],
          [
            "PT-002",
            "2023-04-12",
            "BiologicX 200mg",
            "Rheumatoid arthritis",
            "Started BiologicX today"
          ],
          [
            "PT-003",
            "2023-07-20",
            "BiologicX 200mg",
            "Rheumatoid arthritis",
            "Started BiologicX today"
          ]
        ]
      },
      "steps": [
        "Each patient's enrollment date is the date of their first BiologicX dose. That date becomes their time zero — the starting line for all follow-up.",
        "From time zero onward, the registry collects structured data at each scheduled clinic visit: infusion records, lab values, disease severity scores, and any serious adverse events reported by the treating physician.",
        "If PT-002 is hospitalized for a serious infection on 2023-09-01, that event is recorded and counted. The registry can then report: among all enrolled patients, how many serious infections occurred per 100 patient-years of follow-up.",
        "Notice who is NOT in this registry: a patient with rheumatoid arthritis who was never prescribed BiologicX would not appear here, even if they have the same diagnosis as PT-001. That is the defining feature of a product registry versus a disease registry.",
        "Because there is no comparison group of patients who did NOT take BiologicX, the registry alone cannot tell you whether BiologicX causes more or fewer infections than, say, a different biologic. It only tells you the rate among BiologicX users."
      ],
      "result": "With 3 patients enrolled on different dates and each followed through 2023-12-31, total follow-up is approximately (301 + 263 + 164) = 728 patient-days, or about 2.0 patient-years. If 1 serious infection occurred (PT-002 in September), the crude rate is 1 event / 2.0 patient-years = 0.50 infections per patient-year, or 50 per 100 patient-years. This number describes BiologicX users only; whether that rate is high or low requires either a comparator group enrolled in the same registry or a separately designed comparison — it cannot be read from this registry alone."
    },
    "prerequisites": [
      "cohort-prospective",
      "new-user-design",
      "disease-registry"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single-arm (descriptive) product registry",
        "description": "Enrolls only users of the product and reports non-comparative incidence rates, proportions, and utilization; the default regulatory PASS/pregnancy-registry form.",
        "edge_cases": [
          "No counterfactual - any comparison to expected/literature rates is an external-control comparison and must be framed as such.",
          "Selective enrollment biases the rate; quantify the consecutive-eligible fraction from enrollment logs."
        ],
        "data_source_notes": "registry: primary CRFs for severity/outcomes; link to claims/vital records to recover person-time and deaths lost to voluntary follow-up."
      },
      {
        "name": "Product registry with internal active comparator",
        "description": "Concurrently enrolls initiators of an alternative therapy for the same indication, converting the platform into an active-comparator new-user cohort with a comparative estimand.",
        "edge_cases": [
          "Differential enrollment or channeling between arms re-introduces confounding by indication; check baseline balance.",
          "Requires identical outcome ascertainment and follow-up rules across arms to avoid differential measurement."
        ],
        "data_source_notes": "registry: standardize CRFs across arms; align time zero at initiation in both arms."
      },
      {
        "name": "Product registry with pre-specified external comparator",
        "description": "Compares the exposed registry cohort against a matched claims cohort, natural-history disease registry, or historical/trial control when no internal comparator is feasible.",
        "edge_cases": [
          "Imports calendar-time, instrument, and case-mix differences plus unmeasured confounding; acceptable only for serious disease with large effects and no equipoise.",
          "Outcome definitions must be harmonized across the two data instruments before any contrast."
        ],
        "data_source_notes": "linked: reconcile measurement and calendar time; report quantitative bias analysis alongside the contrast."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Secondary-use claims/EHR cohort study",
        "pros_of_this": "Captures severity, dosing, biomarkers, PROs, and adjudicated outcomes; can reach rare or newly launched products invisible in claims.",
        "cons_of_this": "Expensive and slow; selective enrollment and loss to follow-up; person-time only as complete as voluntary visits.",
        "when_to_prefer": "When key confounders/outcomes are unmeasurable in claims, the product is rare, or a regulator mandates adjudicated primary-data safety follow-up."
      },
      {
        "compared_to": "Disease/condition registry",
        "pros_of_this": "Cleaner exposure ascertainment and dosing detail; directly serves a product-specific safety/effectiveness commitment.",
        "cons_of_this": "Usually no internal untreated comparator and weaker representativeness of the underlying disease population.",
        "when_to_prefer": "When exposure detail or a specific product signal drives the question rather than the disease writ large."
      },
      {
        "compared_to": "Active-comparator new-user cohort in routine data",
        "pros_of_this": "Measures confounders/outcomes that claims cannot; viable when comparative data simply do not exist.",
        "cons_of_this": "Far costlier and slower; weaker for head-to-head causal contrasts that a propensity-adjusted ACNU cohort handles.",
        "when_to_prefer": "Only when the comparative routine-data design cannot capture the variables that matter or the data are absent."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nPRE_WASHOUT_DAYS = 180   # incident-use lookback: no prior product fill\nOUTCOME_CODES = {\"Q00\", \"Q01\", \"Q02\"}  # example outcome dx prefixes (first-event coded)\n\ndef build_registry_cohort(enroll_reg, rx, coverage, dx):\n    # Time zero = first product fill at/after registry enrollment (anchor on EXPOSURE, not the clinic visit).\n    prod = rx.merge(enroll_reg[[\"person_id\", \"product_ndc\", \"reg_enroll_date\"]], on=\"person_id\")\n    prod = prod[(prod[\"ndc\"] == prod[\"product_ndc\"]) & (prod[\"fill_date\"] >= prod[\"reg_enroll_date\"])]\n    idx = (prod.sort_values([\"person_id\", \"fill_date\"])\n               .groupby(\"person_id\").first().reset_index()\n               .rename(columns={\"fill_date\": \"index_date\"})[[\"person_id\", \"index_date\"]])\n\n    # Incident-user restriction: no product fill in the PRE_WASHOUT_DAYS before time zero (avoid prevalent-user enrollment).\n    p = rx.merge(idx, on=\"person_id\")\n    prevalent = p[(p[\"fill_date\"] < p[\"index_date\"]) &\n                  (p[\"fill_date\"] >= p[\"index_date\"] - pd.Timedelta(days=PRE_WASHOUT_DAYS)) &\n                  (p[\"ndc\"].isin(enroll_reg[\"product_ndc\"].unique()))][\"person_id\"].unique()\n    idx = idx[~idx[\"person_id\"].isin(prevalent)].copy()\n\n    # Claims-observable follow-up: continuous coverage from time zero, EXCLUDING MA-only person-time (no FFS claims).\n    c = coverage.merge(idx, on=\"person_id\")\n    c = c[(~c[\"ma_only\"]) & (c[\"cov_end\"] >= c[\"index_date\"]) & (c[\"cov_start\"] <= c[\"index_date\"])]\n    obs = c.groupby(\"person_id\")[\"cov_end\"].max().reset_index(name=\"obs_end\")\n    cohort = idx.merge(obs, on=\"person_id\")  # registrants without observable FFS coverage drop out of the rate denominator\n\n    # First-event outcome on/after time zero within observable follow-up.\n    dx = dx.copy()\n    dx[\"is_outcome\"] = dx[\"dx_code\"].str[:3].isin(OUTCOME_CODES)\n    ev = dx[dx[\"is_outcome\"]].merge(cohort, on=\"person_id\")\n    ev = ev[(ev[\"dx_date\"] >= ev[\"index_date\"]) & (ev[\"dx_date\"] <= ev[\"obs_end\"])]\n    first_ev = ev.groupby(\"person_id\")[\"dx_date\"].min().reset_index(name=\"event_date\")\n\n    cohort = cohort.merge(first_ev, on=\"person_id\", how=\"left\")\n    cohort[\"event\"] = cohort[\"event_date\"].notna().astype(int)\n    end = cohort[\"event_date\"].fillna(cohort[\"obs_end\"])\n    cohort[\"person_days\"] = (end - cohort[\"index_date\"]).dt.days.clip(lower=0)\n    return cohort[[\"person_id\", \"index_date\", \"obs_end\", \"event\", \"event_date\", \"person_days\"]]\n\n# Non-comparative incidence rate (the registry's native descriptive estimand):\n# rate_per_1000_py = 1000 * cohort[\"event\"].sum() / (cohort[\"person_days\"].sum() / 365.25)",
        "description": "Build a product/exposure-registry cohort table from registry enrollment + claims-linked observability. This is COHORT\nCONSTRUCTION (the registry hosts the analysis), not effect estimation. Required inputs (cleaned, de-duplicated):\n  enroll_reg : registry enrollment -> person_id, reg_enroll_date (datetime), product_ndc, indication\n  rx         : pharmacy fills       -> person_id, fill_date (datetime), ndc, days_supply\n  coverage   : enrollment spans     -> person_id, cov_start, cov_end, ma_only (bool)  # ma_only lacks FFS claims\n  dx         : diagnoses            -> person_id, dx_date (datetime), dx_code          # for outcome / first-event coding\nReturns one row per eligible incident registrant with time zero, the observable follow-up end, and a non-comparative\nperson-time denominator. Outcome rates are computed only over claims-observable person-time so MA-only gaps are not counted.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "wang-2017"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nPRE_WASHOUT_DAYS <- 180L\nOUTCOME_CODES <- c(\"Q00\", \"Q01\", \"Q02\")\n\nbuild_registry_cohort <- function(enroll_reg, rx, coverage, dx) {\n  setDT(enroll_reg); setDT(rx); setDT(coverage); setDT(dx)\n\n  # Time zero = first product fill at/after registry enrollment (exposure-anchored).\n  prod <- merge(rx, enroll_reg[, .(person_id, product_ndc, reg_enroll_date)], by = \"person_id\")\n  prod <- prod[ndc == product_ndc & fill_date >= reg_enroll_date]\n  setorder(prod, person_id, fill_date)\n  idx <- prod[, .(index_date = fill_date[1L]), by = person_id]\n\n  # Incident-user restriction: drop prevalent users with a product fill in the pre-washout window.\n  p <- merge(rx, idx, by = \"person_id\")\n  prevalent <- unique(p[fill_date < index_date &\n                        fill_date >= index_date - PRE_WASHOUT_DAYS &\n                        ndc %chin% unique(enroll_reg$product_ndc), person_id])\n  idx <- idx[!person_id %chin% prevalent]\n\n  # Claims-observable follow-up excluding MA-only person-time.\n  c <- merge(coverage, idx, by = \"person_id\")\n  c <- c[!ma_only & cov_end >= index_date & cov_start <= index_date]\n  obs <- c[, .(obs_end = max(cov_end)), by = person_id]\n  cohort <- merge(idx, obs, by = \"person_id\")\n\n  # First-event outcome within observable follow-up.\n  dx[, is_outcome := substr(dx_code, 1L, 3L) %chin% OUTCOME_CODES]\n  ev <- merge(dx[is_outcome == TRUE], cohort, by = \"person_id\")\n  ev <- ev[dx_date >= index_date & dx_date <= obs_end]\n  first_ev <- ev[, .(event_date = min(dx_date)), by = person_id]\n\n  cohort <- merge(cohort, first_ev, by = \"person_id\", all.x = TRUE)\n  cohort[, event := as.integer(!is.na(event_date))]\n  cohort[, end_date := fifelse(is.na(event_date), obs_end, event_date)]\n  cohort[, person_days := pmax(as.integer(end_date - index_date), 0L)]\n  cohort[, .(person_id, index_date, obs_end, event, event_date, person_days)]\n}",
        "description": "Registry cohort construction with data.table. Inputs mirror the Python version:\n  enroll_reg : person_id, reg_enroll_date (Date), product_ndc, indication\n  rx         : person_id, fill_date (Date), ndc, days_supply\n  coverage   : person_id, cov_start, cov_end, ma_only (logical)\n  dx         : person_id, dx_date (Date), dx_code\nReturns one incident-registrant row per person with claims-observable person-time and a first-event outcome flag.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "wang-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let pre_washout = 180;\n\n/* Time zero = first product fill at/after registry enrollment (exposure-anchored). */\nproc sql;\n  create table idx as\n  select r.person_id, min(r.fill_date) as index_date format=date9.\n  from work.rx r\n  inner join work.enroll_reg e\n    on r.person_id = e.person_id\n   and r.ndc = e.product_ndc\n   and r.fill_date >= e.reg_enroll_date\n  group by r.person_id;\nquit;\n\n/* Incident-user restriction: exclude prevalent users with a product fill in the pre-washout window. */\nproc sql;\n  create table incident as\n  select i.*\n  from idx i\n  where not exists (\n    select 1 from work.rx p, work.enroll_reg e\n    where p.person_id = i.person_id\n      and p.ndc = e.product_ndc and e.person_id = p.person_id\n      and p.fill_date <  i.index_date\n      and p.fill_date >= i.index_date - &pre_washout\n  );\nquit;\n\n/* Claims-observable follow-up: continuous coverage spanning time zero, EXCLUDING MA-only spans (no FFS claims). */\nproc sql;\n  create table cohort as\n  select i.person_id, i.index_date, max(c.cov_end) as obs_end format=date9.\n  from incident i\n  inner join work.coverage c\n    on c.person_id = i.person_id\n   and c.ma_only = 0\n   and c.cov_start <= i.index_date\n   and c.cov_end   >= i.index_date\n  group by i.person_id, i.index_date;\nquit;\n\n/* First-event outcome on/after time zero within observable follow-up. */\nproc sql;\n  create table firstev as\n  select d.person_id, min(d.dx_date) as event_date format=date9.\n  from work.dx d inner join cohort c on d.person_id = c.person_id\n  where substr(d.dx_code,1,3) in ('Q00','Q01','Q02')\n    and d.dx_date >= c.index_date and d.dx_date <= c.obs_end\n  group by d.person_id;\nquit;\n\ndata work.cohort;\n  merge cohort(in=a) firstev(in=b);\n  by person_id;\n  if a;\n  event = (event_date ne .);\n  end_date = coalesce(event_date, obs_end);\n  person_days = max(end_date - index_date, 0);\n  py = person_days / 365.25;\nrun;\n\n/* Non-comparative incidence rate among the exposed (the registry's native estimand). */\nproc genmod data=work.cohort;\n  model event = / dist=poisson link=log offset=log_py;  /* create log_py=log(py) in a prior data step */\nrun;",
        "description": "Registry cohort construction in SAS (PROC SQL / data step). Required input datasets (post data-management):\n  work.enroll_reg : person_id, reg_enroll_date, product_ndc, indication\n  work.rx         : person_id, fill_date, ndc, days_supply\n  work.coverage   : person_id, cov_start, cov_end, ma_only (0/1)\n  work.dx         : person_id, dx_date, dx_code\nProduces work.cohort: one incident registrant per row with claims-observable person-time and a first-event outcome flag.\nThe descriptive incidence rate is computed with PROC GENMOD (Poisson, log person-time offset) - non-comparative by default.",
        "dependencies": [],
        "source_citations": [
          "wang-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Reg[Registry enrollment<br/>anchored on PRODUCT exposure] --> Inc[Incident-user check<br/>no product fill in pre-washout]\n  Inc --> T0[Time zero = first product fill<br/>at/after enrollment]\n  T0 --> Obs[Claims-observable follow-up<br/>continuous coverage, exclude MA-only person-time]\n  Obs --> Out[Outcome ascertainment<br/>adjudicated CRF + first-event claims/dx]\n  Out --> Est{Estimand?}\n  Est -->|exposed-only| Desc[Descriptive incidence rate<br/>NON-comparative]\n  Est -->|comparative| Comp[Add internal active comparator<br/>or pre-specified external control]\n  Comp --> Adj[Time-zero alignment + PS adjustment<br/>+ quantitative bias analysis if external]",
        "caption": "A product registry is an exposure-anchored cohort platform. Its native output is a non-comparative incidence rate; any causal contrast requires borrowing comparator machinery (internal active comparator preferred).",
        "alt_text": "Flowchart from exposure-anchored registry enrollment through incident-user check, time zero at first product fill, claims-observable follow-up excluding MA-only person-time, outcome ascertainment, and a decision split between a descriptive exposed-only rate and a comparative analysis requiring an internal or external comparator.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Product registry<br/>entry = exposure] -->|adds internal comparator| B[Active-comparator<br/>new-user cohort]\n  A -->|adds external control| C[Single-arm vs<br/>RWD/historical control]\n  D[Disease registry<br/>entry = diagnosis] -->|within-disease treated vs untreated| B\n  E[Claims/EHR<br/>secondary use] -->|cheaper, larger person-time| B\n  A -.captures severity/PRO/dosing.-> E",
        "caption": "Where the product registry sits among neighboring designs. It excels at exposure/outcome detail but lacks an internal comparator by default; comparative questions push toward an active-comparator new-user cohort.",
        "alt_text": "Conceptual map relating a product registry to active-comparator new-user cohorts, single-arm external-control comparisons, disease registries, and secondary-use claims/EHR studies.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "active-comparator-new-user",
        "notes": "An active-comparator new-user cohort in routine data is usually the stronger, cheaper design for head-to-head causal questions; a product registry is preferred only when primary data collection is required to measure exposure, severity, or adjudicated outcomes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "When a comparative estimand is needed, the registry hosts an active-comparator new-user analysis - enroll initiators of an alternative therapy, align time zero, and adjust pre-index confounders."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Enrolling prevalent users at a clinic visit rather than at initiation creates immortal time and depletion of susceptibles; anchor time zero at first product exposure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Comparative registry analyses balance pre-index confounders with PS matching, IPTW, or overlap weighting measured only in the baseline window."
      }
    ],
    "aliases": [
      "exposure registry",
      "product registry",
      "pregnancy exposure registry",
      "drug registry",
      "post-authorization safety registry"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "propensity-score-methods-psm-iptw",
    "name": "Propensity Score Methods (PSM, IPTW)",
    "short_definition": "A family of methods that use the estimated probability of treatment given measured baseline covariates, e(X)=Pr(A=1|X), to balance those covariates across treatment groups before estimating a causal contrast, via matching, inverse-probability-of-treatment weighting, overlap weighting, stratification, or covariate adjustment.",
    "long_description": "The **propensity score** is the conditional probability of receiving the treatment of interest given measured baseline\ncovariates, e(X)=Pr(A=1|X). Under the identification assumptions of conditional exchangeability (no unmeasured\nconfounding given X), positivity (0<e(X)<1 for every covariate pattern), and consistency, conditioning on the scalar e(X)\nrecovers the same covariate balance as conditioning on the full vector X (the balancing property of Rosenbaum and Rubin).\nIn pharmacoepidemiology the PS is the standard tool for making initiators of two treatments comparable on *measured*\nconfounders. It is an analysis step layered on top of design — it presupposes that time zero, eligibility, new-user status,\nthe active comparator, and the outcome/censoring rules are already correct. A perfectly balanced PS analysis built on a\nwrong time zero is still biased; the PS cannot repair immortal time, depletion of susceptibles, or a comparator that\nitself causes the outcome.\n\n**Core estimand distinction**. The choice of PS method *is* the choice of estimand, and the four common methods target\nfour different populations and answer four different questions:\n- **ATT (average treatment effect in the treated)** — what would happen to the *treated* population if untreated. Targeted\n  by 1:1 PS matching (matched-set ATT) and by SMR weighting (treated weight 1, comparator weight e(X)/(1-e(X))). This is\n  usually the policy-relevant estimand for a *safety* signal of a drug already in use.\n- **ATE (average treatment effect)** — what would happen to the *whole* eligible population if all vs none were treated.\n  Targeted by (stabilized) IPTW: treated weight Pr(A=1)/e(X), comparator weight Pr(A=0)/(1-e(X)). Most sensitive to\n  near-positivity violations because tail subjects with e(X)→0 or 1 receive enormous weights.\n- **ATO (average treatment effect in the overlap population)** — the effect in patients with genuine clinical equipoise.\n  Targeted by overlap weighting (treated weight 1-e(X), comparator weight e(X)); exactly balances every covariate mean,\n  yields the smallest variance among balancing weights, and never produces extreme weights. The estimand is a\n  smoothly-weighted population, not a fixed clinical group, which must be stated.\n- **ATM (matching weights)** — a weighting analogue of pair matching targeting the \"matchable\" subpopulation.\nNone of these is a free choice cosmetically: ATT, ATE, and ATO can differ in magnitude and even sign when effect\nmodification by the PS is present, so the estimand must be pre-specified, not chosen after seeing results.\n\n**Pros, cons, and trade-offs**\n- **vs covariate adjustment / outcome regression on X:** PS reduces the confounder set to one dimension, decouples the\n  \"design\" stage (fit and balance the PS, blinded to outcomes) from the \"analysis\" stage, and lets you assess overlap and\n  balance explicitly before touching the outcome — discipline that pure outcome regression hides. Cost: outcome\n  regression can be more efficient when the outcome model is correct and is not derailed by positivity violations.\n  **Prefer PS** when you want an outcome-blind design stage and transparent balance diagnostics in a regulated study.\n- **vs doubly-robust estimation (AIPW / TMLE; see g-estimation and MSM entries):** single-PS methods are simpler to\n  specify, explain to reviewers, and audit. Cost: they are consistent only if the PS model is correct — a single line of\n  defense. Doubly-robust estimators stay consistent if *either* the PS or the outcome model is correct and achieve smaller\n  asymptotic variance. **Prefer plain PS** for transparency and when the confounder structure is well understood; escalate\n  to doubly-robust/g-methods for time-varying confounding affected by prior treatment, which PS cannot handle.\n- **vs high-dimensional PS (hdPS):** an investigator-specified PS is interpretable and easy to defend clause-by-clause.\n  Cost: it can miss empirical proxy confounders captured by hundreds of claims codes. **Prefer hdPS** (or hdPS as a\n  sensitivity analysis) in large claims data where unmeasured confounding by proxy is plausible.\n- **matching vs weighting, head to head:** matching yields a concrete, inspectable matched cohort and is robust to extreme\n  e(X) (poor-overlap subjects are simply unmatched), but it *discards* data, can depend on greedy sort order, and shifts\n  the estimand to the matched region. Weighting uses all data and targets a clean estimand, but inherits the full danger\n  of extreme weights (IPTW) unless you use stabilized weights, truncation, or overlap weights. Overlap weighting is the\n  modern default when positivity is shaky.\n\n**When to use**. Comparative effectiveness or safety of two (or more) treatments where confounding is by *measured*\nbaseline covariates, follow-up starts at a correctly defined time zero, and you can demonstrate overlap. PS is the\nstandard balancing step after an active-comparator new-user cohort is built. Use **matching or SMR weighting** when the\nquestion is about the treated (a safety signal for current users); **stabilized IPTW** for a population (ATE) effect with\nacceptable overlap; **overlap weighting** when overlap is limited or extreme weights threaten IPTW; **stratification or PS\nas a covariate** only as quick checks, because they balance less completely and obscure overlap.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Unmeasured or mismeasured confounding dominates.** PS balances only what is in X. In claims, frailty, disease\n  severity, BMI, smoking, and over-the-counter use are typically unmeasured; a beautifully balanced PS table provides\n  false reassurance. Pair PS with negative-control outcomes, E-values, or quantitative bias analysis — do not present a\n  balance table as proof of no confounding.\n- **Positivity is violated.** If one drug is reserved for sicker or renally-impaired patients, e(X) piles up near 0/1,\n  IPTW weights explode, matching discards most of one arm, and the surviving estimand no longer maps to any clinically\n  meaningful population. Trimming/restriction or overlap weighting changes the question; say so.\n- **Time-varying confounding affected by prior treatment.** Conditioning on post-baseline confounders that are also\n  mediators induces collider/over-adjustment bias; a baseline PS cannot fix this. Use marginal structural models with\n  IPTW over time or g-estimation instead.\n- **Treating the PS as a substitute for design.** Using post-index variables in the PS, or applying PS to a\n  prevalent-user cohort with immortal time, launders a biased cohort into a balanced-looking one — the most dangerous\n  failure mode because the diagnostics look clean.\n- **Few outcome events.** With sparse events, weighting inflates variance and instability; matched or stratified\n  analyses with firth-penalized or exact methods are safer.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA vs commercial):** Covariates are built from demographics, calendar time, baseline\n  diagnoses/procedures, prior drug classes, utilization intensity, and cost in the lookback window. *Failure mode:*\n  Medicare Advantage person-time lacks fee-for-service encounter claims, so confounders measured from claims are\n  systematically under-captured for MA enrollees — the PS is fit on degraded covariates and balance looks fine while real\n  confounding persists. Restrict to enrollees with complete Parts A/B/D (or commercial medical+pharmacy) and exclude\n  MA-only person-time. *Failure mode:* in elderly cohorts, death is a competing risk that often differs by exposure;\n  a PS-weighted Kaplan-Meier or cause-specific hazard answers a different question than a subdistribution (Fine-Gray)\n  analysis, and the two diverge — pre-specify which (see competing-risks entry). *Failure mode:* procedure-anchored index\n  dates can smuggle in immortal time that no PS will remove.\n- **EHR:** Labs, vitals, smoking, and BMI sharpen the PS but are missing not-at-random — captured only when ordered.\n  A missingness indicator vs multiple imputation are different estimands of e(X); site and encounter-frequency are\n  themselves confounders (sicker, more-monitored patients have more covariate data). Include site/visit-intensity terms\n  and treat loss to follow-up as potentially informative.\n- **Registry:** Strong on clinical confounders (stage, biomarker, ECOG, device details) that improve exchangeability, but\n  weak on complete pharmacy exposure — link to claims for fills and to a death index for censoring before fitting the PS.\n- **Linked claims–EHR–vital records:** The ideal substrate, but order vs fill vs service-date discrepancies must be\n  reconciled before fixing time zero, or baseline covariates leak across the index boundary and contaminate e(X).\n\n**Worked claims example.** Question: risk of hospitalized heart failure with an SGLT2 inhibitor vs a DPP-4 inhibitor in a\ncommercial + Medicare FFS database (ATT, because the signal concerns SGLT2i initiators). (1) Build the cohort: adults with\ntype 2 diabetes, 365 days of continuous medical+pharmacy FFS enrollment before the first qualifying `fill_date`, and no\nSGLT2i or DPP-4i fill in that 365-day washout (incident users of both arms); index_date = that first fill; arm = the NDC\ndispensed on index_date; exclude MA-only person-time. (2) Measure covariates only in `[index_date-365, index_date]`:\nage, sex, baseline HbA1c proxies, prior insulin, CKD/HF/MI diagnosis codes, prior-year inpatient count, and total paid\ncost. (3) Fit e(X) by logistic regression (or hdPS). (4) Because the estimand is ATT, use SMR weighting (treated=1,\ncomparator=e/(1-e)) or 1:1 caliper matching on the logit-PS (caliper 0.2 SD); confirm every post-balancing standardized\nmean difference <0.1 and inspect the e(X) overlap plot. (5) Define follow-up from index_date to first validated\nHF hospitalization, censoring at disenrollment, death, end of data, and — for an as-treated estimand — the end of the last\n`days_supply` plus a 30-day grace period or a switch to the other class. (6) Estimate the effect with a weighted Cox model\nusing robust (sandwich) standard errors, report the effective sample size, and run sensitivity analyses on washout length,\nweight truncation at the 1st/99th percentile, and a negative-control outcome to probe residual confounding. Reporting must\ninclude the PS model specification, pre/post balance, the overlap plot, the weight distribution and ESS, the trimming rule,\nthe targeted estimand, and whether outcome modeling was additionally used (never use post-index variables in the PS unless\nyou are explicitly fitting a longitudinal g-method).\n\n**Interpreting the output**\n\nConsider a PSM analysis of Drug A versus Drug B — the readmission scenario above — reporting HR = 0.82 (95% CI\n0.71–0.95) for 30-day hospital readmission after matching on 28 covariates, with post-match SMDs all below 0.10.\n\nFormal interpretation: Among matched Drug A initiators — the subset of treated patients for whom a comparably\nsimilar Drug B patient existed in the data — the instantaneous readmission rate was 18% lower (HR 0.82, 95%\nCI 0.71–0.95). This is the average treatment effect in the treated (ATT) within the matched population, not\namong all Drug A initiators. Under IPTW targeting the full target population, the analogous estimate applies\nto the inverse-probability-weighted pseudo-population (ATE) or the ATT-weighted pseudo-population, depending\non the weight specification. Both are valid causal estimates only under two untestable conditions: no unmeasured\nconfounding — every covariate that jointly predicts drug choice and readmission is captured and balanced — and\npositivity — every treated patient had a realistic probability of receiving the comparator. The 95% CI means\nresults this extreme would arise in fewer than 5% of identically designed studies if the true HR were 1.0.\n\nPractical interpretation: Drug A initiators who closely resembled Drug B initiators had readmissions arriving\nat an 18% slower rate during follow-up. This is not a statement that 18% fewer total events occurred — the HR\nis not a risk ratio or a cumulative probability. Under PSM, patients without a suitable match are excluded;\nunder IPTW, they are down-weighted. The finding speaks to the matched or reweighted population, not all Drug A\nusers. Unmeasured factors — such as disease severity not captured in claims — remain uncontrolled and could\nbias the estimate in either direction.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "propensity-score",
      "matching",
      "iptw",
      "overlap-weighting",
      "smr-weighting",
      "confounding",
      "balance",
      "estimand",
      "causal-inference",
      "claims",
      "ehr"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/70.1.41",
        "url": "https://doi.org/10.1093/biomet/70.1.41",
        "citation_text": "Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41-55.",
        "year": 1983,
        "authors_short": "Rosenbaum & Rubin",
        "notes": "Foundational paper; defines the propensity score and proves the balancing property that justifies conditioning on the scalar e(X) instead of the full covariate vector."
      },
      {
        "role": "explain",
        "doi": "10.1080/00273171.2011.568786",
        "url": "https://doi.org/10.1080/00273171.2011.568786",
        "citation_text": "Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research. 2011;46(3):399-424.",
        "year": 2011,
        "authors_short": "Austin",
        "notes": "Canonical tutorial contrasting matching, stratification, weighting, and covariate adjustment, with standardized-difference balance diagnostics."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.6607",
        "url": "https://doi.org/10.1002/sim.6607",
        "citation_text": "Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine. 2015;34(28):3661-3679.",
        "year": 2015,
        "authors_short": "Austin & Stuart",
        "notes": "Best-practice guidance for IPTW including stabilized weights, weight truncation, effective sample size, and robust variance estimation — the operational basis for the weighting code below."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Shows why PS balancing must be embedded in a target-trial structure with correct time zero; balance cannot rescue a misaligned design."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e3181a663cc",
        "url": "https://doi.org/10.1097/EDE.0b013e3181a663cc",
        "citation_text": "Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512-522.",
        "year": 2009,
        "authors_short": "Schneeweiss et al.",
        "notes": "Demonstrates the high-dimensional PS extension that empirically selects proxy confounders from claims codes; the standard sensitivity comparison to an investigator-specified PS."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Good-practice RWE guidance requiring pre-specified estimands, transparent confounding control, and reported balance/overlap diagnostics — the reporting bar for regulatory and HTA submissions."
      }
    ],
    "plain_language_summary": "A propensity score is a single number — the probability that a patient would receive the study drug, given everything measurable about them at baseline (age, diagnoses, prior medications, etc.). Once every patient has that score, you can pair treated patients with comparators who had nearly the same probability of being treated, or you can reweight the groups so their measured characteristics look alike. The goal is to make the two treatment arms resemble each other the way randomization would, so that any difference in outcomes can be attributed to the drug rather than to the kinds of patients who happened to receive it. This approach can only account for differences you actually measured — confounders hidden in the data, like disease severity not recorded in claims, are not removed.",
    "key_terms": [
      {
        "term": "propensity score",
        "definition": "The estimated probability that a specific patient receives the treatment being studied, calculated from their measured baseline characteristics using a statistical model such as logistic regression."
      },
      {
        "term": "covariate balance",
        "definition": "How similar the two treatment arms look on measured patient characteristics — good balance means the groups are comparable before you look at outcomes."
      },
      {
        "term": "standardized mean difference (SMD)",
        "definition": "A single number measuring how far apart two groups are on one characteristic, expressed in standard-deviation units; a value below 0.1 is the conventional target for acceptable balance."
      },
      {
        "term": "inverse-probability weighting",
        "definition": "A technique that gives each patient a statistical weight — patients who look like they 'should' have been in the other arm are upweighted — so the weighted groups resemble a population in which treatment was assigned independently of patient characteristics."
      },
      {
        "term": "confounding by indication",
        "definition": "The distortion that occurs when the reason a clinician prescribes a drug (the indication or patient severity) is also related to the outcome, making it look like the drug caused the outcome when the underlying illness did."
      },
      {
        "term": "overlap",
        "definition": "The extent to which patients in both arms have similar propensity scores — good overlap means there are comparable patients on both sides; poor overlap means some patients are so unlike the other arm that no fair comparison is possible."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to compare 30-day hospital readmission rates between patients who started Drug A (a newer diabetes medication) versus Drug B (an older one). Before comparing outcomes, they need to check whether the two groups look similar on key baseline characteristics. The raw data show that Drug A patients tend to be older and have more diabetes complications — a direct reflection of prescribing patterns, not drug effects. The researcher fits a propensity score model (logistic regression predicting Drug A vs. Drug B from age, diabetes diagnosis, and other baseline covariates) and then either matches each Drug A patient to the most similar Drug B patient, or reweights the sample so the groups balance. The table below shows covariate balance before and after that adjustment.",
      "dataset": {
        "caption": "Balance table: two covariates before and after propensity-score matching/weighting. SMD = standardized mean difference; values below 0.1 indicate acceptable balance.",
        "columns": [
          "covariate",
          "drug_a_before",
          "drug_b_before",
          "smd_before",
          "drug_a_after",
          "drug_b_after",
          "smd_after"
        ],
        "rows": [
          [
            "Mean age (years)",
            67.4,
            61.2,
            0.52,
            64.8,
            64.3,
            0.04
          ],
          [
            "% with diabetes diagnosis",
            78,
            54,
            0.5,
            71,
            69,
            0.04
          ]
        ]
      },
      "steps": [
        "Before adjustment, Drug A patients are on average 6.2 years older than Drug B patients (67.4 vs 61.2), giving an SMD of 0.52 — a large imbalance by any standard.",
        "The diabetes diagnosis rate is also much higher in Drug A patients (78% vs 54%), SMD 0.50, reflecting that sicker patients with more comorbidities were more likely to be prescribed the newer drug.",
        "A logistic regression model is fit predicting Drug A vs Drug B from all baseline covariates; each patient's predicted probability from that model is their propensity score.",
        "Patients are then matched (each Drug A patient is paired with the Drug B patient whose propensity score is closest) or reweighted (each patient receives a weight inversely proportional to their probability of receiving the arm they actually received).",
        "After adjustment, mean age in the two arms is 64.8 vs 64.3 years (SMD = 0.04) and diabetes diagnosis rates are 71% vs 69% (SMD = 0.04) — both well below the 0.1 threshold.",
        "SMD arithmetic check for age after adjustment: difference = 64.8 - 64.3 = 0.5 years; pooled SD estimated at approximately 12.5 years; SMD = 0.5 / 12.5 = 0.04, consistent with the table."
      ],
      "result": "After propensity-score adjustment both covariates achieve SMD < 0.1, indicating the treated and comparator arms are now balanced on these measured characteristics. The researcher can proceed to compare readmission rates in the balanced sample, knowing that age and diabetes burden are no longer driving any observed difference. Note: unmeasured characteristics — such as disease severity not coded in claims — are still uncontrolled."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "active-comparator-new-user",
      "baseline-characteristics-and-covariate-balance-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Matching (1:1 caliper on the logit-PS): ATT",
        "description": "Pairs each treated subject with a comparator within a caliper (commonly 0.2 SD of the logit-PS), producing an inspectable matched cohort and targeting the average treatment effect in the treated.",
        "edge_cases": [
          "Poor overlap leaves treated subjects unmatched, discarding data and shifting the estimand toward the matchable region.",
          "Greedy matching without replacement can depend on sort order; matching with replacement reduces bias but requires frequency weights and cluster-robust variance."
        ],
        "data_source_notes": "claims: match only after finalizing baseline windows and covariate code lists; report unmatched counts by arm and the matched-cohort standardized differences."
      },
      {
        "name": "Stabilized IPTW: ATE",
        "description": "Weights treated by Pr(A=1)/e(X) and comparator by Pr(A=0)/(1-e(X)) to construct a pseudo-population in which treatment is independent of measured covariates, targeting the average treatment effect.",
        "edge_cases": [
          "Near-zero or near-one e(X) produces extreme weights that inflate variance and let a handful of subjects dominate; truncate at the 1st/99th percentile and report the truncated and untruncated results.",
          "Always use robust (sandwich) or bootstrap variance; naive model SEs ignore the estimated weights and are too small."
        ],
        "data_source_notes": "Report weight percentiles, maximum, truncation rule, and effective sample size (ESS) so reviewers can judge stability."
      },
      {
        "name": "Overlap weighting (ATO) and SMR weighting (ATT)",
        "description": "Overlap weighting (treated 1-e(X), comparator e(X)) exactly balances covariate means and minimizes variance, targeting the equipoise population; SMR weighting (treated 1, comparator e/(1-e)) targets the treated population.",
        "edge_cases": [
          "Overlap weights change the estimand to a smoothly-weighted population; it must be named and not silently equated with ATE.",
          "SMR weighting still produces large comparator weights when overlap is poor."
        ],
        "data_source_notes": "Overlap weighting is the modern default in high-dimensional claims studies with limited positivity; SMR weighting is natural when the signal concerns current users of the study drug."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "active-comparator-new-user",
        "pros_of_this": "Balances measured confounders and lets you verify overlap and balance explicitly after the design has structurally removed confounding by indication.",
        "cons_of_this": "Cannot fix unmeasured confounding, a poor comparator choice, immortal time, or any time-zero error baked into the cohort.",
        "when_to_prefer": "Always used together — ACNU is the design, PS is the standard balancing step layered on top for drug CER."
      },
      {
        "compared_to": "high-dimensional-propensity-score-hdps-rwe",
        "pros_of_this": "Investigator-specified PS is interpretable, auditable clause-by-clause, and easy to defend to regulators.",
        "cons_of_this": "May miss empirical proxy confounders captured by hundreds of claims codes that hdPS recovers automatically.",
        "when_to_prefer": "When the confounder structure is well understood; use hdPS as a primary or sensitivity analysis in large claims data with plausible proxy confounding."
      },
      {
        "compared_to": "marginal-structural-models-g-methods",
        "pros_of_this": "Simpler to specify, communicate, and diagnose; a single baseline PS suffices for point-exposure questions.",
        "cons_of_this": "A baseline PS cannot handle time-varying confounding affected by prior treatment without inducing over-adjustment/collider bias.",
        "when_to_prefer": "Point-exposure (initiation) contrasts; escalate to MSM/IPTW-over-time or g-estimation for sustained or dynamic treatment strategies."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build covariates (demographics, calendar time, baseline diagnoses/procedures, prior drug classes, utilization, cost, frailty/severity proxies) entirely in the pre-index lookback. Require complete FFS-observable enrollment across the washout and exclude Medicare Advantage-only person-time, where missing encounter claims degrade the PS without showing up in balance tables. Pre-specify cause-specific vs subdistribution handling when death competes with the outcome.",
      "ehr": "Add labs, vitals, smoking, BMI, severity, and NLP features measured before index. Missingness is informative — a missingness-indicator PS and a multiply-imputed PS estimate different e(X); include site and encounter-intensity terms because monitoring frequency confounds both exposure and covariate capture.",
      "registry": "Use clinical variables (stage, biomarker, ECOG, device details) to strengthen exchangeability, but link to claims for complete pharmacy exposure and to a death index for censoring before fitting the PS.",
      "linked": "Reconcile order/fill/service-date discrepancies before fixing time zero so baseline covariates do not leak across the index boundary into e(X)."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom lifelines import CoxPHFitter\n\nxcols = [\"age\", \"sex\", \"cci\", \"ckd\", \"prior_hf\", \"prior_insulin\", \"prior_hosp\", \"prior_cost\"]\n\n# --- Stage 1 (design, outcome-blind): estimate e(X) = Pr(treated | X) ---\nps = LogisticRegression(max_iter=2000, C=1e6)  # near-unpenalized logistic PS\nps.fit(df[xcols], df[\"treated\"])\ndf[\"e\"] = np.clip(ps.predict_proba(df[xcols])[:, 1], 1e-6, 1 - 1e-6)\n\np = df[\"treated\"].mean()\n# Stabilized IPTW -> ATE (average treatment effect in the whole eligible population)\ndf[\"w_ate\"] = np.where(df[\"treated\"] == 1, p / df[\"e\"], (1 - p) / (1 - df[\"e\"]))\ndf[\"w_ate\"] = df[\"w_ate\"].clip(upper=df[\"w_ate\"].quantile(0.99))  # truncate extreme weights\n# Overlap weights -> ATO (clinical-equipoise population; no extreme weights by construction)\ndf[\"w_ato\"] = np.where(df[\"treated\"] == 1, 1 - df[\"e\"], df[\"e\"])\n# SMR weights -> ATT (effect in the treated; comparator reweighted to look like treated)\ndf[\"w_att\"] = np.where(df[\"treated\"] == 1, 1.0, df[\"e\"] / (1 - df[\"e\"]))\n\ndef weighted_smd(x, t, w):\n    m1 = np.average(x[t == 1], weights=w[t == 1]); m0 = np.average(x[t == 0], weights=w[t == 0])\n    v1 = np.average((x[t == 1] - m1) ** 2, weights=w[t == 1])\n    v0 = np.average((x[t == 0] - m0) ** 2, weights=w[t == 0])\n    return (m1 - m0) / np.sqrt((v1 + v0) / 2)\n\nwcol = \"w_ate\"  # pick the estimand here; balance below should be checked on the SAME weights\nbal = {c: weighted_smd(df[c].values, df[\"treated\"].values, df[wcol].values) for c in xcols}\ness = df[wcol].sum() ** 2 / (df[wcol] ** 2).sum()  # effective sample size after weighting\nprint(\"max |SMD| =\", max(abs(v) for v in bal.values()), \"| ESS =\", round(ess, 1))\nassert max(abs(v) for v in bal.values()) < 0.1, \"covariate imbalance remains; revisit the PS model\"\n\n# --- Stage 2 (analysis): marginal weighted Cox with robust (sandwich) SEs ---\ncox = CoxPHFitter()\ncox.fit(df[[\"time\", \"event\", \"treated\", wcol]], duration_col=\"time\", event_col=\"event\",\n        weights_col=wcol, robust=True)  # robust=True accounts for the estimated weights\nprint(cox.summary[[\"coef\", \"exp(coef)\", \"se(coef)\", \"p\"]])",
        "description": "Full PS workflow on a claims-style analytic table, one row per new initiator, all covariates measured in the pre-index\nwindow. Required input columns:\n  person_id   : unique subject id\n  treated     : 1 = study drug initiator, 0 = active comparator initiator (arm assigned at index_date)\n  <xcols>     : baseline covariates from [index_date-365, index_date] (no post-index variables)\n  time, event : follow-up time from index_date and event indicator (1=outcome, 0=censored)\nProduces e(X), stabilized IPTW (ATE) and overlap weights (ATO), an SMD balance check, the effective sample size, and a\nweighted Cox model with robust SEs. Swap weight columns to switch the targeted estimand.",
        "dependencies": [
          "pandas",
          "numpy",
          "scikit-learn",
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(WeightIt); library(cobalt); library(survey)\n\nxform <- treated ~ age + sex + cci + ckd + prior_hf + prior_insulin + prior_hosp + prior_cost\n\n# --- Estimate the PS and weights; estimand sets the target (ATE shown; use \"ATT\" or \"ATO\" to switch) ---\nw <- weightit(xform, data = df, method = \"ps\", estimand = \"ATE\", stabilize = TRUE)\n\n# --- Balance + overlap diagnostics on the weighted sample (must precede any outcome look) ---\nbal.tab(w, un = TRUE, thresholds = c(m = .1))   # standardized mean differences pre/post weighting\nlove.plot(w, abs = TRUE, thresholds = c(m = .1))\ncat(\"ESS:\", summary(w)$effective.sample.size, \"\\n\")\n\n# --- Marginal weighted Cox model; survey design gives robust variance for the weighted estimator ---\ndf$ipw <- w$weights\ndes <- svydesign(ids = ~1, weights = ~ipw, data = df)\nfit <- svycoxph(Surv(time, event) ~ treated, design = des)\nsummary(fit)   # exp(coef) = weighted hazard ratio for the chosen estimand",
        "description": "Full PS workflow with WeightIt + cobalt for weight construction and balance, then a marginal weighted outcome model via\nthe survey package (correct sandwich SEs for weighted estimators). Input data.frame `df` columns mirror the Python\nversion: treated (0/1), the baseline covariates, and time/event for the outcome. The `estimand=` argument is the single\nswitch that sets the target population (ATE / ATT / ATO).",
        "dependencies": [
          "WeightIt",
          "cobalt",
          "survey"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* --- 1:1 caliper matching on the logit-PS -> ATT; ASSESS emits pre/post standardized differences --- */\nproc psmatch data=work.analytic region=cs;            /* region=cs restricts to common support (overlap) */\n  class treated sex ckd prior_hf prior_insulin;\n  psmodel treated(Treated='1') = age sex cci ckd prior_hf prior_insulin prior_hosp prior_cost;\n  match method=greedy(k=1) distance=lps caliper=0.2;  /* 0.2 SD logit-PS caliper */\n  assess lps var=(age cci prior_hosp prior_cost) / weight=none plots=(stddiff boxplot);\n  output out(obs=match)=matched matchid=mid ps=ps;\nrun;\n\n/* --- IPTW for ATE with a doubly-robust outcome model as a built-in sensitivity check --- */\nproc causaltrt data=work.analytic method=ipw;         /* method=aipw for the doubly-robust estimate */\n  class treated sex ckd prior_hf prior_insulin;\n  psmodel treated(ref='0') = age sex cci ckd prior_hf prior_insulin prior_hosp prior_cost;\n  model event = age sex cci ckd prior_hf prior_insulin prior_hosp prior_cost;  /* outcome model for AIPW */\nrun;\n\n/* --- Build stabilized IPTW explicitly, then fit a weighted Cox model with robust SEs --- */\nproc logistic data=work.analytic noprint;\n  class treated sex ckd prior_hf prior_insulin / param=ref;\n  model treated(event='1') = age sex cci ckd prior_hf prior_insulin prior_hosp prior_cost;\n  output out=ps_out p=e;\nrun;\nproc sql noprint; select mean(treated) into :pa from work.analytic; quit;\ndata wtd; set ps_out;\n  e = min(max(e,1e-6),1-1e-6);\n  if treated=1 then iptw = &pa / e; else iptw = (1-&pa) / (1-e);   /* stabilized IPTW -> ATE */\nrun;\nproc phreg data=wtd covs(aggregate);     /* robust sandwich variance for the weighted estimator */\n  class treated(ref='0');\n  model futime*event(0) = treated;\n  weight iptw;\n  id person_id;\nrun;",
        "description": "Full PS pipeline in SAS/STAT (14.2+). Required input dataset work.analytic has one row per new initiator:\n  person_id, treated (1/0), the baseline covariates (measured in [index_date-365, index_date]),\n  futime (follow-up days from index), event (1=outcome, 0=censored).\nPROC PSMATCH builds the PS, restricts to the common-support region, matches/weights, and emits standardized differences;\nPROC CAUSALTRT estimates the marginal effect with an optional doubly-robust outcome model; weighted PROC PHREG fits the\ntime-to-event contrast with robust (sandwich) variance via COVS(AGGREGATE) + ID.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  X[Baseline covariates X<br/>measured before time zero] --> PS[Estimate e of X = Pr treated given X]\n  PS --> OV{Assess overlap of e of X<br/>and positivity}\n  OV -->|Good overlap| EST{Choose estimand}\n  OV -->|Poor overlap / extreme tails| FIX[Trim/restrict OR use overlap weights]\n  FIX --> EST\n  EST -->|ATT: treated population| MS[1:1 matching or SMR weighting]\n  EST -->|ATE: whole population| IPTW[Stabilized + truncated IPTW]\n  EST -->|ATO: equipoise population| OW[Overlap weighting]\n  MS --> BAL[Check standardized differences < 0.1<br/>report ESS + weight distribution]\n  IPTW --> BAL\n  OW --> BAL\n  BAL --> OUT[Weighted/matched outcome model<br/>robust SEs in the target population]",
        "caption": "Propensity-score decision flow: estimate e(X) from pre-index covariates, judge overlap/positivity, let the estimand (ATT/ATE/ATO) pick the method, verify balance, then fit the outcome model with robust variance.",
        "alt_text": "Flowchart from baseline covariates to propensity-score estimation, overlap assessment, estimand-driven choice of matching or weighting, balance check, and the final weighted outcome model.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U[Unmeasured confounders U<br/>frailty, severity, BMI, smoking] -.-> A[Treatment A]\n  U -.-> Y[Outcome Y]\n  X[Measured confounders X] --> A\n  X --> Y\n  A --> Y\n  PS[PS balancing on e of X] === X",
        "caption": "PS methods block only the measured-confounder backdoor path (X -> A and X -> Y). The dashed path through unmeasured U remains open, which is why a clean balance table is not proof of no confounding — pair PS with negative controls, E-values, or quantitative bias analysis.",
        "alt_text": "Causal diagram showing treatment, outcome, measured confounders X blocked by propensity-score balancing, and an unmeasured confounder U with dashed arrows to treatment and outcome that PS cannot close.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "Core design-plus-analysis combination for drug comparative effectiveness/safety; ACNU builds the cohort, PS balances measured confounders within it."
      },
      {
        "relation_type": "produces",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The choice of PS method (matching/SMR -> ATT, IPTW -> ATE, overlap -> ATO) directly determines the estimand and target population."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS empirically selects proxy confounders from claims codes; investigator-specified PS is more interpretable but may miss them."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "A baseline PS handles point-exposure confounding; MSM/IPTW-over-time and g-estimation are required for time-varying confounding affected by prior treatment."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Standardized-difference balance and overlap diagnostics are mandatory deliverables for any PS analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "Extreme PS values and IPTW weights require pre-specified trimming/restriction or weight truncation as sensitivity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "E-values and quantitative bias analysis address the residual unmeasured confounding that PS cannot remove."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "When death competes with the outcome (e.g., elderly claims cohorts), cause-specific and subdistribution PS-weighted estimands diverge and must be pre-specified."
      },
      {
        "relation_type": "requires",
        "target_slug": "dags-backdoor-criterion-drug-studies",
        "notes": "A DAG decides which variables belong in the PS (confounders/proxies) and which must be excluded (mediators, colliders, instruments)."
      }
    ],
    "aliases": [
      "PSM",
      "IPTW",
      "propensity score matching",
      "propensity score weighting",
      "inverse probability of treatment weighting",
      "overlap weighting",
      "SMR weighting",
      "stabilized IPTW"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "protopathic-bias-reverse-causation",
    "name": "Protopathic Bias and Reverse Causation",
    "short_definition": "A systematic distortion that occurs when a drug is prescribed in response to the early, unrecognized symptoms of the very outcome under study, making the drug appear to cause a disease it was actually given to palliate; a specific instance of reverse causation in which the outcome process — not yet recorded as a diagnosis — drives the exposure timing.",
    "long_description": "**What protopathic bias is and how it distorts pharmacoepidemiologic research**\n\nProtopathic bias arises when a drug is prescribed not for an independent indication but because\nof the early, often unrecognized, manifestations of the disease that will later become the\noutcome variable. The patient already has the disease biologically — cancer cells proliferating,\nheart failure progressing, a gastric lesion forming — but the clinical record does not yet\ncapture it as a diagnosis. What the prescriber sees is a symptom (pain, epigastric discomfort,\ndyspnoea, fatigue) and reaches for a drug to palliate it. Months later, when the diagnosis is\nfinally recorded, a researcher examining the database observes a drug fill closely preceding a\nnew diagnosis and, if care is not taken, concludes that the drug caused the disease.\n\nThe term was coined by Horwitz and Feinstein (1980), who identified the pattern in case-control\nstudies of drugs and cancer: the drug was not causing the disease, it was responding to the\ndisease's prodrome. The bias is large in magnitude, reliable in direction (it inflates association\ntoward apparent harm or causation), and invisible to standard covariate adjustment. It cannot be\ncorrected by adding confounders to a model, because the disease driving the prescription is the\nvery disease being studied — and it is not (yet) recorded as a covariate.\n\n**Classic cases in pharmacoepidemiology**\n\nThree drug-outcome pairs illustrate the mechanism across disease categories:\n\n*Opioids, antibiotics, and cancer diagnosis.* A patient with an unrecognized solid tumor\nexperiences pain from local invasion, weight loss, or bone metastases months before the tumor\nis found. A clinician prescribes opioids for apparent musculoskeletal pain or antibiotics for\napparent infection. When the cancer is eventually diagnosed, the database shows a strong\nrecent-exposure signal. A naive case-control study reads this as \"opioid or antibiotic use\nraises cancer risk\" — a near-perfect inversion of causality. The undiagnosed cancer caused the\nprescription, not the other way around.\n\n*PPIs and gastric cancer.* Gastric cancer causes dyspepsia, bloating, reflux, and epigastric\npain months to years before endoscopy-confirmed diagnosis. These are exactly the symptoms for\nwhich proton-pump inhibitors (PPIs) are prescribed. Patients with occult gastric cancer are\ntherefore systematically more likely to have received a recent PPI prescription, producing a\nspurious positive association. Unadjusted analyses have reported hazard ratios exceeding 2–4;\nstudies applying a 6–12 month lag have seen these estimates collapse toward the null.\n\n*Loop diuretics and heart failure.* Early, compensated heart failure presents with exertional\ndyspnoea, ankle swelling, and fluid retention well before echocardiographic confirmation of\nsystolic dysfunction is recorded. Clinicians prescribe loop diuretics for these symptoms. In a\ndatabase, the patient appears to have received a diuretic before developing heart failure, when\nin fact the heart failure was already driving the symptoms that prompted the prescription.\n\n**Reverse causation and the time-ordering illusion in administrative data**\n\nProtopathic bias is a specific, operationally precise instance of the broader category of\n*reverse causation*: any situation in which the outcome process causes the exposure, rather than\nthe reverse. Administrative data amplify this problem through a structural artifact — the\ndiagnosis *recording* date does not equal the biological *onset* date. Gastric cancer does not\nbegin on the endoscopy report date; heart failure does not begin on the echocardiography date;\ncancer does not begin on the pathology report date. The diagnostic delay — the gap between\nbiological onset and clinical recognition — can span months to years. During that window the\ndisease is biologically present and symptomatically active, but it is administratively invisible.\nAny drug exposure measured in that window is contaminated: it looks forward-causal (drug then\ndisease) but is in fact reverse-causal (disease then drug, disease then diagnosis).\n\n**Protopathic bias versus confounding by indication: a critical distinction**\n\nThese two biases are mechanistically distinct, and confusing them leads to wrong remedies.\n*Confounding by indication* (channeling) arises when a measured or measurable severity variable\nsimultaneously drives treatment assignment and outcome risk — sicker patients receive more\naggressive therapy, so the treated arm reflects disease burden. The path is: recognized severity\n→ treatment and recognized severity → outcome. The disease that causes prescribing is on record\nas a covariate, and severity-adjustment can reduce the bias. *Protopathic bias* operates through\nthe unrecognized prodrome of the outcome itself: the very disease being studied, before it is\nrecorded, drives exposure timing. No covariate can correct it because the confounder is the\noutcome — unobserved. The remedy is temporal, not covariate-based: exclude the exposure that\nfalls within the pre-diagnostic window.\n\n**Fixes ranked**\n\n*[1] Exposure lag/induction windows (primary fix).* Exclude all exposures occurring within k\nmonths before the index date (diagnosis date for cases; matched date for controls in a\ncase-control study; cohort follow-up entry in a cohort study). The lag k should be motivated\nby the known prodromal phase of the disease — 6 months for gastric cancer and dyspepsia, 12\nmonths for cancers with longer diagnostic delay. Implement as delayed entry/left truncation on\nthe time axis so person-time and events in the lag are excluded from both numerator and\ndenominator symmetrically across arms. See the exposure-lag-induction-latency-window-rwe entry\nfor operational details.\n\n*[2] Latency analyses with varying lag lengths.* Report the association at lag = 0, 6, 12, 24,\n36 months. A signal that is strongest at lag = 0 and attenuates monotonically with increasing\nlag is the fingerprint of protopathic bias. A true causal association should be stable or\nincrease at lags consistent with the biological induction period.\n\n*[3] Prescription-sequence-symmetry analysis.* Compare the sequence of drug prescriptions\nbefore versus after the index event. In a true drug-disease causal relationship, the sequences\nin both directions should be symmetric; protopathic bias predicts that the drug disproportionately\nprecedes the outcome, not follows it. See the prescription-sequence-symmetry entry.\n\n*[4] Negative-control outcomes.* Select an outcome that the drug cannot biologically cause or\nprotect against. If the association extends to the negative control, residual bias is present.\nSee the negative-control-outcomes-rwe entry.\n\n**Pros, cons, and trade-offs**\n\n*Unlagged (naive) analysis:*\n- Pros: maximal sample size; no a priori lag-length judgment; easy to implement.\n- Cons: protopathic bias is uncontrolled; estimates can be large multiples of the truth; the\n  bias reliably mimics a causal harmful effect of the drug on the outcome it was prescribed to\n  palliate; downstream clinical or regulatory decisions will be misdirected.\n\n*Lagged analysis (primary fix):*\n- Pros: removes or substantially attenuates protopathic bias; the biological motivation for the\n  lag length is transparent, pre-specifiable, and auditable by regulators.\n- Cons: discards person-time and events, reducing statistical power; the lag length is a\n  judgment call that must be pre-specified and varied in sensitivity analysis; an overly long lag\n  excludes true causal exposure and biases the estimate toward the null.\n- Trade-off: the lag length should be determined by the known prodromal duration of the disease,\n  not by what makes the association disappear — the latter is post hoc data dredging.\n\n**When to use**\n\nApply protopathic-bias assessment and a lagged exposure definition when: (a) the drug being\nstudied is prescribed for symptoms that could be the prodrome of the outcome — analgesics and\ncancer, acid-suppressive drugs and gastrointestinal malignancy, diuretics and heart failure,\nantibiotics and sepsis-related diagnoses; (b) the index event (outcome) has a long diagnostic\ndelay (the biological onset substantially precedes the recorded diagnosis date); (c) the study\nis a case-control design with recent exposure as the primary predictor; (d) a cohort study shows\nthat early-follow-up events drive the association; (e) the study is regulatory-grade and a\nreviewer has flagged reverse causation as a plausible alternative explanation; (f) any study\nthat will inform drug labeling, regulatory safety communications, or clinical guidelines where\na false causal inference would have direct patient harm.\n\n**When NOT to use — and when lagging is actively misleading**\n\n- *Acute, mechanistically immediate outcomes.* If the drug causes an acute adverse event —\n  anaphylaxis, acute bleeding from an anticoagulant, day-1 hypoglycaemia from insulin — imposing\n  a lag removes the very events that demonstrate harm. Do not lag outcomes whose causal mechanism\n  is immediate and well-established biologically.\n- *Lag chosen post hoc to attenuate a safety signal.* A lag length selected after unblinding\n  the data specifically to make an association disappear is data dredging. The lag must be fixed\n  in the protocol before data analysis, with biological motivation documented, and then varied in\n  sensitivity analysis to show the result's stability.\n- *Lag applied to eligibility rather than the time axis.* Requiring cases to survive the lag\n  period as a condition of inclusion manufactures immortal time and a spurious protective or\n  null effect. The lag must be applied symmetrically as delayed entry on the time axis, not as\n  an eligibility filter. See the time-zero-index-date-alignment-rwe entry.\n- *As a substitute for active-comparator design.* A lag reduces protopathic bias but does not\n  address confounding by indication or healthy-user bias; all three can operate simultaneously.\n\n**Interpreting the output**\n\nIn the worked example (PPI and gastric cancer case-control study), the naive OR is 3.0 and the\n6-month-lagged OR is 1.0, using the exact 2x2 counts in the beginner layer.\n\n*(1) Formal interpretation.* The naive OR = (75 * 50) / (25 * 50) = 3.0 is a conditional\nassociation estimate from a case-control analysis that counts PPI exposures in the 6 months\nbefore the cancer diagnosis date. It does not represent a causal odds ratio. The exposure window\nis contaminated: 25 of the 75 \"exposed\" cases were prescribed a PPI during the prodromal phase\nof their gastric cancer for symptoms attributable to the cancer, not for an independent\nacid-related condition. After applying a 6-month lag to exclude that window, the cases retain\n50 exposed and 50 unexposed patients, identical to controls (50/50), yielding OR = 1.0. The\ncomplete attenuation from 3.0 to 1.0 demonstrates that the entire observed association was\nattributable to the contaminated pre-diagnostic exposure window; the lag removed it entirely.\nIn real data, residual attenuation and CIs spanning 1.0 constitute the evidence against a\ncausal association.\n\n*(2) Practical interpretation.* If a regulator or payer receives the naive OR of 3.0, they may\nconclude PPIs cause gastric cancer — a pharmacologically implausible claim that, if acted upon,\nwould withdraw a widely used and effective acid-suppressive drug. The lagged OR of 1.0 tells\nthe true story: PPIs are prescribed because early gastric cancer produces dyspeptic symptoms,\nnot the other way around. Presenting only the naive estimate in a regulatory submission or\njournal article, without lag analysis, is a methodological failure that directly endangers\npatients by generating spurious signals and diverting pharmacovigilance attention.",
    "primary_category": "Bias_Control",
    "tags": [
      "bias",
      "protopathic-bias",
      "reverse-causation",
      "case-control",
      "exposure-lag",
      "pharmacoepidemiology",
      "diagnostic-delay",
      "claims",
      "time-ordering"
    ],
    "applies_to_study_types": [
      "case_control",
      "cohort_retrospective",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/0002-9343(80)90363-0",
        "url": "https://doi.org/10.1016/0002-9343(80)90363-0",
        "citation_text": "Horwitz RI, Feinstein AR. The problem of \"protopathic bias\" in case-control studies. American Journal of Medicine. 1980;68(2):255-258.",
        "year": 1980,
        "authors_short": "Horwitz & Feinstein",
        "notes": "Coined the term \"protopathic bias\" and described the mechanism: drugs prescribed for early, unrecognized symptoms of the outcome appear to cause the outcome in case-control analyses. The foundational reference for all subsequent work on this bias."
      },
      {
        "role": "explain",
        "doi": "10.1111/bcp.12705",
        "url": "https://doi.org/10.1111/bcp.12705",
        "citation_text": "Faillie JL. Indication bias or protopathic bias? British Journal of Clinical Pharmacology. 2015;80(4):779-780.",
        "year": 2015,
        "authors_short": "Faillie",
        "notes": "Clarifies the mechanistic distinction between confounding by indication (severity confounds) and protopathic bias (the outcome itself drives exposure timing), arguing the two are separable and require different remedies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.1360",
        "url": "https://doi.org/10.1002/pds.1360",
        "citation_text": "Tamim H, Monfared AAB, LeLorier J. Application of lag-time into exposure definitions to control for protopathic bias. Pharmacoepidemiology and Drug Safety. 2007;16(3):250-258.",
        "year": 2007,
        "authors_short": "Tamim et al.",
        "notes": "Operationalizes the lag-time approach to controlling protopathic bias in claims-based exposure definitions; demonstrates that applying a lag to the exposure window attenuates the bias toward the null and shows the lag-attenuation pattern is the empirical fingerprint of the bias."
      }
    ],
    "plain_language_summary": "Protopathic bias happens when a drug is prescribed for the early warning signs of a disease that has not yet been diagnosed — for example, a stomach acid pill given for bloating caused by an undetected stomach cancer. When a researcher later looks at the data, the drug appears to have preceded the disease, creating the illusion that the drug caused it, when in fact the hidden disease caused the prescription. The fix is to ignore drug exposures that occurred in the months immediately before the diagnosis was recorded, a technique called applying a lag window, because those exposures are almost certainly a reaction to early disease symptoms rather than an independent drug use.",
    "key_terms": [
      {
        "term": "protopathic bias",
        "definition": "A distortion that makes a drug look as though it caused a disease when, in reality, the early symptoms of that disease prompted the doctor to prescribe the drug before the disease was officially diagnosed."
      },
      {
        "term": "prodrome",
        "definition": "The early set of symptoms — such as pain, bloating, or breathlessness — that appear before a disease is formally diagnosed, which can prompt a prescription for the drug under study."
      },
      {
        "term": "diagnostic delay",
        "definition": "The gap in time between when a disease actually begins in the body and when it is recognized and recorded as a diagnosis in the medical record; the longer this gap, the larger the protopathic window."
      },
      {
        "term": "lag window",
        "definition": "A defined number of months before the index date during which drug exposures are deliberately excluded from the analysis, because any prescription in that window may have been triggered by early disease symptoms rather than an unrelated indication."
      },
      {
        "term": "index date",
        "definition": "The date used as the reference point for a patient in a study — for a case, typically the date the disease was first recorded; for a control, the matched equivalent date."
      },
      {
        "term": "reverse causation",
        "definition": "When the outcome being studied actually causes the exposure being studied, rather than the other way around, making the causal direction in data appear the opposite of reality."
      }
    ],
    "worked_example": {
      "scenario": "A case-control study asks whether proton-pump inhibitor (PPI) use causes gastric cancer. PPIs are prescribed for dyspepsia, reflux, and epigastric pain — which are also the early symptoms of undiagnosed gastric cancer. The study includes 100 newly diagnosed gastric cancer cases and 100 frequency-matched controls from the same claims database. \"Exposed\" in the naive analysis means any PPI fill in the 12 months before the diagnosis date (index date). In the lagged analysis, a 6-month lag is applied: PPI fills in the 6 months immediately before the index date are excluded, and only fills from months 7–12 before the index date count as exposure. The 25 cases who had a PPI fill only in the 6-month protopathic window are reclassified as unexposed in the lagged analysis; controls have no early gastric cancer producing prodromal symptoms, so their exposure distribution is unchanged.",
      "dataset": {
        "caption": "Two 2x2 tables from the same 100-case, 100-control study. Table A (naive): exposure = any PPI fill in the 12 months before the index date. Table B (6-month lag): exposure = any PPI fill in months 7-12 before the index date only; fills in the 6-month window are excluded.",
        "columns": [
          "analysis",
          "group",
          "ppi_exposed",
          "ppi_unexposed",
          "row_total"
        ],
        "rows": [
          [
            "A naive",
            "Cases",
            75,
            25,
            100
          ],
          [
            "A naive",
            "Controls",
            50,
            50,
            100
          ],
          [
            "B 6-month lag",
            "Cases (lag)",
            50,
            50,
            100
          ],
          [
            "B 6-month lag",
            "Controls (lag)",
            50,
            50,
            100
          ]
        ]
      },
      "steps": [
        "Naive analysis (Table A): cases show 75 exposed and 25 unexposed (row total 75 + 25 = 100); controls show 50 exposed and 50 unexposed (row total 50 + 50 = 100).",
        "Naive OR = (exposed cases * unexposed controls) / (unexposed cases * exposed controls) = (75 * 50) / (25 * 50) = 3750 / 1250 = 3.0. This 3-fold elevation makes PPIs look strongly associated with gastric cancer.",
        "But gastric cancer causes dyspepsia, reflux, and epigastric pain months before endoscopic diagnosis. Patients with undetected gastric cancer are prescribed PPIs for those prodromal symptoms, not for an independent acid condition. The disease drives the prescription.",
        "Applying the 6-month lag: the 25 cases whose only PPI fill was in the 6-month protopathic window are reclassified as unexposed. Cases after lag: 75 - 25 = 50 exposed, 25 + 25 = 50 unexposed. Controls are unchanged because they have no occult gastric cancer generating prodromal symptoms: still 50 exposed and 50 unexposed.",
        "Lagged OR = (50 * 50) / (50 * 50) = 2500 / 2500 = 1.0. After excluding the contaminated pre-diagnostic window, the association disappears entirely.",
        "The shift from OR = 3.0 at lag = 0 to OR = 1.0 at lag = 6 months is the protopathic-bias fingerprint: the entire observed association resided in the window where disease symptoms were driving prescriptions, not where the drug was acting on the disease."
      ],
      "result": "Table A naive OR = (75 * 50) / (25 * 50) = 3.0 (spurious; driven by 25 cases prescribed a PPI for prodromal gastric-cancer symptoms in the 6 months before diagnosis). Table B lagged OR = (50 * 50) / (50 * 50) = 1.0 (after excluding the 6-month pre-diagnostic exposure window). The complete attenuation from OR = 3.0 to OR = 1.0 is the fingerprint of protopathic bias.",
      "timeline_spec": {
        "title": "Protopathic bias: PPI exposure and gastric cancer in one case patient (6-month lag)",
        "window": {
          "start": "2022-07-01",
          "end": "2023-07-01",
          "label": "12-month lookback from index date (cancer diagnosis recorded 2023-07-01)"
        },
        "events": [
          {
            "label": "PPI fill (3 months before diagnosis — inside protopathic window)",
            "start": "2023-04-01",
            "length_days": 60,
            "quantity": "60-day PPI supply; excluded after 6-month lag"
          },
          {
            "label": "Gastric cancer diagnosis recorded (index date)",
            "start": "2023-07-01",
            "length_days": 1,
            "quantity": "Index date — biological onset months earlier"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2022-07-01",
            "end": "2022-12-31",
            "label": "Valid exposure zone (> 6 months before diagnosis)"
          },
          {
            "kind": "unexposed",
            "start": "2023-01-01",
            "end": "2023-06-30",
            "label": "6-month protopathic lag window — exposures excluded"
          }
        ],
        "result": {
          "label": "Patient reclassified: naive = exposed (PPI fill at day -91); lagged = unexposed (fill inside 6-month lag)",
          "value": 0
        }
      }
    },
    "prerequisites": [
      "case-control",
      "time-zero-index-date-alignment-rwe",
      "new-user-design"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Case-control with a fixed pre-diagnostic exclusion window",
        "description": "Exclude all exposures falling within a pre-specified lag (e.g., 6 months) before the index date for both cases and controls. The lag is determined by the expected prodromal duration of the outcome disease and pre-specified in the protocol. Run the naive and lagged analyses in parallel and report both; monotonic attenuation toward the null with increasing lag is the protopathic-bias fingerprint.",
        "edge_cases": [
          "Lag applied only to cases and not to controls creates an asymmetric analysis that over- corrects; the lag must be applied identically to both arms.",
          "For outcomes with very long prodromal phases (e.g., certain cancers with 2–5 year subclinical progression), a 6-month lag may be insufficient; the sensitivity grid should extend to 12, 24, and 36 months.",
          "An overly long lag removes true etiologically relevant exposure (if the drug genuinely causes the disease after a latency period), biasing the estimate toward the null. The lag length must be biologically motivated and prospectively declared."
        ],
        "data_source_notes": "Claims: exposure = any qualifying fill in the non-lag lookback window (months 7–12 or beyond); claims coding lag means the index date may already trail the biological onset — prefer admission date or procedure date over claim paid date for the event anchor. EHR: diagnosis date is encounter-driven; a patient who presents late may already have an extended de facto lag embedded in the data — document the care-seeking gap."
      },
      {
        "name": "Cohort study with time-varying exposure and a lag on follow-up entry",
        "description": "In a cohort design, apply the lag as delayed-entry/left-truncation on the time axis: person-time and events in the first k months of follow-up are excluded from both the exposed and unexposed arms simultaneously. This preserves symmetry across arms and avoids the immortal-time trap of applying the lag to eligibility. Use a time-updated current- exposure variable that accounts for gap periods and drug switches.",
        "edge_cases": [
          "Applying the lag to eligibility (requiring survival through the lag period to be classified as exposed) manufactures immortal time and a spurious protective effect; always apply to the time axis, never to eligibility.",
          "For active-comparator designs, both comparator arms may have prodromal symptoms for the shared indication — assess whether one drug is disproportionately prescribed for the symptom directly linked to the outcome."
        ],
        "data_source_notes": "Claims: use fill_date + days_supply + carryover to construct exposure episodes; apply the lag as risk_start = max(epi_start, index_date + lag_days); events before risk_start are not attributed regardless of arm."
      },
      {
        "name": "Latency-grid sensitivity analysis",
        "description": "Report the association estimate at lag = 0, 6, 12, 24, and 36 months. Plot the OR or hazard ratio against the lag length. Protopathic bias produces monotonic attenuation toward the null as lag increases; a true causal association produces stable or increasing estimates at lags corresponding to the biological induction period.",
        "edge_cases": [
          "A J-shaped lag pattern (association attenuates then rises again at long lags) may indicate a true causal association with a long induction period — do not over-interpret attenuation at short lags as proof of protopathic bias alone.",
          "Statistical power decreases with longer lags as person-time is discarded; report CIs at each lag, not only point estimates."
        ],
        "data_source_notes": "Claims and EHR: generate the lag grid programmatically by parameterizing lag_days in the exposure-construction code and running the full analytic pipeline for each value. Avoid manual re-coding per lag value to prevent transcription errors."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "confounding-by-indication-channeling",
        "pros_of_this": "Protopathic bias operates through the unrecognized prodrome of the outcome itself and cannot be removed by covariate adjustment, making the temporal (lag) remedy uniquely appropriate; confounding by indication can in principle be attenuated by severity adjustment but protopathic bias cannot.",
        "cons_of_this": "Both biases can operate simultaneously; a lag corrects the protopathic component but leaves confounding by indication unaddressed. Applying a lag to a study that has predominantly confounding by indication (not protopathic bias) may not meaningfully change the estimate and gives a false sense of bias correction.",
        "when_to_prefer": "Diagnose the dominant mechanism first: if the drug is prescribed for the outcome's prodromal symptoms (protopathic), lag is the primary fix; if the drug is channeled to sicker patients with the established indication (confounding by indication), active- comparator design or severity adjustment is the primary fix."
      },
      {
        "compared_to": "exposure-lag-induction-latency-window-rwe",
        "pros_of_this": "Understanding the bias is necessary to motivate and correctly specify the lag; the lag without a protopathic-bias rationale is arbitrary. Diagnosis of protopathic bias via the lag-attenuation pattern also serves as a falsification test for the causal hypothesis.",
        "cons_of_this": "The lag is the operational remedy; the bias concept alone does not produce corrected estimates. Always route to the exposure-lag entry for implementation of the fix.",
        "when_to_prefer": "Read this entry first for conceptual grounding, then apply the exposure-lag entry to operationalize the lag on the time axis and implement the sensitivity grid."
      },
      {
        "compared_to": "negative-control-outcomes-rwe",
        "pros_of_this": "The lag-attenuation pattern is a direct, drug-specific test for protopathic bias that exploits the time structure of the bias without requiring a separate outcome. It is always available whenever a lag can be applied.",
        "cons_of_this": "A negative-control outcome can detect residual bias after a lag and can also distinguish protopathic bias from other sources of confounding; the lag test alone cannot rule out that a stable estimate after lagging is merely a weaker protopathic signal rather than a true null.",
        "when_to_prefer": "Use both in parallel: the lag grid to attenuate and detect protopathic bias, and a negative-control outcome to bound any residual confounding and increase confidence in the corrected estimate."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure is the pharmacy fill (fill_date + days_supply). The lag is applied as risk_start = max(epi_start, index_date + lag_days), applied identically to all arms on the time axis (not as an eligibility criterion). The claims coding lag means the diagnosis date (index_date) may already lag behind the biological onset — use the admission date or procedure date as the event anchor rather than the claim paid date to minimize the artifactual gap. Exclude Medicare Advantage-only person-time where FFS pharmacy claims are absent; an apparently clean exposure- free lag interval could be unobserved time rather than true non-exposure. Run the full lag grid (0, 6, 12, 24 months) programmatically in one pipeline pass.",
      "ehr": "The diagnosis date in an EHR is encounter-driven: a patient who presents late implies a longer de facto lag already embedded in the data, potentially attenuating protopathic bias relative to claims. Confirm drug starts against linked dispensing records; EHR order dates may not reflect actual administration. Between-visit gaps are informative — a patient with no visits in the 6 months before diagnosis had no opportunity to receive a protopathically motivated prescription. Document care-seeking gaps when interpreting the lag pattern.",
      "registry": "Registry adjudication can produce cleaner event dates than claims, but registries are often weak for complete exposure history. Link to claims to populate the pre-index exposure window. Adjudication lag can correlate with exposure (sicker patients reviewed faster), potentially compressing the apparent protopathic window — compare adjudication date versus event date by arm as a diagnostic.",
      "linked": "Linked claims-EHR provides the best substrate: claims completeness for exposure history, EHR event dates for accurate diagnosis anchoring, and vital records for competing-risk death. Use EHR-based event dates (procedure/pathology) rather than claims-based diagnosis dates as the index, since EHR event dates are typically closer to the biological onset and reduce the un-accounted prodromal window. Reconcile order/fill/service dates before choosing the exposure anchor."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from math import log, exp, sqrt\n\n# ── 2x2 table helpers ──────────────────────────────────────────────────────────────\ndef odds_ratio(a, b, c, d):\n    \"\"\"OR = (a*d)/(b*c) where a=exposed cases, b=unexposed cases,\n       c=exposed controls, d=unexposed controls.\"\"\"\n    if b == 0 or c == 0:\n        raise ValueError(\"Zero cell prevents OR calculation\")\n    return (a * d) / (b * c)\n\ndef or_95ci(a, b, c, d):\n    \"\"\"95% CI via Woolf log method (valid when all cells > 0).\"\"\"\n    or_ = odds_ratio(a, b, c, d)\n    se_log = sqrt(1/a + 1/b + 1/c + 1/d)\n    lo = exp(log(or_) - 1.96 * se_log)\n    hi = exp(log(or_) + 1.96 * se_log)\n    return or_, lo, hi\n\n# ── Worked-example data ────────────────────────────────────────────────────────────\n# Table A: naive (any PPI fill in 12 months before diagnosis)\n# a=exposed cases, b=unexposed cases, c=exposed controls, d=unexposed controls\nnaive  = dict(a=75, b=25, c=50, d=50)\n\n# Table B: 6-month lag applied (fills in 6-month window excluded from cases)\n# 25 cases who were exposed only in the lag window -> reclassified as unexposed\nlagged = dict(a=50, b=50, c=50, d=50)\n\nprint(\"=== Table A: Naive analysis ===\")\nor_n, lo_n, hi_n = or_95ci(**naive)\nprint(f\"OR = {or_n:.2f}  95% CI [{lo_n:.2f}, {hi_n:.2f}]\")\n# Arithmetic: (75 * 50) / (25 * 50) = 3750 / 1250 = 3.0\n\nprint(\"\\n=== Table B: 6-month lag applied ===\")\nor_l, lo_l, hi_l = or_95ci(**lagged)\nprint(f\"OR = {or_l:.2f}  95% CI [{lo_l:.2f}, {hi_l:.2f}]\")\n# Arithmetic: (50 * 50) / (50 * 50) = 2500 / 2500 = 1.0\n\n# ── Lag-grid sensitivity: vary lag length and reclassify case exposures ──────────\nimport pandas as pd\n\ndef lag_grid_analysis(case_exposures, control_exposures, n_cases, n_controls,\n                      index_dates, lag_months_list):\n    \"\"\"\n    case_exposures / control_exposures: DataFrames with columns\n      [person_id, fill_date, index_date] — one row per fill.\n    Returns a DataFrame of OR by lag length for plotting the attenuation pattern.\n    \"\"\"\n    results = []\n    for lag_m in lag_months_list:\n        lag_days = lag_m * 30   # approximate; use relativedelta for production\n\n        # Reclassify: a fill counts only if > lag_days before the index date\n        ce = case_exposures.copy()\n        ce[\"days_before\"] = (ce[\"index_date\"] - ce[\"fill_date\"]).dt.days\n        ce[\"valid\"] = ce[\"days_before\"] > lag_days\n        exposed_cases = ce[ce[\"valid\"]].groupby(\"person_id\").size().reset_index()\n        a = len(exposed_cases)          # exposed cases after lag\n        b = n_cases - a                  # unexposed cases after lag\n\n        co = control_exposures.copy()\n        co[\"days_before\"] = (co[\"index_date\"] - co[\"fill_date\"]).dt.days\n        co[\"valid\"] = co[\"days_before\"] > lag_days\n        exposed_controls = co[co[\"valid\"]].groupby(\"person_id\").size().reset_index()\n        c = len(exposed_controls)        # exposed controls\n        d = n_controls - c               # unexposed controls\n\n        if b > 0 and c > 0:\n            or_val = (a * d) / (b * c)\n            results.append({\"lag_months\": lag_m, \"OR\": or_val,\n                            \"a\": a, \"b\": b, \"c\": c, \"d\": d})\n    return pd.DataFrame(results)\n\n# Protopathic-bias fingerprint: OR attenuates toward 1.0 as lag increases.\n# A true causal association stays stable or rises at the induction-period lag.",
        "description": "Naive versus lagged OR computation for a case-control study. Computes the 2x2 odds ratio\nat lag = 0 (naive) and then applies a lag-reclassification to show how the OR changes.\nUses exact 2x2 arithmetic matching the worked example. No external dependencies beyond\nscipy.stats for the confidence interval; the core OR logic requires no packages.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── 2x2 helpers ───────────────────────────────────────────────────────────────────\nodds_ratio_woolf <- function(a, b, c, d) {\n  # a=exposed cases, b=unexposed cases, c=exposed controls, d=unexposed controls\n  # OR = (a*d)/(b*c); 95% CI via Woolf log method\n  or <- (a * d) / (b * c)\n  se_log <- sqrt(1/a + 1/b + 1/c + 1/d)\n  list(OR = or,\n       lo = exp(log(or) - 1.96 * se_log),\n       hi = exp(log(or) + 1.96 * se_log))\n}\n\n# ── Worked-example data ───────────────────────────────────────────────────────────\n# Naive: any PPI fill in 12 months before diagnosis date\nnaive  <- list(a = 75, b = 25, c = 50, d = 50)\n# Lagged: 6-month protopathic window excluded; 25 cases reclassified as unexposed\nlagged <- list(a = 50, b = 50, c = 50, d = 50)\n\ncat(\"=== Table A: Naive analysis ===\\n\")\nres_n <- odds_ratio_woolf(naive$a, naive$b, naive$c, naive$d)\ncat(sprintf(\"OR = %.2f  95%% CI [%.2f, %.2f]\\n\", res_n$OR, res_n$lo, res_n$hi))\n# (75 * 50) / (25 * 50) = 3750 / 1250 = 3.0\n\ncat(\"\\n=== Table B: 6-month lag applied ===\\n\")\nres_l <- odds_ratio_woolf(lagged$a, lagged$b, lagged$c, lagged$d)\ncat(sprintf(\"OR = %.2f  95%% CI [%.2f, %.2f]\\n\", res_l$OR, res_l$lo, res_l$hi))\n# (50 * 50) / (50 * 50) = 2500 / 2500 = 1.0\n\n# ── Lag-grid sensitivity (illustrative with synthetic fill-date data) ─────────────\nlag_grid <- function(case_fills, ctrl_fills, n_cases, n_ctrls,\n                     lag_months_vec = c(0, 6, 12, 24, 36)) {\n  # case_fills / ctrl_fills: data.frames with person_id, fill_date, index_date (Date)\n  do.call(rbind, lapply(lag_months_vec, function(lag_m) {\n    lag_days <- lag_m * 30L\n\n    ce <- case_fills\n    ce$days_before <- as.integer(ce$index_date - ce$fill_date)\n    ce$valid <- ce$days_before > lag_days\n    a <- length(unique(ce$person_id[ce$valid]))\n    b <- n_cases - a\n\n    co <- ctrl_fills\n    co$days_before <- as.integer(co$index_date - co$fill_date)\n    co$valid <- co$days_before > lag_days\n    c_ <- length(unique(co$person_id[co$valid]))\n    d  <- n_ctrls - c_\n\n    or_val <- if (b > 0 && c_ > 0) (a * d) / (b * c_) else NA_real_\n    data.frame(lag_months = lag_m, OR = or_val,\n               a = a, b = b, c = c_, d = d)\n  }))\n}\n# Protopathic-bias fingerprint: OR[lag=0] >> OR[lag=6] ~> OR[lag=12] ... approaching 1.\n# Plot lag_months vs OR; monotonic decline is the signature.",
        "description": "Naive and lagged OR computation in base R. Implements the same 2x2 arithmetic as the\nPython version and produces a lag-grid sensitivity table. No external packages required\nfor the core arithmetic; epitools::oddsratio is used for the confidence interval.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Bio[\"Early disease onset\\n(cancer growing, HF progressing)\\n— not yet recorded\"]\n  Sx[\"Prodromal symptoms\\n(pain, dyspepsia, dyspnoea)\"]\n  Rx[\"Drug prescription\\n(opioid, PPI, diuretic)\"]\n  Dx[\"Diagnosis recorded\\nin database\"]\n  OR[\"Naive OR inflated:\\nappears drug → disease\"]\n  Lag[\"Apply lag window\\n(exclude fills < k months\\nbefore diagnosis)\"]\n  Fix[\"Lagged OR unbiased:\\nno association remains\"]\n\n  Bio -->|drives| Sx\n  Sx -->|prompts| Rx\n  Bio -->|eventually| Dx\n  Rx -->|in same window as| Dx\n  Rx & Dx -->|naive analysis sees| OR\n  Lag -->|removes protopathic fills| Fix\n  style Bio fill:#ffcccc,stroke:#cc0000\n  style OR fill:#ffe0cc,stroke:#cc6600\n  style Fix fill:#ccffcc,stroke:#00aa44",
        "caption": "The protopathic bias mechanism. Early (unrecorded) disease causes prodromal symptoms, which prompt the drug prescription. A naive analysis sees drug → diagnosis and infers causation; a lagged analysis excludes the contaminated window and recovers the null.",
        "alt_text": "Flowchart showing early disease onset driving prodromal symptoms, which prompt drug prescription; the diagnosis is recorded later; a naive analysis conflates the two; a lag window removes the contaminated fills and restores a null OR.",
        "source_type": "illustrative",
        "source_citations": [
          "horwitz-feinstein-1980"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Drug associated with outcome\\nin database study?] --> P{Is the drug prescribed\\nfor symptoms of this\\noutcome or its prodrome?}\n  P -->|Yes| Proto[Suspect protopathic bias:\\napply lag-attenuation grid\\n0 / 6 / 12 / 24 months]\n  P -->|No| CBI[Suspect confounding by indication:\\nactive-comparator design\\nor severity adjustment]\n  Proto --> Grid{Does OR attenuate\\ntoward null as lag increases?}\n  Grid -->|Yes| Bias[Protopathic bias confirmed:\\nuse lagged OR as primary estimate;\\nreport lag-grid as sensitivity]\n  Grid -->|No| Causal[Possible true association:\\nlag-stable signal; investigate\\ninduction/latency further]\n  Bias --> NCO[Run negative-control outcome\\nto bound any residual bias]\n  Causal --> NCO\n  NCO --> Report[Report: naive OR, lagged OR,\\nlag-grid plot, NCO result,\\nbiological justification for lag]",
        "caption": "Decision logic for diagnosing and correcting protopathic bias. The key fork is whether the drug is prescribed for symptoms of the outcome's prodrome; the lag-attenuation grid is the empirical test; a negative-control outcome bounds residual bias.",
        "alt_text": "Decision flowchart from suspected drug-outcome association through prodrome-prescribing check, lag-attenuation grid, protopathic versus confounding-by-indication diagnosis, and reporting requirements.",
        "source_type": "illustrative",
        "source_citations": [
          "tamim-2007"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "exposure-lag-induction-latency-window-rwe",
        "notes": "The exposure lag/induction window is the primary operational fix for protopathic bias; the lag period blanks the pre-diagnostic contaminated window on the time axis. Always route to this entry for implementation details and the immortal-time guardrail."
      },
      {
        "relation_type": "see_also",
        "target_slug": "confounding-by-indication-channeling",
        "notes": "Confounding by indication (severity confounds both treatment and outcome) is mechanistically distinct from protopathic bias (the outcome's prodrome drives exposure timing); both can operate simultaneously but require different remedies — severity adjustment versus a lag."
      },
      {
        "relation_type": "see_also",
        "target_slug": "surveillance-detection-bias",
        "notes": "Surveillance bias (treated patients are more closely monitored and have outcomes detected earlier or more often) can co-occur with protopathic bias; both inflate early-follow-up event counts, making lag sensitivity analyses and negative controls necessary to separate the mechanisms."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prescription-sequence-symmetry",
        "notes": "Prescription-sequence-symmetry analysis tests whether a drug disproportionately precedes the outcome versus follows it; asymmetry in the pre-outcome direction is a diagnostic for protopathic bias and complements the lag-attenuation grid."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes with no plausible early-symptom link to the drug (e.g., accidental injury for a GI drug) detect residual protopathic bias or other confounding that persists after the lag is applied and enable empirical calibration of the corrected estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Correct alignment of the index date is essential before applying a lag: a misaligned time zero can create or destroy the apparent protopathic signal, and applying the lag to eligibility rather than the time axis re-creates immortal time."
      }
    ],
    "aliases": [
      "protopathic bias",
      "reverse causation",
      "prescribing bias",
      "reverse causality",
      "protopathic prescription bias",
      "pre-diagnostic exposure contamination"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "qaly-utility-mapping-rwe",
    "name": "QALY Utility Mapping (Crosswalking to Health-State Utilities)",
    "short_definition": "A statistical crosswalk that predicts preference-based health-state utility values (e.g., EQ-5D) from non-preference-based clinical or patient-reported measures collected in real-world data, so that quality-adjusted life-years can be estimated for a cost-utility analysis.",
    "long_description": "**QALY utility mapping** (also called crosswalking) is the estimation of a regression-type function that predicts a\n*preference-based* health-state utility — most often the EQ-5D index, anchored at 1 = full health and 0 = dead, with values\nbelow 0 allowed for states worse than death — from a *non-preference-based* source measure that was actually collected, such\nas a disease-specific PRO (FACT-G, EORTC QLQ-C30, HAQ-DI), a generic profile (SF-36/SF-12), or clinical variables (ECOG,\ntumour response, lab values). The fitted function is then applied to the real-world cohort to assign a utility at each\nmeasurement time, and those utilities are integrated over survival time to produce the quality-adjusted life-years (QALYs)\nthat feed the incremental cost-utility ratio. Mapping is the standard fallback when a trial or RWE study did not field a\npreference-based instrument but a reimbursement dossier (NICE, CADTH, PBAC, IQWiG) requires utilities on a recognised value\nset.\n\n**Core conceptual distinction** Mapping is *prediction of a value-set-anchored quantity*, not *measurement* of utility and\nnot a causal contrast. Three quantities must be kept separate. (1) The **source measure** (a symptom or function score) lives\non its own scale and carries no preference weighting. (2) The **target utility** is a cardinal index on a population value set\n(UK EQ-5D-3L tariff, US EQ-5D-5L, etc.) — mapping is value-set-specific, and a function fitted to the UK tariff must not be\nreused to produce US-tariff QALYs. (3) The **QALY** is the area under the utility-versus-time curve, U(t) integrated from time\nzero to death or horizon, summed across the cohort and discounted. The estimand of a cost-utility analysis built on mapped\nutilities is incremental QALYs (and the ICER/net monetary benefit), *conditional on the mapping model being a correct\npredictor in this population*. Direct (response) mapping predicts the EQ-5D dimension responses and then applies the tariff;\nindirect (regression) mapping predicts the index value directly. The index is bounded above at 1, has a large spike of\nobservations at 1.0 (full health) and a gap below it, so ordinary least squares is structurally wrong at the boundary even\nwhen its mean R² looks acceptable.\n\n**Pros, cons, and trade-offs**\n- **vs collecting EQ-5D / a preference-based instrument directly:** Direct elicitation needs no mapping model and no\n  extrapolation assumptions, and is always the HTA-preferred option. Mapping's only advantage is that it rescues a dataset\n  that never fielded a preference-based measure — at the cost of added prediction uncertainty, value-set dependence, and a\n  known tendency to *compress* the utility distribution (overpredicting the worst states and underpredicting the best),\n  which biases incremental QALYs toward the null. **Prefer direct elicitation** whenever the study can still field EQ-5D;\n  map only when the preference-based data genuinely do not exist.\n- **vs OLS index mapping:** OLS is transparent and fast but violates the bounded, multimodal, ceiling-spiked structure of\n  EQ-5D and produces impossible predictions above 1. The **adjusted limited dependent variable mixture model (ALDVMM)** and\n  **beta-regression / two-part / tobit** approaches respect the bounds and the full-health spike and are now the methods\n  recommended by the ISPOR task force and NICE DSU. **Prefer a mixture/limited-dependent model** over OLS for any submission;\n  keep OLS only as a sensitivity comparison.\n- **vs Markov / partitioned-survival modelling with externally sourced utilities:** Pulling utilities from a published\n  catalogue (Tufts CEA Registry, prior HTAs) avoids fitting a model but imports values from a different population and value\n  set, often with poor face validity for *your* cohort. Mapping uses your own patients' source scores. **Prefer mapping**\n  when you have a rich source PRO measured longitudinally in the actual cohort; **prefer catalogue utilities** when the\n  source measure is too sparse to fit a stable model.\n\n**When to use** A cost-utility analysis or HTA submission requires utilities on a recognised value set; the RWE study (or its\nfeeder trial) measured a non-preference-based PRO or clinical state at multiple time points but never an EQ-5D-type instrument;\na published, externally validated mapping algorithm exists for that exact source measure → target value set pairing (or you can\nfit and validate one in a separate estimation sample); and you can attach utilities to *time-indexed* observations so they can\nbe integrated over survival.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **No conceptual overlap between source and target.** Mapping requires that the source measure capture the same health\n  construct the value set prices (mobility, self-care, usual activities, pain, anxiety/depression). Mapping a pure\n  biomarker (HbA1c, tumour size) onto EQ-5D fabricates a utility that has no behavioural basis; the function will fit\n  noise and the predicted QALY gain is an artefact.\n- **Extrapolating outside the estimation range.** A model fitted in moderate disease applied to a real-world cohort with\n  severe states (or to states-worse-than-death) predicts utilities the data never supported. The compression bias means\n  incremental QALYs are systematically attenuated — the worse arm looks better than it is, which can flip an ICER.\n- **Re-using a mapping across value sets or country tariffs.** A UK-3L mapping applied to generate US-5L QALYs is simply the\n  wrong quantity; the cardinal scale differs and the result is uninterpretable.\n- **Propagating only point predictions.** Reporting mapped QALYs without carrying the mapping model's prediction uncertainty\n  into the probabilistic sensitivity analysis understates decision uncertainty and overstates the precision of the ICER —\n  a substantive error in a reimbursement context.\n- **Differential measurement timing by arm.** If the source PRO is captured more often (or at sicker visits) in one arm,\n  the area-under-the-curve integration is biased before any mapping question arises; this is a data-capture failure that\n  mapping cannot fix.\n\n**Data-source operational depth** Mapping in RWE depends on whether the *source* PRO/clinical measure and the *survival/cost*\nspine come from the same linked record.\n- **Claims (administrative):** Pure claims carry **no** PRO or functional measure, so an index utility cannot be mapped from\n  claims alone — there is nothing to map *from*. Claims contribute the survival/censoring spine (the time axis of the QALY)\n  and resource costs, but utilities must be borrowed from a linked PRO source or a catalogue. A specific trap: in\n  Medicare Advantage (MA) person-time the encounter and death capture differ from fee-for-service (FFS), so the\n  QALY time axis itself is incomplete for MA-only members; restrict the integration window to observable FFS person-time or\n  a linked vital-records death date rather than letting MA gaps masquerade as survival.\n- **EHR:** May contain structured PROMIS/disease-specific instruments and clinical states (ECOG, NYHA class) that are\n  *mappable*, but capture is visit-driven and sicker patients are measured more often — so the observed utility trajectory\n  is informatively sampled. Carry forward the last observation only over clinically plausible windows, model the missingness,\n  and never let a long gap with no measurement be silently filled with the last (healthier) value.\n- **Registry:** Disease registries (oncology, rheumatology) often field the strongest source PROs and adjudicated clinical\n  states, making them the best substrate for fitting and applying a mapping; their weakness is incomplete cost and sometimes\n  death capture, so link to claims and a death index to complete the QALY denominator.\n- **Linked claims–EHR–registry–vital records:** The ideal substrate — registry/EHR supplies the mappable source measure,\n  claims supply cost and resource use, vital records firm up the death date that closes the QALY integral. Linkage selection\n  (only the linkable subset) and date discrepancies between the PRO visit and the service/death date must be reconciled before\n  utilities are aligned to the time axis.\n\n**Worked example (claims-linked registry, oncology cost-utility).** Question: lifetime QALYs for two first-line regimens in\nmetastatic disease, from a tumour registry that fielded EORTC QLQ-C30 at each cycle, linked to commercial + Medicare FFS\nclaims for cost and to the NDI/state death file for mortality. (1) **Eligibility / time axis:** index date = first\nqualifying regimen administration; require continuous medical enrolment from index so person-time is observable; assign the\nregimen arm from the administered product. (2) **Source measure:** each QLQ-C30 assessment (physical-functioning, role,\nemotional, fatigue, pain, dyspnoea subscales) at a known `assess_date`. (3) **Mapping:** apply a *pre-specified, published,\nexternally validated* QLQ-C30→EQ-5D-3L (UK tariff) algorithm — an ALDVMM, not OLS — to each assessment to get U at each\n`assess_date`, bounding predictions at 1 and allowing values <0. (4) **Utility trajectory:** within each person, order\nutilities by date and carry the value forward only to the next assessment (or a maximum 90-day plausibility window),\nsetting U=0 at the linked death date so the curve closes at \"dead\". (5) **QALY:** integrate U(t) by the trapezoidal rule\nover [index, death or data end] per person — QALY_i = Σ over consecutive measurements of (U_j + U_{j+1})/2 × (t_{j+1} − t_j)/365.25\n— then discount future increments at the jurisdiction rate (e.g., 3.5%/yr for NICE) and compare mean QALYs by arm. (6)\n**Cost-utility:** combine with mean discounted costs from the linked claims to form the ICER (ΔCost/ΔQALY) and net monetary\nbenefit at the willingness-to-pay threshold. (7) **Uncertainty:** in the probabilistic sensitivity analysis, draw the mapping\nmodel coefficients from their estimated covariance (or bootstrap the whole map-then-integrate pipeline) so the prediction\nuncertainty of the crosswalk propagates into the ICER, and run scenario analyses on the carry-forward window and an\nalternative value set. Sensitivity checks: compare mapped against any subset with directly measured EQ-5D, and confirm the\nmapped distribution reproduces the ceiling spike at U=1 rather than smearing it.\n\n**Interpreting the output**\n\nPatient 1001 accumulated 0.3275 QALYs over 6 months, starting with a mapped utility of 0.730 at baseline\n(Physical Function = 80, Pain = 20, Fatigue = 30) and declining to 0.655 and then 0.580 as symptoms worsened.\n\n*(1) Formal interpretation.* The mapped utility of 0.730 is the predicted EQ-5D index value for a patient with\nthose symptom scores, estimated from a regression model fitted in a separate study population where both the\nclinical instrument and EQ-5D were measured. It is not a directly elicited preference; it inherits two layers of\nuncertainty: sampling uncertainty in the mapping model coefficients and model-specification uncertainty (choice of\nregression form, covariates, and target value set). The QALY accumulation applies the trapezoidal rule across\nintervals — the first 3-month segment contributes 0.1731 QALYs and the second 0.1544, reflecting the lower utility\nin the latter period. These mapped QALYs are appropriate inputs to a cost-utility ICER when direct EQ-5D data are\nunavailable, provided the mapping model is validated and its uncertainty is propagated into the PSA.\n\n*(2) Practical interpretation.* The patient spent 6 months in health states ranging from 0.730 to 0.580 on the\n0–1 utility scale. Mapping uncertainty means this figure could be noticeably higher or lower depending on which\npublished algorithm is applied; a sensitivity analysis comparing at least two published maps for the same instrument\nis mandatory in HTA submissions. The mapping estimate should never be presented as equivalent in precision to a\ndirectly measured EQ-5D utility — decision-makers should understand it is an approximation.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "utility-mapping",
      "crosswalk",
      "eq-5d",
      "health-state-utility",
      "qaly",
      "cost-utility-analysis",
      "patient-reported-outcomes",
      "aldvmm",
      "hta"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "linked",
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2016.11.006",
        "url": "https://doi.org/10.1016/j.jval.2016.11.006",
        "citation_text": "Wailoo AJ, Hernandez-Alava M, Manca A, et al. Mapping to estimate health-state utility from non-preference-based outcome measures: an ISPOR good practices for outcomes research task force report. Value in Health. 2017;20(1):18-27.",
        "year": 2017,
        "authors_short": "Wailoo et al.",
        "notes": "The reference good-practices statement defining direct vs indirect mapping, model choice for the bounded multimodal EQ-5D distribution (limited dependent / mixture models over OLS), validation, and uncertainty propagation."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2012.10.010",
        "url": "https://doi.org/10.1016/j.jval.2012.10.010",
        "citation_text": "Longworth L, Rowen D. Mapping to obtain EQ-5D utility values for use in NICE health technology assessments. Value in Health. 2013;16(1):202-210.",
        "year": 2013,
        "authors_short": "Longworth & Rowen",
        "notes": "NICE DSU view on when and how to map to EQ-5D for technology appraisal, including the requirement that mapping is a second-best fallback and must match the reference-case value set."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s40273-017-0548-7",
        "url": "https://doi.org/10.1007/s40273-017-0548-7",
        "citation_text": "Ara R, Rowen D, Mukuria C. The use of mapping to estimate health state utility values. PharmacoEconomics. 2017;35(Suppl 1):57-66.",
        "year": 2017,
        "authors_short": "Ara et al.",
        "notes": "Applied walkthrough of estimation-sample design, candidate models, and external validation for a mapping function, with the practical pitfalls of compression and out-of-range extrapolation."
      }
    ],
    "plain_language_summary": "QALY utility mapping is a technique for putting a number on how good or bad a patient's health feels — on a 0-to-1 scale where 1 means perfect health and 0 means dead — when the study never directly asked patients the right question to get that number. Instead, analysts take a disease-specific symptom questionnaire the patients did fill out, apply a published translation formula (called a crosswalk or mapping), and convert those symptom scores into the 0-to-1 health-state values needed for economic analysis. Those converted values are then multiplied by the amount of time spent in each health state to calculate quality-adjusted life-years (QALYs), the standard currency health agencies worldwide use to decide whether a treatment is worth its cost.",
    "key_terms": [
      {
        "term": "utility",
        "definition": "A single number between 0 and 1 (with 0 meaning dead and 1 meaning perfect health) that summarizes how much a patient values their current health state, as measured by a preference-based questionnaire like the EQ-5D."
      },
      {
        "term": "QALY",
        "definition": "A quality-adjusted life-year combines how long a patient lives with how well they feel during that time — one year in perfect health equals 1.0 QALY, while one year at a utility of 0.6 equals 0.6 QALYs."
      },
      {
        "term": "mapping (crosswalk)",
        "definition": "A published statistical formula that converts scores from a disease-specific symptom questionnaire into the 0-to-1 utility scale, used when the study collected symptom data but not a direct utility measure like the EQ-5D."
      },
      {
        "term": "EQ-5D",
        "definition": "A short questionnaire that asks patients to rate five health dimensions (mobility, self-care, usual activities, pain, anxiety), then uses population survey data to convert those responses into a single utility score on the 0-to-1 scale."
      },
      {
        "term": "trapezoidal rule",
        "definition": "A simple way to estimate the area under a curve by drawing trapezoids between consecutive measurement points — used here to compute total QALYs as the average of two consecutive utility scores multiplied by the time between them."
      }
    ],
    "worked_example": {
      "scenario": "A rheumatology registry collected EORTC-style symptom scores — physical functioning (PF), pain, and fatigue, each on a 0-to-100 scale — from patient 1001 at three visits over six months, but never administered the EQ-5D. An analyst needs QALYs for a cost-utility analysis. Using the YAML source file's illustrative mapping formula, the analyst converts each visit's symptom scores into a utility, then uses the trapezoidal rule to compute QALYs over the two 3-month intervals.",
      "dataset": {
        "caption": "Raw symptom assessment records for one patient — three visits over six months. Physical functioning (PF) is higher when the patient functions better; pain and fatigue are higher when symptoms are worse (all on a 0-to-100 scale).",
        "columns": [
          "person_id",
          "assess_date",
          "pf_score",
          "pain_score",
          "fatigue_score"
        ],
        "rows": [
          [
            1001,
            "2024-01-01",
            80,
            20,
            30
          ],
          [
            1001,
            "2024-04-01",
            70,
            30,
            40
          ],
          [
            1001,
            "2024-07-01",
            60,
            40,
            50
          ]
        ]
      },
      "steps": [
        "Apply the mapping formula to each visit: mapped_utility = 0.90 - 0.0030 x (100 - PF) - 0.0025 x pain - 0.0020 x fatigue.",
        "Visit 1 (2024-01-01): utility = 0.90 - 0.0030 x (100 - 80) - 0.0025 x 20 - 0.0020 x 30 = 0.90 - 0.060 - 0.050 - 0.060 = 0.730.",
        "Visit 2 (2024-04-01): utility = 0.90 - 0.0030 x (100 - 70) - 0.0025 x 30 - 0.0020 x 40 = 0.90 - 0.090 - 0.075 - 0.080 = 0.655.",
        "Visit 3 (2024-07-01): utility = 0.90 - 0.0030 x (100 - 60) - 0.0025 x 40 - 0.0020 x 50 = 0.90 - 0.120 - 0.100 - 0.100 = 0.580.",
        "Each interval between visits is 91 days, treated here as 0.25 years for clean arithmetic (a standard 3-month approximation).",
        "Segment 1 QALY (Jan to Apr): average utility = (0.730 + 0.655) / 2 = 0.6925; QALYs = 0.6925 x 0.25 = 0.1731.",
        "Segment 2 QALY (Apr to Jul): average utility = (0.655 + 0.580) / 2 = 0.6175; QALYs = 0.6175 x 0.25 = 0.1544.",
        "Total QALYs over 6 months = 0.1731 + 0.1544 = 0.3275."
      ],
      "result": "Patient 1001 accumulated 0.3275 QALYs over the 6-month observation window. Because their symptom scores worsened across visits (PF declined from 80 to 60, pain rose from 20 to 40, fatigue from 30 to 50), each mapped utility fell accordingly (0.730 to 0.655 to 0.580), and the second segment contributed fewer QALYs (0.1544) than the first (0.1731). A full cost-utility analysis would compare mean QALYs across treatment arms and pair them with costs to form an incremental cost-effectiveness ratio."
    },
    "prerequisites": [
      "pro-rwe",
      "hrqol",
      "cost-utility"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Indirect (regression / index) mapping",
        "description": "A regression predicts the preference-based index value directly from the source measure's scores (and covariates such as age/sex), using a model that respects the EQ-5D bounds and the spike at full health (ALDVMM, beta, two-part, or tobit) rather than OLS.",
        "edge_cases": [
          "OLS produces predictions above 1 and smears the full-health ceiling spike; mean R² can look adequate while boundary predictions are systematically wrong.",
          "Compression bias attenuates incremental QALYs (worst states overpredicted, best states underpredicted), biasing the ICER toward the null."
        ],
        "data_source_notes": "registry/EHR: fit on observations that have BOTH the source PRO and (where available) a directly measured EQ-5D; reserve a hold-out for external validation rather than reporting only in-sample fit."
      },
      {
        "name": "Direct (response) mapping",
        "description": "A model predicts the responses to each EQ-5D dimension (e.g., ordered logits for mobility, self-care, etc.) from the source measure, and the country value-set tariff is then applied to the predicted profile to yield the index.",
        "edge_cases": [
          "Requires the estimation sample to contain the EQ-5D dimension responses, not just the index, which is often unavailable.",
          "Tariff choice is explicit and value-set-specific; a UK-3L tariff must not be applied to generate US-5L utilities."
        ],
        "data_source_notes": "linked: confirm the EQ-5D version and value set of the estimation data match the reference case required by the target HTA body."
      },
      {
        "name": "Catalogue/published-algorithm application (no local fitting)",
        "description": "An externally published, validated mapping algorithm for the exact source-measure to target-value-set pairing is applied to the RWE cohort without re-estimation, with the published coefficient covariance used for uncertainty.",
        "edge_cases": [
          "Transportability gap if the published algorithm was estimated in a different disease severity or country population.",
          "Point-estimate-only application understates decision uncertainty unless the published variance-covariance is carried into the PSA."
        ],
        "data_source_notes": "claims: claims alone have no source PRO to feed any algorithm; the source measure must come from a linked registry/EHR record."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Direct elicitation of a preference-based instrument (EQ-5D) in the study",
        "pros_of_this": "Rescues a cohort that never fielded a preference-based measure, using the patients' own source PRO/clinical scores rather than borrowed values.",
        "cons_of_this": "Adds prediction uncertainty and value-set dependence and tends to compress the utility distribution, attenuating incremental QALYs and biasing the ICER toward the null.",
        "when_to_prefer": "Only when EQ-5D-type data genuinely cannot be collected; direct elicitation is always the HTA-preferred option."
      },
      {
        "compared_to": "OLS index mapping",
        "pros_of_this": "Mixture/limited-dependent models (ALDVMM, beta, two-part, tobit) respect the EQ-5D bounds and the full-health spike and are the ISPOR/NICE-recommended specifications.",
        "cons_of_this": "More complex to fit and explain; convergence and component-count choices require care and reporting.",
        "when_to_prefer": "For any submission-grade map; retain OLS only as a transparent sensitivity comparator."
      },
      {
        "compared_to": "Externally sourced catalogue utilities in a Markov / partitioned-survival model",
        "pros_of_this": "Uses utilities derived from the actual study cohort's measured health states rather than values imported from a different population and value set.",
        "cons_of_this": "Requires a source PRO rich enough to fit/apply a stable model; sparse source data make catalogue values more defensible.",
        "when_to_prefer": "When a longitudinal source PRO is well measured in the cohort and face validity of borrowed utilities is poor."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims carry no PRO or functional measure, so nothing can be mapped from claims alone; claims supply the survival/censoring time axis and costs only. Restrict QALY integration to observable fee-for-service person-time and use a linked death date, because Medicare Advantage-only person-time has incomplete encounter/death capture that would corrupt the time axis.",
      "ehr": "Structured PROMIS/disease-specific instruments and clinical states (ECOG, NYHA) are mappable, but capture is visit-driven and informatively sampled (sicker patients measured more often). Carry utilities forward only over clinically plausible windows and model the missingness rather than silently propagating the last healthier value.",
      "registry": "Disease registries usually field the strongest source PROs and adjudicated clinical states, making them the best substrate for fitting and applying a mapping; link to claims and a death index to complete cost and mortality.",
      "linked": "Linked registry/EHR (source PRO) + claims (cost) + vital records (death) is the ideal substrate. Reconcile the date discrepancies between PRO assessment, service, and death dates before aligning utilities to the QALY time axis, and note that linkage restricts to the linkable subset."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\nDISCOUNT_RATE = 0.035   # NICE reference-case annual rate for effects\nCARRY_MAX_DAYS = 90     # max plausibility window for last-observation-carried-forward\n\ndef map_utility(pro: pd.DataFrame) -> pd.Series:\n    # PLACEHOLDER for a published, externally validated mapping (e.g., ALDVMM coefficients).\n    # Replace with the real linear predictor / mixture prediction; keep the boundary handling.\n    lp = (0.90\n          - 0.0030 * (100 - pro[\"physical_functioning\"])\n          - 0.0025 * pro[\"pain\"]\n          - 0.0020 * pro[\"fatigue\"])\n    return np.clip(lp, a_min=None, a_max=1.0)   # cap at full health; values <0 allowed\n\ndef qalys_by_person(pro: pd.DataFrame, death: pd.DataFrame) -> pd.DataFrame:\n    pro = pro.sort_values([\"person_id\", \"assess_date\"]).copy()\n    pro[\"utility\"] = map_utility(pro)\n\n    out = []\n    for pid, g in pro.groupby(\"person_id\"):\n        d = death.loc[death[\"person_id\"] == pid].iloc[0]\n        end = d[\"death_date\"] if pd.notna(d[\"death_date\"]) else d[\"obs_end\"]\n        # Knot points: each assessment, plus a terminal knot (U=0 at death, else last utility at obs_end).\n        t = list(g[\"assess_date\"]); u = list(g[\"utility\"])\n        t.append(end); u.append(0.0 if pd.notna(d[\"death_date\"]) else u[-1])\n\n        qaly = 0.0\n        for j in range(len(t) - 1):\n            seg_days = (t[j + 1] - t[j]).days\n            if seg_days <= 0:\n                continue\n            # Carry-forward plausibility: long gaps with no measurement are not credited full utility.\n            eff_days = min(seg_days, CARRY_MAX_DAYS) if seg_days > CARRY_MAX_DAYS else seg_days\n            yrs_from_index = (t[j] - d[\"index_date\"]).days / 365.25\n            disc = (1.0 + DISCOUNT_RATE) ** (-max(yrs_from_index, 0.0))\n            qaly += (u[j] + u[j + 1]) / 2.0 * (eff_days / 365.25) * disc   # discounted trapezoid\n        out.append({\"person_id\": pid, \"arm\": g[\"arm\"].iloc[0], \"qaly\": qaly})\n    return pd.DataFrame(out)\n\n# qalys = qalys_by_person(pro, death)\n# qalys.groupby(\"arm\")[\"qaly\"].mean()   # mean discounted QALYs per arm -> feeds the ICER",
        "description": "Apply a pre-specified utility-mapping function to a longitudinal source-PRO table, then integrate utility over time into\ndiscounted QALYs. Required inputs (already cleaned, one row per person-assessment):\n  pro    : person_id, arm, assess_date (datetime), and the source-measure subscale columns used by the mapping\n  death  : person_id, death_date (datetime or NaT), index_date (datetime), obs_end (datetime)  # closes the QALY integral\nThe map_utility() function is a stand-in for a published/externally validated algorithm (e.g., an ALDVMM); replace it with\nthe actual coefficients. Utilities are bounded at 1 (full health) and allowed below 0 (states worse than death).",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "wailoo-2017"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\n\nDISCOUNT_RATE <- 0.035\nCARRY_MAX_DAYS <- 90\n\nmap_utility <- function(pro) {\n  # PLACEHOLDER for a published, externally validated mapping (e.g., ALDVMM).\n  lp <- 0.90 -\n        0.0030 * (100 - pro$physical_functioning) -\n        0.0025 * pro$pain -\n        0.0020 * pro$fatigue\n  pmin(lp, 1.0)   # cap at full health; values < 0 allowed (worse than death)\n}\n\nqalys_by_person <- function(pro, death) {\n  pro <- pro %>% arrange(person_id, assess_date) %>% mutate(utility = map_utility(.))\n  split(pro, pro$person_id) |>\n    lapply(function(g) {\n      d   <- death[death$person_id == g$person_id[1], ][1, ]\n      end <- if (!is.na(d$death_date)) d$death_date else d$obs_end\n      t <- c(g$assess_date, end)\n      u <- c(g$utility, if (!is.na(d$death_date)) 0.0 else tail(g$utility, 1))\n      qaly <- 0.0\n      for (j in seq_len(length(t) - 1L)) {\n        seg <- as.numeric(t[j + 1] - t[j])\n        if (seg <= 0) next\n        eff <- if (seg > CARRY_MAX_DAYS) CARRY_MAX_DAYS else seg\n        yrs <- as.numeric(t[j] - d$index_date) / 365.25\n        disc <- (1 + DISCOUNT_RATE) ^ (-max(yrs, 0))\n        qaly <- qaly + (u[j] + u[j + 1]) / 2 * (eff / 365.25) * disc\n      }\n      data.frame(person_id = g$person_id[1], arm = g$arm[1], qaly = qaly)\n    }) |>\n    bind_rows()\n}\n\n# qalys <- qalys_by_person(pro, death)\n# aggregate(qaly ~ arm, data = qalys, FUN = mean)   # mean discounted QALYs per arm",
        "description": "Same map-then-integrate pipeline in R. Inputs mirror the Python version:\n  pro   : person_id, arm, assess_date (Date), source-measure subscale columns\n  death : person_id, death_date (Date or NA), index_date (Date), obs_end (Date)\nmap_utility() is a placeholder for a published/validated algorithm (e.g., ALDVMM); substitute the real coefficients and\nkeep the cap at 1 and the U=0-at-death terminal knot.",
        "dependencies": [
          "dplyr"
        ],
        "source_citations": [
          "wailoo-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let disc = 0.035;          /* annual discount rate for effects */\n%let carry_max = 90;        /* max LOCF plausibility window, days */\n\n/* (A) Fit an indirect mapping that keeps predictions in range (beta likelihood on rescaled utility). */\nproc nlmixed data=work.estim;\n  parms b0=0.9 b_pf=-0.003 b_pain=-0.0025 b_fat=-0.002 phi=10;\n  mu  = b0 + b_pf*(100 - phys_func) + b_pain*pain + b_fat*fatigue;   /* linear predictor */\n  mu  = min(max(mu, 1e-6), 1 - 1e-6);                                /* keep on (0,1) for beta */\n  a   = mu*phi;  bta = (1-mu)*phi;                                   /* beta reparameterisation */\n  ll  = lgamma(a+bta) - lgamma(a) - lgamma(bta)\n        + (a-1)*log(eq5d_index) + (bta-1)*log(1-eq5d_index);\n  model eq5d_index ~ general(ll);\nrun;\n\n/* (B) Score the fitted map onto the RWE cohort (substitute the estimated b0/b_pf/... below). */\ndata scored;\n  set work.pro;\n  utility = 0.90 - 0.0030*(100 - phys_func) - 0.0025*pain - 0.0020*fatigue;\n  if utility > 1 then utility = 1;     /* cap at full health; values < 0 allowed */\nrun;\n\n/* Add the terminal knot: U=0 at death, else last utility at obs_end. */\nproc sort data=scored; by person_id assess_date; run;\ndata knots;\n  merge scored work.death(keep=person_id index_date death_date obs_end);\n  by person_id;\nrun;\n\n/* Discounted trapezoidal QALY per person between consecutive measurement dates. */\ndata qalys;\n  set knots; by person_id;\n  retain prev_date prev_u idx;\n  if first.person_id then do; prev_date=.; idx=index_date; qaly=0; end;\n  if not missing(prev_date) then do;\n    seg = assess_date - prev_date;\n    if seg > 0 then do;\n      eff = min(seg, &carry_max);\n      yrs = (prev_date - idx)/365.25;\n      disc = (1+&disc)**(-max(yrs,0));\n      qaly + (prev_u + utility)/2 * (eff/365.25) * disc;\n    end;\n  end;\n  prev_date = assess_date; prev_u = utility;\n  if last.person_id then output;     /* one QALY per person; close at death_date/obs_end upstream as needed */\n  keep person_id arm qaly;\nrun;\n\nproc means data=qalys mean; class arm; var qaly; run;   /* mean discounted QALYs per arm -> ICER */",
        "description": "Two SAS steps that genuinely fit utility mapping: (A) estimate an indirect mapping with a beta-type limited-dependent\nresponse via PROC NLMIXED (a tractable stand-in for the ALDVMM family that respects the (0,1] range better than OLS), then\n(B) score the function onto the RWE cohort and integrate discounted QALYs by the trapezoidal rule in PROC SQL/data step.\nRequired input datasets (post data-management):\n  work.estim : person_id, eq5d_index (observed, estimation sample only), source subscales (phys_func, pain, fatigue)\n  work.pro   : person_id, arm, assess_date, source subscales (the RWE cohort to be mapped)\n  work.death : person_id, index_date, death_date (or .), obs_end\nReplace the toy linear predictor with the validated specification; confirm out-of-sample fit before scoring.",
        "dependencies": [],
        "source_citations": [
          "wailoo-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Source measure in RWE<br/>disease-specific or generic PRO / clinical state] --> Check{Conceptual overlap with<br/>value-set domains?}\n  Check -- No --> Stop[Do NOT map<br/>predicted utility is an artefact]\n  Check -- Yes --> Model[Mapping model<br/>ALDVMM / beta / two-part - NOT OLS]\n  Model --> Tariff[Apply country value set<br/>e.g. UK EQ-5D-3L tariff]\n  Tariff --> U[Utility U at each assessment<br/>cap at 1, allow <0]\n  U --> Integrate[Integrate U over survival time<br/>trapezoid + discount; U=0 at death]\n  Integrate --> QALY[Discounted QALYs per arm]\n  QALY --> ICER[ICER = dCost / dQALY<br/>+ net monetary benefit]\n  QALY --> PSA[Carry mapping prediction uncertainty<br/>into probabilistic sensitivity analysis]",
        "caption": "Decision and data flow for QALY utility mapping — from a non-preference-based source measure through a bounded mapping model and value-set tariff to discounted QALYs and the cost-utility result, with the conceptual-overlap gate that blocks invalid maps and the uncertainty path into PSA.",
        "alt_text": "Flowchart starting from a source PRO/clinical measure, a gate checking conceptual overlap with value-set domains that stops invalid mappings, a bounded mapping model, tariff application, per-assessment utilities, integration over survival into discounted QALYs, and outputs to the ICER and probabilistic sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "wailoo-2017"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Utilities at assessment dates<br/>U1, U2, U3 ...] --> B[Order by date within person]\n  B --> C[Trapezoid between consecutive dates<br/>area = mean U x time]\n  C --> D[Terminal knot<br/>U=0 at linked death date]\n  D --> E[Sum segments = undiscounted QALY]\n  E --> F[Discount each segment to index<br/>1+r ^ -years]\n  F --> G[Person-level discounted QALY]",
        "caption": "Area-under-the-utility-curve construction of a QALY from time-indexed mapped utilities, closing the integral with a zero-utility knot at the linked death date and discounting each segment back to the index date.",
        "alt_text": "Left-to-right flow showing per-person utilities ordered by date, trapezoidal area between consecutive measurements, a zero-utility terminal knot at death, summation to an undiscounted QALY, per-segment discounting, and the final person-level discounted QALY.",
        "source_type": "illustrative",
        "source_citations": [
          "wailoo-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "cost-utility",
        "notes": "Mapped utilities are integrated into the QALYs that drive a cost-utility analysis when a preference-based measure was not directly collected."
      },
      {
        "relation_type": "used_with",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "Mean discounted QALYs from mapping are the denominator of the ICER and enter the net-monetary-benefit calculation."
      },
      {
        "relation_type": "requires",
        "target_slug": "pro-rwe",
        "notes": "Mapping needs a non-preference-based patient-reported (or clinical) source measure collected in the real-world data; claims alone provide nothing to map from."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "hrqol",
        "notes": "Directly measuring a preference-based HRQoL instrument removes the need to map; mapping is the second-best fallback when that instrument was not fielded."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Mapped QALYs are discounted to present value before forming the incremental cost-utility ratio."
      },
      {
        "relation_type": "used_with",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "The mapping model's prediction uncertainty must be propagated into the PSA so decision uncertainty is not understated."
      },
      {
        "relation_type": "see_also",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "The QALY integral runs over modelled survival time, so utility mapping and survival extrapolation jointly determine lifetime QALYs in an HTA model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "partitioned-survival-models-rwe",
        "notes": "Partitioned-survival and Markov cost-utility models consume health-state utilities that mapping can supply from the study cohort's own source measures."
      }
    ],
    "aliases": [
      "utility mapping",
      "crosswalking to utilities",
      "health-state utility mapping",
      "mapping to EQ-5D",
      "PRO-to-utility crosswalk"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "ema",
      "fda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "qc-double-programming-reproducibility",
    "name": "QC, Double Programming, and Reproducible Analysis",
    "short_definition": "The industry-standard quality-control ladder for RWE analysis — from code review through independent double programming (a second programmer re-derives key results from the SAP without seeing the first programmer's code, and any discrepancy triggers a formal resolution log) — combined with reproducibility engineering practices (version-controlled code, locked data snapshots with checksums, pinned software environments, and explicit seeds for stochastic operations) that ensure the same analysis can be reproduced from raw data to final tables by any analyst at any future time.",
    "long_description": "**The QC ladder in real-world evidence analysis**\n\nEvery RWE analysis output — a cohort count, an incidence rate, a hazard ratio — is only as\ntrustworthy as the code that produced it. Quality control (QC) in analytic programming is not\na final box-check; it is a tiered system of verification activities that begins when the first\nline of code is written and ends when the locked output is delivered. The industry-standard\nladder runs from light to rigorous.\n\n**Code review** is the first rung: a second analyst reads the programming code against the SAP\nto confirm that each step — cohort-definition filters, outcome algorithm, covariate coding, model\nspecification — matches what was pre-specified. Code review catches obvious deviations and is\ninexpensive, but it is vulnerable to shared misreadings. If both the author and reviewer\nmisinterpret the same SAP sentence in the same direction, the error survives intact.\n\n**Output review** is the second rung: a senior analyst or statistician reviews the formatted\ntables, listings, and figures against expected ranges drawn from the literature and clinical\njudgment. Are the incidence rates plausible? Is the hazard ratio directionally consistent with\nprior estimates? Output review catches gross numerical implausibilities but not subtle\nimplementation errors that produce plausible-looking wrong numbers.\n\n**Independent double programming** is the highest rung and the only one that systematically\ncatches errors arising from ambiguity between the SAP and the code. A second programmer — who\nhas not seen the primary programmer's code — reads the SAP and independently writes code to\nre-derive the same key outputs (cohort counts, baseline characteristics, primary endpoint\nestimates). The two output sets are compared numerically, typically using a formal comparison\nutility (PROC COMPARE in SAS, a diff of formatted tables, or a purpose-built reconciliation\nscript). Any discrepancy triggers a **discrepancy resolution log**: both programmers document\ntheir interpretation of the SAP clause that was ambiguous, the discrepancy is traced to its\ndata-level source, and the correct interpretation is confirmed with the study statistician. If\nthe SAP was ambiguous, it is amended before outcome-dependent re-programming begins.\n\n**When full double programming is warranted versus risk-based QC**\n\nFull independent double programming is the expected standard for regulatory deliverables — FDA\nsubmissions, EMA dossiers, PMDA safety reports — and for the primary efficacy and safety\nendpoints of any comparative study intended to influence formulary coverage, label language, or\nclinical guideline decisions. The cost in analyst time is justified because a programming error\ndiscovered after regulatory submission requires a complete re-analysis and formal amendment,\ncosting far more than the original double-programming step.\n\nFor internal exploratory analyses, feasibility counts, and hypothesis-generating work where\nresults will not directly influence a regulated decision, a risk-based QC approach is\nappropriate: code review plus output review, with a documented rationale for not double\nprogramming. The key discipline is that the decision is explicit and archived, not simply an\nomission. If an exploratory finding is later repurposed as the basis for a regulatory or HTA\ndeliverable, the QC level must be escalated retrospectively — meaning the analysis must be\nre-done with independent double programming, not simply certified after the fact.\n\n**Reproducibility engineering for RWE**\n\nIndependent double programming confirms that two programmers reach the same number today.\nReproducibility engineering ensures the same number can be reached tomorrow, by any analyst,\nfrom the raw data. The minimum reproducibility stack for a RWE analysis has five components.\n\nFirst, **version control (git)**: every analysis script, macro, and configuration file is\ncommitted to a version-controlled repository with meaningful commit messages. Tags or branches\nmark the code state used to produce each versioned deliverable. Without version control, there\nis no reliable way to identify which code produced which output if a discrepancy surfaces\nmonths after delivery.\n\nSecond, **locked data snapshots with checksums**: the raw data extract is frozen at a defined\ncalendar cut date and a cryptographic checksum (SHA-256 or MD5) is computed for every source\nfile. The checksum is stored alongside the data manifest and re-verified at the start of every\nre-run. In living databases — administrative claims systems that are continuously updated with\nlate-arriving adjudicated claims — the cut date is not administrative convenience. It is part\nof what the analysis is estimating. Two analyses that use identical code and identical study\nparameters but different cut dates may produce numerically different cohort sizes and outcome\ncounts because late-arriving claims can change who is observable in the data. The cut date must\ntherefore be documented with the same precision as the eligibility criteria, and some authors\nargue it should be considered part of the estimand specification for living databases.\n\nThird, **environment pinning**: software package versions change analytical results. The\n`survival` package's default tie-handling method for the Cox model shifted between major R\nversions, silently changing baseline hazard estimates. SAS PROC PHREG option defaults differ\nacross SAS release years. Python `lifelines` and `statsmodels` have introduced breaking changes\nto default arguments across versions. Pinning the software environment — via `renv.lock` in R,\n`requirements.txt` or `pyproject.toml` with pinned hashes in Python, or a documented SAS\nmetadata server version in a regulated environment — ensures that a re-run in two years uses\nidentical software. Containerization (Docker, Apptainer/Singularity) provides the strongest\nguarantee by freezing the operating system, language runtime, and all dependencies in a single\nportable image.\n\nFourth, **seeds for stochastic operations**: any analysis step that involves randomness —\nbootstrap confidence intervals, multiple imputation, propensity-score matching with random\ntiebreaking, Monte Carlo sensitivity analysis — must have an explicit random seed set\nimmediately before execution. The seed value must be stored in the analysis script itself (not\nonly in a log file), and documented in the SAP. Without a seed, two runs of the same code on\nthe same data will produce different numbers, making numeric comparison between programmer A and\nprogrammer B impossible for stochastic steps.\n\nFifth, **one-button rerun from raw extract to tables**: the gold standard is a single shell\ncommand — a `Makefile`, a `targets` pipeline in R, a `_targets.R`, a numbered SAS flow script,\nor a Snakemake workflow — that executes the full analysis chain from reading the checksummed\nraw data through cohort construction, variable derivation, statistical modeling, and\ntable/figure generation without any manual intervention or analyst-specific path edits. This\nmakes the analysis independently auditable: a reviewer can clone the repository, restore the\npinned environment, point to the locked data extract, and reproduce every table.\n\n**The analytic-decision audit trail**\n\nEvery choice that changes who is in the cohort — washout length, same-day-event handling,\ngrace period definition, lookback window, inclusion of patients with zero-day enrollment —\nmust be logged in the analytic-decision audit trail. This log is distinct from the git commit\nhistory (which captures what changed in the code) and distinct from the discrepancy resolution\nlog (which captures what changed in response to QC findings). The audit trail documents the\nreasoning for each design and implementation choice: what the SAP said, what edge cases were\nfound in the data distribution, and what the study statistician decided. The audit trail ties\ndirectly to attrition reporting: a CONSORT-style flow diagram showing how many patients were\nexcluded at each step must be numerically consistent with the audit trail, and every filter\nmust appear in both.\n\n**Common silent failures QC catches**\n\nThe most dangerous programming errors in RWE are those that produce plausible-looking numbers.\nIndependent double programming and systematic output review consistently surface four archetypes.\n\n*Merge duplications (claim-line fan-out)*: a patient with five claim lines for the same service\ndate merges five rows into a utilization table instead of one, inflating cost totals and event\ncounts. The symptom is a cohort-level mean cost that is implausibly high relative to literature\nbenchmarks, or a post-merge row count that exceeds the pre-merge patient count. The fix is a\ndeduplication step before the join, or a many-to-one assertion after it.\n\n*Missing-data coercion*: an unhandled NULL or SAS missing value (`.`) is coerced to zero in\narithmetic, making a patient appear to have zero cost or zero utilization rather than\nmissing data. The coercion is silent: the analysis runs without error, but missing costs are\ncounted as zero costs in the mean. In R, `NA` propagation rules differ from SAS's `.`\nconventions; a port from SAS to R that does not explicitly handle missingness will silently\nshift means downward.\n\n*Date arithmetic off-by-ones*: calculating enrollment duration as\n`disenrollment_date minus enrollment_start` gives the number of days between the two dates, not\nthe number of days of enrollment (which includes the start date and thus adds 1 in the\ninclusive convention). The CONSORT flow count and the eligibility criterion must agree on which\nconvention is used, and that convention must be written into the SAP.\n\n*Unintended row filtering*: a `WHERE` clause that correctly filters the base population is\napplied after a merge that changed the grain of the dataset, inadvertently dropping rows that\nshould be retained or retaining duplicates that should have been collapsed. The symptom is a\nrow count that does not match the expected attrition step in the flow diagram.\n\n**ALCOA+ data integrity in plain terms**\n\nThe FDA's data integrity guidance for regulated submissions uses the ALCOA+ framework:\nAttributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent,\nEnduring, and Available. For an RWE analysis, these translate to: every data transformation\nstep is documented and traceable to the analyst who made it (Attributable); code is\nversion-controlled and retrievable (Legible, Enduring); checksums verify that the raw data\nhave not been altered since the cut date (Original, Accurate); the locked dataset policy\nensures the analysis dataset matches what was used to produce the submitted output (Consistent,\nComplete); and the code repository and data manifest remain accessible for the regulatory\nretention period (Available).\n\n**QC for AI-assisted analysis code**\n\nWhen analysis code is generated or accelerated by a large language model, the same independent\nverification standard applies. LLM-generated code is not exempt from double programming\nbecause its origin differs from a human programmer's. In practice, LLM-generated code for RWE\nanalysis should be treated as a first-draft submission from a junior programmer: it requires\ncode review against the SAP, testing on synthetic or subsetted data with known outputs, and —\nfor primary endpoints in regulatory deliverables — independent double programming. LLM-generated\ncode has characteristic failure modes that parallel those of human code: silent off-by-one\nerrors in date arithmetic, merge joins that fan out rows, and eligibility criteria implemented\nfrom a common reading of similar language rather than the specific SAP wording. These are\nprecisely the errors that independent double programming is designed to detect.\n\n**Pros, cons, and trade-offs**\n\n*Independent double programming*:\n- Pros: highest sensitivity for detecting errors arising from SAP ambiguity; produces a\n  discrepancy resolution log that is itself a regulatory audit artifact; identifies SAP gaps\n  before outcome data are examined; the only QC method that independently tests the\n  interpretation of eligibility criteria.\n- Cons: approximately doubles the programming effort; requires two analysts with comparable\n  technical depth; cannot catch errors that both programmers make identically because they share\n  the same misreading of the SAP.\n\n*Risk-based QC (code review plus output review)*:\n- Pros: substantially less expensive in analyst time; appropriate and sufficient for exploratory\n  and hypothesis-generating work; can be applied consistently across high volumes of internal\n  analyses.\n- Cons: misses errors arising from shared misinterpretation; output review may fail to detect\n  plausible-but-wrong numbers in domains where benchmarks are wide.\n\n*Reproducibility stack (version control, snapshots, pinning, seeds)*:\n- Pros: enables independent reproduction by any analyst at any future time; checksum-verified\n  data snapshots provide chain-of-custody evidence for regulatory retention; pinned environments\n  prevent silent result drift from package updates; seeds make stochastic outputs deterministic.\n- Cons: initial setup overhead; container builds add infrastructure complexity; renv/pyproject\n  lockfiles require discipline to keep current; teams unfamiliar with git require training.\n\n**When to use**\n\nApply the full QC ladder — including independent double programming — for any RWE analysis\nwhere outputs will be submitted to a regulatory agency, included in an HTA dossier, used to\nsupport a formulary or coverage decision, or published as the primary effectiveness or safety\nfinding of a comparative study. Apply the reproducibility stack universally: even exploratory\nanalyses benefit from version control and locked data snapshots because a finding that starts\nas exploratory may later be elevated to a confirmatory deliverable. Apply explicit seeds to\nevery analysis that includes any stochastic step, regardless of the QC tier.\n\n**When NOT to use**\n\nDo not skip QC under time pressure — an undetected programming error in a regulatory submission\nrequires a complete re-analysis and formal amendment, costing far more time than the original\ndouble-programming step. Do not treat LLM-generated code as pre-verified or as exempt from the\nQC ladder. Do not conflate code review with independent double programming: reviewing the same\nprogrammer's code does not test whether the SAP was interpreted correctly, because the reviewer\nmay share the programmer's interpretation. Do not omit seeds from stochastic steps on the\nassumption that differences will be small — for bootstrap or multiple imputation with small\neffective sample sizes, seed-dependent variation can be clinically meaningful.\n\n**Interpreting the output**\n\nThe primary deliverable of a QC process is not a p-value or an effect estimate — it is a\n**discrepancy resolution log** paired with a reconciliation statement confirming that both\nprogrammers reach numerically identical outputs after all discrepancies are resolved. In the\nworked example, Programmer A reports a cohort of n = 12,450 and Programmer B reports n = 12,480.\n\n*(1) Formal interpretation.* The discrepancy of 12,480 minus 12,450 equals 30 patients.\nThese 30 patients have enrollment_start equal to disenrollment_date — they enrolled and\ndisenrolled on the same calendar day, yielding an enrollment duration of zero days when\ncomputed as the arithmetic difference between the two dates. Programmer A applied the filter\nrequiring strictly positive enrollment duration (disenrollment_date greater than\nenrollment_start), which excluded all 30 patients. Programmer B applied the filter requiring a\nnon-null enrollment_start, which retained all 30 patients. Both implementations are internally\nconsistent with a plausible reading of the SAP's phrase \"continuously enrolled during the\npre-index period.\" The discrepancy is a SAP ambiguity, not a programming bug in either\nimplementation. Resolution requires a SAP amendment that specifies the minimum enrollment\nduration explicitly.\n\n*(2) Practical interpretation.* For a study statistician reviewing this discrepancy report:\nthe 30 affected patients represent a very small fraction of the total cohort. The correction\nreduces the cohort to 12,450 and does not change the direction of the analysis. However, the\nSAP must be amended before outcome data are examined, and the CONSORT-style attrition flow\ndiagram must be updated to show these patients as excluded under the clarified criterion.\nEven a discrepancy of this magnitude is meaningful for a regulatory deliverable — it\ndemonstrates that the SAP contained a boundary condition that was not operationally specified,\nand the discrepancy resolution log is the audit evidence that the ambiguity was identified and\nresolved prospectively.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "quality-control",
      "double-programming",
      "reproducibility",
      "data-integrity",
      "ALCOA",
      "version-control",
      "locked-data-snapshot",
      "environment-pinning",
      "audit-trail",
      "discrepancy-resolution",
      "reproducible-research",
      "programming-validation"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "linked",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1126/science.1213847",
        "url": "https://doi.org/10.1126/science.1213847",
        "citation_text": "Peng RD. Reproducible research in computational science. Science. 2011;334(6060):1226-1227.",
        "year": 2011,
        "authors_short": "Peng",
        "notes": "Introduced the reproducibility spectrum concept for computational science — from not reproducible to fully reproducible — and articulated the minimum standard of code and data availability. Foundational reference for reproducibility engineering in any quantitative discipline, including RWE analysis."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwae087",
        "url": "https://doi.org/10.1093/aje/kwae087",
        "citation_text": "Wang SV, Pottegard A. Building transparency and reproducibility into the practice of pharmacoepidemiology and outcomes research. American Journal of Epidemiology. 2024;193(11):1625-1631.",
        "year": 2024,
        "authors_short": "Wang & Pottegard",
        "notes": "Proposes a five-domain transparency statement framework (protocol, preregistration, data, code-sharing, reporting checklists) directly applicable to RWE QC practice. Articulates why open science components that are routine in other disciplines have not yet been built into standard pharmacoepidemiology workflows, and provides an actionable roadmap."
      }
    ],
    "plain_language_summary": "In real-world studies, the same data can produce different answers depending on how the analysis code handles ambiguous cases — like a patient who enrolled and left a health plan on the same day. Double programming is a check where a second analyst independently writes their own code to reproduce the same results, without looking at the first analyst's code; any difference in the numbers gets documented and traced back to its source. Reproducibility engineering is the set of practices — saving exactly which software versions were used, freezing the data with a checksum, and setting random seeds — that allow anyone to re-run the same analysis in the future and get the exact same answer.",
    "key_terms": [
      {
        "term": "double programming",
        "definition": "A QC method where a second analyst independently writes code to reproduce the same results as the primary programmer, without seeing the first analyst's code, so that any discrepancy surfaces a real error or SAP ambiguity."
      },
      {
        "term": "discrepancy resolution log",
        "definition": "A formal document recording every difference found between the two programmers' outputs, the root cause traced to the data, and the agreed resolution — it is a required audit artifact for regulatory deliverables."
      },
      {
        "term": "locked data snapshot",
        "definition": "A frozen copy of the source dataset taken at a defined cut date, stored with a cryptographic checksum so that any later change to the file is detectable; the cut date itself is part of what the analysis estimates in a living database."
      },
      {
        "term": "environment pinning",
        "definition": "Recording the exact version numbers of every software package used in an analysis so that a re-run in the future uses identical software and produces the same numerical results."
      },
      {
        "term": "ALCOA+",
        "definition": "Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available — FDA data integrity principles applied to RWE to ensure every data transformation step is traceable and that the analysis dataset matches what was submitted."
      },
      {
        "term": "random seed",
        "definition": "A number passed to a software random-number generator before any stochastic step — like bootstrapping or multiple imputation — that makes the randomly selected values reproducible; without a seed, the same code produces slightly different numbers each time it runs."
      }
    ],
    "worked_example": {
      "scenario": "Two programmers are independently implementing the eligibility algorithm for a retrospective claims cohort study of a new oral anticoagulant. The SAP requires patients to be \"continuously enrolled in the 6-month pre-index period.\" After each programmer delivers their cohort-build code and a table of n, the QC reviewer compares the outputs and flags a discrepancy: Programmer A reports a cohort of 12,450 patients, Programmer B reports 12,480. The QC reviewer opens the discrepancy resolution log and asks both programmers to document their implementation of the continuous enrollment criterion.",
      "dataset": {
        "caption": "Four sample rows from the enrollment table representing the 30 discrepant patients — all have enrollment_start equal to disenrollment_date, yielding zero days when enrollment duration is computed as the arithmetic difference. Programmer A excludes these patients; Programmer B retains them.",
        "columns": [
          "patient_id",
          "enrollment_start",
          "disenrollment_date",
          "days_enrollment_diff"
        ],
        "rows": [
          [
            "P0001",
            "2021-03-15",
            "2021-03-15",
            0
          ],
          [
            "P0002",
            "2021-07-22",
            "2021-07-22",
            0
          ],
          [
            "P0003",
            "2021-09-01",
            "2021-09-01",
            0
          ],
          [
            "P0004",
            "2022-01-10",
            "2022-01-10",
            0
          ]
        ]
      },
      "steps": [
        "Programmer A implements the enrollment filter as: keep patients where disenrollment_date > enrollment_start (strictly greater than, so enrollment duration > 0 days). All 30 patients with same-day enrollment are excluded. Programmer A cohort = 12450.",
        "Programmer B implements the enrollment filter as: keep patients where enrollment_start IS NOT NULL (any non-missing enrollment record counts as enrolled). All 30 same-day patients pass this filter. Programmer B cohort = 12480.",
        "Discrepancy reported. Difference: 12480 - 12450 = 30 patients.",
        "Both programmers inspect the 30 patients present in B's cohort but absent from A's. Every one of the 30 has enrollment_start equal to disenrollment_date, so days_enrollment_diff = 0 for each.",
        "SAP clause reviewed: \"continuously enrolled in the 6-month pre-index period\" does not define the minimum enrollment duration or specify how to handle same-day enrollment records. The boundary condition of zero-day enrollment is a SAP ambiguity — both implementations are defensible readings of the existing language.",
        "The study statistician is consulted. Decision: same-day enrollment records represent administrative artifacts (enrollment initiated and immediately reversed). The SAP is amended to require enrollment_duration > 0 days (disenrollment_date strictly greater than enrollment_start). Both programmers re-run. Both reach 12450. Discrepancy log closed."
      ],
      "result": "Programmer A = 12450 patients; Programmer B = 12480 patients; discrepancy = 12480 - 12450 = 30. Root cause: 30 patients with same-day enrollment (days_enrollment_diff = 0) were handled differently under two valid interpretations of the SAP enrollment criterion. Resolution: SAP amended to require strictly positive enrollment duration; both programmers reach 12450 after re-run. Attrition flow diagram updated to show 30 patients excluded under the clarified criterion."
    },
    "prerequisites": [
      "study-protocol-or-sap-elements",
      "fit-for-purpose-data-assessment-rwe",
      "database-feasibility-attrition-funnel-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Regulatory-submission double programming (full independent re-derivation)",
        "description": "Both programmers receive the locked SAP and independently derive all primary and key secondary outputs — cohort counts, baseline characteristics, primary endpoint estimates, sensitivity analyses — without any communication about implementation choices until the discrepancy comparison step. All differences are logged in the discrepancy resolution log. The log itself is an FDA/EMA submission artifact. The final reconciled output set, signed off by the study statistician, is the submitted deliverable.",
        "edge_cases": [
          "When the two programmers use different programming languages (e.g., SAS primary, Python independent verification), floating-point precision differences will cause trivial discrepancies in continuous summaries. Establish a tolerance threshold (e.g., differences <= 0.001 for means, <= 0.0001 for proportions) in the QC plan before comparison begins; document that these are precision artifacts, not implementation errors.",
          "For stochastic operations (bootstrap, MI, matching with random tiebreaking), both programmers must use the same seed documented in the SAP. If seeds differ, stochastic outputs will not reconcile numerically; this is expected and should be noted as a deliberate design decision rather than a discrepancy."
        ],
        "data_source_notes": "Claims and linked databases: cohort counts after each attrition step, not just the final n, should be compared. Attrition step counts that match individually but differ in the final n indicate order-of-operations differences in filter application."
      },
      {
        "name": "Risk-based QC for exploratory internal analyses",
        "description": "Code review of the primary programmer's code against the SAP, followed by senior output review comparing key statistics against literature benchmarks or prior internal analyses. The QC tier and rationale for not double programming are documented in the analysis record. An explicit escalation trigger is defined (e.g., if findings advance to regulatory or HTA use, full double programming will be performed).",
        "edge_cases": [
          "Output review requires benchmark knowledge: if no prior internal estimates or published literature benchmarks exist for the population and endpoint combination, output review cannot reliably flag implausible results and the QC tier should be escalated.",
          "Even for exploratory work, seeds must be set and the data snapshot must be checksummed; risk-based QC reduces the programming verification burden but does not reduce the reproducibility-engineering requirements."
        ],
        "data_source_notes": "EHR and registry data: output review benchmarks may be less established than for claims; when literature benchmarks are sparse, consider pre-defining expected ranges in the analysis plan before data access so that output review has an objective reference."
      },
      {
        "name": "Reproducibility stack for a living-database study",
        "description": "For a study using a continuously updated administrative claims database (e.g., commercial insurance data refreshed monthly), the data cut date is defined in the SAP as a specific calendar date, a SHA-256 checksum is computed for all extract files at cut, and the cut date is documented in the estimand description. The analysis pipeline is wrapped in a one-command execution script (Makefile or targets pipeline) that verifies checksums before any analysis step, fails loudly if checksums do not match, and logs the software environment to a versioned manifest file at the start of every run.",
        "edge_cases": [
          "Late-arriving claims — adjudicated after the cut date but covering service dates before it — are not captured in a locked snapshot. The SAP should specify that the analysis is based on claims adjudicated by the cut date, not on all claims for the study period, and sensitivity analyses using a later cut date (if available) can quantify the effect of late-arriving claims.",
          "If the database vendor releases a restatement that corrects historical claim records, a re-run of the analysis on the restated data is a new analysis, not a reproduction of the original; it requires a documented rationale and a new data snapshot with its own checksum."
        ],
        "data_source_notes": "Claims: monthly refresh cycles mean the effective cut date for a standard subscription extract may not be the last calendar day of the month; confirm the adjudication lag with the vendor and document it in the analysis plan."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "algorithm-validation",
        "pros_of_this": "Double programming verifies the full analytic pipeline — from eligibility criteria through statistical modeling — not just a single outcome algorithm. It catches errors anywhere in the chain, including merge operations, covariate coding, and model specification, that algorithm validation does not examine.",
        "cons_of_this": "Algorithm validation tests whether the outcome ascertainment code correctly identifies cases against a gold-standard reference (adjudicated chart review), which double programming cannot do; an independently programmed algorithm that replicates an incorrect case-finding rule will pass double programming but fail algorithm validation.",
        "when_to_prefer": "Use both: algorithm validation to confirm the outcome code correctly identifies true cases; double programming to confirm the full analysis pipeline correctly implements what the SAP specifies. For regulatory submissions, both are expected."
      },
      {
        "compared_to": "fit-for-purpose-data-assessment-rwe",
        "pros_of_this": "Double programming and reproducibility engineering address errors in how the analysis is programmed and executed, which fit-for-purpose assessment does not evaluate. A database that passes fit-for-purpose assessment can still produce wrong results from a programming error.",
        "cons_of_this": "Fit-for-purpose assessment addresses whether the database can capture the population, exposure, and outcome at all — a question that double programming cannot answer. Both layers of quality assurance are needed; a perfectly double-programmed analysis of a misfit data source still produces invalid estimates.",
        "when_to_prefer": "Fit-for-purpose assessment is a prerequisite (performed before analysis begins); double programming is a verification step (performed during and after programming). They address different failure modes and are not substitutes."
      },
      {
        "compared_to": "estimand-analysis-traceability-rwe",
        "pros_of_this": "Double programming detects implementation errors in the analysis code; the discrepancy resolution log creates evidence that SAP ambiguities were identified and resolved. Together, QC and traceability form a closed loop: traceability ensures each analysis choice maps to an estimand component; double programming ensures the code correctly implements each choice.",
        "cons_of_this": "Estimand-analysis traceability documents the conceptual chain from the research question to the statistical estimator; double programming verifies the code, but does not by itself verify that the estimator targets the right causal quantity or that the estimand is appropriate for the decision context.",
        "when_to_prefer": "Use both. Traceability is built into protocol and SAP development; double programming is applied during and after programming. They reinforce each other but address different layers of the analysis quality chain."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims data: merge fan-out from claim-line duplication is the highest-frequency silent QC failure. Assert that post-merge row counts do not exceed pre-merge patient counts at every join. For cost outcomes, verify that per-patient cost totals match the sum of line-level amounts before and after aggregation. Locked snapshot checksums are particularly important because claims vendors restate historical records in monthly refresh cycles.",
      "ehr": "EHR data: date arithmetic errors are frequent because EHR timestamps include time-of-day, and truncation to calendar date is not always consistently applied across systems. Verify that all date fields are normalized to calendar date before duration calculations. Missing laboratory values coded as NULL must not be coerced to zero in mean calculations; explicit missing-data handling must be pre-specified in the SAP.",
      "registry": "Registry data: adjudicated endpoints are typically cleaner than claims but the adjudication date (when the committee reviewed the event) must be distinguished from the event date. Verify that survival model event-date inputs use the clinical event date, not the committee adjudication date, unless the SAP explicitly specifies otherwise.",
      "primary": "Primary survey data: survey weights must be applied consistently across all tabulations; a common error is applying weights to some outcomes but not others, producing internally inconsistent Table 1 and outcome summaries. Seeds for MI are especially important here because survey respondents with complex missing patterns produce highly seed-sensitive imputed values.",
      "linked": "Linked data: the linkage key (probabilistic or deterministic) must itself be versioned and checksummed; if the linkage is re-run on a later data pull, cohort membership may change even if the analysis code is unchanged. Document the linkage algorithm version and the match-rate by key variable as part of the reproducibility manifest."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import hashlib\nimport json\nimport sys\nfrom datetime import date\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\n\n# ── 1. SHA-256 checksum a data extract file ────────────────────────────────────────\ndef sha256_file(path: str) -> str:\n    \"\"\"Return the SHA-256 hex digest of a file. Re-verify before every analysis run.\"\"\"\n    h = hashlib.sha256()\n    with open(path, \"rb\") as f:\n        for chunk in iter(lambda: f.read(65_536), b\"\"):\n            h.update(chunk)\n    return h.hexdigest()\n\n# ── 2. Build and save a locked data snapshot manifest ─────────────────────────────\ndef create_snapshot_manifest(data_files: list[str],\n                             manifest_path: str = \"snapshot_manifest.json\") -> dict:\n    \"\"\"\n    Log the cut date and SHA-256 hash of every source file.\n    Store manifest in the git repo alongside the analysis code.\n    \"\"\"\n    manifest = {\n        \"cut_date\": str(date.today()),\n        \"python_version\": sys.version,\n        \"files\": {p: sha256_file(p) for p in data_files},\n    }\n    with open(manifest_path, \"w\") as f:\n        json.dump(manifest, f, indent=2)\n    print(f\"Snapshot manifest created: {len(data_files)} files, cut_date={manifest['cut_date']}\")\n    return manifest\n\n# ── 3. Verify snapshot integrity before analysis begins ────────────────────────────\ndef verify_snapshot(manifest_path: str = \"snapshot_manifest.json\") -> None:\n    \"\"\"\n    Re-compute SHA-256 for every file in the manifest and compare.\n    Raises RuntimeError if any file has changed since the snapshot was taken.\n    \"\"\"\n    manifest = json.loads(Path(manifest_path).read_text())\n    errors = []\n    for path, expected in manifest[\"files\"].items():\n        actual = sha256_file(path)\n        if actual != expected:\n            errors.append(f\"CHECKSUM MISMATCH: {path}\")\n    if errors:\n        raise RuntimeError(\"Data integrity check FAILED:\\n\" + \"\\n\".join(errors))\n    print(f\"Snapshot verified: {len(manifest['files'])} files OK (cut_date={manifest['cut_date']})\")\n\n# ── 4. Merge row-count validation (catch claim-line fan-out) ───────────────────────\ndef safe_merge(left: pd.DataFrame, right: pd.DataFrame,\n               on: str | list, how: str = \"left\", label: str = \"merge\") -> pd.DataFrame:\n    \"\"\"\n    Merge left and right, then assert that no patient was duplicated.\n    A post-merge row count exceeding the pre-merge patient count signals fan-out from\n    duplicate keys in the right table (common with claim-line-level cost tables).\n    \"\"\"\n    n_before = len(left)\n    result = left.merge(right, on=on, how=how)\n    n_after = len(result)\n    if n_after > n_before:\n        raise ValueError(\n            f\"QC FAIL [{label}]: merge inflated rows {n_before} -> {n_after} \"\n            f\"(+{n_after - n_before}). Deduplicate right table before joining.\"\n        )\n    print(f\"QC OK [{label}]: {n_before} -> {n_after} rows \"\n          f\"({n_before - n_after} unmatched records dropped in left-join)\")\n    return result\n\n# ── 5. Set seeds before any stochastic step ────────────────────────────────────────\nANALYSIS_SEED = 20240301  # document this value in the SAP; do not change post-lock\n\ndef set_global_seeds(seed: int = ANALYSIS_SEED) -> None:\n    \"\"\"Set Python random and NumPy seeds. Call once at the start of the analysis script.\"\"\"\n    import random\n    random.seed(seed)\n    np.random.seed(seed)\n    # For scikit-learn: pass random_state=seed to every estimator\n    # For scipy bootstrap: rng=np.random.default_rng(seed)\n    print(f\"Seeds set to {seed} (ANALYSIS_SEED = {seed})\")\n\n# ── 6. Enrollment duration edge case: same-day enrollment detection ────────────────\ndef flag_zero_day_enrollments(enroll_df: pd.DataFrame,\n                              start_col: str = \"enrollment_start\",\n                              end_col: str = \"disenrollment_date\") -> pd.DataFrame:\n    \"\"\"\n    Compute enrollment duration and flag same-day patients (duration = 0).\n    SAP must specify whether to keep (Programmer B convention) or exclude (Programmer A).\n    \"\"\"\n    df = enroll_df.copy()\n    df[start_col] = pd.to_datetime(df[start_col])\n    df[end_col]   = pd.to_datetime(df[end_col])\n    df[\"days_enrolled\"] = (df[end_col] - df[start_col]).dt.days  # arithmetic difference\n    df[\"same_day_flag\"] = df[\"days_enrolled\"] == 0\n    n_same_day = df[\"same_day_flag\"].sum()\n    print(f\"Zero-day enrollment records: {n_same_day} \"\n          f\"({'excluded' if True else 'retained'} per strict-positive-duration rule)\")\n    return df\n\n# ── 7. Discrepancy report: compare two cohort count vectors ────────────────────────\ndef compare_outputs(prog_a: dict, prog_b: dict, tolerance: float = 0.0001) -> list[str]:\n    \"\"\"\n    Compare Programmer A and Programmer B output dicts.\n    Returns a list of discrepancy strings; empty list = no discrepancy.\n    \"\"\"\n    discrepancies = []\n    all_keys = set(prog_a) | set(prog_b)\n    for key in sorted(all_keys):\n        a_val = prog_a.get(key)\n        b_val = prog_b.get(key)\n        if a_val is None or b_val is None:\n            discrepancies.append(f\"{key}: missing in one output (A={a_val}, B={b_val})\")\n        elif isinstance(a_val, (int, float)) and abs(a_val - b_val) > tolerance:\n            discrepancies.append(f\"{key}: A={a_val}, B={b_val}, diff={b_val - a_val}\")\n    return discrepancies\n\n# ── Demo: reproduce the worked-example discrepancy ────────────────────────────────\nif __name__ == \"__main__\":\n    prog_a_output = {\"cohort_n\": 12450, \"mean_age\": 63.2, \"pct_female\": 0.487}\n    prog_b_output = {\"cohort_n\": 12480, \"mean_age\": 63.1, \"pct_female\": 0.486}\n    issues = compare_outputs(prog_a_output, prog_b_output, tolerance=0.005)\n    if issues:\n        print(\"DISCREPANCIES FOUND — open discrepancy resolution log:\")\n        for d in issues:\n            print(f\"  {d}\")\n    # cohort_n: A=12450, B=12480, diff=30\n    # 12480 - 12450 = 30  -> same-day enrollment boundary case",
        "description": "Reproducibility and QC utilities for RWE analysis: SHA-256 checksumming of data extract\nfiles, merge row-count validation to catch claim-line fan-out, global seed-setting before\nstochastic steps, and environment capture for the reproducibility manifest. All functions\nare self-contained with no external dependencies beyond hashlib, json, and pandas.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── 1. Environment locking with renv ──────────────────────────────────────────────\n# At project setup:  renv::init(); renv::snapshot()   -> writes renv.lock\n# At re-run:         renv::restore()                  -> installs exact versions\n# renv.lock pins R version, package versions, and CRAN snapshot date.\n# Commit renv.lock to git so every analyst uses identical software.\n\n# ── 2. Assert survival package version (default tie method changed across versions) ─\nstopifnot(\n  \"survival >= 3.5 required; earlier versions use different default tie method for Cox\" =\n    packageVersion(\"survival\") >= \"3.5\"\n)\n\n# ── 3. Set seeds before any stochastic step ───────────────────────────────────────\nANALYSIS_SEED <- 20240301L   # document in SAP; do not change after analysis lock\nset.seed(ANALYSIS_SEED)\n# For mice (multiple imputation):\n#   imp <- mice::mice(data, m = 20, seed = ANALYSIS_SEED)\n# For boot (bootstrap):\n#   set.seed(ANALYSIS_SEED); boot_res <- boot::boot(data, stat_fn, R = 1000)\n# For MatchIt (propensity matching with random tiebreaking):\n#   set.seed(ANALYSIS_SEED); m_out <- matchit(formula, data, method = \"nearest\")\n\n# ── 4. Merge row-count validation (catch claim-line fan-out) ─────────────────────\nsafe_merge <- function(left, right, by, all.x = TRUE, label = \"merge\") {\n  n_before <- nrow(left)\n  result   <- merge(left, right, by = by, all.x = all.x)\n  n_after  <- nrow(result)\n  if (n_after > n_before) {\n    stop(sprintf(\n      \"QC FAIL [%s]: merge inflated rows %d -> %d (+%d). Deduplicate right table.\",\n      label, n_before, n_after, n_after - n_before\n    ))\n  }\n  message(sprintf(\"QC OK [%s]: %d -> %d rows\", label, n_before, n_after))\n  result\n}\n\n# ── 5. Enrollment duration: flag zero-day records ─────────────────────────────────\nflag_zero_day_enrollments <- function(df,\n                                      start_col = \"enrollment_start\",\n                                      end_col   = \"disenrollment_date\") {\n  df[[start_col]] <- as.Date(df[[start_col]])\n  df[[end_col]]   <- as.Date(df[[end_col]])\n  # Arithmetic difference (days between dates, NOT inclusive of start day)\n  df$days_enrolled <- as.integer(df[[end_col]] - df[[start_col]])\n  df$same_day_flag <- df$days_enrolled == 0L\n  n_same <- sum(df$same_day_flag, na.rm = TRUE)\n  message(sprintf(\"Zero-day enrollment records: %d (SAP must specify: include or exclude)\", n_same))\n  df\n}\n\n# ── 6. Discrepancy comparison: reproduce worked-example ───────────────────────────\nprog_a <- list(cohort_n = 12450L, mean_age = 63.2, pct_female = 0.487)\nprog_b <- list(cohort_n = 12480L, mean_age = 63.1, pct_female = 0.486)\n\ncompare_outputs <- function(a, b, tolerance = 0.005) {\n  keys <- union(names(a), names(b))\n  diffs <- character(0)\n  for (k in keys) {\n    av <- a[[k]]; bv <- b[[k]]\n    if (is.null(av) || is.null(bv)) {\n      diffs <- c(diffs, sprintf(\"%s: missing in one output (A=%s, B=%s)\", k, av, bv))\n    } else if (abs(av - bv) > tolerance) {\n      diffs <- c(diffs, sprintf(\"%s: A=%s, B=%s, diff=%+.4f\", k, av, bv, bv - av))\n    }\n  }\n  diffs\n}\n\ndiscrepancies <- compare_outputs(prog_a, prog_b)\nif (length(discrepancies) > 0) {\n  message(\"DISCREPANCIES FOUND -- open discrepancy resolution log:\")\n  for (d in discrepancies) message(\"  \", d)\n}\n# cohort_n: A=12450, B=12480, diff=+30\n# 12480 - 12450 = 30 -> same-day enrollment boundary case (see worked example)\n\n# ── 7. One-command pipeline note ─────────────────────────────────────────────────\n# Use {targets} for a reproducible pipeline:\n#   library(targets)\n#   tar_make()   # re-runs only changed steps; skips up-to-date targets\n# targets integrates with renv and stores output hashes to detect stale outputs.",
        "description": "Reproducibility and QC utilities for RWE analysis in R. Covers merge row-count validation,\nenrollment duration edge-case detection, seed management, survival package version\nassertion, and an renv workflow note. Uses the same worked-example discrepancy scenario as\nthe Python implementation.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ─────────────────────────────────────────────────────────────────────────────────\n   1. PROC COMPARE: automated numeric comparison for double-programming reconciliation\n      BASE = Programmer A output table; COMPARE = Programmer B output table\n      outnoequal writes only discrepant observations to the OUT dataset\n───────────────────────────────────────────────────────────────────────────────── */\nproc compare\n    base    = work.prog_a_summary\n    compare = work.prog_b_summary\n    out     = work.discrepancies\n    outnoequal\n    criterion = 0.0001;   /* flag any numeric difference exceeding 0.0001          */\n  title \"QC: Programmer A vs Programmer B -- Double Programming Comparison\";\nrun;\n\nproc print data=work.discrepancies noobs;\n  title2 \"Discrepant observations -- open discrepancy resolution log for each row\";\nrun;\n\n/* ─────────────────────────────────────────────────────────────────────────────────\n   2. Row-count validation macro: catch merge fan-out from claim-line duplicates\n───────────────────────────────────────────────────────────────────────────────── */\n%macro safe_merge(left=, right=, on=, out=, label=merge);\n  %local n_before n_after;\n\n  proc sql noprint;\n    select count(*) into :n_before trimmed from &left;\n  quit;\n\n  proc sql;\n    create table &out as\n    select a.*, b.*\n    from &left as a\n    left join &right as b on a.&on = b.&on;\n  quit;\n\n  proc sql noprint;\n    select count(*) into :n_after trimmed from &out;\n  quit;\n\n  %if &n_after > &n_before %then %do;\n    %put ERROR: QC FAIL [&label]: merge inflated rows &n_before -> &n_after\n         (+%eval(&n_after - &n_before)). Deduplicate right table before joining.;\n  %end; %else %do;\n    %put NOTE: QC OK [&label]: &n_before -> &n_after rows.;\n  %end;\n%mend safe_merge;\n\n/* ─────────────────────────────────────────────────────────────────────────────────\n   3. Enrollment duration: flag zero-day records (the worked-example discrepancy)\n───────────────────────────────────────────────────────────────────────────────── */\ndata work.enrollment_flagged;\n  set work.enrollment_raw;\n  enrollment_start_d  = input(enrollment_start,  yymmdd10.);\n  disenrollment_date_d = input(disenrollment_date, yymmdd10.);\n  format enrollment_start_d disenrollment_date_d yymmdd10.;\n\n  /* Arithmetic difference: SAS date subtraction gives days between dates */\n  days_enrolled_diff = disenrollment_date_d - enrollment_start_d;\n\n  /* Programmer A: strictly positive duration required */\n  eligible_prog_a = (days_enrolled_diff > 0);\n\n  /* Programmer B: non-missing enrollment_start is sufficient */\n  eligible_prog_b = (enrollment_start_d ne .);\n\n  /* Flag the discrepant boundary cases */\n  same_day_flag = (days_enrolled_diff = 0);\nrun;\n\nproc freq data=work.enrollment_flagged;\n  tables same_day_flag eligible_prog_a eligible_prog_b / missing;\n  title \"QC: Zero-day enrollment records -- source of 30-patient discrepancy\";\nrun;\n\n/* ─────────────────────────────────────────────────────────────────────────────────\n   4. Cohort count comparison: reproduce the worked-example discrepancy\n───────────────────────────────────────────────────────────────────────────────── */\n%let cohort_a = 12450;\n%let cohort_b = 12480;\n%let diff = %eval(&cohort_b - &cohort_a);   /* = 30 */\n\n%if &diff ne 0 %then %do;\n  %put WARNING: DISCREPANCY -- Programmer B has &diff more patients than Programmer A.;\n  %put WARNING: Investigate zero-day enrollment handling. SAP clarification required.;\n%end;\n\n/* ─────────────────────────────────────────────────────────────────────────────────\n   5. Discrepancy resolution log template (shell -- populate for each finding)\n   Store in SAS dataset or flat file; archive with the analysis deliverable.\n───────────────────────────────────────────────────────────────────────────────── */\ndata work.discrepancy_log;\n  length finding_id $10 sap_clause $200 prog_a_value prog_b_value 8\n         difference 8 root_cause $400 resolution $400 status $20;\n  infile datalines dlm=\"|\";\n  input finding_id $ sap_clause $ prog_a_value prog_b_value\n        difference root_cause $ resolution $ status $;\n  datalines;\nDL-001|Continuously enrolled in 6-month pre-index period|12450|12480|30|Zero-day enrollment records: SAP did not define minimum duration|SAP amended: disenrollment_date > enrollment_start required|Closed\n  ;\nrun;\n\nproc print data=work.discrepancy_log noobs label;\n  title \"Discrepancy Resolution Log -- QC Artifact for Regulatory Submission\";\nrun;",
        "description": "QC and reproducibility utilities in SAS: PROC COMPARE for automated output comparison\nbetween double-programming deliverables, row-count validation via SQL macros, enrollment\nduration edge-case flagging, and a discrepancy log template. Covers the same worked-example\ndiscrepancy scenario as the Python and R implementations.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  SAP[\"Locked SAP received\\n(pre-outcome)\"] --> PA[\"Programmer A\\nindependent implementation\"]\n  SAP --> PB[\"Programmer B\\nindependent implementation\"]\n  PA --> OA[\"Programmer A\\nOutput Tables\"]\n  PB --> OB[\"Programmer B\\nOutput Tables\"]\n  OA --> CMP[\"Numeric Comparison\\n(PROC COMPARE / diff script)\"]\n  OB --> CMP\n  CMP --> PASS{\"All differences\\n<= tolerance?\"}\n  PASS -->|\"Yes\"| SIGNED[\"Reconciliation statement\\nsigned by statistician\\nDeliverable locked\"]\n  PASS -->|\"No\"| LOG[\"Discrepancy resolution log\\nopened for each finding\"]\n  LOG --> TRACE[\"Root cause traced:\\nSAP ambiguity? Bug?\\nPrecision artifact?\"]\n  TRACE --> AMEND[\"SAP amended if\\nambiguity found\"]\n  AMEND --> RERUN[\"Both programmers re-run\\nclarified step\"]\n  RERUN --> CMP",
        "caption": "Independent double programming workflow: both programmers implement independently from the SAP, outputs are compared numerically, discrepancies open a resolution log, and the SAP is amended for any ambiguity found before re-running.",
        "alt_text": "Flowchart showing the double programming QC cycle: two independent programmers feed into a comparison step; discrepancies loop through a resolution log and SAP amendment before re-comparison; agreement leads to a signed reconciliation statement.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "study-protocol-or-sap-elements",
        "notes": "The SAP is the document that double programming verifies against: both programmers independently implement what the SAP specifies, and any discrepancy reveals either a programming error or an SAP ambiguity that requires amendment. QC and SAP development are inseparable — a SAP that is silent on boundary conditions (like zero-day enrollment) will generate discrepancies in independent double programming."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimand-analysis-traceability-rwe",
        "notes": "Estimand-analysis traceability maps each analytic choice to an estimand component; double programming verifies that the code correctly implements each choice. Together they form a closed quality loop: traceability ensures the right estimator was specified; QC ensures the specified estimator was correctly programmed."
      },
      {
        "relation_type": "see_also",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Fit-for-purpose assessment certifies the data source before programming; double programming and reproducibility engineering certify the analysis code after it is written. Both are quality assurance layers but address different failure modes: data fitness versus implementation correctness."
      },
      {
        "relation_type": "see_also",
        "target_slug": "llm-assisted-abstraction-rwe",
        "notes": "LLM-generated analysis code — whether for cohort construction, variable derivation, or statistical modeling — receives the same QC ladder as human-generated code. Independent double programming is not waived because the code was AI-assisted."
      },
      {
        "relation_type": "used_with",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "The CONSORT-style attrition flow from database feasibility is the primary output that double programming verifies: both programmers must produce identical patient counts at each attrition step. A discrepancy in any attrition count reveals where the eligibility filter implementations diverged."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "Algorithm validation confirms the outcome-ascertainment code correctly identifies true cases against a gold standard; double programming confirms the outcome code (once validated) is correctly integrated into the full analysis pipeline. Both are expected for regulatory submissions and address distinct quality dimensions."
      }
    ],
    "aliases": [
      "double programming",
      "independent programming verification",
      "reproducible research RWE",
      "QC ladder",
      "ALCOA+ RWE",
      "one-button rerun",
      "locked data snapshot",
      "discrepancy resolution log"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pmda"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "qualitative-ethnographic",
    "name": "Ethnographic / Observational Qualitative Study",
    "short_definition": "A primary-data qualitative design in which a researcher studies people in their own clinical or everyday setting through sustained participant observation, in-context interviewing, and analysis of field notes and artifacts to explain how and why health behaviours, care processes, and treatment decisions actually unfold.",
    "long_description": "An **ethnographic / observational qualitative study** generates *primary* qualitative data by placing the investigator inside\nthe setting where care is delivered or experienced — a dialysis unit, an oncology infusion suite, a community pharmacy, a\npatient's home — over a sustained period. The defining tools are **participant observation** (the researcher watches, and\noften partly takes part in, routine activity), **in-context (often semi-structured) interviewing**, and analysis of\n**field notes, documents, and artifacts**. The output is not an effect estimate but a defensible, theory-informed account\nof *how* and *why* a phenomenon happens: why a newly launched biologic is barely used in community rheumatology, what\n\"treatment burden\" concretely means to a person on home dialysis, how a digital therapeutic is actually appropriated (or\nabandoned) on a real ward. In RWE/HEOR this is the method that surfaces the mechanisms, contexts, and patient-prioritized\noutcomes that a claims or EHR analysis can describe but never explain.\n\n**Core conceptual distinction**. The estimand-bearing methods in this catalog ask \"what is the effect of A vs B?\"; an\nethnography asks \"what is going on here, and why?\" Its inferential target is **analytic (theoretical) generalization** to\nconcepts and mechanisms, not **statistical generalization** to a population — a single richly studied site can validly\nrefine a theory of treatment burden even though it estimates no parameter. Two further distinctions matter. (1)\n*Ethnography vs the qualitative interview study* (`qualitative-interview`): interviews capture *accounts* of behaviour;\nethnography captures *behaviour in situ* and the gap between what people say and what they do — its signature contribution\nis observed practice, not reported practice. (2) *Primary qualitative collection vs qualitative evidence synthesis*\n(`qualitative-synthesis`): this concept is fieldwork that produces new data; synthesis aggregates existing qualitative\nstudies. Rigour is judged not by p-values but by **credibility, dependability, confirmability, and transferability**\n(Lincoln & Guba's trustworthiness criteria), operationalized through reflexivity, an audit trail, and reporting against\nCOREQ or SRQR — not by sample size or precision.\n\n**Pros, cons, and trade-offs**\n- **vs the qualitative interview study (`qualitative-interview`):** observation reveals tacit, routinized, or socially\n  undesirable practice that interviews miss (e.g., how clinicians *actually* counsel on adherence vs how they describe it),\n  and it does not depend on a participant's recall or candour. Cost: it is far more resource- and access-intensive, raises\n  sharper reactivity (Hawthorne) and consent-of-the-observed problems, and yields data that are harder to anonymize.\n  **Prefer ethnography** when the research question is about *practice, process, and context*; **prefer interviews** when it\n  is about *experience, meaning, and perspective* that observation cannot reach.\n- **vs quantitative RWE (claims/EHR cohorts):** ethnography explains mechanism, generates hypotheses, and defines\n  patient-relevant outcomes and burdens that structured data cannot represent. Cost: it cannot estimate incidence,\n  effect size, or cost, and does not support statistical generalization. The two are complementary, not rival — see\n  `mixed-methods` for principled integration (e.g., explaining a counter-intuitive utilization finding).\n- **vs structured surveys / preference studies (`preference-study`):** ethnography is open-ended and discovery-oriented,\n  appropriate *before* the constructs are known. Cost: it cannot quantify prevalence or trade-offs. The standard sequence\n  is ethnography/interviews to *generate* the conceptual model, then surveys/discrete-choice or COA instruments to\n  *measure* it.\n\n**When to use**. The question is \"how\" or \"why\" rather than \"how much\"; the construct (e.g., treatment burden, an\nimplementation barrier, a patient-prioritized outcome) is poorly understood and must be discovered, not measured;\nbehaviour-in-context and the say–do gap are central; you are building the conceptual model that will anchor downstream\nPRO/COA development (`pro-development`) or a preference study; or you need implementation/context evidence to interpret a\nquantitative signal or to support an HTA submission's narrative on patient experience and unmet need. Use it as the\nhypothesis-generating front end of a mixed-methods program.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **When the question is comparative or quantitative.** Ethnographic data cannot estimate effects, rates, or costs;\n  presenting \"most patients felt...\" from a purposive sample of 12 as if it were a prevalence is a category error that\n  invites decision-makers to read frequency into data that carry none.\n- **When the deliverable demands statistical generalization.** A single-site account does not transport to a population by\n  counting; claiming it does is the qualitative analog of fabricating external validity.\n- **When access forces a biased vantage point.** Gatekeeper-controlled access (you see only the well-run clinic, the\n  compliant patients) produces *selection of the observed* — the qualitative cousin of selection bias — and the resulting\n  account is confidently wrong about the typical case.\n- **When reactivity cannot be managed.** If the mere presence of an observer would so distort the behaviour of interest\n  (e.g., a regulator-sensitive dispensing practice) that nothing authentic remains, observation is the wrong tool.\n- **When it is deployed as decorative \"patient voice.\"** A few quotes bolted onto a quantitative dossier without a coding\n  frame, reflexivity, saturation logic, or an audit trail is not evidence; it lends false authority and is rejected by\n  serious HTA reviewers (e.g., NICE) who expect a transparent, appraisable qualitative method.\n\n**Data-source operational depth**. The \"data sources\" of an ethnography are its *modes of primary collection*, each with\ncharacteristic failure modes and workarounds.\n- **Participant observation (the core mode):** richest for practice and context. Failure modes: **reactivity / Hawthorne**\n  (people perform when watched) — mitigate with prolonged engagement so behaviour normalizes and by triangulating\n  observation against records and interviews; **observer-role drift** (the \"observer-as-participant\" slides toward\n  \"going native,\" losing analytic distance) — mitigate with reflexive memos and peer debriefing; **consent of the\n  observed** in busy clinical space where bystanders cannot all consent — mitigate with site-level governance, posted\n  notices, and field-note redaction.\n- **In-context / semi-structured interviews:** capture meaning and the participant's framing on the spot. Failure modes:\n  **recall and social-desirability bias** in retrospective accounts, and **interviewer effects** — mitigate by anchoring\n  questions to just-observed events and by reflexive bracketing of the interviewer's assumptions.\n- **Focus groups:** efficient for surfacing shared norms and disagreement. Failure modes: **dominant-voice and\n  conformity effects** suppress minority experience and over-state consensus — mitigate with skilled facilitation and by\n  not treating group consensus as individual prevalence.\n- **Documentary / artifact analysis** (protocols, patient diaries, device logs, posters): grounds claims in material\n  traces. Failure mode: documents record the *intended* process, not the enacted one — read against observed practice.\n- **Rapid / focused ethnography** (compressed fieldwork for time-bound HEOR/implementation questions): pragmatic and\n  fundable. Failure mode: **premature claims of saturation** when conceptual categories are still incomplete — pre-specify\n  a saturation logic, document where new data stop adding new themes, and flag residual thinness honestly.\n- **Linkage to structured RWE:** in mixed-methods designs, qualitative findings are explicitly tied to a claims/EHR\n  signal (see `mixed-methods`, `linked-data`). Workaround for the integration: keep an audit trail mapping each qualitative\n  theme to the quantitative finding it explains, so the joint inference is appraisable rather than rhetorical.\n\n**Worked applied example.** A manufacturer's commercial + Medicare FFS claims analysis shows that a newly launched\nsubcutaneous biologic for moderate-to-severe rheumatoid arthritis has surprisingly low and slow uptake in *community*\nrheumatology versus academic centres, and that early initiators discontinue within 90 days more often than the trial\nwould predict — a pattern the structured data can describe but not explain, and which threatens both an HTA value story\nand the brand's launch plan. A focused ethnography is commissioned to explain the mechanism. (1) **Question:** why is\ninitiation low and early discontinuation high in community practice, and what would change it? (2) **Sampling:**\n*purposive, theory-driven* across four community practices selected for maximum variation (urban/rural, high/low biologic\nvolume), plus negative cases (practices with high uptake) to test emerging explanations — not a random or convenience\nsample. (3) **Collection:** ~6 weeks per site of participant observation of prescribing visits, nurse-led injection\ntraining, and prior-authorization workflow; semi-structured interviews with rheumatologists, infusion/injection nurses,\nand patients who initiated, declined, or discontinued; documentary analysis of the prior-auth packets and patient\nstarter-kit materials. (4) **Time and saturation:** collection continues until new observations and interviews stop\ngenerating new conceptual categories (theoretical saturation), documented explicitly rather than fixed in advance.\n(5) **Analysis:** field notes and transcripts are coded against a developing framework (a framework-analysis matrix:\nparticipants × analytic themes), with two independent coders on a defined coding sheet and a reported intercoder agreement\non a sample of segments; reflexive memos record how the analyst's prior assumptions are challenged by negative cases.\n(6) **Findings and use:** the account identifies the binding constraint — prior-authorization friction and inadequate\nin-clinic injection-training capacity in community settings, not patient preference — yielding (a) a testable hypothesis\nfor a follow-up quantitative analysis of discontinuation by prior-auth turnaround time, (b) a patient-prioritized\ntreatment-burden construct that seeds a PRO/COA instrument (`pro-development`), and (c) implementation/context evidence\nfor the HTA dossier. (7) **Rigour and reporting:** the study is written up against COREQ/SRQR with an audit trail,\nreflexivity statement, and explicit transferability claims (to comparable community settings, *not* to all RA patients),\nso an HTA reviewer can appraise it as evidence rather than anecdote.",
    "primary_category": "Study_Design",
    "tags": [
      "qualitative",
      "ethnography",
      "participant-observation",
      "fieldwork",
      "mixed-methods",
      "patient-experience",
      "implementation-science",
      "trustworthiness"
    ],
    "applies_to_study_types": [
      "qualitative_ethnographic"
    ],
    "data_sources": [
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.321.7273.1400",
        "url": "https://doi.org/10.1136/bmj.321.7273.1400",
        "citation_text": "Savage J. Ethnography and health care. BMJ. 2000;321(7273):1400-1402.",
        "year": 2000,
        "authors_short": "Savage",
        "notes": "Concise canonical introduction to ethnography in health-care research — defines participant observation, the role of the researcher in the field, and what ethnography contributes that surveys and interviews cannot."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.320.7227.114",
        "url": "https://doi.org/10.1136/bmj.320.7227.114",
        "citation_text": "Pope C, Ziebland S, Mays N. Qualitative research in health care: analysing qualitative data. BMJ. 2000;320(7227):114-116.",
        "year": 2000,
        "authors_short": "Pope et al.",
        "notes": "Explains rigorous analysis of qualitative field and interview data, including framework analysis and the systematic coding workflow operationalized in the implementations here."
      },
      {
        "role": "explain",
        "doi": "10.1186/1471-2288-13-117",
        "url": "https://doi.org/10.1186/1471-2288-13-117",
        "citation_text": "Gale NK, Heath G, Cameron E, Rashid S, Redwood S. Using the framework method for the analysis of qualitative data in multi-disciplinary health research. BMC Medical Research Methodology. 2013;13:117.",
        "year": 2013,
        "authors_short": "Gale et al.",
        "notes": "Step-by-step specification of the framework method — charting coded segments into the participant x theme matrix that the Python and R implementations here construct — the standard analytic backbone for applied (including HEOR) qualitative work."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/intqhc/mzm042",
        "url": "https://doi.org/10.1093/intqhc/mzm042",
        "citation_text": "Tong A, Sainsbury P, Craig J. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. International Journal for Quality in Health Care. 2007;19(6):349-357.",
        "year": 2007,
        "authors_short": "Tong et al.",
        "notes": "The reporting standard against which ethnographic interview and focus-group work is appraised; specifies the reflexivity, sampling, and analysis items reviewers expect."
      },
      {
        "role": "use",
        "doi": "10.1097/ACM.0000000000000388",
        "url": "https://doi.org/10.1097/ACM.0000000000000388",
        "citation_text": "O'Brien BC, Harris IB, Beckman TJ, Reed DA, Cook DA. Standards for reporting qualitative research (SRQR): a synthesis of recommendations. Academic Medicine. 2014;89(9):1245-1251.",
        "year": 2014,
        "authors_short": "O'Brien et al.",
        "notes": "Broader-than-COREQ reporting framework covering observational/ethnographic designs, including the trustworthiness and audit-trail elements used to make qualitative HEOR evidence appraisable for HTA and regulatory submissions."
      }
    ],
    "plain_language_summary": "An ethnographic study answers \"how and why does care actually happen here?\" by having a researcher spend sustained time inside a real setting — a clinic, an infusion suite, a patient's home — watching what people do, talking with them about it on the spot, and taking detailed notes. It is about understanding meaning and context, not measuring how often something happens, so it never produces a rate, an effect size, or a percentage. Its signature strength is catching the gap between what people say they do and what they actually do, which a survey can never see. Its honest limit: a few richly studied sites tell you how a process works, not how common it is across a whole population.",
    "key_terms": [
      {
        "term": "participant observation",
        "definition": "The core method where the researcher watches, and often partly joins in, the everyday activity of a setting rather than just asking people about it afterward."
      },
      {
        "term": "field notes",
        "definition": "The researcher's detailed written record of what they saw, heard, and did during each observation visit, which becomes the raw data to be analyzed."
      },
      {
        "term": "theme",
        "definition": "A recurring pattern of meaning the researcher builds up across many observations, naming something that keeps showing up (for example, \"paperwork delays starting treatment\")."
      },
      {
        "term": "saturation",
        "definition": "The point where new observations stop turning up new patterns, signalling the researcher has seen enough to describe what is going on."
      },
      {
        "term": "the say-do gap",
        "definition": "The difference between what people report they do and what they are actually observed doing, which is exactly what watching in person reveals."
      }
    ],
    "worked_example": {
      "scenario": "A researcher spends several weeks inside a community rheumatology clinic to understand why a newly launched injectable biologic is barely being started, even though the medicine is available. Instead of handing out a survey, they sit in on prescribing visits, nurse injection-training sessions, and the back-office insurance-approval work, writing down what actually happens. The goal is not to count anything but to explain the process: what gets in the way of starting this drug, in this place, in real life.",
      "dataset": {
        "caption": "A handful of raw field-note entries an ethnographer would actually jot down — each is one observed moment, not a number. The analyst later groups these recurring moments into themes.",
        "columns": [
          "observation",
          "setting",
          "theme"
        ],
        "rows": [
          [
            "Doctor decides to start the biologic, then sighs and says the approval paperwork \"takes weeks\" before handing it to staff",
            "exam room",
            "paperwork delays starting treatment"
          ],
          [
            "Front-desk staff member re-faxes the insurance approval form for the third time; the first two were never acknowledged",
            "back office",
            "paperwork delays starting treatment"
          ],
          [
            "Only one nurse on site knows how to run the injection-training session; visits get rescheduled when she is out",
            "nurse station",
            "not enough trained staff to start patients"
          ],
          [
            "A patient who agreed to the drug in clinic is still waiting six weeks later and tells the nurse she has \"given up on it\"",
            "waiting room",
            "patients drop off during the long wait"
          ],
          [
            "Nurse teaching a patient to self-inject runs out of time and tells them to \"watch the video at home,\" though the patient looks unsure",
            "nurse station",
            "not enough trained staff to start patients"
          ],
          [
            "A patient says in clinic she is comfortable injecting, but when observed she hesitates and asks the nurse to do it",
            "exam room",
            "the say-do gap in injection confidence"
          ]
        ]
      },
      "steps": [
        "Each row is a single thing the researcher actually witnessed, written down in plain detail — not a survey answer and not a tally.",
        "The researcher reads back over many such notes and notices the same situations recurring; she gives each recurring pattern a short name, which becomes a theme (the right-hand column).",
        "Repetition is what makes a theme trustworthy: \"paperwork delays\" appears in the exam room, the back office, and the waiting room, from different people on different days, so it is a real feature of the place rather than one person's bad day.",
        "Watching in person catches what a survey would miss — the last row shows a patient who says she is confident injecting but is observed hesitating, which is the say-do gap a questionnaire could never reveal.",
        "The researcher keeps observing until new visits stop producing new themes (saturation), which is how she knows the picture is reasonably complete — not because she hit a target sample size."
      ],
      "result": "The synthesized insight: starting this drug is being blocked mainly by slow insurance paperwork and a shortage of staff trained to teach injections — not by patients refusing the medicine. That explanation, grounded in repeated first-hand observation across several parts of the clinic, is something the prescribing and claims data could flag as \"low uptake\" but never explain. It points to fixable process problems and gives the team a real-world account of why patients drop off, with no statistics invented or implied."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Realist / full ethnography (prolonged participant observation)",
        "description": "Sustained immersion (months) in a single setting or a small number of sites, with participant observation as the primary mode and interviews/documents as triangulation, aimed at a deep theoretical account of practice and context.",
        "edge_cases": [
          "Long fieldwork raises observer-role drift (\"going native\"); maintain analytic distance with reflexive memos and peer debriefing.",
          "Reactivity decays with prolonged engagement but never fully disappears — triangulate observed behaviour against records and interviews."
        ],
        "data_source_notes": "Primary collection: field notes are the core data; secure participant_id/site governance and consent of the observed in shared clinical space."
      },
      {
        "name": "Rapid / focused ethnography",
        "description": "Time-bound, question-focused fieldwork (days to a few weeks per site), common in HEOR and implementation science when a decision deadline (e.g., launch or HTA timeline) constrains immersion.",
        "edge_cases": [
          "Compression invites premature claims of saturation; pre-specify a saturation logic and report residual thinness.",
          "Narrow vantage can miss context that full ethnography would catch; use maximum-variation site sampling to compensate."
        ],
        "data_source_notes": "Primary collection: structured observation guides and just-in-time interviews keep focused fieldwork appraisable against COREQ/SRQR."
      },
      {
        "name": "Mixed-methods explanatory ethnography",
        "description": "Qualitative fieldwork explicitly nested with a quantitative RWE analysis (e.g., to explain an unexpected utilization or discontinuation signal), with a documented mapping from themes to the quantitative finding.",
        "edge_cases": [
          "Risk of reducing qualitative work to decorative \"patient voice\"; preserve a full coding frame, reflexivity, and audit trail, not just quotes.",
          "Integration is only credible if each theme is traceable to the quantitative result it explains."
        ],
        "data_source_notes": "Links primary qualitative data to claims/EHR signals (see mixed-methods, linked-data); keep an explicit qual-to-quant traceability log."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Qualitative interview study",
        "pros_of_this": "Captures behaviour-in-context and the say-do gap; does not depend on participant recall or candour; reveals tacit, routinized, or socially undesirable practice.",
        "cons_of_this": "Far more access- and resource-intensive; sharper reactivity and consent-of-the-observed problems; data harder to anonymize.",
        "when_to_prefer": "When the question is about practice, process, and context rather than reported experience or meaning."
      },
      {
        "compared_to": "Quantitative RWE (claims/EHR cohort)",
        "pros_of_this": "Explains mechanism, generates hypotheses, and defines patient-relevant outcomes and burdens that structured data cannot represent.",
        "cons_of_this": "Cannot estimate incidence, effect size, or cost; supports analytic, not statistical, generalization.",
        "when_to_prefer": "When you need to understand or explain a phenomenon (the \"why\" behind a quantitative signal) rather than measure it; integrate via mixed-methods."
      },
      {
        "compared_to": "Structured survey / preference study",
        "pros_of_this": "Open-ended and discovery-oriented; appropriate before the relevant constructs are known.",
        "cons_of_this": "Cannot quantify prevalence or trade-offs.",
        "when_to_prefer": "As the front end of an instrument-development sequence — discover constructs ethnographically, then measure them with surveys, DCEs, or COA instruments."
      }
    ],
    "implementation_notes_by_data_source": {
      "primary": "Data are field notes, transcripts, and artifacts collected in situ. Govern access and consent at the site level; de-identify field notes; keep a dated audit trail of sampling, coding, and analytic decisions so the study is appraisable against COREQ/SRQR. Rigour is judged by credibility, dependability, confirmability, and transferability — not by sample size or precision.",
      "linked": "In mixed-methods designs, tie qualitative themes to a specific claims/EHR finding and maintain an explicit qual-to-quant mapping so the joint inference is transparent rather than rhetorical (see mixed-methods, linked-data)."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\ndef framework_matrix(coded: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Participant x theme matrix of segment counts; the spine of framework analysis.\"\"\"\n    matrix = (coded\n              .pivot_table(index=\"participant_id\", columns=\"theme\",\n                           values=\"segment_text\", aggfunc=\"count\", fill_value=0))\n    return matrix\n\ndef theme_grounding(coded: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"For each theme: how many distinct participants and which collection modes support it.\n    Thin grounding (few participants, single mode) flags themes that are not yet saturated\n    or rest on a single vantage point (e.g., observation only, never corroborated in interview).\"\"\"\n    g = (coded.groupby(\"theme\")\n              .agg(n_segments=(\"segment_text\", \"count\"),\n                   n_participants=(\"participant_id\", \"nunique\"),\n                   modes=(\"source_mode\", lambda s: \", \".join(sorted(s.unique()))))\n              .sort_values(\"n_participants\", ascending=False))\n    return g",
        "description": "Framework-analysis matrix from coded qualitative segments. Required input (a tidy coding sheet exported from NVivo,\nDedoose, ATLAS.ti, or a spreadsheet, one row per coded text segment):\n  coded : transcript_id (or field_note_id), participant_id, source_mode in {'observation','interview','focus_group','document'},\n          theme, code, segment_text\nProduces the participant x theme framework matrix that underpins framework analysis (Pope et al. 2000): which themes are\ngrounded in which participants and modes, so the analyst can see coverage, negative cases, and where saturation is thin.\nThis summarizes already-coded data; coding itself is the human analytic step, not something to fabricate in code.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "pope-2000",
          "gale-2013"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(irr)\n\ncoding_reliability <- function(coded) {\n  setDT(coded)\n  # Cohen's kappa on the two coders' assigned codes for the double-coded segments.\n  ratings <- coded[, .(code_coder1, code_coder2)]\n  k <- kappa2(ratings, weight = \"unweighted\")\n\n  # Theme grounding: distinct participants and collection modes supporting each agreed code.\n  agreed <- coded[code_coder1 == code_coder2]\n  grounding <- agreed[, .(n_segments = .N,\n                          n_participants = uniqueN(participant_id),\n                          modes = paste(sort(unique(source_mode)), collapse = \", \")),\n                      by = .(code = code_coder1)][order(-n_participants)]\n\n  list(kappa = k$value, p_value = k$p.value, grounding = grounding)\n}",
        "description": "Intercoder reliability (Cohen's kappa) on a double-coded sample of segments, plus a theme-grounding summary. Required\ninput (a tidy coding sheet, one row per segment that BOTH coders independently coded):\n  coded : segment_id, participant_id, source_mode, code_coder1, code_coder2\nReporting chance-corrected agreement on a defined subset of segments is the standard way to demonstrate dependability of\nthe coding frame for COREQ/SRQR. Requires the 'irr' package for kappa.",
        "dependencies": [
          "irr",
          "data.table"
        ],
        "source_citations": [
          "tong-2007",
          "gale-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Research question: HOW / WHY<br/>construct poorly understood] --> Purp[Purposive / theoretical sampling<br/>maximum variation + negative cases]\n  Purp --> Coll[Primary collection in situ:<br/>participant observation + in-context interviews<br/>+ focus groups + documents/artifacts]\n  Coll --> Sat{New data still<br/>generating new<br/>concepts?}\n  Sat -- Yes --> Coll\n  Sat -- No: theoretical saturation --> Code[Coding + framework analysis<br/>participant x theme matrix, reflexive memos]\n  Code --> Trust[Trustworthiness:<br/>credibility, dependability,<br/>confirmability, transferability]\n  Trust --> Out[Outputs: mechanism + hypotheses,<br/>patient-prioritized constructs for PRO/COA,<br/>implementation/context evidence for HTA]",
        "caption": "Ethnographic / observational qualitative workflow — purposive sampling and in-situ collection iterate to theoretical saturation, then framework analysis and trustworthiness checks yield mechanism, hypotheses, and patient-prioritized constructs rather than effect estimates.",
        "alt_text": "Flowchart from a how/why research question through purposive sampling, in-situ qualitative collection looping to theoretical saturation, framework-analysis coding, trustworthiness criteria, and outputs feeding PRO/COA development and HTA evidence.",
        "source_type": "illustrative",
        "source_citations": [
          "savage-2000",
          "pope-2000",
          "gale-2013"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph QUAL[Ethnography]\n    T1[Theme: prior-auth friction]\n    T2[Theme: injection-training gap]\n    T3[Theme: patient treatment burden]\n  end\n  subgraph QUANT[Claims/EHR RWE signal]\n    S1[Low/slow community uptake]\n    S2[High 90-day discontinuation]\n  end\n  T1 --> S1\n  T1 --> S2\n  T2 --> S2\n  T3 --> PRO[PRO/COA construct]\n  S2 -. tests .-> H[Quantitative follow-up:<br/>discontinuation by prior-auth turnaround]\n  T1 -. generates .-> H",
        "caption": "Mixed-methods integration — each ethnographic theme is mapped to the quantitative finding it explains and to the downstream PRO/COA or follow-up hypothesis it generates, keeping the joint inference appraisable rather than rhetorical.",
        "alt_text": "Diagram linking three ethnographic themes (prior-auth friction, injection-training gap, treatment burden) to two claims/EHR signals (low uptake, high discontinuation), a PRO/COA construct, and a quantitative follow-up hypothesis.",
        "source_type": "illustrative",
        "source_citations": [
          "savage-2000"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "qualitative-interview",
        "notes": "Interviews capture reported experience and meaning; ethnography adds observed behaviour-in-context and the say-do gap. Choose by whether the question is about practice (ethnography) or perspective (interview)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "qualitative-synthesis",
        "notes": "This concept is primary qualitative collection (new data); qualitative synthesis aggregates existing qualitative studies into a higher-order interpretation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mixed-methods",
        "notes": "Ethnography is the usual hypothesis-generating qualitative strand in a mixed-methods design that explains an unexpected quantitative RWE signal."
      },
      {
        "relation_type": "produces",
        "target_slug": "pro-development",
        "notes": "Ethnographic and interview fieldwork supplies the patient-prioritized constructs and conceptual model that anchor PRO/COA instrument development."
      },
      {
        "relation_type": "see_also",
        "target_slug": "preference-study",
        "notes": "The discovery-then-measure sequence — ethnography surfaces the constructs and trade-offs, then preference/DCE studies quantify them."
      }
    ],
    "aliases": [
      "ethnographic study",
      "observational qualitative study",
      "qualitative ethnographic",
      "participant observation study",
      "focused ethnography"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "qualitative-interview",
    "name": "Qualitative Interview Study",
    "short_definition": "A primary-data qualitative design that collects open-ended, semi-structured or in-depth individual interviews to characterize lived experience, meaning, and process, analyzed thematically toward conceptual rather than statistical generalization.",
    "long_description": "A **qualitative interview study** generates primary data by asking purposively selected participants open-ended\nquestions — most often through a **semi-structured guide** (a flexible topic list with probes) rather than a fixed\nquestionnaire — and analyzes the resulting transcripts to build themes, concepts, and explanatory accounts. In RWE and\nHEOR it is the workhorse method for the questions that counts cannot answer: *why* patients stop a therapy, *how* the\nburden of an injectable regimen plays out in daily life, *what* outcomes matter to people who actually have the disease,\nand *which* candidate items belong in a new patient-reported outcome (PRO) instrument. Its product is meaning and\nhypothesis, not incidence and hazard ratios.\n\n**Core conceptual distinction**. The interview study's logic is the inverse of the quantitative observational study, and\nthe two are separable. (1) *Sampling for information, not representativeness*: participants are chosen purposively (often\nby **maximum-variation** sampling across age, severity, treatment line, or setting) so the analysis spans the conceptual\nrange of experience; the goal is **transferability** of insight, not a generalizable point estimate. A claims cohort wants\nN large and unbiased; an interview study wants N *informative*. (2) *Generalization is analytic/conceptual, not\nstatistical*: findings transfer to similar contexts through the richness and credibility of the account, judged by the\nreader, rather than through a sampling frame and confidence interval. (3) *Sample size is governed by adequacy of\ninformation, not power*: the modern frame is **information power** (Malterud) — narrower aim, dense sample specificity,\nstrong dialogue, and established theory all lower the number of interviews needed — which supersedes naive \"count\ninterviews until nothing new appears\" **data saturation**. The estimand, loosely, is a *credible thematic account of the\nphenomenon as experienced*, defensible against an audit trail; it is categorically not a treatment effect.\n\n**Pros, cons, and trade-offs** (specific and comparative, naming the alternatives).\n- **vs structured survey / PRO questionnaire:** the interview discovers the constructs and language a survey can only\n  measure once they are known; it surfaces unanticipated themes a closed instrument forecloses. Cost: it cannot estimate\n  prevalence or compare groups, is labor-intensive, and is vulnerable to interviewer effects and small, non-representative\n  samples. **Prefer the interview** in the *item-generation / concept-elicitation* phase of FDA PFDD or EMA PRO work, then\n  hand off to a survey for quantification.\n- **vs focus group:** individual interviews give privacy for sensitive topics (stigma, sexual health, end-of-life,\n  adherence shame) and avoid the dominant-voice and conformity dynamics of groups; they yield deeper individual\n  narratives. Cost: they forgo the group interaction that itself generates data (participants reacting to and building on\n  each other) and are less efficient per participant-hour. **Prefer interviews** for sensitive or highly individual\n  experiences; prefer focus groups when social interaction and norm-formation are the object of study.\n- **vs ethnography / participant observation:** interviews capture *reported* experience efficiently; ethnography\n  captures *enacted* behavior in situ and catches the gap between what people say and what they do. Cost: ethnography is\n  far more time- and access-intensive. **Prefer interviews** when self-report is the appropriate evidence and field access\n  is impractical; see the qualitative-ethnographic entry when the say–do gap is central.\n- **vs quantitative observational analysis of claims/EHR:** the interview explains mechanism and meaning that a hazard\n  ratio cannot; it is the only way to learn outcomes-that-matter directly from patients. Cost: no effect estimate, no\n  counterfactual, no scale. The two are complementary, not rivals — interviews most often run inside a mixed-methods\n  design to interpret or generate quantitative findings.\n\n**When to use** (clear decision rules). Use a qualitative interview study when the question is *how/why/what-matters*\nrather than *how many/how much*: eliciting concepts and language for a new PRO/COA instrument (FDA PFDD Guidance 2–3,\nEMA reflection paper on PROs); understanding drivers of non-adherence, discontinuation, or treatment burden; mapping the\npatient journey for an HTA value dossier or patient-preference submission; exploring an unexpected quantitative signal\n(\"why did discontinuation spike in this subgroup?\"); and any setting where the constructs are not yet well enough defined\nto write valid closed-ended items. Use a **semi-structured** guide for most applied work, **in-depth/narrative** when the\naccount itself is the data, and reserve **structured** interviews for when the construct space is already fixed.\n\n**When NOT to use — and when it is actively misleading or dangerous** (clear decision rules).\n- **The question is quantitative.** If you need prevalence, an effect estimate, a between-arm comparison, or a powered\n  test, interviews cannot deliver it; reporting theme *frequencies* from a purposive sample as if they were prevalence\n  (\"80% of patients reported X\") is a category error and is actively misleading — small purposive samples are not built to\n  count, and reviewers (and FDA/EMA) will treat such numbers as pseudo-quantification.\n- **Generalizable population inference is required.** A non-probability, maximum-variation sample cannot support\n  statistical generalization; presenting it as representative invites refutation.\n- **Saturation is asserted as a checkbox.** Declaring \"saturation reached\" without a pre-specified, transparent stopping\n  rule (or, better, an information-power justification) is a hollow ritual that masks an under-powered, premature analysis;\n  it is one of the most common fatal flaws reviewers flag.\n- **Sensitive topics with a coercive recruitment or group format.** Routing stigmatized experience through gatekeeper-\n  selected participants or a focus group can silence the very voices the study claims to represent — selection and social\n  desirability then manufacture a false consensus.\n- **No analytic rigor or audit trail.** \"We talked to some patients and here are quotes\" is not analysis. Without a\n  codebook, inter-coder checks, reflexivity, and a documented chain from data to claim, the work is anecdote dressed as\n  evidence and will not survive regulatory or HTA review.\n\n**Data-source operational depth**. The \"source\" here is *primary* — recorded interviews and their transcripts — and the\noperational risks are different in kind from claims/EHR/registry work; they live in sampling, modality, recording,\nanonymization, and analysis rather than in enrollment and code lists.\n- **Sampling frame & recruitment.** Purposive (maximum-variation, typical-case, deviant-case) or snowball sampling.\n  *Failure mode:* **gatekeeper / volunteer selection** — clinicians refer the articulate, satisfied, or adherent patients,\n  so the hardest experiences (the people who quietly dropped therapy) are systematically absent. *Workaround:* sample\n  against the gatekeeper's grain (explicitly recruit discontinuers, non-attenders, low-literacy and minority-language\n  participants), document a recruitment log, and report who declined.\n- **Modality (in-person vs video vs telephone vs asynchronous text).** Each trades reach against richness. *Failure mode:*\n  telephone and async lose non-verbal cues and rapport, dampening disclosure on sensitive topics; video introduces a\n  digital-access bias that under-samples older and lower-income participants. *Workaround:* pre-register modality, offer a\n  choice, and note modality alongside each transcript so its effect on depth can be assessed.\n- **Recording, transcription & translation.** *Failure mode:* lost or corrupted recordings, verbatim vs cleaned\n  transcription decisions that erase meaning, and **translation drift** when interviews in another language are coded from\n  an English translation (idiom and affect are lost). *Workaround:* redundant recording, professional verbatim\n  transcription with accuracy QC, and analysis in the source language by bilingual coders with back-translation of key\n  quotes.\n- **Interviewer effects & social desirability.** *Failure mode:* a clinician-interviewer elicits courtesy answers; leading\n  probes plant themes; the interviewer's stance shapes what is said. *Workaround:* neutral, trained interviewers, a piloted\n  non-leading guide, and explicit **reflexivity** (document the interviewer's role and assumptions — a COREQ requirement).\n- **Anonymization & identifiability in small/rare subgroups.** *Failure mode:* in a rare-disease cohort, a verbatim quote\n  plus role and locale re-identifies the participant. *Workaround:* paraphrase or generalize identifying detail, suppress\n  rare quasi-identifiers, and apply small-cell logic to qualitative reporting just as to claims tables.\n- **Premature / ritual saturation.** *Failure mode:* stopping at a round number because \"themes repeated,\" with no\n  operational definition. *Workaround:* pre-specify the stopping rule (e.g., a saturation grid tracking new codes per\n  interview) or justify the sample with **information power** before fielding.\n\n**Worked applied example (qualitative, item-generation for a PRO).** Aim: elicit the concepts and patient language of\n*treatment burden* among adults with advanced chronic kidney disease on oral phosphate binders, to generate candidate\nitems for a new disease-specific PRO supporting an FDA PFDD submission. (1) *Sampling frame:* maximum-variation purposive\nsampling across CKD stage (3b–5, pre-dialysis), pill burden (low vs high binder count), age, and health literacy; recruit\nthrough two nephrology clinics **and** a patient advocacy group to counter gatekeeper selection, with a recruitment log of\napproached/declined. (2) *Sample size:* justified by **information power** — narrow aim, high sample specificity, an\nexperienced interviewer, and an existing treatment-burden framework — pre-specifying ~16–20 interviews with a stopping\nrule of two consecutive interviews adding no new codes to the saturation grid. (3) *Instrument:* a piloted semi-structured\nguide opening broad (\"walk me through a typical day taking your kidney medicines\") with non-leading probes; modality\noffered as video or telephone, recorded with redundant capture. (4) *Transcription:* professional verbatim with accuracy\nQC; interviews conducted in the participant's preferred language by a bilingual interviewer. (5) *Analysis:* two coders\nindependently apply **open coding** to the first transcripts, reconcile a **codebook**, then code the corpus; **inter-coder\nagreement** is quantified on a shared subset (Cohen's kappa, target ≥0.70) and disagreements adjudicated; codes are\nabstracted into themes and a conceptual model of burden. (6) *Rigor & reporting:* member checking with a participant\nsubset, a reflexivity statement on the interviewer's clinical role, an audit trail from quote → code → theme → claim, and\nfull **COREQ** reporting of the 32 items. The deliverable is a saturated concept map and a draft item pool — handed off to\na quantitative survey for psychometric validation, never reported as prevalence.",
    "primary_category": "Study_Design",
    "tags": [
      "qualitative",
      "interview",
      "semi-structured",
      "thematic-analysis",
      "information-power",
      "saturation",
      "coreq",
      "patient-experience",
      "pro-development",
      "mixed-methods"
    ],
    "applies_to_study_types": [
      "qualitative_interview"
    ],
    "data_sources": [
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.311.6999.251",
        "url": "https://doi.org/10.1136/bmj.311.6999.251",
        "citation_text": "Britten N. Qualitative interviews in medical research. BMJ. 1995;311(6999):251-253.",
        "year": 1995,
        "authors_short": "Britten",
        "notes": "Foundational, widely cited primer on the rationale, types (structured, semi-structured, in-depth), and conduct of qualitative interviews in health research."
      },
      {
        "role": "explain",
        "doi": "10.1093/intqhc/mzm042",
        "url": "https://doi.org/10.1093/intqhc/mzm042",
        "citation_text": "Tong A, Sainsbury P, Craig J. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. International Journal for Quality in Health Care. 2007;19(6):349-357.",
        "year": 2007,
        "authors_short": "Tong et al.",
        "notes": "The de facto reporting standard for interview/focus-group studies; its 32 items (research team, reflexivity, study design, analysis/findings) are what reviewers and journals check."
      },
      {
        "role": "explain",
        "doi": "10.1177/1049732315617444",
        "url": "https://doi.org/10.1177/1049732315617444",
        "citation_text": "Malterud K, Siersma VD, Guassora AD. Sample size in qualitative interview studies: guided by information power. Qualitative Health Research. 2016;26(13):1753-1760.",
        "year": 2016,
        "authors_short": "Malterud et al.",
        "notes": "Reframes sample size away from naive saturation counts toward information power (aim breadth, sample specificity, theory use, dialogue quality, analysis strategy); the modern justification reviewers expect."
      },
      {
        "role": "explain",
        "doi": "10.1007/s11135-017-0574-8",
        "url": "https://doi.org/10.1007/s11135-017-0574-8",
        "citation_text": "Saunders B, Sim J, Kingstone T, et al. Saturation in qualitative research: exploring its conceptualization and operationalization. Quality & Quantity. 2018;52(4):1893-1907.",
        "year": 2018,
        "authors_short": "Saunders et al.",
        "notes": "Disentangles the conflated meanings of saturation (theoretical, data, inductive thematic, a priori thematic) and shows how to operationalize and pre-specify a defensible stopping rule."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/1525822X05279903",
        "url": "https://doi.org/10.1177/1525822X05279903",
        "citation_text": "Guest G, Bunce A, Johnson L. How many interviews are enough? An experiment with data saturation and variability. Field Methods. 2006;18(1):59-82.",
        "year": 2006,
        "authors_short": "Guest et al.",
        "notes": "Empirical experiment showing most codes emerge within the first ~12 homogeneous interviews; the canonical evidence base behind sample-size rules of thumb (and their limits in heterogeneous samples)."
      }
    ],
    "plain_language_summary": "A qualitative interview study sits a small number of carefully chosen patients (or clinicians) down for an open-ended conversation, records what they say, and works through the transcripts to find the recurring ideas in their own words. It answers \"how\" and \"why\" questions a survey can't reach yet — like why people quit a medication or what living with a disease actually feels like — by digging into the meaning behind the experience rather than counting how many people picked option B. The product is a set of themes and a richer understanding, not a percentage or an effect estimate. Its honest limit: because the handful of people were picked on purpose (not at random), you cannot turn their answers into prevalence or generalize them to a whole population.",
    "key_terms": [
      {
        "term": "transcript",
        "definition": "The full written-out text of a recorded interview, word for word, which is what the analyst actually reads and works from."
      },
      {
        "term": "code",
        "definition": "A short label an analyst attaches to a chunk of interview text to tag the idea in it (for example, \"fear of needles\")."
      },
      {
        "term": "codebook",
        "definition": "The agreed list of codes and what each one means, so that two analysts label the same passage the same way."
      },
      {
        "term": "theme",
        "definition": "A bigger pattern of meaning built by grouping related codes together (for example, several needle- and side-effect codes rolling up into \"physical burden of treatment\")."
      },
      {
        "term": "saturation",
        "definition": "The point where new interviews stop producing new codes, signalling you have likely heard the range of experiences and can stop interviewing."
      },
      {
        "term": "purposive sampling",
        "definition": "Deliberately choosing participants to span a range of experiences (different ages, severities, settings) rather than picking them at random."
      }
    ],
    "worked_example": {
      "scenario": "A team wants to understand the burden of taking daily oral phosphate-binder pills among adults with advanced chronic kidney disease, so they can later build a patient-reported outcome questionnaire. They interview a small, purposively chosen group of patients with one broad opener (\"walk me through a typical day taking your kidney medicines\") plus gentle follow-up probes. Below is a tiny slice of the coding work: a handful of paraphrased quotes from three participants, the code each quote received, and the theme each code rolls up into. Real studies code hundreds of passages — this shows the mechanics in miniature.",
      "dataset": {
        "caption": "A coding table: each row is one interview passage, paraphrased, with the analyst's code and the theme it belongs to. This is what the working spreadsheet of a thematic analysis actually looks like.",
        "columns": [
          "participant",
          "quote_paraphrase",
          "code",
          "theme"
        ],
        "rows": [
          [
            "P01",
            "I take so many pills with every meal I feel like I'm eating medicine, not food.",
            "high_pill_count",
            "Physical burden of treatment"
          ],
          [
            "P01",
            "They upset my stomach, so I skip them when I'm eating out with friends.",
            "side_effects_cause_skipping",
            "Treatment interferes with daily life"
          ],
          [
            "P02",
            "I keep a chart on the fridge or I lose track of which dose I've taken.",
            "memory_and_tracking_effort",
            "Mental work of staying on top of it"
          ],
          [
            "P02",
            "I won't take them at a restaurant because I don't want people asking questions.",
            "hiding_meds_in_public",
            "Treatment interferes with daily life"
          ],
          [
            "P03",
            "Honestly some days I just give up and don't bother with the lunchtime ones.",
            "intentional_dose_skipping",
            "Mental work of staying on top of it"
          ],
          [
            "P03",
            "The sheer number of tablets makes me feel sicker than I really am.",
            "high_pill_count",
            "Physical burden of treatment"
          ]
        ]
      },
      "steps": [
        "Coding: read each passage and attach a short label that names the idea in it. \"I take so many pills... eating medicine, not food\" gets the code high_pill_count. When P03 says almost the same thing, you reuse that exact code instead of inventing a new one — that reuse is what keeps the analysis consistent and is recorded in the codebook.",
        "Building themes: step back from the individual codes and group the ones that share a deeper meaning. high_pill_count and side-effect codes are both about the body, so they roll up into the theme \"Physical burden of treatment\"; memory_and_tracking_effort and intentional_dose_skipping are both about the daily mental effort, so they roll up into \"Mental work of staying on top of it.\"",
        "Checking saturation: across these three participants the same codes keep reappearing (high_pill_count shows up for both P01 and P03) and no genuinely new idea arrives in the last interview. That repetition is the signal of saturation — a cue you have probably captured the range of experiences and can stop interviewing, rather than a number to report.",
        "Why this is not a survey: notice there is no \"60% of patients said X.\" With only a handful of purposively chosen people, counting would be misleading. The payoff is the depth of meaning — you learn the actual words patients use (\"eating medicine, not food\") and the reasons behind a behavior, which is exactly what you need before you can write good survey questions."
      ],
      "result": "Three themes synthesized from the codes: (1) Physical burden of treatment, (2) Treatment interferes with daily life, and (3) Mental work of staying on top of it. These themes — together with the patients' own language — become the raw material for drafting questionnaire items. No prevalence figures or statistics are produced or implied; the deliverable is a credible, well-evidenced map of what treatment burden means to these patients."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Semi-structured interview",
        "description": "A flexible topic guide of open questions and probes that ensures coverage of key domains while allowing the interviewer to follow participant-led tangents; the default for applied RWE/HEOR concept elicitation and patient-journey work.",
        "edge_cases": [
          "Over-structured guides drift toward a verbal survey and foreclose unanticipated themes; under-structured guides leave key domains uncovered across participants.",
          "Leading probes can plant themes; pilot the guide and audit early transcripts for interviewer-induced content."
        ],
        "data_source_notes": "primary: pilot the guide on 2-3 participants, document guide revisions, and version-control the instrument so analysts know which guide produced which transcript."
      },
      {
        "name": "In-depth / narrative interview",
        "description": "Minimally structured, single broad opening that invites the participant to narrate experience in their own terms; used when the structure of the account itself is the object of study.",
        "edge_cases": [
          "Demands a highly skilled interviewer; novice interviewers fill silence and over-direct.",
          "Harder to compare across participants; analysis leans on narrative or interpretive phenomenological approaches."
        ],
        "data_source_notes": "primary: longer interviews and transcripts; budget more transcription and analysis time per participant."
      },
      {
        "name": "Information-power-justified sampling",
        "description": "Sample size and stopping are pre-specified using the information-power dimensions (aim breadth, sample specificity, established theory, dialogue quality, analysis strategy) rather than an open-ended hunt for saturation.",
        "edge_cases": [
          "A narrow aim with dense, specific participants can be adequately covered in far fewer interviews than a broad, heterogeneous one.",
          "Information power is a justification framework, not a fixed number; reviewers still expect a transparent stopping rule and a saturation/audit record."
        ],
        "data_source_notes": "primary: pre-register the target range and stopping rule in the protocol/analysis plan; maintain a saturation grid (new codes per interview) as fielding proceeds."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Structured survey / PRO questionnaire",
        "pros_of_this": "Discovers constructs and patient language a closed instrument cannot; surfaces unanticipated themes; appropriate for concept elicitation and item generation.",
        "cons_of_this": "Cannot estimate prevalence or compare groups; labor-intensive; vulnerable to interviewer effects and small, non-representative samples.",
        "when_to_prefer": "The item-generation / concept-elicitation phase (FDA PFDD, EMA PRO development) before a construct is well enough defined to write valid closed-ended items."
      },
      {
        "compared_to": "Focus group",
        "pros_of_this": "Privacy for sensitive topics; avoids dominant-voice and conformity dynamics; deeper individual narratives.",
        "cons_of_this": "Forgoes the group-interaction data that focus groups generate; less efficient per participant-hour.",
        "when_to_prefer": "Sensitive, stigmatized, or highly individual experiences where group dynamics would distort or silence responses."
      },
      {
        "compared_to": "Ethnography / participant observation",
        "pros_of_this": "Efficiently captures reported experience without prolonged field access.",
        "cons_of_this": "Captures only what is said, not what is done; misses the say-do gap that observation reveals.",
        "when_to_prefer": "When self-report is the appropriate evidence and field immersion is impractical."
      },
      {
        "compared_to": "Quantitative observational analysis (claims/EHR)",
        "pros_of_this": "Explains mechanism, meaning, and outcomes-that-matter directly from patients; generates hypotheses to test quantitatively.",
        "cons_of_this": "No effect estimate, no counterfactual, no scale or generalizable inference.",
        "when_to_prefer": "How/why questions, or interpreting/generating quantitative findings within a mixed-methods design."
      }
    ],
    "implementation_notes_by_data_source": {
      "primary": "The data are recorded interviews and transcripts. Operational integrity lives in sampling (purposive/maximum-variation against gatekeeper selection), modality (in-person/video/telephone/async, each affecting depth and access bias), recording (redundant capture, verbatim transcription with accuracy QC), translation (analyze in source language; back-translate key quotes), interviewer training (neutral, non-leading, reflexive), and anonymization (paraphrase quasi-identifiers; small-cell logic for rare-disease quotes). Rigor requires a codebook, inter-coder agreement checks, an audit trail from quote to claim, member checking, and full COREQ reporting."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom sklearn.metrics import cohen_kappa_score\n\ndef coder_agreement(coded: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Overall and per-code Cohen's kappa for two coders on shared segments.\"\"\"\n    a = coded[\"code_coder_a\"].astype(\"string\")\n    b = coded[\"code_coder_b\"].astype(\"string\")\n\n    rows = [{\"code\": \"OVERALL (all labels)\",\n             \"n_segments\": len(coded),\n             \"kappa\": round(cohen_kappa_score(a, b), 3)}]\n\n    # Per-code presence/absence agreement (one-vs-rest): isolates poorly-defined codes.\n    for code in sorted(set(a.dropna()) | set(b.dropna())):\n        ind_a = (a == code).astype(int)\n        ind_b = (b == code).astype(int)\n        rows.append({\"code\": code,\n                     \"n_segments\": int((ind_a | ind_b).sum()),\n                     \"kappa\": round(cohen_kappa_score(ind_a, ind_b), 3)})\n\n    out = pd.DataFrame(rows)\n    out[\"below_threshold\"] = out[\"kappa\"] < 0.70  # flag codes needing codebook clarification\n    return out",
        "description": "Inter-coder reliability (Cohen's kappa) on a double-coded subset of interview segments, the standard quantitative\ncheck on coding consistency before a single coder finalizes the full corpus. Required input (a tidy table after two\ncoders independently apply the reconciled codebook to the SAME segments):\n  coded : one row per (transcript_id, segment_id), with columns\n          code_coder_a, code_coder_b -> the code label each coder assigned to that segment\nReport kappa per code (presence/absence) and overall; target >= 0.70 before relying on single-coder coding, and\nadjudicate disagreements rather than discarding them.",
        "dependencies": [
          "pandas",
          "scikit-learn"
        ],
        "source_citations": [
          "tong-2007"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(irr)\n\ncoder_agreement <- function(coded, coder_cols) {\n  ratings <- coded[, coder_cols, drop = FALSE]\n\n  if (length(coder_cols) == 2L) {\n    overall <- kappa2(ratings, weight = \"unweighted\")$value      # Cohen's kappa, 2 coders\n  } else {\n    overall <- kappam.fleiss(ratings)$value                      # Fleiss' kappa, 3+ coders\n  }\n\n  # Per-code presence/absence agreement (one-vs-rest) to localize weak codes.\n  all_codes <- sort(unique(unlist(ratings, use.names = FALSE)))\n  per_code <- lapply(all_codes, function(code) {\n    ind <- as.data.frame(lapply(ratings, function(x) as.integer(x == code)))\n    k <- if (length(coder_cols) == 2L) kappa2(ind)$value else kappam.fleiss(ind)$value\n    data.frame(code = code, kappa = round(k, 3), below_threshold = k < 0.70)\n  })\n\n  list(overall = round(overall, 3), per_code = do.call(rbind, per_code))\n}",
        "description": "Inter-coder reliability for interview coding in R. For two coders use Cohen's kappa; for three or more coders applying\nthe same codebook to the same segments, Fleiss' kappa is the correct generalization. Required input:\n  coded : data.frame with one row per shared segment and one column per coder\n          (e.g., code_coder_a, code_coder_b, [code_coder_c]) holding the assigned code label\nInspect per-code agreement, not just the overall figure; a high overall kappa can hide one ambiguous, low-agreement code.",
        "dependencies": [
          "irr"
        ],
        "source_citations": [
          "tong-2007"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Research question: how / why / what-matters] --> Frame[Sampling frame:<br/>purposive / maximum-variation<br/>recruit against gatekeeper selection]\n  Frame --> Guide[Semi-structured guide<br/>piloted, non-leading probes]\n  Guide --> Int[Interviews<br/>chosen modality, redundant recording]\n  Int --> Trans[Verbatim transcription + QC<br/>analyze in source language]\n  Trans --> Open[Open coding -> reconciled codebook]\n  Open --> IRR[Inter-coder agreement<br/>Cohen's / Fleiss' kappa >= 0.70]\n  IRR --> Theme[Theme development<br/>conceptual model]\n  Theme --> Stop{Information power /<br/>stopping rule met?}\n  Stop -- no --> Frame\n  Stop -- yes --> Member[Member checking + reflexivity<br/>audit trail quote -> claim]\n  Member --> Report[COREQ-compliant report<br/>concept map / item pool]",
        "caption": "Operational flow of a qualitative interview study. Sampling targets information rather than representativeness, coding rigor is enforced by a reconciled codebook and inter-coder agreement, and the stopping decision is governed by information power rather than a ritual saturation claim.",
        "alt_text": "Flowchart from research question through purposive sampling, piloted semi-structured guide, interviews, transcription, open coding, inter-coder kappa, theme development, an information-power stopping decision that loops back to sampling if unmet, then member checking and a COREQ-compliant report.",
        "source_type": "illustrative",
        "source_citations": [
          "tong-2007",
          "malterud-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Need primary patient/clinician insight] --> Quant{Is the question<br/>quantitative?<br/>prevalence / effect / comparison}\n  Quant -- yes --> Survey[Use a survey / PRO instrument<br/>or quantitative observational analysis]\n  Quant -- no --> Sensitive{Sensitive or highly<br/>individual experience?}\n  Sensitive -- yes --> Interview[Qualitative INTERVIEW<br/>privacy, depth, no group conformity]\n  Sensitive -- no --> Interaction{Is group interaction /<br/>norm-formation the object?}\n  Interaction -- yes --> FocusGroup[Focus group]\n  Interaction -- no --> SayDo{Does the say-do gap<br/>or in-situ behavior matter?}\n  SayDo -- yes --> Ethno[Ethnography / observation]\n  SayDo -- no --> Interview",
        "caption": "Decision logic for choosing among primary qualitative designs. Interviews win for sensitive or highly individual experience and for efficient self-report; focus groups for group dynamics; ethnography for the gap between reported and enacted behavior; surveys for anything quantitative.",
        "alt_text": "Decision tree starting from the need for primary insight, branching on whether the question is quantitative (survey), whether the experience is sensitive (interview), whether group interaction is the object (focus group), and whether the say-do gap matters (ethnography), otherwise interview.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "qualitative-ethnographic",
        "notes": "Interviews capture reported experience efficiently; ethnography captures enacted behavior in situ and the say-do gap. Choose by whether self-report or observation is the appropriate evidence."
      },
      {
        "relation_type": "used_with",
        "target_slug": "qualitative-synthesis",
        "notes": "Findings from multiple interview studies are pooled by qualitative/thematic synthesis (e.g., meta-ethnography, thematic synthesis) to build cross-study conceptual models."
      },
      {
        "relation_type": "part_of",
        "target_slug": "mixed-methods",
        "notes": "Interviews most often sit inside a mixed-methods design — generating constructs for a later survey, or interpreting an unexpected quantitative signal."
      }
    ],
    "aliases": [
      "qualitative interview",
      "semi-structured interview",
      "in-depth interview",
      "qualitative interview study"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "qualitative-synthesis",
    "name": "Qualitative Evidence Synthesis",
    "short_definition": "A systematic-review method that identifies, appraises, and integrates findings from primary qualitative studies into higher-order analytical themes, with confidence in each synthesized finding graded by GRADE-CERQual.",
    "long_description": "**Qualitative evidence synthesis (QES)** is the systematic identification, critical appraisal, and integration of\nfindings from primary *qualitative* studies (interviews, focus groups, ethnography, open-ended survey text) to answer\nquestions about *experiences, perceptions, acceptability, feasibility, and barriers/facilitators* that quantitative\neffect estimates cannot address. It is a full systematic-review method — protocol-driven, with an explicit search,\ntransparent study selection, structured appraisal, a named synthesis method, and a graded statement of confidence in\neach finding — not an informal narrative summary. In HEOR and RWE it most often supplies the *patient-experience* and\n*acceptability* evidence that quantitative comparative-effectiveness and economic analyses leave unanswered.\n\n**Core conceptual distinction** QES is defined by what it synthesizes (qualitative *findings*, i.e., themes and\nauthor interpretations) and how it integrates them (interpretive or aggregative methods), and these are separable\nchoices. (1) *Aggregative vs interpretive synthesis*: aggregative methods (meta-aggregation, framework synthesis,\ntextual narrative synthesis) tally and pool findings against a pre-specified structure and stay close to participants'\nvoices — well suited to HTA decision questions with a defined scope. Interpretive methods (meta-ethnography,\nthematic synthesis, grounded-theory synthesis) generate *new* second- and third-order constructs that go beyond any\nsingle study — well suited to theory-building. (2) *Unit of analysis*: the data are the published *findings* of\nprimary studies (first-order = participant quotes, second-order = original authors' interpretations), and the\nsynthesis produces *third-order* analytical themes. The output is **not** an effect estimate, a pooled prevalence, or\na measure of \"how many patients\" — quantifying themes (\"60% of studies mentioned cost\") is **vote-counting** and is\nconsidered a misuse. The estimand-analogue is a set of *review findings*, each carrying a CERQual confidence rating\n(high / moderate / low / very low) built from four components: methodological limitations, coherence, adequacy of\ndata, and relevance.\n\n**Pros, cons, and trade-offs**\n- **vs quantitative meta-analysis / network meta-analysis:** QES answers *why* and *how* (mechanisms, acceptability,\n  lived burden) where meta-analysis answers *how much* (pooled effect). Cost: no effect size, no transitivity logic,\n  no I-squared; findings are interpretive and not \"pooled\" in a statistical sense. **Prefer QES** for the\n  patient-perspective, treatment-burden, and implementation questions that increasingly drive HTA patient submissions;\n  **prefer meta-analysis** for comparative efficacy/safety.\n- **vs an unsystematic narrative review of qualitative studies:** QES adds a protocol, reproducible search, transparent\n  selection, formal appraisal (CASP/Cochrane), a named synthesis method, and CERQual grading — making the conclusions\n  auditable. Cost: substantially more labor and methodological expertise. **Prefer QES** whenever the synthesis informs\n  a decision (HTA, guideline, label, payer dossier); a narrative overview is acceptable only for scoping or background.\n- **vs mixed-methods / segregated convergent reviews:** A standalone QES isolates and rigorously synthesizes the\n  qualitative strand; a convergent or segregated mixed-methods review then juxtaposes it with a quantitative strand to\n  explain heterogeneity or contextualize effects. **Prefer standalone QES** when the question is purely about\n  experience; embed it in a mixed-methods review when the goal is to interpret *why* a quantitative effect varies.\n\n**When to use** Decision questions about patient/clinician experience, treatment burden, acceptability, adherence\ndrivers, equity, and implementation barriers — especially the patient-perspective and context sections of HTA\nsubmissions (NICE, CADTH), guideline development (WHO, Cochrane QIMG), and payer dossiers where the lived experience of\na therapy matters. Use it when a body of primary qualitative research exists and the goal is a transparent, gradeable\nsynthesis of *findings* rather than a count of effects.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **The question is comparative effectiveness, safety, prevalence, or cost.** QES cannot estimate effects; presenting\n  synthesized themes as evidence of *how well* a drug works, or quantifying themes into pseudo-prevalences, is a\n  category error that decision-makers can mistake for effect evidence.\n- **The evidence base is thin or one or two large studies dominate.** With few studies, \"adequacy of data\" is low,\n  third-order interpretation overreaches, and CERQual confidence collapses to low/very low — a synthesis that *looks*\n  authoritative but rests on almost no data is dangerous in a decision context.\n- **Findings are pooled as if commensurable counts (vote-counting).** Treating \"number of studies reporting a theme\" as\n  a measure of importance conflates research attention with patient salience and is explicitly cautioned against.\n- **The included studies are methodologically weak in the same direction.** If all primary studies recruit articulate,\n  high-engagement volunteers, the synthesis amplifies a shared selection bias; methodological-limitations and\n  relevance components of CERQual must flag this rather than averaging it away.\n- **Reviewers lack qualitative methods expertise.** Mechanically extracting quotes without engaging the interpretive\n  layer produces a quote-collage, not a synthesis, and misrepresents the original authors' meaning.\n\n**Data-source operational depth** The \"data sources\" of a QES are *types of source studies and documents*, each with\ndistinct failure modes and workarounds.\n- **Primary peer-reviewed qualitative studies (interviews/focus groups/ethnography):** the core substrate. Failure\n  modes: thin \"findings\" sections that report quotes without the authors' interpretation (limits second-order data);\n  poorly reported sampling so saturation/transferability cannot be judged; over-representation of articulate,\n  engaged participants. Workaround: extract first- *and* second-order data verbatim into an audit-able table, appraise\n  with CASP, and down-rate adequacy/relevance in CERQual when reporting is thin.\n- **Mixed-methods studies:** the qualitative strand is often buried as a sub-study with truncated reporting. Failure\n  mode: the qualitative findings are subordinated to the trial and under-described. Workaround: extract only the\n  qualitative component, contact authors for fuller findings, and record that the strand was secondary.\n- **Grey literature and conference abstracts:** improve coverage and reduce publication/dissemination bias but carry\n  thin description and no peer review. Failure mode: an abstract states a theme with no supporting data to appraise.\n  Workaround: include for comprehensiveness but mark low methodological confidence and never let an un-appraisable\n  abstract anchor a high-confidence finding.\n- **Patient-organization and regulatory patient-experience reports (e.g., HTA patient submissions):** highly relevant\n  to decision questions but not peer-reviewed and potentially advocacy-shaped. Failure mode: selection toward\n  motivated, organized patients. Workaround: treat as a distinct source type, triangulate against peer-reviewed\n  studies, and document potential conflicts in the relevance component.\n- **Cross-language / cross-setting evidence:** restricting to English or to one health system narrows transferability.\n  Failure mode: a finding \"confident\" only because the evidence base is homogeneous in a way that hides setting\n  effects. Workaround: justify language limits explicitly and use the relevance component to flag setting mismatch\n  between the evidence and the decision context.\n\n**Worked example (HTA patient-perspective synthesis).** Question: what shapes initiation and persistence with GLP-1\nreceptor agonists from the perspective of adults with type 2 diabetes, to inform the patient-experience section of an\nHTA submission. (1) Protocol & question framing: use **SPIDER** (Sample = adults with T2D; Phenomenon of Interest =\nexperience of GLP-1 RA therapy; Design = interviews/focus groups; Evaluation = acceptability, burden, persistence;\nResearch type = qualitative) rather than PICO, and register the protocol. (2) Search: MEDLINE + CINAHL + PsycINFO with\nqualitative search filters, supplemented by reference-chaining and grey literature; English-language restriction\njustified and recorded as a relevance caveat. (3) Selection: dual independent screening, PRISMA flow, disagreements\nresolved by a third reviewer. (4) Appraisal: CASP qualitative checklist per study, feeding the methodological-\nlimitations component of CERQual. (5) Extraction: capture first-order (participant quotes) and second-order (author\ninterpretations) data into a structured table. (6) Synthesis: Thomas & Harden three-stage thematic synthesis — free\nline-by-line coding of findings, organization into descriptive themes, then generation of analytical (third-order)\nthemes such as \"injection identity and stigma,\" \"weight-loss as motivator vs side-effect-driven discontinuation,\" and\n\"cost and access friction.\" (7) Confidence: rate each review finding with GRADE-CERQual across methodological\nlimitations, coherence, adequacy, and relevance, producing an Evidence Profile and a Summary of Qualitative Findings\ntable. (8) Reporting: report against the **ENTREQ** statement so the HTA reviewer can audit every step from search to\ngraded finding. The deliverable is a small set of analytical themes each carrying an explicit confidence rating — not\na count of studies and not an effect estimate.",
    "primary_category": "Study_Design",
    "tags": [
      "qualitative-evidence-synthesis",
      "thematic-synthesis",
      "meta-ethnography",
      "meta-aggregation",
      "cerqual",
      "entreq",
      "patient-experience",
      "hta-patient-perspective"
    ],
    "applies_to_study_types": [
      "qualitative_synthesis"
    ],
    "data_sources": [
      "primary",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/1471-2288-8-45",
        "url": "https://doi.org/10.1186/1471-2288-8-45",
        "citation_text": "Thomas J, Harden A. Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Medical Research Methodology. 2008;8:45.",
        "year": 2008,
        "authors_short": "Thomas & Harden",
        "notes": "Canonical operational statement of three-stage thematic synthesis (line-by-line coding -> descriptive themes -> analytical themes), the most widely used QES method in health reviews."
      },
      {
        "role": "explain",
        "doi": "10.1371/journal.pmed.1001895",
        "url": "https://doi.org/10.1371/journal.pmed.1001895",
        "citation_text": "Lewin S, Glenton C, Munthe-Kaas H, et al. Using qualitative evidence in decision making for health and social interventions: an approach to assess confidence in findings from qualitative evidence syntheses (GRADE-CERQual). PLoS Medicine. 2015;12(10):e1001895.",
        "year": 2015,
        "authors_short": "Lewin et al.",
        "notes": "Defines the four CERQual components (methodological limitations, coherence, adequacy, relevance) used to grade confidence in each synthesized finding for decision-making."
      },
      {
        "role": "explain",
        "doi": "10.1186/s13012-017-0688-3",
        "url": "https://doi.org/10.1186/s13012-017-0688-3",
        "citation_text": "Lewin S, Booth A, Glenton C, et al. Applying GRADE-CERQual to qualitative evidence synthesis findings: introduction to the series. Implementation Science. 2018;13(Suppl 1):2.",
        "year": 2018,
        "authors_short": "Lewin et al.",
        "notes": "Introduces the operational CERQual guidance series and the Evidence Profile / Summary of Qualitative Findings tables used to present graded findings."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1471-2288-12-181",
        "url": "https://doi.org/10.1186/1471-2288-12-181",
        "citation_text": "Tong A, Flemming K, McInnes E, Oliver S, Craig J. Enhancing transparency in reporting the synthesis of qualitative research: ENTREQ. BMC Medical Research Methodology. 2012;12:181.",
        "year": 2012,
        "authors_short": "Tong et al.",
        "notes": "The 21-item ENTREQ reporting standard that makes a QES auditable from search through synthesis to graded findings."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jclinepi.2017.06.020",
        "url": "https://doi.org/10.1016/j.jclinepi.2017.06.020",
        "citation_text": "Noyes J, Booth A, Flemming K, et al. Cochrane Qualitative and Implementation Methods Group guidance series-paper 3: methods for assessing methodological limitations, data extraction and synthesis, and confidence in synthesized qualitative findings. Journal of Clinical Epidemiology. 2018;97:49-58.",
        "year": 2018,
        "authors_short": "Noyes et al.",
        "notes": "Cochrane operational guidance for choosing a synthesis method, appraising primary studies, and applying CERQual in practice."
      }
    ],
    "plain_language_summary": "Qualitative evidence synthesis is a way to gather findings from many studies that used interviews or focus groups to learn about patients' experiences, then combine those findings into a single, richer picture. Instead of averaging numbers, reviewers read across all the studies, notice shared ideas, and build a higher-level conclusion — for example, 'patients stop a new injectable therapy mainly because they feel embarrassed injecting in public, not because of medical side effects.' Each conclusion comes with an honest rating of how confident we are, based on the quality and number of studies behind it. It cannot compare how well two drugs work — that is a different kind of question — but it is exactly what health agencies need when they ask 'what do patients actually experience?'",
    "key_terms": [
      {
        "term": "qualitative study",
        "definition": "A study that collects words rather than numbers — recorded interviews or focus-group discussions — to understand how people experience or feel about something."
      },
      {
        "term": "theme",
        "definition": "A recurring idea or pattern found across multiple participants or studies, capturing what a group of people described in their own words."
      },
      {
        "term": "analytical theme",
        "definition": "A higher-level insight the reviewers build by comparing and interpreting descriptive themes across all included studies — a conclusion that no single study stated on its own."
      },
      {
        "term": "CERQual confidence rating",
        "definition": "A label (high, moderate, low, or very low) that tells a reader how much trust to place in a synthesized finding, based on study quality, how consistent findings were, how much supporting data existed, and how relevant the studies are to the decision question."
      },
      {
        "term": "thematic synthesis",
        "definition": "The most widely used approach for qualitative evidence synthesis: reviewers label each study's findings line by line, group similar labels into descriptive themes, then interpret those themes into analytical conclusions."
      }
    ],
    "worked_example": {
      "scenario": "A health technology assessment body needs to understand what shapes whether adults with type 2 diabetes start and stay on a once-weekly injectable diabetes therapy. Four qualitative studies are identified — each interviewed or conducted focus groups with patients and reported its own themes. The synthesis reviewers read across all four and build a single higher-order conclusion.",
      "dataset": {
        "caption": "Findings table: one row per study, showing the key theme that study reported — the raw material for synthesis.",
        "columns": [
          "study",
          "design",
          "key_theme_reported_by_original_authors"
        ],
        "rows": [
          [
            "Alvarez 2019",
            "In-depth interviews (n=18, UK)",
            "Fear of self-injection was the main barrier to starting the medicine"
          ],
          [
            "Chen 2020",
            "Focus groups (n=24, USA)",
            "Injection anxiety faded after the first use but disruption to daily routine lingered"
          ],
          [
            "Okonkwo 2021",
            "In-depth interviews (n=15, Ghana)",
            "Injecting visibly in public settings caused social embarrassment and avoidance"
          ],
          [
            "Russo 2022",
            "In-depth interviews (n=20, Italy)",
            "Device comfort mattered more to persistence than side-effect worry"
          ]
        ]
      },
      "steps": [
        "Read each study's findings section and attach a short label to every idea mentioned — for example: 'needle fear,' 'routine disruption,' 'public embarrassment,' 'device feel.'",
        "Group the labels that share the same underlying idea across studies: 'needle fear' (Alvarez) and 'injection anxiety' (Chen) belong together under a descriptive theme called 'Anxiety about the injection act itself'; 'public embarrassment' (Okonkwo) belongs under 'Social visibility and stigma of injecting.'",
        "Step back and ask what these descriptive themes are saying together: three of four studies describe some form of social or psychological discomfort around the injection — not the medicine's effects on the body.",
        "Write the analytical theme: 'The primary barriers to initiating and continuing this therapy are social and psychological — stigma, embarrassment, and disruption of public routine — rather than medical side effects.'",
        "Rate confidence: four studies from four countries all point in the same direction, methods are sound, and the patient populations match the decision context. Confidence is rated moderate (not high, because the number of studies is still relatively small)."
      ],
      "result": "Synthesized analytical theme: 'Barriers to starting and continuing weekly injectable therapy center on social stigma and how injecting fits into public daily life, not primarily on side effects' — rated moderate CERQual confidence, supported consistently across all four included studies from four different countries. No numbers are pooled; the output is a graded interpretive statement, not an effect size or percentage."
    },
    "prerequisites": [
      "systematic-review",
      "qualitative-interview",
      "qualitative-ethnographic"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Thematic synthesis (Thomas & Harden)",
        "description": "Three interpretive stages applied to the findings of primary studies - free line-by-line coding, organization into descriptive themes, and generation of analytical (third-order) themes that go beyond the included studies. The default QES method for health intervention reviews.",
        "edge_cases": [
          "Distinguishing descriptive from analytical themes requires interpretive judgment; two reviewers should code independently and reconcile to limit single-reviewer drift.",
          "Thinly reported primary findings starve the line-by-line coding and force the synthesis to lean on first-order quotes rather than authors' interpretations."
        ],
        "data_source_notes": "primary qualitative studies: extract both participant quotes (first-order) and author interpretations (second-order) before coding; mixed-methods sub-studies often need author contact for fuller findings."
      },
      {
        "name": "Meta-aggregation (JBI) / framework synthesis",
        "description": "Aggregative methods that extract findings and pool them against a pre-specified framework or as categorized findings, staying close to participants' voices. Preferred for scoped HTA/guideline decision questions.",
        "edge_cases": [
          "A pre-specified framework can blind the synthesis to themes outside it; allow inductive themes alongside the framework.",
          "Aggregation tempts vote-counting; the number of studies supporting a finding is not evidence of its importance."
        ],
        "data_source_notes": "patient-organization and grey-literature sources fit aggregative tallies but must be flagged as lower methodological confidence and never anchor a high-confidence finding alone."
      },
      {
        "name": "Meta-ethnography (Noblit & Hare)",
        "description": "Highly interpretive translation of studies into one another (reciprocal, refutational, lines-of-argument synthesis) to build new theory. Powerful but resource-intensive and dependent on rich primary description.",
        "edge_cases": [
          "Reciprocal translation assumes conceptual comparability across settings; refutational synthesis is needed when studies genuinely conflict.",
          "Requires deep engagement with each paper; poorly described studies cannot be translated and should be down-weighted."
        ],
        "data_source_notes": "Best with thick-description ethnographic/interview studies; conference abstracts and thin mixed-methods strands rarely supply enough to translate."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Quantitative meta-analysis / network meta-analysis",
        "pros_of_this": "Answers why/how questions - mechanisms, acceptability, treatment burden, barriers - that effect estimates cannot, and supplies the patient-perspective evidence HTA bodies increasingly require.",
        "cons_of_this": "Produces no effect size, no pooled estimate, and no transitivity/heterogeneity statistics; findings are interpretive and must not be read as effect evidence.",
        "when_to_prefer": "Decision questions about patient/clinician experience, acceptability, adherence drivers, equity, and implementation context."
      },
      {
        "compared_to": "Unsystematic narrative review of qualitative studies",
        "pros_of_this": "Adds protocol, reproducible search, transparent selection, formal CASP/Cochrane appraisal, a named synthesis method, and CERQual confidence grading, making conclusions auditable for decisions.",
        "cons_of_this": "Substantially more labor and requires genuine qualitative-methods expertise; a mechanical version yields a quote-collage rather than a synthesis.",
        "when_to_prefer": "Any synthesis that informs an HTA submission, guideline, label, or payer dossier."
      },
      {
        "compared_to": "Mixed-methods (convergent/segregated) review",
        "pros_of_this": "A standalone QES rigorously isolates and grades the qualitative evidence on experience.",
        "cons_of_this": "On its own it does not explain heterogeneity in a quantitative effect; that requires juxtaposition with a quantitative strand.",
        "when_to_prefer": "When the question is purely about experience; embed QES in a mixed-methods review to interpret why an effect varies."
      }
    ],
    "implementation_notes_by_data_source": {
      "primary": "Core substrate is the published findings of primary qualitative studies. Extract both first-order (participant quotes) and second-order (author interpretations) data; appraise each study with CASP; down-rate CERQual adequacy and relevance when sampling, saturation, or transferability are poorly reported.",
      "registry": "Patient-experience registries and HTA patient-submission repositories are relevant but not peer-reviewed and may be advocacy-shaped; treat as a distinct source type, triangulate against peer-reviewed studies, and document potential conflicts in the relevance component."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# CERQual: start at \"high\" confidence, then down-grade for concerns in each of the four components.\n_DOWNGRADE = {\"none\": 0, \"minor\": 0, \"moderate\": 1, \"serious\": 2}\n_LEVELS = [\"very low\", \"low\", \"moderate\", \"high\"]  # index 3 == high\n\ndef cerqual_rating(findings: pd.DataFrame,\n                   appraisal: pd.DataFrame,\n                   relevance: pd.DataFrame,\n                   coherence: dict[str, str],          # finding_id -> 'none'/'minor'/'moderate'/'serious'\n                   min_studies_adequate: int = 4) -> pd.DataFrame:\n    rows = []\n    for fid, grp in findings[findings[\"supports\"]].groupby(\"finding_id\"):\n        studies = grp[\"study_id\"].unique()\n        n = len(studies)\n\n        # 1) Methodological limitations: worst CASP concern among contributing studies.\n        casp = appraisal.loc[appraisal[\"study_id\"].isin(studies), \"casp_concerns\"]\n        meth = max((_DOWNGRADE[c] for c in casp), default=2)\n\n        # 2) Coherence: reviewer-assigned (fit between data and the finding).\n        coh = _DOWNGRADE[coherence.get(fid, \"moderate\")]\n\n        # 3) Adequacy: thin data when few studies / quotes support the finding.\n        adeq = 0 if n >= min_studies_adequate else (1 if n >= 2 else 2)\n\n        # 4) Relevance: indirect setting match to the decision context down-grades.\n        rel_map = relevance[relevance[\"finding_id\"] == fid][\"setting_match\"]\n        rel = {\"direct\": 0, \"partial\": 1, \"indirect\": 2}.get(\n            rel_map.mode().iat[0] if len(rel_map) else \"indirect\", 2)\n\n        score = 3 - (meth + coh + adeq + rel)        # subtract total concern from \"high\"\n        rating = _LEVELS[max(0, min(3, score))]\n        rows.append({\"finding_id\": fid, \"n_studies\": n, \"meth_limits\": meth,\n                     \"coherence\": coh, \"adequacy\": adeq, \"relevance\": rel,\n                     \"cerqual_confidence\": rating})\n    return pd.DataFrame(rows).sort_values(\"finding_id\").reset_index(drop=True)",
        "description": "QES analytic backbone: structured finding extraction plus a programmatic GRADE-CERQual confidence rating per review\nfinding. This is bookkeeping/decision-logic code, not statistical estimation - QES does not pool effects. Required\ninput tables (already produced by reviewers during extraction and appraisal):\n  findings : one row per (review_finding, source study)\n             -> finding_id, study_id, order2_interpretation (str), supports (bool)\n  appraisal: one row per source study (from the CASP checklist)\n             -> study_id, casp_concerns in {'none','minor','moderate','serious'}\n  relevance: one row per (finding_id, study_id)\n             -> setting_match in {'direct','partial','indirect'}  # to the decision context\nReturns one CERQual rating per review finding. Coherence is reviewer-assigned (interpretive) and passed in.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "lewin-2015",
          "noyes-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(tidyr)\n\nthematic_synthesis <- function(codes, descriptive, analytical) {\n  # Stage 1->2: attach descriptive themes to coded findings.\n  coded <- codes |>\n    left_join(descriptive, by = \"code\")\n\n  # Stage 2->3: roll descriptive themes up to analytical (third-order) themes.\n  mapped <- coded |>\n    left_join(analytical, by = \"descriptive_theme\")\n\n  # Summary of Qualitative Findings skeleton: contributing studies per analytical theme,\n  # with how much of the support is second-order (author interpretation) vs first-order (quote).\n  mapped |>\n    group_by(analytical_theme) |>\n    summarise(\n      n_studies        = n_distinct(study_id),\n      n_descriptive    = n_distinct(descriptive_theme),\n      prop_second_order = mean(order == 2L),   # higher = richer interpretive support\n      .groups = \"drop\"\n    ) |>\n    arrange(desc(n_studies))\n    # NB: n_studies informs CERQual 'adequacy' - it is NOT a vote-count of importance.\n}",
        "description": "Thomas & Harden three-stage thematic synthesis support in R: tally line-by-line codes into descriptive themes, map\ndescriptive themes to analytical (third-order) themes, and emit a Summary-of-Qualitative-Findings skeleton. Inputs\nproduced by reviewers during coding (no raw effect data - QES synthesizes findings, not estimates):\n  codes        : study_id, finding_id, code (free line-by-line code), order (1=quote, 2=author interpretation)\n  descriptive  : code, descriptive_theme           # reviewer grouping of codes\n  analytical   : descriptive_theme, analytical_theme # reviewer 3rd-order interpretation",
        "dependencies": [
          "dplyr",
          "tidyr"
        ],
        "source_citations": [
          "thomas-2008",
          "noyes-2018"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Decision question: experience / acceptability / barriers] --> P[Protocol + SPIDER framing, registered]\n  P --> S[Systematic search: MEDLINE + CINAHL + PsycINFO + grey lit]\n  S --> Sel[Dual screening -> PRISMA flow]\n  Sel --> App[Appraisal: CASP per primary study]\n  App --> Ext[Extract first-order quotes + second-order interpretations]\n  Ext --> Syn[Synthesis: thematic / meta-aggregation / meta-ethnography]\n  Syn --> Theme[Analytical 3rd-order themes]\n  Theme --> Cer[GRADE-CERQual per finding:<br/>meth limits + coherence + adequacy + relevance]\n  Cer --> Out[Summary of Qualitative Findings + Evidence Profile<br/>reported per ENTREQ]",
        "caption": "QES workflow from a decision question to graded review findings. Appraisal feeds the methodological-limitations component of CERQual; the number of contributing studies feeds adequacy, not a vote-count of importance.",
        "alt_text": "Flowchart from decision question through protocol, search, screening, CASP appraisal, extraction of first- and second-order data, synthesis, analytical themes, GRADE-CERQual rating, and ENTREQ-reported output.",
        "source_type": "illustrative",
        "source_citations": [
          "thomas-2008",
          "lewin-2015"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  O1[First-order data<br/>participant quotes] --> O2[Second-order data<br/>original authors' interpretations]\n  O2 --> O3[Third-order constructs<br/>reviewers' analytical themes]\n  O3 --> C{CERQual confidence}\n  C -->|serious concerns| Low[Low / very low]\n  C -->|few or no concerns| High[High / moderate]",
        "caption": "The three orders of interpretation in QES. The synthesis adds value at the third-order layer; CERQual then grades how much confidence a decision-maker can place in each third-order finding.",
        "alt_text": "Diagram showing first-order participant quotes feeding second-order author interpretations feeding third-order analytical themes, which are then graded by CERQual into low or high confidence.",
        "source_type": "illustrative",
        "source_citations": [
          "lewin-2015"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "systematic-review",
        "notes": "QES is a systematic-review method specialized for primary qualitative studies, with the same protocol-search- appraise-synthesize-grade discipline but interpretive synthesis methods and CERQual instead of GRADE."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "meta-analysis-obs",
        "notes": "Meta-analysis pools quantitative effect estimates (how much); QES synthesizes qualitative findings (why/how) and produces graded themes rather than an effect size."
      },
      {
        "relation_type": "see_also",
        "target_slug": "scoping-review",
        "notes": "A scoping review maps the breadth of literature without grading confidence; QES goes further to appraise and synthesize qualitative findings into CERQual-graded conclusions for a decision."
      },
      {
        "relation_type": "used_with",
        "target_slug": "network-meta-analysis",
        "notes": "In a mixed-methods review, QES findings on acceptability and treatment burden contextualize and help interpret heterogeneity in a quantitative (network) meta-analysis of effects."
      }
    ],
    "aliases": [
      "qualitative evidence synthesis",
      "QES",
      "meta-ethnography",
      "thematic synthesis",
      "meta-aggregation"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "quantile-regression",
    "name": "Quantile Regression",
    "short_definition": "Quantile regression estimates how covariates shift any specified percentile (quantile) of the conditional outcome distribution — not just the conditional mean. In health economics and outcomes research it reveals whether a treatment, exposure, or policy compresses or explodes the cost distribution at the median, the 75th percentile, or the catastrophic tail, effects that a mean regression can conceal entirely. No distributional assumption is required, and extreme outliers are handled by the estimator's loss function rather than by deleting or capping observations.",
    "long_description": "**What quantile regression is and why the mean is sometimes the wrong target**\n\nOrdinary least squares (OLS) regression estimates the conditional mean — E[Y | X] — the average\noutcome for patients with a given covariate profile. For outcomes like total healthcare costs or\nhospital length of stay (LOS), the mean is dominated by a small fraction of patients with extreme\nvalues, and a single estimator on the mean may hide clinically and economically important\nheterogeneity across the distribution. Quantile regression, introduced by Koenker and Bassett\n(1978), solves this by estimating the conditional τ-th quantile — Q_τ(Y | X) — for any chosen\nprobability level τ ∈ (0, 1). Setting τ = 0.5 yields median regression (also called LAD, least\nabsolute deviations); τ = 0.90 yields the model for the 90th-percentile patient. Fitting quantile\nregressions at a grid of τ values traces out the entire shape of the covariate effect across the\ndistribution.\n\nThe practical motivation in HEOR is stark. Suppose a specialty drug has no effect on the median\nannual cost — the typical patient's spending is unchanged — but dramatically increases the 95th-\npercentile cost because a subset of patients experience serious adverse events requiring intensive\nhospitalisation. An OLS regression on the mean picks up this tail shift and returns a significant\npositive mean cost difference, but gives no information about *where* in the distribution the\neffect is concentrated. A payer managing population risk cares whether a new therapy shifts 5% of\npatients into catastrophically expensive territory; a single mean estimate cannot answer that.\nQuantile regression at τ = 0.50 and τ = 0.90 (or 0.95) can.\n\n**The check-function intuition**\n\nQuantile regression minimises a weighted sum of absolute residuals — specifically, a loss function\ncalled the check function (or pinball loss) ρ_τ(u) that penalises positive residuals (predictions\ntoo low) by a factor τ and negative residuals (predictions too high) by a factor (1 − τ). At\nτ = 0.5 positive and negative errors are weighted equally, and minimising the total gives the\nconditional median — the value that splits predicted errors evenly above and below the fitted line.\nAt τ = 0.90, an under-prediction is penalised nine times more heavily than an over-prediction, so\nthe fitted line is pulled up to capture the 90th percentile. No Gaussian, gamma, or any other\ndistributional assumption is imposed. The estimator is semi-parametric: it assumes a linear\nconditional quantile function but leaves the full conditional distribution unspecified. This\nmakes it valid for the heavy-tailed, non-normal cost and LOS distributions typical of claims and\nEHR data.\n\n**Robustness without deleting data — contrast with trimming and winsorisation**\n\nQuantile regression at the median is inherently robust to outliers in the outcome. Because the\ncheck function weights residuals by their sign rather than squaring them, a single catastrophically\nexpensive patient contributes one large positive residual with bounded weight τ, rather than a\nsquared residual that can dominate the OLS objective. Crucially, this robustness is achieved\nwithout removing or winsorising the outlier. The patient remains in the dataset at their true cost;\nthe estimator simply does not let that cost dominate the fit.\n\nThis is a meaningful methodological distinction from trimming and winsorisation. Trimming an\nextreme cost observation changes the *estimand*: the trimmed mean estimates average cost in a\nhypothetical population where tail events are excluded — not average cost in the actual population.\nWinsorisation caps the value, introducing deliberate downward bias in the estimated mean. Quantile\nregression at the median, by contrast, estimates the *conditional median in the full population* —\na well-defined quantity that is not distorted by the magnitude of extreme values, because the\nmedian depends only on rank order, not on how far above it the most extreme observations lie.\nFor upper quantiles (τ = 0.75, 0.90), the estimand is similarly the conditional quantile at that\nprobability level in the full population — not in a tail-trimmed sub-population.\n\n**Inference: rank-based and bootstrap standard errors**\n\nTwo types of standard errors are available for quantile regression estimates in practice:\n\n- *Rank-based standard errors* (the Koenker-Bassett sandwich, computed by default in R's\n  quantreg::rq and SAS PROC QUANTREG) are an analytic approach that does not require estimating\n  the conditional density at the quantile. They are reliable in moderate to large samples and\n  are the default choice for quantiles near the median (τ = 0.25 to 0.75).\n- *Bootstrap standard errors* (nonparametric xy-pair bootstrap, which re-samples patient rows\n  jointly to preserve the conditional quantile structure) are preferred for extreme quantiles\n  (τ near 0 or near 1) or when the analytic formula is unreliable due to heavy clustering,\n  complex survey design, or very small samples. Specifying se = \"boot\" in R's quantreg::rq\n  or using the SEED= option in PROC QUANTREG activates this approach.\n\nNeither approach requires distributional assumptions beyond standard regularity conditions\n(continuity of the conditional distribution at the quantile, bounded moments). This is a stronger\nguarantee than OLS standard errors, which require homoscedasticity or the additional\nheteroscedasticity-robust sandwich correction.\n\n**When the mean is still the right target**\n\nQuantile regression is powerful, but it does not replace mean-based analysis when the decision\nproblem requires means. Budget-impact models, cost-effectiveness analyses (ICER numerators), and\npayer negotiations about total drug budget all depend on the *arithmetic mean* cost per patient\nbecause means aggregate to population totals. If median drug cost is $2,000 but mean drug cost\nis $15,000, the payer's actual budget exposure is proportional to the mean. Median regression\ncannot be directly used to project budget impact — it answers a different question. For\nbudget-impact and value-framework submissions, a gamma GLM with log link (see the gamma-distribution\nentry) remains the primary analytic tool. Quantile regression at the median is not interchangeable\nwith a mean estimate; they are separate estimands.\n\nQuantile regression is best positioned as:\n- *Alongside* a mean model, to characterise where effects concentrate and whether the mean\n  summarises the distribution adequately.\n- *As the primary analysis* when the policy question directly concerns a specific quantile: \"Does\n  this intervention reduce catastrophic cost (i.e., the 90th-percentile cost)?\" or \"What is the\n  median LOS for high-risk patients on this care pathway?\"\n- *As a robustness check* when outlier sensitivity is a concern; if the median regression and\n  the mean regression give qualitatively different answers, the difference is informative about\n  the role of the tail.\n\n**Censored quantile regression**\n\nWhen outcomes are right-censored (costs or LOS censored at the end of the observation window,\nor time-to-event outcomes), censored quantile regression — implemented via the Powell (1986)\nestimator and available in quantreg::crq() in R — extends the framework to survival-type data,\nanalogously to how the Cox model extends OLS regression to censored time-to-event outcomes.\n\n**Pros, cons, and trade-offs**\n\n*Pros:*\n- Estimates any conditional quantile of the outcome distribution; reveals heterogeneous covariate\n  effects that a single mean estimate cannot capture — for example, a drug that reduces median\n  cost while inflating 90th-percentile cost.\n- No distributional assumption on the outcome; valid for the heavy-tailed, zero-inclusive,\n  non-normal distributions typical of healthcare cost and LOS data.\n- Robust to outliers in the outcome without deleting or winsorising them; the estimand remains\n  the conditional quantile in the full population.\n- Fitting at multiple τ values (e.g., 0.25, 0.50, 0.75, 0.90, 0.95) provides a complete picture\n  of distributional shift — richer than any single mean estimate.\n- Implemented natively in Python (statsmodels QuantReg), R (quantreg::rq), and SAS (PROC\n  QUANTREG); no specialty packages required.\n\n*Cons:*\n- Estimates conditional quantiles, not means; cannot substitute for a mean-based estimator when\n  budget impact or ICER calculations are the goal.\n- Less efficient than OLS when the outcome is genuinely normally distributed; OLS achieves lower\n  variance for the mean under normality, while quantile regression sacrifices some efficiency for\n  distributional flexibility.\n- Covariate adjustment and propensity score weighting (weighted quantile regression) are feasible\n  but require careful specification, and bootstrap standard errors are computationally intensive\n  in large datasets.\n- Quantile crossing — fitted quantile curves for different τ values that cross each other at some\n  covariate values, violating the mathematical monotonicity requirement — can occur in finite\n  samples and requires joint quantile estimation or rearrangement to correct.\n- Interpretation of upper-quantile coefficients is less familiar to clinical and payer audiences\n  than a mean cost ratio and requires explicit communication in the analysis report.\n\n**When to use**\n\nUse quantile regression when:\n- The research question concerns a specific quantile of the cost or LOS distribution — the\n  median, the 90th percentile, or the probability of catastrophic cost above a threshold.\n- The treatment effect may be heterogeneous across the distribution: a drug may reduce median cost\n  while concentrating high costs in a small subgroup, a pattern that OLS on the mean cannot\n  identify.\n- The outcome distribution is sufficiently non-normal and heavy-tailed that distributional\n  assumptions in GLMs are unattractive and winsorisation would materially change the estimand.\n- A distributional-robustness check is desired alongside a primary mean analysis; divergence\n  between median and mean regression signals the presence of an influential tail.\n- The decision question is specifically about payer risk management at the catastrophic-cost\n  tier — \"What fraction of the benefit-risk trade-off occurs in the top 10% of spenders?\"\n\n**When NOT to use**\n\nDo not use quantile regression as the primary analysis when:\n- The decision-relevant estimand is the arithmetic mean cost (budget impact, ICER numerator,\n  PMPM premium calculation); the mean and the median are different quantities, and substituting\n  one for the other is a methodological error for budget-relevant questions. Route instead to a\n  gamma GLM with log link, a two-part model, or a bootstrap mean difference.\n- The sample is very small (n < 50 per group); quantile regression standard errors are unreliable\n  at extreme quantiles (τ near 0 or 1) with limited data, and bootstrap confidence intervals can\n  be wide and unstable.\n- Outcomes are binary or count-valued and an approximating continuous-quantile model is\n  inappropriate; use logistic regression for binary outcomes and Poisson or negative-binomial\n  models for counts.\n- The analysis requires causal language but confounding is uncontrolled; an unadjusted quantile\n  regression on observational data estimates a biased conditional quantile. Pair with propensity\n  score weighting (weighted quantile regression) or pre-specify as a descriptive sensitivity.\n\n**Interpreting the output**\n\nConsider a study comparing 12-month total healthcare costs for patients initiating Drug A versus\nDrug B in a commercial claims database, covariate-adjusted for age, sex, Charlson Comorbidity\nIndex, and prior-year cost. Quantile regression is fit at τ = 0.50 and τ = 0.90. Results: the\nτ = 0.50 coefficient on Drug A vs Drug B is −500 (95% CI −900 to −100); the τ = 0.90 coefficient\nis +8,200 (95% CI +1,500 to +14,900).\n\n*(1) Formal interpretation.* The τ = 0.50 coefficient of −500 estimates that, conditional on the\nfour adjustment covariates held fixed at their observed values, the 50th percentile of the\n12-month cost distribution is $500 lower for Drug A patients than for Drug B patients. This is a\nstatement about the *conditional median*, not the mean cost difference. A 95% CI of [−900, −100]\nmeans that values of the true conditional median difference between −$900 and −$100 are compatible\nwith the observed data under the repeated-sampling interpretation of a frequentist confidence\ninterval — it does NOT mean there is 95% probability the true median difference falls in that\nrange. The τ = 0.90 coefficient of +8,200 estimates that the 90th-percentile cost is $8,200\n*higher* for Drug A patients, with a 95% CI of [+1,500, +14,900] excluding zero. These two\nestimates come from separate quantile-regression fits and can differ in sign — that divergence\nis the signal quantile regression is designed to detect.\n\n*(2) Practical interpretation.* The typical Drug A patient (at the median) costs $500 less than\nthe typical Drug B patient, but the highest-cost 10% of Drug A patients cost $8,200 more than\ntheir counterparts on Drug B. Drug A appears to compress costs for most patients while\nconcentrating expense in a high-cost subgroup — possibly patients experiencing serious adverse\nevents or disease progression requiring rescue therapy. A mean regression might show a modest\npositive or null mean cost difference that conceals this distributional story entirely. A payer\nmanaging population risk needs both numbers: the median tells them what the majority of members\nwill experience; the 90th-percentile coefficient quantifies the tail-budget exposure. However,\nfor a formal budget-impact model, the payer must use the arithmetic mean cost estimate —\nobtainable from a gamma GLM — not the median, because population budget = mean cost × member\ncount.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "cost-analysis",
      "quantile-regression",
      "robust-statistics",
      "inferential-statistics",
      "healthcare-costs",
      "HEOR",
      "LOS",
      "distribution",
      "median-regression"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "cross_sectional",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "active_comparator_new_user",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/1913643",
        "url": "https://doi.org/10.2307/1913643",
        "citation_text": "Koenker R, Bassett G. Regression quantiles. Econometrica. 1978;46(1):33-50.",
        "year": 1978,
        "authors_short": "Koenker & Bassett",
        "notes": "The foundational paper introducing the quantile regression estimator and its check-function loss, establishing asymptotic properties and demonstrating that the median regression special case corresponds to least absolute deviations. The starting point for all applied and theoretical work on quantile regression."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.1851",
        "url": "https://doi.org/10.1002/sim.1851",
        "citation_text": "Austin PC, Tu JV, Daly PA, Alter DA. The use of quantile regression in health care research: a case study examining gender differences in the timeliness of thrombolytic therapy. Statistics in Medicine. 2004;24(5):791-816.",
        "year": 2004,
        "authors_short": "Austin et al.",
        "notes": "Applied tutorial demonstrating quantile regression in a health services context, showing how covariate effects on the median and upper quantiles of a clinical time-to-treatment outcome differ from OLS mean-regression estimates. Accessible to analysts without deep econometrics background and closely tied to the style of applications seen in RWE and HEOR."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/hec.2927",
        "url": "https://doi.org/10.1002/hec.2927",
        "citation_text": "Borah BJ, Basu A. Highlighting differences between conditional and unconditional quantile regression approaches through an application to assess medication adherence. Health Economics. 2013;22(9):1052-1070.",
        "year": 2013,
        "authors_short": "Borah & Basu",
        "notes": "Contrasts conditional quantile regression (estimating quantiles of Y given X) with unconditional quantile regression (estimating quantiles of the marginal distribution of Y), using medication adherence as the application. Particularly relevant for HEOR analysts who need to choose between these two estimands, and for understanding when the conditional-median coefficient is not the same as the shift in the population median."
      }
    ],
    "plain_language_summary": "Quantile regression is a statistical method that estimates how patient characteristics or treatment affect any chosen percentile of an outcome — such as the median cost or the 90th- percentile length of stay — rather than just the average. This matters enormously for healthcare costs, because a drug might look affordable for most patients (unchanged median) while quietly exploding the bills of the sickest 5% (a much higher 90th percentile). Unlike standard regression, quantile regression makes no assumption about the shape of the cost distribution and handles extreme high-cost patients without deleting or capping them — though it cannot directly replace a mean-cost estimate when budget impact is what the payer needs.",
    "key_terms": [
      {
        "term": "quantile (percentile)",
        "definition": "A cut-point in a ranked list of values; the 50th percentile (median) is the middle value, the 90th percentile is the value below which 90% of patients fall."
      },
      {
        "term": "conditional quantile",
        "definition": "The quantile of the outcome for patients who share a specific set of characteristics (age, treatment, comorbidity), estimated by the regression model — analogous to the conditional mean that ordinary regression estimates."
      },
      {
        "term": "check function (pinball loss)",
        "definition": "The loss function minimised by quantile regression; it penalises predictions that are too low more than predictions too high (or vice versa, depending on the target percentile), pulling the fitted line to the chosen quantile rather than the mean."
      },
      {
        "term": "least absolute deviations (LAD)",
        "definition": "Median regression by another name — the special case of quantile regression at the 50th percentile, which minimises the sum of absolute rather than squared residuals."
      },
      {
        "term": "rank-based standard error",
        "definition": "An analytic formula for the uncertainty of a quantile regression coefficient that does not require assuming a distribution for the outcome; the default in most quantile regression software for quantiles near the median."
      },
      {
        "term": "quantile crossing",
        "definition": "A finite-sample artefact where fitted quantile lines for different percentiles (say, the 50th and 75th) cross each other at some covariate values, which is mathematically impossible in the true distribution and indicates a model problem requiring correction."
      }
    ],
    "worked_example": {
      "scenario": "A health economist is comparing 12-month costs for five patients in a claims database: four patients had routine care and one had a catastrophic complication requiring a transplant. The economist wants to understand the difference between the mean and median cost summary, and what happens to each when the catastrophic patient's cost doubles — mimicking how a more expensive rescue therapy would affect the statistics reported to a payer.",
      "dataset": {
        "caption": "Twelve-month total cost (thousands of USD) for five patients. Patient P5 is a catastrophic outlier; the other four represent typical utilization. All values are used in both the mean and median calculation.",
        "columns": [
          "patient_id",
          "cost_thousand_usd"
        ],
        "rows": [
          [
            "P1",
            1
          ],
          [
            "P2",
            2
          ],
          [
            "P3",
            3
          ],
          [
            "P4",
            4
          ],
          [
            "P5",
            90
          ]
        ]
      },
      "steps": [
        "Sort the five costs in ascending order: 1, 2, 3, 4, 90.",
        "The median is the middle (3rd) value of the sorted list: median = 3 (thousand USD).",
        "Compute the mean: sum = 1+2+3+4+90 = 100. Mean = 100/5 = 20 (thousand USD).",
        "Now suppose a more expensive rescue therapy doubles P5's cost from 90 to 180 (thousand USD). The sorted list becomes 1, 2, 3, 4, 180.",
        "New mean: sum = 1+2+3+4+180 = 190. New mean = 190/5 = 38 (thousand USD). The mean nearly doubled, from 20 to 38.",
        "New median: the middle (3rd) value of [1, 2, 3, 4, 180] is still 3. The median is unchanged = 3 (thousand USD)."
      ],
      "result": "Baseline: median = 3 (thousand USD), mean = 100/5 = 20. After doubling P5's cost: median = 3 (unchanged), mean = 190/5 = 38. Doubling the extreme value nearly doubled the mean but left the median completely unchanged. A mean regression detects the catastrophic-cost shift; a median regression does not. A drug that causes 1 in 5 patients to have twice the catastrophic cost appears to nearly double average spending — yet the typical (median) patient's cost is unaffected. Both facts are true and relevant: the payer's budget depends on the mean; the clinical experience of most patients is captured by the median."
    },
    "prerequisites": [
      "ols-linear-regression",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Median regression (τ = 0.50, least absolute deviations)",
        "description": "The most common application in HEOR. Minimises the sum of absolute residuals; equivalent to OLS with a robust loss function. Produces a conditional-median estimate that is insensitive to the magnitude of cost outliers and does not require any distributional assumption. The conditional median is a valid primary estimand when the policy question is about the typical patient's cost — but it cannot substitute for the mean when budget impact is the target.",
        "edge_cases": [
          "In datasets with a high fraction of tied values (e.g., many patients with zero cost), median regression solutions may not be unique; check whether a two-part model addressing the zero-cost mass is more appropriate.",
          "The conditional median may lie in a very different part of the distribution than the conditional mean for highly skewed cost data; always report both or clarify which estimand is primary and why."
        ],
        "data_source_notes": "Claims: compute patient-period cost totals (PPPY), then fit the median regression on those totals. Report the rank-based 95% CI alongside the coefficient. For sensitivity, also report the gamma GLM mean ratio and note whether they diverge."
      },
      {
        "name": "Upper-quantile regression (τ = 0.75, 0.90, 0.95)",
        "description": "Models the effect of covariates on the upper tail of the cost or LOS distribution — the high- cost or long-stay segment that drives payer risk and budget unpredictability. Upper-quantile coefficients directly answer the question \"does this drug shift the 90th-percentile cost?\" without requiring tail truncation or winsorisation. Bootstrap standard errors are preferred over rank-based SEs for τ ≥ 0.90 in moderate sample sizes.",
        "edge_cases": [
          "Confidence intervals for τ near 1.0 can be very wide in moderate samples; at least n = 500 per group is advisable for reliable 95th-percentile estimates.",
          "Quantile crossing is more likely when fitting at multiple extreme quantiles simultaneously; use rearrangement or joint quantile estimation to correct."
        ],
        "data_source_notes": "Claims: upper-quantile estimates are most stable in large commercial or Medicare FFS cohorts (n > 1,000). Use bias-corrected bootstrap (BCa) CIs for τ = 0.90 and above. Report the proportion of patients above the fitted quantile as a sanity check."
      },
      {
        "name": "Quantile treatment effects across a quantile grid",
        "description": "Fits the same covariate model at a grid of τ values (e.g., 0.10, 0.25, 0.50, 0.75, 0.90, 0.95) and plots the treatment coefficient against τ. A flat coefficient across τ indicates a uniform shift of the distribution (same as a location shift); a rising or falling coefficient reveals distributional heterogeneity. This approach directly characterises who benefits and who is harmed — a key requirement for risk-stratified payer contracts and outcomes-based agreements.",
        "edge_cases": [
          "The quantile-grid plot should always include simultaneous confidence bands (not pointwise CIs) to avoid false inference about where the coefficient is significant.",
          "Individual τ-specific fits are independent OLS problems; the grid does not automatically enforce quantile monotonicity across τ values."
        ],
        "data_source_notes": "Claims and EHR linked data: this approach works well for characterising cost distributions across disease severity strata or across lines of therapy, where the question is \"at which cost tier does this treatment add or subtract value?\""
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ols-linear-regression",
        "pros_of_this": "Quantile regression makes no distributional assumption and is robust to outliers; OLS requires homoscedastic residuals or a heteroscedasticity correction, and squared-residual weighting lets outliers dominate. Quantile regression also answers heterogeneous-effect questions that a single mean estimate cannot address.",
        "cons_of_this": "OLS produces mean estimates that aggregate to population totals needed for budget-impact modelling; OLS is also more efficient than quantile regression when the outcome is genuinely normal. OLS is more familiar to clinical and payer audiences and its output (mean difference, regression coefficient) requires less interpretive explanation.",
        "when_to_prefer": "Prefer quantile regression when the outcome is heavily skewed, when effect heterogeneity across the distribution is the question, or when outlier robustness without data deletion is required. Prefer OLS when the mean is the target estimand, sample sizes are moderate, and residuals are approximately homoscedastic."
      },
      {
        "compared_to": "gamma-distribution",
        "pros_of_this": "Quantile regression requires no distributional assumption at all; the gamma GLM requires that the variance function (V ∝ μ²) is approximately correct. Quantile regression produces tail-specific estimates; the gamma GLM estimates the mean.",
        "cons_of_this": "The gamma GLM with log link directly targets E[Y|X] on the original cost scale and produces mean cost ratios interpretable for budget modelling; quantile regression at the median cannot be used to derive mean costs or population budget projections without additional work. The gamma GLM also accommodates covariate adjustment, IPTW weighting, and marginal standardisation in a single framework with well-understood diagnostics.",
        "when_to_prefer": "Prefer quantile regression when the policy question is tail-specific (\"does this drug increase catastrophic-cost risk?\") or when distributional assumptions for the gamma are not supported by data. Prefer the gamma GLM with log link when the target estimand is the mean cost for budget-impact or ICER calculation."
      },
      {
        "compared_to": "cost-outlier-handling-rwe",
        "pros_of_this": "Quantile regression handles extreme costs through the check-function loss rather than by capping or removing observations; the estimand (conditional median or quantile) is not biased by the presence of extreme values and does not require a pre-specified winsorisation rule. The distributional heterogeneity that outlier-handling strategies try to manage is instead characterised explicitly across quantiles.",
        "cons_of_this": "Winsorisation is more transparent to reviewers who are unfamiliar with quantile regression; the gamma GLM with winsorisation as a sensitivity check is the standard primary comparative analysis for mean costs in HEOR, and winsorisation is simpler to pre-specify and audit.",
        "when_to_prefer": "Prefer quantile regression when outlier robustness is needed without changing the estimand, or when characterising the tail distribution is itself the scientific objective. Pair with the standard two-part GLM for the primary mean-cost comparison and report quantile regression as a sensitivity or complement."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Compute per-patient cost totals (PPPY or PPPM, FFS-observable only) before fitting quantile regression. For median cost, rank-based standard errors are appropriate; for τ ≥ 0.90, use bias-corrected bootstrap SEs. At large n (> 10,000) all quantile estimates will be statistically significant for clinically trivial differences; report the coefficient magnitude and a quantile-specific pseudo-R² alongside the p-value. Pre-specify the τ values in the SAP to avoid selective reporting of the quantile that shows the largest effect.",
      "ehr": "EHR cost variables are typically charges or chargemaster amounts rather than paid amounts; clarify the cost definition before fitting any quantile model. LOS in days is a natural quantile regression outcome for in-patient studies where distribution of stay length is of direct policy interest (discharge planning, DRG profitability). Informative missingness from visit-driven capture can create zero-cost artefacts; address before modelling.",
      "registry": "Disease registries capture adjudicated high-acuity patients with characterised comorbidities. Quantile regression on registry-linked cost data is well-suited for identifying whether a clinical severity stratum drives tail cost, which is obscured in population-average analyses.",
      "primary": "Prospective studies with patient-reported costs or resource utilisation diaries can use quantile regression without assuming a distributional form; for small n (< 100), median regression with bootstrap SEs is more reliable than upper-quantile estimates.",
      "linked": "Linked claims-EHR-registry datasets are the ideal substrate for quantile regression grids: large n allows stable upper-quantile estimates, clinical severity from EHR/registry enables meaningful covariate adjustment, and claims provide complete paid-amount cost data."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\n# ── Simulate cost data: binary treatment, asymmetric distributional effect ──────────────────\nrng = np.random.default_rng(seed=42)\nn = 600\ntreatment = np.repeat([0, 1], n // 2)\nage       = rng.normal(60, 10, n)\n# True model: median cost lower for treated; upper tail higher for treated (heterogeneous effect)\nnoise = rng.exponential(scale=8_000, size=n)\ncost  = (5_000\n         + 100 * age\n         - 600 * treatment          # reduces median\n         + 12_000 * treatment * (noise > np.quantile(noise, 0.85))   # inflates 90th pct\n         + noise)\ndf = pd.DataFrame({\"cost\": cost, \"treatment\": treatment, \"age\": age})\n\n# ── 1. Median regression (tau = 0.50, rank-based SEs) ────────────────────────────────────\nmod = smf.quantreg(\"cost ~ treatment + age\", df)\nres_50 = mod.fit(q=0.5)                       # default: rank-based standard errors\nprint(\"=== Median regression (tau=0.50) ===\")\nprint(res_50.summary())\nci_50 = res_50.conf_int()\nprint(f\"Treatment: coef={res_50.params['treatment']:,.0f}  \"\n      f\"95% CI [{ci_50.loc['treatment', 0]:,.0f}, {ci_50.loc['treatment', 1]:,.0f}]\")\n\n# ── 2. 90th-percentile regression (tau = 0.90, bootstrap SEs) ────────────────────────────\nres_90 = mod.fit(q=0.9, vcov=\"iid\")           # use \"robust\" or bootstrap for extreme quantiles\nprint(\"\\n=== 90th-percentile regression (tau=0.90) ===\")\nci_90 = res_90.conf_int()\nprint(f\"Treatment: coef={res_90.params['treatment']:,.0f}  \"\n      f\"95% CI [{ci_90.loc['treatment', 0]:,.0f}, {ci_90.loc['treatment', 1]:,.0f}]\")\nprint(\"\\nNote: For tau>=0.90, prefer bootstrap SEs. statsmodels quantreg supports\")\nprint(\"      vcov='robust' (sandwich) or bootstrap via res.conf_int(p=0.05, q=0.90).\")\n\n# ── 3. Quantile grid: trace treatment coefficient across tau ──────────────────────────────\nprint(\"\\n=== Treatment coefficient across quantile grid ===\")\nprint(f\"{'tau':>6}  {'coef':>10}  {'CI_low':>10}  {'CI_high':>10}\")\nfor tau in [0.10, 0.25, 0.50, 0.75, 0.90]:\n    r   = mod.fit(q=tau, vcov=\"iid\")\n    ci  = r.conf_int()\n    b   = r.params[\"treatment\"]\n    lo  = ci.loc[\"treatment\", 0]\n    hi  = ci.loc[\"treatment\", 1]\n    print(f\"{tau:>6.2f}  {b:>10,.0f}  {lo:>10,.0f}  {hi:>10,.0f}\")\nprint(\"\\nA rising coefficient from tau=0.50 to tau=0.90 indicates the treatment\")\nprint(\"concentrates its cost effect in the upper tail (heterogeneous distributional effect).\")",
        "description": "Fits quantile regression at multiple τ values using statsmodels QuantReg. Demonstrates median\n(τ=0.50) and 90th-percentile (τ=0.90) models on simulated cost data with a binary treatment\nvariable and age covariate, prints coefficients with 95% confidence intervals (rank-based for\nthe median; bootstrap for the 90th percentile), and shows a quantile-grid loop to trace the\ntreatment coefficient across τ values. No external dependencies beyond statsmodels and numpy.",
        "dependencies": [
          "statsmodels",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(quantreg)\n\n# ── Simulate cost data (same structure as Python example) ────────────────────────────────\nset.seed(42)\nn         <- 600\ntreatment <- rep(0:1, each = n / 2)\nage       <- rnorm(n, mean = 60, sd = 10)\nnoise     <- rexp(n, rate = 1 / 8000)\ncost      <- (5000\n              + 100 * age\n              - 600 * treatment\n              + 12000 * treatment * (noise > quantile(noise, 0.85))\n              + noise)\ndf <- data.frame(cost = cost, treatment = treatment, age = age)\n\n# ── 1. Median regression: tau = 0.50, rank-based standard errors ─────────────────────────\nfit_50 <- rq(cost ~ treatment + age, data = df, tau = 0.50)\ncat(\"=== Median regression (tau = 0.50) ===\\n\")\nprint(summary(fit_50, se = \"rank\"))          # rank-based SEs: default, reliable near median\n# Coefficient interpretation: the treatment coefficient estimates the shift in the\n# conditional MEDIAN cost (not the mean) for treated vs untreated patients with same age.\n\n# ── 2. 90th-percentile regression: tau = 0.90, bootstrap SEs ────────────────────────────\nfit_90 <- rq(cost ~ treatment + age, data = df, tau = 0.90)\ncat(\"\\n=== 90th-percentile regression (tau = 0.90) ===\\n\")\nprint(summary(fit_90, se = \"boot\", R = 1000, bsmethod = \"xy\"))\n# se = \"boot\", bsmethod = \"xy\": nonparametric xy-pair bootstrap; preferred for tau >= 0.90\n\n# ── 3. Quantile grid: fit all tau values at once and plot ─────────────────────────────────\ntaus    <- c(0.10, 0.25, 0.50, 0.75, 0.90)\nfit_all <- rq(cost ~ treatment + age, data = df, tau = taus)\ncat(\"\\n=== Treatment coefficient across quantile grid ===\\n\")\ngrid_summary <- summary(fit_all, se = \"rank\")\nfor (i in seq_along(taus)) {\n  tbl <- grid_summary[[i]]$coefficients\n  b   <- tbl[\"treatment\", \"Value\"]\n  lo  <- tbl[\"treatment\", \"Lower bd\"]\n  hi  <- tbl[\"treatment\", \"Upper bd\"]\n  cat(sprintf(\"tau=%.2f  coef=%8.0f  95%%CI [%8.0f, %8.0f]\\n\", taus[i], b, lo, hi))\n}\n\n# ── 4. Censored quantile regression for right-censored LOS outcomes ───────────────────────\n# crq() in quantreg implements the Powell (1986) censored quantile estimator.\n# Example structure (replace LOS and censor_flag with real columns):\n# fit_crq <- crq(Surv(LOS, 1 - censor_flag, type = \"right\") ~ treatment + age,\n#                data = df, method = \"Powell\", taus = 0.50)\n# summary(fit_crq, taus = 0.50)",
        "description": "Fits quantile regression using the quantreg package (Koenker). Demonstrates median regression\nwith rank-based standard errors (se=\"rank\"), 90th-percentile regression with bootstrap SEs\n(se=\"boot\"), and a quantile grid visualised as a coefficient plot across tau. Uses the same\nsimulated dataset structure as the Python implementation for comparability. The quantreg\npackage implements censored quantile regression via crq() for right-censored outcomes.",
        "dependencies": [
          "quantreg"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create simulated cost data ── */\ndata work.cost_data;\n  call streaminit(42);\n  do i = 1 to 600;\n    treatment = (i > 300);                    /* 0 = untreated (i=1-300), 1 = treated        */\n    age       = rand('normal', 60, 10);\n    noise     = rand('exponential', 8000);    /* exponential mean = 8000                     */\n    cost      = 5000\n              + 100 * age\n              - 600 * treatment\n              + 12000 * treatment * (noise > quantile('exponential', 0.85, 8000))\n              + noise;\n  end;\n  keep i treatment age cost;\nrun;\n\n/* ── 1. Median regression (tau=0.5) with rank-based standard errors ─── */\nproc quantreg data=work.cost_data ci=rank;\n  class treatment (ref='0');\n  model cost = treatment age / quantile=0.5;\n  /* Output: coefficient table at tau=0.5 with rank-based 95% CI.                          */\n  /* The 'treatment 1' row is the conditional-median cost difference (treated vs untreated) */\nrun;\n\n/* ── 2. 90th-percentile regression (tau=0.9) with bootstrap SEs ─────── */\nproc quantreg data=work.cost_data ci=boot(seed=42 samples=1000);\n  class treatment (ref='0');\n  model cost = treatment age / quantile=0.9;\n  /* BOOTSTRAP option activates nonparametric bootstrap SEs; preferred at extreme quantiles */\nrun;\n\n/* ── 3. Quantile grid: median + upper quantiles in a single PROC call ── */\nproc quantreg data=work.cost_data ci=rank;\n  class treatment (ref='0');\n  model cost = treatment age / quantile=0.10 0.25 0.50 0.75 0.90;\n  /* PROC QUANTREG produces a separate coefficient table for each tau value listed.         */\n  /* Compare the 'treatment 1' coefficient across tau to detect heterogeneous effects.     */\n  output out=work.qr_grid p=predicted;\nrun;",
        "description": "Fits quantile regression at the median (QUANTILE=0.5) and at the 90th percentile\n(QUANTILE=0.9) using PROC QUANTREG. Reports rank-based standard errors for the median and\nbootstrap SEs for the 90th percentile via the BOOTSTRAP option. A QUANTILE= list in a single\nPROC QUANTREG call fits all specified tau values simultaneously. Uses the same simulated\ndataset structure as the Python and R implementations; comments explain the key output rows\nand options.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Continuous outcome\\ncost or LOS] --> Est{What estimand\\nis the goal?}\n  Est -->|Mean cost for\\nbudget-impact or ICER| Mean[\"Gamma GLM with log link\\nor two-part model\\n(see gamma-distribution)\"]\n  Est -->|Median or specific\\npercentile cost| QR[Quantile regression\\nat chosen tau]\n  Est -->|Does effect differ\\nacross the distribution?| Grid[\"Quantile grid\\ntau = 0.25, 0.50, 0.75, 0.90\"]\n  QR --> SE{Sample size\\nand quantile level}\n  SE -->|tau 0.25 to 0.75\\nmod to large n| Rank[\"Rank-based SEs\\n(default in R, SAS)\"]\n  SE -->|tau > 0.90\\nor small n| Boot[\"Bootstrap SEs\\nse='boot' or BOOTSTRAP option\"]\n  Grid --> Plot[\"Plot treatment coef vs tau\\nFlat = uniform shift\\nRising = tail concentration\"]",
        "caption": "Decision flow for quantile regression: the key choice is whether the target estimand is the mean (route to gamma GLM) or a specific quantile. For mean-based budget-impact questions, quantile regression is a complement, not a substitute.",
        "alt_text": "Flowchart from a continuous cost or LOS outcome through an estimand decision — mean versus quantile — to gamma GLM for the mean or quantile regression for specific percentiles, branching further on sample size and quantile level to rank-based or bootstrap standard errors.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "ols-linear-regression",
        "notes": "OLS estimates the conditional mean; quantile regression estimates any conditional quantile. The two methods answer different questions about the same outcome distribution. For heavily skewed cost data the conditional median (from quantile regression at tau=0.50) and the conditional mean (from OLS or a gamma GLM) can diverge substantially, and reporting both characterises the full picture of a treatment effect."
      },
      {
        "relation_type": "see_also",
        "target_slug": "gamma-distribution",
        "notes": "The gamma GLM with log link is the modern standard for mean healthcare cost regression; it makes a distributional assumption (variance proportional to mean squared) but directly targets the arithmetic mean needed for budget-impact models. Quantile regression makes no distributional assumption and targets the median or upper quantile; the two are complements rather than substitutes when characterising cost outcomes fully."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Quantile regression and cost outlier handling are two different answers to the same problem — extreme high-cost patients. Winsorisation changes the estimand (capped mean); quantile regression at the median retains the true estimand without distorting it. The two approaches should be pre-specified together in the SAP: quantile regression as the primary distributional characterisation and the two-part gamma-log GLM with winsorisation sensitivity for the primary mean-cost comparison."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Quantile regression is applied to PPPM or PPPY cost totals to characterise whether a treatment's cost impact is concentrated at the median (typical patient) or the upper tail (high-cost claimant tier). Reporting both a gamma GLM mean ratio and a quantile-grid plot across tau values provides a complete distributional picture for payer audiences."
      },
      {
        "relation_type": "requires",
        "target_slug": "descriptive-statistics",
        "notes": "Descriptive statistics — histograms, quantile-quantile plots, percentile tables — should always precede quantile regression. Knowing the empirical distribution of costs (including the fraction at zero, the 75th and 95th percentile values, and the skewness) informs the choice of which quantiles to model and whether a two-part model is needed to handle a mass at zero before the quantile regression is applied to positive costs."
      }
    ],
    "aliases": [
      "quantile regression",
      "median regression",
      "LAD regression",
      "least absolute deviations",
      "conditional quantile regression",
      "50th-percentile regression"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "quantitative-bias-analysis-toolkit-rwe",
    "name": "Quantitative Bias Analysis Toolkit",
    "short_definition": "A pre-specifiable family of deterministic and probabilistic sensitivity analyses that put numbers on how far residual systematic error (unmeasured confounding, exposure/outcome misclassification, selection/missingness) would have to go to overturn a real-world-evidence conclusion, and that correct the point estimate and interval when bias parameters can be anchored.",
    "long_description": "**Quantitative bias analysis (QBA)** replaces the narrative \"limitations\" paragraph with arithmetic: rather than asserting\nthat residual confounding or endpoint misclassification \"may bias results,\" QBA states a bias model, assigns values (or\ndistributions) to its parameters, and reports the estimate you would obtain if that bias were real. In RWE and\npharmacoepidemiology it is the discipline that turns an observed hazard ratio into a *bias-adjusted* hazard ratio, or that\nreports the magnitude of unmeasured confounding required to move a confidence interval across the null. It is best treated\nnot as one procedure but as a coordinated toolkit spanning three bias families — confounding, information (misclassification),\nand selection — plus the empirical probes (negative controls, calibration) and anchors (validation substudies) that make\nthe bias parameters credible.\n\n**Core conceptual distinction** Two axes organize the toolkit and are routinely conflated. (1) *Deterministic vs probabilistic*:\na deterministic (simple) bias analysis plugs in fixed best-guess bias parameters and returns a single corrected estimate (and\noptionally a few scenarios); a probabilistic bias analysis (PBA) assigns prior distributions to the bias parameters, draws\nfrom them by Monte Carlo, and returns a *simulation interval* that propagates bias uncertainty alongside random error. (2)\n*Bounding vs correcting*: bounding methods (the E-value, Ding–VanderWeele) report how strong an unmeasured confounder must be\nto explain away the result without claiming to know it exists, whereas correcting methods (record-level adjustment, the\nFox–Lash misclassification matrix, external/indirect adjustment) shift the point estimate to where it would sit if the\nnamed bias held. A bound answers \"how robust is this?\"; a correction answers \"what is the answer after accounting for this?\"\nThe two are complementary, not interchangeable — a small E-value does *not* mean the corrected estimate is null, and a\nreassuring correction under one set of priors does not bound all bias mechanisms. QBA also distinguishes the *target*: the\nsame machinery corrects a risk ratio, an odds ratio, or a rate ratio, but the bias formulas (and the assumption of\nnon-differential vs differential error) differ, and the corrected estimate inherits whatever estimand the primary analysis\ndefined.\n\n**Pros, cons, and trade-offs**\n- **vs a single E-value (e-value-sensitivity-analysis):** The full toolkit covers misclassification and selection — not just\n  unmeasured confounding — and can *correct* the estimate using validation data or simulation rather than only bounding it.\n  Cost: it is far more assumption-laden and harder to communicate than one transparent number; reviewers can dispute every\n  prior. **Prefer the E-value** as the universally reportable confounding sensitivity metric and the entry point; **escalate\n  to PBA** when the decision is high-stakes, when bias is differential, or when validation data exist to anchor priors.\n- **vs empirical probes (negative-control-outcomes-rwe, empirical-calibration-negative-controls-rwe):** Negative controls and\n  empirical calibration *detect and recalibrate* residual systematic error using observed data, so they need no bias-parameter\n  priors. QBA *quantifies and corrects* an assumed, named bias mechanism. Cost: QBA is only as credible as its priors;\n  negative controls cannot tell you which bias is operating, only that something is. **Use both** — controls to demonstrate\n  bias is present/absent, QBA to characterize its consequences.\n- **vs deterministic-only sensitivity tables:** PBA integrates over plausible bias-parameter ranges instead of reporting a\n  handful of corner scenarios, which avoids the false precision of a single corrected number and the cherry-picking risk of\n  \"best/worst case\" tables. Cost: PBA hides assumptions inside priors that reviewers may not scrutinize, and a confidently\n  wrong prior produces a confidently wrong interval. **Prefer deterministic** for a transparent first pass and to expose the\n  bias formula; **prefer PBA** when bias-parameter uncertainty is itself material to the decision.\n\n**When to use** Any regulatory- or HTA-grade RWE study where a residual-bias objection is foreseeable and would change the\ndecision: confirmatory comparative safety/effectiveness studies, external-control-arm submissions, and label-expansion or\ncoverage dossiers. Pre-specify the bias mechanisms in the SAP (not the manuscript discussion): name the suspected unmeasured\nconfounder, the outcome-algorithm error model, and the selection/attrition mechanism, and state the priors and their sources\n(validation chart review, published PPV/sensitivity, expert elicitation, negative-control calibration). QBA is most valuable\nprecisely when the primary estimate is decision-relevant but the design cannot fully exclude a specific, articulable bias.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **As a credibility laundromat.** Reporting one favorable corrected estimate and calling the result \"robust to bias\" is\n  worse than no QBA: it manufactures false reassurance. A QBA that only ever moves the estimate in the convenient direction,\n  or that omits the bias mechanism most threatening to the conclusion, is advocacy, not analysis.\n- **When the priors are invented.** PBA with priors pulled from thin air launders guesswork into a precise-looking simulation\n  interval. If no validation data, literature, or defensible elicitation exist for a bias parameter, say so and bound (E-value)\n  rather than correct — a correction implies knowledge you do not have.\n- **When the bias is structural, not parametric.** QBA corrects bias whose *direction and mechanism* you can model. It cannot\n  rescue a fatally non-comparable design (e.g., immortal-time bias from misaligned time zero, or a comparator prescribed to\n  systematically different patients). Fix the design; do not \"adjust it away\" post hoc.\n- **When applied to the wrong scale or with the wrong differentiality assumption.** Assuming non-differential outcome\n  misclassification when error depends on exposure (e.g., the treated arm is surveilled more intensely) can correct the\n  estimate in the *wrong* direction and make a biased result look unbiased. Non-differential misclassification biases toward\n  the null *on average for a dichotomous exposure/outcome*, but this is not guaranteed for polytomous variables or any single\n  study realization — never use it as a reflexive \"conservative\" defense.\n\n**Data-source operational depth**\n- **Claims (FFS vs Medicare Advantage):** The dominant biases are unmeasured frailty/SES/over-the-counter and cash-pay drug\n  use, outcome-algorithm error (claims-based endpoint PPV is often 0.6–0.9), and exposure capture. MA-only person-time lacks\n  fee-for-service claims, so a person can appear to have \"no event\" or \"no fill\" purely because their utilization is\n  capitated and unobserved — model this as differential missingness by plan type, not as a true negative, and prefer\n  restricting to enrollees with complete A/B/D. Competing risks bias QBA in the elderly: if death (a competing event) is\n  differentially captured by exposure arm, both the observed rate and any misclassification correction inherit that\n  distortion — link to the death index before correcting outcome rates. Anchor outcome PPV/sensitivity to a chart-validated\n  subset; absent that, use published validation estimates and widen the priors to reflect transportability uncertainty.\n- **EHR:** Endpoint phenotyping error is the headline bias and is frequently *differential* — sicker or more-monitored\n  patients generate more notes, labs, and codes, inflating sensitivity in one arm. Labs are often missing-not-at-random\n  (a test is ordered because the clinician suspected the outcome), so a complete-case analysis is a selection mechanism that\n  QBA must model, not a benign reduction in n. External-care leakage (events treated outside the network) imposes\n  differential outcome under-capture by patient mobility. Use NLP-validated or chart-review subsets to estimate the\n  sensitivity/specificity matrix; allow it to differ by arm.\n- **Registry:** Strengths are adjudicated outcomes and disease severity; the QBA targets are linkage error (to claims for\n  exposure or to death indices for censoring), registry completeness/enrollment selection, and transportability from the\n  registry-eligible population to the broader treated population. Model the linkage match rate as a selection probability and\n  test whether it differs by exposure.\n- **Linked claims–EHR–vital records:** The richest substrate for anchoring priors (validation against EHR/charts, real\n  mortality), but linkage itself selects the linkable subset and creates date-discrepancy (order vs fill vs service)\n  problems that can manufacture immortal time before any QBA is run — reconcile time zero first, then correct residual\n  parametric bias.\n\n**Worked claims example.** A commercial + Medicare FFS active-comparator new-user study reports HR = 0.78 (95% CI 0.66–0.92)\nfor incident MI with Drug A vs Drug B after high-dimensional PS weighting; the primary cohort required 365 days of continuous\nA/B/D enrollment (no MA-only person-time), a drug-free washout (no prior `fill_date` of either NDC class in the lookback),\nindex_date = first qualifying fill, and a claims MI algorithm (inpatient dx in the first position) with chart-validated\nPPV = 0.90 and sensitivity = 0.75. The pre-specified QBA package: (1) *Bound* — the E-value for the point estimate is 1.87\nand for the upper CI limit is 1.39, i.e., an unmeasured confounder associated with both Drug A use and MI by a risk ratio of\n~1.4, beyond the measured covariates, could move the interval to the null. (2) *Confounding correction (PBA)* — assume an\nunmeasured frailty factor with prevalence 0.25 in the Drug A arm and 0.15 in the comparator (Beta priors anchored to a\nvalidation chart-review subset) and a frailty–MI risk ratio drawn from Triangular(1.0, 1.6, 2.5); Monte Carlo over 10,000\ndraws yields a median bias-adjusted HR of ~0.86 with a 95% simulation interval that still excludes 1.0, so confounding of the\nplausible magnitude does not overturn the finding. (3) *Outcome misclassification* — under non-differential error\n(sensitivity 0.75, specificity 0.99 from validation), the corrected RR is essentially unchanged (~0.79); but if the more\nintensively monitored Drug A arm has sensitivity 0.85 vs 0.70 in the comparator (differential), the corrected RR attenuates\ntoward ~0.88, the key vulnerability to flag. (4) *Selection* — if disenrolling treated patients carry 1.8× the MI risk and\nare lost 10% more often (an MA-switch mechanism), an inverse-probability-of-censoring reweight shifts the estimate modestly\nupward. (5) *Tipping point* — the conclusion (a protective effect) survives unless differential outcome sensitivity exceeds a\n~15-point arm gap *and* an unmeasured RR_UY > 2.0 act jointly. The QBA is reported as the joint envelope of these analyses,\npre-specified in the SAP, with every prior traced to validation data, published estimates, or documented elicitation — not as\na single convenient corrected number.\n\n**Interpreting the output**\n\nThe worked example yields a *joint bias envelope*: E-value 1.87 (CI bound 1.39); PBA median bias-adjusted\nHR ≈ 0.86 with a 95% simulation interval still excluding 1.0; non-differential misclassification correction\nHR ≈ 0.79; differential misclassification scenario HR ≈ 0.88.\n\n*(1) Formal interpretation.* Each component answers a different question. The E-value bounds the minimum\nconfounder strength needed to explain the result — it asserts nothing about actual confounder prevalence.\nThe PBA simulation interval is *not* a confidence interval: it is an uncertainty band conditional on the\nanalyst's chosen prior distributions for bias parameters (frailty prevalence Beta priors, RR Triangular\ndraws); changing those priors changes the interval. The deterministic misclassification corrections are\neach conditional on the fixed sensitivity and specificity values supplied from chart-review validation.\nCollectively the envelope characterizes corrected estimates across plausible bias structures, not a single\nauthoritative adjusted value.\n\n*(2) Practical interpretation.* The protective HR 0.78 survives all individually plausible biases in this\nexample: the PBA median remains well below 1.0 and the tipping point requires both a 15-percentage-point\nsensitivity gap *and* an unmeasured confounder RR > 2.0 acting simultaneously. The operationally realistic\nvulnerability is differential outcome ascertainment, which pushes the estimate toward ≈ 0.88 — narrowing\nbut not erasing the apparent benefit. Report the full envelope pre-specified in the SAP; citing only the\nmost favorable corrected number defeats the purpose of quantitative bias analysis.",
    "primary_category": "Bias_Control",
    "tags": [
      "quantitative-bias-analysis",
      "qba",
      "probabilistic-bias-analysis",
      "sensitivity-analysis",
      "negative-controls",
      "misclassification",
      "selection-bias",
      "unmeasured-confounding",
      "empirical-calibration"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/ije/25.6.1107",
        "url": "https://doi.org/10.1093/ije/25.6.1107",
        "citation_text": "Greenland S. Basic methods for sensitivity analysis of biases. International Journal of Epidemiology. 1996;25(6):1107-1116.",
        "year": 1996,
        "authors_short": "Greenland",
        "notes": "Foundational paper formalizing deterministic sensitivity analysis for unmeasured confounding, misclassification, and selection bias, and motivating quantitative correction over qualitative limitation statements."
      },
      {
        "role": "explain",
        "doi": "10.1007/b97920",
        "url": "https://doi.org/10.1007/b97920",
        "citation_text": "Lash TL, Fox MP, Fink AK. Applying Quantitative Bias Analysis to Epidemiologic Data. New York: Springer; 2009.",
        "year": 2009,
        "authors_short": "Lash et al.",
        "notes": "Canonical book-length treatment unifying deterministic and probabilistic bias analysis across the confounding, misclassification, and selection families with worked record-level and summary-level methods."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/ije/dyi184",
        "url": "https://doi.org/10.1093/ije/dyi184",
        "citation_text": "Fox MP, Lash TL, Greenland S. A method to automate probabilistic sensitivity analyses of misclassified binary variables. International Journal of Epidemiology. 2005;34(6):1370-1376.",
        "year": 2005,
        "authors_short": "Fox et al.",
        "notes": "Demonstrates Monte Carlo probabilistic bias analysis by sampling sensitivity/specificity (and other bias parameters) from prior distributions to produce a simulation interval."
      },
      {
        "role": "use",
        "doi": "10.1016/j.jclinepi.2024.111507",
        "url": "https://doi.org/10.1016/j.jclinepi.2024.111507",
        "citation_text": "Shi X, Liu Z, Hua M, et al. Quantitative bias analysis methods for summary-level epidemiologic data in the peer-reviewed literature: a systematic review. Journal of Clinical Epidemiology. 2024;172:111507.",
        "year": 2024,
        "authors_short": "Shi et al.",
        "notes": "Systematic review mapping the breadth, method families, and reporting practices of applied QBA, documenting under-use and the gap between qualitative limitation statements and quantitative correction."
      }
    ],
    "plain_language_summary": "Quantitative bias analysis (QBA) replaces the vague disclaimer that 'unmeasured confounding may affect results' with actual arithmetic: you state what you think the bias is, plug in numbers for how strong it would have to be, and report what the effect estimate would look like if that bias were real. The toolkit has three main moves — bounding (the E-value asks how strong an unmeasured factor would have to be to explain your finding away), correcting (you adjust the observed estimate using known error rates for your outcome measure), and probabilistic simulation (you let the bias parameters vary across a plausible range and see the spread of corrected results). Done honestly, QBA tells a regulator or payer how much hidden noise would need to exist before the conclusion flips.",
    "key_terms": [
      {
        "term": "quantitative bias analysis",
        "definition": "A family of mathematical methods that put explicit numbers on how much a named source of error — like an unmeasured risk factor or a mislabeled outcome — would have to shift a study result to change the conclusion."
      },
      {
        "term": "bias parameters",
        "definition": "The specific numbers that describe how a bias operates — for example, how common an unmeasured risk factor is in each treatment group, or how often an outcome code in claims data correctly identifies the true event."
      },
      {
        "term": "E-value",
        "definition": "A single number that summarizes how strong an unmeasured confounder would have to be — affecting both who gets the drug and who gets the outcome — to fully explain away an observed risk ratio."
      },
      {
        "term": "probabilistic bias analysis",
        "definition": "A version of bias analysis where the bias parameters are treated as uncertain ranges rather than fixed values; the computer draws thousands of plausible combinations and returns a spread of corrected estimates instead of one number."
      },
      {
        "term": "simulation interval",
        "definition": "The range of corrected effect estimates produced by probabilistic bias analysis, capturing both ordinary statistical uncertainty and the added uncertainty from not knowing the exact bias parameters."
      },
      {
        "term": "tipping point",
        "definition": "The minimum size of a bias — or the joint combination of biases — that would be needed to flip the study conclusion from statistically meaningful to not."
      }
    ],
    "worked_example": {
      "scenario": "A new drug (Drug A) is compared against an older drug (Drug B) for heart attack risk in a claims database. After adjusting for all measured risk factors, the observed hazard ratio is 0.78 (95% CI 0.66 to 0.92), suggesting Drug A is protective. The study team runs a three-part QBA toolkit to test how robust that finding is: first they apply the E-value bound, then a simple deterministic correction for outcome misclassification, and finally a quick probabilistic check for unmeasured confounding.",
      "dataset": {
        "caption": "Summary numbers entering the QBA — the observed estimate and the outcome-algorithm accuracy from chart validation",
        "columns": [
          "measure",
          "value",
          "source"
        ],
        "rows": [
          [
            "Observed HR (Drug A vs B)",
            "0.78",
            "Primary PS-weighted analysis"
          ],
          [
            "95% CI lower",
            "0.66",
            "Primary PS-weighted analysis"
          ],
          [
            "95% CI upper",
            "0.92",
            "Primary PS-weighted analysis"
          ],
          [
            "Outcome algorithm PPV",
            "0.90",
            "Chart-validation substudy"
          ],
          [
            "Outcome algorithm sensitivity",
            "0.75",
            "Chart-validation substudy"
          ],
          [
            "Outcome algorithm specificity",
            "0.99",
            "Chart-validation substudy"
          ]
        ]
      },
      "methods_table": {
        "caption": "The three QBA tools applied and what each one answers",
        "columns": [
          "QBA method",
          "What it answers",
          "Output"
        ],
        "rows": [
          [
            "E-value (bounding)",
            "How strong would an unmeasured confounder have to be to explain the entire finding away?",
            "E-value for the point estimate; E-value for the CI limit closest to null"
          ],
          [
            "Deterministic misclassification correction",
            "If outcome codes miss 25% of true events (sensitivity 0.75) and mislabel 1% of non-events (specificity 0.99), what is the corrected HR?",
            "Single adjusted estimate under fixed error-rate assumptions"
          ],
          [
            "Probabilistic bias analysis for confounding",
            "If an unmeasured frailty factor is more common among Drug A users and raises MI risk, does the finding survive across a plausible range of that factor's strength?",
            "Median corrected HR and 95% simulation interval"
          ]
        ]
      },
      "steps": [
        "Step 1 — E-value: The formula for the E-value of a hazard ratio uses RR* = HR + sqrt(HR × (HR - 1)). For HR = 0.78 (a protective direction, so we first flip to 1/0.78 = 1.28): E-value = 1.28 + sqrt(1.28 × 0.28) ≈ 1.28 + 0.60 ≈ 1.88 (rounded to 1.87 in the source file). For the CI limit nearest the null, HR = 0.92 -> 1/0.92 = 1.09: E-value = 1.09 + sqrt(1.09 × 0.09) ≈ 1.09 + 0.31 ≈ 1.40 (file: 1.39). This means an unmeasured factor would have to be associated with both drug choice and MI by a risk ratio of at least 1.4 — beyond all measured covariates — to move the confidence interval to the null.",
        "Step 2 — Deterministic misclassification correction: Using the standard non-differential correction formula with sensitivity = 0.75 and specificity = 0.99, the corrected risk ratio is very close to the observed 0.78 (approximately 0.79). The high specificity (0.99) means very few false positives contaminate the outcome count, so non-differential misclassification barely changes the estimate. The key vulnerability is differential error: if the Drug A arm is monitored more closely, its sensitivity might be 0.85 versus 0.70 in the Drug B arm. Plugging in those arm-specific values shifts the corrected HR upward to approximately 0.88 — still protective but weaker.",
        "Step 3 — Probabilistic bias analysis for confounding: Assume an unmeasured frailty factor is present in about 25% of Drug A users but only 15% of Drug B users (anchored to a chart-review substudy), and that this factor raises MI risk somewhere between 1.0 and 2.5-fold (most likely around 1.6). Running 10,000 Monte Carlo draws through the bias correction formula yields a median corrected HR of approximately 0.86. The 95% simulation interval still excludes 1.0, meaning the protective signal survives across the full plausible range of this confounder.",
        "Step 4 — Tipping point: The conclusion flips only if differential outcome sensitivity exceeds a 15-percentage-point gap between arms AND an unmeasured confounder with risk ratio above 2.0 acts at the same time. Neither condition alone is sufficient to overturn the finding."
      ],
      "result": "E-value for point estimate = 1.87 (CI limit E-value = 1.39): a confounder weaker than RR 1.4 cannot explain the finding away. Non-differential misclassification correction leaves HR essentially at 0.79; differential monitoring shifts it to ~0.88, the main vulnerability to flag. Probabilistic confounding analysis: median corrected HR ~0.86, simulation interval still excludes 1.0. Tipping point: finding flips only under joint worst-case bias (differential sensitivity gap >15 points AND confounder RR >2.0). The protective effect is robust to plausible individual biases but would require careful monitoring if both act simultaneously."
    },
    "prerequisites": [
      "e-value-sensitivity-analysis",
      "propensity-score-methods-psm-iptw",
      "negative-control-outcomes-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Deterministic (simple) bias analysis",
        "description": "Plug fixed best-guess bias parameters into a closed-form bias model and report a single corrected estimate, optionally across a small grid of plausible scenarios; transparent and reproducible by hand.",
        "edge_cases": [
          "A single corrected estimate implies false precision; report a scenario grid and the bias formula used.",
          "Corner-case \"best/worst\" tables invite cherry-picking; pre-specify the scenarios."
        ],
        "data_source_notes": "claims/EHR: anchor the fixed parameter to a chart-validated subset where one exists; otherwise document the published source and its transportability to your population."
      },
      {
        "name": "Probabilistic bias analysis (Monte Carlo)",
        "description": "Assign prior distributions to bias parameters (e.g., Beta for prevalences/PPV, Triangular/Uniform for bias risk ratios) and propagate by simulation to a bias-adjusted point estimate and simulation interval that combines bias and random error.",
        "edge_cases": [
          "Priors are assumptions, not data, unless anchored to validation substudies, literature, or documented elicitation.",
          "Total-error simulation should add random error to each draw (e.g., resampling the corrected log-estimate) so the interval is not artificially narrow."
        ],
        "data_source_notes": "Use validation chart review, published PPV/sensitivity, registry-linkage match rates, or negative-control calibration to inform plausible ranges; widen priors for transportability uncertainty."
      },
      {
        "name": "Bounding analysis (E-value / Ding-VanderWeele)",
        "description": "Report how strong an unmeasured confounder would have to be (on the risk-ratio scale, for both the exposure-confounder and confounder-outcome associations) to fully explain the point estimate and the CI limit nearest the null, without correcting the estimate.",
        "edge_cases": [
          "A small E-value does not produce a null corrected estimate; bounding and correction answer different questions.",
          "The E-value addresses unmeasured confounding only, not misclassification or selection."
        ],
        "data_source_notes": "Communicable across data sources because it requires no source-specific priors; pair with measured covariate associations to judge whether the bound is plausibly attainable."
      },
      {
        "name": "External / indirect adjustment from a validation substudy",
        "description": "Use an internal or external validation source (chart review, registry linkage, EHR enrichment, published validation study) to estimate bias parameters empirically and correct the main-study estimate.",
        "edge_cases": [
          "External validation parameters may not transport to the main-study population (different case mix, era, coding).",
          "Internal validation is preferred but costs chart review and may itself be selected."
        ],
        "data_source_notes": "registry/linked: the linkable subset is the natural validation source for claims-based exposure or outcome algorithms; test whether linkage probability differs by exposure before transporting parameters."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "e-value-sensitivity-analysis",
        "pros_of_this": "Covers misclassification and selection in addition to confounding and can correct (not only bound) the estimate when bias parameters are anchored to validation data or priors.",
        "cons_of_this": "Far more assumption-heavy and harder to communicate than a single transparent E-value; every prior is contestable.",
        "when_to_prefer": "High-stakes or confirmatory RWE, differential bias, or settings with validation data to anchor priors; use the E-value as the universal entry point and one component."
      },
      {
        "compared_to": "negative-control-outcomes-rwe",
        "pros_of_this": "Quantifies and corrects an assumed, named bias mechanism rather than only signaling that systematic error is present.",
        "cons_of_this": "Only as credible as its bias-parameter priors; cannot, by itself, reveal which bias is operating.",
        "when_to_prefer": "After negative controls show residual bias, to characterize its magnitude and direction on the estimate."
      },
      {
        "compared_to": "empirical-calibration-negative-controls-rwe",
        "pros_of_this": "Targets a specific, articulable bias mechanism and uses external/validation knowledge, not just the empirical null distribution of controls.",
        "cons_of_this": "Requires defensible priors; calibration recalibrates intervals from observed controls with fewer modeling assumptions.",
        "when_to_prefer": "When a particular confounder, misclassification model, or selection mechanism is the named threat and can be parameterized."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Pre-specify each bias mechanism: unmeasured frailty/SES/cash-pay exposure, claims outcome-algorithm PPV/sensitivity, and disenrollment/MA-switch selection. Anchor priors to a chart-validated subset; restrict to complete A/B/D person-time so MA-only gaps are not treated as true negatives; link to a death index before correcting outcome rates because competing mortality can be differentially captured by arm.",
      "ehr": "Endpoint phenotyping error is often differential (more-monitored patients generate more codes); estimate an arm-specific sensitivity/specificity matrix from a chart-review or NLP-validated subset. Treat missing-not-at-random labs and complete-case restrictions as a selection mechanism to model, and account for external-care leakage as differential outcome under-capture.",
      "registry": "Target linkage error, registry completeness/enrollment selection, and transportability to the broader treated population. Model the claims/death-index linkage match rate as a selection probability and test whether it differs by exposure before transporting validation parameters.",
      "linked": "The richest substrate for anchoring priors (validation against EHR/charts, real mortality), but linkage selects the linkable subset and creates order/fill/service date discrepancies that can manufacture immortal time; reconcile time zero first, then correct residual parametric bias."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\n\ndef pba_unmeasured_confounder(observed_rr: float,\n                              log_se: float,\n                              n_iter: int = 100_000,\n                              seed: int = 1) -> dict:\n    rng = np.random.default_rng(seed)\n\n    # Bias-parameter priors (ANCHOR these to validation data, not guesses):\n    #   p_u1, p_u0 : prevalence of the unmeasured confounder in treated / comparator arms\n    #   rr_uy      : confounder -> outcome risk ratio\n    p_u1  = rng.beta(25, 75, n_iter)                 # ~0.25 prevalence in treated\n    p_u0  = rng.beta(15, 85, n_iter)                 # ~0.15 prevalence in comparator\n    rr_uy = rng.triangular(1.0, 1.6, 2.5, n_iter)    # confounder-outcome RR\n\n    # Deterministic confounding bias factor (Bross / Schlesselman form).\n    bias_factor = ((p_u1 * (rr_uy - 1) + 1) /\n                   (p_u0 * (rr_uy - 1) + 1))\n    corrected_rr = observed_rr / bias_factor\n\n    # Add random error to each draw so the interval reflects bias + sampling uncertainty.\n    log_corrected = np.log(corrected_rr) + rng.normal(0.0, log_se, n_iter)\n    corrected_total = np.exp(log_corrected)\n\n    return {\n        \"median\": float(np.percentile(corrected_total, 50)),\n        \"ci_95\": (float(np.percentile(corrected_total, 2.5)),\n                  float(np.percentile(corrected_total, 97.5))),\n        \"p_crosses_null\": float(np.mean(corrected_total > 1.0)),\n    }",
        "description": "Probabilistic bias analysis for unmeasured confounding (Monte Carlo over bias-parameter priors). This is a summary-level\ncorrection applied AFTER the primary effect estimate is produced. Required inputs (scalars/arrays from the main analysis):\n  observed_rr      : adjusted risk/rate ratio from the primary model (e.g., PS-weighted HR/RR)\n  log_se           : standard error of log(observed_rr) from the primary model (for total-error propagation)\n  validation       : optional dict with Beta hyperparameters for confounder prevalence by arm, anchored to chart review\nReturns the bias-adjusted point estimate and a 95% simulation interval combining bias and random error. Replace the\nillustrative priors below with validation-substudy, literature, expert-elicited, or negative-control-informed values.",
        "dependencies": [
          "numpy"
        ],
        "source_citations": [
          "fox-2005"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "pba_misclass_rr <- function(a, b, c, d,\n                            se_alpha, se_beta,   # sensitivity Beta prior\n                            sp_alpha, sp_beta,   # specificity Beta prior\n                            n = 100000L, seed = 1L) {\n  set.seed(seed)\n  se <- rbeta(n, se_alpha, se_beta)              # outcome sensitivity\n  sp <- rbeta(n, sp_alpha, sp_beta)              # outcome specificity\n\n  n1 <- a + b; n0 <- c + d                       # exposed / unexposed totals\n  # Back-correct observed cases to expected true cases: A_true = (A_obs - (1-Sp)*N) / (Se - (1-Sp))\n  a_corr <- (a - (1 - sp) * n1) / (se - (1 - sp))\n  c_corr <- (c - (1 - sp) * n0) / (se - (1 - sp))\n\n  valid <- a_corr > 0 & c_corr > 0 & a_corr < n1 & c_corr < n0\n  rr <- (a_corr[valid] / n1) / (c_corr[valid] / n0)\n\n  # Total error: add sampling variability of log(RR) from the corrected cell counts.\n  log_se <- sqrt(1/a_corr[valid] - 1/n1 + 1/c_corr[valid] - 1/n0)\n  rr_total <- exp(log(rr) + rnorm(sum(valid), 0, log_se))\n\n  list(median = median(rr_total),\n       ci_95  = quantile(rr_total, c(0.025, 0.975)),\n       prop_kept = mean(valid))\n}",
        "description": "Probabilistic bias analysis for non-differential outcome misclassification (matrix correction with simulated priors).\nRequired inputs from the primary 2x2 (exposed/unexposed by observed outcome counts), plus validation-anchored priors:\n  a,b,c,d : observed counts (exposed-case, exposed-noncase, unexposed-case, unexposed-noncase)\n  se_*    : Beta hyperparameters for outcome SENSITIVITY (from chart-validated subset)\n  sp_*    : Beta hyperparameters for outcome SPECIFICITY\nReturns the bias-adjusted risk ratio and a 95% simulation interval. For DIFFERENTIAL error, draw arm-specific se/sp and\ncorrect each arm with its own values. Mirrors the episensr / Fox-Lash record-level matrix correction.",
        "dependencies": [
          "stats"
        ],
        "source_citations": [
          "fox-2005"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let observed_rr = 0.78;\n%let log_se      = 0.082;   /* SE of log(observed_rr) from the primary model */\n\ndata pba;\n  call streaminit(1);\n  do iter = 1 to 100000;\n    /* Bias-parameter priors -- ANCHOR to validation data before use. */\n    p_u1  = rand(\"BETA\", 25, 75);                 /* confounder prevalence, treated   */\n    p_u0  = rand(\"BETA\", 15, 85);                 /* confounder prevalence, comparator */\n    rr_uy = 1.0 + 1.5 * rand(\"TRIANGLE\", 0.4);    /* confounder-outcome RR ~Triangular(1.0,1.6,2.5) */\n\n    bias_factor  = (p_u1*(rr_uy - 1) + 1) / (p_u0*(rr_uy - 1) + 1);\n    corrected_rr = &observed_rr / bias_factor;\n\n    /* Total error: add sampling variability on the log scale. */\n    log_total    = log(corrected_rr) + rand(\"NORMAL\", 0, &log_se);\n    rr_total     = exp(log_total);\n    crosses_null = (rr_total > 1);\n    output;\n  end;\n  keep rr_total crosses_null;\nrun;\n\nproc univariate data=pba noprint;\n  var rr_total;\n  output out=qba_summary\n         pctlpts = 2.5 50 97.5\n         pctlpre = p;          /* p2_5 p50 p97_5 = bias-adjusted simulation interval */\n  run;\n\nproc means data=pba mean;      /* mean of crosses_null = P(adjusted RR > 1) */\n  var crosses_null;\nrun;",
        "description": "Probabilistic bias analysis for unmeasured confounding in SAS via DATA-step Monte Carlo, summarized with PROC UNIVARIATE.\nInputs are the primary-model summary statistics (no record-level data needed for this summary-level correction):\n  observed_rr : adjusted RR/HR from the primary model\n  log_se      : SE of log(observed_rr) from the primary model\nPriors below (RAND 'BETA'/'TRIANGLE') are illustrative; replace with validation-anchored hyperparameters. PROC UNIVARIATE\nreturns the bias-adjusted median and 2.5/97.5 percentiles (the simulation interval).",
        "dependencies": [],
        "source_citations": [
          "fox-2005"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Obs[Primary RWE estimate<br/>e.g. PS-weighted HR 0.78] --> Q{Which residual bias<br/>threatens the conclusion?}\n  Q -->|Unmeasured confounding| C1[Bound: E-value]\n  Q -->|Unmeasured confounding| C2[Correct: confounding PBA]\n  Q -->|Outcome/exposure error| M[Misclassification correction<br/>Se/Sp matrix, differential?]\n  Q -->|Selection/attrition/MNAR| S[Selection-bias / IPCW correction]\n  Probe[Negative controls +<br/>empirical calibration] -->|detect bias present?| Q\n  Val[Validation substudy / linkage<br/>chart review, literature] -->|anchor priors| C2\n  Val --> M\n  Val --> S\n  C1 --> Env[Joint bias envelope]\n  C2 --> Env\n  M --> Env\n  S --> Env\n  Env --> Tip[Tipping point:<br/>how large must bias be<br/>to overturn the decision?]\nstyle Obs fill:#e0f2fe\nstyle Tip fill:#fef9c3",
        "caption": "Decision logic for assembling a QBA package. Empirical probes detect whether bias is present; validation data anchor the priors; the named bias mechanisms are bounded and/or corrected and combined into a joint envelope, then summarized as a tipping point against the decision.",
        "alt_text": "Flowchart starting from the primary estimate, branching by bias type into bounding, confounding PBA, misclassification correction, and selection correction, informed by negative controls and validation data, combined into a joint envelope and a tipping-point summary.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Det[Deterministic<br/>fixed parameters] --> Single[Single corrected estimate<br/>+ scenario grid]\n  Prob[Probabilistic<br/>prior distributions] --> Interval[Bias-adjusted estimate<br/>+ simulation interval]\n  Bound[Bounding<br/>E-value] --> HowStrong[How strong must<br/>confounding be?]\n  Correct[Correcting<br/>matrix / indirect adj.] --> WhereSit[Where would the<br/>estimate sit?]\n  Single -.complementary.- Interval\n  HowStrong -.different question.- WhereSit\nstyle Bound fill:#fde68a\nstyle Correct fill:#bbf7d0",
        "caption": "The two organizing axes of QBA. Deterministic vs probabilistic governs how bias uncertainty is propagated; bounding vs correcting governs the question answered. A small bound does not imply a null corrected estimate.",
        "alt_text": "Diagram contrasting deterministic versus probabilistic propagation and bounding versus correcting questions, noting they answer complementary but distinct questions.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value is the communicable unmeasured-confounding bounding metric and the usual entry point; it is one component of, not a substitute for, the full toolkit."
      },
      {
        "relation_type": "produces",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative control outcomes empirically probe whether residual confounding/surveillance bias is present, motivating and informing QBA priors."
      },
      {
        "relation_type": "produces",
        "target_slug": "negative-control-exposures-rwe",
        "notes": "Negative control exposures detect outcome-system artifacts and residual bias that QBA can then characterize."
      },
      {
        "relation_type": "produces",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Empirical calibration uses many controls to recalibrate p-values/CIs for systematic error with minimal prior assumptions, complementing parametric QBA correction."
      },
      {
        "relation_type": "produces",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Misclassification correction is the information-bias arm of the toolkit, using sensitivity/specificity (or PPV/NPV) to correct exposure or outcome error."
      },
      {
        "relation_type": "produces",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "Probabilistic unmeasured-confounding analysis samples bias-parameter distributions to produce a simulation interval."
      },
      {
        "relation_type": "produces",
        "target_slug": "selection-bias-sensitivity-analysis-rwe",
        "notes": "Selection-bias sensitivity addresses attrition, database entry, linkage, MNAR, and complete-case restrictions."
      },
      {
        "relation_type": "produces",
        "target_slug": "tipping-point-analysis-rwe",
        "notes": "Tipping-point analysis summarizes how large a bias or missingness assumption must be to overturn the conclusion."
      },
      {
        "relation_type": "produces",
        "target_slug": "external-adjustment-validation-substudy-bias-correction-rwe",
        "notes": "Validation-substudy and external/indirect adjustment empirically anchor the bias parameters that the toolkit corrects."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS reduces measured-proxy confounding in the primary analysis; QBA addresses the residual unmeasured confounding it cannot capture."
      }
    ],
    "aliases": [
      "QBA",
      "bias analysis toolkit",
      "probabilistic sensitivity analysis",
      "quantitative sensitivity analysis",
      "probabilistic bias analysis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "rare-disease-external-controls-rwe",
    "name": "Rare Disease External Controls",
    "short_definition": "A study design that supplies the comparator arm for a single-arm rare-disease trial from external real-world data (registry, natural-history, EHR, claims, or linked sources), constructing a trial-eligible counterfactual by mirroring eligibility, anchoring time zero on the treatment-decision analog, aligning outcome ascertainment, and adjusting for residual confounding under transportability assumptions.",
    "long_description": "A **rare-disease external control** replaces the randomized comparator with patients drawn from\nexternal real-world data (RWD) — a disease registry, a prospective natural-history study, an EHR\nor claims database, or a linked combination — when randomization is infeasible because the\ndisease is too rare, the natural history is uniformly fatal, or equipoise cannot be sustained.\nThe single-arm trial supplies the treated arm; the external data supply the control arm. The\nscientific work is not \"find some untreated patients\" but **reconstructing the counterfactual a\nrandomized control would have provided**: the same eligible population, the same clock, the same\noutcome definition, measured the same way, with residual differences removed by adjustment that\nis only valid under explicit transportability and exchangeability assumptions.\n\n**Core estimand distinction.** Randomization makes the treated and control arms exchangeable by\ndesign, so the trial estimand is a within-protocol contrast that needs no transportability\nargument. An external control estimand is a contrast between the trial's treated arm and a\n*counterfactual control reconstructed from a different data-generating process*. It is identified\nonly under four bundled assumptions that randomization would otherwise discharge for free: (1)\n**eligibility mirroring** — the external cohort is restricted to the subset that would have met\nthe trial's inclusion/exclusion at an analogous index, so the comparison is conditional on\ntrial-eligible covariates; (2) **time-zero alignment** — the external index date is the analog of\nrandomization (the treatment-decision point), not \"first eligible visit,\" or immortal time and\nlead-time bias contaminate the survival contrast; (3) **outcome-ascertainment alignment** — the\noutcome is defined and measured comparably across the two sources (the central problem when the\ntrial uses RECIST-adjudicated progression and the RWD has only claims- or note-coded surrogates);\nand (4) **conditional exchangeability / positivity within the trial-eligible stratum** plus\n**transportability** of that adjusted contrast to the trial population. The target estimand is\ntypically a marginal hazard ratio or restricted-mean-survival difference for overall survival,\nestimated after propensity-score weighting/matching on baseline prognostic factors — distinct\nfrom a naive treated-vs-untreated comparison, which is not a defensible estimand here.\n\n**Pros, cons, and trade-offs**\n- **vs a concurrent randomized control arm:** the external control exists only because the RCT\n  cannot be run (rarity, ethics, fatal natural history) — it is a second-best that buys\n  feasibility at the cost of every protection randomization provides. **Prefer the RCT whenever\n  it is feasible;** reserve external controls for the cases where it genuinely is not, and treat\n  the result as hypothesis-strengthening evidence subject to heavy sensitivity analysis.\n- **vs a single-arm trial benchmarked against a fixed historical \"objective response rate\" or\n  literature threshold:** a patient-level external control allows eligibility mirroring,\n  covariate adjustment, time-zero alignment, and time-to-event endpoints, none of which a fixed\n  summary threshold supports. **Prefer the patient-level external control** when individual RWD\n  are obtainable; fall back to a literature benchmark only when no patient-level source exists.\n- **vs a Bayesian dynamic-borrowing / hybrid design (power prior, commensurate prior, MAP\n  prior):** dynamic borrowing down-weights the external information when it conflicts with the\n  concurrent data, hedging against an unrepresentative external cohort. Cost: it requires a\n  (usually small) concurrent control and adds prior-specification and tuning complexity. **Prefer\n  dynamic borrowing** when even a few concurrent controls are available; **prefer a fully external\n  control** only when no concurrent control is possible.\n- **vs an active-comparator new-user design in routine RWD** (see `active-comparator-new-user`):\n  ACNU compares two *treated* arms within one data source and so controls confounding by\n  indication by construction; the external control compares across data sources against an\n  *untreated/standard-of-care* arm, re-importing confounding by indication and a cross-source\n  transportability burden that ACNU never incurs. **Prefer ACNU** whenever both arms can be drawn\n  from the same RWD; the external control is for the single-arm-trial situation where they cannot.\n\n**When to use** A rare or ultra-rare disease where a randomized comparator is infeasible or\nunethical; a single-arm registrational or natural-history-anchored study; a well-characterized\ndisease whose untreated course is predictable enough that a credible counterfactual can be built;\nand a setting where a patient-level external source exists with the prognostic covariates,\nindex-anchoring information, and outcome ascertainment needed to mirror the trial. Calendar-time\noverlap with the trial's accrual window and a pre-specified, blinded-to-outcome eligibility and\nanalysis plan are prerequisites for regulatory credibility (FDA externally-controlled-trial and\nregistry guidances; EMA registry/RWD framework).\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **A randomized trial is feasible.** Substituting an external control to save time or cost when\n  an RCT could be run trades away the only unconfounded comparison available — it is the wrong\n  choice, not a pragmatic one.\n- **The standard of care or diagnostics shifted between the external accrual period and the\n  trial.** Secular trends in supportive care, imaging sensitivity (stage migration / Will Rogers),\n  or biomarker testing make the historical control prognostically worse for reasons unrelated to\n  the drug, manufacturing an apparent benefit. Restrict to overlapping calendar time and inspect\n  for stage migration; if the eras cannot be overlapped, do not proceed.\n- **The trial outcome cannot be reconstructed comparably in the RWD.** If the trial endpoint is\n  RECIST-adjudicated progression-free survival and the RWD has only claims-coded \"progression\"\n  proxies or irregular, treatment-driven imaging, the outcome-ascertainment-alignment assumption\n  fails and the PFS contrast is uninterpretable. Overall survival from a reliable death index is\n  usually the only defensible endpoint; an unverifiable surrogate is dangerous.\n- **Positivity is violated within the eligible stratum.** If the trial enrolls a phenotype (e.g.,\n  a specific mutation or performance status) that is sparse or unmeasured in the external source,\n  no comparable controls exist; weighting extrapolates rather than adjusts, and the estimand no\n  longer maps to the trial population.\n- **Key prognostic confounders are unmeasured in the external source.** External controls cannot\n  fix confounding by indication or unmeasured severity; if the prognostic drivers (disease\n  stage, biomarker, prior lines, organ function) are absent from the RWD, the adjusted contrast\n  is biased by an unknown amount and a quantitative-bias / E-value analysis is mandatory before\n  any claim.\n\n**Data-source operational depth**\n- **Disease / natural-history registry:** the strongest substrate for indication confirmation,\n  disease severity/stage, biomarker status, and adjudicated outcomes, and often the only source\n  that captures an ultra-rare phenotype at all. Failure modes: enrollment is voluntary and often\n  biased toward academic centers and survivors (prevalent-case enrollment introduces immortal\n  time and survivor bias — restrict to incident cases at an analogous index); pharmacy/treatment\n  exposure and dates are frequently incomplete; site participation changes the case mix over\n  calendar time. Workaround: link the registry to claims for complete treatment history and to a\n  death index (NDI/SSA) for censoring, and re-anchor time zero on the incident treatment-decision.\n- **Claims (FFS vs Medicare Advantage):** complete on dispensing, procedures, and enrollment\n  spans, enabling prior-line reconstruction and continuous-enrollment washouts. Failure modes:\n  **MA-only person-time lacks adjudicated FFS claims**, so prior lines of therapy and the washout\n  cannot be ascertained — restrict to enrollees with FFS Parts A/B/D over the lookback and exclude\n  MA-only spans, or you will misclassify treatment-naive status. **Differential competing risks\n  by exposure in elderly claims** (the standard-of-care external arm is older/sicker and dies of\n  competing causes before the trial-defined event) bias a cause-specific contrast — model the\n  cumulative incidence with a competing-risks estimand, not a naive Kaplan-Meier. **Immortal time\n  in procedure/line-of-therapy studies** arises when the external index is set at a downstream\n  event reachable only by survivors — anchor on the first qualifying line, the treatment-decision\n  analog of randomization.\n- **EHR:** rich for staging, labs, performance status, biomarkers, and clinician notes (the\n  covariates registries and claims lack), and the usual substrate for abstracted oncology\n  external controls. Failure modes: visit-driven, treatment-influenced ascertainment means\n  \"progression\" is recorded when imaging happens, not when it occurs; patients who leave the\n  system are differentially lost. Workaround: define observation windows explicitly, abstract\n  outcomes against a protocol, and prefer overall survival linked to a death index over\n  EHR-internal progression.\n- **Linked claims–EHR–registry–vital-records:** the ideal substrate (EHR severity + claims\n  completeness + reliable mortality), but **linkage selection** is acute in rare disease — the\n  linkable subset is healthier, wealthier, and more urban than the eligible population, breaking\n  transportability — and order/fill/service/abstraction date discrepancies must be reconciled\n  before time-zero assignment.\n\n**Worked example (rare oncology, linked claims–EHR external control).** Question: overall\nsurvival with a new agent in a single-arm trial for an ultra-rare sarcoma subtype vs\nstandard-of-care reconstructed from a linked EHR–claims database. (1) **Eligibility mirroring:**\napply the trial's inclusion/exclusion to the RWD — histology-confirmed subtype, ECOG 0–1, the\nrequired biomarker, ≥1 prior line, and adequate organ function from baseline labs — measured in\nthe 12-month lookback. (2) **Continuous enrollment / source completeness:** require FFS Parts\nA/B/D (or full commercial medical+pharmacy) across the entire 12-month lookback so prior lines\nand treatment-naive status are observed, not missing; exclude MA-only person-time. (3) **Time\nzero:** the date the external patient initiated the standard-of-care line that is the analog of\nthe trial's randomization-defining treatment decision — `index_date = first qualifying SOC fill`,\nnot the diagnosis date and not a downstream restaging visit (which would inject immortal time).\n(4) **Calendar-time restriction:** keep only external index dates overlapping the trial's accrual\nwindow to neutralize secular shifts in supportive care and imaging. (5) **Baseline covariates:**\nmeasured only in `[index_date − 365, index_date]` — stage, biomarker, prior-line count, organ\nfunction, comorbidity, and healthcare utilization — feeding a propensity score / overlap weights\nfor the trial-eligible contrast. (6) **Outcome and follow-up:** overall survival from `index_date`\nto death from a linked death index (NDI/SSA/EHR vitals), censoring at disenrollment and end of\ndata; do NOT use a claims-coded progression surrogate as the primary endpoint. (7) **Estimation\nand sensitivity:** PS weighting (overlap or 1:1 matching) with standardized mean differences\n<0.1, a competing-risks cumulative-incidence check for non-cancer death in the older external\narm, an E-value / tipping-point analysis for unmeasured confounding, a negative-control outcome,\nand a leave-one-source-out / alternative-eligibility-window sensitivity to probe transportability.",
    "primary_category": "Study_Design",
    "tags": [
      "external-control",
      "single-arm-trial",
      "rare-disease",
      "natural-history",
      "registry",
      "transportability",
      "time-zero-alignment",
      "oncology",
      "special-populations-methods"
    ],
    "applies_to_study_types": [
      "single_arm_external_control",
      "registry_linkage",
      "natural_history_study",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "registry",
      "ehr",
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2147/CLEP.S242097",
        "url": "https://doi.org/10.2147/CLEP.S242097",
        "citation_text": "Thorlund K, Dron L, Park JJH, Mills EJ. Synthetic and external controls in clinical trials - a primer for researchers. Clinical Epidemiology. 2020;12:457-467.",
        "year": 2020,
        "authors_short": "Thorlund et al.",
        "notes": "Primer defining external and synthetic controls, the assumptions they require, and the design choices (eligibility, index alignment, adjustment) that determine validity."
      },
      {
        "role": "explain",
        "doi": "10.1002/cpt.1723",
        "url": "https://doi.org/10.1002/cpt.1723",
        "citation_text": "Schmidli H, Häring DA, Thomas M, Cassidy A, Weber S, Bretz F. Beyond randomized clinical trials: use of external controls. Clinical Pharmacology & Therapeutics. 2020;107(4):806-816.",
        "year": 2020,
        "authors_short": "Schmidli et al.",
        "notes": "Frames the assumptions for external controls and the spectrum from fully external to Bayesian dynamic-borrowing / hybrid designs, clarifying when each is defensible."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.annonc.2021.12.015",
        "url": "https://doi.org/10.1016/j.annonc.2021.12.015",
        "citation_text": "Mishra-Kalyani PS, Amiri Kordestani L, Rivera DR, et al. External control arms in oncology: current use and future directions. Annals of Oncology. 2022;33(4):376-383.",
        "year": 2022,
        "authors_short": "Mishra-Kalyani et al.",
        "notes": "FDA-authored review of how external control arms have been used in oncology submissions, the recurring failure modes (outcome ascertainment, time-zero, era effects), and regulatory expectations."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/cpt.1586",
        "url": "https://doi.org/10.1002/cpt.1586",
        "citation_text": "Carrigan G, Whipple S, Capra WB, et al. Using electronic health records to derive control arms for early phase single-arm lung cancer trials: proof-of-concept in randomized controlled trials. Clinical Pharmacology & Therapeutics. 2020;107(2):369-377.",
        "year": 2020,
        "authors_short": "Carrigan et al.",
        "notes": "Empirical proof-of-concept that an EHR-derived external control can approximate the randomized control arm of completed RCTs after eligibility mirroring and adjustment, with the conditions under which it succeeds and fails."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.fda.gov/regulatory-information/search-fda-guidance-documents/considerations-design-and-conduct-externally-controlled-trials-drug-and-biological-products",
        "citation_text": "FDA. Considerations for the Design and Conduct of Externally Controlled Trials for Drug and Biological Products. Guidance for Industry. 2024.",
        "year": 2024,
        "authors_short": "FDA",
        "notes": "Regulatory guidance on when externally controlled trials are acceptable and the design controls (eligibility, time zero, outcome ascertainment, adjustment) expected for submissions."
      }
    ],
    "plain_language_summary": "When a disease is so rare that recruiting a randomized control group is not feasible, researchers run a single-arm trial (everyone gets the new treatment) and then find a comparison group from outside the trial, using real-world records such as a disease registry or linked health databases. This outside group, called an external control, lets the researchers ask 'how did similar untreated patients do?' rather than leaving the trial with no comparison at all. The approach only works if the external patients resemble the trial patients in disease severity, timing, and how outcomes were measured -- gaps in any of those three areas can make the treatment look better or worse than it really is.",
    "key_terms": [
      {
        "term": "external control",
        "definition": "A comparison group built from patients who are NOT enrolled in the trial, drawn instead from a registry, electronic health record database, or insurance claims, to stand in for the randomized control arm that could not be recruited."
      },
      {
        "term": "natural history",
        "definition": "The typical course of a disease over time in patients who receive standard care or no treatment, documented in registries or observational studies to understand what happens without the new treatment."
      },
      {
        "term": "single-arm trial",
        "definition": "A clinical trial in which every enrolled patient receives the experimental treatment, with no concurrent untreated group enrolled alongside them."
      },
      {
        "term": "eligibility mirroring",
        "definition": "The process of applying the same patient inclusion and exclusion rules used in the trial to the external database, so that the external patients are as similar as possible to the trial patients before any statistical adjustment."
      },
      {
        "term": "time zero",
        "definition": "The starting date for follow-up in each group -- in the trial it is the date of treatment assignment, and in the external control it must be set to an equivalent moment (such as the date the patient started standard-of-care treatment) to ensure both groups are measured from the same decision point."
      },
      {
        "term": "propensity score weighting",
        "definition": "A statistical technique that assigns each external control patient a weight based on how similar their characteristics are to the trial patients, effectively balancing the two groups as if they had been randomized."
      }
    ],
    "worked_example": {
      "scenario": "An ultra-rare pediatric sarcoma affects roughly 200 patients per year nationwide. A sponsor runs a single-arm trial of a new targeted agent: 40 children are enrolled and all receive the drug. Because no concurrent control arm could be enrolled, the team builds an external control from a linked disease registry plus insurance claims, applying the trial eligibility rules and anchoring time zero on the date each registry patient started the standard-of-care regimen. The question: does overall survival differ between the 40 treated trial patients and the 38 eligible external controls?",
      "dataset": {
        "caption": "Summary rows as they appear after eligibility mirroring -- one row per patient, showing arm assignment, time zero, follow-up days, and vital status at last contact.",
        "columns": [
          "person_id",
          "arm",
          "time_zero",
          "os_days",
          "died"
        ],
        "rows": [
          [
            "T-001",
            "Trial (treated)",
            "2020-03-15",
            720,
            0
          ],
          [
            "T-002",
            "Trial (treated)",
            "2020-06-01",
            540,
            1
          ],
          [
            "T-003",
            "Trial (treated)",
            "2021-01-10",
            365,
            0
          ],
          [
            "C-001",
            "External control",
            "2019-11-20",
            310,
            1
          ],
          [
            "C-002",
            "External control",
            "2020-02-14",
            480,
            1
          ],
          [
            "C-003",
            "External control",
            "2020-08-05",
            600,
            0
          ]
        ]
      },
      "steps": [
        "Apply trial eligibility rules to the registry: confirmed sarcoma subtype, ECOG performance status 0 or 1, biomarker-positive, adequate organ function documented within the 12 months before time zero. This reduces the registry from 210 patients to 38 who would have qualified for the trial.",
        "Set time zero for each external control patient to the date they started the standard-of-care regimen -- the closest analog to the date trial patients were assigned to treatment. Using diagnosis date instead would give external controls extra 'free' survival days before they were even sick enough to be treated, which would make the new drug look artificially better.",
        "Restrict external control index dates to the same calendar window as trial enrollment (2019-2022) so that improvements in supportive care over time do not favor the trial arm.",
        "Run propensity score weighting: fit a model predicting arm (trial vs. external control) from age, disease stage, prior treatment lines, and organ function. Weights re-balance the two groups so they look similar on those factors.",
        "Estimate overall survival (time from time zero to death) in both arms using the weighted groups.",
        "In the trial arm of 40 patients: 24-month overall survival rate 68% (27 of 40 alive at 24 months). In the weighted external control of 38 patients: 24-month overall survival rate 42% (16 of 38 alive at 24 months).",
        "Check threats: (1) Are the groups comparable after weighting? Standardized mean differences for all covariates fall below 0.10 -- good. (2) Could era effects explain the difference? Sensitivity analysis restricted to 2020-2021 overlap only narrows the gap slightly to 68% vs. 45% -- the difference persists. (3) Could outcome measurement differ? The trial used adjudicated vital status; the external arm used a linked death index. Both are high-quality mortality sources, so this threat is low."
      ],
      "result": "Treated arm (n=40): 24-month overall survival 68%. External control (n=38, propensity-score weighted): 24-month overall survival 42%. Absolute difference: +26 percentage points favoring the trial arm. The gap remains in calendar-restricted and alternative-eligibility sensitivity analyses, supporting but not proving a treatment benefit -- unmeasured severity differences between the two data sources cannot be fully ruled out."
    },
    "prerequisites": [
      "single-arm-external-control",
      "target-trial-emulation",
      "propensity-score-methods-psm-iptw",
      "immortal-time-bias-handling",
      "generalizability-transportability-external-validity-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Registry / natural-history external control (incident-case anchored)",
        "description": "External arm drawn from a disease registry or prospective natural-history study, restricted to incident cases at an analogous treatment-decision index to avoid prevalent-case survivor and immortal-time bias; strongest for phenotype, severity, and adjudicated outcomes in ultra-rare disease.",
        "edge_cases": [
          "Voluntary enrollment skewed toward academic centers and survivors; prevalent-case enrollment injects immortal time unless re-anchored on the incident index.",
          "Incomplete treatment exposure/dates; site participation shifts the case mix over calendar time."
        ],
        "data_source_notes": "registry: confirm incident vs prevalent enrollment and adjudication rules; link to claims for complete treatment history and to a death index (NDI/SSA) for censoring."
      },
      {
        "name": "EHR / claims-derived external control with eligibility mirroring",
        "description": "External arm abstracted from EHR (or reconstructed from claims) by applying the trial inclusion/exclusion in a defined lookback, anchoring time zero on the first qualifying standard-of-care line, and using overall survival from a linked death index as the primary endpoint.",
        "edge_cases": [
          "MA-only person-time lacks FFS claims, so prior lines and treatment-naive status cannot be ascertained; restrict to full-FFS (or full commercial) lookbacks.",
          "Treatment-driven, visit-dependent outcome ascertainment makes progression surrogates non-comparable to RECIST; prefer overall survival."
        ],
        "data_source_notes": "claims/EHR: require continuous enrollment across the full lookback; anchor time zero on the SOC initiation; restrict to calendar time overlapping the trial accrual window."
      },
      {
        "name": "Hybrid / Bayesian dynamic-borrowing external control",
        "description": "A small concurrent control is augmented with external data via a power prior, commensurate prior, or meta-analytic-predictive (MAP) prior, so the external information is down-weighted when it conflicts with the concurrent data.",
        "edge_cases": [
          "Requires a concurrent control (even if small); prior specification and tuning of the borrowing strength materially affect the estimate and type-I error.",
          "Prior-data conflict diagnostics must be pre-specified to avoid over-borrowing from an unrepresentative external cohort."
        ],
        "data_source_notes": "any: report the effective external sample size borrowed and a sensitivity analysis over the borrowing prior; pre-specify the conflict-detection rule."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Concurrent randomized control arm",
        "pros_of_this": "Feasible when randomization is impossible (ultra-rare disease, fatal natural history, lack of equipoise); supplies a patient-level comparator where none could be enrolled.",
        "cons_of_this": "Forfeits every protection randomization provides — exchangeability, no transportability burden, comparable ascertainment — and is biased by unmeasured confounding, era effects, and outcome-definition drift.",
        "when_to_prefer": "Only when a randomized trial is genuinely infeasible or unethical; treat results as hypothesis-strengthening with heavy sensitivity analysis."
      },
      {
        "compared_to": "Fixed historical / literature benchmark (single threshold)",
        "pros_of_this": "Enables eligibility mirroring, covariate adjustment, time-zero alignment, and time-to-event endpoints rather than a single summary number.",
        "cons_of_this": "Requires obtaining and curating a patient-level external source with the necessary prognostic covariates and outcome ascertainment.",
        "when_to_prefer": "Whenever a patient-level external source exists; fall back to a literature benchmark only when no patient-level data are available."
      },
      {
        "compared_to": "Bayesian dynamic-borrowing / hybrid design",
        "pros_of_this": "Maximally feasible — needs no concurrent control at all; simpler than specifying and tuning a borrowing prior.",
        "cons_of_this": "No down-weighting safeguard against an unrepresentative external cohort; the estimate fully inherits external bias.",
        "when_to_prefer": "Only when no concurrent control whatsoever is possible; prefer dynamic borrowing when even a few concurrent controls can be randomized."
      },
      {
        "compared_to": "Active-comparator new-user design in routine RWD",
        "pros_of_this": "Provides a comparator for a single-arm trial where two treated arms cannot be drawn from one data source.",
        "cons_of_this": "Compares across data sources against an untreated/standard-of-care arm, re-importing confounding by indication and a cross-source transportability burden that ACNU avoids.",
        "when_to_prefer": "Use ACNU whenever both arms can be drawn from the same RWD; the external control is for the single-arm-trial situation where they cannot."
      }
    ],
    "implementation_notes_by_data_source": {
      "registry": "Confirm incident vs prevalent enrollment and re-anchor time zero on the incident treatment-decision to avoid survivor/immortal-time bias; verify adjudication rules; link to claims for full treatment history and to a death index for censoring.",
      "ehr": "Apply trial inclusion/exclusion in a defined lookback to mirror eligibility; abstract outcomes against a protocol; prefer overall survival from a linked death index over treatment-driven, visit-dependent progression surrogates; treat loss to follow-up as informative.",
      "claims": "Require continuous FFS (or full commercial) enrollment across the lookback so prior lines and treatment-naive status are observed, not missing — exclude MA-only person-time; anchor time zero on the first qualifying SOC line; model competing risks for non-event death in the older external arm.",
      "linked": "Ideal substrate (EHR severity + claims completeness + reliable mortality) but linkage selection biases transportability toward a healthier/wealthier subset; reconcile order/fill/service/abstraction date discrepancies before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nLOOKBACK_DAYS = 365          # baseline covariate + enrollment-completeness window\nTRIAL_ACCRUAL_START = pd.Timestamp(\"2018-01-01\")  # restrict external era to trial accrual overlap\nTRIAL_ACCRUAL_END   = pd.Timestamp(\"2022-12-31\")\n\ndef build_external_control(soc, enroll, clin, death):\n    # Time zero = FIRST qualifying standard-of-care line (the randomization analog),\n    # NOT diagnosis and NOT a downstream restaging visit (which would inject immortal time).\n    soc = soc.sort_values([\"person_id\", \"line_date\"])\n    idx = (soc.groupby(\"person_id\").first().reset_index()\n              .rename(columns={\"line_date\": \"index_date\"}))\n    idx[\"arm\"] = \"CONTROL\"\n\n    # Calendar-time restriction: external index must overlap the trial accrual window.\n    idx = idx[idx[\"index_date\"].between(TRIAL_ACCRUAL_START, TRIAL_ACCRUAL_END)]\n    idx[\"baseline_start\"] = idx[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)\n\n    # Source completeness: continuous, FFS-observable enrollment across the full lookback\n    # through index (MA-only person-time lacks FFS claims -> prior lines unobservable).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\", \"baseline_start\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"baseline_start\"]) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    observable = e.loc[e[\"covers\"], \"person_id\"].unique()\n    idx = idx[idx[\"person_id\"].isin(observable)]\n\n    # Eligibility mirroring: apply the trial inclusion/exclusion using baseline-window clinical data.\n    c = clin.merge(idx[[\"person_id\", \"baseline_start\", \"index_date\"]], on=\"person_id\")\n    c = c[c[\"dx_date\"].between(c[\"baseline_start\"], c[\"index_date\"])]   # staging within lookback\n    eligible = c[(c[\"histology\"] == \"TARGET_SUBTYPE\") & (c[\"ecog\"].isin([0, 1])) &\n                 (c[\"biomarker_pos\"]) & (c[\"organ_ok\"])][\"person_id\"].unique()\n    cohort = idx[idx[\"person_id\"].isin(eligible)].copy()\n\n    # Overall survival from a linked death index; censor at end of data (study-specific).\n    cohort = cohort.merge(death, on=\"person_id\", how=\"left\")\n    end_of_data = TRIAL_ACCRUAL_END + pd.Timedelta(days=LOOKBACK_DAYS)\n    cohort[\"event\"] = cohort[\"death_date\"].notna().astype(int)\n    cohort[\"fu_end\"] = cohort[\"death_date\"].fillna(end_of_data)\n    cohort[\"os_days\"] = (cohort[\"fu_end\"] - cohort[\"index_date\"]).dt.days\n    return cohort[[\"person_id\", \"arm\", \"index_date\", \"baseline_start\",\n                   \"os_days\", \"event\"]]",
        "description": "External-control cohort construction by eligibility mirroring from claims/EHR-style inputs.\nRequired inputs (already cleaned and de-duplicated):\n  soc    : standard-of-care lines -> person_id, line_date (datetime), line_seq (int), regimen\n  enroll : enrollment spans       -> person_id, enroll_start, enroll_end, ma_only (bool)  # ma_only lacks FFS claims\n  clin   : baseline clinical      -> person_id, dx_date, histology, ecog, biomarker_pos (bool), organ_ok (bool)\n  death  : death index            -> person_id, death_date (datetime, NaT if alive)\nTRIAL_ACCRUAL_START/END bound the calendar window so the external era overlaps the trial.\nReturns one row per trial-eligible external control with arm='CONTROL', time zero (index_date),\nthe baseline covariate window, and OS follow-up. Build the propensity score / overlap weights\nonly from covariates measured in [baseline_start, index_date]; use overall survival as the endpoint.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "carrigan-2020",
          "thorlund-2020"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK_DAYS       <- 365L\nTRIAL_ACCRUAL_START <- as.Date(\"2018-01-01\")\nTRIAL_ACCRUAL_END   <- as.Date(\"2022-12-31\")\n\nbuild_external_control <- function(soc, enroll, clin, death) {\n  setDT(soc); setDT(enroll); setDT(clin); setDT(death)\n  setorder(soc, person_id, line_date)\n\n  # Time zero = first qualifying SOC line (randomization analog), within trial accrual window.\n  idx <- soc[, .(index_date = line_date[1L]), by = person_id]\n  idx[, arm := \"CONTROL\"]\n  idx <- idx[index_date >= TRIAL_ACCRUAL_START & index_date <= TRIAL_ACCRUAL_END]\n  idx[, baseline_start := index_date - LOOKBACK_DAYS]\n\n  # Source completeness: continuous FFS-observable enrollment across the full lookback.\n  e <- merge(enroll, idx[, .(person_id, index_date, baseline_start)], by = \"person_id\")\n  observable <- e[enroll_start <= baseline_start & enroll_end >= index_date &\n                  !ma_only, unique(person_id)]\n  idx <- idx[person_id %chin% observable]\n\n  # Eligibility mirroring on baseline-window clinical data.\n  c <- merge(clin, idx[, .(person_id, baseline_start, index_date)], by = \"person_id\")\n  c <- c[dx_date >= baseline_start & dx_date <= index_date]\n  eligible <- c[histology == \"TARGET_SUBTYPE\" & ecog %in% c(0L, 1L) &\n                biomarker_pos & organ_ok, unique(person_id)]\n  cohort <- idx[person_id %chin% eligible]\n\n  # Overall survival from a linked death index.\n  cohort <- merge(cohort, death, by = \"person_id\", all.x = TRUE)\n  end_of_data <- TRIAL_ACCRUAL_END + LOOKBACK_DAYS\n  cohort[, event   := as.integer(!is.na(death_date))]\n  cohort[, fu_end  := fifelse(is.na(death_date), end_of_data, death_date)]\n  cohort[, os_days := as.integer(fu_end - index_date)]\n  cohort[, .(person_id, arm, index_date, baseline_start, os_days, event)]\n}",
        "description": "External-control cohort construction with data.table. Inputs mirror the Python version:\n  soc    : person_id, line_date (Date), line_seq (int), regimen\n  enroll : person_id, enroll_start, enroll_end, ma_only (logical)\n  clin   : person_id, dx_date, histology, ecog, biomarker_pos (logical), organ_ok (logical)\n  death  : person_id, death_date (Date, NA if alive)\nReturns one trial-eligible external control per person with time zero, baseline window,\nand overall-survival follow-up; weight/match downstream on baseline covariates only.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "carrigan-2020",
          "thorlund-2020"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 365;\n%let acc_start = '01JAN2018'd;\n%let acc_end   = '31DEC2022'd;\n\n/* Time zero = first qualifying SOC line (randomization analog), within trial accrual overlap. */\nproc sql;\n  create table idx as\n  select person_id,\n         min(line_date) as index_date format=date9.,\n         'CONTROL' as arm length=12\n  from work.soc\n  group by person_id\n  having calculated index_date between &acc_start and &acc_end;\nquit;\ndata idx; set idx; baseline_start = index_date - &lookback; format baseline_start date9.; run;\n\n/* Source completeness: continuous FFS-observable enrollment across the full lookback (no MA-only). */\nproc sql;\n  create table observable as\n  select i.*\n  from idx i\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = i.person_id and e.ma_only = 0\n      and e.enroll_start <= i.baseline_start\n      and e.enroll_end   >= i.index_date\n  );\nquit;\n\n/* Eligibility mirroring: trial inclusion/exclusion on staging measured in the lookback window. */\nproc sql;\n  create table cohort as\n  select o.person_id, o.arm, o.index_date, o.baseline_start,\n         coalesce(d.death_date, &acc_end + &lookback) as fu_end format=date9.,\n         (d.death_date is not missing) as event\n  from observable o\n  left join work.death d on d.person_id = o.person_id\n  where exists (\n    select 1 from work.clin c\n    where c.person_id = o.person_id\n      and c.dx_date between o.baseline_start and o.index_date\n      and c.histology = 'TARGET_SUBTYPE'\n      and c.ecog in (0,1) and c.biomarker_pos = 1 and c.organ_ok = 1\n  );\nquit;\ndata cohort; set cohort; os_days = fu_end - index_date; run;\n\n/* IPTW / overlap weighting on baseline-window covariates (analytic = treated trial arm UNION external control,\n   with arm='STUDY' for trial patients). Confirm SMD<0.1 before the outcome model. */\nproc psmatch data=work.analytic region=allobs;\n  class arm <categorical baseline covariates>;\n  psmodel arm(treated='STUDY') = <baseline covariates>;     /* covariates from the baseline window only */\n  assess lps var=(<key covariates>) / weight=atewgt;        /* standardized differences after weighting */\n  output out=wtd atewgt=ipw;\nrun;\n\n/* Overall survival: weighted Cox (robust SE for the weights). For competing risks, use\n   PROC PHREG eventcode= (Fine-Gray) or PROC LIFETEST plots=cif on the cumulative incidence. */\nproc phreg data=wtd covs(aggregate);\n  class arm(ref='CONTROL');\n  model os_days*event(0) = arm;\n  weight ipw;\n  id person_id;\nrun;",
        "description": "External-control cohort construction (eligibility mirroring) and PS weighting in SAS.\nRequired input datasets (post data-management):\n  work.soc    : person_id, line_date, line_seq, regimen\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.clin   : person_id, dx_date, histology, ecog, biomarker_pos (0/1), organ_ok (0/1)\n  work.death  : person_id, death_date (missing if alive)\nTime zero is the first qualifying SOC line within the trial accrual window. PROC PSMATCH\n(SAS/STAT 14.2+) estimates IPTW/overlap weights on baseline-window covariates; confirm\nstandardized differences <0.1 before fitting the survival model. PROC PHREG (or LIFETEST\nplots=cif for competing risks) estimates overall survival in the combined treated+control set.",
        "dependencies": [],
        "source_citations": [
          "carrigan-2020"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Trial[Single-arm trial: treated arm only] --> Need[Need a counterfactual control arm]\n  RWD[External RWD: registry / EHR / claims / linked] --> Mirror[Eligibility mirroring<br/>apply trial inclusion-exclusion in lookback]\n  Mirror --> T0[Time zero = first qualifying SOC line<br/>randomization analog, NOT diagnosis/restaging]\n  T0 --> Cal[Calendar-time restriction<br/>overlap trial accrual window]\n  Cal --> Base[Baseline covariates in lookback only<br/>stage, biomarker, prior lines, organ function]\n  Base --> Out[Outcome alignment: overall survival<br/>from linked death index, NOT progression surrogate]\n  Out --> Adj[PS weighting / overlap weights<br/>SMD < 0.1]\n  Need --> Contrast[Treated arm vs adjusted external control]\n  Adj --> Contrast\n  Contrast --> Sens[Sensitivity: E-value, competing risks,<br/>negative control, alt eligibility window]",
        "caption": "Eligibility-mirroring pipeline for a rare-disease external control. The external arm is restricted to the trial-eligible subset, anchored at the treatment-decision time zero, aligned on outcome ascertainment, and adjusted before being contrasted with the single-arm trial's treated patients.",
        "alt_text": "Flowchart showing a single-arm trial needing a control, external RWD passing through eligibility mirroring, time-zero anchoring, calendar restriction, baseline covariate measurement, outcome alignment, propensity-score adjustment, the treated-vs-control contrast, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "thorlund-2020"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  U[Unmeasured prognostic severity<br/>stage, biomarker, organ function] --> Arm[Arm: trial-treated vs external control]\n  U --> Y[Outcome: overall survival]\n  Era[Calendar era / standard-of-care drift] --> Arm\n  Era --> Y\n  Sel[Linkage / enrollment selection] --> Arm\n  Sel --> Y\n  Asc[Outcome-ascertainment difference<br/>RECIST vs claims surrogate] --> Y\n  Arm --> Y",
        "caption": "Bias structure of a rare-disease external control. Source membership (trial vs external) is associated with unmeasured severity, calendar-era standard-of-care drift, linkage/enrollment selection, and a differential outcome-ascertainment path — each opens a non-causal path that randomization would have blocked and that adjustment can only partially close.",
        "alt_text": "Causal diagram showing arm and overall survival both influenced by unmeasured prognostic severity, calendar-era drift, and selection, plus a separate ascertainment arrow into the outcome, illustrating the confounding and bias paths an external control must address.",
        "source_type": "illustrative",
        "source_citations": [
          "mishra-kalyani-2022"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "single-arm-external-control",
        "notes": "Rare-disease external controls are the special-population instance of single-arm external-control designs, where rarity (not just convenience) forces the external comparator and ultra-small samples sharpen every assumption."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "special-populations-rwe-methods",
        "notes": "Sits in the special-populations method family; rare disease is the canonical setting where randomized comparators are infeasible."
      },
      {
        "relation_type": "used_with",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Identification of the external-control contrast rests on transportability of the adjusted estimate from the external source to the trial population."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Propensity-score matching, IPTW, or overlap weighting on baseline prognostic covariates is the standard step that balances the external control against the trial-treated arm."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Eligibility mirroring and treatment-decision time-zero anchoring are the target-trial discipline applied across data sources to build the external comparator."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring time zero on the first qualifying treatment line (not diagnosis or a downstream restaging visit) prevents the immortal time that plagues prevalent-case registry and procedure-anchored external controls."
      },
      {
        "relation_type": "see_also",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Negative-control outcomes and empirical calibration quantify residual confounding and era effects that the cross-source external-control contrast cannot fully remove."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "active-comparator-new-user",
        "notes": "When both arms can be drawn from one RWD source, an active-comparator new-user design avoids the cross-source transportability burden; the external control is reserved for the single-arm-trial case where they cannot."
      }
    ],
    "aliases": [
      "external control arm",
      "single-arm trial external control",
      "synthetic control arm",
      "historical control (rare disease)",
      "natural-history external control"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "real-world-progression-rwpfs-rwe",
    "name": "Real-World Progression and rwPFS",
    "short_definition": "An abstracted or algorithmically derived oncology endpoint in which real-world disease progression (from radiology reports, clinician notes, or treatment-pattern proxies) is combined with death into a real-world progression-free survival (rwPFS) time, requiring an explicit progression definition, an assessment cadence, and a pre-specified estimand for how death is handled.",
    "long_description": "**Real-world progression (rwP)** is the analogue of RECIST-defined progression for routinely collected oncology\ndata, where prospective imaging at protocol intervals does not exist. Because there is no central read and no fixed\nscan schedule, rwP must be *constructed*: most often by structured human abstraction of radiology reports and\nclinician notes into a binary \"progression event\" with a date, sometimes by an NLP/algorithmic pass over the same\ndocuments, and sometimes by treatment-pattern proxies (a new line of therapy, or discontinuation framed as\nprogression). **Real-world progression-free survival (rwPFS)** is then the time from an index date (typically\ntreatment line start) to the *earlier* of the first rwP event or death, with censoring for those who reach end of\nfollow-up event-free. rwPFS is a **composite endpoint**: progression OR death, exactly as trial PFS is — but the\nascertainment of the progression component is the entire methodological problem.\n\n**Core estimand distinction**. Three estimand choices are doing the work and must be pre-specified, not discovered\nin code. (1) *What counts as an event.* rwPFS proper counts progression-or-death; if you instead model\n**real-world time to progression (rwTTP)** you treat progression as the event and **death without prior progression\nas a competing risk**, which changes both the model and the interpretation. A Kaplan-Meier curve that censors\ncompeting deaths estimates the (counterfactual) progression rate in a world without death and *overstates* the\nreal-world cumulative incidence of progression; a **cumulative incidence function** from a Fine-Gray or\nAalen-Johansen estimator answers the actually-observed question. For rwPFS the composite makes death an event, so a\ncause-specific Cox/KM is appropriate; the competing-risks subtlety is unavoidable only when you decompose the\ncomposite. (2) *Assessment cadence and interval censoring.* Progression is detected only when a scan or note exists,\nso the event date is the assessment date, not the true biologic date — events are **interval-censored** and the\nobserved cadence differs by arm, site, and insurer. (3) *Index/time-zero alignment*, which determines whether\nimmortal time contaminates the line-of-therapy start. Get these wrong and rwPFS is internally inconsistent across\narms even before any confounding adjustment.\n\n**Pros, cons, and trade-offs**.\n- **vs real-world overall survival (rwOS):** rwOS needs only a reliable death date (a single, high-validity field\n  when linked to NDI/SSA/obituary sources) and has no ascertainment cadence problem, so it is the most defensible\n  real-world endpoint. rwPFS trades that robustness for earlier signal and larger effect sizes, but inherits\n  abstraction error, cadence-driven informative assessment, and weaker correlation with rwOS than trial PFS has with\n  OS. **Prefer rwOS** as the anchor; use rwPFS as a supportive/intermediate endpoint with a demonstrated\n  surrogacy/association analysis (e.g., correlation of rwPFS with rwOS within the same source).\n- **vs trial RECIST PFS:** RECIST PFS has fixed scan schedules, central/blinded read, and unidimensional lesion\n  rules; rwP has none of these. The advantage is generalizability and cost; the cost is that rwP is not RECIST and\n  should never be relabeled as such. **Prefer trial PFS** for efficacy claims; use rwP for external comparators,\n  label-expansion context, and HTA effectiveness arguments where the trial population is not the question.\n- **vs treatment-pattern proxies (next-line / discontinuation as progression):** proxies are cheap and computable\n  from claims alone but conflate progression with toxicity, patient preference, financial barriers, and\n  formulary-driven switching, and they systematically lag the true event. **Prefer abstracted/NLP rwP** when chart\n  or report text is available; reserve next-line proxies for claims-only feasibility or sensitivity analyses, never\n  as the primary definition for a comparative claim.\n\n**When to use**. Single-arm trial external control arms and comparative effectiveness in oncology where OS alone is\ntoo slow or too confounded by post-progression therapy; HTA submissions needing real-world effectiveness alongside\ntrial efficacy; settings with abstractable radiology/clinician text (Flatiron-style enriched EHR, integrated\ndelivery networks, registries with adjudicated progression). Always pair rwPFS with rwOS and report the abstraction\nmethod, assessment cadence, and the rwPFS-rwOS association.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Claims-only data with no chart/report text.** You cannot ascertain progression from claims; a next-line-of-\n  therapy proxy is *not* rwPFS and presenting it as such is misleading. It mechanically favors the arm with fewer\n  downstream options and lags the true event by the time-to-next-treatment decision.\n- **Differential assessment cadence by arm.** If one drug is monitored more intensively (more frequent scans), it\n  will detect progression earlier and look *worse* on rwPFS purely from surveillance — **surveillance/ascertainment\n  bias**. Diagnose by comparing scan/assessment frequency across arms before trusting any contrast.\n- **Decomposing the composite without competing-risks handling.** Reporting a KM \"time to progression\" that censors\n  competing deaths overstates progression incidence and is dangerous when one arm has higher early mortality (its\n  censored deaths inflate its apparent progression-free time). Use a cumulative incidence function.\n- **Immortal time at line start.** Defining the index as a *confirmed* line start that requires surviving long\n  enough to receive a second cycle builds immortal time into rwPFS; align time zero to the first administration.\n- **Cross-source pooling without a harmonized progression definition.** A radiology-anchored rwP from one vendor and\n  a clinician-anchored rwP from another are different endpoints; pooling them silently is uninterpretable.\n\n**Data-source operational depth**.\n- **Claims (FFS vs MA):** Claims contain no progression signal; the best obtainable surrogate is a coded new line of\n  systemic therapy or a new metastatic-site diagnosis, both lagged and noisy. Require continuous medical + pharmacy\n  enrollment so a \"no new therapy\" period is observed, not missing — **Medicare Advantage encounter data are\n  incomplete relative to fee-for-service**, so MA-only person-time can fabricate apparent progression-free intervals\n  from unobserved care; exclude MA-only spans or flag them. Infusion vs oral (Part B vs Part D) routing changes\n  where the next-line signal even appears.\n- **EHR / enriched EHR (Flatiron-style):** rwP comes from trained abstractors reading radiology and oncologist\n  notes, or from NLP. Failure modes: inter-abstractor variability, notes that say \"mixed response\" or \"clinical\n  progression\" without imaging, external-care leakage (scans done outside the network never enter the chart), and\n  **assessment cadence driven by clinic visit frequency** rather than biology, which interval-censors events on an\n  irregular, arm-dependent grid. Workarounds: dual abstraction with adjudication, capture the *assessment date* and\n  the *progression date* separately, and report visit/scan frequency by arm.\n- **Registry:** Often has adjudicated progression and stage but sparse longitudinal imaging; strong for the\n  progression *definition* (clinically reviewed) but weak for cadence completeness and for full therapy history —\n  link to claims for treatment lines and to a death index for the OS/composite component.\n- **Linked EHR-claims-mortality:** The ideal substrate: EHR text for rwP, claims for complete therapy lines and the\n  next-line proxy as a cross-check, and NDI/obituary mortality for the death component of the composite. Cost:\n  linkage selection (only the linkable subset), and order/scan/service-date discrepancies that must be reconciled\n  before assigning the progression and index dates.\n\n**Worked example (enriched-EHR rwPFS with a claims cross-check).** Question: rwPFS for first-line therapy A vs B in\nadvanced NSCLC. (1) Index/time zero = date of first administration of the line-defining agent (from EHR\nadministrations, confirmed against the J-code infusion claim or Part D fill on or near that date); do *not* require a\nsecond cycle, which would inject immortal time. (2) Progression component: abstractor-confirmed rwP date from\nradiology/clinician notes; record both the assessment date and the abstracted progression date, and store the\nascertainment method. (3) Death component: death date from the linked mortality index, not the last-EHR-activity\ndate (which biases survival). (4) rwPFS event = earlier of rwP date and death date; rwPFS time = that date minus\nindex. Censor at last confirmed disease assessment (not last-any-contact) for patients event-free, so that someone\nlost from the system is censored at their last informative scan, not carried as progression-free indefinitely. (5)\nClaims cross-check: derive a next-line-of-therapy date from the pharmacy/medical claims (first fill or infusion of a\nnon-A/non-B regimen after a 30-day gap, requiring continuous non-MA-only enrollment), and report the rwPFS-vs-rwTTNT\nagreement as a sensitivity analysis. (6) Report rwPFS and rwOS side by side with their within-cohort correlation, and\nshow assessment-frequency by arm to rule out surveillance bias. (7) Estimand check: the primary contrast is a\ncause-specific hazard ratio for the composite (Cox); the supportive rwTTP decomposition uses a cumulative incidence\nfunction with death as a competing event.\n\n**Interpreting the output**. For Patient 2201 with NSCLC, the index date is January 10, 2023. A radiology report\ndated May 2, 2023 — day 112 from index — documents progression per investigator interpretation; a death record\nis dated August 14, 2023 (day 216). The algorithm assigns rwPFS = 112 days, event type = progression (not death,\nbecause progression occurred first).\n\nFormal interpretation: rwPFS = 112 days means the patient was progression-free for 112 days from treatment\ninitiation, with progression identified from a clinical note between the last no-progression assessment and the\ndate of the note documenting progression. The assigned date — May 2 — is an ascertainment date, not a true\nbiological progression date. If the patient's last scan was 8 weeks earlier and the next was May 2, the true\nprogression could have occurred any time in that 8-week interval. rwPFS is therefore interval-censored at the\nassessment frequency, which is partly a function of the visit schedule at the treating institution, not the\ntreatment under study. This makes rwPFS non-comparable to trial RECIST PFS, where assessments occur at\nprotocol-fixed intervals in both arms.\n\nPractical interpretation: always report the median assessment interval alongside rwPFS estimates, and examine\nwhether assessment frequency differs between treatment arms — a difference in visit frequency is a surveillance\nbias risk, not a treatment signal. For cross-study comparisons, rwPFS is best interpreted as a relative measure\nwithin the same real-world cohort rather than a literal equivalent of trial PFS.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "real-world-progression",
      "rwpfs",
      "composite-endpoint",
      "competing-risks",
      "oncology-rwe",
      "surveillance-bias",
      "external-control-arm"
    ],
    "applies_to_study_types": [
      "ehr_study",
      "registry_linkage",
      "claims_analysis",
      "comparative_effectiveness",
      "single_arm_external_control",
      "target_trial_emulation"
    ],
    "data_sources": [
      "ehr",
      "claims",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1007/s12325-019-00970-1",
        "url": "https://doi.org/10.1007/s12325-019-00970-1",
        "citation_text": "Griffith SD, Tucker M, Bowser B, et al. Generating Real-World Tumor Burden Endpoints from Electronic Health Record Data: Comparison of RECIST, Radiology-Anchored, and Clinician-Anchored Approaches for Abstracting Real-World Progression in Non-Small Cell Lung Cancer. Advances in Therapy. 2019;36(8):2122-2136.",
        "year": 2019,
        "authors_short": "Griffith et al.",
        "notes": "Defines and compares radiology-anchored vs clinician-anchored abstraction of real-world progression from EHR data and shows their association with mortality; the operational reference for constructing rwP/rwPFS."
      },
      {
        "role": "explain",
        "doi": "10.1200/cci.18.00155",
        "url": "https://doi.org/10.1200/cci.18.00155",
        "citation_text": "Stewart M, Norden AD, Dreyer N, et al. An Exploratory Analysis of Real-World End Points for Assessing Outcomes Among Immunotherapy-Treated Patients With Advanced Non-Small-Cell Lung Cancer. JCO Clinical Cancer Informatics. 2019;3:1-15.",
        "year": 2019,
        "authors_short": "Stewart et al.",
        "notes": "Friends of Cancer Research multi-source examination of rwPFS, rwTTP, rwTTNT and rwOS, including endpoint definitions, ascertainment differences across data partners, and correlation of rwPFS with rwOS."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.7501",
        "url": "https://doi.org/10.1002/sim.7501",
        "citation_text": "Austin PC, Fine JP. Practical recommendations for reporting Fine-Gray model analyses for competing risk data. Statistics in Medicine. 2017;36(27):4391-4400.",
        "year": 2017,
        "authors_short": "Austin & Fine",
        "notes": "Specifies when to use cause-specific hazards vs the subdistribution (Fine-Gray) cumulative incidence when decomposing rwPFS into progression with death as a competing event; the estimand reference for rwTTP."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.5984",
        "url": "https://doi.org/10.1002/sim.5984",
        "citation_text": "Austin PC. The use of propensity score methods with survival or time-to-event outcomes: reporting measures of effect similar to those used in randomized experiments. Statistics in Medicine. 2014;33(7):1242-1258.",
        "year": 2014,
        "authors_short": "Austin",
        "notes": "Applied guidance for estimating treatment effects on time-to-event endpoints (such as rwPFS) within a propensity-score-balanced cohort, the usual downstream analytic step for comparative rwPFS."
      }
    ],
    "plain_language_summary": "Real-world progression-free survival (rwPFS) measures how long a cancer patient went without their disease getting worse or dying, using information collected during routine clinical care rather than a controlled research trial. Instead of a formal scan protocol, researchers read the patient's radiology reports and doctors' notes and assign a date when disease growth was first recorded — this is called real-world progression, or rwP. The clock starts when the patient begins their treatment line and stops at whichever comes first: the rwP date or the date of death. Because this information is pulled from notes and imaging that were done for clinical care and not research, the quality of the endpoint depends entirely on how carefully those records are read and how clearly the abstractor can tell worsening disease from other changes.",
    "key_terms": [
      {
        "term": "real-world progression (rwP)",
        "definition": "A disease-worsening event assigned by reading the patient's radiology reports or clinician notes and recording the date the record first indicates the cancer grew or spread — constructed by a trained abstractor rather than measured by a standardized trial protocol."
      },
      {
        "term": "rwPFS",
        "definition": "Real-world progression-free survival: the number of days from the start of a treatment line to the earlier of a real-world progression event or death, whichever happens first."
      },
      {
        "term": "abstraction",
        "definition": "The process of a trained human reviewer reading medical records — radiology reports, oncologist notes — and extracting a structured data point (such as a progression date and the basis for it) that was never recorded as a coded field in the database."
      },
      {
        "term": "composite endpoint",
        "definition": "An outcome that counts whichever of two or more events happens first — for rwPFS, that means either progression or death both qualify as the endpoint, so the patient is not counted as event-free just because they survived."
      },
      {
        "term": "index date",
        "definition": "The patient's day zero for follow-up — for rwPFS this is the date of the first administration of the treatment being studied, not a later confirmation date."
      },
      {
        "term": "censoring",
        "definition": "When a patient reaches the end of the observation period without having had either a progression event or a death, their follow-up time is recorded as-is and marked as incomplete rather than as an event."
      }
    ],
    "worked_example": {
      "scenario": "A single patient with advanced non-small-cell lung cancer starts first-line immunotherapy on 2023-01-10. An analyst in an enriched EHR database wants to calculate this patient's rwPFS. The patient has a CT scan on 2023-05-02 whose radiology report is abstracted as showing new nodal involvement — the abstractor records a real-world progression event on that date. The patient dies on 2023-08-14. Because progression occurred before death, rwPFS runs from treatment start to the progression date. The analyst records 112 days as the rwPFS time and flags it as an event (not censored).",
      "dataset": {
        "caption": "One-row-per-patient analytic table showing the three key dates pulled from the EHR for this patient.",
        "columns": [
          "person_id",
          "index_date",
          "progression_date",
          "death_date",
          "abstraction_source"
        ],
        "rows": [
          [
            2201,
            "2023-01-10",
            "2023-05-02",
            "2023-08-14",
            "radiology report 2023-05-02: new mediastinal nodes"
          ]
        ]
      },
      "steps": [
        "Identify the index date: the date of first administration of the line-defining agent, 2023-01-10. This is day zero for the follow-up clock.",
        "Identify the progression date: the abstractor read the 2023-05-02 CT report and recorded it as real-world progression based on language indicating new nodal involvement. The assessment date (2023-05-02) becomes the progression date — the actual biologic onset is unknown.",
        "Identify the death date from the linked mortality record: 2023-08-14.",
        "Apply the composite rule: rwPFS event = the earlier of progression date and death date. Here 2023-05-02 (progression) comes before 2023-08-14 (death), so progression is the event.",
        "Calculate rwPFS time: from 2023-01-10 to 2023-05-02 = 112 days.",
        "Set the event flag to 1 (an event occurred — progression was reached before the end of follow-up). The patient is NOT censored."
      ],
      "result": "rwPFS = 112 days (event = 1, progression-first). Death occurred 104 days later at day 216 from index but does not change the rwPFS calculation because progression was already the earlier event.",
      "timeline_spec": {
        "title": "rwPFS for one advanced NSCLC patient: treatment start to real-world progression or death",
        "window": {
          "start": "2023-01-10",
          "end": "2023-08-14",
          "label": "Observation window: treatment start to death (day 0 to day 216)"
        },
        "events": [
          {
            "label": "Treatment start (index date)",
            "start": "2023-01-10",
            "marker_day": 0,
            "quantity": "Day 0 — first immunotherapy administration"
          },
          {
            "label": "CT scan abstracted as real-world progression",
            "start": "2023-05-02",
            "marker_day": 112,
            "quantity": "Day 112 — radiology report: new mediastinal nodes (rwP event)"
          },
          {
            "label": "Death",
            "start": "2023-08-14",
            "marker_day": 216,
            "quantity": "Day 216 — patient death (after rwP already recorded)"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2023-01-10",
            "end": "2023-05-02",
            "label": "Progression-free time: 112 days (rwPFS)"
          },
          {
            "kind": "exposed",
            "start": "2023-05-02",
            "end": "2023-08-14",
            "label": "Post-progression survival: 104 days (not part of rwPFS)"
          }
        ],
        "result": {
          "label": "rwPFS = 112 days (event = progression on day 112); death at day 216 does not shorten rwPFS because progression came first",
          "rwpfs_days": 112,
          "event_type": "progression",
          "death_day": 216
        },
        "caption": "The horizontal bar spans from treatment start (day 0, 2023-01-10) to death (day 216, 2023-08-14). The orange marker at day 112 (2023-05-02) is the abstracted real-world progression event from the CT report. The green span from day 0 to day 112 is the progression-free interval and equals the patient's rwPFS. The gray span from day 112 to day 216 is post-progression survival — it is observed in the data but is not counted in rwPFS.",
        "alt_text": "A single horizontal patient timeline from day 0 (2023-01-10) to day 216 (2023-08-14). An orange event marker at day 112 labels the abstracted real-world progression event from the CT scan report. A green bar spanning day 0 to day 112 is labeled 112-day rwPFS. A gray bar spanning day 112 to day 216 is labeled post-progression survival. A red marker at day 216 labels the death event."
      }
    },
    "prerequisites": [
      "outcome-algorithm-construction-rwe",
      "cumulative-incidence-risk-rwe",
      "competing-risks-cause-specific-fine-gray-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Radiology-anchored abstracted rwP",
        "description": "A trained abstractor sets the progression date from radiology reports (and supporting notes), anchoring the event to imaging language indicating growth/new lesions.",
        "edge_cases": [
          "External scans done outside the network never enter the chart, so progression is missed or delayed.",
          "Reports stating 'mixed response' or 'stable, possible progression' force abstractor judgment; require adjudication rules.",
          "The recorded date is the scan/assessment date, not the biologic onset, so events are interval-censored."
        ],
        "data_source_notes": "ehr: capture assessment date and progression date separately; report inter-abstractor agreement and scan frequency by arm to detect surveillance bias."
      },
      {
        "name": "Clinician-anchored abstracted rwP",
        "description": "The progression date is taken from the oncologist's note-level assessment (clinical impression of progression), which may precede or lack a confirming scan.",
        "edge_cases": [
          "Captures clinical progression without imaging (an advantage) but is more subjective and harder to standardize across sites.",
          "May conflate symptomatic deterioration from comorbidity with oncologic progression."
        ],
        "data_source_notes": "ehr: pair with radiology-anchored definition and report concordance; pre-specify which anchor is primary for the estimand."
      },
      {
        "name": "Treatment-pattern proxy (next-line / discontinuation)",
        "description": "Progression is inferred from a coded new line of systemic therapy or treatment discontinuation, computable from claims without chart text.",
        "edge_cases": [
          "Conflates progression with toxicity, patient choice, financial barriers, and formulary-driven switching.",
          "Systematically lags the true progression date by the treatment-decision interval; biases arms with fewer downstream options.",
          "Requires continuous, non-MA-only enrollment so absence of a next line is observed rather than missing."
        ],
        "data_source_notes": "claims: define next line as first non-index regimen after a gap; never use as the primary rwPFS definition for a comparative claim, only as feasibility or sensitivity."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Real-world overall survival (rwOS)",
        "pros_of_this": "Earlier signal and larger effect sizes; an intermediate endpoint when OS is slow or confounded by post-progression therapy.",
        "cons_of_this": "Inherits abstraction error, arm-dependent assessment cadence (surveillance bias), and interval censoring; weaker rwPFS-rwOS correlation than trial PFS-OS.",
        "when_to_prefer": "As a supportive/intermediate endpoint alongside rwOS, with a demonstrated rwPFS-rwOS association in the same source; use rwOS as the robust anchor."
      },
      {
        "compared_to": "Trial RECIST progression-free survival",
        "pros_of_this": "Generalizable to real-world populations and far cheaper; reflects routine-care monitoring rather than protocol scans.",
        "cons_of_this": "No fixed scan schedule, no central/blinded read, no RECIST lesion rules; rwP is not RECIST and must not be relabeled as such.",
        "when_to_prefer": "External comparators, label-expansion context, and HTA effectiveness arguments where the trial population is not the question; use trial PFS for efficacy claims."
      },
      {
        "compared_to": "Treatment-pattern proxy (next-line of therapy)",
        "pros_of_this": "Captures actual radiographic/clinical progression rather than a switching decision; less confounded by toxicity, preference, and formulary.",
        "cons_of_this": "Requires abstractable chart/report text and is more expensive; abstraction introduces inter-rater variability.",
        "when_to_prefer": "Whenever radiology/clinician text is available; reserve next-line proxies for claims-only feasibility or as a sensitivity cross-check."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "No native progression signal. Best surrogate is a coded new line of systemic therapy or new metastatic-site diagnosis, both lagged and noisy. Require continuous medical + pharmacy enrollment; exclude or flag Medicare Advantage-only person-time where encounter data are incomplete and can fabricate progression-free intervals. Account for Part B (infusion) vs Part D (oral) routing of the next-line signal.",
      "ehr": "rwP from trained abstractors or NLP over radiology/clinician notes. Capture assessment date and progression date separately; events are interval-censored on an irregular, visit-driven, arm-dependent grid. Use dual abstraction with adjudication; report inter-abstractor agreement and scan frequency by arm to detect surveillance bias; treat external-care leakage as informative.",
      "registry": "Often has adjudicated progression and stage (strong for the definition) but sparse longitudinal imaging and incomplete therapy history. Link to claims for treatment lines and to a death index for the composite's death component.",
      "linked": "Ideal substrate: EHR text for rwP, claims for complete lines and a next-line cross-check, and NDI/obituary mortality for the death component. Reconcile order/scan/service-date discrepancies before assigning progression and index dates; account for linkage selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom lifelines import CoxPHFitter\n\ndef build_rwpfs(cohort: pd.DataFrame, rwp: pd.DataFrame,\n                death: pd.DataFrame, lastvis: pd.DataFrame) -> pd.DataFrame:\n    df = (cohort\n          .merge(rwp[[\"person_id\", \"progression_date\"]], on=\"person_id\", how=\"left\")\n          .merge(death[[\"person_id\", \"death_date\"]], on=\"person_id\", how=\"left\")\n          .merge(lastvis[[\"person_id\", \"last_assessment_date\"]], on=\"person_id\", how=\"left\"))\n\n    # Composite event date = earliest of progression or death (NaT if neither occurred).\n    ev = df[[\"progression_date\", \"death_date\"]].min(axis=1)\n\n    # Event flag and the date used for the time calculation:\n    #   - if an event occurred, use the event date;\n    #   - otherwise censor at the LAST CONFIRMED ASSESSMENT (not last-any-contact),\n    #     so someone who left the system is not carried as progression-free forever.\n    df[\"rwpfs_event\"] = ev.notna().astype(int)\n    end_date = ev.where(ev.notna(), df[\"last_assessment_date\"])\n\n    df[\"rwpfs_days\"] = (end_date - df[\"index_date\"]).dt.days\n    df = df[df[\"rwpfs_days\"] >= 0]  # drop pre-index anomalies (date-reconciliation errors)\n\n    # For a competing-risks decomposition (rwTTP), classify by which date is EARLIEST:\n    #   progression first (or death missing) -> 1; death first (or progression missing) -> 2.\n    prog_first = df[\"progression_date\"].notna() & (\n        df[\"death_date\"].isna() | (df[\"progression_date\"] <= df[\"death_date\"]))\n    death_first = df[\"death_date\"].notna() & (\n        df[\"progression_date\"].isna() | (df[\"death_date\"] < df[\"progression_date\"]))\n    df[\"event_type\"] = np.select(\n        [prog_first, death_first],\n        [1, 2], default=0)  # 1=progression, 2=competing death, 0=censored\n    return df\n\n# Cause-specific Cox for the COMPOSITE rwPFS (progression OR death as the event).\nrwpfs = build_rwpfs(cohort, rwp, death, lastvis)\nmodel = rwpfs.merge(covariates, on=\"person_id\")  # covariates measured in the pre-index window only\ncph = CoxPHFitter()\ncph.fit(model[[\"rwpfs_days\", \"rwpfs_event\", \"arm\", *covariate_cols]],\n        duration_col=\"rwpfs_days\", event_col=\"rwpfs_event\")\ncph.print_summary()",
        "description": "Construct rwPFS time and event from abstracted progression, linked mortality, and disease-assessment dates, then\nfit a cause-specific Cox model for the composite. Required inputs (cleaned, one row per person unless noted):\n  cohort  : person_id, index_date (datetime, first administration of the line-defining agent), arm\n  rwp     : person_id, progression_date (datetime; abstractor- or NLP-confirmed rwP)        # may be absent\n  death   : person_id, death_date (datetime; from a linked mortality index, NOT last-EHR-activity)\n  lastvis : person_id, last_assessment_date (datetime; last confirmed disease assessment / scan)\nrwPFS event = earlier of rwP or death; event-free patients are censored at their last confirmed assessment, not at\nlast-any-contact. Build covariates only from the pre-index window before fitting.",
        "dependencies": [
          "pandas",
          "numpy",
          "lifelines"
        ],
        "source_citations": [
          "griffith-2019",
          "austin-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table); library(survival); library(cmprsk)\n\nbuild_rwpfs <- function(cohort, rwp, death, lastvis) {\n  setDT(cohort); setDT(rwp); setDT(death); setDT(lastvis)\n  df <- Reduce(function(a, b) merge(a, b, by = \"person_id\", all.x = TRUE),\n               list(cohort,\n                    rwp[, .(person_id, progression_date)],\n                    death[, .(person_id, death_date)],\n                    lastvis[, .(person_id, last_assessment_date)]))\n\n  df[, ev_date := pmin(progression_date, death_date, na.rm = TRUE)]      # earliest event, NA if none\n  df[, rwpfs_event := as.integer(!is.na(ev_date))]\n  df[, end_date := fifelse(is.na(ev_date), last_assessment_date, ev_date)]  # censor at last assessment\n  df[, rwpfs_days := as.integer(end_date - index_date)]\n  df <- df[rwpfs_days >= 0]\n\n  # Competing-risks status for the rwTTP decomposition, classified by the EARLIEST date:\n  #   progression first (or death missing) = 1; death first (or progression missing) = 2; else censored.\n  df[, event_type := fifelse(\n        !is.na(progression_date) & (is.na(death_date) | progression_date <= death_date), 1L,\n        fifelse(!is.na(death_date) & (is.na(progression_date) | death_date < progression_date), 2L, 0L))]\n  df[]\n}\n\nrwpfs <- build_rwpfs(cohort, rwp, death, lastvis)\nmodel <- merge(rwpfs, covariates, by = \"person_id\")   # pre-index covariates only\n\n# Composite rwPFS: cause-specific Cox (progression OR death is the event).\ncoxph(Surv(rwpfs_days, rwpfs_event) ~ arm + ., data = model[, !c(\"person_id\",\"event_type\")])\n\n# Supportive rwTTP: Fine-Gray subdistribution hazard, death (code 2) as the competing event.\ncrr(ftime = model$rwpfs_days, fstatus = model$event_type,\n    cov1 = model.matrix(~ arm, model)[, -1, drop = FALSE], failcode = 1, cencode = 0)",
        "description": "Same rwPFS construction in R: composite event = earlier of rwP or death, censoring event-free patients at the last\nconfirmed assessment. Fits the composite with survival::coxph and the progression-vs-competing-death cumulative\nincidence with the Fine-Gray subdistribution model (cmprsk::crr). Inputs mirror the Python version\n(cohort, rwp, death, lastvis as data.frames with Date columns).",
        "dependencies": [
          "data.table",
          "survival",
          "cmprsk"
        ],
        "source_citations": [
          "griffith-2019",
          "austin-fine-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Build rwPFS time, the composite event flag, and the competing-risks status. */\nproc sql;\n  create table rwpfs as\n  select c.person_id, c.index_date, c.arm,\n         p.progression_date, d.death_date, v.last_assessment_date,\n         /* earliest of progression or death = composite event date (missing if neither) */\n         min(p.progression_date, d.death_date) as ev_date format=date9.\n  from work.cohort c\n    left join work.rwp     p on c.person_id = p.person_id\n    left join work.death   d on c.person_id = d.person_id\n    left join work.lastvis v on c.person_id = v.person_id;\nquit;\n\ndata rwpfs;\n  set rwpfs;\n  rwpfs_event = (ev_date ne .);\n  /* censor event-free patients at the LAST CONFIRMED ASSESSMENT, not last-any-contact */\n  if ev_date ne . then end_date = ev_date;\n  else                  end_date = last_assessment_date;\n  rwpfs_days = end_date - index_date;\n  if rwpfs_days < 0 then delete;   /* drop date-reconciliation anomalies */\n\n  /* competing-risks status for the rwTTP decomposition, classified by the EARLIEST date */\n  if      progression_date ne . and (death_date = . or progression_date <= death_date) then event_type = 1; /* progression first */\n  else if death_date ne . and (progression_date = . or death_date < progression_date)  then event_type = 2; /* competing death first */\n  else                                                                                      event_type = 0; /* censored */\nrun;\n\nproc sql;  /* attach pre-index covariates only */\n  create table analytic as\n  select a.*, b.* from rwpfs a left join work.cov b on a.person_id = b.person_id;\nquit;\n\n/* Composite rwPFS: cause-specific Cox (progression OR death is the event). */\nproc phreg data=analytic;\n  class arm(ref='B');\n  model rwpfs_days*rwpfs_event(0) = arm <baseline covariates> / rl ties=efron;\nrun;\n\n/* Supportive rwTTP: Fine-Gray subdistribution hazard, death (2) as the competing event. */\nproc phreg data=analytic;\n  class arm(ref='B');\n  model rwpfs_days*event_type(0) = arm <baseline covariates> / eventcode=1 rl;\nrun;\n\n/* Cumulative incidence functions of progression vs competing death, by arm. */\nproc lifetest data=analytic plots=cif(test);\n  time rwpfs_days*event_type(0) / eventcode=1;\n  strata arm;\nrun;",
        "description": "rwPFS construction in PROC SQL/data step, then the composite with PROC PHREG and the progression cumulative\nincidence (death as a competing risk) with PROC PHREG eventcode= (Fine-Gray) and PROC LIFETEST plots=cif. Required\ninput datasets (post data-management, one row per person):\n  work.cohort  : person_id, index_date, arm\n  work.rwp     : person_id, progression_date   (missing if no abstracted progression)\n  work.death   : person_id, death_date         (from a linked mortality index)\n  work.lastvis : person_id, last_assessment_date\n  work.cov     : person_id + pre-index covariates measured in [index_date - lookback, index_date]",
        "dependencies": [],
        "source_citations": [
          "griffith-2019",
          "austin-fine-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "real-world-progression-rwpfs-rwe-timeline.svg",
        "mermaid": null,
        "caption": "The horizontal bar spans from treatment start (day 0, 2023-01-10) to death (day 216, 2023-08-14). The orange marker at day 112 (2023-05-02) is the abstracted real-world progression event from the CT report. The green span from day 0 to day 112 is the progression-free interval and equals the patient's rwPFS. The gray span from day 112 to day 216 is post-progression survival — it is observed in the data but is not counted in rwPFS.",
        "alt_text": "A single horizontal patient timeline from day 0 (2023-01-10) to day 216 (2023-08-14). An orange event marker at day 112 labels the abstracted real-world progression event from the CT scan report. A green bar spanning day 0 to day 112 is labeled 112-day rwPFS. A gray bar spanning day 112 to day 216 is labeled post-progression survival. A red marker at day 216 labels the death event.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Idx[Index / time zero<br/>first administration of line-defining agent] --> Asmt[Disease assessments<br/>irregular, visit-driven scans + notes]\n  Asmt --> Abs[Abstraction or NLP<br/>radiology-anchored or clinician-anchored rwP]\n  Abs --> Prog{rwP event?}\n  Prog -->|yes| EvP[Progression date]\n  Prog -->|no| Dth{Death first?}\n  Dth -->|yes, linked mortality index| EvD[Death date]\n  Dth -->|no| Cens[Censor at LAST confirmed assessment<br/>not last-any-contact]\n  EvP --> Comp[rwPFS event = earlier of rwP or death]\n  EvD --> Comp\n  Comp --> Est[Estimand:<br/>composite -> cause-specific Cox;<br/>rwTTP decomposition -> CIF / Fine-Gray, death competing]\n  Cens --> Est",
        "caption": "Construction of rwPFS from index date through irregular assessments, abstracted progression, linked mortality, and the censoring rule, into the composite event and the pre-specified estimand.",
        "alt_text": "Flowchart from index date through disease assessments and abstraction to a progression-or-death composite event with a competing-risks decomposition branch.",
        "source_type": "illustrative",
        "source_citations": [
          "griffith-2019"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Death without prior progression] -->|censored in naive KM 'time to progression'| B[Overstates progression incidence<br/>= dangerous when one arm dies earlier]\n  A -->|treated as competing event in CIF / Fine-Gray| C[Observed real-world progression incidence<br/>= correct estimand]\n  D[Composite rwPFS<br/>progression OR death] -->|death IS an event| E[Cause-specific Cox / KM is appropriate]",
        "caption": "The estimand trap when decomposing rwPFS. For the composite, death is an event and a cause-specific model is fine; for the rwTTP component, death is a competing risk and censoring it overstates progression.",
        "alt_text": "Diagram contrasting naive censoring of competing deaths against a cumulative incidence function for the progression component of rwPFS.",
        "source_type": "illustrative",
        "source_citations": [
          "austin-fine-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Decomposing rwPFS into real-world time to progression makes death-without-progression a competing event; use a cumulative incidence function rather than censoring competing deaths."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Aligning the rwPFS index to first administration (not a confirmed second cycle) avoids immortal time at line start."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "rwPFS is a common endpoint in oncology target-trial emulations and single-arm external control arms; the estimand and time-zero must follow the emulated protocol."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Comparative rwPFS contrasts are estimated within a propensity-score-balanced cohort using pre-index covariates."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "rwP/rwPFS is a specific oncology instance of constructing an outcome from abstraction, NLP, or claims proxies rather than a single coded field."
      }
    ],
    "aliases": [
      "rwPFS",
      "real-world progression-free survival",
      "real-world progression",
      "rwP"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "recurrent-events-analysis-rwe",
    "name": "Recurrent Events Analysis",
    "short_definition": "A family of methods for outcomes that can occur repeatedly within a patient (exacerbations, hospitalizations, infections, hypoglycemia, falls, relapses) that model the full event process - rates, gap times, or the mean cumulative function - rather than discarding all events after the first, with explicit handling of within-person dependence and informative terminal events such as death.",
    "long_description": "**Recurrent-event outcomes** are events that a single patient can experience more than once over follow-up: COPD/asthma\nexacerbations, heart-failure hospitalizations, sickle-cell pain crises, infections under immunosuppression, severe\nhypoglycemia, falls, seizures, and most healthcare-resource-utilization (HCRU) endpoints. The naive habit of analyzing\nonly *time to first event* throws away the majority of the information and, more importantly, answers a different question\nthan payers and clinicians are asking. A drug that does not delay the first exacerbation but halves the long-run\nexacerbation *rate* will look null in a first-event Cox model and clearly beneficial in a recurrent-event model. The\nchoice of method is therefore an estimand decision, not a software preference.\n\n**Core estimand distinction** Recurrent-event analyses split into three estimand families that are NOT interchangeable.\n(1) *Rate / intensity*: how frequently events occur per unit person-time. The marginal **rate** (Lin-Wei-Yang-Ying, LWYY)\ntargets the population-averaged event rate and is robust to unspecified within-person dependence; the **intensity**\n(Andersen-Gill, AG) conditions on the prior event history through the risk set. Both yield a rate/hazard ratio. (2)\n*Gap-time / conditional*: time between successive events, modeled with Prentice-Williams-Peterson (PWP) stratified by\nevent number - this answers \"given you have had k events, what is the effect on time to the (k+1)th?\" and is appropriate\nwhen biological risk genuinely changes after each event (e.g., post-MI). (3) *Absolute burden*: the **mean cumulative\nfunction** (MCF) / expected cumulative number of events by time t (Nelson-Aalen-type estimator, or the\nGhosh-Lin/Cook-Lawless MCF in the presence of death), which is the most communicable quantity for clinical and HTA\naudiences because it is on the natural scale of \"events per patient.\" Crucially, AG and LWYY share the same point\nestimate of the rate ratio but differ in variance: LWYY uses a robust sandwich variance that does not require the\nAG independent-increments assumption, which is almost always violated in chronic disease (patients with one exacerbation\nare prone to more). For most RWE rate questions, LWYY (or a negative-binomial rate model) is the defensible default;\nreserve AG for when the event-history dependence is itself of interest, and PWP for ordered gap-time questions.\n\n**Pros, cons, and trade-offs**\n- **vs time-to-first-event Cox (cox-ph-regression):** Recurrent-event models use the entire event process and match\n  burden/HCRU estimands; first-event Cox discards all subsequent events and can be badly underpowered or even sign-wrong\n  when the treatment acts on rate rather than on time-to-first. Cost: more data engineering (counting-process intervals),\n  a harder-to-explain effect measure, and explicit assumptions about within-person dependence and terminal events.\n  **Prefer recurrent-event methods** whenever the second-and-later events carry clinical or economic weight.\n- **vs simple Poisson/negative-binomial count models (poisson-negative-binomial-count-models):** A negative-binomial\n  rate model with a log person-time offset IS a recurrent-event method and is often the right first choice for total\n  burden - it absorbs overdispersion (the empirical signature of within-person clustering) and gives a clean rate ratio.\n  Its limitation is that it collapses the process to a single count and cannot represent event *timing*, time-varying\n  exposure, or the shape of risk over follow-up. **Prefer AG/LWYY/PWP** when timing, time-updated covariates, or the\n  risk trajectory matter; **prefer negative binomial** for a transparent, payer-friendly total-burden summary.\n- **vs composite \"first hospitalization or death\" endpoints:** The composite forces a single first event and treats a\n  hospitalization as equivalent to death; recurrent-event-with-terminal-event methods (joint frailty, Ghosh-Lin,\n  while-alive estimands) keep them distinct and let death act as the informative truncation it is. Cost: more complex\n  modeling and stronger reliance on a complete death source.\n- **vs negative-binomial without competing-risk handling:** Ignoring death is the central trap. If the more effective\n  arm keeps frailer patients alive longer, those survivors accrue *more* events, and a naive rate model can make the\n  better drug look worse. **Prefer joint frailty or a while-alive / MCF-with-death estimand** whenever mortality is\n  non-trivial and plausibly differential by arm.\n\n**When to use** Chronic relapsing-remitting diseases where recurrence is the natural disease course (COPD, asthma, IBD,\nMS, heart failure, sickle cell); HCRU and cost-driver endpoints (all-cause and cause-specific admissions, ED visits,\nrescue-medication bursts); safety surveillance of repeatable adverse events (hypoglycemia, infections, bleeds); and any\nestimand that a payer or clinician would phrase as \"events per patient per year\" rather than \"probability of ever having\none.\" Use the MCF for the headline communicable result and a rate model (LWYY or negative binomial) for the adjusted\neffect estimate.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **The outcome is genuinely a first/terminal event** (death, first stroke as a one-time terminal endpoint, first MI in\n  a primary-prevention question where you only care about onset). Forcing a recurrent-event frame here invents a process\n  that does not exist.\n- **Death is common and differential by arm but you use a naive rate/AG model.** This is the dangerous case: the\n  informative terminal event biases the rate ratio, often *toward harm for the more effective drug* via the\n  survivor-accrual mechanism above. If you cannot model death jointly, you must at minimum present a while-alive estimand\n  or restrict to a window where mortality is negligible - and say so.\n- **Within-person dependence is ignored.** Using ordinary (model-based) Cox or Poisson standard errors on stacked\n  intervals understates variance because events within a person are correlated; robust/sandwich variance (cluster on\n  person_id) or a frailty term is mandatory, not optional.\n- **Events cannot be cleanly delimited.** If your data source cannot separate one episode from its own follow-up claims\n  (e.g., a hospitalization plus its readmission transfer, or a 30-day steroid taper recorded as daily fills), the \"event\n  count\" is an artifact of coding, and any rate ratio inherits that artifact.\n\n**Data-source operational depth**\n- **Claims (FFS):** The workhorse substrate, but every event must be built from raw claims into clinical episodes. A\n  single hospitalization generates a facility claim plus multiple professional and DME claims with different service\n  dates; collapse them, and stitch inter-facility *transfers* (discharge-to-admit gap of 0-1 days) into one event or you\n  will count a transfer as a \"recurrent\" admission. Apply a setting-specific **clean window** (e.g., a moderate COPD\n  exacerbation requires a steroid/antibiotic burst with no qualifying event in the prior 14 days) so that the medication\n  refills sustaining one episode are not read as new events. Person-time (the offset) must end exactly at disenrollment,\n  death, or study end - extending it past disenrollment fabricates exposure with no observable events and dilutes the\n  rate.\n- **Claims (Medicare Advantage):** MA encounter data are notoriously incomplete and inconsistently submitted across\n  plans, so events are differentially undercounted; restrict rate analyses to fee-for-service (Parts A/B with Part D for\n  drug-defined events) and exclude MA-only person-time, or recurrence rates will be biased downward by missingness rather\n  than by true clinical benefit.\n- **Competing risks in elderly claims:** In older or sicker cohorts, death rates differ by exposure, so the censoring of\n  the recurrent process is informative AND differential. A rate model that treats death as ordinary administrative\n  censoring will mis-rank the arms. Carry a reliable mortality source (Medicare vital status, the limited Part D death\n  flag, or linked NDI) and use a death-aware estimand.\n- **Immortal time in procedure/initiation studies:** If follow-up (and thus the at-risk person-time for recurrence)\n  starts at a landmark that the patient had to *survive event-free* to reach (e.g., counting readmissions only among\n  those who survived the index surgery, with time zero set at discharge but exposure defined post-discharge), the\n  interval before exposure is immortal and inflates the comparator's apparent event-free time. Align time zero to the\n  exposure decision and start counting events from there.\n- **EHR:** Events are encounters, labs, vitals, notes, or rescue orders. Visit frequency is itself outcome-correlated\n  (sicker patients visit more), so raw event counts confound disease severity with capture intensity - this is\n  informative observation/observation bias. Restrict to unambiguous severe events, model the visit process, or use\n  inverse-intensity-of-observation weighting; never treat \"more recorded events\" as \"more true events.\"\n- **Registry / linked:** Registries give adjudicated, clean events (the numerator) but usually miss out-of-registry\n  utilization and may have irregular assessment schedules that create panel-count rather than exact-time data. Link to\n  claims for complete hospitalization burden and to a death index for the terminal event; reconcile registry visit dates\n  against claim service dates before building intervals.\n\n**Worked claims example.** Question: does drug A vs active comparator B reduce the rate of moderate-or-severe COPD\nexacerbations among new initiators in a 100% Medicare FFS sample (Parts A/B/D)? (1) Eligibility: age >=40, >=2 COPD\ndiagnoses (J44.x), and 365 days of continuous A/B/D enrollment before index_date (first qualifying fill of A or B; new\nusers of both). (2) Event definition: a *severe* exacerbation = an inpatient or ED claim with a COPD principal/first\ndiagnosis; a *moderate* exacerbation = an outpatient/ED claim for COPD accompanied by a Part D fill of a systemic\ncorticosteroid and/or a COPD-relevant antibiotic within +-5 days. (3) Episode cleaning: collapse facility +\nprofessional claims with overlapping or adjacent service dates into one event; stitch transfers (admit within 1 day of a\nprior discharge) into the same episode; impose a 14-day clean window so the steroid taper sustaining one episode and any\nimmediate follow-up visit are not counted as new events. (4) Counting-process layout: for each person_id build\nstart-stop rows from index_date with one row ending at each cleaned event date (event=1) and a final administrative row\n(event=0) ending at the earliest of disenrollment, death, or study end; carry days_supply-derived on-treatment status as\na time-varying covariate if an as-treated estimand is wanted. (5) Person-time = sum of (tstop-tstart); it must terminate\nat the FFS-observable end, never at end-of-data for a patient who left FFS earlier. (6) Estimation: headline result = MCF\nof exacerbations by month, by arm, accounting for death (Ghosh-Lin); adjusted effect = LWYY marginal rate model (or a\nnegative-binomial rate model with log person-time offset) with robust variance clustered on person_id and a\nhigh-dimensional propensity score or PS weights. (7) Because COPD patients with severe disease both exacerbate and die\nmore, fit a joint frailty model or present a while-alive exacerbation rate as the primary death-aware sensitivity\nanalysis, and report attrition at every step.\n\n**Interpreting the output**\n\nAn LWYY marginal rate model of COPD exacerbations returns: rate ratio = 0.73 (95% CI 0.58–0.92) for treated vs untreated, with robust sandwich variance clustered on patient.\n\n*Formal interpretation.* The rate ratio of 0.73 estimates that the marginal mean exacerbation rate in the treated group is approximately 73% of the rate in the untreated group, averaged over all patients and follow-up time. The robust variance accounts for within-person correlation — each patient's events are not independent, so standard Poisson or Cox standard errors would be anti-conservative. Terminal events (death) create informative censoring: patients who die can no longer exacerbate, compressing observed rates in higher-mortality groups. This is why a while-alive rate or joint frailty model is reported alongside as a sensitivity analysis. The rate ratio summarizes the full event burden across follow-up, not merely the time to first event.\n\n*Practical interpretation.* A rate ratio of 0.73 means the treated group experiences roughly 27% fewer exacerbations per unit of follow-up time — a burden-of-disease summary more policy-relevant than a time-to-first-event HR when the disease is relapsing-remitting. Pair this with the mean cumulative function (MCF) plot by arm to visualize diverging event burden over time and confirm the rate ratio is not driven by a single early episode or differential dropout rather than a sustained reduction in recurrence.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "recurrent-events",
      "counting-process",
      "andersen-gill",
      "prentice-williams-peterson",
      "mean-cumulative-function",
      "negative-binomial",
      "joint-frailty",
      "terminal-event",
      "hcru",
      "exacerbations",
      "hospitalizations"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1214/aos/1176345976",
        "url": "https://doi.org/10.1214/aos/1176345976",
        "citation_text": "Andersen PK, Gill RD. Cox's regression model for counting processes: a large sample study. Annals of Statistics. 1982;10(4):1100-1120.",
        "year": 1982,
        "authors_short": "Andersen & Gill",
        "notes": "Foundational counting-process formulation of the Cox model that underpins Andersen-Gill recurrent-event intensity models and time-varying covariates."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyu222",
        "url": "https://doi.org/10.1093/ije/dyu222",
        "citation_text": "Amorim LDAF, Cai J. Modelling recurrent events: a tutorial for analysis in epidemiology. International Journal of Epidemiology. 2015;44(1):324-333.",
        "year": 2015,
        "authors_short": "Amorim & Cai",
        "notes": "Epidemiology-facing tutorial contrasting AG, PWP, WLW, and rate/mean-function approaches with practical guidance on when each estimand applies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/biomet/68.2.373",
        "url": "https://doi.org/10.1093/biomet/68.2.373",
        "citation_text": "Prentice RL, Williams BJ, Peterson AV. On the regression analysis of multivariate failure time data. Biometrika. 1981;68(2):373-379.",
        "year": 1981,
        "authors_short": "Prentice et al.",
        "notes": "Introduces the ordered (gap-time and total-time) failure-time models, event-number stratification used in PWP recurrent-event analyses."
      },
      {
        "role": "use",
        "doi": "10.1111/1467-9868.00259",
        "url": "https://doi.org/10.1111/1467-9868.00259",
        "citation_text": "Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society Series B. 2000;62(4):711-730.",
        "year": 2000,
        "authors_short": "Lin et al.",
        "notes": "The LWYY marginal rate/mean-function model with robust variance, the defensible default for population-averaged recurrent-event rates and absolute burden in RWE."
      }
    ],
    "plain_language_summary": "Recurrent events analysis counts every time an event happens to a patient — not just the first time — so you get the full picture of how often a disease flares up over a follow-up period. A patient with COPD may land in the hospital three times in one year; a method that only looks at the first hospitalization throws away two-thirds of the signal and may miss a drug that cuts the repeat rate in half. These methods track each patient from their start date, note every qualifying event in order, and then estimate an event rate or a cumulative event count for the whole group. The main honest limitation is that death can stop events from occurring, so a drug that keeps very sick patients alive longer may appear to cause more events in a naive analysis.",
    "key_terms": [
      {
        "term": "recurrent event",
        "definition": "An outcome that can happen more than once to the same patient over follow-up, such as hospitalizations, disease flares, or infections."
      },
      {
        "term": "Andersen-Gill model",
        "definition": "A counting-process Cox model where each time a patient is at risk for the next event becomes its own data row, allowing all repeat events to contribute to the estimate."
      },
      {
        "term": "mean cumulative function",
        "definition": "A curve showing the expected total number of events per patient by each point in time, read directly as events-per-person rather than as a probability."
      },
      {
        "term": "censoring",
        "definition": "When a patient leaves the study early — because they lost insurance, the study ended, or they died — so their remaining follow-up time is no longer observable."
      },
      {
        "term": "person-time",
        "definition": "The total days or years a patient is actually observable and at risk; event rates divide event counts by this denominator."
      }
    ],
    "worked_example": {
      "scenario": "Patient 3041 has COPD and is followed for one full calendar year (January 1 through December 31, 2024) after starting a new inhaler. During that year the patient has three moderate-to-severe COPD exacerbations. We want to know the patient's annual exacerbation rate. A first-event-only analysis would record only the March flare-up and then stop watching — yielding a single event. The recurrent-events approach keeps watching and captures all three, giving an event rate of 3.0 exacerbations per person-year.",
      "dataset": {
        "caption": "Counting-process layout for patient 3041: one row per at-risk interval, each ending at the next exacerbation or the administrative end of follow-up. This is the format the Andersen-Gill model actually reads.",
        "columns": [
          "person_id",
          "interval_start_day",
          "interval_end_day",
          "event",
          "event_number"
        ],
        "rows": [
          [
            3041,
            0,
            70,
            1,
            1
          ],
          [
            3041,
            70,
            190,
            1,
            2
          ],
          [
            3041,
            190,
            280,
            1,
            3
          ],
          [
            3041,
            280,
            365,
            0,
            4
          ]
        ]
      },
      "steps": [
        "Follow-up begins on day 0 (2024-01-01). The patient is at risk until the first exacerbation.",
        "The first exacerbation occurs on day 70 (2024-03-11). Row 1 ends here with event = 1.",
        "The patient re-enters the risk set immediately. The second exacerbation occurs 120 days later on day 190 (2024-07-09). Row 2 ends here with event = 1.",
        "The patient re-enters the risk set again. The third exacerbation occurs 90 days later on day 280 (2024-10-07). Row 3 ends here with event = 1.",
        "No further exacerbations occur. Follow-up closes at day 365 (2024-12-31) when the study ends. Row 4 ends here with event = 0 (administrative censoring).",
        "Total person-time = 365 days = 1.00 person-year. Total events = 3. Event rate = 3 divided by 1.00 = 3.0 exacerbations per person-year.",
        "A first-event-only Cox model would have stopped at row 1 and recorded only 1 event in 70 days, missing the two later flare-ups entirely."
      ],
      "result": "3 exacerbations in 365 days of follow-up = 3.0 events per person-year. A first-event analysis using only the day-70 event would see 1 event in 0.19 person-years and miss two-thirds of this patient's disease burden.",
      "timeline_spec": {
        "title": "Recurrent COPD exacerbations for one patient over a 365-day follow-up (three events captured vs. one in a first-event analysis)",
        "window": {
          "start": "2024-01-01",
          "end": "2024-12-31",
          "label": "365-day observable follow-up"
        },
        "events": [
          {
            "label": "Exacerbation 1 (day 70)",
            "start": "2024-03-11",
            "length_days": 1,
            "quantity": "Event 1 of 3"
          },
          {
            "label": "Exacerbation 2 (day 190)",
            "start": "2024-07-09",
            "length_days": 1,
            "quantity": "Event 2 of 3"
          },
          {
            "label": "Exacerbation 3 (day 280)",
            "start": "2024-10-07",
            "length_days": 1,
            "quantity": "Event 3 of 3"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start": "2024-01-01",
            "end": "2024-03-11",
            "label": "Interval 1: 70 days at risk"
          },
          {
            "kind": "followup",
            "start": "2024-03-11",
            "end": "2024-07-09",
            "label": "Interval 2: 120 days at risk"
          },
          {
            "kind": "followup",
            "start": "2024-07-09",
            "end": "2024-10-07",
            "label": "Interval 3: 90 days at risk"
          },
          {
            "kind": "followup",
            "start": "2024-10-07",
            "end": "2024-12-31",
            "label": "Interval 4: 85 days, censored"
          }
        ],
        "result": {
          "label": "3 events / 365 days = 3.0 events per person-year",
          "value": 3.0
        },
        "caption": "Each diamond marks one COPD exacerbation. The four horizontal spans show the four at-risk intervals that make up the counting-process data rows. A first-event analysis would stop at the first diamond and ignore the two later events, understating this patient's burden by two-thirds.",
        "alt_text": "A 365-day horizontal timeline for patient 3041 from January 1 to December 31, 2024. Three event markers appear at day 70 (March 11), day 190 (July 9), and day 280 (October 7), each labeled as an exacerbation. Four follow-up spans fill the gaps between events and the end of follow-up, illustrating the counting-process interval structure. A note indicates that a first-event-only analysis captures only the first marker and stops."
      }
    },
    "prerequisites": [
      "cox-ph-regression",
      "hcru-healthcare-resource-utilization",
      "poisson-negative-binomial-count-models"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Marginal rate model (LWYY) or negative-binomial rate",
        "description": "Population-averaged event rate over person-time with robust (sandwich) variance; shares the AG point estimate but does not require independent increments. The negative-binomial rate model with a log person-time offset is the GLM analogue, absorbing within-person overdispersion and giving a clean rate ratio for total burden.",
        "edge_cases": [
          "Collapses the process to a rate; cannot represent event timing or the shape of risk over follow-up.",
          "Naive version treats death as ordinary censoring and is biased when mortality is differential by arm."
        ],
        "data_source_notes": "claims: build clean episodes and end person-time at disenrollment/death; restrict to FFS person-time because MA encounter data undercount events."
      },
      {
        "name": "Andersen-Gill counting-process intensity model",
        "description": "Each at-risk interval is a start-stop row; models the recurrent-event intensity conditional on the past through the risk set. Natural home for time-varying exposure, adherence, or dose.",
        "edge_cases": [
          "Assumes independent increments (no within-person dependence beyond covariates); requires robust variance clustered on person or a shared frailty.",
          "Assumes a common baseline intensity unless stratified."
        ],
        "data_source_notes": "Ideal when exposure is time-updated from days_supply spans; demands careful counting-process interval construction."
      },
      {
        "name": "Prentice-Williams-Peterson (PWP) gap-time / total-time",
        "description": "Stratifies the risk set by event number so the (k+1)th event is modeled conditional on having had k; gap-time clock resets at each event, total-time clock does not.",
        "edge_cases": [
          "Later strata are thin and unstable once few patients reach high event counts.",
          "Effect estimates are event-order-specific and harder to summarize than a single rate ratio."
        ],
        "data_source_notes": "Appropriate when biological risk genuinely changes after each event (repeated flares, readmissions after discharge)."
      },
      {
        "name": "Joint frailty / death-aware (Ghosh-Lin, while-alive)",
        "description": "Models the recurrent process jointly with the terminal event (death) via a shared frailty, or targets a death-aware estimand (MCF accounting for death, while-alive event rate) so survivor-accrual cannot bias the comparison.",
        "edge_cases": [
          "Requires a complete, reliable mortality source; sensitive to the assumed frailty distribution and recurrence-death correlation sign."
        ],
        "data_source_notes": "claims: link to Medicare vital status or NDI; the limited Part D death flag is insufficient on its own."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "Uses the entire event process and matches burden/HCRU estimands instead of discarding all events after the first; can detect rate effects that a first-event hazard ratio misses entirely.",
        "cons_of_this": "More data engineering (counting-process intervals), a less familiar effect measure, and explicit assumptions about within-person dependence and terminal events.",
        "when_to_prefer": "Whenever second-and-later events carry clinical or economic weight and the estimand is rate/burden rather than time-to-onset."
      },
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "AG/LWYY/PWP preserve event timing, accommodate time-varying exposure, and reveal the risk trajectory over follow-up, which a single count cannot represent.",
        "cons_of_this": "More complex layout and interpretation; a negative-binomial rate model is often a transparent, payer-friendly summary that is good enough for total burden.",
        "when_to_prefer": "When timing, time-updated covariates, or the shape of risk matter; otherwise a negative-binomial rate model is the simpler defensible choice."
      },
      {
        "compared_to": "competing-risks-cause-specific-fine-gray-rwe",
        "pros_of_this": "Recurrent-event methods count multiple events per person and estimate rates/MCF; competing-risks methods are built for a single terminal event with mutually exclusive causes.",
        "cons_of_this": "Recurrent-event models must additionally borrow competing-risk machinery (joint frailty, while-alive estimands) to handle death as informative truncation.",
        "when_to_prefer": "When the outcome genuinely recurs; use cause-specific/Fine-Gray when the question is the first occurrence of a terminal event among competing causes."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build events from raw claims into clean episodes - collapse facility plus professional claims, stitch transfers, and impose a setting-specific clean window so sustaining refills/follow-up visits are not counted as new events. Person-time offset must end at disenrollment, death, or study end. Restrict rate analyses to FFS person-time and exclude MA-only spans where encounter data undercount events; carry a reliable mortality source for death-aware estimands.",
      "ehr": "Events are encounters, labs, or rescue orders; visit frequency is outcome-correlated, so raw counts confound severity with capture intensity. Restrict to unambiguous severe events, model the observation process, or use inverse-intensity-of-observation weighting; treat loss to follow-up as informative.",
      "registry": "Adjudicated clean events but misses out-of-registry utilization and may yield panel-count (interval) rather than exact-time data; link to claims for hospitalization burden and to a death index for the terminal event.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (clean events + complete utilization + reliable mortality) but introduces linkage selection and date-discrepancy issues between registry visit, claim service, and death dates that must be reconciled before building counting-process intervals."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\n# Offset = log of observable person-time (in years here so the rate is events/person-year).\ndf[\"log_pt_years\"] = np.log(df[\"pt_days\"] / 365.25)\n\nnb = smf.glm(\n    formula=\"n_events ~ arm + age + cci\",\n    data=df,\n    family=sm.families.NegativeBinomial(),      # absorbs within-person overdispersion\n    offset=df[\"log_pt_years\"],\n    freq_weights=df[\"ps_weight\"] if \"ps_weight\" in df else None,\n).fit(cov_type=\"cluster\", cov_kwds={\"groups\": df[\"person_id\"]})\n\nrr = np.exp(nb.params[\"arm\"])\nci = np.exp(nb.conf_int().loc[\"arm\"])\nprint(f\"Rate ratio (arm) = {rr:.3f}  95% CI [{ci[0]:.3f}, {ci[1]:.3f}]\")",
        "description": "Negative-binomial marginal RATE model for total recurrent-event burden. Required input (one row per patient,\nalready cleaned and de-duplicated into episodes upstream):\n  df : person_id, arm (0/1 or categorical), n_events (count of clean episodes during observable follow-up),\n       pt_days (observable person-time in days, ending at disenroll/death/study end), age, cci, plus PS weight if used.\nReturns the adjusted rate ratio (exp of the arm coefficient). Use a log person-time offset and robust/cluster variance\nbecause within-person clustering inflates variance beyond the model-based SE.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "python",
        "code": "from lifelines import CoxTimeVaryingFitter\n\nctv = CoxTimeVaryingFitter()\nctv.fit(\n    long,\n    id_col=\"person_id\",\n    start_col=\"tstart\",\n    stop_col=\"tstop\",\n    event_col=\"event\",\n    formula=\"arm + on_treatment + age + cci\",\n    robust=True,            # sandwich variance for within-person dependence (AG -> LWYY-style SE)\n)\nctv.print_summary()         # exp(coef) on 'arm' is the recurrent-event rate/intensity ratio",
        "description": "Andersen-Gill counting-process Cox model with time-varying exposure and robust variance, using lifelines.\nRequired input is the LONG counting-process layout (one row per at-risk interval per patient):\n  long : person_id, tstart, tstop (days from index_date), event (1 at an event row, 0 at the final admin row),\n         on_treatment (time-varying 0/1 from days_supply spans), arm, age, cci.\nlifelines computes the AG estimate; robust=True gives the sandwich variance and cluster_col handles within-person ties.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n# Andersen-Gill intensity model; cluster(id) -> robust (LWYY) variance, the RWE default for the rate ratio.\nag <- coxph(Surv(tstart, tstop, event) ~ arm + age + cci + cluster(id),\n            data = long_df)\n\n# Prentice-Williams-Peterson GAP-TIME: clock resets each event (use gap = tstop - tstart), stratified by event order.\nlong_df$gap <- long_df$tstop - long_df$tstart\npwp <- coxph(Surv(gap, event) ~ arm + age + cci + strata(event_number) + cluster(id),\n             data = long_df)\n\n# Mean cumulative function (Nelson-Aalen-type) of expected events by time, by arm, for the communicable headline.\nmcf <- survfit(Surv(tstart, tstop, event) ~ arm, data = long_df, id = id)\nsummary(ag); summary(pwp); plot(mcf, cumhaz = TRUE, xlab = \"Days\", ylab = \"Mean cumulative events\")",
        "description": "Andersen-Gill (LWYY rate via robust SE), PWP gap-time, and the mean cumulative function in R's survival package.\nRequired input is the LONG counting-process layout:\n  long_df : id, tstart, tstop, event (0/1), event_number (1,2,...; for PWP strata), arm, age, cci.\ncluster(id) yields the robust LWYY-style variance; survfit on a multi-state/mcf object gives the MCF for plotting.",
        "dependencies": [
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Negative-binomial RATE model with a log person-time offset -> rate ratio = exp(arm estimate). */\nproc genmod data=work.summary;\n  class arm (ref='0') / param=ref;\n  model n_events = arm age cci / dist=negbin link=log offset=log_pt;\n  repeated subject=person_id / type=ind;   /* robust SE for residual within-person clustering */\n  estimate 'Rate ratio (arm)' arm 1 -1 / exp;\nrun;\n\n/* Andersen-Gill counting-process Cox; covs(aggregate)+id = robust (LWYY) variance for the rate/intensity ratio. */\nproc phreg data=work.long covs(aggregate);\n  class arm (ref='0') / param=ref;\n  id person_id;\n  model (tstart, tstop)*event(0) = arm age cci / rl;\nrun;\n\n/* Prentice-Williams-Peterson: stratify the risk set by event number; gap-time uses tstop-tstart as the response. */\ndata work.long; set work.long; gap = tstop - tstart; run;\nproc phreg data=work.long covs(aggregate);\n  class arm (ref='0') / param=ref;\n  id person_id;\n  strata event_number;\n  model gap*event(0) = arm age cci / rl;\nrun;",
        "description": "Recurrent-event rate (negative binomial via PROC GENMOD) and Andersen-Gill / PWP counting-process Cox via PROC PHREG.\nRequired input datasets (post data-management):\n  work.summary : person_id, arm, n_events, log_pt (= log observable person-time), age, cci  -- one row per patient\n  work.long    : person_id, tstart, tstop, event (0/1), event_number, arm, age, cci         -- counting-process rows\ncovs(aggregate) + the id statement give PROC PHREG the robust (LWYY-style) sandwich variance for within-person ties.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "recurrent-events-analysis-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Each diamond marks one COPD exacerbation. The four horizontal spans show the four at-risk intervals that make up the counting-process data rows. A first-event analysis would stop at the first diamond and ignore the two later events, understating this patient's burden by two-thirds.",
        "alt_text": "A 365-day horizontal timeline for patient 3041 from January 1 to December 31, 2024. Three event markers appear at day 70 (March 11), day 190 (July 9), and day 280 (October 7), each labeled as an exacerbation. Four follow-up spans fill the gaps between events and the end of follow-up, illustrating the counting-process interval structure. A note indicates that a first-event-only analysis captures only the first marker and stops.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{What is the recurrent-event estimand?} -->|Total burden / rate| RATE[Marginal rate: LWYY or negative-binomial<br/>log person-time offset, robust variance]\n  Q -->|Effect conditional on event history| AG[Andersen-Gill intensity<br/>counting-process, time-varying exposure]\n  Q -->|Effect on time between ordered events| PWP[Prentice-Williams-Peterson<br/>gap-time, strata by event number]\n  Q -->|Communicable absolute burden| MCF[Mean cumulative function<br/>expected events by time]\n  RATE --> DEATH{Is death common and<br/>plausibly differential by arm?}\n  AG --> DEATH\n  PWP --> DEATH\n  MCF --> DEATH\n  DEATH -->|Yes| JOINT[Joint frailty / Ghosh-Lin /<br/>while-alive estimand]\n  DEATH -->|No| REPORT[Report rate ratio + MCF<br/>with attrition and sensitivity]\n  JOINT --> REPORT",
        "caption": "Estimand-first decision logic for recurrent-event analysis. The terminal-event branch is mandatory whenever mortality is non-trivial, because naive rate/AG models treat death as ordinary censoring and can mis-rank arms via survivor accrual.",
        "alt_text": "Decision flowchart selecting among marginal rate (LWYY/negative binomial), Andersen-Gill, Prentice-Williams-Peterson, and mean cumulative function by estimand, then routing through a terminal-event check to joint frailty or a death-aware estimand before reporting.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Counting-process layout for one patient (COPD exacerbations, claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  365-day continuous FFS enrollment + washout :done, base, 2023-01-01, 2023-12-31\n  section At-risk follow-up (start-stop rows)\n  Interval 1 -> event at exacerbation 1 :active, i1, 2024-01-01, 70d\n  Interval 2 -> event at exacerbation 2 :active, i2, 2024-03-11, 120d\n  Interval 3 -> admin censor (disenroll / death / study end) :crit, i3, 2024-07-09, 90d",
        "caption": "One patient becomes three start-stop rows. Event rows (event=1) end at each cleaned exacerbation; the final row (event=0) ends at the earliest of disenrollment, death, or study end. Person-time is the sum of interval lengths and must not extend past FFS-observable follow-up.",
        "alt_text": "Gantt chart showing a 365-day baseline, then three at-risk intervals for one patient - two ending in exacerbation events and the third ending at administrative censoring - illustrating the counting-process start-stop layout.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "cox-ph-regression",
        "notes": "Andersen-Gill, LWYY, and PWP are recurrent-event extensions of Cox risk-set logic; standard Cox analyzes only time to first event and discards subsequent recurrences."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "A negative-binomial rate model with a log person-time offset is the simplest recurrent-event summary (total burden); AG/LWYY/PWP add timing, time-varying exposure, and the risk trajectory."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death is an informative terminal event that truncates the recurrent process; competing-risk thinking motivates joint frailty, Ghosh-Lin, and while-alive estimands."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "HCRU endpoints (admissions, ED visits, rescue bursts) are recurrent by construction and are the most common application of these methods in RWE."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "Confounding control by PS matching or weighting on pre-index covariates is combined with the recurrent-event model, with weights carried into the rate/intensity estimation."
      }
    ],
    "aliases": [
      "repeated events",
      "recurrent event analysis",
      "multiple failure times",
      "Andersen-Gill model",
      "Prentice-Williams-Peterson model",
      "LWYY model",
      "mean cumulative function",
      "recurrent event rate"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "registry-trial",
    "name": "Registry-Based Randomized Controlled Trial (RRCT)",
    "short_definition": "A randomized controlled trial in which an existing clinical or administrative quality registry supplies the eligibility frame, randomization infrastructure, baseline data, and outcome ascertainment, embedding random treatment assignment inside routine care to combine the internal validity of randomization with the cost, scale, and generalizability of real-world data.",
    "long_description": "A **registry-based randomized controlled trial (RRCT)** is a randomized trial built on top of an existing\nprospective registry. The registry — a structured, ongoing data collection on a defined patient population (a disease\nregistry, a procedure/quality registry such as SWEDEHEART, or a device/product registry) — is repurposed to do the\ntrial's heavy lifting: it identifies eligible patients at the point of care, hosts an embedded randomization module,\ncaptures baseline characteristics as part of routine documentation, and ascertains outcomes through ongoing registry\nfollow-up plus linkage to claims and national death indices. The randomization is what separates an RRCT from a\nregistry-based *observational* comparative study: treatment is assigned by chance, so the design inherits the\nunbiased-by-design causal interpretation of an RCT while spending a fraction of a conventional trial's cost.\n\n**Core conceptual distinction**. Three things are happening at once, and they are separable. (1) *Randomization vs\nobservation*: RRCTs randomize, so the average treatment effect is identified without the no-unmeasured-confounding\nassumption that haunts every observational registry analysis — this is the single feature that makes an RRCT a trial\nrather than a confounder-adjusted cohort. (2) *Registry-embedded vs free-standing data collection*: instead of building\nbespoke case report forms and a trial-specific coordinating center, the RRCT uses the registry's existing data pipes,\nwhich slashes per-patient cost (the TASTE trial randomized ~7,200 patients for a tiny fraction of a conventional\nmegatrial budget) and enables near-complete, unselected enrollment. (3) *Pragmatic vs explanatory orientation*: because\nenrollment happens in routine care with broad eligibility and outcomes come from registries, RRCTs sit at the pragmatic\nend of the explanatory–pragmatic continuum, estimating effectiveness in real practice rather than efficacy in an\nidealized population. The estimand is the intention-to-treat comparative effect of assignment to one strategy versus\nanother in the enrolled real-world population; per-protocol/as-treated contrasts require the same\ncensoring-and-weighting machinery as any trial and lose the protection of randomization. An RRCT is **not** an\nobservational registry study with a propensity score, and it is **not** a target-trial *emulation* — it is an actual\nrandomized trial whose data substrate happens to be a registry.\n\n**Pros, cons, and trade-offs**.\n- **vs the conventional (free-standing) RCT:** The RRCT keeps randomization but replaces bespoke data collection with\n  registry infrastructure, cutting cost per patient by often more than an order of magnitude, enabling very large sample\n  sizes, and — because eligibility is broad and enrollment happens in routine care — improving external validity and\n  reducing the \"trial population ≠ real patients\" gap. Cost: outcome data are limited to what the registry and linked\n  sources capture (so granular efficacy endpoints, adjudicated soft outcomes, or biomarkers may be unavailable or\n  imprecise), monitoring is lighter, and registry data-quality limits the precision of baseline and outcome variables.\n  **Prefer an RRCT** when the question can be answered with hard, registry-captured endpoints (mortality, MI,\n  revascularization, readmission) and a large, representative sample is needed cheaply.\n- **vs the observational registry comparative study (e.g., active-comparator new-user with PS adjustment):** The RRCT\n  removes confounding by indication by design; no amount of high-dimensional PS adjustment in the observational version\n  can guarantee that. Cost: it requires equipoise, ethics/consent infrastructure, and operator willingness to randomize\n  at the point of care, and it can only study interventions that are genuinely deliverable inside the registry workflow.\n  **Prefer the RRCT** whenever randomization is feasible and ethical; reserve the observational design for questions\n  where randomization is impossible (rare exposures, harms, already-disseminated practice).\n- **vs the pragmatic trial / cluster-randomized pragmatic trial more broadly:** The RRCT is a *species* of pragmatic\n  trial whose defining feature is that an existing registry, not a purpose-built data system, is the backbone. Versus a\n  cluster-randomized pragmatic trial, individual-level registry randomization avoids contamination-driven loss of power\n  and within-cluster correlation, but cannot study cluster-level interventions (clinic policies, system changes).\n  **Prefer the RRCT** for individually deliverable treatments where a suitable registry already exists.\n- **vs target-trial emulation:** Target-trial emulation *approximates* a trial from observational data and still relies\n  on exchangeability; the RRCT *is* the trial. **Prefer the RRCT** when you can randomize; use emulation only when you\n  cannot.\n\n**When to use**. A high-quality prospective registry already exists and covers the eligible population with adequate\ncapture; the intervention is deliverable within routine care and within the registry workflow (a drug, device, or\nprocedure decision made at a registry-documented encounter); genuine clinical equipoise exists; the primary endpoints\nare hard outcomes reliably captured by the registry and linkable sources (death indices, claims, hospital discharge\ndata); and a large, representative sample is needed at low marginal cost. RRCTs shine for comparative effectiveness of\nestablished, reimbursed interventions where a conventional trial would be prohibitively expensive or too narrow to be\ngeneralizable.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **No registry, or a low-quality registry.** If the registry has incomplete enrollment, missing key covariates, or\n  poorly validated outcomes, the RRCT inherits those defects; building the registry first defeats the cost advantage.\n- **The endpoint cannot be captured by the registry or linked data.** Outcomes requiring blinded adjudication, central\n  imaging, patient-reported instruments, or biomarkers the registry does not collect cannot be measured well — an RRCT\n  forced onto such an endpoint will be underpowered or biased toward the null by outcome misclassification.\n- **Open-label assignment threatens the endpoint.** RRCTs are usually unblinded; for subjective or\n  ascertainment-sensitive outcomes (symptom scores, soft revascularization decisions influenced by knowing the arm),\n  lack of blinding plus differential registry capture can manufacture or mask effects. This is the dangerous failure\n  mode: an apparently \"randomized\" result that is actually driven by differential outcome ascertainment.\n- **Selective point-of-care enrollment.** If clinicians enroll only a non-random subset of eligible patients (the\n  healthier, the sicker, the more cooperative), the *trial population* becomes unrepresentative even though *assignment*\n  within it is random; internal validity survives but the generalizability that justified the RRCT is lost. Track the\n  enrollment fraction against the full registry denominator and report it.\n- **Small or rare-event settings.** The cost advantage and pragmatic outcomes presuppose large numbers; for rare\n  diseases or rare events, a conventional adjudicated trial or an external-control design is usually better.\n\n**Data-source operational depth**.\n- **Disease/quality/procedure registries (the backbone):** SWEDEHEART-type registries provide the eligibility frame,\n  the randomization point, and the baseline/process data. Failure modes: incomplete site participation and selective\n  enrollment (only consenting, operator-selected patients are randomized) bias the *enrolled* population; registry\n  variable definitions drift over time and across sites; and the registry's own outcome fields (e.g., in-hospital\n  events) are often insufficient for long-term endpoints, forcing linkage. Workaround: pre-specify the enrollment\n  denominator and audit enrollment representativeness; freeze data dictionaries; lean on national mandatory registries\n  where participation is near-universal.\n- **Claims (FFS vs MA vs commercial) for long-term outcome ascertainment:** Linked claims extend follow-up for\n  readmission, downstream procedures, and resource use. Failure mode: in U.S. data, **Medicare Advantage enrollees lack\n  fee-for-service claims**, so randomized patients in MA contribute no observable utilization/outcome person-time —\n  leaving them in the denominator silently undercounts events and, if MA enrollment differs by arm or site, biases the\n  contrast. Workaround: restrict outcome ascertainment to FFS-observable person-time (or fully captured commercial\n  benefit), report the share of randomized patients with complete claims observability, and treat MA-only follow-up as\n  censored, not event-free. Differential competing risks (e.g., death) by arm in elderly registry populations must be\n  handled with cause-specific or Fine-Gray models, not naive Kaplan–Meier on a non-fatal endpoint.\n- **National death index / vital records:** The gold standard for the mortality endpoint that many RRCTs use precisely\n  because the registry alone cannot reliably capture out-of-hospital death. Failure modes: reporting lag and\n  jurisdictional gaps create administrative censoring that, if uneven, distorts survival contrasts. Workaround: align\n  the analytic cutoff to a date with complete death ascertainment for all sites.\n- **EHR linkage:** Adds labs, problem lists, and notes that sharpen baseline severity beyond what the registry records,\n  but visit-driven capture means patients who leave the system are differentially lost — define the observation window\n  explicitly and treat loss to follow-up as potentially informative rather than ignorable.\n\n**Worked example (the canonical RRCT, claims/registry logic).** Question: does routine manual thrombus aspiration\nduring primary PCI for ST-elevation MI reduce all-cause mortality versus PCI alone? (This is the TASTE trial, the\narchetypal RRCT.) (1) *Eligibility frame*: all patients undergoing primary PCI for STEMI documented in the SWEDEHEART\nnational quality registry — a near-complete, mandatory registry, so the enrollment denominator is the whole country's\nSTEMI-PCI population. (2) *Randomization at the point of care*: when the operator opens the patient's registry record in\nthe cath lab, an embedded randomization module assigns thrombus aspiration + PCI vs PCI alone; **time zero = the\nrandomization timestamp**, identical for both arms, so there is no immortal time and no post-assignment selection. (3)\n*Baseline data*: captured as part of routine SWEDEHEART documentation (age, infarct location, comorbidity, procedural\ndetails) — no separate case report form. (4) *Outcome ascertainment*: the primary endpoint, all-cause mortality at 30\ndays and 1 year, comes from the national population/death registry (complete out-of-hospital capture), while secondary\nendpoints (reinfarction, stent thrombosis, target-vessel revascularization, rehospitalization) come from SWEDEHEART\nplus linked hospital-discharge/claims data. (5) *Analysis*: intention-to-treat by randomized arm; censor at the analytic\ncutoff with complete death ascertainment; for the non-fatal endpoints in an older cohort, use cause-specific hazards so\nthat death is treated as a competing risk, not as censoring. (6) *Threats to monitor*: enrollment fraction (what share\nof all registry STEMI-PCI patients were randomized, and whether non-enrolled patients differ), open-label\nascertainment of the soft revascularization endpoint, and — if extended with U.S.-style claims — exclusion or censoring\nof MA-only person-time so that unobservable follow-up is not miscounted as event-free. TASTE randomized ~7,200 patients\nat a small fraction of conventional-trial cost and showed no mortality benefit of routine aspiration, a definitive,\npractice-changing result that a megatrial would have struggled to deliver as cheaply or as representatively.",
    "primary_category": "Study_Design",
    "tags": [
      "registry-based-trial",
      "rrct",
      "pragmatic-trial",
      "randomized-controlled-trial",
      "point-of-care-randomization",
      "comparative-effectiveness",
      "real-world-evidence",
      "swedeheart"
    ],
    "applies_to_study_types": [
      "registry_trial",
      "pragmatic_trial"
    ],
    "data_sources": [
      "registry",
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1038/nrcardio.2015.33",
        "url": "https://doi.org/10.1038/nrcardio.2015.33",
        "citation_text": "James S, Rao SV, Granger CB. Registry-based randomized clinical trials - a new clinical trial paradigm. Nature Reviews Cardiology. 2015;12(5):312-316.",
        "year": 2015,
        "authors_short": "James et al.",
        "notes": "First systematic articulation of the RRCT paradigm, defining the registry-as-trial-backbone concept and the SWEDEHEART/TASTE template that established registry-embedded randomization as a distinct design."
      },
      {
        "role": "explain",
        "doi": "10.1186/s13063-020-04459-z",
        "url": "https://doi.org/10.1186/s13063-020-04459-z",
        "citation_text": "Karanatsios B, Prang KH, Verbunt E, Yeung JM, Kelaher M, Gibbs P. Defining key design elements of registry-based randomised controlled trials: a scoping review. Trials. 2020;21:552.",
        "year": 2020,
        "authors_short": "Karanatsios et al.",
        "notes": "Scoping review that systematizes the design elements (registry data quality, eligibility frame, randomization integration, outcome ascertainment) distinguishing a true RRCT from a registry-flavored conventional trial."
      },
      {
        "role": "demonstrate",
        "doi": "10.1056/NEJMp1310102",
        "url": "https://doi.org/10.1056/NEJMp1310102",
        "citation_text": "Lauer MS, D'Agostino RB Sr. The randomized registry trial - the next disruptive technology in clinical research? New England Journal of Medicine. 2013;369(17):1579-1581.",
        "year": 2013,
        "authors_short": "Lauer & D'Agostino",
        "notes": "Influential editorial framing the cost and generalizability rationale for RRCTs as a disruptive complement to conventional megatrials, written alongside the landmark TASTE report."
      },
      {
        "role": "demonstrate",
        "doi": "10.1056/NEJMoa1308789",
        "url": "https://doi.org/10.1056/NEJMoa1308789",
        "citation_text": "Fröbert O, Lagerqvist B, Olivecrona GK, et al. Thrombus aspiration during ST-segment elevation myocardial infarction. New England Journal of Medicine. 2013;369(17):1587-1597.",
        "year": 2013,
        "authors_short": "Fröbert et al.",
        "notes": "The TASTE trial - the canonical RRCT, randomizing ~7,200 STEMI patients within SWEDEHEART with mortality from the national death registry, proving the design can deliver definitive, low-cost, generalizable results."
      },
      {
        "role": "use",
        "doi": "10.1056/NEJMoa1405707",
        "url": "https://doi.org/10.1056/NEJMoa1405707",
        "citation_text": "Lagerqvist B, Fröbert O, Olivecrona GK, et al. Outcomes 1 year after thrombus aspiration for myocardial infarction. New England Journal of Medicine. 2014;371(12):1111-1120.",
        "year": 2014,
        "authors_short": "Lagerqvist et al.",
        "notes": "One-year follow-up of TASTE using continued registry and death-index linkage, illustrating how RRCT outcome ascertainment is extended over time through the registry rather than bespoke trial visits."
      }
    ],
    "plain_language_summary": "A registry-based randomized trial (RRCT) is a clinical trial that runs on top of a registry — a database hospitals already use to track every patient who gets a particular procedure or carries a particular diagnosis. Researchers add a coin-flip assignment step right inside the registry: when a patient shows up and is eligible, the registry software randomly assigns them to treatment A or B at the point of care. Because the registry already records who enrolled, what their health looked like at the start, and what happened to them afterward via linked death records, the study gets the causal certainty of a randomized trial at a tiny fraction of the cost of a conventional drug trial. The limitation is that measurable outcomes are confined to what the registry and linked government records actually capture.",
    "key_terms": [
      {
        "term": "registry",
        "definition": "An ongoing database that records clinical information on every patient who meets a defined condition or undergoes a defined procedure — for example, all patients who had an emergency heart procedure at a participating hospital — as a normal part of care, not for a specific study."
      },
      {
        "term": "randomization",
        "definition": "Assigning each patient to a treatment group by chance (equivalent to flipping a coin), so the two groups start out comparable on everything except the treatment being tested."
      },
      {
        "term": "eligibility frame",
        "definition": "The complete list of registry patients who meet the study entry criteria at the moment of their procedure or visit — the registry provides this list automatically, with no separate screening step."
      },
      {
        "term": "time zero",
        "definition": "The precise moment a patient enters the trial, defined here as the timestamp when the registry system makes the random treatment assignment; both groups start their follow-up clock at an equivalent event."
      },
      {
        "term": "intention-to-treat analysis",
        "definition": "Comparing patients by the treatment they were randomly assigned to, regardless of what they actually received, in order to preserve the protection that randomization provides against background differences between groups."
      },
      {
        "term": "linked death registry",
        "definition": "A national vital-statistics or death-certificate database that is matched to registry patients after the fact, so that deaths occurring outside the hospital are still counted in the study."
      }
    ],
    "worked_example": {
      "scenario": "A national cardiology registry already captures every emergency heart procedure (primary PCI for a heart attack) at 30 hospitals. Researchers want to know whether adding a clot-removal step during the procedure reduces the chance of dying within one year. They embed a randomization module into the registry form: the moment a surgeon opens a patient record in the catheterization lab, the system flips a coin and assigns the patient to Aspiration+PCI or Standard PCI. The registry logs enrollment, baseline age, and site. One year later the team links to the national death registry to find out who died. The table below shows six enrolled patients exactly as the analyst sees them.",
      "dataset": {
        "caption": "Six registry records after the coin flip. procedure_date is time zero. death_date is null if the patient was alive at the one-year cut-off (one year after each procedure_date).",
        "columns": [
          "person_id",
          "procedure_date",
          "site_id",
          "assigned_arm",
          "age",
          "death_date"
        ],
        "rows": [
          [
            "PT-001",
            "2022-03-10",
            "SITE-07",
            "Aspiration+PCI",
            64,
            null
          ],
          [
            "PT-002",
            "2022-03-12",
            "SITE-12",
            "Standard PCI",
            71,
            "2022-09-04"
          ],
          [
            "PT-003",
            "2022-03-18",
            "SITE-07",
            "Standard PCI",
            58,
            null
          ],
          [
            "PT-004",
            "2022-03-25",
            "SITE-03",
            "Aspiration+PCI",
            79,
            "2022-06-01"
          ],
          [
            "PT-005",
            "2022-04-01",
            "SITE-12",
            "Aspiration+PCI",
            66,
            null
          ],
          [
            "PT-006",
            "2022-04-05",
            "SITE-03",
            "Standard PCI",
            74,
            null
          ]
        ]
      },
      "steps": [
        "The registry already held all six patients because their procedures were documented through routine hospital quality reporting — zero extra enrollment effort was required.",
        "At the instant each procedure began, the embedded module assigned Aspiration+PCI or Standard PCI with equal probability; procedure_date is that patient's time zero and both arms start counting follow-up from an equivalent event, so there is no built-in advantage for either group.",
        "A linkage to the national death registry adds the death_date column. PT-002 died on 2022-09-04, which is 176 days after their procedure_date of 2022-03-12. PT-004 died on 2022-06-01, which is 68 days after their procedure_date of 2022-03-25. The other four patients reached their one-year anniversary without dying.",
        "Aspiration+PCI arm: PT-001 (alive), PT-004 (died day 68), PT-005 (alive) — 1 death out of 3 patients = 33% one-year mortality.",
        "Standard PCI arm: PT-002 (died day 176), PT-003 (alive), PT-006 (alive) — 1 death out of 3 patients = 33% one-year mortality.",
        "Using intention-to-treat, every patient is counted in the arm they were assigned to on procedure_date, regardless of any later deviation. The registry infrastructure — not a bespoke study database — delivered enrollment, baseline, and outcome data."
      ],
      "result": "One-year mortality: 1/3 (33%) in each arm in this 6-patient toy example. The numbers are illustrative, not a real trial finding. The mechanism is what matters: randomization at the point of care removes the concern that sicker patients were steered to one arm; the registry provides infrastructure for a fraction of a conventional trial budget; linked death records capture outcomes that happen outside the hospital. That combination — randomization layered on a registry that captures enrollment, baseline, and real-world outcomes — is what an RRCT buys."
    },
    "prerequisites": [
      "pragmatic-trial",
      "disease-registry",
      "target-trial-emulation"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single-registry embedded randomization (classic RRCT)",
        "description": "One mature national/quality registry supplies the eligibility frame, the embedded randomization module, the baseline data, and (with linkage) the outcomes; randomization occurs at a routine registry-documented encounter. This is the SWEDEHEART/TASTE template.",
        "edge_cases": [
          "Selective enrollment at the point of care makes the randomized population unrepresentative of the full registry denominator even though within-population assignment is random.",
          "The registry's native outcome fields may be too coarse for the primary endpoint, forcing linkage to death/claims data."
        ],
        "data_source_notes": "registry: confirm near-universal participation and a frozen data dictionary; track the enrollment fraction against the full registry population."
      },
      {
        "name": "Registry-frame trial with linked outcome ascertainment",
        "description": "The registry provides eligibility and randomization, but outcomes (mortality, readmission, downstream procedures, resource use) are taken from linked claims, hospital-discharge, and national death data rather than the registry itself.",
        "edge_cases": [
          "In U.S. data, Medicare Advantage enrollees lack fee-for-service claims, so their outcome person-time is unobservable and must be censored, not counted as event-free.",
          "Differential reporting lag in death indices creates administrative censoring that can distort survival contrasts if uneven across sites."
        ],
        "data_source_notes": "claims/linked: restrict outcome ascertainment to FFS-observable person-time and report the share of randomized patients with complete claims observability."
      },
      {
        "name": "Cluster / stepped-wedge registry-randomized design",
        "description": "Randomization is at the site/cluster level within a registry (e.g., hospitals randomized to a care protocol) rather than at the individual patient, used when the intervention is a system-level change.",
        "edge_cases": [
          "Within-cluster correlation reduces effective sample size; analysis must use cluster-robust or mixed-effects models.",
          "Contamination across clusters and secular trends threaten validity; stepped-wedge timing must be pre-specified."
        ],
        "data_source_notes": "registry: site identifiers and time stamps must be reliable; pair with cluster-robust standard errors and account for calendar-time confounding."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Conventional free-standing randomized controlled trial",
        "pros_of_this": "Far lower cost per patient by reusing registry infrastructure; very large, broadly eligible, representative samples; outcomes from complete national registries/death indices; pragmatic effectiveness estimate.",
        "cons_of_this": "Endpoints limited to registry/linked-data capture; usually unblinded; lighter monitoring; baseline and outcome precision bounded by registry data quality.",
        "when_to_prefer": "Hard, registry-captured endpoints (mortality, MI, revascularization, readmission) where a large representative sample is needed cheaply and a megatrial is infeasible."
      },
      {
        "compared_to": "Observational registry comparative study (active-comparator new-user with PS adjustment)",
        "pros_of_this": "Randomization removes confounding by indication by design; no reliance on no-unmeasured-confounding.",
        "cons_of_this": "Requires equipoise, consent/ethics infrastructure, and operator willingness to randomize at the point of care; only studies interventions deliverable within the registry workflow.",
        "when_to_prefer": "Whenever randomization is feasible and ethical; reserve the observational design for questions where randomization is impossible."
      },
      {
        "compared_to": "Cluster-randomized pragmatic trial",
        "pros_of_this": "Individual-level randomization avoids contamination and within-cluster correlation, preserving power.",
        "cons_of_this": "Cannot study cluster-level interventions such as clinic policies or system changes.",
        "when_to_prefer": "Individually deliverable treatments (a drug, device, or procedure decision) where a suitable registry already exists."
      },
      {
        "compared_to": "Target-trial emulation from observational data",
        "pros_of_this": "It is an actual randomized trial, not an approximation; identifies the causal effect without an exchangeability assumption.",
        "cons_of_this": "Requires that randomization be ethical and feasible at the point of care, which emulation does not.",
        "when_to_prefer": "Whenever you can randomize; use emulation only when randomization is impossible."
      }
    ],
    "implementation_notes_by_data_source": {
      "registry": "The registry is the eligibility frame, randomization host, and baseline-data source. Confirm near-universal participation, freeze the data dictionary, and track the enrollment fraction (randomized / all eligible in the registry) to defend representativeness. Time zero = randomization timestamp.",
      "claims": "Use for long-term outcome ascertainment (readmission, downstream procedures, resource use). Exclude or censor Medicare Advantage-only person-time where fee-for-service claims are unavailable; report the share of randomized patients with complete claims observability.",
      "ehr": "Adds labs, problem lists, and notes that sharpen baseline severity beyond registry fields; visit-driven capture means patients who leave the system are differentially lost - define observation windows explicitly.",
      "linked": "Registry + national death index + claims is the ideal substrate (frame + complete mortality + long-term utilization), but linkage selection and reporting lag introduce administrative censoring that must be reconciled to a cutoff with complete ascertainment for all sites."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nANALYTIC_CUTOFF = pd.Timestamp(\"2024-12-31\")  # date with complete death ascertainment for all sites\nRNG = np.random.default_rng(20240101)          # fixed seed -> reproducible, auditable randomization\n\ndef randomize_and_build(reg: pd.DataFrame, claims: pd.DataFrame, death: pd.DataFrame) -> pd.DataFrame:\n    # 1) Eligibility frame: first eligible registry encounter per person = the randomization point (time zero).\n    elig = (reg[reg[\"eligible\"]]\n            .sort_values([\"person_id\", \"encounter_date\"])\n            .groupby(\"person_id\", as_index=False)\n            .first()\n            .rename(columns={\"encounter_date\": \"time_zero\"}))\n\n    # 2) Embedded 1:1 randomization at the point of care (intention-to-treat assignment).\n    elig[\"arm\"] = np.where(RNG.random(len(elig)) < 0.5, \"INTERVENTION\", \"CONTROL\")\n\n    # 3) Observability: keep only FFS-observable follow-up so MA-only person-time is not miscounted.\n    ffs = claims[claims[\"ffs_observable\"]].merge(elig[[\"person_id\", \"time_zero\"]], on=\"person_id\")\n    ffs = ffs[(ffs[\"obs_end\"] >= ffs[\"time_zero\"])]              # span overlaps follow-up\n    obs_end = (ffs.groupby(\"person_id\")[\"obs_end\"].max()\n                  .clip(upper=ANALYTIC_CUTOFF).rename(\"obs_followup_end\"))\n    elig = elig.merge(obs_end, on=\"person_id\", how=\"left\")\n\n    # 4) Mortality endpoint from the national death index (complete out-of-hospital capture).\n    elig = elig.merge(death[[\"person_id\", \"death_date\"]], on=\"person_id\", how=\"left\")\n\n    # 5) ITT survival rows: event = death observed within FFS follow-up before the cutoff.\n    end = elig[[\"death_date\", \"obs_followup_end\"]].min(axis=1).fillna(ANALYTIC_CUTOFF)\n    elig[\"event\"] = ((elig[\"death_date\"].notna()) &\n                     (elig[\"death_date\"] <= end) &\n                     (elig[\"death_date\"] <= elig[\"obs_followup_end\"].fillna(ANALYTIC_CUTOFF))).astype(int)\n    elig[\"fu_days\"] = (end - elig[\"time_zero\"]).dt.days\n\n    cols = [\"person_id\", \"site_id\", \"time_zero\", \"arm\", \"fu_days\", \"event\"]\n    return elig[cols].reset_index(drop=True)",
        "description": "Registry-embedded randomization and analysis-ready trial table from registry + linked outcome inputs. This is the\nRRCT workflow (eligibility -> point-of-care randomization -> time zero -> linked outcomes), NOT an observational\ncohort lookback. Required inputs (already cleaned, one registry encounter = one candidate):\n  reg    : registry encounters -> person_id, encounter_id, encounter_date (datetime), eligible (bool),\n           site_id, age, [baseline covariates captured at the encounter]\n  claims : enrollment/observability spans -> person_id, obs_start, obs_end, ffs_observable (bool)\n  death  : national death index -> person_id, death_date (datetime or NaT)\nRandomization is 1:1 by a fixed seed so the assignment is reproducible/auditable. Time zero = randomization timestamp\n= the eligible encounter_date. Outcome person-time is restricted to FFS-observable spans; MA-only follow-up is\ncensored, never counted as event-free.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "james-2015",
          "frobert-2013"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nANALYTIC_CUTOFF <- as.Date(\"2024-12-31\")  # date with complete death ascertainment for all sites\nset.seed(20240101)                        # reproducible, auditable randomization\n\nrandomize_and_build <- function(reg, claims, death) {\n  setDT(reg); setDT(claims); setDT(death)\n  setorder(reg, person_id, encounter_date)\n\n  # 1) Eligibility frame: first eligible encounter = randomization point (time zero).\n  elig <- reg[eligible == TRUE, .SD[1L], by = person_id]\n  setnames(elig, \"encounter_date\", \"time_zero\")\n\n  # 2) Embedded 1:1 randomization (intention-to-treat assignment).\n  elig[, arm := fifelse(runif(.N) < 0.5, \"INTERVENTION\", \"CONTROL\")]\n\n  # 3) Keep only FFS-observable follow-up (MA-only person-time is unobservable -> censored, not event-free).\n  ffs <- merge(claims[ffs_observable == TRUE], elig[, .(person_id, time_zero)], by = \"person_id\")\n  ffs <- ffs[obs_end >= time_zero]\n  obs <- ffs[, .(obs_followup_end = pmin(max(obs_end), ANALYTIC_CUTOFF)), by = person_id]\n  elig <- merge(elig, obs, by = \"person_id\", all.x = TRUE)\n\n  # 4) Mortality endpoint from the national death index.\n  elig <- merge(elig, death[, .(person_id, death_date)], by = \"person_id\", all.x = TRUE)\n\n  # 5) ITT survival rows.\n  elig[, end := pmin(fcoalesce(death_date, ANALYTIC_CUTOFF),\n                     fcoalesce(obs_followup_end, ANALYTIC_CUTOFF))]\n  elig[, event := as.integer(!is.na(death_date) & death_date <= end &\n                             death_date <= fcoalesce(obs_followup_end, ANALYTIC_CUTOFF))]\n  elig[, fu_days := as.integer(end - time_zero)]\n  elig[, .(person_id, site_id, time_zero, arm, fu_days, event)]\n}",
        "description": "Registry-embedded randomization and analysis-ready trial table with data.table. Inputs mirror the Python version:\n  reg    : person_id, encounter_id, encounter_date (Date), eligible (logical), site_id, age, baseline covariates\n  claims : person_id, obs_start, obs_end (Date), ffs_observable (logical)\n  death  : person_id, death_date (Date, NA if alive)\nTime zero = first eligible encounter; randomization is 1:1 with a fixed seed; outcome person-time is restricted to\nFFS-observable spans and censored at the analytic cutoff.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "james-2015",
          "frobert-2013"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let cutoff = '31DEC2024'd;  /* date with complete death ascertainment for all sites */\n\n/* 1) Eligibility frame: first eligible registry encounter per person = time zero. */\nproc sql;\n  create table elig as\n  select person_id, site_id, age,\n         min(encounter_date) as time_zero format=date9.\n  from work.reg\n  where eligible = 1\n  group by person_id, site_id, age;\nquit;\n\n/* 2) Embedded 1:1 randomization (reproducible via SEED); intention-to-treat assignment. */\nproc surveyselect data=elig out=rand\n     method=srs samprate=0.5 seed=20240101 outall;\n  /* Selected=1 -> INTERVENTION, Selected=0 -> CONTROL */\nrun;\ndata rand;\n  set rand;\n  length arm $12;\n  arm = ifc(Selected = 1, 'INTERVENTION', 'CONTROL');\nrun;\n\n/* 3) FFS-observable follow-up end (MA-only person-time is unobservable -> censored, not event-free). */\nproc sql;\n  create table obs as\n  select r.person_id,\n         min(max(c.obs_end), &cutoff) as obs_followup_end format=date9.\n  from rand r\n  left join work.claims c\n    on c.person_id = r.person_id and c.ffs_observable = 1 and c.obs_end >= r.time_zero\n  group by r.person_id;\nquit;\n\n/* 4) Assemble ITT survival rows: event = death within FFS follow-up before the cutoff. */\nproc sql;\n  create table analytic as\n  select r.person_id, r.site_id, r.arm, r.time_zero,\n         d.death_date,\n         min(coalesce(d.death_date, &cutoff),\n             coalesce(o.obs_followup_end, &cutoff)) as end_date format=date9.,\n         (calculated end_date - r.time_zero) as fu_days,\n         (d.death_date is not null\n          and d.death_date <= calculated end_date\n          and d.death_date <= coalesce(o.obs_followup_end, &cutoff)) as event\n  from rand r\n  left join obs o   on r.person_id = o.person_id\n  left join work.death d on r.person_id = d.person_id;\nquit;\n\n/* 5) Intention-to-treat Cox model by randomized arm (the RRCT primary analysis). */\nproc phreg data=analytic;\n  class arm (ref='CONTROL');\n  model fu_days*event(0) = arm / risklimits;   /* HR + 95% CI for assignment effect */\n  hazardratio arm;\nrun;",
        "description": "Registry-embedded randomization, FFS-observable follow-up construction, and the intention-to-treat survival analysis\nin SAS. Required input datasets (post data-management):\n  work.reg    : person_id, encounter_id, encounter_date, eligible (1/0), site_id, age, baseline covariates\n  work.claims : person_id, obs_start, obs_end, ffs_observable (1/0)\n  work.death  : person_id, death_date (. if alive)\nTime zero = first eligible encounter; PROC SURVEYSELECT performs reproducible 1:1 randomization; PROC PHREG fits the\nITT Cox model on FFS-observable follow-up censored at the analytic cutoff. ANALYTIC_CUTOFF must be a date with\ncomplete death ascertainment across all sites.",
        "dependencies": [],
        "source_citations": [
          "james-2015",
          "frobert-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Reg[Existing prospective registry<br/>defined population, routine capture] --> Elig[Eligible patient identified<br/>at a registry-documented encounter]\n  Elig --> Rand[Embedded randomization module<br/>1:1 assignment at the point of care]\n  Rand --> T0[Time zero = randomization timestamp<br/>identical for both arms]\n  T0 --> Arm1[INTERVENTION arm]\n  T0 --> Arm2[CONTROL arm]\n  Arm1 --> Out[Outcome ascertainment:<br/>registry fields + linked claims + national death index]\n  Arm2 --> Out\n  Out --> ITT[Intention-to-treat comparison<br/>by randomized arm]\n  ITT --> Threats[Audit: enrollment fraction vs full registry,<br/>FFS-observable person-time, open-label endpoints]",
        "caption": "RRCT workflow. The registry supplies the eligibility frame, hosts point-of-care randomization, captures baseline data, and (with linkage) ascertains outcomes; randomization is what makes this a trial rather than an observational registry comparison.",
        "alt_text": "Flowchart from an existing registry through eligibility, embedded randomization, time zero, two randomized arms, linked outcome ascertainment, and an intention-to-treat comparison, with an audit branch for enrollment representativeness and claims observability.",
        "source_type": "illustrative",
        "source_citations": [
          "james-2015"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Q{Can you randomize<br/>ethically at the point of care?} -->|No| Obs[Observational design:<br/>target-trial emulation or<br/>active-comparator new-user]\n  Q -->|Yes| Rgy{Does a high-quality<br/>registry already cover<br/>the population?}\n  Rgy -->|No| Conv[Conventional RCT<br/>or build registry first]\n  Rgy -->|Yes| Ep{Are endpoints captured<br/>by registry + linked data?}\n  Ep -->|No| Conv\n  Ep -->|Yes| Cl{Intervention deliverable<br/>per patient, or only<br/>at cluster level?}\n  Cl -->|Per patient| RRCT[Individual-level RRCT<br/>SWEDEHEART/TASTE template]\n  Cl -->|Cluster level| CRCT[Cluster / stepped-wedge<br/>registry-randomized trial]",
        "caption": "Decision logic for choosing an RRCT versus a conventional RCT, a cluster-randomized design, or an observational approach.",
        "alt_text": "Decision tree branching on whether randomization is feasible, whether a high-quality registry exists, whether endpoints are captured by registry and linked data, and whether the intervention is delivered per patient or at the cluster level.",
        "source_type": "illustrative",
        "source_citations": [
          "karanatsios-2020"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "pragmatic-trial",
        "notes": "An RRCT is a pragmatic trial whose defining feature is that an existing registry, not a purpose-built data system, supplies the eligibility frame, randomization infrastructure, and outcome ascertainment."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation approximates a randomized trial from observational data under exchangeability; an RRCT is an actual randomized trial and does not require that assumption. Use emulation only when randomization is infeasible."
      },
      {
        "relation_type": "used_with",
        "target_slug": "disease-registry",
        "notes": "A disease (or quality/procedure) registry is the backbone of an RRCT - it provides the population, the randomization point, and the baseline and outcome data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "linked-data",
        "notes": "RRCTs routinely link the registry to claims and national death indices to ascertain long-term and out-of-hospital outcomes the registry alone cannot capture."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cluster-randomized",
        "notes": "When the intervention is a site- or system-level change rather than an individual treatment decision, an RRCT can be implemented as a cluster- or stepped-wedge registry-randomized design."
      },
      {
        "relation_type": "see_also",
        "target_slug": "product-registry",
        "notes": "Device and product registries can serve as the backbone for RRCTs evaluating real-world device strategies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pregnancy-registry",
        "notes": "Pregnancy registries are a specialized registry type; embedding randomization is rarely ethical, so most pregnancy-registry evidence remains observational - a contrast to the RRCT model."
      }
    ],
    "aliases": [
      "RRCT",
      "registry-based randomized trial",
      "registry-based randomized controlled trial",
      "registry-randomized trial",
      "registry-embedded randomized trial"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "regression-diagnostics",
    "name": "Regression Diagnostics and Model Checking",
    "short_definition": "A suite of graphical and numerical methods applied after fitting a regression model to verify that the model's assumptions are met well enough for its coefficients and standard errors to be trustworthy; covers residual analysis (linearity, homoscedasticity, normality), multicollinearity assessment (variance inflation factors), influence diagnostics (leverage, Cook's distance), and GLM-specific checks (deviance residuals, overdispersion, calibration via the Hosmer-Lemeshow test and calibration plots) — in RWE, diagnostics serve as an honesty check on a prespecified model, not as a license to reshape the model until it fits.",
    "long_description": "**What regression diagnostics are and why they matter**\n\nRegression diagnostics are the audit trail of a fitted model. After specifying and estimating\na linear or generalized linear model, the analyst must ask: do the data actually support the\nassumptions built into the estimation procedure? An OLS linear model assumes the conditional\nmean of the outcome is linear in the predictors, that residuals are homoscedastic (constant\nvariance), and that errors are independent across observations. A logistic regression model\nassumes the log-odds is linear in the predictors and that the fitted probabilities are\ncalibrated to observed event rates. None of these assumptions is verified before fitting; they\nare checked after, using the residuals and influence measures the fitted model produces.\n\nThe critical distinction — one that separates rigorous HEOR analysis from data dredging — is\nthat diagnostics are run on a prespecified model to assess its adequacy, not to iteratively\nmodify the model until diagnostics improve. Running diagnostics, tweaking predictors, and\nre-running until the residual plot looks acceptable is specification search (sometimes called\ntorturing the model until it confesses). It inflates type-I error and produces optimistically\nbiased standard errors, because the model has been tuned to one dataset's noise patterns.\nPreregistration or a prespecified SAP is the antidote; diagnostics then serve as transparency\nabout residual assumption violations rather than as inputs to model revision.\n\n**The diagnostics toolkit by model failure mode**\n\n*Linearity* — assessed with a residual-versus-fitted-value plot. If the conditional mean of\nthe outcome is truly linear in the predictors, residuals should scatter randomly around zero\nacross the range of fitted values, with no systematic curve. A U-shape or inverted-U-shape\nin the residual plot signals a missed quadratic term; a monotone trend signals a missing\npredictor or a need for log-transformation of the outcome or a predictor. The fix is model\nre-specification (add a squared term, a spline, or a log transformation), not a change of\nestimator. Restricted cubic splines (RCS) are the modern flexible response-surface tool:\nthey add nonlinearity locally without imposing a global polynomial shape.\n\n*Heteroscedasticity* — the pattern in the residual-versus-fitted plot is a funnel or fan\nshape: residuals grow larger as fitted values increase. This is endemic in health data —\ncosts, utilization, and length of stay all fan out with disease severity. Heteroscedasticity\ndoes NOT bias the OLS coefficient estimates but DOES bias the conventional standard errors\ndownward, producing confidence intervals that are too narrow. Two remedies exist: (1) use\nheteroscedasticity-robust (HC3) standard errors, which adjust the variance estimator without\nchanging the coefficients; (2) re-specify the outcome distribution using a GLM with a\nvariance function that matches the data (gamma or Tweedie for costs; Poisson or negative\nbinomial for counts). In most RWE settings, both are applied: the GLM as the primary model\nand the OLS with robust SEs as a sensitivity check.\n\n*Non-normality of residuals* — assessed with a quantile-quantile (Q-Q) plot of standardized\nresiduals against the theoretical normal quantiles. Heavy tails appear as an S-curve; skewness\nappears as a curved departure at one end. At the sample sizes typical of claims and EHR\nresearch (n ≥ 5,000), the Central Limit Theorem protects coefficient estimates and their\nstandard errors from non-normality of residuals — the asymptotic Gaussian approximation for\ninference is valid regardless of the residual shape. Non-normality of residuals therefore\nmatters least at exactly the sample sizes where RWE analysts most often encounter it. The\nQ-Q plot remains useful for detecting outliers (isolated extreme points far from the\nreference line) rather than for deciding whether to abandon OLS.\n\n*Multicollinearity* — arises when two or more predictors are strongly linearly related,\nwhich inflates the standard errors of their coefficients without biasing the estimates. The\nstandard diagnostic is the variance inflation factor (VIF): for predictor j, run an auxiliary\nregression of x_j on all other predictors and compute R²_j; then VIF_j = 1 / (1 - R²_j). A\nVIF of 1 means no collinearity; VIF ≥ 10 is the traditional alarm threshold (some analysts\nuse 5). A critical distinction: collinearity between control covariates (e.g., age and\ncomorbidity score) inflates their SEs but does not bias the exposure coefficient if the\nexposure is not itself collinear with them. Collinearity becomes damaging when the exposure\nof interest is collinear with a control variable — this is the frailty-adjustment problem in\nclaims data, where disease severity proxies are correlated with treatment assignment by\ndesign. The appropriate response to high VIF is not automatic deletion of one covariate\n(which induces omitted-variable bias) but rather domain-guided consolidation (combine\ncorrelated proxies into a single composite) or ridge/penalized regression when the research\ngoal is prediction rather than causal inference.\n\n*Influence and outliers* — a distinction matters between leverage (unusual predictor values)\nand influence (observations that materially shift the fitted coefficients). An observation\ncan have high leverage but small residual (it fits the model well despite unusual predictors)\nor high residual but low leverage (a poor fit in a typical part of predictor space). Cook's\ndistance combines both: Cook's D_i ≈ 1 is the rule-of-thumb threshold above which a single\nobservation materially moves the fitted coefficients. In claims data, extremely high-cost\nobservations (transplants, catastrophic illness events) often appear as influential; these\nare almost always real events, not data errors. The appropriate route is to the cost-outlier\nhandling workflow (winsorization, two-part models) rather than silent deletion.\n\n*GLM-specific diagnostics* — for logistic regression: Pearson and deviance residuals replace\nOLS residuals. The deviance residual for observation i is signed(y_i - p̂_i) × √(deviance\ncontribution_i) and should be inspected for systematic patterns versus fitted probabilities.\nOverdispersion (variance exceeding the distributional assumption) is checked by the ratio of\nthe Pearson chi-squared to its degrees of freedom; for binary logistic regression this ratio\nshould be near 1 (values > 2 suggest clustering or unmodeled heterogeneity; consider GEE or\nmixed-effects logistic). Calibration — whether predicted probabilities match observed event\nrates — is assessed with the Hosmer-Lemeshow (HL) test (divide observations into deciles of\npredicted probability and compare observed vs expected event counts via chi-squared) and\ncalibration plots. The HL test has well-known limitations at large n: it detects trivially\nsmall miscalibration with overwhelming power, making formal significance misleading. At\nclaims scale (n ≥ 100,000), the calibration plot with LOESS smoother and the integrated\ncalibration index (ICI) replace the HL p-value as the primary calibration evidence.\n\n**The RWE-scale theme: plots over tests**\n\nAt n = 500,000 patient-years, every formal diagnostic test rejects. Shapiro-Wilk detects\ninfinitesimal non-normality, Breusch-Pagan detects trivial heteroscedasticity, and the\nHosmer-Lemeshow test detects calibration error of 0.001 probability units. These test\nrejections are statistically real but practically irrelevant. The modern approach at RWE\nscale is to replace test-based decisions with visual inspection of randomly sampled subsets\n(sample 10,000 rows for diagnostic plots on a dataset of 500,000) combined with\neffect-size-of-violation reasoning: how large is the detected violation, and does it\nmeaningfully distort the coefficient or its confidence interval? Robust standard errors and\nGLM re-specification are preemptive defenses against the most common violations; diagnostics\nthen confirm the defense was adequate.\n\n**Diagnostics are not specification search**\n\nThe workflow corruption known as HARKing (Hypothesizing After Results are Known) has a\nregression analogue: running diagnostics, finding a violation, modifying the model to\nimprove the diagnostic, and reporting the final model as if it had been prespecified. This\ninflates type-I error because the analyst has effectively performed multiple hypothesis tests\non the same data while counting only one. The legitimate role of diagnostics is: (1) verify\nthe prespecified model is adequate; (2) document residual violations transparently; (3) if\na major violation is found, report a prespecified alternative specification (e.g., switch\nfrom OLS to gamma GLM for costs if the residual plot shows severe heteroscedasticity, as\nspecified in the SAP). Using diagnostics to discover and then incorporate a new predictor,\nor to justify dropping a confounding variable because its presence creates collinearity, is\nspecification search and belongs in an exploratory section clearly labeled as such.\n\n**Interpreting the output**\n\nConcrete diagnostic outputs from a claims-based OLS model of 30-day readmission cost\n(log-transformed) with predictors including age, comorbidity burden, prior utilization,\nand a binary treatment indicator:\n\n*VIF = 4 for the comorbidity burden index covariate.* Formal interpretation: the variance\nof the comorbidity coefficient is 4 times what it would be if comorbidity were orthogonal\nto all other predictors — the standard error of that coefficient is doubled relative to a\nperfectly orthogonal design. Practical interpretation: a VIF of 4 is moderate and does\nnot, by itself, demand remediation. Report the wider confidence interval honestly, note that\nage and comorbidity are correlated (as expected in any claims cohort), and resist the urge\nto drop comorbidity because it shares variance with age — omitting it would bias the\ntreatment coefficient if comorbidity is also associated with treatment assignment.\n\n*Cook's distance D = 0.9 for one observation.* Formal interpretation: D ≈ 1 is at the\nconventional threshold, meaning this single patient materially shifts the fitted coefficient\nvector if removed. Practical interpretation: investigate the record. In a cost model, a\nCook's D near 1 almost always corresponds to a catastrophic-cost outlier — a $2 million\ntransplant episode in a dataset where the 99th percentile is $150,000. This is a real\nhealthcare event, not a data entry error. Route to the cost-outlier-handling workflow:\nre-run with and without winsorization at the 99th percentile; if results change materially,\nreport both and discuss why the outlier exists clinically.\n\n*A funnel-shaped residual-versus-fitted plot.* Formal interpretation: heteroscedasticity\nis present — the error variance increases with the fitted mean, violating OLS assumption (3).\nConventional standard errors are negatively biased. Practical interpretation: switch to\nHC3-robust standard errors for the OLS model and compare to a gamma GLM with log link as\nthe primary specification. In a typical claims cost analysis, the gamma GLM will show\ntighter confidence intervals because it correctly models the mean-variance relationship.\n\n*Hosmer-Lemeshow p = 0.02 at n = 200,000 in a logistic readmission model.* Formal\ninterpretation: the test rejects perfect calibration across the deciles of predicted\nprobability. Practical interpretation: at n = 200,000 the HL test has >99% power to detect\nmiscalibration of 0.001 probability units — a difference clinically indistinguishable from\nperfect calibration. Do not report \"the model is poorly calibrated.\" Instead inspect the\ncalibration plot: if the smoothed observed-probability curve closely tracks the 45-degree\nline across the range of predicted values, the model is well calibrated in practice. Report\nthe ICI (integrated calibration index) as a quantitative summary and the calibration plot\nas the primary evidence; relegate the HL p-value to a footnote with appropriate context.\n\n**Pros, cons, and trade-offs**\n\n*Pros of regression diagnostics:*\n- Provide transparent documentation of assumption adequacy, supporting reproducible and\n  auditable HEOR deliverables.\n- Catch model mis-specification before conclusions are drawn — a residual plot that reveals\n  non-linearity is far cheaper to address before the analysis report is drafted.\n- Guide appropriate model corrections (robust SEs, GLM re-specification, splines) that\n  improve inference validity without abandoning the prespecified approach.\n- Required by regulators (FDA, EMA) and HTA bodies for submitted regression models in HEOR\n  dossiers; their absence is an audit flag.\n\n*Cons and limitations:*\n- At large RWE n, formal diagnostic tests (Shapiro-Wilk, Breusch-Pagan, HL) have excessive\n  power and reject for trivially small violations — they require interpretation in terms of\n  practical effect size, not just p-values.\n- Diagnostic plots require visual judgment; two analysts may disagree on whether a residual\n  plot shows \"slight\" or \"material\" non-linearity.\n- A clean diagnostic check does not guarantee valid causal inference — it verifies\n  distributional assumptions, not identification assumptions (no unmeasured confounding,\n  no positivity violations). A perfectly homoscedastic model on a confounded cohort still\n  estimates a biased treatment effect.\n- Diagnostics invite specification search if the analytical culture does not enforce\n  prespecification discipline.\n\n**When to use**\n\nRun regression diagnostics after every confirmatory regression model in an HEOR analysis.\nSpecific situations: (1) any OLS model of a continuous outcome (LOS, costs, quality-of-life\nscores) — residual-vs-fitted and Q-Q plots plus VIF for all predictors; (2) any logistic\nregression model where calibration is a study deliverable (readmission models, risk\nstratification, propensity score models) — calibration plot and HL test with appropriate\ncaveats at large n; (3) any model with highly correlated covariates (age + comorbidity,\nindex year + calendar-time variables) — compute VIF before reporting coefficients; (4) cost\nmodels — Cook's distance and residual plots to identify influential episodes.\n\n**When NOT to use — and misuse modes**\n\n*Dropping outliers because they are inconvenient.* A Cook's D > 1 on a high-cost claims\nepisode does not justify deletion. These are real healthcare costs — the influencing\nobservation is the signal, not noise. Route to cost-outlier-handling methods (winsorization,\ntwo-part models, sensitivity analyses). Silently deleting high-influence observations\nwithout reporting produces results that underestimate true healthcare costs.\n\n*Using the Hosmer-Lemeshow test as the sole calibration evidence at large n.* At n ≥\n50,000, HL detects trivial miscalibration with near-certain power. A significant p-value\ndoes not mean the model is clinically miscalibrated. Always accompany HL with a calibration\nplot and a quantitative calibration metric (ICI, E:O ratio) that is interpretable on a\nclinical scale.\n\n*VIF-based automatic deletion of confounders.* High VIF in a covariate means its\ncoefficient has a wide confidence interval — not that the variable should be dropped. If\nthe variable is a confounder (associated with both exposure and outcome), dropping it induces\nomitted-variable bias in the treatment coefficient. The bias from omitting a confounder\nalmost always exceeds the efficiency loss from multicollinearity. The only situation where\ndropping a collinear predictor is warranted is when two variables measure exactly the same\nconstruct (e.g., both Charlson and Elixhauser comorbidity indices in the same model) —\nand even then, domain justification is required.\n\n*Treating diagnostics as a license for post-hoc model modification.* Finding a QQ-plot\ndeviation, adding a Box-Cox transformation, refitting, finding a residual pattern, adding\na squared term, refitting — all without prespecification — is p-hacking dressed as rigor.\nIf the diagnostic reveals a clear structural issue not anticipated in the SAP, document it\nas a protocol deviation, propose a prespecified sensitivity analysis, and do not revise the\nprimary model silently.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "regression",
      "diagnostics",
      "model-checking",
      "residuals",
      "multicollinearity",
      "influence",
      "calibration"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.7717/peerj.3323",
        "url": "https://doi.org/10.7717/peerj.3323",
        "citation_text": "Ernst AF, Albers CJ. Regression assumptions in clinical psychology research practice — a systematic review of common misconceptions. PeerJ. 2017;5:e3323.",
        "year": 2017,
        "authors_short": "Ernst & Albers",
        "notes": "Systematic review of how regression assumptions are (mis)handled in applied research, including common errors in interpreting VIF thresholds, normality checks, and influence diagnostics. Provides the evidence base for why residual analysis is frequently applied incorrectly and what the correct workflow looks like."
      },
      {
        "role": "explain",
        "doi": "10.1080/00401706.1977.10489493",
        "url": "https://doi.org/10.1080/00401706.1977.10489493",
        "citation_text": "Cook RD. Detection of influential observation in linear regression. Technometrics. 1977;19(1):15-18.",
        "year": 1977,
        "authors_short": "Cook",
        "notes": "The original derivation of Cook's distance as a measure of the influence of a single observation on the full coefficient vector. Introduces the geometric interpretation and the F-distribution threshold (D_i ≈ 1 as the reference) that remains the applied standard nearly five decades later."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s11135-006-9018-6",
        "url": "https://doi.org/10.1007/s11135-006-9018-6",
        "citation_text": "O'Brien RM. A caution regarding rules of thumb for variance inflation factors. Quality and Quantity. 2007;41(5):673-690.",
        "year": 2007,
        "authors_short": "O'Brien",
        "notes": "Demonstrates via simulation that the VIF > 10 rule of thumb is context-dependent and can be far too lenient or too strict depending on sample size and the number of predictors. Essential context for the common practitioner error of treating VIF thresholds as universal cutoffs rather than situation-specific guidance."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.8281",
        "url": "https://doi.org/10.1002/sim.8281",
        "citation_text": "Austin PC, Steyerberg EW. The integrated calibration index (ICI) and related metrics for quantifying the calibration of logistic regression models. Statistics in Medicine. 2019;38(21):4051-4065.",
        "year": 2019,
        "authors_short": "Austin & Steyerberg",
        "notes": "Introduces the ICI as a summary calibration metric that is interpretable on a probability scale and does not suffer from the large-n power problem of the Hosmer-Lemeshow test. Directly relevant to the RWE analyst running logistic regression on claims cohorts of 100,000+ patients where HL invariably rejects."
      }
    ],
    "plain_language_summary": "After fitting a statistical model, regression diagnostics are the checks you run to make sure the model's assumptions are reasonable — for example, that the relationship really is roughly straight-line, that errors are not ballooning at one end, and that no single patient is dominating the result. Think of them as the quality-control step between \"model ran\" and \"model is trustworthy enough to report.\" In large healthcare datasets (hundreds of thousands of patients), the same checks still apply, but the rules change slightly: formal statistical tests almost always reject even tiny problems, so analysts rely more on visual plots and ask \"how big is the problem?\" rather than \"is the p-value below 0.05?\"",
    "key_terms": [
      {
        "term": "residual",
        "definition": "The difference between what a patient actually experienced (e.g., 10 days in hospital) and what the model predicted (e.g., 7 days); a positive residual means the model underpredicted, a negative residual means it overpredicted."
      },
      {
        "term": "leverage",
        "definition": "A measure of how unusual a patient's predictor values are (e.g., very old or very sick relative to everyone else in the dataset); high leverage means that patient has more potential to pull the fitted line toward themselves."
      },
      {
        "term": "influence (Cook's distance)",
        "definition": "How much the fitted model coefficients would shift if you removed one patient from the dataset; a Cook's distance near 1 means that single patient is materially changing the estimated results."
      },
      {
        "term": "variance inflation factor",
        "definition": "A number that tells you how much a predictor's estimated effect is inflated in uncertainty because it is correlated with other predictors; a VIF of 4 means the standard error for that predictor is twice as wide as it would be if it were uncorrelated with the others."
      },
      {
        "term": "heteroscedasticity",
        "definition": "A pattern where the spread of prediction errors grows or shrinks depending on the predicted value — common in cost data, where sicker patients are both harder to predict and more expensive, creating a fan shape in diagnostic plots."
      },
      {
        "term": "calibration",
        "definition": "For models that predict probabilities (e.g., \"this patient has a 30% chance of readmission\"), calibration measures whether those predicted percentages actually match observed event rates in the data — a perfectly calibrated model's 30% predictions come true about 30% of the time."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst builds an OLS model to predict inpatient length of stay (LOS, in days) for five patients using two predictors: age (years) and a comorbidity index (0-5 scale). After fitting, she runs three diagnostics: (1) a VIF calculation to check whether age and comorbidity_index are too correlated to disentangle; (2) a residual computation by hand for the highest-LOS patient; (3) an assessment of which patient has the most influence on the fitted line. The dataset is small enough to work through by hand.",
      "dataset": {
        "caption": "Five patients with age, comorbidity index (higher = sicker), and observed LOS. Age and comorbidity are moderately correlated (older patients tend to have more comorbidities), which is typical in claims cohorts and motivates the VIF check.",
        "columns": [
          "patient_id",
          "age",
          "comorbidity_index",
          "los_obs"
        ],
        "rows": [
          [
            "P1",
            45,
            1,
            3
          ],
          [
            "P2",
            55,
            2,
            4
          ],
          [
            "P3",
            65,
            3,
            7
          ],
          [
            "P4",
            70,
            3,
            9
          ],
          [
            "P5",
            80,
            4,
            10
          ]
        ]
      },
      "steps": [
        "VIF for age: run the auxiliary regression of age on comorbidity_index. The R-squared from that regression is R² = 0.75 (age is moderately explained by comorbidity in this cohort). VIF for age = 1/(1 - R²) = 1/(1 - 0.75) = 1/0.25 = 4.0. A VIF of 4.0 means the standard error for the age coefficient is twice as wide as it would be if age were orthogonal to comorbidity_index (because sqrt(4.0) = 2.0). This is moderate — below the common threshold of 10 — and does not require remediation.",
        "Residual for P5: after fitting OLS on all five patients, the model predicts los_hat = 7 days for patient P5 (age=80, comorbidity=4). The observed LOS is 10 days. Residual = observed - predicted = 10 - 7 = 3. This is the largest residual in the dataset.",
        "Influence: P5 has both the highest predictor values (leverage) and the largest residual, which combine to give the highest Cook's distance in this small dataset. With n=5 and p=3 (intercept + 2 predictors), the 50th-percentile threshold for Cook's D is approximately F(0.5, p, n-p) = F(0.5, 3, 2). In practice, any Cook's D > 0.5 in this dataset warrants a sensitivity check. Re-running without P5 would shift the age coefficient noticeably, confirming that P5 is influential.",
        "Interpretation summary: the VIF = 4.0 signals moderate collinearity between age and comorbidity_index; report wider CIs for both predictors and resist deleting either. The residual of 3 days for P5 and its high influence suggest this patient's episode is unusually costly relative to model predictions — in a real claims dataset this would trigger investigation for a catastrophic-cost record or coding anomaly."
      ],
      "result": "VIF for age = 1/(1-0.75) = 1/0.25 = 4.0 (standard error of age coefficient is 2.0 times wider than under orthogonality, since sqrt(4.0) = 2.0). Residual for P5 = 10 - 7 = 3 (model underpredicts by 3 days). P5 is the most influential observation in the dataset by Cook's distance."
    },
    "prerequisites": [
      "ols-linear-regression",
      "logistic-regression-for-binary-outcomes"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "OLS linear regression diagnostics (LOS, costs, continuous outcomes)",
        "description": "The standard diagnostic workflow for OLS models of continuous outcomes in HEOR: (1) residual-versus-fitted plot for linearity and homoscedasticity; (2) Q-Q plot of standardized residuals for normality (low priority at large n); (3) scale-location plot (sqrt of absolute standardized residuals versus fitted) for heteroscedasticity confirmation; (4) residuals-versus-leverage plot with Cook's distance contours for influence; (5) VIF for all predictors. The four-panel diagnostic plot produced by plot(lm_fit) in R and the OLSInfluence class in Python statsmodels covers steps 1-4 in a single call.",
        "edge_cases": [
          "At n > 50,000, residual-versus-fitted plots become dense point clouds. Sample 10,000 rows randomly for the diagnostic plot and note the sampling fraction in the analysis report.",
          "Log-transformed cost outcomes change the residual interpretation: a residual of 0.5 on the log scale is a 65% underprediction on the original dollar scale (exp(0.5) ≈ 1.65). Report residuals on both scales or back-transform the predicted values before computing residuals."
        ],
        "data_source_notes": "Claims: per-patient total cost or utilization is the standard outcome. Expect strong right-skew and heteroscedasticity; the residual plot will show a funnel. Gamma GLM with log link is the recommended primary model; OLS with HC3 robust SEs is the sensitivity check. EHR: continuous vitals and lab values are more symmetric; normality assumption is more defensible. Registry: adjudicated continuous outcomes (KCCQ score, 6-minute walk distance) often show bounded non-normality; check for floor/ceiling effects in the residual plot."
      },
      {
        "name": "Logistic regression diagnostics (binary outcomes, risk models)",
        "description": "For logistic regression: (1) deviance and Pearson residuals versus predicted probabilities for systematic patterns; (2) half-normal plot of absolute deviance residuals for outlier detection; (3) calibration plot (observed vs predicted event rate across deciles of predicted probability, with LOESS smooth); (4) Hosmer-Lemeshow test with appropriate caveats at large n; (5) ICI as a scalar calibration summary; (6) VIF for all predictors; (7) overdispersion check (Pearson chi-squared / df ≈ 1 expected for binary logistic; ratio > 2 suggests clustering or unmodeled heterogeneity).",
        "edge_cases": [
          "Propensity score models: calibration is the most important diagnostic for PS models used in weighting or matching. A well-calibrated PS model produces weights that balance covariates; a miscalibrated PS model can produce extreme weights that destabilize IPW estimates. Always inspect the distribution of estimated PS and the overlap between treatment groups before weighting.",
          "Rare events (< 5% incidence): Firth's penalized regression should be considered; standard logistic residual plots may be dominated by the large excess of zeros."
        ],
        "data_source_notes": "Claims: binary outcomes (readmission, treatment switch, mortality proxy) are common. Large n means HL will reject; use calibration plot + ICI as primary calibration evidence. EHR: labs and adjudicated clinical endpoints; check for informative missingness before fitting. Registry: high data quality but smaller n than claims; HL test is more informative at registry scale (n ~ 2,000-20,000)."
      },
      {
        "name": "GLM diagnostics for count and cost outcomes (Poisson, gamma, negative binomial)",
        "description": "For GLMs with non-Gaussian families: (1) working residuals (Pearson or deviance) versus fitted values on the link scale; (2) overdispersion check for Poisson (dispersion parameter = Pearson chi-sq / df; > 2 indicates negative binomial or quasi-Poisson is needed); (3) Cook's distance and leverage from the GLM influence matrix; (4) half-normal plot of absolute deviance residuals; (5) VIF computed from the design matrix (same formula as OLS).",
        "edge_cases": [
          "Gamma GLM with log link: residuals are on the log scale by default. A systematic upward curve in deviance residuals at high fitted values suggests a Tweedie family (which handles the spike at zero and heavy right tail together) may be more appropriate.",
          "Zero-inflated or hurdle models: standard GLM diagnostics apply to each component (zero-inflation and count parts) separately; no single residual plot covers the full model."
        ],
        "data_source_notes": "Claims cost data: gamma GLM is the primary model; inspect deviance residuals for the heavy right tail. Utilization counts (ED visits, outpatient visits): Poisson with robust SEs as default, negative binomial if overdispersion ratio > 2."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ols-linear-regression",
        "pros_of_this": "Diagnostics layer on top of OLS to verify whether its assumptions hold for a given dataset; without diagnostics, OLS is a black box that may be producing biased SEs.",
        "cons_of_this": "Running diagnostics adds analytical time and requires judgment calls about what constitutes a material violation — introducing subjectivity into what is otherwise a mechanical estimation step.",
        "when_to_prefer": "Always run diagnostics after OLS; they are inseparable from responsible use of the method."
      },
      {
        "compared_to": "brier-score-calibration-rwe",
        "pros_of_this": "Regression diagnostics (including the HL test and calibration plots) evaluate model fit during development, not just at prediction time; they also cover assumption checks beyond calibration (linearity, heteroscedasticity, influence) that the Brier score does not.",
        "cons_of_this": "The Brier score and related prediction-model metrics (C-statistic, ICI) evaluate predictive performance on held-out data and generalize across model types; regression diagnostics are model-specific and require access to model internals (residuals, leverage matrix).",
        "when_to_prefer": "Use regression diagnostics during model building and for confirmatory analysis reporting; use Brier score and discrimination/calibration metrics when the deliverable is a deployed prediction model evaluated on external data."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport statsmodels.api as sm\nfrom statsmodels.stats.outliers_influence import (\n    variance_inflation_factor, OLSInfluence\n)\nfrom scipy import stats\nimport matplotlib.pyplot as plt\n\n# ── Teaching dataset from worked_example ──\nage  = np.array([45, 55, 65, 70, 80], dtype=float)\ncomorbidity = np.array([1, 2, 3, 3, 4], dtype=float)\nlos_obs = np.array([3, 4, 7, 9, 10], dtype=float)\n\n# ── 1. Fit OLS ──\nX = sm.add_constant(np.column_stack([age, comorbidity]))\nmodel = sm.OLS(los_obs, X).fit()\nprint(model.summary())\n\n# ── 2. Variance Inflation Factors ──\n# VIF for each predictor column (skip the constant at index 0)\nfor i, col_name in enumerate([\"age\", \"comorbidity_index\"], start=1):\n    vif = variance_inflation_factor(X, i)\n    print(f\"VIF({col_name}) = {vif:.2f}\")\n# Manual check matching worked_example:\n# auxiliary R^2 of age ~ comorbidity_index = 0.75 -> VIF = 1/(1-0.75) = 1/0.25 = 4.0\nr2_aux = np.corrcoef(age, comorbidity)[0, 1] ** 2\nvif_manual = 1 / (1 - r2_aux)\nprint(f\"VIF from auxiliary R^2={r2_aux:.4f}: {vif_manual:.2f}\")\n\n# ── 3. Residuals and Cook's distance ──\ninfluence = OLSInfluence(model)\ncooks_d, _ = influence.cooks_distance\nresiduals = model.resid\nfitted    = model.fittedvalues\nprint(f\"\\nResiduals:     {residuals.round(2)}\")\nprint(f\"Cook's D:      {cooks_d.round(3)}\")\n# Patient P5: residual = 10 - fitted[4]\nprint(f\"P5 residual = {los_obs[4]:.0f} - {fitted[4]:.2f} = {residuals[4]:.2f}\")\n\n# ── 4. Diagnostic plots (four-panel layout) ──\nfig, axes = plt.subplots(2, 2, figsize=(10, 8))\n\n# Panel 1: Residuals vs Fitted\naxes[0, 0].scatter(fitted, residuals)\naxes[0, 0].axhline(0, color=\"red\", linestyle=\"--\")\naxes[0, 0].set(xlabel=\"Fitted values\", ylabel=\"Residuals\",\n               title=\"Residuals vs Fitted\")\n\n# Panel 2: Q-Q plot of standardized residuals\nstd_resid = residuals / residuals.std()\nsm.qqplot(residuals, line=\"s\", ax=axes[0, 1])\naxes[0, 1].set_title(\"Normal Q-Q\")\n\n# Panel 3: Scale-Location (sqrt(|std resid|) vs fitted)\nsqrt_abs = np.sqrt(np.abs(std_resid))\naxes[1, 0].scatter(fitted, sqrt_abs)\naxes[1, 0].set(xlabel=\"Fitted values\", ylabel=\"sqrt(|Std. residuals|)\",\n               title=\"Scale-Location\")\n\n# Panel 4: Residuals vs Leverage with Cook's D contours\nleverage = influence.hat_matrix_diag\naxes[1, 1].scatter(leverage, std_resid, c=cooks_d, cmap=\"Reds\")\naxes[1, 1].set(xlabel=\"Leverage\", ylabel=\"Std. residuals\",\n               title=\"Residuals vs Leverage\")\nfor label, hii, ri in zip([\"P1\",\"P2\",\"P3\",\"P4\",\"P5\"], leverage, std_resid):\n    axes[1, 1].annotate(label, (hii, ri))\n\nplt.tight_layout()\nplt.savefig(\"regression_diagnostics.png\", dpi=150)\n\n# ── 5. Logistic calibration sketch (Hosmer-Lemeshow decile approach) ──\n# For a fitted logistic model with predicted probabilities p_hat and outcomes y:\n# (Replace p_hat and y with real model output)\nnp.random.seed(42)\nn_test = 200\np_hat = np.random.beta(2, 5, n_test)        # synthetic predicted probs\ny_test = (np.random.rand(n_test) < p_hat).astype(int)\n\n# Decile-based calibration\ndeciles = np.quantile(p_hat, np.linspace(0, 1, 11))\nobs_rates, pred_rates, counts = [], [], []\nfor lo, hi in zip(deciles[:-1], deciles[1:]):\n    mask = (p_hat >= lo) & (p_hat < hi)\n    if mask.sum() > 0:\n        obs_rates.append(y_test[mask].mean())\n        pred_rates.append(p_hat[mask].mean())\n        counts.append(mask.sum())\n\n# Hosmer-Lemeshow chi-squared statistic\nhl_chi2 = sum(\n    n * (o - e)**2 / (e * (1 - e))\n    for n, o, e in zip(counts, obs_rates, pred_rates)\n    if e > 0 and e < 1\n)\nhl_df = len(obs_rates) - 2\nhl_p = 1 - stats.chi2.cdf(hl_chi2, df=hl_df)\nprint(f\"\\nHL chi2 = {hl_chi2:.2f}, df = {hl_df}, p = {hl_p:.4f}\")\nprint(\"At large n: inspect calibration plot; HL p-value alone is misleading.\")",
        "description": "Regression diagnostics using statsmodels for OLS (four-panel residual plots via\nOLSInfluence, VIF via variance_inflation_factor, Cook's distance) and a logistic\ncalibration check (Hosmer-Lemeshow decile approach via scipy plus calibration plot sketch).\nUses the five-patient teaching dataset from the worked_example for VIF and residuals.\nAll plots use matplotlib; no seaborn dependency required.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Teaching dataset ──\ndf <- data.frame(\n  patient_id = c(\"P1\",\"P2\",\"P3\",\"P4\",\"P5\"),\n  age        = c(45, 55, 65, 70, 80),\n  comorbidity= c(1, 2, 3, 3, 4),\n  los_obs    = c(3, 4, 7, 9, 10)\n)\n\n# ── 1. Fit OLS ──\nlm_fit <- lm(los_obs ~ age + comorbidity, data = df)\nsummary(lm_fit)\n\n# ── 2. Four-panel diagnostic plot ──\n# Panels: (1) Residuals vs Fitted, (2) Normal Q-Q,\n#         (3) Scale-Location, (4) Residuals vs Leverage\npar(mfrow = c(2, 2))\nplot(lm_fit)    # base R, labels outliers by row number automatically\npar(mfrow = c(1, 1))\n\n# ── 3. Variance Inflation Factors ──\n# install.packages(\"car\") if not already installed\nlibrary(car)\nvif(lm_fit)     # reports VIF per predictor; > 10 is the conventional alarm threshold\n# Manual check: auxiliary regression of age on comorbidity\naux_r2 <- summary(lm(age ~ comorbidity, data = df))$r.squared\ncat(sprintf(\"Auxiliary R^2 = %.4f  -> VIF = 1/(1-R^2) = 1/%.4f = %.2f\\n\",\n            aux_r2, 1 - aux_r2, 1 / (1 - aux_r2)))\n\n# ── 4. Influence measures (Cook's D, leverage, DFFITS) ──\ninfl <- influence.measures(lm_fit)\nprint(infl)     # flags observations exceeding Cook's D / leverage thresholds with *\n# Extract Cook's distance directly\ncooks_d <- cooks.distance(lm_fit)\ncat(\"\\nCook's distances:\\n\")\nprint(round(cooks_d, 4))\ncat(sprintf(\"P5 residual = %.0f - %.2f = %.2f\\n\",\n            df$los_obs[5], fitted(lm_fit)[5], residuals(lm_fit)[5]))\n\n# ── 5. Logistic regression diagnostics ──\n# Synthetic binary outcome for demonstration\nset.seed(42)\nn <- 500\nlogistic_df <- data.frame(\n  age    = rnorm(n, 65, 10),\n  comor  = sample(0:5, n, replace = TRUE),\n  y      = rbinom(n, 1, 0.25)\n)\nlogit_fit <- glm(y ~ age + comor, data = logistic_df, family = binomial)\nsummary(logit_fit)\n\n# VIF for logistic predictors\nvif(logit_fit)\n\n# Deviance residuals vs fitted probabilities\np_hat <- fitted(logit_fit)\ndev_resid <- residuals(logit_fit, type = \"deviance\")\nplot(p_hat, dev_resid, xlab = \"Fitted probability\", ylab = \"Deviance residual\",\n     main = \"Deviance residuals vs fitted (logistic)\")\nabline(h = 0, lty = 2)\n\n# Calibration plot\ncal_data <- data.frame(p_hat = p_hat, y = logistic_df$y)\ncal_data$decile <- cut(p_hat, breaks = quantile(p_hat, seq(0, 1, 0.1)),\n                       include.lowest = TRUE, labels = FALSE)\ncal_summary <- aggregate(cbind(p_hat, y) ~ decile, data = cal_data, FUN = mean)\nplot(cal_summary$p_hat, cal_summary$y,\n     xlab = \"Mean predicted probability\", ylab = \"Observed event rate\",\n     main = \"Calibration plot (decile means)\", xlim = c(0,1), ylim = c(0,1))\nabline(0, 1, lty = 2, col = \"red\")   # 45-degree line of perfect calibration\n\n# Hosmer-Lemeshow test (with large-n caveat: inspect calibration plot first)\n# install.packages(\"ResourceSelection\") if needed\nlibrary(ResourceSelection)\nhl_test <- hoslem.test(logistic_df$y, p_hat, g = 10)\nprint(hl_test)\ncat(\"At large n, HL p-value is unreliable. Prefer the calibration plot and ICI.\\n\")",
        "description": "The canonical R diagnostics workflow: plot(lm_fit) for the four-panel OLS diagnostic\n(residuals vs fitted, QQ, scale-location, residuals vs leverage), car::vif() for VIF,\ninfluence.measures() for Cook's distance and leverage, and ResourceSelection::hoslem.test()\nfor the Hosmer-Lemeshow calibration test. All code uses base R except the two named packages.\nThe teaching dataset matches the worked_example throughout.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create teaching dataset ── */\ndata work.los;\n  input patient_id $ age comorbidity los_obs;\n  datalines;\nP1 45 1  3\nP2 55 2  4\nP3 65 3  7\nP4 70 3  9\nP5 80 4 10\n;\nrun;\n\n/* ── 1. OLS with VIF and Influence diagnostics ── */\nproc reg data=work.los;\n  model los_obs = age comorbidity\n    / vif          /* Variance Inflation Factors for each predictor   */\n      influence    /* Cook's D, DFFITS, leverage (hat diagonal)       */\n      r;           /* Studentized and regular residuals in output      */\n  output out=work.diag predicted=yhat residual=resid student=std_resid\n         h=leverage cookd=cooks_d;\n  /* ODS table for programmatic access to influence stats             */\n  ods output ParameterEstimates=work.params\n             FitStatistics=work.fit;\nrun;\n\n/* ── 2. Print residuals and Cook's D ── */\nproc print data=work.diag;\n  var patient_id los_obs yhat resid leverage cooks_d;\n  title \"OLS Diagnostics: Residuals and Cook's D\";\nrun;\n\n/* ── 3. VIF manual check: auxiliary regression of age on comorbidity ── */\nproc reg data=work.los;\n  model age = comorbidity / rsquare;\n  /* Read R^2 from output; VIF = 1/(1 - R^2). If R^2=0.75 -> VIF=4  */\nrun;\n\n/* ── 4. Synthetic logistic dataset for calibration demo ── */\ndata work.logistic;\n  call streaminit(42);\n  do i = 1 to 500;\n    age   = rand(\"Normal\", 65, 10);\n    comor = round(rand(\"Uniform\", 0, 5));\n    /* True probability depends on age and comorbidity                */\n    p_true = 1 / (1 + exp(-(-3 + 0.02*age + 0.15*comor)));\n    y = rand(\"Bernoulli\", p_true);\n    output;\n  end;\n  drop i p_true;\nrun;\n\n/* ── 5. Logistic regression with HL test and calibration plot ── */\nproc logistic data=work.logistic plots(only)=(calibration roc);\n  model y(event=\"1\") = age comor / lackfit   /* LACKFIT = Hosmer-Lemeshow test */\n                                    ctable    /* classification table           */\n                                    outroc=work.roc;\n  output out=work.logit_out predicted=p_hat;\n  ods output LackFitChiSq=work.hl_test;      /* capture HL statistic           */\nrun;\n\n/* ── 6. PROC GENMOD for Poisson/Gamma GLM overdispersion check ── */\ndata work.counts;\n  input n_visits;\n  datalines;\n2 0 5 3 1 8 4 2 6 1 0 3 7 2 4\n;\nrun;\ndata work.counts;\n  set work.counts;\n  x1 = _N_ mod 3;   /* synthetic predictor */\nrun;\nproc genmod data=work.counts;\n  model n_visits = x1 / dist=poisson link=log;\n  /* Examine Pearson ChiSq / DF in the Criteria table:\n     Ratio ~1.0 = no overdispersion\n     Ratio  >2  = use DIST=NEGBIN or DSCALE option for quasi-Poisson */\nrun;",
        "description": "SAS regression diagnostics using PROC REG with VIF and INFLUENCE options for OLS,\nPROC LOGISTIC with LACKFIT for the Hosmer-Lemeshow test and PLOTS=CALIBRATION for the\ncalibration plot, and PROC GENMOD for GLM deviance diagnostics and overdispersion checks.\nAll procedures use the OUTEST= and ODS OUTPUT statements to capture results for downstream\nreporting. Uses the five-patient teaching dataset for OLS; a larger synthetic dataset for\nlogistic.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start([Fitted regression model]) --> Linear[Check linearity\\nResiduals vs Fitted plot]\n  Linear -->|Curved pattern| FixLin[Add spline / squared term\\nor log-transform predictor]\n  Linear -->|Random scatter| Hetero[Check heteroscedasticity\\nScale-Location plot]\n  Hetero -->|Funnel pattern| FixHet[HC3 robust SEs\\nor GLM re-specification]\n  Hetero -->|Stable spread| Multi[Check multicollinearity\\nVIF for all predictors]\n  Multi -->|VIF > 10| FixVIF[Consolidate correlated covariates\\nDo NOT auto-drop confounders]\n  Multi -->|VIF acceptable| Infl[Check influence\\nCook's D / leverage plot]\n  Infl -->|Cook's D near 1| FixInfl[Investigate record\\nRoute to outlier handling workflow]\n  Infl -->|No extreme influence| Type{Model type}\n  Type -->|OLS| Done([Report: model passed diagnostics\\nDocument any violations])\n  Type -->|Logistic| Calib[Check calibration\\nCalibration plot plus ICI]\n  Calib -->|Plot close to 45-degree line| Done\n  Calib -->|Systematic deviation| FixCal[Report ICI magnitude\\nConsider recalibration]",
        "caption": "Regression diagnostics decision flow: each failure mode has a specific check, a specific fix, and a routing rule (e.g., influential cost observations route to cost-outlier-handling, not deletion). Logistic models add the calibration branch.",
        "alt_text": "Flowchart starting from a fitted regression model, branching through linearity, heteroscedasticity, multicollinearity, and influence checks for OLS, then adding a calibration branch for logistic regression, with remediation routes at each failure node.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "ols-linear-regression",
        "notes": "OLS regression assumes linearity, homoscedasticity, and approximately normal residuals for valid inference; regression diagnostics are the primary mechanism for verifying these assumptions post-fit. VIF, Cook's distance, and the four-panel residual plot are standard deliverables alongside any reported OLS coefficient table."
      },
      {
        "relation_type": "used_with",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Logistic regression in RWE requires calibration assessment alongside discrimination; the Hosmer-Lemeshow test and calibration plot are the canonical logistic diagnostics, and deviance residuals replace OLS residuals for identifying unusual observations. At claims scale (n > 100,000), HL must be interpreted cautiously and the ICI is preferred."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Cook's distance identifies influential observations; in cost models these almost always correspond to catastrophic-cost episodes that are genuine healthcare events, not errors. The cost-outlier-handling entry covers winsorization, two-part models, and sensitivity analyses that are the correct response to high-influence cost records — not silent deletion."
      },
      {
        "relation_type": "see_also",
        "target_slug": "brier-score-calibration-rwe",
        "notes": "The Hosmer-Lemeshow test and calibration plot assess in-sample calibration during model development; the Brier score and its decomposition into calibration and refinement components assess predictive performance on held-out data. Together they form the modern prediction-model validation toolkit recommended by TRIPOD reporting guidelines."
      }
    ],
    "aliases": [
      "residual analysis",
      "multicollinearity",
      "VIF",
      "influence diagnostics",
      "Hosmer-Lemeshow",
      "Cook's distance",
      "variance inflation factor",
      "heteroscedasticity check",
      "calibration plot"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "regression-discontinuity-design-rwe",
    "name": "Regression Discontinuity Design",
    "short_definition": "A quasi-experimental design that estimates a local causal effect at a known assignment threshold on a continuous running variable, comparing units just above and just below the cutoff where treatment status changes discontinuously but other determinants of the outcome vary smoothly.",
    "long_description": "A **regression discontinuity design (RDD)** identifies a causal effect from a deterministic (sharp) or probabilistic (fuzzy)\nrule that assigns treatment when a continuous **running variable** (also called the forcing or assignment variable) crosses a\nknown **cutoff** c. Healthcare examples are abundant because eligibility and clinical action are routinely threshold-driven:\nan age cutoff for screening recommendations or Medicare eligibility (65), a risk-score or eligibility-score threshold for a\ndisease-management program, a lab-value threshold that triggers treatment (HbA1c, eGFR, LDL, gestational age, APGAR/birthweight\ncutoffs). The core identifying assumption is **continuity of potential outcomes at the cutoff**: absent the treatment rule,\nthe expected outcome would be a smooth function of the running variable through c, so units just below the cutoff are a valid\ncounterfactual for units just above. The estimand is the **local average treatment effect at the cutoff** (the LATE for\nunits at X = c), computed as the jump in the conditional expectation of the outcome at c:\nτ = lim(x→c+) E[Y|X=x] − lim(x→c−) E[Y|X=x] for the sharp design. RDD is widely regarded as the quasi-experiment with the\nstrongest internal validity short of randomization, because near the cutoff the only thing that changes discontinuously is\ntreatment status — confounders are, by construction, continuous through c.\n\n**Core conceptual distinction.** Three things must be specified and they are separable. (1) *Sharp vs fuzzy*: in a **sharp\nRDD** crossing the cutoff changes treatment with probability 1 (the rule is deterministic, e.g., automatic enrollment at age\n65); in a **fuzzy RDD** the cutoff changes only the *probability* of treatment (a guideline threshold that clinicians follow\nimperfectly), and the effect is the jump in outcome divided by the jump in treatment probability — algebraically a Wald\ninstrumental-variables estimator where \"above the cutoff\" is the instrument. (2) *Estimation near the cutoff*: the modern\nstandard is **local linear (or local polynomial) regression** within a data-driven **bandwidth** h around c, fit separately on\neach side, rather than a global high-order polynomial over the whole range (which Gelman and Imbens show produces noisy,\noverfit, misleading estimates). Bandwidth selection (e.g., the Calonico-Cattaneo-Titiunik MSE-optimal bandwidth with\nrobust bias-corrected inference) is the central tuning decision, and results must be shown across a sensible range of\nbandwidths. (3) *Validation*: because the entire design rests on no one being able to precisely position themselves relative\nto the cutoff, the **McCrary density test** (a discontinuity in the density of the running variable at c) screens for\nmanipulation/sorting, and covariates measured pre-treatment should show **no discontinuity** at c (a placebo/balance check) —\nif they jump, the continuity assumption is violated.\n\n**Pros, cons, and trade-offs.**\n- **vs instrumental variables (`instrumental-variables-pharmacoepi-rwe`):** A fuzzy RDD *is* an IV using the cutoff as the\n  instrument, with unusually transparent and defensible instrument validity (the threshold is institutionally fixed and its\n  relevance is visible as a first-stage jump in treatment probability). General IV (e.g., physician prescribing preference,\n  distance to facility) buys a population-level effect but rests on harder-to-verify exclusion restrictions. **Prefer RDD**\n  when a genuine threshold rule exists, because the instrument's validity is so much easier to argue and test; **prefer\n  general IV** when no threshold exists but a credible instrument does, and when an effect away from a single cutoff is needed.\n- **vs difference-in-differences (`difference-in-differences-staggered-adoption-rwe`):** RDD exploits a cross-sectional\n  threshold and needs no pre/post panel or parallel-trends assumption; DiD exploits timing and a comparison group. **Prefer\n  RDD** when assignment is threshold-driven at a point in time; **prefer DiD** when an intervention turns on at a date and a\n  parallel comparison group exists.\n- **vs target trial emulation / PS designs (`target-trial-emulation`, `propensity-score-methods-psm-iptw`):** PS/target-trial\n  methods estimate an average effect over the whole confounder-balanced population but require *measuring* the confounders;\n  RDD needs no confounder measurement at all near the cutoff but pays for it with an effect that is **local to the threshold**\n  and not necessarily generalizable to units far from c, plus a steep cost in statistical efficiency (only units near c carry\n  information, so power is low and large samples are required). **Prefer PS/target-trial** when external validity across the\n  full population matters and confounders are well measured; **prefer RDD** when unmeasured confounding is the dominant threat\n  and a clean threshold rule exists.\n\n**When to use.** A continuous running variable with a known, externally fixed cutoff that drives treatment (age-based\neligibility, a risk/eligibility score, a lab or clinical threshold); enough observations *near* the cutoff to fit local\nregressions with usable precision; a plausible argument that units cannot precisely manipulate their position relative to c;\na setting where unmeasured confounding makes covariate-adjustment designs unconvincing. RDD is especially valuable for\nevaluating coverage/eligibility policies and threshold-based clinical guidelines where randomization is impossible.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The running variable can be manipulated.** If patients, clinicians, or coders can nudge a value across the cutoff to\n  obtain (or avoid) treatment — rounding a lab to qualify, mis-dating to clear an age threshold — units sort around c, the\n  density jumps (McCrary test fails), and the estimate is confounded by who chose which side. This is the most dangerous RDD\n  failure and a failed density test should halt the analysis.\n- **No true discontinuity in treatment.** If crossing the cutoff does not actually change treatment status or probability\n  (a guideline nobody follows), the first stage is flat and a fuzzy-RDD estimate explodes from dividing by a near-zero\n  denominator. Verify and report the first-stage jump.\n- **Other things change at the same cutoff.** If a *different* program or rule shares the threshold (turning 65 changes both\n  Medicare eligibility and the program under study), the jump conflates them; the effect is not attributable to the studied\n  treatment. Check that covariates and competing policies are continuous at c.\n- **Effect extrapolated away from the cutoff.** The RDD estimate is local to X = c. Reporting it as the average effect for the\n  whole population — including those far from the threshold where the effect may differ — overstates external validity.\n- **A global high-order polynomial is used.** Fitting a quartic/quintic over the full range manufactures spurious\n  discontinuities at the boundary (Gelman-Imbens); use local linear/quadratic within a data-driven bandwidth instead.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** The running variable is often age (clean, from enrollment files) or a constructed risk/eligibility\n  score; the treatment is an enrollment, coverage, or service indicator that switches at the cutoff. A standing failure mode\n  at the age-65 Medicare threshold: cohort composition changes because people shift from commercial to Medicare coverage at\n  exactly 65, so the *observed* population and its data capture change discontinuously — restrict to a consistent,\n  continuously observable population (or model the coverage transition explicitly) so the outcome jump is not a data-capture\n  artifact. Lab-threshold RDDs require the lab value itself, which fee-for-service claims usually lack; these need linked\n  EHR/lab data.\n- **EHR:** Best substrate for lab/clinical-threshold RDDs because the running variable (HbA1c, eGFR, blood pressure) is\n  recorded directly. But values are measured with error and may be re-tested when near a clinical threshold (a borderline\n  value triggers a repeat that crosses c), inducing exactly the manipulation/sorting RDD forbids — examine the density and\n  test for heaping at round numbers, and consider the measured running variable's noise.\n- **Registry / linked:** Birth/perinatal registries provide near-canonical RDD running variables (gestational age, birthweight\n  at NICU-admission thresholds) with adjudicated outcomes; linkage to claims supplies follow-up and cost outcomes. Watch for\n  heaping at reported round values (e.g., birthweight at 1500 g, 2500 g) which biases the local fit.\n\n**Worked claims/linked example.** Question: does initiating a lipid-lowering management program at a guideline LDL threshold\nof 190 mg/dL reduce 1-year cardiovascular hospitalization, using linked EHR-lab + claims data? (1) *Running variable and\ncutoff*: index LDL value X; cutoff c = 190 mg/dL; treatment = program enrollment, which clinicians follow imperfectly (a\n**fuzzy** RDD). (2) *First stage*: program enrollment probability jumps from ~0.20 just below 190 to ~0.65 just above — a\nvisible, statistically significant discontinuity, confirming relevance. (3) *Validation*: the McCrary density test shows no\ndiscontinuity in the LDL distribution at 190 (p = 0.41), and pre-treatment covariates (age, prior CV history, baseline\ncomorbidity index) are continuous through 190 — supporting the continuity assumption; a heaping check finds no excess mass at\nround LDL values near the cutoff. (4) *Estimation*: local linear regression on each side within the MSE-optimal bandwidth\n(h ≈ 22 mg/dL by Calonico-Cattaneo-Titiunik), with robust bias-corrected confidence intervals. The reduced-form outcome jump\nis −2.1 hospitalizations per 100 patients; dividing by the first-stage jump (0.45) gives a **fuzzy-RDD local effect of −4.7\nper 100** (95% robust CI −8.2 to −1.2) for compliers at the threshold. (5) *Sensitivity*: re-estimate across bandwidths\n(0.5h, h, 2h) and with local quadratic fits, confirm the effect is stable and not an artifact of bandwidth or polynomial\norder, and state clearly that the effect is **local to LDL ≈ 190** and does not license claims about patients far from the\ncutoff.\n\n**Interpreting the output**\n\nUsing the worked example: average hospitalization rate just below LDL 190 is 29.0 per 100 patients and just above\nis 24.0, giving a sharp-RDD discontinuity of 24.0 − 29.0 = −5.0 per 100. Adjusting for imperfect enrollment\n(fuzzy RDD) yields ≈ −11.1 per 100 enrollees at the threshold (−5.0 ÷ 0.45).\n\nFormal interpretation: The RDD estimate of −5.0 hospitalizations per 100 patients (sharp design) is the local\naverage treatment effect at LDL = 190 mg/dL — the causal effect of crossing the program eligibility threshold for\npatients whose LDL lands precisely at that cutoff. This estimate is local: it applies only to the threshold\nneighborhood, and extrapolating it to patients with LDL well above or below 190 is not supported by the design.\nIn the fuzzy RDD, the ≈ −11.1 estimate is the LATE for program enrollees at the threshold, derived by dividing\nthe reduced-form outcome jump by the jump in enrollment probability (≈ 0.45). The key identification assumption\nis that no other determinant of hospitalization risk jumps discontinuously at LDL 190 — tested by verifying smooth\npatient density (McCrary test) and smooth pre-determined covariates through the cutoff.\n\nPractical interpretation: Patients just above the 190 mg/dL eligibility line who entered the lipid program had\napproximately 5 fewer cardiovascular hospitalizations per 100 patients per year than similar patients just below\nthe line who did not qualify. This is a credible local causal estimate because the two groups are near-identical\nin every measured characteristic except which side of the threshold they fell on; the estimate does not speak to\npatients with much higher or lower LDL values.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "regression-discontinuity",
      "running-variable",
      "local-linear-regression",
      "bandwidth-selection",
      "mccrary-density-test",
      "sharp-and-fuzzy",
      "local-average-treatment-effect",
      "quasi-experiment"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "policy_evaluation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.i1216",
        "url": "https://doi.org/10.1136/bmj.i1216",
        "citation_text": "Venkataramani AS, Bor J, Jena AB. Regression discontinuity designs in healthcare research. BMJ. 2016;352:i1216.",
        "year": 2016,
        "authors_short": "Venkataramani et al.",
        "notes": "The standard applied introduction to RDD for health researchers; explains sharp vs fuzzy designs, the local-cutoff estimand, threshold-based clinical and policy examples, and the validity checks (density and covariate continuity)."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jeconom.2007.05.001",
        "url": "https://doi.org/10.1016/j.jeconom.2007.05.001",
        "citation_text": "Imbens GW, Lemieux T. Regression discontinuity designs: a guide to practice. Journal of Econometrics. 2008;142(2):615-635.",
        "year": 2008,
        "authors_short": "Imbens & Lemieux",
        "notes": "The foundational practical guide establishing local-polynomial estimation, bandwidth choice, the fuzzy-RDD Wald/IV interpretation, and the graphical and validity diagnostics that define modern RDD practice."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jeconom.2007.05.005",
        "url": "https://doi.org/10.1016/j.jeconom.2007.05.005",
        "citation_text": "McCrary J. Manipulation of the running variable in the regression discontinuity design: a density test. Journal of Econometrics. 2008;142(2):698-714.",
        "year": 2008,
        "authors_short": "McCrary",
        "notes": "Defines the density (McCrary) test for manipulation/sorting at the cutoff, the central falsification check that detects when units can position themselves across the threshold and invalidate the design."
      },
      {
        "role": "demonstrate",
        "doi": "10.1257/jel.48.2.281",
        "url": "https://doi.org/10.1257/jel.48.2.281",
        "citation_text": "Lee DS, Lemieux T. Regression discontinuity designs in economics. Journal of Economic Literature. 2010;48(2):281-355.",
        "year": 2010,
        "authors_short": "Lee & Lemieux",
        "notes": "Comprehensive review formalizing the continuity-of-potential-outcomes identification, the local-randomization interpretation near the cutoff, and best-practice estimation, validation, and presentation; the standard methodological reference."
      }
    ],
    "plain_language_summary": "Regression discontinuity design (RDD) is a method for estimating a causal treatment effect when patients are assigned to treatment based on whether a continuous measurement crosses a fixed threshold. The core idea is that patients just barely above the threshold and patients just barely below it are nearly identical in every way except their treatment status, so comparing their outcomes is close to a randomized experiment. For example, if guidelines say to prescribe a drug when LDL cholesterol reaches 190 mg/dL, a patient measured at 191 is almost the same person as one measured at 189 — the small gap in their LDL values is essentially random, so any jump in health outcomes right at that 190 cutoff can be attributed to the treatment rather than to pre-existing differences between the groups.",
    "key_terms": [
      {
        "term": "running variable",
        "definition": "The continuous measurement (for example, LDL cholesterol in mg/dL or age in years) whose value determines whether a patient receives treatment once it crosses the cutoff."
      },
      {
        "term": "cutoff",
        "definition": "The fixed threshold on the running variable where treatment status changes — patients on one side are treated (or much more likely to be treated) and patients on the other side are not."
      },
      {
        "term": "local randomization",
        "definition": "The idea that patients who land just above versus just below the cutoff are so similar in everything else that their tiny difference in the running variable is essentially random, making the comparison as credible as a small randomized trial run right at the threshold."
      },
      {
        "term": "discontinuity",
        "definition": "The sudden jump in outcome rates at the cutoff — the size of that jump is the estimated treatment effect, because everything else about patients at the threshold is assumed to vary smoothly."
      }
    ],
    "worked_example": {
      "scenario": "A health plan wants to know whether a lipid-management program reduces cardiovascular hospitalizations. The program is offered when a patient's index LDL measurement is at or above 190 mg/dL. We look at patients whose LDL was measured within a narrow window of 10 mg/dL on either side of that cutoff (180-199 mg/dL) and compare 1-year hospitalization rates just below versus just above 190. Because those patients are nearly identical except for which side of the line they fell on, any jump in hospitalization rates at 190 estimates the program's causal effect.",
      "dataset": {
        "caption": "Observed 1-year cardiovascular hospitalization rates for patients near the LDL 190 mg/dL cutoff (synthetic illustration using proportions from the source file's worked example).",
        "columns": [
          "ldl_band_mg_dL",
          "side_of_cutoff",
          "n_patients",
          "hospitalizations_per_100"
        ],
        "rows": [
          [
            "180-184",
            "below",
            210,
            30
          ],
          [
            "185-189",
            "below",
            198,
            28
          ],
          [
            "190-194",
            "above",
            205,
            25
          ],
          [
            "195-199",
            "above",
            192,
            23
          ]
        ]
      },
      "steps": [
        "Average hospitalization rate just below the cutoff (bands 180-184 and 185-189): (30 + 28) / 2 = 29.0 per 100 patients.",
        "Average hospitalization rate just above the cutoff (bands 190-194 and 195-199): (25 + 23) / 2 = 24.0 per 100 patients.",
        "The discontinuity (jump) at the cutoff = 24.0 - 29.0 = -5.0 per 100 patients.",
        "This -5.0 figure is the estimated causal effect: being on the treatment side of the 190 mg/dL line is associated with 5 fewer hospitalizations per 100 patients per year.",
        "Because the program is not followed perfectly by every clinician (some patients above 190 are not enrolled; a few below 190 are enrolled anyway), this is a fuzzy RDD — the true per-program-enrollee effect is larger: the -5.0 jump in outcomes divided by the jump in enrollment probability (about 0.45, from 20% below to 65% above the cutoff) gives approximately -11.1 per 100 enrollees at the threshold."
      ],
      "result": "Reduced-form discontinuity: -5.0 hospitalizations per 100 patients (24.0 above minus 29.0 below). Fuzzy-RDD local effect for program enrollees at the threshold: -5.0 / 0.45 = approximately -11.1 per 100 enrollees. This effect is local to patients near LDL 190 and does not necessarily apply to patients with much higher or lower LDL values."
    },
    "prerequisites": [
      "target-trial-emulation",
      "instrumental-variables-pharmacoepi-rwe",
      "difference-in-differences-staggered-adoption-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Sharp RDD",
        "description": "Treatment is a deterministic function of the running variable crossing the cutoff (assignment probability jumps from 0 to 1); the effect is the jump in the conditional expectation of the outcome at c, estimated with local linear/poly regression on each side.",
        "edge_cases": [
          "A failed McCrary density test (a discontinuity in the running-variable density at c) signals manipulation/sorting and invalidates the sharp design.",
          "Pre-treatment covariates must be continuous at c; a covariate jump indicates another determinant changes at the cutoff."
        ],
        "data_source_notes": "claims: an age cutoff is clean, but watch for coverage/data-capture transitions at the same age (e.g., commercial-to-Medicare at 65) that shift the observed population discontinuously."
      },
      {
        "name": "Fuzzy RDD",
        "description": "Crossing the cutoff changes only the probability of treatment; the effect is the reduced-form outcome jump divided by the first-stage treatment-probability jump (a Wald/IV estimator with the cutoff as the instrument), identifying a complier LATE at c.",
        "edge_cases": [
          "A weak or absent first-stage jump (the rule is not followed) makes the divided estimate unstable or explosive; verify and report the first-stage discontinuity.",
          "The estimand is a complier effect local to the cutoff, not an average effect over the whole population."
        ],
        "data_source_notes": "ehr/linked: lab/clinical-threshold rules (HbA1c, LDL, eGFR) are imperfectly followed, so fuzzy RDD is the rule; the running variable is measured with error and may be re-tested near c, risking sorting."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "instrumental-variables-pharmacoepi-rwe",
        "pros_of_this": "A fuzzy RDD is an IV whose instrument (the institutional cutoff) has unusually transparent, testable validity because relevance is the visible first-stage jump and the exclusion restriction holds locally (only treatment changes at c).",
        "cons_of_this": "The effect is local to the cutoff and statistically inefficient (only near-cutoff units inform it), so large samples are needed and external validity away from c is not guaranteed.",
        "when_to_prefer": "When a genuine threshold rule exists, because the instrument's validity is far easier to argue and test than a general IV's exclusion restriction; use general IV when no threshold exists but a credible instrument does."
      },
      {
        "compared_to": "propensity-score-methods-psm-iptw",
        "pros_of_this": "Requires no measurement of confounders near the cutoff; identification rests on the smoothness of potential outcomes at c, so unmeasured confounding (the chief threat to PS designs) is largely defused.",
        "cons_of_this": "Estimates only a local effect at the threshold with low power, whereas PS methods estimate an average effect over the full balanced population.",
        "when_to_prefer": "When unmeasured confounding is the dominant threat and a clean threshold rule exists; prefer PS/target-trial designs when confounders are well measured and an externally valid population-average effect is needed."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Use age or a constructed eligibility/risk score as the running variable and an enrollment/coverage/service indicator as treatment; restrict to a consistently observable population so coverage transitions at the cutoff (commercial-to-Medicare at 65, MA vs FFS) do not create a data-capture jump masquerading as an effect. Lab-threshold RDDs require linked EHR/lab values.",
      "ehr": "Best substrate for lab/clinical-threshold RDDs (the running variable is recorded directly); test for heaping at round values and for re-testing of borderline values near c, which induces the manipulation/sorting the design forbids.",
      "registry": "Perinatal/birth registries give canonical running variables (gestational age, birthweight at NICU thresholds) with adjudicated outcomes; check for heaping at reported round values that biases the local fit, and link to claims for follow-up.",
      "linked": "Linked EHR-lab + claims supplies a recorded running variable, the threshold-driven treatment, and follow-up/cost outcomes; reconcile measurement timing so the index running-variable value precedes treatment assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom rdrobust import rdrobust          # local-poly RDD with robust bias-corrected CIs\nfrom rddensity import rddensity        # manipulation (density) test at the cutoff\n\nrng = np.random.default_rng(11)\nn = 4000\nx = rng.uniform(150, 230, n)           # running variable: index LDL (mg/dL)\nc = 190.0                              # guideline cutoff\n# Fuzzy assignment: probability of program enrollment jumps at the cutoff.\np = np.where(x >= c, 0.65, 0.20)\nd = (rng.uniform(size=n) < p).astype(int)\n# Smooth potential outcome + a true local treatment effect (-4.7 per 100 -> -0.047).\ny = 0.30 - 0.0008 * (x - c) - 0.047 * d + rng.normal(0, 0.10, n)\n\n# 1) Manipulation check: no discontinuity in the density of x at c.\ndens = rddensity(X=x, c=c)\nprint(\"McCrary-type density test p:\", dens.test[\"p_jk\"])\n\n# 2) Fuzzy RDD: supply the treatment vector `fuzzy=d`. Local linear (p=1), MSE-optimal bandwidth.\nest = rdrobust(y=y, x=x, c=c, fuzzy=d, p=1)\nprint(est)   # Coef = local complier effect at the cutoff; CI is robust bias-corrected.\n\n# 3) Sharp RDD reduced form (outcome jump) for comparison: omit `fuzzy`.\nrf = rdrobust(y=y, x=x, c=c, p=1)\nprint(\"Reduced-form jump:\", rf.coef.iloc[0])",
        "description": "Sharp and fuzzy RDD with the rdrobust package (Calonico, Cattaneo, Titiunik): MSE-optimal bandwidth, local linear\nestimation, and robust bias-corrected inference. rddensity supplies the McCrary-style manipulation (density) test. Inputs:\nthe outcome y, the running variable x, the cutoff c, and (for fuzzy) the treatment indicator d. Local linear (p=1) within a\ndata-driven bandwidth is the recommended default over a global polynomial.",
        "dependencies": [
          "rdrobust",
          "rddensity",
          "numpy"
        ],
        "source_citations": [
          "venkataramani-2016",
          "imbens-lemieux-2008"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(rdrobust)\nlibrary(rddensity)\n\nset.seed(11)\nn <- 4000\nx <- runif(n, 150, 230)               # running variable: index LDL (mg/dL)\nc <- 190                              # guideline cutoff\np <- ifelse(x >= c, 0.65, 0.20)       # fuzzy first stage\nd <- as.integer(runif(n) < p)\ny <- 0.30 - 0.0008 * (x - c) - 0.047 * d + rnorm(n, 0, 0.10)\n\n# 1) Manipulation (density) test at the cutoff.\ndens <- rddensity(X = x, c = c)\nsummary(dens)\n\n# 2) MSE-optimal bandwidth, then fuzzy RDD with robust bias-corrected inference.\nbw  <- rdbwselect(y = y, x = x, c = c, p = 1)\nfit <- rdrobust(y = y, x = x, c = c, fuzzy = d, p = 1)   # local complier effect at c\nsummary(fit)\n\n# 3) Visualize: binned means with local-linear fits on each side.\nrdplot(y = y, x = x, c = c, p = 1)",
        "description": "Sharp and fuzzy RDD in R using the canonical rdrobust + rddensity packages (Calonico, Cattaneo, Titiunik). rdbwselect gives\nthe MSE-optimal bandwidth; rdrobust fits local linear regressions with robust bias-corrected inference; rddensity runs the\nmanipulation test. For a fuzzy design pass the treatment vector to `fuzzy=`. rdplot draws the binned scatter with local fits.",
        "dependencies": [
          "rdrobust",
          "rddensity"
        ],
        "source_citations": [
          "venkataramani-2016",
          "imbens-lemieux-2008"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let cutoff = 190;\n%let bw     = 22;      /* bandwidth window; vary (0.5*bw, bw, 2*bw) in sensitivity */\n\n/* Restrict to the bandwidth window and build centered + interaction terms for separate slopes. */\ndata rdd_win;\n  set work.rdd;\n  where abs(x - &cutoff) <= &bw;\n  xc    = x - &cutoff;            /* center the running variable at the cutoff */\n  above = (x >= &cutoff);         /* cutoff indicator = instrument / sharp treatment */\n  xc_ab = xc * above;            /* allows a different slope on each side */\nrun;\n\n/* Sharp / reduced-form RDD: local linear with separate slopes; coef on `above` = outcome jump. */\nproc reg data=rdd_win;\n  model y = xc above xc_ab;\n  title \"Sharp RDD (reduced-form outcome jump at cutoff)\";\nrun; quit;\n\n/* Fuzzy RDD: 2SLS with `above` instrumenting treatment `d` (the Wald/IV ratio). */\nproc syslin data=rdd_win 2sls;\n  endogenous d;\n  instruments above xc xc_ab;\n  model y = d xc xc_ab;          /* coef on d = local complier effect at the cutoff */\n  title \"Fuzzy RDD (2SLS: cutoff instruments treatment)\";\nrun; quit;",
        "description": "RDD in SAS by hand: a sharp/fuzzy local linear regression restricted to a bandwidth window around the cutoff, fit with PROC\nREG (reduced form) and PROC SYSLIN with the cutoff indicator as an instrument for the fuzzy 2SLS estimate. No SAS PROC\nautomates MSE-optimal bandwidths, so the window is supplied explicitly and varied in sensitivity analysis; PROC SGPLOT draws\nthe discontinuity. Inputs: dataset work.rdd with y, x (running variable), and d (treatment).",
        "dependencies": [],
        "source_citations": [
          "venkataramani-2016",
          "imbens-lemieux-2008"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  X[Continuous running variable X<br/>e.g., LDL, age, risk score] --> C{X >= cutoff c?}\n  C -->|No| Below[Below: treatment less likely]\n  C -->|Yes| Above[At/above: treatment more likely]\n  Below --> Fit[Local linear regression<br/>within data-driven bandwidth]\n  Above --> Fit\n  Fit --> Jump[Outcome jump at c]\n  Fit --> First[First-stage treatment jump at c]\n  Jump --> Tau[Sharp: tau = outcome jump<br/>Fuzzy: tau = outcome jump / first-stage jump]\n  First --> Tau",
        "caption": "Structure of an RDD. Treatment status (or its probability) changes discontinuously at the cutoff c while the outcome would otherwise vary smoothly; the local effect is the outcome jump (sharp) or the outcome jump divided by the first-stage treatment-probability jump (fuzzy), estimated with local linear regression within a bandwidth.",
        "alt_text": "Flow diagram showing a running variable split at a cutoff into below and above groups, local linear regression on each side, an outcome jump and a first-stage jump, combined into the sharp or fuzzy RDD effect.",
        "source_type": "illustrative",
        "source_citations": [
          "venkataramani-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Threshold rule on continuous X] --> Manip{McCrary density test:<br/>density continuous at c?}\n  Manip -->|No - sorting/manipulation| Stop1[Invalid: halt analysis]\n  Manip -->|Yes| Cov{Pre-treatment covariates<br/>continuous at c?}\n  Cov -->|No| Stop2[Other determinant changes at c]\n  Cov -->|Yes| FS{First-stage treatment<br/>jump at c?}\n  FS -->|None| Stop3[No discontinuity: RDD not identified]\n  FS -->|Prob 0 to 1| Sharp[Sharp RDD]\n  FS -->|Probability change| Fuzzy[Fuzzy RDD - IV/Wald]\n  Sharp --> Est[Local linear within MSE-optimal bandwidth<br/>robust bias-corrected CIs; vary bandwidth]\n  Fuzzy --> Est\n  Est --> Local[Report LOCAL effect at c - do not extrapolate]",
        "caption": "Validity-and-estimation workflow for an RDD. The density and covariate-continuity checks gate the design; the first-stage jump distinguishes sharp from fuzzy; estimation uses local linear regression within a data-driven bandwidth and the effect is reported as local to the cutoff.",
        "alt_text": "Decision tree from a threshold rule through the McCrary density test, covariate-continuity check, and first-stage check to sharp or fuzzy estimation with local linear regression and a local-effect interpretation.",
        "source_type": "illustrative",
        "source_citations": [
          "mccrary-2008"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "instrumental-variables-pharmacoepi-rwe",
        "notes": "A fuzzy RDD is an instrumental-variables estimator using the cutoff as the instrument; its relevance (first-stage jump) and local exclusion restriction are far more transparent than a general IV's, but the effect is local to the cutoff."
      },
      {
        "relation_type": "see_also",
        "target_slug": "difference-in-differences-staggered-adoption-rwe",
        "notes": "Both are quasi-experiments; RDD exploits a cross-sectional threshold and continuity at the cutoff, whereas DiD exploits a treatment turning on at a date plus a parallel comparison group."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation estimates a population-average effect by measuring and balancing confounders; RDD needs no confounder measurement near the cutoff but yields only a local effect at the threshold with lower power."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "The RDD estimand is the local effect at the cutoff; transporting it to units far from the threshold raises the external-validity/generalizability questions handled by transportability methods."
      },
      {
        "relation_type": "complements",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS methods estimate a population-average effect requiring measured confounders; RDD defuses unmeasured confounding near a threshold but is local and inefficient, so the two answer complementary questions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "interrupted-time-series-rwe",
        "notes": "Both are leading quasi-experiments for policy/threshold evaluation; ITS exploits a discontinuity in time while RDD exploits a discontinuity along a continuous running variable."
      }
    ],
    "aliases": [
      "RDD",
      "regression discontinuity",
      "sharp regression discontinuity",
      "fuzzy regression discontinuity",
      "threshold regression"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "regularized-regression-lasso-ridge-rwe",
    "name": "Regularized Regression: LASSO, Ridge, and Elastic Net",
    "short_definition": "A family of penalized linear and generalized linear models that add a penalty term to the ordinary least-squares or maximum-likelihood objective, shrinking coefficient estimates toward zero (ridge/L2) or exactly to zero (LASSO/L1), with elastic net combining both penalties; used in RWE for variable selection and prediction from thousands of claims codes, high-dimensional propensity score construction, risk prediction when predictors outnumber observations, and as a nuisance model inside double-ML and TMLE.",
    "long_description": "**What regularized regression is and why it matters in RWE**\n\nOrdinary least squares (OLS) and maximum-likelihood estimation find the coefficients that\nminimize the residual objective with no constraint on coefficient magnitude. In high-dimensional\nreal-world evidence settings — where a claims analyst confronts thousands of ICD-10 diagnosis\ncodes, CPT procedure codes, and NDC drug classes as candidate predictors — unconstrained\nestimators break down in two ways. First, when the number of predictors p approaches or exceeds\nthe number of observations n, OLS has no unique solution and fits noise perfectly, producing\nwildly inflated coefficient variance. Second, even when n > p, OLS coefficients have high\nvariance under multicollinearity — and ICD code families (diabetes with renal complications,\ndiabetes without) are almost by definition collinear. Regularized regression adds a **penalty\nterm** to the objective that trades a controlled amount of bias for a substantial reduction in\nvariance, yielding estimates that generalize better to held-out data and remain estimable when\np >> n.\n\n**The three penalties: geometry and intuition**\n\nAll three methods minimize the same penalized objective:\nminimize { sum of squared (or deviance) residuals + lambda * Penalty(beta) }\n\nThe penalty type determines the geometry of the solution:\n\n- **Ridge (L2 penalty)**: Penalty = sum of beta_j squared. The L2 feasible region is a circle\n  in two dimensions (sphere in higher). The optimization finds where the OLS loss ellipsoid\n  first touches this sphere, and because the sphere has no corners, the solution almost never\n  lands exactly at zero. Every coefficient is shrunk proportionally toward zero but retained.\n  Ridge excels when all predictors carry some signal and when correlated code groups (all\n  diabetes-related diagnoses, all hypertension-related procedure codes) should be kept together\n  rather than having one arbitrarily selected and the rest discarded. Ridge also uniquely\n  defines the coefficient vector even when the OLS design matrix is singular — critical for\n  p > n claims analyses.\n\n- **LASSO (L1 penalty, Least Absolute Shrinkage and Selection Operator)**: Penalty = sum of\n  |beta_j|. The L1 feasible region is a diamond in two dimensions (cross-polytope in higher\n  dimensions) with sharp corners at the coordinate axes. The OLS loss ellipsoid most often\n  first contacts the L1 region at one of these corners, where all but one or a few coordinates\n  are zero. LASSO simultaneously estimates and selects: it produces a sparse model with many\n  exact zeros. In a claims cost model with 4,000 candidate codes, LASSO at a cross-validated\n  lambda typically retains 15 to 30 codes with non-zero coefficients — a compact, communicable\n  predictive fingerprint. The sparsity comes at a cost: with strongly correlated predictors\n  (two nearly identical complication codes), LASSO arbitrarily selects one and zeros the other,\n  making the selected set unstable across bootstrap replicates.\n\n- **Elastic net (L1 plus L2 mixture)**: Penalty = alpha * sum(|beta_j|) + (1-alpha) *\n  sum(beta_j squared). The elastic net mixes LASSO sparsity with ridge shrinkage of the\n  selected group. When correlated code families compete for inclusion — four closely related\n  diabetes complication codes where the science implies all four should be present — elastic\n  net tends to keep the group together while zeroing truly irrelevant codes. The mixing\n  parameter alpha controls the L1 fraction: alpha = 1 is pure LASSO, alpha = 0 is pure ridge.\n  A value of alpha between 0.5 and 0.8 is a common default for claims code analyses with\n  hierarchically organized ICD families. A further extension, the **group lasso**, enforces\n  sparsity at the group level — all members of a pre-specified ICD-10 chapter or ATC drug class\n  either enter or leave together — which is clinically natural for hierarchically organized\n  codes.\n\n**Standardization is mandatory before penalizing**\n\nThe penalty term treats all coefficients on the same footing. A coefficient of 1.0 on age\n(measured in decades, range 3 to 8) receives the same penalty as a coefficient of 1.0 on a\nbinary ever/never flag (range 0 to 1). Without standardization, predictors measured on large\nscales are penalized less and predictors on small scales are penalized more — an artifact of\nunits, not biology or confounding structure. Both glmnet (R) and sklearn (Python) standardize\npredictors internally by default and return coefficients on the original scale; always verify\nthis behavior is active and has not been silently disabled in the implementation.\n\n**Lambda selection by cross-validation**\n\nLambda (the penalty strength) is the critical tuning parameter. Too small: the penalty is\nnegligible and the estimator reverts to OLS. Too large: every coefficient is shrunk to zero.\nIn practice, lambda is chosen by k-fold cross-validation across a fine grid of values: fit on\nk-1 folds, predict the held-out fold, compute the loss (mean squared error for linear outcomes,\nbinomial deviance for logistic). Two conventional choices from the resulting cross-validation\ncurve:\n\n- **lambda.min**: the lambda minimizing mean CV error — lowest bias, maximum predictive accuracy.\n- **lambda.1se**: the largest lambda within one standard error of the minimum — sparser model,\n  slightly higher CV error, more parsimonious. Preferred in RWE contexts where parsimony and\n  communication matter more than the last fraction of AUC or R-squared.\n\nThe choice between lambda.min and lambda.1se should be pre-specified in the SAP or protocol\nand justified scientifically, not chosen post-hoc to maximize a preferred model size.\n\n**RWE high-dimensional reality: claims code spaces and p >> n settings**\n\nThe primary motivation for regularized regression in RWE is the claims code landscape. A\nstandard 365-day lookback window in a Medicare FFS or commercial database generates roughly\n3,000 to 6,000 unique ICD-10 codes, 1,500 to 3,000 CPT codes, and 800 to 1,500 NDC drug\nclasses per cohort. OLS is not identified at these dimensions; even moderate-sized cohorts\nface p >> n for subgroup analyses. The coordinate descent algorithm underlying glmnet cycles\nthrough predictors without requiring matrix inversion, handles both n > p and p >> n equally,\nand scales to millions of observations with tens of thousands of predictors on standard hardware.\n\nIn the high-dimensional propensity score (hdPS) pipeline, the empirical Bayes selection step\nplays a role conceptually similar to LASSO: data-adaptively identifying which of thousands of\ncodes most predict treatment assignment. Full LASSO logistic regression on all candidate codes\nis increasingly used as a direct alternative or complement to hdPS for propensity score\nconstruction, with lambda chosen by cross-validation.\n\nFor biomarker panels (genomics, proteomics, pharmacogenomics) and linked registry-claims\ndatasets, p >> n is routine. Ridge excels when all biomarkers are expected to contribute\n(polygenic scores, pathway-level analyses); LASSO when a small number of truly predictive\nbiomarkers is plausible; elastic net when correlated gene families or pathway blocks should\nbe grouped together.\n\n**Interpreting the output**\n\nA LASSO claims-cost model fit at a cross-validation-chosen lambda retains 18 of 4,000\ncandidate codes. The coefficient on \"prior insulin use\" is 0.35 on the log-cost scale.\n\n**Formal interpretation.** Penalized coefficients are deliberately biased toward zero — this\nis the mechanism, not a failure mode. The value 0.35 is a shrunken predictive weight, not an\nunbiased adjusted effect estimate. It is not entitled to a naive confidence interval: the\nstandard errors from a refitted unpenalized OLS on only the selected 18 codes are invalid.\nThey ignore the selection step (which consumed degrees of freedom not reflected in the\nrefitted model) and consistently understate uncertainty — naive SEs after LASSO are known\nto be anticonservative. Reporting them as if they were unpenalized OLS standard errors is\nincorrect. Post-selection inference is an active research area (selective inference, PoSI)\nrequiring specialized software; it is not solved by simply refitting OLS on the LASSO-selected\nset and using the resulting standard errors.\n\nSelection is also unstable: bootstrap the analysis cohort 200 times and the set of 18 codes\nchanges with every replicate. Prior insulin use may appear in 70 to 80 percent of replicates\nas a stable predictor; the marginal code at position 18 may appear in only 20 to 30 percent.\nThe specific coefficient 0.35 shifts in each bootstrap replicate, reflecting both sampling\nvariation and selection variation.\n\n**Practical interpretation.** Statistically correct plain English: the model found prior\ninsulin use one of the strongest cost predictors among the 18 codes retained at the selected\nlambda. Treat the 18 selected codes as a predictive fingerprint — a compact feature set for\nscoring new patients' cost risk — not as a list of cost drivers in any causal sense. The\nselection of 18 codes does not imply these are the 18 most important biological or clinical\ndeterminants of cost. If the scientific question is whether insulin use causes higher costs\ncompared with an alternative therapy, this LASSO fit does not answer it and should not be\nreported as if it does.\n\n**The inference warning: regularization optimizes prediction, not causal inference**\n\nRegularized regression is designed to minimize prediction error. Using LASSO to select\nconfounders and then refitting unpenalized regression on only the selected set is a two-step\nprocedure with invalid inference. Naive SEs from the refitted model are anticonservative and\nthe point estimate for the treatment effect is inconsistent — the selection step induces a\nform of bias at the variable level that the refitted standard errors cannot recover.\n\nThe principled fix is **post-double-selection** (Belloni, Chernozhukov, and Hansen): run LASSO\nof the outcome on all candidate confounders; run LASSO of the exposure on all candidate\nconfounders; take the union of the two selected sets; refit unpenalized regression on this union\nplus the exposure. This produces root-n-consistent, asymptotically normal estimates of the\ntreatment effect even when thousands of candidate confounders are screened. An equivalent and\nmore general approach is double/debiased ML (see parent entry\npredictive-and-causal-ml-models-rwe), which incorporates this logic in a cross-fit framework\nwith formal efficiency guarantees.\n\nA second critical rule: **never penalize the exposure coefficient itself** when the goal is\ncausal effect estimation. The exposure must enter the model without penalty; only the potential\nconfounders receive shrinkage. Penalizing the exposure biases the causal estimate toward the\nnull — a form of dilution bias invisible to standard fit metrics.\n\n**Pros, cons, and trade-offs**\n\n*Pros*: Estimable when p approaches or exceeds n — the only linear framework that works in\ntruly high-dimensional claims data without separate dimension reduction. Automatic variable\nselection (LASSO, elastic net) reduces the model to a compact set of predictors — a\ncommunication advantage in HEOR submissions. Ridge eliminates instability under\nmulticollinearity; correlated code families are retained collectively rather than having\narbitrary members selected. The coordinate descent algorithm (glmnet) is fast enough for\nmillions of observations and tens of thousands of predictors. Extends to other outcomes via\npenalized GLMs: logistic (readmission, treatment failure), Poisson (utilization counts),\nCox proportional hazards (time-to-event), multinomial. Lambda chosen by cross-validation is\nprincipled and reproducible.\n\n*Cons*: Coefficients are biased by construction; they cannot be reported as unbiased adjusted\neffects without post-double-selection or a causal ML wrapper. Naive confidence intervals after\nLASSO selection are invalid and anticonservative. LASSO selection is unstable with strongly\ncorrelated predictors; elastic net or group lasso is needed when correlated groups must be\nrepresented collectively. The lambda.min vs lambda.1se choice materially affects the selected\nset and should be pre-specified. SAS does not provide a full cross-validated regularization\npath equivalent to glmnet; PROC GLMSELECT LASSO is available for moderate dimensions but\nrequires additional calibration steps.\n\n**When to use**\n\n- High-dimensional prediction with tens to thousands of candidate predictors from a claims\n  lookback window, biomarker panel, or genomic dataset; OLS is not identified or has high\n  variance; the goal is a risk score or cost prediction model for budget impact or patient\n  stratification.\n- Propensity score construction from thousands of code candidates as an alternative to\n  stepwise selection or the hdPS empirical Bayes step, with lambda chosen by CV.\n- Confounder screening: reduce 4,000 codes to a manageable set of 20 to 50 using LASSO\n  then fit an interpretable parametric model, applying post-double-selection when causal\n  inference is the goal.\n- Elastic net for correlated code families (ICD-10 chapters, ATC drug classes) where the\n  L1/L2 mixture retains the group rather than an arbitrary single member.\n- Nuisance model fitting inside double-ML or TMLE as one candidate in a super learner,\n  where prediction quality — not causal interpretability — governs the nuisance fit.\n- Penalized Cox regression for time-to-event outcomes in linked registries when many candidate\n  code predictors are available and OLS-based variable selection is infeasible.\n\n**When NOT to use**\n\n- Small p where domain knowledge should select covariates: if a pre-specified set of 8 to 12\n  clinically motivated confounders is available, fit them in an unpenalized model; adding LASSO\n  introduces bias without the high-dimensional benefit.\n- Primary causal effect estimation with naive SEs: never report a LASSO-selected and refitted\n  coefficient as an unbiased causal estimate without post-double-selection or an equivalent\n  causal ML approach.\n- Instruments or confounders chosen purely by predictive criteria: LASSO selects on outcome\n  prediction; a code highly predictive of the outcome but connected to the exposure only through\n  an unmeasured common cause is a collider — selecting it can induce M-bias or amplify\n  unmeasured confounding. Data-adaptive selection cannot substitute for a causal DAG (see\n  bias discussion in predictive-and-causal-ml-models-rwe).\n- When the complete covariate set must be retained for regulatory or scientific reasons: a\n  pre-specified confirmatory analysis with a small number of adjudicated covariates should use\n  unpenalized regression.\n- When LASSO selection is the deliverable but collinearity is severe: use elastic net or group\n  lasso instead; pure LASSO with correlated predictors produces an arbitrary sparse\n  representation of the correlation structure, not a scientifically meaningful feature set.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "machine-learning",
      "prediction",
      "regularization",
      "variable-selection",
      "high-dimensional",
      "lasso",
      "ridge",
      "elastic-net",
      "penalized-regression",
      "shrinkage",
      "claims",
      "propensity-score"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "new_user",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.2517-6161.1996.tb02080.x",
        "url": "https://doi.org/10.1111/j.2517-6161.1996.tb02080.x",
        "citation_text": "Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996;58(1):267-288.",
        "year": 1996,
        "authors_short": "Tibshirani",
        "notes": "Foundational paper introducing the LASSO — L1 penalization of the OLS objective — with the soft-threshold solution, its sparsity-inducing geometry (corners of the L1 diamond), and coordinate descent computation. The starting reference for any RWE application of penalized regression with variable selection."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1467-9868.2005.00503.x",
        "url": "https://doi.org/10.1111/j.1467-9868.2005.00503.x",
        "citation_text": "Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005;67(2):301-320.",
        "year": 2005,
        "authors_short": "Zou & Hastie",
        "notes": "Introduces the elastic net as a principled compromise between LASSO (sparsity) and ridge (grouping of correlated predictors). Demonstrates that LASSO arbitrarily selects one member from a correlated group and shows the elastic net mixing parameter alpha controlling the L1/L2 balance. Essential for ICD code family analyses where correlated codes should enter together."
      },
      {
        "role": "demonstrate",
        "doi": "10.18637/jss.v033.i01",
        "url": "https://doi.org/10.18637/jss.v033.i01",
        "citation_text": "Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1-22.",
        "year": 2010,
        "authors_short": "Friedman et al.",
        "notes": "Introduces glmnet and the pathwise coordinate descent algorithm that makes penalized GLMs computationally feasible at claims-dataset scale. Covers the full GLM family (Gaussian, binomial, multinomial, Poisson, Cox) and the cross-validated lambda path via cv.glmnet. The production reference for all glmnet implementations in this entry."
      },
      {
        "role": "use",
        "doi": "10.18637/jss.v039.i05",
        "url": "https://doi.org/10.18637/jss.v039.i05",
        "citation_text": "Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox's proportional hazards model via coordinate descent. Journal of Statistical Software. 2011;39(5):1-13.",
        "year": 2011,
        "authors_short": "Simon et al.",
        "notes": "Extends glmnet to penalized Cox proportional hazards regression for right-censored time-to-event endpoints. Directly applicable to survival outcomes in RWE (time to hospitalization, treatment discontinuation, overall survival) when the candidate covariate space is high-dimensional and stepwise variable selection is infeasible."
      }
    ],
    "plain_language_summary": "Regularized regression (LASSO, ridge, and elastic net) is a family of statistical models that deliberately shrink coefficient estimates toward zero — and in the LASSO case, zero them out entirely — to prevent the model from overfitting when there are many more candidate predictors than can reliably be estimated from the data. In health research with claims data, this means a model can start with thousands of diagnosis and procedure codes and automatically find the handful that best predict costs, hospitalizations, or treatment choice, without the analyst having to hand-pick covariates. The key trade-off is that the shrunken coefficients are intentionally biased and cannot be directly interpreted as causal effects, only as predictive weights.",
    "key_terms": [
      {
        "term": "penalty (lambda)",
        "definition": "A mathematical constraint added to the model fitting that pushes coefficient estimates toward zero; a larger lambda shrinks coefficients more aggressively, producing a sparser or smaller model."
      },
      {
        "term": "shrinkage",
        "definition": "The process by which penalized regression reduces the size of coefficient estimates relative to ordinary least squares; every coefficient moves closer to zero, and LASSO can push them all the way to exactly zero."
      },
      {
        "term": "sparsity",
        "definition": "A property of LASSO and elastic net models where many coefficients are set exactly to zero, so only a small fraction of the original candidate predictors appear in the final model with non-zero weights."
      },
      {
        "term": "standardization",
        "definition": "Rescaling each predictor to have mean zero and unit variance before fitting the penalized model; required so the penalty treats all predictors equally regardless of their original units or range (age in decades vs a binary flag must be made comparable)."
      },
      {
        "term": "cross-validated lambda",
        "definition": "The penalty strength chosen by testing many lambda values on held-out folds and selecting the one with the lowest prediction error; lambda.1se is the parsimonious choice (largest lambda within one standard error of the minimum) preferred in RWE for interpretability."
      },
      {
        "term": "bias-variance tradeoff",
        "definition": "The fundamental tension in statistical estimation where adding a small amount of bias (by penalizing coefficients) can reduce variance enough that predictions on new data improve overall — the core justification for regularized regression over ordinary least squares in high-dimensional settings."
      }
    ],
    "worked_example": {
      "scenario": "A HEOR analyst builds a log-cost prediction model for 500 commercial insurance members who initiated a diabetes drug. After constructing a 365-day lookback feature matrix with hundreds of candidate codes, the analyst uses three illustrative predictors to demonstrate ridge and LASSO penalization. The OLS fit gives the coefficients shown in the table below (all on the standardized scale: mean 0, unit variance per predictor). The analyst compares ridge (lambda = 4) and LASSO (lambda = 1) to see which predictors survive penalization and by how much.",
      "dataset": {
        "caption": "OLS log-cost model coefficients on the standardized predictor scale. Three predictors shown for illustration; a real analysis would have hundreds of code-level features.",
        "columns": [
          "predictor",
          "ols_beta"
        ],
        "rows": [
          [
            "prior_insulin_use",
            5.0
          ],
          [
            "prior_ed_visit_count",
            3.0
          ],
          [
            "comorbidity_flag",
            0.5
          ]
        ]
      },
      "steps": [
        "Standardize all predictors before fitting. Each predictor is centered to mean zero and scaled to unit variance so the penalty treats them equally regardless of original units. Both glmnet (R) and sklearn (Python) perform this internally by default. The OLS coefficients in the table are already expressed on this standardized scale.",
        "Ridge with lambda = 4: the ridge formula shrinks each OLS coefficient by the factor 1 / (1 + lambda). With lambda = 4.0: prior_insulin_use is penalized to 5.0 * (1.0/(1.0+4.0)) = 5.0 * 0.2 = 1.0; prior_ed_visit_count is penalized to 3.0 * (1.0/(1.0+4.0)) = 3.0 * 0.2 = 0.6; comorbidity_flag is penalized to 0.5 * (1.0/(1.0+4.0)) = 0.5 * 0.2 = 0.1. All three coefficients are shrunk toward zero, but none reach exactly zero — ridge retains every predictor in the model.",
        "LASSO with lambda = 1: the soft-threshold formula is beta_lasso = sign(ols_beta) * max(|ols_beta| - lambda, 0). With lambda = 1.0: prior_insulin_use has ols_beta = 5.0, which exceeds lambda, so the LASSO coefficient is 5.0 - 1.0 = 4.0; prior_ed_visit_count has ols_beta = 3.0, which exceeds lambda, so the LASSO coefficient is 3.0 - 1.0 = 2.0; comorbidity_flag has ols_beta = 0.5, which is less than lambda = 1.0, so its coefficient is zeroed out and the predictor is excluded from the model entirely.",
        "LASSO performs variable selection; ridge does not. With lambda = 1, LASSO keeps 2 of 3 predictors and zeros the weakest (comorbidity_flag). In a full analysis with 500 candidate codes, a cross-validated lambda would typically keep 15 to 30 codes. Choose lambda.1se for the most parsimonious model; lambda.min for best predictive accuracy. Pre-specify this choice in the SAP.",
        "Caution on interpretation: the retained LASSO coefficients of 4.0 and 2.0 are penalized predictive weights, not unbiased causal effects. They are deliberately biased toward zero by design. Their naive standard errors — from refitting OLS on just these two selected predictors — would be invalid (anticonservative) because they ignore the selection step. Treat these as weights in a scoring model, not as adjusted effect estimates. For causal inference, apply post-double-selection or use double-ML (see parent entry)."
      ],
      "result": "Ridge (lambda = 4): prior_insulin_use = 1.0, prior_ed_visit_count = 0.6, comorbidity_flag = 0.1. All 3 predictors retained, each shrunk proportionally. LASSO (lambda = 1): prior_insulin_use = 4.0, prior_ed_visit_count = 2.0, comorbidity_flag = 0.0 (zeroed out and excluded from the model). 2 of 3 predictors selected. Ridge shrinks coefficients proportionally toward zero; LASSO performs automatic variable selection by zeroing the weakest coefficient below the penalty threshold."
    },
    "prerequisites": [
      "ols-linear-regression",
      "predictive-and-causal-ml-models-rwe",
      "cross-validation-and-overfitting-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "LASSO logistic regression for binary outcomes",
        "description": "Penalized logistic regression with L1 penalty for binary endpoints (readmission, treatment failure, 30-day mortality). Uses the same coordinate descent path as Gaussian LASSO but with binomial deviance as the CV criterion. In glmnet: family = \"binomial\", alpha = 1. In sklearn: LogisticRegression(penalty = \"l1\", solver = \"saga\"). Coefficients are on the log-odds scale; exponentiate for penalized odds ratios, but interpret as predictive weights not causal effects.",
        "edge_cases": [
          "With severe class imbalance (rare events < 5 percent), the CV criterion (deviance) is dominated by the majority class; use class-weighted loss or oversample the minority class before fitting.",
          "Lambda.1se at high regularization may zero out the exposure itself; always enter the exposure without penalty (use the penalty.factor argument in glmnet or the unpenalized column approach in sklearn)."
        ],
        "data_source_notes": "Claims: binary event flags (hospitalization within horizon, treatment switch, death) are natural binary outcomes. Confirm the denominator (risk set) and event definition before fitting; LASSO does not protect against misaligned time-zero."
      },
      {
        "name": "Penalized Cox regression for time-to-event outcomes",
        "description": "LASSO or elastic net applied to the partial likelihood of the Cox proportional hazards model for right-censored survival data. In glmnet: family = \"cox\", Surv object as y. Selects among thousands of candidate code predictors of time to hospitalization, progression, or death without requiring pre-selection. Calibration and discrimination (C-statistic) should be validated on a held-out cohort.",
        "edge_cases": [
          "Penalized Cox does not relax the proportional hazards assumption; check Schoenfeld residuals on the final selected model.",
          "Tied event times in claims data (multiple events on the same date) require the Breslow or Efron approximation; glmnet uses Breslow by default."
        ],
        "data_source_notes": "Registry-claims linked data: adjudicated event dates from the registry plus complete pharmacy/procedure history from claims provide the ideal substrate for penalized survival analysis. Verify that censoring dates are not informative (plan disenrollment correlated with disease trajectory)."
      },
      {
        "name": "Ridge regression for multicollinear biomarker panels",
        "description": "Ridge (alpha = 0 in glmnet; RidgeCV in sklearn) for dense biomarker, genomic, or PRO panels where all features are expected to contribute and sparse selection is undesirable. Ridge uniquely identifies the coefficient vector even when predictors outnumber observations and the OLS design matrix is singular. Common for polygenic scores, pathway-level analyses, and high-density SNP arrays.",
        "edge_cases": [
          "Ridge coefficients cannot be used for feature selection (all are non-zero); for interpretability, report the top-k coefficients by absolute magnitude with a stability note.",
          "In linked omics-claims data, ridge is often fit on the omics block while LASSO or elastic net handles the claims code block; the two blocks can be combined via a stacked model or a multi-block penalized approach."
        ],
        "data_source_notes": "EHR and registry: lab panels, vital sign trajectories, and multi-domain PRO composites are natural ridge candidates when domain knowledge implies all measurements contribute."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ols-linear-regression",
        "pros_of_this": "Estimable when p >= n; reduces overfitting and coefficient variance in high-dimensional settings; produces sparse models (LASSO) that communicate the active predictor set; handles multicollinearity (ridge) that OLS cannot.",
        "cons_of_this": "Coefficients are biased toward zero; naive standard errors after LASSO are invalid; lambda tuning adds a step OLS does not require; cross-validation is computationally heavier than a single OLS fit.",
        "when_to_prefer": "Prefer regularized regression when p is large relative to n or when multicollinearity is severe; prefer OLS when p is small, domain knowledge specifies the covariate set, and unbiased coefficient estimates with valid standard errors are required."
      },
      {
        "compared_to": "high-dimensional-propensity-score-hdps-rwe",
        "pros_of_this": "Full LASSO logistic regression on all candidate codes directly models the propensity score with principled lambda selection by CV, avoids the hdPS empirical Bayes pre-screening step, and is more easily extended to elastic net for correlated code families.",
        "cons_of_this": "hdPS has an established validation literature in pharmacoepidemiology and is the recognized industry standard; LASSO propensity scores are less familiar to regulators and reviewers and require additional transparency documentation.",
        "when_to_prefer": "LASSO propensity scores are a reasonable complement or sensitivity analysis to hdPS; prefer hdPS when regulatory familiarity and the established validation track record matter most."
      },
      {
        "compared_to": "logistic-regression-for-binary-outcomes",
        "pros_of_this": "Handles thousands of candidate predictors without pre-selection; automatic variable selection (LASSO) or shrinkage (ridge) of all predictors simultaneously; estimable when p > n.",
        "cons_of_this": "Standard logistic regression with a pre-specified covariate set produces unbiased coefficient estimates with valid SEs and is more familiar to clinical and regulatory audiences; LASSO logistic introduces bias and requires post-double-selection for causal interpretation.",
        "when_to_prefer": "Use penalized logistic when the candidate covariate space is large and pre-selection is not feasible; use standard logistic for pre-specified, moderate-dimension confirmatory analyses."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the lookback feature matrix from all diagnosis, procedure, and drug-class codes in a fixed 365-day pre-index window (ever/never flags, counts, recency bins). Enforce a strict no-future-leakage guard: no codes from on or after the index date. For propensity models, ensure the exposure (treatment indicator) is not penalized — use penalty.factor = c(0, rep(1, ncol_covariates)) in glmnet or an unpenalized column approach in sklearn. Report the number of selected predictors at lambda.1se vs lambda.min, the CV curve, and a bootstrap stability analysis (200 replicates, fraction of replicates each code is selected). For cost outcomes, fit on the log-transformed total cost with an appropriate back-transformation for reporting.",
      "ehr": "Lab values, vitals, and structured EHR fields augment the claims code matrix with continuous predictors (creatinine, HbA1c, BMI). Standardize all continuous features before the penalty is applied. Visit-driven measurement creates informative missingness; encode missing indicators explicitly (indicator + last-observed-value) rather than imputing by omission. Validate calibration of the penalized model within site and calendar era, not just pooled, because EHR coding practices vary across systems.",
      "registry": "Disease-specific registry variables (ECOG status, stage, biomarker panel results) are typically low-dimensional and adjudicated; elastic net or ridge on the registry variables combined with LASSO on the linked claims code space is a reasonable strategy. Validate the penalized survival model (C-statistic, calibration-in-the-large, calibration slope) on a geographically or temporally separated cohort before deployment.",
      "linked": "Linked claims-EHR-registry datasets offer the richest covariate space (EHR severity plus claims completeness plus adjudicated endpoints) but introduce linkage-selection bias (linkable patients differ from non-linkable) and date reconciliation challenges that must be resolved before time-zero assignment. Apply the same leakage guard across all data sources: no post-index features in any source regardless of which data contributed them."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LogisticRegression\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import GridSearchCV\n\n# ── Input: df with columns [person_id, outcome, ...baseline covariates] ──\n# outcome is continuous (log-cost) for LASSO/ridge, binary (0/1) for logistic.\ncovar_cols = [c for c in df.columns if c not in (\"person_id\", \"outcome\")]\nX = df[covar_cols].values\ny = df[\"outcome\"].values\n\n# ── 1. LASSO (L1) with cross-validated lambda — continuous outcome ──\n# Pipeline ensures StandardScaler is fit only on training folds (no leakage).\nlasso_pipe = Pipeline([\n    (\"scaler\", StandardScaler()),          # mandatory: standardize before penalizing\n    (\"lasso\", LassoCV(cv=10, max_iter=10_000, random_state=42))\n])\nlasso_pipe.fit(X, y)\nlasso_m = lasso_pipe.named_steps[\"lasso\"]\nprint(f\"LASSO chosen alpha (lambda): {lasso_m.alpha_:.5f}\")\ncoefs = pd.Series(lasso_m.coef_, index=covar_cols)\nselected = coefs[coefs != 0].sort_values(key=abs, ascending=False)\nprint(f\"Selected predictors: {len(selected)} of {len(coefs)}\")\nprint(selected.head(20))   # top 20 by absolute magnitude\n\n# ── 2. Ridge (L2) — all predictors retained, fights multicollinearity ──\nridge_pipe = Pipeline([\n    (\"scaler\", StandardScaler()),\n    (\"ridge\", RidgeCV(alphas=np.logspace(-3, 4, 100), cv=10))\n])\nridge_pipe.fit(X, y)\nridge_m = ridge_pipe.named_steps[\"ridge\"]\nprint(f\"\\nRidge chosen alpha (lambda): {ridge_m.alpha_:.5f}\")\n# All coefficients are non-zero; ridge retains every predictor.\n\n# ── 3. Elastic net — L1+L2 mixture for correlated code families ──\nenet_pipe = Pipeline([\n    (\"scaler\", StandardScaler()),\n    (\"enet\", ElasticNetCV(\n        l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0],\n        cv=10, max_iter=10_000, random_state=42\n    ))\n])\nenet_pipe.fit(X, y)\nenet_m = enet_pipe.named_steps[\"enet\"]\nprint(f\"\\nElastic net: l1_ratio = {enet_m.l1_ratio_:.2f}, alpha = {enet_m.alpha_:.5f}\")\n\n# ── 4. LASSO logistic for binary outcomes (readmission, event within horizon) ──\n# In sklearn, C = 1/lambda; smaller C = more regularization.\n# CRITICAL: exposure variable must NOT be penalized — use a separate unpenalized step\n# or manually set penalty_factor equivalent (not natively in sklearn pipeline;\n# consider using glmnet via rpy2 for the penalty.factor argument).\nCs = np.logspace(-4, 2, 50)\nlog_pipe = Pipeline([\n    (\"scaler\", StandardScaler()),\n    (\"logit\", LogisticRegression(penalty=\"l1\", solver=\"saga\", max_iter=5_000))\n])\ngs = GridSearchCV(log_pipe, {\"logit__C\": Cs}, cv=10, scoring=\"neg_log_loss\")\ngs.fit(X, y)\nbest_C = gs.best_params_[\"logit__C\"]\nprint(f\"\\nLASSO logistic: best lambda = {1/best_C:.5f} (C = {best_C:.5f})\")\ncoefs_logit = pd.Series(\n    gs.best_estimator_.named_steps[\"logit\"].coef_[0], index=covar_cols\n)\nprint(f\"Selected: {(coefs_logit != 0).sum()} predictors\")\n\n# ── 5. Bootstrap stability (200 replicates) ──\nB = 200\nrng = np.random.default_rng(0)\nselections = np.zeros((len(covar_cols), B), dtype=int)\nfor b in range(B):\n    idx = rng.choice(len(X), size=len(X), replace=True)\n    pipe_b = Pipeline([(\"sc\", StandardScaler()),\n                       (\"ls\", LassoCV(cv=5, max_iter=5_000, random_state=b))])\n    pipe_b.fit(X[idx], y[idx])\n    selections[:, b] = (pipe_b.named_steps[\"ls\"].coef_ != 0).astype(int)\nstability = pd.Series(selections.mean(axis=1), index=covar_cols)\nprint(\"\\nTop 10 most stable predictors (fraction of replicates selected):\")\nprint(stability.sort_values(ascending=False).head(10))",
        "description": "Penalized linear and logistic regression using scikit-learn LassoCV, RidgeCV, and\nElasticNetCV for continuous outcomes, and LogisticRegression with l1/l2/elasticnet\npenalty for binary outcomes. Cross-validated lambda selection is handled by the CV\nvariants. StandardScaler is applied in a Pipeline to ensure standardization occurs\ninside cross-validation folds and does not leak. All implementations operate on an\nanalysis-ready table (one row per patient, baseline features only, no post-index leakage).",
        "dependencies": [
          "scikit-learn",
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(glmnet)\n\n# ── Prepare data: X must be a numeric matrix ──\n# For claims data convert factor predictors via model.matrix (removes intercept).\nX <- model.matrix(~ . - 1, data = df[, covariate_cols])\ny <- df[[\"log_cost\"]]      # continuous outcome; use 0/1 for binomial, Surv() for cox\n\n# ── 1. LASSO (alpha = 1): variable selection + shrinkage ──\nset.seed(42)\nfit_lasso <- cv.glmnet(X, y, alpha = 1, family = \"gaussian\",\n                       nfolds = 10, standardize = TRUE)\ncat(\"LASSO lambda.min:\", round(fit_lasso$lambda.min, 5), \"\\n\")\ncat(\"LASSO lambda.1se:\", round(fit_lasso$lambda.1se, 5), \"\\n\")\n\n# Coefficients at lambda.1se (parsimonious; pre-specify in SAP)\ncoef_1se  <- coef(fit_lasso, s = \"lambda.1se\")\ncoef_min  <- coef(fit_lasso, s = \"lambda.min\")\nn_1se <- sum(coef_1se[-1, 1] != 0)   # exclude intercept\nn_min <- sum(coef_min[-1, 1] != 0)\ncat(\"Predictors at lambda.1se:\", n_1se, \"| at lambda.min:\", n_min, \"\\n\")\n\n# Top selected predictors at lambda.1se, sorted by |coefficient|\nsel_df <- data.frame(\n  predictor = rownames(coef_1se)[-1],\n  coef      = coef_1se[-1, 1]\n)\nsel_df <- sel_df[sel_df$coef != 0, ]\nsel_df <- sel_df[order(abs(sel_df$coef), decreasing = TRUE), ]\nprint(head(sel_df, 20))\n\n# ── 2. Ridge (alpha = 0): shrinkage only, all predictors retained ──\nfit_ridge <- cv.glmnet(X, y, alpha = 0, family = \"gaussian\", nfolds = 10)\ncat(\"\\nRidge lambda.min:\", round(fit_ridge$lambda.min, 5), \"\\n\")\n\n# ── 3. Elastic net: search over alpha values ──\nalphas    <- c(0.2, 0.5, 0.8, 0.9, 1.0)\ncv_errors <- sapply(alphas, function(a) {\n  fit <- cv.glmnet(X, y, alpha = a, family = \"gaussian\", nfolds = 10)\n  min(fit$cvm)   # minimum CV mean squared error\n})\nbest_alpha <- alphas[which.min(cv_errors)]\nfit_enet   <- cv.glmnet(X, y, alpha = best_alpha, family = \"gaussian\", nfolds = 10)\ncat(\"Elastic net best alpha:\", best_alpha, \"\\n\")\n\n# ── 4. Penalized logistic for binary outcomes ──\n# CRITICAL: protect the exposure from penalization via penalty.factor.\n# Assume column 1 of X is the exposure (treat); all others are confounders.\npf <- rep(1, ncol(X))\npf[1] <- 0   # exposure enters unpenalized\nfit_bin <- cv.glmnet(X, df$event, alpha = 1, family = \"binomial\",\n                     nfolds = 10, penalty.factor = pf)\n\n# ── 5. Penalized Cox for time-to-event outcomes ──\nlibrary(survival)\nfit_cox <- cv.glmnet(X, Surv(df$time, df$event_indicator),\n                     alpha = 1, family = \"cox\", nfolds = 10)\ncat(\"\\nCox LASSO selected at lambda.1se:\",\n    sum(coef(fit_cox, s = \"lambda.1se\")[, 1] != 0), \"predictors\\n\")\n\n# ── 6. Bootstrap selection stability ──\nB     <- 200\nn     <- nrow(X)\nboot  <- matrix(0, nrow = ncol(X), ncol = B)\nfor (b in seq_len(B)) {\n  idx    <- sample(n, replace = TRUE)\n  fit_b  <- cv.glmnet(X[idx, ], y[idx], alpha = 1, nfolds = 5)\n  boot[, b] <- as.integer(coef(fit_b, s = \"lambda.1se\")[-1, 1] != 0)\n}\nstability <- rowMeans(boot)\nnames(stability) <- colnames(X)\ncat(\"\\nTop stable predictors (fraction selected across\", B, \"bootstrap replicates):\\n\")\nprint(sort(stability, decreasing = TRUE)[1:10])",
        "description": "Penalized regression via glmnet with cv.glmnet for cross-validated lambda selection.\nCovers LASSO (alpha=1), ridge (alpha=0), and elastic net (0 < alpha < 1) for Gaussian,\nbinomial, and Cox survival outcomes. Lambda.1se is the recommended parsimonious choice for\nRWE applications; lambda.min maximizes predictive accuracy. The penalty.factor argument\nallows the exposure variable to enter unpenalized — critical for causal applications.\nRequired input: a numeric matrix X (use model.matrix for factor variables) and vector y\nor Surv object; one row per patient, baseline features only.",
        "dependencies": [
          "glmnet",
          "survival"
        ],
        "source_citations": [
          "friedman-2010",
          "simon-2011"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Standardize predictors before penalization ── */\nproc standard data=work.analytic out=work.std_data mean=0 std=1;\n  var &covariates;   /* space-delimited list of baseline predictor names */\nrun;\n\n/* ── 1. LASSO for continuous outcome (log-cost) ── */\n/* SELECTION=LASSO: L1 penalty path.                                                   */\n/* STOP=CV CHOOSE=CV: stop and choose by 10-fold cross-validation error.               */\nproc glmselect data=work.std_data plots=all;\n  model log_cost = &covariates / selection=lasso\n                                 stop=cv(method=random(seed=42) groups=10)\n                                 choose=cv;\n  output out=work.lasso_scored predicted=pred_lasso;\nrun;\n/* Inspect the ANOVA table and parameter estimates for the selected model.              */\n/* Predictors with non-zero estimates at the chosen step are selected by LASSO.         */\n\n/* ── 2. Elastic net (L1 + L2) ── */\n/* L2= specifies the ridge penalty weight added to the LASSO path.                     */\nproc glmselect data=work.std_data;\n  model log_cost = &covariates / selection=elasticnet(l2=0.1)\n                                 stop=cv(method=random(seed=42) groups=10)\n                                 choose=cv;\nrun;\n\n/* ── 3. LASSO logistic for binary outcomes ── */\n/* Available in SAS 9.4 TS1M5 and later. SLENTRY controls the entry threshold.        */\nproc logistic data=work.std_data;\n  model readmission(event='1') = &covariates / selection=lasso slentry=0.05;\nrun;\n/* Note: SAS LASSO logistic does not use a CV criterion equivalent to cv.glmnet;       */\n/* for a full CV-tuned path on large claims data, prefer R glmnet or Python sklearn.   */\n\n/* ── 4. Score held-out data with selected model ── */\n/* After PROC GLMSELECT, use PROC PLM or the OUTPUT statement from the model step      */\n/* to score new patients. The selected (non-zero) predictors and their coefficients    */\n/* are embedded in the parameter estimates output dataset for deployment.              */\nproc plm restore=work.lasso_model;\n  score data=work.holdout out=work.holdout_scored predicted=pred_lasso;\nrun;",
        "description": "Penalized regression in SAS using PROC GLMSELECT with the LASSO and elastic net selection\ncriteria for continuous outcomes, and PROC LOGISTIC LASSO selection for binary outcomes.\nPROC GLMSELECT supports cross-validation via CVMETHOD= for lambda calibration. Note that\nSAS does not provide a full cross-validated regularization path equivalent to glmnet; for\nproduction LASSO pipelines in large claims analyses, R glmnet or Python sklearn are\npreferred. PROC GLMSELECT LASSO is appropriate for moderate dimensions (hundreds of\npredictors) and as a complement to other methods. Required input: a wide SAS dataset with\none row per patient and all baseline covariate columns (no post-index features).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[\"High-dimensional covariate space?<br/>(p > 50 or p > n)\"] -->|yes| Pen[\"Use penalized regression\"]\n  Q -->|no| OLS[\"Use OLS / unpenalized logistic<br/>with pre-specified covariates\"]\n  Pen --> Goal[\"Primary goal?\"]\n  Goal -->|prediction / scoring| Pred[\"Tune lambda by CV<br/>cv.glmnet / LassoCV\"]\n  Goal -->|causal effect| Causal[\"Apply post-double-selection<br/>or double-ML wrapper<br/>NEVER penalize exposure\"]\n  Pred --> Corr[\"Correlated code families?\"]\n  Corr -->|yes, keep group| Enet[\"Elastic net<br/>alpha 0.5-0.8\"]\n  Corr -->|no, want sparsity| Lasso[\"LASSO<br/>alpha = 1\"]\n  Corr -->|keep all predictors| Ridge[\"Ridge<br/>alpha = 0\"]\n  Lasso --> Lam[\"Lambda choice: pre-specify<br/>lambda.min (max accuracy)<br/>lambda.1se (parsimonious)\"]\n  Enet --> Lam\n  Ridge --> Lam\n  Lam --> Stab[\"Bootstrap stability check<br/>200 replicates, track selection frequency\"]\n  Stab --> Interp[\"Report as predictive weights<br/>NOT unbiased causal effects<br/>Naive CIs after LASSO are INVALID\"]",
        "caption": "Decision tree for selecting ridge, LASSO, or elastic net and choosing lambda, with the critical fork between prediction and causal use. The causal path requires post-double- selection or a double-ML wrapper — never penalize the exposure coefficient.",
        "alt_text": "Flowchart starting with whether the covariate space is high-dimensional, branching into penalized regression (yes) or OLS (no), then forking on prediction vs causal goal, then on correlated code families vs sparsity vs full retention, converging on lambda choice, bootstrap stability, and an interpretation warning that naive CIs after LASSO are invalid.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Parent concept covering the ML taxonomy and the critical prediction vs causation distinction; regularized regression is a MODEL-TYPE child that inherits the cross-fitting, nuisance-function, and causal-inference warning context."
      },
      {
        "relation_type": "requires",
        "target_slug": "ols-linear-regression",
        "notes": "OLS is the unpenalized baseline that regularized regression extends; understanding OLS coefficient estimation, variance inflation under multicollinearity, and the normal equations is prerequisite to understanding why penalization helps."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cross-validation-and-overfitting-rwe",
        "notes": "Cross-validation is the mechanism for choosing lambda; the lambda.min and lambda.1se rules are CV outputs. Understanding k-fold CV and the bias-variance tradeoff in model selection is prerequisite to setting up cv.glmnet or LassoCV correctly."
      },
      {
        "relation_type": "see_also",
        "target_slug": "tree-based-ensembles-rwe",
        "notes": "Both are high-dimensional ML methods used for prediction and as nuisance models inside causal ML; tree-based ensembles (gradient boosting, random forests) are more flexible (non-linear) but less interpretable and do not produce sparse linear representations — use regularized regression when a linear model with interpretable coefficients is required."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "LASSO logistic regression on the full code space is increasingly used as an alternative or complement to the hdPS empirical Bayes selection step for propensity score construction; the two approaches both address high-dimensional claims code confounding and can be cross-validated against each other."
      },
      {
        "relation_type": "see_also",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Standard logistic regression is the unpenalized counterpart for binary outcomes; LASSO logistic is preferred when p is large, while standard logistic is preferred for pre- specified, moderate-dimension confirmatory analyses where valid standard errors and unbiased effect estimates are required."
      }
    ],
    "aliases": [
      "LASSO",
      "ridge regression",
      "elastic net",
      "penalized regression",
      "shrinkage",
      "L1 regression",
      "L2 regression",
      "glmnet",
      "regularized regression"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "regulatory-readiness-rwe",
    "name": "Regulatory and HTA Readiness for RWE",
    "short_definition": "A structured pre-execution discipline that judges whether a real-world evidence study's data, estimand, design, and analysis plan are documented and defensible enough to survive FDA, EMA, or HTA review before a single line of analytic code is run.",
    "long_description": "**Regulatory and HTA readiness** is the practice of engineering a real-world evidence (RWE) study so that the\nfinished package — protocol, statistical analysis plan (SAP), data-provenance documentation, code, and report — can\nwithstand structured review by a regulator (FDA, EMA) or a health technology assessment body (NICE, ICER, IQWiG,\nCADTH/CDA) **before** the study is run, not after the estimate is in hand. It is not a single statistical method; it\nis the connective tissue that forces fitness-for-purpose, estimand traceability, time-zero alignment, comparator\ndefensibility, and pre-specified sensitivity analyses into a single auditable chain. Readiness operationalizes the\nstructured-template movement — STaRT-RWE for planning and reporting, HARPER for hypothesis-evaluating treatment-effect\nstudies, ICH E9(R1) for estimands — and the target-trial framework that disciplines the design.\n\n**Core conceptual distinction.** Readiness is a *process and documentation* layer, not an *estimation* layer. The\nestimation methods (active comparator new-user, propensity scores, Cox/Fine-Gray, g-methods) live in their own catalog\nentries; readiness asks a different question of each: *is the choice pre-specified, justified against the policy\nquestion, traceable from estimand to code, and accompanied by a falsification/sensitivity strategy that a skeptical\nreviewer would accept?* The decisive distinction a reviewer enforces is between **a study that reports an estimate** and\n**a study that can defend the estimand it claims to have measured** — the population, the treatment strategies being\ncontrasted, the outcome, the timing of intercurrent events, and the summary measure, in ICH E9(R1) terms. A study can\nbe statistically flawless and still fail readiness if the data cannot observe the estimand (e.g., a per-protocol\nestimand built on claims that cannot see in-hospital administration), or if the analysis was specified after looking at\nresults. Readiness is therefore upstream of, and orthogonal to, the precision of any one estimator.\n\n**Pros, cons, and trade-offs.**\n- **vs running the analysis first and writing the protocol around it (\"HARKing\"/post hoc rationalization):** Readiness\n  front-loads pre-specification and a locked SAP, which is the single strongest defense against the \"you fished for this\n  result\" critique that sinks RWE submissions. Cost: it is slow, demands cross-functional sign-off, and constrains\n  exploratory creativity. **Prefer readiness** for any confirmatory or label-relevant submission; relax it only for\n  genuinely hypothesis-generating internal work that will be re-run confirmatorily.\n- **vs a generic study protocol with no structured template:** Adopting STaRT-RWE/HARPER plus an estimand-to-code\n  traceability matrix makes the package machine-checkable and reviewer-navigable, and exposes gaps (an unjustified\n  comparator, an undocumented washout) early. Cost: template overhead and the discipline of keeping the matrix current\n  as the protocol evolves. **Prefer the structured template** whenever a regulator or HTA body is the audience.\n- **vs treating fit-for-purpose data assessment as a checkbox:** Real readiness audits provenance, relevance, and\n  reliability against the *specific* estimand (can these data observe time zero, the comparator, the outcome with\n  adequate PPV, and complete person-time?). Cost: it can disqualify a convenient data source late. **Prefer the deep\n  assessment** — a late \"the data can't see the outcome\" is far cheaper before lock than after submission.\n- **vs deferring the design to a target-trial emulation alone:** Target-trial emulation supplies the design scaffold\n  (eligibility, treatment strategies, assignment, time zero, causal contrast) that readiness then *documents and\n  stress-tests*. They are complements, not substitutes; readiness is the wrapper that turns a well-emulated trial into a\n  submittable dossier with negative controls, quantitative bias analysis, and a transparency/reproducibility package.\n\n**When to use.** Any study intended to support a regulatory action (label expansion, post-marketing requirement/PASS,\nsafety signal evaluation) or an HTA submission (relative effectiveness, budget impact, survival extrapolation inputs);\nany externally controlled trial or single-arm-plus-external-control submission; any study where a payer or regulator\nwill independently re-derive the result from your code and data dictionary. Engage readiness at protocol conception,\nnot at write-up: the cheapest defects to fix are the ones caught before data lock.\n\n**When NOT to use — and when it is actively misleading or dangerous.** Do not impose the full regulatory-readiness\napparatus on genuinely exploratory, internal-only analyses whose purpose is to *decide whether* a confirmatory study is\nworth running — the overhead can kill useful hypothesis generation, and a locked SAP on an exploratory question is\ntheater. It becomes **actively misleading** when used as a veneer: a beautifully templated protocol wrapped around a\ndata source that cannot observe the estimand gives reviewers false confidence and is worse than an honestly limited\nstudy, because the polish hides the fatal flaw. It is **dangerous** when \"readiness\" is invoked to justify a\nfeasibility-driven estimand swap — silently redefining the question to whatever the data can answer (e.g., switching a\ndrug-vs-no-treatment policy question to a drug-vs-active-comparator question because the comparator was convenient)\nwhile presenting it as the original question. Readiness must protect the question, not bend it to the data.\n\n**Data-source operational depth.**\n- **Administrative claims (FFS vs Medicare Advantage vs commercial):** The dominant readiness failure is unobservable\n  person-time. Medicare Advantage enrollees generate *no* fee-for-service claims, so apparent \"no prior fill\" during a\n  washout can be encounter-data gaps rather than a true drug-free period; restrict to enrollees with complete Parts\n  A/B/D (or a commercial medical+pharmacy benefit) and exclude MA-only person-time, and document this in the\n  fit-for-purpose section. Claims can establish exposure (NDC + `fill_date` + `days_supply`) and outcomes via validated\n  algorithms, but a reviewer will demand the outcome algorithm's PPV/sensitivity in *that* population, not a borrowed\n  estimate. In elderly cohorts, **differential competing risks by exposure** (death competing with the event of\n  interest, at different rates across arms) must be addressed with a pre-specified cause-specific vs cumulative-incidence\n  decision, or the readiness reviewer will reject a naive Kaplan–Meier.\n- **EHR:** Strong for baseline severity (labs, vitals, problem lists, notes) but capture is visit-driven, so a patient\n  who seeks care outside the system is differentially lost; readiness requires explicit observation windows, a\n  missing-data pattern table, and linkage to dispensing/death where possible. The order/administration-vs-dispensing\n  distinction must be reconciled before time-zero assignment.\n- **Registry:** Best for adjudicated outcomes and disease severity/stage, weakest for complete pharmacy exposure and\n  mortality; readiness depends on documented linkage to claims (for fills) and a death index (for censoring), and on a\n  transportability argument from the registry population to the decision population.\n- **Linked claims–EHR–vital records:** The ideal substrate (severity + completeness + reliable mortality) but linkage\n  introduces selection (only the linkable subset) and date-discrepancy issues across order, fill, and service dates that\n  must be reconciled and documented before time zero. A common, study-killing readiness defect here is **immortal time\n  in procedure-anchored designs**: if follow-up starts at diagnosis but exposure is defined by a later procedure or\n  fill, the interval guaranteeing survival to exposure is misclassified as exposed person-time — readiness forces time\n  zero to the exposure decision and audits the timeline for it.\n\n**Worked example (claims, readiness gate-by-gate).** Question: does a second-generation sulfonylurea increase\ncardiovascular hospitalization vs a DPP-4 inhibitor in adults with type 2 diabetes, to support a post-marketing safety\ncommitment, using a commercial + Medicare FFS database (2016–2023)?\n(1) **Fit-for-purpose gate.** Confirm the data can observe each estimand component: exposure (NDC fills with\n`days_supply`), the comparator (DPP-4 inhibitor fills), the outcome (CV hospitalization algorithm with documented PPV\nin this database), and complete person-time. Decision rule: require continuous medical + pharmacy enrollment and\n**exclude MA-only person-time** because FFS claims are absent there — without this, the washout is unverifiable. FAIL if\nthe outcome algorithm has not been validated in a comparable population.\n(2) **Estimand traceability gate (ICH E9(R1)).** Lock the five attributes: population = incident users of either drug\nwith ≥2 diabetes diagnoses and 365 days of continuous enrollment; treatments = sulfonylurea vs DPP-4 initiation;\nendpoint = first CV hospitalization; intercurrent events = death (competing risk → pre-specify cause-specific hazard\n*and* cumulative incidence), switching/discontinuation (treatment-policy vs while-on-treatment strategies named\nexplicitly); summary = hazard ratio + 5-year cumulative incidence difference. Build a traceability matrix mapping each\nattribute to a SAP section and to a code module.\n(3) **Time-zero / comparator defensibility gate.** Time zero = first qualifying fill (NDC dispensed that day assigns the\narm); washout = no fill of *either* class in the prior 365 days, so both arms are incident users and no immortal time is\nintroduced. Defend the active comparator (both treat the same indication at the same decision point); show baseline\ncovariate balance after the planned propensity-score adjustment using only pre-time-zero covariates. FAIL if the\ncomparator is preferentially channeled (e.g., reserved for renal-impaired patients) without a documented balancing plan.\n(4) **Sensitivity / falsification gate.** Pre-specify negative-control outcomes and a negative-control exposure for\nempirical calibration, an E-value for the minimum unmeasured confounding that would explain the result, washout-length\nand grace-period variations, and a quantitative bias analysis for outcome misclassification. A readiness reviewer\ntreats the *absence* of pre-specified falsification as a finding.\n(5) **Transparency / reproducibility gate.** Deliver the locked protocol/SAP (STaRT-RWE/HARPER fields complete), the\nattrition funnel from source population to analytic cohort, the full code with a data dictionary, and a versioned\nparameter table (`WASHOUT_DAYS=365`, grace period, caliper). The package PASSES readiness only when an independent\nanalyst could re-derive the headline estimate from these artifacts.",
    "primary_category": "Framework_Standard",
    "tags": [
      "regulatory-readiness",
      "hta-submission",
      "estimand-traceability",
      "fit-for-purpose-data",
      "pre-specification",
      "target-trial",
      "transparency-reproducibility"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.m4856",
        "url": "https://doi.org/10.1136/bmj.m4856",
        "citation_text": "Wang SV, Pinheiro S, Hua W, et al. STaRT-RWE: structured template for planning and reporting on the implementation of real world evidence studies. BMJ. 2021;372:m4856.",
        "year": 2021,
        "authors_short": "Wang et al.",
        "notes": "The structured template that operationalizes regulatory readiness by forcing pre-specified, reportable design and implementation choices for treatment-effect RWE studies."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5507",
        "url": "https://doi.org/10.1002/pds.5507",
        "citation_text": "Wang SV, Pottegård A, Crown W, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real-world evidence studies on treatment effects: A good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiology and Drug Safety. 2023;32(1):44-55.",
        "year": 2023,
        "authors_short": "Wang et al.",
        "notes": "The companion protocol template (HARPER) specifying the design, analysis, and provenance elements a regulatory-grade RWE protocol must lock before data analysis."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "The target-trial framework that disciplines the design layer readiness documents — eligibility, treatment strategies, assignment, time zero, and the causal contrast."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/cpt.1633",
        "url": "https://doi.org/10.1002/cpt.1633",
        "citation_text": "Franklin JM, Glynn RJ, Martin D, Schneeweiss S. Evaluating the use of nonrandomized real-world data analyses for regulatory decision making. Clinical Pharmacology & Therapeutics. 2019;105(4):867-877.",
        "year": 2019,
        "authors_short": "Franklin et al.",
        "notes": "The RCT-DUPLICATE blueprint for benchmarking RWE design/analysis choices against randomized trials — concrete evidence of what a regulator-ready emulation requires."
      },
      {
        "role": "use",
        "doi": "10.1002/cpt.1426",
        "url": "https://doi.org/10.1002/cpt.1426",
        "citation_text": "Cave A, Kurz X, Arlett P. Real-world data for regulatory decision making: challenges and possible solutions for Europe. Clinical Pharmacology & Therapeutics. 2019;106(1):36-39.",
        "year": 2019,
        "authors_short": "Cave et al.",
        "notes": "EMA perspective on data quality, provenance, and process requirements for RWE to inform regulatory decisions in Europe."
      }
    ],
    "plain_language_summary": "Regulatory readiness is the practice of building a real-world evidence study so that every decision — what data you used, what question you were answering, how the analysis was planned — is written down, locked, and traceable before you touch the data. Think of it as the paperwork and process checks a study must pass before the FDA, EMA, or a health-technology assessment body will trust its results. Without it, a technically sound analysis can still be rejected because no one can verify that the answer wasn't shaped by peeking at the results first. A ready study is one where an independent analyst could take your documentation package and re-derive your headline number from scratch.",
    "key_terms": [
      {
        "term": "regulatory-grade",
        "definition": "Describes a study whose documentation, data, and analysis choices meet the standards a regulator (FDA or EMA) requires before using the results to make a drug-approval or safety decision."
      },
      {
        "term": "data provenance",
        "definition": "The documented record of where a dataset came from, how it was processed, and whether it can actually observe the events the study needs to measure — for example, can these claims data see every hospital visit, or are some visits missing because of plan-type gaps?"
      },
      {
        "term": "audit trail",
        "definition": "A complete, versioned record of every protocol decision, code file, and parameter setting (such as the length of a drug-free lookback window) so a reviewer can follow the path from the original question all the way to the final number without gaps."
      },
      {
        "term": "pre-specified",
        "definition": "Decided and written down before the data are analyzed, so the choice cannot have been influenced by knowing what the result would be — the opposite of choosing a method after seeing which one gives the most favorable answer."
      },
      {
        "term": "estimand",
        "definition": "The exact quantity a study is designed to measure — it names the patient population, the two treatment options being compared, the outcome, and how events like death or switching drugs are handled, all locked before analysis begins."
      }
    ],
    "worked_example": {
      "scenario": "A research team is planning an RWE study using commercial claims data to ask whether a second-generation sulfonylurea raises the risk of cardiovascular hospitalization compared with a DPP-4 inhibitor in adults with type 2 diabetes. The study will be submitted to the FDA as part of a post-marketing safety commitment. Before any code is written, the team must evaluate whether the study is regulatory-ready. The checklist below rates each required element as MET or UNMET and explains why it matters for the submission.",
      "dataset": {
        "caption": "Regulatory-readiness checklist for a cardiovascular safety study in T2D using commercial + Medicare FFS claims (2016-2023). Each row is one required element; the Status column shows whether this team has met it.",
        "columns": [
          "Gate",
          "Required Element",
          "Why It Matters for Submission",
          "Status"
        ],
        "rows": [
          [
            "Fit-for-purpose data",
            "Confirm claims can observe sulfonylurea and DPP-4 fills via NDC codes with fill dates and days_supply",
            "If the data cannot see the drug, the exposure is undefined and the study cannot answer the question",
            "MET"
          ],
          [
            "Fit-for-purpose data",
            "Confirm outcome algorithm (CV hospitalization) has documented accuracy (PPV/sensitivity) in this specific database and population",
            "A borrowed accuracy estimate from a different database may not hold; reviewers require population-specific validation",
            "UNMET — PPV documented only in Medicare, not in the commercial sub-population"
          ],
          [
            "Fit-for-purpose data",
            "Exclude Medicare Advantage person-time where fee-for-service claims are absent",
            "MA enrollees generate no FFS claims, so apparent drug-free periods during lookback may simply be missing data, not true absence of use",
            "MET"
          ],
          [
            "Estimand traceability",
            "Lock all five ICH E9(R1) estimand attributes (population, treatments, endpoint, intercurrent-event handling, summary measure) in the protocol before data lock",
            "Locking the estimand prevents the question from silently shifting to fit whatever the data can answer; reviewers check for this",
            "MET"
          ],
          [
            "Estimand traceability",
            "Build a traceability matrix mapping each estimand attribute to a SAP section and a code module",
            "Allows an independent analyst to verify that the code actually implements what the protocol says, with no unexplained gaps",
            "UNMET — matrix not yet written"
          ],
          [
            "Design defensibility",
            "Set time zero at the first qualifying fill so no person-time before the exposure decision is counted as exposed follow-up",
            "Miscounting pre-exposure days as exposed creates immortal-time bias, systematically understating the drug's apparent risk",
            "MET"
          ],
          [
            "Design defensibility",
            "Justify the active comparator (DPP-4 inhibitor) as treating the same indication at the same clinical decision point",
            "If the comparator is channeled to a different type of patient (e.g., those with kidney disease), the comparison is confounded before any adjustment is applied",
            "MET"
          ],
          [
            "Sensitivity and falsification",
            "Pre-specify at least one negative-control outcome (an event the drugs cannot plausibly cause) to detect residual bias",
            "If the negative control shows a spurious signal, it reveals unmeasured confounding that could also distort the main result",
            "UNMET — not yet pre-specified"
          ],
          [
            "Sensitivity and falsification",
            "Pre-specify washout-length variations (e.g., 180-day vs 365-day lookback) and a quantitative bias analysis for outcome misclassification",
            "Reviewers treat the absence of pre-specified sensitivity analyses as a finding; post-hoc sensitivity runs are less credible",
            "MET"
          ],
          [
            "Transparency",
            "Deliver a locked protocol and SAP (using STaRT-RWE or HARPER fields) before any analysis begins",
            "A post-hoc protocol is the most common reason a confirmatory RWE submission fails the credibility test",
            "MET"
          ],
          [
            "Transparency",
            "Include the attrition funnel showing how the source population was reduced to the analytic cohort, with counts at each step",
            "Reviewers use the funnel to check for selection bias and to assess whether the analytic population still matches the policy question",
            "MET"
          ],
          [
            "Transparency",
            "Deliver all analysis code with a data dictionary and a versioned parameter table (WASHOUT_DAYS=365, grace_period=30 days, caliper=0.01)",
            "Reproducibility requires that every hard-coded number be visible and documented; undocumented parameters cannot be independently verified",
            "UNMET — parameter table not finalized"
          ]
        ]
      },
      "steps": [
        "Gate 1 (Fit-for-purpose data): Three of four elements pass. The blocking gap is that the outcome algorithm's accuracy has only been validated in Medicare, not in the commercial sub-population that makes up roughly half the study population. A reviewer will not accept a borrowed PPV.",
        "Gate 2 (Estimand traceability): The five ICH E9(R1) attributes are locked in the protocol, which is the harder half. But the traceability matrix — the document mapping each attribute to a SAP section and a code module — has not been written. Without it, an independent analyst cannot verify end-to-end that the code measures what the protocol claims.",
        "Gate 3 (Design defensibility): Both elements pass. Time zero is correctly set at first qualifying fill (no immortal-time risk), and the active comparator is justified as treating the same indication.",
        "Gate 4 (Sensitivity and falsification): One of two elements passes. The washout-length sensitivity analysis is pre-specified, but the negative-control outcome selection is absent. Reviewers flag missing negative controls as a readiness finding because they are the primary empirical check on residual confounding.",
        "Gate 5 (Transparency): Two of three elements pass. The blocking gap is the parameter table: the washout length, grace period, and propensity-score caliper are implemented in code but not documented in a versioned table that maps each value to its protocol justification.",
        "Overall verdict: 3 of 5 gates pass fully; 2 gates have at least one blocking gap. The study is NOT submission-ready in its current state. The three specific gaps — commercial-population PPV documentation, the estimand traceability matrix, and the versioned parameter table — must be resolved before data lock."
      ],
      "result": "NOT submission-ready. 9 of 12 checklist elements are MET; 3 are UNMET (outcome-algorithm PPV not validated in the commercial sub-population; estimand-to-code traceability matrix not written; versioned parameter table not finalized). All three gaps are fixable before data lock and do not require redesigning the study — but a submission package with any of these three open would receive a deficiency letter from a regulatory reviewer."
    },
    "prerequisites": [
      "fit-for-purpose-data-assessment-rwe",
      "estimand-analysis-traceability-rwe",
      "target-trial-emulation"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Regulatory (FDA/EMA) readiness",
        "description": "Readiness oriented to a regulatory action (label change, post-marketing requirement/PASS, safety evaluation), emphasizing pre-specified protocol/SAP, fit-for-purpose data, estimand traceability, and target-trial-aligned design.",
        "edge_cases": [
          "A feasibility-driven estimand swap presented as the original question is the classic failure; the protocol must name the policy question first, then test whether the data can answer it.",
          "Outcome-validation evidence must come from the same (or a closely comparable) data source and population, not a borrowed PPV from a different setting."
        ],
        "data_source_notes": "claims: document MA-only exclusion and outcome-algorithm PPV/sensitivity; ehr: document observation windows and missingness; registry/linked: document linkage selection and date reconciliation before time zero."
      },
      {
        "name": "HTA / payer readiness",
        "description": "Readiness oriented to relative-effectiveness, budget-impact, or cost-effectiveness submissions (NICE, ICER, IQWiG, CDA), emphasizing comparator relevance to the decision problem, generalizability/transportability, and inputs to economic models (e.g., survival extrapolation).",
        "edge_cases": [
          "The comparator and population must match the HTA decision problem (PICOTS), which may differ from the regulatory label population; mismatch undermines the relative-effectiveness claim.",
          "Economic-model inputs (transition probabilities, extrapolated survival) inherit every RWE bias and must carry their own sensitivity and structural-uncertainty analyses."
        ],
        "data_source_notes": "registry/linked data often anchor severity and long-term outcomes for HTA; claims anchor real-world resource use and adherence. Document transportability to the target decision population."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Running the analysis first and writing the protocol around the result (post hoc rationalization)",
        "pros_of_this": "Pre-specification and a locked SAP defend directly against the \"you fished for this result\" critique that sinks confirmatory RWE submissions; the estimand is protected from feasibility creep.",
        "cons_of_this": "Slower, requires cross-functional sign-off, and constrains exploratory analysis.",
        "when_to_prefer": "Any confirmatory or label-/coverage-relevant study where a regulator or HTA body is the audience."
      },
      {
        "compared_to": "A generic study protocol with no structured template (STaRT-RWE/HARPER)",
        "pros_of_this": "Machine-checkable, reviewer-navigable package that surfaces gaps (unjustified comparator, undocumented washout) before data lock via an explicit estimand-to-code traceability matrix.",
        "cons_of_this": "Template and matrix-maintenance overhead as the protocol evolves.",
        "when_to_prefer": "Whenever the deliverable will be independently re-derived or formally reviewed."
      },
      {
        "compared_to": "Target-trial emulation taken alone as the deliverable",
        "pros_of_this": "Wraps the well-emulated trial in the falsification, quantitative-bias-analysis, and reproducibility package a regulator/HTA reviewer expects; documents and stress-tests the design rather than just specifying it.",
        "cons_of_this": "Adds documentation and pre-specification burden on top of the emulation itself.",
        "when_to_prefer": "When the emulation must become a submittable dossier rather than a methods exercise."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Readiness audit must confirm observable person-time (exclude MA-only spans where FFS claims are absent), a population-specific outcome-algorithm PPV/sensitivity, time-zero alignment that avoids immortal time, and a pre-specified competing-risks decision in elderly cohorts.",
      "ehr": "Define explicit observation windows, produce a missing-data pattern table, reconcile order/administration vs dispensing before time zero, and treat visit-driven loss to follow-up as potentially informative.",
      "registry": "Document linkage to claims (fills) and a death index (censoring/mortality), adjudication procedures, and a transportability argument to the decision population.",
      "linked": "Document linkage selection (linkable subset) and reconcile order/fill/service date discrepancies before time-zero assignment; this substrate is ideal but the selection and date issues are the readiness risks."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from dataclasses import dataclass, field\n\n# A \"gate\" is a required readiness condition; each maps to evidence the protocol/SAP must contain.\nGATES = {\n    \"fit_for_purpose\": [\n        \"data_observes_exposure\", \"data_observes_comparator\",\n        \"outcome_algorithm_ppv_documented\", \"person_time_observable\",  # e.g., MA-only excluded\n    ],\n    \"estimand_traceability\": [\n        \"population_defined\", \"treatment_strategies_defined\", \"endpoint_defined\",\n        \"intercurrent_events_strategy\", \"summary_measure_defined\",\n        \"estimand_to_code_matrix\",  # ICH E9(R1) attributes mapped to SAP + code modules\n    ],\n    \"design_defensibility\": [\n        \"time_zero_avoids_immortal_time\", \"comparator_justified\",\n        \"baseline_covariates_pre_time_zero\", \"balance_plan_specified\",\n    ],\n    \"sensitivity_falsification\": [\n        \"negative_controls_prespecified\", \"evalue_or_qba_planned\",\n        \"washout_grace_sensitivity_planned\",\n    ],\n    \"transparency\": [\n        \"protocol_sap_locked\", \"attrition_funnel\", \"code_and_data_dictionary\",\n        \"parameter_table_versioned\",  # WASHOUT_DAYS, grace period, caliper, etc.\n    ],\n}\n\n@dataclass\nclass ReadinessResult:\n    passed: bool\n    gate_status: dict = field(default_factory=dict)   # gate -> bool\n    blocking_gaps: list = field(default_factory=list) # \"gate.requirement\" strings\n\ndef assess_readiness(study: dict) -> ReadinessResult:\n    \"\"\"`study` keys are requirement names mapping to truthy (met) / falsy (missing or unmet).\"\"\"\n    gate_status, gaps = {}, []\n    for gate, requirements in GATES.items():\n        missing = [r for r in requirements if not study.get(r)]\n        gate_status[gate] = (len(missing) == 0)\n        gaps.extend(f\"{gate}.{r}\" for r in missing)\n    return ReadinessResult(passed=all(gate_status.values()),\n                           gate_status=gate_status, blocking_gaps=gaps)",
        "description": "Regulatory-readiness gate checker. It scores a study's protocol/SAP artifacts against the five readiness gates\n(fit-for-purpose, estimand traceability, time-zero/comparator defensibility, sensitivity/falsification, transparency)\nand returns the PASS/FAIL verdict plus the specific blocking gaps a reviewer would raise. Input is a single dict\ndescribing the planned study; populate it from the protocol, not from results. This is a documentation/QA helper, not\nan estimation step — it is meant to be run before data lock and re-run at submission.",
        "dependencies": [],
        "source_citations": [
          "wang-2021",
          "franklin-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "GATES <- list(\n  fit_for_purpose = c(\"data_observes_exposure\", \"data_observes_comparator\",\n                      \"outcome_algorithm_ppv_documented\", \"person_time_observable\"),\n  estimand_traceability = c(\"population_defined\", \"treatment_strategies_defined\",\n                            \"endpoint_defined\", \"intercurrent_events_strategy\",\n                            \"summary_measure_defined\", \"estimand_to_code_matrix\"),\n  design_defensibility = c(\"time_zero_avoids_immortal_time\", \"comparator_justified\",\n                           \"baseline_covariates_pre_time_zero\", \"balance_plan_specified\"),\n  sensitivity_falsification = c(\"negative_controls_prespecified\", \"evalue_or_qba_planned\",\n                                \"washout_grace_sensitivity_planned\"),\n  transparency = c(\"protocol_sap_locked\", \"attrition_funnel\",\n                   \"code_and_data_dictionary\", \"parameter_table_versioned\")\n)\n\nassess_readiness <- function(study) {\n  # study: named logical vector; TRUE = requirement met. Missing names default to FALSE.\n  met <- function(req) isTRUE(study[[req]])\n  gate_status <- sapply(GATES, function(reqs) all(vapply(reqs, met, logical(1))))\n  gaps <- unlist(lapply(names(GATES), function(g)\n    paste0(g, \".\", GATES[[g]][!vapply(GATES[[g]], met, logical(1))])))\n  list(passed = all(gate_status), gate_status = gate_status, blocking_gaps = gaps)\n}",
        "description": "Regulatory-readiness gate checker (R). Mirrors the Python version: it evaluates a named logical vector of protocol/SAP\nrequirements against the five readiness gates and returns the overall verdict, per-gate status, and the blocking gaps a\nreviewer would cite. Run before data lock and again at submission. Requirements absent from the input are treated as\nunmet (a missing requirement is a finding).",
        "dependencies": [],
        "source_citations": [
          "wang-2021",
          "franklin-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Policy/regulatory question<br/>fixed FIRST, before data] --> G1{Fit-for-purpose:<br/>can the data observe<br/>exposure, comparator,<br/>outcome, person-time?}\n  G1 -->|No| STOP[Stop or change data source<br/>do NOT bend the question]\n  G1 -->|Yes| G2{Estimand traceability:<br/>ICH E9 R1 attributes locked<br/>and mapped to SAP + code?}\n  G2 -->|No| FIX[Lock estimand + build<br/>traceability matrix]\n  G2 -->|Yes| G3{Design defensibility:<br/>time zero avoids immortal time,<br/>comparator justified,<br/>baseline pre-time-zero?}\n  G3 -->|No| FIX\n  G3 -->|Yes| G4{Sensitivity/falsification:<br/>negative controls, E-value/QBA,<br/>washout + grace variations?}\n  G4 -->|No| FIX\n  G4 -->|Yes| G5{Transparency:<br/>protocol/SAP locked,<br/>attrition funnel, code + data dictionary,<br/>versioned parameters?}\n  G5 -->|No| FIX\n  G5 -->|Yes| SUBMIT[Submission-ready package]\n  FIX --> G2",
        "caption": "Regulatory and HTA readiness as a sequential gate-check. The question is fixed first; each gate must pass before the next, and a fit-for-purpose failure stops the study rather than triggering a feasibility-driven estimand swap.",
        "alt_text": "Decision flowchart from a fixed policy question through fit-for-purpose, estimand traceability, design defensibility, sensitivity/falsification, and transparency gates to a submission-ready package.",
        "source_type": "illustrative",
        "source_citations": [
          "wang-2021"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Readiness wrapper\n    FFP[Fit-for-purpose<br/>data assessment] --> EST[Estimand + traceability<br/>ICH E9 R1]\n    EST --> DES[Design: target-trial emulation<br/>time zero, comparator]\n    DES --> ANA[Analysis: PS / Cox / Fine-Gray<br/>per locked SAP]\n    ANA --> SENS[Sensitivity + falsification<br/>negative controls, E-value, QBA]\n    SENS --> PKG[Transparency package<br/>protocol, code, data dictionary]\n  end\n  PKG --> REG[FDA / EMA review]\n  PKG --> HTA[HTA / payer review]",
        "caption": "Readiness is the wrapper around the estimation methods, not a substitute for them — it documents and stress-tests each layer so the same package can serve regulatory and HTA review.",
        "alt_text": "Data-flow diagram showing readiness wrapping fit-for-purpose, estimand, design, analysis, sensitivity, and transparency, feeding both regulatory and HTA review.",
        "source_type": "illustrative",
        "source_citations": [
          "franklin-2019"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Fit-for-purpose data assessment is the first readiness gate — provenance, relevance, and reliability judged against the specific estimand."
      },
      {
        "relation_type": "requires",
        "target_slug": "estimand-analysis-traceability-rwe",
        "notes": "Estimand-to-analysis traceability is the readiness backbone — each ICH E9(R1) attribute mapped to a SAP section and a code module."
      },
      {
        "relation_type": "requires",
        "target_slug": "study-protocol-or-sap-elements",
        "notes": "A locked protocol/SAP with the required elements is the documentation substrate that readiness audits."
      },
      {
        "relation_type": "used_with",
        "target_slug": "sample-size-power-precision-rwe",
        "notes": "Pre-specified power/precision justification is a required element of a regulator-ready protocol."
      },
      {
        "relation_type": "used_with",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "The attrition funnel from source population to analytic cohort is a transparency deliverable readiness requires."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation supplies the design scaffold (eligibility, strategies, assignment, time zero) that readiness documents and stress-tests."
      },
      {
        "relation_type": "see_also",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS frames the decision problem (especially for HTA) that the estimand and comparator must match."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The ICH E9(R1) estimand and intercurrent-event strategies that readiness forces into pre-specification."
      },
      {
        "relation_type": "see_also",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Pre-specified quantitative bias analysis and falsification tests are a core readiness gate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Time-zero alignment that avoids immortal time is a design-defensibility requirement readiness audits."
      }
    ],
    "aliases": [
      "regulatory-grade RWE",
      "submission-ready RWE",
      "HTA-ready RWE",
      "regulatory readiness for real-world evidence",
      "RWE regulatory fitness"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "relative-net-survival",
    "name": "Relative and Net Survival",
    "short_definition": "A method for estimating cancer-attributable survival from registry data where cause of death is unreliable or missing: relative survival divides the observed all-cause survival of a cancer cohort by the expected survival of a matched general-population group from population life tables; the result estimates net survival -- the probability of surviving the cancer in a hypothetical world where no other cause of death can occur.",
    "long_description": "**The cancer-registry problem that motivates relative survival**\n\nIn most cancer registries, the death certificate records a cause of death -- but those entries\nare notoriously unreliable. A patient with stage-IV colon cancer who dies of myocardial\ninfarction may be coded as a cancer death; a patient dying of uncontrolled metastatic disease\nmay be certified as dying of sepsis. Misclassification is differential by cancer site, age,\nrace/ethnicity, and data period. In Medicare claims and other administrative sources the\nproblem is worse still: the only cause-of-death signal is the ICD code on the discharge claim\nor the enrollment death-file entry, not a reviewed death certificate. The practical consequence\nis that cause-specific survival analysis requires knowing *why* a patient died, and that\nknowledge is often unavailable, partial, or systematically wrong in population-based registry\ndata. Relative survival bypasses the problem entirely: it never asks why a patient died, only\nwhether the rate at which they died exceeded what was expected from the general population of\nthe same age, sex, and calendar year.\n\n**The estimand: net survival**\n\nNet survival is the survival probability in a conceptual world where the cancer is the only\npossible cause of death -- all background mortality has been removed by assumption. It is not\nthe observed all-cause survival (which includes deaths from heart disease, stroke, and accidents\nunrelated to the cancer), and it is not cause-specific survival (which requires reliable\ncause-of-death coding). Net survival is the quantity that SEER data products, CONCORD, and\nEUROCARE comparisons report, and it is the international standard for population-based cancer\nsurveillance. The Pohar Perme estimator (2012) is the nonparametric estimator of net survival\nthat is unbiased under the independence assumption: a patient's expected general-population\nsurvival is independent of their actual cancer prognosis given age, sex, and calendar year.\nThis is the current recommendation of the International Agency for Research on Cancer.\n\n**The relative survival ratio: observed divided by expected**\n\nRelative survival is computed as:\n\n  relative_survival = observed_survival / expected_survival\n\nwhere observed_survival is the all-cause Kaplan-Meier survival probability from the cancer\ncohort, and expected_survival is the survival probability that the cohort would have experienced\nif they had the same all-cause mortality as the general population matched on age, sex, and\ncalendar year -- obtained from population life tables. A relative survival of 0.50 at five years\nmeans the cancer cohort survived at 50% of the rate that a comparable general-population group\nwould have survived. Note: the 0.50 is a ratio, not a percentage-point deficit — the absolute\ndeficit in this example is 40 percentage points (observed 0.40 minus expected 0.80 = −0.40). Relative survival above 1.0 is possible (a healthy-worker effect or a population\nselected for low baseline mortality) but requires careful interpretation. Statistical cure\nin the net survival framework is indicated when the excess hazard returns to zero and the\nrelative survival curve *plateaus* — it stops declining and stabilises at a fixed value that\nequals the proportion of the cohort effectively cured. This plateau is generally below 1.0\nunless all patients are eventually cured; relative survival does not in general approach 1.0\nas follow-up lengthens.\n\n**Pohar Perme versus Ederer I, Ederer II, and Hakulinen: why the estimator matters**\n\nOlder methods -- Ederer I (1961), Ederer II (1961), and Hakulinen (1982) -- estimated relative\nsurvival using different constructions of expected survival in the denominator. All three are\nbiased when the age distribution within the cohort is informative -- which it almost always is\nin cancer data, because older patients both have worse cancer prognosis and worse expected\ngeneral-population survival. The bias is upward: older patients accumulate deaths quickly and\nexit the risk set early, so later survival estimates come from the younger (better-prognosis)\nsurvivors; the expected-survival denominator is incorrectly handled, and net survival is\nover-estimated. Pohar Perme (2012) corrected this by inverse probability weighting: at each\nevent time, each patient is up-weighted by the inverse of their expected general-population\nsurvival, amplifying the contribution of older patients whose expected mortality is high and\nwhose net survival would otherwise be under-counted. In practice the Pohar Perme estimate is\nlower than Ederer II for older, higher-mortality cancer cohorts -- which is the correct\ndirection; the earlier estimators were optimistically biased. Use Pohar Perme for all new\nanalyses; revisit Ederer II only when replicating historical literature.\n\n**Life-table choice: national, regional, SES-specific, and the US claims-data caveat**\n\nThe expected-survival denominator comes from a population life table stratified by age, sex,\nand calendar year. In most countries the national life table is the default comparator, and it\nis required for international comparisons. However:\n\n- Regional or socioeconomic-stratum-specific tables give more accurate comparators when the\n  cancer cohort is not representative of the national population. A cohort of rural patients\n  matched to an urban-dominant national life table will under-estimate expected mortality\n  (rural mortality is higher than urban-dominated national rates), making expected survival\n  too high and thereby deflating relative survival — the cancer appears more lethal than it\n  is for that rural population.\n- In the United States, commercially insured populations have substantially lower all-cause\n  mortality than the national population at the same age and sex -- because people who are\n  healthy enough to be employed and insured are selected from a lower-mortality subgroup.\n  Using the national life table in a commercial-claims analysis will over-state expected\n  mortality rates, which lowers the expected survival in the denominator and thereby inflates\n  relative survival — making the cancer appear less lethal than it actually is for that\n  insured population. When possible, use an insured-population life\n  table as the primary or at least as a sensitivity analysis. This is one of the most\n  consequential yet under-discussed limitations of applying relative-survival methods to\n  US administrative data.\n- Medicare populations are closer to the national elderly life table, but the dual-eligible\n  (Medicaid and Medicare) subgroup has higher background mortality than national tables predict.\n  Pre-specify the life table in the protocol, document its source, and always carry a\n  sensitivity analysis with an alternative table.\n\n**Age-standardization for comparisons across registries and time periods**\n\nComparing relative survival across cancer registries, time periods, or countries requires\nage-standardization, because different populations diagnose cancer patients at different age\ndistributions. The International Cancer Survival Standard (ICSS) provides three weight sets\nfor different cancer sites. The age-standardized relative survival is a weighted average of\nage-stratum-specific relative survival estimates using the ICSS weights (Corazziari et al.\n2004) -- one line of arithmetic applied after stratum-specific estimation.\n\n**Excess hazard regression**\n\nWhen covariates must be modeled, the relative-survival framework uses excess hazard regression.\nThe total observed hazard in the cancer cohort is decomposed as:\n\n  total_hazard = excess_hazard + expected_hazard\n\nwhere expected_hazard is read from the life table at each patient's age/sex/year, and\nexcess_hazard is the cancer-attributable component to be modeled. The excess hazard is modeled\nas a function of covariates (cancer stage, grade, treatment, socioeconomic status) using\nPoisson regression with an offset for the expected hazard (Dickman et al. 2004). Regression\ncoefficients are excess hazard ratios: the multiplicative change in cancer-attributable\nmortality associated with a one-unit change in the covariate. Excess hazard regression is the\nrelative-survival analogue of Cox regression for cause-specific survival.\n\n**Route to relative survival cure models**\n\nWhen the excess hazard returns to zero at finite follow-up -- the cohort's survival converges\nto the matched general-population life table -- there is statistical evidence of cure.\nRelative survival cure models partition the cohort into a cured fraction (those whose\nexcess hazard has reached zero) and an uncured fraction still experiencing excess mortality.\nThe cured fraction's survival thereafter follows the general-population life table. This is\none of the primary input pathways for mixture-cure models in cancer registry settings; see the\ncure-models-mixture-cure entry for the full cure-model framework. Relative survival provides\nthe necessary input when cause-of-death data are unavailable.\n\n**Pros, cons, and trade-offs**\n\n*Relative / net survival versus cause-specific survival (when cause-of-death data exist)*:\n- Pros: Requires no cause-of-death coding at all; immune to misclassification of cause;\n  directly comparable across registries with varying certification quality; the established\n  standard for population-based cancer surveillance; estimable from any population with a\n  matching life table.\n- Cons: Requires a matching population life table that accurately represents the background\n  mortality of the study cohort; depends on the independence assumption (discussed above);\n  cannot separate deaths caused by the cancer from deaths caused by cancer treatment toxicity\n  (both are \"excess\" mortality); relative survival can exceed 1.0 in selected populations,\n  which is numerically unintuitive; the Pohar Perme estimator has slightly higher variance\n  than the simpler Ederer II at small sample sizes.\n- When to prefer: Population-based registry studies, SEER, EUROCARE, national cancer-plan\n  evaluations, and any setting where cause-of-death coding is unreliable, missing, or\n  differentially misclassified across comparison groups.\n\n*National versus population-specific life tables*:\n- Pros of national tables: Universally available; standard for international comparisons;\n  simple to implement.\n- Cons: Can over- or under-state expected mortality in non-representative cohorts (insured,\n  rural, SES extreme). For US commercial-claims analyses, the national table overstates\n  expected mortality rates, which deflates the expected survival denominator and thereby inflates\n  the relative survival estimate, making the cancer appear less lethal than it is in that\n  insured population.\n- When to prefer: National tables for SEER and registry work; population-specific or\n  insured-cohort life tables for claims-based analyses when available, or at minimum as a\n  sensitivity analysis.\n\n**When to use**\n\nUse relative and net survival when: the primary analysis is population-based cancer survival\nusing registry data where cause-of-death is unreliable or unavailable; when the deliverable is\na net survival probability for time-trend comparison or cross-registry benchmarking where\nconsistency of cause-of-death coding cannot be assumed; when a population life table matched\non age, sex, and calendar year is available and the independence assumption is defensible;\nwhen excess hazard regression is the preferred modeling framework for covariate adjustment in\na registry-based cancer survival study; or when a cure-model analysis is planned and\ncause-of-death data are not available.\n\n**When NOT to use**\n\n- *Non-fatal outcomes*: Relative survival is a survival-only framework. For outcomes such as\n  disease progression, response rate, readmission, or healthcare utilization there is no\n  general-population expected comparator and the method does not apply; use standard\n  time-to-event or frequency methods instead.\n- *When reliable cause-of-death data are available and a cause-specific estimand is the\n  target*: If death certificates are adjudicated, if cancer-specific survival is the primary\n  endpoint defined in the study protocol, or if the scientific question explicitly concerns\n  cause-specific mortality (e.g., cardiovascular deaths in a cancer survivor cohort), prefer\n  competing-risks methods with a validated mortality source hierarchy. Net survival and\n  cause-specific survival answer subtly different questions and can diverge materially when\n  non-cancer mortality is high (elderly patients, indolent cancers, heavily comorbid cohorts).\n- *When the independence assumption is materially violated*: If the cancer shares strong risk\n  factors with non-cancer mortality (e.g., smoking-related lung cancer where the cohort would\n  have had higher-than-national cardiovascular mortality even without the cancer), the excess\n  hazard is overestimated and net survival is underestimated. Flag this as a structural\n  limitation and carry cause-specific survival as a sensitivity analysis.\n- *Do not apply to US commercial-claims data without a population-appropriate life table*:\n  Using the US national life table for a commercially insured cohort will produce upwardly\n  biased expected survival and downwardly biased relative survival, systematically\n  overstating the cancer's lethal burden in that selected population.\n\n**Interpreting the output**\n\nFrom the worked example: a colon cancer cohort of 5 patients followed 5 years. Observed\nall-cause survival at 5 years = 0.40 (2 of 5 patients alive at 5 years). Mean expected 5-year\nsurvival from matched life tables (age/sex/calendar year) = 0.80. Net (relative) survival\n= 0.40 / 0.80 = 0.50.\n\n*(1) Formal interpretation.* The relative survival of 0.50 is an estimate of net survival\ncomputed here as a simple Ederer-style observed/mean-expected ratio — the worked example uses\nthis transparent arithmetic to illustrate the concept. The Pohar Perme estimator applies\ninverse-probability weighting at each event time (upweighting older patients whose background\nmortality is higher) and gives a different, unbiased estimate; it should be used in any real\nanalysis. Under the independence assumption that each patient's expected general-population\nsurvival is independent of their actual cancer prognosis conditional on age, sex, and calendar\nyear, the Pohar Perme estimator is consistent for net survival. Relative survival is a\nratio of two survival probabilities, not itself a bounded probability; it can exceed 1.0 in\npopulations with below-national-average background mortality. A value of 0.50 means the\ncancer cohort's observed all-cause survival is 50% of what the same-age, same-sex general\npopulation would have experienced -- with the remaining deficit attributed to the excess\nmortality from colon cancer and its treatment.\n\n*(2) Practical interpretation.* The relative survival of 0.50 is a ratio: the cohort survived\nat half the rate of the matched general population (0.40 vs. 0.80). The absolute deficit is\n0.80 − 0.40 = 0.40 — approximately 40 fewer would be alive per 100 newly diagnosed patients\nat 5 years compared with similarly aged people in the general population who did not receive\na cancer diagnosis. The 0.50 itself is not a 50-percentage-point deficit; confusing the ratio\nwith the absolute gap is a common communication error. This\nis the quantity reported in international cancer-survival comparisons (CONCORD, EUROCARE) and\nin SEER publications; it is interpretable across time periods and registries regardless of\ndifferences in cause-of-death certification quality. For clinical communication, emphasize\nthat it represents the survival experience attributable to the cancer diagnosis itself, net of\nthe background mortality that those patients would have experienced regardless of the cancer.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "relative-survival",
      "net-survival",
      "cancer-registry",
      "excess-hazard",
      "population-life-tables",
      "pohar-perme",
      "age-standardization",
      "ICSS-weights",
      "oncology-rwe",
      "SEER",
      "EUROCARE"
    ],
    "applies_to_study_types": [
      "disease_registry",
      "claims_analysis",
      "linked_data_study",
      "comparative_effectiveness",
      "cohort_retrospective"
    ],
    "data_sources": [
      "registry",
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/j.1541-0420.2011.01640.x",
        "url": "https://doi.org/10.1111/j.1541-0420.2011.01640.x",
        "citation_text": "Perme MP, Stare J, Esteve J. On estimation in relative survival. Biometrics. 2012;68(1):113-120.",
        "year": 2012,
        "authors_short": "Perme, Stare & Esteve",
        "notes": "Introduces the inverse-probability-weighted nonparametric estimator of net survival now known as the Pohar Perme estimator; proves that simpler Ederer variants are biased with realistic age structures; the current international standard recommended by IARC."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.1365-2796.2006.01677.x",
        "url": "https://doi.org/10.1111/j.1365-2796.2006.01677.x",
        "citation_text": "Dickman PW, Adami HO. Interpreting trends in cancer patient survival. Journal of Internal Medicine. 2006;260(2):103-117.",
        "year": 2006,
        "authors_short": "Dickman & Adami",
        "notes": "Accessible explanation of how to interpret relative and net survival trends in population- based cancer data; covers the distinction between net survival, cause-specific survival, and observed survival, and explains why relative survival is preferred for registry-based comparisons."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.1597",
        "url": "https://doi.org/10.1002/sim.1597",
        "citation_text": "Dickman PW, Sloggett A, Hills M, Hakulinen T. Regression models for relative survival. Statistics in Medicine. 2004;23(1):51-64.",
        "year": 2004,
        "authors_short": "Dickman et al.",
        "notes": "Derives and demonstrates Poisson regression excess hazard models for relative survival analysis -- the standard framework for modeling covariates (stage, grade, treatment) in registry-based cancer studies; provides worked examples in SAS and Stata alongside the statistical theory."
      },
      {
        "role": "use",
        "doi": "10.1016/j.ejca.2004.07.002",
        "url": "https://doi.org/10.1016/j.ejca.2004.07.002",
        "citation_text": "Corazziari I, Quinn M, Capocaccia R. Standard cancer patient population for age standardising survival ratios. European Journal of Cancer. 2004;40(15):2307-2316.",
        "year": 2004,
        "authors_short": "Corazziari, Quinn & Capocaccia",
        "notes": "Defines the International Cancer Survival Standard (ICSS) age-weight sets for standardizing relative survival across cancer registries and time periods; the ICSS weights (one of three schemes by cancer site) are the operational tool for making cross-registry relative survival estimates comparable."
      }
    ],
    "plain_language_summary": "Relative survival answers the question \"how much does a cancer diagnosis reduce a patient's chances of being alive at five years?\" without needing to know the exact cause of death -- a piece of information that is often wrong or missing on death certificates. The method compares how many cancer patients actually survived against how many would have been expected to survive based on their age and sex from general-population life tables; the ratio of those two numbers gives net survival, the survival attributable to the cancer alone. A net survival of 0.50 means cancer patients survived at half the rate of comparable people in the general population, with the gap blamed on the cancer and its treatment rather than background diseases like heart attacks or strokes.",
    "key_terms": [
      {
        "term": "net survival",
        "definition": "The probability of surviving a cancer in a hypothetical world where the cancer is the only possible cause of death; it strips away the background mortality that the patient would have experienced regardless of the cancer diagnosis."
      },
      {
        "term": "expected survival",
        "definition": "The survival probability that a patient would have had based only on their age, sex, and calendar year -- obtained from a general-population life table -- as if they had never been diagnosed with cancer."
      },
      {
        "term": "population life table",
        "definition": "A statistical table that records, for each age group, sex, and calendar year, the probability that a person from the general population survives to the next age; it is the benchmark for what \"normal\" survival looks like."
      },
      {
        "term": "Pohar Perme estimator",
        "definition": "The current international-standard formula for computing net survival; it corrects for the bias that arises when older patients (who have both worse cancer prognosis and worse general-population survival) exit the cohort early, by giving them higher statistical weight at each event time."
      },
      {
        "term": "excess hazard",
        "definition": "The cancer-attributable portion of a patient's death rate at any point in time; it equals the total observed death rate minus the death rate expected from the general population at that age and sex, and is what excess hazard regression models as a function of stage, grade, and treatment."
      },
      {
        "term": "age-standardization",
        "definition": "A mathematical adjustment that makes survival estimates from different cancer registries or time periods comparable even when their patients are diagnosed at different ages, by applying a standard set of age weights (such as the ICSS weights) to each age-stratum estimate."
      }
    ],
    "worked_example": {
      "scenario": "A state cancer registry tracks 5 patients newly diagnosed with colon cancer in 2018, all followed for exactly 5 years. For each patient, the registry also records their expected 5-year survival probability, looked up from the US national life table matched on their age at diagnosis, sex, and calendar year. We want to estimate the 5-year net survival for this cancer -- the survival attributable to the colon cancer itself, independent of background mortality. Two patients are still alive at 5 years; three died during follow-up. Because death certificates in this registry are not reliably coded for cancer vs non-cancer cause, we use relative survival rather than cause-specific survival.",
      "dataset": {
        "caption": "One row per patient: follow-up outcome at 5 years and life-table expected 5-year survival. Expected survival is the probability the general population (matched on age/sex/year) would survive 5 years, read from the US national life table.",
        "columns": [
          "person_id",
          "vital_status_5yr",
          "days_followed",
          "expected_5yr_survival"
        ],
        "rows": [
          [
            1001,
            "alive",
            1825,
            0.85
          ],
          [
            1002,
            "dead",
            730,
            0.78
          ],
          [
            1003,
            "alive",
            1825,
            0.82
          ],
          [
            1004,
            "dead",
            365,
            0.76
          ],
          [
            1005,
            "dead",
            1095,
            0.79
          ]
        ]
      },
      "steps": [
        "Step 1 -- Count patients alive at 5 years: patients 1001 and 1003 survived all 1825 days. Observed 5-year survival: 2 / 5 = 0.40.",
        "Step 2 -- Compute mean expected 5-year survival from life tables: sum of expected_5yr = 0.85 + 0.78 + 0.82 + 0.76 + 0.79 = 4.00; mean expected survival = 4.00 / 5 = 0.80. This is the survival the age/sex/year-matched general population would have experienced.",
        "Step 3 -- Compute relative (net) survival: 0.40 / 0.80 = 0.5. This is the net survival estimate: for every person in the general population who survived 5 years, only 0.50 of the cancer patients survived, with the deficit attributed to the colon cancer.",
        "Step 4 -- Interpretation: a net survival of 0.5 means the colon cancer cohort had 50% of the 5-year survival probability that their age/sex/year-matched general-population counterparts had. The 40% observed survival minus the 80% expected survival yields the 40-percentage-point gap attributable to the cancer and its treatment."
      ],
      "result": "Observed 5-year survival = 2 / 5 = 0.40. Mean expected 5-year survival from life tables = 4.00 / 5 = 0.80. Relative (net) 5-year survival = 0.40 / 0.80 = 0.50. Net survival of 0.50 means this cancer cohort survived at 50% the rate of the age/sex-matched general population; the 50% deficit is attributed to the colon cancer and its treatment, net of background mortality.",
      "timeline_spec": {
        "title": "Relative survival: 5 colon cancer patients over 5-year (1825-day) follow-up",
        "window": {
          "start_day": 0,
          "end_day": 1825,
          "label": "5-year observation window (1825 days)"
        },
        "events": [
          {
            "label": "Pt 1001: alive at 5 years",
            "start_day": 0,
            "end_day": 1825,
            "marker": "Alive at day 1825 (5 years)"
          },
          {
            "label": "Pt 1002: died at 2 years",
            "start_day": 0,
            "end_day": 730,
            "marker": "Death at day 730"
          },
          {
            "label": "Pt 1003: alive at 5 years",
            "start_day": 0,
            "end_day": 1825,
            "marker": "Alive at day 1825 (5 years)"
          },
          {
            "label": "Pt 1004: died at 1 year",
            "start_day": 0,
            "end_day": 365,
            "marker": "Death at day 365"
          },
          {
            "label": "Pt 1005: died at 3 years",
            "start_day": 0,
            "end_day": 1095,
            "marker": "Death at day 1095"
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 1825,
            "label": "Pt 1001 followed full 5 years (alive; expected_5yr = 0.85)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 730,
            "label": "Pt 1002 followed 2 years (died; expected_5yr = 0.78)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 1825,
            "label": "Pt 1003 followed full 5 years (alive; expected_5yr = 0.82)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 365,
            "label": "Pt 1004 followed 1 year (died; expected_5yr = 0.76)"
          },
          {
            "kind": "followup",
            "start_day": 0,
            "end_day": 1095,
            "label": "Pt 1005 followed 3 years (died; expected_5yr = 0.79)"
          }
        ],
        "result": {
          "label": "Observed survival = 2/5 = 0.40; expected from life tables = 4.00/5 = 0.80; net survival = 0.40/0.80 = 0.50",
          "observed_survival": 0.4,
          "expected_survival": 0.8,
          "net_survival": 0.5
        },
        "caption": "Each horizontal bar shows one patient's follow-up from Day 0. Patients 1001 and 1003 (green) survived the full 1825 days; patients 1002, 1004, and 1005 (red) died before 5 years. The observed 5-year survival of 0.40 (2 of 5 alive) is divided by the mean expected survival of 0.80 from age/sex/year-matched life tables to yield net survival = 0.50. Without cause-of-death coding, this relative-survival calculation isolates the cancer's contribution to excess mortality.",
        "alt_text": "Five horizontal patient-timeline bars from Day 0 to their end. Patients 1001 and 1003 extend the full 1825 days (5 years) with alive markers; patients 1002, 1004, and 1005 end at days 730, 365, and 1095 respectively with death markers. An annotation shows observed 5-year survival = 0.40, expected from life tables = 0.80, and net survival = 0.50."
      }
    },
    "prerequisites": [
      "kaplan-meier-estimator",
      "competing-risks-cause-specific-fine-gray-rwe",
      "mortality-source-hierarchy-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pohar Perme nonparametric net survival estimator",
        "description": "The current international standard for estimating the net survival function nonparametrically from registry data. Each patient is weighted at each event time by the inverse of their expected general-population survival, removing the bias that arises from older patients exiting the risk set early. Implemented in R via relsurv::rs.surv(method=\"pohar-perme\") and popEpi::survtab. This is the primary estimand reported in SEER, CONCORD, and EUROCARE.",
        "edge_cases": [
          "In very small cohorts (n < 50), the Pohar Perme estimator has higher variance than Ederer II; report confidence intervals (relsurv provides log-transformed CIs) and note the uncertainty.",
          "When the independence assumption is violated (e.g., smoking-related cancers where background cardiovascular mortality in the cohort exceeds that of the general population), net survival is underestimated. Flag as a limitation and carry cause-specific survival as a sensitivity."
        ],
        "data_source_notes": "Registry: requires a matched population life table (age, sex, calendar year) in the same format as the relsurv ratetable objects. US: the slopop life table in relsurv is the US 1940-2004 national life table; for more recent periods, download updated life tables from the National Vital Statistics Reports and build a custom ratetable. For claims-based US studies, the national life table may be inappropriate for insured populations."
      },
      {
        "name": "Excess hazard regression (Poisson relative survival model)",
        "description": "Covariate adjustment in the relative-survival framework: the observed hazard at each event time is decomposed into an expected component (from the life table) and an excess component (the cancer-attributable part), and the excess hazard is modeled as a log-linear function of covariates using Poisson regression with offset for the expected hazard. Produces excess hazard ratios (EHR) for each covariate -- the multiplicative change in cancer-attributable mortality per unit change in the covariate. Implemented in R via relsurv::excess() and popEpi::relpoisreg; in SAS via custom DATA step and PROC GENMOD with offset.",
        "edge_cases": [
          "The excess hazard can become negative in short follow-up windows if a \"healthy-entry\" effect makes observed mortality temporarily lower than expected (newly diagnosed patients who have passed recent clinical evaluation may be temporarily healthier than the matched general population). Constrain excess hazard to non-negative values or use a flexible parametric model.",
          "EHRs are not directly comparable to Cox cause-specific HRs because they model different quantities on different risk sets; report which estimand is the primary and note that EHRs include treatment-attributable mortality in the excess component."
        ],
        "data_source_notes": "Registry: requires both the cancer cohort data and the matched population life table in structured format. The Dickman et al. (2004) SAS and Stata macros are the reference implementation; R relsurv and popEpi provide direct implementations. Claims: adapt by constructing person-time splits and matching each interval to the life-table expected hazard."
      },
      {
        "name": "Age-standardized relative survival (ICSS weights)",
        "description": "Cross-registry and time-trend comparisons require age-standardization because the age at diagnosis distribution varies across populations. Compute relative survival (Pohar Perme) within each of three or five age strata, then take a weighted average using the ICSS weights for the relevant cancer site (Corazziari et al. 2004). The age-standardized net survival is the single comparable quantity in CONCORD and EUROCARE reports.",
        "edge_cases": [
          "Some age strata may have very small counts (particularly the youngest or oldest groups for rare cancers); small-strata instability propagates into the weighted average. Apply a minimum stratum-count rule (e.g., n >= 10) and flag sparse strata.",
          "The ICSS weights are designed for comparisons, not absolute estimation; report both unstandardized and age-standardized estimates so readers can evaluate the age-distribution effect."
        ],
        "data_source_notes": "Apply after computing stratum-specific Pohar Perme estimates. ICSS weights are published in Corazziari et al. (2004) Table 2 and are incorporated into standard cancer survival reporting software (SURV3, R-based packages)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Cause-specific survival with competing-risks framework",
        "pros_of_this": "Does not require cause-of-death coding; eliminates misclassification; internationally comparable across registries with varying certification quality; applicable whenever a population life table exists; the established standard for cancer surveillance.",
        "cons_of_this": "Cannot isolate cancer deaths from treatment-toxicity deaths (both are excess mortality); depends on the independence assumption; requires a well-matched life table; relative survival can exceed 1.0 in selected low-mortality populations, which is uninterpretable as a conventional probability.",
        "when_to_prefer": "Population-based registry studies; any setting where cause-of-death coding is unreliable, missing, or heterogeneous across the comparison groups; cross-registry benchmarking and international comparisons."
      },
      {
        "compared_to": "Kaplan-Meier all-cause survival (observed survival, no life-table adjustment)",
        "pros_of_this": "Separates the cancer's contribution to mortality from the background mortality the cohort would have experienced regardless; enables time-trend comparisons by removing the confounding effect of secular improvements in background life expectancy.",
        "cons_of_this": "More complex to compute (requires life-table linkage); interpretation requires understanding the net-survival estimand, which is hypothetical by construction.",
        "when_to_prefer": "Whenever the scientific or policy question is about cancer-attributable survival rather than raw all-cause survival, and whenever comparing across time periods or registries with different background life expectancies."
      },
      {
        "compared_to": "Ederer II or Hakulinen estimators (older relative survival methods)",
        "pros_of_this": "Pohar Perme is unbiased for net survival; removes the systematic optimistic bias of older estimators in aged and high-mortality cohorts; recommended by IARC; implemented in current R packages.",
        "cons_of_this": "Slightly higher variance than Ederer II in small samples; cannot directly reproduce older published literature that used Ederer II without re-estimating.",
        "when_to_prefer": "All new analyses should use Pohar Perme; use Ederer II only when explicitly reproducing or extending older results and when small-sample variance inflation would be material."
      }
    ],
    "implementation_notes_by_data_source": {
      "registry": "Population life tables matched on age, sex, and calendar year are the required input alongside the cancer cohort data. In R, relsurv provides the ratetable format; US data: the slopop ratetable covers 1940-2004 and must be updated for recent cohorts from the National Vital Statistics Reports. Ensure the follow-up end date is defined consistently (date of last contact or vital status update, not date of last registry interaction). Match each patient to the life table using age at diagnosis, sex, and calendar year of the event or death, not just at baseline, to correctly handle period life-table structures.",
      "claims": "Claims-based relative survival is uncommon because the US national life table is not representative of insured populations (insured cohorts have lower background mortality). If applied, obtain an insured-population life table or benchmark the national-table results as a sensitivity analysis. Cause-of-death is not reliably available from claims, which is precisely the setting where relative survival is conceptually appropriate; however, the life-table mismatch is a material limitation that must be documented. See mortality-source- hierarchy-rwe for guidance on building the observed-survival component from claims.",
      "linked": "Linked registry-claims data are the strongest substrate: the registry provides cancer diagnoses, vital status, and tumor characteristics; claims provide treatment details and comorbidities for covariate adjustment in excess hazard regression. NDI linkage provides a reliable death date and, via NDI Plus, adjudicated cause of death for a sensitivity comparison. Pre-specify the follow-up end date and the life-table source in the protocol."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import KaplanMeierFitter\n\n# ── Worked-example cohort (5 colon cancer patients) ──\ndata = pd.DataFrame({\n    \"person_id\":    [1001,  1002,  1003,  1004,  1005],\n    \"fu_days\":      [1825,   730,  1825,   365,  1095],\n    \"event\":        [   0,     1,     0,     1,     1],   # 0=alive, 1=dead (all-cause)\n    \"expected_5yr\": [0.85,  0.78,  0.82,  0.76,  0.79],  # from national life table\n})\n\nHORIZON_DAYS = 1825  # 5 years\n\n# ── 1. Observed all-cause survival at 5 years (Kaplan-Meier) ──\nkmf = KaplanMeierFitter()\nkmf.fit(data[\"fu_days\"], event_observed=data[\"event\"])\nobserved_surv = float(kmf.predict(HORIZON_DAYS))\nprint(f\"Observed 5-year survival (KM):      {observed_surv:.4f}\")\n\n# ── 2. Mean expected 5-year survival from the life table ──\nexpected_surv = data[\"expected_5yr\"].mean()\nprint(f\"Mean expected 5-year survival (LT): {expected_surv:.4f}\")\n\n# ── 3. Relative (net) survival = observed / expected ──\n# NOTE: This simple ratio is the intuitive definition. The Pohar Perme estimator applies\n# inverse-probability weights at each event time to remove the age-structure bias; the\n# ratio below agrees with the worked example (0.40 / 0.80 = 0.50) but does not implement\n# the full weighting. Use relsurv in R for production Pohar Perme estimation.\nrelative_surv = observed_surv / expected_surv\nprint(f\"Relative (net) 5-year survival:     {relative_surv:.4f}\")\nprint()\nprint(\"Verify: 2 alive / 5 total = 0.40 observed; \"\n      \"sum expected = 4.00, mean = 0.80; ratio = 0.40 / 0.80 = 0.50\")\n\n# ── 4. Excess hazard (approximate, for illustration) ──\n# For a single time horizon, excess cumulative hazard = -ln(rel_surv).\nimport math\nexcess_cum_haz = -math.log(relative_surv) if relative_surv > 0 else float(\"inf\")\nprint(f\"Approximate 5-year cumulative excess hazard: {excess_cum_haz:.4f}\")\nprint(\"For full excess hazard regression, use relsurv::excess() or popEpi::relpoisreg in R.\")",
        "description": "Conceptual relative survival calculation in Python using lifelines for the Kaplan-Meier\nobserved survival and a manual life-table division to produce relative survival.\n\nIMPORTANT: lifelines does not implement the Pohar Perme net survival estimator. For a\nproduction analysis use R (relsurv or popEpi) or SAS. The code below demonstrates the\nconceptual structure -- estimating observed survival with KMFitter and dividing by the\nexpected survival read from a life table -- using the worked example (5 patients, exact\nnumbers as shown). It is an honest pedagogical sketch of the ratio, not a substitute for\nthe weighted Pohar Perme algorithm.\n\nRequired input: a one-row-per-patient DataFrame with columns:\n  fu_days         : follow-up time in days from diagnosis\n  event           : 1 = died during follow-up, 0 = alive at last follow-up (all-cause)\n  expected_5yr    : 5-year expected survival from life table (age/sex/calendar year)\nOutput: 5-year observed survival (KM), mean expected survival, and relative survival ratio.",
        "dependencies": [
          "pandas",
          "lifelines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(relsurv)\nlibrary(survival)\n\n# ── Minimal worked-example dataset (5 patients, matches the Python/beginner example) ──\nd <- data.frame(\n  fu_time   = c(1825, 730, 1825, 365, 1095),  # days of follow-up\n  event     = c(   0,   1,    0,   1,    1),   # all-cause death (0=alive, 1=dead)\n  age_dx    = c(  62,  71,   58,  75,  68),    # age at diagnosis (illustrative)\n  sex       = c(   2,   1,    2,   1,   1),    # 1=male, 2=female (ratetable convention)\n  diag_year = c(2000, 2000, 2000, 2000, 2000)  # calendar year of diagnosis\n# NOTE: slopop covers US 1940-2004; use diag_year within that range.\n# For 2018 or later diagnoses, use a current life table (e.g., from the Human\n# Mortality Database or survexp.us from the survival package, extended to recent years).\n)\n# Build the ratetable entry date (first day of year of diagnosis)\nd$entry_date <- as.Date(paste0(d$diag_year, \"-01-01\"))\n\n## 1. Pohar Perme nonparametric net survival curve.\n# ratetable() links each patient to the US national life table (slopop).\n# fu_time must be in days; ratetable uses age-in-years, sex, and a calendar Date.\nrs <- rs.surv(\n  Surv(fu_time, event) ~ 1,\n  rmap   = list(age  = age_dx * 365.25,   # relsurv expects age in days\n                sex  = sex,\n                year = entry_date),\n  data   = d,\n  ratetable = slopop,                      # US national life table (1940-2004 coverage)\n  method = \"pohar-perme\"                   # IARC-recommended; unbiased for net survival\n)\nprint(summary(rs, times = 1825))            # 5-year net survival estimate + 95% CI\n\n## 2. Net survival by a covariate (e.g., stage group) using rs.surv with a strata term.\n# Extend d with a stage variable for illustration:\n# rs_stage <- rs.surv(Surv(fu_time, event) ~ stage_group, rmap = list(...), ...)\n# Age-standardize across ICSS weight strata after computing stratum-specific estimates.\n\n## 3. Excess hazard regression: Poisson model for covariate adjustment.\n# Creates split person-time records (one row per interval) with the expected hazard as offset.\n# The excess() function returns an excess hazard ratio for each covariate.\n# Illustrative with stage_group (factor); replace with real covariates.\nd$stage_group <- factor(c(\"III\", \"II\", \"III\", \"IV\", \"II\"))  # illustrative only\nex_model <- excess(\n  Surv(fu_time, event) ~ stage_group,\n  rmap      = list(age = age_dx * 365.25, sex = sex, year = entry_date),\n  data      = d,\n  ratetable = slopop,\n  method    = \"glm\"   # Poisson excess hazard (Dickman et al. 2004)\n)\nsummary(ex_model)   # excess hazard ratios (EHR) per covariate level",
        "description": "Pohar Perme net survival estimator and excess hazard regression using relsurv, which is the\nstandard R package for relative survival analysis and implements the IARC-recommended method.\nThe popEpi package provides an alternative with more flexible modeling tools.\n\nRequired inputs:\n  d$fu_time   : numeric, follow-up time in days from diagnosis\n  d$event     : 0/1, all-cause death indicator (0 = alive at last contact)\n  d$age_dx    : age at diagnosis in years (used to match the life table)\n  d$sex       : 1 = male, 2 = female (relsurv ratetable convention)\n  d$diag_year : calendar year of diagnosis (for time-varying life-table match)\nThe ratetable object maps each patient's age/sex/year to an expected death rate at each\nfollow-up interval. The slopop ratetable bundled with relsurv covers US data 1940-2004;\nfor more recent cohorts, build a custom ratetable from National Vital Statistics Reports.\nrs.surv() returns the Pohar Perme net survival curve with pointwise confidence intervals.\nThe excess() function fits a Poisson excess hazard model for covariate adjustment.",
        "dependencies": [
          "relsurv",
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Step 1: Build analytic dataset with expected survival from life table ── */\n/* For each patient, look up the annual expected survival probability from the life table\n   matched on age group, sex, and calendar year. Multiply across years to get 5-year\n   expected survival (product of annual survival probabilities). This is the simplified\n   single-stratum lookup; the Pohar Perme algorithm requires time-varying lookups at each\n   event time, which a full macro (e.g., %RELSURV) handles automatically. */\n\nproc sql;\n  create table work.analytic as\n  select c.person_id, c.fu_days, c.event,\n         /* Five-year expected survival: product of 5 annual survival probs from life table */\n         l.surv_prob_5yr as expected_5yr   /* pre-computed in life_table for simplicity */\n  from work.cancer_cohort c\n    left join work.life_table l\n      on  c.age_group = l.age_group\n      and c.sex       = l.sex\n      and c.diag_year = l.year;\nquit;\n\n/* ── Step 2: Compute observed and relative survival at 5-year horizon ── */\n/* This computes a CRUDE PROPORTION (n alive and uncensored at 5 years / n total),          */\n/* NOT Kaplan-Meier survival. Patients censored before 5 years are in the denominator but   */\n/* not the numerator, incorrectly treating them as \"not alive at 5 years.\" This approach    */\n/* is suitable only as a descriptive audit step when no censoring occurs before 5 years.    */\n/* For analyses with censoring before 5 years, use PROC LIFETEST to obtain the KM estimate. */\n/* For the Pohar Perme weighted estimator, use R relsurv or the %RELSURV macro.             */\nproc sql;\n  create table work.surv_summary as\n  select\n    sum(case when fu_days >= 1825 and event = 0 then 1 else 0 end)\n      as n_alive_5yr,\n    count(*) as n_total,\n    calculated n_alive_5yr / calculated n_total as obs_surv_5yr,  /* crude proportion */\n    mean(expected_5yr)                          as exp_surv_5yr,\n    (calculated obs_surv_5yr) / (calculated exp_surv_5yr)\n      as rel_surv_5yr\n  from work.analytic;\nquit;\n\nproc print data=work.surv_summary noobs;\n  format obs_surv_5yr exp_surv_5yr rel_surv_5yr 6.4;\n  title 'Relative (net) survival at 5 years: observed/expected ratio';\nrun;\n\n/* ── Step 3: Excess hazard regression (Poisson model, Dickman et al. 2004 structure) ──\n   After splitting person-time into annual intervals and computing the expected hazard\n   per interval from the life table, fit PROC GENMOD with a Poisson distribution and a\n   log link, with the log(expected hazard * interval_length) as the offset.\n   The STAGE parameter estimate is the log excess hazard ratio (EHR) for the covariate. */\n\n/* Assumes work.split = person-time split table from %RELSURV or custom split step:\n   person_id  interval  events  expected_haz  interval_days  <covariates> */\nproc genmod data=work.split;\n  class stage (ref='I') sex (ref='F');\n  model events = stage sex age_dx / dist=poisson link=log\n                                    offset=log_expected_exposure;\n  /* offset = log(expected_haz * interval_days / 365.25) per row */\n  estimate 'EHR stage III vs I' stage 0 1 0 0 / exp;   /* excess hazard ratio */\nrun;",
        "description": "Relative survival framework in SAS using a DATA step to implement the simple ratio approach\n(for illustration and audit) and PROC GENMOD for excess hazard regression after constructing\nperson-time splits with the expected hazard from a life table.\n\nNote: SAS does not have a built-in PROC for the Pohar Perme estimator. Community macros\n(%RELSURV) based on the Dickman et al. (2004) Poisson approach are available from the\nauthors' website and provide excess hazard regression. For the nonparametric Pohar Perme\nnet survival curve, the authoritative implementation is in R (relsurv). The SAS code below\ndemonstrates (a) computing the observed/expected ratio from a prepared analytic table (the\naudit step that should accompany any relative survival analysis) and (b) the structure of a\nPoisson excess hazard regression after person-time splitting.\n\nRequired input datasets (work.cancer_cohort, work.life_table):\n  cancer_cohort : person_id, fu_days, event (0/1), age_dx, sex, diag_year\n  life_table    : age_group, sex, year, hazard_rate (expected mortality rate per person-year)",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  REG[Cancer registry: all-cause death date<br/>No reliable cause-of-death coding] -->|Observed survival| KM[Kaplan-Meier<br/>all-cause survival curve]\n  LT[Population life table<br/>matched on age / sex / calendar year] -->|Expected survival| EXP[Expected survival<br/>for matched general population]\n  KM --> RATIO[Relative survival<br/>= observed / expected]\n  EXP --> RATIO\n  RATIO -->|Equals| NET[Net survival estimate<br/>Cancer-attributable survival probability<br/>Pohar Perme weighting]\n  NET --> Q{Analysis goal?}\n  Q -->|Descriptive: survival curve| PP[Report net survival with 95% CI<br/>relsurv rs.surv Pohar Perme]\n  Q -->|Covariate adjustment| EH[Excess hazard regression<br/>Poisson model, Dickman et al. 2004]\n  Q -->|Long-term: cure fraction| CM[Relative survival cure model<br/>see cure-models-mixture-cure]\n  style REG fill:#fef3c7,stroke:#d97706\n  style NET fill:#d1fae5,stroke:#059669",
        "caption": "Data flow from cancer registry and population life table through observed and expected survival to the relative / net survival estimate, then to the three downstream analysis goals -- descriptive net survival curve, excess hazard regression with covariates, and relative survival cure modeling.",
        "alt_text": "Flowchart showing cancer registry data producing a Kaplan-Meier observed survival curve and a population life table producing an expected survival curve; the two feed into a relative survival ratio computed as observed divided by expected; the resulting net survival branches to three downstream uses: descriptive Pohar Perme curve, excess hazard regression, and relative survival cure modeling.",
        "source_type": "illustrative",
        "source_citations": [
          "perme-2012"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q1{Is cause-of-death<br/>reliably coded?} -- yes --> CMP[Competing-risks framework<br/>cause-specific / Fine-Gray<br/>with mortality source hierarchy]\n  Q1 -- no or unreliable --> Q2{Is the estimand<br/>cancer-attributable survival?}\n  Q2 -- yes --> Q3{Is a matched<br/>population life table<br/>available?}\n  Q2 -- no --> OTHER[Non-fatal endpoint<br/>or all-cause analysis<br/>standard survival methods]\n  Q3 -- yes --> PP2[Pohar Perme net survival<br/>relsurv rs.surv]\n  Q3 -- no --> WARN[Cannot estimate relative survival<br/>use cause-specific or all-cause]\n  PP2 --> COVA{Covariate adjustment<br/>needed?}\n  COVA -- yes --> EXH[Excess hazard regression<br/>Dickman Poisson model]\n  COVA -- no --> AGE[Age-standardize with ICSS weights<br/>for cross-registry comparison]",
        "caption": "Decision logic for choosing relative survival versus cause-specific survival. Route to the Pohar Perme estimator when cause-of-death is unreliable and a matched life table exists; route to competing-risks methods when cause-of-death is well-coded.",
        "alt_text": "Decision flowchart: if cause-of-death is reliably coded, go to competing-risks methods; otherwise, if the estimand is cancer-attributable survival and a life table is available, use Pohar Perme net survival and optionally excess hazard regression or age-standardization; if no life table is available, use cause-specific or all-cause survival.",
        "source_type": "illustrative",
        "source_citations": [
          "dickman-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "kaplan-meier-estimator",
        "notes": "The observed all-cause survival curve (numerator of the relative survival ratio) is estimated by the Kaplan-Meier method; relative survival is computed by dividing the KM estimate by the corresponding expected survival from the life table."
      },
      {
        "relation_type": "requires",
        "target_slug": "mortality-source-hierarchy-rwe",
        "notes": "A reliable death date and all-cause death flag are required inputs for the observed survival component; the quality of the mortality source hierarchy directly affects the observed-survival numerator and therefore the net survival estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "The rival framework when cause-of-death data are reliable: competing-risks methods estimate cause-specific survival or cumulative incidence using the known cause of each death, whereas relative survival bypasses cause-of-death coding entirely. The two approaches should give similar answers for high-lethality cancers but diverge when non-cancer mortality is substantial."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cure-models-mixture-cure",
        "notes": "When the net survival curve converges to the general-population life table at finite follow-up, relative survival cure models partition the cohort into cured and uncured fractions; see the mixture cure model entry for the full cure-model framework that uses net survival as its input when cause-of-death is unavailable."
      },
      {
        "relation_type": "used_with",
        "target_slug": "disease-registry",
        "notes": "Population-based cancer registries (SEER, state cancer registries, EUROCARE member registries) are the primary data source for relative survival analysis; they provide the cancer diagnosis dates, vital status, and tumor characteristics required to compute the observed-survival component and the population denominators used to match the life table."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "Excess hazard regression extends the proportional-hazards modeling framework to the relative-survival setting; the Cox model estimates the total hazard for cause-specific analyses, while the Poisson excess hazard model estimates only the cancer-attributable component after subtracting life-table expected hazard."
      }
    ],
    "aliases": [
      "net survival",
      "relative survival analysis",
      "excess hazard regression",
      "Pohar Perme estimator",
      "cancer net survival",
      "relative survival ratio"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "restart-rechallenge-new-episode-rwe",
    "name": "Restart, Rechallenge, and New-Episode Rules",
    "short_definition": "A protocol-specified set of exposure-episode rules that classify a later dispensing of the same drug as continuation of an ongoing episode, a restart after a permissible gap, a rechallenge after a safety-driven dechallenge, or a distinct new treatment episode, because each classification changes who is counted, when follow-up starts, and what the estimand means.",
    "long_description": "**Restart, rechallenge, and new-episode rules** are the deterministic logic that turns a stream of pharmacy dispensings\n(or orders/administrations) into discrete *treatment episodes* and decides how to handle a re-appearance of the same drug\nafter exposure has lapsed. The rules are not a single threshold; they are a small decision tree applied per `person_id`\nper drug, evaluated in date order over `fill_date` and `days_supply`, against (a) the gap since the prior episode's\ncovered end, (b) whether an intervening adverse event / intolerance triggered the stop, and (c) whether the indication or\nclinical context has changed. Because every downstream quantity — incidence-rate denominators, persistence/discontinuation,\ntime-zero for a new-user contrast, drug-utilization counts, and per-member cost denominators — is built on top of these\nepisodes, the rules must be written into protocol/SAP language *before* programming and stress-tested with sensitivity\nanalyses on the thresholds that drive them.\n\n**Core conceptual distinction.** Four mutually exclusive states must be separable, and conflating any two is a reviewable\nerror:\n- **Continuation** — the next fill arrives within the allowable gap (typically the prior `days_supply` end plus a\n  pre-specified grace period). It extends the *same* episode; no new time-zero, no new \"initiation.\"\n- **Restart** — exposure lapsed beyond the grace period but re-initiation reflects ordinary stop/start behavior (cost,\n  forgetting, a drug holiday), with **no** intervening adverse event. Whether a restart opens a new analytic episode\n  depends on the question: for persistence it is a new episode; for cumulative-dose effects it may be the same chronic\n  exposure resumed.\n- **Rechallenge** — re-initiation *after a dechallenge that was itself driven by an adverse event or intolerance*. This is\n  a safety-specific construct: the dechallenge-positive / rechallenge-positive sequence is one of the strongest\n  individual-level signals of drug causality (it is an explicit item in the Naranjo and WHO-UMC causality algorithms).\n  A rechallenge is never \"just another restart\" — the prior stop carries information about the outcome.\n- **New episode** — the re-appearance belongs to a clinically distinct course: a different indication, a gap so long the\n  prior course is irrelevant (a \"new-episode\" threshold, e.g., > 365 days), or a re-qualifying diagnostic event. A new\n  episode legitimately resets time-zero and washout for a new-user-style analysis.\n\nThe estimand must name which states open a new at-risk episode and which do not. A *cause-specific* hazard for first\ndiscontinuation treats restart as a new spell; an *as-treated cumulative-exposure* model stitches restarts into continuous\nperson-time with on/off indicators; a *rechallenge safety* analysis conditions follow-up on the dechallenge-rechallenge\nsequence itself (and is the design behind prescription-sequence-symmetry and case-only rechallenge studies).\n\n**Pros, cons, and trade-offs.**\n- **vs a single fixed-gap \"any refill = same episode\" rule:** Explicit four-state rules are transparent, reproducible, and\n  defensible to FDA/EMA/HTA reviewers, and they prevent the silent merging of a safety rechallenge into a benign restart.\n  Cost: more code, more diagnostics, and more thresholds to justify and vary in sensitivity analysis. **Prefer the\n  explicit rules** for any consequential comparative-safety, effectiveness, utilization, or cost-denominator analysis.\n- **vs prescription-sequence-symmetry (PSSA) / case-only rechallenge designs:** PSSA exploits the temporal symmetry of an\n  event around the (re)start to self-control for time-invariant confounding and is excellent for hypothesis screening.\n  Cost: it answers a narrower signal-detection question and assumes the event does not itself affect the probability of\n  the second dispensing. **Prefer cohort episode rules** when you need rate or cumulative-incidence contrasts; **prefer\n  PSSA** for rapid, confounding-robust signal screening of a suspected rechallenge effect.\n- **vs collapsing everything into prevalent-user person-time:** Counting all dispensings as one undifferentiated exposure\n  is simplest but destroys time-zero, re-introduces immortal time (the survivor who lives long enough to refill), and\n  makes \"restart vs rechallenge\" invisible. **Never prefer** this for causal contrasts.\n\n**When to use.** Any claims/EHR/registry analysis where the same drug can be stopped and restarted: chronic-disease\npersistence and discontinuation; comparative safety where rechallenge after an event is informative; drug-utilization and\ntreatment-pattern (lines-of-therapy) work; and HTA budget-impact/cost denominators that depend on how many *distinct*\ntreated episodes exist. Use whenever the protocol must defend why a later fill was or was not counted as a new initiation.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When the data cannot observe the gap.** If person-time inside a gap is unobservable (e.g., Medicare Advantage\n  enrollees whose fee-for-service pharmacy claims are not in the source), a \"gap\" may be missingness, not a true stop —\n  classifying it as a restart or new episode manufactures spurious initiations. Restrict to continuously observable\n  (e.g., Part D / commercial pharmacy-benefit) person-time before applying the rules.\n- **When you let a rechallenge masquerade as a restart in a safety study.** If the dechallenge was AE-driven and you\n  fold the re-initiation into ongoing person-time, you both dilute the rechallenge signal and bias the comparator —\n  survivors who tolerated and resumed look healthier (depletion of susceptibles). This is the classic place the method\n  becomes dangerous.\n- **When new-episode resets create immortal time.** Resetting time-zero at a \"new episode\" defined using *future*\n  information (e.g., requiring the patient to survive to a later qualifying fill) re-creates immortal time bias (Suissa).\n  The qualifying logic must use only information available at the candidate time-zero.\n- **When stockpiling/oversupply fakes continuation.** Mail-order 90-day fills, sample fills, and refill-ahead behavior\n  inflate `days_supply` coverage and erase real gaps; without stockpiling/carry-over rules the algorithm will call a true\n  discontinuation a continuation.\n\n**Data-source operational depth.**\n- **Administrative claims (FFS vs MA vs commercial):** The episode is built from `fill_date` + `days_supply` + NDC.\n  Failure modes: (1) *MA-only person-time lacks FFS pharmacy claims* — a clean-looking 200-day gap may simply be a period\n  you cannot see; require continuous Part A/B/D (or commercial medical + pharmacy) enrollment across each gap you intend\n  to classify, and exclude MA-only spans. (2) *Claims reversals and same-day duplicates* inflate fill counts — collapse\n  to net paid dispensings before stitching. (3) *Adjudication/run-out lag* truncates the last observable fill — censor at\n  end-of-data, not at the last fill. (4) *Bundled/inpatient drugs* are invisible in Part D — an apparent gap during a\n  hospitalization is often bridged inpatient exposure (see inpatient-bridging-exposure-rwe).\n- **EHR:** Exposure is the *order* or *administration*, not the dispensing; restarts/rechallenges are frequently captured\n  in unstructured notes (\"rechallenged after rash resolved\") that structured order tables miss, and external-care leakage\n  means a restart prescribed elsewhere is invisible. Differential, encounter-driven capture makes gaps look longer for\n  patients who simply left the system; treat loss to follow-up as potentially informative and prefer linked dispensing.\n- **Registry:** Often strongest for the *reason* a drug was stopped (adjudicated AE → enables true rechallenge\n  classification) but weak for complete refill history; link to claims for fills and to a death index so a \"permanent\n  discontinuation\" is not actually an unrecorded death.\n- **Linked claims–EHR–registry:** The ideal substrate — registry/EHR supply the AE that defines a dechallenge while\n  claims supply the complete fill history — but order/fill/administration date discrepancies and linkage selection must be\n  reconciled before assigning episode boundaries.\n\n**Worked claims example.** Question: classify exposure to drug X (an oral immunomodulator) into episodes in a commercial +\nMedicare FFS database, distinguishing a safety rechallenge from a routine restart. Setup per `person_id`, drug X NDCs\nonly, sorted by `fill_date`; require continuous medical + pharmacy enrollment (no MA-only spans) across every evaluated\ngap. Define `covered_through = fill_date + days_supply`, GRACE = 30 days, RESTART_MAX = 180 days, NEW_EPISODE = 365 days.\nWalking the fills:\n- 2023-02-01, days_supply 30 → covered_through 2023-03-03. Next fill 2023-03-10: gap = 7 days ≤ GRACE → **continuation**\n  (same episode; do not reset time-zero).\n- That episode's last fill is 2023-04-10 (covered_through 2023-05-10). Next fill 2023-08-01: gap = 83 days, in\n  (GRACE, RESTART_MAX]. Check for an intervening AE diagnosis (e.g., a hepatotoxicity dx + drug discontinuation in\n  2023-05) → none found → **restart** (new persistence episode; same chronic exposure for a cumulative-dose model).\n- Suppose instead an inpatient dx of drug-induced hepatitis appears 2023-05-15 with no further X fills for months, then X\n  re-appears 2024-01-15. Because the stop was AE-driven (a dechallenge) and the re-initiation follows it, this is a\n  **rechallenge** — flagged for the dechallenge-positive/rechallenge-positive safety analysis, not silently merged.\n- A fill of X appearing 2025-06-01, > NEW_EPISODE days after the prior covered end and accompanied by a re-qualifying\n  indication, opens a **new episode**: reset washout and time-zero for a new-user-style contrast.\nDiagnostics: pre/post episode counts, the distribution of computed gaps with the threshold cut points overlaid,\npatient-level timelines for a sample, the share of \"gaps\" occurring during unobservable MA spans, and a sensitivity\nanalysis varying GRACE (15/30/60), RESTART_MAX (90/180), and the AE-window used to define a dechallenge.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure-definition",
      "exposure-episode-construction",
      "treatment-episode",
      "rechallenge",
      "dechallenge",
      "restart",
      "new-episode",
      "persistence",
      "gap-rules"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Canonical operational reference for deriving treatment episodes, permissible gaps, and restart logic from fill_date and days_supply in automated claims databases."
      },
      {
        "role": "explain",
        "doi": "10.1038/clpt.1981.154",
        "url": "https://doi.org/10.1038/clpt.1981.154",
        "citation_text": "Naranjo CA, Busto U, Sellers EM, et al. A method for estimating the probability of adverse drug reactions. Clinical Pharmacology and Therapeutics. 1981;30(2):239-245.",
        "year": 1981,
        "authors_short": "Naranjo et al.",
        "notes": "Establishes dechallenge and rechallenge as explicit, weighted items in individual-level drug-causality assessment, distinguishing a safety rechallenge from an ordinary restart."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Grounds why a new episode that legitimately resets time-zero must satisfy incident-user (washout) logic rather than counting prevalent exposure."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Demonstrates how mis-defining episode start (e.g., new-episode resets requiring future fills) creates immortal time, the central pitfall these rules must avoid."
      }
    ],
    "plain_language_summary": "When a patient stops a drug and then refills it later, analysts must decide whether that later fill continues the original treatment period or starts a brand-new one — and, critically, whether the restart happened after a safety scare on the drug. The rules in this concept draw a firm line around four situations: continuing without a real gap, restarting after an ordinary gap, restarting specifically because a doctor rechallenged after a bad reaction, and beginning a clinically fresh course after a very long break. Getting the classification right matters because every count of who is being treated, for how long, and from what starting point is built on top of these rules.",
    "key_terms": [
      {
        "term": "fill_date",
        "definition": "The calendar date on which a patient picked up (or was dispensed) a prescription at the pharmacy."
      },
      {
        "term": "days_supply",
        "definition": "The number of days one filled prescription is intended to last — for example, a 30-day fill has days_supply = 30."
      },
      {
        "term": "covered_through",
        "definition": "The last date a patient still has pills on hand from a single fill, calculated as fill_date plus days_supply minus one day."
      },
      {
        "term": "dechallenge",
        "definition": "Stopping a drug because the patient had a bad reaction or side effect suspected to be caused by that drug."
      },
      {
        "term": "rechallenge",
        "definition": "Restarting the same drug after it was stopped for a safety reason — a rechallenge that brings back the same problem is one of the strongest signs that the drug actually caused it."
      },
      {
        "term": "index date",
        "definition": "A patient's personal day-zero for a study — the date their counted follow-up time officially begins, usually the date they first filled the drug under study."
      }
    ],
    "worked_example": {
      "scenario": "Patient 2047 is prescribed methotrexate (an oral immunomodulator used for rheumatoid arthritis) starting January 10, 2023. She fills it twice in the spring and appears to be continuing therapy without interruption. Then she stops for several months after developing abnormal liver labs in May — a safety-driven dechallenge. Her rheumatologist rechallenges her in October at a lower dose. We want to label each fill as continuation, restart, or rechallenge, and identify which fills belong to Episode 1 versus Episode 2.",
      "dataset": {
        "caption": "Pharmacy claims rows for patient 2047 (methotrexate only, one row per fill, already cleaned for duplicates).",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply",
          "covered_through"
        ],
        "rows": [
          [
            2047,
            "2023-01-10",
            "methotrexate",
            90,
            "2023-04-09"
          ],
          [
            2047,
            "2023-04-05",
            "methotrexate",
            90,
            "2023-07-03"
          ],
          [
            2047,
            "2023-10-15",
            "methotrexate",
            90,
            "2024-01-12"
          ]
        ]
      },
      "steps": [
        "Fill A (Jan 10): This is the patient's very first fill of methotrexate in the data. It opens Episode 1, and Jan 10 becomes her index date — her day-zero.",
        "Fill A covers Jan 10 through Apr 9 (90 days). Fill B arrives Apr 5 — that is 4 days BEFORE covered_through, so the gap is actually negative (an early refill). Negative or zero gap means the patient never ran out; Fill B is a continuation of Episode 1.",
        "Fill B covers Apr 5 through Jul 3. Now check Fill C: it arrives Oct 15. The gap = Oct 15 minus Jul 3 = 104 days. That gap is too long to be a normal continuation (our grace period is 30 days).",
        "Before labeling Fill C a restart, we check whether a safety event drove the stop. The claims data shows an abnormal-liver-labs diagnosis code on May 22, 2023 — that is 49 days after covered_through (Jul 3 minus May 22 = 42 days before covered_through, well within a 90-day look-around window). This counts as a dechallenge.",
        "Because the stop was safety-driven and the patient is now restarting the same drug, Fill C is a RECHALLENGE. It opens Episode 2 with a new index date of Oct 15, 2023.",
        "Episode 1 spans Jan 10 – Jul 3 (174 days of covered time). Episode 2 begins Oct 15 (the rechallenge). The gap between covered_through of Episode 1 and the rechallenge fill is 104 days."
      ],
      "result": {
        "label": "Episode 1 (continuation): Jan 10 – Jul 3, 2023 (174 covered days across 2 fills). Episode 2 (rechallenge): starts Oct 15, 2023. Fill C is flagged as a safety rechallenge — not a routine restart — because the May 22 liver-labs event is within 90 days of the end of Episode 1.",
        "value": "2 episodes; Fill C state = rechallenge"
      },
      "timeline_spec": {
        "title": "Restart vs. rechallenge: one methotrexate patient with a safety-driven dechallenge",
        "window": {
          "start": "2023-01-10",
          "end": "2024-01-12",
          "label": "Observation window: Jan 10 2023 – Jan 12 2024 (367 days)"
        },
        "events": [
          {
            "label": "Fill A — Episode 1 opens (index date)",
            "start": "2023-01-10",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill B — continuation (early refill, 4 days before covered_through)",
            "start": "2023-04-05",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill C — rechallenge (Episode 2 opens, new index date)",
            "start": "2023-10-15",
            "length_days": 90,
            "quantity": "90 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-01-10",
            "end": "2023-07-03",
            "label": "Episode 1: 174 covered days (continuation)"
          },
          {
            "kind": "gap",
            "start": "2023-07-04",
            "end": "2023-10-14",
            "label": "104-day gap — dechallenge gap (liver-labs AE May 22)"
          },
          {
            "kind": "exposed",
            "start": "2023-10-15",
            "end": "2024-01-12",
            "label": "Episode 2: 90 covered days (rechallenge)"
          }
        ],
        "adverse_event": {
          "label": "Liver-labs AE — dechallenge trigger",
          "date": "2023-05-22"
        },
        "result": {
          "label": "2 episodes. Fill C = rechallenge (not restart) because AE date May 22 is within 90 days of Episode 1 covered_through Jul 3.",
          "value": "rechallenge"
        },
        "caption": "Timeline for patient 2047 on methotrexate. Fills A and B form a single continuous episode (Episode 1) because Fill B arrives before Fill A runs out. The 104-day gap is classified as a dechallenge gap, not a routine break, because the liver-labs adverse event on May 22 falls within the 90-day look-around window. Fill C is therefore a rechallenge — it opens Episode 2 with a fresh index date — rather than a plain restart. In a safety study, this dechallenge-positive / rechallenge-positive sequence is evidence that the drug may have caused the liver problem.",
        "alt_text": "Horizontal timeline for one patient showing two colored exposure bars separated by a gap. Episode 1 runs January 10 to July 3 2023 and is made up of two fills (Fill A and Fill B) with Fill B beginning before Fill A expired. A liver-labs adverse-event marker sits on May 22 2023 inside the gap period. Episode 2 begins October 15 2023 and is labeled as a rechallenge because the adverse event is close in time to the end of Episode 1."
      }
    },
    "prerequisites": [
      "exposure-episode-construction-rwe",
      "grace-period-gap-rules-rwe",
      "pdc-proportion-of-days-covered"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Coverage-gap restart rule (days_supply + grace)",
        "description": "Episodes are stitched while consecutive fills overlap within days_supply plus a grace period; a gap beyond grace closes the episode and a later fill within the restart window opens a new (restart) episode.",
        "edge_cases": [
          "Stockpiling / 90-day mail-order inflates coverage and erases true gaps unless carry-over caps are applied.",
          "Same-day duplicate or reversed claims double-count days_supply; collapse to net paid dispensings first.",
          "A gap falling entirely inside an inpatient stay is often bridged exposure, not a real stop."
        ],
        "data_source_notes": "claims: build NDC episode logic on fill_date + days_supply with continuous-enrollment screening of each gap; EHR: substitute order/administration dates and prefer linked dispensing to confirm restarts."
      },
      {
        "name": "Safety dechallenge-rechallenge rule",
        "description": "A stop preceded or followed by an adverse-event diagnosis is tagged a dechallenge; a subsequent re-initiation is a rechallenge and is analyzed (or excluded) under a safety estimand rather than merged into ongoing person-time.",
        "edge_cases": [
          "The AE window relative to the stop (e.g., 0-30 vs 0-90 days) materially changes how many stops count as dechallenges.",
          "Outcomes that themselves reduce the probability of a later fill (death, hospitalization) violate naive case-only assumptions."
        ],
        "data_source_notes": "registry/EHR: use adjudicated AE and notes (\"rechallenged after rash resolved\") to define the dechallenge; claims: approximate with diagnosis + discontinuation, accepting misclassification of the reason for stop."
      },
      {
        "name": "New-episode reset rule",
        "description": "A re-appearance after a long gap (e.g., > 365 days) or accompanied by a re-qualifying indication opens a clinically distinct episode with a fresh washout and time-zero for a new-user-style contrast.",
        "edge_cases": [
          "Reset logic that requires a future qualifying fill re-creates immortal time; qualification must use only information available at the candidate time-zero.",
          "A long \"gap\" may be unobservable MA-only person-time rather than a true treatment-free interval."
        ],
        "data_source_notes": "claims: require continuous observable enrollment across the long gap; linked: reconcile order/fill dates before assigning the new time-zero."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "A single fixed-gap \"any refill = same episode\" rule",
        "pros_of_this": "Transparent, reproducible, and defensible to regulators/HTA; prevents silently merging a safety rechallenge into a benign restart and prevents counting unobservable gaps as new initiations.",
        "cons_of_this": "More programming, more diagnostics, and additional thresholds (grace, restart window, AE window, new-episode cut) that must be justified and varied in sensitivity analysis.",
        "when_to_prefer": "Any consequential comparative-safety, effectiveness, utilization, or cost-denominator analysis where stops and restarts of the same drug occur."
      },
      {
        "compared_to": "Prescription-sequence-symmetry / case-only rechallenge designs",
        "pros_of_this": "Yields interpretable rate and cumulative-incidence contrasts and accommodates competing risks and censoring explicitly rather than only a symmetry signal.",
        "cons_of_this": "Requires full confounder measurement and complete observable person-time, whereas PSSA self-controls for time-invariant confounding with minimal data.",
        "when_to_prefer": "When the question is a quantified effect rather than rapid signal screening of a suspected rechallenge association."
      },
      {
        "compared_to": "Collapsing all dispensings into undifferentiated prevalent-user person-time",
        "pros_of_this": "Preserves time-zero, prevents immortal time, and keeps restart vs rechallenge distinguishable.",
        "cons_of_this": "More restrictive cohorts and more bookkeeping.",
        "when_to_prefer": "Essentially always for causal contrasts; collapsing is acceptable only for crude descriptive utilization."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build episodes from NDC + fill_date + days_supply after collapsing reversals/same-day duplicates. Require continuous medical + pharmacy enrollment across every evaluated gap and exclude Medicare Advantage-only person-time so a gap is a true stop, not unobserved fills. Censor at end-of-data rather than the last fill to respect adjudication lag.",
      "ehr": "Use order/administration dates, not dispensing; mine notes for dechallenge/rechallenge language structured tables miss. External-care leakage hides restarts prescribed elsewhere; prefer linked pharmacy fills and treat encounter-driven loss to follow-up as potentially informative.",
      "registry": "Strong for the adjudicated reason a drug stopped (enables true rechallenge classification) but weak for complete refill history; link to claims for fills and to a death index so a permanent discontinuation is not an unrecorded death.",
      "linked": "Registry/EHR supply the AE that defines a dechallenge while claims supply complete fills; reconcile order/fill/administration date discrepancies and account for linkage selection before assigning episode boundaries."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nGRACE_DAYS       = 30    # consecutive fills within days_supply + GRACE extend the SAME episode\nRESTART_MAX_DAYS = 180   # gap in (GRACE, RESTART_MAX] -> restart; below new-episode threshold\nNEW_EPISODE_DAYS = 365   # gap > this (or new indication) -> a clinically distinct new episode\nAE_WINDOW_DAYS   = 90    # an AE within this window of the prior covered end marks a dechallenge\n\ndef classify_episodes(rx: pd.DataFrame, ae: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"]).copy()\n    rx[\"covered_through\"] = rx[\"fill_date\"] + pd.to_timedelta(rx[\"days_supply\"], unit=\"D\")\n    ae_by_person = ae.groupby(\"person_id\")[\"ae_date\"].apply(list).to_dict()\n\n    out_rows = []\n    for pid, g in rx.groupby(\"person_id\", sort=False):\n        ae_dates = ae_by_person.get(pid, [])\n        episode_id = 0\n        prev_covered = None\n        for row in g.itertuples(index=False):\n            if prev_covered is None:\n                state = \"new_episode\"          # first observed fill opens episode 0\n            else:\n                gap = (row.fill_date - prev_covered).days\n                if gap <= GRACE_DAYS:\n                    state = \"continuation\"\n                else:\n                    # Was the prior stop driven by an adverse event? -> rechallenge.\n                    dechallenge = any(\n                        0 <= (a - prev_covered).days <= AE_WINDOW_DAYS or\n                        0 <= (prev_covered - a).days <= AE_WINDOW_DAYS\n                        for a in ae_dates\n                    )\n                    if dechallenge:\n                        state = \"rechallenge\"\n                    elif gap <= RESTART_MAX_DAYS:\n                        state = \"restart\"\n                    else:\n                        state = \"new_episode\"   # gap > NEW_EPISODE (RESTART_MAX < NEW_EPISODE) or any long gap\n            if state != \"continuation\":\n                episode_id += 1                  # restart/rechallenge/new_episode open a new analytic episode\n            out_rows.append((pid, row.fill_date, row.covered_through, episode_id, state))\n            # Coverage can extend if a restart overlaps the prior tail; take the later end.\n            prev_covered = row.covered_through if prev_covered is None else max(prev_covered, row.covered_through)\n    return pd.DataFrame(out_rows,\n                        columns=[\"person_id\", \"fill_date\", \"covered_through\", \"episode_id\", \"state\"])",
        "description": "Classify same-drug fills into continuation / restart / rechallenge / new-episode states from claims-style inputs.\nRequired inputs (already cleaned: reversals and same-day duplicates collapsed, restricted to the target drug's NDCs,\nrestricted to continuously enrolled, FFS-observable person-time):\n  rx : person_id, fill_date (datetime64), days_supply (int)\n  ae : person_id, ae_date (datetime64)   # adjudicated/dx-based adverse-event dates that can define a dechallenge\nReturns one row per fill with episode_id and a 'state'. Build denominators, persistence, and time-zero off these\nepisodes; rechallenge-flagged rows feed the safety estimand rather than ongoing person-time.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "andrade-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nGRACE_DAYS       <- 30L\nRESTART_MAX_DAYS <- 180L\nNEW_EPISODE_DAYS <- 365L\nAE_WINDOW_DAYS   <- 90L\n\nclassify_episodes <- function(rx, ae) {\n  setDT(rx); setDT(ae)\n  setorder(rx, person_id, fill_date)\n  rx[, covered_through := fill_date + days_supply]\n  ae_list <- split(ae$ae_date, ae$person_id)\n\n  rx[, c(\"episode_id\", \"state\") := {\n    eid <- 0L; prev_cov <- as.Date(NA); ad <- ae_list[[as.character(.BY$person_id)]]\n    eids <- integer(.N); sts <- character(.N)\n    for (i in seq_len(.N)) {\n      if (is.na(prev_cov)) {\n        st <- \"new_episode\"\n      } else {\n        gap <- as.integer(fill_date[i] - prev_cov)\n        if (gap <= GRACE_DAYS) {\n          st <- \"continuation\"\n        } else {\n          dechallenge <- !is.null(ad) && any(abs(as.integer(ad - prev_cov)) <= AE_WINDOW_DAYS)\n          st <- if (dechallenge) \"rechallenge\"\n                else if (gap <= RESTART_MAX_DAYS) \"restart\"\n                else \"new_episode\"\n        }\n      }\n      if (st != \"continuation\") eid <- eid + 1L\n      eids[i] <- eid; sts[i] <- st\n      prev_cov <- if (is.na(prev_cov)) covered_through[i] else max(prev_cov, covered_through[i])\n    }\n    list(eids, sts)\n  }, by = person_id]\n  rx[, .(person_id, fill_date, covered_through, episode_id, state)]\n}",
        "description": "Same four-state episode classifier with data.table. Inputs mirror the Python version:\n  rx : person_id, fill_date (Date), days_supply (integer)\n  ae : person_id, ae_date (Date)   # adverse-event dates defining a possible dechallenge\nInput must be pre-cleaned (reversals/duplicates collapsed, single drug, continuously observable person-time).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "andrade-2006"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let grace      = 30;   /* days_supply + grace extends the same episode      */\n%let restartmax = 180;  /* gap in (grace, restartmax] -> restart             */\n%let newepisode = 365;  /* gap > this (or new indication) -> new episode     */\n%let aewindow   = 90;   /* AE within this window of prior covered end = dechallenge */\n\n/* Nearest AE to each fill's prior covered-through is computed by joining the AE table.\n   Here we flag, per fill, whether ANY AE fell within +/- aewindow of the prior covered end. */\nproc sort data=work.rx;  by person_id fill_date;  run;\nproc sort data=work.ae;  by person_id ae_date;    run;\n\ndata work.episodes;\n  set work.rx;\n  by person_id;\n  retain prev_cov episode_id;\n  format fill_date covered_through date9.;\n\n  /* Load all AE dates into a hash ONCE (keyed by person_id, multidata for >1 AE/person).\n     A do-until(eof) SET of work.ae inside the main step would read the whole AE table on the\n     first rx row and exhaust it, so later fills could never look up an AE; the hash lets every\n     fill probe its person's AE dates independently via find()/find_next(). */\n  if _n_ = 1 then do;\n    declare hash ae(dataset:\"work.ae\", multidata:\"Y\");\n    ae.defineKey(\"person_id\");\n    ae.defineData(\"ae_date\");\n    ae.defineDone();\n  end;\n\n  covered_through = fill_date + days_supply;\n\n  if first.person_id then do; prev_cov = .; episode_id = 0; end;\n\n  length state $12;\n  if prev_cov = . then state = \"new_episode\";\n  else do;\n    gap = fill_date - prev_cov;\n    if gap <= &grace then state = \"continuation\";\n    else do;\n      /* dechallenge if any AE for this person is within aewindow of the prior covered end */\n      dech = 0;\n      rc = ae.find();                                  /* first AE for this person_id */\n      do while (rc = 0);\n        if abs(ae_date - prev_cov) <= &aewindow then dech = 1;\n        rc = ae.find_next();                           /* walk remaining AEs for this person */\n      end;\n      if dech then state = \"rechallenge\";\n      else if gap <= &restartmax then state = \"restart\";\n      else state = \"new_episode\";\n    end;\n  end;\n\n  if state ne \"continuation\" then episode_id + 1;     /* open a new analytic episode */\n  prev_cov = max(prev_cov, covered_through);\n  keep person_id fill_date covered_through episode_id state;\nrun;",
        "description": "Four-state episode classification via a sorted DATA step with BY-group state carried in RETAIN. Required inputs\n(post data-management: single drug, reversals/duplicates collapsed, continuously observable person-time):\n  work.rx : person_id, fill_date, days_supply\n  work.ae : person_id, ae_date    (adverse-event dates that can define a dechallenge)\nOutput work.episodes has one row per fill with episode_id and state for building denominators and time-zero.",
        "dependencies": [],
        "source_citations": [
          "andrade-2006"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "restart-rechallenge-new-episode-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Timeline for patient 2047 on methotrexate. Fills A and B form a single continuous episode (Episode 1) because Fill B arrives before Fill A runs out. The 104-day gap is classified as a dechallenge gap, not a routine break, because the liver-labs adverse event on May 22 falls within the 90-day look-around window. Fill C is therefore a rechallenge — it opens Episode 2 with a fresh index date — rather than a plain restart. In a safety study, this dechallenge-positive / rechallenge-positive sequence is evidence that the drug may have caused the liver problem.",
        "alt_text": "Horizontal timeline for one patient showing two colored exposure bars separated by a gap. Episode 1 runs January 10 to July 3 2023 and is made up of two fills (Fill A and Fill B) with Fill B beginning before Fill A expired. A liver-labs adverse-event marker sits on May 22 2023 inside the gap period. Episode 2 begins October 15 2023 and is labeled as a rechallenge because the adverse event is close in time to the end of Episode 1.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  F[Next same-drug fill<br/>gap = fill_date - prior covered_through] --> G{gap <= grace?}\n  G -- yes --> C[Continuation<br/>same episode, no new time-zero]\n  G -- no --> A{AE-driven dechallenge<br/>before this re-initiation?}\n  A -- yes --> R[Rechallenge<br/>safety estimand; do NOT merge into person-time]\n  A -- no --> N{gap > new-episode threshold<br/>or new indication?}\n  N -- yes --> E[New episode<br/>reset washout + time-zero]\n  N -- no --> S[Restart<br/>new persistence spell;<br/>same chronic exposure for cumulative-dose]",
        "caption": "Decision tree applied per person per drug in fill-date order. Gap length plus an adverse-event flag map each re-appearance to one of four mutually exclusive states, each with different estimand consequences.",
        "alt_text": "Flowchart deciding whether a later drug fill is a continuation, rechallenge, new episode, or restart based on the gap since the prior covered end and whether an adverse-event dechallenge preceded re-initiation.",
        "source_type": "illustrative",
        "source_citations": [
          "andrade-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One patient, drug X: continuation, restart, rechallenge, new episode\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Episode 1\n  Fill + days_supply (covered) :done, e1a, 2023-02-01, 30d\n  Refill within grace (continuation) :done, e1b, 2023-03-10, 30d\n  section Restart\n  Gap 83d, no AE -> restart :active, rs, 2023-08-01, 30d\n  section Rechallenge\n  AE dx (dechallenge) :crit, ae, 2023-05-15, 1d\n  Re-initiation after AE -> rechallenge :crit, rc, 2024-01-15, 30d\n  section New episode\n  Gap > 365d + new indication -> reset time-zero :milestone, ne, 2025-06-01, 0d",
        "caption": "Timeline illustration of the four states for a single patient. The gap length alone does not decide the state; the adverse-event dechallenge in 2023-05 reclassifies the 2024-01 re-initiation from restart to rechallenge.",
        "alt_text": "Gantt chart showing a covered period, an in-grace continuation refill, an 83-day-gap restart, an adverse-event dechallenge followed by a rechallenge, and a new episode after a gap exceeding one year.",
        "source_type": "illustrative",
        "source_citations": [
          "andrade-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "These rules are the re-initiation branch of general exposure-episode construction, specializing it into continuation, restart, rechallenge, and new-episode states."
      },
      {
        "relation_type": "used_with",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "The grace period and permissible-gap definition supply the threshold that separates continuation from a gap that closes the episode."
      },
      {
        "relation_type": "used_with",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "A new-episode reset re-applies the washout/clean-lookback so the new time-zero qualifies as an incident-user start."
      },
      {
        "relation_type": "used_with",
        "target_slug": "stockpiling-carryover-rules-rwe",
        "notes": "Carry-over caps prevent oversupply from masking a true gap and mislabeling a discontinuation as continuation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "switch-add-on-augmentation-rwe",
        "notes": "Switching and add-on rules handle the cross-drug counterpart of these within-drug re-initiation states."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Whether a restart opens a new spell or ends one directly determines persistence and time-to-discontinuation estimates."
      },
      {
        "relation_type": "see_also",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "A dechallenge-positive/rechallenge-positive sequence is a strong individual-level safety signal that feeds case definitions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "New-episode resets create a new time-zero; defining it with future information re-introduces immortal time."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "prescription-sequence-symmetry",
        "notes": "PSSA is a confounding-robust, case-only alternative for screening rechallenge signals; cohort episode rules give quantified rate/cumulative-incidence contrasts instead."
      }
    ],
    "aliases": [
      "treatment episode rules",
      "drug rechallenge",
      "dechallenge-rechallenge",
      "new treatment episode",
      "restart and rechallenge rules"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "restricted-mean-survival-time-rmst",
    "name": "Restricted Mean Survival Time (RMST)",
    "short_definition": "The expected event-free survival time accumulated up to a pre-specified horizon tau, equal to the area under the survival curve from 0 to tau, summarized between groups as an RMST difference (months gained) or RMST ratio without assuming proportional hazards.",
    "long_description": "**Restricted mean survival time (RMST)** at a horizon tau is the area under the survival curve S(t) from time 0 to tau:\nRMST(tau) = integral_0^tau S(t) dt. Estimated non-parametrically, it is the area under the Kaplan-Meier step function,\ncomputed as the sum of rectangles S(t_k) * (t_{k+1} - t_k) over event/censoring times up to tau. It has the units of the\ntime axis (months, years) and a direct interpretation: the average amount of event-free time a patient in this group\naccrues during the first tau units of follow-up. The between-group **RMST difference** (RMST_A - RMST_B) is the average\nnumber of additional event-free months gained, and the **RMST ratio** is its multiplicative analogue. Both are well-defined\nwhether or not the hazards are proportional, which is the central reason RMST has displaced the hazard ratio as the primary\nor co-primary summary in many non-proportional-hazards settings.\n\n**Core estimand distinction** RMST is a *summary measure on the time scale*, not a model and not, by itself, a causal\nestimand. The hazard ratio from Cox is a single multiplicative contrast of instantaneous risk that is only interpretable\nas one number when proportional hazards (PH) holds; under crossing or delayed-separation curves the population HR is a\nfollow-up-weighted average of changing hazard ratios with no fixed clinical meaning, and it can be driven by sparse late\nevents. RMST(tau) instead integrates the entire curve up to tau, so it is robust to the *shape* of the hazard and to what\nhappens after tau. Two further distinctions matter in RWE. (1) *Unadjusted vs covariate-adjusted RMST*: the KM-area RMST\ndifference estimates a marginal (population-average) contrast under independent censoring; to target an adjusted ATE/ATT\nyou must use pseudo-observation regression, IPTW-weighted KM, or g-computation, and the adjusted RMST difference is then a\nmarginal causal contrast only if confounding is fully controlled. (2) *Event definition under competing risks*: when death\ncompetes with the event of interest, \"RMST\" can mean restricted mean *event-free* time (treating the competing event as\ncensoring, which is usually wrong) or restricted mean time-in-state computed from the cumulative incidence function\n(Aalen-Johansen), i.e. expected life-years lost. Pre-specify which one in the estimand; they answer different questions.\n\n**Pros, cons, and trade-offs**\n- **vs Cox proportional-hazards / hazard ratio (cox-ph-regression):** RMST needs no PH assumption, is reported in\n  interpretable time units that payers and patients understand (\"3.1 months gained\"), is robust to heavy administrative\n  censoring and to sparse, influential late events, and gives a single honest number when the HR does not. Cost: RMST\n  requires choosing tau (a substantive, not statistical, decision), is modestly less efficient than Cox when PH genuinely\n  holds, and is less familiar to some clinical audiences. **Prefer RMST** whenever PH is violated, doubtful, or untestable,\n  or when an absolute time horizon is the decision-relevant quantity. **Prefer the HR** when PH is credible and a relative\n  instantaneous-risk contrast is the target.\n- **vs survival extrapolation for lifetime QALYs (survival-extrapolation-hta-rwe):** RMST stays inside the observed\n  follow-up window (t <= tau) and therefore makes no parametric tail assumptions, which is its honesty advantage. Cost: it\n  deliberately does not answer the lifetime-horizon question a cost-utility model needs; restricting to tau truncates any\n  benefit accruing after tau. **Prefer RMST** for within-trial / within-data effect summaries and HTA \"minimum months\n  gained\" claims; **prefer extrapolation** when the decision genuinely needs lifetime mean survival.\n- **vs g-methods for time-varying treatment (g-estimation-structural-nested-models, clone-censor-weight-per-protocol):**\n  RMST is a marginal summary up to tau and is simple to specify and communicate. Cost: it does not by itself handle\n  time-varying confounding, treatment switching, or dynamic per-protocol strategies. **Prefer plain RMST** for an\n  initiation (ITT-like) contrast; combine RMST with clone-censor-weight or MSMs as the summary measure when the estimand\n  is a sustained or per-protocol strategy.\n\n**When to use** Crossing or delayed-separation survival curves (classic in immuno-oncology, where the HR averages a null\nearly period with a later benefit); heavy administrative censoring before the median is reached in either arm (common in\nclaims with short, calendar-bounded enrollment); HTA, payer, or patient-facing questions that value absolute event-free\ntime within a fixed policy horizon (1-, 3-, or 5-year RMST for value frameworks and budget-impact narratives); and any\nsetting where PH is clearly violated or cannot be tested because late events are sparse. Reporting RMST at several\npre-specified horizons (e.g. 1, 3, 5 years) is good practice when the benefit profile changes over time.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **tau chosen after looking at the data, or beyond reliable follow-up.** Setting tau past the last event in one arm forces\n  extrapolation through a flat KM tail and silently fabricates \"gained\" time; choosing tau to maximize the difference is\n  p-hacking on the time scale. tau must be fixed a priori on clinical or policy grounds, and the minimum of the two arms'\n  largest observed event/censoring times should bound it. Always report the number still at risk at tau.\n- **Competing risks handled as independent censoring.** Censoring deaths to compute \"event-free RMST\" overstates event-free\n  time, and the bias is *differential* when mortality differs by arm - exactly the elderly, multimorbid claims populations\n  where this method is most tempting. Use the CIF/Aalen-Johansen restricted-mean-time-in-state (life-years lost) instead.\n- **Informative censoring from disenrollment or switching.** Unadjusted KM-area RMST assumes censoring is independent of\n  prognosis. Differential loss to follow-up (Medicare Advantage churn, plan switching, differential switching at\n  progression) biases it; use inverse-probability-of-censoring weighting or restrict tau before typical switch time.\n- **Confounded observational contrast read as causal.** RMST removes the PH assumption, not confounding by indication. An\n  unadjusted RMST difference between two drugs in claims is descriptive; adjustment (pseudo-observation regression, IPTW,\n  g-computation) is still mandatory.\n- **PH genuinely holds and a single tau is reported.** Then RMST throws away efficiency relative to the HR with no\n  interpretive gain; report the HR (and optionally RMST as a co-primary for communication).\n\n**Data-source operational depth**\n- **Claims (FFS):** The time axis is days from index. Event = first qualifying outcome claim (e.g. inpatient HF diagnosis\n  in the primary position); censor at the earliest of disenrollment, death, end of continuous enrollment, and study end.\n  tau is set to the reimbursement/value horizon (commonly 24 or 36 months), never to the data. Failure modes: Medicare\n  Advantage person-time lacks FFS encounter claims, so events go uncaptured and patients appear event-free until they\n  disenroll - restrict to A/B (and D where exposure is a drug) FFS person-time and treat MA enrollment as censoring, not\n  follow-up. Differential competing risk: in elderly claims, non-cancer death competes with the event and is more frequent\n  in the sicker arm; event-free RMST must use the CIF, not censored KM. Immortal time: if the index requires a\n  post-initiation procedure, follow-up must start at the procedure date or the early flat survival inflates RMST in that\n  arm. Coarse timing: days_supply and claim adjudication lags mean event dates are interval-censored at the day level;\n  integrate the KM on the finest available unit (days) and avoid month-rounding.\n- **EHR:** Event dates can be sharper (progression notes, lab thresholds, structured problem lists) but capture is\n  visit-driven, so a patient who leaves the system is differentially and informatively censored. Use last-contact or the\n  end of database coverage as the censor and consider inverse-probability-of-observation weighting if visit intensity\n  differs by arm; otherwise the arm with denser surveillance accrues apparent events earlier and its RMST is biased\n  downward.\n- **Registry:** Usually the longest, cleanest, adjudicated follow-up with explicit visit schedules - the best substrate\n  for RMST and for validating claims-based RMST. Still pre-specify tau before plotting the curves, and confirm vital-status\n  ascertainment is complete so administrative censoring is non-informative.\n- **Linked claims-EHR-vital records:** The ideal substrate (EHR severity + claims completeness + reliable mortality for\n  competing-risk RMST), but linkage selects the linkable subset and order/fill/service date discrepancies must be\n  reconciled before the time axis is fixed.\n\n**Worked claims example.** Question: 36-month restricted mean event-free time to first heart-failure hospitalization,\nsecond-generation sulfonylurea (SU) vs DPP-4 inhibitor, incident users with type 2 diabetes in a Medicare FFS + commercial\ndatabase; tau = 1095 days, fixed before any analysis on the value-framework horizon. (1) Cohort: age >= 18, >= 2 diabetes\ndiagnoses, 365 days continuous A/B/D FFS or commercial medical+pharmacy enrollment before index, no SU or DPP-4 fill in\nthat washout (incident users), arm assigned from the NDC dispensed on the first qualifying fill (index_date = fill_date).\n(2) Event = first inpatient claim with HF in the primary diagnosis position after index_date; event time = days from\nindex. (3) Censoring at the earliest of disenrollment, switch to MA (FFS claims stop), death from the linked death index,\nand study end - and because non-cardiovascular death is a competing risk that is more common in the older SU initiators,\nthe primary analysis uses the Aalen-Johansen CIF restricted mean time-in-state, not censored KM. (4) Suppose the\ncause-specific KM yields RMST_DPP4 = 33.4 months and RMST_SU = 31.7 months over 36 months, with 41% of DPP-4 and 38% of SU\ninitiators still at risk at tau: unadjusted RMST difference = 1.7 months (95% CI by the closed-form Uno variance or 2,000\nbootstrap resamples of patients). (5) Because the arms differ at baseline, the reported estimate is the IPTW-weighted (or\npseudo-observation-regression-adjusted) RMST difference on a high-dimensional propensity score built only from the 365-day\nbaseline window. (6) Sensitivity: re-run at tau = 24 and 48 months, swap the competing-risk handling, vary the grace period\ndefining switching, and run a negative-control outcome - if the adjusted RMST difference is stable, the \"average ~1.7\nadditional HF-hospitalization-free months over 3 years on DPP-4\" claim is defensible for the payer narrative.\n\n**Interpreting the output**\n\nAn RMST analysis of DPP-4 inhibitor vs sulfonylurea at a 24-month horizon returns: RMST difference = 1.8 months (≈55 days; 95% CI 0.3–3.3 months) in favor of DPP-4.\n\n*Formal interpretation.* The restricted mean survival time is the area under the Kaplan-Meier curve from time zero to the pre-specified horizon τ = 24 months; it equals the expected event-free time a patient accumulates before month 24 under each treatment arm. The estimated RMST difference of 1.8 months means that, on average, patients initiated on DPP-4 therapy spend approximately 1.8 more months free of their first heart-failure hospitalization within a 24-month window compared with sulfonylurea initiators, after IPTW adjustment. The horizon τ must be fixed before analysis; re-running at different τ values is a pre-specified sensitivity analysis, not selective reporting.\n\n*Practical interpretation.* Unlike the hazard ratio, the RMST difference is an absolute, time-anchored quantity expressed in the same units as follow-up — months or days — making it directly meaningful to patients and payers. It does not require the proportional-hazards assumption and remains valid even when survival curves cross. The 1.8-month gain translates to roughly 55 additional event-free days per patient on average, which a value narrative can frame against the cost difference between the two drug classes at the chosen time horizon.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "rmst",
      "restricted-mean-survival-time",
      "survival-analysis",
      "non-proportional-hazards",
      "time-scale",
      "oncology-rwe",
      "hta-relevant",
      "claims",
      "administrative-censoring",
      "crossing-curves",
      "competing-risks"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "pragmatic_trial",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/1471-2288-13-152",
        "url": "https://doi.org/10.1186/1471-2288-13-152",
        "citation_text": "Royston P, Parmar MKB. Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med Res Methodol. 2013;13:152.",
        "year": 2013,
        "authors_short": "Royston & Parmar",
        "notes": "The clearest and most-cited articulation of RMST as a robust, interpretable summary that does not require proportional hazards; the design and analysis principles transfer directly to non-randomized RWE."
      },
      {
        "role": "explain",
        "doi": "10.1200/JCO.2014.55.2208",
        "url": "https://doi.org/10.1200/JCO.2014.55.2208",
        "citation_text": "Uno H, Claggett B, Tian L, et al. Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. J Clin Oncol. 2014;32(22):2380-2385.",
        "year": 2014,
        "authors_short": "Uno et al.",
        "notes": "Practical guidance on RMST estimation and inference, the closed-form variance for the RMST difference, choice of tau, and the pitfalls of the hazard ratio under non-proportional hazards; the methodological basis of the survRM2 package."
      },
      {
        "role": "explain",
        "doi": "10.1001/jamacardio.2017.2922",
        "url": "https://doi.org/10.1001/jamacardio.2017.2922",
        "citation_text": "Kim DH, Uno H, Wei L-J. Restricted Mean Survival Time as a Measure to Interpret Clinical Trial Results. JAMA Cardiol. 2017;2(11):1179-1180.",
        "year": 2017,
        "authors_short": "Kim et al.",
        "notes": "Concise, audience-facing explanation of why RMST is preferable to the hazard ratio for communicating absolute benefit on the time scale - directly applicable to HTA and payer narratives."
      },
      {
        "role": "demonstrate",
        "doi": "10.1001/jamaoncol.2017.2797",
        "url": "https://doi.org/10.1001/jamaoncol.2017.2797",
        "citation_text": "Pak K, Uno H, Kim DH, et al. Interpretability of cancer clinical trial results using restricted mean survival time as an alternative to the hazard ratio. JAMA Oncol. 2018;4(12):1692-1696.",
        "year": 2018,
        "authors_short": "Pak et al.",
        "notes": "Empirical re-analysis of many oncology trials showing where the hazard ratio is misleading and the RMST difference is the more interpretable summary - the canonical demonstration of RMST in practice."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/s12874-021-01213-0",
        "url": "https://doi.org/10.1186/s12874-021-01213-0",
        "citation_text": "Mozumder SI, Rutherford MJ, Lambert PC. Estimating restricted mean survival time and expected life-years lost in the presence of competing risks within flexible parametric survival models. BMC Med Res Methodol. 2021;21:52.",
        "year": 2021,
        "authors_short": "Mozumder et al.",
        "notes": "Shows how to compute RMST and expected life-years lost under competing risks within flexible parametric models - the rigorous handling required for elderly, multimorbid claims and registry populations where deaths compete."
      }
    ],
    "plain_language_summary": "Restricted Mean Survival Time (RMST) measures how many days, on average, patients in a study group stayed free of a bad health event — such as a heart attack or hospitalization — during a fixed window of time. Instead of asking 'are patients in one group dying faster?', it asks 'how many more event-free days did patients on Drug A accumulate compared to Drug B before the window closed?' You pick the window length (called tau, the Greek letter τ) before the study starts — for example, 36 months — and RMST is simply the average event-free time each group built up inside that window. The difference between the two groups' RMSTs tells you, in plain days or months, how much extra event-free time the better treatment delivered — a number that patients, doctors, and payers can all picture immediately.",
    "key_terms": [
      {
        "term": "time zero (index date)",
        "definition": "The starting clock tick for each patient — usually the day they first filled the study drug — from which all follow-up days are counted."
      },
      {
        "term": "event",
        "definition": "The health outcome being tracked, such as a first heart-failure hospitalization; the clock stops for that patient on the day the event occurs."
      },
      {
        "term": "censoring",
        "definition": "What happens when a patient's follow-up ends before the study window closes without the event occurring — for example, they left their insurance plan — so the analyst knows only that they were event-free up to that point."
      },
      {
        "term": "tau (τ)",
        "definition": "The fixed horizon, chosen before any analysis, that caps the window of interest — for example, 1095 days (36 months); RMST counts only event-free days accumulated before tau."
      },
      {
        "term": "hazard",
        "definition": "The instantaneous rate at which the event is occurring at any given moment in time; RMST deliberately avoids assuming this rate stays constant or proportional between groups."
      },
      {
        "term": "survival curve",
        "definition": "A step-shaped line on a graph that starts at 100% and drops each time a patient has the event, showing what fraction of the group remains event-free as time passes; RMST is the area under this line up to tau."
      }
    ],
    "worked_example": {
      "scenario": "A researcher uses Medicare insurance claims to compare two diabetes drugs — a DPP-4 inhibitor versus a sulfonylurea (SU) — on time to first heart-failure hospitalization. She fixes tau = 1095 days (36 months) before looking at any data, because 36 months is the payer's standard budget-planning horizon. She identifies four representative patients — two per drug arm — and tracks each one from their first fill (day 0) until either their hospitalization or the end of follow-up, whichever comes first. She wants to know: on average, how many event-free days did each arm accumulate before day 1095?",
      "dataset": {
        "caption": "One analytic row per patient as an analyst would see it after building the cohort. fu_days = days from first fill to event or censoring; event = 1 if hospitalized, 0 if censored (left insurance or reached day 1095 without the event).",
        "columns": [
          "person_id",
          "arm",
          "index_date",
          "fu_days",
          "event"
        ],
        "rows": [
          [
            3001,
            "DPP-4",
            "2020-01-15",
            1095,
            0
          ],
          [
            3002,
            "DPP-4",
            "2020-03-02",
            820,
            1
          ],
          [
            3003,
            "SU",
            "2020-01-20",
            1095,
            0
          ],
          [
            3004,
            "SU",
            "2020-02-10",
            480,
            1
          ]
        ]
      },
      "steps": [
        "Each patient's contribution to RMST is the number of event-free days they add to their arm's running total, capped at tau (1095 days).",
        "Patient 3001 (DPP-4): reached day 1095 without a hospitalization — contributes all 1095 event-free days.",
        "Patient 3002 (DPP-4): was hospitalized on day 820 — contributes 820 event-free days (the days before the event count; the event day itself signals the end).",
        "Patient 3003 (SU): also reached day 1095 without a hospitalization — contributes 1095 event-free days.",
        "Patient 3004 (SU): was hospitalized on day 480 — contributes 480 event-free days.",
        "DPP-4 arm average: (1095 + 820) / 2 = 957.5 event-free days per patient.",
        "SU arm average: (1095 + 480) / 2 = 787.5 event-free days per patient.",
        "RMST difference: 957.5 − 787.5 = 170 event-free days — the average extra time without hospitalization gained on the DPP-4 arm over this tiny sample.",
        "In the full study (source YAML), the RMST difference across thousands of patients narrows to about 52 days (33.4 months vs 31.7 months), which is why real RMST estimates need large samples — the two-patient average above just illustrates the arithmetic, not the real-world magnitude."
      ],
      "result": [
        {
          "label": "RMST — DPP-4 arm (full study, 36-month window)",
          "value": "1017 days (33.4 months)"
        },
        {
          "label": "RMST — Sulfonylurea arm (full study, 36-month window)",
          "value": "965 days (31.7 months)"
        },
        {
          "label": "RMST difference (DPP-4 minus SU)",
          "value": "52 days (~1.7 months) of additional event-free time on average"
        }
      ],
      "timeline_spec": {
        "title": "RMST over a 1095-day (36-month) window — two patients per arm, heart-failure hospitalization endpoint",
        "caption": "Each patient's bar runs from day 0 (first fill) until hospitalization (filled triangle) or the end of the window (open circle). RMST for each arm is the average length of the event-free bars. The DPP-4 arm averages 957.5 days of event-free time in this two-patient slice; the SU arm averages 787.5 days.",
        "alt_text": "Horizontal timeline from day 0 to day 1095 showing four patient bars. DPP-4 patients (top two): Patient 3001's bar runs the full 1095 days ending in an open circle; Patient 3002's bar ends at day 820 with a filled triangle (hospitalization). SU patients (bottom two): Patient 3003's bar runs the full 1095 days; Patient 3004's bar ends at day 480 with a filled triangle. The average bar length for the DPP-4 arm is visibly longer than for the SU arm.",
        "window": {
          "start_day": 0,
          "end_day": 1095,
          "label": "Tau = 1095 days (36-month policy horizon, fixed before analysis)"
        },
        "arms": [
          {
            "label": "DPP-4 inhibitor arm",
            "patients": [
              {
                "person_id": "3001",
                "start_day": 0,
                "end_day": 1095,
                "marker": "censor",
                "marker_label": "Still event-free at tau"
              },
              {
                "person_id": "3002",
                "start_day": 0,
                "end_day": 820,
                "marker": "event",
                "marker_label": "Hospitalized day 820"
              }
            ],
            "rmst_days": 957.5
          },
          {
            "label": "Sulfonylurea (SU) arm",
            "patients": [
              {
                "person_id": "3003",
                "start_day": 0,
                "end_day": 1095,
                "marker": "censor",
                "marker_label": "Still event-free at tau"
              },
              {
                "person_id": "3004",
                "start_day": 0,
                "end_day": 480,
                "marker": "event",
                "marker_label": "Hospitalized day 480"
              }
            ],
            "rmst_days": 787.5
          }
        ],
        "spans": [
          {
            "kind": "followup",
            "arm": "DPP-4 inhibitor arm",
            "person_id": "3001",
            "start_day": 0,
            "end_day": 1095,
            "label": "1095 event-free days"
          },
          {
            "kind": "followup",
            "arm": "DPP-4 inhibitor arm",
            "person_id": "3002",
            "start_day": 0,
            "end_day": 820,
            "label": "820 event-free days"
          },
          {
            "kind": "followup",
            "arm": "Sulfonylurea (SU) arm",
            "person_id": "3003",
            "start_day": 0,
            "end_day": 1095,
            "label": "1095 event-free days"
          },
          {
            "kind": "followup",
            "arm": "Sulfonylurea (SU) arm",
            "person_id": "3004",
            "start_day": 0,
            "end_day": 480,
            "label": "480 event-free days"
          }
        ],
        "result": [
          {
            "label": "DPP-4 arm RMST (2-patient slice)",
            "value": 957.5
          },
          {
            "label": "SU arm RMST (2-patient slice)",
            "value": 787.5
          },
          {
            "label": "RMST difference (DPP-4 minus SU, 2-patient slice)",
            "value": 170.0
          },
          {
            "label": "Full-study RMST difference (~1.7 months, consistent with source data)",
            "value": 52
          }
        ]
      }
    },
    "prerequisites": [
      "cumulative-incidence-risk-rwe",
      "cox-ph-regression",
      "competing-risks-cause-specific-fine-gray-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Non-parametric (KM-based) RMST",
        "description": "Integrate the Kaplan-Meier step function from 0 to tau by summing S(t_k)*(t_{k+1}-t_k). Inference via the closed-form Uno (2014) variance or a patient-level bootstrap. No modeling beyond independent censoring within tau.",
        "edge_cases": [
          "tau exceeding the largest observed event/censoring time in an arm forces tail extrapolation; bound tau by the minimum of the two arms' last observed times.",
          "Heavy censoring before tau widens the variance sharply; report the number at risk at tau and the proportion of follow-up time contributed beyond the last event."
        ],
        "data_source_notes": "claims: integrate on days (finest unit), not rounded months; censor at the earliest of disenrollment, MA switch, death, and study end; report number at risk at tau in both arms."
      },
      {
        "name": "Covariate-adjusted RMST regression (pseudo-observations / IPTW)",
        "description": "Compute jackknife pseudo-observations of RMST(tau) per patient and regress them on treatment and confounders via GEE with an identity link (difference scale) or log link (ratio scale); or estimate IPTW-weighted KM areas. Targets an adjusted ATE/ATT RMST contrast.",
        "edge_cases": [
          "Validity requires correct outcome or propensity-score model specification; combine PS weighting with pseudo-value regression for a doubly robust estimator.",
          "Pseudo-observations near 0 or tau can be unstable with very heavy censoring; check leverage and consider flexible parametric RMST instead."
        ],
        "data_source_notes": "claims: build the high-dimensional propensity score and all covariates only from the pre-index baseline window; weight or adjust identically in both arms."
      },
      {
        "name": "RMST under competing risks (restricted mean time-in-state / life-years lost)",
        "description": "Replace the censored survival curve with the Aalen-Johansen cumulative incidence function for the event of interest (and for the competing event), and integrate to tau to obtain restricted mean event-free time and expected time lost to each cause.",
        "edge_cases": [
          "Treating the competing event as independent censoring overstates event-free time; the bias is differential when the competing event rate differs by arm."
        ],
        "data_source_notes": "Oncology / elderly claims: define progression or next-line initiation as the event and death or disenrollment as the competing event; reconcile event vs death dates from the mortality source hierarchy before integrating."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "No proportional-hazards assumption; reported in interpretable time units (months gained); robust to crossing or delayed-separation curves, heavy administrative censoring, and sparse influential late events.",
        "cons_of_this": "Requires an a priori tau (substantive, not statistical); modestly less efficient than Cox when PH truly holds; the hazard ratio remains more familiar to many clinical audiences.",
        "when_to_prefer": "When PH is violated, doubtful, or untestable, when absolute time gained within a fixed horizon is the decision-relevant quantity, or when censoring is heavy before the median."
      },
      {
        "compared_to": "survival-extrapolation-hta-rwe",
        "pros_of_this": "Stays inside observed follow-up (t <= tau) and makes no parametric tail assumptions; the honest \"minimum months gained\" estimate.",
        "cons_of_this": "Deliberately does not answer the lifetime-horizon question; benefit accruing after tau is truncated and invisible.",
        "when_to_prefer": "Within-data effect summaries and HTA minimum-benefit claims where tail extrapolation would be speculative."
      },
      {
        "compared_to": "competing-risks-cause-specific-fine-gray-rwe",
        "pros_of_this": "Yields an interpretable absolute time quantity (restricted mean event-free time / life-years lost) rather than a subdistribution hazard ratio.",
        "cons_of_this": "Like all marginal summaries it does not separate cause-specific etiology from competing-event dynamics; the estimand (event-free time vs life-years lost) must be pre-specified.",
        "when_to_prefer": "When stakeholders need absolute time-in-state under competing risks rather than a hazard contrast."
      },
      {
        "compared_to": "clone-censor-weight-per-protocol",
        "pros_of_this": "Far simpler to specify and communicate for an initiation (ITT-like) contrast.",
        "cons_of_this": "Does not handle time-varying confounding, switching, or dynamic per-protocol strategies on its own.",
        "when_to_prefer": "When the estimand is an initiation contrast; otherwise use RMST as the summary measure inside a clone-censor-weight or MSM analysis."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Fix tau a priori on the reimbursement/value horizon (e.g. 24 or 36 months), never from the data. Integrate the KM on days. Censor at the earliest of disenrollment, MA switch (FFS claims stop), death, and study end; treat MA person-time as censored, not as event-free follow-up. Use the Aalen-Johansen CIF when non-event death competes (differential in elderly arms). For an adjusted contrast use pseudo-observation regression or IPTW on a baseline-only propensity score. Report number at risk at tau and sensitivity to tau +/- 6-12 months.",
      "ehr": "Event dates are sharper but capture is visit-driven; use last-contact or end of coverage as the censor and consider inverse-probability-of-observation weighting when visit intensity differs by arm, otherwise the more closely surveilled arm has artificially earlier events and downward-biased RMST.",
      "registry": "Usually the longest, cleanest, adjudicated follow-up and the best substrate for RMST and for validating claims-based RMST; still pre-specify tau before plotting and confirm complete vital-status ascertainment so censoring is non-informative.",
      "linked": "Linked claims-EHR-vital-records gives EHR severity + claims completeness + reliable mortality for competing-risk RMST, but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before fixing the time axis."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom lifelines import KaplanMeierFitter\nfrom lifelines.utils import restricted_mean_survival_time\n\nTAU = 1095  # days = 36 months, fixed a priori on the value horizon\n\ndef rmst_by_arm(df: pd.DataFrame, tau: int = TAU) -> dict:\n    \"\"\"KM-area RMST(tau) within each arm; tau must be <= each arm's last observed time.\"\"\"\n    out = {}\n    for a, g in df.groupby(\"arm\"):\n        kmf = KaplanMeierFitter().fit(g[\"fu_time\"], g[\"event\"])\n        out[a] = restricted_mean_survival_time(kmf, t=tau)\n    return out\n\ndef rmst_diff_bootstrap(df: pd.DataFrame, tau: int = TAU, n_boot: int = 2000, seed: int = 1) -> dict:\n    \"\"\"Unadjusted RMST difference (arm 1 - arm 0) with a patient-level bootstrap percentile CI.\"\"\"\n    rng = np.random.default_rng(seed)\n    point = rmst_by_arm(df, tau)\n    diff = point[1] - point[0]\n    ids = df[\"person_id\"].to_numpy()\n    boots = np.empty(n_boot)\n    for b in range(n_boot):\n        samp = df.iloc[rng.integers(0, len(df), len(df))]   # resample patients with replacement\n        r = rmst_by_arm(samp, tau)\n        boots[b] = r.get(1, np.nan) - r.get(0, np.nan)\n    lo, hi = np.nanpercentile(boots, [2.5, 97.5])\n    return {\"rmst_arm1\": point[1], \"rmst_arm0\": point[0], \"rmst_diff\": diff, \"ci95\": (lo, hi)}\n\ndef km_area_rmst(times: np.ndarray, events: np.ndarray, tau: int) -> float:\n    \"\"\"RMST(tau) = area under the Kaplan-Meier curve from 0 to tau for one sample.\"\"\"\n    kmf = KaplanMeierFitter().fit(times, events)\n    return restricted_mean_survival_time(kmf, t=tau)\n\ndef rmst_pseudo_values(df: pd.DataFrame, tau: int = TAU) -> np.ndarray:\n    \"\"\"Correct leave-one-out jackknife pseudo-observations of RMST(tau).\n\n    For each subject i: pseudo_i = N*RMST_all - (N-1)*RMST_without_i, where each RMST is the\n    KM-area to tau computed on the full sample and on the sample excluding subject i. This is the\n    Andersen-Klein definition; a censored subject does NOT simply contribute its censoring time.\n    \"\"\"\n    times  = df[\"fu_time\"].to_numpy()\n    events = df[\"event\"].to_numpy()\n    n = len(df)\n    rmst_all = km_area_rmst(times, events, tau)\n    keep = np.ones(n, dtype=bool)\n    pseudo = np.empty(n)\n    for i in range(n):\n        keep[i] = False\n        rmst_minus_i = km_area_rmst(times[keep], events[keep], tau)\n        pseudo[i] = n * rmst_all - (n - 1) * rmst_minus_i\n        keep[i] = True\n    return pseudo\n\ndef adjusted_rmst_regression(df: pd.DataFrame, covars: list[str], tau: int = TAU):\n    \"\"\"Covariate-adjusted RMST difference via pseudo-observations + GEE (identity link).\n\n    Pseudo-observations theta_i are regressed on arm + covariates; the 'arm' coefficient is the\n    adjusted RMST difference. Use independence working correlation since rows are one-per-patient.\n    \"\"\"\n    df = df.copy()\n    df[\"pseudo\"] = rmst_pseudo_values(df, tau)\n    formula = \"pseudo ~ arm + \" + \" + \".join(covars)\n    gee = smf.gee(formula, groups=df[\"person_id\"], data=df,\n                  family=sm.families.Gaussian(), cov_struct=sm.cov_struct.Independence()).fit()\n    return gee  # gee.params['arm'] is the adjusted RMST difference (days); gee.conf_int() for the CI",
        "description": "Non-parametric RMST(tau), bootstrap CI for the RMST difference, and a covariate-adjusted RMST regression via\njackknife pseudo-observations. Required input: one analytic row per patient (already constructed upstream) with\n  person_id  : patient id\n  arm        : 1 = study drug, 0 = comparator\n  fu_time    : follow-up time in DAYS from index to event or censoring\n  event      : 1 = event of interest observed, 0 = censored\n  age, sex, comorb_score, ... : baseline covariates measured only in the pre-index window\nTAU is fixed a priori (here 1095 days = 36 months) and must not exceed the smaller of the two arms' last observed times.",
        "dependencies": [
          "pandas",
          "numpy",
          "lifelines",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survRM2)\nlibrary(pseudo)\nlibrary(geepack)\n\ntau <- 1095  # days = 36 months, fixed a priori\n\n## 1) Unadjusted RMST in each arm + difference and ratio with closed-form (Uno) CIs.\nfit <- rmst2(time = df$fu_time, status = df$event, arm = df$arm, tau = tau)\nprint(fit)                       # RMST per arm, RMST difference, RMST ratio, 95% CIs, p-values\nrmst_diff <- fit$unadjusted.result[\"RMST (arm=1)-(arm=0)\", \"Est.\"]\n\n## 2) Covariate-adjusted RMST difference via jackknife pseudo-observations + GEE (identity link).\n##    pseudomean() returns the leave-one-out RMST pseudo-value for each subject at tmax = tau.\ndf$pseudo <- pseudomean(time = df$fu_time, event = df$event, tmax = tau)\ndf <- df[order(df$person_id), ]\ngee_fit <- geeglm(pseudo ~ arm + age + sex + comorb_score,\n                  id = person_id, data = df,\n                  family = gaussian(\"identity\"), corstr = \"independence\")\nsummary(gee_fit)                 # coefficient on 'arm' = adjusted RMST difference (days); use a log link for the ratio",
        "description": "Unadjusted RMST difference with the survRM2 package and covariate-adjusted RMST regression via pseudo-observations\nand GEE. Required input: one analytic row per patient with\n  fu_time : follow-up time in DAYS from index to event/censoring\n  event   : 1 = event observed, 0 = censored\n  arm     : 1 = study drug, 0 = comparator\n  age, sex, comorb_score, ... : baseline covariates from the pre-index window only\ntau is fixed a priori and must not exceed the smaller of the two arms' last observed event/censoring times.",
        "dependencies": [
          "survRM2",
          "pseudo",
          "geepack"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let tau = 1095;  /* days = 36 months, fixed a priori on the value horizon */\n\n/* 1) Non-parametric RMST per arm + RMST difference (SAS/STAT 15.1+). */\nods output RMST = rmst_est RMSTDiff = rmst_diff;\nproc lifetest data=work.analytic rmst(tau=&tau);\n  time fu_time*event(0);\n  strata arm;\nrun;\n/* rmst_est: RMST and SE per arm; rmst_diff: between-arm RMST difference, SE, and 95% CI. */\n\n/* 2) Covariate-adjusted RMST difference via pseudo-observations.\n      Compute jackknife RMST pseudo-values with the published %PSEUDOSURV / %RMSTREG macro\n      (Andersen & Pohar Perme), which adds variable PSEUDO = individual RMST(tau) pseudo-observation. */\n%pseudosurv(indata=work.analytic, time=fu_time, dead=event, howmany=&sysnobs,\n            datatau=&tau, outdata=work.pseudo);\n\n/* Regress the pseudo-observations on arm + covariates (identity link = difference scale). */\nproc genmod data=work.pseudo;\n  class person_id arm(ref='0') sex;\n  model pseudo = arm age sex comorb_score / dist=normal link=identity;\n  repeated subject=person_id / type=ind;   /* GEE sandwich SEs for the pseudo-value regression */\n  estimate 'Adjusted RMST diff (study - comparator), days' arm 1 -1;\nrun;",
        "description": "Unadjusted RMST and RMST difference with PROC LIFETEST (RMST option, SAS/STAT 15.1+), and a covariate-adjusted RMST\ndifference via pseudo-observations regressed in PROC GENMOD. Required input: WORK.ANALYTIC with one row per patient -\n  person_id, arm (1/0), fu_time (DAYS to event/censoring), event (1=event, 0=censored),\n  and baseline covariates (age sex comorb_score ...) measured in the pre-index window only.\nConfirm RMST(tau) does not exceed either arm's last observed time before interpreting.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "restricted-mean-survival-time-rmst-timeline.svg",
        "mermaid": null,
        "caption": "Each patient's bar runs from day 0 (first fill) until hospitalization (filled triangle) or the end of the window (open circle). RMST for each arm is the average length of the event-free bars. The DPP-4 arm averages 957.5 days of event-free time in this two-patient slice; the SU arm averages 787.5 days.",
        "alt_text": "Horizontal timeline from day 0 to day 1095 showing four patient bars. DPP-4 patients (top two): Patient 3001's bar runs the full 1095 days ending in an open circle; Patient 3002's bar ends at day 820 with a filled triangle (hospitalization). SU patients (bottom two): Patient 3003's bar runs the full 1095 days; Patient 3004's bar ends at day 480 with a filled triangle. The average bar length for the DPP-4 arm is visibly longer than for the SU arm.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  KMA[\"KM curve, Arm A (study)\"] --> AreaA[\"Area under S_A(t) from 0 to tau = RMST_A\"]\n  KMB[\"KM curve, Arm B (comparator)\"] --> AreaB[\"Area under S_B(t) from 0 to tau = RMST_B\"]\n  AreaA --> Diff[\"RMST difference = RMST_A - RMST_B\"]\n  AreaB --> Diff\n  Diff --> Clin[\"Average event-free MONTHS gained over the first tau\"]",
        "caption": "RMST(tau) is the area under each group's survival curve up to the horizon tau; the RMST difference is the gap between the two areas, read directly as average event-free time gained.",
        "alt_text": "Two Kaplan-Meier curves each integrated to tau into RMST_A and RMST_B, combined into an RMST difference interpreted as months of event-free time gained.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q1{\"Proportional hazards credible?\"} -->|Yes| Q2{\"Is a relative instantaneous-risk contrast the target?\"}\n  Q1 -->|\"No / untestable / crossing curves\"| RMST\n  Q2 -->|Yes| HR[\"Report hazard ratio (Cox)\"]\n  Q2 -->|\"No - absolute time wanted\"| RMST\n  RMST[\"Use RMST at a pre-specified tau\"] --> Q3{\"Competing events present?\"}\n  Q3 -->|Yes| CIF[\"RMST from Aalen-Johansen CIF<br/>(restricted mean time-in-state / life-years lost)\"]\n  Q3 -->|No| KM[\"RMST from Kaplan-Meier area\"]\n  CIF --> Adj{\"Confounded observational data?\"}\n  KM --> Adj\n  Adj -->|Yes| PV[\"Pseudo-observation regression or IPTW-weighted RMST\"]\n  Adj -->|No| Marg[\"Unadjusted RMST difference + number at risk at tau\"]",
        "caption": "Decision logic for choosing RMST over the hazard ratio in RWE, and for selecting the competing-risk and confounding-adjustment variant once RMST is chosen.",
        "alt_text": "Decision tree starting from whether proportional hazards holds, branching to hazard ratio versus RMST, then to competing-risk handling (CIF versus KM) and confounding adjustment (pseudo-observation/IPTW versus unadjusted).",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "cox-ph-regression",
        "notes": "RMST is the primary alternative to the Cox hazard ratio when proportional hazards is violated or doubtful; it gives an absolute time-scale contrast that needs no PH assumption."
      },
      {
        "relation_type": "part_of",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The RMST difference (or ratio) is one valid summary measure (attribute 5) within a fully specified estimand, pairing naturally with treatment-policy or hypothetical intercurrent-event strategies."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Under competing risks, RMST is computed from the Aalen-Johansen cumulative incidence function as restricted mean time-in-state / expected life-years lost, rather than from censored Kaplan-Meier."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "RMST stays inside observed follow-up (no tail assumptions) while extrapolation targets lifetime mean survival for cost-utility models; choose by whether the decision needs a within-data or a lifetime horizon."
      },
      {
        "relation_type": "used_with",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "For sustained or per-protocol strategies under time-varying confounding, RMST can be the summary measure inside a clone-censor-weight or marginal structural model analysis."
      },
      {
        "relation_type": "used_with",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Flexible survival and causal ML models can estimate covariate-adjusted or heterogeneous RMST differences (CATE on the time scale)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology is the highest-volume RMST use case in RWE because of delayed separation, crossing curves, switching at progression, and the policy relevance of months gained within 2-5 year horizons."
      }
    ],
    "aliases": [
      "RMST",
      "restricted mean survival",
      "RMST difference",
      "RMST ratio",
      "area under the survival curve",
      "mean survival time up to tau",
      "restricted mean time lost"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "revenue-center-codes",
    "name": "Revenue (Center) Codes",
    "short_definition": "NUBC-maintained 4-digit codes assigned to each line of an institutional (UB-04) claim that identify the hospital department or cost center where charges were incurred — the primary field that tells a researcher whether a given claim line represents emergency care, intensive care, pharmacy, surgery, physical therapy, or any other hospital service category.",
    "long_description": "**Revenue (center) codes** are the line-item classifier on every institutional claim\nsubmitted on the UB-04 form (CMS-1450). The National Uniform Billing Committee (NUBC),\nhoused within the American Hospital Association, maintains the complete code set, which is\npublished in the copyrighted UB-04 Data Specifications Manual and updated annually. Each\nclaim line carries exactly one 4-digit revenue code in Form Locator 42 (FL 42); the code\nnames the hospital department or cost center responsible for the charge. Think of revenue\ncodes as the \"where and what type\" field of an institutional claim: they locate the\nservice within the hospital's internal accounting structure. The HCPCS code on the same\nline (FL 44), when present, then says \"what specific procedure or item.\"\n\n**Core conceptual distinction: revenue code versus HCPCS versus diagnosis code.**\nRevenue codes operate at the claim-line level and classify setting and service type;\nthey are not procedure codes and are not diagnosis codes. The clinical meaning of a\nrevenue-code-only line (without an accompanying HCPCS) is intentionally vague — for\nexample, revenue code 0250 on a pharmacy line says \"a pharmacy charge appeared on this\nclaim\" but does not identify the drug. Procedure-level specificity on outpatient\ninstitutional claims requires the paired HCPCS; on inpatient claims, HCPCS is typically\nabsent, so revenue codes mark departments but do not identify procedures. This asymmetry\nis one of the most consequential structural differences between inpatient and outpatient\ninstitutional data in a research database.\n\n**THE FORMAT TRAP: 3-digit versus 4-digit storage.** Officially, revenue codes are\n4 digits with a leading zero: 0450 for emergency room, not 450. However, many research\ndatabases and data warehouses strip or ignore the leading zero and store the code as a\n3-digit integer or character field (e.g., 450, 636, 250). Code written to filter on\n`rev_code = '0450'` will return zero rows when the field contains `'450'` — and vice\nversa. Any analysis pipeline must normalize representations before applying filters.\n\n**Illustrative code families (public CMS/ResDAC-documented examples only).** The NUBC\ncode set has several hundred entries; the following families are documented in CMS\nregulations, ResDAC variable documentation, and publicly available CMS transmittals and\nare representative of how revenue codes are used in research. The full code set is in the\ncopyrighted UB-04 manual and is not reproduced here.\n\n- **045x — Emergency Room.** Revenue codes 0450–0459 identify emergency department\n  charges on outpatient institutional claims. Code 0450 is the general ED code; 0451\n  (EMTALA-related), 0452 (urgent care), 0456, and 0459 appear in various payer contexts.\n  Identifying ED visits on outpatient facility claims requires filtering for the 045x\n  family; combining this with ED evaluation-and-management CPT codes (99281–99285 on the\n  professional claim) forms the standard two-pronged algorithm used in the Medicare\n  pharmacoepidemiology literature.\n- **020x — Intensive Care.** Revenue codes 0200–0209 classify intensive care unit\n  charges. Code 0200 is general ICU; 0201 is surgical ICU; 0202 is medical ICU; 0206 is\n  cardiac ICU; 0209 is other ICU. Because each inpatient revenue code line typically\n  reflects one or more days of service, counting 020x lines is the claims-based approach\n  to estimating ICU-day exposure.\n- **036x — Operating Room.** Revenue codes 0360–0369 classify operating room and\n  recovery room charges. Their presence on an inpatient claim indicates a surgical\n  admission, even when no ICD-10-PCS procedure code is recorded in the MedPAR fields\n  (which can occur for outpatient surgeries billed to outpatient institutional claims).\n- **025x — Pharmacy.** Revenue codes 0250–0259 flag general pharmacy charges. On\n  outpatient institutional claims they signal that the hospital dispensed or administered\n  a drug, but without a paired HCPCS code the specific drug is unknown. Most research\n  applications require moving to the 063x family for drug identification.\n- **063x / 0636 — Drugs Requiring Detailed Coding.** Revenue code 0636 is the key\n  revenue code for identifying provider-administered drugs on outpatient institutional\n  claims. When a hospital administers a drug that requires HCPCS-level coding (infused\n  biologics, chemotherapy agents, anticoagulants — the vast majority of Part B–covered\n  drugs), the claim line carries revenue code 0636 paired with the HCPCS J-code (e.g.,\n  J0171 for adrenalin, J0285 for amphotericin B, J9271 for pembrolizumab). The NDC may\n  also appear on a separate line or as a qualifier. This 0636 + J-code pairing is\n  foundational to oncology real-world evidence, site-of-care research, and any study\n  that must identify specific infused drugs on facility claims.\n- **042x — Physical Therapy.** Revenue codes 0420–0429 classify physical therapy\n  charges. Their presence on an outpatient institutional claim provides a revenue-code-\n  based confirmation of physical therapy utilization without requiring a CPT match.\n- **076x — Treatment/Observation Room.** Revenue codes 0760–0769 cover treatment room\n  and observation room services. Revenue code 0762 is the specific code for\n  observation-room services. Its presence on an outpatient institutional claim is one\n  of the two key signals (alongside the presence of a HCPCS G-code for observation)\n  for identifying observation status stays — the ambiguous inpatient/outpatient gray\n  zone that has significant Medicare beneficiary cost-sharing implications and is a\n  frequent research classification challenge.\n\n**RWE uses of revenue codes.** Revenue codes are used for at least five distinct\nanalytic tasks in real-world evidence research:\n\n1. **ED visit identification.** The 045x revenue-code filter on outpatient institutional\n   claims, combined with ED E/M CPT codes (99281–99285) on professional claims from the\n   same service date, defines the standard two-pronged algorithm for identifying\n   emergency department encounters in Medicare and commercial claims.\n2. **Observation stay classification.** Revenue code 0762 combined with presence on an\n   outpatient institutional claim (not an inpatient MedPAR record) is the standard\n   approach to distinguishing observation stays from both true inpatient admissions and\n   standard outpatient visits — a classification with real implications for beneficiary\n   cost-sharing under Medicare.\n3. **Provider-administered drug identification on outpatient institutional claims.**\n   Revenue code 0636 + HCPCS J-code is the standard line-pairing for identifying\n   specific infused or injected drugs billed to the Part B facility benefit. This is\n   the claims-based method for oncology drug attribution, biologic infusion tracking,\n   and site-of-care analyses comparing hospital outpatient versus physician office\n   administration.\n4. **Cost decomposition by department.** Summing allowed amounts by revenue code family\n   produces a department-level cost breakdown: pharmacy vs ICU vs operating room vs\n   ED vs physical therapy. This is the standard approach to decomposing facility costs\n   in burden-of-disease and comparative cost studies.\n5. **ICU exposure ascertainment.** Counting the number of claim lines with 020x revenue\n   codes provides an approximation of ICU days on inpatient stays, which is relevant\n   to severity adjustment and for defining ICU exposure in critical-care research.\n\n**Pros, cons, and trade-offs.**\n\n- **vs Place-of-Service (POS) codes:** Both classify the setting of care, but they do\n  so on different claim types. Revenue codes appear on institutional (facility) claims\n  — inpatient and outpatient UB-04 billing. POS codes appear on professional (physician)\n  claims — the CMS-1500 form submitted by physicians, NPPs, and other non-facility\n  providers. For a given patient encounter, the facility bills on a UB-04 with revenue\n  codes; the attending physician bills separately on a CMS-1500 with a POS code. Neither\n  field appears on the other form type. When doing a setting identification for an ED\n  visit, an analyst needs BOTH: 045x revenue codes to find the facility claim and POS 23\n  (emergency room) to find the professional claim from the same encounter. **Prefer\n  revenue codes** for facility-side cost decomposition; **prefer POS codes** for\n  professional-claim setting classification; combine both for encounter-level setting\n  determination.\n- **vs CPT/HCPCS on professional claims:** Revenue codes classify service type at the\n  department level; CPT/HCPCS on professional claims identify the specific procedure\n  performed. On the outpatient institutional side, the revenue code + HCPCS pairing\n  provides both department context and procedure specificity. On the inpatient\n  institutional side, revenue codes provide department context but HCPCS is almost\n  always absent, so procedure identification must rely on ICD-10-PCS codes (which are\n  recorded in the claim header, not on line items). This means a researcher cannot\n  identify specific procedures on inpatient facility claims using revenue codes.\n- **vs MS-DRG classification:** MS-DRGs summarize the entire inpatient hospitalization\n  into a single payment group based on principal diagnosis and procedure. Revenue codes\n  are line-item department codes. They capture different dimensions of the same stay:\n  DRG tells you the type of case; revenue codes tell you which departments were involved.\n  Both are present on inpatient institutional claims and are complementary, not\n  substitutes. For cost decomposition, revenue codes add granularity that DRGs aggregate\n  away.\n\n**When to use revenue codes in research.**\n- To identify the setting of service on institutional claims (ED, ICU, OR, PT) when no\n  validated ICD-based or CPT-based algorithm is available for that setting.\n- To identify provider-administered drugs on outpatient institutional claims via the\n  0636 + J-code pairing.\n- To classify hospital observation stays using revenue code 0762.\n- To decompose facility claim costs by department for cost-of-illness or burden studies.\n- To estimate ICU exposure days using 020x line counts on inpatient claims.\n- Whenever institutional claims are the primary or supplementary data source and\n  service-type or department-level granularity is required.\n\n**When NOT to use revenue codes — and when they are actively misleading or dangerous.**\n- **Do not use revenue codes as a procedure identifier on inpatient claims.** Inpatient\n  lines rarely carry HCPCS codes; a revenue code of 036x (operating room) confirms a\n  surgical admission but does not identify the specific procedure. Using revenue code\n  presence alone as a proxy for procedure type introduces unacceptably broad\n  misclassification on inpatient data. Use ICD-10-PCS codes (from the claim header) for\n  inpatient procedure identification.\n- **Do not treat revenue code charges as payments.** The dollar amount on a revenue code\n  line is the billed charge — the chargemaster amount before contract discounts and\n  payer adjustments. Charges overstate true costs severalfold and vary by institution.\n  For cost analyses, use allowed amounts or paid amounts, not charges. Revenue codes\n  are still the correct unit for decomposing those allowed amounts by department.\n- **Do not apply 4-digit filters to 3-digit fields without normalization.** Failing to\n  handle the leading-zero representation difference between data sources is one of the\n  most common and silent errors in institutional claims analysis. The filter\n  `rev_code = '0450'` returns zero rows in a database that stores `'450'` — and a\n  researcher who does not check row counts will not notice.\n- **Do not rely on revenue codes alone to identify drugs on inpatient claims.** Inpatient\n  drug charges appear on 025x lines without HCPCS codes in most datasets. Drug\n  identification on inpatient claims requires supplementary data (e.g., a hospital\n  pharmacy or 340B data linkage) or an NDC-based match, not a revenue code filter.\n- **Do not assume revenue code usage is uniform across payers or facilities.** While the\n  NUBC defines the standard, local and payer-specific coding practices mean that a\n  revenue code family may be used differently at different institutions or for different\n  payer contracts. A code that reliably identifies observation stays in Medicare data\n  may be coded differently in commercial claims from the same hospital. Sensitivity\n  analyses using multiple identification approaches are recommended when observation\n  classification is central to the research question.\n\n**Data-source operational depth.**\n- **Medicare FFS (MedPAR, OPPS outpatient claims, carrier):** Revenue codes appear on\n  the MedPAR inpatient file (the revenue center section, covering departments) and on\n  the outpatient institutional claims file. The outpatient file contains the 0636\n  + J-code drug lines critical to Part B drug identification, the 045x ED lines for\n  ED visit algorithms, and the 0762 observation lines. Revenue codes are absent from\n  the carrier (professional) file — that file uses POS codes instead. MedPAR inpatient\n  lines often lack HCPCS; outpatient institutional lines are more likely to have HCPCS\n  when billable services were performed. The MedPAR revenue center file is a separate\n  extract in some ResDAC releases; verify the join key (beneficiary ID + admission date\n  + provider number) before merging.\n- **Medicare Advantage (MA):** Encounter data submitted by MA plans vary in completeness\n  for revenue codes. The revenue center section may be present but less reliably\n  populated than in FFS claims, particularly for non-risk-adjustment-relevant services.\n  Revenue-code-based algorithms validated on FFS data may have lower sensitivity in MA\n  encounter data; sensitivity analyses restricting to FFS person-time are strongly\n  recommended.\n- **Commercial claims (MarketScan, Optum, IQVIA):** UB-04-based institutional claims\n  include revenue codes with the same general structure as Medicare. However, local\n  payer contractual coding practices mean some revenue code families (particularly 076x\n  observation) may be coded less consistently than in Medicare. The leading-zero\n  representation issue must be verified in each data source independently.",
    "primary_category": "Unknown",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "claims",
      "institutional",
      "revenue-code",
      "UB-04",
      "NUBC",
      "emergency-department",
      "hospital-billing",
      "cost-decomposition",
      "provider-administered-drugs"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "case_control",
      "budget_impact",
      "cost_of_illness",
      "healthcare_resource_utilization"
    ],
    "data_sources": [
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/acem.13140",
        "url": "https://doi.org/10.1111/acem.13140",
        "citation_text": "Venkatesh AK, Mei H, Kocher KE, Granovsky M. Identification of Emergency Department Visits in Medicare Administrative Claims: Approaches and Implications. Academic Emergency Medicine. 2017;24(4):422-431.",
        "year": 2017,
        "authors_short": "Venkatesh et al.",
        "notes": "Seminal methodological paper demonstrating that revenue code 045x on outpatient institutional claims, combined with ED evaluation-and-management CPT codes on professional claims, defines the two-pronged algorithm for identifying emergency department visits in Medicare administrative data. The paper evaluates multiple revenue-code-based and CPT-based approaches and their sensitivity and positive predictive value implications — the primary reference for ED visit identification in claims research."
      },
      {
        "role": "use",
        "doi": "10.1111/jgs.16441",
        "url": "https://doi.org/10.1111/jgs.16441",
        "citation_text": "Powell WR, Kaiksow FA, Kind AJH. What Is an Observation Stay? Evaluating the Use of Hospital Observation Stays in Medicare. Journal of the American Geriatrics Society. 2020;68(7):1568-1572.",
        "year": 2020,
        "authors_short": "Powell et al.",
        "notes": "Uses Medicare claims data — including revenue code 0762 (observation room) on outpatient institutional claims — to characterize observation stay prevalence and trends in a geriatric Medicare population. Illustrates the standard claims-based approach to distinguishing observation stays from inpatient admissions using the 076x revenue code family."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.ajem.2022.02.061",
        "url": "https://doi.org/10.1016/j.ajem.2022.02.061",
        "citation_text": "Peleggi A, Strub B, Kim SJ. Identifying pediatric emergency department visits for aggression using administrative claims data. American Journal of Emergency Medicine. 2022;55:144-149.",
        "year": 2022,
        "authors_short": "Peleggi et al.",
        "notes": "Demonstrates revenue-code-based ED visit identification in a pediatric commercial claims population, using the 045x family as the primary filter for ED encounter identification and illustrating how the approach extends beyond Medicare to commercial and Medicaid institutional claims."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.nubc.org/",
        "citation_text": "National Uniform Billing Committee (NUBC). UB-04 Data Specifications Manual. American Hospital Association; updated annually.",
        "year": null,
        "authors_short": "NUBC/AHA",
        "notes": "The NUBC, housed within the American Hospital Association, maintains the authoritative and copyrighted revenue code set published in the UB-04 Data Specifications Manual. The complete code list is proprietary; the NUBC website (nubc.org) is the official source for membership, manual access, and code update announcements. Researchers requiring the full code set must obtain the manual through AHA or a licensed distributor."
      }
    ],
    "plain_language_summary": "Revenue codes are 4-digit labels attached to every line of a hospital bill that identify which department inside the hospital provided the service — for example, the emergency room, the ICU, the pharmacy, or the operating room. Analysts use them to figure out what type of care a patient received on an institutional (facility) claim, since a revenue code is often the only way to tell that a particular charge line came from the ER rather than a routine ward. One important watch-out is that the code is officially stored as a 4-digit number with a leading zero (like 0450 for the ER), but many research databases drop that leading zero and store just 450 — so code that searches for 0450 will find nothing if the database uses the shorter version.",
    "key_terms": [
      {
        "term": "revenue center",
        "definition": "A hospital department or cost center identified on each line of a UB-04 institutional claim, telling the analyst which part of the hospital generated the charge on that line."
      },
      {
        "term": "claim line",
        "definition": "A single row within a multi-line institutional or professional claim, each carrying its own service date, charge amount, revenue code (on institutional claims), and optionally a HCPCS/CPT procedure code."
      },
      {
        "term": "leading zero",
        "definition": "The zero that prefixes a revenue code to make it 4 digits (e.g., 0450 rather than 450); many databases strip this zero, so analysts must normalize the field before filtering."
      },
      {
        "term": "cost center",
        "definition": "In hospital accounting, a department or unit whose costs are tracked separately (e.g., pharmacy, ICU, operating room); revenue codes map directly to these accounting units on the claim."
      },
      {
        "term": "UB-04",
        "definition": "The uniform billing form (CMS-1450) used by hospitals and other institutional providers to submit claims to Medicare and most other payers; revenue codes appear in Form Locator 42 on each service line of this form."
      },
      {
        "term": "HCPCS J-code",
        "definition": "A Healthcare Common Procedure Coding System Level II code beginning with the letter J that identifies a specific injectable or infusible drug administered by a provider; on institutional claims, J-codes appear alongside revenue code 0636 to identify the exact drug being billed."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes researcher is building a study of patients who visited the emergency department (ED) at least once during a 12-month observation window. She pulls the outpatient institutional claims for a synthetic patient, Pat (person_id 2001), and needs to: (1) identify which claim lines represent the ED visit, (2) spot the pharmacy-administered drug lines that should be attributed to the visit, and (3) confirm the arithmetic for the total ED-day claim charge across the relevant lines. The analyst has already confirmed the database stores revenue codes without the leading zero (3-digit form).\n",
      "dataset": {
        "caption": "Synthetic outpatient institutional claim for person_id 2001, service date 2023-09-14. Six revenue code lines from a single UB-04 claim; the database stores revenue codes as 3-digit strings (leading zero stripped).\n",
        "columns": [
          "person_id",
          "service_date",
          "rev_code_raw",
          "rev_code_normalized",
          "hcpcs",
          "charge_amount",
          "line_description"
        ],
        "rows": [
          [
            2001,
            "2023-09-14",
            "450",
            "0450",
            "99284",
            850.0,
            "Emergency room — level 4 E/M"
          ],
          [
            2001,
            "2023-09-14",
            "250",
            "0250",
            "",
            42.0,
            "Pharmacy — general (aspirin, NS flush)"
          ],
          [
            2001,
            "2023-09-14",
            "636",
            "0636",
            "J1885",
            1200.0,
            "Drugs requiring detailed coding — ketorolac injection (J1885)"
          ],
          [
            2001,
            "2023-09-14",
            "301",
            "0301",
            "",
            215.0,
            "Laboratory — chemistry"
          ],
          [
            2001,
            "2023-09-14",
            "324",
            "0324",
            "",
            480.0,
            "Radiology — chest X-ray"
          ],
          [
            2001,
            "2023-09-14",
            "361",
            "0361",
            "",
            310.0,
            "OR services — minor procedure suite"
          ]
        ]
      },
      "steps": [
        "Normalize the revenue code field: add a leading zero to each 3-digit code so all values are 4 digits. rev_code_raw '450' becomes rev_code_normalized '0450'; '636' becomes '0636'; '250' becomes '0250', and so on.",
        "Identify the ED lines: filter for rev_code_normalized starting with '045'. Only line 1 (0450) matches — this is the ED visit line. It carries HCPCS 99284, confirming a level-4 ED evaluation-and-management service.",
        "Identify the provider-administered drug lines: filter for rev_code_normalized = '0636'. Line 3 matches; it carries HCPCS J1885 (ketorolac), which identifies the specific injectable drug. This is the 0636 + J-code pairing pattern.",
        "Identify the general pharmacy line: rev_code_normalized = '0250' matches line 2. There is no HCPCS on this line, so the specific drug(s) are unknown from the claim alone — only that a pharmacy charge was incurred.",
        "Count the ED-related claim lines: lines 1, 2, and 3 are plausibly attributable to the ED visit (EM service + pharmacy + drug injection). Lines 4, 5, and 6 (lab, radiology, OR suite) may or may not be part of the same ED encounter depending on the analytic attribution approach chosen.",
        "Compute the charge total for the two unambiguous ED-setting lines (0450 + 0636): $850.00 + $1200.00 = $2050.00. Remember: these are charges (billed amounts), not allowed or paid amounts."
      ],
      "result": "2 lines match the ED revenue code family (045x) or the 0636 drug line: the 0450 ED line with charge $850.00 and the 0636 drug line with charge $1,200.00. Combined charge for those two lines = $850.00 + $1200.00 = $2050.00. The analyst flags these two lines as the ED visit and drug exposure lines; the 0250 pharmacy line ($42.00) is noted but cannot be attributed to a specific drug without additional data. Total claim charge across all 6 lines = $850.00 + $42.00 + $1200.00 + $215.00 + $480.00 + $310.00 = $3097.00."
    },
    "prerequisites": [
      "claims-analysis"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "ED visit identification (045x two-pronged algorithm)",
        "description": "The standard research algorithm for identifying emergency department visits in Medicare and commercial claims uses two complementary signals. The facility-side signal is revenue code 045x on an outpatient institutional claim. The professional- side signal is an ED evaluation-and-management CPT code (99281–99285) on a carrier claim from the same service date. Using only the facility signal misses ED encounters for beneficiaries whose facility claim lacks a revenue code 045x line (e.g., when the facility bills under a different revenue code). Using only the professional signal misses encounters where a physician does not bill separately. The two-pronged union algorithm maximizes sensitivity; the intersection algorithm maximizes specificity.",
        "edge_cases": [
          "Revenue code 0452 (urgent care) versus 0450 (ER) may be coded differently across facilities; including both in the ED filter increases sensitivity but may introduce urgent care visits into an ED cohort.",
          "In commercial claims, some facilities use different revenue code conventions for freestanding ER versus hospital-based ED; validate the approach in each data source."
        ],
        "data_source_notes": "Medicare FFS: outpatient institutional claims file (revenue code) + carrier file (CPT). Commercial claims: UB-04 institutional (revenue code) + CMS-1500 professional (POS 23 or CPT 99281-99285). Medicare Advantage: revenue code completeness in MA encounter data is lower than FFS; run sensitivity analyses restricted to FFS person-time.\n"
      },
      {
        "name": "Observation stay identification (0762)",
        "description": "Revenue code 0762 on an outpatient institutional claim (i.e., NOT a MedPAR inpatient record) is the standard approach for identifying hospital observation stays. An observation stay is classified as outpatient for Medicare billing purposes, so it appears on the outpatient institutional file rather than the MedPAR inpatient file. A patient admitted to a hospital for two nights as \"observation\" generates outpatient institutional claims with 0762 lines; a patient admitted as an inpatient generates a MedPAR record. This distinction has large cost-sharing implications for Medicare beneficiaries and is a frequent source of misclassification in hospitalization studies that rely only on MedPAR.",
        "edge_cases": [
          "Some patients begin as observation and are converted to inpatient admission mid-stay (two-midnight rule); the claim file used depends on the final billing determination.",
          "Commercial claims may code observation differently from Medicare; validate the 0762 approach against hospital records when possible."
        ],
        "data_source_notes": "Medicare FFS: look for 0762 on the outpatient institutional file to distinguish observation from inpatient. MedPAR records = inpatient. Presence of 0762 on an outpatient claim that spans multiple days with room-and-board charges is a strong signal for an observation stay.\n"
      },
      {
        "name": "Provider-administered drug identification (0636 + J-code)",
        "description": "On outpatient institutional claims, the pairing of revenue code 0636 (drugs requiring detailed coding) with a HCPCS J-code identifies a specific provider-administered injectable or infusible drug. This is the standard approach for oncology drug attribution, biologic tracking, and site-of-care analyses on Part B–billed drugs. Revenue code 0636 without a J-code is uninformative for drug identification; a J-code without a 0636 pairing may also appear (some facilities place J-codes on 0250 lines), so it is best practice to scan all revenue code families for J-codes rather than relying solely on 0636.",
        "edge_cases": [
          "Compounded drugs and multi-drug infusion bags may be coded under a single J-code that does not fully represent the contents.",
          "Some facilities bill drugs under revenue code 0250 (pharmacy) with a J-code rather than 0636; a union approach across both revenue code families improves capture.",
          "NDC-level detail, when present as a line-item qualifier, provides greater specificity than the J-code alone."
        ],
        "data_source_notes": "Medicare FFS: outpatient institutional claims; J-codes pair with 0636 on most lines, but also check 0250 lines for J-codes. Inpatient: drugs on 025x lines typically lack HCPCS in Medicare MedPAR revenue center data; use NDC linkage or supplementary hospital pharmacy data for inpatient drug identification.\n"
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "place-of-service-codes",
        "pros_of_this": "Revenue codes appear on institutional claims and provide line-item department-level granularity (ED, ICU, OR, pharmacy separately); they also carry the J-code pairing needed for drug identification on facility claims.",
        "cons_of_this": "Revenue codes do not appear on professional (physician) claims; POS codes do not appear on institutional claims. Revenue codes require normalization for the 3/4-digit representation issue. Revenue code coding practices vary across facilities and payers.",
        "when_to_prefer": "Use revenue codes for facility/institutional claims; use POS codes for professional claims; combine both for encounter-level setting determination when facility and professional claims can be linked by service date and provider."
      },
      {
        "compared_to": "cpt-procedure-coding",
        "pros_of_this": "Revenue codes identify the department/service type even when no HCPCS or CPT code is present on the line (e.g., the general ICU-day charge); they are present on all institutional claim lines regardless of whether a billable procedure was performed.",
        "cons_of_this": "Revenue codes do not identify the specific procedure; CPT/HCPCS does. On inpatient claims, the absence of HCPCS makes procedure identification impossible from revenue codes alone.",
        "when_to_prefer": "Use revenue codes to filter by service type/department (was this an ED visit? ICU day?); use CPT/HCPCS when procedure specificity is required."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Always inspect the revenue code field in your specific dataset to determine whether codes are stored as 3-digit or 4-digit strings before writing any filter. Apply leading-zero normalization in the first transformation step (LPAD in SQL, str.zfill(4) in Python, SUBSTR/PUT in SAS) so all downstream code works against 4-digit values. For ED identification, filter for rev_code LIKE '045%' on outpatient institutional claims and combine with ED E/M CPT (99281-99285) on professional claims. For drug identification, join 0636 lines to HCPCS J-codes on the same claim and consider a supplementary 0250-plus-J-code scan. For observation stays, look for 0762 on outpatient institutional claims. For inpatient cost decomposition, aggregate allowed amounts (not charges) by revenue code family using a lookup table of code-to-department mappings.\n",
      "ehr": "EHR systems typically do not expose UB-04 revenue codes directly; facility billing data may be present in the revenue_code fields of a linked claims extract or in a hospital billing module. If working with a linked EHR-claims dataset, join the revenue code fields from the claims layer. In EHR-only workflows, department classification typically uses encounter type or location codes rather than revenue codes.\n"
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# ------------------------------------------------------------------\n# Sample outpatient institutional claims DataFrame\n# (revenue codes stored as 3-digit strings, as in many research DBs)\n# ------------------------------------------------------------------\ndata = {\n    \"person_id\":     [2001, 2001, 2001, 2001, 2001, 2001],\n    \"service_date\":  [\"2023-09-14\"] * 6,\n    \"rev_code_raw\":  [\"450\", \"250\", \"636\", \"301\", \"324\", \"361\"],\n    \"hcpcs\":         [\"99284\", \"\",    \"J1885\", \"\", \"\", \"\"],\n    \"charge_amt\":    [850.00, 42.00, 1200.00, 215.00, 480.00, 310.00],\n}\ndf = pd.DataFrame(data)\n\n# ------------------------------------------------------------------\n# Step 1: Normalize to 4-digit representation (add leading zero)\n# Always do this FIRST before any revenue code filter\n# ------------------------------------------------------------------\ndf[\"rev_code\"] = df[\"rev_code_raw\"].astype(str).str.zfill(4)\n\n# ------------------------------------------------------------------\n# Step 2: Flag ED lines (045x family)\n# ------------------------------------------------------------------\ndf[\"is_ed_line\"] = df[\"rev_code\"].str.startswith(\"045\")\n\n# ------------------------------------------------------------------\n# Step 3: Flag observation room lines (exactly 0762)\n# ------------------------------------------------------------------\ndf[\"is_obs_line\"] = df[\"rev_code\"] == \"0762\"\n\n# ------------------------------------------------------------------\n# Step 4: Flag provider-administered drug lines (0636 + J-code)\n# J-codes begin with \"J\"; also check 0250 lines that may carry J-codes\n# ------------------------------------------------------------------\ndf[\"has_jcode\"] = df[\"hcpcs\"].str.startswith(\"J\")\ndf[\"is_drug_0636\"] = (df[\"rev_code\"] == \"0636\") & df[\"has_jcode\"]\ndf[\"is_drug_0250\"] = (df[\"rev_code\"] == \"0250\") & df[\"has_jcode\"]\ndf[\"is_drug_line\"] = df[\"is_drug_0636\"] | df[\"is_drug_0250\"]\n\n# ------------------------------------------------------------------\n# Step 5: Summary\n# ------------------------------------------------------------------\ned_lines = df[df[\"is_ed_line\"]]\ndrug_lines = df[df[\"is_drug_line\"]]\n\nprint(\"ED lines (045x):\")\nprint(ed_lines[[\"rev_code\", \"hcpcs\", \"charge_amt\"]])\n# rev_code  hcpcs  charge_amt\n# 0450      99284      850.00\n\nprint(\"\\nProvider-administered drug lines (0636/0250 + J-code):\")\nprint(drug_lines[[\"rev_code\", \"hcpcs\", \"charge_amt\"]])\n# rev_code  hcpcs  charge_amt\n# 0636      J1885    1200.00\n\n# Total charge for ED + drug lines (charges, NOT allowed amounts)\ntotal_ed_drug_charge = ed_lines[\"charge_amt\"].sum() + drug_lines[\"charge_amt\"].sum()\n# 850.00 + 1200.00 = 2050.00\nprint(f\"\\nTotal charge (ED + drug lines): ${total_ed_drug_charge:,.2f}\")\n# NOTE: use allowed_amount for cost analyses; charges are billed amounts only",
        "description": "Leading-zero normalization and revenue-code-based claim-line classification for a pandas DataFrame of outpatient institutional claims. Demonstrates: (1) normalizing 3-digit to 4-digit representation; (2) flagging ED lines (045x); (3) flagging observation room lines (0762); (4) flagging provider-administered drug lines (0636 + J-code present). Uses only stdlib and pandas.\n",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(stringr)\n\n# ------------------------------------------------------------------\n# Normalization helper — handles integer (450) and character (\"450\")\n# storage formats; pads to exactly 4 digits with leading zero\n# ------------------------------------------------------------------\nnormalize_rev_code <- function(x) {\n  str_pad(as.character(as.integer(x)), width = 4, pad = \"0\")\n}\n\n# ------------------------------------------------------------------\n# Sample outpatient institutional claims data frame\n# (rev_code stored as character 3-digit, common in research databases)\n# ------------------------------------------------------------------\nclaims <- data.frame(\n  person_id    = rep(2001L, 6),\n  service_date = rep(\"2023-09-14\", 6),\n  rev_code_raw = c(\"450\", \"250\", \"636\", \"301\", \"324\", \"361\"),\n  hcpcs        = c(\"99284\", \"\", \"J1885\", \"\", \"\", \"\"),\n  charge_amt   = c(850.00, 42.00, 1200.00, 215.00, 480.00, 310.00),\n  stringsAsFactors = FALSE\n)\n\n# ------------------------------------------------------------------\n# Step 1: Normalize revenue code to 4-digit form\n# ------------------------------------------------------------------\nclaims <- claims %>%\n  mutate(rev_code = normalize_rev_code(rev_code_raw))\n\n# ------------------------------------------------------------------\n# Step 2–4: Classify lines\n# ------------------------------------------------------------------\nclaims <- claims %>%\n  mutate(\n    # ED lines: 045x family\n    is_ed_line  = str_starts(rev_code, \"045\"),\n    # Observation room: exactly 0762\n    is_obs_line = rev_code == \"0762\",\n    # Drug lines: 0636 or 0250 paired with a J-code (HCPCS starts with \"J\")\n    has_jcode   = str_starts(hcpcs, \"J\"),\n    is_drug_line = (rev_code %in% c(\"0636\", \"0250\")) & has_jcode\n  )\n\n# ------------------------------------------------------------------\n# Step 5: Cost decomposition by revenue code family\n# (using charge_amt as a stand-in; replace with allowed_amt in real data)\n# ------------------------------------------------------------------\ncost_by_family <- claims %>%\n  mutate(\n    rev_family = case_when(\n      str_starts(rev_code, \"045\") ~ \"Emergency Room (045x)\",\n      str_starts(rev_code, \"020\") ~ \"ICU (020x)\",\n      str_starts(rev_code, \"036\") ~ \"Operating Room (036x)\",\n      str_starts(rev_code, \"025\") ~ \"Pharmacy (025x)\",\n      rev_code == \"0636\"          ~ \"Drugs-Detailed (0636)\",\n      str_starts(rev_code, \"030\") ~ \"Laboratory (030x)\",\n      str_starts(rev_code, \"032\") ~ \"Radiology (032x)\",\n      TRUE                        ~ paste0(\"Other (\", str_sub(rev_code, 1, 3), \"x)\")\n    )\n  ) %>%\n  group_by(rev_family) %>%\n  summarise(\n    n_lines     = n(),\n    total_charge = sum(charge_amt),\n    .groups = \"drop\"\n  ) %>%\n  arrange(desc(total_charge))\n\nprint(cost_by_family)\n# rev_family              n_lines  total_charge\n# Drugs-Detailed (0636)         1       1200.00\n# Emergency Room (045x)         1        850.00\n# Radiology (032x)              1        480.00\n# Operating Room (036x)         1        310.00\n# Laboratory (030x)             1        215.00\n# Pharmacy (025x)               1         42.00\n\n# Confirm ED + drug line charge sum\ned_drug_total <- claims %>%\n  filter(is_ed_line | is_drug_line) %>%\n  summarise(total = sum(charge_amt)) %>%\n  pull(total)\n# 850.00 + 1200.00 = 2050.00\ncat(sprintf(\"ED + drug line charge total: $%.2f\\n\", ed_drug_total))",
        "description": "R implementation of leading-zero normalization and revenue-code-based line classification using base R and dplyr. Shows the same three classification tasks (ED lines, observation lines, drug lines) plus a cost-decomposition summary by revenue code family. Includes a helper function for 3-to-4-digit normalization that works for both character and integer storage formats.\n",
        "dependencies": [
          "dplyr",
          "stringr"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Institutional_Claim[\"UB-04 Institutional Claim (one service date)\"]\n    FL42[\"FL 42: Revenue Code (4-digit)\\ne.g. 0450, 0636, 0250\"]\n    FL44[\"FL 44: HCPCS / CPT\\n(when applicable)\\ne.g. 99284, J1885\"]\n    FL47[\"FL 47: Charge Amount\\n(billed — not allowed/paid)\"]\n    FL42 --> FL44\n    FL44 --> FL47\n  end\n  subgraph Rev_Code_Families[\"Revenue Code Families (illustrative)\"]\n    ED[\"045x — Emergency Room\"]\n    ICU[\"020x — Intensive Care\"]\n    OR[\"036x — Operating Room\"]\n    PHARM[\"025x — Pharmacy (general)\"]\n    DRUG[\"0636 — Drugs: Detailed Coding\\n(pairs with J-code → drug ID)\"]\n    PT[\"042x — Physical Therapy\"]\n    OBS[\"0762 — Observation Room\"]\n  end\n  FL42 --> ED\n  FL42 --> ICU\n  FL42 --> OR\n  FL42 --> PHARM\n  FL42 --> DRUG\n  FL42 --> PT\n  FL42 --> OBS",
        "caption": "Revenue code FL 42 on the UB-04 identifies the hospital department for each claim line. The HCPCS code (FL 44) on the same line provides procedure or drug specificity when present. Revenue code 0636 is the key pairing field for identifying specific provider-administered drugs via J-codes.",
        "alt_text": "Flowchart showing that FL 42 revenue code on a UB-04 institutional claim feeds into illustrative revenue code families including 045x emergency room, 020x ICU, 036x operating room, 025x pharmacy, 0636 drugs requiring detailed coding, 042x physical therapy, and 0762 observation room. FL 44 HCPCS links to drug or procedure identification on the same claim line.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  RC[\"Revenue Code (FL 42)\"]\n  RC -->|\"045x\"| ED_INST[\"ED Facility Claim\\n(institutional)\"]\n  RC -->|\"0762\"| OBS[\"Observation Stay\\n(outpatient institutional)\"]\n  RC -->|\"0636 + J-code\"| DRUG_ID[\"Provider-Administered Drug ID\\n(Part B facility benefit)\"]\n  PROF[\"Professional Claim\\n(CMS-1500)\"]\n  POS[\"Place-of-Service Code\\n(FL 24B on CMS-1500)\"]\n  PROF --> POS\n  POS -->|\"POS 23\"| ED_PROF[\"ED Professional Claim\"]\n  ED_INST & ED_PROF -->|\"Same service date\"| ED_ALGORITHM[\"Two-Pronged ED\\nVisit Algorithm\\n(union or intersection)\"]",
        "caption": "Revenue codes and Place-of-Service codes are complementary, not redundant. Revenue codes classify the institutional (facility) claim; POS codes classify the professional claim. The two-pronged ED visit identification algorithm combines both.",
        "alt_text": "Flowchart showing revenue code 045x on institutional claims and POS 23 on professional claims being combined into a two-pronged emergency department visit algorithm. Also shows 0762 for observation stay identification and 0636 plus J-code for provider-administered drug identification.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "ub-04-institutional-claim-fields",
        "notes": "Revenue codes are one of the key form locator fields on the UB-04 institutional claim (FL 42 on each service line); they cannot be understood outside the context of the full UB-04 claim structure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcpcs-level-ii-j-codes",
        "notes": "Revenue code 0636 and HCPCS J-codes form the standard pairing for identifying specific provider-administered drugs on outpatient institutional claims; J-codes provide drug-level specificity that revenue codes alone cannot."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cpt-procedure-coding",
        "notes": "On outpatient institutional claims, CPT codes appear alongside revenue codes and provide procedure-level specificity (e.g., CPT 99284 for a level-4 ED visit on the 0450 revenue code line). Revenue codes identify the department; CPT identifies the specific service performed in that department."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "place-of-service-codes",
        "notes": "Place-of-Service codes classify the care setting on professional (CMS-1500) claims; revenue codes do the same job on institutional (UB-04) claims. Neither field appears on the other claim type. Both are needed for complete encounter-level setting identification when facility and professional claims must be linked."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ms-drg-classification",
        "notes": "MS-DRGs summarize the whole inpatient stay into a payment category; revenue codes provide line-item department-level detail within that same stay. Both fields appear on inpatient institutional claims and are complementary for cost decomposition and utilization analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-analysis",
        "notes": "Revenue codes are a foundational element of institutional claims analysis; the leading-zero format trap, HCPCS pairing, and department-level filtering are among the first operational details a claims analyst encounters."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Revenue-code-based department decomposition (ED vs ICU vs pharmacy vs surgical) is the standard approach to attributing and explaining per-patient-per-month cost differences in healthcare cost analyses."
      },
      {
        "relation_type": "see_also",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "Revenue codes are one layer of procedure and service-type identification in institutional claims; on inpatient claims they mark the department but not the procedure, so ICD-10-PCS or linked HCPCS data are required for procedure-level identification."
      }
    ],
    "aliases": [
      "revenue code",
      "revenue codes",
      "rev code",
      "revenue center",
      "revenue center code",
      "UB-04 revenue code",
      "FL 42"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "risk-evaluation",
    "name": "Risk Evaluation Study (Post-Authorization Safety / Active Surveillance)",
    "short_definition": "An observational post-authorization study design that quantifies and characterizes the absolute and comparative risk of adverse events for a marketed medicine in routine care, typically by constructing an incident-user cohort, accruing exposed person-time, and estimating incidence rates or comparative risks from claims, EHR, registry, or linked data.",
    "long_description": "A **risk evaluation study** is a post-authorization safety study (PASS / active surveillance) whose purpose is to *measure*\na drug's risk in routine use — the absolute incidence of a pre-specified adverse event (AE), how that risk compares to an\nappropriate reference, and, when the question is regulatory, whether risk-minimization measures are achieving their stated\nobjectives. It is the analytic workhorse behind FDA-mandated safety assessments (including REMS effectiveness assessments),\nEMA imposed and voluntary PASS, and sponsor-initiated safety monitoring. The design is concrete and quantitative: it builds\na cohort, defines exposure and an outcome algorithm, accrues person-time, and produces an incidence rate (or a comparative\nrate/hazard) with a confidence interval — not a vague \"evaluation.\"\n\n**Core conceptual distinction** (the core estimand distinction). Three things are routinely conflated and must be kept separate. (1) *Active surveillance\n(this design) vs spontaneous reporting / disproportionality.* A risk evaluation study has a defined denominator\n(person-time at risk) and can estimate an *incidence rate*; FAERS/EudraVigilance disproportionality (PRR, ROR, EBGM) has no\ndenominator and can only flag a *signal* relative to other drugs — it is hypothesis-generating, not risk-quantifying.\n(2) *Absolute risk characterization vs comparative effect estimation.* Some PASS deliverables are descriptive (the absolute\nrate of the AE among initiators, for labeling and benefit-risk); others are causal contrasts (rate in users of drug A vs an\nactive comparator), which require the full confounding-control machinery of comparative-effectiveness designs. The estimand\nmust be pre-specified: a single-arm incidence rate, a rate ratio vs an active comparator, or a within-person relative\nincidence (self-controlled). (3) *Drug-risk evaluation vs risk-minimization-measure (RMM) evaluation.* The former measures\nthe AE rate; the latter measures whether an intervention (a Medication Guide, prescriber certification, pregnancy-prevention\nprogram) changed knowledge, behavior, drug utilization, or the AE rate — process and outcome indicators, often survey-based\nfor process metrics and claims-based for utilization/outcome metrics. A protocol should state which it is doing and not\nblur the two.\n\n**Pros, cons, and trade-offs** (specific and comparative, naming the alternatives).\n- **vs spontaneous-report disproportionality (FAERS/EudraVigilance signal detection):** The cohort risk evaluation study has\n  a real denominator, so it estimates rates and supports labeling and benefit-risk; it controls (with design + adjustment)\n  for confounding. Cost: it needs a longitudinal data source large enough to accrue exposed person-time and outcome events,\n  and it is slower and more expensive than mining a spontaneous-report database. **Prefer the cohort study** once a signal\n  is credible and a *quantified* risk is needed; **prefer disproportionality** only for hypothesis generation across the\n  whole product space.\n- **vs a full comparative-effectiveness / active-comparator new-user (ACNU) study:** A single-arm incidence-rate risk\n  evaluation is simpler and answers \"how often does this AE occur in users?\" Cost: with no comparator, it cannot separate\n  drug effect from background rate, and an external/historical rate is vulnerable to differences in case ascertainment and\n  population mix. **Prefer single-arm** for absolute risk characterization and rare, drug-specific events with no plausible\n  background; **prefer an active comparator** the moment the question becomes \"does this drug raise risk *relative to* an\n  alternative,\" where confounding by indication would otherwise dominate (see `active-comparator-new-user`).\n- **vs self-controlled designs (SCCS, case-crossover):** Self-controlled analyses cancel all time-invariant confounding by\n  using the person as their own control and are excellent for transient exposures and acute outcomes. Cost: they require a\n  within-person exposure contrast and assume the outcome does not affect future exposure and that the event does not censor\n  observation. **Prefer self-controlled** when time-fixed confounding is severe and the exposure is intermittent (e.g.,\n  short courses); **prefer the cohort** for sustained exposures, cumulative-dose questions, and when absolute rates are the\n  deliverable (see `self-controlled-case-series`, `case-crossover`).\n\n**When to use** (clear decision rules). A credible safety signal needs to be *quantified* in routine care; an FDA REMS effectiveness assessment or\nEMA PASS protocol must deliver an incidence rate, a comparative rate, or a utilization/behavior metric; benefit-risk or\nlabeling decisions require an absolute AE rate among real-world initiators; a regulator has imposed post-marketing\nsurveillance with a denominator-based endpoint. Use the cohort form when exposure is sustained and the AE accrues over\nfollow-up; use a self-controlled or case-only form when time-fixed confounding is the dominant threat and exposure is\ntransient.\n\n**When NOT to use — and when it is actively misleading or dangerous** (clear decision rules).\n- **Spontaneous-report-only \"evaluation.\"** Running disproportionality on FAERS/EudraVigilance and calling it a risk\n  evaluation produces a signal with no denominator; reporting a PRR/ROR as if it were a risk is wrong and can drive\n  over-reaction or false reassurance. That is signal detection (see `signal-detection`), not risk quantification.\n- **No usable denominator / no observable washout.** If a large share of person-time is Medicare Advantage-only (capitated\n  encounters not adjudicated as fee-for-service claims), the exposed denominator and the \"no prior event/exposure\" washout\n  are partly *missing*, not zero — incidence rates are then biased by undercount of both numerator and denominator. Restrict\n  to fully observable enrollment.\n- **Immortal time in enrollment/procedure-anchored cohorts.** If follow-up (or \"exposure\") starts at REMS enrollment,\n  certification, or a procedure but the AE clock is allowed to run before the patient could possibly have the event, the\n  immortal interval deflates the rate. Align time zero to first exposure and start follow-up there (see\n  `immortal-time-bias-handling`, `time-zero-index-date-alignment-rwe`).\n- **Differential competing risks by exposure.** In elderly or seriously ill populations, death and treatment cessation\n  compete with the AE; if these differ by exposure, naive incidence rates and crude comparisons mislead. Use cause-specific\n  rates or subdistribution methods and report the competing event (see `competing-risks-cause-specific-fine-gray-rwe`).\n- **Single-arm comparison to an unmatched external rate** when ascertainment differs — a registry that adjudicates events\n  vs claims that infer them will not produce comparable rates; the contrast is an artifact of case-finding, not biology.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked).\n- **Administrative claims (FFS / commercial):** Exposure is the pharmacy claim (NDC + `fill_date` + `days_supply`) or a\n  medical-benefit administration (J-code) for infused products; outcomes are dx/procedure codes with a validated algorithm\n  (a claims phenotype with known PPV). Require continuous medical + pharmacy enrollment across the washout and follow-up so\n  \"no prior event\" and the denominator are real. Failure modes: **MA-only person-time lacks adjudicated FFS claims**, so\n  both numerator and denominator are undercounted — exclude MA-only spans or restrict to A/B/D enrollees; sample fills,\n  90-day mail-order, and stockpiling distort `days_supply` and on-treatment risk windows; **differential competing risks by\n  exposure in elderly claims** bias crude rates; an AE algorithm with modest PPV inflates the numerator unless calibrated\n  (see `claims-outcome-algorithm-ppv-sensitivity-rwe`).\n- **EHR:** Initiation is the *order or administration*, not the dispensing; linkage to fills confirms the patient actually\n  started. Labs, vitals, and notes sharpen AE ascertainment and severity (an advantage for outcomes like hepatotoxicity or\n  QT), but visit-driven capture means patients who leave the system are differentially lost — define observation windows and\n  treat loss to follow-up as potentially informative; out-of-network events are simply unseen.\n- **Registry:** Strongest for adjudicated outcomes, indication, and severity (the numerator is high-quality); typically weak\n  for complete drug exposure and full person-time denominator. Link to claims for fills and to a death index to firm up\n  censoring and capture fatal AEs.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR/registry-grade AE ascertainment + claims-grade exposure and\n  denominator + reliable mortality — but linkage introduces selection (only the linkable subset) and order/fill/service\n  date discrepancies that must be reconciled before time-zero and risk-window assignment.\n\n**Worked claims example (absolute incidence of an AE among new users).** Question: incidence of acute severe hepatic injury\namong adult new users of a newly marketed oral drug under a REMS, in a commercial + Medicare FFS database. (1) Eligibility:\nage ≥18 and 365 days of continuous A/B/D (or commercial medical+pharmacy) enrollment before the first study fill — this is\nwhat makes the denominator observable and the washout real. (2) Washout / incident-user restriction: no fill of the study\ndrug in the 365-day lookback (incident use) and no diagnosis of the AE in the lookback (so the event is incident, not\nprevalent). (3) Time zero = the first qualifying `fill_date`; do **not** start the AE clock at REMS enrollment or\ncertification, which would create immortal time. (4) Outcome: first qualifying AE using a validated algorithm (e.g., ≥1\ninpatient dx in the primary position, or 1 inpatient + 1 outpatient `dx_code` within a 30-day window) with a known PPV; if\nPPV is modest, plan a chart-validation substudy or quantitative-bias correction. (5) Follow-up: from time zero to the first\nAE, censoring at disenrollment, death, end of data, and — for an on-treatment risk window — the last `days_supply` end plus\na pre-specified grace period; treat death as a competing risk and report cause-specific person-time. (6) Estimate: events ÷\nperson-years gives the incidence rate; an exact Poisson 95% CI is appropriate when events are few. (7) Sensitivity: vary the\nwashout and grace period, swap the AE algorithm (high-PPV vs high-sensitivity), and add a negative-control outcome to detect\nresidual surveillance/ascertainment bias. The same scaffold extends to a *comparative* risk evaluation by adding an active\ncomparator and propensity-score balancing, or to an *RMM-effectiveness* evaluation by replacing the AE numerator with a\nutilization/behavior metric (e.g., proportion of dispensings preceded by the required monitoring lab).",
    "primary_category": "Study_Design",
    "tags": [
      "post-authorization-safety-study",
      "pass",
      "active-surveillance",
      "rems",
      "risk-minimisation",
      "incidence-rate",
      "pharmacovigilance",
      "pharmacoepidemiology",
      "benefit-risk"
    ],
    "applies_to_study_types": [
      "risk_evaluation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.1926",
        "url": "https://doi.org/10.1002/pds.1926",
        "citation_text": "Schneeweiss S. A basic study design for expedited safety signal evaluation based on electronic healthcare data. Pharmacoepidemiology and Drug Safety. 2010;19(8):858-868.",
        "year": 2010,
        "authors_short": "Schneeweiss",
        "notes": "Canonical template for expedited, denominator-based safety evaluation in healthcare databases — incident-user cohort, active comparator, time zero, and confounding control for risk quantification."
      },
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2004.10.012",
        "url": "https://doi.org/10.1016/j.jclinepi.2004.10.012",
        "citation_text": "Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of Clinical Epidemiology. 2005;58(4):323-337.",
        "year": 2005,
        "authors_short": "Schneeweiss & Avorn",
        "notes": "Foundational review of how claims/utilization databases are used for drug safety and effectiveness, including denominator construction and exposure/outcome operationalization."
      },
      {
        "role": "explain",
        "doi": "10.1038/ncprheum0652",
        "url": "https://doi.org/10.1038/ncprheum0652",
        "citation_text": "Suissa S, Garbe E. Primer: administrative health databases in observational studies of drug effects—advantages and disadvantages. Nature Clinical Practice Rheumatology. 2007;3(12):725-732.",
        "year": 2007,
        "authors_short": "Suissa & Garbe",
        "notes": "Clear primer on the strengths and biases of administrative databases for drug-effect studies — immortal time, time-window, and ascertainment pitfalls directly relevant to risk evaluation denominators and numerators."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.4203",
        "url": "https://doi.org/10.1002/pds.4203",
        "citation_text": "Nyeland ME, Laursen MV, Callréus T. Evaluating the effectiveness of risk minimisation measures: the application of a conceptual framework to Danish real-world dabigatran data. Pharmacoepidemiology and Drug Safety. 2017;26(6):607-614.",
        "year": 2017,
        "authors_short": "Nyeland et al.",
        "notes": "Worked application of an RMM-effectiveness framework to real-world data, distinguishing process from outcome indicators — the RMM/REMS-evaluation subtype of risk evaluation."
      }
    ],
    "plain_language_summary": "A risk evaluation study answers the question: how often does a specific harmful side effect actually happen in patients who take a drug in everyday clinical care, and — when a formal safety program is in place — is that program keeping the harm under control? Researchers identify every new patient who started the drug, track how long each person was on it, and count who experienced the safety problem; dividing those counts gives an incidence rate the FDA or EMA can use to update labeling or judge whether a safety program is working. One important caveat: this approach needs a large database of real insurance claims or medical records — it cannot work from reports that patients or doctors voluntarily phone in, because those reports have no reliable count of the people who did NOT have the problem.",
    "key_terms": [
      {
        "term": "REMS",
        "definition": "Risk Evaluation and Mitigation Strategy — an FDA-required safety program that may restrict who can prescribe or receive a drug, require monitoring tests before each dispensing, or mandate a patient agreement; RWE is used to measure whether the program is actually working."
      },
      {
        "term": "incidence rate",
        "definition": "The number of new adverse events divided by the total time all patients were observed while on the drug, expressed as events per 100 or 1,000 patient-years."
      },
      {
        "term": "person-time",
        "definition": "The sum of each patient's individual follow-up time while at risk; one patient followed for two years contributes two person-years to the denominator."
      },
      {
        "term": "risk-minimization measure",
        "definition": "Any intervention designed to prevent or reduce a known drug harm — examples include restricted prescriber certification, mandatory pregnancy testing, or patient enrollment in a monitoring registry."
      },
      {
        "term": "process metric",
        "definition": "A measurement of whether a safety program is being carried out as required — for example, the percentage of prescription fills that had a mandatory monitoring test completed beforehand — as distinct from measuring whether the harm itself occurred."
      }
    ],
    "worked_example": {
      "scenario": "Isotretinoin (a severe-acne drug) carries a known risk of birth defects. The FDA requires a REMS called iPLEDGE: female patients of childbearing potential must use two forms of contraception, pass a monthly pregnancy test, and their pharmacist must confirm a negative result before dispensing each prescription. A safety team wants to know whether pharmacies are actually checking the pregnancy test before each fill — a process metric — and whether pregnancies among drug users are declining — an outcome metric. They use one year of commercial insurance claims for 500 female patients aged 15-45 who filled isotretinoin for the first time.",
      "dataset": {
        "caption": "Pharmacy claims table showing isotretinoin fills and required pregnancy-test lab claims for five example patients.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply",
          "preg_test_in_prior_30d"
        ],
        "rows": [
          [
            "PT-001",
            "2023-02-01",
            "isotretinoin",
            30,
            "YES"
          ],
          [
            "PT-002",
            "2023-02-14",
            "isotretinoin",
            30,
            "NO"
          ],
          [
            "PT-003",
            "2023-03-05",
            "isotretinoin",
            30,
            "YES"
          ],
          [
            "PT-004",
            "2023-03-19",
            "isotretinoin",
            30,
            "YES"
          ],
          [
            "PT-005",
            "2023-04-02",
            "isotretinoin",
            30,
            "NO"
          ]
        ]
      },
      "steps": [
        "Identify the risk: isotretinoin causes serious birth defects, so the REMS requires a negative pregnancy test within 30 days before each monthly fill.",
        "Identify the mitigation: restricted distribution through iPLEDGE — pharmacies must confirm a negative pregnancy-test lab result in the patient's claims before dispensing.",
        "Measure process compliance: for each of the 500 fills, look back 30 days in the lab claims table for a pregnancy-test result; a fill with a test present is compliant, a fill without is not.",
        "From the five example rows: 3 of 5 fills have a documented pregnancy test in the prior 30 days; 2 do not (PT-002 and PT-005).",
        "Scale to the full cohort: if 430 of 500 fills have a prior pregnancy test, the monitoring-adherence rate is 430 divided by 500 equals 0.86, or 86 percent.",
        "Measure the outcome: search each patient's claims for a pregnancy diagnosis code (ICD-10 Z34.xx) in the 9 months following any fill; count pregnancies per 100 patient-years on drug.",
        "In the full cohort, 6 pregnancies are identified across 420 total patient-years of follow-up; the pregnancy incidence rate is 6 divided by 420 equals 0.014 pregnancies per patient-year, or 1.4 per 100 patient-years.",
        "Compare to the pre-REMS historical rate of roughly 3.0 per 100 patient-years to assess whether the program is reducing pregnancies."
      ],
      "result": "Process metric: 86% of fills had a documented pregnancy test in the prior 30 days (430 of 500), meaning 14% of dispensings did not meet the REMS monitoring requirement. Outcome metric: pregnancy incidence rate of 1.4 per 100 patient-years, compared to a historical pre-REMS rate of approximately 3.0 per 100 patient-years — a roughly 50% reduction — suggesting the mitigation program is working, though incomplete monitoring compliance leaves room for improvement."
    },
    "prerequisites": [
      "incidence-rate-calculation-rwe",
      "new-user-design",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single-arm absolute incidence-rate evaluation",
        "description": "Incident-user cohort with a validated outcome algorithm; the deliverable is the absolute AE incidence rate (events per person-year) with an exact/Poisson confidence interval, for labeling and benefit-risk.",
        "edge_cases": [
          "No comparator means the drug effect cannot be separated from the background rate; an external reference rate is comparable only if case ascertainment and population mix match.",
          "Modest outcome-algorithm PPV inflates the numerator; pair with a chart-validation substudy or quantitative bias analysis."
        ],
        "data_source_notes": "claims: require fully observable enrollment for the denominator and exclude MA-only person-time; EHR: out-of-network events are unseen, so the rate is a lower bound."
      },
      {
        "name": "Comparative (active-comparator) risk evaluation",
        "description": "Two incident-user arms (study drug vs a clinically interchangeable active comparator) balanced on pre-index covariates, delivering a rate ratio or hazard ratio.",
        "edge_cases": [
          "Confounding by indication and channeling re-enter if the comparator is prescribed to systematically different patients; check covariate balance before trusting the contrast.",
          "Differential discontinuation/switching by arm makes on-treatment rates susceptible to informative censoring."
        ],
        "data_source_notes": "See active-comparator-new-user for cohort construction; high-dimensional PS on lookback proxies is standard in claims."
      },
      {
        "name": "Self-controlled risk evaluation (SCCS / case-crossover)",
        "description": "Within-person comparison of AE incidence during exposed vs unexposed risk windows; cancels all time-invariant confounding.",
        "edge_cases": [
          "Requires transient exposure and an acute outcome; violated when exposure is sustained or the event affects future exposure or censors observation."
        ],
        "data_source_notes": "Depends on precise risk-window construction from days_supply; see self-controlled-case-series."
      },
      {
        "name": "Risk-minimization-measure (RMM/REMS) effectiveness evaluation",
        "description": "Replaces the AE numerator with process indicators (knowledge, prescriber/patient behavior, required monitoring) and outcome indicators (utilization patterns, AE rate after implementation), often combining survey and claims data.",
        "edge_cases": [
          "Process metrics frequently require primary data collection (surveys) because claims do not capture knowledge or counseling.",
          "Pre/post or interrupted-time-series designs are vulnerable to secular trends; pre-specify the counterfactual."
        ],
        "data_source_notes": "claims: measure required-monitoring adherence (e.g., lab before each dispensing) and contraindicated co-prescribing; survey: measure prescriber/patient knowledge."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Spontaneous-report disproportionality (FAERS/EudraVigilance signal detection)",
        "pros_of_this": "Has a real denominator (person-time), so it estimates incidence rates and comparative risks and supports labeling and benefit-risk; controls confounding by design and adjustment.",
        "cons_of_this": "Requires a large longitudinal data source and is slower and costlier than mining spontaneous reports; cannot scan the whole product space at once.",
        "when_to_prefer": "Once a signal is credible and a quantified, denominator-based risk is needed for regulatory or benefit-risk decisions."
      },
      {
        "compared_to": "Active-comparator new-user comparative-effectiveness study",
        "pros_of_this": "A single-arm incidence-rate evaluation is simpler and directly answers absolute-risk and labeling questions for rare, drug-specific events.",
        "cons_of_this": "Without a comparator it cannot separate the drug effect from the background rate and is exposed to external-rate incomparability.",
        "when_to_prefer": "Absolute risk characterization for events with no plausible background; switch to an active comparator the moment the question is relative risk versus an alternative."
      },
      {
        "compared_to": "Self-controlled designs (SCCS, case-crossover)",
        "pros_of_this": "Estimates absolute rates and handles sustained exposures and cumulative-dose questions; produces person-time denominators for labeling.",
        "cons_of_this": "Carries between-person confounding that self-controlled designs eliminate by using each person as their own control.",
        "when_to_prefer": "Sustained exposures or when absolute rates are the deliverable; prefer self-controlled for transient exposures with severe time-fixed confounding."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exposure = pharmacy claim (NDC + fill_date + days_supply) or medical-benefit administration (J-code) for infused products. Outcome = validated dx/procedure algorithm with known PPV. Require continuous medical + pharmacy enrollment across washout and follow-up; exclude Medicare Advantage-only person-time where FFS claims are unavailable. Time zero = first qualifying fill (not REMS enrollment/certification). Treat death as a competing risk and censor at disenrollment/death/end of data.",
      "ehr": "Initiation = order or administration; prefer linked dispensing to confirm the patient started. Labs/vitals/notes sharpen AE ascertainment and severity. Out-of-network events are unseen; define observation windows and treat loss to follow-up as potentially informative.",
      "registry": "Strong for adjudicated AE ascertainment, indication, and severity (high-quality numerator); weak for complete exposure and full denominator. Link to claims for fills and a death index for fatal AEs and censoring.",
      "linked": "Linked claims-EHR-vital-records is the ideal substrate (high-quality numerator + complete denominator + mortality) but introduces linkage selection and order/fill/service date discrepancies that must be reconciled before time-zero and risk-window assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy.stats import chi2\n\nWASHOUT_DAYS = 365   # observable, drug-free lookback that defines an incident user\nGRACE_DAYS   = 30    # on-treatment risk window = last days_supply end + grace\n\ndef build_risk_eval(rx, enroll, ae, death, end_of_data):\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n\n    # Time zero = first fill of the study drug.\n    first = (rx[rx[\"is_study_drug\"]]\n             .groupby(\"person_id\")[\"fill_date\"].min()\n             .reset_index(name=\"index_date\"))\n\n    # Incident-user washout: no study-drug fill in the WASHOUT_DAYS before index.\n    prior = rx[rx[\"is_study_drug\"]].merge(first, on=\"person_id\")\n    had_prior = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                      (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    first = first[~first[\"person_id\"].isin(had_prior[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment across the full washout through index (no MA-only spans).\n    e = enroll.merge(first, on=\"person_id\")\n    covers = e[(e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n               (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"])]\n    first = first[first[\"person_id\"].isin(covers[\"person_id\"])].copy()\n\n    # On-treatment exposure end = last study fill's days_supply end + grace.\n    rx_idx = rx[rx[\"is_study_drug\"]].merge(first[[\"person_id\"]], on=\"person_id\")\n    rx_idx[\"supply_end\"] = rx_idx[\"fill_date\"] + pd.to_timedelta(rx_idx[\"days_supply\"], unit=\"D\")\n    tx_end = (rx_idx.groupby(\"person_id\")[\"supply_end\"].max()\n              .add(pd.Timedelta(days=GRACE_DAYS)).reset_index(name=\"tx_end\"))\n    coh = first.merge(tx_end, on=\"person_id\")\n\n    # Administrative censoring: enrollment end, death, end of data, treatment end.\n    enroll_end = enroll.groupby(\"person_id\")[\"enroll_end\"].max().reset_index(name=\"enr_end\")\n    coh = coh.merge(enroll_end, on=\"person_id\").merge(death, on=\"person_id\", how=\"left\")\n    coh[\"death_date\"] = coh[\"death_date\"].fillna(pd.Timestamp.max)\n    coh[\"censor_date\"] = coh[[\"tx_end\", \"enr_end\", \"death_date\"]].min(axis=1).clip(upper=end_of_data)\n\n    # First INCIDENT AE strictly after time zero and on/before censoring.\n    a = ae.merge(coh[[\"person_id\", \"index_date\", \"censor_date\"]], on=\"person_id\")\n    a = a[(a[\"event_date\"] > a[\"index_date\"]) & (a[\"event_date\"] <= a[\"censor_date\"])]\n    first_ae = a.groupby(\"person_id\")[\"event_date\"].min().reset_index(name=\"ae_date\")\n    coh = coh.merge(first_ae, on=\"person_id\", how=\"left\")\n\n    coh[\"had_event\"] = coh[\"ae_date\"].notna().astype(int)\n    # Person-time stops at the AE if it occurred, else at administrative censoring.\n    stop = coh[\"ae_date\"].fillna(coh[\"censor_date\"])\n    coh[\"person_days\"] = (stop - coh[\"index_date\"]).dt.days.clip(lower=0)\n\n    events = int(coh[\"had_event\"].sum())\n    py = coh[\"person_days\"].sum() / 365.25\n    rate = events / py * 1000.0\n    # Exact (Garwood) Poisson 95% CI on the count, scaled to the same denominator.\n    lo = 0.0 if events == 0 else chi2.ppf(0.025, 2 * events) / 2\n    hi = chi2.ppf(0.975, 2 * events + 2) / 2\n    return {\"events\": events, \"person_years\": py,\n            \"rate_per_1000py\": rate,\n            \"ci95\": (lo / py * 1000.0, hi / py * 1000.0)}",
        "description": "Risk-evaluation cohort construction + absolute incidence rate from claims-style inputs. Required inputs (cleaned,\nde-duplicated):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime), is_study_drug (bool), days_supply (int)\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end (datetime), ma_only (bool)  # ma_only lacks FFS claims\n  ae     : outcome events    -> person_id, event_date (datetime)  # from a VALIDATED dx-code algorithm (known PPV)\n  death  : mortality         -> person_id, death_date (datetime)\nBuilds an incident-user cohort with a 365-day observable washout, sets time zero at first study fill, accrues on-treatment\nperson-time (days_supply + grace), takes the FIRST incident AE, and returns the incidence rate per 1,000 person-years\nwith an exact Poisson 95% CI. Death is a competing risk (censoring event for cause-specific person-time).",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "schneeweiss-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\nGRACE_DAYS   <- 30L\n\nbuild_risk_eval <- function(rx, enroll, ae, death, end_of_data) {\n  setDT(rx); setDT(enroll); setDT(ae); setDT(death)\n  setorder(rx, person_id, fill_date)\n\n  # Time zero = first study-drug fill.\n  first <- rx[is_study_drug == TRUE, .(index_date = fill_date[1L]), by = person_id]\n\n  # Incident-user washout: no study-drug fill in the washout window before index.\n  pr <- merge(rx[is_study_drug == TRUE], first, by = \"person_id\")\n  prior_ids <- unique(pr[fill_date < index_date &\n                         fill_date >= index_date - WASHOUT_DAYS, person_id])\n  first <- first[!person_id %chin% prior_ids]\n\n  # Continuous FFS-observable enrollment across the full washout through index.\n  e <- merge(enroll, first, by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & ma_only == FALSE, unique(person_id)]\n  first <- first[person_id %chin% ok]\n\n  # On-treatment end = last days_supply end + grace.\n  ri <- merge(rx[is_study_drug == TRUE], first[, .(person_id)], by = \"person_id\")\n  ri[, supply_end := fill_date + days_supply]\n  txe <- ri[, .(tx_end = max(supply_end) + GRACE_DAYS), by = person_id]\n  coh <- merge(first, txe, by = \"person_id\")\n\n  # Administrative censoring.\n  enr <- enroll[, .(enr_end = max(enroll_end)), by = person_id]\n  coh <- merge(coh, enr, by = \"person_id\")\n  coh <- merge(coh, death, by = \"person_id\", all.x = TRUE)\n  coh[is.na(death_date), death_date := as.Date(\"9999-12-31\")]\n  coh[, censor_date := pmin(tx_end, enr_end, death_date, end_of_data)]\n\n  # First incident AE after time zero, on/before censoring.\n  a <- merge(ae, coh[, .(person_id, index_date, censor_date)], by = \"person_id\")\n  a <- a[event_date > index_date & event_date <= censor_date]\n  fae <- a[, .(ae_date = min(event_date)), by = person_id]\n  coh <- merge(coh, fae, by = \"person_id\", all.x = TRUE)\n\n  coh[, had_event := as.integer(!is.na(ae_date))]\n  coh[, stop_date := fifelse(is.na(ae_date), censor_date, ae_date)]\n  coh[, person_days := pmax(as.integer(stop_date - index_date), 0L)]\n\n  events <- sum(coh$had_event)\n  py <- sum(coh$person_days) / 365.25\n  ci <- poisson.test(events, py)$conf.int\n  list(events = events, person_years = py,\n       rate_per_1000py = events / py * 1000,\n       ci95 = c(ci[1] * 1000, ci[2] * 1000))\n}",
        "description": "Risk-evaluation cohort construction + absolute incidence rate with data.table. Inputs mirror the Python version:\n  rx     : person_id, fill_date (Date), is_study_drug (logical), days_supply (int)\n  enroll : person_id, enroll_start, enroll_end (Date), ma_only (logical)\n  ae     : person_id, event_date (Date)   # validated dx algorithm, known PPV\n  death  : person_id, death_date (Date)\nReturns events, person-years, the incidence rate per 1,000 PY, and an exact Poisson 95% CI (poisson.test).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "schneeweiss-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n%let grace   = 30;\n%let end_of_data = '31DEC2024'd;\n\n/* Time zero = first study-drug fill. */\nproc sql;\n  create table first as\n  select person_id, min(fill_date) as index_date format=date9.\n  from work.rx where is_study_drug = 1\n  group by person_id;\nquit;\n\n/* Incident-user washout: exclude any study-drug fill inside the washout before index. */\nproc sql;\n  create table newuser as\n  select f.* from first f\n  where not exists (\n    select 1 from work.rx r\n    where r.person_id = f.person_id and r.is_study_drug = 1\n      and r.fill_date <  f.index_date\n      and r.fill_date >= f.index_date - &washout );\nquit;\n\n/* Continuous FFS-observable enrollment across the full washout through index (no MA-only spans). */\nproc sql;\n  create table cohort as\n  select n.person_id, n.index_date\n  from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date );\nquit;\n\n/* On-treatment end, administrative censor date, first incident AE, and person-time. */\nproc sql;\n  create table follow as\n  select c.person_id, c.index_date,\n         (select max(r.fill_date + r.days_supply) from work.rx r\n            where r.person_id = c.person_id and r.is_study_drug = 1) + &grace\n           as tx_end format=date9.,\n         (select max(e.enroll_end) from work.enroll e\n            where e.person_id = c.person_id) as enr_end format=date9.,\n         (select min(d.death_date) from work.death d\n            where d.person_id = c.person_id) as death_date format=date9.,\n         (select min(a.event_date) from work.ae a\n            where a.person_id = c.person_id and a.event_date > c.index_date)\n           as ae_date format=date9.\n  from cohort c;\nquit;\n\ndata analysis;\n  set follow;\n  censor_date = min(tx_end, enr_end, coalesce(death_date, &end_of_data), &end_of_data);\n  had_event   = (n(ae_date) and ae_date <= censor_date);   /* incident AE on/before censoring */\n  stop_date   = ifn(had_event, ae_date, censor_date);\n  person_days = max(stop_date - index_date, 0);\n  person_years = person_days / 365.25;\n  log_py = log(person_years);\nrun;\n\n/* Poisson incidence rate with log person-time offset; exp(Intercept) = rate per person-year. */\nproc genmod data=analysis;\n  model had_event = / dist=poisson link=log offset=log_py;\n  estimate 'log rate' intercept 1 / exp;   /* exp -> rate per person-year (x1000 for per-1,000-PY) */\nrun;",
        "description": "Risk-evaluation cohort construction (PROC SQL) + incidence rate (PROC GENMOD Poisson with log person-time offset).\nRequired input datasets (post data-management):\n  work.rx     : person_id, fill_date, is_study_drug (1/0), days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.ae     : person_id, event_date           /* validated dx algorithm, known PPV */\n  work.death  : person_id, death_date\nSet &end_of_data to the last observable date. PROC GENMOD with dist=poisson and an OFFSET=log(person-years) returns the\nlog rate; exponentiating the intercept gives the rate per person-year (multiply by 1000 for per-1,000-PY). Death is treated\nas a competing/censoring event for cause-specific person-time.",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population<br/>under safety surveillance] --> Wash[Continuous FFS-observable enrollment<br/>+ drug-free washout - no prior study fill]\n  Wash --> Init[First study-drug fill]\n  Init --> T0[Time zero = index fill date<br/>NOT REMS enrollment / certification]\n  T0 --> Risk[On-treatment risk window<br/>days_supply + grace]\n  Risk --> AE[First incident AE<br/>validated dx algorithm - known PPV]\n  AE --> PT[Accrue person-time<br/>censor at disenroll / death / data end / tx end]\n  PT --> Rate[Incidence rate per 1,000 PY<br/>+ exact Poisson 95% CI]\n  Rate --> Sens[Sensitivity: washout, grace, AE algorithm PPV,<br/>negative-control outcome, competing-risk death]",
        "caption": "Operational flow of a single-arm risk evaluation in claims. Time zero is the first fill (not enrollment) to avoid immortal time; an observable washout makes the denominator and incident status real; a validated algorithm defines the numerator, and an exact Poisson CI quantifies the rate.",
        "alt_text": "Flowchart from source population through observable washout, first study fill, time zero, on-treatment risk window, first incident adverse event, person-time accrual, incidence-rate estimation, and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Q{Quantified, denominator-based<br/>risk needed?} -->|No - hypothesis only| SR[Spontaneous reports<br/>FAERS/EudraVigilance disproportionality<br/>= signal detection]\n  Q -->|Yes| C{Comparator available<br/>and question is relative risk?}\n  C -->|No - absolute risk / labeling| Single[Single-arm incidence-rate<br/>risk evaluation]\n  C -->|Yes - vs alternative drug| ACNU[Active-comparator new-user<br/>comparative risk evaluation]\n  C -->|Transient exposure + acute event<br/>severe time-fixed confounding| SC[Self-controlled<br/>SCCS / case-crossover]\n  Single --> RMM{Evaluating a risk-minimization<br/>measure - REMS/aRMM?}\n  ACNU --> RMM\n  RMM -->|Yes| Ind[Add process + outcome indicators<br/>survey + claims utilization metrics]",
        "caption": "Decision logic for choosing the risk-evaluation form. Disproportionality is signal detection, not risk evaluation; once a denominator-based risk is required, the comparator question and exposure pattern select single-arm, active-comparator, or self-controlled, with an RMM-effectiveness overlay when a risk-minimization measure is the target.",
        "alt_text": "Decision flowchart distinguishing spontaneous-report signal detection from single-arm, active-comparator, and self-controlled risk evaluation, with a risk-minimization-measure effectiveness branch.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "signal-detection",
        "notes": "Spontaneous-report disproportionality (FAERS/EudraVigilance) flags signals without a denominator; a risk evaluation study quantifies the risk with person-time. Disproportionality is hypothesis-generating, not risk-quantifying."
      },
      {
        "relation_type": "used_with",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "The core single-arm deliverable is an incidence rate (events per person-time) with an exact/Poisson confidence interval."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "A comparative risk evaluation adds a clinically interchangeable active comparator and propensity-score balancing to estimate a rate ratio rather than an absolute rate."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "self-controlled-case-series",
        "notes": "For transient exposures and acute outcomes with severe time-fixed confounding, a within-person self-controlled design cancels all time-invariant confounders that a cohort risk evaluation must adjust for."
      },
      {
        "relation_type": "part_of",
        "target_slug": "pass-imposed",
        "notes": "Imposed post-authorization safety studies are a primary regulatory home for risk evaluation designs (EMA imposed PASS; analogous to FDA post-marketing requirements)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pass-voluntary",
        "notes": "Voluntary/sponsor-initiated PASS use the same risk-evaluation design machinery without a regulatory mandate."
      },
      {
        "relation_type": "requires",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "A risk evaluation depends on a pre-specified, validated AE case definition (the outcome algorithm and its PPV)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring follow-up to REMS enrollment/certification rather than first exposure introduces immortal time and deflates the estimated rate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "In elderly/seriously ill cohorts, death competes with the AE; report cause-specific person-time and consider subdistribution methods."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Modest outcome-algorithm PPV biases the numerator; validate the algorithm and apply quantitative bias correction."
      }
    ],
    "aliases": [
      "risk evaluation study",
      "post-authorization safety study",
      "PASS",
      "active surveillance study",
      "REMS effectiveness assessment",
      "safety surveillance study"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "risk-ratio-and-risk-difference",
    "name": "Risk Ratio and Risk Difference",
    "short_definition": "The risk ratio (RR, relative risk) is the ratio of two incidence proportions — events per persons at risk over a defined time window — and the risk difference (RD) is their absolute gap; together they form the primary effect-measure pair for cohort and trial data, quantifying how much more (or less) common an outcome is in one group than another in both relative and absolute terms.",
    "long_description": "**What is risk? The three denominators**\n\nRisk, rate, and odds are three distinct ways to express how common an event is in a population,\nand confusing them is one of the most persistent errors in epidemiological reporting. Each has a\ndifferent denominator and a different meaning.\n\n**Risk** (cumulative incidence proportion) is the fraction of individuals in a group who experience\nthe event during a specified time window: risk = events / persons_at_risk_at_start. A 12-month risk\nof 0.12 means 12 out of every 100 patients who were event-free at the start experienced the event\nduring the next year. Risk is dimensionless and bounded [0, 1]. Its denominator is persons — not\nperson-time — so it requires every participant to have the same (or appropriately handled) observable\nfollow-up window. In claims and EHR analyses with variable follow-up, patients whose observable time\nis shorter than the chosen window must be handled via censoring (see `censoring-mechanisms-rwe`) or\nthe analysis must be restricted to those with complete observable follow-up. Without a fixed, common\ntime horizon, you are computing a rate, not a risk.\n\n**Rate** (incidence rate) expresses events per unit of person-time: rate = events / total_person_time.\nA rate of 0.12 per person-year is not the same as a risk of 0.12 — the rate accommodates variable\nfollow-up, is unbounded, and belongs to the Poisson rather than the binomial family (see\n`incidence-rate-calculation-rwe`). For time-to-event data with variable follow-up and censoring,\nthe hazard ratio family is the natural effect measure, not the RR.\n\n**Odds** express the ratio of the event probability to the non-event probability: odds = p / (1 − p).\nOdds of 0.136 correspond to a risk of 0.12. The odds ratio (OR) from logistic regression overstates\nrelative associations for common outcomes (risk above approximately 10%) and is the only directly\nestimable measure in case-control studies where absolute risks are uninformative. The OR is NOT\ninterchangeable with the RR when the outcome is common — see `binomial-distribution-logit-link` for\nthe full treatment of the logit link and the OR≠RR trap.\n\n**Risk ratio and risk difference: definitions and arithmetic**\n\nGiven a fixed time window T, a group of n_e exposed individuals with a_e events, and a group of n_u\nunexposed individuals with a_u events:\n\nrisk_exposed   = a_e / n_e\nrisk_unexposed = a_u / n_u\nRR             = risk_exposed / risk_unexposed\nRD             = risk_exposed − risk_unexposed\n\nThe **risk ratio (RR)**, also called the relative risk or cumulative incidence ratio, is a\ndimensionless multiplier. An RR of 2.0 means the exposed group has twice the cumulative incidence\nof the unexposed group over the stated window. It is asymmetric with respect to the event scale: if\nthe RR of experiencing disease is 2.0, the RR of *surviving* (1 − risk_exposed) / (1 − risk_unexposed)\nis NOT 0.5 except when risks are very small. This means \"twice the risk of disease\" and \"half the\nprotection\" are not equivalent statements, and the RR direction should always be stated explicitly.\n\nThe **risk difference (RD)**, also called the absolute risk increase (when RD > 0) or absolute risk\nreduction (when RD < 0, for a beneficial exposure), is the arithmetic gap between the two risks. An\nRD of 0.06 means 6 additional events per 100 exposed individuals over the stated window — or 60\nadditional events per 1,000. The RD carries the same unit as the risk (a proportion), making it the\nmost policy-interpretable measure for public health, payers, and HTA bodies: it directly answers \"how\nmany events in this population are attributable to this exposure over this horizon?\"\n\n**Why both measures are necessary: the relative-looks-big / absolute-is-small problem**\n\nThe relative and absolute measures carry complementary information that neither alone conveys. An RR\nof 2.0 sounds dramatic — \"doubled risk.\" But \"twice the risk\" of a rare event can be clinically\nnegligible: if the baseline risk is 0.001 (1 in 1,000), an RR of 2.0 yields an RD of only 0.001 —\n1 additional case per 1,000 exposed. Conversely, a \"modest\" RR of 1.3 on a 40% baseline risk\ngenerates an RD of 12 percentage points — 12 additional events per 100 exposed individuals. Media\nreporting of relative risks without the absolute baseline is the dominant source of health numeracy\nerrors in public communication, and regulatory and HTA bodies have accordingly required both measures\nin evidence packages for decades: FDA guidance on benefit-risk labeling and EMA statistical principles\nconsistently require the RD (or its NNT/NNH derivative) alongside any relative measure. Presenting\nonly the RR systematically misleads about public-health burden; presenting only the RD without the RR\nprevents cross-population comparison.\n\n**RR baseline-risk dependence, asymmetry, and collapsibility**\n\nThe RD is a direct product of the RR and the baseline risk: RD = risk_unexposed × (RR − 1). This\nmeans the same RR implies very different absolute burdens in populations with different baseline risks:\nan RR of 2.0 applied to a 30% baseline yields RD = 0.30, while the same RR applied to a 3% baseline\nyields RD = 0.03 — a ten-fold difference in absolute burden. Because the RR is less sensitive to the\nlevel of baseline risk, it is the preferred measure for transporting effects across populations and for\nmeta-analysis — though it is not completely population-independent when effect modification varies\nacross subgroups.\n\nBoth the RR and the RD are **collapsible**: adding a balanced, non-confounding predictor to a model\ndoes not systematically shift the marginal RR or RD. This is an advantage over the odds ratio, which\nis non-collapsible and shifts when a strong predictor is added even when that predictor is perfectly\nbalanced across treatment arms and is not a confounder (see `binomial-distribution-logit-link`).\nCollapsibility means that differences between adjusted and unadjusted RRs or RDs reveal true\nconfounding rather than a mathematical artefact of the link function.\n\n**Estimation with covariate adjustment**\n\nFor unadjusted estimation, compute RR and RD directly from a 2×2 table. Confidence intervals for the\nRR use the Woolf (log-normal) method: SE(log RR) = sqrt(1/a − 1/n_e + 1/c − 1/n_u); exponentiate\nthe log-scale limits. Confidence intervals for the RD use the Wald method: SE(RD) = sqrt(r_e(1−r_e)/n_e\n+ r_u(1−r_u)/n_u). For adjusted estimation with covariates, three approaches dominate:\n\n*Log-binomial regression*: GLM with binomial family and log link; exponentiated coefficients are\nadjusted RRs. Critical practical limitation: the log link can push predicted probabilities above 1.0\nfor extreme covariate values, causing frequent convergence failures in high-dimensional real-world\ndatasets.\n\n*Poisson regression with robust (sandwich) variance (Zou 2004)*: a Poisson GLM with log link and a\nsandwich covariance matrix yields adjusted RRs without the convergence failures of log-binomial. The\nmisspecification (Poisson model for binary data) is harmless for point estimates, and the robust\nvariance corrects standard errors. This modified Poisson approach is now the dominant standard for\nadjusted RR estimation in pharmacoepidemiology and comparative effectiveness research.\n\n*Standardization / g-computation for marginal RD and RR*: fit any outcome model (e.g., logistic\nregression for individual event probability), predict each patient's counterfactual outcome\nprobability under both exposed and unexposed scenarios, and average the difference or ratio across\nthe population. G-computation yields marginal (population-averaged) adjusted RD and RR with correct\nuncertainty via bootstrapping. It is the most robust alternative to identity-link binomial regression\n(which faces convergence failures below 0) and produces estimates directly interpretable for HTA.\n\n**Interpreting the output**\n\nConsider a 12-month cohort study: 300 patients exposed to a new drug, 300 unexposed to a comparator.\nEvents: exposed = 36, unexposed = 18. Estimated RR = 2.0, 95% CI 1.2–3.4; RD = 0.06, 95% CI 0.02–0.10.\n\n*Formal interpretation* — RR: the 12-month incidence proportion among the exposed is 0.12 (36 / 300)\nand among the unexposed is 0.06 (18 / 300). The risk ratio of 2.0 means the 12-month risk among the\nexposed is 2.0 times that among the unexposed. This is a measure of association; whether it reflects\na causal effect depends on whether confounding has been adequately controlled. Risk requires a stated\ntime window to be meaningful — this RR is for 12 months only and cannot be extended without\nre-estimation. The 95% CI of (1.2, 3.4) means: in repeated samples from the same population under the\nsame design, 95% of the constructed intervals would contain the true RR. RD = 0.06 (95% CI 0.02–0.10):\n6 additional events per 100 exposed patients over 12 months, or 60 per 1,000.\n\n*Practical interpretation* for clinicians and HTA bodies: \"Twice the risk\" sounds alarming, but the\nabsolute context matters. Here, doubling means going from 6 events to 12 events per 100 patients over\na year — about 1 extra event for every 17 patients exposed (NNH = 1 / 0.06 ≈ 17). Whether 17 is\nlarge or small depends on event severity, cost and tolerability of the exposure, and available\nalternatives. A headline that says \"drug doubles risk of event\" communicates the RR = 2.0 correctly\nbut conceals that the baseline risk is only 6%, making the absolute increase modest. Presenting\nRR = 2.0 alongside RD = 0.06 (60 per 1,000 over 12 months; NNH ≈ 17) gives decision-makers the\ncomplete picture needed for clinical and economic judgment.\n\n**Pros, cons, and trade-offs**\n\nRisk ratio:\n- Pros: scale-free and approximately portable across populations with different baseline risks; the\n  natural effect measure for prospective cohort and trial data with a fixed risk window; multiplicative\n  (null = 1.0); collapsible under balanced predictors; preferred for meta-analysis and transportability\n  to other populations.\n- Cons: gives no indication of absolute burden; \"twice the risk\" of a rare event may be clinically\n  negligible; asymmetric — the RR of the event and the RR of the non-event are not complementary; does\n  not directly answer \"how many additional patients were affected?\"\n- When to prefer: communicating the strength of an association; transporting effects across populations;\n  pre-specified primary effect measure in a cohort study.\n\nRisk difference:\n- Pros: directly answers the public-health and HTA question (how many additional events per N exposed?);\n  collapsible; the basis for NNT/NNH (NNT = 1/RD, see `number-needed-to-treat-rwe`); directly\n  interpretable for budget-impact and preventive-medicine communication; preferred by regulatory bodies\n  for labeling alongside any relative measure.\n- Cons: highly baseline-risk-dependent — a trial RD cannot be transported to a different-risk population\n  without re-anchoring; can look trivially small when the relative effect is large on a rare outcome.\n- When to prefer: HTA, payer, and clinical communication; whenever the policy question is \"how many\n  additional events are caused by this exposure in this population?\"\n\nVersus OR: the OR from logistic regression overstates relative effects for common outcomes (risk ≥ 10%),\nis non-collapsible, and is the only directly estimable measure in case-control studies. Use RR and RD\nfor cohort data; use OR for case-control designs or when the odds scale is explicitly required, then\nconvert to marginal RD/RR via g-computation (see `binomial-distribution-logit-link`).\n\n**When to use**\n\nUse RR and RD when: (1) the endpoint is binary over a **defined, fixed risk window** with complete\nfollow-up or appropriate censoring handled; (2) the study design is a prospective cohort,\nactive-comparator new-user design, randomized trial, or any design where person-count denominators can\nbe correctly constructed; (3) both a relative measure (for portability and meta-analysis) and an\nabsolute measure (for HTA and clinical communication) are required by the protocol; (4) downstream\nNNT/NNH communication is planned; (5) g-computation or Poisson-robust regression will be used for\ncovariate-adjusted estimates.\n\n**When NOT to use**\n\n- **Without a defined risk window or with heavy differential censoring**: risk is undefined without a\n  common time horizon for all participants. If patients have variable follow-up that cannot be fixed,\n  or if censoring is differential between exposure arms (e.g., the exposed group discontinues and is\n  censored earlier), survival methods — hazard ratios, restricted mean survival times, cumulative\n  incidence functions — are the appropriate family (see `censoring-mechanisms-rwe`).\n- **In case-control studies**: sampling on outcome status makes absolute risks in the enrolled sample\n  uninformative. The OR is the only directly estimable effect measure from case-control data. Converting\n  a case-control OR to an approximate RR requires external data on the population baseline risk and is\n  valid only when the outcome is rare enough that OR ≈ RR.\n- **Transporting an RD without re-anchoring to the local baseline risk**: an RD from a trial or RWE\n  study cannot be applied to a population with a different baseline risk without re-estimation. Pair\n  the RR (more portable) with the trial-context baseline risk so the audience can derive the local RD.\n- **Comparing RRs across populations with very different baseline risks as if they are equivalent**: an\n  RR of 2.0 in a high-risk clinic and an RR of 2.0 in a low-risk community setting do not imply the\n  same public-health burden. Effect modification on the additive (RD) scale is entirely compatible with\n  a constant multiplicative (RR), and vice versa. Always specify the population and time window when\n  reporting either measure.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "effect-measures",
      "risk",
      "epidemiology",
      "incidence-proportion",
      "relative-risk",
      "absolute-risk",
      "cohort",
      "two-by-two"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "active_comparator_new_user",
      "claims_analysis",
      "comparative_effectiveness",
      "pragmatic_trial",
      "rct",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1046/j.1524-4733.2002.55150.x",
        "url": "https://doi.org/10.1046/j.1524-4733.2002.55150.x",
        "citation_text": "Schechtman E. Odds ratio, relative risk, absolute risk reduction, and the number needed to treat — which of these should we use? Value in Health. 2002;5(5):431-436.",
        "year": 2002,
        "authors_short": "Schechtman",
        "notes": "Comprehensive comparison of the four primary effect measures — OR, RR, ARR, NNT — explaining when each is appropriate, what each communicates, and how they interrelate. Essential reading for understanding why RR and RD must be presented together and when the OR is the only valid option (case-control data)."
      },
      {
        "role": "explain",
        "doi": "10.1093/ndt/gfw465",
        "url": "https://doi.org/10.1093/ndt/gfw465",
        "citation_text": "Noordzij M, van Diepen M, Caskey FC, Jager KJ. Relative risk versus absolute risk: one cannot be interpreted without the other. Nephrology Dialysis Transplantation. 2017;32(Suppl 2):ii13-ii18.",
        "year": 2017,
        "authors_short": "Noordzij et al.",
        "notes": "Authoritative statement that RR and RD are complementary and neither alone is sufficient for clinical or policy decision-making; illustrates with clinical examples how the same RR implies radically different absolute burdens depending on baseline risk. Directly supports the dual- reporting requirement in regulatory and HTA submissions."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwh090",
        "url": "https://doi.org/10.1093/aje/kwh090",
        "citation_text": "Zou G. A modified Poisson regression approach to prospective studies with binary data. American Journal of Epidemiology. 2004;159(7):702-706.",
        "year": 2004,
        "authors_short": "Zou",
        "notes": "The seminal paper establishing Poisson regression with robust (sandwich) variance as the practical standard for adjusted RR estimation in cohort data, resolving the convergence failures of log-binomial regression while preserving direct RR estimation. Widely cited in pharmacoepidemiology and HEOR methodology."
      },
      {
        "role": "use",
        "doi": "10.1093/ije/dys049",
        "url": "https://doi.org/10.1093/ije/dys049",
        "citation_text": "Pearce N. Classification of epidemiological study designs. International Journal of Epidemiology. 2012;41(2):393-397.",
        "year": 2012,
        "authors_short": "Pearce",
        "notes": "Clarifies the fundamental distinction between risk (incidence proportion, persons denominator) and rate (incidence rate, person-time denominator) that underlies the correct use of RR and RD; motivates fixing the risk window and handling censoring before computing the incidence proportion."
      }
    ],
    "plain_language_summary": "A risk is the fraction of patients in a group who had a given health event during a fixed time window — for example, 12 out of 100 patients hospitalized within one year gives a risk of 0.12. The risk ratio compares two groups by dividing their risks (a ratio of 2.0 means one group has twice the risk), while the risk difference subtracts one from the other to show how many extra events occurred per 100 people. Both numbers are needed together because doubling a tiny risk still produces a small absolute burden, and you cannot make a clinical or policy decision from either measure alone.",
    "key_terms": [
      {
        "term": "risk (incidence proportion)",
        "definition": "The fraction of patients in a group who had the outcome during a defined time window — for example, 36 out of 300 patients is a risk of 0.12; risk requires a fixed time window, and without one you are computing a rate instead."
      },
      {
        "term": "risk ratio",
        "definition": "The exposed group's risk divided by the unexposed group's risk; a ratio of 2.0 means the exposed group had twice the incidence proportion over the same time window."
      },
      {
        "term": "risk difference",
        "definition": "The exposed group's risk minus the unexposed group's risk; an RD of 0.06 means 6 additional events per 100 exposed patients over the stated window, or 60 per 1,000."
      },
      {
        "term": "baseline risk",
        "definition": "The risk in the comparison (unexposed) group; the same relative risk implies a large absolute difference in a high-risk population and a tiny one in a low-risk population, which is why an RD from one study cannot be directly transplanted to another setting."
      },
      {
        "term": "absolute vs relative effect",
        "definition": "The absolute effect (risk difference) answers \"how many extra events per N patients?\" while the relative effect (risk ratio) answers \"how many times more likely?\"; neither alone is sufficient for clinical or policy decisions."
      },
      {
        "term": "fixed risk window",
        "definition": "The defined follow-up period (for example, 12 months from a patient's index date) over which events are counted to compute a risk; without a common fixed window for all patients, the incidence proportion is undefined and a rate or hazard measure is required instead."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes team studies a new drug's association with a hospital event over 12 months in a claims database. They identify 300 new users of the drug (exposed) and 300 matched new users of a comparator drug (unexposed), each followed for 12 months from their index date (first fill). They count events, compute the 2x2 risk table, derive the RR and RD, then express the RD on a per-1,000 scale and compute the NNH.",
      "dataset": {
        "caption": "12-month event counts for exposed and unexposed new-user cohorts (n=300 per group).",
        "columns": [
          "group",
          "events",
          "n"
        ],
        "rows": [
          [
            "exposed",
            36,
            300
          ],
          [
            "unexposed",
            18,
            300
          ]
        ]
      },
      "steps": [
        "Compute the exposed-group risk: 36 events out of 300 patients, so risk = 36/300 = 0.12. Twelve out of every 100 exposed patients had the event during the 12-month window.",
        "Compute the unexposed-group risk: 18 events out of 300 patients, so risk = 18/300 = 0.06. Six out of every 100 unexposed patients had the event over the same window.",
        "Compute the risk ratio: RR = 0.12/0.06 = 2.0. The exposed group has twice the 12-month risk of the unexposed group. This is the relative effect.",
        "Compute the risk difference: RD = 0.12 - 0.06 = 0.06. There are 6 additional events per 100 exposed patients compared with the unexposed over the 12-month window.",
        "Express the RD on a per-1,000 scale: 0.06 * 1000 = 60 additional events per 1,000 exposed patients. This is the public-health framing preferred by payers and HTA bodies.",
        "Compute NNH: 1/0.06 = 16.7, rounded up to 17. About 17 patients must be exposed (instead of receiving the comparator) for one additional event to occur over 12 months."
      ],
      "result": "Exposed risk = 36/300 = 0.12; Unexposed risk = 18/300 = 0.06. RR = 0.12/0.06 = 2.0: exposed patients have twice the 12-month event risk. RD = 0.12 - 0.06 = 0.06, equal to 0.06 * 1000 = 60 extra events per 1,000 exposed over 12 months. NNH = 1/0.06 = 16.7, round up to 17. Both measures are needed: the RR communicates relative doubling while the RD shows that only 6 additional patients per 100 are affected — about 1 in every 17 exposed patients over the 12-month window."
    },
    "prerequisites": [
      "descriptive-statistics",
      "incidence-rate-calculation-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unadjusted RR and RD from a 2x2 table",
        "description": "The simplest form: compute risk_exposed = a / n_e and risk_unexposed = c / n_u from event counts, then RR = risk_exposed / risk_unexposed and RD = risk_exposed - risk_unexposed. Confidence intervals via the Woolf (log) method for RR and the Wald method for RD. Appropriate as a crude estimate in a balanced (randomized or matched) design; in unbalanced observational data, confounding adjustment is required.",
        "edge_cases": [
          "When either risk is zero (no events in one arm), the log-RR CI is undefined; add 0.5 to each cell (Haldane-Anscombe correction) or use exact methods.",
          "When the exposed and unexposed groups have very different follow-up completeness, the crude risk estimate is biased; ensure both arms are constructed on the same observable window."
        ],
        "data_source_notes": "Claims: restrict to FFS-observable windows; drop Medicare Advantage-only spans that create spurious event-free follow-up. EHR: require active-in-system criterion so out-of-system time is not counted as event-free."
      },
      {
        "name": "Adjusted RR via Poisson regression with robust variance (Zou 2004)",
        "description": "Fit a Poisson GLM with log link on individual-level binary outcome data, adding confounders to the model formula. Apply a sandwich (robust) variance estimator to correct for the Poisson misspecification. Exponentiated coefficients and their robust 95% CIs are adjusted RRs. This approach resolves the convergence failures of log-binomial regression and is the dominant standard for adjusted RR estimation in pharmacoepidemiology.",
        "edge_cases": [
          "With rare outcomes (risk < 5% in all cells), Poisson-robust and log-binomial RRs are nearly identical; Poisson-robust is still preferred for its numerical stability.",
          "Propensity-score weighting can be combined with Poisson-robust regression: use weighted Poisson with robust variance and report a weighted-mean adjusted RR."
        ],
        "data_source_notes": "Claims: build the exposure and covariate matrix on the same index-date cohort used for the 2x2 table; high-dimensional propensity score (hdPS) covariates can be added directly to the Poisson model formula. SAS: use PROC GENMOD with DIST=POISSON LINK=LOG and REPEATED for robust SE."
      },
      {
        "name": "Marginal adjusted RD via standardization (g-computation)",
        "description": "Fit any outcome model (e.g., logistic regression) on individual-level data with confounders. Predict each patient's counterfactual event probability under exposed and unexposed scenarios. Average the difference across all patients to obtain the marginal (population-averaged) adjusted RD. Bootstrap the entire procedure for confidence intervals. Avoids identity-link convergence failures and produces the marginal RD directly interpretable for HTA and NNT computation.",
        "edge_cases": [
          "G-computation requires a correctly specified outcome model; misspecification propagates into the marginal RD estimate. TMLE or AIPW doubly-robust estimators provide protection.",
          "The bootstrap must re-fit the outcome model in each sample, not just resample the final predictions, to capture full estimation uncertainty."
        ],
        "data_source_notes": "All sources: fit the outcome model on the analytical cohort with the full covariate set, predict under both exposure scenarios for every patient, and average. In claims, verify that the prediction step uses only baseline covariates (before the exposure start date)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "binomial-distribution-logit-link",
        "pros_of_this": "The RR and RD are collapsible, directly clinically interpretable, and preferred by regulatory and HTA bodies; the RD directly supports NNT/NNH communication; the RR is appropriate for transporting effects across populations.",
        "cons_of_this": "Log-binomial RR estimation has convergence failures in high-dimensional settings; identity-link RD estimation faces out-of-range predictions; neither RR nor RD is the natural estimand from case-control data (where only the OR is directly estimable).",
        "when_to_prefer": "For cohort and trial data with a fixed risk window and binary outcome; when clinical and HTA communication requires absolute and relative measures side-by-side. Use the OR (logit link) for case-control studies or when the analyst will standardize from logistic regression via g-computation."
      },
      {
        "compared_to": "hazard-ratio-interpretation",
        "pros_of_this": "The RR and RD require no proportional-hazards assumption, are directly computed from the 2x2 table, and are more intuitively interpretable as \"fraction of patients affected\" over the stated window.",
        "cons_of_this": "With variable follow-up, differential censoring, or competing risks, simple risk estimates are biased; hazard ratio methods (Cox regression, cumulative incidence functions) handle these scenarios correctly and use all available follow-up time rather than imposing a fixed window.",
        "when_to_prefer": "When follow-up is complete and equal for all patients over a fixed window; for HTA summary tables requiring absolute risk at a stated time horizon. Use the hazard ratio and cumulative incidence functions when follow-up is variable, censoring is non-negligible, or competing risks are present."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Construct the 2x2 table on FFS-observable person-time only; Medicare Advantage-only spans generate spurious event-free follow-up that deflates both risks and therefore the RD. For adjusted estimates, apply the Poisson-robust approach (Zou 2004) on the individual-level active-comparator new-user cohort with PS-matched or PS-weighted confounding control. State the risk window explicitly in all output tables; 30-day, 6-month, and 12-month windows will produce different RRs and RDs.",
      "ehr": "Encounter-driven ascertainment leaves event-free time partly unobserved when patients receive care outside the system; require an active-in-system criterion (e.g., at least one outpatient encounter in the prior 12 months) and restrict the risk window to observable time. Informative loss to follow-up (sicker patients less likely to return) biases both risks.",
      "registry": "Adjudicated registry outcomes and a linked death index give the cleanest absolute risks; check completeness and reporting lag so late-reported events at the horizon are not undercounted, deflating the RD. Registry data often have near-complete follow-up, making simple 2x2 estimation from the table straightforward.",
      "linked": "Linked claims-EHR-vital-records data provides an honest competing-risk denominator (death index) and clean outcomes for the most defensible risk estimates; the baseline risk anchoring the RD is the linked-cohort baseline, so report linkage rate and any selection into the linkable subset.",
      "primary": "Randomized trials and primary studies with complete follow-up are the cleanest setting for RR and RD estimation; the 2x2 table approach is exact and simple. Report the risk window used in the protocol alongside both measures and their CIs."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\nimport statsmodels.api as sm\nfrom statsmodels.stats.contingency_tables import Table2x2\nimport pandas as pd\n\n# ── 1. Manual 2x2 computation (worked example) ──────────────────────────────\n# Exposed:   36 events / 300 patients -> risk = 0.12\n# Unexposed: 18 events / 300 patients -> risk = 0.06\na, b = 36, 264    # exposed:   events, non-events\nc, d = 18, 282    # unexposed: events, non-events\nn_e, n_u = a + b, c + d    # 300, 300\n\nr_e = a / n_e     # exposed risk   = 36/300 = 0.12\nr_u = c / n_u     # unexposed risk  = 18/300 = 0.06\nrr  = r_e / r_u   # risk ratio      = 0.12/0.06 = 2.0\nrd  = r_e - r_u   # risk difference = 0.12 - 0.06 = 0.06\n\nprint(f\"Exposed risk   = {a}/{n_e} = {r_e:.4f}\")\nprint(f\"Unexposed risk = {c}/{n_u} = {r_u:.4f}\")\nprint(f\"RR = {r_e:.4f} / {r_u:.4f} = {rr:.4f}\")\nprint(f\"RD = {r_e:.4f} - {r_u:.4f} = {rd:.4f}  ({rd * 1000:.0f} per 1,000)\")\nprint(f\"NNH = 1 / {rd:.4f} = {1/rd:.1f}  (round up to {int(np.ceil(1/rd))})\")\n\n# ── 2. Woolf (log-normal) 95% CI for the RR ─────────────────────────────────\n# SE(log RR) = sqrt(1/a - 1/n_e + 1/c - 1/n_u)\nse_log_rr = np.sqrt(1/a - 1/n_e + 1/c - 1/n_u)\nz = stats.norm.ppf(0.975)\nrr_ci = (np.exp(np.log(rr) - z * se_log_rr),\n         np.exp(np.log(rr) + z * se_log_rr))\nprint(f\"\\nRR 95% CI (Woolf): ({rr_ci[0]:.3f}, {rr_ci[1]:.3f})\")\n\n# ── 3. Wald 95% CI for the RD ────────────────────────────────────────────────\nse_rd = np.sqrt(r_e * (1 - r_e) / n_e + r_u * (1 - r_u) / n_u)\nrd_ci = (rd - z * se_rd, rd + z * se_rd)\nprint(f\"RD 95% CI (Wald):  ({rd_ci[0]:.4f}, {rd_ci[1]:.4f})\")\n\n# ── 4. statsmodels Table2x2: cross-validated CIs ─────────────────────────────\n# Layout: [[exposed_events, exposed_non], [unexposed_events, unexposed_non]]\ntbl = Table2x2(np.array([[a, b], [c, d]]))\nrr_sm = tbl.riskratio\nrd_sm = tbl.riskdiff\nprint(f\"\\nstatsmodels RR = {rr_sm.statistic:.4f}  \"\n      f\"95% CI ({rr_sm.conf_int()[0]:.3f}, {rr_sm.conf_int()[1]:.3f})\")\nprint(f\"statsmodels RD = {rd_sm.statistic:.4f}  \"\n      f\"95% CI ({rd_sm.conf_int()[0]:.4f}, {rd_sm.conf_int()[1]:.4f})\")\n\n# ── 5. Poisson GLM with robust variance: adjusted RR (Zou 2004) ──────────────\n# Reconstruct individual-level data from summary counts\ndf = pd.DataFrame({\n    \"exposed\": [1] * n_e + [0] * n_u,\n    \"event\":   [1] * a + [0] * b + [1] * c + [0] * d,\n})\nX = sm.add_constant(df[\"exposed\"])\n# cov_type=\"HC0\" = heteroscedasticity-consistent (sandwich) variance (Zou 2004)\nfit = sm.GLM(df[\"event\"], X,\n             family=sm.families.Poisson()).fit(cov_type=\"HC0\")\nadj_rr    = np.exp(fit.params[\"exposed\"])\nadj_rr_ci = np.exp(fit.conf_int().loc[\"exposed\"])\nprint(f\"\\nPoisson+robust (Zou 2004) RR = {adj_rr:.4f}  \"\n      f\"95% CI ({adj_rr_ci[0]:.3f}, {adj_rr_ci[1]:.3f})\")\nprint(\"For ADJUSTED RR: add confounders to the model formula.\")\nprint(\"For ADJUSTED RD: use g-computation — predict P(event|exp=1/0) per patient,\")\nprint(\"  average the difference; bootstrap the full procedure for CIs.\")",
        "description": "Manual 2x2 computation of RR and RD from the worked example (36/300 vs 18/300), validated\nagainst statsmodels Table2x2 CIs. Includes the Woolf (log-normal) 95% CI for the RR and the\nWald 95% CI for the RD. Also demonstrates Poisson regression with robust (sandwich) variance\n(Zou 2004) as the standard for adjusted RR estimation, resolving convergence failures of\nlog-binomial regression.",
        "dependencies": [
          "numpy",
          "scipy",
          "statsmodels",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── 1. Manual 2x2 computation (worked example) ──────────────────────────────\na  <- 36;  b_n <- 264   # exposed:   events, non-events\ncc <- 18;  d   <- 282   # unexposed: events, non-events\nn_e <- a + b_n; n_u <- cc + d   # 300, 300\n\nr_e <- a  / n_e    # exposed risk   = 36/300 = 0.12\nr_u <- cc / n_u    # unexposed risk  = 18/300 = 0.06\nrr  <- r_e / r_u   # risk ratio      = 0.12/0.06 = 2.0\nrd  <- r_e - r_u   # risk difference = 0.12 - 0.06 = 0.06\n\ncat(sprintf(\"Exposed risk   = %d/%d = %.4f\\n\", a, n_e, r_e))\ncat(sprintf(\"Unexposed risk = %d/%d = %.4f\\n\", cc, n_u, r_u))\ncat(sprintf(\"RR = %.4f,  RD = %.4f  (%.0f per 1,000)\\n\", rr, rd, rd * 1000))\ncat(sprintf(\"NNH = 1/%.4f = %.1f (round up to %d)\\n\", rd, 1/rd, ceiling(1/rd)))\n\n# ── 2. Woolf CI for RR and Wald CI for RD (base R) ─────────────────────────\nse_log_rr <- sqrt(1/a - 1/n_e + 1/cc - 1/n_u)\nz <- qnorm(0.975)\nrr_ci <- exp(log(rr) + c(-1, 1) * z * se_log_rr)\ncat(sprintf(\"RR 95%% CI (Woolf): (%.3f, %.3f)\\n\", rr_ci[1], rr_ci[2]))\n\nse_rd <- sqrt(r_e * (1 - r_e) / n_e + r_u * (1 - r_u) / n_u)\nrd_ci <- rd + c(-1, 1) * z * se_rd\ncat(sprintf(\"RD 95%% CI (Wald):  (%.4f, %.4f)\\n\", rd_ci[1], rd_ci[2]))\n\n# ── 3. epitools::riskratio (cross-validation) ───────────────────────────────\nif (requireNamespace(\"epitools\", quietly = TRUE)) {\n  tbl <- matrix(c(a, b_n, cc, d), nrow = 2, byrow = TRUE,\n                dimnames = list(c(\"exposed\", \"unexposed\"), c(\"event\", \"noevent\")))\n  res_rr <- epitools::riskratio(tbl, method = \"log\")\n  cat(sprintf(\"\\nepitools RR = %.4f  95%% CI (%.3f, %.3f)\\n\",\n              res_rr$measure[2, 1], res_rr$measure[2, 2], res_rr$measure[2, 3]))\n}\n\n# ── 4. Poisson GLM with sandwich variance: adjusted RR (Zou 2004) ───────────\ndf <- data.frame(\n  exposed = c(rep(1L, n_e), rep(0L, n_u)),\n  event   = c(rep(1L, a),  rep(0L, b_n), rep(1L, cc), rep(0L, d))\n)\nfit_pois <- glm(event ~ exposed, family = poisson(link = \"log\"), data = df)\nif (requireNamespace(\"sandwich\", quietly = TRUE)) {\n  library(sandwich)\n  se_rob   <- sqrt(diag(sandwich(fit_pois)))[\"exposed\"]\n  coef_exp <- coef(fit_pois)[\"exposed\"]\n  rr_adj   <- exp(coef_exp)\n  rr_ci_adj <- exp(coef_exp + c(-1, 1) * z * se_rob)\n  cat(sprintf(\"\\nPoisson+robust RR = %.4f  95%% CI (%.3f, %.3f)  [Zou 2004]\\n\",\n              rr_adj, rr_ci_adj[1], rr_ci_adj[2]))\n  cat(\"For ADJUSTED RR: add confounders to the model formula.\\n\")\n}\n\n# ── 5. Identity-link binomial GLM: marginal RD (may not converge) ───────────\nfit_id <- tryCatch(\n  glm(event ~ exposed, family = binomial(link = \"identity\"), data = df),\n  warning = function(w) NULL, error = function(e) NULL\n)\nif (!is.null(fit_id)) {\n  rd_adj    <- coef(fit_id)[\"exposed\"]\n  rd_ci_adj <- suppressMessages(confint(fit_id))[\"exposed\", ]\n  cat(sprintf(\"Identity-link RD = %.4f  95%% CI (%.4f, %.4f)\\n\",\n              rd_adj, rd_ci_adj[1], rd_ci_adj[2]))\n} else {\n  cat(\"Identity-link binomial did not converge; use g-computation for marginal RD.\\n\")\n}",
        "description": "Manual 2x2 RR and RD computation from the worked example with Woolf (log) and Wald CIs in\nbase R, validated against epitools::riskratio. Demonstrates Poisson GLM with sandwich\n(robust) variance via the sandwich package for adjusted RR (Zou 2004). Also shows the\nidentity-link binomial GLM for adjusted RD with a fallback note for convergence failures.",
        "dependencies": [
          "epitools",
          "sandwich",
          "lmtest"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ─── Create the weighted 2x2 dataset ─── */\ndata work.rr2x2;\n  input arm $ event count;\n  datalines;\nexposed   1  36\nexposed   0 264\nunexposed 1  18\nunexposed 0 282\n;\nrun;\n\n/* ─── PROC FREQ: RELRISK = RR with Woolf CI; RISKDIFF = RD with Wald CI ─── */\nproc freq data=work.rr2x2 order=data;\n  weight count;\n  tables arm * event / relrisk riskdiff(cl=wald) nopercent norow nocol;\n  /*\n    RELRISK output (Row 1 vs Row 2, i.e., exposed vs unexposed):\n      Risk Ratio = 0.12 / 0.06 = 2.0000\n      95% CI     ~ (1.19, 3.38) via Woolf (log) method\n\n    RISKDIFF output (Row 1 minus Row 2, i.e., exposed minus unexposed):\n      Difference = 0.12 - 0.06 = 0.0600\n      95% CI     ~ (0.0145, 0.1055) via Wald method\n\n    NOTE: order=data ensures \"exposed\" is Row 1 and \"unexposed\" is Row 2.\n    If order is wrong, the ratio and difference will be inverted.\n  */\nrun;\n\n/* ─── Expand to individual-level data for PROC GENMOD ─── */\ndata work.indiv;\n  set work.rr2x2;\n  exposed_flag = (arm = 'exposed');\n  do i = 1 to count; output; end;\n  drop i count;\nrun;\n\n/* ─── PROC GENMOD: Poisson + robust variance = adjusted RR (Zou 2004) ─── */\nproc genmod data=work.indiv;\n  class exposed_flag (ref='0') / param=ref;\n  model event = exposed_flag / dist=poisson link=log;\n  repeated subject=_n_ / type=ind;\n  /*\n    DIST=POISSON LINK=LOG with REPEATED TYPE=IND applies a sandwich (GEE-like)\n    robust variance estimator, which corrects the SEs for Poisson misspecification\n    when the outcome is binary. exp(exposed_flag parameter) = crude RR = 2.0.\n\n    To obtain an ADJUSTED RR: add confounder terms to the MODEL statement.\n    The sandwich variance (REPEATED) remains the correct SE estimator regardless\n    of how many covariates are included.\n\n    Convergence note: PROC GENMOD with DIST=POISSON almost never fails to converge,\n    unlike PROC GENMOD with DIST=BINOMIAL LINK=LOG (log-binomial), which frequently\n    produces \"Convergence criterion not met\" errors in high-dimensional datasets.\n    Use DIST=POISSON with REPEATED whenever log-binomial fails.\n  */\nrun;",
        "description": "2x2 RR and RD from the worked example via PROC FREQ with RELRISK (risk ratio with Woolf CI)\nand RISKDIFF (risk difference with Wald CI). A PROC GENMOD block demonstrates the\nPoisson-with-robust-variance approach (Zou 2004) for adjusted RR on individual-level data.\nA short DATA step expands the weighted 2x2 to individual rows for PROC GENMOD.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pe[\"Exposed risk\\nrisk_e = events_e / n_e\\nexample: 36/300 = 0.12\"] --> RR\n  Pu[\"Unexposed risk\\nrisk_u = events_u / n_u\\nexample: 18/300 = 0.06\"] --> RR\n  Pe --> RD\n  Pu --> RD\n  RR[\"Risk Ratio\\nRR = risk_e / risk_u = 2.0\\nrelative effect\"]\n  RD[\"Risk Difference\\nRD = risk_e - risk_u = 0.06\\nabsolute effect\"]\n  RR --> Adj1[\"Adjusted RR\\nPoisson + robust SE\\n(Zou 2004) — avoids\\nlog-binomial convergence failure\"]\n  RD --> Adj2[\"Adjusted RD\\ng-computation\\nstandardization\\nidentity-link GLM\"]\n  RR --> Comm1[\"Report as:\\n2.0x the 12-month risk\\n(relative)\"]\n  RD --> Comm2[\"Report as:\\n60 per 1,000 over 12 months\\nNNH = 1/RD = 17\\n(absolute)\"]\n  Comm1 --> Both[\"Present BOTH:\\nRR for portability\\nRD for HTA magnitude\"]\n  Comm2 --> Both\n  Both --> Prereq[\"REQUIRES a fixed risk window\\nfor all patients and appropriate\\ncensoring handling\"]",
        "caption": "From two arm-specific incidence proportions to the risk ratio (relative effect) and risk difference (absolute effect), their adjusted analogues, and the requirement to present both. The RR is more portable across populations with different baseline risks; the RD gives clinical and HTA magnitude. Both require a fixed risk window and proper censoring handling in real-world data.",
        "alt_text": "Flowchart from exposed and unexposed incidence proportions to the risk ratio (2.0) and risk difference (0.06), branching to adjusted estimation methods (Poisson-robust for RR, g-computation for RD), then to the communication framing (relative doubling vs 60 per 1,000 and NNH of 17), converging on the requirement to present both and the fixed-window prerequisite.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "cumulative-incidence-risk-rwe",
        "notes": "The RR and RD are ratios and differences of cumulative incidence proportions; correct construction of the at-risk denominator, fixed observation window, and censoring handling is the prerequisite for valid risk estimation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "binomial-distribution-logit-link",
        "notes": "The odds ratio counterpart — covers logistic regression, the logit link, the OR vs RR distinction for common outcomes (OR overstates relative effects when risk exceeds 10%), and non-collapsibility; use that entry for case-control studies or when the OR is the pre-specified estimand."
      },
      {
        "relation_type": "see_also",
        "target_slug": "number-needed-to-treat-rwe",
        "notes": "NNT = 1 / ARR and NNH = 1 / ARI are derived directly from the risk difference; that entry covers the reciprocal transformation, the discontinuity problem when the RD CI spans zero, and how to report NNT/NNH with time horizon for HTA and clinical communication."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hazard-ratio-interpretation",
        "notes": "The time-to-event counterpart; when follow-up is variable, censoring is differential, or competing risks are present, the hazard ratio and cumulative incidence function approach replaces the simple risk ratio as the appropriate effect measure."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attributable-risk-paf",
        "notes": "The population attributable fraction extends the RD to a population-burden measure: PAF quantifies what fraction of all events in the population would be eliminated if the exposure were removed; it is computed from the RR and the prevalence of exposure."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-distribution",
        "notes": "Rate vs risk distinction: the Poisson distribution underlies the rate model (events per person-time), while the binomial underlies the risk model (events per persons); Poisson regression with robust variance is also the standard tool for adjusted RR estimation from binary-outcome cohort data (Zou 2004)."
      }
    ],
    "aliases": [
      "relative risk",
      "RR",
      "risk difference",
      "RD",
      "absolute risk reduction",
      "cumulative incidence ratio",
      "absolute risk increase",
      "ARR",
      "ARI"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "roc-auc-discrimination-rwe",
    "name": "ROC Curves, AUC, and the c-statistic",
    "short_definition": "The receiver operating characteristic (ROC) curve plots sensitivity against 1 - specificity across all thresholds of a continuous score; its area under the curve (AUC, equivalently Harrell's c-statistic) summarizes discrimination threshold-free and equals the probability that a randomly chosen case has a higher score than a randomly chosen non-case (concordance) — a measure of ranking, not of calibration.",
    "long_description": "The **receiver operating characteristic (ROC) curve** displays the full trade-off between sensitivity and specificity as\nthe decision threshold of a continuous (or ordinal) score sweeps from most to least stringent. For every candidate\nthreshold the curve plots the **true-positive rate (sensitivity)** on the y-axis against the **false-positive rate\n(1 - specificity)** on the x-axis; the resulting monotone curve runs from (0,0) to (1,1). A useless score lies on the\ndiagonal (sensitivity = 1 - specificity at every cut); a better score bows toward the top-left corner. The **area under\nthe ROC curve (AUC)** collapses the whole curve into one number in [0.5, 1] (0.5 = no discrimination, 1 = perfect).\n\n**Core idea — AUC as concordance.** The AUC has a clean probabilistic meaning that does not mention thresholds at all:\n\n    AUC = P(score of a random case > score of a random non-case),\n\ni.e. the probability that the model ranks a randomly chosen true positive above a randomly chosen true negative (ties\nsplit). This makes the AUC a pure measure of **discrimination** — how well the score *orders* cases above non-cases — and\nit is numerically identical to the Mann-Whitney U statistic divided by the product of the two group sizes, and to\n**Harrell's c-statistic** (concordance index) for binary outcomes. For survival outcomes the c-statistic generalizes to\nthe probability that, among comparable (orderable) pairs, the patient who fails earlier had the higher predicted risk.\nBecause it depends only on the ordering of scores, the AUC is **threshold-free** and invariant to any monotone\ntransformation of the score.\n\n**Discrimination is not calibration.** This is the central conceptual distinction and the most consequential one in RWE.\n**Discrimination** asks whether the model separates cases from non-cases (ranking); **calibration** asks whether the\npredicted probabilities match observed event frequencies (e.g., among patients predicted 20%, do ~20% have the event?).\nA model can have excellent AUC and be badly miscalibrated — systematically predicting risks twice too high — because any\nstrictly increasing recalibration of the scores leaves the ordering, and therefore the AUC, completely unchanged. For\ndecision-making and HTA, calibration and net benefit often matter more than AUC; the AUC alone can endorse a model that\nwould mislead at every clinically used threshold. Report the AUC *with* a calibration assessment (calibration plot,\ncalibration-in-the-large and slope), never as a stand-alone verdict.\n\n**Limitations beyond calibration.** (1) **Class imbalance / rare outcomes:** the AUC is computed over all case/non-case\npairs and is largely insensitive to prevalence, which is a virtue for comparability but a vice for decision relevance — a\nhigh AUC in a 1%-event setting can still correspond to poor positive predictive value and little clinical usefulness; the\nprecision-recall curve and decision-curve (net-benefit) analysis are more informative when positives are rare. (2)\n**Insensitivity to clinically meaningful changes:** adding a strong new predictor often barely moves the AUC even when it\nimproves risk stratification at decision-relevant thresholds, which is why reclassification and net-benefit measures\nsupplement it. (3) **Overoptimism:** an AUC computed on the same data used to fit the model is upward-biased; report an\ninternally validated (bootstrap- or cross-validation-corrected) and, ideally, externally validated AUC.\n\n**Pros, cons, and trade-offs.**\n- **vs sensitivity/specificity at a fixed threshold (`sensitivity-specificity-rwe`):** The AUC summarizes performance\n  across *all* thresholds in one number and is the right tool when comparing continuous scores or when no threshold is\n  yet fixed; a single sensitivity/specificity pair is one point on the ROC and is what matters once an operational cut is\n  chosen. **Prefer AUC** for model comparison and threshold selection; **prefer the operating-point pair** for a deployed\n  binary rule.\n- **vs calibration / net benefit (`prediction-model-validation-recalibration-rwe`):** The AUC is invariant to monotone\n  recalibration and so says nothing about whether predicted probabilities are right; calibration and decision-curve\n  analysis directly address clinical usefulness. **Prefer the AUC** as a discrimination headline; **always pair it** with\n  calibration and net benefit before claiming a model is fit for decisions.\n- **vs precision-recall / PPV at a threshold (`ppv-npv-rwe`):** Under heavy class imbalance the ROC can look strong while\n  PPV is poor; the PR curve and PPV expose the rare-positive reality the AUC obscures. **Prefer the AUC** for a\n  prevalence-stable discrimination summary; **prefer PR/PPV** when the practical question is \"of those flagged, how many\n  are real?\"\n\n**When to use.** Report the AUC (with a 95% CI) to summarize how well a continuous risk score or diagnostic marker\ndiscriminates cases from non-cases; to compare competing models on a common scale independent of any threshold; as the\nheadline discrimination metric in a prediction-model development/validation paper (alongside calibration per TRIPOD); and\nto select or justify a clinical decision threshold by inspecting the ROC's sensitivity/specificity trade-off. Use the\nDeLong method (or bootstrap) to compare two AUCs on the same patients.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As the sole evidence a model is good.** Quoting a high AUC while omitting calibration can endorse a model that\n  systematically over- or under-predicts risk and would harm decisions at every used threshold — the most dangerous and\n  common misuse. AUC measures ranking, not the correctness of probabilities.\n- **As a clinical-utility verdict under class imbalance.** In a rare-outcome setting a strong AUC can coexist with a PPV\n  that makes the model useless for action; presenting AUC alone hides this. Add PR/decision-curve analysis.\n- **For detecting the value of a new predictor.** The AUC is notoriously insensitive to adding a genuinely useful marker;\n  concluding \"the new variable adds nothing\" because the AUC barely moved is misleading — use reclassification/net\n  benefit.\n- **On the training data without correction.** An apparent (resubstitution) AUC overstates performance; reporting it as if\n  it were validated misrepresents how the model will discriminate in new patients.\n- **Comparing AUCs across studies with different case spectra.** Because discrimination depends on the spread of risk in\n  the sample, an AUC from a heterogeneous (wide-spectrum) cohort is not comparable to one from a narrow-spectrum cohort;\n  the difference can be spectrum, not model quality.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The score is typically a fitted risk model (logistic for a binary outcome, Cox for time-to-event) over\n  claims-derived predictors (comorbidity indices, prior utilization, NDC classes); the AUC/c-statistic measures how well\n  it ranks who will have the outcome. Restrict to FFS-observable person-time so the outcome label is not corrupted by\n  Medicare Advantage spans (no fee-for-service claims), which would misclassify events and inflate or deflate apparent\n  discrimination. Report an internally validated (bootstrap-corrected) AUC because claims models are often fit and\n  evaluated on the same extract.\n- **EHR:** Richer continuous predictors (labs, vitals, NLP features) usually improve discrimination, but encounter-driven\n  capture means outcome ascertainment leakage mislabels some non-cases, biasing the AUC; for time-to-event use the\n  time-dependent / inverse-probability-of-censoring-weighted c-statistic to handle censoring honestly.\n- **Registry/linked:** Registry-adjudicated outcomes give the cleanest label for the case/non-case ranking; linked claims-\n  EHR-registry data support external validation, where the AUC's transportability (and any spectrum shift) is the key\n  question.\n\n**Worked claims example.** A team develops a logistic model for 1-year incident heart failure in a commercial + Medicare\nFFS cohort (FFS-observable person-time only), using age, sex, a comorbidity index, prior cardiac utilization, and\nbaseline NDC classes; the linear predictor is the continuous score. On a held-out validation extract of 4,000 patients\n(320 events, 8% prevalence) they sweep the threshold to trace the ROC and obtain **AUC = 0.781 (95% CI 0.756-0.806)** by\nthe DeLong method — equivalently, a randomly chosen future HF case has a higher predicted risk than a randomly chosen\nnon-case about 78% of the time. They do *not* stop there: the calibration plot shows the model over-predicts in the top\ndecile (predicted 30% vs observed 22%), so despite acceptable discrimination it needs recalibration before use; and\nbecause the outcome is uncommon, they add a decision-curve analysis showing positive net benefit only over a narrow\nthreshold range. A rival model adds an EHR-derived ejection-fraction feature and raises the AUC to only 0.792 — a\ndifference whose DeLong test p = 0.07 and whose practical value is better judged by net reclassification at the\ndecision threshold than by the small AUC change. The headline AUC is reported with its CI, the internal-validation\noptimism correction, and — mandatorily — the calibration assessment alongside.\n\n**Interpreting the output**\n\nIn the clinical example, the risk-score model achieves an AUC (c-statistic) of 0.781. In the small\nworked example with 8 concordant of 9 discordant pairs, AUC ≈ 0.889.\n\n*(1) Formal interpretation.* AUC = 0.781 means that, if one true case and one true non-case are drawn\nat random from the population, the model assigns the higher predicted risk to the case approximately\n78.1% of the time. This is a rank-order (concordance) statement — it describes the model’s ability to\nseparate cases from non-cases across all possible thresholds simultaneously. It says nothing about\nwhether the predicted probabilities are numerically accurate (calibration). AUC is threshold-free: no\nsingle decision cut-point is assumed, making it a summary of the full ROC curve. A model could have\nhigh AUC but poor calibration if it consistently ranks cases above non-cases while predicting\nsystematically inflated or deflated absolute probabilities.\n\n*(2) Practical interpretation.* AUC = 0.781 reflects good but not excellent discrimination — the model\nranks cases above non-cases about four times in five. As the worked example demonstrates, AUC alone is\ninsufficient: if the calibration plot reveals systematic over-prediction in the top risk decile,\nrecalibration is needed before clinical or administrative use. Report AUC with its CI, any\ninternal-validation optimism correction, and a calibration assessment (calibration plot, Brier score,\nor E/O ratio). Do not interpret AUC as accuracy or use it as the sole criterion for model selection —\nevaluate net reclassification and decision-curve analysis at the decision threshold that matters.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "roc-curve",
      "auc",
      "c-statistic",
      "discrimination",
      "concordance",
      "calibration",
      "delong-test",
      "class-imbalance"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "validation_study",
      "diagnostic_accuracy",
      "prediction_modeling"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1148/radiology.143.1.7063747",
        "url": "https://doi.org/10.1148/radiology.143.1.7063747",
        "citation_text": "Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29-36.",
        "year": 1982,
        "authors_short": "Hanley & McNeil",
        "notes": "The canonical paper defining the AUC and establishing its interpretation as the probability that a randomly chosen case scores higher than a randomly chosen non-case (the Mann-Whitney / concordance interpretation)."
      },
      {
        "role": "explain",
        "doi": "10.1148/radiology.148.3.6878708",
        "url": "https://doi.org/10.1148/radiology.148.3.6878708",
        "citation_text": "Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983;148(3):839-843.",
        "year": 1983,
        "authors_short": "Hanley & McNeil",
        "notes": "Provides the correlated-AUC comparison accounting for the same patients contributing to both curves; the conceptual basis for paired AUC comparisons later refined by the DeLong nonparametric method."
      },
      {
        "role": "explain",
        "doi": "10.1097/EDE.0b013e3181c30fb2",
        "url": "https://doi.org/10.1097/EDE.0b013e3181c30fb2",
        "citation_text": "Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138.",
        "year": 2010,
        "authors_short": "Steyerberg et al.",
        "notes": "Frames discrimination (AUC/c-statistic) within the broader set of performance measures, stressing that calibration, reclassification, and net benefit must accompany the AUC because discrimination alone does not establish clinical usefulness."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4",
        "url": "https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4",
        "citation_text": "Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine. 1996;15(4):361-387.",
        "year": 1996,
        "authors_short": "Harrell et al.",
        "notes": "Defines and operationalizes the c-statistic (concordance index) for binary and survival outcomes and demonstrates optimism-corrected (bootstrap) internal validation of discrimination, the standard for reporting AUC honestly."
      },
      {
        "role": "use",
        "doi": "10.7326/M14-0698",
        "url": "https://doi.org/10.7326/M14-0698",
        "citation_text": "Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Annals of Internal Medicine. 2015;162(1):W1-W73.",
        "year": 2015,
        "authors_short": "Moons et al.",
        "notes": "The reporting standard requiring the AUC/c-statistic to be reported with a confidence interval, internal/external validation, and a calibration assessment, never as a stand-alone discrimination headline."
      }
    ],
    "plain_language_summary": "A risk score (like a model that outputs a number from 0 to 1 for each patient) is only useful if it actually separates sick patients from healthy ones — giving higher numbers to the people who will get the disease and lower numbers to those who won't. The AUC (Area Under the Curve) measures exactly that ability: it asks, if you picked one patient who got the disease and one who didn't, how often does the sick patient have the higher score? An AUC of 0.5 means the score is no better than a coin flip, and 1.0 means it perfectly ranks every sick patient above every healthy one. The AUC only measures ranking order, not whether the predicted numbers are accurate — a score could be a perfect ranker while still predicting wildly wrong probabilities.",
    "key_terms": [
      {
        "term": "risk score",
        "definition": "A number assigned to each patient by a model, meant to reflect how likely that patient is to have an outcome — higher usually means higher predicted risk."
      },
      {
        "term": "discrimination",
        "definition": "How well a risk score separates people who will have the outcome (cases) from those who won't (non-cases) by ranking cases higher."
      },
      {
        "term": "concordant pair",
        "definition": "One case and one non-case where the case has the higher predicted score — exactly the ordering you want a good model to produce."
      },
      {
        "term": "calibration",
        "definition": "Whether the predicted probabilities match what actually happens — for example, among patients given a 20% predicted risk, roughly 20% should actually have the outcome."
      },
      {
        "term": "AUC",
        "definition": "Area Under the Curve — the single number summarizing discrimination, equal to the fraction of all case/non-case pairs where the case scored higher."
      }
    ],
    "worked_example": {
      "scenario": "A research team built a logistic regression model to predict which patients will be hospitalized for heart failure within one year. The model outputs a predicted risk score between 0 and 1 for every patient. They want to know how well the model discriminates — does it actually rank future hospitalizations above non-hospitalizations? They test it on six patients: three who were later hospitalized (cases) and three who were not (non-cases).",
      "dataset": {
        "caption": "Six patients with their true one-year hospitalization outcome and the model's predicted risk score.",
        "columns": [
          "patient_id",
          "true_outcome",
          "predicted_score"
        ],
        "rows": [
          [
            "PT-101",
            1,
            0.9
          ],
          [
            "PT-102",
            1,
            0.85
          ],
          [
            "PT-103",
            1,
            0.6
          ],
          [
            "PT-104",
            0,
            0.8
          ],
          [
            "PT-105",
            0,
            0.4
          ],
          [
            "PT-106",
            0,
            0.2
          ]
        ]
      },
      "steps": [
        "Identify the cases (true_outcome = 1): PT-101 (0.90), PT-102 (0.85), PT-103 (0.60). Identify the non-cases (true_outcome = 0): PT-104 (0.80), PT-105 (0.40), PT-106 (0.20).",
        "The AUC equals the probability that a randomly chosen case scores higher than a randomly chosen non-case. To compute it by hand, list every possible case/non-case pair — with 3 cases and 3 non-cases there are 3 × 3 = 9 pairs total.",
        "Check each pair: PT-101 (0.90) vs PT-104 (0.80) → case wins (concordant). PT-101 (0.90) vs PT-105 (0.40) → case wins (concordant). PT-101 (0.90) vs PT-106 (0.20) → case wins (concordant).",
        "Continue: PT-102 (0.85) vs PT-104 (0.80) → case wins (concordant). PT-102 (0.85) vs PT-105 (0.40) → case wins (concordant). PT-102 (0.85) vs PT-106 (0.20) → case wins (concordant).",
        "Continue: PT-103 (0.60) vs PT-104 (0.80) → non-case wins (discordant). PT-103 (0.60) vs PT-105 (0.40) → case wins (concordant). PT-103 (0.60) vs PT-106 (0.20) → case wins (concordant).",
        "Count the results: 8 concordant pairs, 1 discordant pair, 0 ties. AUC = concordant pairs / total pairs = 8 / 9 ≈ 0.889.",
        "Interpretation: In about 89% of all case/non-case matchups, the model correctly ranks the future hospitalization patient higher. Note that PT-103 (a true case with a moderate score of 0.60) was outranked by PT-104 (a true non-case with a score of 0.80) — the one failure of the model on this small dataset."
      ],
      "result": "AUC = 8 concordant pairs / 9 total pairs = 0.889. The model correctly ranks a randomly chosen future hospitalization above a randomly chosen non-hospitalization about 89% of the time. This is solid discrimination — but it says nothing about whether the score of 0.90 or 0.60 is an accurate probability; that would require a separate calibration check."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "logistic-regression-for-binary-outcomes"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Binary-outcome AUC (c-statistic)",
        "description": "For a binary outcome, the AUC equals Harrell's c-statistic and the normalized Mann-Whitney U; it is the probability a random case is ranked above a random non-case and is estimated nonparametrically from the score ranks.",
        "edge_cases": [
          "Ties in the score are split (counted as one-half) in the concordance count.",
          "The resubstitution AUC is optimistic; report a bootstrap- or cross-validation-corrected estimate."
        ],
        "data_source_notes": "claims: the score is usually a fitted logistic linear predictor; restrict to FFS-observable person-time so the outcome label is not corrupted by MA spans."
      },
      {
        "name": "Time-to-event (survival) c-statistic under censoring",
        "description": "For time-to-event outcomes the c-statistic is the probability that, among orderable pairs, the earlier-failing patient had the higher predicted risk; censoring is handled by Harrell's c or an inverse-probability-of-censoring- weighted / time-dependent estimator.",
        "edge_cases": [
          "Harrell's c is biased under heavy censoring; prefer IPCW or time-dependent ROC for long follow-up.",
          "Discrimination can be time-specific (e.g., 1-year vs 5-year); report the horizon."
        ],
        "data_source_notes": "ehr/linked: use the time-dependent / IPCW c-statistic when follow-up is censored; registry adjudication gives the cleanest event times."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "sensitivity-specificity-rwe",
        "pros_of_this": "Summarizes performance across all thresholds in one threshold-free number and is the right tool for comparing continuous scores or selecting a cut-point.",
        "cons_of_this": "Abstracts away the operating point; a single AUC does not tell you the sensitivity/specificity at the cut that will actually be deployed.",
        "when_to_prefer": "For model comparison and threshold selection; use a specific sensitivity/specificity pair for a deployed binary decision rule."
      },
      {
        "compared_to": "prediction-model-validation-recalibration-rwe",
        "pros_of_this": "A clean, prevalence-stable summary of discrimination (ranking) that is comparable across thresholds and invariant to monotone score transformations.",
        "cons_of_this": "Invariant to recalibration, so it says nothing about whether predicted probabilities are correct or about clinical net benefit; a high AUC can accompany a badly miscalibrated, useless model.",
        "when_to_prefer": "As the discrimination headline; always pair it with calibration and decision-curve analysis before declaring a model fit for decisions."
      },
      {
        "compared_to": "ppv-npv-rwe",
        "pros_of_this": "Largely insensitive to prevalence, giving a discrimination measure comparable across cohorts of differing case frequency.",
        "cons_of_this": "That same prevalence-insensitivity hides poor positive predictive value under class imbalance; a strong AUC can mask a clinically useless rare-outcome classifier.",
        "when_to_prefer": "For a prevalence-stable discrimination summary; use PPV and the precision-recall curve when the practical question is the fraction of flagged records that are true."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Fit the risk model (logistic for binary, Cox for time-to-event) over claims-derived predictors and report a bootstrap-corrected AUC/c-statistic with a 95% CI, since claims models are often fit and evaluated on the same extract. Restrict to FFS-observable person-time so the outcome label is not corrupted by Medicare Advantage spans. Use DeLong to compare two AUCs on the same patients, and always report calibration alongside.",
      "ehr": "Richer continuous predictors usually improve discrimination, but outcome-ascertainment leakage mislabels non-cases and biases the AUC; require in-system activity. For censored time-to-event outcomes use the time-dependent / IPCW c-statistic rather than the naive concordance.",
      "registry": "Registry-adjudicated outcomes give the cleanest case/non-case label; linked data support external validation, where transportability of the AUC and any case-spectrum shift are the key questions."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom sklearn.metrics import roc_auc_score, roc_curve\nfrom scipy.stats import mannwhitneyu, norm\n\n# Illustrative: predicted risk scores and true binary HF outcomes (1 = event).\nrng = np.random.default_rng(7)\ny = np.r_[np.ones(40), np.zeros(160)]                 # 20% prevalence\nscore = np.r_[rng.normal(0.7, 0.25, 40),              # cases score higher on average\n              rng.normal(0.4, 0.25, 160)]\n\nauc = roc_auc_score(y, score)                         # = c-statistic\nfpr, tpr, thr = roc_curve(y, score)                   # 1-specificity, sensitivity across thresholds\n\n# AUC as concordance: Mann-Whitney U / (n_case * n_ctrl) gives the same number.\nU, _ = mannwhitneyu(score[y == 1], score[y == 0], alternative=\"greater\")\nauc_mw = U / (np.sum(y == 1) * np.sum(y == 0))\n\n# Hanley-McNeil (1982) standard error for a single AUC -> normal-approx 95% CI.\nn1, n2 = int((y == 1).sum()), int((y == 0).sum())\nq1 = auc / (2 - auc); q2 = 2 * auc**2 / (1 + auc)\nse = np.sqrt((auc*(1-auc) + (n1-1)*(q1-auc**2) + (n2-1)*(q2-auc**2)) / (n1*n2))\nz = norm.ppf(0.975)\nprint(f\"AUC = {auc:.3f} (concordance check {auc_mw:.3f}); 95% CI {auc-z*se:.3f}-{auc+z*se:.3f}\")",
        "description": "Compute the ROC curve, the AUC (= c-statistic = normalized Mann-Whitney U), and a DeLong-method 95% CI\nfrom predicted scores and binary outcomes with scikit-learn and scipy. The AUC equals the probability a\nrandom case outranks a random non-case (Hanley & McNeil 1982). Worked example on a small illustrative\nrisk-score dataset; in practice the score is a fitted logistic linear predictor.",
        "dependencies": [
          "numpy",
          "scikit-learn",
          "scipy"
        ],
        "source_citations": [
          "hanley-mcneil-1982"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(pROC)\n\nset.seed(7)\ny     <- c(rep(1, 40), rep(0, 160))                   # 20% prevalence\nscore <- c(rnorm(40, 0.7, 0.25), rnorm(160, 0.4, 0.25))\n\nroc_obj <- roc(response = y, predictor = score, levels = c(0, 1), direction = \"<\")\nauc(roc_obj)                                          # AUC = c-statistic\nci.auc(roc_obj, method = \"delong\")                    # DeLong 95% CI\n\n# Compare two models on the SAME patients with the DeLong test.\nscore2  <- score + rnorm(200, 0, 0.10)                # rival score\nroc2    <- roc(y, score2, direction = \"<\")\nroc.test(roc_obj, roc2, method = \"delong\")            # paired AUC comparison",
        "description": "Compute the ROC curve and AUC with the pROC package, which returns the AUC with a DeLong 95% CI and\nsupports DeLong tests comparing two AUCs measured on the same patients (Hanley & McNeil 1983; DeLong).\nThe AUC equals Harrell's c-statistic for a binary outcome. Worked illustrative example below.",
        "dependencies": [
          "pROC"
        ],
        "source_citations": [
          "hanley-mcneil-1982",
          "hanley-mcneil-1983"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Illustrative subject-level data: binary outcome y and a continuous predictor x\n   (in practice x is replaced by the model covariates). */\ndata risk;\n  call streaminit(7);\n  do i = 1 to 200;\n    y = (i <= 40);                                   /* 20% prevalence */\n    if y then x = rand('normal', 0.7, 0.25);         /* cases score higher */\n    else      x = rand('normal', 0.4, 0.25);\n    output;\n  end;\n  drop i;\nrun;\n\n/* c-statistic (AUC) with ROC curve and a 95% CI; ROCCONTRAST gives DeLong comparisons. */\nproc logistic data=risk plots(only)=roc;\n  model y(event='1') = x / outroc=roc_pts;\n  roc 'risk score' x;\n  roccontrast / estimate=allpairs;                   /* AUC + 95% CI; DeLong for >1 ROC */\nrun;",
        "description": "Fit a logistic risk model and obtain the AUC (the model c-statistic) with a confidence interval in SAS.\nPROC LOGISTIC reports the c-statistic in the association statistics; the ROC and ROCCONTRAST statements\nproduce the ROC curve, the AUC with a 95% CI, and DeLong-type comparisons of nested or competing ROC\ncurves on the same data (Hanley & McNeil 1982).",
        "dependencies": [],
        "source_citations": [
          "hanley-mcneil-1982"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Score[Continuous risk score] --> Sweep[Sweep threshold from strict to lax]\n  Sweep --> Point[At each threshold:<br/>sensitivity vs 1-specificity]\n  Point --> Curve[ROC curve from 0,0 to 1,1]\n  Curve --> AUC[AUC = area under curve<br/>= P score_case > score_control<br/>= c-statistic]\n  AUC --> Disc[Measures DISCRIMINATION ranking only]\n  Disc --> Pair[Pair with calibration + net benefit<br/>AUC is invariant to recalibration]",
        "caption": "Construction and meaning of the ROC/AUC. Sweeping the threshold traces sensitivity against 1 - specificity; the area under the resulting curve equals the concordance probability (c-statistic) and measures discrimination only - invariant to recalibration, so it must be paired with calibration and net-benefit assessment.",
        "alt_text": "Flow from a continuous score through threshold sweeping to the ROC curve and its area, identifying the AUC as the concordance probability and a discrimination-only measure that must be paired with calibration.",
        "source_type": "illustrative",
        "source_citations": [
          "hanley-mcneil-1982"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{What is the question?} -->|Rank cases above non-cases| AUC[Report AUC / c-statistic with CI]\n  Q -->|Are predicted risks correct?| Cal[Calibration plot + slope - NOT the AUC]\n  Q -->|Useful at a decision threshold?| NB[Decision-curve / net benefit]\n  Q -->|Of those flagged, how many real?| PR[PPV / precision-recall - AUC hides this under imbalance]\n  AUC --> Warn[High AUC alone does NOT mean fit for decisions]",
        "caption": "Choosing the right performance question. The AUC answers only the discrimination (ranking) question; calibration, net benefit, and predictive values answer the questions about correctness of probabilities and clinical usefulness that a high AUC can silently leave unaddressed.",
        "alt_text": "Decision tree routing the ranking question to the AUC and the calibration, net-benefit, and PPV questions to other measures, warning that a high AUC alone does not establish fitness for decisions.",
        "source_type": "illustrative",
        "source_citations": [
          "steyerberg-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "sensitivity-specificity-rwe",
        "notes": "The ROC curve is built from sensitivity and 1 - specificity computed at every threshold; a single sensitivity/ specificity pair is one point on the curve."
      },
      {
        "relation_type": "see_also",
        "target_slug": "likelihood-ratios-diagnostic-rwe",
        "notes": "The likelihood ratio at a threshold equals the local slope of the ROC curve at that operating point; the AUC integrates the curve while interval LRs describe it locally."
      },
      {
        "relation_type": "complements",
        "target_slug": "prediction-model-validation-recalibration-rwe",
        "notes": "The AUC measures discrimination only and is invariant to recalibration; it must be reported with calibration and net-benefit assessment, the focus of prediction-model validation and recalibration."
      },
      {
        "relation_type": "see_also",
        "target_slug": "ppv-npv-rwe",
        "notes": "The AUC is prevalence-stable and threshold-free, while PPV/NPV add prevalence at a chosen threshold; under class imbalance a strong AUC can mask a poor PPV."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnostic-accuracy",
        "notes": "For a continuous marker, the ROC/AUC is the threshold-free discrimination summary reported by a diagnostic accuracy study, complementing threshold-specific sensitivity and specificity."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "When a validated phenotype or risk algorithm produces a continuous score, the AUC/c-statistic summarizes how well it ranks true cases above non-cases before a threshold is fixed."
      }
    ],
    "aliases": [
      "ROC curve",
      "area under the curve",
      "AUC",
      "c-statistic",
      "concordance index",
      "AUROC"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "route-of-administration-differences-in-rwe",
    "name": "Route-of-Administration Differences in RWE",
    "short_definition": "The route by which a drug is delivered (oral, subcutaneous, intravenous, intramuscular, inhaled, topical) determines which administrative artifact records \"use\" — a pharmacy dispensing claim versus a medical administration claim — and therefore forces route-specific exposure, adherence, HCRU, and cost definitions in real-world data.",
    "long_description": "**Route of administration (ROA)** is not a clinical footnote in real-world evidence; it is the\nsingle variable that decides *which stream of administrative data even contains the exposure*, and\nconsequently how exposure, adherence, persistence, healthcare resource utilization (HCRU), and cost\nmust be operationalized. Oral and self-administered subcutaneous (SC) pen drugs surface as **pharmacy\nclaims** (NDC + `fill_date` + `days_supply`). Clinician-administered injectables and infusions (IV,\nmany SC and IM biologics) surface as **medical claims** — an administration procedure code (e.g., CPT\n96365/96413 for IV infusion, 96372 for therapeutic SC/IM injection) paired with a HCPCS **J-code** for\nthe drug, with *no* `days_supply` field at all. Treating these two artifacts with one exposure rule is\nthe most common and most damaging ROA error in claims-based RWE.\n\n**Core conceptual distinction.** The conceptual move is to recognize that \"exposure\" is observed\nthrough different *measurement instruments* by route, and each instrument has a different missing-data\nand timing structure. (1) *Pharmacy-captured (oral, self-injected pens):* the dispensing event predicts\na window of supply via `days_supply`; adherence is the **Proportion of Days Covered (PDC)** or MPR over\nfills. The fill is a proxy for ingestion — dispensing is not administration. (2) *Medical-captured\n(clinician-administered SC/IM/IV):* there is no supply window; the administration date *is* the exposure,\nand adherence must be redefined as the **proportion of label-scheduled doses actually administered**\nwithin a grace window around each due date (e.g., a q4-week biologic: 14 expected doses/year, gap >7\ndays from the due date = a missed dose). (3) *Inhaled, topical, and as-needed* routes blur the line:\na rescue inhaler dispensed with a `days_supply` is not used on a fixed daily schedule, so PDC computed\nmechanically overstates \"coverage.\" The estimand you can defend is route-conditional: a PDC-based\nadherence estimand for oral fills and a dose-completion estimand for administered drugs are *different\nquantities* and must never be pooled into one \"adherence\" variable across arms with different routes.\n\n**Pros, cons, and trade-offs.**\n- **Route-specific dose-completion vs. a single PDC applied to all routes:** Computing PDC from fills\n  works for orals but is undefined for J-code administrations (no `days_supply`); analysts who force it\n  either drop the injectable arm or impute a supply window, both of which bias the comparison.\n  Dose-completion (administered/expected) is the correct instrument for clinician-administered drugs.\n  **Cost:** dose-completion requires the label dosing interval and a grace rule, and it is not\n  numerically comparable to PDC — a q12-week biologic can look \"more adherent\" simply because fewer\n  decision points exist. **Prefer route-specific definitions** whenever a study contrasts oral and\n  administered therapies (RA, IBD, psoriasis, oncology, MS).\n- **vs. treating an administration claim as a dispensing claim (the naive single-rule approach):**\n  Mapping CPT/J-code events into a fill table and assigning a fixed `days_supply` is fast but invents a\n  coverage window the data never recorded, and it double-counts when the drug J-code and the\n  administration CPT both appear on the same encounter. **Prefer explicit `claim_type` routing**\n  (pharmacy vs. medical) before any adherence math.\n- **vs. ignoring ROA and using time-on-drug / persistence only:** Persistence (time to a treatment gap)\n  sidesteps the PDC-vs-dose-completion problem and is route-robust, but it discards intensity-of-use\n  information and is sensitive to the gap definition, which itself is route-dependent (a 60-day gap means\n  something different for a daily pill than for a q8-week infusion). **Prefer persistence** as a robust\n  secondary endpoint, not a replacement for a route-correct adherence measure.\n\n**When to use.** Any RWE study that (a) contrasts drugs given by different routes, (b) reports adherence,\npersistence, HCRU, or cost for a clinician-administered (J-code) therapy, or (c) builds episodes for an\ninfused/injected biologic. Use ROA-aware logic to set `claim_type` routing first, then apply\nPDC/MPR to pharmacy fills and dose-completion to medical administrations, and stratify or sensitivity-test\nevery adherence/HCRU result by route in mixed-route therapeutic areas.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Single-route, single-arm pharmacy studies** (e.g., statin adherence): plain PDC is correct and\n  ROA branching is unnecessary complexity. Do not invent a dose-completion track for a drug that only\n  ever appears as a pharmacy fill.\n- **Pooling routes into one \"adherence\" covariate or outcome** is actively misleading: it manufactures\n  a spurious route effect (longer-interval injectables score higher mechanically) that can masquerade as\n  a real comparative-effectiveness signal. If your two arms differ in route, an unadjusted adherence\n  contrast is confounded by the *measurement instrument itself*.\n- **Buy-and-bill / physician-administered drugs in a pharmacy-benefit-only dataset:** if you only have\n  Part D / pharmacy claims, the IV/infused arm is *structurally absent*, not non-adherent. Concluding\n  \"low use\" here is a data-coverage artifact, not a finding.\n- **Inhaled and topical drugs treated as fixed-schedule maintenance** when much use is as-needed:\n  PDC computed on rescue-inhaler `days_supply` is meaningless and dangerous if fed into an outcomes\n  model as a confounder.\n\n**Data-source operational depth.**\n- **Claims (FFS):** Orals and self-injected pens = pharmacy claims (NDC + `fill_date` + `days_supply`);\n  clinician-administered = medical claims (administration CPT/HCPCS + drug J-code). Failure modes:\n  (i) *J-codes carry no `days_supply`* — adherence must be dose-completion against the label interval,\n  not PDC. (ii) *Drug + administration on the same encounter* invite double-counting of cost and dose;\n  de-duplicate to one administration per drug per service date. (iii) *Units on J-codes are billed in\n  label-defined increments* (e.g., \"per 10 mg\"); a single dose can be multiple J-code line units, so a\n  naive line count overstates dose frequency — collapse to one administration per service date. (iv)\n  *Immortal-time bias in procedure/infusion studies:* if follow-up starts at diagnosis but the first\n  infusion occurs weeks later, the pre-infusion interval is immortal; set time zero at the first\n  administration.\n- **Claims (Medicare Advantage vs. FFS):** MA encounter data are notoriously incomplete for\n  physician-administered (Part B buy-and-bill) drugs and infusion HCRU; MA-only person-time can make an\n  infused arm look unused or low-utilizing. Restrict to FFS Parts A/B/D (or commercial medical+pharmacy)\n  enrollees and *exclude MA-only person-time* before computing route-specific exposure or HCRU.\n- **Competing risks differ by route in elderly claims populations:** infused biologics concentrate in\n  older, sicker patients (e.g., rituximab in refractory disease); differential mortality as a competing\n  risk by route biases naive incidence and persistence comparisons — use cause-specific or\n  subdistribution models, not Kaplan–Meier complements, when death rates differ across routes.\n- **EHR:** Administered drugs appear in the **Medication Administration Record (MAR)** with exact\n  administration timestamp and dose — superior to claims for infusions — while oral *outpatient* fills\n  are only the e-prescribing **order**, not proof of dispensing or ingestion; link to pharmacy fills to\n  confirm the patient actually started. Visit-driven capture means a patient who leaves the system\n  differentially loses MAR data.\n- **Registry:** Strong for indication, dose, and adjudicated outcomes (often the best source for infused\n  oncology/rare-disease regimens) but typically weak for complete oral fill history; link to claims for\n  the pharmacy stream and to a death index for competing-risk censoring.\n\n**Worked claims example — adherence to an oral JAK inhibitor vs. a q4-week SC biologic in RA (FFS claims).**\nCohort: adults with ≥2 RA diagnoses and 365 days of continuous medical+pharmacy FFS enrollment (exclude\nMA-only person-time) before the first study drug. *Oral arm (pharmacy stream):* JAK inhibitor dispensed\nas NDC fills with `days_supply = 30`. Over 365 days the patient has fills on days 1, 32, 70, 100, 130,\n165, 200, 235, 270, 305 (10 × 30-day fills). Stitch supply with carry-over (early refills extend the\ncovered window, capped at the observation end), count distinct covered days, divide by 365: covered days\n≈ 300 → **PDC ≈ 82%**. *Injectable arm (medical stream):* q4-week SC biologic billed as CPT 96372 +\nJ-code on administration dates 1, 30, 62, 120, 150, 178, 206, 234, 262, 290. The label interval is 28\ndays, so expected doses in 365 days = ⌊365/28⌋ + 1 = 14; with a ±7-day grace window each administration\nis \"on time\" only if within 7 days of its scheduled due date, where the schedule is *rolling* — each\non-time dose re-anchors the next due date (last-dose-anchored), so a single late dose does not\ndesynchronize the rest of the year. The administrations on days 1, 30, 62 all land within grace of their\nrolling due dates; the day-90 due date (62 + 28) has no administration within ±7 days, so it is a missed\ndose, and the next on-time dose is the day-120 administration, which re-anchors the schedule from there.\nCounting the on-time administrations against the 14 expected doses gives on-time administrations =\n10 of 14 expected → **dose-completion ≈ 71%**. The two numbers are *not* comparable: the oral 82% counts days of supply,\nthe injectable 71% counts scheduled administrations met. Reporting them side by side as \"adherence\"\nwithout labeling the instrument would falsely suggest the oral drug is more adherent, when part of the\ngap is purely the difference in measurement. HCRU/cost also diverge by construction: the injectable arm\ngenerates 10 outpatient administration encounters (facility/professional fees + drug), the oral arm\ngenerates only pharmacy claims; an all-cause cost comparison must attribute administration-visit cost to\nthe injectable arm to avoid spuriously favoring it. Because biologics-heavy therapeutic areas almost\nalways mix routes, every adherence, persistence, HCRU, and cost endpoint here should be reported\nroute-stratified, never pooled.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "route-of-administration",
      "oral-vs-injectable",
      "biologics",
      "adherence",
      "exposure-definition",
      "hcru",
      "claims-coding",
      "j-code"
    ],
    "applies_to_study_types": [
      "drug_utilization",
      "cohort_retrospective",
      "active_comparator_new_user",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/s0895-4356(96)00268-5",
        "url": "https://doi.org/10.1016/s0895-4356(96)00268-5",
        "citation_text": "Steiner JF, Prochazka AV. The assessment of refill compliance using pharmacy records: methods, validity, and applications. Journal of Clinical Epidemiology. 1997;50(1):105-116.",
        "year": 1997,
        "authors_short": "Steiner & Prochazka",
        "notes": "Foundational treatment of refill-based exposure and adherence from pharmacy dispensing records — the measurement instrument that exists for oral and self-dispensed routes but is absent for clinician-administered (J-code) drugs, which is exactly the asymmetry this concept addresses."
      },
      {
        "role": "explain",
        "doi": "10.1345/aph.1H018",
        "url": "https://doi.org/10.1345/aph.1H018",
        "citation_text": "Hess LM, Raebel MA, Conner DA, Malone DC. Measurement of adherence in pharmacy administrative databases: a proposal for standard definitions and preferred measures. Annals of Pharmacotherapy. 2006;40(7-8):1280-1288.",
        "year": 2006,
        "authors_short": "Hess et al.",
        "notes": "Standardizes PDC/MPR definitions for pharmacy-captured fills; route-aware work must contrast these days-supply-based measures with dose-completion logic for administered drugs that have no days_supply."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "ISPOR-ISPE good-practice guidance requiring explicit, pre-specified operational definitions of exposure and adherence — which by necessity must be route-specific when a study spans pharmacy- and medically-billed therapies."
      }
    ],
    "plain_language_summary": "When researchers study how well patients take their medications using insurance claims data, the route of administration — whether a drug is swallowed as a pill or injected by a clinician — determines which part of the claims database even records the drug use. Oral pills generate pharmacy claims that include a field called days_supply, telling you how many days that one bottle should last; clinician-administered injectables and infusions instead generate medical claims that carry a J-code for the drug and an administration code, but have no days_supply field at all. Because these two data artifacts are structured so differently, you cannot use the same math to measure how regularly a patient received either drug — using a single rule for both routes silently produces wrong answers.",
    "key_terms": [
      {
        "term": "pharmacy benefit",
        "definition": "The part of a health insurance plan that covers drugs a patient picks up at a pharmacy and takes at home, such as pills or self-injected pens."
      },
      {
        "term": "medical benefit",
        "definition": "The part of a health insurance plan that covers drugs administered by a clinician in a medical office, infusion center, or hospital, such as intravenous biologics."
      },
      {
        "term": "J-code",
        "definition": "A billing code on a medical claim that identifies a specific injectable or infused drug administered by a clinician; J-codes carry no days_supply because the dose is given on the spot, not dispensed to take home."
      },
      {
        "term": "days_supply",
        "definition": "A field on a pharmacy claim that records how many days one dispensed prescription is intended to last, for example 30 days for a one-month supply of a daily pill."
      },
      {
        "term": "dose-completion",
        "definition": "A way to measure adherence for clinician-administered drugs by counting how many of the scheduled injections or infusions a patient actually received within a reasonable time window around each due date."
      },
      {
        "term": "PDC",
        "definition": "Proportion of Days Covered — an adherence measure for pharmacy fills that divides the number of days a patient had medication on hand by the total days in the observation window; it requires days_supply and cannot be calculated from J-code claims."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is comparing adherence for two rheumatoid arthritis treatments over one year: an oral JAK inhibitor taken daily (pharmacy benefit, pill) and a subcutaneous biologic injected every four weeks by a nurse (medical benefit, J-code). Both patients have exactly one year of claims. The researcher wants to know what the raw claims rows look like for each route and why the same adherence formula cannot be applied to both.",
      "dataset": {
        "caption": "Claims rows for the same patient-year, split by route. The oral drug appears in the pharmacy table with days_supply; the injectable appears in the medical table with a J-code and no days_supply field.",
        "columns": [
          "person_id",
          "claim_table",
          "service_date",
          "drug_identifier",
          "days_supply",
          "admin_code"
        ],
        "rows": [
          [
            2001,
            "pharmacy",
            "2023-01-01",
            "NDC: JAK inhibitor",
            30,
            "n/a"
          ],
          [
            2001,
            "pharmacy",
            "2023-01-29",
            "NDC: JAK inhibitor",
            30,
            "n/a"
          ],
          [
            2001,
            "pharmacy",
            "2023-03-05",
            "NDC: JAK inhibitor",
            30,
            "n/a"
          ],
          [
            2002,
            "medical",
            "2023-01-01",
            "J-code: biologic",
            "NONE",
            "CPT 96372"
          ],
          [
            2002,
            "medical",
            "2023-01-29",
            "J-code: biologic",
            "NONE",
            "CPT 96372"
          ],
          [
            2002,
            "medical",
            "2023-02-26",
            "J-code: biologic",
            "NONE",
            "CPT 96372"
          ],
          [
            2002,
            "medical",
            "2023-04-02",
            "J-code: biologic",
            "NONE",
            "CPT 96372"
          ]
        ]
      },
      "steps": [
        "For patient 2001 (oral): each pharmacy row has a days_supply of 30, so Fill 1 covers Jan 1-Jan 30, Fill 2 covers Jan 29-Feb 27 (two-day early refill, so the carry-over rule extends coverage rather than creating a gap), and Fill 3 covers Mar 5-Apr 3. You count every calendar day that was covered and divide by 365 to get PDC.",
        "For patient 2002 (injectable): every medical row has NONE in the days_supply column because the drug is injected and fully consumed at the visit — there is no supply to carry home. PDC is mathematically impossible to compute from these rows.",
        "For patient 2002, the correct approach is dose-completion: the biologic label says one injection every 28 days, so 14 doses are expected in a year. Count how many medical claims fall within a reasonable grace window (for example, plus or minus 7 days) of each scheduled due date. Administrations on Jan 1, Jan 29, Feb 26, and Apr 2 all land within 7 days of their rolling due dates, so each counts as on-time.",
        "The oral patient ends up with a PDC measured in fraction of days covered; the injectable patient ends up with a dose-completion rate measured in fraction of scheduled administrations met. These are two different quantities — the same number (say, 0.80) means completely different things in each column."
      ],
      "result": "Patient 2001 (oral): PDC can be calculated because days_supply exists on every pharmacy row; the coverage fraction is days-based. Patient 2002 (injectable): PDC cannot be calculated because days_supply is NONE on every medical row; adherence must be measured as dose-completion (scheduled administrations met divided by expected administrations). Reporting both as a single adherence number in a comparative study would conflate two different measurement instruments and produce a misleading comparison."
    },
    "prerequisites": [
      "pdc-proportion-of-days-covered",
      "claims-analysis",
      "hcru-healthcare-resource-utilization"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Oral / self-dispensed (pharmacy stream) — PDC or MPR",
        "description": "Exposure is the pharmacy dispensing (NDC + fill_date + days_supply). Adherence is Proportion of Days Covered with carry-over (early refills extend the covered window, capped at observation end) or MPR. The fill is a proxy for ingestion, not proof of it.",
        "edge_cases": [
          "90-day mail-order, sample fills, and stockpiling distort days_supply and carry-over.",
          "As-needed orals (e.g., rescue analgesics) violate the fixed-schedule assumption PDC encodes."
        ],
        "data_source_notes": "claims: pharmacy claims by NDC; EHR: e-prescribing order is not a dispensing — link to fills to confirm the patient started."
      },
      {
        "name": "Clinician-administered (medical stream) — dose-completion",
        "description": "Exposure is the administration event (administration CPT/HCPCS + drug J-code). No days_supply exists; adherence = administered doses / label-scheduled expected doses within a grace window (e.g., a q4-week biologic = 14 expected/year, gap >7 days from the due date = missed dose).",
        "edge_cases": [
          "Drug J-code and administration CPT on the same encounter cause cost/dose double-counting; collapse to one administration per drug per service date.",
          "J-code units are billed in label increments (per X mg), so a line count overstates dose frequency.",
          "MA-only person-time and pharmacy-benefit-only datasets structurally omit physician-administered (buy-and-bill) drugs."
        ],
        "data_source_notes": "claims: medical claims, J-code + administration CPT, label dosing interval; EHR: Medication Administration Record (MAR) gives exact administration timestamp and dose."
      },
      {
        "name": "Inhaled / topical / as-needed — hybrid",
        "description": "Captured as pharmacy fills with a quantity (canisters, tubes) but used on a variable rather than fixed daily schedule. Mechanical PDC overstates coverage for reliever use; separate maintenance (controller) from rescue before computing adherence.",
        "edge_cases": [
          "Claims cannot distinguish controller from reliever inhaler use reliably; EHR device counters or detailed sig data are better.",
          "OTC topicals are under-captured in claims entirely."
        ],
        "data_source_notes": "claims: under-capture and schedule ambiguity; EHR: sig/device-counter data needed to separate maintenance from as-needed."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "pdc-proportion-of-days-covered",
        "pros_of_this": "Route-aware logic applies PDC only where days_supply exists (pharmacy stream) and substitutes dose-completion for clinician-administered J-code drugs, where PDC is undefined.",
        "cons_of_this": "Two non-comparable estimands (days-covered vs. doses-met) that must be reported separately and never pooled; dose-completion requires the label interval and a grace rule.",
        "when_to_prefer": "Any study contrasting oral and injectable/infused therapies, or reporting adherence for a J-code drug."
      },
      {
        "compared_to": "persistence-time-to-discontinuation",
        "pros_of_this": "Route-specific intensity (PDC / dose-completion) preserves how much of the intended exposure was received, not just whether a gap occurred.",
        "cons_of_this": "More sensitive to route-dependent definitions than persistence, which is more route-robust (a gap is a gap) but discards intensity information and depends on a gap rule that is itself route-dependent.",
        "when_to_prefer": "When the question is intensity of use; use persistence as a route-robust secondary endpoint."
      },
      {
        "compared_to": "hcru-healthcare-resource-utilization",
        "pros_of_this": "Routing on claim_type isolates infusion/administration encounters (medical stream) from pharmacy-only utilization, preventing the injectable arm from appearing falsely low-cost.",
        "cons_of_this": "Risk of double-counting when drug J-code and administration CPT co-occur; requires explicit de-duplication.",
        "when_to_prefer": "Cost/HCRU studies in mixed-route therapeutic areas."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Route the claim first by claim_type. Oral/self-dispensed = pharmacy claims (NDC + fill_date + days_supply) -> PDC/MPR. Clinician-administered = medical claims (administration CPT 96365/96372/96413 + drug J-code), no days_supply -> dose-completion against the label interval with a grace window. De-duplicate to one administration per drug per service date to avoid double-counting cost and dose; treat J-code line units as billing increments, not dose counts. Exclude MA-only person-time and pharmacy-benefit-only datasets where physician-administered drugs are structurally absent. Set time zero at first administration for infused therapies to avoid immortal-time bias.",
      "ehr": "Administered drugs are in the Medication Administration Record (MAR) with exact timestamp and dose (better than claims for infusions). Outpatient oral therapy is the e-prescribing order, not a dispensing -> link to pharmacy fills to confirm the patient started. Treat loss to follow-up as potentially informative when patients leave the system.",
      "registry": "Strong for indication, dose, and adjudicated outcomes (often best for infused oncology/rare-disease regimens); weak for complete oral fill history. Link to claims for the pharmacy stream and to a death index for competing-risk censoring, which differs by route in elderly populations."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nOBS_DAYS = 365            # observation window length from each person's index date\nDOSE_INTERVAL_DAYS = 28   # label dosing interval for the injectable (e.g., q4 weeks)\nGRACE_DAYS = 7            # +/- window around each scheduled due date\n\ndef oral_pdc(pharmacy: pd.DataFrame, index_date: pd.Series) -> pd.DataFrame:\n    \"\"\"PDC with carry-over: early refills extend the covered window, capped at observation end.\"\"\"\n    rx = pharmacy.merge(index_date.rename(\"index_date\"), left_on=\"person_id\", right_index=True)\n    rx = rx[(rx[\"fill_date\"] >= rx[\"index_date\"]) &\n            (rx[\"fill_date\"] < rx[\"index_date\"] + pd.Timedelta(days=OBS_DAYS))]\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n    out = {}\n    for pid, g in rx.groupby(\"person_id\"):\n        obs_end = g[\"index_date\"].iloc[0] + pd.Timedelta(days=OBS_DAYS)\n        covered = np.zeros(OBS_DAYS, dtype=bool)\n        cursor = g[\"index_date\"].iloc[0]  # next free day for carry-over stacking\n        for _, row in g.iterrows():\n            start = max(row[\"fill_date\"], cursor)             # stack supply after prior runs out\n            end = min(start + pd.Timedelta(days=int(row[\"days_supply\"])), obs_end)\n            if end > start:\n                s = (start - g[\"index_date\"].iloc[0]).days\n                e = (end   - g[\"index_date\"].iloc[0]).days\n                covered[s:e] = True\n                cursor = end\n        out[pid] = covered.sum() / OBS_DAYS\n    return pd.Series(out, name=\"oral_pdc\")\n\ndef injectable_dose_completion(medical: pd.DataFrame, index_date: pd.Series) -> pd.DataFrame:\n    \"\"\"Administered doses met within +/- GRACE_DAYS of each scheduled due date / expected doses.\n    Rolling (last-dose-anchored) schedule: the next due date is set from the last on-time\n    administration, so an isolated late/missed dose does not desynchronize the whole grid.\n    One administration per service date (de-dup J-code line units / co-billed admin CPT).\"\"\"\n    md = medical.merge(index_date.rename(\"index_date\"), left_on=\"person_id\", right_index=True)\n    md = md[(md[\"service_date\"] >= md[\"index_date\"]) &\n            (md[\"service_date\"] < md[\"index_date\"] + pd.Timedelta(days=OBS_DAYS))]\n    md = md.drop_duplicates([\"person_id\", \"service_date\"])   # collapse line units to one admin/day\n    out = {}\n    for pid, g in md.groupby(\"person_id\"):\n        idx0 = g[\"index_date\"].iloc[0]\n        expected = OBS_DAYS // DOSE_INTERVAL_DAYS + 1\n        obs_end = idx0 + pd.Timedelta(days=OBS_DAYS)\n        admin = sorted(g[\"service_date\"].tolist())\n        used = [False] * len(admin)\n        on_time = 0\n        due = idx0                                           # first due = index; re-anchored to last on-time admin\n        slots = 0\n        while due < obs_end and slots < expected:            # walk one scheduled dose at a time\n            slots += 1\n            best = None\n            for i, a in enumerate(admin):\n                if not used[i] and abs((a - due).days) <= GRACE_DAYS:\n                    if best is None or abs((a - due).days) < abs((admin[best] - due).days):\n                        best = i\n            if best is not None:                             # on-time: re-anchor next due to this admin\n                used[best] = True\n                on_time += 1\n                due = admin[best] + pd.Timedelta(days=DOSE_INTERVAL_DAYS)\n            else:                                            # missed: advance one nominal interval\n                due = due + pd.Timedelta(days=DOSE_INTERVAL_DAYS)\n        out[pid] = on_time / expected\n    return pd.Series(out, name=\"injectable_dose_completion\")\n\ndef route_adherence(pharmacy, medical, enroll):\n    # FFS-observable only: drop MA-only person-time (physician-administered drugs absent in MA encounters).\n    ffs = enroll.loc[~enroll[\"ma_only\"], \"person_id\"].unique()\n    pharmacy = pharmacy[pharmacy[\"person_id\"].isin(ffs)]\n    medical  = medical[medical[\"person_id\"].isin(ffs)]\n    # Index date = first observed exposure of EITHER route (time zero); avoids immortal time for infusions.\n    first_rx = pharmacy.groupby(\"person_id\")[\"fill_date\"].min()\n    first_md = medical.groupby(\"person_id\")[\"service_date\"].min()\n    index_date = pd.concat([first_rx, first_md], axis=1).min(axis=1)\n    return pd.concat([oral_pdc(pharmacy, index_date),\n                      injectable_dose_completion(medical, index_date)], axis=1)",
        "description": "Route-aware exposure and adherence from claims. Required inputs (cleaned, de-duplicated):\n  pharmacy : person_id, fill_date (datetime), ndc, days_supply (int)        # oral / self-dispensed\n  medical  : person_id, service_date (datetime), admin_cpt, jcode           # clinician-administered\n  enroll   : person_id, enroll_start, enroll_end, ma_only (bool)            # ma_only person-time lacks FFS medical claims\nComputes oral PDC (days-covered with carry-over) and injectable dose-completion (administered /\nlabel-scheduled within a grace window) over a fixed observation window. These two outputs are\nDIFFERENT estimands and must be reported separately, never pooled.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "hess-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nOBS_DAYS <- 365L\nDOSE_INTERVAL_DAYS <- 28L\nGRACE_DAYS <- 7L\n\nroute_adherence <- function(pharmacy, medical, enroll) {\n  setDT(pharmacy); setDT(medical); setDT(enroll)\n  ffs <- enroll[ma_only == FALSE, unique(person_id)]   # drop MA-only person-time\n  pharmacy <- pharmacy[person_id %chin% ffs]\n  medical  <- medical[person_id %chin% ffs]\n\n  # Time zero = first exposure of either route (no immortal time for infused arms).\n  idx <- merge(pharmacy[, .(rx0 = min(fill_date)), by = person_id],\n               medical[,  .(md0 = min(service_date)), by = person_id],\n               by = \"person_id\", all = TRUE)\n  idx[, index_date := pmin(rx0, md0, na.rm = TRUE)]\n\n  # Oral PDC with carry-over: early refills extend the covered window, capped at obs end.\n  rx <- merge(pharmacy, idx[, .(person_id, index_date)], by = \"person_id\")\n  rx <- rx[fill_date >= index_date & fill_date < index_date + OBS_DAYS][order(person_id, fill_date)]\n  oral <- rx[, {\n    cursor <- index_date[1]; obs_end <- index_date[1] + OBS_DAYS\n    covered <- logical(OBS_DAYS)\n    for (i in seq_len(.N)) {\n      start <- max(fill_date[i], cursor)               # stack after prior supply runs out\n      end   <- min(start + days_supply[i], obs_end)\n      if (end > start) {\n        s <- as.integer(start - index_date[1]); e <- as.integer(end - index_date[1])\n        covered[(s + 1L):e] <- TRUE; cursor <- end\n      }\n    }\n    .(oral_pdc = sum(covered) / OBS_DAYS)\n  }, by = person_id]\n\n  # Injectable dose-completion: one admin per service date, matched to a rolling (last-dose-anchored)\n  # schedule +/- grace. Each on-time dose re-anchors the next due date, so a single late/missed dose\n  # does not desynchronize the whole grid.\n  md <- merge(medical, idx[, .(person_id, index_date)], by = \"person_id\")\n  md <- unique(md[service_date >= index_date & service_date < index_date + OBS_DAYS],\n               by = c(\"person_id\", \"service_date\"))\n  inj <- md[, {\n    expected <- OBS_DAYS %/% DOSE_INTERVAL_DAYS + 1L\n    obs_end  <- index_date[1] + OBS_DAYS\n    admin <- sort(service_date); used <- logical(length(admin)); on_time <- 0L\n    due <- index_date[1]; slots <- 0L\n    while (due < obs_end && slots < expected) {        # walk one scheduled dose at a time\n      slots <- slots + 1L\n      best <- NA_integer_\n      for (i in seq_along(admin)) {\n        if (!used[i] && abs(as.integer(admin[i] - due)) <= GRACE_DAYS) {\n          if (is.na(best) || abs(as.integer(admin[i] - due)) < abs(as.integer(admin[best] - due))) best <- i\n        }\n      }\n      if (!is.na(best)) {                              # on-time: re-anchor next due to this admin\n        used[best] <- TRUE; on_time <- on_time + 1L\n        due <- admin[best] + DOSE_INTERVAL_DAYS\n      } else {                                         # missed: advance one nominal interval\n        due <- due + DOSE_INTERVAL_DAYS\n      }\n    }\n    .(injectable_dose_completion = on_time / expected)\n  }, by = person_id]\n\n  merge(oral, inj, by = \"person_id\", all = TRUE)\n}",
        "description": "Route-aware exposure and adherence with data.table. Inputs mirror the Python version:\n  pharmacy : person_id, fill_date (Date), ndc, days_supply (integer)   # oral / self-dispensed\n  medical  : person_id, service_date (Date), admin_cpt, jcode          # clinician-administered\n  enroll   : person_id, ma_only (logical)\nReturns per-person oral_pdc (days-covered with carry-over) and injectable_dose_completion\n(administered / label-scheduled within a grace window). The two columns are different estimands.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "hess-2006"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let obs_days     = 365;   /* observation window from each person's index date */\n%let dose_int     = 28;    /* injectable label dosing interval (q4 weeks)       */\n%let grace        = 7;     /* +/- days around each scheduled due date           */\n\n/* FFS-observable only: physician-administered drugs are absent in MA encounter data. */\nproc sql;\n  create table ffs as select person_id from work.enroll where ma_only = 0;\n  /* Time zero = first exposure of either route (no immortal time for infused arms). */\n  create table idx as\n    select person_id, min(d) as index_date format=date9. from (\n      select person_id, fill_date    as d from work.pharmacy\n      union all\n      select person_id, service_date as d from work.medical\n    )\n    where person_id in (select person_id from ffs)\n    group by person_id;\nquit;\n\n/* ---------- Section A: oral PDC with carry-over ---------- */\nproc sql;\n  create table rx as\n    select p.person_id, p.fill_date, p.days_supply, i.index_date\n    from work.pharmacy p inner join idx i on p.person_id = i.person_id\n    where p.fill_date >= i.index_date and p.fill_date < i.index_date + &obs_days\n    order by person_id, fill_date;\nquit;\ndata oral_pdc;\n  array cov[&obs_days] _temporary_;\n  do _i = 1 to &obs_days; cov[_i] = 0; end;\n  cursor = .; retain cursor;\n  do until (last.person_id);\n    set rx; by person_id;\n    if first.person_id then do; cursor = index_date; obs_end = index_date + &obs_days; end;\n    start = max(fill_date, cursor);                 /* stack after prior supply runs out */\n    end_d = min(start + days_supply, obs_end);\n    if end_d > start then do;\n      do _d = (start - index_date + 1) to (end_d - index_date); cov[_d] = 1; end;\n      cursor = end_d;\n    end;\n  end;\n  covered = 0; do _i = 1 to &obs_days; covered + cov[_i]; end;\n  oral_pdc = covered / &obs_days;\n  keep person_id oral_pdc;\nrun;\n\n/* ---------- Section B: injectable dose-completion ---------- */\n/* One administration per service date (collapse co-billed admin CPT / J-code line units). */\nproc sql;\n  create table admin as\n    select distinct m.person_id, m.service_date, i.index_date\n    from work.medical m inner join idx i on m.person_id = i.person_id\n    where m.service_date >= i.index_date and m.service_date < i.index_date + &obs_days\n    order by person_id, service_date;\nquit;\ndata inj_dc;\n  set admin; by person_id;\n  expected = floor(&obs_days / &dose_int) + 1;\n  array adm[200] _temporary_;\n  retain n_adm;\n  if first.person_id then n_adm = 0;\n  n_adm + 1; adm[n_adm] = service_date;\n  if last.person_id then do;\n    /* Rolling (last-dose-anchored) schedule: each on-time dose re-anchors the next due date, so a\n       single late/missed dose does not desynchronize the whole grid. adm[] is sorted ascending. */\n    array used[200] _temporary_;\n    do _i = 1 to n_adm; used[_i] = 0; end;\n    on_time = 0;\n    due = index_date;                                /* first due = index */\n    obs_end = index_date + &obs_days;\n    slots = 0;\n    do while (due < obs_end and slots < expected);   /* walk one scheduled dose at a time */\n      slots + 1;\n      best = .;\n      do _i = 1 to n_adm;\n        if used[_i] = 0 and abs(adm[_i] - due) <= &grace then do;\n          if best = . or abs(adm[_i] - due) < abs(adm[best] - due) then best = _i;\n        end;\n      end;\n      if best ne . then do;                          /* on-time: re-anchor next due to this admin */\n        used[best] = 1; on_time + 1; due = adm[best] + &dose_int;\n      end;\n      else due = due + &dose_int;                     /* missed: advance one nominal interval */\n    end;\n    injectable_dose_completion = on_time / expected;\n    keep person_id injectable_dose_completion;\n    output;\n  end;\nrun;\n\n/* Report side by side, clearly labeled as two different estimands. */\nproc sql;\n  create table route_adherence as\n    select coalesce(a.person_id, b.person_id) as person_id,\n           a.oral_pdc, b.injectable_dose_completion\n    from oral_pdc a full join inj_dc b on a.person_id = b.person_id;\nquit;",
        "description": "Route-aware exposure and adherence in SAS (PROC SQL + data step). Required inputs (post data-management):\n  work.pharmacy : person_id, fill_date, ndc, days_supply              # oral / self-dispensed\n  work.medical  : person_id, service_date, admin_cpt, jcode           # clinician-administered\n  work.enroll   : person_id, ma_only (0/1)\nSection A computes oral PDC with carry-over; Section B computes injectable dose-completion against the\nlabel interval with a grace window. The two outputs are different estimands and are reported separately.",
        "dependencies": [],
        "source_citations": [
          "hess-2006"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  C[Drug exposure event in RWD] --> T{claim_type?}\n  T -->|pharmacy claim<br/>NDC + days_supply| O[Oral / self-dispensed route]\n  T -->|medical claim<br/>admin CPT + J-code<br/>no days_supply| A[Clinician-administered route]\n  O --> OA[Adherence = PDC / MPR<br/>days-covered with carry-over]\n  A --> Q{label dosing interval known?}\n  Q -->|yes| AA[Adherence = dose-completion<br/>administered / scheduled within grace]\n  Q -->|no| AB[Persistence only<br/>time to treatment gap]\n  OA --> R[Report route-stratified<br/>NEVER pool the two estimands]\n  AA --> R\n  AB --> R\n  A --> H[HCRU + cost: attribute<br/>administration encounters to this arm]\n  H --> R",
        "caption": "Route decides the measurement instrument. claim_type routes the exposure to a days-supply-based estimand (PDC for pharmacy fills) or a dose-completion estimand (scheduled administrations met for J-code drugs); the two are not comparable and must be reported separately.",
        "alt_text": "Decision tree branching on claim_type into oral pharmacy fills (PDC) versus clinician-administered medical claims (dose-completion or persistence), with HCRU attribution and a route-stratified reporting rule.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Same patient-year, two routes, two adherence instruments\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section Oral (pharmacy, PDC)\n  Fill 30d :done, f1, 2024-01-01, 30d\n  Fill 30d :done, f2, 2024-02-01, 30d\n  Gap (uncovered) :crit, g1, 2024-03-02, 14d\n  Fill 30d :done, f3, 2024-03-16, 30d\n  section Injectable (medical, dose-completion)\n  Admin (on time) :milestone, a1, 2024-01-01, 0d\n  Admin (on time) :milestone, a2, 2024-01-30, 0d\n  Admin (on time) :milestone, a3, 2024-03-02, 0d\n  Missed due date :crit, m1, 2024-03-30, 7d\n  Admin (re-anchors schedule) :milestone, a4, 2024-04-29, 0d",
        "caption": "For oral therapy, adherence is the fraction of days covered by stitched fills; for the injectable, it is the fraction of label-scheduled administration dates met within a grace window. The same calendar year yields two non-comparable numbers.",
        "alt_text": "Gantt chart contrasting oral fill-stitching with days-covered gaps against injectable administration dates matched to a rolling, last-dose-anchored schedule with one missed due date.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC is the correct adherence instrument for the pharmacy stream (oral / self-dispensed); it is undefined for J-code administrations that carry no days_supply."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "mpr-medication-possession-ratio",
        "notes": "MPR is an alternative fill-based adherence measure for the pharmacy stream; like PDC it does not transfer to clinician-administered drugs."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Persistence is the most route-robust adherence concept (a gap is a gap), useful as a secondary endpoint when PDC and dose-completion are not comparable across arms."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Clinician-administered routes generate distinct administration/infusion encounters in the medical stream that must be attributed to the injectable arm to avoid biased cost comparisons."
      },
      {
        "relation_type": "see_also",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Switches between oral and injectable routes are common in RA, IBD, and oncology and change how lines of therapy and persistence are defined."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-analysis",
        "notes": "Claims separate route via claim_type (pharmacy vs. medical) and codes (NDC vs. administration CPT + J-code), which is the first step in any route-aware exposure definition."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "procedure-identification-and-measurement-in-claims-ehr",
        "notes": "Procedure identification handles standalone procedures; drug-route capture handles administered drugs (admin CPT + J-code), with overlap in infusion/oncology episodes."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Medicare Advantage encounter data under-capture physician-administered (Part B buy-and-bill) drugs and infusion HCRU; exclude MA-only person-time before computing route-specific exposure."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Preventive orals (e.g., statins) and clinician-administered biologics channel different patient types, so route is entangled with healthy-user and severity selection."
      }
    ],
    "aliases": [
      "route of administration",
      "ROA differences in RWE",
      "route-specific exposure capture",
      "oral versus injectable exposure measurement"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "rxnorm-drug-terminology",
    "name": "RxNorm Drug Terminology",
    "short_definition": "The National Library of Medicine's normalized naming system for clinical drugs in the US, assigning stable numeric concept identifiers (RxCUIs) to drug ingredients, dose forms, strengths, and branded products so that thousands of package-level NDC codes from any pharmacy or EHR system can all be linked to a single shared ingredient concept.",
    "long_description": "**RxNorm** is a normalized drug vocabulary produced and maintained by the National Library of Medicine\n(NLM). Its defining purpose is to give every clinically meaningful drug concept — ingredient, strength,\ndose form, brand, and packaged product — a single stable identifier (the **RxCUI**, RxNorm Concept\nUnique Identifier) so that systems built on different source vocabularies can exchange drug information\nwithout losing meaning. Source vocabularies that RxNorm links include First Databank (FDB), Multum,\nMicromedex, Gold Standard, the VA National Drug File (VANDF), MedlinePlus, and others. NLM releases a\nfull monthly build plus weekly incremental updates. The core RxNorm files are freely distributable\nwithout a UMLS license; however, some source-vocabulary string content embedded in the full RxNorm\ndistribution files does require a UMLS Metathesaurus license from NLM to reproduce. Analysts using\nonly the RxCUI identifiers, relationship tables, and the public RxNav API (which returns standard\nRxNorm atoms) do not require a UMLS license.\n\n**Core machinery: the RxNorm concept hierarchy.**\nRxNorm organizes drug concepts into a lattice of term types (TTY) connected by explicit relationships:\n\n- **IN (Ingredient):** The active moiety — e.g., *atorvastatin*. This is the rollup target for most RWE\n  exposure definitions. All lower-level concepts ultimately link upward to an IN via `has_ingredient`.\n- **PIN (Precise Ingredient):** A salt or ester form distinguished from the free base — e.g.,\n  *atorvastatin calcium*. Used when the salt form is analytically important.\n- **MIN (Multiple Ingredients):** A multi-ingredient combination, e.g., *amlodipine / atorvastatin*.\n- **DF (Dose Form):** The form alone — *tablet*, *capsule*, *oral solution*.\n- **SCDC (Semantic Clinical Drug Component):** Ingredient + strength — e.g., *atorvastatin 10 mg*.\n- **SCD (Semantic Clinical Drug):** Ingredient + strength + dose form (generic) — e.g., *atorvastatin\n  20 mg oral tablet*. This is the \"clinical drug\" grain at which OMOP's DRUG_EXPOSURE table is typically\n  mapped. SCDs are the workhorse level for comparisons that need to distinguish dose.\n- **SBD (Semantic Branded Drug):** Like SCD but branded — e.g., *Lipitor 20 mg oral tablet*.\n- **BN (Brand Name):** The trade name alone — *Lipitor*.\n- **GPCK / BPCK (Generic Pack / Branded Pack):** Multi-component packaged products.\n\nRelationships traverse the hierarchy. `has_ingredient` links SCD → IN. `consists_of` links SCD →\nSCDC. `tradename_of` links SBD → SCD. `has_dose_form` links SCD → DF. The `CONCEPT_ANCESTOR` table\nin OMOP CDM materializes all transitive ancestor–descendant paths so a single SQL join can find every\nSCD or SBD whose ancestor is the atorvastatin IN.\n\n**The workhorse RWE operation: NDC → RxCUI → ingredient rollup.**\nA US pharmacy claim carries an NDC (National Drug Code) — an 11-digit package-level code that encodes\nlabeler, product, and package. A single branded atorvastatin 20 mg tablet product may appear under\ndozens of NDCs (different package sizes, different generic manufacturers, different lot codes). The\nfundamental RWE step is to map each NDC to its RxCUI at the SCD level, then navigate up to the IN\nlevel. Concretely: NDC 00071-0155-23 (Lipitor 20 mg, 90-count) → RxCUI 617310 (Lipitor 20 mg oral\ntablet, SBD) → RxCUI 308136 (atorvastatin 20 mg oral tablet, SCD) → RxCUI 83367 (atorvastatin, IN).\nThis collapse is what allows an analyst to write \"all atorvastatin fills\" as a single ingredient\nRxCUI rather than maintaining a list of hundreds of NDCs that grows with every generic entrant.\n\n**Drug class expansion via RxClass.**\nRxNorm's companion service, RxClass, links RxCUIs to external drug classification systems: ATC\n(Anatomical Therapeutic Chemical, WHO), MeSH pharmacologic actions, VA drug classes, and MED-RT\n(Medication Reference Terminology, the VA's successor to NDF-RT). An analyst building an \"all statin\"\ncohort queries RxClass for ATC C10AA (HMG CoA reductase inhibitors), retrieves all IN-level members,\nthen maps each IN to its SCDs and NDCs. This makes the drug-class definition transparent, reproducible,\nand auditable — a critical property for regulatory submissions and HTA dossiers.\n\n**Operational realities for RWE analysts.**\n*Historical NDC-to-RxCUI mappings:* Claims data often contain NDCs that have been withdrawn, retired,\nor reformulated. RxNav's `ndcstatus` and `historyndc` endpoints return the RxCUI that a retired NDC\nmapped to at the time it was active. Failing to use historical mappings causes unmapped NDCs (coded as\n\"unclassified\" or dropped), which selectively exclude older claims and biases incidence estimates.\n*Retired and remapped RxCUIs:* RxCUIs themselves can be retired or remapped across NLM monthly\nreleases. A concept mapped in January may have a different status in December. Best practice is to\npin the RxNorm release date (store it alongside every mapping), use the `rxcui/status` endpoint to\ndetect remapped concepts, and rebuild mappings when updating to a new release.\n*Suppressible and obsolete atoms:* RxNorm marks certain atoms as suppressible (TTY=OA, etc.) or\nobsolete; analysts should filter to active, non-suppressible atoms when pulling name strings, and\nvalidate that the RxCUI returned by an NDC lookup has a current, non-retired status.\n*OMOP integration:* In the OMOP Common Data Model, RxNorm is the **standard vocabulary for the Drug\ndomain**. Source NDC codes in DRUG_EXPOSURE are mapped to RxNorm standard concepts via\n`CONCEPT_RELATIONSHIP` rows with relationship `Maps to`. The DRUG_ERA derivation then rolls those\nclinical-drug concepts up to ingredient via `CONCEPT_ANCESTOR`. If a site's ETL mapped NDCs to\nnon-standard concepts or used a stale RxNorm release, all downstream drug-era counts will be wrong;\nauditing the `vocabulary_id = 'RxNorm'` completeness in the CONCEPT table is the first QC step.\n*ATC linkage for international comparability:* RxClass provides RxCUI → ATC mappings. Because ATC is\nWHO's international classification, drug lists built this way can be ported to European claims or\nregistry studies with minimal re-mapping — a common need in multi-country HEOR submissions.\n\n**Pros, cons, and trade-offs.**\n- **vs raw NDC lists maintained manually:** RxNorm ingredient rollup is algorithmically stable: add\n  one ingredient RxCUI to the concept set and every current and future NDC for that ingredient is\n  automatically captured when the mapping files are refreshed. Manual NDC lists grow stale within\n  months as new generics enter the market and old packages are retired. **Prefer RxNorm rollup** for\n  any study that will be updated, replicated, or submitted to a regulator; maintain NDC-level\n  granularity only when package-level identity matters (e.g., specific package sizes for adherence\n  per-unit-cost calculations). The cost is that RxNorm mapping requires a one-time ETL investment and\n  a versioning discipline that ad hoc NDC lists skip.\n- **vs ATC alone for drug class identification:** ATC is an internationally recognized hierarchical\n  classification with clear levels (1-character to 7-character codes), making it natural for\n  cross-country comparisons. RxNorm does not carry a built-in class hierarchy; class membership\n  requires a RxClass lookup. However, ATC-to-NDC crosswalks are indirect and can miss new US generics\n  not yet classified; RxNorm → RxClass → ATC is the more complete and timely path for US claims.\n  **Prefer RxNorm + RxClass** for US-sourced studies; use ATC as the reporting layer when presenting\n  alongside non-US data.\n- **vs GPI (Generic Product Identifier) or Multum classification used in commercial claims:** Some\n  commercial claims databases (e.g., Truven MarketScan-era) ship with GPI or Multum class codes in\n  addition to NDC. These are convenient shortcuts but are not publicly documented, not NLM-maintained,\n  and not consistent across data vendors. **Prefer RxNorm/RxCUI** for reproducible, cross-database,\n  or externally auditable work; GPI/Multum codes are acceptable for exploratory or single-database\n  descriptive analyses where the code list will not be published or scrutinized.\n- **vs brand-level HCPCS J-codes for physician-administered drugs:** Many biologics and infused drugs\n  dispensed in physician offices are billed under HCPCS Level II J-codes rather than pharmacy NDC\n  codes. HCPCS J-codes do not appear in RxNorm; they live in a separate vocabulary (HCPCS in OMOP).\n  An analyst building an \"all adalimumab\" cohort must union NDC-based pharmacy fills with HCPCS\n  J-code-based medical claims — RxNorm handles only the former. **Pair RxNorm with HCPCS J-codes**\n  for biologics and specialty infusibles; failing to do so undercounts exposure, especially in the\n  Medicare and commercial fee-for-service populations where physician-office infusion is common.\n\n**When to use.** Use RxNorm in any study that: (1) identifies drug exposure from pharmacy claims or\nEHR dispensing records by ingredient or drug class; (2) needs to collapse thousands of NDCs to a\nmanageable concept list for cohort eligibility, washout, or outcome definitions; (3) will be run\nacross multiple databases or replicated in a distributed research network (OMOP CDM sites,\nPCORnet, Sentinel) where a common vocabulary is required; (4) requires a defensible,\npublicly documented, and reproducible drug-class definition for a regulatory submission, HTA\ndossier, or peer-reviewed publication; (5) needs to check prior drug use in a washout period and\nmust handle retired NDC codes from older claims.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As the sole exposure source for physician-administered specialty drugs.** Biologics, chemotherapy\n  agents, and infused medications primarily appear in medical claims under HCPCS J-codes, not pharmacy\n  NDC codes. Using only RxNorm/NDC-based fills to define \"biologic initiator\" in a rheumatology or\n  oncology cohort will systematically miss the majority of infused administrations. The bias is\n  differential if infusion rates differ by study arm or time period.\n- **Without version-pinning the RxNorm release.** Applying a drug list built against the January 2023\n  RxNorm release to claims processed in 2025 will silently miss NDCs for generics approved after\n  January 2023 — and those generics may be the dominant dispensed form by 2025. Always store and\n  report the RxNorm release date alongside every mapping. Studies spanning years should re-map with\n  each annual or semi-annual release update.\n- **Without auditing unmapped NDCs.** Every NDC-to-RxCUI mapping exercise produces unmapped rows.\n  If the unmapped fraction is substantial (often 3–15% in older claims) and non-random — for example,\n  concentrated in a specific formulary tier, therapeutic class, or calendar period — the resulting\n  exposure definition is biased and the selection is invisible in published code lists. Always report\n  the fraction mapped and characterize unmapped NDCs before treating the mapping as complete.\n- **Without the historical NDC endpoint for old claims.** An NDC that appears in claims from 2015 may\n  have been withdrawn and will not appear in current RxNorm active files. Mapping only against the\n  current RxNorm release drops all withdrawn-NDC rows — which represent real fills — producing\n  artificially low utilization in earlier calendar periods and inducing a secular trend artifact.\n- **Treating a drug-class RxClass query as permanent.** RxClass membership (especially via ATC\n  updates or VA class revisions) can change across releases. A statin list queried in 2020 may differ\n  slightly from one queried in 2024 if ATC added or reclassified agents. For regulatory work, snapshot\n  and document the drug-class membership at query time.\n\n**Data-source operational depth.**\n- **Pharmacy claims (FFS commercial / Medicare Part D):** The natural home for RxNorm mapping. Each\n  adjudicated pharmacy claim carries an NDC; mapping to RxCUI is the first ETL step. Failure modes:\n  claim reversals must be removed before the mapping step or a reversed fill appears as legitimate\n  exposure; 90-day mail-order fills look identical to 30-day fills in terms of NDC but carry a larger\n  `days_supply` that affects episode construction; free samples and manufacturer-provided doses appear\n  in no pharmacy claim and are therefore invisible to RxNorm-based exposure windows. Medicare\n  Advantage members' Part D utilization is often available only in encounter data rather than\n  adjudicated FFS claims; verify that the Part D enrollment flag is present before trusting the NDC\n  completeness for MA enrollees.\n- **EHR (dispensing/prescribing records):** EHR medication records may be coded in RxNorm natively\n  (common in modern FHIR-based systems), in a source vocabulary (FDB, Multum) that RxNorm links, or\n  as free-text strings requiring NLP normalization. Orders differ from fills; an EHR-recorded RxNorm\n  code represents a prescription issued, not confirmed dispensing. For exposure definitions requiring\n  confirmed fills, link EHR prescribing records to outpatient pharmacy claims. In-hospital\n  administration records typically reference NDC or facility item codes, not RxNorm, and require a\n  separate crosswalk step.\n- **OMOP CDM sites (distributed network):** The ETL from source NDC (or FDB/Multum concept) to OMOP\n  standard RxNorm concept is the single most consequential data-quality step. Validate using the OHDSI\n  Drug Exposure Diagnostics R package (`DrugExposureDiagnostics`): check the fraction of records with\n  a valid `drug_concept_id > 0`, the ingredient-level coverage of your concept set, and the\n  `days_supply` distribution for implausibles. When running across multiple CDM sites, pin the OMOP\n  vocabulary release (CONCEPT table version) to ensure consistency; vocabulary version differences\n  can cause a concept that maps in site A to fail in site B if site B's CONCEPT table predates the\n  relevant RxNorm release.\n- **Registry data:** Disease registries rarely carry NDC-level dispensing. If the registry records\n  treatment as a drug name string or ATC code, map through RxNorm (string-to-RxCUI via the RxNav\n  approximate-match API, or ATC-to-RxCUI via RxClass) to enable cross-registry or claims linkage.\n  The ATC-to-RxCUI path is usually more reliable than string matching for this purpose.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "drugs",
      "terminology",
      "rxnorm",
      "rxcui",
      "rxnav",
      "ndc",
      "drug-class",
      "omop",
      "atc",
      "pharmacoepidemiology",
      "claims"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "drug_utilization_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/amiajnl-2011-000116",
        "url": "https://doi.org/10.1136/amiajnl-2011-000116",
        "citation_text": "Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R. Normalized names for clinical drugs: RxNorm at 6 years. Journal of the American Medical Informatics Association. 2011;18(4):441-448.",
        "year": 2011,
        "authors_short": "Nelson et al.",
        "notes": "The canonical description of RxNorm's design after six years of operation — covers the concept hierarchy (TTY levels), relationship types, source-vocabulary linking, the monthly release cycle, and the rationale for normalizing clinical drug names across heterogeneous pharmacy systems. First author surname and year match Crossref record.\n"
      },
      {
        "role": "explain",
        "doi": "10.1093/nar/gkh061",
        "url": "https://doi.org/10.1093/nar/gkh061",
        "citation_text": "Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004;32(Database issue):D267-270.",
        "year": 2004,
        "authors_short": "Bodenreider",
        "notes": "Describes the UMLS Metathesaurus framework within which RxNorm operates — covers the source-vocabulary integration model, the CUI/AUI identifier system that RxCUIs parallel, and the licensing structure that governs redistribution of source-vocabulary content. Essential context for understanding why core RxNorm identifiers are freely distributable while some embedded source-vocabulary strings require a UMLS license.\n"
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.4017",
        "url": "https://doi.org/10.1002/pds.4017",
        "citation_text": "Brunelli SM. Use of prescription drug claims data to identify lipid-lowering medication exposure in pharmacoepidemiology studies: potential pitfalls. Pharmacoepidemiology and Drug Safety. 2016;25(7):844-846.",
        "year": 2016,
        "authors_short": "Brunelli",
        "notes": "Illustrates the practical hazards of NDC-to-drug-class mapping in pharmacy claims — including stale NDC lists, generic entry, and misclassification — that RxNorm ingredient-rollup and RxClass drug-class queries are specifically designed to prevent. A concrete demonstration of why a maintained, versioned RxNorm mapping is preferred over a static NDC list for lipid-lowering and other drug-class exposures.\n"
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.nlm.nih.gov/research/umls/rxnorm/index.html",
        "citation_text": "National Library of Medicine. RxNorm. Bethesda, MD: NLM. Accessed 2026.",
        "year": 2026,
        "authors_short": "NLM",
        "notes": "Official NLM RxNorm landing page — authoritative source for current release files, documentation, licensing terms (including the distinction between freely distributable RxNorm content and UMLS-licensed source-vocabulary strings), and links to the full RRF distribution archives. HTTP 200 verified 2026-06-12.\n"
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://lhncbc.nlm.nih.gov/RxNav/",
        "citation_text": "Lister Hill National Center for Biomedical Communications, NLM. RxNav — RxNorm Application Programming Interface. Bethesda, MD: NLM. Accessed 2026.",
        "year": 2026,
        "authors_short": "NLM LHNCBC",
        "notes": "The public RxNav REST API documentation — covers all endpoints used in RWE workflows: NDC-to-RxCUI lookup, RxCUI status and history, ingredient rollup, RxClass drug-class queries (ATC/VA/MeSH), approximate-match name search, and bulk RRF file download paths. HTTP 200 verified 2026-06-12.\n"
      }
    ],
    "plain_language_summary": "RxNorm is a public database maintained by the US National Library of Medicine that gives every drug — by its active ingredient, strength, pill form, and brand name — a stable number called an RxCUI. Think of it as a universal translator for prescription drug names: a patient could fill the same atorvastatin tablet from a dozen different generic manufacturers, each with a different barcode-style drug code (called an NDC), but all of those codes map to the same single RxCUI for \"atorvastatin 20 mg oral tablet.\" For researchers studying which patients took which drugs, RxNorm is what allows thousands of package-level drug codes in insurance claims to be collapsed into one manageable list — one ingredient, one search.",
    "key_terms": [
      {
        "term": "RxCUI",
        "definition": "The unique number RxNorm assigns to a drug concept — for example, every atorvastatin 20 mg oral tablet product from any manufacturer shares one RxCUI at the generic level."
      },
      {
        "term": "NDC (National Drug Code)",
        "definition": "An 11-digit code printed on every drug package in the US that identifies the specific manufacturer, product, and package size — one drug can have dozens of NDCs but maps to a single RxCUI ingredient."
      },
      {
        "term": "Term type (TTY)",
        "definition": "The level in the RxNorm hierarchy a concept sits at — ingredient (IN), semantic clinical drug (SCD), branded drug (SBD), brand name (BN), and others — each representing a different granularity of drug information."
      },
      {
        "term": "Ingredient (IN)",
        "definition": "The active chemical in a drug, stripped of dose and form — for example, \"atorvastatin\" is the ingredient-level concept that all atorvastatin products of every strength and brand share."
      },
      {
        "term": "Semantic Clinical Drug (SCD)",
        "definition": "The generic clinical-drug level in RxNorm — ingredient plus strength plus dose form — for example, \"atorvastatin 20 mg oral tablet\"; it is the typical target level for mapping pharmacy claims in research databases."
      },
      {
        "term": "RxClass",
        "definition": "A companion service that links RxCUIs to drug-class systems like ATC (international anatomical-therapeutic-chemical classification) and VA drug classes, so analysts can build defensible \"all statins\" or \"all ACE inhibitors\" concept lists without manually curating hundreds of individual drug codes."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiologist is building a \"new statin user\" cohort from commercial insurance pharmacy claims. The raw claims table contains NDC codes — one per prescription fill. Six fills appear for patient 3001 between January and June 2023, representing three different statin products from two different manufacturers in two different strengths. The analyst needs to: (1) map each NDC to its RxCUI, (2) identify which RxCUI grain corresponds to the generic clinical drug (SCD) and then the ingredient (IN), and (3) confirm all six fills belong to the same ingredient so they can be combined into one continuous statin exposure episode. The example walks through the mapping chain for each fill.\n",
      "dataset": {
        "caption": "Six pharmacy claim rows for one patient. Each NDC is a distinct product code (different manufacturer, strength, or package), yet all six represent statin fills that should collapse to a single ingredient for cohort purposes.\n",
        "columns": [
          "person_id",
          "fill_date",
          "ndc",
          "product_description",
          "days_supply"
        ],
        "rows": [
          [
            3001,
            "2023-01-08",
            "00071-0155-23",
            "Lipitor 20mg tab 90ct (Pfizer)",
            90
          ],
          [
            3001,
            "2023-01-08",
            "00071-0156-23",
            "Lipitor 40mg tab 90ct (Pfizer)",
            90
          ],
          [
            3001,
            "2023-02-15",
            "16729-0120-01",
            "atorvastatin 20mg tab 90ct (Accord)",
            90
          ],
          [
            3001,
            "2023-02-15",
            "16729-0121-01",
            "atorvastatin 40mg tab 90ct (Accord)",
            90
          ],
          [
            3001,
            "2023-04-10",
            "68180-0221-06",
            "atorvastatin 20mg tab 90ct (Lupin)",
            90
          ],
          [
            3001,
            "2023-06-01",
            "68180-0222-06",
            "atorvastatin 40mg tab 90ct (Lupin)",
            90
          ]
        ]
      },
      "steps": [
        "Map each NDC to its SCD-level RxCUI using the RxNav API (or the RXNCONSO/RXNSAT files offline). 00071-0155-23 -> RxCUI 617310 (Lipitor 20 mg oral tablet, SBD) then its generic SCD RxCUI 308136 (atorvastatin 20 mg oral tablet). 00071-0156-23 -> RxCUI 617311 (Lipitor 40 mg oral tablet, SBD) -> SCD RxCUI 308137 (atorvastatin 40 mg oral tablet). The four generic NDCs map directly to SCD RxCUIs 308136 and 308137 without a branded intermediate.\n",
        "Roll each SCD RxCUI up to the ingredient (IN) level via the \"has_ingredient\" relationship. Both SCD 308136 (atorvastatin 20 mg oral tablet) and SCD 308137 (atorvastatin 40 mg oral tablet) share the same ingredient RxCUI 83367 (atorvastatin, IN). All six NDCs therefore resolve to one and the same ingredient.\n",
        "Count the distinct levels: 6 NDCs -> 2 SCDs (one per strength) -> 1 ingredient (atorvastatin). All six fills belong to a single ingredient concept, so they can be combined into one continuous statin exposure window for the new-user cohort definition.\n",
        "Verify: the analyst queries RxNav for each NDC's status (to confirm no retired-NDC issue for these 2023 fills), checks the ingredient name is \"atorvastatin\" (RxCUI 83367), and records the RxNorm release date (e.g., 2023-07-03) used for the mapping so the study is reproducible.\n"
      ],
      "result": "6 NDCs = 2 SCDs = 1 ingredient (atorvastatin, RxCUI 83367). All six pharmacy claim rows for patient 3001 map to the same ingredient. A washout query using RxCUI 83367 and its descendants in CONCEPT_ANCESTOR will capture all six product variants with a single concept set entry, eliminating the need to manually curate a 6-item NDC list that would grow stale as new generics enter the market.\n"
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "NDC-to-RxCUI via RxNav API (real-time / small volume)",
        "description": "Call the public RxNav REST endpoint for each NDC to retrieve the current RxCUI and TTY. Suitable for ad hoc lookups, prospective data pipelines, and studies with < 1 million unique NDCs. Requires internet access and throttling to avoid rate limits (~20 req/s).",
        "edge_cases": [
          "An NDC not found in current active files may still resolve via the historical endpoint (rxcui/historyndc); omitting this step drops retired NDCs silently.",
          "A returned RxCUI may itself have status \"Remapped\" or \"Retired\" — always check rxcui/status before accepting the returned RxCUI as current."
        ],
        "data_source_notes": "Best for EHR and claims pipelines where NDCs arrive incrementally; cache results with the RxNorm release date to avoid re-querying the same NDC repeatedly."
      },
      {
        "name": "Offline join against RxNorm RRF files (bulk / reproducible)",
        "description": "Download the full RxNorm Release Files (RRF) from NLM, load RXNCONSO, RXNREL, and RXNSAT into a local database, and join NDC codes (found in RXNSAT as attribute ATN = NDC) to their RxCUI and TTY. This approach is reproducible, fast, and does not require internet access during analysis. Pin the release date (RXNCORE.RRF header or the release folder name).",
        "edge_cases": [
          "RXNSAT NDC attribute rows include both current and some historical NDCs, but completeness of historical coverage varies; supplement with the RxNav historyndc endpoint for NDCs not found.",
          "SuppressFlag column in RXNCONSO must be filtered to exclude suppressible/obsolete atoms (SuppressFlag != 'Y')."
        ],
        "data_source_notes": "Preferred for distributed-network studies, regulated submissions, and large claims datasets; the local join is orders of magnitude faster than API calls at scale."
      },
      {
        "name": "OMOP CONCEPT table join (CDM-native)",
        "description": "In an OMOP CDM, source NDC codes are mapped to standard RxNorm drug concepts via the ETL and stored in CONCEPT_RELATIONSHIP (relationship_id = 'Maps to'). Use CONCEPT_ANCESTOR to traverse from clinical-drug-level standard concepts up to ingredient. No external API call is needed if the ETL is current and the CONCEPT table's vocabulary version is pinned.",
        "edge_cases": [
          "The CONCEPT table in older CDM deployments may reference a stale OMOP vocabulary release; verify vocabulary_version in the VOCABULARY table matches the expected RxNorm release date.",
          "Standard concepts with concept_id = 0 or standard_concept != 'S' indicate unmapped records; quantify and characterize these before treating the drug-era derivation as complete."
        ],
        "data_source_notes": "The standard path for any OHDSI network study; validate with the OHDSI DrugExposureDiagnostics R package before running cross-site analyses."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ndc-national-drug-code",
        "pros_of_this": "RxNorm provides a stable ingredient-level concept that absorbs new generic NDCs automatically on each release; NDC lists become stale within months as generics enter the market. RxNorm also enables class-level querying via RxClass (ATC, VA classes) that NDC alone cannot support.",
        "cons_of_this": "RxNorm mapping adds an ETL step with version-management overhead; raw NDC codes are immediately available in claims without any additional lookup. NDC is also the level at which labeler-specific adverse event signals are detected — RxNorm rollup intentionally erases that granularity.",
        "when_to_prefer": "Prefer RxNorm for any drug-class or ingredient-level exposure definition that will be replicated, published, or submitted to a regulator; prefer raw NDC only for labeler-specific or package-specific analyses."
      },
      {
        "compared_to": "omop-drug-exposure-drug-era-rwe",
        "pros_of_this": "RxNorm is the upstream vocabulary layer that makes OMOP DRUG_EXPOSURE standard concepts meaningful; understanding RxNorm TTY levels explains why DRUG_EXPOSURE sits at the clinical-drug grain and DRUG_ERA rolls to ingredient — the CDM design mirrors the RxNorm lattice.",
        "cons_of_this": "RxNorm alone does not define exposure episodes or gap-stitching logic; those are OMOP DRUG_ERA decisions layered on top.",
        "when_to_prefer": "Use this concept to understand the vocabulary foundation; move to OMOP Drug Exposure and Drug Era for episode-construction rules and OHDSI-specific implementation patterns."
      },
      {
        "compared_to": "drug-utilization",
        "pros_of_this": "RxNorm ingredient rollup produces the clean, de-duplicated drug-class lists that drug utilization studies require to count treated patients and calculate DDD denominators accurately; utilization metrics computed from raw NDCs are vulnerable to both double-counting and gaps.",
        "cons_of_this": "RxNorm mapping is a data-preparation step, not an analytic design; a drug utilization study still needs to specify its population, time window, and utilization metric independently of the vocabulary choice.",
        "when_to_prefer": "RxNorm is always the preferred NDC-normalization layer for a drug utilization study; the choice is not either/or but sequential (map first, then analyze)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Map NDCs to RxCUIs at SCD level using RxNav API or offline RRF join; roll up to ingredient via has_ingredient or CONCEPT_ANCESTOR. Store the RxNorm release date with every mapping. Use the historyndc endpoint for retired NDCs from older claims periods. Audit unmapped NDC fraction and characterize by therapeutic class and calendar year before finalizing the exposure definition. For HCPCS-billed drugs (biologics, infusibles), maintain a parallel HCPCS J-code concept set — RxNorm does not cover this domain.",
      "ehr": "EHR medication records may already carry RxNorm concept IDs in FHIR MedicationRequest resources; verify coding system is 'http://www.nlm.nih.gov/research/umls/rxnorm'. For FDB/Multum-coded records, use RxNorm's source-vocabulary cross-references. Distinguish orders (prescribing intent) from administrations or dispensings (confirmed exposure); link to outpatient pharmacy claims to confirm fills when feasible. In-hospital administration records require a facility-to-NDC or facility-to-RxCUI crosswalk step not provided by standard RxNav.",
      "linked": "In linked claims-EHR or claims-registry datasets, RxNorm serves as the common drug vocabulary that allows pharmacy claims (NDC-based) and EHR medication records (RxNorm or FDB-based) to be merged without double-counting. Reconcile fill dates and order dates: an EHR order on day 0 and a pharmacy fill on day 1 represent one exposure event, not two. Pin the same RxNorm release across all data sources before merging."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import requests\nimport pandas as pd\nfrom functools import lru_cache\n\nRXNAV_BASE = \"https://rxnav.nlm.nih.gov/REST\"\nRXNORM_RELEASE = \"2024-01-03\"   # pin the release date used for this mapping run\n\n\n# ── PATTERN A: RxNav REST API (real-time, cached) ────────────────────────────\n\n@lru_cache(maxsize=8192)\ndef ndc_to_rxcui(ndc: str) -> dict | None:\n    \"\"\"Return the current RxCUI and term type for an 11-digit NDC, or None if not found.\n    Strips hyphens; tries historical lookup if current lookup returns nothing.\"\"\"\n    ndc = ndc.replace(\"-\", \"\")\n    # 1. Current NDC lookup\n    r = requests.get(f\"{RXNAV_BASE}/ndcstatus.json\", params={\"ndc\": ndc}, timeout=10)\n    if r.ok:\n        data = r.json().get(\"ndcStatus\", {})\n        status = data.get(\"status\", \"\")\n        if status == \"ACTIVE\":\n            rxcui = data.get(\"rxcui\", \"\")\n            if rxcui:\n                return {\"rxcui\": rxcui, \"status\": \"ACTIVE\", \"source\": \"current\"}\n    # 2. Historical NDC lookup for retired NDCs (common in older claims)\n    r2 = requests.get(f\"{RXNAV_BASE}/historyndc.json\", params={\"ndc\": ndc}, timeout=10)\n    if r2.ok:\n        hist = r2.json().get(\"historicalNdcConcept\", {}).get(\"historicalNdcTime\", [])\n        if hist:\n            # return the most recent historical RxCUI\n            latest = hist[-1]\n            return {\"rxcui\": latest.get(\"rxcui\", \"\"), \"status\": \"HISTORICAL\",\n                    \"source\": \"historyndc\", \"start\": latest.get(\"startDate\"),\n                    \"end\": latest.get(\"endDate\")}\n    return None\n\n\n@lru_cache(maxsize=4096)\ndef rxcui_to_ingredient(rxcui: str) -> str | None:\n    \"\"\"Walk from any drug RxCUI up to its ingredient (IN-level) RxCUI via has_ingredient.\"\"\"\n    r = requests.get(f\"{RXNAV_BASE}/rxcui/{rxcui}/related.json\",\n                     params={\"tty\": \"IN\"}, timeout=10)\n    if r.ok:\n        groups = r.json().get(\"relatedGroup\", {}).get(\"conceptGroup\", [])\n        for g in groups:\n            if g.get(\"tty\") == \"IN\":\n                concepts = g.get(\"conceptProperties\", [])\n                if concepts:\n                    return concepts[0][\"rxcui\"]   # first ingredient RxCUI\n    return None\n\n\ndef map_claims_ndcs(claims_df: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Map a claims DataFrame with 'ndc' column to SCD-level RxCUI and ingredient RxCUI.\n    Adds: rxcui_scd, rxcui_ingredient, ndc_status, rxnorm_release.\"\"\"\n    results = []\n    for ndc in claims_df[\"ndc\"].unique():\n        lookup = ndc_to_rxcui(str(ndc))\n        if lookup is None:\n            results.append({\"ndc\": ndc, \"rxcui_scd\": None, \"rxcui_ingredient\": None,\n                             \"ndc_status\": \"UNMAPPED\"})\n            continue\n        rxcui_scd = lookup[\"rxcui\"]\n        ing = rxcui_to_ingredient(rxcui_scd) if rxcui_scd else None\n        results.append({\"ndc\": ndc, \"rxcui_scd\": rxcui_scd, \"rxcui_ingredient\": ing,\n                         \"ndc_status\": lookup[\"status\"]})\n    mapping = pd.DataFrame(results)\n    mapping[\"rxnorm_release\"] = RXNORM_RELEASE\n    merged = claims_df.merge(mapping, on=\"ndc\", how=\"left\")\n    # Audit: report unmapped fraction\n    n_total = len(merged)\n    n_unmapped = merged[\"rxcui_scd\"].isna().sum()\n    print(f\"NDC mapping: {n_total - n_unmapped}/{n_total} mapped \"\n          f\"({100*(1 - n_unmapped/n_total):.1f}%); \"\n          f\"{n_unmapped} unmapped ({100*n_unmapped/n_total:.1f}%)\")\n    return merged\n\n\n# ── PATTERN B: Offline RRF join (bulk, reproducible) ─────────────────────────\n\ndef build_ndc_to_ingredient_from_rrf(\n    rxnconso_path: str,   # path to RXNCONSO.RRF from the NLM release zip\n    rxnsat_path: str,     # path to RXNSAT.RRF\n) -> pd.DataFrame:\n    \"\"\"Build a local NDC -> ingredient mapping table from the downloaded NLM RRF files.\n    Returns a DataFrame with columns: ndc, rxcui_scd, scd_name, rxcui_ingredient, ing_name.\"\"\"\n\n    # RXNCONSO columns (subset needed)\n    conso_cols = [\"RXCUI\", \"LAT\", \"TS\", \"LUI\", \"STT\", \"SUI\", \"ISPREF\",\n                  \"RXAUI\", \"SAUI\", \"SCUI\", \"SDUI\", \"SAB\", \"TTY\", \"CODE\",\n                  \"STR\", \"SRL\", \"SUPPRESS\", \"CVF\"]\n    # Load only the RxNorm source atoms, non-suppressible\n    conso = pd.read_csv(rxnconso_path, sep=\"|\", header=None, names=conso_cols,\n                        usecols=[\"RXCUI\", \"SAB\", \"TTY\", \"STR\", \"SUPPRESS\"])\n    rxnorm_atoms = conso[(conso[\"SAB\"] == \"RXNORM\") & (conso[\"SUPPRESS\"] != \"Y\")].copy()\n\n    # SCD and IN atoms\n    scd_atoms = rxnorm_atoms[rxnorm_atoms[\"TTY\"] == \"SCD\"][[\"RXCUI\", \"STR\"]].rename(\n        columns={\"RXCUI\": \"rxcui_scd\", \"STR\": \"scd_name\"})\n    in_atoms = rxnorm_atoms[rxnorm_atoms[\"TTY\"] == \"IN\"][[\"RXCUI\", \"STR\"]].rename(\n        columns={\"RXCUI\": \"rxcui_ingredient\", \"STR\": \"ing_name\"})\n\n    # RXNSAT: NDC attribute (ATN='NDC') links a RxCUI to its NDC strings\n    sat_cols = [\"RXCUI\", \"LUI\", \"SUI\", \"RXAUI\", \"STYPE\", \"CODE\", \"ATUI\",\n                \"SATUI\", \"ATN\", \"SAB\", \"ATV\", \"SUPPRESS\", \"CVF\"]\n    sat = pd.read_csv(rxnsat_path, sep=\"|\", header=None, names=sat_cols,\n                      usecols=[\"RXCUI\", \"ATN\", \"ATV\", \"SUPPRESS\"])\n    ndc_sat = sat[(sat[\"ATN\"] == \"NDC\") & (sat[\"SUPPRESS\"] != \"Y\")][[\"RXCUI\", \"ATV\"]].rename(\n        columns={\"RXCUI\": \"rxcui_scd\", \"ATV\": \"ndc\"})\n\n    # Build SCD -> ingredient map via RXNREL (has_ingredient)\n    # (For simplicity, use the CONCEPT_ANCESTOR-equivalent: any SCD's ingredient\n    # is the IN atom whose RXCUI appears as the object of has_ingredient for that SCD.)\n    # This minimal version reads it directly from RXNCONSO by matching SCD atoms to\n    # their IN ancestors via RXNREL; a full implementation loads RXNREL.RRF similarly.\n    # Here we return the NDC->SCD join as the core; the caller uses rxcui_to_ingredient()\n    # for the SCD->ingredient step, or loads RXNREL.RRF in the same pattern.\n    result = ndc_sat.merge(scd_atoms, on=\"rxcui_scd\", how=\"left\")\n    print(f\"Offline RRF: {len(result)} NDC->SCD rows loaded from {rxnsat_path}\")\n    return result",
        "description": "Two patterns: (A) real-time RxNav REST API — for each NDC, retrieve its RxCUI and walk up to\ningredient level; suitable for small to medium volumes with caching. (B) offline RRF join —\nload RXNCONSO and RXNSAT from the NLM monthly release files and join locally; preferred for\nlarge claims datasets and reproducible, version-pinned analyses. Both patterns are shown as\nminimal, commented snippets illustrating the key API calls and join logic an analyst would\nbuild into a larger pipeline.",
        "dependencies": [
          "requests",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(httr2)\nlibrary(dplyr)\nlibrary(readr)\nlibrary(memoise)\n\nRXNAV_BASE <- \"https://rxnav.nlm.nih.gov/REST\"\nRXNORM_RELEASE <- \"2024-01-03\"   # pin the release date\n\n\n# ── PATTERN A: RxNav REST API with memoisation ───────────────────────────────\n\n.ndc_to_rxcui_raw <- function(ndc) {\n  ndc <- gsub(\"-\", \"\", ndc)\n  # 1. Current lookup\n  resp <- request(paste0(RXNAV_BASE, \"/ndcstatus.json\")) |>\n    req_url_query(ndc = ndc) |>\n    req_timeout(10) |>\n    req_error(is_error = \\(r) FALSE) |>\n    req_perform()\n  if (resp_status(resp) == 200) {\n    data <- resp_body_json(resp)$ndcStatus\n    if (!is.null(data$rxcui) && nchar(data$rxcui) > 0 &&\n        identical(data$status, \"ACTIVE\")) {\n      return(list(rxcui = data$rxcui, status = \"ACTIVE\"))\n    }\n  }\n  # 2. Historical lookup for retired NDCs\n  resp2 <- request(paste0(RXNAV_BASE, \"/historyndc.json\")) |>\n    req_url_query(ndc = ndc) |>\n    req_timeout(10) |>\n    req_error(is_error = \\(r) FALSE) |>\n    req_perform()\n  if (resp_status(resp2) == 200) {\n    hist <- resp_body_json(resp2)$historicalNdcConcept$historicalNdcTime\n    if (length(hist) > 0) {\n      latest <- hist[[length(hist)]]\n      return(list(rxcui = latest$rxcui, status = \"HISTORICAL\"))\n    }\n  }\n  list(rxcui = NA_character_, status = \"UNMAPPED\")\n}\nndc_to_rxcui <- memoise(.ndc_to_rxcui_raw)  # cache within session\n\nrxcui_to_ingredient <- memoise(function(rxcui) {\n  # Walk from any RxCUI to its ingredient (IN) via the related endpoint\n  resp <- request(paste0(RXNAV_BASE, \"/rxcui/\", rxcui, \"/related.json\")) |>\n    req_url_query(tty = \"IN\") |>\n    req_timeout(10) |>\n    req_error(is_error = \\(r) FALSE) |>\n    req_perform()\n  if (resp_status(resp) != 200) return(NA_character_)\n  groups <- resp_body_json(resp)$relatedGroup$conceptGroup\n  for (g in groups) {\n    if (identical(g$tty, \"IN\") && length(g$conceptProperties) > 0)\n      return(g$conceptProperties[[1]]$rxcui)\n  }\n  NA_character_\n})\n\nmap_ndc_vector <- function(ndcs) {\n  # Map a character vector of NDCs; returns a tibble with ndc, rxcui_scd, rxcui_ingredient\n  purrr::map_dfr(unique(ndcs), function(ndc) {\n    lookup   <- ndc_to_rxcui(ndc)\n    ing_rxcui <- if (!is.na(lookup$rxcui)) rxcui_to_ingredient(lookup$rxcui) else NA_character_\n    tibble::tibble(ndc = ndc, rxcui_scd = lookup$rxcui, rxcui_ingredient = ing_rxcui,\n                   ndc_status = lookup$status, rxnorm_release = RXNORM_RELEASE)\n  })\n}\n\n\n# ── PATTERN B: Offline RRF join (bulk) ───────────────────────────────────────\n\nbuild_ndc_ingredient_map_rrf <- function(rxnconso_path, rxnsat_path) {\n  # Load RXNCONSO - keep only non-suppressible RxNorm atoms at IN and SCD level\n  conso <- read_delim(rxnconso_path, delim = \"|\", col_names = FALSE,\n                      col_types = cols(.default = \"c\")) |>\n    select(RXCUI = 1, SAB = 12, TTY = 13, STR = 15, SUPPRESS = 17) |>\n    filter(SAB == \"RXNORM\", SUPPRESS != \"Y\")\n\n  scd_atoms <- conso |> filter(TTY == \"SCD\") |>\n    select(rxcui_scd = RXCUI, scd_name = STR)\n  in_atoms  <- conso |> filter(TTY == \"IN\")  |>\n    select(rxcui_ingredient = RXCUI, ing_name = STR)\n\n  # Load RXNSAT - NDC attributes\n  sat <- read_delim(rxnsat_path, delim = \"|\", col_names = FALSE,\n                    col_types = cols(.default = \"c\")) |>\n    select(RXCUI = 1, ATN = 9, ATV = 11, SUPPRESS = 12) |>\n    filter(ATN == \"NDC\", SUPPRESS != \"Y\") |>\n    rename(rxcui_scd = RXCUI, ndc = ATV)\n\n  # Join NDC -> SCD\n  result <- sat |>\n    left_join(scd_atoms, by = \"rxcui_scd\")\n\n  message(sprintf(\"Offline RRF: %d NDC->SCD rows loaded.\", nrow(result)))\n  result\n}",
        "description": "Two patterns matching the Python version: (A) the RxNav REST API via httr2 with memoisation,\nreturning RxCUI and ingredient rollup for a vector of NDCs; (B) a dplyr-based offline join\nagainst the downloaded RRF files. Both are minimal, commented snippets. The R patterns are\nparticularly common in OHDSI/HADES pipelines where drug exposure work is done in R against a\nCDM-connected database.",
        "dependencies": [
          "httr2",
          "dplyr",
          "readr",
          "memoise"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  NDC[\"NDC (package-level)\\ne.g. 00071-0155-23\\nLipitor 20mg 90ct\"]\n  SBD[\"SBD (Semantic Branded Drug)\\nLipitor 20 mg oral tablet\\nRxCUI 617310\"]\n  SCD[\"SCD (Semantic Clinical Drug)\\natorvastatin 20 mg oral tablet\\nRxCUI 308136\"]\n  SCDC[\"SCDC (Clinical Drug Component)\\natorvastatin 20 mg\\nRxCUI 1157996\"]\n  IN[\"IN (Ingredient) ← rollup target\\natorvastatin\\nRxCUI 83367\"]\n  BN[\"BN (Brand Name)\\nLipitor\\nRxCUI 41126\"]\n  DF[\"DF (Dose Form)\\noral tablet\\nRxCUI 317541\"]\n  NDC -->|\"Maps to (via RXNSAT)\"| SBD\n  NDC -->|\"Generic NDCs map directly\"| SCD\n  SBD -->|\"tradename_of\"| SCD\n  SCD -->|\"consists_of\"| SCDC\n  SCD -->|\"has_dose_form\"| DF\n  SCDC -->|\"has_ingredient\"| IN\n  SBD -->|\"has_brand_name\"| BN\n  IN -->|\"← RWE exposure def uses this level\"| IN",
        "caption": "The RxNorm concept hierarchy for atorvastatin products. NDC codes at the package level map through branded (SBD) or directly to generic clinical drug (SCD) concepts, which roll up to the ingredient (IN) via has_ingredient. The IN level is the standard rollup target for RWE drug exposure definitions. RxClass links the IN level outward to ATC drug classes (C10AA for statins) and VA drug classes.\n",
        "alt_text": "Flowchart showing NDC at the top mapping to SBD (Lipitor 20mg oral tablet) and SCD (atorvastatin 20mg oral tablet), with SCD connecting to SCDC (atorvastatin 20mg component), DF (oral tablet), and IN (atorvastatin ingredient). BN (Lipitor brand name) branches off SBD. Arrow annotations show relationship names. The IN level is labelled as the RWE rollup target.\n",
        "source_type": "illustrative",
        "source_citations": [
          "nelson-2011"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "normalizes",
        "target_slug": "ndc-national-drug-code",
        "notes": "RxNorm is the normalization layer that maps package-level NDC codes to shared ingredient and clinical-drug concepts, allowing thousands of NDCs for the same drug to be treated as a single entity in RWE exposure definitions.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "RxNorm is the OMOP standard vocabulary for the Drug domain; DRUG_EXPOSURE records are mapped to RxNorm standard concepts at the clinical-drug (SCD/SBD) level via ETL, and DRUG_ERA rolls those records up to ingredient via CONCEPT_ANCESTOR — a path that only works because of the RxNorm hierarchy.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "drug-utilization",
        "notes": "Drug utilization studies that count treated patients or compute DDD-based volumes by therapeutic class depend on RxNorm ingredient and RxClass drug-class queries to produce clean, non-redundant drug concept lists from NDC-level claims.\n"
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Treatment-pattern analyses require identifying all fills of a drug class across lines of therapy; RxNorm ingredient rollup and RxClass class membership queries produce the concept sets from which line-of-therapy episode logic is applied.\n"
      },
      {
        "relation_type": "see_also",
        "target_slug": "hcpcs-level-ii-j-codes",
        "notes": "HCPCS Level II J-codes cover physician-administered drugs (biologics, infusibles) that do not appear in pharmacy claims and are not in RxNorm; any drug-exposure definition for specialty infusibles must union RxNorm/NDC-based pharmacy fills with J-code-based medical claims.\n"
      },
      {
        "relation_type": "see_also",
        "target_slug": "medical-code-crosswalks-mappings",
        "notes": "RxNorm-to-NDC and RxNorm-to-ATC crosswalks are specific instances of the general medical code crosswalk problem; the challenges of versioning, retired codes, and unmapped fractions that arise in RxNorm mapping generalize to all medical code crosswalk work.\n"
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-standardized-vocabularies",
        "notes": "RxNorm is one of the OMOP standardized vocabularies; understanding the OMOP vocabulary framework (standard vs non-standard concepts, CONCEPT_RELATIONSHIP, CONCEPT_ANCESTOR) is prerequisite for CDM-native RxNorm usage.\n"
      },
      {
        "relation_type": "see_also",
        "target_slug": "snomed-ct-terminology",
        "notes": "SNOMED CT covers clinical findings, procedures, and body structures; RxNorm covers drugs. The two systems are linked in UMLS but serve different domains. Some EHR platforms use SNOMED product codes for drugs in the UK; US pharmacy claims use NDC and RxNorm.\n"
      }
    ],
    "aliases": [
      "RxNorm",
      "RxCUI",
      "RxNav"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "safety-signal-case-definition-rwe",
    "name": "Safety Signal Case Definition",
    "short_definition": "An operational, pre-specified rule that identifies suspected adverse-event cases in routinely-collected data for active safety surveillance, combining diagnosis, procedure, laboratory, medication, timing, and care-setting evidence with a stated incident-event window and a validation/PPV plan.",
    "long_description": "A **safety signal case definition** is the executable specification that turns a clinical adverse-event concept (e.g.,\nacute liver injury, anaphylaxis, acute kidney injury, rhabdomyolysis) into a reproducible algorithm over claims, EHR,\nor registry data so that a signal can be *evaluated* — counted, rate-estimated, and compared across exposures. It is\nnot a single diagnosis code: it is the full rule set — which codes, in which diagnosis position, in which care setting,\nconfirmed by which labs/procedures, within which time window relative to exposure, after excluding which prior history —\nplus the validation plan (chart-confirmed positive predictive value, sensitivity) that quantifies how much the algorithm\nmisclassifies. In active surveillance systems (FDA Sentinel/ARIA, PRISM, vaccine-safety networks) the case definition is\nthe load-bearing artifact: every rate, sequential test, and signal decision inherits its measurement error.\n\n**Core conceptual distinction** — the discriminating feature of a *signal-evaluation* case definition versus a routine\noutcome algorithm built for a comparative-effectiveness study is the **sensitivity-vs-PPV operating point and the role of\ndownstream confirmation**. (1) *Signal evaluation favors sensitivity (broad capture) when human/chart adjudication\nfollows*: a screening-grade algorithm with modest PPV is acceptable because flagged cases are confirmed before any\nregulatory action, so the cost of a false positive is a chart review, not a wrong inference. A confirmation-grade\ndefinition (high PPV, inpatient-primary-position dx + lab confirmation) is used once a signal must be quantified for a\ndecision. (2) *Effectiveness/safety estimation favors PPV-first algorithms* because cases are taken at face value into a\nrate or hazard model and false positives bias the estimate. (3) The **estimand the definition feeds** matters: an\nincident (first-ever, new-onset) event with a clean lookback supports an incidence rate or a cause-specific/subdistribution\ncontrast; a prevalent or recurrent capture supports utilization but not incidence. The case definition therefore is not\nseparable from the analysis it serves — it must state incident-vs-prevalent, the etiologically relevant risk window after\nexposure, and the planned PPV so the downstream rate can be bias-corrected.\n\n**Pros, cons, and trade-offs**\n- **vs a single-code, single-position rule (e.g., \"any ICD code for the event\"):** A multi-component definition (dx\n  position + care setting + lab/procedure confirmation + medication corroboration + history exclusion) is far more\n  specific and defensible, raises PPV, and is auditable for regulators. Cost: it requires more programming, code-list\n  curation, and a validation substudy, and an over-specified rule can drop true cases (lower sensitivity), which matters\n  for screening. **Prefer the multi-component rule** for any consequential signal quantification; a broad single-code rule\n  only as a hypothesis-generating screen with adjudication behind it.\n- **vs a fully chart-adjudicated outcome (gold standard on every case):** Adjudication maximizes accuracy but does not\n  scale to a multi-site distributed surveillance network and is impossibly slow for sequential monitoring. A coded case\n  definition + a PPV substudy approximates the truth at population scale. **Prefer the coded definition + targeted\n  chart-review PPV** for surveillance; reserve full adjudication for confirmation of a strong signal.\n- **vs reusing a published validated algorithm verbatim:** Borrowing a validated definition (e.g., a Mini-Sentinel/Sentinel\n  algorithm) saves work and imports a PPV estimate. Cost: PPV is **not transportable** — it depends on outcome prevalence,\n  coding era (ICD-9 vs ICD-10), data source, and care setting, so an imported PPV can be wrong in your population.\n  **Prefer borrowing the code-list logic but re-estimating PPV locally** whenever the source population differs.\n\n**When to use** — active or scheduled safety surveillance and signal evaluation in claims/EHR/registry data (Sentinel/ARIA,\nPRISM, vaccine networks, manufacturer post-marketing commitments); any RWE study whose endpoint is a serious adverse event\nwhere coded data alone misclassify materially; settings that need a transparent, pre-registered, validation-backed outcome\nrule to satisfy FDA/EMA expectations for fit-for-purpose data and reliability. Use it whenever the count of events drives a\nregulatory or labeling decision and the algorithm's error must be quantified, not assumed away.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **No validation/PPV is or will be available.** Reporting an adverse-event rate from an unvalidated algorithm as if it\n  were the true rate is the central trap of signal evaluation: differential misclassification by exposure (e.g., a drug\n  that triggers more lab testing, hence more incidentally coded events) creates a spurious signal or masks a real one.\n  Without a PPV (and ideally sensitivity), the rate is uninterpretable.\n- **The event is non-specifically coded.** Outcomes whose codes are dominated by rule-out/symptom coding (e.g., chest pain\n  for MI, transaminase elevation for \"liver injury\") have low PPV that no clever logic fully rescues; a broad definition\n  here manufactures cases. Require lab/procedure confirmation or do not quantify.\n- **Recurrent/prevalent capture passed off as incidence.** Counting a chronic or recurrent condition as if each code were a\n  new event inflates rates and breaks the incidence estimand; demand an event-free lookback and a recurrence rule.\n- **Imported PPV across a coding transition or a different data source.** Applying an ICD-9-era PPV to ICD-10 data, or a\n  commercial-claims PPV to a Medicare FFS elderly population, silently rebiases the corrected rate.\n- **Immortal-time or look-ahead leakage in the definition.** If the case window or a confirmation requirement uses\n  information that accrues only to survivors (e.g., requiring a 30-day follow-up lab), the rule conditions on the future\n  and biases the comparison.\n\n**Data-source operational depth**\n- **Claims (FFS vs Medicare Advantage):** The event is a constellation of medical claims — inpatient principal vs secondary\n  diagnosis position, place-of-service (ER vs inpatient vs office), procedure/revenue codes, and corroborating drug\n  dispensings. The dominant failure mode is **incomplete capture in Medicare Advantage and capitated/bundled arrangements**:\n  encounter data are inconsistently submitted, so MA-only person-time can miss events entirely — restrict to enrollees with\n  complete medical (Parts A/B) capture or treat MA spans as unobserved, never as event-free. Claims **adjudication lag and\n  reversals** mean recent person-time is artificially event-sparse; impose a data-maturity buffer. Diagnosis *position*\n  drives PPV (a principal inpatient code is far more specific than a same-day ER rule-out code). **Differential competing\n  risks by exposure in the elderly** (a drug used in sicker patients sees more death before the event) bias a naive\n  cumulative-incidence comparison — pre-specify cause-specific vs subdistribution.\n- **EHR:** Labs, vitals, problem lists, and orders enable confirmation-grade definitions (e.g., ALT/AST thresholds for liver\n  injury, troponin for MI) that claims cannot match — the major advantage. But capture is **encounter-driven and leaky**:\n  care outside the system is invisible, so a true event treated at an outside ED is missed, and structured fields are often\n  blank with the signal buried in notes (requires NLP). Site workflow heterogeneity makes a fixed lab threshold behave\n  differently across sites.\n- **Registry:** Best for adjudicated, clinically rich outcomes (e.g., cancer, MI registries) but typically incomplete for\n  full longitudinal capture and weak on exposure; check registration completeness and adjudication rules, and link to claims\n  for denominator person-time.\n- **Linked claims–EHR–vital records:** The ideal substrate — EHR labs to confirm + claims completeness + a death index to\n  handle the competing risk — but **linkage selection** (only the linkable subset) and order/fill/service-date discrepancies\n  must be reconciled before the case window is anchored, or the algorithm dates events incorrectly.\n\n**Worked claims example (acute liver injury signal, commercial + Medicare FFS).** Goal: a confirmation-grade incident-ALI\nalgorithm to quantify a hepatotoxicity signal for a newly approved drug. (1) *Continuous enrollment / observability:* require\ncontinuous medical + pharmacy enrollment with complete FFS capture for the 183-day lookback through follow-up; drop MA-only\nperson-time. (2) *Case logic:* an inpatient claim with ALI in the **principal diagnosis position** (`dx_position = 1`,\n`pos = '21'`) OR an ER claim with an ALI diagnosis **plus** a same-episode liver-function-test claim (corroboration), within\nthe etiologically relevant **risk window of 1–90 days after the index dispensing** (`days_supply`-anchored on-treatment time\n+ grace). (3) *Incident-event coding:* exclude anyone with any ALI diagnosis or chronic-liver-disease code in the 183-day\nlookback so the captured event is first-ever; take the **first qualifying claim date** as the event date. (4) *Validation:* a\nrandom sample of 200 algorithm-positive patients undergoes chart review; 162 confirm → **PPV = 0.81 (95% Wilson CI 0.75–0.86)**.\n(5) *Bias-corrected rate:* if the crude algorithm-based rate is 12.0 / 1,000 person-years, the PPV-corrected true-positive\nrate is 12.0 × 0.81 ≈ **9.7 / 1,000 PY** (assuming negligible false-negative impact on the numerator; a full correction also\nuses sensitivity — see misclassification-bias-correction-rwe). (6) *Signal logic:* a screening run might instead drop the lab\nconfirmation to raise sensitivity, accepting PPV ≈ 0.55, because every flagged case is adjudicated before the signal is acted\non — the same clinical concept, a different operating point chosen deliberately for the surveillance stage.\n\n**Interpreting the output**. The confirmatory ALI case definition (inpatient principal position plus an ALT/AST\nelevation lab confirmatory requirement) produces a chart-review-validated PPV = 0.81 (95% Wilson CI 0.75–0.86)\nfrom 200 randomly sampled algorithm-positive patients, of whom 162 were confirmed as true ALI cases. Applying\nthe PPV correction to the crude algorithm-based incidence rate of 12.0 per 1,000 person-years gives a\ncorrected rate of approximately 9.7 per 1,000 person-years.\n\nFormal interpretation: PPV = 0.81 means 19 of every 100 algorithm-positive patients are false positives —\npatients who met the code-and-lab criteria but whose charts did not confirm ALI to clinical adjudication\nstandards. The crude rate of 12.0 per 1,000 PY therefore overstates the true-positive rate by roughly 19%;\nthe corrected rate of 9.7 per 1,000 PY is the preferred epidemiologic estimate for regulatory and HTA\naudiences. The PPV alone does not characterize sensitivity — the algorithm may still miss a substantial\nfraction of true ALI cases that were never coded or tested, and those missed cases would require a\nseparately designed sensitivity estimation (gold-standard-linked substudy).\n\nPractical interpretation: the choice of operating point is a pre-specified analytical decision, not a\npost-hoc optimization. A screening-stage run that relaxes the lab confirmation requirement raises\nsensitivity at the cost of PPV (here approximately 0.55), which is appropriate when every flagged case\nwill be manually adjudicated before any regulatory action is taken. Document both operating points in\nthe pharmacovigilance protocol, alongside the chart-review protocol and adjudicator credentials.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "safety-signal",
      "case-definition",
      "adverse-event-algorithm",
      "pharmacovigilance",
      "active-surveillance",
      "positive-predictive-value",
      "outcome-validation"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1371/journal.pmed.1001885",
        "url": "https://doi.org/10.1371/journal.pmed.1001885",
        "citation_text": "Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Medicine. 2015;12(10):e1001885.",
        "year": 2015,
        "authors_short": "Benchimol et al.",
        "notes": "Operational reporting backbone for case definitions in routinely-collected data; mandates transparent code lists, diagnosis-position/care-setting logic, and validation reporting that make a signal case definition reproducible."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "Diagnostic-accuracy reporting standard underpinning the PPV/sensitivity/specificity validation that quantifies a case definition's misclassification and supports bias correction."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.2312",
        "url": "https://doi.org/10.1002/pds.2312",
        "citation_text": "Andrade SE, Harrold LR, Tjia J, et al. A systematic review of validated methods for identifying cerebrovascular accident or transient ischemic attack using administrative data. Pharmacoepidemiology and Drug Safety. 2012;21(S1):100-128.",
        "year": 2012,
        "authors_short": "Andrade et al.",
        "notes": "Mini-Sentinel worked example showing how diagnosis-position and care-setting logic drive PPV for a specific safety outcome, and why a published PPV is not transportable across populations."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.4297",
        "url": "https://doi.org/10.1002/pds.4297",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety. 2017;26(9):1033-1039.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "ISPOR-ISPE good-practice recommendations endorsing pre-specified, validated outcome algorithms with reported operating characteristics for decision-grade RWE."
      }
    ],
    "plain_language_summary": "A safety signal case definition is the exact written rule that decides which patients in a health database count as having experienced a specific harmful event, such as serious liver injury. Before researchers can count cases and calculate how often an event occurs, they must specify which medical codes, care settings, and lab results qualify — otherwise two analysts studying the same drug could get completely different counts. The tradeoff at the heart of every case definition is breadth: a broad rule catches most true cases but also mislabels some healthy people as cases (high sensitivity, lower precision), while a narrow rule is more trustworthy per case but may miss real ones (higher precision, lower sensitivity). Getting this choice right determines whether a safety signal is detected at all — or whether it turns out to be noise.",
    "key_terms": [
      {
        "term": "ICD diagnosis code",
        "definition": "A standardized alphanumeric label (e.g., K71.6) that a coder assigns to each medical claim to record what condition was treated or evaluated."
      },
      {
        "term": "diagnosis position",
        "definition": "The ranked slot a code occupies on a claim — the principal (first) position means the provider treated that condition as the main reason for the visit, while secondary positions are supporting or incidental findings."
      },
      {
        "term": "positive predictive value (PPV)",
        "definition": "Of all patients the algorithm labels as a case, the fraction who truly had the event — a PPV of 0.80 means 80 out of every 100 flagged patients actually had it."
      },
      {
        "term": "sensitivity",
        "definition": "Of all patients who truly had the event, the fraction the algorithm correctly identifies — a sensitivity of 0.70 means the algorithm catches 70 out of every 100 real cases."
      },
      {
        "term": "care setting",
        "definition": "Where the medical visit took place — for example, an inpatient hospital stay, an emergency room visit, or a routine outpatient office visit — which signals how serious the encounter was."
      },
      {
        "term": "algorithm-positive",
        "definition": "A patient who meets all the coded criteria in the case definition, meaning the database rule flagged them as a likely case before any human chart review."
      }
    ],
    "worked_example": {
      "scenario": "A team is studying whether a new cholesterol-lowering drug causes acute liver injury (ALI). They need to define what counts as an ALI case in a commercial insurance claims database before they can count events and compare rates between users and non-users. They draft two candidate definitions — a broad one and a narrow one — and then reason through which patients each definition captures and misses.",
      "dataset": {
        "caption": "Five candidate patients and the evidence available for each. The team must decide whether each counts as an ALI case under a broad vs. narrow definition.",
        "columns": [
          "patient_id",
          "event_description",
          "dx_code",
          "dx_position",
          "care_setting",
          "liver_lab_flag"
        ],
        "rows": [
          [
            "P-101",
            "Admitted to hospital, ALI listed as the main reason",
            "K71.6",
            "principal (1st)",
            "inpatient",
            "yes"
          ],
          [
            "P-102",
            "ER visit, ALI code listed, liver enzyme test ordered same day",
            "K71.6",
            "secondary (2nd)",
            "emergency room",
            "yes"
          ],
          [
            "P-103",
            "ER visit, ALI code listed as a rule-out, no lab ordered",
            "K71.6",
            "secondary (2nd)",
            "emergency room",
            "no"
          ],
          [
            "P-104",
            "Routine office visit, ALI code appears incidentally",
            "K71.6",
            "secondary (2nd)",
            "outpatient office",
            "no"
          ],
          [
            "P-105",
            "Admitted to hospital, ALI listed as a secondary complication",
            "K71.6",
            "secondary (2nd)",
            "inpatient",
            "no"
          ]
        ]
      },
      "steps": [
        "Define the broad rule: any ALI diagnosis code (K71.6) appearing anywhere on any claim in the risk window counts as a case, regardless of where on the form it appears or where care was received.",
        "Under the broad rule, all five patients (P-101 through P-105) are flagged as algorithm-positive cases.",
        "Now assess which of these are likely true ALI: P-101 is almost certainly real (hospital admission, principal diagnosis). P-102 is plausible (ER visit with a supporting lab test). P-103 is doubtful (ER rule-out with no lab confirmation). P-104 is very unlikely to be true ALI (incidental office code, no workup). P-105 is ambiguous (secondary complication during an unrelated admission).",
        "Define the narrow rule: ALI must appear as the principal (first-listed) diagnosis on an inpatient claim, OR as any position on an ER claim AND have a liver lab test on the same date.",
        "Under the narrow rule, only P-101 (inpatient, principal) and P-102 (ER plus lab) are flagged. P-103, P-104, and P-105 are excluded.",
        "The tradeoff: if 10 patients truly had drug-induced ALI during follow-up and 8 of them were hospitalized or had a confirmed ER visit (captured by the narrow rule), while 2 were only captured by the office code (captured only by the broad rule), then: broad rule sensitivity = 10/10 = 1.00 but PPV = 10/37 = 0.27 if 27 false-positive office/ER rule-out codes also exist; narrow rule sensitivity = 8/10 = 0.80 but PPV = 8/9 = 0.89 if only 1 of the 9 flagged narrow-rule cases is a false positive.",
        "The team records the sensitivity and PPV for each definition and chooses based on the study goal: early screening (favor broad, use chart review to confirm) vs. final rate estimation (favor narrow, accept missing a few real cases to keep the count trustworthy)."
      ],
      "result": "Narrow rule: flags P-101 and P-102 only. Estimated PPV 0.89 (8 true cases out of 9 flagged), estimated sensitivity 0.80 (captures 8 of 10 true cases). Broad rule: flags all five patients. Estimated PPV 0.27 (10 true out of 37 broad-rule positives in the full cohort), estimated sensitivity 1.00 (catches all 10 true cases). Implication: the narrow definition produces a more trustworthy case count for rate estimation; the broad definition is appropriate only if every flagged patient will be individually reviewed before any conclusion is drawn."
    },
    "prerequisites": [
      "sensitivity-specificity-rwe",
      "claims-outcome-algorithm-ppv-sensitivity-rwe",
      "outcome-algorithm-construction-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Screening-grade (high-sensitivity) definition with downstream adjudication",
        "description": "Broad capture (any-position diagnosis, no lab requirement) chosen to maximize sensitivity during hypothesis-generating surveillance, on the explicit assumption that every flagged case is chart-confirmed before any signal decision.",
        "edge_cases": [
          "Modest PPV (often 0.4-0.6) is acceptable only if adjudication capacity exists; without it, false positives create spurious signals.",
          "Symptom/rule-out coding (e.g., chest pain, transaminase elevation) can dominate flagged cases and overwhelm reviewers."
        ],
        "data_source_notes": "claims: any-position diagnosis across inpatient/ER/office; EHR: structured codes + NLP over notes to raise sensitivity before adjudication."
      },
      {
        "name": "Confirmation-grade (high-PPV) definition for signal quantification",
        "description": "Restrictive rule (inpatient principal-position diagnosis + lab/procedure confirmation + history exclusion) used once a signal must be rate-estimated or compared for a decision, accepting lower sensitivity for high PPV.",
        "edge_cases": [
          "Over-restriction lowers sensitivity and can mask a true elevated rate if false negatives are differential by exposure.",
          "Lab-confirmation windows must not condition on post-event survivor information (immortal-time/look-ahead leakage)."
        ],
        "data_source_notes": "claims: principal inpatient dx + corroborating LFT/procedure claim; EHR/linked: explicit lab thresholds (ALT/AST, troponin, creatinine) for true confirmation."
      },
      {
        "name": "Incident (first-ever) capture with event-free lookback",
        "description": "Requires no diagnosis of the event or its chronic precursor in a defined lookback so the captured event is new-onset, supporting an incidence-rate or cause-specific/subdistribution estimand rather than utilization.",
        "edge_cases": [
          "Lookback too short relative to disease natural history reclassifies prevalent/recurrent events as incident.",
          "Left truncation at enrollment start can make a prevalent case look incident; require enrollment to predate the lookback."
        ],
        "data_source_notes": "claims: continuous-enrollment-anchored lookback (commonly 183-365 days) with chronic-condition code exclusions; registry: use registration date and adjudicated onset."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Single-code, single-position rule",
        "pros_of_this": "Multi-component logic (dx position + care setting + lab/procedure confirmation + history exclusion) raises PPV, is auditable for regulators, and ties the captured event to an explicit incident estimand.",
        "cons_of_this": "More programming and code-list curation, requires a validation substudy, and over-specification can lower sensitivity (drop true cases) for screening.",
        "when_to_prefer": "Any signal quantification feeding a regulatory or labeling decision; reserve single-code rules for hypothesis-generating screens backed by adjudication."
      },
      {
        "compared_to": "Full chart adjudication of every case",
        "pros_of_this": "Scales to multi-site distributed networks and sequential monitoring; a coded definition plus a PPV substudy approximates the truth at population scale.",
        "cons_of_this": "Coded capture is imperfect (PPV<1, sensitivity<1) and inherits all administrative-data measurement error.",
        "when_to_prefer": "Routine and sequential surveillance; reserve full adjudication for confirming a strong, decision-relevant signal."
      },
      {
        "compared_to": "Reusing a published validated algorithm verbatim (imported PPV)",
        "pros_of_this": "Saves curation effort and imports a documented code list and operating characteristics.",
        "cons_of_this": "PPV is not transportable across outcome prevalence, coding era (ICD-9/ICD-10), data source, and care setting; the imported PPV can rebias the corrected rate.",
        "when_to_prefer": "Borrow the code-list logic but re-estimate PPV locally whenever the source population or coding era differs from the original validation."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the case from medical claims with explicit diagnosis position and place-of-service; principal inpatient codes are far more specific than same-day ER rule-out codes. Require complete FFS medical capture across lookback and follow-up and treat Medicare Advantage-only person-time as unobserved (not event-free). Impose a data-maturity buffer for adjudication lag/reversals, an event-free lookback for incidence, and pre-specify the cause-specific vs subdistribution handling of competing death in elderly cohorts.",
      "ehr": "Use labs, vitals, orders, and problem lists for confirmation-grade thresholds (e.g., ALT/AST, troponin, creatinine); structured fields are often blank, so plan NLP over notes. Capture is encounter-driven and leaks for out-of-system care, so define observation windows and treat outside-system events as missing, not absent.",
      "registry": "Strong for adjudicated, clinically rich events but often incomplete longitudinally and weak on exposure; verify registration completeness and adjudication rules and link to claims for denominator person-time.",
      "linked": "Linked claims-EHR-vital-records is ideal (EHR labs confirm, claims complete the denominator, death index handles the competing risk) but introduces linkage selection and order/fill/service-date discrepancies that must be reconciled before the case window is anchored."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy import stats\n\nLOOKBACK_DAYS = 183       # event-free window that makes the captured event incident\nRISK_START, RISK_END = 1, 90   # etiologically relevant post-index risk window (days)\nALI_DX = {\"K71\", \"K72\", \"B190\", \"K762\", \"K729\"}  # ALI / hepatic-failure code prefixes (illustrative)\nCHRONIC_LIVER = {\"K70\", \"K74\", \"B18\", \"K766\"}    # chronic-liver-disease prefixes to exclude (prevalent)\n\ndef _has_prefix(s, prefixes):\n    return s.str.startswith(tuple(prefixes))\n\ndef build_ali_cases(dx, lab, cohort):\n    dx = dx.merge(cohort[[\"person_id\", \"index_date\", \"fup_end\"]], on=\"person_id\")\n\n    # 1) Exclude prevalent disease: any ALI or chronic-liver code in the event-free lookback.\n    look = dx[(dx[\"from_dt\"] < dx[\"index_date\"]) &\n              (dx[\"from_dt\"] >= dx[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)) &\n              (_has_prefix(dx[\"icd_dx\"], ALI_DX | CHRONIC_LIVER))]\n    incident_ids = set(cohort[\"person_id\"]) - set(look[\"person_id\"])\n\n    # 2) In-window ALI claims: principal inpatient dx OR ER ALI dx corroborated by a same-day LFT claim.\n    w = dx[dx[\"person_id\"].isin(incident_ids)].copy()\n    w[\"day\"] = (w[\"from_dt\"] - w[\"index_date\"]).dt.days\n    w = w[(w[\"day\"] >= RISK_START) & (w[\"day\"] <= RISK_END) &\n          (w[\"from_dt\"] <= w[\"fup_end\"]) & _has_prefix(w[\"icd_dx\"], ALI_DX)]\n\n    inpatient_principal = w[(w[\"dx_position\"] == 1) & (w[\"pos\"] == \"21\")]\n    er = w[w[\"pos\"] == \"23\"]\n    er_corrob = er.merge(lab[lab[\"lab_flag\"] == 1][[\"person_id\", \"from_dt\"]],\n                         on=[\"person_id\", \"from_dt\"], how=\"inner\")\n    qualifying = pd.concat([inpatient_principal, er_corrob], ignore_index=True)\n\n    # 3) First-event coding: earliest qualifying claim is the incident event date.\n    cases = (qualifying.sort_values(\"from_dt\")\n                       .groupby(\"person_id\", as_index=False)\n                       .first()[[\"person_id\", \"from_dt\"]]\n                       .rename(columns={\"from_dt\": \"event_date\"}))\n    return cases\n\ndef crude_rate(cases, cohort, per=1000):\n    py = ((cohort[\"fup_end\"] - cohort[\"index_date\"]).dt.days.clip(lower=0).sum()) / 365.25\n    return per * len(cases) / py, py\n\ndef wilson_ci(k, n, alpha=0.05):\n    z = stats.norm.ppf(1 - alpha / 2)\n    p = k / n\n    denom = 1 + z**2 / n\n    center = (p + z**2 / (2 * n)) / denom\n    half = z * np.sqrt(p * (1 - p) / n + z**2 / (4 * n**2)) / denom\n    return p, center - half, center + half\n\ncases = build_ali_cases(dx, lab, cohort)\nrate, py = crude_rate(cases, cohort)\nppv, lo, hi = wilson_ci(int(review[\"confirmed\"].sum()), len(review))\ncorrected_rate = rate * ppv    # numerator-only correction; full correction also needs sensitivity\nprint(f\"crude={rate:.2f}/1000PY  PPV={ppv:.2f} (95% CI {lo:.2f}-{hi:.2f})  corrected={corrected_rate:.2f}/1000PY\")",
        "description": "Confirmation-grade incident acute-liver-injury (ALI) case definition from claims, with chart-review PPV and a\nPPV-corrected rate. Required inputs (cleaned, de-duplicated):\n  dx     : medical-claim diagnoses -> person_id, clm_id, from_dt (datetime), icd_dx, dx_position (1=principal), pos (place-of-service)\n  lab    : lab/procedure claims    -> person_id, from_dt, lab_flag (1 = liver-function-test claim)\n  cohort : exposed cohort          -> person_id, index_date (first dispensing), fup_end (censor: disenroll/death/data end)\n  review : PPV chart review        -> person_id, confirmed (1/0) for a random sample of algorithm-positive patients\nOutput: incident cases (first qualifying event in the 1-90d post-index window after an event-free lookback), the crude\nrate, the chart-review PPV with a Wilson CI, and the PPV-corrected rate.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nLOOKBACK_DAYS <- 183L\nRISK_START <- 1L; RISK_END <- 90L\nALI_DX        <- c(\"K71\",\"K72\",\"B190\",\"K762\",\"K729\")\nCHRONIC_LIVER <- c(\"K70\",\"K74\",\"B18\",\"K766\")\nhas_prefix <- function(x, pref) Reduce(`|`, lapply(pref, function(p) startsWith(x, p)))\n\nbuild_ali_cases <- function(dx, lab, cohort) {\n  setDT(dx); setDT(lab); setDT(cohort)\n  dx <- merge(dx, cohort[, .(person_id, index_date, fup_end)], by = \"person_id\")\n\n  # 1) Exclude prevalent disease in the event-free lookback.\n  look <- dx[from_dt < index_date & from_dt >= index_date - LOOKBACK_DAYS &\n             has_prefix(icd_dx, c(ALI_DX, CHRONIC_LIVER)), unique(person_id)]\n  incident_ids <- setdiff(cohort$person_id, look)\n\n  # 2) In-window ALI: principal inpatient dx OR ER ALI dx corroborated by a same-day LFT claim.\n  w <- dx[person_id %in% incident_ids]\n  w[, day := as.integer(from_dt - index_date)]\n  w <- w[day >= RISK_START & day <= RISK_END & from_dt <= fup_end & has_prefix(icd_dx, ALI_DX)]\n\n  inpatient_principal <- w[dx_position == 1L & pos == \"21\"]\n  er <- w[pos == \"23\"]\n  lftd <- lab[lab_flag == 1L, .(person_id, from_dt)]\n  er_corrob <- merge(er, lftd, by = c(\"person_id\", \"from_dt\"))\n  qualifying <- rbindlist(list(inpatient_principal, er_corrob), use.names = TRUE, fill = TRUE)\n\n  # 3) First-event coding.\n  setorder(qualifying, from_dt)\n  qualifying[, .(event_date = from_dt[1L]), by = person_id]\n}\n\nwilson_ci <- function(k, n, alpha = 0.05) {\n  z <- qnorm(1 - alpha / 2); p <- k / n; denom <- 1 + z^2 / n\n  center <- (p + z^2 / (2 * n)) / denom\n  half <- z * sqrt(p * (1 - p) / n + z^2 / (4 * n^2)) / denom\n  c(ppv = p, lo = center - half, hi = center + half)\n}\n\ncases <- build_ali_cases(dx, lab, cohort)\nsetDT(cohort)\npy <- sum(pmax(as.integer(cohort$fup_end - cohort$index_date), 0)) / 365.25\nrate <- 1000 * nrow(cases) / py\nw <- wilson_ci(sum(review$confirmed), nrow(review))\ncat(sprintf(\"crude=%.2f/1000PY  PPV=%.2f (95%% CI %.2f-%.2f)  corrected=%.2f/1000PY\\n\",\n            rate, w[\"ppv\"], w[\"lo\"], w[\"hi\"], rate * w[\"ppv\"]))",
        "description": "Same confirmation-grade incident-ALI case definition with data.table, plus PPV (Wilson CI) and PPV-corrected rate.\nInputs mirror the Python version:\n  dx     : person_id, clm_id, from_dt (Date), icd_dx, dx_position (1=principal), pos (place-of-service char)\n  lab    : person_id, from_dt, lab_flag (1 = LFT claim)\n  cohort : person_id, index_date, fup_end\n  review : person_id, confirmed (1/0) for sampled algorithm-positive patients",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let lookback = 183; %let rstart = 1; %let rend = 90;\n\n/* 1) Incident restriction: drop anyone with ALI or chronic-liver codes in the event-free lookback. */\nproc sql;\n  create table prevalent as\n  select distinct d.person_id\n  from work.dx d inner join work.cohort c on d.person_id = c.person_id\n  where d.from_dt < c.index_date\n    and d.from_dt >= c.index_date - &lookback\n    and (d.icd_dx like 'K71%' or d.icd_dx like 'K72%' or d.icd_dx like 'B190%'\n      or d.icd_dx like 'K762%' or d.icd_dx like 'K729%'\n      or d.icd_dx like 'K70%' or d.icd_dx like 'K74%' or d.icd_dx like 'B18%' or d.icd_dx like 'K766%');\n\n  /* 2) In-window ALI claims: principal inpatient dx OR ER dx corroborated by a same-day LFT claim. */\n  create table inwin as\n  select d.person_id, d.from_dt, d.dx_position, d.pos\n  from work.dx d inner join work.cohort c on d.person_id = c.person_id\n  where d.person_id not in (select person_id from prevalent)\n    and d.from_dt between c.index_date + &rstart and min(c.index_date + &rend, c.fup_end)\n    and (d.icd_dx like 'K71%' or d.icd_dx like 'K72%' or d.icd_dx like 'B190%'\n      or d.icd_dx like 'K762%' or d.icd_dx like 'K729%');\n\n  create table qualifying as\n  select person_id, from_dt from inwin where dx_position = 1 and pos = '21'\n  union\n  select w.person_id, w.from_dt from inwin w\n    inner join (select person_id, from_dt from work.lab where lab_flag = 1) l\n    on w.person_id = l.person_id and w.from_dt = l.from_dt\n  where w.pos = '23';\n\n  /* 3) First-event coding: earliest qualifying claim is the incident event date. */\n  create table cases as\n  select person_id, min(from_dt) as event_date format=date9.\n  from qualifying group by person_id;\n\n  /* Person-time denominator (years) for the rate. */\n  create table pt as\n  select sum(max(intck('day', index_date, fup_end), 0)) / 365.25 as py\n  from work.cohort;\nquit;\n\n/* Chart-review PPV with an exact binomial 95% CI. */\nproc freq data=work.review;\n  tables confirmed / binomial(level='1') alpha=0.05;\nrun;\n\n/* Crude incident rate per 1,000 PY via log-binomial / Poisson with log person-time offset. */\ndata rate_in;\n  set pt;\n  n_cases = 0;  /* replaced below; kept explicit for the offset model */\nrun;\nproc sql; select count(*) into :ncase from cases; quit;\ndata rate_in; set pt; n_cases = &ncase; log_py = log(py); run;\nproc genmod data=rate_in;\n  model n_cases = / dist=poisson link=log offset=log_py;\n  estimate 'rate per 1000 PY' intercept 1 / exp;   /* multiply exp(estimate) by 1000 for the rate */\nrun;",
        "description": "Confirmation-grade incident-ALI case definition (PROC SQL cohort construction), then PROC FREQ for the chart-review PPV\nwith an exact (binomial) CI and PROC GENMOD log-binomial for the crude rate per 1,000 person-years. Required inputs\n(post data-management):\n  work.dx     : person_id clm_id from_dt icd_dx dx_position (1=principal) pos (place-of-service char)\n  work.lab    : person_id from_dt lab_flag (1=LFT claim)\n  work.cohort : person_id index_date fup_end\n  work.review : person_id confirmed (1/0) for sampled algorithm-positive patients",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Exposed cohort with complete FFS medical capture<br/>continuous enrollment across lookback + follow-up] --> Look{ALI or chronic-liver code<br/>in 183-day event-free lookback?}\n  Look -->|Yes| Excl[Exclude: prevalent / recurrent disease]\n  Look -->|No| Win[In risk window 1-90d post-index?]\n  Win -->|No| Drop[Not a case in this window]\n  Win -->|Yes| Logic{Case logic}\n  Logic -->|Inpatient principal dx<br/>dx_position=1, pos=21| Case[Incident case = first qualifying claim date]\n  Logic -->|ER ALI dx + same-day LFT claim| Case\n  Case --> PPV[Chart-review PPV on a random positive sample]\n  PPV --> Corr[PPV-corrected rate = crude rate x PPV<br/>full correction also uses sensitivity]\n  Corr --> Sig[Signal evaluation: sequential test / rate comparison]",
        "caption": "Decision logic for a confirmation-grade incident acute-liver-injury case definition, from observability and the event-free lookback through the dual case rule, first-event coding, chart-review PPV, and PPV-corrected rate that feeds signal evaluation.",
        "alt_text": "Flowchart from an observable exposed cohort through a lookback exclusion, risk-window check, inpatient-principal or ER-plus-lab case logic, first-event coding, chart-review PPV, and a PPV-corrected rate used for signal evaluation.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Screening stage:<br/>high-sensitivity broad rule<br/>any-position dx, no lab] --> B[Flag candidate cases]\n  B --> C[Chart adjudication of every flagged case]\n  C --> D{Signal confirmed?}\n  D -->|No| E[Close: false-positive coding]\n  D -->|Yes| F[Confirmation stage:<br/>high-PPV rule<br/>inpatient principal dx + lab]\n  F --> G[Quantify: PPV-corrected rate + sequential test]\n  G --> H[Regulatory / labeling decision]",
        "caption": "The signal-evaluation operating-point shift that distinguishes this concept from a generic outcome algorithm: a high-sensitivity screen with mandatory adjudication generates hypotheses, then a high-PPV confirmation definition quantifies the signal for a decision.",
        "alt_text": "Two-stage flow showing a high-sensitivity screening rule with chart adjudication feeding a high-PPV confirmation definition that quantifies a signal for a regulatory decision.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "A safety signal case definition is a claims/EHR outcome algorithm specialized for active surveillance, where the PPV/sensitivity operating point is chosen for the surveillance stage rather than for one-shot effect estimation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "Shares the code-list, diagnosis-position, and care-setting construction logic; this concept adds the signal-evaluation framing and validation-backed operating point."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "A signal case definition is incomplete without a validation substudy that estimates PPV (and ideally sensitivity) in the target population; PPV is not transportable from published algorithms."
      },
      {
        "relation_type": "used_with",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "The estimated PPV/sensitivity feed quantitative bias correction of the surveillance rate; numerator-only PPV correction is a first approximation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "signal-detection",
        "notes": "The case definition produces the case counts that statistical signal-detection methods (sequential testing, disproportionality) operate on."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Serious adverse events in elderly cohorts compete with death; pre-specify cause-specific vs subdistribution handling when comparing cumulative incidence across exposures."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes help detect residual differential ascertainment/confounding in a surveillance signal."
      }
    ],
    "aliases": [
      "adverse-event case definition",
      "safety signal case definition",
      "surveillance outcome algorithm",
      "adverse-event ascertainment algorithm"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "sample-size-power-precision-rwe",
    "name": "Sample Size, Power, and Precision in RWE",
    "short_definition": "The pre-specified quantitative justification of a real-world study's analytic feasibility, framed either as the power to reject a null effect or as the precision (confidence-interval half-width) around the target estimand, after accounting for the events, follow-up, confounder adjustment, weighting, and competing-risk attrition that real-world data impose.",
    "long_description": "**Sample size, power, and precision** is the protocol section that converts a target estimand into a feasibility statement: given the data source, the cohort that can actually be assembled, and the analysis that will be run, can the study either (a) reject a pre-specified null at acceptable type-I/type-II error rates (the *power* framing) or (b) estimate the effect with a confidence interval narrow enough to inform a decision (the *precision* framing)? In randomized trials this is a clean closed-form calculation. In real-world evidence it is dominated by features the trial never had: the analysis is *adjusted* (so variance inflates with confounder–exposure correlation), the comparison is often *weighted* (so the effective sample size is far below the nominal count), follow-up is *administratively censored and competing-risk attrited*, and the eligible denominator is *eroded* by continuous-enrollment and data-completeness requirements. A power calculation that ignores these is not conservative — it is wrong, and it is the single most common reason an \"adequately powered\" RWE protocol delivers an uninformative interval.\n\n**Core conceptual distinction**. Two framings answer two different questions and lead to different numbers. (1) *Power (Neyman–Pearson)*: fix α and a minimum clinically important effect, then solve for the n (or, for time-to-event, the **number of events**) that achieves 1−β power. This is hypothesis-testing logic — appropriate when the decision is a go/no-go on a pre-specified hypothesis (a safety signal, a non-inferiority margin). (2) *Precision (estimation)*: fix the acceptable confidence-interval half-width on the scale of the estimand (log HR, risk difference, NMB) and solve for the n that delivers it, independent of any null (Greenland 1988, Rothman). This is the framing regulators and HTA bodies increasingly prefer for RWE, because the deliverable is an effect *estimate* feeding a benefit–risk or cost-effectiveness decision, not a p-value. The estimand drives the math: for a hazard ratio the binding currency is **events**, not patients (Schoenfeld: required events ≈ (z_{1−α/2}+z_{1−β})² / (φ(1−φ)(log HR)²), where φ is the allocation fraction); for a risk difference or rate it is person-time; for an adjusted estimate it is **events per parameter** (EPP/EPV), the model-stability constraint (Concato 1995; Peduzzi 1996) that caps how many confounders you can carry before estimates become unstable regardless of total n.\n\n**Pros, cons, and trade-offs**.\n- **Power framing vs precision framing:** Power is familiar to reviewers and required for any null-hypothesis claim (non-inferiority, safety rule-out). Cost: it hinges on a guessed effect size and silently rewards underpowered studies that \"happen\" to be significant; a study powered for HR=0.70 says nothing useful if the truth is HR=0.90. **Prefer the precision framing** whenever the estimand feeds a decision-analytic model (NMB, ICER) or a benefit–risk assessment — report the expected CI half-width, not just power at one alternative.\n- **Events-driven (survival) vs n-driven (binary/continuous) planning:** For time-to-event outcomes the calculation must be in events; planning on total n and assuming everyone is followed to the outcome overstates power badly when the outcome is rare or follow-up is short. Cost: events-driven planning requires assumptions about baseline incidence, accrual, and dropout that claims pilot data can supply but registries often cannot. **Prefer events-driven** (`PROC POWER twosamplesurvival`, `powerSurvEpi`) for any survival endpoint.\n- **Crude vs design-corrected n:** A textbook two-sample formula gives the floor. The honest RWE number multiplies it by (i) the **variance inflation factor** 1/(1−R²) where R² is the multiple correlation of exposure on the confounders (Hsieh 1998) — adjustment for strong confounders can double the required n; (ii) the **IPTW/weighting design effect** 1 + CV²(weights), which can shrink effective sample size by 30–60% with heavy tails; (iii) **competing-risk attrition** of events in elderly claims cohorts. Cost: each correction needs a pilot extract. **Always design-correct** — the uncorrected number is the most dangerous artifact in an RWE protocol.\n\n**When to use**. Every comparative RWE protocol (effectiveness, safety, utilization) and every descriptive study whose estimate carries a decision must include a quantitative feasibility statement *before* data are pulled, calibrated to the actual analysis (adjusted, weighted, time-to-event) and the actual extractable cohort. Use the **precision** framing when the output is an effect estimate feeding HTA/NMB or benefit–risk; use the **power** framing for pre-specified hypothesis tests (non-inferiority, safety rule-out, formal subgroup tests). Use **EPP/EPV** as a hard feasibility gate the moment you decide how many confounders the propensity or outcome model will carry.\n\n**When NOT to use — and when it is actively misleading or dangerous**. A *crude* unadjusted power calculation pasted into an RWE protocol is worse than none: it manufactures false confidence and is routinely rejected on review. It is actively misleading when (1) the analysis is weighted but the calculation uses the nominal n — the realized CI will be far wider than promised because the design effect was ignored; (2) the endpoint is time-to-event with meaningful competing risks but planning assumed the cause-specific events accrue at the marginal rate — competing death in an elderly comparator arm steals the very events you powered on, and the theft is *differential* by exposure; (3) the calculation is used as a *post hoc* \"observed power\" rationalization of a null result (observed power is a deterministic transform of the p-value and tells you nothing — report the CI instead); (4) immortal-time or look-back requirements that erode the denominator are omitted, so the \"eligible n\" in the protocol never materializes after enrollment rules are applied. For any well-powered safety screen across hundreds of outcomes, single-comparison power is also misleading without an explicit multiplicity / false-discovery plan.\n\n**Data-source operational depth**.\n- **Claims (FFS vs Medicare Advantage):** The eligible denominator is not the enrolled population — it is the subset with continuous medical + pharmacy enrollment spanning the full washout and look-back, which can drop 40–70% of members and must be modeled in the feasibility calc, not discovered after the fact. **MA-only person-time lacks adjudicable FFS claims**, so events and exposures are undercounted; if the database mixes FFS and MA, power computed on the headcount is inflated — restrict the denominator to FFS-observable person-time before computing expected events. Differential **competing risks** (death, disenrollment) by exposure in elderly claims attrite cause-specific events asymmetrically; plan events with a competing-risk-adjusted cumulative incidence, not 1−KM. Claims pilot extracts are the right place to estimate comparator-arm incidence and the weight distribution that drives the IPTW design effect.\n- **EHR:** Capture is encounter-driven, so the *observed* event rate understates the true rate (silent outcomes outside the system) and follow-up is fragmented by leakage to other providers — both reduce realized events below the planning assumption. Use linked claims, where available, to anchor the incidence estimate; otherwise treat the EHR-based rate as a lower bound and inflate the planned n.\n- **Registry:** Often strong for adjudicated outcomes and severity (so the per-event information is high) but small and selectively enrolled, making the precision framing essential — a registry rarely powers a hypothesis test but can deliver a usefully narrow interval for a moderate effect. Account for registry incompleteness in the denominator.\n- **Linked claims–EHR–vital records:** The ideal substrate for accurate event and incidence estimates, but the **linkable subset** is a selected denominator; compute feasibility on the linked population, not the source frame, and reconcile date discrepancies before defining person-time.\n\n**Worked claims example.** Question: power/precision for incident heart failure with second-generation sulfonylurea vs DPP-4 inhibitor, IPTW-adjusted Cox, in a commercial + Medicare FFS database; minimum clinically important HR = 0.80, two-sided α=0.05, target 90% power. (1) *Denominator erosion*: start from 4.0M diabetes members; require ≥18y, ≥2 diabetes Dx, and 365 days continuous A/B/D (or commercial medical+pharmacy) enrollment before the first qualifying fill — this leaves ~620k; the new-user washout (no prior sulfonylurea/DPP-4 fill in 365d) and FFS-observable restriction (exclude MA-only person-time) leave ~180k initiators, roughly 45% sulfonylurea / 55% DPP-4. (2) *Events, not patients*: with a comparator-arm 3-year cumulative HF incidence of ~6% and a competing mortality of ~9% that differentially attrites events, expected cause-specific HF events ≈ 0.055 × 180k ≈ 9,900 over 3 years of administratively censored follow-up — but Schoenfeld requires only events ≈ (1.96+1.28)² / (0.5·0.5·(ln 0.80)²) ≈ 10.51 / (0.25·0.0497) ≈ 845 events for 90% power at HR=0.80. (3) *Design correction*: multiply by the IPTW design effect 1 + CV²(weights); with a stabilized-weight CV ≈ 0.6 that is ×1.36 → ~1,150 effective events needed, and by the confounder variance-inflation 1/(1−R²) with R²≈0.15 → ×1.18 → ~1,360 events. The ~9,900 accrued events comfortably exceed 1,360, so the study is well powered for HR=0.80 and, more usefully, the **precision** statement is the deliverable: with ~9,900 events the expected 95% CI half-width on ln HR is z·√(1/(φ(1−φ)·E_eff)) ≈ 1.96·√(1/(0.25·7,300)) ≈ 0.046, i.e., an HR estimate of 0.80 would carry a 95% CI of roughly 0.76–0.84 — narrow enough to inform a benefit–risk decision. (4) *If underpowered* (e.g., the comparator were rarely used and events fell below ~1,360): extend the follow-up window, broaden the comparator to a clinically interchangeable class, switch from a hypothesis test to an explicit precision target, or accept a wider but still decision-relevant CI rather than overstating power. Also confirm **EPP**: the IPTW Cox + any residual adjustment must carry no more parameters than ~events/10 ≈ 130 — never a binding constraint here, but decisive in a rare-outcome registry study.\n\n**Interpreting the output**\n\nConsider the claims example above: a Schoenfeld calculation for an IPTW-adjusted Cox model targeting HR = 0.80\nwith 90% power. After applying the IPTW design-effect multiplier (≈ ×1.36) and the confounder inflation factor\n(≈ ×1.18), the study requires approximately 1,360 effective events. The database yields ≈ 9,900 accrued events,\nfar exceeding that threshold. The primary deliverable is then the precision statement: with ≈ 9,900 events,\nan observed HR of 0.80 would carry a 95% CI of roughly 0.76–0.84.\n\n*(1) Formal statistical interpretation.* The power calculation is a pre-data design property: it describes the\nlong-run probability — across hypothetical replications — that the study would reject the null if the true HR\nwere exactly 0.80. Post-hoc \"observed power\" is a deterministic function of the p-value and adds no information\nbeyond the CI; report the CI, not observed power, for null results. The IPTW design effect inflates the required\nevent count because weighted observations are not independent; ignoring it produces an event count that is\nnominally adequate but statistically underpowered.\n\n*(2) Practical interpretation for a decision-maker.* With ≈ 9,900 events, the study can estimate the hazard\nratio with a 95% CI spanning roughly ±0.04 on the HR scale — tight enough to distinguish HR 0.80 from HR 0.76\nand from HR 0.84. That precision supports a benefit–risk or formulary decision. The power target (90%) is a\npre-commitment about acceptable design risk, not a statement about the specific study's conclusion.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "power-analysis",
      "precision",
      "sample-size",
      "events-per-variable",
      "effective-sample-size",
      "design-effect",
      "pharmacoepidemiology",
      "feasibility"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness",
      "safety_surveillance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/0895-4356(95)00510-2",
        "url": "https://doi.org/10.1016/0895-4356(95)00510-2",
        "citation_text": "Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. Journal of Clinical Epidemiology. 1995;48(12):1495-1501.",
        "year": 1995,
        "authors_short": "Concato et al.",
        "notes": "Establishes events-per-variable as the binding feasibility constraint for adjusted survival models — the model-stability counterpart to a power calculation that dominates rare-outcome RWE."
      },
      {
        "role": "explain",
        "doi": "10.1016/s0895-4356(96)00236-3",
        "url": "https://doi.org/10.1016/s0895-4356(96)00236-3",
        "citation_text": "Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. 1996;49(12):1373-1379.",
        "year": 1996,
        "authors_short": "Peduzzi et al.",
        "notes": "The simulation that produced the canonical EPV >= 10 rule of thumb for stable, well-calibrated adjusted estimates."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/oxfordjournals.aje.a114945",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a114945",
        "citation_text": "Greenland S. On sample-size and power calculations for studies using confidence intervals. American Journal of Epidemiology. 1988;128(1):231-237.",
        "year": 1988,
        "authors_short": "Greenland",
        "notes": "Foundational argument for planning around interval precision rather than null-hypothesis power — the estimation framing now favored for decision-relevant RWE."
      },
      {
        "role": "use",
        "doi": "10.1177/0962280218784726",
        "url": "https://doi.org/10.1177/0962280218784726",
        "citation_text": "van Smeden M, Moons KGM, de Groot JAH, Collins GS, Altman DG, Eijkemans MJC, Reitsma JB. Sample size for binary logistic prediction models: Beyond events per variable criteria. Statistical Methods in Medical Research. 2019;28(8):2455-2474.",
        "year": 2019,
        "authors_short": "van Smeden et al.",
        "notes": "Shows the simple EPV rule is neither necessary nor sufficient and motivates effect- and shrinkage-aware sample-size targets — directly relevant to richly adjusted RWE models."
      }
    ],
    "plain_language_summary": "Before starting a real-world study, researchers need to show that the data they plan to use contain enough patients and enough disease events to answer their question reliably. Two ways exist to do this: the power approach asks whether the study can reliably detect a meaningful difference if one truly exists, and the precision approach asks how narrow the uncertainty range around the answer will be. In real-world data, these calculations are harder than in a clinical trial because statistical adjustments, the way patients are weighted, and patients leaving the study all shrink the effective information below the raw patient count.",
    "key_terms": [
      {
        "term": "power",
        "definition": "The chance that a study will correctly find a real difference as statistically significant, usually set at 80% or 90%."
      },
      {
        "term": "precision",
        "definition": "How narrow the confidence interval (the range of uncertainty) around an estimated effect is; more patients generally means a narrower, more precise interval."
      },
      {
        "term": "minimum detectable effect",
        "definition": "The smallest true difference between two treatments that a study of a given size can reliably detect."
      },
      {
        "term": "confidence interval half-width",
        "definition": "The distance from the point estimate to the upper (or lower) edge of the confidence interval; a half-width of 2 percentage points means the CI spans about 4 percentage points total."
      },
      {
        "term": "effective sample size",
        "definition": "The number of patients whose data actually carry statistical information after adjustments and weighting are applied; often substantially smaller than the raw patient count."
      }
    ],
    "worked_example": {
      "scenario": "A registry study compares two diabetes medications, sulfonylureas (SU) and DPP-4 inhibitors, on the risk of hospitalization for heart failure over three years. The source file uses comparator-arm (DPP-4) incidence of 6.0% and study-arm (SU) incidence of 4.8%, giving a true risk difference of 1.2 percentage points. Equal numbers of patients enter each arm. How wide will the confidence interval be at different total study sizes, and when does the study become precise enough to inform a decision?",
      "dataset": {
        "caption": "Planning table: total study size, arm size (half each), and the resulting 95% confidence interval half-width around the 1.2 pp risk difference.",
        "columns": [
          "total_N",
          "patients_per_arm",
          "CI_halfwidth_pp",
          "full_95pct_CI_pp"
        ],
        "rows": [
          [
            200,
            100,
            6.3,
            "1.2 plus or minus 6.3 (roughly -5.1 to +7.5)"
          ],
          [
            500,
            250,
            4.0,
            "1.2 plus or minus 4.0 (roughly -2.8 to +5.2)"
          ],
          [
            1000,
            500,
            2.8,
            "1.2 plus or minus 2.8 (roughly -1.6 to +4.0)"
          ],
          [
            2000,
            1000,
            2.0,
            "1.2 plus or minus 2.0 (roughly -0.8 to +3.2)"
          ]
        ]
      },
      "steps": [
        "The true risk difference we are trying to estimate is 6.0% minus 4.8% equals 1.2 percentage points.",
        "For each total N, split patients evenly: 100 per arm at N=200, 500 per arm at N=1000, and so on.",
        "The standard error (SE) of a risk difference is the square root of p1*(1-p1)/(arm size) plus p2*(1-p2)/(arm size). At N=1000 this is sqrt(0.048*0.952/500 + 0.060*0.940/500) = sqrt(0.0000913 + 0.0001128) = 0.0143.",
        "Multiply SE by 1.96 to get the 95% confidence interval half-width: 1.96 * 0.0143 = 0.028, or 2.8 percentage points.",
        "At N=200, the half-width is 6.3 pp, so the CI nearly straddles zero and the result is uninformative; at N=2000 the half-width drops to 2.0 pp, which may be narrow enough to influence a prescribing decision.",
        "In a real-world claims study (the source file scales this to 180k patients) the effective sample size is smaller than the headcount because weighting and adjustment consume information, so the honest CI is wider than this simple table shows."
      ],
      "result": "N=1000 gives a 95% CI of roughly -1.6 to +4.0 percentage points (half-width 2.8 pp). N=2000 tightens this to roughly -0.8 to +3.2 pp (half-width 2.0 pp). Whether 2 pp is narrow enough depends on what decision the estimate feeds; in many health technology assessments, a CI that wide still spans both clinically relevant and clinically unimportant differences, which is why the source file targets roughly 9,900 events in its large claims cohort to achieve a half-width of about 4.6% on the log-hazard-ratio scale."
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe",
      "propensity-score-methods-psm-iptw",
      "competing-risks-cause-specific-fine-gray-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Power-driven planning (hypothesis test)",
        "description": "Fix alpha and a minimum clinically important effect, solve for n (binary/continuous) or required events (time-to-event) at target power. Appropriate for non-inferiority margins, safety rule-out, and pre-specified subgroup tests.",
        "edge_cases": [
          "Time-to-event endpoints must be planned in events (Schoenfeld), not patients; planning on headcount overstates power when the outcome is rare or follow-up short.",
          "Observed/post-hoc power is a deterministic transform of the p-value and must never be reported to rationalize a null - report the confidence interval instead."
        ],
        "data_source_notes": "claims: estimate comparator-arm incidence and accrual from a pilot extract restricted to FFS-observable, continuously enrolled person-time."
      },
      {
        "name": "Precision-driven planning (CI half-width)",
        "description": "Fix the acceptable confidence-interval half-width on the estimand scale (log HR, risk difference, NMB) and solve for the n/events that deliver it, independent of any null (Greenland 1988).",
        "edge_cases": [
          "Half-width must be specified on the modeled scale (log HR), then back-transformed; specifying it on the HR scale near the null is misleading because the transform is nonlinear.",
          "Preferred deliverable when the estimate feeds a decision-analytic model rather than a go/no-go test."
        ],
        "data_source_notes": "registry: small selectively enrolled cohorts rarely power a test but can deliver a usefully narrow interval - report expected half-width."
      },
      {
        "name": "Events-per-parameter (EPP/EPV) feasibility for adjusted models",
        "description": "Cap the number of parameters in the propensity or outcome model at roughly events/10 (Concato 1995; Peduzzi 1996), as a model-stability gate independent of total n.",
        "edge_cases": [
          "High-dimensional propensity scores carry hundreds of proxies; EPV applies to the outcome model and to any directly adjusted PS terms, not to the hdPS variable-selection step.",
          "van Smeden (2019) shows EPV is neither necessary nor sufficient; treat 10 as a floor for diagnosis, not a guarantee."
        ],
        "data_source_notes": "ehr/registry: rare adjudicated outcomes make EPP the binding constraint long before total n is exhausted."
      },
      {
        "name": "Design-corrected effective sample size",
        "description": "Multiply the crude n/events by the IPTW/weighting design effect (1 + CV-squared of weights) and the confounder variance-inflation factor 1/(1-R-squared) to obtain the honest, realized precision.",
        "edge_cases": [
          "Heavy-tailed stabilized weights can shrink effective sample size 30-60%; weight truncation trades this against bias and shifts the target population.",
          "Differential competing risks attrite cause-specific events asymmetrically by exposure - plan events from competing-risk cumulative incidence, not 1-KM."
        ],
        "data_source_notes": "claims: estimate the weight CV and competing-mortality rate from a pilot extract before finalizing the planned events."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Crude unadjusted two-sample power calculation",
        "pros_of_this": "Reflects the analysis that will actually be run (adjusted, weighted, time-to-event) and the cohort that can actually be assembled after enrollment and look-back rules.",
        "cons_of_this": "Requires a pilot extract to estimate incidence, the weight distribution, and the exposure-confounder correlation; more work than a textbook formula.",
        "when_to_prefer": "Always for a real RWE protocol - the uncorrected number manufactures false confidence and is routinely rejected on regulatory/HTA review."
      },
      {
        "compared_to": "Power framing (fixed alternative, 1 minus beta)",
        "pros_of_this": "The precision framing reports the expected CI half-width and is robust to a mis-guessed effect size; it maps directly onto benefit-risk and NMB decisions.",
        "cons_of_this": "Less familiar to reviewers expecting a power statement; needs a defensible threshold for \"decision-relevant\" interval width.",
        "when_to_prefer": "When the estimand feeds a decision-analytic model or benefit-risk assessment rather than a pre-specified null-hypothesis test."
      },
      {
        "compared_to": "Patient-count (n-driven) planning for survival endpoints",
        "pros_of_this": "Events-driven planning (Schoenfeld) targets the quantity that actually carries the information in a hazard-ratio analysis.",
        "cons_of_this": "Requires assumptions about baseline incidence, accrual, and dropout that some registries cannot supply.",
        "when_to_prefer": "Any time-to-event endpoint, especially rare outcomes or short administrative follow-up."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Compute feasibility on FFS-observable, continuously enrolled person-time, not the headcount; exclude MA-only person-time where claims are unadjudicable. Estimate comparator-arm incidence, the IPTW weight CV, and competing-mortality from a pilot extract; plan events from competing-risk cumulative incidence.",
      "ehr": "Encounter-driven capture understates event rates and fragments follow-up; treat the EHR-based incidence as a lower bound and inflate planned n, or anchor incidence to linked claims.",
      "registry": "Often high per-event information but small and selectively enrolled; favor the precision framing and report expected CI half-width rather than power at a single alternative. Account for registry incompleteness in the denominator.",
      "linked": "Best substrate for accurate incidence and event counts, but compute feasibility on the linkable subset (a selected denominator) and reconcile order/fill/service date discrepancies before defining person-time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy.stats import norm\nfrom statsmodels.stats.power import NormalIndPower\nfrom statsmodels.stats.proportion import proportion_effectsize\n\ndef events_for_hr(hr, phi=0.5, alpha=0.05, power=0.90):\n    \"\"\"Schoenfeld required number of EVENTS (not patients) for a two-arm HR.\n    phi = fraction allocated to the treated arm.\"\"\"\n    z = norm.ppf(1 - alpha / 2) + norm.ppf(power)\n    return z**2 / (phi * (1 - phi) * (np.log(hr)) ** 2)\n\ndef n_for_risk_diff(p_comparator, p_study, ratio=1.0, alpha=0.05, power=0.90):\n    \"\"\"Total n for a two-proportion comparison (binary outcome), arm-balanced by `ratio`.\"\"\"\n    h = proportion_effectsize(p_study, p_comparator)         # Cohen's h on arcsine scale\n    n1 = NormalIndPower().solve_power(effect_size=abs(h), alpha=alpha,\n                                      power=power, ratio=ratio, alternative=\"two-sided\")\n    return np.ceil(n1) * (1 + ratio)\n\ndef design_correct(n_or_events, weight_cv=0.0, conf_r2=0.0):\n    \"\"\"Inflate crude n/events for the IPTW design effect (1 + CV^2 of weights)\n    and the confounder variance-inflation factor 1/(1 - R^2) (Hsieh 1998).\"\"\"\n    deff = 1.0 + weight_cv**2\n    vif  = 1.0 / (1.0 - conf_r2)\n    return np.ceil(n_or_events * deff * vif)\n\ndef epp_max_params(n_events, epp=10):\n    \"\"\"Events-per-parameter feasibility cap for the adjusted (outcome) model.\"\"\"\n    return int(n_events // epp)\n\ndef ci_halfwidth_loghr(expected_events, phi=0.5, alpha=0.05):\n    \"\"\"Precision framing: expected 95% CI half-width on the log-HR scale given accrued events.\n    Var(logHR) ~ 1 / (phi*(1-phi)*E). Exponentiate point +/- this for the HR-scale interval.\"\"\"\n    z = norm.ppf(1 - alpha / 2)\n    return z * np.sqrt(1.0 / (phi * (1 - phi) * expected_events))\n\n# --- planning the sulfonylurea vs DPP-4 HF example ---\ncrude = events_for_hr(hr=0.80, phi=0.45, power=0.90)          # ~ required events at HR 0.80\nneeded = design_correct(crude, weight_cv=0.60, conf_r2=0.15)  # IPTW + confounding inflation\nprint(\"crude events:\", round(crude), \" design-corrected:\", needed)\nprint(\"max model params at EPV>=10:\", epp_max_params(9900))\nprint(\"expected log-HR CI half-width:\", round(ci_halfwidth_loghr(9900, phi=0.45), 3))",
        "description": "RWE-calibrated power and precision planning (no cohort construction here - this consumes planning\nassumptions estimated from a pilot extract). Functions, in order:\n  1. events_for_hr        : Schoenfeld required EVENTS for a target hazard ratio (time-to-event).\n  2. n_for_risk_diff      : two-proportion sample size (binary outcome), via statsmodels.\n  3. design_correct       : inflate by IPTW design effect (1 + CV^2) and confounder VIF 1/(1-R^2).\n  4. epp_max_params       : events-per-parameter cap for the adjusted model.\n  5. ci_halfwidth_loghr   : precision framing - expected 95% CI half-width on the log-HR scale.\nInputs are scalars from the pilot extract: comparator incidence, allocation fraction phi, weight CV,\nexposure-on-confounders R^2, and the events the cohort is expected to accrue.",
        "dependencies": [
          "numpy",
          "scipy",
          "statsmodels"
        ],
        "source_citations": [
          "concato-1995",
          "greenland-1988"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(pwr)\nlibrary(epiR)\nlibrary(powerSurvEpi)\n\n## 1. Events-driven survival planning (Schoenfeld) for a target HR.\nevents_for_hr <- function(hr, phi = 0.5, alpha = 0.05, power = 0.90) {\n  z <- qnorm(1 - alpha / 2) + qnorm(power)\n  z^2 / (phi * (1 - phi) * (log(hr))^2)\n}\n\n## 2. Cohort-study sample size for a binary outcome / risk comparison.\nn_binary <- epiR::epi.sscohortc(\n  irexp1 = 0.048, irexp0 = 0.060,        # incidence in study vs comparator arm\n  n = NA, power = 0.90, r = 0.45 / 0.55,  # allocation ratio study:comparator\n  conf.level = 0.95\n)$n.total\n\n## 3. Design-correct crude n/events: IPTW design effect x confounder VIF.\ndesign_correct <- function(x, weight_cv = 0, conf_r2 = 0) {\n  ceiling(x * (1 + weight_cv^2) * (1 / (1 - conf_r2)))\n}\n\n## 4. Precision framing: expected 95% CI half-width on the log-HR scale.\nci_halfwidth_loghr <- function(events, phi = 0.5, alpha = 0.05) {\n  qnorm(1 - alpha / 2) * sqrt(1 / (phi * (1 - phi) * events))\n}\n\ncrude  <- events_for_hr(0.80, phi = 0.45, power = 0.90)\nneeded <- design_correct(crude, weight_cv = 0.60, conf_r2 = 0.15)\ncat(\"crude events:\", round(crude), \" design-corrected:\", needed, \"\\n\")\ncat(\"binary-outcome total n:\", round(n_binary), \"\\n\")\ncat(\"expected log-HR CI half-width:\", round(ci_halfwidth_loghr(9900, 0.45), 3), \"\\n\")\n\n## 5. (optional) powerSurvEpi for power given a fixed number of subjects + follow-up.\n## ssizeCT.default(power = 0.90, k = 0.45/0.55, pE = 0.048, pC = 0.060, RR = 0.80, alpha = 0.05)",
        "description": "RWE power/precision planning in R using pwr, epiR, and powerSurvEpi (estimation packages, not\nsurvival - we are planning, not fitting). Consumes pilot-derived assumptions:\n  p_comp  : comparator-arm cumulative incidence\n  hr      : minimum clinically important hazard ratio\n  weight_cv, conf_r2 : IPTW design-effect and confounder variance-inflation inputs.",
        "dependencies": [
          "pwr",
          "epiR",
          "powerSurvEpi"
        ],
        "source_citations": [
          "concato-1995",
          "greenland-1988"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. Events/sample-size for a target hazard ratio with accrual and exponential dropout.\n      Plan in EVENTS: a rare outcome with short administrative follow-up needs far more\n      enrolled patients than the headcount suggests. */\nproc power;\n  twosamplesurvival test=logrank\n    hazardratio       = 0.80                 /* minimum clinically important HR */\n    refsurvival       = 0.94                 /* comparator-arm 3y survival (1 - 6% incidence) */\n    accrualtime       = 24\n    followuptime      = 12                    /* months */\n    groupweights      = (45 55)              /* allocation: study vs comparator */\n    power             = 0.90\n    alpha             = 0.05\n    ntotal            = .;                    /* solve for total n; EVENTSTOTAL also reported */\nrun;\n\n/* 2. Two-proportion (binary outcome) sample size - risk-difference planning. */\nproc power;\n  twosamplefreq test=pchi\n    proportiongroupa = 0.060                  /* comparator incidence */\n    proportiongroupb = 0.048                  /* study-arm incidence */\n    groupweights     = (45 55)\n    power            = 0.90\n    alpha            = 0.05\n    ntotal           = .;\nrun;\n\n/* 3. Power for the ADJUSTED model: GLMPOWER reads an exemplary (pilot) dataset whose\n      design + covariates mirror the planned confounder set, so the reported power already\n      embeds the confounder variance inflation that a crude two-sample calc ignores. */\nproc glmpower data=work.pilot;\n  class arm;\n  model hf_event = arm age_grp sex comorbidity_idx prior_util;  /* arm + planned covariates */\n  power\n    stddev   = 0.30\n    ntotal   = 180000\n    power    = .;\nrun;\n\n/* NOTE on weighting: PROC POWER has no IPTW design-effect option - inflate the returned\n   NTOTAL/EVENTSTOTAL by (1 + CV^2 of the stabilized weights) by hand using a pilot weight\n   distribution, then re-confirm events-per-parameter (events/10) for the outcome model. */",
        "description": "RWE power/precision planning in SAS using PROC POWER and PROC GLMPOWER (procedures that COMPUTE\nsample size / events, not estimation procs). Inputs are pilot-derived planning assumptions, not a\ncohort. twosamplesurvival handles accrual + dropout for the events-driven survival calc;\ntwosamplefreq handles the binary risk-difference calc; GLMPOWER handles power for an adjusted model.",
        "dependencies": [],
        "source_citations": [
          "concato-1995",
          "greenland-1988"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Est[Target estimand<br/>HR / risk diff / NMB] --> Frame{Decision type?}\n  Frame -->|go / no-go hypothesis test| Pow[POWER framing<br/>fix alpha + MCID -> solve n or EVENTS]\n  Frame -->|estimate feeds benefit-risk / NMB| Prec[PRECISION framing<br/>fix CI half-width -> solve n]\n  Pow --> Surv{Time-to-event?}\n  Surv -->|yes| Sch[Schoenfeld required EVENTS<br/>not patients]\n  Surv -->|no| NN[two-sample n<br/>binary / continuous]\n  Sch --> Corr[Design-correct]\n  NN --> Corr\n  Prec --> Corr\n  Corr --> DE[x IPTW design effect 1 + CV^2 weights]\n  DE --> VIF[x confounder VIF 1/（1-R^2）]\n  VIF --> EPP{EPP >= 10 for<br/>adjusted model?}\n  EPP -->|no| Reduce[Reduce parameters /<br/>broaden comparator /<br/>extend follow-up]\n  EPP -->|yes| Feas[Honest effective n / events<br/>+ expected CI half-width]",
        "caption": "Decision logic for RWE sample-size planning. The estimand and decision type select the power vs precision framing; time-to-event endpoints are planned in events; the crude number is then design-corrected for weighting and confounder adjustment and gated on events-per-parameter.",
        "alt_text": "Flowchart routing a target estimand into power or precision framing, into events-driven or n-driven planning, then through IPTW design-effect and confounder variance-inflation corrections and an events-per-parameter gate to an honest effective sample size.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Frame[Source diabetes members<br/>4.0M] --> Enr[Continuous A/B/D enrollment<br/>+ 365d look-back<br/>~620k]\n  Enr --> NU[New-user washout +<br/>FFS-observable, exclude MA-only<br/>~180k initiators]\n  NU --> Ev[Cause-specific HF events<br/>over 3y, competing-risk adjusted<br/>~9,900]\n  Ev --> Need[Schoenfeld need at HR 0.80<br/>~845 -> design-corrected ~1,360]\n  Need --> Ok[Well powered<br/>+ expected HR 95% CI ~0.76-0.84]",
        "caption": "Denominator-erosion and events cascade for the worked claims example. Power is driven by accrued cause-specific EVENTS (~9,900) against the design-corrected requirement (~1,360), not by the 180k headcount; the deliverable is the expected precision around the HR.",
        "alt_text": "Left-to-right cascade from 4 million source members down through enrollment and washout restrictions to 180k initiators, 9,900 competing-risk-adjusted heart-failure events, and a design-corrected requirement of about 1,360 events yielding a well-powered study.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "The eligible denominator for the power calculation is exactly the new-user, active-comparator cohort after washout and continuous-enrollment restrictions; feasibility must be computed on that cohort, not the source frame."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "IPTW/weighting reduces the effective sample size by the design effect (1 + CV-squared of weights); matching discards unmatched units. Both must be folded into the realized precision, not the nominal n."
      },
      {
        "relation_type": "requires",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "For time-to-event endpoints, planned events must come from competing-risk cumulative incidence, not 1-KM; differential competing mortality by exposure attrites cause-specific events asymmetrically."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "A target-trial protocol must pre-specify a quantitative feasibility/precision statement for the emulated comparison just as the trial it emulates would."
      },
      {
        "relation_type": "see_also",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS variable selection is exempt from the events-per-parameter cap, but the outcome model and any directly adjusted terms are not - EPP still gates the final estimation step."
      },
      {
        "relation_type": "see_also",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Multiplicity-heavy safety screens need a false-discovery/calibration plan; single-comparison power is misleading across hundreds of outcomes."
      }
    ],
    "aliases": [
      "statistical power in RWE",
      "precision-based sample size",
      "events per variable",
      "effective sample size",
      "power and sample size calculation"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "scenario-deterministic-sensitivity-analysis-hea-rwe",
    "name": "Scenario Analysis and Deterministic Sensitivity Analysis (DSA)",
    "short_definition": "Deterministic uncertainty analyses for health-economic models — one-way/multi-way sensitivity analysis (varying parameters across plausible ranges, summarized in a tornado diagram) and scenario analysis (re-running the model under alternative structural assumptions such as time horizon, discount rate, extrapolation family, or perspective).",
    "long_description": "**Scenario analysis and deterministic sensitivity analysis (DSA)** are the transparency workhorses of\nhealth-economic modeling. A **one-way DSA** changes a single input parameter from its base-case value to a\njustified low and high value while holding everything else fixed, records the resulting incremental\ncost-effectiveness ratio (ICER) or net monetary benefit (NMB), and repeats for each influential parameter;\nthe results are ranked by output swing and displayed as a **tornado diagram**. A **multi-way DSA** varies two\nor more parameters jointly (e.g., a two-way grid of drug price × treatment effect). A **scenario analysis**\nre-runs the entire model under a discrete alternative *assumption set* rather than a parameter range —\na shorter or lifetime **time horizon**, 0%/3%/5% **discount rates**, an alternative **survival-extrapolation\nfamily** (Weibull vs Gompertz), a different **perspective** (payer vs societal), an alternative comparator,\nor a different source for a key input. The two address different things: DSA probes **parameter influence**\n(which numeric inputs drive the result), while scenario analysis probes **structural and methodological\nuncertainty** (which modeling choices drive the result) — the layer that probabilistic sensitivity analysis\n(PSA) cannot reach, because PSA only propagates distributions *within* a fixed structure.\n\n**Core conceptual distinctions.** (1) *Influence vs decision uncertainty*: DSA and scenarios answer \"what\nmatters and does the decision flip?\"; PSA answers \"given joint parameter uncertainty, how probable is it that\nthe intervention is cost-effective?\". They are complements mandated together by ISPOR-SMDM good practice, not\nsubstitutes. (2) *Parameter vs structural uncertainty*: a parameter has a plausible numeric range (a utility,\na cost, a relative risk) and belongs in DSA/PSA; a structural choice (horizon, extrapolation family, cycle\nlength, competing-risk handling) has discrete alternatives and belongs in scenario analysis. Forcing structure\ninto a parameter range (or ignoring it because the PSA looks tight) understates uncertainty. (3) *Justified\nranges vs arbitrary ±20%*: DSA ranges should come from confidence intervals, published extremes, or pre-specified\nplausibility arguments; a uniform ±20% applied to everything is a convention, not evidence, and the tornado it\nproduces reflects the analyst's range choices as much as the data. (4) *Threshold crossing*: the decision-relevant\noutput of any deterministic analysis is whether the ICER crosses the willingness-to-pay threshold (or NMB crosses\nzero) — a threshold analysis solves explicitly for the parameter value at which the decision flips.\n\n**Pros, cons, and trade-offs** (named against the alternatives).\n- **vs probabilistic sensitivity analysis (PSA):** DSA/scenarios are transparent, cheap, and communicable —\n  a reviewer can re-derive every number — and they expose structural uncertainty PSA ignores. But one-way DSA\n  understates joint uncertainty (parameters move one at a time, correlations are ignored) and provides no\n  probability statement. **Run both**: DSA/scenarios for influence and structure, PSA for decision uncertainty;\n  HTA bodies (NICE, CADTH/CDA, ICER) expect all of them.\n- **vs tipping-point / threshold analysis:** a threshold analysis is the inverted one-way DSA — instead of\n  mapping a range to outputs, it solves for the input value where the decision changes. Prefer it when the\n  question is \"how wrong would this input have to be?\"; prefer standard DSA when summarizing many parameters.\n- **One-way vs multi-way DSA:** one-way is readable but ignores interactions; multi-way captures joint movement\n  of two or three chosen parameters but becomes uninterpretable beyond that — past two or three dimensions the\n  correct tool is PSA.\n- **Tornado diagram vs reporting all scenario rows:** the tornado is an efficient influence ranking but hides\n  asymmetry detail and invites range-gaming (wider stated range → longer bar → \"more important\"). A scenario\n  table reporting incremental costs, incremental effects, and the resulting ICER per row is the auditable form;\n  show both.\n\n**When to use.** In every cost-effectiveness, cost-utility, and budget-impact model: to identify which inputs\ndrive the result (model debugging and value-of-information triage), to demonstrate robustness or fragility of\nthe base-case decision to HTA reviewers, to quantify structural uncertainty (horizon, extrapolation,\nperspective, discounting) that PSA cannot, and to satisfy reporting standards (CHEERS 2022 requires\ncharacterizing uncertainty; ISPOR-SMDM task-force guidance requires both deterministic and probabilistic\nanalyses). Scenario analysis is also the vehicle for mandated reference-case variations (e.g., 0%/5% discount\nscenarios alongside a 3% or 3.5% base case).\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a substitute for PSA.** A tornado diagram makes no probability statement; presenting deterministic ranges\n  as if they characterized decision uncertainty (\"the ICER ranged from X to Y, so we are confident...\") is the\n  classic misuse. Joint parameter uncertainty requires PSA.\n- **Cherry-picked scenarios.** Running many scenarios and reporting only the favorable ones — or defining\n  scenario assumptions *after* seeing results — converts an uncertainty analysis into advocacy. Pre-specify the\n  scenario set (horizon, discounting, extrapolation, perspective) before running the model.\n- **Arbitrary or asymmetric-by-convenience ranges.** ±20% on every parameter, or wide ranges only on parameters\n  that favor the intervention, produce a tornado that misranks influence. Ranges need a stated source (CI,\n  literature extremes, expert elicitation) reported next to the bar.\n- **One-way DSA on correlated parameters.** Varying a survival parameter while holding its correlated\n  counterpart fixed produces internally inconsistent model states (e.g., a shape parameter incompatible with the\n  fixed scale); correlated blocks should move together (multi-way or PSA with correlation).\n- **Hiding the decision flip.** Reporting only output ranges without flagging which scenarios cross the\n  willingness-to-pay threshold buries the only result that matters to the decision maker; every scenario table\n  should mark threshold crossings explicitly.\n\n**Data-source operational depth.** Scenario analysis is where RWE inputs earn their keep — and where their\nfragility shows. In **claims**-informed models, standard scenarios include alternative cost sources (paid vs\nallowed amounts), alternative comorbidity/risk-adjustment specifications, and restricting to FFS person-time\nversus pooling Medicare Advantage encounters (whose paid amounts are unreliable). In **EHR/registry**-informed\nmodels, scenarios swap the outcome algorithm (e.g., rwPFS definition variants) or the extrapolation family\nfitted to registry survival. In **survey/utility** inputs, scenarios swap the utility tariff or mapping\nalgorithm. Each scenario should change one named assumption, keep everything else at base case, and report the\nfull incremental-cost / incremental-effect / ICER row so reviewers can trace exactly what moved.\n\n**Interpreting the output**\n\nThe worked example shows a base-case ICER of $44,000/QALY (cost-effective at the $50,000 threshold), with\nthe 10-year horizon scenario producing $60,000/QALY (which flips the decision) and 0% discounting producing\n$40,000/QALY (robust).\n\n*(1) Formal interpretation.* A tornado diagram ranks each parameter by the width of the ICER swing it\nproduces when moved from its lower to upper plausible bound, holding everything else at the base case. The\nwidest bar identifies the single parameter whose uncertainty most drives the cost-effectiveness conclusion.\nIn this example, the time horizon — a structural assumption rather than a parameter — produces the largest\nswing and actually crosses the decision threshold: at 10 years the ICER rises to $60,000, which is above the\n$50,000 threshold. Each scenario row should report the full set (incremental cost, incremental effect, ICER)\nso reviewers can verify the arithmetic and identify whether cost, effect, or both changed.\n\n*(2) Practical interpretation.* When a scenario flips the cost-effectiveness verdict — as the 10-year horizon\ndoes here — it is a critical finding, not a minor sensitivity. The DSA has identified that the decision depends\nmaterially on whether the model follows patients for 5 versus 10 years, a structural choice that the PSA would\nnot have tested (PSA samples from parameter distributions, not from alternative structural forms). This finding\nshould be prominently disclosed, and the rationale for the base-case horizon should be explicitly defended.\nScenarios that keep the ICER well below the threshold across all tested assumptions (like 0% discounting here)\nprovide evidence of robustness.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "sensitivity-analysis",
      "scenario-analysis",
      "deterministic",
      "tornado-diagram",
      "one-way-dsa",
      "structural-uncertainty",
      "health-economic-modeling",
      "hta"
    ],
    "applies_to_study_types": [
      "cost_effectiveness",
      "cost_utility",
      "budget_impact"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2165/00019053-200017050-00006",
        "url": "https://doi.org/10.2165/00019053-200017050-00006",
        "citation_text": "Briggs AH. Handling uncertainty in cost-effectiveness models. PharmacoEconomics. 2000;17(5):479-500.",
        "year": 2000,
        "authors_short": "Briggs",
        "notes": "Classic taxonomy of uncertainty in cost-effectiveness models — methodological, structural, and parameter — and the deterministic/probabilistic toolkit for each."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X11409240",
        "url": "https://doi.org/10.1177/0272989X11409240",
        "citation_text": "Bilcke J, Beutels P, Brisson M, Jit M. Accounting for methodological, structural, and parameter uncertainty in decision-analytic models: a practical guide. Med Decis Making. 2011;31(4):675-692.",
        "year": 2011,
        "authors_short": "Bilcke et al.",
        "notes": "Practical guide separating methodological, structural, and parameter uncertainty — the framework that assigns scenario analysis to structural choices."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X12458348",
        "url": "https://doi.org/10.1177/0272989X12458348",
        "citation_text": "Briggs AH, Weinstein MC, Fenwick EAL, Karnon J, Sculpher MJ, Paltiel AD. Model parameter estimation and uncertainty analysis: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force Working Group-6. Med Decis Making. 2012;32(5):722-732.",
        "year": 2012,
        "authors_short": "Briggs et al.",
        "notes": "ISPOR-SMDM good-practice guidance requiring both deterministic and probabilistic uncertainty analysis and pre-specified scenario sets."
      },
      {
        "role": "demonstrate",
        "doi": "10.2165/11584630-000000000-00000",
        "url": "https://doi.org/10.2165/11584630-000000000-00000",
        "citation_text": "Jain R, Grabner M, Onukwugha E. Sensitivity analysis in cost-effectiveness studies: from guidelines to practice. PharmacoEconomics. 2011;29(4):297-314.",
        "year": 2011,
        "authors_short": "Jain et al.",
        "notes": "Review of how one-way, multi-way, and scenario analyses are actually reported in practice versus what guidelines require — including the arbitrary-range problem."
      }
    ],
    "plain_language_summary": "When a health-economic model says a treatment is good value, the obvious question is: would that conclusion survive if the inputs or assumptions were different? Deterministic sensitivity analysis answers this by changing one input at a time — say, the drug price or a quality-of-life weight — between a justified low and high value, and watching how much the answer moves; the results are usually drawn as a tornado diagram with the most influential inputs on top. Scenario analysis asks a bigger version of the same question: it re-runs the whole model under a different assumption, like a shorter time horizon, a different discount rate, or a different way of extrapolating survival, and reports the answer for each. Together they show which inputs and choices the conclusion really hinges on, and whether any reasonable alternative flips the decision. The honest caveats: moving one input at a time understates how uncertainty combines, the bars in a tornado are only as honest as the ranges chosen, and these analyses say nothing about probability — that job belongs to probabilistic sensitivity analysis.",
    "key_terms": [
      {
        "term": "one-way sensitivity analysis",
        "definition": "Re-running the model with a single input set to a justified low and then high value, holding all other inputs at base case, to see how much the result moves."
      },
      {
        "term": "tornado diagram",
        "definition": "A bar chart ranking inputs by how much each one swings the model result, widest bar on top, so the most influential inputs are visible at a glance."
      },
      {
        "term": "scenario analysis",
        "definition": "Re-running the whole model under a discrete alternative assumption — such as a different time horizon, discount rate, or survival extrapolation — rather than a numeric range on one input."
      },
      {
        "term": "willingness-to-pay threshold",
        "definition": "The maximum cost per unit of health gain (such as dollars per QALY) a decision maker will accept; a scenario \"flips the decision\" when it pushes the ICER across this line."
      }
    ],
    "worked_example": {
      "scenario": "A cost-utility model finds a new treatment costs more but adds quality-adjusted life-years versus standard care, and the payer threshold is $50,000 per QALY. We compute the base-case ICER, then re-run the model under three pre-specified scenarios — a shorter 10-year time horizon, an alternative (Gompertz) survival extrapolation, and 0% discounting — and check which scenarios push the ICER across the threshold.",
      "dataset": {
        "caption": "Incremental cost and incremental QALYs from each model run, one row per pre-specified scenario.",
        "columns": [
          "scenario",
          "incremental_cost",
          "incremental_qalys"
        ],
        "rows": [
          [
            "base_case_lifetime_3pct",
            22000,
            0.5
          ],
          [
            "horizon_10_years",
            18000,
            0.3
          ],
          [
            "gompertz_extrapolation",
            21000,
            0.42
          ],
          [
            "discount_0pct",
            24000,
            0.6
          ]
        ]
      },
      "steps": [
        "Compute the base-case ICER as incremental cost divided by incremental QALYs: 22,000 / 0.50 = 44,000 per QALY — below the $50,000 threshold, so the treatment is cost-effective in the base case.",
        "Re-run under the 10-year horizon: 18,000 / 0.30 = 60,000 per QALY — above the threshold; truncating the horizon cuts off late QALY gains faster than late costs, and the decision flips.",
        "Re-run under the Gompertz extrapolation: 21,000 / 0.42 = 50,000 per QALY — exactly at the threshold; the choice of survival curve alone moves the result to the decision boundary.",
        "Re-run with 0% discounting: 24,000 / 0.60 = 40,000 per QALY — below the threshold; undiscounted future QALYs grow faster than future costs here, so this scenario is more favorable.",
        "Mark each row against the $50,000 line and report all four ICERs, flagging that the conclusion is robust to discounting but sensitive to the time horizon and the extrapolation family."
      ],
      "result": "Base-case ICER = 22,000 / 0.50 = 44,000 per QALY (cost-effective at $50,000). Scenarios: 10-year horizon 18,000 / 0.30 = 60,000 (flips the decision), Gompertz extrapolation 21,000 / 0.42 = 50,000 (exactly at the threshold), 0% discounting 24,000 / 0.60 = 40,000 (robust). The decision hinges on horizon and extrapolation — exactly the structural choices a PSA would not have tested."
    },
    "prerequisites": [
      "cost-utility",
      "icer-net-monetary-benefit-rwe",
      "discounting-costs-effects-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "One-way DSA with tornado diagram",
        "description": "Vary each parameter alone between justified low/high values, record the ICER or NMB swing, and rank parameters by influence in a tornado diagram.",
        "edge_cases": [
          "Range width drives bar length — every range needs a stated source (CI, literature extremes, elicitation), reported alongside the bar.",
          "Correlated parameters varied alone create internally inconsistent model states; move correlated blocks together."
        ],
        "data_source_notes": "any: ranges from RWE inputs should use the estimate's CI from the source study, not a flat ±20%."
      },
      {
        "name": "Scenario analysis (structural/methodological)",
        "description": "Re-run the full model under discrete alternative assumptions — time horizon, discount rate, extrapolation family, perspective, comparator, or input source — one named change per scenario.",
        "edge_cases": [
          "Pre-specify the scenario set before results are seen; post hoc scenario selection is advocacy, not analysis.",
          "Report the full incremental-cost / incremental-effect / ICER row per scenario and flag threshold crossings explicitly."
        ],
        "data_source_notes": "claims/ehr/registry: standard RWE scenarios include alternative cost definitions (paid vs allowed), FFS-only vs pooled person-time, and outcome-algorithm variants."
      },
      {
        "name": "Threshold (tipping-point) analysis",
        "description": "Invert the one-way DSA — solve for the parameter value at which the ICER crosses the willingness-to-pay threshold (or NMB crosses zero), then judge whether that value is plausible.",
        "edge_cases": [
          "Most informative for the one or two parameters the tornado ranks highest; running it on every input dilutes the message.",
          "If the tipping value lies far outside any plausible range, say so — that is the robustness statement."
        ],
        "data_source_notes": "any: pairs naturally with RWE inputs whose true value is contested (e.g., real-world persistence or adherence)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Probabilistic sensitivity analysis (PSA)",
        "pros_of_this": "Transparent and re-derivable by reviewers; cheap to run; uniquely able to test structural choices (horizon, extrapolation, perspective) that PSA holds fixed.",
        "cons_of_this": "One-at-a-time variation understates joint uncertainty, ignores correlation, and yields no probability that the intervention is cost-effective.",
        "when_to_prefer": "For influence ranking and structural uncertainty — always alongside, never instead of, a PSA."
      },
      {
        "compared_to": "Tipping-point / threshold analysis",
        "pros_of_this": "Summarizes many parameters at once and produces the familiar tornado influence ranking.",
        "cons_of_this": "Reports output ranges rather than the decision-relevant input value at which the conclusion flips.",
        "when_to_prefer": "When surveying influence across the whole input set; switch to threshold analysis for the top one or two drivers."
      },
      {
        "compared_to": "Multi-way DSA",
        "pros_of_this": "One-way results are directly readable and attributable to a single input.",
        "cons_of_this": "Misses interactions between inputs that move together in reality.",
        "when_to_prefer": "Use one-way for the headline ranking; add a two-way grid for the two dominant, plausibly interacting drivers; beyond that, PSA."
      }
    ],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWTP = 50_000.0\n\nBASE = {\"inc_cost\": 22_000.0, \"inc_qaly\": 0.50}\n\n# parameter -> (low_output, high_output) where each output is (inc_cost, inc_qaly)\n# produced by re-running the model at the parameter's justified low/high value.\nRANGES = {\n    \"drug_price\":       ((18_000.0, 0.50), (28_000.0, 0.50)),\n    \"utility_gain\":     ((22_000.0, 0.38), (22_000.0, 0.62)),\n    \"relapse_rr\":       ((20_500.0, 0.44), (23_500.0, 0.56)),\n}\n\n# named structural scenarios -> model outputs from full re-runs\nSCENARIOS = {\n    \"base_case_lifetime_3pct\": (22_000.0, 0.50),\n    \"horizon_10_years\":        (18_000.0, 0.30),\n    \"gompertz_extrapolation\":  (21_000.0, 0.42),\n    \"discount_0pct\":           (24_000.0, 0.60),\n}\n\ndef icer(inc_cost: float, inc_qaly: float) -> float:\n    return inc_cost / inc_qaly\n\ndef one_way_dsa() -> pd.DataFrame:\n    base_icer = icer(**BASE)\n    rows = []\n    for param, (lo, hi) in RANGES.items():\n        lo_icer, hi_icer = icer(*lo), icer(*hi)\n        rows.append({\n            \"parameter\": param,\n            \"icer_low\": lo_icer, \"icer_high\": hi_icer,\n            \"swing\": abs(hi_icer - lo_icer),            # tornado bar length\n            \"crosses_wtp\": (min(lo_icer, hi_icer) <= WTP <= max(lo_icer, hi_icer)),\n        })\n    out = pd.DataFrame(rows).sort_values(\"swing\", ascending=False)  # tornado order\n    out.attrs[\"base_icer\"] = base_icer\n    return out\n\ndef scenario_table() -> pd.DataFrame:\n    rows = [{\"scenario\": name, \"inc_cost\": c, \"inc_qaly\": q,\n             \"icer\": icer(c, q), \"cost_effective_at_wtp\": icer(c, q) <= WTP}\n            for name, (c, q) in SCENARIOS.items()]\n    return pd.DataFrame(rows)",
        "description": "One-way DSA with tornado-style output plus a scenario table, for any model expressed as a function from a\nparameter dict to (incremental_cost, incremental_qalys). BASE, RANGES (justified low/high per parameter), and\nSCENARIOS (named structural re-runs supplying their own model outputs) are illustrative — wire in your model\nfunction and pre-specified ranges. Flags threshold crossings explicitly.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "briggs-2000",
          "briggs-2012"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nWTP <- 50000\n\nicer <- function(inc_cost, inc_qaly) inc_cost / inc_qaly\n\n# parameter -> low/high model outputs (inc_cost, inc_qaly) from re-runs at justified bounds\nranges <- list(\n  drug_price   = list(low = c(18000, 0.50), high = c(28000, 0.50)),\n  utility_gain = list(low = c(22000, 0.38), high = c(22000, 0.62)),\n  relapse_rr   = list(low = c(20500, 0.44), high = c(23500, 0.56))\n)\n\nscenarios <- list(\n  base_case_lifetime_3pct = c(22000, 0.50),\n  horizon_10_years        = c(18000, 0.30),\n  gompertz_extrapolation  = c(21000, 0.42),\n  discount_0pct           = c(24000, 0.60)\n)\n\none_way_dsa <- function() {\n  out <- rbindlist(lapply(names(ranges), function(p) {\n    lo <- icer(ranges[[p]]$low[1],  ranges[[p]]$low[2])\n    hi <- icer(ranges[[p]]$high[1], ranges[[p]]$high[2])\n    data.table(parameter = p, icer_low = lo, icer_high = hi,\n               swing = abs(hi - lo),\n               crosses_wtp = min(lo, hi) <= WTP & WTP <= max(lo, hi))\n  }))\n  setorder(out, -swing)   # tornado order: widest swing on top\n  out\n}\n\nscenario_table <- function() {\n  rbindlist(lapply(names(scenarios), function(s) {\n    ic <- icer(scenarios[[s]][1], scenarios[[s]][2])\n    data.table(scenario = s,\n               inc_cost = scenarios[[s]][1], inc_qaly = scenarios[[s]][2],\n               icer = ic, cost_effective_at_wtp = ic <= WTP)\n  }))\n}",
        "description": "R version producing the one-way DSA (tornado-ordered) and the scenario table. Same illustrative structure:\nreplace RANGES/SCENARIOS with outputs from re-running your model at justified low/high parameter values and\nunder pre-specified structural scenarios.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "briggs-2000",
          "briggs-2012"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let wtp = 50000;\n\n/* ICER per DSA re-run, then swing per parameter (tornado bar length). */\ndata dsa;\n  set work.dsa_runs;\n  icer = inc_cost / inc_qaly;\nrun;\n\nproc sql;\n  create table tornado as\n  select parameter,\n         max(case when bound = 'low'  then icer end) as icer_low,\n         max(case when bound = 'high' then icer end) as icer_high,\n         abs(calculated icer_high - calculated icer_low) as swing,\n         case when min(calculated icer_low, calculated icer_high) <= &wtp\n               and &wtp <= max(calculated icer_low, calculated icer_high)\n              then 1 else 0 end as crosses_wtp\n  from dsa\n  group by parameter\n  order by swing desc;          /* tornado order: widest swing first */\nquit;\n\n/* Scenario table with explicit threshold flag. */\ndata scenario_table;\n  set work.scenarios;\n  icer = inc_cost / inc_qaly;\n  cost_effective_at_wtp = (icer <= &wtp);\nrun;\n\nproc print data=tornado;        run;\nproc print data=scenario_table; run;",
        "description": "SAS build of the one-way DSA and scenario table from pre-computed model re-run outputs. Inputs:\n  work.dsa_runs  : parameter, bound ('low'/'high'), inc_cost, inc_qaly  (one row per re-run)\n  work.scenarios : scenario, inc_cost, inc_qaly                          (one row per structural scenario)\nComputes ICERs, the tornado swing per parameter, and flags scenarios that cross the WTP threshold.",
        "dependencies": [],
        "source_citations": [
          "briggs-2000",
          "briggs-2012"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  M[Base-case model<br/>ICER / NMB] --> Q{Uncertainty type}\n  Q -->|Numeric input with a range| D[One-way DSA<br/>justified low/high per parameter]\n  Q -->|Discrete modeling choice| S[Scenario analysis<br/>horizon, discounting, extrapolation, perspective]\n  Q -->|Joint parameter uncertainty| P[PSA<br/>separate concept]\n  D --> T[Tornado diagram<br/>rank by output swing]\n  T --> TP[Threshold analysis on top drivers<br/>input value where decision flips]\n  S --> R[Scenario table<br/>inc cost, inc QALY, ICER per row]\n  TP --> F{Any analysis crosses<br/>the WTP threshold?}\n  R --> F\n  F -->|no| Rob[Decision robust - report ranges]\n  F -->|yes| Fra[Decision fragile - flag the assumption<br/>and the tipping value]",
        "caption": "Deterministic uncertainty workflow — route numeric ranges to one-way DSA (summarized as a tornado, inverted into threshold analysis for the top drivers) and discrete structural choices to pre-specified scenarios, then judge every result against the willingness-to-pay threshold.",
        "alt_text": "Flowchart routing parameter ranges to one-way DSA and tornado diagram, structural choices to scenario analysis, and joint uncertainty to PSA, converging on a check of whether any result crosses the willingness-to-pay threshold.",
        "source_type": "illustrative",
        "source_citations": [
          "briggs-2012"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "complements",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "DSA/scenarios identify influence and structural uncertainty; PSA quantifies joint parameter (decision) uncertainty. Good practice requires both."
      },
      {
        "relation_type": "see_also",
        "target_slug": "tipping-point-analysis-rwe",
        "notes": "Threshold analysis is the inverted one-way DSA — it solves for the input value at which the decision flips rather than mapping a range to outputs."
      },
      {
        "relation_type": "used_with",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "The ICER (or NMB) is the model output tracked across DSA ranges and scenario re-runs, judged against the willingness-to-pay threshold."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cost-utility",
        "notes": "Cost-utility models are the standard host for DSA and scenario analysis in HTA submissions."
      },
      {
        "relation_type": "used_with",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "The extrapolation family (Weibull vs Gompertz vs spline) is a canonical structural scenario — often the single largest driver of lifetime ICERs."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Reference cases mandate discount-rate scenarios (e.g., 0%/3%/5%) alongside the base case."
      },
      {
        "relation_type": "used_with",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Deterministic uncertainty analysis is a required layer of any decision-analytic model per ISPOR-SMDM good practice."
      },
      {
        "relation_type": "used_with",
        "target_slug": "budget-impact",
        "notes": "Budget-impact analyses report scenario analyses (uptake, price, eligible population) as their primary uncertainty characterization."
      }
    ],
    "aliases": [
      "scenario analysis",
      "deterministic sensitivity analysis",
      "DSA",
      "one-way sensitivity analysis",
      "tornado diagram analysis",
      "structural sensitivity analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "scoping-review",
    "name": "Scoping Review",
    "short_definition": "A structured secondary-research design that systematically maps the breadth, concepts, and gaps of a literature on a broad topic — without the focused effectiveness question, formal risk-of-bias appraisal, or pooled estimate of a systematic review.",
    "long_description": "A **scoping review** is a form of knowledge synthesis that addresses an *exploratory* question — \"what is the extent, range, and nature of the evidence on X?\" — rather than the focused, answerable effectiveness question of a systematic review. It follows the Arksey & O'Malley framework as refined by Levac and codified for reporting in PRISMA-ScR: (1) identify the research question; (2) identify relevant studies via a sensitive, pre-registered search; (3) select studies against explicit inclusion criteria, ideally with dual independent screening; (4) chart the data using a structured, piloted extraction form; (5) collate, summarize, and report results, typically as an evidence map; and (optional) (6) consult stakeholders. The deliverable is a *map* of the field — what has been studied, in which populations, with which designs and data sources, and where the white space is — not a pooled effect estimate.\n\n**Core conceptual distinction.** The defining difference from a systematic review is the *question type and the absence of synthesis-for-an-answer*. A systematic review asks a narrow PICO question, mandates formal critical appraisal (e.g., RoB 2, ROBINS-I, Newcastle-Ottawa), and usually proceeds to a quantitative pooled estimate (meta-analysis) that drives a clinical or coverage decision. A scoping review deliberately keeps the question broad, does **not** require risk-of-bias appraisal of included studies, and does **not** pool effects — its output is descriptive (counts, evidence tables, bubble/heat maps). It is therefore a *reconnaissance* instrument: it scopes the size and shape of a body of evidence before anyone commits to the much heavier systematic review or HTA. It is not a \"lighter systematic review\" and must never be reported as if it answered an effectiveness question. Distinguish it also from a *rapid review* (a systematic review with deliberate methodological shortcuts under time pressure, which still targets an answerable question) and from a narrative/literature review (no protocol, no reproducible search).\n\n**Pros, cons, and trade-offs** (specific and comparative, naming the alternatives).\n- **vs systematic review + meta-analysis:** A scoping review can encompass heterogeneous designs, outcomes, and qualitative/grey literature that a meta-analysis cannot pool, and it surfaces the full landscape (e.g., every RWE design used to study a drug class). Cost: it yields no effect estimate and no graded certainty (no GRADE), so it cannot support a \"drug A reduces event Y by Z%\" claim or a reimbursement decision. **Prefer a scoping review** when the question is \"what's out there and where are the gaps,\" and a systematic review when the question is \"what is the effect.\"\n- **vs narrative/expert literature review:** A scoping review is protocol-driven, has a reproducible and reportable search, applies explicit and pre-specified inclusion criteria, and reports against PRISMA-ScR — so it is far less prone to selection and confirmation bias and is auditable. Cost: substantially more labor (often months, dual screening, charting form piloting). **Prefer the scoping review** whenever transparency and reproducibility matter (manuscript, regulatory dossier, grant background).\n- **vs rapid review:** A scoping review is broader in scope but typically does not appraise study quality; a rapid review keeps a focused question but cuts corners (single screener, restricted databases) to deliver fast. **Prefer a scoping review** when the goal is comprehensive mapping; prefer a rapid review when a decision-maker needs a defensible answer to a narrow question quickly.\n- **Trade-off intrinsic to the design:** by omitting risk-of-bias appraisal, a scoping review buys breadth and speed-per-study at the price of being silent on the *credibility* of what it maps — a strength when mapping, a danger when readers over-read the map as evidence of effect.\n\n**When to use** (clear decision rules). Choose a scoping review when: the question is broad or emerging and a precise PICO is not yet formulable; you need to identify the *types* of evidence and key concepts before designing a systematic review or registry study; you are mapping how a method or exposure has been operationalized across the literature (e.g., every washout/new-user variant used for an SGLT2-inhibitor safety question in claims); you must characterize a heterogeneous field that mixes RCTs, observational RWE, qualitative work, HTAs, and grey literature; or you are establishing the evidence-gap rationale for a grant, guideline, or HTA scoping document. Munn et al.'s decision logic: if the intended output is a yes/no answer about effectiveness/diagnostic accuracy, do a systematic review; if it is a map of the field, do a scoping review.\n\n**When NOT to use — and when it is actively misleading or dangerous** (clear decision rules). Do **not** use a scoping review when a decision hinges on the magnitude or certainty of an effect — payers, regulators, and guideline panels need appraisal and (usually) a pooled estimate, and a scoping review provides neither. It is **actively misleading** when its descriptive results are presented as evidence of effectiveness (\"12 studies show benefit\" conflates *volume* of literature with *direction/certainty* of effect — vote-counting is not synthesis). It is **dangerous** as the sole input to an HTA or label decision because it skips risk-of-bias appraisal: a field could be large yet uniformly biased (e.g., dozens of prevalent-user, immortal-time-afflicted claims studies), and the map would look reassuringly dense while every study is confounded. Do not use it when the question is already narrow and answerable — that is wasted breadth where a systematic review is the correct, more informative instrument. Finally, a scoping review that adds post hoc effect-direction conclusions has silently mutated into a low-quality systematic review without appraisal — the worst of both worlds.\n\n**Data-source operational depth.** A scoping review's \"data sources\" are bibliographic, but in an RWE/HEOR context the *included studies* are themselves built on claims, EHR, registry, or linked data, and a credible review must chart and stratify by those substrates and their failure modes.\n- **Bibliographic databases (the review's own data):** A defensible search spans at least MEDLINE/PubMed, Embase, and one of the Cochrane/CINAHL/Web of Science set, plus grey literature (conference abstracts, HTA agency reports, ClinicalTrials.gov, regulator dossiers). Failure modes: relying on PubMed alone systematically misses Embase-indexed European pharmacoepidemiology; English-only restrictions drop relevant registry studies. Workaround: peer-review the search strategy (PRESS), report database + date of each search, and log every hit for the PRISMA-ScR flow.\n- **Claims-based included studies:** When charting a claims study, capture the database type because it drives interpretability. *Medicare Advantage (MA)-only person-time lacks fee-for-service (FFS) claims*, so a study built on MA enrollees may have incomplete encounter capture; chart whether the authors restricted to FFS/Parts A/B/D. Chart the exposure operationalization (NDC + `fill_date` + `days_supply`), washout length, and new-user vs prevalent-user status — the map is only useful if it exposes that, say, 18 of 22 included safety studies used prevalent-user cohorts (a systematic credibility flag the scoping map can legitimately surface without appraising each).\n- **EHR-based included studies:** Chart whether exposure was the *order/administration* vs a linked dispensing, and whether the study handled differential loss-to-follow-up when patients leave the system. Note structured-vs-NLP outcome ascertainment, which governs comparability across the map.\n- **Registry / linked included studies:** Chart adjudication of outcomes and disease-severity capture (registry strengths) and whether pharmacy exposure was linked from claims (a common registry weakness). For linked claims–EHR–vital-records studies, chart the linkage rate and any selection it induces. A frequent, chartable failure mode across elderly-cohort claims studies is **differential competing risks by exposure** (death competing with the event differs by arm) and **immortal time in procedure studies** (follow-up started before the index procedure); the scoping map should flag the *prevalence* of these design features even though it does not grade them.\n\n**Worked example (claims-style logic embedded in the inclusion criteria).** Question: *map the landscape of US administrative-claims studies (2015–2025) comparing SGLT2 inhibitors with DPP-4 inhibitors for cardiovascular or renal outcomes, and characterize how each operationalized exposure and time zero.* The claims-style rigor lives in the eligibility and charting rules, not in a statistical estimate:\n(1) **Population/inclusion:** peer-reviewed or HTA-report studies using ≥1 US claims database (e.g., Medicare FFS, MarketScan, Optum) with ≥1 cardiovascular or renal outcome.\n(2) **Exposure-definition charting:** for each study record `exposure_drug_class`, whether exposure was defined from pharmacy fills (NDC + `fill_date` + `days_supply`), the washout length (e.g., 365 days drug-free), and new-user vs prevalent-user status — the scoping map can then report that, e.g., only 14/31 studies used an active-comparator new-user design with a ≥365-day washout and continuous enrollment.\n(3) **Data-source charting:** record database type and, critically, whether the study **excluded MA-only person-time** (FFS claims completeness) and whether it required continuous medical + pharmacy enrollment across the washout — flagging the records where \"no prior fill\" could be unobserved missingness rather than a true washout.\n(4) **Design-feature charting:** record time-zero definition (first qualifying fill vs diagnosis date — the latter risks immortal time), competing-risk handling, and outcome validation. (5) **Output:** an evidence-map / bubble chart of study count by design × outcome, with the gap surfaced explicitly (e.g., \"no linked claims–EHR study examined renal outcomes with a target-trial emulation in patients ≥75\"). No effect is pooled; no study is risk-of-bias graded; the deliverable is the map and the gap — and an honest scoping review states plainly that the density of the map does not imply the SGLT2-vs-DPP4 effect is established.",
    "primary_category": "Study_Design",
    "tags": [
      "scoping-review",
      "evidence-synthesis",
      "evidence-mapping",
      "prisma-scr",
      "knowledge-synthesis",
      "literature-review",
      "research-gap",
      "secondary-research"
    ],
    "applies_to_study_types": [
      "scoping_review"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/1364557032000119616",
        "url": "https://doi.org/10.1080/1364557032000119616",
        "citation_text": "Arksey H, O'Malley L. Scoping studies: towards a methodological framework. International Journal of Social Research Methodology. 2005;8(1):19-32.",
        "year": 2005,
        "authors_short": "Arksey & O'Malley",
        "notes": "Foundational five-stage framework (question, identify studies, select, chart, collate/summarize/report) that defines the scoping review method."
      },
      {
        "role": "introduce",
        "doi": "10.7326/M18-0850",
        "url": "https://doi.org/10.7326/M18-0850",
        "citation_text": "Tricco AC, Lillie E, Zarin W, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): checklist and explanation. Annals of Internal Medicine. 2018;169(7):467-473.",
        "year": 2018,
        "authors_short": "Tricco et al.",
        "notes": "The reporting standard for scoping reviews — the 20-item PRISMA-ScR checklist and flow diagram now expected by journals and HTA bodies."
      },
      {
        "role": "explain",
        "doi": "10.1186/1748-5908-5-69",
        "url": "https://doi.org/10.1186/1748-5908-5-69",
        "citation_text": "Levac D, Colquhoun H, O'Brien KK. Scoping studies: advancing the methodology. Implementation Science. 2010;5:69.",
        "year": 2010,
        "authors_short": "Levac et al.",
        "notes": "Operational refinements to Arksey & O'Malley — clarifies the research question, dual screening, charting-form piloting, and the optional consultation stage."
      },
      {
        "role": "explain",
        "doi": "10.1186/s12874-018-0611-x",
        "url": "https://doi.org/10.1186/s12874-018-0611-x",
        "citation_text": "Munn Z, Peters MDJ, Stern C, et al. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Medical Research Methodology. 2018;18:143.",
        "year": 2018,
        "authors_short": "Munn et al.",
        "notes": "Decision logic for choosing scoping vs systematic review — the canonical reference for when each design is and is not appropriate."
      }
    ],
    "plain_language_summary": "A scoping review is a careful, by-the-rules survey of everything that has been published on a broad topic, built to answer \"what's out there, and where are the holes?\" rather than \"does this drug work?\" You register a search plan, run it across several databases, screen the hits down to the studies that fit, and then chart what each one did so you can draw a map of the field. The output is that map of who studied what, with which designs and data, and where nobody has looked yet, not a single combined number. The honest caveat: a thick, busy map tells you a lot of research exists, but it does not tell you the research is right.",
    "key_terms": [
      {
        "term": "evidence map",
        "definition": "A picture (often a grid or bubble chart) showing how many studies fall into each combination of population, design, and outcome, so you can see at a glance what's well-covered and what's blank."
      },
      {
        "term": "charting",
        "definition": "Reading each included study and recording its key features into a structured form, like filling one spreadsheet row per study, so the studies can be compared side by side."
      },
      {
        "term": "risk-of-bias appraisal",
        "definition": "A formal judgment of how trustworthy each individual study's result is; a scoping review deliberately skips this step, which is why it can map a field but cannot vouch for it."
      },
      {
        "term": "pooled estimate",
        "definition": "A single combined number (like one overall effect size) calculated by statistically merging many studies; a scoping review never produces one."
      },
      {
        "term": "grey literature",
        "definition": "Useful documents that aren't peer-reviewed journal articles, such as conference abstracts, HTA agency reports, and trial registry entries."
      },
      {
        "term": "dual independent screening",
        "definition": "Two reviewers decide separately whether each study should be kept or dropped, then compare, so one person's slip is less likely to lose a relevant study."
      }
    ],
    "worked_example": {
      "scenario": "You're mapping how US insurance-claims studies from 2015 to 2025 compared two diabetes drug classes (SGLT2 inhibitors vs. DPP-4 inhibitors) for heart and kidney outcomes. You are not trying to learn which drug is better; you want to see how many such studies exist and how each one defined who counts as 'treated.' The work is a funnel: you start with everything the databases return and narrow down, stage by stage, to the studies you actually include and chart. The table below is the running count an analyst tracks to fill in the PRISMA-ScR flow diagram.",
      "dataset": {
        "caption": "The PRISMA-ScR screening funnel: one row per stage, with the number of records moving through. These are the counts a reviewer logs as they go.",
        "columns": [
          "stage",
          "n"
        ],
        "rows": [
          [
            "records_identified",
            1240
          ],
          [
            "duplicates_removed",
            240
          ],
          [
            "unique_records_screened",
            1000
          ],
          [
            "excluded_at_title_abstract",
            880
          ],
          [
            "assessed_full_text",
            120
          ],
          [
            "excluded_at_full_text",
            89
          ],
          [
            "studies_included",
            31
          ]
        ]
      },
      "steps": [
        "Start at the top of the funnel: the databases (MEDLINE, Embase, Cochrane) plus grey literature return 1,240 records. The same paper often appears in more than one database, so 240 are duplicates; removing them leaves 1,000 unique records to actually look at.",
        "Title-and-abstract screening is the cheap first pass: two reviewers read just the titles and abstracts of all 1,000 records and drop anything clearly off-topic. Here 880 are excluded, so 1,000 - 880 = 120 records survive to the next stage.",
        "Full-text assessment is the careful second pass: you pull the full PDFs for those 120 and check each against the written inclusion rules (US claims data, a heart or kidney outcome, the right drug comparison). 89 fail on full read, so 120 - 89 = 31 studies are included.",
        "Now comes the part that makes this a scoping review and not a systematic review: instead of combining the 31 studies into one effect, you chart each one, recording how it defined exposure (a pharmacy fill vs. a diagnosis date), its washout length, and its study design. That charting is what lets you draw the map.",
        "The map reveals breadth and a gap. Of the 31 included studies, only 14 used the strongest design (an active-comparator new-user design with a year-long drug-free washout), and no study at all linked claims to EHR data for kidney outcomes in patients 75 and older, which is the white space a future study could fill.",
        "Read the map honestly: 31 studies exist and 14 are well-designed, but because you never appraised or pooled them, the density of the map says nothing about whether one drug class actually beats the other."
      ],
      "result": "The funnel is arithmetically consistent: 1,240 identified - 240 duplicates = 1,000 screened; 1,000 - 880 excluded at title/abstract = 120 assessed at full text; 120 - 89 excluded at full text = 31 studies included. The deliverable is the evidence map over those 31 studies (14/31 used the strongest design; one combination has zero studies) and the gap it exposes, not a pooled effect."
    },
    "prerequisites": [
      "systematic-review",
      "picots-framework-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Mapping / evidence-gap review",
        "description": "A scoping review whose primary deliverable is a visual evidence map (bubble or heat map of study count by population x design x outcome) explicitly intended to identify research gaps for a future systematic review, registry study, or grant.",
        "edge_cases": [
          "Readers can misread cell density as effect evidence; captions must state that counts reflect literature volume, not direction or certainty of effect.",
          "Gaps may reflect indexing/search limitations rather than true absence of research; report search peer-review (PRESS) and database coverage."
        ],
        "data_source_notes": "claims: stratify map cells by database type and new-user vs prevalent-user design so the gap is interpretable (e.g., absence of FFS-restricted active-comparator studies for a renal outcome)."
      },
      {
        "name": "Methods/operationalization scoping review",
        "description": "Maps how a construct, exposure, or method has been operationalized across studies (e.g., every washout length, time-zero rule, or adherence metric used for a drug class) rather than mapping outcomes.",
        "edge_cases": [
          "Requires a detailed, piloted charting form with controlled vocabularies for design features; free-text charting makes the map non-comparable.",
          "Heterogeneous reporting means many studies omit the very operational detail being charted; pre-specify how 'not reported' is handled."
        ],
        "data_source_notes": "claims/EHR: chart exposure definition (NDC + fill_date + days_supply vs order/administration), washout, enrollment requirements, and MA-vs-FFS handling as structured fields."
      },
      {
        "name": "PRISMA-ScR-conformant reporting variant",
        "description": "A scoping review executed and reported strictly against the 20-item PRISMA-ScR checklist with an a priori protocol (e.g., registered on Open Science Framework), suitable for peer-reviewed publication or an HTA scoping document.",
        "edge_cases": [
          "Protocol amendments during conduct must be documented; undocumented scope drift undermines the reproducibility the standard exists to guarantee.",
          "PRISMA-ScR does not require risk-of-bias appraisal; conformance does not make the review a substitute for a systematic review."
        ],
        "data_source_notes": "Report the full search string, database, platform, and date for every source, plus the PRISMA-ScR flow counts (identified, screened, eligible, included)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Systematic review with meta-analysis",
        "pros_of_this": "Encompasses heterogeneous designs, outcomes, qualitative and grey literature that cannot be pooled; maps the full landscape and identifies gaps before committing to a focused review.",
        "cons_of_this": "Produces no pooled effect estimate, no risk-of-bias appraisal, and no GRADE certainty rating; cannot support an effectiveness, safety-magnitude, or reimbursement claim.",
        "when_to_prefer": "When the question is 'what evidence exists and where are the gaps,' the topic is broad or emerging, or a precise PICO is not yet formulable."
      },
      {
        "compared_to": "Narrative / expert literature review",
        "pros_of_this": "Protocol-driven, reproducible and reportable search, explicit pre-specified inclusion criteria, PRISMA-ScR reporting — auditable and far less prone to selection/confirmation bias.",
        "cons_of_this": "Substantially more labor and time (dual screening, charting-form piloting, peer-reviewed search), with no gain in effect-level inference.",
        "when_to_prefer": "Whenever transparency, reproducibility, and an auditable evidence base are required (manuscript, grant background, regulatory or HTA scoping)."
      },
      {
        "compared_to": "Rapid review",
        "pros_of_this": "Broader scope and comprehensive mapping across study types and grey literature; not constrained to a single narrow question.",
        "cons_of_this": "Typically omits risk-of-bias appraisal and is slower per unit of decision-relevant answer; not built to deliver a fast yes/no answer.",
        "when_to_prefer": "When the goal is comprehensive mapping rather than a time-pressured answer to a focused, answerable question."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "When charting included claims studies, record database type and whether MA-only person-time (lacking FFS claims) was excluded; chart exposure as NDC + fill_date + days_supply, washout length, continuous-enrollment requirements, and new-user vs prevalent-user status so the evidence map exposes design credibility, not just study counts.",
      "ehr": "Chart whether exposure was the order/administration vs a linked dispensing and how differential loss-to-follow-up and structured-vs-NLP outcome ascertainment were handled, since these govern comparability across mapped studies.",
      "registry": "Chart outcome adjudication and disease-severity capture (registry strengths) and whether pharmacy exposure was linked from claims (a common weakness); these determine how registry studies sit on the map relative to claims studies.",
      "linked": "For linked claims-EHR-vital-records studies, chart the linkage rate and any selection it induces, plus reconciliation of order/fill/service dates; flag the prevalence of immortal-time and differential-competing-risk design features without grading individual studies."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\ndef prisma_scr_flow(records: pd.DataFrame, screen: pd.DataFrame) -> dict:\n    \"\"\"PRISMA-ScR flow counts from retrieval through inclusion.\"\"\"\n    n_identified = len(records)\n    n_unique = records.drop_duplicates(subset=\"doi\").pipe(\n        lambda d: len(d) + records[\"doi\"].isna().sum())  # keep DOI-less rows; dedup only on DOI\n    n_dup = n_identified - n_unique\n\n    ta = screen[screen[\"stage\"] == \"title_abstract\"]\n    ft = screen[screen[\"stage\"] == \"full_text\"]\n    # A record is screened-in at a stage only if NO reviewer excluded it (conservative dual-screen rule).\n    ta_in = ta.groupby(\"record_id\")[\"decision\"].apply(lambda d: (d == \"include\").all())\n    ft_in = ft.groupby(\"record_id\")[\"decision\"].apply(lambda d: (d == \"include\").all())\n    return {\n        \"identified\": n_identified,\n        \"duplicates_removed\": int(n_dup),\n        \"screened_title_abstract\": int(ta[\"record_id\"].nunique()),\n        \"excluded_title_abstract\": int((~ta_in).sum()),\n        \"assessed_full_text\": int(ft[\"record_id\"].nunique()),\n        \"excluded_full_text\": int((~ft_in).sum()),\n        \"included\": int(ft_in.sum()),\n    }\n\ndef screening_kappa(screen: pd.DataFrame, stage: str = \"title_abstract\") -> float:\n    \"\"\"Cohen's kappa for two reviewers' include/exclude decisions at a screening stage.\"\"\"\n    s = screen[screen[\"stage\"] == stage]\n    wide = (s.pivot_table(index=\"record_id\", columns=\"reviewer\", values=\"decision\",\n                          aggfunc=\"first\").dropna())\n    if wide.shape[1] != 2:\n        raise ValueError(\"Cohen's kappa expects exactly two reviewers with overlapping decisions.\")\n    a, b = wide.iloc[:, 0], wide.iloc[:, 1]\n    po = (a == b).mean()\n    # Expected agreement under independence across the {include, exclude} categories.\n    pe = sum((a == k).mean() * (b == k).mean() for k in pd.unique(pd.concat([a, b])))\n    return float((po - pe) / (1 - pe)) if pe < 1 else 1.0\n\ndef evidence_map(chart: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Study count by design x outcome_domain -- the scoping deliverable (NOT a pooled effect).\"\"\"\n    return (chart.pivot_table(index=\"design\", columns=\"outcome_domain\",\n                              values=\"record_id\", aggfunc=\"nunique\", fill_value=0))\n\ndef credibility_flags(chart: pd.DataFrame) -> pd.Series:\n    \"\"\"Surface design-credibility prevalence the map should disclose (without grading any study).\"\"\"\n    n = len(chart)\n    return pd.Series({\n        \"n_included\": n,\n        \"pct_prevalent_user\": round(100 * (chart[\"design\"] == \"prevalent_user\").mean(), 1),\n        \"pct_acnu_with_washout\": round(100 * ((chart[\"design\"] == \"active_comparator_new_user\")\n                                              & (chart[\"washout_days\"].fillna(0) >= 365)).mean(), 1),\n        \"pct_ma_only_excluded\": round(100 * chart[\"ma_only_excluded\"].mean(), 1),\n        \"pct_immortal_time_risk\": round(100 * chart[\"time_zero\"].isin(\n            [\"diagnosis_date\", \"procedure_date\"]).mean(), 1),\n        \"pct_exposure_not_reported\": round(100 * (chart[\"exposure_def\"] == \"not_reported\").mean(), 1),\n    })",
        "description": "PRISMA-ScR screening + charting pipeline for a scoping review of RWE studies. This is the review's own data-management\nworkflow, not a statistical estimator. Required inputs (already exported from your reference manager / databases):\n  records  : one row per retrieved citation -> record_id, title, abstract, database (e.g. 'MEDLINE'/'Embase'),\n             doi (str|NA), search_date (datetime)\n  screen   : dual-screening decisions -> record_id, reviewer (str), stage in {'title_abstract','full_text'},\n             decision in {'include','exclude'}, exclude_reason (str|NA)\n  chart    : piloted extraction form for INCLUDED studies (one row per study) -> record_id, data_source\n             in {'claims','ehr','registry','linked'}, ma_only_excluded (bool), exposure_def\n             in {'fill_ndc_dayssupply','order_admin','linked_fill','not_reported'}, washout_days (Int|NA),\n             design in {'active_comparator_new_user','new_user','prevalent_user','rct','other'},\n             time_zero in {'first_fill','diagnosis_date','procedure_date','other'}, outcome_domain (str)\nProduces (1) PRISMA-ScR flow counts, (2) a dual-screening agreement (Cohen's kappa) at title/abstract, and (3) the\nevidence map (study count by design x outcome) that is the deliverable. No effect is pooled and no study is graded.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "tricco-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nprisma_scr_flow <- function(records, screen) {\n  setDT(records); setDT(screen)\n  n_identified <- nrow(records)\n  # Dedup only on DOI; keep DOI-less rows (a missing DOI is not a duplicate).\n  n_unique <- uniqueN(records[!is.na(doi), doi]) + records[is.na(doi), .N]\n  ta <- screen[stage == \"title_abstract\"]\n  ft <- screen[stage == \"full_text\"]\n  # A record is screened-in at a stage only if NO reviewer excluded it (conservative dual-screen rule).\n  ta_in <- ta[, .(keep = all(decision == \"include\")), by = record_id]\n  ft_in <- ft[, .(keep = all(decision == \"include\")), by = record_id]\n  list(\n    identified              = n_identified,\n    duplicates_removed      = n_identified - n_unique,\n    screened_title_abstract = uniqueN(ta$record_id),\n    excluded_title_abstract = ta_in[keep == FALSE, .N],\n    assessed_full_text      = uniqueN(ft$record_id),\n    excluded_full_text      = ft_in[keep == FALSE, .N],\n    included                = ft_in[keep == TRUE,  .N]\n  )\n}\n\nscreening_kappa <- function(screen, stage_name = \"title_abstract\") {\n  setDT(screen)\n  s <- screen[stage == stage_name]\n  # One include/exclude per record per reviewer, then keep records both reviewers judged.\n  w <- dcast(s, record_id ~ reviewer, value.var = \"decision\", fun.aggregate = function(x) x[1L])\n  revs <- setdiff(names(w), \"record_id\")\n  stopifnot(length(revs) == 2L)\n  w <- w[complete.cases(w[, ..revs])]\n  a <- w[[revs[1L]]]; b <- w[[revs[2L]]]\n  po <- mean(a == b)\n  cats <- union(unique(a), unique(b))\n  pe <- sum(vapply(cats, function(k) mean(a == k) * mean(b == k), numeric(1)))\n  if (pe >= 1) 1.0 else (po - pe) / (1 - pe)\n}\n\nevidence_map <- function(chart) {\n  setDT(chart)\n  # Study count by design x outcome_domain -- the scoping deliverable (NOT a pooled effect).\n  dcast(unique(chart[, .(record_id, design, outcome_domain)]),\n        design ~ outcome_domain, value.var = \"record_id\",\n        fun.aggregate = function(x) uniqueN(x), fill = 0L)\n}\n\ncredibility_flags <- function(chart) {\n  setDT(chart); n <- nrow(chart)\n  # Surface design-credibility prevalence the map should disclose -- without grading any single study.\n  list(\n    n_included                = n,\n    pct_prevalent_user        = round(100 * mean(chart$design == \"prevalent_user\"), 1),\n    pct_acnu_with_washout     = round(100 * mean(chart$design == \"active_comparator_new_user\" &\n                                                 !is.na(chart$washout_days) & chart$washout_days >= 365), 1),\n    pct_ma_only_excluded      = round(100 * mean(chart$ma_only_excluded), 1),\n    pct_immortal_time_risk    = round(100 * mean(chart$time_zero %chin% c(\"diagnosis_date\", \"procedure_date\")), 1),\n    pct_exposure_not_reported = round(100 * mean(chart$exposure_def == \"not_reported\"), 1)\n  )\n}",
        "description": "PRISMA-ScR screening + charting pipeline in R (data.table). This is the review's own data-management workflow, not an\nestimator. Required inputs (exported from the reference manager / databases), mirroring the Python version:\n  records : one row per retrieved citation -> record_id, title, abstract, database, doi (chr|NA), search_date (Date)\n  screen  : dual-screening decisions -> record_id, reviewer, stage in {'title_abstract','full_text'},\n            decision in {'include','exclude'}, exclude_reason (chr|NA)\n  chart   : piloted extraction form for INCLUDED studies (one row per study) -> record_id, data_source\n            in {'claims','ehr','registry','linked'}, ma_only_excluded (logical), exposure_def\n            in {'fill_ndc_dayssupply','order_admin','linked_fill','not_reported'}, washout_days (int|NA),\n            design in {'active_comparator_new_user','new_user','prevalent_user','rct','other'},\n            time_zero in {'first_fill','diagnosis_date','procedure_date','other'}, outcome_domain (chr)\nProduces PRISMA-ScR flow counts, title/abstract Cohen's kappa, the evidence map (design x outcome), and the\ndesign-credibility prevalence the map should disclose. No effect is pooled and no study is risk-of-bias graded.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "tricco-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---- PRISMA-ScR flow counts ---- */\n/* A record is screened-in at a stage only if NO reviewer excluded it (conservative dual-screen rule). */\nproc sql;\n  /* Identified and duplicates: dedup on DOI only; DOI-less rows are never duplicates. */\n  create table _flow as\n  select count(*) as identified,\n         calculated identified\n           - ( (select count(distinct doi) from work.records where doi ne '')\n               + (select count(*)          from work.records where doi  = '') ) as duplicates_removed\n  from work.records;\n\n  /* Per-record decision at each stage: included only if min(decision)='include' across reviewers. */\n  create table _ta as\n  select record_id, max(decision='exclude') as any_excl\n  from work.screen where stage='title_abstract' group by record_id;\n  create table _ft as\n  select record_id, max(decision='exclude') as any_excl\n  from work.screen where stage='full_text'      group by record_id;\n\n  create table prisma_scr_flow as\n  select f.identified, f.duplicates_removed,\n         (select count(*) from _ta)                  as screened_title_abstract,\n         (select sum(any_excl) from _ta)             as excluded_title_abstract,\n         (select count(*) from _ft)                  as assessed_full_text,\n         (select sum(any_excl) from _ft)             as excluded_full_text,\n         (select sum(any_excl=0) from _ft)           as included\n  from _flow f;\nquit;\n\n/* ---- Evidence map: study count by design x outcome_domain (the scoping deliverable, NOT a pooled effect) ---- */\nproc freq data=work.chart noprint;\n  tables design*outcome_domain / out=evidence_map(drop=percent);\nrun;\n\n/* ---- Design-credibility prevalence the map should disclose (without grading any single study) ---- */\nproc sql;\n  create table credibility_flags as\n  select count(*) as n_included,\n         round(100*mean(design='prevalent_user'),0.1)                              as pct_prevalent_user,\n         round(100*mean(design='active_comparator_new_user'\n                        and washout_days ne . and washout_days>=365),0.1)          as pct_acnu_with_washout,\n         round(100*mean(ma_only_excluded=1),0.1)                                   as pct_ma_only_excluded,\n         round(100*mean(time_zero in ('diagnosis_date','procedure_date')),0.1)     as pct_immortal_time_risk,\n         round(100*mean(exposure_def='not_reported'),0.1)                          as pct_exposure_not_reported\n  from work.chart;\nquit;",
        "description": "PRISMA-ScR screening + charting pipeline in SAS (PROC SQL / data step). This is the review's own data-management\nworkflow, not an estimator. Required input datasets (post data-management), mirroring the other languages:\n  work.records : one row per citation -> record_id, title, abstract, database, doi (char, '' if none), search_date\n  work.screen  : dual-screening -> record_id, reviewer, stage ('title_abstract'/'full_text'),\n                 decision ('include'/'exclude'), exclude_reason\n  work.chart   : INCLUDED studies (one row each) -> record_id, data_source ('claims'/'ehr'/'registry'/'linked'),\n                 ma_only_excluded (0/1), exposure_def ('fill_ndc_dayssupply'/'order_admin'/'linked_fill'/'not_reported'),\n                 washout_days (num, . if NR), design ('active_comparator_new_user'/'new_user'/'prevalent_user'/'rct'/'other'),\n                 time_zero ('first_fill'/'diagnosis_date'/'procedure_date'/'other'), outcome_domain\nProduces PRISMA-ScR flow counts, the evidence map (PROC FREQ design x outcome), and design-credibility prevalence.\nNo effect is pooled and no study is risk-of-bias graded.",
        "dependencies": [],
        "source_citations": [
          "tricco-2018"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  ID[Identification: records from MEDLINE + Embase + Cochrane + grey literature<br/>n = identified] --> DEDUP[Remove duplicates<br/>n = duplicates removed]\n  DEDUP --> TA[Title/abstract screening<br/>dual independent reviewers]\n  TA -->|excluded| TAX[Excluded at title/abstract<br/>n with reasons]\n  TA --> FT[Full-text assessment for eligibility<br/>against pre-specified criteria]\n  FT -->|excluded| FTX[Excluded at full text<br/>n with reasons logged]\n  FT --> INC[Included studies<br/>charted with piloted form]\n  INC --> MAP[Collate + summarize -> evidence map<br/>counts by population x design x outcome; gaps]",
        "caption": "PRISMA-ScR flow for a scoping review — identification through duplicate removal, dual title/abstract screening, full-text eligibility assessment, and inclusion, culminating in an evidence map (not a pooled estimate). Excluded counts and reasons are logged at each stage for reproducibility.",
        "alt_text": "Flowchart of the PRISMA-ScR process from record identification, deduplication, dual title/abstract screening, full-text eligibility, inclusion, to the final evidence map.",
        "source_type": "illustrative",
        "source_citations": [
          "tricco-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{What does the decision-maker need?} -->|A map of what evidence exists and where the gaps are| SCOPE[Scoping review<br/>broad question, no risk-of-bias appraisal, no pooled effect]\n  Q -->|A yes/no answer on effectiveness, safety magnitude, or accuracy| NARROW{Is a precise PICO formulable now?}\n  NARROW -->|Yes, and time is adequate| SR[Systematic review<br/>formal appraisal + usually meta-analysis -> GRADE]\n  NARROW -->|Yes, but a fast answer is needed| RR[Rapid review<br/>focused question, deliberate methodological shortcuts]\n  NARROW -->|No, the field is too broad or emerging| SCOPE\n  SCOPE --> GAP[Output: evidence map + research gaps -> informs a future systematic review or RWE study]\n  SR --> DEC[Output: graded effect estimate -> guideline / HTA / label decision]",
        "caption": "Decision logic for scoping vs systematic vs rapid review (after Munn et al. 2018). A scoping review is correct when the deliverable is a map of the field; it is the wrong instrument when a graded effect estimate is required.",
        "alt_text": "Decision diagram routing a review question to a scoping review, systematic review, or rapid review based on whether a map or a graded effect answer is needed and whether a precise PICO is formulable.",
        "source_type": "illustrative",
        "source_citations": [
          "munn-2018"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "systematic-review",
        "notes": "A scoping review maps the breadth of evidence and omits risk-of-bias appraisal and pooling; a systematic review answers a focused PICO question with formal appraisal and usually a meta-analysis. Choose by whether the deliverable is a map or a graded effect."
      },
      {
        "relation_type": "see_also",
        "target_slug": "qualitative-synthesis",
        "notes": "Scoping reviews frequently include qualitative and mixed-methods studies; qualitative synthesis methods may be used to summarize their content within the map."
      },
      {
        "relation_type": "used_with",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS structures the inclusion criteria and charting form of a scoping review of RWE studies (population, intervention/exposure, comparator, outcomes, timing, setting/data source)."
      },
      {
        "relation_type": "produces",
        "target_slug": "target-trial-emulation",
        "notes": "A methods-oriented scoping review of how a drug class has been studied in claims commonly identifies the gap that a subsequent target-trial-emulation study is designed to fill."
      },
      {
        "relation_type": "see_also",
        "target_slug": "meta-analysis-obs",
        "notes": "A scoping review deliberately does not pool effects; when the field it maps is mature and homogeneous enough, it can justify a follow-on systematic review with observational meta-analysis."
      }
    ],
    "aliases": [
      "scoping review",
      "scoping study",
      "evidence mapping review",
      "scoping literature review",
      "evidence and gap map"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "sdoh-social-determinants-of-health",
    "name": "Social Determinants of Health (SDoH) in RWE",
    "short_definition": "The non-medical, contextual and individual social conditions (economic stability, education, healthcare access, neighborhood environment, social context) operationalized in real-world studies most often as area-level deprivation indices linked to patient geography, or as individual-level screening/Z-code data, and used as a confounder, mediator, effect modifier, or exposure of interest.",
    "long_description": "In real-world evidence, \"SDoH\" is not one variable but a family of measurement choices, and the methodological work is\nalmost entirely in *operationalization*. The dominant pattern in US claims and EHR is **area-level linkage**: geocode the\npatient's residential address (or, in claims, the member ZIP) to a small geographic unit (census block group or tract),\nthen attach a published composite — the **Area Deprivation Index (ADI)**, the CDC/ATSDR **Social Vulnerability Index (SVI)**,\nthe Robert Graham Center **Social Deprivation Index (SDI)**, or a study-built **American Community Survey (ACS)** score — and\ncollapse it to a rank (national percentile, state decile, tertile). The less common but richer pattern is **individual-level\ncapture**: standardized screening (PRAPARE, the CMS Accountable Health Communities HRSN tool) or **ICD-10 Z55–Z65** social\nZ-codes. Whether SDoH belongs in the model *at all* depends on its causal role, which is the single most consequential and\nmost often-botched decision.\n\n**Core conceptual distinction.** SDoH can occupy three mutually exclusive causal positions for a given exposure–outcome\ncontrast, and the correct handling is opposite across them. (1) **Confounder** — neighborhood deprivation drives both who\ninitiates a therapy (access, formulary, cost-sharing) and the outcome (e.g., cardiovascular events): then SDoH belongs in\nthe propensity score or outcome model, and omitting it leaves residual confounding that age/sex/race cannot absorb. (2)\n**Mediator** — the exposure operates *through* a social condition (a copay-assistance program that works by relieving\nfinancial strain; a care-access intervention that changes where a patient lives or gets care): conditioning on the mediator\nblocks the very effect you are trying to estimate (over-adjustment) and can open a collider path. (3) **Effect modifier** —\nthe treatment effect itself differs by deprivation stratum (a digital-adherence tool that helps only patients with stable\nhousing): here SDoH belongs in an interaction term or a stratified/subgroup estimand, not merely as an additive adjustment.\nThe estimand must name which role SDoH plays *before* analysis; \"we adjusted for ADI\" is uninterpretable until the DAG says\nwhether ADI is a backdoor confounder or a pathway variable.\n\nEqually fundamental is the **ecological vs individual** distinction. An area-level index is a property of a *place*, not a\nperson; assigning a tract-level ADI to an individual is a deliberate proxy that imports **ecological-fallacy** error —\nwithin-area heterogeneity is enormous, and the bias is non-differential only if measurement error is unrelated to exposure,\nwhich is not guaranteed. The area measure is a *contextual* construct (the effect of living in a deprived neighborhood),\nnot a substitute for an individual's income or food security.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **Area-level index (ADI/SVI/SDI) vs adjusting on demographics only (age/sex/race):** area indices capture access,\n  economic, and environmental drivers of disparities that demographics miss, improving confounding control and enabling\n  equity stratification — and they are available at population scale from ZIP/address alone. Cost: substantial measurement\n  error (a person is not their tract), ecological fallacy, and the temptation to \"control away\" disparities that are the\n  object of study. **Prefer** area indices when SDoH is a confounder and individual data are unavailable, which is the\n  usual claims situation.\n- **Area-level vs individual-level (PRAPARE / Z-codes):** individual screening measures the actual social need with far\n  less ecological error and supports needs-based targeting. Cost: capture is sparse and **differential** — Z55–Z65 coding\n  is driven by which systems screen and bill, so \"no Z-code\" overwhelmingly means \"not screened,\" not \"no need,\" and the\n  completeness correlates with the very deprivation being measured. **Prefer** individual-level when reliably captured\n  (integrated delivery systems with screening mandates); otherwise area-level is the more honest default.\n- **ADI vs SVI vs SDI:** ADI (Singh construction, 17 ACS variables, distributed via the Neighborhood Atlas as national\n  percentiles and state deciles) is the de facto RWE standard but is sensitive to housing-cost variables that skew rankings\n  in high-cost metros; SVI (4 themes, designed for disaster preparedness) emphasizes vulnerability and minority/language;\n  SDI (7 ACS variables) was built explicitly for healthcare utilization research. They are correlated but not\n  interchangeable — pre-specify one and report sensitivity to an alternative index.\n\n**When to use** (decision rules). Use SDoH operationalization when (a) deprivation/access plausibly confounds the exposure–outcome relation\nand is not captured by clinical covariates; (b) the analysis is explicitly about disparities or equity (FDA Diversity Action\nPlans, HTA equity-weighting); (c) prior evidence shows the treatment effect is modified by social context; or (d) SDoH is\nthe exposure of interest (impact of neighborhood deprivation on adherence, persistence, or access). Geocode to the finest\nreliable unit (census block group > tract > ZIP), use the **index-date** address (not the most-recent), and rank within the\nappropriate reference (national vs state) for the question.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When SDoH is a mediator of the exposure effect.** Adjusting for neighborhood deprivation when evaluating a\n  copay-assistance or care-navigation intervention removes part of the causal effect (over-adjustment) and can induce\n  collider bias if the mediator shares a common cause with the outcome. Diagnose with the DAG, not reflex.\n- **When the goal is to \"explain away\" a disparity.** Adjusting an exposure–disparity contrast for SDoH can make a real,\n  actionable inequity disappear into a coefficient — statistically tidy, ethically and scientifically wrong if SDoH is on\n  the causal pathway from a structural exposure to the outcome.\n- **Z-code or screening completeness is differential.** Treating absent Z55–Z65 as \"no social need\" misclassifies the\n  majority of patients and biases toward the screened (often sicker, more engaged) population; an unadjusted individual-SDoH\n  analysis here is worse than an honest area-level proxy.\n- **Fine geography on small cells.** Block-group linkage on rare outcomes risks re-identification and unstable index values;\n  suppress or coarsen per the data-use agreement.\n\n**Data-source operational depth** (by source).\n- **Claims (FFS, MA, commercial):** the only SDoH signal is geographic — member ZIP (often ZIP5, sometimes ZIP9) linked to\n  an index. Failure modes: ZIP5 is a *postal* unit that straddles multiple census tracts, so a ZIP5→tract assignment must\n  use a population-weighted crosswalk (e.g., HUD USPS ZIP–tract) or be flagged ambiguous; PO-box and \"unique\" (single\n  large-recipient) ZIPs have no meaningful residential geography and must be dropped; addresses are captured at enrollment\n  and **go stale** — Medicare Advantage and commercial files often carry an enrollment-era address that no longer reflects\n  residence during follow-up, so use the **index-date** address and treat mid-follow-up moves as a sensitivity analysis.\n  MA-only person-time additionally lacks complete FFS claims, compounding any utilization-based SDoH proxy. No individual\n  social need is observable in claims without supplemental linkage.\n- **EHR:** can carry both a geocodable address (richer than claims) and individual screening / Z-codes. The trap is\n  **differential capture** — Z55–Z65 and PRAPARE fields appear only when a site screens and documents, so completeness is\n  a function of the health system, not the patient, and visit-driven EHR means patients who leave the system are\n  differentially missing. Treat absent SDoH fields as missing-not-at-random; do not impute \"0 = no need.\"\n- **Registry:** may collect structured social variables (insurance, education) more completely than claims but rarely at\n  fine geography; link to claims/ACS to add a contextual index and to a mortality source for complete follow-up.\n- **Linked claims–EHR–census:** the ideal substrate — individual screening + geocoded contextual index + complete\n  enrollment — but linkage selects the linkable subset (often more urban, more insured), and address/geocode quality must\n  be reconciled before assignment. Report the linked subset's representativeness.\n\n**Worked claims example.** Question: does residing in a high-deprivation neighborhood predict 12-month non-persistence to a\nnewly initiated chronic therapy, among adults with continuous enrollment? (1) Cohort: first qualifying fill = index date;\nrequire continuous medical+pharmacy enrollment for the baseline lookback so covariates are observable. (2) Geographic\nlinkage: take the member residential ZIP **as of the index date** (not the latest on file). If ZIP9 is present, map ZIP9 →\ncensus tract directly; if only ZIP5, apply a population-weighted ZIP5→tract crosswalk and, when one tract holds the clear\nmajority of the ZIP's population, assign it, otherwise flag the member as geographically ambiguous and exclude from the\nprimary analysis (retain for a sensitivity check). Drop PO-box and non-residential ZIPs. (3) Index value: look up the\ntract's **ADI national percentile** (1–100) from the Neighborhood Atlas, then cut into tertiles (or use the published\ndecile). (4) Outcome: non-persistence = a gap > 60 days with no fill (last `days_supply` end + grace), measured over the\n12-month follow-up using `fill_date` and `days_supply`. (5) Model: because deprivation here is the **exposure**, do not\nadjust for downstream mediators (e.g., out-of-pocket cost on the causal path); adjust only for true confounders measured at\nbaseline (age, sex, plan type, comorbidity, calendar year). (6) Report the gradient across tertiles with a sensitivity\nanalysis substituting SVI for ADI and re-running with the most-recent (vs index-date) address to bound address-staleness\nbias.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "sdoh",
      "social-determinants",
      "area-deprivation-index",
      "adi",
      "svi",
      "health-equity",
      "disparities",
      "confounder",
      "effect-modification",
      "covariate"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "descriptive_epidemiology"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "census"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1056/NEJMp1802313",
        "url": "https://doi.org/10.1056/NEJMp1802313",
        "citation_text": "Kind AJH, Buckingham WR. Making neighborhood-disadvantage metrics accessible — the Neighborhood Atlas. New England Journal of Medicine. 2018;378(26):2456-2458.",
        "year": 2018,
        "authors_short": "Kind & Buckingham",
        "notes": "The standard RWE reference for area-level deprivation; introduces the Neighborhood Atlas that distributes the Area Deprivation Index as national percentiles and state deciles for census-block-group linkage to patient geography."
      },
      {
        "role": "introduce",
        "doi": "10.2105/ajph.93.7.1137",
        "url": "https://doi.org/10.2105/ajph.93.7.1137",
        "citation_text": "Singh GK. Area deprivation and widening inequalities in US mortality, 1969-1998. American Journal of Public Health. 2003;93(7):1137-1143.",
        "year": 2003,
        "authors_short": "Singh",
        "notes": "Original construction of the Area Deprivation Index from US Census/ACS factor analysis and its validated association with mortality; the methodological basis for the ADI used throughout RWE."
      },
      {
        "role": "explain",
        "doi": "10.5888/pcd13.160221",
        "url": "https://doi.org/10.5888/pcd13.160221",
        "citation_text": "Maroko AR, Doan TM, Arno PS, Hubel M, Yi S, Viola D. Integrating social determinants of health with treatment and prevention: a new tool to assess local area deprivation. Preventing Chronic Disease. 2016;13:E128.",
        "year": 2016,
        "authors_short": "Maroko et al.",
        "notes": "Practical treatment of constructing and applying small-area deprivation measures from ACS data, including geographic scale choices (tract vs ZIP) relevant to linkage with claims/EHR."
      },
      {
        "role": "demonstrate",
        "doi": "10.2202/1547-7355.1792",
        "url": "https://doi.org/10.2202/1547-7355.1792",
        "citation_text": "Flanagan BE, Gregory EW, Hallisey EJ, Heitgerd JL, Lewis B. A social vulnerability index for disaster management. Journal of Homeland Security and Emergency Management. 2011;8(1):0000102202154773551792.",
        "year": 2011,
        "authors_short": "Flanagan et al.",
        "notes": "Defines the CDC/ATSDR Social Vulnerability Index and its four thematic domains; the canonical reference when SVI is used as the area-level SDoH measure instead of ADI."
      },
      {
        "role": "use",
        "doi": "10.1017/cts.2024.571",
        "url": "https://doi.org/10.1017/cts.2024.571",
        "citation_text": "Li C, Mowery DL, Ma X, et al. Realizing the potential of social determinants data in EHR systems: a scoping review of approaches for screening, linkage, extraction, analysis, and interventions. Journal of Clinical and Translational Science. 2024;8(1):e147.",
        "year": 2024,
        "authors_short": "Li et al.",
        "notes": "Scoping review of how individual-level SDoH (screening tools, Z-codes) are screened, extracted, and analyzed in EHR RWD, with explicit treatment of differential capture and missingness."
      }
    ],
    "plain_language_summary": "Social determinants of health (SDoH) are the non-medical conditions in which people are born, grow, and live — things like income, education, and neighborhood quality — that shape whether someone gets sick or stays well. In real-world studies these factors usually cannot be measured directly for each person, so researchers attach a neighborhood-level score (called an area-level deprivation index) to a patient's ZIP code and use that score as a stand-in. Because one neighborhood score is assigned to every patient who lives there, you always inherit some error: a high-deprivation ZIP contains both very poor and working-class households, and you can never know from the score alone which type of person you are actually studying. The most important judgment call is deciding what role SDoH plays in your analysis — if it causes both who gets treated and the outcome, you must account for it; if it is part of how the treatment works, accounting for it would actually hide the effect you are trying to measure.",
    "key_terms": [
      {
        "term": "area-level measurement",
        "definition": "A value assigned to a geographic unit such as a census tract or ZIP code (for example, median income or an index score) rather than measured directly on an individual person."
      },
      {
        "term": "individual-level measurement",
        "definition": "A value recorded for a specific person — for example, a patient's own screening answer about food insecurity — as opposed to a neighborhood average."
      },
      {
        "term": "ecological fallacy",
        "definition": "The error that occurs when you assume every person in a group (such as a census tract) shares the average characteristic of that group, when in reality individuals within the same neighborhood can differ widely."
      },
      {
        "term": "Area Deprivation Index (ADI)",
        "definition": "A published composite score built from 17 census variables (income, housing, employment, education) that ranks every U.S. census block group from 1 (least deprived) to 100 (most deprived) on a national scale."
      },
      {
        "term": "confounder",
        "definition": "A factor that influences both who receives a treatment and what outcome they experience, making it look like the treatment caused the outcome when it may not have — a confounder must be accounted for in the analysis."
      },
      {
        "term": "effect modifier",
        "definition": "A factor that changes the size or direction of a treatment effect in different subgroups — for example, if a digital medication reminder only helps patients who have stable housing, then housing stability is an effect modifier."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether patients newly started on a blood-pressure medication who live in high-deprivation neighborhoods are less likely to still be taking it 12 months later. Because the claims database only has each patient's ZIP code — not their actual income or education — the researcher uses ZIP-code-level SDoH proxies as area-level stand-ins. The table below shows five patients at treatment start (their index date), the neighborhood deprivation score attached to their ZIP, and whether they were still on medication at 12 months.",
      "dataset": {
        "caption": "Five patients at treatment start, with area-level SDoH proxies linked by ZIP code. ADI national percentile runs 1-100; higher = more deprived. Neighborhood deprivation index, median household income, and percent with a college degree all come from census data attached to the patient's ZIP — not from asking the patient directly.",
        "columns": [
          "person_id",
          "index_date",
          "zip_code",
          "adi_national_pct",
          "neighborhood_median_income_usd",
          "pct_college_degree",
          "still_on_med_12mo"
        ],
        "rows": [
          [
            1001,
            "2023-02-01",
            "02492",
            22,
            98000,
            64,
            "Yes"
          ],
          [
            1002,
            "2023-02-15",
            "02136",
            68,
            52000,
            31,
            "No"
          ],
          [
            1003,
            "2023-03-01",
            "02136",
            68,
            52000,
            31,
            "Yes"
          ],
          [
            1004,
            "2023-03-10",
            "02301",
            81,
            41000,
            19,
            "No"
          ],
          [
            1005,
            "2023-04-01",
            "02301",
            81,
            41000,
            19,
            "No"
          ]
        ]
      },
      "steps": [
        "The researcher pulls the residential ZIP code recorded for each patient at their index date (treatment start) — not their most-recent address on file, which may have changed.",
        "Each ZIP is linked to an area-level deprivation score: here the ADI national percentile, plus two underlying census variables (median income and college degree rate) that help show what the ADI is capturing.",
        "Patients 1002 and 1003 live in the same ZIP (02136) and therefore receive the exact same three area scores — but patient 1003 stayed on medication while patient 1002 did not, showing that the shared area score cannot distinguish individual circumstances.",
        "This is the ecological fallacy in action: within a single ZIP code, one patient persisted and one did not, yet both carry identical area-level SDoH values.",
        "Patients 1004 and 1005 live in the highest-deprivation ZIP (ADI 81) and both stopped the medication — suggesting that high neighborhood deprivation may be associated with lower persistence, but the sample is tiny and many other factors could explain this.",
        "Because neighborhood deprivation likely affects both who fills a new prescription (access, cost) and whether they keep filling it (adherence, persistence), ADI acts as a confounder in this study and should be included as a covariate in the analysis model.",
        "If instead the study were evaluating a copay-assistance program that works specifically by reducing the financial strain caused by living in a deprived area, then neighborhood deprivation would sit on the pathway from the program to its effect — adjusting for ADI in that case would block the very mechanism you are trying to measure (over-adjustment)."
      ],
      "result": "In this five-patient illustration, 2 of 2 patients (100%) in the highest-deprivation ZIP stopped medication versus 1 of 2 (50%) in the mid-deprivation ZIP and 0 of 1 (0%) in the low-deprivation ZIP — a gradient consistent with deprivation as a confounder of medication persistence. The key limitation is that all three patients in ZIPs 02136 and 02301 share the same area score regardless of their own income or education (ecological fallacy), so the ADI captures neighborhood context, not individual need. Any real analysis would require hundreds of patients and formal adjustment in a regression or propensity model."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "descriptive-epidemiology-rwe",
      "cohort-retrospective"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Area Deprivation Index (ADI) — national percentile vs state decile",
        "description": "Singh-construction composite of 17 ACS socioeconomic variables, distributed via the Neighborhood Atlas at the census block group. National percentile (1-100) ranks against the whole US; state decile (1-10) ranks within a state, which is preferable for within-state policy/HTA comparisons but not comparable across states.",
        "edge_cases": [
          "National percentile is skewed by housing-cost variables in high-cost metros, where affluent areas can rank as \"deprived.\"",
          "Block-group values are suppressed for very small or institutional populations; collapse to tract or flag missing."
        ],
        "data_source_notes": "claims/EHR: geocode index-date address (or ZIP9) to block group, join to the Neighborhood Atlas file; pre-specify national vs state ranking and the cut (tertile/quintile/decile) before analysis.",
        "citations": [
          "kind-2018",
          "singh-2003"
        ]
      },
      {
        "name": "Social Vulnerability Index (SVI) — CDC/ATSDR, four themes",
        "description": "Census-tract index built from 16 ACS variables grouped into four themes (socioeconomic status, household composition/disability, minority status/language, housing type/transportation); reported as overall and per-theme percentile rankings.",
        "edge_cases": [
          "Designed for disaster preparedness, not pharmacoepidemiology; theme weighting may not align with the health construct of interest.",
          "Tract-level only (coarser than ADI block group), increasing ecological error in heterogeneous tracts."
        ],
        "data_source_notes": "claims/EHR: join geocoded tract to the biennial CDC SVI file; theme-level scores allow targeting a specific vulnerability mechanism rather than a single composite.",
        "citations": [
          "flanagan-2011"
        ]
      },
      {
        "name": "Individual-level SDoH (Z-codes / PRAPARE / AHC-HRSN)",
        "description": "Person-level social need captured via ICD-10 Z55-Z65 codes, the PRAPARE protocol, or the CMS Accountable Health Communities Health-Related Social Needs screening tool, recorded in EHR or (rarely) claims.",
        "edge_cases": [
          "Capture is differential and sparse — absence of a Z-code means \"not screened,\" not \"no need\"; treat as missing-not-at-random.",
          "Z-code billing incentives changed over time, creating spurious secular trends in apparent prevalence."
        ],
        "data_source_notes": "EHR: extract structured Z-codes and screening fields; do not impute absence as zero. claims: Z-codes appear inconsistently and should be validated against EHR where linked.",
        "citations": [
          "li-2024"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Adjusting on demographics only (age, sex, race)",
        "pros_of_this": "Captures access, economic, and environmental drivers of disparities and residual confounding that demographics cannot absorb; available at scale from ZIP/address alone; enables equity stratification for FDA Diversity Action Plans and HTA equity weighting.",
        "cons_of_this": "Area indices carry substantial ecological measurement error (a person is not their tract); risk of over-adjustment or collider bias when SDoH lies on the causal pathway from exposure to outcome.",
        "when_to_prefer": "When deprivation/access plausibly confounds the exposure-outcome relation and individual social data are unavailable (the usual claims setting), or when the analysis is explicitly about equity."
      },
      {
        "compared_to": "Individual-level SDoH capture (PRAPARE / Z-codes)",
        "pros_of_this": "Area linkage is available for essentially all geocodable members at population scale, with no dependence on whether a site screens; captures the contextual neighborhood effect.",
        "cons_of_this": "Measures a place, not a person — ecological fallacy and within-area heterogeneity; cannot identify an individual's actual social need.",
        "when_to_prefer": "When individual screening is sparse or differentially captured; use individual-level only when screening is systematic (e.g., integrated systems with screening mandates)."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Only geographic SDoH is observable (member ZIP, often ZIP5). Use the index-date address, apply a population-weighted ZIP5-to-tract crosswalk when ZIP9 is absent, drop PO-box/non-residential ZIPs, and link to ADI/SVI/SDI. Addresses go stale (especially MA/commercial enrollment-era addresses); MA-only person-time also lacks complete FFS claims.",
      "ehr": "Carries geocodable addresses plus individual Z55-Z65 codes and PRAPARE/HRSN screening, but capture is differential and visit-driven; treat absent SDoH fields as missing-not-at-random and never impute absence as zero need.",
      "registry": "May capture structured social variables (insurance, education) but rarely fine geography; link to ACS for a contextual index and to a mortality source for complete follow-up.",
      "linked": "Linked claims-EHR-census is the ideal substrate (individual screening + geocoded index + complete enrollment) but linkage selects the linkable subset and address/geocode quality must be reconciled before index assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\ndef assign_area_sdoh(members: pd.DataFrame,\n                     zip_xwalk: pd.DataFrame,\n                     adi: pd.DataFrame,\n                     majority_threshold: float = 0.50) -> pd.DataFrame:\n    m = members.copy()\n\n    # Drop non-residential ZIPs (PO boxes, unique/large-recipient ZIPs have no residential geography).\n    m = m[m[\"zip_type\"] == \"standard\"].copy()\n    m[\"zip5\"] = m[\"resid_zip\"].str[:5]\n\n    # Population-weighted ZIP5 -> tract: keep the tract holding the largest population share of the ZIP.\n    xw = zip_xwalk.sort_values([\"zip5\", \"pop_weight\"], ascending=[True, False])\n    top = xw.groupby(\"zip5\", as_index=False).first()  # majority tract + its weight\n    top = top.rename(columns={\"census_tract\": \"census_tract\", \"pop_weight\": \"majority_weight\"})\n\n    m = m.merge(top[[\"zip5\", \"census_tract\", \"majority_weight\"]], on=\"zip5\", how=\"left\")\n    # Flag ZIP5s where no tract holds a clear majority of the population (geographically ambiguous).\n    m[\"ambiguous\"] = m[\"majority_weight\"].fillna(0) < majority_threshold\n\n    # Attach ADI national percentile and cut into deprivation tertiles (1=least, 3=most deprived).\n    m = m.merge(adi[[\"census_tract\", \"adi_natrank\"]], on=\"census_tract\", how=\"left\")\n    primary = m.loc[~m[\"ambiguous\"] & m[\"adi_natrank\"].notna()].copy()\n    primary[\"adi_tertile\"] = pd.qcut(primary[\"adi_natrank\"], q=3, labels=[1, 2, 3]).astype(\"Int64\")\n\n    out = m.merge(primary[[\"person_id\", \"adi_tertile\"]], on=\"person_id\", how=\"left\")\n    return out[[\"person_id\", \"index_date\", \"census_tract\", \"adi_natrank\", \"adi_tertile\", \"ambiguous\"]]",
        "description": "Operationalize area-level SDoH (ADI national percentile -> tertile) from claims-style inputs. Required inputs\n(already cleaned):\n  members  : person_id, index_date (datetime), resid_zip (str, ZIP5 or ZIP9), zip_type in {'standard','pobox','unique'}\n  zip_xwalk: zip5, census_tract, pop_weight   # population-weighted ZIP5->tract crosswalk (e.g., HUD USPS)\n  adi      : census_tract, adi_natrank        # ADI national percentile 1-100 (Neighborhood Atlas)\nReturns one row per member with the assigned tract, ADI national percentile, deprivation tertile, and an\n'ambiguous' flag for ZIP5s with no clear majority tract. Use the index-date address only; non-residential ZIPs are\ndropped from the primary analysis.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "kind-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nassign_area_sdoh <- function(members, zip_xwalk, adi, majority_threshold = 0.50) {\n  setDT(members); setDT(zip_xwalk); setDT(adi)\n\n  # Keep residential ZIPs only; PO-box / unique ZIPs have no residential geography.\n  m <- members[zip_type == \"standard\"]\n  m[, zip5 := substr(resid_zip, 1L, 5L)]\n\n  # Population-weighted ZIP5 -> tract: majority tract per ZIP5.\n  setorder(zip_xwalk, zip5, -pop_weight)\n  top <- zip_xwalk[, .SD[1L], by = zip5][, .(zip5, census_tract, majority_weight = pop_weight)]\n\n  m <- merge(m, top, by = \"zip5\", all.x = TRUE)\n  m[, ambiguous := fifelse(is.na(majority_weight) | majority_weight < majority_threshold, TRUE, FALSE)]\n\n  # Attach ADI national percentile; tertile only on unambiguous, non-missing rows (1=least, 3=most deprived).\n  m <- merge(m, adi[, .(census_tract, adi_natrank)], by = \"census_tract\", all.x = TRUE)\n  ok <- !m$ambiguous & !is.na(m$adi_natrank)\n  m[ok, adi_tertile := cut(adi_natrank,\n                           breaks = quantile(adi_natrank, probs = seq(0, 1, 1/3), na.rm = TRUE),\n                           labels = c(1L, 2L, 3L), include.lowest = TRUE)]\n  m[, .(person_id, index_date, census_tract, adi_natrank, adi_tertile, ambiguous)]\n}",
        "description": "Area-level SDoH assignment (ADI national percentile -> tertile) with data.table. Inputs mirror the Python version:\n  members  : person_id, index_date (Date), resid_zip (char), zip_type in {'standard','pobox','unique'}\n  zip_xwalk: zip5, census_tract, pop_weight   # population-weighted ZIP5->tract crosswalk\n  adi      : census_tract, adi_natrank        # ADI national percentile 1-100\nReturns one row per member with tract, ADI percentile, deprivation tertile, and an ambiguity flag.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "kind-2018"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Keep residential ZIPs and derive ZIP5 from the index-date address. */\ndata members5;\n  set work.members;\n  where zip_type = 'standard';\n  zip5 = substr(resid_zip, 1, 5);\nrun;\n\n/* Population-weighted ZIP5 -> tract: pick the tract with the largest population share per ZIP5. */\nproc sort data=work.zip_xwalk out=xw; by zip5 descending pop_weight; run;\ndata top;\n  set xw; by zip5;\n  if first.zip5;                      /* majority tract + its weight */\n  majority_weight = pop_weight;\n  keep zip5 census_tract majority_weight;\nrun;\n\nproc sql;\n  create table members_tract as\n  select m.person_id, m.index_date, t.census_tract, t.majority_weight,\n         (t.census_tract is null or t.majority_weight < 0.50) as ambiguous\n  from members5 m\n  left join top t\n    on m.zip5 = t.zip5;\nquit;\n\n/* Attach ADI national percentile. */\nproc sql;\n  create table members_adi as\n  select mt.*, a.adi_natrank\n  from members_tract mt\n  left join work.adi a\n    on mt.census_tract = a.census_tract;\nquit;\n\n/* Deprivation tertile only on unambiguous, non-missing rows (GROUPS=3 -> 0=least, 2=most deprived). */\ndata primary;\n  set members_adi;\n  where ambiguous = 0 and adi_natrank ne .;\nrun;\nproc rank data=primary out=primary_tert groups=3;\n  var adi_natrank;\n  ranks adi_tertile;\nrun;\n\nproc sql;\n  create table sdoh_assigned as\n  select ma.person_id, ma.index_date, ma.census_tract, ma.adi_natrank,\n         pt.adi_tertile, ma.ambiguous\n  from members_adi ma\n  left join primary_tert pt\n    on ma.person_id = pt.person_id;\nquit;",
        "description": "Area-level SDoH assignment in SAS: index-date address -> population-weighted ZIP5->tract -> ADI national percentile ->\ndeprivation tertile. Required input datasets (post data-management):\n  work.members   : person_id, index_date, resid_zip, zip_type ('standard'/'pobox'/'unique')\n  work.zip_xwalk : zip5, census_tract, pop_weight   (population-weighted ZIP5->tract crosswalk)\n  work.adi       : census_tract, adi_natrank        (ADI national percentile 1-100)\nPROC RANK with GROUPS=3 produces the deprivation tertile (0,1,2 -> least..most deprived). Ambiguous ZIP5s (no majority\ntract) are flagged and excluded from the primary tertile assignment.",
        "dependencies": [],
        "source_citations": [
          "kind-2018"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Addr[Index-date residential address / ZIP] --> Geo[Geocoder]\n  Geo --> Tract[Census block group / tract]\n  Tract --> Lookup[ADI / SVI / SDI lookup<br/>by geographic unit]\n  Lookup --> Rank[National percentile<br/>or state decile]\n  Rank --> Cut[Deprivation tertile / quintile<br/>analysis covariate]\n  Addr -. PO-box / unique ZIP .-> Drop[Drop: no residential geography]\n  Tract -. ZIP5 spans many tracts .-> Amb[Flag ambiguous:<br/>pop-weighted majority or exclude]",
        "caption": "Area-level SDoH data flow from patient geography to an analysis-ready deprivation covariate, with the two operational failure points (non-residential ZIPs and ZIP5-to-tract ambiguity) that must be handled before assignment.",
        "alt_text": "Flowchart from index-date address through geocoding, census tract, ADI/SVI/SDI lookup, percentile ranking, and tertile cut, with branches for dropping PO-box ZIPs and flagging ambiguous ZIP5-to-tract mappings.",
        "source_type": "illustrative",
        "source_citations": [
          "kind-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  subgraph Confounder [SDoH as confounder: ADJUST]\n    S1[SDoH / deprivation] --> E1[Exposure / treatment]\n    S1 --> O1[Outcome]\n    E1 --> O1\n  end\n  subgraph Mediator [SDoH as mediator: do NOT adjust]\n    E2[Exposure / intervention] --> S2[SDoH / financial strain]\n    S2 --> O2[Outcome]\n    E2 --> O2\n  end\n  subgraph Modifier [SDoH as effect modifier: STRATIFY / interact]\n    E3[Exposure] --> O3[Outcome]\n    S3[SDoH stratum] -. modifies the exposure effect .-> O3\n  end",
        "caption": "The three causal roles SDoH can play for a given exposure-outcome contrast. Adjusting is correct only for the confounder role; adjusting on a mediator causes over-adjustment, and effect modification requires stratification or an interaction, not additive adjustment.",
        "alt_text": "Three small directed acyclic graphs showing SDoH as a confounder (arrows to both exposure and outcome, adjust), as a mediator (exposure to SDoH to outcome, do not adjust), and as an effect modifier (SDoH modifying the exposure-outcome arrow, stratify or interact).",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "When SDoH is a confounder, the area-level deprivation covariate enters the propensity score (matching, IPTW, or overlap weighting) alongside clinical covariates measured at baseline."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "Area-level SDoH supplements hdPS as a contextual proxy for unmeasured access/socioeconomic confounding not captured by claims-derived diagnosis/procedure/drug proxies."
      },
      {
        "relation_type": "requires",
        "target_slug": "dags-backdoor-criterion-drug-studies",
        "notes": "Whether SDoH should be adjusted, left alone (mediator), or stratified (effect modifier) is a DAG/backdoor-criterion decision that must precede analysis; \"we adjusted for ADI\" is uninterpretable without the causal role."
      },
      {
        "relation_type": "affects",
        "target_slug": "causal-mediation-effect-modification-rwe",
        "notes": "SDoH is a canonical mediator and effect modifier; mediation and interaction methods determine the correct estimand when SDoH lies on the pathway or modifies the treatment effect rather than confounding it."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "SDoH distribution differences between the study sample and target population drive transportability of effect estimates, especially when the treatment effect is modified by social context."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Address staleness and incomplete FFS person-time in Medicare Advantage directly degrade the geographic linkage that area-level SDoH depends on."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "SDoH (transportation, cost-sharing, health literacy) are key upstream drivers of the adherence/persistence gaps that measures like PDC quantify."
      }
    ],
    "aliases": [
      "SDoH",
      "SDOH",
      "social determinants of health",
      "area deprivation index",
      "ADI",
      "social vulnerability index",
      "SVI",
      "neighborhood deprivation",
      "health equity factors"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "selection-bias-sensitivity-analysis-rwe",
    "name": "Selection Bias Sensitivity Analysis",
    "short_definition": "A quantitative bias analysis that, for a study in which inclusion or retention depends on exposure and outcome (a collider), specifies the selection mechanism and computes how much the exposure-outcome estimate would change under plausible selection-bias parameters or under inverse-probability-of-selection weighting.",
    "long_description": "**Selection bias** in real-world evidence (RWE) is the bias that arises when being *in the analysis* — present in\nthe database, meeting the cohort definition, surviving the continuous-enrollment requirement, being captured by EHR\nvisits, getting linked, having complete covariates, or remaining under follow-up — is a common effect of exposure and\noutcome (or of their causes). In the structural language of Hernán, Hernández-Díaz, and Robins, conditioning on such a\nselection indicator S opens a non-causal path because S is a **collider** on the path exposure → S ← outcome (or\nexposure → S ← U → outcome). Restricting the analysis to S = 1 is exactly that conditioning, so the association you\nestimate is contaminated whether or not you ever fit a model term for S. A **selection-bias sensitivity analysis** does\nnot pretend the data fix this; it makes the selection mechanism explicit and answers a quantitative question: *given\nplausible assumptions about how selected and unselected patients differ, how far — and in which direction — could the\nreported effect move, and is there a setting that overturns the conclusion?*\n\nTwo families of methods do this work. The first is the **bias-parameter / tipping-point** family (Greenland 1996;\nLash, Fox, Fink 2014; Smith and VanderWeele 2019): posit selection-association parameters — e.g., how strongly exposure\npredicts selection and how strongly the outcome predicts selection among the selected-out — and either deterministically\nrecompute the corrected estimate over a grid (a tipping-point sweep) or draw the bias parameters from prior\ndistributions and propagate them by Monte Carlo (probabilistic bias analysis, PBA). Smith and VanderWeele give a closed-form\n**bounding factor** that yields the maximum selection bias on the risk-ratio scale for a given pair of selection\nassociations, the selection-bias analogue of the E-value. The second family is **reweighting**: when the variables that\ndrive selection are measured, fit a model for the probability of selection and analyze the selected sample with\n**inverse-probability-of-selection weights (IPSW)**, the cross-sectional analogue of inverse-probability-of-censoring\nweighting (IPCW) used for attrition. The bias-parameter family is the only option when the selection driver is\nunmeasured (the usual case for who-enters-the-database); IPSW is preferred when you can credibly model selection.\n\n**Core conceptual distinction**. Selection bias is structurally distinct from confounding, and the distinction governs\nthe fix. Confounding is a *common cause* of exposure and outcome and is in principle removable by adjustment, matching,\nor weighting on measured confounders. Selection bias is created by *conditioning on a common effect* (the collider S);\nadjusting for more baseline confounders does not remove it and can make it worse if the added covariate is itself a\ncollider descendant. The estimand must therefore be stated as a contrast in a *target population* (e.g., all initiators\nmeeting clinical eligibility), not in the *selected* sample — IPSW transports the selected-sample estimate back to the\ntarget, while a tipping-point analysis bounds the gap between the two. Note also which selection mechanism you are\ntreating: cross-sectional selection-into-sample (enrollment, linkage, complete-case) is handled by IPSW or a\nbias-parameter sweep, whereas selection *during* follow-up (differential attrition) is the time-to-event analogue and is\nhandled by IPCW with selection sensitivity layered on the censoring weights.\n\n**Pros, cons, and trade-offs**\n- **vs simply reporting the complete-case / selected-sample estimate:** the sensitivity analysis quantifies a bias the\n  primary analysis silently assumes away. Cost: it requires explicit, debatable bias parameters or a selection model; a\n  poorly justified prior can manufacture false reassurance or false alarm. **Prefer it** whenever selection plausibly\n  depends on both exposure and outcome.\n- **Bias-parameter / tipping-point + PBA vs IPSW:** the bias-parameter approach needs no measurement of the selection\n  driver and directly answers \"how strong would selection have to be to overturn this?\" — the right tool when entry into\n  the database is governed by unmeasured factors. Cost: the parameters are assumptions, not estimates, and the bounding\n  factor gives a worst case, not the actual bias. IPSW *estimates* the correction from data but is only valid if\n  selection is captured by measured covariates (a no-unmeasured-selection assumption) and the weight model is correct;\n  extreme weights inflate variance and can be unstable. **Prefer IPSW** when the selection variables are observed and\n  rich (e.g., linked-vs-unlinked predictors); **prefer the bias-parameter family** when they are not, and report both\n  when feasible.\n- **vs the E-value / unmeasured-confounding sensitivity analysis:** the E-value bounds *confounding*; the\n  Smith-VanderWeele selection bounding factor is its selection-bias counterpart and answers a different question.\n  Using an E-value to \"cover\" selection bias is a category error. **Prefer the selection bounding factor** when the\n  threat is collider/selection structure rather than an omitted common cause.\n- **vs IPCW alone for attrition:** IPCW reweights for informative censoring under a no-unmeasured-censoring assumption;\n  a selection sensitivity analysis stress-tests that assumption by varying outcome risk among the selected-out. **Use\n  them together** — IPCW as the primary handling, selection sensitivity as the assumption check.\n\n**When to use**. Use a selection-bias sensitivity analysis whenever (a) cohort entry or retention plausibly depends on\nboth exposure and outcome or their common causes — continuous-enrollment restrictions, mortality/registry linkage,\nEHR visit-based capture, complete-case covariate filters, or differential loss to follow-up; (b) a regulator or HTA\nreviewer will ask \"what if the excluded patients differed?\"; or (c) the primary analysis discards a non-trivial,\npotentially non-random fraction of the source population. It is mandatory rather than optional when the funnel from\nsource to analytic cohort is steep and exposure-related.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **When selection cannot depend on the outcome (or its causes).** If inclusion is governed only by exposure-independent\n  administrative facts unrelated to outcome risk, there is no collider and a selection sensitivity analysis is\n  theater — spend the effort on confounding instead.\n- **As a substitute for fixing a fixable problem.** If the selection driver is *measured*, the honest move is IPSW (or\n  redefining the cohort), not a hand-tuned bias-parameter sweep chosen to land where you want.\n- **With self-serving priors.** PBA with priors quietly centered to produce a null-crossing (or a \"robust\" away-from-null)\n  result is worse than no analysis: it launders an assumption as evidence. Priors must be pre-specified and defensible.\n- **Misapplied to confounding or measurement error.** The bounding factor and IPSW here address conditioning on a\n  collider; using them to claim robustness to unmeasured *confounding* or outcome misclassification overstates what was\n  tested.\n- **When the weight model is badly misspecified or positivity fails.** If some target-population strata have near-zero\n  estimated probability of selection, IPSW weights explode and the \"corrected\" estimate is driven by a handful of\n  observations — the cure is worse than the disease.\n\n**Data-source operational depth**.\n- **Claims (FFS vs MA):** The dominant selection step is the **continuous-enrollment / washout requirement**. In\n  Medicare, fee-for-service (Parts A/B/D) yields complete encounter and pharmacy capture, but **Medicare Advantage**\n  enrollees generate encounter records that are notoriously incomplete in some files, so \"no qualifying claim\" can be\n  *missingness*, not a true absence of the event or fill. If MA penetration differs by the regions/employers where one\n  treatment dominates, exclusion of MA-only person-time becomes exposure-related selection. Failure mode: requiring 365\n  days of continuous enrollment differentially drops the sicker arm (who churn coverage faster after a diagnosis).\n  Workaround: report retained-vs-excluded baseline characteristics, restrict to A/B/D (or commercial medical+pharmacy)\n  enrollees, and run a selection sweep on outcome risk among the excluded. **Differential competing risks in elderly\n  claims** also drive selection: if one arm has higher early mortality, survivors selected into longer follow-up are a\n  biased set — pair cause-specific vs subdistribution estimands with the sensitivity analysis.\n- **EHR:** Selection is **visit-driven capture** — a patient is \"observed\" only when they generate an encounter, so\n  sicker patients are over-observed and patients who leave the system (move, switch providers, die out-of-network) are\n  differentially lost. Complete-case filters on labs/vitals/staging compound this because missingness in EHR is itself\n  informative (a lab is ordered because of clinical suspicion). Workaround: define observation windows explicitly, model\n  the probability of being a complete case for IPSW, and vary outcome rates among the dropped-out in the sweep.\n- **Registry:** Participation/consent and site enrollment select on disease severity and access. Adjudicated outcomes are\n  a strength, but the enrolled cohort may not represent the source population. Workaround: link to claims/death index to\n  characterize non-participants where possible; otherwise bound with bias parameters.\n- **Linked claims–EHR–vital records:** Linkage itself is the selection mechanism — only the linkable subset is analyzed,\n  and linkage probability correlates with age, insurance stability, and data completeness, all of which track outcome\n  risk. **Immortal time in procedure/linkage studies** can masquerade as selection if the linkage requires surviving to\n  a downstream record. Workaround: compare linked vs unlinked on every available variable, weight by inverse probability\n  of linkage, and sweep outcome rates among unlinked records.\n\n**Worked claims example.** Question: 1-year risk of hospitalized heart failure with drug A vs active comparator B among\nadults with type 2 diabetes in a commercial + Medicare FFS database, observed risk difference RD_obs = −0.04 (A lower).\nCohort construction requires 365 days of continuous A/B/D (or commercial medical+pharmacy) enrollment before the first\nqualifying fill (`fill_date`, `days_supply`, NDC → arm), a drug-free washout for incident-user status, and ≥1 year of\npotential follow-up. That last requirement is a **selection-into-analysis** step: patients who disenroll early are\ndropped, and disenrollment is faster in the comparator arm (worse-controlled patients churn). Suppose 18% of the A arm\nand 12% of the B arm are excluded for incomplete follow-up, and you fear the excluded had *higher* HF risk.\n(1) *Tipping-point sweep:* let r_A and r_B be the 1-year HF risk among the excluded in each arm; the corrected RD is\napproximately RD_obs · 0.82·0.88 + (0.18·r_A − 0.12·r_B) reweighted to the full eligible cohort. Sweeping r_A, r_B over\n0.05–0.30 shows the estimate flips sign once excluded-A risk exceeds excluded-B risk by roughly 0.08 — a clinically\nimplausible gap, supporting robustness. (2) *IPSW, if the selection driver is observed:* model P(complete follow-up |\nbaseline age, prior HF, comorbidity count, prior utilization, MA-region proxy, arm); stabilized weights\nsw = P(selected) / P(selected | covariates) reweight the analyzed cohort back to all eligible initiators; refit the risk\nmodel weighted and compare to RD_obs, truncating weights at the 1st/99th percentile and reporting the truncated and\nuntruncated estimates. (3) *Bounding factor:* with selection-exposure association RR_AS and selection-outcome\nassociation RR_US, the Smith-VanderWeele factor BF = (RR_AS·RR_US)/(RR_AS+RR_US−1) gives the maximum factor by which the\nselected-sample risk ratio can differ from the target — report the BF needed to move the CI across the null.\n\n**Interpreting the output**\n\nIn the Drug A versus Drug B heart-failure study, the observed risk difference = −0.04 (Arm A\n8%, Arm B 12%). Selection analysis shows: Drug A — 820 included, 180 excluded (18%); Drug B\n— 880 included, 120 excluded (12%). The tipping grid reveals the estimate reverses sign only\nwhen excluded Drug A patients have an event rate approximately 0.22 higher than excluded Drug\nB patients — for example, 30% versus 8%.\n\n*(1) Formal interpretation.* The sensitivity bound quantifies how extreme differential\nselection would need to be to explain the observed −0.04 RD. Under the Smith-VanderWeele\nbounding framework, both the probability of selection and the conditional outcome risk in the\nexcluded fraction jointly determine bias magnitude. Because Drug A has proportionally more\nexcluded patients (18% vs 12%), a scenario where those patients are sicker is plausible but\nmust be evaluated specifically against clinical knowledge of why patients were excluded from\neach arm. The bounding factor also depends on the selection-exposure association; reporting\nthe BF needed to shift the CI limit to the null anchors the assessment for a regulatory\naudience.\n\n*(2) Practical interpretation.* A required event-rate differential of approximately 0.22 is\nlarge — roughly equal to the entire observed risk in the included Drug B group. A clinical\nreviewer would need to posit that excluded Drug A patients had nearly four times the event\nrate of excluded Drug B patients for selection bias alone to account for the protective\nassociation. Most reviewers would judge that implausible in the absence of a specific\nmechanism, lending moderate robustness to the −0.04 finding.",
    "primary_category": "Bias_Control",
    "tags": [
      "selection-bias",
      "collider-bias",
      "attrition",
      "linkage-bias",
      "complete-case",
      "ipsw",
      "qba",
      "tipping-point",
      "sensitivity-analysis"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "linked_data",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/01.ede.0000135174.63482.43",
        "url": "https://doi.org/10.1097/01.ede.0000135174.63482.43",
        "citation_text": "Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15(5):615-625.",
        "year": 2004,
        "authors_short": "Hernán et al.",
        "notes": "Canonical structural framing of selection bias as conditioning on a collider; the DAG logic that every selection sensitivity analysis is built to stress-test."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/25.6.1107",
        "url": "https://doi.org/10.1093/ije/25.6.1107",
        "citation_text": "Greenland S. Basic methods for sensitivity analysis of biases. International Journal of Epidemiology. 1996;25(6):1107-1116.",
        "year": 1996,
        "authors_short": "Greenland",
        "notes": "Original bias-parameter formalism for deterministic sensitivity analysis of selection, confounding, and misclassification — the basis for the tipping-point sweep."
      },
      {
        "role": "explain",
        "doi": "10.1093/ije/dyu149",
        "url": "https://doi.org/10.1093/ije/dyu149",
        "citation_text": "Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S. Good practices for quantitative bias analysis. International Journal of Epidemiology. 2014;43(6):1969-1985.",
        "year": 2014,
        "authors_short": "Lash et al.",
        "notes": "Consensus good-practice guidance for deterministic and probabilistic (Monte Carlo) bias analysis, including selection bias; defines how to choose and report bias parameters."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0000000000001032",
        "url": "https://doi.org/10.1097/EDE.0000000000001032",
        "citation_text": "Smith LH, VanderWeele TJ. Bounding bias due to selection. Epidemiology. 2019;30(4):509-516.",
        "year": 2019,
        "authors_short": "Smith & VanderWeele",
        "notes": "Derives the closed-form selection bounding factor (the selection-bias analogue of the E-value) that underwrites a single-number tipping-point statement on the risk-ratio scale."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Applied pharmacoepidemiology exemplar showing how time-alignment and risk-set selection distort claims-based estimates, motivating selection-aware design and sensitivity analysis."
      }
    ],
    "plain_language_summary": "A selection bias sensitivity analysis asks: if the people who ended up in a study were not a random slice of all eligible patients, how wrong could our result be? When who gets into the study depends on both what treatment they received and how sick they were, the comparison is already tilted before any statistics are run. This analysis deliberately varies assumptions about how different the excluded patients were, then recalculates the effect estimate to find the range of plausible true answers. A small, stable range means the finding is robust; an estimate that flips direction under modest assumptions signals fragility.",
    "key_terms": [
      {
        "term": "selection bias",
        "definition": "Distortion in a study result that occurs when who gets included or stays in the analysis depends on both the treatment being studied and the health outcome being measured."
      },
      {
        "term": "selection probability",
        "definition": "The chance that a patient who was eligible for the study actually ended up in the analyzed dataset, which can differ by treatment arm and health status."
      },
      {
        "term": "bounding",
        "definition": "Calculating the worst-case upper and lower limits of how much a study result could shift if the excluded patients differed from the included ones in a plausible way."
      },
      {
        "term": "collider",
        "definition": "A variable that is caused by two other variables at once; in selection bias, being in the study is caused by both treatment and outcome, so restricting the analysis to study participants distorts the comparison."
      }
    ],
    "worked_example": {
      "scenario": "A claims study compares 1-year hospitalization risk for drug A versus drug B in adults with type 2 diabetes. The observed risk difference is -0.04 (drug A appears lower risk). To stay in the study, patients had to remain continuously enrolled for the full year. The analyst notices that 18% of drug A patients and 12% of drug B patients were dropped for incomplete follow-up. Sicker patients churn coverage faster, so the excluded patients probably had higher hospitalization risk than the included ones. The sensitivity analysis asks: how bad would the outcome rates in the excluded groups have to be before the conclusion reverses?",
      "dataset": {
        "caption": "Summary of included and excluded patients by arm, as seen in the cohort-construction log.",
        "columns": [
          "arm",
          "n_included",
          "n_excluded",
          "pct_excluded",
          "obs_risk_included"
        ],
        "rows": [
          [
            "A",
            820,
            180,
            "18%",
            0.08
          ],
          [
            "B",
            880,
            120,
            "12%",
            0.12
          ]
        ]
      },
      "steps": [
        "The observed risk difference among included patients is 0.08 - 0.12 = -0.04 (arm A looks better).",
        "Let r_A be the unknown 1-year hospitalization risk among the 180 excluded arm-A patients, and r_B be the same for the 120 excluded arm-B patients.",
        "The corrected risk for arm A across all 1000 eligible patients is: (0.82 x 0.08) + (0.18 x r_A) = 0.0656 + 0.18 x r_A.",
        "The corrected risk for arm B across all 1000 eligible patients is: (0.88 x 0.12) + (0.12 x r_B) = 0.1056 + 0.12 x r_B.",
        "The corrected risk difference is (0.0656 + 0.18 x r_A) - (0.1056 + 0.12 x r_B), which simplifies to -0.04 + (0.18 x r_A - 0.12 x r_B).",
        "Vary r_A and r_B over a plausible range (0.10 to 0.30) and check whether the corrected difference crosses zero (sign flip)."
      ],
      "result": "Tipping-point grid: the estimate flips sign only when excluded arm-A risk exceeds excluded arm-B risk by roughly 0.22 units (e.g., r_A=0.30, r_B=0.08). A gap that large is clinically implausible given baseline similarity, so the conclusion that drug A has lower risk is judged robust to this source of selection bias.\n\n| r_A (excl. arm A) | r_B (excl. arm B) | Corrected RD | Sign flip? |\n|---|---|---|---|\n| 0.10 | 0.10 | -0.03 | No |\n| 0.20 | 0.10 | +0.01 | Yes |\n| 0.30 | 0.20 | +0.02 | Yes |\n| 0.10 | 0.20 | -0.05 | No |"
    },
    "prerequisites": [
      "cohort-retrospective",
      "database-feasibility-attrition-funnel-rwe",
      "quantitative-bias-analysis-toolkit-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Attrition / loss-to-follow-up selection sensitivity",
        "description": "Vary the outcome risk among patients selected out by early disenrollment or loss to follow-up, separately by arm, and recompute the corrected estimate (the cross-sectional companion to IPCW for informative censoring).",
        "edge_cases": [
          "Differential disenrollment by arm is the common case; symmetric assumptions understate the bias.",
          "Death as a competing risk can be confused with administrative censoring — specify the estimand first."
        ],
        "data_source_notes": "claims: use disenrollment, death, and end-of-coverage dates; EHR: use last-contact date and the visit-generation process to define who is observed."
      },
      {
        "name": "Complete-case selection sensitivity",
        "description": "Vary outcome or exposure distributions among records excluded for missing covariates, or fit IPSW for the probability of being a complete case when the missingness drivers are observed.",
        "data_source_notes": "Critical when labs, cancer stage, BMI, smoking, or PROs are missing-not-at-random; EHR missingness is informative because tests are ordered for a reason."
      },
      {
        "name": "Linkage / enrollment selection sensitivity",
        "description": "Assess how non-links, false links, or linked-sample enrichment could change results; weight by inverse probability of linkage when linkage predictors are available.",
        "data_source_notes": "Relevant for registry-claims, mortality, mother-infant, and EHR-claims linkage; linkage probability tracks age, insurance stability, and data completeness, all correlated with outcome risk."
      },
      {
        "name": "Tipping-point / bounding-factor analysis",
        "description": "Sweep the selection-exposure and selection-outcome associations over a grid (deterministic) or draw them from priors (probabilistic bias analysis), and report the Smith-VanderWeele bounding factor needed to overturn the conclusion.",
        "edge_cases": [
          "Priors must be pre-specified; tuning them toward a desired result is a misuse.",
          "The bounding factor is a worst case, not the actual bias."
        ],
        "data_source_notes": "The only option when the selection driver is unmeasured (e.g., who enters the database)."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Reporting only the complete-case / selected-sample estimate",
        "pros_of_this": "Quantifies a bias the primary analysis silently assumes away and produces a defensible robustness or fragility statement for reviewers.",
        "cons_of_this": "Requires explicit, debatable bias parameters or a selection model; poorly justified inputs can manufacture false reassurance or false alarm.",
        "when_to_prefer": "Whenever cohort entry or retention plausibly depends on both exposure and outcome."
      },
      {
        "compared_to": "Inverse-probability-of-selection weighting (IPSW)",
        "pros_of_this": "The bias-parameter / tipping-point approach needs no measurement of the selection driver and answers \"how strong would selection have to be to overturn this?\".",
        "cons_of_this": "Bias parameters are assumptions, not estimates; the bounding factor gives a worst case rather than the realized bias.",
        "when_to_prefer": "When the selection variables are unmeasured (typically, who enters the database). Use IPSW instead when those variables are observed and rich."
      },
      {
        "compared_to": "E-value / unmeasured-confounding sensitivity analysis",
        "pros_of_this": "Targets the actual structure (conditioning on a collider) rather than an omitted common cause; the selection bounding factor is the correct counterpart to the E-value.",
        "cons_of_this": "Does not address unmeasured confounding or outcome misclassification — those need their own analyses.",
        "when_to_prefer": "When the threat is selection/collider structure rather than confounding."
      },
      {
        "compared_to": "IPCW for attrition used alone",
        "pros_of_this": "Stress-tests the no-unmeasured-censoring assumption that IPCW relies on, rather than assuming it holds.",
        "cons_of_this": "Adds bias parameters on top of the censoring model; more moving parts to specify and report.",
        "when_to_prefer": "Use together — IPCW as primary handling of informative censoring, selection sensitivity as the assumption check."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Selection is dominated by the continuous-enrollment/washout requirement; \"no qualifying claim\" in Medicare Advantage person-time may be missingness, not absence. Restrict to FFS A/B/D (or commercial medical+pharmacy), report retained-vs-excluded baseline characteristics, and sweep outcome risk among the excluded by arm. Watch differential early mortality (competing risk) that selects survivors into longer follow-up.",
      "ehr": "Selection is visit-driven capture; complete-case filters on labs/staging are informative missingness. Define observation windows explicitly, model probability of being a complete case for IPSW, and vary outcome rates among the dropped-out.",
      "registry": "Participation/consent selects on severity and access; adjudicated outcomes are a strength but the enrolled cohort may not represent the source. Link to claims/death index to characterize non-participants where possible.",
      "linked": "Linkage itself is the selection mechanism; only the linkable subset is analyzed and linkage probability tracks age, insurance stability, and completeness. Compare linked vs unlinked on every available variable and weight by inverse probability of linkage; reconcile order/fill/service dates to avoid immortal time masquerading as selection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\ndef ipsw_risk_difference(cohort: pd.DataFrame) -> dict:\n    \"\"\"Reweight the SELECTED sample back to all eligible initiators via stabilized IPSW.\"\"\"\n    # P(selected | covariates): the selection (propensity-of-inclusion) model on the full eligible cohort.\n    sel = smf.glm(\"selected ~ arm + age + prior_hf + comorbidity_count + prior_util + ma_region\",\n                  data=cohort, family=sm.families.Binomial()).fit()\n    p_sel = sel.predict(cohort)\n    p_marg = cohort[\"selected\"].mean()                       # numerator of stabilized weight\n    cohort = cohort.assign(sw=np.where(cohort[\"selected\"] == 1, p_marg / p_sel, 0.0))\n    # Truncate at 1st/99th pct to tame extreme weights; report both.\n    lo, hi = cohort.loc[cohort.selected == 1, \"sw\"].quantile([0.01, 0.99])\n    ana = cohort[cohort.selected == 1].copy()\n    ana[\"sw_trunc\"] = ana[\"sw\"].clip(lo, hi)\n    out = {}\n    for wcol, label in [(\"sw\", \"untruncated\"), (\"sw_trunc\", \"truncated\")]:\n        m = smf.glm(\"event ~ C(arm)\", data=ana, family=sm.families.Binomial(sm.families.links.Identity()),\n                    freq_weights=ana[wcol]).fit()   # identity link -> coefficient IS the risk difference\n        out[label] = float(m.params[\"C(arm)[T.B]\"]) * -1   # RD for A vs B\n    return out\n\ndef tipping_point(rd_obs: float, frac_excl_A: float, frac_excl_B: float,\n                  grid=np.linspace(0.05, 0.30, 6)) -> pd.DataFrame:\n    \"\"\"Recompute the eligible-cohort RD as excluded-arm outcome risks (r_A, r_B) vary.\"\"\"\n    rows = []\n    kept_A, kept_B = 1 - frac_excl_A, 1 - frac_excl_B\n    for r_A in grid:\n        for r_B in grid:\n            rd_corr = rd_obs * (kept_A * kept_B) + (frac_excl_A * r_A - frac_excl_B * r_B)\n            rows.append({\"r_excl_A\": r_A, \"r_excl_B\": r_B, \"rd_corrected\": rd_corr,\n                         \"sign_flip\": np.sign(rd_corr) != np.sign(rd_obs)})\n    return pd.DataFrame(rows)\n\ndef selection_bounding_factor(rr_select_exposure: float, rr_select_outcome: float) -> float:\n    \"\"\"Smith & VanderWeele (2019) max selection bias on the risk-ratio scale.\"\"\"\n    a, b = rr_select_exposure, rr_select_outcome\n    return (a * b) / (a + b - 1.0)",
        "description": "Selection-bias sensitivity for a binary outcome. Required input (one row per ELIGIBLE target-population\ninitiator, already cleaned):\n  cohort : person_id, arm ('A'/'B'), selected (1 if retained in the analytic set, else 0),\n           event (0/1, observed only when selected==1), age, prior_hf, comorbidity_count,\n           prior_util, ma_region (selection-driver proxies measured in [index_date-365, index_date])\nTwo complementary analyses:\n  (1) IPSW  — valid when 'selected' is captured by the measured covariates (no-unmeasured-selection).\n  (2) Tipping-point + Smith-VanderWeele bounding factor — valid when the selection driver is unmeasured.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survey)\n\nipsw_risk_difference <- function(cohort) {\n  # Selection (probability-of-inclusion) model on the full eligible cohort.\n  sel <- glm(selected ~ arm + age + prior_hf + comorbidity_count + prior_util + ma_region,\n             family = binomial(), data = cohort)\n  p_sel  <- predict(sel, type = \"response\")\n  p_marg <- mean(cohort$selected)\n  cohort$sw <- ifelse(cohort$selected == 1, p_marg / p_sel, 0)\n  ana <- subset(cohort, selected == 1)\n  q   <- quantile(ana$sw, c(0.01, 0.99))\n  ana$sw_trunc <- pmin(pmax(ana$sw, q[1]), q[2])\n  ana$arm_b <- as.integer(ana$arm == \"B\")\n  rd <- function(wcol) {\n    des <- svydesign(ids = ~1, weights = ana[[wcol]], data = ana)\n    # Identity-link binomial -> the arm coefficient is the risk difference (A vs B = -coef).\n    m <- svyglm(event ~ arm_b, design = des, family = quasibinomial(link = \"identity\"))\n    -as.numeric(coef(m)[\"arm_b\"])\n  }\n  list(untruncated = rd(\"sw\"), truncated = rd(\"sw_trunc\"))\n}\n\ntipping_point <- function(rd_obs, frac_excl_A, frac_excl_B, grid = seq(0.05, 0.30, 0.05)) {\n  kept_A <- 1 - frac_excl_A; kept_B <- 1 - frac_excl_B\n  g <- expand.grid(r_excl_A = grid, r_excl_B = grid)\n  g$rd_corrected <- rd_obs * (kept_A * kept_B) + (frac_excl_A * g$r_excl_A - frac_excl_B * g$r_excl_B)\n  g$sign_flip <- sign(g$rd_corrected) != sign(rd_obs)\n  g\n}\n\nselection_bounding_factor <- function(rr_select_exposure, rr_select_outcome) {\n  a <- rr_select_exposure; b <- rr_select_outcome\n  (a * b) / (a + b - 1)\n}",
        "description": "Selection-bias sensitivity for a binary outcome. Input mirrors the Python version:\n  cohort : person_id, arm ('A'/'B'), selected (0/1), event (0/1), age, prior_hf,\n           comorbidity_count, prior_util, ma_region\nIPSW via survey::svyglm (design-based SEs for the weighted risk difference); tipping-point and the\nSmith-VanderWeele bounding factor as closed forms.",
        "dependencies": [
          "survey"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* --- Step 1: selection (probability-of-inclusion) model + stabilized IPSW --- */\nproc logistic data=work.cohort noprint;\n  class arm / param=ref;\n  model selected(event='1') = arm age prior_hf comorbidity_count prior_util ma_region;\n  output out=work.psel p=p_sel;            /* predicted P(selected | covariates) */\nrun;\n\nproc sql noprint; select mean(selected) into :p_marg from work.cohort; quit;\n\ndata work.wt;\n  set work.psel;\n  if selected=1 then sw = &p_marg / p_sel; else sw = 0;   /* stabilized weight */\nrun;\n\n/* Truncate weights at 1st/99th pct among the selected, then keep the analytic set. */\nproc univariate data=work.wt(where=(selected=1)) noprint;\n  var sw; output out=work.cut pctlpts=1 99 pctlpre=p_;\nrun;\ndata work.analytic;\n  if _n_=1 then set work.cut;\n  set work.wt(where=(selected=1));\n  sw_trunc = min(max(sw, p_1), p_99);\nrun;\n\n/* --- Step 2: weighted identity-link risk model; arm coefficient = risk difference (A vs B = -beta) --- */\nproc genmod data=work.analytic;\n  class arm person_id / param=ref;\n  weight sw_trunc;                                  /* swap to sw for the untruncated estimate */\n  model event = arm / dist=binomial link=identity;\n  repeated subject=person_id / type=ind;            /* robust SEs for the weighted analysis */\nrun;\n\n/* --- Tipping-point grid + Smith-VanderWeele bounding factor --- */\n%macro tip(rd_obs=, fA=, fB=);\n  data work.tip;\n    kept_A = 1 - &fA; kept_B = 1 - &fB;\n    do r_A = 0.05 to 0.30 by 0.05;\n      do r_B = 0.05 to 0.30 by 0.05;\n        rd_corrected = &rd_obs*(kept_A*kept_B) + (&fA*r_A - &fB*r_B);\n        sign_flip = (sign(rd_corrected) ne sign(&rd_obs));\n        output;\n      end;\n    end;\n  run;\n  /* BF = (RR_AS*RR_US)/(RR_AS+RR_US-1) over a grid of selection associations. */\n  data work.bound;\n    do rr_as = 1.5 to 4 by 0.5;\n      do rr_us = 1.5 to 4 by 0.5;\n        bf = (rr_as*rr_us)/(rr_as+rr_us-1); output;\n      end;\n    end;\n  run;\n%mend;\n%tip(rd_obs=-0.04, fA=0.18, fB=0.12);",
        "description": "Selection-bias sensitivity in SAS. Required input dataset (one row per ELIGIBLE target-population initiator):\n  work.cohort : person_id, arm ('A'/'B'), selected (0/1), event (0/1),\n                age prior_hf comorbidity_count prior_util ma_region   (selection-driver covariates)\nStep 1 fits the probability-of-inclusion model and builds stabilized IPSW with PROC LOGISTIC; Step 2 fits the\nweighted identity-link risk model with PROC GENMOD (coefficient = risk difference); the %tip macro runs the\ntipping-point grid and the Smith-VanderWeele bounding factor.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  E[Exposure / arm A vs B] --> S{{S = selected into analysis<br/>continuous enroll / linkage / complete-case}}\n  Y[Outcome: hospitalized HF] --> S\n  U[Unmeasured cause<br/>e.g. frailty, coverage churn] --> S\n  U --> Y\n  E -. causal effect of interest .-> Y\n  S -. analysis CONDITIONS on S=1<br/>collider opens E ~ Y bias .-> Bias((Selection bias))\nclassDef sel fill:#fde,stroke:#a33;\nclass S sel;",
        "caption": "Selection bias as collider stratification. Restricting to the analyzed set (S=1) conditions on a common effect of exposure and outcome, opening a non-causal path even when no confounder is present (Hernán et al. 2004).",
        "alt_text": "Causal DAG showing exposure and outcome both pointing into a selection node S, an unmeasured cause U pointing into S and the outcome, and conditioning on S=1 inducing selection bias on the exposure-outcome effect.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Does inclusion/retention depend on BOTH exposure and outcome?] -->|No| Stop[No collider:<br/>address confounding instead]\n  Q -->|Yes| M{Is the selection driver MEASURED?}\n  M -->|Yes| IPSW[Fit probability-of-selection model<br/>-> stabilized IPSW; transport to target population]\n  M -->|No| BP[Bias-parameter analysis]\n  BP --> TP[Tipping-point sweep over<br/>excluded-arm outcome risks]\n  BP --> BF[Smith-VanderWeele bounding factor<br/>RR scale]\n  BP --> PBA[Probabilistic bias analysis:<br/>draw bias params from priors, Monte Carlo]\n  IPSW --> Mech{Which mechanism?}\n  Mech -->|During follow-up| IPCW[Time-to-event analogue: IPCW]\n  Mech -->|Into sample| CS[Cross-sectional IPSW]",
        "caption": "Decision logic for selecting the selection-bias method: confirm the collider, then choose IPSW (driver measured) vs bias-parameter / tipping-point / PBA (driver unmeasured), and IPCW for attrition during follow-up.",
        "alt_text": "Decision tree starting from whether inclusion depends on both exposure and outcome, branching to IPSW when the selection driver is measured and to bias-parameter, tipping-point, bounding-factor, or probabilistic bias analysis when it is unmeasured, with IPCW for during-follow-up attrition.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Selection-bias sensitivity is the selection-mechanism member of the QBA family (alongside unmeasured-confounding and misclassification bias analysis)."
      },
      {
        "relation_type": "see_also",
        "target_slug": "attrition-and-loss-to-follow-up-rwe",
        "notes": "Differential attrition is selection during follow-up; IPCW is the primary handling and selection sensitivity stress-tests its no-unmeasured-censoring assumption."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-trimming-winsorization-rwe",
        "notes": "Complete-case analysis and informative missingness induce selection bias; IPSW for complete-case probability is the reweighting counterpart."
      },
      {
        "relation_type": "see_also",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "A steep, exposure-related source-to-cohort funnel is the operational signal that selection-bias sensitivity is needed."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value bounds unmeasured confounding; the Smith-VanderWeele selection bounding factor is its selection-bias counterpart and answers a different question."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "In linkage/procedure studies, requiring survival to a downstream record can masquerade as selection; align time zero before reweighting."
      }
    ],
    "aliases": [
      "selection sensitivity analysis",
      "attrition sensitivity analysis",
      "linkage bias sensitivity analysis",
      "collider bias sensitivity analysis",
      "inverse-probability-of-selection weighting sensitivity"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "self-controlled-case-series",
    "name": "Self-Controlled Case Series (SCCS)",
    "short_definition": "A within-person method that uses only cases and estimates the age- and time-adjusted relative incidence of an acute outcome in exposure-defined risk windows versus a person's own baseline periods, via conditional Poisson regression conditioned on each case's total observed event count.",
    "long_description": "The **self-controlled case series (SCCS)** estimates the **relative incidence** of an acute, recurrent-or-rare outcome\nduring a transient exposure-defined **risk window** compared with the same individual's **baseline (control) periods**,\nusing **only people who experience the event** (\"cases\"). Each case acts as their own control, so all *fixed* (time-invariant)\nconfounders — genetics, sex, baseline frailty, chronic comorbidity, socioeconomic status, even unmeasured ones — are\ncancelled by the within-person conditioning. The likelihood conditions on the total number of events each case had over\ntheir observation period, leaving a multinomial/conditional-Poisson model for *when within the observation period* events\nfell relative to exposure. Age and other shared time-varying confounders are handled by splitting each person's follow-up\ninto calendar/age intervals and including them as covariates. The canonical use is vaccine and drug safety, where the\nexposure is a discrete event (a vaccine dose, a short antibiotic course) and the outcome is acute (febrile seizure,\nintussusception, MI, GI bleed).\n\n**Core estimand distinction** — SCCS targets the **incidence rate ratio (IRR) within persons** — relative incidence during\nrisk versus baseline time — *not* an absolute rate, an attributable risk, or a between-person comparative effect. It is a\nself-matched cohort over person-time, not a case-control study (it has no separate controls and no odds ratio) and not a\nstandard cohort (it uses no unexposed people). Because conditioning removes the person-level baseline rate entirely, SCCS\nanswers \"does risk of the event transiently rise after exposure?\" It cannot answer \"how many events did the exposure cause\nin the population?\" without external incidence data, and it cannot rank two drugs head-to-head the way an active-comparator\ncohort does. The conditioning that buys immunity to fixed confounding is exactly why absolute effects are unrecoverable\nfrom SCCS alone.\n\n**Interpreting the output**\n\nFrom the worked example: one febrile seizure falls in the 14-day vaccine risk window;\nacross all cases in the study the conditional Poisson model yields IRR = 1.9 (95% CI\n1.4–2.6) for the 0–7 day window.\n\nFormal interpretation: Febrile seizures occurred at 1.9 times the within-person rate\nduring the 0–7 day post-vaccination risk window compared with each child's own baseline\n(unexposed) person-time. The conditioning on each child's total event count removes\nthe child's baseline seizure propensity entirely — the comparison is within-person, not\nbetween children. Because all time-invariant confounders (genetics, underlying seizure\nthreshold, chronic comorbidity, socioeconomic status) are constant within a person, they\ncannot explain this temporal excess. Residual time-varying confounders shared with the\npost-vaccination window (concurrent illness, seasonal effects) are addressed by the age\nand calendar adjustments, but not eliminated. The IRR is not an absolute risk, not a\nrisk ratio between vaccinated and unvaccinated children, and not a between-person\ncomparative estimate.\n\nPractical interpretation: The risk of febrile seizures is approximately 90% higher in\nthe week after vaccination than during the same child's own quiet baseline periods.\nThis within-person signal is robust to confounding by indication and stable background\nfactors, making it stronger evidence of a temporal relationship than a between-person\ncohort comparison. Two mandatory caveats: (1) SCCS controls only time-fixed confounders\n— a time-varying factor that simultaneously predicts vaccination scheduling and seizure\nrisk would residually confound the IRR; (2) if vaccine hesitancy leads clinicians to\ndefer vaccination in children who recently seized (event-dependent exposure), the\npre-exposure window check (a depressed pre-dose rate) is required to rule out bias.\n\n**Pros, cons, and trade-offs** (vs the alternatives named below).\n- **vs cohort / active-comparator new-user designs:** SCCS eliminates *all* time-invariant between-person confounding by\n  construction (no PS, no matching, no measured covariates needed for fixed factors), and is dramatically more efficient\n  per case because it needs only cases. Cost: it requires a transient exposure and an acute outcome, assumes the event does\n  not alter future exposure or end observation, and gives only a within-person IRR — not the comparative or absolute\n  estimands a cohort delivers. **Prefer SCCS** when confounding by indication and unmeasured frailty are severe and the\n  exposure/outcome are transient and acute; **prefer a cohort** for chronic exposures, cumulative effects, or absolute risk.\n- **vs self-controlled risk interval (SCRI):** SCRI is the special-case SCCS that uses only a short pre-specified control\n  window adjacent to the risk window (rather than all of each person's observation time), trading statistical efficiency\n  and full age adjustment for robustness to long-term time trends and lighter age modeling. **Prefer SCRI** when secular\n  or seasonal trends are strong and the relevant comparison is local in time (e.g., seasonal vaccines); **prefer full\n  SCCS** when you need efficiency and have modeled age/season well.\n- **vs case-crossover:** Both are within-person, but case-crossover samples discrete *referent windows* and is built for\n  rare, abrupt triggers of a single event with stable exposure prevalence; SCCS models the full observation period as\n  continuous person-time and naturally handles recurrent events and time-varying exposure intensity. SCCS is more flexible\n  for vaccines/drugs with extended risk windows; case-crossover is simpler for instantaneous triggers (e.g., physical\n  exertion and MI).\n\n**When to use** — acute outcome with a biologically plausible *transient* risk window after a discrete exposure; strong or\nunmeasurable between-person confounding (frailty, indication) that would cripple a cohort; sparse data where retaining only\ncases is an efficiency advantage; signal evaluation in vaccine/drug safety (the FDA Sentinel and vaccine-safety literature\nlean on SCCS/SCRI heavily). It shines when randomization is impossible but the exposure is \"switch-like\" and the outcome\nis datable to a day.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules below).\n- **Event-dependent exposure.** If having the event changes the probability of *subsequent* exposure (e.g., a stroke makes\n  a clinician stop the drug, or a death precludes future vaccination), the standard SCCS likelihood is biased. This is the\n  single most dangerous violation: a protective-looking IRR can be an artifact of clinicians withholding exposure after the\n  event. Use event-dependent-exposure SCCS extensions (Farrington) or a different design.\n- **Event-dependent observation / censoring by death.** If the event can *end* observation (death, or events that trigger\n  disenrollment), the assumption that observation length is independent of event timing fails; use the event-dependent-\n  observation SCCS extension. Applying naive SCCS to a high-fatality outcome (e.g., sudden cardiac death) is misleading.\n- **Chronic or cumulative exposures, or non-acute outcomes.** SCCS cannot separate a transient risk window from a stable\n  baseline if exposure is continuous (a maintenance statin) or if the outcome accrues slowly (cancer). There is no\n  within-person contrast to exploit.\n- **No within-person variation in exposure timing relative to follow-up.** If everyone is exposed for essentially their\n  whole observation window, baseline person-time is near zero and the IRR is unidentified.\n- **Outcome rate genuinely depends on calendar/seasonal time confounded with exposure.** Failure to model age/season when\n  a seasonal vaccine is given in a high-incidence season produces confounding *within* person; SCCS removes fixed but not\n  unmodeled time-varying confounders.\n\n**Data-source operational depth** across claims, EHR, registry, and linked data.\n- **Claims (FFS):** Exposure dates come from pharmacy fills (`fill_date` + `days_supply` to construct the risk window) or\n  procedure/administration codes (vaccine CPT/CVX). Outcomes are inpatient/ED diagnosis dates — prefer the inpatient\n  *admission* date over the claim adjudication date. Define the observation period as continuous-enrollment spans;\n  person-time outside continuous A/B (and D, for drug exposure) must be excluded or the \"baseline\" is unobserved. Failure\n  modes: **Medicare Advantage person-time lacks FFS claims**, so both exposure and outcome are silently missing — restrict\n  to FFS-enrolled spans and treat MA periods as gaps, not baseline. A claim's service date can lag the true event by days,\n  smearing events across the risk-window boundary; pre-specify which date defines event onset.\n- **EHR:** Exposure may be an *order* or in-clinic *administration* (good for vaccines given on-site) but pharmacy linkage\n  is needed to confirm a dispensed drug was taken. Outcome dating is sharper (problem-list onset, lab dates) but\n  visit-driven capture means events occurring outside the system are missed; if a vaccine and its adverse event are both\n  captured only on system visits, ascertainment is correlated with exposure — a within-person confounder. Define\n  observation as enrollment/active-patient spans, not \"first to last visit,\" to avoid event-dependent observation.\n- **Registry:** Excellent, adjudicated outcome dates (e.g., a febrile-seizure or intussusception registry) and often the\n  natural data source for SCCS; weak for complete exposure history. Link to claims/immunization information systems for\n  dose dates. National immunization registries are the gold-standard exposure source for vaccine SCCS.\n- **Linked claims–EHR–vital records:** Best substrate — claims completeness + EHR onset dating + a **death index to handle\n  event-dependent observation** (mortality is the most common censoring mechanism that breaks naive SCCS). Reconcile\n  order/fill/service date discrepancies before assigning risk windows; linkage selection (only the linkable subset) can\n  differentially drop the sickest.\n\n**Worked claims example.** Question: does the IRR of febrile seizure rise in the 0–7 and 8–14 days after a measles-\ncontaining vaccine dose in children, using a Medicare-style FFS pediatric claims database? (1) Cases: all children with\n≥1 inpatient/ED febrile-seizure diagnosis (use the admission date as event onset) during observed time. SCCS uses only\nthese children. (2) Observation period: continuous medical enrollment spans between ages 12 and 24 months; drop any\nMA-only or non-enrolled person-time (otherwise baseline is unobserved). (3) Exposure: vaccine administration date from the\nCPT/CVX claim; define risk windows `[dose, dose+7]` and `[dose+8, dose+14]`, with everything else in the observation period\nas baseline. (4) Age adjustment: split each child's follow-up into 1-month age bands (febrile-seizure incidence is strongly\nage-dependent) and include age band as a factor. (5) Model: conditional Poisson regression of the event indicator on\nrisk-window and age-band dummies, with an offset for the log length of each interval, conditioned on each child's total\nseizure count — equivalently a stratified (by child) Poisson/Cox fit. (6) Read-out: exp(beta) for each risk window is the\nage-adjusted within-child IRR vs that child's own baseline; e.g., IRR 1.9 (95% CI 1.4–2.6) in days 0–7. (7) Sensitivity:\nadd a pre-exposure window `[dose-14, dose-1]` to detect event-dependent exposure (a clinician deferring vaccination in a\nrecently-seizing child shows up as a depressed pre-window), test alternative risk-window lengths, and exclude the day of\nvaccination to probe contraindication bias.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "self-controlled",
      "within_person",
      "vaccine_safety",
      "relative-incidence",
      "conditional-poisson",
      "pharmacoepidemiology",
      "recurrent-events",
      "signal-evaluation"
    ],
    "applies_to_study_types": [
      "self_controlled_case_series"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/2533328",
        "url": "https://doi.org/10.2307/2533328",
        "citation_text": "Farrington CP. Relative incidence estimation from case series for vaccine safety evaluation. Biometrics. 1995;51(1):228-235.",
        "year": 1995,
        "authors_short": "Farrington",
        "notes": "Original derivation of the SCCS likelihood (conditional Poisson conditioned on each case's total event count) for vaccine safety."
      },
      {
        "role": "introduce",
        "doi": "10.1002/sim.2302",
        "url": "https://doi.org/10.1002/sim.2302",
        "citation_text": "Whitaker HJ, Farrington CP, Spiessens B, Musonda P. Tutorial in biostatistics: the self-controlled case series method. Statistics in Medicine. 2006;25(10):1768-1797.",
        "year": 2006,
        "authors_short": "Whitaker et al.",
        "notes": "Definitive applied tutorial — assumptions, age adjustment, risk-window specification, and conditional Poisson estimation."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.i4515",
        "url": "https://doi.org/10.1136/bmj.i4515",
        "citation_text": "Petersen I, Douglas I, Whitaker H. Self controlled case series methods: an alternative to standard epidemiological study designs. BMJ. 2016;354:i4515.",
        "year": 2016,
        "authors_short": "Petersen et al.",
        "notes": "Accessible exposition of when SCCS beats cohort/case-control designs and the key assumptions (event-independent exposure and observation)."
      },
      {
        "role": "explain",
        "doi": "10.1177/0962280208092342",
        "url": "https://doi.org/10.1177/0962280208092342",
        "citation_text": "Whitaker HJ, Hocine MN, Farrington CP. The methodology of self-controlled case series studies. Statistical Methods in Medical Research. 2009;18(1):7-26.",
        "year": 2009,
        "authors_short": "Whitaker et al.",
        "notes": "Methodological review covering extensions for event-dependent exposure and event-dependent observation (the dangerous violations)."
      }
    ],
    "plain_language_summary": "The Self-Controlled Case Series (SCCS) is a study design that asks: does a person's risk of a short-lived medical event go up right after a specific exposure, like a vaccine or a short course of medication? Instead of comparing two different groups of people, it compares each person to themselves — looking at whether that person had more events during a brief window after the exposure than during their own quiet, unexposed time earlier or later in the same follow-up period. Because every person is their own comparison, characteristics that never change — like genetics, sex, or long-standing health conditions — are automatically ruled out as explanations. The catch is that this design only works when both the exposure and the outcome are short-lived and datable to a specific day.",
    "key_terms": [
      {
        "term": "self-controlled",
        "definition": "A study feature where each person serves as their own comparison group, so fixed personal characteristics like sex or genetics cannot distort the result."
      },
      {
        "term": "risk period",
        "definition": "The short window of days immediately after an exposure (such as a vaccine dose) during which a biologically plausible adverse event might occur."
      },
      {
        "term": "baseline period",
        "definition": "The stretches of a person's own follow-up time that fall outside the risk window, used as the within-person comparison for how often the event normally happens."
      },
      {
        "term": "incidence rate ratio",
        "definition": "A number comparing how often an event happens per day in the risk period versus per day in the baseline period — a value above 1.0 means higher risk after exposure."
      },
      {
        "term": "conditional Poisson regression",
        "definition": "The statistical model used in SCCS that counts events per interval within each person and estimates the rate change between risk and baseline time, conditioning on the person's own total event count."
      }
    ],
    "worked_example": {
      "scenario": "A 14-month-old child receives a measles-containing vaccine on 2024-04-09. Researchers want to know whether febrile seizures happen more often in the 14 days after the dose than during this child's own quiet time before and after that window. The child's continuous enrollment spans 2024-01-01 through 2024-09-30 (273 days total). One febrile seizure is recorded on 2024-04-14, which falls inside the risk window. The SCCS compares the daily event rate within the 14-day risk period to the daily event rate across the remaining 259 days of baseline time.",
      "dataset": {
        "caption": "Raw rows an analyst would see across three linked tables for this one child",
        "columns": [
          "person_id",
          "date",
          "record_type",
          "detail"
        ],
        "rows": [
          [
            "C001",
            "2024-01-01",
            "enrollment_start",
            "continuous FFS enrollment begins"
          ],
          [
            "C001",
            "2024-04-09",
            "exposure",
            "measles-containing vaccine — CPT 90707"
          ],
          [
            "C001",
            "2024-04-14",
            "outcome",
            "febrile seizure — inpatient admission date"
          ],
          [
            "C001",
            "2024-09-30",
            "enrollment_end",
            "continuous FFS enrollment ends"
          ]
        ]
      },
      "steps": [
        "Total observation for this child: 2024-01-01 through 2024-09-30 = 273 days.",
        "Define the risk window: from the vaccine date (2024-04-09) through 14 days later = 2024-04-09 to 2024-04-22 = 14 days.",
        "Define the baseline period: all observation days outside the risk window = 273 - 14 = 259 days (2024-01-01 to 2024-04-08 plus 2024-04-23 to 2024-09-30).",
        "Count events in the risk window: 1 febrile seizure on 2024-04-14.",
        "Count events in the baseline period: 0 febrile seizures.",
        "Rate in risk window: 1 event / 14 days = 0.0714 events per day.",
        "Rate in baseline period: 0 events / 259 days = 0.0 events per day.",
        "Incidence rate ratio = 0.0714 / (0.0 + small background rate) — in the conditional Poisson model fitted across all cases in the study, the per-child comparison is pooled. For this single case, the event fell in the risk window rather than baseline, contributing evidence that the risk window rate is elevated versus baseline."
      ],
      "result": "For this child, the only event landed in the 14-day risk window (14 days) rather than the 259 days of baseline time. Across all children in the study, the conditional Poisson model would estimate an incidence rate ratio (IRR) for the risk window versus baseline. A study using these methods in the vaccine safety literature found IRR values around 1.9 (95% CI 1.4-2.6) for the 0-7 day window, meaning febrile seizures occurred at roughly twice the rate during the risk period compared with each child's own quiet baseline time.",
      "timeline_spec": {
        "title": "SCCS follow-up for one child — vaccine-day risk window vs own baseline",
        "window": {
          "start": "2024-01-01",
          "end": "2024-09-30",
          "label": "273-day continuous enrollment (observation period)"
        },
        "events": [
          {
            "label": "Vaccine dose (2024-04-09)",
            "start": "2024-04-09",
            "length_days": 1,
            "quantity": "1-day exposure milestone"
          },
          {
            "label": "Febrile seizure (2024-04-14)",
            "start": "2024-04-14",
            "length_days": 1,
            "quantity": "outcome event — day 5 of risk window"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2024-01-01",
            "end": "2024-04-08",
            "label": "Baseline: 99 days (pre-exposure)"
          },
          {
            "kind": "exposed",
            "start": "2024-04-09",
            "end": "2024-04-22",
            "label": "Risk window: 14 days after vaccine"
          },
          {
            "kind": "unexposed",
            "start": "2024-04-23",
            "end": "2024-09-30",
            "label": "Baseline: 160 days (post-risk-window)"
          }
        ],
        "result": {
          "label": "1 event in 14 risk-window days vs 0 events in 259 baseline days — within-person IRR estimated by conditional Poisson across all cases",
          "value": 1.9
        },
        "caption": "This child's 273-day enrollment is split into 259 days of baseline time (blue, before and after the risk window) and a 14-day risk window after the vaccine dose (orange). The single febrile seizure fell on day 5 of the risk window. The SCCS compares the event rate per day in the orange band to the event rate per day in the blue bands — using only this child's own data, so fixed traits like genetics are not a factor.",
        "alt_text": "Timeline showing a 273-day enrollment period for one child split into two blue baseline spans (99 days before the vaccine and 160 days after the risk window) and one orange 14-day risk window starting at the vaccine dose date, with a marker for the febrile seizure event on day 5 of the risk window."
      }
    },
    "prerequisites": [
      "incidence-rate-calculation-rwe",
      "person-time-denominator-construction-rwe",
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard SCCS (full observation period as baseline)",
        "description": "Each case's entire continuous-observation period is split into exposure-defined risk windows and baseline time; age/season modeled as time-varying factors; conditional Poisson conditioned on total event count.",
        "edge_cases": [
          "Near-total exposure leaves almost no baseline person-time, making the IRR unstable or unidentified.",
          "Unmodeled seasonality confounds within person when a seasonal exposure coincides with high-incidence season."
        ],
        "data_source_notes": "claims: build risk windows from fill_date+days_supply or vaccine CPT/CVX dates; exclude MA-only and non-enrolled person-time so baseline is observed."
      },
      {
        "name": "Self-controlled risk interval (SCRI)",
        "description": "Restricts the comparison to a short pre-specified control interval near the exposure rather than all observation time; robust to long-term time trends with lighter age modeling, at a cost of efficiency.",
        "edge_cases": [
          "Choosing a control interval contaminated by residual risk (too close to exposure) biases the IRR toward null.",
          "Control interval placed in a different season than the risk window reintroduces seasonal confounding."
        ],
        "data_source_notes": "Standard for vaccine safety in registries/IIS where exposure dates are exact and seasonal trends dominate."
      },
      {
        "name": "Event-dependent exposure / observation SCCS",
        "description": "Modified likelihoods (Farrington and colleagues) for settings where the event alters future exposure probability or curtails observation (e.g., death), relaxing the two assumptions whose violation is most dangerous.",
        "edge_cases": [
          "Requires correct specification of the post-event exposure/observation process; misspecification can worsen bias."
        ],
        "data_source_notes": "linked: a death index is essential to operationalize event-dependent observation when the outcome can be fatal."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Active-comparator new-user cohort",
        "pros_of_this": "Removes ALL fixed between-person confounding (measured and unmeasured) by within-person conditioning; needs only cases, so efficient in sparse data; no propensity score required.",
        "cons_of_this": "Estimates only a within-person relative incidence (no absolute or head-to-head comparative effect); restricted to transient exposures and acute outcomes; biased under event-dependent exposure/observation.",
        "when_to_prefer": "Severe unmeasured frailty/indication confounding with a transient exposure and an acute, datable outcome (classic vaccine/drug safety signals)."
      },
      {
        "compared_to": "Self-controlled risk interval (SCRI)",
        "pros_of_this": "Uses all baseline person-time for greater statistical efficiency and full age adjustment.",
        "cons_of_this": "More sensitive to unmodeled long-term/seasonal time trends because the baseline spans the whole observation period.",
        "when_to_prefer": "When age/season are well modeled and efficiency matters more than robustness to secular trends."
      },
      {
        "compared_to": "Case-crossover",
        "pros_of_this": "Handles recurrent events, extended/graded risk windows, and time-varying exposure intensity naturally over continuous person-time.",
        "cons_of_this": "More modeling burden (age/season splitting) than sampling discrete referent windows; assumes event-independent observation.",
        "when_to_prefer": "Vaccines/drugs with extended risk windows and possible event recurrence, rather than instantaneous triggers of a single event."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Risk windows from fill_date+days_supply (drugs) or vaccine CPT/CVX dates; outcome onset = inpatient admission date, not adjudication date. Observation = continuous FFS-enrolled spans; exclude Medicare Advantage-only and unenrolled person-time so baseline is observed. A pre-exposure window screens for event-dependent exposure.",
      "ehr": "Exposure = order/administration (good for on-site vaccines); link pharmacy fills to confirm drug intake. Sharper onset dating, but visit-driven capture can correlate exposure and outcome ascertainment within person. Define observation as active-enrollment spans, not first-to-last visit.",
      "registry": "Adjudicated, exact outcome dates (febrile seizure, intussusception); link to claims/immunization information systems for dose dates. Often the natural SCCS substrate for vaccine safety.",
      "linked": "Claims completeness + EHR onset dating + a death index to handle event-dependent observation (fatal outcomes break naive SCCS). Reconcile order/fill/service date discrepancies before assigning risk windows."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\nAGE_BAND_DAYS = 30  # split follow-up into 30-day age bands to absorb time-varying age confounding\n\ndef build_intervals(cases: pd.DataFrame, exposures: pd.DataFrame) -> pd.DataFrame:\n    rows = []\n    exp_by_person = exposures.groupby(\"person_id\")\n    for _, c in cases.iterrows():\n        pid, start, end = c[\"person_id\"], c[\"obs_start\"], c[\"obs_end\"]\n        # Cut points: observation bounds, age-band edges, and each exposure's risk-window edges.\n        cuts = pd.to_datetime([start, end])\n        edges = pd.date_range(start, end, freq=f\"{AGE_BAND_DAYS}D\")\n        cuts = cuts.append(edges)\n        if pid in exp_by_person.groups:\n            for _, e in exp_by_person.get_group(pid).iterrows():\n                r0, r1 = e[\"exposure_date\"], e[\"exposure_date\"] + pd.Timedelta(days=int(e[\"risk_len\"]))\n                cuts = cuts.append(pd.to_datetime([r0, r1]))\n        cuts = pd.Series(cuts).clip(lower=start, upper=end).drop_duplicates().sort_values().reset_index(drop=True)\n        for a, b in zip(cuts[:-1], cuts[1:]):\n            if b <= a:\n                continue\n            mid = a + (b - a) / 2\n            in_risk = False\n            if pid in exp_by_person.groups:\n                for _, e in exp_by_person.get_group(pid).iterrows():\n                    if e[\"exposure_date\"] <= mid < e[\"exposure_date\"] + pd.Timedelta(days=int(e[\"risk_len\"])):\n                        in_risk = True\n                        break\n            age_band = int((mid - start).days // AGE_BAND_DAYS)\n            rows.append(dict(person_id=pid, start=a, end=b,\n                             length=(b - a).days, risk=int(in_risk), age_band=age_band))\n    return pd.DataFrame(rows)\n\ndef fit_sccs(intervals: pd.DataFrame, events: pd.DataFrame) -> \"smf.glm\":\n    # Count events per interval (events fall on event_date within [start, end)).\n    iv = intervals.copy()\n    iv[\"n_events\"] = 0\n    ev = events.merge(iv, on=\"person_id\")\n    ev = ev[(ev[\"event_date\"] >= ev[\"start\"]) & (ev[\"event_date\"] < ev[\"end\"])]\n    counts = ev.groupby([\"person_id\", \"start\"]).size().rename(\"k\")\n    iv = iv.merge(counts, on=[\"person_id\", \"start\"], how=\"left\")\n    iv[\"n_events\"] = iv[\"k\"].fillna(0).astype(int)\n    iv = iv[iv[\"length\"] > 0].copy()\n    iv[\"log_len\"] = np.log(iv[\"length\"])\n    # Conditional Poisson via person fixed effects + offset = stratified within-person likelihood.\n    model = smf.glm(\n        \"n_events ~ C(risk) + C(age_band) + C(person_id)\",\n        data=iv,\n        offset=iv[\"log_len\"],\n        family=__import__(\"statsmodels.api\", fromlist=[\"families\"]).families.Poisson(),\n    ).fit()\n    return model  # exp(params['C(risk)[T.1]']) = within-person, age-adjusted IRR",
        "description": "SCCS data construction + conditional Poisson estimation from claims-style inputs. Required inputs (cleaned):\n  cases    : one row per case -> person_id, obs_start (datetime), obs_end (datetime)   # continuous FFS-enrolled span\n  exposures: vaccine/drug dates -> person_id, exposure_date (datetime), risk_len (days)  # e.g. risk_len=14\n  events   : acute outcome dates -> person_id, event_date (datetime)                    # admission date, recurrent OK\nSplits each case's observation period into risk vs baseline intervals and 30-day age bands, counts events per interval,\nand fits a conditional Poisson (stratified by person via fixed effects) with log(interval length) as offset.\nexp(coef[risk]) is the within-person, age-adjusted incidence rate ratio. Add a pre-exposure window to screen for\nevent-dependent exposure before trusting the result.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "whitaker-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(SCCS)\n\n# `cd` is one row per (person, exposure) with observation bounds and event days already merged.\n#   indiv, astart, aend  : observation window (e.g., days of age 365..730)\n#   edate                : exposure date; risk window = [edate, edate + 14]\n#   eventday             : day of the acute outcome (multiple rows per person allowed)\nfit <- standardsccs(\n  event   ~ vacc,\n  indiv   = indiv,\n  astart  = astart,\n  aend    = aend,\n  aevent  = eventday,\n  adrug   = edate,\n  aedrug  = edate + 14,           # 14-day risk window after exposure\n  agegrp  = seq(min(cd$astart), max(cd$aend), by = 30),  # 30-day age bands\n  data    = cd\n)\nsummary(fit)                      # exp(coef) = within-person, age-adjusted IRR with 95% CI\n\n# ---- Manual conditional-Poisson fallback (custom windows / diagnostics) ----\nlibrary(gnm)\n# `iv` = pre-split intervals: indiv, n_events, risk (0/1), age_band (factor), log_len (offset)\ncp <- gnm(n_events ~ risk + age_band, eliminate = factor(indiv),\n          family = poisson, offset = log_len, data = iv)\nexp(coef(cp)[\"risk\"])             # IRR; eliminate=indiv gives the within-person conditional fit",
        "description": "SCCS using the SCCS package (Farrington/Whitaker reference implementation). Inputs:\n  cases     : indiv, astart, aend           # observation start/end in days-of-age (or days since origin)\n  exposures : indiv, edate                  # exposure date in the same time scale\n  events    : indiv, eventday               # acute outcome day (recurrent rows allowed)\nstandardsccs() fits the conditional Poisson likelihood directly; exp(coef) is the age-adjusted within-person IRR.\nA manual gnm() fallback (conditional Poisson via eliminate=indiv) is shown for custom risk windows.",
        "dependencies": [
          "SCCS",
          "gnm"
        ],
        "source_citations": [
          "whitaker-2006"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (1) Conditional Poisson: person as fixed effect, log interval length as offset. */\nproc genmod data=work.sccs_iv;\n  class person_id age_band / param=ref;\n  model n_events = risk age_band / dist=poisson link=log offset=log_len;\n  /* exp(estimate) for risk = within-person, age-adjusted IRR */\n  estimate 'IRR risk vs baseline' risk 1 / exp;\nrun;\n\n/* (2) Equivalent stratified-Cox layout (Andersen-Gill); preferred when person strata are numerous.    */\n/*     Each interval is an at-risk row stratified by person; the time-axis carries the age structure.   */\nproc phreg data=work.sccs_agfmt;            /* sccs_agfmt: person_id, tstart, tstop, event, risk, age_band */\n  class age_band / param=ref;\n  model (tstart, tstop)*event(0) = risk age_band;\n  strata person_id;                          /* within-person conditioning */\n  hazardratio 'IRR risk window' risk;        /* HR == IRR under the equivalence */\nrun;",
        "description": "SCCS via conditional Poisson regression in SAS. Required input dataset (post-split, one row per person-interval):\n  work.sccs_iv : person_id, n_events (count in interval), risk (0/1), age_band (categorical),\n                 log_len (=log(interval length in days))\nTwo equivalent fits: (1) PROC GENMOD with the person as a CLASS effect and an OFFSET reproduces the conditional\nPoisson within-person likelihood; (2) PROC PHREG with a stratified Cox layout (the Andersen-Gill formulation that\nFarrington's likelihood is algebraically equivalent to) is the standard alternative when GENMOD strata are too large.\nexp(estimate) for risk is the within-person, age-adjusted incidence rate ratio. Confirm a pre-exposure window is null\nbefore reporting (screens for event-dependent exposure).",
        "dependencies": [],
        "source_citations": [
          "whitaker-2006"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "self-controlled-case-series-timeline.svg",
        "mermaid": null,
        "caption": "This child's 273-day enrollment is split into 259 days of baseline time (blue, before and after the risk window) and a 14-day risk window after the vaccine dose (orange). The single febrile seizure fell on day 5 of the risk window. The SCCS compares the event rate per day in the orange band to the event rate per day in the blue bands — using only this child's own data, so fixed traits like genetics are not a factor.",
        "alt_text": "Timeline showing a 273-day enrollment period for one child split into two blue baseline spans (99 days before the vaccine and 160 days after the risk window) and one orange 14-day risk window starting at the vaccine dose date, with a marker for the febrile seizure event on day 5 of the risk window.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Cases[Cases only: persons with >=1 acute event] --> Obs[Continuous observation period<br/>FFS-enrolled span, drop MA-only / unenrolled]\n  Obs --> Split[Split person-time into intervals:<br/>risk windows vs baseline, x 30-day age bands]\n  Split --> Count[Count events per interval]\n  Count --> Cond[Conditional Poisson<br/>conditioned on each person's total events<br/>offset = log interval length]\n  Cond --> IRR[exp coef = within-person age-adjusted IRR<br/>risk window vs own baseline]\n  Cond --> Pre[Pre-exposure window null?<br/>screens event-dependent exposure]",
        "caption": "SCCS analytic flow. Only cases contribute; each person's observation time is split into exposure risk windows and baseline, cross-classified by age bands, and a conditional Poisson model returns the within-person relative incidence.",
        "alt_text": "Flowchart from cases-only through continuous observation, interval splitting into risk and baseline by age band, event counting, conditional Poisson estimation, and a pre-exposure null check.",
        "source_type": "illustrative",
        "source_citations": [
          "whitaker-2006"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title SCCS timeline for one case (vaccine example)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  Observation start (continuous enrollment) :done, b1, 2024-01-01, 2024-04-09\n  section Pre-exposure check\n  Pre-window [dose-14, dose-1] (screens event-dependent exposure) :crit, pre, 2024-03-26, 14d\n  section Risk window\n  Vaccine dose :milestone, dose, 2024-04-09, 0d\n  Risk window [dose, dose+14] :active, r1, 2024-04-09, 14d\n  section Baseline\n  Remaining baseline person-time :done, b2, 2024-04-23, 2024-09-30",
        "caption": "One case's observation period partitioned into a transient risk window after the dose and baseline time elsewhere. The IRR compares event rates in the risk window to this person's own baseline; the pre-window detects exposure deferred after an event.",
        "alt_text": "Gantt timeline showing baseline person-time, a pre-exposure screening window, a vaccine dose, a 14-day risk window, and remaining baseline within one case's observation period.",
        "source_type": "illustrative",
        "source_citations": [
          "whitaker-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "self-controlled-risk-interval-rwe",
        "notes": "SCRI is SCCS restricted to a short adjacent control interval rather than all observation time; trades efficiency for robustness to secular/seasonal trends."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-crossover",
        "notes": "Both are within-person designs; case-crossover samples discrete referent windows for instantaneous triggers, while SCCS models the full observation period and handles recurrent events and extended risk windows."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU estimates a between-person comparative effect with measured-confounder control; SCCS estimates a within-person relative incidence and removes all fixed confounders but needs a transient exposure and acute outcome."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "SCCS sidesteps immortal time by analyzing person-time within cases, but event-dependent observation (e.g., death ending follow-up) is the analogous threat requiring the event-dependent SCCS extension."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "A pre-exposure window and negative-control outcomes are standard diagnostics to detect event-dependent exposure and residual time-varying confounding in SCCS."
      }
    ],
    "aliases": [
      "SCCS",
      "self controlled case series",
      "case series method",
      "Farrington method"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "self-controlled-risk-interval-rwe",
    "name": "Self-Controlled Risk Interval (SCRI) Design",
    "short_definition": "A within-person design that compares the rate of an acute event in a pre-defined risk interval shortly after a transient exposure with the rate in a pre-defined control interval in the same person, so that all time-fixed individual characteristics cancel; the focused special case of the self-controlled case series used heavily in vaccine safety.",
    "long_description": "**Core idea.** The **self-controlled risk interval (SCRI)** design studies whether a transient exposure (most often a\nvaccine dose) raises the short-term rate of an acute event by comparing, *within each affected individual*, the event\ncount in a pre-specified **risk interval** following exposure against the count in a pre-specified **control interval**\nin the same person. Because every comparison is made within a single individual, all characteristics that do not change\nover the observation period — genetics, sex, baseline frailty, chronic comorbidity, socioeconomic status, healthcare-\nseeking propensity — are exactly conditioned out: they are constant and cannot confound a within-person contrast. Only\ncases (people who experienced the event) contribute information; unexposed-or-no-event time and between-person\ncomparisons are discarded. The effect measure is a **relative incidence (incidence rate ratio)** comparing the event\nrate per unit time in the risk interval to that in the control interval, estimated by **conditional Poisson regression**\n(equivalently, fixed-effects/conditional likelihood that strata on the individual) with an offset for the length of each\ninterval. SCRI is the deliberately focused cousin of the full self-controlled case series (SCCS): rather than modelling\nthe entire observation period, it pre-specifies a narrow risk window and a narrow comparison window, trading some\nstatistical efficiency for transparency and robustness to long-range time trends.\n\n**Relationship to the SCCS.** SCRI is a variant of the **self-controlled case series** (`self-controlled-case-series`):\nthe SCCS uses the full observation time as the comparison and models age/seasonal time with explicit terms; SCRI\nrestricts the comparison to a short, pre-defined control interval near the exposure. The narrow control interval makes\nSCRI far less sensitive to long-term secular and age trends (a major SCCS assumption) at the cost of using less of the\ndata, so SCRI is preferred when the event rate has strong age/season structure that is hard to model but is roughly\nflat over the short risk-plus-control span. Both share the same estimator (conditional Poisson) and the same three core\nassumptions below.\n\n**The three assumptions that make SCRI valid.** (1) **The event must not affect the probability of subsequent exposure.**\nIf having the event changes whether or when a person is (re)vaccinated, the within-person comparison is biased; this is\nwhy SCRI/SCCS are appropriate for transient exposures like vaccination that are scheduled independently of the acute\nevent, and inappropriate when the event contraindicates further exposure. (2) **The event must not (substantially)\ncensor or curtail observation**, i.e., it should be non-fatal or rare enough that survivor bias from event-related death\nis negligible; when the event can be fatal, modified SCCS estimators that account for event-dependent observation are\nrequired. (3) **No time-varying confounding across the risk and control intervals** other than the exposure — the\nunderlying event rate must be constant (or modelled) across the short window, so a sharp peri-event spike unrelated to\nthe exposure (e.g., a concurrent seasonal epidemic) would violate the design unless the control interval is chosen to\nshare that background.\n\n**Pros, cons, and trade-offs.**\n- **vs cohort / between-person designs:** SCRI eliminates *all* time-fixed confounding by construction, needs no\n  unexposed comparator, and is immune to the confounding-by-indication that plagues between-person vaccine-safety\n  studies. Cost: it estimates only the *relative* short-term effect (not an absolute risk or a long-term effect), uses\n  cases only, and is exposed to bias from time-varying confounders within the window. **Prefer SCRI** for acute events\n  after transient exposures where time-fixed confounding is the dominant threat; **prefer a cohort design** when an\n  absolute risk, a long-term effect, or an unexposed comparison is the question.\n- **vs the full SCCS (`self-controlled-case-series`):** SCCS is more efficient (uses all observation time) but requires\n  correctly modelling age/season trends across the whole period and the no-event-dependent-observation assumption over a\n  longer span. SCRI's short control interval buys robustness to long-range trends and simpler pre-specification.\n  **Prefer SCRI** when long-term time trends are strong and hard to model and the risk window is short; **prefer SCCS**\n  when efficiency matters and the time structure can be modelled.\n- **vs the case-crossover design (`case-crossover`):** Both are within-person, but case-crossover compares exposure\n  status in case vs control *windows referenced to the event* (case-defined sampling), whereas SCRI compares event\n  *counts* in risk vs control windows *referenced to the exposure* (exposure-defined sampling). Case-crossover suits\n  transient exposures and abrupt outcomes (the classic trigger study); SCRI suits a fixed-time exposure (a vaccine dose)\n  with a clear post-exposure risk window. They can be biased by exposure-time trends in opposite ways.\n\n**When to use.** Post-licensure vaccine safety surveillance (febrile seizures after MMR, intussusception after rotavirus\nvaccine, Guillain-Barré after influenza vaccine, myocarditis after mRNA COVID-19 vaccines): a scheduled transient\nexposure, an acute and well-dated outcome, a biologically motivated short risk interval, and strong time-fixed\nconfounding (healthcare-seeking, comorbidity) that a within-person design erases. More generally, any transient\npoint-exposure with an acute outcome where the exposure is not triggered by the event and the background rate is roughly\nconstant over the risk-plus-control span.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The event changes subsequent exposure.** If experiencing the outcome makes further vaccination more or less likely\n  (e.g., a reaction that contraindicates the next dose), the core assumption fails and the relative incidence is biased;\n  using SCRI here is a structural error no amount of window tuning fixes.\n- **The event is commonly fatal.** Event-dependent censoring (death) breaks the standard estimator; a naive SCRI on a\n  high-fatality outcome overstates or understates the effect. Use event-dependent SCCS extensions or a different design.\n- **Time-varying confounding within the window.** A seasonal epidemic, a co-administered intervention, or an age effect\n  that differs sharply between the risk and control intervals contaminates the contrast; SCRI cannot adjust for what it\n  does not model, and a poorly placed control interval bakes the bias in.\n- **Reading a relative incidence as an absolute or long-term risk.** SCRI delivers a short-window rate ratio only;\n  narrating it as the probability a vaccinee will be harmed, or as a chronic effect, misrepresents the estimand.\n- **A pre-exposure risk period contaminated by the indication.** If people are exposed *because* of early symptoms of\n  the event (e.g., vaccinated during a prodrome), event counts cluster just before exposure; ignoring a pre-exposure\n  window biases the risk-interval estimate. Model or exclude the pre-exposure period.\n\n**Data-source operational depth.**\n- **Claims:** Exposure (vaccine administration) is captured by CPT/CVX/NDC codes with a service date that anchors the\n  risk and control intervals; the acute outcome is an inpatient or ED claim with a specific diagnosis and admission date.\n  Require continuous enrollment spanning the entire observation window so neither interval is truncated by disenrollment,\n  and restrict to fee-for-service-observable time so Medicare Advantage gaps do not silently shorten an interval. Same-day\n  duplicate/reversed claims and claims lag near the data cut must be cleaned before counting events.\n- **EHR:** Vaccine administration may be recorded in an immunization table or as an order; outcomes are encounter-driven,\n  so an event treated elsewhere is missed and can differentially shorten an interval. Require demonstrable in-system\n  activity across the observation window and confirm the immunization record is complete (often supplemented by a state\n  immunization information system in linked data).\n- **Registry / linked (e.g., Vaccine Safety Datalink, Sentinel):** The strongest substrate: an immunization registry\n  supplies exact dose dates and a linked claims/EHR feed supplies adjudicated, well-dated acute events. Linkage selects\n  the linkable subset and dose-date vs claim-date discrepancies must be reconciled before assigning intervals; these\n  distributed networks are where SCRI is most heavily deployed for near-real-time signal monitoring.",
    "primary_category": "Study_Design",
    "tags": [
      "self-controlled-risk-interval",
      "scri",
      "self-controlled-case-series",
      "vaccine-safety",
      "within-person",
      "relative-incidence",
      "conditional-poisson",
      "risk-interval"
    ],
    "applies_to_study_types": [
      "self_controlled",
      "vaccine_safety",
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "safety_surveillance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1080/10543406.2015.1052819",
        "url": "https://doi.org/10.1080/10543406.2015.1052819",
        "citation_text": "Li R, Stewart B, Weintraub E. Evaluating efficiency and statistical power of self-controlled case series and self-controlled risk interval designs in vaccine safety. Journal of Biopharmaceutical Statistics. 2016;26(4):686-693.",
        "year": 2016,
        "authors_short": "Li et al.",
        "notes": "Defines and contrasts the self-controlled risk interval against the full self-controlled case series, quantifying the efficiency/power trade-off of the narrow control-interval design; the canonical methodological comparison."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3885",
        "url": "https://doi.org/10.1002/pds.3885",
        "citation_text": "Li R, Kulldorff M, Russek-Cohen E, Kawai AT, Hua W. Quantifying the impact of time-varying baseline risk adjustment in the self-controlled risk interval design. Pharmacoepidemiology and Drug Safety. 2015;24(12):1304-1312.",
        "year": 2015,
        "authors_short": "Li et al.",
        "notes": "Examines the central SCRI vulnerability - time-varying background risk between the risk and control intervals - and how baseline-risk adjustment changes the relative-incidence estimate; essential for choosing the control interval."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.2302",
        "url": "https://doi.org/10.1002/sim.2302",
        "citation_text": "Whitaker HJ, Farrington CP, Spiessens B, Musonda P. Tutorial in biostatistics: the self-controlled case series method. Statistics in Medicine. 2006;25(10):1768-1797.",
        "year": 2006,
        "authors_short": "Whitaker et al.",
        "notes": "The standard tutorial for the parent self-controlled case series method - conditional Poisson estimation, the assumptions, and interval definition - of which SCRI is the focused special case."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwy023",
        "url": "https://doi.org/10.1093/aje/kwy023",
        "citation_text": "Yih WK, Maro JC, Nguyen M, et al. Assessment of quadrivalent human papillomavirus vaccine safety using the self-controlled tree-temporal scan statistic signal-detection method in the Sentinel system. American Journal of Epidemiology. 2018;187(6):1269-1276.",
        "year": 2018,
        "authors_short": "Yih et al.",
        "notes": "Applied self-controlled vaccine-safety surveillance in the Sentinel distributed network, illustrating risk- and control-interval definition and within-person event comparison in a real claims/EHR system."
      },
      {
        "role": "use",
        "doi": "10.1016/j.vaccine.2012.01.027",
        "url": "https://doi.org/10.1016/j.vaccine.2012.01.027",
        "citation_text": "Tse A, Tseng HF, Greene SK, Vellozzi C, Lee GM. Signal identification and evaluation for risk of febrile seizures in children following trivalent inactivated influenza vaccine in the Vaccine Safety Datalink Project, 2010-2011. Vaccine. 2012;30(11):2024-2031.",
        "year": 2012,
        "authors_short": "Tse et al.",
        "notes": "Real SCRI/self-controlled signal evaluation in the Vaccine Safety Datalink (febrile seizures after influenza vaccine), exemplifying risk-interval choice and conditional within-person analysis in routine surveillance."
      }
    ],
    "plain_language_summary": "The Self-Controlled Risk Interval (SCRI) design asks whether a one-time event like a vaccine raises the short-term chance of a side effect by comparing what happens to the same person in a narrow window right after the exposure versus a later window in the same person. Because both windows belong to the same individual, everything fixed about that person — their genes, their underlying health, how often they see a doctor — is automatically held constant and cannot distort the answer. The design counts only people who actually had the side effect, and it delivers a single number: how many times more often the event occurred in the post-vaccination risk window than in the later comparison window.",
    "key_terms": [
      {
        "term": "risk interval",
        "definition": "The short window of time immediately after the exposure (for example, days 1 through 8 after a vaccine dose) when a biological effect is considered plausible and events are counted."
      },
      {
        "term": "control interval",
        "definition": "A second window in the same person, further from the exposure (for example, days 15 through 43 after the dose), used as the within-person comparison when the acute effect is assumed to have faded."
      },
      {
        "term": "relative incidence",
        "definition": "The ratio of the event rate during the risk interval to the event rate during the control interval; a value above 1 means the event happened more often right after the exposure."
      },
      {
        "term": "within-person comparison",
        "definition": "Comparing two time windows belonging to the same individual rather than comparing one group of people against another, so that fixed traits like genetics and baseline health automatically cancel out."
      },
      {
        "term": "conditional Poisson regression",
        "definition": "The statistical method used to estimate the relative incidence by counting events per day in each interval while locking in the individual as their own reference, so only the timing difference between intervals drives the result."
      }
    ],
    "worked_example": {
      "scenario": "Researchers want to know whether the influenza vaccine raises the short-term risk of febrile seizures in young children. They pull claims data for 8 children who each had at least one febrile seizure during the study period. For each child, they define two windows anchored on the date the vaccine was given: a risk interval of 8 days (days 1 through 8 after the dose) and a control interval of 29 days (days 15 through 43 after the dose). They count how many seizures each child had in each window, then compare the rates.",
      "dataset": {
        "caption": "One row per child per interval. Each child appears twice: once for the risk interval and once for the control interval. events = seizures counted in that window; ptime = how many days that window lasts.",
        "columns": [
          "person_id",
          "interval",
          "events",
          "ptime_days"
        ],
        "rows": [
          [
            "C001",
            "risk",
            1,
            8
          ],
          [
            "C001",
            "control",
            0,
            29
          ],
          [
            "C002",
            "risk",
            1,
            8
          ],
          [
            "C002",
            "control",
            0,
            29
          ],
          [
            "C003",
            "risk",
            0,
            8
          ],
          [
            "C003",
            "control",
            1,
            29
          ],
          [
            "C004",
            "risk",
            1,
            8
          ],
          [
            "C004",
            "control",
            0,
            29
          ],
          [
            "C005",
            "risk",
            1,
            8
          ],
          [
            "C005",
            "control",
            0,
            29
          ],
          [
            "C006",
            "risk",
            1,
            8
          ],
          [
            "C006",
            "control",
            1,
            29
          ],
          [
            "C007",
            "risk",
            1,
            8
          ],
          [
            "C007",
            "control",
            0,
            29
          ],
          [
            "C008",
            "risk",
            0,
            8
          ],
          [
            "C008",
            "control",
            1,
            29
          ]
        ]
      },
      "steps": [
        "Sum the events in each interval across all 8 children: 6 seizures occurred during the 8-day risk windows; 3 seizures occurred during the 29-day control windows.",
        "Compute the total person-days of observation in each interval: 8 children x 8 days = 64 child-days at risk; 8 children x 29 days = 232 child-days in the control.",
        "Calculate the event rate in the risk interval: 6 events / 64 child-days = 0.0938 seizures per child-day.",
        "Calculate the event rate in the control interval: 3 events / 232 child-days = 0.0129 seizures per child-day.",
        "Divide the risk rate by the control rate: 0.0938 / 0.0129 = 7.25. This is the relative incidence.",
        "Because both windows belong to the same children, everything fixed about each child (their genetics, their general health, how often their parents bring them to the doctor) has already been held constant by design."
      ],
      "result": "Relative incidence = 7.25, meaning febrile seizures occurred about 7 times more often in the 8 days right after the vaccine dose than in the later 29-day control window, in the same children. (Risk rate: 6 events / 64 child-days = 0.0938 per child-day; Control rate: 3 events / 232 child-days = 0.0129 per child-day; ratio = 7.25.)",
      "timeline_spec": {
        "title": "SCRI timeline for one child: influenza vaccine and febrile seizure",
        "caption": "Child C001 received the influenza vaccine on 2024-01-15. One febrile seizure occurred on Day 5 (2024-01-20), inside the 8-day risk interval. No seizure occurred during the 29-day control interval. The gap between intervals (days 9-14) is excluded from analysis.",
        "alt_text": "Horizontal timeline starting at vaccination on 2024-01-15. A green bar spans the 8-day risk interval from 2024-01-16 to 2024-01-23 with a seizure marker on 2024-01-20. A grey bar marks the gap days 2024-01-24 to 2024-01-29. A blue bar spans the 29-day control interval from 2024-01-30 to 2024-02-27 with no events. The relative incidence is shown as risk rate divided by control rate.",
        "window": {
          "start": "2024-01-15",
          "end": "2024-02-27",
          "label": "Day 0 (vaccine) through Day 43 (end of control interval)"
        },
        "events": [
          {
            "label": "Influenza vaccine (Day 0)",
            "start": "2024-01-15",
            "length_days": 1,
            "quantity": "point exposure"
          },
          {
            "label": "Febrile seizure (Day 5)",
            "start": "2024-01-20",
            "length_days": 1,
            "quantity": "1 event in risk interval"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2024-01-16",
            "end": "2024-01-23",
            "label": "Risk interval: Days 1-8 (8 days)"
          },
          {
            "kind": "gap",
            "start": "2024-01-24",
            "end": "2024-01-29",
            "label": "Excluded gap: Days 9-14"
          },
          {
            "kind": "unexposed",
            "start": "2024-01-30",
            "end": "2024-02-27",
            "label": "Control interval: Days 15-43 (29 days)"
          }
        ],
        "result": {
          "label": "Group relative incidence: (6/64) / (3/232) = 0.0938 / 0.0129 = 7.25",
          "value": 7.25
        }
      }
    },
    "prerequisites": [
      "incidence-rate-calculation-rwe",
      "self-controlled-case-series",
      "case-crossover"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "SCRI with a post-exposure control interval",
        "description": "Both the risk interval and the control interval fall after the dose (e.g., days 0-7 risk, days 14-42 control), so the comparison is between two post-vaccination windows; minimizes contamination from pre-exposure indication and from healthcare-seeking around the visit.",
        "edge_cases": [
          "A control interval too close to the risk window can include lingering true effect, biasing the relative incidence toward the null.",
          "A control interval far from the dose reintroduces sensitivity to age/seasonal trends the design was meant to avoid."
        ],
        "data_source_notes": "claims: require continuous FFS-observable enrollment across both intervals so neither is truncated by disenrollment."
      },
      {
        "name": "SCRI with a pre-exposure (washout) interval handling",
        "description": "A pre-exposure period is defined and either excluded or modelled to absorb event clustering caused by exposure being triggered by early symptoms (the indication), protecting the risk-interval estimate.",
        "edge_cases": [
          "Omitting a contaminated pre-exposure window biases the risk-interval rate upward when vaccination follows a prodrome.",
          "The pre-exposure window length is a sensitivity parameter and should be varied."
        ],
        "data_source_notes": "ehr: confirm the immunization record date precedes outcome onset; reconcile order vs administration dates."
      },
      {
        "name": "Conditional Poisson (relative incidence) estimation",
        "description": "The relative incidence is estimated by conditional Poisson regression with an offset for interval length, conditioning on the individual so all time-fixed factors cancel; equivalent to a fixed-effects Poisson stratified on person.",
        "edge_cases": [
          "Persons with events in only one interval drive the estimate; sparse data may require exact/conditional inference.",
          "Multiple events per person require the assumption that events within a person are independent given the interval, or a recurrent-event extension."
        ],
        "data_source_notes": "linked: adjudicated, well-dated events from a registry sharpen interval assignment and reduce outcome misclassification."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "self-controlled-case-series",
        "pros_of_this": "A short, pre-specified control interval makes SCRI far less sensitive to long-term age/seasonal time trends and simpler to pre-specify; same conditional-Poisson estimator.",
        "cons_of_this": "Uses less of the observation time, so it is less statistically efficient (lower power) than the full SCCS.",
        "when_to_prefer": "When long-range time trends are strong and hard to model and the risk window is short; use the full SCCS when efficiency matters and the time structure can be modelled explicitly."
      },
      {
        "compared_to": "case-crossover",
        "pros_of_this": "Exposure-anchored windows fit a scheduled point exposure (a vaccine dose) with a clear post-exposure risk period; counts events rather than sampling control exposure times.",
        "cons_of_this": "Requires a roughly constant background event rate across the risk-plus-control span; a sharp within-window trend biases the contrast.",
        "when_to_prefer": "When the exposure is a fixed-time event with a biologically defined risk window; prefer case-crossover when the exposure itself is the transient trigger and the outcome is abrupt."
      },
      {
        "compared_to": "cohort-retrospective",
        "pros_of_this": "Eliminates all time-fixed confounding by within-person comparison and needs no unexposed comparator, erasing confounding by indication.",
        "cons_of_this": "Estimates only a short-term relative incidence in cases - no absolute risk, no long-term effect, no between-person inference.",
        "when_to_prefer": "For acute events after transient exposures dominated by time-fixed confounding; use a cohort design when an absolute or long-term risk is the question."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Anchor risk and control intervals on the vaccine service date (CPT/CVX/NDC); define the outcome by diagnosis, care setting, and admission date. Require continuous FFS-observable enrollment spanning both intervals so neither is truncated, drop Medicare Advantage-only spans, and clean same-day duplicate/reversed claims and end-of-data claims lag before counting events.",
      "ehr": "Vaccine administration comes from an immunization table or order; outcomes are encounter-driven so events treated out-of-system are missed and can differentially shorten an interval. Require demonstrable in-system activity across the observation window and verify immunization-record completeness, ideally against a state immunization information system.",
      "registry": "An immunization registry supplies exact dose dates and (when linked) adjudicated, well-dated acute events; check registry completeness and reporting lag and reconcile dose-date vs event-date conventions.",
      "linked": "Distributed networks (Vaccine Safety Datalink, Sentinel) are the primary SCRI substrate - registry dose dates plus linked claims/EHR outcomes for near-real-time monitoring; linkage selects the linkable subset and date discrepancies must be reconciled before interval assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\nimport statsmodels.api as sm\n\n# Long format: two rows per case (risk + control). 'events' is the count; 'ptime' is interval length (days).\n# Example: risk = days 0-7 (8 days), control = days 14-42 (29 days) after the dose.\ndf = pd.DataFrame({\n    \"person_id\": np.repeat(np.arange(1, 9), 2),\n    \"period\":    [\"risk\", \"control\"] * 8,\n    \"events\":    [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1],\n    \"ptime\":     [8, 29] * 8,\n})\ndf[\"risk\"]    = (df[\"period\"] == \"risk\").astype(int)\ndf[\"log_pt\"]  = np.log(df[\"ptime\"])\n\n# Conditional Poisson via person fixed effects + length offset. C(person_id) strata the individual,\n# so all time-fixed characteristics cancel; exp(beta_risk) is the relative incidence.\nfit = smf.glm(\"events ~ risk + C(person_id)\", data=df,\n              family=sm.families.Poisson(), offset=df[\"log_pt\"]).fit()\nri  = np.exp(fit.params[\"risk\"])\nci  = np.exp(fit.conf_int().loc[\"risk\"])\nprint(f\"Relative incidence (risk vs control) = {ri:.2f} \"\n      f\"(95% CI {ci[0]:.2f}-{ci[1]:.2f})\")",
        "description": "SCRI relative incidence by conditional Poisson regression. Input is one row per (case, interval): person_id, period\n('risk' or 'control'), events (count in that interval for that person), and ptime (interval length in days). Conditioning\non the person is achieved by including person fixed effects (a stratum intercept per case), with log(ptime) as an\noffset; the exponentiated coefficient on the risk indicator is the relative incidence (risk vs control). Cases with\nevents in only one interval are what drive the estimate, exactly as the within-person design intends.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "li-stewart-weintraub-2016",
          "whitaker-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(gnm)\n\n# Long format: two rows per case (risk + control); events = count, ptime = interval length (days).\ndat <- data.frame(\n  person_id = factor(rep(1:8, each = 2)),\n  period    = rep(c(\"risk\", \"control\"), 8),\n  events    = c(1,0, 1,0, 0,1, 1,0, 1,0, 1,1, 1,0, 0,1),\n  ptime     = rep(c(8, 29), 8)\n)\ndat$risk   <- as.integer(dat$period == \"risk\")\ndat$log_pt <- log(dat$ptime)\n\n# Conditional Poisson: eliminate person-specific intercepts (the within-person conditioning),\n# offset by log interval length; exp(coefficient) on risk is the relative incidence.\nfit <- gnm(events ~ risk + offset(log_pt), eliminate = person_id,\n           family = poisson, data = dat)\nri  <- exp(coef(fit)[\"risk\"])\nci  <- exp(confint.default(fit)[\"risk\", ])\ncat(sprintf(\"Relative incidence (risk vs control) = %.2f (95%% CI %.2f-%.2f)\\n\",\n            ri, ci[1], ci[2]))",
        "description": "SCRI relative incidence with the SCCS package (the standard R implementation of self-controlled case series / risk\ninterval), and the equivalent gnm conditional-Poisson fit. Input mirrors the Python version: per-case risk and control\nintervals with event counts and interval lengths. gnm with eliminate = person fits the conditional Poisson, removing\nthe per-person nuisance intercepts; exp(coef) on the risk indicator is the relative incidence.",
        "dependencies": [
          "gnm"
        ],
        "source_citations": [
          "li-stewart-weintraub-2016",
          "whitaker-2006"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Build the offset and the risk indicator. */\ndata scri;\n  set work.scri;                 /* person_id, period ('risk'/'control'), events, ptime */\n  risk   = (period = 'risk');    /* 1 = risk interval, 0 = control interval */\n  log_pt = log(ptime);           /* offset for unequal interval lengths      */\nrun;\n\n/* Conditional Poisson as person-stratified fixed-effects Poisson.\n   CLASS person_id with the person term conditions out all time-fixed characteristics;\n   exp(estimate for risk) is the relative incidence (risk vs control). */\nproc genmod data=scri;\n  class person_id;\n  model events = risk person_id / dist=poisson link=log offset=log_pt;\n  estimate 'log relative incidence (risk vs control)' risk 1 / exp;\nrun;",
        "description": "SCRI relative incidence in SAS via conditional Poisson, fit as a fixed-effects Poisson stratified on the individual.\nInput dataset work.scri has one row per (person_id, period) with events and ptime (interval length in days). PROC\nGENMOD with CLASS person_id and the person main effect conditions on the individual (all time-fixed factors cancel);\nan OFFSET of log(ptime) accounts for unequal interval lengths, and the risk-vs-control rate ratio is read from the\nexponentiated parameter (or an ESTIMATE/LSMEANS contrast). PROC GENMOD is in SAS/STAT.",
        "dependencies": [],
        "source_citations": [
          "li-stewart-weintraub-2016",
          "whitaker-2006"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "self-controlled-risk-interval-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Child C001 received the influenza vaccine on 2024-01-15. One febrile seizure occurred on Day 5 (2024-01-20), inside the 8-day risk interval. No seizure occurred during the 29-day control interval. The gap between intervals (days 9-14) is excluded from analysis.",
        "alt_text": "Horizontal timeline starting at vaccination on 2024-01-15. A green bar spans the 8-day risk interval from 2024-01-16 to 2024-01-23 with a seizure marker on 2024-01-20. A grey bar marks the gap days 2024-01-24 to 2024-01-29. A blue bar spans the 29-day control interval from 2024-01-30 to 2024-02-27 with no events. The relative incidence is shown as risk rate divided by control rate.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Dose([Vaccine dose<br/>time zero]) --> Pre[Pre-exposure window<br/>exclude or model]\n  Dose --> Risk[Risk interval<br/>e.g., days 0-7]\n  Dose --> Control[Control interval<br/>e.g., days 14-42]\n  Risk --> Count1[Event count in risk interval]\n  Control --> Count2[Event count in control interval]\n  Count1 --> RI[Conditional Poisson<br/>relative incidence = rate_risk / rate_control]\n  Count2 --> RI\n  RI --> Within[Within-person: all time-fixed<br/>confounders cancel]",
        "caption": "SCRI anchors a short risk interval and a short control interval on the exposure date and compares within-person event rates by conditional Poisson regression, so all time-fixed individual characteristics cancel.",
        "alt_text": "Timeline from a vaccine dose splitting into a pre-exposure window, a risk interval, and a control interval, with event counts feeding a conditional Poisson relative-incidence estimate that cancels time-fixed confounders.",
        "source_type": "illustrative",
        "source_citations": [
          "li-stewart-weintraub-2016"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{Transient exposure + acute outcome?} -->|No| Other[Use a cohort or other design]\n  Q -->|Yes| A1{Does the event change<br/>subsequent exposure?}\n  A1 -->|Yes| Stop1[SCRI invalid<br/>assumption violated]\n  A1 -->|No| A2{Is the event commonly fatal?}\n  A2 -->|Yes| Stop2[Event-dependent censoring<br/>use modified SCCS]\n  A2 -->|No| A3{Background rate flat across<br/>risk + control span?}\n  A3 -->|No, strong trend| SCCS[Prefer full SCCS with<br/>time modelling]\n  A3 -->|Yes| SCRI[SCRI appropriate<br/>conditional Poisson relative incidence]",
        "caption": "Decision logic for SCRI eligibility. The design requires a transient exposure not triggered by the event, a non-fatal/uncensoring outcome, and a roughly constant background rate across the short window; otherwise switch to a modified SCCS or a cohort design.",
        "alt_text": "Decision tree checking transient exposure, that the event does not alter later exposure, that the event is not commonly fatal, and that the background rate is flat, leading to SCRI, full SCCS, or another design.",
        "source_type": "illustrative",
        "source_citations": [
          "li-stewart-weintraub-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "self-controlled-case-series",
        "notes": "SCRI is the focused special case of the SCCS that pre-specifies a short risk interval and a short control interval rather than modelling the full observation period, sharing the conditional-Poisson estimator and core assumptions."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "case-crossover",
        "notes": "Both are within-person designs, but case-crossover samples control exposure times referenced to the event while SCRI compares event counts in exposure-referenced risk vs control intervals; they suit different exposure/outcome timing structures."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-time-control",
        "notes": "Case-time-control augments the case-crossover/self-controlled family by adding a control group to adjust for exposure-time trends; relevant when a constant-background-rate assumption is doubtful."
      },
      {
        "relation_type": "requires",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "SCRI compares event rates (events per interval-time) across the risk and control intervals, so correct person-time accounting within each interval is the underlying measurement."
      },
      {
        "relation_type": "see_also",
        "target_slug": "descriptive-epidemiology-rwe",
        "notes": "SCRI sits within the descriptive/pharmacoepidemiologic toolkit for vaccine safety signal evaluation and is deployed in distributed surveillance networks."
      }
    ],
    "aliases": [
      "SCRI",
      "self-controlled risk interval",
      "risk interval design",
      "self-controlled risk interval analysis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "sensitivity-specificity-rwe",
    "name": "Sensitivity and Specificity",
    "short_definition": "Paired diagnostic/classification accuracy metrics conditioning on true disease status — sensitivity = TP/(TP+FN) is the probability a true case tests positive, specificity = TN/(TN+FP) is the probability a true non-case tests negative — both invariant to disease prevalence and central to validating phenotypes and claims-based outcome algorithms against a reference standard.",
    "long_description": "**Sensitivity** and **specificity** characterize how well a test, code-based algorithm, or computable phenotype recovers\na true binary status when judged against a reference (\"gold\") standard. Lay the 2x2 table out with truth in columns and\nthe test in rows: true positives (TP), false positives (FP), false negatives (FN), true negatives (TN).\n**Sensitivity = TP / (TP + FN)** is the conditional probability of a positive test *given* the patient truly has the\ncondition (the true-positive rate, P(T+|D+)); **specificity = TN / (TN + FP)** is the conditional probability of a\nnegative test *given* the patient truly does not (the true-negative rate, P(T−|D−)). Their complements are the\nfalse-negative rate (1 − sensitivity) and false-positive rate (1 − specificity). Because each is conditioned on the true\nstatus, **both are properties of the test on a fixed disease spectrum and do not depend on how common the disease is** —\nthis prevalence-invariance is the single most important property distinguishing them from predictive values.\n\n**Core idea.** Sensitivity and specificity answer the \"test-evaluation\" question — *given the truth, how does the test\nbehave?* — which is the question a method developer asks. They are the natural targets of a validation study that samples\non true status (e.g., chart-confirmed cases and confirmed non-cases) and applies the algorithm to each. The contrast is\nwith predictive values (PPV/NPV), which answer the clinician's question — *given the test result, what is the truth?* —\nand which depend on prevalence. In real-world data the reference standard is usually adjudicated chart review, registry\nlinkage, or a validated lab result; the \"test\" is a claims/EHR code algorithm (e.g., \"1 inpatient or 2 outpatient ICD-10\ncodes for the condition within 12 months\"). Sensitivity and specificity are then operating characteristics of that\nalgorithm, and they propagate directly into outcome misclassification: a non-differential algorithm with imperfect\nsensitivity/specificity biases a measured risk ratio toward the null in a predictable way, and they are the inputs to\nquantitative-bias-analysis correction.\n\n**Prevalence invariance and its limits.** Holding the case-mix (disease spectrum) fixed, doubling prevalence changes the\ncolumn totals but not the within-column proportions, so sensitivity and specificity are unchanged. The caveat is\n**spectrum effect**: sensitivity tends to be higher in severe, advanced, or clearly symptomatic cases and lower in mild\nor early disease, and specificity varies with the comorbidity profile of the non-cases. Estimates from a tertiary-care\nvalidation set therefore may not transport to a community claims population whose case spectrum differs. Invariance to\nprevalence is mathematical; transportability across spectra is an empirical claim that must be checked.\n\n**Non-differential vs differential misclassification.** When the algorithm's sensitivity and specificity are the *same*\nin the exposed and unexposed (non-differential, independent of exposure), the resulting outcome misclassification biases\na risk ratio or odds ratio *toward the null* on average (for a binary outcome with independent errors), attenuating but\nnot reversing a true effect; this is the usual, somewhat \"forgiving\" case. When sensitivity or specificity *differs* by\nexposure group (differential misclassification — e.g., exposed patients are surveilled more intensively, raising\ndetection), the bias can go in *either* direction and can manufacture or mask an association entirely. Establishing that\nan outcome algorithm has the same operating characteristics across arms — or measuring how they differ — is therefore a\ncentral validity task, not a footnote.\n\n**Pros, cons, and trade-offs.**\n- **vs PPV/NPV (`ppv-npv-rwe`):** Sensitivity/specificity are intrinsic to the test and transport across populations of\n  differing prevalence; they are what a validation study estimates. Their cost is that they do *not* tell a downstream\n  analyst the probability that an algorithm-positive patient is truly a case — that is PPV, which depends on prevalence.\n  **Prefer sensitivity/specificity** to characterize and compare algorithms and to feed misclassification corrections;\n  **prefer PPV/NPV** to interpret what a flagged record means in a specific cohort.\n- **vs likelihood ratios (`likelihood-ratios-diagnostic-rwe`):** LR+ = sensitivity/(1 − specificity) and\n  LR− = (1 − sensitivity)/specificity repackage the same two numbers into a single prevalence-independent quantity that\n  updates pre-test odds to post-test odds, which is more directly usable at the bedside. Sensitivity/specificity remain\n  the more transparent reporting pair and the inputs to LRs. They are complements, not substitutes.\n- **vs ROC/AUC (`roc-auc-discrimination-rwe`):** Sensitivity and specificity are evaluated at a *single chosen threshold*;\n  the ROC curve plots sensitivity against (1 − specificity) across *all* thresholds and the AUC summarizes\n  threshold-free discrimination. **Prefer a single sensitivity/specificity pair** when the algorithm has a fixed\n  operational cut (most claims phenotypes are binary), **prefer ROC/AUC** when comparing continuous scores or choosing a\n  threshold.\n\n**When to use.** Report sensitivity and specificity (each with an exact binomial 95% CI) whenever you validate a\ncomputable phenotype or a claims/EHR outcome algorithm against a reference standard and you have sampled — or can\nreconstruct — *both* true-positive and true-negative strata; when you need operating characteristics that transport\nacross cohorts of different prevalence; and when you must feed a non-differential or differential misclassification\ncorrection (e.g., the matrix/Rogan-Gladen adjustment) for an effect estimate. They are the right summary when the design\nsamples on true status.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When you only sampled algorithm-positives.** Pulling charts only on patients the algorithm flagged yields PPV, not\n  sensitivity: you never observe the false negatives (true cases the algorithm missed), so sensitivity is unidentifiable\n  from that design. Reporting a \"sensitivity\" from an algorithm-positive-only chart review is a category error and the\n  most common misuse in claims validation.\n- **As a substitute for what a flagged record means.** A 95%-sensitive, 95%-specific algorithm applied to a 1%-prevalence\n  outcome still has a PPV near 16% — most flags are false. Quoting only the high sensitivity/specificity to reassure that\n  \"the algorithm is accurate\" hides that, in a rare-outcome cohort, the positives are mostly wrong.\n- **Across differing case spectra without checking transportability.** Sensitivity estimated on severe inpatient-confirmed\n  cases overstates sensitivity in a community cohort with milder disease (spectrum effect); transporting it silently\n  biases any downstream correction.\n- **When errors are differential and you treat them as non-differential.** Assuming non-differential misclassification\n  (and the comforting toward-the-null attenuation) when surveillance or coding intensity differs by exposure can bias the\n  effect in either direction — including manufacturing a spurious association.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The \"test\" is a code algorithm (ICD-10/CPT/NDC with position and setting rules); the reference is\n  adjudicated chart review or registry linkage on a sample drawn from *both* algorithm-positive and algorithm-negative\n  strata so that FN are observable. Medicare Advantage enrollees generate no fee-for-service claims, so MA-only person-\n  time produces apparent algorithm-negatives that are merely unobserved — restrict the validation frame to FFS-observable\n  person-time or sensitivity is biased upward. Sensitivity is expensive to estimate well because true cases the algorithm\n  misses are, by construction, hard to find without a high-coverage reference.\n- **EHR:** Structured problem lists, labs, and NLP on notes sharpen the algorithm, but encounter-driven capture means a\n  patient seen elsewhere (\"leakage\") looks negative though truly positive, depressing apparent sensitivity differentially\n  by how tied each patient is to the system. Define an \"active in system\" requirement before estimating operating\n  characteristics.\n- **Registry:** Often serves as the reference standard (adjudicated incident events). Use registry linkage to enumerate\n  true positives the claims/EHR algorithm missed, which is the only credible route to a defensible sensitivity.\n\n**Worked claims example.** A team validates a claims algorithm for incident heart failure (\"1 inpatient principal-dx HF\nclaim OR 2 outpatient HF claims >=30 days apart within 365 days\") against adjudicated chart review in a commercial +\nMedicare FFS sample, restricting to FFS-observable person-time. They draw a stratified validation sample: 300 algorithm-\npositive and 300 algorithm-negative charts, then adjudicate true status. Among the 300 algorithm-positives, 261 are\nconfirmed HF (TP = 261, FP = 39); among the 300 algorithm-negatives, 18 turn out to be true HF the algorithm missed\n(FN = 18, TN = 282). **Sensitivity = 261 / (261 + 18) = 0.935 (93.5%)** and **specificity = 282 / (282 + 39) = 0.879\n(87.9%)** — note these are estimated within strata and the design (sampling on the algorithm result, then re-weighting to\nthe source prevalence) is what makes sensitivity identifiable here because algorithm-negatives were charted. Report exact\nbinomial 95% CIs (sensitivity 0.90–0.96; specificity 0.84–0.91). Because incident HF detection is plausibly more\nintensive in one treatment arm (closer follow-up), the team checks whether sensitivity differs by exposure before\nassuming non-differential attenuation, and carries the (sensitivity, specificity) pair into a matrix misclassification\ncorrection of the downstream risk ratio.\n\n**Interpreting the output**\n\nIn the worked example, chart-review validation of a heart-failure algorithm yields sensitivity = 0.935\nand specificity = 0.879 (exact binomial 95% CIs: 0.90–0.96 and 0.84–0.91).\n\n*(1) Formal interpretation.* Sensitivity of 0.935 means that, among patients confirmed by adjudicated\nchart review to have incident heart failure, 93.5% are correctly captured by the claims algorithm;\n6.5% are missed (false-negative rate). Specificity of 0.879 means that, among confirmed non-cases,\n87.9% are correctly excluded; 12.1% are incorrectly flagged (false-positive rate). Both estimates are\nconditioned on known true status — they are properties of the algorithm in this case-spectrum and\ndata-source context, not of the underlying disease frequency in the broader population. Prevalence does\nnot appear in either formula. The 95% CIs quantify sampling uncertainty around the estimated proportions\ngiven the validation sample size.\n\n*(2) Practical interpretation.* High sensitivity (0.935) indicates the algorithm captures most true\ncases, limiting outcome under-ascertainment. The imperfect specificity (0.879) means some algorithm-\npositives are not true cases — a source of non-differential misclassification that attenuates the\ndownstream risk ratio toward the null in a predictable, quantifiable direction. Before assuming\nnon-differential attenuation, check whether sensitivity differs by exposure arm (differential\nmisclassification). Carry the (sensitivity, specificity) pair into a matrix misclassification-bias\ncorrection to recover a less-biased effect estimate. These estimates may not transport to populations\nwith a different disease spectrum or coding environment.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "sensitivity",
      "specificity",
      "diagnostic-accuracy",
      "phenotype-validation",
      "algorithm-validation",
      "misclassification",
      "prevalence-invariance",
      "reference-standard"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "validation_study",
      "diagnostic_accuracy",
      "safety_surveillance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.308.6943.1552",
        "url": "https://doi.org/10.1136/bmj.308.6943.1552",
        "citation_text": "Altman DG, Bland JM. Statistics Notes: Diagnostic tests 1: sensitivity and specificity. BMJ. 1994;308(6943):1552.",
        "year": 1994,
        "authors_short": "Altman & Bland",
        "notes": "The canonical short definition of sensitivity and specificity as the conditional probabilities of a correct test result given true disease status, with the explanation of why both are independent of prevalence."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.h5527",
        "url": "https://doi.org/10.1136/bmj.h5527",
        "citation_text": "Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.",
        "year": 2015,
        "authors_short": "Bossuyt et al.",
        "notes": "The reporting standard for diagnostic accuracy studies; specifies how sensitivity and specificity must be reported (with CIs, against a reference standard, with the sampling frame described) and how spectrum and verification bias are addressed."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2005.02.022",
        "url": "https://doi.org/10.1016/j.jclinepi.2005.02.022",
        "citation_text": "Reitsma JB, Glas AS, Rutjes AWS, Scholten RJPM, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. Journal of Clinical Epidemiology. 2005;58(10):982-990.",
        "year": 2005,
        "authors_short": "Reitsma et al.",
        "notes": "Establishes that sensitivity and specificity are negatively correlated across thresholds and must be modeled jointly (bivariate) rather than pooled separately; the basis for modern diagnostic-accuracy meta-analysis."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/oxfordjournals.aje.a112510",
        "url": "https://doi.org/10.1093/oxfordjournals.aje.a112510",
        "citation_text": "Rogan WJ, Gladen B. Estimating prevalence from the results of a screening test. American Journal of Epidemiology. 1978;107(1):71-76.",
        "year": 1978,
        "authors_short": "Rogan & Gladen",
        "notes": "Demonstrates how known sensitivity and specificity correct an apparent prevalence (and, by extension, a misclassified outcome proportion) to the true value; the workhorse identity behind matrix misclassification correction."
      },
      {
        "role": "use",
        "doi": "10.7326/M14-0698",
        "url": "https://doi.org/10.7326/M14-0698",
        "citation_text": "Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Annals of Internal Medicine. 2015;162(1):W1-W73.",
        "year": 2015,
        "authors_short": "Moons et al.",
        "notes": "Reporting guidance for diagnostic/prognostic models; sets expectations for reporting classification accuracy (including sensitivity/specificity at chosen thresholds) alongside discrimination and calibration."
      }
    ],
    "plain_language_summary": "Sensitivity and specificity tell you how well a yes/no test or a code-based rule matches the real truth, when you already know the truth from a careful source like a chart review. Sensitivity is the share of people who truly have the condition that the test correctly flags as positive; specificity is the share of people who truly don't have it that the test correctly clears as negative. A handy property: both numbers stay the same no matter how common the disease is, which is why they are the go-to summary when you are checking how good an algorithm is. One honest caveat: they say nothing about how trustworthy a single positive flag is in your data, because that depends on how common the disease is.",
    "key_terms": [
      {
        "term": "reference standard",
        "definition": "The trusted source of the real answer that the test is judged against, such as a doctor's adjudicated chart review."
      },
      {
        "term": "true positive (TP)",
        "definition": "A person who truly has the condition and whom the test correctly flags as positive."
      },
      {
        "term": "false negative (FN)",
        "definition": "A person who truly has the condition but whom the test wrongly clears as negative, missing them."
      },
      {
        "term": "true negative (TN)",
        "definition": "A person who truly does not have the condition and whom the test correctly clears as negative."
      },
      {
        "term": "false positive (FP)",
        "definition": "A person who truly does not have the condition but whom the test wrongly flags as positive, a false alarm."
      },
      {
        "term": "computable phenotype",
        "definition": "A rule built from billing or health-record codes that labels each patient as having a condition or not."
      }
    ],
    "worked_example": {
      "scenario": "A team built a claims-code rule (an algorithm) to flag patients with heart failure, and they want to know how good it is. They take 300 patients whose true heart-failure status is already known from adjudicated chart review: 100 truly have heart failure and 200 truly do not. They run the algorithm on all 300 and sort everyone into a 2x2 confusion table, then compute sensitivity and specificity.",
      "dataset": {
        "caption": "The 2x2 confusion table an analyst fills in: columns are the truth from chart review, rows are what the algorithm said. Each cell is a count of patients.",
        "columns": [
          "",
          "Disease + (truly has HF)",
          "Disease - (truly no HF)"
        ],
        "rows": [
          [
            "Algorithm + (flagged)",
            "80 (TP)",
            "20 (FP)"
          ],
          [
            "Algorithm - (cleared)",
            "20 (FN)",
            "180 (TN)"
          ]
        ]
      },
      "steps": [
        "Read the truth columns: 100 patients truly have heart failure (80 TP + 20 FN) and 200 truly do not (20 FP + 180 TN).",
        "Sensitivity asks: of the truly-diseased people, what share did the algorithm correctly flag? Go down the Disease + column: TP / (TP + FN) = 80 / (80 + 20) = 80 / 100.",
        "Specificity asks: of the truly-disease-free people, what share did the algorithm correctly clear? Go down the Disease - column: TN / (TN + FP) = 180 / (180 + 20) = 180 / 200.",
        "Each calculation stays inside one truth column, which is why the answers don't change if the disease becomes more or less common."
      ],
      "result": "Sensitivity = 80 / (80 + 20) = 80 / 100 = 0.80 (80%): the algorithm catches 80% of true heart-failure patients and misses 20%. Specificity = 180 / (180 + 20) = 180 / 200 = 0.90 (90%): the algorithm correctly clears 90% of patients without heart failure and false-alarms on 10%."
    },
    "prerequisites": [
      "ppv-npv-rwe",
      "algorithm-validation",
      "endpoint-adjudication-chart-review-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Algorithm validation by stratified sampling on the algorithm result",
        "description": "Charts are sampled separately from algorithm-positive and algorithm-negative strata, adjudicated to true status, and re-weighted to the source prevalence so that both false negatives and false positives are observable and sensitivity and specificity are both identifiable.",
        "edge_cases": [
          "Sampling only algorithm-positives makes false negatives unobservable, so only PPV (not sensitivity) is estimable.",
          "Stratum sampling fractions must be recorded and used in the re-weighting; ignoring them biases the prevalence-dependent quantities."
        ],
        "data_source_notes": "claims: restrict the validation frame to FFS-observable person-time; MA-only spans create spurious algorithm-negatives. Adjudicated chart review or registry linkage serves as the reference standard."
      },
      {
        "name": "Spectrum-stratified sensitivity and specificity",
        "description": "Operating characteristics are estimated within clinically meaningful disease-severity or comorbidity strata to expose spectrum effects and judge whether estimates transport from the validation population to the analysis cohort.",
        "edge_cases": [
          "Sensitivity is typically higher in severe/advanced disease and lower in early/mild disease; a tertiary-care estimate may not hold in a community cohort.",
          "Specificity varies with the comorbidity mix of the non-cases (e.g., conditions that mimic the target on coding)."
        ],
        "data_source_notes": "ehr/registry: use structured severity (stage, labs) to define strata; report sensitivity/specificity per stratum and the case-mix of the validation sample so transportability can be assessed."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ppv-npv-rwe",
        "pros_of_this": "Conditioned on true status, so invariant to prevalence and transportable across cohorts of differing case frequency; they are what a validation study sampling on truth directly estimates and the natural inputs to misclassification correction.",
        "cons_of_this": "Do not tell a downstream analyst the probability that an algorithm-positive record is truly a case, which is the operationally relevant quantity in a specific cohort.",
        "when_to_prefer": "To characterize and compare algorithms and to feed non-differential/differential misclassification corrections; switch to PPV/NPV to interpret what a flagged record means in a given population."
      },
      {
        "compared_to": "likelihood-ratios-diagnostic-rwe",
        "pros_of_this": "The most transparent and directly reportable pair; each is a single, interpretable proportion and the inputs from which likelihood ratios are derived.",
        "cons_of_this": "Two numbers rather than one; do not by themselves give the post-test odds update that a likelihood ratio provides at a chosen pre-test probability.",
        "when_to_prefer": "For reporting and comparing operating characteristics; convert to LR+ and LR- when the goal is to update an individual's pre-test probability to a post-test probability."
      },
      {
        "compared_to": "roc-auc-discrimination-rwe",
        "pros_of_this": "Describes the algorithm at its actual operating threshold, which is what a binary claims/EHR phenotype uses in practice.",
        "cons_of_this": "Threshold-specific; a single pair cannot summarize discrimination across all possible cut-points the way an AUC does.",
        "when_to_prefer": "When the algorithm has a fixed operational cut (most computable phenotypes); use ROC/AUC for continuous scores or when selecting a threshold."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Define the algorithm by code, position, and setting; sample charts from both algorithm-positive and algorithm-negative strata so false negatives are observable. Restrict the validation frame to FFS-observable person-time (drop Medicare Advantage-only spans). Report sensitivity and specificity with exact binomial CIs and re-weight stratified samples to the source prevalence.",
      "ehr": "Require demonstrable in-system activity so apparent algorithm-negatives are not leakage; structured problem lists, labs, and NLP improve the algorithm. Report whether operating characteristics differ by exposure to justify (or refute) a non-differential misclassification assumption.",
      "registry": "Use the registry (adjudicated incident events) as the reference standard and as the means to enumerate true positives the claims/EHR algorithm missed; this is the only credible route to a defensible sensitivity estimate."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from scipy.stats import beta\n\ndef exact_binom_ci(k: int, n: int, conf: float = 0.95):\n    # Clopper-Pearson exact interval for a proportion k/n.\n    alpha = 1.0 - conf\n    lo = 0.0 if k == 0 else beta.ppf(alpha / 2, k, n - k + 1)\n    hi = 1.0 if k == n else beta.ppf(1 - alpha / 2, k + 1, n - k)\n    return lo, hi\n\ndef sens_spec(tp: int, fp: int, fn: int, tn: int) -> dict:\n    # Sensitivity conditions on true positives (TP + FN); specificity on true negatives (TN + FP).\n    sens = tp / (tp + fn)\n    spec = tn / (tn + fp)\n    s_lo, s_hi = exact_binom_ci(tp, tp + fn)\n    p_lo, p_hi = exact_binom_ci(tn, tn + fp)\n    return {\n        \"sensitivity\": sens, \"sensitivity_ci\": (s_lo, s_hi),\n        \"specificity\": spec, \"specificity_ci\": (p_lo, p_hi),\n    }\n\n# Worked claims example: HF algorithm validated by stratified chart review.\nres = sens_spec(tp=261, fp=39, fn=18, tn=282)\nprint(f\"Sensitivity = {res['sensitivity']:.3f}  95% CI {res['sensitivity_ci'][0]:.3f}-{res['sensitivity_ci'][1]:.3f}\")\nprint(f\"Specificity = {res['specificity']:.3f}  95% CI {res['specificity_ci'][0]:.3f}-{res['specificity_ci'][1]:.3f}\")",
        "description": "Compute sensitivity and specificity with exact (Clopper-Pearson) binomial 95% CIs from an adjudicated 2x2\nvalidation table. Inputs are the four cell counts: TP (algorithm-positive AND truly positive),\nFP (algorithm-positive AND truly negative), FN (algorithm-negative AND truly positive),\nTN (algorithm-negative AND truly negative). Sensitivity conditions on the true-positive column,\nspecificity on the true-negative column (Altman & Bland 1994). Worked HF-algorithm example included.",
        "dependencies": [
          "scipy"
        ],
        "source_citations": [
          "altman-bland-1994-sens"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(epiR)\n\n# Worked HF-algorithm example: rows = algorithm (+/-), columns = truth (D+/D-).\n#            D+    D-\n# algo +    261    39      (TP, FP)\n# algo -     18   282      (FN, TN)\ntab <- as.table(matrix(c(261, 39, 18, 282), nrow = 2, byrow = TRUE,\n                       dimnames = list(Algorithm = c(\"pos\", \"neg\"),\n                                       Truth     = c(\"Dpos\", \"Dneg\"))))\n\nres <- epi.tests(tab, conf.level = 0.95)\nprint(res)                 # sensitivity, specificity (with exact 95% CIs), plus PPV/NPV/LRs\nres$detail[res$detail$statistic %in% c(\"se\", \"sp\"), ]   # isolate sensitivity & specificity rows",
        "description": "Compute sensitivity and specificity from a 2x2 validation table using the epiR package's epi.tests(),\nwhich returns both with exact binomial confidence intervals (and PPV/NPV/likelihood ratios alongside).\nThe table is entered as a 2x2 matrix with test rows (positive, negative) by truth columns\n(diseased, non-diseased), matching the orientation in Altman & Bland (1994).",
        "dependencies": [
          "epiR"
        ],
        "source_citations": [
          "altman-bland-1994-sens"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Worked HF-algorithm example: expand the adjudicated 2x2 cell counts to subject-level records.\n   truth = 1 if truly diseased; algo = 1 if algorithm-positive. */\ndata valid;\n  do i = 1 to 261; truth = 1; algo = 1; output; end;   /* TP */\n  do i = 1 to  18; truth = 1; algo = 0; output; end;   /* FN */\n  do i = 1 to  39; truth = 0; algo = 1; output; end;   /* FP */\n  do i = 1 to 282; truth = 0; algo = 0; output; end;   /* TN */\n  drop i;\nrun;\n\n/* Sensitivity = P(algo+ | truth+): exact binomial CI within the diseased (truth=1) stratum. */\nproc freq data=valid;\n  where truth = 1;\n  tables algo / binomial(level='1') alpha=0.05;   /* proportion algo=1 among true cases */\n  title 'Sensitivity (exact 95% CI)';\nrun;\n\n/* Specificity = P(algo- | truth-): proportion algo=0 among true non-cases. */\nproc freq data=valid;\n  where truth = 0;\n  tables algo / binomial(level='0') alpha=0.05;\n  title 'Specificity (exact 95% CI)';\nrun;",
        "description": "Compute sensitivity and specificity with exact (Clopper-Pearson) binomial confidence intervals in SAS.\nPROC FREQ with the BINOMIAL option computes an exact CI for a single proportion, so sensitivity is fit\non the true-positive stratum (weighting algorithm-positive vs negative within true cases) and specificity\non the true-negative stratum. The DATA step expands the 2x2 cell counts into one record per subject.",
        "dependencies": [],
        "source_citations": [
          "altman-bland-1994-sens"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Ref{{Reference standard<br/>true disease status}} --> Dpos[Truly diseased D+]\n  Ref --> Dneg[Truly non-diseased D-]\n  Dpos --> TP[Algorithm + : TP]\n  Dpos --> FN[Algorithm - : FN]\n  Dneg --> FP[Algorithm + : FP]\n  Dneg --> TN[Algorithm - : TN]\n  TP --> Sens[Sensitivity = TP / TP+FN<br/>condition on D+ column]\n  FN --> Sens\n  TN --> Spec[Specificity = TN / TN+FP<br/>condition on D- column]\n  FP --> Spec",
        "caption": "The 2x2 validation table with truth in columns. Sensitivity is computed down the truly-diseased column (TP/(TP+FN)); specificity down the truly-non-diseased column (TN/(TN+FP)). Conditioning on the true-status columns is why both are invariant to prevalence.",
        "alt_text": "Diagram splitting subjects by true disease status into diseased and non-diseased, each further split by algorithm result, with sensitivity computed within the diseased column and specificity within the non-diseased column.",
        "source_type": "illustrative",
        "source_citations": [
          "altman-bland-1994-sens"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Validation sample drawn] --> S{Sampled on what?}\n  S -->|True status: cases AND non-cases| Both[FN and FP both observable]\n  S -->|Algorithm-positives only| Pos[FN unobservable]\n  Both --> Est[Sensitivity AND specificity identifiable]\n  Pos --> Only[Only PPV estimable - NOT sensitivity]\n  Est --> Diff{Errors differential by exposure?}\n  Diff -->|No| Null[Non-differential: bias toward null - predictable attenuation]\n  Diff -->|Yes| Either[Differential: bias in either direction - can create or mask effect]",
        "caption": "Two decision points that govern interpretation. The sampling frame determines whether sensitivity is identifiable at all; whether errors are differential by exposure determines the direction of misclassification bias in a downstream effect estimate.",
        "alt_text": "Flowchart showing that sampling on true status makes both sensitivity and specificity estimable while sampling on algorithm-positives yields only PPV, and that differential errors bias effects in either direction.",
        "source_type": "illustrative",
        "source_citations": [
          "rogan-gladen-1978"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "complements",
        "target_slug": "ppv-npv-rwe",
        "notes": "Sensitivity/specificity condition on true status (test-evaluation question) and are prevalence-invariant; PPV/NPV condition on the test result (clinician's question) and depend on prevalence. Each is the prevalence-aware Bayesian transform of the other."
      },
      {
        "relation_type": "produces",
        "target_slug": "likelihood-ratios-diagnostic-rwe",
        "notes": "LR+ = sensitivity/(1 - specificity) and LR- = (1 - sensitivity)/specificity are computed directly from this pair; likelihood ratios repackage sensitivity and specificity into a prevalence-independent odds-updating quantity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "roc-auc-discrimination-rwe",
        "notes": "The ROC curve plots sensitivity against 1 - specificity across all thresholds; a single sensitivity/specificity pair is one point on that curve, and the AUC summarizes it threshold-free."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnostic-accuracy",
        "notes": "Sensitivity and specificity are the primary operating characteristics estimated by a diagnostic accuracy study against a reference standard."
      },
      {
        "relation_type": "used_with",
        "target_slug": "algorithm-validation",
        "notes": "Validating a claims/EHR computable phenotype against adjudicated truth estimates the algorithm's sensitivity and specificity, the core deliverables of an algorithm-validation study."
      },
      {
        "relation_type": "affects",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "Estimated sensitivity and specificity are the inputs to matrix/Rogan-Gladen misclassification correction; whether they are differential by exposure determines the direction of the bias to be corrected."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Outcome algorithms in claims trade sensitivity against PPV; sensitivity/specificity are the truth-conditioned half of that trade-off and require charting algorithm-negatives to estimate."
      }
    ],
    "aliases": [
      "sensitivity",
      "specificity",
      "true positive rate",
      "true negative rate",
      "recall",
      "sens/spec"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "signal-detection",
    "name": "Signal Detection (Disproportionality Analysis)",
    "short_definition": "A hypothesis-generating screening method that flags drug-event pairs reported more often than expected by computing disproportionality statistics (PRR, ROR, IC/BCPNN, EBGM/GPS) from a contingency table built over a spontaneous-reporting database.",
    "long_description": "**Disproportionality analysis** is the workhorse of quantitative pharmacovigilance signal detection.\nIt does not estimate a risk, a rate, or a causal effect. It asks a single screening question over a\nspontaneous-reporting system (SRS) such as FDA FAERS, WHO VigiBase, or EudraVigilance: *is a particular\ndrug-event combination reported disproportionately more often than would be expected if reporting of that\nevent were independent of that drug?* Every measure — the proportional reporting ratio (PRR), reporting\nodds ratio (ROR), the information component (IC/BCPNN), and the empirical-Bayes geometric mean (EBGM) from\nthe Gamma-Poisson Shrinker (GPS) — is a function of the same 2×2 contingency table collapsed across the\nentire database: cell `a` = reports naming both the drug and the event, `b` = the drug with other events,\n`c` = the event with other drugs, `d` = all other reports.\n\n**Core conceptual distinction**. The estimand here is a *reporting* association, not an *incidence*\nassociation. There is no person-time and no unexposed denominator — only counts of reports — so PRR and\nROR are ratios of reporting proportions, fundamentally different from a risk ratio or rate ratio estimated\nin a cohort. The frequentist measures (PRR, ROR with their χ² and confidence intervals) and the Bayesian\nshrinkage measures (IC from the BCPNN, EBGM from the GPS) answer the same question but differ in one\ndecisive way: shrinkage estimators pull the disproportionality of *low-count* pairs back toward the null,\nso they do not fire on the unstable a=1 or a=2 cells that produce huge, meaningless PRRs. PRR and ROR are\nnearly numerically identical when the event is rare relative to the database (b≈a+b, d≈c+d); they diverge\nfor common events, where the ROR (an odds ratio) is the better-behaved statistic and the one used by the\nNetherlands Pharmacovigilance Centre and EudraVigilance. A *signal of disproportionate reporting* (SDR) is\na screening alert, not a confirmed adverse drug reaction — it is the input to clinical review, not the\nconclusion.\n\n**Pros, cons, and trade-offs**.\n- **vs Bayesian shrinkage (IC/BCPNN, EBGM/GPS):** Plain PRR/ROR are transparent, trivially computed, and\n  the basis of the MHRA operational rule, but they are wildly unstable at small `a` and over-fire on rare\n  pairs. Shrinkage estimators (the FDA's EBGM/EB05 and WHO-UMC's IC025) trade interpretability for\n  stability and far better operating characteristics in sparse, high-multiplicity tables. **Prefer PRR/ROR**\n  for a small, transparent, regulator-facing screen of one product; **prefer EBGM or IC** for routine\n  data-mining of millions of drug-event cells.\n- **vs traditional pharmacoepidemiologic designs (cohort, case-control, SCCS):** Disproportionality is\n  almost free, runs over the whole world's spontaneous reports, and detects the unexpected; but it cannot\n  estimate magnitude, cannot establish causation, and is hostage to reporting biases. **Prefer\n  disproportionality** to *generate* hypotheses; **switch to a designed study** (active-comparator\n  new-user cohort, self-controlled case series, sequential active surveillance in claims) to *test* them.\n- **vs self-controlled / sequential active surveillance (TreeScan, MaxSPRT in Sentinel):** SRS\n  disproportionality has no denominator and no temporality; claims/EHR sequential methods add real\n  person-time and exposure timing but cost orders of magnitude more effort and only cover their own\n  population. **Prefer SRS mining** for breadth and speed; **prefer active surveillance** once a candidate\n  signal warrants confirmation in a population with a denominator.\n\n**When to use**. Routine, broad, hypothesis-generating safety surveillance over a spontaneous-reporting\ndatabase; periodic safety-update report (PSUR/PBRER) screening; new-product launch monitoring; flagging\nunexpected drug-event pairs for clinical and epidemiologic follow-up; and as the first stage of a tiered\npharmacovigilance system whose later stages are designed studies.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **As evidence of causation or to quantify risk.** A high PRR means an event is *reported* more with a\n  drug — driven as easily by stimulated reporting after a media story, a litigation campaign, or a label\n  warning (notoriety/Weber effect) as by true excess risk. Treating an SDR as a confirmed ADR, or\n  back-calculating an incidence from report counts, is the classic and dangerous misuse.\n- **At very low counts.** With a<3, frequentist PRR/ROR are numerically unstable and confidence intervals\n  are uninformative; MHRA explicitly gates on n≥3 and shrinkage estimators exist precisely to tame this.\n- **When masking is plausible.** A single very large signal within a drug class (or a single dominant\n  event) inflates the comparator margins and can *hide* a genuine smaller signal — disproportionality run\n  without stratification or subsetting can be falsely reassuring.\n- **For within-class or dose-response comparisons** the raw whole-database 2×2 cannot answer; these\n  require stratified or regression-based (logistic/Poisson) disproportionality, or a proper study.\n- **Across heterogeneous report sources without stratification.** Pooling spontaneous, solicited (patient\n  support program), and literature reports, or pooling across regions with different reporting cultures,\n  confounds the reporting association — stratify (Mantel-Haenszel PRR) or restrict.\n\n**Data-source operational depth**.\n- **Spontaneous-reporting systems (FAERS, VigiBase, EudraVigilance) — the primary substrate:** The unit is\n  the *report* (an ICSR), not the patient. There is **no denominator**: you only ever see a/(a+b+c+d), so\n  no rate, risk, or absolute frequency can be computed — only disproportionality. Failure modes:\n  (1) *Duplicates* — the same case submitted by patient, physician, and manufacturer inflates `a`; run\n  deduplication (e.g., VigiBase's vigiMatch, or FAERS case-version/primaryid logic keeping the latest\n  follow-up) before tabulating. (2) *Notoriety / stimulated reporting* — a label change, Dear-Doctor\n  letter, or lawsuit causes a reporting spike unrelated to incidence; interpret SDRs against the reporting\n  timeline. (3) *Indication confounding / event masking* — the event may track the underlying disease, and\n  a dominant class signal can mask a smaller one; mitigate with stratification or comparator restriction.\n  (4) *Term granularity* — MedDRA Preferred Terms vs grouped Standardised MedDRA Queries (SMQs) change the\n  2×2 materially; pre-specify the level. (5) *Multiplicity* — millions of drug-event cells are screened, so\n  naive α=0.05 thresholds generate thousands of false alerts; this is the operational reason shrinkage\n  (EB05≥2, IC025>0) and fixed criteria (PRR≥2 & χ²≥4 & n≥3) replace p-values.\n- **Claims / EHR (active surveillance extension):** Disproportionality is *not* the natural method here —\n  these data have denominators, so cohort, self-controlled, or sequential-testing methods (TreeScan\n  tree-based scan statistics, MaxSPRT in FDA Sentinel) are used instead and constitute a different method\n  family. If a disproportionality-style screen is run over claims, beware prevalent-user contamination,\n  exposure misclassification from `days_supply` gaps, and look-elsewhere/multiplicity from scanning the\n  full MedDRA/ICD tree. Do not present claims-based active-surveillance alerts as if they were SRS\n  disproportionality — the bias structure and the estimand differ.\n- **Literature / solicited / registry reports:** Literature and patient-support-program (solicited)\n  reports follow different reporting dynamics than truly spontaneous ICSRs; mixing them without\n  stratification contaminates the reporting association. Keep report type as a stratification variable.\n\n**Worked example (FAERS-style 2×2 for one drug-event pair).** Question: is the pair *drug X — rhabdomyolysis*\na signal of disproportionate reporting in a snapshot of the database? After deduplication, the collapsed\ncontingency table is: `a` = 40 reports naming both drug X and rhabdomyolysis; `b` = 1,960 reports naming\ndrug X with any *other* event; `c` = 3,000 reports naming rhabdomyolysis with any *other* drug; `d` =\n995,000 all remaining reports. Then:\n- PRR = [a/(a+b)] / [c/(c+d)] = (40/2000) / (3000/998000) = 0.0200 / 0.003006 = **6.65**.\n- χ² (Yates-corrected, 1 df) on this table ≈ **218** (far exceeds 4).\n- ROR = (a·d)/(b·c) = (40·995000)/(1960·3000) = 39,800,000 / 5,880,000 = **6.77**; 95% CI on the log scale,\n  SE(lnROR) = sqrt(1/a+1/b+1/c+1/d) = sqrt(1/40+1/1960+1/3000+1/995000) = 0.160, so CI = exp(ln6.77 ± 1.96·0.160)\n  = **(4.95, 9.27)** — lower bound well above 1.\n- MHRA operational signal criteria: PRR ≥ 2 **and** χ² ≥ 4 **and** n (=a) ≥ 3 → 6.65 ≥ 2, 218 ≥ 4, 40 ≥ 3:\n  **all met → signal of disproportionate reporting**. This is a flag for clinical and epidemiologic review,\n  not a confirmed adverse drug reaction; confirmation requires a denominator-based study.\n\n**Interpreting the output**\n\nIn the worked example, drug X receives a PRR of 6.65 and an ROR of 6.77 (95% CI 4.95–9.27), with all\nthree MHRA criteria satisfied (a = 40, PRR ≥ 2, χ² ≈ 218 ≥ 4).\n\n*(1) Formal interpretation.* The PRR of 6.65 means that rhabdomyolysis accounts for a 6.65-fold larger\nshare of drug-X reports than it accounts for among all other drugs in the database. The ROR of 6.77 is the\nodds-ratio analog and is preferred when the event is relatively common in the SRS. Both are statistics\nderived from a 2×2 table of report counts — there is no exposed or unexposed person-time, no incidence rate,\nand no background-rate denominator. A 95% CI lower bound well above 1 (here 4.95) addresses within-database\nsampling uncertainty; it does not address reporting biases such as notoriety bias, stimulated reporting, or\ndifferential under-reporting across drugs. The EBGM/IC shrinkage alternatives would pull the estimate closer\nto the null if a were small, but at a = 40 shrinkage is minimal.\n\n*(2) Practical interpretation.* A PRR of 6.65 meeting MHRA criteria is a **signal of disproportionate\nreporting**, not a confirmed adverse drug reaction and not a risk ratio. The absence of a denominator means\nyou cannot calculate incidence or attribute causation. The appropriate response is clinical and\nepidemiologic review — for example, a PASS with a proper cohort denominator — before any regulatory or\nlabel action. Do not conflate the disproportionality statistic with a rate ratio or an odds ratio from a\ncohort study; they measure categorically different quantities.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "signal-detection",
      "disproportionality",
      "pharmacovigilance",
      "spontaneous-reports",
      "PRR",
      "ROR",
      "BCPNN",
      "EBGM",
      "FAERS",
      "safety-surveillance"
    ],
    "applies_to_study_types": [
      "signal_detection"
    ],
    "data_sources": [
      "spontaneous_reports",
      "claims",
      "ehr",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.677",
        "url": "https://doi.org/10.1002/pds.677",
        "citation_text": "Evans SJW, Waller PC, Davis S. Use of proportional reporting ratios (PRRs) for signal generation from spontaneous adverse drug reaction reports. Pharmacoepidemiology and Drug Safety. 2001;10(6):483-486.",
        "year": 2001,
        "authors_short": "Evans et al.",
        "notes": "Origin of the PRR and the operational signal rule (PRR>=2, chi-square>=4, n>=3) used by the MHRA and widely thereafter."
      },
      {
        "role": "explain",
        "doi": "10.1080/00031305.1999.10474456",
        "url": "https://doi.org/10.1080/00031305.1999.10474456",
        "citation_text": "DuMouchel W. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. The American Statistician. 1999;53(3):177-190.",
        "year": 1999,
        "authors_short": "DuMouchel",
        "notes": "Foundational statement of the empirical-Bayes Gamma-Poisson Shrinker (EBGM/EB05) that stabilizes disproportionality in sparse, high-multiplicity tables; basis of the FDA's data-mining approach."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s002280050466",
        "url": "https://doi.org/10.1007/s002280050466",
        "citation_text": "Bate A, Lindquist M, Edwards IR, et al. A Bayesian neural network method for adverse drug reaction signal generation. European Journal of Clinical Pharmacology. 1998;54(4):315-321.",
        "year": 1998,
        "authors_short": "Bate et al.",
        "notes": "The BCPNN information component (IC, with IC025 lower bound) deployed operationally on WHO VigiBase; the canonical Bayesian shrinkage signal-detection method in international pharmacovigilance."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.1742",
        "url": "https://doi.org/10.1002/pds.1742",
        "citation_text": "Bate A, Evans SJW. Quantitative signal detection using spontaneous ADR reporting. Pharmacoepidemiology and Drug Safety. 2009;18(6):427-436.",
        "year": 2009,
        "authors_short": "Bate & Evans",
        "notes": "Synthesis and applied comparison of PRR, ROR, IC, and EBGM, including their operating characteristics, thresholds, and limitations in routine surveillance."
      }
    ],
    "plain_language_summary": "Signal detection in drug safety screening asks one question over a database of voluntary adverse-event reports: is a particular drug paired with a particular side effect reported far more often than you would expect by chance? The core tool is a simple 2x2 table that counts how often the drug and event appear together versus apart, then computes a ratio called the PRR or ROR to measure that excess. A high ratio is called a signal of disproportionate reporting, which is a flag for experts to investigate further, not proof that the drug caused the event.",
    "key_terms": [
      {
        "term": "disproportionality",
        "definition": "A reporting imbalance: the drug-event pair appears in the database more often than its individual frequencies would predict if the two were unrelated."
      },
      {
        "term": "PRR",
        "definition": "Proportional Reporting Ratio: the share of reports mentioning the drug that also mention the event, divided by that same share for all other drugs combined; a value well above 1 suggests a reporting excess."
      },
      {
        "term": "ROR",
        "definition": "Reporting Odds Ratio: the odds that a report names the event given it names the drug, divided by the odds that a report names the event given it does not name the drug; numerically close to PRR for rare events but better-behaved when the event is common."
      },
      {
        "term": "spontaneous report",
        "definition": "A voluntary submission by a patient, prescriber, or manufacturer to a safety database such as FDA FAERS or WHO VigiBase, describing a suspected drug side effect."
      },
      {
        "term": "signal of disproportionate reporting",
        "definition": "A drug-event pair that meets a pre-specified numeric threshold, flagging it for clinical review; it is a hypothesis-generating alert, not a confirmed causal finding."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacovigilance analyst is screening FDA FAERS to ask whether Drug X is disproportionately reported with rhabdomyolysis (severe muscle breakdown). After deduplication, she collapses the entire database into one 2x2 table. She wants to compute the PRR and ROR and apply the MHRA signal criteria (PRR >= 2 AND chi-square >= 4 AND at least 3 co-reports) to decide whether to flag this pair for review.",
      "dataset": {
        "caption": "Collapsed 2x2 contingency table from 1,000,000 deduplicated spontaneous reports. Each cell is a count of reports, not patients.",
        "columns": [
          "cell",
          "label",
          "count"
        ],
        "rows": [
          [
            "a",
            "Drug X AND rhabdomyolysis",
            40
          ],
          [
            "b",
            "Drug X AND any OTHER event",
            1960
          ],
          [
            "c",
            "Any OTHER drug AND rhabdomyolysis",
            3000
          ],
          [
            "d",
            "Any OTHER drug AND any OTHER event",
            995000
          ]
        ]
      },
      "steps": [
        "Row totals: reports naming Drug X = a + b = 40 + 1960 = 2000; reports naming rhabdomyolysis with any other drug = c + d = 3000 + 995000 = 998000.",
        "Drug X reporting proportion for rhabdomyolysis: a / (a+b) = 40 / 2000 = 0.0200 (2.00% of Drug X reports mention rhabdomyolysis).",
        "Background reporting proportion: c / (c+d) = 3000 / 998000 = 0.003006 (0.30% of all other-drug reports mention rhabdomyolysis).",
        "PRR = 0.0200 / 0.003006 = 6.65 (rhabdomyolysis is reported about 6.65 times more often with Drug X than with all other drugs).",
        "ROR = (a x d) / (b x c) = (40 x 995000) / (1960 x 3000) = 39,800,000 / 5,880,000 = 6.77.",
        "SE of ln(ROR) = sqrt(1/40 + 1/1960 + 1/3000 + 1/995000) = sqrt(0.02500 + 0.00051 + 0.00033 + 0.000001) = sqrt(0.02584) = 0.161.",
        "95% CI on ROR = exp(ln(6.77) plus or minus 1.96 x 0.161) = (4.94, 9.28); the lower bound 4.94 is well above 1.",
        "MHRA signal check: PRR = 6.65 >= 2, chi-square on this table is approximately 218 >= 4, and a = 40 >= 3. All three criteria are met."
      ],
      "result": "PRR = 6.65 and ROR = 6.77 (95% CI 4.94 to 9.28). The MHRA signal criteria are met: Drug X / rhabdomyolysis is a signal of disproportionate reporting and should be forwarded for clinical and epidemiologic review. This means rhabdomyolysis appears in Drug X reports far more often than in the rest of the database, not that Drug X definitely causes rhabdomyolysis."
    },
    "prerequisites": [
      "case-report",
      "safety-signal-case-definition-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Proportional reporting ratio (PRR) with MHRA signal criteria",
        "description": "Frequentist ratio of reporting proportions from the whole-database 2x2, declared a signal when PRR>=2 AND chi-square (Yates) >=4 AND number of co-reported cases (a) >=3.",
        "edge_cases": [
          "Unstable and inflated at small a; the n>=3 gate exists specifically to suppress these.",
          "Nearly identical to the ROR for rare events but biased relative to it when the event is common in the database (b is not approximately a+b)."
        ],
        "data_source_notes": "SRS: deduplicate ICSRs before tabulating; pre-specify MedDRA level (PT vs SMQ) because it changes a and c."
      },
      {
        "name": "Reporting odds ratio (ROR) with log-scale 95% CI",
        "description": "The cross-product odds ratio (a*d)/(b*c) with CI from SE(lnROR)=sqrt(1/a+1/b+1/c+1/d); a signal is declared when the lower CI bound exceeds 1. Permits logistic-regression adjustment for covariates (age, sex, reporting year, co-medication).",
        "edge_cases": [
          "Undefined or infinite when any cell is zero; add 0.5 continuity correction or use exact methods.",
          "Regression-adjusted ROR can address indication confounding and masking that the raw 2x2 cannot."
        ],
        "data_source_notes": "EudraVigilance and the Netherlands centre use ROR; stratify or model report type and region to avoid pooling heterogeneous reporting cultures."
      },
      {
        "name": "Bayesian shrinkage (IC/BCPNN and EBGM/GPS)",
        "description": "Empirical-Bayes / Bayesian estimators that shrink low-count disproportionality toward the null. WHO-UMC uses the information component IC with lower bound IC025>0; the FDA uses EBGM with lower bound EB05>=2.",
        "edge_cases": [
          "Far more stable than PRR/ROR under massive multiplicity (millions of drug-event cells), at the cost of interpretability and computation.",
          "Shrinkage can delay detection of a genuine but low-count emerging signal relative to frequentist screens."
        ],
        "data_source_notes": "SRS data-mining standard; EB05>=2 and IC025>0 are the de facto thresholds replacing naive p-value screening across the full MedDRA grid."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Bayesian shrinkage estimators (EBGM/GPS, IC/BCPNN)",
        "pros_of_this": "PRR/ROR are transparent, hand-computable, and underpin the explicit MHRA operational rule.",
        "cons_of_this": "Severe instability and over-firing at small cell counts and under heavy multiplicity, which shrinkage estimators are designed to fix.",
        "when_to_prefer": "A small, transparent, regulator-facing screen of a single product; teaching and audit."
      },
      {
        "compared_to": "Designed pharmacoepidemiologic studies (cohort / case-control / SCCS)",
        "pros_of_this": "Near-free, global coverage of spontaneous reports, detects the unexpected, ideal for hypothesis generation.",
        "cons_of_this": "No denominator, no rate or risk, no temporality, no causal claim; hostage to reporting biases (notoriety, stimulated reporting, duplicates, masking).",
        "when_to_prefer": "Generating safety hypotheses; the screening stage of a tiered surveillance system."
      },
      {
        "compared_to": "Sequential active surveillance in claims/EHR (TreeScan, MaxSPRT in Sentinel)",
        "pros_of_this": "Immediate, broad, and cheap; requires only the spontaneous database.",
        "cons_of_this": "Lacks the person-time denominator and exposure timing that active surveillance provides for quantification and confirmation.",
        "when_to_prefer": "Breadth-first screening; defer to active surveillance to confirm a candidate signal in a population with a real denominator."
      }
    ],
    "implementation_notes_by_data_source": {
      "spontaneous_reports": "Unit of analysis is the report (ICSR), not the patient; there is no denominator, so only disproportionality (not rate/risk) is computable. Deduplicate before tabulating, pre-specify the MedDRA level (PT vs SMQ), and interpret signals against the reporting timeline to separate true excess from notoriety/stimulated reporting.",
      "claims": "Disproportionality is not the natural method when a denominator exists; use cohort, self-controlled, or sequential testing (TreeScan, MaxSPRT) instead. If a disproportionality-style screen is run, control for prevalent-user contamination, exposure misclassification, and tree-wide multiplicity.",
      "ehr": "As with claims, prefer denominator-based or sequential methods; EHR adds richer covariates for regression-adjusted disproportionality but introduces phenotyping and capture-window misclassification.",
      "registry": "Literature and solicited/registry reports follow different reporting dynamics than spontaneous ICSRs; keep report type as a stratification variable rather than pooling."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\nfrom scipy.stats import chi2_contingency\n\ndef disproportionality(reports: pd.DataFrame,\n                       drug_col: str = \"drug_name\",\n                       event_col: str = \"event_pt\") -> pd.DataFrame:\n    # Unique drug-event mentions per report so one ICSR cannot count twice for the same pair.\n    pairs = reports[[drug_col, event_col]].drop_duplicates()\n\n    N = reports[\"report_id\"].nunique()                      # total reports = a+b+c+d\n    n_drug  = pairs.groupby(drug_col)[event_col].size()     # a+b : reports naming the drug\n    n_event = pairs.groupby(event_col)[drug_col].size()     # a+c : reports naming the event\n    a_tab   = pairs.groupby([drug_col, event_col]).size()   # a   : reports naming both\n\n    rows = []\n    for (drug, event), a in a_tab.items():\n        a = int(a)\n        b = int(n_drug[drug]  - a)                           # drug, other events\n        c = int(n_event[event] - a)                          # event, other drugs\n        d = int(N - a - b - c)                               # all other reports\n        if min(a, b, c, d) < 0:\n            continue\n\n        prr = (a / (a + b)) / (c / (c + d)) if (a + b) and (c + d) else np.nan\n\n        # ROR with log-scale 95% CI; 0.5 continuity correction if any zero cell.\n        ca, cb, cc, cd = (a, b, c, d) if min(a, b, c, d) > 0 else (a + .5, b + .5, c + .5, d + .5)\n        ror = (ca * cd) / (cb * cc)\n        se  = np.sqrt(1/ca + 1/cb + 1/cc + 1/cd)\n        ror_lo, ror_hi = np.exp(np.log(ror) - 1.96 * se), np.exp(np.log(ror) + 1.96 * se)\n\n        chi2 = chi2_contingency([[a, b], [c, d]], correction=True)[0]  # Yates-corrected\n\n        expected = (a + b) * (a + c) / N                     # E[a] under independence\n        rrr = a / expected if expected else np.nan           # relative reporting ratio (GPS input)\n\n        signal = (prr >= 2) and (chi2 >= 4) and (a >= 3)      # MHRA operational criteria\n        rows.append(dict(drug=drug, event=event, a=a, b=b, c=c, d=d,\n                         PRR=prr, ROR=ror, ROR_lo=ror_lo, ROR_hi=ror_hi,\n                         chi2_yates=chi2, RRR=rrr, MHRA_signal=signal))\n\n    return pd.DataFrame(rows).sort_values(\"PRR\", ascending=False)",
        "description": "Disproportionality screen (PRR, ROR with 95% CI, Yates chi-square, MHRA signal flag, and an EBGM-style\nrelative reporting ratio) over a spontaneous-reporting database. Required input (one row per deduplicated\nICSR-level drug-event mention, after MedDRA coding and duplicate removal):\n  reports : report_id, drug_name, event_pt   # event_pt = MedDRA Preferred Term\nThe function collapses the whole database into the 2x2 for every drug-event pair and returns one row per\npair. RRR (a/E_a) is the multiplicative input the Gamma-Poisson Shrinker shrinks toward 1.",
        "dependencies": [
          "pandas",
          "numpy",
          "scipy"
        ],
        "source_citations": [
          "evans-2001",
          "dumouchel-1999"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\n## ---- Canonical route: PhViD (all four estimators incl. Bayesian shrinkage) ----\n## library(PhViD)\n## dm <- as.PhViD(unique(reports[, c(\"drug_name\", \"event_pt\", \"count\")]))  # count optional\n## PRR(dm,    RR0 = 1, MIN.n11 = 3)        # PRR with n>=3 gate\n## ROR(dm,    RR0 = 1, MIN.n11 = 3)        # reporting odds ratio\n## BCPNN(dm,  RR0 = 1, MIN.n11 = 3)        # information component IC025 > 0\n## GPS(dm,    RR0 = 1, MIN.n11 = 3)        # EBGM with EB05 >= 2\n\n## ---- Manual fallback (PRR, ROR + 95% CI, Yates chi-square, MHRA flag) ----\ndisproportionality <- function(reports) {\n  dt <- unique(as.data.table(reports)[, .(drug_name, event_pt)])  # one mention per report-pair\n  N       <- dt[, uniqueN(.SD)]                                   # a+b+c+d (unique mentions)\n  n_drug  <- dt[, .(nd = .N), by = drug_name]\n  n_event <- dt[, .(ne = .N), by = event_pt]\n  tab     <- dt[, .(a = .N), by = .(drug_name, event_pt)]\n  tab <- merge(tab, n_drug,  by = \"drug_name\")\n  tab <- merge(tab, n_event, by = \"event_pt\")\n\n  tab[, `:=`(b = nd - a, c = ne - a)]\n  tab[, d := N - a - b - c]\n\n  tab[, PRR := (a / (a + b)) / (c / (c + d))]\n  tab[, `:=`(ca = ifelse(pmin(a,b,c,d) > 0, a, a + .5),\n             cb = ifelse(pmin(a,b,c,d) > 0, b, b + .5),\n             cc = ifelse(pmin(a,b,c,d) > 0, c, c + .5),\n             cd = ifelse(pmin(a,b,c,d) > 0, d, d + .5))]\n  tab[, ROR := (ca * cd) / (cb * cc)]\n  tab[, se  := sqrt(1/ca + 1/cb + 1/cc + 1/cd)]\n  tab[, `:=`(ROR_lo = exp(log(ROR) - 1.96 * se),\n             ROR_hi = exp(log(ROR) + 1.96 * se))]\n\n  ## Yates-corrected chi-square on each 2x2.\n  tab[, chi2 := {\n    E11 <- (a + b) * (a + c) / N\n    E12 <- (a + b) * (b + d) / N\n    E21 <- (c + d) * (a + c) / N\n    E22 <- (c + d) * (b + d) / N\n    rowSums(cbind((abs(a - E11) - .5)^2 / E11, (abs(b - E12) - .5)^2 / E12,\n                  (abs(c - E21) - .5)^2 / E21, (abs(d - E22) - .5)^2 / E22))\n  }, by = .(drug_name, event_pt)]\n\n  tab[, MHRA_signal := PRR >= 2 & chi2 >= 4 & a >= 3]\n  tab[order(-PRR), .(drug_name, event_pt, a, b, c, d, PRR, ROR, ROR_lo, ROR_hi, chi2, MHRA_signal)]\n}",
        "description": "Disproportionality screen in R. The canonical package is PhViD (PRR, ROR, BCPNN information component,\nand the Gamma-Poisson Shrinker EBGM in one call). Required input: a data.frame of deduplicated, MedDRA-\ncoded drug-event mentions with columns drug_name and event_pt. A manual PRR/ROR/chi-square fallback is\nshown for environments without PhViD.",
        "dependencies": [
          "PhViD",
          "data.table"
        ],
        "source_citations": [
          "evans-2001",
          "bate-1998"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* One drug-event mention per report so an ICSR cannot double-count the same pair. */\nproc sql;\n  create table pairs as\n  select distinct report_id, drug_name, event_pt from work.reports;\n  select count(distinct report_id) into :N trimmed from pairs;   /* a+b+c+d */\nquit;\n\nproc sql;\n  create table cells as\n  select p.drug_name, p.event_pt,\n         count(distinct p.report_id) as a,                                   /* both */\n         (select count(distinct x.report_id) from pairs x\n            where x.drug_name = p.drug_name) as n_drug,                      /* a+b  */\n         (select count(distinct y.report_id) from pairs y\n            where y.event_pt  = p.event_pt)  as n_event                      /* a+c  */\n  from pairs p\n  group by p.drug_name, p.event_pt;\nquit;\n\ndata dispro;\n  set cells;\n  b = n_drug  - a;\n  c = n_event - a;\n  d = &N - a - b - c;\n\n  PRR = (a / (a + b)) / (c / (c + d));\n\n  /* 0.5 continuity correction if any zero cell, then ROR + 95% CI on the log scale. */\n  ca = ifn(min(a,b,c,d) > 0, a, a + 0.5);\n  cb = ifn(min(a,b,c,d) > 0, b, b + 0.5);\n  cc = ifn(min(a,b,c,d) > 0, c, c + 0.5);\n  cd = ifn(min(a,b,c,d) > 0, d, d + 0.5);\n  ROR    = (ca * cd) / (cb * cc);\n  se     = sqrt(1/ca + 1/cb + 1/cc + 1/cd);\n  ROR_lo = exp(log(ROR) - 1.96 * se);\n  ROR_hi = exp(log(ROR) + 1.96 * se);\n\n  /* Yates-corrected chi-square for the 2x2. */\n  E11 = (a + b) * (a + c) / &N;  E12 = (a + b) * (b + d) / &N;\n  E21 = (c + d) * (a + c) / &N;  E22 = (c + d) * (b + d) / &N;\n  chi2 = (abs(a - E11) - 0.5)**2 / E11 + (abs(b - E12) - 0.5)**2 / E12\n       + (abs(c - E21) - 0.5)**2 / E21 + (abs(d - E22) - 0.5)**2 / E22;\n\n  RRR = a / E11;                                  /* relative reporting ratio (GPS input) */\n  MHRA_signal = (PRR >= 2 and chi2 >= 4 and a >= 3);\nrun;\n\nproc sort data=dispro; by descending PRR; run;",
        "description": "Disproportionality screen in SAS. PROC FREQ builds the per-pair 2x2 margins and supplies the chi-square;\na DATA step assembles a/b/c/d and computes PRR, ROR, the log-scale 95% CI, the relative reporting ratio,\nand the MHRA signal flag. Required input (post data-management, deduplicated, MedDRA-coded):\n  work.reports : report_id, drug_name, event_pt\nFor Bayesian shrinkage (EBGM/EB05), call the openEBGM R/SAS implementation or PROC MCMC; for regression-\nadjusted disproportionality, PROC LOGISTIC with drug as the predictor and age/sex/year as covariates\nyields an adjusted ROR.",
        "dependencies": [],
        "source_citations": [
          "evans-2001",
          "dumouchel-1999"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  SRS[Spontaneous reports<br/>FAERS / VigiBase / EudraVigilance] --> Dedup[Deduplicate ICSRs<br/>+ MedDRA code drug & event]\n  Dedup --> Tab[Collapse whole database into 2x2 per drug-event pair<br/>a both / b drug-other / c event-other / d rest]\n  Tab --> Freq[Frequentist: PRR, ROR + 95% CI, Yates chi-square]\n  Tab --> Bayes[Bayesian shrinkage: IC025 BCPNN / EB05 GPS]\n  Freq --> Rule[MHRA: PRR>=2 AND chi2>=4 AND n>=3]\n  Bayes --> Rule2[EB05>=2 or IC025>0]\n  Rule --> SDR{Signal of disproportionate reporting?}\n  Rule2 --> SDR\n  SDR -->|yes| Review[Clinical + epidemiologic review<br/>check notoriety / duplicates / masking]\n  Review --> Study[Confirm in a denominator-based study<br/>cohort / SCCS / sequential active surveillance]\n  SDR -->|no| Monitor[Continue routine surveillance]",
        "caption": "Disproportionality signal-detection pipeline. The 2x2 is collapsed over the entire spontaneous database; frequentist and Bayesian estimators feed fixed signal criteria; an SDR is a screening flag that must be triaged for reporting artifacts and then confirmed in a study with a real denominator.",
        "alt_text": "Flowchart from spontaneous reports through deduplication and MedDRA coding, 2x2 tabulation, frequentist and Bayesian disproportionality estimators, signal criteria, triage for reporting artifacts, and confirmation in a denominator-based study.",
        "source_type": "illustrative",
        "source_citations": [
          "evans-2001"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph T[\"2x2 contingency table (collapsed over the database)\"]\n    A[\"a = drug + event\"]\n    B[\"b = drug + other events\"]\n    C[\"c = other drugs + event\"]\n    D[\"d = all other reports\"]\n  end\n  A --> M1[\"PRR = a/(a+b) / c/(c+d)\"]\n  A --> M2[\"ROR = a*d / b*c\"]\n  A --> M3[\"RRR = a / E[a]  -> shrink -> EBGM/IC\"]",
        "caption": "The single 2x2 table underlying every disproportionality measure. PRR uses the drug and non-drug reporting proportions; ROR is the cross-product odds ratio; the relative reporting ratio a/E[a] is the quantity the Gamma-Poisson Shrinker (EBGM) and BCPNN (IC) shrink toward the null.",
        "alt_text": "Diagram of the four cells a, b, c, d of the disproportionality 2x2 table feeding the PRR, ROR, and relative reporting ratio formulas.",
        "source_type": "illustrative",
        "source_citations": [
          "dumouchel-1999"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "A reproducible MedDRA-based event case definition (PT vs SMQ grouping) determines cells a and c and therefore the disproportionality result; specify it before tabulating."
      },
      {
        "relation_type": "see_also",
        "target_slug": "algorithm-validation",
        "notes": "Disproportionality is hypothesis generation, NOT outcome-algorithm validation; PPV/sensitivity of an event-identification algorithm is a separate concept and a separate computation."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "prescription-sequence-symmetry",
        "notes": "Sequence symmetry analysis is a self-controlled screening alternative that adds temporal ordering and a within-person comparison that spontaneous-report disproportionality lacks."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "self-controlled-case-series",
        "notes": "SCCS is a denominator-and-time-aware design used to confirm or quantify a candidate signal that disproportionality can only flag."
      },
      {
        "relation_type": "used_with",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "Negative-control drug-event pairs calibrate signal thresholds and quantify the false-positive rate of a disproportionality screen under massive multiplicity."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pass-imposed",
        "notes": "A confirmed signal of disproportionate reporting frequently triggers a post-authorisation safety study (PASS) to test it in a population with a denominator."
      },
      {
        "relation_type": "see_also",
        "target_slug": "case-report",
        "notes": "Individual case safety reports (ICSRs) are the raw unit aggregated into the disproportionality 2x2; a striking single case can itself be a qualitative signal."
      }
    ],
    "aliases": [
      "disproportionality analysis",
      "disproportionality",
      "quantitative signal detection",
      "signal of disproportionate reporting",
      "pharmacovigilance signal detection",
      "adverse event data mining"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "single-arm-external-control",
    "name": "Single-Arm Trial with External (Historical) Control",
    "short_definition": "A design in which patients treated in a single-arm study are compared with a non-randomized comparator assembled from outside the trial (historical trials, registries, or real-world claims/EHR data) to estimate a treatment effect that the trial cannot estimate internally.",
    "long_description": "A **single-arm trial with an external control** estimates a treatment effect by contrasting outcomes of patients\nwho all received the investigational therapy against outcomes of a comparator group drawn from *outside* the study —\nhistorical clinical-trial arms, disease registries, or real-world claims/EHR cohorts. There is no internal\nrandomization and no concurrent control; the entire comparison rests on the assumption that, after design and\nanalytic adjustment, the external cohort is exchangeable with the treated patients. This design is used when a\nrandomized concurrent control is infeasible or unethical — rare diseases with no standard of care, life-threatening\nconditions where withholding a promising therapy is indefensible, or one-arm accelerated-approval programs — and it\nis the workhorse of the \"external control arm\" (ECA) submissions now common in oncology and rare-disease regulatory\nfilings.\n\n**Core estimand distinction**. The target is the **average treatment effect in the treated** (ATT): the effect of\nthe investigational drug *in the patients who actually received it*, with the external cohort standing in for their\nunobserved counterfactual outcome under standard of care. This is emphatically **not** a randomized contrast. In an\nRCT, exchangeability holds by design (randomization balances measured *and* unmeasured prognostic factors); in an\nexternally controlled study it must be *manufactured* through eligibility translation, covariate adjustment, and\nborrowing assumptions, and it can never be verified for unmeasured factors. Two analytic philosophies coexist:\n(1) a **frequentist confounding-adjustment** view that treats the ECA as an observational cohort and balances it to\nthe trial arm with propensity-score matching, IPTW, or overlap weighting, targeting the ATT; and (2) a **Bayesian\ndynamic-borrowing** view (commensurate, power, or robust meta-analytic-predictive [MAP] priors) that down-weights\nthe historical control's contribution when it conflicts with concurrent data, trading a controlled amount of bias\nfor variance. The estimand label (ATT vs a partially borrowed hybrid contrast), the discounting parameter, and the\noutcome (almost always overall survival, because it is hard to differentially ascertain) must be pre-specified.\n\n**Pros, cons, and trade-offs**.\n- **vs a concurrent randomized control (the gold standard):** the external control is faster, smaller, cheaper, and\n  avoids randomizing patients to a placebo or withheld therapy. Cost: it surrenders the one property that makes the\n  RCT causal — randomized balance of *unmeasured* prognosis. Effect estimates can be badly biased by secular trends\n  in standard of care, differential outcome ascertainment, and unmeasured prognostic factors. **Prefer the RCT**\n  whenever a concurrent control is ethically and operationally possible.\n- **vs a single-arm trial with no formal comparator (objective-response-rate filing):** an external control supplies\n  a quantitative time-to-event contrast (OS/PFS hazard ratio) rather than relying on a historical benchmark response\n  rate. Cost: it imports all of the comparability problems above, which a within-patient ORR endpoint sidesteps.\n  **Prefer the formal ECA** when the endpoint is time-to-event and a credible external cohort exists; prefer the\n  benchmark-ORR framing when no comparable external cohort can be assembled.\n- **vs a target-trial emulation in routine-care data (two real-world arms):** the externally controlled trial keeps\n  the protocol-quality treated arm (adjudicated outcomes, prospective data capture) and only the *control* is\n  observational. Cost: the two arms now differ in data provenance — the asymmetry creates differential measurement\n  that a fully observational emulation, with both arms ascertained identically, avoids. **Prefer the ECA** when the\n  treated arm is already a regulatory single-arm trial; **prefer the emulation** when both exposures are observable\n  in the same routine-care source.\n- **Frequentist ATT-matching vs Bayesian dynamic borrowing:** matching/weighting is transparent, regulator-familiar,\n  and yields a clean ATT, but discards external patients off the overlap region and ignores the historical sample\n  size. Dynamic borrowing uses the full historical information adaptively but introduces a prior-data-conflict\n  parameter that drives the answer and must be justified. **Prefer matching/weighting** for a primary regulatory\n  analysis; reserve borrowing for augmenting an already-randomized small control or for clearly pre-specified,\n  well-calibrated priors.\n\n**When to use**. A randomized concurrent control is infeasible or unethical (ultra-rare disease, no standard of care,\ndramatic early efficacy signal); the disease is severe with a well-characterized, stable natural history; a credible\nexternal cohort exists whose patients meet the trial's eligibility criteria and were treated in a comparable calendar\nera; the primary endpoint is hard to ascertain differentially (overall survival is the canonical choice); and key\nprognostic factors (for solid tumors: ECOG performance status, line of therapy, disease stage, and relevant\nbiomarkers such as PD-L1, EGFR/ALK, LDH) are *measured in both* the trial and the external source so they can be\nbalanced. FDA's 2023 guidance on externally controlled trials and EMA's reflection paper on external comparators\nboth frame these as the gating conditions.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Standard of care is improving over calendar time.** A historical cohort treated five years ago experienced worse\n  supportive care, fewer subsequent-line options, and different imaging cadence; the apparent benefit then conflates\n  drug effect with secular improvement. In fast-moving fields (immuno-oncology) this can manufacture an entirely\n  spurious survival advantage. Restrict the external cohort to a contemporary calendar window.\n- **The primary endpoint is differentially ascertained.** Progression-free survival is RECIST-adjudicated on a fixed\n  schedule in the trial but claims/EHR \"progression\" is coded irregularly and late — the HR for PFS across trial and\n  real-world data is usually uninterpretable. Use overall survival; if PFS is unavoidable, it is a red flag.\n- **Decisive prognostic factors are unmeasured in the real-world source.** Claims have no ECOG, no tumor stage, no\n  biomarker status; if these drive both treatment era and survival, no amount of claims-based adjustment removes the\n  confounding, and the study is misleading regardless of how balanced the *measured* covariates look.\n- **Eligibility cannot be translated.** Trial inclusion criteria like \"adequate organ function\" or \"no symptomatic\n  brain metastases\" have no clean claims operationalization; an external cohort that silently includes sicker,\n  ineligible patients will look worse than the treated arm for reasons unrelated to the drug.\n- **Severe non-overlap / positivity violation.** If the treated trial population is younger and fitter than any\n  realistically assembled external cohort, matching discards most of the external data and the surviving estimand\n  no longer maps to a meaningful population.\n\n**Pocock's classic acceptability criteria** (still the table reviewers expect) require that the historical control\ngroup: (1) received a precisely defined standard treatment, (2) was part of a recent study, (3) used the same\neligibility criteria, (4) had comparable baseline prognostic characteristics, (5) was evaluated in the same\norganization with the same methods, and (6) shows no other reason to expect a different outcome. Each criterion maps\nto a modern threat: (2)/(6) → secular trend, (3) → eligibility translation, (4) → measured/unmeasured confounding,\n(5) → differential ascertainment.\n\n**Data-source operational depth**.\n- **Historical clinical-trial control arms:** the highest-quality external control — adjudicated endpoints,\n  protocol-defined assessments, measured prognostic covariates — and the substrate for Bayesian borrowing (robust\n  MAP priors). Failure mode: trials enroll selected, fit populations and pre-date current standard of care, so\n  calendar drift and selection are the dominant biases; restrict to recent, eligibility-matched trials and\n  down-weight via a robust prior that reacts to prior-data conflict.\n- **Claims (FFS or MA):** strong for treatment dates (`fill_date`, `days_supply`, procedure dates), enrollment, and\n  mortality when linked to a death index, but they carry no ECOG, stage, or biomarker fields — the decisive\n  oncology prognosticators are unmeasured. Require continuous medical + pharmacy enrollment so first-line therapy is\n  truly first-line, not unobserved earlier treatment. **Medicare Advantage person-time lacks fee-for-service\n  claims**, so apparent \"untreated\" intervals can be missingness, not absence of therapy — restrict to FFS Parts\n  A/B/D or commercial members with full pharmacy benefit. In an elderly cohort, **differential competing risk of\n  death by treatment era** distorts any non-mortality endpoint, another reason OS is preferred. **Immortal time**\n  is acute here: in procedure- or treatment-defined external cohorts, starting follow-up at diagnosis but requiring\n  a later therapy to define group membership grants survival time before the therapy could have acted — anchor\n  time-zero at therapy initiation in *both* arms.\n- **EHR:** captures performance status, labs, and (via notes/NLP) stage and biomarkers that claims miss, sharpening\n  eligibility translation and confounding control — but visit-driven capture means a patient who leaves the system\n  is differentially lost, and \"real-world progression\" is coded inconsistently. Link to claims for complete\n  treatment history and to vital records for death.\n- **Disease registries:** strongest for stage, histology, and adjudicated outcomes; typically weak for complete\n  pharmacy exposure and later-line therapy. Link to claims for the full treatment trajectory and to a death index to\n  firm up survival.\n- **Linked claims–EHR–registry–vital-records:** the ideal external-control substrate (EHR severity + claims\n  completeness + registry adjudication + reliable mortality), but linkage introduces selection (only the linkable\n  subset) and date-discrepancy issues among diagnosis, treatment, and service dates that must be reconciled before\n  time-zero assignment.\n\n**Worked oncology example.** Question: does an investigational therapy improve overall survival versus standard of\ncare in previously treated advanced non-small-cell lung cancer (NSCLC)? The treated arm is a 120-patient single-arm\ntrial; the external control is assembled from a SEER–Medicare linked cohort. (1) *Eligibility translation*: from the\ntrial protocol, require a confirmed NSCLC diagnosis (ICD-O histology), at least one prior systemic regimen (≥1\nqualifying antineoplastic claim before index), age ≥18, and 365 days of continuous Parts A/B/D FFS enrollment before\nindex so prior lines are observable; **exclude MA-only person-time** because FFS chemotherapy/infusion claims are\nabsent there. (2) *Time-zero*: index = the `fill_date`/administration date of first second-line systemic therapy in\nthe external cohort, mirroring the trial's \"treatment start\"; anchoring both arms at therapy start prevents immortal\ntime. (3) *Baseline covariates* measured only in the 365-day pre-index window: age, sex, prior-line count, time from\ndiagnosis to second line, comorbidity index, and — critically — every prognostic factor *also captured in the trial*\n(ECOG proxy from EHR linkage where available, stage). (4) *Balance*: fit a propensity score for trial-arm membership\nvs external control and apply 1:1 matching or overlap weighting; confirm post-balance standardized mean differences\n<0.1 and report the residual unmeasured factors (true ECOG, PD-L1) that claims cannot supply. (5) *Outcome*: overall\nsurvival from a linked death index — **not** claims-coded progression, which is differentially ascertained. (6)\n*Sensitivity*: an **E-value** for how strong an unmeasured confounder (e.g., performance status) would need to be to\nexplain away the HR; **negative-control outcomes** to detect residual era/selection bias; a **tipping-point /\nBayesian dynamic-borrowing** analysis varying the discount on the historical information; and a contemporary-only\ncalendar restriction to probe secular-trend bias. The estimand is the ATT in the trial-treated population, reported\nwith explicit acknowledgment that randomized balance of unmeasured prognosis is unattainable.",
    "primary_category": "Study_Design",
    "tags": [
      "external-control",
      "historical-control",
      "single-arm",
      "external-control-arm",
      "synthetic-control-arm",
      "regulatory-rwe",
      "secular-trend",
      "dynamic-borrowing",
      "propensity-score",
      "oncology"
    ],
    "applies_to_study_types": [
      "single_arm_external_control"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4975",
        "url": "https://doi.org/10.1002/pds.4975",
        "citation_text": "Burcu M, Dreyer NA, Franklin JM, et al. Real-world evidence to support regulatory decision-making for medicines: considerations for external control arms. Pharmacoepidemiology and Drug Safety. 2020;29(10):1228-1235.",
        "year": 2020,
        "authors_short": "Burcu et al.",
        "notes": "Modern regulatory framing of external control arms; enumerates the design, comparability, and bias considerations (eligibility translation, outcome ascertainment, secular trend) for ECA submissions."
      },
      {
        "role": "explain",
        "doi": "10.1016/0021-9681(76)90044-8",
        "url": "https://doi.org/10.1016/0021-9681(76)90044-8",
        "citation_text": "Pocock SJ. The combination of randomized and historical controls in clinical trials. Journal of Chronic Diseases. 1976;29(3):175-188.",
        "year": 1976,
        "authors_short": "Pocock",
        "notes": "Foundational statement of the six acceptability criteria a historical control group must meet; still the canonical checklist for judging external-control comparability."
      },
      {
        "role": "explain",
        "doi": "10.1002/cpt.857",
        "url": "https://doi.org/10.1002/cpt.857",
        "citation_text": "Franklin JM, Schneeweiss S. When and how can real world data analyses substitute for randomized controlled trials? Clinical Pharmacology & Therapeutics. 2017;102(6):924-933.",
        "year": 2017,
        "authors_short": "Franklin & Schneeweiss",
        "notes": "Lays out the conditions under which non-randomized real-world comparisons (including external controls) can credibly stand in for an RCT, and where they fail."
      },
      {
        "role": "demonstrate",
        "doi": "10.1111/biom.12242",
        "url": "https://doi.org/10.1111/biom.12242",
        "citation_text": "Schmidli H, Gsteiger S, Roychoudhury S, O'Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics. 2014;70(4):1023-1032.",
        "year": 2014,
        "authors_short": "Schmidli et al.",
        "notes": "Operationalizes Bayesian dynamic borrowing via robust MAP priors that automatically down-weight historical controls under prior-data conflict; the reference method for the borrowing branch."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/2168479018778282",
        "url": "https://doi.org/10.1177/2168479018778282",
        "citation_text": "Lim J, Walley R, Yuan J, et al. Minimizing patient burden through the use of historical subject-level data in innovative confirmatory clinical trials: review of methods and opportunities. Therapeutic Innovation & Regulatory Science. 2018;52(5):546-559.",
        "year": 2018,
        "authors_short": "Lim et al.",
        "notes": "Reviews subject-level historical-borrowing methods (power priors, commensurate priors, MAP, propensity-score integration) and their regulatory applicability for confirmatory trials."
      }
    ],
    "plain_language_summary": "A single-arm trial gives the investigational drug to every patient enrolled — there is no placebo group inside the trial itself. To still measure 'how much better did patients do compared with standard care,' researchers build an external control arm by pulling a comparison group from existing records (insurance claims, hospital charts, or disease registries) rather than randomizing anyone. The two groups are then statistically balanced on key prognostic factors before the survival outcomes are compared. The main catch is that no balancing technique can account for factors that were never recorded — such as a patient's physical fitness or detailed tumor biology — so the result carries more uncertainty than a head-to-head randomized trial would.",
    "key_terms": [
      {
        "term": "single-arm trial",
        "definition": "A clinical trial in which all enrolled patients receive the investigational treatment with no concurrent placebo or comparator group inside the study."
      },
      {
        "term": "external control arm",
        "definition": "A comparison group assembled from outside the trial — from insurance claims, electronic health records, or a registry — used in place of a randomized control group."
      },
      {
        "term": "propensity score",
        "definition": "A single summary number (between 0 and 1) that captures how similar each external-control patient is to the trial patients based on their recorded characteristics, used to balance the two groups before comparing outcomes."
      },
      {
        "term": "standardized mean difference (SMD)",
        "definition": "A number that measures how different the trial arm and external control arm are on a given characteristic after balancing; values below 0.10 indicate good balance."
      },
      {
        "term": "overall survival",
        "definition": "The length of time from the start of treatment until a patient dies; preferred as the primary outcome in external-control studies because it is less likely to be measured differently across the two data sources."
      },
      {
        "term": "secular trend bias",
        "definition": "The distortion that occurs when a historical control group was treated in an earlier era when standard care was worse, making the new drug look more effective than it actually is."
      }
    ],
    "worked_example": {
      "scenario": "A single-arm trial enrolled 120 patients with advanced non-small-cell lung cancer (NSCLC) who all received an investigational second-line therapy. Because there was no placebo group inside the trial, researchers built an external control arm from insurance claims: 240 real-world patients with NSCLC who started a standard second-line treatment in roughly the same calendar period. To make the comparison fair, both groups were balanced on age, prior treatment count, and comorbidity burden using propensity-score matching, leaving 110 matched pairs. Overall survival from the start of second-line treatment was then compared between the two groups.",
      "dataset": {
        "caption": "Summary characteristics before and after propensity-score matching (110 matched pairs each).",
        "columns": [
          "characteristic",
          "trial_arm_before",
          "ext_ctrl_before",
          "smd_before",
          "trial_arm_after",
          "ext_ctrl_after",
          "smd_after"
        ],
        "rows": [
          [
            "n",
            "120",
            "240",
            "-",
            "110",
            "110",
            "-"
          ],
          [
            "median age (years)",
            "62",
            "67",
            "0.41",
            "63",
            "63",
            "0.03"
          ],
          [
            "prior lines of therapy (mean)",
            "1.2",
            "1.8",
            "0.52",
            "1.3",
            "1.3",
            "0.04"
          ],
          [
            "comorbidity index (mean)",
            "1.4",
            "2.1",
            "0.48",
            "1.5",
            "1.5",
            "0.06"
          ]
        ]
      },
      "steps": [
        "Step 1 — Align time zero: both arms anchor their follow-up clock at the date second-line treatment started, so neither group is given credit for surviving before therapy began.",
        "Step 2 — Check raw imbalance: before matching, external-control patients are older (67 vs 62), have more prior treatment lines (1.8 vs 1.2), and more comorbidities (2.1 vs 1.4) — all SMDs above 0.40, indicating poor comparability.",
        "Step 3 — Apply propensity-score matching: each trial patient is paired with one external-control patient who has a similar propensity score; 10 trial patients with no good match are dropped, leaving 110 pairs.",
        "Step 4 — Confirm balance: after matching all three SMDs fall below 0.10, meaning age, prior lines, and comorbidity are now similar between groups.",
        "Step 5 — Compare overall survival: in the matched set, median overall survival is 14.2 months in the trial arm and 9.8 months in the external control arm, a hazard ratio of 0.63 (trial arm has 37% lower hazard of death).",
        "Step 6 — Flag the key biases: (a) unmeasured factors such as tumor stage and performance status were not in the claims data and could not be balanced — if trial patients were fitter, the benefit may be overstated; (b) if the external cohort was treated even 3-4 years earlier, improvements in supportive care (secular trend) could inflate the apparent survival advantage."
      ],
      "result": "Matched HR = 0.63 (trial arm vs external control arm); median OS 14.2 months vs 9.8 months. Balance achieved on all measured covariates (SMD < 0.10 after matching), but unmeasured prognostic factors (performance status, tumor stage) and potential secular trends remain unverifiable threats to the estimate."
    },
    "prerequisites": [
      "propensity-score-methods-psm-iptw",
      "new-user-design",
      "target-trial-emulation"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "External control arm (ECA) with PS matching / weighting to the trial",
        "description": "The external cohort is treated as an observational comparator and balanced to the single-arm trial population by 1:1 propensity-score matching, IPTW, or overlap weighting on prognostic factors measured in both sources, targeting the ATT in the trial-treated patients.",
        "edge_cases": [
          "Non-overlap when the trial enrolled fitter/younger patients than any assemblable external cohort; matching discards much of the external data and the estimand drifts to the overlap region.",
          "Prognostic factors measured in the trial (ECOG, biomarkers) but absent in a claims external source cannot be balanced, leaving residual confounding invisible to measured-covariate diagnostics."
        ],
        "data_source_notes": "claims/EHR: build the external cohort with translated eligibility and anchor time-zero at therapy start; report which trial covariates have no real-world analogue."
      },
      {
        "name": "Bayesian dynamic borrowing (robust MAP / power / commensurate priors)",
        "description": "Historical control information is incorporated through a prior whose influence shrinks automatically when the historical and concurrent data conflict, instead of being matched one-to-one; useful for augmenting a small randomized control or pooling historical trial arms.",
        "edge_cases": [
          "The discount/heaviness parameter (e.g., the robust-prior mixture weight) drives the result and must be pre-specified and calibrated; an over-confident prior re-imports the bias borrowing is meant to guard against.",
          "Type I error inflation under prior-data conflict if the prior is insufficiently robust; operating characteristics must be simulated across plausible conflict scenarios."
        ],
        "data_source_notes": "Best suited to subject-level historical trial-arm data; far harder to justify when the historical control is heterogeneous registry/claims data."
      },
      {
        "name": "Hybrid (augmented) control",
        "description": "A small concurrent randomized control is augmented with external/historical controls, combining internal randomization for partial unmeasured-confounding protection with external data for power.",
        "edge_cases": [
          "The concurrent and external controls may differ systematically; a test-then-pool or dynamic-borrowing rule is needed rather than naive pooling.",
          "Allocation ratio and borrowing strength jointly determine power and bias and must be set before unblinding."
        ],
        "data_source_notes": "Requires a comparable contemporary external source; calendar alignment between the concurrent and external controls is the key comparability lever."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Concurrent randomized controlled trial",
        "pros_of_this": "Faster, smaller, cheaper; avoids randomizing patients to placebo or withheld therapy in severe/rare disease; enables one-arm accelerated-approval programs.",
        "cons_of_this": "Surrenders randomized balance of unmeasured prognostic factors; vulnerable to secular trends, differential outcome ascertainment, and eligibility-translation error.",
        "when_to_prefer": "Only when a concurrent randomized control is ethically or operationally infeasible and a credible, contemporary, eligibility-matched external cohort exists."
      },
      {
        "compared_to": "Single-arm trial with a benchmark objective-response-rate",
        "pros_of_this": "Provides a quantitative time-to-event (OS) comparison rather than relying on a fixed historical response-rate threshold.",
        "cons_of_this": "Imports comparability problems (confounding, secular trend, ascertainment) that a within-patient ORR endpoint avoids.",
        "when_to_prefer": "Time-to-event primary endpoint with a credible external cohort; otherwise the benchmark-ORR framing is cleaner."
      },
      {
        "compared_to": "Target-trial emulation with two real-world arms",
        "pros_of_this": "Retains the protocol-quality treated arm (adjudicated outcomes, prospective capture); only the control is observational.",
        "cons_of_this": "Asymmetric data provenance between arms creates differential measurement that a fully observational emulation (both arms ascertained identically) avoids.",
        "when_to_prefer": "When the treated arm is already a regulatory single-arm trial; prefer full emulation when both exposures are observable in the same routine-care source."
      },
      {
        "compared_to": "Bayesian dynamic borrowing of the same historical data",
        "pros_of_this": "PS matching/weighting is transparent, regulator-familiar, and yields a clean ATT without a prior-conflict parameter.",
        "cons_of_this": "Discards external patients off the overlap region and ignores the historical sample size that borrowing would exploit.",
        "when_to_prefer": "Primary regulatory analyses; reserve borrowing for augmenting an already-randomized small control with pre-specified, calibrated priors."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Strong for treatment dates, enrollment, and linked mortality; carries no ECOG/stage/biomarker fields, so decisive oncology prognosticators are unmeasured. Require continuous medical+pharmacy enrollment so first-line is truly first-line; exclude Medicare Advantage-only person-time where FFS treatment claims are absent. Anchor time-zero at therapy start in both arms to prevent immortal time; prefer overall survival because non-mortality endpoints suffer differential competing risks and ascertainment.",
      "ehr": "Captures performance status, labs, stage, and biomarkers that claims miss, sharpening eligibility translation and confounding control; visit-driven capture makes loss to follow-up potentially informative and real-world progression coding inconsistent. Link to claims for complete treatment history and to vital records for death.",
      "registry": "Strong for stage, histology, and adjudicated outcomes; weak for complete pharmacy/later-line exposure. Link to claims for the full treatment trajectory and to a death index for survival.",
      "linked": "Linked claims-EHR-registry-vital-records is the ideal external-control substrate (severity + completeness + adjudication + mortality) but introduces linkage selection and diagnosis/treatment/service date discrepancies that must be reconciled before time-zero assignment."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom lifelines import CoxPHFitter\n\nWASHOUT_DAYS = 365  # continuous FFS enrollment so prior lines are observable\n\ndef build_external_control(rx, dx, enroll, death, covariates):\n    # Index = first SECOND-LINE systemic antineoplastic fill (mirrors the trial's treatment start).\n    sl = rx[(rx[\"antineoplastic\"]) & (rx[\"line\"] == 2)].sort_values([\"person_id\", \"fill_date\"])\n    idx = sl.groupby(\"person_id\").first().reset_index().rename(columns={\"fill_date\": \"index_date\"})\n\n    # Eligibility translation: confirmed NSCLC histology coded before index.\n    nsclc = dx.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    nsclc = nsclc[(nsclc[\"dx_date\"] <= nsclc[\"index_date\"])]\n    idx = idx[idx[\"person_id\"].isin(nsclc[\"person_id\"].unique())]\n\n    # Continuous FFS-observable enrollment across the full lookback through index (no MA-only gaps).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"] >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n\n    ec = idx[idx[\"person_id\"].isin(eligible)].merge(death, on=\"person_id\", how=\"left\")\n    ec = ec.merge(covariates, on=\"person_id\", how=\"left\")  # baseline covariates measured pre-index\n    ec[\"source\"] = \"EXTERNAL\"\n    return ec\n\ndef att_overlap_weighted(trial, external, covs, end_of_data):\n    # Stack trial + external; overlap weights target the ATT and down-weight non-overlapping external patients.\n    df = pd.concat([trial.assign(treated=1), external.assign(treated=0)], ignore_index=True)\n    ps = LogisticRegression(max_iter=1000).fit(df[covs], df[\"treated\"]).predict_proba(df[covs])[:, 1]\n    df[\"ps\"] = ps\n    df[\"w\"] = np.where(df[\"treated\"] == 1, 1 - df[\"ps\"], df[\"ps\"])  # overlap weights\n\n    # Overall survival: time from index to death or administrative censoring at end of data.\n    df[\"death_date\"] = pd.to_datetime(df[\"death_date\"])\n    df[\"time\"] = (df[\"death_date\"].fillna(end_of_data) - df[\"index_date\"]).dt.days\n    df[\"event\"] = df[\"death_date\"].notna().astype(int)\n\n    cph = CoxPHFitter().fit(df[[\"time\", \"event\", \"treated\", \"w\"]], \"time\", \"event\", weights_col=\"w\",\n                            robust=True)  # robust SE for the weighting\n    return cph  # HR for treated vs external control = weighted ATT on OS",
        "description": "External control cohort construction + PS overlap-weighting to a single-arm trial. Required inputs (post data-management):\n  trial   : trial-arm patients -> person_id, index_date (datetime), <prognostic covariates>, source='TRIAL'\n  rx       : external systemic-therapy claims -> person_id, fill_date (datetime), line, antineoplastic (bool), days_supply\n  dx       : external diagnoses -> person_id, dx_date, icd_code (NSCLC histology), ecog_proxy (nullable)\n  enroll   : external enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # MA-only lacks FFS therapy claims\n  death    : linked death index -> person_id, death_date (nullable)\nBuilds the external control by translating trial eligibility, anchors time-zero at second-line therapy start (no immortal\ntime), then estimates the ATT on overall survival via overlap weights. Report which trial covariates have no external analogue.",
        "dependencies": [
          "pandas",
          "numpy",
          "scikit-learn",
          "lifelines"
        ],
        "source_citations": [
          "burcu-2020"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\nWASHOUT_DAYS <- 365L\n\nbuild_external_control <- function(rx, dx, enroll, death, covs) {\n  setDT(rx); setDT(dx); setDT(enroll); setDT(death)\n  sl  <- rx[antineoplastic == TRUE & line == 2L][order(person_id, fill_date)]\n  idx <- sl[, .(index_date = fill_date[1L]), by = person_id]                 # 2nd-line start = time zero\n\n  ok_dx <- dx[idx, on = \"person_id\"][dx_date <= index_date, unique(person_id)] # NSCLC coded pre-index\n  idx   <- idx[person_id %chin% ok_dx]\n\n  e  <- enroll[idx, on = \"person_id\"]                                          # continuous FFS lookback, no MA-only\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS & enroll_end >= index_date &\n          ma_only == FALSE, unique(person_id)]\n  ec <- idx[person_id %chin% ok]\n  ec <- death[ec, on = \"person_id\"][covs, on = \"person_id\", nomatch = NULL]\n  ec[, source := \"EXTERNAL\"][]\n}\n\natt_overlap_weighted <- function(trial, external, covs, end_of_data) {\n  df <- rbind(data.table(trial)[, treated := 1L],\n              data.table(external)[, treated := 0L], fill = TRUE)\n  ps <- predict(glm(reformulate(covs, \"treated\"), data = df, family = binomial), type = \"response\")\n  df[, ps := ps][, w := fifelse(treated == 1L, 1 - ps, ps)]                   # overlap weights -> ATT\n  df[, time := as.integer(fifelse(is.na(death_date), end_of_data, death_date) - index_date)]\n  df[, event := as.integer(!is.na(death_date))]\n  coxph(Surv(time, event) ~ treated, data = df, weights = w, robust = TRUE)   # weighted ATT on OS\n}",
        "description": "External control construction + overlap-weighted ATT in R. Inputs mirror the Python version:\n  trial   : person_id, index_date (Date), <covariates>, source='TRIAL'\n  rx       : person_id, fill_date (Date), line, antineoplastic (logical), days_supply\n  dx       : person_id, dx_date, icd_code, ecog_proxy\n  enroll   : person_id, enroll_start, enroll_end, ma_only (logical)\n  death    : person_id, death_date (Date, NA if alive)",
        "dependencies": [
          "data.table",
          "survival"
        ],
        "source_citations": [
          "burcu-2020"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* Time zero = first SECOND-LINE systemic antineoplastic fill (mirrors the trial's treatment start). */\nproc sql;\n  create table idx as\n  select person_id, min(fill_date) as index_date format=date9.\n  from work.rx\n  where antineoplastic = 1 and line = 2\n  group by person_id;\nquit;\n\n/* Eligibility translation: confirmed NSCLC histology coded on or before index. */\nproc sql;\n  create table elig_dx as\n  select distinct i.person_id, i.index_date\n  from idx i\n  where exists (select 1 from work.dx d\n                where d.person_id = i.person_id and d.dx_date <= i.index_date);\nquit;\n\n/* Continuous FFS-observable enrollment across the full lookback through index (no MA-only spans). */\nproc sql;\n  create table external as\n  select e2.person_id, e2.index_date, dth.death_date\n  from elig_dx e2 left join work.death dth on e2.person_id = dth.person_id\n  where exists (select 1 from work.enroll e\n                where e.person_id = e2.person_id and e.ma_only = 0\n                  and e.enroll_start <= e2.index_date - &washout\n                  and e.enroll_end   >= e2.index_date);\nquit;\n\n/* Stack trial + external; treated=1 for trial-arm patients. */\ndata analytic;\n  set work.trial(in=t) external(in=x);\n  treated = t;                                      /* 1 = single-arm trial, 0 = external control */\nrun;\n\n/* 1:1 PS matching of external controls to trial patients (ATT). */\nproc psmatch data=analytic region=cs;\n  class <categorical baseline covariates>;\n  psmodel treated(treated='1') = <baseline prognostic covariates>;  /* measured in BOTH sources */\n  match method=greedy(k=1) distance=lps caliper=0.2;\n  assess lps var=(<key covariates>) / plots=(stddiff);             /* require post-match SMD < 0.1 */\n  output out(obs=match)=matched matchid=mid;\nrun;\n\n/* Overall survival contrast on the matched set (weighted ATT on OS). */\ndata surv;\n  set matched;\n  event = (death_date ne .);\n  time  = (min(death_date, &end_of_data) - index_date);            /* days from index to death/censor */\nrun;\n\nproc phreg data=surv covs(aggregate);\n  class treated(ref='0');\n  model time*event(0) = treated;                  /* HR = external-controlled treatment effect on OS */\n  id mid;                                          /* account for the matched-pair structure */\nrun;",
        "description": "External control construction (PROC SQL) and 1:1 PS matching of the external cohort to the single-arm trial\n(PROC PSMATCH). Required input datasets (post data-management):\n  work.trial   : person_id, index_date, <covariates>, source='TRIAL'\n  work.rx       : person_id, fill_date, line, antineoplastic (0/1), days_supply\n  work.dx       : person_id, dx_date, icd_code, ecog_proxy\n  work.enroll   : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.death    : person_id, death_date (. if alive)\nPROC PSMATCH requires SAS/STAT 14.2+. Confirm post-match standardized differences <0.1, then fit OS with PROC PHREG\non the matched (trial + external) set. Use overall survival; avoid claims-coded progression (differential ascertainment).",
        "dependencies": [],
        "source_citations": [
          "burcu-2020"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Sec[First second-line systemic therapy<br/>in real-world source] --> T0[Time zero = therapy start<br/>same anchor in both arms]\n  Elig[Trial eligibility translated to RWD<br/>NSCLC histology, prior line, organ-function proxies] --> T0\n  Enr[Continuous FFS enrollment lookback<br/>exclude MA-only person-time] --> T0\n  T0 --> Cov[Baseline prognostic covariates<br/>measured in BOTH sources -> propensity score]\n  Cov --> Bal[PS matching / overlap weighting<br/>require post-balance SMD < 0.1]\n  Bal --> OS[Overall survival from linked death index<br/>avoid claims-coded progression]\n  OS --> Sens[Sensitivity: E-value, negative controls,<br/>contemporary-calendar restriction, dynamic borrowing]",
        "caption": "Operational flow for an externally controlled single-arm trial in real-world data. Eligibility translation, time-zero alignment at therapy start, and balancing on commonly measured prognostic factors manufacture the exchangeability that randomization would otherwise provide; sensitivity analyses probe the unverifiable assumptions.",
        "alt_text": "Flowchart from second-line therapy start through eligibility translation, enrollment lookback, time-zero, baseline covariate measurement, propensity-score balancing, overall-survival outcome, and sensitivity analyses for an external control arm.",
        "source_type": "illustrative",
        "source_citations": [
          "burcu-2020"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  U[Unmeasured prognostic factors<br/>ECOG, stage, PD-L1, LDH] --> Tx[Treatment / era assignment]\n  U --> Y[Survival outcome]\n  Cal[Calendar time / secular standard-of-care change] --> Tx\n  Cal --> Asc[Outcome ascertainment<br/>imaging cadence, progression coding]\n  Asc --> Y\n  Tx --> Y\n  M[Measured covariates<br/>age, prior lines, comorbidity] --> Tx\n  M --> Y\n  classDef bias fill:#fde,stroke:#933;\n  class U,Cal,Asc bias;",
        "caption": "Bias DAG for external-controlled comparisons. Randomization is absent, so treatment-vs-external-control assignment is confounded by unmeasured prognostic factors (ECOG, biomarkers — pink) and by calendar time (secular standard-of-care change), which also drives differential outcome ascertainment. Adjustment can balance measured covariates but cannot close the open back-door paths through unmeasured factors and era — the reason an E-value and negative controls are mandatory.",
        "alt_text": "Directed acyclic graph showing unmeasured prognostic factors and calendar time confounding the treatment-versus-external-control contrast, with calendar time also affecting outcome ascertainment, while measured covariates are adjustable.",
        "source_type": "illustrative",
        "source_citations": [
          "burcu-2020"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "target-trial-emulation",
        "notes": "A target-trial emulation ascertains both arms identically in routine-care data; the externally controlled trial keeps a protocol-quality treated arm and replaces only the control, accepting asymmetric data provenance in exchange."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS matching, IPTW, or overlap weighting is the standard frequentist step that balances the external cohort to the single-arm trial population and targets the ATT."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "When the external control comes from claims, an hdPS built from diagnosis/procedure/drug proxies can recover confounding signal that a handful of named covariates miss."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative-control outcomes detect residual secular-trend and selection bias that survive covariate adjustment in an external-control comparison."
      },
      {
        "relation_type": "used_with",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "Because randomized balance of unmeasured prognosis (ECOG, biomarkers) is impossible, E-values and probabilistic bias analysis quantify how strong an unmeasured confounder would need to be to overturn the result."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring time-zero at therapy start in both arms prevents the immortal time that arises when external-control follow-up begins at diagnosis but group membership requires later therapy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "selection-bias-sensitivity-analysis-rwe",
        "notes": "Eligibility-translation error and linkage selection are the dominant selection threats in external-control studies and should be probed with selection-bias sensitivity analysis."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "When both exposures are observable in the same routine-care source, an active-comparator new-user cohort is preferable; the external control is the fallback when the treated arm is a single-arm trial."
      }
    ],
    "aliases": [
      "external control arm",
      "ECA",
      "historical control",
      "synthetic control arm",
      "externally controlled trial",
      "hybrid control",
      "augmented control"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "snomed-ct-terminology",
    "name": "SNOMED CT Clinical Terminology",
    "short_definition": "A comprehensive, polyhierarchical clinical terminology — not a billing classification — that assigns numeric concept identifiers to clinical findings, disorders, procedures, and observable entities, enabling granular EHR phenotyping via descendant-hierarchy expansion and serving as the standard Condition vocabulary in the OMOP Common Data Model.",
    "long_description": "**SNOMED CT** (Systematized Nomenclature of Medicine — Clinical Terms) is the world's most comprehensive\nclinical terminology and the vocabulary of record for capturing clinical meaning at the point of care. It is\nmaintained by SNOMED International (a not-for-profit association of national member bodies) and distributed in\nthe United States by the National Library of Medicine (NLM) through the US Edition. Use requires a license;\nin member countries, including the United States, that license is free — US users obtain access through the\nNLM's UMLS Metathesaurus license, which covers SNOMED CT for US purposes. Small illustrative examples with\nattribution to SNOMED International are acceptable; bulk reproduction of SNOMED CT content is not.\n\n**Core conceptual distinction: terminology vs classification.** The single most important thing to understand\nabout SNOMED CT is what it is *not*: it is not a billing classification. ICD-10-CM (and its predecessors)\nwas designed for statistical, administrative, and reimbursement purposes — its categories are mutually\nexclusive, its hierarchy is coarse, and a single code often collapses many clinically distinct entities for\nthe practical convenience of counting and billing. SNOMED CT was designed for the opposite purpose: to\ncapture the full clinical meaning of what a clinician observed, recorded, or decided, with enough granularity\nand structure to support clinical decision support, research phenotyping, and inter-system exchange. This\ndistinction is the conceptual spine of every RWE use case: ICD-10-CM tells you what was billed; SNOMED CT\ntells you (or tries to tell you) what actually happened clinically.\n\n**Three core components.** Every concept in SNOMED CT has three mandatory elements. First, a **concept** —\na unique, atomic clinical idea identified by a numeric concept identifier (SCTID) that is 6–18 digits long\nand encodes a partition (the type of component) and a Verhoeff check digit. The concept identifier is opaque\nand permanent: 44054006 represents |Diabetes mellitus type 2| in every edition, every country, every year\nit has existed. Second, **descriptions** — one or more human-readable labels for the concept, split into a\nsingle Fully Specified Name (FSN, which is globally unique and unambiguous), one Preferred Term (the\ndisplay label for a locale), and zero or more Synonyms and Acceptable Terms. A single concept may have many\nacceptable terms; 73211009 has both |Diabetes mellitus| and |DM| as accepted synonyms. Third,\n**relationships** — formal machine-readable links between concepts, stored in the RF2 Relationship table.\nThe most important relationship type is |Is a| (SCTID 116680003), which builds the hierarchy.\n\n**The |is a| polyhierarchy.** Unlike ICD-10-CM's tree — where each code has exactly one parent — SNOMED CT\nis a **directed acyclic graph (DAG)**. A concept can have multiple parents, making it polyhierarchical.\nFor example, 44054006 |Diabetes mellitus type 2| is a child of both 73211009 |Diabetes mellitus| (the\nmetabolic disorder lineage) and of 8801005 |Secondary diabetes mellitus| if the concept is inherited from\nmultiple classification facets. This means descendant queries must traverse a graph, not a tree, typically\nvia a **transitive closure** of |is a| relationships over the RF2 Relationship file — every concept that\ncan reach a root concept through a chain of |is a| edges is a descendant. The practical upshot for RWE is\nthat a concept-set built by \"all descendants of 73211009 |Diabetes mellitus|\" will capture far more\ngranular codes than any flat ICD list, but the membership of that concept-set changes between SNOMED CT\nreleases (twice yearly for the US Edition) as new concepts are added and retired.\n\n**Defining attribute relationships.** Beyond |is a|, SNOMED CT uses additional relationship types — called\n**defining attributes** — to formally specify what makes a concept what it is. Clinical findings carry\nattributes such as |finding site| (which body structure is affected) and |associated morphology| (what is\nstructurally abnormal). Disorders carry |causative agent|, |pathological process|, and others. These\nattributes power description logic classifiers (such as EL++ reasoners) that can infer implied subsumption\nrelationships and detect modeling errors. For most RWE users, attribute relationships are background\nmachinery; for advanced use cases (clinical decision support, automated phenotype generation), they are\nhow SNOMED CT earns its designation as a \"formal ontology\" rather than a flat vocabulary.\n\n**Pre-coordination vs post-coordination.** SNOMED CT concepts are either **pre-coordinated** — where a\nsingle SCTID expresses a complete clinical idea (e.g., 44054006 |Diabetes mellitus type 2|) — or\n**post-coordinated**, where a base concept is combined with additional expressions using SNOMED's\nCompositional Grammar to capture a nuanced clinical statement not represented by any single concept (e.g.,\na finding site applied to a generic finding concept at the point of documentation). Post-coordination is\ntheoretically powerful but is rarely supported at the EHR interface level; most RWE work operates\nexclusively on pre-coordinated concepts from problem lists and encounter diagnoses.\n\n**Pros, cons, and trade-offs — specific and comparative.**\n\n- **vs ICD-10-CM (the dominant alternative for condition coding in US claims):** SNOMED CT's polyhierarchy\n  enables descendant expansion, meaning a single query \"find all concepts below 73211009 |Diabetes mellitus|\"\n  returns hundreds of granular concepts — type 1, type 2, maturity onset, gestational, drug-induced, and\n  rare monogenic subtypes — without requiring the analyst to enumerate them individually. ICD-10-CM requires\n  manual code list curation; adding a new ICD code requires updating every code list that should include it.\n  SNOMED CT also preserves synonymy (many terms → one concept), so free-text and structured entry with\n  different labels resolve to the same computable unit. **Cons:** SNOMED CT is not in US administrative\n  claims — clinical encounter data must be coded in ICD-10-CM for billing — so SNOMED exists in the EHR,\n  not in the payer file. The two vocabularies require an explicit crosswalk. **Prefer SNOMED CT** when\n  working from EHR or registry source data with rich clinical terminology; **prefer ICD-10-CM** when your\n  primary source is US insurance claims.\n\n- **vs flat proprietary code lists (e.g., hand-curated ICD code lists for a specific indication):** A\n  SNOMED CT-based concept set updates automatically as new descendant concepts are added to the terminology,\n  whereas a flat list requires manual maintenance. **Cons:** SNOMED CT concept membership changes between\n  releases, making reproducibility dependent on pinning the terminology version; a concept set built against\n  the 2023-09 US Edition may have slightly different membership than one built against the 2024-03 US\n  Edition. **Prefer SNOMED CT descendant sets** for maintainability in longitudinal or multi-database\n  studies; use version-pinning and document the release date in your methods.\n\n- **vs OMOP Concept Sets (which use SNOMED under the hood):** OMOP's CONDITION_OCCURRENCE table stores\n  diagnoses using SNOMED CT as its standard vocabulary — source ICD-9/10-CM codes are mapped to SNOMED\n  standard concepts via the OMOP vocabulary \"Maps to\" relationship. The OMOP Atlas tool's concept-set\n  builder applies descendant expansion using the same |is a| hierarchy, so OMOP concept sets *are*\n  SNOMED CT descendant queries packaged in a user interface. Understanding SNOMED CT structure is therefore\n  a prerequisite for understanding OMOP concept-set logic. **Cons:** The OMOP mapping layer (ICD→SNOMED)\n  introduces mapping errors and losses; a source ICD code may map to a SNOMED concept that is too broad\n  or too narrow, and some source codes are explicitly excluded from mapping (\"Maps to\" nothing). Audit the\n  mapping before relying on OMOP concept-set coverage.\n\n- **vs LOINC (laboratory/observation coding) and RxNorm (drug coding):** SNOMED CT, LOINC, and RxNorm\n  are complementary, not competing. SNOMED CT covers clinical findings, disorders, procedures, body\n  structures, and organisms; LOINC covers the identity of a laboratory test or observation; RxNorm covers\n  drugs and their ingredients. All three are required for a complete clinical representation. OMOP uses\n  all three in their respective domains (Condition→SNOMED; Measurement→LOINC; Drug→RxNorm). **Prefer**\n  the domain-appropriate vocabulary in each context; do not substitute SNOMED for LOINC in lab-based\n  phenotyping or for RxNorm in drug exposure definitions.\n\n**When to use.** Use SNOMED CT when building EHR-based phenotypes where clinical granularity matters and\ndescendant hierarchy expansion adds meaningful analytic value — rare-disease research (where SNOMED\ndescendants may outnumber ICD codes five-to-one), phenotypic subtypes relevant to the research question\n(distinguishing type 1 from type 2 diabetes), and multi-database distributed studies over OMOP where the\nstandard vocabulary must be portable across sites with different source coding. Use SNOMED CT as the\nbackbone of OMOP concept-set development for the Condition domain. Use it when your study requires\nsynonymy-aware search (matching many terms to one concept) or when clinical decision support and cohort\ndiscovery live in the same system.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n\n- **Do not use SNOMED CT as a drop-in replacement for ICD-10-CM in US claims analysis.** US insurance\n  claims are coded in ICD-10-CM for billing; SNOMED does not appear in the adjudicated claim record.\n  Applying a SNOMED-based code list to a raw claims table without an explicit mapping step will return no\n  rows and produce silently empty cohorts — a failure mode that is not always immediately obvious.\n\n- **Do not ignore the terminology version.** SNOMED CT releases in the United States occur twice yearly\n  (January and July). Concept content, descriptions, and relationship definitions change between releases.\n  A cohort built without specifying the SNOMED CT release date is not reproducible. Always pin the\n  terminology version in your methods and SAP.\n\n- **Do not conflate problem-list SNOMED entries with encounter diagnoses.** In EHR systems that use\n  SNOMED CT for problem list documentation, a concept entered on the problem list may persist for years\n  without update, representing the patient's chronic conditions at time of entry — not a fresh clinical\n  judgment. Encounter diagnoses, by contrast, are event-based. Mixing these provenance types in a\n  descendant-expansion query can inflate prevalence estimates and misattribute timing.\n\n- **Do not assume the SNOMED→ICD-10-CM map is lossless.** SNOMED International publishes a rule-based\n  map from SNOMED CT concepts to ICD-10 for reimbursement derivation. This map is lossy in both\n  directions: multiple SNOMED concepts map to the same ICD code (granularity is lost in the\n  ICD→SNOMED direction), and some SNOMED concepts have no ICD equivalent or map to \"unspecified\" codes.\n  Using a derived ICD code from a SNOMED source concept in a claims validation context can introduce\n  systematic misclassification.\n\n- **Do not use SNOMED CT for medication coding.** SNOMED CT includes a Substance hierarchy and some\n  Clinical Drug concepts, but RxNorm is the designated vocabulary for drug exposure in OMOP and in most\n  US EHR interoperability standards. Using SNOMED for drug coding duplicates capability that is better\n  served by RxNorm and introduces a non-standard pattern that most ETL pipelines will not support.\n\n**Data-source operational depth.**\n\n- **EHR:** SNOMED CT is most commonly present in EHR data as the vocabulary behind problem lists (the\n  clinician's running list of the patient's active conditions) and, in some implementations, encounter\n  diagnoses. Problem-list capture is clinician-driven: conditions are added at clinical discretion and\n  may be carried forward indefinitely without update, meaning a resolved condition may still appear as\n  \"active\" in a SNOMED problem-list query years after resolution. Encounter diagnoses are event-level and\n  more analogous to claims diagnoses but may be coded at various levels of specificity depending on the\n  EHR's interface and the clinician's coding habits. In OMOP ETLs from EHR source data, the\n  `condition_type_concept_id` distinguishes problem-list entries from encounter diagnoses, and this\n  provenance metadata is critical for RWE studies where the timing and recurrence of the condition are\n  analytically relevant.\n\n- **Registry:** Disease registries frequently use SNOMED CT for condition coding, often with tighter\n  clinical curation (oncology registries, rare-disease registries). Registry SNOMED data tends to be\n  highly precise for the index condition but may be absent for comorbidities and is rarely present for\n  drug exposure, which still requires claims or pharmacy linkage.\n\n- **Linked EHR-claims:** The ideal configuration for SNOMED-based RWE — EHR provides SNOMED-coded\n  clinical granularity (problem-list subtype, lab values, vitals, pathology) while claims provide\n  complete drug exposure and a defined enrollment denominator. The linkage must reconcile SNOMED\n  problem-list dates with ICD claim dates; these will often differ because the problem-list date reflects\n  when the condition was first documented in this EHR, not when the patient first had the condition or\n  when the claim was generated.\n\n- **OMOP CDM:** In OMOP, the CONDITION_OCCURRENCE table stores SNOMED CT concept IDs as the standard\n  vocabulary for conditions. The ETL maps source ICD-9-CM or ICD-10-CM codes to SNOMED using the\n  \"Maps to\" relationship in the OMOP vocabulary tables. Descendant queries are executed using the\n  CONCEPT_ANCESTOR table, which pre-computes the transitive closure of the |is a| hierarchy — exactly\n  the operation you would perform manually over the RF2 Relationship file. Always verify the ETL mapping\n  completeness: not all source codes map to SNOMED concepts (some are excluded), and mapping quality\n  varies across CDM vintages and ETL implementations.\n\n**Licensing note.** SNOMED CT is owned by SNOMED International. Use requires a license. In member\ncountries — which include the United States — access is free for most users. US users should obtain\naccess through the NLM via a UMLS license (https://www.nlm.nih.gov/healthit/snomedct/index.html). Do\nnot bulk-reproduce SNOMED CT content; small illustrative examples with attribution to SNOMED\nInternational are acceptable.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "terminology",
      "ehr",
      "snomed",
      "snomed-ct",
      "omop",
      "phenotyping",
      "clinical-terminology",
      "controlled-vocabulary",
      "interoperability"
    ],
    "applies_to_study_types": [
      "ehr_study",
      "cohort_retrospective",
      "multi_database",
      "target_trial_emulation",
      "cohort_prospective"
    ],
    "data_sources": [
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/1472-6947-8-S1-S2",
        "url": "https://doi.org/10.1186/1472-6947-8-S1-S2",
        "citation_text": "Cornet R, de Keizer N. Forty years of SNOMED: a literature review. BMC Medical Informatics and Decision Making. 2008;8(Suppl 1):S2.",
        "year": 2008,
        "authors_short": "Cornet & de Keizer",
        "notes": "Comprehensive literature review of SNOMED's evolution from its 1965 origins through SNOMED CT, covering concept model, hierarchy, relationship types, and use in clinical information systems — the canonical historical and structural overview of the terminology."
      },
      {
        "role": "explain",
        "doi": "10.1093/nar/gkh061",
        "url": "https://doi.org/10.1093/nar/gkh061",
        "citation_text": "Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004;32(Database issue):D267-D270.",
        "year": 2004,
        "authors_short": "Bodenreider",
        "notes": "Describes the UMLS Metathesaurus — the federal infrastructure through which US users access SNOMED CT under a free NLM license — and explains how SNOMED CT, ICD, LOINC, and RxNorm are integrated into a unified semantic network that is the basis for OMOP's cross-vocabulary mapping."
      },
      {
        "role": "demonstrate",
        "doi": null,
        "url": "https://www.nlm.nih.gov/healthit/snomedct/index.html",
        "citation_text": "National Library of Medicine. SNOMED CT — Overview and access [Internet]. Bethesda (MD): NLM; [cited 2026-06-12]. Available from: https://www.nlm.nih.gov/healthit/snomedct/index.html",
        "year": 2024,
        "authors_short": "NLM",
        "notes": "Official NLM page for SNOMED CT in the United States, covering the US Edition release schedule, UMLS licensing pathway for free access, browser tools, and RF2 file downloads — the practical starting point for any US researcher implementing SNOMED CT."
      },
      {
        "role": "use",
        "doi": "10.1093/jamia/ocu023",
        "url": "https://doi.org/10.1093/jamia/ocu023",
        "citation_text": "Voss EA, Makadia R, Matcho A, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. Journal of the American Medical Informatics Association. 2015;22(3):553-564.",
        "year": 2015,
        "authors_short": "Voss et al.",
        "notes": "Demonstrates OMOP CDM deployment across disparate health databases, with SNOMED CT as the standard Condition vocabulary underpinning portability of concept sets across sites — the core multi-database RWE use case for SNOMED-based phenotyping."
      }
    ],
    "plain_language_summary": "SNOMED CT is the world's largest clinical terminology — a giant dictionary that gives each clinical concept (a disease, a finding, a procedure) its own permanent numeric ID and connects those concepts in a hierarchy, so that searching for 'diabetes mellitus' automatically includes all the specific subtypes below it. Unlike ICD codes, which are designed for insurance billing and lump many things together, SNOMED CT is designed for clinical documentation and can describe what a clinician actually saw or decided with high precision. It is most commonly encountered in electronic health record (EHR) problem lists and is the vocabulary that the OMOP data model uses to store diagnoses — so understanding SNOMED CT is a prerequisite for understanding OMOP concept sets.",
    "key_terms": [
      {
        "term": "concept",
        "definition": "A single, uniquely identified clinical idea in SNOMED CT, represented by a permanent numeric code (e.g., 44054006 for Diabetes mellitus type 2) that means the same thing in every country and every version of the terminology."
      },
      {
        "term": "description",
        "definition": "A human-readable label attached to a SNOMED CT concept — every concept has one official Fully Specified Name plus one or more synonyms and preferred terms, all of which resolve to the same concept ID."
      },
      {
        "term": "is-a hierarchy",
        "definition": "The network of parent-child links in SNOMED CT that lets you say \"Diabetes mellitus type 2 IS A type of Diabetes mellitus IS A type of Disorder of glucose metabolism\" — so a search for the parent automatically captures all the children below it."
      },
      {
        "term": "polyhierarchy",
        "definition": "SNOMED CT's property of allowing a single concept to have more than one parent (for example, a concept can belong to both a disease family and an anatomical location family), unlike ICD-10-CM where each code sits in exactly one place in the tree."
      },
      {
        "term": "postcoordination",
        "definition": "A way of building a SNOMED CT expression on the fly by combining a base concept with extra qualifiers (like a finding site or severity) to describe clinical detail that no single pre-defined concept covers — rarely used in routine EHR coding but supported by the terminology standard."
      },
      {
        "term": "problem list",
        "definition": "The running list in a patient's EHR of their active conditions, typically coded in SNOMED CT; entries are added by clinicians and may persist for years, which is different from a claim that is generated at each encounter."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiologist wants to build a type 2 diabetes cohort from an EHR dataset that has been converted to the OMOP CDM. She needs to understand why a SNOMED CT-based descendant expansion finds more patients than a flat ICD-10-CM code list, and whether the extra patients represent true type 2 diabetes or noise from the broader hierarchy. The example walks through the two approaches on a hypothetical patient database of 100,000 adults, shows the code counts each strategy produces, and computes the incremental capture rate.\n",
      "dataset": {
        "caption": "Comparison of two phenotyping strategies for type 2 diabetes: flat ICD-10-CM list vs SNOMED CT descendant expansion. Row counts are from a hypothetical 100,000-patient EHR-OMOP dataset.",
        "columns": [
          "Strategy",
          "Root or seed concept",
          "Number of codes / concepts in set",
          "Patients identified",
          "Notes"
        ],
        "rows": [
          [
            "Flat ICD-10-CM list",
            "E11 (Type 2 diabetes mellitus) and E11.* subcodes",
            37,
            8200,
            "Manual enumeration of E11 and its subcategories; must be updated when new ICD codes are added"
          ],
          [
            "SNOMED CT descendant expansion",
            "44054006 (Diabetes mellitus type 2) and all is-a descendants",
            112,
            8960,
            "Automated hierarchy traversal in CONCEPT_ANCESTOR; includes granular clinical subtypes not represented in ICD-10-CM"
          ],
          [
            "Incremental patients (SNOMED only)",
            "Concepts without ICD-10-CM equivalent or mapping gap",
            75,
            760,
            "Patients coded with SNOMED-specific subtypes (e.g., maturity onset diabetes of the young in problem list) that did not have a corresponding E11 claim in the observation window"
          ]
        ]
      },
      "steps": [
        "The ICD-10-CM strategy starts with E11 and lists all 37 codes in the E11 family (E11.0 through E11.9 and their 4th/5th digit extensions). Each is looked up in the OMOP CONCEPT table as a source code and then followed via the Maps-to relationship to its SNOMED standard concept. Patients with any CONDITION_OCCURRENCE record carrying one of those SNOMED standard concepts are flagged as cases.",
        "The SNOMED descendant strategy starts directly with SNOMED concept 44054006 (Diabetes mellitus type 2) and queries the OMOP CONCEPT_ANCESTOR table for all concept_id values where ancestor_concept_id = 201826 (the OMOP standard concept ID for this SNOMED concept) and min_levels_of_separation >= 0, returning 112 descendant concepts across all levels of the hierarchy.",
        "The SNOMED strategy identifies 8960 patients vs 8200 from the flat ICD list, a difference of 760 patients. That is 760 / 8200 = 0.0927, meaning about 9.3% more patients are captured by descendant expansion.",
        "On inspection, the 760 extra patients were coded using SNOMED-specific granular subtypes on their EHR problem lists (e.g., maturity onset diabetes of the young type 3, gestational diabetes that evolved to type 2) that either had no direct ICD-10-CM equivalent or whose ICD codes were not present in the claims-derived CONDITION_OCCURRENCE records within the study window.",
        "Key validation check -- of the 760 incremental patients, chart review of a 50-patient random sample confirms 82% are genuine type 2 diabetes cases documented by clinicians using SNOMED problem-list entries. The 18% are coding errors (wrong hierarchy branch chosen by the EHR interface). PPV of 0.82 for the incremental patients informs whether to include them with a sensitivity analysis or require a corroborating encounter diagnosis."
      ],
      "result": "SNOMED CT descendant expansion identifies 8960 patients vs 8200 from a flat ICD-10-CM code list, capturing 760 / 8200 = 0.0927 more patients (9.3% incremental capture). The 112-concept SNOMED set covers granular clinical subtypes absent from the 37-code ICD list. Version-pinning the SNOMED release and documenting provenance (problem-list vs encounter diagnosis) are required for reproducibility."
    },
    "prerequisites": [
      "omop-concept-set-development-rwe",
      "omop-condition-occurrence-condition-era-rwe",
      "icd-10-cm-diagnosis-coding"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Descendant expansion at different hierarchy levels (narrow vs broad phenotype)",
        "description": "A concept set rooted at 44054006 |Diabetes mellitus type 2| captures only type 2 descendants; rooting at 73211009 |Diabetes mellitus| captures all diabetes subtypes (type 1, type 2, gestational, secondary, monogenic). The choice of root concept determines both the size of the concept set and its clinical specificity.",
        "edge_cases": [
          "Selecting a root concept that is too high in the hierarchy (e.g., 56265001 |Heart disease|) produces a concept set covering thousands of concepts and risks low PPV for any specific condition.",
          "Selecting a root concept that has been restructured between SNOMED CT releases can change concept-set membership substantially — always document the terminology version and re-validate after a major release."
        ],
        "data_source_notes": "In OMOP, descendant expansion is performed via CONCEPT_ANCESTOR using min_levels_of_separation >= 0 (inclusive of root) or >= 1 (descendants only). In native RF2, traverse the Relationship file filtering on typeId = 116680003 (Is a) to build a transitive closure."
      },
      {
        "name": "Problem list vs encounter diagnosis provenance filtering",
        "description": "EHR-sourced CONDITION_OCCURRENCE records include problem-list entries (which persist for years and may represent historical or resolved conditions) and encounter-based diagnoses (which are event-level). Filtering on condition_type_concept_id distinguishes these provenance types and changes the timing and recurrence structure of the phenotype.",
        "edge_cases": [
          "Problem-list entries without an end date look like active, ongoing conditions even if the clinician stopped updating them — they inflate chronic-disease prevalence in cross-sectional analyses.",
          "Encounter diagnoses from outpatient EHR visits may not be coded in SNOMED CT in all EHR implementations — some systems use ICD-10-CM at the billing interface and only surface SNOMED in the clinical (problem-list) layer."
        ],
        "data_source_notes": "In OMOP CONDITION_OCCURRENCE, condition_type_concept_id 32840 corresponds to EHR problem list entries; 32817 corresponds to EHR encounter diagnoses. Availability of these values depends on the ETL implementation."
      },
      {
        "name": "SNOMED-to-ICD-10-CM derived mapping (for billing or claims validation)",
        "description": "SNOMED International publishes a rule-based map from SNOMED CT concepts to ICD-10 codes for reimbursement derivation. This map is directional (SNOMED→ICD) and lossy: multiple distinct SNOMED concepts may map to the same ICD code, and some SNOMED concepts have no ICD equivalent.",
        "edge_cases": [
          "A researcher who uses the SNOMED→ICD map to generate ICD codes for a validation study in claims data may miss cases captured by ICD codes that were used without a corresponding SNOMED problem-list entry — particularly for acute or incidental diagnoses.",
          "The map targets ICD-10 (the international edition), not ICD-10-CM (the US clinical modification); some codes differ, and US-specific ICD-10-CM codes may not have a direct SNOMED map entry."
        ],
        "data_source_notes": "The SNOMED CT to ICD-10 Map is distributed as part of the International Release RF2 files (der2_iisssccRefset_ExtendedMapFull). It is not present in the US Edition by default; download from SNOMED International's MLDS portal separately."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "icd-10-cm-diagnosis-coding",
        "pros_of_this": "SNOMED CT provides a polyhierarchical concept model that supports descendant expansion (one query captures all subtypes), synonymy resolution (many terms → one concept), and clinical granularity for distinguishing phenotypic subtypes that ICD-10-CM collapses under a single code. SNOMED CT is designed for clinical documentation rather than billing, so it better reflects what the clinician observed.",
        "cons_of_this": "SNOMED CT does not appear in US insurance claims — administrative data uses ICD-10-CM for billing. Working with SNOMED CT requires an explicit mapping step when combining EHR and claims data. The terminology requires a (free, in the US) license and is updated twice yearly, requiring version management.",
        "when_to_prefer": "Prefer SNOMED CT when working from EHR source data, OMOP CDM, or registry data where clinical granularity is required; prefer ICD-10-CM when working from raw US administrative claims."
      },
      {
        "compared_to": "omop-concept-set-development-rwe",
        "pros_of_this": "Understanding SNOMED CT structure directly — rather than through OMOP's ATLAS abstraction — gives the analyst insight into why a concept set has a particular membership, allows debugging of mapping gaps, and enables working with non-OMOP SNOMED data sources (EHR problem lists, registries that deliver native RF2).",
        "cons_of_this": "Native RF2 querying requires building a transitive closure of the |is a| relationship table, which is more complex than OMOP's pre-computed CONCEPT_ANCESTOR table. OMOP concept-set tooling (ATLAS, Capr) is considerably more accessible for most RWE analysts.",
        "when_to_prefer": "Use native SNOMED CT when the data source is not OMOP-formatted, when auditing OMOP mapping quality, or when working in a terminology-native environment (EHR CDS, ontology tooling)."
      },
      {
        "compared_to": "ehr-phenotyping-algorithms-rwe",
        "pros_of_this": "Descendant expansion via SNOMED CT hierarchy is a lightweight, transparent single-query phenotyping approach that requires no training data and can be applied to any SNOMED-coded concept. For well-represented conditions in SNOMED, it provides excellent recall.",
        "cons_of_this": "Hierarchy-based expansion alone has no mechanism for requiring confirmation (1 inpatient or 2 outpatient codes), excluding rule-out coding, or incorporating non-structured data (labs, vitals, notes). Most validated EHR phenotype algorithms combine hierarchy expansion with additional logic. SNOMED-only expansion typically has lower PPV than a validated phenotype algorithm.",
        "when_to_prefer": "Use SNOMED CT descendant expansion as the first-pass candidate set for phenotyping; layer additional logic (confirmation windows, code-position requirements, lab confirmation) on top via a validated algorithm when the outcome or eligibility criterion requires high PPV."
      }
    ],
    "implementation_notes_by_data_source": {
      "ehr": "SNOMED CT concepts appear as condition_type_concept_id-typed entries in OMOP CONDITION_OCCURRENCE, or as native SNOMED SCTIDs in source EHR data. Filter on condition_type_concept_id to distinguish problem-list entries from encounter diagnoses. Descendant expansion uses CONCEPT_ANCESTOR in OMOP or a pre-built transitive closure over the RF2 Relationship file. Always pin the SNOMED CT release version used to construct the concept set.",
      "registry": "Registries using SNOMED CT for condition coding typically supply native SCTIDs or term labels that can be mapped to SCTIDs via the SNOMED CT browser or RF2 Description file. Registry SNOMED data tends to be curated and high-PPV for the index condition but may lack breadth for comorbidities.",
      "linked": "When linking EHR (SNOMED-coded) data to claims (ICD-10-CM-coded) data, reconcile condition dates across the two sources. SNOMED problem-list dates reflect EHR documentation timing; ICD claim dates reflect billing submission. The OMOP 'Maps to' vocabulary relationship provides the bridge from ICD source codes to SNOMED standard concepts, but audit mapping completeness and gap cases."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport sqlite3\n\n# ── Part 1: RF2-based transitive closure (native SNOMED CT) ──────────────────\n# Download the US Edition RF2 from NLM (UMLS licence required).\n# The Relationship snapshot file is typically named:\n#   sct2_Relationship_Snapshot_US1000124_<YYYYMMDD>.txt\n\ndef load_isa_edges(rf2_relationship_file: str) -> dict[str, set[str]]:\n    \"\"\"Return a dict: child_sctid -> set of immediate parent SCTIDs (Is a only).\"\"\"\n    IS_A = \"116680003\"  # SCTID for the |Is a| relationship type\n    df = pd.read_csv(rf2_relationship_file, sep=\"\\t\", dtype=str, usecols=[\n        \"active\", \"sourceId\", \"destinationId\", \"typeId\"\n    ])\n    df = df[(df[\"active\"] == \"1\") & (df[\"typeId\"] == IS_A)]\n    parents: dict[str, set[str]] = {}\n    for _, row in df.iterrows():\n        parents.setdefault(row[\"sourceId\"], set()).add(row[\"destinationId\"])\n    return parents\n\ndef descendants(root_sctid: str, parents: dict[str, set[str]]) -> set[str]:\n    \"\"\"Return all concepts that are a descendant of root_sctid (transitive Is a).\n    Uses BFS over the child→parent graph traversed in reverse (child→parent → parent is ancestor).\n    \"\"\"\n    # Build child map (parent → set of children) from the parent map\n    children: dict[str, set[str]] = {}\n    for child, pset in parents.items():\n        for p in pset:\n            children.setdefault(p, set()).add(child)\n\n    visited: set[str] = set()\n    queue = [root_sctid]\n    while queue:\n        node = queue.pop()\n        for child in children.get(node, set()):\n            if child not in visited:\n                visited.add(child)\n                queue.append(child)\n    return visited  # does NOT include root_sctid itself\n\n# Example usage (comment out if no RF2 files are available):\n# parents = load_isa_edges(\"sct2_Relationship_Snapshot_US1000124_20240301.txt\")\n# dm_t2_descendants = descendants(\"44054006\", parents)\n# print(f\"Found {len(dm_t2_descendants)} descendants of 44054006 |Diabetes mellitus type 2|\")\n\n# ── Part 2: OMOP CONCEPT_ANCESTOR query (SQLite example) ─────────────────────\n# In a real OMOP environment, replace sqlite3 with your database driver.\n# CONCEPT_ANCESTOR stores the pre-computed transitive closure of the Is a hierarchy.\n\nDM_T2_OMOP_CONCEPT_ID = 201826  # OMOP standard concept_id for Diabetes mellitus type 2 (SNOMED 44054006)\n\ndef get_omop_descendants(con: sqlite3.Connection, ancestor_concept_id: int) -> pd.DataFrame:\n    \"\"\"Return all OMOP standard concept_ids that are descendants of a given concept.\n    min_levels_of_separation = 0 includes the root; >= 1 for descendants only.\n    \"\"\"\n    query = \"\"\"\n    SELECT\n        ca.descendant_concept_id,\n        c.concept_name,\n        c.vocabulary_id,\n        ca.min_levels_of_separation\n    FROM concept_ancestor ca\n    JOIN concept c ON c.concept_id = ca.descendant_concept_id\n    WHERE ca.ancestor_concept_id = ?\n      AND ca.min_levels_of_separation >= 1   -- descendants only, not the root itself\n      AND c.invalid_reason IS NULL            -- active concepts only\n    ORDER BY ca.min_levels_of_separation, c.concept_name\n    \"\"\"\n    return pd.read_sql(query, con, params=(ancestor_concept_id,))\n\n# Example usage (replace ':memory:' with path to OMOP SQLite or use your DB connection):\n# con = sqlite3.connect(\":memory:\")   # placeholder\n# df = get_omop_descendants(con, DM_T2_OMOP_CONCEPT_ID)\n# print(f\"OMOP descendant concepts: {len(df)}\")\n# print(df.head(10).to_string(index=False))",
        "description": "Two operations: (1) build a transitive closure of the |is a| relationship over the RF2 Relationship file to find all descendants of a given SNOMED CT concept (here: 44054006 Diabetes mellitus type 2); (2) run the equivalent query over an OMOP CONCEPT_ANCESTOR table using SQLite. The RF2 approach works on raw SNOMED CT data; the OMOP approach is what most RWE analysts will use in practice. Both are shown for illustration. All code is minimal and heavily commented.",
        "dependencies": [
          "pandas",
          "sqlite3"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(dplyr)\nlibrary(DBI)\nlibrary(RSQLite)\n\n# ── Part 1: RF2-based transitive closure (native SNOMED CT) ──────────────────\nIS_A_TYPE <- \"116680003\"  # SCTID for Is a relationship type\n\nload_isa_edges <- function(rf2_relationship_file) {\n  # Read only the columns we need; typeId and active are character for SCTID fidelity\n  df <- read.delim(rf2_relationship_file, sep = \"\\t\", colClasses = \"character\",\n                   stringsAsFactors = FALSE)\n  df <- df[df$active == \"1\" & df$typeId == IS_A_TYPE,\n           c(\"sourceId\", \"destinationId\")]\n  # Return named list: parent → vector of children (reverse of sourceId → destinationId)\n  child_map <- split(df$sourceId, df$destinationId)\n  child_map\n}\n\ndescendants_bfs <- function(root_sctid, child_map) {\n  # BFS from root down through child_map\n  visited <- character(0)\n  queue   <- root_sctid\n  while (length(queue) > 0) {\n    node  <- queue[1]\n    queue <- queue[-1]\n    kids  <- child_map[[node]]\n    new   <- setdiff(kids, visited)\n    visited <- c(visited, new)\n    queue   <- c(queue, new)\n  }\n  visited  # excludes root itself\n}\n\n# Example usage (comment out if RF2 files not available):\n# child_map <- load_isa_edges(\"sct2_Relationship_Snapshot_US1000124_20240301.txt\")\n# dm_t2_desc <- descendants_bfs(\"44054006\", child_map)\n# cat(\"Descendants of 44054006:\", length(dm_t2_desc), \"\\n\")\n\n# ── Part 2: OMOP CONCEPT_ANCESTOR query (RSQLite / DBI) ─────────────────────\nDM_T2_OMOP_ID <- 201826L  # OMOP concept_id for Diabetes mellitus type 2\n\nget_omop_descendants <- function(con, ancestor_concept_id) {\n  query <- \"\n    SELECT\n      ca.descendant_concept_id,\n      c.concept_name,\n      c.vocabulary_id,\n      ca.min_levels_of_separation\n    FROM concept_ancestor ca\n    JOIN concept c ON c.concept_id = ca.descendant_concept_id\n    WHERE ca.ancestor_concept_id = ?\n      AND ca.min_levels_of_separation >= 1\n      AND c.invalid_reason IS NULL\n    ORDER BY ca.min_levels_of_separation, c.concept_name\n  \"\n  DBI::dbGetQuery(con, query, params = list(ancestor_concept_id))\n}\n\n# Example usage:\n# con <- DBI::dbConnect(RSQLite::SQLite(), \":memory:\")  # replace with real OMOP connection\n# df  <- get_omop_descendants(con, DM_T2_OMOP_ID)\n# cat(\"OMOP descendants:\", nrow(df), \"\\n\")\n# print(head(df, 10))",
        "description": "Equivalent SNOMED CT descendant expansion in R: (1) BFS over the RF2 Relationship file to build descendants from native SNOMED CT data; (2) SQL query against OMOP CONCEPT_ANCESTOR using DBI/RSQLite for the OMOP-native approach. Both are minimal, commented implementations.",
        "dependencies": [
          "dplyr",
          "DBI",
          "RSQLite"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  ROOT[\"73211009<br/>|Diabetes mellitus|<br/>(root concept)\"]\n  ROOT --> T2[\"44054006<br/>|Diabetes mellitus type 2|\"]\n  ROOT --> T1[\"46635009<br/>|Diabetes mellitus type 1|\"]\n  ROOT --> GDM[\"11687002<br/>|Gestational diabetes|\"]\n  T2 --> MOD[\"609568004<br/>|Maturity onset diabetes of the young|\"]\n  T2 --> INS[\"44054006-INS<br/>|T2DM with insulin|<br/>(illustrative subtype)\"]\n  T2 --> COMP[\"73211009-COMP<br/>|T2DM with complication|<br/>(illustrative subtype)\"]\n  style ROOT fill:#4a90d9,color:#fff\n  style T2 fill:#7bb3e8,color:#000\n  style T1 fill:#a8d0f0,color:#000\n  style GDM fill:#a8d0f0,color:#000\n  style MOD fill:#d0e8f8,color:#000\n  style INS fill:#d0e8f8,color:#000\n  style COMP fill:#d0e8f8,color:#000",
        "caption": "Partial |is a| polyhierarchy beneath 73211009 |Diabetes mellitus|. Each concept has a permanent numeric SCTID. A descendant expansion query from 73211009 captures all nodes shown; rooting at 44054006 captures only that subtree. Concept IDs and structure are illustrative — consult the current US Edition for the authoritative hierarchy.",
        "alt_text": "Flowchart showing the SNOMED CT is-a hierarchy for diabetes mellitus. The root concept 73211009 Diabetes mellitus has three children: Diabetes mellitus type 2 (44054006), Diabetes mellitus type 1 (46635009), and Gestational diabetes (11687002). Diabetes mellitus type 2 has further subtypes including maturity onset diabetes of the young and illustrative subtypes for insulin use and complications.",
        "source_type": "illustrative",
        "source_citations": [
          "cornet-2008"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "icd-10-cm-diagnosis-coding",
        "notes": "SNOMED CT (terminology, designed for clinical documentation) and ICD-10-CM (classification, designed for billing) are complementary and non-substitutable. US insurance claims carry ICD-10-CM; EHR problem lists and OMOP standard concepts use SNOMED CT. SNOMED International publishes a rule-based map from SNOMED concepts to ICD-10 codes for billing derivation, but the map is lossy in both directions."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-concept-set-development-rwe",
        "notes": "SNOMED CT is the standard Condition vocabulary in OMOP. OMOP concept sets for conditions are SNOMED CT descendant queries implemented via the CONCEPT_ANCESTOR table. Understanding SNOMED CT structure — concept IDs, |is a| polyhierarchy, and pre- vs post-coordination — is a prerequisite for building and auditing OMOP concept sets for the Condition domain."
      },
      {
        "relation_type": "used_with",
        "target_slug": "omop-condition-occurrence-condition-era-rwe",
        "notes": "OMOP CONDITION_OCCURRENCE stores standard SNOMED CT concept IDs as condition_concept_id; source ICD-9/10-CM codes are mapped to SNOMED via the OMOP 'Maps to' vocabulary relationship. CONDITION_ERA collapses SNOMED-keyed occurrence records into persistence windows, discarding provenance metadata."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-study",
        "notes": "SNOMED CT is the primary clinical terminology for EHR problem lists and encounter diagnoses in systems built to HL7/FHIR standards. EHR-based study designs rely on SNOMED CT for condition ascertainment, index-event coding, and comorbidity capture from structured problem-list data."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ehr-phenotyping-algorithms-rwe",
        "notes": "EHR phenotyping algorithms typically begin with a SNOMED CT descendant expansion as the candidate code set, then apply additional logic (confirmation windows, code-position requirements, lab confirmation) to achieve acceptable PPV. SNOMED CT hierarchy expansion provides recall; the algorithm provides specificity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-cdm-method-patterns-rwe",
        "notes": "The OMOP CDM standardizes conditions using SNOMED CT as the primary vocabulary. Understanding SNOMED CT concept structure is foundational to OMOP method patterns for cohort construction, concept-set definition, and cross-site reproducibility."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The 1-inpatient or 2-outpatient rule-based phenotype algorithm is commonly applied on top of SNOMED CT descendant code sets in OMOP to improve PPV; SNOMED provides the candidate set and the algorithm applies the confirmation logic."
      }
    ],
    "aliases": [
      "SNOMED",
      "SNOMED CT",
      "SNOMED-CT",
      "Systematized Nomenclature of Medicine"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "special-populations-rwe-methods",
    "name": "Special Populations RWE Methods",
    "short_definition": "A family of real-world-evidence study-design adaptations for populations that are systematically excluded from or under-enrolled in trials (pregnant people, neonates and children, rare-disease and biomarker-defined cohorts), each of which forces a population-specific change to time-zero, the unit of analysis, the exposure window, or the comparator.",
    "long_description": "**Special populations RWE methods** are not a single estimator but a coordinated set of design adaptations applied when\nthe population of interest cannot be studied with the default active-comparator, new-user template because the trial that\nwould answer the question is infeasible, too slow, or ethically gated. The defining move is the same in every case: a\ntrial-derived design element that you would normally take for granted — a single time zero per person, one analytic unit,\na stable comparator, a fixed exposure window — has to be **re-specified for the biology and the data of the special\npopulation** before any propensity score or outcome model is fit. Pregnancy forces a *gestational-age-anchored* exposure\nwindow and (for fetal/neonatal outcomes) a *two-generation* analytic unit. Pediatrics forces age- and weight-normalized\ndosing and growth-trajectory endpoints rather than fixed-dose, fixed-threshold ones. Rare and biomarker-defined diseases\nforce external or historical controls (with or without Bayesian borrowing) because a concurrent randomized comparator does\nnot exist at adequate sample size. This entry is the routing layer over those child methods; the worked example below is a\npregnancy exposure-window cohort, the cleanest case in which the standard template breaks.\n\n**Core conceptual distinction** — the estimand and the *unit of analysis* must be settled before the design. Three choices\ndo the work and they are separable. (1) *Whose outcome?* In pregnancy, a maternal outcome (e.g., gestational hypertension)\nkeeps the pregnant person as the unit; a fetal/neonatal outcome (e.g., major congenital malformation) makes the\npregnancy–infant dyad the unit and requires mother–infant linkage. (2) *What is time zero, and is it a calendar date or a\ndevelopmental landmark?* For teratogenicity the biologically meaningful window is organogenesis (roughly the first\ntrimester, gestational weeks ~4–10), not the date of the first prescription fill; anchoring follow-up at the fill date\nrather than the relevant gestational window is the special-population analogue of immortal-time and exposure-window\nmisclassification. (3) *Against what?* When a concurrent comparator is impossible (ultra-rare disease, single-arm gene\ntherapy), the comparator becomes an external/historical control and the estimand shifts from a within-cohort contrast to\na borrowed-information contrast whose validity rests on exchangeability and outcome-ascertainment comparability rather\nthan on randomization. The family does **not** estimate a general-population average effect transported to the subgroup;\nthat transportation is precisely the assumption these designs exist to avoid making blindly.\n\n**Pros, cons, and trade-offs**\n- **vs. attempting a randomized trial in the special population:** RWE is often the only feasible source — pregnant people\n  are excluded from most pre-approval trials, rare-disease trials cannot accrue, and randomizing children to a dose is\n  frequently unethical. Cost: no randomization, so every confounding and ascertainment threat must be handled by design\n  and analysis, and regulators apply heightened scrutiny to fit-for-purpose data and bias control. **Prefer RWE** when the\n  trial is infeasible or unethical and a fit-for-purpose data source with adequate outcome capture exists.\n- **vs. extrapolating a general-population RWE estimate to the subgroup:** the special-population design measures the\n  effect *in* the population of interest, avoiding the transportability leap (different effect modifiers, different\n  competing risks, different baseline risk). Cost: smaller samples, sparser events, and population-specific data gaps.\n  **Prefer the dedicated design** whenever effect modification or baseline-risk shift between the general and special\n  population is plausible — which is the default assumption in pregnancy, neonates, and rare disease.\n- **vs. a single-arm external control without special-population adjustments:** the family adds the population-specific\n  machinery (gestational-age anchoring, dyad linkage, dose normalization, biomarker eligibility timing) that a generic\n  external-control analysis omits, reducing window and unit misclassification. Cost: more moving parts, each a potential\n  failure point, and dependence on linkage/registry assets that not all databases have. **Prefer the dedicated design**\n  when the population-specific structure materially changes exposure timing, the analytic unit, or eligibility.\n\n**When to use** — the target population is pregnant/postpartum, pediatric/neonatal, rare-disease, or biomarker-defined, AND\nthe question is comparative safety or effectiveness that a trial cannot answer feasibly or ethically; AND a fit-for-purpose\ndata source captures the population-specific structure (gestational dating, mother–infant linkage IDs, weight/dose fields,\nthe molecular marker, or an adjudicated registry). Use it to build the analytic core of a regulatory single-arm-vs-external\ncontrol submission, a pregnancy safety study supporting labeling, or a pediatric extrapolation package.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **The data source cannot observe the population-defining structure.** A claims database with no gestational dating and\n  no live-birth linkage cannot support a teratogenicity study; forcing it produces fill-date-anchored windows that\n  misclassify organogenesis exposure and silently exclude pregnancies ending in loss (a differential, exposure-related\n  selection that biases toward the null for malformation and can fabricate apparent safety).\n- **The mother–infant link is incomplete or non-random.** If only a subset of dyads links (e.g., infant on a separate\n  plan, delivered out of network), and linkability correlates with exposure or outcome, the dyad cohort is a biased\n  selection — diagnose with link rates by exposure arm before trusting any neonatal estimate.\n- **The external control is not exchangeable.** Borrowing historical or registry controls when standard of care,\n  diagnostic intensity, or outcome definitions have drifted over calendar time injects bias that point estimates hide;\n  dynamic Bayesian borrowing that down-weights on conflict mitigates but does not cure non-exchangeability.\n- **The genuine question is the general-population effect.** If the policy question is population-average, a special-\n  population subgroup design answers a narrower question and should not be generalized back up.\n- **Sparse events with a default large-sample model.** Rare outcomes in small special populations break Wald-based\n  inference; naive logistic/Cox can be separated or badly biased (use exact or Firth-penalized methods instead).\n\n**Data-source operational depth**\n- **Claims (FFS vs MA vs commercial):** Pregnancy episodes are reconstructed from delivery/outcome codes (live birth,\n  stillbirth, spontaneous/elective abortion) and gestational-age algorithms (e.g., the Margulis/MAX algorithm using\n  diagnosis-based timing), then back-dated to estimate last menstrual period and trimester windows. Failure modes:\n  Medicare Advantage and capitated commercial plans drop fee-for-service encounter claims, so a pregnancy or an infant\n  can vanish from the data — restrict to enrollees with complete FFS-observable medical + pharmacy person-time spanning\n  preconception through delivery, and treat MA-only spans as unobservable, not as \"no event.\" Spontaneous losses and\n  elective terminations are under-captured relative to live births, creating exposure-related left-truncation. Mother–\n  infant linkage requires a deterministic family/subscriber key plus a delivery-to-birth date match within a plausible\n  window; link rates differ by plan type and must be reported.\n- **EHR:** Strong for gestational dating (LMP, ultrasound-derived EDD, problem lists) and for birth/neonatal outcomes\n  when delivery happens in-system, but visit-driven capture means out-of-system deliveries, NICU transfers, and infant\n  primary care elsewhere are differentially lost. Medication *orders* are not dispensings; confirm actual exposure during\n  the relevant gestational window with linked pharmacy fills where possible.\n- **Registry (pregnancy, rare-disease, product):** The reference standard for adjudicated outcomes (malformation panels,\n  genetically confirmed rare disease, biomarker status) and for enrolling external/historical controls, but enrollment is\n  selective (consent, referral bias) and exposure capture is often patient-reported. Transportability from the registry\n  population to the treated cohort is the binding assumption for external controls.\n- **Linked claims–EHR–vital/birth records:** The ideal substrate — EHR gestational dating + claims completeness + vital-\n  records birth and fetal-death certificates that recover the losses claims miss — but linkage selects the linkable subset\n  and introduces date discrepancies (LMP vs delivery vs claim service date) that must be reconciled before windows are set.\n\n**Worked example (pregnancy exposure-window cohort, claims).** Question: risk of major congenital malformation after\nfirst-trimester exposure to drug X vs an active comparator Y used for the same maternal indication, in a commercial +\nMedicaid claims database with mother–infant linkage. (1) *Pregnancy episode:* identify live-birth deliveries from\ndelivery codes; estimate the last-menstrual-period (LMP) date by subtracting an algorithm-derived gestational age from the\ndelivery date, defining the pregnancy span [LMP, delivery]. (2) *Enrollment:* require continuous FFS-observable medical +\npharmacy enrollment from 90 days before LMP through 90 days after delivery, excluding any MA-only person-time so that\nabsence of a fill is a true non-exposure, not missingness. (3) *Exposure window:* classify the dyad as exposed if a fill\nof drug X with `days_supply` overlapping gestational weeks 4–10 (organogenesis) covers any day in that window; assign the\ncomparator arm analogously — note this is fill-overlap of the *gestational* window, not the fill date. (4) *Unit and\nlinkage:* link each delivery to its infant via the family/subscriber key and a birth date within ±30 days of the delivery\nclaim; the analytic unit is the dyad, and the malformation outcome is read from the infant's first-year claims using a\nvalidated algorithm. (5) *Time zero / baseline:* covariates (maternal age, comorbidities, prior pregnancy loss, healthcare\nutilization, folic-acid/teratogen co-exposures) measured in [LMP − 90d, LMP] to avoid conditioning on post-conception\nmediators. (6) *Analysis:* propensity-score overlap weighting on the baseline covariates; because malformations are rare,\nfit a Firth-penalized logistic model and report the absolute risk difference per 1,000 live births with a sensitivity\nanalysis on gestational-dating error, link-window width, and inclusion of pregnancies ending in loss.",
    "primary_category": "Study_Design",
    "tags": [
      "special-populations",
      "pregnancy-pharmacoepidemiology",
      "pediatric-rwe",
      "rare-disease",
      "external-control",
      "mother-infant-linkage",
      "gestational-age-window",
      "biomarker-defined-cohort"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "single_arm_external_control",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.4789",
        "url": "https://doi.org/10.1002/pds.4789",
        "citation_text": "Huybrechts KF, Bateman BT, Hernández-Díaz S. Use of real-world evidence from healthcare utilization data to evaluate drug safety during pregnancy. Pharmacoepidemiology and Drug Safety. 2019;28(7):906-922.",
        "year": 2019,
        "authors_short": "Huybrechts et al.",
        "notes": "Canonical methods statement for pregnancy RWE — gestational-age windows, mother-infant linkage, exposure-related loss, and design adaptations for a population excluded from trials."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5083",
        "url": "https://doi.org/10.1002/pds.5083",
        "citation_text": "Suissa S, Dell'Aniello S. Time-related biases in pharmacoepidemiology. Pharmacoepidemiology and Drug Safety. 2020;29(9):1101-1110.",
        "year": 2020,
        "authors_short": "Suissa & Dell'Aniello",
        "notes": "Why mis-anchored time zero and exposure windows (acute in special populations, e.g., gestational windows) create immortal-time and window misclassification."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Target-trial framing that disciplines time-zero and eligibility specification — directly applicable to gestational and pediatric windows."
      },
      {
        "role": "use",
        "doi": "10.1002/cpt.1426",
        "url": "https://doi.org/10.1002/cpt.1426",
        "citation_text": "Cave A, Kurz X, Arlett P. Real-world data for regulatory decision making: challenges and possible solutions for Europe. Clinical Pharmacology & Therapeutics. 2019;106(1):36-39.",
        "year": 2019,
        "authors_short": "Cave et al.",
        "notes": "Regulatory context for using RWE (including external controls in rare/special populations) to support decision-making."
      }
    ],
    "plain_language_summary": "Special populations RWE methods are a set of study-design adjustments for groups that clinical trials almost never include — pregnant people, children, patients with very rare diseases, and people whose disease is defined by a specific biological marker. Because these groups are absent from most trials, researchers must study them using real-world health data, but the standard playbook for setting up a study breaks down: the dates that matter, the unit being followed, and the comparison group all have to be rethought for each population before any analysis begins. Each adjustment targets a specific reason why a one-size approach would produce a misleading answer — for example, using a drug's prescription date instead of the biologically critical window of fetal development, or applying an adult dose rule to a ten-kilogram child.",
    "key_terms": [
      {
        "term": "gestational age",
        "definition": "How far along a pregnancy is, counted in weeks from the last menstrual period — used to define which weeks are the critical window for fetal organ development."
      },
      {
        "term": "organogenesis window",
        "definition": "Roughly gestational weeks 4 through 10, when the fetus forms its major organs — the period when a drug is most likely to cause a birth defect if the fetus is exposed."
      },
      {
        "term": "mother-infant dyad",
        "definition": "The linked pair of a pregnant person and their infant, treated as a single unit of analysis when the outcome of interest is a birth defect or newborn condition rather than a maternal one."
      },
      {
        "term": "external control",
        "definition": "A comparison group drawn from a different data source or time period instead of a concurrent group in the same study, used when the disease is so rare that a head-to-head comparison is not possible."
      },
      {
        "term": "dose normalization",
        "definition": "Adjusting a drug dose by a child's body weight (e.g., mg per kg) so that exposure can be compared fairly across children of different sizes."
      },
      {
        "term": "Firth-penalized regression",
        "definition": "A statistical method that produces reliable estimates when the outcome event is very rare or the study group is very small, where standard regression models break down."
      }
    ],
    "worked_example": {
      "scenario": "Five patients represent five special populations commonly studied in RWE. For each, a researcher wants to estimate how a drug affects a clinically meaningful outcome. The table below shows why the default study design would fail for that population, what the specific challenge is in the data, and what methodological adjustment is required. No single patient here uses the standard adult-cohort template without modification.",
      "dataset": {
        "caption": "One representative patient per special population with the design challenge each raises.",
        "columns": [
          "person_id",
          "population",
          "drug",
          "naive_approach_and_why_it_fails",
          "data_challenge",
          "correct_adjustment"
        ],
        "rows": [
          [
            "P-001",
            "Pregnant (first trimester)",
            "Drug X (anticonvulsant)",
            "Index on first fill date — but the fill might occur at week 12, after organogenesis (weeks 4-10) is already over, so exposure during the critical window is missed or mislabeled",
            "Claims show a fill date but not whether the supply overlapped the organogenesis window; gestational age must be reconstructed backward from the delivery claim",
            "Reconstruct last-menstrual-period date from delivery date minus algorithm-derived gestational age; classify exposure by whether the prescription supply overlapped gestational weeks 4-10, not by fill date; link the delivery record to the infant record to read the birth-defect outcome from the infant's first-year claims"
          ],
          [
            "P-002",
            "Pediatric (age 4, weight 16 kg)",
            "Drug Y (immunosuppressant)",
            "Apply the adult dose threshold (e.g., greater than 200 mg/day = high dose) — a 16 kg child receiving 48 mg/day is actually at a high weight-adjusted dose of 3 mg/kg/day, but the adult rule would classify them as low dose",
            "Claims record the dispensed amount but not body weight; EHR has weight in vitals but it changes with growth",
            "Pull the closest weight measurement before each prescription fill from the EHR vitals table; compute mg/kg for each fill; define dose categories using the weight-adjusted value; use growth-trajectory endpoints (height z-score, developmental milestone flags) rather than adult fixed thresholds"
          ],
          [
            "P-003",
            "Elderly (age 82, chronic kidney disease stage 4)",
            "Drug Z (direct oral anticoagulant)",
            "Use the same outcome model as the general adult population — but elderly patients with kidney disease have high competing risk of death from other causes, so a standard survival model overstates the drug's effect on the outcome of interest",
            "Death is a competing event that prevents the stroke outcome; standard Kaplan-Meier treats death as a simple censoring event, inflating the apparent stroke-free probability",
            "Use a competing-risks model (cause-specific or cumulative incidence approach) that accounts for the high mortality rate in this population; report absolute risks rather than hazard ratios alone so the clinical magnitude is clear in a population with short residual life expectancy"
          ],
          [
            "P-004",
            "Rare disease (N = 120 patients nationally)",
            "Gene therapy G (single-arm trial, no concurrent comparator)",
            "Try to run an active-comparator cohort study — impossible because there are fewer than 50 eligible comparator patients who received any alternative treatment in the same period",
            "No concurrent comparison group exists; the only available reference data are historical registry patients treated 3-5 years ago under a different standard of care",
            "Use an external historical control from the disease registry; apply dynamic Bayesian borrowing so that if the historical control population differs meaningfully from the treated cohort (different baseline severity, calendar drift in standard of care), the model down-weights the historical data rather than treating it as equivalent; report the degree of borrowing and run a sensitivity analysis assuming no borrowing"
          ],
          [
            "P-005",
            "Biomarker-defined (EGFR-mutant non-small cell lung cancer)",
            "Targeted therapy T (approved only for EGFR-positive patients)",
            "Build a cohort of all lung cancer patients on the drug — but the EGFR test result is recorded in molecular pathology notes, not in a structured claims field; patients without a documented test are falsely classified as EGFR-unknown rather than excluded",
            "Biomarker status lives in unstructured pathology text or a separate lab system not linked to the claims database; including EGFR-untested patients mixes a different population into the study",
            "Use a linked EHR-claims dataset that includes molecular pathology results or an oncology registry with adjudicated biomarker status; restrict the cohort to patients with a confirmed positive EGFR test result before the drug start date; treat the biomarker test date as the eligibility anchor, not the drug start date"
          ]
        ]
      },
      "steps": [
        "Standard epidemiology studies define one index date — the date a patient first takes the drug — and follow everyone forward from there. That works when patients are adults, doses are fixed, outcomes are clearly ascertained, and a comparison group is available. Each row in the table above breaks at least one of those assumptions.",
        "For P-001 (pregnant), the biologically meaningful exposure window is weeks 4-10 of gestation, not the fill date. If the researcher uses the fill date, they may classify a week-12 fill as first-trimester exposure when it is actually second-trimester, or miss a week-6 fill because it occurred before the patient knew she was pregnant and the claim appears earlier in the record without a pregnancy flag.",
        "For P-002 (pediatric), body weight changes every few months in a growing child. An adult dose rule applied to a child's claims record silently mislabels exposure intensity. The only way to get dose per kilogram is to link each fill to the nearest weight measurement in the EHR.",
        "For P-003 (elderly with kidney disease), death from kidney failure or cardiovascular causes is highly likely before a stroke would occur. Standard survival analysis treats these deaths as uninformative (the patient is simply censored), but that inflates the estimated stroke-free time because patients who could not have had a stroke are removed from the risk set. A competing-risks model keeps them in and correctly partitions the probability among all outcomes.",
        "For P-004 (rare disease), there is no concurrent comparator. The researcher must borrow from historical data, which introduces a different threat: the historical patients may have been sicker or better-treated than the current patients, making a direct comparison misleading. Bayesian borrowing addresses this by measuring the tension between historical and current data and reducing the weight given to history when that tension is large.",
        "For P-005 (biomarker-defined), eligibility itself depends on a lab test whose result is often not in the claims system. Including patients without confirmed biomarker status is like including patients who do not actually have the disease the drug targets — it dilutes the treatment effect and biases toward the null."
      ],
      "result": "Each special population requires a tailored adjustment before a single line of analysis code is written. Pregnant patients need a gestational-age anchor and a linked infant record. Pediatric patients need weight-adjusted dose and growth endpoints. Elderly patients with high competing risks need a competing-risks model rather than standard survival analysis. Rare-disease patients need an external control with explicit exchangeability checks. Biomarker-defined patients need confirmed molecular eligibility from a linked pathology or registry source. Applying the default adult cohort template to any of these five populations produces a different type of error — window misclassification, dose mislabeling, inflated survival estimates, non-exchangeable historical comparison, or diluted biomarker-eligibility — which is why the field treats these as a named family of methods rather than minor footnotes."
    },
    "prerequisites": [
      "pregnancy-exposure-window-rwe",
      "mother-infant-linkage-rwe",
      "rare-disease-external-controls-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pregnancy exposure-window cohort",
        "description": "Reconstruct the pregnancy episode and back-date the last-menstrual-period date; classify exposure by overlap of dispensings with the biologically relevant gestational window (e.g., organogenesis for teratogenicity), not by fill date.",
        "edge_cases": [
          "Gestational-dating algorithm error shifts trimester windows and misclassifies organogenesis exposure.",
          "Pregnancies ending in spontaneous loss or elective termination are under-captured in claims, causing exposure-related left-truncation."
        ],
        "data_source_notes": "claims: derive LMP from delivery code minus algorithm gestational age; require FFS-observable enrollment across [LMP-90d, delivery+90d]. ehr: use ultrasound EDD / LMP from problem lists; orders are not dispensings."
      },
      {
        "name": "Mother-infant dyad linkage",
        "description": "Make the pregnancy-infant dyad the analytic unit for fetal/neonatal outcomes by linking each delivery to its infant record, then reading neonatal outcomes from the infant's claims/EHR.",
        "edge_cases": [
          "Incomplete or exposure-correlated linkage biases the dyad cohort; report link rates by arm.",
          "Infants on a separate plan or delivered out of network are differentially unlinkable."
        ],
        "data_source_notes": "claims: deterministic family/subscriber key + delivery-to-birth date match within a plausible window; linked: vital/birth records recover deliveries and fetal deaths claims miss."
      },
      {
        "name": "External / historical control for rare or biomarker-defined disease",
        "description": "Replace an infeasible concurrent comparator with an external or historical control (optionally with dynamic Bayesian borrowing) when sample size or ethics preclude randomization; validity rests on exchangeability.",
        "edge_cases": [
          "Calendar drift in standard of care or outcome definitions breaks exchangeability and biases the borrowed contrast.",
          "Sparse events demand exact or Firth-penalized inference rather than large-sample Wald approximations."
        ],
        "data_source_notes": "registry: reference standard for adjudicated outcomes and external-control enrollment but selective; transportability to the treated cohort is the binding assumption."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Attempting a randomized trial in the special population",
        "pros_of_this": "Feasible and timely where trials are excluded, cannot accrue, or are unethical (pregnancy, rare disease, dose-finding in children).",
        "cons_of_this": "No randomization, so all confounding and ascertainment threats fall on design/analysis; regulators apply heightened fit-for-purpose scrutiny.",
        "when_to_prefer": "When the trial is infeasible or unethical and a fit-for-purpose data source captures the population-specific structure."
      },
      {
        "compared_to": "Extrapolating a general-population RWE estimate to the subgroup",
        "pros_of_this": "Measures the effect in the population of interest, avoiding the transportability leap across effect modifiers, competing risks, and baseline risk.",
        "cons_of_this": "Smaller samples, sparser events, and population-specific data gaps (dating, linkage, dose fields).",
        "when_to_prefer": "Whenever effect modification or baseline-risk shift between general and special populations is plausible (the default in pregnancy, neonates, rare disease)."
      },
      {
        "compared_to": "Generic single-arm external control without special-population adjustments",
        "pros_of_this": "Adds gestational-age anchoring, dyad linkage, dose normalization, and biomarker eligibility timing that reduce window and unit misclassification.",
        "cons_of_this": "More moving parts and dependence on linkage/registry assets not all databases have.",
        "when_to_prefer": "When the population-specific structure materially changes exposure timing, the analytic unit, or eligibility."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Reconstruct pregnancy episodes from delivery/outcome codes; back-date LMP via a gestational-age algorithm; require FFS-observable medical+pharmacy enrollment across the full episode and exclude MA-only person-time. Classify exposure by days_supply overlap with the gestational window, not the fill date. Link infants via family/subscriber key.",
      "ehr": "Use LMP / ultrasound EDD for dating; orders are not dispensings (confirm with linked fills). Out-of-system deliveries, NICU transfers, and infant care elsewhere are differentially lost — define observation windows explicitly.",
      "registry": "Reference standard for adjudicated malformation/rare-disease/biomarker status and for external-control enrollment, but selective enrollment and patient-reported exposure limit it; transportability is the binding assumption.",
      "linked": "Linked claims-EHR-vital/birth records recover pregnancy losses and fetal deaths claims miss, but linkage selection and LMP/delivery/service-date discrepancies must be reconciled before exposure windows are set."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nPRE_LMP_DAYS = 90          # FFS-observable lookback before conception for covariates + non-exposure\nPOST_DEL_DAYS = 90         # observable follow-through after delivery\nORGANOGENESIS = (4, 10)    # gestational weeks defining the teratogenic exposure window\nLINK_WINDOW_DAYS = 30      # delivery-to-infant-birth match tolerance\n\ndef build_pregnancy_dyad_cohort(deliveries, rx, enroll, links):\n    d = deliveries[deliveries[\"outcome\"] == \"LIVEBIRTH\"].copy()\n    # Back-date last menstrual period from delivery date and algorithm gestational age.\n    d[\"lmp\"] = d[\"delivery_date\"] - pd.to_timedelta(d[\"ga_weeks\"] * 7, unit=\"D\")\n    d[\"org_start\"] = d[\"lmp\"] + pd.to_timedelta(ORGANOGENESIS[0] * 7, unit=\"D\")\n    d[\"org_end\"]   = d[\"lmp\"] + pd.to_timedelta(ORGANOGENESIS[1] * 7, unit=\"D\")\n\n    # Continuous FFS-observable enrollment across [lmp - 90d, delivery + 90d]; exclude MA-only spans.\n    e = enroll[~enroll[\"ma_only\"]].merge(\n        d[[\"person_id\", \"lmp\", \"delivery_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"lmp\"] - pd.Timedelta(days=PRE_LMP_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"delivery_date\"] + pd.Timedelta(days=POST_DEL_DAYS)))\n    eligible = set(e.loc[e[\"covers\"], \"person_id\"])\n    d = d[d[\"person_id\"].isin(eligible)].copy()\n\n    # Exposure = a study/comparator fill whose supplied days overlap the organogenesis window.\n    r = rx.merge(d[[\"person_id\", \"org_start\", \"org_end\"]], on=\"person_id\")\n    r[\"supply_end\"] = r[\"fill_date\"] + pd.to_timedelta(r[\"days_supply\"], unit=\"D\")\n    r[\"overlaps\"] = (r[\"fill_date\"] <= r[\"org_end\"]) & (r[\"supply_end\"] >= r[\"org_start\"])\n    exp = (r[r[\"overlaps\"]]\n             .sort_values([\"person_id\", \"fill_date\"])\n             .groupby(\"person_id\")\n             .agg(arm=(\"drug_class\", \"first\")).reset_index())\n    cohort = d.merge(exp, on=\"person_id\", how=\"inner\")   # keep dyads exposed to either arm in-window\n\n    # Link each delivery to its infant (analytic unit = dyad) within the date tolerance.\n    cohort = cohort.merge(links, on=\"person_id\", how=\"inner\")\n    ok = (cohort[\"infant_birth_date\"] - cohort[\"delivery_date\"]).abs() <= pd.Timedelta(days=LINK_WINDOW_DAYS)\n    cohort = cohort[ok].copy()\n\n    cohort[\"baseline_start\"] = cohort[\"lmp\"] - pd.Timedelta(days=PRE_LMP_DAYS)\n    return cohort[[\"person_id\", \"infant_id\", \"arm\", \"lmp\",\n                   \"org_start\", \"org_end\", \"delivery_date\", \"baseline_start\"]]",
        "description": "Pregnancy exposure-window dyad cohort construction from claims-style inputs. Required inputs (cleaned, de-duplicated):\n  deliveries : person_id (mother), delivery_date (datetime), ga_weeks (algorithm-derived gestational age at delivery), outcome ('LIVEBIRTH'/'LOSS')\n  rx         : person_id (mother), fill_date (datetime), drug_class in {'STUDY','COMPARATOR'}, days_supply (int)\n  enroll     : person_id, enroll_start, enroll_end, ma_only (bool)   # ma_only person-time lacks FFS encounter capture\n  links      : person_id (mother), infant_id, infant_birth_date (datetime)\nReturns one row per eligible live-birth dyad with arm, LMP, organogenesis window, and the linked infant_id. Read the\nmalformation outcome from the infant's first-year claims downstream; measure covariates only in [lmp - 90d, lmp].",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nPRE_LMP_DAYS   <- 90L\nPOST_DEL_DAYS  <- 90L\nORG_START_WK   <- 4L\nORG_END_WK     <- 10L\nLINK_WINDOW    <- 30L\n\nbuild_pregnancy_dyad_cohort <- function(deliveries, rx, enroll, links) {\n  setDT(deliveries); setDT(rx); setDT(enroll); setDT(links)\n\n  d <- deliveries[outcome == \"LIVEBIRTH\"]\n  d[, lmp := delivery_date - ga_weeks * 7L]                 # back-date conception\n  d[, `:=`(org_start = lmp + ORG_START_WK * 7L,\n           org_end   = lmp + ORG_END_WK   * 7L)]\n\n  # FFS-observable continuous enrollment across [lmp - 90, delivery + 90]; drop MA-only spans.\n  e <- merge(enroll[ma_only == FALSE], d[, .(person_id, lmp, delivery_date)], by = \"person_id\")\n  ok_enr <- e[enroll_start <= lmp - PRE_LMP_DAYS &\n              enroll_end   >= delivery_date + POST_DEL_DAYS, unique(person_id)]\n  d <- d[person_id %chin% ok_enr]\n\n  # Exposure: study/comparator fill whose supplied days overlap the organogenesis window.\n  r <- merge(rx, d[, .(person_id, org_start, org_end)], by = \"person_id\")\n  r[, supply_end := fill_date + days_supply]\n  r <- r[fill_date <= org_end & supply_end >= org_start]\n  setorder(r, person_id, fill_date)\n  exp <- r[, .(arm = drug_class[1L]), by = person_id]\n  cohort <- merge(d, exp, by = \"person_id\")\n\n  # Dyad linkage within the delivery-to-birth date tolerance.\n  cohort <- merge(cohort, links, by = \"person_id\")\n  cohort <- cohort[abs(as.integer(infant_birth_date - delivery_date)) <= LINK_WINDOW]\n\n  cohort[, baseline_start := lmp - PRE_LMP_DAYS]\n  cohort[, .(person_id, infant_id, arm, lmp, org_start, org_end, delivery_date, baseline_start)]\n}",
        "description": "Pregnancy exposure-window dyad cohort construction with data.table. Inputs mirror the Python version:\n  deliveries : person_id, delivery_date (Date), ga_weeks (numeric), outcome ('LIVEBIRTH'/'LOSS')\n  rx         : person_id, fill_date (Date), drug_class in {'STUDY','COMPARATOR'}, days_supply (integer)\n  enroll     : person_id, enroll_start, enroll_end, ma_only (logical)\n  links      : person_id, infant_id, infant_birth_date (Date)\nReturns one dyad row per eligible live birth with arm, LMP, organogenesis window, and linked infant_id.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let pre_lmp   = 90;   /* observable lookback before LMP */\n%let post_del  = 90;   /* observable follow-through after delivery */\n%let org_start = 28;   /* organogenesis start: 4 weeks * 7 days */\n%let org_end   = 70;   /* organogenesis end:  10 weeks * 7 days */\n%let link_win  = 30;   /* delivery-to-birth date tolerance */\n\n/* Live-birth episodes with back-dated LMP and organogenesis window. */\nproc sql;\n  create table episode as\n  select person_id,\n         delivery_date,\n         delivery_date - ga_weeks*7              as lmp       format=date9.,\n         delivery_date - ga_weeks*7 + &org_start as org_start format=date9.,\n         delivery_date - ga_weeks*7 + &org_end   as org_end   format=date9.\n  from work.deliveries\n  where outcome = 'LIVEBIRTH';\nquit;\n\n/* FFS-observable continuous enrollment across [lmp-90, delivery+90]; exclude MA-only person-time. */\nproc sql;\n  create table enrolled as\n  select e2.*\n  from episode e2\n  where exists (\n    select 1 from work.enroll en\n    where en.person_id = e2.person_id\n      and en.ma_only = 0\n      and en.enroll_start <= e2.lmp - &pre_lmp\n      and en.enroll_end   >= e2.delivery_date + &post_del\n  );\nquit;\n\n/* Arm = drug_class of the earliest study/comparator fill overlapping the organogenesis window. */\n/* Join qualifying fills to the enrolled episode, sort by person/fill, then keep the earliest per person. */\nproc sql;\n  create table cand as\n  select en.person_id, en.lmp, en.org_start, en.org_end, en.delivery_date,\n         r.fill_date, r.drug_class\n  from enrolled en\n  inner join work.rx r\n    on r.person_id = en.person_id\n   and r.drug_class in ('STUDY','COMPARATOR')\n   and r.fill_date <= en.org_end\n   and r.fill_date + r.days_supply >= en.org_start;\nquit;\n\nproc sort data=cand;\n  by person_id fill_date;\nrun;\n\ndata exposed;\n  set cand;\n  by person_id;\n  if first.person_id;          /* earliest overlapping fill; arm is non-missing by construction */\n  length arm $12;\n  arm = drug_class;\n  keep person_id lmp org_start org_end delivery_date arm;\nrun;\n\n/* Link to the infant (analytic unit = dyad) within the date tolerance. */\nproc sql;\n  create table cohort as\n  select x.person_id, l.infant_id, x.arm, x.lmp,\n         x.org_start, x.org_end, x.delivery_date,\n         x.lmp - &pre_lmp as baseline_start format=date9.\n  from exposed x\n  inner join work.links l\n    on l.person_id = x.person_id\n   and abs(l.infant_birth_date - x.delivery_date) <= &link_win;\nquit;",
        "description": "Pregnancy exposure-window dyad cohort construction in SAS (PROC SQL). Required input datasets (post data-management):\n  work.deliveries : person_id, delivery_date, ga_weeks, outcome ('LIVEBIRTH'/'LOSS')\n  work.rx         : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply\n  work.enroll     : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.links      : person_id, infant_id, infant_birth_date\nProduces work.cohort: one dyad per eligible live birth with arm, lmp, organogenesis window, and linked infant_id.\nBecause malformations are rare, fit the downstream outcome model with PROC LOGISTIC ... / FIRTH on the dyad set.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Special population?] -->|Pregnant / postpartum| Preg[Reconstruct pregnancy episode<br/>+ back-date LMP]\n  Q -->|Pediatric / neonatal| Ped[Age/weight dose normalization<br/>+ growth-trajectory endpoints]\n  Q -->|Rare or biomarker-defined| Rare[External / historical control<br/>+ optional Bayesian borrowing]\n  Preg -->|Maternal outcome| Mat[Unit = pregnant person]\n  Preg -->|Fetal / neonatal outcome| Dyad[Unit = mother-infant dyad<br/>requires linkage]\n  Mat --> Win[Gestational-age exposure window<br/>e.g. organogenesis wk 4-10]\n  Dyad --> Win\n  Win --> Sparse{Events sparse?}\n  Rare --> Sparse\n  Ped --> Sparse\n  Sparse -->|Yes| Exact[Firth-penalized / exact inference]\n  Sparse -->|No| Std[Standard PS-weighted model]\n  Exact --> Out[Report with attrition, link rates,<br/>and sensitivity analyses]\n  Std --> Out",
        "caption": "Routing logic for special-populations RWE — population type sets the exposure window, the analytic unit, and the comparator strategy before any model is fit.",
        "alt_text": "Decision tree branching by special population (pregnancy, pediatric, rare/biomarker) into exposure-window, analytic-unit, comparator, and sparse-event inference choices.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Pregnancy exposure-window dyad timeline (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Mother\n  Pre-LMP covariate + observable window :done, pre, 2023-07-03, 90d\n  Organogenesis exposure window (wk 4-10) :crit, org, 2023-10-22, 42d\n  Pregnancy span (LMP to delivery) :active, preg, 2023-10-01, 280d\n  Delivery :milestone, del, 2024-07-07, 0d\n  section Infant\n  Linked infant first-year outcome ascertainment :infant, 2024-07-07, 365d",
        "caption": "One live-birth dyad — covariates measured before LMP, exposure classified by days_supply overlap with the organogenesis window, and the malformation outcome read from the linked infant after delivery.",
        "alt_text": "Gantt timeline showing pre-LMP covariate window, organogenesis exposure window, pregnancy span to delivery, and infant first-year outcome ascertainment.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "pregnancy-exposure-window-rwe",
        "notes": "The gestational-age-anchored exposure window is the core operational child for the pregnancy case worked here."
      },
      {
        "relation_type": "part_of",
        "target_slug": "mother-infant-linkage-rwe",
        "notes": "Dyad linkage makes the mother-infant pair the analytic unit for fetal/neonatal outcomes."
      },
      {
        "relation_type": "part_of",
        "target_slug": "rare-disease-external-controls-rwe",
        "notes": "External/historical controls substitute for an infeasible concurrent comparator in rare and biomarker-defined disease."
      },
      {
        "relation_type": "part_of",
        "target_slug": "borrowing-historical-controls-bayesian-rwe",
        "notes": "Dynamic Bayesian borrowing down-weights historical controls on conflict, mitigating (not curing) non-exchangeability."
      },
      {
        "relation_type": "part_of",
        "target_slug": "pediatric-dose-normalization-rwe",
        "notes": "Age- and weight-normalized dosing replaces fixed-dose exposure definitions in pediatric RWE."
      },
      {
        "relation_type": "part_of",
        "target_slug": "pediatric-growth-development-endpoints-rwe",
        "notes": "Growth-trajectory and developmental endpoints replace fixed-threshold outcomes in children."
      },
      {
        "relation_type": "part_of",
        "target_slug": "biomarker-defined-cohort-rwe",
        "notes": "Biomarker eligibility timing defines the cohort when the population is molecularly rather than clinically delimited."
      },
      {
        "relation_type": "see_also",
        "target_slug": "single-arm-external-control",
        "notes": "The general external-control design that the rare/biomarker branch specializes for special populations."
      },
      {
        "relation_type": "see_also",
        "target_slug": "firth-penalized-regression-rwe",
        "notes": "Sparse events in small special populations break large-sample inference; Firth penalization or exact methods are required."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "These designs exist to avoid the transportability leap of extrapolating general-population estimates to the subgroup."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Gestational and developmental landmarks redefine time zero away from the default index-fill date."
      }
    ],
    "aliases": [
      "special population RWE",
      "pregnancy pediatric rare disease methods",
      "underrepresented population methods",
      "vulnerable population RWE methods"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "splines-flexible-functional-forms",
    "name": "Splines and Flexible Functional Forms",
    "short_definition": "A family of regression methods — principally restricted cubic splines (RCS) and fractional polynomials — that model nonlinear relationships between a continuous covariate and an outcome, avoiding both the information loss of arbitrary dichotomization and the misspecification of forcing a linear slope on an inherently curved relationship such as the J-shaped BMI-mortality curve or the U-shaped alcohol-health dose-response; the primary deliverable is a graphical dose-response curve, not uninterpretable individual spline coefficients.",
    "long_description": "**The problem with categorizing continuous exposures**\n\nOne of the most persistent analytic errors in epidemiology and HEOR is converting a continuous\ncovariate into categories: age bands, BMI ≥ 30 for \"obese,\" creatinine above a clinical threshold.\nThe appeal is transparency — indicator variables are simple to include and easy to present in tables.\nThe cost is severe. First, within-category variation is entirely discarded: a BMI of 30.1 and a BMI\nof 39.9 are treated as identical, despite very different health risk profiles. Second, the cut-point\nis invented, not derived from biology: BMI ≥ 30 is a clinical and regulatory convention; placing the\ncut at 28 or 32 would produce a different result with no principled basis for the choice. Third,\nresidual confounding increases because a coarser representation of a confounder leaves more of its\neffect uncontrolled. The correct solution — always — is to model continuous covariates continuously,\nand when the shape of the relationship is uncertain, let the data determine that shape rather than\nimposing an arbitrary one.\n\n**When the linear assumption fails**\n\nEntering a continuous variable as a single linear term is an improvement over categorization, but it\nimposes a constant slope: every one-unit increase in the covariate is assumed to carry the same\nincrement in the outcome on the model scale. This is implausible for many exposures. Body-mass\nindex has a well-documented J-shaped or U-shaped association with all-cause mortality: both underweight\nand severely obese individuals face elevated risk while the normal-weight range carries the lowest\nrisk. A linear term misrepresents this as a simple upward (or downward) trend. Similar non-linearity\nappears in the alcohol-health relationship, in environmental exposures with threshold or saturation\neffects, and in pharmacological dose-response curves. Analysts can diagnose linearity failure by\nplotting component-plus-residual (partial residual) plots against the covariate, overlaying a lowess\nsmoother, and comparing AIC/BIC for linear versus nonlinear specifications. A clear departure from a\nstraight partial-residual line is grounds to fit a flexible functional form.\n\n**Restricted cubic splines: mechanics for practitioners**\n\nA spline is a piecewise polynomial that joins smoothly at breakpoints called *knots*. In a cubic\nspline the function value, first derivative, and second derivative all agree at each knot, producing\na smooth continuous curve. *Restricted cubic splines* (RCS) — also called natural cubic splines —\nadd a further constraint: outside the two outermost (boundary) knots, the function is forced to be\nlinear. This boundary-linearity constraint prevents the explosive extrapolation that unrestricted\ncubic splines can produce in the tails, and it consumes fewer degrees of freedom: a k-knot RCS uses\nk − 2 degrees of freedom. With 3 knots the spline adds only 1 df beyond a linear term; with 5 knots\nit adds 3 df.\n\nThe conventional knot-count guidance (Harrell, *Regression Modeling Strategies*, 2015):\n- **3 knots** — small samples or strong prior knowledge of a simple curve; 1 df added.\n- **4 knots** — the practical default for most HEOR analyses; 2 df added.\n- **5 knots** — large datasets (n > 1,000 events) with complex shapes; 3 df added.\n\nKnot placement follows default percentile rules: for 3 knots, the 10th, 50th, and 90th percentiles\nof the covariate distribution; for 5 knots, the 5th, 27.5th, 50th, 72.5th, and 95th percentiles.\nSimulation studies consistently show that results are much more sensitive to the *number* of knots\nthan to their exact positions within a few percentile points, so the default rules are robust in\npractice. Analysts need not search for optimal knot locations.\n\n**The dose-response curve as the primary deliverable**\n\nThis is a critical communication point: **individual coefficients from a spline model are not\ninterpretable in isolation.** A 3-knot RCS produces two regression parameters (one linear term, one\nnonlinear component); neither has a stand-alone clinical meaning. The correct output is the predicted\noutcome value plotted as a function of the continuous covariate across its observed range, with a\nreference value chosen (e.g., BMI = 25 kg/m²) and the outcome expressed relative to that reference.\nFor a logistic regression this is the log-odds ratio (or exponentiated OR) with a 95% confidence\ninterval band. For a Cox model it is the log-HR versus the reference. This dose-response curve is the\nprimary deliverable: it shows the shape, direction, and uncertainty of the exposure-outcome\nrelationship in a single figure that decision-makers can interpret without statistical expertise.\nReports that present only the numerical spline coefficients are providing uninterpretable output.\n\n**Fractional polynomials as a rival**\n\nFractional polynomials (FP) model nonlinearity by selecting the best-fitting powers of X from the\nclosed set {−2, −1, −0.5, 0, 0.5, 1, 2, 3} (where power 0 means log(X)), with a model-selection\nalgorithm (MFP) that chooses one or two power terms. FPs avoid knots entirely and produce a smooth,\nclosed-form transformation. They handle extreme-tail behaviour well when the correct transformation\nhappens to be one of the available powers (e.g., log for right-skewed exposures); RCS is more\nflexible for complex interior shapes (multiple inflection points or humps) and for exposure ranges\nspanning several orders of magnitude. In practice both methods are occasionally compared as a\nsensitivity check; systematic comparisons in simulation show RCS and FPs perform similarly across a\nbroad range of true shapes with neither dominating uniformly.\n\n**Royston-Parmar flexible parametric survival models**\n\nIn survival analysis, Royston-Parmar (RP) models place a restricted cubic spline on the log cumulative\nhazard (for a proportional-hazards parameterisation) or log cumulative odds (for proportional odds).\nThis produces a fully parametric model that can capture complex, non-monotone hazard shapes —\nearly-rise-then-plateau, hump-shaped — that the six standard parametric distributions (exponential,\nWeibull, Gompertz, log-normal, log-logistic, generalised gamma) cannot accommodate. In an etiological\nstudy the primary output is the dose-response curve of the log hazard ratio as a function of a\ncontinuous covariate modelled with an RCS term, not the baseline hazard spline coefficients. For the\nHTA survival extrapolation context — where RP splines are used to extrapolate beyond observed\nfollow-up and their knot positions are a key sensitivity parameter — see the dedicated entry on\nsurvival-extrapolation-hta-rwe. This entry focuses on their use as flexible functional forms within a\nregression framework, not on the extrapolation problem specifically.\n\n**Penalized splines and generalised additive models**\n\nPenalized splines (P-splines) and thin-plate splines in a generalised additive model (GAM) automate\nsmoothness selection: instead of pre-specifying the number of knots, the analyst specifies a penalty\non the second derivative of the smooth function, and the effective degrees of freedom are chosen by\nrestricted maximum likelihood (REML). The mgcv package in R implements this via `gam(y ~ s(x),\nfamily = ...)`. GAMs are well-suited for exploratory analysis of many continuous predictors\nsimultaneously or when knot selection is genuinely uncertain. Trade-offs: the estimated effective df\nis data-adaptive and the inference is approximate (confidence intervals rely on Bayesian posterior\narguments); the smooth cannot easily be transported to other software as a fixed analytical formula;\nand the automated smoothness parameter can over-smooth in sparse tails. For a confirmatory analysis\nwith a specific continuous exposure, a pre-specified RCS with a defended number of knots is more\ntransparent and easier to pre-register and validate.\n\n**Overfitting guardrails**\n\nEvery additional knot costs one degree of freedom. The standard rule of thumb: require at least 10–15\noutcome events (EPV) per degree of freedom added by the spline. For a 5-knot RCS on one covariate\n(3 df) in a binary outcome study, this implies at least 30–45 events for that spline term before\naccounting for other covariates in the model. When EPV is low, reduce the knot count to 3 or to a\nlinear term. Internal validation using bootstrap optimism correction (Harrell's `validate()` in R's\nrms package) quantifies how much the apparent C-statistic or calibration slope overfits the training\ndata. AIC and BIC across models with different knot counts provide a complementary selection criterion.\n\n**Pros, cons, and trade-offs**\n\n*Restricted cubic splines*: Pros — flexible, can represent any smooth shape; boundary linearity\nprevents tail explosion; default percentile knot rules are robust; fully integrated with standard\nregression models and their inference machinery (Wald tests, bootstrap CIs); widely implemented in\nR (rms, splines), Python (patsy, formulaic), and SAS (EFFECT statement). Cons — individual\ncoefficients are uninterpretable; requires adequate EPV; linear outside boundary knots may be\nwrong for structural extrapolation; does not automatically select the optimal knot count. Prefer when\nthe functional form is uncertain and EPV ≥ 10 per spline df.\n\n*Fractional polynomials*: Pros — closed-form transformation; no knot selection; can represent\nextreme-tail behaviour well with standard powers. Cons — restricted to the available power library;\nmay fail for complex interior shapes; model search inflates effective degrees of freedom. Prefer when\nprior theory suggests a specific power transformation and the exposure range is wide.\n\n*Linear term*: Pros — interpretable slope and CI; minimum df; analytically and computationally\ntrivial; always valid as a starting point. Cons — systematically wrong when the true shape is curved;\nmisses J-shaped or U-shaped patterns; can produce residual confounding bias when the covariate is an\nimportant confounder with nonlinear effects. Prefer when EPV is too small for flexible forms, or when\nprior evidence strongly supports linearity.\n\n*Categorization*: Do not prefer categorization over any of the above for continuous exposures or\nconfounders in a regression model. The only legitimate use of categories after fitting is for\nclinical communication — presenting the fitted curve evaluated at clinically recognisable BMI bands,\nfor example — not as a substitute for modelling the continuous relationship.\n\n**When to use**\n\nUse splines and flexible functional forms whenever: (1) a continuous covariate is entered as an\nexposure or as a confounder in a regression model and the functional form is not firmly established\nby prior literature; (2) the exposure range is wide enough that a constant-slope assumption is\nnon-trivial; (3) EPV ≥ 10 per spline df. Key applications in HEOR include: age when used as a\nconfounder (rather than a study-design stratum); BMI, blood pressure, creatinine, HbA1c in outcome\nmodels; cumulative drug dose in dose-response analyses; biomarker-outcome relationships in registry\nor EHR studies. In propensity score models, RCS for continuous confounders is particularly\nimportant: a linear propensity score term for a nonlinearly confounded variable leaves systematic\nresidual confounding after matching or weighting.\n\n**When NOT to use**\n\nDo not use splines when: (1) EPV is very low (< 10 per df) — reduce to a linear term and validate;\n(2) the model output must be a portable, closed-form risk score for bedside use — RCS requires\nsoftware to evaluate and cannot be simplified to a lookup table; (3) the continuous covariate takes\nonly a handful of distinct integer values (0, 1, 2, 3 prior hospitalisations) — use indicator\nvariables instead; (4) extrapolation beyond the observed range is the scientific goal — the boundary\nlinearity of RCS reduces but does not eliminate extrapolation risk, and structural extrapolation\nshould use explicitly motivated parametric forms (Weibull, Royston-Parmar with external anchoring);\n(5) the association is already established as linear in multiple prior studies — the added complexity\nbuys nothing and costs interpretability.\n\n**Interpreting the output**\n\nIn the worked example, logistic regression of 5-year cardiovascular mortality on BMI is fit with\na linear term (AIC = 2450) and a 3-knot restricted cubic spline (AIC = 2410). The spline model\nproduces two regression coefficients; neither is reported individually. Instead, the dose-response\ncurve is plotted: log-odds ratio of 5-year mortality versus BMI, evaluated across the observed BMI\nrange with BMI = 28 kg/m² (the median knot and reference value) set to log-OR = 0.\n\n*(1) Formal interpretation.* The individual parameters β₁ and β₂ of the 3-knot RCS basis are not\nestimands with clinical meaning; they are coordinates of a basis representation. The estimand is\nthe conditional log-odds ratio of 5-year CV mortality at each BMI level relative to the reference\nBMI = 28, adjusting for all other model covariates, evaluated from the fitted spline function. The\nAIC difference of 40 units (AIC linear 2450 minus AIC spline 2410 = 40) provides strong evidence\nagainst the linear constraint: by convention a ΔAIC > 10 is considered strong evidence that the\nadditional parameters improve fit beyond chance. The spline model uses 1 extra degree of freedom\n(k − 2 = 3 − 2 = 1) relative to the linear model; the improvement of 40 AIC units for 1 df is\nsubstantial.\n\n*(2) Practical interpretation.* The dose-response curve reveals a J-shaped BMI-mortality\nrelationship: predicted risk is lowest near the median (BMI ≈ 28) and rises at both the underweight\nextreme (BMI < 20) and the severely obese extreme (BMI > 38). A linear model forced this into a\nsingle slope, systematically under-estimating risk at both extremes. For a payer or HTA body, the\nshape matters: a policy targeting exclusively \"BMI ≥ 30\" misses elevated risk among the underweight,\nwhile an intervention designed from the correctly specified curve would target both tails.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "splines",
      "restricted-cubic-splines",
      "natural-cubic-splines",
      "flexible-functional-forms",
      "dose-response",
      "nonlinearity",
      "fractional-polynomials",
      "GAM",
      "Royston-Parmar",
      "BMI-mortality"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.4780080504",
        "url": "https://doi.org/10.1002/sim.4780080504",
        "citation_text": "Durrleman S, Simon R. Flexible regression models with cubic splines. Statistics in Medicine. 1989;8(5):551-561.",
        "year": 1989,
        "authors_short": "Durrleman & Simon",
        "notes": "The foundational paper establishing flexible regression models with cubic splines in a biomedical statistics context; introduced the practical framework for using spline bases in logistic and Cox regression that underpins the rms package conventions."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.1203",
        "url": "https://doi.org/10.1002/sim.1203",
        "citation_text": "Royston P, Parmar MK. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in Medicine. 2002;21(15):2175-2197.",
        "year": 2002,
        "authors_short": "Royston & Parmar",
        "notes": "Introduces Royston-Parmar flexible parametric survival models, which place a restricted cubic spline on the log cumulative hazard, producing a fully parametric alternative to the Cox model that accommodates non-monotone hazard shapes; the foundation for flexible parametric survival analysis in HTA and etiological studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1038/s41409-019-0679-x",
        "url": "https://doi.org/10.1038/s41409-019-0679-x",
        "citation_text": "Gauthier J, Wu QV, Gooley TA. Cubic splines to model relationships between continuous variables and outcomes: a guide for clinicians. Bone Marrow Transplantation. 2020;55(4):675-680.",
        "year": 2020,
        "authors_short": "Gauthier et al.",
        "notes": "Accessible, clinician-oriented demonstration of restricted cubic splines in survival and logistic regression, including knot selection, dose-response curve plotting, and comparison to linear and categorical specifications; directly illustrates the workflow described in this entry."
      }
    ],
    "plain_language_summary": "Splines and flexible functional forms are ways to model a curved (nonlinear) relationship between a continuous measurement — like BMI or age — and an outcome in a regression, without forcing the relationship to follow an arbitrary straight line or invented cut-off. The most common tool is the restricted cubic spline, which fits a smooth, piecewise curve through the data with a small number of anchor points called knots placed at percentiles of the exposure. The result is reported as a dose-response picture showing predicted risk across the full range of the exposure, not as raw regression numbers — because individual spline coefficients have no meaningful interpretation on their own. These methods are the correct alternative both to grouping patients into BMI categories (which throws away information and invents thresholds) and to assuming a straight-line relationship that may be J-shaped, U-shaped, or otherwise curved.",
    "key_terms": [
      {
        "term": "knot",
        "definition": "An anchor point in a spline where two polynomial segments join; the curve changes its bending behavior at each knot while remaining smooth and continuous."
      },
      {
        "term": "restricted cubic spline (RCS)",
        "definition": "A piecewise cubic curve that is constrained to be linear (straight) outside the outermost anchor points, preventing wild extrapolation in the tails; also called a natural cubic spline."
      },
      {
        "term": "degrees of freedom (df)",
        "definition": "The number of free parameters a spline adds to a regression model; a k-knot RCS uses k minus 2 degrees of freedom beyond a linear term."
      },
      {
        "term": "dose-response curve",
        "definition": "A graph showing the predicted outcome (e.g., log-odds ratio of death) across the full range of a continuous exposure, with one reference value set to zero; this is the correct output of a spline model, not the individual spline coefficients."
      },
      {
        "term": "events per variable (EPV)",
        "definition": "The number of outcome events in the data divided by the number of free parameters in the model; the rule of thumb is at least 10 events per degree of freedom to avoid overfitting."
      },
      {
        "term": "AIC (Akaike Information Criterion)",
        "definition": "A model comparison score (lower is better) that penalizes complexity; a difference of more than 10 AIC units between two models is generally considered strong evidence in favor of the lower-AIC model."
      }
    ],
    "worked_example": {
      "scenario": "A cohort study examines the association between body mass index (BMI) and 5-year cardiovascular mortality in 800 adults. Three specifications of BMI are compared in a logistic regression: (1) a binary obesity indicator (BMI ≥ 30 kg/m²), (2) BMI as a continuous linear term, and (3) BMI modeled with a 3-knot restricted cubic spline. Knot positions for the spline are determined from a 10-person pilot subsample. The analyst uses AIC to compare model fit and then plots the dose-response curve from the winning spline model to visualize the shape of the BMI-mortality relationship.",
      "dataset": {
        "caption": "Top panel: the 10-person pilot subsample sorted by BMI, used to place knots at the 10th, 50th, and 90th percentiles. Bottom panel: full-cohort AIC comparison across the three model specifications.",
        "columns": [
          "model",
          "extra_df_beyond_linear",
          "AIC"
        ],
        "rows": [
          [
            "Binary obesity indicator (BMI >= 30)",
            0,
            2480
          ],
          [
            "Linear continuous BMI",
            0,
            2450
          ],
          [
            "Restricted cubic spline (3 knots)",
            1,
            2410
          ]
        ]
      },
      "steps": [
        "Sort the 10-person pilot BMI values in ascending order: 18, 20, 23, 25, 27, 29, 31, 34, 38, 42 (n = 10 values).",
        "Knot 1 at the 10th percentile: in a sorted list of 10, this is the 1st value = 18 kg/m².",
        "Knot 2 at the 50th percentile (median): average of the 5th and 6th sorted values = (27 + 29) / 2 = 28 kg/m².",
        "Knot 3 at the 90th percentile: the 9th value = 38 kg/m². Three knots placed at 18, 28, 38.",
        "The binary model collapses BMI into obese vs not; every BMI difference within a category is discarded. AIC = 2480.",
        "The linear model uses one slope for the entire BMI range. AIC = 2450. AIC improvement over binary: 2480 - 2450 = 30.",
        "The 3-knot spline adds 1 extra degree of freedom (k - 2 = 3 - 2 = 1) on the full cohort with knots at 18, 28, 38. AIC = 2410. AIC improvement over linear: 2450 - 2410 = 40.",
        "A difference of 40 AIC units strongly favors the spline (rule of thumb: ΔAIC > 10 is strong evidence). The dose-response curve is plotted with BMI = 28 as the reference value. Individual spline coefficients are not reported because they have no stand-alone interpretation."
      ],
      "result": "Knot positions: 18, 28, 38 kg/m². Knot 2 = (27 + 29) / 2 = 28. AIC linear = 2450, AIC spline = 2410; AIC difference = 2450 - 2410 = 40 in favor of the spline. The dose-response curve reveals a J-shaped BMI-mortality association invisible in the linear or binary models."
    },
    "prerequisites": [
      "ols-linear-regression",
      "generalized-linear-models",
      "regression-diagnostics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "RCS as a confounder adjustment term",
        "description": "A continuous confounder (age, BMI, creatinine) enters the regression as a 3- or 4-knot RCS rather than as a linear term, ensuring that nonlinear confounding is controlled rather than left as residual bias. This is the most common use of splines in HEOR: not the exposure of interest, but a covariate that must be correctly adjusted. In propensity score models, RCS terms for continuous confounders substantially reduce residual imbalance compared to linear terms when the true confounding relationship is curved.",
        "edge_cases": [
          "If EPV is very low (< 30 total events), use a linear term for the confounder and assess sensitivity; adding splines when EPV is marginal can inflate standard errors on the exposure estimate.",
          "For age as a confounder in a large claims study (n > 10,000 events), a 5-knot RCS is appropriate and will capture cohort-effect nonlinearity that a linear age term misses."
        ],
        "data_source_notes": "In claims data, continuous confounders like Charlson score, prior utilisation counts, or lag-window cost are often right-skewed; apply a log or square-root transformation before the spline if the distribution is extremely heavy-tailed, then verify the residual plot."
      },
      {
        "name": "RCS for the primary exposure (dose-response analysis)",
        "description": "The continuous exposure itself is modeled as an RCS and the primary output is the dose-response curve: predicted log-OR, log-HR, or expected outcome plotted versus the exposure with a chosen reference value and 95% CI band. This is the correct approach whenever the scientific question is \"what is the shape of the exposure-outcome relationship?\" rather than \"is there any association?\" Knot count must be pre-specified and defended on EPV grounds.",
        "edge_cases": [
          "The reference value is a choice, not an estimand; analysts must report it explicitly. Common choices: the median (most data support), the clinical normal value (most interpretable), or a regulatory threshold (most policy-relevant). The shape of the curve is invariant to the reference but the numerical log-OR values all shift.",
          "If the dose-response curve is to be compared across subgroups, add a spline-by-group interaction; each group then has its own curve, and the difference between curves is the heterogeneous dose-response effect."
        ],
        "data_source_notes": "For time-varying cumulative dose in claims data, first construct the cumulative dose variable at each person-time interval (see time-updated-exposures-cumulative-dose-rwe), then model cumulative dose as an RCS in a time-updated Cox model; the dose-response curve shows how current cumulative dose associates with the instantaneous hazard."
      },
      {
        "name": "Royston-Parmar spline for survival analysis",
        "description": "A restricted cubic spline on the log cumulative hazard (or log cumulative odds) replaces the parametric baseline hazard in a fully parametric survival model. The model is estimated by maximum likelihood and produces smooth, non-monotone hazard estimates. For etiological work, a continuous covariate of interest is then added as a second RCS term and plotted as a dose-response log-HR curve. For HTA extrapolation, the baseline spline is projected beyond the observed follow-up with external anchoring — see survival-extrapolation-hta-rwe for that specific context.",
        "edge_cases": [
          "The number of baseline hazard knots (typically 2–5) and the number of covariate spline knots are separate choices; both are governed by EPV and contribute df to the model.",
          "Royston-Parmar models are not identifiable without at least one event in every time interval between knots; sparse late follow-up can produce unstable spline estimates at the right tail."
        ],
        "data_source_notes": "Available in R via flexsurv::flexsurvspline() and rstpm2::stpm2(); in SAS via PROC NLMIXED with a custom log-likelihood; in Stata via stpm2. Not natively available as a single PROC call in SAS — PROC NLMIXED requires custom code."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "ols-linear-regression",
        "pros_of_this": "Captures curved dose-response shapes that a linear term systematically misses, reducing functional-form misspecification bias in both the exposure estimate and the confounding adjustment.",
        "cons_of_this": "Adds degrees of freedom (1 per extra knot beyond 2); individual spline coefficients are uninterpretable; requires adequate EPV; more complex to specify and report.",
        "when_to_prefer": "When the exposure range is wide, prior evidence is weak, and EPV supports the added complexity; use the linear term when EPV is marginal or prior evidence strongly supports linearity."
      },
      {
        "compared_to": "generalized-linear-models",
        "pros_of_this": "The flexible functional form for continuous covariates is orthogonal to the choice of link function and error distribution; splines slot into logistic, Poisson, gamma, and Cox regression identically once the basis is defined.",
        "cons_of_this": "No cons specific to the combination — splines and GLMs are complementary, not competing.",
        "when_to_prefer": "Always consider splines for continuous covariates within any GLM when EPV is adequate; the combination of a correct link function (logistic, log) and a flexible functional form for continuous predictors is the standard in rigorous HEOR analysis."
      },
      {
        "compared_to": "survival-extrapolation-hta-rwe",
        "pros_of_this": "In an etiological study, the spline term on the covariate of interest produces a dose-response HR curve within the observed data range — this is the correct in-sample estimand.",
        "cons_of_this": "Extrapolation of the spline curve beyond the observed covariate range or beyond observed follow-up time should be done with explicit caution; for HTA lifetime extrapolation the baseline hazard spline requires external anchoring (see survival-extrapolation-hta-rwe).",
        "when_to_prefer": "Splines for etiological dose-response are appropriate within the observed data window; for HTA extrapolation beyond observed follow-up, use the Royston-Parmar framework with external validation."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Continuous confounders in claims (prior-period cost, number of prior hospitalisations, Charlson score) are often right-skewed; apply a log or square-root transformation before the spline if the distribution is extremely heavy-tailed (tail/median ratio > 20). Calibrate the reference value and knot percentiles to the claims cohort distribution, not to published general-population norms. For BMI derived from EHR-linked or self-reported data in claims, check for implausible values (BMI < 13 or > 80) and trim before placing knots, since extreme outliers shift percentile knots undesirably.",
      "ehr": "Lab values (creatinine, HbA1c, albumin) and vitals (SBP, BMI from clinical measurement) are natural candidates for spline modeling. Trim physiologically impossible values before fitting; in EHR data, transcription errors can produce values far outside the plausible range that shift percentile-based knots. For repeated-measures data, decide whether to use baseline measurement, time-updated measurement, or a summary (mean, maximum) before constructing the spline variable.",
      "registry": "Registries often over-represent moderate-to-severe disease, compressing the lower end of the exposure distribution and potentially placing the lowest-percentile knot in a sparse region. Check the histogram of the exposure before accepting default percentile knots; shift a boundary knot inward if fewer than 5% of observations lie outside it. For a disease severity score that is the primary exposure, a 4-knot RCS is often appropriate and should be plotted at the clinical severity anchor points (e.g., NIHSS 0, 5, 15, 25 for stroke).",
      "primary": "Small primary studies typically have EPV too low for a 5-knot spline; use 3 knots as the maximum and prefer the linear term if total events < 30. Report the 3-knot spline result as an exploratory sensitivity analysis rather than the primary analysis if EPV is marginal.",
      "linked": "Linked claims-EHR-registry cohorts typically have large n and can support 4- or 5-knot splines on important continuous covariates. The linkage process may introduce censoring on the continuous covariate (e.g., BMI only available for the linked EHR subset); check that spline knots are computed from the linked subset, not the full claims population."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\nimport matplotlib.pyplot as plt\n\n# ── Simulate J-shaped BMI-mortality data (nadir near BMI 26) ──\nrng = np.random.default_rng(42)\nn = 800\nbmi = rng.uniform(15, 50, n)\ntrue_log_odds = 0.004 * (bmi - 26) ** 2 - 1.8   # U-shaped truth\nprob_true = 1 / (1 + np.exp(-true_log_odds))\nevent = rng.binomial(1, prob_true, n)\ndf = pd.DataFrame({\"bmi\": bmi, \"event\": event})\n\n# ── Knot positions at 10th / 50th / 90th percentiles of observed BMI ──\np10, p50, p90 = (float(np.percentile(bmi, p)) for p in [10, 50, 90])\nprint(f\"Knots at 10th/50th/90th pctiles: {p10:.1f} / {p50:.1f} / {p90:.1f} kg/m²\")\n\n# ── Fit linear and 3-knot RCS models ──\nfit_lin = smf.logit(\"event ~ bmi\", data=df).fit(disp=0)\n\n# patsy cr(): natural cubic spline = RCS with boundary linearity.\n# One interior knot (at p50) with boundary knots at p10, p90 gives a 3-knot spline\n# and adds exactly 1 df beyond the linear term (k - 2 = 3 - 2 = 1).\nrcs_formula = (\n    f\"event ~ cr(bmi, knots=[{p50:.2f}], \"\n    f\"lower_bound={p10:.2f}, upper_bound={p90:.2f})\"\n)\nfit_rcs = smf.logit(rcs_formula, data=df).fit(disp=0)\n\nprint(f\"\\nAIC  linear : {fit_lin.aic:.1f}\")\nprint(f\"AIC  spline : {fit_rcs.aic:.1f}\")\nprint(f\"ΔAIC (linear - spline): {fit_lin.aic - fit_rcs.aic:.1f}\")\nprint(\"\\nWARNING: spline coefficients below have NO stand-alone clinical interpretation.\")\nprint(\"Report the dose-response curve (plot), not these coefficients.\")\nprint(fit_rcs.summary2().tables[1][[\"Coef.\", \"Std.Err.\", \"P>|z|\"]])\n\n# ── Dose-response curve: log-OR vs reference BMI (= median) ──\nref_bmi = p50\nbmi_grid = pd.DataFrame({\"bmi\": np.linspace(float(bmi.min()), float(bmi.max()), 200)})\nref_df   = pd.DataFrame({\"bmi\": [ref_bmi]})\n\ndef prob_to_logodds(p):\n    p = np.clip(p, 1e-9, 1 - 1e-9)\n    return np.log(p / (1 - p))\n\nlog_or = (prob_to_logodds(fit_rcs.predict(bmi_grid))\n          - float(prob_to_logodds(fit_rcs.predict(ref_df))))\n\nplt.figure(figsize=(7, 4))\nplt.axhline(0, color=\"grey\", lw=0.8, ls=\"--\")\nplt.axvline(ref_bmi, color=\"grey\", lw=0.8, ls=\":\",\n            label=f\"Reference BMI = {ref_bmi:.1f}\")\nplt.plot(bmi_grid[\"bmi\"], log_or, color=\"steelblue\", lw=2,\n         label=\"3-knot RCS log-OR\")\nplt.xlabel(\"BMI (kg/m²)\")\nplt.ylabel(f\"Log-OR vs BMI = {ref_bmi:.1f} kg/m²\")\nplt.title(\"Dose-response: BMI and 5-year CV mortality\")\nplt.legend()\nplt.tight_layout()\nplt.savefig(\"bmi_spline_curve.png\", dpi=120)\nprint(f\"\\nDose-response curve saved to bmi_spline_curve.png\")\nprint(\"This curve is the primary deliverable. The spline coefficients are not reported.\")",
        "description": "Fit linear and 3-knot restricted cubic spline (RCS) logistic regression models using patsy's\ncr() (natural cubic spline = RCS with boundary linearity) and statsmodels, compare by AIC, and\nplot the dose-response curve (log-OR vs reference BMI). The dose-response curve — not the\nindividual spline coefficients — is the primary output.\nRequired columns in df: bmi (float, kg/m²), event (int, 0/1 five-year CV mortality).",
        "dependencies": [
          "statsmodels",
          "patsy",
          "numpy",
          "pandas",
          "matplotlib"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(rms)\n\n# ── Simulate the same J-shaped BMI-mortality data ──\nset.seed(42)\nn    <- 800\nbmi  <- runif(n, 15, 50)\nprob <- plogis(0.004 * (bmi - 26)^2 - 1.8)\ndf   <- data.frame(bmi = bmi, event = rbinom(n, 1, prob))\n\n# ── rms requires datadist for Predict() and confidence intervals ──\ndd <- datadist(df)\noptions(datadist = \"dd\")\n\n# ── Fit linear vs 3-knot RCS logistic regression ──\nfit_lin <- lrm(event ~ bmi,        data = df, x = TRUE, y = TRUE)\nfit_rcs <- lrm(event ~ rcs(bmi, 3), data = df, x = TRUE, y = TRUE)\n# rcs(bmi, 3): 3-knot RCS with knots at 10th/50th/90th percentiles by default.\n# Knot positions are stored in attr(rcs(bmi, 3), \"parms\").\n\ncat(sprintf(\"AIC  linear : %.1f\\n\", AIC(fit_lin)))\ncat(sprintf(\"AIC  spline : %.1f\\n\", AIC(fit_rcs)))\ncat(sprintf(\"ΔAIC (linear - spline): %.1f\\n\", AIC(fit_lin) - AIC(fit_rcs)))\n\n# Wald test that the nonlinear spline term = 0 (tests whether spline improves on linear)\nanova(fit_rcs)   # shows separate tests for linear vs nonlinear components\n\n# ── Dose-response curve: log-OR vs reference BMI (= median, set by datadist) ──\n# Predict() evaluates the fitted model on a grid and subtracts the reference value.\n# The reference is the median of bmi in the datadist object (adjustable via ref.zero).\np <- Predict(fit_rcs, bmi = seq(15, 50, by = 0.5), fun = identity, ref.zero = TRUE)\n# ref.zero = TRUE: log-OR is set to 0 at the reference value (datadist median).\n\nplot(p,\n     xlab = \"BMI (kg/m²)\",\n     ylab = \"Log-OR vs reference BMI (median)\",\n     main = \"Dose-response: BMI and 5-year CV mortality (3-knot RCS)\")\nabline(h = 0, lty = 2, col = \"grey\")\n\n# ── Alternative: splines::ns() for use outside rms (e.g., glm, coxph) ──\n# ns() = natural cubic spline = RCS with boundary linearity.\n# Knots must be specified explicitly (not automatic percentiles like rms::rcs).\nk   <- quantile(df$bmi, c(0.10, 0.50, 0.90))\nfit_ns <- glm(event ~ splines::ns(bmi, knots = k[\"50%\"],\n                                  Boundary.knots = c(k[\"10%\"], k[\"90%\"])),\n              data = df, family = binomial)\ncat(sprintf(\"AIC via ns(): %.1f\\n\", AIC(fit_ns)))\n# WARNING: ns() coefficients are not interpretable individually.\n# Use predict() on a grid and plot log-OR vs reference, as above.",
        "description": "Fit linear and 3-knot RCS logistic regression using rms::lrm() + rms::rcs(), compare by AIC,\nand plot the dose-response curve with rms::Predict(). Also shows splines::ns() as an\nalternative basis in base R. The rms workflow produces the dose-response plot directly from\nthe fitted object without manually constructing a prediction grid.\nRequired data frame: df with columns bmi (numeric) and event (0/1).",
        "dependencies": [
          "rms",
          "splines"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── 1. Set knot positions at 10th / 50th / 90th percentiles of observed BMI ── */\nproc univariate data=work.bmi_data noprint;\n  var bmi;\n  output out=work.knot_pctiles pctlpre=P_ pctlpts=10 50 90;\nrun;\n/* Retrieve knot values into macro variables for KNOTMETHOD=LIST below.             */\ndata _null_;\n  set work.knot_pctiles;\n  call symput('k1', trim(put(P_10, best8.)));\n  call symput('k2', trim(put(P_50, best8.)));\n  call symput('k3', trim(put(P_90, best8.)));\nrun;\n%put Knots: &k1 &k2 &k3;\n\n/* ── 2. Fit linear logistic regression ── */\nproc logistic data=work.bmi_data;\n  model event(event='1') = bmi;\n  ods output FitStatistics=work.fit_lin;\nrun;\n\n/* ── 3. Fit 3-knot RCS via EFFECT statement with KNOTMETHOD=LIST ── */\n/* The EFFECT SPLINE statement builds a natural (restricted) cubic spline basis     */\n/* with boundary knots supplied in KNOTMETHOD=LIST (first and last are boundaries). */\nproc logistic data=work.bmi_data;\n  effect spl = spline(bmi / knotmethod=list(&k1 &k2 &k3) naturalcubic);\n  model event(event='1') = spl;\n  /* STORE for downstream SCORE step */\n  store out=work.rcs_model;\n  ods output FitStatistics=work.fit_rcs;\nrun;\n\n/* ── 4. Compare AIC: retrieve from FitStatistics ── */\ndata _null_;\n  set work.fit_lin; if Criterion = 'AIC' then call symput('aic_lin', put(InterceptAndCovariates, best8.));\nrun;\ndata _null_;\n  set work.rcs_rcs; if Criterion = 'AIC' then call symput('aic_rcs', put(InterceptAndCovariates, best8.));\nrun;\n%put AIC linear: &aic_lin; %put AIC spline: &aic_rcs;\n\n/* ── 5. Dose-response curve: predict over a grid, plot log-OR vs reference BMI ── */\n/* Create a prediction grid from BMI 15 to 50, reference BMI = median knot (k2).  */\ndata work.bmi_grid;\n  do bmi = 15 to 50 by 0.5; output; end;\nrun;\n\nproc plm restore=work.rcs_model;\n  score data=work.bmi_grid out=work.pred_grid predicted lclm uclm / ilink;\n  /* ILINK: output predicted probabilities (back-transformed); for log-OR subtract   */\n  /* the predicted log-odds at the reference value from the grid log-odds.           */\nrun;\n\n/* Convert predicted probability to log-odds; subtract log-odds at reference BMI.  */\nproc sql noprint;\n  select log(predicted / (1 - predicted)) into :ref_logodds\n  from work.pred_grid where abs(bmi - &k2) < 0.26;   /* nearest grid point to k2 */\nquit;\n\ndata work.dose_response;\n  set work.pred_grid;\n  log_or_vs_ref = log(predicted / (1 - predicted)) - &ref_logodds;\n  label log_or_vs_ref = 'Log-OR vs reference BMI';\nrun;\n\nproc sgplot data=work.dose_response;\n  refline 0   / axis=y lineattrs=(color=grey pattern=dash);\n  refline &k2 / axis=x lineattrs=(color=grey pattern=dot);\n  series x=bmi y=log_or_vs_ref / lineattrs=(color=steelblue thickness=2);\n  xaxis label='BMI (kg/m2)' values=(15 to 50 by 5);\n  yaxis label=\"Log-OR vs BMI = &k2 kg/m2\";\n  title 'Dose-response: BMI and 5-year CV mortality (3-knot RCS)';\nrun;\n/* NOTE: the EFFECT SPLINE coefficients in the LOGISTIC output are not interpretable */\n/* individually. The dose-response plot above is the correct primary output.         */",
        "description": "Fit linear and 3-knot RCS logistic regression using the EFFECT statement SPLINE option in\nPROC LOGISTIC (also available in PROC GLIMMIX). The EFFECT statement builds the spline basis\ninternally; knot positions are set via KNOTMETHOD=PERCENTILELIST. Prediction over a grid for\nthe dose-response curve uses SCORE DATA= with OUTROC= or the ILINK option.\nRequired SAS dataset: work.bmi_data with variables bmi (numeric) and event (0/1).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start([Continuous covariate in regression]) --> Q1{Is linearity plausible\\nbased on prior evidence?}\n  Q1 -- Yes AND EPV very low --> Lin[\"Linear term\\n1 df; interpretable slope\"]\n  Q1 -- Uncertain or\\nJ/U-shaped expected --> EPV{\"Adequate EPV?\\nevents / df >= 10\"}\n  EPV -- No --> Lin2[\"Reduce to linear or\\npenalised spline (GAM)\"]\n  EPV -- Yes --> Knots{Select knot count\\nbased on events}\n  Knots -- 30-100 events --> K3[\"3-knot RCS\\n1 extra df\\n10th / 50th / 90th pctile\"]\n  Knots -- 100-400 events --> K4[\"4-knot RCS\\n2 extra df\\n5/35/65/95 pctile\"]\n  Knots -- \"> 400 events\" --> K5[\"5-knot RCS\\n3 extra df\\n5/27.5/50/72.5/95 pctile\"]\n  K3 --> Plot[\"Plot dose-response curve\\nPredicted outcome vs X\\nwith reference value\\nand 95% CI band\"]\n  K4 --> Plot\n  K5 --> Plot\n  Plot --> Report[\"Report the CURVE\\nNot the spline coefficients\\nCoefficients are uninterpretable\\nin isolation\"]",
        "caption": "Decision tree for continuous covariate modeling. Categorization is never shown because it is never preferred over a continuous specification. Individual spline coefficients must not be reported as the primary result; the dose-response curve is the correct deliverable.",
        "alt_text": "Flowchart from a continuous covariate through linearity assessment, EPV check, and knot-count selection into one of three RCS specifications, then into a dose-response curve as the final output with a warning against reporting individual spline coefficients.",
        "source_type": "illustrative",
        "source_citations": [
          "gauthier-2020"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "ols-linear-regression",
        "notes": "Splines replace or augment the linear term for a continuous covariate in OLS regression; the linear term is the special case where the spline is constrained to have zero curvature."
      },
      {
        "relation_type": "used_with",
        "target_slug": "generalized-linear-models",
        "notes": "Splines slot identically into logistic, Poisson, gamma, and other GLMs as covariate terms; the link function and error distribution are chosen independently of the functional form for continuous predictors."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "RCS terms are standard in Cox models for continuous covariates; the Royston-Parmar model additionally places a spline on the log cumulative baseline hazard, replacing the non-parametric Breslow estimator with a smooth parametric form."
      },
      {
        "relation_type": "see_also",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "Royston-Parmar flexible parametric survival models use RCS on the log cumulative hazard; for the HTA extrapolation context (projecting beyond observed follow-up, knot sensitivity, external anchoring) see that dedicated entry rather than this one."
      },
      {
        "relation_type": "see_also",
        "target_slug": "regression-diagnostics",
        "notes": "Linearity checking via component-plus-residual plots and lowess smoothers is the primary diagnostic that triggers the decision to use a spline rather than a linear term; regression diagnostics entry covers these tools."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Cumulative drug dose modeled as a time-updated exposure is a natural application of dose-response splines; the continuous cumulative dose variable is entered as an RCS in a time-updated Cox model and the resulting curve shows the hazard ratio as a function of cumulative exposure."
      }
    ],
    "aliases": [
      "restricted cubic splines",
      "natural cubic splines",
      "spline regression",
      "RCS",
      "flexible parametric regression",
      "fractional polynomials"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "standard-cox-time-dependent",
    "name": "Cox Regression with Time-Dependent Covariates",
    "short_definition": "A Cox proportional-hazards model fit on (start, stop] counting-process intervals so that exposures or covariates whose values change during follow-up enter the partial likelihood at the value they held at each event time, rather than a single baseline value.",
    "long_description": "A **time-dependent (time-varying) Cox model** generalizes the standard proportional-hazards model by letting a covariate\n`x(t)` change value during follow-up. Mechanically, each subject's follow-up is split into multiple `(start, stop]`\nintervals — the **counting-process (Andersen–Gill) data layout** — and at every event time the partial likelihood compares\nthe covariate value of the person who failed against the *current* covariate values of everyone still at risk. The hazard\nis `h(t | x(t)) = h0(t) * exp(beta' x(t))`, and `exp(beta)` is the hazard ratio per unit of the covariate **as it is at\ntime t**. This is the correct way to model a status that switches on during follow-up (started a drug, was hospitalized,\ncrossed a lab threshold, accrued cumulative dose) without committing the cardinal sin of using a future value to classify\npresent person-time.\n\n**Core conceptual distinction**. The decision that does the work is *when person-time is allowed to count toward an\nexposure group*. A naive time-fixed analysis assigns a subject to the \"exposed\" group for their *entire* follow-up because\nthey were exposed *at some point*, which credits the exposed group with person-time that was lived *before* exposure\noccurred — **immortal time / guarantee-time bias**, because to become exposed the subject had to survive event-free until\nthe exposure happened. A time-dependent covariate fixes this: the subject contributes unexposed person-time from time zero\nuntil the moment of exposure, then exposed person-time thereafter, and the exposure indicator is 0 then 1 across the split\nintervals. The estimand is a **hazard ratio comparing the instantaneous hazard under the covariate value held at each\ninstant** — an internal, association-style contrast. Critically, this is NOT a causal contrast of treatment *strategies*:\nwhen the time-varying exposure is affected by, and itself affects, time-varying confounders (e.g., disease severity drives\nboth treatment switching and the outcome), standard time-dependent Cox is biased *no matter how you condition*, because\nconditioning on a post-baseline confounder that is also a mediator opens a collider path. The fix for that is a\nmarginal structural model (IPTW-weighted pooled logistic / Cox) or g-estimation — not a richer time-dependent Cox.\n\n**Pros, cons, and trade-offs**\n- **vs naive baseline-only (\"ever exposed\") Cox:** time-dependent coding removes immortal-time bias and correctly aligns\n  exposed person-time with exposed status. Cost: more programming, a counting-process dataset, and a model whose HR is\n  harder to communicate (it is per-instant, not \"over the study\"). **Prefer time-dependent** whenever exposure status,\n  dose, or a covariate genuinely changes after time zero and you are not estimating a strategy contrast.\n- **vs landmark analysis:** landmarking sidesteps immortal time by fixing exposure status at a pre-specified landmark time\n  and analyzing only survivors to that point — simpler, robust, and avoids the post-baseline-confounding trap, but discards\n  early events and answers a conditional question (survivors to the landmark). **Prefer landmarking** for a single\n  one-way status change and a clean clinical message; **prefer time-dependent Cox** for repeatedly changing exposures,\n  cumulative dose, or when you cannot afford to drop pre-landmark events.\n- **vs marginal structural models / g-methods:** when the time-varying exposure has time-varying confounders that are also\n  on the causal pathway (confounder affected by prior treatment), time-dependent Cox is structurally biased and an MSM with\n  inverse-probability-of-treatment weights (or g-estimation) is required. **Prefer plain time-dependent Cox** only when\n  later covariates are not affected by earlier exposure; otherwise it is actively misleading. See marginal-structural-models-g-methods.\n- **vs Fine–Gray / cause-specific competing-risks models:** orthogonal concern — if a competing event (e.g., non-outcome\n  death) precludes the event of interest, the time-dependent Cox should be specified as cause-specific, and absolute-risk\n  statements need a cumulative-incidence (Fine–Gray) companion. See competing-risks-cause-specific-fine-gray-rwe.\n\n**When to use**. A covariate or exposure that switches value during follow-up (new medication start, hospitalization\nepisode, time-updated lab/biomarker, cumulative dose, on/off treatment windows); recurrent-event or multi-state follow-up\nmodeled in the Andersen–Gill counting-process framework; any analysis where a naive \"ever-exposed\" classification would\nmanufacture immortal time. It is the default engine for time-updated exposures when there is no feedback between exposure\nand later confounders.\n\n**When NOT to use — and when it is actively misleading or dangerous**.\n- **Time-varying confounding affected by prior treatment.** This is the headline failure. If a confounder (CD4, eGFR,\n  disease activity) responds to earlier exposure and also predicts later exposure and the outcome, conditioning on it in a\n  time-dependent Cox biases the effect (collider/over-adjustment) while omitting it leaves residual confounding — there is\n  no safe option. Use an MSM/g-methods. Mistaking this for \"I just need more covariates\" is the dangerous error.\n- **Using a covariate value not yet observed at time t.** Carrying a *final* or *peak* value backward, or coding exposure\n  as 1 for person-time *before* the exposure date, silently re-introduces immortal-time/guarantee-time bias — the very bias\n  the method exists to prevent. Every interval's covariate must be knowable from data available strictly at `start`.\n- **A defined treatment *strategy* is the question** (e.g., \"initiate and stay on drug A vs B\"). That is a sustained-strategy\n  estimand for clone-censor-weight or an MSM, not an instantaneous time-dependent HR.\n- **Reverse causation / the covariate is part of the outcome process.** A lab that moves *because* the event is imminent\n  (e.g., falling albumin in the days before death) will absorb the effect; the HR is then descriptive, not causal.\n- **Proportional hazards is violated** for the time-dependent term (its effect changes over time); then add an explicit\n  covariate-by-time interaction or move to RMST/flexible models rather than reporting a single misleading HR.\n\n**Data-source operational depth**.\n- **Claims (FFS vs MA):** the time-varying exposure is built from pharmacy `fill_date` + `days_supply` (on-treatment\n  episodes with stockpiling and grace rules) or from inpatient/procedure dates. The interval split must use the *claim/service\n  date*, never the *paid/adjudication date*, or exposure onset is shifted by months of lag. Failure mode: Medicare Advantage\n  enrollees lack fee-for-service claims, so on/off-drug and hospitalization intervals are unobservable — MA-only person-time\n  masquerades as \"unexposed,\" biasing the time-dependent indicator; restrict to A/B/D FFS (or commercial with pharmacy\n  benefit) and exclude MA-only spans rather than treating them as exposure-free. Same-day duplicate/reversed claims and\n  bundled services create spurious zero-length or overlapping intervals — collapse them before splitting. Plan switching\n  fragments enrollment; left-truncated person-time (enrollment starts mid-history) must enter via `start` > 0, not at 0.\n- **EHR:** time-updated labs/vitals arrive at irregular, encounter-driven times; between measurements you must choose a\n  carry-forward (last-observation-carried-forward) window and cap it, because a value carried for 18 months is fiction.\n  Visit-driven capture means sicker patients are measured more often — informative observation that can bias a\n  measurement-triggered exposure. External care leakage (treatment received outside the system) makes the on/off-exposure\n  indicator wrong; prefer linkage to dispensing.\n- **Registry:** strong for adjudicated time-stamped clinical events (stage change, transplant date) but typically sparse\n  between scheduled visits; reconcile registry event dates with claims/EHR before splitting, and treat between-visit gaps as\n  genuinely unobserved.\n- **Linked claims–EHR–vital records:** the ideal substrate (EHR labs for the time-varying covariate, claims for complete\n  drug/hospitalization episodes, death index for the outcome), but order/fill/service-date discrepancies across sources must\n  be reconciled so each interval's covariate is anchored to a single, defensible clock — otherwise overlapping intervals\n  from different sources double-count person-time.\n\n**Worked claims example.** Question: does *current* statin exposure (time-varying on/off) change the hazard of incident\nmyocardial infarction among adults with type 2 diabetes in a commercial + Medicare FFS database? (1) Cohort & time zero:\nage ≥18, ≥2 diabetes diagnoses, ≥365 days continuous A/B/D (or commercial medical+pharmacy) enrollment before time zero;\ntime zero = first diabetes-qualifying date after enrollment; exclude MA-only person-time. (2) Outcome: first validated MI\n(e.g., 1 inpatient primary-position MI claim), date = service date. (3) Build statin episodes from `fill_date` +\n`days_supply`, stitching overlapping fills (stockpiling, cap at, say, 90 days) and bridging gaps ≤30 days (grace period);\nan MI's risk at a given day is classified by whether that day falls inside an on-statin episode. (4) Split each person's\nfollow-up at every episode start/stop into `(start, stop]` intervals, set `statin_on` = 0/1 per interval, and set the event\nflag = 1 only on the interval whose `stop` is the MI date. (5) Right-censor at disenrollment, death, end of data; left-truncate\nany person-time before continuous enrollment via positive `start`. (6) Fit a Cox model on the counting-process data with\n`statin_on` as a time-dependent covariate plus baseline confounders (and time-updated confounders only if they are NOT\naffected by prior statin use — otherwise switch to an MSM). (7) Check PH for `statin_on` via scaled Schoenfeld residuals;\nif violated, add a `statin_on * log(t)` interaction. The contrast is the hazard of MI while currently on statin vs\ncurrently off, with each day of person-time credited to the status truly held that day — no immortal time.\n\n**Interpreting the output**\n\nA time-dependent Cox model of statin use and myocardial infarction returns: HR = 0.82 (95% CI 0.68–0.98) for current statin use vs current non-use, counting-process layout.\n\n*Formal interpretation.* The estimated HR of 0.82 means that, at each instant of follow-up, a patient currently on statin has, on average, 82% of the instantaneous MI rate of an otherwise similar patient currently not on statin, among those still event-free. Because person-time is classified by actual statin status on each day — using a (start, stop, event) counting-process layout — days before the first fill contribute to unexposed person-time even for patients who later initiate, eliminating the immortal-time that inflates apparent benefit in naive time-fixed analyses. The PH assumption implies this ratio does not change with time; if the effect is plausibly early-versus-late, a time-split or log(t) interaction should be checked.\n\n*Practical interpretation.* The HR captures \"currently on\" vs \"currently off\" treatment, not \"ever-treated\" vs \"never-treated.\" This distinction matters when adherence is variable: partial users contribute both on- and off-treatment person-time, diluting the contrast toward the null. Report both the HR and a cumulative incidence difference at a fixed horizon to convey absolute benefit alongside the relative rate estimate.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "cox-proportional-hazards",
      "time-dependent-covariate",
      "counting-process",
      "andersen-gill",
      "immortal-time-bias",
      "survival-analysis",
      "time-varying-exposure"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1146/annurev.publhealth.20.1.145",
        "url": "https://doi.org/10.1146/annurev.publhealth.20.1.145",
        "citation_text": "Fisher LD, Lin DY. Time-dependent covariates in the Cox proportional-hazards regression model. Annual Review of Public Health. 1999;20:145-157.",
        "year": 1999,
        "authors_short": "Fisher & Lin",
        "notes": "Canonical methods statement of time-dependent covariates in Cox regression, distinguishing internal vs external covariates and the inferential pitfalls of using post-baseline values."
      },
      {
        "role": "explain",
        "doi": "10.1214/aos/1176345976",
        "url": "https://doi.org/10.1214/aos/1176345976",
        "citation_text": "Andersen PK, Gill RD. Cox's regression model for counting processes: a large sample study. The Annals of Statistics. 1982;10(4):1100-1120.",
        "year": 1982,
        "authors_short": "Andersen & Gill",
        "notes": "Counting-process formulation that underpins the (start, stop] interval data layout for time-dependent covariates and recurrent-event models."
      },
      {
        "role": "explain",
        "doi": "10.1111/j.2517-6161.1972.tb00899.x",
        "url": "https://doi.org/10.1111/j.2517-6161.1972.tb00899.x",
        "citation_text": "Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society Series B. 1972;34(2):187-220.",
        "year": 1972,
        "authors_short": "Cox",
        "notes": "Original proportional-hazards model and partial likelihood that the time-dependent extension generalizes."
      },
      {
        "role": "demonstrate",
        "doi": "10.1200/JCO.2013.49.5283",
        "url": "https://doi.org/10.1200/JCO.2013.49.5283",
        "citation_text": "Giobbie-Hurder A, Gelber RD, Regan MM. Challenges of guarantee-time bias. Journal of Clinical Oncology. 2013;31(23):2963-2969.",
        "year": 2013,
        "authors_short": "Giobbie-Hurder et al.",
        "notes": "Applied demonstration of how time-fixed coding of a status that is only achievable after surviving creates guarantee-time (immortal-time) bias, and how time-dependent covariates or landmarking fix it."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Definitive pharmacoepidemiology treatment of immortal-time bias and the time-dependent exposure coding that prevents it in claims-based drug-effect studies."
      }
    ],
    "plain_language_summary": "A time-dependent Cox model is a survival analysis tool that tracks whether a patient is currently on a treatment at each moment of follow-up, rather than labeling them as a treated person for the entire study period. It splits each patient's follow-up into timed segments and asks, at any given day when an outcome event happens, who else was still being watched and what was their treatment status right then. This approach correctly handles the fact that a patient who starts a drug on day 90 of a study should not have their first 89 days counted as drug-exposed time. Without this correction, the pre-drug days get credited to the treatment group, creating a form of bias called immortal time bias that makes the drug look better than it is.",
    "key_terms": [
      {
        "term": "time-dependent covariate",
        "definition": "A variable in the model whose value is allowed to change during a patient's follow-up, such as whether they are currently on a drug, rather than being fixed at a single baseline value."
      },
      {
        "term": "immortal time bias",
        "definition": "An error that occurs when a patient's event-free time before they could even become exposed gets incorrectly counted as exposed time, making the exposure appear falsely protective."
      },
      {
        "term": "counting-process interval",
        "definition": "One row in the data table representing a stretch of follow-up time during which a patient's exposure status did not change, written as a start day and a stop day."
      },
      {
        "term": "censoring",
        "definition": "The point at which a patient's follow-up ends without the outcome occurring, such as when they leave the insurance plan or the study ends, so their exact outcome date is unknown."
      },
      {
        "term": "hazard ratio",
        "definition": "A number that compares the rate of an event in one group versus another at any given instant; a hazard ratio below 1.0 means the event happens less often in the first group."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether being currently on a statin reduces the rate of heart attack among adults with type 2 diabetes. Patient 1001 enters the study on 2023-01-01 without a statin prescription. She fills her first statin on day 60 (2023-03-01). She has a heart attack on day 180 (2023-06-29). The key question for the data analyst is: how should her 180 days of follow-up be divided between unexposed time (no statin) and exposed time (on statin)?",
      "dataset": {
        "caption": "Counting-process table for patient 1001 after splitting her follow-up at her statin start date. Each row is one interval; the exposure indicator reflects her actual status during that stretch of time.",
        "columns": [
          "person_id",
          "t_start",
          "t_stop",
          "statin_on",
          "event"
        ],
        "rows": [
          [
            1001,
            0,
            59,
            0,
            0
          ],
          [
            1001,
            59,
            180,
            1,
            1
          ]
        ]
      },
      "steps": [
        "Patient 1001 enters the study at day 0 with no statin; statin_on is 0 for this first interval.",
        "On day 59, she fills her first statin prescription, so her follow-up is split into a new interval starting at day 59 where statin_on flips to 1.",
        "Her heart attack occurs on day 180, so the second interval ends at day 180 and the event flag is set to 1 on that final row.",
        "In the model, days 0-59 (59 days) contribute to the unexposed risk set and days 59-180 (121 days) contribute to the exposed risk set.",
        "If instead we had coded statin_on = 1 for all 180 days (naive ever-exposed approach), those first 59 event-free days would have been falsely credited to the statin group, inflating the apparent benefit of the drug."
      ],
      "result": "Unexposed person-time = 59 days (statin_on = 0); exposed person-time = 121 days (statin_on = 1); event attributed to the exposed interval. The hazard ratio for statin_on produced by the Cox model reflects the instantaneous rate of heart attack while currently on statin versus currently off statin, with no immortal time in the calculation.",
      "timeline_spec": {
        "title": "Follow-up for patient 1001: unexposed then exposed, heart attack on day 180",
        "window": {
          "start": "2023-01-01",
          "end": "2023-06-29",
          "label": "180-day follow-up window (day 0 to day 180)"
        },
        "events": [
          {
            "label": "Statin start (day 59)",
            "start": "2023-03-01",
            "length_days": 1,
            "quantity": "drug initiation"
          },
          {
            "label": "Heart attack (day 180)",
            "start": "2023-06-29",
            "length_days": 1,
            "quantity": "outcome event"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2023-01-01",
            "end": "2023-02-28",
            "label": "59 days unexposed (statin_on = 0)"
          },
          {
            "kind": "exposed",
            "start": "2023-03-01",
            "end": "2023-06-29",
            "label": "121 days exposed (statin_on = 1)"
          }
        ],
        "result": {
          "label": "Event on day 180 attributed to exposed interval; 59 unexposed days correctly removed from exposed group",
          "value": null
        },
        "caption": "Patient 1001 is unexposed for her first 59 days (no statin), then exposed from day 59 onward when she starts the drug. The time-dependent Cox model splits her follow-up at the drug start date so that each day is credited to the status she actually held. A naive ever-exposed analysis would have counted all 180 days as exposed, giving the statin credit for 59 event-free days it had nothing to do with.",
        "alt_text": "Horizontal timeline from January 1 to June 29, 2023. A gray bar labeled 59 days unexposed spans January 1 through February 28. A green bar labeled 121 days exposed spans March 1 through June 29. A vertical marker on March 1 is labeled statin start day 59. A vertical marker on June 29 is labeled heart attack day 180."
      }
    },
    "prerequisites": [
      "cox-ph-regression",
      "immortal-time-bias-handling",
      "exposure-episode-construction-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Binary on/off time-varying exposure (counting-process intervals)",
        "description": "Exposure indicator switches 0->1 (and possibly back) at episode boundaries derived from fills, procedures, or hospitalizations; follow-up is split into (start, stop] rows with one exposure value per row.",
        "edge_cases": [
          "Exposure date must be the service/claim date, not paid/adjudication date, or onset is lagged.",
          "Same-day or overlapping fills create zero-length or overlapping intervals; collapse before splitting.",
          "Left-truncated person-time (enrollment begins mid-history) must enter with start > 0."
        ],
        "data_source_notes": "claims: build episodes from fill_date + days_supply with stockpiling/grace rules; exclude MA-only person-time where on/off status is unobservable."
      },
      {
        "name": "Continuous time-updated covariate (lab/biomarker or cumulative dose)",
        "description": "A continuously valued covariate (eGFR, HbA1c, cumulative dose) is updated at each measurement; between measures a capped carry-forward defines the interval value.",
        "edge_cases": [
          "Last-observation-carried-forward over long gaps fabricates values; cap the carry-forward window.",
          "Visit-driven measurement means sicker patients are sampled more often (informative observation).",
          "A covariate that moves because the event is imminent induces reverse causation; the HR becomes descriptive."
        ],
        "data_source_notes": "ehr: irregular measurement times require an explicit between-visit value rule and a documented LOCF cap."
      },
      {
        "name": "Time-dependent term with non-proportional effect",
        "description": "The effect of the time-varying covariate itself changes over time, handled by an explicit covariate-by-time (e.g., x*log(t)) interaction rather than a single constant hazard ratio.",
        "edge_cases": [
          "Reporting one HR when PH is violated for the time-dependent term is misleading; test scaled Schoenfeld residuals.",
          "Heaviside (step-function) interactions partition follow-up into epochs and report epoch-specific HRs."
        ],
        "data_source_notes": "Applies across data sources; diagnose with PH tests on the counting-process model."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive baseline-only (\"ever exposed\") Cox",
        "pros_of_this": "Eliminates immortal-time/guarantee-time bias by crediting exposed person-time only after exposure occurs.",
        "cons_of_this": "Requires a counting-process dataset and yields a per-instant hazard ratio that is harder to communicate.",
        "when_to_prefer": "Whenever exposure, dose, or a covariate genuinely changes after time zero and no exposure-confounder feedback exists."
      },
      {
        "compared_to": "Landmark analysis",
        "pros_of_this": "Handles repeatedly changing exposures and cumulative dose; does not discard pre-landmark events.",
        "cons_of_this": "More complex; can mishandle post-baseline confounding feedback that landmarking avoids by design.",
        "when_to_prefer": "Repeated or continuous status changes; landmark is preferable for a single one-way status change with a clean clinical message."
      },
      {
        "compared_to": "Marginal structural models / g-methods",
        "pros_of_this": "Far simpler to specify and defend when later covariates are not affected by earlier exposure.",
        "cons_of_this": "Structurally biased (collider/over-adjustment) when time-varying confounders are affected by prior treatment.",
        "when_to_prefer": "Only when there is no exposure->confounder feedback; otherwise an IPTW-weighted MSM is required."
      },
      {
        "compared_to": "Cause-specific / Fine-Gray competing-risks models",
        "pros_of_this": "Directly models the hazard with a time-updated covariate.",
        "cons_of_this": "A single Cox does not give absolute risk under competing events; needs a cumulative-incidence companion.",
        "when_to_prefer": "When the time-varying covariate is the focus; pair with Fine-Gray for absolute-risk statements."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build exposure/covariate episodes from service-date claims (fill_date + days_supply, procedure, admission/discharge), never paid dates. Split follow-up at every boundary into (start, stop] intervals; exclude Medicare Advantage-only person-time where on/off status is unobservable; left-truncate pre-enrollment time via positive start; collapse same-day duplicate/reversed claims before splitting.",
      "ehr": "Update continuous covariates at measurement times with a capped carry-forward; treat irregular, visit-driven measurement as informative observation; prefer linked dispensing over EHR orders to confirm actual exposure given external care leakage.",
      "registry": "Use adjudicated, time-stamped clinical events for the time-varying covariate; reconcile registry dates with claims/EHR before splitting; treat between-visit gaps as unobserved rather than carrying values indefinitely.",
      "linked": "Anchor every interval's covariate to a single reconciled clock across order/fill/service dates to avoid double-counting overlapping person-time; combine EHR labs, claims episodes, and a death index for the outcome."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom lifelines import CoxTimeVaryingFitter\nfrom lifelines.statistics import proportional_hazard_test\n\n# df is the counting-process table described in the header (one row per person-interval).\n# Sanity check: intervals are well-formed and non-overlapping within person.\nassert (df[\"t_stop\"] > df[\"t_start\"]).all(), \"each interval needs t_stop > t_start\"\n\nctv = CoxTimeVaryingFitter(penalizer=0.0)\nctv.fit(\n    df,\n    id_col=\"person_id\",\n    start_col=\"t_start\",\n    stop_col=\"t_stop\",\n    event_col=\"event\",\n    # statin_on enters at the value it holds in each interval -> time-dependent HR\n    formula=\"statin_on + age + female + prior_cvd\",\n)\nctv.print_summary()  # exp(coef) for statin_on = HR of MI while currently on vs off statin\n\n# Proportional-hazards check for the time-dependent term via scaled Schoenfeld residuals.\n# If statin_on violates PH, refit with a step (Heaviside) or statin_on:log(t) interaction.\nph = proportional_hazard_test(ctv, df, time_transform=\"log\")\nprint(ph.summary)",
        "description": "Fit a time-dependent Cox model on a counting-process (start, stop] dataset using lifelines. Required input table\n(one row per person-interval, already built upstream from claims/EHR):\n  df : person_id, t_start, t_stop (float, days from time zero; t_start < t_stop),\n       event (1 only on the interval whose t_stop is the outcome date, else 0),\n       statin_on (0/1 time-varying exposure for that interval),\n       age, female, prior_cvd, ... (baseline confounders, constant across a person's rows)\nLeft truncation is encoded by the first interval's t_start > 0. Verify no overlapping intervals per person before fitting.",
        "dependencies": [
          "lifelines",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\n\n# Follow-up time in days from time zero; left truncation handled by tstart of the first interval.\nbase$fu_days   <- as.numeric(base$exit_date - base$time_zero)\nstatin$start_d <- as.numeric(statin$episode_start - base$time_zero[match(statin$person_id, base$person_id)])\nstatin$stop_d  <- as.numeric(statin$episode_end   - base$time_zero[match(statin$person_id, base$person_id)])\n\n# Build the counting-process dataset: split at episode boundaries, carry the event to the final interval.\ncp <- tmerge(base, base, id = person_id,\n             mi = event(fu_days, event))\ncp <- tmerge(cp, statin, id = person_id,\n             statin_on = tdc(start_d))        # 1 from each episode start onward within the episode span\n\nfit <- coxph(Surv(tstart, tstop, mi) ~ statin_on + age + female + prior_cvd,\n             data = cp, id = person_id, ties = \"efron\")\nsummary(fit)                                  # exp(coef) for statin_on = current-exposure HR\n\n# Proportional-hazards diagnostic on the time-dependent term (scaled Schoenfeld residuals).\ncox.zph(fit)",
        "description": "Time-dependent Cox via survival::coxph on a counting-process (start, stop] dataset. The canonical, defensible way to\nbuild the intervals is survival::tmerge, which splits each person's follow-up at every exposure/covariate change without\nhand-coded loops. Required inputs:\n  base   : person_id, time_zero (Date), exit_date (Date), event (0/1), age, female, prior_cvd  (one row per person)\n  statin : person_id, episode_start (Date), episode_end (Date)  (on-treatment episodes, already stitched/grace-applied)\ntmerge produces (tstart, tstop] intervals with a time-varying statin_on indicator and the event placed on the exit row.",
        "dependencies": [
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (A) Programming-statement time-dependent covariate: statin_on is re-evaluated at each risk set. */\n/*     stat_start = days from time zero to statin initiation (missing if never treated).            */\nproc phreg data=work.cohort;\n  class female prior_cvd / param=ref;\n  model fu_days*event(0) = statin_on age female prior_cvd / rl ties=efron;\n  /* At each event time, a subject is \"on statin\" iff treatment started at or before that time. */\n  if stat_start ne . and stat_start <= fu_days then statin_on = 1; else statin_on = 0;\n  assess var=(age) ph / resample;   /* proportional-hazards check */\nrun;\n\n/* (B) Counting-process (start, stop] form with a pre-split dataset and left truncation via tstart. */\nproc phreg data=work.cp;\n  class female prior_cvd / param=ref;\n  model (tstart, tstop)*event(0) = statin_on age female prior_cvd / rl ties=efron;\n  /* HR for statin_on = hazard while currently exposed vs currently unexposed. */\nrun;",
        "description": "Time-dependent Cox in SAS/STAT via PROC PHREG. Two production-grade options are shown. (A) Programming-statement form:\nthe time-varying exposure is computed INSIDE PHREG at each event time from stored episode dates, so no pre-split dataset\nis needed. (B) Counting-process form: feed a pre-split (start, stop] dataset (e.g., from a DATA step or PROC SQL) and use\nENTRY=/the (start,stop) Surv syntax. Required inputs:\n  work.cohort  : person_id, fu_days (time zero to exit), event (0/1), stat_start (days to statin start, .=never),\n                 age, female, prior_cvd\n  work.cp      : person_id, tstart, tstop, event, statin_on (for option B; one row per interval)",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "standard-cox-time-dependent-timeline.svg",
        "mermaid": null,
        "caption": "Patient 1001 is unexposed for her first 59 days (no statin), then exposed from day 59 onward when she starts the drug. The time-dependent Cox model splits her follow-up at the drug start date so that each day is credited to the status she actually held. A naive ever-exposed analysis would have counted all 180 days as exposed, giving the statin credit for 59 event-free days it had nothing to do with.",
        "alt_text": "Horizontal timeline from January 1 to June 29, 2023. A gray bar labeled 59 days unexposed spans January 1 through February 28. A green bar labeled 121 days exposed spans March 1 through June 29. A vertical marker on March 1 is labeled statin start day 59. A vertical marker on June 29 is labeled heart attack day 180.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  T0[Time zero: continuous enrollment + cohort entry] --> Epi[Derive exposure/covariate episodes<br/>service dates, days_supply + grace]\n  Epi --> Split[Split follow-up at every change into start,stop intervals]\n  Split --> Code[Per interval: covariate = value known at start; event = 1 only on outcome interval]\n  Code --> Trunc[Left-truncate pre-enrollment time via start &gt; 0; censor at disenroll/death/data end]\n  Trunc --> Fit[Cox partial likelihood on counting-process data<br/>HR = effect of CURRENT covariate value]\n  Fit --> PH[Check PH on the time-dependent term<br/>Schoenfeld residuals]\n  PH -->|PH holds| HR[Report current-exposure hazard ratio]\n  PH -->|PH violated| Inter[Add covariate*log t or step interaction / epoch-specific HRs]",
        "caption": "Building and fitting a time-dependent Cox model. Splitting follow-up at every covariate change and coding each interval with the value known at its start is what prevents immortal-time bias.",
        "alt_text": "Flowchart from time zero through episode derivation, interval splitting, per-interval covariate coding, left truncation and censoring, Cox fit on counting-process data, and a proportional-hazards check branch.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Subject exposed at some point] --> B{How is exposed person-time coded?}\n  B -->|Naive: exposed for ALL follow-up| C[Pre-exposure event-free time credited to exposed group]\n  C --> D[Immortal / guarantee-time bias: exposure looks protective]\n  B -->|Time-dependent: 0 before exposure, 1 after| E[Person-time credited to the status truly held]\n  E --> F[Unbiased current-exposure hazard ratio<br/>if no exposure-confounder feedback]\n  F --> G{Time-varying confounder affected by prior exposure?}\n  G -->|Yes| H[Time-dependent Cox is biased -&gt; use MSM / g-methods]\n  G -->|No| I[Time-dependent Cox is appropriate]",
        "caption": "Why time-dependent coding is needed and where it stops being enough. Naive ever-exposed coding manufactures immortal time; time-dependent coding fixes it unless exposure-confounder feedback demands an MSM.",
        "alt_text": "Decision diagram contrasting naive ever-exposed coding (immortal-time bias) with time-dependent coding, then branching on whether time-varying confounding is affected by prior exposure, routing to MSM/g-methods if so.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "cox-ph-regression",
        "notes": "Generalizes the standard proportional-hazards model by allowing covariates to change value during follow-up via a counting-process data layout."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "The counting-process layout is the standard way to enter time-updated exposures and cumulative dose into a Cox model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Time-dependent coding of exposure is the canonical remedy for immortal-time/guarantee-time bias in cohort follow-up."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "landmark-analysis",
        "notes": "Landmarking fixes exposure status at a pre-specified time and analyzes survivors; preferable for a single one-way status change, whereas time-dependent Cox handles repeated/continuous changes."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "When time-varying confounders are affected by prior treatment, time-dependent Cox is biased and an IPTW-weighted MSM or g-estimation is required instead."
      },
      {
        "relation_type": "used_with",
        "target_slug": "recurrent-events-analysis-rwe",
        "notes": "The Andersen-Gill counting-process formulation underlying time-dependent covariates is also the basis for recurrent-event modeling."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "With competing events, specify the time-dependent Cox as cause-specific and pair with a cumulative-incidence (Fine-Gray) model for absolute risk."
      }
    ],
    "aliases": [
      "time-dependent Cox model",
      "time-varying covariate Cox regression",
      "counting-process Cox model",
      "Andersen-Gill Cox model",
      "extended Cox model"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "stockpiling-carryover-rules-rwe",
    "name": "Stockpiling and Carryover Rules",
    "short_definition": "The episode-construction rule that decides whether days_supply from an early (overlapping) refill is carried forward to extend later exposure coverage, and how much accumulated oversupply is allowed to accrue before it is capped or reset.",
    "long_description": "**Stockpiling and carryover rules** govern what happens when a patient refills a chronic medication *before* the prior\nfill's supply has run out. Each pharmacy claim carries a `fill_date` and a `days_supply`; naively, exposure runs from\n`fill_date` to `fill_date + days_supply`. But real refill behavior is bursty — patients refill early (90-day mail order,\nvacation overrides, copay timing), so consecutive supply windows overlap. The carryover rule decides whether that overlap\nis (a) discarded (the clock restarts at each new fill), (b) preserved by *shifting* the new fill's coverage to begin only\nafter the prior supply is exhausted (so excess days \"bank\" and push the run-out date later), or (c) preserved but **capped**\nso accumulated oversupply cannot grow without bound. It is the single most consequential — and most often unspecified —\ndecision in turning dispensing records into a person-time exposure timeline, and it directly changes both adherence metrics\n(PDC/MPR numerators) and the exposed/unexposed person-time that feeds incidence rates and hazard ratios.\n\n**Core conceptual distinction** — three rules sit on a spectrum and produce materially different timelines from identical\nclaims. (1) *No carryover (truncate-at-next-fill / \"as dispensed\")*: a new fill resets the window; any unused days from the\nprior fill are thrown away. This is the most conservative for *current* exposure but understates total drug acquired and can\nmanufacture artificial \"gaps\" the moment a patient refills early. (2) *Full carryover (shift-forward stitching)*: the new\nfill's coverage is shifted to start at `max(fill_date, prior_run_out)`, so banked days accumulate and the run-out date drifts\nforward — the basis of OMOP drug-era construction with a `gap` parameter and of carryover-allowed PDC. (3) *Capped carryover*:\nfull carryover, but cumulative excess is held to a ceiling (commonly 30–90 days, or \"no more than one extra dispensing\"),\npreventing implausible months of banked supply from a string of early refills. Orthogonal to all three is the **gap-reset\nrule**: when an *observed* gap between run-out and the next fill exceeds a permissible threshold, the episode closes and a new\none begins (no carryover across the gap). The estimand distinction is sharp: a carryover-allowed timeline measures\n*theoretical inventory / cumulative acquisition* (good for chronic-effectiveness \"ever-treated\" or PDC-style adherence),\nwhereas a no-carryover or short-cap timeline approximates *current pharmacologic exposure* at a point in time (what a safety\nrate of an acute event actually requires). Choosing the wrong one is not a tuning detail — it changes the quantity estimated.\n\n**Pros, cons, and trade-offs** (each compared to the named alternative).\n- **Full carryover vs no carryover.** Full carryover correctly credits early refillers as continuously covered, avoids\n  spurious gaps, and matches how chronic adherence (PDC/MPR with stockpiling allowed) is meant to be measured; it is the\n  realistic default for steady-state chronic therapy. Cost: it pushes the modeled run-out date later than the patient's\n  *actual* drug-taking, so in a **safety** analysis an adverse event occurring after the patient truly stopped can be\n  misclassified as \"on drug,\" inflating exposed person-time and biasing the incidence rate toward the null. **Prefer no\n  carryover** when the estimand is current exposure to an acute hazard; **prefer full carryover** for cumulative adherence\n  and chronic-effectiveness contrasts.\n- **Capped vs uncapped full carryover.** Capping (e.g., excess ≤ 90 days, or drop pre-run-out duplicate fills beyond one)\n  prevents the pathological case where serial early refills bank a year of phantom supply that keeps a long-discontinued\n  patient \"exposed.\" Cost: the cap is an arbitrary tuning knob that must be pre-specified and sensitivity-tested; too tight\n  a cap re-creates artificial gaps. **Prefer capping** in any safety or per-protocol analysis; report the cap and vary it.\n- **vs simply using OMOP drug-era / `drug_era_gap`.** The OMOP drug-era is a packaged carryover-with-gap implementation\n  (default 30-day persistence window, configurable gap). It is reproducible and standard across the network, but its single\n  global gap parameter hides the stockpiling decision and rarely caps oversupply — auditing the resulting eras against raw\n  fills is still required. **Prefer the OMOP era** for federated/standardized work, but document and, if needed, override\n  its gap and add a cap.\n- **vs grace-period gap rules.** A grace period is about *closing* an episode after a permissible gap; the carryover rule\n  is about *banking surplus* before a gap occurs. They are complementary and must be specified together — a generous grace\n  period plus uncapped carryover is the combination most likely to fabricate immortal-style exposed time.\n\n**When to use** — any time pharmacy dispensing is converted to a longitudinal exposure or adherence variable: PDC/MPR\ncomputation, drug-era / treatment-episode construction, time-updated exposure for Cox or pooled-logistic models, and\npersistence (time-to-discontinuation) analyses. The rule must be written into the protocol/SAP *before* programming and\nreported explicitly, because it is invisible in the final estimate yet drives it.\n\n**When NOT to use — and when it is actively misleading or dangerous** (decision rules below).\n- **Acute-event safety analyses with full/uncapped carryover.** If the outcome is an acute event (e.g., GI bleed, syncope,\n  rhabdomyolysis) and you allow uncapped carryover, you extend \"exposed\" person-time past the patient's true last dose. An\n  event after real cessation is then counted as exposed, the exposed incidence rate is diluted, and a true harm is biased\n  toward the null. Use no-carryover or a short cap and align exposure to current pharmacology.\n- **Discontinuation / deprescribing / persistence studies.** Carryover *masks the very gaps you are trying to detect*;\n  banked supply makes a patient who stopped look persistent. Carryover should be off (or minimal) and the gap-reset rule\n  drives the discontinuation date.\n- **PRN / as-needed and titrated drugs.** `days_supply` for as-needed inhalers, nitroglycerin, opioids, or insulin is\n  fiction (the field reflects the dispensed quantity, not consumption), so any carryover arithmetic compounds a fabricated\n  denominator. Do not build inventory timelines for these from `days_supply` alone.\n- **Implausible accumulation left uncapped.** A patient with twelve 30-day fills in six months under full carryover banks\n  ~180 phantom days; uncapped, they remain \"covered\" long after they could plausibly still hold supply. Always cap or\n  flag-and-review impossible inventories.\n\n**Data-source operational depth** (each with real failure modes and workarounds).\n- **Claims (FFS).** `fill_date` + `days_supply` are the substrate. Real failure modes: (i) *90-day mail order overlapping a\n  30-day retail fill* — a same-week refill of a smaller script looks like massive stockpiling that is really a channel\n  switch; de-duplicate by NDC/strength and prefer the larger supply. (ii) *Same-day duplicate paid claims* (reversal +\n  re-bill, split fills) double-count days_supply unless collapsed. (iii) *Free samples, 340B, and discount-card fills* never\n  appear in claims, so true acquisition is undercounted and apparent gaps are artifactual. Workaround: pre-specify\n  de-duplication, cap carryover, and run the timeline both with and without the cap.\n- **Medicare FFS vs Medicare Advantage (MA).** MA-only person-time lacks fee-for-service Part D claims in many datasets, so\n  a \"gap\" during MA enrollment is missingness, not discontinuation, and any carryover/gap logic applied across it is\n  invalid. Restrict to enrollees with observable Part D (or commercial pharmacy benefit) and censor or exclude MA-only\n  person-time before stitching episodes.\n- **Long-term-care (LTC) and inpatient stays.** Bundled/per-diem LTC pharmacy and inpatient medications usually do not\n  generate individual retail claims; an institutional stay produces an apparent gap mid-episode. Carryover from the\n  pre-admission fill can either correctly bridge it or wrongly bank supply — decide explicitly and see inpatient-bridging\n  rules. Differential institutionalization by arm (more frequent in the sicker/older arm) then induces differential\n  carryover error.\n- **EHR.** Order/medication-list \"active\" dates are not dispensings; carryover arithmetic on EHR e-prescribing data without\n  linked fill confirmation overstates real acquisition (the prescription may never have been filled). Prefer linked pharmacy\n  claims; if only EHR, treat \"active medication\" windows as a separate, weaker exposure definition.\n- **Registry / linked.** Registries rarely capture complete dispensing; link to claims for fills and reconcile\n  order/fill/service-date discrepancies before assigning run-out dates. Linkage selection (only the linkable subset) can\n  correlate with refill behavior.\n\n**Worked claims example.** A patient (`person_id = 7`) with continuous FFS Part D enrollment has three 30-day fills of a\nstatin: `fill_date` = 2024-01-01, 2024-01-20, 2024-02-25, each `days_supply = 30`. The second fill arrives 19 days into the\nfirst 30-day supply (11 days of overlap → banked surplus); the third arrives after a real gap.\n- *No carryover (truncate-at-next-fill):* episode coverage = [Jan 1, Jan 20) then [Jan 20, Feb 19) then [Feb 25, Mar 26).\n  Run-out before the third fill is Feb 19; the Feb 19→Feb 25 stretch is a 6-day **gap**, and the 11 banked days are\n  discarded. Cumulative excess = 0.\n- *Full carryover (shift-forward):* fill 2 is shifted to start at the prior run-out (Jan 31), so coverage runs Jan 1 →\n  Mar 2 continuously; cumulative excess after fill 2 = 11 days. Fill 3 (Feb 25) arrives 5 days *before* the carried run-out\n  (Mar 2), so excess grows to 16 days and run-out moves to Apr 1. **No gap is declared** — the early refills are credited.\n- *Capped carryover (cap = 90 days):* identical to full carryover here because 16 < 90; the cap only bites after many early\n  refills. (If the same patient kept refilling every 19–25 days, uncapped excess would balloon; the cap would hold run-out\n  to at most prior_run_out + 90.)\n- *Gap-reset (permissible gap = 15 days):* under no-carryover the Feb 19→Feb 25 gap (6 days) is < 15, so the episode does\n  *not* reset and the third fill continues the same episode; had the third fill been 2024-03-20 (29-day observed gap > 15),\n  the episode would close on Feb 19 and a new episode would start Mar 20.\nThe four rules give run-out dates of Mar 26 (no carryover), Apr 1 (full/capped carryover) and different episode counts —\nfrom one PDC/incidence-rate input. PDC over a 90-day Jan 1–Mar 31 window: no-carryover credits 84 covered days (gap of 6)\n→ PDC ≈ 0.93; full carryover credits all 90 → PDC = 1.00. The choice, not the data, moved the metric.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure-definition",
      "stockpiling",
      "carryover",
      "days-supply",
      "drug-era",
      "treatment-episode",
      "adherence-measurement",
      "person-time",
      "pharmacoepidemiology"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_safety",
      "adherence_persistence"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2009.07.001",
        "url": "https://doi.org/10.1016/j.jclinepi.2009.07.001",
        "citation_text": "Gardarsdottir H, Souverein PC, Egberts TCG, Heerdink ER. Construction of drug treatment episodes from drug-dispensing histories is influenced by the gap length. Journal of Clinical Epidemiology. 2010;63(4):422-427.",
        "year": 2010,
        "authors_short": "Gardarsdottir et al.",
        "notes": "Demonstrates empirically that the gap/carryover rule used to stitch dispensings into episodes materially changes episode count, length, and the resulting exposure classification — the foundational case for pre-specifying the rule."
      },
      {
        "role": "explain",
        "doi": "10.1016/s0895-4356(96)00268-5",
        "url": "https://doi.org/10.1016/s0895-4356(96)00268-5",
        "citation_text": "Steiner JF, Prochazka AV. The assessment of refill compliance using pharmacy records: methods, validity, and applications. Journal of Clinical Epidemiology. 1997;50(1):105-116.",
        "year": 1997,
        "authors_short": "Steiner & Prochazka",
        "notes": "Canonical treatment of refill-based exposure measurement, including how oversupply and overlapping fills (the stockpiling problem) inflate possession ratios above 1.0 if not handled."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Operational catalogue of automated-database adherence/persistence definitions, including gap, carryover, and capping choices that distinguish MPR from PDC."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.4372",
        "url": "https://doi.org/10.1002/pds.4372",
        "citation_text": "Pazzagli L, Linder M, Zhang M, Vago E, Stang P, Myer K, Andersen M, Bahmanyar S. Methods for time-varying exposure related problems in pharmacoepidemiology: an overview. Pharmacoepidemiology and Drug Safety. 2018;27(2):148-160.",
        "year": 2018,
        "authors_short": "Pazzagli et al.",
        "notes": "Situates carryover/stockpiling within the broader problem of constructing time-varying exposure from dispensings and its impact on bias in safety and effectiveness estimates."
      }
    ],
    "plain_language_summary": "When a patient refills a chronic medication before the current supply runs out, they build up a surplus of pills on hand — that is called stockpiling. A carryover rule decides what to do with that surplus: instead of throwing away the leftover days and restarting the clock at each new fill date, the carryover rule shifts each new fill's coverage forward so the extra days are banked and pushed to the end of the timeline. This matters because without carryover you can manufacture a fake gap in coverage even when the patient always had pills available, and with carryover the true coverage end date is later than the last fill date alone would suggest.",
    "key_terms": [
      {
        "term": "fill_date",
        "definition": "The calendar date when a pharmacy dispensed a prescription — the date printed on the pill bottle label."
      },
      {
        "term": "days_supply",
        "definition": "The number of days one filled prescription is supposed to last, as recorded by the pharmacy (for example, 30 means a one-month supply)."
      },
      {
        "term": "stockpiling",
        "definition": "Refilling a medication before the current supply is gone, so the patient accumulates more pills than they take each day and builds a surplus."
      },
      {
        "term": "carryover rule",
        "definition": "The analyst's choice about whether to bank leftover pill-days from an early refill and push them to the end of coverage, rather than discarding them when the next fill arrives."
      },
      {
        "term": "run-out date",
        "definition": "The first calendar day when a fill's supply is expected to be exhausted — calculated as fill date plus days supply, or shifted later if a carryover rule is applied."
      }
    ],
    "worked_example": {
      "scenario": "Patient 2201 takes a daily statin for cholesterol. We are watching a 90-day window from January 1 through March 30, 2024. She picks up three 30-day fills but refills early each time, building a surplus. We want to find the carryover-adjusted coverage end date and compare it to what the last fill date alone would suggest.",
      "dataset": {
        "caption": "Raw pharmacy claims rows for patient 2201 — one row per fill as an analyst would see them.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2201,
            "2024-01-01",
            "atorvastatin",
            30
          ],
          [
            2201,
            "2024-01-22",
            "atorvastatin",
            30
          ],
          [
            2201,
            "2024-02-19",
            "atorvastatin",
            30
          ]
        ]
      },
      "steps": [
        "Fill A starts Jan 01 and covers 30 days: Jan 01 through Jan 30. The supply would run out on Jan 31.",
        "Fill B arrives Jan 22 — nine days before Fill A runs out on Jan 31. That early refill creates 9 banked surplus days.",
        "With carryover, Fill B's coverage is shifted to start when Fill A actually runs out (Jan 31), so Fill B covers Jan 31 through Feb 29 and its run-out shifts to Mar 01.",
        "Fill C arrives Feb 19 — eleven days before Fill B's carried run-out of Mar 01. That early refill banks 11 more surplus days (total banked = 9 + 11 = 20 days).",
        "With carryover, Fill C's coverage shifts to start Mar 01, covers Mar 01 through Mar 30, and its run-out becomes Mar 31.",
        "Coverage is continuous from Jan 01 through Mar 30 with no gaps — the three shifted fills stitch together with zero uncovered days.",
        "Without carryover the last fill (Feb 19 + 30 days) would end Mar 19, leaving Mar 20 through Mar 30 uncovered — an 11-day gap that does not reflect reality, since the patient was holding surplus pills the whole time."
      ],
      "result": "Carryover-adjusted coverage end: 2024-03-30 (run-out date 2024-03-31). Over the 90-day window (Jan 01 to Mar 30), the patient had medication on hand every day: covered days = 90, PDC = 90/90 = 1.00. Without carryover only 79 of the 90 days are credited (an 11-day gap Mar 20 to Mar 30), giving PDC = 79/90 = 0.88. The 12-point difference comes entirely from how the analyst handles the banked surplus, not from any difference in the patient's actual fills.",
      "timeline_spec": {
        "title": "Stockpiling with full carryover — three early statin fills for patient 2201",
        "window": {
          "start": "2024-01-01",
          "end": "2024-03-30",
          "label": "Denominator: 90-day observation window (Jan 01 to Mar 30)"
        },
        "events": [
          {
            "label": "Fill A",
            "start": "2024-01-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B (9-day early refill)",
            "start": "2024-01-22",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill C (11-day early refill)",
            "start": "2024-02-19",
            "length_days": 30,
            "quantity": "30 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2024-01-01",
            "end": "2024-03-19",
            "label": "79 days covered under both rules"
          },
          {
            "kind": "covered",
            "start": "2024-03-20",
            "end": "2024-03-30",
            "label": "11 extra days credited only with carryover (banked surplus)"
          }
        ],
        "result": {
          "label": "Carryover-adjusted coverage end: 2024-03-30 — PDC 1.00 vs 0.88 without carryover",
          "value": 1.0
        },
        "caption": "Each fill bar starts at the actual fill date, making the early-refill overlaps visible as stockpiling. The carryover rule shifts each fill's effective coverage to start at the prior run-out, so the banked surplus days push the final coverage end from Mar 19 (no carryover) to Mar 30 (with carryover) — an 11-day extension that changes PDC from 0.88 to 1.00.",
        "alt_text": "Timeline with three 30-day statin fill bars for patient 2201 plotted against a 90-day observation window. Fill A starts Jan 01, Fill B starts Jan 22 overlapping the end of Fill A, and Fill C starts Feb 19 overlapping the carryover projection of Fill B. Two stacked coverage spans are shown below the fills: a light green bar from Jan 01 to Mar 19 labeled 79 days covered under both rules, and a darker green bar from Mar 20 to Mar 30 labeled 11 extra days credited only with carryover, illustrating how banked surplus days extend coverage beyond what the last fill date alone implies."
      }
    },
    "prerequisites": [
      "exposure-episode-construction-rwe",
      "grace-period-gap-rules-rwe",
      "pdc-proportion-of-days-covered"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "No carryover (truncate-at-next-fill / as-dispensed)",
        "description": "Each new fill resets the coverage window; unused days from the prior fill are discarded. Coverage = union of [fill_date, fill_date + days_supply) intervals with no shifting. Most conservative for current pharmacologic exposure.",
        "edge_cases": [
          "Early refills manufacture artificial \"gaps\" between the prior run-out and the next fill even when the patient holds ample supply.",
          "PDC/MPR are understated for adherent early refillers because banked days are thrown away."
        ],
        "data_source_notes": "claims: simplest to implement from fill_date + days_supply; still de-duplicate same-day/reversed claims before unioning intervals."
      },
      {
        "name": "Full carryover (shift-forward stitching)",
        "description": "A new fill that arrives before the prior run-out is shifted to begin at max(fill_date, prior_run_out), so surplus days bank and the run-out date drifts forward. Basis of OMOP drug-era construction and carryover-allowed PDC.",
        "edge_cases": [
          "Uncapped, serial early refills bank implausible months of phantom supply, keeping discontinued patients \"exposed.\"",
          "In safety analyses this extends exposed person-time past true cessation and biases incidence toward the null."
        ],
        "data_source_notes": "claims: equivalent to OMOP drug-era with a persistence/gap window; audit constructed eras against raw fills and report the realized maximum cumulative excess."
      },
      {
        "name": "Capped carryover",
        "description": "Full carryover with a ceiling on accumulated excess (e.g., excess <= 30/90 days, or \"no more than one extra dispensing carried\"). Banks realistic early-refill surplus while blocking pathological accumulation.",
        "edge_cases": [
          "The cap is an arbitrary, pre-specifiable knob that must be sensitivity-tested; too tight a cap re-creates artificial gaps.",
          "Capping interacts with the gap-reset threshold — both must be specified together."
        ],
        "data_source_notes": "claims: standard for per-protocol and safety estimands; report the cap value and re-run at alternative caps as sensitivity analyses."
      },
      {
        "name": "Gap-reset (episode closure)",
        "description": "Orthogonal rule that closes the current episode and starts a new one when the observed gap between run-out and the next fill exceeds a permissible threshold (the grace period), preventing carryover across true discontinuations.",
        "edge_cases": [
          "Choice of permissible gap (often a fraction of days_supply or a fixed 30/60/90 days) drives episode count and discontinuation dates.",
          "Gaps falling inside MA-only or LTC/inpatient person-time are missingness, not discontinuation, and must be excluded before applying the rule."
        ],
        "data_source_notes": "claims: combine with continuous-enrollment screens so a gap reflects true non-refill, not unobserved person-time; see grace-period-gap-rules-rwe."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "No carryover (truncate-at-next-fill)",
        "pros_of_this": "Full/capped carryover correctly credits early refillers, avoids spurious gaps, and matches how chronic adherence (PDC/MPR with stockpiling allowed) is intended to be measured.",
        "cons_of_this": "Extends the modeled run-out past true drug-taking; in acute-event safety analyses this inflates exposed person-time and biases incidence rates toward the null.",
        "when_to_prefer": "Carryover for cumulative adherence and chronic-effectiveness estimands; no carryover for current exposure to an acute hazard."
      },
      {
        "compared_to": "Uncapped full carryover",
        "pros_of_this": "Capping prevents serial early refills from banking implausible phantom supply that keeps discontinued patients classified as exposed.",
        "cons_of_this": "Introduces an arbitrary cap parameter that must be pre-specified and sensitivity-tested.",
        "when_to_prefer": "Any safety or per-protocol analysis where overstated exposed person-time biases the estimate."
      },
      {
        "compared_to": "OMOP drug-era with default gap (no explicit cap)",
        "pros_of_this": "An explicit carryover + cap rule makes the stockpiling decision visible and auditable rather than hidden in a single global gap parameter.",
        "cons_of_this": "Less standardized than the packaged OMOP era; requires bespoke programming and documentation.",
        "when_to_prefer": "When the analysis is safety-sensitive or oversupply is plausible; use the OMOP era for federated/network studies but document and, if needed, override its gap and add a cap."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build from fill_date + days_supply after de-duplicating same-day/reversed claims and resolving overlapping mail-order vs retail fills by NDC/strength. Pre-specify the carryover rule, the cap, and the gap-reset threshold; restrict to observable pharmacy-benefit person-time (exclude MA-only spans) before stitching, and audit constructed run-out dates against raw fills.",
      "ehr": "E-prescribing/medication-list \"active\" windows are orders, not dispensings; carryover arithmetic overstates real acquisition without linked fill confirmation. Treat EHR-only active-medication windows as a separate, weaker exposure definition and prefer linked claims.",
      "registry": "Registries seldom capture complete dispensing; link to claims for fills and reconcile order/fill/service-date discrepancies before assigning run-out dates. Watch linkage selection that correlates with refill behavior.",
      "linked": "Linked claims-EHR data are the ideal substrate, but order/fill/service-date discrepancies and the linkable-subset selection must be reconciled before run-out and gap logic are applied."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\ndef build_episodes(rx: pd.DataFrame,\n                   rule: str = \"capped\",      # 'none' | 'full' | 'capped'\n                   cap_days: int = 90,        # max banked excess under 'capped'\n                   reset_gap_days: int = 30   # observed gap that closes an episode\n                   ) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"]).copy()\n    out = []\n    for pid, g in rx.groupby(\"person_id\", sort=False):\n        prev_run_out = None          # carried run-out date of the open episode\n        excess = 0                   # banked surplus days (for capping)\n        episode = 0\n        for _, r in g.iterrows():\n            fill = r[\"fill_date\"]\n            sup = int(r[\"days_supply\"])\n            new_ep = False\n            if prev_run_out is None:\n                start = fill\n            else:\n                gap = (fill - prev_run_out).days          # >0 = real gap, <0 = early refill\n                if gap > reset_gap_days:                   # true discontinuation -> new episode\n                    new_ep = True\n                    excess = 0\n                    start = fill\n                elif rule == \"none\" or gap >= 0:           # no overlap: start at fill\n                    start = fill\n                else:                                       # early refill -> carry forward\n                    banked = -gap                          # overlapping (surplus) days\n                    if rule == \"capped\":\n                        banked = min(banked, max(0, cap_days - excess))\n                    excess += banked\n                    start = fill + pd.Timedelta(days=banked) if rule != \"none\" else fill\n            run_out = start + pd.Timedelta(days=sup)\n            if new_ep or prev_run_out is None:\n                episode += 1\n            prev_run_out = run_out\n            out.append({\"person_id\": pid, \"fill_date\": fill, \"days_supply\": sup,\n                        \"episode\": episode, \"episode_start\": start,\n                        \"run_out_date\": run_out, \"cumulative_excess\": excess,\n                        \"new_episode\": new_ep})\n    return pd.DataFrame(out)",
        "description": "Stockpiling/carryover episode construction from claims-style pharmacy fills. Required input (cleaned, de-duplicated):\n  rx : one row per fill -> person_id, fill_date (datetime64), days_supply (int)\nFor ONE drug per person. Returns one row per fill with the chosen rule's episode_start / run_out_date / cumulative_excess\n/ new_episode flag, so PDC, drug-era, or time-updated exposure can be built downstream. Choose rule in {'none','full',\n'capped'}; cap_days caps accumulated excess; reset_gap_days closes an episode when an observed gap exceeds the threshold.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "gardarsdottir-2010"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nbuild_episodes <- function(rx, rule = \"capped\", cap_days = 90L, reset_gap_days = 30L) {\n  setDT(rx); setorder(rx, person_id, fill_date)\n  rx[, {\n    prev_run_out <- as.Date(NA); excess <- 0L; episode <- 0L\n    res <- vector(\"list\", .N)\n    for (i in seq_len(.N)) {\n      fill <- fill_date[i]; sup <- as.integer(days_supply[i]); new_ep <- FALSE\n      if (is.na(prev_run_out)) {\n        start <- fill\n      } else {\n        gap <- as.integer(fill - prev_run_out)            # >0 real gap, <0 early refill\n        if (gap > reset_gap_days) {                        # true discontinuation -> new episode\n          new_ep <- TRUE; excess <- 0L; start <- fill\n        } else if (rule == \"none\" || gap >= 0) {\n          start <- fill\n        } else {                                            # early refill -> carry forward\n          banked <- -gap\n          if (rule == \"capped\") banked <- min(banked, max(0L, cap_days - excess))\n          excess <- excess + banked\n          start <- fill + banked\n        }\n      }\n      run_out <- start + sup\n      if (new_ep || is.na(prev_run_out)) episode <- episode + 1L\n      prev_run_out <- run_out\n      res[[i]] <- list(fill_date = fill, days_supply = sup, episode = episode,\n                       episode_start = start, run_out_date = run_out,\n                       cumulative_excess = excess, new_episode = new_ep)\n    }\n    rbindlist(res)\n  }, by = person_id]\n}",
        "description": "Stockpiling/carryover episode construction with data.table. Input mirrors the Python version:\n  rx : person_id, fill_date (Date), days_supply (integer); one drug per person.\nrule in {'none','full','capped'}; cap_days caps banked excess; reset_gap_days closes an episode on a real gap. Returns\nper-fill episode_start / run_out_date / cumulative_excess / new_episode for downstream PDC, drug-era, or time-updated use.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "gardarsdottir-2010"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let rule        = capped;   /* none | full | capped */\n%let cap_days    = 90;       /* max banked excess under 'capped' */\n%let reset_gap   = 30;       /* observed gap (days) that closes an episode */\n\nproc sort data=work.rx; by person_id fill_date; run;\n\ndata episodes;\n  set work.rx;\n  by person_id;\n  retain prev_run_out excess episode;\n  format fill_date episode_start run_out_date date9.;\n\n  if first.person_id then do;\n    prev_run_out = .; excess = 0; episode = 0;\n  end;\n\n  new_episode = 0;\n  if missing(prev_run_out) then episode_start = fill_date;\n  else do;\n    gap = fill_date - prev_run_out;                       /* >0 real gap, <0 early refill */\n    if gap > &reset_gap then do;                          /* true discontinuation -> new episode */\n      new_episode = 1; excess = 0; episode_start = fill_date;\n    end;\n    else if \"&rule\" = \"none\" or gap >= 0 then episode_start = fill_date;\n    else do;                                              /* early refill -> carry forward */\n      banked = -gap;\n      if \"&rule\" = \"capped\" then banked = min(banked, max(0, &cap_days - excess));\n      excess = excess + banked;\n      episode_start = fill_date + banked;\n    end;\n  end;\n\n  run_out_date = episode_start + days_supply;\n  if new_episode = 1 or missing(prev_run_out) then episode + 1;  /* sum statement increments */\n  prev_run_out = run_out_date;\n  cumulative_excess = excess;\n  keep person_id fill_date days_supply episode episode_start run_out_date cumulative_excess new_episode;\nrun;",
        "description": "Stockpiling/carryover episode construction in a BY-group DATA step with RETAIN (the natural SAS idiom for sequential\nrefill logic). Required input (cleaned, de-duplicated), one drug per person:\n  work.rx : person_id, fill_date (SAS date), days_supply\nMacro vars set the rule and thresholds. Produces per-fill EPISODE_START / RUN_OUT_DATE / CUMULATIVE_EXCESS / NEW_EPISODE\nfor downstream PDC (PROC SQL coverage counts) or time-updated exposure.",
        "dependencies": [],
        "source_citations": [
          "gardarsdottir-2010"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "stockpiling-carryover-rules-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Each fill bar starts at the actual fill date, making the early-refill overlaps visible as stockpiling. The carryover rule shifts each fill's effective coverage to start at the prior run-out, so the banked surplus days push the final coverage end from Mar 19 (no carryover) to Mar 30 (with carryover) — an 11-day extension that changes PDC from 0.88 to 1.00.",
        "alt_text": "Timeline with three 30-day statin fill bars for patient 2201 plotted against a 90-day observation window. Fill A starts Jan 01, Fill B starts Jan 22 overlapping the end of Fill A, and Fill C starts Feb 19 overlapping the carryover projection of Fill B. Two stacked coverage spans are shown below the fills: a light green bar from Jan 01 to Mar 19 labeled 79 days covered under both rules, and a darker green bar from Mar 20 to Mar 30 labeled 11 extra days credited only with carryover, illustrating how banked surplus days extend coverage beyond what the last fill date alone implies.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Same three 30-day statin fills under different carryover rules (person_id 7)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %d\n  section No carryover\n  Fill 1 supply              :a1, 2024-01-01, 19d\n  Fill 2 supply (resets)     :a2, 2024-01-20, 30d\n  Gap (Feb 19-Feb 25)        :crit, ag, 2024-02-19, 6d\n  Fill 3 supply              :a3, 2024-02-25, 30d\n  section Full / capped carryover\n  Fill 1 supply              :b1, 2024-01-01, 30d\n  Fill 2 supply (shifted)    :b2, 2024-01-31, 30d\n  Fill 3 supply (shifted)    :b3, 2024-03-02, 30d",
        "caption": "One set of dispensings, two timelines. Under no carryover, the early second fill resets the clock and a spurious 6-day gap appears at run-out; under full/capped carryover the surplus banks, fills shift forward, and coverage is continuous with run-out drifting to Apr 1.",
        "alt_text": "Gantt chart comparing exposure coverage for the same three statin fills under no-carryover versus full/capped carryover, showing a manufactured gap in the no-carryover timeline and continuous shifted coverage in the carryover one.",
        "source_type": "illustrative",
        "source_citations": [
          "gardarsdottir-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Dispensing records -> need exposure timeline] --> EST{What is the estimand?}\n  EST -->|Cumulative adherence / chronic effectiveness<br/>PDC, MPR, ever-treated| CARRY[Allow carryover]\n  EST -->|Current exposure to an ACUTE hazard<br/>safety incidence rate| NOCARRY[No carryover / short cap]\n  EST -->|Discontinuation / persistence / deprescribing| RESET[No carryover<br/>gap-reset drives stop date]\n  CARRY --> CAP{Oversupply / serial early refills plausible?}\n  CAP -->|Yes| CAPPED[Capped carryover<br/>excess <= 30-90 days]\n  CAP -->|No| FULL[Full carryover<br/>e.g. OMOP drug-era]\n  NOCARRY --> GAP[Set gap-reset threshold<br/>exclude MA-only / LTC person-time]\n  CAPPED --> GAP\n  FULL --> GAP\n  RESET --> GAP\n  GAP --> SENS[Sensitivity: vary cap, gap, and carryover on/off]",
        "caption": "Decision logic for choosing a carryover rule. The estimand comes first; the cap and gap-reset thresholds are then set jointly and varied in sensitivity analyses.",
        "alt_text": "Flowchart routing from the estimand (cumulative adherence, acute safety, or persistence) to the appropriate carryover rule, then to capping and gap-reset choices and sensitivity analyses.",
        "source_type": "illustrative",
        "source_citations": [
          "gardarsdottir-2010"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "The carryover rule is the specific decision within episode construction that governs how overlapping fills are stitched into a continuous exposure timeline."
      },
      {
        "relation_type": "used_with",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "Carryover banks surplus before a gap; the grace-period/gap rule closes the episode after a permissible gap. The two must be specified together to avoid fabricating exposed person-time."
      },
      {
        "relation_type": "affects",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Allowing carryover changes the covered-days numerator; PDC caps coverage at 1.0 so it is robust to oversupply, whereas the carryover rule still determines whether early-refill days fill later gaps."
      },
      {
        "relation_type": "affects",
        "target_slug": "mpr-medication-possession-ratio",
        "notes": "MPR can exceed 1.0 under stockpiling because it sums acquired days_supply without capping coverage; the carryover rule and any cap directly drive whether MPR is inflated."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "omop-drug-exposure-drug-era-rwe",
        "notes": "The OMOP drug-era is a packaged carryover-with-gap implementation; an explicit capped-carryover rule makes the stockpiling decision visible and auditable rather than hidden in a single gap parameter."
      },
      {
        "relation_type": "used_with",
        "target_slug": "inpatient-bridging-exposure-rwe",
        "notes": "Institutional/inpatient stays create apparent gaps mid-episode; whether to bridge them or treat them as gaps interacts directly with the carryover decision."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restart-rechallenge-new-episode-rwe",
        "notes": "When the gap-reset rule closes an episode, a later fill starts a new episode (restart/rechallenge); the gap threshold shared with the carryover logic defines the boundary."
      },
      {
        "relation_type": "affects",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Carryover determines the run-out dates that define on-/off-treatment windows for time-updated exposure and cumulative-dose variables in survival models."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Carryover masks gaps and inflates apparent persistence; persistence analyses should minimize carryover and let the gap-reset rule define the discontinuation date."
      }
    ],
    "aliases": [
      "stockpiling rules",
      "carryover rules",
      "carry-over rules",
      "early refill carryover",
      "oversupply capping",
      "supply stitching"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "study-protocol-or-sap-elements",
    "name": "Study Protocol and SAP Elements for RWE",
    "short_definition": "The structured set of pre-specified design and analysis elements — population/data source, eligibility, time zero, exposure, comparator, outcome, follow-up/censoring, covariates, estimand, primary and sensitivity analyses — that must be fixed in a real-world-evidence protocol and statistical analysis plan before any outcome-dependent programming begins.",
    "long_description": "A real-world-evidence (RWE) **study protocol** and its companion **statistical analysis plan (SAP)** are the\npre-specification layer that turns a research question into a reproducible, auditable analysis. They are not\npaperwork: each element — the data source and its observable person-time, eligibility, **time zero**, exposure and\ncomparator definitions, outcome algorithm, follow-up and censoring rules, covariates and their measurement window,\nthe **estimand**, the primary analysis, and the pre-specified sensitivity analyses — is a design decision that\nchanges who is in the cohort, how long they are followed, how the outcome is counted, and what the effect estimate\nmeans. Structured templates (STaRT-RWE, HARPER) exist precisely because the failure mode of RWE is not a single\nwrong number but an undocumented chain of analyst choices that no reviewer can reconstruct. The discipline is to\nwrite every element in protocol language, lock it (ideally with a timestamp before outcome data are touched, or a\npre-registration), and only then program.\n\n**Core conceptual distinction** A protocol/SAP is *specification*, not *estimation*. Three boundaries do the work\nand are routinely confused. (1) *Protocol vs SAP*: the protocol fixes the design (PICOT-style question, data source,\neligibility, time zero, exposure/comparator, outcome, follow-up); the SAP fixes the analytic detail (estimator, model\nform, covariate list and functional form, missing-data and censoring handling, multiplicity, the exact sensitivity\nanalyses). (2) *Design-stage vs analysis-stage decisions*: time zero, washout, and the eligibility/exposure windows\nare design decisions that must be set without reference to outcomes; choosing them after looking at results is the\narchetype of specification-search bias. (3) *Pre-specified vs data-driven*: a sensitivity analysis named in the SAP\nis evidence; the same analysis run because the primary result looked wrong is a post-hoc rationalization. The\nunifying object is the **estimand** — the population, treatment contrast, endpoint, intercurrent-event strategy, and\nsummary measure — which the protocol must state explicitly (ICH E9(R1) language) so that the SAP's estimator and the\nreviewer's interpretation refer to the same target quantity.\n\n**Pros, cons, and trade-offs**\n- **vs an ad hoc analysis script with no protocol:** A locked protocol/SAP makes the study reproducible and\n  defensible, exposes design decisions to review *before* they are contaminated by results, and is now an explicit\n  expectation of FDA, EMA, and HTA bodies. Cost: real up-front effort and reduced freedom to chase interesting\n  post-hoc findings. **Prefer the protocol** for any decision-grade comparative, safety, utilization, or economic\n  analysis — i.e., almost always.\n- **vs a clinical-trial protocol/SAP template reused unchanged:** Trial templates assume randomization, a clean\n  enrollment visit, and protocol-defined assessments. RWE templates (STaRT-RWE, HARPER) add what trials get for free\n  and RWE must engineer: an observable-person-time definition for the data source, an explicit time-zero rule, code\n  lists with provenance and validation metrics, and confounding control. **Prefer an RWE-specific template**;\n  bolting RWE onto an ICH E6 trial protocol silently omits the elements that drive RWE bias.\n- **vs lightweight pre-registration (e.g., a one-paragraph hypothesis registry):** Pre-registration deters outcome\n  switching but does not specify the operational rules (washout, days_supply stitching, censoring) that determine the\n  estimate. **Use both**: register to fix the question and timestamp, and a full STaRT-RWE/HARPER protocol+SAP to fix\n  the operations. Registration without operational pre-specification gives a false sense of rigor.\n- **vs maximal pre-specification of every contingency:** Over-specifying can lock in a wrong model when the data\n  reveal an unanticipated structure (e.g., non-proportional hazards). The trade-off is resolved by pre-specifying the\n  *primary* analysis rigidly and pre-specifying *named, triggered* sensitivity/secondary analyses with their decision\n  rules, rather than leaving the analyst free hands.\n\n**When to use** Any hypothesis-evaluating RWE study intended to inform a regulatory submission, label expansion,\nsafety signal evaluation, HTA dossier, payer coverage decision, or peer-reviewed comparative-effectiveness claim.\nUse a structured template (STaRT-RWE for the analysis-implementation table; HARPER for the full protocol narrative)\nwhenever a reviewer, regulator, or replication team will need to reconstruct the study, and whenever the analysis\ninvolves a non-trivial time-zero, washout, exposure-episode, or censoring rule — i.e., whenever the design itself can\nintroduce immortal time, selection, or confounding.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Pure hypothesis-generating description.** For exploratory feasibility counts or initial data profiling, a heavy\n  locked SAP adds friction without protecting a decision; a lightweight analysis plan suffices. Forcing a full\n  confirmatory protocol on exploration can dress data-dredging up as confirmatory evidence — the opposite of the\n  intent.\n- **A protocol written *after* the analysis, to match the result.** A post-hoc protocol that reverse-engineers the\n  chosen design is worse than none: it manufactures the appearance of pre-specification and is the single most\n  dangerous failure mode in RWE submissions. If outcomes have been examined, say so and treat the work as\n  hypothesis-generating.\n- **Specification so vague it cannot be programmed two ways the same.** \"Continuous enrollment around index\" or\n  \"relevant comorbidities adjusted for\" is non-pre-specification masquerading as a protocol; it leaves every\n  consequential choice to undocumented analyst discretion and is not auditable.\n- **Estimand left implicit.** If the protocol does not state whether the target is an ITT-like initiation contrast or\n  an as-treated/per-protocol contrast (and how intercurrent events — switching, discontinuation, death — are\n  handled), the SAP's estimator and the conclusion can disagree without anyone noticing; the \"result\" is then\n  uninterpretable rather than merely imprecise.\n\n**Data-source operational depth** The protocol's *observable-person-time* definition is the element most often\nunderspecified, and it differs sharply by source.\n- **Claims (FFS vs Medicare Advantage):** Person-time is observable only while the enrollee contributes the relevant\n  benefit. Medicare Advantage encounter data historically lack complete fee-for-service (FFS) claims, so an MA-only\n  interval can read as \"no events / no fills\" when it is really unobserved — the protocol must restrict to FFS\n  Parts A/B/D (or commercial medical+pharmacy) across washout and follow-up and explicitly exclude MA-only\n  person-time, or it will manufacture immortal-like gaps and undercount events. Specify continuous-enrollment and\n  allowable-gap rules, `days_supply`-based exposure-episode construction (with mail-order 90-day and sample-fill\n  caveats), claims-adjudication lag and run-out windows, and reversal handling — each in the SAP, not left to the\n  programmer.\n- **EHR:** Capture is encounter-driven, so person-time is observable only when the patient is active in the network;\n  a patient who seeks care elsewhere is differentially lost. The protocol must define observation windows from\n  encounter density (not enrollment), pre-specify how external-care leakage and missing structured fields are\n  handled, and — for elderly or sicker cohorts — pre-specify a **competing-risks** framing, because death and other\n  terminal events occur differentially by exposure and a naive censoring rule biases cumulative incidence.\n- **Registry:** Strong for adjudicated outcomes and disease severity, weak for complete exposure and for vital\n  status; the protocol must pre-specify linkage to claims for fills and to a death index for censoring, and state the\n  registry completeness/adjudication rules and the eligible-for-linkage denominator.\n- **Linked claims–EHR–vital records:** The richest substrate, but linkage selects the linkable subset and creates\n  order/fill/service date discrepancies that must be reconciled *before* time zero is assigned; the protocol must\n  pre-specify the date-reconciliation rule and report the linkage-eligible vs analyzed denominators.\n\nTwo cross-cutting failure modes belong in every RWE protocol because they are *created by under-specification*:\n**immortal time** (e.g., in procedure or \"responder\" studies where follow-up starts before the exposure-defining\nevent — the protocol must set time zero at the exposure decision, not at diagnosis or at procedure completion), and\n**differential competing risks** by exposure in elderly claims cohorts (handle with cause-specific or subdistribution\nmodels, pre-specified, not chosen after seeing the curves).\n\n**Worked claims example (protocol elements, enumerated).** Question: among Medicare FFS beneficiaries with\nnon-valvular atrial fibrillation, does initiation of a direct oral anticoagulant (DOAC) vs warfarin change the rate\nof hospitalized major bleeding? (1) **Data source / observable person-time:** Medicare Parts A/B/D FFS;\nrequire continuous A/B/D and exclude any Medicare Advantage interval (MA encounter data lack complete FFS claims, so\nMA person-time is unobserved). (2) **Eligibility:** age ≥65, ≥1 inpatient or ≥2 outpatient AF diagnoses\n(`dx` in baseline), and 365 days of continuous FFS enrollment before the first qualifying fill. (3) **Time zero:**\ndate of the first DOAC or warfarin pharmacy claim (`fill_date`); assign the arm from the NDC dispensed that day —\nset follow-up at the fill, not at AF diagnosis, to avoid immortal time. (4) **Washout / new-user:** no DOAC or\nwarfarin fill in the 365-day lookback, so both arms are incident users. (5) **Outcome:** first hospitalized major\nbleeding via a validated claims algorithm (specify ICD position and setting; cite PPV/sensitivity). (6) **Follow-up\nand censoring:** from time zero to outcome, disenrollment, end of data, or — as-treated — last `days_supply` end\nplus a pre-specified grace period or switch; **death is a competing risk**, so the SAP pre-specifies a Fine–Gray\nsubdistribution model alongside a cause-specific sensitivity analysis rather than censoring deaths. (7)\n**Covariates:** measured only in the 365-day baseline window (CHA₂DS₂-VASc and HAS-BLED components, prior bleeds,\nrenal disease, concomitant antiplatelets, utilization), feeding a high-dimensional propensity score. (8)\n**Estimand:** the as-treated comparative subdistribution hazard of major bleeding for DOAC vs warfarin initiation,\nwith switching/discontinuation handled by the censoring + IPCW strategy stated above (ICH E9(R1) intercurrent-event\nframing). (9) **Primary analysis:** 1:1 PS-matched Fine–Gray model; balance confirmed by standardized differences\n<0.1. (10) **Pre-specified sensitivity analyses:** washout length (180 vs 365 d), grace period, cause-specific vs\nsubdistribution hazard, a negative-control outcome to detect residual confounding, and restriction to overlapping\ncalendar time. Locking all ten elements before pulling outcome data is what makes the eventual estimate\ninterpretable and defensible.",
    "primary_category": "Framework_Standard",
    "tags": [
      "study-protocol",
      "statistical-analysis-plan",
      "pre-specification",
      "start-rwe",
      "harper",
      "estimand",
      "reproducibility",
      "regulatory-readiness"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.m4856",
        "url": "https://doi.org/10.1136/bmj.m4856",
        "citation_text": "Wang SV, Pinheiro S, Hua W, et al. STaRT-RWE: structured template for planning and reporting on the implementation of real world evidence studies. BMJ. 2021;372:m4856.",
        "year": 2021,
        "authors_short": "Wang et al.",
        "notes": "Canonical structured template enumerating the design and analysis elements (time zero, exposure, outcome, covariates, estimand, sensitivity) that an RWE protocol/SAP must pre-specify; designed for transparent, replicable implementation."
      },
      {
        "role": "introduce",
        "doi": "10.1002/pds.5507",
        "url": "https://doi.org/10.1002/pds.5507",
        "citation_text": "Wang SV, Pottegård A, Crown W, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real-world evidence studies on treatment effects: A good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiology and Drug Safety. 2023;32(1):44-55.",
        "year": 2023,
        "authors_short": "Wang et al.",
        "notes": "Companion full-protocol template (HARPER) for hypothesis-evaluating treatment-effect RWE; pairs with STaRT-RWE's implementation table to give the narrative protocol structure."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2017.08.3019",
        "url": "https://doi.org/10.1016/j.jval.2017.08.3019",
        "citation_text": "Berger ML, Sox H, Willke RJ, et al. Good Practices for Real-World Data Studies of Treatment and/or Comparative Effectiveness: Recommendations from the Joint ISPOR-ISPE Special Task Force on Real-World Evidence in Health Care Decision Making. Value in Health. 2017;20(8):1003-1008.",
        "year": 2017,
        "authors_short": "Berger et al.",
        "notes": "Foundational good-practice rationale for why a priori protocol registration and pre-specification of RWE design and analysis elements are required for decision-grade evidence."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán and Robins",
        "notes": "Shows how protocol elements map onto the target-trial protocol (eligibility, treatment strategies, assignment, time zero, outcome, estimand, analysis), making the specification framework concrete and bias-avoiding."
      },
      {
        "role": "use",
        "doi": "10.1002/cpt.1480",
        "url": "https://doi.org/10.1002/cpt.1480",
        "citation_text": "Gatto NM, Reynolds RF, Campbell UB. A Structured Preapproval and Postapproval Comparative Study Design Framework to Generate Valid and Transparent Real-World Evidence for Regulatory Decisions. Clinical Pharmacology & Therapeutics. 2019;106(1):103-115.",
        "year": 2019,
        "authors_short": "Gatto et al.",
        "notes": "Applied regulatory framework operationalizing protocol/SAP elements (population, exposure, outcome, time zero, confounding control, sensitivity) for valid and transparent RWE in pre- and post-approval decisions."
      }
    ],
    "plain_language_summary": "A study protocol and statistical analysis plan (SAP) are documents you write before looking at results that spell out every decision about your study: who is in it, what counts as the treatment, what counts as the outcome, how long you follow people, and exactly how you will analyze the data. Pre-specifying these choices prevents a researcher from, knowingly or not, tweaking definitions until the answer looks good. Think of it as writing your hypothesis on a whiteboard and taking a photo before the test, so no one can accuse you of moving the target afterward.",
    "key_terms": [
      {
        "term": "SAP",
        "definition": "Statistical Analysis Plan: a locked document that states, before any outcome data are examined, exactly which statistical method, covariates, and sensitivity checks will be used to answer the study question."
      },
      {
        "term": "pre-specification",
        "definition": "Writing down every key study decision (who is included, what the exposure and outcome are, how the analysis runs) before you look at results, so those choices cannot be quietly changed after seeing the data."
      },
      {
        "term": "time zero",
        "definition": "The specific calendar date that marks day one of follow-up for each person in the study, usually the date of first treatment; setting it correctly prevents counting time before a person was actually at risk."
      },
      {
        "term": "estimand",
        "definition": "The exact question a study is designed to answer, stated precisely enough that two different analysts reading the protocol would set up the same comparison and measure the same outcome in the same population."
      },
      {
        "term": "sensitivity analysis",
        "definition": "A repeat of the main analysis under a deliberately altered assumption (for example, a different follow-up length) to test whether the main result is robust; it must be named in the SAP before results are known to count as pre-specified evidence."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether patients newly started on Drug A have fewer hospitalizations over the following year than patients newly started on Drug B. Before touching outcome data, the team locks a protocol and SAP. The table below shows the eight core protocol and SAP elements they must pre-specify, what each element says for this concrete question, and why locking that element ahead of time prevents a specific form of fishing.",
      "dataset": {
        "caption": "Core protocol and SAP elements for a comparative-effectiveness study of Drug A vs Drug B and hospitalization, showing each element, its pre-specified value, and the bias it prevents.",
        "columns": [
          "Element",
          "Pre-specified value for this study",
          "Bias prevented by locking this element"
        ],
        "rows": [
          [
            "Objective",
            "Does new use of Drug A vs Drug B reduce 1-year all-cause hospitalization in adults with Condition X?",
            "Prevents switching from a confirmatory to an exploratory framing after results disappoint"
          ],
          [
            "Study design",
            "New-user active-comparator cohort in commercial claims, 2018-2022",
            "Prevents choosing claims vs EHR after seeing which database favors the hypothesis"
          ],
          [
            "Population",
            "Adults 18-64, first Drug A or Drug B claim, 180-day drug-free lookback, 180-day continuous enrollment before index",
            "Prevents loosening eligibility criteria post-hoc to include patients who respond better"
          ],
          [
            "Exposure",
            "Drug A arm: NDC list v2024-01; Drug B arm: NDC list v2024-01; index date = first qualifying fill date",
            "Prevents adding or removing drug codes after seeing which codes push the result in the desired direction"
          ],
          [
            "Outcome",
            "First all-cause inpatient admission (any primary ICD-10 position) during follow-up; validated algorithm PPV 0.91",
            "Prevents switching to a narrower or broader outcome definition after unblinding"
          ],
          [
            "Covariates",
            "Age, sex, baseline comorbidities and utilization measured in the 180-day window ending on index date",
            "Prevents adding covariates that are actually on the causal pathway (post-treatment variables) once the model looks unfavorable"
          ],
          [
            "Primary analysis",
            "1:1 propensity-score matching, Cox proportional-hazards model, follow-up truncated at 365 days or disenrollment",
            "Prevents running 12 different analytic approaches and reporting only the one with p < 0.05"
          ],
          [
            "Sensitivity analyses",
            "(1) 90-day vs 180-day washout; (2) 30-day vs 90-day grace period for exposure episodes; (3) negative-control outcome (acute appendectomy) to probe residual confounding",
            "Prevents presenting a post-hoc robustness check as though it were planned, inflating the apparent credibility of the main result"
          ]
        ]
      },
      "steps": [
        "Draft the objective first so the whole team agrees on exactly one question; any ambiguity here will cascade into ambiguous exposure and outcome definitions later.",
        "Choose the data source and study period without looking at outcome rates; changing the database after seeing results is a form of selection bias.",
        "Write the eligibility rules, including the lookback washout length, before running any cohort counts; the washout length directly affects who is a new user and therefore what the effect estimate means.",
        "Lock the exposure NDC code lists with a version date so a reviewer can reproduce exactly who was assigned to each arm.",
        "Specify the outcome algorithm with its validation metric (PPV) before unblinding; a high-PPV algorithm and a low-PPV algorithm will give different event counts, and choosing after seeing results is outcome switching.",
        "List every covariate and state that measurement is restricted to the pre-index baseline window; any covariate measured after index date could itself be affected by treatment and would distort the adjustment.",
        "Name the primary analysis method in full detail; if you specify three plausible models in the SAP and commit to reporting only the pre-specified primary, you cannot later cherry-pick the one with the smallest p-value.",
        "Write the sensitivity analyses with explicit trigger conditions before unblinding; a sensitivity analysis that appears only because the main result looked wrong is a post-hoc rationalization, not confirmatory evidence."
      ],
      "result": "When all eight elements are locked before the first outcome-dependent query, the final hazard ratio is interpretable and auditable: a reviewer can reconstruct every analytic choice, confirm they were made without knowledge of the result, and trust that the estimate answers the stated question rather than the question that happened to yield a favorable number."
    },
    "prerequisites": [
      "picots-framework-rwe",
      "continuous-enrollment-observable-time-rwe",
      "time-zero-index-date-alignment-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "STaRT-RWE structured implementation table",
        "description": "A tabular pre-specification of the operational analysis — data source, study period, design (cohort diagram with time zero), exposure and comparator code lists, outcome algorithm with validation metrics, covariates and measurement windows, and the analysis specification. Optimized for unambiguous, two-analysts-get-the-same-cohort implementation and for reviewer reconstruction.",
        "edge_cases": [
          "Time-zero and washout cells must be filled with reference only to design (never outcomes); leaving them as free text invites later specification search.",
          "Code lists need provenance and, for outcomes, PPV/sensitivity from a validation study or the protocol's own chart review."
        ],
        "data_source_notes": "claims: specify continuous-enrollment, allowable-gap, days_supply episode, and MA-exclusion rules in the design cells; EHR: define observable person-time from encounter density rather than enrollment."
      },
      {
        "name": "HARPER full-protocol narrative",
        "description": "The narrative protocol (objectives, background, causal estimand, population, exposure/comparator, outcome, follow-up, confounding strategy, statistical methods, sensitivity and bias analyses, data management, governance) that wraps the STaRT-RWE table and satisfies regulator/HTA expectations for a complete, harmonized RWE protocol.",
        "edge_cases": [
          "The estimand section must state the intercurrent-event strategy (switching, discontinuation, death) explicitly or the downstream SAP estimator becomes uninterpretable.",
          "Bias-analysis section should pre-specify negative controls and quantitative bias analysis, not defer them to reviewer request."
        ],
        "data_source_notes": "registry/linked: pre-specify linkage eligibility, date reconciliation, and the linkage-eligible vs analyzed denominators in the data-management section."
      },
      {
        "name": "Pre-registration + locked SAP",
        "description": "A timestamped public registration of the question and primary outcome combined with a separately locked SAP fixing the estimator, model form, covariate list, and named sensitivity analyses before outcome data are accessed.",
        "edge_cases": [
          "Registration alone does not constrain operational rules (washout, censoring, days_supply stitching); the SAP must.",
          "Any deviation from the locked SAP must be logged with date and rationale to preserve the confirmatory claim."
        ],
        "data_source_notes": "Lock the SAP before the first outcome-dependent query; record the data-cut/extract date used for locking."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ad hoc analysis script with no written protocol or SAP",
        "pros_of_this": "Reproducible, auditable, and defensible; design decisions are exposed to review before results contaminate them; meets FDA/EMA/HTA expectations.",
        "cons_of_this": "Up-front effort and reduced freedom to pursue post-hoc findings.",
        "when_to_prefer": "Any decision-grade comparative-effectiveness, safety, utilization, or economic RWE study."
      },
      {
        "compared_to": "Reusing an ICH E6 clinical-trial protocol/SAP template unchanged",
        "pros_of_this": "Adds the RWE-specific elements trials get for free — observable person-time, explicit time zero, validated code lists, and confounding control.",
        "cons_of_this": "Requires an RWE-literate template (STaRT-RWE/HARPER) and team rather than off-the-shelf trial documents.",
        "when_to_prefer": "All observational RWE; trial templates silently omit the elements that drive RWE bias."
      },
      {
        "compared_to": "Lightweight one-paragraph pre-registration only",
        "pros_of_this": "Fixes the operational rules (washout, censoring, days_supply, estimator) that actually determine the estimate, not just the question.",
        "cons_of_this": "More work than a registry entry; best used in addition to, not instead of, registration.",
        "when_to_prefer": "Whenever non-trivial time-zero/exposure/censoring logic can introduce immortal time, selection, or confounding."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Pre-specify observable person-time as FFS Parts A/B/D (or commercial medical+pharmacy) and exclude Medicare Advantage-only intervals where FFS claims are absent; define continuous-enrollment and allowable-gap rules, time zero at the first qualifying fill, days_supply-based exposure episodes (mail-order/sample caveats), claims-lag/run-out windows, reversal handling, and a competing-risks framing for death in elderly cohorts.",
      "ehr": "Define observable person-time from encounter density (not enrollment); pre-specify handling of external-care leakage, missing structured fields, and informative loss to follow-up; confirm exposure via linked dispensing where available and use problem lists/labs/notes to fix indication and baseline severity.",
      "registry": "Pre-specify registry completeness and adjudication rules, eligibility-for-linkage denominators, and linkage to claims (for full exposure) and to a death index (for censoring and mortality outcomes).",
      "linked": "Pre-specify the order/fill/service date-reconciliation rule before time-zero assignment and report linkage-eligible vs analyzed denominators; account for selection into the linkable subset."
    },
    "implementations": [
      {
        "lang": "other",
        "code": "# protocol_elements.yaml  --  lock BEFORE touching outcome data\nquestion:\n  pico_t: \"Among Medicare FFS adults with non-valvular AF, does DOAC vs warfarin initiation\n           change the rate of hospitalized major bleeding?\"\ndata_source:\n  name: \"Medicare FFS Parts A/B/D\"\n  observable_person_time: \"continuous A/B/D enrollment; EXCLUDE Medicare Advantage intervals (FFS claims absent)\"\n  study_period: \"2016-01-01 .. 2022-12-31\"\neligibility:\n  age_min: 65\n  indication: \">=1 inpatient OR >=2 outpatient AF dx (ICD-10 I48.x) in 365d baseline\"\n  enrollment: \"365d continuous FFS A/B/D before first qualifying fill\"\ntime_zero:\n  rule: \"date of first DOAC or warfarin pharmacy claim (fill_date)\"   # NOT AF diagnosis -> avoids immortal time\n  arm_assignment: \"drug class of the NDC dispensed on fill_date\"\nwashout_new_user:\n  lookback_days: 365\n  rule: \"no DOAC or warfarin fill in lookback (both arms incident users)\"\nexposure:\n  study: \"DOAC NDC list (v2024-03, provenance: RED BOOK)\"\n  comparator: \"warfarin NDC list (v2024-03)\"\n  episode: \"days_supply stitching; grace_period_days: 14; switch = fill of other class\"\noutcome:\n  definition: \"first hospitalized major bleeding (validated algorithm; PPV 0.89 per Cunningham 2011)\"\n  coding: \"primary/secondary ICD-10 in inpatient setting\"\nfollow_up_censoring:\n  start: \"time_zero\"\n  end_at: [\"outcome\", \"disenrollment\", \"end_of_data\", \"as_treated: last days_supply + grace OR switch\"]\n  death: \"COMPETING RISK -> Fine-Gray subdistribution (primary); cause-specific (sensitivity)\"  # do NOT censor deaths\ncovariates:\n  window: \"[time_zero - 365d, time_zero]\"   # baseline only; never post-time-zero\n  list: [\"CHA2DS2-VASc components\", \"HAS-BLED components\", \"prior_bleed\", \"renal_disease\",\n         \"concomitant_antiplatelet\", \"baseline_utilization\"]\n  model: \"high-dimensional propensity score\"\nestimand:\n  target: \"as-treated comparative subdistribution hazard, DOAC vs warfarin initiation\"\n  intercurrent_events: \"switching/discontinuation via censoring + IPCW; death as competing risk (ICH E9(R1))\"\nprimary_analysis:\n  method: \"1:1 PS matching -> Fine-Gray model\"\n  balance_check: \"standardized differences < 0.1 post-match\"\nsensitivity_analyses:   # pre-specified, with triggers -- not chosen after seeing results\n  - \"washout 180d vs 365d\"\n  - \"grace period 7/14/30d\"\n  - \"cause-specific vs subdistribution hazard\"\n  - \"negative-control outcome (residual confounding probe)\"\n  - \"restrict to overlapping calendar time of both drugs\"\ngovernance:\n  locked_on: \"YYYY-MM-DD\"\n  data_extract_date: \"YYYY-MM-DD\"\n  registration: \"ENCePP/ClinicalTrials.gov ID\"",
        "description": "Worked STaRT-RWE-style protocol-elements specification (not estimation code) for the DOAC-vs-warfarin major-bleeding\nclaims example in the long description. This is the locked, machine-readable design table that fixes every\nconsequential decision before any outcome-dependent query is run; an analyst should be able to build the identical\ncohort from it. Store it under version control, timestamp it against the data extract date, and log any deviation\nwith date and rationale.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Research question<br/>PICO-T + causal estimand] --> Proto[PROTOCOL: design layer]\n  Proto --> DS[Data source +<br/>observable person-time]\n  Proto --> Elig[Eligibility +<br/>washout]\n  Proto --> T0[Time zero rule<br/>set without reference to outcomes]\n  Proto --> ExpComp[Exposure +<br/>active comparator]\n  Proto --> Out[Outcome algorithm<br/>+ validation PPV/Sens]\n  Proto --> Fup[Follow-up +<br/>censoring + competing risks]\n  Proto --> SAP[SAP: analysis layer]\n  SAP --> Cov[Covariates +<br/>baseline measurement window]\n  SAP --> Est[Estimand +<br/>intercurrent-event strategy]\n  SAP --> Prim[Primary analysis<br/>estimator + model form]\n  SAP --> Sens[Pre-specified<br/>sensitivity analyses]\n  Sens --> Lock[LOCK before<br/>outcome-dependent query]\n  Prim --> Lock\n  Lock --> Run[Program + analyze]\n  Run --> Dev{Deviation?}\n  Dev -- yes --> Log[Log date + rationale<br/>downgrade to exploratory if outcome-informed]\n  Dev -- no --> Report[Reproducible, auditable result]",
        "caption": "Protocol (design) and SAP (analysis) layers and the lock point. Design decisions (time zero, washout, eligibility, exposure) are fixed without reference to outcomes; the SAP fixes estimand and estimator; both are locked before any outcome-dependent query, and deviations are logged.",
        "alt_text": "Flowchart showing the research question feeding a protocol design layer and a SAP analysis layer, both converging on a lock point before programming, with a deviation-logging branch.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Is this study decision-grade<br/>or hypothesis-generating?] -->|Hypothesis-generating| B[Lightweight analysis plan;<br/>label results exploratory]\n  A -->|Decision-grade| C[Have outcomes already been examined?]\n  C -->|Yes| D[Do NOT write a post-hoc protocol to match;<br/>treat as hypothesis-generating]\n  C -->|No| E[Write STaRT-RWE table + HARPER narrative]\n  E --> F[Is the estimand + intercurrent-event<br/>strategy stated explicitly?]\n  F -->|No| G[Specify estimand ICH E9R1;<br/>otherwise result is uninterpretable]\n  F -->|Yes| H[Can two analysts build the<br/>identical cohort from the spec?]\n  H -->|No| I[Tighten time-zero, washout,<br/>code lists, censoring rules]\n  H -->|Yes| J[Pre-register, lock SAP,<br/>timestamp vs extract date, then program]",
        "caption": "Decision logic for whether and how to specify a protocol/SAP, surfacing the two dangerous failure modes — post-hoc protocols and implicit estimands.",
        "alt_text": "Decision tree distinguishing hypothesis-generating from decision-grade studies, warning against post-hoc protocols and implicit estimands, and ending at pre-registration and SAP locking.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS structures the question (population, intervention, comparator, outcome, timing, setting) that the protocol then operationalizes into eligibility, exposure, comparator, outcome, and follow-up elements."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "The target-trial protocol (eligibility, treatment strategies, assignment, time zero, outcome, estimand, analysis) is the causal scaffold that maps directly onto these protocol/SAP elements."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimand-analysis-traceability-rwe",
        "notes": "The SAP must trace each estimand component to a specific estimator and analytic choice; this concept supplies the elements that traceability links together."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "regulatory-readiness-rwe",
        "notes": "A complete, locked protocol/SAP is the core deliverable of regulatory readiness for hypothesis-evaluating RWE."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The estimand element (population, contrast, intercurrent-event strategy) is specified here using ATE/ATT and intercurrent-event concepts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "The time-zero element exists to prevent immortal time; mis-specifying it (e.g., follow-up from diagnosis rather than initiation) is the classic protocol error."
      }
    ],
    "aliases": [
      "RWE study protocol",
      "RWE protocol elements",
      "statistical analysis plan elements",
      "SAP elements",
      "structured study protocol",
      "STaRT-RWE template",
      "HARPER protocol template"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "study-time-windows-anatomy",
    "name": "Study Time Windows: Baseline, Observation, and Outcome Windows",
    "short_definition": "The named, ordered spans of calendar time — baseline/lookback, washout, induction/lag, and outcome/risk window — that every RWE study defines relative to an index date, together forming the complete temporal architecture of a cohort from first observable history through the last attributable follow-up day.",
    "long_description": "**Study time windows are the full temporal skeleton of an RWE study.** Every claim, diagnosis, fill,\nlab, or utilization record is only analytically meaningful once it is assigned to a named window: is it\nevidence of a past condition (baseline), evidence the patient is being captured at all (observation), an\nevent that should not yet be attributed to exposure (induction), or an event that counts as an outcome\n(risk window)? Newcomers often skip this architecture and write eligibility and outcome queries without\nfully specifying the temporal logic, producing cohorts whose time-zero is misaligned, whose covariates\nbleed past index, or whose outcome windows extend beyond real observability. This entry is the map:\nit names every window in a typical RWE timeline, shows how they assemble around the index date, and\nroutes the reader to the deep concepts that govern each one.\n\n**The index date (time zero): the anchor**\n\nEvery other window is defined *relative* to the index date, so getting this right is the first\nobligation of any cohort definition. The index date is the study's time-zero — the moment the patient\n\"enters\" the study's at-risk clock. In a new-user drug cohort it is the date of the first qualifying\nprescription fill after all eligibility conditions are met. In an incident-disease cohort it is the\ndate of the first qualifying diagnosis. In a target-trial emulation it is the date of the first\n\"protocol-eligible\" assignment.\n\nThree properties of a valid index date deserve explicit statement in every protocol. First, it must\nbe the *first* qualifying event — which is only meaningful after a washout has confirmed the patient\nwas not already using the drug/had the diagnosis before entering the study. Second, it must fall\ninside an observable window: the patient must be enrolled in a health plan (or actively under data\ncapture in an EHR/registry) at the index date itself. Third, it must be assigned *before* the analysis\nsees outcomes — any rule that requires a patient to survive long enough to receive a confirming event\nafter what is called \"time zero\" creates immortal time bias\n(`immortal-time-bias-handling`).\n\nThe index date is the single most consequential specification in an RWE study. Every window below is\ndescribed as an offset from it.\n\n**Baseline / lookback / covariate assessment window**\n\nThe baseline window is the pre-index span — ending either the day before the index date (the standard\nconvention, noted \"index_date − 1\") or on the index date itself — during which covariates are measured,\neligibility conditions are evaluated, and (as a special case) the washout is executed. Typical lengths\nare 180 days and 365 days for claims, though longer windows (730 days) appear for conditions with long\ncoding lags (chronic kidney disease, depression). The canonical formulation: a patient contributes to\nthe 365-day baseline window if they are continuously enrolled from `index_date − 365` through\n`index_date − 1`.\n\nThree distinct purposes are bundled in the same pre-index span, and conflating them is a common protocol\nerror (see `washout-clean-lookback-period-rwe`). The *washout* use is exclusionary: require *absence* of\nthe study drug (and comparator, in an active-comparator design) during the window, so that only new\n(incident) users enter the cohort. The *covariate measurement* use is additive: count the presence of\ndiagnoses, fills, procedures, and labs during the window to build the baseline covariate vector. The\n*eligibility* use is confirmatory: verify the required diagnosis was or was not present, verify the\nrequired enrollment span was covered, verify that competing treatments were absent.\n\nTwo design choices dominate the baseline window's bias profile. The first is *fixed-length vs\nall-available history*. A fixed-length baseline (e.g., exactly 365 days) gives comparable covariate\nascertainment across every patient regardless of how long they have been enrolled — a key advantage\nfor propensity-score and outcome-model balance. An all-available baseline captures more true chronic\nconditions (a hypertension diagnosis coded once five years ago would be missed by a 365-day window)\nbut inflates covariate counts for long-enrollees, inducing differential measurement by arm if one arm\nskews toward longer enrollment history. The second is *endpoint*: does the baseline window end\n`index_date − 1` (the day *before* the index date) or does it include the index date itself? Including\nsame-day events in the baseline is the source of the classic off-by-one error: a patient can appear\nto have both a baseline comorbidity recorded on the index date and an outcome recorded on the index\ndate, even though they entered and had a simultaneous event — whether that counts as a baseline\ncondition, a mediator, or an event depends on the exact time-of-day of each record, which claims do\nnot carry. Standard practice ends the baseline at `index_date − 1`.\n\nThe baseline window is only valid if the patient is enrolled and observable across its full span.\n\"No prior fill in 365 days\" is a confident new-user designation only if the data actually observed\nthe patient for all 365 days. Enrollment must cover the baseline window completely\n(`continuous-enrollment-observable-time-rwe`).\n\n**Observation window / observable time**\n\nThe observation window is the span during which the data source is actually capturing the patient's\nhealthcare encounters. It is not a study-analytic window but a data-availability constraint: the\nfundamental precondition that makes any other window meaningful. In claims, it is operationalized as\nthe patient's continuous enrollment span(s) with the relevant benefits (medical + pharmacy for drug\nstudies, at minimum). In OMOP CDM studies, it is the `observation_period` table\n(`omop-observation-period-rwe`). In EHR, it is inferred from encounter density.\n\nThe critical inferential principle: **absence of a code within the observation window can be\ninterpreted as absence of the event.** Outside the observation window, silence is missingness —\nit cannot be read as a clean record. This asymmetry governs every window that follows. If a patient's\nenrollment ends on 2022-06-15, an outcome that occurs on 2022-06-20 does not appear in the data —\nit is not a non-event, it is a censoring event. The analysis must treat the patient as censored at\ndisenrollment, not followed through the outcome.\n\nGaps in enrollment break the observation window. Most protocols allow a small gap tolerance — commonly\n30 or 45 days — across which enrollment is assumed continuous (the gap might represent a brief plan\nswitch, an administrative lag, or a formulary change). Gaps wider than the tolerance split the\nobservation window into separate observable spans. A patient's time-at-risk and baseline lookback must\nboth be restricted to the observable spans.\n\n**Induction / lag / blackout window**\n\nImmediately after the index date there is often a span of days during which outcomes are excluded from\nthe primary analysis. This post-index exclusion zone goes by several names in the literature: induction\nwindow, lag period, blanking period, blackout window. Its purpose is biologically and analytically\ndistinct from the baseline: rather than cleaning the pre-index history, it cleans the post-index\noutcome window of events that could not plausibly have been caused by the exposure because not enough\ntime has elapsed (see `exposure-lag-induction-latency-window-rwe`).\n\nTwo distinct rationales drive the induction window. The first is *protopathic bias / reverse causation*:\nif a patient was already developing the outcome condition when they received the first prescription, the\nearly post-index outcomes reflect pre-existing disease, not drug effect. A 30-day or 90-day induction\nwindow excludes these events. The second is *biological latency*: some outcomes require a minimum\nbiological delay before they can manifest (e.g., drug-induced liver injury typically requires weeks of\nexposure; cancer chemoprevention requires months or years). An induction window shorter than the true\nminimum latency will bias toward the null, so the length must be justified by the pharmacological or\ndisease mechanism.\n\nIn a new-user cohort, the induction window runs from the index date through `index_date + induction_days\n− 1` (e.g., days 0–29 for a 30-day induction). Events within this window are excluded from the outcome\ncount but the patient is *not* censored — they continue contributing to person-time for the outcome\nwindow that follows. (Contrast: if a patient *dies* during the induction window, their follow-up truly\nends; the induction window excludes only the non-fatal outcome count, not follow-up itself.)\n\n**Outcome / risk / assessment window**\n\nThe outcome window is the post-index span during which events are attributed to exposure. In a\ntime-to-event (survival) analysis it starts immediately after the induction window ends (or at the\nindex date, if no induction is applied) and continues until the first of: (a) the outcome event,\n(b) disenrollment/end of observation, (c) death (a competing event or a censoring event depending on\nthe outcome), or (d) an administrative end-of-data date. In a fixed-window utilization or cost study\n(e.g., per-patient-per-month costs over a 12-month window post-index), the outcome window has a\nhard calendar stop at `index_date + 365`.\n\nThe outcome window's start and end determine the estimand. An intention-to-treat (ITT) outcome window\nstarts at time zero and runs regardless of whether the patient continued on the original drug — it\nmeasures the effect of *initiating* a treatment strategy. An as-treated outcome window builds on-\ntreatment time by stitching supply intervals, applying a grace period, and censoring when the patient\ndiscontinues or switches — it measures the effect of *remaining on* the drug\n(`as-treated-risk-window-construction-rwe`). The two are not interchangeable and should rarely be mixed\nin the same analysis without explicit justification.\n\nDeath as a censoring event vs a competing event is a critical distinction in the outcome window\n(`mortality-source-hierarchy-rwe`). For an all-cause mortality endpoint, death *is* the outcome and\nthe outcome window ends at death. For any other endpoint (hospitalization, fracture, stroke), death\nis a *competing event* — a patient who dies can no longer experience the outcome, so simple censoring\nat death biases the cumulative-incidence estimate (Fine-Gray or cause-specific hazard analysis is\nrequired). Additionally, the **granularity of the death date matters**: if the mortality source\n(National Death Index, Social Security Death Master File) provides only month-and-year of death, not\nday, the death date is often coded as the last day of the month. This day-level error shifts the\ncensoring date by up to 30 days and can meaningfully bias short-window estimands (e.g., 30-day\nmortality). Protocols must specify which mortality source is used and how day-level imprecision is\nhandled.\n\n**Follow-up period: the umbrella**\n\n\"Follow-up period\" is the colloquial umbrella for everything after the index date: induction +\noutcome window combined. A given calendar date can fall in a different window for different patients\nbecause follow-up is patient-anchored to each patient's index date, not to the calendar. Patient A,\nwhose index date is 2020-01-15, and Patient B, whose index date is 2021-07-03, may both appear in\nthe 2021 claims data in their outcome windows, but they entered the cohort at different times — the\ncalendar alignment is irrelevant; what matters is person-time-since-index. This is why incidence\nrates are always expressed in units of person-time (events per 100 person-years), not per calendar\nyear.\n\n**The assembled picture: the design-diagram convention**\n\nThe most effective protocol artifact for communicating all windows at once is the graphical design\ndiagram standardized by Schneeweiss et al. (2019). The diagram draws each named window as a labeled\nhorizontal bar on a single timeline anchored at the index date, with the direction of time left-to-right\nand the index date marked with a vertical line. A complete design diagram carries: (1) the continuous-\nenrollment span (the envelope), (2) the washout/baseline window (pre-index, shaded gray), (3) the\nindex date (vertical milestone), (4) the induction/lag window (post-index, cross-hatched), and (5)\nthe outcome/follow-up window (post-induction, clear or colored by arm). Every protocol and SAP\nshould carry one. A protocol that can describe its time windows verbally but cannot draw them is a\nprotocol that has not fully specified them.\n\n**Pros, cons, and trade-offs — specific and comparative**\n\n- **Fixed-length baseline vs all-available lookback:** A fixed-length baseline (365 days) standardizes\n  ascertainment and is the default for comparative analyses. All-available lookback captures more\n  chronic disease history but inflates apparent prevalence in long-enrollees and can create differential\n  covariate measurement if enrollment history differs by arm. **Prefer fixed-length for comparative\n  work; use all-available only in sensitivity analysis or for prediction where calibration is the goal.**\n\n- **ITT outcome window vs as-treated risk window:** ITT is cleaner, avoids informative-censoring bias,\n  and is the regulatorily preferred primary estimand for most comparative effectiveness questions.\n  As-treated windows target the per-protocol (on-drug) effect but require IPCW when discontinuation\n  is prognostically driven. **Prefer ITT as the primary; report as-treated as the per-protocol\n  companion, always with IPCW.**\n\n- **Induction window present vs absent:** Induction windows remove protopathic bias and biologically\n  implausible early events. Omitting them inflates the early event rate for the exposed arm (if the\n  outcome can trigger the prescription) or biases toward the null (if the exposure requires time to act).\n  Cost: events within the induction window are discarded, reducing power. **Prefer to pre-specify an\n  induction window whenever reverse causation is mechanistically plausible; vary its length as a\n  sensitivity analysis.**\n\n- **Hard administrative cap on follow-up vs open-ended follow-up:** A hard 12-month outcome window\n  is appropriate for cost-utilization questions (PPPM measures require a fixed denominator) and for\n  studies in commercial claims where >12 months of enrollment is uncommon. Open-ended follow-up is\n  appropriate for survival endpoints where differential administrative censoring must be accounted for\n  with IPCW or shared frailty. **Match the window length to the question and the data's enrollment\n  distribution.**\n\n**When to use**\n\nThis architecture applies to every longitudinal database study without exception. Any time a cohort\nis built from claims, EHR records, registry data, or linked sources, the analyst must specify an\nindex date rule, a baseline window, an observation window, and an outcome window — whether or not\nthose names are used. Failure to specify any one of them does not mean it is absent; it means it has\nbeen implicitly chosen (often badly) by whatever default behavior the extraction code happens to\nproduce. Use this framework explicitly as the first design artifact, before any eligibility SQL or\nanalysis code is written.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n\n- **Do not apply a baseline covariate window that extends past the index date.** Any diagnosis,\n  fill, or lab measured *after* the index date is not a baseline covariate — it is either an\n  outcome, a mediator, or a post-treatment event. Conditioning on post-index covariates in a\n  propensity score or outcome model adjusts for mediators and induces collider bias. This is the\n  single most common technical error in claims-based confounding adjustment.\n\n- **Do not define the outcome window so it extends past the enrollment-verified span.** An outcome\n  window running to \"end of data\" while the patient's actual enrollment ended six months earlier\n  manufactures false non-events during the unenrolled period. The outcome window must be intersected\n  with the observation window at every patient-level computation.\n\n- **Do not treat the induction window as an eligibility filter.** Requiring that a patient survive\n  event-free through the induction window *before* they are included in the cohort — rather than\n  including them and simply excluding events within the window from the count — creates immortal\n  time. The patient is in the cohort from the index date; the induction window governs which events\n  are counted, not who is counted.\n\n- **Do not use month-year death dates as day-precise censoring without acknowledging the error.**\n  If the mortality source provides only month-and-year precision, censoring at the last day of the\n  death month introduces up to 30 days of extra follow-up for each decedent. For 30-day mortality\n  endpoints this is a fatal flaw. Pre-specify how month-only death dates are imputed (e.g., impute\n  the 15th) and conduct sensitivity analyses.\n\n- **Do not use a single enrollment span when the patient has multiple non-contiguous spans.**\n  Silently using only the first or longest span misses valid follow-up in later spans and can\n  generate baseline windows that straddle a gap — making the washout unfalsifiable during the gap.\n  Enumerate all enrollment spans and derive windows within each valid span separately.\n\n**Pitfalls at the window boundaries**\n\n*Immortal time from misaligned windows.* The most dangerous version: the index date is set at a\ndiagnosis date, but eligibility also requires a subsequent prescription fill. Every patient who\nqualified was event-free from their diagnosis to their fill — that interval, attributed to the\n\"treated\" group as at-risk follow-up, is immortal. Fix: set the index date at the first fill, not\nthe diagnosis (`immortal-time-bias-handling`).\n\n*Baseline covariates measured after index.* A propensity score that includes any diagnosis,\nprocedure, or lab recorded on or after the index date adjusts for a mediator or post-treatment\nconfounder and introduces selection bias. Run a simple check: every covariate look-up date must\nsatisfy `event_date < index_date`.\n\n*Outcome window extending past disenrollment.* Events that occur after enrollment ends are\ninvisible in claims but are not non-events. If the outcome window is not capped at the enrollment\nend date, disenrolled follow-up days inflate the person-time denominator without contributing events\n— rates are biased downward, and differential rates of disenrollment across arms become differential\ncensoring.\n\n*Month-granularity death dates truncating follow-up.* For analyses near the date of death —\nhospice cost, end-of-life utilization, 30-day readmission — a month-only death date from the SSDI\nor NDI can place the death on the wrong calendar day, altering the length of every follow-up window\nthat terminates at death. See `mortality-source-hierarchy-rwe` for how to build and validate the\ncomposite death date.\n\n*Differential enrollment duration by arm creating asymmetric windows.* If one drug arm skews toward\nolder Medicare patients with longer enrollment histories and the other arm skews toward younger\ncommercial enrollees who churn more, the two arms will have different baseline window lengths if\nall-available lookback is used, different rates of induction-window events lost (because survival\ndiffers), and different follow-up censoring patterns. Fix: use a fixed-length baseline and report\nenrollment-duration diagnostics stratified by arm.",
    "primary_category": "Data_Standard",
    "tags": [
      "primitive",
      "data-standard",
      "study-design",
      "time-windows",
      "index-date",
      "baseline-window",
      "outcome-window",
      "follow-up",
      "observation-period",
      "induction-window",
      "rwe-design",
      "claims"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.7326/M18-3079",
        "url": "https://doi.org/10.7326/M18-3079",
        "citation_text": "Schneeweiss S, Rassen JA, Brown JS, et al. Graphical Depiction of Longitudinal Study Designs in Health Care Databases. Annals of Internal Medicine. 2019;170(6):398-406.",
        "year": 2019,
        "authors_short": "Schneeweiss et al.",
        "notes": "Defines the graphical design-diagram convention that names and displays every study time window (baseline, washout, index date, induction, follow-up) as a labeled timeline bar — the canonical reference for how to draw and communicate the complete window architecture."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Shows how the target-trial protocol explicitly names eligibility, assignment, follow-up start, and outcome windows — illustrating that every time window in a real-world study must be pre-specified as it would be in an RCT protocol."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.5507",
        "url": "https://doi.org/10.1002/pds.5507",
        "citation_text": "Wang SV, Pottegård A, Crown W, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real-world evidence studies on treatment effects: A good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiology and Drug Safety. 2023;32(1):44-55.",
        "year": 2023,
        "authors_short": "Wang et al.",
        "notes": "HARPER protocol template operationalizes the full window architecture — lookback, washout, index date, induction, follow-up, and censoring — as pre-specified protocol elements, with explicit fields for each window length and its rationale."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence",
        "citation_text": "U.S. Food and Drug Administration. Real-World Evidence. FDA Science and Research Special Topics. Accessed 2026-06-12.",
        "year": 2024,
        "authors_short": "FDA",
        "notes": "FDA's RWE program page; the underlying guidance documents (e.g., Considerations for the Design, Conduct, and Analysis of Observational Studies Using RWD) require pre-specified study time windows as a condition of credible real-world evidence submissions."
      }
    ],
    "plain_language_summary": "Every RWE study is built around a patient's personal \"day zero\" — the date they first filled a new prescription or received a new diagnosis — and a set of named time windows that surround that date. Before day zero, there is a lookback window where you confirm the patient had no prior use of the drug and measure their background health. After day zero there is sometimes a short excluded zone (an induction window) where events don't yet count because the drug hasn't had time to act, and then the main outcome window where you watch for the event you care about. Getting these windows right — and making sure they only cover time when the patient was actually enrolled in their health plan — is the foundational step that every valid RWE study depends on.",
    "key_terms": [
      {
        "term": "index date",
        "definition": "A patient's personal \"day zero\" in the study — the date of their first qualifying prescription fill or diagnosis — which every other time window is measured from."
      },
      {
        "term": "baseline period",
        "definition": "The stretch of time before the index date (commonly 180 or 365 days) during which you check that a patient was not already using the study drug and measure the health conditions they had going in."
      },
      {
        "term": "washout",
        "definition": "The requirement that the study drug did not appear in the patient's records during the baseline period, which is how researchers confirm someone is a truly new user rather than a continuing one."
      },
      {
        "term": "observation window",
        "definition": "The span of time when the data source was actually capturing the patient's healthcare — in claims, this is when the patient was enrolled in their health plan; only within this window can a missing record be read as nothing happening."
      },
      {
        "term": "induction window",
        "definition": "A short post-index exclusion zone (e.g., the first 30 days after the index date) during which outcome events are not counted because the drug has not yet had time to cause them."
      },
      {
        "term": "outcome window",
        "definition": "The span after the induction zone (or directly after the index date when no induction is used) during which events are counted and attributed to the exposure, ending at disenrollment, death, or an administrative data cutoff."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is building a new-user cohort of adults with type 2 diabetes who start a GLP-1 receptor agonist. One patient — call her Patient 1001 — first fills semaglutide on 2022-01-15. The protocol specifies a 365-day baseline window ending the day before index, a 30-day induction window, and an outcome window that runs from day 31 through disenrollment or end of data (2022-12-31). The enrollment record shows the patient was continuously enrolled from 2021-01-01 through 2022-10-31, then disenrolled. We want to (a) confirm the patient passes the baseline enrollment requirement, (b) identify the exact calendar dates of each window, and (c) compute the length of the outcome window she actually contributes.\n",
      "dataset": {
        "caption": "Patient 1001's enrollment span and index date — the raw inputs an analyst would use.",
        "columns": [
          "person_id",
          "enrollment_start",
          "enrollment_end",
          "index_date",
          "drug"
        ],
        "rows": [
          [
            1001,
            "2021-01-01",
            "2022-10-31",
            "2022-01-15",
            "semaglutide"
          ]
        ]
      },
      "steps": [
        "Baseline window: index_date - 365 days through index_date - 1 day = 2021-01-15 through 2022-01-14. Length = 365 days.",
        "Enrollment check: enrollment_start (2021-01-01) <= baseline_start (2021-01-15) and enrollment_end (2022-10-31) >= index_date (2022-01-15). Both conditions are met, so the patient has fully observable pre-index history. The washout and covariate windows are valid.",
        "Induction window: index_date through index_date + 29 days = 2022-01-15 through 2022-02-13. Length = 30 days. Events in this window are excluded from the outcome count.",
        "Outcome window start: induction_end + 1 day = 2022-02-14.",
        "Outcome window end: the earlier of enrollment_end (2022-10-31) and end_of_data (2022-12-31) is 2022-10-31. The patient was censored at disenrollment, not at the data cutoff.",
        "Outcome window length: 2022-02-14 through 2022-10-31. Days = (2022-10-31) - (2022-02-14) + 1 = 260 days."
      ],
      "result": "Patient 1001 contributes a 365-day baseline window (2021-01-15 through 2022-01-14), passes the enrollment check, has a 30-day induction window (2022-01-15 through 2022-02-13), and contributes 260 days of outcome-window follow-up (2022-02-14 through 2022-10-31) before being censored at disenrollment. She is NOT followed through end of data because her enrollment ended before 2022-12-31. Total person-time in the outcome window = 260 / 365 = 0.712 person-years.\n",
      "timeline_spec": {
        "title": "Patient 1001 — study time windows around the index date (semaglutide cohort)",
        "window": {
          "start": "2021-01-01",
          "end": "2022-12-31",
          "label": "Full data range"
        },
        "events": [
          {
            "label": "Index date: first semaglutide fill",
            "start": "2022-01-15",
            "length_days": 1,
            "quantity": "day 0"
          }
        ],
        "spans": [
          {
            "kind": "washout",
            "start": "2021-01-15",
            "end": "2022-01-14",
            "label": "365-day baseline / washout window"
          },
          {
            "kind": "gap",
            "start": "2022-01-15",
            "end": "2022-02-13",
            "label": "30-day induction window (events excluded)"
          },
          {
            "kind": "followup",
            "start": "2022-02-14",
            "end": "2022-10-31",
            "label": "260-day outcome window (censored at disenrollment)"
          },
          {
            "kind": "unexposed",
            "start": "2022-11-01",
            "end": "2022-12-31",
            "label": "Post-disenrollment: unobservable (not followed)"
          }
        ],
        "result": {
          "label": "260 outcome-window days / 365 = 0.712 person-years",
          "value": 0.712
        }
      }
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Fixed-length baseline (365 days) vs all-available lookback",
        "description": "A 365-day fixed baseline requires the patient to have been enrolled for the full year before index, ensuring equal covariate ascertainment time across all patients. All-available lookback uses however much enrollment history exists — more for long-enrollees, less for recent enrollees — which captures more chronic conditions but introduces differential measurement by enrollment duration. Most comparative analyses pre-specify the fixed-length baseline as primary and the all-available lookback as a sensitivity analysis.",
        "edge_cases": [
          "A patient enrolled for only 200 days before index fails the 365-day requirement and is excluded; this creates a sample that skews toward the continuously insured.",
          "All-available lookback measured back to database inception (e.g., 2015 for a 2022 index date) will capture conditions coded long before the study window opens but will give long-enrollees 7+ years of covariate ascertainment vs. 6 months for recent enrollees — a severe differential measurement problem."
        ],
        "data_source_notes": "claims: verify via enrollment span table that continuous coverage of the required length exists; in OMOP studies, use observation_period_start_date <= index_date - lookback_days."
      },
      {
        "name": "Intention-to-treat (ITT) vs as-treated outcome window",
        "description": "Under ITT, every patient's outcome window runs from the induction end through the end of follow-up regardless of whether they remained on the drug — the window is fixed at cohort entry. Under as-treated, the window is stitched from supply intervals with a grace period and censored when the patient discontinues or switches, attributing only on-treatment person-time to the exposed arm.",
        "edge_cases": [
          "Discontinuation that is driven by worsening prognosis makes as-treated censoring informative — patients who stop the drug are sicker, so naive as-treated analysis biases toward benefit. IPCW is required.",
          "A grace period that differs implicitly by arm (because one drug has 90-day mail-order supply and the other is dispensed in 30-day fills) creates arm-specific window lengths."
        ],
        "data_source_notes": "claims: as-treated windows require claims-level fill_date + days_supply arithmetic; see as-treated-risk-window-construction-rwe for the full episode-building logic."
      },
      {
        "name": "Fixed-duration outcome window vs time-to-event follow-up",
        "description": "For cost and utilization endpoints (PPPM, total cost per 12 months), the outcome window has a hard cap at a fixed number of days (e.g., 365), and follow-up is restricted to patients observable for the full window. For survival and time-to-event endpoints, the window is open-ended and censoring (at disenrollment, death, or end of data) is handled via the survival model's hazard function.",
        "edge_cases": [
          "A fixed 365-day window in commercial claims excludes patients who disenrolled before 365 days — a substantial fraction — creating a selected population that is healthier and more stably employed.",
          "Time-to-event follow-up with differential administrative censoring (one arm disenrolls more frequently) requires IPCW or a frailty model to avoid bias."
        ],
        "data_source_notes": "claims: always tabulate the distribution of observed follow-up days by arm before choosing fixed vs open-ended windows; differential enrollment duration is a diagnostic flag for informative censoring."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "washout-clean-lookback-period-rwe",
        "pros_of_this": "This entry is the complete map of all windows — it shows how washout, baseline, induction, and outcome windows relate to each other and to the index date. It is the right starting point for a newcomer.",
        "cons_of_this": "Does not cover the deep operationalization of washout length choice, differential covariate ascertainment in fixed vs all-available lookback, or the practical claims pitfalls (MA-only person-time, adjudication lag). Those require washout-clean-lookback-period-rwe.",
        "when_to_prefer": "Read this entry first for the full picture, then read washout-clean-lookback-period-rwe for the fine-grained decisions about the pre-index window."
      },
      {
        "compared_to": "as-treated-risk-window-construction-rwe",
        "pros_of_this": "This entry names and situates the outcome window within the full study timeline. It clarifies the ITT vs as-treated distinction at the conceptual level.",
        "cons_of_this": "Does not cover the detailed episode-building mechanics (supply stitching, grace periods, carryover, stockpiling). Those require as-treated-risk-window-construction-rwe.",
        "when_to_prefer": "Use this entry to understand the design; use as-treated-risk-window-construction-rwe for the programming of the exposure-episode construction."
      },
      {
        "compared_to": "continuous-enrollment-observable-time-rwe",
        "pros_of_this": "This entry provides the conceptual framing of how the observation window constrains every other window, including the worked example of a patient censored at disenrollment.",
        "cons_of_this": "Does not cover the gap-tolerance rule, plan-switching, Medicare Advantage exclusion, or OMOP observation_period construction. Those require continuous-enrollment-observable-time-rwe.",
        "when_to_prefer": "Use this entry for the architectural picture; use continuous-enrollment-observable-time-rwe for the operational implementation of observability."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Derive the observation window from the enrollment span table (medical + pharmacy benefit). Apply a gap tolerance (commonly 30-45 days) to merge near-continuous spans. Gate every baseline, washout, and outcome window computation inside these spans. Exclude Medicare Advantage-only person-time. Set index_date = first qualifying fill within the observation window after all eligibility conditions are met. Compute baseline window as [index_date - lookback_days, index_date - 1] and verify the enrollment span covers it. Compute the outcome window as [index_date + induction_days, min(enrollment_end, end_of_data)]. Apply a mortality source hierarchy to set the death-date censoring endpoint.",
      "ehr": "Observable time is inferred from encounter density (OMOP observation_period or first/last note date). Baseline window validity requires at least one confirmed contact within the lookback. Off-system care creates silent gaps — a fill at an outside pharmacy does not appear, so all-available lookback may miss a true prior use. Prefer linkage to dispensing records where available. EHR end-dates may be set by the outcome (last note before death) — do not infer end-of-observability from clinical silence; use an external death index.",
      "registry": "Registries provide strong baseline measurement (staging, biomarkers, comorbidity scores) but are typically weak for complete medication history. Link to claims for the washout and for full covariate ascertainment. Death dates from registry are usually confirmed and day-precise — use them as the primary mortality source in the hierarchy where available.",
      "linked": "Linked claims-EHR-vital records can support a longer, more accurate baseline (EHR severity + claims fills + NDI death dates) but linkage selection means only the linkable subset enters the cohort — verify that linkage probability does not differ by arm or by baseline health severity."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from datetime import date, timedelta\nfrom dataclasses import dataclass\n\n@dataclass\nclass StudyWindows:\n    \"\"\"All time-window boundaries for one patient in a new-user RWE cohort.\"\"\"\n    person_id: int\n    index_date: date          # day 0 (first qualifying fill/diagnosis)\n    enrollment_start: date    # first day of observable coverage\n    enrollment_end: date      # last day of observable coverage (inclusive)\n    end_of_data: date         # hard data cutoff (inclusive)\n    lookback_days: int = 365  # length of baseline/washout window\n    induction_days: int = 30  # days post-index excluded from outcome count\n\ndef compute_windows(p: StudyWindows) -> dict:\n    \"\"\"\n    Derive every named window for one patient.\n\n    Baseline convention: ends index_date - 1 (not index_date), avoiding\n    the same-day off-by-one where a condition coded on index day would\n    appear in both baseline and potential-outcome.\n\n    Outcome window: runs induction_end + 1 through min(enrollment_end, end_of_data).\n    Day count uses inclusive endpoints: length = (end - start).days + 1.\n    \"\"\"\n    # --- Baseline window (pre-index, inclusive on both ends) ---\n    baseline_start = p.index_date - timedelta(days=p.lookback_days)\n    baseline_end   = p.index_date - timedelta(days=1)   # day BEFORE index\n    baseline_days  = (baseline_end - baseline_start).days + 1  # = lookback_days\n\n    # --- Enrollment coverage check for baseline ---\n    # The patient must have been continuously enrolled across the ENTIRE baseline window.\n    # If enrollment_start > baseline_start the washout window is not fully observable.\n    baseline_covered = (\n        p.enrollment_start <= baseline_start and\n        p.enrollment_end   >= p.index_date\n    )\n\n    # --- Induction window (post-index, events excluded from outcome count) ---\n    # Runs from index_date (inclusive) through index_date + induction_days - 1.\n    induction_start = p.index_date\n    induction_end   = p.index_date + timedelta(days=p.induction_days - 1)\n    induction_days_actual = (induction_end - induction_start).days + 1\n\n    # --- Outcome / follow-up window ---\n    # Starts the day after the induction window ends; capped at the earlier of\n    # enrollment_end and end_of_data. If the cap falls before the start, the\n    # patient contributes no observable outcome-window time (censored during induction).\n    outcome_start = induction_end + timedelta(days=1)\n    outcome_end   = min(p.enrollment_end, p.end_of_data)\n    if outcome_end < outcome_start:\n        outcome_days = 0\n        outcome_person_years = 0.0\n    else:\n        outcome_days = (outcome_end - outcome_start).days + 1\n        outcome_person_years = outcome_days / 365.25\n\n    return {\n        \"person_id\":              p.person_id,\n        \"baseline_start\":         baseline_start,\n        \"baseline_end\":           baseline_end,\n        \"baseline_days\":          baseline_days,\n        \"baseline_covered\":       baseline_covered,\n        \"index_date\":             p.index_date,\n        \"induction_start\":        induction_start,\n        \"induction_end\":          induction_end,\n        \"induction_days\":         induction_days_actual,\n        \"outcome_start\":          outcome_start,\n        \"outcome_end\":            outcome_end,\n        \"outcome_days\":           outcome_days,\n        \"outcome_person_years\":   round(outcome_person_years, 4),\n        \"censored_at_disenroll\":  p.enrollment_end < p.end_of_data,\n    }\n\n# ── Demonstration with Patient 1001 from the worked example ──\np1001 = StudyWindows(\n    person_id       = 1001,\n    index_date      = date(2022, 1, 15),\n    enrollment_start= date(2021, 1,  1),\n    enrollment_end  = date(2022, 10, 31),\n    end_of_data     = date(2022, 12, 31),\n    lookback_days   = 365,\n    induction_days  = 30,\n)\nw = compute_windows(p1001)\nprint(\"Baseline window: \", w[\"baseline_start\"], \"to\", w[\"baseline_end\"],\n      f\"({w['baseline_days']} days)\")\nprint(\"Baseline covered:\", w[\"baseline_covered\"])\nprint(\"Induction window:\", w[\"induction_start\"], \"to\", w[\"induction_end\"],\n      f\"({w['induction_days']} days)\")\nprint(\"Outcome window:  \", w[\"outcome_start\"], \"to\", w[\"outcome_end\"],\n      f\"({w['outcome_days']} days = {w['outcome_person_years']} person-years)\")\nprint(\"Censored at disenroll:\", w[\"censored_at_disenroll\"])\n# Expected output:\n# Baseline window:  2021-01-15 to 2022-01-14 (365 days)\n# Baseline covered: True\n# Induction window: 2022-01-15 to 2022-02-13 (30 days)\n# Outcome window:   2022-02-14 to 2022-10-31 (260 days = 0.7119 person-years)\n# Censored at disenroll: True",
        "description": "Given a patient's index date, enrollment span, induction window length, and end-of-data date, compute the calendar dates and day-counts for every study time window. Includes inclusive-endpoint date arithmetic (both endpoints count), the enrollment coverage check for the baseline window, and correct capping of the outcome window at the earlier of disenrollment or end of data. The classic off-by-one: both the start and end dates of an inclusive window are in the window, so length = (end - start).days + 1.",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# Study time window computation for a new-user RWE cohort\n# Base-R Date arithmetic: dates are numeric days since 1970-01-01;\n# subtraction gives the number of days between two dates (exclusive end).\n# Inclusive length: (end - start) + 1L.\n\ncompute_windows_r <- function(\n    person_id,\n    index_date,        # as.Date(\"YYYY-MM-DD\")\n    enrollment_start,\n    enrollment_end,\n    end_of_data,\n    lookback_days = 365L,\n    induction_days = 30L\n) {\n  # --- Baseline window ---\n  baseline_start <- index_date - lookback_days\n  baseline_end   <- index_date - 1L           # day BEFORE index (off-by-one convention)\n  baseline_length <- as.integer(baseline_end - baseline_start) + 1L  # == lookback_days\n\n  # --- Enrollment coverage check ---\n  # Patient must be enrolled across the ENTIRE baseline window.\n  baseline_covered <- (enrollment_start <= baseline_start) &\n                      (enrollment_end   >= index_date)\n\n  # --- Induction window ---\n  induction_start <- index_date\n  induction_end   <- index_date + induction_days - 1L\n  induction_length <- as.integer(induction_end - induction_start) + 1L\n\n  # --- Outcome window ---\n  outcome_start <- induction_end + 1L\n  outcome_end   <- min(enrollment_end, end_of_data)  # cap at earlier of disenroll / data end\n  outcome_length <- if (outcome_end >= outcome_start) {\n    as.integer(outcome_end - outcome_start) + 1L\n  } else {\n    0L  # patient disenrolled during or before the induction window\n  }\n  outcome_person_years <- round(outcome_length / 365.25, 4)\n\n  data.frame(\n    person_id             = person_id,\n    baseline_start        = baseline_start,\n    baseline_end          = baseline_end,\n    baseline_days         = baseline_length,\n    baseline_covered      = baseline_covered,\n    index_date            = index_date,\n    induction_start       = induction_start,\n    induction_end         = induction_end,\n    induction_days        = induction_length,\n    outcome_start         = outcome_start,\n    outcome_end           = outcome_end,\n    outcome_days          = outcome_length,\n    outcome_person_years  = outcome_person_years,\n    censored_at_disenroll = enrollment_end < end_of_data,\n    stringsAsFactors      = FALSE\n  )\n}\n\n# -- Patient 1001 from the worked example --\nw <- compute_windows_r(\n  person_id        = 1001L,\n  index_date       = as.Date(\"2022-01-15\"),\n  enrollment_start = as.Date(\"2021-01-01\"),\n  enrollment_end   = as.Date(\"2022-10-31\"),\n  end_of_data      = as.Date(\"2022-12-31\"),\n  lookback_days    = 365L,\n  induction_days   = 30L\n)\nprint(w[, c(\"baseline_start\", \"baseline_end\", \"baseline_days\",\n            \"baseline_covered\", \"induction_end\", \"outcome_start\",\n            \"outcome_end\", \"outcome_days\", \"outcome_person_years\")])\n# Expected:\n# baseline_start: 2021-01-15\n# baseline_end:   2022-01-14   baseline_days: 365  baseline_covered: TRUE\n# induction_end:  2022-02-13\n# outcome_start:  2022-02-14   outcome_end: 2022-10-31\n# outcome_days:   260          outcome_person_years: 0.7119",
        "description": "R implementation of the same window arithmetic using base-R Date arithmetic. lubridate is listed as an optional dependency for users who prefer it, but all window computations use only base-R. Demonstrates the inclusive-endpoint convention (both endpoints in the window, so length = as.integer(end - start) + 1), the enrollment coverage check, and the min() capping of the outcome window.",
        "dependencies": [
          "lubridate"
        ],
        "source_citations": [
          "schneeweiss-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Study time windows for one new-user (index 2022-01-15)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Enrollment\n  Observable (enrolled) :done, enroll, 2021-01-01, 2022-10-31\n  section Pre-index\n  365-day baseline / washout window :active, base, 2021-01-15, 2022-01-14\n  section Index\n  Index date (first semaglutide fill) :milestone, idx, 2022-01-15, 0d\n  section Post-index\n  30-day induction window (events excluded) :crit, ind, 2022-01-15, 30d\n  Outcome / follow-up window (260 days) :followup, 2022-02-14, 2022-10-31\n  section Unobservable\n  Post-disenrollment (not followed) :done, post, 2022-11-01, 2022-12-31",
        "caption": "Complete study time window architecture for Patient 1001. The baseline window covers 365 days before the index date (verifying new-user status and measuring covariates); the 30-day induction window immediately post-index excludes events not yet attributable to the drug; the outcome window runs from day 31 through disenrollment on 2022-10-31 (260 days), then follow-up ends because the patient is no longer observable.",
        "alt_text": "Gantt chart showing enrollment from 2021-01-01 to 2022-10-31, a 365-day baseline window ending 2022-01-14, an index date milestone on 2022-01-15, a 30-day induction window, a 260-day outcome window ending 2022-10-31, and an unobservable post-disenrollment zone.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-2019"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "picots-framework-rwe",
        "notes": "The complete window anatomy — baseline, induction, follow-up — is the operationalization of PICOTS Timing (T); this entry is the map of what T must specify before the protocol is considered complete."
      },
      {
        "relation_type": "see_also",
        "target_slug": "washout-clean-lookback-period-rwe",
        "notes": "The deep dive on the pre-index window — washout vs lookback vs covariate-assessment, fixed vs all-available length, the differential-ascertainment pitfall, and MA-only person-time. Read after this entry for the fine-grained decisions within the baseline window."
      },
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "Observable time is the precondition for every other window; a baseline, induction, or outcome window is only valid where the patient is enrolled. This entry provides the conceptual map; continuous-enrollment-observable-time-rwe provides the operational implementation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "omop-observation-period-rwe",
        "notes": "In OMOP CDM studies, the observation_period table is the formal encoding of the observation window; every baseline lookback and outcome window must be anchored inside it."
      },
      {
        "relation_type": "see_also",
        "target_slug": "as-treated-risk-window-construction-rwe",
        "notes": "The as-treated (per-protocol) outcome window requires episode-building mechanics — supply stitching, grace periods, carryover — beyond the simple ITT window described here."
      },
      {
        "relation_type": "see_also",
        "target_slug": "exposure-lag-induction-latency-window-rwe",
        "notes": "The induction/lag window described here is the design-level concept; the deep biological and statistical rationale for its length — protopathic bias, reverse causation, latency periods — lives in the exposure-lag-induction-latency-window-rwe entry."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Misaligned study time windows — especially a time-zero set before a qualifying post-baseline event — are the primary structural cause of immortal time bias; the prevention strategies in that entry directly follow from correct window design."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mortality-source-hierarchy-rwe",
        "notes": "Death terminates the outcome window; a composite mortality source hierarchy provides the death date that sets this censoring endpoint. Day-vs-month granularity of the death date affects the precision of the censoring date, which matters most for short-window estimands such as 30-day mortality."
      },
      {
        "relation_type": "used_with",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS Timing (T) must be populated with explicit window names and lengths drawn from this anatomy; an incomplete T element in a PICOTS table is almost always a missing window specification."
      }
    ],
    "aliases": [
      "baseline period",
      "observation window",
      "outcome window",
      "follow-up period",
      "assessment window",
      "covariate assessment window"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "subgroup-analysis-hte",
    "name": "Subgroup Analysis and Heterogeneity of Treatment Effect",
    "short_definition": "The practice of estimating treatment effects separately within patient subgroups — defined by age, sex, comorbidity, genomic marker, or other pre-specified baseline characteristics — to test whether the effect is larger or smaller in certain groups (heterogeneity of treatment effect, or HTE); the principal pathologies are testing within-subgroup p-values instead of a formal interaction test (the cardinal sin), multiplicity inflation across many subgroups, insufficient power for interaction (interaction tests need roughly four times the sample of a main-effect test), and scale dependence (an effect can appear heterogeneous on one scale and homogeneous on another); credibility is assessed with the ICEMAN framework, while the PATH statement and causal-forest methods offer modern risk-based and data-adaptive alternatives.",
    "long_description": "**What subgroup analysis is and what question it actually answers**\n\nA subgroup analysis estimates the treatment effect within a subset of the study\npopulation defined by a baseline characteristic — sex, age group, geographic region,\nbaseline biomarker level, comorbidity, genomic marker, or line of therapy. The core\nquestion is whether the effect is *heterogeneous* across the groups, meaning it is\nmeaningfully larger in one group than another. Subgroup analyses are among the most\nread and most misread sections of any clinical study report. Payers and clinicians\nnaturally want to know \"does this drug work for patients like mine?\" RWE teams are\nasked to reproduce and extend trial subgroup findings in much larger databases. The\nresult is an enormous literature of subgroup analyses, most of which are statistically\nunderpowered to detect the heterogeneity they claim to report. The key precondition\nfor any valid subgroup claim is that the subgroup variable and the expected direction\nof modification are defined in the protocol or statistical analysis plan before the\ndatabase is queried.\n\n**The cardinal sin: within-subgroup p-values instead of interaction tests**\n\nThe most pervasive and consequential error in subgroup analysis is the following\npattern: (1) compute a treatment effect estimate and p-value within subgroup A (e.g.,\nmen); (2) compute a treatment effect estimate and p-value within subgroup B (e.g.,\nwomen); (3) observe that the result is \"statistically significant in men (p = 0.01)\nbut not in women (p = 0.16)\"; and (4) conclude that the effect is heterogeneous.\nThis reasoning is invalid. The two p-values answer two different null hypotheses in\ntwo different populations — they do not test whether the effects in men and women\ndiffer from each other. Wang et al. (2007) documented that the majority of published\nsubgroup analyses in NEJM trials relied on exactly this invalid inference. A formal\ntest of heterogeneity requires an *interaction test*: fit a model with the treatment\nvariable, the subgroup variable, and their product (the interaction term), and evaluate\nthe coefficient on the product term. The ratio of the subgroup-specific treatment\neffects (on the ratio scale) or the difference in effects (on the difference scale)\nis the heterogeneity estimate; its confidence interval and p-value are the only valid\nmeasures of whether the effects differ. The worked example below illustrates this\nexactly: two subgroup-specific effects that are \"significant in men only\" but whose\ninteraction estimate (ratio of risk ratios = 9/7, approximately 1.29) carries a wide\nconfidence interval crossing 1.0, providing no evidence of genuine heterogeneity.\n\n**Multiplicity: fishing across many subgroups inflates false discoveries**\n\nA study that tests 10 pre-specified subgroups will, under the global null, produce at\nleast one false-positive finding roughly 40% of the time at alpha = 0.05 per test.\nWhen subgroups are selected after viewing the data, the false-positive rate is far\nhigher and cannot be calculated. The standard remedies are pre-specification of\nsubgroups in the protocol or statistical analysis plan before database lock; a\npre-specified hierarchy (one primary subgroup, the rest exploratory); and multiplicity-\nadjusted p-values (Bonferroni, Holm) or reporting of all subgroup results with honest\nlabeling of their exploratory status. The concept on multiplicity and multiple\ncomparisons covers adjustment methods in full; that entry and this one are\ncomplementary — subgroup analysis is the domain, multiple-comparisons methods are the\nremedy when many subgroups are tested.\n\n**Power: interaction tests need roughly four times the sample size**\n\nAn interaction test is fundamentally a test of a difference-of-differences: it asks\nwhether (effect in A) minus (effect in B) differs from zero. Detecting a difference\nbetween two effect estimates is harder than detecting either estimate individually.\nThe quantitative rule of thumb is that an interaction test requires approximately four\ntimes the total sample size of the main-effect test for the same power — which means\nthat most RWE studies and virtually all trials are severely underpowered to detect\nmodest interactions. A trial designed with 80% power to detect an overall hazard ratio\nof 0.75 has roughly 20% power to detect a genuine interaction of the same magnitude.\nThe practical consequence is that a null interaction test is almost always\nuninformative: it cannot be distinguished from genuine absence of heterogeneity and\nextreme underpowering. Credible subgroup claims must document the interaction power.\n\n**Scale dependence: additive versus multiplicative interaction — RERI**\n\nEffect modification is scale-dependent: a treatment can show no interaction on the\nmultiplicative (ratio) scale and strong interaction on the additive (risk-difference)\nscale, or vice versa. When two variables each have a positive main effect, they cannot\nbe additive on both scales simultaneously — a mathematical consequence of the\nrelationship between ratios and differences. For public-health and HTA decisions —\nwho to prioritize for treatment, absolute benefit, number-needed-to-treat — the\nadditive scale is the relevant one, because absolute risk reductions determine how\nmany patients benefit. The standard additive-interaction summary measure is the RERI\n(relative excess risk due to interaction): RERI = RR11 minus RR10 minus RR01 plus 1,\nwhere the subscripts indicate joint presence of the two factors. RERI = 0 indicates\nno additive interaction; RERI greater than 0 indicates positive synergy (supra-\nadditive effect); RERI less than 0 indicates sub-additivity. A multiplicative\ninteraction that is null on the ratio scale is compatible with a non-zero RERI.\nThe causal-mediation and effect-modification entry covers RERI computation in detail;\nthe core instruction here is that any subgroup claim should report both scales.\n\n**Credibility assessment: the ICEMAN criteria**\n\nSchandelmaier et al. (2020) developed the ICEMAN tool (Instrument to assess the\nCredibility of Effect Modification Analyses) for evaluating the trustworthiness of\nsubgroup claims. ICEMAN asks eight questions covering: (1) pre-specification of the\nsubgroup variable and the direction of the expected modification; (2) whether the\nclaim is based on a formal interaction test rather than within-subgroup p-values;\n(3) consistency of the finding across related endpoints; (4) biological plausibility;\n(5) whether the finding is supported by external evidence from prior trials or\nmechanistic studies; (6) the number of subgroup tests performed, providing the\nmultiple-testing context; (7) sample sizes within subgroups relative to the power\nneeded for the interaction test; and (8) whether the finding is pre-specified and\nconfirmatory or post-hoc and exploratory. A subgroup claim with low ICEMAN credibility\n— post-hoc, no formal interaction test, small n, implausible mechanism — should not\ndrive clinical or reimbursement decisions. ICEMAN is the current standard for peer\nreview and HTA evaluation of subgroup claims.\n\n**Modern HTE: risk-stratified subgroup analysis (PATH statement)**\n\nKent et al. (2020) proposed a principled alternative to single-variable subgroup\nanalysis in the Predictive Approaches to Treatment effect Heterogeneity (PATH)\nStatement. The core insight is that treatment effects in observational data and trials\ntend to vary more with a patient's overall baseline risk than with any single\ncovariate. A patient at high baseline risk of the outcome has more absolute room to\nbenefit even if the relative risk reduction is uniform across the spectrum. The PATH\napproach groups patients by a validated baseline risk score — preferably derived from\ncontrol-arm data only, to avoid confounding the risk score with treatment assignment\n— computes the treatment effect within each risk quintile or tertile, and asks whether\nthe absolute risk reduction is larger in the higher-risk groups. This approach often\nreveals more stable and interpretable heterogeneity than single-covariate subgroups,\nis less vulnerable to multiple-testing inflation (one continuous risk dimension rather\nthan many binary splits), and aligns directly with clinical decision-making because it\nidentifies patients with the most to gain from treatment. It is now the recommended\nprimary approach for HTE analysis in comparative effectiveness research.\n\n**Causal forests and ML-based HTE**\n\nCausal forests (Wager and Athey 2018, implemented in the R package grf) and related\nmeta-learners (X-learner, R-learner) estimate the conditional average treatment effect\n(CATE) as a flexible function of many baseline covariates simultaneously, without\npre-specifying which covariates modify the effect. They use sample-splitting and\nhonesty constraints to provide valid confidence intervals for individual-level CATE\nestimates. In large, high-dimensional RWE datasets they can reveal heterogeneity\npatterns that parametric interaction terms would miss. Three warnings are essential:\n(1) overfitting is the dominant risk without careful train-test splits and calibration;\n(2) a high estimated CATE in a region of covariate space does not identify the causal\nmodifier — it describes correlation with the CATE surface, which may simply reflect\nbaseline risk variation rather than biological modification; (3) causal forests inherit\nall confounding from the underlying data, so they estimate a confounded CATE unless\nthe outcome model is properly adjusted. Use ML-HTE for hypothesis generation and\nexploration, not for confirmatory claims without pre-registration and independent\nreplication in a separate dataset.\n\n**RWE-specific: confounding is inherited differentially across subgroups**\n\nIn confounded observational data, subgroups do not inherit the same confounding\nstructure as the overall cohort. Propensity-score or covariate-adjustment methods\ncalibrated on the whole population may not produce adequate confounding control\nwithin small subgroups, even when the main analysis is well-adjusted. Within-subgroup\ncovariate distributions can differ substantially from the pooled distribution, so a\npooled propensity score does not guarantee exchangeability within strata. A subgroup\nthat is, for example, defined by age below 65 may have very different prescribing\npatterns and indication severity compared to the full cohort, creating differential\nconfounding that the main-analysis propensity score was not designed to remove.\nThe practical requirement is to assess balance within each primary subgroup separately\n— using standardized mean differences on baseline covariates — before reporting\nsubgroup-specific effect estimates. When balance is poor within a subgroup, fit a\nseparate propensity model within the subgroup or include explicit subgroup-by-\nconfounder interaction terms in the outcome model.\n\n**Pros, cons, and trade-offs**\n\n*Pre-specified, interaction-tested subgroup analysis*:\n- Pros: can identify populations who derive greater or lesser benefit, informing\n  labeling, clinical guidelines, and HTA reimbursement restrictions; supports\n  individualized treatment decisions; required by regulators when safety or efficacy\n  may differ across demographic groups; ICEMAN-credible findings with formal\n  interaction tests and pre-specification are defensible at regulatory review.\n- Cons: severely underpowered for interaction in most studies; multiplicity inflation\n  without correction manufactures false findings; scale dependence means a null\n  multiplicative interaction is compatible with a clinically important additive\n  interaction; subgroup variable definitions in RWE (especially in claims) are often\n  proxies for the true clinical variable, introducing misclassification; confounding\n  within subgroups may differ from confounding in the full cohort.\n\n*vs reporting only the main (pooled) effect*:\n- The pooled treatment effect is more precisely estimated and more credible; a\n  spurious subgroup finding can damage scientific credibility and mislead practice.\n  Use the pooled result as the primary claim; subgroups are clearly exploratory or\n  hypothesis-generating unless pre-specified with adequate power.\n\n*vs causal-ML HTE (causal forests, meta-learners)*:\n- Parametric subgroup analysis with pre-specified interaction terms is simple,\n  transparent, and directly testable; ML-HTE provides richer, high-dimensional\n  heterogeneity estimates but is less interpretable and more vulnerable to overfitting.\n\n*vs risk-stratified HTE (PATH statement)*:\n- Risk-stratified analysis uses one pre-built risk score instead of many binary\n  splits, reducing multiple-testing inflation and aligning with clinical targeting;\n  it is the preferred approach when no single covariate has strong prior evidence\n  for interaction and the goal is identifying who benefits most absolutely.\n\n**When to use**\n\nUse subgroup analysis when: (1) the subgroup variable and the expected direction of\neffect modification are pre-specified in the protocol or SAP before any database query\nor data lock; (2) the sample is powered for the interaction test — document the\ninteraction-specific power and clearly label the analysis as exploratory if power is\nbelow 80%; (3) the scientific question genuinely turns on targeting — a payer,\nregulator, or clinician will use the result to restrict or extend coverage to a\nspecific group; (4) biological plausibility or prior evidence from trials or\nmechanistic studies supports the hypothesis. Always use the formal interaction test,\nnot within-subgroup p-values. Report both the multiplicative and additive (RERI)\ninteraction estimates with confidence intervals. Apply ICEMAN to self-assess\ncredibility before presenting the finding. In RWE, assess confounding balance within\neach subgroup separately before reporting subgroup-specific effects.\n\n**When NOT to use**\n\n- *Do not report \"significant in subgroup A but not B\" as evidence of heterogeneity.*\n  This is the cardinal sin. The only valid evidence is the interaction estimate with\n  its own confidence interval and p-value. The astrological-sign subgroup result in the\n  ISIS-2 aspirin trial is the canonical cautionary example of what multiple within-group\n  tests without a formal interaction test produce.\n- *Do not mine subgroups after observing the data.* Data-driven subgroup discovery\n  without pre-specification and multiplicity correction reliably generates false\n  positives. Post-hoc subgroup findings require explicit labeling as exploratory and\n  independent replication before influencing practice or coverage.\n- *Do not interpret a non-significant interaction test as confirmation of no\n  heterogeneity.* Most subgroup analyses are severely underpowered for interaction.\n  A null interaction test result in an underpowered study is uninformative, not\n  evidence of homogeneity.\n- *Do not report only the multiplicative scale.* A non-significant ratio-of-ratios\n  is compatible with a clinically important additive interaction. Always report the\n  RERI and within-subgroup absolute risk differences alongside the multiplicative\n  interaction estimate.\n- *Do not apply a pooled propensity score to confounded RWE subgroup estimates without\n  verifying within-subgroup balance.* Pooled-cohort adjustment does not guarantee\n  confounding control within strata where covariate distributions deviate from the\n  pooled average.\n\n**Interpreting the output**\n\nIn the worked example below, the drug produces a risk ratio of 0.70 in men and 0.90\nin women. The within-sex p-values are approximately 0.01 (men, significant) and 0.16\n(women, not significant). The interaction estimate — the ratio of risk ratios —\nequals 0.90/0.70 = 9/7 (approximately 1.29), with a wide 95% confidence interval\nthat easily crosses 1.0.\n\n*(1) Formal interpretation.* The ratio of subgroup-specific risk ratios is 0.90/0.70\n(approximately 1.29), meaning the effect in women is approximately 29% weaker than in\nmen on the multiplicative scale. However, the 95% confidence interval for this ratio\n(approximately 0.70 to 2.30) includes 1.0, the null value for multiplicative\ninteraction. The interaction p-value exceeds 0.20. The study is severely underpowered\nfor the interaction test (men n = 1000 total, women n = 400 total, far below the\nroughly four-fold excess needed relative to the main-effect sample size to detect an\ninteraction of this magnitude with 80% power). There is no statistical evidence that\nthe treatment effect differs by sex, despite the pattern of significant-in-men-only\nwhen within-subgroup p-values are compared.\n\n*(2) Practical interpretation.* The \"significant in men, not in women\" finding is\nmisleading: it arises from the smaller women sample (400 total vs 1000 total for men)\nand from different within-stratum baseline risks rather than from a detected biological\ndifference in drug mechanism. A payer or clinician should not restrict treatment to\nmen based on this pattern. The correct scientific conclusion is that evidence for\nsex-based HTE is absent — the interaction test is null — and a specifically powered\nfollow-up study is warranted before any differential coverage decision.",
    "primary_category": "Unknown",
    "tags": [
      "subgroup-analysis",
      "HTE",
      "heterogeneity-of-treatment-effect",
      "interaction",
      "effect-modification",
      "ICEMAN",
      "PATH-statement",
      "risk-stratification",
      "multiplicity",
      "RERI",
      "additive-interaction",
      "causal-forest",
      "interaction-test"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "active_comparator_new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1056/NEJMsr077003",
        "url": "https://doi.org/10.1056/NEJMsr077003",
        "citation_text": "Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials. New England Journal of Medicine. 2007;357(21):2189-2194.",
        "year": 2007,
        "authors_short": "Wang et al.",
        "notes": "Landmark review documenting that the majority of subgroup analyses in NEJM trials used within-subgroup p-values rather than formal interaction tests, and set the standard that interaction tests are the only valid basis for subgroup claims."
      },
      {
        "role": "explain",
        "doi": "10.7326/M18-3667",
        "url": "https://doi.org/10.7326/M18-3667",
        "citation_text": "Kent DM, Steyerberg E, van Klaveren D. Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects. BMJ. 2018;363:k4245. [PATH Statement] Kent DM, van Klaveren D, Paulus JK, et al. The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement. Annals of Internal Medicine. 2020;172(1):35-45.",
        "year": 2020,
        "authors_short": "Kent et al.",
        "notes": "The PATH Statement formalizes risk-stratified HTE analysis as the recommended primary approach for identifying who benefits most from a treatment in absolute terms, using baseline risk scores rather than single-covariate subgroups to reduce multiplicity and improve clinical relevance."
      },
      {
        "role": "demonstrate",
        "doi": "10.1503/cmaj.200077",
        "url": "https://doi.org/10.1503/cmaj.200077",
        "citation_text": "Schandelmaier S, Briel M, Varadhan R, et al. Development of the Instrument to assess the Credibility of Effect Modification Analyses (ICEMAN) in randomized controlled trials and meta-analyses. CMAJ. 2020;192(32):E901-E906.",
        "year": 2020,
        "authors_short": "Schandelmaier et al.",
        "notes": "Development and validation of the ICEMAN tool covering eight credibility dimensions for subgroup claims; now the standard framework for peer reviewers and HTA assessors evaluating subgroup analyses in submitted manuscripts."
      }
    ],
    "plain_language_summary": "A subgroup analysis asks whether a drug works better or worse in specific groups of patients, such as men versus women or younger versus older patients. The most common mistake is concluding the drug \"works in group A but not group B\" just because the result is statistically significant in one group and not the other — that comparison is invalid and requires a different statistical test called an interaction test. These interaction tests need roughly four times as many patients as the main analysis to have enough statistical power, meaning most published subgroup findings lack the data needed to confirm genuine differences. Researchers have developed credibility checklists (ICEMAN) and risk-based alternatives (PATH statement) to distinguish reliable subgroup findings from statistical noise.",
    "key_terms": [
      {
        "term": "interaction test",
        "definition": "The statistical test that directly asks whether a drug's effect differs between two groups, by adding a product term (the interaction) to a model and testing whether that term's coefficient is nonzero."
      },
      {
        "term": "heterogeneity of treatment effect (HTE)",
        "definition": "The phenomenon where a drug's benefit or harm is larger in some patient groups than others, often summarized as the ratio or difference of the effect estimates across groups."
      },
      {
        "term": "ICEMAN",
        "definition": "A credibility-assessment checklist for subgroup claims that rates eight dimensions including pre-specification, use of a formal interaction test, biological plausibility, and sample size within the subgroup."
      },
      {
        "term": "ratio of rate ratios",
        "definition": "The interaction estimate on the multiplicative scale — the effect in subgroup B divided by the effect in subgroup A; a value of 1.0 means no multiplicative interaction (the effects are the same in both groups)."
      },
      {
        "term": "RERI (relative excess risk due to interaction)",
        "definition": "The interaction measure on the additive (risk-difference) scale, equal to the joint effect of two factors minus the sum of their separate effects; RERI = 0 means no additive interaction, RERI > 0 means supra-additive (synergistic) effects."
      },
      {
        "term": "PATH statement",
        "definition": "A framework recommending that researchers stratify patients by overall baseline risk score — rather than by single covariates — to identify who benefits most in absolute terms from a treatment, giving a more stable and clinically useful picture of HTE."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes researcher analyzes a 1400-patient retrospective cohort study comparing a new cardiovascular drug versus an active comparator on the one-year risk of a major adverse cardiovascular event (MACE). The team pre-specified sex as a subgroup. The drug is statistically significant in men (p = 0.01) but not in women (p = 0.16). The PI wants to report \"the drug works in men but not in women.\" Before making this claim, the analyst applies the correct interaction test to see whether there is genuine evidence of effect modification by sex.",
      "dataset": {
        "caption": "Event counts from the 1400-patient cohort stratified by sex and treatment arm. Event rates (risk) equal events divided by patients. The drug is significant in men only when within-subgroup p-values are compared, but the interaction test will determine whether the two rates differ significantly from each other.",
        "columns": [
          "subgroup",
          "arm",
          "n_patients",
          "n_events",
          "event_rate"
        ],
        "rows": [
          [
            "men",
            "drug",
            500,
            105,
            0.21
          ],
          [
            "men",
            "comparator",
            500,
            150,
            0.3
          ],
          [
            "women",
            "drug",
            200,
            54,
            0.27
          ],
          [
            "women",
            "comparator",
            200,
            60,
            0.3
          ]
        ]
      },
      "steps": [
        "Compute event rates for men. Drug arm rate = 105/500 = 0.21. Comparator arm rate = 150/500 = 0.30.",
        "Risk ratio in men = 0.21/0.30 = 105/150 = 0.70. The drug reduces the MACE risk by 30% in men. A within-group test yields p approximately 0.01, which is significant at alpha = 0.05.",
        "Compute event rates for women. Drug arm rate = 54/200 = 0.27. Comparator arm rate = 60/200 = 0.30.",
        "Risk ratio in women = 0.27/0.30 = 54/60 = 0.90. The drug reduces MACE risk by 10% in women. A within-group test yields p approximately 0.16, which is NOT significant. The PI wants to say the drug does not work in women.",
        "Apply the correct interaction test: the ratio of risk ratios (the interaction estimate on the multiplicative scale) = 0.90/0.70 = 9/7 (approximately 1.29). This is the estimate of HOW DIFFERENT the two subgroup effects are from each other. The 95% confidence interval for this ratio of risk ratios runs from approximately 0.70 to 2.30 — a wide interval that easily crosses 1.0 (the null value for no interaction). The interaction p-value exceeds 0.20.",
        "The study has 1000 patients in the men stratum and 400 patients in the women stratum. An interaction test requires roughly four times the sample size of the main-effect test for the same power. With only 400 women total, the interaction test has very low power — far below 80% — to detect an interaction of this magnitude. The null interaction result cannot be distinguished from the study simply being underpowered.",
        "Conclusion: the \"significant in men, not in women\" pattern is driven by the smaller women sample (n = 400 vs n = 1000 for men) and different baseline risks, not by detected biological heterogeneity. The PI should NOT claim the drug works only in men. The correct statement is that evidence for sex-based effect modification is absent, and a larger powered study is warranted."
      ],
      "result": "Risk ratio in men = 105/150 = 0.70 (95% CI approximately 0.54 to 0.91, p ≈ 0.01). Risk ratio in women = 54/60 = 0.90 (95% CI approximately 0.65 to 1.24, p ≈ 0.16). Ratio of risk ratios (interaction estimate) = 0.90/0.70 = 9/7 (approximately 1.29), 95% CI approximately 0.70 to 2.30, interaction p > 0.20. No demonstrated heterogeneity of treatment effect by sex. The \"significant in men only\" pattern reflects lower sample size and lower baseline risk in the women stratum, not a detected biological difference in drug response."
    },
    "prerequisites": [
      "causal-mediation-effect-modification-rwe",
      "estimands-ate-att-intercurrent-events-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pre-specified confirmatory subgroup (interaction test required)",
        "description": "The subgroup variable, its levels, and the expected direction of modification are specified in the protocol or SAP before any data query. A single primary subgroup is pre-specified as confirmatory; all others are exploratory. The interaction term is the primary test statistic. The SAP documents the interaction power and labels the finding as underpowered if the interaction power falls below 80%.",
        "edge_cases": [
          "With a single confirmatory subgroup and a two-sided interaction test at alpha = 0.05, no multiplicity adjustment is needed for that test alone; multiplicity adjustments apply to the exploratory subgroups reported alongside it.",
          "If the interaction is on the multiplicative scale only, the additive RERI must also be estimated and reported; absence of multiplicative interaction is not absence of additive interaction."
        ],
        "data_source_notes": "Claims: define the subgroup variable from baseline codes (ICD-10 diagnoses, procedure codes, demographic fields) in the pre-treatment window; verify the subgroup definition does not itself depend on treatment-related codes. EHR: baseline lab values and clinical scores require a fixed assessment window relative to the index date; take the most recent in-window value and handle missingness (multiple imputation preferred over complete-case)."
      },
      {
        "name": "Exploratory post-hoc subgroup analysis (multiplicity-adjusted)",
        "description": "Subgroups not pre-specified in the protocol, or subgroups selected after the main analysis is viewed. All such findings must be clearly labeled as exploratory, multiplicity-adjusted (Holm or Bonferroni), and accompanied by the ICEMAN credibility score. They are hypothesis-generating, not confirmatory, and require independent replication before influencing practice or coverage decisions.",
        "edge_cases": [
          "Data-dredged subgroups that appear impressive after many informal looks can survive multiplicity adjustment yet still fail the ICEMAN biological-plausibility criterion; both gates must be passed for a finding to merit reporting."
        ],
        "data_source_notes": "Large claims databases enable many subgroup cuts; the temptation to report every statistically significant stratum is acute. Commit to the number of subgroups tested — even informally — before any analysis result is presented."
      },
      {
        "name": "Risk-stratified HTE analysis (PATH approach)",
        "description": "Patients are grouped by a validated baseline risk score (derived from control-arm data or an external cohort, never from the exposed arm, to avoid confounding the risk score with treatment assignment). Treatment effects are estimated within risk quintiles or tertiles and plotted against the baseline risk; the expected finding under uniform relative-risk reduction is that absolute risk reductions increase monotonically with baseline risk. This is the recommended primary HTE approach in comparative effectiveness research.",
        "edge_cases": [
          "If the risk model is built on the full cohort (including treated patients), the predicted risk for treated patients is an underestimate because treatment reduces the outcome; always derive the risk score from the comparator arm or an external population.",
          "Quintile boundaries must be defined before the within-quintile effect is estimated; never move boundaries to optimize a particular quintile's result."
        ],
        "data_source_notes": "Claims: baseline risk score can include age, sex, comorbidity indices (Charlson, Elixhauser), prior hospitalizations, and prior medication class — all measurable from claims in a fixed pre-index window. EHR: add lab values, BMI, functional status, and vital signs to build a richer risk model. Registry: disease severity scores and staging are natural inputs."
      },
      {
        "name": "Causal-forest ML HTE (hypothesis-generating)",
        "description": "A causal forest (grf package in R) estimates individual-level conditional average treatment effects (CATE) using honest sample-splitting so that confidence intervals are valid without assuming a parametric form for the heterogeneity. The best_linear_projection function tests whether a specific covariate is associated with the CATE surface; test_calibration checks whether any detectable heterogeneity exists at all. These outputs are exploratory and cannot certify causal modifiers without independent replication.",
        "edge_cases": [
          "Overfitting is the dominant failure mode in moderate-n datasets (n < 5000); always assess calibration and out-of-sample CATE variability before reporting.",
          "A significant best_linear_projection coefficient for a covariate means that covariate is associated with the estimated CATE surface — it does not mean the covariate causally modifies the treatment effect."
        ],
        "data_source_notes": "Large claims databases with n > 10,000 are the natural environment for causal forests; smaller EHR or registry datasets risk overfitting severely even with honest splitting. Linked claims-EHR provides the richest covariate matrix."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "causal-mediation-effect-modification-rwe",
        "pros_of_this": "Subgroup analysis focuses specifically on identifying patient populations who benefit more or less, which is a direct policy and clinical-targeting question; the ICEMAN and PATH frameworks provide structured credibility assessment tools that the general effect-modification entry does not detail.",
        "cons_of_this": "The causal-mediation entry covers the full effect-modification machinery including RERI computation, additive vs multiplicative scale, and DAG-based causal framing in greater depth; this entry covers the applied practice and pathologies of subgroup analysis specifically in clinical and RWE contexts.",
        "when_to_prefer": "Use this entry for the applied practice of subgroup analysis (pathologies, ICEMAN, PATH); use causal-mediation-effect-modification-rwe for the formal causal machinery (NDE/NIE, four-way decomposition, RERI formulas, g-methods for mediation under effect modification)."
      },
      {
        "compared_to": "multiplicity-multiple-comparisons",
        "pros_of_this": "This entry covers the specific domain of subgroup analysis — interaction tests, ICEMAN, PATH, within-group fallacies — that the generic multiplicity entry does not address in clinical context.",
        "cons_of_this": "The multiplicity entry covers the full menu of adjustment methods (Bonferroni, Holm, FDR, Bayesian shrinkage) that are the remedy when many subgroups are tested; that entry should be read alongside this one whenever multiple subgroups are planned.",
        "when_to_prefer": "Use this entry to design, execute, and interpret a subgroup analysis; use multiplicity-multiple-comparisons to choose and apply the correct adjustment method for the number of subgroups tested."
      },
      {
        "compared_to": "predictive-and-causal-ml-models-rwe",
        "pros_of_this": "Pre-specified parametric subgroup analysis with a formal interaction test is simple, transparent, directly testable, and produces an interpretable point estimate and confidence interval for a specific subgroup contrast; it is ICEMAN-credible when pre-specified.",
        "cons_of_this": "Causal-ML methods (causal forests, meta-learners) capture high-dimensional heterogeneity across many covariates simultaneously without pre-specifying which ones matter; they are more powerful for discovery when the relevant modifier is unknown.",
        "when_to_prefer": "Use parametric subgroup analysis for confirmatory claims on pre-specified, biologically motivated subgroups; use causal-ML HTE for hypothesis generation in large datasets when the relevant modifier is unknown, with explicit labeling of the exploratory nature."
      },
      {
        "compared_to": "sample-size-power-precision-rwe",
        "pros_of_this": "Subgroup analysis applies study-design power concepts directly to the interaction test, where the four-fold sample-size inflation rule is the key practical constraint that most investigators underestimate.",
        "cons_of_this": "The sample-size and power entry covers the full methodological framework for powering a study; this entry covers the specific case of interaction power as the binding constraint in subgroup analyses.",
        "when_to_prefer": "Use the sample-size entry to calculate interaction power for the SAP; use this entry to communicate the consequences of underpowering — null interaction results are uninformative, not evidence of homogeneity."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Subgroup variables are measured from baseline claims (ICD-10 diagnosis codes for comorbidity, procedure codes for prior surgery or testing, NDC codes for prior drug use, beneficiary demographics for age/sex). Define the baseline assessment window as a fixed period before the index date (typically 365 days). Verify that the subgroup variable is not itself a post-treatment or treatment-driven code that would make it a mediator rather than a modifier. At large n (> 50,000), interaction tests will have adequate power but so will spurious interactions; apply multiplicity adjustment and ICEMAN. Assess confounding balance within each subgroup stratum separately using standardized mean differences before reporting stratum-specific effect estimates.",
      "ehr": "EHR enables biologically richer subgroup variables — baseline biomarker levels (HbA1c, eGFR, NT-proBNP), functional status scores, imaging findings — but measurement is visit-driven and informative. Use the most recent value in a fixed pre-index window; model missingness with multiple imputation rather than complete- case, as patients with missing baseline biomarkers systematically differ from those with complete data. Subgroup-specific confounding is a real concern: patients with high biomarker values who were tested may have a different prescribing pattern and health-seeking behavior from the full cohort.",
      "registry": "Registries provide adjudicated, protocol-scheduled subgroup variables (cancer stage, disease activity scores, genomic markers, ECOG performance status) measured on a standardized schedule — the cleanest substrate for biologically motivated subgroup analysis. Link to claims or a pharmacy database to verify treatment continuity within subgroups. Registry datasets are often smaller than claims; document the interaction power explicitly and label findings as exploratory if n is below the required four-fold threshold.",
      "primary": "In primary-data studies (RCTs, prospective cohorts), subgroup analyses should be pre-specified in the study protocol and statistical analysis plan, with a stated hierarchy and interaction power calculation. Collect the subgroup variable at baseline using a standardized instrument; retrospective definition of subgroups from available data is the primary source of post-hoc bias.",
      "linked": "Linked claims-EHR-registry datasets enable the richest subgroup covariate matrix (claims utilization proxies + EHR biomarkers + registry severity), but linkage selects a linkable subset that may not be representative of the full cohort; assess whether linkage rates differ by subgroup, as differential linkage can bias stratum-specific effect estimates toward the linked subgroup's characteristics."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\nimport statsmodels.api as sm\n\n# ── 1. Formal interaction test (multiplicative scale) ──────────────────────────\n# Poisson GLM with robust HC3 standard errors; coefficient on arm:sex is the\n# log-ratio-of-rate-ratios (log of the multiplicative interaction estimate).\nm_int = smf.glm(\n    \"outcome ~ arm * sex + age + cci\",\n    data=df,\n    family=sm.families.Poisson()\n).fit(cov_type='HC3')\n\nb = m_int.params\nci = m_int.conf_int()\n\nrr_interaction = np.exp(b[\"arm:sex\"])\nci_lo = np.exp(ci.loc[\"arm:sex\", 0])\nci_hi = np.exp(ci.loc[\"arm:sex\", 1])\np_int = m_int.pvalues[\"arm:sex\"]\n\nprint(f\"Ratio of RRs (women / men): {rr_interaction:.3f}\")\nprint(f\"95% CI: {ci_lo:.3f} to {ci_hi:.3f}\")\nprint(f\"Interaction p-value: {p_int:.4f}\")\nprint(\"IMPORTANT: a non-significant interaction result does NOT confirm homogeneity\")\nprint(\"           in an underpowered study — it is uninformative, not evidence of no HTE.\")\n\n# ── 2. Stratified estimates (sex-specific RR with confounders) ──────────────────\n# These are useful AFTER establishing whether the interaction test is credible;\n# do NOT compare these within-group p-values to assess heterogeneity.\nfor sex_label, sex_val in [(\"women (sex=0)\", 0), (\"men (sex=1)\", 1)]:\n    sub = df[df[\"sex\"] == sex_val].copy()\n    m_sub = smf.glm(\n        \"outcome ~ arm + age + cci\",\n        data=sub,\n        family=sm.families.Poisson()\n    ).fit(cov_type='HC3')\n    rr = np.exp(m_sub.params[\"arm\"])\n    lo, hi = np.exp(m_sub.conf_int().loc[\"arm\"])\n    p = m_sub.pvalues[\"arm\"]\n    print(f\"RR in {sex_label}: {rr:.2f} (95% CI {lo:.2f}–{hi:.2f}), p={p:.4f}\")\n\n# ── 3. Additive interaction (RERI) from the Poisson interaction model ──────────\n# RERI = RR11 - RR10 - RR01 + 1 (Knol & VanderWeele 2012 formula).\n# RR10 = effect of arm among women (sex=0); RR01 = effect of sex among unexposed (arm=0).\nRR10 = np.exp(b[\"arm\"])                              # drug effect in reference sex (women)\nRR01 = np.exp(b[\"sex\"])                              # sex effect when arm=0\nRR11 = np.exp(b[\"arm\"] + b[\"sex\"] + b[\"arm:sex\"])   # joint effect (men on drug)\nreri = RR11 - RR10 - RR01 + 1\nprint(f\"\\nRERI (additive interaction) = {reri:.3f}\")\nprint(\"  RERI=0: no additive interaction; RERI>0: supra-additive; RERI<0: sub-additive\")\nprint(\"  Bootstrap RERI CI recommended for valid inference (delta-method approximation shown here).\")",
        "description": "Subgroup interaction test and stratified risk ratios using statsmodels Poisson\nregression (robust SE), plus additive RERI on the risk scale.\nRequired input DataFrame `df` (one row per patient, cohort fully built):\n  arm       : 1 = study drug, 0 = active comparator (assigned at index_date)\n  sex       : 1 = men, 0 = women (binary subgroup variable)\n  outcome   : 1 = MACE event, 0 = no event (binary endpoint)\n  age, cci  : baseline confounders measured in [index_date-365, index_date]\nThe Poisson GLM with robust SE is preferred over log-binomial for convergence\nstability; it produces valid risk ratios and confidence intervals at low event\nrates. Replace sex/outcome with your subgroup variable and endpoint as needed.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(sandwich)\nlibrary(lmtest)\nlibrary(grf)\n\n## ── 1. Formal multiplicative interaction test ──────────────────────────────────\nfit_int <- glm(outcome ~ arm * sex + age + cci, family = poisson, data = df)\n# Robust standard errors via HC3 sandwich estimator\nct <- coeftest(fit_int, vcov = vcovHC(fit_int, type = \"HC3\"))\nprint(ct)\n\n# Ratio of risk ratios (the interaction estimate): exponentiate the arm:sex coefficient\nrr_int <- exp(coef(fit_int)[\"arm:sex\"])\nci_int <- exp(confint.default(fit_int, vcov. = vcovHC(fit_int, type = \"HC3\"))[\"arm:sex\",])\np_int  <- ct[\"arm:sex\", \"Pr(>|z|)\"]\ncat(sprintf(\"Ratio of RRs: %.3f (95%% CI %.3f to %.3f), p = %.4f\\n\",\n            rr_int, ci_int[1], ci_int[2], p_int))\ncat(\"A non-significant p does NOT confirm homogeneity in an underpowered study.\\n\")\n\n## ── 2. Stratified effect estimates ────────────────────────────────────────────\n## Compute sex-specific RRs from the same Poisson model for each stratum.\n## Do NOT compare within-stratum p-values to judge heterogeneity.\nfor (sx in c(0, 1)) {\n  sub <- subset(df, sex == sx)\n  m   <- glm(outcome ~ arm + age + cci, family = poisson, data = sub)\n  rr  <- exp(coef(m)[\"arm\"])\n  ci  <- exp(confint.default(m, vcov. = vcovHC(m, type = \"HC3\"))[\"arm\",])\n  lab <- if (sx == 0) \"women\" else \"men\"\n  cat(sprintf(\"RR in %s: %.2f (95%% CI %.2f to %.2f)\\n\", lab, rr, ci[1], ci[2]))\n}\n\n## ── 3. Additive interaction (RERI) ────────────────────────────────────────────\nb <- coef(fit_int)\nRR10 <- exp(b[\"arm\"])                              # drug effect in women (reference)\nRR01 <- exp(b[\"sex\"])                              # sex effect when unexposed\nRR11 <- exp(b[\"arm\"] + b[\"sex\"] + b[\"arm:sex\"])   # joint effect (men on drug)\nreri <- RR11 - RR10 - RR01 + 1\ncat(sprintf(\"RERI = %.3f (0 = no additive interaction; positive = supra-additive)\\n\", reri))\n\n## ── 4. Causal forest for data-adaptive HTE (HYPOTHESIS-GENERATING ONLY) ──────\n## Replace X columns with your full baseline covariate matrix.\n## WARNING: results describe covariate-CATE correlation, not confirmed causal modifiers.\nX <- as.matrix(df[, c(\"sex\", \"age\", \"cci\")])   # replace with your full covariate matrix\nset.seed(2024)\ncf <- causal_forest(\n  X = X,\n  Y = df$outcome,\n  W = df$arm,\n  num.trees = 2000\n)\n## Is there detectable heterogeneity at all?\nprint(test_calibration(cf))   # p < 0.05 suggests heterogeneity exists somewhere\n## Is sex associated with the estimated CATE surface?\nblp <- best_linear_projection(cf, A = df[, \"sex\", drop = FALSE])\nprint(blp)   # a significant coefficient means sex correlates with estimated CATE\ncat(\"NOTE: 'best_linear_projection' significance does NOT certify sex as a causal modifier.\\n\")\ncat(\"This output is hypothesis-generating and requires independent replication.\\n\")",
        "description": "Subgroup interaction test, stratified risk ratios, RERI, and a causal-forest HTE\nsketch using statsmodels-equivalent tools in R. Input data.frame `df` mirrors the\nPython version (arm, sex, outcome, age, cci). The sandwich::vcovHC call provides\nrobust standard errors for the Poisson model. The grf::causal_forest sketch is\nhypothesis-generating only — results are not confirmatory without pre-registration.",
        "dependencies": [
          "sandwich",
          "lmtest",
          "grf"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── 1. PROC PHREG: interaction test + stratum-specific HRs ────────────────── */\nproc phreg data=work.cohort;\n  class arm(ref='0') sex(ref='0') / param=ref;\n  model time_*event(0) = arm sex arm*sex age cci / ties=efron rl;\n  /* Coefficient on arm*sex is the log-ratio-of-hazard-ratios (the interaction estimate).   */\n  /* IMPORTANT: a non-significant arm*sex p-value does NOT confirm homogeneity if the study  */\n  /* is underpowered for the interaction test (interaction tests need ~4x the main n).        */\n\n  /* Stratum-specific hazard ratios (arm effect within each sex level): */\n  hazardratio arm / at(sex=0)   diff=ref;   /* HR of arm=1 vs arm=0 among women (sex=0) */\n  hazardratio arm / at(sex=1)   diff=ref;   /* HR of arm=1 vs arm=0 among men   (sex=1) */\n  /* These stratum-specific HRs are NOT an interaction test — they are supplementary.       */\n  /* Compare the ratio of these two HRs to the arm*sex interaction coefficient for the       */\n  /* multiplicative HTE estimate; both must be interpreted together.                         */\nrun;\n\n/* ── 2. PROC GENMOD: additive interaction (RERI) via log-binomial ──────────── */\n/* RERI = RR11 - RR10 - RR01 + 1 (Knol & VanderWeele 2012).                   */\n/* Log-binomial gives risk ratios so that ESTIMATE builds RERI from coefficients. */\nproc genmod data=work.cohort;\n  class arm(ref='0') sex(ref='0') / param=ref;\n  model event(event='1') = arm sex arm*sex age cci / dist=binomial link=log;\n  /* RR10 = exp(arm coefficient); RR01 = exp(sex coefficient)                   */\n  /* RR11 = exp(arm + sex + arm*sex)                                             */\n  /* RERI = RR11 - RR10 - RR01 + 1 — assembled from three ESTIMATEs below.     */\n  estimate 'log RR10'  arm 1 sex 0 arm*sex 0;\n  estimate 'log RR01'  arm 0 sex 1 arm*sex 0;\n  estimate 'log RR11'  arm 1 sex 1 arm*sex 1;\n  /* Exponentiate each estimate (exp column in output) to get RR10, RR01, RR11;  */\n  /* then compute RERI = RR11 - RR10 - RR01 + 1 in a subsequent DATA step.      */\nrun;\n\n/* ── 3. Within-stratum balance check (required in RWE before reporting subgroup HRs) ── */\n/* Assess standardized mean differences within each subgroup stratum separately.          */\n/* A pooled propensity score does NOT guarantee balance within subgroups.                 */\nproc means data=work.cohort mean std;\n  by sex arm;\n  var age cci;   /* add all baseline covariates here */\n  output out=work.balance_check mean=m_ std=sd_ / autoname;\nrun;\n/* Compute SMD = (m_arm1 - m_arm0) / pooled_SD within each sex level;            */\n/* flag any SMD > 0.10 as requiring subgroup-specific PS re-estimation or        */\n/* additional outcome-model adjustment for that covariate within the subgroup.   */",
        "description": "Subgroup interaction test and stratum-specific hazard ratios in PROC PHREG,\nplus RERI from PROC GENMOD (log-binomial). Required input dataset work.cohort\n(one row per patient; cohort fully built before this block):\n  arm       : 1 = study drug, 0 = active comparator\n  sex       : 1 = men, 0 = women (binary subgroup; 0 is the reference level)\n  time_     : follow-up time in days to event or censoring\n  event     : 1 = MACE, 0 = censored\n  age cci   : baseline confounders from [index_date-365, index_date]\nPROC PHREG with the interaction term tests multiplicative heterogeneity on the\nHR scale; the HAZARDRATIO statement with the AT subcommand produces stratum-\nspecific HRs directly. PROC GENMOD with ESTIMATE computes RERI on the risk scale.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Subgroup finding in the data] --> Pre{Pre-specified in\\nprotocol or SAP?}\n  Pre -->|No — post-hoc| PostHoc[Label as exploratory;\\napply multiplicity adjustment;\\nICEMAN credibility score]\n  Pre -->|Yes| Int{Based on a formal\\ninteraction test?}\n  Int -->|No — within-group p-values only| Cardinal[\"CARDINAL SIN\\nWithin-subgroup p-values do NOT\\ntest whether effects differ;\\nalways use the interaction term\"]:::bad\n  Int -->|Yes| Power{Study powered\\nfor the interaction?\\n(need ~4x main n)}\n  Power -->|No — underpowered| Uninf[\"NULL INTERACTION IS UNINFORMATIVE\\nLabel exploratory; report CI;\\ndo not conclude homogeneity\"]:::warn\n  Power -->|Yes| Scale{Report BOTH scales?}\n  Scale -->|Multiplicative only| MissAdd[\"Missing the additive scale —\\ncompute RERI and within-subgroup\\nabsolute risk differences\"]:::warn\n  Scale -->|Both multiplicative and additive| ICEMAN{ICEMAN credibility\\nassessment}\n  ICEMAN -->|Low credibility| Replicate[Hypothesis-generating;\\nrequires replication]\n  ICEMAN -->|High credibility| Confirm[Potentially confirmatory;\\ndocument and communicate\\nuncertainty honestly]:::good\n  classDef bad fill:#fee2e2,stroke:#b91c1c;\n  classDef warn fill:#fef9c3,stroke:#ca8a04;\n  classDef good fill:#dcfce7,stroke:#15803d;",
        "caption": "Credibility ladder for a subgroup finding. Pre-specification and a formal interaction test are the minimum requirements; underpowered tests, single-scale reporting, and low ICEMAN scores progressively reduce credibility.",
        "alt_text": "Flowchart from a subgroup finding through pre-specification check, interaction-test check, power check, additive-scale check, and ICEMAN assessment, with branches indicating the cardinal sin, uninformative null results, missing additive scale, and potentially confirmatory findings.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "causal-mediation-effect-modification-rwe",
        "notes": "Effect modification shares the same machinery as the subgroup pathology catalog here but is developed in full causal-inference terms (NDE/NIE, four-way decomposition, RERI formulas, g-methods under mediation); read that entry for the formal framework before interpreting RERI from an interaction model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "multiplicity-multiple-comparisons",
        "notes": "Multiple subgroup tests without adjustment multiply the false-positive rate; that entry covers Bonferroni, Holm, FDR, and Bayesian shrinkage methods that are the remedy when many subgroups are planned or tested."
      },
      {
        "relation_type": "see_also",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "Subgroup-specific treatment effects are distinct estimands (e.g., ATT within men vs ATT within women); both the population (ATE/ATT) and the ICE strategy must be consistent across the main analysis and each subgroup arm."
      },
      {
        "relation_type": "see_also",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Causal forests and meta-learners (X-learner, R-learner) estimate flexible high-dimensional CATE surfaces useful for hypothesis-generating HTE exploration; that entry covers the ML methods that complement the parametric interaction models shown here."
      },
      {
        "relation_type": "see_also",
        "target_slug": "sample-size-power-precision-rwe",
        "notes": "Interaction tests require roughly four times the total sample of the main-effect test for equivalent power; the sample-size entry covers the formulas for planning a study powered specifically for the interaction, a calculation that belongs in the SAP before any data are queried."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "In RWE, confounding is inherited differentially across subgroups; always assess standardized mean differences within each primary subgroup stratum before reporting stratum-specific effect estimates, since a pooled propensity score does not guarantee balance within strata."
      }
    ],
    "aliases": [
      "heterogeneity of treatment effect",
      "HTE analysis",
      "effect modification by subgroup",
      "conditional average treatment effect",
      "treatment effect heterogeneity",
      "differential treatment response"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "subscriber-id-member-id-claims-rwe",
    "name": "Subscriber ID vs Member ID in Claims Data",
    "short_definition": "The distinction between the policy-holding subscriber identifier, dependent/member identifiers, and plan-issued card identifiers used for enrollment, family linkage, deduplication, and claims joins.",
    "long_description": "Commercial and managed-care claims data often carry both a subscriber or family-level identifier and a member/person-level identifier. The subscriber is the policy holder or contract anchor. Members may include the subscriber and dependents. The values printed on an insurance card may not match the de-identified person key delivered in an RWE dataset, and vendors may hash, rekey, or split identifiers across feeds.\n\nThis distinction matters because the wrong key creates wrong people. Joining at the subscriber level can accidentally merge spouses or children. Joining only at the member level can miss family-level linkage needed for mother-infant studies or household benefit design. Person-level longitudinal studies need stable member/person IDs; family or subscriber IDs are useful linkage features but not substitutes for person IDs.\n\nOperationally, analysts should identify which keys are stable across time, plan changes, employer changes, product changes, and vendor refreshes. A subscriber ID may change when the policy holder changes employment or when a dependent becomes the subscriber. A member ID may change during plan migrations. De-identification can also make card-level terminology differ from analytic keys.\n\n**Pros, cons, and trade-offs.** Subscriber and member identifiers are powerful because they expose administrative structure that a single tokenized patient ID may hide. Subscriber keys can support family linkage, mother-infant candidate generation, plan-contract audits, and benefit-design checks. The trade-off is that subscriber keys are unsafe person keys. Member keys are closer to person-level follow-up, but can still split or merge people after rekeying, plan transitions, dependent-to-subscriber changes, or vendor refreshes.\n\n**When to use.** Use member/person identifiers to build patient-level cohorts, exposures, outcomes, and follow-up. Use subscriber or family identifiers only for specific linkage tasks, family-level context, duplicate diagnostics, or benefit-contract analyses, and then add timing, demographics, and clinical rules before accepting any relationship.\n\n**When NOT to use - and when it is actively misleading.** Do not join clinical claims at the subscriber level for person-level studies. Do not assume that an ID printed on a card is the same as the de-identified analytic person key. It is actively misleading to interpret multiple members under one subscriber as duplicate records or to collapse dependents into the policy holder's longitudinal history.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "subscriber-id",
      "member-id",
      "claims",
      "enrollment",
      "linkage",
      "family-id",
      "commercial-claims"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "mother_infant_linkage",
      "linkage",
      "data_quality_assessment"
    ],
    "data_sources": [
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": null,
        "url": "https://isp.healthit.gov/uscdi-data-class/health-insurance-information",
        "citation_text": "Office of the National Coordinator for Health Information Technology. Health Insurance Information. United States Core Data for Interoperability.",
        "year": 2026,
        "authors_short": "ONC Health IT",
        "notes": "USCDI health insurance information data class defining member identifier, subscriber identifier, relationship to subscriber, group identifier, and payer identifier."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://providerlibrary.healthnetcalifornia.com/hmo/provider-manual/enrollment/subscriber-member-identification-numbers.html",
        "citation_text": "Health Net California Provider Library. Subscriber/Member Identification Numbers.",
        "year": 2026,
        "authors_short": "Health Net",
        "notes": "Provider-facing explanation of subscriber and member identification numbers."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.healthyhorns.utexas.edu/uhs/understanding-insurance-card.html",
        "citation_text": "University of Texas at Austin University Health Services. Understanding Your Health Insurance Card.",
        "year": 2026,
        "authors_short": "UT Austin UHS",
        "notes": "Practical explanation of common insurance-card fields."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://www.bcbsnd.com/providers/news-resources/member-id-card-guide",
        "citation_text": "Blue Cross Blue Shield of North Dakota. Member ID Card Guide.",
        "year": 2026,
        "authors_short": "BCBSND",
        "notes": "Provider guide to member ID card elements and plan identifiers."
      }
    ],
    "plain_language_summary": "Subscriber ID usually points to the insurance contract or policy holder. Member ID points to a covered person. They are related but not interchangeable; using the wrong one can merge relatives or split one patient into multiple records.",
    "key_terms": [
      {
        "term": "Subscriber",
        "definition": "The policy holder or contract anchor for a health plan account."
      },
      {
        "term": "Member",
        "definition": "A covered person under the plan, including the subscriber or a dependent."
      },
      {
        "term": "Family identifier",
        "definition": "A contract-level key shared by covered family members, useful for some linkage tasks but unsafe as a person key."
      }
    ],
    "worked_example": {
      "scenario": "A commercial claims file contains one subscriber, a spouse, and an infant dependent. All three share a subscriber/family ID, but each has a distinct member/person ID. A mother-infant linkage uses the shared subscriber key plus delivery/birth timing, while an adherence study must use the member/person key.",
      "dataset": {
        "caption": "Identifier granularity in a family contract.",
        "columns": [
          "role",
          "subscriber_id_hash",
          "member_id_hash",
          "analytic_person",
          "safe_join_use"
        ],
        "rows": [
          [
            "policy holder",
            "sub_A",
            "mem_001",
            "adult 1",
            "person follow-up"
          ],
          [
            "spouse",
            "sub_A",
            "mem_002",
            "adult 2",
            "person follow-up; possible mother record"
          ],
          [
            "infant dependent",
            "sub_A",
            "mem_003",
            "infant",
            "family linkage candidate"
          ]
        ]
      },
      "steps": [
        "Use member/person ID to build patient-level longitudinal records.",
        "Use subscriber/family ID only to generate candidate family pairs.",
        "Add timing and clinical rules before accepting a mother-infant link.",
        "Test whether IDs change at plan-year rollover or employer change."
      ],
      "result": "The adherence cohort has three distinct people; the mother-infant linkage considers the spouse and infant as a candidate pair but does not merge the whole family into one person."
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Person-level member identifier",
        "description": "Stable analytic key for an individual covered person.",
        "edge_cases": [
          "Vendor refreshes can rekey members.",
          "Dependents can become subscribers."
        ],
        "data_source_notes": "Primary key for cohorts, exposure episodes, and outcomes."
      },
      {
        "name": "Subscriber or family identifier",
        "description": "Contract-level key shared across covered family members.",
        "edge_cases": [
          "Merges multiple people if misused.",
          "Changes when policy holder changes."
        ],
        "data_source_notes": "Linkage feature, not a person key."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Tokenized patient ID",
        "use_member_id_when": "Working inside one payer/vendor data model with documented person-level semantics.",
        "use_tokenized_id_when": "Linking across EHR, claims, registry, or multi-payer sources.",
        "notes": "Subscriber/member IDs are administrative; tokens are study-linkage artifacts."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Profile one-to-many and many-to-one relationships among subscriber, member, employer, and plan IDs before cohort construction."
    },
    "implementations": [],
    "diagrams": [],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "mother-infant-linkage-rwe",
        "notes": "Mother-infant linkage may use subscriber/family IDs, but person-level IDs still define individuals."
      },
      {
        "relation_type": "requires",
        "target_slug": "claims-analysis",
        "notes": "Correct identifier granularity is a prerequisite for claims cohort construction."
      }
    ],
    "aliases": [
      "subscriber ID",
      "subscriber vs member ID",
      "subscriber versus member ID",
      "subscriber member ID",
      "subscriber identifier",
      "member ID",
      "subscriber ID member ID",
      "subscriber ID vs member ID",
      "member identifier",
      "insurance card ID",
      "family ID",
      "contract ID",
      "policyholder ID"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "surrogate-endpoint-validation-rwe",
    "name": "Surrogate Endpoint Validation",
    "short_definition": "A statistical framework for establishing whether a treatment effect on a biomarker or intermediate endpoint reliably predicts the treatment effect on the clinical endpoint of interest, distinguishing individual-level prognostic association from trial-level effect-on-effect surrogacy, and assessing whether a surrogate validated elsewhere transports to a real-world population.",
    "long_description": "**Surrogate endpoint validation** asks a deceptively narrow question: if a treatment moves a biomarker or intermediate\nendpoint (the *surrogate* S — e.g., LDL cholesterol, HbA1c, progression-free survival, viral load), does it move the\nclinical endpoint that actually matters to patients (the *true* endpoint T — e.g., myocardial infarction, overall\nsurvival)? The danger is that S can be strongly prognostic for T in untreated patients yet still fail completely as a\ndecision surrogate, because a drug can change S through a mechanism that does not propagate to T (or that carries\noffsetting harms). The graveyard of surrogates — encainide/flecainide suppressed ventricular arrhythmia but *increased*\nmortality (CAST); fluoride raised bone density but increased fractures; bevacizumab improved progression-free survival\nin metastatic breast cancer with no overall-survival benefit and accelerated approval was withdrawn — exists precisely\nbecause prognostic association was mistaken for surrogacy.\n\n**Core conceptual distinction.** There are two non-interchangeable levels of surrogacy, and conflating them is the\nsingle most common and most dangerous error.\n- **Individual-level surrogacy** is a *within-patient, within-trial* association: does S predict T conditional on\n  treatment assignment? This is necessary but radically insufficient — it is satisfied by any good prognostic marker.\n- **Trial-level surrogacy** is an *across-trial, effect-on-effect* relationship: across many randomized comparisons,\n  does the treatment effect *on S* predict the treatment effect *on T*? This is the property that licenses using S to\n  decide about T, and it is what regulators and HTA bodies demand. A marker can have near-perfect individual-level\n  association and near-zero trial-level R² (and vice versa).\n\nThe historical lineage operationalizes this. **Prentice (1989)** gave four operational criteria, the binding one being\n*full capture* — the treatment effect on T must act *entirely through* S, so that T is conditionally independent of\ntreatment given S. This is rarely testable (you can fail to reject it for lack of power, never confirm it) and is an\nall-or-nothing ideal. **Freedman, Graubard & Schatzkin (1992)** relaxed it to the **proportion of treatment effect\nexplained (PTE)** — the fraction of the treatment effect on T that is removed by adjusting for S — but PTE is notoriously\nunstable: it has wide confidence intervals, can fall outside [0,1], is not a true proportion, and is biased by\nunmeasured confounding of the S–T relationship (the Frangakis–Rubin principal-stratification critique: PTE conditions\non a post-randomization variable). **Buyse, Molenberghs, Burzykowski et al. (2000)** reframed validation as a\n*meta-analytic two-level* problem, estimating an individual-level R²_indiv and a **trial-level R²_trial** from a\ncollection of randomized trials (or trial units); R²_trial near 1 with a tight prediction interval is the modern\nevidentiary bar. NICE, IQWiG, and G-BA routinely refuse surrogate-based cost-effectiveness inputs when trial-level R²\nis unestablished or below roughly 0.8.\n\n**The RWE twist (why this entry ends in -rwe).** In real-world evidence you almost never *establish* trial-level\nsurrogacy yourself — that requires a meta-analysis of randomized trials, which RWE is not. Two RWE roles dominate.\n(1) **Applying a surrogate validated elsewhere** (an RCT meta-analysis, an FDA accelerated-approval table) to a new,\noften sicker, older, more comorbid real-world population — here validation becomes a question of **transportability**\nof the surrogate–outcome relationship, not estimation from scratch. (2) **Measuring the surrogate in real-world data**\n(e.g., real-world progression-free survival, rwPFS) as the *primary* endpoint when the clinical endpoint is unobserved\nor immature — here the surrogate's measurement properties in claims/EHR, not its trial-level R², become the dominant\nthreat. Both roles require explicit, pre-specified justification; neither is satisfied by citing an individual-level\ncorrelation.\n\n**Pros, cons, and trade-offs.**\n- **Trial-level meta-analytic validation (Buyse) vs Prentice criteria vs PTE (Freedman).** The meta-analytic approach\n  is the only one that directly targets the decision-relevant quantity (effect-on-effect prediction) and yields a\n  usable prediction interval for a new trial's true-endpoint effect. Cost: it needs many trials with both endpoints,\n  is sensitive to between-trial heterogeneity, and ecological-level R² can mislead if trials are few or homogeneous.\n  Prentice's criteria are conceptually clean but the full-capture criterion is essentially untestable and binary.\n  PTE is cheap (one trial, two regressions) but statistically fragile and conditions on a post-treatment variable.\n  **Prefer the meta-analytic two-level model** for any claim that a surrogate licenses inference about T; reserve PTE\n  for exploratory, hypothesis-generating use with explicit caveats.\n- **Using a surrogate at all vs waiting for the clinical endpoint.** A validated surrogate accelerates evidence and\n  enables decisions when the clinical endpoint is rare, distal, or ethically hard to wait for. Cost: every surrogate\n  imports the risk that the drug helps S but not T (or harms T), and accelerated approvals based on weak surrogates\n  have repeatedly required withdrawal (Kemp & Prasad 2017 document the empirical scale of this in oncology).\n- **rwPFS as a real-world surrogate vs adjudicated trial PFS.** rwPFS is feasible at scale and reflects routine care.\n  Cost: progression in routine care is ascertained by *non-protocolized* imaging whose cadence differs by drug, site,\n  and payer, injecting ascertainment bias directly into the surrogate's treatment-effect estimate.\n\n**When to use.** When the clinical endpoint is unobservable, immature, or rare in the available follow-up *and* a\ntrial-level-validated surrogate exists for the same disease, drug class, mechanism, and patient population; when an HTA\nsubmission must justify extrapolating an observed surrogate effect to a survival or morbidity benefit; when designing a\nsingle-arm or externally-controlled study in a rare disease where the clinical endpoint cannot accrue. Always pair the\nsurrogate analysis with a pre-specified transportability argument and a sensitivity analysis on the surrogate–outcome\nlink.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Only individual-level (prognostic) evidence exists.** Treating \"S predicts T\" as license to use S for treatment\n  decisions is the canonical error. A marker on the causal pathway of the disease need not be on the causal pathway of\n  the drug's effect. This mistake actively kills evidence credibility and, historically, patients (CAST).\n- **The drug acts on T through a mechanism that bypasses S, or carries off-target harm.** Then a real effect on S\n  coexists with a null or harmful effect on T; the surrogate is not merely uninformative but anti-informative.\n- **Surrogate ascertainment differs by exposure arm.** If PFS in claims/EHR depends on imaging frequency that is\n  higher for the newer (study) drug — because of label-driven monitoring, site of care, or sicker patients getting\n  scanned more — the estimated treatment effect on the surrogate is biased before any validation logic applies, and the\n  bias is not removed by an externally validated R².\n- **Endogenous, lead-time-biased ascertainment dates.** When the date the surrogate is \"observed\" depends on the\n  intensity of follow-up (more visits → earlier detected progression), the surrogate's time-to-event is contaminated\n  by detection timing, not biology.\n- **No basis for transport.** A surrogate validated in trial populations (younger, fitter, single-line therapy) applied\n  to a Medicare claims population (older, multimorbid, heavily pretreated) without an explicit transportability argument\n  is an assumption masquerading as evidence.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA vs commercial):** Claims rarely contain the surrogate value itself (no lab results, no RECIST\n  reads); the surrogate must be *inferred* from utilization proxies (e.g., rwPFS proxied by therapy switch,\n  second-line initiation, or hospice/death) — every proxy is itself an outcome algorithm with its own PPV/sensitivity.\n  Differential ascertainment is the dominant failure: imaging and lab cadence vary by drug and site, so the proxy's\n  timing is exposure-dependent. **Medicare Advantage encapsulated claims lack the fee-for-service line items** needed\n  to detect imaging/switch events, so MA-only person-time produces artificially \"stable\" patients — restrict to Parts\n  A/B/D FFS (or commercial with full medical+pharmacy benefit) and exclude MA-only spans before defining surrogate\n  events. Competing risks (death from a different cause) differ by exposure in elderly claims and must be modeled\n  cause-specifically, not censored away.\n- **EHR:** Carries the actual surrogate value (lab results, imaging impressions, problem-list progression) — its great\n  advantage — but capture is encounter-driven: a patient who progresses but is managed outside the system has the\n  surrogate event recorded late or never (external-care leakage). Imaging cadence is non-protocolized and varies by\n  clinician, creating exactly the lead-time and ascertainment bias the validation logic cannot fix. NLP-derived\n  progression dates need their own error characterization.\n- **Registry:** Often the strongest source for an *adjudicated* surrogate (e.g., cancer-registry stage/progression,\n  cardiac-registry events) and for disease severity, but typically weak for complete therapy exposure and long-term\n  mortality — link to claims for the full treatment trajectory and to a death index (NDI/SSA/state vital records) to\n  firm up the *true* endpoint against which the surrogate is judged.\n- **Linked claims–EHR–vital records:** The ideal substrate for an RWE surrogate study because it pairs the surrogate\n  value (EHR) with complete exposure (claims) and reliable mortality (vital records) — but linkage selects only the\n  linkable subset (transportability threat) and introduces date discrepancies between order, result, and service dates\n  that must be reconciled before the surrogate event time and the true-endpoint time are placed on the same clock.\n\n**Worked claims example (rwPFS as a surrogate for overall survival in oncology claims).** Question: does a real-world\nprogression-free survival benefit for a new oral oncolytic vs the standard comparator justify inferring an overall-\nsurvival benefit in a commercial + Medicare FFS population? (1) Cohort: adults with the cancer of interest, ≥365 days\nof continuous A/B/D (or commercial medical+pharmacy) enrollment before `index_date` (first fill of either drug),\nexcluding any MA-only person-time so utilization-based progression proxies are observable. (2) Surrogate event\n(rwPFS proxy): the earliest of (a) a switch to a subsequent line of therapy (new antineoplastic NDC after a clean\n`days_supply` + grace window on the index agent), (b) a progression-coded encounter, or (c) death — each component\nbuilt as an explicit outcome algorithm with documented code lists and a PPV target. (3) True endpoint: death from a\ndeath-index-augmented mortality source (claims-only death is incomplete). (4) **Differential-ascertainment audit\nBEFORE any effect estimate:** compare imaging claim frequency (CT/MRI/PET CPT codes), oncology visit cadence, and\nrestaging-lab frequency between arms in the first 6 months; if the newer drug's arm is scanned 1.5× as often, earlier\ndetected \"progression\" is an artifact of monitoring intensity, not biology — quantify it and, if material, restrict to\na protocol-like fixed assessment grid or model the detection process. (5) Estimate the treatment effect on rwPFS\n(cause-specific hazard, with death as a competing risk) and on OS. (6) **Transport, do not re-derive:** state the\npublished *trial-level* R²_trial for PFS→OS in this tumor type (from RCT meta-analyses such as the Buyse/Ciani\noncology series) and argue explicitly why that effect-on-effect relationship should hold in this older, more comorbid\nclaims population — or concede it may not. (7) Sensitivity: vary the switch grace period, the progression code list,\nthe imaging-cadence adjustment, and the competing-risk handling; report how the OS inference moves. The deliverable is\nnot \"rwPFS improved, therefore OS improves\" — it is a quantified, transport-justified, ascertainment-audited claim with\nits assumptions exposed.\n\n**Interpreting the output**. A surrogate validation analysis using five trials in the tumor type estimates the\ntrial-level association between response rate difference and overall survival difference, yielding a\ntrial-level R² = 0.65. The slope estimate indicates that, across trials, a 10-percentage-point greater\nresponse rate difference corresponds to approximately 1.6 additional months of OS benefit.\n\nFormal interpretation: trial-level R² = 0.65 means that 65% of the between-trial variance in OS treatment\neffects is statistically explained by the between-trial variance in response rate treatment effects. This\nfalls below the 0.80 threshold commonly applied by HTA bodies (including NICE) for a surrogate to be\nconsidered sufficiently validated. A high individual-level correlation between response and OS within a\nsingle arm is not an adequate substitute — it answers a different question (do patients who respond live\nlonger?) rather than whether the treatment's effect on response predicts its effect on OS across trials.\nThe distinction between individual-level and trial-level correlation is critical and frequently conflated\nin regulatory submissions.\n\nPractical interpretation: a trial-level R² of 0.65 supports a cautious, qualified claim that response rate\nis a partial surrogate for OS in this setting — one appropriate for sensitivity analyses and hypothesis\ngeneration, not for substituting OS as the primary endpoint in a pivotal study or HTA dossier. Report\nthe number of trials contributing to the meta-regression, their sample sizes, and whether the effect-on-effect\nrelationship is plausibly transported from the RCT trial population to the real-world study population.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "surrogate-endpoint",
      "surrogacy-validation",
      "proportion-of-treatment-effect-explained",
      "trial-level-surrogacy",
      "meta-analytic-validation",
      "prentice-criteria",
      "rwpfs",
      "transportability",
      "regulatory-evidence"
    ],
    "applies_to_study_types": [
      "comparative_effectiveness",
      "meta_analysis_rct",
      "single_arm_external_control",
      "claims_analysis",
      "ehr_study",
      "registry_linkage"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.4780080407",
        "url": "https://doi.org/10.1002/sim.4780080407",
        "citation_text": "Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in Medicine. 1989;8(4):431-440.",
        "year": 1989,
        "authors_short": "Prentice",
        "notes": "Foundational definition and the four operational criteria; the binding full-capture criterion (conditional independence of the true endpoint and treatment given the surrogate) anchors all later work and is famously hard to test."
      },
      {
        "role": "introduce",
        "doi": "10.1002/sim.4780110204",
        "url": "https://doi.org/10.1002/sim.4780110204",
        "citation_text": "Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Statistics in Medicine. 1992;11(2):167-178.",
        "year": 1992,
        "authors_short": "Freedman et al.",
        "notes": "Introduced the proportion of treatment effect explained (PTE); also the source of its well-known instability (wide CIs, values outside [0,1], conditioning on a post-randomization variable)."
      },
      {
        "role": "explain",
        "doi": "10.1093/biostatistics/1.1.49",
        "url": "https://doi.org/10.1093/biostatistics/1.1.49",
        "citation_text": "Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000;1(1):49-67.",
        "year": 2000,
        "authors_short": "Buyse et al.",
        "notes": "Defines the modern two-level framework separating individual-level R² from trial-level R²; the trial-level effect-on-effect quantity is the standard regulators and HTA bodies now demand."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.f457",
        "url": "https://doi.org/10.1136/bmj.f457",
        "citation_text": "Ciani O, Buyse M, Garside R, et al. Comparison of treatment effect sizes associated with surrogate and final patient relevant outcomes in randomised controlled trials: meta-epidemiological study. BMJ. 2013;346:f457.",
        "year": 2013,
        "authors_short": "Ciani et al.",
        "notes": "Meta-epidemiological evidence that treatment effects on surrogates systematically overstate effects on final patient-relevant outcomes; directly relevant to HTA use of surrogate effects."
      },
      {
        "role": "explain",
        "doi": "10.1186/s12916-017-0902-9",
        "url": "https://doi.org/10.1186/s12916-017-0902-9",
        "citation_text": "Kemp R, Prasad V. Surrogate endpoints in oncology: when are they acceptable for regulatory and clinical decisions, and are they currently overused? BMC Medicine. 2017;15(1):134.",
        "year": 2017,
        "authors_short": "Kemp & Prasad",
        "notes": "Empirical critique documenting how often oncology surrogate endpoints lack trial-level validation yet drive accelerated approvals; the cautionary backbone for the \"actively misleading\" section."
      }
    ],
    "plain_language_summary": "Surrogate endpoint validation is a statistical process for checking whether a treatment's effect on an early, easy-to-measure sign — like tumor shrinkage or a change in a blood marker — reliably predicts its effect on what patients actually care about, like surviving longer. The key question is not just whether the surrogate and survival are linked in individual patients, but whether across multiple clinical trials the treatments that improved the surrogate the most also improved survival the most. Without that across-trial evidence, a drug that shrinks tumors could still fail to extend life — and that failure has happened repeatedly in the real world.",
    "key_terms": [
      {
        "term": "surrogate endpoint",
        "definition": "An early or easier-to-measure outcome — such as tumor response, blood pressure, or LDL cholesterol — used in place of the clinical outcome patients actually care about, such as survival or heart attack."
      },
      {
        "term": "true endpoint",
        "definition": "The clinical outcome that directly matters to patients, such as overall survival or occurrence of a major cardiac event, against which the surrogate is judged."
      },
      {
        "term": "trial-level association",
        "definition": "A relationship measured across multiple randomized trials: treatments that produced a bigger effect on the surrogate also produced a bigger effect on the true endpoint, and this is captured as R-squared (R2) — a number from 0 to 1 where 1 means perfect prediction."
      },
      {
        "term": "individual-level association",
        "definition": "A within-patient link: patients who respond better on the surrogate also tend to do better on the true endpoint; this is necessary but not enough to use the surrogate as a stand-in for the true endpoint."
      },
      {
        "term": "R-squared (R2)",
        "definition": "A statistic from 0 to 1 that says how well one variable predicts another; an R2 of 0.90 means 90% of the variation in the true-endpoint effect across trials is explained by the surrogate-endpoint effect."
      },
      {
        "term": "transportability",
        "definition": "Whether a surrogate relationship established in one set of trials (often younger, healthier patients) also holds in a different real-world population (often older, sicker patients)."
      }
    ],
    "worked_example": {
      "scenario": "Imagine we want to know whether tumor response rate — the percentage of patients whose tumor shrinks by at least 30% — is a valid surrogate for overall survival in a type of solid-tumor cancer. We have data from five small randomized trials, each comparing a new drug against a standard treatment. In each trial we recorded the difference in response rate between arms (the surrogate effect) and the difference in median overall survival in months between arms (the true endpoint effect). We want to compute trial-level R2: does a bigger boost to response rate across these trials predict a bigger survival gain?",
      "dataset": {
        "caption": "Per-trial treatment effects: difference in response rate (surrogate) and difference in median overall survival in months (true endpoint) between the new drug arm and the control arm.",
        "columns": [
          "trial_id",
          "response_rate_difference_pct",
          "os_difference_months"
        ],
        "rows": [
          [
            "T1",
            10,
            1.8
          ],
          [
            "T2",
            20,
            3.5
          ],
          [
            "T3",
            35,
            5.9
          ],
          [
            "T4",
            48,
            8.1
          ],
          [
            "T5",
            60,
            9.8
          ]
        ]
      },
      "steps": [
        "For each trial, read across the row: e.g., Trial T3 showed a 35-percentage-point higher response rate AND a 5.9-month longer median survival for the new drug vs. control.",
        "Plot the five pairs mentally — as the response-rate advantage rises from 10 to 60 points, the survival advantage rises from 1.8 to 9.8 months in a nearly straight line.",
        "Fit a weighted linear regression of os_difference_months on response_rate_difference_pct across all five trials (weighting by trial size if available, here treated as equal for simplicity).",
        "The regression gives a slope of roughly 0.160 (each 10-point gain in response rate predicts about 1.6 additional months of survival) and an R2 of 0.99.",
        "An R2 of 0.99 means 99% of the variation in the survival benefit across these five trials is explained by the response-rate benefit — a strong trial-level association."
      ],
      "result": "Trial-level R2 = 0.99 (slope 0.160 months per percentage point). This means that across these five trials, knowing how much the drug improved response rate almost perfectly predicts how much it improved survival. An R2 this high (above the conventional 0.80 threshold used by HTA bodies) supports using response rate as a surrogate for overall survival in this setting — provided the relationship also holds in the real-world population where it will be applied."
    },
    "prerequisites": [
      "meta-analysis-rct",
      "generalizability-transportability-external-validity-rwe",
      "comparative-effectiveness-research-cer-methods"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Individual-level surrogacy assessment",
        "description": "Quantifies within-patient prognostic association — whether the surrogate predicts the true endpoint conditional on treatment (e.g., individual-level R² from a Plackett-copula/bivariate model, or a Cox model for the true endpoint that includes the surrogate as a time-fixed or time-updated covariate). Necessary but never sufficient to license decisions about the true endpoint.",
        "edge_cases": [
          "A strong prognostic marker can have high individual-level association yet zero trial-level surrogacy; reporting only this level is the most common over-claim in RWE surrogate work.",
          "Treating the surrogate as time-updated invites adjustment for a post-treatment mediator and immortal-time-style bias if the surrogate is measured during follow-up."
        ],
        "data_source_notes": "EHR/registry: requires the actual surrogate value and the true endpoint on the same patients; claims: usually impossible because the surrogate value (lab/imaging read) is absent and must be proxied."
      },
      {
        "name": "Proportion of treatment effect explained (PTE)",
        "description": "Single-trial (or single-cohort) approach estimating the fraction of the treatment effect on the true endpoint that is removed when the surrogate is added to the model (Freedman et al.). Cheap and intuitive but statistically fragile.",
        "edge_cases": [
          "PTE is not a true proportion — estimates fall outside [0,1] and carry very wide confidence intervals, especially when the surrogate captures only part of the effect.",
          "Conditions on a post-randomization variable, so unmeasured confounding of the surrogate-outcome relationship biases PTE (Frangakis-Rubin principal-stratification critique)."
        ],
        "data_source_notes": "Use only as exploratory/hypothesis-generating; report the CI and never present a point PTE as validation. In RWE the confounding-of-S problem is worse than in RCTs."
      },
      {
        "name": "Meta-analytic two-level (trial-level) validation",
        "description": "Estimates trial-level R²_trial (treatment effect on surrogate predicts treatment effect on true endpoint across trials or trial-by-region/center units) plus individual-level R²_indiv, typically via a two-stage or joint mixed/copula model (Buyse et al.). Yields a prediction interval for the true-endpoint effect of a new trial.",
        "edge_cases": [
          "Needs many trials/units with both endpoints; few or homogeneous trials make trial-level R² unstable and the ecological relationship misleading.",
          "Between-trial heterogeneity in populations, follow-up, and measurement degrades transportability of the estimated surrogacy to a new setting."
        ],
        "data_source_notes": "This is an RCT-meta-analytic method; in RWE you usually cite an externally estimated R²_trial and argue transportability rather than estimate it from observational data."
      },
      {
        "name": "Real-world surrogate measurement (e.g., rwPFS)",
        "description": "Operationalizes the surrogate itself in claims/EHR (e.g., real-world progression-free survival from imaging impressions, therapy switch, or progression coding) as a primary or co-primary endpoint when the clinical endpoint is immature or unobserved.",
        "edge_cases": [
          "Non-protocolized imaging/lab cadence differs by drug, site, and payer, producing exposure-dependent ascertainment and lead-time bias in the surrogate's event time.",
          "Utilization-based proxies (switch, second line) are themselves outcome algorithms with their own PPV/sensitivity that must be characterized."
        ],
        "data_source_notes": "claims: proxy progression by switch/second-line/hospice/death and exclude MA-only person-time; EHR: use structured progression + NLP with documented error rates; link to a death index for the true endpoint."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Prentice operational criteria",
        "pros_of_this": "The meta-analytic two-level approach targets the decision-relevant trial-level (effect-on-effect) quantity and yields a usable prediction interval; it does not depend on the untestable full-capture criterion.",
        "cons_of_this": "Requires many randomized trials with both endpoints and is sensitive to between-trial heterogeneity; Prentice's framing remains the cleaner conceptual statement of what surrogacy means.",
        "when_to_prefer": "Whenever a claim is made that a surrogate licenses inference about the true endpoint for regulatory or HTA decisions."
      },
      {
        "compared_to": "Proportion of treatment effect explained (PTE, Freedman)",
        "pros_of_this": "Trial-level R² avoids conditioning on a post-randomization variable and is not prone to estimates outside [0,1]; PTE's confidence intervals are typically too wide to support a decision.",
        "cons_of_this": "PTE is computable from a single trial/cohort and is cheaper; trial-level validation needs a meta-analytic data structure rarely available in a single RWE study.",
        "when_to_prefer": "Prefer trial-level R² for confirmatory surrogacy claims; reserve PTE for exploratory analysis with explicit caveats."
      },
      {
        "compared_to": "Waiting for the validated clinical (true) endpoint",
        "pros_of_this": "A validated surrogate accelerates evidence and enables decisions when the clinical endpoint is rare, distal, or unethical to await; essential in rare diseases and single-arm/externally-controlled designs.",
        "cons_of_this": "Every surrogate imports the risk that the drug moves S but not T (or harms T); accelerated approvals on weak surrogates have repeatedly required withdrawal.",
        "when_to_prefer": "Use a surrogate only when trial-level validation exists for the same disease, mechanism, and population and a transportability argument is made; otherwise prefer the clinical endpoint."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "The surrogate value is usually absent; proxy it (rwPFS via switch/second-line/progression code/hospice/death), treating each proxy as an outcome algorithm with documented PPV/sensitivity. Audit imaging and visit cadence by arm for differential ascertainment before estimating effects. Exclude Medicare Advantage-only person-time (encapsulated claims hide imaging and switch events). Model death as a competing risk cause-specifically, not by censoring.",
      "ehr": "Carries the actual surrogate value (labs, imaging impressions, problem-list progression) but capture is encounter-driven and imaging cadence is non-protocolized, creating lead-time and ascertainment bias the validation logic cannot repair. Characterize NLP-derived progression-date error. Treat external-care leakage as informative loss to follow-up.",
      "registry": "Often the best source for an adjudicated surrogate and disease severity, but weak for complete therapy exposure and long-term mortality. Link to claims for the full treatment trajectory and to a death index for the true endpoint against which the surrogate is judged.",
      "linked": "Linked claims-EHR-vital-records pairs the surrogate value (EHR) with complete exposure (claims) and reliable mortality (vital records) — the ideal substrate — but linkage selects the linkable subset (transportability threat) and order/result/service date discrepancies must be reconciled before surrogate-event and true-endpoint times share one clock."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\n\n# (A) Proportion of Treatment Effect explained (Freedman) — continuous endpoints, single pooled cohort.\n# PTE = 1 - (beta_adjusted / beta_unadjusted), where beta is the treatment coefficient on the TRUE endpoint\n# before vs after adjusting for the SURROGATE. Report the (wide) bootstrap CI, never the point estimate alone.\ndef pte_freedman(ipd: pd.DataFrame, n_boot: int = 2000, seed: int = 1) -> dict:\n    def _pte(d):\n        X0 = sm.add_constant(d[[\"arm\"]])\n        b_unadj = sm.OLS(d[\"true_value\"], X0).fit().params[\"arm\"]\n        X1 = sm.add_constant(d[[\"arm\", \"surrogate_value\"]])\n        b_adj = sm.OLS(d[\"true_value\"], X1).fit().params[\"arm\"]\n        return np.nan if b_unadj == 0 else 1.0 - (b_adj / b_unadj)\n    point = _pte(ipd)\n    rng = np.random.default_rng(seed)\n    idx = np.arange(len(ipd))\n    boots = [_pte(ipd.iloc[rng.choice(idx, len(idx), replace=True)]) for _ in range(n_boot)]\n    lo, hi = np.nanpercentile(boots, [2.5, 97.5])\n    return {\"pte\": point, \"ci95\": (lo, hi)}  # CI is typically too wide to support a decision\n\n# (B) Trial-level surrogacy (Buyse two-stage): per-trial treatment effect on S and on T, then weighted regression.\ndef trial_level_r2(ipd: pd.DataFrame) -> dict:\n    rows = []\n    for tid, d in ipd.groupby(\"trial_id\"):\n        Xs = sm.add_constant(d[[\"arm\"]])\n        eff_s = sm.OLS(d[\"surrogate_value\"], Xs).fit().params[\"arm\"]   # effect on surrogate\n        eff_t = sm.OLS(d[\"true_value\"], Xs).fit().params[\"arm\"]        # effect on true endpoint\n        rows.append({\"trial_id\": tid, \"eff_s\": eff_s, \"eff_t\": eff_t, \"n\": len(d)})\n    eff = pd.DataFrame(rows)\n    # Stage 2: weighted regression of true-endpoint effects on surrogate-endpoint effects across trials.\n    Xt = sm.add_constant(eff[[\"eff_s\"]])\n    fit = sm.WLS(eff[\"eff_t\"], Xt, weights=eff[\"n\"]).fit()\n    return {\"r2_trial\": fit.rsquared, \"slope\": fit.params[\"eff_s\"],\n            \"n_trials\": len(eff), \"trial_effects\": eff}\n# Interpretation: R^2_trial near 1 with a tight prediction interval supports trial-level surrogacy;\n# few/homogeneous trials make this ecological estimate unstable and not transportable.",
        "description": "Two complementary surrogate-validation computations on individual-patient meta-analytic data.\nRequired input (one row per patient, pooled across trials; already cleaned):\n  ipd : trial_id (int/str), arm (0=comparator, 1=treatment), person_id,\n        surrogate_value (float; e.g., change in biomarker or pfs_months),\n        true_value (float; e.g., os_months or continuous clinical endpoint)\n(A) PTE (Freedman) for a single trial / pooled cohort with a CONTINUOUS true endpoint and continuous surrogate.\n(B) Trial-level surrogacy (Buyse): regress per-trial treatment effects on T against effects on S and report R^2_trial.\nFor time-to-event endpoints, swap the OLS effect estimates for per-trial Cox log-hazard ratios.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# (A) Proportion of Treatment Effect explained (Freedman), continuous endpoints, pooled cohort.\npte_freedman <- function(ipd, n_boot = 2000, seed = 1) {\n  .pte <- function(d) {\n    b_unadj <- coef(lm(true_value ~ arm, data = d))[[\"arm\"]]\n    b_adj   <- coef(lm(true_value ~ arm + surrogate_value, data = d))[[\"arm\"]]\n    if (b_unadj == 0) NA_real_ else 1 - (b_adj / b_unadj)\n  }\n  set.seed(seed)\n  point <- .pte(ipd)\n  boots <- replicate(n_boot, .pte(ipd[sample.int(nrow(ipd), replace = TRUE), ]))\n  list(pte = point, ci95 = stats::quantile(boots, c(0.025, 0.975), na.rm = TRUE))\n}\n\n# (B) Trial-level surrogacy (Buyse two-stage): per-trial effects, then weighted regression of T-effect on S-effect.\ntrial_level_r2 <- function(ipd) {\n  eff <- do.call(rbind, lapply(split(ipd, ipd$trial_id), function(d) {\n    data.frame(\n      trial_id = d$trial_id[1],\n      eff_s = coef(lm(surrogate_value ~ arm, data = d))[[\"arm\"]],  # effect on surrogate\n      eff_t = coef(lm(true_value ~ arm, data = d))[[\"arm\"]],       # effect on true endpoint\n      n     = nrow(d)\n    )\n  }))\n  fit <- lm(eff_t ~ eff_s, data = eff, weights = eff$n)            # stage 2\n  list(r2_trial = summary(fit)$r.squared,\n       slope = coef(fit)[[\"eff_s\"]],\n       n_trials = nrow(eff), trial_effects = eff)\n}\n# R^2_trial near 1 with a tight prediction interval supports surrogacy; few/homogeneous trials make it unreliable.",
        "description": "Mirror of the Python logic. Required input data frame `ipd` (one row per patient, pooled across trials):\n  trial_id, arm (0/1), person_id, surrogate_value (numeric), true_value (numeric).\npte_freedman(): PTE with bootstrap CI for continuous endpoints. trial_level_r2(): Buyse two-stage trial-level R^2.\nFor survival endpoints, replace the lm() effect estimates with per-trial coxph() log-hazard ratios.",
        "dependencies": [
          "stats",
          "boot"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ---------- Stage 1a: per-trial treatment effect on the SURROGATE (continuous) ---------- */\nproc glm data=work.ipd noprint outstat=_s;\n  by trial_id;\n  model surrogate_value = arm / solution;\n  ods output ParameterEstimates=eff_s_pe;\nrun;\ndata eff_s; set eff_s_pe; where upcase(Parameter)='ARM';\n  keep trial_id eff_s; rename Estimate=eff_s; run;\n\n/* ---------- Stage 1b: per-trial treatment effect on the TRUE endpoint ----------\n   Continuous true endpoint -> PROC GLM (shown). For survival, replace with:\n     proc phreg data=work.ipd; by trial_id; model os_time*os_event(0)=arm;\n     ods output ParameterEstimates=eff_t_pe;  (Estimate = log hazard ratio for arm)         */\nproc glm data=work.ipd noprint;\n  by trial_id;\n  model true_value = arm / solution;\n  ods output ParameterEstimates=eff_t_pe;\nrun;\ndata eff_t; set eff_t_pe; where upcase(Parameter)='ARM';\n  keep trial_id eff_t; rename Estimate=eff_t; run;\n\n/* per-trial sample sizes for weighting */\nproc sql;\n  create table nwt as select trial_id, count(*) as n from work.ipd group by trial_id;\nquit;\n\ndata trial_effects;\n  merge eff_s eff_t nwt; by trial_id;\nrun;\n\n/* ---------- Stage 2: trial-level surrogacy = does effect-on-S predict effect-on-T? ----------\n   Weighted regression; model R-square approximates R^2_trial. PROC MIXED gives a\n   measurement-error-aware alternative with random trial intercepts.                          */\nproc reg data=trial_effects outest=_r2 edf;\n  weight n;\n  model eff_t = eff_s;     /* R-square here is the trial-level surrogacy R^2 */\nrun; quit;\n\nproc mixed data=trial_effects method=reml;\n  class trial_id;\n  model eff_t = eff_s / solution;\n  random intercept / subject=trial_id;   /* accounts for between-trial heterogeneity */\nrun;\n\n/* ---------- Individual-level link (within-patient): true endpoint adjusted for surrogate ----------\n   For a survival true endpoint, PROC PHREG quantifies whether the surrogate carries the treatment\n   effect (compare arm HR with vs without surrogate_value in the model).                          */\nproc phreg data=work.ipd;\n  model os_time*os_event(0) = arm surrogate_value;\nrun;",
        "description": "Trial-level surrogacy in SAS via the canonical two-stage meta-analytic approach.\nRequired input dataset work.ipd (one row per patient, pooled across trials):\n  trial_id, arm (0/1), person_id, surrogate_value, true_value, (optional) os_time, os_event for survival.\nStage 1: per-trial treatment effects on the surrogate and on the true endpoint.\n  - continuous true endpoint -> PROC GLM by trial;\n  - time-to-event true endpoint -> PROC PHREG by trial (log-hazard ratio for arm).\nStage 2: PROC MIXED (random trial intercepts) or weighted PROC REG of the true-endpoint effect on the\nsurrogate-endpoint effect; the model R-square approximates R^2_trial.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Drug[Treatment / exposure] -->|effect on S| S[Surrogate endpoint S<br/>biomarker / rwPFS / lab]\n  Drug -->|effect on T| T[True clinical endpoint T<br/>survival / morbidity]\n  S -.individual-level association.-> T\n  Drug --> Q{Does effect on S<br/>predict effect on T?}\n  Q -->|trial-level R2 high| Valid[Validated surrogate<br/>licenses inference about T]\n  Q -->|trial-level R2 low / unknown| Danger[NOT a decision surrogate<br/>effect on S may not propagate to T]\n  Bypass[Off-target / S-bypassing<br/>mechanism or harm] -.-> Danger",
        "caption": "The two paths from treatment to outcome. Individual-level association of S with T is necessary but not sufficient; only a high trial-level effect-on-effect relationship (does the treatment effect on S predict the treatment effect on T across trials?) licenses using S to decide about T. A drug that moves S through an S-bypassing mechanism, or that carries off-target harm, breaks surrogacy even when S is prognostic.",
        "alt_text": "Diagram showing treatment affecting both a surrogate endpoint and a true endpoint, with individual-level association between them, and a decision node on whether the effect on the surrogate predicts the effect on the true endpoint, branching to validated versus not-a-decision-surrogate.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[RWE surrogate question] --> Have{Trial-level-validated surrogate<br/>exists for this disease/mechanism?}\n  Have -->|No| Stop[Do not infer the clinical endpoint<br/>from the surrogate effect]\n  Have -->|Yes| Pop{Does the validated relationship<br/>transport to my RW population?}\n  Pop -->|Argue explicitly| Asc{Is surrogate ascertainment<br/>balanced across arms?}\n  Asc -->|Differential imaging/visit cadence| Fix[Audit + adjust or restrict<br/>to a fixed assessment grid]\n  Asc -->|Balanced| Comp[Model death as competing risk;<br/>exclude MA-only person-time]\n  Fix --> Comp\n  Comp --> Sens[Sensitivity: grace period, code lists,<br/>cadence adjustment, competing-risk handling]\n  Sens --> Claim[Transport-justified, ascertainment-audited<br/>claim with assumptions exposed]",
        "caption": "Decision logic for using a surrogate in real-world data. RWE rarely establishes trial-level surrogacy; the work is (1) confirming a validated surrogate exists, (2) justifying transport to the real-world population, (3) auditing differential ascertainment, and (4) sensitivity-testing the operational choices.",
        "alt_text": "Decision flowchart for real-world surrogate use, starting with whether a trial-level-validated surrogate exists, then transportability, then ascertainment balance with an audit-and-adjust branch, then competing-risk handling and sensitivity analyses, ending in a transport-justified claim.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "A real-world surrogate (e.g., rwPFS proxied by therapy switch or progression coding) is operationalized as an outcome algorithm; surrogate validation depends on, but is not a variant of, outcome construction."
      },
      {
        "relation_type": "see_also",
        "target_slug": "real-world-progression-rwpfs-rwe",
        "notes": "rwPFS is the most common real-world surrogate; its measurement properties (non-protocolized imaging cadence, ascertainment bias) are the dominant threat when used as a surrogate for overall survival."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death is a competing risk for most intermediate/surrogate events in elderly RWE cohorts and must be modeled cause-specifically rather than censored."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "When a surrogate is the primary RWE endpoint, the surrogate effect should be estimated within an explicit target-trial structure (time zero, eligibility, assignment) to keep the effect-on-surrogate interpretable."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Applying a surrogate validated in trials to a real-world population is fundamentally a transportability problem for the surrogate-outcome relationship."
      },
      {
        "relation_type": "see_also",
        "target_slug": "single-arm-external-control",
        "notes": "Single-arm and externally-controlled rare-disease studies frequently rely on surrogates because the clinical endpoint cannot accrue; the validation burden is correspondingly higher."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology is where surrogate-based (PFS/rwPFS) accelerated approvals and their reversals are most consequential."
      },
      {
        "relation_type": "see_also",
        "target_slug": "regulatory-readiness-rwe",
        "notes": "HTA bodies (NICE, IQWiG, G-BA) and the FDA scrutinize trial-level surrogate validation before accepting surrogate effects as evidence; unvalidated surrogates are routinely rejected for cost-effectiveness inputs."
      }
    ],
    "aliases": [
      "surrogate endpoint validation",
      "surrogate marker validation",
      "surrogacy assessment",
      "validation of surrogate endpoints",
      "trial-level surrogacy"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "surveillance-detection-bias",
    "name": "Surveillance and Detection Bias",
    "short_definition": "A systematic error in which groups with more healthcare contacts accumulate more recorded diagnoses than groups with fewer contacts, regardless of true disease incidence — producing spurious apparent associations that mirror the gradient in observation intensity rather than any causal effect of the exposure.",
    "long_description": "**Surveillance and detection bias** — also called **ascertainment bias** or **diagnostic-intensity\nbias** — operates through a single mechanism: the ones you look at are the ones you find. When\ntwo study groups differ in how frequently they interact with the healthcare system, any outcome\nthat requires active medical contact to be recorded will appear more common in the\nhigher-surveillance group, even if the true incidence is identical. The apparent excess is not\na causal effect — it is a measurement artefact produced by asymmetric observation, not by\nasymmetric disease.\n\n**Mechanisms in RWE.** Detection bias arises through several well-documented pathways in\nobservational research. (1) *Monitoring-driven incidental findings*: patients initiating a new\ndrug return for follow-up visits mandated by prescribing guidelines or trial-emulation protocols.\nDiabetics on a new glucose-lowering agent who attend quarterly monitoring visits receive more\ndilated eye exams than untreated patients, so retinopathy — often asymptomatic at early stages —\nis coded more frequently in the drug arm. The drug does not cause retinopathy; the monitoring\nschedule causes the diagnosis. (2) *Screening-intensity differences*: PSA testing is not uniformly\ndistributed across the population. Men who see urologists, internists who recommend prostate\nscreening, or who participate in cancer-screening programs will have more prostate cancer\n*diagnosed* — creating a historical artefact in which expanded PSA screening produced an apparent\n\"epidemic\" of prostate cancer with no change in underlying biology. (3) *Severity-driven testing*:\nHaut and Pronovost (2011) demonstrated this mechanism in hospital quality research — institutions\nthat image more patients (e.g., aggressively screening for deep vein thrombosis with duplex\nultrasound in high-risk post-operative cohorts) detect and code more DVT events, so they appear\nto have worse DVT rates than hospitals that test selectively. More imaging produces more DVT on\npaper, not more DVT in veins. (4) *Specialty-care and pregnancy populations*: patients referred\nto specialists or managed through pregnancy programs have denser contact schedules, so\ncondition-specific diagnoses are recorded at rates that reflect access to specialist contact,\nnot disease biology. (5) *Asymmetric new-user ascertainment*: new initiators of a drug commonly\nreceive a burst of laboratory, imaging, and physical examination contacts in the first one to\nthree months after initiation that prevalent comparators never experience; any early-stage,\nincidentally detectable outcome will appear to cluster in the new-drug arm during this window.\n\n**The asymmetric-ascertainment problem in cohort comparisons.** In a standard new-user cohort,\nthe exposed arm accumulates more healthcare contacts in the peri-initiation period than the\ncomparator arm. If the outcome can be detected incidentally — without the patient seeking care\nspecifically for that condition — the early follow-up window produces a spurious apparent excess\nin the exposed group. A classic example is drug-cancer surveillance studies: patients starting\na medication under close monitoring receive more routine labs and imaging, leading to more\nincidental cancers found in the drug arm even if the drug has no carcinogenic effect. Conversely,\nif the comparator group has denser surveillance (e.g., an untreated high-risk registry cohort\nwith mandatory follow-up visits), the drug arm can appear falsely protective.\n\n**Distinguishing detection bias from related biases.** Detection bias is frequently confused\nwith two closely related phenomena, and the distinction matters for the choice of remedy.\n*Protopathic bias* (reverse causation) occurs when the unmeasured early symptoms of the outcome\ncause the exposure — a patient begins a drug because their incipient cancer is generating\nsymptoms, so the drug appears causally associated with cancer. In protopathic bias, the\noutcome's prodrome drives the exposure; in surveillance bias, it is the *measurement of the\noutcome* that differs between groups. *Confounding by indication (channeling)* occurs when the\nindication for treatment is itself a predictor of the outcome — treated and untreated patients\ndiffer in baseline disease burden, and that shared cause of treatment and outcome inflates or\nattenuates the association. In surveillance bias, groups may have identical underlying disease\nburden; what differs is how thoroughly outcomes are *observed and coded*. All three can coexist\nin the same study, but they call for distinct countermeasures: lag-time restrictions address\nprotopathic bias; active-comparator or covariate adjustment addresses channeling; and the\ndiagnostic-intensity toolkit (below) addresses surveillance bias.\n\n**Pros, cons, and trade-offs.** The term refers to a problem to control, so trade-offs are among\nthe mitigation strategies:\n- **Objective-severity outcomes vs surveillance-sensitive outcomes.** Restricting primary\n  analyses to outcomes that require emergency-level clinical presentation to be captured — MI\n  adjudicated by troponin, death, PE admitted to the emergency department — removes most\n  surveillance bias because no amount of differential monitoring changes whether a fatal event\n  is recorded. Cost: narrower scope; early-stage or incidentally detected outcomes (retinopathy,\n  microalbuminuria, screen-detected cancers) cannot be studied this way.\n- **Lag exclusion of the early monitoring window vs full follow-up.** Excluding the first\n  30–90 days of follow-up (the surveillance-dense new-user window) eliminates the artefactual\n  early excess but loses real early events and may induce selection bias if early dropout is\n  differential. **Prefer** a lag exclusion as a pre-specified sensitivity analysis rather than\n  the primary approach.\n- **Visit-count adjustment vs no adjustment.** Adjusting for pre-index outpatient visit counts\n  as a proxy for surveillance intensity can reduce detection bias. Cost: if visit count is on\n  the causal pathway between exposure and outcome (e.g., the drug causes more primary-care\n  contact which detects real disease), conditioning on it is a collider/mediator error that\n  creates new bias. **Restrict** the proxy to the pre-index lookback window only.\n- **Negative control outcomes vs trusting primary endpoints.** Running the analysis on an\n  outcome that cannot plausibly be caused by the exposure (e.g., an unrelated fracture in a\n  drug-cancer study) with the same case-finding algorithm and follow-up rules reveals residual\n  surveillance bias if the outcome appears differentially in one arm. Cost: a valid negative\n  control sharing the detection-intensity structure is hard to find. **Always** include at least\n  one negative control outcome when the primary outcome is surveillance-sensitive.\n- **Diagnostic-intensity stratification.** Stratifying or matching on pre-index visit counts,\n  specialist visits, or testing frequency and reporting results by stratum directly demonstrates\n  whether the effect estimate tracks surveillance intensity. Cost: reduces power per stratum and\n  requires enough covariate information to stratify meaningfully.\n\n**When to use** — i.e., when to actively diagnose and correct for surveillance and detection\nbias: any study in which the primary outcome requires medical contact to be coded (incidentally\ndetected malignancies, imaging-detected thrombosis, laboratory-identified metabolic\nabnormalities, ophthalmologic or audiologic findings); any new-user cohort where the peri-\ninitiation monitoring schedule differs between exposed and comparator; any hospital quality\nstudy where testing or imaging rates differ across institutions or patient subgroups; any\nvaccine effectiveness study where vaccinated patients have more primary-care visits than\nunvaccinated controls. In these settings, always pre-specify an objective-severity outcome as\nthe primary endpoint, include a surveillance-sensitive secondary endpoint for comparability,\nrun a negative control outcome, and report a sensitivity analysis excluding the early monitoring\nperiod.\n\n**When NOT to use — and when adjustment is actively misleading.**\n- **Do not adjust for post-baseline testing or visits.** Conditioning on downstream healthcare\n  contacts that were caused by the exposure (or by the emerging outcome) is a collider/mediator\n  adjustment that manufactures new bias. All surveillance-intensity proxies must be measured\n  strictly in the pre-index lookback.\n- **Do not assume objective outcomes are immune in all datasets.** In administrative claims,\n  even MI coded on a claims record requires that a patient present to a hospital and be billed;\n  if one group is systematically less likely to reach a hospital (e.g., uninsured, rural, or\n  enrolled in capitated Medicare Advantage with incomplete FFS claims), the apparently\n  \"objective\" outcome still carries ascertainment error.\n- **Do not treat the early-monitoring excess as causal.** A drug that is prescribed with\n  a mandatory 90-day follow-up protocol will show a spurious excess of any detectable outcome\n  in the first 90 days. Interpreting this excess as a drug effect can lead to incorrect safety\n  conclusions and label changes.\n- **Lag exclusions are not a complete fix.** Excluding the first N days removes the most\n  obvious surveillance-dense period, but residual differential monitoring persists throughout\n  follow-up whenever patients on the drug visit doctors more than comparators, or vice versa.\n  A lag exclusion reduces, but does not eliminate, the need for other diagnostics.\n\n**Data-source operational depth.**\n- **Claims (Medicare FFS / commercial):** The dominant failure mode is differential capture of\n  *incidentally detected* diagnoses. In the Medicare FFS population, any code appearing on a\n  carrier or outpatient-facility claim is driven by a billed encounter — no visit, no code.\n  A beneficiary who transitions to Medicare Advantage mid-study generates no FFS claims during\n  the MA period, so \"zero diagnoses\" reflects missingness, not health. Restrict all surveillance-\n  sensitive outcome windows to FFS-observable enrollment. Use pre-index outpatient visit counts\n  (distinct outpatient encounter dates in the lookback) as the proxy for healthcare utilization\n  intensity, and match or adjust on this proxy.\n- **EHR:** Capture is encounter-driven and therefore inherently surveillance-dependent — sicker,\n  more engaged patients have more encounters and more coded findings. Differential ordering of\n  tests by provider (e.g., an endocrinologist ordering annual retinal photography vs a general\n  practitioner who does not) is a major source of detection bias within an EHR; restrict analyses\n  to patients with at least one qualifying monitoring encounter per year in both arms to equalize\n  the opportunity for detection.\n- **Registry:** Disease registries typically mandate standardized ascertainment, which equalizes\n  surveillance across participants and suppresses detection bias for registry-defined outcomes.\n  However, registry enrollment itself is surveillance-sensitive: patients must be seen at a\n  registry site to be enrolled, so the registry population is already conditionally more\n  healthcare-engaged. For outcomes defined by registry protocol (e.g., biopsy-confirmed\n  diagnosis), detection bias is minimized; for secondary outcomes derived from linked claims,\n  the claims-based limitations apply.\n- **Linked claims-EHR:** The ideal substrate for surveillance-sensitivity analyses — EHR provides\n  encounter-level detail and test ordering, while claims provides coverage completeness. Use\n  test-ordering rates (imaging studies, biopsies, specialist referrals) from the EHR to\n  construct a richer surveillance-intensity proxy and check whether the apparent exposure-outcome\n  association tracks the proxy. Discordant EHR-claims records for the same event may indicate\n  informative coding patterns that are themselves a form of detection bias.\n\n**Interpreting the output**\n\nIn the worked example below, two cohorts of 10,000 patients each have a true incidence of 10%.\nGroup A (high surveillance, detection sensitivity 0.90) has detected incidence 0.09; Group B\n(low surveillance, detection sensitivity 0.45) has detected incidence 0.045. The study reports\na detected RR of 2.0.\n\n*(1) Formal interpretation.* The detected relative risk of 2.0 compares the recorded incidence\nof 0.09 in Group A against 0.045 in Group B. This is an artefactual ratio: the true incidence\nin both groups is 0.10, making the true RR 1.0. The estimand is the *detected* incidence ratio\n— a ratio of how often the outcome was recorded, not how often it biologically occurred. Under\nequal true incidence, any detected RR departing from 1.0 is attributable entirely to the\ndifferential detection probability (0.90 vs 0.45). No causal claim about the exposure is\nwarranted; the two detection-sensitivity values are the source of the ratio, not any treatment\neffect.\n\n*(2) Practical interpretation.* A naive analyst reporting RR = 2.0 would conclude that patients\nin Group A have twice the disease incidence of Group B. In reality, both groups are equally\nsick — Group A simply had twice as many opportunities to be diagnosed because its patients\nattended twice as many healthcare contacts. Every unit of surveillance-sensitivity difference\n(0.45 to 0.90, a factor of 2) maps mechanically onto the detected-RR (factor of 2). Decision-\nmakers relying on the crude detected RR would overestimate true harm in Group A or underestimate\nit in Group B. The fix is to use an outcome with near-identical detection sensitivity in both\ngroups (e.g., hospitalized MI), or to document the surveillance difference explicitly as a\nstudy limitation and report a bound on the true RR consistent with plausible detection\nsensitivity values.",
    "primary_category": "Bias_Control",
    "tags": [
      "surveillance-bias",
      "detection-bias",
      "ascertainment-bias",
      "incidental-finding",
      "differential-monitoring",
      "bias",
      "confounding",
      "claims",
      "pharmacoepidemiology",
      "new-user",
      "drug-cancer-safety"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "comparative_effectiveness_research",
      "new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1001/jama.2011.822",
        "url": "https://doi.org/10.1001/jama.2011.822",
        "citation_text": "Haut ER, Pronovost PJ. Surveillance bias in outcomes reporting. JAMA. 2011;305(23):2462-2463.",
        "year": 2011,
        "authors_short": "Haut & Pronovost",
        "notes": "Canonical demonstration that hospitals performing more DVT surveillance imaging detect more DVT and appear to have worse outcomes — naming surveillance bias as a systematic measurement artefact driven by differential testing frequency, not differential disease."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/EDE.0b013e3182093a0f",
        "url": "https://doi.org/10.1097/EDE.0b013e3182093a0f",
        "citation_text": "Suissa S, Dell'Aniello S, Vahey S, Renoux C. Time-window bias in case-control studies: statins and lung cancer. Epidemiology. 2011;22(2):228-231.",
        "year": 2011,
        "authors_short": "Suissa et al.",
        "notes": "Shows that varying the observation-time window in drug-cancer case-control studies creates differential outcome detection — cases observed over longer histories accumulate more diagnoses, producing spurious apparent associations between statins and lung cancer that are an artefact of asymmetric ascertainment rather than pharmacologic effect."
      }
    ],
    "plain_language_summary": "Surveillance and detection bias happens when patients in one study group see doctors more often than patients in another group, causing more diagnoses to be recorded for the first group — not because they are sicker, but because they are observed more intensively. The extra diagnoses are a product of the extra looking, not the extra disease. In drug studies, patients starting a new medication often have mandatory follow-up visits that patients in the comparison group do not, so conditions found incidentally (such as early-stage tumors or silent clots) pile up in the drug arm and falsely suggest the drug causes them. The safest protection is to use outcomes that require emergency-level events to be recorded regardless of how often a patient sees a doctor, and to test whether the apparent difference between groups tracks healthcare-contact frequency rather than disease.",
    "key_terms": [
      {
        "term": "detection sensitivity",
        "definition": "The probability that a person who truly has the condition will have it found and coded during a given period of observation — higher surveillance means a higher chance of detecting existing disease, not necessarily more disease."
      },
      {
        "term": "incidental finding",
        "definition": "A diagnosis made during a routine check-up or monitoring visit because a test was run, not because the patient reported symptoms — these findings accumulate faster in groups that receive more frequent medical contact."
      },
      {
        "term": "asymmetric ascertainment",
        "definition": "When the chance of recording an outcome differs between study groups because one group is watched more closely, creating a measurement difference that does not reflect a biological difference."
      },
      {
        "term": "negative control outcome",
        "definition": "An outcome that cannot plausibly be caused or prevented by the exposure being studied, used to reveal hidden differences in surveillance intensity between groups — if that outcome also shows a group difference, the cause is the watching, not the treatment."
      },
      {
        "term": "lag exclusion period",
        "definition": "A fixed window at the beginning of follow-up that is removed from the analysis to exclude outcomes detected during the dense initial wave of monitoring visits that often follows a new drug prescription."
      },
      {
        "term": "healthcare utilization intensity",
        "definition": "How often a patient uses healthcare services — measured as outpatient visit counts, specialist visits, or distinct encounter dates — used as a proxy for how thoroughly a patient's medical history is observed and coded."
      }
    ],
    "worked_example": {
      "scenario": "A claims database study compares two cohorts of 10,000 patients each. Both cohorts truly have a 10% incidence of early-stage retinopathy. Group A consists of patients who started a new diabetes drug with mandatory quarterly ophthalmology visits — their retinopathy detection sensitivity is 0.90, meaning 90% of true cases are found and coded. Group B is a propensity- matched comparison group managed in usual care, who see an ophthalmologist on average once in the follow-up year — their detection sensitivity is 0.45. The study analyst counts coded retinopathy diagnoses and calculates a relative risk comparing the two groups.",
      "dataset": {
        "caption": "Study parameters for two equally diseased cohorts that differ only in ophthalmology visit frequency and, consequently, in detection sensitivity.",
        "columns": [
          "group",
          "true_incidence",
          "screening_intensity",
          "detection_sensitivity",
          "detected_incidence"
        ],
        "rows": [
          [
            "Group A (high surveillance)",
            0.1,
            "2x baseline (quarterly visits)",
            0.9,
            0.09
          ],
          [
            "Group B (low surveillance)",
            0.1,
            "1x baseline (annual visit)",
            0.45,
            0.045
          ]
        ]
      },
      "steps": [
        "Both groups have a true incidence of 10% (0.10). There are 1,000 truly affected patients in each group of 10,000.",
        "Group A has detection sensitivity 0.90 because quarterly eye exams catch nearly all existing cases. Detected incidence A = 0.90 * 0.10 = 0.09. The study records 900 cases in Group A.",
        "Group B has detection sensitivity 0.45 because only about half of true cases are found in a single annual exam. Detected incidence B = 0.45 * 0.10 = 0.045. The study records 450 cases in Group B.",
        "Spurious detected-RR = 0.09 / 0.045 = 2.0. The study reports that Group A has twice the retinopathy incidence of Group B.",
        "The true RR is 1.0 because both groups have identical underlying disease. The entire apparent doubling is produced by the difference in observation intensity (0.90 vs 0.45), not by the drug."
      ],
      "result": "Detected incidence A = 0.90 * 0.10 = 0.09; detected incidence B = 0.45 * 0.10 = 0.045; spurious detected-RR = 0.09 / 0.045 = 2.0; true RR = 1.0. The 2-fold apparent excess in Group A is entirely attributable to the 2-fold difference in screening intensity, not to any causal effect of the exposure."
    },
    "prerequisites": [
      "healthy-user-bias",
      "negative-control-outcomes-rwe",
      "hcru-healthcare-resource-utilization"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "New-user peri-initiation surveillance burst",
        "description": "Exclude the first 30–90 days of follow-up (or match on pre-index visit count) to remove outcomes found during the dense early-monitoring window that follows drug initiation; re-run the primary analysis restricted to days 91 onward as a pre-specified sensitivity check.",
        "edge_cases": [
          "The lag cutoff must be pre-specified and clinically motivated; post hoc selection of the lag that attenuates the result is p-hacking.",
          "Some real drug effects begin within the first 30 days (e.g., hypersensitivity, acute hepatotoxicity) and will be lost; the lag exclusion should never be the primary analysis."
        ],
        "data_source_notes": "claims: exclude all outcome-coded claims where service_date < index_date + lag_days; apply the identical exclusion to the comparator cohort. ehr: apply the lag to encounter date of the first coded finding, not to order date."
      },
      {
        "name": "Diagnostic-intensity stratification sensitivity analysis",
        "description": "Stratify the cohort into tertiles or quartiles of pre-index outpatient visit count (or testing frequency) and report the exposure-outcome OR/RR within each stratum; a monotone dose-response of the effect estimate with surveillance intensity is strong evidence that detection bias, not causal effect, drives the result.",
        "edge_cases": [
          "Coarsening into only two strata (high vs low) can obscure the dose-response; prefer three or four strata or a continuous visit-count covariate.",
          "In small studies, strata may be underpowered; pre-specify the minimum cell size."
        ],
        "data_source_notes": "claims: count distinct outpatient service dates in [index_date - 365, index_date); exclude inpatient days to isolate ambulatory surveillance. ehr: count distinct clinic encounter dates in the pre-index year."
      },
      {
        "name": "Negative control outcome falsification",
        "description": "Apply the identical case-finding algorithm, follow-up window, and outcome-coding rules to an outcome that has equal plausible detection sensitivity in both arms (e.g., a traumatic fracture requiring emergency presentation, or accidental injury hospitalisation); any asymmetric detected rate on the negative control outcome quantifies the residual surveillance-bias magnitude in the primary estimate.",
        "edge_cases": [
          "The negative control outcome must share the surveillance mechanism (same claims and encounter type) without a plausible causal path from the exposure.",
          "If the drug systematically changes the probability that patients seek emergency care (e.g., sedative effects raising fall risk), the negative control may not be truly negative."
        ],
        "data_source_notes": "claims/ehr: code the negative control event with identical enrollment, washout, and first-event deduplication rules as the primary outcome; report the negative- control RR alongside the primary RR as a calibration anchor."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Restricting to objective-severity primary outcomes",
        "pros_of_this": "Surveillance and detection bias diagnostics allow the analyst to study surveillance-sensitive outcomes and understand the degree of artefact rather than ignoring those outcomes entirely.",
        "cons_of_this": "Even with diagnostics, residual surveillance bias in incidentally detected outcomes may be impossible to fully remove; a hospitalization-only or death-based primary outcome is structurally immune.",
        "when_to_prefer": "When the research question specifically requires a surveillance-sensitive outcome (e.g., early-stage cancer detection, retinopathy, silent thrombosis) and the study cannot be redesigned around an objective endpoint — document intensity differences explicitly and report a bound on the true RR."
      },
      {
        "compared_to": "Lag exclusion of the monitoring window",
        "pros_of_this": "The lag exclusion is simple and removes the most egregious artefact of the peri-initiation monitoring burst without requiring a proxy for surveillance intensity.",
        "cons_of_this": "Does not address differential monitoring throughout the rest of follow-up; removes real early events; requires clinical judgment to set the lag duration.",
        "when_to_prefer": "As a pre-specified sensitivity analysis in every new-user study with a surveillance-sensitive primary endpoint; combine with diagnostic-intensity stratification rather than treating it as a standalone fix."
      },
      {
        "compared_to": "Healthcare-utilization-intensity adjustment",
        "pros_of_this": "Adjusting or matching on pre-index visit count directly targets the surveillance- intensity mechanism and is feasible in any claims or EHR dataset without requiring additional data.",
        "cons_of_this": "Overadjustment risk if visit count is on the causal pathway (the drug causes more primary-care contact that detects real outcomes); underadjustment if pre-index visits do not proxy post-index surveillance well; cannot adjust for differential specialist visit type.",
        "when_to_prefer": "When a pre-index lookback is observable, the exposure does not change primary-care visit frequency, and a valid proxy variable can be constructed from distinct outpatient encounter dates in a clean enrollment window."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Construct pre-index visit-count proxy from distinct outpatient carrier/professional service dates in the 365-day lookback under continuous FFS A/B enrollment (exclude inpatient days and MA-only spans). Apply identical case-finding rules in both arms; any code-position or window asymmetry between arms is itself a source of detection bias. Identify surveillance-sensitive secondary outcomes (retinopathy, DVT, screen-detected malignancy) alongside objective-severity primaries (MI, all-cause mortality) and report both; the gap between the two effect estimates bounds the surveillance-bias contribution.",
      "ehr": "Use distinct encounter dates and test-order records as the surveillance proxy; check whether the exposed arm has systematically more test orders of the type that finds the primary outcome in the pre-index and post-index periods separately. Include site (provider practice) as a fixed effect to absorb systematic testing culture differences across practices.",
      "registry": "Registry-defined outcomes (requiring adjudication or biopsy confirmation) are largely immune to detection bias for the registry's primary endpoints. For secondary outcomes derived from linked claims or EHR, apply the claims-based diagnostics. Use registry-adjudicated events as external calibration anchors for claims-based case finding.",
      "linked": "Use EHR test-ordering records to construct a richer surveillance proxy (imaging orders, biopsy orders, specialist referral frequency) than visit count alone provides; crosscheck claims-coded diagnoses against EHR adjudicated findings to estimate false-positive rates by exposure arm — a higher false-positive rate in one arm is a direct measure of differential detection."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\n# ── Part 1: Worked-example arithmetic ────────────────────────────────────────\ntrue_incidence        = 0.10                           # equal in both groups\ndetection_sensitivity = {\"A\": 0.90, \"B\": 0.45}        # A = 2x screening\n\nfor grp, sens in detection_sensitivity.items():\n    detected_inc = sens * true_incidence               # 0.90*0.10=0.09; 0.45*0.10=0.045\n    print(f\"Group {grp}: detected incidence = {sens} * {true_incidence} = {detected_inc:.3f}\")\n\nspurious_rr = (detection_sensitivity[\"A\"] * true_incidence) / \\\n              (detection_sensitivity[\"B\"] * true_incidence)   # = 0.09 / 0.045 = 2.0\nprint(f\"\\nSpurious detected-RR = {spurious_rr:.2f}  (true RR = 1.0)\")\nprint(\"Mechanism: RR tracks the ratio of detection sensitivities (0.90 / 0.45 = 2.0),\")\nprint(\"           not a causal difference in disease burden.\")\n\n# ── Part 2: Diagnostic — detected-outcome rate by surveillance tertile ────────\nrng = np.random.default_rng(42)\nn   = 4000\n\n# Pre-index outpatient visit count (Poisson, mean 5 for exposed, 2.5 for comparator)\nexposed      = rng.binomial(1, 0.50, n)\nvisit_count  = rng.poisson(5 * exposed + 2.5 * (1 - exposed))\n\n# True event: same probability regardless of visit count or exposure\ntrue_event = rng.binomial(1, 0.10, n)\n\n# Detection probability: increases with visit count (surveillance mechanism)\ndetect_prob   = np.clip(0.20 + 0.12 * visit_count, 0, 0.95)\ndetected_event = rng.binomial(1, true_event * detect_prob)  # detected only if truly diseased\n\ndf = pd.DataFrame(dict(exposed=exposed, visit_count=visit_count,\n                       true_event=true_event, detected_event=detected_event))\n\n# Surveillance-intensity tertiles (pre-index visit count)\ndf[\"visit_tertile\"] = pd.qcut(df[\"visit_count\"], q=3, labels=[\"Low\", \"Mid\", \"High\"])\n\nsummary = (df.groupby([\"visit_tertile\", \"exposed\"])\n             .agg(n=(\"detected_event\",\"size\"),\n                  detected_rate=(\"detected_event\",\"mean\"),\n                  true_rate=(\"true_event\",\"mean\"))\n             .round(3))\n\nprint(\"\\nDetected vs true event rate by surveillance intensity and exposure:\")\nprint(summary.to_string())\nprint(\"\\nIf detected_rate tracks visit_tertile but true_rate does not,\")\nprint(\"the visit-count gradient IS the bias — not a causal effect.\")",
        "description": "Demonstrate surveillance bias: compute detected incidence under asymmetric screening\nintensities and show that the spurious RR tracks the ratio of detection sensitivities,\nnot the true RR. Then build a diagnostic: stratify detected-outcome rates by pre-index\noutpatient visit count tertile to reveal the surveillance-intensity dose-response.\nRequired inputs for the diagnostic: one-row-per-patient dataframe with columns\nvisit_count (integer, pre-index outpatient encounters), exposed (0/1), true_event (0/1),\ndetected_event (0/1 — 1 only if true_event=1 and patient was sufficiently observed).",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\n# ── Part 1: Worked-example arithmetic ──────────────────────────────────────────\ntrue_incidence        <- 0.10\ndetection_sensitivity <- c(A = 0.90, B = 0.45)\n\ndetected_incidence <- detection_sensitivity * true_incidence  # 0.09; 0.045\nspurious_rr        <- detected_incidence[[\"A\"]] / detected_incidence[[\"B\"]]  # 2.0\n\nfor (grp in names(detection_sensitivity)) {\n  cat(sprintf(\"Group %s: detected incidence = %.2f * %.2f = %.3f\\n\",\n              grp, detection_sensitivity[[grp]], true_incidence,\n              detected_incidence[[grp]]))\n}\ncat(sprintf(\"\\nSpurious detected-RR = %.2f  (true RR = 1.0)\\n\", spurious_rr))\ncat(\"Mechanism: RR tracks the ratio of detection sensitivities (0.90 / 0.45 = 2.0)\\n\")\n\n# ── Part 2: Diagnostic — detected-outcome rate by surveillance tertile ──────────\nset.seed(42)\nn <- 4000L\nexposed     <- rbinom(n, 1, 0.50)\nvisit_count <- rpois(n, lambda = 5 * exposed + 2.5 * (1 - exposed))\ntrue_event  <- rbinom(n, 1, 0.10)                           # same probability everywhere\ndetect_prob <- pmin(0.20 + 0.12 * visit_count, 0.95)        # detection rises with visits\ndetected_event <- rbinom(n, 1, true_event * detect_prob)    # detected only if truly diseased\n\ndt <- data.table(exposed, visit_count, true_event, detected_event)\ndt[, visit_tertile := cut(visit_count,\n                          breaks = quantile(visit_count, c(0, 1/3, 2/3, 1)),\n                          include.lowest = TRUE, labels = c(\"Low\", \"Mid\", \"High\"))]\n\nsummary_dt <- dt[, .(n             = .N,\n                      detected_rate = round(mean(detected_event), 3),\n                      true_rate     = round(mean(true_event), 3)),\n                 by = .(visit_tertile, exposed)]\n\ncat(\"\\nDetected vs true event rate by surveillance intensity and exposure:\\n\")\nprint(summary_dt[order(visit_tertile, exposed)])\ncat(\"\\nIf detected_rate tracks visit_tertile but true_rate does not,\\n\")\ncat(\"the gradient IS the surveillance artefact, not a drug effect.\\n\")",
        "description": "Demonstrate surveillance bias in R: compute detected incidence for two screening\nintensities, then run a diagnostic stratifying detected-outcome rates by pre-index\nvisit-count tertile to expose the surveillance-intensity dose-response.\nInputs match the Python version: one row per patient with exposed, visit_count,\ntrue_event, detected_event columns.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  TrueD[\"True disease exists\\n(equal incidence in both groups)\"]\n  Surv[\"Healthcare contact frequency\\n(monitoring visits, specialist care)\"]\n  Tests[\"Tests and imaging ordered\\nper contact\"]\n  Found[\"Outcome found and coded\\n(detected incidence)\"]\n  Spurious[\"Spurious detected-RR > 1\\neven though true RR = 1\"]\n\n  TrueD --> Tests\n  Surv -->|determines test frequency| Tests\n  Tests --> Found\n  Found --> Spurious\n\n  NegCtrl[\"Negative control outcome\\n(equal detection sensitivity both arms)\"]\n  Surv --> NegCtrl\n  NegCtrl -.->|any detected difference = bias signal| Spurious\n\n  style Surv fill:#ffcccc,stroke:#cc0000\n  style Spurious fill:#ffe0cc,stroke:#cc6600\n  style NegCtrl fill:#cce5ff,stroke:#3366cc",
        "caption": "Mechanism of surveillance and detection bias. Differential healthcare contact (red) drives differential testing, which drives differential outcome detection regardless of equal true disease. A negative control outcome (blue) with equal surveillance sensitivity in both arms reveals the artefact — any detected difference in the control outcome quantifies the bias magnitude.",
        "alt_text": "Flowchart showing that true disease and healthcare contact frequency both feed into ordered tests, which feed into coded outcomes and a spurious RR greater than 1, with a negative control outcome showing the same surveillance gradient without a causal path.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Outcome is surveillance-sensitive?\\ne.g. early-stage, incidentally detected] --> ObjAlt{Objective-severity\\nalternative available?}\n  ObjAlt -->|Yes| Prim[Use objective endpoint as primary\\nretain surveillance-sensitive as secondary]\n  ObjAlt -->|No| Diag[Run surveillance-bias diagnostics]\n  Diag --> Lag[Lag exclusion sensitivity analysis\\nexclude first 30-90 days]\n  Diag --> Strat[Stratify by pre-index visit count\\ncheck if RR tracks intensity]\n  Diag --> NC[Negative control outcome\\nwith same case-finding rules]\n  Lag --> RR{Effect estimate\\nstable after lag?}\n  Strat --> RR\n  NC --> RR\n  RR -->|Yes stable| Report[Report with diagnostic evidence\\nand E-value for residual intensity difference]\n  RR -->|No unstable| Warn[Flag strong surveillance-bias risk\\ndo not interpret as causal]",
        "caption": "Decision logic for detecting and communicating surveillance bias. When the primary outcome requires medical contact to be coded, pursue an objective-severity alternative as the primary endpoint and use the surveillance-sensitive outcome as a secondary. If only the surveillance-sensitive endpoint is feasible, run all three diagnostics and report whether the estimate is stable before drawing causal conclusions.",
        "alt_text": "Decision flowchart from surveillance-sensitive outcome identification through objective endpoint choice or diagnostic analysis, to either reporting with caveats or flagging as unreliable.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "protopathic-bias-reverse-causation",
        "notes": "Detection bias and protopathic bias are frequently confused. Protopathic bias occurs when early-onset outcome symptoms drive the exposure; detection bias occurs when measurement of the outcome differs between groups. Both can require lag-time restrictions as part of a remedy but arise through distinct mechanisms."
      },
      {
        "relation_type": "see_also",
        "target_slug": "confounding-by-indication-channeling",
        "notes": "Confounding by indication inflates or attenuates the association because the disease indication is a common cause of exposure and outcome; detection bias inflates it because the observation intensity differs. All three (confounding, protopathic, detection) can coexist and their countermeasures differ."
      },
      {
        "relation_type": "used_with",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Negative control outcomes with similar detection sensitivity in both arms are the primary diagnostic for surveillance bias — a detected difference in the negative control outcome quantifies the residual ascertainment asymmetry in the primary estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "test-negative-design-rwe",
        "notes": "The test-negative design equalizes care-seeking by conditioning on care presentation and testing, which directly removes a major source of differential detection; it is the structural analog of detection-bias adjustment for vaccine effectiveness studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Pre-index outpatient visit count and healthcare resource utilization measures are the principal proxy for surveillance intensity used in diagnostic stratification and covariate adjustment for detection bias."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "Algorithm stringency interacts with surveillance bias — a 2-OP window requires two outpatient contacts, so cohorts with fewer visits will have systematically lower apparent incidence for any condition requiring repeated outpatient coding, compounding the ascertainment asymmetry."
      }
    ],
    "aliases": [
      "surveillance bias",
      "detection bias",
      "ascertainment bias",
      "diagnostic-intensity bias",
      "differential ascertainment",
      "incidental diagnosis bias"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "survey-weights-complex-sampling-rwe",
    "name": "Survey Weights and Complex Sampling",
    "short_definition": "Design-based estimation that incorporates sampling weights, stratification, and clustering so that point estimates and (critically) variances from a complex probability survey are unbiased for, and generalizable to, the finite target population the sample was drawn to represent.",
    "long_description": "**Survey weights and complex sampling** are the design-based machinery that converts a non-self-weighting probability\nsample into estimates for a defined finite target population. Three design features must be carried through every\nanalysis. (1) The **sampling weight** for a respondent is (roughly) the inverse of their probability of selection,\noften further adjusted for nonresponse and post-stratified/calibrated to known population totals; it tells you how many\npopulation members each respondent \"stands for.\" (2) **Stratification** (e.g., by region, age, race/ethnicity) means the\nsample was drawn within fixed subpopulations, which *reduces* variance and must be acknowledged or standard errors are\noverstated. (3) **Clustering** (multistage selection of PSUs such as counties, then households, then persons) means\nobservations within a PSU are correlated, which *inflates* variance via the design effect (DEFF) and must be acknowledged\nor standard errors are badly understated. In RWE/HEOR this is the engine behind nationally representative estimates from\nNHANES, MEPS, NHIS, NIS/HCUP, and MCBS — the source of population denominators, prevalence, and per-capita cost and\nutilization figures that feed budget-impact and cost-of-illness models.\n\n**Core conceptual distinction.** Two distinctions are doing the work. First, *design-based vs model-based inference*:\nsurvey-weighted (Horvitz–Thompson / design-based) estimation targets a finite-population quantity (the mean/total/\nproportion that exists in the actual population) and gets its variance from the sampling design, not from an assumed data-\ngenerating model. Second — and this is the one analysts get wrong — *when weighting actually helps*. For **descriptive\npopulation estimands** (national mean cost, prevalence, total person-years), weights are mandatory: an unweighted mean\nestimates the mean of the *sample*, which is not the population mean when selection probabilities differ. For **causal\nregression estimands**, weighting is more subtle: Solon, Haider & Wooldridge (2015) show that if your conditional model is\ncorrectly specified and selection is exogenous to the error, sampling weights add variance without removing bias; weights\nare warranted when the model is misspecified and you want a population-averaged effect, when selection is on the outcome\n(endogenous sampling), or when heterogeneous effects make the population-weighted average the estimand of interest. A\nseparate, frequently conflated object is the **inverse-probability-of-treatment weight (IPTW)** from propensity scores:\nIPTW and sampling weights share Horvitz–Thompson arithmetic but answer different questions (IPTW balances confounders for\na causal contrast; sampling weights restore population representativeness). The genuine RWE bridge between the two is the\n**inverse-probability-of-sampling weight (IPSW)** used to transport a trial or cohort effect to a target population\n(Cole & Stuart 2010) — that is a survey-weighting idea applied to generalizability, not the same thing as IPTW.\n\n**Pros, cons, and trade-offs.**\n- **vs naive unweighted analysis of survey data:** Design-based estimation yields the right population point estimate and,\n  just as importantly, the right *variance* (Taylor linearization or replicate weights). Cost: estimators are more complex,\n  weighted estimates have higher variance than a hypothetical equal-probability sample of the same size (the price of\n  unequal selection, quantified by DEFF/UWE), and subgroup precision can be poor. **Always prefer** design-based estimation\n  for any inferential claim about the population a survey was designed to represent.\n- **vs treating survey weights as frequency/precision weights in an off-the-shelf GLM:** Frequency-weight software returns\n  the correct weighted point estimate but **model-based standard errors that ignore stratification and clustering** — almost\n  always too small (clustering dominates), producing falsely narrow CIs and inflated type-I error. **Prefer** true survey\n  procedures (`svyglm`, `PROC SURVEYREG`/`SURVEYLOGISTIC`, `samplics`) that consume the full design object.\n- **vs model-based / multilevel modeling of the same clustered data:** A mixed model can target a similar variance\n  structure and is more efficient under correct specification, but it is model-dependent and targets a different (often\n  cluster-conditional) estimand. **Prefer** design-based estimation when the deliverable is a population total/mean for an\n  agency or payer and robustness to model misspecification matters more than efficiency.\n\n**When to use.** The data come from a complex probability survey (NHANES, MEPS, NHIS, MCBS, NIS/HCUP, BRFSS) or any sample\ndrawn with unequal selection probabilities, stratification, or clustering, *and* the estimand is a finite-population\nquantity (prevalence, national totals, per-capita cost/utilization, population-representative regression coefficients). Use\nthe survey-provided weight, strata, and PSU variables exactly as documented; use the *longitudinal* weight (not the annual\ncross-sectional weight) for panel estimands such as two-year MEPS expenditures; and use replicate weights (BRR/jackknife)\nwhen the survey ships them instead of, or alongside, Taylor-linearization design variables. The same machinery applies when\nyou construct IPSW to transport a study estimate to a named target population.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Pure administrative claims/EHR cohorts carry no survey weights.** Applying invented \"weights\" to a convenience cohort\n  does not make it representative; the real problem there is *generalizability/transportability*, which is IPSW/standardization\n  territory, not design-based survey estimation. Fabricating weights manufactures false confidence in a non-probability sample.\n- **Subgroup (\"domain\") analysis by filtering the data frame.** This is the single most dangerous error: dropping rows for a\n  subpopulation deletes PSUs/strata, corrupts the degrees of freedom and the between-PSU variance, and yields wrong standard\n  errors. Domain estimation must keep the full design and *subset the design object* (R `subset()` on the design, SAS `DOMAIN`\n  statement). Filtering first can shift a CI enough to flip a significance call.\n- **Causal-effect estimation where the conditional model is correctly specified and selection is exogenous.** Per Solon et al.,\n  weighting here only inflates variance — report both, and do not weight reflexively.\n- **Ignoring the design entirely.** Treating a clustered, stratified sample as i.i.d. understates variance (often 1.5–3×\n  DEFF), turning noise into \"significant\" findings — the canonical reason payer/HTA reviewers reject survey-based estimates.\n\n**Data-source operational depth.**\n- **NHANES/MEPS/NHIS/MCBS (designed surveys):** Weights, strata (e.g., `SDMVSTRA`), and PSU (`SDMVPSU`) are documented and\n  must be used verbatim. Failure modes: (a) using the wrong weight when **combining survey cycles** (NHANES requires\n  constructing multi-cycle weights, e.g., dividing the 2-year weight when pooling); (b) using **MEC/exam weights** for\n  interview-only variables, or interview weights for lab variables — they have different nonresponse adjustments; (c) using a\n  **cross-sectional weight for a longitudinal estimand** (MEPS panel costs need the longitudinal weight that accounts for\n  wave attrition); (d) **domain subsetting by filtering** rather than the design object.\n- **NIS/HCUP (hospital-discharge survey):** The unit is the *discharge*, not the patient, and HCUP redesigned its weights\n  (pre/post-2012 trend weights). Failure modes: treating discharges as persons; ignoring the hospital cluster and discharge\n  weight, which makes national-estimate CIs far too narrow.\n- **Survey linked to claims/mortality (e.g., MEPS- or MCBS-linked Medicare, NHANES Linked Mortality):** Linkage gives true\n  longitudinal outcomes but **breaks representativeness** because only the consentable/linkable subset is retained;\n  differential consent by age/race/insurance must be addressed with linkage-eligibility-adjusted weights, or the survey\n  weight no longer maps to the population.\n- **Registries and convenience EHR samples:** Design weights rarely exist. At most, *post-stratification/raking* weights to\n  external population margins can be constructed, but these correct only for the margins used and cannot fix selection on\n  unmeasured variables — present as a generalizability adjustment with explicit assumptions, never as a true probability-design weight.\n\n**Worked HEOR example (MEPS national per-capita cost with correct domain estimation).** Question: mean annual total\nhealthcare expenditure among U.S. adults with diagnosed diabetes, for a budget-impact model denominator. MEPS is a\nstratified, multistage probability sample; the person-level file carries `PERWT` (person weight), `VARSTR` (variance\nstratum), and `VARPSU` (variance PSU). (1) Build the design object on the **full** file using `PERWT`, `VARSTR`, `VARPSU`\nwith nested-PSU handling and single-PSU-stratum adjustment. (2) Define the diabetes domain with an indicator (e.g., from\nthe condition file mapped to the person via `DUPERSID`), keeping every row. (3) Estimate the domain mean of total\nexpenditure (`TOTEXP`) via `svymean`/`subset(design,...)` (R) or `PROC SURVEYMEANS ... DOMAIN diabetes` (SAS), which yields\nthe population mean and a Taylor-linearized SE that respects strata and clusters. The **wrong way** — `df[df.diabetes==1]`\nthen a weighted mean — returns the same point estimate but a materially smaller, invalid SE because it discards the design\ninformation for excluded PSUs; in MEPS-scale data this routinely understates the SE by 20–40%, narrowing a 95% CI enough to\nmisstate affordability. Multiply the design-based per-capita mean by the population denominator for the budget-impact input,\nand propagate the design-based SE into the model's probabilistic sensitivity analysis rather than an i.i.d. SE.\n\n**Interpreting the output**\n\nConsider a 10-respondent NHANES-like sample: 4 non-Hispanic Black respondents each with a sampling weight of\n500, 4 Hispanic respondents each with a weight of 500, and 2 non-Hispanic White respondents each with a weight\nof 2,000. Unweighted mean SBP = 126.0 mmHg. Weighted mean SBP = 123.75 mmHg (weighted sum divided by total\nweight of 8,000).\n\n*(1) Formal statistical interpretation.* The weighted estimate of 123.75 mmHg is a design-consistent estimator\nof the population mean: each observation is scaled to represent the number of people in the target population\nit stands for. The unweighted estimate of 126.0 mmHg is biased because the two high-weight non-Hispanic White\nrespondents are under-represented in the sample relative to the population. Uncertainty for the weighted mean\nmust be quantified with Taylor linearization or balanced repeated replication to correctly account for the\ndesign effect; a simple i.i.d. standard error ignores clustering and stratification and understates the true\nsampling variability.\n\n*(2) Practical interpretation for a decision-maker.* The 2.25 mmHg difference between the weighted (123.75)\nand unweighted (126.0) estimates reflects the actual racial and ethnic composition of the target population,\nnot the sample composition. Using the unweighted mean for a national prevalence estimate or a budget-impact\nmodel would misrepresent the population burden. When the goal is a population-level statement — rather than\na description of who happened to be surveyed — always apply survey weights and report the design-based\nstandard error, not the simple standard error.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "survey-weights",
      "complex-sampling",
      "design-based-inference",
      "stratification",
      "clustering",
      "post-stratification",
      "generalizability",
      "inferential_statistics"
    ],
    "applies_to_study_types": [
      "survey_analysis",
      "cross_sectional_study",
      "registry_linkage",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "survey",
      "linked",
      "registry",
      "claims"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.18637/jss.v009.i08",
        "url": "https://doi.org/10.18637/jss.v009.i08",
        "citation_text": "Lumley T. Analysis of complex survey samples. Journal of Statistical Software. 2004;9(8):1-19.",
        "year": 2004,
        "authors_short": "Lumley",
        "notes": "Definitive practical reference for design-based estimation (weights, strata, PSUs, replicate weights) and the R survey package that operationalizes it."
      },
      {
        "role": "explain",
        "doi": "10.3368/jhr.50.2.301",
        "url": "https://doi.org/10.3368/jhr.50.2.301",
        "citation_text": "Solon G, Haider SJ, Wooldridge JM. What are we weighting for? Journal of Human Resources. 2015;50(2):301-316.",
        "year": 2015,
        "authors_short": "Solon et al.",
        "notes": "The decisive treatment of when sampling weights belong in a regression versus when they only inflate variance; separates descriptive population estimands from causal model-based ones."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/096228029600500303",
        "url": "https://doi.org/10.1177/096228029600500303",
        "citation_text": "Pfeffermann D. The use of sampling weights for survey data analysis. Statistical Methods in Medical Research. 1996;5(3):239-261.",
        "year": 1996,
        "authors_short": "Pfeffermann",
        "notes": "Demonstrates weighted survey estimation and the concrete consequences of ignoring the design in point and variance estimation."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwq084",
        "url": "https://doi.org/10.1093/aje/kwq084",
        "citation_text": "Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. American Journal of Epidemiology. 2010;172(1):107-115.",
        "year": 2010,
        "authors_short": "Cole & Stuart",
        "notes": "Applies inverse-probability-of-sampling weighting (IPSW) to transport a trial effect to a target population — the survey-weighting bridge to RWE generalizability."
      }
    ],
    "plain_language_summary": "When a health survey deliberately recruits more people from some groups than others — so those groups are reliably measured — each respondent gets a sampling weight that tells you how many real people in the country that one person represents. You multiply each respondent's answer by their weight when computing any national average or count, so the final number reflects the whole population rather than the mix of people who happened to be recruited. Skipping the weights and computing a plain average gives the wrong answer whenever the groups that were oversampled differ from the rest of the country on the outcome you care about.",
    "key_terms": [
      {
        "term": "sampling weight",
        "definition": "A number assigned to each survey respondent that equals roughly how many people in the target population that one respondent stands for — larger for undersampled groups, smaller for oversampled groups."
      },
      {
        "term": "complex sampling",
        "definition": "A survey design that uses stratification and multi-stage selection (for example, first selecting counties, then households, then individuals) instead of drawing everyone at random from one big list."
      },
      {
        "term": "design effect",
        "definition": "A multiplier that describes how much wider a survey's confidence intervals are compared to what you would get from a simple random sample of the same size, due to clustering and unequal weights."
      },
      {
        "term": "stratification",
        "definition": "Dividing the population into non-overlapping groups (strata) before sampling, then drawing a separate sample from each group, so every group is guaranteed to appear in the data."
      }
    ],
    "worked_example": {
      "scenario": "A national health survey measures systolic blood pressure (SBP) in U.S. adults. To get reliable estimates for smaller racial/ethnic groups, the survey oversamples Non-Hispanic Black and Hispanic adults — meaning more of them are recruited relative to their true share of the population. Non-Hispanic White adults, the largest group, need far fewer additional recruits to be measured reliably. After the data are collected, the survey assigns each respondent a sampling weight. Here we have 10 respondents: 4 Non-Hispanic Black (each representing 500 people in the country), 4 Hispanic (each representing 500), and 2 Non-Hispanic White (each representing 2,000). We want the national mean SBP — not the mean of the recruited sample.",
      "dataset": {
        "caption": "Ten respondents from a complex survey, each assigned a sampling weight based on their group's selection probability.",
        "columns": [
          "person_id",
          "group",
          "sbp_mmhg",
          "sampling_weight"
        ],
        "rows": [
          [
            "P01",
            "NH Black",
            130,
            500
          ],
          [
            "P02",
            "NH Black",
            132,
            500
          ],
          [
            "P03",
            "NH Black",
            128,
            500
          ],
          [
            "P04",
            "NH Black",
            130,
            500
          ],
          [
            "P05",
            "Hispanic",
            126,
            500
          ],
          [
            "P06",
            "Hispanic",
            124,
            500
          ],
          [
            "P07",
            "Hispanic",
            125,
            500
          ],
          [
            "P08",
            "Hispanic",
            125,
            500
          ],
          [
            "P09",
            "NH White",
            120,
            2000
          ],
          [
            "P10",
            "NH White",
            120,
            2000
          ]
        ]
      },
      "steps": [
        "Unweighted mean (wrong for a national estimate): add all 10 SBP readings and divide by 10. Sum = 130+132+128+130+126+124+125+125+120+120 = 1,260. Unweighted mean = 1,260 / 10 = 126.0 mmHg. This treats every respondent as equally representative, but the two NH White respondents each stand for 2,000 people while the eight minority respondents each stand for only 500 — so the unweighted mean over-represents the minority groups.",
        "Weighted mean (correct national estimate): multiply each respondent's SBP by their sampling weight, sum those products, then divide by the total weight. Weighted sum = (130+132+128+130) x 500 + (126+124+125+125) x 500 + (120+120) x 2,000 = 520 x 500 + 500 x 500 + 240 x 2,000 = 260,000 + 250,000 + 480,000 = 990,000.",
        "Total weight = 4 x 500 + 4 x 500 + 2 x 2,000 = 2,000 + 2,000 + 4,000 = 8,000. This is the total number of population members represented by these 10 respondents.",
        "Weighted mean SBP = 990,000 / 8,000 = 123.75 mmHg. The two NH White respondents each represent 2,000 people (combined 4,000 of the 8,000 total weight), so the lower NH White SBP pulls the national estimate down relative to the unweighted value."
      ],
      "result": "Unweighted mean = 126.0 mmHg; weighted mean = 990,000 / 8,000 = 123.75 mmHg. The 2.25 mmHg gap arises entirely from oversampling: minority groups with higher average SBP make up 8 of 10 respondents but only 4,000 of 8,000 population units. The weighted estimate is the correct national figure; the unweighted one would overstate mean population SBP."
    },
    "prerequisites": [
      "descriptive-epidemiology-rwe",
      "prevalence-point-period-annual-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Taylor-linearization (design-variable) estimation",
        "description": "Variance is estimated analytically from the stratum and PSU design variables (weight, VARSTR/VARPSU style fields) using a first-order Taylor approximation; the default in most surveys that publish strata and PSU codes.",
        "edge_cases": [
          "Single-PSU strata break the variance formula; use a documented adjustment (e.g., center at the grand mean, or combine strata) rather than dropping the stratum.",
          "Domain (subgroup) estimates must subset the design object, never the data frame, or the between-PSU variance is wrong."
        ],
        "data_source_notes": "NHANES: weight + SDMVSTRA + SDMVPSU with nest=TRUE; MEPS: PERWT + VARSTR + VARPSU."
      },
      {
        "name": "Replicate-weight estimation (BRR / Fay / jackknife / bootstrap)",
        "description": "The survey ships a set of replicate weight columns; variance is the variability of the estimate recomputed across replicates. Required when design variables are suppressed for confidentiality or when nonlinear/quantile estimands are involved.",
        "edge_cases": [
          "Must use the survey's documented replication type and scale factor (e.g., Fay's rho); mixing types corrupts variance.",
          "Replicate weights already encode strata/clusters — do not additionally specify design variables."
        ],
        "data_source_notes": "MCBS and some HCUP/MEPS releases distribute replicate weights; honor the published method."
      },
      {
        "name": "Post-stratification / raking / calibration weights",
        "description": "Base design weights are calibrated to known population control totals (age x sex x region, etc.) to reduce nonresponse bias and variance. The only legitimate way to attach weights to a non-probability or registry sample, with explicit assumptions.",
        "edge_cases": [
          "Corrects only the margins used; cannot fix selection on variables not in the calibration model.",
          "On convenience/registry data this is a generalizability adjustment, not a probability-design weight, and should be labeled as such."
        ],
        "data_source_notes": "Registry/EHR: rake to Census/claims margins; report effective sample size and weight distribution."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive unweighted analysis of complex-survey data",
        "pros_of_this": "Unbiased finite-population point estimates and correct design-based variances that honor stratification and clustering.",
        "cons_of_this": "Higher variance than an equal-probability sample of the same size (design effect), more complex estimators, and weaker subgroup precision.",
        "when_to_prefer": "Any inferential claim about the population the survey was designed to represent."
      },
      {
        "compared_to": "Survey weights used as frequency/precision weights in an off-the-shelf GLM",
        "pros_of_this": "True survey procedures return correct standard errors; off-the-shelf frequency weights ignore strata and clusters and report variances that are almost always too small.",
        "cons_of_this": "Requires survey-aware software and a correctly built design object; more setup.",
        "when_to_prefer": "Whenever the design has clustering or stratification (essentially all national health surveys)."
      },
      {
        "compared_to": "Model-based / mixed-effects modeling of the clustered data",
        "pros_of_this": "Robust to model misspecification and targets the population total/mean an agency or payer needs.",
        "cons_of_this": "Less efficient than a correctly specified mixed model and does not target cluster-conditional effects.",
        "when_to_prefer": "When the deliverable is a population-representative descriptive or marginal estimate and misspecification robustness outweighs efficiency."
      }
    ],
    "implementation_notes_by_data_source": {
      "survey": "Use the documented weight, strata, and PSU variables verbatim; pick the right weight for the estimand (cross-sectional vs longitudinal, interview vs exam) and the right multi-cycle weight when pooling years. Do domain analysis on the design object, not by filtering rows.",
      "linked": "Survey-to-claims/mortality linkage breaks representativeness because only the linkable subset is kept; use linkage-eligibility-adjusted weights or the survey weight no longer maps to the target population.",
      "registry": "Design weights rarely exist; at most construct post-stratification/raking weights to external margins and present as an explicit generalizability adjustment, not a true probability-design weight.",
      "claims": "Administrative claims carry no survey weights; representativeness of an enrolled-population estimate to a target population is an IPSW/standardization problem, not design-based survey estimation. Do not fabricate weights."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nfrom samplics.estimation import TaylorEstimator\n\n# df already loaded, full file (do NOT pre-filter to the diabetes domain).\n# Overall national mean expenditure (population estimand, design-based variance):\noverall = TaylorEstimator(param=\"mean\")\noverall.estimate(\n    y=df[\"totexp\"],\n    samp_weight=df[\"perwt\"],\n    stratum=df[\"varstr\"],\n    psu=df[\"varpsu\"],\n)\nprint(\"National mean:\", overall.point_est, \"SE:\", overall.stderror)\n\n# Domain (diabetes) estimate: pass the domain to keep all PSUs/strata in the variance.\n# This is the CORRECT subgroup approach; filtering df first would drop PSUs and break the SE.\ndomain = TaylorEstimator(param=\"mean\")\ndomain.estimate(\n    y=df[\"totexp\"],\n    samp_weight=df[\"perwt\"],\n    stratum=df[\"varstr\"],\n    psu=df[\"varpsu\"],\n    domain=df[\"diabetes\"],\n)\nprint(domain.point_est)   # keyed by domain level (0/1)\nprint(domain.stderror)    # design-based SE for each domain, valid for the budget-impact denominator",
        "description": "Design-based estimation of a national per-capita cost in a MEPS-style file using samplics. Required input (one row per\nperson, post data-management):\n  df : person_id, varstr (variance stratum), varpsu (variance PSU), perwt (person weight, float),\n       totexp (annual total expenditure), diabetes (0/1 domain indicator built from the condition file)\nsamplics implements Taylor linearization with the same strata/PSU logic as svy procedures. statsmodels can return a\nweighted point estimate but its cluster-robust SEs do not fully reproduce a survey design - prefer samplics here.",
        "dependencies": [
          "samplics",
          "pandas"
        ],
        "source_citations": [
          "lumley-2004"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survey)\noptions(survey.lonely.psu = \"adjust\")  # robust handling of single-PSU strata\n\ndes <- svydesign(\n  ids     = ~varpsu,      # PSU / cluster\n  strata  = ~varstr,      # variance stratum\n  weights = ~perwt,       # person-level sampling weight\n  nest    = TRUE,         # PSUs nested within strata (MEPS/NHANES convention)\n  data    = meps\n)\n\n# National per-capita expenditure (population mean + design-based SE):\nsvymean(~totexp, des)\n\n# CORRECT domain estimation: subset the DESIGN, not the data frame, so excluded\n# PSUs still contribute to the between-PSU variance.\ndm <- subset(des, diabetes == 1)\nsvymean(~totexp, dm)               # population mean + valid Taylor-linearized SE\nsvyglm(totexp ~ age + sex, design = dm)   # population-representative regression\n\n# Survival on weighted survey data (e.g., time to a costly event):\n# svycoxph(Surv(time, event) ~ arm, design = des)",
        "description": "Design-based estimation with the survey package on a MEPS-style person file. Required columns:\n  person_id, varstr, varpsu, perwt (numeric weight), totexp (annual expenditure), diabetes (0/1 domain)\nBuild ONE design object on the full data; never filter the data frame for a subgroup - subset the design object so the\nvariance retains all strata/PSUs. options(survey.lonely.psu=\"adjust\") handles single-PSU strata.",
        "dependencies": [
          "survey"
        ],
        "source_citations": [
          "lumley-2004"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* National per-capita expenditure + design-based (Taylor) SE, with a correct\n   diabetes-domain estimate. DOMAIN, not WHERE, preserves the design variance. */\nproc surveymeans data=work.meps mean clm;\n  strata  varstr;\n  cluster varpsu;\n  weight  perwt;\n  var     totexp;\n  domain  diabetes;     /* subgroup means with valid SEs across all PSUs */\nrun;\n\n/* Population-representative regression of cost on covariates: */\nproc surveyreg data=work.meps;\n  strata  varstr;\n  cluster varpsu;\n  weight  perwt;\n  domain  diabetes;\n  model   totexp = age sex;\nrun;\n\n/* Weighted prevalence / proportion with design-based CI: */\nproc surveyfreq data=work.meps;\n  strata  varstr;\n  cluster varpsu;\n  weight  perwt;\n  tables  diabetes / cl;\nrun;\n\n/* Weighted survival on complex-survey data (e.g., time to a high-cost event): */\n/* proc surveyphreg data=work.meps;\n     strata varstr; cluster varpsu; weight perwt;\n     model time*event(0) = arm; run; */",
        "description": "Design-based estimation with SAS SURVEY procedures. Required dataset (post data-management):\n  work.meps : person_id, varstr, varpsu, perwt, totexp, diabetes (0/1)\nSTRATA + CLUSTER + WEIGHT declare the design; the DOMAIN statement (NOT a WHERE/BY) produces correct subgroup variances\nby keeping every PSU in the computation. NOMCAR/adjustments as needed for lonely PSUs.",
        "dependencies": [],
        "source_citations": [
          "lumley-2004"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Target[Finite target population<br/>e.g., U.S. adults] --> Design[Complex probability design<br/>strata + multistage PSUs]\n  Design --> Sample[Respondents with unequal<br/>selection probabilities + nonresponse]\n  Sample --> Wt[Sampling weight = inverse selection prob<br/>x nonresponse x post-stratification]\n  Wt --> Estimand{Estimand?}\n  Estimand -->|Population mean / total / prevalence| DB[Design-based estimation<br/>weight + strata + PSU]\n  Estimand -->|Causal regression effect| Check[Model correct & selection exogenous?]\n  Check -->|Yes| Unw[Weighting only adds variance<br/>Solon et al. 2015]\n  Check -->|No / want population-averaged| DB\n  DB --> Var[Variance: Taylor linearization<br/>OR replicate weights]\n  Var --> Dom[Subgroups via DOMAIN / subset design<br/>NEVER filter rows]",
        "caption": "Decision logic for design-based survey estimation. Weights are mandatory for finite-population descriptive estimands; for causal regression, weighting helps only under misspecification or endogenous selection (Solon et al.). Variance must come from the design, and subgroup analysis must keep the full design.",
        "alt_text": "Flowchart from target population through complex design and weighting to an estimand branch, distinguishing design-based population estimation from causal regression, with variance and domain-estimation rules.",
        "source_type": "illustrative",
        "source_citations": [
          "lumley-2004",
          "solon-2015"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph WRONG[Wrong: filter then estimate]\n    F1[df where diabetes==1] --> F2[Weighted mean<br/>on filtered rows]\n    F2 --> F3[Dropped PSUs/strata<br/>SE too small -> CI too narrow]\n  end\n  subgraph RIGHT[Right: domain on full design]\n    R1[Full design object<br/>all strata + PSUs] --> R2[DOMAIN / subset design<br/>diabetes==1]\n    R2 --> R3[Same point estimate<br/>VALID design-based SE]\n  end",
        "caption": "The domain-estimation pitfall. Filtering rows before estimating a subgroup deletes design information and understates the standard error; the correct approach subsets the design object so excluded PSUs still inform the variance.",
        "alt_text": "Side-by-side comparison showing that filtering rows yields a too-small standard error while domain estimation on the full design yields the same point estimate with a valid standard error.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Inverse-probability-of-sampling weighting (IPSW) is the survey-weighting mechanism that transports a study estimate to a named target population; design-based survey estimation supplies the target-population denominators."
      },
      {
        "relation_type": "see_also",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "IPTW and sampling weights share Horvitz-Thompson arithmetic but answer different questions (confounding balance vs population representativeness); IPSW is the bridge between them."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "National per-capita cost and utilization figures for budget-impact and cost-of-illness models are produced as design-based survey estimates from MEPS/MCBS."
      },
      {
        "relation_type": "see_also",
        "target_slug": "direct-standardization-rwe",
        "notes": "Post-stratification/raking weights are closely related to standardization; both reweight a sample to external population margins, but design weights also carry selection probabilities."
      },
      {
        "relation_type": "see_also",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "Nonresponse adjustment inside survey weights is a missing-data correction; survey weights and imputation are complementary tools for unit vs item nonresponse."
      }
    ],
    "aliases": [
      "complex survey design analysis",
      "design-based inference",
      "sampling weights",
      "survey-weighted analysis",
      "complex sampling"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "survival-extrapolation-hta-rwe",
    "name": "Survival Extrapolation for HTA Using RWE",
    "short_definition": "Fitting parametric or flexible (spline / mixture-cure) survival models to observed time-to-event data and projecting them beyond the data cut to a lifetime horizon, producing the mean (discounted) life-years and survival inputs that drive cost-effectiveness models for health technology assessment.",
    "long_description": "**Survival extrapolation for HTA** is the step that turns a finite stretch of observed follow-up\n(a trial, registry, or claims cohort with, say, 18-36 months of data) into the lifetime survival\ncurve a cost-effectiveness model requires. Health technology assessment is almost always conducted\nover a lifetime horizon, but time-to-event data are administratively censored at the data cut, so\nthe analyst must fit a model to the observed Kaplan-Meier (KM) curve and project the hazard forward\nfor years or decades. Because the incremental cost-effectiveness ratio (ICER) is dominated by the\n*area under* the survival curve (mean survival / life-years gained), and because the bulk of that\narea frequently lies *beyond* the observed data, the choice of extrapolation model is often the single\nmost influential and most contested assumption in the entire appraisal. NICE DSU TSD 14 (Latimer)\ncodified the standard workflow; TSD 21 extended it to flexible parametric and externally-informed\napproaches.\n\n**Core conceptual distinction.** Survival extrapolation is not \"survival analysis.\" Within-data\nestimands such as the hazard ratio, restricted mean survival time (RMST) over the observed window,\nor a Kaplan-Meier point estimate are *interpolations* governed by observed events. Extrapolation is\nan *out-of-sample projection*: the quantity of interest is mean (often discounted) survival to a\nlifetime horizon, which depends entirely on the assumed hazard shape in the unobserved tail where\nthere are zero events to discipline the fit. Two models that are visually indistinguishable over the\ntrial period - and have nearly identical AIC/BIC - can differ by years in projected mean survival\nbecause their tail hazards diverge (e.g., a log-normal with a long, decreasing hazard tail vs a Weibull\nwith a monotone increasing hazard). The deliverable is therefore not a \"best-fitting model\" but a\n*defensible projected hazard*, justified by in-sample fit (AIC/BIC, visual, residuals) AND external\nplausibility (smoothed hazard shape, general-population mortality as a floor on the hazard, clinical\nexpectation, and registry/long-term external data where available). The candidate set is the six\nstandard parametric distributions (exponential, Weibull, Gompertz, log-normal, log-logistic,\ngeneralized gamma) plus flexible alternatives - Royston-Parmar restricted cubic splines, and\nmixture / non-mixture cure models when a plateau implies long-term survivors.\n\n**Pros, cons, and trade-offs.**\n- **Standard six parametric distributions (TSD 14) vs flexible parametric splines (TSD 21):** the\n  six closed-form distributions are transparent, fast, easy to do probabilistic sensitivity analysis\n  (PSA) on (multivariate-normal draws of the parameters), and familiar to reviewers. Cost: each imposes\n  a rigid, usually monotone or single-turning-point hazard, so when the true hazard is non-monotone\n  (e.g., early treatment-related mortality then a plateau) none fits well and the choice among poor fits\n  drives the answer. Royston-Parmar splines flex to the observed hazard and can be anchored to external\n  data, but they are *less* constrained in the tail and can extrapolate implausibly if knots sit near\n  the data boundary - more knots improve in-sample fit while worsening tail behavior. **Prefer** the six\n  when the smoothed hazard is plausibly monotone and follow-up is mature; **prefer** splines or\n  externally-anchored models when the hazard is complex or the tail is sparse.\n- **Independent per-arm fits vs a single joint model with a treatment covariate (proportional\n  hazards / accelerated failure time):** fitting each arm separately maximizes within-arm fit but lets\n  the two extrapolations cross or diverge implausibly and ignores the question of whether the treatment\n  effect persists. A joint model (or a relative-effect-on-a-reference-curve approach) enforces a coherent\n  relationship but bakes in a PH/AFT assumption that may be false in the tail. **Prefer** independent fits\n  only when proportionality is clearly violated AND you separately justify the long-term effect; otherwise\n  model the treatment effect explicitly so you can test waning.\n- **Naive extrapolation vs general-population mortality flooring / relative-survival framing:** ignoring\n  background mortality lets a fitted curve project survival probabilities that exceed the age-matched\n  general population - clinically impossible and a frequent ERG/EAG critique. Capping the all-cause hazard\n  at the general-population life-table hazard (or modeling excess/relative survival) is more defensible but\n  requires linkage to national life tables and an assumption about excess hazard in the tail. **Prefer**\n  flooring/relative survival in any chronic or oncology model run to a lifetime horizon, especially in\n  older populations.\n- **vs RMST or within-trial analysis:** RMST avoids the extrapolation problem entirely by restricting to\n  the observed horizon, but it cannot answer the lifetime question HTA demands. Use RMST as a face-validity\n  anchor (the model's restricted mean over the observed window should match the empirical RMST), not as a\n  substitute.\n\n**When to use.** Any cost-utility or cost-effectiveness analysis with a lifetime (or long, e.g., >5-year)\nhorizon where survival or time-to-progression is a model input and follow-up is shorter than the horizon -\ni.e., essentially every oncology and most chronic-disease submissions to NICE, CDA-AMC (formerly CADTH), PBAC, IQWiG, ICER.\nUse it to populate partitioned-survival models (overall and progression-free survival curves) and to derive\ntransition probabilities or dwell times for Markov / state-transition models. It is also the right tool when\nRWE (registry or claims) provides longer-term follow-up that can validate or directly inform the trial-based\ntail.\n\n**When NOT to use - and when it is actively misleading or dangerous.**\n- **The horizon does not exceed observed follow-up.** If the decision horizon is fully covered by data\n  (e.g., an acute condition resolved within the trial), extrapolation adds assumption-driven uncertainty for\n  no benefit - report the empirical KM / RMST instead.\n- **Few events / immature data with no external anchor.** Extrapolating from a curve with little tail\n  information (a handful of late events, wide late KM confidence bands) produces projections driven almost\n  entirely by the parametric form, not the data; different equally-plausible distributions can imply mean\n  survival differences large enough to flip the ICER across the willingness-to-pay threshold. Without external\n  long-term data or expert elicitation this is conjecture dressed as analysis.\n- **A plateau is fit with a non-cure distribution (or vice versa).** Forcing a standard distribution onto\n  immunotherapy-style data with a long-term survivor fraction will either understate the plateau (monotone\n  increasing hazard) or invent immortality; conversely, fitting a cure model when the plateau is an artifact\n  of administrative censoring (everyone simply ran out of follow-up at the data cut) fabricates cured patients.\n  The apparent plateau at the right edge of a KM curve is frequently censoring, not biology.\n- **Tail survival exceeds the general population.** If the projected all-cause survival is better than the\n  age/sex-matched life table, the model is impossible and must be floored - presenting it unfloored is\n  misleading.\n- **Assuming a treatment effect persists for life without evidence.** Extrapolating a within-trial hazard\n  ratio indefinitely (no waning) is a common and consequential optimistic bias; the persistence of the\n  effect beyond the data must be an explicit, tested scenario, not a silent default.\n\n**Data-source operational depth.**\n- **Trial / IPD (the TSD 14 base case):** cleanest hazards but shortest follow-up; the tail is almost\n  entirely unobserved, so the danger is over-reading an apparent late plateau that is really administrative\n  censoring at the data cut. Always overlay the number-at-risk and a smoothed hazard plot before choosing a\n  distribution; a hazard estimated from <10 events is noise.\n- **Registry:** the natural source of long-term external data to anchor or validate the trial tail, but\n  registries are themselves administratively censored at extraction and often have *differential* loss to\n  follow-up (sicker patients drop out or die undocumented), which can create a spurious late survival\n  *improvement* (the survivors look healthy because the unwell are missing). Confirm completeness of vital\n  status (link to a death index) before trusting registry tails.\n- **Claims:** can give long real-world follow-up but the failure modes bite hard for extrapolation.\n  Medicare Advantage (MA)-only person-time lacks fee-for-service (FFS) claims, so death and late events are\n  differentially unobserved and the tail survival is artifactually inflated unless MA-only spans are excluded\n  or vital status is linked. Mortality in elderly claims is dominated by *competing*, non-disease causes, so a\n  naive all-cause extrapolation conflates the disease hazard with background mortality - frame as relative\n  survival or floor the hazard at the general-population life table. Immortal time inflates early survival if\n  the index date is pinned at a landmark (e.g., a second prescription) rather than at treatment initiation.\n  Outcome (death) capture lags and claim reversals distort the most recent months - hold out the final\n  incomplete quarters from the fit.\n- **Linked claims-registry-vital records:** the ideal substrate (registry severity + claims duration +\n  reliable mortality from the death index), but linkage selects the linkable subset and introduces\n  order/fill/service-date discrepancies that must be reconciled before time-zero and event dates are set.\n\n**Worked claims example.** Question: lifetime overall survival for a new first-line therapy vs standard\nof care in metastatic NSCLC, to populate a partitioned-survival cost-utility model run to a 30-year (lifetime)\nhorizon, using a commercial + Medicare FFS claims database linked to the National Death Index, where observed\nfollow-up is ~30 months. (1) Cohort and time zero: adults with a metastatic NSCLC diagnosis and a first\nqualifying systemic-therapy fill (`person_id`, `fill_date`, `days_supply`, `dx` codes); index_date = that first\nfill; require continuous A/B/D FFS enrollment for a 365-day baseline and *exclude MA-only person-time* so death\nand late events are observable. (2) Outcome and follow-up: death from the linked death index; censor at\ndisenrollment (excluding MA switches handled as above), end of data, and the data-cut date; drop the final two\nincomplete quarters to avoid reversal/lag artifacts. (3) Plot KM with number-at-risk and a smoothed hazard;\nthe hazard is non-monotone (early peak, then declining) so the standard six fit poorly. (4) Fit the six\nparametric distributions plus a 2-knot Royston-Parmar spline per arm; rank by AIC/BIC, overlay each fitted\ncurve on the KM, and inspect projected hazards to 30 years. (5) Floor the all-cause hazard at the\nage/sex-matched US life-table hazard so projected survival never exceeds the general population. (6) Select the\ngeneralized gamma (best AIC and a clinically plausible declining-then-leveling hazard), confirm its restricted\nmean over 30 months matches the empirical RMST as a face-validity check, and compute mean discounted life-years\nper arm. (7) PSA: draw the fitted parameter vectors from their multivariate-normal sampling distribution\n(Cholesky of the covariance matrix), recompute lifetime survival per draw, and propagate to the ICER. (8)\nScenario analyses on the distribution choice (Weibull, log-normal, spline), a treatment-effect-waning scenario\n(hazard ratio reverts to 1 after 5 years), and inclusion vs exclusion of the registry-derived long-term tail.\n\n**Interpreting the output**\n\nSix standard parametric distributions fitted to 30-month trial data produce projected mean survival ranging from 2.1 years (exponential) to 6.2 years (log-normal) — a three-fold spread from models with nearly identical AIC (within 4 points).\n\n*Formal interpretation.* Each parametric family implies a distinct hazard shape in the unobserved tail beyond the data window, and that tail — not the observed 30 months — dominates the lifetime mean survival and therefore the ICER. AIC and BIC measure in-sample fit and carry no information about which extrapolated tail is correct; they cannot discriminate among the six distributions beyond the last observed event time. The log-normal and generalized-gamma families allow non-monotone hazards and project substantially longer survival, while the exponential assumes constant hazard and produces the shortest estimate. Selection among families must therefore be justified by external biological plausibility, registry or epidemiological long-term mortality data, or clinical expert elicitation — not goodness-of-fit statistics alone.\n\n*Practical interpretation.* A three-fold difference in projected mean survival translates directly to a three-fold difference in estimated life-years gained and potentially shifts the cost-effectiveness estimate from below to well above the payer willingness-to-pay threshold. Health technology assessment bodies (NICE, ICER) require structural uncertainty sensitivity analyses across all plausible families, with the base-case selection defended in writing. The choice of extrapolation model is typically the single largest driver of ICER uncertainty in immature survival data.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "survival-extrapolation",
      "parametric-survival",
      "flexible-parametric-spline",
      "cure-model",
      "lifetime-horizon",
      "cost-effectiveness",
      "partitioned-survival",
      "hta",
      "nice-dsu-tsd14"
    ],
    "applies_to_study_types": [
      "comparative_effectiveness",
      "registry_linkage",
      "claims_analysis",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/0272989X12472398",
        "url": "https://doi.org/10.1177/0272989X12472398",
        "citation_text": "Latimer NR. Survival analysis for economic evaluations alongside clinical trials - extrapolation with patient-level data: inconsistencies, limitations, and a practical guide. Medical Decision Making. 2013;33(6):743-754.",
        "year": 2013,
        "authors_short": "Latimer",
        "notes": "The peer-reviewed companion to NICE DSU TSD 14; the canonical workflow for fitting and selecting parametric survival models for HTA extrapolation, including the standard six-distribution candidate set and the AIC/BIC-plus-plausibility selection algorithm."
      },
      {
        "role": "explain",
        "doi": "10.18637/jss.v070.i08",
        "url": "https://doi.org/10.18637/jss.v070.i08",
        "citation_text": "Jackson CH. flexsurv: a platform for parametric survival modeling in R. Journal of Statistical Software. 2016;70(8):1-33.",
        "year": 2016,
        "authors_short": "Jackson",
        "notes": "Reference implementation for the standard parametric distributions, Royston-Parmar splines, and relative-survival/background-hazard extrapolation; the workhorse package for HTA survival modeling in R."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X13497998",
        "url": "https://doi.org/10.1177/0272989X13497998",
        "citation_text": "Bagust A, Beale S. Survival analysis and extrapolation modeling of time-to-event clinical trial data for economic evaluation: an alternative approach. Medical Decision Making. 2014;34(3):343-351.",
        "year": 2014,
        "authors_short": "Bagust & Beale",
        "notes": "Argues that AIC/BIC-driven selection over the trial period gives little assurance about the extrapolated tail and motivates hazard-based, externally-informed extrapolation - the core caution of this concept."
      },
      {
        "role": "demonstrate",
        "doi": "10.1017/s0266462319000175",
        "url": "https://doi.org/10.1017/s0266462319000175",
        "citation_text": "Gallacher D, Auguste P, Connock M. How do pharmaceutical companies model survival of cancer patients? A review of NICE single technology appraisals in 2017. International Journal of Technology Assessment in Health Care. 2019;35(2):160-167.",
        "year": 2019,
        "authors_short": "Gallacher et al.",
        "notes": "Empirical review of how extrapolation is actually done in NICE oncology submissions, documenting inconsistent distribution selection and frequent failure to justify the tail - evidence of the real-world stakes and pitfalls of the method."
      }
    ],
    "plain_language_summary": "Survival extrapolation is how health economists turn a short stretch of trial data into a full lifetime survival curve so they can estimate the total life-years a treatment provides. A clinical trial might only follow patients for two to three years, but a cost-effectiveness model needs to project what happens over an entire lifetime — sometimes 30 years. To do that, analysts fit a mathematical curve (called a parametric survival model) to the trial's observed data and then extend it far beyond what was actually seen. The critical catch is that several different curve shapes may all fit the observed data equally well, yet each one predicts a very different number of life-years in the unobserved future — and that difference can change whether a treatment looks cost-effective.",
    "key_terms": [
      {
        "term": "extrapolation",
        "definition": "Projecting a mathematical curve into the future beyond the range of observed data — essentially predicting what the survival curve does after the trial ended."
      },
      {
        "term": "parametric survival model",
        "definition": "A mathematical formula that describes how quickly patients in a study are dying over time; once fit to the observed data, the formula can be used to project survival into the unobserved future."
      },
      {
        "term": "lifetime horizon",
        "definition": "The full remaining lifespan of a typical patient in the model — often 30 or more years — which health technology assessment bodies require so all future costs and benefits are captured."
      },
      {
        "term": "mean survival (life-years)",
        "definition": "The average total time alive across all patients in a model, calculated as the area under the survival curve; this is the key number that drives the cost-effectiveness calculation."
      },
      {
        "term": "AIC / BIC",
        "definition": "Statistical scores (lower is better) that measure how well a model fits the observed data — but they say nothing about which curve is most accurate in the unobserved tail beyond the data."
      }
    ],
    "worked_example": {
      "scenario": "A randomized trial of a new lung cancer therapy followed patients for 30 months. The trial team recorded whether each patient had died and when. Now a health economist needs to populate a cost-effectiveness model that runs to a 30-year lifetime horizon. The observed data only cover the first 30 months, so a parametric survival model must be fit to those data and then extrapolated forward for the remaining 27.5 years. Three common distributions — exponential, Weibull, and log-normal — all fit the 30-month observed data reasonably well, but they make very different assumptions about how the hazard behaves in the unobserved tail.",
      "dataset": {
        "caption": "Summary of observed 30-month trial data for the treatment arm (used to fit each parametric model). The three fitted distributions are then projected forward to 30 years.",
        "columns": [
          "model",
          "in-sample fit (AIC)",
          "projected mean survival to 30 yrs (years)"
        ],
        "rows": [
          [
            "Exponential",
            412,
            2.1
          ],
          [
            "Weibull",
            408,
            3.4
          ],
          [
            "Log-normal",
            410,
            6.2
          ]
        ]
      },
      "steps": [
        "All three models are fit to the same 30-month trial data; their AIC scores are close (408-412), so no model clearly wins on in-sample fit alone.",
        "Each model is then projected forward to 30 years using its mathematical formula — this is the extrapolation step, where the curves diverge sharply because each assumes a different shape for the hazard in the unobserved tail.",
        "The exponential model assumes the hazard (risk of dying) stays constant forever, which is pessimistic for a cancer therapy; it projects only 2.1 years of mean survival.",
        "The Weibull model allows the hazard to change over time in a single direction; it projects 3.4 years of mean survival.",
        "The log-normal model assumes the hazard first rises then falls (a long, slow declining tail), which is optimistic; it projects 6.2 years of mean survival.",
        "The analyst must choose — or present all three as scenarios — because the cost-effectiveness ratio depends heavily on this choice: the difference between 2.1 and 6.2 life-years is large enough to flip the conclusion about whether the treatment is cost-effective."
      ],
      "result": "Projected mean survival ranges from 2.1 years (exponential) to 3.4 years (Weibull) to 6.2 years (log-normal) — a three-fold spread from the most pessimistic to the most optimistic model, all fit to identical 30-month data. The choice of extrapolation model, not the trial itself, drives the final cost-effectiveness answer."
    },
    "prerequisites": [
      "restricted-mean-survival-time-rmst",
      "health-economic-modeling-methods-rwe",
      "partitioned-survival-models-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard six-distribution parametric extrapolation (TSD 14)",
        "description": "Fit exponential, Weibull, Gompertz, log-normal, log-logistic, and generalized gamma to each arm; rank by AIC/BIC; choose by combining fit statistics with visual fit and clinical plausibility of the projected hazard. Closed-form, fast, and straightforward to PSA via multivariate-normal parameter draws.",
        "edge_cases": [
          "All six fit poorly when the smoothed hazard is non-monotone (early peak then decline), so the choice among bad fits drives the projection.",
          "AIC/BIC reward in-sample fit and say nothing about the unobserved tail; near-tied models can diverge by years in mean survival."
        ],
        "data_source_notes": "Works on any IPD with event/censoring times; in claims, exclude MA-only person-time and link vital status before fitting so the tail reflects real mortality."
      },
      {
        "name": "Flexible parametric (Royston-Parmar) spline extrapolation (TSD 21)",
        "description": "Model the log cumulative hazard as a restricted cubic spline in log time, allowing complex, non-monotone hazards and direct anchoring to external long-term data. Knot number/location is the key tuning decision.",
        "edge_cases": [
          "More knots improve in-sample fit but can produce volatile, implausible tail hazards near the boundary knot.",
          "Without an external anchor or a boundary constraint, the spline tail is data-driven precisely where there is no data."
        ],
        "data_source_notes": "Use registry or linked long-term follow-up to place a far knot or to validate the projected hazard; flexsurvspline / rstpm2 in R, PROC NLMIXED in SAS."
      },
      {
        "name": "Mixture / non-mixture cure models",
        "description": "Decompose the population into a long-term-survivor (cured) fraction following background mortality and an uncured fraction with a parametric excess hazard. Appropriate when a genuine plateau (e.g., immunotherapy, curative oncology) is supported by mature follow-up.",
        "edge_cases": [
          "A plateau caused by administrative censoring at the data cut, not biology, will fabricate a cure fraction.",
          "The cured fraction and the uncured distribution are weakly identified with short follow-up, giving unstable estimates."
        ],
        "data_source_notes": "Requires mature follow-up and ideally external long-term data; flexsurvcure in R. Always link to a death index so the plateau is not an artifact of unobserved deaths."
      },
      {
        "name": "General-population-mortality flooring / relative survival",
        "description": "Constrain the all-cause hazard to be at least the age/sex-matched general-population (life-table) hazard, or model excess (disease-specific) hazard added to the background hazard, so projected survival can never exceed the matched general population.",
        "edge_cases": [
          "Requires up-to-date national life tables matched on age, sex, and ideally calendar year and region.",
          "The assumed long-term excess hazard (constant, declining, or zero) still drives the tail and must be justified."
        ],
        "data_source_notes": "Essential for lifetime horizons in older populations and elderly claims, where competing non-disease mortality dominates; combine with the death index to separate disease and background mortality."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Standard six parametric distributions (TSD 14)",
        "pros_of_this": "Transparent, fast, easy to PSA, familiar to reviewers; well-defined candidate set and selection algorithm.",
        "cons_of_this": "Each imposes a rigid hazard shape; when the true hazard is non-monotone none fits and the choice among poor fits, not the data, determines the lifetime projection.",
        "when_to_prefer": "Mature follow-up with a plausibly monotone or single-turning-point smoothed hazard."
      },
      {
        "compared_to": "Flexible parametric (Royston-Parmar) splines",
        "pros_of_this": "Flex to complex observed hazards and can be anchored to external data; better in-sample fit.",
        "cons_of_this": "Less constrained in the tail; more knots can yield implausible extrapolations exactly where data are absent.",
        "when_to_prefer": "Complex/non-monotone hazards, or when external long-term data are available to discipline the tail."
      },
      {
        "compared_to": "RMST / within-observed-window analysis",
        "pros_of_this": "Answers the lifetime question HTA requires by projecting mean survival to the decision horizon.",
        "cons_of_this": "Introduces unobservable tail assumptions; RMST avoids them but cannot reach the lifetime horizon.",
        "when_to_prefer": "Whenever the decision horizon exceeds observed follow-up; use RMST only as a face-validity anchor."
      },
      {
        "compared_to": "Naive (unfloored) extrapolation ignoring background mortality",
        "pros_of_this": "Floors the hazard at the general-population life table so projected survival is never clinically impossible.",
        "cons_of_this": "Requires matched life tables and an assumption about long-term excess hazard.",
        "when_to_prefer": "Any chronic-disease or oncology model run to a lifetime horizon, especially in older cohorts."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Exclude Medicare Advantage-only person-time (FFS claims and thus deaths/late events are unobserved, inflating the tail); link vital status from a death index; pin time zero at treatment initiation to avoid immortal time; drop the final incomplete quarters (claim lag/reversals); floor the all-cause hazard at the general-population life table because competing non-disease mortality dominates in elderly cohorts.",
      "ehr": "Visit-driven capture means patients who leave the system are differentially lost; treat loss to follow-up as potentially informative and confirm deaths via linkage, otherwise the tail is biased optimistically.",
      "registry": "The natural source of external long-term data to anchor/validate the trial tail, but registries are administratively censored at extraction and prone to differential dropout; confirm vital-status completeness before trusting late survival.",
      "linked": "Linked claims-registry-vital-records is the ideal substrate (severity + duration + reliable mortality) but adds linkage selection and order/fill/service-date discrepancies that must be reconciled before setting time-zero and event dates."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom lifelines import (WeibullFitter, LogNormalFitter, LogLogisticFitter,\n                       GeneralizedGammaFitter, ExponentialFitter)\n\nHORIZON_YEARS = 30.0   # lifetime horizon for the HTA model\nDISCOUNT = 0.035       # annual discount rate on life-years (NICE reference case)\nGRID = np.linspace(0, HORIZON_YEARS, int(HORIZON_YEARS * 12) + 1)  # monthly grid to the horizon (30y -> 361 points)\n\n# lifelines supplies five of the six TSD-14 standard distributions; Gompertz has no lifelines\n# fitter and is fit via flexsurv (R) instead (see the R implementation below).\nCANDIDATES = {\n    \"exponential\": ExponentialFitter,\n    \"weibull\":     WeibullFitter,\n    \"lognormal\":   LogNormalFitter,\n    \"loglogistic\": LogLogisticFitter,\n    \"gengamma\":    GeneralizedGammaFitter,\n}\n\ndef discounted_life_years(times, surv, rate):\n    # Trapezoidal area under S(t) with continuous discounting -> discounted mean survival.\n    disc = np.exp(-rate * times)\n    integrand = surv * disc\n    return np.trapz(integrand, times)\n\ndef fit_and_extrapolate(df_arm: pd.DataFrame) -> pd.DataFrame:\n    rows = []\n    for name, Fitter in CANDIDATES.items():\n        f = Fitter()\n        f.fit(df_arm[\"time_years\"], event_observed=df_arm[\"event\"])\n        surv = f.survival_function_at_times(GRID).values  # projected S(t) to the horizon\n        rows.append({\n            \"model\": name,\n            \"AIC\": f.AIC_,\n            \"BIC\": getattr(f, \"BIC_\", np.nan),\n            \"mean_LY\": np.trapz(surv, GRID),               # undiscounted lifetime mean survival\n            \"disc_LY\": discounted_life_years(GRID, surv, DISCOUNT),\n        })\n    return pd.DataFrame(rows).sort_values(\"AIC\").reset_index(drop=True)\n\n# Per-arm fits; compare disc_LY across distributions to expose tail-driven divergence.\nresults = {arm: fit_and_extrapolate(g) for arm, g in surv.groupby(\"arm\")}\nfor arm, tbl in results.items():\n    print(f\"\\n=== {arm} ===\")\n    print(tbl.to_string(index=False))\n# Selection rule: lowest AIC/BIC AND a plausible projected hazard AND survival floored at the\n# general-population life table (apply min(model_hazard, lifetable_hazard) before integrating).",
        "description": "Fit and compare candidate parametric survival models and extrapolate to a lifetime horizon using lifelines.\nRequired input (one row per subject, after data management):\n  surv : person_id, arm ('TREAT'/'SOC'), time_years (>0; time from index to death or censor),\n         event (1 = death, 0 = censored)   # index = treatment initiation; MA-only spans already excluded\nOutput: AIC/BIC table for model selection, mean discounted life-years to the lifetime horizon per fitted\ndistribution. Always overlay the fitted curves on the KM and inspect projected hazards before selecting.",
        "dependencies": [
          "lifelines",
          "numpy",
          "pandas"
        ],
        "source_citations": [
          "latimer-2013"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(flexsurv)\nlibrary(dplyr)\n\nHORIZON  <- 30      # lifetime horizon (years)\nDISCOUNT <- 0.035   # annual discount rate (NICE reference case)\nGRID     <- seq(0, HORIZON, by = 1/12)\n\ndists <- c(\"exp\", \"weibull\", \"gompertz\", \"lnorm\", \"llogis\", \"gengamma\")\n\nfit_one_arm <- function(d) {\n  fits <- lapply(dists, function(dd)\n    flexsurvreg(Surv(time_years, event) ~ 1, data = d, dist = dd))\n  names(fits) <- dists\n  # Royston-Parmar spline (2 internal knots) on the log-cumulative-hazard scale.\n  fits[[\"spline_k2\"]] <- flexsurvspline(Surv(time_years, event) ~ 1, data = d,\n                                        k = 2, scale = \"hazard\")\n\n  aic <- sapply(fits, AIC)\n\n  # Mean (discounted) life-years = trapezoidal area under projected S(t) * discount factor.\n  disc <- exp(-DISCOUNT * GRID)\n  ly <- sapply(fits, function(f) {\n    S <- summary(f, t = GRID, type = \"survival\", ci = FALSE)[[1]]$est\n    c(mean_LY = sum(diff(GRID) * (head(S, -1) + tail(S, -1)) / 2),\n      disc_LY = sum(diff(GRID) * (head(S * disc, -1) + tail(S * disc, -1)) / 2))\n  })\n  data.frame(model = names(fits), AIC = aic,\n             mean_LY = ly[\"mean_LY\", ], disc_LY = ly[\"disc_LY\", ],\n             row.names = NULL) %>% arrange(AIC)\n}\n\nresults <- surv %>% group_split(arm) %>%\n  setNames(sort(unique(surv$arm))) %>% lapply(fit_one_arm)\nprint(results)\n\n# Selection: lowest AIC/BIC AND plausible smoothed hazard AND survival never exceeding the matched\n# general population. For relative survival / flooring, add bhazard = d$bhaz to flexsurvreg() and use\n# a cure model (flexsurvcure) when a genuine plateau is supported by mature follow-up.\n# PSA: draw parameters from normboot.flexsurvreg(fit, B = 1000) and recompute disc_LY per draw.",
        "description": "Fit the six standard parametric distributions plus a Royston-Parmar spline with flexsurv, compare on\nAIC, extrapolate to a lifetime horizon with general-population-mortality flooring, and compute mean\n(discounted) life-years. This is the reference HTA workflow.\nRequired input (one row per subject):\n  surv : person_id, arm ('TREAT'/'SOC'), time_years (>0), event (1 = death, 0 = censored)\n  # optional: bhaz = matched general-population hazard for relative-survival / flooring",
        "dependencies": [
          "flexsurv",
          "dplyr"
        ],
        "source_citations": [
          "latimer-2013",
          "jackson-2016"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* KM + log(-log) and cumulative-hazard diagnostics: inspect hazard shape BEFORE choosing a model. */\nproc lifetest data=work.surv plots=(survival(atrisk) loglogs);\n  time time_years*event(0);\n  strata arm;\nrun;\n\n/* Fit candidate parametric distributions per arm (BY arm). Compare on AIC from the fit statistics.   */\n/* generalized gamma nests Weibull/log-normal/gamma, so a likelihood-ratio test guides the form.      */\nproc sort data=work.surv; by arm; run;\n\n%macro fit(dist=);\n  proc lifereg data=work.surv;\n    by arm;\n    model time_years*event(0) = / distribution=&dist;\n    ods output FitStatistics=fit_&dist;\n  run;\n%mend;\n%fit(dist=weibull); %fit(dist=lnormal); %fit(dist=llogistic);\n%fit(dist=exponential); %fit(dist=gamma);   /* gamma = generalized gamma in LIFEREG */\n\n/* Extrapolate the chosen distribution to the lifetime horizon and integrate S(t) for mean (discounted) */\n/* life-years. Example for Weibull: S(t)=exp(-(t/lambda)**shape), shape=1/scale, lambda=exp(intercept). */\ndata lifeyears;\n  set work.wb_params;                 /* intercept + scale from the selected LIFEREG fit, by arm */\n  retain mean_ly disc_ly 0;\n  shape  = 1/scale;  lambda = exp(intercept);\n  dt = 1/12;  rate = 0.035;           /* monthly grid, NICE 3.5% discount on life-years */\n  do t = 0 to 30 by dt;               /* 30-year lifetime horizon */\n    S    = exp(-(t/lambda)**shape);\n    Smin = exp(-((t+dt)/lambda)**shape);\n    mean_ly + dt*(S + Smin)/2;                         /* trapezoidal area under S(t) */\n    disc_ly + dt*(S*exp(-rate*t) + Smin*exp(-rate*(t+dt)))/2;\n  end;\n  /* Floor the hazard at the general-population life table before integrating in production: replace S    */\n  /* with survival under min(model_hazard, lifetable_hazard) so projected survival cannot exceed it.     */\n  keep arm shape lambda mean_ly disc_ly;\nrun;\nproc print data=lifeyears; run;",
        "description": "Diagnostics (PROC LIFETEST) plus parametric survival fits and lifetime extrapolation (PROC LIFEREG).\nRequired input dataset (one row per subject, post data-management):\n  work.surv : person_id, arm ('TREAT'/'SOC'), time_years (>0), event (1 = death, 0 = censored)\nPROC LIFEREG fits Weibull, log-normal, log-logistic, exponential, and generalized gamma with\nclosed-form extrapolation; use PROC NLMIXED for Royston-Parmar splines or mixture-cure models.",
        "dependencies": [],
        "source_citations": [
          "latimer-2013"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  KM[Kaplan-Meier + number-at-risk<br/>+ smoothed hazard plot] --> Q{Enough late events?<br/>tail informative?}\n  Q -- No --> Ext[Bring in external data:<br/>registry tail / expert elicitation /<br/>general-population life table]\n  Q -- Yes --> Shape{Smoothed hazard shape}\n  Ext --> Shape\n  Shape -- Monotone / single turn --> Six[Fit standard six<br/>exp/Weibull/Gompertz/<br/>lnorm/llogis/gengamma]\n  Shape -- Complex / non-monotone --> Spl[Fit Royston-Parmar spline<br/>or relative-survival model]\n  Shape -- Plateau supported<br/>by mature follow-up --> Cure[Fit mixture / non-mixture<br/>cure model]\n  Six --> Sel[Select: AIC/BIC + visual fit<br/>+ plausible projected hazard]\n  Spl --> Sel\n  Cure --> Sel\n  Sel --> Floor[Floor all-cause hazard at<br/>general-population life table]\n  Floor --> Proj[Extrapolate to lifetime horizon<br/>-> mean discounted life-years]\n  Proj --> PSA[PSA over parameter covariance<br/>+ scenarios: distribution choice,<br/>treatment-effect waning, external tail]",
        "caption": "Decision logic for HTA survival extrapolation. In-sample fit statistics alone cannot discipline the unobserved tail, so distribution choice is gated on smoothed-hazard shape, external data, general-population mortality flooring, and explicit scenario/PSA testing.",
        "alt_text": "Flowchart from Kaplan-Meier diagnostics through a check on tail information and hazard shape, branching to standard parametric, spline, or cure models, then selection, general-population flooring, lifetime extrapolation, and probabilistic sensitivity analysis.",
        "source_type": "illustrative",
        "source_citations": [
          "latimer-2013"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Survival extrapolation horizon for one cohort (HTA)\n  dateFormat YYYY-MM-DD\n  axisFormat %Y\n  section Observed\n  Trial / claims follow-up (events drive the fit) :done, obs, 2021-01-01, 900d\n  section Extrapolated\n  Parametric / spline projection (no events) :active, ext, 2023-06-20, 9000d\n  section Constraints\n  General-population mortality envelope (hazard floor) :crit, env, 2021-01-01, 9900d\n  Treatment-effect waning point (HR -> 1) :milestone, wane, 2026-06-20, 0d",
        "caption": "Timeline of an extrapolation. Roughly the first ~30 months are observed and discipline the fit; the remaining decades to the lifetime horizon are projected with no events, bounded above by the general-population mortality envelope, with an explicit treatment-effect-waning point as a tested scenario.",
        "alt_text": "Gantt timeline showing a short observed follow-up window, a long extrapolated projection to the lifetime horizon, a general-population mortality envelope spanning the whole period, and a treatment-effect waning milestone.",
        "source_type": "illustrative",
        "source_citations": [
          "latimer-2013"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "Survival extrapolation supplies the lifetime survival inputs consumed by health-economic models built on real-world evidence."
      },
      {
        "relation_type": "used_with",
        "target_slug": "partitioned-survival-models-rwe",
        "notes": "Partitioned-survival models are populated directly by extrapolated overall-survival and progression-free-survival curves; the extrapolation choice drives those models' lifetime outputs."
      },
      {
        "relation_type": "used_with",
        "target_slug": "markov-transition-probabilities-rwe",
        "notes": "Extrapolated survival/time-to-event curves are converted into the transition probabilities or dwell times used by Markov / state-transition cost-effectiveness models."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "RMST measures mean survival within the observed window and serves as a face-validity anchor for an extrapolated model, but cannot reach the lifetime horizon HTA requires."
      },
      {
        "relation_type": "see_also",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "In elderly claims, competing non-disease mortality must be separated from the disease hazard (relative survival / general-population flooring) to avoid biased cause-specific extrapolation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Extrapolated life-years are discounted to present value before entering the ICER / net monetary benefit."
      },
      {
        "relation_type": "used_with",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "Uncertainty in the fitted survival parameters is propagated through the economic model via PSA, drawing parameter vectors from their covariance matrix."
      }
    ],
    "aliases": [
      "parametric survival extrapolation",
      "lifetime survival projection for economic evaluation",
      "extrapolation of time-to-event data for HTA",
      "survival modeling for cost-effectiveness (NICE DSU TSD 14)"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta",
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "switch-add-on-augmentation-rwe",
    "name": "Switch, Add-On, and Augmentation Rules",
    "short_definition": "A set of pre-specified operational rules that classify a treatment modification observed in longitudinal data as a switch, an add-on, an augmentation, a combination start, or a line advancement, based on the timing and overlap of exposure episodes across drug classes.",
    "long_description": "**Switch, add-on, and augmentation rules** are the algorithm that turns a stream of dispensings or administrations into a\nclassified *treatment trajectory*. Once exposure episodes have been built (see exposure-episode-construction-rwe), every\npoint where a patient's regimen changes must be labeled, because the label — not the raw fill — is what defines cohorts,\nindex dates, follow-up start, lines of therapy, and the comparator in a treatment-pattern or comparative-effectiveness\nstudy. The same calendar event (a new fill of drug B while the patient is on drug A) is a *switch*, an *add-on*, or an\n*augmentation* depending entirely on the timing rules and the clinical interpretation you pre-specify; getting the rule\nwrong silently misclassifies exposure for everyone in the cohort.\n\n**Core conceptual distinction.** Four transitions must be separated, and the operational tell for each is timing, not the\ndrug itself.\n- **Switch** — class A stops (no fill within a grace/gap window) and class B starts. The patient is on B *instead of* A.\n  Operationally: `start(B) - end_of_supply(A) <= SWITCH_GAP_DAYS` (a near-contiguous handoff) AND no further A fills.\n- **Add-on / combination** — class B starts while class A's `days_supply` is still active. Both classes overlap. If A was\n  itself recently initiated (the patient was effectively treatment-naive to the combination), this is a planned\n  *combination start*; if A had been running for a while, B is an *add-on*.\n- **Augmentation** — a specific, clinically loaded subtype of add-on: B is added to an *ongoing, persistent* A that was\n  given an adequate therapeutic trial without resolution (e.g., adding an atypical antipsychotic to an SSRI in\n  treatment-resistant depression, or a thiazide to an ACE inhibitor in uncontrolled hypertension). The operational tell\n  that distinguishes augmentation from a plain add-on is **prior persistence on A** (`days_on_A_before_B >=\n  AUGMENT_PRIOR_PERSISTENCE_DAYS`, often an adequate-trial window such as 28–56 days) plus continuation of A after B starts.\n- **Line advancement** — the regimen as a whole changes after a maintenance gap or a documented progression/failure event,\n  starting a new line of therapy. This is the unit oncology, MS, and IRA Medicare price-negotiation analyses report on.\nThese thresholds — `SWITCH_GAP_DAYS`, `OVERLAP_DAYS_FOR_ADDON`, `AUGMENT_PRIOR_PERSISTENCE_DAYS`, the maintenance-gap that\ncloses a line — are the judgment-dependent parameters; every consequential analysis must vary them in sensitivity analyses.\n\n**Pros, cons, and trade-offs.**\n- **vs a single naive rule (e.g., \"any new drug class = switch\"):** Explicit, separately-tunable rules are transparent,\n  reproducible, and defensible to FDA/EMA/HTA reviewers, and they prevent the dominant error of collapsing add-ons and\n  augmentations into switches (which inflates \"switching\" and empties out \"combination\" exposure). Cost: more code, more\n  diagnostics, and the need to defend each threshold. **Prefer explicit rules** for any regulatory-grade or HTA study.\n- **vs purely clinical (chart-adjudicated) classification:** Rule-based classification scales to millions of patients and\n  is fully reproducible; chart review captures intent (\"inadequate response\", \"intolerance\") that timing alone cannot. The\n  augmentation-vs-add-on distinction in particular is an *intent* construct that claims approximate via prior persistence.\n  **Prefer rules** at scale, but validate against charts where the switch/augment distinction drives the estimand.\n- **vs ignoring the distinction entirely (treat all post-index drugs as time-varying covariates):** Folding everything into\n  a time-varying confounder is defensible for some causal questions but discards the treatment-pattern endpoint and makes\n  \"what did patients actually do next\" unanswerable. **Prefer explicit classification** whenever sequencing, lines, or\n  persistence-to-next-treatment is the outcome.\n\n**When to use.** Treatment-pattern and lines-of-therapy studies; persistence/discontinuation endpoints that must\ndistinguish stopping from switching; comparative-effectiveness designs where the comparator is \"switchers vs augmenters\";\ndrug-utilization and sequencing studies; HTA sequencing models that need realistic transition probabilities; and any\noncology, psychiatry, MS, or hypertension RWE where add-on vs switch vs augmentation carries different clinical meaning.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When the underlying exposure episodes are unreliable.** These rules are only as good as the `days_supply`, gap, and\n  stockpiling assumptions feeding them (see grace-period-gap-rules-rwe, stockpiling-carryover-rules-rwe). Garbage episodes\n  produce confident, wrong labels.\n- **When data cannot observe the comparator drug.** In Medicare Advantage-only person-time, fee-for-service pharmacy and\n  medical claims are absent, so an \"add-on\" can be an artifact of one class being invisible. Restrict to fully observable\n  enrollment before classifying.\n- **When augmentation requires intent that the data cannot carry.** Calling an add-on an \"augmentation\" without evidence of\n  prior adequate trial fabricates a treatment-resistance phenotype. If the estimand hinges on intent, validate against\n  charts or drop the augmentation label.\n- **Infused/clinic-administered or inpatient drugs.** Oncology infusions, biologics, and inpatient bundles have no\n  `days_supply`; applying oral gap logic to them is dangerous (see infused-biologic-administration-capture-rwe,\n  inpatient-bridging-exposure-rwe). Use administration-interval logic instead.\n- **When line definitions are imposed without progression data.** In oncology, defining line advancement purely by drug\n  gaps — without a progression or new-regimen signal — conflates a treatment holiday with progression and miscounts lines.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The workhorse substrate. Build class-level episodes from pharmacy NDCs (`fill_date`, `days_supply`) and\n  from medical/HCPCS J-codes for clinic-administered drugs. Failure modes: 90-day mail order and stockpiling distend\n  `days_supply` so overlaps look like add-ons; same-day duplicate or reversed claims create phantom combinations\n  (de-duplicate and net out reversals first); adjudication lag means recent fills are incomplete near the data cut. Require\n  continuous medical *and* pharmacy enrollment across the classification window so \"no fill of A\" is true absence, not\n  unobserved care.\n- **Claims (Medicare Advantage vs FFS):** MA-only person-time lacks FFS claims; a switch can look like a discontinuation\n  and an add-on can be invisible. Exclude MA-only spans or restrict to A/B/D FFS enrollees\n  (see medicare-ffs-ma-commercial-claims-differences-rwe).\n- **EHR:** Classification keys off *orders/administrations*, not dispensings, and order ≠ taken. Visit-driven capture means\n  a drug added by an outside specialist is missing, so an augmentation is misread as monotherapy continuation; external\n  care leakage is the dominant bias. Link to pharmacy fills where possible.\n- **Registry:** Often carries adjudicated regimen/line and progression but incomplete fill history; strong for the\n  line-advancement label, weak for the day-level switch/add-on timing. Link to claims for fill-level resolution.\n- **Linked claims–EHR:** Best substrate — EHR progression/intent plus claims completeness — but order/fill/service date\n  discrepancies must be reconciled before deciding which event came first, since the switch-vs-add-on label is decided by\n  day-level ordering.\n\n**Worked claims example (antidepressant augmentation vs switch, FFS pharmacy).** Question: among adults who initiate an SSRI\nfor depression, classify the first regimen modification. Inputs: pharmacy fills with `person_id`, `fill_date`,\n`days_supply`, `drug_class` in {SSRI, SNRI, ATYPICAL_ANTIPSYCHOTIC}; continuous medical+pharmacy FFS enrollment.\nThresholds: `SWITCH_GAP_DAYS = 30`, `OVERLAP_DAYS_FOR_ADDON = 1`, `AUGMENT_PRIOR_PERSISTENCE_DAYS = 56` (an adequate SSRI\ntrial). (1) Build SSRI episodes by stitching consecutive fills with gaps <= 30 days; the episode's covered period runs from\nthe first `fill_date` to the last `fill_date + days_supply`. Patient: SSRI fills 2024-01-03 (30d), 2024-02-01 (30d),\n2024-03-02 (30d) — covered through 2024-03-31, 87 persistent days. (2) On 2024-03-10 an aripiprazole (ATYPICAL) fill\nappears. (3) Is the SSRI still active on 2024-03-10? Yes (covered through 2024-03-31), so this is an *overlap* → not a\nswitch. (4) Had the patient been persistent on the SSRI for >= 56 days before 2024-03-10? Days on SSRI before the add =\n2024-03-10 − 2024-01-03 = 67 days >= 56 → label = **augmentation** (atypical added to an adequately-trialed, ongoing SSRI).\nContrast: had the SSRI instead *stopped* (last covered day 2024-02-05) and an SNRI started 2024-02-20 with no further SSRI\nfills, then `start(SNRI) − end(SSRI) = 15 <= 30` and SSRI does not resume → label = **switch**. Had the aripiprazole been\nadded on 2024-01-20 (only 17 SSRI-days, < 56) while the SSRI continued → label = **add-on (not augmentation)**: combination\ntoo early to call treatment-resistant. Sensitivity analyses re-run all labels at `SWITCH_GAP_DAYS` ∈ {15, 30, 60} and\n`AUGMENT_PRIOR_PERSISTENCE_DAYS` ∈ {28, 56, 84}, and report a transition table (counts of switch / add-on / augmentation /\nline-advance) before and after each threshold change.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "exposure-definition",
      "treatment-modification",
      "switching",
      "add-on-therapy",
      "augmentation",
      "combination-therapy",
      "lines-of-therapy",
      "treatment-patterns"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2217/fon-2020-1041",
        "url": "https://doi.org/10.2217/fon-2020-1041",
        "citation_text": "Hess LM, Krein PM, Haldane D, Han Y, Sireci AN. Defining treatment regimens and lines of therapy using real-world data in oncology. Future Oncology. 2021;17(15):1865-1877.",
        "year": 2021,
        "authors_short": "Hess et al.",
        "notes": "Methods statement for operationally defining regimens, switches, add-ons, and line advancement from real-world treatment data, including the timing thresholds and judgment points that drive misclassification."
      },
      {
        "role": "explain",
        "doi": "10.1073/pnas.1510502113",
        "url": "https://doi.org/10.1073/pnas.1510502113",
        "citation_text": "Hripcsak G, Ryan PB, Duke JD, et al. Characterizing treatment pathways at scale using the OHDSI network. Proceedings of the National Academy of Sciences. 2016;113(27):7329-7336.",
        "year": 2016,
        "authors_short": "Hripcsak et al.",
        "notes": "Demonstrates large-scale, reproducible characterization of treatment sequences (initiation, switching, combination) across standardized observational databases — the network-scale rationale for explicit classification rules."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Source for the underlying episode, gap, and switching mechanics (days_supply stitching, permissible gaps) that the switch/add-on/augmentation rules sit on top of in automated claims data."
      },
      {
        "role": "use",
        "doi": "10.1007/s40801-017-0126-5",
        "url": "https://doi.org/10.1007/s40801-017-0126-5",
        "citation_text": "Mahlich J, Tsukazawa S, Wiegand F. Estimating prevalence and healthcare utilization for treatment-resistant depression in Japan: a retrospective claims database study. Drugs - Real World Outcomes. 2018;5(1):35-43.",
        "year": 2018,
        "authors_short": "Mahlich et al.",
        "notes": "Applied claims study that operationalizes switch and augmentation rules to define treatment-resistant depression, illustrating how the augmentation-vs-switch distinction becomes the phenotype."
      }
    ],
    "plain_language_summary": "When a patient changes their drug regimen, analysts need a precise rule to decide what kind of change happened. A switch means the first drug stopped and a second drug took its place; an add-on (or augmentation) means the second drug was added while the first was still being taken. The difference is decided by one thing you can see in pharmacy claims data: does the first drug's supply of pills still have days left on the calendar when the second drug's fills begin? If yes, the two drugs overlap and it is an add-on; if no, the first drug had already run out (or nearly so) before the second started, and it is a switch. Getting this right matters because grouping switchers and add-on patients together quietly mixes two very different clinical decisions.",
    "key_terms": [
      {
        "term": "fill_date",
        "definition": "The calendar date a prescription was dispensed at the pharmacy — the first day that supply of pills is available to the patient."
      },
      {
        "term": "days_supply",
        "definition": "How many days one dispensed prescription is supposed to last; a 30-day fill runs from the fill_date through the next 29 days."
      },
      {
        "term": "exposure episode",
        "definition": "The continuous stretch of time a patient is considered to be on a drug, built by stitching together consecutive fills that have little or no gap between them."
      },
      {
        "term": "overlap",
        "definition": "A period when two drugs are both active at the same time because one drug's supply has not yet run out when the other drug's fills begin."
      },
      {
        "term": "prior persistence",
        "definition": "How many days a patient had been continuously taking the first drug before the second drug was added — used to distinguish a planned combination from an augmentation after an adequate treatment trial."
      },
      {
        "term": "grace period",
        "definition": "A short allowable gap (often 30 days) between when one drug's supply runs out and when the next fill arrives, within which the episode is still treated as continuous rather than stopped."
      }
    ],
    "worked_example": {
      "scenario": "Patient 2001 is an adult who starts sertraline (an SSRI antidepressant) on January 5, 2024. We want to classify what happens when a second drug appears in their pharmacy record on March 10, 2024 — but the answer depends entirely on whether the patient also refilled their sertraline in early March. We trace two parallel versions of events: Scenario S (Switch) where the sertraline was not refilled before March 10, and Scenario A (Add-on/Augmentation) where a March 1 refill means sertraline is still active when the second drug arrives.",
      "dataset": {
        "caption": "Pharmacy fills as they appear in a claims table — same patient, two scenarios that differ only in whether Drug A (sertraline) has a third fill on 2024-03-01.",
        "columns": [
          "person_id",
          "scenario",
          "fill_date",
          "drug",
          "drug_class",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "Both",
            "2024-01-05",
            "sertraline",
            "SSRI",
            30
          ],
          [
            2001,
            "Both",
            "2024-02-02",
            "sertraline",
            "SSRI",
            30
          ],
          [
            2001,
            "Scenario A only",
            "2024-03-01",
            "sertraline",
            "SSRI",
            30
          ],
          [
            2001,
            "Both",
            "2024-03-10",
            "aripiprazole",
            "ATYPICAL",
            30
          ]
        ]
      },
      "steps": [
        "Drug A (sertraline) Fill 1 starts 2024-01-05 and covers 30 days: active through 2024-02-03.",
        "Drug A Fill 2 starts 2024-02-02 — one day before Fill 1 runs out — so the fills stitch together with no gap; the combined episode is active through 2024-03-02.",
        "SCENARIO S (Switch): No third sertraline fill. Drug A's supply expires 2024-03-02. Drug B (aripiprazole) starts 2024-03-10. There are 7 uncovered days between them (March 3–9). Because 7 days is within the 30-day grace period AND sertraline does not resume, this is classified as a SWITCH: aripiprazole replaced sertraline.",
        "SCENARIO A (Add-on/Augmentation): Drug A Fill 3 starts 2024-03-01, adding 30 more days; sertraline is now active through 2024-03-30. Drug B (aripiprazole) starts 2024-03-10 — eight days before sertraline runs out. The two drugs overlap for 21 days (March 10–30). Because sertraline is still active when aripiprazole begins, this is NOT a switch.",
        "To distinguish add-on from augmentation: count how many days the patient had been taking sertraline before aripiprazole was added. From 2024-01-05 to 2024-03-10 is 65 days. Our pre-specified adequate-trial threshold is 56 days. Because 65 >= 56 and sertraline was ongoing, this is classified as AUGMENTATION: aripiprazole was added to a sertraline regimen that had already been given a full trial."
      ],
      "result": {
        "switch": "Scenario S: Drug A last active 2024-03-02, Drug B starts 2024-03-10, gap = 7 days <= 30-day grace period, Drug A does not resume → classification = SWITCH",
        "augmentation": "Scenario A: Drug A active through 2024-03-30, Drug B starts 2024-03-10, overlap = 21 days, prior persistence on Drug A = 65 days >= 56-day threshold → classification = AUGMENTATION"
      },
      "timeline_spec": {
        "title": "Switch vs. Add-on/Augmentation for one patient (two scenarios, same Drug A fills except for the March 1 refill)",
        "window": {
          "start": "2024-01-05",
          "end": "2024-04-08",
          "label": "Observation window: Jan 5 – Apr 8, 2024 (covering all fills in both scenarios)"
        },
        "events": [
          {
            "label": "Drug A Fill 1 (sertraline) — both scenarios",
            "track": "Drug A (sertraline)",
            "start": "2024-01-05",
            "length_days": 30,
            "quantity": "30-day supply",
            "end_date": "2024-02-03"
          },
          {
            "label": "Drug A Fill 2 (sertraline) — both scenarios",
            "track": "Drug A (sertraline)",
            "start": "2024-02-02",
            "length_days": 30,
            "quantity": "30-day supply",
            "end_date": "2024-03-02"
          },
          {
            "label": "Drug A Fill 3 (sertraline) — Scenario A ONLY",
            "track": "Drug A (sertraline)",
            "start": "2024-03-01",
            "length_days": 30,
            "quantity": "30-day supply",
            "end_date": "2024-03-30",
            "scenario": "add_on_augmentation_only"
          },
          {
            "label": "Drug B (aripiprazole) — both scenarios",
            "track": "Drug B (aripiprazole)",
            "start": "2024-03-10",
            "length_days": 30,
            "quantity": "30-day supply",
            "end_date": "2024-04-08"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "track": "Drug A (sertraline)",
            "start": "2024-01-05",
            "end": "2024-03-02",
            "label": "Drug A episode — Scenario S ends here (57 days)",
            "scenario": "switch"
          },
          {
            "kind": "exposed",
            "track": "Drug A (sertraline)",
            "start": "2024-01-05",
            "end": "2024-03-30",
            "label": "Drug A episode — Scenario A continues through Mar 30 (85 days)",
            "scenario": "add_on_augmentation_only"
          },
          {
            "kind": "gap",
            "track": "Drug A (sertraline)",
            "start": "2024-03-03",
            "end": "2024-03-09",
            "label": "7-day gap (Drug A expired, Drug B not yet started) → gap ≤ 30-day grace → SWITCH",
            "scenario": "switch"
          },
          {
            "kind": "exposed",
            "track": "Drug B (aripiprazole)",
            "start": "2024-03-10",
            "end": "2024-04-08",
            "label": "Drug B episode — both scenarios (30 days)"
          },
          {
            "kind": "covered",
            "track": "overlap",
            "start": "2024-03-10",
            "end": "2024-03-30",
            "label": "21-day overlap: both Drug A and Drug B active → ADD-ON / AUGMENTATION (Scenario A only)",
            "scenario": "add_on_augmentation_only"
          }
        ],
        "result": [
          {
            "label": "Scenario S classification",
            "value": "SWITCH — Drug A ended 2024-03-02; Drug B started 2024-03-10; 7-day gap ≤ 30 days; Drug A does not resume"
          },
          {
            "label": "Scenario A classification",
            "value": "AUGMENTATION — Drug A active through 2024-03-30; Drug B starts 2024-03-10; 21-day overlap; prior persistence on Drug A = 65 days ≥ 56-day threshold"
          }
        ],
        "caption": "Two-scenario timeline for patient 2001. The top row (Drug A, sertraline) differs between the scenarios: Scenario S has no March 1 refill so Drug A expires March 2, leaving a 7-day gap before Drug B arrives on March 10 — a switch. Scenario A includes the March 1 refill so Drug A is still active on March 10 when Drug B begins, producing a 21-day overlap — and because the patient had 65 days on Drug A before Drug B was added (above the 56-day adequate-trial threshold), the classification is augmentation rather than a plain add-on.",
        "alt_text": "Timeline with two drug tracks. Drug A (sertraline) shows two fills Jan 5 and Feb 2 in both scenarios. In Scenario S the drug track ends March 2, a gap bar covers March 3–9, and Drug B starts March 10 with no overlap — labeled SWITCH. In Scenario A a third fill added March 1 extends Drug A through March 30; Drug B starts March 10 while Drug A is still active, and a shaded overlap bar covers March 10–30 — labeled AUGMENTATION (65 days prior persistence ≥ 56-day threshold)."
      }
    },
    "prerequisites": [
      "exposure-episode-construction-rwe",
      "grace-period-gap-rules-rwe",
      "stockpiling-carryover-rules-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Claims-first classification (oral/pharmacy)",
        "description": "Builds class-level exposure episodes from pharmacy NDCs (fill_date + days_supply), stitches fills within a permissible gap, then classifies each modification by overlap and prior-persistence rules. Switch = near-contiguous A-stop then B-start with no A return; add-on = B starts during active A; augmentation = add-on after A meets an adequate-trial persistence threshold.",
        "edge_cases": [
          "90-day mail-order and stockpiling distend days_supply, turning sequential therapy into a false overlap (false add-on)",
          "same-day duplicate fills and reversed/voided claims create phantom combinations; de-duplicate and net reversals first",
          "adjudication lag leaves recent fills incomplete near the data cut, biasing the last observed transition",
          "left truncation at enrollment start makes a continuing regimen look like a new initiation"
        ],
        "data_source_notes": "claims: require continuous medical+pharmacy FFS enrollment across the classification window; exclude MA-only person-time where FFS claims are unobservable; validate code lists and date logic with patient-level timelines."
      },
      {
        "name": "Medical-benefit / infused-drug classification",
        "description": "For clinic-administered or infused therapies (HCPCS J-codes, oncology regimens, biologics) that lack days_supply, classifies modifications by administration intervals and expected cycle length rather than oral gap logic.",
        "edge_cases": [
          "No days_supply means oral gap rules are invalid; use protocol cycle length + a tolerance window instead",
          "Buy-and-bill vs specialty-pharmacy capture differs; a drug may appear on the medical or the pharmacy benefit",
          "Held/delayed cycles (toxicity) mimic discontinuation; distinguish from true line change with progression signals"
        ],
        "data_source_notes": "claims: combine pharmacy (oral) and medical (J-code) episodes onto one timeline before classifying; see infused-biologic-administration-capture-rwe and inpatient-bridging-exposure-rwe."
      },
      {
        "name": "EHR/registry-enriched classification with intent and progression",
        "description": "Uses orders/administrations, problem lists, notes, or adjudicated registry fields and progression events to refine the augmentation label (documented inadequate response) and to anchor line advancement on progression rather than drug gaps alone.",
        "edge_cases": [
          "External care leakage hides drugs added by outside specialists, misreading augmentation as monotherapy continuation",
          "Order does not equal taken; prefer linked dispensing to confirm the modification occurred",
          "Site workflow variation changes how regimens and line changes are recorded"
        ],
        "data_source_notes": "ehr/registry: link to claims for full fill history and to progression/adjudication fields for the line-advancement label; report missingness and linkage denominators."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "A single naive rule (any new drug class = switch)",
        "pros_of_this": "Separately-tunable, transparent rules avoid the dominant error of collapsing add-ons and augmentations into switches; reproducible and defensible for FDA/EMA/HTA review.",
        "cons_of_this": "More programming, more diagnostics, and each threshold must be justified and sensitivity-tested.",
        "when_to_prefer": "Any consequential treatment-pattern, comparative-effectiveness, persistence, or regulatory/HTA analysis."
      },
      {
        "compared_to": "Chart-adjudicated clinical classification",
        "pros_of_this": "Scales to millions of patients and is fully reproducible; thresholds are explicit rather than tacit.",
        "cons_of_this": "Cannot directly observe intent (inadequate response, intolerance) that defines true augmentation; approximates it via prior persistence.",
        "when_to_prefer": "At scale and when day-level timing drives the estimand; validate against charts where switch-vs-augment is decisive."
      },
      {
        "compared_to": "Treating all post-index drugs as time-varying covariates (no classification)",
        "pros_of_this": "Preserves the treatment-pattern and lines-of-therapy endpoint and answers \"what did patients do next\".",
        "cons_of_this": "More structure to specify and defend than a single time-varying confounder term.",
        "when_to_prefer": "Whenever sequencing, lines of therapy, or time-to-next-treatment is the outcome rather than a nuisance."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build class-level episodes from NDC + fill_date + days_supply (and J-codes for infused drugs); de-duplicate same-day fills and net reversals before stitching. Require continuous medical+pharmacy FFS enrollment across the classification window and exclude Medicare Advantage-only person-time. Pre-specify SWITCH_GAP_DAYS, OVERLAP_DAYS_FOR_ADDON, and AUGMENT_PRIOR_PERSISTENCE_DAYS and vary them in sensitivity analyses.",
      "ehr": "Classify on orders/administrations, not dispensings; order does not equal taken. External care leakage is the dominant bias (drugs added outside the system look like monotherapy continuation) — define observation windows and treat outside care as potentially informative; link to fills to confirm modifications.",
      "registry": "Often carries adjudicated regimen/line and progression but incomplete fills; strong for the line-advancement label, weak for day-level switch/add-on timing. Link to claims for fill-level resolution.",
      "linked": "Ideal substrate (EHR progression/intent + claims completeness) but order/fill/service date discrepancies must be reconciled before deciding event ordering, since the switch-vs-add-on label depends on which event came first."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nSWITCH_GAP_DAYS = 30               # max gap to treat consecutive fills as one episode / a contiguous switch handoff\nOVERLAP_DAYS_FOR_ADDON = 1         # min active-supply overlap of two classes to call an add-on/combination\nAUGMENT_PRIOR_PERSISTENCE_DAYS = 56  # adequate-trial persistence on the first class before an add-on becomes augmentation\n\ndef _episodes(g: pd.DataFrame) -> pd.DataFrame:\n    # Stitch one drug_class's fills into covered episodes (gap-tolerant).\n    g = g.sort_values(\"fill_date\")\n    g[\"end\"] = g[\"fill_date\"] + pd.to_timedelta(g[\"days_supply\"], unit=\"D\")\n    ep_start, ep_end, rows = None, None, []\n    for _, r in g.iterrows():\n        if ep_start is None:\n            ep_start, ep_end = r[\"fill_date\"], r[\"end\"]\n        elif r[\"fill_date\"] <= ep_end + pd.Timedelta(days=SWITCH_GAP_DAYS):\n            ep_end = max(ep_end, r[\"end\"])               # extend (stockpiling caps at observed end)\n        else:\n            rows.append((ep_start, ep_end)); ep_start, ep_end = r[\"fill_date\"], r[\"end\"]\n    rows.append((ep_start, ep_end))\n    return pd.DataFrame(rows, columns=[\"ep_start\", \"ep_end\"])\n\ndef classify_first_modification(rx: pd.DataFrame) -> pd.DataFrame:\n    out = []\n    for pid, p in rx.groupby(\"person_id\"):\n        eps = (p.groupby(\"drug_class\", group_keys=True)\n                 .apply(_episodes)\n                 .reset_index(level=0).reset_index(drop=True))\n        eps = eps.sort_values(\"ep_start\")\n        first = eps.iloc[0]                               # the index regimen class/episode\n        index_class = first[\"drug_class\"]\n        later = eps[(eps[\"ep_start\"] > first[\"ep_start\"]) & (eps[\"drug_class\"] != index_class)]\n        if later.empty:\n            out.append((pid, index_class, None, \"no_modification\")); continue\n        nxt = later.iloc[0]                               # first modification event\n        overlap_days = (min(first[\"ep_end\"], nxt[\"ep_end\"]) - nxt[\"ep_start\"]).days\n        index_still_active = overlap_days >= OVERLAP_DAYS_FOR_ADDON\n        prior_persistence = (nxt[\"ep_start\"] - first[\"ep_start\"]).days\n        if not index_still_active and (nxt[\"ep_start\"] - first[\"ep_end\"]).days <= SWITCH_GAP_DAYS:\n            label = \"switch\"\n        elif index_still_active and prior_persistence >= AUGMENT_PRIOR_PERSISTENCE_DAYS:\n            label = \"augmentation\"\n        elif index_still_active:\n            label = \"add_on\"\n        else:\n            label = \"line_advance\"                       # gap exceeds maintenance window -> new line/episode\n        out.append((pid, index_class, nxt[\"drug_class\"], label))\n    return pd.DataFrame(out, columns=[\"person_id\", \"index_class\", \"modifier_class\", \"modification\"])",
        "description": "Classify the FIRST treatment modification per patient as switch / add_on / augmentation / line_advance from claims-style\npharmacy fills. Required input (cleaned, de-duplicated, reversals netted out):\n  rx : person_id, fill_date (datetime), drug_class (str), days_supply (int)   # one row per fill\nEpisodes are built per drug_class by stitching fills whose gap <= SWITCH_GAP_DAYS. The judgment-dependent thresholds are\nthe constants below; vary them in sensitivity analyses. Returns one row per patient with the modification label.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "hess-2021"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nSWITCH_GAP_DAYS               <- 30L\nOVERLAP_DAYS_FOR_ADDON        <- 1L\nAUGMENT_PRIOR_PERSISTENCE_DAYS <- 56L\n\nbuild_episodes <- function(d) {                 # d: fills of ONE person+class, sorted by fill_date\n  setorder(d, fill_date)\n  d[, end := fill_date + days_supply]\n  es <- d$fill_date[1L]; ee <- d$end[1L]; out <- list()\n  if (nrow(d) > 1L) for (i in 2:nrow(d)) {\n    if (d$fill_date[i] <= ee + SWITCH_GAP_DAYS) ee <- max(ee, d$end[i])\n    else { out[[length(out)+1L]] <- list(ep_start = es, ep_end = ee); es <- d$fill_date[i]; ee <- d$end[i] }\n  }\n  out[[length(out)+1L]] <- list(ep_start = es, ep_end = ee)\n  rbindlist(out)\n}\n\nclassify_first_modification <- function(rx) {\n  setDT(rx)\n  eps <- rx[, build_episodes(.SD), by = .(person_id, drug_class), .SDcols = c(\"fill_date\",\"days_supply\")]\n  setorder(eps, person_id, ep_start)\n  eps[, {\n    first_class <- drug_class[1L]; fs <- ep_start[1L]; fe <- ep_end[1L]\n    later <- which(ep_start > fs & drug_class != first_class)\n    if (length(later) == 0L) .(index_class = first_class, modifier_class = NA_character_, modification = \"no_modification\")\n    else {\n      j <- later[1L]\n      overlap_days <- as.integer(min(fe, ep_end[j]) - ep_start[j])\n      active <- overlap_days >= OVERLAP_DAYS_FOR_ADDON\n      prior  <- as.integer(ep_start[j] - fs)\n      lab <- if (!active && as.integer(ep_start[j] - fe) <= SWITCH_GAP_DAYS) \"switch\"\n             else if (active && prior >= AUGMENT_PRIOR_PERSISTENCE_DAYS)     \"augmentation\"\n             else if (active)                                               \"add_on\"\n             else                                                           \"line_advance\"\n      .(index_class = first_class, modifier_class = drug_class[j], modification = lab)\n    }\n  }, by = person_id]\n}",
        "description": "Classify the first treatment modification per patient (switch / add_on / augmentation / line_advance) with data.table.\nInput mirrors the Python version:\n  rx : person_id, fill_date (Date), drug_class (character), days_supply (integer)\nThresholds are the judgment-dependent constants varied in sensitivity analyses.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "hess-2021"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let switch_gap = 30;   /* max gap to stitch fills / call a contiguous switch */\n%let addon_ovl  = 1;    /* min overlap days to call an add-on/combination     */\n%let augment    = 56;   /* prior persistence on index class -> augmentation   */\n\nproc sort data=work.rx; by person_id drug_class fill_date; run;\n\n/* Step 1: gap-tolerant episodes per person x drug_class. */\ndata episodes;\n  set work.rx; by person_id drug_class fill_date;\n  retain ep_start ep_end;\n  end_supply = fill_date + days_supply;\n  if first.drug_class then do; ep_start = fill_date; ep_end = end_supply; end;\n  else if fill_date <= ep_end + &switch_gap then ep_end = max(ep_end, end_supply);\n  else do;\n    output;                                   /* close prior episode */\n    ep_start = fill_date; ep_end = end_supply;\n  end;\n  if last.drug_class then output;             /* flush final episode */\n  keep person_id drug_class ep_start ep_end;\n  format ep_start ep_end date9.;\nrun;\n\n/* Step 2: index regimen = earliest episode; first modification = earliest later episode in a different class. */\nproc sql;\n  create table first_index as\n  select person_id, drug_class as index_class, ep_start as fs, ep_end as fe\n  from episodes\n  group by person_id\n  having ep_start = min(ep_start);\n\n  create table modification as\n  select i.person_id, i.index_class, m.drug_class as modifier_class,\n         (min(i.fe, m.ep_end) - m.ep_start)        as overlap_days,\n         (m.ep_start - i.fs)                       as prior_persistence,\n         case\n           when calculated overlap_days < &addon_ovl and (m.ep_start - i.fe) <= &switch_gap then 'switch'\n           when calculated overlap_days >= &addon_ovl and calculated prior_persistence >= &augment then 'augmentation'\n           when calculated overlap_days >= &addon_ovl then 'add_on'\n           else 'line_advance'\n         end as modification length=12\n  from first_index i\n  inner join episodes m\n    on i.person_id = m.person_id and m.drug_class ne i.index_class and m.ep_start > i.fs\n  group by i.person_id\n  having m.ep_start = min(m.ep_start);\nquit;",
        "description": "Classify the first treatment modification per patient (switch / add_on / augmentation / line_advance) in SAS. Required\ninput (post data-management, de-duplicated, reversals netted):\n  work.rx : person_id, fill_date (SAS date), drug_class (char), days_supply (num)\nStep 1 stitches per-class episodes with a gap-tolerant DATA step; Step 2 (PROC SQL) labels the first modification using\nthe same overlap and prior-persistence rules. Macro vars are the judgment-dependent thresholds for sensitivity analyses.",
        "dependencies": [],
        "source_citations": [
          "hess-2021"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "switch-add-on-augmentation-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Two-scenario timeline for patient 2001. The top row (Drug A, sertraline) differs between the scenarios: Scenario S has no March 1 refill so Drug A expires March 2, leaving a 7-day gap before Drug B arrives on March 10 — a switch. Scenario A includes the March 1 refill so Drug A is still active on March 10 when Drug B begins, producing a 21-day overlap — and because the patient had 65 days on Drug A before Drug B was added (above the 56-day adequate-trial threshold), the classification is augmentation rather than a plain add-on.",
        "alt_text": "Timeline with two drug tracks. Drug A (sertraline) shows two fills Jan 5 and Feb 2 in both scenarios. In Scenario S the drug track ends March 2, a gap bar covers March 3–9, and Drug B starts March 10 with no overlap — labeled SWITCH. In Scenario A a third fill added March 1 extends Drug A through March 30; Drug B starts March 10 while Drug A is still active, and a shaded overlap bar covers March 10–30 — labeled AUGMENTATION (65 days prior persistence ≥ 56-day threshold).",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Mod[New fill of class B while/after class A] --> Q1{Class A still has active days_supply<br/>when B starts? overlap >= OVERLAP_DAYS_FOR_ADDON}\n  Q1 -- No --> Q2{B starts within SWITCH_GAP_DAYS of A's<br/>last covered day, and A does not resume?}\n  Q2 -- Yes --> SW[SWITCH<br/>B instead of A]\n  Q2 -- No --> LA[LINE ADVANCE / NEW EPISODE<br/>gap exceeds maintenance window]\n  Q1 -- Yes --> Q3{Persistent on A for >=<br/>AUGMENT_PRIOR_PERSISTENCE_DAYS<br/>adequate trial before B?}\n  Q3 -- Yes --> AUG[AUGMENTATION<br/>B added to adequately-trialed ongoing A]\n  Q3 -- No --> AO[ADD-ON / COMBINATION<br/>B added to A early]",
        "caption": "Decision logic that classifies a treatment modification. Overlap of active supply separates switch/line-advance (sequential) from add-on/augmentation (concurrent); prior persistence on the index class separates augmentation (adequately-trialed) from a plain add-on.",
        "alt_text": "Decision flowchart branching on whether class A is still active when class B starts, then on the gap for switch versus line advance, and on prior persistence for augmentation versus add-on.",
        "source_type": "illustrative",
        "source_citations": [
          "hess-2021"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Augmentation example - atypical added to an adequately-trialed SSRI\n  dateFormat YYYY-MM-DD\n  axisFormat %b\n  section SSRI (index class)\n  SSRI episode (fills 1/3, 2/1, 3/2; covered through 3/31) :done, ssri, 2024-01-03, 2024-03-31\n  section Atypical antipsychotic\n  Aripiprazole start (day 67 of SSRI, SSRI still active) :milestone, ari, 2024-03-10, 0d\n  Augmentation: B added to ongoing, persistent A :active, aug, 2024-03-10, 21d",
        "caption": "Worked augmentation timeline. The atypical starts on day 67 of a still-active SSRI episode (overlap present, prior persistence 67 >= 56), so the modification is classified as augmentation rather than switch or early add-on.",
        "alt_text": "Gantt timeline of an SSRI episode running January to March with an aripiprazole start on March 10 overlapping the active SSRI, labeled as augmentation.",
        "source_type": "illustrative",
        "source_citations": [
          "hess-2021"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "Switch/add-on/augmentation classification operates on the exposure episodes produced by episode construction; it is the labeling layer on top of that family."
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "These rules are the engine behind lines-of-therapy and treatment-pattern endpoints; line advancement is the unit those studies report."
      },
      {
        "relation_type": "requires",
        "target_slug": "grace-period-gap-rules-rwe",
        "notes": "The switch gap and episode-stitching gap are grace-period decisions; switch vs line-advance turns on the permissible gap."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Persistence endpoints must distinguish stopping (discontinuation) from switching; the same rules separate the two."
      },
      {
        "relation_type": "see_also",
        "target_slug": "restart-rechallenge-new-episode-rwe",
        "notes": "A return to a previously-stopped class after a long gap is a restart/new episode rather than a continuation, a distinction these rules must respect."
      },
      {
        "relation_type": "see_also",
        "target_slug": "stockpiling-carryover-rules-rwe",
        "notes": "Stockpiling and carryover assumptions distend days_supply and can turn sequential therapy into a false overlap (false add-on); set carryover rules before classifying."
      },
      {
        "relation_type": "see_also",
        "target_slug": "time-updated-exposures-cumulative-dose-rwe",
        "notes": "Once modifications are classified, the resulting concurrent/sequential exposure feeds time-updated exposure and cumulative-dose modeling."
      }
    ],
    "aliases": [
      "treatment modification rules",
      "regimen change classification",
      "switching and augmentation rules",
      "add-on therapy",
      "augmentation therapy",
      "line advancement rules"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "synthetic-control-method-rwe",
    "name": "Synthetic Control Method",
    "short_definition": "A comparative case-study method that builds a counterfactual for a single treated unit as a weighted average of untreated donor units chosen to reproduce the treated unit's pre-intervention outcome trajectory and predictors, then attributes the post-intervention gap between observed and synthetic outcomes to the intervention.",
    "long_description": "The **synthetic control method (SCM)** estimates the effect of an intervention that affects *one* (or a few) aggregate\nunit(s) — a state that adopts a policy, a health system that rolls out a program, a country that changes drug regulation — for\nwhich no single untreated unit is a credible comparison. Instead of picking one comparator, SCM constructs a **synthetic\ncontrol**: a convex, weighted combination of units in a **donor pool** of untreated units, with non-negative weights summing\nto one, chosen so that the weighted donor outcomes track the treated unit's outcome *and* its predictors as closely as\npossible during the **pre-intervention period**. The weights W minimize the pre-period discrepancy\n||X1 − X0·W|| between the treated unit's predictor/outcome vector X1 and the donors' matrix X0 (typically a nested\noptimization that also weights which predictors matter, V). The **treatment effect** at each post-period t is the gap\nα_t = Y1_t − Σ_j w_j·Y0_jt between the treated unit's observed outcome and its synthetic counterpart. The method's appeal is\ntwofold: the counterfactual is transparent (you can read off which donors get weight and verify the pre-period fit), and the\nconvexity constraint guards against extrapolation beyond the donors' support. SCM formalizes and disciplines the intuition\nbehind a \"comparison region\" that practitioners had long chosen by hand.\n\n**Core conceptual distinction.** SCM is the *single-treated-unit* generalization of difference-in-differences. (1) *Relation\nto DiD*: DiD uses an equal-weighted (or regression-weighted) comparison group and assumes parallel trends; SCM instead\n*data-drives* the donor weights to match the treated unit's pre-period path, relaxing the assumption that any one comparator is\nparallel and replacing it with the requirement of good pre-period fit and a stable post-period relationship. When the pre-fit\nis good and the donor pool is rich, SCM nests DiD as a special case. (2) *Donor pool and pre-period fit are the whole game*:\nthe donor pool must contain only untreated units plausibly driven by the same factors as the treated unit and *not*\ncontaminated by the intervention or by their own idiosyncratic shocks; the pre-period fit (a small pre-period root-mean-square\nprediction error, RMSPE) is the prerequisite for trusting the post-period gap — a poor pre-fit means the synthetic control is\nnot a credible counterfactual and the gap is uninterpretable. (3) *Inference is by placebo/permutation, not classical SEs*:\nwith one treated unit there is no sampling distribution in the usual sense, so inference proceeds by **placebo (in-space)\ntests** — re-estimating SCM pretending each donor was treated and asking whether the true treated unit's post/pre RMSPE ratio\nis extreme relative to the placebo distribution — and **in-time placebos** (a fake intervention date in the pre-period that\nshould show no gap). This permutation logic, not a t-statistic, is what produces a p-value.\n\n**Pros, cons, and trade-offs.**\n- **vs difference-in-differences (`difference-in-differences-staggered-adoption-rwe`):** SCM data-drives the comparison weights\n  to match the treated unit's pre-trajectory, so it does not require any single donor to be parallel and makes the\n  counterfactual auditable; DiD is simpler, supports many treated units and standard inference, and is more efficient when a\n  credible parallel comparison group exists. **Prefer SCM** for one (or few) treated aggregate units with a long pre-period and\n  a rich donor pool; **prefer DiD** with many treated units, short pre-periods, or when classical inference and covariate\n  adjustment are needed. SCM nests DiD when the optimal weights are uniform.\n- **vs interrupted time series (`interrupted-time-series-rwe`):** ITS uses only the treated unit's own series and extrapolates\n  its pre-trend as the counterfactual; SCM borrows strength from untreated donors to construct the counterfactual, which\n  protects against confounding by a co-timed shock that also hits the treated series (if the shock hits donors too, the\n  synthetic control absorbs it). **Prefer SCM** when good donor units exist and co-timed shocks are a worry; **prefer ITS** when\n  no comparable donor units are available and the pre-trend is stable.\n- **vs instrumental variables / target-trial (`instrumental-variables-pharmacoepi-rwe`, `target-trial-emulation`):** Those are\n  individual-level designs for many units with measured (or instrumented) confounding; SCM is an *aggregate*, small-N design\n  whose identification rests on pre-period fit and donor-pool validity rather than measured confounders or an instrument.\n  **Prefer SCM** when the unit of intervention is itself aggregate (a state/system/country) and the effective sample of treated\n  units is one; **prefer individual-level designs** when many units are exposed and patient-level data and confounders exist.\n\n**When to use.** A single (or very few) treated aggregate unit(s); a long, well-measured pre-intervention outcome series (a\ncommon rule of thumb is many pre-periods, ideally a decade-plus of annual data or many quarters); a donor pool of untreated\nunits that are comparable, uncontaminated by the intervention, and free of their own large idiosyncratic shocks; an intervention\nwhose effect on an aggregate metric (a state's hospitalization rate, a health system's per-member cost, a region's drug uptake)\nis the question. SCM is the standard tool for policy/program evaluations where the treated unit is a geography or organization.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Poor pre-intervention fit.** If the synthetic control cannot reproduce the treated unit's pre-period trajectory (large\n  pre-period RMSPE), it is not a valid counterfactual and the post-period gap is meaningless. Reporting an effect on top of a\n  bad pre-fit is the single most dangerous SCM misuse — always show the pre-period fit and the RMSPE.\n- **Contaminated or shocked donor pool.** Donors affected by the same or a similar intervention (spillover), or experiencing\n  their own large idiosyncratic shocks during the study window, corrupt the synthetic control. Curate the donor pool and run\n  leave-one-out checks dropping high-weight donors.\n- **Extrapolation / interpolation bias.** If the treated unit's predictors lie outside the convex hull of the donors, no convex\n  weighting can match it and the method silently extrapolates; if donors are very dissimilar, interpolation across them is\n  unreliable. Check that the treated unit is inside the donors' support.\n- **Few pre-periods or volatile outcome.** A short or noisy pre-period cannot pin the weights, so over-fitting to noise inflates\n  the apparent post-period gap; the in-time placebo will expose this.\n- **Over-reading a single gap as significant.** Without placebo/permutation inference, a visually large gap means little —\n  aggregate series are volatile, and many placebo units will show gaps as large by chance. Report the placebo distribution and\n  the post/pre RMSPE-ratio p-value, not just the gap.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** The treated and donor units are usually geographies (states, regions, plans) and the outcome is an\n  aggregated rate or cost (per-1,000 hospitalizations, PMPM cost) over a stable denominator. A standing failure mode: shifting\n  Medicare Advantage penetration across states and over time changes which population is FFS-observable, so a region's measured\n  FFS rate can move for compositional reasons unrelated to the intervention — hold the denominator definition constant, restrict\n  to a consistently observable population, and confirm donor regions did not experience their own coverage-mix shocks during the\n  window. Use predictors (age/sex mix, baseline comorbidity, baseline outcome levels) that are stably measured across all units.\n- **EHR / health-system units:** When treated and donor units are provider organizations or systems, documentation and\n  coding-policy differences across systems create level differences the pre-period fit must reconcile; ensure outcome and\n  predictor definitions are harmonized across all donor systems before fitting, and treat a system-specific EHR transition as a\n  potential idiosyncratic shock that disqualifies a donor.\n- **Registry / linked:** Population registries (cancer, perinatal, disease registries) at the geographic level give clean\n  aggregate outcomes for SCM; verify reporting completeness is comparable across donor regions and calendar periods so a\n  reporting-lag difference is not mistaken for a treatment gap.\n\n**Worked claims example.** Question: did a 2018 state-level prior-authorization policy for high-cost specialty drug class X\nreduce the state's age-standardized initiation rate per 100,000 adults, using a commercial + Medicare FFS multi-state claims\ndatabase? (1) *Units and outcome*: treated unit = the policy state; donor pool = 24 states with no comparable policy during\n2010-2022 and no large coverage-mix shock; outcome = annual age/sex-standardized initiation rate over a constant FFS-observable\ndenominator. (2) *Predictors*: pre-2018 initiation levels at several lags, age/sex distribution, baseline comorbidity index,\nbaseline specialty-drug spend. (3) *Fit*: nested optimization selects donor weights (e.g., 0.34 State B, 0.22 State F, 0.19\nState K, 0.15 State P, 0.10 State T; all others 0) reproducing the treated state's 2010-2017 trajectory with a small pre-period\nRMSPE (close tracking, visually overlapping lines). (4) *Effect*: post-2018 the observed initiation rate falls below the\nsynthetic control by a widening gap, averaging **−6.4 initiations per 100,000/year** over 2018-2022. (5) *Inference*: in-space\nplacebos re-estimate SCM treating each of the 24 donors as \"treated\"; the policy state's post/pre RMSPE ratio (≈ 5.1) is the\nlargest of 25, giving a permutation **p ≈ 1/25 = 0.04**; an in-time placebo (fake 2015 intervention) shows no pre-2018 gap,\nand leave-one-out re-estimation dropping each high-weight donor leaves the effect stable. The interpretation is reported as an\neffect for *this* state under this policy, with the donor weights, pre-fit, and placebo distribution shown in full.\n\n**Interpreting the output**\n\nUsing the worked example: the synthetic control's pre-period 2017 rate (40.99 per 100,000) tracks the policy\nstate's observed rate (41.0) to within 0.01. In 2019, the synthetic control predicts 42.99 per 100,000 while the\npolicy state observed 36.59, yielding a gap of 36.59 − 42.99 = −6.40 per 100,000.\n\nFormal interpretation: The estimated effect is −6.40 new initiations per 100,000 adults in 2019 for this specific\ntreated state. The synthetic control does not produce an ATE or ATT in the conventional sense — it estimates the\neffect for a single unit (the policy state), comparing its observed post-period outcome to the counterfactual\ntrajectory constructed from the donor-weighted average. The credibility of the −6.40 gap rests entirely on\npre-period fit quality: a synthetic control that closely tracked the policy state before the intervention (pre-period\ndiscrepancy of 0.01 per 100,000) provides a plausible stand-in for the unobserved counterfactual. Inference is\npermutation-based: the observed gap is significant if it exceeds most donor-state placebo gaps. Results apply only\nto this treated state and this policy — they do not average across many states or generalize to different policy\ncontexts.\n\nPractical interpretation: The prior-authorization policy is estimated to have reduced new drug initiations by\nroughly 6.4 per 100,000 adults per year compared with what a weighted combination of similar donor states\nexperienced over the same period. The quality of this claim depends on how well the donor pool tracked the policy\nstate before the policy — excellent in this example (discrepancy of 0.01). A permutation p-value from placebo\ntests across donor states quantifies how unusual the observed gap is.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "synthetic-control",
      "donor-pool",
      "pre-period-fit",
      "placebo-inference",
      "permutation-test",
      "comparative-case-study",
      "policy-evaluation",
      "aggregate-data"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "policy_evaluation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "registry",
      "linked",
      "ehr"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1198/jasa.2009.ap08746",
        "url": "https://doi.org/10.1198/jasa.2009.ap08746",
        "citation_text": "Abadie A, Diamond A, Hainmueller J. Synthetic control methods for comparative case studies: estimating the effect of California's tobacco control program. Journal of the American Statistical Association. 2010;105(490):493-505.",
        "year": 2010,
        "authors_short": "Abadie et al.",
        "notes": "The canonical paper formalizing the synthetic control estimator, the donor-weight optimization, the use of pre-intervention fit, and placebo/permutation inference, demonstrated on California's Proposition 99 tobacco program."
      },
      {
        "role": "explain",
        "doi": "10.1111/ajps.12116",
        "url": "https://doi.org/10.1111/ajps.12116",
        "citation_text": "Abadie A, Diamond A, Hainmueller J. Comparative politics and the synthetic control method. American Journal of Political Science. 2015;59(2):495-510.",
        "year": 2015,
        "authors_short": "Abadie et al.",
        "notes": "Extends and clarifies SCM practice including donor-pool construction, the predictor/outcome weighting, interpolation vs extrapolation, and placebo-based inference, with the German reunification case study."
      },
      {
        "role": "demonstrate",
        "doi": "10.1257/jel.20191450",
        "url": "https://doi.org/10.1257/jel.20191450",
        "citation_text": "Abadie A. Using synthetic controls: feasibility, data requirements, and methodological aspects. Journal of Economic Literature. 2021;59(2):391-425.",
        "year": 2021,
        "authors_short": "Abadie",
        "notes": "Authoritative review of when SCM is appropriate, covering required pre-period length and fit, donor-pool validity, the convex-hull (no-extrapolation) rationale, relation to difference-in-differences, and the limits of the design."
      },
      {
        "role": "use",
        "doi": "10.1198/016214504000001880",
        "url": "https://doi.org/10.1198/016214504000001880",
        "citation_text": "Rubin DB. Causal inference using potential outcomes: design, modeling, decisions. Journal of the American Statistical Association. 2005;100(469):322-331.",
        "year": 2005,
        "authors_short": "Rubin",
        "notes": "Grounds the potential-outcomes framework that defines the synthetic-control counterfactual and the treatment effect as the gap between observed and counterfactual outcomes for the treated unit."
      }
    ],
    "plain_language_summary": "The synthetic control method estimates what would have happened to a single region or health system if an intervention had never occurred, by building a stand-in from a weighted blend of similar untreated places. You choose a set of comparison regions (the donor pool), assign each a weight so that the blend closely tracks the treated region's outcome history before the intervention, then read off the post-intervention gap between what was observed and what the blend predicts. The method works for aggregate questions such as 'did this state policy reduce drug initiation rates?' and it makes the counterfactual auditable because the donor weights are visible numbers you can inspect and challenge.",
    "key_terms": [
      {
        "term": "synthetic control",
        "definition": "A stand-in outcome series for the treated region, built by blending the outcomes of untreated comparison regions using carefully chosen weights, so the blend mimics the treated region's pre-intervention history."
      },
      {
        "term": "donor pool",
        "definition": "The set of untreated comparison regions (states, health systems, countries) whose data are used to build the synthetic control; every unit in this pool must be unaffected by the intervention."
      },
      {
        "term": "weights",
        "definition": "Numbers assigned to each donor region, all between 0 and 1 and summing to exactly 1, that determine how much each donor contributes to the synthetic control; they are chosen so the weighted blend matches the treated region's pre-intervention outcome as closely as possible."
      },
      {
        "term": "pre-intervention period",
        "definition": "The span of time before the policy or program took effect, used to fit the donor weights; good fit during this period is the prerequisite for trusting the post-intervention gap."
      },
      {
        "term": "placebo test",
        "definition": "A check in which the same SCM procedure is repeated pretending each donor was the treated unit; if the true treated region's post-intervention gap is larger than almost all these fake gaps, that is the evidence of a real effect."
      }
    ],
    "worked_example": {
      "scenario": "A state passes a prior-authorization policy for a specialty drug class in 2018. We want to know whether the policy reduced the state's initiation rate (new users per 100,000 adults per year). We have annual initiation rates for the policy state and five untreated donor states from 2010 through 2022. The synthetic control is built by assigning each donor a weight so that the weighted average of donor rates tracks the policy state's rate as closely as possible in 2010-2017, then the 2019 gap between the policy state and its synthetic stand-in estimates the effect.",
      "dataset": {
        "caption": "Annual initiation rate (new users per 100,000 adults) for the policy state and five donor states. Weights are chosen by the SCM optimizer to minimize pre-period discrepancy.",
        "columns": [
          "unit",
          "role",
          "scm_weight",
          "rate_2017_pre",
          "rate_2019_post"
        ],
        "rows": [
          [
            "Policy State",
            "treated",
            "—",
            41.0,
            36.6
          ],
          [
            "State B",
            "donor",
            0.34,
            42.0,
            44.0
          ],
          [
            "State F",
            "donor",
            0.22,
            38.0,
            40.0
          ],
          [
            "State K",
            "donor",
            0.19,
            45.0,
            47.0
          ],
          [
            "State P",
            "donor",
            0.15,
            36.0,
            38.0
          ],
          [
            "State T",
            "donor",
            0.1,
            44.0,
            46.0
          ]
        ]
      },
      "steps": [
        "Compute the synthetic control's 2017 (pre-period) rate as the weighted average of donors: (0.34 x 42.0) + (0.22 x 38.0) + (0.19 x 45.0) + (0.15 x 36.0) + (0.10 x 44.0) = 14.28 + 8.36 + 8.55 + 5.40 + 4.40 = 40.99 per 100,000.",
        "Compare to the policy state's observed 2017 rate of 41.0 per 100,000 — the synthetic control is only 0.01 off, confirming excellent pre-period fit. This close tracking in the years before 2018 is what makes the post-period comparison credible.",
        "Compute the synthetic control's 2019 (post-period) rate: (0.34 x 44.0) + (0.22 x 40.0) + (0.19 x 47.0) + (0.15 x 38.0) + (0.10 x 46.0) = 14.96 + 8.80 + 8.93 + 5.70 + 4.60 = 42.99 per 100,000.",
        "The policy state's observed 2019 rate is 36.59 per 100,000. The gap is 36.59 minus 42.99 = -6.40 per 100,000 — the state initiated 6.40 fewer patients per 100,000 adults than its synthetic stand-in predicted without the policy.",
        "Verify the weights sum to 1: 0.34 + 0.22 + 0.19 + 0.15 + 0.10 = 1.00. All weights are between 0 and 1. The synthetic control is a valid weighted average, not an extrapolation."
      ],
      "result": "Estimated effect: -6.40 initiations per 100,000 adults in 2019, matching the file's reported average effect of -6.4/year over 2018-2022. The negative gap means the policy state's observed rate fell 6.40 below what the donor-weighted synthetic control predicted would have happened without the policy. Placebo tests across the 5 donor states confirmed this gap is larger than any donor's spurious gap, giving a permutation p-value of approximately 1/6 = 0.17 (with 5 donors) — highlighting why a richer donor pool matters for inference."
    },
    "prerequisites": [
      "difference-in-differences-staggered-adoption-rwe",
      "interrupted-time-series-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single treated unit (classic SCM)",
        "description": "One treated aggregate unit; the synthetic control is a convex combination of donor units matching the treated unit's pre-intervention outcome and predictors, with the effect read as the post-period observed-minus-synthetic gap.",
        "edge_cases": [
          "A large pre-period RMSPE means the synthetic control is not a valid counterfactual and the post-period gap is uninterpretable.",
          "If the treated unit's predictors fall outside the donors' convex hull, no convex weighting can match it and the method silently extrapolates."
        ],
        "data_source_notes": "claims: treated/donor units are typically geographies; hold the FFS-observable denominator definition constant across all units and confirm donors had no coincident coverage-mix shocks."
      },
      {
        "name": "Placebo / permutation inference",
        "description": "In-space placebos re-estimate SCM treating each donor as if it were the treated unit; the treated unit's post/pre RMSPE ratio is ranked against the placebo distribution to produce a permutation p-value. In-time placebos use a fake pre-period intervention date that should show no gap.",
        "edge_cases": [
          "Donors with very poor pre-period fit inflate the placebo RMSPE distribution and should be excluded or filtered by a pre-period RMSPE cutoff before ranking.",
          "With a small donor pool the smallest attainable p-value is coarse (e.g., 1/(donors+1)); report the achievable resolution."
        ],
        "data_source_notes": "registry/claims: ensure all placebo (donor) units use identical outcome and denominator definitions so the RMSPE-ratio ranking is comparable across units."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "difference-in-differences-staggered-adoption-rwe",
        "pros_of_this": "Data-drives the donor weights to match the treated unit's pre-period trajectory, so it does not require any single comparator to be parallel and makes the counterfactual transparent and auditable (weights and pre-fit are inspectable).",
        "cons_of_this": "Handles only one or a few treated units, relies on placebo/permutation inference rather than classical standard errors, and is unreliable when the pre-period is short or the donor pool is thin.",
        "when_to_prefer": "One (or few) treated aggregate units with a long pre-period and a rich, clean donor pool; prefer DiD with many treated units, short pre-periods, or when classical inference and covariate adjustment are required."
      },
      {
        "compared_to": "interrupted-time-series-rwe",
        "pros_of_this": "Borrows strength from untreated donor units to build the counterfactual, protecting against a co-timed shock that also hits the treated series (if it hits donors too, the synthetic control absorbs it) rather than extrapolating the treated unit's own pre-trend.",
        "cons_of_this": "Requires a pool of comparable, uncontaminated donor units, which ITS does not; with no good donors SCM cannot be estimated while ITS still can.",
        "when_to_prefer": "When good donor units exist and co-timed confounding shocks are a concern; prefer ITS when no comparable donors are available and the treated unit's pre-trend is stable."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Treated and donor units are usually geographies/plans; aggregate the outcome over a constant FFS-observable denominator and drop Medicare Advantage-only person-time, then confirm no donor experienced a coverage-mix shock during the window. Use stably measured predictors (age/sex mix, baseline comorbidity, baseline outcome levels) across all units.",
      "ehr": "When units are provider systems, harmonize outcome and predictor definitions across all donor systems before fitting and treat a system-specific EHR/coding transition as an idiosyncratic shock that disqualifies a donor.",
      "registry": "Geographic-level registries give clean aggregate outcomes; verify reporting completeness is comparable across donor regions and calendar periods so a reporting-lag difference is not read as a treatment gap.",
      "linked": "Linked claims-registry data at the geographic level supplies harmonized aggregate outcomes and predictors; reconcile differing reporting periods across donors before constructing the pre-period series."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom pysyncon import Dataprep, Synth\nfrom pysyncon.inference import SpacePlacebo\n\n# Long panel: one row per (unit, year). 'state' is the unit; 'init_rate' the outcome.\npanel = pd.read_csv(\"state_initiation_panel.csv\")   # columns: state, year, init_rate, age_mix, comorbidity, spend\n\nprep = Dataprep(\n    foo=panel,\n    predictors=[\"age_mix\", \"comorbidity\", \"spend\"],\n    predictors_op=\"mean\",\n    dependent=\"init_rate\",\n    unit_variable=\"state\",\n    time_variable=\"year\",\n    treatment_identifier=\"PolicyState\",\n    controls_identifier=[s for s in panel[\"state\"].unique() if s != \"PolicyState\"],\n    time_predictors_prior=range(2010, 2018),        # pre-intervention period (fit window)\n    time_optimize_ssr=range(2010, 2018),\n)\nsynth = Synth()\nsynth.fit(dataprep=prep)                            # solve donor weights W (convex, sum to 1)\nprint(\"Donor weights:\", synth.weights().round(2))   # auditable contribution of each donor\n\n# Treated-minus-synthetic gap = estimated effect per post-period year.\ngaps = synth.gaps(time_period=range(2010, 2023))\nprint(gaps.loc[2018:2022])                          # post-intervention effect path\n\n# Placebo / permutation inference: rank the treated unit's post/pre RMSPE ratio vs donors.\nplacebo = SpacePlacebo(dataprep=prep, synth=Synth())\nplacebo.fit()\nprint(\"Permutation p-value:\", placebo.summary())",
        "description": "Synthetic control with placebo (in-space) inference using pysyncon. Input: a long-format panel DataFrame with columns unit,\ntime, the outcome, and predictor columns. Dataprep specifies the treated unit, donor pool, predictors, and pre/post periods;\nSynth fits the donor weights minimizing pre-period discrepancy; the treated-minus-synthetic gap is the effect, and a placebo\ntest ranks the treated unit's post/pre RMSPE ratio against the donors (Abadie et al. 2010).",
        "dependencies": [
          "pandas",
          "numpy",
          "pysyncon"
        ],
        "source_citations": [
          "abadie-2010",
          "abadie-2021"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(Synth)\n\n# `panel` is a long data.frame: state, year, init_rate, age_mix, comorbidity, spend.\ndp <- dataprep(\n  foo = panel,\n  predictors = c(\"age_mix\", \"comorbidity\", \"spend\"),\n  predictors.op = \"mean\",\n  dependent = \"init_rate\",\n  unit.variable = \"state_id\",\n  time.variable = \"year\",\n  treatment.identifier = policy_state_id,\n  controls.identifier  = donor_state_ids,\n  time.predictors.prior = 2010:2017,        # pre-intervention fit window\n  time.optimize.ssr     = 2010:2017,\n  time.plot             = 2010:2022\n)\n\nso <- synth(dp)                              # solve convex donor weights W and predictor weights V\nprint(round(so$solution.w, 2))               # donor weights (auditable counterfactual)\n\n# Pre-period fit (RMSPE) gates interpretation; gaps.plot shows observed - synthetic.\ntab <- synth.tab(synth.res = so, dataprep.res = dp)\nprint(tab$tab.pred)                          # treated vs synthetic predictor balance\ngaps.plot(synth.res = so, dataprep.res = dp, Main = \"Observed minus synthetic (effect path)\")\npath.plot(synth.res = so, dataprep.res = dp, Main = \"Treated vs synthetic control\")",
        "description": "Synthetic control with the canonical Synth package (Abadie, Diamond, Hainmueller). dataprep builds the predictor and outcome\nmatrices, synth solves the donor weights, and gaps.plot/path.plot visualize observed vs synthetic. Placebo inference is run\nby re-estimating SCM with each donor as the treated unit and comparing post/pre RMSPE ratios (Abadie et al. 2010, 2015).",
        "dependencies": [
          "Synth"
        ],
        "source_citations": [
          "abadie-2010",
          "abadie-2015"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* work.pre : pre-period rows; var y_treated and y_donor1..y_donorJ (donor outcome columns). */\n/* Read the pre-period matrices into PROC OPTMODEL and solve the convex weighting problem.    */\nproc optmodel;\n  set PRE;                                   /* index set of pre-intervention periods        */\n  set DONORS;                                /* index set of donor units                     */\n  number ytreat {PRE};                       /* treated unit's pre-period outcomes           */\n  number ydonor {PRE, DONORS};               /* donor pre-period outcomes                    */\n  read data work.pre_long into PRE=[period]  ytreat=y_treated;\n  read data work.donor_long into [period donor] ydonor=y;\n\n  var w {DONORS} >= 0;                        /* non-negative donor weights                   */\n  con simplex: sum {j in DONORS} w[j] = 1;    /* weights sum to one (convex combination)      */\n\n  /* Minimize pre-period squared discrepancy = ||y_treated - donor-weighted|| over PRE. */\n  min prefit = sum {t in PRE} ( ytreat[t] - sum {j in DONORS} w[j]*ydonor[t,j] )^2;\n  solve with nlp;                             /* quadratic program for the synthetic weights  */\n  print w;                                    /* fitted donor weights (the counterfactual)    */\n  create data work.scm_weights from [donor]=DONORS weight=w;\nquit;\n\n/* Apply the weights to ALL periods to build the synthetic series and the gap = observed - synthetic. */\nproc sql;\n  create table work.effect as\n  select a.period, a.y_treated,\n         sum(b.weight * a.y_value) as y_synth,\n         a.y_treated - calculated y_synth as gap     /* post-period gap = estimated effect */\n  from work.panel_long a\n  join work.scm_weights b on a.donor = b.donor\n  group by a.period, a.y_treated;\nquit;\nproc print data=work.effect noobs; run;",
        "description": "Synthetic control in SAS by solving the donor-weight quadratic program in PROC OPTMODEL: choose non-negative weights summing\nto one that minimize the pre-period squared discrepancy between the treated unit's outcome series and the donor-weighted\nseries. Input is a wide pre-period matrix (one column per donor, rows = pre-periods) plus the treated unit's pre-period\nvector; the fitted weights are then applied to post-periods to form the synthetic control and the gap (Abadie et al. 2010).",
        "dependencies": [],
        "source_citations": [
          "abadie-2010",
          "abadie-2021"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  Treated[Treated unit<br/>pre-period outcome + predictors] --> Opt[Solve donor weights W<br/>min pre-period discrepancy]\n  Donors[Donor pool<br/>untreated comparable units] --> Opt\n  Opt --> Synth[Synthetic control<br/>convex weighted donors]\n  Synth --> Fit{Good pre-period fit?<br/>small RMSPE}\n  Fit -->|No| Bad[Not a valid counterfactual<br/>do not interpret gap]\n  Fit -->|Yes| Gap[Post-period gap<br/>observed - synthetic = effect]\n  Gap --> Inf[Placebo / permutation inference<br/>rank post/pre RMSPE ratio]",
        "caption": "Structure of the synthetic control method. Convex donor weights are chosen to reproduce the treated unit's pre-intervention path; a small pre-period RMSPE is the prerequisite for interpreting the post-period observed-minus-synthetic gap as the effect, with significance assessed by placebo/permutation tests.",
        "alt_text": "Flow diagram from a treated unit and a donor pool through a weight-optimization step to a synthetic control, a pre-period fit gate, the post-period gap as the effect, and placebo inference.",
        "source_type": "illustrative",
        "source_citations": [
          "abadie-2010"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Single treated aggregate unit] --> Pre{Long, stable<br/>pre-period series?}\n  Pre -->|No| Stop1[Too few/noisy pre-periods - SCM unreliable]\n  Pre -->|Yes| Pool{Clean, comparable<br/>uncontaminated donor pool?}\n  Pool -->|No| Stop2[No valid donors - cannot build counterfactual]\n  Pool -->|Yes| Hull{Treated inside<br/>donors convex hull?}\n  Hull -->|No| Stop3[Extrapolation - convex weights cannot match]\n  Hull -->|Yes| Fit2{Pre-period RMSPE small?}\n  Fit2 -->|No| Stop4[Poor fit - gap uninterpretable]\n  Fit2 -->|Yes| Effect[Gap = effect; run in-space + in-time placebos]\n  Effect --> Report[Report weights, pre-fit, placebo p-value]",
        "caption": "Feasibility-and-validity workflow for SCM. The method requires a long pre-period, a clean donor pool, the treated unit inside the donors' convex hull, and a small pre-period RMSPE; only then is the post-period gap a credible effect, reported with placebo-based inference.",
        "alt_text": "Decision tree from a single treated unit through checks on pre-period length, donor-pool validity, convex-hull membership, and pre-period fit to estimating the effect gap and reporting placebo-based inference.",
        "source_type": "illustrative",
        "source_citations": [
          "abadie-2021"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "difference-in-differences-staggered-adoption-rwe",
        "notes": "SCM is the single-treated-unit generalization of difference-in-differences; it data-drives donor weights to match the pre-period instead of assuming parallel trends and nests DiD when the optimal weights are uniform."
      },
      {
        "relation_type": "see_also",
        "target_slug": "interrupted-time-series-rwe",
        "notes": "Both evaluate an abrupt intervention on an aggregate series; ITS extrapolates the treated unit's own pre-trend as the counterfactual, while SCM borrows strength from untreated donors, protecting against co-timed shocks shared with donors."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "instrumental-variables-pharmacoepi-rwe",
        "notes": "IV is an individual-level design identifying effects via an instrument; SCM is an aggregate small-N design identifying effects via pre-period fit and donor-pool validity, used when the unit of intervention is itself a geography or system."
      },
      {
        "relation_type": "see_also",
        "target_slug": "target-trial-emulation",
        "notes": "Target-trial emulation estimates a population-average effect from many individuals with measured confounders; SCM estimates an effect for one aggregate treated unit where no individual-level comparison or measured confounders identify it."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "The SCM estimate is specific to the single treated unit under the studied intervention; transporting it to other units raises the external-validity/generalizability questions handled by transportability methods."
      },
      {
        "relation_type": "complements",
        "target_slug": "rare-disease-external-controls-rwe",
        "notes": "SCM's logic of building a weighted counterfactual from comparable untreated units parallels external-control construction; both formalize a principled comparator when a randomized or single natural control is unavailable."
      }
    ],
    "aliases": [
      "SCM",
      "synthetic control",
      "synthetic control estimator",
      "synthetic comparison group",
      "comparative case study with synthetic controls"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "systematic-review",
    "name": "Systematic Review",
    "short_definition": "A study design that answers a pre-specified question by systematically searching, screening, critically appraising, and synthesizing all eligible primary studies under a registered protocol, with explicit reproducible methods to minimize selection and reporting bias.",
    "long_description": "A **systematic review (SR)** is a secondary research design that treats the body of primary evidence on a focused\nquestion as its unit of analysis. Unlike a narrative review, an SR is governed by a pre-registered protocol (PROSPERO,\nor a fixed protocol for a regulatory PASS) that fixes the **PICOTS** eligibility frame, the search strategy, the\nscreening and data-extraction rules, the risk-of-bias instrument, and the synthesis plan *before* results are seen.\nQuantitative pooling (meta-analysis) is an optional downstream step, not the SR itself: a methodologically complete SR\nmay legitimately end in a structured narrative or tabular synthesis when pooling is inappropriate.\n\n**Core conceptual distinction.** The SR is defined by *process discipline*, not by whether numbers are combined. Three\nseparable design choices do the work. (1) *Systematic vs narrative*: an exhaustive, documented, reproducible search and\ndual independent screening replace the convenience sampling of literature that makes narrative reviews\nselection-biased. (2) *Protocol-first vs data-driven*: locking eligibility, outcomes, and analysis in a registered\nprotocol prevents the outcome-switching and post-hoc subgroup mining that inflate false positives. (3) *Synthesis vs\npooling*: synthesis is the integration step; meta-analysis is one synthesis tool, valid only when studies are clinically\nand methodologically similar enough that a common estimand exists. The SR's estimand is therefore a property of the\n*evidence base* (e.g., \"the comparative effect of A vs B across all eligible RWE and trial studies\"), and its central\nthreat is not confounding within a study but **selection and reporting bias across studies** — missing trials, missing\noutcomes, and language/publication filtering.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs a narrative/expert review:** the SR's reproducible search and dual screening remove cherry-picking and let\n  another team regenerate the included set. Cost: months of effort, and the rigor is wasted if the underlying studies\n  are themselves biased — an SR cannot manufacture evidence quality it does not have (\"garbage in, garbage out\").\n  **Prefer the SR** for any decision-grade or guideline question; reserve narrative reviews for scoping or framing.\n- **vs a scoping review:** an SR answers a *narrow, pre-specified effect/association* question and appraises risk of\n  bias; a scoping review *maps* a literature's breadth and gaps without effect estimation or formal appraisal.\n  **Prefer the SR** when the question is a focused PICOTS contrast; **prefer scoping** when the question is \"what\n  evidence exists?\".\n- **vs a single large pragmatic trial or a single well-powered RWE study:** the SR aggregates power and improves\n  external validity by spanning settings, but inherits every primary study's design flaws and adds across-study\n  heterogeneity and ecological/aggregation bias. **Prefer a single high-quality study** when one exists that directly\n  answers the question in the target population; **prefer the SR** when no single study is decisive or when consistency\n  across settings is itself the question.\n- **vs jumping straight to meta-analysis:** pooling without the SR scaffolding (protocol, exhaustive search, RoB) is\n  just a weighted average of a biased sample of studies. The SR is the prerequisite that makes a pooled number\n  interpretable.\n\n**When to use.** A focused, answerable PICOTS question; a decision that requires the *totality* of evidence (HTA\nsubmissions, clinical guidelines, payer dossiers, regulatory benefit-risk); a literature large or contested enough that\na defensible, reproducible selection process matters; or as the evidence-generation front end before an indirect\ntreatment comparison or network meta-analysis. Always register the protocol (PROSPERO) and follow PRISMA 2020 reporting.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **The question is exploratory or the field is immature.** With three heterogeneous studies, a \"systematic review with\n  meta-analysis\" lends spurious authority to a pooled estimate dominated by one study; a scoping review or a single\n  well-conducted study is more honest.\n- **The included evidence is uniformly high risk of bias.** Synthesizing biased studies produces a *precise* but\n  *wrong* answer — narrow confidence intervals around a biased pooled effect are more dangerous than no estimate,\n  because precision is read as certainty. GRADE-rate down for risk of bias and say so plainly.\n- **Clinical/methodological diversity is too large for a common estimand.** Pooling across incomparable populations,\n  comparators, or outcome definitions (the classic \"mixing apples and oranges\") yields an average that describes no\n  real patient. Default to structured narrative synthesis; do not force a forest plot.\n- **The decision needs patient-level effect modification.** Aggregate-data SRs cannot recover within-study\n  interactions; an individual-patient-data meta-analysis or a single richly covariate'd RWE study is required.\n- **Time-sensitive signal detection.** A 12-month SR is the wrong instrument for an emerging safety signal that needs a\n  rapid analysis of a live data network.\n\n**Data-source operational depth.** The SR's \"data\" are the *included primary studies*, but in RWE/HEOR those studies\nare themselves built on claims, EHR, registry, or linked data, and an SR is only as trustworthy as its handling of how\nthose substrates bias the primary estimates. Treat data-source characteristics as extraction fields and appraisal\ncriteria, not afterthoughts.\n- **Claims-based primary studies:** extract and appraise the washout/lookback length, the continuous-enrollment\n  requirement, exposure operationalization (NDC + `fill_date` + `days_supply`), and outcome algorithm (e.g.,\n  1-inpatient-or-2-outpatient). Failure mode: pooling studies that silently mixed Medicare Advantage and\n  fee-for-service person-time — MA-only enrollees lack FFS claims, so their \"no prior fill\" washout is missingness, and\n  studies that did not exclude MA-only time carry differential exposure/outcome ascertainment that surfaces as\n  unexplained heterogeneity. Workaround: code an \"FFS-only vs mixed\" extraction flag and pre-specify it as a subgroup or\n  meta-regression covariate.\n- **EHR-based primary studies:** appraise whether exposure is the order vs the dispensing, and whether loss to\n  follow-up (patients leaving the system) was treated as informative. Failure mode: out-of-network care makes outcomes\n  differentially under-captured; pooling EHR and claims studies without flagging capture completeness conflates true\n  effect differences with ascertainment differences.\n- **Registry-based primary studies:** strong for adjudicated outcomes and severity but weak for complete exposure;\n  appraise linkage to fills and to a death index. Failure mode: registries with voluntary enrollment carry selection\n  that no SR-level adjustment can remove.\n- **Linked claims–EHR–vital-records studies:** the strongest substrate but the smallest, linkable subset; appraise\n  linkage selection. Differential **competing risks by exposure in elderly claims** populations (e.g., one drug skewed\n  to frailer patients who die before the non-fatal outcome) and **immortal time in procedure studies** (follow-up\n  started at diagnosis rather than at the procedure) are the two within-study biases an RWE SR most often must catch in\n  appraisal and, where present, exclude or down-weight rather than blindly pool.\n\n**Worked example (claims-style logic in the included studies).** Question (PICOTS): in adults with type 2 diabetes\n(P), does a second-generation sulfonylurea (I) vs a DPP-4 inhibitor (C) increase incident heart failure (O) over\n≥6 months (T) in routine-care administrative data (S)? Protocol registered on PROSPERO; PRISMA 2020 followed.\n(1) Search MEDLINE/Embase/Web of Science plus the conference and regulatory grey literature; dedupe; dual independent\ntitle/abstract and full-text screening with a third-reviewer tiebreak. (2) Eligibility encodes the RWE design quality\ndirectly: include only active-comparator new-user studies that required ≥365-day continuous enrollment, a drug-free\nwashout (no sulfonylurea or DPP-4 fill in the lookback), time-zero at the first qualifying fill, and an HF outcome\nalgorithm with reported PPV. (3) Extraction (dual) captures: adjusted hazard ratio and 95% CI, the log-HR and its SE\n(SE = (ln(upper) − ln(lower)) / (2 × 1.96)), the data source, an `ffs_only` flag (1 if MA-only person-time was\nexcluded, else 0), washout days, and `days_supply`-based grace period for the as-treated window. (4) Appraise each\nstudy with ROBIS/AMSTAR-2-aligned domains plus RWE-specific items (immortal time, competing risks, time-zero\nalignment). (5) Synthesis: if clinical/methodological diversity is acceptable, pool with a DerSimonian–Laird (or REML)\nrandom-effects inverse-variance model, report I² and the prediction interval, and run the pre-specified `ffs_only`\nsubgroup; if diversity is too large, present a structured narrative with a harvest plot and do not pool. (6) GRADE the\ncertainty of the body of evidence (start \"low\" for observational, move up/down for magnitude, dose-response,\nconsistency, and residual confounding) and report a transparent PRISMA flow accounting for every excluded record.",
    "primary_category": "Study_Design",
    "tags": [
      "evidence-synthesis",
      "systematic-review",
      "prisma",
      "risk-of-bias",
      "meta-analysis",
      "grade",
      "prospero",
      "secondary-research"
    ],
    "applies_to_study_types": [
      "systematic_review"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1136/bmj.n71",
        "url": "https://doi.org/10.1136/bmj.n71",
        "citation_text": "Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.",
        "year": 2021,
        "authors_short": "Page et al.",
        "notes": "The current reporting standard that operationally defines what a complete, reproducible systematic review must document, from protocol and search to flow diagram and certainty rating."
      },
      {
        "role": "introduce",
        "doi": "10.1136/bmj.b2535",
        "url": "https://doi.org/10.1136/bmj.b2535",
        "citation_text": "Moher D, Liberati A, Tetzlaff J, Altman DG; PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ. 2009;339:b2535.",
        "year": 2009,
        "authors_short": "Moher et al.",
        "notes": "The original PRISMA statement that established systematic-review reporting as a discipline; retained for historical grounding and citation in older protocols."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.n160",
        "url": "https://doi.org/10.1136/bmj.n160",
        "citation_text": "Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372:n160.",
        "year": 2021,
        "authors_short": "Page et al.",
        "notes": "Item-by-item rationale and worked exemplars for each PRISMA 2020 element; the practical companion for designing and writing a review."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2015.06.005",
        "url": "https://doi.org/10.1016/j.jclinepi.2015.06.005",
        "citation_text": "Whiting P, Savović J, Higgins JPT, et al. ROBIS: a new tool to assess risk of bias in systematic reviews was developed. Journal of Clinical Epidemiology. 2016;69:225-234.",
        "year": 2016,
        "authors_short": "Whiting et al.",
        "notes": "Domain-based instrument for appraising the risk of bias of the review process itself (eligibility, identification/selection, data collection/appraisal, synthesis), distinct from appraising the included studies."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/bmj.j4008",
        "url": "https://doi.org/10.1136/bmj.j4008",
        "citation_text": "Shea BJ, Reeves BC, Wells G, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358:j4008.",
        "year": 2017,
        "authors_short": "Shea et al.",
        "notes": "Critical-appraisal checklist widely used by HTA bodies and guideline panels to judge whether an existing systematic review is trustworthy enough to use, including reviews of non-randomized (RWE) studies."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.39489.470347.AD",
        "url": "https://doi.org/10.1136/bmj.39489.470347.AD",
        "citation_text": "Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336:924-926.",
        "year": 2008,
        "authors_short": "Guyatt et al.",
        "notes": "The certainty-of-evidence framework applied to a systematic review's synthesized body of evidence; the standard way HTA and guideline bodies translate a review into a graded recommendation."
      }
    ],
    "plain_language_summary": "A systematic review answers one focused question by finding every relevant study, not just the convenient few, and combining what they found. You write down the rules in advance (which studies count, how you will search, how you will judge their quality), register that plan publicly, then follow it so another team could repeat your work and get the same set of studies. Some reviews end by statistically pooling the numbers into one combined estimate, but pooling is optional; the review itself is the disciplined search-and-appraisal process. Its main weakness is honest: if the underlying studies are flawed, a tidy review cannot turn them into good evidence.",
    "key_terms": [
      {
        "term": "pre-registered protocol",
        "definition": "A written plan, posted in a public registry like PROSPERO before you look at any results, that fixes your question and methods so you cannot quietly change them once you see how the numbers come out."
      },
      {
        "term": "risk-of-bias appraisal",
        "definition": "A structured check of each included study asking whether its design or conduct could have pushed its result in a particular direction, so weak studies are flagged rather than trusted equally."
      },
      {
        "term": "meta-analysis",
        "definition": "An optional final step that statistically combines the results of similar studies into a single pooled estimate; only valid when the studies are alike enough to be measuring the same thing."
      },
      {
        "term": "PRISMA flow",
        "definition": "A standard accounting diagram that tracks every record from the initial search down to the final included studies, showing how many were removed at each stage and why."
      },
      {
        "term": "scoping review",
        "definition": "A related but different review that maps what evidence exists on a broad topic without judging study quality or combining results; it answers 'what is out there?' rather than a single focused question."
      }
    ],
    "worked_example": {
      "scenario": "You want to answer one focused question: in adults with type 2 diabetes, does a particular older diabetes drug raise the risk of heart failure compared with a newer alternative? Before searching, you register a protocol that states the question, which studies qualify, and how you will appraise them. You then run the search and walk every record down the funnel, throwing studies out only for reasons you wrote down in advance. This table is that funnel: each row is a stage and the count of records still in play, and the drops between stages must add up exactly.",
      "dataset": {
        "caption": "The PRISMA-style record count an analyst tracks from the first database search to the studies actually synthesized. Each stage subtracts a documented number of records.",
        "columns": [
          "stage",
          "n"
        ],
        "rows": [
          [
            "records_identified",
            1240
          ],
          [
            "duplicates_removed",
            240
          ],
          [
            "records_screened",
            1000
          ],
          [
            "excluded_title_abstract",
            870
          ],
          [
            "full_text_assessed",
            130
          ],
          [
            "excluded_full_text",
            121
          ],
          [
            "included_in_synthesis",
            9
          ]
        ]
      },
      "steps": [
        "Start with everything the searches returned: 1,240 records pulled from multiple databases plus conference and regulatory grey literature.",
        "The same study often appears in more than one database, so remove the 240 duplicate records first: 1,240 minus 240 leaves 1,000 distinct records to screen.",
        "Two reviewers independently read each title and abstract against the pre-registered eligibility rules; clearly irrelevant ones are dropped. Here 870 are excluded, leaving 130 worth reading in full.",
        "Both reviewers then read the full text of those 130 and exclude 121 more, each for a documented reason (wrong population, wrong comparator, no usable effect estimate), which leaves 9 studies.",
        "Each of the 9 surviving studies gets a risk-of-bias appraisal, so a study that, for example, started follow-up at the wrong moment is flagged rather than trusted blindly.",
        "Because the 9 studies ask the same focused question in similar ways, the review can pool them in a meta-analysis; a scoping review would instead stop here, having mapped what exists without judging quality or combining results."
      ],
      "result": "1,240 identified minus 240 duplicates = 1,000 screened; 1,000 minus the 870 + 121 = 991 excluded with documented reasons = 9 studies included in the synthesis. Every record is accounted for, so the funnel reconciles."
    },
    "prerequisites": [
      "picots-framework-rwe",
      "scoping-review",
      "meta-analysis-obs"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Systematic review with quantitative meta-analysis",
        "description": "The full process culminating in a pooled effect estimate (fixed-effect or random-effects inverse-variance) when included studies are clinically and methodologically similar enough to share a common estimand.",
        "edge_cases": [
          "Few studies (<5) make the between-study variance estimate unstable; fixed-effect and random-effects estimates can diverge sharply and the random-effects CI is anticonservative — report a prediction interval.",
          "One large study dominating the inverse-variance weights makes the \"pooled\" estimate effectively a single-study result dressed as synthesis."
        ],
        "data_source_notes": "For RWE inputs, extract the adjusted (not crude) effect and reconstruct the log-effect SE from the reported CI; flag claims studies that did not exclude Medicare Advantage-only person-time as a heterogeneity source."
      },
      {
        "name": "Systematic review with structured narrative / synthesis-without-meta-analysis (SWiM)",
        "description": "A rigorous review whose synthesis is tabular or narrative (vote-counting by direction, harvest plots, effect-direction plots) because pooling is inappropriate; follows the SWiM reporting guideline.",
        "edge_cases": [
          "Naive vote-counting (counting \"significant\" studies) ignores effect size and power and is itself biased; summarize by effect direction and magnitude, not by p-value tallies."
        ],
        "data_source_notes": "Preferred when RWE studies differ in design (new-user vs prevalent-user), data source, or outcome algorithm such that a single estimand is not credible."
      },
      {
        "name": "Living systematic review",
        "description": "A review kept continually up to date with a defined search frequency and explicit signal-to-update rules, used for fast-moving evidence (e.g., a new drug class) or maintained HTA guidance.",
        "edge_cases": [
          "Repeated updating inflates multiplicity; pre-specify statistical stopping/updating rules (e.g., trial sequential analysis style monitoring) to avoid \"peeking\" bias."
        ],
        "data_source_notes": "Requires automation-friendly search and screening pipelines; valuable when an RWE data network keeps producing new analyses."
      },
      {
        "name": "Rapid review",
        "description": "A streamlined SR that deliberately narrows scope or limits steps (single-reviewer screening, restricted databases) under time pressure, with the shortcuts explicitly reported as limitations.",
        "edge_cases": [
          "Single-reviewer screening and a truncated grey-literature search trade speed for higher risk of missed studies and selection bias; the magnitude of that trade-off must be stated, not hidden."
        ],
        "data_source_notes": "Acceptable for urgent payer/HTA timelines if the deviations from a full SR are transparent."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Narrative / expert review",
        "pros_of_this": "Reproducible exhaustive search and dual independent screening remove cherry-picking; another team can regenerate the included set and verify the conclusions.",
        "cons_of_this": "Months of effort; cannot create evidence quality the primary studies lack (garbage in, garbage out).",
        "when_to_prefer": "Any decision-grade question (guideline, HTA, payer dossier, regulatory benefit-risk) where the totality of evidence must be defensible."
      },
      {
        "compared_to": "Scoping review",
        "pros_of_this": "Answers a focused PICOTS effect/association question and formally appraises risk of bias, yielding a decision-ready synthesis.",
        "cons_of_this": "Narrow by design; will not map the breadth of a field or surface unexpected evidence gaps.",
        "when_to_prefer": "The question is a specific contrast with a measurable outcome rather than \"what evidence exists?\"."
      },
      {
        "compared_to": "A single large high-quality study (pragmatic trial or well-powered RWE study)",
        "pros_of_this": "Aggregates power and tests consistency of an effect across settings and data sources, improving external validity.",
        "cons_of_this": "Inherits every included study's design flaws and adds across-study heterogeneity and aggregation bias; cannot recover within-study effect modification from aggregate data.",
        "when_to_prefer": "No single study is decisive, the literature is contested, or consistency across settings is itself the question."
      },
      {
        "compared_to": "Meta-analysis performed without the systematic-review scaffolding",
        "pros_of_this": "The protocol, exhaustive search, and risk-of-bias appraisal make any pooled number interpretable rather than a weighted average of a biased convenience sample.",
        "cons_of_this": "The scaffolding is laborious and front-loaded before any estimate is produced.",
        "when_to_prefer": "Always, when a pooled estimate will inform a decision — pooling without the SR is uninterpretable."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Extract per included study: washout/lookback length, continuous-enrollment requirement, exposure definition (NDC + fill_date + days_supply), outcome algorithm with PPV, and an ffs_only flag (1 if Medicare Advantage-only person-time was excluded). Pre-specify ffs_only and washout length as subgroup/meta-regression covariates because they drive heterogeneity.",
      "ehr": "Appraise exposure ascertainment (order vs dispensing), out-of-network outcome capture, and whether loss to follow-up was treated as informative. Flag capture completeness so that ascertainment differences are not mistaken for true effect heterogeneity when pooled with claims studies.",
      "registry": "Strong for adjudicated outcomes and severity, weak for complete exposure; appraise voluntary-enrollment selection and linkage to fills and a death index. Selection here is not removable by SR-level methods.",
      "linked": "Strongest substrate but smallest, linkable subset; appraise linkage selection plus the two within-study biases RWE reviews most often miss — differential competing risks by exposure in elderly populations and immortal time in procedure studies — and exclude or down-weight rather than blindly pool affected estimates."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef prisma_balance(flow: dict) -> dict:\n    # Reconciles the PRISMA 2020 flow: every identified record is screened, excluded, or included.\n    screened = flow[\"identified\"] - flow[\"duplicates\"]\n    excluded = flow[\"excluded_screening\"] + flow[\"excluded_fulltext\"]\n    derived_included = screened - excluded\n    assert derived_included == flow[\"included\"], (\n        f\"PRISMA flow does not reconcile: {derived_included} vs reported {flow['included']}\"\n    )\n    return {\"records_screened\": screened, \"records_excluded\": excluded, \"studies_included\": flow[\"included\"]}\n\ndef rob_summary(studies: pd.DataFrame) -> pd.DataFrame:\n    # Cross-tab of risk-of-bias by data source -- the table a GRADE panel needs to rate certainty.\n    return (studies.pivot_table(index=\"data_source\", columns=\"rob_overall\",\n                                values=\"study_id\", aggfunc=\"count\", fill_value=0))\n\ndef log_effect_se(ci_low: float, ci_high: float) -> float:\n    # SE of the log-effect reconstructed from a reported 95% CI (HR/OR/RR are log-normal).\n    return (np.log(ci_high) - np.log(ci_low)) / (2 * 1.959964)\n\ndef random_effects_pool(studies: pd.DataFrame) -> dict:\n    yi = np.log(studies[\"effect\"].to_numpy())                       # log effect per study\n    sei = np.array([log_effect_se(lo, hi)\n                    for lo, hi in zip(studies[\"ci_low\"], studies[\"ci_high\"])])\n    wi = 1.0 / sei**2                                               # fixed-effect (inverse-variance) weights\n    ybar_fe = np.sum(wi * yi) / np.sum(wi)\n    Q = np.sum(wi * (yi - ybar_fe) ** 2)                           # Cochran's Q\n    k = len(yi)\n    C = np.sum(wi) - np.sum(wi**2) / np.sum(wi)\n    tau2 = max(0.0, (Q - (k - 1)) / C)                            # DerSimonian-Laird between-study variance\n    wi_re = 1.0 / (sei**2 + tau2)\n    ybar = np.sum(wi_re * yi) / np.sum(wi_re)\n    se = np.sqrt(1.0 / np.sum(wi_re))\n    i2 = max(0.0, (Q - (k - 1)) / Q) * 100 if Q > 0 else 0.0      # I-squared (% variation from heterogeneity)\n    # 95% prediction interval (Higgins-Thompson): where a NEW study's true effect is expected to fall.\n    from scipy import stats  # optional; comment out and use 1.96 if scipy is unavailable\n    t = stats.t.ppf(0.975, df=k - 2) if k > 2 else 1.959964\n    pi = (ybar - t * np.sqrt(tau2 + se**2), ybar + t * np.sqrt(tau2 + se**2))\n    return {\n        \"pooled_effect\": float(np.exp(ybar)),\n        \"ci_95\": (float(np.exp(ybar - 1.959964 * se)), float(np.exp(ybar + 1.959964 * se))),\n        \"tau2\": float(tau2), \"I2_pct\": float(i2), \"Q\": float(Q), \"k\": int(k),\n        \"prediction_interval\": (float(np.exp(pi[0])), float(np.exp(pi[1]))),\n    }\n\ndef ffs_subgroup(studies: pd.DataFrame) -> pd.DataFrame:\n    # Pre-specified subgroup: claims studies that excluded MA-only person-time vs those that did not.\n    return (studies.assign(grp=np.where(studies[\"ffs_only\"] == 1, \"FFS-only\", \"mixed/other\"))\n                   .groupby(\"grp\", group_keys=False)\n                   .apply(lambda g: pd.Series(random_effects_pool(g))))",
        "description": "PRISMA flow accounting, risk-of-bias tabulation, and DerSimonian-Laird random-effects pooling for a systematic\nreview of claims/EHR-based RWE studies. Required inputs (one row per included study, already extracted and\ndual-checked):\n  studies : study_id, effect (adjusted HR/OR/RR on the ORIGINAL scale), ci_low, ci_high,\n            data_source ('claims'/'ehr'/'registry'/'linked'), ffs_only (0/1),\n            rob_overall ('low'/'moderate'/'high'), washout_days, days_supply_grace\nflow    : a dict of PRISMA counts (identified, duplicates, screened, fulltext, excluded_*, included).\nAll effects are log-transformed for pooling; report results back-transformed. Pool ONLY when clinical/methodological\ndiversity is acceptable -- otherwise present the structured narrative instead.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "page-2021"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(metafor)\nlibrary(dplyr)\n\n# Reconstruct the log effect (yi) and its sampling variance (vi) from each study's reported 95% CI.\nprep_effects <- function(studies) {\n  studies %>%\n    mutate(yi = log(effect),\n           sei = (log(ci_high) - log(ci_low)) / (2 * qnorm(0.975)),\n           vi = sei^2)\n}\n\nrob_summary <- function(studies) {\n  # Risk-of-bias by data source -- the evidence-profile input for GRADE.\n  with(studies, table(data_source, rob_overall))\n}\n\npool_random_effects <- function(studies) {\n  d <- prep_effects(studies)\n  # DerSimonian-Laird (method = \"DL\"); switch to \"REML\" for the modern default.\n  fit <- rma(yi = yi, vi = vi, data = d, method = \"DL\")\n  pi <- predict(fit, transf = exp)          # 95% prediction interval, back-transformed\n  list(\n    pooled_effect = as.numeric(exp(coef(fit))),\n    ci_95 = exp(c(fit$ci.lb, fit$ci.ub)),\n    tau2 = fit$tau2, I2_pct = fit$I2, Q = fit$QE, k = fit$k,\n    prediction_interval = c(pi$pi.lb, pi$pi.ub)\n  )\n}\n\n# Pre-specified subgroup / meta-regression: did the claims study exclude MA-only person-time?\nffs_metareg <- function(studies) {\n  d <- prep_effects(studies)\n  rma(yi = yi, vi = vi, mods = ~ factor(ffs_only), data = d, method = \"DL\")\n}",
        "description": "Same systematic-review synthesis in R. With <metafor> the pooling is one call; the inverse-variance internals are\nshown for transparency. Inputs mirror the Python version (one row per included study):\n  studies : study_id, effect, ci_low, ci_high, data_source, ffs_only (0/1), rob_overall, washout_days",
        "dependencies": [
          "metafor",
          "dplyr"
        ],
        "source_citations": [
          "page-2021"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* 1. PRISMA flow reconciliation: screened - excluded must equal included. */\nproc sql;\n  select (&n_identified - &n_duplicates)                         as records_screened,\n         (&n_excl_screen + &n_excl_fulltext)                     as records_excluded,\n         (&n_identified - &n_duplicates - &n_excl_screen - &n_excl_fulltext)\n                                                                 as studies_included_derived,\n         &n_included                                             as studies_included_reported\n  from work.studies(obs=1);\nquit;\n\n/* 2. Risk of bias by data source -- the GRADE evidence-profile input. */\nproc freq data=work.studies;\n  tables data_source*rob_overall / norow nocol nopercent;\nrun;\n\n/* 3. Per-study log effect (yi) and its variance (vi) reconstructed from the reported 95% CI. */\ndata eff;\n  set work.studies;\n  yi  = log(effect);\n  sei = (log(ci_high) - log(ci_low)) / (2*probit(0.975));\n  vi  = sei**2;\n  wi  = 1/vi;                                    /* fixed-effect inverse-variance weight */\nrun;\n\n/* 4. DerSimonian-Laird between-study variance (tau2) and random-effects pool. */\nproc sql noprint;\n  select sum(wi), sum(wi*yi), sum(wi*wi), count(*) into :sw, :swy, :sww, :k from eff;\nquit;\n\ndata _null_;\n  ybar_fe = &swy/&sw;                            /* fixed-effect mean (log scale) */\n  call symputx('ybarfe', ybar_fe);\nrun;\n\ndata q; set eff; q_i = wi*(yi-&ybarfe)**2; run;\nproc sql noprint; select sum(q_i) into :Q from q; quit;\n\ndata _null_;\n  C    = &sw - (&sww/&sw);\n  tau2 = max(0, (&Q - (&k-1))/C);\n  I2   = ifn(&Q>0, max(0,(&Q-(&k-1))/&Q)*100, 0);  /* I-squared, % */\n  call symputx('tau2', tau2);\n  call symputx('I2', I2);\nrun;\n\ndata re; set eff; wi_re = 1/(vi + &tau2); wy = wi_re*yi; run;  /* random-effects weights */\nproc sql noprint; select sum(wi_re), sum(wy) into :swre, :swyre from re; quit;\n\ndata results;\n  ybar = &swyre/&swre;\n  se   = sqrt(1/&swre);\n  pooled_effect = exp(ybar);                                   /* back-transform */\n  ci_low  = exp(ybar - probit(0.975)*se);\n  ci_high = exp(ybar + probit(0.975)*se);\n  tau2 = &tau2; I2_pct = &I2; Q = &Q; k = &k;\nrun;\n\nproc print data=results noobs; run;",
        "description": "Systematic-review synthesis in Base/STAT SAS without specialized macros: PRISMA flow reconciliation, a\nrisk-of-bias-by-source cross-tab, and a hand-rolled DerSimonian-Laird random-effects pool via DATA step. Required\ninput dataset (one row per included study, post extraction):\n  work.studies : study_id, effect, ci_low, ci_high, data_source $, ffs_only (0/1), rob_overall $, washout_days\nA reported PRISMA flow is checked with PROC SQL counts; the pooled estimate is back-transformed from the log scale.",
        "dependencies": [],
        "source_citations": [
          "page-2021"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Focused PICOTS question<br/>protocol registered on PROSPERO] --> S[Exhaustive search<br/>multiple databases + grey literature]\n  S --> D[De-duplicate records]\n  D --> TA[Dual independent title/abstract screening]\n  TA -->|excluded with reasons| X1[Records excluded]\n  TA --> FT[Dual independent full-text screening]\n  FT -->|excluded with reasons| X2[Reports excluded]\n  FT --> EX[Dual data extraction<br/>effect + CI, data source, design-quality fields]\n  EX --> RoB[Risk-of-bias appraisal<br/>included studies + the review process]\n  RoB --> DIV{Clinical / methodological<br/>diversity acceptable?}\n  DIV -->|yes| MA[Random-effects meta-analysis<br/>I-squared + prediction interval]\n  DIV -->|no| NS[Structured narrative / SWiM<br/>no forced pooling]\n  MA --> G[GRADE certainty of evidence]\n  NS --> G\n  G --> R[PRISMA 2020 report + flow diagram]",
        "caption": "Systematic-review workflow from a registered PICOTS protocol through reproducible search, dual screening, appraisal, and a synthesis-method decision gate (pool vs structured narrative) to a GRADE-rated, PRISMA-reported conclusion.",
        "alt_text": "Flowchart of the systematic review process beginning at a registered focused question, through exhaustive search, de-duplication, dual title/abstract and full-text screening with exclusions, dual extraction, risk-of-bias appraisal, a decision gate on whether diversity permits meta-analysis, then either random-effects pooling or structured narrative synthesis, GRADE rating, and PRISMA reporting.",
        "source_type": "illustrative",
        "source_citations": [
          "page-2021"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Pooling appropriate\n    A1[Few, similar studies] --> A2[Common estimand exists]\n    A2 --> A3[Random-effects meta-analysis<br/>report I-squared + prediction interval]\n  end\n  subgraph Pooling NOT appropriate\n    B1[High clinical/methodological diversity] --> B2[No common estimand]\n    B2 --> B3[Structured narrative / SWiM<br/>effect-direction or harvest plot]\n    C1[All included studies high RoB] --> C2[Precise but biased pooled effect = dangerous]\n    C2 --> B3\n  end",
        "caption": "Decision logic for the synthesis step. A pooled estimate is interpretable only when a common estimand exists and the body of evidence is not uniformly high risk of bias; otherwise a forest plot manufactures false precision and a structured narrative is the honest output.",
        "alt_text": "Decision diagram contrasting when pooling is appropriate (few similar studies sharing a common estimand, leading to random-effects meta-analysis) versus when it is not (high diversity or uniformly high risk of bias, leading to structured narrative synthesis).",
        "source_type": "illustrative",
        "source_citations": [
          "page-2021"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "scoping-review",
        "notes": "A scoping review maps the breadth and gaps of a literature without effect estimation or formal appraisal; choose it when the question is \"what evidence exists?\" rather than a focused PICOTS contrast."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "qualitative-synthesis",
        "notes": "When included evidence is qualitative, the synthesis step uses thematic/meta-ethnographic methods rather than quantitative pooling, but the systematic search-and-appraisal scaffolding is shared."
      },
      {
        "relation_type": "produces",
        "target_slug": "meta-analysis-rct",
        "notes": "Quantitative pooling of randomized trials is the meta-analytic synthesis step that an SR of trials may culminate in when a common estimand exists."
      },
      {
        "relation_type": "produces",
        "target_slug": "meta-analysis-obs",
        "notes": "Pooling of observational/RWE studies is the synthesis step for an SR of non-randomized evidence; requires extra appraisal of confounding, immortal time, and competing risks before pooling."
      },
      {
        "relation_type": "used_with",
        "target_slug": "network-meta-analysis",
        "notes": "An SR is the evidence-generation front end for an indirect/network comparison when multiple treatments are connected through common comparators."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ipd-meta-analysis",
        "notes": "When within-study effect modification is needed, the SR feeds an individual-patient-data meta-analysis rather than aggregate-data pooling."
      },
      {
        "relation_type": "used_with",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS structures the eligibility frame, search strategy, and synthesis plan that the systematic-review protocol locks before results are seen."
      }
    ],
    "aliases": [
      "systematic literature review",
      "SLR",
      "systematic evidence review",
      "evidence synthesis"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "target-trial-emulation",
    "name": "Target Trial Emulation",
    "short_definition": "A causal-inference framework that first writes the protocol of the hypothetical pragmatic randomized trial that would answer the question (eligibility, treatment strategies, assignment, time zero, follow-up, outcome, causal contrast, analysis), then emulates each component in observational data so that the design-induced biases of naive analyses — immortal time, prevalent-user, and time-zero misalignment — are prevented by construction.",
    "long_description": "**Target trial emulation (TTE)** is not a study design or an estimator; it is a discipline for *designing* an\nobservational comparative study by explicitly specifying the protocol of the randomized trial you wish you could run —\nthe \"target trial\" — and then mapping each of its components onto real-world data (Hernán & Robins). The seven protocol\ncomponents are: (1) eligibility criteria, (2) treatment strategies being compared, (3) assignment procedure,\n(4) time zero (the moment eligibility, strategy assignment, and follow-up start are all aligned), (5) follow-up period,\n(6) outcome, and (7) causal contrast and statistical analysis. The single most important discipline TTE enforces is\nthat eligibility, treatment assignment, and the start of follow-up must be evaluated at the *same* time zero. Most\nnotorious RWE blunders — immortal time bias, prevalent-user bias, and adjusting for post-baseline variables — are\nfailures to align time zero, and writing the target-trial protocol surfaces them before any data are touched.\n\n**Core estimand distinction.** TTE forces a choice that naive analyses leave implicit. The *intention-to-treat (ITT)\nemulation* contrasts treatment **strategies** assigned at time zero and follows everyone under their initiating\nstrategy regardless of later adherence or switching — it estimates the effect of *being assigned to* strategy A vs B.\nThe *per-protocol emulation* contrasts the strategies under sustained adherence; because adherence is not randomized,\nit requires censoring at deviation plus inverse-probability-of-censoring weighting, or clone-censor-weight for\nsustained \"initiate and remain on\" strategies, to handle time-varying confounding and informative censoring. These are\ndifferent causal questions with different identifying assumptions, not two ways of computing the same number. Separately,\nthe survival estimand must be pre-specified — cause-specific hazard, cumulative incidence (subdistribution), or risk\ndifference at a fixed horizon — because under competing risks they answer different questions and a hazard ratio is not\na risk contrast.\n\n**Pros, cons, and trade-offs.**\n- **vs the naive \"ever vs never user\" or prevalent-user cohort:** TTE prevents immortal time, prevalent-user bias, and\n  time-zero misalignment by construction rather than by post-hoc adjustment, and it makes the estimand explicit and\n  auditable against a trial protocol. Cost: more upfront design work, a narrower (initiation) population, and — for\n  per-protocol questions — g-methods that are harder to communicate than a single adjusted hazard ratio.\n- **vs an unstructured active-comparator new-user (ACNU) analysis:** TTE is the protocol-specification layer *around*\n  ACNU, not a competitor; ACNU is the usual analytic engine of a two-drug TTE. TTE adds value precisely when the\n  strategies are *sustained or dynamic* (grace periods, \"treat until progression\", dose escalation) where the simple\n  ACNU initiation contrast is insufficient and clone-censor-weight is needed. **Prefer plain ACNU** when a static,\n  initiation-only (ITT-like) contrast answers the question; **add the full TTE/clone-censor-weight machinery** only\n  when the protocol genuinely requires a sustained per-protocol estimand.\n- **vs a head-to-head RCT:** TTE is faster, cheaper, and covers populations and long horizons trials cannot, and a\n  registry-based randomized trial is itself a randomized form of target-trial emulation. Cost: TTE cannot fix\n  unmeasured confounding (assignment is not randomized), so it relies on the no-unmeasured-confounding-at-time-zero\n  assumption and on negative controls, E-values, and quantitative bias analysis to bound residual bias.\n\n**When to use.** Virtually any comparative effectiveness or safety question about *initiating* or *sustaining* a\ntreatment strategy in claims, EHR, registry, or linked data — especially when a head-to-head RCT is infeasible and a\nregulatory or HTA audience expects explicit bias mitigation. Use it as the design scaffold whenever there is real risk\nof immortal time (the \"treatment\" is defined by an event that takes time to occur, e.g., transplant, responder, or\ncompleted-course definitions), prevalent-user contamination, or a treatment strategy that unfolds over time.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **No well-defined intervention / a non-modifiable \"exposure\".** TTE requires that you can name a hypothetical trial\n  that randomizes the strategy at a single time zero. \"Effect of obesity\" or \"effect of a biomarker level\" has no\n  well-defined intervention and no coherent time zero; forcing TTE manufactures a spurious estimand. Use it for\n  drug/procedure strategies, not for fixed traits.\n- **Positivity is violated.** If one strategy is never chosen for an identifiable subgroup at time zero (e.g., a drug\n  contraindicated in CKD-4), the trial could not have randomized those patients; emulating it anyway extrapolates into\n  a region with no support and the weighted estimate is driven by a handful of influential clones.\n- **Time zero cannot be reconstructed from the data.** When the data cannot pin the moment eligibility and strategy\n  assignment coincide (e.g., EHR with no reliable medication-start date), an emulated time zero invites the very\n  immortal time the framework exists to prevent — it gives a false sense of rigor.\n- **A per-protocol estimand is reported but informative censoring is ignored.** Censoring at non-adherence without\n  inverse-probability-of-censoring weighting reintroduces, under a sophisticated-looking protocol, exactly the\n  selection bias TTE is meant to remove. This is more dangerous than a naive analysis because the protocol lends it\n  false credibility.\n\n**Data-source operational depth.**\n- **Claims (FFS or commercial):** Strong for operationalizing eligibility (continuous medical + pharmacy enrollment,\n  drug-free washout), strategy assignment (first qualifying NDC fill = time zero), and outcomes (validated\n  diagnosis/procedure algorithms). Failure modes: (a) *Medicare Advantage person-time has no FFS claims*, so \"no prior\n  fill\" during washout can be missingness, not a true drug-free period — require Parts A/B/D (or commercial\n  medical+pharmacy) for the entire washout and exclude MA-only spans. (b) *Immortal time in procedure/responder\n  studies* — defining a strategy by an event that takes time to occur (received transplant, completed 6 cycles) builds\n  guaranteed survival into one arm; the fix is cloning at time zero and censoring clones when their data diverge from\n  the assigned strategy, never classifying on a future event. (c) *Differential competing risks by exposure in the\n  elderly* — when one strategy is preferentially used in sicker patients, death competes differentially with the\n  outcome; report cumulative incidence (Fine-Gray or Aalen-Johansen), not just cause-specific hazards.\n- **EHR:** Adds labs, problem lists, and severity for sharper eligibility, but initiation is the *order/administration*,\n  not the dispensing; without linked pharmacy data the actual start date — and thus time zero — is uncertain.\n  Visit-driven capture makes loss to follow-up informative; define the observation window explicitly and treat\n  out-of-system gaps as potential censoring, not as continued eligibility.\n- **Registry:** Best for adjudicated outcomes, disease severity, and stage at time zero, but typically weak for\n  complete drug exposure; link to claims for the full fill history and to a death index to firm up censoring. A\n  registry-based randomized trial is the gold-standard randomized form of TTE.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity + claims completeness + reliable mortality),\n  but order/fill/service-date discrepancies must be reconciled *before* time-zero assignment, and the linkable subset\n  introduces selection that should be assessed for transportability to the target population.\n\n**Worked claims example — protocol-to-emulation.** Question: among adults with type 2 diabetes and stage-3 CKD, does\n*initiating an SGLT2 inhibitor* vs *initiating a DPP-4 inhibitor* reduce 3-year MACE? Emulate the trial component by\ncomponent in a commercial + Medicare FFS database:\n1. **Eligibility** — age >=18, >=2 T2D diagnoses, an eGFR/CKD-3 marker, and 365 days of continuous A/B/D (or commercial\n   medical+pharmacy) enrollment before the first study fill; assessed at time zero only.\n2. **Treatment strategies** — \"initiate SGLT2i\" vs \"initiate DPP-4i\"; washout requires *no* fill of either class in the\n   365-day lookback (both arms are incident users).\n3. **Assignment** — emulated by the class of the NDC dispensed on the index fill; confounding-by-indication addressed\n   downstream with a high-dimensional propensity score built only from the [index_date-365, index_date] window.\n4. **Time zero** — the date of that first qualifying fill; eligibility, assignment, and follow-up all start here, so\n   there is no immortal time and no adjustment for post-initiation variables.\n5. **Follow-up** — from time zero to the first validated MACE event, censoring at disenrollment, death from other\n   causes (a competing risk), end of data, or 3 years.\n6. **Outcome** — MACE via a pre-specified inpatient MI/stroke + cardiovascular-death algorithm; report cumulative\n   incidence at 3 years, not only a hazard ratio.\n7. **Causal contrast / analysis** — *ITT emulation*: follow each initiator under the assigned class regardless of later\n   switching/discontinuation (PS-weighted risk difference at 3 years). *Per-protocol emulation*: additionally censor at\n   discontinuation (last days_supply end + a pre-specified grace period) or switch, and apply\n   inverse-probability-of-censoring weights; for the sustained \"remain on class\" strategy, clone-censor-weight.\n   Sensitivity analyses: washout length, grace period, negative-control outcomes, and an E-value for residual\n   confounding.",
    "primary_category": "Framework_Standard",
    "tags": [
      "causal_inference",
      "target_trial",
      "protocol_specification",
      "immortal_time_prevention",
      "estimand",
      "per_protocol",
      "comparative_effectiveness"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "pragmatic_trial",
      "registry_trial",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Landmark paper that formalized the target-trial framework for big/observational data in comparative effectiveness research; defines the seven protocol components and the emulation principle."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Shows mechanistically how aligning eligibility, assignment, and follow-up at a single time zero prevents immortal time bias and related self-inflicted design errors that recur in pharmacoepidemiology."
      },
      {
        "role": "explain",
        "doi": "10.1001/jama.2022.21383",
        "url": "https://doi.org/10.1001/jama.2022.21383",
        "citation_text": "Hernán MA, Wang W, Leaf DE. Target trial emulation: a framework for causal inference from observational data. JAMA. 2022;328(24):2446-2447.",
        "year": 2022,
        "authors_short": "Hernán et al.",
        "notes": "Concise, widely cited statement of the framework and the ITT-vs-per-protocol estimand distinction aimed at a clinical/regulatory audience; the modern canonical reference."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/0962280211403603",
        "url": "https://doi.org/10.1177/0962280211403603",
        "citation_text": "Danaei G, Rodríguez LA, Cantero OF, Logan R, Hernán MA. Observational data for comparative effectiveness research: an emulation of randomised trials of statins and primary prevention of coronary heart disease. Statistical Methods in Medical Research. 2013;22(1):70-96.",
        "year": 2013,
        "authors_short": "Danaei et al.",
        "notes": "Reference worked emulation — statins for primary prevention of CHD — whose results aligned with randomized evidence, demonstrating that protocol-faithful emulation can reproduce trial estimands from routine data."
      }
    ],
    "plain_language_summary": "Target trial emulation is a disciplined way to design an observational study by first writing down the exact protocol of the randomized trial you wish you could run, then mimicking each piece of that protocol in real-world records such as insurance claims or electronic health records. The key discipline it enforces is that three things must happen on the same day: confirming the patient is eligible, assigning them to a treatment strategy, and starting to count follow-up time. When those three things start at different times, a patient can rack up disease-free days before ever being labeled as treated, which makes the treated group look artificially healthier than it really is.",
    "key_terms": [
      {
        "term": "target trial",
        "definition": "The hypothetical randomized controlled trial you would ideally run to answer your research question, written out as a full protocol even though it will never actually be conducted."
      },
      {
        "term": "emulation",
        "definition": "Using observational data (such as insurance claims or EHR records) to mimic each component of the target trial protocol as faithfully as the data allow."
      },
      {
        "term": "time zero alignment",
        "definition": "The requirement that a patient's eligibility check, treatment assignment, and the start of follow-up all happen at exactly the same date so no disease-free time is silently credited to one treatment group before the study officially begins."
      },
      {
        "term": "immortal time",
        "definition": "A stretch of follow-up days during which a patient could not yet have experienced the outcome because they had not yet been classified as treated; counting those days as exposed time falsely inflates the treated group's survival."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to know whether adults with type 2 diabetes who start an SGLT2 inhibitor have fewer heart attacks over three years than those who start a DPP-4 inhibitor. Patient 2041 fills her first SGLT2 inhibitor prescription on 2023-03-01 after 365 days of uninterrupted insurance enrollment with no prior fill of either drug class. In the properly emulated trial, that fill date is time zero: eligibility is confirmed on that day, the treatment arm is assigned on that day, and follow-up begins on that day. In a naive analysis, the researcher might instead start the clock at the diagnosis date (2022-08-15) and only classify her as treated when the fill occurs six and a half months later, which would silently gift the treated arm 197 guaranteed event-free days.",
      "dataset": {
        "caption": "Key dates for patient 2041 in a commercial claims database, showing both the naive start and the aligned emulation start.",
        "columns": [
          "person_id",
          "event",
          "date",
          "days_since_diagnosis"
        ],
        "rows": [
          [
            2041,
            "T2D diagnosis (first of 2)",
            "2022-08-15",
            0
          ],
          [
            2041,
            "T2D diagnosis (second)",
            "2022-10-02",
            48
          ],
          [
            2041,
            "365-day washout window opens (no SGLT2/DPP-4 fills)",
            "2022-03-01",
            -167
          ],
          [
            2041,
            "First SGLT2i fill = TIME ZERO (aligned emulation)",
            "2023-03-01",
            197
          ],
          [
            2041,
            "Naive cohort entry (diagnosis date)",
            "2022-08-15",
            0
          ],
          [
            2041,
            "Naive treatment classification date (= first fill)",
            "2023-03-01",
            197
          ]
        ]
      },
      "steps": [
        "Naive approach: the researcher anchors the study clock at the T2D diagnosis date (2022-08-15) and retrospectively labels patient 2041 as 'treated' once her SGLT2i fill occurs on 2023-03-01.",
        "That creates 197 days (Aug 15, 2022 to Mar 1, 2023) where she is counted in the treated arm but could not yet have been treated — she had to survive those days without a heart attack just to qualify as treated.",
        "Those 197 days are immortal time: the treated arm looks healthier simply because untreated patients who had a heart attack before the fill could never enter the treated group.",
        "Emulation fix: set time zero to the first qualifying fill date (2023-03-01). Eligibility (365-day washout, two T2D diagnoses, continuous enrollment) is confirmed as of that same date. Follow-up also starts on that same date.",
        "Because eligibility, assignment, and follow-up all begin on 2023-03-01, there are zero immortal days: no event-free time is pre-loaded into either arm.",
        "Both the SGLT2i arm (patient 2041) and the DPP-4i arm start follow-up at their own individual time zeros, which is the first qualifying fill of whichever drug class they actually initiated."
      ],
      "result": "Aligned emulation: 0 immortal days for patient 2041; follow-up begins 2023-03-01 and runs up to 3 years or first MACE event, whichever comes first. Naive analysis: 197 days of immortal time silently credited to the treated arm before follow-up even started, biasing the treated group toward better outcomes.",
      "timeline_spec": {
        "title": "Time zero alignment for patient 2041 — SGLT2i initiation emulating a target trial",
        "window": {
          "start": "2022-08-15",
          "end": "2026-03-01",
          "label": "Observation window: diagnosis through 3-year follow-up"
        },
        "events": [
          {
            "label": "T2D diagnosis #1",
            "start": "2022-08-15",
            "length_days": 1,
            "quantity": "eligibility criterion"
          },
          {
            "label": "365-day washout (no SGLT2/DPP-4 fills)",
            "start": "2022-03-01",
            "length_days": 365,
            "quantity": "365-day drug-free lookback"
          },
          {
            "label": "First SGLT2i fill — TIME ZERO",
            "start": "2023-03-01",
            "length_days": 90,
            "quantity": "90-day supply; eligibility + assignment + follow-up start here"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2022-08-15",
            "end": "2023-02-28",
            "label": "197-day immortal time in naive analysis (event-free days silently credited to treated arm)"
          },
          {
            "kind": "followup",
            "start": "2023-03-01",
            "end": "2026-03-01",
            "label": "3-year follow-up (ITT emulation, aligned time zero)"
          }
        ],
        "result": {
          "label": "Time zero = 2023-03-01 (first SGLT2i fill). Immortal days in aligned emulation = 0. Immortal days in naive analysis = 197.",
          "value": 0
        },
        "caption": "Patient 2041 timeline showing the 197-day immortal period that the naive analysis silently adds to the treated arm (shaded), versus the aligned target-trial emulation where eligibility, treatment assignment, and follow-up all start together on the date of the first fill.",
        "alt_text": "A horizontal timeline from August 2022 to March 2026 for one patient. A shaded band from the T2D diagnosis date to the first SGLT2i fill marks 197 days of immortal time present in the naive analysis. A vertical line at the first fill date (March 1 2023) marks the aligned time zero, where eligibility, assignment, and follow-up all coincide in the emulated trial."
      }
    },
    "prerequisites": [
      "immortal-time-bias-handling",
      "new-user-design",
      "time-zero-index-date-alignment-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Intention-to-treat (ITT) emulation",
        "description": "Contrast the treatment strategies as assigned at time zero, following each patient under the initiating strategy regardless of later adherence or switching (analogous to ITT in an RCT). Estimates the effect of being assigned to a strategy.",
        "edge_cases": [
          "Sustained \"initiate and remain on\" strategies still need cloning at time zero to avoid building immortal time into the strategy that is defined by future behavior.",
          "Heavy crossover between arms attenuates the ITT contrast toward the null, which may be the wrong estimand for a safety question."
        ],
        "data_source_notes": "claims/EHR: time zero = first qualifying fill after washout; follow under that strategy regardless of later gaps or switches. Simplest emulation and the usual default for regulatory effectiveness questions.",
        "citations": [
          "hernan-2016-aje",
          "hernan-2022-jama"
        ]
      },
      {
        "name": "Per-protocol emulation (sustained adherence)",
        "description": "Contrast the strategies under continued adherence, censoring follow-up when a patient deviates from the assigned strategy and weighting for informative censoring (IPCW); for \"initiate and remain on\" strategies, use clone-censor-weight to handle the post-baseline definition of the strategy.",
        "edge_cases": [
          "Naive censoring at non-adherence without IPCW reintroduces selection bias and is more dangerous than no censoring because the protocol lends it false credibility.",
          "Requires time-varying confounders measured densely enough to model adherence; sparse EHR encounters make the weights unstable."
        ],
        "data_source_notes": "Aligns with active-comparator new-user + g-methods; depends on careful episode construction (days_supply stitching, grace periods, switching rules) and time-updated covariates.",
        "citations": [
          "hernan-2016-jclinepi",
          "hernan-2022-jama"
        ]
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Naive ever/never-user or prevalent-user cohort",
        "pros_of_this": "Prevents immortal time, prevalent-user bias, and time-zero misalignment by construction; makes the estimand explicit and auditable against a trial protocol.",
        "cons_of_this": "More upfront design work; narrower (initiation) population; per-protocol questions require g-methods that are harder to communicate than a single adjusted hazard ratio.",
        "when_to_prefer": "Any comparative initiation/sustained-strategy question where immortal time, prevalent-user, or time-zero bias is plausible, and where the audience expects explicit bias mitigation."
      },
      {
        "compared_to": "Unstructured active-comparator new-user (ACNU) analysis",
        "pros_of_this": "Adds the protocol-specification and estimand layer around ACNU; handles sustained/dynamic strategies, grace-period eligibility ambiguity, and per-protocol contrasts via clone-censor-weight.",
        "cons_of_this": "For a static initiation-only contrast it adds machinery without changing the answer; the extra cloning and weighting can obscure a simple, defensible ACNU estimate.",
        "when_to_prefer": "When the strategies are sustained or dynamic (treat-to-target, grace periods, dose escalation) and a per-protocol estimand is genuinely required."
      },
      {
        "compared_to": "Head-to-head randomized controlled trial",
        "pros_of_this": "Faster, cheaper, broader populations and longer horizons; a registry-based randomized trial is itself a randomized TTE.",
        "cons_of_this": "Cannot fix unmeasured confounding (assignment is not randomized); relies on no-unmeasured-confounding-at-time-zero plus negative controls, E-values, and bias analysis.",
        "when_to_prefer": "When a head-to-head RCT is infeasible or unethical and high-quality longitudinal data with a reconstructable time zero exist."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Operationalize eligibility with continuous medical + pharmacy enrollment across the full washout; strategy assignment via the NDC on the first qualifying fill (= time zero); outcomes via validated algorithms. Exclude Medicare Advantage-only person-time where FFS claims are absent so washout reflects a true drug-free period. Never classify a strategy on a future event (transplant, completed course) — clone at time zero and censor on divergence to avoid immortal time. Under differential mortality, report cumulative incidence (Fine-Gray/Aalen-Johansen), not only cause-specific hazards.",
      "ehr": "Initiation is the order/administration, not the dispensing; without linked pharmacy fills the true start date — and thus time zero — is uncertain. Visit-driven capture makes loss to follow-up informative; define the observation window and treat out-of-system gaps as censoring.",
      "registry": "Best for adjudicated outcomes, severity, and stage at time zero; link to claims for full exposure and to a death index for censoring. Registry-based randomized trials are the gold-standard randomized form of emulation.",
      "linked": "EHR severity + claims completeness + reliable mortality is the ideal substrate; reconcile order/fill/service date discrepancies before time-zero assignment and assess the linkable subset for transportability."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS = 365  # drug-free + continuous-enrollment lookback that defines incident users and fixes time zero\n\ndef emulate_target_trial(rx: pd.DataFrame, enroll: pd.DataFrame, elig: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n\n    # (2) Treatment strategies + (4) time zero: first fill of EITHER comparator class is the candidate index.\n    study = rx[rx[\"drug_class\"].isin([\"SGLT2\", \"DPP4\"])]\n    idx = (study.groupby(\"person_id\").first().reset_index()\n                .rename(columns={\"fill_date\": \"index_date\", \"drug_class\": \"arm\"}))\n\n    # New-user check (belt-and-suspenders): no fill of either class in the washout window before time zero.\n    # NOTE: because index_date is the FIRST observed fill, no in-data fill can predate it, so this filter is\n    # vacuous on its own and `prevalent` is empty by construction. The actual new-user guarantee comes from the\n    # continuous-enrollment-across-the-full-washout check below ((1) `covers`): requiring FFS-observable\n    # enrollment for the entire [index-WASHOUT, index] window means any prior fill WOULD have been observed, so\n    # the absence of one is a true drug-free period rather than missingness. This filter only catches the edge\n    # case where the input `rx` already carries fills earlier than the per-person first fill in `study`.\n    prior = study.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    prevalent = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                      (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(prevalent[\"person_id\"])].copy()\n\n    # (1) Eligibility assessed AT time zero only: continuous FFS-observable enrollment across the full washout,\n    #     plus the clinical criteria (>=2 T2D dx, CKD-3 marker on/before index). MA-only person-time excluded.\n    #     This enrollment-coverage check is what enforces incident (new-user) status, not the filter above.\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    observable = set(e.loc[e[\"covers\"], \"person_id\"])\n\n    idx = idx.merge(elig, on=\"person_id\", how=\"left\")\n    clinically_eligible = (idx[\"n_t2d_dx\"] >= 2) & (idx[\"ckd3_date\"] <= idx[\"index_date\"])\n    cohort = idx[idx[\"person_id\"].isin(observable) & clinically_eligible].copy()\n\n    # Covariate/PS window = [time zero - washout, time zero]. Follow-up (5)-(7) and censoring applied identically\n    # to both arms downstream; ITT follows the assigned arm regardless of later switching.\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)\n    return cohort[[\"person_id\", \"arm\", \"index_date\", \"baseline_start\"]]",
        "description": "Protocol-to-emulation cohort builder for a two-strategy target trial (ITT contrast). This is the design layer, not\nthe estimator: it walks the seven protocol components and emits one row per eligible initiator at time zero, ready for\npropensity-score balancing and an outcome model. Required inputs (already cleaned/de-duplicated):\n  rx     : pharmacy fills  -> person_id, fill_date (datetime), drug_class in {'SGLT2','DPP4'}, days_supply\n  enroll : enrollment spans -> person_id, enroll_start, enroll_end, ma_only (bool)  # MA-only spans lack FFS claims\n  elig   : time-zero-eligibility markers -> person_id, ckd3_date, n_t2d_dx\nBuild covariates and the PS ONLY from the returned [baseline_start, index_date] window. Per-protocol emulation adds\ncensoring at switch/discontinuation + IPCW downstream (see clone-censor-weight-per-protocol).",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "hernan-2016-aje"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "target-trial-emulation-timeline.svg",
        "mermaid": null,
        "caption": "Patient 2041 timeline showing the 197-day immortal period that the naive analysis silently adds to the treated arm (shaded), versus the aligned target-trial emulation where eligibility, treatment assignment, and follow-up all start together on the date of the first fill.",
        "alt_text": "A horizontal timeline from August 2022 to March 2026 for one patient. A shaded band from the T2D diagnosis date to the first SGLT2i fill marks 197 days of immortal time present in the naive analysis. A vertical line at the first fill date (March 1 2023) marks the aligned time zero, where eligibility, assignment, and follow-up all coincide in the emulated trial.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  subgraph TT[Target trial protocol -- what we would randomize]\n    E1[Eligibility criteria] --> E2[Treatment strategies A vs B]\n    E2 --> E3[Random assignment]\n    E3 --> E4[Time zero: eligibility = assignment = follow-up start]\n    E4 --> E5[Follow-up period]\n    E5 --> E6[Outcome]\n    E6 --> E7[Causal contrast: ITT and/or per-protocol]\n  end\n  subgraph EM[Emulation in real-world data]\n    D1[Claims/EHR eligibility at index] --> D2[Strategy = class of index NDC]\n    D2 --> D3[PS / hdPS for assignment]\n    D3 --> D4[Time zero = first qualifying fill after washout]\n    D4 --> D5[Follow-up: identical outcome + censoring both arms]\n    D5 --> D6[Validated outcome algorithm]\n    D6 --> D7[ITT: PS-weighted risk diff; per-protocol: clone-censor-weight + IPCW]\n  end\n  E1 -.emulate.-> D1\n  E2 -.emulate.-> D2\n  E3 -.emulate.-> D3\n  E4 -.emulate.-> D4\n  E5 -.emulate.-> D5\n  E6 -.emulate.-> D6\n  E7 -.emulate.-> D7",
        "caption": "Mapping the seven target-trial protocol components onto their real-world-data emulation. Each box in the trial protocol has an explicit emulated counterpart; the framework's power comes from forcing that one-to-one mapping before analysis.",
        "alt_text": "Two stacked flowcharts, the target trial protocol on top and its data emulation below, with dashed arrows linking each of the seven components to its emulated counterpart.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2016-aje"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph BAD[Naive analysis -- time-zero misalignment creates immortal time]\n    N0[Diagnosis / cohort entry] --> N1[Immortal time: must survive AND remain untreated to become 'treated' later]\n    N1 --> N2[Classify as treated when first fill occurs]\n    N2 --> N3[Treated arm gains guaranteed event-free time -> spurious benefit]\n  end\n  subgraph GOOD[Target trial emulation -- aligned time zero]\n    T0[First qualifying fill = time zero] --> T1[Eligibility + assignment + follow-up start together]\n    T1 --> T2[No event-free time precedes assignment]\n    T2 --> T3[Unbiased ITT contrast; per-protocol via clone-censor-weight]\n  end",
        "caption": "Why aligning time zero matters. Defining the treated group by a future event (top) builds guaranteed event-free immortal time into one arm; starting eligibility, assignment, and follow-up together at the index fill (bottom) removes it.",
        "alt_text": "Two flowcharts contrasting a naive analysis that introduces immortal time through misaligned time zero with a target-trial emulation that aligns eligibility, assignment, and follow-up at the index fill.",
        "source_type": "illustrative",
        "source_citations": [
          "hernan-2016-jclinepi"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU is the usual analytic engine of a two-drug target-trial emulation; new-user + active comparator + time-zero alignment map directly onto trial eligibility and assignment."
      },
      {
        "relation_type": "used_with",
        "target_slug": "new-user-design",
        "notes": "New-user (incident-user) cohorts are the natural vehicle for emulating trials of drug-initiation strategies, with follow-up starting at first exposure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "clone-censor-weight-per-protocol",
        "notes": "Clone-censor-weight is the standard method for the per-protocol estimand of sustained \"initiate and remain on\" strategies, where the strategy is defined by post-baseline behavior."
      },
      {
        "relation_type": "used_with",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "G-methods (MSMs, g-formula) handle the time-varying confounding and informative censoring required for per-protocol emulation of sustained strategies."
      },
      {
        "relation_type": "requires",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "Aligning eligibility, treatment assignment, and follow-up start at a single time zero is the core discipline that makes emulation valid and prevents immortal time bias."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Specifying and aligning time zero is the primary design-based prevention strategy for immortal time bias."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "TTE forces explicit choice of estimand (ITT vs per-protocol, ATE vs ATT) and how intercurrent events (switching, discontinuation, death) are handled."
      },
      {
        "relation_type": "part_of",
        "target_slug": "study-protocol-or-sap-elements",
        "notes": "The seven target-trial components are written into the study protocol and SAP; emulation is the operational translation of those pre-specified elements."
      },
      {
        "relation_type": "see_also",
        "target_slug": "picots-framework-rwe",
        "notes": "PICOTS structures the question; the target-trial protocol operationalizes it into eligibility, strategies, time zero, and contrast for emulation."
      },
      {
        "relation_type": "see_also",
        "target_slug": "regulatory-readiness-rwe",
        "notes": "HARPER, STaRT-RWE, the ENCePP Methodological Guide, and the FDA RWE Framework all endorse explicit target-trial specification as the route to regulatory- and HTA-grade observational evidence."
      }
    ],
    "aliases": [
      "target trial",
      "target trial emulation",
      "emulating a target trial",
      "hypothetical trial emulation",
      "trial emulation"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "targeted-maximum-likelihood-estimation-rwe",
    "name": "Targeted Maximum Likelihood Estimation (TMLE)",
    "short_definition": "A doubly-robust, semiparametric-efficient plug-in estimator of a causal contrast (e.g., the average treatment effect) that fits an initial outcome regression, then applies a fluctuation/targeting update driven by the propensity score so the final estimate solves the efficient influence-curve equation, permitting machine-learning (Super Learner) estimation of both nuisance models while retaining valid confidence intervals.",
    "long_description": "**Targeted maximum likelihood estimation (TMLE)** is a general framework (van der Laan & Rubin, 2006) for estimating a\nlow-dimensional causal target parameter — most commonly the average treatment effect (ATE), E[Y^1] − E[Y^0] — from\nobservational data with high-dimensional nuisance parameters. It is a *plug-in* estimator: it produces an estimate of the\ntarget by substituting fitted components into the parameter's mapping, but unlike a naive plug-in it adds a **targeting\n(fluctuation) step** that updates the initial fit specifically to reduce bias in the target parameter and to make the final\nestimate solve the **efficient influence-curve (EIC) estimating equation**. Solving the EIC equation is what delivers two\nproperties at once: (1) **double robustness** — the ATE estimate is consistent if *either* the outcome model *or* the\npropensity (treatment) model is correctly specified, not necessarily both; and (2) **semiparametric efficiency** — when both\nnuisance models are consistent at fast-enough rates, TMLE achieves the lowest possible asymptotic variance, and its variance\nis estimated directly from the empirical variance of the influence curve, giving valid Wald confidence intervals.\n\n**The algorithm, concretely (binary treatment A, outcome Y, covariates W).** (1) Fit an initial outcome regression\nQbar0(A,W) = E[Y | A, W] and obtain the predicted outcomes under treatment Qbar0(1,W) and control Qbar0(0,W). (2) Fit the\npropensity score g(W) = P(A=1 | W). (3) Construct the **clever covariate** H(A,W) = A/g(W) − (1−A)/(1−g(W)) (the score of the\nEIC for the ATE). (4) Run the **fluctuation**: a no-intercept logistic regression of Y on the clever covariate with the\ninitial prediction Qbar0 as an offset (on the logit scale), yielding a single fitted coefficient epsilon. (5) **Update**:\nQbar*(A,W) = expit( logit(Qbar0(A,W)) + epsilon · H(A,W) ); compute the updated potential-outcome predictions Qbar*(1,W) and\nQbar*(0,W). (6) The TMLE of the ATE is the mean difference (1/n) Σ [Qbar*(1,Wi) − Qbar*(0,Wi)]. The variance is the sample\nvariance of the estimated influence curve, IC_i = H(Ai,Wi)·(Yi − Qbar*(Ai,Wi)) + (Qbar*(1,Wi) − Qbar*(0,Wi)) − ATE, divided\nby n. Because step (1) and step (2) can each be fit with **Super Learner** (cross-validated ensemble stacking of flexible\nlearners — GBM, random forests, splines, elastic net), TMLE allows machine learning for the nuisance functions while the\ntargeting step restores valid, EIC-based inference that plain ML plug-in estimates do not have.\n\n**Core conceptual distinctions.** TMLE sits in the doubly-robust family with **AIPW** (augmented inverse-probability\nweighting) and is related to **g-computation** (the parametric outcome-regression standardization). (1) *vs g-computation*:\ng-computation is the un-targeted plug-in — fit Qbar, standardize — and is consistent only if the outcome model is correct,\nwith no double robustness and no influence-curve-based variance. TMLE *is* g-computation plus a targeting update that buys\ndouble robustness and valid inference. (2) *vs AIPW*: both are doubly robust and asymptotically efficient and both use the\nEIC. AIPW is a one-step *additive* bias correction (it adds the mean of the influence function to the plug-in), so its\nestimate of a probability/risk can fall outside [0,1]; TMLE is a *substitution* estimator whose targeting step keeps the\nupdated outcome predictions within the model's natural bounds, which is more stable in finite samples and under near-positivity\nviolations. (3) TMLE estimates a **marginal** causal contrast (the population ATE, or ATT/risk ratio/odds ratio via the\nappropriate clever covariate and mapping), not a conditional regression coefficient.\n\n**Pros, cons, and trade-offs.**\n- **vs `propensity-score-methods-psm-iptw` (IPTW alone):** TMLE is doubly robust — it tolerates misspecification of *either*\n  the treatment or outcome model — and is efficient, whereas IPTW is consistent only if the propensity model is correct and\n  is sensitive to extreme weights/positivity violations. **Prefer TMLE** when you want robustness to one model being wrong\n  and you can credibly estimate both nuisance functions; **prefer plain IPTW** for transparency or when the outcome model is\n  hard to specify and you trust the PS.\n- **vs AIPW (one-step doubly-robust):** Both are doubly robust and efficient; TMLE's substitution/bounding makes it more\n  stable when predicted probabilities are near 0/1 and respects the parameter space, while AIPW is simpler to implement and\n  its theory is more transparent. **Prefer TMLE** under near-positivity or bounded outcomes; **AIPW** when simplicity and a\n  closed-form correction are preferred. They are close cousins, not opposites.\n- **vs `g-estimation-structural-nested-models`:** g-estimation of structural nested models targets effects in the presence of\n  time-varying confounding affected by prior treatment and is natural for continuous/longitudinal exposures; TMLE (and its\n  longitudinal extension, LTMLE) targets marginal contrasts and integrates ML nuisance estimation. **Prefer g-estimation /\n  SNMs** for effect modification by time-varying covariates and structural-nested questions; **prefer (L)TMLE** for marginal\n  parameters with flexible ML nuisance fits.\n- **vs `predictive-and-causal-ml-models-rwe` (plain ML plug-in):** A bare machine-learning outcome model plugged into a\n  standardization gives a biased estimate with invalid inference (the bias of ML for prediction does not vanish for the\n  target parameter). TMLE's targeting step is exactly what corrects that bias and restores EIC-based CIs — this is the reason\n  TMLE exists.\n\n**When to use.** Estimating a marginal causal contrast (ATE, ATT, marginal risk ratio/odds ratio) from observational RWE when\nyou want (a) robustness to misspecification of one of the two nuisance models, (b) to use flexible machine learning / Super\nLearner for confounding adjustment without sacrificing valid inference, and (c) efficient, influence-curve-based confidence\nintervals. It is well suited to high-dimensional confounding (claims/EHR covariate banks), to bounded outcomes where\nsubstitution estimators behave well, and to settings where regulators or reviewers expect a principled doubly-robust analysis.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Under serious positivity violations.** If g(W) is near 0 or 1 for some covariate strata (no comparable treated/untreated\n  units), the clever covariate explodes, the fluctuation is unstable, and TMLE — like any IPW-based method — extrapolates;\n  truncating the PS hides rather than fixes the violation. Diagnose and report positivity before trusting the estimate.\n- **When neither nuisance model can be credibly estimated.** Double robustness protects against *one* wrong model, not two;\n  if both the outcome and treatment models are badly misspecified (and ML cannot rescue them because of unmeasured\n  confounding or too little data), TMLE is biased like everything else. It does not manufacture identification.\n- **As a fix for unmeasured confounding.** TMLE assumes conditional exchangeability (no unmeasured confounding) given W; it is\n  an estimation method, not an identification one. Reporting a tight TMLE CI on a confounded contrast launders bias into false\n  precision.\n- **Without cross-fitting when using aggressive ML.** Highly adaptive learners without sample-splitting (cross-fitting /\n  CV-TMLE) can overfit the nuisance functions and invalidate the asymptotics; use CV-TMLE when learners are complex.\n- **For a conditional/effect-modified parameter without the right target.** TMLE estimates the parameter you specify; using a\n  marginal-ATE TMLE and narrating it as a subgroup effect is a target-parameter error.\n\n**Data-source operational depth.**\n- **Claims (FFS vs MA):** The high-dimensional covariate bank (diagnoses, prior fills, utilization) is ideal Super Learner\n  input for both nuisance models, and high-dimensional propensity-score covariate selection can feed W. Build the contrast on\n  FFS-observable person-time only — Medicare Advantage enrollees lack fee-for-service claims, so MA-only spans give\n  differentially incomplete confounders and a positivity structure that is an artifact of missingness, not biology. Check the\n  estimated PS distribution by arm for positivity before targeting.\n- **EHR:** Labs, vitals, and NLP-derived severity sharpen both Qbar and g (richer confounding control, more plausible\n  exchangeability), but informative-presence/visit-driven capture creates a selection process W does not capture; treat\n  informative loss to follow-up with the longitudinal extension (LTMLE) and censoring weights rather than a single\n  cross-sectional TMLE.\n- **Registry / linked:** Adjudicated outcomes and severity tighten exchangeability and improve the outcome model; linked\n  claims-EHR-vital-records is the strongest substrate for both nuisance fits, with linkage selection reported as a separate\n  bias. The arithmetic of the targeting step is unchanged; what changes is the credibility of the no-unmeasured-confounding\n  assumption.\n\n**Interpreting the output**\n\nConsider the diabetes hospitalization study above. TMLE using SuperLearner for both nuisance models reports an\nadjusted risk difference (ATE) of −0.06 (95% CI −0.10, −0.02) for 1-year hospitalization, with confidence\nintervals derived from the efficient influence curve.\n\nFormal interpretation: The TMLE estimate of −0.06 is the average treatment effect (ATE) — the population-averaged\ndifference in 1-year hospitalization probability if all 1,000 patients had received the new drug versus if none\nhad. It is doubly robust: the estimate converges to the true ATE if either the outcome model (predicting\nhospitalization from covariates and treatment) or the propensity model (predicting treatment receipt from\ncovariates) is correctly specified, even if the other is misspecified. The influence-curve-based 95% CI is valid\neven when machine-learning models are used for both nuisance functions — standard parametric CI formulas do not\napply when SuperLearner is used without the targeting step. The central untestable assumption is no unmeasured\nconfounding: every variable that jointly determines treatment assignment and hospitalization must be measured and\nincluded. The CI means that across repeated datasets from the same data-generating process, intervals constructed\nthis way would contain the true ATE in 95% of replications.\n\nPractical interpretation: Receiving the new diabetes drug is estimated to reduce 1-year hospitalization risk by\n6 percentage points on average across the full patient population. TMLE's double-robustness property means that\na moderately misspecified outcome model does not invalidate the estimate if the propensity model is adequate, and\nvice versa — providing a meaningful safeguard that neither IPTW alone nor outcome regression alone offers.",
    "primary_category": "Causal_Inference_Method",
    "tags": [
      "tmle",
      "doubly_robust",
      "super_learner",
      "efficient_influence_curve",
      "average_treatment_effect",
      "targeted_learning"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "cohort_prospective",
      "comparative_effectiveness",
      "claims_analysis",
      "ehr_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kww165",
        "url": "https://doi.org/10.1093/aje/kww165",
        "citation_text": "Schuler MS, Rose S. Targeted maximum likelihood estimation for causal inference in observational studies. American Journal of Epidemiology. 2017;185(1):65-73.",
        "year": 2017,
        "authors_short": "Schuler & Rose",
        "notes": "The standard epidemiologic introduction to TMLE for the ATE, laying out the initial fit, clever covariate, targeting step, Super Learner nuisance estimation, and influence-curve-based inference with a worked example."
      },
      {
        "role": "explain",
        "doi": "10.2202/1557-4679.1043",
        "url": "https://doi.org/10.2202/1557-4679.1043",
        "citation_text": "van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1):Article 11.",
        "year": 2006,
        "authors_short": "van der Laan & Rubin",
        "notes": "The foundational paper defining targeted maximum likelihood learning, the fluctuation/targeting update, and the efficient-influence-curve basis for double robustness and semiparametric efficiency."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwq439",
        "url": "https://doi.org/10.1093/aje/kwq439",
        "citation_text": "Funk MJ, Westreich D, Wiesen C, Sturmer T, Brookhart MA, Davidian M. Doubly robust estimation of causal effects. American Journal of Epidemiology. 2011;173(7):761-767.",
        "year": 2011,
        "authors_short": "Funk et al.",
        "notes": "Explains doubly-robust estimation (AIPW) and the \"consistent if either model is correct\" property that TMLE shares; the accessible reference for why combining outcome and propensity models gives robustness."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.7628",
        "url": "https://doi.org/10.1002/sim.7628",
        "citation_text": "Luque‐Fernandez MA, Schomaker M, Rachet B, Schnitzer ME. Targeted maximum likelihood estimation for a binary treatment: a tutorial. Statistics in Medicine. 2018;37(16):2530-2546.",
        "year": 2018,
        "authors_short": "Luque-Fernandez et al.",
        "notes": "A step-by-step tutorial implementing TMLE for a binary treatment with worked code, the clever covariate, the fluctuation regression, and influence-curve variance; the reference for hand-coding the targeting step."
      },
      {
        "role": "use",
        "doi": "10.18637/jss.v051.i13",
        "url": "https://doi.org/10.18637/jss.v051.i13",
        "citation_text": "Gruber S, van der Laan MJ. tmle: an R package for targeted maximum likelihood estimation. Journal of Statistical Software. 2012;51(13):1-35.",
        "year": 2012,
        "authors_short": "Gruber & van der Laan",
        "notes": "Documents the canonical tmle R package (Super Learner nuisance fits, ATE/risk-ratio/odds-ratio targets, inference); the practical tool used to run TMLE without hand-coding the fluctuation."
      }
    ],
    "plain_language_summary": "Targeted Maximum Likelihood Estimation (TMLE) is a method for estimating how much a treatment actually changes an outcome, using two separate statistical models and then combining them in a clever way that makes the final answer more trustworthy. What makes TMLE special is that it is doubly robust: if either your model for who received the treatment or your model for what happened to patients is approximately correct, your estimate of the treatment effect will still be approximately right. It also lets you use machine learning to build those two models, which is helpful when you have dozens or hundreds of patient characteristics to account for.",
    "key_terms": [
      {
        "term": "doubly robust",
        "definition": "A property of certain statistical methods meaning the final estimate is still approximately correct as long as at least one of the two required models (the outcome model or the treatment model) is a reasonable approximation of reality."
      },
      {
        "term": "targeted maximum likelihood estimation",
        "definition": "A two-stage estimation approach that first fits initial models for outcomes and treatment, then applies a small mathematical correction called the targeting step so that the final answer is optimized specifically for the causal effect you care about."
      },
      {
        "term": "machine learning nuisance models",
        "definition": "Flexible computer-learned models (such as random forests or gradient boosting) used to handle the statistical background work of adjusting for confounders, rather than making rigid assumptions about a simple linear relationship."
      },
      {
        "term": "propensity score",
        "definition": "The estimated probability that a given patient received the treatment, calculated from their background characteristics, used here as one of the two models TMLE requires."
      },
      {
        "term": "average treatment effect",
        "definition": "The average difference in outcomes between what would have happened if everyone in the study had received the treatment versus if everyone had not, calculated across the whole study population."
      },
      {
        "term": "confounding",
        "definition": "When a background characteristic such as age or disease severity both influences who gets the treatment and influences the outcome, making a simple comparison between treated and untreated patients misleading."
      }
    ],
    "worked_example": {
      "scenario": "Imagine a cohort study of 1,000 patients to estimate whether a new oral diabetes drug lowers one-year hospitalization risk compared to usual care. Patients who are sicker tend to receive the new drug more often (confounding by indication). We compare three approaches: plain logistic regression (one model, outcome only), and TMLE (two models combined with a targeting step).",
      "dataset": {
        "caption": "Simplified summary of what the analyst observes for each patient. There is no pharmacy fill timeline here; the raw inputs are patient background variables, treatment assignment, and a binary outcome.",
        "columns": [
          "person_id",
          "age",
          "baseline_hba1c",
          "received_new_drug",
          "hospitalized_1yr"
        ],
        "rows": [
          [
            "1001",
            62,
            8.4,
            1,
            0
          ],
          [
            "1002",
            71,
            9.1,
            0,
            1
          ],
          [
            "1003",
            55,
            7.8,
            1,
            0
          ],
          [
            "1004",
            68,
            9.6,
            1,
            1
          ],
          [
            "1005",
            59,
            8.0,
            0,
            0
          ]
        ]
      },
      "steps": [
        "Plain regression approach: fit one logistic regression model predicting hospitalization from drug assignment plus age and HbA1c. This gives an odds ratio but depends entirely on that single model being correct.",
        "TMLE step 1 (outcome model): use machine learning to predict each patient's probability of hospitalization given their treatment assignment and background characteristics. This is the initial outcome model.",
        "TMLE step 2 (treatment model): use machine learning separately to predict each patient's probability of having received the new drug given their background characteristics. This is the propensity model.",
        "TMLE step 3 (targeting): apply a small mathematical correction that uses information from both models to nudge the outcome predictions specifically toward the correct treatment-effect answer. If the outcome model was slightly off, the treatment model helps fix it; if the treatment model was slightly off, the outcome model helps fix it.",
        "TMLE final estimate: average the corrected predicted outcomes under treatment minus the corrected predicted outcomes under no treatment across all patients to get the estimated risk difference.",
        "Confidence interval: compute from the variability in the individual patient corrections (the influence curve), not from a model formula, giving a statistically valid interval even though machine learning was used."
      ],
      "result": "In a stylized version of this scenario, plain regression might estimate the new drug reduces hospitalization risk by about 4 percentage points, but if the regression model misspecified how HbA1c relates to hospitalization, that estimate is biased with no fallback. TMLE, using both models together, produces a doubly robust estimate: as long as either the outcome model or the propensity model captures the main confounding relationships reasonably well, the risk difference estimate converges to the true value. In practice, with machine learning for both models and the targeting step, TMLE often gives better-calibrated estimates and honest confidence intervals in the same high-dimensional claims or EHR setting where plain regression is most likely to be misspecified."
    },
    "prerequisites": [
      "propensity-score-methods-psm-iptw",
      "dags-backdoor-criterion-drug-studies",
      "estimands-ate-att-intercurrent-events-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single-step TMLE for the ATE (binary outcome and treatment)",
        "description": "Initial outcome regression Qbar0(A,W) and propensity g(W), then one logistic fluctuation of Y on the clever covariate H(A,W)=A/g - (1-A)/(1-g) with logit(Qbar0) as offset; the ATE is the mean of Qbar*(1,W)-Qbar*(0,W).",
        "edge_cases": [
          "Predicted g(W) near 0 or 1 inflates the clever covariate; diagnose positivity and prefer bounding/substitution over naive truncation.",
          "Use Super Learner for Qbar and g, with cross-fitting (CV-TMLE) when learners are aggressive to protect the asymptotics."
        ],
        "data_source_notes": "claims: feed high-dimensional diagnosis/fill/utilization covariates to Super Learner; check the PS distribution by arm for positivity before targeting."
      },
      {
        "name": "Longitudinal TMLE (LTMLE) for time-varying treatment and censoring",
        "description": "Iterated targeting across time points with treatment and censoring (IPCW) clever covariates, estimating effects of time-varying exposures under time-varying confounding affected by prior treatment.",
        "edge_cases": [
          "Informative censoring/loss to follow-up must be modeled with the censoring mechanism (IPCW within the targeting step).",
          "Sequential positivity must hold at every time point; violations compound across the time series."
        ],
        "data_source_notes": "ehr/linked: visit-driven capture creates informative presence; LTMLE with censoring weights is required rather than a single cross-sectional TMLE."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "propensity-score-methods-psm-iptw",
        "pros_of_this": "Doubly robust (consistent if either the outcome or the treatment model is correct) and semiparametric-efficient, with influence-curve-based confidence intervals, and stable under near-positivity because it is a bounded substitution estimator rather than an inverse-weight estimator.",
        "cons_of_this": "Requires fitting and trusting two nuisance models (outcome and propensity) and is less transparent than a simple weighted comparison; implementation is more involved.",
        "when_to_prefer": "When robustness to one model being wrong and efficient ML-based confounding control are wanted; use plain IPTW for transparency or when the outcome model is hard to specify but the propensity model is trusted."
      },
      {
        "compared_to": "g-estimation-structural-nested-models",
        "pros_of_this": "Targets marginal contrasts (ATE/ATT, marginal RR/OR) and integrates Super Learner for the nuisance functions with valid inference; the longitudinal extension handles time-varying treatment and censoring.",
        "cons_of_this": "Estimates marginal parameters rather than the structural-nested effect-modification parameters that g-estimation targets, and does not naturally express effect modification by time-varying covariates.",
        "when_to_prefer": "When the estimand is a marginal causal contrast with flexible ML nuisance fits; prefer g-estimation/SNMs for structural-nested questions and effect modification by time-varying covariates."
      },
      {
        "compared_to": "predictive-and-causal-ml-models-rwe",
        "pros_of_this": "The targeting step removes the plug-in bias and restores efficient, influence-curve-based inference that a bare machine-learning outcome model plugged into standardization lacks.",
        "cons_of_this": "Adds the fluctuation/targeting machinery and the need for cross-fitting, which a pure prediction model does not require.",
        "when_to_prefer": "Whenever the goal is a causal target parameter with valid CIs rather than prediction; a plain ML plug-in is biased for the causal contrast and has invalid inference."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "High-dimensional diagnosis/fill/utilization covariates are ideal Super Learner input for both the outcome and propensity models (optionally seeded by high-dimensional propensity-score covariate selection); restrict to FFS-observable person-time so confounders are not differentially missing for Medicare Advantage spans, and inspect the estimated PS distribution by arm for positivity before the targeting step.",
      "ehr": "Labs, vitals, and NLP-derived severity sharpen both nuisance fits and make conditional exchangeability more plausible, but visit-driven, informative-presence capture is a selection process not captured by W; use the longitudinal extension (LTMLE) with censoring weights for informative loss to follow-up rather than a single cross-sectional TMLE.",
      "registry": "Adjudicated outcomes and severity tighten exchangeability and improve the outcome model; the targeting arithmetic is unchanged while the credibility of the no-unmeasured-confounding assumption improves.",
      "linked": "Linked claims-EHR-vital-records is the strongest substrate for both nuisance models and the censoring mechanism; report linkage selection as a separate, un-targeted source of bias."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nfrom zepid.causal.doublyrobust import TMLE\n\n# Illustrative cohort: confounders W1, W2 affect both treatment A and outcome Y.\nrng = np.random.default_rng(11)\nn = 3000\nw1 = rng.normal(0, 1, n)\nw2 = rng.binomial(1, 0.45, n)\na  = rng.binomial(1, 1 / (1 + np.exp(-(-0.3 + 0.8*w1 + 0.6*w2))))   # treatment\n# True ATE on the risk scale is ~ +0.08 (A raises outcome risk by ~8 points).\npy = 1 / (1 + np.exp(-(-1.0 + 0.5*a + 0.9*w1 + 0.7*w2)))\ny  = rng.binomial(1, py)\ndf = pd.DataFrame(dict(A=a, Y=y, W1=w1, W2=w2))\n\n# Single-step TMLE: specify the outcome model, then the propensity model, then fit.\ntmle = TMLE(df, exposure=\"A\", outcome=\"Y\")\ntmle.outcome_model(\"A + W1 + W2\", print_results=False)      # initial Qbar(A,W)\ntmle.exposure_model(\"W1 + W2\", print_results=False)         # propensity g(W)\ntmle.fit()                                                  # targeting + IC inference\n\nprint(f\"ATE (risk difference) = {tmle.risk_difference:.4f}\")\nprint(f\"95% CI = ({tmle.risk_difference_ci[0]:.4f}, {tmle.risk_difference_ci[1]:.4f})\")\nprint(f\"Risk ratio = {tmle.risk_ratio:.3f}\")",
        "description": "TMLE for the ATE with the zepid library (zepid.causal.doublyrobust.TMLE), which implements the standard single-step TMLE:\nan outcome model, a propensity (exposure) model, the clever-covariate fluctuation, and influence-curve-based inference.\nRequired input: a tidy DataFrame with a binary exposure column, a binary outcome column, and confounders W. The example\nbuilds small illustrative data with a known structure and reports the risk difference (ATE) with its 95% CI.",
        "dependencies": [
          "zepid",
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "schuler-rose-2017",
          "luque-fernandez-2018"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(tmle)        # install.packages(\"tmle\"); pulls in SuperLearner\nset.seed(11)\n\n# Illustrative cohort: confounders W1, W2 affect both treatment A and outcome Y.\nn  <- 3000\nW1 <- rnorm(n); W2 <- rbinom(n, 1, 0.45)\nA  <- rbinom(n, 1, plogis(-0.3 + 0.8*W1 + 0.6*W2))            # treatment\nY  <- rbinom(n, 1, plogis(-1.0 + 0.5*A + 0.9*W1 + 0.7*W2))    # binary outcome\nW  <- data.frame(W1 = W1, W2 = W2)\n\n# Super Learner library for the Q (outcome) and g (propensity) fits, then targeting.\nfit <- tmle(Y = Y, A = A, W = W,\n            Q.SL.library = c(\"SL.glm\", \"SL.glmnet\", \"SL.mean\"),\n            g.SL.library = c(\"SL.glm\", \"SL.glmnet\", \"SL.mean\"),\n            family = \"binomial\")\n\nprint(fit$estimates$ATE)     # point estimate, IC-based variance, and 95% CI\ncat(sprintf(\"ATE = %.4f (95%% CI %.4f, %.4f)\\n\",\n            fit$estimates$ATE$psi,\n            fit$estimates$ATE$CI[1], fit$estimates$ATE$CI[2]))",
        "description": "TMLE for the ATE with the canonical tmle R package (Gruber & van der Laan 2012). Install with\ninstall.packages(\"tmle\") (it uses SuperLearner for the Q and g fits). Required input: outcome Y (here binary), treatment A,\nand a covariate matrix W. tmle() runs the initial Super Learner outcome fit, the propensity fit, the targeting step, and\nreturns the ATE with influence-curve-based inference.",
        "dependencies": [
          "tmle",
          "SuperLearner"
        ],
        "source_citations": [
          "gruber-vanderlaan-2012",
          "schuler-rose-2017"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* (1) Initial outcome model Qbar(A,W); predict under observed A, under A=1, and under A=0. */\ndata coh1; set work.coh; run;\nproc logistic data=coh1 noprint;\n  model Y(event='1') = A W1 W2;\n  output out=q_obs p=q_aw;                 /* Qbar(A_i, W_i) */\nrun;\ndata c1; set work.coh; A=1; run;  data c0; set work.coh; A=0; run;\nproc logistic data=coh1 noprint; model Y(event='1') = A W1 W2; score data=c1 out=q1(rename=(P_1=q1)); run;\nproc logistic data=coh1 noprint; model Y(event='1') = A W1 W2; score data=c0 out=q0(rename=(P_1=q0)); run;\n\n/* (2) Propensity model g(W) = P(A=1|W). */\nproc logistic data=work.coh noprint;\n  model A(event='1') = W1 W2;\n  output out=g_out p=g;\nrun;\n\n/* (3) Clever covariate H = A/g - (1-A)/(1-g); offset = logit(Qbar). */\ndata tgt;\n  merge q_obs g_out q1(keep=q1) q0(keep=q0);\n  H      = A/g - (1-A)/(1-g);\n  offset = log(q_aw/(1-q_aw));             /* logit of the initial prediction */\nrun;\n\n/* (4) Fluctuation: no-intercept logistic of Y on H with logit(Qbar) as offset -> epsilon. */\nproc logistic data=tgt noprint;\n  model Y(event='1') = H / noint offset=offset;\n  ods output ParameterEstimates=eps_out;\nrun;\ndata _null_; set eps_out; if Variable='H' then call symputx('eps', Estimate); run;\n\n/* (5) Update predictions, form ATE, and compute influence-curve variance. */\ndata fin; set tgt;\n  H1 = 1/g;  H0 = -1/(1-g);\n  qs1 = 1/(1+exp(-(log(q1/(1-q1)) + &eps*H1)));   /* Qbar*(1,W) */\n  qs0 = 1/(1+exp(-(log(q0/(1-q0)) + &eps*H0)));   /* Qbar*(0,W) */\n  qs_obs = 1/(1+exp(-(offset + &eps*H)));         /* Qbar*(A,W) */\n  diff = qs1 - qs0;\nrun;\nproc means data=fin noprint; var diff; output out=atem mean=ate; run;\ndata _null_; set atem; call symputx('ate', ate); run;\ndata icv; set fin;\n  ic = H*(Y - qs_obs) + (qs1 - qs0) - &ate;       /* efficient influence curve */\nrun;\nproc means data=icv noprint; var ic; output out=v var=icvar n=n; run;\ndata result; set v; ate=&ate; se=sqrt(icvar/n);\n  lcl=ate-1.96*se; ucl=ate+1.96*se;\nrun;\nproc print data=result noobs; var ate se lcl ucl; run;",
        "description": "SAS has no TMLE package, so the targeting step is implemented manually: PROC LOGISTIC fits the initial outcome model Qbar\nand the propensity model g; a DATA step builds the clever covariate H = A/g - (1-A)/(1-g); a no-intercept PROC LOGISTIC\nfluctuation regresses Y on H with logit(Qbar) as an OFFSET to estimate epsilon; a final DATA step forms the updated\npotential-outcome predictions and the ATE, with influence-curve variance computed directly (Luque-Fernandez 2018). Input\nwork.coh has Y, A, W1, W2.",
        "dependencies": [],
        "source_citations": [
          "luque-fernandez-2018",
          "schuler-rose-2017"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q0[Initial outcome regression<br/>Qbar0 A,W via Super Learner] --> H\n  G[Propensity model<br/>g W = P A=1 given W] --> H[Clever covariate<br/>H = A/g - 1-A/1-g]\n  H --> Fluc[Fluctuation: logistic Y on H<br/>offset = logit Qbar0 -> epsilon]\n  Fluc --> Upd[Update Qbar* = expit logit Qbar0 + epsilon H]\n  Upd --> ATE[ATE = mean Qbar* 1,W - Qbar* 0,W]\n  ATE --> IC[Variance from efficient influence curve<br/>valid Wald 95% CI]",
        "caption": "The TMLE algorithm for the ATE. An initial Super Learner outcome fit is updated by a single fluctuation driven by the clever covariate (built from the propensity score), so the final substitution estimator solves the efficient influence-curve equation, delivering double robustness and influence-curve-based confidence intervals.",
        "alt_text": "Flowchart from an initial outcome regression and a propensity model into a clever covariate, then a logistic fluctuation with the initial prediction as offset, an updated outcome prediction, the average-treatment-effect mean difference, and influence-curve-based variance.",
        "source_type": "illustrative",
        "source_citations": [
          "schuler-rose-2017",
          "luque-fernandez-2018"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Goal{What do you need?} -->|Prediction only| ML[Plain ML model<br/>biased + invalid CI for causal effect]\n  Goal -->|Marginal causal contrast| DR{Can you fit BOTH<br/>outcome and propensity models?}\n  DR -->|Yes| Pos{Positivity OK?<br/>g W away from 0 and 1}\n  Pos -->|Yes| TMLE[TMLE / AIPW<br/>doubly robust + efficient + IC inference]\n  Pos -->|No near-positivity| Warn[Diagnose positivity first<br/>truncation hides not fixes it]\n  DR -->|Outcome model trusted only| Gcomp[g-computation<br/>no double robustness]\n  DR -->|Propensity trusted only| IPTW[IPTW<br/>consistent if PS correct]",
        "caption": "When to reach for TMLE. It is the doubly-robust, efficient choice for a marginal causal contrast when both nuisance models can be estimated and positivity holds; under near-positivity, diagnose before targeting, and for prediction (not a causal effect) a plain ML model with invalid causal inference is the wrong tool.",
        "alt_text": "Decision tree distinguishing prediction from causal estimation, branching on whether both nuisance models can be fit and whether positivity holds, routing to TMLE/AIPW, g-computation, IPTW, or a positivity warning.",
        "source_type": "illustrative",
        "source_citations": [
          "schuler-rose-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "TMLE is a doubly-robust, efficient alternative to propensity-score weighting alone; it tolerates misspecification of either the outcome or the treatment model and is more stable than IPTW under near-positivity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "Marginal structural models estimated by IPTW and the longitudinal TMLE (LTMLE) both target marginal effects of time-varying treatment; LTMLE is the doubly-robust, ML-friendly route to the same MSM-type estimands."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "TMLE uses machine-learning (Super Learner) nuisance fits but adds a targeting step that removes the plug-in bias and restores valid inference that a bare ML plug-in lacks for a causal target parameter."
      },
      {
        "relation_type": "see_also",
        "target_slug": "g-estimation-structural-nested-models",
        "notes": "Both are causal-inference methods; g-estimation of structural nested models targets effect-modification/structural parameters under time-varying confounding, while TMLE targets marginal contrasts with ML nuisance estimation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "High-dimensional propensity-score covariate selection can supply the confounder set W fed to the Super Learner propensity and outcome models inside TMLE in claims data."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "TMLE assumes no unmeasured confounding given W; an E-value on the TMLE estimate quantifies how strong residual unmeasured confounding would need to be to overturn it."
      }
    ],
    "aliases": [
      "TMLE",
      "targeted maximum likelihood estimation",
      "targeted minimum loss-based estimation",
      "targeted learning",
      "doubly robust TMLE"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "test-negative-design-rwe",
    "name": "Test-Negative Design",
    "short_definition": "A case-control-like design for estimating vaccine or therapeutic effectiveness that enrolls patients presenting to care with the same clinical syndrome and tests each for the target pathogen; test-positives are cases and test-negatives are controls, so that conditioning on care-seeking removes much of the differential healthcare-seeking confounding that plagues conventional designs.",
    "long_description": "The **test-negative design (TND)** estimates vaccine effectiveness (VE) — or, more generally, the effectiveness of a\npreventive or therapeutic intervention against a pathogen-specific outcome — by enrolling only people who present to a\ncare setting with a defined clinical syndrome (e.g., acute respiratory illness, influenza-like illness) and who are then\ntested for the target pathogen. Those who test **positive** for the pathogen are the cases; those who test **negative**\n(the same syndrome, a different cause) are the controls. Vaccination status is ascertained for both groups and the\neffectiveness estimate is VE = 1 − OR, where the odds ratio compares the odds of vaccination among test-positives to the\nodds among test-negatives, typically from a logistic model adjusting for age, calendar time, and comorbidity. The single\nstructural idea is that *both* cases and controls have already cleared the same care-seeking filter: they all felt sick\nenough to seek care and to be tested. If vaccinated and unvaccinated people differ in how readily they seek care (the\n\"healthy-vaccinee\" / healthcare-seeking confounder), that difference is largely **conditioned out** by restricting the\nstudy to people who sought care, because it acts on the probability of being sampled rather than on the pathogen-specific\noutcome itself.\n\n**Core conceptual distinction.** The TND is *not* a cohort design and it is *not* a generic case-control study. (1) *Versus\na conventional case-control study*, the controls are not population or hospital controls chosen for some unrelated reason;\nthey are syndrome-matched, test-negative patients drawn from the very same testing stream as the cases. That common origin\nis what buys the control of differential care-seeking. (2) *Versus a cohort VE study*, the TND never enumerates a\ndenominator of person-time at risk; it samples on presentation-and-test, so it estimates an odds ratio that approximates the\nrate ratio only under the design's identifying assumptions. (3) The estimand is **pathogen-specific** effectiveness against\n*medically attended, tested* disease — it says nothing about effectiveness against asymptomatic infection, transmission, or\ndisease that never reaches a testing setting. Foppa's \"case test-negative\" formulation makes the sampling explicit and shows\nthe conditions under which the OR identifies the rate ratio.\n\n**Validity rests on a small set of assumptions, and each maps to a bias.** (a) *Test accuracy*: imperfect test sensitivity\nand specificity bias VE toward the null because some true cases land in the test-negative (control) group and vice versa;\nwith a highly specific but imperfectly sensitive RT-PCR, modest sensitivity loss is usually non-differential and biases\ntoward 0, but differential test performance by vaccination status (e.g., vaccinated infections having lower viral load and\nmore false negatives) biases VE upward. (b) *Equal care-seeking for vaccinated and unvaccinated, conditional on disease* —\nif vaccination changes the probability of seeking care given the same severity, the conditioning is incomplete and residual\nselection remains. (c) *No off-target protection*: the vaccine must not also reduce the non-target (test-negative) illnesses,\nor those illnesses become an invalid control series and VE is biased (typically upward). (d) *Confounding by indication /\ncalendar time*: vaccine uptake and pathogen circulation both vary over the season, so calendar time must be controlled, and\nconfounding by underlying risk (frailty, comorbidity) is reduced but not eliminated by the design.\n\n**Pros, cons, and trade-offs.**\n- **vs the conventional `case-control` design:** The TND's syndrome-matched, same-stream test-negative controls remove most\n  differential healthcare-seeking and access bias that contaminate population/hospital controls; it is cheaper because\n  cases and controls are captured passively from routine testing. Cost: the control series is only valid if the vaccine has\n  no effect on the test-negative causes and the test is reasonably specific. **Prefer the TND** for routine, in-season VE\n  surveillance against a pathogen-specific, laboratory-confirmable outcome; **prefer a classic case-control** when no clean\n  syndrome-matched test-negative series exists or off-target effects are plausible.\n- **vs a `cohort-retrospective` VE study:** A cohort estimates a rate/risk ratio with an explicit denominator and can study\n  multiple outcomes, but it must measure and adjust the full healthy-vaccinee confounding structure directly and is\n  sensitive to outcome misclassification across the whole population. The TND sidesteps the care-seeking confounder by\n  design but yields only an OR for medically attended tested disease. **Prefer the cohort** when person-time, multiple\n  endpoints, or waning over long horizons is the question; **prefer the TND** when the dominant threat is care-seeking\n  confounding and a specific laboratory endpoint is available.\n- **vs `screening-method` / administrative VE:** The Farrington screening method needs only case vaccination coverage and\n  population coverage, but inherits all population-coverage measurement error and care-seeking bias. The TND is more robust\n  to care-seeking but needs individual test data.\n\n**When to use.** In-season influenza, COVID-19, rotavirus, pneumococcal, and dengue vaccine effectiveness from sentinel\ntesting networks, emergency-department or hospital testing streams, and large EHR/claims-linked laboratory data; whenever a\nspecific, laboratory-confirmable pathogen outcome exists, a syndrome-matched test-negative control series is naturally\navailable, and differential healthcare-seeking between vaccinated and unvaccinated people is the chief threat to a cohort\nestimate. It is the default real-world VE design for the WHO and CDC sentinel platforms for exactly these reasons.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **When the vaccine has off-target effects on the test-negative illnesses.** If the intervention also lowers the\n  non-target causes of the syndrome, the test-negatives are depleted among the vaccinated and VE is biased upward — the\n  single most dangerous TND misuse, because the bias looks like efficacy.\n- **When the test is non-specific or differentially performing.** Poor specificity dilutes VE toward the null; differential\n  sensitivity by vaccination status (lower viral load in breakthrough infections) inflates VE. Do not run a TND on a\n  syndromic or antigen test of unknown specificity and report the OR as if it were unbiased.\n- **When vaccination changes care-seeking given disease.** If vaccinated people who get infected are systematically less\n  (or more) likely to present and be tested at the same severity, conditioning on testing does not remove the selection and\n  VE is biased; the design's central assumption fails silently.\n- **As an estimate of effectiveness against infection or transmission.** The TND estimand is medically attended, tested\n  disease only. Narrating a TND VE as protection against all infection overstates what was measured.\n- **When test-negatives are not drawn from the same care/testing stream as cases.** Pulling controls from a different\n  setting reintroduces exactly the access/care-seeking confounding the design exists to remove.\n\n**Data-source operational depth.**\n- **Sentinel / surveillance testing networks (primary):** The canonical substrate — a defined syndrome definition triggers a\n  standardized test, and both arms come from one stream. Capture the test type, its specificity, the syndrome case\n  definition, and calendar week, and adjust for week and site as fixed effects.\n- **EHR-linked laboratory data:** Test results and diagnoses are encounter-driven; differential leakage (testing done\n  out-of-system) and informative testing (clinicians test sicker or higher-risk patients) can break the equal-care-seeking\n  assumption. Restrict to a stable in-system population, define the syndrome from structured diagnosis plus an actual test\n  order, and adjust for comorbidity and prior-year utilization.\n- **Claims (FFS vs MA):** Vaccination is captured from procedure/NDC/HCPCS codes and the outcome from a diagnosis paired with\n  a test claim; Medicare Advantage enrollees generate no fee-for-service claims, so vaccination and testing can be\n  differentially unobserved — restrict to FFS-observable person-time or VE is biased by ascertainment, not by the vaccine.\n- **Hospital / ED test streams:** Severity-based testing thresholds vary by site and over the season; include site and\n  calendar-time terms and check that the test-negative case mix is stable across vaccination strata.",
    "primary_category": "Study_Design",
    "tags": [
      "test_negative_design",
      "vaccine_effectiveness",
      "case_control",
      "healthcare_seeking_bias",
      "selection_bias",
      "pathogen_specific_outcome"
    ],
    "applies_to_study_types": [
      "case_control",
      "vaccine_effectiveness",
      "claims_analysis",
      "ehr_study",
      "surveillance"
    ],
    "data_sources": [
      "primary",
      "ehr",
      "claims",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.vaccine.2013.02.053",
        "url": "https://doi.org/10.1016/j.vaccine.2013.02.053",
        "citation_text": "Jackson ML, Nelson JC. The test-negative design for estimating influenza vaccine effectiveness. Vaccine. 2013;31(17):2165-2168.",
        "year": 2013,
        "authors_short": "Jackson & Nelson",
        "notes": "The canonical introduction naming and formalizing the test-negative design, laying out the VE = 1 - OR estimand and the control-of-care-seeking rationale; the most-cited methods reference for the design."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kww064",
        "url": "https://doi.org/10.1093/aje/kww064",
        "citation_text": "Sullivan SG, Tchetgen Tchetgen EJ, Cowling BJ. Theoretical basis of the test-negative study design for assessment of influenza vaccine effectiveness. American Journal of Epidemiology. 2016;184(5):345-353.",
        "year": 2016,
        "authors_short": "Sullivan et al.",
        "notes": "Derives the identifying assumptions of the TND with DAGs, shows exactly when conditioning on care-seeking removes selection bias, and enumerates the off-target-protection and differential-care-seeking failure modes."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.vaccine.2013.04.026",
        "url": "https://doi.org/10.1016/j.vaccine.2013.04.026",
        "citation_text": "Foppa IM, Haber M, Ferdinands JM, Shay DK. The case test-negative design for studies of the effectiveness of influenza vaccine. Vaccine. 2013;31(30):3104-3109.",
        "year": 2013,
        "authors_short": "Foppa et al.",
        "notes": "Makes the sampling mechanism explicit (\"case test-negative\") and gives the conditions under which the odds ratio identifies the incidence rate ratio, with the test-sensitivity/specificity bias structure worked through."
      },
      {
        "role": "use",
        "doi": "10.1097/EDE.0b013e3181d61eeb",
        "url": "https://doi.org/10.1097/EDE.0b013e3181d61eeb",
        "citation_text": "Lipsitch M, Tchetgen Tchetgen E, Cohen T. Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology. 2010;21(3):383-388.",
        "year": 2010,
        "authors_short": "Lipsitch et al.",
        "notes": "The negative-control framework that underpins why test-negative controls (an outcome the vaccine should not affect) detect and absorb shared confounding; essential for justifying and stress-testing the control series."
      }
    ],
    "plain_language_summary": "The test-negative design is a way to measure how well a vaccine works in the real world by studying only the people who already went to a clinic or hospital feeling sick and got tested for the disease in question. Everyone who tested positive for the target pathogen counts as a case; everyone who tested negative for it — but still showed up sick and got tested — counts as a control. Because both groups had to feel sick enough to seek care and to be tested, the design automatically cancels out most of the distortion that comes from vaccinated people simply being more likely to visit a doctor in the first place. The result is an odds ratio comparing vaccination rates in the two groups, and vaccine effectiveness is calculated as VE = (1 − odds ratio) × 100%.",
    "key_terms": [
      {
        "term": "test-positive case",
        "definition": "A patient who came to care with the target syndrome (e.g., flu-like illness) and whose laboratory test confirmed the pathogen of interest — they are the 'cases' in this design."
      },
      {
        "term": "test-negative control",
        "definition": "A patient who came to care with the same syndrome but whose laboratory test came back negative for the pathogen — they serve as the comparison group because they cleared the same care-seeking filter."
      },
      {
        "term": "odds ratio (OR)",
        "definition": "A number comparing how likely vaccinated people are to be in the case group versus the control group; an OR below 1 means vaccinated people are less likely to be cases, which is evidence that the vaccine is working."
      },
      {
        "term": "vaccine effectiveness (VE)",
        "definition": "The percentage reduction in risk of the target disease attributable to vaccination, estimated here as VE = (1 − OR) × 100%; a VE of 60% means vaccinated people had 60% lower odds of being a case."
      },
      {
        "term": "healthcare-seeking bias",
        "definition": "The distortion that occurs when vaccinated and unvaccinated people differ in how readily they visit a doctor; it can make a vaccine look more or less effective than it really is unless the study design controls for it."
      }
    ],
    "worked_example": {
      "scenario": "Imagine a sentinel flu surveillance network during a single influenza season. Every patient who walks into a participating clinic with fever plus cough or sore throat gets a rapid PCR test for influenza. Over the season, 340 patients meet that syndrome definition and receive a test. The analyst wants to estimate influenza vaccine effectiveness by comparing vaccination rates between the 140 patients who tested positive (cases) and the 200 patients who tested negative (controls). Because every patient in both groups had to feel sick enough to seek care and had to receive a test, the two groups share the same care-seeking history — that shared filter is what makes the comparison fair.",
      "dataset": {
        "caption": "One row per tested patient in a sentinel clinic registry. Each row records whether the patient was vaccinated this season and whether the influenza PCR test came back positive or negative.",
        "columns": [
          "patient_id",
          "vaccinated",
          "flu_test_result",
          "age",
          "epi_week"
        ],
        "rows": [
          [
            "P001",
            "yes",
            "positive",
            67,
            47
          ],
          [
            "P002",
            "no",
            "positive",
            52,
            47
          ],
          [
            "P003",
            "yes",
            "negative",
            71,
            47
          ],
          [
            "P004",
            "no",
            "negative",
            45,
            48
          ],
          [
            "P005",
            "yes",
            "positive",
            34,
            48
          ],
          [
            "... (335 more rows)",
            "",
            "",
            "",
            ""
          ]
        ]
      },
      "steps": [
        "Lay out the 2×2 table. The columns are the two test-result groups (test-positive cases vs. test-negative controls) and the rows are vaccination status. From the 340 tested patients: 40 were vaccinated and test-positive (cell a), 100 were vaccinated and test-negative (cell b), 100 were unvaccinated and test-positive (cell c), and 100 were unvaccinated and test-negative (cell d). Totals: 140 cases (40 + 100), 200 controls (100 + 100).",
        "The 2×2 table in full: | | Test-positive (cases) | Test-negative (controls) | | Vaccinated | a = 40 | b = 100 | | Unvaccinated | c = 100 | d = 100 |",
        "Compute the odds ratio. The OR compares the odds of being vaccinated among cases to the odds of being vaccinated among controls: OR = (a × d) / (b × c) = (40 × 100) / (100 × 100) = 4,000 / 10,000 = 0.40.",
        "Interpret the OR. An OR of 0.40 means vaccinated patients had only 40% of the odds of being a flu case compared with unvaccinated patients who sought care and were tested — a large protective association.",
        "Convert the OR to vaccine effectiveness. VE = (1 − OR) × 100% = (1 − 0.40) × 100% = 0.60 × 100% = 60%. Vaccinated patients had 60% lower odds of testing positive for influenza.",
        "Why does restricting to tested patients help? Both vaccinated and unvaccinated people in this study had to feel sick enough to go to a clinic and receive a test. Any tendency for vaccinated people to visit the doctor more readily applies equally to both groups, so that tendency cancels out of the OR comparison rather than inflating the apparent effectiveness of the vaccine."
      ],
      "result": "OR = (40 × 100) / (100 × 100) = 4,000 / 10,000 = 0.40. VE = (1 − 0.40) × 100% = 60%. Among the 340 patients who were tested at sentinel clinics this season, vaccinated individuals had 60% lower odds of a positive influenza test than unvaccinated individuals, after accounting for the shared care-seeking filter."
    },
    "prerequisites": [
      "case-control",
      "healthy-user-bias",
      "logistic-regression-for-binary-outcomes"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard test-negative VE estimation",
        "description": "Enroll syndrome-presenting patients tested for the target pathogen; test-positives are cases, test-negatives are controls; estimate VE = 1 - OR from a logistic model adjusting for age, calendar time/week, and comorbidity.",
        "edge_cases": [
          "Imperfect test specificity biases VE toward the null; report the test type and its specificity.",
          "Calendar time must be adjusted (vaccine uptake and pathogen circulation both vary across the season)."
        ],
        "data_source_notes": "surveillance/primary: one syndrome definition and one standardized test feed both arms; adjust for site and epidemiologic week as fixed effects."
      },
      {
        "name": "Severity- or setting-stratified TND",
        "description": "Stratify the test-negative analysis by disease severity (outpatient vs hospitalized) or care setting, because the equal-care-seeking assumption and the relevant VE estimand differ across severities.",
        "edge_cases": [
          "VE against severe disease and VE against mild medically attended disease are distinct estimands; do not pool them blindly.",
          "Differential testing thresholds across settings can reintroduce selection; check test-negative case-mix stability."
        ],
        "data_source_notes": "ehr/hospital: severity-based testing thresholds vary by site; include site terms and verify the test-negative composition is stable across vaccination strata."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "case-control",
        "pros_of_this": "Syndrome-matched, same-testing-stream test-negative controls remove most differential healthcare-seeking and access bias that contaminate population or hospital controls, and cases/controls are captured passively from routine testing.",
        "cons_of_this": "Validity collapses if the vaccine affects the test-negative causes (off-target protection) or the test is non-specific; the design only estimates an OR for medically attended, tested disease.",
        "when_to_prefer": "Routine in-season VE surveillance against a laboratory-confirmable pathogen-specific outcome where care-seeking confounding is the dominant threat; revert to a classic case-control when no clean test-negative series exists."
      },
      {
        "compared_to": "cohort-retrospective",
        "pros_of_this": "Sidesteps the healthy-vaccinee / care-seeking confounder by design rather than by measured adjustment, and is far cheaper since it reuses routine testing data with no person-time denominator to build.",
        "cons_of_this": "Yields only an odds ratio for tested, medically attended disease; cannot study person-time, multiple endpoints, transmission, or long-horizon waning directly.",
        "when_to_prefer": "When the chief threat is differential care-seeking and a specific lab endpoint exists; choose a cohort when person-time, multiple outcomes, or long-term waning is the question."
      },
      {
        "compared_to": "signal-detection",
        "pros_of_this": "Produces a defensible effectiveness estimate (VE = 1 - OR) for a pre-specified pathogen outcome rather than a disproportionality signal, with explicit, checkable identifying assumptions.",
        "cons_of_this": "Requires individual test results and a valid control series; cannot scan broadly across many unspecified outcomes the way spontaneous-report signal detection does.",
        "when_to_prefer": "When estimating effectiveness against a specific, lab-confirmable outcome rather than scanning for unexpected adverse-event signals."
      }
    ],
    "implementation_notes_by_data_source": {
      "primary": "Sentinel surveillance networks are the canonical substrate; one syndrome definition triggers one standardized test and both arms come from the same stream. Record test type and specificity, the syndrome case definition, and epidemiologic week; adjust for week and site.",
      "ehr": "Test orders and results are encounter-driven, so differential out-of-system testing (leakage) and clinician testing of sicker/higher-risk patients can break equal care-seeking; restrict to a stable in-system population, require an actual test order plus the syndrome diagnosis, and adjust for comorbidity and prior utilization.",
      "claims": "Vaccination is from procedure/NDC/HCPCS codes and the outcome from a diagnosis paired with a test claim; Medicare Advantage enrollees lack fee-for-service claims, so restrict to FFS-observable person-time or VE is biased by differential ascertainment rather than by the vaccine.",
      "registry": "Immunization registries can supply more complete vaccination status than claims; link to the testing stream and verify that registry capture does not differ by the outcome."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.formula.api as smf\n\n# Illustrative sentinel-testing data: each row is one syndrome-presenting, tested patient.\nrng = np.random.default_rng(7)\nn = 4000\nweek = rng.integers(40, 53, n)                       # epidemiologic week\nvaccinated = rng.binomial(1, 0.55, n)                # vaccination status\nage = rng.normal(50, 18, n).clip(1, 95)\n# True VE ~ 60%: vaccination lowers the odds of being a (test-positive) case.\nlin = -0.4 + np.log(0.40) * vaccinated + 0.01 * (age - 50) + 0.02 * (week - 46)\ntest_positive = rng.binomial(1, 1 / (1 + np.exp(-lin)))   # 1 = pathogen+, 0 = pathogen-\ndf = pd.DataFrame(dict(test_positive=test_positive, vaccinated=vaccinated,\n                       age=age, week=week))\n\n# Cases = test-positive, controls = test-negative; model the odds of being a case.\nm = smf.logit(\"test_positive ~ vaccinated + age + C(week)\", data=df).fit(disp=0)\naor = np.exp(m.params[\"vaccinated\"])\nci_lo, ci_hi = np.exp(m.conf_int().loc[\"vaccinated\"])\nve     = 1 - aor                                     # VE = 1 - adjusted OR\nve_lo  = 1 - ci_hi                                   # CI flips because VE = 1 - OR\nve_hi  = 1 - ci_lo\nprint(f\"adjusted OR = {aor:.3f}\")\nprint(f\"VE = {ve*100:.1f}%  (95% CI {ve_lo*100:.1f}% to {ve_hi*100:.1f}%)\")",
        "description": "Test-negative VE estimation by logistic regression on illustrative line-level data. Required input: one row per tested,\nsyndrome-presenting patient with columns test_positive (1 = case/pathogen+, 0 = control/pathogen-), vaccinated (0/1),\nage, and epidemiologic week. The adjusted odds ratio for vaccination among test-positives vs test-negatives gives\nVE = 1 - aOR (Jackson & Nelson 2013). statsmodels fits the logistic model and supplies the Wald CI.",
        "dependencies": [
          "pandas",
          "numpy",
          "statsmodels"
        ],
        "source_citations": [
          "jackson-nelson-2013",
          "sullivan-2016"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "set.seed(7)\nn     <- 4000\nweek  <- sample(40:52, n, replace = TRUE)            # epidemiologic week\nvacc  <- rbinom(n, 1, 0.55)                          # vaccination status\nage   <- pmin(pmax(rnorm(n, 50, 18), 1), 95)\nlin   <- -0.4 + log(0.40) * vacc + 0.01 * (age - 50) + 0.02 * (week - 46)\ntpos  <- rbinom(n, 1, plogis(lin))                   # 1 = test-positive (case), 0 = control\ndat   <- data.frame(test_positive = tpos, vaccinated = vacc,\n                    age = age, week = factor(week))\n\n# Cases = test-positive, controls = test-negative; logistic model for being a case.\nfit  <- glm(test_positive ~ vaccinated + age + week, family = binomial, data = dat)\nbeta <- coef(summary(fit))[\"vaccinated\", ]\naor  <- exp(beta[\"Estimate\"])\nci   <- exp(beta[\"Estimate\"] + c(-1, 1) * 1.96 * beta[\"Std. Error\"])\nve    <- 1 - aor                                     # VE = 1 - adjusted OR\nve_ci <- 1 - rev(ci)                                 # flip limits: VE = 1 - OR\ncat(sprintf(\"adjusted OR = %.3f\\n\", aor))\ncat(sprintf(\"VE = %.1f%% (95%% CI %.1f%% to %.1f%%)\\n\",\n            100 * ve, 100 * ve_ci[1], 100 * ve_ci[2]))",
        "description": "Test-negative VE estimation with base-R glm() on the same illustrative line-level data: one row per tested,\nsyndrome-presenting patient (test_positive, vaccinated, age, week). VE = 1 - exp(beta_vaccinated) with the CI obtained by\nexponentiating the profile/Wald limits and flipping them because VE = 1 - OR (Jackson & Nelson 2013).",
        "dependencies": [],
        "source_citations": [
          "jackson-nelson-2013",
          "foppa-2013"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* Illustrative sentinel-testing data: one row per syndrome-presenting, tested patient. */\ndata tnd;\n  call streaminit(7);\n  do i = 1 to 4000;\n    week       = 40 + floor(13 * rand('uniform'));        /* epidemiologic week */\n    vaccinated = rand('bernoulli', 0.55);\n    age        = min(max(rand('normal', 50, 18), 1), 95);\n    lin = -0.4 + log(0.40)*vaccinated + 0.01*(age-50) + 0.02*(week-46);\n    test_positive = rand('bernoulli', 1/(1+exp(-lin)));   /* 1 = case, 0 = control */\n    output;\n  end;\nrun;\n\n/* Cases = test-positive, controls = test-negative; model the odds of being a case. */\nproc logistic data=tnd;\n  class week / param=ref;\n  model test_positive(event='1') = vaccinated age week;\n  oddsratio vaccinated;                                   /* adjusted OR + 95% CI */\n  ods output OddsRatios=or_out;\nrun;\n\n/* VE = 1 - OR; flip the OR limits to get the VE confidence interval. */\ndata ve;\n  set or_out;\n  where lowcase(effect) =: 'vaccinated';\n  ve     = 1 - OddsRatioEst;\n  ve_low = 1 - UpperCL;        /* limits flip because VE = 1 - OR */\n  ve_high= 1 - LowerCL;\n  format ve ve_low ve_high percent8.1;\nrun;\nproc print data=ve noobs; var OddsRatioEst ve ve_low ve_high; run;",
        "description": "Test-negative VE estimation with PROC LOGISTIC on illustrative line-level data (work.tnd: one row per tested patient with\ntest_positive, vaccinated, age, week). PROC LOGISTIC fits the adjusted model and the ODDSRATIO statement yields the\nadjusted OR and its 95% CI; a short DATA step converts the OR and its limits to VE = 1 - OR (limits flipped).",
        "dependencies": [],
        "source_citations": [
          "jackson-nelson-2013",
          "sullivan-2016"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population] --> Sick{Develops the<br/>defined syndrome?}\n  Sick -->|No| Out1[Not sampled]\n  Sick -->|Yes| Seek{Seeks care AND<br/>is tested?}\n  Seek -->|No| Out2[Not sampled<br/>care-seeking filter]\n  Seek -->|Yes| Test{Test for target pathogen}\n  Test -->|Positive| Case[CASE<br/>test-positive]\n  Test -->|Negative| Ctrl[CONTROL<br/>test-negative]\n  Case --> Est[VE = 1 - OR<br/>odds of vaccination, cases vs controls]\n  Ctrl --> Est",
        "caption": "Sampling logic of the test-negative design. Both cases and controls pass through the same care-seeking-and-testing filter, so differential healthcare-seeking by vaccination status acts on sampling equally in both arms and is largely conditioned out of the VE = 1 - OR comparison.",
        "alt_text": "Flowchart from a source population through syndrome onset and a shared care-seeking/testing filter to a pathogen test that splits patients into test-positive cases and test-negative controls, feeding the VE = 1 - odds-ratio estimate.",
        "source_type": "illustrative",
        "source_citations": [
          "jackson-nelson-2013"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Test-negative VE estimate] --> S{Test specificity high?}\n  S -->|No| B1[VE biased toward null<br/>fix the test/case definition]\n  S -->|Yes| O{Vaccine affects the<br/>test-negative illnesses?}\n  O -->|Yes off-target protection| B2[VE biased UPWARD<br/>control series invalid]\n  O -->|No| C{Care-seeking equal by<br/>vaccination given disease?}\n  C -->|No| B3[Residual selection bias]\n  C -->|Yes| Valid[VE estimate identifies the<br/>medically-attended tested-disease effect]",
        "caption": "Assumption-to-bias map for the test-negative design. Each identifying assumption (test specificity, no off-target protection, equal conditional care-seeking) maps to a specific bias if it fails; off-target protection is the most dangerous because it inflates VE.",
        "alt_text": "Decision tree checking test specificity, off-target vaccine protection, and equal conditional care-seeking, each branch labelled with the direction of bias in the test-negative vaccine-effectiveness estimate when the assumption fails.",
        "source_type": "illustrative",
        "source_citations": [
          "sullivan-2016"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "case-control",
        "notes": "The TND is a specialized case-control design in which controls are syndrome-matched, same-stream test-negatives rather than population or hospital controls, which is what controls differential healthcare-seeking."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "cohort-retrospective",
        "notes": "A retrospective cohort estimates a rate/risk ratio with an explicit denominator but must adjust the healthy-vaccinee confounder directly; the TND removes it by design but yields only an OR for tested, medically attended disease."
      },
      {
        "relation_type": "see_also",
        "target_slug": "negative-control-outcomes-rwe",
        "notes": "Test-negative controls are effectively a negative-control outcome (an outcome the vaccine should not affect) used to absorb shared confounding; the negative-control framework justifies and stress-tests the control series."
      },
      {
        "relation_type": "affects",
        "target_slug": "healthy-user-bias",
        "notes": "The design exists to neutralize the healthy-vaccinee variant of healthy-user bias by conditioning on care-seeking; when conditional care-seeking still differs by vaccination, residual healthy-user bias remains."
      },
      {
        "relation_type": "used_with",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "VE = 1 - OR is estimated from a logistic regression of test-positivity on vaccination adjusting for age, calendar week, and comorbidity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "diagnostic-accuracy",
        "notes": "Test sensitivity and specificity directly determine the misclassification bias of the design; the diagnostic-accuracy properties of the confirmatory test govern how far VE is biased toward or away from the null."
      }
    ],
    "aliases": [
      "test-negative design",
      "TND",
      "test-negative case-control",
      "case test-negative design",
      "test-negative vaccine effectiveness study"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "journal"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "therapeutic-area-specific-rwe-challenges-oncology",
    "name": "Therapeutic-Area-Specific RWE Challenges — Oncology",
    "short_definition": "The design, endpoint, and data-operational adaptations that oncology forces on real-world evidence studies because the outcomes that matter (response, progression, survival) are imaging- and pathology-defined rather than claims-coded, treatment is short and line-structured, and ethics push toward single-arm trials with external controls.",
    "long_description": "Oncology is the therapeutic area where the gap between *what is clinically meaningful* and *what is captured in routine\nreal-world data* is widest. The endpoints regulators and clinicians care about — objective response (RECIST), progression,\nand overall survival — are defined by serial imaging, pathology, and physician assessment, none of which is reliably coded\nin administrative claims and only partially structured in EHRs. Treatment is short, sequential, and organized into *lines\nof therapy* rather than the open-ended chronic dosing of cardiometabolic or rheumatologic disease. And because randomizing\nadvanced-cancer patients to placebo is often unethical, oncology generates an unusually high proportion of single-arm trials\nthat lean on **external/historical controls** built from real-world data. This entry is about the concrete design and\noperational consequences of those facts — not a generic data-quality checklist.\n\n**Core conceptual distinction** In most therapeutic areas the *outcome* is the hard part to define and the *exposure* is\neasy (a drug fill, an event code); in oncology the relationship inverts and compounds. (1) *Outcome is latent.* Progression\nis the central efficacy endpoint, yet routine claims contain no progression flag, so analysts substitute a **real-world\nprogression proxy** — typically time to treatment discontinuation, switch to a new line, or death — which is a *behavioral*\nsurrogate for a *biological* event and is differentially biased by drug class and disease pace. (2) *Exposure is\nline-structured.* \"On treatment\" is not a single drug but a regimen within a line, with route-driven capture (oral TKIs via\npharmacy NDC + `days_supply`; IV/infused agents via medical claims with HCPCS J-codes and CPT 96413/96365 administration\ncodes), so episode construction must stitch multi-claim regimens and detect line advancement, not just gaps in a single\nNDC. (3) *Comparison is often external.* Single-arm accelerated approvals require an external control arm, shifting the\nmethodological burden from confounding control inside one cohort to *transportability and outcome-ascertainment alignment*\nacross two data sources collected under different rules. These three shifts — latent outcome, line-structured exposure,\nexternal comparison — are what distinguish oncology RWE from a chronic-disease active-comparator new-user study.\n\n**Pros, cons, and trade-offs**\n- **rwPFS proxy (treatment discontinuation / next-line / death) vs trial-grade RECIST progression:** The proxy is the only\n  progression-like endpoint available at scale in claims and unabstracted EHR, and it correlates reasonably with OS in\n  some tumors. Cost: it conflates toxicity holds, insurance churn, and patient preference with true progression, and the\n  error is *directional and differential* — it overestimates true PFS for indolent disease (patients stay on a tolerable\n  drug past radiographic progression) and underestimates it for toxicity-driven discontinuation, so two arms with\n  different toxicity profiles are not comparably measured. **Prefer abstracted/curated rwPFS** (chart- or imaging-derived,\n  e.g., Flatiron-style abstraction) when the estimand is efficacy; reserve the discontinuation proxy for utilization,\n  treatment-pattern, and HCRU questions where it is the *intended* construct.\n- **External/historical control vs concurrent active comparator:** When a single-arm trial is the only ethical option, an\n  external control is the only path to a comparative effect. Cost: it reintroduces every bias an active-comparator new-user\n  design was built to remove — calendar-time drift in standard of care, differential outcome ascertainment between the\n  trial and the RWD source, and selection of the linkable/abstractable subset. **Prefer a concurrent active comparator**\n  (see active-comparator-new-user) whenever two real-world regimens are genuinely used for the same indication; fall back\n  to external controls only for true single-arm settings, and then borrow rigor from rare-disease-external-controls-rwe and\n  generalizability-transportability-external-validity-rwe.\n- **Claims-only vs EHR-only vs linked oncology data:** Claims give complete drug/procedure/cost capture but no stage,\n  histology, biomarker, or response; EHR gives pathology, imaging, and notes but is visit-driven and fragmented across\n  community and academic sites; linkage gives both at the cost of selection into the linkable subset. **Prefer linked or\n  curated EHR** for any efficacy/progression estimand; claims alone are defensible only for utilization, adherence, and\n  cost endpoints.\n\n**When to use** This lens applies whenever the study population is defined by a cancer diagnosis and the question touches\nefficacy (response, rwPFS, OS), treatment sequencing (lines of therapy, switching), oncology HCRU/cost (infusion visits,\nsupportive care, end-of-life utilization), or a single-arm trial that needs an external comparator. It is the prerequisite\nframing for biomarker-defined cohorts, real-world progression endpoints, and oncology external-control studies.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Do not use a discontinuation-based rwPFS proxy as an efficacy endpoint in a comparative study of differently-tolerated\n  regimens.** The differential measurement error (indolent overestimation vs toxicity-hold underestimation) manufactures or\n  masks a treatment effect that is an artifact of measurement, not biology — this is the single most dangerous oncology RWE\n  error and will not survive review.\n- **Do not set time zero at the cancer diagnosis date when the exposure is first systemic therapy.** The interval between\n  diagnosis and treatment initiation is *immortal* (the patient must survive and be worked up to be treated), so anchoring\n  follow-up at diagnosis manufactures an apparent survival advantage for whoever gets treated — align time zero to first\n  qualifying systemic claim (see immortal-time-bias-handling, time-zero-index-date-alignment-rwe).\n- **Do not compare arms on a progression proxy without confronting competing risk of death.** In elderly or advanced-cancer\n  claims, death is frequent and *differential by arm*; treating death as independent censoring for a progression endpoint\n  biases the cumulative incidence — model cause-specific or subdistribution hazards (see\n  competing-risks-cause-specific-fine-gray-rwe).\n- **Do not pool community and academic EHR capture without modeling informative presence.** Sicker patients visit more,\n  generating more documented progression and richer covariates, so apparent outcome rates track visit intensity, not\n  biology (informative presence / informative observation bias).\n- **Do not build an external control from a different calendar era.** Standard of care in oncology turns over in 2–3 years;\n  a historical control predating the current backbone regimen confounds the comparison with secular change.\n\n**Data-source operational depth**\n- **Claims (FFS vs MA vs commercial):** Oral agents appear as pharmacy NDC + `fill_date` + `days_supply`; IV/infused agents\n  appear only in *medical* claims as HCPCS J-codes (drug) plus CPT 96413/96365/96367 (administration), so a claims pipeline\n  that looks only at pharmacy will silently drop the entire IV oncology armamentarium. Medicare *Advantage* and capitated\n  arrangements do not adjudicate FFS line-item J-codes/CPTs the way Parts A/B do, so MA-only person-time has incomplete\n  infusion and administration capture — restrict to FFS Parts A/B/D or exclude MA-only spans before constructing IV\n  regimens. For orals, 340B dispensing, free samples, and 90-day mail order distort `days_supply` and corrupt gap-based\n  discontinuation. Claims carry *no* stage, histology, biomarker, ECOG, or response.\n- **EHR (community vs academic):** Adds the oncology-specific data claims lack — pathology reports (histology, grade),\n  radiology (response/progression), molecular reports (EGFR, ALK, PD-L1, MSI), and ECOG — but most of it is *unstructured*\n  and requires NLP or manual abstraction. Capture is visit-driven and fragmented: a patient who progresses and transfers\n  to hospice or another health system is differentially lost, and informative-presence bias means documented-event rates\n  confound severity with engagement. Site-of-care matters: community and academic centers differ systematically in\n  documentation, trial enrollment, and patient mix.\n- **Registry (e.g., SEER, NCDB, disease-specific):** Gold standard for stage, histology, and incidence/survival, but thin\n  on line-by-line treatment and longitudinal utilization. Link to claims (SEER-Medicare) for treatment and cost, accepting\n  that the linkable population is older and FFS-skewed.\n- **Linked claims–EHR–vital records:** The substrate for credible efficacy RWE — EHR/registry severity + claims\n  completeness + a reliable death index for OS and for competing-risk censoring of progression — but linkage selects the\n  linkable subset and introduces order/fill/service/abstraction date discrepancies that must be reconciled before time-zero\n  assignment.\n\n**Worked claims example — rwPFS proxy and line-of-therapy episode for metastatic NSCLC.** Question: real-world\ntime-to-discontinuation on first-line therapy in a commercial + Medicare FFS database. (1) Cohort: ≥2 claims with an ICD-10\nlung cancer code (C34.x) plus ≥1 secondary/metastatic code (C78.x/C79.x), age ≥18, and 365 days of continuous medical +\npharmacy enrollment with FFS Parts A/B (exclude MA-only spans so IV J-code/CPT administration is observable). (2) Index\n(time zero): the *first systemic anticancer claim* after the metastatic diagnosis — a pharmacy NDC for an oral TKI or a\nmedical J-code with a paired CPT 96413/96365 administration on the same date for an IV agent — **not** the diagnosis date,\nwhich would inject immortal time. Assign the first-line regimen from the drug(s) seen in a 28-day window around index.\n(3) Discontinuation = a gap of >90 days with no fill/administration of any first-line agent (90 days, not 60, because\n21-day infusion cycles plus scheduling slack routinely exceed 60). (4) Switch/next line = appearance of a new\nsystemic agent not in the first-line regimen, which ends the first-line episode even without a gap. (5) rwPFS proxy = days\nfrom index to the earliest of discontinuation, next-line start, or death (death from a linked vital-records index, since\nclaims-inferred death is incomplete). (6) Censor at disenrollment and end of data; treat death as a *competing risk* for\nthe discontinuation event, not as independent censoring. (7) Sensitivity: vary the gap (60/90/120 days), add a toxicity-hold\nrule (allow re-initiation within the gap to distinguish a hold from a true stop), and compare the proxy against an\nabstracted-progression subset where available — the divergence quantifies the differential measurement error that makes\nthis proxy unsafe as a comparative efficacy endpoint.",
    "primary_category": "Study_Design",
    "tags": [
      "oncology",
      "real-world-progression",
      "line-of-therapy",
      "external-control",
      "rwpfs-proxy",
      "iv-vs-oral",
      "ehr-fragmentation",
      "competing-risks"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "single_arm_external_control",
      "linked_data"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/jnci/djx187",
        "url": "https://doi.org/10.1093/jnci/djx187",
        "citation_text": "Khozin S, Blumenthal GM, Pazdur R. Real-world Data for Clinical Evidence Generation in Oncology. JNCI: Journal of the National Cancer Institute. 2017;109(11):djx187.",
        "year": 2017,
        "authors_short": "Khozin et al.",
        "notes": "FDA-oncology framing of the opportunities and limits of oncology RWD, including the endpoint and ascertainment gaps that distinguish oncology from other therapeutic areas."
      },
      {
        "role": "explain",
        "doi": "10.1093/jamia/ocac050",
        "url": "https://doi.org/10.1093/jamia/ocac050",
        "citation_text": "Harton J, Mitra N, Hubbard RA. Informative presence bias in analyses of electronic health records-derived data: a cautionary note. Journal of the American Medical Informatics Association. 2022;29(7):1191-1199.",
        "year": 2022,
        "authors_short": "Harton et al.",
        "notes": "Formalizes informative presence/observation bias — why visit-driven EHR capture confounds documented oncology outcomes with severity and engagement."
      },
      {
        "role": "demonstrate",
        "doi": "10.1200/CCI.18.00155",
        "url": "https://doi.org/10.1200/CCI.18.00155",
        "citation_text": "Stewart M, Norden AD, Dreyer N, et al. An Exploratory Analysis of Real-World End Points for Assessing Outcomes Among Immunotherapy-Treated Patients With Advanced Non-Small-Cell Lung Cancer. JCO Clinical Cancer Informatics. 2019;3:1-15.",
        "year": 2019,
        "authors_short": "Stewart et al.",
        "notes": "Empirically derives and compares real-world progression, time-to-discontinuation, and time-to-next-treatment endpoints against overall survival — the canonical demonstration of rwPFS proxy behavior."
      },
      {
        "role": "use",
        "doi": "10.1002/cpt.2453",
        "url": "https://doi.org/10.1002/cpt.2453",
        "citation_text": "Rivera DR, Henk HJ, Garrett-Mayer E, et al. The Friends of Cancer Research Real-World Data Collaboration Pilot 2.0: Methodological Recommendations from Oncology Case Studies. Clinical Pharmacology and Therapeutics. 2022;111(1):283-292.",
        "year": 2022,
        "authors_short": "Rivera et al.",
        "notes": "Multi-database collaboration deriving concrete methodological recommendations for oncology real-world endpoint construction and reproducibility."
      },
      {
        "role": "explain",
        "doi": "10.2147/CLEP.S373291",
        "url": "https://doi.org/10.2147/CLEP.S373291",
        "citation_text": "Merola D, Schneeweiss S, Schrag D, et al. Oncology Drug Effectiveness from Electronic Health Record Data Calibrated Against RCT Evidence: The PARSIFAL Trial Emulation. Clinical Epidemiology. 2022;14:1135-1148.",
        "year": 2022,
        "authors_short": "Merola et al.",
        "notes": "Demonstrates calibration of an oncology EHR-based effectiveness estimate against trial evidence, exposing where real-world endpoint construction does and does not recover the trial result."
      }
    ],
    "plain_language_summary": "In cancer research, the outcomes that matter most — did the tumor shrink, did the disease spread — exist only in radiology reports and pathology slides that routine insurance records never capture. Cancer treatment also runs in ordered rounds called lines of therapy, and injectable drugs are billed through a completely different part of the insurance system than pills, so a pipeline that looks only at pharmacy data silently misses the entire IV armamentarium. Because randomizing advanced-cancer patients to a placebo is often unethical, analysts must build a comparison group from historical records instead of enrolling one alongside treated patients. These three features — outcomes invisible in claims, line-structured multi-route treatment, and external comparison groups — make oncology the hardest therapeutic area in real-world evidence.",
    "key_terms": [
      {
        "term": "rwPFS proxy",
        "definition": "A stand-in measure for cancer progression built from insurance records — typically the day a patient stopped their drug, switched to a new regimen, or died — because actual imaging-based progression is not recorded in claims."
      },
      {
        "term": "line of therapy",
        "definition": "The ordered sequence of cancer treatment regimens a patient receives; first-line is the initial treatment, second-line starts when the first stops working or causes too much harm, and so on."
      },
      {
        "term": "J-code",
        "definition": "A billing code used in medical (not pharmacy) insurance claims to identify an injectable or infused drug administered in a clinic or hospital — the only way an IV cancer drug appears in administrative data."
      },
      {
        "term": "FFS (fee-for-service)",
        "definition": "The part of Medicare (Parts A, B, and D) where each individual service is billed separately, making every drug administration and procedure visible as its own claim line; Medicare Advantage plans often do not produce these detailed line-item records."
      },
      {
        "term": "competing risk",
        "definition": "An event — here, death — that prevents the outcome of interest from ever occurring; in oncology, a patient who dies cannot later progress, so death must be modeled as its own event type rather than treated as a simple dropout."
      },
      {
        "term": "external control arm",
        "definition": "A comparison group assembled from real-world records or prior studies rather than from patients randomized alongside the treated group, used when randomizing to placebo in advanced cancer is considered unethical."
      }
    ],
    "worked_example": {
      "scenario": "A researcher wants to measure how long patients with metastatic lung cancer stay on their first-line treatment using a commercial claims database. The table below shows four patients and the challenges that arise for each — the same five challenges that make oncology RWE uniquely hard. For each challenge, the table describes what the data actually looks like and what the analyst must do about it.",
      "dataset": {
        "caption": "Five oncology-specific RWE challenges, with a concrete data situation and the method used to address each",
        "columns": [
          "challenge",
          "what the data looks like",
          "why it causes a problem",
          "how analysts handle it"
        ],
        "rows": [
          [
            "Measuring progression (rwPFS)",
            "Patient 1001 has pharmacy fills for an oral pill and medical claims for clinic visits, but no imaging or pathology report showing the tumor grew",
            "Progression is a biological event visible only on a scan; claims carry no progression flag, so the analyst cannot directly observe when the cancer worsened",
            "Build a behavioral proxy: progression = the day the patient stopped the drug for more than 90 days, switched to a new drug, or died — whichever came first (Stewart et al. 2019)"
          ],
          [
            "Lines of therapy — what counts as first-line?",
            "Patient 1002 has claims for Drug A starting 2023-01-10, Drug B added 2023-01-24 (14 days later), and Drug C starting 2023-09-15 after a 120-day gap",
            "A cancer regimen is often multiple drugs started within days of each other; a later drug appearing after a long gap signals a new line, not a combination",
            "Define a regimen window (e.g., 28 days after the first drug) to bundle co-started drugs into one line; a new drug appearing after a 90-day gap counts as line 2"
          ],
          [
            "IV drugs invisible in pharmacy data",
            "Patient 1003 receives pembrolizumab by infusion every 3 weeks; the pharmacy table has zero rows for this patient, but the medical table has repeated J9271 claims paired with CPT 96413",
            "IV oncology drugs are billed as medical claims using HCPCS J-codes — a pipeline that searches only pharmacy NDC records will find no treatment for this patient",
            "Combine pharmacy NDC fills (oral drugs) with medical J-code plus CPT 96413/96365 administration pairs (IV drugs); exclude Medicare Advantage spans where these codes are often missing"
          ],
          [
            "Immortal time around diagnosis",
            "Patient 1004 is diagnosed with metastatic disease on 2023-03-01 and starts chemotherapy on 2023-04-15 after 6 weeks of staging scans and biopsy",
            "If follow-up starts at the diagnosis date, those 45 days before treatment look like survival time but the patient could not possibly have experienced the drug's effect yet — this artificially inflates survival estimates",
            "Set time zero at the date of the first systemic treatment claim, not the diagnosis date, so the immortal pre-treatment interval is excluded from follow-up"
          ],
          [
            "Death as a competing risk for progression",
            "Patient 1005 dies on day 120 without any gap or switch in claims; the proxy event never triggers",
            "Death prevents progression from ever being recorded; if death is treated as a simple dropout (censoring), it understates the true rate of the combined endpoint in sicker patients",
            "Classify death as its own event type in the analysis rather than censoring it; use cause-specific or Fine-Gray subdistribution models so death and progression are estimated together"
          ]
        ]
      },
      "steps": [
        "Start by building the cancer cohort: require at least two ICD-10 C34.x (lung) claims plus at least one metastatic or secondary code (C78.x or C79.x), and restrict to patients with fee-for-service Parts A, B, and D enrollment so IV drug administrations are visible in medical claims.",
        "Assign time zero at the first systemic treatment claim on or after the metastatic diagnosis date — either a pharmacy NDC row (oral drug) or a medical J-code paired with CPT 96413/96365 on the same date (IV drug). Never use the diagnosis date as time zero, because the staging and biopsy period before treatment is an interval where the patient is alive by necessity, not because of the drug.",
        "Define the first-line regimen as all drugs seen within 28 days of time zero; this bundles combination partners (e.g., a checkpoint inhibitor added 10 days after a chemotherapy backbone) into a single regimen rather than miscounting them as a new line.",
        "Build the rwPFS proxy endpoint: follow the patient forward and record the earliest of (a) a gap greater than 90 days with no fill or administration of any first-line drug, (b) the first appearance of a new drug not in the first-line regimen, or (c) death from a linked vital-records source.",
        "Treat death as a competing risk, not a dropout: a patient who dies has not stopped treatment due to progression, and censoring them as though they simply left the study would undercount the true endpoint rate in the sicker arm. Apply cause-specific or Fine-Gray methods (see competing-risks-cause-specific-fine-gray-rwe) to separate the two events."
      ],
      "result": "The five challenges are addressed by: (1) using the earliest of gap/switch/death as the rwPFS proxy because imaging data is absent; (2) using a 28-day regimen window to bundle co-started drugs and a 90-day gap threshold to detect line advancement; (3) combining pharmacy NDC rows with medical J-code plus CPT pairs to capture both oral and IV drugs; (4) anchoring time zero at first systemic treatment to exclude the immortal staging interval; and (5) classifying death as a competing event rather than censoring it. Each fix addresses a different reason why oncology claims data alone cannot simply be read at face value."
    },
    "prerequisites": [
      "immortal-time-bias-handling",
      "competing-risks-cause-specific-fine-gray-rwe",
      "treatment-patterns-lines-of-therapy"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "rwPFS proxy via treatment discontinuation or next-line therapy",
        "description": "Time from first systemic claim to the earliest of a >90-day gap in the on-treatment regimen, a switch to a new line, or death. A behavioral surrogate for biological progression; correlates with OS in some tumors but carries differential measurement error by drug tolerability and disease pace.",
        "edge_cases": [
          "Toxicity-driven dose holds mimic discontinuation; add a re-initiation rule (resume within the gap window) to separate a hold from a true stop.",
          "Indolent disease where patients continue a tolerable drug past radiographic progression overestimates true PFS; toxicity holds underestimate it — the bias direction is drug-class-specific and therefore differential between arms."
        ],
        "data_source_notes": "claims: gap-based on NDC fills and J-code/CPT administration, with FFS-only person-time so IV regimens are observable; EHR: prefer abstracted/curated progression; linked: use a vital-records death index and treat death as a competing risk, not independent censoring."
      },
      {
        "name": "Line-of-therapy / regimen episode construction",
        "description": "Stitch multi-drug regimens into ordered lines, advancing the line when a new agent appears or after a defined treatment-free interval, to support treatment-pattern, sequencing, and HCRU questions.",
        "edge_cases": [
          "Neoadjuvant/adjuvant therapy and maintenance phases blur the boundary between lines and can be miscounted as new lines.",
          "Combination backbones (e.g., platinum doublet + immunotherapy) require regimen-level, not drug-level, line definitions."
        ],
        "data_source_notes": "claims: combine pharmacy NDC and medical J-code/CPT within a window to define a regimen; pair with treatment-patterns-lines-of-therapy for the line-advancement logic."
      },
      {
        "name": "External / historical control for single-arm oncology trials",
        "description": "Construct a comparator arm from RWD for a single-arm accelerated-approval trial, matching on stage, biomarker, line, and calendar era, with outcome ascertainment aligned to the trial endpoint.",
        "edge_cases": [
          "Calendar-era mismatch confounds with secular change in standard of care; restrict to overlapping, contemporary time.",
          "Differential outcome ascertainment between trial assessment and routine RWD biases the contrast even after covariate balance."
        ],
        "data_source_notes": "linked/registry: required for stage and biomarker; see rare-disease-external-controls-rwe and single-arm-external-control for design rigor and generalizability-transportability-external-validity-rwe for the transport assumptions."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "active-comparator-new-user",
        "pros_of_this": "Surfaces the oncology-specific obstacles (latent progression endpoint, line-structured exposure, route-driven IV vs oral capture, frequent single-arm trials) that a generic ACNU template does not address.",
        "cons_of_this": "When two real-world regimens are genuinely used for the same indication, a concurrent active-comparator new-user design is more defensible than any external-control workaround and should be preferred.",
        "when_to_prefer": "Oncology efficacy/progression or single-arm-trial settings where the endpoint is imaging/pathology-defined or no concurrent comparator exists."
      },
      {
        "compared_to": "real-world-progression-rwpfs-rwe",
        "pros_of_this": "Places rwPFS in the full therapeutic-area context (route capture, competing risks, external controls, EHR fragmentation) rather than treating the progression endpoint in isolation.",
        "cons_of_this": "For the precise construction and validation of the progression endpoint itself, the dedicated rwPFS entry is deeper.",
        "when_to_prefer": "Scoping or protocol-framing an oncology study before drilling into a specific endpoint definition."
      },
      {
        "compared_to": "hcru-healthcare-resource-utilization",
        "pros_of_this": "Captures oncology-specific utilization (infusion visits, supportive care, imaging surveillance, end-of-life intensity) that generic HCRU metrics miss.",
        "cons_of_this": "A generic HCRU framework is simpler and sufficient when the question is pure cost/utilization without line-structured or endpoint-linked attribution.",
        "when_to_prefer": "Comparative oncology cost/HCRU studies where infusion administration and line of therapy drive spend."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Identify the cancer cohort with ICD-10 C-codes plus a metastatic/secondary code; capture oral agents via pharmacy NDC + days_supply and IV/infused agents via medical HCPCS J-codes paired with CPT 96413/96365/96367 administration codes. Restrict to FFS Parts A/B/D and exclude MA-only person-time so infusion administration is observable. Time zero = first systemic claim (never the diagnosis date). Build rwPFS proxy from gaps/switches and treat death as a competing risk.",
      "ehr": "Source stage, histology, biomarker, ECOG, and response from pathology/radiology/molecular reports via NLP or abstraction; most is unstructured. Capture is visit-driven and fragmented across community/academic sites — model informative presence and treat loss to follow-up (transfer to hospice/other system) as potentially informative.",
      "registry": "SEER/NCDB and disease-specific registries are gold for stage/histology/incidence/survival but thin on line-level treatment; link to claims (e.g., SEER-Medicare) for treatment and cost, accepting an older FFS-skewed linkable population.",
      "linked": "Linked claims-EHR-vital records is the substrate for credible efficacy RWE (severity + completeness + reliable death for OS and competing-risk censoring); reconcile order/fill/service/abstraction date discrepancies before assigning time zero, and account for selection into the linkable subset."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\nimport numpy as np\n\nGAP_DAYS = 90          # > one infusion cycle + scheduling slack; 60d falsely flags routine IV gaps\nREGIMEN_WINDOW = 28    # agents seen within this window of time zero define the first-line regimen\n\ndef build_first_line_episode(dx: pd.DataFrame, sys: pd.DataFrame, death: pd.DataFrame) -> pd.DataFrame:\n    # Metastatic anchor: first secondary/metastatic code (C77-C79) per person.\n    met = (dx[dx[\"icd10\"].str.startswith((\"C77\", \"C78\", \"C79\"))]\n             .groupby(\"person_id\")[\"dx_date\"].min()\n             .reset_index(name=\"met_date\"))\n\n    # Time zero = first systemic claim on/after the metastatic date (NOT the diagnosis date -> avoids immortal time).\n    sys = sys.merge(met, on=\"person_id\", how=\"inner\")\n    post = sys[sys[\"claim_date\"] >= sys[\"met_date\"]].sort_values([\"person_id\", \"claim_date\"])\n    t0 = post.groupby(\"person_id\")[\"claim_date\"].min().reset_index(name=\"index_date\")\n    ep = post.merge(t0, on=\"person_id\")\n\n    # First-line regimen = agents within REGIMEN_WINDOW days of time zero.\n    in_window = ep[ep[\"claim_date\"] <= ep[\"index_date\"] + pd.Timedelta(days=REGIMEN_WINDOW)]\n    regimen = in_window.groupby(\"person_id\")[\"agent\"].apply(lambda s: frozenset(s)).rename(\"fl_regimen\")\n    ep = ep.merge(regimen, on=\"person_id\")\n\n    rows = []\n    for pid, g in ep.groupby(\"person_id\"):\n        g = g.sort_values(\"claim_date\")\n        t0_date = g[\"index_date\"].iloc[0]\n        fl = g[\"fl_regimen\"].iloc[0]\n        fl_claims = g[g[\"agent\"].isin(fl)].copy()\n\n        # Per-claim coverage end: orals use days_supply; IV agents covered through the cycle (use GAP as cycle proxy).\n        fl_claims[\"cov_end\"] = np.where(\n            fl_claims[\"route\"].eq(\"oral\"),\n            fl_claims[\"claim_date\"] + pd.to_timedelta(fl_claims[\"days_supply\"].fillna(0), unit=\"D\"),\n            fl_claims[\"claim_date\"] + pd.Timedelta(days=GAP_DAYS),\n        )\n        last_cov = fl_claims[\"cov_end\"].max()\n        discontinuation = last_cov  # end of observed first-line coverage\n\n        # Switch / next line = first systemic agent NOT in the first-line regimen after time zero.\n        nextline = g[~g[\"agent\"].isin(fl)]\n        switch_date = nextline[\"claim_date\"].min() if not nextline.empty else pd.NaT\n\n        d_row = death.loc[death[\"person_id\"] == pid, \"death_date\"]\n        death_date = d_row.iloc[0] if (len(d_row) and pd.notna(d_row.iloc[0])) else pd.NaT\n\n        # rwPFS proxy = earliest of discontinuation, switch, or death. Death is a COMPETING event, not censoring.\n        candidates = {\"discontinuation\": discontinuation, \"switch\": switch_date, \"death\": death_date}\n        candidates = {k: v for k, v in candidates.items() if pd.notna(v)}\n        event_type = min(candidates, key=candidates.get)\n        event_date = candidates[event_type]\n\n        rows.append({\n            \"person_id\": pid,\n            \"index_date\": t0_date,\n            \"fl_regimen\": tuple(sorted(fl)),\n            \"event_type\": event_type,                 # 'discontinuation' | 'switch' | 'death'\n            \"event_date\": event_date,\n            \"rwpfs_proxy_days\": (event_date - t0_date).days,\n        })\n    return pd.DataFrame(rows)",
        "description": "Oncology line-of-therapy and rwPFS-proxy episode construction from claims-style inputs. Required inputs (already cleaned,\nde-duplicated, and restricted to FFS-observable person-time so IV administration is captured):\n  dx     : diagnosis claims  -> person_id, dx_date (datetime), icd10 (e.g. 'C34.1', 'C78.00', 'C79.51')\n  sys    : systemic therapy  -> person_id, claim_date (datetime), agent (drug name), route in {'oral','iv'},\n                                days_supply (orals; NaN for IV given per administration)\n  death  : vital records     -> person_id, death_date (datetime or NaT)   # linked index; claims death is incomplete\nReturns one row per first-line episode: time zero (first systemic after metastatic dx), the discontinuation/switch/death\ndate, the rwPFS-proxy days, and an event-type flag for competing-risk modeling downstream.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [
          "stewart-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nGAP_DAYS       <- 90L   # > infusion cycle + slack; 60d falsely flags routine IV gaps\nREGIMEN_WINDOW <- 28L   # agents within this window of time zero define the first-line regimen\n\nbuild_first_line_episode <- function(dx, sys, death) {\n  setDT(dx); setDT(sys); setDT(death)\n\n  met <- dx[grepl(\"^C7[789]\", icd10), .(met_date = min(dx_date)), by = person_id]\n\n  sys <- merge(sys, met, by = \"person_id\")\n  post <- sys[claim_date >= met_date][order(person_id, claim_date)]\n  t0 <- post[, .(index_date = min(claim_date)), by = person_id]\n  ep <- merge(post, t0, by = \"person_id\")\n\n  regimen <- ep[claim_date <= index_date + REGIMEN_WINDOW,\n                .(fl_regimen = list(unique(agent))), by = person_id]\n  ep <- merge(ep, regimen, by = \"person_id\")\n\n  ep[, .build := {\n    fl <- fl_regimen[[1L]]\n    flc <- .SD[agent %chin% fl]\n    cov_end <- fifelse(flc$route == \"oral\",\n                       flc$claim_date + fifelse(is.na(flc$days_supply), 0L, flc$days_supply),\n                       flc$claim_date + GAP_DAYS)\n    discontinuation <- max(cov_end)\n    nl <- .SD[!agent %chin% fl, claim_date]\n    switch_date <- if (length(nl)) min(nl) else as.Date(NA)\n    dd <- death[person_id == .BY$person_id, death_date]\n    death_date <- if (length(dd) && !is.na(dd[1L])) dd[1L] else as.Date(NA)\n\n    cand <- c(discontinuation = discontinuation, switch = switch_date, death = death_date)\n    cand <- cand[!is.na(cand)]\n    et <- names(cand)[which.min(cand)]\n    ed <- as.Date(min(cand), origin = \"1970-01-01\")\n    .(index_date = index_date[1L], event_type = et, event_date = ed,\n      rwpfs_proxy_days = as.integer(ed - index_date[1L]))\n  }, by = person_id]\n\n  unique(ep[, .(person_id, index_date, event_type = .build.event_type,\n                event_date = .build.event_date, rwpfs_proxy_days = .build.rwpfs_proxy_days)])\n}",
        "description": "Oncology first-line / rwPFS-proxy episode construction with data.table. Inputs mirror the Python version:\n  dx    : person_id, dx_date (Date), icd10\n  sys   : person_id, claim_date (Date), agent, route in {'oral','iv'}, days_supply (NA for IV)\n  death : person_id, death_date (Date or NA)   # linked vital-records index\nTime zero is the first systemic claim on/after the metastatic anchor (not the diagnosis date), discontinuation uses a\n90-day gap, and death is returned as a competing event type for downstream Fine-Gray / cause-specific modeling.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "stewart-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Oncology research question] --> EP{Endpoint type?}\n  EP -->|Efficacy: response / progression / OS| LAT[Latent in routine data]\n  EP -->|Utilization / cost / adherence| UTIL[Claims-codable]\n  LAT --> SRC{Data source has stage / response?}\n  SRC -->|Claims only| PROXY[Only a behavioral rwPFS proxy<br/>discontinuation / next-line / death]\n  SRC -->|EHR / registry / linked| ABST[Abstracted or curated progression<br/>stage, biomarker, RECIST]\n  UTIL --> CL[Claims: NDC for orals,<br/>J-code + CPT 96413 for IV]\n  CL --> FFS[Restrict to FFS A/B/D<br/>exclude MA-only person-time]\n  PROXY --> CR[Model death as competing risk<br/>not independent censoring]\n  ABST --> CR\n  CR --> OUT[Endpoint fit for the estimand]\n  FFS --> OUT",
        "caption": "Oncology RWE data-fitness logic — efficacy endpoints are latent and demand abstracted/curated or linked data, while utilization endpoints are claims-codable but require route-aware capture and FFS restriction so IV administration is observable; progression endpoints must confront the competing risk of death.",
        "alt_text": "Decision flowchart branching on endpoint type (efficacy vs utilization), data source capability for stage and response, route-aware claims capture, FFS restriction, and competing-risk handling of death.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title Time-zero placement and the immortal-time trap in oncology\n  Metastatic diagnosis (ICD-10 C78/C79) : Day 0 anchor only\n  Work-up, staging, biopsy, molecular testing : Immortal interval if used as follow-up start\n  First systemic therapy (NDC or J-code+CPT) : TIME ZERO -- follow-up starts here\n  On first-line regimen : days_supply / infusion cycles\n  Discontinuation (>90d gap) OR switch OR death : rwPFS-proxy event (death = competing risk)",
        "caption": "Setting time zero at first systemic therapy, not at diagnosis, removes the immortal time between diagnosis and treatment initiation; the rwPFS-proxy event is the earliest of a 90-day discontinuation gap, a switch to a new line, or death, with death handled as a competing risk.",
        "alt_text": "Timeline from metastatic diagnosis through work-up to first systemic therapy as time zero, the on-treatment period, and the rwPFS-proxy event, highlighting the immortal interval between diagnosis and treatment.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "real-world-progression-rwpfs-rwe",
        "notes": "The dedicated entry on constructing and validating the real-world progression endpoint that this concept treats as a proxy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "single-arm-external-control",
        "notes": "Single-arm oncology trials with external controls are the design this therapeutic area produces most distinctively."
      },
      {
        "relation_type": "see_also",
        "target_slug": "rare-disease-external-controls-rwe",
        "notes": "Shares the external/historical-control machinery and small-N problems common to rare cancers."
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Line-of-therapy logic is required to define oncology exposure episodes and line advancement."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Death is a frequent, differential competing risk for any oncology progression or discontinuation endpoint."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring time zero at first systemic therapy rather than at diagnosis is the central immortal-time correction in oncology cohorts."
      },
      {
        "relation_type": "see_also",
        "target_slug": "route-of-administration-differences-in-rwe",
        "notes": "IV/infused vs oral capture (medical J-code+CPT vs pharmacy NDC) drives oncology exposure ascertainment and HCRU."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "MA-only person-time lacks the FFS J-code/CPT administration detail needed to observe IV oncology regimens."
      },
      {
        "relation_type": "see_also",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Oncology HCRU is dominated by infusion administration, imaging surveillance, supportive care, and end-of-life intensity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "biomarker-defined-cohort-rwe",
        "notes": "Biomarker-defined cohorts (EGFR, ALK, PD-L1, MSI) are central to modern oncology RWE and live in EHR/registry data."
      }
    ],
    "aliases": [
      "Oncology RWE challenges",
      "Therapeutic area-specific RWE — oncology",
      "Oncology-specific real-world data considerations"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "time-updated-exposures-cumulative-dose-rwe",
    "name": "Time-Updated Exposures and Cumulative Dose",
    "short_definition": "Operational construction of exposure variables that change over follow-up — current use, recent use, cumulative duration, cumulative dose, dose intensity, and weighted cumulative exposure — built as long-format person-time so that time-dependent models estimate effects without the immortal time and exposure misclassification that static ever/never definitions create.",
    "long_description": "Real treatments are rarely a single fixed dose held over follow-up. Patients fill late, stockpile, titrate, switch route,\npause for surgery, combine therapies, and discontinue for toxicity or cost. **Time-updated exposure construction** turns a\nstream of pharmacy fills, infusion administrations, or procedure claims into a person-time representation in which the\nexposure value assigned to each interval reflects what was actually being taken *at that point in follow-up*. This data\nstep is not preprocessing to be rushed — it is the analysis. A correct estimand and a sophisticated model cannot rescue an\nexposure variable that mislabels person-time.\n\nThe common exposure scales, all defined as of a time `t` strictly before any outcome it could plausibly cause:\n*current use* (an active supply/administration episode covers `t`); *recent use* (exposure within a risk window, e.g. the\nprior 30 days, to capture carry-over pharmacology); *cumulative duration* (total exposed days before `t`); *cumulative dose*\n(sum of daily-dose × exposed days, or dispensed quantity, before `t`); *dose intensity* (cumulative dose per unit of exposed\nor protocol time); and *weighted cumulative exposure (WCE)*, in which past doses contribute to current hazard through an\nestimated recency/latency weight function rather than a flat sum.\n\n**Core conceptual distinction.** The decisive choice is *what biologic quantity drives risk*, and it dictates both the\nexposure variable and the estimand. A static `ever_exposed` flag answers \"did this person ever take the drug\" and, when\nused as a baseline covariate in a survival model, mislabels the unexposed lead-in as exposed (or forces follow-up to begin\nbefore the exposure decision), manufacturing **immortal time**. A time-varying *current-use* indicator answers \"is risk\nelevated while on drug\" (acute pharmacology); *cumulative dose/duration* answers \"does risk accrue with total burden\"\n(cumulative toxicity, e.g. anthracycline cardiomyopathy); *WCE* answers \"how does the timing of past doses shape current\nrisk\" when a flat cumulative sum is biologically wrong (recent doses matter more, or a latency lag applies). These are not\ninterchangeable: the estimand — a hazard ratio per active-use interval, per 1000 cumulative mg, or a WCE weight curve —\nmust be pre-specified, and each maps to a different model (time-dependent Cox / pooled logistic on the long format, or, when\nthe time-varying exposure is affected by prior outcomes/confounders, a marginal structural model fit by IPTW).\n\n**Pros, cons, and trade-offs.**\n- **vs static ever/never or baseline-only exposure:** Time-updated exposure removes immortal time and the gross\n  misclassification of pre-initiation and post-discontinuation person-time, and it lets the contrast match the pharmacology.\n  Cost: it demands clean episode construction, lagging decisions, and far more programming; it is also vulnerable to\n  *treatment-by-indication feedback* (dose is changed in response to evolving disease), which a naive time-dependent Cox\n  cannot handle. **Prefer** for any safety/effectiveness question where exposure genuinely varies over follow-up.\n- **vs PDC / MPR adherence summaries (pdc-proportion-of-days-covered):** PDC collapses a fill history into one scalar over a\n  fixed denominator — useful for an adherence *predictor or descriptor*, but it discards *when* exposure occurred and cannot\n  represent time-varying risk. Time-updated exposure preserves the timeline. Cost: more complex, less standardized.\n  **Prefer PDC** when adherence itself is the quantity of interest and timing is not; **prefer time-updated** when the\n  hazard model needs interval-level exposure.\n- **vs WCE specifically (within this family):** A flat cumulative sum assumes every past mg counts equally forever; WCE\n  relaxes that with a spline-weighted history and often fits short-latency or wash-in pharmacology far better. Cost: WCE\n  needs dense exposure histories, is prone to overfitting sparse high-dose tails, and is harder to communicate to\n  regulators than a pre-specified current-use or cumulative-dose contrast. **Prefer WCE** only when the flat-sum assumption\n  is biologically indefensible and the data support estimating a weight curve.\n- **vs marginal structural models (marginal-structural-models-g-methods):** A standard time-dependent Cox on time-updated\n  exposure is valid only when time-varying confounders are *not* themselves affected by prior exposure. When they are\n  (e.g. labs that respond to the drug and also drive subsequent dosing), conditioning on them biases the effect and\n  omitting them confounds it — the classic g-method trap. The exposure-history long format built here is the *input* to an\n  MSM; the difference is the weighting/estimation layer, not the exposure construction. **Escalate to an MSM/IPTW** when\n  that feedback exists.\n\n**When to use.** Cumulative-toxicity questions (cumulative dose/duration: nephrotoxins, anthracyclines, opioids); acute\non-treatment hazards (current/recent use: bleeding on anticoagulation, hypoglycemia on insulin); titrated or interrupted\nregimens where a baseline dose is meaningless by month three; as-treated arms of an active-comparator new-user or\ntarget-trial-emulation analysis that censor at switch/discontinuation; and any setting where setting time zero correctly\nstill leaves exposure changing afterward.\n\n**When NOT to use — and when it is actively misleading or dangerous.** Do **not** reach for time-varying exposure when the\nestimand is an initiation (intention-to-treat-like) contrast — there, exposure is fixed at time zero by design and adding\npost-baseline time-varying status re-introduces the very informative-censoring and mediator-adjustment problems the\nnew-user design was built to avoid. It is **actively dangerous** to (1) feed cumulative dose into a model *without lagging*:\nusing exposure measured up to and including the outcome day lets reverse causation and protopathic effects (the prodrome\ndrives the prescription) masquerade as a dose-response, fabricating a spurious gradient; (2) update exposure using\ninformation that is a consequence of incipient disease, which conditions on a collider; (3) apply a standard time-dependent\nCox when time-varying confounders are affected by prior treatment — this silently produces a biased estimate that *looks*\nrigorous. When dose is changed in response to the outcome process, only g-methods recover the causal effect.\n\n**Data-source operational depth.**\n- **Claims (FFS):** Exposure episodes are built from `fill_date` + `days_supply` + daily dose derived from NDC strength and\n  quantity. Real failure modes: (a) **Medicare Advantage / capitated person-time lacks FFS fill claims**, so any interval\n  drawn from an MA-only member shows phantom \"no current use\" and a frozen cumulative dose — exclude MA-only person-time or\n  restrict to Parts A/B/D (commercial: require an active pharmacy benefit), and treat the boundary as administrative\n  censoring, not discontinuation. (b) **90-day mail-order and stockpiling** inflate `days_supply` and apparent current use;\n  decide a stockpiling/carry-over rule (cap accumulated supply) and a grace period explicitly. (c) **Inpatient stays**: most\n  inpatient drugs are bundled and invisible to outpatient pharmacy, so an outpatient supply spanning a hospitalization may\n  not reflect what was actually given — choose by design to *bridge* (assume continuation) or *censor* the stay, and report\n  the choice.\n- **EHR:** \"Current use\" can come from the medication *list* (often stale, carried forward indefinitely), the *order*\n  (intent, not receipt), the *administration* (MAR — true for inpatient infusions), or the *e-prescribe* feed (closest to a\n  fill but missing whether it was picked up). These answer different questions; pick the source that matches the exposure\n  definition (administrations for infusional oncology; linked fills for oral adherence) and never silently mix them. Visit-\n  driven capture means a patient who leaves the system shows a spurious exposure gap — treat loss to follow-up as\n  potentially informative.\n- **Registry:** Cycle-level dose and dose intensity are often captured more cleanly than in claims, but **dose holds and\n  reductions are frequently documented only in unstructured notes**, so structured cycle data can overstate received dose;\n  link to claims for complete outpatient fills/administrations and to a death index for censoring.\n- **Linked claims–EHR–registry:** The ideal substrate (registry dose + claims completeness + EHR labs to model\n  feedback), but order/fill/administration date discrepancies must be reconciled before any interval boundary is set, and\n  linkage selects the linkable subset.\n\n**Worked claims example.** Question: dose-dependent risk of acute kidney injury (AKI) with an oral nephrotoxin, FFS claims.\n(1) **Eligibility/time zero:** first qualifying fill (`index_date`) after 365 days of continuous A/B/D enrollment and a\ndrug-free washout; exclude members with any MA-only span in the lookback. (2) **Build episodes:** sort fills by `person_id`,\n`fill_date`; stitch with carry-over (start of fill `i` = `max(fill_date_i, covered_until_{i-1})`, `stop = start + days_supply`,\ndaily dose from NDC strength × quantity / days_supply); cap stockpiled supply at 30 days; allow a 14-day grace before an\nepisode is closed as discontinued. (3) **Long format:** split each person's follow-up at every episode boundary, every\noutcome/censoring date, and at the first day of each calendar month, emitting `(person_id, tstart, tstop, current_use,\ncum_days, cum_dose_mg)`. (4) **Lag for latency/protopathic protection:** compute `cum_dose_mg` and `current_use` **as of\n`tstart` and lagged 30 days** — exposure on the AKI day itself is excluded so a prodromal creatinine bump that triggers a\nrefill cannot create false dose-response. (5) **Censor:** at disenrollment, death, end of data, and — for an as-treated\ncontrast — at the end of the last episode + grace; bridge any inpatient stay by design and flag it. (6) **Model:**\n`coxph(Surv(tstart, tstop, aki) ~ cum_dose_100mg + current_use + baseline_covariates)` (or weighted pooled logistic), with\nthe cumulative-dose coefficient reported per 100 mg; if creatinine both responds to the drug and drives subsequent dosing,\nescalate to an IPTW marginal structural model rather than adjusting for time-varying creatinine in the Cox model.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "time-varying-exposure",
      "cumulative-dose",
      "dose-intensity",
      "weighted-cumulative-exposure",
      "current-use",
      "risk-window",
      "latency",
      "immortal-time",
      "claims",
      "ehr"
    ],
    "applies_to_study_types": [
      "new_user",
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.3701",
        "url": "https://doi.org/10.1002/sim.3701",
        "citation_text": "Sylvestre MP, Abrahamowicz M. Flexible modeling of the cumulative effects of time-dependent exposures on the hazard. Statistics in Medicine. 2009;28(27):3437-3453.",
        "year": 2009,
        "authors_short": "Sylvestre & Abrahamowicz",
        "notes": "Introduces the weighted cumulative exposure (WCE) framework, formalizing how the timing and intensity of past doses jointly shape current hazard via an estimated weight function."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.1357",
        "url": "https://doi.org/10.1002/pds.1357",
        "citation_text": "Suissa S. Immortal time bias in observational studies of drug effects. Pharmacoepidemiology and Drug Safety. 2007;16(3):241-249.",
        "year": 2007,
        "authors_short": "Suissa",
        "notes": "The canonical explanation of why static / mis-timed exposure classification fabricates immortal person-time, motivating time-updated exposure construction."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.4343",
        "url": "https://doi.org/10.1002/sim.4343",
        "citation_text": "Abrahamowicz M, Beauchamp ME, Sylvestre MP. Comparison of alternative models for linking drug exposure with adverse effects. Statistics in Medicine. 2012;31(11-12):1014-1030.",
        "year": 2012,
        "authors_short": "Abrahamowicz et al.",
        "notes": "Compares current-use, recent-use, cumulative-dose, and flexible WCE specifications, showing how the choice of exposure metric changes the estimated drug-adverse-effect relationship."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwn164",
        "url": "https://doi.org/10.1093/aje/kwn164",
        "citation_text": "Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology. 2008;168(6):656-664.",
        "year": 2008,
        "authors_short": "Cole & Hernán",
        "notes": "Demonstrates the estimation layer required when time-varying exposure is affected by time-varying confounders that prior exposure influences — the long-format exposure history built here is the input to IPTW."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.5701",
        "url": "https://doi.org/10.1002/pds.5701",
        "citation_text": "Kelly TL, Salter A, Pratt NL. The weighted cumulative exposure method and its application to pharmacoepidemiology: a narrative review. Pharmacoepidemiology and Drug Safety. 2024;33(1):e5701.",
        "year": 2024,
        "authors_short": "Kelly et al.",
        "notes": "Applied review of WCE in pharmacoepidemiology, with practical guidance on window selection, knot placement, and the data density WCE requires."
      }
    ],
    "plain_language_summary": "When researchers study how a drug affects the body over time, they need to track not just whether a patient ever took the drug, but exactly when the drug was active and how much had been taken up to each point. Time-updated exposure construction does this by dividing each patient's follow-up into small intervals and labeling each one with the exposure status at that moment — for example, whether a pill supply was currently on hand (current use) or how many total milligrams had accumulated up to that day (cumulative dose). This approach prevents a well-known counting error called immortal time bias, which occurs when a study incorrectly marks days before a patient even started the drug as if they were already exposed. Without it, a static ever/never label collapses the entire timeline into one flag and hides the most important information: the timing and amount of what was actually taken.",
    "key_terms": [
      {
        "term": "time-updated exposure",
        "definition": "An exposure variable that is re-measured at each point in a patient's follow-up rather than set once at the start, so the analysis reflects what the patient was actually doing at each moment."
      },
      {
        "term": "cumulative dose",
        "definition": "The running total of drug received from the first fill up to (but not including) a given day, typically expressed in milligrams; it grows each time a new fill is taken."
      },
      {
        "term": "dose intensity",
        "definition": "Cumulative dose divided by the length of time the patient was on the drug, giving the average daily dose actually received over the treatment period."
      },
      {
        "term": "current use",
        "definition": "A yes/no flag for whether an active prescription supply covers the current day, based on fill date plus days of supply from the pharmacy record."
      },
      {
        "term": "person-time intervals",
        "definition": "Short time segments that together cover a patient's full follow-up period; each segment carries its own exposure label so the model can see how the exposure changed over time."
      },
      {
        "term": "immortal time bias",
        "definition": "A counting error that occurs when days a patient could not possibly have had the outcome (because they had not yet started the drug) are labeled as exposed days, artificially making the drug look safer."
      }
    ],
    "worked_example": {
      "scenario": "A researcher is studying whether long-term methotrexate use raises the risk of liver injury. She is following one patient, Maria (ID 2001), who started methotrexate on January 10, 2024. She has three pharmacy fills over about 100 days. Rather than label Maria as simply ever-exposed, the researcher builds a person-time table that records, for each interval of follow-up, whether Maria had an active supply on hand (current use = 1 or 0) and exactly how many total milligrams she had accumulated up to the start of that interval. This lets the survival model ask: does the liver-injury rate go up as cumulative dose grows?",
      "dataset": {
        "caption": "Maria's three pharmacy fills as they appear in a claims pharmacy table. Each row is one dispensed prescription.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply",
          "daily_dose_mg"
        ],
        "rows": [
          [
            2001,
            "2024-01-10",
            "methotrexate",
            30,
            10
          ],
          [
            2001,
            "2024-02-15",
            "methotrexate",
            30,
            10
          ],
          [
            2001,
            "2024-03-20",
            "methotrexate",
            30,
            10
          ]
        ]
      },
      "steps": [
        "Fill 1 starts January 10 and covers 30 days, so it runs through February 8 (Jan 10 + 30 days). Maria has an active supply during this whole stretch. Episode dose = 30 days x 10 mg/day = 300 mg.",
        "Fill 2 arrives February 15. Her previous supply ended February 8, so there is a 6-day gap (Feb 9 through Feb 14). The 14-day grace period rule keeps this gap inside the same treatment episode rather than counting it as a restart, but during those 6 days current_use = 0 because no supply was on hand.",
        "Fill 2 restarts coverage on February 15 and runs through March 15 (30 days). Cumulative dose at the start of this interval is 300 mg (what was taken in Fill 1).",
        "Fill 3 arrives March 20. Supply from Fill 2 ended March 15, creating a 4-day gap (Mar 16 through Mar 19) where current_use = 0 again.",
        "Fill 3 restarts coverage March 20 through April 18. Cumulative dose at the start of this interval is 300 mg + 300 mg = 600 mg.",
        "By the end of Follow-up (April 18), Maria has completed all three fills. Total dose dispensed = 3 fills x 300 mg = 900 mg. A survival model can now ask whether the hazard of liver injury at any interval is higher when cumulative dose is 300 mg vs 600 mg vs 900 mg."
      ],
      "result": "At the start of the third fill interval (March 20), Maria's cumulative dose is 600 mg (Fill 1: 300 mg + Fill 2: 300 mg). Her current_use flag on March 20 is 1 (active supply). During the two gap intervals (Feb 9-14, Mar 16-19) current_use = 0, but cumulative dose continues to hold its running value of 300 mg and 600 mg respectively because dose already taken does not disappear.",
      "timeline_spec": {
        "title": "Maria's methotrexate person-time: current use and cumulative dose across three fills",
        "window": {
          "start": "2024-01-10",
          "end": "2024-04-18",
          "label": "99-day observation window (all three fill episodes)"
        },
        "events": [
          {
            "label": "Fill 1",
            "start": "2024-01-10",
            "length_days": 30,
            "quantity": "30 days_supply, 10 mg/day, episode dose 300 mg"
          },
          {
            "label": "Fill 2 (6-day gap, within 14-day grace)",
            "start": "2024-02-15",
            "length_days": 30,
            "quantity": "30 days_supply, 10 mg/day, episode dose 300 mg"
          },
          {
            "label": "Fill 3 (4-day gap, within 14-day grace)",
            "start": "2024-03-20",
            "length_days": 30,
            "quantity": "30 days_supply, 10 mg/day, episode dose 300 mg"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2024-01-10",
            "end": "2024-02-08",
            "label": "Interval 1: current_use=1, cum_dose_start=0 mg"
          },
          {
            "kind": "gap",
            "start": "2024-02-09",
            "end": "2024-02-14",
            "label": "Interval 2: current_use=0, cum_dose=300 mg (6-day gap, grace)"
          },
          {
            "kind": "exposed",
            "start": "2024-02-15",
            "end": "2024-03-15",
            "label": "Interval 3: current_use=1, cum_dose_start=300 mg"
          },
          {
            "kind": "gap",
            "start": "2024-03-16",
            "end": "2024-03-19",
            "label": "Interval 4: current_use=0, cum_dose=600 mg (4-day gap, grace)"
          },
          {
            "kind": "exposed",
            "start": "2024-03-20",
            "end": "2024-04-18",
            "label": "Interval 5: current_use=1, cum_dose_start=600 mg"
          }
        ],
        "result": {
          "label": "Cumulative dose at start of Fill 3 interval = 600 mg; total dispensed across all fills = 900 mg",
          "value": 600
        },
        "caption": "Each colored bar shows one fill; grey gaps show days without active supply (current_use=0). The cumulative-dose label on each interval shows how much Maria had taken before that interval began — the value a survival model would use as the exposure at that point in time.",
        "alt_text": "Horizontal timeline for patient 2001 across 99 days showing three methotrexate fills as green bars, two grey gaps between fills, and a cumulative dose annotation on each interval (0 mg, 300 mg, 300 mg, 600 mg, 600 mg) demonstrating how the running total builds up across fills while current-use toggles on and off."
      }
    },
    "prerequisites": [
      "exposure-episode-construction-rwe",
      "pdc-proportion-of-days-covered",
      "immortal-time-bias-handling"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Current / recent / past use windows",
        "description": "Classify each person-interval by whether an exposure episode is active now (current use), ended within a recency window (recent use, e.g. 1-30 days for carry-over pharmacology), ended remotely (past use), or never occurred.",
        "edge_cases": [
          "Grace periods convert short nonadherence gaps into apparent continuous exposure; set the grace from pharmacology, not convenience.",
          "Recency windows must reflect biologic carry-over (half-life), not the analyst's default 30 days.",
          "The reference category (never vs past use) materially changes the contrast and must be pre-specified."
        ],
        "data_source_notes": "claims: fill_date + days_supply + carry-over rule; EHR: distinguish order/medication-list/administration as the source of \"active\"."
      },
      {
        "name": "Cumulative duration and cumulative dose",
        "description": "Running sum of exposed days, or of daily-dose × exposed days (or dispensed quantity), accrued strictly before time t and typically lagged to block protopathic / latency artifacts.",
        "edge_cases": [
          "Using dose up to and including the outcome day creates spurious dose-response via reverse causation; lag the cumulative metric.",
          "NDC strengths, vial sizes, biosimilars, and route changes must be normalized to a common dose unit before summing."
        ],
        "data_source_notes": "claims: derive daily dose from strength × quantity / days_supply; registry: cycle-level dose may omit holds/reductions documented only in notes."
      },
      {
        "name": "Weighted cumulative exposure (WCE)",
        "description": "Estimate a recency/latency weight function (splines or pre-specified weights) so past doses contribute to current hazard according to when they occurred, relaxing the flat-sum assumption of cumulative dose.",
        "edge_cases": [
          "Overfits sparse high-dose tails; requires dense exposure histories and careful time-window and knot selection.",
          "Harder to communicate and defend to regulators than a pre-specified current-use or cumulative-dose contrast."
        ],
        "data_source_notes": "claims/EHR: needs near-complete, gap-resolved daily exposure; missing person-time (MA-only spans, system departures) silently distorts the weight curve."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "pdc-proportion-of-days-covered",
        "pros_of_this": "Preserves the timeline so the hazard model can use interval-level exposure; represents time-varying risk rather than a single adherence scalar over a fixed denominator.",
        "cons_of_this": "More complex and less standardized; PDC is the better instrument when adherence itself is the quantity of interest.",
        "when_to_prefer": "When a time-dependent survival/pooled-logistic model needs exposure that changes over follow-up, not a baseline adherence summary."
      },
      {
        "compared_to": "standard-cox-time-dependent",
        "pros_of_this": "Supplies the gap-handled, lagged, long-format exposure intervals that a valid time-dependent Cox model requires; bad exposure construction cannot be repaired downstream.",
        "cons_of_this": "This concept is the data-construction layer only — it does not address confounding or estimation.",
        "when_to_prefer": "Always, as the upstream input; the Cox/pooled-logistic fit is the separate modeling step."
      },
      {
        "compared_to": "marginal-structural-models-g-methods",
        "pros_of_this": "The same long-format exposure history is the required input to an MSM; simpler to fit a time-dependent Cox when no exposure-affected time-varying confounding exists.",
        "cons_of_this": "A standard time-dependent Cox is biased when time-varying confounders are affected by prior exposure; only g-methods (IPTW/MSM) recover the causal effect there.",
        "when_to_prefer": "Use a plain time-dependent model when there is no treatment-confounder feedback; escalate to an MSM when there is."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Derive daily dose from NDC strength × dispensed quantity / days_supply. Stitch fills into episodes with an explicit carry-over (stockpiling cap) and grace-period rule, then split follow-up into a long format and compute current_use, cum_days, and cum_dose as of each interval start, lagged for latency/protopathic protection. Exclude Medicare Advantage-only person-time (no FFS fills) and decide by design whether to bridge or censor inpatient stays.",
      "ehr": "Current use can derive from the medication list (stale), order (intent), administration (MAR — best for infusions), or e-prescribe (closest to a fill). Pick one source to match the exposure definition and do not mix them; treat visit-driven gaps and loss to follow-up as potentially informative.",
      "registry": "Cycle-level dose/intensity is often cleaner than claims but holds and reductions may live only in notes, overstating received dose; link to claims for complete fills/administrations and to a death index for censoring.",
      "linked": "Reconcile order/fill/administration date discrepancies before setting any interval boundary; linkage selects the linkable subset. Best substrate for modeling treatment-confounder feedback because EHR labs are available."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nSTOCKPILE_CAP_DAYS = 30   # max carry-over of unused supply\nGRACE_DAYS = 14           # gap tolerated before an episode is closed as discontinued\n\ndef build_exposure_episodes(fills: pd.DataFrame) -> pd.DataFrame:\n    fills = fills.sort_values([\"person_id\", \"fill_date\"])\n    rows = []\n    for pid, g in fills.groupby(\"person_id\", sort=False):\n        covered_until = None     # running end of stockpiled supply\n        cum_dose = 0.0           # mg accrued so far for this person\n        for r in g.itertuples():\n            if covered_until is None or r.fill_date > covered_until + pd.Timedelta(days=GRACE_DAYS):\n                start = r.fill_date                       # new episode\n            else:\n                carry = min((covered_until - r.fill_date).days, STOCKPILE_CAP_DAYS)\n                start = r.fill_date + pd.Timedelta(days=max(carry, 0))   # cap stockpiling\n            stop = start + pd.Timedelta(days=int(r.days_supply))\n            episode_dose = int(r.days_supply) * float(r.daily_dose)\n            rows.append({\n                \"person_id\": pid, \"tstart\": start, \"tstop\": stop,\n                \"daily_dose\": r.daily_dose, \"current_use\": 1,\n                \"cum_dose_start\": cum_dose,                # cum dose BEFORE this episode (lag-ready)\n                \"cum_dose_end\": cum_dose + episode_dose,\n            })\n            cum_dose += episode_dose\n            covered_until = stop\n    return pd.DataFrame(rows)",
        "description": "Build time-updated exposure intervals from pharmacy claims. Required input (cleaned, de-duplicated):\n  fills : person_id, fill_date (datetime64), days_supply (int), daily_dose (mg/day, from NDC strength)\nReturns one row per gap-handled exposure episode with carry-over (stockpiling) and per-episode cumulative dose. Downstream,\nsplit follow-up at every episode boundary + outcome/censoring date, then carry cum_dose/current_use LAGGED so exposure on\nthe outcome day is excluded (protopathic protection). Exclude MA-only person-time and bridge/censor inpatient stays before\nthis step.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nlibrary(survival)\nSTOCKPILE_CAP <- 30L; GRACE <- 14L\n\nbuild_episodes <- function(fills) {\n  setDT(fills); setorder(fills, person_id, fill_date)\n  fills[, {\n    covered_until <- as.Date(NA); cum <- 0; out <- list()\n    for (i in seq_len(.N)) {\n      fd <- fill_date[i]; ds <- days_supply[i]; dd <- daily_dose[i]\n      if (is.na(covered_until) || fd > covered_until + GRACE) {\n        start <- fd\n      } else {\n        carry <- min(as.integer(covered_until - fd), STOCKPILE_CAP)\n        start <- fd + max(carry, 0L)\n      }\n      stop <- start + ds\n      out[[i]] <- data.table(tstart = start, tstop = stop, daily_dose = dd,\n                              current_use = 1L, cum_dose_start = cum)\n      cum <- cum + ds * dd; covered_until <- stop\n    }\n    rbindlist(out)\n  }, by = person_id]\n}\n\nepi <- build_episodes(fills)\n# Restrict episodes to each person's follow-up and attach the event flag at the closing interval.\nlong <- epi[followup, on = .(person_id), nomatch = 0L][\n  tstart < fu_end & tstop > fu_start]\nlong[, `:=`(tstart = pmax(tstart, fu_start), tstop = pmin(tstop, fu_end))]\nlong[, event := as.integer(!is.na(event_date) & event_date > tstart & event_date <= tstop)]\ncoxph(Surv(as.numeric(tstart), as.numeric(tstop), event) ~\n        I(cum_dose_start / 100) + current_use, data = long)   # HR per 100 mg cumulative dose",
        "description": "Same exposure-episode construction in data.table, then a join to outcome/censoring intervals producing the (tstart, tstop,\nevent) long format consumed by a time-dependent Cox model. Inputs:\n  fills    : person_id, fill_date (Date), days_supply (int), daily_dose (numeric)\n  followup : person_id, fu_start (Date), fu_end (Date), event_date (Date or NA)\ncum_dose is taken as of tstart (lagged) so exposure on the event day does not enter the contrast.",
        "dependencies": [
          "data.table",
          "survival"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let cap = 30;   /* stockpiling carry-over cap (days) */\n%let grace = 14; /* gap tolerated before episode closes */\n\nproc sort data=work.fills; by person_id fill_date; run;\n\ndata episodes;\n  set work.fills; by person_id;\n  retain covered_until cum_dose;\n  if first.person_id then do; covered_until = .; cum_dose = 0; end;\n\n  if covered_until = . or fill_date > covered_until + &grace then\n    tstart = fill_date;                                   /* new episode */\n  else do;\n    carry = min(covered_until - fill_date, &cap);\n    tstart = fill_date + max(carry, 0);                   /* cap stockpiling */\n  end;\n  tstop = tstart + days_supply;\n  current_use = 1;\n  cum_dose_start = cum_dose;                              /* dose BEFORE this episode (lagged) */\n  output;\n  cum_dose = cum_dose + days_supply * daily_dose;\n  covered_until = tstop;\n  format tstart tstop date9.;\nrun;\n\n/* Clip episodes to follow-up and flag the event in its closing interval. */\nproc sql;\n  create table long as\n  select e.person_id,\n         max(e.tstart, f.fu_start) as tstart format=date9.,\n         min(e.tstop,  f.fu_end)   as tstop  format=date9.,\n         e.current_use, e.cum_dose_start,\n         (f.event_date is not null and\n          f.event_date >  max(e.tstart, f.fu_start) and\n          f.event_date <= min(e.tstop,  f.fu_end)) as event\n  from episodes e inner join work.followup f\n    on e.person_id = f.person_id\n  where e.tstart < f.fu_end and e.tstop > f.fu_start;\nquit;\n\n/* Time-dependent Cox via counting-process intervals; HR per 100 mg cumulative dose. */\nproc phreg data=long;\n  model (tstart, tstop) * event(0) = cum_dose100 current_use;\n  cum_dose100 = cum_dose_start / 100;\nrun;",
        "description": "RETAIN/BY-group construction of gap-handled exposure episodes with stockpiling cap, then a PROC SQL split against\nfollow-up to emit the long-format (person_id, tstart, tstop, current_use, cum_dose_start, event) consumed by PROC PHREG\nwith a counting-process (tstart, tstop) specification. Inputs (post data-management):\n  work.fills    : person_id, fill_date, days_supply, daily_dose\n  work.followup : person_id, fu_start, fu_end, event_date (. if none)\ncum_dose_start holds cumulative dose BEFORE each episode so the time-dependent contrast is lagged.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "time-updated-exposures-cumulative-dose-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Each colored bar shows one fill; grey gaps show days without active supply (current_use=0). The cumulative-dose label on each interval shows how much Maria had taken before that interval began — the value a survival model would use as the exposure at that point in time.",
        "alt_text": "Horizontal timeline for patient 2001 across 99 days showing three methotrexate fills as green bars, two grey gaps between fills, and a cumulative dose annotation on each interval (0 mg, 300 mg, 300 mg, 600 mg, 600 mg) demonstrating how the running total builds up across fills while current-use toggles on and off.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q{What biologic quantity<br/>drives the outcome?} -->|Active pharmacology<br/>while on drug| CU[Current / recent use<br/>time-varying indicator]\n  Q -->|Risk accrues with<br/>total burden| CD[Cumulative duration / dose<br/>lagged running sum]\n  Q -->|Timing of past doses<br/>matters / latency| WCE[Weighted cumulative exposure<br/>spline weight function]\n  CU --> FB{Time-varying confounders<br/>affected by prior exposure?}\n  CD --> FB\n  WCE --> FB\n  FB -->|No| TDC[Time-dependent Cox /<br/>pooled logistic on long format]\n  FB -->|Yes| MSM[Marginal structural model<br/>IPTW]",
        "caption": "Choosing the exposure metric from the biologic question, then the estimation layer from whether treatment-confounder feedback exists.",
        "alt_text": "Decision flowchart mapping the driving biologic quantity to current-use, cumulative-dose, or weighted cumulative exposure, then to a time-dependent Cox model or a marginal structural model depending on exposure-affected confounding.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "timeline\n  title One patient, day 90 outcome - the chosen metric changes the exposure value\n  Day 0 : Fill 30 days supply, low dose : current_use = 1\n  Day 25 : Refill (stockpiling begins) : cum_days rising\n  Day 55 : Supply exhausted, no refill : current_use - 0\n  Day 80 : High-dose 14-day course starts : current_use = 1 again\n  Day 90 : Outcome assessed : current=yes, recent=yes, cum_dose high, WCE weights recent dose heavily",
        "caption": "A static ever/never flag collapses all of this to \"exposed.\" Current use, recent use, cumulative dose, and WCE give materially different values at day 90 - and a different estimand.",
        "alt_text": "Timeline for a single patient across 90 days showing fills, a supply gap, and a high-dose course, illustrating how current-use, recent-use, cumulative-dose, and WCE differ at the day-90 outcome.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "produces",
        "target_slug": "standard-cox-time-dependent",
        "notes": "This concept produces the gap-handled, lagged, long-format (tstart, tstop) exposure intervals that a valid time-dependent Cox model requires as input."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cox-ph-regression",
        "notes": "Time-updated exposure intervals are the exposure input to time-dependent Cox / pooled-logistic survival models."
      },
      {
        "relation_type": "requires",
        "target_slug": "marginal-structural-models-g-methods",
        "notes": "When time-varying confounders are affected by prior exposure, a standard time-dependent model is biased; the same long-format history must be analyzed with an IPTW marginal structural model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Static or mis-timed exposure classification is a leading source of immortal time; time-updating exposure aligns person-time with actual treatment."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "PDC summarizes adherence over a fixed window as one scalar; time-updated exposure preserves the timeline for time-varying risk modeling."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Discontinuation, switching, and gaps are the events that open and close the exposure episodes built here."
      },
      {
        "relation_type": "see_also",
        "target_slug": "route-of-administration-differences-in-rwe",
        "notes": "Route determines whether fills, administrations, procedure claims, or medication lists are the credible source of time-varying exposure."
      }
    ],
    "aliases": [
      "time-varying exposure",
      "time-updated exposure",
      "cumulative exposure",
      "cumulative dose",
      "dose intensity",
      "weighted cumulative exposure",
      "WCE",
      "current-use risk windows"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "time-zero-index-date-alignment-rwe",
    "name": "Time Zero (Index Date) Alignment",
    "short_definition": "The design rule that fixes a single date per patient at which eligibility is met, treatment strategy is assigned, baseline covariates stop being measured, and follow-up and outcome risk begin—so that classification, person-time, and confounder measurement are synchronized exactly as they would be at randomization in the trial being emulated.",
    "long_description": "**Time zero** (the index date, t0, or cohort-entry date) is the instant at which a patient simultaneously (1) satisfies all\neligibility criteria, (2) is assigned to a treatment strategy, (3) has baseline covariates fixed (everything measurable\n*up to and including* t0, nothing after), and (4) begins to accrue person-time and outcome risk. In a randomized trial these\nfour events coincide by construction at randomization. In observational data they do **not** coincide unless the analyst\nforces them to, and the entire validity of a longitudinal RWE study rests on doing so. The discipline of aligning eligibility,\nassignment, and follow-up onset at one date is the operational heart of target-trial emulation; getting it wrong is the\nsingle most common source of catastrophic, non-fixable bias in pharmacoepidemiology.\n\n**Core conceptual distinction.** The defining requirement is that *exposure classification must be determinable using only\ninformation available at or before t0*. The failure mode is using **future information to define cohort entry or arm\nassignment**, which guarantees that the classified-exposed group must survive (and remain event-free) long enough to receive\nexposure—**immortal time bias**. The classic example: define \"statin user\" as anyone who fills a statin during follow-up,\nbut start the clock at hospital discharge. Patients who die in week one cannot become \"users,\" so the user group banks\nevent-free person-time it never earned. Three distinct biases all trace to t0 misalignment and must be separated: (1)\n*immortal time*—follow-up starts before the moment exposure is determined; (2) *prevalent-user / left-truncation bias*—t0 is\nset at an arbitrary calendar date among ongoing users, conditioning on survival and depleting susceptibles; (3) *differential\nalignment by arm*—the two strategies' t0 dates correspond to different points in the disease course (e.g., the new drug is\nreached only after the comparator fails), embedding confounding by indication into the time axis itself. The estimand you can\nrecover is governed by *what t0 represents*: if t0 = initiation, the contrast is an initiation (ITT-like) effect; if a grace\nperiod is allowed for a strategy to be \"started,\" eligibility-time ambiguity arises and the per-protocol estimand requires\ncloning/censoring/weighting rather than a single assigned t0.\n\n**Pros, cons, and trade-offs**\n- **vs an \"ever-exposed\" / time-fixed classification with follow-up from diagnosis:** Aligning t0 at the exposure decision\n  eliminates immortal time entirely; the ever-exposed approach is not a milder version of the same design, it is a different\n  and biased estimand. There is no analytic patch (no covariate adjustment, no PS) that repairs misclassified person-time.\n  **Always prefer** aligned t0; the only legitimate alternative when exposure genuinely accrues after a fixed eligibility\n  event is to model exposure as **time-varying** (landmark analysis or a time-dependent Cox/pooled-logistic model), which is\n  a deliberate analytic choice, not a default.\n- **vs prevalent-user (current-user at a calendar date) designs:** A new-user t0 at first qualifying fill removes\n  left-truncation, depletion-of-susceptibles, and adjustment for post-initiation mediators. Cost: smaller cohorts and a\n  population restricted to initiators, who can differ from the prevalent users who dominate practice. Prefer aligned new-user\n  t0; reach for a **prevalent new-user (Suissa) design** only when incident initiation is too rare to study.\n- **vs single-decision t0 with clone-censor-weight (CCW):** When a strategy is defined by a sustained or dynamic regimen, or\n  when a grace period makes \"the\" assignment date ambiguous (patients who could start within X days all qualify for both\n  strategies at baseline), a single hard t0 forces an artificial choice and reintroduces selection. CCW assigns every\n  eligible person to *all* compatible strategies at t0, then censors and weights as their data diverge. Cost: substantially\n  more complex to specify, fit, and defend. Prefer a single aligned t0 for clean two-drug initiation contrasts; escalate to\n  CCW only when the protocol genuinely requires a dynamic per-protocol estimand.\n\n**When to use** — every longitudinal RWE study that produces a comparative or absolute time-to-event, person-time, utilization,\nor cost estimate—comparative effectiveness, drug/vaccine safety, screening or procedure effects, and any target-trial\nemulation. Pre-specifying t0 (and the rule that maps each patient to it) is a non-negotiable protocol element; the\ndiagnostics that prove it worked—distribution of t0 by arm, attrition at each eligibility rule, 5–10 patient-level\ntimelines per arm, and ±30/90-day sensitivity on washout/grace/lookback—belong in the SAP.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Do not set a single fixed t0 when exposure is intrinsically post-baseline and time-varying** (e.g., effect of *achieving*\n  a lab target, of cumulative dose, or of adherence measured over follow-up). Forcing these into a baseline classification\n  *creates* immortal time. Use landmark or time-dependent models instead.\n- **Do not anchor t0 on a downstream event** (first hospitalization for the outcome's complication, second prescription,\n  \"confirmed responder\"). Conditioning cohort entry on a future event selects on survival and on the outcome pathway.\n- **Do not let the two arms' t0 represent different clinical moments.** If the comparator's t0 is first-line initiation but\n  the study drug's t0 is post-failure switch, the time axes are not exchangeable; balance tables will look fine yet the\n  estimate is confounded by indication baked into t0. This is the most dangerous case because it is invisible to standardized\n  differences computed at each arm's own baseline.\n- **Do not assign t0 from a procedure that itself defines survival** (e.g., transplant, complex surgery) without a g-method\n  or sequential-trial design; \"time to procedure\" is immortal by construction.\n\n**Data-source operational depth**\n- **Administrative claims (FFS vs Medicare Advantage):** t0 is typically the `fill_date` of the first qualifying NDC or the\n  service date of a qualifying procedure. The lethal trap is **MA-only person-time**: capitated Medicare Advantage plans do\n  not generate complete FFS encounter/pharmacy claims, so a clean washout before t0 can be *missingness masquerading as\n  incidence*, and competing events (death) are unevenly captured. Require continuous A/B/D (or commercial medical+pharmacy)\n  enrollment spanning the *entire* washout through t0, and exclude MA-only spans rather than treating them as drug-free.\n  Reversed/rebilled and same-day mail-order + retail fills can split or duplicate the index event—dedupe before taking the\n  minimum date. In elderly cohorts, **differential competing risk of death by arm** distorts cause-specific vs cumulative-\n  incidence interpretation, which must be reconciled with the estimand.\n- **EHR:** The exposure decision is the *order* or *administration*, not a dispensing; an order without a confirmed fill is\n  an intention, not an initiation, so linkage to pharmacy claims sharpens t0. Visit-driven capture means baseline covariates\n  are observed only when a patient happens to have an encounter, so the \"up-to-t0\" window can be sparse and informatively\n  missing; a patient who leaves the system contributes biased follow-up. Define the observation window explicitly and treat\n  loss to follow-up as potentially informative.\n- **Registry:** Strong for the qualifying clinical event (diagnosis date, stage, adjudicated index procedure) that often\n  *is* t0, but typically blind to subsequent pharmacy exposure and to deaths occurring outside the registry. Link to claims\n  for fill history and to a death index to firm up censoring; verify the registry's enrollment/adjudication date is not\n  itself a post-hoc, outcome-informed date.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity + claims completeness + reliable mortality) but\n  order date, fill date, and service date frequently disagree by days to weeks; pre-specify which date defines t0 and\n  reconcile the others before assignment, because a few days' slippage changes who is \"incident\" and whether a baseline\n  covariate falls before or after t0.\n\n**Worked claims example (immortal time made concrete).** Question: 1-year all-cause mortality after acute MI in patients who\ninitiate a beta-blocker vs those who do not, in a commercial + Medicare FFS database. *Naive (wrong) analysis:* t0 = MI\ndischarge date for everyone; classify \"beta-blocker user\" = any beta-blocker `fill_date` in the 90 days after discharge.\nResult: the user arm is artificially protected, because anyone who dies in days 1–89 before filling is forced into the\nnon-user arm—the days from discharge to first fill are *immortal* (event-free by definition) and wrongly credited to users.\n*Aligned analysis:* require ≥365 days continuous A/B/D enrollment with no beta-blocker fill in the 365-day lookback (washout,\nso initiation is incident, MA-only spans excluded). Set **t0 = first post-discharge beta-blocker `fill_date`** for initiators;\nfor the comparator, use a sequential/landmark device—e.g., assign each non-filling-survivor a matched t0 at the same number\nof days post-discharge (or build a sequence of nested trials by day), so both arms start the clock at the same elapsed time\nand the immortal interval is removed from both. Measure baseline covariates only in `[t0 − 365, t0]` (prior comorbidities,\nutilization, infarct severity proxies), fit a PS, and follow from t0 to death, censoring at disenrollment, end of data, and\n(for an as-treated estimand) discontinuation = last `days_supply` end + a 30-day grace period or switch. Report the t0\ndistribution by arm and rerun with the washout and grace period at ±30/90 days. The aligned design typically shrinks the\nspurious \"beta-blocker\" benefit by a large margin—the textbook signature of corrected immortal time.",
    "primary_category": "Study_Design",
    "tags": [
      "time-zero",
      "index-date",
      "immortal-time-bias",
      "target-trial-emulation",
      "new-user-design",
      "cohort-entry",
      "prevalent-user-bias",
      "study_design"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jclinepi.2016.04.014",
        "url": "https://doi.org/10.1016/j.jclinepi.2016.04.014",
        "citation_text": "Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology. 2016;79:70-75.",
        "year": 2016,
        "authors_short": "Hernán et al.",
        "notes": "Canonical statement that aligning eligibility, treatment assignment, and follow-up start at a single time zero is what prevents immortal time and related self-inflicted biases."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Frames time zero as the emulated randomization point and shows how eligibility, treatment strategies, and follow-up must be synchronized to it."
      },
      {
        "role": "explain",
        "doi": "10.1093/aje/kwm324",
        "url": "https://doi.org/10.1093/aje/kwm324",
        "citation_text": "Suissa S. Immortal time bias in pharmacoepidemiology. American Journal of Epidemiology. 2008;167(4):492-499.",
        "year": 2008,
        "authors_short": "Suissa",
        "notes": "Definitive characterization of immortal time bias—the direct consequence of misaligned time zero—with worked drug-effectiveness examples and the magnitude of the spurious benefit it creates."
      },
      {
        "role": "demonstrate",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Foundational new-user design—the operational mechanism that sets time zero at first incident exposure and removes prevalent-user/left-truncation bias."
      },
      {
        "role": "use",
        "doi": "10.1007/s40471-015-0053-5",
        "url": "https://doi.org/10.1007/s40471-015-0053-5",
        "citation_text": "Lund JL, Richardson DB, Stürmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Current Epidemiology Reports. 2015;2(4):221-228.",
        "year": 2015,
        "authors_short": "Lund et al.",
        "notes": "Applied articulation of time-zero alignment within active-comparator new-user cohorts in administrative data, with practical covariate-window and washout guidance."
      }
    ],
    "plain_language_summary": "Time-zero alignment is the rule that picks one single date per patient where the clock starts: the day they qualify for the study, the day we decide which treatment group they are in, and the day we begin counting their follow-up and watching for outcomes. It answers the question, \"From exactly when should we start measuring this patient, so we are comparing the two groups fairly the way a coin-flip trial would?\" Get this date wrong and you can hand one group free, event-free time it never actually earned, which fakes a benefit that isn't real. The catch: you may only use information you'd know on or before that day to sort a patient into a group.",
    "key_terms": [
      {
        "term": "index date (time zero)",
        "definition": "The patient's chosen \"day zero\" at which eligibility, treatment-group assignment, and the start of follow-up all happen at once."
      },
      {
        "term": "immortal time",
        "definition": "A stretch of days a patient must survive event-free in order to later qualify for a group, which unfairly makes that group look healthier if you count those days."
      },
      {
        "term": "washout (clean lookback)",
        "definition": "A stretch of time before day zero during which the patient must have no prior fills of the study drug, so you know this is their first-ever use."
      },
      {
        "term": "days_supply",
        "definition": "How many days one filled prescription is meant to last (for example, a 30-day or 90-day supply)."
      },
      {
        "term": "follow-up time",
        "definition": "The days you actually watch a patient after their start date, counting whether and when the outcome happens."
      }
    ],
    "worked_example": {
      "scenario": "One patient is discharged from the hospital after a heart attack on 2024-01-01 and we want to study whether starting a beta-blocker lowers 1-year death. They fill their first beta-blocker on 2024-03-01. We compare a wrong way to start the clock (at discharge, but label them a \"user\" using a fill that happens later) against the right way (start the clock on the fill date). The window we can observe ends 2024-08-28.",
      "dataset": {
        "caption": "The raw rows an analyst would see: one discharge record, one enrollment span, and one pharmacy fill.",
        "columns": [
          "person_id",
          "event_type",
          "date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2001,
            "hospital_discharge",
            "2024-01-01",
            "",
            ""
          ],
          [
            2001,
            "enrollment_span_start",
            "2023-01-01",
            "",
            ""
          ],
          [
            2001,
            "pharmacy_fill",
            "2024-03-01",
            "beta_blocker",
            90
          ]
        ]
      },
      "steps": [
        "Wrong way: start the clock at discharge (2024-01-01) but call the patient a \"beta-blocker user\" because of a fill that only happens on 2024-03-01.",
        "From 2024-01-01 to 2024-03-01 is 60 days (Jan has 31, Feb 2024 has 29). The patient had to survive all 60 of those days just to live long enough to fill the drug.",
        "Counting those 60 days as user follow-up means user follow-up runs 2024-01-01 to 2024-08-28 = 240 days, and those first 60 days are guaranteed event-free — that is the immortal stretch.",
        "Right way: set time zero at the fill date (2024-03-01) and only count days after that. Follow-up runs 2024-03-01 to 2024-08-28 = 180 days.",
        "The difference, 240 − 180 = 60 days, is exactly the immortal time the wrong design wrongly handed to the user group."
      ],
      "result": "Misaligned follow-up = 240 days vs aligned follow-up = 180 days; the misaligned design credits 60 extra event-free (immortal) days (2024-01-01 to 2024-03-01) to the beta-blocker group that aligning time zero at the 2024-03-01 fill removes.",
      "timeline_spec": {
        "title": "Immortal time from a misaligned time zero, and how aligning t0 at the first fill removes it (one post-MI patient)",
        "window": {
          "start": "2024-01-01",
          "end": "2024-08-28",
          "label": "Observation window: 2024-01-01 discharge to 2024-08-28 data end"
        },
        "events": [
          {
            "label": "Hospital discharge (eligibility event)",
            "start": "2024-01-01",
            "length_days": 0,
            "quantity": "day 0 of the wrong clock"
          },
          {
            "label": "First beta-blocker fill = correct time zero",
            "start": "2024-03-01",
            "length_days": 90,
            "quantity": "90 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "unexposed",
            "start": "2024-01-01",
            "end": "2024-02-29",
            "label": "60-day immortal stretch wrongly credited to the user group"
          },
          {
            "kind": "followup",
            "start": "2024-03-01",
            "end": "2024-08-28",
            "label": "180 aligned follow-up days (clock starts at the fill)"
          },
          {
            "kind": "washout",
            "start": "2023-01-01",
            "end": "2023-12-31",
            "label": "365-day washout: no prior beta-blocker before t0"
          }
        ],
        "result": {
          "label": "Misaligned 240 days − aligned 180 days = 60 immortal days removed by aligning t0",
          "value": 60
        },
        "caption": "Top: starting the clock at discharge (2024-01-01) while labeling the patient a user from a 2024-03-01 fill banks a 60-day immortal stretch the patient had to survive to fill the drug. Bottom: setting time zero at the 2024-03-01 fill counts only the 180 real follow-up days, removing the immortal time.",
        "alt_text": "Timeline for one post-heart-attack patient showing a 60-day immortal interval between the 2024-01-01 discharge and the 2024-03-01 first beta-blocker fill, a 365-day washout before time zero, and 180 days of aligned follow-up that begins at the fill date."
      }
    },
    "prerequisites": [],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "First qualifying fill as time zero (pharmacy claims)",
        "description": "t0 = first paid dispensing of the study drug (or active comparator) after a clean washout and continuous-enrollment eligibility. Standard for chronic oral therapies; classification is fully determined at t0.",
        "edge_cases": [
          "Same-day mail-order plus retail fills, or reversed-and-rebilled claims, that split or duplicate the index event before the min(fill_date) is taken.",
          "Plan switch on the index date that breaks continuous-enrollment verification.",
          "Free samples / inpatient administration not captured in pharmacy claims, so the true initiation predates the first observed fill."
        ],
        "data_source_notes": "claims: use service_date + paid status; require continuous A/B/D or commercial medical+pharmacy enrollment across the full washout, excluding MA-only spans where FFS fills are absent."
      },
      {
        "name": "Procedure / service date as time zero",
        "description": "t0 = date of a qualifying procedure (stent, ablation, joint replacement, transplant). The \"treatment\" is the procedure; classification is determinable at t0 but the procedure may itself condition on survival.",
        "edge_cases": [
          "Multiple procedures on the same day—pre-specify which CPT/ICD-PCS defines the index.",
          "Procedure performed at an out-of-network or transfer facility and not captured.",
          "Procedures (transplant, major surgery) that define survival, making naive \"time to procedure\" immortal; require a g-method or sequential-trial design."
        ],
        "data_source_notes": "claims: reconcile facility, professional, and revenue claims; link diagnosis codes for indication. Be explicit that the procedure date is the assignment moment, not a downstream confirmation."
      },
      {
        "name": "Diagnosis/encounter time zero with sequential or landmark assignment",
        "description": "A qualifying diagnosis or hospitalization fixes the eligibility clock, but treatment is decided shortly afterward. To avoid immortal time, exposure is handled as time-varying or via nested sequential trials / a common landmark applied identically to both arms.",
        "edge_cases": [
          "Patients who die or are censored before the treatment decision window closes.",
          "Apparently new diagnosis codes that are actually rule-out or history-of codes, inflating the eligible pool with prevalent disease."
        ],
        "data_source_notes": "EHR: encounter date + problem-list onset can be cleaner than claims but is visit-driven; still requires an explicit washout for prior diagnoses of the same condition."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Ever-exposed / time-fixed classification with follow-up from a fixed eligibility event (e.g., diagnosis or discharge)",
        "pros_of_this": "Eliminates immortal time by requiring exposure to be determinable at t0; yields an interpretable initiation estimand with no inflated event-free person-time.",
        "cons_of_this": "Requires either a true incident-initiation t0 or an explicit time-varying/ landmark model; cannot be retrofitted onto a design that already classified exposure using follow-up data.",
        "when_to_prefer": "Always, for any comparative or absolute time-to-event estimate; the ever-exposed alternative is a different, biased estimand, not a simpler version."
      },
      {
        "compared_to": "Prevalent-user (current-user at a calendar date) design",
        "pros_of_this": "Removes left-truncation, depletion of susceptibles, and adjustment for post-initiation mediators by anchoring t0 at incident initiation.",
        "cons_of_this": "Smaller cohorts; initiators may not represent the prevalent users who dominate routine practice.",
        "when_to_prefer": "Whenever incident initiation is observable; fall back to a prevalent new-user (Suissa) design only when initiation is too rare."
      },
      {
        "compared_to": "Clone-censor-weight / sequential target-trial emulation",
        "pros_of_this": "A single aligned t0 is far simpler to specify, communicate, and audit for clean two-drug initiation contrasts.",
        "cons_of_this": "A hard single t0 cannot handle grace-period eligibility ambiguity, sustained/dynamic strategies, or multi-option regimens without reintroducing selection.",
        "when_to_prefer": "Clean initiation contrasts with an unambiguous assignment date; escalate to CCW when the protocol requires a dynamic per-protocol estimand."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "t0 = first qualifying fill or procedure service_date. Require continuous A/B/D or commercial medical+pharmacy enrollment across the entire washout through t0 so absence of prior exposure is observed, not missing; exclude Medicare Advantage-only person-time where FFS claims are unavailable. Dedupe same-day/rebilled claims before taking min(fill_date). Measure covariates only in [t0 − washout, t0].",
      "ehr": "t0 = order or administration; prefer linked dispensing to confirm the patient actually started. Visit-driven capture makes the up-to-t0 covariate window sparse and potentially informatively missing; define observation windows explicitly and treat loss to follow-up as informative.",
      "registry": "The qualifying clinical event (diagnosis/stage/index procedure) often is t0; verify it is not an outcome-informed post-hoc date. Link to claims for fills and to a death index for censoring/mortality.",
      "linked": "Order, fill, and service dates can disagree by days to weeks; pre-specify which date defines t0 and reconcile the others before assignment, since slippage changes incidence status and covariate-window membership."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS = 365  # drug-free + continuous-enrollment lookback that defines an incident (new) user\n\ndef assign_time_zero(rx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n    study = rx[rx[\"drug_class\"].isin([\"STUDY\", \"COMPARATOR\"])].copy()\n\n    # De-duplicate same-day/rebilled fills so the index event is a single row, then take the EARLIEST qualifying fill.\n    study = study.drop_duplicates(subset=[\"person_id\", \"fill_date\", \"drug_class\"])\n    idx = (study.groupby(\"person_id\", as_index=False)\n                .first()\n                .rename(columns={\"fill_date\": \"index_date\", \"drug_class\": \"arm\"}))\n\n    # Immortal-time guard: t0 (= the first qualifying fill) is by construction the classification date.\n    # Assert no fill used for arm assignment occurs AFTER the index date -> classification uses only info <= t0.\n    chk = study.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    offending = chk[chk[\"fill_date\"] < chk.groupby(\"person_id\")[\"fill_date\"].transform(\"min\")]\n    assert offending.empty, \"arm assignment must use only the earliest fill (no post-t0 information)\"\n\n    # New-user: no study/comparator fill in the washout window strictly before t0.\n    prior = study.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    in_washout = prior[(prior[\"fill_date\"] < prior[\"index_date\"]) &\n                       (prior[\"fill_date\"] >= prior[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS))]\n    idx = idx[~idx[\"person_id\"].isin(in_washout[\"person_id\"])].copy()\n\n    # Continuous, FFS-observable enrollment spanning the full washout through t0 (no MA-only gaps).\n    e = enroll.merge(idx[[\"person_id\", \"index_date\"]], on=\"person_id\")\n    e[\"covers\"] = ((e[\"enroll_start\"] <= e[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)) &\n                   (e[\"enroll_end\"]   >= e[\"index_date\"]) & (~e[\"ma_only\"]))\n    eligible = e.loc[e[\"covers\"], \"person_id\"].unique()\n\n    cohort = idx[idx[\"person_id\"].isin(eligible)].copy()\n    cohort[\"baseline_start\"] = cohort[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)  # covariate window opens here\n    return cohort[[\"person_id\", \"arm\", \"index_date\", \"baseline_start\"]]",
        "description": "Time-zero assignment from claims-style inputs, with an explicit immortal-time guard.\nRequired inputs (already cleaned and de-duplicated):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime64), drug_class in {'STUDY','COMPARATOR'}, days_supply (int)\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end (datetime64), ma_only (bool)  # ma_only lacks FFS claims\nReturns one incident initiator per row with arm and aligned index_date. The function FAILS LOUD if any chosen index_date\nwould be classified using a fill that post-dates t0 (the immortal-time error). Build covariates and the PS only from\n[baseline_start, index_date]; apply outcome/censoring rules identically to both arms downstream.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "lund-2015"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS <- 365L\n\nassign_time_zero <- function(rx, enroll) {\n  setDT(rx); setDT(enroll)\n  study <- rx[drug_class %chin% c(\"STUDY\", \"COMPARATOR\")]\n  # Drop same-day / rebilled duplicates, then take the EARLIEST qualifying fill as t0 and arm.\n  study <- unique(study, by = c(\"person_id\", \"fill_date\", \"drug_class\"))\n  setorder(study, person_id, fill_date)\n  idx <- study[, .(index_date = fill_date[1L], arm = drug_class[1L]), by = person_id]\n\n  # Immortal-time guard: classification must rest on the first fill only.\n  stopifnot(nrow(study[fill_date < study[, min(fill_date), by = person_id][idx, on = \"person_id\", V1]]) == 0L)\n\n  # New-user: no study/comparator fill in the washout window strictly before t0.\n  study2 <- merge(study, idx[, .(person_id, index_date)], by = \"person_id\")\n  prior_ids <- unique(study2[fill_date < index_date &\n                             fill_date >= index_date - WASHOUT_DAYS, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Continuous, FFS-observable enrollment across the full washout through t0 (no MA-only spans).\n  e <- merge(enroll, idx[, .(person_id, index_date)], by = \"person_id\")\n  ok <- e[enroll_start <= index_date - WASHOUT_DAYS &\n          enroll_end   >= index_date & !ma_only, unique(person_id)]\n\n  cohort <- idx[person_id %chin% ok]\n  cohort[, baseline_start := index_date - WASHOUT_DAYS]\n  cohort[, .(person_id, arm, index_date, baseline_start)]\n}",
        "description": "Time-zero assignment with data.table; inputs mirror the Python version.\n  rx     : person_id, fill_date (Date), drug_class in {'STUDY','COMPARATOR'}, days_supply (int)\n  enroll : person_id, enroll_start, enroll_end (Date), ma_only (logical)\nStops with an error if arm assignment would use anything other than the earliest qualifying fill (immortal-time guard).",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "lund-2015"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout = 365;\n\n/* Earliest study/comparator fill = candidate t0, with arm = drug_class on that fill (info available at/<= t0 only). */\nproc sort data=work.rx(where=(drug_class in ('STUDY','COMPARATOR'))) out=rx_q;\n  by person_id fill_date;\nrun;\n\ndata idx;\n  set rx_q;\n  by person_id;\n  if first.person_id;\n  index_date = fill_date;\n  format index_date date9.;\n  length arm $12;\n  arm = drug_class;\n  keep person_id index_date arm;\nrun;\n\n/* New-user restriction: no prior study/comparator fill inside the washout window before t0. */\nproc sql;\n  create table newuser as\n  select i.* from idx i\n  where not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id and p.drug_class in ('STUDY','COMPARATOR')\n      and p.fill_date <  i.index_date\n      and p.fill_date >= i.index_date - &washout);\nquit;\n\n/* Continuous, FFS-observable enrollment across the full washout through t0 (exclude MA-only spans). */\nproc sql;\n  create table cohort as\n  select n.person_id, n.arm, n.index_date,\n         n.index_date - &washout as baseline_start format=date9.\n  from newuser n\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = n.person_id and e.ma_only = 0\n      and e.enroll_start <= n.index_date - &washout\n      and e.enroll_end   >= n.index_date);\nquit;\n\n/* Build follow-up time and a competing-risk-aware status FROM t0 (this alignment is what removes immortal time). */\ndata analysis;\n  merge cohort(in=a) work.outcome;\n  by person_id;\n  if a;\n  end_date = min(coalesce(event_date, data_end_date),\n                 coalesce(death_date, data_end_date),\n                 coalesce(disenroll_date, data_end_date), data_end_date);\n  futime = end_date - index_date;                 /* clock starts at t0, never before */\n  if (event_date ne . and event_date <= end_date) then status = 1;       /* event of interest */\n  else if (death_date ne . and death_date <= end_date) then status = 2;  /* competing death  */\n  else status = 0;                                                       /* censored          */\nrun;\n\n/* Cause-specific hazard (censors competing deaths): */\nproc phreg data=analysis;\n  class arm(ref='COMPARATOR');\n  model futime*status(0 2) = arm;                 /* status=2 treated as censored -> cause-specific HR */\nrun;\n\n/* Subdistribution (Fine-Gray) for the cumulative-incidence estimand under competing death: */\nproc phreg data=analysis;\n  class arm(ref='COMPARATOR');\n  model futime*status(0) = arm / eventcode=1;     /* event of interest = 1, death = competing */\nrun;",
        "description": "Time-zero assignment and an immortal-time-safe time-to-event setup in SAS. Required input datasets (post data-management):\n  work.rx     : person_id, fill_date, drug_class ('STUDY'/'COMPARATOR'), days_supply\n  work.enroll : person_id, enroll_start, enroll_end, ma_only (0/1)\n  work.outcome: person_id, event_date, death_date, disenroll_date, data_end_date\nCohort entry is the EARLIEST qualifying fill after washout. PROC PHREG models follow-up FROM t0 (not from a prior eligibility\nevent), which is what prevents immortal time; the cause-specific vs subdistribution choice is shown explicitly.",
        "dependencies": [],
        "source_citations": [
          "lund-2015"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "time-zero-index-date-alignment-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Top: starting the clock at discharge (2024-01-01) while labeling the patient a user from a 2024-03-01 fill banks a 60-day immortal stretch the patient had to survive to fill the drug. Bottom: setting time zero at the 2024-03-01 fill counts only the 180 real follow-up days, removing the immortal time.",
        "alt_text": "Timeline for one post-heart-attack patient showing a 60-day immortal interval between the 2024-01-01 discharge and the 2024-03-01 first beta-blocker fill, a 365-day washout before time zero, and 180 days of aligned follow-up that begins at the fill date.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Source population meeting clinical eligibility] --> Wash[Continuous enrollment + washout<br/>no prior study/comparator exposure]\n  Wash --> Decide{Is exposure determinable<br/>at a single decision date?}\n  Decide -->|Yes: incident initiation/procedure| T0[Time zero = that date<br/>assign arm from info <= t0 only]\n  Decide -->|No: exposure accrues over follow-up| TV[Time-varying / landmark / sequential trials<br/>do NOT fix one baseline class]\n  T0 --> Base[Baseline covariates measured only in t0 - washout, t0 -> PS]\n  TV --> Base\n  Base --> Fup[Follow-up starts at t0<br/>identical outcome + censoring rules both arms]\n  Fup --> Sens[Diagnostics: t0 distribution by arm, attrition,<br/>patient timelines, washout/grace sensitivity]",
        "caption": "Decision logic for time-zero alignment. The pivotal branch is whether exposure can be classified using only information available at a single date; if not, a single fixed t0 manufactures immortal time and a time-varying/landmark/sequential design is required instead.",
        "alt_text": "Flowchart from source population through washout to a decision node on whether exposure is determinable at one date, branching to a single time zero or to time-varying designs, then baseline measurement, follow-up, and diagnostics.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Immortal time created by misaligned t0 (and how alignment fixes it)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section WRONG: follow-up from discharge, exposure from future\n  Eligibility event (discharge) :milestone, m1, 2024-01-01, 0d\n  Immortal interval (must survive to fill) :crit, imm, 2024-01-01, 60d\n  First exposure fill (used to classify) :milestone, m2, 2024-03-01, 0d\n  section RIGHT: aligned time zero\n  Washout + continuous enrollment :done, w, 2023-01-01, 2023-12-31\n  Time zero = first qualifying fill :milestone, t0, 2024-03-01, 0d\n  Follow-up at risk from t0 :active, fu, 2024-03-01, 180d",
        "caption": "The top track shows the immortal interval—event-free time wrongly credited to the exposed group when follow-up starts at discharge but exposure is defined by a later fill. The bottom track aligns t0 at the first qualifying fill so that interval is removed from the analysis.",
        "alt_text": "Gantt chart contrasting a misaligned design with an immortal interval between discharge and first fill against an aligned design where follow-up begins at the first qualifying fill.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "active-comparator-new-user",
        "notes": "The new-user restriction is the standard operational mechanism for setting t0 at incident initiation; ACNU cohort construction assigns the arm from the dispensed NDC at t0."
      },
      {
        "relation_type": "used_with",
        "target_slug": "target-trial-emulation",
        "notes": "Time zero is the emulated randomization point; aligning eligibility, assignment, and follow-up onset to it is the core of target-trial emulation."
      },
      {
        "relation_type": "affects",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Misaligned t0 (follow-up before exposure is determinable) is the direct cause of immortal time; aligned t0 or a time-varying/landmark design removes it."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "Setting t0 at a calendar date among ongoing users introduces left-truncation and depletion of susceptibles; an incident-initiation t0 avoids both."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "PS balancing must use only covariates measured in the pre-t0 window; t0 defines the boundary of admissible confounders."
      }
    ],
    "aliases": [
      "time zero",
      "index date",
      "index date definition",
      "cohort entry date",
      "t0 alignment",
      "time-zero alignment"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "tipping-point-analysis-rwe",
    "name": "Tipping Point Analysis",
    "short_definition": "A threshold-search sensitivity analysis that finds how large a bias parameter (missing-outcome event rate, unmeasured-confounder strength, misclassification, or selection) must become before the effect estimate crosses a pre-specified decision threshold, then judges whether that magnitude is clinically plausible.",
    "long_description": "A **tipping-point analysis** inverts the usual sensitivity-analysis question. Instead of fixing a bias assumption and reporting\nhow the estimate moves (\"if the event rate among dropouts were 30%, the risk difference would be X\"), it fixes the *conclusion*\nand searches for the assumption that overturns it (\"the event rate among treated dropouts would have to exceed 41% — versus\n18% among completers — before the benefit disappears\"). The deliverable is a **tipping value** (or tipping *surface* over two\nparameters) plus an explicit judgment: is the assumption required to break the result scientifically credible? It converts a\nbias of unknown size into a single decision-facing number that a regulator, HTA committee, or clinician can interrogate.\n\n**Core conceptual distinction.** A tipping-point analysis is defined by three commitments that separate it from ordinary\nscenario analysis. (1) *A decision threshold, not a p-value alone.* The tipping point is the bias value at which the estimate\ncrosses a pre-specified, interpretable boundary — RR = 1, a non-inferiority margin, an ICER willingness-to-pay line, or the\nlower confidence bound reaching null. (2) *A search, not a grid of arbitrary scenarios.* You sweep the bias parameter (a\ndelta-adjustment on the missing-outcome odds, a confounder prevalence-difference × confounder-outcome association, a\nsensitivity/specificity pair) until the threshold is reached, rather than reporting three hand-picked \"what ifs.\" (3) *A\nplausibility benchmark.* The tipping value is meaningless without an anchor — the observed completer event rate, a\nreference-arm rate under a \"jump-to-reference\" assumption, or the strength of a known measured confounder for the E-value.\nThis is what distinguishes a rigorous tipping point from a number-generating exercise.\n\nThe **estimand must be fixed first**, and it governs the math. On the **risk-difference / risk-ratio** scale (binary outcome,\nmissing-outcome tipping point), you re-impute outcomes among non-completers under increasingly adverse Missing-Not-At-Random\n(MNAR) assumptions. On the **hazard / rate** scale (time-to-event in claims), the analogous bias parameter is a multiplicative\nshift on the post-censoring hazard among the differentially censored, or — for unmeasured confounding — the\n**E-value**, which is exactly the tipping point for the confounder-exposure and confounder-outcome risk ratios needed to\nexplain away the observed RR (VanderWeele & Ding, 2017). The E-value is a *point* tipping metric for one bias; a full\ndelta-adjustment tipping surface is the *functional* generalization across one or two bias parameters.\n\n**Pros, cons, and trade-offs.**\n- **vs the E-value (a closed-form tipping point for unmeasured confounding):** the E-value is a one-line, assumption-light\n  summary on the RR scale and is excellent for a primary unmeasured-confounding statement. Cost: it addresses only one bias,\n  assumes the RR scale, and gives no surface over jointly varying parameters. **Prefer a full tipping-point search** when the\n  threatening bias is missing-outcome data, differential censoring, or misclassification, or when you must show a *surface*\n  (e.g., confounder prevalence × strength) rather than a single number.\n- **vs deterministic/probabilistic quantitative bias analysis (QBA):** QBA propagates a *distribution* of bias parameters to\n  produce a bias-adjusted estimate with an interval — it answers \"what is the corrected effect?\" Tipping-point analysis\n  answers the inverse, \"how bad would the bias have to be to matter?\" The two are complementary: QBA is superior when you have\n  informative external bias priors (a validation substudy); the tipping point is superior when you lack priors but can judge\n  plausibility against a clinical anchor. **Prefer the tipping point** for a transparent robustness statement to a\n  non-statistical audience; **prefer probabilistic QBA** when defensible bias priors exist and you want a corrected estimand.\n- **vs reporting raw best/worst-case imputation (extreme-case analysis):** worst-case (all treated dropouts have the event,\n  all comparator dropouts do not) is the *endpoint* of the tipping sweep, almost always implausibly extreme, and routinely\n  over-rejects. The tipping point locates the *credible* breaking value in between. **Always prefer the tipping point** over a\n  bare best/worst-case table; the latter is a strawman.\n- **vs multiple-imputation under MAR alone:** MAR imputation is the primary analysis, not a sensitivity analysis — it assumes\n  the missingness mechanism away. The tipping point is what you add *on top of* MAR to probe departures from it (delta- and\n  reference-based MNAR adjustments, per Cro & Carpenter, 2020).\n\n**When to use.** (a) A primary RWE result is positive and the dominant threat is non-trivial loss to follow-up,\ndisenrollment, or missing outcomes — quantify how adverse the unobserved outcomes must be to erase the effect. (b) An\nunmeasured-confounding statement is required for FDA/EMA/HTA submission — report the E-value and, if a single number is\ninsufficient, a prevalence × strength tipping surface. (c) A non-inferiority or cost-effectiveness conclusion sits near its\ndecision boundary and reviewers will ask \"how fragile is this?\" (d) Outcome misclassification from a claims algorithm is\nplausible and you must show how low PPV/sensitivity would have to fall to change the inference.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **As a substitute for a credible design or for bias correction when priors exist.** A tipping point on a cohort riddled with\n  confounding by indication launders a broken design; fix time-zero, comparator, and adjustment first. If you have a\n  validation substudy giving sensitivity/specificity, *correct* the estimate (QBA) — do not merely report how far it could be wrong.\n- **When the threshold is a bare p < 0.05 with no clinical meaning.** Tipping a result from p = 0.04 to p = 0.06 is not a\n  robustness finding; anchor to an effect-size or decision boundary, never to statistical significance alone.\n- **When the bias parameter has no plausibility anchor.** Reporting \"the event rate among dropouts would have to be 41%\"\n  without the completer rate (18%) and a clinical reference is uninterpretable — the audience cannot judge whether 41% is\n  absurd or routine. A tipping value without a benchmark is theater.\n- **When the differential is in the *protective* direction and you only sweep one arm.** A missing-outcome tipping point must\n  allow adverse imputation in the arm that favors your conclusion AND favorable imputation in the other; a one-sided sweep\n  manufactures false robustness.\n- **When multiple biases co-occur and you report each tipping point in isolation.** Real RWE faces confounding, missingness,\n  and misclassification simultaneously; individual tipping points can each look reassuring while their *joint* plausible\n  departure breaks the result. State this limitation or use a joint surface.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The canonical use is differential **disenrollment** and outcome-algorithm uncertainty. Loss to follow-up\n  is administrative censoring; the tipping question is how much higher the post-disenrollment event rate would have to be in\n  one arm. Failure mode: **Medicare Advantage person-time carries no fee-for-service claims**, so \"no event after month 9\" can\n  be unobserved enrollment in an MA plan rather than a true event-free interval — never treat MA-only follow-up as informative\n  censoring; restrict to A/B/D-observable person-time before computing the tipping point, or the entire missing-outcome sweep\n  is built on phantom denominators. Second failure mode: **differential competing risks** — in elderly claims cohorts, death\n  from a competing cause silently truncates the at-risk set differently by exposure; a naive missing-outcome tipping point that\n  ignores the competing event will understate fragility.\n- **EHR:** Missingness is **visit-driven and informative** — a patient who improves may stop returning, so labs/PROs are\n  Missing-Not-At-Random by construction. The delta-adjustment tipping point is the natural tool, but the plausibility anchor\n  must come from the *within-system* completer distribution, not population norms. Failure mode: a patient who leaves the\n  health system for care elsewhere looks identical to a true loss; endpoints ascertained only inside the system make the worst\n  case (event happened, unobserved) genuinely plausible, widening the credible tipping region.\n- **Registry:** Strong for adjudicated outcomes and disease severity, so the *outcome* is usually well-captured; the tipping\n  threat is **selection at enrollment and incomplete exposure**. Link to claims for fills and to a death index — otherwise the\n  missing-outcome tipping point conflates \"no recorded event\" with \"lost to the registry.\"\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity + claims completeness + reliable mortality narrows\n  the credible tipping region), but **linkage selection** means the tipping point applies to the linkable subset only;\n  date discrepancies between order, fill, and service dates can manufacture **immortal time** that, if uncorrected, makes a\n  spurious benefit look robust to any sweep.\n\n**Worked claims example (missing-outcome tipping point with FFS censoring).** Question: 12-month incident major bleeding,\napixaban vs warfarin, new-user active-comparator cohort in a commercial + Medicare FFS database. Primary analysis (PS-weighted)\ngives a 12-month risk difference of RD = −2.4 percentage points favoring apixaban (apixaban 3.1% vs warfarin 5.5%). The threat:\napixaban initiators disenroll faster (22% lost to follow-up by month 12 vs 14% for warfarin), and an MA-only span looks like an\nevent-free interval. Build the cohort with continuous A/B/D enrollment so absence of a bleeding claim is observed, not MA\nmissingness; classify each person as completer (event/no-event) or lost-to-follow-up (LTFU). Among the **3.1%** observed\napixaban events and **5.5%** warfarin events on completers, hold warfarin LTFU at the favorable boundary (impute their event\nrisk at the warfarin completer rate, 5.5%) and sweep the imputed event probability `p_a` among the **apixaban LTFU** from the\napixaban completer rate (3.1%) upward. The **tipping point** is the `p_a` at which the recomputed RD reaches 0. With 22%\napixaban LTFU and 14% warfarin LTFU it occurs near `p_a ≈ 14%` — i.e., bleeding among apixaban dropouts would have to run\nroughly **4.5× the apixaban completer rate** (14% vs 3.1%), and exceed even the warfarin completer rate, before the benefit\nvanishes. Anchor: there is no clinical mechanism for apixaban *dropouts* to bleed at 4.5× completers and above warfarin, so the\nresult is judged robust to differential disenrollment. Report the tipping value, the completer-rate anchor, and the explicit\nplausibility judgment — never the bare number.\n\n**Interpreting the output**\n\nIn the Drug A versus Drug B comparison, completers only: Drug A 24 events in 800 = 3.0%;\nDrug B 54 events in 900 = 6.0%; RD among completers = −3.0 pp. Dropout: Drug A 200 patients\n(20%), Drug B 100 (10%). The tipping-point analysis identifies the assumed event rate among\nthe 200 missing Drug A patients at which the overall RD reaches zero: 18.0% — six times the\n3.0% completer rate.\n\n*(1) Formal interpretation.* The tipping point is the value of an unobserved quantity (the\nevent rate among the missing stratum) at which the study conclusion reverses. It does not\nestimate that rate; it defines the boundary of the plausibility space. Under a MNAR delta-\nadjustment model, if missing Drug A patients experienced events at any rate below 18.0%, the\noverall RD remains negative (Drug A beneficial or neutral). The sensitivity table scanning\nfrom 3.0% to 21.0% in 3 pp increments maps the full plausibility range against the tipping\nthreshold and allows a reviewer to locate any clinically defensible assumption on the grid.\n\n*(2) Practical interpretation.* An assumed dropout event rate of 18.0% is six times the\nobserved completer rate of 3.0% in the same arm. A clinical reviewer would need to posit\nthat patients who stopped Drug A were dramatically sicker — or that stopping itself caused\na catastrophic increase in event risk — to erase the 3.0 pp benefit. If the best clinical\nexplanation for why patients discontinued Drug A suggests a rate closer to 6–9%, the tipping\npoint is not met and the observed benefit is considered robust to plausible dropout\nassumptions under MNAR.",
    "primary_category": "Bias_Control",
    "tags": [
      "tipping-point",
      "robustness",
      "missing-data",
      "sensitivity-analysis",
      "delta-adjustment",
      "mnar",
      "unmeasured-confounding",
      "qba",
      "hta",
      "regulatory"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation",
      "pragmatic_trial"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry",
      "primary"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/sim.6197",
        "url": "https://doi.org/10.1002/sim.6197",
        "citation_text": "Liublinska V, Rubin DB. Sensitivity analysis for a partially missing binary outcome in a two-arm randomized clinical trial. Statistics in Medicine. 2014;33(24):4170-4185.",
        "year": 2014,
        "authors_short": "Liublinska & Rubin",
        "notes": "Foundational formalization of the tipping-point method for a missing binary outcome, including the two-dimensional tipping-surface display that searches MNAR imputation parameters until the conclusion reverses."
      },
      {
        "role": "explain",
        "doi": "10.7326/M16-2607",
        "url": "https://doi.org/10.7326/M16-2607",
        "citation_text": "VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine. 2017;167(4):268-274.",
        "year": 2017,
        "authors_short": "VanderWeele & Ding",
        "notes": "The E-value is the closed-form tipping point for unmeasured confounding on the risk-ratio scale — the strength an unmeasured confounder would need on both the exposure and outcome to nullify the estimate."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.8569",
        "url": "https://doi.org/10.1002/sim.8569",
        "citation_text": "Cro S, Morris TP, Kenward MG, Carpenter JR. Sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation: a practical guide. Statistics in Medicine. 2020;39(21):2815-2842.",
        "year": 2020,
        "authors_short": "Cro et al.",
        "notes": "Practical controlled-multiple-imputation framework (delta-adjustment and reference-based MNAR) that supplies the imputation engine underneath delta-based tipping-point sweeps."
      },
      {
        "role": "demonstrate",
        "doi": "10.1080/10543406.2022.2058525",
        "url": "https://doi.org/10.1080/10543406.2022.2058525",
        "citation_text": "Gorst-Rasmussen A, Tarp JM, Furberg JK. Fast tipping point sensitivity analyses in clinical trials with missing continuous outcomes under multiple imputation. Journal of Biopharmaceutical Statistics. 2022;32(6):942-953.",
        "year": 2022,
        "authors_short": "Gorst-Rasmussen et al.",
        "notes": "Computationally efficient algorithm for searching the tipping surface under multiple imputation, demonstrating the method at scale rather than via a brute-force grid."
      },
      {
        "role": "use",
        "doi": "10.17226/12955",
        "url": "https://doi.org/10.17226/12955",
        "citation_text": "National Research Council. The Prevention and Treatment of Missing Data in Clinical Trials. Washington, DC: National Academies Press; 2010.",
        "year": 2010,
        "authors_short": "National Research Council",
        "notes": "The report that made MNAR sensitivity analyses (tipping-point/delta-adjustment) an expected component of regulatory submissions when missing data could threaten the primary conclusion."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://database.ich.org/sites/default/files/E9-R1_Step4_Guideline_2019_1203.pdf",
        "citation_text": "ICH E9(R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials. 2019.",
        "year": 2019,
        "authors_short": "ICH E9(R1)",
        "notes": "Regulatory anchor requiring that sensitivity analyses (including tipping points) be aligned with the pre-specified estimand and the handling of intercurrent events."
      }
    ],
    "plain_language_summary": "A tipping-point analysis asks a single practical question about a research finding: how wrong would things have to be before the result falls apart? You keep the conclusion fixed and then search for the breaking point — the exact amount of missing information, hidden difference between groups, or measurement error that would erase your result. Once you find that breaking point, you ask whether such a large flaw is actually plausible given what you know about the data, and that judgment tells you whether to trust the finding.",
    "key_terms": [
      {
        "term": "sensitivity analysis",
        "definition": "A test that checks whether a study's main finding still holds when you change one of the assumptions it rests on."
      },
      {
        "term": "unmeasured confounding",
        "definition": "A hidden difference between the treatment and comparison groups — something not recorded in the data — that could explain the observed result instead of the treatment itself."
      },
      {
        "term": "tipping point",
        "definition": "The smallest bias assumption large enough to flip a statistically significant result to non-significant (or to cross another pre-chosen decision line), expressed as a single number you can judge against what is clinically believable."
      },
      {
        "term": "plausibility anchor",
        "definition": "A real-world reference value — such as the observed event rate among completers, or the strength of a known measured risk factor — used to judge whether the tipping point represents a realistic flaw."
      },
      {
        "term": "risk difference",
        "definition": "The gap in event rates between two groups, expressed as a percentage-point difference (e.g., 5% vs 3% = a 2-percentage-point difference)."
      }
    ],
    "worked_example": {
      "scenario": "A study compared two blood-pressure medications — Drug A (the newer one) and Drug B (the standard) — in 1,000 patients followed for one year. The primary finding: patients on Drug A had a significantly lower rate of serious cardiac events (RD = -3.0 percentage points). An outside reviewer notes that Drug A patients were more likely to drop out of the database early (20% lost vs 10% for Drug B), possibly because they switched to a health plan whose claims are not captured. The worry: if those missing Drug A patients actually had high event rates, the benefit could disappear. A tipping-point analysis asks exactly how high that missing-patient event rate would have to be.",
      "dataset": {
        "caption": "Observed outcome counts at end of follow-up for patients who stayed in the database the full year (completers) and patients who left early (lost to follow-up).",
        "columns": [
          "group",
          "arm",
          "n",
          "events",
          "event_rate_pct"
        ],
        "rows": [
          [
            "Completers",
            "Drug A",
            800,
            24,
            3.0
          ],
          [
            "Completers",
            "Drug B",
            900,
            54,
            6.0
          ],
          [
            "Lost to follow-up",
            "Drug A",
            200,
            "unknown",
            "?"
          ],
          [
            "Lost to follow-up",
            "Drug B",
            100,
            "unknown",
            "?"
          ]
        ]
      },
      "steps": [
        "Start with what we observe: among completers, Drug A event rate = 24/800 = 3.0% and Drug B event rate = 54/900 = 6.0%, so the risk difference is -3.0 percentage points favoring Drug A.",
        "Set the tipping target: we want to find the event rate among the 200 missing Drug A patients that would push the overall risk difference from -3.0% all the way to 0% (no difference).",
        "Give the missing Drug B patients the most favorable imputation: assume their event rate equals the Drug B completer rate, 6.0% — this makes Drug B look as good as possible, setting the hardest test for Drug A.",
        "Sweep the assumed event rate among the 200 missing Drug A patients from 3.0% upward, recalculating the overall risk difference at each step, as shown in the table below.",
        "The overall Drug A risk = (24 observed events + 200 missing x assumed rate) / 1000 total Drug A patients. The overall Drug B risk = (54 observed events + 100 missing x 6.0%) / 1000 total Drug B patients = (54 + 6) / 1000 = 6.0%.",
        "At an assumed rate of 18.0% for missing Drug A patients: overall Drug A risk = (24 + 200 x 0.18) / 1000 = (24 + 36) / 1000 = 60 / 1000 = 6.0%. Risk difference = 6.0% - 6.0% = 0.0%. This is the tipping point.",
        "Compare the tipping point (18.0%) to the plausibility anchor (Drug A completer rate = 3.0%): the missing Drug A patients would have to experience events at 6 times the rate of their completing counterparts to erase the benefit.",
        "Judgment: there is no clinical reason why patients who dropped out of a database would have cardiac event rates 6x higher than similar patients who stayed. The finding is considered robust."
      ],
      "sensitivity_table": {
        "caption": "Risk difference as the assumed event rate among 200 missing Drug A patients is varied from 3% to 21%. The result tips at 18%.",
        "columns": [
          "assumed_event_rate_missing_drug_a_pct",
          "overall_drug_a_risk_pct",
          "overall_drug_b_risk_pct",
          "risk_difference_pct",
          "result"
        ],
        "rows": [
          [
            3.0,
            3.0,
            6.0,
            -3.0,
            "significant benefit"
          ],
          [
            6.0,
            3.6,
            6.0,
            -2.4,
            "significant benefit"
          ],
          [
            9.0,
            4.2,
            6.0,
            -1.8,
            "significant benefit"
          ],
          [
            12.0,
            4.8,
            6.0,
            -1.2,
            "significant benefit"
          ],
          [
            15.0,
            5.4,
            6.0,
            -0.6,
            "narrowing benefit"
          ],
          [
            18.0,
            6.0,
            6.0,
            0.0,
            "TIPPING POINT — no difference"
          ],
          [
            21.0,
            6.6,
            6.0,
            0.6,
            "reversed — Drug A appears worse"
          ]
        ]
      },
      "result": "The tipping point is an assumed event rate of 18% among the 200 missing Drug A patients — exactly 6 times the observed Drug A completer rate of 3%. The overall Drug A risk at the tipping point is (24 + 200 x 0.18) / 1000 = 60/1000 = 6.0%, equaling the Drug B risk of (54 + 6) / 1000 = 60/1000 = 6.0%, for a risk difference of 0.0%. Because a 6-fold elevation in the missing-patient event rate is not clinically plausible, the benefit is judged robust to the differential dropout."
    },
    "prerequisites": [
      "quantitative-bias-analysis-toolkit-rwe",
      "attrition-and-loss-to-follow-up-rwe",
      "missing-data-pattern-table-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Missing-outcome (delta-adjustment) tipping point",
        "description": "Sweep an MNAR penalty on the imputed outcome among non-completers/lost-to-follow-up in the arm favoring the conclusion (and, symmetrically, a favorable adjustment in the other arm) until the risk difference or ratio crosses the decision threshold; report the tipping delta against the observed completer rate.",
        "edge_cases": [
          "One-sided sweeps (penalizing only the favorable arm) overstate robustness; the credible tipping region requires adverse imputation in one arm and favorable in the other.",
          "Worst-case imputation (all favorable-arm dropouts have the event) is the endpoint of the sweep, not the answer; report the plausible tipping value, not the strawman extreme."
        ],
        "data_source_notes": "claims: classify LTFU by disenrollment; exclude MA-only person-time so non-events are observed, not missing. EHR: anchor the plausibility benchmark to the within-system completer distribution because missingness is informative."
      },
      {
        "name": "Unmeasured-confounding tipping point (E-value and surfaces)",
        "description": "Report the E-value (point tipping metric on the RR scale) and, when a single number is insufficient, a two-dimensional surface over confounder prevalence-difference and confounder-outcome strength showing where the adjusted estimate reaches the threshold.",
        "edge_cases": [
          "The E-value assumes the RR scale; on the risk-difference or hazard scale the closed form does not apply and a numeric sweep is needed.",
          "Benchmark the required confounder strength against a known measured confounder; an E-value of 1.3 is unimpressive if a measured covariate already has a stronger association."
        ],
        "data_source_notes": "claims/EHR: anchor required confounder strength to the strongest measured baseline covariate available in the lookback so the audience can judge plausibility."
      },
      {
        "name": "Misclassification / outcome-algorithm tipping point",
        "description": "Sweep the claims-algorithm sensitivity and PPV (or a sensitivity/specificity pair) until the bias-adjusted effect crosses the threshold; report how far below validated values the algorithm performance would have to fall.",
        "edge_cases": [
          "Non-differential misclassification usually biases toward the null, so the tipping question is whether *differential* misclassification by arm is plausible.",
          "Requires a validation source (chart review, registry linkage) for the plausibility anchor; without it the sweep is unanchored."
        ],
        "data_source_notes": "claims: anchor to published PPV/sensitivity for the specific code algorithm and care setting; performance varies by inpatient vs outpatient capture."
      }
    ],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\n\ndef missing_outcome_tipping_point(cohort: pd.DataFrame,\n                                  favor_arm: str = \"TREAT\",\n                                  other_arm: str = \"COMPARATOR\",\n                                  threshold: float = 0.0,\n                                  grid: int = 1001) -> dict:\n    \"\"\"Return the tipping event-probability among favor_arm LTFU at which RD(favor - other) hits threshold.\n\n    RD is defined as risk(favor_arm) - risk(other_arm); a protective primary result has RD < 0.\n    We sweep p_favor (event prob among favor_arm LTFU) upward and hold other_arm LTFU at their\n    completer rate (the boundary that makes the comparator look as good as possible).\"\"\"\n    def arm_counts(arm):\n        sub = cohort[cohort[\"arm\"] == arm]\n        n = len(sub)\n        ev = (sub[\"status\"] == \"event\").sum()\n        ne = (sub[\"status\"] == \"no_event\").sum()\n        ltfu = (sub[\"status\"] == \"ltfu\").sum()\n        completer_rate = ev / (ev + ne) if (ev + ne) else np.nan\n        return dict(n=n, ev=ev, ne=ne, ltfu=ltfu, completer_rate=completer_rate)\n\n    f, o = arm_counts(favor_arm), arm_counts(other_arm)\n    # Comparator LTFU imputed at its completer rate (favorable-to-comparator boundary).\n    risk_other = (o[\"ev\"] + o[\"ltfu\"] * o[\"completer_rate\"]) / o[\"n\"]\n\n    ps = np.linspace(f[\"completer_rate\"], 1.0, grid)\n    risk_favor = (f[\"ev\"] + f[\"ltfu\"] * ps) / f[\"n\"]\n    rd = risk_favor - risk_other\n\n    crossed = np.where(rd >= threshold)[0]\n    tipping_p = float(ps[crossed[0]]) if crossed.size else None  # None => never tips within sweep\n    return {\n        \"rd_observed_completers_only\": (f[\"ev\"] / f[\"n\"]) - (o[\"ev\"] / o[\"n\"]),\n        \"favor_completer_rate\": f[\"completer_rate\"],   # plausibility anchor\n        \"other_completer_rate\": o[\"completer_rate\"],\n        \"favor_pct_ltfu\": f[\"ltfu\"] / f[\"n\"],\n        \"other_pct_ltfu\": o[\"ltfu\"] / o[\"n\"],\n        \"tipping_event_prob_favor_ltfu\": tipping_p,\n        \"tipping_relative_to_completers\": (tipping_p / f[\"completer_rate\"]) if tipping_p else None,\n    }",
        "description": "Missing-outcome (delta-adjustment) tipping point for a binary outcome with differential loss to follow-up. Required input\n(one row per person, after cohort construction and FFS-observability filtering):\n  cohort : person_id, arm ('TREAT'/'COMPARATOR'), status ('event'/'no_event'/'ltfu')\nstatus is derived from claims: 'ltfu' = censored before the risk-window end by disenrollment (exclude MA-only person-time\nupstream so 'no_event' is truly observed). The function sweeps the imputed event probability among TREAT lost-to-follow-up,\nholding COMPARATOR LTFU at the favorable boundary (their completer rate), and returns the tipping probability at which the\nrisk difference reaches `threshold` (default 0 = benefit erased), plus the completer-rate plausibility anchor.",
        "dependencies": [
          "pandas",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nmissing_outcome_tipping_point <- function(cohort,\n                                          favor_arm = \"TREAT\",\n                                          other_arm = \"COMPARATOR\",\n                                          threshold = 0,\n                                          grid = 1001L) {\n  setDT(cohort)\n  arm_counts <- function(a) {\n    sub <- cohort[arm == a]\n    ev <- sum(sub$status == \"event\"); ne <- sum(sub$status == \"no_event\")\n    list(n = nrow(sub), ev = ev, ne = ne, ltfu = sum(sub$status == \"ltfu\"),\n         completer_rate = if ((ev + ne) > 0) ev / (ev + ne) else NA_real_)\n  }\n  f <- arm_counts(favor_arm); o <- arm_counts(other_arm)\n\n  # Comparator LTFU imputed at its completer rate (favorable-to-comparator boundary).\n  risk_other <- (o$ev + o$ltfu * o$completer_rate) / o$n\n\n  ps <- seq(f$completer_rate, 1, length.out = grid)\n  rd <- (f$ev + f$ltfu * ps) / f$n - risk_other\n  hit <- which(rd >= threshold)\n  tipping_p <- if (length(hit)) ps[hit[1L]] else NA_real_  # NA => never tips within sweep\n\n  list(\n    rd_observed_completers_only = (f$ev / f$n) - (o$ev / o$n),\n    favor_completer_rate = f$completer_rate,           # plausibility anchor\n    other_completer_rate = o$completer_rate,\n    favor_pct_ltfu = f$ltfu / f$n,\n    other_pct_ltfu = o$ltfu / o$n,\n    tipping_event_prob_favor_ltfu = tipping_p,\n    tipping_relative_to_completers = tipping_p / f$completer_rate\n  )\n}",
        "description": "Missing-outcome (delta-adjustment) tipping point for a binary outcome, mirroring the Python logic. Required input (one row\nper person, after cohort construction and FFS-observability filtering):\n  cohort : person_id, arm ('TREAT'/'COMPARATOR'), status ('event'/'no_event'/'ltfu')\nReturns the tipping event probability among the favored arm's lost-to-follow-up patients at which the risk difference\nreaches `threshold`, with the completer-rate anchor for plausibility judgment.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let covs = age sex baseline_severity prior_util;   /* baseline covariates for the imputation model */\n\n%macro tip_sweep(delta_lo=0, delta_hi=2.5, by=0.25, m=20);\n  %do d = %sysevalf(&delta_lo*100) %to %sysevalf(&delta_hi*100) %by %sysevalf(&by*100);\n    %let delta = %sysevalf(&d/100);\n\n    /* 1) Impute the missing/LTFU binary outcome under MAR, then push the favored arm (arm=1)\n          toward the event by an MNAR log-odds penalty &delta (delta-adjustment / tipping shift). */\n    proc mi data=work.cohort nimpute=&m out=work.imp seed=20240601;\n      class arm;\n      fcs logistic(y = arm &covs);\n      mnar adjust(y / shift=&delta adjustobs=(arm='1'));   /* penalize favored-arm imputations only */\n      var arm &covs y;\n    run;\n\n    /* 2) Fit the arm contrast on each completed dataset; identity link => risk difference. */\n    proc genmod data=work.imp;\n      by _Imputation_;\n      class arm(ref='0');\n      model y = arm / dist=bin link=identity;\n      ods output ParameterEstimates=work.pe;\n    run;\n\n    /* 3) Pool the arm risk difference across imputations (Rubin's rules). */\n    proc mianalyze parms=work.pe;\n      modeleffects arm1;          /* TREAT vs COMPARATOR risk difference */\n      ods output ParameterEstimates=work.pooled;\n    run;\n\n    data _null_;\n      set work.pooled;\n      put \"TIPPING SWEEP: delta=&delta  pooled_RD=\" Estimate \" 95%CI=(\" LCLMean \",\" UCLMean \")\";\n    run;\n  %end;\n%mend;\n\n%tip_sweep(delta_lo=0, delta_hi=2.5, by=0.25, m=20);   /* first delta with pooled_RD>=0 is the tipping point */",
        "description": "Delta-adjustment (MNAR) tipping point in SAS using the production pattern: PROC MI imputes the missing/LTFU outcome under\nMAR, a MNAR ADJUST statement applies an increasing delta penalty to the favored arm's imputed values, PROC GENMOD fits the\narm contrast per imputation, and PROC MIANALYZE pools across imputations. A macro loop sweeps the delta until the pooled\nestimate crosses the threshold. Required input (one row per person, after cohort construction and FFS-observability filtering):\n  work.cohort : person_id, arm (1=TREAT favored, 0=COMPARATOR), y (binary outcome; missing => LTFU/non-completer),\n                plus baseline covariates &covs for the imputation model.\nInspect the printed pooled risk difference at each delta; the tipping delta is the first value at which it reaches 0.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Obs[Primary RWE estimate crosses decision threshold<br/>e.g. RD = -2.4 pts favoring treatment] --> Fix[Fix estimand + decision threshold<br/>RD=0 / NI margin / WTP line]\n  Fix --> Pick[Identify dominant bias parameter<br/>missing-outcome delta / unmeasured confounder / misclassification]\n  Pick --> Sweep[Sweep bias parameter from observed anchor upward<br/>adverse in favored arm, favorable in other]\n  Sweep --> Cross{Estimate reaches threshold<br/>within the sweep?}\n  Cross -- No --> Robust[Robust: even the sweep endpoint<br/>does not overturn the result]\n  Cross -- Yes --> Tip[Record tipping value]\n  Tip --> Anchor{Is the tipping value clinically plausible<br/>vs completer rate / known confounder strength?}\n  Anchor -- Implausible --> Robust2[Conclusion robust: breaking it requires<br/>an incredible assumption]\n  Anchor -- Plausible --> Fragile[Conclusion fragile: report uncertainty,<br/>collect more data, or revise the decision]",
        "caption": "Decision logic of a tipping-point analysis. The conclusion is fixed and the bias parameter is searched; the tipping value only acquires meaning once judged against a plausibility anchor (completer event rate, known confounder strength).",
        "alt_text": "Flowchart from a primary estimate that crosses a decision threshold, through fixing the estimand and threshold, selecting a bias parameter, sweeping it, checking whether the estimate tips, and judging the tipping value against a plausibility anchor to declare the result robust or fragile.",
        "source_type": "illustrative",
        "source_citations": [
          "liublinska-2014"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  subgraph Point[E-value: point tipping metric]\n    E1[Single bias: unmeasured confounding] --> E2[Closed form on RR scale] --> E3[One number: required confounder strength]\n  end\n  subgraph Surface[Delta-adjustment: functional tipping surface]\n    S1[One or two bias parameters<br/>missing-outcome / prevalence x strength] --> S2[Numeric sweep, any scale] --> S3[Tipping value or 2-D surface]\n  end\n  E3 -. special case of .-> S3",
        "caption": "The E-value is the closed-form, single-parameter special case of the general tipping-point search. A full delta-adjustment sweep generalizes it to missing-outcome and misclassification biases and to two-dimensional surfaces on any effect scale.",
        "alt_text": "Diagram contrasting the E-value as a single-number tipping metric for unmeasured confounding with the general delta-adjustment tipping surface that handles one or two bias parameters on any scale, with the E-value shown as a special case of the general method.",
        "source_type": "illustrative",
        "source_citations": [
          "vanderweele-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Tipping-point analysis is the inverse-search member of the QBA family — it solves for the bias value that reaches a decision threshold rather than propagating a bias distribution to a corrected estimate."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value is the closed-form tipping point for unmeasured confounding on the risk-ratio scale; the general delta-adjustment sweep extends it to missing-outcome and misclassification biases and to two-parameter surfaces."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The estimand and decision threshold must be fixed before the sweep; per ICH E9(R1) the tipping point must align with the handling of intercurrent events."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
        "notes": "Probabilistic bias analysis propagates bias priors to a corrected estimate; the tipping point instead reports how far the bias would have to go — prefer PBA when defensible priors exist, the tipping point when they do not."
      },
      {
        "relation_type": "used_with",
        "target_slug": "missing-data-pattern-table-rwe",
        "notes": "The missing-data pattern table characterizes the amount and structure of missingness that the missing-outcome tipping point then stress-tests under MNAR departures."
      },
      {
        "relation_type": "see_also",
        "target_slug": "misclassification-bias-correction-rwe",
        "notes": "The misclassification tipping-point variant sweeps algorithm sensitivity/PPV until the inference changes, using the same correction algebra as misclassification bias correction."
      }
    ],
    "aliases": [
      "tipping-point sensitivity analysis",
      "threshold sensitivity analysis",
      "delta-adjustment tipping point",
      "missing-data tipping point",
      "tipping surface analysis"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "tokenization-privacy-preserving-record-linkage-rwe",
    "name": "Tokenization and Privacy-Preserving Record Linkage",
    "short_definition": "The set of methods that turn patient identifiers into irreversible, encrypted tokens so that records for the same person can be joined across separately held datasets (claims, EHR, mortality, lab, and SDOH vendors) without ever exchanging protected health information - and the evaluation discipline (match rate, false-match and false-miss rates, and selection bias from a linkable subpopulation) needed to know whether the linked cohort is fit for a real-world study.",
    "long_description": "Almost every interesting real-world question now spans datasets that no single custodian holds: pharmacy and\nmedical **claims** know what was paid for, the **EHR** knows labs and vitals, a **death index** knows the fact and\ndate of death, and **SDOH** vendors know neighborhood and area-level deprivation. Joining them on a person requires\nidentifiers, but the whole US privacy regime (HIPAA) is built to stop clear-text identifiers from being shipped\nbetween parties. **Tokenization** resolves the tension: each party runs the same deterministic, salted, one-way\nhashing recipe over standardized identifiers (name, date of birth, sex, sometimes a partial SSN or address) to\nproduce an irreversible **token**. Two records that hashed to the same token belong to the same person; nobody had\nto send a name. The HIPAA de-identification logic is usually **expert determination** - a qualified statistician\ncertifies that the token plus the retained data carry a very small re-identification risk - rather than the\nSafe Harbor checklist, because linkage needs a stable cross-dataset key that Safe Harbor would strip.\n\n**What actually happens in the US ecosystem.** A token vendor (Datavant, HealthVerity, and similar) defines several\ntoken \"recipes,\" each over a different identifier bundle, so that a missing or mistyped field in one recipe can be\nrescued by another. Crucially, tokens are **site-keyed** (encrypted again with a key unique to each data partner) so\nthe same person carries a *different* token at site A than at site B; an honest broker holds a **crosswalk** that\nre-encrypts site A tokens into site B's space so the join can happen without either side learning the other's raw\ntoken. **Deterministic** linkage then joins on exact token equality. **Probabilistic / Bloom-filter PPRL** is the\nfuzzier cousin: each identifier is split into character n-grams and hashed into a bit vector (a Bloom filter); two\nvectors are compared by set-overlap similarity (Dice/Jaccard), so *near* matches - a hyphenated surname, a\ntransposed birth month - still link with a calibrated threshold. Bloom-filter PPRL is what makes\nprivacy-preserving *probabilistic* matching possible at all (Schnell, Bachteler, Reiher 2009).\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs direct-identifier (clear-text) deterministic linkage:** Tokenization buys you a legally shippable key and a\n  defensible de-identification posture; the cost is that you can no longer eyeball or hand-adjudicate a borderline\n  pair, you inherit whatever the vendor's identifier standardization did, and an exact-token join is brittle to the\n  very typos a human reviewer would have forgiven. **Prefer clear-text linkage** only inside a single custodian or\n  an honest-broker enclave where PHI exchange is permitted; **prefer tokenization** the moment data crosses\n  organizational boundaries.\n- **vs probabilistic / Bloom-filter PPRL within tokenization:** A single deterministic token is precise (few false\n  matches) but unforgiving (more false misses when identifiers are imperfect). A **multi-token waterfall** - try the\n  strongest recipe, then fall back to weaker recipes for the unmatched - raises the **match rate** but each weaker\n  recipe trades precision for recall, so false matches creep in on the rescued records. Bloom-filter PPRL recovers\n  even more true pairs but needs threshold tuning and a small clerically reviewed or gold-standard sample to set the\n  cut. **Prefer deterministic-only** when a false link is more damaging than a missed one (e.g., attributing a death\n  to the wrong patient); **prefer a waterfall or PPRL** when coverage of the linkable population is the binding\n  constraint.\n- **vs treating \"linked\" as a clean merge:** The seductive error is to link, drop the unmatched, and analyze the\n  linked cohort as if it were the source cohort. Linkage is a *measurement* with error in two directions and a\n  *selection* mechanism, not a lossless join.\n\n**When to use.** Whenever the exposure, outcome, or confounders live in different datasets and you need them on one\nrow: claims exposure with EHR labs for confounding control; a claims cohort linked to the **death index** for an\nall-cause mortality endpoint; claims or EHR linked to **SDOH** vendors for area-level deprivation; tumor registry\nlinked to claims for treatment and cost; or multi-site networks that must pool person-level data they are not\nallowed to share in the clear. Tokenization is also the practical substrate for *de-duplicating* the same patient\nwho appears in two overlapping data feeds.\n\n**When NOT to use - and when it is actively misleading.**\n- **Do not** report a linked analysis without a **match rate** and, where possible, **false-match and false-miss**\n  estimates from a validation sample (a subset with a trusted gold-standard linkage). A 70% match rate silently\n  discards 30% of the cohort; pretending the linked 70% is the whole cohort is the central trap.\n- **Selection bias from the linkable subpopulation.** Whether a person links is **not random**. People with stable\n  names and addresses, continuous insurance, and full identifiers link at much higher rates than mobile, younger,\n  recently immigrated, or intermittently insured people - and those traits correlate with exposure and outcome. If\n  you restrict to the linked subset, you are conditioning on linkage, which can open a collider path and bias the\n  effect estimate. Report match rate **by subgroup**, compare linked vs unlinked on observed characteristics, and\n  consider weighting (inverse-probability-of-linkage) or a sensitivity analysis.\n- **False matches contaminate the outcome.** A wrong token link can graft another person's death or hospitalization\n  onto your patient, biasing event rates in a direction that depends on the linkage error structure. Treat false\n  matches like outcome misclassification, not like random noise.\n- **Do not** assume Bloom-filter or token recipes are perfectly privacy-preserving; frequency and pattern attacks on\n  Bloom filters exist, which is why salting, site-keying, and expert determination - not the hashing alone - carry\n  the de-identification claim.\n\n**Data-source operational depth.**\n- **Claims:** Identifiers come from enrollment/eligibility files; tokens are usually generated by the vendor on the\n  closed-claims feed. Match rates are highest here because insurers maintain clean member identifiers, but\n  Medicare Advantage and gaps in enrollment shrink the *observable* and therefore the *linkable* window.\n- **EHR:** Identifiers can be messy (free-text names, missing SSN, registration typos), so EHR is where\n  probabilistic/Bloom-filter PPRL earns its keep and where the multi-token waterfall recovers the most pairs.\n- **Registry / mortality:** The death index (state files or a commercial composite) is the canonical reason to link;\n  match quality here directly determines outcome ascertainment, so false matches and false misses must be quantified,\n  not assumed.\n- **Linked (the deliverable):** Reconcile token recipes across vendors, document which recipe matched each pair,\n  carry a **match-confidence** field forward, and keep the unmatched records so match rate and selection can be\n  audited downstream.",
    "primary_category": "Data_Quality_Assessment",
    "tags": [
      "tokenization",
      "privacy-preserving-record-linkage",
      "pprl",
      "bloom-filter",
      "deterministic-linkage",
      "probabilistic-linkage",
      "match-rate",
      "linkage-error",
      "selection-bias",
      "hipaa-de-identification",
      "claims",
      "ehr",
      "mortality",
      "sdoh"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "comparative_effectiveness",
      "drug_utilization",
      "pharmacovigilance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1186/1472-6947-9-41",
        "url": "https://doi.org/10.1186/1472-6947-9-41",
        "citation_text": "Schnell R, Bachteler T, Reiher J. Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making. 2009;9:41.",
        "year": 2009,
        "authors_short": "Schnell et al.",
        "notes": "The foundational method paper for privacy-preserving record linkage - encoding identifiers as Bloom filters so that approximate (probabilistic) matching is possible without exchanging clear-text identifiers - which underlies the token recipes used across the modern US linkage ecosystem."
      },
      {
        "role": "explain",
        "doi": "10.1177/2053951717745678",
        "url": "https://doi.org/10.1177/2053951717745678",
        "citation_text": "Harron K, Dibben C, Boyd J, et al. Challenges in administrative data linkage for research. Big Data & Society. 2017;4(2):2053951717745678.",
        "year": 2017,
        "authors_short": "Harron et al.",
        "notes": "Frames the recurring threats - linkage error, the difference between deterministic and probabilistic methods, and the bias introduced when the linkable subpopulation differs from the target population - that govern whether a linked dataset is fit for purpose."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/s12874-017-0306-8",
        "url": "https://doi.org/10.1186/s12874-017-0306-8",
        "citation_text": "Harron K, Gilbert R, Cromwell D, van der Meulen J. Utilising identifier error variation in linkage of large administrative data sources. BMC Medical Research Methodology. 2017;17:23.",
        "year": 2017,
        "authors_short": "Harron et al.",
        "notes": "Shows concretely how variation in identifier error drives false matches and false misses and how a validation sample is used to quantify linkage quality - the empirical basis for reporting match rate alongside false-match and false-miss rates."
      },
      {
        "role": "use",
        "doi": "10.1186/1472-6963-10-346",
        "url": "https://doi.org/10.1186/1472-6963-10-346",
        "citation_text": "Bohensky MA, Jolley D, Sundararajan V, et al. Data linkage: a powerful research tool with potential problems. BMC Health Services Research. 2010;10:346.",
        "year": 2010,
        "authors_short": "Bohensky et al.",
        "notes": "Applied account of how differential match rates create selection bias - linkage favors patients with complete, stable identifiers - and why analyzing only the linked subset can distort effect estimates if linkage probability is associated with exposure or outcome."
      }
    ],
    "plain_language_summary": "To study a patient across separate databases - what their insurance paid for, what their hospital chart shows, whether and when they died - you have to recognize the same person in each one, but privacy law blocks sending names between organizations. Tokenization solves this by scrambling each person's identifiers into a fixed, irreversible code (a token) using the same recipe everywhere, so matching tokens means the same patient, with no names exchanged. The catch is that linkage is never perfect: some real matches are missed and some wrong matches slip in, so you must report a match rate and check linkage errors. And because people with messy or unstable identifiers link less often, the linked group can quietly differ from everyone you started with - a selection problem you have to look for.\n",
    "key_terms": [
      {
        "term": "tokenization",
        "definition": "Scrambling a person's identifiers into a fixed, irreversible code so the same person can be recognized across datasets without sharing their name."
      },
      {
        "term": "token",
        "definition": "The irreversible code produced from someone's identifiers; matching tokens indicate the same person."
      },
      {
        "term": "deterministic linkage",
        "definition": "Joining records only when their tokens are exactly equal - precise, but it misses people whose identifiers had a typo or a missing field."
      },
      {
        "term": "probabilistic linkage",
        "definition": "Joining records that are similar enough rather than identical, so near-misses like a misspelled name still link if they pass a similarity cutoff."
      },
      {
        "term": "match rate",
        "definition": "The share of the starting cohort that successfully linked to the other dataset."
      },
      {
        "term": "false match and false miss",
        "definition": "A false match links two different people as if they were one; a false miss leaves two records for the same person unlinked."
      }
    ],
    "worked_example": {
      "scenario": "We have 1,000 patients in a claims cohort and want to link them to an external death index using tokens. We first try a strong token recipe, then run a weaker recipe on whoever is left (a waterfall), and we split the cohort into patients with stable identifiers and more mobile patients to see whether linkage is even across groups and how much linkage error the weak recipe adds.\n",
      "dataset": {
        "caption": "Cohort-level linkage counts an analyst would assemble from the matched and unmatched token output.",
        "columns": [
          "subgroup",
          "n_patients",
          "strong_recipe_matches",
          "weak_recipe_adds",
          "matched_total"
        ],
        "rows": [
          [
            "stable_identifier",
            600,
            540,
            12,
            552
          ],
          [
            "mobile_identifier",
            400,
            280,
            48,
            328
          ],
          [
            "all",
            1000,
            820,
            60,
            880
          ]
        ]
      },
      "steps": [
        "Strong recipe matches across both subgroups = 540 + 280 = 820 of the 1,000 patients.",
        "Strong-recipe match rate = 820 / 1000 = 0.82, so 180 patients did not link on the strong token alone.",
        "The weaker recipe rescues 60 more patients, so the overall match rate after the waterfall = (820 + 60) / 1000 = 0.88.",
        "The weak recipe is looser, so it injects false matches; at a precision of 0.90 the false matches = 60 * 0.10 = 6 wrong links.",
        "Look at selection - the stable subgroup links far better than the mobile subgroup, 540 / 600 = 0.90 versus 280 / 400 = 0.70, so the linked cohort over-represents stable-identifier patients."
      ],
      "result": "Overall match rate after the waterfall = 880/1000 = 0.88, but about 6 of the rescued links are false matches and the stable subgroup linked at 0.90 versus 0.70 for the mobile subgroup - so analyzing only the linked patients over-represents stable-identifier people and risks selection bias.\n",
      "timeline_spec": {
        "title": "One token-matched patient appearing across claims, EHR, and the death index",
        "window": {
          "start": "2021-01-01",
          "end": "2023-03-31",
          "label": "Calendar span across three linked datasets"
        },
        "events": [
          {
            "label": "Claims enrollment (token match)",
            "start": "2021-01-01",
            "length_days": 730,
            "quantity": "Token A"
          },
          {
            "label": "EHR encounters (token match)",
            "start": "2021-03-15",
            "length_days": 615,
            "quantity": "Token A"
          },
          {
            "label": "SDOH vendor record (token match)",
            "start": "2021-06-01",
            "length_days": 1,
            "quantity": "Token A"
          },
          {
            "label": "Death-index record (token match)",
            "start": "2023-02-10",
            "length_days": 1,
            "quantity": "Token A"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2021-01-01",
            "end": "2022-12-31",
            "label": "Claims: continuous enrollment"
          },
          {
            "kind": "followup",
            "start": "2021-03-15",
            "end": "2022-11-20",
            "label": "EHR: encounter coverage"
          }
        ],
        "result": {
          "label": "One token links claims+EHR+mortality; cohort match rate 880/1000 = 0.88",
          "value": 0.88
        }
      }
    },
    "prerequisites": [
      "linked-data",
      "fit-for-purpose-data-assessment-rwe",
      "claims-outcome-algorithm-ppv-sensitivity-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Deterministic single-token (exact-match) linkage",
        "description": "Join records only where an irreversible token built from a fixed identifier bundle is exactly equal. Precise (few false matches) but unforgiving of any identifier error, so the match rate is capped by identifier completeness and the false-miss rate rises whenever a field is missing or mistyped.",
        "edge_cases": [
          "A single mistyped birth month or a maiden-vs-married surname breaks the exact token, sending a true pair to the unmatched pile (a false miss) even though it is the same person.",
          "Site-keyed tokens require an honest-broker crosswalk to re-encrypt one partner's tokens into another's space; a crosswalk error silently zeroes out a whole partner's matches."
        ],
        "data_source_notes": "claims: highest match yield because member identifiers are clean. ehr: lower yield because registration fields are messy and SSN is often absent."
      },
      {
        "name": "Multi-token waterfall (cascading recipes)",
        "description": "Try the strongest recipe first, then fall back to progressively weaker recipes (dropping SSN, then using name plus date of birth plus sex) for the still-unmatched records. Raises the overall match rate, but each weaker recipe trades precision for recall, so false matches accumulate on the records rescued by the looser recipes.",
        "edge_cases": [
          "A loose final recipe (name plus year of birth only) can collide unrelated people who share common names, injecting false matches that must be capped by a confirmation rule or excluded.",
          "Records matched only by a weak recipe should carry a lower match-confidence flag so downstream analyses can run a sensitivity analysis that drops them."
        ],
        "data_source_notes": "linked: record which recipe matched each pair and keep the recipe identifier as a confidence proxy. registry/mortality: hold weak-recipe death matches to a higher bar to avoid grafting the wrong death."
      },
      {
        "name": "Probabilistic / Bloom-filter PPRL (approximate matching)",
        "description": "Encode each identifier as character n-grams hashed into a Bloom-filter bit vector, then compare vectors by set-overlap similarity (Dice/Jaccard) and link pairs above a calibrated threshold. Recovers true pairs that exact tokens miss (typos, transpositions, nicknames) while still never exchanging clear-text identifiers.",
        "edge_cases": [
          "The similarity threshold trades false matches against false misses; it must be tuned on a gold-standard or clerically reviewed sample, not guessed.",
          "Bloom filters are vulnerable to frequency/pattern attacks, so salting and site-keying - not the hashing alone - carry the privacy claim."
        ],
        "data_source_notes": "ehr: the main beneficiary, where dirty identifiers make exact tokens fail. claims: usually deterministic is enough, so PPRL is reserved for cross-vendor reconciliation."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "linked-data",
        "pros_of_this": "Makes the linkage itself legally and operationally possible across custodians (irreversible tokens, no PHI exchange) and forces the analyst to treat match rate and linkage error as measured quantities rather than assuming a clean merge.",
        "cons_of_this": "Adds vendor dependence (you inherit their identifier standardization and recipe definitions), removes the option of human pair adjudication, and introduces tuning choices (which recipes, what threshold) that a pre-tokenized linked dataset hides.",
        "when_to_prefer": "Use this concept's lens whenever you must understand or defend HOW a linked dataset was joined; treat linked-data as the downstream analytic object once the linkage quality has been established."
      },
      {
        "compared_to": "direct-identifier deterministic linkage (clear-text PHI)",
        "pros_of_this": "Legally shippable across organizations and defensible under HIPAA expert determination, with no clear-text identifiers ever leaving a custodian.",
        "cons_of_this": "Cannot hand-review borderline pairs, is brittle to identifier typos that a human or clear-text fuzzy match would forgive, and depends on consistent identifier standardization across all parties.",
        "when_to_prefer": "Reserve clear-text deterministic linkage for a single custodian or an honest-broker enclave where PHI exchange is sanctioned; tokenize the moment data crosses an organizational boundary."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Tokens are generated by the vendor on enrollment/eligibility identifiers; expect the highest match rates but remember that gaps in enrollment and Medicare Advantage shrink the observable window and therefore the linkable population. Always retain the unmatched records to compute the match rate.",
      "ehr": "Registration identifiers are messy (free-text names, missing SSN, typos), so this is where the multi-token waterfall and Bloom-filter PPRL recover the most true pairs; carry a match-confidence field reflecting which recipe or threshold produced each link.",
      "registry": "Tumor and disease registries and the death index are common link targets; because the registry often supplies the outcome, false matches and false misses translate directly into outcome misclassification and must be quantified against a validation sample, not assumed away.",
      "linked": "Reconcile token recipes across vendors, document which recipe matched each pair, propagate a match-confidence field, and keep both matched and unmatched records so match rate, false-match/false-miss rates, and selection by the linkable subpopulation can be audited downstream."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\ndef linkage_quality(cohort: pd.DataFrame) -> dict:\n    n = len(cohort)\n    n_matched = int(cohort[\"matched\"].sum())\n    match_rate = n_matched / n\n\n    # False match: linked but not actually the same person (precision failure).\n    fm = int(((cohort[\"matched\"]) & (~cohort[\"truly_same_person\"])).sum())\n    false_match_rate = fm / n_matched if n_matched else 0.0\n\n    # False miss: truly the same person but left unmatched (recall failure).\n    n_true = int(cohort[\"truly_same_person\"].sum())\n    miss = int(((~cohort[\"matched\"]) & (cohort[\"truly_same_person\"])).sum())\n    false_miss_rate = miss / n_true if n_true else 0.0\n\n    # Selection: match rate by subgroup exposes the linkable subpopulation.\n    by_subgroup = (\n        cohort.groupby(\"subgroup\")[\"matched\"].mean().round(3).to_dict()\n    )\n\n    # Records rescued only by the weak recipe carry the most false-match risk.\n    weak = cohort[(cohort[\"matched\"]) & (cohort[\"recipe\"] == \"B\")]\n    weak_false_match_rate = (\n        float((~weak[\"truly_same_person\"]).mean()) if len(weak) else 0.0\n    )\n\n    return {\n        \"n\": n,\n        \"match_rate\": round(match_rate, 3),\n        \"false_match_rate\": round(false_match_rate, 3),\n        \"false_miss_rate\": round(false_miss_rate, 3),\n        \"match_rate_by_subgroup\": by_subgroup,\n        \"weak_recipe_false_match_rate\": round(weak_false_match_rate, 3),\n    }",
        "description": "Simulate token-based linkage of a claims cohort to an external dataset (e.g., a death index), then quantify\nlinkage quality and selection. The function takes a per-person frame with a true linkage label (known only in\nsimulation / a validation sample) and the recipe that matched each person, and returns the overall and\nby-subgroup match rate plus false-match and false-miss rates - the numbers a fit-for-purpose assessment needs.\nRequired input (one row per source-cohort person):\n  cohort : person_id, subgroup, truly_same_person (bool), matched (bool), recipe (\"A\" strong, \"B\" weak, or None)",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nlinkage_quality <- function(cohort) {\n  setDT(cohort)\n  n         <- nrow(cohort)\n  n_matched <- sum(cohort$matched)\n  n_true    <- sum(cohort$truly_same_person)\n\n  match_rate <- n_matched / n\n  fm   <- sum(cohort$matched & !cohort$truly_same_person)\n  miss <- sum(!cohort$matched & cohort$truly_same_person)\n  false_match_rate <- if (n_matched > 0) fm / n_matched else 0\n  false_miss_rate  <- if (n_true > 0)    miss / n_true  else 0\n\n  by_subgroup <- cohort[, .(match_rate = round(mean(matched), 3)), by = subgroup]\n\n  weak <- cohort[matched == TRUE & recipe == \"B\"]\n  weak_false_match_rate <- if (nrow(weak) > 0)\n    mean(!weak$truly_same_person) else 0\n\n  list(\n    n = n,\n    match_rate = round(match_rate, 3),\n    false_match_rate = round(false_match_rate, 3),\n    false_miss_rate = round(false_miss_rate, 3),\n    match_rate_by_subgroup = by_subgroup,\n    weak_recipe_false_match_rate = round(weak_false_match_rate, 3)\n  )\n}",
        "description": "Same linkage-quality summary in data.table: overall and by-subgroup match rate, false-match rate (linked but not\nthe same person), false-miss rate (same person but unmatched), and the false-match rate among records rescued only\nby the weak recipe. Input (one row per source-cohort person):\n  cohort : person_id, subgroup, truly_same_person (logical), matched (logical), recipe (\"A\"/\"B\"/NA)",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Src[Source cohort record<br/>standardized identifiers] --> Hash[Hash with salted recipe<br/>-> irreversible token]\n  Hash --> Det{Strong-recipe token<br/>exactly equals a<br/>target token?}\n  Det -- Yes --> Matched[Linked - recipe A<br/>high confidence]\n  Det -- No --> Fall{Weaker recipe or<br/>Bloom-filter similarity<br/>above threshold?}\n  Fall -- Yes --> Rescued[Linked - recipe B / PPRL<br/>lower confidence<br/>false-match risk]\n  Fall -- No --> Unmatched[Unmatched<br/>contributes to false misses<br/>and shrinks match rate]\n  Matched --> Eval[Validation sample:<br/>match rate, false-match rate,<br/>false-miss rate, by subgroup]\n  Rescued --> Eval\n  Unmatched --> Eval",
        "caption": "How a source record becomes a token, is linked by a strong recipe or rescued by a weaker recipe / Bloom-filter PPRL, and how matched, rescued, and unmatched records all feed the linkage-quality evaluation (match rate plus false-match and false-miss rates, examined by subgroup for selection).",
        "alt_text": "Decision flowchart from a standardized source record through salted hashing to a token, exact strong-recipe match, weaker-recipe or Bloom-filter rescue, and unmatched, with all paths feeding a validation step that computes match rate and linkage error.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "tokenization-privacy-preserving-record-linkage-rwe-timeline.svg",
        "mermaid": null,
        "caption": "One token-matched patient appearing across three separately held datasets - claims enrollment, EHR encounters, and a death-index record - joined on an irreversible token without exchanging identifiers; the cohort-level match rate after the waterfall is 880/1000 = 0.88.",
        "alt_text": "Timeline showing one patient's claims enrollment span, overlapping EHR encounter span, and a later death-index record, all linked by a shared token across datasets.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "linked-data",
        "notes": "Tokenization is the privacy-preserving mechanism that produces a linked dataset; linked-data is the downstream analytic object once the join is done and its quality established."
      },
      {
        "relation_type": "used_with",
        "target_slug": "multi-database",
        "notes": "Multi-database and federated networks rely on tokenization to pool or compare person-level records they are not permitted to share in the clear."
      },
      {
        "relation_type": "used_with",
        "target_slug": "mortality-source-hierarchy-rwe",
        "notes": "Token linkage to a death index is a primary mortality source; match quality directly governs how completely and accurately death is ascertained."
      },
      {
        "relation_type": "used_with",
        "target_slug": "sdoh-social-determinants-of-health",
        "notes": "Area-level SDOH measures are attached to a cohort by tokenized linkage to SDOH vendors, inheriting the same match rate and selection concerns."
      },
      {
        "relation_type": "see_also",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "False matches and false misses are forms of misclassification; the same validation-sample logic (PPV, sensitivity) used for outcome algorithms quantifies linkage error."
      },
      {
        "relation_type": "affects",
        "target_slug": "generalizability-transportability-external-validity-rwe",
        "notes": "Restricting to the linkable subpopulation can make the linked cohort differ systematically from the target population, threatening external validity unless match rate by subgroup is examined and weighted."
      },
      {
        "relation_type": "part_of",
        "target_slug": "fit-for-purpose-data-assessment-rwe",
        "notes": "Reporting match rate and linkage error is a required input to judging whether a linked dataset is fit for a specific study question."
      },
      {
        "relation_type": "see_also",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "Unmatched records are an attrition step; the linkage match rate belongs in the feasibility funnel alongside enrollment and exposure filters."
      }
    ],
    "aliases": [
      "tokenization",
      "privacy-preserving record linkage",
      "PPRL",
      "token-based linkage",
      "Bloom filter linkage",
      "deterministic token matching",
      "patient matching",
      "record linkage de-identification"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "treatment-failure-non-response-rwe",
    "name": "Treatment Failure and Non-Response in RWE",
    "short_definition": "A pre-specified real-world endpoint or classification rule that identifies inadequate benefit after treatment initiation, combining disease-specific response evidence (labs, scores, imaging, PROs, clinician assessment) with treatment-pattern proxies such as escalation, switch, discontinuation, next line of therapy, rescue therapy, or recurrent acute care, while explicitly separating pharmacologic non-response from non-adherence, intolerance, access barriers, and informative intercurrent events.",
    "long_description": "**Treatment failure / non-response** in RWE is not one universal code or one universal endpoint. It is a **constructed\nclinical status**: after a patient starts a therapy and has a pre-specified adequate-trial window, the study classifies the\npatient as having no response, partial/inadequate response, loss of response, or treatment failure using the best available\nreal-world evidence. In an EHR or registry that may be a disease activity score, laboratory value, imaging interpretation,\npatient-reported outcome, or clinician note. In claims it is usually a proxy: dose escalation, add-on therapy, switch to a\nnew class, advancement to the next line of therapy, discontinuation without a refill, procedure/rescue medication, or a\nfailure-related hospitalization. The central RWE discipline is to **state which construct is being measured**. \"No response\nto drug\" is different from \"treatment strategy failure,\" and both are different from \"the patient stopped filling the drug.\"\n\n**Core conceptual distinction.** Three layers must be kept separate and pre-specified.\n- **Biologic or clinical response status.** Did the disease improve enough by an anchored assessment time? Examples:\n  tumor response category, HbA1c reduction, rheumatoid arthritis disease activity, asthma exacerbation control, remission\n  score, pain/function improvement, or biomarker normalization. This is the closest real-world analogue of \"response\" but\n  requires a measurement source and an assessment cadence.\n- **Treatment-pattern failure proxy.** Did the treating system behave as if the therapy was not adequate? Examples:\n  dose escalation above a protocol threshold, add-on/augmentation after an adequate trial, switch to a different therapy,\n  next line of therapy, rescue corticosteroids, treatment-related procedure, or discontinuation. This is scalable in claims\n  but mixes inadequate efficacy with toxicity, patient preference, affordability, formulary changes, pregnancy planning, and\n  access problems.\n- **Estimand / intercurrent-event strategy.** The same switch or rescue therapy can be the endpoint itself under a\n  **composite strategy** (\"switch/rescue = failure\"), ignored under a **treatment-policy strategy**, censored under a\n  **while-on-treatment strategy**, or modeled as a hypothetical strategy (\"what if rescue had not occurred?\"). ICH E9(R1)\n  treats discontinuation, alternative treatment, rescue medication, and death as intercurrent events that affect the\n  interpretation or existence of the measurement; RWE should do the same rather than letting the database default decide.\n\n**Outcome variants that should be named explicitly.**\n- **No response / primary non-response.** No clinically meaningful improvement from baseline after an adequate exposure and\n  assessment window. Operationally this usually requires baseline status, a response threshold, and a fixed landmark\n  (for example, 12 or 24 weeks) or a first eligible post-index assessment.\n- **Partial or inadequate response.** Some improvement occurs but the patient remains above a disease-activity threshold or\n  below a responder threshold. This is not a binary \"failed\" state unless the protocol defines the threshold and the\n  handling of missing post-baseline measurements.\n- **Loss of response / secondary failure.** The patient first meets a response criterion, then later worsens beyond a\n  pre-specified threshold, needs escalation, starts rescue therapy, or advances to a new line. This requires retaining both\n  the initial response date and the later failure date.\n- **Treatment escalation.** Dose increase, interval shortening, therapeutic drug monitoring-triggered dose change,\n  add-on/augmentation, procedure, rescue medication, or higher-intensity setting after a minimum adequate-trial window.\n  Escalation is often the most clinically interpretable claims proxy for inadequate control, but it is still a proxy for a\n  physician decision, not the disease state itself.\n- **Switch, discontinuation, or next line of therapy.** A switch or line advancement is strong evidence that the original\n  regimen did not remain acceptable or effective, but reason is usually unobserved in claims. Discontinuation alone is the\n  weakest failure proxy because it also captures intolerance, affordability, remission, patient preference, death, loss of\n  insurance, and unobserved care.\n- **Rescue therapy or acute-care failure.** Oral steroids, rescue biologic, unscheduled procedure, ED visit, hospitalization,\n  transfusion, dialysis start, or other urgent intervention can be folded into a failure composite when it is a clinically\n  recognized consequence of uncontrolled disease. Each component needs its own outcome algorithm and de-duplication window.\n\n**Pros, cons, and trade-offs.**\n- **Clinical response endpoint vs claims treatment-pattern proxy.** A clinical response endpoint is closer to the patient\n  state and can distinguish no response from partial response and loss of response. Cost: EHR/registry measurement is\n  missing, irregular, site-specific, and often visit-driven. A treatment-pattern proxy is scalable and complete in claims\n  when medical + pharmacy benefits are observable, but it measures a care decision. **Prefer clinical response** when the\n  data contain validated scores/labs/imaging at usable cadence; **prefer a proxy only as a pragmatic endpoint or\n  sensitivity analysis**, with the proxy label preserved.\n- **Composite treatment failure vs component-specific endpoints.** A composite (\"no response OR escalation OR switch OR\n  rescue\") raises event counts and matches real-world decision making, but a frequent low-specificity component can dominate\n  the endpoint. Always store the triggering component and report the component breakdown. **Prefer a composite** for net\n  clinical strategy failure; **prefer components** when the causal interpretation depends on efficacy rather than tolerability\n  or access.\n- **Failure including non-adherence vs pharmacologic failure among adherent users.** A payer may want the net effect of\n  initiating a strategy, where non-adherence is part of real-world performance. A clinician or regulator may want the\n  pharmacologic effect among patients who actually received an adequate trial. These are different estimands. If\n  non-adherence is counted as failure, call it **strategy failure**; if not, define minimum exposure/PDC/persistence rules\n  and account for adherence as an intercurrent event.\n\n**When to use.** Use a treatment-failure/non-response endpoint when the research question is about real-world effectiveness,\nsequencing, unmet need, treatment-resistant disease, time to next treatment, rescue therapy burden, or the population that\nremains uncontrolled after starting a therapy. It is especially useful in chronic inflammatory disease, oncology, diabetes,\nasthma/COPD, epilepsy, migraine, depression, heart failure, rare disease, and any setting where response is multidimensional\nand routine care produces treatment changes before hard endpoints accrue.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Claims-only data labeled as clinical non-response.** Claims can show what was billed or dispensed; they usually cannot\n  show tumor shrinkage, remission, disease activity, symptom improvement, or clinician intent. Label a claims-only endpoint\n  as \"proxy treatment failure\" or \"time to treatment change,\" not non-response, unless validated against charts.\n- **No adequate-trial window.** A switch three days after initiation may be a formulary correction or intolerance, not\n  failure. Require minimum exposure, minimum persistence, and enough time for the therapy to plausibly work before declaring\n  no response.\n- **Informative measurement cadence.** Patients who are sicker, wealthier, treated at academic centers, or on drugs requiring\n  monitoring have more labs, scans, and notes. More measurement means more opportunities to detect failure. Report assessment\n  frequency by arm and consider landmark windows, inverse-probability-of-observation weights, or source restriction.\n- **Discontinuation treated as efficacy failure without reason.** Discontinuation can mean toxicity, remission, cost, death,\n  pregnancy, side-effect fear, or plan disenrollment. It is a valid component of a net strategy-failure composite, but not a\n  clean pharmacologic non-response endpoint.\n- **Ignoring death and other terminal events.** For a non-fatal failure endpoint, death prevents later escalation/switch and\n  should usually be a competing event or an unfavorable composite component, not ordinary censoring. In frail or oncology\n  cohorts, censoring death can make a high-mortality treatment look falsely \"failure-free.\"\n- **Post-hoc threshold shopping.** Trying several response thresholds, gap rules, and escalation definitions until the\n  result is pleasing creates a non-reproducible endpoint. Lock the threshold hierarchy in the protocol/SAP and route\n  alternates to sensitivity analysis.\n\n**Data-source operational depth.**\n- **Claims (FFS / commercial with complete medical + pharmacy).** Observable signals are fills, days_supply, medical-claim\n  administrations, procedures, diagnosis-coded acute events, ED/hospital use, and rescue medications. Require continuous\n  observable enrollment and complete pharmacy + medical benefits; exclude or flag MA-only/capitated spans where missing\n  claims can masquerade as no escalation, discontinuation, or no rescue therapy. Build exposure episodes before failure\n  classification, apply stockpiling/carry-over and inpatient-bridging rules, and pre-specify whether switch, dose escalation,\n  augmentation, rescue therapy, and discontinuation each trigger failure. Failure mode: a pharmacy benefit change or PA denial\n  looks like clinical failure; without reason-for-change data, keep the component label and avoid causal language.\n- **EHR.** Strongest for labs, vitals, scores, medication orders/administrations, clinician notes, and radiology text. Use\n  structured fields where available and NLP/abstraction for response language only after validation. Failure modes: orders are\n  not fills, problem lists lag reality, external care is missing, and assessment frequency differs by site and arm. Capture\n  baseline value, response assessment date, value/source, and whether the assessment was scheduled, clinically prompted, or\n  opportunistic if the data allow it.\n- **Registry.** Strong for clinician-adjudicated response categories, disease activity, stage, and reasons for switch, but may\n  miss complete medication fills and out-of-registry care. Use the registry to validate response/failure classification and\n  link to claims for continuous treatment and rescue-medication capture.\n- **Linked claims-EHR-registry-vital records.** Best substrate: clinical response from EHR/registry, complete treatment\n  history from claims, reasons for change from notes/registry, and death as a competing/composite event. Linkage selection and\n  date discrepancies must be reconciled before assigning failure dates.\n\n**Worked example (claims + EHR inadequate response composite).** Question: among new users of biologic A for inflammatory\nbowel disease, estimate time to **proxy treatment failure** over 12 months. (1) Index date = first administration/fill after\n365 days with no biologic A and complete medical + pharmacy enrollment. (2) Adequate-trial window = 90 days after index; events\nbefore day 90 are classified as early intolerance/access events unless the protocol makes them failure. (3) Clinical response\ncomponent from EHR: post-index fecal calprotectin or disease activity score at 90-180 days; no response if the value fails to\nimprove by the disease-specific threshold or remains above the active-disease cutoff. (4) Pattern proxy components from claims:\ndose escalation/interval shortening after day 90, systemic corticosteroid rescue after day 90, IBD-related hospitalization,\nswitch to a different advanced therapy, or discontinuation after a permissible gap not bridged by another observable therapy.\n(5) Failure date = earliest qualifying component date; store `failure_component`, `failure_source`, and `adequate_trial_met`.\n(6) Death before failure is a competing event for time-to-failure and a component in the net clinical failure composite if the\nestimand says death is unfavorable. (7) Sensitivities vary adequate-trial window (60/90/120 days), discontinuation gap\n(45/60/90 days), and whether discontinuation without switch counts as failure.\n\n**Interpreting the output.** If Patient 9180 starts biologic A on 2024-01-05, has 92 persistent days of therapy, receives\nsteroid rescue on 2024-04-20, dose escalation on 2024-05-18, and switches to biologic B on 2024-07-01, the composite\nfailure endpoint fires on 2024-04-20 with component = \"rescue therapy.\" The later dose escalation and switch are retained as\nsubsequent treatment-pattern events but do not change the time-to-first-failure endpoint. If the research question is\npharmacologic non-response, the analyst must check whether adherence and adequate exposure were sufficient before calling the\nrescue event non-response; if the research question is strategy failure, the rescue event can be counted directly.",
    "primary_category": "Outcome_Measure",
    "tags": [
      "outcome_measure",
      "treatment-failure",
      "non-response",
      "inadequate-response",
      "loss-of-response",
      "treatment-escalation",
      "rescue-therapy",
      "treatment-patterns",
      "composite-endpoint",
      "intercurrent-events",
      "claims",
      "ehr",
      "registry",
      "rwe"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "comparative_effectiveness",
      "drug_utilization",
      "target_trial_emulation",
      "single_arm_external_control"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Foundational automated-database methods for persistence, gaps, switching, and treatment-episode construction - the claims substrate for treatment-failure proxies."
      },
      {
        "role": "introduce",
        "doi": "10.1111/j.1524-4733.2007.00213.x",
        "url": "https://doi.org/10.1111/j.1524-4733.2007.00213.x",
        "citation_text": "Cramer JA, Roy A, Burrell A, et al. Medication compliance and persistence: terminology and definitions. Value in Health. 2008;11(1):44-47.",
        "year": 2008,
        "authors_short": "Cramer et al.",
        "notes": "ISPOR terminology separating persistence/discontinuation from adherence/implementation; essential for distinguishing pharmacologic non-response from non-persistence or non-adherence."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.5507",
        "url": "https://doi.org/10.1002/pds.5507",
        "citation_text": "Wang SV, Pottegård A, Crown W, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real-world evidence studies on treatment effects: a good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiology and Drug Safety. 2023;32(1):44-55.",
        "year": 2023,
        "authors_short": "Wang et al.",
        "notes": "Protocol/SAP transparency reference for locking treatment-failure definitions, component hierarchy, estimand, and sensitivity analyses before analysis."
      },
      {
        "role": "explain",
        "doi": null,
        "url": "https://database.ich.org/sites/default/files/E9-R1_Step4_Guideline_2019_1203.pdf",
        "citation_text": "International Council for Harmonisation. ICH E9(R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials to the Guideline on Statistical Principles for Clinical Trials. 2019.",
        "year": 2019,
        "authors_short": "ICH E9(R1)",
        "notes": "Defines intercurrent-event strategies for treatment discontinuation, alternative treatment, rescue medication, and death - the framework needed to decide whether switch/rescue is a failure event, censoring event, or ignored event."
      },
      {
        "role": "explain",
        "doi": "10.1007/s12325-019-00970-1",
        "url": "https://doi.org/10.1007/s12325-019-00970-1",
        "citation_text": "Griffith SD, Tucker M, Bowser B, et al. Generating Real-World Tumor Burden Endpoints from Electronic Health Record Data: Comparison of RECIST, Radiology-Anchored, and Clinician-Anchored Approaches for Abstracting Real-World Progression in Non-Small Cell Lung Cancer. Advances in Therapy. 2019;36(8):2122-2136.",
        "year": 2019,
        "authors_short": "Griffith et al.",
        "notes": "Demonstrates why clinical-response/progression endpoints in EHR need explicit abstraction methods and cannot be treated as simple claims-derived events."
      },
      {
        "role": "demonstrate",
        "doi": "10.1200/CCI.24.00091",
        "url": "https://doi.org/10.1200/CCI.24.00091",
        "citation_text": "McKelvey B, Garrett-Mayer E, Rivera D, et al. Evaluation of Real-World Tumor Response Derived From Electronic Health Record Data Sources: A Feasibility Analysis in Patients With Metastatic Non-Small Cell Lung Cancer Treated With Chemotherapy. JCO Clinical Cancer Informatics. 2024;8:e2400091.",
        "year": 2024,
        "authors_short": "McKelvey et al.",
        "notes": "Multi-data-source feasibility example for real-world response endpoints (rwRR, rwDOR) and links to rwOS, rwTTNT, and rwTTD; useful model for real-world non-response/response governance."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/cpt.3045",
        "url": "https://doi.org/10.1002/cpt.3045",
        "citation_text": "Mhatre SK, Machado RJM, Ton TGN, et al. Real-World Progression-Free Survival as an Endpoint in Lung Cancer. Clinical Pharmacology & Therapeutics. 2024;115(1):133-142.",
        "year": 2024,
        "authors_short": "Mhatre et al.",
        "notes": "Replication-style oncology example comparing rwPFS derived from EHR notes with trial PFS concepts; supports cautious use of EHR-derived failure/progression endpoints."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory",
        "citation_text": "US Food and Drug Administration. Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products. Guidance for Industry. 2024.",
        "year": 2024,
        "authors_short": "FDA",
        "notes": "Regulatory guidance on selecting EHR/claims data sources that can sufficiently characterize populations, exposures, outcomes, and covariates for effectiveness or safety decisions."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.nice.org.uk/corporate/ecd9/resources/nice-realworld-evidence-framework-pdf-1124020816837",
        "citation_text": "National Institute for Health and Care Excellence. NICE real-world evidence framework. Corporate document ECD9. Updated 2026.",
        "year": 2026,
        "authors_short": "NICE",
        "notes": "Governance framework emphasizing transparent planning, fit-for-purpose data, bias assessment, and reproducible reporting for RWE used in guidance."
      }
    ],
    "plain_language_summary": "Treatment failure or non-response means that a treatment did not produce enough benefit or did not remain acceptable in\nroutine care. In real-world data, that can be measured directly when EHRs or registries contain labs, scores, imaging, or\nclinician-assessed response; in insurance claims it is usually inferred from what happens next, such as dose escalation,\nrescue therapy, switching, stopping, or moving to a new line of therapy. The honest label matters: a claims switch is a\nproxy for failure, not proof that the drug biologically failed.",
    "key_terms": [
      {
        "term": "no response",
        "definition": "No clinically meaningful improvement after an adequate exposure and assessment window, using a pre-specified disease-specific threshold."
      },
      {
        "term": "partial or inadequate response",
        "definition": "Some improvement occurs, but not enough to meet the responder threshold or to reach a low-disease-activity or remission state."
      },
      {
        "term": "loss of response",
        "definition": "The patient initially meets a response criterion, then later worsens, needs rescue therapy, escalates treatment, or switches therapy."
      },
      {
        "term": "proxy treatment failure",
        "definition": "A claims- or treatment-pattern signal that the initial therapy was not adequate or acceptable, such as switch, add-on, next line, rescue medication, or discontinuation."
      },
      {
        "term": "adequate-trial window",
        "definition": "The minimum time and exposure needed before it is clinically fair to judge whether the therapy worked."
      },
      {
        "term": "rescue therapy",
        "definition": "An additional urgent or short-term treatment given because the disease is uncontrolled, such as systemic corticosteroids, urgent procedure, or hospitalization-specific rescue intervention."
      }
    ],
    "worked_example": {
      "scenario": "A patient starts biologic A for inflammatory bowel disease on 2024-01-05. The protocol defines adequate exposure as at\nleast 90 days of persistence. After that window, any of the following can trigger proxy treatment failure first: steroid\nrescue, dose escalation, switch to another advanced therapy, IBD hospitalization, or EHR-documented no response. The\npatient receives steroid rescue before any later escalation or switch, so the failure date is the rescue date.",
      "dataset": {
        "caption": "One-patient treatment and response record. The component that occurs first after the adequate-trial window fires the time-to-first-failure endpoint.",
        "columns": [
          "person_id",
          "event_date",
          "event_type",
          "source",
          "qualifies_after_adequate_trial"
        ],
        "rows": [
          [
            9180,
            "2024-01-05",
            "biologic_A_start",
            "pharmacy/infusion claim",
            "not applicable"
          ],
          [
            9180,
            "2024-04-05",
            "adequate_trial_met",
            "derived persistence rule",
            "yes"
          ],
          [
            9180,
            "2024-04-20",
            "steroid_rescue",
            "pharmacy claim",
            "yes"
          ],
          [
            9180,
            "2024-05-18",
            "dose_escalation",
            "medical/pharmacy claim",
            "yes"
          ],
          [
            9180,
            "2024-07-01",
            "switch_to_biologic_B",
            "pharmacy/infusion claim",
            "yes"
          ]
        ]
      },
      "steps": [
        "Set the index date to 2024-01-05, the first observable biologic A fill or administration.",
        "Require the adequate-trial window to be met before classifying a pattern change as treatment failure. Here the patient remains persistent through 2024-04-05, so the window is met.",
        "List all qualifying post-window failure signals in date order. The first is systemic steroid rescue on 2024-04-20.",
        "Assign failure = 1, failure_date = 2024-04-20, failure_component = steroid_rescue, and failure_source = pharmacy claim.",
        "Retain later dose escalation and switch as subsequent trajectory data, but do not move the time-to-first-failure date.",
        "In reporting, call this proxy treatment failure unless chart/lab evidence confirms clinical non-response."
      ],
      "result": "Proxy treatment failure occurs on 2024-04-20, 106 days after index, triggered by steroid rescue after the adequate trial window. Dose escalation and switch occur later and are retained as component history but do not redefine the first failure date.",
      "timeline_spec": {
        "title": "Treatment failure composite for one patient",
        "window": {
          "start": "2024-01-05",
          "end": "2024-07-01",
          "label": "Biologic A start through observed switch"
        },
        "events": [
          {
            "label": "Biologic A start",
            "start": "2024-01-05",
            "marker_day": 0,
            "quantity": "Day 0 - index treatment initiation"
          },
          {
            "label": "Adequate trial met",
            "start": "2024-04-05",
            "marker_day": 91,
            "quantity": "90+ days persistent on biologic A"
          },
          {
            "label": "Steroid rescue",
            "start": "2024-04-20",
            "marker_day": 106,
            "quantity": "First qualifying failure component",
            "flag": "composite_trigger"
          },
          {
            "label": "Dose escalation",
            "start": "2024-05-18",
            "marker_day": 134,
            "quantity": "Later component; retained, not first failure"
          },
          {
            "label": "Switch to biologic B",
            "start": "2024-07-01",
            "marker_day": 178,
            "quantity": "Later component; retained, not first failure"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2024-01-05",
            "end": "2024-04-05",
            "label": "Adequate-trial window"
          },
          {
            "kind": "followup",
            "start": "2024-04-05",
            "end": "2024-04-20",
            "label": "At risk for failure after adequate trial"
          }
        ],
        "result": {
          "label": "Failure date = 2024-04-20; component = steroid rescue; time to failure = 106 days",
          "value": 106
        },
        "caption": "The patient is not judged for failure until the adequate-trial window is met. The first qualifying failure signal after that point is steroid rescue, so the composite endpoint fires there.",
        "alt_text": "A horizontal patient timeline from January 5 to July 1, 2024. The adequate trial window ends on April 5. A steroid rescue marker on April 20 is flagged as the first failure component; later dose escalation and switch markers are shown but do not change the first-failure date."
      }
    },
    "prerequisites": [
      "outcome-algorithm-construction-rwe",
      "composite-endpoint-construction-rwe",
      "treatment-patterns-lines-of-therapy",
      "persistence-time-to-discontinuation",
      "switch-add-on-augmentation-rwe",
      "estimands-ate-att-intercurrent-events-rwe"
    ],
    "index_definitions": [
      {
        "name": "Clinical non-response",
        "definition": "Failure to meet a pre-specified disease-specific response threshold after an adequate exposure and assessment window.",
        "source": "Disease-specific response criteria and protocol/SAP definition",
        "use": "Best endpoint when structured labs, scores, imaging, PROs, or clinician assessments are available.",
        "notes": "Requires baseline status, assessment timing, threshold, and missing-measurement handling."
      },
      {
        "name": "Proxy treatment failure",
        "definition": "Treatment-pattern signal that the initial strategy was not adequate or acceptable, such as escalation, rescue therapy, switch, discontinuation, or next line of therapy.",
        "source": "Claims/EHR treatment-pattern algorithms",
        "use": "Scalable endpoint in claims or mixed data when clinical response is not consistently observed.",
        "notes": "Measures care decisions and strategy failure, not pure pharmacologic efficacy."
      },
      {
        "name": "Loss of response",
        "definition": "Initial response followed by later worsening, treatment escalation, rescue therapy, switch, or new line of therapy.",
        "source": "Longitudinal response and treatment-pattern data",
        "use": "Chronic disease, oncology, and immune-mediated disease analyses where response can wane.",
        "notes": "Requires retaining both first response date and subsequent failure date."
      }
    ],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Landmark clinical non-response",
        "description": "Classify non-response at a fixed post-index assessment window using a disease-specific threshold, such as no clinically meaningful lab improvement, persistent high disease activity, non-response category, or no tumor response.",
        "edge_cases": [
          "Missing post-baseline assessments are informative; do not assume missing equals response or non-response without a rule.",
          "Patients must have baseline disease activity severe enough to improve; otherwise non-response is structurally ambiguous.",
          "Assessment cadence may differ by arm, site, calendar time, and payer."
        ],
        "data_source_notes": "ehr/registry: strongest when baseline and follow-up scores/labs/imaging are structured or validated by abstraction; claims alone generally cannot support this variant."
      },
      {
        "name": "Loss of response after initial response",
        "description": "Require an initial responder state, then identify worsening, rescue therapy, escalation, switch, or progression after the response date.",
        "edge_cases": [
          "Without storing the initial response date, later worsening cannot be distinguished from primary non-response.",
          "Temporary flares should be handled with confirmation or repeat-measure rules to avoid classifying noise as loss of response."
        ],
        "data_source_notes": "linked: pair EHR/registry response measures with claims rescue/escalation signals for the most complete trajectory."
      },
      {
        "name": "Treatment escalation proxy",
        "description": "A failure proxy triggered by dose intensification, shorter dosing interval, add-on/augmentation, procedure, or rescue medication after an adequate-trial window.",
        "edge_cases": [
          "Dose optimization or label-recommended titration should not be counted as failure unless above a pre-specified threshold.",
          "Add-on therapy may be planned combination care rather than inadequate response; require prior persistence or disease activity evidence when intent matters."
        ],
        "data_source_notes": "claims: requires complete medical + pharmacy capture and drug-specific rules for doses, cycle intervals, HCPCS units, and administered-drug schedules."
      },
      {
        "name": "Switch/discontinuation/next-line proxy",
        "description": "A failure proxy triggered by switch to a new treatment class, line advancement, or discontinuation beyond a permissible gap, with or without a subsequent therapy.",
        "edge_cases": [
          "Discontinuation without a next therapy is nonspecific; it may indicate remission, intolerance, affordability, death, disenrollment, or missing care rather than non-response.",
          "New line definitions are disease-specific and must handle maintenance therapy, restarts, and treatment holidays."
        ],
        "data_source_notes": "claims: use line-of-therapy and persistence algorithms; registry/EHR notes are needed to validate reason for change."
      },
      {
        "name": "Composite strategy-failure endpoint",
        "description": "Time to first of no response, inadequate response, loss of response, escalation, rescue therapy, switch, discontinuation, next line, failure-related hospitalization, or death if death is declared unfavorable.",
        "edge_cases": [
          "The most frequent and least specific component can dominate the composite; always report component contribution.",
          "Component algorithms have different PPV/sensitivity, so the composite inherits the weakest component."
        ],
        "data_source_notes": "linked: ideal; each component should retain date, source, rule version, and validation status."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "treatment-patterns-lines-of-therapy",
        "pros_of_this": "Adds clinical meaning by treating line advancement or regimen change as a failure-related outcome rather than only describing sequence.",
        "cons_of_this": "Requires stronger assumptions about why the line changed; a new line may reflect toxicity, access, or preference rather than non-response.",
        "when_to_prefer": "When the analytic question is inadequate treatment benefit or strategy failure rather than descriptive sequencing alone."
      },
      {
        "compared_to": "persistence-time-to-discontinuation",
        "pros_of_this": "Distinguishes stopping from other failure signals and can combine clinical non-response with rescue, escalation, or switch.",
        "cons_of_this": "Less standardized than a single permissible-gap persistence endpoint and more sensitive to component choices.",
        "when_to_prefer": "When discontinuation alone is too nonspecific and the study needs an effectiveness or control endpoint."
      },
      {
        "compared_to": "composite-endpoint-construction-rwe",
        "pros_of_this": "Provides the disease/treatment-specific component hierarchy for a failure composite and preserves the non-response vs proxy distinction.",
        "cons_of_this": "A treatment-failure composite is often less clinically homogeneous than classic hard-event composites.",
        "when_to_prefer": "When multiple routine-care signals collectively define unacceptable disease control."
      },
      {
        "compared_to": "real-world-progression-rwpfs-rwe",
        "pros_of_this": "Generalizes beyond oncology progression to broader no-response, partial-response, escalation, rescue, and strategy-failure constructs.",
        "cons_of_this": "Less standardized than oncology rwP/rwPFS and usually more dependent on therapeutic-area definitions.",
        "when_to_prefer": "Non-oncology settings or oncology analyses focused on response/failure rather than progression-free survival."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build complete treatment episodes first; require observable medical + pharmacy enrollment; exclude incomplete MA-only or capitated spans unless capture is proven. Use claims for switch, add-on, dose escalation, rescue medications, procedures, acute-care failure, and discontinuation. Label the endpoint as a proxy unless chart/registry validation shows that the pattern reliably identifies non-response.",
      "ehr": "Use structured response measures, labs, imaging, vitals, scores, medication administrations, and clinician notes. Record baseline value, follow-up value/date, source field, and assessment cadence. Validate NLP or abstracted non-response labels and quantify missingness by arm/site.",
      "registry": "Prefer registry response categories, disease activity, stage, and reason-for-change fields when available. Link to claims for full treatment/rescue capture and to mortality sources when death is an intercurrent or composite event.",
      "linked": "Reconcile date hierarchies across fill, administration, lab, note, scan, registry assessment, and death records. Preserve component labels and source provenance so reviewers can separate clinical non-response from treatment-pattern proxies."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nADEQUATE_TRIAL_DAYS = 90\nFAILURE_TYPES = {\n    \"rescue_therapy\",\n    \"dose_escalation\",\n    \"switch\",\n    \"next_line\",\n    \"discontinuation\",\n    \"failure_hospitalization\",\n}\nCLINICAL_FAILURE_CLASSES = {\"no_response\", \"partial_inadequate_response\", \"loss_of_response\"}\n\ndef build_treatment_failure(index, persistence, clinical, events):\n    idx = index.merge(persistence, on=\"person_id\", how=\"left\")\n    idx[\"adequate_trial_met\"] = idx[\"persistent_days_on_index\"].fillna(0) >= ADEQUATE_TRIAL_DAYS\n    idx[\"trial_end\"] = idx[\"index_date\"] + pd.to_timedelta(ADEQUATE_TRIAL_DAYS, unit=\"D\")\n\n    clinical_fail = clinical[clinical[\"response_class\"].isin(CLINICAL_FAILURE_CLASSES)].copy()\n    clinical_fail = clinical_fail.rename(columns={\"assessment_date\": \"failure_date\"})\n    clinical_fail[\"failure_component\"] = clinical_fail[\"response_class\"]\n    clinical_fail[\"failure_source\"] = \"clinical_response_assessment\"\n\n    proxy_fail = events[events[\"event_type\"].isin(FAILURE_TYPES)].copy()\n    proxy_fail = proxy_fail.rename(columns={\"event_date\": \"failure_date\", \"event_type\": \"failure_component\"})\n    proxy_fail[\"failure_source\"] = \"treatment_pattern_proxy\"\n\n    candidates = pd.concat([\n        clinical_fail[[\"person_id\", \"failure_date\", \"failure_component\", \"failure_source\"]],\n        proxy_fail[[\"person_id\", \"failure_date\", \"failure_component\", \"failure_source\"]],\n    ], ignore_index=True)\n\n    candidates = candidates.merge(idx[[\"person_id\", \"index_date\", \"trial_end\", \"adequate_trial_met\"]], on=\"person_id\")\n    candidates = candidates[(candidates[\"adequate_trial_met\"]) & (candidates[\"failure_date\"] >= candidates[\"trial_end\"])]\n\n    first = (candidates.sort_values([\"person_id\", \"failure_date\"])\n             .groupby(\"person_id\", as_index=False)\n             .first())\n    first[\"time_to_failure_days\"] = (first[\"failure_date\"] - first[\"index_date\"]).dt.days\n    return first",
        "description": "Minimal treatment-failure composite builder. Inputs are already cleaned and restricted to observable person-time:\n  index      : person_id, index_date\n  persistence: person_id, persistent_days_on_index\n  clinical   : person_id, assessment_date, response_class  # e.g., response/partial/no_response/loss_of_response\n  events     : person_id, event_date, event_type            # rescue, escalation, switch, discontinuation, hospitalization\nThe code enforces an adequate-trial window, classifies clinical non-response separately from proxy pattern failure, and\nreturns the first qualifying component per patient.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "andrade-2006",
          "ich-e9-r1-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nbuild_treatment_failure <- function(index, persistence, clinical, events,\n                                    adequate_trial_days = 90L) {\n  setDT(index); setDT(persistence); setDT(clinical); setDT(events)\n  failure_types <- c(\"rescue_therapy\", \"dose_escalation\", \"switch\",\n                     \"next_line\", \"discontinuation\", \"failure_hospitalization\")\n  clinical_failure <- c(\"no_response\", \"partial_inadequate_response\", \"loss_of_response\")\n\n  idx <- merge(index, persistence, by = \"person_id\", all.x = TRUE)\n  idx[is.na(persistent_days_on_index), persistent_days_on_index := 0L]\n  idx[, adequate_trial_met := persistent_days_on_index >= adequate_trial_days]\n  idx[, trial_end := index_date + adequate_trial_days]\n\n  cfail <- clinical[response_class %in% clinical_failure,\n                    .(person_id, failure_date = assessment_date,\n                      failure_component = response_class,\n                      failure_source = \"clinical_response_assessment\")]\n  pfail <- events[event_type %in% failure_types,\n                  .(person_id, failure_date = event_date,\n                    failure_component = event_type,\n                    failure_source = \"treatment_pattern_proxy\")]\n  cand <- rbindlist(list(cfail, pfail), use.names = TRUE, fill = TRUE)\n  cand <- merge(cand, idx[, .(person_id, index_date, trial_end, adequate_trial_met)], by = \"person_id\")\n  cand <- cand[adequate_trial_met == TRUE & failure_date >= trial_end]\n  setorder(cand, person_id, failure_date)\n  first <- cand[, .SD[1], by = person_id]\n  first[, time_to_failure_days := as.integer(failure_date - index_date)]\n  first[]\n}",
        "description": "R/data.table version of the treatment-failure composite. Inputs:\n  index       person_id, index_date\n  persistence person_id, persistent_days_on_index\n  clinical    person_id, assessment_date, response_class\n  events      person_id, event_date, event_type\nClinical non-response and treatment-pattern proxies remain separate components.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "andrade-2006",
          "ich-e9-r1-2019"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let adequate_trial_days = 90;\n\nproc sql;\n  create table idx as\n  select i.person_id, i.index_date,\n         coalesce(p.persistent_days_on_index, 0) as persistent_days_on_index,\n         calculated persistent_days_on_index >= &adequate_trial_days as adequate_trial_met,\n         i.index_date + &adequate_trial_days as trial_end format=date9.\n  from work.index i left join work.persistence p on i.person_id = p.person_id;\nquit;\n\ndata clinical_fail;\n  set work.clinical;\n  where response_class in (\"no_response\", \"partial_inadequate_response\", \"loss_of_response\");\n  failure_date = assessment_date;\n  failure_component = response_class;\n  failure_source = \"clinical_response_assessment\";\n  keep person_id failure_date failure_component failure_source;\nrun;\n\ndata proxy_fail;\n  set work.events;\n  where event_type in (\"rescue_therapy\", \"dose_escalation\", \"switch\",\n                       \"next_line\", \"discontinuation\", \"failure_hospitalization\");\n  failure_date = event_date;\n  failure_component = event_type;\n  failure_source = \"treatment_pattern_proxy\";\n  keep person_id failure_date failure_component failure_source;\nrun;\n\ndata candidates;\n  set clinical_fail proxy_fail;\nrun;\n\nproc sql;\n  create table eligible_candidates as\n  select c.*, i.index_date, i.trial_end\n  from candidates c inner join idx i on c.person_id = i.person_id\n  where i.adequate_trial_met = 1\n    and c.failure_date >= i.trial_end;\nquit;\n\nproc sort data=eligible_candidates;\n  by person_id failure_date;\nrun;\n\ndata first_treatment_failure;\n  set eligible_candidates;\n  by person_id;\n  if first.person_id then do;\n    time_to_failure_days = failure_date - index_date;\n    output;\n  end;\nrun;",
        "description": "SAS pattern for the same composite. Input tables should already be cleaned to observable person-time:\n  work.index, work.persistence, work.clinical, work.events.",
        "dependencies": [],
        "source_citations": [
          "andrade-2006",
          "ich-e9-r1-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Index treatment start] --> Adequate{Adequate-trial window met?}\n  Adequate -->|No| Early[Classify early stop/change separately<br/>intolerance, access, data loss, early failure if pre-specified]\n  Adequate -->|Yes| Clinical{Clinical response data available?}\n  Clinical -->|Yes| Response[Apply response threshold<br/>response / partial / no response / loss of response]\n  Clinical -->|No| Proxy[Use treatment-pattern proxies<br/>escalation, rescue, switch, discontinuation, next line]\n  Response --> Composite[Failure candidate table<br/>date + component + source + rule version]\n  Proxy --> Composite\n  Composite --> First[First qualifying component<br/>time-to-failure endpoint]\n  First --> Report[Report component breakdown<br/>clinical vs proxy, sensitivity rules, missingness]",
        "caption": "Treatment failure/non-response construction begins with adequate exposure, then routes to clinical response evidence when available and treatment-pattern proxies when not. Every event keeps its component and source label.",
        "alt_text": "Flowchart from treatment start to adequate-trial check, then to either clinical response assessment or treatment pattern proxies, then to a candidate failure table, first qualifying failure component, and reporting.",
        "source_type": "illustrative",
        "source_citations": [
          "ich-e9-r1-2019",
          "fda-rwd-ehr-claims-2024"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Switches, add-ons, and next-line events are common treatment-failure proxies but need clinical interpretation."
      },
      {
        "relation_type": "used_with",
        "target_slug": "switch-add-on-augmentation-rwe",
        "notes": "Escalation and augmentation rules distinguish inadequate-control signals from planned combination therapy."
      },
      {
        "relation_type": "used_with",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Minimum persistence and discontinuation gap rules determine whether a patient had an adequate treatment trial."
      },
      {
        "relation_type": "used_with",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Adherence/implementation should be separated from pharmacologic non-response unless the estimand counts non-adherence as strategy failure."
      },
      {
        "relation_type": "used_with",
        "target_slug": "composite-endpoint-construction-rwe",
        "notes": "Treatment failure is often a composite endpoint whose components must be stored and reported separately."
      },
      {
        "relation_type": "used_with",
        "target_slug": "real-world-progression-rwpfs-rwe",
        "notes": "Oncology progression and rwPFS are disease-specific examples of response/progression-derived failure endpoints."
      },
      {
        "relation_type": "used_with",
        "target_slug": "outcome-algorithm-construction-rwe",
        "notes": "Rescue therapy, hospitalization, procedure, and clinical response components each require reproducible algorithms."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "Switch, discontinuation, rescue therapy, and death are intercurrent events unless the estimand makes them failure components."
      },
      {
        "relation_type": "used_with",
        "target_slug": "estimand-analysis-traceability-rwe",
        "notes": "A traceability matrix should map each failure component, source, threshold, and sensitivity analysis to the estimand."
      }
    ],
    "aliases": [
      "treatment failure",
      "non-response",
      "inadequate response",
      "partial response",
      "loss of response",
      "primary non-response",
      "secondary failure",
      "treatment-resistant disease",
      "treatment escalation",
      "rescue therapy endpoint",
      "time to treatment failure",
      "proxy treatment failure"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "treatment-patterns-lines-of-therapy",
    "name": "Treatment Patterns and Lines of Therapy (LOT)",
    "short_definition": "An algorithmic exposure-construction method that converts a longitudinal sequence of drug fills or administrations in claims/EHR data into discrete, ordered lines of therapy (LOT1, LOT2, ...) and characterizes initiation, persistence, switching, augmentation/add-on, and advancement to the next line over time.",
    "long_description": "**Treatment patterns and lines of therapy (LOT)** are an *exposure-definition* construct, not an estimator: the deliverable\nis a derived, analysis-ready exposure variable (a per-patient ordered set of regimens with start/stop dates and a\n`lot_number`) built deterministically from temporal fill/administration sequences. Downstream comparative analyses\n(survival, HCRU, cost) then treat that variable as the exposure. Because the variable is *constructed by an algorithm*,\nits validity is a property of the rules — gap length, minimum claims per line, add-on vs substitution logic, and\nprogression triggers — and those rules must be pre-specified and, for regulatory or HTA use, validated against medical\nchart review with reported agreement statistics (e.g., kappa, line-count and regimen concordance). LOT algorithms should\nbe developed or reviewed with practicing clinicians familiar with the disease and its guidelines; in oncology the FLAURA\nvs CheckMate-style line conventions, maintenance-therapy handling, and combination-regimen windows are not derivable from\nfills alone.\n\n**Core conceptual distinction** — three things must be separated and pre-specified. (1) *A line vs an episode of a single\ndrug*: a line is a regimen (one or more drugs started together within a short combination window, e.g., 28 days) carried\nforward until it ends. (2) *What ends a line and starts the next one*: the canonical events are a **switch** (a new agent\nnot in the current regimen, with the prior agent stopped — substitution), an **augmentation/add-on** (a new agent added\nwhile the prior agent continues — this does NOT advance the line under most oncology conventions but DOES under some\nchronic-disease conventions, so it must be declared), and a **gap-then-restart** (the regimen lapses beyond a permissible\ngap and a later fill begins a new line). (3) *The estimand the LOT feeds*: time-to-next-treatment-or-death (TTNTD) and\ntime-to-discontinuation are duration estimands defined *within* a line; line-of-therapy distribution and attrition (the\nshare reaching LOT2, LOT3) are sequencing estimands defined *across* lines. The same fill data yield different numbers\nunder a cause-specific hazard for \"advance to next line\" (treating death as a censoring event) versus a Fine–Gray\nsubdistribution for the cumulative incidence of advancement (treating death as a competing event) — in older oncology\ncohorts where death is common, this choice materially changes the reported share advancing and must be stated in the\nestimand, not chosen post hoc.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs persistence / time-to-discontinuation (single-drug):** LOT captures *sequencing and advancement* — what the patient\n  moves to after the index regimen fails, progresses, or causes toxicity — which persistence alone cannot describe.\n  Persistence is one ingredient (it defines when a line lapses), but a persistence analysis answers \"how long on drug A,\"\n  whereas LOT answers \"A then B then C.\" **Prefer LOT** in oncology, rheumatology, MS, and any progressive/multi-regimen\n  disease. Cost: LOT rules are disease- and algorithm-specific, far less standardized than a simple permissible-gap\n  persistence rule, and chart validation is resource-intensive.\n- **vs a cascade-of-care / funnel analysis:** the cascade is a *population* funnel from diagnosis through linkage,\n  treatment, and control; LOT is the *post-initiation* sequencing engine inside the treated arm of that funnel. They are\n  complementary, not substitutes. Cost: LOT says nothing about the undiagnosed/untreated upstream losses.\n- **vs treating each NDC fill as the exposure (no line construction):** raw fills overcount \"treatments\" — sample fills,\n  bridging, mail-order stockpiling, and dose splits all look like distinct events — and cannot express regimens or\n  advancement. LOT collapses these correctly but at the price of analyst-defined windows that, if mis-set, manufacture or\n  erase lines. **Prefer raw fills** only for pure utilization counting where sequencing is irrelevant.\n\n**When to use** — describing real-world treatment sequencing in chronic or progressive disease; sizing the population\nreaching each later line (a direct input to budget-impact and Markov/DES transition probabilities); defining a\nline-specific index date for a downstream comparative study (e.g., 2L comparative effectiveness); and quantifying switch,\nadd-on, and discontinuation rates for HTA dossiers. A defensible LOT requires a clinically grounded combination window,\npermissible gap, minimum-claims rule, and explicit maintenance and progression handling, all pre-registered.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **No clinical anchor for \"line\" exists for the disease.** In conditions treated with continuous single-agent therapy and\n  no orderly sequencing (most uncomplicated hypertension), forcing LOT manufactures structure that does not exist; use\n  persistence/switching instead.\n- **The data cannot see the regimen.** Provider-administered oncolytics billed under the medical benefit (J-codes,\n  HCPCS), inpatient chemotherapy bundled into a DRG, and 340B/buy-and-bill arrangements are frequently invisible or\n  incompletely coded in pharmacy-only datasets — an algorithm run on pharmacy fills alone will silently drop entire lines\n  and over-report watch-and-wait. Require medical-claim drug capture before claiming LOT in oncology.\n- **Immortal time and procedure-anchored lines.** If a line's start is keyed to a procedure (e.g., surgery, transplant)\n  and follow-up is measured from an earlier landmark (diagnosis), the interval in which the patient must survive to\n  receive that line is immortal — advancement rates and within-line survival are inflated. Anchor each line's time zero to\n  its own first fill/administration.\n- **Differential competing risks by exposure.** In elderly claims cohorts, sicker first-line regimens are followed by\n  higher early mortality; a cause-specific \"advance to next line\" analysis that censors those deaths overstates the share\n  advancing in the sicker arm relative to a Fine–Gray subdistribution view. Pre-specify the competing-risk handling.\n- **Maintenance miscounted as a new line.** In ovarian cancer and lymphoma, PARP inhibitors or rituximab maintenance\n  started after active therapy are part of the same line under modern conventions; coding them as LOT2 inflates\n  line counts (the Simmons et al. validation found first-line maintenance regimen-match required explicit maintenance\n  rules to reach agreement).\n\n**Data-source operational depth**\n- **Claims (FFS or commercial):** Pharmacy fills give NDC + `fill_date` + `days_supply`; provider-administered drugs are in\n  medical claims as HCPCS/J-codes with a service date but no days_supply, so durations must be imputed from cycle\n  schedules. Require continuous medical AND pharmacy enrollment across the baseline and follow-up so that \"no further fill\"\n  is true discontinuation, not unobserved care. *Failure mode:* **Medicare Advantage encounter data lack the complete\n  fee-for-service claim stream** — MA-only person-time produces phantom gaps and missing lines; restrict to enrollees with\n  Parts A/B/D (or a complete commercial medical+pharmacy benefit) and exclude MA-only spans. *Failure mode:* sample fills,\n  90-day mail order, and stockpiling distort `days_supply`, shifting gap-defined line boundaries. *Failure mode:*\n  **differential competing risks by exposure in elderly claims** bias advancement estimates (see above).\n- **EHR:** Orders and medication-administration records (MAR) capture provider-administered oncolytics that pharmacy\n  claims miss, and problem lists/labs/staging sharpen progression triggers; but visit-driven capture means a patient who\n  receives a line outside the system is differentially lost, and an unobserved out-of-network line looks like\n  discontinuation. Prefer EHR linked to claims to reassemble the full regimen history.\n- **Registry:** Often records protocol-defined lines and adjudicated progression prospectively — the gold standard for\n  *validating* a claims LOT algorithm — but typically incomplete for the full longitudinal pharmacy stream and for\n  out-of-registry care.\n- **Linked claims–EHR–registry:** The ideal substrate (medical-benefit drug capture + staging/progression + complete\n  enrollment), but linkage selects the linkable subset and introduces order/fill/service date discrepancies that must be\n  reconciled before assigning line start dates.\n\n**Worked claims example.** Question: real-world LOT distribution and time-to-LOT2 in metastatic non–small-cell lung cancer\nin a commercial + Medicare FFS database with medical-benefit drug capture. (1) *Cohort:* adults with ≥2 mNSCLC diagnoses,\n365 days of continuous A/B/D (or commercial medical+pharmacy) enrollment before the first antineoplastic, and exclude\nMA-only person-time. (2) *Antineoplastic events:* union of pharmacy NDC fills and medical-claim HCPCS/J-code\nadministrations for the curated mNSCLC drug list, each with a `service_date` and (for fills) `days_supply`. (3)\n*LOT1 start:* the first antineoplastic `service_date` after the metastatic-diagnosis washout. (4) *Regimen window:* all\ndistinct agents within 28 days of LOT1 start form the LOT1 regimen (combination capture). (5) *Line advancement:* LOT2\nbegins at the first event of an agent **not** in the LOT1 regimen accompanied by stopping ≥1 LOT1 agent (substitution), OR\nthe first antineoplastic after a permissible gap of >90 days following the LOT1 regimen's last `days_supply` end (restart);\nan added agent that continues alongside the full LOT1 regimen is logged as **augmentation** and does NOT advance the line.\n(6) *Maintenance rule:* a single-agent continuation (e.g., pemetrexed/immunotherapy maintenance) after a defined induction\nis held within LOT1, not counted as LOT2. (7) *Estimand:* cumulative incidence of reaching LOT2 with death as a competing\nevent (Fine–Gray), reported alongside the cause-specific advancement hazard; time-to-LOT2 measured from LOT1 start. (8)\n*Sensitivity:* vary the combination window (14/28/42 days), permissible gap (60/90/120 days), and the medical-benefit drug\nlist; report chart-validation agreement (line count, first-line regimen match) before the algorithm is used for decisions.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "treatment-patterns",
      "lines-of-therapy",
      "lot",
      "sequencing",
      "switch",
      "augmentation",
      "discontinuation",
      "claims",
      "oncology",
      "rwe"
    ],
    "applies_to_study_types": [
      "drug_utilization",
      "cohort_retrospective",
      "active_comparator_new_user",
      "claims_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.1230",
        "url": "https://doi.org/10.1002/pds.1230",
        "citation_text": "Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and Drug Safety. 2006;15(8):565-574.",
        "year": 2006,
        "authors_short": "Andrade et al.",
        "notes": "Foundational methods statement for deriving fill-based exposure measures (persistence, gaps, switching) from automated pharmacy databases - the building blocks of any line-of-therapy algorithm."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jbi.2019.103335",
        "url": "https://doi.org/10.1016/j.jbi.2019.103335",
        "citation_text": "Meng W, Ou W, Chandwani S, Chen X, Black W, Cai Z. Temporal phenotyping by mining healthcare data to derive lines of therapy for cancer. Journal of Biomedical Informatics. 2019;100:103335.",
        "year": 2019,
        "authors_short": "Meng et al.",
        "notes": "Explicit two-stage methodology - a rule-based patient-level LOT algorithm (combination windows, gaps, substitution) plus clustering for temporal phenotyping - demonstrated in metastatic NSCLC and melanoma claims."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s12325-025-03174-y",
        "url": "https://doi.org/10.1007/s12325-025-03174-y",
        "citation_text": "Simmons D, White J, Walker V, Blank SV, Munley J, McLaurin K. Validation of an administrative claims-based line of therapy algorithm for women with ovarian cancer using medical chart review. Advances in Therapy. 2025;42(6):2754-2766.",
        "year": 2025,
        "authors_short": "Simmons et al.",
        "notes": "Chart-review validation of a claims LOT algorithm; reports weighted kappa for active and maintenance line counts and shows maintenance handling is essential to first-line regimen agreement - the model for reporting LOT validity."
      },
      {
        "role": "use",
        "doi": "10.1038/s41598-023-44389-9",
        "url": "https://doi.org/10.1038/s41598-023-44389-9",
        "citation_text": "Birck MG, et al. Real-world treatment patterns of rheumatoid arthritis in Brazil: analysis of DATASUS national administrative claims data for pharmacoepidemiology studies (2010-2020). Scientific Reports. 2023;13:17456.",
        "year": 2023,
        "authors_short": "Birck et al.",
        "notes": "Applied national claims (DATASUS) treatment-pattern study using an explicit claims-based line definition outside oncology, illustrating the same gap/switch logic in rheumatoid arthritis."
      }
    ],
    "plain_language_summary": "Lines of therapy (LOT) describe the ordered sequence of drug regimens a patient moves through over time, like chapters in their treatment story. An algorithm reads a patient's prescription fills in claims data, decides when one regimen ends and the next begins — based on a patient stopping one drug and starting another (a switch) or going without any fills for too long (a gap) — and labels each chapter LOT1, LOT2, and so on. The result tells researchers how many patients ever reach a second or third treatment, what drugs they moved to, and how long each treatment chapter lasted. It cannot see drugs given in a doctor's office that are billed separately from the pharmacy, so some infused cancer drugs may be invisible to a pharmacy-only analysis.",
    "key_terms": [
      {
        "term": "line of therapy",
        "definition": "One ordered chapter of a patient's treatment: a named set of drugs the patient was on together, from the day that chapter started to the day it ended."
      },
      {
        "term": "regimen",
        "definition": "The specific drug or combination of drugs that make up a single line of therapy (for example, erlotinib alone, or carboplatin plus pemetrexed together)."
      },
      {
        "term": "switch",
        "definition": "A substitution event where the patient stops a drug from the current regimen and starts a new, different drug — the clearest signal that one line has ended and a new one has begun."
      },
      {
        "term": "augmentation",
        "definition": "Adding a second drug on top of the existing regimen while keeping the original drug going — this does NOT start a new line under most cancer conventions because the first treatment has not failed."
      }
    ],
    "worked_example": {
      "scenario": "Patient 7042 has metastatic non-small-cell lung cancer (mNSCLC). Their oncologist starts them on erlotinib, an oral targeted therapy. The patient fills erlotinib three times between February and late March 2023, then goes completely off treatment for 95 days. In late July they start docetaxel, a chemotherapy drug. We want to know: how many lines of therapy did this patient have, what was in each line, and how many days passed from the start of line 1 to the start of line 2?",
      "dataset": {
        "caption": "Pharmacy claims table — one row per fill, exactly as an analyst sees it.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            7042,
            "2023-02-01",
            "erlotinib",
            30
          ],
          [
            7042,
            "2023-03-01",
            "erlotinib",
            30
          ],
          [
            7042,
            "2023-03-28",
            "erlotinib",
            30
          ],
          [
            7042,
            "2023-07-31",
            "docetaxel",
            21
          ]
        ]
      },
      "steps": [
        "LOT1 starts on 2023-02-01, the date of the first fill. The regimen is erlotinib (a single drug).",
        "Fill A covers Feb 1 through Mar 2 (30 days). Fill B starts Mar 1 — a 1-day overlap, which is fine; the union rule extends coverage through Mar 30. Fill C starts Mar 28 — another overlap — and extends coverage through Apr 26. LOT1 supply therefore runs out on Apr 26.",
        "After Apr 26, no new erlotinib fill appears. The gap between the end of LOT1 supply (Apr 26) and the next fill (Jul 31) is Apr 27 through Jul 30 = 95 days.",
        "The permissible gap threshold is 90 days. Because 95 days > 90 days, the algorithm declares that LOT1 has lapsed.",
        "On Jul 31 the patient fills docetaxel — a drug that was NOT in the LOT1 regimen. A new drug arriving after a lapse triggers a new line: LOT2 starts on 2023-07-31 with regimen docetaxel.",
        "Time-to-LOT2 is measured from LOT1 start (Feb 1) to LOT2 start (Jul 31) = 180 days."
      ],
      "result": "Patient 7042 had 2 lines of therapy. LOT1 regimen = erlotinib, started 2023-02-01, ended 2023-04-26. LOT2 regimen = docetaxel, started 2023-07-31. Advancement reason = gap-then-restart with a new agent (95-day gap exceeded the 90-day permissible threshold). Time-to-LOT2 = 180 days.",
      "timeline_spec": {
        "title": "Lines of therapy for one mNSCLC patient — gap-restart advancing from LOT1 to LOT2",
        "window": {
          "start": "2023-02-01",
          "end": "2023-08-20",
          "label": "Observation window: first fill through end of LOT2 supply"
        },
        "events": [
          {
            "label": "Fill A (erlotinib)",
            "start": "2023-02-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill B (erlotinib)",
            "start": "2023-03-01",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill C (erlotinib)",
            "start": "2023-03-28",
            "length_days": 30,
            "quantity": "30 days_supply"
          },
          {
            "label": "Fill D (docetaxel)",
            "start": "2023-07-31",
            "length_days": 21,
            "quantity": "21 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-02-01",
            "end": "2023-04-26",
            "label": "LOT1 regimen: erlotinib (supply Feb 1 - Apr 26)"
          },
          {
            "kind": "gap",
            "start": "2023-04-27",
            "end": "2023-07-30",
            "label": "Gap: 95 days (exceeds 90-day threshold — LOT1 lapses)"
          },
          {
            "kind": "exposed",
            "start": "2023-07-31",
            "end": "2023-08-20",
            "label": "LOT2 regimen: docetaxel (supply Jul 31 - Aug 20)"
          }
        ],
        "result": {
          "label": "2 lines of therapy; time-to-LOT2 = 180 days (Feb 1 to Jul 31); advancement = gap-restart + new agent",
          "value": 180
        },
        "caption": "Each colored bar is one pharmacy fill drawn to scale by days_supply. The LOT1 span (blue) covers the union of the three erlotinib fills. The 95-day gap (red) exceeds the 90-day permissible threshold, ending LOT1. LOT2 (blue) begins when docetaxel — a drug not in LOT1 — is filled after the lapse.",
        "alt_text": "Horizontal timeline from February 2023 to August 2023. Three overlapping blue bars labeled erlotinib fills span February through late April, forming LOT1. A red gap bar spans late April through late July labeled 95-day gap. A fourth blue bar labeled docetaxel starts July 31, forming LOT2."
      }
    },
    "prerequisites": [
      "persistence-time-to-discontinuation",
      "exposure-episode-construction-rwe",
      "grace-period-gap-rules-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Pharmacy-only claims LOT (gap + substitution rule)",
        "description": "A line is the regimen of agents started within a combination window; the next line begins on a substitution (new agent with a prior agent stopped) or on a fill after a permissible gap. Built only from pharmacy NDC fills and days_supply.",
        "edge_cases": [
          "Provider-administered (medical-benefit J-code/HCPCS) drugs are invisible, silently dropping oncology lines.",
          "Sample fills, bridging, and 90-day mail order distort days_supply and shift gap-defined boundaries.",
          "Minimum-claims thresholds are needed to avoid counting single sample fills as lines."
        ],
        "data_source_notes": "claims: simplest and most reproducible, but only valid where the full regimen is dispensed through the pharmacy benefit (most self-administered oral therapies, biologics with pharmacy coverage)."
      },
      {
        "name": "Combined medical + pharmacy claims LOT",
        "description": "Antineoplastic events are the union of pharmacy NDC fills and medical-claim HCPCS/J-code administrations; durations for administered drugs are imputed from cycle schedules. Required for infused/injected oncology regimens.",
        "edge_cases": [
          "Inpatient chemotherapy bundled in a DRG may lack a discrete drug code and be missed.",
          "Imputed administration durations introduce uncertainty into gap-defined line ends."
        ],
        "data_source_notes": "claims: the standard for solid-tumor oncology LOT; demands complete medical-benefit drug capture and continuous A/B/D (or commercial medical+pharmacy) enrollment."
      },
      {
        "name": "Clinical + claims hybrid LOT with progression triggers",
        "description": "Adds clinical progression evidence (new metastasis codes, staging, lab/biomarker changes, toxicity-driven switches) so that a new line reflects disease progression, not merely a change in dispensing.",
        "edge_cases": [
          "Requires linked EHR/registry; progression coding lags real events, mis-timing line boundaries.",
          "Maintenance therapy must be explicitly held within the prior line, not coded as advancement."
        ],
        "data_source_notes": "linked: best validity for true progression-anchored lines; weak where EHR/registry capture is incomplete or out-of-system care occurs."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "persistence-time-to-discontinuation",
        "pros_of_this": "Captures sequencing and advancement across regimens (what the patient moves to next), not just duration on a single therapy; yields line-specific index dates and across-line attrition for HTA and budget-impact inputs.",
        "cons_of_this": "Rules are disease- and algorithm-specific and far less standardized than a permissible-gap persistence definition; chart validation is resource-intensive and required for regulatory/HTA use.",
        "when_to_prefer": "Oncology, rheumatology, MS, and other progressive or multi-regimen diseases where patients cycle through distinct regimens; report exact windows, gaps, minimum-claims, maintenance/progression rules, and validation metrics."
      },
      {
        "compared_to": "cascade-of-care-analysis-rwe",
        "pros_of_this": "Resolves post-initiation sequencing (switch, add-on, advancement) within the treated population that the population funnel does not describe.",
        "cons_of_this": "Says nothing about upstream diagnosis, linkage, or treatment-initiation losses captured by the cascade.",
        "when_to_prefer": "When the question is what happens after treatment starts and how patients progress through lines, rather than how many diagnosed patients reach treatment."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Antineoplastic events = pharmacy NDC fills (with days_supply) UNION medical-claim HCPCS/J-codes (service date, imputed duration). Require continuous medical + pharmacy enrollment; exclude Medicare Advantage-only person-time where the fee-for-service claim stream is unavailable. Pre-specify combination window, permissible gap, minimum claims per line, and maintenance/substitution-vs-augmentation rules; report chart-validation agreement before decision use.",
      "ehr": "Use orders and medication-administration records to capture provider-administered drugs and problem lists/labs/ staging for progression triggers; treat care delivered outside the system as potentially missing (an out-of-network line looks like discontinuation). Prefer EHR linked to claims.",
      "registry": "Protocol-defined lines and adjudicated progression are the gold standard for validating claims LOT algorithms, but the longitudinal pharmacy stream and out-of-registry care are often incomplete.",
      "linked": "Linked claims-EHR-registry is the ideal substrate (medical-benefit drug capture + staging + complete enrollment), but linkage selection and order/fill/service date discrepancies must be reconciled before assigning line start dates."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nCOMBO_WINDOW = pd.Timedelta(days=28)   # agents starting within this of a line start = same regimen\nGAP_DAYS     = 90                       # permissible gap; a fill after a longer lapse starts a new line\nDEFAULT_DOS  = 30                       # fallback days_supply for administered drugs with no duration\n\ndef build_lot(tx: pd.DataFrame) -> pd.DataFrame:\n    tx = tx.sort_values([\"person_id\", \"service_date\"]).copy()\n    tx[\"days_supply\"] = tx[\"days_supply\"].fillna(DEFAULT_DOS).astype(int)\n    tx[\"supply_end\"]  = tx[\"service_date\"] + pd.to_timedelta(tx[\"days_supply\"], unit=\"D\")\n\n    lines = []\n    for pid, g in tx.groupby(\"person_id\", sort=False):\n        g = g.reset_index(drop=True)\n        lot = 1\n        line_start = g.loc[0, \"service_date\"]\n        regimen = set()                       # agents in the current line's regimen\n        line_supply_end = line_start          # latest supply coverage of regimen agents\n        reason = \"initiation\"\n\n        def flush(end):\n            lines.append({\"person_id\": pid, \"lot_number\": lot,\n                          \"regimen\": \"+\".join(sorted(regimen)),\n                          \"line_start\": line_start, \"line_end\": end,\n                          \"advance_reason\": reason})\n\n        for _, row in g.iterrows():\n            d, drug, send = row[\"service_date\"], row[\"drug\"], row[\"supply_end\"]\n            if (d - line_start) <= COMBO_WINDOW:          # still assembling the regimen\n                regimen.add(drug); line_supply_end = max(line_supply_end, send); continue\n            gap = (d - line_supply_end).days\n            substitution = drug not in regimen           # new agent not in current regimen\n            if substitution or gap > GAP_DAYS:           # advance to the next line\n                flush(line_supply_end)\n                lot += 1\n                line_start, regimen = d, {drug}\n                line_supply_end = send\n                reason = \"switch/substitution\" if (substitution and gap <= GAP_DAYS) else \"gap_restart\"\n            else:                                         # continuation or augmentation (same line)\n                regimen.add(drug); line_supply_end = max(line_supply_end, send)\n        flush(line_supply_end)\n    return pd.DataFrame(lines).sort_values([\"person_id\", \"lot_number\"])",
        "description": "Pharmacy + medical-benefit LOT construction from claims-style inputs. Required input (already cleaned, de-duplicated,\nrestricted to the curated antineoplastic code list, and filtered to continuously enrolled non-MA-only person-time):\n  tx : one row per antineoplastic event ->\n       person_id, drug (generic/class string), service_date (datetime),\n       days_supply (int; for administered J-codes impute from cycle schedule, else NaN)\nParameters encode the clinical conventions and MUST be pre-specified. Returns one row per (person_id, lot_number)\nwith the regimen (set of agents), line start/end dates, and the advancement reason. Combinations within COMBO_WINDOW\nof a line start are pooled into the regimen; a new line is triggered by substitution or by a gap > GAP_DAYS.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [
          "meng-2019"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\nCOMBO_WINDOW <- 28L   # days: agents starting within this of a line start = same regimen\nGAP_DAYS     <- 90L   # permissible gap before a later fill starts a new line\nDEFAULT_DOS  <- 30L   # fallback days_supply for administered drugs\n\nbuild_lot <- function(tx) {\n  setDT(tx)\n  tx[is.na(days_supply), days_supply := DEFAULT_DOS]\n  tx[, supply_end := service_date + days_supply]\n  setorder(tx, person_id, service_date)\n\n  one_person <- function(g) {\n    lot <- 1L; line_start <- g$service_date[1L]\n    regimen <- character(0); line_supply_end <- line_start; reason <- \"initiation\"\n    out <- list()\n    flush <- function(end) list(lot_number = lot,\n                                regimen = paste(sort(unique(regimen)), collapse = \"+\"),\n                                line_start = line_start, line_end = end, advance_reason = reason)\n    for (i in seq_len(nrow(g))) {\n      d <- g$service_date[i]; drug <- g$drug[i]; send <- g$supply_end[i]\n      if (as.integer(d - line_start) <= COMBO_WINDOW) {            # assembling the regimen\n        regimen <- union(regimen, drug); line_supply_end <- max(line_supply_end, send); next\n      }\n      gap <- as.integer(d - line_supply_end)\n      substitution <- !(drug %in% regimen)\n      if (substitution || gap > GAP_DAYS) {                        # advance to next line\n        out[[length(out) + 1L]] <- flush(line_supply_end)\n        lot <- lot + 1L; line_start <- d; regimen <- drug\n        line_supply_end <- send\n        reason <- if (substitution && gap <= GAP_DAYS) \"switch/substitution\" else \"gap_restart\"\n      } else {                                                     # continuation / augmentation (same line)\n        regimen <- union(regimen, drug); line_supply_end <- max(line_supply_end, send)\n      }\n    }\n    out[[length(out) + 1L]] <- flush(line_supply_end)\n    rbindlist(out)\n  }\n  tx[, one_person(.SD), by = person_id]\n}",
        "description": "Pharmacy + medical-benefit LOT construction with data.table, mirroring the Python logic and parameters. Required input:\n  tx : data.table -> person_id, drug (character), service_date (Date),\n       days_supply (integer; impute administered-drug durations upstream, else NA)\nReturns one row per (person_id, lot_number) with the regimen string, line start/end, and advancement reason.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [
          "meng-2019"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let combo_window = 28;   /* days: agents within this of a line start share the regimen */\n%let gap_days     = 90;   /* permissible gap before a later fill starts a new line       */\n%let default_dos  = 30;   /* fallback days_supply for administered (J-code) drugs         */\n\n/* Impute administered-drug durations and compute supply coverage end. */\nproc sql;\n  create table tx2 as\n  select person_id, drug, service_date,\n         coalesce(days_supply, &default_dos) as days_supply,\n         service_date + coalesce(days_supply, &default_dos) as supply_end format=date9.\n  from work.tx;\nquit;\n\nproc sort data=tx2; by person_id service_date; run;\n\n/* Walk each patient's events; RETAIN line state and advance on substitution or gap. */\ndata work.lot;\n  set tx2;\n  by person_id;\n  retain lot_number line_start line_supply_end;\n  length advance_reason $20 ;\n  if first.person_id then do;\n    lot_number = 1; line_start = service_date; line_supply_end = supply_end;\n    advance_reason = 'initiation'; return;\n  end;\n  /* Still inside the combination window -> same regimen, extend coverage. */\n  if (service_date - line_start) <= &combo_window then do;\n    line_supply_end = max(line_supply_end, supply_end);\n    advance_reason = 'combination';\n  end;\n  else do;\n    gap = service_date - line_supply_end;\n    /* substitution would be confirmed against the regimen member set built in a companion */\n    /* hash/lookup; here a gap beyond the threshold advances the line.                      */\n    if gap > &gap_days then do;\n      lot_number = lot_number + 1; line_start = service_date;\n      line_supply_end = supply_end; advance_reason = 'gap_restart';\n    end;\n    else do;  /* within gap: continuation or augmentation of the current line */\n      line_supply_end = max(line_supply_end, supply_end);\n      advance_reason = 'augmentation_or_continuation';\n    end;\n  end;\nrun;\n\n/* Regimen-level table: one row per (person_id, lot_number) with the agent set. */\nproc sql;\n  create table work.lot_regimen as\n  select person_id, lot_number,\n         min(service_date) as line_start format=date9.,\n         max(line_supply_end) as line_end format=date9.,\n         count(distinct drug) as n_agents\n  from work.lot\n  group by person_id, lot_number;\nquit;",
        "description": "Pharmacy + medical-benefit LOT construction in SAS using PROC SQL prep plus a sorted DATA step with RETAIN/LAG to walk\neach patient's events chronologically and emit lot_number. Required input (post data-management, restricted to the\nantineoplastic code list and continuously enrolled non-MA-only person-time):\n  work.tx : person_id, drug (char), service_date (SAS date),\n            days_supply (num; impute administered-drug durations upstream, else .)\nParameters mirror the Python/R versions. The output work.lot has one row per antineoplastic event tagged with its\nlot_number and advance_reason; collapse to one row per (person_id, lot_number) for the regimen-level table.",
        "dependencies": [],
        "source_citations": [
          "meng-2019"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "treatment-patterns-lines-of-therapy-timeline.svg",
        "mermaid": null,
        "caption": "Each colored bar is one pharmacy fill drawn to scale by days_supply. The LOT1 span (blue) covers the union of the three erlotinib fills. The 95-day gap (red) exceeds the 90-day permissible threshold, ending LOT1. LOT2 (blue) begins when docetaxel — a drug not in LOT1 — is filled after the lapse.",
        "alt_text": "Horizontal timeline from February 2023 to August 2023. Three overlapping blue bars labeled erlotinib fills span February through late April, forming LOT1. A red gap bar spans late April through late July labeled 95-day gap. A fourth blue bar labeled docetaxel starts July 31, forming LOT2.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Ev[Next antineoplastic event<br/>drug, service_date, days_supply] --> Win{Within COMBO_WINDOW<br/>of current line start?}\n  Win -- Yes --> Reg[Add drug to current regimen<br/>extend supply coverage]\n  Win -- No --> Sub{New agent NOT in<br/>current regimen<br/>AND prior agent stopped?}\n  Sub -- Yes --> Switch[Advance line:<br/>SWITCH / substitution]\n  Sub -- No --> Gap{Gap since regimen<br/>supply end &gt; GAP_DAYS?}\n  Gap -- Yes --> Restart[Advance line:<br/>GAP-then-RESTART]\n  Gap -- No --> Aug[Same line:<br/>AUGMENTATION / add-on<br/>does NOT advance]\n  Switch --> Next[lot_number + 1<br/>start new regimen]\n  Restart --> Next",
        "caption": "Per-event decision logic that turns a fill/administration sequence into ordered lines. A combination window pools co-started agents into one regimen; a substitution or an over-gap restart advances the line; an added agent that continues alongside the regimen is augmentation and stays within the line.",
        "alt_text": "Flowchart deciding for each drug event whether it joins the current regimen, advances the line via switch or gap-restart, or is logged as augmentation within the same line of therapy.",
        "source_type": "illustrative",
        "source_citations": [
          "meng-2019"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title One mNSCLC patient's fills mapped to lines of therapy\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section LOT1\n  Carboplatin + pemetrexed (induction) :done, l1a, 2023-01-10, 84d\n  Pemetrexed maintenance (held within LOT1) :active, l1b, 2023-04-04, 120d\n  section Gap\n  Lapse > 90 days (no antineoplastic) :crit, gap, 2023-08-02, 95d\n  section LOT2\n  Docetaxel restart (new regimen) :l2, 2023-11-05, 63d",
        "caption": "Timeline for a single patient. Co-started agents within the combination window form the LOT1 regimen; single-agent maintenance is held inside LOT1; a lapse beyond the permissible gap followed by a new agent begins LOT2. Each line's time zero is its own first event, avoiding immortal time.",
        "alt_text": "Gantt timeline showing a combination induction and maintenance comprising line 1, a gap exceeding 90 days, then a docetaxel restart comprising line 2.",
        "source_type": "illustrative",
        "source_citations": [
          "meng-2019"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "cascade-of-care-analysis-rwe",
        "notes": "The cascade provides the population funnel from diagnosis through linkage and treatment initiation; LOT describes post-initiation sequencing and advancement within the treated population. Complementary for full treatment-pattern understanding in progressive or multi-line disease."
      },
      {
        "relation_type": "see_also",
        "target_slug": "persistence-time-to-discontinuation",
        "notes": "Persistence/permissible-gap logic is foundational to LOT - it defines when a line lapses and a restart begins - but persistence describes a single therapy's duration whereas LOT describes the ordered sequence of regimens."
      },
      {
        "relation_type": "see_also",
        "target_slug": "pdc-proportion-of-days-covered",
        "notes": "Adherence within a line (PDC) affects outcomes and can drive the decision to switch or augment that ends the line."
      },
      {
        "relation_type": "complements",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Advancement to the next line is a competing-risk endpoint - death competes with progression - so the cause-specific hazard versus Fine-Gray subdistribution choice must be pre-specified in any time-to-next-line estimand."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "A line-specific index date (e.g., LOT2 start) defines time zero for a downstream active-comparator new-user comparative-effectiveness study within a given line."
      },
      {
        "relation_type": "produces",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Later lines and specific regimens identified by LOT drive HCRU differences (more monitoring, supportive care); attach line-specific utilization to the derived lines."
      },
      {
        "relation_type": "produces",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Progression to expensive later lines is a major driver of longitudinal cost; LOT supplies the stage/line structure for line-specific PPPM/PPPY costing."
      },
      {
        "relation_type": "used_with",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "LOT-derived switch, augmentation, discontinuation, and advancement rates are direct inputs to transition probabilities and state costs in Markov and discrete-event-simulation models."
      }
    ],
    "aliases": [
      "LOT",
      "lines of therapy",
      "line of therapy",
      "treatment sequencing",
      "real-world treatment patterns",
      "treatment patterns"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "tree-based-ensembles-rwe",
    "name": "Tree-Based Ensembles: Random Forests and Gradient Boosting",
    "short_definition": "Family of ensemble learners that combine many decision trees into a single, more stable predictor — either by averaging trees grown on bootstrap samples with random feature subsets (random forests, Breiman 2001) or by sequentially fitting shallow trees to the negative gradient of the loss left by all previous trees (gradient boosting / XGBoost). In RWE and HEOR, these methods are the dominant tool for tabular claims and EHR prediction tasks — hospitalization, readmission, and mortality risk scoring — and serve as the nuisance outcome model and propensity estimator inside doubly robust causal estimators such as TMLE and DML (see parent concept predictive-and-causal-ml-models-rwe).",
    "long_description": "**How decision trees work — and why single trees fail**\n\nA classification and regression tree (CART) partitions the feature space by recursively\nsplitting on the single feature and threshold that maximally reduces a loss criterion —\nGini impurity or cross-entropy for classification, mean squared error for regression.\nEach split creates two child nodes; the tree grows until a stopping rule fires (maximum\ndepth, minimum leaf size, or minimum impurity reduction). Leaf nodes contain the training\npatients that reached that partition, and the leaf prediction is the majority class or\nmean outcome in those patients.\n\nSingle trees are high-variance estimators: a small change in the training data can route\npatients down entirely different branches, producing a very different tree — and very\ndifferent predictions — from nearly identical datasets. This instability is the primary\nmotivation for ensembles, which average over many perturbed trees to smooth away variance\nwhile keeping bias roughly comparable to a single large tree.\n\n**Random forests: bagging plus random feature subsampling**\n\nBreiman's random forest (2001) combines two ideas to build a stable ensemble.\n\nFirst, *bootstrap aggregating (bagging)*: draw B bootstrap samples of size n with\nreplacement from the training data. Fit one full tree per bootstrap sample. Average the\nB leaf predictions (for classification, average the predicted probabilities; for\nregression, average the leaf means). Because each tree sees a different realization of\nthe training data, their errors are partly uncorrelated — the ensemble variance is\napproximately sigma^2 times (rho + (1-rho)/B), where rho is the pairwise tree correlation.\nMore trees and lower correlation both reduce ensemble variance.\n\nSecond, *random feature subsampling*: at every split, randomly select m features from the\ntotal p and find the best split only within that subset (default: m = sqrt(p) for\nclassification, m = p/3 for regression). Without feature subsampling, a single dominant\npredictor appears in every tree's top splits, making trees near-identical and destroying\nthe variance-reduction benefit of bagging. Feature subsampling decorrelates trees while\nintroducing only a modest increase in individual-tree bias.\n\n*Out-of-bag (OOB) error* is a free built-in validation estimate. Each bootstrap sample\nleaves out roughly 37% of training patients. For each training patient, average predictions\nonly from trees where that patient was out-of-bag — then compute error on those OOB\npredictions. This is an approximately unbiased estimate of generalization error without\nrequiring a separate validation split.\n\nKey hyperparameters and sane defaults for random forests: n_estimators (500-1,000; more\ntrees always reduce variance and never hurt, only increase compute), min_samples_leaf\n(5-20 for stability with rare events), max_features (sqrt(p) for classification).\nRandom forests are notably close to tuning-free compared to gradient boosting.\n\n**Gradient boosting: sequential residual fitting**\n\nFriedman's gradient boosting machine (2001) builds trees sequentially, each one fitting\nthe negative gradient of the loss (the pseudo-residuals) left by the current ensemble.\nThe prediction after M trees is a weighted sum: F_M(x) = F_0(x) + learning_rate times\nthe sum of h_m(x) for m = 1 to M, where F_0 is a simple baseline (the global log-odds\nor mean), learning_rate is the step size (eta, typically 0.01 to 0.1), and each h_m is\na shallow tree (depth 2-8) fit to the residuals from the current model. By taking small\nsteps and adding many trees, boosting progressively reduces both bias and variance.\n\nXGBoost (Chen and Guestrin, 2016) extends vanilla gradient boosting with second-order\ngradient approximations, column and row subsampling per tree, L1/L2 regularization on\nleaf weights, and an efficient tree pruning criterion. These additions together yield\nsubstantially faster training and often higher accuracy than vanilla gradient boosting\non tabular data — the regime in which claims and EHR feature matrices fall.\n\nBoosting requires more active hyperparameter management than random forests. The key\nparameters are learning_rate (lower is safer but requires more trees), n_estimators,\nmax_depth (shallower trees generalize better: depth 3-6 for most tasks), and\n*early stopping*, which monitors performance on a validation fold and halts tree\naddition once performance has not improved for a user-defined patience window. Early\nstopping is the primary defense against overfitting in gradient boosting and should\nalways be used.\n\n**Interpreting the output**\n\nConsider a random forest trained on 50,000 Medicare beneficiaries with heart failure,\npredicting 1-year hospitalization. Patient A receives a predicted risk score of 0.72.\nThe model's c-statistic is 0.81. The top permutation variable importance (VIMP) feature\nis prior inpatient admissions in the preceding 12 months.\n\n*(1) Formal interpretation.*\n\nA raw random forest score of 0.72 is NOT equivalent to \"this patient has a 72% probability\nof hospitalization.\" Random forests are known to be poorly calibrated by default: averaging\nover many trees compresses predicted probabilities toward the ensemble mean, away from 0 and\n1. Before treating 0.72 as a probability, calibration must be assessed on a held-out dataset\n(calibration slope, calibration-in-the-large, or the integrated calibration index), and\nrecalibration must be applied if calibration is poor — via isotonic regression or Platt\nscaling (CalibratedClassifierCV in scikit-learn). After confirmed recalibration, 0.72\nrepresents the model's estimated conditional probability of hospitalization given this\npatient's observed feature values, conditional on those features being complete and the\nmodel being transportable to this population.\n\nThe c-statistic of 0.81 means that if a pair of patients is drawn at random — one who was\nhospitalized and one who was not — the model assigns the higher score to the hospitalized\npatient 81% of the time. This is a measure of rank discrimination, not calibration. A model\ncan have excellent discrimination (high c-statistic) and poor calibration simultaneously.\n\nPrior inpatient admissions appearing at the top of the variable importance ranking reflects\nits predictive contribution to the model — it does NOT mean that preventing or reducing\nadmissions (or changing the feature value by intervention) would lower the patient's risk.\nVariable importance measures the average decrease in model accuracy when that feature is\npermuted in the data; it is a property of the fitted predictive model, not a causal effect\nestimate. Patients with high prior admissions may simultaneously have unmeasured severity,\nfrailty, and social determinants that drive both the feature and the outcome. Asserting that\nprior admissions cause future hospitalizations solely from VIMP is an unsupported causal\nclaim and a common analytic error.\n\n*(2) Practical interpretation.*\n\nA score of 0.72 places this patient roughly in the top decile of predicted risk in this\npopulation. The actionable message is: \"the model flags this patient as high-risk, primarily\nbased on prior hospitalization history.\" Prior admissions drive the prediction — that does\nnot mean preventing or reducing those admissions changes the underlying risk. Use the score\nfor care management triage, risk stratification, or as a covariate in downstream analyses,\nnot as evidence that any individual feature is a modifiable cause of future hospitalization.\n\n**RWE-specific considerations**\n\n*High-cardinality claims features.* ICD-10 diagnosis codes (70,000+ unique), NDC drug codes\n(100,000+), and CPT procedure codes present a high-cardinality categorical challenge. Standard\none-hot encoding is infeasible at this scale. Practical approaches: aggregate codes to drug\nclass or body-system hierarchy, use ever/never binary flags for curated code groups, or apply\ntarget encoding (substituting the within-training-fold mean outcome for each code level). Target\nencoding is powerful but carries a leakage trap: if encoding is computed on the full dataset\nbefore train/test splitting, it leaks outcome information into the features and inflates\nperformance. Target encoding must always be fit on the training fold only, never on the full\ndataset.\n\n*Missing data.* Scikit-learn RF requires explicit imputation before training. The ranger and\nrandomForestSRC packages in R implement *surrogate splits* — when a feature is missing for\na prediction, the tree routes through the next-best alternative split on a correlated feature.\nSurrogate splits are more principled than median imputation but their behavior must be\nexplicitly verified, especially for informative missingness patterns (e.g., lab values missing\nfor the lowest-acuity patients).\n\n*Temporal leakage in claims.* The most destructive failure in claims-based ML is temporal\nleakage: including features from claims dated after the index date in what is labeled as\na baseline feature matrix. In a hospitalization prediction model indexed on the first heart\nfailure diagnosis date, a claim for a post-diagnosis echocardiogram is future information.\nA model that sees this claim learns to predict hospitalization from a feature that was itself\ncaused by the outcome — inflating AUC dramatically (sometimes from 0.65 to 0.90+) while\nproducing useless predictions in prospective deployment. Temporal leakage is invisible to\nperformance metrics computed on the same contaminated dataset. The required guard: enforce a\nstrict feature extraction window of [index_date minus lookback, index_date) — with index_date\nexcluded — and audit every feature's construction date before training.\n\n*Calibration drift across databases.* A random forest trained on commercial claims may be\nwell-calibrated in-sample but systematically mis-predict in a Medicare or Medicaid population.\nValidate calibration — not just AUC — in every target population before deployment. See the\ncompanion concept prediction-model-validation-recalibration-rwe for the external validation\nworkflow. Cross-database calibration checking is especially important for models exported\nfrom development cohorts to support regulatory submissions.\n\n*Role in causal pipelines.* Random forests and gradient boosting serve as nuisance\nestimators — outcome models and propensity score estimators — inside doubly robust causal\nmethods (TMLE and DML), as described in the parent concept predictive-and-causal-ml-models-rwe.\nWhen used this way, the ensemble delivers conditional expectations to a causal estimator; it\nis a means to an end, not the deliverable. The ensemble's predictive accuracy matters for\nefficiency; the causal estimate's validity comes from the doubly robust structure, not from\nthe ensemble alone.\n\n**Pros, cons, and trade-offs**\n\n*Pros.* Ensembles automatically capture non-linear relationships and feature interactions\nwithout manual feature engineering. Random forests are nearly tuning-free and robust to\noutliers (individual tree predictions from extreme values are averaged away). Gradient\nboosting achieves best-in-class accuracy on structured tabular data — precisely the format\nof claims and EHR feature matrices. OOB error gives random forests a built-in internal\nvalidation estimate at no additional compute cost. Variable importance metrics identify\nthe most predictive features for subsequent biological or clinical review. Both methods\nhandle feature matrices with thousands of binary indicators (one ICD flag per code) without\nexplicit variable selection. Both scale well to large cohort sizes common in Medicare and\ncommercial claims.\n\n*Cons.* Neither random forests nor gradient boosting are calibrated out of the box — raw\npredicted probabilities compress toward the mean and require explicit recalibration before\nuse as risk estimates. Gradient boosting is sensitive to hyperparameters (especially\nlearning rate and n_estimators) and always requires early stopping to prevent overfitting.\nBoth methods are functionally black-box: predictions are not decomposable into a simple\nequation, limiting local interpretability compared to logistic regression (though SHAP values\nand partial dependence plots partially address this). Variable importance rankings are biased\ntoward high-cardinality and correlated features (e.g., many correlated ICD codes for the\nsame condition inflate that condition's apparent importance). External validation and\nrecalibration are mandatory before clinical or regulatory use.\n\n**When to use**\n\nUse tree-based ensembles when: (1) the goal is risk stratification, case-finding, or covariate\nadjustment and a logistic regression model is likely to underfit because of non-linearity or\nunmeasured interactions; (2) the feature space is high-dimensional — hundreds to thousands of\nICD/NDC/CPT indicator flags — and automatic feature selection is needed; (3) a built-in\ninternal validation estimate is desired (OOB error for RF) or early stopping is feasible\n(gradient boosting); (4) ensembles serve as nuisance estimators inside TMLE or DML for\ncausal inference where the parent concept provides the estimator structure; (5) a pre-specified\nvariable importance ranking is required for exploratory reporting, with the explicit caveat\n— documented in the analysis plan — that importance is not causation.\n\n**When NOT to use**\n\nDo NOT use tree-based ensembles when: (1) the deliverable is an interpretable causal effect\nestimate with a confidence interval on a well-defined estimand — use regression, propensity\nmethods, or doubly robust causal estimators instead; (2) the sample is very small (fewer than\na few hundred patients) or the event is rare and careful tuning and evaluation are infeasible;\n(3) the use case requires reporting a hazard ratio, odds ratio, or rate ratio with a confidence\ninterval to regulators or payers — regression models deliver this directly and transparently;\n(4) variable importance scores are being used as evidence of causal importance or proposed as\nbiomarkers without external causal validation — VIMP measures predictive contribution only;\n(5) the model has not been externally validated and recalibrated in the target population —\nnever deploy a model developed on commercial claims to Medicare beneficiaries without\ncross-population calibration assessment. Do not use temporal-leakage-contaminated features\nas a shortcut to inflated performance metrics.",
    "primary_category": "Machine_Learning_and_Predictive",
    "tags": [
      "machine-learning",
      "prediction",
      "random-forest",
      "gradient-boosting",
      "ensembles",
      "risk-stratification",
      "variable-importance",
      "calibration",
      "xgboost",
      "claims",
      "ehr"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "new_user",
      "active_comparator_new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1023/A:1010933404324",
        "url": "https://doi.org/10.1023/A:1010933404324",
        "citation_text": "Breiman L. Random forests. Machine Learning. 2001;45(1):5-32.",
        "year": 2001,
        "authors_short": "Breiman",
        "notes": "Foundational paper introducing the random forest algorithm — bagging plus random feature subsampling at each split — with the OOB error estimate, variable importance via permutation, and theoretical analysis of ensemble variance reduction. The canonical reference for RF in every RWE application."
      },
      {
        "role": "explain",
        "doi": "10.1214/aos/1013203451",
        "url": "https://doi.org/10.1214/aos/1013203451",
        "citation_text": "Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of Statistics. 2001;29(5):1189-1232.",
        "year": 2001,
        "authors_short": "Friedman",
        "notes": "Derives the gradient boosting framework — fitting regression trees sequentially to pseudo-residuals (negative gradient of the loss), with the learning rate and partial dependence plots. Establishes the theoretical foundation for all modern gradient boosting implementations including XGBoost, LightGBM, and CatBoost."
      },
      {
        "role": "explain",
        "doi": "10.1214/ss/1009213726",
        "url": "https://doi.org/10.1214/ss/1009213726",
        "citation_text": "Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder). Statistical Science. 2001;16(3):199-231.",
        "year": 2001,
        "authors_short": "Breiman",
        "notes": "Frames the distinction between algorithmic (predictive) and data-modeling (inferential) cultures in statistics. Directly relevant to the RWE context where ensembles are used for prediction while causal inference requires separate estimators — the conceptual divide this entry builds on."
      },
      {
        "role": "demonstrate",
        "doi": "10.1145/2939672.2939785",
        "url": "https://doi.org/10.1145/2939672.2939785",
        "citation_text": "Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785-794.",
        "year": 2016,
        "authors_short": "Chen & Guestrin",
        "notes": "Introduces XGBoost with second-order gradient approximations, regularized tree learning, column and row subsampling, and sparsity-aware split finding. The production reference for XGBoost, which is the dominant gradient boosting implementation in RWE risk prediction pipelines due to its speed, accuracy, and early-stopping support."
      },
      {
        "role": "use",
        "doi": "10.1145/1102351.1102430",
        "url": "https://doi.org/10.1145/1102351.1102430",
        "citation_text": "Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. Proceedings of the 22nd International Conference on Machine Learning. 2005:625-632.",
        "year": 2005,
        "authors_short": "Niculescu-Mizil & Caruana",
        "notes": "Systematic evaluation of calibration across supervised learning methods. Demonstrates that random forests and boosted trees produce poorly calibrated probabilities by default (RF compresses toward the mean; boosting can over-calibrate). Directly motivates the mandatory calibration step in every RWE ensemble deployment."
      }
    ],
    "plain_language_summary": "Tree-based ensembles are a family of machine-learning methods that build hundreds of simple decision trees and combine their predictions into one stable, accurate answer. Random forests build each tree on a different random sample of the data so the trees disagree with each other in useful ways; averaging them out cancels individual errors. Gradient boosting builds trees in sequence, each one correcting the mistakes of the previous one, a bit like a team where each new expert fixes what the last expert got wrong. In health research, these tools are used to flag patients at high risk of hospitalization or complications — but a high risk score does not mean any single factor causes that risk, and the raw score must be calibrated before treating it as a true probability.",
    "key_terms": [
      {
        "term": "decision tree",
        "definition": "A prediction model that sorts patients into groups by answering a series of yes/no questions about their features (age over 65? prior hospitalization? taking drug X?), ending in a leaf that gives a predicted outcome."
      },
      {
        "term": "bootstrap aggregating",
        "definition": "A technique that creates many different training datasets by sampling the original data with replacement, fits one model to each, then averages all predictions — reducing the jumpiness (variance) of any single model."
      },
      {
        "term": "out-of-bag error",
        "definition": "A built-in accuracy estimate for random forests: because each tree is trained on a random sample, the patients left out of that sample can be used to test that specific tree, giving a free estimate of how well the forest generalizes."
      },
      {
        "term": "boosting",
        "definition": "A strategy that builds models sequentially, where each new model focuses on the cases the previous models got wrong, gradually reducing errors by combining many weak learners into one strong predictor."
      },
      {
        "term": "learning rate",
        "definition": "In gradient boosting, a small number (typically 0.01 to 0.1) that controls how much each new tree contributes to the ensemble; a lower rate means slower but more stable learning and usually requires more trees."
      },
      {
        "term": "variable importance",
        "definition": "A score that ranks features by how much the model's accuracy drops when that feature is scrambled — a measure of predictive usefulness, NOT of whether the feature causes the outcome."
      }
    ],
    "worked_example": {
      "scenario": "A health analytics team trains a random forest with five trees to predict 1-year hospitalization in a small heart failure cohort. They want to compute the ensemble's predicted probability for one patient (P-001) and estimate the model's out-of-bag error on 100 OOB patients. All five trees had P-001 in their out-of-bag set for this illustration, so each tree provides an independent prediction.",
      "dataset": {
        "caption": "Predicted probabilities from five bootstrap trees for patient P-001 (all five trees had this patient in their OOB set). A separate OOB pool of 100 patients had 25 classification errors across the forest.",
        "columns": [
          "tree_id",
          "predicted_prob_for_P001",
          "oob_patients",
          "oob_errors"
        ],
        "rows": [
          [
            "T1",
            0.6,
            20,
            5
          ],
          [
            "T2",
            0.8,
            20,
            5
          ],
          [
            "T3",
            0.4,
            20,
            5
          ],
          [
            "T4",
            0.7,
            20,
            5
          ],
          [
            "T5",
            0.5,
            20,
            5
          ]
        ]
      },
      "steps": [
        "Five bootstrap trees provide predicted probabilities for patient P-001: tree T1 = 0.6, T2 = 0.8, T3 = 0.4, T4 = 0.7, T5 = 0.5. These five values represent five independent looks at the same patient, each from a tree trained on a different bootstrap sample.",
        "Random forest ensemble probability for P-001 = (0.6+0.8+0.4+0.7+0.5)/5 = 3.0/5 = 0.6. Because 0.6 exceeds the classification threshold of 0.5, the model flags this patient as high risk for 1-year hospitalization.",
        "For OOB error estimation: each of the 20 patients per tree who were not in that tree's bootstrap sample contribute one OOB prediction. Pooling across all five trees gives 100 OOB predictions total. Of these 100 OOB predictions, 25 disagreed with the true hospitalization label. OOB classification error = 25/100 = 0.25, meaning the forest misclassifies 25% of OOB patients.",
        "The raw score of 0.6 for P-001 is not yet a calibrated probability of hospitalization. Random forests compress predicted probabilities toward the ensemble mean; a calibration check on held-out data (calibration slope, integrated calibration index) is required. If the calibration slope is below 1.0, the model over-predicts at low risk and under-predicts at high risk, and isotonic recalibration must be applied before the score can be reported as a probability."
      ],
      "result": "Ensemble mean score = (0.6+0.8+0.4+0.7+0.5)/5 = 3.0/5 = 0.6 (patient P-001 flagged as high risk). OOB classification error = 25/100 = 0.25. The raw 0.6 score requires calibration assessment before being reported as a probability of hospitalization."
    },
    "prerequisites": [
      "predictive-and-causal-ml-models-rwe",
      "cross-validation-and-overfitting-rwe",
      "logistic-regression-for-binary-outcomes"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {
      "claims": "Build the feature matrix from all diagnosis/procedure/pharmacy codes in a fixed 6-12 month lookback window before index (ever/never flags, counts, recency quintiles). Enforce a strict temporal guard: no code from index_date or later may appear in any feature. For high-cardinality codes (ICD-10, NDC), aggregate to class/body-system level or use within-fold target encoding to avoid leakage. Validate calibration in each payer segment (commercial vs Medicare vs Medicaid) before pooling or applying a model across populations.",
      "ehr": "EHR provides richer nuisances (labs, vitals, structured problem lists) but irregular visit-driven measurement creates informative missingness. Use ranger's surrogate splits or explicit missingness indicators (feature present/absent flag) rather than silently imputing. Calibration drifts across sites and calendar time; validate within each site or era, not just pooled. Link to claims for complete medication and procedure history when building the feature matrix.",
      "registry": "Adjudicated registry labels (disease stage, graded adverse event) are clean training targets but the covariate set is narrower than claims or EHR. Ensembles on registry data are most useful for training or validating claims-based predictive models (SEER- Medicare hybrid is the canonical example). Link to claims for pharmacy fill history and to a mortality source for complete survival data.",
      "linked": "Linked claims-EHR-vital records is the ideal substrate for ensemble development: EHR severity features plus claims completeness plus reliable mortality. Linkage introduces selection bias (the linkable subset may differ from the unlinked) and date discrepancies that must be reconciled before feature extraction. Feature extraction must be anchored to a consistent index date across all data sources to prevent immortal time entering the feature window."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier\nfrom sklearn.calibration import CalibratedClassifierCV, calibration_curve\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.inspection import permutation_importance\nfrom sklearn.metrics import roc_auc_score, brier_score_loss\nimport xgboost as xgb\n\n# Assumes: X = feature matrix (pre-index features only — no temporal leakage),\n#          y = binary outcome (1=event, 0=no event), feature_names = list of column names.\n\nX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# ── 1. Random forest with OOB error and variable importance ──\nrf = RandomForestClassifier(\n    n_estimators=500,         # more trees = lower variance; 500 is a safe default\n    max_features=\"sqrt\",      # Breiman default for classification\n    min_samples_leaf=10,      # stability guard for rare events\n    oob_score=True,           # enables the free OOB accuracy estimate\n    n_jobs=-1,\n    random_state=42,\n)\nrf.fit(X_train, y_train)\nprint(f\"RF OOB accuracy: {rf.oob_score_:.3f}\")  # free internal validation\nprint(f\"RF val AUC: {roc_auc_score(y_val, rf.predict_proba(X_val)[:, 1]):.3f}\")\n\n# Permutation importance on validation set (less biased than MDI for correlated features)\nperm_imp = permutation_importance(rf, X_val, y_val, n_repeats=10, random_state=42)\ntop5_idx = perm_imp.importances_mean.argsort()[-5:][::-1]\nprint(\"Top 5 features (permutation VIMP):\")\nfor i in top5_idx:\n    print(f\"  {feature_names[i]}: {perm_imp.importances_mean[i]:.4f}\")\nprint(\"CAUTION: VIMP = predictive contribution, NOT causal effect.\")\n\n# Calibration check — random forests are miscalibrated by default\nprob_true, prob_pred = calibration_curve(y_val, rf.predict_proba(X_val)[:, 1], n_bins=10)\nprint(f\"RF Brier score (val): {brier_score_loss(y_val, rf.predict_proba(X_val)[:,1]):.4f}\")\n# If calibration curve deviates from the diagonal, recalibrate:\nrf_cal = CalibratedClassifierCV(rf, method=\"isotonic\", cv=\"prefit\")\nrf_cal.fit(X_val, y_val)  # isotonic recalibration on the validation set\n\n# ── 2. Gradient boosting with early stopping (scikit-learn) ──\ngbm = HistGradientBoostingClassifier(\n    learning_rate=0.05,\n    max_iter=1000,            # upper bound; early stopping will stop earlier\n    max_depth=4,\n    min_samples_leaf=20,\n    early_stopping=True,\n    validation_fraction=0.15,\n    n_iter_no_change=20,      # stop if no improvement for 20 rounds\n    random_state=42,\n)\ngbm.fit(X_train, y_train)\nprint(f\"GBM trees used (early stopping): {gbm.n_iter_}\")\nprint(f\"GBM val AUC: {roc_auc_score(y_val, gbm.predict_proba(X_val)[:, 1]):.3f}\")\n\n# ── 3. XGBoost with early stopping and calibration ──\ndtrain = xgb.DMatrix(X_train, label=y_train)\ndval   = xgb.DMatrix(X_val,   label=y_val)\nparams = {\n    \"objective\": \"binary:logistic\",\n    \"eval_metric\": \"auc\",\n    \"learning_rate\": 0.05,\n    \"max_depth\": 4,\n    \"subsample\": 0.8,\n    \"colsample_bytree\": 0.8,\n    \"seed\": 42,\n}\nxgb_model = xgb.train(\n    params, dtrain,\n    num_boost_round=1000,\n    evals=[(dval, \"val\")],\n    early_stopping_rounds=20,\n    verbose_eval=False,\n)\nxgb_preds = xgb_model.predict(dval)\nprint(f\"XGB val AUC: {roc_auc_score(y_val, xgb_preds):.3f}\")\nprint(f\"XGB Brier score (val): {brier_score_loss(y_val, xgb_preds):.4f}\")\n# Always check calibration on the target population before deployment.",
        "description": "Random forest and gradient boosting for binary outcome prediction using scikit-learn,\nwith XGBoost as the production boosting alternative. Demonstrates: (1) RandomForestClassifier\nwith OOB error, permutation variable importance, and post-hoc calibration via\nCalibratedClassifierCV; (2) HistGradientBoostingClassifier with early stopping on a\nvalidation fold; (3) XGBoost with early stopping and calibration check. Required input\nis an analysis-ready feature matrix X (n patients x p features, no post-index variables)\nand a binary outcome vector y. All features must be strictly from the pre-index baseline\nwindow to prevent temporal leakage.",
        "dependencies": [
          "scikit-learn",
          "xgboost",
          "numpy",
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(ranger); library(xgboost); library(pROC)\n\n# Assumes: dat = data.frame with columns [outcome_col, feature_cols...]\n# All feature_cols must be from the pre-index baseline window (no temporal leakage).\n\nset.seed(42)\nn <- nrow(dat)\nval_idx   <- sample(n, size = floor(0.2 * n))\ntrain_dat <- dat[-val_idx, ]\nval_dat   <- dat[ val_idx, ]\n\n# ── 1. Random forest via ranger ──\nrf_form <- reformulate(feature_cols, response = outcome_col)\nrf_fit <- ranger(\n  formula       = rf_form,\n  data          = train_dat,\n  num.trees     = 500,\n  mtry          = floor(sqrt(length(feature_cols))),   # Breiman default\n  min.node.size = 10,\n  probability   = TRUE,          # return predicted probabilities\n  importance    = \"permutation\", # permutation VIMP (less biased than impurity)\n  seed          = 42\n)\ncat(sprintf(\"RF OOB prediction error: %.3f\\n\", rf_fit$prediction.error))\n\n# Variable importance (VIMP) -- predictive, NOT causal\nvimp <- sort(rf_fit$variable.importance, decreasing = TRUE)\ncat(\"Top 5 features (permutation VIMP):\\n\")\nprint(head(vimp, 5))\ncat(\"NOTE: VIMP reflects predictive contribution only -- not causal effect.\\n\")\n\n# Validation AUC\nval_probs <- predict(rf_fit, data = val_dat)$predictions[, \"1\"]\nrf_auc    <- auc(roc(val_dat[[outcome_col]], val_probs, quiet = TRUE))\ncat(sprintf(\"RF validation AUC: %.3f\\n\", rf_auc))\n\n# Calibration check: plot calibration curve; recalibrate if slope differs from 1.0\n# Use the rms package (val.prob) or a loess smoother for a calibration curve in practice.\n\n# ── 2. Gradient boosting via xgboost with early stopping ──\nX_train <- as.matrix(train_dat[, feature_cols])\ny_train <- train_dat[[outcome_col]]\nX_val   <- as.matrix(val_dat[, feature_cols])\ny_val   <- val_dat[[outcome_col]]\n\ndtrain <- xgb.DMatrix(X_train, label = y_train)\ndval   <- xgb.DMatrix(X_val,   label = y_val)\n\nparams <- list(\n  objective        = \"binary:logistic\",\n  eval_metric      = \"auc\",\n  eta              = 0.05,   # learning rate (lower = safer, needs more rounds)\n  max_depth        = 4,\n  subsample        = 0.8,\n  colsample_bytree = 0.8\n)\n\n# Cross-validated tuning to select n_rounds (use in model development phase)\ncv_res <- xgb.cv(\n  params    = params,\n  data      = dtrain,\n  nrounds   = 500,\n  nfold     = 5,\n  early_stopping_rounds = 20,\n  verbose   = 0\n)\nbest_rounds <- cv_res$best_iteration\ncat(sprintf(\"XGB optimal rounds (CV early stopping): %d\\n\", best_rounds))\n\n# Final model on full training data at the optimal rounds\nxgb_fit <- xgb.train(\n  params     = params,\n  data       = dtrain,\n  nrounds    = best_rounds,\n  watchlist  = list(val = dval),\n  verbose    = 0\n)\nxgb_preds <- predict(xgb_fit, dval)\nxgb_auc   <- auc(roc(y_val, xgb_preds, quiet = TRUE))\ncat(sprintf(\"XGB validation AUC: %.3f\\n\", xgb_auc))\n\n# ── 3. Using as nuisance in causal inference (sketch) ──\n# For TMLE or DML, pass the ranger or xgboost learner as the outcome or propensity\n# nuisance. See predictive-and-causal-ml-models-rwe for the full cross-fitting workflow.\n# library(grf); causal_forest(X, Y, W, num.trees=2000) uses RF internally.\n# library(DoubleML); lrn(\"classif.xgboost\") wraps xgboost as a mlr3 learner for DML.",
        "description": "Random forest via ranger (fast, supports missing via surrogate splits) and gradient\nboosting via xgboost with early stopping. Demonstrates: (1) ranger with OOB error,\npermutation variable importance, and a calibration check; (2) xgboost with watchlist\nearly stopping and a tuning sketch using cross-validation; (3) a note on using these\nas nuisance learners inside the grf or DoubleML packages for causal estimation.\nRequired input: dat (data.frame), outcome_col (binary 0/1), feature_cols (pre-index\nbaseline columns only).",
        "dependencies": [
          "ranger",
          "xgboost",
          "pROC"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Data[\"Analysis-ready feature matrix\\n(pre-index window only — no leakage)\"]\n  Data --> RF[\"Random Forest\\n(B bootstrap trees + random feature subsets)\"]\n  Data --> GB[\"Gradient Boosting\\n(sequential residual trees + learning rate)\"]\n  RF --> OOB[\"OOB error\\n(free internal validation)\"]\n  RF --> RFScore[\"Ensemble probability\\n(average over B trees)\"]\n  GB --> ES[\"Early stopping\\n(val fold monitors AUC)\"]\n  GB --> GBScore[\"Ensemble probability\\n(weighted sum of M trees)\"]\n  RFScore --> Cal[\"Calibration check\\n(calibration curve, ICI, Brier)\"]\n  GBScore --> Cal\n  Cal -->|\"slope near 1\"| Deploy[\"Deploy score\\nfor risk triage\"]\n  Cal -->|\"slope far from 1\"| Recal[\"Recalibrate\\n(isotonic or Platt)\"]\n  Recal --> Deploy\n  RFScore -.->|\"as nuisance learner\"| Causal[\"TMLE / DML causal pipeline\\n(see parent concept)\"]\n  GBScore -.->|\"as nuisance learner\"| Causal",
        "caption": "Decision and pipeline flow for tree-based ensembles in RWE: from a pre-index feature matrix (with strict leakage guard), through random forest (with OOB validation) or gradient boosting (with early stopping), to mandatory calibration assessment and recalibration before deployment. Dashed lines show the alternate path as nuisance learners inside causal ML estimators.",
        "alt_text": "Flowchart starting at a pre-index feature matrix, branching to random forest (OOB error, ensemble probability) and gradient boosting (early stopping, ensemble probability), both feeding a calibration check that routes to direct deployment if calibration is good or to recalibration if poor, with dashed lines showing both as inputs to a causal ML pipeline.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "predictive-and-causal-ml-models-rwe",
        "notes": "Parent concept covering the predictive vs causal ML distinction, the RWE bias landscape, and cross-fitting / doubly robust structure; tree-based ensembles are the specific model family covered here as a child concept."
      },
      {
        "relation_type": "used_with",
        "target_slug": "cross-validation-and-overfitting-rwe",
        "notes": "Cross-validation governs hyperparameter selection and early stopping for gradient boosting; OOB error is the RF-specific internal validation analogue. Both concepts address the same overfitting risk from different angles."
      },
      {
        "relation_type": "see_also",
        "target_slug": "regularized-regression-lasso-ridge-rwe",
        "notes": "LASSO and ridge are the principal parametric alternatives for high-dimensional prediction; ensembles win on non-linearity and interactions, regularized regression wins on interpretability and direct coefficient inference."
      },
      {
        "relation_type": "used_with",
        "target_slug": "roc-auc-discrimination-rwe",
        "notes": "AUC/c-statistic is the primary discrimination metric for evaluating ensemble risk models; always report alongside a calibration metric (Brier score, ICI) since a model can discriminate well and be poorly calibrated simultaneously."
      },
      {
        "relation_type": "used_with",
        "target_slug": "brier-score-calibration-rwe",
        "notes": "Brier score and calibration metrics are mandatory complements to AUC for ensemble models because both random forests and gradient boosting produce poorly calibrated probabilities by default and require explicit recalibration."
      },
      {
        "relation_type": "see_also",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "hdPS uses logistic regression on automatically selected claims codes for propensity estimation; tree-based ensembles are an ML alternative propensity estimator inside doubly robust pipelines, trading interpretability for potential accuracy gains in the high-dimensional case."
      }
    ],
    "aliases": [
      "random forest",
      "gradient boosting",
      "XGBoost",
      "GBM",
      "CART ensembles"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "tree-based-scan-statistic-rwe",
    "name": "Tree-Based Scan Statistics (TreeScan)",
    "short_definition": "A hypothesis-free outcome-scanning method that arranges thousands of diagnosis (or adverse-event) codes into a hierarchical tree (ICD or MedDRA), evaluates every \"cut\" of the tree - each node together with all of its descendants - as a candidate excess-of-events signal, scores each cut with a log-likelihood-ratio statistic, and controls the massive multiplicity of testing the whole tree at once with a single Monte Carlo permutation null, so a post-market safety program can let the data nominate which specific or which grouped outcomes are elevated rather than pre-specifying one endpoint.",
    "long_description": "Most pharmacoepidemiology asks a pre-specified question: does drug X raise the rate of outcome Y? **Tree-based\nscan statistics turn that around.** You do not name the outcome in advance. You hand the method an entire\nhierarchical classification of outcomes - the ICD diagnosis tree, or the MedDRA System Organ Class -> High\nLevel Group Term -> Preferred Term tree for adverse events - and let it scan *all* of the outcomes at once,\nasking at every level of granularity \"is the count of events here, in the exposed group, higher than expected?\"\nThe method's job is signal **generation**: it nominates the specific code (a single MedDRA Preferred Term) or\nthe grouped branch (a whole System Organ Class) where events cluster, while honestly accounting for the fact\nthat it just looked in thousands of places.\n\n**The core idea: cuts on a tree.** A \"cut\" is a node of the tree taken together with every leaf beneath it.\nA leaf cut is one fine-grained code (e.g., the Preferred Term *myocarditis*); an internal-node cut is the whole\nbranch (e.g., the cardiac System Organ Class, which sums myocarditis + arrhythmia + ...). Scanning the tree means\nevaluating every cut as a candidate signal. This is what lets TreeScan catch a real effect whether it is\nconcentrated in **one** code (sharp leaf signal) or **smeared** across a related family of codes that no single\ncode makes significant on its own (a branch signal that only appears once you sum the siblings). A flat,\none-code-at-a-time screen sees neither the grouping nor shares strength across related outcomes.\n\n**The statistic and the multiplicity fix.** Each cut gets a **log-likelihood-ratio (LLR)** comparing the observed\ncount under the cut to its expected count, with the alternative being \"more events here than expected.\" The test\nstatistic for the whole analysis is the **maximum LLR over all cuts** - the most surprising place on the tree.\nThe hard part is that you tested thousands of overlapping, correlated cuts, so a raw per-cut p-value is\nmeaningless. TreeScan controls this with a **Monte Carlo permutation null**: it repeatedly re-generates the data\nunder the no-signal hypothesis (e.g., randomly redistributing the events across the tree in proportion to the\nexpected counts, or permuting exposed/unexposed labels), recomputes the maximum LLR each time, and builds the\nnull distribution of that maximum. A cut's signal is \"significant\" only if its observed LLR beats almost all of\nthose simulated maxima. Because the threshold is set on the *maximum* over the whole tree, the family-wise error\nis controlled across the entire scan in one shot - no Bonferroni explosion, and the correlation among nested\ncuts is handled automatically.\n\n**Interpreting the output**\n\nFrom the worked example: six cuts evaluated over four MedDRA Preferred Terms under\ntwo System Organ Classes. The myocarditis leaf cut has the highest LLR (LLR = 4.07,\nobserved 8 vs expected 3), exceeding the permutation-derived critical boundary of\napproximately 3.7 (p ≈ 0.017 by Monte Carlo). The cardiac System Organ Class branch\ncut (observed 10, expected 6) has LLR ≈ 2.10, below the boundary. No GI cut is flagged.\n\nFormal interpretation: The myocarditis Preferred Term cut is the nominated signal —\nthe single location in the outcome tree where the observed-to-expected excess is large\nenough that it would arise by chance in fewer than 5% of permutation replicates under\nthe null of no excess anywhere on the tree. The multiplicity correction is achieved\nthrough the permutation null on the maximum LLR: because the test statistic is the\nmaximum over all cuts, the family-wise error rate across the entire tree is controlled\nat alpha without Bonferroni correction and without assuming independence among the\noverlapping, nested cuts. The branch cut (cardiac SOC) was not flagged separately\nbecause the myocarditis leaf already captures most of the excess; the branch signal\ndoes not add independent information in this example.\n\nPractical interpretation: The myocarditis finding is a generated hypothesis — a signal\nfor follow-up investigation — not a confirmed causal relationship. Report the LLR, the\npermutation p-value, the observed and expected counts, and the tie fraction to the\nnearest-ancestor branch. Next steps are: verify the myocarditis case definition PPV\n(a low-PPV code makes a noise signal), examine confounding (are the exposed patients\ndifferently characterized than the reference?), and run a pre-specified confirmatory\nstudy (a cohort or SCCS analysis on the myocarditis outcome alone) in an independent\ndata source or time period. TreeScan produces no effect-size estimate; do not extract\na relative risk from the scan output.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs signal-detection (disproportionality: PRR, ROR, the Bayesian BCPNN/MGPS):** Disproportionality works on\n  spontaneous-report databases (FAERS, VigiBase) and asks, pair by pair, whether a drug-event combination is\n  reported more than expected *given the report mix*. It has no population denominator, no hierarchy, and treats\n  every Preferred Term as an independent test. **TreeScan** runs on **cohort or self-controlled data with real\n  person-time/denominators**, exploits the **MedDRA/ICD hierarchy** to borrow strength across related codes, and\n  delivers **one honest family-wise-controlled** answer instead of thousands of separately-corrected pair tests.\n  **Prefer disproportionality** when all you have is a spontaneous-report dump with no denominator; **prefer\n  TreeScan** when you have an enumerated cohort (claims/EHR) where expected counts are estimable.\n- **vs a pre-specified single-outcome analysis (a Cox model or sequential MaxSPRT on outcome Y):** A targeted\n  analysis is more powerful *for the outcome you named* and is the right tool for confirmation. TreeScan trades\n  some per-outcome power for **breadth** - it will surface the outcome you would not have thought to pre-specify.\n  The cost is that a TreeScan hit is a **lead, not a verdict**: it must be re-tested in an independent,\n  pre-specified, confounding-controlled study. **Do not** report a scan hit as a confirmed causal effect.\n- **vs naive multiple testing (scan every code, Bonferroni-correct):** Bonferroni over thousands of correlated,\n  nested codes is brutally conservative and ignores the hierarchy; it cannot express a branch-level signal at all.\n  TreeScan's permutation-on-the-maximum is both **less conservative** (it respects the correlation) and **able to\n  test groupings**. The price is **computation** (thousands of permutations x a full tree scan each).\n\n**When to use.** Active post-market safety surveillance where you want the data to nominate the adverse events\n(FDA Sentinel-style monitoring of a new vaccine or drug across a claims network); broad hypothesis-free outcome\nscanning when you genuinely do not know which outcome to pre-specify; situations where a real effect may be\nspread across a family of related codes (a System Organ Class) rather than any single Preferred Term; and any\nscreen where you need **one** multiplicity-honest answer over a large structured outcome space instead of a pile\nof separately-corrected tests. The **self-controlled tree-temporal** variant is the workhorse for vaccines: it\nscans the outcome tree *and* a post-exposure risk-window tree simultaneously, finding both *which* event and\n*when* after exposure it clusters, using each person as their own control.\n\n**When NOT to use - and when it is actively misleading.**\n- **Do not treat a signal as a confirmed effect.** TreeScan is a generator. A flagged cut is hypothesis-\n  *generating*; reporting it as an established harm conflates screening with confirmation and invites\n  over-reaction to noise. Every hit needs an independent, pre-specified, confounder-adjusted follow-up.\n- **Garbage expected counts -> garbage signals.** The whole method rests on credible **expected** counts (from a\n  comparator cohort, historical rates, or the self-controlled comparison window). If the expected counts are\n  confounded - the exposed and reference groups differ in age, comorbidity, or surveillance intensity - TreeScan\n  will faithfully flag that confounding as a \"signal.\" The unconditional Poisson model especially assumes the\n  expected counts are correct; it does not adjust for confounding on its own. Use a self-controlled design,\n  matching, or stratified expected counts, and pair scanning with negative-control outcomes / empirical\n  calibration to gauge residual bias.\n- **It does not estimate effect size or do confounding control.** TreeScan answers \"where is there an excess?\",\n  not \"how big, and is it causal?\". Pulling a relative risk off a scan hit and acting on it - without a designed\n  comparative study - is a misuse.\n- **Outcome misclassification rides along.** The leaves are code-based outcome definitions; if a Preferred Term\n  or ICD code has poor positive predictive value, its \"signal\" may be a coding artifact, not a real cluster.\n\n**Data-source operational depth.**\n- **Claims (FFS):** The natural substrate - an enumerated cohort with denominators (person-time) and an ICD\n  diagnosis tree. Build the exposed cohort, define the post-exposure risk window, count events at each ICD leaf,\n  and get expected counts from a comparator cohort (active comparator or matched unexposed) or from the same\n  people's comparison window (self-controlled). Watch Medicare Advantage / capitated person-time (incomplete\n  encounter capture deflates counts) and the usual claims outcome-validity caveats - a noisy leaf code makes a\n  noisy signal.\n- **EHR:** Richer outcome detail (labs, vitals, problem lists) can sharpen leaf definitions and let you build a\n  finer outcome tree, but encounter capture is leakier (care outside the system is invisible), so expected counts\n  and risk windows are more fragile. Best when the network is reasonably closed.\n- **Registry:** Adjudicated outcomes make clean leaves, but registries usually track a narrow outcome set, which\n  defeats the point of scanning a *broad* tree; use when the registry itself is the surveillance target.\n- **Linked claims-EHR-registry / distributed networks (Sentinel):** The ideal substrate - claims for\n  denominators and exposure timing, EHR/registry for outcome refinement, and a common data model so the same\n  tree and the same scan run across many sites. The self-controlled tree-temporal scan is the standard Sentinel\n  tool here because it sidesteps between-person confounding by construction.\n\n**Worked surveillance example.** A new vaccine is monitored across a claims network. Events in the post-vaccination\nrisk window are counted at four MedDRA Preferred Terms grouped under two System Organ Classes; expected counts\ncome from the matched comparison person-time. The scan evaluates six cuts (four leaves + two branches), scores\neach with the conditional-Poisson LLR, takes the maximum, and benchmarks it against the permutation null. The\nworked_example below carries the exact, hand-checkable arithmetic - the myocarditis leaf wins with LLR 4.07 and a\npermutation p of about 0.017, nominating it as the candidate signal to confirm in a designed study.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "tree-based-scan-statistic",
      "treescan",
      "signal-detection",
      "post-market-surveillance",
      "sentinel",
      "vaccine-safety",
      "meddra",
      "icd-hierarchy",
      "log-likelihood-ratio",
      "monte-carlo-permutation",
      "multiplicity",
      "self-controlled"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "self_controlled",
      "claims_analysis",
      "ehr_study",
      "drug_utilization",
      "pharmacovigilance"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1111/1541-0420.00039",
        "url": "https://doi.org/10.1111/1541-0420.00039",
        "citation_text": "Kulldorff M, Fang Z, Walsh SJ. A tree-based scan statistic for database disease surveillance. Biometrics. 2003;59(2):323-331.",
        "year": 2003,
        "authors_short": "Kulldorff et al.",
        "notes": "The originating paper. Introduces the tree of cuts, the per-cut likelihood-ratio statistic, the maximum over cuts as the test statistic, and the Monte Carlo permutation null that controls multiplicity across the whole hierarchical scan in one shot."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pds.5765",
        "url": "https://doi.org/10.1002/pds.5765",
        "citation_text": "Russo MW, Wang SV. An open-source implementation of tree-based scan statistics. Pharmacoepidemiology and Drug Safety. 2024;33(3):e5765.",
        "year": 2024,
        "authors_short": "Russo & Wang",
        "notes": "An open-source software implementation of tree-based scan statistics for pharmacoepidemiology, showing how the cuts, the LLR, and the permutation inference are operationalized in reproducible code beyond the original standalone TreeScan binary."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwz104",
        "url": "https://doi.org/10.1093/aje/kwz104",
        "citation_text": "Yih WK, Kulldorff M, Dashevsky I, Maro JC. Using the self-controlled tree-temporal scan statistic to assess the safety of live attenuated herpes zoster vaccine. American Journal of Epidemiology. 2019;188(7):1383-1388.",
        "year": 2019,
        "authors_short": "Yih et al.",
        "notes": "A concrete FDA Sentinel-style application that scans the MedDRA outcome tree and the post-vaccination risk-window tree at once with each vaccinee as their own control, demonstrating tree-temporal signal generation in a real distributed claims network."
      }
    ],
    "plain_language_summary": "Tree-based scan statistics (TreeScan) is a way to screen for drug or vaccine safety problems without guessing the outcome ahead of time. You arrange every possible diagnosis or side-effect code into a family tree - specific codes are the leaves, and broad categories are the branches above them - then let the method check every leaf and every branch at once for \"more events here than we'd expect.\" Because it looks in thousands of places, it uses a shuffling test (it re-runs the whole scan on randomized data many times) so that only a truly surprising cluster counts as a signal. The catch: a hit is a lead to investigate, not proof of harm, and it is only as trustworthy as the \"expected\" counts you feed it - if the compared groups differ for other reasons, TreeScan will flag that difference as a fake signal.\n",
    "key_terms": [
      {
        "term": "cut (on the tree)",
        "definition": "A node of the tree taken together with everything beneath it - either one specific code (a leaf) or a whole category summing all its sub-codes (a branch)."
      },
      {
        "term": "leaf vs branch",
        "definition": "A leaf is the most specific code (like the single term myocarditis); a branch is a grouping above it (like the whole cardiac category) whose count is the sum of its leaves."
      },
      {
        "term": "log-likelihood-ratio (LLR)",
        "definition": "A score for one cut that measures how surprising its observed event count is compared to its expected count; bigger means a more unusual excess."
      },
      {
        "term": "expected counts",
        "definition": "How many events you would predict at each code if the drug had no effect, usually taken from a comparison group or from each person's own non-exposed time."
      },
      {
        "term": "Monte Carlo permutation",
        "definition": "A shuffling procedure that re-creates the data many times assuming no real effect, so you can see how big a cluster could appear just by chance across the whole tree."
      },
      {
        "term": "signal generation vs confirmation",
        "definition": "Generation means flagging a possible problem worth investigating; confirmation means proving it in a separate, carefully designed study - TreeScan only does the first."
      }
    ],
    "worked_example": {
      "scenario": "A new vaccine is being watched in a claims network. In the weeks after vaccination we count adverse events at four specific MedDRA codes (the leaves), grouped under two broad categories (the branches): a cardiac category holding myocarditis and arrhythmia, and a GI category holding nausea and vomiting. From a matched comparison group we already know how many of each event we would expect if the vaccine were harmless. We want the scan to tell us which code, or which whole category, has a real excess - and to do it honestly even though we are checking six places at once.\n",
      "dataset": {
        "caption": "The counts an analyst feeds the scan - observed events in the vaccinated group and expected events from the matched comparison, for each leaf and each branch (a branch is just the sum of its leaves).",
        "columns": [
          "node",
          "level",
          "observed_events",
          "expected_events"
        ],
        "rows": [
          [
            "myocarditis",
            "leaf",
            8,
            3
          ],
          [
            "arrhythmia",
            "leaf",
            2,
            3
          ],
          [
            "nausea",
            "leaf",
            3,
            5
          ],
          [
            "vomiting",
            "leaf",
            2,
            4
          ],
          [
            "cardiac_SOC",
            "branch",
            10,
            6
          ],
          [
            "gi_SOC",
            "branch",
            5,
            9
          ]
        ]
      },
      "steps": [
        "Totals first - both columns sum to the same total because we conditioned on it. Observed = 8 + 2 + 3 + 2 = 15; expected = 3 + 3 + 5 + 4 = 15, so C = 15.",
        "The two branch cuts are just the sums of their leaves. Cardiac observed = 8 + 2 = 10, expected = 3 + 3 = 6. GI observed = 3 + 2 = 5, expected = 5 + 4 = 9.",
        "Only cuts with more events than expected can signal. Myocarditis (8 > 3) and cardiac (10 > 6) are in excess; arrhythmia, nausea, vomiting, and the whole GI branch all have fewer than expected, so their LLR is 0.",
        "Score the myocarditis leaf with the conditional-Poisson log-likelihood-ratio LLR = observed x ln(observed/expected) + (C - observed) x ln((C - observed)/(C - expected)). The observed-to-expected ratio there is 8 / 3 = 2.67, a clear excess.",
        "The excess term is 8 x ln(8/3), about 7.847, and the deficit term is 7 x ln(7/12), about -3.773, so the myocarditis LLR = 7.847 - 3.773 = 4.07.",
        "Score the cardiac branch the same way; its two terms are about 5.108 and -2.939, so its LLR = 5.108 - 2.939 = 2.17 - real, but smaller than the leaf, so the signal is the specific code, not the whole category.",
        "Take the maximum LLR over all six cuts (4.07 at myocarditis) as the test statistic, then shuffle - redistribute the 15 events across the four leaves in proportion to expected (3,3,5,4) hundreds of times and recompute the maximum each time. Suppose 7 of 999 shuffles produced a maximum LLR of at least 4.07.",
        "In 999 Monte Carlo permutations, 16 exceed the observed maximum, so the permutation p-value is (16 + 1) / (999 + 1) = 0.017 - the myocarditis cluster stands out across the whole tree after full multiplicity adjustment."
      ],
      "result": "The scan nominates the myocarditis leaf as the candidate signal (maximum LLR = 4.07, permutation p = 0.017); the cardiac branch is elevated (LLR 2.17) only because it contains myocarditis, and the GI side shows nothing. This is a hypothesis to confirm in a pre-specified, confounder-controlled study - not a confirmed harm.",
      "timeline_spec": {
        "title": "Self-controlled view of one vaccinee - a 42-day post-vaccination risk window vs a 42-day comparison window",
        "window": {
          "start": "2023-03-01",
          "end": "2023-05-24",
          "label": "Self-controlled observation: 42-day risk window plus 42-day comparison window"
        },
        "events": [
          {
            "label": "Vaccination (index exposure)",
            "start": "2023-03-01",
            "length_days": 1,
            "quantity": "day 0"
          },
          {
            "label": "Myocarditis event (signal leaf)",
            "start": "2023-03-10",
            "length_days": 1,
            "quantity": "AE at the flagged code"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2023-03-02",
            "end": "2023-04-12",
            "label": "Risk window: days 1-42 after vaccination"
          },
          {
            "kind": "unexposed",
            "start": "2023-04-13",
            "end": "2023-05-24",
            "label": "Comparison window: days 43-84 (self-control)"
          }
        ],
        "result": {
          "label": "Myocarditis clusters in the risk window - max LLR 4.07, permutation p 0.017",
          "value": 4.07
        }
      }
    },
    "prerequisites": [
      "signal-detection",
      "incidence-rate-calculation-rwe",
      "self-controlled-risk-interval-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Unconditional Bernoulli / self-controlled tree-temporal scan",
        "description": "Each event is exposed-window or comparison-window (a Bernoulli label) within the same person, and the scan ranges over both the outcome tree and a tree of candidate post-exposure risk windows, finding which event clusters and in which time window after exposure. Each person is their own control, so fixed between-person confounders cancel. The standard vaccine-safety form.",
        "edge_cases": [
          "The risk-window tree adds a second multiplicity axis - the permutation null must scan outcome x window jointly, or the family-wise control is broken.",
          "A truly chronic, slowly accruing outcome has no clean self-controlled contrast and will be missed by a temporal scan."
        ],
        "data_source_notes": "claims/linked: needs exposure timing plus dated outcomes; the comparison window must be free of the acute post-exposure effect. EHR: leaky capture biases the within-person window contrast."
      },
      {
        "name": "Unconditional Poisson with external expected counts",
        "description": "Observed event counts at each leaf are compared to expected counts derived from an external reference (historical rates, a comparator cohort, or population rates applied to the exposed person-time). The LLR and the maximum-over-cuts logic are identical; the null permutes events across the tree in proportion to the expected counts.",
        "edge_cases": [
          "The method assumes the expected counts are correct - any confounding baked into them is reported as a signal.",
          "Sparse leaves (zero or near-zero expected) destabilize the LLR; collapse to a higher node or pool person-time."
        ],
        "data_source_notes": "claims: expected counts from a matched comparator or historical baseline; exclude incomplete Medicare Advantage person-time. registry: clean outcomes but usually too narrow a tree to scan."
      },
      {
        "name": "Conditional scan (conditioned on the total number of events)",
        "description": "The analysis conditions on the observed total event count C across the whole tree, so the null redistributes exactly C events across leaves in proportion to expected counts (a multinomial), rather than treating each leaf count as an independent Poisson. Removes sensitivity to the absolute expected total and is the form used in the worked example here.",
        "edge_cases": [
          "Conditioning on C reduces power to detect a uniform across-the-board excess (a tree-wide shift has no single cut).",
          "With very small C the permutation null is coarse and p-values are granular - report the Monte Carlo resolution."
        ],
        "data_source_notes": "claims/linked: natural when person-time and expected proportions are trustworthy but the overall event rate is uncertain. ehr: pairs well with self-controlled comparison-window expected counts."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "signal-detection",
        "pros_of_this": "Uses real denominators/person-time and the MedDRA/ICD hierarchy to test grouped branch signals and borrow strength across related codes, returning one family-wise-controlled answer instead of thousands of separately-corrected drug-event pair tests with no denominator.",
        "cons_of_this": "Requires an enumerated cohort with estimable expected counts; it cannot run on a denominator-free spontaneous-report dump (FAERS/VigiBase), which is exactly where disproportionality lives.",
        "when_to_prefer": "Prefer TreeScan when you have cohort/self-controlled data with person-time; prefer disproportionality when all you have are spontaneous reports with no population at risk."
      },
      {
        "compared_to": "self-controlled-risk-interval-rwe",
        "pros_of_this": "Scans the entire outcome tree (and, in the tree-temporal form, a tree of risk windows) hypothesis-free, nominating which event and which window to study, rather than committing to one pre-specified outcome and interval.",
        "cons_of_this": "Lower power for any single named outcome and produces a lead, not an effect estimate; the SCRI on a pre-specified outcome is the stronger confirmatory tool once TreeScan has nominated the target.",
        "when_to_prefer": "Use TreeScan to discover the outcome and window; switch to a pre-specified SCRI (or cohort study) to confirm and quantify it."
      },
      {
        "compared_to": "empirical-calibration-negative-controls-rwe",
        "pros_of_this": "Provides the broad structured scan that proposes candidate signals across a whole hierarchy in a single multiplicity-controlled pass.",
        "cons_of_this": "Does nothing to gauge residual systematic error on its own; a scan hit can be pure confounding, so it must be paired with negative-control outcomes / empirical calibration to interpret the signal honestly.",
        "when_to_prefer": "Always run them together - TreeScan to generate, negative controls / calibration to judge whether a generated signal exceeds the system's background bias."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Build the exposed cohort and the post-exposure risk window, count events at each ICD (or mapped MedDRA) leaf, and derive expected counts from a matched comparator cohort or a self-controlled comparison window. Exclude incomplete Medicare Advantage / capitated person-time (deflated counts mimic protection), and remember that a low-PPV outcome code produces a low-quality signal.",
      "ehr": "Finer outcome detail (labs, problem lists) can sharpen leaf definitions and support a deeper tree, but care delivered outside the system is invisible, so expected counts and risk-window contrasts are fragile - best in a reasonably closed network. Map structured outcomes to the chosen hierarchy before scanning.",
      "registry": "Adjudicated outcomes give clean leaves but a narrow tree, which undercuts the value of scanning a broad hierarchy; reserve for surveillance where the registry's own outcome set is the monitoring target.",
      "linked": "The Sentinel-style ideal - claims for denominators and exposure timing, EHR/registry for outcome refinement, a common data model so one tree and one scan run across sites. The self-controlled tree-temporal scan is standard here because within-person contrasts remove fixed between-site and between-person confounding."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math, random\n\nLEAVES   = [\"myocarditis\", \"arrhythmia\", \"nausea\", \"vomiting\"]\nOBSERVED = {\"myocarditis\": 8, \"arrhythmia\": 2, \"nausea\": 3, \"vomiting\": 2}   # sum = 15\nEXPECTED = {\"myocarditis\": 3, \"arrhythmia\": 3, \"nausea\": 5, \"vomiting\": 4}   # conditioned to sum = 15\nCUTS = {                                  # node -> leaves beneath it (leaf cuts + branch cuts)\n    \"myocarditis\": [\"myocarditis\"],\n    \"arrhythmia\":  [\"arrhythmia\"],\n    \"nausea\":      [\"nausea\"],\n    \"vomiting\":    [\"vomiting\"],\n    \"cardiac_SOC\": [\"myocarditis\", \"arrhythmia\"],\n    \"gi_SOC\":      [\"nausea\", \"vomiting\"],\n}\n\ndef cut_llr(c, mu, C):\n    # conditional-Poisson LLR for one cut; 0 unless the cut is in excess (c > mu)\n    if c <= mu or c >= C:\n        return 0.0\n    return c * math.log(c / mu) + (C - c) * math.log((C - c) / (C - mu))\n\ndef scan(obs):\n    C = sum(obs.values())\n    best_node, best_llr = None, 0.0\n    for node, members in CUTS.items():\n        c  = sum(obs[m] for m in members)\n        mu = sum(EXPECTED[m] for m in members)\n        s  = cut_llr(c, mu, C)\n        if s > best_llr:\n            best_llr, best_node = s, node\n    return best_llr, best_node\n\nobs_llr, node = scan(OBSERVED)          # -> ~4.07 at \"myocarditis\"\n\nC      = sum(OBSERVED.values())\nweights = [EXPECTED[l] for l in LEAVES]\nrandom.seed(20230601)\nNSIM, hits = 999, 0\nfor _ in range(NSIM):\n    draw = random.choices(LEAVES, weights=weights, k=C)   # redistribute C events under the null\n    sim  = {l: 0 for l in LEAVES}\n    for l in draw:\n        sim[l] += 1\n    if scan(sim)[0] >= obs_llr - 1e-9:\n        hits += 1\np = (hits + 1) / (NSIM + 1)\nprint(f\"most likely cut = {node}  LLR = {obs_llr:.2f}  permutation p = {p:.3f}\")",
        "description": "Minimal conditional-Poisson tree-based scan on a tiny 2-level tree (4 MedDRA-style leaves under 2 branches).\nFor each cut (a node plus its descendant leaves) it computes the log-likelihood-ratio of an excess, takes the\nmaximum over cuts as the test statistic, and gets a Monte Carlo permutation p-value by redistributing the C\ntotal events across the leaves in proportion to the expected counts. The counts are hand-countable so the LLRs\nmatch the worked_example. For a real tree (thousands of MedDRA terms) use the TreeScan software (Kulldorff) or\nthe open-source implementation (Russo and Wang 2024).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "leaves   <- c(\"myocarditis\", \"arrhythmia\", \"nausea\", \"vomiting\")\nobserved <- c(myocarditis = 8, arrhythmia = 2, nausea = 3, vomiting = 2)   # sum = 15\nexpected <- c(myocarditis = 3, arrhythmia = 3, nausea = 5, vomiting = 4)   # conditioned to sum = 15\ncuts <- list(myocarditis = \"myocarditis\", arrhythmia = \"arrhythmia\",\n             nausea = \"nausea\", vomiting = \"vomiting\",\n             cardiac_SOC = c(\"myocarditis\", \"arrhythmia\"),\n             gi_SOC      = c(\"nausea\", \"vomiting\"))\n\ncut_llr <- function(c, mu, C) {\n  if (c <= mu || c >= C) return(0)\n  c * log(c / mu) + (C - c) * log((C - c) / (C - mu))\n}\nscan <- function(obs) {\n  C <- sum(obs); best_llr <- 0; best_node <- NA_character_\n  for (node in names(cuts)) {\n    m  <- cuts[[node]]\n    s  <- cut_llr(sum(obs[m]), sum(expected[m]), C)\n    if (s > best_llr) { best_llr <- s; best_node <- node }\n  }\n  list(llr = best_llr, node = best_node)\n}\nobs <- scan(observed)                       # -> ~4.07 at \"myocarditis\"\n\nC       <- sum(observed)\nweights <- expected[leaves] / sum(expected)\nset.seed(20230601); NSIM <- 999L; hits <- 0L\nfor (i in seq_len(NSIM)) {\n  draw <- sample(leaves, C, replace = TRUE, prob = weights)\n  sim  <- setNames(as.integer(table(factor(draw, levels = leaves))), leaves)\n  if (scan(sim)$llr >= obs$llr - 1e-9) hits <- hits + 1L\n}\np <- (hits + 1L) / (NSIM + 1L)\ncat(sprintf(\"most likely cut = %s  LLR = %.2f  permutation p = %.3f\\n\", obs$node, obs$llr, p))",
        "description": "Same tiny conditional-Poisson tree scan in base R: per-cut log-likelihood-ratio, maximum over the six cuts as\nthe statistic, and a Monte Carlo permutation p-value by redistributing the C events across leaves in proportion\nto the expected counts. Counts match the worked_example (myocarditis leaf wins). For production scanning of a\nfull MedDRA/ICD tree use the standalone TreeScan software (Kulldorff) or the open-source rtreescan-style\nimplementation (Russo and Wang 2024) rather than this teaching code.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let C    = 15;     /* total observed = total expected (conditioned)  */\n%let nsim = 999;\n\n/* 1) Score the six candidate cuts (leaf cuts + branch-node cuts). */\ndata cuts;\n  length node $12;\n  input node $ c mu;\n  if c > mu and c < &C\n    then llr = c*log(c/mu) + (&C - c)*log((&C - c)/(&C - mu));\n    else llr = 0;\n  datalines;\nmyocarditis 8 3\narrhythmia  2 3\nnausea      3 5\nvomiting    2 4\ncardiac_SOC 10 6\ngi_SOC      5 9\n;\nrun;\n\nproc sort data=cuts; by descending llr; run;\ndata _null_;\n  set cuts;\n  if _n_ = 1 then do;\n    call symputx('best_node', node);\n    call symputx('best_llr', put(llr, 8.4));\n  end;\nrun;\n\n/* 2) Monte Carlo permutation null: redistribute C events across the 4 leaves\n      in proportion to expected counts (3,3,5,4)/15, recompute the maximum LLR. */\ndata perm;\n  call streaminit(20230601);\n  array p[4] _temporary_ (0.2 0.2 0.3333333 0.2666667);\n  array mu[6] _temporary_ (3 3 5 4 6 9);\n  do sim = 1 to &nsim;\n    array cnt[4] c1-c4;\n    do k = 1 to 4; cnt[k] = 0; end;\n    do e = 1 to &C;\n      u = rand('TABLE', of p[*]);     /* category 1..4 */\n      cnt[u] + 1;\n    end;\n    array cc[6] cc1-cc6;\n    cc1 = c1; cc2 = c2; cc3 = c3; cc4 = c4;   /* leaf cuts   */\n    cc5 = c1 + c2; cc6 = c3 + c4;             /* branch cuts */\n    maxllr = 0;\n    do j = 1 to 6;\n      if cc[j] > mu[j] and cc[j] < &C\n        then L = cc[j]*log(cc[j]/mu[j]) + (&C - cc[j])*log((&C - cc[j])/(&C - mu[j]));\n        else L = 0;\n      if L > maxllr then maxllr = L;\n    end;\n    hit = (maxllr >= &best_llr - 1e-9);\n    output;\n  end;\n  keep sim maxllr hit;\nrun;\n\nproc sql noprint;\n  select (sum(hit) + 1) / (&nsim + 1) into :pval trimmed from perm;\nquit;\n%put NOTE: most likely cut = &best_node  observed LLR = &best_llr  permutation p = &pval;",
        "description": "Tiny conditional-Poisson tree scan in Base SAS. A data step holds the six cuts (four leaves + two branch nodes)\nwith their observed and expected counts, an ifn() computes each cut's log-likelihood-ratio, the maximum cut is\nfound, and a permutation loop redistributes the C total events across the four leaves with rand('TABLE') to build\na Monte Carlo p-value. Hand-countable so the LLRs match the worked_example. Real surveillance over a full\nMedDRA/ICD tree uses the TreeScan software (Kulldorff) or the open-source implementation (Russo and Wang 2024).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Root[\"Root (all events)<br/>observed 15 / expected 15\"]\n  Root --> Cardiac[\"Cardiac SOC (cut)<br/>obs 10 / exp 6<br/>LLR 2.17\"]\n  Root --> GI[\"GI SOC (cut)<br/>obs 5 / exp 9<br/>LLR 0\"]\n  Cardiac --> Myo[\"myocarditis (leaf cut)<br/>obs 8 / exp 3<br/>LLR 4.07 - max\"]\n  Cardiac --> Arr[\"arrhythmia (leaf cut)<br/>obs 2 / exp 3<br/>LLR 0\"]\n  GI --> Nau[\"nausea (leaf cut)<br/>obs 3 / exp 5<br/>LLR 0\"]\n  GI --> Vom[\"vomiting (leaf cut)<br/>obs 2 / exp 4<br/>LLR 0\"]",
        "caption": "The outcome tree itself. Every node is a candidate cut - a leaf code or a whole branch (System Organ Class) summing its children. Each cut is scored by its log-likelihood-ratio of an excess; the scan reports the maximum cut (the myocarditis leaf, LLR 4.07) as the candidate signal and benchmarks it against the permutation null.",
        "alt_text": "A two-level outcome tree with a root over two branch nodes (Cardiac and GI System Organ Classes), each over two leaf codes, annotated with observed and expected event counts and the log-likelihood-ratio of each cut, with the myocarditis leaf flagged as the maximum.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  A[Outcome hierarchy<br/>ICD / MedDRA tree] --> B[Count observed events<br/>at every leaf]\n  B --> C[Expected counts<br/>comparator or self-control window]\n  C --> D[Score every cut<br/>node + descendants by LLR]\n  D --> E[Test statistic =<br/>maximum LLR over all cuts]\n  E --> F{Beats the Monte Carlo<br/>permutation null?}\n  F -- Yes --> G[Candidate signal<br/>hypothesis-generating]\n  F -- No --> H[No signal at this cut]\n  G --> I[Confirm in a pre-specified<br/>confounder-controlled study]",
        "caption": "The TreeScan workflow - from a structured outcome hierarchy to a single multiplicity-controlled answer. The maximum LLR over all cuts is the statistic; the permutation null sets the threshold; a hit is a lead that must be confirmed in a designed study, never reported as an established effect.",
        "alt_text": "A left-to-right flow from outcome hierarchy to counting observed and expected events, scoring every cut by likelihood ratio, taking the maximum, comparing it to a permutation null, and routing a significant cut to confirmatory study.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "tree-based-scan-statistic-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Self-controlled view of one vaccinee - myocarditis falls in the 42-day post-vaccination risk window rather than the matched comparison window; across the cohort this is the cut with the maximum LLR (4.07, permutation p about 0.017), the candidate signal the scan nominates for confirmation.",
        "alt_text": "Timeline of one vaccinee showing vaccination at day 0, a shaded 42-day risk window containing a myocarditis event, and a 42-day comparison window, annotating the maximum log-likelihood-ratio of 4.07.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "signal-detection",
        "notes": "Both generate safety signals, but disproportionality (PRR/ROR/BCPNN) runs on denominator-free spontaneous reports pair by pair, whereas TreeScan runs on cohort/self-controlled data with person-time and exploits the MedDRA/ICD hierarchy to test grouped branch signals under one family-wise-controlled permutation null."
      },
      {
        "relation_type": "used_with",
        "target_slug": "self-controlled-risk-interval-rwe",
        "notes": "The self-controlled tree-temporal scan is the SCRI design generalized across a whole outcome tree and a tree of candidate risk windows; once TreeScan nominates an outcome and window, a pre-specified SCRI confirms and quantifies it."
      },
      {
        "relation_type": "used_with",
        "target_slug": "empirical-calibration-negative-controls-rwe",
        "notes": "TreeScan generates candidate signals but does not gauge residual systematic error; pairing the scan with negative-control outcomes and empirical calibration shows whether a flagged cut exceeds the surveillance system's background bias."
      },
      {
        "relation_type": "requires",
        "target_slug": "diagnosis-phenotype-algorithm-1ip-2op-time-window-rwe",
        "notes": "The tree's leaves are code-based outcome definitions; a leaf with poor positive predictive value yields a spurious signal, so each scanned outcome rests on a validated diagnosis/phenotype algorithm."
      },
      {
        "relation_type": "used_with",
        "target_slug": "safety-signal-case-definition-rwe",
        "notes": "A flagged cut must be turned into a pre-specified case definition before any confirmatory analysis; the scan nominates the event, the case definition operationalizes it for the follow-up study."
      },
      {
        "relation_type": "see_also",
        "target_slug": "incidence-rate-calculation-rwe",
        "notes": "Expected counts under the Poisson scan come from applying reference incidence rates to the exposed person-time; credible denominators and rates are what separate a real excess from a counting artifact."
      },
      {
        "relation_type": "see_also",
        "target_slug": "active-comparator-new-user",
        "notes": "When expected counts come from a comparator rather than a self-controlled window, an active-comparator new-user design supplies the least-confounded reference, since unconditional TreeScan reports any confounding in the expected counts as a signal."
      }
    ],
    "aliases": [
      "TreeScan",
      "tree-based scan statistic",
      "tree scan statistic",
      "self-controlled tree-temporal scan statistic",
      "hierarchical scan statistic",
      "tree-temporal scan"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "two-part-models-semicontinuous-costs",
    "name": "Two-Part and Hurdle Models for Semicontinuous Costs",
    "short_definition": "A two-part model handles the semicontinuous distribution of healthcare costs — a spike of exact zeros (patients with no utilization) combined with a right-skewed positive distribution — by fitting two separate equations: a logistic model for the probability of any cost (Part 1) and a gamma or log-normal GLM among patients who did have costs (Part 2). The overall expected cost is E[Y] = P(Y > 0) × E[Y | Y > 0], and covariates are free to have different effects in each part, revealing whether a treatment changes who seeks care, how much users spend, or both.",
    "long_description": "**The semicontinuous cost distribution and why it breaks one-part models**\n\nHealthcare cost data from claims and linked sources routinely exhibit a structural pattern that\nstatisticians call *semicontinuous*: a large mass of exact zeros from patients who had no\nqualifying utilization during the measurement window, combined with a positively skewed continuous\ndistribution among those who did incur costs. The zeros are not missing data, rounding artifacts,\nor deductible truncation — they represent genuine non-users: patients who were enrolled and\nobservable but who never filled a relevant prescription, visited an outpatient specialist, or were\nadmitted to hospital during the study window. In a typical commercially insured cohort with an\nuncommon specialty drug, 30–60% of observed patients may have zero total cost in any given month.\n\nThis creates a fundamental problem for one-part models. The gamma distribution, the standard for\nright-skewed continuous costs (see gamma-distribution entry), is defined only on strictly positive\nvalues (Y > 0, ∞). Fitting a gamma GLM to a dataset that includes zeros ignores those observations\nor provokes convergence failures. Log-ordinary-least-squares cannot accommodate a zero outcome\nwithout a small offset constant, and the offset distorts both the distributional shape and the\nestimated mean. Tobit models (censored regression) treat zeros as left-censored observations, which\nis the wrong assumption: a zero cost is a true zero, not a cost below a detection limit. Only a\ntwo-part architecture directly models both mechanisms — the binary any-cost decision and the\nconditional cost level — as separate but compatible statistical processes.\n\n**The two-part model: Part 1 (any cost?) and Part 2 (how much?)**\n\nThe two-part (hurdle) model factorizes the distribution of Y into two estimating equations:\n\n*Part 1.* A logistic regression models P(Y > 0 | X) — the probability that a patient incurs any\ncost at all — as a function of covariates X. The binary outcome is 1{Y > 0}. This part answers the\nclinical question: \"Does the treatment or exposure change who enters the healthcare system?\" The\nlogistic coefficients are log-odds ratios for any-cost occurrence. For the mechanics of logistic\nregression, see the logistic-regression-for-binary-outcomes entry.\n\n*Part 2.* A gamma GLM with a log link (or a log-normal GLM) is fitted only among patients with\nY > 0, modelling E[Y | Y > 0, X]. This part answers: \"Among those who do use care, does the\ntreatment change how much they spend?\" Because Part 2 is restricted to the positive sample, the\ngamma GLM's requirement that Y > 0 is automatically satisfied. For the gamma GLM mechanics,\nvariance function V(μ) = φμ², the Modified Park test, and coefficient interpretation, see the\ngamma-distribution entry — this entry does not duplicate those details.\n\n**Combining the two parts into the overall expected cost**\n\nThe unconditional (population-average) expected cost integrates the two estimating equations:\n\n    E[Y | X] = P(Y > 0 | X) × E[Y | Y > 0, X]\n\nIn practice this is computed by recycled (marginal) standardisation: for each patient in the\ndataset, predicted probabilities from Part 1 are multiplied by predicted positive costs from\nPart 2, and these products are averaged over the sample to obtain the arm-specific expected cost.\nThe treatment contrast is the difference (or ratio) of these marginalised expectations. This\napproach is estimand-transparent: the analyst recovers E[Y], the arithmetic mean on the dollar\nscale, which aggregates correctly to population totals for budget-impact and payer decision models.\n\n**Standard errors for the combined effect: delta method and bootstrap**\n\nBecause the overall expected cost is a nonlinear function of coefficients from two separately\nfitted models, standard errors cannot be read directly from either model's output table. Two\napproaches are standard:\n\n*Delta method (analytical).* By the multivariate delta method, the variance of E[Y | X] can be\napproximated from the gradient of the estimand with respect to the stacked coefficient vector and\nthe asymptotic variance-covariance matrices of the two models. This is implemented in the margins\nand predictNL packages in R, and in recycled-prediction commands in Stata. The delta method is\nfast but relies on asymptotic normality and can understate variance in small or highly heterogeneous\nsamples.\n\n*Bootstrap (nonparametric).* The nonparametric bootstrap — re-fitting both parts on each\nbootstrap resample and computing the combined expected cost each time — captures variance\nfrom both models and from the dependence between them without distributional assumptions. For\ndetails and implementation of the bootstrap itself, see the bootstrap-resampling-methods entry.\nThe bootstrap is the more robust default for complex two-part estimands and is recommended\nwhenever computing resources permit (typically 500–2000 resamples).\n\n**The interpretive payoff: covariate effects that differ by part**\n\nThe most powerful feature of the two-part model — and the primary reason to prefer it over a\none-part Tweedie alternative in HEOR — is that it allows covariates to have entirely different\neffects in each part. This decomposition is clinically interpretable and practically important:\n\n- A novel specialty drug may shift P(any cost) upward because initiating therapy requires an\n  office visit, but may reduce E[Y | Y > 0] because it prevents costly hospitalisations among\n  users — total cost effects in a one-part model would average across these two opposing forces\n  and could be statistically undetectable even when both part-specific effects are large.\n- A care-management programme may not change who enters the system (Part 1 coefficient near zero)\n  but substantially reduces spending per user through earlier intervention (Part 2 coefficient\n  negative) — reporting only the overall mean effect in a one-part model understates the\n  programme's value to payers who want to understand the mechanism.\n- A comorbidity burden score may be a strong predictor of any utilisation (Part 1) but a weak\n  predictor of cost among users (Part 2), because once a patient is in the system, cost is driven\n  more by disease severity than by comorbidity count — misspecifying this as a single covariate\n  effect biases both coefficients.\n\nReporting part-specific coefficients alongside the combined expected cost is therefore not\noptional detail — it is the substantive contribution that motivates choosing the two-part\narchitecture in the first place.\n\n**Tweedie GLM as the one-model alternative**\n\nThe Tweedie distribution is a compound Poisson-gamma family that is defined at zero (the Poisson\ncomponent generates the zero mass) and on the positive real line (the gamma component generates\nthe positive costs). A GLM with a Tweedie variance function — V(μ) = φμ^p where the power\nparameter p is estimated from data, typically 1 < p < 2 for cost data — handles zeros and\npositive costs in a single estimating equation. This is computationally simpler than fitting two\nseparate models and produces a single coefficient per covariate that captures the combined effect\non E[Y].\n\nThe trade-off is that the Tweedie constrains the covariate effect to be the same in both the\nprobability-of-any-cost part and the conditional-cost part. When this constraint is reasonable\n(which must be checked by comparing Tweedie predictions to two-part predictions), the Tweedie is\nmore parsimonious. When covariates genuinely differ in their Part 1 and Part 2 effects — the\ntypical situation in drug-comparative HEOR — the Tweedie blends the two effects into one\ncoefficient, hiding the mechanism and potentially misrepresenting the effect size. The two-part\nmodel is therefore preferred when the scientific question requires understanding how treatment\naffects each part separately, and the Tweedie is a reasonable default when only the combined\nexpected cost is needed and the constraint can be empirically validated.\n\n**Hurdle vs two-part: terminology**\n\nIn applied HEOR literature, \"hurdle model\" and \"two-part model\" are used interchangeably for the\nlogistic + GLM architecture described here. A technical distinction sometimes drawn is that the\nhurdle model uses a zero-truncated distribution (e.g., zero-truncated gamma, zero-truncated\nPoisson) for the positive part, while the \"two-part\" label is reserved for any distribution on\nthe positive part. In practice, when the positive-part model is a gamma or log-normal GLM — both\ndefined on Y > 0 — the two terms describe identical models. Analysts should specify the\npositive-part distribution explicitly rather than relying on terminology alone.\n\n**Zero-inflated count cousins**\n\nWhen the outcome is a count (emergency department visits, hospitalisations, prescription fills)\nrather than continuous cost, the analogous model is a zero-inflated Poisson or zero-inflated\nnegative-binomial regression. These share the two-mechanism logic (structural zero mass plus a\ncount distribution for positive values) but are fitted with count-specific methods. For count\noutcomes, see the zero-inflated-count-models entry. The current entry is specifically for\nsemicontinuous cost and other positive-continuous outcomes with a structural zero mass.\n\n**When a one-part gamma is sufficient**\n\nWhen the proportion of zero-cost patients is small — typically below 5–10% of the analytic\ncohort — the bias from fitting a one-part gamma GLM is modest. In this case, a constant small\noffset (e.g., Y + 0.01) or quasi-Poisson as a workaround is acceptable for a sensitivity\nanalysis, though the two-part model remains the principled primary choice. The 5–10% threshold\nis a rule of thumb, not a hard cutoff; the decision should also weigh whether the zeros are\ninformative (structural non-users vs. random non-claimants) and whether the part-specific\ndecomposition is scientifically important.\n\n**Pros, cons, and trade-offs**\n\n*Pros:*\n- Correctly handles the semicontinuous distribution that is the empirical norm for healthcare\n  cost data in claims, EHR, and registry sources with variable follow-up.\n- E[Y] = P(Y > 0) × E[Y | Y > 0] produces an unbiased estimate of the arithmetic mean on the\n  dollar scale — the quantity that payers, HTA bodies, and budget models need.\n- Covariate effects can differ by part, enabling mechanistic interpretation (who uses care vs.\n  how much users spend) that is invisible in one-part or Tweedie models.\n- Both parts (logistic and gamma GLM) are standard methods with well-understood diagnostics,\n  implemented natively in all major statistical software.\n- Marginal standardisation of combined predictions accommodates IPTW weighting, subgroup effects,\n  and covariate adjustment in a single coherent framework.\n\n*Cons:*\n- Two separate models must be specified, diagnosed, and reported; twice the model-checking burden\n  relative to a one-part Tweedie.\n- Standard errors for the combined estimand require bootstrap or delta method — they are not\n  printed automatically by any model summary table.\n- The Part 2 gamma GLM is fitted only on the positive subsample, which reduces effective sample\n  size and raises standard errors, especially when the zero fraction is large.\n- Separation or near-separation in Part 1 (all zeros in a covariate stratum) causes logistic\n  instability; data sparsity in Part 2 (very few positives in a covariate stratum) causes gamma\n  GLM instability.\n- When the Tweedie constraint (common covariate effects across both parts) holds approximately,\n  the two-part model is slightly less efficient than the Tweedie.\n\n**When to use**\n\nUse the two-part model when:\n- The outcome is a continuous cost or resource-use measure with 10% or more zero values in the\n  analytic cohort. Below 5%, a one-part gamma is often adequate as a sensitivity analysis;\n  above 10%, the two-part model is the principled primary analysis.\n- The scientific question requires distinguishing effects on utilisation (Part 1) from effects on\n  cost per user (Part 2) — a drug comparison, programme evaluation, or subgroup analysis where\n  the mechanism matters as much as the overall expected cost.\n- The target estimand is the arithmetic mean E[Y] on the dollar scale, required for budget-impact\n  models, payer dossiers, cost-effectiveness submissions, and value frameworks.\n- IPTW-weighted or stratified analyses where marginal standardisation of the combined prediction\n  is the appropriate final step after weighting.\n\n**When NOT to use**\n\n- *Very few zeros (<5%):* when the zero fraction is small, fitting a one-part gamma GLM (possibly\n  with a small constant added to zeros) is simpler and introduces negligible bias. The two-part\n  architecture is unnecessary complexity in this setting.\n- *Count outcomes with excess zeros:* zero-inflated Poisson or zero-inflated negative-binomial\n  regression is the correct family for counts (ED visits, fills, admissions). Applying a\n  continuous two-part model to a count outcome discards the integer structure and leads to\n  predictions that are not integers. Route to zero-inflated-count-models.\n- *When the estimand is not the mean:* if the policy question concerns the median cost, use\n  quantile regression. The two-part model targets E[Y], not median(Y).\n- *When zeros are informative censoring, not structural zeros:* if patients with zero cost are\n  patients who died early, disenrolled, or left the system — rather than patients who simply\n  chose not to seek care — the two-part model mixes two distinct mechanisms. Handle informative\n  censoring first (e.g., inverse probability of censoring weighting) before fitting the\n  two-part model.\n- *When causal language is required but confounding is uncontrolled:* the two-part model is an\n  estimating approach, not a causal identification strategy. Pair it with propensity score\n  weighting, matching, or g-methods. The Part 1 and Part 2 coefficients estimate adjusted\n  associations conditional on measured covariates.\n\n**Interpreting the output**\n\nIn the worked example below, the control arm has P(any cost) = 0.6 and E[cost | cost > 0] =\n5000, so E[Y] = 0.6 * 5000 = 3000. The treated arm has P(any cost) = 0.5 and E[cost | cost > 0]\n= 4800, so E[Y] = 0.5 * 4800 = 2400. The difference is 3000 - 2400 = 600.\n\n*(1) Formal statistical interpretation.* The two-part model produces two component estimates: a\nlogistic odds ratio quantifying the change in the log-odds of any cost occurrence between arms\n(Part 1), and a gamma GLM mean cost ratio quantifying the change in expected cost among users\nbetween arms (Part 2). The combined estimand E[Y] = 3000 (control) versus E[Y] = 2400 (treated)\nis the marginalised (population-average) expected annual cost per patient, averaging over both\nnon-users and users in each arm. The difference of $600 is the estimated adjusted mean cost\nreduction per patient per year in the treated arm. A 95% bootstrap confidence interval (from\nresampling both models simultaneously) represents values of the true mean cost difference that\nare compatible with the data at the conventional significance threshold in repeated sampling — it\nis not a Bayesian probability statement about the true difference.\n\n*(2) Practical interpretation for a decision-maker.* \"On average, treated patients cost $600 less\nper year than control patients — but this composite hides two separate effects. In the treated\narm, fewer patients sought care at all (50% vs 60% in controls — a 10-percentage-point reduction\nin the probability of any utilisation). Among those who did seek care, treated patients also spent\nless ($4,800 vs $5,000 on average — a 4% reduction in cost per user). Both effects work in the\nsame direction, but a drug payer considering formulary placement would want to know which effect\nis driving savings, because each implies a different care-management intervention.\"",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "cost-analysis",
      "two-part-model",
      "hurdle-model",
      "semicontinuous",
      "zero-inflation",
      "gamma-glm",
      "logistic-regression",
      "healthcare-costs",
      "tweedie",
      "bootstrap",
      "heor"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "cross_sectional",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "active_comparator_new_user",
      "cost_of_illness",
      "budget_impact",
      "cost_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/hec.1653",
        "url": "https://doi.org/10.1002/hec.1653",
        "citation_text": "Mihaylova B, Briggs A, O'Hagan A, Thompson SG. Review of statistical methods for analysing healthcare resources and costs. Health Economics. 2011;20(8):897-916.",
        "year": 2011,
        "authors_short": "Mihaylova et al.",
        "notes": "Comprehensive review of statistical methods for healthcare costs and resource use, with a dedicated section on two-part and hurdle models for semicontinuous distributions. The authoritative methodological reference for the class of models described in this entry, covering the choice among one-part GLMs, two-part models, and transformation approaches."
      },
      {
        "role": "explain",
        "doi": "10.1016/S0167-6296(98)00030-7",
        "url": "https://doi.org/10.1016/S0167-6296(98)00030-7",
        "citation_text": "Mullahy J. Much ado about two: reconsidering retransformation and the two-part model in health econometrics. Journal of Health Economics. 1998;17(3):247-281.",
        "year": 1998,
        "authors_short": "Mullahy",
        "notes": "The foundational health-econometrics paper that carefully distinguishes the two-part model from the tobit and the one-part gamma, clarifies the retransformation problem when log-OLS is used in the positive part, and establishes conditions under which two-part and one-part GLM estimators produce equivalent population-average predictions. Required reading for understanding why the model choice matters for the target estimand."
      },
      {
        "role": "demonstrate",
        "doi": "10.1080/07350015.1983.10509330",
        "url": "https://doi.org/10.1080/07350015.1983.10509330",
        "citation_text": "Duan N, Manning WG, Morris CN, Newhouse JP. A comparison of alternative models for the demand for medical care. Journal of Business and Economic Statistics. 1983;1(2):115-126.",
        "year": 1983,
        "authors_short": "Duan et al.",
        "notes": "The seminal empirical paper introducing and comparing the two-part model for medical care demand, demonstrating that the two-part structure outperforms simpler alternatives on the RAND Health Insurance Experiment data. Established the logit + log-OLS (now updated to logit + gamma GLM) architecture that remains the health-economics standard four decades later."
      }
    ],
    "plain_language_summary": "Many healthcare datasets have a large fraction of patients with exactly zero cost — they were enrolled and observable but never used the service being measured. A two-part model handles this by fitting two separate equations: the first uses logistic regression to estimate the probability that any cost occurs at all, and the second uses a gamma regression model (fitted only on the patients who did have costs) to estimate how much those users spent. Multiplying the two predictions gives the expected cost for any patient or group, on the same dollar scale that budget and payer models need. A key advantage is that a treatment can affect who seeks care and how much users spend by different amounts — effects that are hidden in a single model.",
    "key_terms": [
      {
        "term": "semicontinuous distribution",
        "definition": "A distribution that has an exact zero at one point — representing patients with no cost — plus a continuous range of positive values; it cannot be modelled by a purely continuous distribution like the gamma, which requires all values to be strictly positive."
      },
      {
        "term": "structural zero",
        "definition": "A true zero cost from a patient who genuinely used no covered services during the window, as opposed to a missing or censored value; these zeros must be modelled explicitly rather than dropped."
      },
      {
        "term": "Part 1 (any-cost model)",
        "definition": "The logistic regression component of a two-part model that estimates the probability a patient has any positive cost at all, answering the question of who enters the healthcare system."
      },
      {
        "term": "Part 2 (positive-cost model)",
        "definition": "The gamma or log-normal regression component fitted only among patients who had positive costs, estimating how much those users spent conditional on having any cost at all."
      },
      {
        "term": "Tweedie distribution",
        "definition": "A one-model alternative that handles zeros and positive continuous values in a single equation using a compound Poisson-gamma structure, at the cost of constraining covariate effects to be the same for the any-cost and the conditional-cost parts."
      },
      {
        "term": "marginal standardisation",
        "definition": "Computing the average of individual predicted values across all patients in a dataset, under a specified treatment assignment, to recover the population-average expected cost rather than the cost at an \"average patient.\""
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is comparing total annual allowed costs between a control arm and a treated arm in a retrospective cohort study. The analyst notices that some patients have zero cost in the study year — these are enrolled patients who never filed a relevant claim. The analyst applies a two-part model: Part 1 estimates the probability of any cost per arm; Part 2 estimates the mean cost among users per arm. Multiplying the two parts gives the overall expected cost per arm, and the difference is the estimated treatment contrast.",
      "dataset": {
        "caption": "Total annual allowed cost (USD) for nine patients across two arms. Zero-cost patients are structural non-users enrolled throughout the year who filed no qualifying claims.",
        "columns": [
          "patient_id",
          "arm",
          "total_cost"
        ],
        "rows": [
          [
            "C1",
            "control",
            0
          ],
          [
            "C2",
            "control",
            0
          ],
          [
            "C3",
            "control",
            5000
          ],
          [
            "C4",
            "control",
            5000
          ],
          [
            "C5",
            "control",
            5000
          ],
          [
            "T1",
            "treated",
            0
          ],
          [
            "T2",
            "treated",
            0
          ],
          [
            "T3",
            "treated",
            4800
          ],
          [
            "T4",
            "treated",
            4800
          ]
        ]
      },
      "steps": [
        "Count zero-cost patients in the control arm: 2 out of 5 have zero cost, so 3 out of 5 have positive cost. Part 1 probability (control): 3/5 = 0.6.",
        "Count zero-cost patients in the treated arm: 2 out of 4 have zero cost, so 2 out of 4 have positive cost. Part 1 probability (treated): 2/4 = 0.5.",
        "Compute mean cost among users in the control arm. Sum of user costs = 5000 + 5000 + 5000 = 15000. Mean among users (control) = 15000 / 3 = 5000.",
        "Compute mean cost among users in the treated arm. Sum of user costs = 4800 + 4800 = 9600. Mean among users (treated) = 9600 / 2 = 4800.",
        "Multiply Part 1 by Part 2 to get the overall expected cost. Control arm: 0.6 * 5000 = 3000. Treated arm: 0.5 * 4800 = 2400.",
        "The difference in overall expected cost is 3000 - 2400 = 600. Treated patients cost on average $600 less per year than control patients. Two components drive this: fewer treated patients used care at all (50% vs 60%) and those who did use care spent slightly less ($4800 vs $5000 per user)."
      ],
      "result": "Control P(any cost) = 3/5 = 0.6, mean among users = 15000/3 = 5000, overall mean = 0.6 * 5000 = 3000. Treated P(any cost) = 2/4 = 0.5, mean among users = 9600/2 = 4800, overall mean = 0.5 * 4800 = 2400. Difference = 3000 - 2400 = 600. A sequential Shapley decomposition shows the utilisation component (Part 1) accounts for approximately $490 of the reduction and the conditional-cost component (Part 2) accounts for approximately $110; they do not contribute equally because the larger absolute gap is in the probability of use, not the cost per user."
    },
    "prerequisites": [
      "gamma-distribution",
      "logistic-regression-for-binary-outcomes",
      "healthcare-costs-pppm-pppy-pmpm"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {},
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\n# ── Simulate semicontinuous cost data ───────────────────────────────────────────────────────\nrng = np.random.default_rng(seed=42)\nn = 600\ntreatment = np.repeat([0, 1], n // 2)       # 300 control, 300 treated\nage       = rng.normal(58, 12, n)\n\n# Part 1 truth: logit(P(cost > 0)) = 0.5 + 0.3*treatment + 0.02*age\np_pos  = 1 / (1 + np.exp(-(0.5 + 0.3 * treatment + 0.02 * age)))\nhas_cost = rng.binomial(1, p_pos)\n\n# Part 2 truth (among users): log(E[cost | cost>0]) = 7.5 - 0.2*treatment + 0.01*age\nmu_pos = np.exp(7.5 - 0.2 * treatment + 0.01 * age)\ncost   = np.where(has_cost, rng.gamma(shape=2.0, scale=mu_pos / 2.0), 0.0)\n\ndf = {\"cost\": cost, \"treatment\": treatment, \"age\": age,\n      \"any_cost\": has_cost.astype(float)}\nimport pandas as pd\ndf = pd.DataFrame(df)\n\n# ── Part 1: logistic regression for P(any cost) ─────────────────────────────────────────────\nm1 = smf.logit(\"any_cost ~ treatment + age\", data=df).fit(disp=0)\n\n# ── Part 2: gamma GLM with log link on positive-cost patients ────────────────────────────────\npos = df.loc[df[\"cost\"] > 0].copy()\nm2  = smf.glm(\"cost ~ treatment + age\", data=pos,\n              family=sm.families.Gamma(link=sm.families.links.Log())).fit()\n\nprint(\"=== Part 1: Logistic for P(any cost) ===\")\nprint(m1.summary().tables[1])\nprint(\"\\n=== Part 2: Gamma GLM for E[cost | cost>0] ===\")\nprint(m2.summary().tables[1])\n\n# ── Combined expected cost via recycled prediction ───────────────────────────────────────────\ndef expected_cost(data, t_val):\n    \"\"\"Marginalised E[cost] under treatment assignment t_val.\"\"\"\n    d = data.assign(treatment=t_val)\n    p_any = m1.predict(d)                        # P(cost > 0)\n    mu    = m2.predict(d)                        # E[cost | cost > 0]\n    return float((p_any * mu).mean())\n\ne_control = expected_cost(df, 0)\ne_treated = expected_cost(df, 1)\ncontrast  = e_control - e_treated\nprint(f\"\\nE[cost | control] = ${e_control:,.0f}\")\nprint(f\"E[cost | treated] = ${e_treated:,.0f}\")\nprint(f\"Mean cost reduction (control - treated) = ${contrast:,.0f}\")\n\n# ── Bootstrap 95% CI for the combined contrast ───────────────────────────────────────────────\nB = 500\nboot = np.empty(B)\nfor b in range(B):\n    idx   = rng.choice(len(df), size=len(df), replace=True)\n    df_b  = df.iloc[idx].reset_index(drop=True)\n    pos_b = df_b.loc[df_b[\"cost\"] > 0]\n    try:\n        b1 = smf.logit(\"any_cost ~ treatment + age\", data=df_b).fit(disp=0)\n        b2 = smf.glm(\"cost ~ treatment + age\", data=pos_b,\n                     family=sm.families.Gamma(link=sm.families.links.Log())).fit(disp=0)\n        ec0 = (b1.predict(df_b.assign(treatment=0)) * b2.predict(df_b.assign(treatment=0))).mean()\n        ec1 = (b1.predict(df_b.assign(treatment=1)) * b2.predict(df_b.assign(treatment=1))).mean()\n        boot[b] = ec0 - ec1\n    except Exception:\n        boot[b] = np.nan\nboot = boot[~np.isnan(boot)]\nlo, hi = np.percentile(boot, [2.5, 97.5])\nprint(f\"Bootstrap 95% CI for mean cost reduction: [${lo:,.0f}, ${hi:,.0f}]\")",
        "description": "Two-stage estimation using statsmodels: Part 1 fits a logistic regression for any cost;\nPart 2 fits a gamma GLM with log link among positive-cost patients. The combined expected\ncost is obtained by recycled prediction (marginal standardisation) and a percentile bootstrap\nprovides the 95% confidence interval for the treatment contrast. All steps use a simulated\ndataset with a 40% zero-cost rate so results are traceable.",
        "dependencies": [
          "statsmodels",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "set.seed(42)\nn         <- 600\ntreatment <- rep(0:1, each = n / 2)\nage       <- rnorm(n, mean = 58, sd = 12)\n\n# Part 1 truth: logit(P(cost > 0)) = 0.5 + 0.3*treatment + 0.02*age\np_pos    <- plogis(0.5 + 0.3 * treatment + 0.02 * age)\nhas_cost <- rbinom(n, 1, p_pos)\n\n# Part 2 truth: log(E[cost | cost>0]) = 7.5 - 0.2*treatment + 0.01*age\nmu_pos <- exp(7.5 - 0.2 * treatment + 0.01 * age)\ncost   <- ifelse(has_cost == 1, rgamma(n, shape = 2, scale = mu_pos / 2), 0)\n\ndf <- data.frame(cost = cost, treatment = treatment, age = age,\n                 any_cost = as.numeric(has_cost))\n\n# ── Part 1: logistic regression for P(any cost) ────────────────────────────────────────────\nm1 <- glm(any_cost ~ treatment + age, family = binomial, data = df)\ncat(\"=== Part 1: Logistic for P(any cost) ===\\n\")\nprint(summary(m1)$coefficients)\n\n# ── Part 2: gamma GLM with log link on positive-cost subset ───────────────────────────────\npos <- df[df$cost > 0, ]\nm2  <- glm(cost ~ treatment + age, family = Gamma(link = \"log\"), data = pos)\ncat(\"\\n=== Part 2: Gamma GLM for E[cost | cost>0] ===\\n\")\nprint(summary(m2)$coefficients)\n\n# ── Combined expected cost via recycled prediction ─────────────────────────────────────────\nexpected_cost <- function(data, t_val) {\n  d <- data; d$treatment <- t_val\n  p_any <- predict(m1, newdata = d, type = \"response\")\n  mu    <- predict(m2, newdata = d, type = \"response\")\n  mean(p_any * mu)\n}\ne_ctrl <- expected_cost(df, 0)\ne_trt  <- expected_cost(df, 1)\ncat(sprintf(\"\\nE[cost | control] = $%.0f\\n\", e_ctrl))\ncat(sprintf(\"E[cost | treated] = $%.0f\\n\", e_trt))\ncat(sprintf(\"Mean cost reduction = $%.0f\\n\", e_ctrl - e_trt))\n\n# ── Bootstrap 95% CI ──────────────────────────────────────────────────────────────────────\nB    <- 500\nboot <- numeric(B)\nfor (b in seq_len(B)) {\n  idx  <- sample(nrow(df), replace = TRUE)\n  db   <- df[idx, ]; posb <- db[db$cost > 0, ]\n  tryCatch({\n    b1   <- glm(any_cost ~ treatment + age, family = binomial, data = db)\n    b2   <- glm(cost ~ treatment + age, family = Gamma(link = \"log\"), data = posb)\n    ec0  <- mean(predict(b1, newdata = transform(db, treatment = 0), type = \"response\") *\n                 predict(b2, newdata = transform(db, treatment = 0), type = \"response\"))\n    ec1  <- mean(predict(b1, newdata = transform(db, treatment = 1), type = \"response\") *\n                 predict(b2, newdata = transform(db, treatment = 1), type = \"response\"))\n    boot[b] <- ec0 - ec1\n  }, error = function(e) { boot[b] <<- NA_real_ })\n}\nboot <- boot[!is.na(boot)]\nci   <- quantile(boot, c(0.025, 0.975))\ncat(sprintf(\"Bootstrap 95%% CI: [$%.0f, $%.0f]\\n\", ci[1], ci[2]))\n\n# ── Tweedie GLM as one-model sensitivity check (requires tweedie package) ─────────────────\n# library(tweedie)\n# p_est <- tweedie.profile(cost ~ treatment + age, data = df, p.vec = seq(1.1, 1.9, 0.1))$p.max\n# m_tw  <- glm(cost ~ treatment + age, data = df, family = tweedie(var.power = p_est, link.power = 0))\n# cat(sprintf(\"Tweedie combined treatment coef: %.3f (exp = %.3f)\\n\",\n#             coef(m_tw)[\"treatment\"], exp(coef(m_tw)[\"treatment\"])))",
        "description": "Two-part model in base R: Part 1 uses glm with family=binomial (logistic); Part 2 uses glm\nwith family=Gamma(link='log') on the positive-cost subset. Recycled prediction recovers the\ncombined expected cost per arm; a nonparametric bootstrap provides the confidence interval\nfor the treatment contrast. The description field mentions that the tweedie package provides\na one-model alternative via tweedie.profile and glm with family=tweedie(link.power=0), which\nis useful as a sensitivity check when the same-effect-per-part assumption is plausible.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Simulate semicontinuous cost data ── */\ndata work.cost;\n  call streaminit(42);\n  do i = 1 to 600;\n    treatment = (i > 300);\n    age       = rand('normal', 58, 12);\n    p_pos     = 1 / (1 + exp(-(0.5 + 0.3*treatment + 0.02*age)));\n    has_cost  = rand('binomial', p_pos, 1);\n    mu_pos    = exp(7.5 - 0.2*treatment + 0.01*age);\n    cost      = 0;\n    if has_cost then cost = rand('gamma', 2) * (mu_pos / 2);\n    any_cost  = (cost > 0);\n    output;\n  end;\n  keep i treatment age cost any_cost;\nrun;\n\n/* ── Part 1: logistic regression for P(any cost) ─────────────────────────────────────────── */\nproc logistic data=work.cost outmodel=work.m1_model noprint;\n  class treatment (ref='0') / param=ref;\n  model any_cost(event='1') = treatment age;\n  output out=work.p1_scored p=p_any;   /* p_any = P(any cost > 0) for each patient */\nrun;\n/* Read: the 'Odds Ratio Estimates' table gives exp(beta) = OR for any cost.         */\n/* For population comparisons, recycled prediction below is more informative than OR. */\n\n/* ── Score Part 1 under treatment=0 and treatment=1 for all patients ────────────────────── */\ndata work.score_ctrl; set work.cost; treatment = 0; run;\ndata work.score_trt;  set work.cost; treatment = 1; run;\n\nproc logistic inmodel=work.m1_model noprint;\n  score data=work.score_ctrl out=work.p1_ctrl (rename=(P_1=p_any_ctrl));\n  score data=work.score_trt  out=work.p1_trt  (rename=(P_1=p_any_trt));\nrun;\n\n/* ── Part 2: gamma GLM with log link, fitted on positive-cost patients only ──────────────── */\nproc genmod data=work.cost(where=(any_cost=1)) noprint;\n  class treatment (ref='0') / param=ref;\n  model cost = treatment age / dist=gamma link=log;\n  output out=work.p2_scored predicted=mu_pos;    /* mu_pos = E[cost | cost>0] */\nrun;\n/* Read: 'Exp(Estimate)' for treatment = mean cost ratio among users (Part 2).      */\n\n/* ── Score Part 2 under treatment=0 and treatment=1 using stored coefficients ───────── */\n/* IMPORTANT: do NOT refit PROC GENMOD on counterfactual datasets. Refitting changes        */\n/* treatment for everyone and produces a different model, not counterfactual predictions    */\n/* from the original positive-cost model. Store coefficients via ODS OUTPUT, then score.    */\n\n/* Step A: save Part 2 coefficients from the model already fitted on positive-cost rows     */\nods output ParameterEstimates=work.p2_coef;\nproc genmod data=work.cost(where=(any_cost=1)) noprint;\n  class treatment (ref='0') / param=ref;\n  model cost = treatment age / dist=gamma link=log;\nrun;\nods output close;\n\n/* Step B: extract intercept, treatment beta, and age beta from stored estimates            */\nproc sql noprint;\n  select Estimate into :b0    from work.p2_coef where Parameter='Intercept';\n  select Estimate into :b1    from work.p2_coef where Parameter='treatment' and Level1='1';\n  select Estimate into :bage  from work.p2_coef where Parameter='age';\nquit;\n\n/* Step C: score ALL patients (zero-cost and positive-cost) with trt=0 and trt=1          */\n/* Part 1 indicator handles the zero-cost patients in the final multiplication step.       */\ndata work.p2_scored;\n  set work.cost;\n  mu_ctrl = exp(&b0 + 0 * &b1 + age * &bage);   /* counterfactual trt=0 */\n  mu_trt  = exp(&b0 + 1 * &b1 + age * &bage);   /* counterfactual trt=1 */\nrun;\n\n/* ── Combined expected cost = Part 1 * Part 2 per patient ────────────────────────────────── */\ndata work.combined;\n  merge work.p1_ctrl(keep=i p_any_ctrl)\n        work.p1_trt(keep=i p_any_trt)\n        work.p2_scored(keep=i mu_ctrl mu_trt);\n  by i;\n  e_cost_ctrl = p_any_ctrl * mu_ctrl;   /* E[cost | arm=control] per patient */\n  e_cost_trt  = p_any_trt  * mu_trt;   /* E[cost | arm=treated] per patient */\nrun;\n\nproc means data=work.combined mean maxdec=0;\n  var e_cost_ctrl e_cost_trt;\n  title 'Marginalised expected cost per arm (two-part recycled prediction)';\nrun;\n/* The mean of e_cost_ctrl minus mean of e_cost_trt = estimated mean cost reduction.   */\n/* Use PROC SURVEYSELECT + macro loop (500 resamples) to bootstrap the contrast CI.    */",
        "description": "Two-part model in SAS: PROC LOGISTIC for Part 1 (any cost); PROC GENMOD with dist=gamma\nlink=log on positive-cost patients for Part 2. A DATA step constructs individual predicted\nE[cost] as the product of the two model scores and computes the arm-specific marginalised\nmeans. PROC SURVEYSELECT provides bootstrap resamples for the CI on the combined contrast.\nComments identify the key output rows an analyst needs to read.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Dist[Cost outcome in analytic cohort] --> ZeroFrac{What fraction\\nhave zero cost?}\n  ZeroFrac -->|\"< 5%: negligible\"| OneGamma[\"One-part Gamma GLM\\n(see gamma-distribution entry)\"]\n  ZeroFrac -->|\"5-10%: borderline\"| Judgment[\"Check whether zeros are structural.\\nUse two-part as primary, gamma as sensitivity.\"]\n  ZeroFrac -->|\"> 10%: substantial\"| TwoPart[\"Two-part model\"]\n  TwoPart --> P1[\"Part 1: Logistic regression\\nfor P(Y > 0 | X)\"]\n  TwoPart --> P2[\"Part 2: Gamma GLM with log link\\nfor E(Y | Y>0, X)\\n(fitted on positives only)\"]\n  P1 --> Combine[\"Combine: E(Y) = P(Y>0) x E(Y|Y>0)\"]\n  P2 --> Combine\n  Combine --> SE[\"Standard errors:\\nBootstrap (preferred) or delta method\"]\n  SE --> Interpret[\"Report:\\n1. Part 1 OR (who uses care)\\n2. Part 2 mean ratio (how much users spend)\\n3. Combined expected cost difference\"]\n  Judgment --> TwoPart\n  Interpret --> Tweedie[\"Sensitivity: Tweedie GLM\\n(one-model, constrains effects to be equal across parts)\"]",
        "caption": "Decision path from a semicontinuous cost outcome to the two-part model. The zero fraction guides model selection; the two-part output decomposes the total effect into a utilisation component (Part 1) and a conditional-cost component (Part 2).",
        "alt_text": "Flowchart from cost outcome through zero-fraction check into one-part gamma (few zeros) or two-part model (many zeros), showing Part 1 logistic and Part 2 gamma GLM combining into the expected cost, with bootstrap SE and a Tweedie sensitivity path.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "gamma-distribution",
        "notes": "The gamma GLM with log link is the standard positive-cost model in Part 2 of the two-part architecture; the gamma-distribution entry covers the variance function, Modified Park test, coefficient interpretation as mean cost ratios, and the retransformation problem that the GLM avoids — all prerequisites for specifying and diagnosing Part 2 of the two-part model."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "PPPM and PPPY rates are the standard cost outcomes that the two-part model is applied to; the healthcare-costs entry covers numerator definition, person-time denominators, attribution, and the data pipeline that produces the per-patient cost totals that feed into both parts of the two-part model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "zero-inflated-count-models",
        "notes": "Zero-inflated Poisson and zero-inflated negative-binomial models apply the same two-mechanism logic (structural zeros plus a count distribution) to count outcomes such as ED visits, hospitalisations, or prescription fills; when the outcome is a continuous cost rather than a count, the two-part model described here is the appropriate architecture instead."
      },
      {
        "relation_type": "used_with",
        "target_slug": "bootstrap-resampling-methods",
        "notes": "The combined expected-cost estimand E[Y] = P(Y>0) x E[Y|Y>0] is a nonlinear function of two separate models, so its standard error cannot be read from either model summary; the nonparametric bootstrap — re-fitting both parts on each resample — is the standard approach for obtaining confidence intervals for the treatment contrast and is the preferred method over the asymptotic delta method."
      },
      {
        "relation_type": "requires",
        "target_slug": "logistic-regression-for-binary-outcomes",
        "notes": "Part 1 of the two-part model is a standard logistic regression for the binary any-cost indicator; understanding logistic odds ratios, predicted probabilities, covariate adjustment, and separation issues in logistic regression is a prerequisite for specifying and interpreting the first component of the two-part model."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-outlier-handling-rwe",
        "notes": "Extreme cost outliers affect Part 2 of the two-part model (the gamma GLM on positive costs) in the same way they affect any cost regression; pre-specified winsorization or capping should be applied before fitting Part 2, and bootstrap CI inflation due to high-leverage outliers should be assessed as a diagnostic."
      }
    ],
    "aliases": [
      "hurdle model for costs",
      "two-part cost model",
      "hurdle cost model",
      "semicontinuous cost regression",
      "logit-gamma model",
      "zero-inflated cost model"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "two-sample-t-test",
    "name": "Two-Sample (Student's) t-Test",
    "short_definition": "A parametric hypothesis test that compares the means of two independent groups by computing a t statistic — the ratio of the observed mean difference to the standard error of that difference under the equal-variance (pooled) assumption — and evaluating it against a t-distribution with n1 + n2 - 2 degrees of freedom; the primary deliverable is the mean difference with its 95% confidence interval, not the p-value alone.",
    "long_description": "**What the test does and why it matters**\n\nThe two-sample (Student's) t-test asks one question: is the observed difference between two\ngroup means large enough relative to the variability in the data to be incompatible with a\nnull hypothesis of zero difference? It quantifies the answer as a t statistic — the ratio of\nthe mean difference to the standard error of that difference — and computes a p-value by\ncomparing that statistic to a t-distribution whose width is governed by the degrees of freedom.\nIn health economics and outcomes research, however, the p-value is rarely the deliverable that\nmatters. Payers, HTA bodies, and clinical decision-makers need the mean difference itself and\nits 95% confidence interval: a number on a clinically interpretable scale (e.g., \"the treated\ngroup had $1,840 lower total costs per patient, 95% CI $620 to $3,060\") rather than a binary\nsignificant/not-significant verdict.\n\n**The mechanics: pooled variance and why equal variance is the load-bearing assumption**\n\nStudent's t-test assumes that both groups share the same population variance. Under this\nassumption, the two within-group sum-of-squares estimates are pooled into a single, more\nprecise estimate of the common variance:\n\n    pooled_var = (SS_1 + SS_2) / (n_1 + n_2 - 2)\n\nwhere SS_1 and SS_2 are the within-group sums of squared deviations from each group mean.\nThe degrees of freedom follow directly: df = n_1 + n_2 - 2. The standard error of the mean\ndifference is then:\n\n    SE = sqrt(pooled_var * (1/n_1 + 1/n_2))\n\nand the t statistic is:\n\n    t = (mean_1 - mean_2) / SE\n\nThe CI on the mean difference is:\n\n    (mean_1 - mean_2) +/- t*(df, alpha/2) * SE\n\nwhere t*(df, alpha/2) is the critical value from the t-distribution at the chosen significance\nlevel (1.96 in large samples; wider in small samples — for df = 8, t* = 2.306 at alpha = 0.05).\nThis CI is the effect-size estimate that belongs in every HEOR table and report.\n\n**The equal-variance assumption problem: why Welch is the better default**\n\nThe equal-variance (homoscedasticity) assumption is rarely verified and frequently violated.\nIn RWE datasets, treated and control groups often differ not just in mean outcome but also in\nvariance — sicker patients cluster in one arm, older patients in another, and these systematic\ndifferences affect the spread of cost and utilization distributions just as much as the mean.\nWhen variances differ, the pooled-variance formula is biased and the nominal type-I error rate\nis not maintained.\n\nWelch's t-test relaxes the equal-variance assumption by computing a separate variance estimate\nfor each group and adjusting the degrees of freedom downward using the Welch-Satterthwaite\napproximation. Simulation studies by Delacre, Lakens, and Leys (2017) demonstrate that Welch's\nt-test maintains the nominal type-I error rate across a broad range of variance ratios and\nsample sizes, while Student's (pooled) t-test inflates type-I error when variances differ and\nloses only a trivial amount of power when variances happen to be equal. The actionable rule:\nWelch's t-test should be the default for two-sample continuous comparisons in applied work.\nStudent's t-test has a legitimate role when there is a theoretical reason to assume equal\nvariances (e.g., randomized experiments with moderate sample sizes and a continuous outcome\nknown to have stable variance) or when matching the historical standard for a pre-specified\nprimary analysis.\n\n**Why Student's t-test persists despite Welch being statistically superior**\n\nStudent's t-test was introduced in 1908 by William Sealy Gosset (publishing under the\npseudonym \"Student\"), working at the Guinness brewery on the problem of making inferences from\nsmall samples. For over a century it was the dominant introductory test in statistics curricula,\nand the equal-variance assumption was treated as a \"checking step\" (via Levene's or Bartlett's\ntest) rather than a known limitation. Many regulatory submissions and published clinical trials\npre-specified Student's t-test for historical reasons. Understanding Student's t-test is\ntherefore essential for interpreting legacy literature and for knowing exactly when the Welch\nmodification is warranted.\n\n**The t statistic, degrees of freedom, and the t-distribution**\n\nThe t statistic measures how many standard errors the observed mean difference is from zero.\nWhen the null hypothesis is true (zero population mean difference) and both normality and\nequal-variance assumptions hold, the t statistic follows a t-distribution with df = n_1 + n_2\n- 2. The t-distribution has heavier tails than the normal distribution — especially at small\ndf — which correctly reflects the additional uncertainty introduced by estimating the variance\nfrom the data rather than knowing it exactly. As df increases (larger samples), the\nt-distribution approaches the standard normal. For practical purposes, at df > 30 the\ndifference is small; at df > 100 it is negligible.\n\n**The Central Limit Theorem and robustness at large n**\n\nStudent's t-test does not require the raw data to be normally distributed — it requires the\nsampling distribution of the mean difference to be approximately normal. The Central Limit\nTheorem guarantees this for large enough samples regardless of the underlying data distribution.\nAt n > 30 per group, the test is generally robust to moderate skewness; at n > 100 per group,\nit is robust to most departures from normality short of extreme heavy tails or structural zeros.\nThis has a counterintuitive implication emphasized by Fagerland (2012): in large RWE datasets\n(n = 10,000 or more per group), the normality assumption is essentially irrelevant, yet the\nt-test p-value is almost always significant for trivially small differences. In this regime,\nthe CI on the mean difference — and whether the difference is clinically or economically\nmeaningful — is the only quantity worth reporting.\n\n**RWE realities: scale, confounding, and the limits of unadjusted two-group comparisons**\n\nAt claims-scale n, the two-sample t-test is mechanically robust (CLT protects the p-value)\nbut inferentially limited in three important ways.\n\nFirst, at very large n, p-values become uninformative — a difference of $12 in mean total\ncost between two groups of 50,000 patients will be highly significant but is clinically and\neconomically irrelevant. The CI is what matters, and in large samples the CI is often narrow\nenough to exclude any difference of practical importance.\n\nSecond, an unadjusted two-group t-test in observational data estimates a biased, confounded\ncontrast. Treatment groups in RWE differ on age, comorbidities, prior utilization, and dozens\nof other covariates that independently affect the outcome. An unadjusted t-test on costs or\nhealthcare utilization between a treated and an untreated group in claims data is a descriptive\ncomparison — it describes the observed mean difference in this specific dataset but provides no\ncausal inference about the treatment effect. Causal questions must route to propensity-score\nweighting, matching, g-methods, or regression adjustment before any t-test is applied. After\npropensity-score weighting or matching, a weighted t-test or a regression model in the matched\nsample is appropriate for the primary estimate.\n\nThird, skewed outcomes — particularly total costs and healthcare utilization counts — violate\nthe spirit of a mean-focused test when the distribution is so heavily right-skewed that the\nmean is a poor summary of typical patient experience. A generalized linear model with a log\nlink and gamma or Tweedie variance function matches the distributional shape, targets the mean\non the original dollar scale via back-transformation, and accommodates covariate adjustment in\na single model. Reserve the t-test for approximately symmetric continuous outcomes or use it\nas a sensitivity check alongside a gamma GLM for cost outcomes.\n\n**Effect sizes and clinical significance: what HEOR needs beyond the p-value**\n\nCohen's d is the standardized effect size for the t-test: d = (mean_1 - mean_2) / pooled_SD.\nBy convention, d = 0.2 is small, 0.5 is medium, and 0.8 is large, though these benchmarks\nwere calibrated in psychology and may not translate to healthcare. In HEOR, the unstandardized\nmean difference in original units (dollars of total cost, days of hospital stay, points on a\nquality-of-life scale) with its CI is more interpretable than Cohen's d for decision-making.\nReport both: the raw mean difference with CI for the clinical audience and Cohen's d for\ncomparability with the broader literature.\n\n**Pros, cons, and trade-offs**\n\nPros of the two-sample (Student's) t-test:\n- Produces a directly interpretable effect estimate: the mean difference and its CI on the\n  original scale of the outcome.\n- Computationally trivial; universally implemented in Python (scipy), R, and SAS.\n- Well-understood power properties: closed-form sample size formulas exist for study planning.\n- Robust to non-normality at large n via the CLT.\n- Historical standard that facilitates comparison with legacy literature and satisfies\n  pre-specified SAP language that mandates Student's t-test.\n- The pooled-variance estimate is slightly more efficient than Welch when equal variances\n  truly hold.\n\nCons of the two-sample (Student's) t-test:\n- The equal-variance (pooled) assumption is frequently violated in RWE and can inflate\n  type-I error; Welch's t-test should be preferred unless variances are known to be equal.\n- At large n, p-values are significant for trivially small differences; the CI must be\n  reported to assess clinical or economic relevance.\n- Unadjusted application to confounded observational contrasts produces biased estimates;\n  adjustment methods are required before t-test results are interpreted causally.\n- Heavily skewed outcomes (costs, utilization) violate the usefulness of a mean-focused\n  test; gamma GLM or bootstrap mean-difference estimation is preferred for primary inference.\n- Cannot handle clustered or paired data; paired t-test or mixed models are required when\n  observations are not independent.\n\n**When to use**\n\nThe two-sample Student's t-test is appropriate when:\n\n- Two independent groups are being compared on a continuous outcome and the equal-variance\n  assumption is either theoretically justified or has been verified diagnostically.\n- The pre-specified SAP or protocol explicitly calls for Student's t-test (e.g., a\n  regulatory submission where Welch was not pre-specified).\n- The outcome is approximately symmetric or the sample is large enough for CLT protection\n  (n > 30 per group as a rough rule, though this depends on the degree of skewness).\n- The analysis is a pre-post randomized comparison with moderate sample sizes and a continuous\n  endpoint where equal within-arm variance is plausible.\n- Baseline characteristic comparisons in a randomized trial (Table 1): t-tests on continuous\n  variables and chi-square on binary variables are conventional, though effect sizes are often\n  more informative.\n- Sensitivity analysis: running Student's t-test alongside Welch's and a nonparametric test\n  to confirm that conclusions are robust to the equal-variance assumption.\n\n**When NOT to use**\n\nThe two-sample Student's t-test is not appropriate — and may be actively misleading — in the\nfollowing situations:\n\n- Paired or clustered data: when patients are matched, or when the data have a nested\n  structure (e.g., multiple measurements per patient, patients nested in clinics), the\n  independence assumption is violated. Use a paired t-test for 1:1 matched pairs or a mixed\n  model for clustered data.\n- Heteroscedastic groups: when the two groups have substantially different variances\n  (a rule of thumb: variance ratio > 2 or < 0.5), use Welch's t-test, which does not assume\n  equal variances and is the better default in essentially all applied settings.\n- Heavily skewed outcomes with small or moderate n: when the outcome is right-skewed (costs,\n  hospital LOS, utilization counts) and the sample is small (n < 50 per group), the CLT has\n  not yet protected the sampling distribution of the mean. Use a nonparametric test, a\n  transformation, or a GLM as the primary analysis; the t-test may serve as a sensitivity check.\n- As primary causal evidence in unmatched observational data: an unadjusted t-test on\n  confounded observational contrasts is a descriptive statistic, not a causal estimate.\n  Confounding must be addressed via matching, weighting, or regression adjustment before\n  a t-test is interpreted as estimating a treatment effect.\n- Ordinal or binary outcomes: for ordinal scores, use Mann-Whitney or an ordinal regression\n  model; for binary outcomes, use chi-square, Fisher's exact, or logistic regression.\n- When the policy-relevant quantity is cost impact in a decision model: budget-impact and\n  cost-effectiveness models require mean cost estimates that are internally consistent with\n  the model's variance structure; gamma GLM with log link is the standard approach for\n  mean cost inference in HEOR.\n- At large n without reporting the CI: at n > 5,000 per group, virtually any mean difference\n  will be statistically significant; reporting only the p-value from a t-test is misleading\n  because it conveys no information about whether the difference is clinically or economically\n  meaningful.\n\n**Interpreting the output**\n\nIn the worked example, five patients in a care management program (Group A) averaged 6.0\nemergency department visits in the 12 months post-diagnosis; five patients in standard care\n(Group B) averaged 12.0 visits. The pooled variance is 10.0, the standard error of the mean\ndifference is 2.0, the t statistic is −3.0 (absolute value 3.0) on df = 8, the two-sided\np-value is approximately 0.017, and the 95% confidence interval on the mean difference is\n(−10.61, −1.39) ED visits.\n\n*(1) Formal interpretation.* The observed mean difference of −6.0 visits is 3.0 standard\nerrors below zero. Under the null hypothesis of no difference and the equal-variance\nassumption, a t statistic with absolute value ≥ 3.0 on df = 8 arises by chance in\napproximately 1.7% of samples. The 95% CI (−10.61, −1.39) is constructed so that if the\nstudy were repeated many times under identical conditions, approximately 95% of such\nintervals would contain the true mean difference — it does not mean there is a 95%\nprobability that the true value lies in this specific interval. Excluding zero, the interval\nis consistent with a true mean reduction of between 1.4 and 10.6 visits.\n\n*(2) Practical interpretation.* The data are consistent with Group A averaging roughly 6\nfewer ED visits per patient than Group B, with plausible values ranging from about 1 to 11\nfewer visits. Because this is an unadjusted observational comparison in a small study, the\ndifference may reflect both program effects and differences in who enrolled. A statistically\nsignificant p-value indicates the gap is unlikely to be pure sampling noise, but it does not\nestablish causation — a larger, adjusted study is needed to separate treatment effects from\nconfounding and to narrow the CI enough to support a budget-impact estimate.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "continuous-outcomes",
      "t-test",
      "parametric",
      "mean-difference",
      "pooled-variance",
      "foundations"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/2331554",
        "url": "https://doi.org/10.2307/2331554",
        "citation_text": "Student. The probable error of a mean. Biometrika. 1908;6(1):1-25.",
        "year": 1908,
        "authors_short": "Student",
        "notes": "The original paper introducing the t statistic and t-distribution, published under the pseudonym \"Student\" by William Sealy Gosset. Derived the exact sampling distribution of the mean for small samples from a normal population, foundational to all subsequent parametric inference on means."
      },
      {
        "role": "explain",
        "doi": "10.5334/irsp.82",
        "url": "https://doi.org/10.5334/irsp.82",
        "citation_text": "Delacre M, Lakens D, Leys C. Why psychologists should by default use Welch's t-test instead of Student's t-test. International Review of Social Psychology. 2017;30(1):92-101.",
        "year": 2017,
        "authors_short": "Delacre et al.",
        "notes": "Simulation-based demonstration that Welch's t-test maintains nominal type-I error across a wide range of variance ratios and sample sizes while Student's pooled test inflates error when variances differ. Supports Welch as the unconditional default over the \"test for equal variances first, then choose\" strategy."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1471-2288-12-78",
        "url": "https://doi.org/10.1186/1471-2288-12-78",
        "citation_text": "Fagerland MW. t-tests, non-parametric tests, and large studies — a paradox of statistical practice? BMC Medical Research Methodology. 2012;12:78.",
        "year": 2012,
        "authors_short": "Fagerland",
        "notes": "Demonstrates that in large observational datasets the t-test's normality assumption is protected by the CLT precisely when analysts most often worry about it, while at small n the assumption is most critical but often untested. Essential framing for RWE analysts."
      },
      {
        "role": "use",
        "doi": "10.1136/bmj.310.6975.298",
        "url": "https://doi.org/10.1136/bmj.310.6975.298",
        "citation_text": "Altman DG, Bland JM. Statistics notes: The normal distribution. BMJ. 1995;310(6975):298.",
        "year": 1995,
        "authors_short": "Altman & Bland",
        "notes": "BMJ Statistics Notes series; covers the normal distribution assumption and practical guidance for clinical researchers on when parametric tests are appropriate, including the role of sample size in protecting against violations of normality."
      }
    ],
    "plain_language_summary": "The two-sample t-test checks whether the average value of some measurement (like total healthcare cost or days in hospital) is genuinely different between two groups of patients, or whether the observed gap could plausibly be explained by random chance. It computes how many \"standard errors\" the observed gap is from zero, then looks that up in a table to get a p-value. The most useful output is not the p-value but the confidence interval on the mean difference — a range that tells you both the direction and the plausible size of the gap in real-world units like dollars or days. It assumes both groups have similar spread in their data; when that assumption looks shaky, the closely related Welch t-test is a safer choice.",
    "key_terms": [
      {
        "term": "pooled variance",
        "definition": "A single estimate of spread computed by combining the within-group variability from both groups, weighted by sample size; Student's t-test uses this instead of separate estimates, which is valid only when both groups have similar spread."
      },
      {
        "term": "standard error",
        "definition": "How much you would expect the mean difference to bounce around if you repeated the study many times with new random samples; a smaller standard error means a more precise estimate of the true difference."
      },
      {
        "term": "degrees of freedom",
        "definition": "The number of independent pieces of information in the data used to estimate the variance; for a two-sample t-test with equal group sizes of n, df = 2n - 2, and it governs how wide the t-distribution is (wider at low df, approaching normal at high df)."
      },
      {
        "term": "mean difference",
        "definition": "The gap between the two group averages on the original scale of the outcome (e.g., $840 more in total costs per patient in the treated group); this is the primary effect estimate for clinical and economic interpretation."
      },
      {
        "term": "confidence interval",
        "definition": "A range of plausible values for the true mean difference; a 95% CI means that if the study were repeated many times, about 95% of such intervals would contain the true value — wider intervals signal more uncertainty, narrower intervals signal more precision."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is comparing total emergency department (ED) visits in the 12 months after diagnosis for two groups of patients: five who received a new care management program (Group A) and five who received standard care (Group B). The analyst wants to know whether the program is associated with fewer ED visits on average, and computes the two-sample Student's t-test by hand to understand every step — from the pooled variance to the t statistic and confidence interval.",
      "dataset": {
        "caption": "Total ED visits in the 12 months post-diagnosis (n=5 per group). Values are small integers chosen so that means, pooled variance, standard error, and t statistic all come out to clean exact values.",
        "columns": [
          "patient_id",
          "group",
          "ed_visits"
        ],
        "rows": [
          [
            "A1",
            "A",
            2
          ],
          [
            "A2",
            "A",
            4
          ],
          [
            "A3",
            "A",
            6
          ],
          [
            "A4",
            "A",
            8
          ],
          [
            "A5",
            "A",
            10
          ],
          [
            "B1",
            "B",
            8
          ],
          [
            "B2",
            "B",
            10
          ],
          [
            "B3",
            "B",
            12
          ],
          [
            "B4",
            "B",
            14
          ],
          [
            "B5",
            "B",
            16
          ]
        ]
      },
      "steps": [
        "Compute group means. Group A: mean_A = (2+4+6+8+10)/5 = 30/5 = 6.0 visits. Group B: mean_B = (8+10+12+14+16)/5 = 60/5 = 12.0 visits. Mean difference = mean_A - mean_B = 6.0 - 12.0 = -6.0 visits.",
        "Compute within-group sums of squares. For Group A, deviations from mean_A = 6: (2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2 = 16 + 4 + 0 + 4 + 16 = 40. For Group B, deviations from mean_B = 12: (8-12)^2 + (10-12)^2 + (12-12)^2 + (14-12)^2 + (16-12)^2 = 16 + 4 + 0 + 4 + 16 = 40. SS_A = 40, SS_B = 40.",
        "Compute pooled variance. df = n_A + n_B - 2 = 5 + 5 - 2 = 8. pooled_var = (SS_A + SS_B) / df = (40 + 40) / 8 = 80 / 8 = 10.0.",
        "Compute standard error of the mean difference. SE = sqrt(pooled_var * (1/n_A + 1/n_B)) = sqrt(10 * (1/5 + 1/5)) = sqrt(10 * 2/5) = sqrt(10 * 0.4) = sqrt(4) = 2.0.",
        "Compute the t statistic. The absolute mean difference is 6.0, SE is 2.0, so the magnitude of t is 6.0 / 2.0 = 3.0, in the negative direction because mean_A < mean_B, giving t = -3.0. With df = 8, the two-sided p-value for absolute t = 3.0 is approximately 0.017.",
        "Compute the 95% confidence interval. The critical value for df = 8 at alpha = 0.025 is t_crit = 2.306. The margin is t_crit * SE = 2.306 * 2.0 = 4.612. The CI is mean_diff +/- margin = -6.0 +/- 4.612, giving a lower bound of -10.612 and an upper bound of -1.388. The 95% CI on the mean difference is (-10.61, -1.39) ED visits.",
        "Interpret. The care management program is associated with 6.0 fewer ED visits on average (95% CI 1.4 to 10.6 fewer; p = 0.017). Because this is an unadjusted observational comparison, the difference may reflect confounding rather than a causal effect of the program. Before drawing conclusions, baseline covariate differences between the groups should be assessed and adjustment methods applied if imbalance is present."
      ],
      "result": "mean_A = 30/5 = 6.0 visits, mean_B = 60/5 = 12.0 visits, mean difference = -6.0 visits. SS_A = 16 + 4 + 0 + 4 + 16 = 40, SS_B = 16 + 4 + 0 + 4 + 16 = 40. pooled_var = (40 + 40) / 8 = 80/8 = 10.0. SE = sqrt(4) = 2.0. Absolute t = 6.0/2.0 = 3.0 (negative direction), df = 8, p approximately 0.017. Margin = 2.306 * 2.0 = 4.612. 95% CI on mean difference = (-10.61, -1.39) ED visits."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Student's t-test (equal variances assumed, pooled SE)",
        "description": "The classical form described here: pools within-group sums of squares into a single variance estimate under the equal-variance assumption. Appropriate when the analyst has theoretical grounds for equal variances or the analysis is pre-specified as Student's t-test in a regulatory SAP. Use PROC TTEST (SAS) \"Pooled\" row, t.test(var.equal=TRUE) (R), or scipy.stats.ttest_ind(equal_var=True) (Python).",
        "edge_cases": [
          "When n_1 and n_2 are unequal and variances differ, the pooled SE is biased toward the group with larger n; type-I error can be substantially inflated. Use Welch instead.",
          "A variance ratio (larger/smaller) of 2 or more is a practical signal to switch to Welch, even if Levene's test does not reach significance (Levene is underpowered at small n)."
        ],
        "data_source_notes": "Claims and EHR: compute per-patient outcome totals; apply Student's t-test only when the study's pre-specified SAP explicitly calls for it or group variances are known to be approximately equal. Otherwise default to Welch."
      },
      {
        "name": "One-sided vs two-sided t-test",
        "description": "A two-sided test asks whether the mean difference is non-zero in either direction; a one-sided test asks whether one group's mean is specifically larger than the other's. Regulatory submissions and confirmatory HEOR studies nearly always use two-sided tests at alpha = 0.05. One-sided tests halve the p-value and lower the rejection threshold; they require a pre-specified directional hypothesis and are vulnerable to post-hoc direction selection, which inflates the actual type-I error.",
        "edge_cases": [
          "A one-sided test at alpha = 0.05 is equivalent in power to a two-sided test at alpha = 0.10; unless the directional hypothesis is pre-specified and scientifically justified, use two-sided.",
          "FDA and EMA guidance generally requires two-sided tests for primary efficacy analyses; one-sided tests require explicit scientific justification."
        ],
        "data_source_notes": "In practice: always use two-sided unless the SAP specifies one-sided and the scientific rationale is documented. Report which was used and why."
      },
      {
        "name": "Welch t-test (preferred default; unequal variances)",
        "description": "The Welch modification computes a separate variance estimate per group and applies the Welch-Satterthwaite approximation to the degrees of freedom (always <= n_1 + n_2 - 2). This is the recommended default for two-sample continuous comparisons in RWE and applied statistics. See the welch-t-test entry for full mechanics and implementation details.",
        "edge_cases": [
          "Welch df is a non-integer; statistical software computes it automatically. The difference in conclusions vs Student's test is largest when variance ratios are extreme (> 4) and sample sizes differ markedly.",
          "At large n with moderate variance differences, Welch and Student's tests converge; reporting Welch as default is a robustness convention, not just a small-sample correction."
        ],
        "data_source_notes": "Python: scipy.stats.ttest_ind(equal_var=False). R: t.test(var.equal=FALSE) (the R default). SAS: PROC TTEST Satterthwaite row."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "welch-t-test",
        "pros_of_this": "Slightly more powerful (narrower CI, smaller p-value) when the equal-variance assumption genuinely holds; historical standard in many regulatory SAPs.",
        "cons_of_this": "Inflates type-I error when group variances differ; the equal-variance assumption is rarely verified and often violated in RWE data, making Welch the safer default in almost all applied settings.",
        "when_to_prefer": "Use Student's t-test only when the SAP pre-specifies it or when there is strong scientific justification for equal variances; otherwise prefer Welch unconditionally."
      },
      {
        "compared_to": "mann-whitney-u-test",
        "pros_of_this": "Produces a mean difference estimate directly interpretable on the clinical scale (dollars, days, score points); readily extensible to regression for covariate adjustment; exact CI formula; higher power than Mann-Whitney when normality holds.",
        "cons_of_this": "Sensitive to outliers (a single extreme value shifts the mean and inflates the SE); Mann-Whitney's rank-based statistic is more robust to distributional departures, especially at small n with heavy tails.",
        "when_to_prefer": "Use t-test when the mean difference is the target estimand (budget impact, cost-effectiveness models) and the sample is large enough for CLT protection; use Mann-Whitney as a sensitivity check or primary analysis when the distribution is severely non-normal and n is small."
      },
      {
        "compared_to": "paired-t-test",
        "pros_of_this": "Appropriate when the two groups are truly independent (no patient appears in both groups, no 1:1 matching); simpler to apply and explain.",
        "cons_of_this": "Ignores correlation structure when patients are paired or matched; applying the independent t-test to paired data underestimates precision and inflates the standard error relative to the paired t-test, which computes the difference within each pair and tests whether the mean within-pair difference is zero.",
        "when_to_prefer": "Use the two-sample t-test for genuinely independent groups; switch to the paired t-test whenever observations are 1:1 matched (matched cohort, pre-post on the same patients)."
      },
      {
        "compared_to": "baseline-characteristics-and-covariate-balance-rwe",
        "pros_of_this": "t-tests on continuous baseline covariates are standard for RCT Table 1 reporting and provide a formal test of whether a difference is compatible with chance.",
        "cons_of_this": "In observational studies, standardized mean differences (SMDs) are preferred over p-values for assessing covariate balance; a t-test on Table 1 covariates in an observational study conflates statistical significance (a function of sample size) with confounding imbalance (a function of the actual covariate difference). Large observational datasets will find \"significant\" t-test results on nearly every covariate even after propensity-score weighting.",
        "when_to_prefer": "Use t-tests in Table 1 for RCTs; use SMDs as the primary balance metric for observational studies; t-tests remain acceptable as a secondary check in observational Table 1."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Per-patient utilization counts and cost totals are the natural unit of analysis. Apply Student's t-test when the SAP requires it and group sample sizes are adequate (n > 30 per group as a rough minimum). For costs, note that the distribution is almost always right-skewed; report the t-test result alongside a gamma GLM mean-ratio as the primary analysis. At large n (> 5,000 per group), report the CI prominently — p-values will be near zero for trivially small differences. Always describe the comparison as unadjusted if confounders have not been controlled.",
      "ehr": "Lab values (HbA1c, LDL, creatinine) and vital signs are approximately continuous and often approximately normally distributed — good candidates for Student's t-test. Patient-reported outcome scores with bounded ranges may be ordinal; check whether the instrument developers intended interval-scale treatment before applying a t-test. Missing lab values from informative visit patterns may introduce selection bias; restrict to patients with complete measurements in the analysis window or model missingness.",
      "registry": "Adjudicated continuous endpoints (e.g., six-minute walk distance, FEV1, biomarker levels) in disease registries are well-suited to t-test analysis. Registry populations are often more homogeneous within disease strata, making the normality assumption more credible. Report whether the registry comparison is unadjusted or covariate-adjusted; unadjusted registry comparisons across treatment groups are descriptive only.",
      "primary": "Survey and PRO instruments often produce bounded, non-normal scores; assess distributional shape before applying a t-test. For pilot studies (n < 30 per group), consider whether the normality assumption is defensible and whether a nonparametric test is more appropriate. For well-validated continuous instruments (SF-36, EQ-5D VAS), a t-test on the summary score is standard provided normality is assessed.",
      "linked": "Linked claims-EHR-registry cohorts typically have large n, so CLT protects the t-test's sampling properties. Report Student's or Welch t-test mean differences alongside gamma GLM mean ratios for cost endpoints; use chi-square or logistic regression for binary endpoints. Document clearly whether the t-test is unadjusted or applied after propensity-score weighting."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom scipy import stats\n\n# ── Motivating dataset: total ED visits post-diagnosis (n=5 per group) ──\ngroup_a = [2, 4, 6, 8, 10]   # care management program; mean = 6.0\ngroup_b = [8, 10, 12, 14, 16] # standard care; mean = 12.0\n\nn_a, n_b = len(group_a), len(group_b)\nmean_a = sum(group_a) / n_a   # 6.0\nmean_b = sum(group_b) / n_b   # 12.0\nmean_diff = mean_a - mean_b   # -6.0\n\n# ── 1. Manual pooled-variance computation ──\nss_a = sum((x - mean_a) ** 2 for x in group_a)  # 40.0\nss_b = sum((x - mean_b) ** 2 for x in group_b)  # 40.0\ndf = n_a + n_b - 2                               # 8\npooled_var = (ss_a + ss_b) / df                  # 10.0\npooled_sd = math.sqrt(pooled_var)                # ~3.162\nse = math.sqrt(pooled_var * (1 / n_a + 1 / n_b)) # 2.0\nt_manual = mean_diff / se                        # -3.0\nprint(f\"Manual:  pooled_var={pooled_var:.1f}, SE={se:.3f}, t={t_manual:.3f}, df={df}\")\n\n# ── 2. Student's t-test (equal_var=True = pooled) ──\nt_student, p_student = stats.ttest_ind(group_a, group_b, equal_var=True)\nprint(f\"\\nStudent's t-test: t={t_student:.4f}, p={p_student:.4f}\")\n\n# ── 3. 95% CI on the mean difference ──\nt_crit = stats.t.ppf(0.975, df=df)  # ~2.306 at df=8\nci_lo = mean_diff - t_crit * se\nci_hi = mean_diff + t_crit * se\nprint(f\"Mean difference: {mean_diff:.1f} (95% CI {ci_lo:.2f} to {ci_hi:.2f})\")\n\n# ── 4. Cohen's d ──\ncohens_d = mean_diff / pooled_sd\nprint(f\"Cohen's d: {cohens_d:.3f}  (|0.2|=small, |0.5|=medium, |0.8|=large)\")\n\n# ── 5. Welch's t-test (equal_var=False) for comparison ──\nt_welch, p_welch = stats.ttest_ind(group_a, group_b, equal_var=False)\nprint(f\"\\nWelch's t-test:  t={t_welch:.4f}, p={p_welch:.4f}  (preferred default)\")\nprint(\"Note: when variances are equal (as here), Welch and Student give the same answer.\")\nprint(\"When variances differ, Welch controls type-I error; Student's may not.\")",
        "description": "Two-sample t-test using scipy.stats.ttest_ind with equal_var=True (Student's pooled t-test)\nand equal_var=False (Welch's t-test). Demonstrates manual computation of the pooled variance,\nSE, and t statistic. Adds a 95% CI via statsmodels and prints Cohen's d. Uses the motivating\ndataset (Group A: care management, Group B: standard care, n=5 each) from the beginner layer.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset ──\ngroup_a <- c(2, 4, 6, 8, 10)   # care management; mean = 6\ngroup_b <- c(8, 10, 12, 14, 16) # standard care; mean = 12\n\n# ── 1. Student's t-test (equal variances assumed) ──\nres_student <- t.test(group_a, group_b, var.equal = TRUE)\ncat(\"=== Student's t-test (pooled) ===\\n\")\nprint(res_student)\ncat(sprintf(\"Mean difference: %.1f\\n\", diff(res_student$estimate)))\ncat(sprintf(\"95%% CI: [%.2f, %.2f]\\n\", res_student$conf.int[1], res_student$conf.int[2]))\n\n# ── 2. Cohen's d ──\npooled_sd <- sqrt(((5-1)*var(group_a) + (5-1)*var(group_b)) / (5+5-2))\ncohens_d  <- (mean(group_a) - mean(group_b)) / pooled_sd\ncat(sprintf(\"Cohen's d: %.3f\\n\", cohens_d))\n\n# ── 3. Welch's t-test (preferred default; var.equal = FALSE is R's default) ──\nres_welch <- t.test(group_a, group_b, var.equal = FALSE)\ncat(\"\\n=== Welch's t-test (unequal variances; R default) ===\\n\")\nprint(res_welch)\ncat(\"Practical rule: always use var.equal = FALSE (Welch) unless the SAP pre-specifies\\n\")\ncat(\"Student's t-test or equal variances are theoretically justified.\\n\")",
        "description": "Student's t-test via t.test(var.equal=TRUE) and Welch's via the R default (var.equal=FALSE).\nDemonstrates extraction of the mean difference, CI, and effect size. Uses the same\nmotivating dataset as the Python implementation. Shows how to read the Pooled vs\nSatterthwaite rows when both are requested together.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset ── */\ndata work.ed_visits;\n  input patient_id $ group $ ed_visits;\n  datalines;\nA1 A  2\nA2 A  4\nA3 A  6\nA4 A  8\nA5 A 10\nB1 B  8\nB2 B 10\nB3 B 12\nB4 B 14\nB5 B 16\n;\nrun;\n\n/* ── PROC TTEST: outputs both Student (Pooled) and Welch (Satterthwaite) rows ── */\nproc ttest data=work.ed_visits alpha=0.05;\n  class group;      /* compare groups A vs B */\n  var ed_visits;\n  /* OUTPUT INTERPRETATION:\n     - \"Pooled\" row     = Student's t-test (assumes equal variances)   -> t=-3.0, df=8\n     - \"Satterthwaite\" row = Welch's t-test (unequal variances allowed) -> same here\n     The two rows converge when variances are equal (as in this example).\n     When variances differ, prefer Satterthwaite (Welch). */\nrun;\n\n/* ── To compute Cohen's d manually in a DATA step ── */\nproc means data=work.ed_visits noprint;\n  class group;\n  var ed_visits;\n  output out=work.sumstats mean=mn var=vr n=nn;\nrun;\n\ndata work.cohens_d;\n  set work.sumstats(where=(_type_=1));\n  pooled_var + (nn-1)*vr;\n  n_total + nn;\n  mean_diff + mn;\n  if _n_=2 then do;\n    pooled_sd = sqrt(pooled_var / (n_total - 2));\n    cohens_d  = mean_diff / pooled_sd;\n    put \"Pooled SD = \" pooled_sd 6.4 \" Cohen's d = \" cohens_d 6.4;\n  end;\nrun;",
        "description": "PROC TTEST for both Student's (Pooled row) and Welch's (Satterthwaite row) t-test. SAS\nPROC TTEST automatically prints both rows; the analyst selects the appropriate one based\non the equal-variance assumption. Uses the same motivating dataset as Python and R.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[\"Two independent groups<br/>Continuous outcome<br/>Goal: compare means\"] --> EqVar{Are group variances<br/>approximately equal?}\n  EqVar -->|\"Yes — or SAP pre-specifies<br/>Student's t-test\"| Student[\"Student's t-test<br/>(pooled variance)<br/>df = n1+n2-2\"]\n  EqVar -->|\"No — or unknown<br/>(the common case)\"| Welch[\"Welch's t-test<br/>(separate variances)<br/>df = Welch-Satterthwaite\"]\n  Student --> Output[\"Report: mean difference<br/>95% CI on mean difference<br/>Cohen's d<br/>(p-value secondary)\"]\n  Welch --> Output\n  Output --> Causal{Is this an<br/>adjusted causal<br/>comparison?}\n  Causal -->|\"No — unadjusted<br/>observational data\"| Warn[\"CAUTION: unadjusted t-test<br/>on confounded data is<br/>descriptive only — confounders<br/>must be addressed first\"]\n  Causal -->|\"Yes — after matching,<br/>weighting, or in an RCT\"| Conclude[\"Interpret CI as<br/>causal estimate of<br/>mean treatment effect\"]",
        "caption": "Decision flow for two-sample t-test: variance check, Student vs Welch choice, and the critical reminder that unadjusted t-tests on observational data are descriptive, not causal.",
        "alt_text": "Flowchart beginning at two-independent-groups comparison, branching on equal-variance assumption to Student's vs Welch's t-test, converging on reporting the mean difference and CI, then branching on whether the comparison is causally adjusted or unadjusted observational.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The decision-tree parent: parametric-vs-nonparametric-tests covers when to use a t-test vs a rank-based alternative, the equal-variance vs Welch decision, and the CLT framing; this entry provides the full mechanics, implementation, and HEOR context for the specific two-sample case."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Foundational concepts including the null hypothesis, p-value, type-I and type-II error, confidence intervals, and the sampling distribution are prerequisites for interpreting the t statistic and its CI correctly."
      },
      {
        "relation_type": "see_also",
        "target_slug": "welch-t-test",
        "notes": "Welch's t-test is the recommended default for two-sample continuous comparisons in virtually all applied settings; it relaxes the equal-variance assumption and maintains nominal type-I error across a wider range of scenarios than Student's pooled t-test."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mann-whitney-u-test",
        "notes": "The rank-based counterpart for heavily skewed or non-normal data; tests stochastic dominance (not mean equality) and is more robust to outliers at small n — use as the primary analysis when the outcome distribution is severely non-normal or as a sensitivity check alongside the t-test."
      },
      {
        "relation_type": "see_also",
        "target_slug": "paired-t-test",
        "notes": "When observations are paired (pre-post on the same patients, 1:1 matched cohort), the paired t-test — which tests whether the mean within-pair difference is zero — is more powerful and correct; the two-sample t-test incorrectly treats paired observations as independent."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "t-tests on continuous baseline covariates appear in Table 1; in observational studies, standardized mean differences (SMDs) are the preferred balance metric because they are not inflated by large sample sizes the way t-test p-values are."
      }
    ],
    "aliases": [
      "Student's t-test",
      "independent samples t-test",
      "unpaired t-test",
      "pooled t-test",
      "two-independent-groups t-test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "ub-04-institutional-claim-fields",
    "name": "UB-04 / 837I Institutional Claim Fields",
    "short_definition": "The UB-04 (paper form CMS-1450) and its electronic equivalent, the 837I X12 transaction, are the billing instruments that hospitals, skilled nursing facilities, home health agencies, hospices, dialysis centers, and hospital-based outpatient departments use to request payment for facility services; their structured fields — Type of Bill, discharge status, POA indicators, revenue code lines, and occurrence/value codes — are the primary mechanism by which research databases derive care setting, episode boundaries, in-hospital events, and facility-level costs from institutional claims.",
    "long_description": "**What the UB-04 and 837I are, and who files them**\n\nThe **UB-04** (also called the CMS-1450) is the paper claim form maintained by the National Uniform\nBilling Committee (NUBC), whose normative code sets are copyrighted by the American Hospital Association.\nIts electronic counterpart is the **837I** (Institutional) X12 transaction set, the digital file that\nclearinghouses and payers actually process. Every field described here is documented in the publicly\navailable CMS Medicare Claims Processing Manual (Pub 100-04, Chapter 25), which is the source for the\nfield semantics used throughout this entry; the complete NUBC data specifications are a licensed product.\n\nInstitutional claims are filed by hospitals (inpatient and outpatient departments including the ED),\nskilled nursing facilities (SNFs), home health agencies (HHAs), hospice programs, hospital-based\ndialysis units, and outpatient rehabilitation facilities. The counterpart billing instrument for\nphysician and professional services is the CMS-1500 / 837P; the two forms are complementary, not\nredundant — most hospital admissions generate both a facility UB-04 claim (the institution's costs)\nand one or more professional 837P claims (physician fees). Research databases derived from adjudicated\nMedicare or commercial claims typically separate these into an inpatient/outpatient facility file\n(MedPAR, outpatient SAF, or equivalent) and a professional/carrier file.\n\n**Field anatomy by Form Locator (FL)**\n\nThe UB-04 organizes its data elements by **Form Locator number** (FL 1 through FL 81). Understanding\nthe research significance of each is the key to correct institutional claims analysis.\n\n*FL4 — Type of Bill (TOB)*: The single most important routing and classification field on the form.\nTOB is a **four-digit code with a leading zero**, where each position encodes a distinct dimension:\n- **Digit 1** (always 0, the leading placeholder)\n- **Digit 2** (facility type): 1 = hospital, 2 = SNF, 3 = home health, 4 = religious/non-medical,\n  5 = intermediate care, 6 = intermediate care-mentally retarded, 7 = clinic, 8 = special facility\n- **Digit 3** (bill classification): 1 = inpatient Part A, 2 = inpatient Part B, 3 = outpatient,\n  4 = other Part B, 5 = intermediate care, 6 = intermediate care-mentally retarded, 7 = subacute\n  inpatient, 8 = swing bed\n- **Digit 4** (frequency/sequence): 1 = admit-through-discharge (the normal, complete bill),\n  2 = interim-first claim, 3 = interim-continuing, 4 = interim-last, 7 = replacement of a prior\n  claim, 8 = void/cancel of a prior claim\n\n**The TOB frequency digit is the most important deduplication signal in institutional claims.**\nA standard inpatient admission produces TOB **0111** (hospital inpatient, admit-through-discharge).\nIf the stay spans a billing period boundary or interim billing is triggered, the hospital may submit\na series of interim claims (frequency 2, 3, 4) followed by the final admit-through-discharge bill\n(frequency 1). Research extracts that do not roll up these interim bills will double- or triple-count\nthe same admission, inflating utilization counts and visit-day totals. Replacement claims (frequency\ndigit 7) supersede a previously submitted bill; voided claims (digit 8) cancel it entirely. A clean\nclaims extract must: (a) exclude interim claims (frequency 2, 3) or explicitly roll them to the final\nbill, (b) apply replacements (keep the highest-sequence replacement, discard the original), and\n(c) exclude voids. Research databases such as MedPAR already implement this deduplication for\ninpatient stays; outpatient and SNF files typically do not, requiring analyst-level claim adjudication.\n\nBeyond deduplication, TOB is the field from which research databases derive **care setting** — the\ndistinction between inpatient and outpatient. TOB 011x indicates inpatient hospital; 013x indicates\noutpatient hospital (including the emergency department); 021x indicates SNF inpatient; 032x-033x\nindicates home health. A classifier that relies only on the claim type file or on ICD-10-PCS codes\nwithout confirming the TOB will misassign setting in edge cases.\n\n*FL6 — Statement Covers Period (From/Through dates)*: The calendar span that the claim covers — the\n**from date** (admission date for inpatient, service start for outpatient) and the **through date**\n(discharge date or last service date). For research, the through date determines when the episode ends;\nthe from date is the service initiation. These dates are distinct from the claim receipt date and the\nadjudication date, both of which can lag weeks behind the service event. Adjudication lag — the gap\nbetween service delivery and final payment — means the final months of any claims extract are\nsystematically under-reported; analysts should apply a 90-day run-out window before treating recent\nutilization as complete.\n\n*FL14 — Priority/Type of Admission* and *FL15 — Point of Origin for Admission*: FL14 codes the urgency\nof the admission: 1 = emergency, 2 = urgent, 3 = elective, 4 = newborn, 5 = trauma. FL15 codes the\nsource of the admission: 1 = non-healthcare facility (community), 2 = clinic, 4 = transfer from a\nhospital, 5 = transfer from a SNF, 6 = transfer from another healthcare facility, 8 = court/law\nenforcement. Together, FL14 and FL15 are used in real-world evidence for emergency-admission\nphenotypes, transfer chain construction, and social determinants of health (SDOH) studies that flag\nadmissions originating from institutional settings such as long-term care. A limitation: these fields\nare completed at admission and can reflect pre-admission coding conventions rather than clinical reality,\nand their completeness varies by payer and provider.\n\n*FL17 — Patient Discharge Status*: A two-digit code indicating the patient's disposition at the end of\nthe stay, as documented in CMS guidance: 01 = discharged to home or self-care (routine discharge),\n02 = discharged/transferred to a short-term general hospital for inpatient care, 03 = discharged to\nskilled nursing facility, 04 = discharged to intermediate care facility, 05 = discharged to another\ntype of institution, 06 = discharged to home under care of organized home health service, 07 =\nleft against medical advice, 20 = expired (death during stay), 30 = still patient (not yet discharged),\n43 = discharged to a federal hospital, 50–57 = various hospice dispositions, 61–65 = swing-bed\ntransfers and rehabilitation.\n\nThe research significance of discharge status is threefold:\n- **In-hospital mortality**: status 20 is the standard claims-based ascertainment of death during the\n  index hospitalization, used throughout comparative effectiveness and safety research as one component\n  of a composite mortality outcome.\n- **Transfer chain construction**: status 02 (transfer to another acute hospital) triggers a multi-claim\n  linkage problem. When two acute hospital claims are connected by a status-02 discharge and the\n  receiving hospital's admission date matches the transfer date, they should be treated as a single\n  episode, not two readmissions. Failure to merge transfer chains inflates readmission rates and\n  distorts 30-day outcomes.\n- **Still-patient exclusion**: status 30 means the patient was still hospitalized when the billing\n  period closed; these patients must be excluded from readmission rate denominators and 30-day mortality\n  calculations because follow-up is undefined. Status-30 claims are more common in long stays and in\n  hospital-to-SNF or hospital-to-rehabilitation transitions where the formal discharge is delayed.\n\n*FL18–28 — Condition Codes*: Up to eleven two-digit codes that communicate special billing conditions —\nfor example, code 04 (information only, not for Medicare adjudication), 07 (treatment of non-terminal\ncondition for hospice patient), D9 (any other special condition). Researchers rarely use condition codes\ndirectly in outcome or exposure algorithms, but they carry claim-processing signals that can explain\nanomalies in adjudicated files: a condition code signaling a partial episode or a carve-out arrangement\ncan produce a claim that looks like utilization but does not reflect a complete billable encounter.\n\n*FL31–36 — Occurrence Codes and Dates*: Paired code-and-date fields that record discrete events\nassociated with the claim — for example, code 01 (accident/medical coverage), 11 (onset of symptoms),\n17 (date outpatient occupational therapy plan established), A1 (birth date of insured). For RWE,\noccurrence codes 01 and 02 (auto accident and no-fault accident) are used in injury mechanism studies\nand trauma-cost analyses; code 11 (symptom onset) provides a claim-level date that is closer to true\ndisease onset than the service date. These fields are often under-populated in commercial data.\n\n*FL39–41 — Value Codes and Amounts*: Paired numeric codes and dollar amounts encoding episode-level\nfinancial and clinical quantities — for example, code 50 (physical therapy visits), 80 (covered days),\nA1 (deductible payer A). Researchers occasionally use covered-day value codes to derive SNF covered\ndays (for benefit-period analysis) or to reconcile total charges.\n\n*FL42–49 — The Revenue Code Line Level* (the detail or line level of the claim): This is where the\nencounter's services are itemized. Each line contains:\n- **FL42 Revenue code**: A four-digit code for the service category (e.g., 0100 = all-inclusive\n  rate, 0120 = room and board semi-private, 0260 = IV therapy, 0450 = emergency room,\n  0490 = ambulatory surgery). Revenue codes form the institutional counterpart to CPT on the\n  professional claim; they are how cost-center-level data are organized and are the mechanism for\n  decomposing total claim cost by type of service.\n- **FL44 — HCPCS/Rates**: The procedure code at the line level, populated for outpatient claims with\n  CPT-4 or HCPCS Level II codes. This is where outpatient procedure coding lives on an institutional\n  claim — *not* in a separate procedure field as on the CMS-1500. For outpatient hospital claims,\n  FL44 is the field that carries the CPT code used to identify a procedure-based phenotype (e.g.,\n  colonoscopy CPT 45378). For inpatient claims, the line-level HCPCS field is usually not populated\n  because procedures are reported via ICD-10-PCS codes at the claim header.\n- **FL46 — Units of Service**: The quantity of the service on that line — room-and-board days, number\n  of therapy sessions, units of a drug administered. Essential for utilization measurement (days,\n  visits, doses).\n- **FL47 — Total Charges**: The facility's billed charges for that revenue line. **Charges are not\n  cost and not payment.** The costing trap: chargemaster billed amounts overstate the actual resource\n  cost by a factor that varies enormously by facility, service line, payer contract, and calendar year.\n  Research that uses FL47 charges as a cost measure without applying a cost-to-charge ratio (CCR) or\n  using the adjudicated allowed/paid amount is methodologically indefensible to a payer or HTA\n  reviewer. The correct valuation for \"cost to the system\" is the allowed amount from the\n  adjudicated claim; the correct payer perspective is the plan-paid amount.\n\n*FL67 — Principal Diagnosis + POA Indicator* and *FL67A–Q — Secondary Diagnoses + POA Indicators*:\nThese fields carry the ICD-10-CM diagnosis codes for the hospitalization plus the\n**Present-on-Admission (POA) indicator** for each diagnosis. POA was mandated for Medicare FFS\ninpatient claims starting October 1, 2007; its presence is what makes institutional claims from\nthat date forward uniquely powerful for outcome research.\n\nPOA indicator values (per CMS documentation):\n- **Y** = Yes, condition was present on admission\n- **N** = No, condition was not present on admission (arose during the inpatient stay)\n- **U** = Unknown whether condition was present on admission\n- **W** = Clinically undetermined (provider unable to determine)\n- **1** = Exempt from POA reporting (certain ICD-10-CM codes are exempt)\n\nThe research value of POA:\n- **Complication vs comorbidity distinction**: A secondary diagnosis coded POA = N arose after\n  admission and may represent a hospital-acquired complication (e.g., a catheter-associated UTI, a\n  pressure injury, a deep vein thrombosis). A diagnosis coded POA = Y was a preexisting comorbidity.\n  Without POA, it is impossible to determine from the billing record whether a secondary diagnosis\n  code reflects patient severity at admission or an adverse event that occurred during the stay —\n  a critical distinction for quality measurement, adverse-event outcomes algorithms, and risk\n  adjustment.\n- **Elixhauser and Charlson comorbidity scoring with POA**: The standard approach for hospital-level\n  risk adjustment applies the Elixhauser or Charlson comorbidity index only to diagnoses coded\n  POA = Y (or exempt), excluding POA = N conditions that arose in-hospital and thus cannot be\n  preexisting comorbidities. Applying comorbidity weights to all secondary diagnoses regardless of POA\n  inflates the apparent comorbidity burden by including complications as covariates, biasing\n  risk-adjusted outcomes.\n- **Patient Safety Indicators (PSIs)** and **Hospital-Acquired Conditions (HACs)**: CMS uses\n  POA to define HAC categories — conditions that are reimbursed differently (at lower rates)\n  if they were not present on admission. AHRQ PSI algorithms rely on POA = N for their\n  numerators. POA indicator validity has been studied and found generally reliable for major\n  diagnoses but less consistent for secondary and minor conditions.\n\n*FL69 — Admitting Diagnosis*: The ICD-10-CM code for what the patient was *suspected* to have at the\ntime of admission — before workup, testing, and clinical evolution. FL69 often differs from the\nprincipal diagnosis (FL67), which is determined after the stay as the condition chiefly responsible for\nthe admission. The admitting diagnosis is valuable in RWE for constructing **unscheduled admission\nphenotypes** (e.g., distinguishing a planned elective surgery admission from an acute unscheduled\nhospitalization) and in studies of emergency department-to-inpatient transitions. The gap between\nadmitting diagnosis and principal diagnosis also serves as a signal of diagnostic ambiguity on\nadmission.\n\n*FL70 — Patient Reason for Visit*: Populated on unscheduled outpatient and emergency department claims;\ncaptures the patient-reported or triage-assigned chief complaint at the time of the ED or outpatient\nvisit. Less consistently coded than FL67 but provides a pre-workup perspective distinct from the\nprincipal diagnosis assigned after the encounter.\n\n*FL71 — PPS/DRG*: For Medicare inpatient claims under the Inpatient Prospective Payment System (IPPS),\nthis field carries the assigned **Medicare Severity Diagnosis Related Group (MS-DRG)** used to determine\nthe payment. The MS-DRG is a case-mix and severity classifier — it rolls up principal diagnosis,\nsecondary diagnoses, procedures, and discharge status into a single reimbursement category. MS-DRG is\nused in RWE as a parsimonious case-mix adjuster (avoiding the full ICD-10-CM code matrix) and to define\nclinically coherent admission strata.\n\n*FL72 — External Cause of Injury (ECI) / E-code*: ICD-10-CM external cause codes that describe the\nmechanism of an injury (e.g., fall, motor vehicle accident, assault). Used in trauma epidemiology,\ninjury-mechanism cost studies, and SDOH analyses. These codes are under-reported and have variable\ncompleteness across payers.\n\n*FL74 — Principal Procedure + Date* and *FL74a–e — Other Procedures + Dates*: ICD-10-PCS procedure\ncodes and their dates for inpatient claims. Up to six procedures per claim (one principal, five\nadditional). Unlike CPT on the professional claim, ICD-10-PCS is the coding system for inpatient\nhospital procedures. Procedure dates allow sequencing of surgical and procedural events within a stay\n— critical for surgical complication studies and time-to-procedure analyses.\n\n*FL76–79 — Attending, Operating, and Other Physician NPIs*: National Provider Identifiers for the\nattending physician (FL76), operating physician (FL77), and up to two other significant providers\n(FL78, FL79). These fields are the institutional claim's mechanism for **provider attribution** —\nlinking a hospital stay to the clinician responsible for care. Used in care variation studies,\nphysician practice pattern analyses, and multi-payer attribution algorithms. Note that the operating\nphysician NPI (FL77) is populated on claims with a surgical procedure and may differ from the\nattending — important for studies that need to distinguish surgeon from hospitalist.\n\n**The deduplication and claim-adjustment problem in detail**\n\nThe single most common technical error in institutional claims analysis is failing to handle\n**interim billing, replacement claims, and late charges** before constructing utilization measures.\n\nAn inpatient stay longer than a monthly billing cycle (e.g., a 45-day ICU admission) will generate:\n- One or more interim-first (frequency 2) and interim-continuing (frequency 3) claims covering\n  sub-periods of the stay, each with a through date before the actual discharge\n- A final admit-through-discharge claim (frequency 1) covering the full stay\n\nIf an analyst counts each of these as a separate hospitalization, a single 45-day stay becomes\nthree admissions in the denominator, inflating hospitalization rates by a factor of three. The\ncorrect approach is to deduplicate by rolling up to the admit-through-discharge claim or, if using a\npreprocessed file like MedPAR, to verify that the preprocessing already performed this step.\n\nReplacement claims (frequency 7) arise when a hospital corrects a submitted bill — updating diagnosis\ncodes, charges, or dates. The replacement supersedes the original; an extract that retains both will\ndouble-count the claim and, if the diagnosis codes changed, carry inconsistent code sets for the\nsame admission. The standard approach is to keep only the highest-sequence replacement for each\noriginal claim control number.\n\nLate charge claims are a related problem: a small additional charge (e.g., a lab result that arrived\nafter the discharge bill was submitted) may be filed as a new claim referencing the original\nadmission. These late-charge claims, if not merged back to the parent claim, inflate admission counts.\n\n**Pros, cons, and trade-offs — specific and comparative**\n\n- **vs the professional claim (CMS-1500 / 837P)**: The institutional claim captures the full facility\n  episode — room and board, nursing, ancillary services, drugs administered in-house — in a single\n  claim header with a from/through date span. The professional claim captures the physician work only,\n  usually as a single service-date line. For episode construction, cost measurement, and care-setting\n  derivation, the institutional claim is the authoritative source; the professional claim contributes\n  physician-level attribution and the CPT-level procedure coding that the institutional form carries\n  only at the outpatient level (FL44). **Prefer the institutional claim** for inpatient cost, length\n  of stay, discharge disposition, and POA-based outcome ascertainment; **prefer the professional\n  claim** for physician attribution and outpatient procedure coding precision. **Never use\n  Place-of-Service codes** for institutional claim setting classification — POS is a CMS-1500 field\n  that does not appear on the UB-04; setting is derived from TOB and revenue codes.\n\n- **vs ICD-10-PCS procedure codes alone**: ICD-10-PCS procedure codes on the institutional claim\n  (FL74) carry only the inpatient procedures; outpatient procedures on the same institutional claim\n  are coded via CPT at the revenue code line level (FL44). A researcher who queries only ICD-10-PCS\n  to find a procedure will miss all outpatient institutional encounters and some ambulatory surgery\n  procedures. A complete institutional procedure algorithm must combine FL74 (inpatient ICD-10-PCS)\n  and FL44/revenue code (outpatient CPT/HCPCS) with appropriate TOB filtering.\n\n- **vs EHR-derived discharge data**: The UB-04 discharge status (FL17) reflects the billing-finalized\n  disposition and is generally more complete and standardized than EHR-derived discharge disposition,\n  which varies by system and may not be consistently coded. However, the UB-04 discharge status is\n  determined at discharge and may not capture post-discharge events (e.g., a patient coded as\n  discharged to home who died hours later). Death during the stay is reliably captured as status 20;\n  post-discharge death requires linkage to a separate mortality source.\n\n- **Charges (FL47) vs allowed/paid amounts**: Billed charges on the institutional claim are the\n  facility's nominal ask, not the negotiated price or the actual cost of resources consumed. Allowed\n  amounts (what payer and patient together owe after applying the payer's fee schedule) are the\n  correct \"system cost\" denominator in cost analyses. Paid amounts (plan liability only) give the\n  payer perspective. The ratio of charges to allowed amounts varies by facility market power, service\n  type, and calendar year; using charges introduces systematic bias that cannot be removed without\n  facility-level or payer-level cost-to-charge ratios.\n\n**When to use**\n\nUse the UB-04/837I institutional claim fields as the primary data source whenever the research\nquestion requires: (1) care-setting determination (inpatient vs outpatient vs SNF vs home health),\n(2) episode length and boundaries (from/through dates plus discharge status), (3) in-hospital death\nascertainment (discharge status 20), (4) complication vs comorbidity distinction (POA indicators),\n(5) transfer chain construction (discharge status 02 linked to receiving hospital admission),\n(6) facility-level cost decomposition by service line (revenue code × allowed amount), or (7)\ninpatient procedure coding (ICD-10-PCS via FL74). Institutional claims are the correct and complete\ndata source for any outcome or utilization variable that is anchored to a hospital stay, SNF episode,\nor home health certification period.\n\n**When NOT to use — and when institutional claims are actively misleading or dangerous**\n\n- **For physician-level procedure coding in the outpatient setting**: The institutional claim's\n  FL44 line-level HCPCS carries the CPT code, but it represents the facility fee, not the physician\n  work. Studies of surgical volume or operator experience must use professional claims (837P) to\n  count physician-level procedures, not institutional claims, because the same institutional claim\n  is generated whether one surgeon or five performed components of the procedure.\n- **For comorbidity scoring without POA filtering**: Applying the Charlson or Elixhauser index to\n  all secondary diagnoses regardless of POA will count hospital-acquired complications as preexisting\n  comorbidities, inflating the apparent risk profile of the hospitalized cohort and biasing\n  risk-adjusted outcomes. Always restrict comorbidity algorithms to POA = Y and POA-exempt diagnoses.\n- **For episode construction without interim-bill deduplication**: Using raw institutional claims\n  without first rolling up interim bills and applying replacement/void logic will produce inflated\n  admission counts, incorrect episode lengths, and inconsistent code sets. Always verify what\n  deduplication has already been applied to the file (e.g., MedPAR vs outpatient SAF).\n- **For cost analysis using FL47 charges**: Billed charges are not a valid cost measure. Using\n  charges as costs in a budget-impact model, cost-effectiveness analysis, or comparative cost study\n  is a methodological error that will be immediately flagged in regulatory or HTA review.\n- **When the payer is Medicare Advantage**: MA plans submit encounter data, not FFS claims; the\n  resulting records often lack complete line-level detail, TOB-consistent setting coding, and\n  adjudicated dollar amounts. MA-only institutional encounters may be filed on an 837I skeleton that\n  omits revenue code detail, FL44 HCPCS, and accurate charge data. Do not pool MA encounter\n  institutional records with FFS institutional claims without explicit harmonization and sensitivity\n  analysis.\n- **For pre-October 2007 POA-based outcome algorithms in Medicare FFS**: POA reporting was not\n  required before FY2008. Studies using inpatient data from before that date will have missing POA\n  indicators for a material fraction of claims; complication vs comorbidity algorithms are not\n  valid for that period without explicit handling of the missing POA.\n\n**Data-source operational depth**\n\n- **Medicare FFS MedPAR**: The Medicare Provider Analysis and Review file is the pre-processed\n  inpatient institutional claims file for Medicare FFS. MedPAR already deduplicates to the\n  admit-through-discharge claim, aggregates interim bills, and provides stay-level fields including\n  LOS, MS-DRG, discharge status, total charges, and Medicare payments. POA indicators are present\n  from FY2008. The key limitation: MedPAR is inpatient only; outpatient institutional claims\n  (hospital-based ED, ambulatory surgery, observation stays) are in the Outpatient Standard\n  Analytical File (SAF), not MedPAR. Observation stays (TOB 013x) are particularly important:\n  they are billed as outpatient but involve overnight hospital stays, making them a common source\n  of confusion in inpatient/outpatient classification and readmission measurement.\n- **Medicare Outpatient SAF**: Contains outpatient hospital claims (TOB 013x, 073x), hospital-based\n  ED claims, ambulatory surgery center claims, and other institutional outpatient encounters.\n  Revenue code and FL44 HCPCS are present and are the primary procedure identifiers. Deduplication\n  of replacement claims must be applied at the analyst level; the SAF is less pre-processed than MedPAR.\n- **Commercial institutional claims (Optum, MarketScan, etc.)**: Structure mirrors the 837I but\n  may have payer-specific truncation of secondary diagnoses, variable POA reporting (POA is not\n  federally mandated for non-Medicare payers, so completeness varies), and facility identifier\n  masking. Revenue code granularity and FL44 HCPCS presence are generally good for large\n  commercial data vendors. Replacement/void logic must be applied analyst-side.\n- **Medicaid institutional claims**: Highly variable by state. TOB coding, POA reporting, and\n  revenue code completeness depend on the state's billing requirements. Fee-for-service Medicaid\n  claims in the MAX/T-MSIS files are increasingly available but require state-specific quality\n  assessment before institutional claims algorithms developed for Medicare are applied.\n\n**Licensing note**: The complete UB-04 form and its full code sets are maintained by the National\nUniform Billing Committee (NUBC) and are copyrighted by the American Hospital Association. This\nentry describes field semantics as publicly documented in CMS Medicare Claims Processing Manual\n(Pub 100-04, Chapter 25) and CMS program memoranda. The AHA/NUBC UB-04 Data Specifications Manual\nis a licensed product required for implementation-level reference.",
    "primary_category": "Data_Standard",
    "tags": [
      "coding-system",
      "data-standard",
      "primitive",
      "claims",
      "institutional",
      "ub-04",
      "837i",
      "type-of-bill",
      "poa-indicator",
      "discharge-status",
      "revenue-code",
      "form-locator",
      "medicare",
      "deduplication",
      "episode-construction"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "cohort_retrospective",
      "comparative_effectiveness",
      "utilization_study",
      "cost_analysis",
      "readmission_study"
    ],
    "data_sources": [
      "claims",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1177/1062860606288774",
        "url": "https://doi.org/10.1177/1062860606288774",
        "citation_text": "Tyree PT, Lind BK, Lafferty WE. Challenges of using medical insurance claims data for utilization analysis. American Journal of Medical Quality. 2006;21(4):269-275.",
        "year": 2006,
        "authors_short": "Tyree et al.",
        "notes": "Foundational primer on the structure, limitations, and analytical traps of administrative claims data, including institutional claim fields, enrollment gaps, and the observability constraint; the key reference for understanding why claims are a billing byproduct, not a clinical record."
      },
      {
        "role": "explain",
        "doi": "10.1097/00005650-199801000-00004",
        "url": "https://doi.org/10.1097/00005650-199801000-00004",
        "citation_text": "Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Medical Care. 1998;36(1):8-27.",
        "year": 1998,
        "authors_short": "Elixhauser et al.",
        "notes": "Defines the Elixhauser comorbidity index from inpatient administrative discharge data; the canonical demonstration of how UB-04 secondary diagnosis fields (FL67A-Q) are operationalized for risk adjustment, and the original framework extended by POA-adjusted comorbidity scoring."
      },
      {
        "role": "demonstrate",
        "doi": "10.1097/01.mlr.0000182534.19832.83",
        "url": "https://doi.org/10.1097/01.mlr.0000182534.19832.83",
        "citation_text": "Quan H, Sundararajan V, Halfon P, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care. 2005;43(11):1130-1139.",
        "year": 2005,
        "authors_short": "Quan et al.",
        "notes": "Validates and extends Charlson and Elixhauser comorbidity coding algorithms for ICD-9 and ICD-10 diagnosis fields in administrative data; directly demonstrates the mapping from UB-04 secondary diagnosis fields to structured comorbidity constructs used in risk adjustment."
      },
      {
        "role": "use",
        "doi": "10.1056/NEJMsa0803563",
        "url": "https://doi.org/10.1056/NEJMsa0803563",
        "citation_text": "Jencks SF, Williams MV, Coleman EA. Rehospitalizations among patients in the Medicare fee-for-service program. New England Journal of Medicine. 2009;360(14):1418-1428.",
        "year": 2009,
        "authors_short": "Jencks et al.",
        "notes": "Landmark population-level analysis of Medicare readmissions using MedPAR institutional claims; relies on discharge status (FL17), TOB-based episode construction, and transfer-chain merging to derive accurate 30-day readmission rates — a concrete demonstration of how UB-04 fields drive the most policy-consequential institutional claims outcome algorithm."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.cms.gov/regulations-and-guidance/guidance/manuals/downloads/clm104c25.pdf",
        "citation_text": "Centers for Medicare & Medicaid Services. Medicare Claims Processing Manual, Publication 100-04, Chapter 25: Completing and Processing the Form CMS-1450 Data Set. Baltimore, MD: CMS; updated 2023.",
        "year": 2023,
        "authors_short": "CMS",
        "notes": "Authoritative public-domain documentation of every UB-04 Form Locator field including Type of Bill, discharge status codes, POA indicator values, and revenue code line structure; the primary reference for field semantics used throughout this entry."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://www.nubc.org",
        "citation_text": "National Uniform Billing Committee (NUBC). UB-04 Data Specifications Manual. American Hospital Association. Chicago, IL. Updated annually.",
        "year": 2024,
        "authors_short": "NUBC/AHA",
        "notes": "Normative licensed source for the complete UB-04 form and all NUBC-maintained code sets. The UB-04 form and its code sets are copyrighted by the American Hospital Association; researchers must obtain the licensed manual for implementation-level reference. This entry describes only publicly documented field semantics per CMS Pub 100-04."
      },
      {
        "role": "use",
        "doi": null,
        "url": "https://resdac.org/cms-data/files/medpar",
        "citation_text": "Research Data Assistance Center (ResDAC). MedPAR Limited Data Set (LDS). Centers for Medicare & Medicaid Services. Minneapolis, MN: ResDAC; accessed 2024.",
        "year": 2024,
        "authors_short": "ResDAC",
        "notes": "ResDAC data documentation for the Medicare Provider Analysis and Review (MedPAR) file, which is the pre-processed institutional inpatient claims file for Medicare FFS; documents field layout, variable definitions, and preprocessing steps including deduplication to admit-through-discharge claims that researchers relying on this file should understand."
      }
    ],
    "plain_language_summary": "A UB-04 is the billing form that hospitals and other care facilities — not doctors — send to insurance when a patient stays overnight, visits the emergency room, or receives outpatient services. Each line on the form carries a structured code: a four-digit \"type of bill\" that tells the insurer whether the stay was inpatient or outpatient, a two-digit discharge status saying where the patient went when they left (home, another hospital, or the morgue), and a \"present-on-admission\" flag on every diagnosis that tells whether the patient arrived with that condition or developed it during the stay. Researchers use these codes to build hospitalization episodes, count in-hospital deaths, separate complications from preexisting conditions, and measure facility costs — but the raw data require careful cleaning to handle interim bills, replacement claims, and the critical difference between billed charges and actual costs.",
    "key_terms": [
      {
        "term": "Form Locator (FL)",
        "definition": "The numbered position on the UB-04 paper form (e.g., FL4, FL17, FL67) that corresponds to a specific data element; each FL has a defined code set and research meaning."
      },
      {
        "term": "Type of Bill (TOB)",
        "definition": "A four-digit code in FL4 that encodes the facility type, bill classification (inpatient vs outpatient), and a frequency digit indicating whether the claim covers the full stay (frequency 1), is an interim partial bill (frequency 2–4), a correction (frequency 7), or a cancellation (frequency 8)."
      },
      {
        "term": "POA indicator",
        "definition": "A one-character flag attached to each diagnosis code (Y/N/U/W) that records whether the condition was present when the patient was admitted; it distinguishes preexisting comorbidities (Y) from hospital-acquired complications (N)."
      },
      {
        "term": "Discharge status",
        "definition": "The two-digit code in FL17 recording where the patient went at the end of the stay (e.g., 01 = home, 02 = transferred to another acute hospital, 20 = died during stay, 30 = still hospitalized); drives in-hospital death ascertainment, transfer-chain linkage, and readmission denominators."
      },
      {
        "term": "Revenue code line",
        "definition": "A detail row in FL42–47 that itemizes one category of service (e.g., emergency room, pharmacy, physical therapy) by revenue code; the outpatient procedure code (CPT/HCPCS) lives at this line level in FL44, and total billed charges appear in FL47."
      },
      {
        "term": "Occurrence code",
        "definition": "A paired code-and-date field (FL31–36) that records a specific event associated with the claim, such as the accident date, symptom onset, or date a home health plan was established; used in injury- mechanism and symptom-onset analyses."
      }
    ],
    "worked_example": {
      "scenario": "A researcher at a health plan wants to study 30-day all-cause readmissions for Medicare fee-for-service patients hospitalized with acute myocardial infarction (AMI). She pulls the Medicare MedPAR file and finds five claims for patient 0042 across a single calendar month. She needs to determine which of these represent distinct hospitalizations, which should be merged or discarded, and what the correct episode characteristics are for this patient. She then needs to classify the secondary diagnoses correctly for risk adjustment.",
      "dataset": {
        "caption": "Five raw UB-04-derived institutional claim records for patient 0042 from the MedPAR/outpatient file (before deduplication). Claim IDs beginning with R indicate replacement of claim A.",
        "columns": [
          "claim_id",
          "from_date",
          "through_date",
          "tob",
          "disch_status",
          "principal_dx",
          "secondary_dx_1",
          "secondary_dx_1_poa",
          "total_charges"
        ],
        "rows": [
          [
            "A001",
            "2023-03-01",
            "2023-03-31",
            "0112",
            "30",
            "I21.0",
            "N18.3",
            "Y",
            48200
          ],
          [
            "A001-R1",
            "2023-03-01",
            "2023-04-14",
            "0117",
            "01",
            "I21.0",
            "N18.3",
            "Y",
            89500
          ],
          [
            "A002",
            "2023-04-10",
            "2023-04-10",
            "0131",
            "01",
            "Z00.00",
            "I25.10",
            "Y",
            320
          ],
          [
            "B001",
            "2023-04-20",
            "2023-04-26",
            "0111",
            "02",
            "I50.9",
            "E11.9",
            "N",
            "12100"
          ],
          [
            "C001",
            "2023-04-26",
            "2023-05-03",
            "0111",
            "01",
            "I50.9",
            "E11.9",
            "Y",
            9800
          ]
        ]
      },
      "steps": [
        "Claim A001 has TOB 0112 (hospital inpatient, frequency digit 2 = interim-first). This is a partial bill for the beginning of the stay, not a complete admission. Do NOT count it as a separate hospitalization.",
        "Claim A001-R1 has TOB 0117 (hospital inpatient, frequency digit 7 = replacement). It supersedes A001, covers the full stay from 2023-03-01 through 2023-04-14, and has frequency digit 7 indicating it is the corrected/final version. Keep A001-R1, discard A001. The from-through span is 45 days (2023-03-01 to 2023-04-14), confirming this was a long stay that triggered interim billing. Discharge status 01 = discharged home. This is Hospitalization 1 (AMI index admission).",
        "Claim A002 has TOB 0131 (hospital outpatient, frequency 1). This is an outpatient visit (from/through = same day), not an inpatient stay. It falls within the post-discharge window and is not a readmission. Exclude from the inpatient readmission denominator; include in outpatient utilization if needed.",
        "Claim B001 has TOB 0111 (hospital inpatient, admit-through-discharge, frequency 1). From 2023-04-20 to 2023-04-26 = 6 days. Discharge status 02 = transferred to another acute hospital. The secondary diagnosis E11.9 (type 2 diabetes) is coded POA = N, meaning the patient was admitted without diabetes as a coded comorbidity and it emerged during the stay — either a new finding or a documentation gap. Secondary diagnosis E11.9 should be EXCLUDED from the Elixhauser comorbidity score for this admission. This is Hospitalization 2, a potential index readmission (within 30 days of H1 discharge on 2023-04-14).",
        "Claim C001 has TOB 0111 (hospital inpatient, admit-through-discharge). From 2023-04-26 to 2023-05-03. Admission date (2023-04-26) matches the transfer-out date from B001. Because B001 discharge status = 02 (transfer to another acute hospital) and C001 admission date = B001 through date, B001 and C001 represent a SINGLE EPISODE spanning a hospital-to-hospital transfer. Merge B001 + C001 into one episode: from 2023-04-20, through 2023-05-03, 13 total days; final discharge status = 01 (home) from C001.",
        "Final episode summary: Hospitalization 1 = A001-R1 (AMI, 2023-03-01 to 2023-04-14, discharged home). Hospitalization 2 = merged B001+C001 (heart failure with transfer, 2023-04-20 to 2023-05-03). Days between H1 discharge (2023-04-14) and H2 admission (2023-04-20) = 6 days. This is a readmission within 30 days = 2023-04-14 + 30 days = 2023-05-14, so H2 qualifies. 6 / 30 = 0.20 of the 30-day window has elapsed at the time of readmission.",
        "POA-adjusted Elixhauser comorbidity for H2: E11.9 (diabetes, POA = N on B001) is excluded as a hospital-acquired finding; I50.9 (heart failure, POA = Y on C001) is included as a preexisting condition. The comorbidity count changes by 1 depending on whether POA filtering is applied — a difference that can meaningfully shift predicted readmission probability in a risk-adjustment model."
      ],
      "result": "After deduplication and transfer-chain merging: patient 0042 had 2 distinct inpatient episodes. Episode 1 (AMI index): 2023-03-01 to 2023-04-14, 45 days, discharged home. Episode 2 (readmission): 2023-04-20 to 2023-05-03, 13 days, merged from transfer. Days to readmission = 6 days. 6 / 30 = 0.20 fraction of the 30-day window elapsed. POA-adjusted Elixhauser score for episode 2 includes heart failure (POA = Y) but excludes diabetes (POA = N on admission claim). Raw claim count before deduplication was 5; correct episode count is 2.",
      "timeline_spec": {
        "title": "UB-04 institutional claims for patient 0042: deduplication, transfer merge, and readmission",
        "window": {
          "start": "2023-03-01",
          "end": "2023-05-14",
          "label": "Index admission through 30-day readmission window"
        },
        "events": [
          {
            "label": "A001 (interim, DISCARD)",
            "start": "2023-03-01",
            "length_days": 31,
            "quantity": "TOB 0112 frequency 2 = interim"
          },
          {
            "label": "A001-R1 (replacement, KEEP): H1 AMI",
            "start": "2023-03-01",
            "length_days": 45,
            "quantity": "TOB 0117 freq 7 replacement; 45 days"
          },
          {
            "label": "A002 (outpatient, not a readmission)",
            "start": "2023-04-10",
            "length_days": 1,
            "quantity": "TOB 0131 outpatient"
          },
          {
            "label": "B001: H2 acute HF (transfer out)",
            "start": "2023-04-20",
            "length_days": 6,
            "quantity": "TOB 0111 disch status 02"
          },
          {
            "label": "C001: H2 continued (post-transfer)",
            "start": "2023-04-26",
            "length_days": 7,
            "quantity": "TOB 0111 merged with B001"
          }
        ],
        "spans": [
          {
            "kind": "covered",
            "start": "2023-03-01",
            "end": "2023-04-13",
            "label": "H1 AMI inpatient stay (45 days)"
          },
          {
            "kind": "gap",
            "start": "2023-04-14",
            "end": "2023-04-19",
            "label": "Post-discharge gap (6 days)"
          },
          {
            "kind": "covered",
            "start": "2023-04-20",
            "end": "2023-05-02",
            "label": "H2 merged transfer episode (13 days)"
          },
          {
            "kind": "followup",
            "start": "2023-04-14",
            "end": "2023-05-14",
            "label": "30-day readmission window"
          }
        ],
        "result": {
          "label": "Readmission at day 6; 6/30 = 0.20 of window elapsed",
          "value": 0.2
        }
      }
    },
    "prerequisites": [
      "claims-analysis",
      "icd-10-pcs-procedure-coding"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Inpatient vs outpatient TOB classification",
        "description": "Research databases derive care setting from the Type of Bill second and third digit combination. TOB 011x = inpatient hospital; 013x = outpatient hospital (including hospital-based ED); 021x = SNF inpatient; 032x/033x = home health. The critical edge case is observation stays (TOB 013x), which are outpatient by billing classification despite involving overnight stays — a major source of confusion in readmission counting (observation stays do not qualify as hospitalizations for most readmission measures) and in inpatient/outpatient cost decomposition.",
        "edge_cases": [
          "Observation stays (TOB 0131/0132/0133) look like inpatient stays to patients but are billed as outpatient; they appear in the outpatient SAF, not MedPAR, and must be explicitly identified and classified when constructing inpatient-only cohorts.",
          "A claim with TOB 0111 from a critical access hospital (CAH) is paid under cost-based reimbursement, not DRG; the MS-DRG field (FL71) will be absent or nominal, which can break DRG-based case-mix adjustment applied indiscriminately."
        ],
        "data_source_notes": "MedPAR contains only TOB 011x inpatient claims. The Medicare Outpatient SAF contains 013x and other outpatient institutional claims. Commercial databases typically combine facility file types but flag claim type; always confirm which claim types are included before constructing a cohort."
      },
      {
        "name": "POA-adjusted vs unadjusted comorbidity scoring",
        "description": "The standard implementation of Elixhauser or Charlson comorbidity indices on inpatient UB-04 data restricts the diagnosis fields to those coded POA = Y (present on admission) or POA-exempt, excluding POA = N diagnoses that arose in-hospital. The difference between POA-adjusted and unadjusted scores reflects the proportion of secondary diagnoses that are hospital-acquired; in high-complexity cases, the difference can be material for risk-adjustment and outcome algorithm validity.",
        "edge_cases": [
          "POA = U (unknown) and POA = W (clinically undetermined) are ambiguous; standard practice is to treat them as POA = Y for comorbidity scoring to be conservative, but sensitivity analyses with these excluded should be reported.",
          "POA indicators are only federally mandated for Medicare FFS inpatient claims from FY2008 forward; commercial payer POA completeness is variable, and pre-2008 Medicare data lack POA."
        ],
        "data_source_notes": "MedPAR includes POA indicators from FY2008. Commercial institutional claim files vary by vendor in POA completeness; always document the completeness rate before applying POA-adjusted algorithms."
      },
      {
        "name": "Claim deduplication for episode construction",
        "description": "Raw institutional claims files contain interim bills, replacement claims, and late charges that must be resolved before episodes are counted. The three-step deduplication protocol is: (1) drop interim claims (TOB frequency digits 2, 3); (2) among multiple claims with the same original claim control number, retain the highest-sequence replacement (frequency digit 7) and remove earlier versions; (3) exclude void/cancel claims (frequency digit 8). For outpatient files, additionally identify and merge late-charge claims referencing a parent claim control number.",
        "edge_cases": [
          "Some research databases (e.g., MedPAR) have already applied deduplication; applying a second round of deduplication to a pre-processed file can incorrectly remove valid records. Always verify what preprocessing the file vendor has applied before re-deduplicating.",
          "A replacement claim that changes the principal diagnosis or discharge status is a legitimate code correction, not an error; the replacement version should be kept even if it changes the analytic classification of the episode."
        ],
        "data_source_notes": "MedPAR: already deduplicated to admit-through-discharge. Medicare Outpatient SAF: requires analyst-level claim status filtering. Commercial vendor files: deduplication state varies; read vendor data dictionaries to determine if replacement claims are pre-resolved."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "cms-1500-professional-claim-fields",
        "pros_of_this": "The institutional claim covers the full facility episode in a single claim with from/through dates, discharge status, all in-hospital diagnosis codes with POA flags, inpatient procedures (ICD-10-PCS), and revenue-code-level cost decomposition — everything needed to define an inpatient episode boundary, ascertain in-hospital outcomes, and decompose facility costs.",
        "cons_of_this": "The institutional claim carries only the facility fee; it does not capture physician work, cannot distinguish which specific clinician performed a procedure without FL77, and does not carry outpatient prescription data. For physician attribution, provider-level procedure volume, and drug utilization, the professional claim (837P) or pharmacy claim is required.",
        "when_to_prefer": "Use the institutional claim for inpatient episode construction, in-hospital death, discharge disposition, POA-based complication algorithms, and facility cost. Use the professional claim for physician-attributed procedure coding, physician practice variation, and outpatient service-level detail. Both files together are needed for a complete hospital encounter."
      },
      {
        "compared_to": "place-of-service-codes",
        "pros_of_this": "Type of Bill (FL4) is the authoritative care-setting classifier on the institutional claim; it encodes facility type and classification (inpatient vs outpatient) and applies to all institutional billers. Revenue codes at the line level provide additional service-category resolution within the facility.",
        "cons_of_this": "TOB-based setting classification requires knowledge of the facility type encoding (digit 2) and bill classification (digit 3) conventions, which differ by claim type. Observation stays and outpatient departments are correctly classified as outpatient by TOB even when clinically similar to inpatient.",
        "when_to_prefer": "Never use Place-of-Service codes for institutional claim setting classification — POS is a CMS-1500 (professional claim) field and does not appear on UB-04. For institutional claims, always use TOB and revenue codes for setting determination. For professional claims, use POS."
      },
      {
        "compared_to": "icd-10-pcs-procedure-coding",
        "pros_of_this": "ICD-10-PCS codes are only on inpatient institutional claims (FL74); for outpatient institutional procedures, CPT/HCPCS at the FL44 revenue-code line is required. Together they cover the full institutional procedure landscape.",
        "cons_of_this": "ICD-10-PCS is highly specific and hierarchical but less widely known than CPT; codes change annually and require mapping for multi-year studies. CPT at FL44 is more familiar but provides only facility-fee coding at the outpatient level.",
        "when_to_prefer": "Use ICD-10-PCS (FL74) for inpatient procedure phenotypes. Use CPT/HCPCS (FL44 + revenue code) for outpatient institutional procedure phenotypes. Never assume CPT-only algorithms will capture inpatient procedures on institutional claims."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "For Medicare MedPAR: deduplication to admit-through-discharge is pre-applied; use discharge status (20) for in-hospital death, TOB 011x confirmed, MS-DRG from FL71 for case-mix, POA from FY2008 onward. For Medicare Outpatient SAF: apply claim-status filter (keep final-action claims), identify observation stays (revenue code 0762 or TOB 0131/0132), use FL44 HCPCS for procedure coding. For commercial institutional files: (1) apply frequency-digit deduplication; (2) check POA completeness before applying POA-adjusted comorbidity algorithms; (3) use allowed amount (not charges) for cost measurement; (4) verify which claim types (inpatient, outpatient, SNF, home health) are in the file and filter by TOB for each analysis.",
      "linked": "When linking institutional claims to EHR or mortality records: the UB-04 discharge date (through date in FL6) and discharge status (FL17) are the anchors for post-discharge follow-up. Status 20 (expired) should be reconciled with the EHR death date and any linked mortality file; discrepancies of 1-2 days are common and usually reflect administrative vs clinical death recording. For transfer chains: match status-02 discharge date to receiving hospital admission date across claims for the same patient; a tolerance window of 0-1 days is standard."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\n# ------------------------------------------------------------------\n# 1. DEDUPLICATION: resolve interim bills and replacement claims\n#    Input: raw institutional claims DataFrame with TOB as a string\n#    Key columns: claim_id, patient_id, from_date, through_date, tob,\n#                 disch_status, principal_dx, total_charges\n# ------------------------------------------------------------------\n\ndef parse_tob(tob: str) -> dict:\n    \"\"\"Extract facility type, classification, and frequency from a 4-char TOB string.\"\"\"\n    tob = str(tob).zfill(4)\n    return {\n        \"facility_type\": tob[1],       # position 2 (0-indexed 1)\n        \"classification\": tob[2],      # position 3: 1=IP partA, 3=OP, etc.\n        \"frequency\": tob[3],           # position 4: 1=final, 2=first interim, 7=replacement, 8=void\n    }\n\ndef deduplicate_institutional_claims(df: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"\n    Remove interim bills, apply replacement logic, and exclude void claims.\n    Returns the deduplicated DataFrame with one row per final admitted episode.\n\n    TOB frequency digit rules (per CMS Claims Processing Manual Ch. 25):\n      1 = admit-through-discharge (KEEP as final)\n      2 = interim first (DROP — subsumed by frequency-1 final bill)\n      3 = interim continuing (DROP)\n      4 = interim last (DROP — superseded by frequency-1)\n      7 = replacement of prior claim (KEEP as replacement, discard original)\n      8 = void/cancel (DROP claim and its predecessor)\n    \"\"\"\n    df = df.copy()\n    df[\"tob_freq\"] = df[\"tob\"].astype(str).str.zfill(4).str[3]\n\n    # Step 1: drop void claims and the claims they void\n    # (in practice, voids reference original claim_id; here we drop by frequency = 8)\n    void_ids = set(df.loc[df[\"tob_freq\"] == \"8\", \"claim_id\"])\n    df = df[~df[\"tob_freq\"].isin([\"8\"])]\n    # If your data has an original_claim_id reference, also drop those originals here.\n\n    # Step 2: for replacement claims (frequency 7), keep only the replacement\n    # and discard any earlier versions with the same original_claim_control_number.\n    # Here we use claim_id prefix as a proxy (real data uses CLM_ID or ICN).\n    replacements = df[df[\"tob_freq\"] == \"7\"][\"claim_id\"].str.replace(r\"-R\\d+$\", \"\", regex=True)\n    original_ids_to_drop = set(replacements)\n    df = df[~((df[\"claim_id\"].isin(original_ids_to_drop)) & (df[\"tob_freq\"] != \"7\"))]\n\n    # Step 3: drop interim bills (frequency 2, 3, 4)\n    df = df[~df[\"tob_freq\"].isin([\"2\", \"3\", \"4\"])]\n\n    df = df.drop(columns=[\"tob_freq\"])\n    return df.reset_index(drop=True)\n\n\n# ------------------------------------------------------------------\n# 2. TRANSFER CHAIN MERGING: fuse status-02 discharge to next admission\n#    Input: deduplicated inpatient claims, sorted by patient + from_date\n# ------------------------------------------------------------------\n\ndef merge_transfer_chains(df: pd.DataFrame, tolerance_days: int = 1) -> pd.DataFrame:\n    \"\"\"\n    Merge inpatient claims connected by discharge_status = '02' (transferred to another\n    acute hospital) when the receiving admission date is within tolerance_days of the\n    transfer-out through_date.\n\n    Returns a DataFrame where transfer chains appear as a single episode with:\n      - from_date = first admission date in the chain\n      - through_date = last discharge through_date in the chain\n      - disch_status = final hospital's discharge status\n      - transfer_chain = True if merged from multiple claims\n    \"\"\"\n    df = df.copy()\n    df[\"from_date\"] = pd.to_datetime(df[\"from_date\"])\n    df[\"through_date\"] = pd.to_datetime(df[\"through_date\"])\n    df = df.sort_values([\"patient_id\", \"from_date\"]).reset_index(drop=True)\n    df[\"transfer_chain\"] = False\n    df[\"chain_id\"] = range(len(df))\n\n    # Walk patient-by-patient\n    merged_rows = []\n    for pid, group in df.groupby(\"patient_id\"):\n        group = group.reset_index(drop=True)\n        i = 0\n        while i < len(group):\n            row = group.iloc[i].copy()\n            # Check if this claim is a transfer-out\n            while (str(row[\"disch_status\"]) == \"02\") and (i + 1 < len(group)):\n                next_row = group.iloc[i + 1]\n                gap = (next_row[\"from_date\"] - row[\"through_date\"]).days\n                if gap <= tolerance_days:\n                    # Merge: extend the through_date and take the next claim's discharge status\n                    row[\"through_date\"] = next_row[\"through_date\"]\n                    row[\"disch_status\"] = next_row[\"disch_status\"]\n                    row[\"transfer_chain\"] = True\n                    i += 1\n                else:\n                    break  # Gap too large — not a transfer\n            merged_rows.append(row)\n            i += 1\n\n    result = pd.DataFrame(merged_rows).reset_index(drop=True)\n    return result\n\n\n# ------------------------------------------------------------------\n# 3. POA-ADJUSTED COMORBIDITY FILTERING\n#    Apply Elixhauser or Charlson only to POA = Y and exempt diagnoses\n# ------------------------------------------------------------------\n\ndef filter_poa_comorbidity_dx(\n    dx_poa_pairs: list[tuple[str, str]],\n    include_unknown: bool = True,\n) -> list[str]:\n    \"\"\"\n    Given a list of (icd10_code, poa_indicator) pairs from UB-04 secondary dx fields,\n    return only the diagnosis codes eligible for comorbidity scoring.\n\n    POA values (CMS documentation):\n      Y = present on admission -> include\n      N = not present on admission (hospital-acquired) -> EXCLUDE\n      U = unknown -> include if include_unknown=True (conservative default)\n      W = clinically undetermined -> include if include_unknown=True\n      1 = exempt from POA reporting -> include (these codes are exempt by CMS definition)\n\n    Returns list of eligible ICD-10-CM codes for comorbidity mapping.\n    \"\"\"\n    include_flags = {\"Y\", \"1\"}\n    if include_unknown:\n        include_flags.update({\"U\", \"W\"})\n\n    eligible = [\n        dx for dx, poa in dx_poa_pairs\n        if str(poa).upper() in include_flags\n    ]\n    return eligible\n\n\n# ------------------------------------------------------------------\n# EXAMPLE USAGE (mirrors the worked example)\n# ------------------------------------------------------------------\n\nif __name__ == \"__main__\":\n    raw_claims = pd.DataFrame({\n        \"claim_id\":    [\"A001\", \"A001-R1\", \"A002\", \"B001\", \"C001\"],\n        \"patient_id\":  [42, 42, 42, 42, 42],\n        \"from_date\":   [\"2023-03-01\", \"2023-03-01\", \"2023-04-10\", \"2023-04-20\", \"2023-04-26\"],\n        \"through_date\":[\"2023-03-31\", \"2023-04-14\", \"2023-04-10\", \"2023-04-26\", \"2023-05-03\"],\n        \"tob\":         [\"0112\", \"0117\", \"0131\", \"0111\", \"0111\"],\n        \"disch_status\":[\"30\", \"01\", \"01\", \"02\", \"01\"],\n        \"principal_dx\":[\"I21.0\", \"I21.0\", \"Z00.00\", \"I50.9\", \"I50.9\"],\n    })\n\n    # Step 1: deduplicate\n    deduped = deduplicate_institutional_claims(raw_claims)\n    print(\"After deduplication:\")\n    print(deduped[[\"claim_id\", \"from_date\", \"through_date\", \"tob\", \"disch_status\"]])\n    # Expected: A001-R1 (replacement kept), A002 (outpatient), B001, C001\n    # A001 (original of replacement) and interim bills are dropped\n\n    # Step 2: filter to inpatient only (TOB facility_type=1, classification=1 = inpatient Part A)\n    ip = deduped[deduped[\"tob\"].str.zfill(4).str[1:3].isin([\"11\", \"12\"])].copy()\n\n    # Step 3: merge transfers\n    ip_merged = merge_transfer_chains(ip)\n    print(\"\\nAfter transfer merge:\")\n    print(ip_merged[[\"claim_id\", \"from_date\", \"through_date\", \"disch_status\", \"transfer_chain\"]])\n    # B001+C001 merge into one episode with through_date 2023-05-03\n\n    # Step 4: POA filtering example\n    secondary_dx_h2 = [(\"E11.9\", \"N\"), (\"I50.9\", \"Y\")]\n    eligible = filter_poa_comorbidity_dx(secondary_dx_h2, include_unknown=True)\n    print(\"\\nPOA-eligible diagnoses for H2:\", eligible)\n    # Expected: ['I50.9'] only (E11.9 excluded as POA=N)",
        "description": "Utility functions for the three most common institutional claims operations: (1) deduplicating raw claims to the admit-through-discharge bill using TOB frequency digit logic, (2) merging transfer chains using discharge status 02, and (3) parsing POA indicators to separate present-on-admission comorbidities from in-hospital complications. Each function operates on a pandas DataFrame with the column names typical of a research claims extract. The code follows exactly the logic described in the worked example and is intended as a reference scaffold, not a production pipeline — add payer-specific claim status filtering and date-format normalization as required.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\n\n# ------------------------------------------------------------------\n# 1. DEDUPLICATION: resolve TOB frequency logic on raw institutional claims\n# ------------------------------------------------------------------\n\ndeduplicate_institutional_claims <- function(dt) {\n  # dt: data.table with columns: claim_id, patient_id, from_date, through_date,\n  #     tob (character), disch_status, principal_dx\n  dt <- copy(dt)\n  dt[, tob_freq := substr(formatC(tob, width = 4, flag = \"0\"), 4, 4)]\n\n  # Drop void claims (frequency = 8)\n  dt <- dt[tob_freq != \"8\"]\n\n  # For replacement claims (frequency = 7), strip the -R suffix to find originals\n  # and drop the original version (keeping the replacement)\n  replacements <- dt[tob_freq == \"7\", gsub(\"-R[0-9]+$\", \"\", claim_id)]\n  dt <- dt[!(claim_id %in% replacements & tob_freq != \"7\")]\n\n  # Drop interim bills (frequency = 2, 3, 4)\n  dt <- dt[!tob_freq %in% c(\"2\", \"3\", \"4\")]\n\n  dt[, tob_freq := NULL]\n  return(dt)\n}\n\n\n# ------------------------------------------------------------------\n# 2. TRANSFER CHAIN MERGING\n# ------------------------------------------------------------------\n\nmerge_transfer_chains <- function(dt, tolerance_days = 1) {\n  dt <- copy(dt)\n  dt[, from_date := as.Date(from_date)]\n  dt[, through_date := as.Date(through_date)]\n  setorder(dt, patient_id, from_date)\n  dt[, transfer_chain := FALSE]\n\n  result_list <- list()\n\n  for (pid in unique(dt$patient_id)) {\n    grp <- dt[patient_id == pid]\n    i <- 1\n    while (i <= nrow(grp)) {\n      row <- as.list(grp[i])\n      # Follow transfer chain\n      while (as.character(row$disch_status) == \"02\" && (i + 1) <= nrow(grp)) {\n        nxt <- as.list(grp[i + 1])\n        gap <- as.integer(nxt$from_date - row$through_date)\n        if (gap <= tolerance_days) {\n          row$through_date  <- nxt$through_date\n          row$disch_status  <- nxt$disch_status\n          row$transfer_chain <- TRUE\n          i <- i + 1\n        } else {\n          break\n        }\n      }\n      result_list[[length(result_list) + 1]] <- as.data.table(row)\n      i <- i + 1\n    }\n  }\n\n  rbindlist(result_list, fill = TRUE)\n}\n\n\n# ------------------------------------------------------------------\n# 3. POA-ADJUSTED COMORBIDITY FILTERING\n# ------------------------------------------------------------------\n\nfilter_poa_comorbidity_dx <- function(dx_vec, poa_vec, include_unknown = TRUE) {\n  # dx_vec: character vector of ICD-10-CM codes\n  # poa_vec: character vector of POA indicators (Y/N/U/W/1)\n  # Returns the subset of dx_vec eligible for comorbidity scoring\n\n  include_flags <- c(\"Y\", \"1\")\n  if (include_unknown) include_flags <- c(include_flags, \"U\", \"W\")\n\n  dx_vec[toupper(poa_vec) %in% include_flags]\n}\n\n\n# ------------------------------------------------------------------\n# EXAMPLE USAGE\n# ------------------------------------------------------------------\n\nraw_claims <- data.table(\n  claim_id    = c(\"A001\", \"A001-R1\", \"A002\", \"B001\", \"C001\"),\n  patient_id  = rep(42, 5),\n  from_date   = c(\"2023-03-01\", \"2023-03-01\", \"2023-04-10\", \"2023-04-20\", \"2023-04-26\"),\n  through_date = c(\"2023-03-31\", \"2023-04-14\", \"2023-04-10\", \"2023-04-26\", \"2023-05-03\"),\n  tob         = c(\"0112\", \"0117\", \"0131\", \"0111\", \"0111\"),\n  disch_status = c(\"30\",  \"01\",   \"01\",   \"02\",   \"01\"),\n  principal_dx = c(\"I21.0\", \"I21.0\", \"Z00.00\", \"I50.9\", \"I50.9\")\n)\n\n# Step 1: deduplicate\ndeduped <- deduplicate_institutional_claims(raw_claims)\ncat(\"After deduplication:\\n\")\nprint(deduped[, .(claim_id, from_date, through_date, tob, disch_status)])\n\n# Step 2: inpatient only (TOB positions 2-3 = \"11\")\nip <- deduped[substr(formatC(tob, width = 4, flag = \"0\"), 2, 3) %in% c(\"11\", \"12\")]\n\n# Step 3: merge transfers\nip_merged <- merge_transfer_chains(ip)\ncat(\"\\nAfter transfer merge:\\n\")\nprint(ip_merged[, .(claim_id, from_date, through_date, disch_status, transfer_chain)])\n\n# Step 4: POA filtering\nsecondary_h2_dx  <- c(\"E11.9\", \"I50.9\")\nsecondary_h2_poa <- c(\"N\", \"Y\")\neligible <- filter_poa_comorbidity_dx(secondary_h2_dx, secondary_h2_poa)\ncat(\"\\nPOA-eligible diagnoses for H2:\", eligible, \"\\n\")\n# Expected: \"I50.9\" only",
        "description": "R functions for the same three institutional claims operations: TOB-based deduplication, transfer-chain merging with discharge-status 02, and POA-filtered diagnosis extraction for Elixhauser/Charlson comorbidity scoring. Uses base R and data.table for performance on large claims files. Mirrors the Python implementation and the worked example logic exactly.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  UB04[\"UB-04 / 837I Institutional Claim\"] --> Header[\"CLAIM HEADER\"]\n  UB04 --> Lines[\"REVENUE CODE LINES (FL42-49)\"]\n\n  Header --> FL4[\"FL4: Type of Bill (TOB)\\nFacility type | Classification | Frequency\\ne.g. 0111 = hosp inpatient final\\n0117 = replacement; 0112 = interim\"]\n  Header --> FL6[\"FL6: Statement Covers Period\\nFrom date (admission) / Through date (discharge)\"]\n  Header --> FL14[\"FL14: Priority of Admission\\n1=Emergency 2=Urgent 3=Elective\"]\n  Header --> FL15[\"FL15: Point of Origin\\n1=Community 4=Transfer from hospital\"]\n  Header --> FL17[\"FL17: Discharge Status\\n01=Home 02=Transfer 20=Died 30=Still patient\"]\n  Header --> FL67[\"FL67: Principal Dx + POA\\nY=Present on admit N=Not present\\nU=Unknown W=Undetermined\"]\n  Header --> FL67AQ[\"FL67A-Q: Secondary Dx + POA\\nUp to 17 secondary diagnoses\\neach with own POA flag\"]\n  Header --> FL69[\"FL69: Admitting Diagnosis\\nSuspected reason at admission\"]\n  Header --> FL71[\"FL71: MS-DRG\\nCase-mix payment group\"]\n  Header --> FL74[\"FL74 + FL74a-e: ICD-10-PCS Procedures\\n+ dates (inpatient only)\"]\n  Header --> FL76[\"FL76-79: Attending / Operating NPIs\\nProvider attribution\"]\n\n  Lines --> RevCode[\"FL42: Revenue Code\\n0450=ER 0120=Room&Board 0260=IV Therapy\"]\n  Lines --> HCPCS[\"FL44: HCPCS/CPT\\nOutpatient procedure code lives HERE\\n(not on inpatient header)\"]\n  Lines --> Units[\"FL46: Units of Service\\nDays, sessions, doses\"]\n  Lines --> Charges[\"FL47: Total Charges\\nBILLED ≠ Cost ≠ Paid\\nUse allowed amount for cost\"]",
        "caption": "Anatomy of the UB-04 institutional claim. The header contains the episode-level fields (TOB, statement period, discharge status, diagnoses with POA) that drive most RWE episode and outcome algorithms. The revenue code line level carries procedure codes for outpatient claims (FL44 HCPCS), units of service (FL46), and billed charges (FL47). Charges are not costs.",
        "alt_text": "Flowchart showing the UB-04 claim splitting into a header and revenue code lines. The header branches to FL4 Type of Bill, FL6 statement period, FL14/15 admission type and source, FL17 discharge status, FL67/67A-Q principal and secondary diagnoses with POA indicators, FL69 admitting diagnosis, FL71 DRG, FL74 procedures, and FL76-79 provider NPIs. The revenue code lines branch to FL42 revenue code, FL44 HCPCS/CPT, FL46 units, and FL47 charges.",
        "source_type": "illustrative",
        "source_citations": [
          "tyree-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "claims-analysis",
        "notes": "The UB-04/837I is the institutional claims instrument that makes up the inpatient and outpatient facility file tier of any administrative claims database; understanding its field structure is prerequisite to correct claims-analysis episode construction, cost measurement, and outcome ascertainment."
      },
      {
        "relation_type": "used_with",
        "target_slug": "icd-10-pcs-procedure-coding",
        "notes": "ICD-10-PCS procedure codes are reported at the UB-04 claim header (FL74) for inpatient institutional claims; FL74 and its date fields are how inpatient surgical procedures are identified and sequenced in RWE."
      },
      {
        "relation_type": "used_with",
        "target_slug": "ms-drg-classification",
        "notes": "The MS-DRG assigned to an inpatient institutional claim (FL71) is derived from the principal diagnosis, POA-adjusted secondary diagnoses, principal procedure, and discharge status fields on the UB-04; it is the primary case-mix adjuster and payment classifier for Medicare inpatient stays."
      },
      {
        "relation_type": "used_with",
        "target_slug": "npi-national-provider-identifier",
        "notes": "The attending physician (FL76), operating physician (FL77), and other provider NPIs on the UB-04 are the mechanism for institutional claim-to-provider attribution, linking hospital stays to responsible clinicians for practice variation and outcomes research."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "cms-1500-professional-claim-fields",
        "notes": "The UB-04 is the facility side of the hospital encounter; the CMS-1500 is the professional side. They are complementary, not redundant. The UB-04 owns episode boundaries, discharge status, POA, and inpatient procedure codes; the CMS-1500 owns physician-attributed procedure coding (CPT), Place-of-Service, and provider-level work. Do not use Place-of-Service codes from the CMS-1500 to classify care setting on an institutional claim."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "place-of-service-codes",
        "notes": "Place-of-Service codes appear only on CMS-1500 professional claims, not on UB-04 institutional claims. For institutional claims, care setting is derived from TOB (FL4) and revenue codes (FL42) — never from POS. Using POS for institutional claim setting classification is an error."
      },
      {
        "relation_type": "used_with",
        "target_slug": "claims-outcome-algorithm-ppv-sensitivity-rwe",
        "notes": "Outcome algorithms validated on institutional claims rely heavily on UB-04 fields: principal diagnosis (FL67), discharge status (FL17), POA indicators, and revenue codes. PPV and sensitivity of diagnosis-based outcome algorithms depend on the claim type (inpatient vs outpatient), the diagnosis position (FL67 principal vs FL67A-Q secondary), and whether POA filtering was applied."
      },
      {
        "relation_type": "used_with",
        "target_slug": "healthcare-costs-pppm-pppy-pmpm",
        "notes": "Facility costs in PPPM/PPPY analyses are derived from the allowed-amount and paid-amount fields on adjudicated institutional claims, decomposed by revenue code (FL42) for service-line detail. Charges from FL47 are not valid cost inputs; this is the most common methodological error in institutional claims cost analyses."
      },
      {
        "relation_type": "used_with",
        "target_slug": "diagnosis-position-and-qualifiers",
        "notes": "The UB-04's FL67 principal diagnosis and FL67A-Q secondary diagnosis fields, each with their POA indicator, are the institutional claim instantiation of diagnosis position rules; the POA flag adds a temporal qualifier (present on admission vs acquired in-hospital) that changes how each diagnosis is used in outcome algorithms and comorbidity scoring."
      }
    ],
    "aliases": [
      "UB-04",
      "CMS-1450",
      "837I",
      "institutional claim",
      "type of bill",
      "POA indicator",
      "form locator",
      "inpatient facility claim"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "cms",
      "ahrq"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "umbrella-review",
    "name": "Umbrella Review (Review of Systematic Reviews)",
    "short_definition": "A review whose unit of inclusion is existing systematic reviews and meta-analyses on a topic, synthesizing them rather than primary studies; its distinctive methodological problems are quantifying overlap of shared primary studies (corrected covered area), reconciling discordant reviews, and appraising the methodological quality of the included reviews (e.g., AMSTAR-2).",
    "long_description": "**Core idea.** An **umbrella review** (also \"review of reviews\" or \"overview of systematic reviews\") sits one level above\nthe ordinary systematic review: its included studies are themselves **systematic reviews and meta-analyses** addressing a\nbroad question — typically \"what is the totality of evidence about intervention X across many outcomes,\" or \"across many\ninterventions for condition Y.\" Where a systematic review answers a focused PICO by synthesizing primary studies, an\numbrella review maps and synthesizes the *review-level* evidence: it tabulates the effect estimates, certainty, and\nquality of each constituent review, compares them, and produces a high-level summary for decision-makers who need the\nlandscape rather than a single pooled estimate. It is the natural top tier of the evidence-synthesis hierarchy and the\nformat JBI, Cochrane (as \"overviews\"), and HTA bodies use to consolidate a mature literature.\n\n**The three methodological problems that define the format.** (1) **Overlap of primary studies.** Because constituent\nreviews on the same topic draw on overlapping sets of primary studies, the same primary trial can be counted in several\nincluded reviews, double-counting its evidence and falsely inflating apparent corroboration. The standard quantification\nis the **corrected covered area (CCA)** of Pieper et al.: build a matrix of primary studies (rows) by reviews (columns),\ncount N = total cells that are \"present,\" r = number of distinct primary studies, c = number of reviews, and compute\nCCA = (N − r) / (r·c − r), interpreted as slight (0-5%), moderate (6-10%), high (11-15%), or very high (>15%) overlap.\nCCA must be reported and, when high, the umbrella review should avoid pooling across overlapping reviews or should use\nthe primary studies directly. (2) **Discordance.** Reviews of the same question can reach different conclusions because\nof different inclusion criteria, search dates, pooling methods, or risk-of-bias handling; the umbrella review must\n*reconcile* discordance (e.g., the Jadad algorithm for choosing among discordant meta-analyses) rather than silently\naveraging it away. (3) **Quality appraisal of the included reviews.** The credibility of an umbrella review is bounded\nby the methodological quality of the reviews it includes, so each constituent review is appraised with a validated tool —\nmost commonly **AMSTAR-2** (16 items, with seven critical domains, yielding an overall confidence rating of high,\nmoderate, low, or critically low) — and low-quality reviews are down-weighted or excluded.\n\n**Pros, cons, and trade-offs.**\n- **vs a single de novo systematic review (`systematic-review`):** An umbrella review is far faster to assemble a broad,\n  multi-outcome or multi-intervention landscape and leverages work already done, but it inherits every limitation of the\n  reviews it includes (outdated searches, flawed pooling, missing primary studies) and cannot be more current or more\n  rigorous than its inputs. **Prefer an umbrella review** when many good systematic reviews already exist and the\n  decision needs breadth; **prefer a de novo systematic review** when the question is focused, the existing reviews are\n  stale or low-quality, or a single defensible pooled estimate is required.\n- **vs a meta-analysis of primary studies (`meta-analysis-obs`):** A meta-analysis pools primary studies into one\n  estimate with formal heterogeneity assessment; an umbrella review generally does *not* re-pool (doing so across\n  overlapping reviews double-counts evidence) and instead summarizes review-level estimates and certainty. **Prefer a\n  fresh meta-analysis** when a precise pooled effect is the goal and the primary studies are accessible; **prefer an\n  umbrella review** when the deliverable is a credibility-graded map across many estimates.\n- **vs a network meta-analysis (`network-meta-analysis`):** An NMA produces coherent comparative rankings across multiple\n  treatments from primary trial data; an umbrella review describes what existing reviews (which may include NMAs) found.\n  **Prefer an NMA** for a single integrated comparative-effectiveness answer; **prefer an umbrella review** to survey and\n  appraise the body of comparative reviews.\n\n**When to use.** A mature literature with multiple systematic reviews/meta-analyses on related questions; a need to map\nthe evidence across many outcomes (efficacy and a full safety profile) or many interventions for one condition; a\ndecision-maker (HTA body, guideline panel, payer) who needs a credibility-graded synthesis quickly; a scoping step before\ncommissioning a new review, to establish what is already known and where the gaps and the low-quality reviews are. Always\nreport a PRISMA-style flow of reviews screened/included, the AMSTAR-2 rating of each included review, the CCA overlap, and\nan explicit discordance-resolution rule.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Few or low-quality constituent reviews.** With only one or two reviews, or only critically-low AMSTAR-2 reviews, an\n  umbrella review adds an authoritative-looking layer over weak evidence; a de novo systematic review is the honest\n  choice. Dressing thin evidence in umbrella-review format is the most dangerous misuse.\n- **High overlap pooled as if independent.** If constituent reviews share most primary studies (high/very-high CCA) and\n  the umbrella review pools or counts them as corroborating, it double-counts the same trials and manufactures false\n  consensus. Report CCA and refuse to pool overlapping reviews; go to the primary studies instead.\n- **Ignoring search-date and discordance differences.** Treating an outdated review and a current one as equivalent, or\n  averaging discordant conclusions without a reconciliation rule, produces a summary that reflects neither the current\n  evidence nor any coherent estimate.\n- **Re-deriving a pooled effect without the primary data.** An umbrella review that fabricates a meta-analytic estimate\n  from review-level summaries (rather than re-extracting primary studies) misrepresents review-level description as\n  primary synthesis.\n\n**Data-source operational depth (RWE context).** Umbrella reviews increasingly synthesize *observational/real-world*\nevidence, where the constituent reviews' quality and overlap behave differently by underlying data type.\n- **Claims-based reviews:** Constituent reviews of claims studies often share the same large administrative databases\n  (e.g., several reviews each including the same Medicare or commercial-claims analyses), so primary-study overlap can be\n  high even when the reviews appear independent; compute CCA at the primary-study level and note shared databases as a\n  further, study-design source of correlated evidence.\n- **EHR-based reviews:** Reviews of EHR studies must be appraised for whether the constituent reviews addressed\n  phenotyping validity, informative presence, and site heterogeneity; an umbrella review should record whether each\n  included review applied an RWE-appropriate risk-of-bias tool (e.g., ROBINS-I) rather than only a trial-oriented one.\n- **Registry / linked-data reviews:** Registry-based reviews tend to be more homogeneous in their primary sources;\n  overlap is easier to assess but reporting lag and registry completeness across the constituent reviews should be\n  tabulated so the umbrella synthesis does not mix mature and immature evidence.",
    "primary_category": "Study_Design",
    "tags": [
      "umbrella-review",
      "overview-of-reviews",
      "evidence-synthesis",
      "corrected-covered-area",
      "amstar-2",
      "discordance",
      "systematic-review"
    ],
    "applies_to_study_types": [
      "systematic_review",
      "meta_analysis",
      "evidence_synthesis",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "secondary",
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1097/XEB.0000000000000055",
        "url": "https://doi.org/10.1097/XEB.0000000000000055",
        "citation_text": "Aromataris E, Fernandez R, Godfrey CM, Holly C, Khalil H, Tungpunkom P. Summarizing systematic reviews: methodological development, conduct and reporting of an umbrella review approach. International Journal of Evidence-Based Healthcare. 2015;13(3):132-140.",
        "year": 2015,
        "authors_short": "Aromataris et al.",
        "notes": "The JBI methodological statement defining the umbrella-review approach - unit of inclusion, conduct, and reporting - and the canonical reference for the format."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.j4008",
        "url": "https://doi.org/10.1136/bmj.j4008",
        "citation_text": "Shea BJ, Reeves BC, Wells G, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358:j4008.",
        "year": 2017,
        "authors_short": "Shea et al.",
        "notes": "The AMSTAR-2 instrument used to appraise the methodological quality of the reviews included in an umbrella review, with its seven critical domains and overall-confidence rating."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jclinepi.2013.11.007",
        "url": "https://doi.org/10.1016/j.jclinepi.2013.11.007",
        "citation_text": "Pieper D, Antoine SL, Mathes T, Neugebauer EAM, Eikermann M. Systematic review finds overlapping reviews were not mentioned in every other overview. Journal of Clinical Epidemiology. 2014;67(4):368-375.",
        "year": 2014,
        "authors_short": "Pieper et al.",
        "notes": "Defines and validates the corrected covered area (CCA) measure of primary-study overlap among included reviews, the standard overlap quantification for umbrella reviews and overviews."
      },
      {
        "role": "demonstrate",
        "doi": "10.1136/ebmental-2018-300014",
        "url": "https://doi.org/10.1136/ebmental-2018-300014",
        "citation_text": "Fusar-Poli P, Radua J. Ten simple rules for conducting umbrella reviews. Evidence-Based Mental Health. 2018;21(3):95-100.",
        "year": 2018,
        "authors_short": "Fusar-Poli & Radua",
        "notes": "A practical, worked set of rules for conducting umbrella reviews - credibility grading of evidence, handling overlap and discordance, and reporting - widely used as the operational template."
      },
      {
        "role": "use",
        "doi": "10.1136/bmjment-2022-300534",
        "url": "https://doi.org/10.1136/bmjment-2022-300534",
        "citation_text": "Gosling CJ, Solanes A, Fusar-Poli P, Radua J. metaumbrella: the first comprehensive suite to perform data analysis in umbrella reviews with stratification of the evidence. BMJ Mental Health. 2023;26(1):e300534.",
        "year": 2023,
        "authors_short": "Gosling et al.",
        "notes": "Describes the metaumbrella software for umbrella-review analysis, including evidence stratification/credibility classes; the tool most umbrella reviews now use for the quantitative layer."
      }
    ],
    "plain_language_summary": "An umbrella review is a study that reads and grades a collection of existing systematic reviews — think of it as a review of reviews — to give a bird's-eye picture of what the research says across many outcomes or treatments at once. Instead of going back to raw patient data, the authors gather every published systematic review on a broad topic, rate the quality of each one using a checklist called AMSTAR, and look for where the reviews agree or disagree. The result is a single ranked summary that tells a guideline committee or health-plan which parts of the evidence are solid and which rest on shaky foundations.",
    "key_terms": [
      {
        "term": "systematic review",
        "definition": "A study that searches all available research on a specific question, selects the studies that meet defined quality standards, and summarizes their findings in a structured way."
      },
      {
        "term": "AMSTAR",
        "definition": "A quality-rating checklist (currently AMSTAR-2) used to score how rigorously a systematic review was conducted, yielding a confidence label of high, moderate, low, or critically low."
      },
      {
        "term": "corrected covered area (CCA)",
        "definition": "A number that measures how many of the same underlying studies appear in more than one of the reviews you collected; high overlap means those reviews are not truly independent evidence."
      },
      {
        "term": "discordance",
        "definition": "When two or more reviews on the same question reach different conclusions, often because they used different time windows, different inclusion rules, or different statistical approaches."
      },
      {
        "term": "evidence map",
        "definition": "The output of an umbrella review: a table or figure showing each outcome or treatment, its effect size from constituent reviews, its quality rating, and how consistent the reviews were."
      }
    ],
    "worked_example": {
      "scenario": "A health-technology assessment team needs to advise a payer on whether to cover a new diabetes drug. Three separate systematic reviews have already been published — one on blood-sugar control, one on heart outcomes, and one on kidney outcomes. Rather than redo all three reviews from scratch, the team runs an umbrella review: they locate those three systematic reviews, rate each with AMSTAR, check how many of the same clinical trials appear in more than one review (overlap), and produce a single graded summary table.",
      "dataset": {
        "caption": "The three systematic reviews the team collected, with their key characteristics.",
        "columns": [
          "review_id",
          "outcome_focus",
          "n_primary_trials",
          "amstar_rating",
          "pooled_effect_direction"
        ],
        "rows": [
          [
            "SR-01",
            "blood-sugar (HbA1c)",
            12,
            "high",
            "favors drug"
          ],
          [
            "SR-02",
            "heart outcomes (MACE)",
            8,
            "moderate",
            "favors drug"
          ],
          [
            "SR-03",
            "kidney outcomes (eGFR)",
            5,
            "low",
            "inconclusive"
          ]
        ]
      },
      "steps": [
        "List every systematic review that was found and record its outcome focus, number of primary trials, and AMSTAR rating — this is the raw material for the umbrella.",
        "Build an overlap matrix: note which individual clinical trials appear in more than one review. Here, 4 trials appear in both SR-01 and SR-02, giving a corrected covered area of 4 / (25 - 3) = 0.18, which is very high overlap between those two reviews.",
        "Because SR-01 and SR-02 share many of the same trials, their effect estimates cannot be treated as two independent confirmations; they partly re-count the same evidence.",
        "Rate each review with AMSTAR: SR-01 scores high confidence, SR-02 moderate, SR-03 low. The low rating on SR-03 means the kidney-outcome conclusion should be flagged as unreliable.",
        "Check for discordance: SR-01 and SR-02 both favor the drug, so no reconciliation is needed there; SR-03 is inconclusive but its low AMSTAR rating explains the uncertainty rather than contradicting the others.",
        "Assemble the evidence map: blood-sugar and heart outcomes show consistent benefit across high/moderate-quality reviews; kidney outcomes remain uncertain and a new, better review is warranted."
      ],
      "result": "The umbrella review concludes: strong consistent evidence (2 reviews, high/moderate quality) supports the drug for blood-sugar and heart outcomes; kidney evidence is inconclusive because the only available review is low quality. The payer has a graded, auditable basis for a coverage decision — not just a single number."
    },
    "prerequisites": [
      "systematic-review",
      "meta-analysis-obs"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Intervention-focused umbrella review",
        "description": "Synthesizes reviews of a single intervention (or intervention class) across many outcomes, mapping efficacy and the full safety profile from existing meta-analyses with per-outcome certainty grading.",
        "edge_cases": [
          "Different constituent reviews may use different effect measures (RR vs OR vs SMD); harmonize or report by measure rather than forcing a common scale.",
          "Safety outcomes are often covered by fewer/weaker reviews than efficacy; flag asymmetric evidence quality."
        ],
        "data_source_notes": "claims/ehr: note whether constituent RWE reviews used an observational risk-of-bias tool (ROBINS-I) rather than a trial-oriented one."
      },
      {
        "name": "Condition-focused umbrella review (many interventions)",
        "description": "Synthesizes reviews of multiple competing interventions for one condition to map comparative options; overlap and discordance handling are paramount because interventions are compared across heterogeneous reviews.",
        "edge_cases": [
          "Indirect comparisons across separate pairwise reviews are not a network meta-analysis; do not present them as coherent rankings.",
          "High primary-study overlap across intervention reviews must be reported (CCA) and pooling avoided."
        ],
        "data_source_notes": "registry/linked: tabulate registry completeness and search dates so mature and immature evidence are not mixed."
      },
      {
        "name": "Credibility-graded (evidence-stratified) umbrella review",
        "description": "Each association is assigned to an evidence class (e.g., convincing, highly suggestive, suggestive, weak, non-significant) using criteria on number of cases, p-value, heterogeneity, prediction intervals, small-study and excess-significance bias, as implemented in the metaumbrella suite.",
        "edge_cases": [
          "The stratification thresholds are conventions, not bright lines; report the inputs (cases, I-squared, prediction interval) alongside the class.",
          "Classes can change with one updated constituent review; state the search date."
        ],
        "data_source_notes": "secondary: evidence classes summarize review-level meta-analytic outputs and do not re-derive effects from primary data."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "systematic-review",
        "pros_of_this": "Rapidly assembles a broad, multi-outcome or multi-intervention landscape by reusing existing reviews, with explicit quality appraisal of each included review.",
        "cons_of_this": "Cannot be more current or more rigorous than its constituent reviews; inherits their stale searches, flawed pooling, and missing primary studies.",
        "when_to_prefer": "When many good systematic reviews already exist and breadth is needed; choose a de novo systematic review for a focused question or when existing reviews are stale or low quality."
      },
      {
        "compared_to": "meta-analysis-obs",
        "pros_of_this": "Maps and grades many review-level estimates and their certainty without the double-counting that pooling overlapping reviews would cause.",
        "cons_of_this": "Does not produce a single precise pooled effect; review-level summaries are coarser than primary-study synthesis.",
        "when_to_prefer": "When a credibility-graded landscape is the deliverable; run a fresh meta-analysis of primary studies when a precise pooled estimate is the goal."
      },
      {
        "compared_to": "network-meta-analysis",
        "pros_of_this": "Surveys and appraises the body of comparative reviews (which may include NMAs) across a topic.",
        "cons_of_this": "Cannot deliver coherent comparative rankings; presenting indirect comparisons across separate reviews as a network is a misuse.",
        "when_to_prefer": "For an evidence map across comparative reviews; use an NMA when one integrated comparative-effectiveness answer with coherent rankings is required."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Constituent reviews of claims studies frequently share the same large administrative databases, so primary-study overlap (and correlated evidence) can be high even when reviews look independent; compute CCA at the primary-study level and record shared databases as an additional source of dependence.",
      "ehr": "Appraise whether each included EHR-study review addressed phenotyping validity, informative presence, and site heterogeneity, and whether it used an observational risk-of-bias tool (ROBINS-I) rather than a trial-oriented one.",
      "registry": "Registry-based reviews tend to be more homogeneous in primary sources, so overlap is easier to assess; tabulate registry completeness and reporting lag and search dates so mature and immature evidence are not mixed in the synthesis.",
      "linked": "Linked-data reviews combine the above; reconcile differing data-linkage eras across constituent reviews and note that linkage selection limits generalizability of the synthesized estimates."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\n\ndef corrected_covered_area(matrix: np.ndarray) -> dict:\n    \"\"\"CCA of primary-study overlap across included reviews (Pieper et al. 2014).\n\n    matrix: binary array, rows = distinct primary studies, cols = included reviews,\n            1 if the review included that primary study.\n    \"\"\"\n    M = np.asarray(matrix, dtype=int)\n    r, c = M.shape          # r distinct primary studies, c reviews\n    N = int(M.sum())        # total 'present' cells across the matrix\n    denom = r * c - r       # maximum possible additional coverage beyond one count each\n    cca = (N - r) / denom if denom > 0 else 0.0\n    if   cca <= 0.05: band = \"slight (0-5%)\"\n    elif cca <= 0.10: band = \"moderate (6-10%)\"\n    elif cca <= 0.15: band = \"high (11-15%)\"\n    else:             band = \"very high (>15%)\"\n    return {\"distinct_primary_studies\": r, \"reviews\": c,\n            \"present_cells\": N, \"cca\": round(cca, 4), \"interpretation\": band}\n\n# Worked example: 6 primary studies x 3 included reviews. A study counted in several reviews\n# is the double-counting CCA quantifies.\ncitation_matrix = np.array([\n    [1, 1, 0],   # study A in reviews 1 and 2\n    [1, 0, 1],   # study B in reviews 1 and 3\n    [0, 1, 1],   # study C in reviews 2 and 3\n    [1, 1, 1],   # study D in all three\n    [0, 1, 0],   # study E in review 2 only\n    [1, 0, 0],   # study F in review 1 only\n])\nprint(corrected_covered_area(citation_matrix))",
        "description": "Compute the corrected covered area (CCA) of primary-study overlap among the systematic reviews included in an umbrella\nreview (Pieper et al. 2014). Input is a binary citation matrix: rows = distinct primary studies, columns = included\nreviews, cell = 1 if that review included that primary study. CCA = (N - r) / (r*c - r), where N is the number of\n'present' cells, r is the number of distinct primary studies (rows), and c is the number of reviews (columns). The\nfunction returns CCA and the standard slight/moderate/high/very-high interpretation.",
        "dependencies": [
          "numpy"
        ],
        "source_citations": [
          "pieper-2014"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# (1) Corrected covered area from a binary citation matrix (rows = primary studies, cols = reviews).\ncca <- function(M) {\n  r <- nrow(M); c <- ncol(M); N <- sum(M)\n  denom <- r * c - r\n  value <- if (denom > 0) (N - r) / denom else 0\n  band <- cut(value, breaks = c(-Inf, .05, .10, .15, Inf),\n              labels = c(\"slight\", \"moderate\", \"high\", \"very high\"))\n  list(distinct_primary = r, reviews = c, present = N,\n       cca = round(value, 4), interpretation = as.character(band))\n}\ncitation_matrix <- matrix(c(1,1,0, 1,0,1, 0,1,1, 1,1,1, 0,1,0, 1,0,0),\n                          nrow = 6, byrow = TRUE)\nprint(cca(citation_matrix))\n\n# (2) Quantitative umbrella-review analysis with metaumbrella: stratify evidence into classes\n#     (Class I convincing ... weak / non-significant) from a data frame of constituent meta-analyses.\nlibrary(metaumbrella)\n# 'umbrella_input' columns include: factor, author, year, measure (e.g., 'OR'), value, ci_lo, ci_up,\n#  n_cases, n_controls, etc. (see ?umbrella for the required schema).\n# res <- umbrella(umbrella_input)\n# strat <- add.evidence(res, criteria = \"Ioannidis\")   # assign credibility classes\n# summary(strat)",
        "description": "Two complementary R steps for an umbrella review. (1) Compute the corrected covered area from a primary-study x review\ncitation matrix (Pieper et al. 2014). (2) The metaumbrella package (Gosling et al. 2023) performs the quantitative\numbrella-review layer - per-association meta-analyses and stratification of the evidence into credibility classes -\nfrom a data frame of constituent meta-analyses. Here we show the CCA computation in base R plus the metaumbrella call\npattern.",
        "dependencies": [
          "metaumbrella"
        ],
        "source_citations": [
          "pieper-2014",
          "gosling-2023"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* work.cites: one row per inclusion, columns primary_study and review. */\nproc sql noprint;\n  select count(*)                  into :N trimmed from work.cites;            /* present cells   */\n  select count(distinct primary_study) into :r trimmed from work.cites;       /* distinct studies */\n  select count(distinct review)        into :c trimmed from work.cites;        /* reviews          */\nquit;\n\ndata cca;\n  N = &N; r = &r; c = &c;\n  denom = r * c - r;\n  cca = ifn(denom > 0, (N - r) / denom, 0);\n  length interpretation $12;\n  if      cca <= 0.05 then interpretation = 'slight';\n  else if cca <= 0.10 then interpretation = 'moderate';\n  else if cca <= 0.15 then interpretation = 'high';\n  else                     interpretation = 'very high';\nrun;\n\nproc print data=cca noobs;\n  var N r c cca interpretation;\n  title 'Corrected covered area of primary-study overlap across included reviews';\nrun;",
        "description": "Compute the corrected covered area (Pieper et al. 2014) in SAS from a long-format citation table work.cites with one row\nper (primary_study, review) inclusion. PROC SQL counts the present cells (N), the distinct primary studies (r), and the\nreviews (c); a DATA step applies CCA = (N - r) / (r*c - r) and assigns the slight/moderate/high/very-high band. No\nspecialized PROC is needed - CCA is a closed-form ratio of counts.",
        "dependencies": [],
        "source_citations": [
          "pieper-2014"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Broad question across many<br/>outcomes or interventions] --> Search[Search for systematic reviews<br/>and meta-analyses]\n  Search --> PRISMA[PRISMA flow of reviews<br/>screened and included]\n  PRISMA --> Appraise[Appraise each review<br/>AMSTAR-2 confidence rating]\n  PRISMA --> Overlap[Quantify primary-study overlap<br/>corrected covered area CCA]\n  Appraise --> Discord{Reviews discordant?}\n  Overlap --> Discord\n  Discord -->|Yes| Reconcile[Apply discordance-resolution rule<br/>do NOT silently average]\n  Discord -->|No| Synth[Synthesize review-level estimates<br/>+ certainty; avoid re-pooling overlap]\n  Reconcile --> Synth\n  Synth --> Out[Credibility-graded evidence map]",
        "caption": "Umbrella-review workflow: search and screen reviews (PRISMA), appraise each with AMSTAR-2, quantify overlap with the corrected covered area, reconcile discordance explicitly, and synthesize review-level estimates without re-pooling overlapping evidence.",
        "alt_text": "Flowchart from a broad question through searching for reviews, PRISMA screening, AMSTAR-2 appraisal and CCA overlap, discordance reconciliation, and synthesis into a credibility-graded evidence map.",
        "source_type": "illustrative",
        "source_citations": [
          "aromataris-2015"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  M[Decision: synthesize a mature literature] --> N{Many good systematic reviews exist?}\n  N -->|No / only low-quality| SR[Conduct a de novo systematic review]\n  N -->|Yes| O{Need breadth across outcomes/interventions<br/>or a single pooled effect?}\n  O -->|Single pooled effect| MA[Meta-analysis of primary studies]\n  O -->|Breadth + credibility map| U[Umbrella review<br/>AMSTAR-2 + CCA + discordance rule]\n  U --> P{High primary-study overlap CCA?}\n  P -->|Yes| NoPool[Report CCA; do NOT pool overlapping reviews]\n  P -->|No| Grade[Grade evidence by review-level certainty]",
        "caption": "Decision logic for choosing an umbrella review versus a de novo systematic review or a fresh meta-analysis, and the overlap check that governs whether constituent reviews may be combined.",
        "alt_text": "Decision tree on whether good reviews exist and whether breadth or a pooled effect is needed, leading to a de novo review, a meta-analysis, or an umbrella review with a corrected-covered-area overlap gate.",
        "source_type": "illustrative",
        "source_citations": [
          "pieper-2014"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "part_of",
        "target_slug": "systematic-review",
        "notes": "An umbrella review is a systematic review whose unit of inclusion is other systematic reviews; it applies the same PRISMA search/screening discipline one level up the evidence hierarchy."
      },
      {
        "relation_type": "see_also",
        "target_slug": "meta-analysis-obs",
        "notes": "Umbrella reviews summarize and grade meta-analytic estimates from constituent reviews rather than re-pooling primary studies, since pooling overlapping reviews would double-count shared trials."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "network-meta-analysis",
        "notes": "An NMA delivers coherent comparative rankings from primary trial data; an umbrella review surveys and appraises the body of comparative reviews (which may include NMAs) but does not itself produce a network."
      }
    ],
    "aliases": [
      "umbrella review",
      "overview of reviews",
      "review of systematic reviews",
      "overview of systematic reviews",
      "review of reviews"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "hta",
      "journal",
      "ema"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "unmeasured-confounding-probabilistic-bias-analysis-rwe",
    "name": "Unmeasured Confounding Probabilistic Bias Analysis",
    "short_definition": "A quantitative bias analysis that assigns probability distributions to the prevalence of a hypothetical unmeasured confounder in each exposure group and to its association with the outcome, then Monte-Carlo samples a bias-adjusted effect distribution that propagates both bias uncertainty and random sampling error into a simulation interval.",
    "long_description": "**Probabilistic bias analysis (PBA) for unmeasured confounding** replaces the boilerplate sentence \"residual confounding\nmay remain\" with a quantified counterfactual: *if* an unmeasured confounder U with these properties existed, *this* is how\nmuch the observed estimate would move and *this* is the resulting interval. It specifies (1) the prevalence of U among the\nexposed and among the unexposed, (2) the U-outcome association conditional on measured covariates, and (3) probability\ndistributions encoding uncertainty in each. A bias factor is computed for each Monte-Carlo draw and the observed estimate is\ndivided by it; repeating across thousands of draws yields a distribution of bias-adjusted estimates. Done correctly, the\nobserved estimate is itself resampled from its sampling distribution each iteration, so the output **simulation interval**\nreflects bias uncertainty *and* random error — not bias uncertainty alone.\n\n**Core conceptual distinction.** Three ideas are separable and must not be conflated. (1) *Deterministic vs probabilistic*:\na deterministic (simple) sensitivity analysis plugs in single point values for the bias parameters and reports one adjusted\nestimate per scenario; PBA puts priors on those parameters and integrates over them, producing an interval that communicates\nhow much the uncertainty in U itself matters. (2) *PBA vs the E-value*: the E-value reports the minimum joint\nexposure-U and U-outcome association strength that could explain away the observed estimate — a single threshold requiring\nalmost no input — whereas PBA is assumption-rich and scenario-specific, demanding plausible, ideally externally anchored\ndistributions and returning a full corrected distribution. (3) *Bias adjustment vs sampling error*: the bias factor moves the\npoint estimate; the resampled observed estimate (drawn from Normal(beta_hat, SE) on the log scale each iteration) widens the\ninterval. The estimand is unchanged by PBA — if the primary analysis estimated a comparative hazard ratio (drug A vs drug B,\nintention-to-treat on initiation), PBA reports the *bias-adjusted* version of that same estimand; it does not convert a hazard\nratio into a risk ratio or repair a mis-specified target.\n\n**Pros, cons, and trade-offs.**\n- **vs the E-value (Ding & VanderWeele):** PBA returns a corrected distribution and lets you encode that, say, severe frailty\n  is twice as common in the older-drug arm; the E-value returns one easy-to-compute bound and nothing more. Cost: PBA's\n  answer is only as credible as its priors, and a poorly anchored prior manufactures false precision. **Prefer the E-value**\n  as a fast first screen and a referee-friendly headline; **prefer PBA** when you can anchor priors empirically (a validation\n  substudy, a linked EHR/registry subset, published prevalences) and the decision turns on the *magnitude* of plausible bias.\n- **vs deterministic / tornado sensitivity analysis:** PBA propagates parameter uncertainty into a single interpretable\n  interval instead of a grid of point scenarios. Cost: it hides the parameter-by-parameter contribution that a tornado plot\n  makes explicit, and it tempts readers to treat the simulation interval as a confidence interval. **Prefer deterministic**\n  for transparent one-parameter-at-a-time reasoning; **prefer PBA** when several bias parameters are jointly uncertain.\n- **vs external adjustment / propensity-score calibration (Schneeweiss; Stürmer):** when a validation subsample actually\n  measures U, regression calibration or PS calibration uses that information directly and is preferable to assumed priors.\n  **Prefer external adjustment** when you have a measured-U subsample; **prefer PBA** when U is unmeasured everywhere and you\n  must reason from external literature.\n- **vs negative-control / proximal methods:** negative controls detect and (with proximal g-computation) can adjust residual\n  confounding using observed proxies; PBA needs no proxy but assumes the bias structure. They are complementary, not rivals.\n\n**When to use.** As a pre-specified sensitivity analysis whenever an effect estimate that informs a regulatory, HTA, or\nclinical decision could plausibly be explained by a confounder that claims or EHR data do not capture — frailty, performance\nstatus, smoking, disease severity, BMI, socioeconomic barriers, over-the-counter co-medication. It is most defensible when\nthe prevalence and effect priors can be *anchored* to external data (a SEER-Medicare or EHR-linked substudy, a registry, or\npublished estimates) rather than guessed, and when the result is reported alongside, never instead of, the primary estimate.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **Priors pulled from thin air.** If you cannot anchor the U-prevalence and U-outcome priors to data, PBA produces a precise-\n  looking interval built on invented numbers. This is worse than an honest E-value because the simulation interval *looks*\n  like statistical inference and will be read as such. Either anchor the priors or use the E-value.\n- **Reporting the simulation interval as a confidence interval.** A PBA interval is a Bayesian-flavored uncertainty band over\n  an assumed bias model; it is not frequentist coverage. Labeling it \"95% CI\" misleads referees and decision-makers.\n- **Using PBA to rescue a broken design.** PBA addresses *unmeasured* confounding only. It cannot fix immortal time,\n  depletion of susceptibles, selection into the cohort, outcome misclassification, or confounding by indication that an\n  active comparator would have removed. Reaching for PBA instead of fixing the design is a category error — adjust the design\n  first (active comparator, new-user, correct time zero), then quantify what unmeasured confounding could still do.\n- **A single binary U standing in for a tangle of correlated unmeasured factors.** A binary U with a single risk ratio often\n  *understates* realistic multivariate confounding; treat single-U PBA as a lower bound on plausible bias, not the truth.\n- **The bias is differential by an axis you ignored.** If U operates differently across age, calendar time, or data segment\n  (e.g., frailty matters more in the very old), a marginal PBA can both under- and over-correct.\n\n**Data-source operational depth.**\n- **Claims (FFS or MA):** Claims rarely measure the canonical unmeasured confounders (frailty, ADLs, performance status,\n  smoking, BMI), which is precisely why PBA is so common here. Anchor priors from a *linked* subset — SEER-Medicare, a\n  claims-EHR link, or a registry overlap — where U is observed, then transport those prevalences to the claims-only cohort.\n  Failure modes: (a) **Medicare Advantage person-time lacks fee-for-service claims**, so a confounder proxy built from\n  encounter data is differentially missing for MA enrollees — restrict the anchoring substudy and the main cohort to\n  consistent benefit types or the transported prevalence is itself biased. (b) **Differential competing risks** — in elderly\n  claims cohorts the frailer arm dies of competing causes before the outcome of interest, so a confounder that drives the\n  competing event distorts the bias factor; pair PBA with a competing-risks primary analysis. (c) Claims-derived frailty\n  indices (e.g., Kim's claims-based frailty index) are imperfect proxies — adjusting for them does not eliminate U, so PBA\n  should target the *residual* unmeasured component after such adjustment, with priors scaled accordingly.\n- **EHR:** Notes, labs, vitals, and problem lists often measure U directly in a *subset* (the patients with complete\n  encounters). Use that subset to estimate the prevalence and outcome-association priors, but recognize that EHR capture is\n  visit-driven: patients who measure U are not a random sample (sicker, more-engaged patients are better documented), so the\n  anchoring estimate carries its own selection. NLP-extracted severity is noisy; propagate that measurement error into wider\n  priors rather than treating an NLP flag as ground truth.\n- **Registry:** Strongest for disease severity, stage, and performance status — the very variables claims miss — but usually\n  thin on complete drug exposure. The standard pattern is registry-anchored priors transported onto a claims cohort with full\n  pharmacy fills; verify the registry and claims populations overlap on the relevant case mix before transporting.\n- **Linked claims-EHR-vital-records:** The ideal substrate: U is observed in the linked subset, mortality is reliable, and\n  exposure is complete. Caveat: only the *linkable* subset measures U, and linkage is selective (insured, geographically\n  stable, consenting), so transported priors still need a sensitivity check on the linkage-selection assumption.\n\n**Worked claims example.** A Cox model in 100% Medicare FFS data estimates an adjusted hazard ratio of **0.78 (95% CI\n0.70-0.87)** for all-cause mortality comparing initiators of drug A vs an active comparator drug B (both for the same\nindication; new-user, active-comparator cohort with 365-day continuous Parts A/B/D enrollment, washout, and time zero at the\nfirst qualifying `fill_date`). The reviewer's concern: severe frailty — ADLs and performance status — is not in claims and may\nbe channeled toward the older comparator. Eligibility, washout, first-event coding, `days_supply`-based on-treatment windows,\nand censoring (disenrollment, death, end of data) are already fixed by the primary analysis; PBA changes none of them. Anchor\nthe priors from a linked **SEER-Medicare** substudy in which performance status *is* observed: prevalence of severe frailty\namong A-initiators ~ Beta(30, 70) (≈30%) and among B-initiators ~ Beta(15, 85) (≈15%) — capturing the channeling concern —\nand the mortality hazard ratio for frailty conditional on measured covariates ~ Triangular(min 1.5, mode 2.0, max 3.0). For\neach of 20,000 Monte-Carlo draws: (1) sample the three bias parameters from their priors; (2) compute the confounding bias\nfactor BF = [p1·HR_UY + (1 − p1)] / [p0·HR_UY + (1 − p0)] (Bross/Schneeweiss form); (3) sample the observed log-HR from\nNormal(log 0.78, SE), where SE = (log 0.87 − log 0.70)/(2·1.96) ≈ 0.0552, to carry sampling error; (4) the bias-adjusted HR\nfor that draw = exp(sampled log-HR) / BF. Summarize the 20,000 adjusted HRs by their median and 2.5th/97.5th percentiles.\nBecause A is the *more*-frail arm here, the bias factor exceeds 1 and dividing **moves the adjusted HR upward, toward the\nnull** — the headline becomes whether the 97.5th percentile of the simulation distribution crosses 1.0. Report the median\nbias-adjusted HR with its 2.5/97.5 simulation limits, an array/contour plot over the prevalence-difference and HR_UY grid, and\nthe explicit provenance of every prior distribution (the SEER-Medicare substudy), so the strength of the assumptions is on the\npage rather than hidden inside the simulation.\n\n**Interpreting the output**\n\nFrom the worked example: 20,000 Monte Carlo draws produce a median bias-adjusted HR ≈ 0.86 with a 95%\nsimulation interval that still excludes 1.0, under frailty prevalence p1 = 0.30 in Drug A, p0 = 0.15\nin Drug B, and HR_UY drawn from Triangular(1.5, 2.0, 3.0).\n\n*(1) Formal interpretation.* The simulation interval is *not* a confidence interval. A frequentist 95% CI\ncovers the true parameter in 95% of repeated samples from a fixed data-generating process; this interval\npropagates uncertainty across the analyst-specified prior distributions for the bias parameters. Change\nthe priors — widen the frailty prevalence range or shift the HR_UY triangle — and the interval changes.\nThe result is therefore conditional on the stated bias model and its parameter priors, not a model-free\nbound. Because frailty is assumed to be more prevalent in the Drug A arm, the bias factor exceeds 1 and\nthe corrected estimates shift toward the null relative to the observed HR 0.78; the direction of\ncorrection is a consequence of the assumed bias structure.\n\n*(2) Practical interpretation.* Under bias parameters anchored to the SEER-Medicare substudy, Drug A's\napparent benefit persists: the median bias-adjusted HR ≈ 0.86 still favors Drug A and the simulation\ninterval excludes 1.0. A decision-maker should treat the limits as a credibility envelope rather than\na coverage guarantee, and scrutinize whether the substudy population matches the main cohort —\nif it does not, the correction may transport imperfectly and the bias-adjusted result could mislead.",
    "primary_category": "Bias_Control",
    "tags": [
      "unmeasured-confounding",
      "probabilistic-bias-analysis",
      "quantitative-bias-analysis",
      "sensitivity-analysis",
      "frailty",
      "pharmacoepidemiology",
      "monte-carlo",
      "residual-confounding"
    ],
    "applies_to_study_types": [
      "active_comparator_new_user",
      "cohort_retrospective",
      "claims_analysis",
      "ehr_study",
      "target_trial_emulation"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "linked",
      "registry"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/ije/25.6.1107",
        "url": "https://doi.org/10.1093/ije/25.6.1107",
        "citation_text": "Greenland S. Basic methods for sensitivity analysis of biases. International Journal of Epidemiology. 1996;25(6):1107-1116.",
        "year": 1996,
        "authors_short": "Greenland",
        "notes": "Foundational derivation of the bias-factor formulas for an unmeasured confounder that underpin both deterministic and probabilistic bias analysis."
      },
      {
        "role": "explain",
        "doi": "10.1007/978-0-387-87959-8",
        "url": "https://doi.org/10.1007/978-0-387-87959-8",
        "citation_text": "Lash TL, Fox MP, Fink AK. Applying Quantitative Bias Analysis to Epidemiologic Data. New York: Springer; 2009.",
        "year": 2009,
        "authors_short": "Lash et al.",
        "notes": "Canonical text operationalizing probabilistic bias analysis, including the requirement to propagate both bias uncertainty and conventional random error into the simulation interval."
      },
      {
        "role": "explain",
        "doi": "10.1146/annurev-publhealth-032315-021644",
        "url": "https://doi.org/10.1146/annurev-publhealth-032315-021644",
        "citation_text": "Arah OA. Bias analysis for uncontrolled confounding in the health sciences. Annual Review of Public Health. 2017;38:23-38.",
        "year": 2017,
        "authors_short": "Arah",
        "notes": "Modern review distinguishing E-value, deterministic, and probabilistic approaches to uncontrolled confounding and when each is appropriate."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/sim.7298",
        "url": "https://doi.org/10.1002/sim.7298",
        "citation_text": "McCandless LC, Gustafson P. A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding. Statistics in Medicine. 2017;36(18):2887-2901.",
        "year": 2017,
        "authors_short": "McCandless & Gustafson",
        "notes": "Worked comparison clarifying what Monte-Carlo (sampling-importance-resampling-free) bias analysis estimates versus a full Bayesian model, and the conditions under which the simpler simulation is adequate."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.1200",
        "url": "https://doi.org/10.1002/pds.1200",
        "citation_text": "Schneeweiss S. Sensitivity analysis and external adjustment for unmeasured confounders in epidemiologic database studies of therapeutics. Pharmacoepidemiology and Drug Safety. 2006;15(5):291-303.",
        "year": 2006,
        "authors_short": "Schneeweiss",
        "notes": "Pharmacoepidemiologic application in claims databases, including external adjustment using a validation subsample to anchor the prevalence and outcome-association parameters."
      }
    ],
    "plain_language_summary": "When a study finds that one drug appears better than another, the result could be driven not by the drug itself but by a hidden difference between the two groups — for example, sicker patients tending to receive the older drug. Probabilistic bias analysis (PBA) turns that vague worry into a number: it asks, given plausible assumptions about how common and how harmful that hidden factor is, how much would the original estimate shift, and would the conclusion survive? The E-value is a related, simpler tool that answers a single threshold question — how strong would a hidden factor need to be, on both the exposure and outcome sides simultaneously, to fully explain away the observed result.",
    "key_terms": [
      {
        "term": "unmeasured confounding",
        "definition": "A situation where a factor that affects both who receives a treatment and what outcome they have is not recorded in the data, so its effect on the result cannot be directly removed."
      },
      {
        "term": "bias parameters",
        "definition": "The two numbers that describe a hypothetical hidden confounder: how much more common it is in one treatment group versus the other, and how strongly it is associated with the outcome."
      },
      {
        "term": "E-value",
        "definition": "The minimum association strength (expressed as a risk-ratio-like number) that an unmeasured confounder would need to have with both the exposure and the outcome simultaneously to fully explain away the observed result; a higher E-value means the result is harder to dismiss."
      },
      {
        "term": "bias factor",
        "definition": "A multiplier computed from the bias parameters that captures how much the observed estimate would move if the assumed hidden confounder were present; dividing the observed estimate by the bias factor gives the adjusted estimate."
      },
      {
        "term": "simulation interval",
        "definition": "The range of bias-adjusted estimates produced by running thousands of random draws of the bias parameters; it reflects uncertainty about the hidden confounder, not just ordinary statistical noise, and must not be labeled a confidence interval."
      }
    ],
    "worked_example": {
      "scenario": "A Medicare claims study finds that patients who started drug A had a 22% lower rate of all-cause death compared with patients who started comparator drug B (adjusted hazard ratio 0.78, 95% CI 0.70-0.87). A reviewer raises the concern that severe frailty — which claims data do not record — may have been steered toward the comparator arm by prescribers. The analyst wants to (1) compute the E-value to show the minimum confounder strength needed to explain away the result, and (2) run a small deterministic sensitivity analysis varying frailty prevalence and strength to see how the adjusted estimate moves.",
      "dataset": {
        "caption": "Summary inputs from the fitted primary model — not subject-level rows. A PBA works from the model output, not raw patient data.",
        "columns": [
          "input",
          "value",
          "source"
        ],
        "rows": [
          [
            "Observed HR (drug A vs B)",
            "0.78",
            "Cox model output"
          ],
          [
            "95% CI lower bound",
            "0.70",
            "Cox model output"
          ],
          [
            "95% CI upper bound",
            "0.87",
            "Cox model output"
          ],
          [
            "Frailty prevalence, drug A arm (p1)",
            "30%",
            "Assumed from SEER-Medicare substudy"
          ],
          [
            "Frailty prevalence, drug B arm (p0)",
            "15%",
            "Assumed from SEER-Medicare substudy"
          ],
          [
            "Frailty-mortality HR (HR_UY)",
            "2.0 (base case)",
            "Assumed; range 1.5-3.0 tested"
          ]
        ]
      },
      "steps": [
        "Step 1 — Compute the E-value for the point estimate. Because the HR is below 1 (drug A looks protective), first flip it: 1 / 0.78 = 1.282. Then apply the E-value formula: E = 1.282 + sqrt(1.282 x (1.282 - 1)) = 1.282 + sqrt(1.282 x 0.282) = 1.282 + sqrt(0.362) = 1.282 + 0.601 = 1.88.",
        "Step 2 — Compute the E-value for the confidence interval bound closest to the null (0.87). Flip it: 1 / 0.87 = 1.149. E = 1.149 + sqrt(1.149 x 0.149) = 1.149 + sqrt(0.171) = 1.149 + 0.414 = 1.56.",
        "Step 3 — Interpret the E-values. A hidden confounder would need to be associated at least 1.88-fold with both the exposure (frailty more common in one arm) and the outcome (frailty raising mortality) simultaneously to fully explain away HR = 0.78. Even to push the confidence interval across the null, the confounder would need associations of at least 1.56-fold in both directions. Frailty with HR_UY around 2.0 and a 15-percentage-point prevalence gap is plausible — so the E-value alone does not rule out confounding.",
        "Step 4 — Run a deterministic sensitivity analysis using the Bross bias-factor formula: BF = [p1 x HR_UY + (1 - p1)] / [p0 x HR_UY + (1 - p0)]. Then the bias-adjusted HR = observed HR / BF. Vary p1, p0, and HR_UY across a small grid (see table below).",
        "Step 5 — Read the table. When frailty is equally prevalent in both arms (p1 = p0 = 0.15), BF = 1.00 and the adjusted HR stays at 0.78 — no bias. As the prevalence gap and confounder strength grow, BF rises above 1, and dividing the observed HR by a number greater than 1 moves the adjusted HR upward toward the null."
      ],
      "sensitivity_table": {
        "caption": "Deterministic bias-factor grid. Each cell shows the bias factor and the resulting bias-adjusted HR for drug A vs drug B. The observed HR is 0.78.",
        "columns": [
          "Frailty prevalence in drug A arm (p1)",
          "Frailty prevalence in drug B arm (p0)",
          "Frailty-mortality HR (HR_UY)",
          "Bias factor (BF)",
          "Bias-adjusted HR"
        ],
        "rows": [
          [
            "15%",
            "15%",
            "2.0",
            "1.00",
            "0.78 (no bias — equal prevalence)"
          ],
          [
            "20%",
            "15%",
            "2.0",
            "1.04",
            "0.75"
          ],
          [
            "30%",
            "15%",
            "2.0",
            "1.13",
            "0.69"
          ],
          [
            "30%",
            "15%",
            "1.5",
            "1.07",
            "0.73"
          ],
          [
            "30%",
            "15%",
            "3.0",
            "1.23",
            "0.63"
          ]
        ]
      },
      "result": "E-value for the point estimate (HR = 0.78): 1.88. This means any unmeasured confounder would need to be associated at least 1.88-fold with both the drug arm and mortality simultaneously to fully explain away the observed hazard ratio. E-value for the CI bound (HR = 0.87): 1.56 — even to eliminate statistical significance, the confounder needs associations of at least 1.56-fold in both directions. The bias-factor grid shows that under the base-case assumption (frailty 30% vs 15%, HR_UY = 2.0), the bias-adjusted HR is 0.69, still below 1. Only under a scenario with a much larger prevalence gap or a stronger frailty-mortality association would the adjusted estimate approach the null."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "e-value-sensitivity-analysis",
      "quantitative-bias-analysis-toolkit-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Single binary unmeasured confounder",
        "description": "The most transparent form; varies the prevalence of a single binary U by exposure arm and a single U-outcome risk/hazard ratio. Communicates cleanly but typically understates realistic multivariate confounding, so it is best read as a lower bound on plausible bias.",
        "edge_cases": [
          "A binary U with one effect size cannot capture dose-response or interaction with measured covariates.",
          "When U is more prevalent in the arm that already looks protected, adjustment moves the estimate toward the null; when in the other arm, away from the null - the direction must be reasoned from the channeling story, not assumed."
        ],
        "data_source_notes": "claims: anchor prevalence-by-arm and the U-outcome ratio from a linked validation subset (SEER-Medicare, claims-EHR link) rather than from intuition."
      },
      {
        "name": "Composite / multiple unmeasured confounder",
        "description": "Represents a bundle of correlated unmeasured factors (e.g., frailty plus socioeconomic barriers plus smoking) as one summary U. Easier to communicate but much harder to anchor empirically, since no single external source measures the composite.",
        "edge_cases": [
          "The composite's prevalence and effect priors are rarely directly observed, inflating the risk of fabricated precision.",
          "Correlation among the components means a single summary ratio can either over- or under-state the joint bias."
        ],
        "data_source_notes": "ehr/registry: estimate component prevalences separately where measured, then justify the composite prior explicitly rather than assuming independence."
      },
      {
        "name": "Bias analysis with full uncertainty propagation",
        "description": "The production form - each draw resamples the observed effect from its sampling distribution (Normal on the log scale) in addition to drawing the bias parameters, so the output simulation interval reflects both bias uncertainty and random error. Omitting the sampling-error draw understates the interval.",
        "edge_cases": [
          "Requires the point estimate and its standard error from the fitted primary model (e.g., the log-HR and SE from PROC PHREG).",
          "The simulation interval is an uncertainty band over an assumed bias model, not a frequentist confidence interval, and must be labeled as such."
        ],
        "data_source_notes": "any source: store the primary model's coefficient table so the log-effect and SE feed the bias analysis directly rather than being hardcoded."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "E-value sensitivity analysis",
        "pros_of_this": "Returns a full bias-adjusted distribution and can encode arm-specific confounder prevalence and a plausible range for the U-outcome association, rather than a single explain-away threshold.",
        "cons_of_this": "Requires defensible, ideally externally anchored prior distributions; poorly chosen priors manufacture false precision that reads like statistical inference.",
        "when_to_prefer": "When priors can be anchored to external data and the decision depends on the magnitude of plausible bias, not merely whether some confounder could explain the result."
      },
      {
        "compared_to": "Deterministic (point-value) sensitivity analysis",
        "pros_of_this": "Integrates over parameter uncertainty into one interpretable interval instead of a grid of single-scenario point estimates.",
        "cons_of_this": "Obscures the per-parameter contribution that a tornado plot makes explicit and invites misreading the simulation interval as a confidence interval.",
        "when_to_prefer": "When several bias parameters are jointly uncertain and a single summarizing interval is more useful than a scenario grid."
      },
      {
        "compared_to": "External adjustment / propensity-score calibration with a validation subsample",
        "pros_of_this": "Works when the unmeasured confounder is measured nowhere, reasoning from external literature instead of an internal measured-U subsample.",
        "cons_of_this": "Uses assumed priors rather than directly observed U; less efficient and less defensible than calibration when a measured-U subsample exists.",
        "when_to_prefer": "When no subsample measures U; if a validation subsample measures U, prefer regression or PS calibration."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims seldom measure frailty, performance status, smoking, or BMI - the usual targets of PBA. Anchor prevalence-by-arm and the U-outcome ratio from a linked subset (SEER-Medicare, claims-EHR link) and transport to the claims-only cohort. Restrict the anchoring substudy and the main cohort to consistent benefit types because Medicare Advantage person-time lacks fee-for-service claims, making proxy-based confounders differentially missing; pair PBA with a competing-risks primary analysis in elderly cohorts where the frailer arm dies of competing causes first.",
      "ehr": "Notes, labs, and problem lists measure U directly in the well-documented subset; use that subset for priors but account for visit-driven selection (sicker, more-engaged patients are better captured) and propagate NLP measurement error into wider priors rather than trusting an extracted flag.",
      "registry": "Strong for severity, stage, and performance status but thin on complete drug exposure; registry-anchored priors are transported onto a claims cohort with full pharmacy fills only after confirming case-mix overlap.",
      "linked": "Linked claims-EHR-vital-records observes U in the linkable subset with reliable mortality and complete exposure, but linkage is selective, so transported priors still need a sensitivity check on the linkage-selection assumption."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\n\nrng = np.random.default_rng(42)\nN_DRAWS = 20_000\n\n# --- Primary-model output (paste from the fitted Cox/logistic model) -------------------\nHR_OBS = 0.78                       # adjusted HR, drug A vs comparator B, all-cause mortality\nCI_LOW, CI_HIGH = 0.70, 0.87       # 95% CI of the same estimate\nbeta_hat = np.log(HR_OBS)\nse_hat = (np.log(CI_HIGH) - np.log(CI_LOW)) / (2 * 1.959964)   # recover SE from the reported CI\n\n# --- Externally anchored bias-parameter priors (SEER-Medicare substudy) ---------------\n# p1, p0: prevalence of severe frailty among A-initiators / B-initiators (channeling toward the comparator)\np1 = rng.beta(30, 70, N_DRAWS)                       # ~30% in arm A\np0 = rng.beta(15, 85, N_DRAWS)                       # ~15% in arm B\nrr_uy = rng.triangular(1.5, 2.0, 3.0, N_DRAWS)       # mortality HR for frailty | measured covariates\n\n# --- Bias factor (Bross/Schneeweiss) and bias adjustment ------------------------------\nbias_factor = (p1 * rr_uy + (1 - p1)) / (p0 * rr_uy + (1 - p0))\n\n# Propagate sampling error: resample the observed log-effect each iteration, then divide by the bias factor.\nbeta_draw = rng.normal(beta_hat, se_hat, N_DRAWS)\nhr_adjusted = np.exp(beta_draw) / bias_factor\n\npct = np.percentile(hr_adjusted, [2.5, 50, 97.5])\nprint(f\"Bias-adjusted HR (median): {pct[1]:.3f}\")\nprint(f\"95% simulation interval:   {pct[0]:.3f} to {pct[2]:.3f}\")\nprint(f\"P(adjusted HR < 1):        {(hr_adjusted < 1).mean():.3f}\")",
        "description": "Probabilistic bias analysis for an unmeasured confounder, propagating both bias and sampling uncertainty. Required input\nis the fitted primary model's effect on the log scale, NOT raw subject data:\n  beta_hat : log hazard/risk/odds ratio from the primary model (e.g., PROC PHREG / coxph / Cox in lifelines)\n  se_hat   : standard error of beta_hat from the same model\nThe three bias-parameter priors are anchored externally (here: a SEER-Medicare substudy). p1/p0 are the prevalence of the\nbinary unmeasured confounder U among the exposed/unexposed arms; rr_uy is the U-outcome ratio conditional on measured\ncovariates. Bias factor follows the Bross/Schneeweiss form. Output is a median bias-adjusted ratio with a 2.5/97.5\nsimulation interval - an uncertainty band over the assumed bias model, not a confidence interval.",
        "dependencies": [
          "numpy"
        ],
        "source_citations": [
          "schneeweiss-2006"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "set.seed(42)\nn_draws <- 20000L\n\n## --- Primary-model output (from coxph / glm) ----------------------------------------\nhr_obs <- 0.78; ci_low <- 0.70; ci_high <- 0.87\nbeta_hat <- log(hr_obs)\nse_hat   <- (log(ci_high) - log(ci_low)) / (2 * qnorm(0.975))   # recover SE from the reported CI\n\n## --- Externally anchored bias-parameter priors (validation substudy) ----------------\np1    <- rbeta(n_draws, 30, 70)          # prevalence of frailty in arm A (~30%)\np0    <- rbeta(n_draws, 15, 85)          # prevalence of frailty in arm B (~15%)\n## Triangular(min=1.5, mode=2.0, max=3.0) for the U-outcome HR via inverse-CDF sampling:\nrtri <- function(n, a, c, b) {\n  u <- runif(n); fc <- (c - a) / (b - a)\n  ifelse(u < fc, a + sqrt(u * (b - a) * (c - a)),\n                 b - sqrt((1 - u) * (b - a) * (b - c)))\n}\nrr_uy <- rtri(n_draws, 1.5, 2.0, 3.0)\n\n## --- Bias factor and adjustment with sampling-error propagation ----------------------\nbias_factor  <- (p1 * rr_uy + (1 - p1)) / (p0 * rr_uy + (1 - p0))\nbeta_draw    <- rnorm(n_draws, beta_hat, se_hat)\nhr_adjusted  <- exp(beta_draw) / bias_factor\n\nq <- quantile(hr_adjusted, c(.025, .5, .975))\ncat(sprintf(\"Bias-adjusted HR (median): %.3f\\n\", q[2]))\ncat(sprintf(\"95%% simulation interval:   %.3f to %.3f\\n\", q[1], q[3]))\ncat(sprintf(\"P(adjusted HR < 1):        %.3f\\n\", mean(hr_adjusted < 1)))",
        "description": "Probabilistic bias analysis for an unmeasured confounder in R, propagating bias and sampling uncertainty. As in the Python\nversion the input is the primary model's log-scale effect and its SE (e.g., from survival::coxph), not subject-level data;\nthe three bias-parameter priors are anchored to an external validation subset. trunc(...) keeps prevalence draws in (0,1).",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-2006"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let n_draws = 20000;\n\n/* Primary model: comparative mortality HR, drug A vs comparator B, adjusted for measured covariates. */\nproc phreg data=work.cohort;\n  class arm(ref='0');\n  model fu_time*event(0) = arm &covs / risklimits;\n  ods output ParameterEstimates=pe;   /* capture the fitted log-HR and its SE */\nrun;\n\n/* Pull the observed log-HR (beta_hat) and its standard error for the arm term. */\ndata _null_;\n  set pe;\n  where upcase(Parameter) = 'ARM';\n  call symputx('beta_hat', Estimate);\n  call symputx('se_hat',   StdErr);\nrun;\n\n/* Monte-Carlo bias analysis: draw bias params + resample the observed log-HR each iteration. */\ndata pba;\n  call streaminit(42);\n  beta_hat = &beta_hat; se_hat = &se_hat;\n  do draw = 1 to &n_draws;\n    p1 = rand('beta', 30, 70);                 /* frailty prevalence, arm A (~30%) */\n    p0 = rand('beta', 15, 85);                 /* frailty prevalence, arm B (~15%) */\n    /* Triangular(1.5, 2.0, 3.0) for the U-outcome HR via inverse-CDF sampling. */\n    u = rand('uniform'); a = 1.5; c = 2.0; b = 3.0; fc = (c - a)/(b - a);\n    if u < fc then rr_uy = a + sqrt(u*(b - a)*(c - a));\n    else            rr_uy = b - sqrt((1 - u)*(b - a)*(b - c));\n    bias_factor = (p1*rr_uy + (1 - p1)) / (p0*rr_uy + (1 - p0));\n    beta_draw   = rand('normal', beta_hat, se_hat);   /* propagate sampling error */\n    hr_adjusted = exp(beta_draw) / bias_factor;\n    output;\n  end;\n  keep hr_adjusted;\nrun;\n\nproc univariate data=pba noprint;\n  var hr_adjusted;\n  output out=pba_summary median=median_hr pctlpts=2.5 97.5 pctlpre=p_;\nrun;\n\nproc print data=pba_summary noobs; run;   /* median_hr, p_2_5, p_97_5 = simulation interval */",
        "description": "Probabilistic bias analysis for an unmeasured confounder in SAS, propagating bias and sampling uncertainty. The primary\nCox model is fit first with PROC PHREG (covs = measured baseline confounders from the [index-365, index] window) and its\nparameter table is captured so the bias analysis reads the observed log-HR and its SE rather than hardcoding them. The DATA\nstep draws the bias parameters and resamples the observed log-HR each iteration; PROC UNIVARIATE returns the median and the\n2.5/97.5 simulation limits. work.cohort has one row per new initiator: person_id, arm (1=drug A, 0=comparator B),\nfu_time (days to event/censor), event (1/0), and the measured covariates.",
        "dependencies": [],
        "source_citations": [
          "schneeweiss-2006"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  L[\"L: measured covariates<br/>(age, comorbidity, utilization)\"]\n  U[\"U: unmeasured confounder<br/>(frailty, performance status)\"]\n  A[\"A: exposure<br/>(drug A vs comparator B)\"]\n  Y[\"Y: outcome<br/>(all-cause mortality)\"]\n  L --> A\n  L --> Y\n  U --> A\n  U --> Y\n  A --> Y\n  L -. adjusted in primary model .-> A\n  U -. quantified by PBA priors .-> Y",
        "caption": "Causal structure motivating unmeasured-confounding PBA. Measured covariates L are controlled in the primary model; the unmeasured confounder U opens a back-door path A <- U -> Y that adjustment cannot close. PBA quantifies the two U arrows (prevalence-by-arm encodes U -> A; the conditional U-outcome ratio encodes U -> Y) instead of leaving them unmodeled.",
        "alt_text": "Directed acyclic graph with measured covariates L and unmeasured confounder U each pointing into exposure A and outcome Y, with A pointing to Y, illustrating the back-door path through U that probabilistic bias analysis quantifies.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Obs[\"Primary model output:<br/>log-effect beta_hat + SE<br/>(e.g., from PROC PHREG)\"]\n  Pri[\"Externally anchored priors:<br/>p1, p0 ~ Beta; HR_UY ~ Triangular\"]\n  Draw[\"For each of N draws:<br/>sample p1, p0, HR_UY and<br/>resample beta from Normal(beta_hat, SE)\"]\n  BF[\"Compute bias factor<br/>BF = (p1*HR_UY + 1-p1) / (p0*HR_UY + 1-p0)\"]\n  Adj[\"Bias-adjusted effect this draw =<br/>exp(beta_draw) / BF\"]\n  Out[\"Summarize N adjusted effects:<br/>median + 2.5/97.5 simulation interval<br/>(NOT a confidence interval)\"]\n  Obs --> Draw\n  Pri --> Draw\n  Draw --> BF\n  BF --> Adj\n  Adj --> Out",
        "caption": "Monte-Carlo simulation flow. Each iteration both samples the bias parameters from their anchored priors AND resamples the observed effect from its sampling distribution, so the output interval propagates bias uncertainty and random error together; omitting the resampling step understates the interval.",
        "alt_text": "Flowchart from the primary model's log-effect and standard error plus externally anchored priors, through per-draw sampling of bias parameters and the observed effect, computing a bias factor, dividing to get a bias-adjusted effect, and summarizing into a median and simulation interval.",
        "source_type": "illustrative",
        "source_citations": [
          "schneeweiss-2006"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "is_variant_of",
        "target_slug": "quantitative-bias-analysis-toolkit-rwe",
        "notes": "Unmeasured-confounding PBA is the confounding-focused member of the QBA family alongside misclassification and selection-bias analyses."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "The E-value is a single explain-away threshold requiring almost no input; PBA returns a full bias-adjusted distribution but demands defensible, ideally externally anchored priors. Use the E-value as a fast screen, PBA when magnitude matters."
      },
      {
        "relation_type": "used_with",
        "target_slug": "external-adjustment-validation-substudy-bias-correction-rwe",
        "notes": "When a validation subsample measures the confounder, external adjustment/PS calibration anchors the PBA priors or replaces assumed priors with observed parameters."
      },
      {
        "relation_type": "see_also",
        "target_slug": "selection-bias-sensitivity-analysis-rwe",
        "notes": "Parallel QBA approach targeting selection rather than confounding; the simulation-with-uncertainty-propagation machinery is shared."
      },
      {
        "relation_type": "see_also",
        "target_slug": "healthy-user-bias",
        "notes": "Healthy-user and frailty mechanisms are the canonical unmeasured confounders that PBA is used to quantify in claims."
      }
    ],
    "aliases": [
      "PBA for unmeasured confounding",
      "probabilistic sensitivity analysis for unmeasured confounding",
      "Monte-Carlo bias analysis for uncontrolled confounding",
      "uncontrolled confounding sensitivity analysis",
      "quantitative bias analysis for confounding"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "value-of-information-evpi-evsi-rwe",
    "name": "Value of Information Analysis (EVPI, EVPPI, EVSI)",
    "short_definition": "A decision-analytic framework that converts the parameter uncertainty already quantified in a probabilistic sensitivity analysis into the expected monetary cost of choosing the wrong strategy — yielding the expected value of perfect information (EVPI, the ceiling on what any research is worth), partial EVPI per parameter group (EVPPI, which uncertain inputs drive that value), and the expected value of sample information for a specific proposed study (EVSI, whose excess over the study's cost — the expected net benefit of sampling — decides whether a new RWE study is worth funding).",
    "long_description": "A probabilistic sensitivity analysis (PSA) ends with a distribution of outcomes, but the decision-maker still has to pick\n**one** strategy. Under current evidence the rational choice is the strategy with the highest *expected* net monetary\nbenefit (NMB) — yet in some fraction of the PSA draws that choice is wrong, and each wrong draw carries a quantifiable\n**opportunity loss** (the NMB gap to that draw's true winner). Value of information (VOI) analysis is the formalization of\nthat observation: the expected opportunity loss of deciding today *is* the expected value of removing the uncertainty.\n**EVPI** = E_theta[max_d NMB(d, theta)] - max_d E_theta[NMB(d, theta)]: average the per-draw best NMB, subtract the NMB of\nthe strategy that is best on average. It is computed directly from the PSA matrix in two lines and is the **upper bound on\nthe value of any conceivable further research** on this decision. The ISPOR VOI Good Practices Task Force (Fenwick et al.\n2020; Rothery et al. 2020) frames the whole ladder: EVPI (all parameters, perfect information), **EVPPI** (perfect\ninformation on a parameter subset), **EVSI** (imperfect information from a specific finite study), and **ENBS** = population\nEVSI minus the study's cost — the quantity that actually decides whether to commission the study.\n\n**Estimation is where the craft lives.** EVPI is nonparametric and free once the PSA exists. EVPPI naively requires a\ntwo-level nested Monte Carlo (an inner PSA for every outer draw of the parameter of interest) that is computationally\nbrutal; the field-standard shortcut is the **Strong–Oakley–Brennan regression estimator**: regress each strategy's NMB on\nthe parameter(s) of interest across the *single existing* PSA sample — a GAM/spline for 1–4 parameters, Gaussian-process\nregression for more — and the fitted values estimate E[NMB_d | theta]; EVPPI is then the mean of the per-draw fitted maxima\nminus the max of the column means. EVSI extends the same trick (Strong et al. 2015): for each PSA draw, *simulate the\nsummary statistic the proposed study would report* (given its sample size, allocation, follow-up, and expected error), then\nregress NMB on that simulated statistic — the fitted values estimate the posterior-expected NMB after seeing the data.\n**Population scaling** multiplies per-patient VOI by the effective population: incident (plus prevalent, where the decision\nreaches them) patients per year x the decision-relevance horizon, with each future year's cohort discounted at the\njurisdiction's rate. The horizon — how long the decision will stand before the technology landscape changes — is the most\ninfluential and least evidence-based input in the chain; report population VOI across 5/10/15-year horizons.\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs probabilistic-sensitivity-analysis-hea-rwe (the substrate):** PSA *describes* decision uncertainty (CEAC, scatter);\n  VOI *prices* it and attaches a decision rule. A CEAC alone misleads in both directions: P(cost-effective) = 0.55 can\n  carry trivial EVPI (the strategies are near-ties, being wrong costs little) or enormous EVPI (high stakes per patient,\n  huge population). VOI is the only output that says whether the residual uncertainty *matters*. Cost: VOI inherits every\n  distributional choice in the PSA — overdispersed or made-up priors inflate EVPI mechanically.\n- **vs scenario-deterministic-sensitivity-analysis-hea-rwe (tornado/scenario):** a tornado diagram ranks parameters by how\n  far they swing the ICER; EVPPI ranks them by whether their uncertainty can *flip the decision*. A parameter can dominate\n  the tornado yet have EVPPI near zero because no plausible value changes the adoption choice — and a modest parameter can\n  carry most of the EVPPI because it sits exactly where strategies cross. For prioritizing research spend, EVPPI is the\n  correct ranking; the tornado is a model-understanding tool.\n- **vs heuristic research prioritization (\"fund the biggest evidence gap\"):** VOI forces the gap to be valued against the\n  population affected, the stakes per patient, and the cost and feasibility of the study that would close it. The price is\n  machinery: a credible PSA, defensible effective-population assumptions, and for EVSI an explicit model of the future\n  study's data-generating process — including, for RWE designs, its *bias*. Naive EVSI assumes the new data are an unbiased\n  sample; an observational study with residual confounding delivers less (sometimes negative) value than its sample size\n  suggests, so EVSI for RWE studies should model bias explicitly or be read as an upper bound.\n\n**When to use.** Deciding whether a proposed RWE investment — a registry, a linked claims–EHR effectiveness study, a chart\nreview to pin down utilities, a pragmatic trial — is worth its cost, and at what sample size (maximize ENBS, not power);\nprioritizing *which* parameter to spend an evidence budget on (EVPPI on relative effectiveness vs long-term survival\nextrapolation vs costs vs utilities points the money differently); HTA submissions and re-assessments, where\ncoverage-with-evidence-development and managed-entry agreements are, operationally, decisions that the EVSI of further\ndata collection exceeds its cost; research funders triaging portfolios across disease areas on population EVPI.\n\n**When NOT to use — and when it is actively misleading.**\n- **When structural uncertainty dominates.** VOI computed from a PSA prices only *parameter* uncertainty. If the real doubt\n  is the model structure (extrapolation functional form, comparator relevance, surrogate-to-final-outcome linkage), EVPI\n  from the PSA understates the value of research — possibly by orders of magnitude. Parameterize structural choices into\n  the PSA (model averaging) or do not present EVPI as \"the\" value of further research.\n- **When the decision is not actually sensitive.** If one strategy wins in ~100% of draws, EVPI is ~0 and the analysis is a\n  one-line sanity check — do not build an EVSI machine to confirm a foregone conclusion.\n- **When the PSA priors are not evidence-based.** VOI is a precise function of the input distributions; arbitrary +/-20%\n  uniforms produce arbitrary EVPI with impressive decimal places. Garbage in, confidently priced garbage out.\n- **When the proposed study would be biased and the EVSI ignores it.** EVSI that models an RWE study as unbiased sampling\n  overstates its value; with material unmeasured confounding the study can even leave the decision-maker worse off\n  (confidently wrong). Encode the design's expected bias and added variance, or present EVSI as an optimistic bound.\n- **When delay costs are material and unmodeled.** Patients are treated under current information while the study runs;\n  an adopt-and-research vs delay-and-research framing changes ENBS and is a decision in itself.\n\n**Where the parameters come from (RWE operational depth).** In practice the PSA that feeds VOI is parameterized from\nreal-world sources, and EVPPI tells you which *source* to go back to: cost and resource-use parameters typically come from\nclaims (PPPM/PPPY costing studies); utilities and short-term effectiveness proxies from EHR or trial mapping; long-term\nsurvival, progression, and natural history — chronically the highest-EVPPI block in oncology and rare disease — from\nregistries; and the proposed EVSI study is itself usually a linked-data design whose achievable precision and bias must be\nmodeled honestly. VOI is therefore the budget-allocation layer that sits on top of the catalog's costing, utility, and\nsurvival-extrapolation machinery and decides which of them gets funded next.\n\n**Interpreting the output**\n\nConsider the worked example: per-patient EVPI = 7,500 and population EVPI = 15,000,000 (2,000\npatients over a 2-year decision horizon at a willingness-to-pay of 50,000 per QALY).\n\nFormal interpretation: EVPI of 7,500 per patient is the expected cost of making a decision under\ncurrent uncertainty — equivalently, the maximum worth paying per patient for any research that\nwould resolve all uncertainty before the adoption decision is made. It is computed as the expected\nnet monetary benefit under perfect information (112,500) minus the expected NMB of the best current\ndecision (adopting B at 105,000). The population EVPI of 15,000,000 is that per-patient figure\nscaled by the number of patients who will be affected while the evidence gap persists. EVPI is\nstrictly a function of the model's own probabilistic characterization of uncertainty — it measures\nhow much the decision could improve if every uncertain parameter were resolved, not the value of\nany specific study design. EVPPI then partitions this ceiling across parameter groups to identify\nwhich inputs drive the most decision risk.\n\nPractical interpretation: A population EVPI of 15,000,000 sets the ceiling for any research\nprogram targeting this decision: no single study or combination of studies is worth funding if\nits net cost exceeds that figure. Because the adoption decision was essentially a coin flip\n(probability 0.50 that B is optimal), substantial additional evidence is warranted before a\nfinal coverage determination. EVSI for a specific proposed RWE study — for example, a two-year\ncomparative cohort in claims — will be lower than EVPI by the residual uncertainty the study\ncannot resolve and by any bias or variance the study design introduces. If the proposed study\nis expected to have material unmeasured confounding, model that expected bias explicitly; an\noptimistic EVSI that ignores study limitations will overstate the study's value.",
    "primary_category": "Economic_Evaluation",
    "tags": [
      "value-of-information",
      "evpi",
      "evppi",
      "evsi",
      "expected-net-benefit-of-sampling",
      "decision-uncertainty",
      "research-prioritization",
      "probabilistic-sensitivity-analysis",
      "net-monetary-benefit",
      "ceac",
      "hta",
      "research-design"
    ],
    "applies_to_study_types": [
      "cost_utility_analysis",
      "cost_effectiveness_analysis",
      "decision_analytic_modeling",
      "claims_analysis",
      "registry_linkage"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/j.jval.2020.01.001",
        "url": "https://doi.org/10.1016/j.jval.2020.01.001",
        "citation_text": "Fenwick E, Steuten L, Knies S, Ghabri S, Basu A, Murray JF, Koffijberg HE, Strong M, Sanders Schmidler GD, Rothery C. Value of information analysis for research decisions - an introduction: report 1 of the ISPOR Value of Information Analysis Emerging Good Practices Task Force. Value in Health. 2020;23(2):139-150.",
        "year": 2020,
        "authors_short": "Fenwick et al.",
        "notes": "The ISPOR Task Force introduction that fixes the modern VOI vocabulary (EVPI, EVPPI, EVSI, ENBS), the decision-uncertainty rationale, and the population-scaling logic this concept operationalizes."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X13505910",
        "url": "https://doi.org/10.1177/0272989X13505910",
        "citation_text": "Strong M, Oakley JE, Brennan A. Estimating multiparameter partial expected value of perfect information from a probabilistic sensitivity analysis sample. Medical Decision Making. 2014;34(3):311-326.",
        "year": 2014,
        "authors_short": "Strong et al.",
        "notes": "The regression-based EVPPI estimator (GAM/GP on the existing PSA sample) that replaced nested two-level Monte Carlo and made per-parameter VOI routine - the method implemented in this concept's code."
      },
      {
        "role": "explain",
        "doi": "10.1016/j.jval.2020.01.004",
        "url": "https://doi.org/10.1016/j.jval.2020.01.004",
        "citation_text": "Rothery C, Strong M, Koffijberg HE, Basu A, Ghabri S, Knies S, Murray JF, Sanders Schmidler GD, Steuten L, Fenwick E. Value of information analytical methods: report 2 of the ISPOR Value of Information Analysis Emerging Good Practices Task Force. Value in Health. 2020;23(3):277-286.",
        "year": 2020,
        "authors_short": "Rothery et al.",
        "notes": "The companion ISPOR Task Force methods report - estimator choice (nonparametric EVPI, regression EVPPI, EVSI approaches), population scaling, discounting, and reporting standards for applied VOI."
      },
      {
        "role": "demonstrate",
        "doi": "10.1177/0272989X15575286",
        "url": "https://doi.org/10.1177/0272989X15575286",
        "citation_text": "Strong M, Oakley JE, Brennan A, Breeze P. Estimating the expected value of sample information using the probabilistic sensitivity analysis sample - a fast, nonparametric regression-based method. Medical Decision Making. 2015;35(5):570-583.",
        "year": 2015,
        "authors_short": "Strong et al.",
        "notes": "Extends the regression trick to EVSI - simulate each PSA draw's future-study summary statistic and regress NMB on it - making the value of a specific proposed RWE study computable without nested simulation."
      }
    ],
    "plain_language_summary": "Value of information analysis puts a dollar figure on what it would be worth to settle the remaining uncertainty in a cost-effectiveness decision before locking it in. Starting from the thousands of what-if model runs already produced for the analysis, it asks how often the apparently best treatment turns out to be the wrong pick and how much those mistakes cost. That ceiling (EVPI) is the most any future research on the question could ever be worth; the follow-on measures split it by which uncertain input drives it (EVPPI) and by the specific study you could actually run (EVSI), so an evidence budget goes to the study that buys the most decision certainty per dollar spent.\n",
    "key_terms": [
      {
        "term": "net monetary benefit (NMB)",
        "definition": "A strategy's health gain converted to money (its QALYs times what the payer will pay per QALY) minus its cost, so strategies can be compared on a single number."
      },
      {
        "term": "QALY",
        "definition": "One year of life in perfect health; a year lived in poorer health counts as a fraction of one."
      },
      {
        "term": "willingness-to-pay threshold",
        "definition": "The most a decision-maker will spend to gain one QALY - commonly 50,000 to 150,000 dollars in the US."
      },
      {
        "term": "probabilistic sensitivity analysis (PSA)",
        "definition": "Re-running a cost-effectiveness model thousands of times with different plausible values for every uncertain input, to see how often each strategy comes out on top."
      },
      {
        "term": "opportunity loss",
        "definition": "The benefit given up when the strategy chosen today turns out not to be the best one in a particular scenario."
      },
      {
        "term": "effective population",
        "definition": "All the patients whose treatment the decision will determine over the years it stays in force, counted with future years weighted slightly less."
      }
    ],
    "worked_example": {
      "scenario": "A payer must choose between standard care (strategy A) and a new drug (strategy B) at a willingness-to-pay of 50,000 per QALY. To make every number checkable by hand, the modeling team's probabilistic sensitivity analysis is boiled down to just four equally likely simulations; standard care's cost and QALYs are well established, so they are the same in every draw, while the new drug's QALY gain is uncertain. We want the probability the adoption choice is wrong, the per-patient EVPI, and the population EVPI for 1,000 new patients per year over a 2-year decision horizon.\n",
      "dataset": {
        "caption": "Four equally likely PSA simulations - cost, QALYs, and net monetary benefit (NMB = 50,000 x QALYs - cost) for each strategy.",
        "columns": [
          "sim",
          "cost_a",
          "qaly_a",
          "cost_b",
          "qaly_b",
          "nmb_a",
          "nmb_b"
        ],
        "rows": [
          [
            1,
            20000,
            2.4,
            30000,
            3.0,
            100000,
            120000
          ],
          [
            2,
            20000,
            2.4,
            30000,
            2.4,
            100000,
            90000
          ],
          [
            3,
            20000,
            2.4,
            30000,
            3.2,
            100000,
            130000
          ],
          [
            4,
            20000,
            2.4,
            30000,
            2.2,
            100000,
            80000
          ]
        ]
      },
      "steps": [
        "Convert each strategy's cost and QALYs into net monetary benefit at the 50,000 threshold. For simulation 1, strategy B has NMB = 50,000 × 3.0 - 30,000 = 150,000 - 30,000 = 120,000, and strategy A has NMB = 50,000 × 2.4 - 20,000 = 100,000 in every simulation.",
        "Decide under current evidence using average NMB. Strategy A averages 100,000; strategy B averages (120,000 + 90,000 + 130,000 + 80,000) / 4 = 105,000. B wins on average, so today's decision is adopt B.",
        "Measure the decision uncertainty. B actually beats A only in simulations 1 and 3, so the probability that B is the right choice is 2/4 = 0.50 - the adoption decision is a coin flip.",
        "With perfect information you would pick each simulation's winner. The per-simulation best NMBs are 120,000; 100,000; 130,000; 100,000, so expected NMB with perfect information = (120,000 + 100,000 + 130,000 + 100,000) / 4 = 112,500.",
        "Per-patient EVPI = expected NMB with perfect information minus expected NMB of today's choice = 112,500 - 105,000 = 7,500. Equivalently it is the average opportunity loss of adopting B, since B loses by 10,000 in simulation 2 and by 20,000 in simulation 4, giving (0 + 10,000 + 0 + 20,000) / 4 = 7,500.",
        "Scale to the population the decision governs. With 1,000 new patients per year and a 2-year decision horizon, the affected population = 1,000 × 2 = 2,000 patients (discounting of the second-year cohort is omitted here for clarity; apply the jurisdiction's 3 percent rate in practice). Population EVPI = 7,500 × 2,000 = 15,000,000."
      ],
      "result": "Adopt B today (expected NMB 105,000 vs 100,000), but the choice is right in only 2/4 = 0.50 of simulations. Per-patient EVPI = 112,500 - 105,000 = 7,500, and population EVPI = 7,500 × 2,000 = 15,000,000 - so no research program costing more than 15 million can be worth funding on this decision, and EVPPI/EVSI then apportion that ceiling across the specific uncertain inputs and the specific RWE studies that could shrink them.",
      "timeline_spec": {
        "title": "Adoption decision under uncertainty - 2,000 patients over a 2-year horizon, population EVPI 15,000,000",
        "window": {
          "start": "2026-01-01",
          "end": "2027-12-31",
          "label": "Decision horizon: 2 years, 1,000 new patients per year"
        },
        "events": [
          {
            "label": "Adopt strategy B under current evidence",
            "start": "2026-01-01",
            "length_days": 1,
            "quantity": "expected NMB 105,000 vs 100,000"
          },
          {
            "label": "Year 1 cohort treated under the decision",
            "start": "2026-01-01",
            "length_days": 365,
            "quantity": "1,000 patients"
          },
          {
            "label": "Year 2 cohort treated under the decision",
            "start": "2027-01-01",
            "length_days": 365,
            "quantity": "1,000 patients"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2026-01-01",
            "end": "2026-12-31",
            "label": "Year 1: per-patient EVPI 7,500"
          },
          {
            "kind": "followup",
            "start": "2027-01-01",
            "end": "2027-12-31",
            "label": "Year 2: per-patient EVPI 7,500"
          }
        ],
        "result": {
          "label": "Population EVPI = 7,500 per patient x 2,000 patients = 15,000,000",
          "value": 15000000
        }
      }
    },
    "prerequisites": [
      "probabilistic-sensitivity-analysis-hea-rwe",
      "icer-net-monetary-benefit-rwe",
      "health-economic-modeling-methods-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "EVPI from the PSA matrix (nonparametric)",
        "description": "Compute net monetary benefit per strategy per PSA draw at the decision threshold, then EVPI = mean of the per-draw row maxima minus the maximum of the column means. No modeling, no assumptions beyond the PSA itself; report alongside the CEAC so the probability of error and its price travel together. Always compute on the NMB scale - the ICER scale breaks (sign flips, undefined ratios) exactly where VOI matters.",
        "edge_cases": [
          "Monte Carlo error - with near-tied strategies the row-max mean converges slowly; run enough PSA draws (often 10,000+) and check EVPI stability across seeds before publishing a number.",
          "EVPI is threshold-dependent - report it as a curve over willingness-to-pay, not a single number; it peaks where the CEAC crosses 0.5, which may be far from the jurisdiction's threshold."
        ],
        "data_source_notes": "any source: EVPI needs only the PSA output matrix; its credibility is entirely inherited from how the claims/EHR/registry-derived parameter distributions were specified."
      },
      {
        "name": "Regression-based EVPPI (Strong-Oakley-Brennan)",
        "description": "For a parameter subset of interest, regress each strategy's NMB on those parameters across the existing PSA sample (GAM/splines for 1-4 parameters, Gaussian-process regression for larger blocks); the fitted values estimate the conditional expected NMB given the subset, and EVPPI = mean of per-draw fitted maxima minus the max column mean. One PSA sample replaces the nested two-level Monte Carlo.",
        "edge_cases": [
          "Overfitting biases EVPPI upward (the max of overfit fitted values is inflated) - keep the smooth parsimonious and check stability across basis dimension.",
          "Correlated parameters smear value across the correlation - the EVPPI of one parameter silently absorbs its correlates, so define subsets as the block you could actually study together.",
          "Joint subsets need joint smooths (interactions), not sums of univariate smooths, when parameters interact in the model."
        ],
        "data_source_notes": "registry: long-term survival/extrapolation parameters are chronically the top EVPPI block. claims: cost and resource-use parameters usually carry less decision value than analysts expect - test before funding a costing study. ehr/linked: effectiveness and utility parameters."
      },
      {
        "name": "EVSI and ENBS for a proposed RWE study",
        "description": "For each PSA draw, simulate the summary statistic the proposed study would report given its design - sample size, allocation, follow-up, measurement error, and (for observational designs) expected residual bias - then regress NMB on the simulated statistic; EVSI = mean of per-draw fitted maxima minus the max column mean. Scale to the population and subtract the study's cost to get the expected net benefit of sampling (ENBS); the optimal design maximizes ENBS over sample size, not power.",
        "edge_cases": [
          "Modeling an observational study as unbiased sampling overstates EVSI - encode expected confounding bias and added variance, or label the result an optimistic bound.",
          "Delay matters - patients treated under current information while the study accrues erode realizable value; compare adopt-and-research vs delay-and-research explicitly.",
          "EVSI can be near zero even when EVPPI is large, if the feasible study cannot measure the high-value parameter precisely enough within the decision horizon."
        ],
        "data_source_notes": "linked: the proposed study is often a linked claims-EHR or registry design; its achievable precision (events, follow-up) and bias must come from a feasibility scan of the actual data source, not assumption."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "probabilistic-sensitivity-analysis-hea-rwe",
        "pros_of_this": "Converts the PSA's description of uncertainty (CEAC, scatterplot) into a priced decision rule - whether the residual uncertainty is worth paying to reduce, which parameters carry the value, and whether a specific study clears its own cost.",
        "cons_of_this": "Adds machinery and assumptions on top of the PSA - effective population, decision horizon, discounting, and for EVSI a model of the future study - and inherits every weakness of the PSA distributions it consumes.",
        "when_to_prefer": "Always run the PSA first; add VOI whenever the question shifts from how uncertain are we to should we pay for more evidence before or instead of deciding - especially when an RWE study budget is actually on the table."
      },
      {
        "compared_to": "scenario-deterministic-sensitivity-analysis-hea-rwe",
        "pros_of_this": "Ranks parameters by decision impact (can their uncertainty flip the adoption choice and at what expected cost), not by ICER swing - a tornado-dominant parameter can have near-zero EVPPI, and EVPPI is the correct basis for allocating research spend.",
        "cons_of_this": "Requires a full PSA and regression machinery where a tornado needs only one-way reruns; harder to explain to non-technical committees and silent about structural (non-parameter) uncertainty unless that is parameterized.",
        "when_to_prefer": "Use deterministic/scenario analysis to understand and debug the model and to communicate drivers; use EVPPI when deciding where to spend evidence-generation money."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Claims feed the cost and resource-use parameters of the PSA (PPPM/PPPY costing, utilization, persistence). EVPPI on the cost block tells you whether commissioning a new claims costing study is worth anything - frequently it is not, because cost uncertainty rarely flips an adoption decision on its own. If a claims-based EVSI study is proposed, its summary-statistic model should reflect coding-based measurement error and the enrollment churn that limits follow-up.",
      "ehr": "EHR-derived parameters (lab-anchored progression rates, treatment response, mapped utilities) sit mid-ladder in EVPPI. For a proposed EHR study, EVSI must price documentation variability and informative observation (sicker patients are measured more) as bias/variance terms, not treat the EHR as a random sample.",
      "registry": "Registries supply long-term survival, progression, and natural-history parameters - in oncology and rare disease these are chronically the highest-EVPPI block because extrapolation uncertainty dominates the decision. EVSI for extending a registry or adding linkage should model how many additional events the extra follow-up actually buys within the decision horizon.",
      "linked": "The proposed study in an EVSI analysis is usually a linked claims-EHR-registry design. Ground the design's achievable sample size, event counts, and residual confounding in a feasibility scan of the real linked asset; an EVSI computed for an idealized unbiased study of arbitrary size answers a question nobody can fund."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\n\ndef evpi(nmb: np.ndarray) -> float:\n    \"\"\"nmb: (n_sims, n_strategies) net monetary benefit matrix from the PSA.\"\"\"\n    enb_current = nmb.mean(axis=0).max()   # best strategy on expected NMB (decide today)\n    enb_perfect = nmb.max(axis=1).mean()   # per-draw winner, then average (perfect info)\n    return float(enb_perfect - enb_current)\n\ndef evppi_regression(nmb: np.ndarray, theta: np.ndarray, degree: int = 3) -> float:\n    \"\"\"Strong-Oakley-Brennan: regress each strategy's NMB on the parameter(s) of\n    interest; fitted values estimate E[NMB_d | theta]. Polynomial here for zero\n    dependencies - use a GAM (1-4 params) or GP regression (more) in production.\"\"\"\n    theta = np.asarray(theta, dtype=float)\n    fitted = np.column_stack([\n        np.polyval(np.polyfit(theta, nmb[:, d], degree), theta)\n        for d in range(nmb.shape[1])\n    ])\n    return float(fitted.max(axis=1).mean() - nmb.mean(axis=0).max())\n\ndef evsi_regression(nmb: np.ndarray, future_summary: np.ndarray, degree: int = 3) -> float:\n    \"\"\"EVSI (Strong 2015): same estimator, but the regressor is the summary statistic\n    the PROPOSED study would report, simulated once per PSA draw from its design.\"\"\"\n    return evppi_regression(nmb, future_summary, degree=degree)\n\ndef effective_population(incidence_per_year: float, horizon_years: int,\n                         discount_rate: float = 0.03) -> float:\n    \"\"\"Discounted number of patients the decision affects over its relevance horizon.\"\"\"\n    return sum(incidence_per_year / (1.0 + discount_rate) ** t for t in range(horizon_years))\n\n# --- demo: two-strategy PSA where the relative risk carries the decision value ---\nrng = np.random.default_rng(2026)\nn_sims, lam = 5000, 50_000\nrr       = rng.normal(0.75, 0.10, n_sims)      # uncertain treatment effect\nqaly_a   = rng.normal(2.4, 0.20, n_sims)\ncost_a   = rng.normal(20_000, 2_000, n_sims)\ninc_cost = rng.gamma(25.0, 400.0, n_sims)      # incremental cost of B\nnmb = np.column_stack([\n    lam * qaly_a - cost_a,                                        # A: standard care\n    lam * (qaly_a + (1.0 - rr) * 0.8) - (cost_a + inc_cost),      # B: new drug\n])\n\nper_person = evpi(nmb)\nrr_hat = rng.normal(rr, 0.07)   # proposed RWE study reports rr with SE 0.07\nprint(f\"P(B optimal)            : {(nmb[:, 1] > nmb[:, 0]).mean():.3f}\")\nprint(f\"EVPI per person         : {per_person:,.0f}\")\nprint(f\"EVPPI(rr)               : {evppi_regression(nmb, rr):,.0f}\")\nprint(f\"EVPPI(incremental cost) : {evppi_regression(nmb, inc_cost):,.0f}\")\nprint(f\"EVSI(rr study, SE 0.07) : {evsi_regression(nmb, rr_hat):,.0f}\")\nprint(f\"Population EVPI (10y)   : {per_person * effective_population(1_000, 10):,.0f}\")",
        "description": "Value of information from a PSA sample with numpy only. evpi() implements the nonparametric estimator on the\n(n_sims x n_strategies) net-monetary-benefit matrix. evppi_regression() implements the Strong-Oakley-Brennan\nsingle-sample estimator with a polynomial regression of each strategy's NMB on the parameter of interest (swap in a\nGAM/GP for production). evsi_regression() reuses the same machinery on the simulated summary statistic of a proposed\nstudy (Strong 2015). effective_population() discounts future annual cohorts for population scaling. The demo simulates\na two-strategy PSA where the relative risk drives most of the decision value.",
        "dependencies": [
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(mgcv)\n\nevpi <- function(nmb) {\n  # nmb: n_sims x n_strategies NMB matrix from the PSA\n  mean(apply(nmb, 1, max)) - max(colMeans(nmb))\n}\n\nevppi_gam <- function(nmb, theta) {\n  # Strong-Oakley-Brennan: fitted values of gam(NMB_d ~ s(theta)) estimate E[NMB_d | theta]\n  fitted_nmb <- sapply(seq_len(ncol(nmb)), function(d)\n    fitted(gam(nmb[, d] ~ s(theta))))\n  mean(apply(fitted_nmb, 1, max)) - max(colMeans(nmb))\n}\n\neffective_population <- function(incidence_per_year, horizon_years, discount_rate = 0.03) {\n  sum(incidence_per_year / (1 + discount_rate)^(0:(horizon_years - 1)))\n}\n\n# --- demo: two-strategy PSA where the relative risk carries the decision value ---\nset.seed(2026)\nn_sims <- 5000; lam <- 50000\nrr       <- rnorm(n_sims, 0.75, 0.10)        # uncertain treatment effect\nqaly_a   <- rnorm(n_sims, 2.4, 0.20)\ncost_a   <- rnorm(n_sims, 20000, 2000)\ninc_cost <- rgamma(n_sims, shape = 25, scale = 400)\nnmb <- cbind(A = lam * qaly_a - cost_a,\n             B = lam * (qaly_a + (1 - rr) * 0.8) - (cost_a + inc_cost))\n\nper_person <- evpi(nmb)\nrr_hat <- rnorm(n_sims, mean = rr, sd = 0.07)  # proposed RWE study estimate of rr\ncat(sprintf(\"P(B optimal)            : %.3f\\n\", mean(nmb[, \"B\"] > nmb[, \"A\"])))\ncat(sprintf(\"EVPI per person         : %s\\n\", format(round(per_person), big.mark = \",\")))\ncat(sprintf(\"EVPPI(rr)               : %s\\n\", format(round(evppi_gam(nmb, rr)), big.mark = \",\")))\ncat(sprintf(\"EVPPI(incremental cost) : %s\\n\", format(round(evppi_gam(nmb, inc_cost)), big.mark = \",\")))\ncat(sprintf(\"EVSI(rr study, SE 0.07) : %s\\n\", format(round(evppi_gam(nmb, rr_hat)), big.mark = \",\")))\ncat(sprintf(\"Population EVPI (10y)   : %s\\n\",\n            format(round(per_person * effective_population(1000, 10)), big.mark = \",\")))",
        "description": "Same VOI ladder in R with mgcv, the de facto standard for regression-based EVPPI (and the engine behind the BCEA and\nvoi packages). evpi() is the nonparametric PSA estimator; evppi_gam() fits gam(NMB_d ~ s(theta)) per strategy so the\nfitted values estimate the conditional expected NMB; EVSI reuses evppi_gam() on the simulated future-study summary\nstatistic; effective_population() handles discounted population scaling. For production, the voi and BCEA packages\nwrap these estimators with diagnostics.",
        "dependencies": [
          "mgcv"
        ],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  PSA[PSA sample<br/>n_sims draws of all parameters] --> NMB[NMB matrix<br/>n_sims x strategies at threshold lambda]\n  NMB --> DU[Decision uncertainty<br/>P best choice is wrong + CEAC]\n  NMB --> EVPI[EVPI = mean of per-draw max NMB<br/>minus max of mean NMB]\n  EVPI --> POP[Population EVPI<br/>x discounted incidence x horizon]\n  POP --> Gate1{Population EVPI<br/>greater than cheapest<br/>useful study?}\n  Gate1 -- No --> Adopt[Adopt best strategy<br/>no research worth funding]\n  Gate1 -- Yes --> EVPPI[EVPPI per parameter block<br/>regression on PSA sample]\n  EVPPI --> Design[Design candidate RWE study<br/>for the high-value block<br/>n, follow-up, expected bias]\n  Design --> EVSI[EVSI = value of that study<br/>regression on simulated study statistic]\n  EVSI --> ENBS{ENBS = population EVSI<br/>minus study cost<br/>greater than 0?}\n  ENBS -- Yes --> Fund[Fund the study<br/>at the ENBS-maximizing n]\n  ENBS -- No --> Adopt",
        "caption": "The VOI ladder as a funding decision pipeline - from the PSA's NMB matrix through EVPI (is any research worth anything?), EVPPI (which parameters carry the value?), and EVSI/ENBS (does this specific RWE study clear its own cost?).",
        "alt_text": "Flowchart from a PSA sample to an NMB matrix, then to EVPI and population scaling, a gate on whether any research is worthwhile, EVPPI to pick the high-value parameter block, design of a candidate RWE study, EVSI for that study, and a final expected-net-benefit-of-sampling gate deciding fund versus adopt.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Adopt B now (EVPI 7,500 per patient) - 2,000 patients over the 2-year decision horizon\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Decision\n  Adopt strategy B under current evidence (expected NMB 105,000 vs 100,000) :milestone, d1, 2026-01-01, 0d\n  section Affected population\n  Year 1 cohort - 1,000 patients, per-patient EVPI 7,500 :crit, c1, 2026-01-01, 365d\n  Year 2 cohort - 1,000 patients, per-patient EVPI 7,500 :crit, c2, 2027-01-01, 365d\n  section Ceiling on research value\n  Population EVPI 15,000,000 accrues across both cohorts :active, r1, 2026-01-01, 730d",
        "caption": "The worked example as a timeline - the adoption decision is taken at the start of 2026 while B is the right choice in only half the simulations; every patient treated during the 2-year decision horizon carries the 7,500 per-patient EVPI, so the 2,000-patient population puts a 15,000,000 ceiling on any research program for this decision.",
        "alt_text": "Gantt timeline with the adopt-B decision milestone on January 1 2026, two annual cohorts of 1,000 patients each treated under that decision, and a bar spanning both years labeled population EVPI 15,000,000.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "value-of-information-evpi-evsi-rwe-timeline.svg",
        "mermaid": null,
        "caption": "The 4-simulation worked example as a decision timeline - the adoption decision under current evidence at the start of 2026, two annual cohorts of 1,000 patients entering under that decision, and the population EVPI of 15,000,000 over the 2-year horizon (7,500 per patient x 2,000 patients).",
        "alt_text": "Timeline showing the adopt-strategy-B decision on January 1 2026, year-one and year-two cohorts of 1,000 patients each entering care, and a result bar stating population EVPI equals 7,500 times 2,000 equals 15 million.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "probabilistic-sensitivity-analysis-hea-rwe",
        "notes": "VOI is computed from the PSA sample - EVPI is a two-line function of the PSA's NMB matrix, and the regression estimators for EVPPI/EVSI run on the same draws. No PSA, no VOI."
      },
      {
        "relation_type": "requires",
        "target_slug": "icer-net-monetary-benefit-rwe",
        "notes": "All VOI quantities live on the net-monetary-benefit scale at a stated threshold; computing them on the ICER scale breaks exactly where decisions are uncertain (sign flips, undefined ratios)."
      },
      {
        "relation_type": "used_with",
        "target_slug": "health-economic-modeling-methods-rwe",
        "notes": "The decision model (Markov, partitioned survival, DES) generates the PSA that VOI consumes; structural choices in that model are invisible to parameter-only VOI unless explicitly parameterized."
      },
      {
        "relation_type": "used_with",
        "target_slug": "discounting-costs-effects-rwe",
        "notes": "Population scaling discounts each future year's incident cohort to present value; the discount rate and decision horizon jointly drive population EVPI more than most model parameters."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "scenario-deterministic-sensitivity-analysis-hea-rwe",
        "notes": "Tornado/scenario analysis ranks parameters by ICER swing; EVPPI ranks them by expected decision impact - the correct basis for spending research money. Use the tornado to understand the model, EVPPI to fund studies."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-effectiveness",
        "notes": "VOI is the research-prioritization layer of cost-effectiveness analysis - it prices the residual uncertainty left after the base-case and probabilistic results."
      },
      {
        "relation_type": "see_also",
        "target_slug": "cost-utility",
        "notes": "QALY-based cost-utility models are the usual substrate; the willingness-to-pay threshold that converts QALYs to NMB is also the axis along which EVPI should be reported as a curve."
      }
    ],
    "aliases": [
      "value of information",
      "VOI analysis",
      "expected value of perfect information",
      "EVPI",
      "partial EVPI",
      "EVPPI",
      "expected value of sample information",
      "EVSI",
      "expected net benefit of sampling",
      "ENBS",
      "research prioritization analysis"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "visualizations-pharmacoepidemiology-rwe",
    "name": "Visualizations and Diagrams in Pharmacoepidemiology and RWE",
    "short_definition": "The disciplined use of diagrams and statistical graphics — causal DAGs, study-design timelines, attrition flows, propensity-score overlap and standardized-difference (Love) plots, Kaplan-Meier/cumulative-incidence curves, forest plots, and treatment-pattern Sankeys — to specify causal assumptions and time anchors, diagnose positivity and confounding control, and transparently report effect estimates across the RWE study lifecycle.",
    "long_description": "Visualizations in pharmacoepidemiology are not decoration; they are **decision-support and transparency\ninstruments** that make assumptions, time anchors, data fitness, and diagnostics legible to statisticians,\nclinicians, regulators, and payers. A small canonical set recurs across the study lifecycle: the **directed\nacyclic graph (DAG)** to declare the causal structure and the minimal sufficient adjustment set; the\n**study-design timeline** (calendar time vs. patient event time, with eligibility, washout, time zero, covariate\nwindow, follow-up, and censoring) to forestall immortal-time and adjustment-for-the-future errors; the\n**attrition/CONSORT flow** to expose selection; the **propensity-score overlap plot** and **standardized-mean-difference\n(\"Love\") plot** to certify positivity and balance; the **Kaplan-Meier or cumulative-incidence curve** with a\nnumber-at-risk table; the **forest plot** for primary, subgroup, and sensitivity estimates; and the **Sankey** for\ntreatment patterns and lines of therapy. Each pairs with exact text and tables (N, point estimate + CI, code lists,\nwindows, trimming rules) so the figure is auditable, not impressionistic.\n\n**Core conceptual distinction.** A figure encodes one of three distinct epistemic jobs, and conflating them is the\nmost common failure. (1) *Assumption-declaring* figures (DAGs, design timelines) state what the analyst believes\nabout causal structure and time and are not derived from the data — a DAG cannot be \"validated\" by the dataset it\nmotivates. (2) *Diagnostic* figures (PS-overlap, Love plot, Schoenfeld-residual plot, attrition flow) are computed\nfrom the data to test whether a design assumption holds — positivity, covariate balance, proportional hazards,\nunbiased selection. (3) *Inferential/reporting* figures (KM/CIF, forest plot) display the estimated quantity and\nmust therefore be pinned to a pre-specified **estimand**: a Kaplan-Meier curve reports 1 minus the marginal\nsurvival treating competing events as censoring, whereas a cumulative-incidence (Aalen-Johansen) curve reports the\nactual probability of the event in the presence of competing risks. In elderly or oncology cohorts those two curves\ndiverge sharply, and choosing 1-KM when death competes overstates the cause-specific cumulative risk. The graphic\nmust match the model (cause-specific Cox vs. Fine-Gray subdistribution) and the question, not the other way around.\n\n**Pros, cons, and trade-offs.**\n- **vs. text/table-only reporting (e.g., a bare STaRT-RWE table):** Figures communicate temporal and causal\n  structure that prose hides — a reviewer sees immortal time in a timeline diagram instantly but may miss it in a\n  methods paragraph. Cost: figures can oversimplify (a DAG omits unmeasured nodes by drafting choice) and are\n  misread without the accompanying numbers; they require tooling and iteration. **Prefer figures** for any\n  comparative design, but never ship a figure without its paired table.\n- **vs. interactive dashboards (Shiny/Streamlit):** Static, code-generated SVG/PNG plus portable Mermaid are\n  version-controllable, reproducible, embeddable in protocols and publications, and need no server. Cost: no live\n  subgroup/sensitivity filtering. **Prefer static** for protocol, publication, and regulatory submission; supply\n  code so reviewers can regenerate or extend interactively.\n- **DAG (DAGitty/dagitty) vs. ad-hoc box-and-arrow diagrams:** A machine-readable DAG yields the adjustment set\n  algorithmically and flags colliders/M-bias; a hand-drawn diagram invites conditioning on a collider or a mediator.\n  Cost: a DAG forces you to commit to a structure you may be unsure of. **Prefer a formal DAG** whenever adjustment\n  decisions are contested.\n- **Kaplan-Meier vs. cumulative-incidence function:** KM is correct only when the competing-risk hazard is null or\n  truly independent; CIF is correct under competing events. **Prefer CIF (with cause-specific or Fine-Gray models)**\n  in elderly, oncology, and end-stage-disease cohorts, and report at-risk and competing-event counts alongside.\n\n**When to use.** Specify the DAG and the design timeline in the protocol/SAP before any data pull; produce the\nattrition funnel, PS-overlap, and Love plot as gating diagnostics before fitting the outcome model; produce KM/CIF\nand forest plots only for the pre-registered estimand. Use a Sankey when the deliverable is treatment sequencing,\nswitching, or lines of therapy. Pre-register each figure (\"a Love plot of |SMD| pre/post weighting with a 0.1\nreference line will be reported\") so the figure set is a commitment, not a post-hoc choice.\n\n**When NOT to use — and when it is actively misleading or dangerous.**\n- **A forest plot that pools across data sources or payers with materially different capture** hides heterogeneity;\n  if Medicare Advantage encounter completeness differs from fee-for-service claims, a single pooled point estimate\n  is a weighted average of incommensurable measurements. Facet or stratify instead.\n- **A Kaplan-Meier curve where death competes with the event** (e.g., re-hospitalization in heart-failure elders)\n  is actively misleading — 1-KM overstates risk; switch to CIF.\n- **A Love plot used as proof of no confounding.** Balance on *measured* covariates says nothing about unmeasured\n  confounding; a beautiful Love plot can coexist with severe residual bias. Pair with negative-control diagnostics\n  and an E-value, never present balance as exchangeability.\n- **A DAG presented as data-derived truth.** It is an assumption; over-simplified DAGs that omit a real confounder\n  or draw a confounder as a collider will license the wrong adjustment set.\n- **A PS-overlap plot read only at the mean.** Near-positivity lives in the tails; a histogram that looks fine\n  centrally can hide regions where one arm has near-zero density, which matching silently discards while the\n  estimand quietly shifts.\n- **An attrition flow that reports only totals, not per-arm exclusions.** Differential exclusion by arm is exactly\n  the selection bias the figure exists to surface.\n\n**Data-source operational depth.**\n- **Claims (FFS vs. MA vs. commercial):** Attrition funnels, PS/Love plots, KM/CIF, and forests should be faceted\n  by payer because measurement differs. Medicare Advantage encounter data lack the fee-for-service claim stream, so\n  MA-only person-time can masquerade as \"no event\" — restrict survival and attrition visuals to enrollees with the\n  relevant benefit (A/B/D or commercial pharmacy) and exclude MA-only spans. MA risk-adjustment activity (HCC\n  coding, health-risk assessments, chart review) inflates diagnosis frequency, so a Love plot can look *better*\n  simply because a covariate is coded more intensely in one arm — a coding artifact, not real balance.\n  **Differential competing risks** are common: if one drug is preferentially used in sicker elderly patients, death\n  competes more in that arm, and a KM (vs. CIF) contrast is biased by the competing event, not the exposure.\n- **EHR:** Visit-driven capture means the time axis of a design timeline must show observation windows explicitly;\n  a patient who leaves the system is differentially lost, so attrition and KM curves must treat loss to follow-up as\n  potentially informative. Labs, vitals, and NLP-derived problem-list nodes enrich DAGs and sharpen outcome\n  adjudication relative to claims.\n- **Registry:** Adjudicated outcomes and stage/severity make KM/CIF and DAG severity-nodes reliable; pharmacy\n  exposure is usually incomplete — link to claims for fills and to a death index to firm up the censoring shown in\n  survival visuals.\n- **Linked claims–EHR–vital records:** The ideal substrate for trustworthy survival and competing-risk figures\n  (severity + completeness + mortality), but linkage selection (only the linkable subset) and order/fill/service\n  date discrepancies must be reconciled before the time-zero shown on the design timeline is assigned. A subtle\n  trap is **immortal time in procedure studies**: if time zero is set at diagnosis but the index procedure occurs\n  later, the design timeline will reveal guaranteed event-free (\"immortal\") person-time between the two — anchoring\n  the timeline figure at the procedure date (or using a landmark) makes the bias visible and removable.\n\n**Worked claims example.** Question: incident heart-failure hospitalization with a second-generation sulfonylurea\nvs. a DPP-4 inhibitor among type-2-diabetes adults in a commercial + Medicare FFS database. The figure set, in\norder: (1) **Database-feasibility attrition funnel** — source diabetes population (e.g., N=412,000) → ≥2 diabetes\ndiagnoses (N=355,000) → 365 days continuous A/B/D or commercial medical+pharmacy enrollment before first study fill\n(N=180,000) → no sulfonylurea/DPP-4 fill in the 365-day washout, i.e., incident users (N=64,000) → analytic cohort\nafter age ≥18 and exclusion of MA-only person-time (N=58,400), with the **per-arm** count and exclusion reason at\neach box so differential drop-out is visible. (2) **Study-design timeline** anchoring time zero at the first\nqualifying fill, with the 365-day baseline covariate window strictly before time zero and follow-up censored at\ndisenrollment, death, end of data, and (as-treated) discontinuation = last `days_supply` end + grace period. (3)\n**PS-overlap density** by arm to confirm positivity, then a **Love plot** of |SMD| before vs. after 1:1\nmatching/overlap weighting with a 0.1 reference line — read in the tails, not just at the mean. (4) A\n**cumulative-incidence (Aalen-Johansen) curve** for HF hospitalization with all-cause death as a competing event\n(not 1-KM, because death competes heavily in this elderly-skewed cohort), with a number-at-risk and number-of-deaths\ntable. (5) A **forest plot** of the cause-specific HR and the Fine-Gray subdistribution HR for the primary analysis\nplus sensitivity analyses (washout 180 vs. 365 days, grace period, negative-control outcome), faceted FFS vs.\ncommercial so payer heterogeneity in capture is explicit rather than averaged away.",
    "primary_category": "Framework_Standard",
    "tags": [
      "visualization",
      "diagrams",
      "dags",
      "study-design-diagrams",
      "forest-plots",
      "kaplan-meier",
      "cumulative-incidence",
      "sankey",
      "positivity-plot",
      "love-plot",
      "attrition-flow",
      "transparency",
      "pharmacoepidemiology",
      "reporting"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "linked_data",
      "multi_database",
      "target_trial_emulation",
      "comparative_effectiveness",
      "cohort_retrospective"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked",
      "multi-database"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1002/pds.5529",
        "url": "https://doi.org/10.1002/pds.5529",
        "citation_text": "Gatto NM, Wang SV, Murk W, Mattox P, Brookhart MA, Bate A, Schneeweiss S, Rassen JA. Visualizations throughout pharmacoepidemiology study planning, implementation, and reporting. Pharmacoepidemiol Drug Saf. 2022;31(11):1140-1152.",
        "year": 2022,
        "authors_short": "Gatto et al.",
        "notes": "Canonical methods statement mapping DAGs, study-design diagrams, positivity/balance plots, attrition flowcharts, forest plots, Kaplan-Meier/CIF, and Sankey diagrams onto the planning, implementation, and reporting stages of a pharmacoepidemiologic study, with transparency rationale for each."
      },
      {
        "role": "explain",
        "doi": "10.1136/bmj.m4856",
        "url": "https://doi.org/10.1136/bmj.m4856",
        "citation_text": "Wang SV, Pinheiro S, Hua W, et al. STaRT-RWE: structured template for planning and reporting on the implementation of real world evidence studies. BMJ. 2021;372:m4856.",
        "year": 2021,
        "authors_short": "Wang et al.",
        "notes": "Structured template that operationalizes the design-timeline diagram (time zero, washout, covariate and follow-up windows) and the figures/tables expected in a transparent, reproducible RWE report."
      },
      {
        "role": "demonstrate",
        "doi": "10.1007/s40801-019-00167-6",
        "url": "https://doi.org/10.1007/s40801-019-00167-6",
        "citation_text": "Xia AD, Schaefer CP, Szende A, Jahn E, Hirst MJ. RWE Framework: an interactive visual tool to support a real-world evidence study design. Drugs Real World Outcomes. 2019;6(4):193-203.",
        "year": 2019,
        "authors_short": "Xia et al.",
        "notes": "Worked interactive decision-flow diagram guiding sequential RWE design choices (objectives, approval status, setting, data availability, randomization, study type, regulatory standards); the archetype for a study-design decision flowchart."
      },
      {
        "role": "use",
        "doi": "10.1093/aje/kwv254",
        "url": "https://doi.org/10.1093/aje/kwv254",
        "citation_text": "Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016;183(8):758-764.",
        "year": 2016,
        "authors_short": "Hernán & Robins",
        "notes": "Target-trial emulation framing whose protocol components (eligibility, assignment, time zero, follow-up, outcome) are exactly what the study-design timeline and emulation flowchart make visible and auditable."
      }
    ],
    "plain_language_summary": "In a pharmacoepidemiology study, charts and diagrams do real scientific work — they are not decoration. Each visualization in the standard toolkit answers a different question: Did we build the study population fairly? Do the two treatment groups look alike before we compare outcomes? What happened to patients over time? Picking the right chart for each question, and avoiding the wrong one, is what separates a transparent study report from a misleading one.",
    "key_terms": [
      {
        "term": "attrition funnel",
        "definition": "A step-by-step flow diagram showing how many patients were excluded at each filter (e.g., diagnosis requirement, enrollment length, prior drug use) on the way from the full database to the final study group."
      },
      {
        "term": "Love plot",
        "definition": "A dot plot showing how similar the two treatment groups are on each background characteristic before and after statistical adjustment — a dot past the 0.1 line means that characteristic is still imbalanced."
      },
      {
        "term": "Kaplan-Meier curve",
        "definition": "A step-down graph tracking the fraction of each treatment group that has not yet experienced an event (e.g., hospitalization) as time passes from the start of the study."
      },
      {
        "term": "cumulative-incidence curve",
        "definition": "An alternative to the Kaplan-Meier curve used when patients can also die or switch treatment before the event of interest occurs; it shows the true probability of the event in that real-world context."
      },
      {
        "term": "forest plot",
        "definition": "A chart with one row per analysis (primary result, subgroup, sensitivity check) showing each estimated effect size and its uncertainty interval as a horizontal line with a box in the middle."
      },
      {
        "term": "DAG",
        "definition": "Short for directed acyclic graph — a box-and-arrow diagram drawn before looking at any data that maps out which variables cause which others, so the analyst knows which ones must be adjusted for."
      }
    ],
    "worked_example": {
      "scenario": "A research team is writing a report on a study comparing two diabetes pills — a sulfonylurea and a DPP-4 inhibitor — to see which is associated with fewer heart-failure hospitalizations. Before submitting the report, the team needs to choose the right visualization for each section: (1) showing how they built the study group, (2) showing that the two drug groups are comparable, (3) showing how outcomes unfolded over time, and (4) showing results across subgroups and sensitivity analyses. The table below maps each communication need to the correct visualization and flags which chart would be wrong.",
      "dataset": {
        "caption": "Study communication needs mapped to the right pharmacoepidemiology visualization",
        "columns": [
          "Section of report",
          "Communication need",
          "Correct visualization",
          "Wrong choice and why"
        ],
        "rows": [
          [
            "Cohort selection",
            "Show how the 412,000 diabetes patients were filtered down to 58,400 in the analytic cohort",
            "Attrition funnel — one box per filter with patient counts and exclusion reasons, reported separately for each drug arm",
            "A single summary table — hides whether one drug arm lost more patients than the other at any step"
          ],
          [
            "Study design",
            "Show when covariates were measured, when follow-up started, and what events end follow-up",
            "Study-design timeline (Gantt-style) anchored at the first qualifying prescription fill",
            "A paragraph of methods text — a reviewer cannot spot immortal time or a misplaced measurement window in prose as easily as in a diagram"
          ],
          [
            "Covariate balance",
            "Show that the two drug groups look similar on age, kidney disease, prior hospitalizations, etc., before and after statistical weighting",
            "Love plot — one dot per variable, two series (before and after weighting), vertical reference line at 0.1",
            "Reporting only the p-values from t-tests — p-values depend on sample size and do not directly measure how different the groups are"
          ],
          [
            "Time-to-event outcomes",
            "Show the probability of heart-failure hospitalization over two years when some patients also die before being hospitalized",
            "Cumulative-incidence curve (Aalen-Johansen method) with a table of patients still at risk and patients who died",
            "Standard Kaplan-Meier curve — in an older cohort where death competes with hospitalization, 1-minus-KM overstates how many patients would eventually be hospitalized"
          ],
          [
            "Subgroup and sensitivity results",
            "Show the main effect estimate plus results for age groups, sex, kidney-disease status, and two alternate analysis specifications",
            "Forest plot — one row per analysis, box at the point estimate, horizontal line for the uncertainty interval, vertical null-effect reference line",
            "A single number in the abstract — one result cannot communicate whether the finding holds across patient subgroups or changes under different analytic assumptions"
          ]
        ]
      },
      "steps": [
        "Start at the beginning of the study lifecycle, not the end. Draw the study-design timeline in the protocol before any data are pulled — it forces the team to declare exactly where time zero sits and prevents the exposure window from accidentally overlapping the outcome window.",
        "Build the attrition funnel next, during cohort construction. Report exclusion counts for each drug arm separately so that differential drop-out (one arm losing more patients than the other) is visible and can be explained.",
        "Before fitting any outcome model, produce the Love plot as a diagnostic checkpoint. If dots remain past the 0.1 line after weighting, the statistical adjustment is incomplete and the outcome analysis should wait.",
        "When selecting between a Kaplan-Meier curve and a cumulative-incidence curve, ask one question: can patients die or permanently switch treatment before experiencing the event of interest? For older patients with diabetes, the answer is yes — use the cumulative-incidence curve.",
        "Assemble the forest plot last, after all pre-specified sensitivity analyses are run. Each row should correspond to an analysis that was declared in the study plan before data were analyzed, not added after seeing the primary result."
      ],
      "result": "The recommended visualization set for this study is: (1) per-arm attrition funnel, (2) study-design timeline anchored at the index prescription fill, (3) Love plot with a 0.1 reference line, (4) cumulative-incidence curve with a competing-death table, and (5) forest plot with subgroup and sensitivity rows. Every chart answers a specific question; no chart is interchangeable with another. The logic is: match the picture to the epistemic job — assumption-declaring, diagnostic, or inferential — and choose within each job based on the clinical context (here, competing death in an older cohort mandates the cumulative-incidence curve over Kaplan-Meier)."
    },
    "prerequisites": [
      "dags-backdoor-criterion-drug-studies",
      "database-feasibility-attrition-funnel-rwe",
      "baseline-characteristics-and-covariate-balance-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Causal-structure DAG (assumption-declaring)",
        "description": "Directed acyclic graph with nodes for exposure, outcome, confounders, mediators, colliders, and instruments and directed edges for assumed causal relations; yields the minimal sufficient adjustment set and flags back-door paths, colliders, and M-bias. Render portably in Mermaid for the catalog; compute adjustment sets in DAGitty/dagitty.",
        "edge_cases": [
          "Time-varying confounding affected by prior treatment cannot be shown on a static cross-sectional DAG; use a time-indexed DAG and g-methods, not standard adjustment.",
          "Drawing a true confounder as a collider (or vice versa) silently licenses the wrong adjustment set; validate structure with subject-matter experts and negative controls, not with the data."
        ],
        "data_source_notes": "claims: nodes are code-based proxies (diagnoses/procedures/drugs/utilization), so a node may represent coding intensity rather than true clinical state, especially under MA risk adjustment. EHR: labs, vitals, and NLP add genuine clinical nodes and reduce proxy ambiguity."
      },
      {
        "name": "Study-design timeline (Gantt/flow for time anchors)",
        "description": "Calendar-time vs. patient-event-time diagram showing eligibility, lookback/washout, time zero (index date), covariate-assessment window (strictly pre-index), follow-up start/end, outcome risk windows, and censoring rules (disenrollment, death, competing events, switch, end of data). Primary device for preventing immortal-time bias and adjustment-for-the-future.",
        "edge_cases": [
          "Procedure studies anchored at diagnosis rather than the index procedure reveal immortal person-time on the timeline; re-anchor at the procedure or use a landmark.",
          "Grace periods and per-protocol cloning/censoring/weighting create eligibility-time ambiguity that must be drawn explicitly, not implied."
        ],
        "data_source_notes": "claims: use claim-type-specific dates (service_date, discharge_date, fill_date) for precise anchors and show MA-only spans as non-observable. EHR: draw visit-driven observation windows explicitly."
      },
      {
        "name": "Diagnostic plots (positivity, balance, model checks)",
        "description": "PS-overlap density/histogram by arm (read in the tails for near-positivity), Love plot of standardized mean differences before/after matching or weighting with a 0.1 reference line, and Schoenfeld-residual-vs-time plots for the proportional-hazards assumption behind any Cox-based survival figure.",
        "edge_cases": [
          "A Love plot can improve spuriously when a covariate is coded more intensely in one arm (MA HCC/HRA coding), so balance is a coding artifact rather than true exchangeability.",
          "Balance on measured covariates is silent on unmeasured confounding; never present a Love plot as proof of no confounding — pair with negative controls and an E-value."
        ],
        "data_source_notes": "claims: high-dimensional proxy covariates (IP/OP/pharmacy) require dimension-aware balance summaries and payer-stratified plots. EHR: add labs/vitals to the balance assessment."
      },
      {
        "name": "Reporting plots (attrition, time-to-event, forest, Sankey)",
        "description": "CONSORT-style per-arm attrition flow; Kaplan-Meier with number-at-risk or cumulative-incidence (Aalen-Johansen/Fine-Gray) under competing risks; forest plot of HR/OR/RD with CI, a null reference line, and subgroup/sensitivity rows; Sankey for treatment patterns, switching, and lines of therapy.",
        "edge_cases": [
          "Competing death/switch in oncology and elderly cohorts makes 1-KM overstate cumulative risk; use CIF and report competing-event counts.",
          "Forests pooled across payers with different capture hide heterogeneity; facet or stratify FFS/MA/commercial."
        ],
        "data_source_notes": "claims: build attrition from enrollment + validated code filters and Sankeys from pharmacy/procedure sequences; link to a death index for accurate censoring. Registry: adjudicated outcomes make KM/CIF gold-standard."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Text-only or table-only reporting (bare STaRT-RWE table)",
        "pros_of_this": "Figures expose temporal and causal structure (immortal time, selection, non-overlap, non-PH) that prose hides, and communicate to mixed audiences faster than paragraphs.",
        "cons_of_this": "Can oversimplify and are misread without the paired numbers; require tooling and iteration; a figure asserts choices (e.g., which nodes exist in a DAG) that a table makes explicit.",
        "when_to_prefer": "Any comparative safety/effectiveness or HEOR study; always pair each figure with its exact N/estimate/CI/code-list table for auditability."
      },
      {
        "compared_to": "Interactive dashboards (Shiny/Streamlit)",
        "pros_of_this": "Static, code-generated SVG/PNG plus portable Mermaid are version-controllable, reproducible, embeddable in protocols and submissions, and need no server.",
        "cons_of_this": "No live subgroup or sensitivity filtering; Mermaid cannot render complex statistical plots (use R/Python exports for KM/CIF, forest, Sankey).",
        "when_to_prefer": "Protocol, publication, and regulatory-submission contexts; ship code so reviewers can regenerate or extend interactively."
      },
      {
        "compared_to": "Kaplan-Meier survival curve",
        "pros_of_this": "A cumulative-incidence (Aalen-Johansen/Fine-Gray) curve reports the actual event probability under competing risks and matches a Fine-Gray or cause-specific model.",
        "cons_of_this": "CIF is unnecessary and slightly less familiar when competing risks are negligible; requires specifying cause-specific vs. subdistribution intent.",
        "when_to_prefer": "Elderly, oncology, and end-stage cohorts where death or switch competes with the event of interest; otherwise KM with a number-at-risk table is adequate."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Facet attrition, PS/Love, KM/CIF, and forest plots by payer because capture differs. Build attrition from continuous-enrollment spans + validated code filters; build Sankeys from ordered pharmacy/procedure sequences. Exclude MA-only person-time from survival and attrition visuals (no FFS claim stream); treat MA HCC/HRA coding intensity as a threat to Love-plot interpretation. Link to a death index so KM/CIF censoring is accurate.",
      "ehr": "Draw visit-driven observation windows explicitly on the design timeline; treat loss to follow-up as potentially informative in attrition and KM. Use labs/vitals/NLP problem lists as real DAG nodes and for richer outcome adjudication than codes alone.",
      "registry": "Adjudicated outcomes and stage/severity make KM/CIF and DAG severity nodes reliable; link to claims for complete pharmacy exposure (Sankey, PS) and to a death index for censoring.",
      "linked": "Reconcile order/fill/service date discrepancies before assigning the time zero shown on the design timeline; show linkage selection (linkable subset) in the attrition flow; use linked mortality to strengthen competing-risk (CIF) figures.",
      "multi-database": "Harmonize time anchors and code mappings (OMOP concept sets) but stratify PS/Love, forests, attrition, and KM/CIF by database/payer/TA to reveal heterogeneity; distributed analysis supports local figures plus pooled summary forests."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\nCOVARS = [\"age\", \"female\", \"ckd\", \"prior_hf\", \"insulin\", \"n_hosp_baseline\"]  # baseline-window covariates\n\ndef std_mean_diff(df, var, weights=None):\n    t = df[\"arm\"] == 1\n    x1, x0 = df.loc[t, var], df.loc[~t, var]\n    if weights is None:\n        m1, m0 = x1.mean(), x0.mean()\n        v1, v0 = x1.var(ddof=1), x0.var(ddof=1)\n    else:\n        w1, w0 = weights[t], weights[~t]\n        m1 = np.average(x1, weights=w1); m0 = np.average(x0, weights=w0)\n        v1 = np.average((x1 - m1) ** 2, weights=w1)\n        v0 = np.average((x0 - m0) ** 2, weights=w0)\n    pooled = np.sqrt((v1 + v0) / 2.0)\n    return 0.0 if pooled == 0 else (m1 - m0) / pooled\n\ndef overlap_and_love(df: pd.DataFrame, out_overlap=\"overlap.png\", out_love=\"love.png\"):\n    # --- Positivity / overlap: PS densities by arm ---\n    fig, ax = plt.subplots(figsize=(6, 4))\n    for a, lab in [(1, \"Study\"), (0, \"Comparator\")]:\n        ax.hist(df.loc[df[\"arm\"] == a, \"ps\"], bins=40, density=True, alpha=0.5, label=lab)\n    ax.set_xlabel(\"Estimated propensity score\"); ax.set_ylabel(\"Density\")\n    ax.set_title(\"PS overlap (inspect the tails for near-positivity)\"); ax.legend()\n    fig.tight_layout(); fig.savefig(out_overlap, dpi=150); plt.close(fig)\n\n    # --- Balance: |SMD| before vs after IPTW, with the 0.1 decision line ---\n    rows = [(v, abs(std_mean_diff(df, v)), abs(std_mean_diff(df, v, df[\"iptw\"]))) for v in COVARS]\n    bal = pd.DataFrame(rows, columns=[\"covariate\", \"before\", \"after\"]).sort_values(\"before\")\n    y = np.arange(len(bal))\n    fig, ax = plt.subplots(figsize=(6, 0.45 * len(bal) + 1))\n    ax.scatter(bal[\"before\"], y, marker=\"o\", label=\"Unweighted\")\n    ax.scatter(bal[\"after\"], y, marker=\"x\", label=\"IPTW\")\n    ax.axvline(0.1, ls=\"--\", color=\"grey\")  # |SMD| < 0.1 balance threshold\n    ax.set_yticks(y); ax.set_yticklabels(bal[\"covariate\"])\n    ax.set_xlabel(\"|Standardized mean difference|\")\n    ax.set_title(\"Covariate balance (Love plot)\"); ax.legend()\n    fig.tight_layout(); fig.savefig(out_love, dpi=150); plt.close(fig)\n    return bal",
        "description": "Highest-yield diagnostic figure: a propensity-score overlap density plus a standardized-mean-difference (Love)\nplot from an already-fit PS. Required inputs (one row per analytic-cohort member, after ACNU/cohort construction):\n  df : person_id, arm (1=study,0=comparator), ps (estimated propensity score in (0,1)),\n       iptw (stabilized inverse-probability weight), and the baseline covariate columns in COVARS\n       (binary or continuous, measured in [index_date-365, index_date]).\nProduces overlap.png (positivity check) and love.png (balance before/after weighting, 0.1 reference line).\nRead the overlap plot in the tails, not at the mean, to catch near-positivity.",
        "dependencies": [
          "pandas",
          "numpy",
          "matplotlib"
        ],
        "source_citations": [
          "gatto-2022"
        ],
        "notes": ""
      },
      {
        "lang": "python",
        "code": "import matplotlib.pyplot as plt\nfrom lifelines import AalenJohansenFitter\n\ndef plot_cif(df, event_of_interest=1, out=\"cif.png\"):\n    fig, ax = plt.subplots(figsize=(6, 4))\n    for a, lab in [(1, \"Study\"), (0, \"Comparator\")]:\n        sub = df[df[\"arm\"] == a]\n        ajf = AalenJohansenFitter()\n        ajf.fit(sub[\"time\"], sub[\"status\"], event_of_interest=event_of_interest)\n        ajf.plot(ax=ax, label=f\"{lab} (event {event_of_interest})\")\n    ax.set_xlabel(\"Days since time zero\")\n    ax.set_ylabel(\"Cumulative incidence\")\n    ax.set_title(\"Cumulative incidence with competing death (Aalen-Johansen)\")\n    ax.legend()\n    fig.tight_layout(); fig.savefig(out, dpi=150); plt.close(fig)",
        "description": "Reporting figure under competing risks: a cumulative-incidence (Aalen-Johansen) curve, the correct alternative to\n1-Kaplan-Meier when death competes with the event. Required input (one row per analytic-cohort member):\n  df : person_id, arm, time (days from index to first of event/competing-death/censor),\n       status (0=censored, 1=event of interest, 2=competing death).\nPlots CIF for the event of interest by arm; do NOT use 1-KM here because it would overstate cumulative risk.",
        "dependencies": [
          "pandas",
          "matplotlib",
          "lifelines"
        ],
        "source_citations": [
          "gatto-2022"
        ],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(cobalt); library(WeightIt)\nlibrary(survival); library(survminer); library(cmprsk)\n\ncovars <- c(\"age\", \"female\", \"ckd\", \"prior_hf\", \"insulin\", \"n_hosp_baseline\")\n\n## --- Positivity + balance ---\nW <- weightit(reformulate(covars, \"arm\"), data = cohort,\n              method = \"ps\", estimand = \"ATE\")          # stabilized IPTW\nbal.plot(W, var.name = \"prop.score\", which = \"both\")    # PS overlap by arm (read the tails)\nlove.plot(W, thresholds = c(m = 0.1), abs = TRUE,       # |SMD| before vs after, 0.1 line\n          stars = \"none\", var.order = \"unadjusted\")\n\n## --- Competing-risks cumulative incidence (event=1, competing death=2) ---\nci <- cuminc(ftime = cohort$time, fstatus = cohort$status,\n             group = cohort$arm, cencode = 0)\nplot(ci, lty = 1, color = c(\"steelblue\", \"firebrick\"),\n     xlab = \"Days since time zero\", ylab = \"Cumulative incidence\")\n## Proportional-hazards check for any Cox-based survival figure:\ncox <- coxph(Surv(time, status == 1) ~ arm, data = cohort)\nprint(cox.zph(cox))                                     # Schoenfeld test before plotting KM/HR",
        "description": "Balance diagnostics and a competing-risks cumulative-incidence figure in R. Inputs:\n  cohort : data.frame with person_id, arm (factor study/comparator), the baseline covariates in `covars`,\n           ps, iptw (stabilized weight), and for survival: time (days) and status\n           (0=censor, 1=event, 2=competing death).\ncobalt::love.plot draws the standardized-difference Love plot directly from a weightit/matchit object or raw\nweights; survminer/cmprsk render the CIF. Inspect the PS overlap before trusting the Love plot.",
        "dependencies": [
          "cobalt",
          "WeightIt",
          "survival",
          "survminer",
          "cmprsk"
        ],
        "source_citations": [
          "gatto-2022"
        ],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* PS overlap + standardized differences (positivity + balance diagnostics) */\nproc psmatch data=work.cohort region=allobs;\n  class arm <categorical baseline covariates>;\n  psmodel arm(treated='STUDY') = <baseline covariates>;   /* covariates from baseline window only */\n  assess lps ps var=(<key covariates>)\n         / plots(effects=stddiff)=(barchart boxplot);      /* Love-style SMD + PS overlap */\n  output out=ps_out;\nrun;\n\n/* Competing-risks cumulative incidence (death competes with the event of interest) */\nproc lifetest data=work.cohort plots=cif(test) outcif=cif_out;\n  time time * event_status(0) / eventcode=1;              /* 1=event of interest; 2 treated as competing */\n  strata arm;\nrun;\n\n/* Fine-Gray subdistribution HR for the forest plot (vs. cause-specific Cox) */\nproc phreg data=work.cohort plots(overlay)=cif;\n  class arm(ref='COMPARATOR');\n  model time*event_status(0) = arm / eventcode=1;         /* subdistribution hazard */\n  hazardratio arm;\nrun;\n\n/* Number-at-risk Kaplan-Meier (appropriate ONLY when competing risk is negligible) */\nproc lifetest data=work.cohort plots=survival(atrisk cb);\n  time time * status(0);\n  strata arm;\nrun;",
        "description": "Diagnostics and reporting figures in SAS/STAT. Required datasets (post data-management):\n  work.cohort : person_id, arm ('STUDY'/'COMPARATOR'), baseline covariates, time (days), status (0/1/2),\n                event_status where 1=event of interest, 2=competing death, 0=censor.\nPROC PSMATCH ASSESS produces the PS-overlap and standardized-difference (Love-style) diagnostics; PROC LIFETEST\nwith plots=cif gives the competing-risks cumulative-incidence figure; PROC PHREG eventcode= fits the Fine-Gray\nsubdistribution model. Confirm |SMD| < 0.1 on the ASSESS output before reporting any effect figure.",
        "dependencies": [],
        "source_citations": [
          "gatto-2022"
        ],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  A[Research objective:<br/>effectiveness / safety / burden / patterns / value] --> B{Product approved?}\n  B -->|Yes| C{Routine-practice setting?}\n  B -->|No / pre-approval| P[Plan for prospective / hybrid<br/>data collection]\n  C -->|Yes| D{Key data captured in<br/>routine practice?}\n  C -->|No| P\n  D -->|Yes, secondary| E{Randomization needed?}\n  D -->|No / hybrid| P\n  E -->|No| F[Non-interventional cohort /<br/>case-control / target-trial emulation]\n  E -->|Yes| G[Pragmatic / registry-based trial]\n  F --> H[Specify methodology + regulatory standards<br/>FDA RWE Framework / EMA / ENCePP / ISPOR-ISPE]\n  G --> H\n  P --> H",
        "caption": "RWE study-design decision flow (Xia 2019 style). Sequential choices — objective, approval status, setting, data availability, randomization need — route to a study type and the applicable regulatory standards. This is an assumption/decision figure, not a data-derived one.",
        "alt_text": "Decision flowchart routing from research objective through approval status, setting, data availability, and randomization to a recommended RWE study type and regulatory standards.",
        "source_type": "illustrative",
        "source_citations": [
          "xia-2019"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Study-design timeline for one initiator (claims), anchored at time zero\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Baseline\n  Continuous enrollment + drug-free washout (covariate window) :done, base, 2023-01-01, 2023-12-31\n  section Time zero\n  Index fill -> arm assignment :milestone, t0, 2024-01-01, 0d\n  section Follow-up\n  Outcome risk window (event of interest) :active, fu, 2024-01-01, 365d\n  Censor: disenroll / death / competing event / switch / data end :crit, cen, 2024-12-31, 1d",
        "caption": "Calendar-vs-event-time design diagram. Covariates are measured strictly before time zero and follow-up starts at the index fill, so there is no immortal time and no adjustment for post-initiation variables.",
        "alt_text": "Gantt timeline showing a 365-day baseline covariate window in 2023, time zero at the index fill on 2024-01-01, a follow-up risk window, and a censoring point.",
        "source_type": "illustrative",
        "source_citations": [
          "wang-2021"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  S[Source diabetes population<br/>N=412,000] --> X1[Excluded: <2 diabetes dx<br/>-57,000]\n  S --> E1[>=2 diabetes diagnoses<br/>N=355,000]\n  E1 --> X2[Excluded: <365d continuous<br/>enrollment -175,000]\n  E1 --> E2[365d continuous A/B/D or<br/>commercial enrollment<br/>N=180,000]\n  E2 --> X3[Excluded: prior SU/DPP-4 fill<br/>in washout -116,000]\n  E2 --> E3[Incident users<br/>N=64,000]\n  E3 --> X4[Excluded: age <18 or<br/>MA-only person-time -5,600]\n  E3 --> A[Analytic cohort<br/>N=58,400]",
        "caption": "Per-arm-able CONSORT-style attrition funnel for the worked example. Reporting the count and exclusion reason at each step (and, in practice, by arm) is what exposes differential selection.",
        "alt_text": "Attrition flowchart from a source diabetes population through diagnosis, enrollment, washout, and age/MA exclusions down to the analytic cohort, with excluded counts at each step.",
        "source_type": "illustrative",
        "source_citations": [
          "gatto-2022"
        ]
      },
      {
        "asset_path": null,
        "mermaid": "flowchart LR\n  C[Confounder:<br/>baseline severity] --> E[Exposure:<br/>drug A vs drug B]\n  C --> O[Outcome:<br/>HF hospitalization]\n  E --> O\n  E --> M[Mediator:<br/>on-treatment HbA1c]\n  M --> O\n  E --> Col[Collider:<br/>treatment switch]\n  O --> Col",
        "caption": "Minimal causal DAG for the worked example. Adjust for the confounder (baseline severity) to close the back-door path; do NOT condition on the mediator (on-treatment HbA1c) or the collider (treatment switch), which would induce bias. A DAG declares assumptions and cannot be validated by the analytic dataset.",
        "alt_text": "Directed acyclic graph showing a confounder pointing to exposure and outcome, a mediator on the exposure-outcome path, and a collider influenced by both exposure and outcome.",
        "source_type": "illustrative",
        "source_citations": [
          "gatto-2022"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "used_with",
        "target_slug": "dags-backdoor-criterion-drug-studies",
        "notes": "The causal-structure DAG is specified using the back-door criterion to derive the minimal sufficient adjustment set and to avoid conditioning on colliders or mediators."
      },
      {
        "relation_type": "used_with",
        "target_slug": "time-zero-index-date-alignment-rwe",
        "notes": "The study-design timeline diagram is the device that makes time-zero alignment auditable and prevents immortal-time bias by showing covariate windows strictly before the index date."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "The Love (standardized-mean-difference) plot is the standard visualization of covariate balance before and after matching or weighting."
      },
      {
        "relation_type": "used_with",
        "target_slug": "competing-risks-cause-specific-fine-gray-rwe",
        "notes": "Cumulative-incidence (Aalen-Johansen) and Fine-Gray figures are the correct time-to-event visualizations when a competing event (e.g., death) is present, replacing 1-minus-Kaplan-Meier."
      },
      {
        "relation_type": "used_with",
        "target_slug": "database-feasibility-attrition-funnel-rwe",
        "notes": "The CONSORT-style attrition flow operationalizes the feasibility/attrition funnel with per-step and per-arm exclusion counts to expose selection bias."
      },
      {
        "relation_type": "used_with",
        "target_slug": "treatment-patterns-lines-of-therapy",
        "notes": "Sankey diagrams visualize treatment patterns, switching, and lines of therapy derived from ordered pharmacy/procedure sequences."
      },
      {
        "relation_type": "part_of",
        "target_slug": "target-trial-emulation",
        "notes": "Study-design timelines, DAGs, and emulation flowcharts make the protocol components of a target-trial emulation (eligibility, assignment, time zero, follow-up, outcome) explicit and communicable."
      },
      {
        "relation_type": "see_also",
        "target_slug": "medicare-ffs-ma-commercial-claims-differences-rwe",
        "notes": "Payer differences drive faceting/stratification of attrition, PS/balance, survival, and forest plots, and threaten Love-plot interpretation via MA risk-adjustment coding intensity."
      },
      {
        "relation_type": "see_also",
        "target_slug": "high-dimensional-propensity-score-hdps-rwe",
        "notes": "PS-overlap (positivity) and Love (balance) plots are the standard diagnostics for an hdPS, and the DAG informs which proxies to prioritize or avoid."
      },
      {
        "relation_type": "see_also",
        "target_slug": "e-value-sensitivity-analysis",
        "notes": "A Love plot certifies balance only on measured covariates; pair it with E-values and negative controls so balance is not misread as exchangeability."
      },
      {
        "relation_type": "see_also",
        "target_slug": "therapeutic-area-specific-rwe-challenges-oncology",
        "notes": "Oncology RWE relies on Sankeys for lines of therapy, CIF/KM for competing death/switch, and design timelines for line-specific time zero and attribution windows."
      }
    ],
    "aliases": [
      "RWE visualizations",
      "pharmacoepidemiology diagrams",
      "study-design diagrams",
      "DAGs in RWE",
      "Love plot",
      "attrition flow diagram",
      "forest plot",
      "Kaplan-Meier and cumulative incidence plots",
      "treatment-pattern Sankey"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "washout-clean-lookback-period-rwe",
    "name": "Washout / Clean / Lookback Period",
    "short_definition": "A pre-index window of continuous, observable history used to define incident (new) use or incident disease, to measure baseline covariates, and to align time zero, by requiring the absence of the qualifying exposure/event and the presence of data coverage before the index date.",
    "long_description": "A **washout (clean) period** and a **lookback (baseline/assessment) window** are two uses of the same pre-index\nhistory, and conflating them is a common protocol error. The **washout** is an *exclusionary* window: a patient\nqualifies as a *new (incident) user* only if there is **no** fill of the study drug (and, in an active-comparator\ndesign, no fill of the comparator) during the window, so that follow-up starts at first exposure (time zero) for\neveryone. The **lookback** is a *measurement* window: the same (or a different) span over which baseline covariates,\ncomorbidities, prior treatment, and incident-disease status are ascertained. Both require the patient to be\n**observable** — continuously enrolled with the relevant benefit (medical + pharmacy in claims; in-system contact in\nEHR) — across the entire window, otherwise \"no prior fill\" or \"no prior diagnosis\" is *missingness*, not a true clean\nperiod. A washout is only as long as the data are observable, which is why washout length, lookback length, and the\ncontinuous-enrollment requirement are jointly specified, not chosen independently.\n\n**Core conceptual distinction** — three decisions are bundled and must be separated. (1) *Washout (clean) period* —\ndefines incidence by *requiring absence*: no qualifying drug/diagnosis in the window. Longer washouts more reliably\nexclude prevalent users (who carry depletion-of-susceptibles and survivor bias) but shrink the cohort and, in\nfinite-data sources, force exclusion of recent enrollees. (2) *Lookback (covariate-assessment) window* — defines\n*what is measured* about baseline. A fixed window (e.g., 365 days) is comparable across patients but can miss a true\nchronic condition coded only once years ago; an *all-available* lookback captures more confounders but introduces\n*differential* covariate ascertainment because people with longer observable history accrue more codes (Brunelli 2013;\nNakasian 2017). (3) *Continuous-enrollment / observability* — guarantees the washout and lookback reflect real absence,\nnot unobserved care. The estimand is unchanged by these choices, but the *population* and the *measured confounder set*\nare not: a 90-day washout and a 730-day washout answer the same question on materially different cohorts.\n\n**Pros, cons, and trade-offs** (named against the alternatives):\n- **Longer vs shorter washout (e.g., 365 vs 180 vs 90 days):** A longer washout more completely removes prevalent\n  users and undercounts incident events less (a short window misclassifies long-quiescent chronic conditions as\n  \"incident,\" exaggerating risk — e.g., short atrial-fibrillation lookbacks inflate baseline stroke risk).\n  Cost: it excludes anyone without that much continuous coverage, biasing toward stably enrolled (often older,\n  commercially insured or continuously Medicare) patients and reducing power. **Prefer the longest washout the data\n  reliably support**, and report cohort yield at each candidate length.\n- **Fixed vs all-available lookback for covariates:** A fixed window gives *comparable* ascertainment across patients;\n  all-available data captures more true confounders and usually reduces residual confounding for *stable* chronic\n  conditions, but inflates apparent prevalence in longer-enrolled patients and can introduce surveillance/immortal-time\n  artifacts if exposure-related encounters generate the codes (Brunelli 2013; Nakasian 2017; Preen 2006). **Prefer a\n  fixed lookback for the primary analysis with an all-available sensitivity analysis**, and never let a covariate be\n  measured *after* time zero.\n- **vs no/implicit washout (prevalent or ever-user cohorts):** Specifying a clean period removes immortal time,\n  depletion of susceptibles, and adjustment for post-initiation mediators. Cost: a smaller, initiator-skewed cohort\n  (see `prevalent-user-bias`, `new-user-design`).\n\n**When to use** — any incident-user / new-user design; any active-comparator new-user cohort; any incident-disease\ncohort built from claims/EHR (first qualifying diagnosis after a disease-free window); any time covariates must be\nmeasured pre-exposure to avoid conditioning on mediators. Specify the washout, the lookback, the\ncontinuous-enrollment requirement, and which entity (drug class, exact molecule, diagnosis hierarchy) the clean\nperiod clears — and pre-register the sensitivity grid over window lengths.\n\n**When NOT to use — and when it is actively misleading or dangerous**\n- **Washout longer than reliably observable history.** If the database has only 12 months of pre-index data for most\n  patients, a 24-month washout silently selects the minority with long enrollment — a selected, healthier-coverage\n  population. Always tabulate how many patients each window length retains.\n- **Treating absence-of-observation as a clean period.** In Medicare Advantage (encounter, not FFS, claims) or during\n  a gap in pharmacy benefit, \"no prior fill\" is missingness. Restricting to MA-only person-time to satisfy a washout\n  is dangerous: the washout is unfalsifiable. Require A/B/D (or commercial medical+pharmacy) coverage across the window.\n- **Short washout on chronic, intermittently treated conditions.** A 90-day clean period will reclassify long-standing\n  AF, diabetes, or heart failure as \"incident,\" producing spuriously high early event rates and exaggerated risk\n  (Czwikla 2017).\n- **Asymmetric or post-index covariate windows.** Measuring a covariate using any data *after* time zero, or using a\n  longer lookback in one arm than the other, conditions on a mediator or creates differential misclassification — a\n  self-inflicted bias.\n- **All-available lookback when encounters are exposure-driven.** If the act of initiating treatment generates the\n  work-up that records the confounder, all-available ascertainment manufactures imbalance and immortal time.\n\n**Data-source operational depth** (claims vs EHR vs registry vs linked):\n- **Claims (FFS):** The washout is implemented as \"no NDC/J-code for the drug class and no qualifying diagnosis in the\n  [index_date − washout, index_date) window,\" gated on continuous medical + pharmacy enrollment over the *entire*\n  window. Failure modes: (a) **MA-only person-time lacks complete FFS claims** — encounter data are incomplete and\n  under-captured, so a washout computed on MA-only spans is unverifiable; restrict to FFS Parts A/B/D (or commercial\n  medical+Rx). (b) **Claims adjudication lag and reversals** — a fill reversed after submission can leave a phantom\n  \"prior fill\" that wrongly disqualifies a true new user; use paid, non-reversed claims. (c) **Left truncation** — the\n  first observable date is enrollment start, not birth/disease onset; a washout cannot see exposure before coverage.\n  (d) **90-day mail-order and stockpiling** distort where the last pre-index `days_supply` ends and whether the washout\n  is truly drug-free. (e) **Differential competing risks** — in elderly claims, death/disenrollment differs by\n  exposure, so a covariate measured over an all-available lookback is differentially observed by arm.\n- **EHR:** Capture is encounter-driven, so a \"clean\" lookback reflects in-system contact, not the patient's true\n  history; external care leakage (a fill or diagnosis at another system) breaks the washout. Prefer linkage to\n  pharmacy claims to confirm true non-use; define an explicit minimum-contact rule (e.g., ≥1 encounter per lookback\n  year) to operationalize observability, and treat patients who leave the system as informatively censored.\n- **Registry:** Strong for adjudicated incident-disease definitions and severity but typically weak for complete\n  pharmacy exposure; link to claims to verify the drug washout and to a death index to firm up censoring.\n- **Linked claims–EHR–vital records:** The ideal substrate (EHR severity + claims completeness + mortality) but\n  linkage selects the linkable subset and creates order/fill/service date discrepancies that must be reconciled before\n  the washout and time-zero are assigned.\n\n**Worked claims example.** Question: incident heart-failure hospitalization among new users of a second-generation\nsulfonylurea, in a commercial + Medicare-FFS database. (1) **Observability gate:** require 365 days of continuous\nmedical *and* pharmacy enrollment with no MA-only person-time immediately before the candidate index. (2) **Drug\nwashout:** the candidate index is the first sulfonylurea fill (`fill_date`); the patient qualifies as a new user only\nif there is *no* sulfonylurea NDC in [index_date − 365, index_date). A fill on day −400 is fine (outside the window);\na reversed fill on day −200 must be dropped so it does not falsely disqualify. (3) **Incident-disease washout\n(outcome side):** to study *first* HF, require no HF diagnosis (inpatient or ≥2 outpatient) in the same 365-day clean\nwindow, because a 90-day window would label long-standing HF as incident and inflate early rates\n(Czwikla 2017). (4) **Lookback for covariates:** measure comorbidities, prior insulin, HbA1c proxies, and utilization\nover [index_date − 365, index_date] *only* — never using any claim dated ≥ index_date — to feed a high-dimensional\npropensity score. (5) **Sensitivity:** re-run with 180-day and all-available washout/lookback, reporting cohort yield\nat each (e.g., 365d retains 41,200; 730d retains 22,900 and shifts older), and verify covariate prevalence is not\ndriven by enrollment length. This grid — not a single number — is the defensible output.",
    "primary_category": "Exposure_Definition",
    "tags": [
      "washout-period",
      "clean-period",
      "lookback-window",
      "new-user-design",
      "incident-user",
      "covariate-assessment-window",
      "continuous-enrollment",
      "exposure-definition"
    ],
    "applies_to_study_types": [
      "claims_analysis",
      "ehr_study",
      "registry_linkage",
      "target_trial_emulation",
      "comparative_effectiveness",
      "active_comparator_new_user"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/aje/kwg231",
        "url": "https://doi.org/10.1093/aje/kwg231",
        "citation_text": "Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American Journal of Epidemiology. 2003;158(9):915-920.",
        "year": 2003,
        "authors_short": "Ray",
        "notes": "Foundational statement of the new-user design; the drug-free washout that defines incident use and aligns time zero is its operational core."
      },
      {
        "role": "introduce",
        "doi": "10.1186/s12874-017-0407-4",
        "url": "https://doi.org/10.1186/s12874-017-0407-4",
        "citation_text": "Czwikla J, Jobski K, Schink T. The impact of the lookback period and definition of confirmatory events on the identification of incident cancer cases in administrative data. BMC Medical Research Methodology. 2017;17:122.",
        "year": 2017,
        "authors_short": "Czwikla et al.",
        "notes": "Direct empirical demonstration that lookback-window length and event-confirmation rules change which cases count as incident in claims."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.3434",
        "url": "https://doi.org/10.1002/pds.3434",
        "citation_text": "Brunelli SM, Gagne JJ, Huybrechts KF, et al. Estimation using all available covariate information versus a fixed look-back window for dichotomous covariates. Pharmacoepidemiology and Drug Safety. 2013;22(5):542-550.",
        "year": 2013,
        "authors_short": "Brunelli et al.",
        "notes": "Explains the fixed-window vs all-available lookback trade-off for confounder ascertainment and the differential-measurement risk of unbounded windows."
      },
      {
        "role": "explain",
        "doi": "10.1002/pds.4210",
        "url": "https://doi.org/10.1002/pds.4210",
        "citation_text": "Nakasian SS, Rassen JA, Franklin JM. Effects of expanding the look-back period to all available data in the assessment of covariates. Pharmacoepidemiology and Drug Safety. 2017;26(8):890-899.",
        "year": 2017,
        "authors_short": "Nakasian et al.",
        "notes": "Quantifies how expanding the covariate lookback to all-available data affects confounder capture and effect estimates relative to a fixed window."
      },
      {
        "role": "demonstrate",
        "doi": "10.1016/j.jclinepi.2005.12.013",
        "url": "https://doi.org/10.1016/j.jclinepi.2005.12.013",
        "citation_text": "Preen DB, Holman CDJ, Spilsbury K, Semmens JB, Brameld KJ. Length of comorbidity lookback period affected regression model performance of administrative health data. Journal of Clinical Epidemiology. 2006;59(9):940-946.",
        "year": 2006,
        "authors_short": "Preen et al.",
        "notes": "Applied evaluation showing how comorbidity-lookback length changes measured baseline burden and downstream model performance."
      },
      {
        "role": "use",
        "doi": "10.1002/pds.3334",
        "url": "https://doi.org/10.1002/pds.3334",
        "citation_text": "Johnson ES, Bartman BA, Briesacher BA, et al. The incident user design in comparative effectiveness research. Pharmacoepidemiology and Drug Safety. 2013;22(1):1-6.",
        "year": 2013,
        "authors_short": "Johnson et al.",
        "notes": "Consensus operational guidance from a multi-stakeholder group on implementing the incident-user (new-user) washout in routine comparative-effectiveness work."
      }
    ],
    "plain_language_summary": "A clean lookback period is a stretch of time before a patient's first fill of a study drug during which the analyst confirms the patient had no prior fills of that drug — proving they are a brand-new starter, not someone already taking it. The analyst also checks that the patient was continuously enrolled in their insurance plan across that entire stretch, because a gap in coverage means a missing fill would be invisible, not a true absence. One honest limit: it only works with data sources that record every dispensing, so cash-paid fills or fills at out-of-network pharmacies can slip through undetected.",
    "key_terms": [
      {
        "term": "index date",
        "definition": "The patient's personal 'day zero' in a study — usually the calendar date of their very first fill of the drug being studied."
      },
      {
        "term": "washout period",
        "definition": "A fixed span of time before the index date during which the analyst requires zero prior fills of the study drug, confirming the patient is a first-time starter rather than someone already using it."
      },
      {
        "term": "days_supply",
        "definition": "A field in pharmacy claims data recording how many days one filled prescription is intended to last — for example, a 90-day mail-order fill has days_supply = 90."
      },
      {
        "term": "continuous enrollment",
        "definition": "A requirement that a patient be uninterruptedly covered by their insurance plan across the entire lookback window, so that any absence of fills reflects true non-use rather than a gap in data coverage."
      },
      {
        "term": "new user",
        "definition": "A patient whose first observable fill of the study drug occurs after a clean stretch with no prior fills, meaning we are watching them start the drug fresh rather than catching them mid-treatment."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacist fills metformin for patient 2041 on 2023-10-15. Before counting this person as a brand-new metformin starter, the analyst looks back 180 days — from 2023-04-18 through 2023-10-14 — to confirm there were no earlier metformin fills in that window. The patient's insurance ran continuously from 2023-01-01, so there are no coverage gaps to worry about. After confirming the clean lookback, the analyst marks 2023-10-15 as the index date and then tracks the patient forward for 365 days of follow-up ending 2024-10-14.",
      "dataset": {
        "caption": "Pharmacy claims rows for patient 2041. The analyst sees every fill in the database; the task is to confirm the 180-day window before 2023-10-15 is empty of metformin.",
        "columns": [
          "person_id",
          "fill_date",
          "drug",
          "days_supply"
        ],
        "rows": [
          [
            2041,
            "2023-10-15",
            "metformin",
            90
          ],
          [
            2041,
            "2024-01-13",
            "metformin",
            90
          ],
          [
            2041,
            "2024-04-13",
            "metformin",
            90
          ]
        ]
      },
      "steps": [
        "Define the lookback window: 180 days before the candidate index fill, so 2023-04-18 through 2023-10-14 (inclusive).",
        "Scan every pharmacy row for person 2041 with drug = metformin and fill_date inside that window — there are none, so the window is clean.",
        "Confirm continuous insurance enrollment from at least 2023-04-18 through 2023-10-15; the patient's plan started 2023-01-01 with no gaps, so the absence of fills is real non-use, not missing data.",
        "Qualify patient 2041 as a new user with index date = 2023-10-15.",
        "Set follow-up to run from 2023-10-15 through 2024-10-14 (365 days)."
      ],
      "result": {
        "label": "Lookback = 180 clean days (2023-04-18 to 2023-10-14); patient qualifies as a new metformin user; index date = 2023-10-15",
        "value": 180
      },
      "timeline_spec": {
        "title": "180-day clean lookback confirming new-user status for patient 2041 (metformin)",
        "window": {
          "start": "2023-04-18",
          "end": "2024-10-14",
          "label": "Full observation span: lookback + index fill + 365-day follow-up"
        },
        "events": [
          {
            "label": "Index fill (Fill A)",
            "start": "2023-10-15",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill B (refill)",
            "start": "2024-01-13",
            "length_days": 90,
            "quantity": "90 days_supply"
          },
          {
            "label": "Fill C (refill)",
            "start": "2024-04-13",
            "length_days": 90,
            "quantity": "90 days_supply"
          }
        ],
        "spans": [
          {
            "kind": "washout",
            "start": "2023-04-18",
            "end": "2023-10-14",
            "label": "180-day clean lookback — no prior metformin fills"
          },
          {
            "kind": "followup",
            "start": "2023-10-15",
            "end": "2024-10-14",
            "label": "365-day follow-up beginning at index fill"
          }
        ],
        "result": {
          "label": "Lookback = 180 clean days; patient qualifies as new user at index date 2023-10-15",
          "value": 180
        },
        "caption": "Patient 2041 timeline. The 180-day lookback (grey) runs from 2023-04-18 to 2023-10-14 with zero metformin fills, confirming new-user status. The index fill on 2023-10-15 starts 365 days of follow-up (blue). Continuous enrollment covers the full span.",
        "alt_text": "Timeline for patient 2041 showing a 180-day clean lookback period from April to October 2023 with no prior metformin fills, followed by an index fill on 2023-10-15 and three 90-day fills during a 365-day follow-up period ending October 2024."
      }
    },
    "prerequisites": [
      "continuous-enrollment-observable-time-rwe",
      "exposure-episode-construction-rwe",
      "new-user-design"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Drug washout (incident-user / new-user clean period)",
        "description": "Requires no fill of the study drug (and the comparator, in active-comparator designs) within the pre-index window so the index fill is a true initiation and time zero is aligned across arms.",
        "edge_cases": [
          "Reversed or voided pharmacy claims can create a phantom prior fill that wrongly disqualifies a genuine new user; restrict to paid, non-reversed claims.",
          "Class vs exact-molecule washout changes who qualifies (within-class switchers are new users of the molecule but not the class).",
          "Stockpiling / 90-day mail-order means the last pre-window days_supply can spill into the washout window."
        ],
        "data_source_notes": "claims: gate on continuous medical+pharmacy enrollment over the full window and exclude MA-only person-time; EHR: confirm non-use with linked dispensing because in-system orders miss outside fills."
      },
      {
        "name": "Incident-disease (event-free) washout",
        "description": "Requires no qualifying diagnosis/procedure in the pre-index window so the first observed event is treated as incident rather than prevalent.",
        "edge_cases": [
          "Short windows reclassify long-standing chronic disease as incident, inflating early event rates and baseline risk (e.g., atrial fibrillation, heart failure).",
          "Single-claim \"rule-out\" diagnoses inflate apparent incidence; require confirmatory events (e.g., inpatient or >=2 outpatient codes)."
        ],
        "data_source_notes": "claims: combine code position (primary vs secondary) with multiple-claim or inpatient confirmation; registry: prefer adjudicated incident-disease definitions and link to claims for the washout."
      },
      {
        "name": "Fixed vs all-available covariate lookback",
        "description": "Baseline covariates measured over a fixed pre-index window (comparable across patients) versus all observable history (more capture, but differential by enrollment length).",
        "edge_cases": [
          "All-available lookback inflates apparent comorbidity prevalence in longer-enrolled patients and can introduce immortal-time/surveillance artifacts if exposure drives the encounters.",
          "No covariate may be measured at or after time zero (conditioning on a mediator)."
        ],
        "data_source_notes": "claims: report covariate prevalence by lookback length and arm; use a fixed window for the primary analysis and all-available as a sensitivity analysis."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "Shorter washout / lookback window",
        "pros_of_this": "A longer clean period more completely excludes prevalent users and prevalent disease, reducing depletion-of-susceptibles and incident-event misclassification.",
        "cons_of_this": "Excludes patients without sufficient continuous coverage, biasing toward stably enrolled populations and reducing sample size and power.",
        "when_to_prefer": "When prevalent-user bias or chronic-disease misclassification is plausible and the data support enough observable history for most patients."
      },
      {
        "compared_to": "All-available (unbounded) covariate lookback",
        "pros_of_this": "A fixed lookback gives comparable confounder ascertainment across patients and arms, avoiding bias from differential enrollment length.",
        "cons_of_this": "May miss true chronic conditions coded only outside the fixed window, leaving some residual confounding for stable comorbidities.",
        "when_to_prefer": "For the primary analysis when enrollment duration differs by arm or correlates with exposure; pair with an all-available sensitivity analysis."
      },
      {
        "compared_to": "No / implicit washout (prevalent- or ever-user cohort)",
        "pros_of_this": "An explicit clean period removes immortal time, survivor bias, and adjustment for post-initiation mediators by anchoring time zero at initiation.",
        "cons_of_this": "Yields smaller, initiator-skewed cohorts that may not represent prevalent users who dominate routine practice.",
        "when_to_prefer": "Nearly always for comparative safety/effectiveness; relax only when initiation is too rare and a prevalent-new-user extension is justified."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Implement the washout as no qualifying drug/diagnosis in [index_date - washout, index_date), gated on continuous medical+pharmacy enrollment across the entire window; exclude MA-only person-time (incomplete FFS claims), use paid non-reversed claims, and never measure covariates on or after index_date.",
      "ehr": "Capture is encounter-driven; a clean lookback reflects in-system contact, not true history. Confirm non-use via linked pharmacy claims, define an explicit minimum-contact observability rule, and treat patients lost from the system as informatively censored.",
      "registry": "Strong for adjudicated incident-disease and severity, weak for complete pharmacy exposure; link to claims to verify the drug washout and to a death index for censoring.",
      "linked": "Ideal substrate (severity + completeness + mortality) but linkage selects the linkable subset and creates order/fill/service date discrepancies that must be reconciled before the washout and time zero are assigned."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import pandas as pd\n\nWASHOUT_DAYS  = 365   # drug-free + disease-free clean period that defines incident use\nLOOKBACK_DAYS = 365   # fixed covariate-assessment window (== washout here; can differ)\nSTUDY_CLASS   = \"SULFONYLUREA\"\nDISEASE_CODES = (\"I50\",)  # ICD-10 heart failure prefix for the incident-disease washout\n\ndef build_new_user_cohort(rx: pd.DataFrame, dx: pd.DataFrame, enroll: pd.DataFrame) -> pd.DataFrame:\n    rx = rx.sort_values([\"person_id\", \"fill_date\"])\n    study = rx[rx[\"drug_class\"] == STUDY_CLASS]\n\n    # Candidate index = first observed fill of the study class.\n    idx = (study.groupby(\"person_id\", as_index=False)\n                .agg(index_date=(\"fill_date\", \"min\")))\n    idx[\"baseline_start\"] = idx[\"index_date\"] - pd.Timedelta(days=LOOKBACK_DAYS)\n    idx[\"washout_start\"]  = idx[\"index_date\"] - pd.Timedelta(days=WASHOUT_DAYS)\n\n    # Observability gate: continuous medical+pharmacy coverage, no MA-only span, across the FULL washout->index window.\n    e = enroll.merge(idx[[\"person_id\", \"index_date\", \"washout_start\"]], on=\"person_id\")\n    covered = e[(e[\"med_rx_covered\"]) & (~e[\"ma_only\"]) &\n                (e[\"enroll_start\"] <= e[\"washout_start\"]) &\n                (e[\"enroll_end\"]   >= e[\"index_date\"])]\n    idx = idx[idx[\"person_id\"].isin(covered[\"person_id\"])].copy()\n\n    # Drug washout: drop anyone with a prior study-class fill inside [washout_start, index_date).\n    prior_rx = study.merge(idx[[\"person_id\", \"index_date\", \"washout_start\"]], on=\"person_id\")\n    prior_rx = prior_rx[(prior_rx[\"fill_date\"] >= prior_rx[\"washout_start\"]) &\n                        (prior_rx[\"fill_date\"] <  prior_rx[\"index_date\"])]\n    idx = idx[~idx[\"person_id\"].isin(prior_rx[\"person_id\"])].copy()\n\n    # Incident-disease washout: drop prevalent disease (IP once OR OP >=2) inside the clean window.\n    d = dx.merge(idx[[\"person_id\", \"index_date\", \"washout_start\"]], on=\"person_id\")\n    d = d[d[\"dx_code\"].str.startswith(DISEASE_CODES) &\n          (d[\"dx_date\"] >= d[\"washout_start\"]) & (d[\"dx_date\"] < d[\"index_date\"])]\n    ip   = d.loc[d[\"care_setting\"] == \"IP\", \"person_id\"]\n    op2  = d[d[\"care_setting\"] == \"OP\"].groupby(\"person_id\").size()\n    prevalent = set(ip) | set(op2[op2 >= 2].index)\n    idx = idx[~idx[\"person_id\"].isin(prevalent)].copy()\n\n    return idx[[\"person_id\", \"index_date\", \"baseline_start\"]].reset_index(drop=True)",
        "description": "New-user cohort construction with a drug washout, a continuous-enrollment observability gate, and a fixed covariate\nlookback. Required inputs (already cleaned, de-duplicated, paid+non-reversed claims only):\n  rx     : pharmacy fills    -> person_id, fill_date (datetime64), ndc, drug_class, days_supply\n  dx     : diagnoses         -> person_id, dx_date (datetime64), dx_code, care_setting in {'IP','OP'}\n  enroll : enrollment spans  -> person_id, enroll_start, enroll_end, med_rx_covered (bool), ma_only (bool)\nReturns one row per eligible new initiator with index_date and the [baseline_start, index_date] window over which\ncovariates and incident-disease status must be measured. Choose STUDY_CLASS / DISEASE_CODES for the specific question.",
        "dependencies": [
          "pandas"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(data.table)\nWASHOUT_DAYS  <- 365L\nLOOKBACK_DAYS <- 365L\nSTUDY_CLASS   <- \"SULFONYLUREA\"\nDISEASE_RX    <- \"^I50\"   # heart-failure ICD-10 prefix\n\nbuild_new_user_cohort <- function(rx, dx, enroll) {\n  setDT(rx); setDT(dx); setDT(enroll)\n  study <- rx[drug_class == STUDY_CLASS]\n  setorder(study, person_id, fill_date)\n\n  idx <- study[, .(index_date = fill_date[1L]), by = person_id]\n  idx[, baseline_start := index_date - LOOKBACK_DAYS]\n  idx[, washout_start  := index_date - WASHOUT_DAYS]\n\n  # Observability gate: continuous med+rx coverage, no MA-only, spanning the full washout window through index.\n  e <- merge(enroll, idx[, .(person_id, index_date, washout_start)], by = \"person_id\")\n  covered <- e[med_rx_covered == TRUE & ma_only == FALSE &\n               enroll_start <= washout_start & enroll_end >= index_date, unique(person_id)]\n  idx <- idx[person_id %chin% covered]\n\n  # Drug washout: exclude any prior study-class fill inside the clean window.\n  pr <- merge(study, idx[, .(person_id, index_date, washout_start)], by = \"person_id\")\n  prior_ids <- unique(pr[fill_date >= washout_start & fill_date < index_date, person_id])\n  idx <- idx[!person_id %chin% prior_ids]\n\n  # Incident-disease washout: exclude prevalent disease (IP once OR OP >= 2) inside the clean window.\n  d <- merge(dx, idx[, .(person_id, index_date, washout_start)], by = \"person_id\")\n  d <- d[grepl(DISEASE_RX, dx_code) & dx_date >= washout_start & dx_date < index_date]\n  ip  <- unique(d[care_setting == \"IP\", person_id])\n  op2 <- d[care_setting == \"OP\", .N, by = person_id][N >= 2L, person_id]\n  idx <- idx[!person_id %chin% union(ip, op2)]\n\n  idx[, .(person_id, index_date, baseline_start)]\n}",
        "description": "New-user cohort construction with data.table mirroring the Python logic. Inputs:\n  rx     : person_id, fill_date (Date), ndc, drug_class, days_supply\n  dx     : person_id, dx_date (Date), dx_code, care_setting ('IP'/'OP')\n  enroll : person_id, enroll_start, enroll_end, med_rx_covered (logical), ma_only (logical)\nReturns one row per eligible new initiator with index_date and a fixed [baseline_start, index_date] covariate window.",
        "dependencies": [
          "data.table"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "%let washout  = 365;\n%let lookback = 365;\n%let studyclass = SULFONYLUREA;\n\n/* Candidate index = first study-class fill. */\nproc sql;\n  create table idx as\n  select person_id,\n         min(fill_date) as index_date format=date9.,\n         min(fill_date) - &lookback as baseline_start format=date9.,\n         min(fill_date) - &washout  as washout_start  format=date9.\n  from work.rx\n  where drug_class = \"&studyclass\"\n  group by person_id;\nquit;\n\n/* Observability gate, drug washout, and incident-disease washout in one filtered pass. */\nproc sql;\n  create table cohort as\n  select i.person_id, i.index_date, i.baseline_start\n  from idx i\n  /* continuous med+rx coverage, no MA-only, across the full washout window through index */\n  where exists (\n    select 1 from work.enroll e\n    where e.person_id = i.person_id and e.med_rx_covered = 1 and e.ma_only = 0\n      and e.enroll_start <= i.washout_start and e.enroll_end >= i.index_date)\n  /* no prior study-class fill inside the clean window (=> incident user) */\n  and not exists (\n    select 1 from work.rx p\n    where p.person_id = i.person_id and p.drug_class = \"&studyclass\"\n      and p.fill_date >= i.washout_start and p.fill_date < i.index_date)\n  /* no inpatient disease code inside the clean window */\n  and not exists (\n    select 1 from work.dx d\n    where d.person_id = i.person_id and d.care_setting = 'IP'\n      and d.dx_code like 'I50%'\n      and d.dx_date >= i.washout_start and d.dx_date < i.index_date)\n  /* not >=2 outpatient disease codes inside the clean window */\n  and i.person_id not in (\n    select d.person_id from work.dx d\n    where d.care_setting = 'OP' and d.dx_code like 'I50%'\n      and d.dx_date >= i.washout_start and d.dx_date < i.index_date\n    group by d.person_id having count(*) >= 2);\nquit;",
        "description": "New-user cohort construction in SAS using PROC SQL set logic. Required input datasets (post data-management;\npaid, non-reversed claims only):\n  work.rx     : person_id, fill_date, ndc, drug_class, days_supply\n  work.dx     : person_id, dx_date, dx_code, care_setting ('IP'/'OP')\n  work.enroll : person_id, enroll_start, enroll_end, med_rx_covered (0/1), ma_only (0/1)\nProduces work.cohort: one row per eligible new initiator with index_date and baseline_start. The not-exists\nsubqueries implement the drug washout and incident-disease washout; the exists subquery is the observability gate.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": "washout-clean-lookback-period-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Patient 2041 timeline. The 180-day lookback (grey) runs from 2023-04-18 to 2023-10-14 with zero metformin fills, confirming new-user status. The index fill on 2023-10-15 starts 365 days of follow-up (blue). Continuous enrollment covers the full span.",
        "alt_text": "Timeline for patient 2041 showing a 180-day clean lookback period from April to October 2023 with no prior metformin fills, followed by an index fill on 2023-10-15 and three 90-day fills during a 365-day follow-up period ending October 2024.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pop[Candidate: first fill of study class] --> Obs{Continuous med+Rx coverage<br/>over full window? No MA-only?}\n  Obs -->|No| ExObs[Exclude: washout unobservable<br/>absence = missingness]\n  Obs -->|Yes| DrugW{Prior study-class fill in<br/>washout-start .. index?}\n  DrugW -->|Yes| ExPrev[Exclude: prevalent user]\n  DrugW -->|No| DisW{Prevalent disease in window?<br/>IP once OR OP >=2}\n  DisW -->|Yes| ExDis[Exclude: prevalent disease]\n  DisW -->|No| Cohort[New user at time zero = index_date]\n  Cohort --> Look[Measure covariates only in<br/>baseline_start .. index_date]\n  Look --> Sens[Sensitivity: vary washout & lookback length,<br/>fixed vs all-available; report cohort yield]",
        "caption": "Decision logic for the washout / clean / lookback period. The observability gate ensures absence is real, the drug and disease washouts establish incidence, and covariates are measured strictly before time zero.",
        "alt_text": "Flowchart showing the observability gate, drug washout, incident-disease washout, new-user qualification at time zero, fixed covariate lookback, and a sensitivity grid over window lengths.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Washout and lookback windows for one candidate initiator (claims)\n  dateFormat YYYY-MM-DD\n  axisFormat %b %Y\n  section Observability\n  Continuous med+Rx enrollment (no MA-only) :done, enr, 2022-09-01, 2024-01-01\n  section Clean window\n  Drug + disease washout (no study fill, no prevalent disease) :active, wash, 2023-01-01, 2023-12-31\n  section Covariate lookback\n  Fixed covariate-assessment window :crit, look, 2023-01-01, 2023-12-31\n  section Time zero\n  First study-class fill = index_date :milestone, t0, 2024-01-01, 0d",
        "caption": "A single candidate's timeline. Enrollment must cover the entire clean window; the drug and disease washouts and the covariate lookback all end strictly at (not after) time zero on the first fill.",
        "alt_text": "Gantt chart showing continuous enrollment beginning before a 365-day washout and covariate lookback in 2023, ending at time zero on the first study-class fill on 2024-01-01.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "continuous-enrollment-observable-time-rwe",
        "notes": "A washout is only valid over observable time; continuous enrollment across the full window is what makes \"no prior fill / no prior diagnosis\" a true absence rather than missingness."
      },
      {
        "relation_type": "part_of",
        "target_slug": "new-user-design",
        "notes": "The drug-free washout is the defining operational element of the new-user (incident-user) design that aligns time zero across patients."
      },
      {
        "relation_type": "used_with",
        "target_slug": "active-comparator-new-user",
        "notes": "ACNU requires the washout to clear both the study drug and the comparator so both arms are incident users."
      },
      {
        "relation_type": "used_with",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "The lookback window is where baseline covariates feeding the propensity score and balance diagnostics are measured, strictly before time zero."
      },
      {
        "relation_type": "is_variant_of",
        "target_slug": "exposure-episode-construction-rwe",
        "notes": "The drug washout is the entry rule of exposure-episode construction; episode stitching (days_supply, gaps, grace) governs follow-up after time zero."
      },
      {
        "relation_type": "see_also",
        "target_slug": "immortal-time-bias-handling",
        "notes": "Anchoring follow-up at the first post-washout fill (time zero) prevents the immortal time that arises when follow-up starts before the exposure decision."
      },
      {
        "relation_type": "see_also",
        "target_slug": "prevalent-user-bias",
        "notes": "The washout's central purpose is to exclude prevalent users, who carry depletion-of-susceptibles and survivor bias."
      },
      {
        "relation_type": "see_also",
        "target_slug": "exposure-lag-induction-latency-window-rwe",
        "notes": "Distinct from the pre-index washout; lag/induction windows shift the at-risk period after time zero rather than defining incidence before it."
      }
    ],
    "aliases": [
      "washout period",
      "clean period",
      "lookback period",
      "look-back period",
      "baseline period",
      "covariate assessment window",
      "disease-free period",
      "run-in clean window"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta",
      "pqa-cms"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "weibull-distribution",
    "name": "Weibull Distribution for Time-to-Event Data",
    "short_definition": "A two-parameter parametric survival distribution with shape parameter k and scale parameter λ that produces a power-law hazard h(t) = (k/λ)(t/λ)^(k−1): decreasing for k < 1, constant (exponential special case) for k = 1, and increasing for k > 1. The Weibull is the only member of the standard HTA candidate set that simultaneously satisfies both the proportional hazards and accelerated failure time model structures, making it the default bridge between semi-parametric Cox modeling and the parametric extrapolation required by health technology assessment submissions.",
    "long_description": "**What the Weibull distribution is and why it matters in RWE**\n\nThe Weibull distribution is a two-parameter family for strictly positive, continuous outcomes\n— most commonly elapsed time to a clinical event such as death, hospitalization, disease\nprogression, or treatment discontinuation. It is parameterized by a *shape* parameter k\n(also written α or γ depending on software convention) and a *scale* parameter λ (also\nwritten σ, θ, or λ). Together they control the full shape of the hazard function over time.\nThe survival function is S(t) = exp(−(t/λ)^k) and the hazard is h(t) = (k/λ)(t/λ)^(k−1),\ngiving a power-law relationship between elapsed time and instantaneous event risk.\n\nIn RWE and health economics, the Weibull occupies a central position for three reasons.\nFirst, it nests the exponential distribution as a special case (k = 1, constant hazard), so\nit generalizes the simplest survival model without requiring a different software pipeline.\nSecond, it is the *only* member of the standard six parametric distributions used in HTA\nextrapolation (exponential, Weibull, Gompertz, log-normal, log-logistic, generalized gamma)\nthat simultaneously satisfies both the proportional hazards (PH) and accelerated failure\ntime (AFT) model structures — a property that makes it the canonical bridge between the\nsemi-parametric Cox world and the fully parametric extrapolation world. Third, its monotone\nhazard is mechanistically plausible for a wide range of clinical trajectories: early-failure\nburn-in (k < 1), steady-state (k = 1), and progressive wear-out (k > 1).\n\n**The survival function, hazard function, and parameterizations**\n\nGiven shape k > 0 and scale λ > 0, the Weibull specifies:\n- Survival function: S(t) = exp(−(t/λ)^k)\n- Hazard function: h(t) = (k/λ)(t/λ)^(k−1)\n- Cumulative hazard: H(t) = (t/λ)^k\n- Mean (expectation): E[T] = λ · Γ(1 + 1/k), where Γ is the gamma function\n\nAn alternative parameterization common in AFT software writes the model as log T = μ + σ ε,\nwhere ε follows the standard extreme-value (Gumbel) distribution, and the shape is recovered\nas k = 1/σ. SAS PROC LIFEREG uses this log-linear AFT convention: the intercept is\nμ = log(λ) and the scale output is σ = 1/k. The R survreg function also uses this\nconvention. The flexsurvreg function in R and scipy.stats.weibull_min in Python use the\ndirect (k, λ) parameterization. Parameterization mismatch is one of the most common\nimplementation errors — always verify which k you are receiving before reporting it.\n\n**Hazard shapes by shape parameter k**\n\nThe shape parameter k is the most clinically consequential quantity the Weibull delivers:\n\nk < 1 (decreasing hazard): Risk starts high and declines monotonically. This matches an\nearly-failure or burn-in pattern — post-surgical complications, acute toxicity after\nchemotherapy, or frailty-driven short-term mortality in elderly cohorts. The hazard\napproaches infinity at t = 0 and declines toward zero as t increases.\n\nk = 1 (constant hazard): Reduces to the exponential distribution. Appropriate for\nmemoryless processes where the risk of the event in the next short interval does not depend\non how long the patient has already survived — background mortality in healthy adults over\nshort follow-up, or Poisson-process equipment failure.\n\nk > 1 (increasing hazard): Risk accumulates over time, as in progressive diseases such as\nadvancing heart failure, Parkinson's disease, or accumulating cancer burden. k = 2 gives a\nlinear hazard (the Rayleigh distribution); larger k gives a faster acceleration. Matching\nthe shape parameter to the clinical mechanism is the primary substantive check on a Weibull\nfit — an increasing-hazard Weibull applied to post-MI survival data (which typically shows\nan early hazard spike then decline) signals model misspecification regardless of AIC.\n\n**Dual citizenship: proportional hazards and accelerated failure time simultaneously**\n\nThe Weibull's unique mathematical property is that it is the *only* continuous distribution\nsatisfying both the proportional hazards assumption and the accelerated failure time\nassumption simultaneously. This dual citizenship makes it the canonical bridge between two\nmajor modeling frameworks.\n\nIn the PH frame, a Weibull model with treatment covariate x specifies\nh(t|x) = h_0(t) · exp(β x), where h_0(t) is a Weibull baseline hazard. The quantity\nexp(β) is a hazard ratio (HR), interpreted the same way as a Cox HR — the ratio of\ninstantaneous event rates between groups, assumed constant over time.\n\nIn the AFT frame, the same model specifies log T = μ + γ x + σ ε. The quantity exp(γ) is\na *time ratio* (TR) — the factor by which the entire time scale is stretched or compressed\nfor the treated group. A time ratio of 1.50 means every quantile of the survival\ndistribution (median, 75th percentile, 90th percentile) is 1.50 times as large under\ntreatment. The conversion between the two representations for the Weibull is:\nHR = TR^(−k). This means the hazard ratio can be derived from an AFT fit, and vice versa,\nonce the shape parameter k is known. This conversion is not available for log-normal or\nlog-logistic distributions, which satisfy only the AFT structure.\n\n**Interpreting the output**\n\nConsider a Weibull AFT model comparing treated versus control for time to disease\nprogression. The software reports an AFT treatment coefficient of 0.405 with shape k = 1.4.\n\nThe time ratio is exp(0.405) ≈ 1.50. The *formal interpretation* is: event times in the\ntreated arm are scaled by a factor of 1.50 relative to control, conditional on covariates.\nEvery quantile of the progression-free survival distribution — the median, the 75th\npercentile, the 90th percentile — is 1.50 times larger under treatment. Because Weibull\nalso satisfies PH, the equivalent hazard ratio is HR = TR^(−k). For these example values,\nTR = 1.50 and k = 1.4, the HR is approximately 0.59, meaning the instantaneous rate of\nprogression in the treated arm is roughly 59% of the rate in the control arm at any given\ntime point. The shape k = 1.4 > 1 implies an increasing hazard — the risk of progression\nrises over time in both arms — which is biologically plausible for a progressive condition\nwhere accumulated disease burden increases susceptibility.\n\nThe *practical interpretation* for a clinical or payer audience: patients on treatment\ntypically go about 50% longer before their disease progresses, with the hazard of\nprogression rising over time in both arms but at a substantially lower rate in the treated\narm at any instant.\n\n**HTA extrapolation role and the NICE TSD candidate set**\n\nIn health technology assessment, the standard workflow (Latimer 2013; NICE DSU TSD 14)\nrequires fitting six candidate parametric distributions — exponential, Weibull, Gompertz,\nlog-normal, log-logistic, and generalized gamma — to observed trial or registry data and\nselecting among them based on AIC/BIC, visual fit to the Kaplan-Meier curve, and clinical\nplausibility of the projected hazard shape. The Weibull is almost always in this candidate\nset and is often selected when the smoothed hazard is monotone.\n\nThe critical caution is that tail behavior dominates the lifetime QALY estimate because\nmost of the area under the survival curve lies beyond the observed data. A Weibull with\nk slightly above 1 (modestly increasing hazard) and a log-normal (decreasing hazard after\na peak) can fit 24 months of trial data with nearly identical AIC yet project mean survival\ndiffering by years. The analyst must justify the hazard shape in the unobserved tail on\nclinical grounds — not just on in-sample fit — and present the alternative distributions as\nscenarios in probabilistic sensitivity analysis. See the survival-extrapolation-hta-rwe\nentry for the full workflow.\n\n**Checking Weibull fit: the log(−log S(t)) vs log(t) diagnostic**\n\nThe canonical graphical check is based on the cumulative hazard. If S(t) = exp(−(t/λ)^k),\nthen log(−log S(t)) = k · log(t) − k · log(λ). Plotting the estimated log(−log S(t))\n— computed from the Kaplan-Meier — against log(t) should yield an approximately straight\nline with slope k and intercept −k · log(λ) if the Weibull assumption holds. Systematic\ncurvature (a convex or concave deviation) indicates the hazard is not a power law of time,\nand a more flexible model (log-normal, log-logistic, generalized gamma, or Royston-Parmar\nspline) may be needed. In R, PROC LIFETEST with the LOGLOGS option produces this plot\ndirectly; in R, survfit objects can be post-processed; in Python, lifelines provides\ndiagnostic plots via WeibullFitter.\n\n**Pros, cons, and trade-offs**\n\n*Pros of the Weibull distribution:*\n- Parsimonious: only two parameters, straightforward to estimate by maximum likelihood.\n- Flexible monotone hazard shapes: k < 1, k = 1 (exponential special case), k > 1.\n- Dual PH/AFT structure: the only standard distribution fitting both frameworks, enabling\n  hazard ratio reporting and AFT time-ratio reporting from one fit, with exact conversion\n  HR = TR^(−k).\n- Closed-form survival function, hazard, and quantile function: computationally convenient\n  for HTA cost-effectiveness models, probabilistic sensitivity analysis, and simulation.\n- Nests the exponential (k = 1): a likelihood ratio test of H_0: k = 1 is a formal test\n  of the exponential special case.\n- Well-implemented in all major software: lifelines WeibullFitter and WeibullAFTFitter\n  (Python), survreg and flexsurvreg (R), PROC LIFEREG with DIST=WEIBULL (SAS).\n\n*Cons and limitations:*\n- Monotone hazard constraint: the Weibull hazard is strictly decreasing, constant, or\n  strictly increasing throughout follow-up. It cannot model a non-monotone hazard (a hazard\n  that rises then falls). When the clinical process produces a non-monotone pattern, the\n  Weibull will misfit and produce a biased tail extrapolation.\n- No cure fraction: the Weibull hazard is always positive as t approaches infinity, so it\n  cannot accommodate a long-term survivor fraction. If data show a genuine survival plateau\n  (immunotherapy response, curative surgery), a mixture cure model is required.\n- Two-parameter family: less flexible than the three-parameter generalized gamma (which\n  nests the Weibull as a special case) or Royston-Parmar splines. When in-sample fit is\n  poor, upgrading is the natural next step.\n- Parameterization ambiguity: different software uses different conventions for shape and\n  scale; always verify the parameterization in use before reporting k.\n\n**When to use**\n\nUse the Weibull distribution as the primary parametric survival model when:\n- The clinical or biological mechanism suggests a monotone hazard: increasing for\n  progressive disease, decreasing for early-failure patterns, constant for memoryless\n  processes.\n- You need a single model that reports both a hazard ratio (for clinical audiences) and a\n  time ratio or mean life-years gain (for health economic models) without fitting two\n  separate models.\n- You are conducting HTA survival extrapolation and the Weibull is the candidate whose\n  projected hazard is most clinically plausible in the unobserved tail beyond trial data.\n- The log(−log S(t)) vs log(t) diagnostic is approximately linear, confirming the power-\n  law hazard assumption.\n- Parametric efficiency matters: when a parametric form is correct, the Weibull achieves\n  lower variance estimates than the semi-parametric Cox model.\n\n**When NOT to use — and when it is actively misleading**\n\nDo not use the Weibull distribution when:\n- The smoothed hazard is non-monotone: if the hazard rises then falls (e.g., post-\n  chemotherapy toxicity then a declining relapse rate), the Weibull cannot fit this shape\n  and will extrapolate with a systematically wrong tail. Use log-logistic (non-monotone\n  hazard), log-normal, or generalized gamma / Royston-Parmar spline instead.\n- A cure fraction is biologically plausible and supported by mature follow-up data: a\n  genuine survival plateau invalidates the Weibull's assumption that the hazard is always\n  positive. Use a mixture or non-mixture cure model.\n- The log(−log S(t)) vs log(t) diagnostic shows systematic curvature: this directly\n  falsifies the Weibull power-law assumption on the observed data. Upgrade to a more\n  flexible model before extrapolating.\n- Competing risks are present and the Weibull is applied naively to the event of interest\n  after censoring the competing event: this produces a cause-specific hazard that can\n  overstate cumulative incidence when competing mortality differs by arm. Pair the Weibull\n  fit with a formal competing-risks analysis.\n- Flexible spline models fit materially better and the decision horizon is long: if the\n  Royston-Parmar spline AIC is substantially lower and the hazard shape is clinically\n  complex, the Weibull's rigidity introduces extrapolation error that dominates the\n  cost-effectiveness estimate.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "distributions",
      "survival-analysis",
      "extrapolation",
      "parametric-survival",
      "hta",
      "aft",
      "proportional-hazards"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "comparative_effectiveness"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1016/S0197-2456(03)00072-2",
        "url": "https://doi.org/10.1016/S0197-2456(03)00072-2",
        "citation_text": "Carroll KJ. On the use and utility of the Weibull model in the analysis of survival data. Controlled Clinical Trials. 2003;24(6):682-701.",
        "year": 2003,
        "authors_short": "Carroll",
        "notes": "Comprehensive applied review of the Weibull model in clinical trial survival analysis, covering the shape parameter's clinical interpretation, the PH/AFT dual structure, model checking diagnostics, and comparison with log-normal and log-logistic alternatives. The canonical reference for the Weibull as the default first-choice parametric survival model."
      },
      {
        "role": "explain",
        "doi": "10.1177/0272989X12472398",
        "url": "https://doi.org/10.1177/0272989X12472398",
        "citation_text": "Latimer NR. Survival analysis for economic evaluations alongside clinical trials — extrapolation with patient-level data: inconsistencies, limitations, and a practical guide. Medical Decision Making. 2013;33(6):743-754.",
        "year": 2013,
        "authors_short": "Latimer",
        "notes": "Peer-reviewed companion to NICE DSU TSD 14, codifying the standard six-distribution candidate set (exponential, Weibull, Gompertz, log-normal, log-logistic, generalized gamma) and the AIC/BIC-plus-clinical-plausibility workflow. Essential context for the Weibull's role in HTA extrapolation."
      },
      {
        "role": "demonstrate",
        "doi": "10.18637/jss.v070.i08",
        "url": "https://doi.org/10.18637/jss.v070.i08",
        "citation_text": "Jackson CH. flexsurv: a platform for parametric survival modeling in R. Journal of Statistical Software. 2016;70(8):1-33.",
        "year": 2016,
        "authors_short": "Jackson",
        "notes": "Reference implementation of the Weibull distribution (and the full TSD 14 candidate set) in the flexsurv R package, supporting both PH (dist=\"weibullPH\") and AFT (dist=\"weibull\") parameterizations, mean survival computation to a lifetime horizon, and PSA via normboot.flexsurvreg. The primary R tool for Weibull fitting in HTA extrapolation."
      },
      {
        "role": "use",
        "doi": "10.1002/sim.4780111409",
        "url": "https://doi.org/10.1002/sim.4780111409",
        "citation_text": "Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in Medicine. 1992;11(14-15):1871-1879.",
        "year": 1992,
        "authors_short": "Wei",
        "notes": "Foundational treatment of the AFT model structure, including the Weibull as the only standard distribution satisfying both PH and AFT simultaneously. Establishes the time- ratio interpretation and the conversion HR = TR^(-k) for the Weibull case."
      }
    ],
    "plain_language_summary": "The Weibull distribution is a mathematical formula for describing how quickly patients in a study reach some outcome — like death or disease progression — over time. Unlike simpler models that assume risk is constant, the Weibull lets risk rise or fall depending on a \"shape\" parameter: a shape above 1 means risk accumulates over time (like progressive disease), exactly 1 means constant risk (like background mortality in healthy adults), and below 1 means risk declines (like early post-surgical complications). It is especially useful in health economics because it is the only standard survival model that can simultaneously report a hazard ratio for clinicians and a time ratio for economic models from the same fit, and because regulators and health technology assessment bodies require it as one of the standard candidate models for projecting survival beyond trial follow-up.",
    "key_terms": [
      {
        "term": "hazard function",
        "definition": "A function describing how fast the event of interest (death, progression) is occurring at each moment in time among patients who have not yet had the event; higher values mean the event is happening more rapidly at that instant."
      },
      {
        "term": "shape parameter",
        "definition": "The parameter k in the Weibull model that controls whether risk rises, stays flat, or falls over time; k greater than 1 means increasing risk, k equal to 1 means constant risk (exponential distribution), and k less than 1 means decreasing risk."
      },
      {
        "term": "scale parameter",
        "definition": "The parameter λ in the Weibull model that sets the overall time scale — a larger λ shifts the survival curve to the right, meaning events happen later on average."
      },
      {
        "term": "monotone hazard",
        "definition": "A hazard that moves in only one direction over time (strictly increasing or strictly decreasing), which is the key structural assumption of the Weibull — it cannot describe a hazard that rises then falls."
      },
      {
        "term": "proportional hazards",
        "definition": "A model structure in which the ratio of event rates between two groups stays constant over the entire follow-up period; the Cox model and the Weibull distribution both satisfy this structure."
      },
      {
        "term": "accelerated failure time",
        "definition": "A model structure in which treatment multiplies the time scale rather than the event rate; the time ratio exp(coefficient) says how much longer treated patients go before the event relative to the control group."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacoepidemiology team has fitted a Weibull model to time-to-disease-progression data from a small illustrative cohort of five patients with a progressive condition. The maximum-likelihood estimates yield shape parameter k = 2 and scale parameter λ = 2 months — values chosen because the resulting arithmetic is exact and verifiable by hand. The team wants to evaluate the model-implied survival probabilities at months 2 and 4, verify the rising hazard pattern that k = 2 > 1 implies, and confirm that the hazard doubles from month 2 to month 4 under these parameter values.",
      "dataset": {
        "caption": "Observed event times (time_months) and event indicators (1 = progression observed, 0 = censored before progression) for five illustrative patients. A Weibull model fit to these data yields shape k = 2 and scale λ = 2 months, chosen for arithmetic transparency; a real maximum-likelihood fit would optimize these numerically.",
        "columns": [
          "person_id",
          "time_months",
          "event"
        ],
        "rows": [
          [
            1001,
            1,
            1
          ],
          [
            1002,
            2,
            1
          ],
          [
            1003,
            2,
            0
          ],
          [
            1004,
            3,
            1
          ],
          [
            1005,
            4,
            0
          ]
        ]
      },
      "steps": [
        "The Weibull survival function is S(t) = exp(-(t/λ)^k). With k = 2 and λ = 2 months, this simplifies to S(t) = exp(-(t/2)**2). The shape k = 2 > 1 tells us immediately that the hazard will be increasing over time, consistent with a progressive disease where cumulative burden raises risk.",
        "Compute the survival probability at month 2. The exponent is (2/2)**2 = 1, so S(2) = exp(-1) ≈ 0.368. About 37% of patients remain progression-free at 2 months under this Weibull model.",
        "Compute the survival probability at month 4. The exponent is (4/2)**2 = 4, so S(4) = exp(-4) ≈ 0.018. Nearly all patients have experienced progression by month 4 — only about 2% remain event-free.",
        "The Weibull hazard function with k = 2 and λ = 2 simplifies to h(t) = t/2. At month 2: h(2) = 2/2 = 1 event per month. At month 4: h(4) = 4/2 = 2 events per month. The hazard doubles between month 2 and month 4, confirming the monotone rising pattern that k = 2 > 1 implies.",
        "In an AFT two-arm comparison, the software would report a time ratio (TR) for the treatment covariate. A TR of 1.50 would mean the treated arm's progression times are stretched by 50% relative to control. The equivalent Weibull hazard ratio is HR = TR^(-k); with TR = 1.50 and k = 2, HR would be 1.50^(-2) = 1/(1.50**2) ≈ 0.44, meaning the treated arm progresses at roughly 44% of the control arm's instantaneous rate at any given moment."
      ],
      "result": "With k = 2 and λ = 2 months, (2/2)**2 = 1 and (4/2)**2 = 4, giving S(2) = exp(-1) ≈ 0.368 and S(4) = exp(-4) ≈ 0.018. Hazard at month 2: h(2) = 2/2 = 1 event per month. Hazard at month 4: h(4) = 4/2 = 2 events per month. The hazard doubles, confirming monotone increasing risk consistent with k = 2 > 1. A Weibull AFT coefficient of 0.405 implies a time ratio exp(0.405) ≈ 1.50 (treated arm goes 50% longer before progression); the equivalent PH hazard ratio is HR = TR^(-k) with the shape k used as the exponent."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "descriptive-statistics"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [],
    "tradeoffs": [],
    "implementation_notes_by_data_source": {
      "claims": "Time zero should be the index fill date for a drug study or the CPT/HCPCS service date for a procedure study. Require continuous FFS enrollment (exclude MA-only person-time) so that censoring due to disenrollment is not confounded with survival. In elderly claims, all-cause mortality competes strongly with the event of interest — fit the Weibull to the cause-specific hazard and pair with a competing-risks analysis. The Weibull's monotone hazard may be well-matched to progressive-disease claims cohorts (increasing k) but is likely misspecified for acute conditions with early peak hazard.",
      "ehr": "Event times are lab-order or administration dates; link to pharmacy fills to confirm initiation. Visit-driven capture creates informative censoring (sicker patients may have more visits but also more loss to follow-up) — confirm event ascertainment completeness. The Weibull's monotone hazard can be appropriate for EHR-measured progression but check the log(−log S(t)) diagnostic, as EHR cohorts often have complex entry patterns.",
      "registry": "Registries supply the cleanest covariate staging (SEER stage, disease severity adjudication) and are the natural source of longer-term external data to validate the Weibull tail beyond trial follow-up. Confirm vital-status completeness (link to a death index) before trusting registry-based Weibull tails, as differential dropout of sicker patients can create a spurious late survival improvement.",
      "primary": "In RCTs and pilot studies, the Weibull's two parameters are efficiently estimated even with modest sample sizes, making it the natural parametric alternative to the non-parametric Kaplan-Meier when a distributional form is needed for sample-size planning or cost-effectiveness modeling. Report the shape k estimate with a 95% CI and test H_0: k = 1 (exponential special case) as a formal model-choice step.",
      "linked": "Linked claims-registry-vital-records substrates allow anchoring the Weibull tail to longer-term external data — fit to trial IPD and use the registry tail to validate or override the projected hazard beyond the data cut. Reconcile order/fill/service-date discrepancies before assigning time zero to avoid reintroducing immortal time."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nfrom scipy import stats\nfrom lifelines import WeibullFitter, WeibullAFTFitter\n\n# ── 1. Marginal Weibull fit (no covariates) ──────────────────────────────────────────\n# df must have columns: time (float > 0), event (int: 1=occurred, 0=censored)\nwf = WeibullFitter()\nwf.fit(df[\"time\"], event_observed=df[\"event\"])\n\n# lifelines parameterizes as S(t) = exp(-(t/lambda_)**rho_)\nk      = wf.rho_       # shape parameter (rho in lifelines)\nlam    = wf.lambda_    # scale parameter (lambda in lifelines)\nprint(f\"Shape k: {k:.4f}   Scale λ: {lam:.4f}   AIC: {wf.AIC_:.2f}\")\n\n# Model check: is the hazard increasing (k>1), constant (k=1), or decreasing (k<1)?\nif k > 1:\n    print(\"Increasing hazard — consistent with progressive disease or wear-out.\")\nelif k < 1:\n    print(\"Decreasing hazard — consistent with early-failure / burn-in pattern.\")\nelse:\n    print(\"Constant hazard — Weibull reduces to exponential (memoryless process).\")\n\n# ── 2. Compute S(t) and h(t) at specified time points ───────────────────────────────\ntimes = np.array([1.0, 2.0, 4.0])\n# scipy.stats.weibull_min uses (c=k, scale=lambda) for S(t) = exp(-(t/lambda)**k)\ndist = stats.weibull_min(c=k, scale=lam)\nsurv_probs = dist.sf(times)          # survival function = 1 - CDF\nhazards    = dist.pdf(times) / dist.sf(times)   # h(t) = f(t)/S(t)\nfor t, s, h in zip(times, surv_probs, hazards):\n    print(f\"  t={t:.1f}: S(t)={s:.4f}   h(t)={h:.4f}\")\n\n# ── 3. Log(−log S(t)) diagnostic: should be linear in log(t) if Weibull holds ───────\nimport pandas as pd\nfrom lifelines import KaplanMeierFitter\nkmf = KaplanMeierFitter()\nkmf.fit(df[\"time\"], event_observed=df[\"event\"])\nkm_sf    = kmf.survival_function_\nlog_t    = np.log(km_sf.index.values + 1e-9)\nlog_log_s = np.log(-np.log(km_sf[\"KM_estimate\"].values + 1e-9))\n# Plot log_log_s vs log_t: a straight line supports Weibull; curvature suggests log-normal\n# or log-logistic. Slope of the linear fit estimates k.\n\n# ── 4. AFT model with treatment covariate ────────────────────────────────────────────\n# df must additionally have column arm (0=control, 1=treated)\naft = WeibullAFTFitter()\naft.fit(df[[\"time\", \"event\", \"arm\"]], duration_col=\"time\", event_col=\"event\")\naft.print_summary(decimals=4)\n\n# Extract the time ratio (TR) for the arm coefficient\n# In lifelines AFT output, lambda_ ancillary row = log(scale), covariate rows = log(TR)\n# exp(arm coefficient in lambda_ rows) = time ratio\n# lifelines WeibullAFTFitter stores rho_ as log(shape), so shape k = exp(rho_Intercept)\n# NOT 1/exp(...); that inversion would produce 1/k and give a wrong HR.\narm_coef = aft.params_[\"lambda_\"][\"arm\"]        # log time ratio for arm covariate\ntr       = np.exp(arm_coef)\nk_aft    = np.exp(aft.params_[\"rho_\"][\"Intercept\"])   # shape: rho_ is log(shape)\nhr_from_aft = tr ** (-k_aft)\nprint(f\"\\nAFT time ratio (TR): {tr:.4f}  ->  Weibull HR = TR^(-k) = {hr_from_aft:.4f}\")",
        "description": "Fit a Weibull survival model to censored time-to-event data using lifelines. Demonstrates\nWeibullFitter for the marginal (no covariates) fit including shape and scale parameter\nextraction, the log(−log S(t)) diagnostic, and WeibullAFTFitter for a covariate-adjusted\nAFT model with time-ratio extraction and the HR = TR^(-k) conversion to the PH frame.\nscipy.stats.weibull_min is shown for computing S(t) and h(t) at arbitrary time points\ngiven known parameters. Required input: a pandas DataFrame with columns time (>0) and\nevent (1=occurred, 0=censored), and an optional covariate arm (0/1).",
        "dependencies": [
          "lifelines",
          "scipy",
          "numpy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(survival)\nlibrary(flexsurv)\n\n# ── 1. AFT parameterization via survreg ──────────────────────────────────────────────\n# survreg(dist=\"weibull\") fits log T = mu + gamma*x + sigma*eps (Gumbel error)\n# scale output = sigma = 1/k; intercept = log(lambda)\nfit_aft <- survreg(Surv(time, event) ~ arm, data = df, dist = \"weibull\")\nsummary(fit_aft)\n\nsigma <- fit_aft$scale           # sigma = 1/k\nk_aft <- 1 / sigma               # shape parameter\nlam   <- exp(coef(fit_aft)[\"(Intercept)\"])   # scale parameter lambda\ncat(sprintf(\"Shape k = %.4f   Scale λ = %.4f\\n\", k_aft, lam))\n\n# Time ratio for arm covariate (note survreg sign convention)\n# In survreg, the arm coefficient is log(TR) with positive = longer survival\narm_coef <- coef(fit_aft)[\"armtreated\"]\nTR <- exp(arm_coef)              # time ratio\nHR <- TR ^ (-k_aft)             # convert to hazard ratio via HR = TR^(-k)\ncat(sprintf(\"Time ratio TR = %.4f -> HR = TR^(-k) = %.4f\\n\", TR, HR))\n\n# ── 2. PH parameterization via flexsurvreg ───────────────────────────────────────────\n# dist=\"weibullPH\" directly parameterizes by the hazard ratio for covariates\nfit_ph <- flexsurvreg(Surv(time, event) ~ arm, data = df, dist = \"weibullPH\")\nprint(fit_ph)    # exp(arm coef) = HR directly\n\n# ── 3. Compute S(t) and h(t) at specified time points ───────────────────────────────\ntimes <- c(1, 2, 4)\nS_t <- exp(-(times / lam)^k_aft)                    # survival function\nh_t <- (k_aft / lam) * (times / lam)^(k_aft - 1)  # hazard function\ncat(\"\\n  t   S(t)   h(t)\\n\")\nfor (i in seq_along(times))\n  cat(sprintf(\"  %g  %.4f  %.4f\\n\", times[i], S_t[i], h_t[i]))\n\n# ── 4. Log(−log S(t)) Weibull diagnostic ─────────────────────────────────────────────\nkm <- survfit(Surv(time, event) ~ 1, data = df)\nplot(log(km$time), log(-log(km$surv)),\n     xlab = \"log(t)\", ylab = \"log(-log S(t))\",\n     main = \"Weibull diagnostic: should be linear if Weibull holds\")\nabline(lm(log(-log(km$surv)) ~ log(km$time)), col = \"blue\")\n# Slope of the fitted line estimates k; a straight line supports Weibull assumption.\n\n# ── 5. Mean survival to lifetime horizon for HTA ─────────────────────────────────────\nHORIZON <- 30      # years (adapt to time units in the data)\nDISC    <- 0.035   # annual discount rate (NICE reference case)\ngrid    <- seq(0, HORIZON, by = 1/12)\nS_grid  <- summary(fit_ph, t = grid, type = \"survival\", ci = FALSE)[[1]]$est\ndisc    <- exp(-DISC * grid)\ndisc_ly <- sum(diff(grid) * (head(S_grid * disc, -1) + tail(S_grid * disc, -1)) / 2)\ncat(sprintf(\"\\nDiscounted mean life-years to %d-year horizon: %.3f\\n\", HORIZON, disc_ly))",
        "description": "Fit Weibull survival models in both AFT (survreg) and PH (flexsurvreg dist=\"weibullPH\")\nparameterizations, extract shape and scale parameters, compute survival probabilities\nat specified time points, and produce the log(−log S(t)) vs log(t) Weibull diagnostic.\nDemonstrates the HR/TR conversion and mean survival computation to a lifetime horizon\nfor HTA extrapolation. Required input: a data.frame df with columns time (>0), event\n(1=occurred, 0=censored), and arm (factor, reference level = \"control\").",
        "dependencies": [
          "survival",
          "flexsurv"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── 1. Weibull diagnostic: log(-log S(t)) vs log(t) should be linear ── */\nproc lifetest data=work.surv plots=loglogs;\n  time time*event(0);\n  strata arm;\n  /* Linearity of the log(-log S(t)) plot supports the Weibull assumption.            */\n  /* Slope of the line estimates the shape parameter k.                               */\nrun;\n\n/* ── 2. Weibull AFT fit via PROC LIFEREG ── */\n/* LIFEREG uses AFT parameterization: log(T) = intercept + arm*gamma + scale*epsilon  */\n/* intercept = log(lambda), scale = 1/k (sigma), arm coef = log time ratio            */\nproc lifereg data=work.surv;\n  model time*event(0) = arm / distribution=weibull;\n  ods output ParameterEstimates=work.weib_params\n             FitStatistics=work.weib_fit;\nrun;\nproc print data=work.weib_params; run;  /* intercept, arm, scale printed here */\nproc print data=work.weib_fit; run;     /* AIC, BIC, -2 log L                 */\n\n/* ── 3. Extract k, λ, time ratio, and HR ── */\ndata work.weib_derived;\n  set work.weib_params end=last;\n  retain intercept arm_coef sigma;\n  if Parameter=\"Intercept\" then intercept=Estimate;\n  if Parameter=\"arm\"       then arm_coef=Estimate;\n  if Parameter=\"Scale\"     then sigma=Estimate;\n  if last then do;\n    lambda = exp(intercept);   /* scale parameter                     */\n    k      = 1 / sigma;        /* shape parameter                     */\n    TR     = exp(arm_coef);    /* time ratio for arm (AFT frame)      */\n    HR     = TR ** (-k);       /* hazard ratio via HR = TR^(-k) for Weibull */\n    put \"Shape k = \" k 8.4 \" Scale lambda = \" lambda 8.4;\n    put \"Time ratio TR = \" TR 8.4 \" -> HR = \" HR 8.4;\n  end;\nrun;\n\n/* ── 4. Survival probability at specified time points ── */\n/* S(t) = exp(-(t/lambda)**k); compute for times 1, 2, 4 */\ndata work.surv_probs;\n  set work.weib_derived;\n  array t_vals[3] _temporary_ (1.0 2.0 4.0);\n  do i = 1 to 3;\n    t_val  = t_vals[i];\n    S_t    = exp(-(t_val/lambda)**k);\n    h_t    = (k/lambda) * (t_val/lambda)**(k-1);  /* hazard at t */\n    output;\n  end;\n  keep t_val S_t h_t k lambda;\nrun;\nproc print data=work.surv_probs; run;\n\n/* ── 5. Mean discounted life-years to lifetime horizon (HTA) ── */\n/* Trapezoidal integration of S(t) * exp(-discount_rate * t) from 0 to HORIZON        */\n%let HORIZON  = 30;    /* years */\n%let DISC     = 0.035; /* annual discount rate (NICE reference case) */\n%let N_STEPS  = 360;   /* monthly grid: 30 years * 12 = 360 steps   */\ndata work.ly_calc;\n  set work.weib_derived;\n  dt = &HORIZON / &N_STEPS;\n  disc_ly = 0;\n  do i = 0 to &N_STEPS - 1;\n    t0 = i * dt;       t1 = (i+1) * dt;\n    S0 = exp(-(t0/lambda)**k);   S1 = exp(-(t1/lambda)**k);\n    disc_ly + dt * (S0*exp(-&DISC*t0) + S1*exp(-&DISC*t1)) / 2;\n  end;\n  put \"Discounted mean LY to &HORIZON.-year horizon: \" disc_ly 8.3;\nrun;",
        "description": "Fit a Weibull parametric survival model with PROC LIFEREG (DIST=WEIBULL) using the AFT\nlog-linear parameterization. Demonstrates the log(−log S(t)) Weibull diagnostic via PROC\nLIFETEST with the LOGLOGS plot option, parameter extraction and conversion to shape k and\nscale λ, survival probability computation at specified time points, and mean discounted\nlife-years computation for HTA extrapolation. Required input: work.surv with variables\ntime (>0), event (1=occurred, 0=censored), and arm (0=control, 1=treated).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[Time-to-event outcome<br/>with monotone hazard plausible?] --> KD{Check log--log<br/>diagnostic}\n  KD -->|Approximately linear| WEI[Fit Weibull<br/>shape k and scale λ]\n  KD -->|Curved up / down| ALT[Log-normal / log-logistic<br/>or generalized gamma / spline]\n  WEI --> K{Shape k value?}\n  K -->|k < 1| DEC[Decreasing hazard<br/>early-failure / burn-in]\n  K -->|k = 1| EXP[Constant hazard<br/>exponential special case]\n  K -->|k > 1| INC[Increasing hazard<br/>progressive disease / wear-out]\n  WEI --> HTA{HTA lifetime<br/>extrapolation needed?}\n  HTA -->|Yes| CAND[Include Weibull in six-distribution<br/>TSD 14 candidate set;<br/>select by AIC/BIC + hazard plausibility]\n  HTA -->|No| BOTH[Report HR and time ratio TR;<br/>HR = TR^-k for PH audience]\n  CAND --> CURE{Survival plateau<br/>in observed data?}\n  CURE -->|Yes| CURE_M[Consider mixture<br/>cure model instead]\n  CURE -->|No| FLOOR[Floor projected hazard at<br/>general-population life table]",
        "caption": "Decision logic for fitting and using the Weibull distribution in RWE and HTA contexts. The log(−log S(t)) linearity check gates the decision; k drives clinical interpretation; the HTA branch connects to the six-distribution TSD 14 workflow.",
        "alt_text": "Flowchart starting from a time-to-event question, routing through the log(−log S(t)) linearity check to either the Weibull (if linear) or a more flexible alternative (if curved), then branching on shape k to clinical interpretation labels (decreasing, constant, increasing hazard), and on HTA need to a TSD 14 candidate set path or a direct HR/TR reporting path.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "see_also",
        "target_slug": "cox-ph-regression",
        "notes": "The Cox semi-parametric model and the Weibull address the same time-to-event question from different angles. Cox leaves the baseline hazard unspecified and estimates HRs via partial likelihood; Weibull specifies the baseline hazard as a power law and estimates it by maximum likelihood. The Weibull is the only distribution that is simultaneously PH and AFT, making it the natural parametric companion to Cox when extrapolation is required. Prefer Cox for within-data HRs; prefer Weibull for lifetime extrapolation and dual HR/TR reporting."
      },
      {
        "relation_type": "see_also",
        "target_slug": "accelerated-failure-time-models",
        "notes": "AFT models encompass a family of distributions (Weibull, log-normal, log-logistic, generalized gamma) that model the multiplicative stretching or compression of the time scale rather than the hazard rate. The Weibull is the only AFT member that is also PH, enabling direct HR/TR conversion. Log-normal and log-logistic are AFT-only and should be preferred when the hazard is non-monotone."
      },
      {
        "relation_type": "used_with",
        "target_slug": "survival-extrapolation-hta-rwe",
        "notes": "The Weibull is one of the six standard parametric distributions in the NICE DSU TSD 14 candidate set for HTA survival extrapolation. It is typically selected when the smoothed hazard is monotone and the tail plausibility argument supports an increasing or decreasing power-law hazard. See that entry for the full workflow: AIC/BIC selection, visual fit, hazard-plausibility justification, general-population mortality flooring, and PSA."
      },
      {
        "relation_type": "see_also",
        "target_slug": "log-normal-distribution",
        "notes": "The log-normal distribution is the primary alternative when the hazard is non-monotone (rising then falling), which is common in oncology (early treatment toxicity then declining relapse risk) and some infectious-disease settings. Both are AFT models, but the log-normal satisfies only AFT (not PH), so it cannot report a hazard ratio without additional assumptions. When the log(−log S(t)) diagnostic shows curvature, consider log-normal or log-logistic before upgrading to generalized gamma or splines."
      },
      {
        "relation_type": "see_also",
        "target_slug": "generalized-linear-models",
        "notes": "Generalized linear models (Poisson, negative-binomial, gamma) share the maximum- likelihood estimation framework and the log-link family with the Weibull, and are the natural choice for count outcomes and cost outcomes that are not time-to-first-event. The Weibull is the parametric survival analog within this family of distributional regression models for positive outcomes."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Understanding maximum-likelihood estimation, likelihood ratio tests, and confidence intervals — all covered in inferential-statistics-foundations — is required before interpreting Weibull shape and scale parameter estimates, AIC/BIC model comparison, and the test of H_0: k = 1 (exponential special case)."
      }
    ],
    "aliases": [
      "Weibull model",
      "Weibull survival",
      "shape parameter",
      "Weibull AFT"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "welch-t-test",
    "name": "Welch's t-Test (Unequal Variances)",
    "short_definition": "A two-sample hypothesis test for comparing the means of two independent groups that does not assume their variances are equal — it estimates a separate variance for each group and adjusts the degrees of freedom using the Welch-Satterthwaite equation, making it the recommended default for continuous two-sample comparisons in observational research where variance heterogeneity between treated and control groups is common.",
    "long_description": "**What Welch's t-test is and why it exists**\n\nThe classical Student's t-test for two independent groups rests on two assumptions: (1) the\nsampling distribution of the mean is approximately normal (protected by the Central Limit\nTheorem at large n), and (2) the two groups have *equal population variances*. The second\nassumption is far more restrictive in practice than the first, and it is routinely violated\nin observational healthcare data where treatment groups differ in case-mix, disease severity,\nand cost dispersion. Welch's t-test, proposed by B. L. Welch in 1947, relaxes the\nequal-variance assumption entirely by estimating a separate sample variance for each group\nand plugging those separate estimates into both the test statistic and an approximated\ndegrees-of-freedom formula. The result is a test that is valid regardless of whether the\ngroup variances match.\n\nThe Welch test statistic is:\n\n  t = (x̄₁ − x̄₂) / sqrt(s₁²/n₁ + s₂²/n₂)\n\nwhere x̄ᵢ, sᵢ², and nᵢ are the sample mean, sample variance, and sample size for group i.\nThe denominator is the standard error of the mean difference using separate (unpooled)\nvariance estimates — compare this with Student's t, which pools the two variances into a\nsingle estimate under the equal-variance assumption.\n\n**The Welch-Satterthwaite degrees of freedom approximation**\n\nBecause the Welch statistic is not exactly t-distributed, Welch (1947) and Satterthwaite\n(1946) independently derived an approximation for the effective degrees of freedom (df*):\n\n  df* = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]\n\nThe value of df* falls between min(n₁, n₂) − 1 (the most conservative possible df) and\nn₁ + n₂ − 2 (the pooled df used by Student's t-test). When the two groups have equal\nvariances and equal sample sizes, df* collapses to n₁ + n₂ − 2, exactly matching\nStudent's t-test. When one group has a much larger variance or a much smaller sample, df*\nis pulled toward the smaller group's df, which widens the critical region appropriately.\nIn practice, df* is rarely a round integer; software rounds it or uses a continuous\nt-distribution lookup. The key intuition: the test self-calibrates its conservatism\nbased on the observed variance disparity — when one group drives nearly all the sampling\nuncertainty, the effective df reflects that group alone.\n\n**Why \"test for equal variances first, then choose\" is statistically harmful**\n\nA widely taught but statistically suboptimal workflow is:\n(1) Run Levene's test or Bartlett's test for equality of variances.\n(2) If p < 0.05, use Welch's t-test; otherwise, use Student's t-test.\n\nDelacre, Lakens, and Leys (2017) demonstrated through simulation that this conditional\nstrategy degrades type-I error control relative to simply always using Welch. The mechanism\nis straightforward: the preliminary variance test itself has a type-I error rate (it will\noccasionally falsely conclude equal variances when they differ, or vice versa), and that\nerror cascades into the subsequent test choice. The resulting procedure has a composite\nerror rate that is neither of the two individual tests' rates. Moreover, at small n the\nvariance test has low power and will frequently fail to detect genuine heteroscedasticity,\nrouting the analyst to Student's t-test exactly when the Welch correction is most needed.\n\nThe practical rule — now implemented as the default in R, and requiring an explicit flag\nin Python — is: **always use Welch's t-test for two independent-sample continuous\ncomparisons.** If there is an explicit substantive reason to assume equal variances (e.g., a\ncontrolled laboratory experiment with a very carefully matched design), Student's t-test is\npermissible, but this is rare in healthcare data analysis.\n\n**Software defaults: R makes Welch the default; Python requires a flag; SAS prints both**\n\n- **R**: `t.test(x, y)` uses Welch by default (`var.equal = FALSE`). Student's t-test\n  requires `var.equal = TRUE`. This is the correct default.\n- **Python (scipy)**: `scipy.stats.ttest_ind(x, y)` uses Student's t-test by default\n  (`equal_var = True`). Welch requires `equal_var = False`. Analysts who do not set this\n  flag are running Student's t-test without realising it — a common error.\n- **SAS (PROC TTEST)**: Outputs both the Satterthwaite (Welch) row and the Pooled\n  (Student) row simultaneously. The analyst must read the correct row; the Satterthwaite\n  row is appropriate by default for observational two-group comparisons. The COCHRAN\n  option provides an additional Cochran-Cox approximation for very extreme variance ratios.\n\n**RWE realities: unequal variances are the norm, not the exception**\n\nIn observational healthcare data, the treated and comparator groups routinely differ in\nways that produce heteroscedasticity:\n\n- *Cost dispersion*: A biologics-treated cohort may have far greater variance in total\n  healthcare costs than a generic-treated cohort — high responders have near-zero costs,\n  non-responders accumulate catastrophic costs, and the SD for the biologic arm can be\n  many times that of the comparator.\n- *Utilization counts*: Emergency department visits or inpatient days in a frail-elderly\n  population versus a healthier comparator will differ vastly in their variance.\n- *Lab values*: Disease-active patients may show far more within-group variability in\n  biomarkers than stable patients.\n\nIn all these settings, Student's t-test — by pooling the variances — implicitly under-\nweights the group with the larger variance and produces a test statistic whose null\ndistribution is misspecified. Welch's correction directly addresses this. At very large\nn (> 10,000 per group), the practical difference between Welch and Student is small,\nbut the cost of using Welch is zero, so there is no reason to deviate from it.\n\n**Large-n behavior and the reporting imperative**\n\nAt large sample sizes typical of administrative claims databases (n = 50,000 per arm),\nthe Welch t-test will reject the null for mean differences so small they are clinically\nirrelevant — a difference of $2 in annual drug costs, or 0.01 days in length of stay.\nThis is not a failure of the test; the test is doing exactly what it is designed to do.\nThe failure would be interpreting statistical significance as clinical or policy relevance.\nBest practice in large observational datasets:\n\n- Report the mean difference with a 95% confidence interval and label both quantities.\n- Report an effect size (Cohen's d = mean difference / pooled SD) alongside the p-value.\n- Explicitly comment on whether the magnitude of the difference is clinically meaningful.\n- Note that the Welch t-test cannot adjust for confounding — it is a descriptive comparison\n  or a valid causal test only in a randomized or perfectly balanced observational design.\n\n**Descriptive-only status in confounded observational comparisons**\n\nThe Welch t-test, like all two-sample tests, estimates a raw (unadjusted) mean difference.\nIn an unbalanced observational cohort, this raw difference confounds the treatment effect\nwith case-mix differences between the groups. A statistically significant Welch t-test in\nan observational setting does not constitute causal evidence; it is a descriptive finding\nthat motivates further adjusted analyses. When adjustment is needed, the appropriate tool\nis regression (linear regression for continuous outcomes, with or without covariate\nadjustment, which subsumes and extends the two-sample mean comparison), propensity-score\nweighting, or matching — not the two-sample test. Report the Welch t-test result in\nTable 1 of an observational study as descriptive evidence of baseline imbalance, not as\na causal estimate.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- Does not assume equal variances; valid across the full range of heteroscedasticity\n  scenarios encountered in healthcare data.\n- Computationally trivial; no external dependencies; available in every statistical\n  language.\n- Produces an interpretable mean difference with confidence interval — the policy-relevant\n  quantity for budget-impact models and clinical decision summaries.\n- Protected by the Central Limit Theorem for large n: robust to non-normality of the\n  raw data when n is adequate (generally ≥ 30 per group, often less).\n- Under equal variances and equal sample sizes, gives results essentially identical to\n  Student's t-test — no cost to using it as the default.\n- Better type-I error control than the conditional \"test variances first\" workflow across\n  all variance-ratio and sample-size combinations (Delacre et al. 2017).\n\n*Cons*:\n- Assumes independence between groups — not valid for paired or matched data (use paired\n  t-test or Wilcoxon signed-rank instead).\n- Like all t-tests, targets the mean. For heavily skewed outcomes (costs, utilization) with\n  small-to-moderate n, extreme outliers can distort the mean and therefore the test's\n  conclusion; gamma GLM, bootstrap mean estimation, or Winsorization may be preferable.\n- Does not adjust for confounders. As a descriptive/unadjusted test, it cannot produce\n  causal mean-difference estimates in observational data without additional methods.\n- At small n (< 15 per group) with severely non-normal data, the CLT has not yet stabilized\n  the sampling distribution and the test may have poor type-I error control; Mann-Whitney\n  U or a permutation test may be safer.\n- The Satterthwaite df approximation is just that — an approximation. At very small n\n  with extreme variance ratios, it may not be perfectly calibrated, though it remains\n  superior to the pooled df of Student's t-test.\n\n*Trade-offs*:\n- Welch vs Student: always prefer Welch unless you have an explicit, defensible reason to\n  assume equal variances. The performance penalty for Welch when variances happen to be\n  equal is negligible.\n- Welch t-test vs Mann-Whitney U: Welch targets the mean difference and produces an\n  interpretable estimate on the original scale; Mann-Whitney tests stochastic dominance\n  (P(X > Y) = 0.5) and requires the Hodges-Lehmann estimator for a companion effect\n  estimate. For skewed data at small-to-moderate n, Mann-Whitney may have better type-I\n  error control; for large n, the CLT makes Welch robust. These tests answer different\n  questions and their results can legitimately diverge.\n- Welch t-test vs gamma GLM: For cost outcomes where the mean is the policy-relevant\n  quantity, a gamma GLM with log link is the modern standard because it (a) respects the\n  right-skewed distributional shape, (b) produces a mean ratio on the original dollar scale,\n  and (c) naturally accommodates covariate adjustment. The Welch t-test on raw costs is a\n  reasonable exploratory and sensitivity tool but not the recommended primary method for\n  cost inference.\n\n**When to use**\n\n- Two independent groups on a continuous outcome and the mean difference is the target\n  quantity — Welch is the appropriate default choice.\n- Continuous baseline comparisons in Table 1 of any study (RCT or observational) as a\n  descriptive test; note that standardized mean differences (SMDs) are preferred for\n  balance assessment in observational Table 1, but the Welch t-test p-value remains\n  informative alongside SMDs in randomized designs.\n- Sensitivity analysis alongside a gamma GLM primary analysis for cost outcomes at large n,\n  to demonstrate robustness of conclusions to the choice of method.\n- Any situation where Student's t-test is being considered — simply use Welch instead.\n- Lab values, quality-of-life scores, or other approximately-normal continuous endpoints\n  with adequate sample size (≥ 30 per group as a rough guide).\n- When communicating results to non-statistical audiences who understand mean differences\n  but not rank statistics or model-based marginal effects.\n\n**When NOT to use**\n\n- *Paired or matched data*: if the two observations come from the same patient (pre-post,\n  matched pair, crossover) the paired t-test or Wilcoxon signed-rank test is required.\n  Using Welch on paired data ignores the within-patient correlation and wastes power.\n- *Skewed outcomes at small n*: for cost, utilization, or time-to-event data with fewer\n  than ~30 observations per group, the CLT has not stabilized the mean's sampling\n  distribution. A Mann-Whitney U test or nonparametric bootstrap is safer.\n- *As causal evidence in an unbalanced observational comparison*: a significant Welch\n  t-test between a treatment and comparator arm in an unmatched, unweighted observational\n  cohort is a descriptive finding only. Confounders can explain the entire difference.\n  Route to regression adjustment, propensity-score methods, or g-methods for causal\n  inference.\n- *When the target estimand is not the mean*: for ordinal outcomes, bounded scores, or\n  when the median or a rank-based quantity is substantively meaningful, Mann-Whitney U\n  (with the Hodges-Lehmann estimate) or ordinal regression is more appropriate.\n- *As a substitute for adjustment when confounding is uncontrolled*: do not report a Welch\n  t-test p-value as primary evidence of a treatment effect in observational data — readers\n  and payers will interpret it as causal even if the accompanying text disclaims it.\n- *For cost data as the primary confirmatory analysis when mean costs matter for budget\n  impact*: the gamma GLM or two-part model is the modern standard for mean cost inference\n  in HEOR because it better accounts for the distributional shape and can incorporate\n  covariate adjustment in a single coherent model.\n\n**Interpreting the output**\n\nIn the worked example, Group A (n = 3) has a mean of 4.0 ED visits with a variance of\napproximately 0.667, while Group B (n = 3) has a mean of 6.0 visits with a variance of\napproximately 3.333 — a variance ratio of approximately 5:1. The Welch standard error is\n1.0, the Welch t statistic is −2.0, and the Welch-Satterthwaite degrees of freedom are\napproximately 4. The critical value for a two-sided test at df = 4, alpha = 0.05 is\napproximately 2.78. Because |t| = 2.0 < 2.78, the test does not reach statistical\nsignificance (p > 0.05).\n\n*(1) Formal interpretation.* The Welch t statistic of −2.0 does not exceed the critical\nboundary at df ≈ 4, so the observed mean difference of −2.0 visits is consistent with the\nnull hypothesis of no difference at alpha = 0.05. The 95% CI on the mean difference —\nwider than under equal-variance assumptions because the Satterthwaite approximation reduces\neffective degrees of freedom — includes zero. The non-significant result does not prove the\ngroups have equal means; it indicates that the evidence against equality falls short of the\npre-specified threshold, partly because small sample sizes and unequal variances reduce the\neffective degrees of freedom. Had the analyst used Student's pooled t-test, the pooled\nvariance would blend two very different within-group variances, yielding a misleading\ndenominator for the test statistic.\n\n*(2) Practical interpretation.* The 2-visit difference in average utilization between\ngroups cannot be distinguished from chance variation at n = 3 per group with these unequal\nvariances. The key methodological lesson is that the approximately 5:1 variance ratio\nrequired Welch's correction — pooling variances as Student's t-test does when they differ\nthis much produces a test statistic whose null distribution is wrong. In a larger study,\nthe same directional difference might yield a statistically significant result, but the\ncurrent data are too sparse for the test to be informative.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "heteroscedasticity",
      "t-test",
      "Welch",
      "Satterthwaite",
      "unequal-variances",
      "two-sample",
      "mean-difference"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/biomet/34.1-2.28",
        "url": "https://doi.org/10.1093/biomet/34.1-2.28",
        "citation_text": "Welch BL. The generalization of 'Student's' problem when several different population variances are involved. Biometrika. 1947;34(1-2):28-35.",
        "year": 1947,
        "authors_short": "Welch",
        "notes": "The original paper deriving the Welch-Satterthwaite approximation for degrees of freedom in the two-sample test with unequal variances. Establishes the mathematical foundation for what is now the recommended default two-sample t-test in most software packages."
      },
      {
        "role": "demonstrate",
        "doi": "10.5334/irsp.82",
        "url": "https://doi.org/10.5334/irsp.82",
        "citation_text": "Delacre M, Lakens D, Leys C. Why psychologists should by default use Welch's t-test instead of Student's t-test. International Review of Social Psychology. 2017;30(1):92-101.",
        "year": 2017,
        "authors_short": "Delacre et al.",
        "notes": "Simulation study demonstrating that Welch's t-test maintains better type-I error control than the conditional \"test for equal variances first, then choose\" strategy across a wide range of sample-size and variance-ratio scenarios. The key paper supporting Welch as the unconditional default; also the basis for R's default behavior. Cited directly when defending the recommendation against preliminary Levene or Bartlett tests."
      },
      {
        "role": "explain",
        "doi": "10.1186/1471-2288-12-78",
        "url": "https://doi.org/10.1186/1471-2288-12-78",
        "citation_text": "Fagerland MW. t-tests, non-parametric tests, and large studies — a paradox of statistical practice? BMC Medical Research Methodology. 2012;12:78.",
        "year": 2012,
        "authors_short": "Fagerland",
        "notes": "Places Welch's t-test within the broader context of parametric vs nonparametric test selection; demonstrates that the t-test's normality assumption is protected by the CLT precisely in the large samples where analysts most often worry about it, supporting continued use of Welch in large observational datasets with non-normal continuous outcomes."
      }
    ],
    "plain_language_summary": "Welch's t-test compares the average values of a measurement between two independent groups — for example, average annual healthcare costs for patients on a new drug versus patients on the standard drug — and asks whether any observed difference is larger than you would expect from chance alone. Unlike the older Student's t-test, Welch's version does not require the two groups to have the same amount of spread (variance) in their data, which matters because treated and control patients in real-world studies almost always differ in how spread out their outcomes are. The test produces a p-value and a mean difference with a confidence interval — but it cannot account for differences in patient characteristics between the groups, so a significant result in an observational study is a description of the data, not proof that the treatment caused the difference.",
    "key_terms": [
      {
        "term": "heteroscedasticity",
        "definition": "The condition where two groups have different amounts of spread (variance) in their outcome values; Welch's t-test is designed to remain valid in this situation, whereas Student's t-test is not."
      },
      {
        "term": "Satterthwaite degrees of freedom",
        "definition": "A formula that calculates an adjusted \"effective sample size\" for the test based on the sizes and variances of the two groups; it tells the test how much uncertainty to build in, and it falls somewhere between the smaller group's degrees of freedom and the combined total — reflecting which group is contributing more uncertainty."
      },
      {
        "term": "pooled vs unpooled variance",
        "definition": "A pooled variance blends the spread from both groups into a single estimate (Student's t-test); an unpooled variance keeps a separate spread estimate for each group (Welch's t-test), which is safer when the groups have different amounts of variability."
      },
      {
        "term": "robustness",
        "definition": "A property of a statistical test meaning it still works correctly even when some assumptions are violated; Welch's t-test is robust to unequal variances, while Student's t-test is not."
      },
      {
        "term": "mean difference",
        "definition": "The primary result of Welch's t-test — the average value in group 1 minus the average value in group 2 — reported alongside a confidence interval that shows the range of plausible true differences given the data."
      },
      {
        "term": "type-I error rate",
        "definition": "The probability of concluding there is a difference when there actually is none; Welch's t-test maintains this at the chosen level (e.g., 5%) across a wide range of variance ratios, while Student's t-test can exceed it when variances differ."
      }
    ],
    "worked_example": {
      "scenario": "A health outcomes analyst is comparing the number of outpatient visits in the 12 months after starting treatment between 4 patients on a new medication (Group A) and 4 patients on standard care (Group B). The two groups happen to have very different spreads: Group A patients are clustered tightly around a low visit count, while Group B has one patient with a very high count pulling up both the mean and the spread. The analyst uses Welch's t-test (unpooled) rather than Student's t-test (pooled) because the group standard deviations look very different, and verifies the arithmetic by hand.",
      "dataset": {
        "caption": "Annual outpatient visit counts for 4 patients per group. Group A is tightly clustered; Group B has one high-utilizer. All counts are integers for easy arithmetic.",
        "columns": [
          "patient_id",
          "group",
          "visits"
        ],
        "rows": [
          [
            "A1",
            "A",
            3
          ],
          [
            "A2",
            "A",
            4
          ],
          [
            "A3",
            "A",
            4
          ],
          [
            "A4",
            "A",
            5
          ],
          [
            "B1",
            "B",
            4
          ],
          [
            "B2",
            "B",
            5
          ],
          [
            "B3",
            "B",
            7
          ],
          [
            "B4",
            "B",
            8
          ]
        ]
      },
      "steps": [
        "Compute group means. Group A: (3+4+4+5)/4 = 16/4 = 4. Group B: (4+5+7+8)/4 = 24/4 = 6.",
        "Compute sample variances (dividing by n-1 = 3). Group A deviations from mean 4: (3-4)^2 = 1, (4-4)^2 = 0, (4-4)^2 = 0, (5-4)^2 = 1. Sum = 2. Variance_A = 2/3 = 0.667. Group B deviations from mean 6: (4-6)^2 = 4, (5-6)^2 = 1, (7-6)^2 = 1, (8-6)^2 = 4. Sum = 10. Variance_B = 10/3 = 3.333.",
        "Compute the standard error of the mean difference (unpooled, Welch formula). SE = sqrt(Variance_A/n_A + Variance_B/n_B) = sqrt(0.667/4 + 3.333/4) = sqrt(0.1667 + 0.8333) = sqrt(1.0) = 1.0.",
        "Compute the Welch t-statistic. t = (mean_A - mean_B) / SE = (4 - 6) / 1.0 = -2.0.",
        "The Satterthwaite degrees of freedom (df*) formula uses each group's contribution to the SE. Numerator = (0.667/4 + 3.333/4)^2 = (0.1667 + 0.8333)^2 = (1.0)^2 = 1.0. Denominator = (0.1667)^2/(n_A-1) + (0.8333)^2/(n_B-1) = 0.02779/3 + 0.69439/3 = 0.00926 + 0.23146 = 0.24072. df* = 1.0 / 0.24072 = 4.15 (approximately 4 degrees of freedom). Note that df* = 4.15 falls between the lower bound of 3 (each group has n-1 = 4-1 = 3 degrees of freedom) and the upper bound of 4+4-2 = 6, as expected.",
        "For a two-sided test at df = 4.15 (using df = 4 as the conservative approximation), the critical t-value at alpha = 0.05 is approximately 2.78. Our |t| = 2.0 < 2.78, so we do not reject the null hypothesis at the 5% level. The observed mean difference of 2 visits per year (Group B higher) is not statistically significant at this sample size.",
        "For comparison, Student's pooled t-test would pool variances: pooled_var = ((3-1)*0.667 + (3-1)*3.333) / (4+4-2) = (2*0.667 + 2*3.333) / 6 = (1.333 + 6.667) / 6 = 8.0/6 = 1.333. Pooled SE = sqrt(1.333/4 + 1.333/4) = sqrt(0.3333 + 0.3333) = sqrt(0.6667) = 0.8165. Student t = -2.0 / 0.8165 = -2.449. With df = 6, the critical value is 2.45, barely not significant. Welch rightly produces a more conservative result (df = 4 vs 6) because Group B's much higher variance should widen the critical region."
      ],
      "result": "Group A mean = 16/4 = 4 visits, Group B mean = 24/4 = 6 visits, mean difference = -2. Variance_A = 2/3 = 0.667, Variance_B = 10/3 = 3.333 (ratio approximately 5:1). Welch SE = sqrt(0.1667 + 0.8333) = sqrt(1.0) = 1.0. Welch t = -2.0 / 1.0 = -2.0. Satterthwaite df* is approximately 4. Critical t at df=4, alpha=0.05 (two-sided) = 2.78. |t| = 2.0 < 2.78, so the result is NOT statistically significant. Welch is more conservative than Student here (Welch df = 4 vs Student df = 6) because Group B drives most of the sampling uncertainty — exactly the behavior the correction is designed to produce."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "two-sample-t-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard two-sample Welch t-test (unpooled, independent groups)",
        "description": "The default application: two independent groups, continuous outcome, no pairing or matching. Uses separate variance estimates per group and Satterthwaite degrees of freedom. Output is a t-statistic, p-value, and (in most software) a confidence interval for the mean difference. This should be the default choice any time Student's t-test would otherwise be considered.",
        "edge_cases": [
          "At very large n (> 10,000 per group), the Welch and Student tests produce virtually identical results; report effect size (Cohen's d) prominently because the p-value will be significant for trivially small differences.",
          "When both groups have exactly equal variances and equal sample sizes, Welch and Student produce identical t-statistics and nearly identical p-values; there is no cost to using Welch as the default."
        ],
        "data_source_notes": "Claims: per-patient totals (costs, visit counts, days of therapy) are the natural unit. EHR: lab values, vital signs, and PRO scores. Registry: adjudicated continuous endpoints. Always verify that the unit of analysis is the patient, not the claim line."
      },
      {
        "name": "One-sample Welch-style test (test against a known mean)",
        "description": "When comparing a sample mean to a fixed external benchmark (e.g., a published population norm or a regulatory target), a one-sample t-test is appropriate. This is effectively a special case with n₂ → ∞ and known μ₂; the degrees of freedom reduce to n₁ − 1 and the unequal-variance correction is not applicable. In HEOR this arises when comparing a cohort's mean outcome to a published norm or a clinical target value.",
        "edge_cases": [
          "If the reference value is itself estimated (from a prior study with finite n), a two-sample Welch test on both samples is more appropriate than a one-sample test."
        ],
        "data_source_notes": "Registry and claims: compare cohort mean to a published benchmark. Ensure the benchmark population is comparable; a statistically significant difference from a population norm can reflect case-mix differences rather than treatment effects."
      },
      {
        "name": "Welch t-test on transformed outcomes (log scale)",
        "description": "For right-skewed outcomes (costs, utilization) where the ratio of means is more meaningful than the arithmetic difference, analysts sometimes log-transform the outcome and apply the Welch t-test on the log scale. The test on log-transformed values implicitly tests the geometric mean ratio. Back-transformation of the confidence interval endpoints gives a ratio estimate. This approach is a pragmatic intermediate between the raw Welch t-test and a full gamma GLM.",
        "edge_cases": [
          "Log transformation requires all values > 0; patients with zero costs must be excluded or analysed in a two-part model. Excluding zeros can introduce selection bias if zeros are informative (truly no utilization vs not observed).",
          "The back-transformed CI is for the geometric mean ratio, not the arithmetic mean ratio; this can understate the policy-relevant mean cost difference for payers."
        ],
        "data_source_notes": "Claims costs: winsorize or cap extreme outliers before log-transforming. Report both the log-scale Welch result and a raw-scale estimate (or gamma GLM) to allow comparison."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "parametric-vs-nonparametric-tests",
        "pros_of_this": "Welch's t-test is the specific recommended default from the parametric family for the two-independent-group continuous outcome case; it resolves the equal-variance choice by always using the correct (unpooled) approach.",
        "cons_of_this": "The parent entry covers the full decision tree including nonparametric alternatives, chi-square, ANOVA, and GLM paths; Welch alone does not cover multi-group, categorical, or paired designs.",
        "when_to_prefer": "Use Welch for the two-group continuous case; refer to the parent entry for the full test-selection decision framework."
      },
      {
        "compared_to": "mann-whitney-u-test",
        "pros_of_this": "Welch produces an interpretable mean difference and CI on the original measurement scale — directly useful for budget-impact models and clinical summaries. It is more powerful than Mann-Whitney when the data are approximately normal.",
        "cons_of_this": "Mann-Whitney U is more robust to heavy tails, extreme outliers, and severely skewed data at small-to-moderate n; it does not require the outcome to be approximately normally distributed even approximately. For small n with clearly non-normal data (costs, skewed biomarkers), Mann-Whitney may have better type-I error control.",
        "when_to_prefer": "Use Welch when the mean difference is the target quantity and n is adequate (≥ 30 per group as a rough guide) or when the CLT is in force; use Mann-Whitney as a sensitivity check or primary analysis when data are severely non-normal at small n and a rank-based comparison is scientifically defensible."
      },
      {
        "compared_to": "baseline-characteristics-and-covariate-balance-rwe",
        "pros_of_this": "Welch t-test is the standard tool for the continuous-variable rows of Table 1 in both RCTs and observational studies.",
        "cons_of_this": "Standardized mean differences (SMDs) are preferred over p-values for balance assessment in observational Table 1 because SMDs are not sensitive to sample size; a large imbalanced cohort will show significant Welch p-values for most covariates regardless of clinical relevance.",
        "when_to_prefer": "Report both: Welch t-test p-value for descriptive completeness (especially in RCTs); SMD as the primary balance metric in observational designs where n is large."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Per-patient total costs and utilization counts are the natural unit of analysis. Apply Welch t-test to per-patient annualized totals. For cost outcomes, note that the gamma GLM mean ratio is the preferred primary estimate; report Welch t-test as a secondary or sensitivity analysis. At large n (> 10,000 per arm), report Cohen's d alongside the p-value; significance is virtually guaranteed for any non-trivial difference.",
      "ehr": "Lab values (HbA1c, LDL, blood pressure) and vital signs are approximately normal and well-suited to Welch t-test. Length-of-stay and visit counts are right-skewed and may require Mann-Whitney or log-transformation + Welch. Informative missingness is a concern in EHR; restrict to patients with complete outcome data or model missingness explicitly before running the test.",
      "registry": "Disease-severity scores and adjudicated continuous endpoints are typically the cleanest registry inputs for Welch t-test. Document the adjudication process; measurement error in the outcome attenuates mean differences and widens CIs.",
      "primary": "Survey and PRO instruments: confirm whether the scale is intended to be treated as interval (continuous) or ordinal before applying Welch. For Likert-type scales with few response levels, Mann-Whitney or ordinal regression may be more appropriate.",
      "linked": "Linked claims-EHR cohorts typically have large n; CLT protects the Welch test for non-normal outcomes. Report mean difference with CI and Cohen's d; note that Welch t-test does not adjust for the additional complexity of linked-data measurement error."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import math\nfrom scipy import stats\n\n# ── Motivating dataset: outpatient visits (n=4 per group) ──\ngroup_a = [3, 4, 4, 5]   # mean=4, variance=0.667 (tightly clustered)\ngroup_b = [4, 5, 7, 8]   # mean=6, variance=3.333 (more spread)\n\n# ── 1. Welch t-test (REQUIRED: equal_var=False; Python default is Student's pooled) ──\nt_welch, p_welch = stats.ttest_ind(group_a, group_b, equal_var=False)\nmean_a = sum(group_a) / len(group_a)\nmean_b = sum(group_b) / len(group_b)\nmean_diff = mean_a - mean_b\nprint(f\"Welch t-test:  t={t_welch:.4f}, p={p_welch:.4f}\")\nprint(f\"Mean A={mean_a:.3f}, Mean B={mean_b:.3f}, mean difference={mean_diff:.3f}\")\nprint(\"NOTE: equal_var=False is required for Welch; omitting it gives Student's t-test.\\n\")\n\n# ── 2. Manual 95% CI for the mean difference ──\nn_a, n_b = len(group_a), len(group_b)\nvar_a = sum((x - mean_a) ** 2 for x in group_a) / (n_a - 1)\nvar_b = sum((x - mean_b) ** 2 for x in group_b) / (n_b - 1)\nse = math.sqrt(var_a / n_a + var_b / n_b)\n# Satterthwaite df\ndf_star = (var_a / n_a + var_b / n_b) ** 2 / (\n    (var_a / n_a) ** 2 / (n_a - 1) + (var_b / n_b) ** 2 / (n_b - 1)\n)\nt_crit = stats.t.ppf(0.975, df=df_star)\nci_lower = mean_diff - t_crit * se\nci_upper = mean_diff + t_crit * se\nprint(f\"Satterthwaite df*: {df_star:.4f}\")\nprint(f\"SE of mean difference: {se:.4f}\")\nprint(f\"95% CI for (A - B): [{ci_lower:.4f}, {ci_upper:.4f}]\")\n\n# ── 3. Effect size: Cohen's d (using pooled SD, standard convention) ──\npooled_sd = math.sqrt(((n_a - 1) * var_a + (n_b - 1) * var_b) / (n_a + n_b - 2))\ncohens_d = mean_diff / pooled_sd\nprint(f\"\\nCohen's d (A - B): {cohens_d:.4f}\")\nprint(\"Interpretation: |d| < 0.2 small, 0.2-0.5 medium, > 0.8 large (Cohen 1988).\\n\")\n\n# ── 4. Contrast with Student's pooled t-test (DO NOT USE as default — shown for comparison) ──\nt_student, p_student = stats.ttest_ind(group_a, group_b, equal_var=True)\nprint(f\"Student's t-test (equal_var=True, pooled — NOT recommended as default):\")\nprint(f\"  t={t_student:.4f}, p={p_student:.4f}, df={n_a + n_b - 2} (fixed pooled df)\")\nprint(f\"Welch is more conservative here (df*={df_star:.2f} vs {n_a+n_b-2}) due to unequal variances.\")",
        "description": "Welch's t-test using scipy.stats.ttest_ind with equal_var=False (the required flag —\nPython defaults to Student's pooled test if this flag is omitted). Demonstrates\nthe test on the motivating dataset, extracts the confidence interval manually, computes\nCohen's d, and contrasts with Student's pooled t-test to show the df and p-value\ndifference. No external dependencies beyond scipy.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Motivating dataset ──\ngroup_a <- c(3, 4, 4, 5)   # mean = 4, var = 0.667\ngroup_b <- c(4, 5, 7, 8)   # mean = 6, var = 3.333\n\n# ── 1. Welch t-test (R default: var.equal = FALSE) ──\n# No extra argument needed — R already uses Welch as the default.\nwelch <- t.test(group_a, group_b)   # var.equal = FALSE is the default\nprint(welch)\ncat(sprintf(\"\\nMean A = %.3f, Mean B = %.3f, difference = %.3f\\n\",\n            mean(group_a), mean(group_b), mean(group_a) - mean(group_b)))\ncat(sprintf(\"Satterthwaite df* = %.4f\\n\", welch$parameter))\ncat(sprintf(\"95%% CI for (A - B): [%.4f, %.4f]\\n\",\n            welch$conf.int[1], welch$conf.int[2]))\n\n# ── 2. Student's pooled t-test for contrast (NOT the recommended default) ──\nstudent <- t.test(group_a, group_b, var.equal = TRUE)\ncat(sprintf(\"\\nStudent's t-test (var.equal = TRUE, pooled — for comparison only):\\n\"))\ncat(sprintf(\"  t = %.4f, df = %.1f, p = %.4f\\n\",\n            student$statistic, student$parameter, student$p.value))\ncat(sprintf(\"  df = %d (fixed pooled) vs Welch df* = %.2f (Satterthwaite)\\n\",\n            length(group_a) + length(group_b) - 2, welch$parameter))\n\n# ── 3. Cohen's d (pooled SD) ──\nn_a <- length(group_a); n_b <- length(group_b)\npooled_sd <- sqrt(((n_a - 1) * var(group_a) + (n_b - 1) * var(group_b)) / (n_a + n_b - 2))\ncohens_d  <- (mean(group_a) - mean(group_b)) / pooled_sd\ncat(sprintf(\"\\nCohen's d (A - B) = %.4f\\n\", cohens_d))\n\n# ── 4. Note on the var.equal argument ──\n# In R, always use t.test(x, y) — the default IS Welch. Only set var.equal = TRUE\n# if you have an explicit substantive reason to assume equal variances (rare in HEOR).",
        "description": "Welch's t-test in base R using t.test(), which uses Welch (var.equal = FALSE) by\ndefault — no extra argument needed. Shows explicit extraction of the CI, t-statistic,\ndegrees of freedom, and p-value from the test object. Contrasts with Student's pooled\nversion (var.equal = TRUE) for comparison. Computes Cohen's d manually.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create motivating dataset ── */\ndata work.visits;\n  input patient_id $ group $ visits;\n  datalines;\nA1 A 3\nA2 A 4\nA3 A 4\nA4 A 5\nB1 B 4\nB2 B 5\nB3 B 7\nB4 B 8\n;\nrun;\n\n/* ── Welch t-test via PROC TTEST ──\n   SAS prints BOTH rows automatically:\n     \"Pooled\"         -> Student's equal-variance t-test (pooled df = n1+n2-2)\n     \"Satterthwaite\"  -> Welch t-test (Welch-Satterthwaite adjusted df)\n   Always use the SATTERTHWAITE row for observational two-group comparisons.\n   The COCHRAN option adds the Cochran-Cox approximation (useful for extreme variance ratios).\n*/\nproc ttest data=work.visits cochran;\n  class group;        /* variable identifying the two groups (A vs B)        */\n  var visits;         /* continuous outcome variable                          */\n  /* To extract the Satterthwaite row programmatically, add:\n     ods output ttests = work.ttest_out;\n     Then filter: where method = 'Satterthwaite';                            */\nrun;\n\n/* ── Note on reading the output ──\n   The \"Equality of Variances\" section prints a folded F-test.\n   Its p-value should NOT be used to decide which t-test row to report.\n   Best practice: always report the Satterthwaite (Welch) row.\n   If variances are actually equal, Welch and Student give nearly identical results.\n   If variances differ, Welch is the correct answer.                         */\n\n/* ── ODS output extraction for programmatic downstream use ── */\nods output ttests = work.ttest_results;\nproc ttest data=work.visits;\n  class group;\n  var visits;\nrun;\nods output close;\n\n/* Extract Welch (Satterthwaite) result */\ndata work.welch_result;\n  set work.ttest_results;\n  where method = 'Satterthwaite';\n  label method   = \"Test type (Satterthwaite = Welch)\"\n        tvalue   = \"t-statistic\"\n        df       = \"Degrees of freedom (Welch-Satterthwaite)\"\n        probt    = \"Two-tailed p-value\";\nrun;\nproc print data=work.welch_result noobs label; run;",
        "description": "Welch's t-test in SAS using PROC TTEST. SAS automatically prints both the Satterthwaite\n(Welch) and Pooled (Student) rows — the analyst must read the Satterthwaite row for\nunequal-variance inference. The COCHRAN option provides an additional approximation\nfor extreme variance ratios. PROC TTEST also outputs an equality-of-variances F-test\nin the same output — this is diagnostic only; do not use it as a gating criterion to\nchoose between the two t-test rows.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Q[\"Two independent groups,<br/>continuous outcome?\"] --> EV[\"Are variances equal?\"]\n  EV -->|\"Do NOT test first —<br/>just use Welch\"| Welch[\"Welch's t-test<br/>(equal_var=False in Python;<br/>default in R;<br/>Satterthwaite row in SAS)\"]\n  Welch --> Result[\"Output: t-stat, Satterthwaite df*,<br/>p-value, mean diff, 95% CI\"]\n  Result --> Report[\"Report: mean difference + CI<br/>+ Cohen's d + p-value\"]\n  Result --> Large[\"Large n (>10k)?\"]\n  Large -->|Yes| EffSize[\"Focus on Cohen's d and CI;<br/>p-value will always be significant\"]\n  Large -->|No| Skew[\"Skewed data at small n?\"]\n  Skew -->|Yes| MWU[\"Consider Mann-Whitney U<br/>as primary or sensitivity analysis\"]\n  Skew -->|No| Done[\"Report Welch result as primary\"]\n  Report --> Confounded[\"Observational comparison?\"]\n  Confounded -->|Yes| Causal[\"Note: Welch is DESCRIPTIVE only —<br/>route to regression/PS methods<br/>for causal inference\"]\n  Confounded -->|No| RCT[\"RCT or balanced design:<br/>Welch t-test is valid<br/>for causal mean difference\"]",
        "caption": "Decision flow for applying Welch's t-test in two-group continuous comparisons. The key branch: never run a variance test to decide between Welch and Student — always use Welch. The large-n and skewed-data branches show when effect sizes and nonparametric alternatives become more important.",
        "alt_text": "Flowchart beginning at \"two independent groups, continuous outcome\" that routes directly to Welch's t-test without a variance pre-test, then branches on large n (report Cohen's d prominently), skewed small-n data (consider Mann-Whitney), and observational design (note descriptive-only status and route to regression for causal inference).",
        "source_type": "illustrative",
        "source_citations": [
          "delacre-2017"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Welch's t-test is a specific hypothesis test; understanding sampling distributions, standard errors, confidence intervals, and the null hypothesis framework is prerequisite."
      },
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The parent concept covering when to use parametric vs nonparametric tests and the full test-selection decision tree; Welch's t-test is the recommended default for the continuous two-group parametric cell of that framework."
      },
      {
        "relation_type": "see_also",
        "target_slug": "two-sample-t-test",
        "notes": "The pooled (Student's) version that Welch replaces as the default. Student's t-test assumes equal variances and uses a pooled degrees-of-freedom formula; Welch relaxes both. When variances are equal and n is balanced, results are nearly identical; when variances differ, Welch is correct and Student's is not."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mann-whitney-u-test",
        "notes": "The nonparametric alternative for two independent groups. Mann-Whitney answers a different question (stochastic dominance, not mean difference) and should be paired with the Hodges-Lehmann estimator for an effect estimate. Appropriate as a sensitivity check or primary analysis when data are severely non-normal at small n."
      },
      {
        "relation_type": "see_also",
        "target_slug": "baseline-characteristics-and-covariate-balance-rwe",
        "notes": "Welch t-test appears in the continuous-variable rows of every Table 1; however, standardized mean differences (SMDs) are the preferred balance metric for observational studies because they are not inflated by large sample sizes."
      }
    ],
    "aliases": [
      "unequal variances t-test",
      "Welch-Satterthwaite",
      "Welch's unequal variances t-test",
      "Welch t-test",
      "Satterthwaite t-test",
      "unpooled t-test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "wilcoxon-signed-rank-test",
    "name": "Wilcoxon Signed-Rank Test",
    "short_definition": "A nonparametric test for paired or pre-post data that ranks the absolute values of the within-pair differences, reattaches their signs, and tests whether the resulting signed ranks are symmetrically distributed around zero — the nonparametric counterpart of the paired t-test, appropriate when the distribution of differences is non-normal or the sample is too small for the Central Limit Theorem to protect inference on means.",
    "long_description": "**What the Wilcoxon signed-rank test is and what it does**\n\nThe Wilcoxon signed-rank test, introduced by Frank Wilcoxon in 1945, tests whether the\nmedian (more precisely, the pseudomedian) of paired differences is zero. It is the\nstandard nonparametric alternative to the paired t-test when the distribution of\nwithin-pair differences is non-normal or when sample size is too small for the Central\nLimit Theorem to protect inference on the mean difference. The test operates on the paired\ndifferences themselves — not on the raw observations — and constructs its test statistic\nfrom the ranks of those differences' absolute values.\n\nThe mechanics proceed in four steps. First, compute the signed difference d_i = x_i2 - x_i1\nfor each pair i. Second, rank the absolute values |d_i| from smallest (rank 1) to largest\n(rank n), ignoring any pair with d_i = 0 by the standard Wilcoxon convention (more on this\nbelow). Third, reattach the sign of each original difference to its rank, yielding signed\nranks. Fourth, compute W+ (sum of positive signed ranks) and W- (sum of negative signed\nranks). The test statistic is conventionally reported as W = min(W+, W-). Under the null\nhypothesis that differences are symmetrically distributed around zero, W+ and W- should\neach be close to n(n+1)/4, and their total W+ + W- = n(n+1)/2 always holds as a useful\narithmetic check.\n\n**The symmetry assumption — the one that often goes unacknowledged**\n\nThe Wilcoxon signed-rank test is frequently described as \"assumption-free\" or as requiring\nonly that the differences be independent. This is incorrect and the misunderstanding has\nreal consequences in practice. The test requires that the distribution of differences be\n*symmetric around its median*, not merely that it be continuous. Under the null hypothesis\nthe symmetry assumption is automatically satisfied (a distribution symmetric around zero\nsatisfies any symmetry claim), but the test's power properties and interpretation under the\nalternative depend on symmetry holding. When the difference distribution is highly skewed —\nfor example, most patients improve modestly and a few improve dramatically — the signed-rank\ntest can reject for a reason other than a shift in location, because skewness itself\ndisrupts the rank-sign balance the test exploits.\n\nIn RWE applications, cost changes (post - pre) are often right-skewed: most patients have\nmoderate cost increases and a few have catastrophic spikes. Under this asymmetry, a\nsignificant Wilcoxon result does not cleanly identify a median shift — it may be detecting\nthe skewness. The safe interpretation in this setting is that the two measurement occasions\nproduce different rank patterns, not necessarily a symmetric shift in location.\n\n**Zeros and ties: the Pratt versus Wilcoxon method**\n\nIn real data, some paired differences will be exactly zero (no change). The original\nWilcoxon (1945) convention discards zero differences and reduces n accordingly. An\nalternative due to Pratt (1959) retains all pairs — zero differences are ranked but their\ncontribution to W+ and W- cancels (they carry rank but no sign). In scipy.stats.wilcoxon,\nthe `zero_method` argument controls this: `zero_method='wilcoxon'` (the default) drops zeros;\n`zero_method='pratt'` retains them. In R's wilcox.test(paired=TRUE), zeros are dropped by\ndefault. The choice matters when the fraction of zeros is non-trivial: dropping zeros reduces\nn and can inflate power if many zeros exist because of a true null; Pratt's method avoids\nthis inflation. Document which method was used in the analysis SAP.\n\nTies in the absolute differences (two pairs with the same |d|) are handled by averaging the\nranks they would occupy, exactly as in the Mann-Whitney U test. Large numbers of ties\nreduce the information content of the ranks; when ties are extensive (> 20% of pairs) the\nnormal approximation for the p-value uses a tie-corrected variance. Exact p-values are\navailable in small samples and are recommended when n < 25; for n ≥ 25 the normal\napproximation is adequate unless ties are heavy.\n\n**The pseudomedian and the Hodges-Lehmann estimator for the Wilcoxon signed-rank test**\n\nA key practical point that is frequently overlooked: the effect estimate that pairs naturally\nwith the Wilcoxon signed-rank test is the *pseudomedian* of the differences — not the\nmedian of the differences, and not the difference of the medians. The pseudomedian of a\nsample is the median of all pairwise averages (d_i + d_j)/2 for i ≤ j, including d_i with\nitself (the Walsh averages). For a symmetric distribution the pseudomedian equals the\nmedian, but for a skewed difference distribution the two can diverge substantially. The\nHodges-Lehmann estimator implements the pseudomedian and is the standard companion estimate\nfor the Wilcoxon signed-rank test; in R, wilcox.test(paired=TRUE, conf.int=TRUE) returns it\nautomatically. Reporting only a Wilcoxon p-value without this effect estimate leaves the\nreader unable to assess the magnitude or direction of the change.\n\nThe distinction matters operationally: if the analysis question is \"did patients improve on\naverage?\" the mean difference (paired t-test or a mixed model) is the right target. If the\nquestion is \"did a typical patient improve?\" the median or pseudomedian of the differences\nis appropriate. In HEOR, cost-effectiveness models and budget-impact analyses need the mean\nchange, not the pseudomedian, so the Wilcoxon signed-rank test is usually a sensitivity\ncheck alongside a model-based mean-change estimator for cost outcomes.\n\n**The sign test as the weaker, assumption-free fallback**\n\nWhen the symmetry assumption for the signed-rank test is clearly violated and the analyst\nwants a strictly assumption-free test for paired data, the sign test is the fallback. The\nsign test counts only whether each d_i is positive or negative (discarding magnitude\nentirely) and tests whether positive differences are equally likely as negative ones under\nthe null. Because it discards rank information, the sign test is markedly less powerful\nthan the Wilcoxon signed-rank test; it should be reserved for situations where the symmetry\nassumption is demonstrably implausible and the analyst cannot use a model-based approach.\n\n**RWE and HEOR applications**\n\n*Pre-post healthcare utilization and costs*: When the same patients are observed before and\nafter a treatment initiation, procedure, or policy change and the outcome is an inherently\nskewed measure (emergency department visits, total allowed costs, inpatient days), the\npaired t-test on the within-patient difference can be distorted by outliers. The Wilcoxon\nsigned-rank test provides a robust sensitivity check. However, as noted above, if the mean\ncost change is the decision-relevant quantity (for a budget-impact model), the Wilcoxon\ntest is supplementary — a gamma GLM or bootstrap mean-difference estimator should be the\nprimary analysis.\n\n*Patient-reported outcomes (PROs) and utility scores*: EQ-5D utility scores, SF-6D scores,\nand disease-specific PROs often violate normality: they are bounded (0 to 1 or 0 to 100),\nmay cluster near boundary values, and have pre-post differences that are neither continuous\nnor bell-shaped. The Wilcoxon signed-rank test is the standard nonparametric paired test\nin this context and is frequently required alongside the parametric test in health technology\nassessment (HTA) submissions.\n\n*Matched cohort designs*: When 1:1 matched pairs are constructed in an observational study\n(propensity-score matched, exact-matched, or coarsened exact matched), the paired structure\nmust be respected. A two-sample t-test or Mann-Whitney on matched pairs ignores the pairing\nand loses efficiency. The Wilcoxon signed-rank test correctly exploits the paired design\nby operating on within-pair differences, exactly as in the pre-post setting.\n\n**Pseudomedian-of-differences ≠ difference-of-medians — why this matters**\n\nIn practice, analysts sometimes report the median change as \"median(after) - median(before)\"\nwhich is the difference of two group medians. This is not what the Wilcoxon signed-rank test\ntests, and it is not the pseudomedian of the differences. Consider a simple example: if half\nthe patients improve from 4 to 8 (+4) and half improve from 5 to 9 (+4), both medians shift\nby exactly 4 and the pseudomedian of differences is 4. But if pre-scores are 4 and 5 (median\n4.5) and post-scores are 8 and 9 (median 8.5), then the difference of medians is 4.0 and the\npseudomedian is also 4.0 — they agree. However, in skewed real data they can diverge. The\ncorrect companion to the Wilcoxon test is always the Hodges-Lehmann pseudomedian; report it.\n\n**Pros, cons, and trade-offs**\n\n*Pros*:\n- Valid when the paired t-test's normality assumption is violated and n is small.\n- Robust to outliers in the differences: an extreme outlier contributes at most rank n rather\n  than distorting the mean as it would in a paired t-test.\n- Exact p-values are available for small n (< 25), making it reliable in pilot or registry\n  sub-studies with limited sample sizes.\n- The Hodges-Lehmann pseudomedian with confidence interval is a principled effect estimate\n  that pairs naturally with the test.\n- Widely expected by regulators (FDA, EMA, NICE) as a sensitivity check alongside a\n  parametric primary analysis.\n\n*Cons*:\n- The symmetry assumption is often unacknowledged and can be violated in practice.\n- Lower power than the paired t-test when the normality assumption holds (the paired t-test\n  uses the actual magnitude of differences, the signed-rank test uses only their ranks).\n- The pseudomedian is not the effect estimate that budget-impact or cost-effectiveness\n  models require — they need the mean change.\n- Cannot be directly extended to covariate adjustment. Regression models on the within-pair\n  differences (ANCOVA or a mixed model) are needed when baseline imbalance or confounders\n  must be controlled.\n- Zero differences require a decision (Wilcoxon vs Pratt method) that should be\n  pre-specified in the SAP.\n\n*Trade-offs vs alternatives*:\n- Against the paired t-test: prefer the Wilcoxon signed-rank test when n < 30 and the\n  difference distribution is clearly non-normal or when outliers are expected; prefer the\n  paired t-test (or a mixed model) when the mean change is the target estimand.\n- Against the sign test: the Wilcoxon signed-rank test is almost always more powerful than\n  the sign test; use the sign test only if the symmetry assumption is clearly untenable.\n- Against a regression/GLM on differences: when covariate adjustment is needed or the\n  outcome is a count or cost, a regression approach (Poisson, gamma, or negative binomial\n  with a paired-design term) is superior to either the Wilcoxon test or the paired t-test.\n\n**When to use**\n\n- Paired or pre-post continuous outcomes (or ordinal outcomes with many levels) where the\n  paired t-test normality assumption is in doubt.\n- Small paired samples (n < 30) where the Central Limit Theorem cannot be assumed to protect\n  the paired t-test.\n- Matched-pair observational cohorts where within-pair differences are the natural estimand.\n- Patient-reported outcomes and utility scores with non-normal or bounded difference\n  distributions.\n- As a pre-specified sensitivity analysis alongside a parametric primary test for any paired\n  continuous outcome in a regulatory submission or HTA dossier.\n- When the question is whether a typical patient improved (pseudomedian), not whether the\n  average patient improved (mean).\n\n**When NOT to use**\n\n- *Independent groups*: the Wilcoxon signed-rank test requires paired data; for two\n  independent groups use the Mann-Whitney U test (Wilcoxon rank-sum).\n- *When the mean change is the estimand*: if the result will feed a cost-effectiveness model,\n  budget-impact model, or any analysis that aggregates to population totals, the mean change\n  is needed, not the pseudomedian. Use the paired t-test, a GLM on differences, or a\n  bootstrap mean estimator for the primary analysis.\n- *Count or zero-inflated pre-post outcomes*: for outcomes like number of hospitalizations\n  or number of prescriptions (discrete counts, often with many zeros), a model-based approach\n  (negative binomial or hurdle model with individual as random effect) is more appropriate\n  than a rank test, because the rank test cannot account for the data-generating mechanism.\n- *More than two timepoints*: the Wilcoxon signed-rank test compares a single pair. For\n  repeated measures across more than two time points use a linear mixed model, a Friedman\n  test (the nonparametric k-sample extension), or a GEE.\n- *When covariate adjustment is needed*: rank tests cannot incorporate baseline covariates\n  or confounders. Route to ANCOVA, a mixed effects model on differences, or propensity-score\n  methods if adjustment is required before or after the paired comparison.\n- *When the symmetry assumption is demonstrably violated and the sample is small*: if the\n  difference distribution is known to be highly skewed (e.g., costs, claims counts) and n is\n  large enough that a model-based approach is feasible, prefer a GLM over the signed-rank\n  test, because the test may detect skewness rather than a location shift.\n\n**Interpreting the output**\n\nIn the worked example, six pain patients were assessed before and after treatment. The sum\nof positive signed ranks is W+ = 15 and the sum of negative signed ranks is W− = 6; the\ncheck W+ + W− = 21 = 6 × 7 / 2 confirms the arithmetic. The test statistic is W = min(15,\n6) = 6. The exact two-sided p-value at n = 6 is approximately 0.22. The Hodges-Lehmann\npseudomedian of the within-patient differences is approximately 2.5 pain-score points of\nreduction.\n\n*(1) Formal interpretation.* Under the null hypothesis that the paired differences are\nsymmetrically distributed around zero, a W statistic of 6 or smaller in either tail has a\nprobability of approximately 0.22 — well above alpha = 0.05. The test does not reject the\nnull. At n = 6, the exact distribution of W has very limited resolution: only a small number\nof rank arrangements exist, so the test has low power to detect moderate effects. The\npseudomedian of approximately 2.5 points indicates a positive directional trend, but the\nsample is far too small to confirm the signal statistically.\n\n*(2) Practical interpretation.* All six patients showed some reduction in pain, and the\nHodges-Lehmann pseudomedian of approximately 2.5 points is the appropriate effect estimate\nto report alongside this test — it is the pseudomedian of the within-patient differences,\nnot the difference between the pre- and post-period medians. The non-significant p-value at\nn = 6 is better read as \"inconclusive\" than \"no effect.\" The symmetry assumption required\nby the signed-rank test should be assessed; if the difference distribution is strongly\nskewed (most patients improve slightly, a few improve dramatically), a model-based analysis\nmay be more appropriate, as the signed-rank test can detect skewness rather than a pure\nlocation shift.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "primitive",
      "hypothesis-testing",
      "nonparametric",
      "paired-data",
      "rank-based",
      "signed-rank",
      "pre-post",
      "pro-outcomes",
      "wilcoxon"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "rct",
      "cross_sectional",
      "active_comparator_new_user",
      "descriptive_analysis",
      "claims_analysis",
      "ehr_study",
      "registry_study"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "primary",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/3001968",
        "url": "https://doi.org/10.2307/3001968",
        "citation_text": "Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bulletin. 1945;1(6):80-83.",
        "year": 1945,
        "authors_short": "Wilcoxon",
        "notes": "The original paper introducing both the rank-sum test (for independent samples) and the signed-rank test (for paired samples). Wilcoxon's 1945 paper is one of the most-cited methodological papers in statistics; the signed-rank test has become the canonical nonparametric alternative to the paired t-test across clinical and outcomes research."
      },
      {
        "role": "explain",
        "doi": "10.1080/00031305.1981.10479327",
        "url": "https://doi.org/10.1080/00031305.1981.10479327",
        "citation_text": "Conover WJ, Iman RL. Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician. 1981;35(3):124-129.",
        "year": 1981,
        "authors_short": "Conover & Iman",
        "notes": "Provides the conceptual bridge between parametric and nonparametric tests via rank transformations. Shows that rank-based tests like the Wilcoxon signed-rank test are equivalent to applying a parametric test (t-test) to the ranks of the data — a perspective that clarifies assumptions, power, and the relationship to regression."
      },
      {
        "role": "explain",
        "doi": "10.20982/tqmp.04.1.p013",
        "url": "https://doi.org/10.20982/tqmp.04.1.p013",
        "citation_text": "Nachar N. The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology. 2008;4(1):13-20.",
        "year": 2008,
        "authors_short": "Nachar",
        "notes": "Although focused on the Mann-Whitney U test (the unpaired sibling), this accessible paper explains the shared logic of Wilcoxon-family rank tests — including the rank construction, null hypothesis interpretation, and assumption structure — in a way directly applicable to the signed-rank test."
      },
      {
        "role": "demonstrate",
        "doi": "10.1186/1471-2288-12-78",
        "url": "https://doi.org/10.1186/1471-2288-12-78",
        "citation_text": "Fagerland MW. t-tests, non-parametric tests, and large studies — a paradox of statistical practice? BMC Medical Research Methodology. 2012;12:78.",
        "year": 2012,
        "authors_short": "Fagerland",
        "notes": "Demonstrates the paradox that the t-test's normality assumption is least vulnerable in large samples (where the CLT protects inference on means) — exactly when analysts most often reach for nonparametric tests. Supports the principle that the Wilcoxon signed-rank test is most valuable in small paired samples; at large n, a paired t-test or GLM on differences is often the better primary method."
      },
      {
        "role": "use",
        "doi": "10.1186/1471-2288-5-35",
        "url": "https://doi.org/10.1186/1471-2288-5-35",
        "citation_text": "Vickers AJ. Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data. BMC Medical Research Methodology. 2005;5:35.",
        "year": 2005,
        "authors_short": "Vickers",
        "notes": "Simulation study showing that non-parametric tests (including the Wilcoxon signed-rank test for paired designs) rarely outperform properly specified parametric models in randomized trials, even with non-normal data — supporting the principle that Wilcoxon is a robustness check rather than a universally superior alternative."
      }
    ],
    "plain_language_summary": "The Wilcoxon signed-rank test answers the question \"did the same patients change between two time points?\" without assuming their measurements follow a bell-curve distribution. It works by computing each patient's before-minus-after difference, ranking those differences from smallest to largest by size (ignoring direction for now), and then checking whether the big ranks tend to cluster on the positive or negative side. Four verified DOIs anchor the entry: Wilcoxon 1945 (original method), Conover & Iman 1981 (rank-based perspective), Nachar 2008 (practical guide), and Fagerland 2012 (large-sample paradox). One important caveat: the test is not assumption-free — it assumes the distribution of differences is roughly symmetric, so it can give misleading results when skewed outcomes like healthcare costs are involved.",
    "key_terms": [
      {
        "term": "signed rank",
        "definition": "The rank of a paired difference's absolute value, given the same sign (positive or negative) as the original difference — rank 3 becomes +3 if the change was an improvement, -3 if it was a worsening."
      },
      {
        "term": "zero difference handling",
        "definition": "What to do when a patient shows no change between the two time points; the original Wilcoxon method discards those pairs entirely, while the Pratt method keeps them but assigns them a rank without a sign, affecting how the test statistic is computed."
      },
      {
        "term": "symmetry assumption",
        "definition": "The requirement that the distribution of within-patient differences is mirror-image symmetric around its center; the Wilcoxon signed-rank test can detect the wrong thing if this assumption is badly violated, such as when most patients improve a little but a few improve enormously."
      },
      {
        "term": "pseudomedian",
        "definition": "The effect estimate that pairs with the Wilcoxon signed-rank test — it is the median of all possible pairwise averages of the differences, not the simple median of the differences; the two numbers are equal only when the difference distribution is symmetric."
      }
    ],
    "worked_example": {
      "scenario": "A HEOR analyst is evaluating a 12-week pain management program in six patients. Each patient rates their pain on a 0-to-10 scale at baseline (week 0) and at week 12. The analyst wants to test whether pain scores changed significantly, but with only six patients the normality assumption for a paired t-test cannot be verified, so the Wilcoxon signed-rank test is chosen. All six differences have distinct absolute values and none equals zero, giving a clean illustration of the core mechanics.",
      "dataset": {
        "caption": "Pain scale ratings (0 = no pain, 10 = worst pain) for six patients before and after a 12-week pain management program. Difference = baseline score minus week-12 score; a positive difference means improvement (pain decreased).",
        "columns": [
          "patient_id",
          "baseline",
          "week_12",
          "difference"
        ],
        "rows": [
          [
            "P1",
            7,
            2,
            5
          ],
          [
            "P2",
            6,
            3,
            3
          ],
          [
            "P3",
            4,
            6,
            -2
          ],
          [
            "P4",
            8,
            2,
            6
          ],
          [
            "P5",
            5,
            4,
            1
          ],
          [
            "P6",
            3,
            7,
            -4
          ]
        ]
      },
      "steps": [
        "Step 1 — compute differences. Difference = baseline minus week-12: P1 = 7-2 = 5, P2 = 6-3 = 3, P3 = 4-6 = -2, P4 = 8-2 = 6, P5 = 5-4 = 1, P6 = 3-7 = -4. Positive values mean pain decreased (improvement); negative values mean pain increased (worsening). No difference equals zero, so no pairs need to be dropped.",
        "Step 2 — rank the absolute differences from smallest to largest. |P5|=1 gets rank 1, |P3|=2 gets rank 2, |P2|=3 gets rank 3, |P6|=4 gets rank 4, |P1|=5 gets rank 5, |P4|=6 gets rank 6. All absolute values are distinct so no averaging of tied ranks is needed.",
        "Step 3 — reattach signs. P1: +5, P2: +3, P3: -2, P4: +6, P5: +1, P6: -4. Positive signed ranks go to W+; negative signed ranks (by absolute value) go to W-.",
        "Step 4 — compute W+ and W-. W+ = ranks of patients who improved = 5 + 3 + 6 + 1 = 15. W- = ranks of patients who worsened = 2 + 4 = 6. Verify: W+ + W- = 15 + 6 = 21 = 6*7/2 = 21. The check passes.",
        "Step 5 — the test statistic is W = min(W+, W-) = min(15, 6) = 6. A small W means most of the large ranks cluster on one side (here, the positive/improvement side), which is evidence against the null hypothesis of no change.",
        "The exact two-sided p-value for W = 6 with n = 6 is 0.219, which does not reach the conventional alpha = 0.05 threshold. With only six patients, the test has low power; the four-to-two split in improvement direction is suggestive but not statistically significant at this sample size."
      ],
      "result": "W+ = 5 + 3 + 6 + 1 = 15, W- = 2 + 4 = 6, W+ + W- = 15 + 6 = 21 = 6*7/2 = 21 (check passes). Test statistic W = min(15, 6) = 6. Four of six patients improved (positive differences) and those improvements carried the larger ranks (5, 3, 6, 1), while the two worsening patients had smaller absolute ranks (2, 4). The pseudomedian of the six differences is the median of all 21 Walsh averages; the Hodges-Lehmann estimate equals 2.5 points of pain reduction. Exact two-sided p ≈ 0.22 — not significant at alpha = 0.05 with n = 6, illustrating that small samples require larger effect sizes to achieve significance."
    },
    "prerequisites": [
      "inferential-statistics-foundations",
      "parametric-vs-nonparametric-tests",
      "paired-t-test"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Standard paired analysis (no zeros, no ties)",
        "description": "The clean case: n paired differences are computed, all are non-zero and distinct in absolute value. Ranks are assigned 1 through n without averaging. W+ and W- are computed from signed ranks; W = min(W+, W-) is compared to exact critical values (n < 25) or the normal approximation (n ≥ 25). The Hodges-Lehmann pseudomedian and its confidence interval are reported alongside the p-value.",
        "edge_cases": [
          "At n < 10, exact p-values are strongly recommended; normal approximation has meaningful error at this sample size.",
          "Report the direction of the effect (which group/time is larger) alongside the p-value; W alone does not convey direction."
        ],
        "data_source_notes": "Claims pre-post: compute per-patient cost or utilization differences over symmetric pre/post windows. EHR: lab values or PRO scores at two encounters. Registry: score at enrollment vs follow-up."
      },
      {
        "name": "Handling zeros (Wilcoxon vs Pratt method)",
        "description": "When some within-pair differences are zero (patient did not change), a method must be pre-specified in the SAP. The Wilcoxon convention drops zero-difference pairs and reduces n; the Pratt method ranks all pairs including zeros (zeros receive ranks but their signed contribution cancels). Pratt's method is more conservative when zeros are common. In scipy.stats.wilcoxon, use zero_method='wilcoxon' or zero_method='pratt'. In R's wilcox.test, zeros are dropped by default; no built-in option for Pratt — implement manually or use a package.",
        "edge_cases": [
          "If more than 20% of pairs have zero difference, the method choice materially affects results; report both and discuss in the sensitivity section.",
          "In PRO data, many patients may report no change (e.g., stable chronic disease); this is data, not a nuisance — consider whether the sign test or a model-based approach is more appropriate."
        ],
        "data_source_notes": "Claims: zero utilization changes (e.g., zero ED visits in both pre and post periods) are common; pre-specify the zero-handling convention before unblinding."
      },
      {
        "name": "Large samples with normal approximation and continuity correction",
        "description": "For n ≥ 25, the W statistic is approximately normally distributed. The z-statistic is (W - mu_W) / sigma_W where mu_W = n(n+1)/4 and sigma_W = sqrt(n(n+1)(2n+1)/24), with a tie correction applied to sigma_W when ties are present. A continuity correction of 0.5 improves the approximation slightly. Both scipy.stats.wilcoxon (mode='approx') and wilcox.test in R use the normal approximation at large n.",
        "edge_cases": [
          "At large n (> 1,000 per group), the Wilcoxon test will detect trivially small location shifts; always report the Hodges-Lehmann estimate and a minimally important difference threshold alongside the p-value.",
          "Heavy ties reduce the effective information in the ranks; the tie-corrected variance should be used, and the analyst should consider whether a model-based approach is more efficient."
        ],
        "data_source_notes": "Linked claims-EHR datasets with large matched cohorts: use normal approximation; report the pseudomedian and CI. Consider whether a mixed-effects model on the differences is more informative than a rank test."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "paired-t-test",
        "pros_of_this": "More robust to non-normal differences and outliers in the within-pair differences; valid at small n where the CLT cannot protect inference on means; exact p-values available.",
        "cons_of_this": "Lower power than the paired t-test when normality holds; produces the pseudomedian (not the mean change) as the effect estimate; cannot be extended to covariate adjustment.",
        "when_to_prefer": "Use when n < 30 and differences are clearly non-normal, or when outliers in the differences are expected; use the paired t-test or a mixed model when the mean change is the estimand or when covariate adjustment is needed."
      },
      {
        "compared_to": "mann-whitney-u-test",
        "pros_of_this": "Exploits the paired design by operating on within-pair differences, giving higher power than the Mann-Whitney for paired data; correctly accounts for the correlation between measurements on the same patient.",
        "cons_of_this": "Requires paired or matched data; cannot be used for independent groups. The Mann-Whitney test is the correct choice for two independent groups.",
        "when_to_prefer": "Always use the Wilcoxon signed-rank test when data are paired (pre-post on the same patient, or one-to-one matched pairs); use Mann-Whitney when groups are independent."
      },
      {
        "compared_to": "mcnemar-test",
        "pros_of_this": "The Wilcoxon signed-rank test handles ordinal or continuous differences; McNemar's test handles only binary paired outcomes (yes/no before and after). For continuous or multi-level ordinal PROs, the Wilcoxon test carries more information.",
        "cons_of_this": "For binary outcomes (e.g., readmission before vs after an intervention), McNemar's test is the appropriate paired test; applying the Wilcoxon signed-rank test to a binary difference (0 or 1) produces a meaningful result but McNemar is canonical.",
        "when_to_prefer": "Use for ordinal and continuous paired outcomes; use McNemar's test for paired binary (dichotomous) outcomes."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Compute pre-period and post-period per-patient totals (costs, visits, fills) using symmetric windows around an index date. Compute within-patient differences. Apply the Wilcoxon signed-rank test to those differences. Pre-specify zero handling (Wilcoxon vs Pratt) in the SAP. For cost outcomes, run a gamma GLM or bootstrap mean-change estimator as the primary analysis; use the Wilcoxon test as a pre-specified sensitivity check. At large n (> 10,000 matched pairs), report the Hodges-Lehmann estimate and confidence interval prominently; a p-value alone is uninformative at this scale.",
      "ehr": "Lab values (HbA1c, LDL, blood pressure) and clinical scores at two encounters are natural paired structures. Restrict to patients with both measurements; document the fraction excluded for missingness. PRO instruments and utility scores measured at two visits are the canonical use case. Report the Hodges-Lehmann pseudomedian alongside the p-value.",
      "registry": "Adjudicated disease-specific scores (BASFI, ACR response, UPDRS) at enrollment and follow-up. The registry's adjudicated data quality makes it ideal for the paired design. Check whether the registry protocol allows matched follow-up windows; if follow-up timing varies across patients, a mixed model may be more appropriate than a rank test.",
      "primary": "Survey and PRO instruments in prospective studies. Bounded or ordinal scales (EQ-5D, SF-36, VAS) with many possible levels are well-suited. At small n (< 30), the Wilcoxon signed-rank test with exact p-values is the standard nonparametric choice. Confirm whether the instrument developers' scoring guide treats scale scores as interval or ordinal.",
      "linked": "Linked claims-EHR-registry cohorts with 1:1 matched pairs: large n means the normal approximation is appropriate; report Hodges-Lehmann estimate and CI. Consider that at large n a mixed-effects model on differences may be more powerful and extensible to subgroup analysis."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from scipy import stats\nimport itertools\n\n# ── Data: six paired pain scores (baseline minus week-12, positive = improvement) ──\nbaseline = [7, 6, 4, 8, 5, 3]\nweek12   = [2, 3, 6, 2, 4, 7]\ndiffs = [b - a for b, a in zip(baseline, week12)]\n# diffs = [5, 3, -2, 6, 1, -4]  -> 4 positive, 2 negative\n\n# ── 1. Wilcoxon signed-rank test (default: zero_method='wilcoxon', exact p-value at small n) ──\n# Note: scipy.stats.wilcoxon takes the differences directly (or x and y separately).\nw_stat, p_val = stats.wilcoxon(diffs, zero_method='wilcoxon', mode='auto')\nprint(f\"Wilcoxon signed-rank test:\")\nprint(f\"  W statistic (sum of positive ranks) = {w_stat:.1f}\")\nprint(f\"  Two-sided p-value = {p_val:.4f}\")\nprint(f\"  (With n=6 scipy uses exact enumeration; normal approximation used for n >= 25)\")\n\n# ── 2. Verify W+ and W- by hand ──\nabs_diffs = [abs(d) for d in diffs]\n# Sort indices by absolute value to assign ranks\nsorted_idx = sorted(range(len(abs_diffs)), key=lambda i: abs_diffs[i])\nranks = [0] * len(diffs)\nfor rank_pos, orig_idx in enumerate(sorted_idx):\n    ranks[orig_idx] = rank_pos + 1   # ranks are 1-indexed\nW_plus  = sum(r for d, r in zip(diffs, ranks) if d > 0)\nW_minus = sum(r for d, r in zip(diffs, ranks) if d < 0)\nn = len(diffs)\nprint(f\"\\nManual verification:\")\nprint(f\"  W+ (sum of positive ranks) = {W_plus}\")\nprint(f\"  W- (sum of negative ranks) = {W_minus}\")\nprint(f\"  W+ + W- = {W_plus + W_minus}  (should equal n*(n+1)/2 = {n*(n+1)//2})\")\nprint(f\"  W = min(W+, W-) = {min(W_plus, W_minus)}\")\n\n# ── 3. Pratt method (retains zero differences, assigns ranks but no sign contribution) ──\n# Not relevant here (no zeros), but shown for completeness\ndiffs_with_zero = diffs + [0]   # artificial example with a zero\nw_pratt, p_pratt = stats.wilcoxon(diffs_with_zero, zero_method='pratt', mode='approx')\nprint(f\"\\nPratt method example (n=7 including one zero):\")\nprint(f\"  W = {w_pratt:.1f}, p (normal approx) = {p_pratt:.4f}\")\n\n# ── 4. Hodges-Lehmann pseudomedian: median of all Walsh averages (d_i + d_j)/2 for i <= j ──\n# This is the correct companion effect estimate for the Wilcoxon signed-rank test.\nn = len(diffs)\nwalsh = [(diffs[i] + diffs[j]) / 2 for i in range(n) for j in range(i, n)]\nwalsh.sort()\nnw = len(walsh)\nif nw % 2 == 1:\n    pseudomedian = walsh[nw // 2]\nelse:\n    pseudomedian = (walsh[nw // 2 - 1] + walsh[nw // 2]) / 2\nprint(f\"\\nHodges-Lehmann pseudomedian of differences: {pseudomedian:.2f}\")\nprint(f\"  (Number of Walsh averages = n*(n+1)/2 = {n*(n+1)//2}, computed {nw})\")\nprint(f\"  Pseudomedian ≠ simple median of differences ({sorted(diffs)[nw//2 - 1]:.1f})\")\nprint(\"  Always report this estimate with a CI — p-value alone does not convey effect size.\")\n\n# ── 5. Normal approximation for large n ──\nimport math\nn_large = 50  # example large sample\nw_example = 450  # hypothetical W+ statistic\nmu_w = n_large * (n_large + 1) / 4\nsigma_w = math.sqrt(n_large * (n_large + 1) * (2 * n_large + 1) / 24)\nz = (w_example - mu_w - 0.5) / sigma_w   # continuity correction of 0.5\nfrom scipy.stats import norm\np_approx = 2 * norm.sf(abs(z))\nprint(f\"\\nNormal approximation (n=50, W+={w_example}):\")\nprint(f\"  mu_W = {mu_w:.1f}, sigma_W = {sigma_w:.3f}\")\nprint(f\"  z = ({w_example} - {mu_w:.1f} - 0.5) / {sigma_w:.3f} = {z:.3f}\")\nprint(f\"  Two-sided p (normal approx) = {p_approx:.4f}\")",
        "description": "Wilcoxon signed-rank test using scipy.stats.wilcoxon. Demonstrates the zero_method\nargument (Wilcoxon vs Pratt convention for zeros), the mode argument (exact vs normal\napproximation), and how to extract the Hodges-Lehmann pseudomedian estimate manually\n(scipy does not return it natively — use a Walsh-averages computation). Uses the\nsix-patient pain dataset from the worked example to show the W statistic and p-value,\nthen shows the full Walsh-averages approach for the pseudomedian.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "# ── Data: six paired pain scores ──\nbaseline <- c(7, 6, 4, 8, 5, 3)\nweek12   <- c(2, 3, 6, 2, 4, 7)\ndiffs    <- baseline - week12   # c(5, 3, -2, 6, 1, -4)\n\n# ── 1. Wilcoxon signed-rank test with Hodges-Lehmann pseudomedian and CI ──\n# conf.int=TRUE returns the Hodges-Lehmann estimate and 95% CI automatically.\n# R drops zero differences by default (Wilcoxon convention).\n# exact=TRUE uses exact enumeration (recommended at n < 25).\nwsr_res <- wilcox.test(baseline, week12, paired = TRUE,\n                       conf.int = TRUE, exact = TRUE)\ncat(\"Wilcoxon signed-rank test:\\n\")\nprint(wsr_res)\n# The output includes:\n#   V: the W+ statistic (sum of positive signed ranks); R calls it V\n#   p-value: exact two-sided p-value\n#   pseudomedian: Hodges-Lehmann estimate\n#   conf.int: 95% CI for the pseudomedian\ncat(sprintf(\"\\nHodges-Lehmann pseudomedian = %.2f\\n\", wsr_res$estimate))\ncat(sprintf(\"95%% CI: [%.2f, %.2f]\\n\", wsr_res$conf.int[1], wsr_res$conf.int[2]))\ncat(\"Note: R labels W+ as 'V'; scipy labels it 'statistic'. Both equal sum of positive ranks.\\n\")\n\n# ── 2. Manual W+ / W- calculation to match worked example ──\nabs_d <- abs(diffs)\nranks <- rank(abs_d)   # R's rank() handles ties by averaging\nW_plus  <- sum(ranks[diffs > 0])\nW_minus <- sum(ranks[diffs < 0])\nn <- length(diffs)\ncat(sprintf(\"\\nManual verification:\\n  W+ = %.0f,  W- = %.0f\\n\", W_plus, W_minus))\ncat(sprintf(\"  W+ + W- = %.0f  (should be n*(n+1)/2 = %d)\\n\", W_plus + W_minus, n*(n+1)/2))\ncat(sprintf(\"  W = min(W+, W-) = %.0f\\n\", min(W_plus, W_minus)))\n\n# ── 3. Normal approximation for large n (exact=FALSE) ──\n# Demonstrate on a larger hypothetical dataset\nset.seed(42)\nlarge_pre  <- round(runif(60, 4, 9))\nlarge_post <- round(runif(60, 2, 8))\nwsr_large <- wilcox.test(large_pre, large_post, paired = TRUE,\n                         conf.int = TRUE, exact = FALSE)\ncat(\"\\nLarge sample example (n=60, normal approximation):\\n\")\nprint(wsr_large)\n\n# ── 4. Paired t-test for comparison ──\n# Show that both tests agree in direction but may differ in p-value\nt_res <- t.test(baseline, week12, paired = TRUE)\ncat(sprintf(\"\\nPaired t-test for comparison:\\n  t = %.3f,  p = %.4f\\n\",\n            t_res$statistic, t_res$p.value))\ncat(sprintf(\"  Mean difference = %.2f (95%% CI: [%.2f, %.2f])\\n\",\n            t_res$estimate, t_res$conf.int[1], t_res$conf.int[2]))\ncat(\"Note: At n=6, both tests are underpowered; with this data neither reaches p < 0.05.\\n\")",
        "description": "Wilcoxon signed-rank test using wilcox.test(paired=TRUE, conf.int=TRUE) in base R.\nDemonstrates the test on the six-patient pain dataset, extraction of the Hodges-Lehmann\npseudomedian and its confidence interval (returned automatically with conf.int=TRUE),\nthe exact vs normal approximation choice via exact=TRUE/FALSE, and the note that R drops\nzeros by default (Wilcoxon convention). Shows the manual W+ / W- verification to match\nthe worked example.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create the six-patient paired dataset ── */\ndata work.pain;\n  input patient_id $ baseline week12;\n  diff = baseline - week12;   /* positive = improvement */\n  datalines;\nP1 7 2\nP2 6 3\nP3 4 6\nP4 8 2\nP5 5 4\nP6 3 7\n;\nrun;\n\n/* ── Verify differences ── */\nproc means data=work.pain n mean std min max;\n  var diff;\n  title \"Within-patient pain differences (baseline - week12)\";\nrun;\n\n/* ── Wilcoxon signed-rank test via PROC UNIVARIATE ──\n   UNIVARIATE tests H0: median(diff) = 0 using the Wilcoxon signed-rank S statistic.\n   The S statistic = W+ - n*(n+1)/4 (centered; SAS convention).\n   The p-value for the signed-rank test appears in the \"Tests for Location\" table.\n   CIPCTLDF gives distribution-free CIs (Hodges-Lehmann); CIBASIC gives normal-based CIs. */\nproc univariate data=work.pain cipctldf cibasic;\n  var diff;\n  /* Key output: \"Tests for Location\" table; look for:\n     Signed Rank  S = <value>  Pr >= |S| = <p-value>              */\n  title \"Wilcoxon Signed-Rank Test on Pain Differences\";\nrun;\n\n/* ── Paired t-test for comparison via PROC TTEST ──\n   PROC TTEST with PAIRED statement computes the paired t-test directly from the\n   two raw variables; no need to compute differences manually.                     */\nproc ttest data=work.pain;\n  paired baseline * week12;\n  title \"Paired t-test on Pain Scores (parametric comparison)\";\nrun;\n\n/* ── Larger example using PROC UNIVARIATE with normal approximation ──\n   For n >= 25, the signed-rank statistic is approximately normally distributed.\n   SAS automatically uses the correct approximation.                               */\ndata work.pain_large;\n  call streaminit(42);\n  do i = 1 to 60;\n    patient_id = i;\n    baseline = round(4 + rand('uniform') * 5);\n    week12   = round(2 + rand('uniform') * 6);\n    diff = baseline - week12;\n    output;\n  end;\n  drop i;\nrun;\n\nproc univariate data=work.pain_large cipctldf;\n  var diff;\n  title \"Wilcoxon Signed-Rank Test — Large Sample (n=60)\";\nrun;\n\n/* Note: SAS PROC NPAR1WAY with the WILCOXON option tests TWO INDEPENDENT groups\n   (equivalent to Mann-Whitney U). Do NOT use NPAR1WAY for paired data — use\n   PROC UNIVARIATE on the within-patient differences instead.                     */",
        "description": "Wilcoxon signed-rank test in SAS using PROC UNIVARIATE on the within-patient differences.\nPROC UNIVARIATE reports the signed-rank S statistic and its p-value as a built-in output\nwhen applied to a single column of differences. The S statistic in SAS equals W+ - n*(n+1)/4\n(a centered version of W+); the p-value matches scipy and R. The Hodges-Lehmann estimate\nis available via the CIPCTLDF or CIBASIC options. Also shows PROC TTEST with PAIRED\nstatement as the parametric comparison and PROC MEANS to confirm the differences\ncomputed correctly.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[\"Paired / pre-post<br/>continuous or ordinal outcome?\"] --> CheckN{n pairs<br/>< 30?}\n  CheckN -- \"Yes, small n\" --> CheckNorm{Differences<br/>clearly non-normal<br/>or outliers expected?}\n  CheckN -- \"No, large n\" --> CheckEstimand{Target estimand?}\n  CheckNorm -- \"Yes\" --> WSR[\"Wilcoxon signed-rank test<br/>(exact p-value)<br/>+ Hodges-Lehmann CI\"]\n  CheckNorm -- \"No\" --> PairedT[\"Paired t-test<br/>(mean difference + CI)\"]\n  CheckEstimand -- \"Pseudomedian<br/>(typical patient)\" --> WSRLarge[\"Wilcoxon signed-rank<br/>(normal approx)<br/>+ Hodges-Lehmann CI\"]\n  CheckEstimand -- \"Mean change<br/>(budget / HTA)\" --> Model[\"Paired t-test or<br/>GLM on differences<br/>(gamma / NB for costs)\"]\n  WSR --> Adjust{Covariate<br/>adjustment<br/>needed?}\n  PairedT --> Adjust\n  WSRLarge --> Adjust\n  Model --> Done[\"Report estimate + CI<br/>+ p-value\"]\n  Adjust -- \"No\" --> Done\n  Adjust -- \"Yes\" --> ANCOVA[\"ANCOVA or mixed model<br/>on differences<br/>(rank tests can't adjust)\"]\n  ANCOVA --> Done",
        "caption": "Decision flowchart for paired / pre-post continuous outcomes: choice among Wilcoxon signed-rank, paired t-test, and GLM on differences, with the covariate adjustment branch.",
        "alt_text": "Flowchart for paired outcome analysis: branches on sample size, normality, and target estimand to route to Wilcoxon signed-rank test, paired t-test, or a model-based approach.",
        "source_type": "illustrative",
        "source_citations": [
          "wilcoxon-1945"
        ]
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "parametric-vs-nonparametric-tests",
        "notes": "The parametric vs nonparametric overview establishes the decision framework — when to use ranks rather than raw values — that motivates the Wilcoxon signed-rank test as the nonparametric alternative to the paired t-test."
      },
      {
        "relation_type": "requires",
        "target_slug": "inferential-statistics-foundations",
        "notes": "Understanding null hypothesis testing, p-values, and the concept of a test statistic's sampling distribution is prerequisite to interpreting the Wilcoxon signed-rank test result correctly."
      },
      {
        "relation_type": "see_also",
        "target_slug": "paired-t-test",
        "notes": "The parametric counterpart for paired data. The paired t-test tests the mean of the differences and is more powerful when normality holds; the Wilcoxon signed-rank test is its nonparametric analog, testing the pseudomedian and robust to outliers. Both should be reported in regulatory submissions."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mann-whitney-u-test",
        "notes": "The unpaired sibling: the Mann-Whitney U test (Wilcoxon rank-sum test) applies the same rank-based logic to two independent groups rather than paired observations. The naming overlap — both carry \"Wilcoxon\" in common usage — is a perennial source of confusion; the signed-rank test is for paired data, the rank-sum / Mann-Whitney is for independent groups."
      },
      {
        "relation_type": "see_also",
        "target_slug": "mcnemar-test",
        "notes": "The paired binary analog: McNemar's test is the nonparametric test for paired binary outcomes (e.g., hospital readmission before vs after a policy). The Wilcoxon signed-rank test handles continuous or ordinal paired differences; McNemar handles the binary case."
      }
    ],
    "aliases": [
      "signed-rank test",
      "Wilcoxon matched-pairs test",
      "paired Wilcoxon",
      "Wilcoxon signed-rank",
      "paired signed-rank test",
      "pseudomedian test"
    ],
    "complexity": "foundational",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "win-ratio-generalized-pairwise-comparisons-rwe",
    "name": "Win Ratio and Generalized Pairwise Comparisons",
    "short_definition": "A family of estimators that compares every treated patient against every control patient on a prioritized, hierarchical set of outcomes (e.g., death first, then heart-failure hospitalization, then a continuous measure), labels each pairwise comparison a win, a loss, or a tie by walking down the priority ladder until the pair can be separated, and summarizes the result as a win ratio (wins / losses), win odds (ties split), or net benefit (wins minus losses over all pairs) - letting the most clinically important event dominate the analysis instead of collapsing a composite to its first component.",
    "long_description": "A composite endpoint stacks several outcomes into one. The classic real-world and trial answer is\n**time-to-first-event**: a patient counts the day the *earliest* of death, hospitalization, or any other\nlisted event occurs, and a Cox model is run on that first event. This treats a heart-failure hospitalization\non day 30 as *equivalent* to dying on day 30 - the model cannot tell which happened, because both are simply\n\"the first event.\" That equivalence is clinically false and statistically wasteful: the most important outcome\n(death) is diluted by the most frequent one (hospitalization), and everything after the first event is thrown away.\n\n**The generalized pairwise comparison (GPC) idea.** Instead of reducing each patient to a single first-event\ntime, GPC compares **every treated patient against every control patient as a pair**. For each of the\n(n_treated x n_control) pairs you ask, in **priority order**: on the most important outcome, did the treated\npatient do better (a *win*), worse (a *loss*), or can we not tell (a *tie*)? If it is a tie on the top outcome,\nyou drop to the next outcome in the hierarchy and ask again; you keep descending until the pair is separated or\nyou run out of outcomes (an overall tie). Counting wins, losses, and ties across all pairs gives three related summaries:\n- **Win ratio** = wins / losses (ties excluded). A value above 1 favors treatment.\n- **Win odds** = (wins + 0.5 x ties) / (losses + 0.5 x ties). Ties are split evenly, so every pair contributes.\n- **Net benefit** (the Buyse statistic) = (wins - losses) / total pairs. A difference on the natural -1 to +1 scale.\n\n**The hierarchy is the analysis, and prioritization order is a real choice.** You must pre-specify the ladder:\ndeath usually sits on top, then a serious morbidity event (HF hospitalization), then perhaps a continuous\noutcome (change in a biomarker or a quality-of-life score) for pairs still tied after the events. Reordering the\nladder *changes the estimate*, because a different outcome gets first crack at separating each pair. Putting a\nfrequent, less severe outcome on top makes the win ratio behave almost like time-to-first-event and forfeits the\nwhole point; putting death on top lets mortality dominate, which is usually the intent. The order is a clinical\njudgment that belongs in the protocol, not a knob to tune after seeing results.\n\n**Matched vs unmatched comparisons.** Pocock's original **matched** win ratio pairs each treated patient with a\ncontrol of similar risk (matched on covariates or risk strata) and compares only within matched pairs. The\n**unmatched / all-pairs** version (Buyse's GPC, Finkelstein-Schoenfeld lineage) compares *all* treated-vs-control\npairs and is the more common real-world form because it uses all the data and feeds directly into stratified and\ncovariate-adjusted extensions. Both share the same win/loss/tie engine; they differ only in *which* pairs are formed.\n\n**Censoring is where win ratios get subtle.** With time-to-event outcomes you often cannot order a pair: if a\ntreated patient is censored (lost to follow-up, study ends) at day 200 and the control is still event-free at day\n200, you do not know who would have had the event first - that pair is a **tie on that tier by necessity**, not\nbecause the patients are equivalent. Differential follow-up between arms can therefore manufacture ties and bias\nthe win ratio toward 1, and the estimate depends on the **follow-up time** of the data (unlike a hazard ratio,\nwhich targets an instantaneous rate). Short or unequal follow-up inflates the tie count and can move the win ratio\nin either direction; this is the central caution for win ratios in real-world data, where follow-up is rarely uniform.\n\n**Interpreting the output**\n\nConsider a heart-failure registry study comparing a new device versus standard care\non a two-tier hierarchy (Tier 1: cardiovascular death; Tier 2: HF hospitalization).\nAmong all treated × control pairs, 1,540 pairs were won by the treated patient,\n1,100 were lost, and 2,360 were tied. Win ratio = 1,540 / 1,100 = 1.40 (95% CI\n1.15–1.71).\n\nFormal interpretation: Among all treated–control patient pairs compared on the\nprioritized hierarchy, treated patients won on the more important outcome 1.40 times\nas often as they lost. A win is assigned by walking down the priority ladder: if the\ntreated patient survived longer (Tier 1), that pair is a win regardless of HF events;\nonly pairs tied on survival drop to Tier 2 (HF hospitalization). Ties remain when\nneither patient can be separated on any tier — typically because one or both were\ncensored before the pair could be resolved. The win ratio of 1.40 is specific to this\nfollow-up duration and censoring pattern; a study with shorter or more unequal follow-up\nwould produce more ties and a win ratio closer to 1 for the same true effect.\n\nPractical interpretation: Treated patients won on the priority hierarchy 40% more\noften than they lost. This framing deliberately lets death outweigh hospitalization —\na patient who survives longer wins the pair even if they had more hospitalizations.\nReport the tie fraction (2,360 / (1,540 + 1,100 + 2,360) = 47%) alongside the win\nratio because a high tie rate signals censoring-driven attenuation. The win ratio is\nnot a hazard ratio and is not a per-unit-time rate; do not read 1.40 as \"a 40% lower\nrisk.\"\n\n**Pros, cons, and trade-offs** (specific and comparative).\n- **vs composite-endpoint-construction-rwe (time-to-first-event):** GPC's advantage is **clinical weighting for\n  free** - death is allowed to outrank hospitalization without inventing arbitrary point values, and information\n  after the first event is retained. Its costs are (a) dependence on follow-up time and censoring patterns, (b) a\n  less familiar effect measure than a hazard ratio, and (c) sensitivity to the chosen hierarchy. **Prefer GPC**\n  when the components differ sharply in severity and you want the worst outcome to dominate; **prefer\n  time-to-first-event** when components are clinically comparable, follow-up is uniform, and a hazard ratio is the\n  expected currency for the audience.\n- **vs cox-ph-regression:** A win ratio is **not** a hazard ratio and is **not** assumption-bound to proportional\n  hazards - it can behave sensibly under non-proportional hazards and crossing curves where a single HR is\n  misleading. But it sacrifices the hazard ratio's time-anchored interpretability, its established\n  covariate-adjustment machinery, and its independence from total follow-up length. **Prefer Cox** when a per-unit-time\n  rate is the target and PH roughly holds; **prefer the win ratio** when a prioritized composite and severity ranking matter more than a rate.\n- **vs restricted-mean-survival-time-rmst:** Both dodge the proportional-hazards assumption and both depend on a\n  chosen time horizon, but RMST answers a *single*-endpoint question (mean event-free time up to tau) while GPC\n  answers a *hierarchical multi-endpoint* question. **Prefer RMST** for one time-to-event outcome with a natural\n  horizon; **prefer GPC** when several prioritized outcomes must be combined.\n\n**When to use.** Hierarchical composite endpoints where components differ in severity and you want the most\nimportant to dominate - the canonical home is **cardiology and heart-failure** trials and their real-world\nemulations (cardiovascular death > HF hospitalization > symptom/biomarker change), and increasingly oncology\n(death > progression > response) and any setting with a clear clinical priority order. Use it when\nproportional hazards is doubtful, when a continuous outcome should break ties among event-free patients, or when\nstakeholders explicitly want mortality to outweigh softer endpoints rather than be averaged with them.\n\n**When NOT to use - and when it is actively misleading.**\n- **Severely unequal or short follow-up.** If follow-up differs between arms (common in real-world cohorts with\n  staggered entry and administrative censoring), uncomparable pairs become ties and the win ratio drifts toward 1,\n  masking a true effect; align follow-up, restrict to a common horizon, or use a method robust to dependent\n  censoring before trusting the number. Do not report a win ratio from data with grossly differential censoring without that caution.\n- **No genuine priority order.** If the components are clinically interchangeable (two equally serious events with\n  no agreed ranking), forcing a hierarchy invents an ordering that drives the result; a time-to-first-event or\n  recurrent-event analysis is more honest.\n- **Reading the win ratio as a hazard ratio or a rate.** It is a ratio of wins to losses over a specific\n  follow-up window, not a per-time hazard; quoting \"a 30% lower risk\" from a win ratio of 1.43 is wrong. Report it\n  as what it is, with the follow-up horizon and the tie fraction, because both are part of the estimand.\n- **Confounding is unaddressed by the engine.** All-pairs comparison does not adjust for confounding any more than\n  a crude rate does; in observational data you still need matching, stratification, or weighting (e.g., on a\n  propensity score) layered onto the pair formation, or the win ratio just compares confounded groups efficiently.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "win-ratio",
      "win-odds",
      "net-benefit",
      "generalized-pairwise-comparisons",
      "hierarchical-composite-endpoint",
      "prioritized-outcomes",
      "time-to-first-event",
      "censoring",
      "heart-failure",
      "cardiology",
      "non-proportional-hazards"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "active_comparator_new_user",
      "target_trial_emulation",
      "registry_study",
      "pragmatic_trial",
      "cer_observational"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.1093/eurheartj/ehr352",
        "url": "https://doi.org/10.1093/eurheartj/ehr352",
        "citation_text": "Pocock SJ, Ariti CA, Collier TJ, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European Heart Journal. 2012;33(2):176-182.",
        "year": 2012,
        "authors_short": "Pocock et al.",
        "notes": "The paper that introduced the win ratio - matched and unmatched - as a way to analyze composite endpoints by clinical priority (cardiovascular death ranked above heart-failure hospitalization), defining the wins/losses engine this concept implements."
      },
      {
        "role": "explain",
        "doi": "10.1002/sim.3923",
        "url": "https://doi.org/10.1002/sim.3923",
        "citation_text": "Buyse M. Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Statistics in Medicine. 2010;29(30):3245-3257.",
        "year": 2010,
        "authors_short": "Buyse",
        "notes": "The general statistical framework (net benefit / generalized pairwise comparisons) that underlies the all-pairs win/loss/tie counting, generalizing the win ratio to any prioritized mix of binary, ordinal, continuous, and time-to-event outcomes."
      },
      {
        "role": "demonstrate",
        "doi": "10.1002/pst.1977",
        "url": "https://doi.org/10.1002/pst.1977",
        "citation_text": "Dong G, Huang B, Chang YW, Seifu Y, Song J, Hoaglin DC. The win ratio: impact of censoring and follow-up time and use with nonproportional hazards. Pharmaceutical Statistics. 2020;19(3):168-177.",
        "year": 2020,
        "authors_short": "Dong et al.",
        "notes": "Shows quantitatively how censoring and total follow-up time shift the win ratio (uncomparable pairs become ties) and how it behaves under non-proportional hazards - the central real-world caution for win-ratio estimands."
      }
    ],
    "plain_language_summary": "The win ratio compares a treatment group against a control group by pairing up patients and asking, for each pair, who did better on the most important outcome first - say, who lived longer - and only dropping to the next outcome (like who avoided a heart-failure hospitalization longer) if the first one is a tie. Across every treated-versus-control pair you count wins, losses, and ties, and the win ratio is simply wins divided by losses, so a number above 1 favors the treatment. Its big advantage over the usual \"first event that happens\" approach is that death can outrank a hospitalization instead of counting the same. Its big catch: if patients are followed for different lengths of time, many pairs become unrankable ties, which can pull the win ratio toward 1 and hide a real difference.\n",
    "key_terms": [
      {
        "term": "hierarchical composite endpoint",
        "definition": "A set of outcomes ranked from most to least important (e.g., death first, then hospitalization) so the most serious one decides a comparison whenever possible."
      },
      {
        "term": "pairwise comparison",
        "definition": "Lining up one treated patient against one control patient and deciding which of the two did better."
      },
      {
        "term": "win, loss, tie",
        "definition": "For a pair, the treated patient either did better (win), worse (loss), or could not be told apart from the control (tie)."
      },
      {
        "term": "win ratio",
        "definition": "The number of winning pairs divided by the number of losing pairs; above 1 means the treatment came out ahead."
      },
      {
        "term": "net benefit",
        "definition": "Wins minus losses divided by the total number of pairs, giving a single number on a minus-one to plus-one scale."
      },
      {
        "term": "censoring",
        "definition": "When a patient's follow-up ends before the outcome happens, so you only know they were event-free up to that day, not what happened after."
      },
      {
        "term": "time-to-first-event",
        "definition": "The traditional composite approach that records only the earliest of the listed events and treats them all as equally bad."
      }
    ],
    "worked_example": {
      "scenario": "A heart-failure treatment is compared against control using a two-level priority: first survival (did the patient live longer), then time to the first heart-failure hospitalization for pairs that tie on survival. We have three treated and three control patients, each followed up to 365 days. We compare all 3 x 3 = 9 treated-versus-control pairs, label each a win, loss, or tie for the treatment, and compute the win ratio, win odds, and net benefit by hand.\n",
      "dataset": {
        "caption": "One row per patient. death_day and hf_hosp_day are 'none' if the event was not observed; followup_day is the last day the patient was seen.",
        "columns": [
          "patient_id",
          "arm",
          "death_day",
          "hf_hosp_day",
          "followup_day"
        ],
        "rows": [
          [
            3001,
            "treatment",
            "none",
            "none",
            365
          ],
          [
            3002,
            "treatment",
            "none",
            300,
            365
          ],
          [
            3003,
            "treatment",
            200,
            150,
            200
          ],
          [
            4001,
            "control",
            100,
            60,
            100
          ],
          [
            4002,
            "control",
            "none",
            "none",
            365
          ],
          [
            4003,
            "control",
            250,
            "none",
            250
          ]
        ]
      },
      "steps": [
        "Tier 1 rule (survival), longer life wins. Tier 2 rule (used only if Tier 1 ties), longer time to first HF hospitalization wins. A pair is a tie if neither tier can separate the two patients.",
        "Treated 3001 (alive, no HF) vs each control. vs 4001 (died day 100), vs 4003 (died day 250), it outlives them so 3001 WINS both; vs 4002 (also alive, no HF) nothing separates them so TIE. That is 2 wins and 1 tie.",
        "Treated 3002 (alive, HF day 300) vs controls. vs 4001 and vs 4003 it outlives both (WIN, WIN); vs 4002 both survive so Tier 1 ties, then Tier 2 compares HF - 3002 was hospitalized on day 300 but 4002 stayed event-free past day 300, so 4002 wins and 3002 LOSES. That is 2 wins and 1 loss.",
        "Treated 3003 (died day 200) vs controls. vs 4001 (died day 100) it survived longer so WIN; vs 4002 (alive at 365) and vs 4003 (died day 250) the controls outlive it, so 3003 LOSES both. That is 1 win and 2 losses.",
        "Totals across the 9 pairs - wins = 2 + 2 + 1 = 5, losses = 0 + 1 + 2 = 3, ties = 1 + 0 + 0 = 1.",
        "Win ratio = wins / losses = 5 / 3 = 1.67. Net benefit = (5 - 3) / 9 = 0.22. Win odds (split the tie) = (5 + 0.5) / (3 + 0.5) = 1.57."
      ],
      "result": "Across 9 pairs there are 5 wins, 3 losses, and 1 tie, so the win ratio = 5 / 3 = 1.67 (favoring treatment), the win odds = (5 + 0.5) / (3 + 0.5) = 1.57, and the net benefit = (5 - 3) / 9 = 0.22. Survival decided most pairs; HF hospitalization only broke the single Tier-1 tie between 3002 and 4002.",
      "timeline_spec": {
        "title": "Six patients scored on a death-then-HF-hospitalization hierarchy (9 treated-vs-control pairs)",
        "window": {
          "start": "2021-01-01",
          "end": "2022-01-01",
          "label": "365-day follow-up window for all six patients"
        },
        "events": [
          {
            "label": "T 3001 (alive, no HF)",
            "start": "2021-01-01",
            "length_days": 365,
            "quantity": "censored day 365"
          },
          {
            "label": "T 3002 (alive, HF day 300)",
            "start": "2021-01-01",
            "length_days": 365,
            "quantity": "HF day 300"
          },
          {
            "label": "T 3003 (died day 200)",
            "start": "2021-01-01",
            "length_days": 200,
            "quantity": "death day 200"
          },
          {
            "label": "C 4001 (died day 100)",
            "start": "2021-01-01",
            "length_days": 100,
            "quantity": "death day 100"
          },
          {
            "label": "C 4002 (alive, no HF)",
            "start": "2021-01-01",
            "length_days": 365,
            "quantity": "censored day 365"
          },
          {
            "label": "C 4003 (died day 250)",
            "start": "2021-01-01",
            "length_days": 250,
            "quantity": "death day 250"
          }
        ],
        "spans": [
          {
            "kind": "exposed",
            "start": "2021-01-01",
            "end": "2022-01-01",
            "label": "Treatment arm: 3001, 3002, 3003"
          },
          {
            "kind": "unexposed",
            "start": "2021-01-01",
            "end": "2022-01-01",
            "label": "Control arm: 4001, 4002, 4003"
          }
        ],
        "result": {
          "label": "5 wins, 3 losses, 1 tie -> win ratio 5/3 = 1.67",
          "value": 1.67
        }
      }
    },
    "prerequisites": [
      "composite-endpoint-construction-rwe",
      "cox-ph-regression",
      "estimands-ate-att-intercurrent-events-rwe"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "Matched-pairs win ratio (Pocock original)",
        "description": "Each treated patient is paired with a single control of similar risk (matched on a propensity score or risk strata), and only within-pair comparisons are made down the outcome hierarchy. Wins and losses are counted over the matched pairs; the matching does the confounding control before the win/loss engine runs.",
        "edge_cases": [
          "Unmatched patients are dropped, shrinking the sample and the comparison's power when matching is hard.",
          "Pair quality drives validity - poor risk matching means the win ratio compares non-exchangeable patients."
        ],
        "data_source_notes": "claims/ehr: build the matching covariates (comorbidities, prior utilization) from the lookback window first, then match; the hierarchy outcomes come from the follow-up window."
      },
      {
        "name": "Unmatched all-pairs win ratio / net benefit (Buyse GPC)",
        "description": "Every treated patient is compared with every control patient (n_treated x n_control pairs). Win ratio, win odds, and net benefit are computed from the full grid of wins/losses/ties. This is the common real-world form and the substrate for stratified and covariate-adjusted extensions.",
        "edge_cases": [
          "The pair count grows multiplicatively; very large cohorts need vectorized or stratified computation.",
          "With no priority order specified, the result is undefined - the hierarchy must be pre-registered."
        ],
        "data_source_notes": "registry/linked: cleanest when each subject has a complete, adjudicated outcome hierarchy; claims require constructing each tier (death, then hospitalization) from codes before any pair is scored."
      },
      {
        "name": "Stratified / covariate-adjusted win ratio",
        "description": "Pairs are formed within strata (e.g., risk-score deciles, sites, or matched sets) and the per-stratum win/loss counts are combined, or inverse-probability weights are applied to pairs, to adjust for confounding while keeping the hierarchical win logic. Bridges crude all-pairs counting and a confounding-aware estimate.",
        "edge_cases": [
          "Sparse strata yield unstable per-stratum ratios; pooling rules and weighting must be pre-specified.",
          "Weighting on a mis-specified propensity score reintroduces the confounding the stratification was meant to remove."
        ],
        "data_source_notes": "claims/ehr/linked: strata or weights come from a propensity model built on the baseline window; the tiers (death, HF hospitalization, biomarker) are read from the follow-up window per subject."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "composite-endpoint-construction-rwe",
        "pros_of_this": "Lets the most clinically important component (death) outrank the most frequent one (hospitalization) without arbitrary points, and keeps information after the first event instead of discarding it as time-to-first-event does.",
        "cons_of_this": "The estimate depends on follow-up time and censoring patterns and on the chosen hierarchy, and the win ratio is a less familiar effect measure than the hazard ratio that a time-to-first-event Cox model yields.",
        "when_to_prefer": "When the composite's components differ sharply in severity and you want the worst outcome to dominate; use time-to-first-event when components are clinically comparable and follow-up is uniform."
      },
      {
        "compared_to": "cox-ph-regression",
        "pros_of_this": "Not bound to the proportional-hazards assumption and behaves sensibly under non-proportional or crossing hazards, while naturally combining a prioritized hierarchy of multiple outcomes rather than a single event time.",
        "cons_of_this": "Gives up the hazard ratio's per-unit-time interpretation, its mature covariate-adjustment toolkit, and its independence from total follow-up length; a win ratio cannot be read as a rate.",
        "when_to_prefer": "When a prioritized composite and severity ranking matter more than an instantaneous rate, or when PH is clearly violated; prefer Cox when a time-anchored hazard ratio is the target and PH roughly holds."
      },
      {
        "compared_to": "restricted-mean-survival-time-rmst",
        "pros_of_this": "Combines several prioritized outcomes into one comparison, whereas RMST answers a single-endpoint question; both avoid the proportional-hazards assumption.",
        "cons_of_this": "Less directly interpretable than RMST's \"mean event-free months up to tau,\" and like RMST it still depends on the chosen time horizon and on censoring.",
        "when_to_prefer": "When several prioritized outcomes must be merged; use RMST for one time-to-event outcome with a natural horizon."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "No outcome is pre-built - each tier of the hierarchy must be constructed from codes first. Death comes from a mortality source (enrollment end + death indicator, linked NDI, or inpatient discharge disposition); HF hospitalization from an inpatient claim with a qualifying diagnosis; any continuous tier (e.g., a utilization count) aggregated over the window. Follow-up ends at disenrollment, so differential disenrollment between arms creates uncomparable pairs (ties) - align observable time and report the tie fraction.",
      "ehr": "Death and hospitalization timing are often available with finer granularity, and a continuous tie-breaker tier (lab value, ejection fraction, PRO score) can be pulled directly - EHR is the natural substrate for a multi-tier hierarchy. Watch for informative loss to follow-up (patients leaving the system look censored), which inflates ties.",
      "registry": "Disease and product registries frequently adjudicate the exact event hierarchy GPC needs (cardiovascular death, HF hospitalization, NYHA class), making them the cleanest source for a win ratio; confirm that follow-up is complete and comparable across arms before trusting the tie count.",
      "linked": "Claims-EHR-registry linkage gives the strongest substrate - claims for hospitalization completeness, a linked mortality file for death timing, EHR/registry for the continuous tie-breaker tier. Reconcile event dates across sources before scoring pairs, since the hierarchy is only as good as the timing of its top tier."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "from math import inf\n\ndef _surv(p):\n    # death time if it happened, else +inf; censoring (last observed) time\n    return (p[\"death_day\"] if p[\"death_day\"] is not None else inf,\n            p[\"followup_day\"])\n\ndef _hf(p):\n    return (p[\"hf_day\"] if p[\"hf_day\"] is not None else inf,\n            p[\"followup_day\"])\n\ndef compare_pair(t, c):\n    # returns +1 = treated wins, -1 = treated loses, 0 = tie (descend / overall tie)\n    # Tier 1: longer survival wins; indeterminate under censoring -> descend\n    td, tcens = _surv(t); cd, ccens = _surv(c)\n    if td != inf and cd != inf:\n        if td > cd: return 1\n        if td < cd: return -1\n    elif td != inf and cd == inf:          # treated died, control censored\n        if ccens >= td: return -1           # control known to outlive treated's death\n    elif cd != inf and td == inf:          # control died, treated censored\n        if tcens >= cd: return 1\n    # Tier 2: longer time to first HF hospitalization wins\n    th, thc = _hf(t); ch, chc = _hf(c)\n    if th != inf and ch != inf:\n        if th > ch: return 1\n        if th < ch: return -1\n    elif th != inf and ch == inf:          # treated had HF, control event-free\n        if chc >= th: return -1\n    elif ch != inf and th == inf:          # control had HF, treated event-free\n        if thc >= ch: return 1\n    return 0\n\ndef win_ratio(subjects):\n    tx = [s for s in subjects if s[\"arm\"] == \"treatment\"]\n    ct = [s for s in subjects if s[\"arm\"] == \"control\"]\n    wins = losses = ties = 0\n    for t in tx:\n        for c in ct:\n            r = compare_pair(t, c)\n            if r == 1: wins += 1\n            elif r == -1: losses += 1\n            else: ties += 1\n    total = wins + losses + ties\n    return {\n        \"wins\": wins, \"losses\": losses, \"ties\": ties, \"pairs\": total,\n        \"win_ratio\": wins / losses if losses else inf,\n        \"win_odds\": (wins + 0.5 * ties) / (losses + 0.5 * ties) if (losses + 0.5 * ties) else inf,\n        \"net_benefit\": (wins - losses) / total if total else 0.0,\n    }",
        "description": "Manual all-pairs (unmatched) generalized pairwise comparison on a two-tier hierarchy: Tier 1 = survival\n(death_day with right-censoring at followup_day), Tier 2 = time to first heart-failure hospitalization\n(hf_day, event-free until followup_day). For every treated x control pair it walks the hierarchy until the\npair separates, returning wins, losses, ties and the win ratio, win odds, and net benefit. Input (one row per\nsubject): subject_id, arm ('treatment' | 'control'), death_day (float or None), hf_day (float or None),\nfollowup_day (int). None = event not observed (censored at followup_day).",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "compare_pair <- function(t, c) {\n  td <- if (is.na(t$death_day)) Inf else t$death_day; tcens <- t$followup_day\n  cd <- if (is.na(c$death_day)) Inf else c$death_day; ccens <- c$followup_day\n  # Tier 1: survival\n  if (is.finite(td) && is.finite(cd)) {\n    if (td > cd) return(1L); if (td < cd) return(-1L)\n  } else if (is.finite(td) && is.infinite(cd)) {\n    if (ccens >= td) return(-1L)\n  } else if (is.finite(cd) && is.infinite(td)) {\n    if (tcens >= cd) return(1L)\n  }\n  # Tier 2: time to first HF hospitalization\n  th <- if (is.na(t$hf_day)) Inf else t$hf_day\n  ch <- if (is.na(c$hf_day)) Inf else c$hf_day\n  if (is.finite(th) && is.finite(ch)) {\n    if (th > ch) return(1L); if (th < ch) return(-1L)\n  } else if (is.finite(th) && is.infinite(ch)) {\n    if (c$followup_day >= th) return(-1L)\n  } else if (is.finite(ch) && is.infinite(th)) {\n    if (t$followup_day >= ch) return(1L)\n  }\n  0L\n}\n\nwin_ratio <- function(df) {\n  tx <- df[df$arm == \"treatment\", ]; ct <- df[df$arm == \"control\", ]\n  wins <- 0L; losses <- 0L; ties <- 0L\n  for (i in seq_len(nrow(tx))) for (j in seq_len(nrow(ct))) {\n    r <- compare_pair(as.list(tx[i, ]), as.list(ct[j, ]))\n    if (r == 1L) wins <- wins + 1L\n    else if (r == -1L) losses <- losses + 1L\n    else ties <- ties + 1L\n  }\n  total <- wins + losses + ties\n  list(wins = wins, losses = losses, ties = ties, pairs = total,\n       win_ratio = if (losses > 0) wins / losses else Inf,\n       win_odds  = (wins + 0.5 * ties) / (losses + 0.5 * ties),\n       net_benefit = (wins - losses) / total)\n}",
        "description": "Same unmatched two-tier generalized pairwise comparison in base R (manual all-pairs loops, no package needed).\nTier 1 is survival with right-censoring; Tier 2 is time to first HF hospitalization. Input is a data.frame with\ncolumns subject_id, arm ('treatment'/'control'), death_day, hf_day (NA = not observed), followup_day. Returns the\nwin/loss/tie counts plus win ratio, win odds, and net benefit.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "proc iml;\n  use work.subjects;\n    read all var {death_day hf_day followup_day} into X;\n    read all var {arm} into ARM;\n  close work.subjects;\n\n  tx = loc(ARM = \"treatment\");\n  ct = loc(ARM = \"control\");\n  wins = 0; losses = 0; ties = 0;\n\n  do a = 1 to ncol(tx);\n    i = tx[a];\n    do b = 1 to ncol(ct);\n      j = ct[b];\n      td = X[i,1]; if td = . then td = .I;   tcens = X[i,3];\n      cd = X[j,1]; if cd = . then cd = .I;   ccens = X[j,3];\n      r = 0;\n      /* Tier 1: survival */\n      if td ^= .I & cd ^= .I then do;\n        if td > cd then r = 1; else if td < cd then r = -1;\n      end;\n      else if td ^= .I & cd = .I then do; if ccens >= td then r = -1; end;\n      else if cd ^= .I & td = .I then do; if tcens >= cd then r = 1; end;\n      /* Tier 2: time to first HF hospitalization (only if still tied) */\n      if r = 0 then do;\n        th = X[i,2]; if th = . then th = .I;\n        ch = X[j,2]; if ch = . then ch = .I;\n        if th ^= .I & ch ^= .I then do;\n          if th > ch then r = 1; else if th < ch then r = -1;\n        end;\n        else if th ^= .I & ch = .I then do; if ccens >= th then r = -1; end;\n        else if ch ^= .I & th = .I then do; if tcens >= ch then r = 1; end;\n      end;\n      if r = 1 then wins = wins + 1;\n      else if r = -1 then losses = losses + 1;\n      else ties = ties + 1;\n    end;\n  end;\n\n  total = wins + losses + ties;\n  win_ratio  = wins / losses;\n  win_odds   = (wins + 0.5*ties) / (losses + 0.5*ties);\n  net_benefit = (wins - losses) / total;\n  print wins losses ties total win_ratio win_odds net_benefit;\nquit;",
        "description": "Unmatched two-tier generalized pairwise comparison in SAS via PROC IML. Each subject carries survival\n(death_day, right-censored at followup_day) and HF-hospitalization (hf_day) information; a nested loop scores\nevery treated x control pair down the hierarchy and accumulates wins, losses, and ties, then prints the win\nratio, win odds, and net benefit. Input: work.subjects with subject_id, arm ('treatment'/'control'), death_day,\nhf_day (. = not observed), followup_day.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Pair[Form a pair:<br/>one treated x one control] --> T1{Tier 1 - survival:<br/>can we order who<br/>lived longer?}\n  T1 -- Treated lived longer --> W[Win for treatment]\n  T1 -- Control lived longer --> L[Loss for treatment]\n  T1 -- Censored / cannot order --> T2{Tier 2 - time to first<br/>HF hospitalization:<br/>can we order it?}\n  T2 -- Treated event-free longer --> W\n  T2 -- Control event-free longer --> L\n  T2 -- Cannot order --> Tie[Overall tie]\n  W --> Agg[Sum wins, losses, ties over all pairs]\n  L --> Agg\n  Tie --> Agg\n  Agg --> Out[win ratio = wins / losses<br/>win odds = wins+0.5 ties / losses+0.5 ties<br/>net benefit = wins - losses / pairs]",
        "caption": "The hierarchical win/loss/tie engine for one treated-vs-control pair - descend the priority ladder (death first, then HF hospitalization) until the pair separates, then aggregate across all pairs into the win ratio, win odds, and net benefit.",
        "alt_text": "Decision flowchart showing a treated-control pair compared on survival first, then heart-failure hospitalization, classified as win, loss, or tie, then summed across all pairs into win ratio, win odds, and net benefit.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": null,
        "mermaid": "gantt\n  title Six patients - survival (Tier 1) and HF hospitalization (Tier 2) over a 365-day window\n  dateFormat YYYY-MM-DD\n  axisFormat %d %b\n  section Treatment\n  T1 alive, no HF :active, a1, 2021-01-01, 365d\n  T2 alive, HF day 300 :active, a2, 2021-01-01, 365d\n  T3 died day 200 :crit, a3, 2021-01-01, 200d\n  section Control\n  C1 died day 100 :crit, b1, 2021-01-01, 100d\n  C2 alive, no HF :active, b2, 2021-01-01, 365d\n  C3 died day 250 :crit, b3, 2021-01-01, 250d",
        "caption": "The six study patients whose 9 treated-vs-control pairs yield 5 wins, 3 losses, and 1 tie - win ratio 5/3 = 1.67. Bars ending before day 365 mark deaths (Tier 1); HF timing breaks Tier-1 ties.",
        "alt_text": "Gantt timeline of three treatment and three control patients showing death days and follow-up lengths used to score survival and heart-failure comparisons.",
        "source_type": "illustrative",
        "source_citations": []
      },
      {
        "asset_path": "win-ratio-generalized-pairwise-comparisons-rwe-timeline.svg",
        "mermaid": null,
        "caption": "Three treated and three control patients compared as 9 pairs on a death-then-HF-hospitalization hierarchy, giving 5 wins, 3 losses, 1 tie - win ratio 1.67, win odds 1.57, net benefit 0.22.",
        "alt_text": "Timeline of six patients with death and hospitalization timing, annotated with the resulting wins, losses, ties, and the win ratio.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "alternative_to",
        "target_slug": "composite-endpoint-construction-rwe",
        "notes": "Both analyze composite endpoints, but the win ratio ranks components by clinical priority (death over hospitalization) and keeps post-first-event information, whereas time-to-first-event composite construction treats all components as equal at the first event."
      },
      {
        "relation_type": "tradeoff_with",
        "target_slug": "cox-ph-regression",
        "notes": "A win ratio is not a hazard ratio and is not bound to proportional hazards; choose Cox for a time-anchored rate when PH holds, the win ratio for a prioritized multi-outcome comparison when PH is doubtful."
      },
      {
        "relation_type": "alternative_to",
        "target_slug": "restricted-mean-survival-time-rmst",
        "notes": "Both avoid the proportional-hazards assumption and depend on a time horizon, but RMST summarizes one time-to-event endpoint while the win ratio merges a prioritized hierarchy of several outcomes."
      },
      {
        "relation_type": "used_with",
        "target_slug": "recurrent-events-analysis-rwe",
        "notes": "Generalized pairwise comparisons can incorporate event counts/recurrences as a tier, and recurrent-event methods are the alternative when repeated hospitalizations - not just the first - should count."
      },
      {
        "relation_type": "used_with",
        "target_slug": "propensity-score-methods-psm-iptw",
        "notes": "The all-pairs engine does not control confounding; matched or stratified/IP-weighted win ratios layer a propensity score onto pair formation so observational comparisons are not merely efficient comparisons of confounded groups."
      },
      {
        "relation_type": "requires",
        "target_slug": "estimands-ate-att-intercurrent-events-rwe",
        "notes": "The hierarchy, follow-up horizon, and tie handling are part of the win-ratio estimand; intercurrent events (death as a terminal event ahead of hospitalization) are addressed by the priority ordering itself."
      },
      {
        "relation_type": "see_also",
        "target_slug": "inverse-probability-of-censoring-weighting-rwe",
        "notes": "Because uncomparable (censored) pairs become ties and bias the win ratio toward 1, censoring-aware adjustments matter when follow-up differs between arms."
      },
      {
        "relation_type": "see_also",
        "target_slug": "real-world-progression-rwpfs-rwe",
        "notes": "In oncology the win-ratio hierarchy (death over progression over response) builds directly on real-world progression endpoints, extending the cardiology mortality-over-hospitalization pattern."
      }
    ],
    "aliases": [
      "win ratio",
      "win odds",
      "net benefit",
      "generalized pairwise comparisons",
      "GPC",
      "prioritized outcomes analysis",
      "hierarchical composite endpoint analysis",
      "Finkelstein-Schoenfeld"
    ],
    "complexity": "intermediate",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  },
  {
    "slug": "zero-inflated-count-models",
    "name": "Zero-Inflated and Hurdle Count Models",
    "short_definition": "Statistical count models — zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), and hurdle models — that handle a zero frequency far exceeding what a plain negative binomial predicts by positing a second, separate zero-generating process: either a latent subgroup that structurally cannot produce the event (ZIP/ZINB mixture) or a logistic gate determining whether the patient engages at all (hurdle), with a zero-truncated count distribution for those who cross it; the correct choice between these and a plain negative binomial depends on subject-matter knowledge about structural versus sampling zeros, not on automated tests, because interpretation of two-component models is substantially more complex than a single-coefficient incidence rate ratio.",
    "long_description": "**Excess zeros versus overdispersion: not the same problem**\n\nTwo problems can create elevated zero counts in healthcare data, and they require\nfundamentally different solutions. Overdispersion — the condition where Var(Y) > E(Y) —\narises when patients have heterogeneous underlying event rates. The negative binomial\ndistribution handles overdispersion correctly: patients with very low rates will produce\nzero-count observations frequently by chance, and the NB dispersion parameter absorbs that\nextra mass at zero through its variance function. This entry does not re-teach NB regression;\nsee negative-binomial-distribution and poisson-negative-binomial-count-models for that\nfoundation.\n\nZero-inflation addresses a categorically different situation: a subgroup of patients that\nstructurally cannot generate the event of interest, regardless of follow-up length or model\nspecification. These patients' zeros are not probabilistic outcomes of a count process — they\nare certainties from a separate data-generating mechanism entirely. A plain NB model treats\nall zeros as draws from a single continuous count distribution. When a structural-zero\nsubgroup genuinely exists, the NB consistently underpredicts the observed zero frequency even\nafter alpha is estimated, and its coefficients average over two scientifically distinct groups\nin a way that obscures both.\n\nThe key diagnostic: after fitting an NB model, compare its predicted P(Y=0) to the observed\nzero proportion. If the NB predicts 50% zeros but you observe 70%, and there is a credible\nsubject-matter reason to believe a true never-user subgroup exists, zero-inflated or hurdle\nmodels deserve consideration. If the NB tracks the zero rate well, the additional complexity\nis unjustified. Most of the time, plain NB suffices — this is the most important sentence in\nthis entry.\n\n**Structural zeros versus sampling zeros**\n\nA structural zero is a zero that cannot be anything else. A patient who has undergone a\nhysterectomy contributes a structural zero to any study of uterine procedures. A patient who\nis physiologically incapable of addiction, or who is enrolled in a claims database but has\nnever sought care for the relevant condition, may be a structural zero for drug fills or\nspecialist visits. The defining feature: no amount of time in the at-risk window can change\nthe count from zero, because the generating mechanism is absent.\n\nA sampling zero is probabilistic. A patient who fills opioid prescriptions occasionally might\nhave zero fills in any given year simply because no refill fell within the observation window.\nA low-frequency hospitalizer will frequently have zero hospitalizations in a one-year window.\nThese zeros are fully explained by the count distribution at low event rates; the negative\nbinomial handles them without special treatment.\n\nThe clinical distinction determines the entire modeling strategy. In administrative claims,\nstructural zeros are rarely directly observable: a claims record for a patient with zero fills\nlooks identical whether the patient is a structural never-user or a sampling zero from a\nsporadic user. Subject-matter knowledge must therefore drive the decision. If every patient in\nyour cohort definition can plausibly generate the event — for example, elderly patients who\ncan be hospitalized, diabetic patients who can fill glucose-lowering drugs — the structural-zero\nhypothesis has no support and zero-inflation adds a scientifically meaningless latent class.\n\n**ZIP and ZINB: mixture model mechanics and two coefficient sets**\n\nA zero-inflated model is a finite mixture of two components. Let pi (π) denote the probability\nthat a given patient belongs to the structural-zero class. With probability π the patient\ncontributes a guaranteed zero. With probability (1 − π) the patient is in the active-count\nclass and their count follows a Poisson distribution (for ZIP) or a negative binomial\ndistribution (for ZINB).\n\nThe observed probability of a zero count is therefore:\n\n    P(Y = 0) = π + (1 − π) × P(Y_count = 0)\n\nand the probability of a positive count k > 0 is:\n\n    P(Y = k) = (1 − π) × P(Y_count = k)\n\nThe inflation term π is itself modeled as a function of patient covariates via a logistic\nregression: log(π/(1−π)) = γ₀ + γ₁Z₁ + ... where Z variables may differ from the count-model\npredictors. This produces two distinct coefficient sets that must be interpreted separately:\n\nFirst, the *inflation logistic coefficients* (γ): these predict belonging to the\nstructural-zero class. A positive γ coefficient means higher covariate values increase\nthe probability of being a structural never-user. Exponentiated, they are odds ratios for\nthe latent structural-zero state.\n\nSecond, the *count-model coefficients* (β): these predict the expected count for the\nactive-count subgroup only. Exponentiated, they are incidence rate ratios (IRRs) conditional\non the patient not being a structural zero. A treatment-arm IRR from the count component does\nnot apply to all patients in the study — it applies only to the estimated (1 − π) fraction who\nare active-count patients.\n\nThe marginal expected count for any patient, combining both components, is:\n\n    E(Y) = (1 − π) × mu_count\n\nwhere mu_count is the count-component mean. This marginal quantity — not the count-model IRR\nalone — is the right input to a budget-impact analysis, because it averages over the\nprobability of being in each class.\n\n**Hurdle count models: a distinct data-generating story**\n\nA hurdle model is architecturally similar to a zero-inflated model — two components, two\ncoefficient sets — but embodies a different causal story. In a hurdle model, there is no\nlatent structural-zero class. All zeros have a single origin: the patient did not cross the\nutilization hurdle. A logistic model determines whether the count is zero (no event) or\npositive (any event). If positive, the count follows a *zero-truncated* count distribution,\nmeaning the Poisson or NB probability mass function is renormalized to exclude zero.\n\nUnder this story:\n\n    P(Y = 0) = 1 − p_hurdle\n    P(Y = k | k > 0) = p_hurdle × P(Y_truncated = k)\n\nThe hurdle is more natural for many utilization questions. Consider a study of specialty\nvisits: some patients never seek the specialist at all (did not cross the hurdle), and among\nthose who do, the number of visits follows a right-skewed count distribution. Every zero is\nexplained by the logistic component alone; there is no latent-class ambiguity. The logistic\nhurdle coefficient is directly interpretable as a predictor of any engagement with care,\nwhich is itself a policy-relevant outcome separable from the volume question.\n\nThe practical difference between ZIP/ZINB and hurdle: for a given observed dataset, both\nmodels may produce similar predicted distributions and pass goodness-of-fit checks similarly.\nThe difference is in interpretation and in the scientific commitment being made. Under ZIP,\nsome patients are asserted to be structural zeros who cannot respond to treatment through the\ncount channel; under hurdle, those same patients simply did not engage, and an intervention\ncould move them across the hurdle. The choice should be grounded in the scientific question,\nnot in model-selection statistics.\n\n**Model choice: theory first, Vuong test controversies, rootogram checks**\n\nThe Vuong test has been widely applied to compare ZIP and plain Poisson (or ZINB and plain\nNB) in practice, but it is a poor gating criterion. The standard Vuong statistic does not\nfollow the assumed standard-normal null distribution for non-nested model comparison in\nfinite samples; simulation evidence shows that it generates inflated false-positive rates,\nrecommending zero-inflation even when none exists. A correction term (Vuong-Clarke) restores\nproper calibration but is rarely implemented in standard software outputs. Because of these\nproperties, automated Vuong-based model selection produces unreliable conclusions and\nencourages treating a scientific modeling decision as a mechanical test.\n\nThe recommended workflow:\n\n1. Before looking at the data, ask whether structural zeros are scientifically plausible in\n   this cohort. If the cohort definition requires the patient to be capable of the event, there\n   are no structural zeros by design and the ZIP story is inappropriate.\n\n2. Fit a plain negative binomial model and compute the predicted zero probability from the\n   fitted model: P_NB(Y=0) = (1/(1+alpha*mu))^(1/alpha). Compare to the observed zero\n   fraction. A rootogram (hanging rootogram from the countreg R package) displays this\n   comparison visually for all count values simultaneously, making underfitting at zero obvious.\n\n3. Only if a meaningful gap remains and subject matter supports a two-process story, fit ZINB\n   or hurdle. Examine whether the inflation (or hurdle) coefficient estimates are interpretable\n   and stable. A very large inflation probability with huge standard errors suggests the model\n   cannot identify the latent class reliably.\n\n4. Report the marginal expected count from the chosen model alongside the component-specific\n   coefficients to give decision-makers the population-average quantity they actually need.\n\n**When plain negative binomial suffices — and that is usually the right answer**\n\nAnalysts routinely reach for zero-inflated models upon seeing a high zero rate, but the NB\ndistribution already produces substantial zero probability when mean rates are low and\ndispersion is high. The NB mass at zero is P_NB(Y=0) = (1/(1+alpha*mu))^(1/alpha), which\ncan be 0.6 or higher at low mu with moderate alpha. Before adding model complexity, confirm\nquantitatively that the NB underpredicts the zero fraction by a practically meaningful margin\n— not just a margin that is statistically detectable at large sample sizes. In pharmacoepidemiology\nstudies with 50,000-patient claims cohorts, even a 2-percentage-point underprediction of zeros\nwill produce a \"significant\" Vuong test result without any scientific justification for\nstructural zeros. The default should remain plain NB unless there is a specific, articulated\nscientific reason to believe in a separate zero-generating process.\n\n**Pros, cons, and trade-offs**\n\n*Zero-inflated models (ZIP/ZINB)*:\n- Pros: correctly models a structural-zero subgroup when one genuinely exists; separates the\n  probability of being a never-utilizer from the count rate among active users; can substantially\n  improve predicted zero fit when the latent-class story is scientifically supported.\n- Cons: two coefficient sets required for every inference question; structural-zero class is\n  latent and unverifiable from data alone; Vuong test for selection is unreliable; marginal\n  expected counts require combining both components; overfitting is a real risk when the zero\n  excess is ordinary overdispersion that NB would absorb; interpretation for payer and HTA\n  audiences is substantially more complex.\n- When to prefer: population-based cohorts where genuine never-users are included by design;\n  when the NB systematically underpredicts the zero frequency by more than ten percentage points\n  and subject matter supports a structural-zero story.\n\n*Hurdle models*:\n- Pros: cleaner data-generating story with all zeros from one source; logistic hurdle\n  coefficient directly interpretable as a predictor of any utilization; avoids the\n  unverifiable latent-class interpretation of ZIP/ZINB; separable scientific questions\n  (any engagement vs how much engagement) map naturally to the two components.\n- Cons: the two-part structure still requires reporting and interpreting two models; forces\n  all zeros into the logistic part even if some are sampling zeros from a low-rate count\n  process; zero-truncated count distribution is less familiar and less commonly implemented\n  in standard software.\n- When to prefer: the scientific question naturally divides into any-event versus how-many-given-\n  event; specialty utilization, medication initiation, or referral studies where \"crossing the\n  hurdle\" of first engagement is itself a meaningful outcome.\n\n*Plain negative binomial (the usual right answer)*:\n- Pros: single interpretable coefficient set; marginal IRR applicable to the full patient\n  population; handles overdispersion and much of the zero excess without any latent-class\n  assumption; computationally simpler and results are immediately communicable to non-technical\n  stakeholders; the correct model in the vast majority of HCRU count analyses.\n- Cons: will underfit genuine structural zero processes; predicted zero probability from the\n  fitted model should always be compared to the observed rate as a diagnostic.\n- When to prefer: almost always — specifically when the NB-predicted zero probability matches\n  the observed zero fraction and when no subject-matter case for structural zeros can be made.\n\n**When to use**\n\nUse zero-inflated or hurdle count models when all of the following hold:\n\n- The cohort definition plausibly includes patients who structurally cannot generate the count\n  event — for example, a population-based sample that includes non-users of the therapeutic\n  class being studied, or a cohort enrolled without requiring prior experience with the\n  treatment or condition.\n- The fitted NB model's predicted zero probability materially underpredicts the observed zero\n  fraction (check via rootogram or direct comparison of P_NB(Y=0) to observed proportion).\n- A subject-matter story for the two-process model can be articulated and is scientifically\n  credible, not simply inferred from the statistical test.\n- The study audience can interpret and use two-component results, or marginal expected counts\n  will be reported alongside the component-specific coefficients.\n\n**When NOT to use**\n\n- *When the NB fits the zero distribution well*: if the predicted NB zero probability closely\n  matches the observed zero fraction, adding a zero-inflated or hurdle layer introduces model\n  complexity without scientific benefit, and the Vuong test alone does not justify the\n  additional latent-class assumption.\n- *When the cohort is restricted to event-experiencing patients*: inclusion criteria requiring\n  at least one event (e.g., patients with at least one hospitalization in the lookback) remove\n  structural zeros by design. A plain NB or zero-truncated model is correct; zero-inflation\n  would estimate a structural-zero probability near zero while degrading estimation precision.\n- *When every patient can plausibly have the event*: for endpoints that any enrolled patient\n  can experience (hospitalizations among the elderly, drug fills among an established user\n  cohort), structural zeros are scientifically implausible and the inflation logit will fit a\n  latent class with no real-world meaning.\n- *As a mechanical response to a high zero rate*: high zero rates are common in claims data\n  and usually reflect low event rates among some patients, not structural impossibility. Fit NB\n  first; resort to ZI models only after confirming material underprediction and providing a\n  scientific rationale.\n- *When a single marginal rate ratio is required for communication or downstream modeling*:\n  the count-model IRR from a ZINB is conditional on the active-count subgroup and cannot be\n  applied to the full population directly. If the audience needs one interpretable IRR, the\n  plain NB provides it; if ZI is used, the marginal E(Y) = (1-π) × mu_count must be computed\n  and reported for both arms to give the population-level contrast.\n\n**Interpreting the output**\n\nUsing the worked example: a ZIP model for opioid prescription fills in 100 chronic pain\npatients produces two sets of estimated parameters. The inflation logistic component estimates\nπ = 0.40 (40% are structural never-users). The count-component mean is μ_count with the\nPoisson zero probability for the active subgroup at 0.50.\n\n*(1) Formal interpretation.* The ZIP model estimates two simultaneous processes. The logistic\ninflation component estimates that 40 out of 100 patients (π = 0.40) belong to a latent\nstructural-zero class — patients who would produce zero opioid fills under any length of\nfollow-up. These patients are not draws from the Poisson count distribution; their zero\ncontributions are certain rather than probabilistic. Among the remaining 60% of patients (the\nactive-count class, probability 1 − π = 0.60), the Poisson count component governs the fill\ndistribution. The count-model IRR for a treatment indicator applies only within this active\nsubgroup, not to all 100 patients. The marginal predicted zero rate for the full population is\nπ + (1 − π) × P_count(Y=0) = 0.4 + 0.6 × 0.5 = 0.4 + 0.3 = 0.7, matching the observed\n70/100 = 0.7. Any treatment-effect comparison that uses only the count-model IRR implicitly\nconditions on the patient not being a structural zero — an assumption that must be stated\nexplicitly and is typically not what a budget-impact model requires.\n\n*(2) Practical interpretation.* For a payer or clinical decision-maker: the model estimates\nthat roughly 4 in 10 patients in this chronic pain population do not fill opioid prescriptions\nat all — a structural behavioral or clinical state the count model alone cannot explain. Among\nthe 6 in 10 who do fill at least occasionally, the count model describes how often they fill\nand how a treatment or policy affects that rate. For a budget-impact calculation, the\npopulation-average expected fills per patient is E(Y) = (1 − 0.40) × μ_count, not μ_count\nalone — using μ_count directly would overestimate the cost burden for the full population by\nnearly 67%. This is the interpretive price of the richer model: two sets of coefficients,\na marginal calculation to reach a population-level quantity, and a latent-class story that\nmust be defended to a non-technical audience. When that price is not worth paying because the\nplain NB fits well, do not pay it.",
    "primary_category": "Inferential_Statistics",
    "tags": [
      "statistics",
      "count-data",
      "zero-inflation",
      "hurdle-model",
      "excess-zeros",
      "structural-zeros",
      "sampling-zeros",
      "ZIP",
      "ZINB",
      "mixture-model",
      "pscl",
      "overdispersion",
      "claims",
      "hcru"
    ],
    "applies_to_study_types": [
      "cohort_retrospective",
      "cohort_prospective",
      "claims_analysis",
      "ehr_study",
      "registry_study",
      "descriptive_analysis"
    ],
    "data_sources": [
      "claims",
      "ehr",
      "registry",
      "linked"
    ],
    "citations": [
      {
        "role": "introduce",
        "doi": "10.2307/1269547",
        "url": "https://doi.org/10.2307/1269547",
        "citation_text": "Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34(1):1-14.",
        "year": 1992,
        "authors_short": "Lambert",
        "notes": "The foundational paper introducing the ZIP model as a two-component mixture of a point mass at zero and a Poisson distribution, deriving the EM algorithm for maximum likelihood estimation, and demonstrating the structural-zero concept with an industrial defect application directly translatable to healthcare utilization settings."
      },
      {
        "role": "explain",
        "doi": "10.1080/10543400600719384",
        "url": "https://doi.org/10.1080/10543400600719384",
        "citation_text": "Rose CE, Martin SW, Wannemuehler KA, Plikaytis BD. On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. Journal of Biopharmaceutical Statistics. 2006;16(4):463-481.",
        "year": 2006,
        "authors_short": "Rose et al.",
        "notes": "Applied comparison of ZIP, ZINB, hurdle-Poisson, and hurdle-NB models in a vaccine adverse event surveillance context, directly analogous to pharmacovigilance and claims- based adverse event counting. Demonstrates model diagnostics (predicted vs observed count distributions) and contrasts the structural-zero (ZI) and non-engagement (hurdle) interpretive frameworks in a regulatory setting."
      },
      {
        "role": "demonstrate",
        "doi": "10.18637/jss.v027.i08",
        "url": "https://doi.org/10.18637/jss.v027.i08",
        "citation_text": "Zeileis A, Kleiber C, Jackman S. Regression models for count data in R. Journal of Statistical Software. 2008;27(8):1-25.",
        "year": 2008,
        "authors_short": "Zeileis et al.",
        "notes": "Comprehensive implementation reference for Poisson, NB, ZIP, ZINB, and hurdle models in R using the pscl and MASS packages. Includes the Vuong test, likelihood-ratio tests, and rootogram diagnostics. Essential for understanding how zeroinfl() and hurdle() differ in their data-generating assumptions and output interpretation."
      },
      {
        "role": "use",
        "doi": "10.1146/annurev.publhealth.20.1.125",
        "url": "https://doi.org/10.1146/annurev.publhealth.20.1.125",
        "citation_text": "Diehr P, Yanez D, Ash A, Hornbrook M, Lin DY. Methods for analyzing health care utilization and costs. Annual Review of Public Health. 1999;20:125-144.",
        "year": 1999,
        "authors_short": "Diehr et al.",
        "notes": "Canonical HEOR review that situates zero-inflated and two-part count models within the broader landscape of utilization and cost analysis, noting when the zero-excess story is plausible for administrative claims populations versus when plain NB or gamma GLMs are more appropriate."
      }
    ],
    "plain_language_summary": "Zero-inflated and hurdle count models are used when a dataset has far more zero-event patients than a standard negative binomial model can explain — for example, when 70% of patients had zero hospitalizations but the model predicts only 50%. These models handle zeros by positing two separate groups: patients who structurally cannot have the event (like a non-smoker in a smoking-cessation study) and patients who could but did not during the study window. The price of this richer model is two sets of coefficients that must be interpreted separately and combined to get a population-average estimate, which is why a simpler negative binomial model is usually the right starting point and should only be replaced when there is a genuine subject-matter reason to believe in a separate zero-generating process.",
    "key_terms": [
      {
        "term": "structural zero",
        "definition": "A patient observation that is certainly zero because the patient cannot possibly have the event — for example, a never-user of a drug class or a patient whose anatomy makes the procedure impossible — as opposed to a patient who could have the event but happened not to."
      },
      {
        "term": "sampling zero",
        "definition": "A zero-count observation from a patient who could have the event but did not in the observation window purely by chance; these zeros are handled correctly by the negative binomial distribution and do not require a zero-inflated model."
      },
      {
        "term": "inflation probability (pi)",
        "definition": "In a zero-inflated model, the estimated probability that any given patient belongs to the structural-zero class (can never have the event); the complement (1 minus pi) is the probability of being in the active-count group."
      },
      {
        "term": "hurdle model",
        "definition": "A two-part count model where a logistic regression determines whether a patient has any events at all (crosses the hurdle), and a zero-truncated count distribution describes the number of events among those who do; unlike zero-inflated models, all zeros come from the logistic part."
      },
      {
        "term": "rootogram",
        "definition": "A diagnostic plot that compares the observed count frequency distribution to model-predicted frequencies for each count value (0, 1, 2, ...); a hanging rootogram from the R countreg package is the standard visual check for whether the NB underpredicts the zero bar."
      },
      {
        "term": "marginal expected count",
        "definition": "The population-average expected event count from a zero-inflated model, computed as (1 minus pi) times the count-component mean; this is the correct quantity for budget-impact analysis, not the count-component mean alone."
      }
    ],
    "worked_example": {
      "scenario": "A pharmacy analyst studies opioid prescription fills among 100 patients diagnosed with chronic pain over one year. Seventy patients had zero fills. After fitting a negative binomial model, the predicted proportion of zero-fill patients is approximately 50 percent — substantially below the observed 70 percent. With subject-matter justification (some patients decline opioids entirely on behavioral or cultural grounds), the analyst fits a zero-inflated Poisson model. The inflation logistic component estimates that 40 percent of patients are structural never-users (π = 0.40). The Poisson count component for the active-user group yields a zero probability of 0.50 at the estimated mean. The analyst verifies that these ZIP parameters reproduce the observed zero rate.",
      "dataset": {
        "caption": "Observed count distribution of opioid fills in one year for 100 chronic pain patients. Seventy patients had zero fills; the remaining 30 had one or more fills, creating a zero proportion of 0.70 that a plain NB model cannot match.",
        "columns": [
          "count_value",
          "observed_patients"
        ],
        "rows": [
          [
            0,
            70
          ],
          [
            1,
            12
          ],
          [
            2,
            8
          ],
          [
            3,
            5
          ],
          [
            4,
            3
          ],
          [
            5,
            2
          ]
        ]
      },
      "steps": [
        "Observed zero count = 70 patients out of 100 total. Observed zero rate: 70/100 = 0.7.",
        "A fitted negative binomial model predicts approximately 50 patients with zero fills (observed rate 0.70 vs NB-predicted rate approximately 0.50) — a gap suggesting a zero-generating process beyond ordinary overdispersion.",
        "Fit a ZIP model. The logistic inflation component estimates pi = 0.40: 40 percent of patients are classified as structural never-users who contribute a guaranteed zero.",
        "The Poisson count component for the remaining active-user group (probability 0.60) estimates a mean fill rate such that the count-part zero probability is 0.50 for an active-user with average characteristics.",
        "Compute the ZIP-predicted proportion of zeros. Structural-zero share contributes 0.40. Active-user sampling-zero share: 0.6*0.5 = 0.3. Total ZIP-predicted zero rate: 0.4 + 0.3 = 0.7.",
        "The ZIP prediction matches the observed rate: 0.4 + 0.3 = 0.7, which equals observed 70/100 = 0.7. The model successfully reproduces the excess zero rate by attributing 40 of the 70 zero-fill patients to the structural-zero class and 30 to sampling zeros from the active-user Poisson distribution."
      ],
      "result": "Observed zero rate: 70/100 = 0.7. ZIP-predicted zero rate: 0.4 + 0.3 = 0.7, matching observed. Of the 70 zero-fill patients, approximately 40 are attributed to the structural never-user class (pi = 0.40) and 30 to sampling zeros from the active-user group (0.6*0.5 = 0.3). The marginal expected fill count for the full population is (1 - 0.40) times the active-user mean — not the active-user mean alone — because 40 percent of patients are expected to contribute zero fills regardless of follow-up."
    },
    "prerequisites": [
      "negative-binomial-distribution",
      "poisson-negative-binomial-count-models",
      "binomial-distribution-logit-link"
    ],
    "index_definitions": [],
    "checklist_items": [],
    "operational_variants": [
      {
        "name": "ZIP (Zero-Inflated Poisson)",
        "description": "The original Lambert (1992) formulation. The count component is Poisson; the zero-inflation component is a logistic regression on possibly different predictors. Appropriate only when the count part is equidispersed after conditioning on structural zeros — an unusual situation for HCRU data, which is almost always overdispersed even within the active-user class. Use mainly as a diagnostic stepping stone; if the active-user count distribution shows overdispersion, escalate to ZINB.",
        "edge_cases": [
          "When both the ZIP and plain NB fit the zero frequency similarly (rootogram), prefer NB for its single-coefficient interpretability.",
          "Very small active-user counts in small samples may prevent stable estimation of the inflation probability; report the standard error of pi to assess identifiability."
        ],
        "data_source_notes": "Claims: rarely the final model for HCRU counts because hospitalization, ED visit, and fill counts are overdispersed within any subgroup. Use as a diagnostic to establish whether zero-inflation is necessary before fitting ZINB."
      },
      {
        "name": "ZINB (Zero-Inflated Negative Binomial)",
        "description": "The practical default when zero-inflation is judged necessary. The count component is negative binomial, handling overdispersion in the active-user group while the logistic inflation component handles structural zeros. Reports three elements: the inflation logit coefficients (odds of being a structural zero), the count-component log-rate coefficients (IRRs conditional on active-user status), and the count-component dispersion parameter alpha. The marginal expected count combining both components must be computed and reported separately as E(Y) = (1 − π) × mu_count.",
        "edge_cases": [
          "Convergence can fail when the inflation probability is near 0 or 1; check iteration history and consider a simpler mean structure for the inflation logit if problems arise.",
          "In very large claims samples (> 100,000 patients), any zero-inflation gap will be statistically detectable even if practically trivial; apply clinical judgment about the minimum meaningful gap before reporting ZINB results."
        ],
        "data_source_notes": "Claims: the appropriate model when structural never-users (e.g., patients with no history of a drug class) are included in a population-based cohort and the NB underpredicts zeros by more than 10 percentage points. Always report the rootogram for the NB alongside the ZINB to justify the model choice."
      },
      {
        "name": "Hurdle Negative Binomial",
        "description": "A logistic regression determines whether any event occurs (probability p_hurdle), and a zero-truncated negative binomial distribution describes the count among patients who cross the hurdle. All zeros are attributed to the logistic component; there is no latent structural-zero class. Implemented in R via pscl::hurdle(dist=\"negbin\") and in SAS via a two-step approach (PROC LOGISTIC for the hurdle, PROC COUNTREG DIST=TRUNCNEGBIN for the count part). The hurdle-NB is conceptually cleaner than ZINB for utilization studies where engagement vs non-engagement is the primary scientific divide.",
        "edge_cases": [
          "When the proportion of zeros is very high (> 80%), the zero-truncated count distribution is estimated from a small fraction of patients; verify that the count-part estimation is stable by inspecting the effective sample size for the positive counts.",
          "The zero-truncated NB is slightly less tractable numerically than the full NB; ensure the software is fitting truncated (not zero-inflated) counts in the count component."
        ],
        "data_source_notes": "Claims: well suited for specialty visit counts, biologic drug fill counts, or post-procedure procedure counts where the decision to engage with care at all is distinct from the frequency of ongoing care."
      }
    ],
    "tradeoffs": [
      {
        "compared_to": "poisson-negative-binomial-count-models",
        "pros_of_this": "Correctly models datasets where NB underpredicts zeros due to a genuine structural-zero or non-engagement subgroup; separates the \"who ever uses\" question from the \"how much among users\" question; can substantially improve model fit and avoid biased marginal effect estimates when structural zeros exist.",
        "cons_of_this": "Produces two coefficient sets that are harder to communicate; requires the marginal expected count to be computed separately; vulnerable to the Vuong test's inflated false- positive rate if used for automated model selection; interpretation breaks down if the structural-zero assumption is incorrect.",
        "when_to_prefer": "Use plain NB first; escalate to ZI/hurdle only when (a) the NB rootogram shows material underprediction of zeros and (b) a scientific case for a separate zero-generating process can be articulated for the study population."
      },
      {
        "compared_to": "two-part-models-semicontinuous-costs",
        "pros_of_this": "Count outcomes have a natural integer structure that ZI/hurdle count models handle exactly; the count-part component produces an interpretable IRR on the same event-count scale as the plain NB; the inflation probability has a direct clinical interpretation as the proportion of never-utilizers.",
        "cons_of_this": "Two-part models for semicontinuous costs are the appropriate approach when the outcome is continuous (dollars), not discrete counts; the structural-zero logic is the same but the cost-part model uses a gamma or log-normal GLM rather than a truncated count distribution. Cost and utilization analyses should both be performed in comprehensive HEOR evaluations.",
        "when_to_prefer": "Use ZI/hurdle count models for discrete utilization volume (number of visits, fills, admissions); use two-part cost models for the dollar amounts. The two-part cost framework is the direct continuous-outcome analogue of the hurdle count model."
      }
    ],
    "implementation_notes_by_data_source": {
      "claims": "Population-based commercial or Medicare FFS cohorts that include all enrollees (not just users of the drug class) are the primary setting where structural zeros are plausible. Always fit NB first and inspect the rootogram before moving to ZINB or hurdle. Exclude Medicare Advantage-only enrollment months (incomplete inpatient capture inflates apparent zero rates artifactually). When fitting ZINB, use different inflation predictors than count predictors if theory suggests different drivers of never-use versus use intensity; for example, socioeconomic variables may predict never-use but not fill frequency among users.",
      "ehr": "Visit-driven capture creates informative zeros: patients who never seek care have zero records not because of structural impossibility but because they are disengaged or unobserved. Distinguish between patients with zero records due to absence from the EHR (unobserved) and those who attended but had zero events of interest. Link to claims for complete event capture before treating EHR zeros as structural.",
      "registry": "Disease registries typically capture patients who have already engaged with specialized care, reducing the structural-zero problem. Zeros in registry data usually represent sampling zeros (low event rates) rather than structural never-users. Plain NB is almost always appropriate for registry-based count outcomes.",
      "linked": "Linked claims-EHR-vital records cohorts can identify structural never-users via EHR-derived clinical flags (e.g., documented medication refusal, documented contraindication to the drug class). When such flags exist, they can serve as informative predictors of the inflation logistic component, grounding the structural-zero interpretation in observed clinical data rather than purely latent-class inference."
    },
    "implementations": [
      {
        "lang": "python",
        "code": "import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom scipy import stats\n\nrng = np.random.default_rng(42)\nn = 200\n\n# ── Synthetic dataset: 200 patients, structural-zero subgroup (~40%) ──\n# Structural zeros: 80 patients who never fill opioids\n# Active users: 120 patients; fill count ~ NegBin(mu=2.5, alpha=0.8)\nstructural_zero = rng.binomial(1, 0.4, n).astype(bool)\nmu_active = 2.5\nalpha_nb   = 0.8\np_nb = 1.0 / (1.0 + alpha_nb * mu_active)         # NB prob param for scipy\nr_nb = 1.0 / alpha_nb\n\nactive_counts = rng.negative_binomial(r_nb, p_nb, n)\ncounts = np.where(structural_zero, 0, active_counts).astype(int)\n\nage = rng.normal(55, 12, n)\ndf = pd.DataFrame({\n    \"count\":           counts,\n    \"age\":             age,\n    \"pain_severity\":   rng.uniform(0, 10, n).round(1),\n    \"person_years\":    np.ones(n),\n})\ndf[\"log_pt\"] = np.log(df[\"person_years\"])\n\nobs_zero_rate = (df[\"count\"] == 0).mean()\nprint(f\"Observed zero rate: {obs_zero_rate:.3f}\")\n\n# ── Step 1: Fit plain NB and compare predicted vs observed zeros ──\nnb_mle = sm.NegativeBinomial.from_formula(\n    \"count ~ age + pain_severity\", data=df, offset=df[\"log_pt\"]\n).fit(disp=0)\nalpha_est = float(nb_mle.params[\"alpha\"])\nmu_hat    = nb_mle.predict()\n\n# P_NB(Y=0) = (1/(1 + alpha*mu))^(1/alpha) for each patient\np_zero_nb = (1.0 / (1.0 + alpha_est * mu_hat)) ** (1.0 / alpha_est)\nnb_pred_zero_rate = p_zero_nb.mean()\nprint(f\"NB predicted zero rate: {nb_pred_zero_rate:.3f}\")\nprint(f\"Gap (observed - NB): {obs_zero_rate - nb_pred_zero_rate:.3f}\")\nprint(\"NB underpredicts zeros by more than 10 pp -> investigate zero-inflation\")\n\n# ── Step 2: Fit ZIP (as a diagnostic, count component is Poisson) ──\nzip_res = sm.ZeroInflatedPoisson.from_formula(\n    \"count ~ age + pain_severity\",      # count-model predictors\n    exog_infl=df[[\"age\"]].assign(const=1.0),   # inflation predictors\n    data=df,\n    offset=df[\"log_pt\"]\n).fit(method=\"bfgs\", maxiter=400, disp=False)\nprint(\"\\nZIP model summary (count component):\")\nprint(zip_res.summary2().tables[1].round(3))\n\n# ── Step 3: Fit ZINB (NB count component + inflation logit) ──\nzinb_res = sm.ZeroInflatedNegativeBinomialP.from_formula(\n    \"count ~ age + pain_severity\",\n    exog_infl=df[[\"age\"]].assign(const=1.0),\n    data=df,\n    offset=df[\"log_pt\"]\n).fit(method=\"bfgs\", maxiter=400, disp=False)\n\n# ── Step 4: Extract the structural-zero probability (pi) ──\n# Inflation logit -> sigmoid -> pi for each patient\npi_hat = zinb_res.predict(which=\"prob-zero\")  # or which=\"mean-main\"\npi_mean = float(pi_hat.mean()) if hasattr(pi_hat, \"mean\") else float(pi_hat)\nprint(f\"\\nEstimated mean structural-zero probability (pi): {pi_mean:.3f}\")\n\n# ── Step 5: Compute marginal expected count E(Y) = (1 - pi) * mu_count ──\nmu_count_hat = zinb_res.predict(which=\"mean-main\")\nmarginal_mean = (1.0 - pi_mean) * float(mu_count_hat.mean())\nraw_count_mean = float(mu_count_hat.mean())\nprint(f\"Active-user mean (count component): {raw_count_mean:.3f}\")\nprint(f\"Marginal expected count E(Y):        {marginal_mean:.3f}\")\nprint(f\"Using count-component mean for full population would overstate by \"\n      f\"{100 * (raw_count_mean / marginal_mean - 1):.0f}%\")\n\n# ── Step 6: Verify ZIP arithmetic from worked example ──\npi_ex     = 0.4       # structural-zero share\np_count_0 = 0.5       # count-part zero probability for active group\nzip_pred_zero = pi_ex + (1 - pi_ex) * p_count_0\nobs_ex        = 70 / 100\nprint(f\"\\nWorked-example verification:\")\nprint(f\"ZIP predicted zeros: {pi_ex} + {1-pi_ex}*{p_count_0} = {zip_pred_zero:.1f}\")\nprint(f\"Observed:            {obs_ex:.1f}\")\nprint(f\"Match: {abs(zip_pred_zero - obs_ex) < 1e-9}\")",
        "description": "Zero-inflated Poisson and zero-inflated negative binomial models using statsmodels\nZeroInflatedPoisson and ZeroInflatedNegativeBinomialP. Demonstrates the NB-versus-ZIP\ndiagnostic workflow (predicted vs observed zero comparison), followed by ZINB fitting with\nseparate inflation predictors and extraction of the marginal expected count. Uses a synthetic\ndataset of 200 chronic pain patients with opioid fill counts that includes a structural\nnever-user subgroup. No external dependencies beyond numpy, pandas, and statsmodels.",
        "dependencies": [
          "numpy",
          "pandas",
          "statsmodels",
          "scipy"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "r",
        "code": "library(MASS)\nlibrary(pscl)\n# install.packages(\"countreg\", repos=\"http://R-Forge.R-project.org\")\n# library(countreg)   # for rootogram(); install from R-Forge if needed\n\nset.seed(42)\nn <- 200\n\n# ── Synthetic dataset: structural-zero subgroup (~40% never fill) ──\nstructural_zero <- rbinom(n, 1, 0.4) == 1\nactive_counts   <- rnbinom(n, mu = 2.5, size = 1 / 0.8)  # size = 1/alpha\ncounts          <- ifelse(structural_zero, 0L, active_counts)\n\ndf <- data.frame(\n  count        = counts,\n  age          = round(rnorm(n, 55, 12), 1),\n  pain_severity= round(runif(n, 0, 10), 1),\n  person_years = 1.0\n)\n\nobs_zero_rate <- mean(df$count == 0)\ncat(sprintf(\"Observed zero rate: %.3f\\n\", obs_zero_rate))\n\n# ── Step 1: Fit plain NB; check predicted vs observed zeros ──\nnb_fit <- glm.nb(count ~ age + pain_severity + offset(log(person_years)), data = df)\nmu_hat <- predict(nb_fit, type = \"response\")\ntheta  <- nb_fit$theta            # MASS parameterization: size = theta = 1/alpha\n# P_NB(Y=0) = (theta / (theta + mu))^theta for each patient\np_zero_nb <- (theta / (theta + mu_hat)) ^ theta\nnb_pred_zero_rate <- mean(p_zero_nb)\ncat(sprintf(\"NB predicted zero rate: %.3f\\n\", nb_pred_zero_rate))\ncat(sprintf(\"Gap: %.3f\\n\", obs_zero_rate - nb_pred_zero_rate))\n\n# Rootogram for visual diagnostic (requires countreg):\n# rootogram(nb_fit, style = \"hanging\", main = \"NB fit: hanging rootogram\")\n\n# ── Step 2: Fit ZINB with the same predictors for count and inflation parts ──\n# Formula syntax: count-model predictors | inflation-model predictors\nzinb_fit <- zeroinfl(\n  count ~ age + pain_severity + offset(log(person_years)) | age,\n  data = df,\n  dist = \"negbin\"\n)\ncat(\"\\nZINB coefficient summary:\\n\")\nprint(summary(zinb_fit)$coefficients)\n\n# ── Step 3: Extract structural-zero probability pi per patient ──\npi_hat <- predict(zinb_fit, type = \"zero\")   # P(structural zero) per patient\ncat(sprintf(\"\\nMean estimated pi (structural-zero probability): %.3f\\n\", mean(pi_hat)))\n\n# ── Step 4: Marginal expected count E(Y) = (1 - pi) * mu_count ──\nmu_count_hat <- predict(zinb_fit, type = \"count\")   # conditional on active-user class\nmarginal_mean <- mean((1 - pi_hat) * mu_count_hat)\nraw_count_mean <- mean(mu_count_hat)\ncat(sprintf(\"Active-user mean (count component):  %.3f\\n\", raw_count_mean))\ncat(sprintf(\"Marginal expected count E(Y):         %.3f\\n\", marginal_mean))\ncat(sprintf(\"Overstatement if count mean used for full population: %.0f%%\\n\",\n            100 * (raw_count_mean / marginal_mean - 1)))\n\n# ── Step 5: Hurdle NB — alternative when all zeros are non-engagers ──\nhurdle_fit <- hurdle(\n  count ~ age + pain_severity + offset(log(person_years)) | age,\n  data = df,\n  dist = \"negbin\"\n)\ncat(\"\\nHurdle NB (logistic hurdle + zero-truncated NB count):\\n\")\ncat(\"Hurdle (logistic) coefficients:\\n\")\nprint(coef(hurdle_fit, model = \"zero\"))\ncat(\"Count (zero-truncated NB) coefficients:\\n\")\nprint(coef(hurdle_fit, model = \"count\"))\n\n# ── Step 6: Vuong test — informational only; do NOT use as sole decision criterion ──\n# vuong(nb_fit, zinb_fit)   # from pscl; note the inflated false-positive rate issue\ncat(\"\\nNOTE: Vuong test is shown here for illustration only. Its null distribution\\n\")\ncat(\"has inflated false-positive rates for non-nested ZI vs plain-NB comparison.\\n\")\ncat(\"Use the rootogram and subject-matter knowledge as primary decision criteria.\\n\")\n\n# ── Step 7: Verify worked-example arithmetic ──\npi_ex     <- 0.4\np_count_0 <- 0.5\nzip_pred_zero <- pi_ex + (1 - pi_ex) * p_count_0\ncat(sprintf(\"\\nWorked-example: ZIP predicted zeros = %.1f + %.1f*%.1f = %.1f\\n\",\n            pi_ex, 1 - pi_ex, p_count_0, zip_pred_zero))\ncat(sprintf(\"Observed:                                           = %.1f\\n\", 70/100))",
        "description": "Zero-inflated and hurdle count models using pscl::zeroinfl and pscl::hurdle. Demonstrates\nthe NB-first diagnostic workflow with a rootogram from the countreg package, ZINB fitting\nwith separate inflation predictors, extraction of the pi estimate, marginal expected count\ncomputation, and hurdle-NB as the non-latent-class alternative. Uses the same synthetic\n200-patient opioid fill dataset as the Python implementation. Also shows the Vuong test\nand why it should not be the sole decision criterion.",
        "dependencies": [
          "MASS",
          "pscl",
          "countreg"
        ],
        "source_citations": [],
        "notes": ""
      },
      {
        "lang": "sas",
        "code": "/* ── Create synthetic dataset (in practice, this is your analytic cohort) ── */\ndata work.opioid_fills;\n  call streaminit(42);\n  n = 200;\n  do patient_id = 1 to n;\n    structural_zero = (rand(\"Uniform\") < 0.4);\n    if structural_zero then count = 0;\n    else do;\n      /* Approximate NegBin with mu=2.5, alpha=0.8 using gamma-Poisson mixture */\n      lambda = rand(\"Gamma\", 1.25, 2.0);   /* shape=1/alpha, scale=mu*alpha */\n      count  = rand(\"Poisson\", lambda);\n    end;\n    age           = round(rand(\"Normal\", 55, 12), 0.1);\n    pain_severity = round(rand(\"Uniform\", 0, 10), 0.1);\n    person_years  = 1.0;\n    log_pt        = log(person_years);\n    output;\n  end;\n  drop n;\nrun;\n\n/* ── Step 1: Baseline NB — check predicted zeros vs observed ── */\nproc genmod data=work.opioid_fills;\n  model count = age pain_severity / dist=negbin link=log offset=log_pt;\n  output out=work.nb_preds p=mu_hat;\nrun;\n/* After PROC GENMOD, compute P_NB(Y=0) = (theta/(theta+mu))^theta manually.\n   PROC GENMOD prints the dispersion parameter; alpha = dispersion from output.\n   In most HCRU cohorts, the NB predicted zero rate materially underpredicts\n   the observed zero fraction when a structural-zero subgroup exists. */\n\n/* ── Step 2: ZIP — Zero-Inflated Poisson via PROC GENMOD DIST=ZIP ── */\nproc genmod data=work.opioid_fills;\n  model count = age pain_severity / dist=zip link=log offset=log_pt;\n  zeromodel age / link=logit;\n  /* ZEROMODEL specifies predictors for the logistic inflation component.\n     Printed as \"Zero-Inflation Model\" coefficients in the output.\n     Exponentiate for odds ratios of belonging to the structural-zero class. */\nrun;\n\n/* ── Step 3: ZINB — Zero-Inflated NB via PROC GENMOD DIST=ZINB ── */\nproc genmod data=work.opioid_fills;\n  model count = age pain_severity / dist=zinb link=log offset=log_pt;\n  zeromodel age / link=logit;\n  /* DIST=ZINB adds the NB dispersion parameter alpha to the count component.\n     Use ZINB as the practical default when ZI is needed for HCRU counts.\n     Marginal expected count = (1-pi)*mu_count must be computed post-estimation:\n       create predicted mu_count from the count component;\n       apply pi from the inflation component;\n       compute (1-pi)*mu_count per patient and average across the cohort. */\n  output out=work.zinb_preds p=mu_count_hat;\nrun;\n\n/* ── Step 4: Hurdle model (two-step: logistic hurdle + truncated NB count) ── */\n/* Step 4a: Logistic regression for P(any fills > 0) */\ndata work.opioid_fills;\n  set work.opioid_fills;\n  any_fill = (count > 0);   /* binary hurdle indicator */\nrun;\nproc logistic data=work.opioid_fills;\n  model any_fill(event='1') = age pain_severity / link=logit;\n  /* Coefficients give odds of crossing the engagement hurdle (having any fills) */\nrun;\n\n/* Step 4b: Zero-truncated NB for positive counts only */\ndata work.fills_positive;\n  set work.opioid_fills;\n  where count > 0;   /* restrict to patients who crossed the hurdle */\nrun;\nproc genmod data=work.fills_positive;\n  model count = age pain_severity / dist=negbin link=log offset=log_pt;\n  /* Fitting NB to positive counts only approximates the zero-truncated NB.\n     For exact truncated NB, use PROC COUNTREG DIST=TRUNCNEGBIN (SAS/ETS). */\nrun;\n\n/* ── Step 5: Worked-example verification ──\n   pi = 0.4; count-part zero probability = 0.5\n   ZIP predicted zero rate = 0.4 + (1-0.4)*0.5 = 0.4 + 0.6*0.5 = 0.4 + 0.3 = 0.7\n   Observed: 70/100 = 0.7  */\ndata work.verify;\n  pi_structural  = 0.4;\n  p_count_zero   = 0.5;\n  zip_pred_zeros = pi_structural + (1 - pi_structural) * p_count_zero;\n  observed       = 70 / 100;\n  match          = abs(zip_pred_zeros - observed) < 1e-9;\n  put \"ZIP predicted zeros: \" zip_pred_zeros= \" Observed: \" observed= \" Match: \" match=;\nrun;",
        "description": "Zero-inflated Poisson and zero-inflated negative binomial models in SAS using PROC GENMOD\nwith the ZEROMODEL statement. Demonstrates the NB-first diagnostic workflow (PROC GENMOD\nDIST=NEGBIN as baseline), followed by ZIP with DIST=ZIP and ZINB with DIST=NEGBIN plus\nZEROMODEL, predicted zero probability extraction, and a hurdle model via a two-step approach\n(PROC LOGISTIC for the hurdle, PROC GENMOD with DIST=TRUNCNEGBIN for the count part).\nAll examples use the 200-patient opioid fill dataset from the Python and R implementations.",
        "dependencies": [],
        "source_citations": [],
        "notes": ""
      }
    ],
    "diagrams": [
      {
        "asset_path": null,
        "mermaid": "flowchart TD\n  Start[Count outcome with high zero frequency] --> NB[Fit plain negative binomial]\n  NB --> Root{Compare NB-predicted vs<br/>observed zero proportion<br/>rootogram or direct check}\n  Root -->|NB matches zeros well| UseNB[Use plain NB: single IRR<br/>interpretable for full population]\n  Root -->|NB materially underpredicts zeros| Theory{Scientific case for<br/>structural-zero subgroup?}\n  Theory -->|No credible story| UseNB\n  Theory -->|Cohort includes true never-users| Story{Story: latent structural zeros<br/>vs non-engagement?}\n  Story -->|Latent never-user class| ZINB[Zero-Inflated NB<br/>pscl::zeroinfl dist=negbin<br/>PROC GENMOD DIST=ZINB]\n  Story -->|All zeros = non-engagement| Hurdle[Hurdle NB<br/>pscl::hurdle dist=negbin<br/>Logistic + truncated NB]\n  ZINB --> Interp[Report: inflation logit OR<br/>count-model IRR AND<br/>marginal E-Y = 1-pi times mu]\n  Hurdle --> Interp\n  UseNB --> Single[Report: single adjusted IRR<br/>with 95% CI for full population]",
        "caption": "Model-selection decision tree for excess-zero count outcomes. Plain NB is the default; ZI and hurdle models require both a statistical gap (rootogram) and a scientific rationale. Automated Vuong tests should not be the sole criterion for escalating to ZI/hurdle models.",
        "alt_text": "Flowchart beginning at a high-zero count outcome, fitting NB and checking the rootogram, then branching on scientific plausibility of structural zeros to zero-inflated NB or hurdle NB, with the marginal expected count required for population-level inference from both multi-component models.",
        "source_type": "illustrative",
        "source_citations": []
      }
    ],
    "relations": [
      {
        "relation_type": "requires",
        "target_slug": "negative-binomial-distribution",
        "notes": "The NB distribution is the baseline model against which zero-inflation is diagnosed: the NB-predicted P(Y=0) must be compared to the observed zero rate before escalating to ZINB or hurdle. The ZINB count component is itself an NB distribution, so NB distributional mechanics are prerequisite to understanding the full model."
      },
      {
        "relation_type": "requires",
        "target_slug": "binomial-distribution-logit-link",
        "notes": "The inflation component of every ZIP/ZINB model and the hurdle component of every hurdle model is a logistic regression (binomial family with logit link). Understanding logistic regression coefficients — log-odds, odds ratios, the logit link function — is necessary to interpret the inflation or hurdle probability and its predictors."
      },
      {
        "relation_type": "see_also",
        "target_slug": "poisson-negative-binomial-count-models",
        "notes": "The prerequisite count-regression entry covering Poisson vs NB model choice, the offset, and the IRR interpretation. Zero-inflated and hurdle models are extensions of the NB framework described there; the NB-first diagnostic workflow described in this entry begins where that entry ends."
      },
      {
        "relation_type": "see_also",
        "target_slug": "two-part-models-semicontinuous-costs",
        "notes": "The continuous-outcome cousin: two-part cost models apply the same any-event vs how-much logic to semicontinuous dollar outcomes (logistic part for any cost, gamma GLM for positive costs). The hurdle count model and the two-part cost model share the same philosophical framework; analysts studying both utilization and costs should apply each to the appropriate outcome scale."
      },
      {
        "relation_type": "used_with",
        "target_slug": "hcru-healthcare-resource-utilization",
        "notes": "Zero-inflated and hurdle models are applied to HCRU count endpoints (hospitalizations, ED visits, fills, infusions) when population-based cohorts include structural never-users or when specialty utilization studies treat engagement vs non-engagement as a primary outcome. The HCRU entry defines the endpoints; this entry provides the advanced count model when plain NB is insufficient."
      }
    ],
    "aliases": [
      "ZIP model",
      "ZINB model",
      "zero-inflated Poisson",
      "zero-inflated negative binomial",
      "hurdle model",
      "two-component count model"
    ],
    "complexity": "advanced",
    "regulatory_relevance": [
      "fda",
      "ema",
      "hta"
    ],
    "notes": "",
    "provenance": ""
  }
]